Consider a MIPS-like instruction set. All instructions are of the form:
Computers spend most of their time in loops, so multiple loop iterations are great places to speculatively find more work to keep CPU resources busy. Nothing is ever easy, though; the compiler emitted only one copy of that loop’s code, so even though multiple iterations are handling distinct data, they will appear to use the same registers. To keep register usages multiple iterations from colliding, we rename their registers. The following code segment shows an example that we would like our hardware to rename.
Figure 2-1. Rename table and on-the-fly register substitution logic for superscalar machines. (Note: “src” is source, “dst” is destination.)
Problem 3: Coherence [30 points]
Part A: Single processor coherence [5 points]
A processor such as the PowerPC G3, widely deployed in Apple Macintosh systems, is primarily intended for use in uniprocessor systems, and hence has a very simple MEI cache coherence protocol. MEI is the same as MESI, except the Shared (S) state is omitted. Identify and discuss one reason why even a uniprocessor design should support cache coherence. Is the MEI protocol of the G3 adequate for this purpose? Why or why not? (Hint: think about Direct Memory Access (DMA))
Part B: MOESIF cache coherence protocol [10 points]
Many modern systems use cache-to-cache transfer as a way to avoid penalties of going off-chip for a memory access. MOESIF cache coherency protocol extends from the MESI protocol, where the semantics of the additional states are as follows: O state indicates that the line is shared-dirty: i.e., multiple copies may exist, but the other copies are in the S state, and the cache that has the line in the O state is responsible for writing the line back if it is evicted. F state indicates that the line is shared-clean but multiple copies may exist in the S state and this cache is responsible for a transfer on fill request. Fill in the table below for actions on every event trigger. If nothing needs to be done, write in “Do nothing.” If an event is invalid for a given state, write in “Error.”
Current State s
|
Event and Local Coherence Controller Responses and Actions (s’ refers to next state)
|
Local Read
|
Local Write
|
Local Eviction
|
Bus Read
|
Bus Write
|
Bus Upgrade
|
Invalid (I)
|
|
|
|
|
|
|
Shared (S)
|
|
|
|
|
|
|
Forwarding(F)
|
|
|
|
|
|
|
Exclusive (E)
|
|
|
|
|
|
|
Owned (O)
|
|
|
|
|
|
|
Modified (M)
|
|
|
|
|
|
|
Part C: Snoopy Coherence [5 points]
Assuming a processor frequency of 1 GHz, a target CPI of 2, a per-instruction level-2 cache miss rate of 1% per instruction, a snoop-based cache coherent system with 32 processors, and 8-byte address messages (including command and snoop addresses), compute the inbound and outbound snoop bandwidth required at each processor node.
Part D: Memory Consistency [10 points]
Consider a simple multicore processor using a snoopy MSI cache coherence protocol. Each processor has a single, private cache that is direct-mapped with four blocks each holding two words. The initial cache state of the system is shown in the figure below. To simplify the illustration, the cache-address tag contains the full address.
P0
|
Coherency
State
|
Address
tag
|
B0
|
I
|
100
|
B1
|
S
|
108
|
B2
|
M
|
110
|
B3
|
I
|
118
|
|
P1
|
Coherency
State
|
Address
tag
|
B0
|
I
|
100
|
B1
|
M
|
128
|
B2
|
I
|
110
|
B3
|
S
|
118
|
|
P2
|
Coherency
State
|
Address
tag
|
B0
|
S
|
120
|
B1
|
S
|
108
|
B2
|
I
|
110
|
B3
|
I
|
118
|
|
Reads and writes will experience stall cycles depending on the state of the cache line:
-
CPU read and write hits generate no stall cycles
-
CPU read and write misses generate Nmemory and Ncache stall cycles if satisfied by memory and cache, respectively
-
CPU write hits that generate an invalidate incur Ninvalidate stall cycles
-
A write-back of a block, due to either a conflict or another processor’s request to an exclusive block, incurs an additional Nwriteback stall cycles
The exact cycle count for each event is given in the table below:
Parameter
|
Cycles
|
Nmemory
|
100
|
Ncache
|
40
|
Ninvalidate
|
15
|
Nwriteback
|
10
|
Sequential consistency (SC) requires that all reads and writes appear to have executed in some total order. This may require the processor to stall in certain cases before committing a read or write instruction. Consider the following code sequence:
write A
read B
where the write A results in a cache miss and the read B results in a cache hit. Under SC, the processor must stall read B until after it can order (and thus perform) write A. Simple implementations of SC will stall the processor until the cache receives the data and can perform the write. Weaker consistency models relax the ordering constraints on reads and writes, reducing the cases that the processor must stall. The Total Store Order (TSO, or Processor Order) consistency model requires that all writes appear to occur in a total order but allows a processor’s reads to pass its own writes. This allows processor to implement write buffers that hold committed writes that have not yet been ordered with respect to other processors’ writes. Reads are allowed to pass (and potentially bypass) the write buffer in TSO (which they could not do under SC). Assume that one memory operation can be performed per cycle and that operations that hit in the cache or that can be satisfied by the write buffer introduce no stall cycles. Operations that miss incur the latencies stated above. How many stall cycles occur prior to each operation for both the SC and TSO consistency models for the cases listed below? Show your work; a correct answer without any work shown will receive no credit.
Instructions
|
Stall cycles
|
P0: write 110 80
P0: read 108
|
SC
|
TSO
|
P0: write 100 80
P0: read 108
|
SC
|
TSO
|
P0: write 110 80
P0: write 100 90
|
SC
|
TSO
|
P0: write 100 80
P0: write 110 90
|
SC
|
TSO
|
P0: read 118
P0: write 110 80
|
SC
|
TSO
|
Problem 4: Instruction Flow and Branch Prediction [30 points]
This problem investigates the effects of branches and control flow changes on program performance for a scalar pipeline (to keep the focus on branch prediction). Branch penalties increase as the number of pipeline stages increases between instruction fetch and branch resolution (or condition and target resolution). This effect of pipelined execution drives the need for branch prediction. This problem explores both static branch prediction in Part C and dynamic branch prediction in Part D. For this problem the base machine is a 5-Stage pipeline.
The 5-Stage Pipeline without Dynamic Branch Prediction
Execution Assumptions:
-
unconditional branches execute in the decode stage
-
conditional branches execute in the execute stage
-
Effective address calculation is performed in the execute stage
-
All memory access is performed in the memory access stage
-
All necessary forwarding paths exist
-
The register file is read after write
The fetch address is a choice between the sequential address generation logic and the branch correction logic. If a mispredicted branch is being corrected the correction address is chosen over the sequential address for the next fetch address.
Part A: Branch Penalties. [2 points]
What are the branch penalties for unconditional and conditional branches?
Unconditional ______________ Conditional _______________
Part B: No Branch Prediction. [4 points]
This problem will use the insertion sort program. An execution trace, or a sequence of executed basic blocks, is provided for this problem. A basic block is a group of consecutive instructions that are always executed together in sequence.
Example Code: Insertion Sort
BB Line# Label Assembly_Instruction Comment
1 1 main: addi r2, r0, ListArray r2 <- ListArray
2 addi r3, r0, ListLength r3 <- ListLength
3 add r4, r0, r0 i = 0;
2 4 loop1: bge r4, r3, end while (i < Length)
{
3 5 add r5, r4, r0 j = i ;
4 6 loop2: ble r5, r0, cont while (j > 0)
{
5 7 addi r6, r5, -1 k=j-1;
8 lw r7, r5(r2) temp1 = ListArray[j];
9 lw r8, r6(r2) temp2 = ListArray[k];
10 bge r7, r8, cont if (temp1 >= temp2) break;
6 11 sw r8, r5(r2) ListArray[j] temp2;
12 sw r7, r6(r2) ListArray[k] temp1;
13 addi r5, r5, -1 j--;
14 ba loop2 }
7 15 cont: addi r4, r4, 1 i++;
16 ba loop1 }
8 17 end: lw r1, (sp) r1 <- Return Pointer
18 ba r1
Execution Trace: Sequence of Basic Blocks Executed:
1 2 3 4 5 7 2 3 4 5 6 4 5 6 4 7 2 3 4 5 6 4 5 7 2 3 4 5 6 4 5 6 4 5 7 2 8
[Hint: An alternate way to represent the same execution trace above is to use the sequence of branch instructions, both conditional and unconditional (i.e. ba), executed.]
1. Fill in the branch execution table with an N for not taken and a T for taken. This table is recording the execution pattern for each (static) branch instruction. Use the execution trace on the previous page.
Branch Execution - Assume No Branch Prediction:
Branch Instruction No. (i.e. Line#)
|
Branch Instruction Execution (dynamic executions of each branch)
|
1
|
2
|
3
|
4
|
5
|
6
|
7
|
8
|
9
|
10
|
4
|
|
|
|
|
|
|
|
|
|
|
6
|
|
|
|
|
|
|
|
|
|
|
10
|
|
|
|
|
|
|
|
|
|
|
14
|
|
|
|
|
|
|
|
|
|
|
16
|
|
|
|
|
|
|
|
|
|
|
18
|
|
|
|
|
|
|
|
|
|
|
Using the branch execution table above to calculate the statistics requested in the following table.
Branch Execution Statistics:
Branch Instr. No.
|
Times Executed
|
Times Taken
|
Times Not Taken
|
% Taken
|
%Not Taken
|
4
|
|
|
|
|
|
6
|
|
|
|
|
|
10
|
|
|
|
|
|
14
|
|
|
|
|
|
16
|
|
|
|
|
|
18
|
|
|
|
|
|
2. How many cycles does the trace take to execute (include all pipeline fill and drain cycles)? [Hint: you don’t need to physically simulate the execution trace, just compute the cycle count.]
3. How many cycles are lost to control dependency stalls?