Final project

Download 96 Kb.

Date	31.01.2017
Size	96 Kb.
	#13201

FINAL PROJECT
Tomasulo with Reorder Buffer

Kevin Moody

Sheng Li

Gary Gong

Omid Barkhordarian
CS152 Spring 2003

Section: Th 4-6pm

TABLE OF CONTENTS

Abstract 1

Division of Labor 1

Overall Design 1

Register File 3

Issue Unit 3

Reservation Station 4

Arithmetic Functional Unit 8

Memory Functional Unit 8

Load/Store Address Computation Unit 8

Reorder Buffer 9

Common Data Bus Arbiter 14

Branch Prediction 15

Results 20

Conclusion 20

Appendix I Table of contents (Notebook) 22

Appendix II Links (Schematics) 28

Appendix III Links (Verilog Files) 29

Appendix IV Links (Test benches) 30

ABSTRACT

The goal of this project is to implement out of order execution with a reorder buffer. We talked about how to implement the processor extensively before we began to write our code. We made sure that we could implement it correctly before we choose to do the Tomasulo processor. An advantage of the Tomasulo processor is its scalability because we can add other functional units and branch prediction. The subcomponents of our Tomasulo processor contains a fetch/decode unit, functional units, reservation stations, reorder buffer a common data bus arbiter.
We devised a list of inputs and outputs for each module and what signals the modules output to one another. We also determine the internal data structure of the modules. We then split up the modules and each person has to test his/her own module. The key issue is timing. We made sure that each person knows what time other people’s module need an output. The biggest problem that faces us is timing issues because if there’s any miscommunication between the modules, we could miss a value that needs to be written back to the regfile or memory.
Back to the main table of contents

DIVISION OF LABOR
Alice: reorder buffer and common data bus arbiter

Kevin: functional units, reservation stations, fix Lab 6

Gary: register file and issue unit

Omid: branch and jump prediction

Alice & Gary: wiring up the top level

Alice & Gary & Kevin: debugging, debugging, and more debugging…..:(

Alice & Kevin: put together final report with everyone else’s parts
Back to the main table of contents

OVERALL DESIGN
Our design consists of four parts: issue, execute, broadcast result, and commit:

Issue: Dispatch an instruction is there is space in the reorder buffer and reservation station (RS). If there isn’t space, then we must stall.
Execute: If the operands are not available in the regfile, then we must monitor the common data bus (CDB) for the value to be broadcasted by a functional unit (FU). When both operands are ready, then the reservation station (RS) executes that instruction. (Execution may take more than one cycle. This leads to the possibility of instructions executing out-of-order.)
Broadcast result: When result is available, broadcast it on the CDB with the ROB number.
Commit: When an instruction reached the head of the ROB: If the result isn’t ready, then we must stall. When it is ready, we write the result to the regfile or memory. Also, at this point, if we predicted a branch incorrectly or we detect a load/store conflict (i.e. memory disambiguation; explained in more detail below).

Our choice of reservation stations is the most interesting aspect of the design:

Address calculation unit: We dedicate a RS to calculating the memory address for loads and stores. This simplifies the way the ROB distinguishes between an address and a write-back/store value. Therefore, we broadcast the address on a separate bus to the ROB and load unit.
Load unit: This unit executes load instructions, which will take a variable amount of time.
Arithmetic unit: This unit executes all arithmetic instructions in a single cycle.
Branch unit: See section on branch prediction (not implemented in final result).

Back to the main table of contents

REGISTER FILE
We modified the regfile to support the following concurrent accesses:

When an instruction is issued, the ROB number must be updated.
The ROB commits a value to the regfile.
The issue unit reads from the regfile when in the process of issuing an instruction (for more details, see the section on issue above).

So we divided the value and tag (ROB number) fields and have a valid bit to distinguish between the two. We also decided to do this in order to support flushing: When we flush, we must be able to recover the old value of the register, in case it was “overwritten” by a new ROB number. Using our scheme, we only need to set the valid bit to 1 (the value field is valid) when we flush.
Dealing with concurrent accesses to the same register: Our scheme handles writes to both value and ROB number fields easily. The difficult part is deciding what happens on a read of the same register. In this case, we will output the current value to be committed only when issuing ROB number is equal to the ROB number of this register. Otherwise, we will output the newly issued ROB number. This ensures that the output is the most recent value at any particular time.
Write backs: When the ROB wants to write back a value to the regfile, we first check to see if that register’s ROB number is equal to the ROB number of that instruction. If they aren’t equal, then we don’t write the value. This consistency check ensures that an old value doesn’t get written into the regfile, because a newer version is in the ROB.
Back to the main table of contents

ISSUE UNIT
This module fetches and disassembles instructions, and dispatches them to the ROB and the appropriate reservation stations. A series of actions are taken to issue on instruction (all in a single cycle):

Ask the regfile for the current value of the relevant registers to send to the ROB and RS.
Ask the ROB for the same values. This checks for the possibility that an instruction has been completed, but not yet written back into the regfile.
Choose between these values and send the disassembled instruction to the ROB and RS.
The module sends to the ROB:
- PC: to accommodate flushes
- RegDst: to know which register to write back into
- inst_type: to let the ROB know what type of instruction this is. To see more on the exact encoding, see the section that discusses the ROB below.
The module sends to the RS:
- Src1 & Src2: this is either a “real” value or a ROB number. If it is a ROB number, then it must wait for the value to be broadcasted. If they are both valid, then the instruction is ready to execute.
- Src1_valid & Src2_valid: to let the RS know if the Src1 & Src2 values are valid or ROB numbers.
- i_choose_u: to let the RS know that it should receive this instruction.

Branches and jump register: Before we integrated branch and jump prediction, this is how we handled branches and jump register: If the value is not ready, then we stall until the regfile shows the register as valid (i.e. not a ROB number).

This module also contains the instruction cache, which it uses to fetch instructions based on a handshaking protocol. To read more on the cache, see the Lab 6 report.
Back to the main table of contents

RESERVATION STATION
Our reservation stations had four slots. There were a total of three reservation stations, one for each of the functional units, in our final (dysfunctional) design. A special memory reservation station was designed for the memory functional unit, which contained the data cache. This was because of the difference in the signal indicating that the functional unit was done. The basic design of both types of reservation stations was four slots that could be sent to the functional unit in random order, or it could send straight from the issued values. The following depicts this action:

Issued Instruction

It would only send the issued value when none of the slots were ready. Sending to the functional unit meant selecting which inputs to register in the functional unit. This is easy to see from the various functional unit schematics. The reservation station provides the inputs to a set of registers, which are the values needed to complete the operation of an instruction in the functional unit. To get a better understanding of how this works, let me describe the inputs and outputs:

Inputs:
cdb_tag – the tag, or reorder buffer number, of the value on the common data bus

cdb_value – the value on the common data bus

decode_tag – the tag of the current instruction that is being issued to the reservation stations and reorder buffer

decode_src1_value – the value of the first source in the instruction that is being issued

decode_src2_value – the value of the second source in the instruction that is being issued

decode_src1_valbit – the valid bit for the first source that indicates if the source is an actual value or a tag

decode_src2_valbit – the valid bit for the second source

decode_cntrl – the control signals required for the instruction that is being issued

fu_tag_out – the output of the functional unit’s tag register

fu_cntrl_out – the output of the functional unit’s control register

fu_src1_value_out – the output of the functional unit’s first source register

fu_src2_value_out – the output of the functional unit’s second source register

clk – the processor clock

rst – the processor reset

issued_inst – the one bit signal that indicates that a instruction is being issued to that particular reservation station (i_choose_u signal in the overall design)

flush – the flush signal that clears the reservation station

ack – the acknowledge signal that tells the functional unit that it’s value has been accepted by the common data bus arbiter
Outputs:
fu_src1_value – output to the first source register’s input in the functional unit

fu_src2_value – output to the second source register’s input in the functional unit

fu_cntrl – output to the control register’s input in the functional unit

fu_tag – output to the tag register’s input in the functional unit

full – output to the instruction issue to indicate that the reservation station is full

done – output to the common data bus arbiter to indicate that the functional unit is done (this is different in the memory reservation station, as specified below)

Inserting an instruction into the functional unit:
The following four-state state machine controls sending values to the functional unit:

The four different states simply specify which slot was sent to the functional unit last. For example, using ready signals, the issued1 state will check to see if slot 2 is now ready because slot 1 was the last one issued. Then, it will check slot 3, then slot 4, then back to slot 1, and finally the issued instruction. The state machine simply sets an issue1, issue2, issue3, issue4, or issue_decode signal to indicate which one to issue. Then, it decides which slot to replace, using empty signals. The ready and empty signals are all generated from assign statements. A slot is considered ready if the functional unit is ready and the sources are ready. A functional unit is ready if it is done and the common data bus acknowledges it, or the functional unit tag is zero. The sources are ready if both valid bits are set, or one or both of the values are on the common data bus. A slot is considered empty if the tag is zero.

Updating the slots:
The common data bus tag and value are taken in so that each value can be updated. A value is updated if the valid bit is low and the value of that source matches the tag on the common data bus. If the slot is not being updated, the inputs are set to the outputs of the slot registers so that the values remain in that slot.
Taking in an instruction:
To make life easier on the overall design and several other components, I designed the reservation station to do a lot of error-checking here. If the instruction is issued to the reservation station (represented by issued_inst going high) and the tag does not match anything in the reservation station or functional unit (known as redundant_tag and should never occur, but I check for it anyways), then the instruction is considered decoded. If decoded_inst is high, then the state machine controller will attempt to place it somewhere. If no slots have been issued and none of the slots are open, it sets the full signal, which indicates that the reservation station is full and the instruction fetch and decode should stall. When there is a decoded_inst and a place to put the instruction, the common data bus is checked to see if the sources need that value. If they do, the reservation station ignores the decoded source values and retrieves this value for the matching source. This is determined by another assign signal that checks to see if the decoded source valid bit is low and the decoded source value matches the common data bus. This ensures that the reservation station does not miss anything from the common data bus, which will prevent deadlock where the reorder buffer fills up waiting for the reservation station to complete something that will never be done.
Arithmetic vs. Memory Reservation Station:
Since, our design incorporated a memory bus for memory address and the done signals for the other two functional units where finished in a cycle, I needed to modify the reservation station. I simply cut out the control and source two registers because the memory unit only needed an address. Also, I generated the done signal in the schematic and took it in as an input to the reservation station to determine when the functional unit was ready. See the memory unit for more details.
Testing:
I made an original reservation station, but it had gotten changed so much that I never went back and modified the tests. So, the following are example outputs from the dumber test file:

Back to the main table of contents

ARITHMETIC FUNCTIONAL UNIT
The arithmetic functional unit is the same basic unit as the execute stage of the pipeline. However, I did remove all of the forwarding because obviously this does not apply to Tomasulo. This functional unit used one of the arithmetic reservation stations to retrieve the inputs to the source, control, and tag registers. Then, it simply calculated the value and outputted the values to the common data bus arbiter. All the rest of the functionality, including the done signal, was taken care of by the reservation station. This unit was not tested individually and did not have a monitor module of its own. In the end, we found that it did not need any monitors either because there were no bugs.
Back to the main table of contents

MEMORY FUNCTIONAL UNIT
The memory functional unit included the data cache and the memory reservation station. Also, it was the only functional unit to produce its own done signal. It only needed two registers: tag and address. The done signal was produced with the use of the tag register output and the data cache stall signal. If there was a lw instruction, indicated by the tag being not empty, and the we weren’t stalling then the lw instruction was complete. Because of the timing with the stall signal, we had to register it so that the output was not read a cycle too early. This functional unit took several more inputs than the others because it had to handle the committing of the sw instructions. Also, this meant modifying the data cache to handle sw and lw instructions, simultaneously. This required several changes, which had not been anticipated to the data cache. Therefore, it was not tested after we fixed the issues in lab 6. Furthermore, once again, there was no testing for this unit alone because we needed to spend more time on the integration of the units.
Back to the main table of contents

LOAD/STORE ADDRESS CALCULATION UNIT
The load/store address unit was the most basic of them all. This unit included an adder, the arithmetic reservation station, and two register: source 1 and source 2. The only difference here was that the output was put on the memory address bus, instead of the common data bus. Therefore, the acknowledge signal was wired to Vcc.
Back to the main table of contents

REORDER BUFFER
The reorder buffer has eight slots and has four tasks and is implemented with pointers. The buffer has to fill a slot from a new instruction from issue; it has to pick values off of the common data bus to update a slot; when the head of the buffer has its write back value and/or the memory address, the reorder buffer commits the value to the regfile/memory; and lastly, the reorder buffer supplies values of registers for issuing.
Inputs:
PC_val .from issue (to fill in a new slot in the reorder buffer)

regDst………………..from issue (to fill in a new slot in the reorder buffer)

instr_type…………….from issue (to fill in a new slot in the reorder buffer)

src1_tag……………...from issue (to get latest value of reg from reorder buffer)

src2_tag…………...... from issue (to get latest value of reg from reorder buffer)

WB_val………….…...from common data bus (to update a slot in reorder buffer)

WB_index……….…...from common data bus (to update a slot in reorder buffer)

mem_addr……….…...from common mem bus (to update a slot in reorder buffer)

mem_index……….….from common mem bus (to update a slot in reorder buffer)

wr_buff_full………….from write buffer (if write buff full, don’t write into mem)

CLK…………….……from top level

RST………….……….from top level

Outputs:
ROB_index_issue........ to everyone (the slot number that reorder buffer fills)

ROB_index_cmt….…. to regfile (the slot number that reorder buffer commits)

store_data……….…… to data cache (the value to be stored in the memory)

store_addr…………... to data cache (the address of the value to be stored in mem)

WB_data………….…. to regfile (the value to be stored in the regfile)

WB_regDst……….…. to regfile (the reg# of the value to be stored in regfile)

Full……………….….. to issue (issue would stall until reorder buffer not full)

regWrite……………... to regfile (tells the regfile whether to commit a value)

memWrite…………… to data cache (tells the memory whether to store a value)

error…………………. to issue (tells issue to reset the PC)

src1_data…………….. to regfile (the latest value in the reorder buffer of a reg#)

src2_data…………….. to regfile (the latest value in the reorder buffer of a reg#)

src1_valid……………..to regfile (whether the value is valid)

src2_valid……………. to regfile (whether the value is valid)

PC_out………………...to regfile (if error is high, reset the PC to this PC)
Fields:
The reorder buffer has seven fields, the type of the instruction, the register destination of the instruction, the memory address of the instruction, the value that needs to be written back to the memory or regfile, the PC of the instruction, the done field and the error field.
Type:

There are four types of instruction: type branch/jump, type jal, type load word, type store word and all the arithmetic instructions go into the last type.

The register destination of store words and branches/jumps are set to zero.

Memory Address:

The memory addresses of arithmetic instructions are ignored.

Done:

The done field has two bits; the top bit is set to one when the write back value is present in the buffer and the lower bit is set to one when the memory address is present in the buffer.

Write Back value:

This field is ignored for branches/jumps.

Error bit:

If error bit of the “head” of the buffer is high, then the entire buffer is flushed.

PC:

This value is passed out with every committed instruction. If the error bit associated with the PC is high, then the issue unit needs to start fetching instructions from that PC.

All the fields are implemented with registers. There are a total of eight slots and each slot has seven fields, so there are 56 registers. Each register is named the filed name and the slot number. This way I don’t need to keep an index field.
Pointers:
The current pointer points to the “head” of the buffer, or the oldest instruction in the buffer; the next pointer points to the next slot in the buffer to be filled with a new instruction. The buffer is full when the next pointer plus one equals the current pointer. This implementation means that there are only seven slots that can be used at one time. The reason that I didn’t have a valid bit is because I separated putting in a new instruction into the buffer and committing a slot in the buffer in different always blocks so I can’t modify the valid bit in both always blocks. I combined committing a slot and picking values off of the common data bus in one always block and it’s huge so if I combine all three parts together in an always block, it would be very hard to debug. The pointers keep track of which slot to fill and which slot of commit.

CP

NP

Filling in a slot in the reorder buffer:
The reorder buffer gets three fields from the issue unit when writing a new instruction into the buffer. It receives the type of the instruction, the register destination of the register and the PC of the instruction. These three fields in the reorder buffer do not change after issue. An instruction is a NOP is both the type and the destination are equal to zero. The reorder buffer only fills a new slot that is a non-NOP instruction. The new instruction resides in the slot pointed by the next pointer and the pointer advances to the next slot. The reorder buffer sends out the index number where the new instruction is filled to all the functional units and the regfile.
Picking off values from the Common Data Bus and Memory Address Bus:
When the reorder buffer sees a value from the common data bus that doesn’t have index zero, then it fills in the value in the slot with the index number. The reorder buffer sets the top bit of the done field in the updated slot to one. The reorder buffer also pick values off of the memory address bus because all the memory addresses of loads and stores are calculated in a separate unit that doesn’t go through the common data bus. If the index associated with the memory address on the bus is nonzero, the reorder buffer fills in the value in the memory address field of the indexed slot. It sets the lower bit of the done field to one. For store words, issue sends the address calculation unit the base addresses and offset and it also sends a fake add instruction to the arithmetic unit that adds the register to be stored in memory to zero.
Committing a slot in the Reorder Buffer:
When the slot pointed to by the current pointer has all the values the instruction needs for write back and the error bit of that slot is low, the reorder buffer commits the value to the regfile or memory.
Type arithmetic instructions:

The reorder buffer checks to makes sure that the top bit of the done field (the write back done) is set to one. If so, the slot is committed to the regfile. The regWrite signal is set to high and WB_data and WB_regDst are set to the WBval and RegDst of the slot memAddr of the slot that’s being committed.

Type load word instructions:

The reorder buffer checks to make sure both bits of the done field are high and that the error bit is low. If so, the slot is committed to the regfile. If the error bit is high, the instruction is not committed and the buffer is flushed. The regWrite signal is set to high and WB_data and WB_regDst are set to the WBval and RegDst of the slot memAddr of the slot that’s being committed.

Type store word instructions:

The reorder buffer checks to make sure both bits of the done field are high and that the write buffer is not full. It also checks if there are any load word instructions with the same memory address. If the load word received the load value from the common data bus (the high bit of the done field is high), then the error bit for that load word instruction is set to high. The reorder buffer also checks that if there is a load word with the same address as the store word and from the common data bus, there’s something arriving with the same index number as the load word. If that’s the case, the error bit of the load word is also set to high. The reason that I check the index from the common data bus is because if the index equals to the index of the load word instruction, then the data associated with that index has to be the loaded value and has been broadcasted to all the functional units. Therefore we need to flush everything and start over. The memWrite signal is set to high and mem_data and mem_addr are set to the WBval and memAddr of the slot that’s being committed.

Type branches/jumps instructions:

The reorder buffer automatically commits branch/jump instructions because these instructions don’t need any values from the common data bus or the memory address bus. Both the memWrite and regWrite signals are set to low

Type jal instructions:

The reorder buffer sets the destination register to $31 and puts the PC+4 in the write back field of the slot. The regWrite signal is set to high and WB_data and WB_regDst are set to the WBval and RegDst of the slot memAddr of the slot that’s being committed.

To give the latest values of registers to issue:
A potential problem with our design is that completed calculations from the functional units are only broadcasted once over the common data bus. The reorder buffer updates a slot with the value from the common data bus (let’s say that the destination register for that instruction is $2 and the index associated with that instruction is index #5), but that slot does not get written back to the regfile until all the older instructions before it gets committed. The issue unit can issue an instruction that depends on $2, so it gets $2 from the regfile. Since the value that writes back to $2 hasn’t gotten to the head of the buffer yet, it’s not committed to the regfile. Therefore in the functional unit, the instruction that depends on $2 will have the index #5 in the source field and it’s going to wait for the value of index #5 from the common data bus. Because the common data bus broadcasted the value for $2 (or index #5) already, the new instruction in the function unit would never complete. Therefore when the issuing unit issues a new instruction, it checks for the newest value of a register from the reorder buffer.
Testing methodology:
To test the reorder buffer, I made a test bench and faked the inputs from the issuing unit, the common data bus, the memory calculation unit and the write buffer. I tested putting new instruction into the buffer first to make sure that it only puts in valid instructions (no NOPs) and that when the buffer is full, no new instructions goes in. I make sure that the reorder buffer issues the correct index number to everyone.
When that works, I start to give it values from the common data bus and the memory address calculation unit to make sure that the right slots are updated and the done field is updated accordingly.
Afterwards, I test the commit part of the reorder buffer and make sure the commit index numbers are outputted correctly. I first test the non-store word instructions because they don’t deal with memory disambiguation. When non-store word instructions are committing correctly, I start of test the various scenarios for committing a store word. The following are the different scenarios that can happen with a store word at the head of the buffer:
Scenario #1:

There are no load word instructions in the write buffer. No error.

Scenario #2:

There are load word instruction/s in the write buffer and the address of the load word does not equal to the address of the store word that we’re currently committing. No error.

Scenario #3:

There are load word instructions/s in the write buffer and the address of the load word matches the address of the store word that we’re currently committing. The load word has not received the load value from memory yet and the common data bus is not currently broadcasting the load value of the load instruction. No error.

Scenario #4:

There are load word instructions/s in the write buffer and the address of the load word matches the address of the store word that we’re currently committing. The load word has not received the load value from memory yet and the common data bus is currently broadcasting the load value of the load instruction. Set the error register for the load word instruction to one.

Scenario #5:

There are load word instructions/s in the write buffer and the address of the load word matches the address of the store word that we’re currently committing. The load word has received the load value from memory. Set the error register for the load word instruction to one.

I then test for the last function of the reorder buffer—to give issue the latest value of the reorder buffer slots. I output the WB field of a slot and the valid bit is set to the upper bit of the done field.
Back to the main table of contents

COMMON DATA BUS ARBITER
The common bus arbiter issues out time slots in a round robin fashion. The order starts from data from the branch unit, to data from the load unit and lastly to data from the arithmetic unit. When none of the functional units have any data to broadcast, the common data bus arbiter outputs a tag with the value zero. Since zero is not a valid tag value, no value in the functional units or the reorder buffer would update any values backed on the index zero.

State Transition Diagram of the arbiter:

The init state waits for a request signal to go high. If more than one request signal goes high, the priority of who gets to get broadcasted depends on the state. In the init state, branch get priority, load is second and arithmetic last. In the branch state, load gets priority, arithmetic is second and branch third. In the arithmetic state, branch get priority, load is second and arithmetic is last. When the arbiter chooses to broadcast a value from a functional unit, it also sends an acknowledgement signal to that functional unit.
Testing methodology:

There are a total of 32 test cases. For each state, there are 2^3 possibilities for branch request, load request and arithmetic request combinations and since there are four states, that’s 8*4 test cases. I test each of the possible test cases to make sure that the state machine transitions to the correct state and outputs the tag and value of the correct request.

Back to the main table of contents

Branch Prediction
Conditional branches are major obstacles to achieve higher performance for a high performance CPU. Accurate branch prediction is required to overcome this performance limitation imposed on high performance architectures and is the key to many techniques for enhancing and exploiting Instruction Level parallelism (ILP). Many different branch prediction schemes have been proposed. Most of these works has been based on benchmark programs including SPEC89 and SPEC92.
Our Branch Prediction Implementation:
We use a combination of:

State machine and look up table

Inputs:

EN_branch…… this should solve the problem when ever the branch can be resolved in branch unit .It should notify ROB so why not use that to give branch prediction to tell him that we know the answer to an specific branch which it's address is on PC_branch and the answer (should have taken or not taken on should_take_branch)

PC_IF…………this PC_IF is PC form IF stage(unit)
PC_branch…….Comes form branch unit every time that a branch can be solve.

Should_take_branch…comes form branch unit. Basic idea is when the branch unit can resolve both registers then it should tell the prediction unit that if that branch should have been taken or not. In order to figure out which branch, is resolved. It passes the PC_branch address of that branch instruction. The next cycle if PC_branch stays the same it means that we took that branch twice or more (in a loop) This is very important, in the next clock cycle b/c there is no way for me to now that the PC_branch the PC that comes form branch unit) is actually the same branch instruction and we have not resolved anything else or simply it's the same branch in a loop that know we can resolve in one clock cycle, I have to relay on the branch unit, so the branch unit should make EN_branch high when it resolved any branch.

Outputs:

take_branch…………….. The Prediction that if we should take the branch or (High or Low)

Error_miss_prediction….. This is the address of the branch that should have been taken/not taken. This is zero if there is no error
Why did we choose this approach?

It provided us a custom history of the branches that has already been executed.
It also provided us the accuracy of the two state machine prediction: For example with a 1k lookup table we can look at history of every i individual branch in 4k program space.
We can easily expand this scheme based on number of registers that we can dedicate. (it grows by the factor of two)

The main components:

Table of registers (Number of rows is divisible by two).
Each row in the table holds 2 bits which constructs our state machine.
Our table is made of 1024 two entries.

A brief description of how it works:
We use the PC address of the branch instruction to map it to our look up table.
Example:

0x80000020 beq $3,$4,0X8

(What we know PC[1]=PC[0]=0 word aligned )

So we can use PC[12:2] to find the corresponding branch state machine. Something very close to direct map but without tags.

In the same clock cycle we can find the corresponding row in our look up table look at the value of our 4 state state-machine (2bits) and decided if we want to take the branch or not.

If the value of the state machine is zero then we predict not to take the branch. If anything else (01,10,11) we will take the branch. In our example this will correspond to (10’b0000001000). Because the instruction address of beq is 0x80000020 and we don’t use the first two bits. We look at the 2 bit value of state-machine and make our prediction.

Why do we use Look up table?
1. It gives us a custom history of branches in our code with reasonable range.

2. We can change this range based on the average characteristic of our programs.

3. It provides the easiest way to incorporate the two bit state-machine and individual history with a minimum hardware requirement.

4. If we would use any kind of associatively; finding the correct state machine in our look-up-table in one clock cycle, would require a 2 bit comparator for every row and n bit MUX ( in this case 1024 to one) to tell us if there was a match (very expensive and very slow) Then we should have drastically limit our look-up-table size which would mean that we would lose our ability to keep a reasonable history of our branches.

More details about look-up-table:
Each row is actually made two bit registers. This means that if we increment our table size, since its power of two every time we increase the size of our table we need

(n^2)*2

registers. This is really a drain in our resources.
Example:

In our 1024 Row table we are using 2048 registers.(2 bits state)

Expanding our table means we need to increase our number of registers from 1024 to 2048 so we would need 4096 registers.
How could we solve this problem?
Replace this array of registers with memory. But this won’t really work.
Let’s assume this scenario:

Our ID stage is dispersedly waiting for prediction unit. At the same time the branch unit resolved an out standing branch and wants to tell the prediction unit if it made the correct prediction.

There are two units; both of the units want to talk to the same unit at the same time. And we don’t want to make either of them wait. Single ported memory won’t work.
What is another practical solution to this problem?
One of the easiest ways to deal with this is to use a dual ported memory.

First Port is used to tell the IF unit if it should take the branch or not (branch prediction).

Second port is what will actually update the states based on the branch unit input (if we made the correct prediction or not).
Is this legal? I don’t know, so we didn’t use it.
A brief review of our Branch prediction:

We look at the PC address; find the according row in our look-up table. Based on the state of our state-machine on that specific row we make our prediction.

When the branch unit solved the corresponding branch it will tell us if we actually had to take the branch or not.

If we (miss) predict we use the PC from branch unit to update the corresponding state-machine in our look up table.

At the same time we let the branch unit know that we miss predicted (or not ) and branch unit will use the common data bus (CDB) to let the Reorder buffer knows that the branch has been resolved and the prediction was (was not) successful. If the prediction is wrong the ROB will flush when branch is at the bottom of ROB.

Testing methodology:
We choose some random inputs and look at the prediction that our branch prediction unit provided as output. Then we resolved some of the inputs and told the branch prediction unit that if it should have predicted to take the branch or not and the branch prediction found out if it miss predicted or not.

The next time that we provided some of the inputs randomly again the prediction was correct due to the fact that the state machine had a correct state based on the previous history that it had.

Jump Prediction:
The only prediction that we need is for Jr. so let’s consider this facts:

Usually Jump registers goes back to where it came from!! (JAL)

This means jumps can not have any dependence on other jumps. Any kind of global history does not make any sense.

We need to keep individual of current jumps.

Implementation:

8 Rows look-up-table. It works like a buffer.
32 bit address of the jump is stored with its latest jump address. The probability that it goes back to the same address is very high.
If the address is zero, there is no previous history. There won’t be any prediction and we will wait till it’s resolved.

What would I do if I had more time?

Implement the global history in the branch prediction unit.

Look at other ways to implement branch prediction.

Since we did all the prediction, why not look at Data prediction!!

Back to the main table of contents

RESULTS
Our final project results are:
Critical Path: 133.676ns
Design Summary

--------------

Number of Slices: 8,124 out of 19,200 42%

Number of Slices containing

unrelated logic: 0 out of 8,124 0%

Total Number Slice Registers: 6,651 out of 38,400 17%

Number used as Flip Flops: 5,012

Number used as Latches: 1,639

Total Number 4 input LUTs: 11,450 out of 38,400 29%

Number used as LUTs: 11,447

Number used as a route-thru: 1

Number used as 16x1 ROMs: 2

Number of bonded IOBs: 117 out of 512 22%

Number of Tbufs: 32 out of 19,520 1%

Number of Block RAMs: 56 out of 160 35%

Number of GCLKs: 3 out of 4 75%

Number of GCLKIOBs: 1 out of 4 25%

Total equivalent gate count for design: 1,045,982

Additional JTAG gate count for IOBs: 5,664
Back to the main table of contents

CONCLUSION
We put in a lot of time and effort, but our design still did not work. We are pretty sure that the bug lies somewhere in our implementation of jumps. We passed all of the corner cases, except where it incremented the $ra. We could not seem to figure out the bug there, otherwise we would have passed all tests in simulation. We were unable to add the branch prediction and jump prediction that Omid attempted. Also, we were unable to product traces because somehow or memory-mapped IO person (Omid) decided to leave that out of Labs 6 and 7. In conclusion, we learned that communication is key to a group successfully working. We couldn’t simply go back and redo old modules. Therefore, it was important to have all partners working on integration and documentation.
Back to the main table of contents

APPENDIX I TABLE OF CONTENTS

NOTEBOOKS

Kevin

Alice

Gary

Omid

Back to the main table of contents

KEVIN’S NOTEBOOK

May 4^th

I worked all weekend on lab 6 trying to get the cache to work properly. It seems to be fine now. We had some minor bugs. There is still one bug in the memory-mapped IO that Gary said he would fix.

May 5^th

Gary and I started researching Tomasulo and spent a little time discussing the implementation over the phone and online. I started writing the reservation station because I figure it is a very crucial part.

May 6^th

Gary and I spent several hours discussing our implementation of Tomasulo and trying to figure out the hang-ups and details. We think there should be three units: arithmetic, branch, and load/store. Also, we think we might as well do a reorder buffer because it will make the design more modular. Otherwise, the common data bus arbiter has to be much smarter or the issue or something. Also, common data bus arbiter is going to be a simple round-robin design. I think Alice can do this. I drew up a state diagram for her. Since, she wants to get started right away. My reservation station is coming along. We discussed how stuff works with Alice.

May 7^th

Gary and I spent several more hours trying to figure out the specifics. We decided now to have a memory address bus and a load/store address unit that will simply calculate the memory address for the reorder buffer and the load/store memory unit. This bus doesn’t require anything special. We still don’t know how to handle memory disambiguation. We discussed the implementation later with Alice.

May 8^th

All four of us met for several hours to discuss the final implementation. Unfortunately, Omid left early, so he may not know what is going on. He is supposed to do the functional units.

May 9^th

Finished my reservation station and testbench. It work….yeah!

May 10^th-Today

I did a lot debugging and discussing and debugging and the functional units and debugging….and now the report

Please see Gary’s and Alice’s entries for more…

Back to the Appendix I table of contents

ALICE’S NOTEBOOK
May 8^th

Start writing the CDB arbiter and its test bench. I didn’t have many problems with the code and debugging went smoothly because there weren’t many bugs. I finished the writing and testing of the module in around four hours.

May 9^th

Start writing the reorder buffer. A lot of the code were copying and pasting so I made quite a few errors by forgetting to modify the index of a register and that caused many problems. After writing the reorder buffer code, I start to write the test bench. It took more around seven hours to write the code. I tested the code for around an hour and went home.

May 10^th

I continued to test the reorder buffer. It took me around five hours to fully test my reorder buffer to the point where as a stand alone module, it works as expected. Gary then told me that I need to add two inputs and four outputs to the reorder buffer. The two inputs are the register number of the source registers of the instruction that the issuing unit issues and the four outputs are the current write back values in my reorder buffer (if they’re present) and whether they’re valid. I continued to test the module and the part that I added worked fine.

May 11^th

Gary and I start to put together our modules to test how they work with one another. We put together the issuing unit, the register file and the reorder buffer. We faked the outputs of the functional units to the reorder buffer. We had a problem with one of Gary’s signals because it changes on the negative edge of the clock and that messes up the enable to my current pointer and next pointer. We had to register a few of the signals that Gary’s issuing unit sends to me so everything changes on the positive edge of the clock. It took us about 5 hours to load instruction from the issuing unit correctly to the reorder buffer.

May 12^th

I synthesized the reorder buffer code and had to complete sensitivity lists. Adding signals to my sensitivity list caused a lot of problems and it took me and Gary around 4 hours to fix all the mistakes and still be able to load signals correctly from the issuing unit to the reorder buffer. There are also problems because I modified a reg in more than one always block so the module wouldn’t synthesize. It took me around 4 hours to move the signals around the always blocks and combining always blocks so a signal is modified in only one always block.

May 13^th

Once everything seems to be working properly with the reorder buffer, issuing unit and register file, Gary and I started to wire up the rest of the modules, including Kevin’s functional units. It took us around 3 hours to wire up everything because the memory I/O stuff was kind of confusing and Omid (who wrote mem I/O) never came to lab to help us. It took us around an hour to debug all the wiring mistakes. We were still issuing instructions and putting them in the reorder buffer correctly, the Kevin’s functional units never returned any values to update the slots in the reorder buffer. So we emailed Kevin and told him to look at his reservation station when he has time.

May 15^th

I didn’t work on the processor yesterday because I had an E190 paper due and so did Kevin. More debugging…..The issuing unit seems to be branching to the wrong address because of the signals that Gary has to register in order for everything to change on the positive edge of the clock. It took us around 3 hours to fix the problem so that we’re branching to the right address. Since Kevin is busy with CS169 project, we did all the debugging that we could without going into his modules so we finished for the day.

May 16^th

Still more debugging…..Me, Kevin and Gary together trying to fix problems in the code. We had all kinds of timing issues where a signal does not stay high for long enough. We spend around eight hours debugging.

May 17^th

Debugging even more. There seems to be a lot of problems with the cache and the write buffer that we didn’t fix from the last lab because even though we’re pulling instructions correctly, we seem to be storing things incorrectly. Around ten more hours of debugging for Kevin, Gary and I.

May 18^th

Gary, Kevin and I debugged a little more for around 4 hours but Gary and I had to stop because we have finals on Tuesday and Wednesday to study for. Gary and I studied in the second floor lounge in Cory so when Kevin has questions about our modules, he would come up and ask questions and we would go down stairs with Kevin to help him debug. But generally, the problem was with the memory system.

Back to the Appendix I table of contents

GARY’S NOTEBOOK
May 9

I spent a few hours creating the issue module. Essentially, I merged the IF and ID stages with the control module from Lab 6. This makes the most sense because the control already disassembles the instruction, and we only need to decide which reservation station we need to send it to. I labeled all the inputs and outputs to this module so you guys know how to connect it with your own modules. Actually, I’ll be in lab all this week to test this module and also work on putting everything together.

May 10

I expanded regfile.v to support the new fields that we are storing. So now we can write a ROB#, the actual value, and the valid bit to distinguish between the two. This could be a problem because it’s possible that we have two writes and a read happen concurrently. I need to discuss this with Kevin to decide what the correct behavior is. Otherwise, the changes were fairly straightforward, and everything seems to be working in the testbench I wrote to test this module.

Please see Kevin and Alice’s entries for more…
Back to the Appendix I table of contents

OMID’S NOTEBOOK
Friday May/10/2003 2:35pm
I will do the branch prediction it was different form what I have though.

Finally after talking to Kevin I have a clear idea of what they need.

Friday May/10/2003 3:55pm

I will design a 4 state state-machine and a 1024 row look up table in this way

I use PC as an input to find the corresponding state machine in my table. Bit form 2 to 12 should do it.

Saturday May/10/2003 9:50 am

I ‘m working on the branch unit there is no know problem since lat night that I was working on it. It looks as if it works fine.

Monday May/12/2003 8:50 pm

I’m working on the jump prediction unit. 8 rows very close to fully associative.
Tuesday May/12/2003 12:07 pm

I’m designing the registers that we need for the branch unit.

Tuesday May/12/2003 2:31 pm

Me and Kevin thought we can use the control unit to distinguish between the 4 different branches the problem is that they control unit are the same for beq and bne and for bgz and bltz so I had to decode the whole branch instruction again.

Wednesday May/14/2003 8:50 pm

I’m making the power point presentation for branch prediction unit and jump prediction unit.

Thursday May/14/2003 12:05 pm

I had 14 slides and I can only use 4 maximum so busy redoing slides!!!

Back to the Appendix I table of contents
APPENDIX II LINKS

SCHEMATICS

Top Level mini_top.sch

Arithmetic Functional Unit arith_unit.sch

Memory Functional Unit mem_unit.sch

Load/Store Address Calculation Unit ld_st_addr_unit.sch
All other schematic files are virtually the same as Lab 6. Please check the project zip folder with the project named “reorder_buffer”.
Back to the main table of contents

APPENDIX III LINKS

VERILOG FILES

Reorder Buffer reorder_buf.v

Arithmetic Reservation Station arith_reserv_stat4.v

Memory Reservation Station mem_reserv_stat4.v

Issue Unit issue.v

Common Data Bus Arbiter cdb_arbiter.v

Branch Prediction branch_predict.v…….Branch_reg_unit.v

(If you cannot open a link, please manually open the file in the directory that contains this document and associate the file extension with a program)

Back to the main table of contents
APPENDIX IV LINKS

TEST BENCHES

Reorder Buffer reorder_buff_tf.tf

Branch Prediction branch_prediction_tbw.tbw

Common Data Bus Arbiter cdb_arb_tf.tf

(*Warning – this is a very large file. We are truly sorry but all of our testing was done with corner.s. Therefore, it wouldn’t have turned out very pretty.)

Overall Testing Output for dumber.s…………………….lab7_transcript.txt

(If you cannot open a link, please manually open the file in the directory that contains this document and associate the file extension with a program)

Back to the main table of contents

Download 96 Kb.

Share with your friends: