Practical 9 Pipelined hazard resolution

Objectives

This section is not a list of tasks for you to do. It is a list of skills you will have or things you will know after you complete the practical.

Following completion of this practical you should be able to:

Implement data hazard resolution in a pipelined processor by employing write-before-read, data forwarding, and stalls.
Use waveform diagrams to debug a pipelined processor implementation with instructions in all stages of the pipe
Use verilog test benches and a testing framework to test a processor implementation

Guidelines

Because you will be iteratively adding functionality to one processor module, we strongly recommend that you periodically add and commit your progress to git as a backup.

Time Estimate This practical will take approximately 6-9 hours per student, varying depending on your familiarity with Verilog and the pipelined architecture covered in class.

Preliminary Tasks

You will be working in the same groups as Practical 8, so you should use the same repository: your RISC-V-pipelined-processor repository. This also means you should be continuing to use the same .mpf file that you created for the last practical.

Obtain the worksheet.

The general sequence for this practical is (1) try out the tests and see how they work, (2) implement data forwarding, then (3) add stalling.

Run the hazards tests

During this practical, you will gradually be fixing data hazards for R-types until you've fixed them all; then you will look at other types of hazards to fix.

To begin, open up the file in test_asm/datahaz/test_datahaz_x2.asm and read the test code (and comments) provided.
(Q) On the worksheet, answer the first question about the need for forwarding in a pipelined processor.
Open up the tb_Pipe_hazards.v test bench and scroll to the bottom. Notice there are a sequence of test tasks commented out, much like in the last practical and the first one (test_no_hazard_detection()) is the only one uncommented.
Scroll up to the implementation of test_no_hazard_detection() and observe that it (and many other tasks) simply check that the final states of the registers are correct. Answer the question in the worksheet about these tests.
Review check_data_hazard_general() to ensure it will work (work does not mean pass it just means that you understand the code and see where it will fail in your current implementation) with your pipelined processor implementation. It uses the same shortcuts in pipeline_test_tools.vh that you may have edited for Practical 8, so hopefully there won't be much to change.
Open the ModelSim project you created for Practical 8 and add tb_Pipe_hazards.v to the project. Compile it and simulate this test bench. Fix any bugs or errors until you can get test_no_hazard_detection() to pass its test. (Note: that you may not pass this test if you already have implemented the write-then-read behavior. Consider your answer to the 1.4 question on the worksheet.)

Write then read

Once you've passed the check_data_hazard_general() tests, comment it out in the test bench's main initial block.
In that same initial block, uncomment the test_write_then_read_hazard_detection() task and the call to CLEAR_PIPE() that follows it. (See comments in that block)
Compile and run the test bench in ModelSim. It might fail if you've not implmented write-before-read in your datapath. That's ok!
Figure out how to make your reg file write before it reads
- hint: consider when you should write to the register file so it can be read at the right time (but before the pipeline stage registers get written).
Once you get this test to pass, answer the next question on the worksheet: Describe the process you plan to follow to incrementally address data hazards in your pipeline for R-type instructions. If you’re not sure what process to follow, review the comments in the ASM file (test_datahaz_x2.asm) and the Test Bench (tb_Pipe_hazards.v).

Data forwarding

Uncomment the next test in the test bench (test_WB_to_EX_fwd()) and compile then run the test bench again.
(Q) On the worksheet, write some pseudocode that describes how you will detect the need to forward data to one of the two register operands (A or B) when an instruction in EX needs data from WB.

Recall that forwarding happens when there is a data hazard between a register being written by one instruction and a second instruction that reads the same register before the writer puts it in the register file. We pass this "to be written" data from one pipeline stage to another to compensate for the fact that the writing happens too late when these two instructions are too close together in the pipeline.

Create forwarding unit module and add it to your Processor.

Tip: forwarding unit module shape

     module ForwardingUnit (
     input wire [6:0] opcode, // to decide rs 1 / rs 2 check
     input wire [4:0] rs1,    // to check dependency with rd
     input wire [4:0] rs2,    // to check dependency with rd
     ... // fill in remaining input values
     output reg [1:0] ALUSrcA, // controls ALU source A mux
     output reg [1:0] ALUSrcB  // controls ALU source B mux
     ) ;

Get the first forwarding (WB -> EX) working before you try to address the other conditions.

Handle WB -> EX forwarding
- SUGGESTION: connect some outputs from the MEM_WB pipeline stage register and from the ID_EX pipeline stage register to determine whether the hazard exists, then create an output that will control a mux to use forwarded data (from MEM_WB) or the standard data from the EX cycle.
- uncomment the test for this in the test bench, and update your forwarding unit and Processor accordingly.
Handle MEM -> EX forwarding
- uncomment the test for this in the test bench, and update your forwarding unit and Processor accordingly.
Tip: suggestions for implementing forwarding

Recall the forwarding logic we covered in lecture. Forwarding data into the EX stage might come from MEM or WB.

Generally, this is the plan:
1. Any forwarding from MEM is prioritized over forwarding from WB since MEM is "newer" data.
2. The forwarding unit in EX will choose data from the ALUOut value in MEM instead of A if:
  - The instruction in MEM is writing rd and the instruction in EX has read the same register as rs1.
  - and the instruction in EX has an opcode for an R-type, I-type, S-type, or SB-type.
3. The forwarding unit in EX will choose data from the ALUOut value in MEM instead of B if:
  - the instruction in MEM is writing rd and the instruction in EX has read the same register as rs2.
  - and the instruction in EX has an opcode for an R-type, S-type, or SB-type.
4. The forwarding unit in EX will choose data from whatever is going into the register file (either ALUOut, PC+4, or MemOut) instead of A if:
  - The forwarding unit is not forwarding from cycle MEM into A
  - and the instruction in WB is writing rd and the instruction in EX has read that register as rs1
  - and the instruction in EX has an opcode for an R-type, I-type, S-type, or SB-type.
5. The forwarding unit in EX will choose data from whatever is going into the register file (either ALUOut, PC+4, or MemOut) instead of B if:
  - The forwarding unit is not forwarding from cycle MEM into B
  - and the instruction in WB is writing rd and the instruction in EX has read that register as rs2
  - and the instruction in EX has an opcode for an R-type, I-type, S-type, or SB-type.
Since much as the logic for the forwarding unit feels bulky, leverage the behavior of if and else statements to implement prioritizing MEM forwarding over WB forwarding. If done correctly, the forwarding logic will be much more approachable.
At the end of this step, your test bench should run and pass the following tests, in sequence:
- test_write_then_read_hazard_detection()
- test_WB_to_EX_fwd()
- test_MEM_to_EX_fwd()
Add, commit, and push your code changes to git. Be sure to add your assembled versions of the asm files.
(Q) On the worksheet, answer the questions about implementing and testing forwarding, and whether your forwarding worked the first time you tried.

Stalling the Pipeline

Recall that stalls happen when there is a data hazard and the data is not yet available. Commonly this happens when an instruction follows a lw and depends on what the lw loads.

(Q) On the worksheet, answer the question about the need for stalls.

Adding the `lw` stall

Examine test asm file test_datahaz_lw.asm, then assemble it.
Create a hazard detection unit module
Add logic to the new module that creates a stall when lw is in EX and the next instruction will use its rd value (see page 322 in the textbook)
- This should be handled in the decode stage
- hint: There is a special case for UJ and U types that follow a lw: they don't use register sources and don't need to stall!
  Suggestions for implementing the Hazard Detection Unit
  
  Here are some things to consider while implementing the HDU:
  - This reads rs1 and rs2 (source reg numbers) from the decode stage, and also the opcode to determine if those registers are getting read
  - It compares those to the rd register in the EX stage, including the MemRead and RegWrite signals from EX, to determine if the instruction in EX is writing rd.
  - It then does three things when a stall is needed:
    1. Turns off PCWrite so the instruction in IF stays there.
    2. Disables writing to the IF_ID pipeline stage register, so the instruction in ID stays there.
    3. Inserts a "bubble" into EX by writing all zeroes into the ID_EX register control bits.
  This effectively separates the instructions that were in ID and EX so that during the next cycle they are in ID and MEM (and there's a nop in EX)
Uncomment and run our tests (test_lw_stall())
(Q) On the worksheet, answer the questions about your hazard detection unit and about the feasibility of stalling every instruction once.

Forwarding into `sw`

Special care needs to be taken for sw since it requires both the forwarded B value (from rs2) and the imm. Most other instructions only require one or the other. sw may also require a stall if a lw precedes it.

Ensure that your implementation forwards B into EX and carries that into the MEM stage of the pipeline while the ALU uses the imm value instead of B.

Test sw. Read and assemble test_datahaz_sw.asm.

Performance issues when forwarding into sw

In the implementation we covered in lecture, as well as that being tested for the testbench, sw’s rs2 dependencies are resolved in the EX stage by forwarding the correct B value into the EX stage, then saving it in the EX_MEM pipeline stage register.

While there is another variation that forwards the value directly into the MEM stage from WB, that variation will require additional forwarding checks that we opted to remove for the sake of simplicity.

Our variation of resolving sw in the EX stage does infrequently cause stalls (lw to sw), but that is a performance sacrifice we are making for simplicity of implementation.
Uncomment and run our tests (test_sw_forwarding()). Note that one of the tests causes a stall in addition to forwarding. See the above performance note for details.
Fix any errors that you need to make those tests pass. (You may have to update your forwarding unit.)
(Q) On the worksheet, answer the question about forwarding from lw to sw.
Now is a great time to commit your changes to git. Include any assembled versions of the asm files.

Handling Additional Hazards

There are some additional forwarding cases you should resolve that are not covered by the testbenches we provide.

Forwarding into Branches

Since branches are in the decode stage, there will be additional edge cases to consider. For example:

add x5, x8, x9 ; F D X m w    // x 5 calculated in execute stage
beq x5, x0, L  ;   F d X M W  // needs x 5 in decode stage

Notice that add only has x5 ready for forwarding in the MEM and WB stage, but beq needs x5 in the decode stage. This means stalling is required.

Work with your team to to determine how this can be resolved. Note that there are two strategies. One approach optimizes performance but has more datapath complexity, while the other optimizes datapath consistency at the cost of lower performance. Explore both to see which one is most approachable for you and your team.

There is no test provided to you to test this. You should write your own .asm test, assemble it, and write you own testbench task to verify you are resolving this correctly.

(Q) After resolving and testing the add to beq hazard, answer the questions on the worksheet and take a screenshot of your tests running in modelsim. Be sure to make the screenshot clearly show how you know the tests are working as intended.

Forwarding lui immediate and LinkAddr from MEM

With the base forwarding datapath that we set up in lecture, the MEM stage forwards ALUOut, whereas the WB stage forwards the value after the MemToReg mux (in order to forward the correct value if the instruction is a lw, lui, jal, and jalr).

However, if the forwarding value needs to come from lui, jal, or jalr and the instruction is in the MEM stage, the datapath can only forward ALUOut. This is an issue, especially for lui+addi combos that we saw in the first two weeks of lecture:

li t0, IMM[31:0]              ; li t0, IMM pseudoinstruction decomposition
; .... turns into: ....
lui t0, IMM[31:12] + IMM[11]  ; F D X m W   // fwd EX_MEM.imm
addi t0, t0, IMM[11:0]        ;   F D x M W // but EX_MEM.imm is not setup

There's a similar problem for the link address being written by jal or jalr.

You will need to set up some datapath structure to handle this. This should not be overly complicated as you can re-use an existing control signal to make this almost trivially simple.

There is no test provided to test this. You should write your own .asm test, assemble it, and write you own testbench task to verify you are resolving this correctly.

(Q) After resolving and testing this hazard, answer the questions on the worksheet and take a screenshot of your tests running in modelsim. Be sure to make the screenshot clearly show how you know the tests are working as intended.

6 Write and run bigger tests (programs!)

Examine the following code:

// Array A's memory location is in x5
int[] A = {1, 2, 3, 4, 5};
int idx = 0;
while(idx < 5) {
    A[idx] = A[idx] + 1;
    idx = idx + 1;
}

Write RISC-V for this in an .asm file in the test_asm folder.
- Use comments to explain what you're doing.
- Add, commit, and push it to git.
- To initialize the array, it is ok to pick an address in memory and put the integers in your assembled .txt file there. (You don't need to write RISC-V instructions to do that).
- To initialize x5 to have the address of A, load the address as an immediate (remember lui and addi? Or maybe you have an assembler that supports pseudoinstructions like li?) in your code.
- idx can be any register of your choice and does not need to be stored in memory.

Open tb_Processor_Program.v in VS Code and observe how it loads a .txt file and runs the program in that file.
Make a copy of the testProgramA() task in the test bench and modify the copy to run the code you wrote above.
- HINT: you can use CHECK_MEM() to check contents of memory in your test bench. Do this to see what the array values are after the program runs.
- HINT: testProgramA takes an argument and an expected result; you can remove those from your copy for this test.
For the last practical, you used your relPrime and gcd program. Assemble your code again for those procedures into something that your processor can run, but this time do not add the nop instructions to eliminate hazards. Put that code in the test_asm folder in your git repo, replacing the code you added nop instructions to.
- Add, commit, and push your assembly (.asm file) and the assembled code (.txt file).
You should already have a copy of the testProgramA() task in the test bench that will run your relPrime program.
- Verify that your newly assembled versions (without nops to reduce hazareds) runs and produces the right answers.
(Q) On the worksheet, explain how you plan to test that relPrime works; specifically, how will pass the input argument to your program from the test bench, and how will your test bench know when the program has finished running (so it can check the result)?
- There are many ways to do this; think about the Input/Output lecture from class for a few ideas, or think about how you could tell that the program is done by inspecting a register or the PC.
Test your relPrime program on your processor with many inputs, including at least these three:
- relPrime(6) = 5
- relPrime(5040) = 11
- relPrime(30030) = 17

Design a new instruction

Similar to what you did in practical 6, your last task is to design and implement a new instruction and implement it in your pipeline. You need to provide clear documentation for how it will work, and justify it's inclusion in the instruction set.

As you plan your design you should consider inventing an instruction that makes relprime run faster (this generally would combine multiple instructions into one new instruction).

(Q) Document the design and format (in the practical worksheet) and explain how you plan to resolve hazards in the pipeline.
- maybe add a stage to support extra work
- or stall the pipeline
- or add more hardware to existing stages
(Q) Explain how you expect the new instruction to impact the performance of your processor.
Implement your design.
Run relprime with your new instruction (you'll have to rewrite relprime - make sure you keep both versions in your repository.)
(Q) Compare the two runtimes (number of cycles for each run, before and after your new instruction)

BONUS: Implement Memory-Mapped Input/Output (MMIO)

We discussed I/O in class, one way of implementing I/O is Memory Mapped I/O. For an extra points on this practical you can implement MMIO. You will need to write a test bench to show this works. If you do this you need to do the following:

Add a datapath drawing to the worksheet which shows the modifications for MMIO.
Put a Test Plan (following the format from previous practicals) together to show that I/O works.
Include a clear screenshot of a waveform in your worksheet that shows that the IO succeeded. You should annotate this waveform to indicate key events (e.g. point an arrow at a signal when an input number gets into a register.)

Full credit will only be awarded if you communicate how this works sufficiently in your worksheet. The graders will not look at your code for this problem.

This is a challenge problem, there is less support for this, you are expected to take ownership if you want to complete this challenge.

Working Ahead

Take a look at Practical 10 if you want to work ahead. This is mostly creating a presentation.

Submission and Grading

Functional Requirements

At the end of the practical you should have done these things:

Implement data forwarding in Processor.v and pass the following test bench tasks:
- test_write_then_read_hazard_detection
- test_WB_to_EX_fwd
- test_MEM_to_EX_fwd
- test_sw_forwarding
Implement pipeline stalls in Processor.v and pass the following test bench tasks:
- test_lw_stall
Handle additional hazard cases and create test tasks for them:
- lw->branch stalling
- Forwarding into branches
- Forwarding from lui and jumps
Run relPrime(5040) without artificially-added nop instructions
OPTIONALLY implement MMIO
Completed and submitted the Practical Worksheet.

Git Requirements

Remember, Do not add and commit every single file ModelSim creates. Only add, commit, and push .v, .do, .asm, .txt, and .mpf files.

In addition to the list below, you should regularly commit and push whenever you fix a bug, work to a stopping point, or make any incremental updates. At minimum, you must have at least 5 commits in your repo for this practical:

Git commit 1: upon completion of data forwarding
Git commit 2: upon completion of stalling (because of lw)
Git commit 3: upon completion and testing of additional hazard cases
Git commit 4: upon completion and testing of relPrime
Git commit 5: upon completion and testing of your new instruction

Since this is a team-based practical, there should be numerous iterative commits from each team member.

Worksheet Requirement

All the practicals for CSSE232 have these general requirements:

General Requirements for all Practicals

The solution fits the need
Aspects of performance are discussed
The solution is tested for correctness
The submission shows iteration and documentation

Some practicals will hit some of these requirements more than others. But you should always be thinking about them.

(Q) Complete the practical worksheet and write your final git commit on the worksheet where required.

Final Checklist

Verify that your code compiles and your tests pass (or at least run).
Verify your verilog code is committed and the commits are pushed to github.
Submit your completed worksheet to gradescope.

Grading Breakdown

Practical 9 Rubric items	Possible Points	Weight
Worksheet	86	52%
Code	80	48%
Total out of		100%