CIS 351

Lab 10: Pipeline

Winter 2021

This lab was designed by Prof. Wolffe

Pre-lab

Please have questions 1 - 5 answered when you arrive in lab.

Overview

The purpose of this lab is to become more familiar with the concept of pipelining, as found in modern CPU architectures. Specifically, the lab involves the use of a graphical pipeline simulator. This tool will be used to further your understanding of how a pipeline works, how it is implemented, and how it deals with the problems introduced by using this high-performance technique.

Activities

Work your way through the following brief tutorial.
Submit a lab report that includes answers to the numbered questions and your annotated source code.

Resources

There is an on-line User Manual available for the visualization tool used in this lab.

A graphical pipeline simulator

The name of this simulation tool is DLXview. Its purpose is to graphically illustrate the operation of the pipeline implemented in the DLX architecture. DLX is the name Patterson and Hennessy gave to the hypothetical machine and instruction set we have been studying. (Although, I don't think they use the DLX name in the current edition of the textbook.)

Operating DLXview

To start DLXview, simply type dlxview at the command prompt.

Configuring the system

The first step is to configure the machine to simulate a specific pipelined architecture. Feel free to experiment with alternative configurations. However, this lab utilizes only the default mode with default parameters. See the user manual for a description of the various configuration parameters available.

Click the configure button. A pop-up window will present a set of scheduling modes.
Select basic pipeline. A pop-up window will present a set of configuration parameters appropriate for the pipeline mode you have selected.
Accept the default configuration by clicking OK.
Again, click OK.

Loading an executable

Once the system has been configured, a file can be loaded for execution.

Download the assembly program file exampleDLX-1.s into your current directory.
Click the load button. A file selection window will pop up.
Select the file (exampleDLX-1.s) by clicking on it.
Click load.
Click done.

Running a program

To execute a program:

Click the step forward button to begin execution.; A pop-up window will request a starting address.
Click the default button.
Click the step forward button or the next cycle button to progress through the assembly program. Stepping advances execution forward to the next instruction. Cycling advances the simulation by one clock cycle. These are usually, but not always, the same thing (i.e., most instructions take a single cycle). The step back and previous cycle buttons will also be very helpful in observing and understanding pipeline operation.

How many cycles does it take to fill the pipeline? (Notice that DLXview labels the cycles beginning with zero. To avoid ambiguity, fill in the blank: "In DLXview, the first cycle during which all the pipeline stages are busy is labeled _____.")
How many cycles does it take for the computer to execute the first instruction completely?
What pipeline stage is the instruction and r6, r7, r8 in during cycle number 5 (i.e., the cycle labeled 5)?
During which cycle does the the processor begin computing the instruction and r12, r13, r14? (Make sure your answer clearly states "the xth cycle" or "the cycle labeled x").
During which cycle does the the processor finish computing the instruction and r12, r13, r14? (Make sure your answer clearly states "the xth cycle" or "the cycle labeled x").

Views

The top half of the display window is an instruction timing diagram, calibrated in clock cycles. Each instruction is uniquely colored to make it easier to follow its progress through the pipeline. In addition to the timing diagram, DLXview provides two views of the pipeline in operation. The button near the bottom of the screen allows you to toggle between the two viewing modes.

the block diagram shows the entire pipeline as labeled stages
the integer datapath presents a detailed device-level view of the pipeline

The DLXview implementation of the DLX architecture

DLXview is a fairly straightforward implementation of the DLX architecture. Recall the differences between a pipelined processor and the simpler processor we described in class:

Pipeline latches have been added between datapath stages, and named appropriately. Think of them as "holding" the results of the previous stage of the pipeline (left) and forwarding those results on to the next stage of the pipeline (right).
Additional wiring has been added to efficiently distribute results. For example, multiple transfer wires emanate from the Write Back stage and serve as input (feedback) to previous stages of the pipeline. This loop back technique is known as data forwarding and is used to solve some of the timing problems associated with pipeline hazards.
Branches are resolved in the Instruction Decode stage. Hence the necessary logic has been moved into that stage of the pipeline.

A simple pipelined program

The assembly program loaded earlier (exampleDLX-1.s) is a very simple program with no data dependencies, branches, or other potential problems. The code is well-documented and should be easily understood. Take some time now to execute the program, familiarizing yourself with the different views and the operation of the simulator.

Pay particular attention to the register file. Notice how it indicates which registers are being read from and written to. Also pay close attention to the muxes. They help indicate which of the many values generated in the previous stage are actually being used in the current stage. (In other words, watching the lines through the muxes shows you where the inputs to the ALU are coming from.) Stepping backward and forward will help you understand how an individual instruction progresses through the pipeline.

Note: we are only concerned here with the operation and control of the pipeline. In other words, we don't care what data values are actually in the registers or what the results of the operations are. It is possible, however, to initialize registers to a desired value using a *.i file. See the User's manual for additional information.

Note: it is possible at any time to change the operation of the simulator by editing the source code (see the file selection window).

Let's trace through the execution of a single instruction. The first instruction in the example program loads a word from memory into a register. Recall that DLX data addressing uses the indexed method. The word to be read from memory is located at the address contained in the instruction itself (i.e. the constant value 0), offset (or indexed) by the value contained in register R2. The word read from memory is to be loaded into register R1.

In the first phase of the instruction cycle, the instruction is fetched (note the color: it's the yucky brown instruction in the pipeline).
In the second phase, the instruction is decoded (although this is not illustrated, the opcode is being used to specify a load operation). In addition, any necessary values are read from the register file. In this case, the value of register R2 is read because it is needed for subsequent memory addressing. The simulator indicates the register read by displaying a colored R2 in the box representing the register file.
After executing the third cycle, the effective address of the operand has been calculated. Note the paths drawn through the two muxes in the Execute stage. Tracing back the path in the upper mux indicates a value originating in the register file -- this represents the contents of R2. The path through the lower mux traces back to the sign-extended immediate operand datapath -- this represents the constant 0 from the instruction.
In the fourth cycle the memory word is read. This is not really evident until the next phase.
The last stage is the Write Back phase. This stage always uses a value generated in the previous stage. In this case, the instruction was a load from memory into a register. Examining the register file shows that data is being written to register R1. (Values displayed near the top of the register file are read values, the value displayed near the bottom represents a write operation). Tracing back the wire representing the register file write operation reveals the origin of the value being written -- the data memory.

Trace through the execution of the other instructions in the program until you are confident you understand the operation of the pipeline.

Which registers are being read during cycle 5 (the cycle labeled 5)?
Which registers are being written during cycle 5 (the cycle labeled 5)?
Where does input to the main ALU come from during cycle 2 (the cycle labeled 2)?
Where does input to the main ALU come from during cycle 3 (the cycle labeled 3)?
What is the purpose of all of the nops at the end of the sample program?

Structural hazards

A structural hazard is another term for a resource conflict. Resource conflicts (we will discuss this in more detail in class) refer to contention for a specific functional unit. Take a moment to study the integer data path of the DLX pipeline.

It is possible that the Instruction Fetch stage might be accessing memory at the same time as the Memory Access stage. One stage is reading an instruction and the other is reading a data value. Only one word can be read from a memory unit at time. This is an example of a resource conflict.

How does the DLX pipeline architecture resolve this conflict?

Locate the adder in the Instruction Decode stage. Trace its backward and forward connections to understand its operation well enough to answer the question:

What purpose does this adder serve?

Data hazards

The next type of hazard is a data hazard, which always involves some type of data dependency. In a typical instance, the current instruction requires some operand that is being generated by the preceding instruction (e.g. a load or an ALU operation). Program exampleDLX-2.s contains such a dependency. Download the program and study the code. Load and run the program and study its execution.

Identify all pairs of instructions that have a data dependency. In particular identify all pairs of instructions (not necessarily adjacent) where the result of the second instruction depends directly on the result of the first.
Describe in your own words how the DLX hardware addresses this dependency problem. Your answer should be precise enough to convince me that you understand the mechanism used.
Trace through the progress of the fourth instruction in a manner similar to that used in the section above ("A simple pipelined program"). There is a "visual bug" or discrepancy that you will notice as you trace through the execution of this program. Identify it.
Step to cycle 4. Use your favorite screenshot tool to capture the main DLXview window. Print the snapshot and highlight the set of wires that shows how the result of sub r3, r4, r5 is routed directly to the main ALU.

Some data dependencies span several instructions. Program exampleDLX-3.s demonstrates this problem in cycle 5, when the first add instruction is writing register r3 during the same cycle that the second add instruction is reading r3.

Which value of r3 will be read if the DLX processor used the register file provided for Project 4? If the wrong value would be read, describe how you would properly coordinate the reading and writing of the registers.

Not all data dependencies can be solved by forwarding. Program exampleDLX-4.s contains an example of a type of hazard that forwarding cannot eliminate. Download the program and study the code. Load and run the program and study its execution.

Examine all of the instructions in the example program and identify their dependencies. Notice that the add instruction immediately follows a load instruction. This is an example of a dependency immune to forwarding. Execute the program using cycles instead of stepping. Refer to the timing diagram and notice the presence of stalls in the pipeline. A stall is a delay cycle, or bubble, inserted into the pipeline. When this type of data dependency (called a load data hazard) is detected, hardware known as a load interlock will delay subsequent instructions to resolve the timing problem. Study the operation of this program until you understand both the problem and the solution.

Why must the add operation be delayed one cycle? Your answer should consider timing issues and functional units. Be sure to explain why forwarding cannot solve the problem.
Describe how hardware can detect a load data hazard.

Control hazards

The final type of hazard involves control transfer -- branches and jumps. Program exampleDLX-5.s implements control transfer in the form of a loop. Download, study, and execute the code until you understand its operation. The comments in the code should help you understand what the loop is doing.

The DLX hardware resolves branches in the Instruction Decode stage of the pipeline. Instead of a beq instruction that uses subtraction in the ALU to compare two registers, the DLX hardware has bez and bnz instructions that branch if the specified register is equal or not equal to 0 respectively. These new branch instructions allow the hardware to determine by the end of the second cycle whether to take the branch.

Notice that the instruction immediately following the branch instruction is a nop. This nop is called a branch delay slot and is typically inserted by the compiler.

Why is the branch delay slot necessary? In other words, what would go wrong if we removed the nop?
Suppose you had a smarter compiler. Explain what it could put in the branch delay slot instead of a nop.
Write an optimized assembly program incorporating your code adjustments. Your solution must still contain a loop.
How many cycles does the original program take to assign the final result to register R6? (In other words, during which cycle is the instruction add r6, r4, r5 in the "write-back" phase?)
How many cycles does your optimized assembly program take to do the same work?

Updated Saturday, 27 March 2021, 8:01 PM