CIS 451 Week 7

SuperScalar / Super pipeline

Instructions form a partial order
- Think about building a house.
- a = 0; b = 3; a = a +c; b = b+ c; d = a+b
Having multiple pipelines allow us to take advantage of this instruction level parallelism
Modern CPUs have multiple functional units – as many as six or eight!
Ideally an “x-way” pipeline will get work done x times as fast.
What goes wrong?
Hazards have a larger effect on wider pipelines.
In a regular pipeline, a load followed by an access requires 1 stall.
- How many stalls required in a 2-way pipeline? Why?
- How about an n-way processor?
what about control hazards?
How can we schedule instructions to functional units?
- Statically (VLIW)
- Dynamically
What are the advantages and disadvantages of each?
- Static
  - Can be simpler and, therefore, faster; but,
  - Can’t react to stalls and other dynamic events.
- Dynamic
  - More complex
  - Can react to dynamic events (e.g., cache misses)
  - Presents common interface. (Small implementation changes don’t require recompilation)
Which is more common today? Why?
Itanium was HP/Intel joint venture attempt at static scheduling.
- Performance wasn’t good enough to justify the cost.
Dynamic scheduling introduces new data hazards
Standard hazard is RAW – Read after Write
- This is what forwarding and stalls address in a standard pipeline.

If a CPU has out-of-order issue (looks at a window of instructions and chooses the next available to run), must also worry about WAR hazard: Write after read.

lw $t0, 40($s0)
or $t3, $s5, $s6
sw $s7, 80($t3)
add $t1, $t0, $s1  
sub $t0, $s2, $s3  # be careful when moving this instruction up  WAR hazard 
and $t2, $s4, $t0

WAR hazards are called “fake” hazards Why
- They can be eliminated by register renaming.
- Also called “anti-dependence” or “name dependence”
Most CPUs require in-order completion. (Instructions held and committed in the original order).
- What is the challenge with out-of-order completion?
  - Unwinding exceptions.
Out-of-order completion raises the possibility of WAW data hazards.

Compiler optimizations can help reduce stalls / unused functional units.

for (int i = 0; i < MAX; i++) { 
   c[i] = a[i] + b[i]
}
  
  
  
; The base address of array a is in r1 
; The base address of array b is in r2
; The base address of array c is in r3
  
      addi r4, r0, 4            ; Set r4 (the loop counter) to MAX
LOOP: lw r5, 0(r1)              ; Load from a into r5
      lw r6, 0(r2)              ; Load from b into r6
      add r7, r5, r6            ; r7 = r6 + r5
      sw  0(r3), r7             ; store result back in array c
      addi r1, r1, 4            ; increment pointers for arrays a, b, and c
      addi r2, r2, 4
      addi r3, r3, 4
      subi r4, r4, 1            ; decrement loop counter
      bnez r4, LOOP             ; branch
      nop                       ; branch delay slot
      sw 0(r3), r0              ; set c[0] to 0. (not important, just something to do after the loop.)
      nop;
      nop; 
      nop;
      nop;      
      trap #0;