CIS 451 Week 9

Intel breaks instructions down into micro operations.
These micro-ops are bigger than those shown for the multi-cycle CPU.
- They are more like RISC ops.
For example,
- add eax, ebx is (probably) one micro-op
- add eax, [MEM] is probably two (a read followed by an add)
- add [MEM], eax is probably three (read, add, store)
Classic CISC instructions (e.g., string compare) are split into many (more than 4) micro-ops.
- These micro-ops typically come from a microcoded routine (often with a loop or other complex control)
- It is typically faster to write code from scratch using newer, smaller instructions than to use old
Out-of-order execution typically done at micro-instruction level
- Consider the code below
- sub can execute immediately, even if eax isn’t ready yet.
- call requires esp. Allowing sub esp, 4 to progress immediately allows the CPU to begin setting up the function call, even if eax is still being computed.

     # version 1
     push eax
     call some_function

     # version 2
     sub esp, 4
     mov [ESP], eax
     call some_function

Look at PII pipeline slide
- Two stages for branch prediction
- Three stages to fetch instruction (and build 16+ byte block aligned to next instruction)
- Two stages for decode
  - Figure out where the next three instructions are
  - Decode
  - Four parallel decoders D0, D1, D2, D3
    - D0 can handle instructions with up to 4 micro-ops
    - D1-D3 can handle 1 micro op each
    - Thus, use 4-1-1-1 rule to maximize throughput
Notice that although execution units and buffers increase, issue/retire width stalls out at 4.
Example 6.1a. Instruction decoding requiring 3 clock cycles

mov [esi], eax ; 2 uops, D0
add ebx, [edi] ; 2 uops, D0
sub eax, 1 ; 1 uop, D1
cmp ebx, ecx ; 1 uop, D2
je L1 ; 1 uop, D0

Example 6.1b. Instructions reordered for improved decoding (only two clock cycles)

mov [esi], eax ; 2 uops, D0
sub eax, 1 ; 1 uop, D1
add ebx, [edi] ; 2 uops, D0
cmp ebx, ecx ; 1 uop, D1
je L1 ; 1 uop, D2

Show Intel Micro-architecture slide
Show “Intel Growth over time” slide
Tomasulo’s algorithm
Other goodies
- Macrofusion ==> CMP and JMP
- Loop detection buffer
Show slides with pictures of CPUs over time.
At this point processing power of single CPU is leveling out Show “Single processor performance”
Faster clocks use too much power / generate too much heat
Hard to utilize more functional units
- too many dependencies
- cache misses cause long delays
  - (Thus, transistors went into larger caches instead of more execution units.)
  - “Sometimes bigger and dumber is better”
Solution is more parallelism.
- Multi core
- Simultaneous multithreading (aka “Hyperthreading”)
- SIMD (both in applications – SSE, AVX — and GPUs)