CIS 451 Week 9
Intel Superscalar
- Intel breaks instructions down into micro operations.
- These micro-ops are bigger than those shown for the multi-cycle CPU.
- They are more like RISC ops.
- For example,
add eax, ebx
is (probably) one micro-opadd eax, [MEM]
is probably two (a read followed by an add)add [MEM], eax
is probably three (read, add, store)
- Classic CISC instructions (e.g., string compare) are split into
many (more than 4) micro-ops.
- These micro-ops typically come from a microcoded routine (often with a loop or other complex control)
- It is typically faster to write code from scratch using newer, smaller instructions than to use old
- Out-of-order execution typically done at micro-instruction level
- Consider the code below
sub
can execute immediately, even ifeax
isn’t ready yet.call
requiresesp
. Allowingsub esp, 4
to progress immediately allows the CPU to begin setting up the function call, even ifeax
is still being computed.
# version 1
push eax
call some_function
# version 2
sub esp, 4
mov [ESP], eax
call some_function
- Look at PII pipeline slide
- Two stages for branch prediction
- Three stages to fetch instruction (and build 16+ byte block aligned to next instruction)
- Two stages for decode
- Figure out where the next three instructions are
- Decode
- Four parallel decoders D0, D1, D2, D3
- D0 can handle instructions with up to 4 micro-ops
- D1-D3 can handle 1 micro op each
- Thus, use 4-1-1-1 rule to maximize throughput
-
Notice that although execution units and buffers increase, issue/retire width stalls out at 4.
- Example 6.1a. Instruction decoding requiring 3 clock cycles
mov [esi], eax ; 2 uops, D0
add ebx, [edi] ; 2 uops, D0
sub eax, 1 ; 1 uop, D1
cmp ebx, ecx ; 1 uop, D2
je L1 ; 1 uop, D0
- Example 6.1b. Instructions reordered for improved decoding (only two clock cycles)
mov [esi], eax ; 2 uops, D0
sub eax, 1 ; 1 uop, D1
add ebx, [edi] ; 2 uops, D0
cmp ebx, ecx ; 1 uop, D1
je L1 ; 1 uop, D2
- Show Intel Micro-architecture slide
- Show “Intel Growth over time” slide
-
Tomasulo’s algorithm
- Other goodies
- Macrofusion ==> CMP and JMP
- Loop detection buffer
- Show slides with pictures of CPUs over time.
- At this point processing power of single CPU is leveling out Show “Single processor performance”
- Faster clocks use too much power / generate too much heat
- Hard to utilize more functional units
- too many dependencies
- cache misses cause long delays
- (Thus, transistors went into larger caches instead of more execution units.)
- “Sometimes bigger and dumber is better”
- Solution is more parallelism.
- Multi core
- Simultaneous multithreading (aka “Hyperthreading”)
- SIMD (both in applications – SSE, AVX — and GPUs)