CIS 451 Week 9
Intel Superscalar
- Intel breaks instructions down into micro operations.
- These micro-ops are bigger than those shown for the multi-cycle CPU.
- They are more like RISC ops.
- For example,
add eax, ebxis (probably) one micro-opadd eax, [MEM]is probably two (a read followed by an add)add [MEM], eaxis probably three (read, add, store)
- Classic CISC instructions (e.g., string compare) are split into
many (more than 4) micro-ops.
- These micro-ops typically come from a microcoded routine (often with a loop or other complex control)
- It is typically faster to write code from scratch using newer, smaller instructions than to use old
- Out-of-order execution typically done at micro-instruction level
- Consider the code below
subcan execute immediately, even ifeaxisn’t ready yet.callrequiresesp. Allowingsub esp, 4to progress immediately allows the CPU to begin setting up the function call, even ifeaxis still being computed.
# version 1
push eax
call some_function
# version 2
sub esp, 4
mov [ESP], eax
call some_function
- Look at PII pipeline slide
- Two stages for branch prediction
- Three stages to fetch instruction (and build 16+ byte block aligned to next instruction)
- Two stages for decode
- Figure out where the next three instructions are
- Decode
- Four parallel decoders D0, D1, D2, D3
- D0 can handle instructions with up to 4 micro-ops
- D1-D3 can handle 1 micro op each
- Thus, use 4-1-1-1 rule to maximize throughput
-
Notice that although execution units and buffers increase, issue/retire width stalls out at 4.
- Example 6.1a. Instruction decoding requiring 3 clock cycles
mov [esi], eax ; 2 uops, D0
add ebx, [edi] ; 2 uops, D0
sub eax, 1 ; 1 uop, D1
cmp ebx, ecx ; 1 uop, D2
je L1 ; 1 uop, D0
- Example 6.1b. Instructions reordered for improved decoding (only two clock cycles)
mov [esi], eax ; 2 uops, D0
sub eax, 1 ; 1 uop, D1
add ebx, [edi] ; 2 uops, D0
cmp ebx, ecx ; 1 uop, D1
je L1 ; 1 uop, D2
- Show Intel Micro-architecture slide
- Show “Intel Growth over time” slide
-
Tomasulo’s algorithm
- Other goodies
- Macrofusion ==> CMP and JMP
- Loop detection buffer
- Show slides with pictures of CPUs over time.
- At this point processing power of single CPU is leveling out Show “Single processor performance”
- Faster clocks use too much power / generate too much heat
- Hard to utilize more functional units
- too many dependencies
- cache misses cause long delays
- (Thus, transistors went into larger caches instead of more execution units.)
- “Sometimes bigger and dumber is better”
- Solution is more parallelism.
- Multi core
- Simultaneous multithreading (aka “Hyperthreading”)
- SIMD (both in applications – SSE, AVX — and GPUs)