CIS 451 Week 13
I/O
- Programmed I/O
- Instructions that specifically route data to an I/O device.
- Hard-coded / not flexible
- Harder to program
- Polling vs. interrupts
- Polling still exists in ultra-chap low-performing embedded devices
- DMA
- How does this affect caching?
- Memory Mapped I/O (H&H Chapter 8)
- Some memory reads and writes are routed to I/O
- Need to be careful with caching (make sure written through and properly updated)
- Hierarchy of buses
- Most critical things (memory graphics handled first)
- Often asynchronous (CPU moves onto other things after request sent off-chip.)
- Pendulum: Move off chip to distribute, then move back on-chip for performance
- Interrupt priority
- RAID (switch to P&H Slides #29)
- Note why RAM accesses are expensive.
Domain Specific Architectures
- Tapped out ILP and multi-core
- More work using more transistors is difficult to scale because more transistors means more energy (heat)
- Not much room to optimize existing architectures
- Arithmetic operations themselves use little energy.
- Energy cost comes from overhead (fetch, decode, memory access, etc.)
- Thus, big improvements in performance mean reducing overhead.
- Focusing only on the arithmetic means switching from General purpose CPUs to Domain Specific chips
- (Chips designed to solve a specific problem.)
- They tend to map the data flow through the HW directly onto the problem, thereby reducing overhead
- For example: sending data directly from where it’s produced to where it’s consumed without the overhead of storing in registers or cache.
- Another example: Video/image processing doesn’t have a lot of data reuse, so multi-level caches aren’t helpful (and just consume energy)
- “Desperation is the reason architects are now working on DSAs”: Hennessy and Patterson
-
DSAs targt frequently used subproblems.
- One idea: FPGAs
- Pros / Cons?
- Pro: Economy of scale
- Pro: Reconfigurable
- Con: Comparatively slow. Slow enough that the benefit is rather modest.
- Pros / Cons?
Guidelines for DSAs
Ask students what each of these means{. :q}
- Use dedicated memories to minimize the distance over which data is moved
- Invest resources saved from dropping advanced micorarchitectural optimizations into more arithmetic or memory
- Reduce branch predictors, schedulers, register renaming and such
- Possible because the workflow is more specific, we don’t need to be prepared for “anything.”
- Use the easiest form of parallelism that matches the domain
- Most key targets for DSA have a lot of parallel, but in a specific pattern
- Reduce data size and type to the simplest needed for the domain
- Don’t use 64-bit ints, if 8 or 16 will do. This lets you do more in parallel.
- Also saves on memory bandwidth.
- Use a domain specific language
Notice the “Pendulum:”
* In the 60s and 70s memory was expensive and we worked hard to save each bit.
* Then memory became cheap, and we didn’t care
* Now we’re paying attention, but for a slightly different reason: bandwidth and energy as opposed to monetary cost.
- Google’s TPU for Machine Learning
- Designed for inference phase of machine learning
- quick overview of learning and inference phases
- Key piece of work is evaluating
[W][x]
- Easy to do the multiplications in parallel; but
- Need a way to do all the additions.
- Systolic array pushes data through “diagonally”
- Computing a row takes
n
steps, doing one multiply and one add at each step; but, - can be pipelined.
- Notice that the systolic array avoids the need to move intermediate results into and out of registers (or worse, back into main memory)
- Instructions are CISC style.
- A single vector multiply takes n cycles;
- But, that’s OK because
- It facilitates pipelining
- Avoids repeated instruction fetches (a kind of overhead)
- Point out
- Accumulators (the output from the vector multiply)
- Activation (hardware to apply the nonlinear function)
- After activation, data sent back to “Unified Buffer” to be used for next step in the process (again avoiding the long trip back to main memory).
- Show the die. Notice that the datapath is the majority of the chip (unlike the General CPU die I showed a few weeks ago.)
- Designed for inference phase of machine learning
- How TPU applies general principles
- Has dedicated memory
- Data path is 50% of CPU instead of branch prediction, caches, and such.
- Easiest form of parallelism: Focuses on the basic operation of the neural net.
- Uses 8 bit integers instead of floating point: Good enough, but much simpler.
- Uses Tensor Flow.
- Microsoft Catapult
- Use FPGAs to accelerate certain server functions
- Convolutions in CNNs
- Feature extraction on search for Bing
- Main downside? FPGA programming is still hard (done at a hardware level)
- Use FPGAs to accelerate certain server functions
- How catapult applies principles
- FPGAs have on-chip memory
- The ALUs in an FPGA can be dedicated to the specific application
- Can be configured to the match the best form of parallelism
- Can choose the optimal data size
- !! Doesn’t fit !! RTL/Verilog goes in the other direction.
Cross-Cutting issues
- Heterogeneity and System on a Chip
- Easiest way to get DSA on a system is over I/O bus; but, bus is relatively slow.
- How to get custom hardware embedded on a SOC?
- IP (Intellectual Property) block
- Verilog or VHDL code that can be bought/licensed and dropped into a new chip.
- Apple A4 used 8 such blocks in 2010
- Apple A8 used 28 such blocks in 2014
- IP blocks 2/3 of the A8 chip.
- “Designing an SOC is like city planning where groups lobby for limited resources”
- “Compromise is difficult”
- Limited space power. Must decide how much to allocate to CPU, GPU, cache, video encoder, etc.
- Variety of IP designs allow SOC designers to pick components of the right size for the target audience.
- Small version of IP block important for entering a market: Existing designs are not likely to make radical changes to try a new concept.
- Open Instruction Set
- Historically, instruction sets (MIPS, x86, PowerPC, Sparc, etc.) belong to one company
- This makes it difficult for IP to provide instructions sets that are easily incorporated.
- Chip designers don’t want to deal with lots of licensing.
- RISC-V now provides an open-source instruction set with unused opcodes.
- Chips using RISC-V can incorporate IP cores and their instructions into their custom RISC-V instruction set.