CIS 451 Week 13

I/O

Programmed I/O
- Instructions that specifically route data to an I/O device.
- Hard-coded / not flexible
- Harder to program
Polling vs. interrupts
- Polling still exists in ultra-chap low-performing embedded devices
DMA
- How does this affect caching?
Memory Mapped I/O (H&H Chapter 8)
- Some memory reads and writes are routed to I/O
- Need to be careful with caching (make sure written through and properly updated)
Hierarchy of buses
- Most critical things (memory graphics handled first)
- Often asynchronous (CPU moves onto other things after request sent off-chip.)
- Pendulum: Move off chip to distribute, then move back on-chip for performance
Interrupt priority
RAID (switch to P&H Slides #29)
Note why RAM accesses are expensive.

Domain Specific Architectures

Tapped out ILP and multi-core
More work using more transistors is difficult to scale because more transistors means more energy (heat)
Not much room to optimize existing architectures
Arithmetic operations themselves use little energy.
- Energy cost comes from overhead (fetch, decode, memory access, etc.)
- Thus, big improvements in performance mean reducing overhead.
- Focusing only on the arithmetic means switching from General purpose CPUs to Domain Specific chips
  - (Chips designed to solve a specific problem.)
  - They tend to map the data flow through the HW directly onto the problem, thereby reducing overhead
  - For example: sending data directly from where it’s produced to where it’s consumed without the overhead of storing in registers or cache.
  - Another example: Video/image processing doesn’t have a lot of data reuse, so multi-level caches aren’t helpful (and just consume energy)
“Desperation is the reason architects are now working on DSAs”: Hennessy and Patterson
DSAs targt frequently used subproblems.
One idea: FPGAs
- Pros / Cons?
  - Pro: Economy of scale
  - Pro: Reconfigurable
  - Con: Comparatively slow. Slow enough that the benefit is rather modest.

Guidelines for DSAs

Ask students what each of these means{. :q}

Use dedicated memories to minimize the distance over which data is moved
Invest resources saved from dropping advanced micorarchitectural optimizations into more arithmetic or memory
- Reduce branch predictors, schedulers, register renaming and such
- Possible because the workflow is more specific, we don’t need to be prepared for “anything.”
Use the easiest form of parallelism that matches the domain
- Most key targets for DSA have a lot of parallel, but in a specific pattern
Reduce data size and type to the simplest needed for the domain
- Don’t use 64-bit ints, if 8 or 16 will do. This lets you do more in parallel.
- Also saves on memory bandwidth.
Use a domain specific language

Notice the “Pendulum:”
* In the 60s and 70s memory was expensive and we worked hard to save each bit. * Then memory became cheap, and we didn’t care * Now we’re paying attention, but for a slightly different reason: bandwidth and energy as opposed to monetary cost.

Google’s TPU for Machine Learning
- Designed for inference phase of machine learning
  - quick overview of learning and inference phases
- Key piece of work is evaluating [W][x]
  - Easy to do the multiplications in parallel; but
  - Need a way to do all the additions.
  - Systolic array pushes data through “diagonally”
  - Computing a row takes n steps, doing one multiply and one add at each step; but,
  - can be pipelined.
- Notice that the systolic array avoids the need to move intermediate results into and out of registers (or worse, back into main memory)
- Instructions are CISC style.
  - A single vector multiply takes n cycles;
  - But, that’s OK because
    1. It facilitates pipelining
    2. Avoids repeated instruction fetches (a kind of overhead)
- Point out
  - Accumulators (the output from the vector multiply)
  - Activation (hardware to apply the nonlinear function)
  - After activation, data sent back to “Unified Buffer” to be used for next step in the process (again avoiding the long trip back to main memory).
- Show the die. Notice that the datapath is the majority of the chip (unlike the General CPU die I showed a few weeks ago.)
How TPU applies general principles
1. Has dedicated memory
2. Data path is 50% of CPU instead of branch prediction, caches, and such.
3. Easiest form of parallelism: Focuses on the basic operation of the neural net.
4. Uses 8 bit integers instead of floating point: Good enough, but much simpler.
5. Uses Tensor Flow.
Microsoft Catapult
- Use FPGAs to accelerate certain server functions
  - Convolutions in CNNs
  - Feature extraction on search for Bing
- Main downside? FPGA programming is still hard (done at a hardware level)
How catapult applies principles
1. FPGAs have on-chip memory
2. The ALUs in an FPGA can be dedicated to the specific application
3. Can be configured to the match the best form of parallelism
4. Can choose the optimal data size
5. !! Doesn’t fit !! RTL/Verilog goes in the other direction.

Cross-Cutting issues

Heterogeneity and System on a Chip
- Easiest way to get DSA on a system is over I/O bus; but, bus is relatively slow.
- How to get custom hardware embedded on a SOC?
- IP (Intellectual Property) block
  - Verilog or VHDL code that can be bought/licensed and dropped into a new chip.
  - Apple A4 used 8 such blocks in 2010
  - Apple A8 used 28 such blocks in 2014
  - IP blocks 2/3 of the A8 chip.
- “Designing an SOC is like city planning where groups lobby for limited resources”
  - “Compromise is difficult”
  - Limited space power. Must decide how much to allocate to CPU, GPU, cache, video encoder, etc.
  - Variety of IP designs allow SOC designers to pick components of the right size for the target audience.
  - Small version of IP block important for entering a market: Existing designs are not likely to make radical changes to try a new concept.
- Open Instruction Set
  - Historically, instruction sets (MIPS, x86, PowerPC, Sparc, etc.) belong to one company
  - This makes it difficult for IP to provide instructions sets that are easily incorporated.
    - Chip designers don’t want to deal with lots of licensing.
  - RISC-V now provides an open-source instruction set with unused opcodes.
    - Chips using RISC-V can incorporate IP cores and their instructions into their custom RISC-V instruction set.