CIS 451 Week 12

Predication: Marking instructions as conditional on some predicate.
- Can be more efficient than flow control (no mis-predicts)

Data-Level parallelism

“If you were plowing a field, which would you rather use: Two strong oxen, or 1024 chickens?”
- Seymour Cray arguing that two powerful vector processors is better than many simple processors.
What does SIMD stand for?
SIMD is applying the same operation (instruction) to an entire vector of values.
What kind of applications use SIMD?
- Matrix-oriented scientific computing
- Image processing
- Sound processing
SIMD is more energy efficient because there is less overhead (e.g., fetching and decoding operations)
Programmers can still (mostly) think sequentially (unlike more complex multi-threaded environments)
First SIMD approach was vector processors in the 1980s.
- Key issues were
  1. Cost of that many transistors
  2. Cost of sufficient memory bandwidth to provide data to that many ALUs. (Especially given the need for )
Other key SIMD challenges
1. Stride in vectors that are not lined up with memory.
  - Consider matrix multiply. Either the rows or the columns will go across memory.
  - How can this affect cache?
  - Can cause problems with large block sizes, since entire block brought in for only one item.
  - One solution is to prepare special memory instructions that understand stride and bring only the needed data into a vector.
  - Adds considerable complexity to memory system.
2. Limited memory bandwidth
  - Mark which registers are unused so we don’t spend time sending them to memory on a context switch
Other vector processor instructions
- AXPY: a*X+y (where a is a scalar)
- disable: Disable all vector registers so they aren’t saved during a context switch.
- vector mask: Prevent operation from affecting some cells in the vector.
  - It’s one way to handle if statements.
  - A type of predication.
- Gather / Scatter for sparse matrices
Next came x86 extensions
- MMX
- SSE
- AVX
- Focuses on floating point
- Latest version 512 bits. That’s 16, 32-bit ops
At first, only basic vector ops: add multiply
- No scalar
- No masking
But, more complex ops have crept in over time.
Different op codes for different data sizes (8-bit, 16-bit, etc.)
- Thus, large number of op codes added over time
- They double every time data size doubles.
At first, compilers couldn’t do much with MMX/SSE, etc.
- Lack of more sophisticated ops
- Only used to build specialized libraries in assembly.
Modern compilers can automatically use AVX in some situations
- Look at AVX_Demo
- It’s not difficult to write code that the compiler can’t automatically optimize
Popular because they are cheap addition to existing hardware
- Just divide up existing adders
- No stride, so no memory complexities.
- Only a few registers, so less context switch costs.
Depending on application, performance either limited by
1. Total # of flops available
2. Memory bandwidth.
Look at AVX Assembly code

GPUs

Many Parallel ALUs
Several multithreaded SIMD processors.
Each SIMD processor has many (e.g., 32) “lanes”
Consider typical vector add loop

  for (int i = 0; i < n; ++i) {
    y[i] = a*x[i]+y[i] 
  }

Each iteration of the loop is conceptually a separate thread that can run in parallel with all the other “threads”.
- We could, in theory, assign each thread to a lane, let them run in any order and synchronize at the end.
However, each SIMD processor has only one Program Counter: Threads must be executed in batches of 32.
We want to abstract away from the specifics of the GPU (so our code will run well on different GPUs.)
- Therefore, we define a concept called “thread blocks”: Groups of threads all sent to the same SIMD.
- The GPU software itself will group each block’s threads into batches of num_lanes. (Called “SIMD threads” by H&P)
CUDA Code:

   __host__
    int nblocks = (n + 511) / 512;   // (We want something like (n / 512) + 1; but doesn't quite work.
   daxpy<<nblocks, 512>>(n, 2.0, x, y);


  __global__
  void daxpy(int n, double a, double *x, double *y) {
     int i = blockIdx.x * blockDim.y + threadIdx.x;
     if (i < n) y[i] = a*x[i] + y[i]
  }

We choose to put 512 threads into a thread block because
1. we want to be able to keep ALUs busy when other ops stall, and
2. it is a multiple of num_lanes (so three aren’t gaps.)
Note, we don’t necessarily know the number of SIMD processors or lanes, but we can choose values that work well for a variety of GPUs.
512 threads per thread block means 512/32 = 16 “SIMD threads” per block.
Because each thread is independent, it can be scheduled in any order. Thus, the GPU scheduler is like a superscalar scheduler: It schedules SIMD threads as soon as all the dependencies are handled.
- It has a “scoreboard”, which is an predecessor to Tomasulo’s algorithm.
Notice that threads are identified by blockId and threadId, not just threadID alone.
Look at assembly code
- Note where stalls can happen after arithmetic
Because “SIMD Threads” share a PC, their paths cannot diverge.
- In other words, they can’t run different paths of a branch.
In effect, branches are turned into predicated instructions.
- With an IF-THEN group, those instructions loose 1/2 the “bandwidth”
- Unless, they all go the same way. Then the other branch is skipped.
- Typical for checking for error conditions.
- Or, like with if statement above, allows “odds” at end to not be a problem.
Can have nested branches.
- Then saved PCs go on a stack
- But bandwidth drops accordingly to 25%, 12%, 6%, etc.
Summary: Each thread is either
1. Executing the same instruction as every other thread, or
2. Idle.

GPU Instruction set

PTX (Parallel Thread Execution)
- Abstraction of “real” hardware instruction set
- Real instruction set is kept hidden to promote compatibility

GPU Memory:

Each thread has private memory
- Other threads have no access
- Used for stacks, spilling registers
- Generally kept in L1 and L2.
Each SIMD Processor has local memory
- Used for sharing across Thread Blocks
- Tends to be small
- Can’t share with other processors
GPU Memory
- Shared by all Thread Blocks
- Example above used GPU Memory
- Only memory accessible to the host
Nearby threads should ideally have nearby memory accesses so they can be grouped together.
Newer chips are beginning to offer direct CPU to GPU connections to avoid the comparatively slower PCI bus.
Step through loop unrolling lab. Look closely at where the savings come from.