CIS 451 Week 12
- Predication: Marking instructions as conditional on some predicate.
- Can be more efficient than flow control (no mis-predicts)
Data-Level parallelism
- “If you were plowing a field, which would you rather use: Two strong oxen, or 1024 chickens?”
- Seymour Cray arguing that two powerful vector processors is better than many simple processors.
- What does SIMD stand for?
- SIMD is applying the same operation (instruction) to an entire vector of values.
- What kind of applications use SIMD?
- Matrix-oriented scientific computing
- Image processing
- Sound processing
- SIMD is more energy efficient because there is less overhead (e.g., fetching and decoding operations)
- Programmers can still (mostly) think sequentially (unlike more complex multi-threaded environments)
- First SIMD approach was vector processors in the 1980s.
- Key issues were
- Cost of that many transistors
- Cost of sufficient memory bandwidth to provide data to that many ALUs. (Especially given the need for )
- Key issues were
- Other key SIMD challenges
- Stride in vectors that are not lined up with memory.
- Consider matrix multiply. Either the rows or the columns will go across memory.
- How can this affect cache?
- Can cause problems with large block sizes, since entire block brought in for only one item.
- One solution is to prepare special memory instructions that understand stride and bring only the needed data into a vector.
- Adds considerable complexity to memory system.
- Limited memory bandwidth
- Mark which registers are unused so we don’t spend time sending them to memory on a context switch
- Stride in vectors that are not lined up with memory.
- Other vector processor instructions
- AXPY: a*X+y (where
a
is a scalar) - disable: Disable all vector registers so they aren’t saved during a context switch.
- vector mask: Prevent operation from affecting some cells in the vector.
- It’s one way to handle
if
statements. - A type of predication.
- It’s one way to handle
- Gather / Scatter for sparse matrices
- AXPY: a*X+y (where
- Next came x86 extensions
- MMX
- SSE
- AVX
- Focuses on floating point
- Latest version 512 bits. That’s 16, 32-bit ops
- At first, only basic vector ops: add multiply
- No scalar
- No masking
- But, more complex ops have crept in over time.
- Different op codes for different data sizes (8-bit, 16-bit, etc.)
- Thus, large number of op codes added over time
- They double every time data size doubles.
- At first, compilers couldn’t do much with MMX/SSE, etc.
- Lack of more sophisticated ops
- Only used to build specialized libraries in assembly.
- Modern compilers can automatically use AVX in some situations
- Look at AVX_Demo
- It’s not difficult to write code that the compiler can’t automatically optimize
- Popular because they are cheap addition to existing hardware
- Just divide up existing adders
- No stride, so no memory complexities.
- Only a few registers, so less context switch costs.
- Depending on application, performance either limited by
- Total # of flops available
- Memory bandwidth.
- Look at AVX Assembly code
GPUs
- Many Parallel ALUs
- Several multithreaded SIMD processors.
- Each SIMD processor has many (e.g., 32) “lanes”
- Consider typical vector add loop
for (int i = 0; i < n; ++i) {
y[i] = a*x[i]+y[i]
}
- Each iteration of the loop is conceptually a separate thread that can run in parallel with all the other “threads”.
- We could, in theory, assign each thread to a lane, let them run in any order and synchronize at the end.
- However, each SIMD processor has only one Program Counter: Threads must be executed in batches of 32.
- We want to abstract away from the specifics of the GPU (so our code will run well on different GPUs.)
- Therefore, we define a concept called “thread blocks”: Groups of threads all sent to the same SIMD.
- The GPU software itself will group each block’s threads into batches of
num_lanes
. (Called “SIMD threads” by H&P)
- CUDA Code:
__host__
int nblocks = (n + 511) / 512; // (We want something like (n / 512) + 1; but doesn't quite work.
daxpy<<nblocks, 512>>(n, 2.0, x, y);
__global__
void daxpy(int n, double a, double *x, double *y) {
int i = blockIdx.x * blockDim.y + threadIdx.x;
if (i < n) y[i] = a*x[i] + y[i]
}
- We choose to put 512 threads into a thread block because
- we want to be able to keep ALUs busy when other ops stall, and
- it is a multiple of
num_lanes
(so three aren’t gaps.)
- Note, we don’t necessarily know the number of SIMD processors or lanes, but we can choose values that work well for a variety of GPUs.
- 512 threads per thread block means
512/32 = 16
“SIMD threads” per block. - Because each thread is independent, it can be scheduled in any order. Thus, the GPU scheduler
is like a superscalar scheduler: It schedules SIMD threads as soon as all the dependencies are handled.
- It has a “scoreboard”, which is an predecessor to Tomasulo’s algorithm.
-
Notice that threads are identified by
blockId
andthreadId
, not justthreadID
alone. - Look at assembly code
- Note where stalls can happen after arithmetic
- Because “SIMD Threads” share a PC, their paths cannot diverge.
- In other words, they can’t run different paths of a branch.
- In effect, branches are turned into predicated instructions.
- With an IF-THEN group, those instructions loose 1/2 the “bandwidth”
- Unless, they all go the same way. Then the other branch is skipped.
- Typical for checking for error conditions.
- Or, like with if statement above, allows “odds” at end to not be a problem.
- Can have nested branches.
- Then saved PCs go on a stack
- But bandwidth drops accordingly to 25%, 12%, 6%, etc.
- Summary: Each thread is either
- Executing the same instruction as every other thread, or
- Idle.
GPU Instruction set
- PTX (Parallel Thread Execution)
- Abstraction of “real” hardware instruction set
- Real instruction set is kept hidden to promote compatibility
GPU Memory:
- Each thread has private memory
- Other threads have no access
- Used for stacks, spilling registers
- Generally kept in L1 and L2.
- Each SIMD Processor has local memory
- Used for sharing across Thread Blocks
- Tends to be small
- Can’t share with other processors
- GPU Memory
- Shared by all Thread Blocks
- Example above used GPU Memory
- Only memory accessible to the host
- Nearby threads should ideally have nearby memory accesses so they can be grouped together.
-
Newer chips are beginning to offer direct CPU to GPU connections to avoid the comparatively slower PCI bus.
- Step through loop unrolling lab. Look closely at where the savings come from.