CIS 351

Cache

Fall 2021

Overview

This lab was originally written to use the Simplescalar CPU simulator. The update to use Cachegrind instead was prepared by Prof. Bowman.

Lab 9 is problems 1-11. (Or 1-7 for F21 since I messed up.)
Lab 10 is problems 12-19.

Cachegrind

You may have used Valgrind in the past to check for memory leaks in your C code. However, Valgrind is actually a set of tools based on a common framework. One of those tools allows you to simulate caches with various configurations. Cachegrind is a tool that takes as input a description of a machine's memory hierarchy (i.e., cache levels), runs a specified program, and reports on the number of hits and misses in each level of the cache. Note: when looking at examples of Cachegrind, be careful to distinguish between "1" (the numeral "one") and "l" (a lower-case letter "L").

You can see the documentation to learn how to specify cache parameters to Cachegrind. This lab focuses on the level 1 data cache, which is referred to as D1 in Cachegrind. You can leave the other caches with their default configurations. A few notes:

Line size and cache size should be given in bytes.
Line size is what we have been referring to as block size -- it is just different terminology for the same idea.

For example, to simulate a machine with an 8KB, direct-mapped L1 data cache with 32 byte blocks, use this command: valgrind --tool=cachegrind --D1=8192,1,32 {program_to_run} If you do not want Cachegrind to clutter your working directory with log files, you can use the argument --cachegrind-out-file=/dev/null to prevent the file from being created. But, be sure you are recording the information you need somewhere.

Availability

Valgrind is already installed in EOS/Arch. There is no need to install it yourself. (In fact, you probably don't have sufficient permissions to install your own copoy.)
Don't try to run Valgrind on MacOS. Each OS update breaks Valgrind and it takes them quite awhile to fix it. According to their web site, the current version only supports MacOS up to 10.12. (Most of you probably have some version of 11.)

Code

The code for this lab is in this GitHub repo: https://github.com/KurmasGVSU/CacheLabCode. This is not a GitHub Classroom repository. If you want your own repository where you can commit changes, you need to fork this repository as opposed to cloning it. If you clone this repository, you will get a read-only copy of the code and won't be able to commit any changes.

To fork a repository, go to the repo's GitHub page and look for the "fork" button near the upper-right corner of the screen (example).

Effects of block size

Your first task is to examine the effects of block size on a "toy" program. Look at blockSize1.c. This small program iterates through each byte in a large array. Begin by compiling this C program using gcc as you normally would. The instructions below assume that you produced the default executable (a.out). If you named your executable something else using the -o flag, update the instructions accordingly.

Use Cachegrind to determine the miss-rates of an 8KB, direct-mapped cache with the following block sizes: 32 bytes, 64 bytes, 128 bytes, and 256 bytes. To do so, use commands that look like this: valgrind --tool=cachegrind --D1=8192,1,{block} ./a.out where {block} ranges from 32 to 256. When the program runs correctly there will be lots of lines specifying the cache performance. The one you are interested in is D1 miss rate -- the miss rate for the L1 data cache. This value is rounded to 0.1%. If you want to see the result with more precision, you can divide the number of L1 data misses by the total number of L1 data accesses. After you have run Cachegrind for each block size, save the miss rate from each run. List the miss rate for each block size tested.
Hints for running Cachegrind -- you can ignore these or ask me if you are not comfortable with bash:
- If you get sick of typing the full instruction, you can add an alias in your .bashrc similar to the following: alias cachegrind='valgrind --tool=cachegrind --cachegrind-out-file=/dev/null
- It's not difficult to make a script to automatically runs Cachegrind with varying block sizes and presents the results. You can even just type a for loop right in the bash terminal.
- Cachegrind places its output on the standard error. One technique for automating the data collection is to (1) redirect the standard error to a different file for each block size, such as by adding 2> cache_{block}.txt to the end of an instruction, then (2) grep the output files rather than searching them by hand. For example, grep 'D1 miss rate:' cache_64.txt (Note that the search string above has two spaces after D1.)
- For example, you could use the command below to collect the miss rates for all four block sizes:
  for block in 32 64 128 256; do cachegrind --D1=8192,1,${block} ./a.out 2> output_${block}; done
  grep 'D1 miss rate' output_*
Based on your observations, determine a formula for the miss rate in terms of block size. The formula will not be exact, but it should track the miss rate quite closely. Two hints:
- Write the miss rates as decimals, not percentages.
- If you don't see the formula, continue the pattern "backwards" and estimate the values for block sizes of 16, 8, 4, and 2.
Your formula above should have an intuitive explanation (i.e., it shoud not just be a line fit to data). Explain what is happening in the cache during each memory access to produce the results you observed.

Writing Code with Cache in Mind

Now, write a C program for which the miss rate is considerably higher for a 64-byte block than for a 32-byte block. (The easiest way to do this is to find two array locations that conflict with a 64-byte block, but not with a 32-byte block. If you do this, you will see the cache with the 32-byte blocks has a nearly 0% miss rate while the cache with the 64-byte blocks has nearly a 100% miss rate.) You may not change the associativity or overall cache size. List your source code, all cache parameters used, and the resulting hit rates. Hint: You need not loop through the entire array. Instead, find two addresses that collide in the cache. Remove the inner loop and mke NUM_LOOPS 10,000,000.

Hints for writing programs with a specified cache behavior:

Notice in blockSize1.c that array is an array of characters; therefore, each item in the cache is exactly 1 byte. As a result, it is easy to identify data items that will or will not conflict in the cache. For example, in an 8KB direct-mapped cache, array bytes 0 and 8192 will conflict. Your job is to find sets of array elements that conflict with a 64 byte block, but not 32 byte block. This is basically a pencil-and-paper problem that we happen to be using a cache simulator to verify.

Benefits of a large cache

Let's observe how the cache miss rate changes as the cache gets larger. We will use various sorting algorithms as our test programs. To compile the first test program, cd into the git repo and run make runInsertionSort

Use cachegrind to measure the D1 miss rate when running an insertion sort on the input file input_5e4. You will find it in the Data directory in the git repo. (This file is named input_5e4 because it contains 5x10⁴=50,000 data points.)

Plot the miss rate as the cache size increases from 1K to 512KB. (Use powers of two for the cache size. Choose any block size you like.) Running cachegrind on this input should take 15 to 20 seconds on EOS/Arch and should generate about 5.6 million memory accesses. If you are noticing a faster run time or fewer memory accesses, then you are doing something wrong. The most common problem is a mis-configured input file. Similarly, if the program doesn't terminate after 45 seconds or so, then you probably forgot the input file. Attach your plot to your lab report.
When (for what cache size) does the miss rate reach zero?
Why do you think this cache size produces a 100% hit rate? (Hint: Why is the input file named input_5e4?)

Optimal block size

Your next task is to determine the optimal block size for insertion sort and quick sort.

Determine the optimal block size for insertion sort. Run runInsertionSort using input_5e4 as input given cache sizes of 2KB, 8KB, and 32KB and block sizes from 32 to 512 (powers of two only). Present your results using a graph with block size on the x-axis and the miss rate on the y-axis. Please generate one graph with three lines: One each for each cache size. Your graph should have a form similar to Figure 8.18 in Harris and Harris (2nd edition). There are a couple ways to generate these graphs. One option is to place the data points into a spreadsheet that can generate graphs. Another is to use a tool called gnuplot. To use gnuplot, place your data in a plain text file similar to sample_gnuplot_data. This sample file also contains the gnuplot commands that will generate the graph.
Generate a similar plot for runMyQsort. Use the same cache and block sizes. (To compile the quick sort runner, run make runMyQsort.)
When comparing the plots, you will notice that the optimal block size for quick sort is smaller than for insertion sort. Why is that? (Hint: Think about how each algorithm accesses the array as it runs.)
(Optional) The makefile compiles the sort programs using the -O3 compile flag (optimization level 3). Recompile insertion sort using the -O0 flag (no optimization). What do you think accounts for the difference in miss rates? (Hint: Look at the total number of cache accesses for each optimization level.)

Effects of associativity

Write a simple C program for which a 4-way associative cache has a significantly lower miss rate than a equally sized 2-way set associative cache. You may choose any cache size and block size you wish, but they must remain the same for the entire problem. Submit the source code, all cache parameters, and resulting hit rates. Submit the source code, all cache parameters, and resulting hit rates. Hint: Write the simplest program you can that will produce a 100% miss rate for the 2-way cache.
Write a simple C program for which an associativity of 2 has a higher miss rate than a direct-mapped cache. You may choose any cache size and block size you wish, but they must remain the same for the entire problem. Submit the source code, all cache parameters, and resulting hit rates. As with the previous problem, start by writing a program that has a 100% miss rate for a 2-way cache.
Explain why your code above produces the miss rates observed (i.e., why your code has a higer miss rate on the two-way cache).
Graph the miss rate quicksort, insertion sort, and gcc as associativity ranges over 1, 2, 4, 8, 16, and fully associative. Use a 64 byte block and generate lines for both 1KB and 4KB caches. Your graph should have six lines: 1KB and 4KB for each of the three programs. Your graph should have a form similar to Figure 5.30 in Patterson and Hennessey (4th edition, revised). Notice how quickly the miss rate levels off as the associativity increases.
- To test gcc run gcc -O3 CacheLabCode/cacheTestTools.c CacheLabCode/myQsort.c CacheLabCode/runMyQsort.c -o test
Which programs benefit the most from increased associativiy (i.e., associativity larger than two)? Why do you think these programs are more sensitivie to the associativity?

Effects of replacement policy

(Optional for Fall 2021) Write a simple C program for which a random replacement policy results in a significantly lower miss rate than LRU. (Use an 8-way set associative cache with a size and block size of your choosing.) Submit the source code.
(Optional for Fall 2021) Unfortunately, cachegrind will only simulate LRU replacement. Explain why your code should have a lower miss rate with a random replacement policy than with LRU.

Updated Tuesday, 23 November 2021, 1:52 PM