CIS 351 |
Cache |
Winter 2022 |
Lab 9 is problems 1-11. Lab 10 is problems 12-18.
You can see the documentation
to learn how to specify cache parameters to Cachegrind.
This lab focuses on the level 1 data cache,
which is referred to as D1
in Cachegrind.
You can leave the other caches with their default configurations.
A few notes:
valgrind --tool=cachegrind --D1=8192,1,32 {program_to_run}
If you do not want Cachegrind to clutter your working directory with log files,
you can use the argument --cachegrind-out-file=/dev/null
to prevent the file
from being created.
But, be sure you are recording the information you need somewhere.
The code for this lab is in this GitHub repo: https://github.com/KurmasGVSU/CacheLabCode
. This is not a GitHub Classroom repository.
If you want your own repository where you can commit changes, you need to fork this repository as opposed to cloning it.
If you clone this repository, you will get a read-only copy of the code and won't be able to commit any changes.
To fork a repository, go to the repo's GitHub page and look for the "fork" button near the upper-right corner of the screen (example).
Your first task is to examine the effects of block size on a "toy"
program. Look
at blockSize1.c
. This small
program iterates through each byte in a large array. Begin by
compiling this C program using gcc
as you normally would.
The instructions below assume that you produced the default executable
(a.out
). If you named your executable something else using the -o
flag,
update the instructions accordingly.
valgrind --tool=cachegrind --D1=8192,1,{block} ./a.out
where {block}
ranges from 32 to 256.
When the program runs correctly there will be lots of lines specifying the cache performance.
The one you are interested in is D1 miss rate
--
the miss rate for the L1 data cache. This value is rounded to 0.1%. If you want to see the result
with more precision, you can divide the number of L1 data misses by the total number of L1 data accesses.
After you have run Cachegrind for each block size,
save the miss rate from each run.
List the miss rate for each block size tested.
Hints for running Cachegrind --
you can ignore these or ask me if you are not comfortable with bash
:
.bashrc
similar to the following:
alias cachegrind='valgrind --tool=cachegrind --cachegrind-out-file=/dev/null
for
loop
right in the bash terminal.
2> cache_{block}.txt
to the end of an instruction, then
(2) grep
the output files rather than searching them by hand.
For example, grep 'D1 miss rate:' cache_64.txt
(Note that the search string above has two spaces after D1
.)for block in 32 64 128 256; do cachegrind --D1=8192,1,${block} ./a.out 2> output_${block}; done
grep 'D1 miss rate' output_*
NUM_LOOPS
10,000,000.
blockSize1.c
that array
is
an array of characters; therefore, each item in the cache is exactly
1 byte. As a result, it is easy to identify data items that will or
will not conflict in the cache. For example, in an 8KB direct-mapped
cache, array bytes 0 and 8192 will conflict. Your job is to find
sets of array elements that conflict with a 64 byte block, but not 32 byte block.
This is basically a pencil-and-paper problem that we happen to be using a cache
simulator to verify.
Let's observe how the cache miss rate changes as the cache gets larger. We will use various sorting algorithms as our test programs.
To compile the first test program, cd
into the git repo and run make runInsertionSort
Use cachegrind
to measure
the D1 miss rate when running an insertion sort on the input file input_5e4
. You will find it in the Data
directory
in the git repo. (This file is named input_5e4
because it contains 5x104=50,000
data points.)
cachegrind
on this input should take 15 to 20 seconds on EOS/Arch and should generate about
5.6 million memory accesses. If you are noticing a faster run time or fewer memory accesses, then you
are doing something wrong. The most common problem is a mis-configured input file. Similarly, if the
program doesn't terminate after 45 seconds or so, then you probably forgot the input file. Attach your plot
to your lab report.input_5e4
?)runInsertionSort
using input_5e4
as input
given cache sizes of 2KB, 8KB, and 32KB and block sizes from 32 to 512 (powers of two only).
Present your results using a graph with block size on the x-axis and the miss rate on the y-axis. Please
generate one graph with three lines: One each for each cache size.
Your graph should have a form similar to
Figure 8.18 in Harris and Harris (2nd edition).
There are a couple ways to generate these graphs. One option is to place the data points into a spreadsheet that can generate graphs.
Another is to use a tool called gnuplot. To use gnuplot, place your data in a plain text file similar to
sample_gnuplot_data
. This sample file also contains the gnuplot commands that will generate the graph.
runMyQsort
. Use the same cache and block sizes. (To compile the quick sort runner, run
make runMyQsort.)-O3
compile flag (optimization level 3).
Recompile insertion sort using the -O0
flag (no optimization). What do you think accounts for the difference in miss rates?
(Hint: Look at the total number of cache accesses for each optimization level.)gcc
run gcc -O3 CacheLabCode/cacheTestTools.c CacheLabCode/myQsort.c CacheLabCode/runMyQsort.c -o test
Updated Tuesday, 12 April 2022, 6:21 PM