CIS 351

Cache

Fall 2020

Overview

This lab uses the Simplescalar CPU simulator to investigate the effects of different cache configurations on performance.

This lab will be due in two pieces. Part 1 is problems 1-5; it will be due next week. Part 2 is problems 6 - 11; it will be due the Monday after Thanksgiving. Officially, there is not lab Thanksgiving week; but, I will be online to answer questions with Part 2. Anyone is welcome to addend (regardless of your scheduled lab time).

Simplescalar

Simplescalar is a suite of programs that simulate the execution of programs compiled using a MIPS-like instruction set called PISA. You can simulate the execution of any program using Simplescalar by simply re-compiling it using a version of gcc that knows how to generate PISA instructions as well as x86 instructions.

For this lab, you will be using the tool sim-cache. This tool takes as input a description of a machine's memory hierarchy (i.e., cache levels) and reports on the number of hits and misses in each cache. Section 4.2 of The Simplescalar Tech Report explains how to describe the cache setup you want to simulate. When you read through this section, take note of three things:

With sim-cache, you don't specify the size of the cache directly. Instead you specify (1) the number of lines, (2) the block size, and (3) the associativity of the cache. The size of the cache is the product of these three numbers. Thus, a 4-way set associative cache with 1024 lines of 16 bytes each is 4*1024*16 = 65536 bytes (64 kilobytes).
The cache to be configured is mentioned twice on the command line. (The phrase "dl1" will appear twice when configuring the L1 data cache.)
The parameters to sim-cache frequently uses both "1" (the numeral "one") and "l" (a lower-case letter "L"). Watch carefully because the differences in print between the two can be subtle.

For example, to configure a machine with an 8KB, direct-mapped L1 data cache with 32 byte blocks, use this command: -cache:dl1 dl1:256:32:1:l. Notice that 256 blocks times 32 bytes per block equals 8192 bytes.

When running sim-cache, I recommend sending the output directly to a file using the command line parameters -redir:sim file1 and -redir:prog file2. file1 will contain the results of the simulation (i.e., cache hit and miss rates). file2 will contain the output produced by the program simulated. This data is generally not interesting.

Using a VM for Simplescalar

Simplescalar is designed to run on 32-bit machines. So, instead of using Arch/EOS directly, we will be using a Virtual Machine. Our sysadmin Tom has set up five virtual machines for us to use. They have these IP addresses:

192.168.216.22
192.168.216.26
192.168.216.27
192.168.216.28
192.168.216.29

To access these machines, you must first log into an EOS or Arch machine. Then, use ssh to log into one of the VMs: ssh simple@192.168.216.X (where X is the number of the particular machine you chose. Just pick one at random.) The username is simple and the password is SimpleSim2020.

You will all be sharing these machines; so,

the first time please create a directory to use as your workspace (so you don't accidentally mess up others' work).
Be nice.

Note: The VMs do not have a GUI.

Effects of block size

Your first task is to examine the effects of block size on a "toy" program. Look at blockSize1.c. This small program iterates through each byte in a large array. First, you will need to copy this file to your space on a VM: scp blockSize1.c simple@192.168.216.X:yourDirectory Next, compile this C program for Simplescalar using the following command: ss_gcc blockSize1.c (Make sure you are in your directory so you don't end up overwriting someone else's file.) Running this command will produce a file named a.out. (As with the normal version of gcc, you can specify the name of the executable generated using the -o flag.) This file will not run by itself. It will run only as input to one of the Simplescalar programs. If it does, you generated it using the wrong version of gcc.

Use sim-cache to determine the miss-rates of an 8KB, direct-mapped cache with the following block sizes: 8 bytes, 16 bytes, 32 bytes, and 64 bytes. To do so, use commands that look like this:

sim-cache -cache:dl1 dl1:line:block:1:l -redir:prog /dev/null -redir:sim output_block a.out

Where block ranges from 8 to 64, and line is set such that product of block times line is 8192.
Hints for running sim-cache:
- Notice that the end of the cache configuration parameter is the number 1 followed by the letter l (as in "lru").
- sim-cache is in /Simplescalar/simplesim-3.0/sim-cache
- It's not difficult to make a script to automatically runs sim-cache with varying block sizes and present the results. This line provides an example of how to perform arithmetic in a bash script: let num_lines=8192/$i
After you have run sim-cache for each block size, grep each output file (output_8, output_16, etc.) for the line "dl1.miss_rate". List the miss rate for each block size tested.
Based on your observations, determine a formula for the miss rate in terms of block_size.
Explain your results (i.e., what is happening in the cache during each memory access to produce the results you observed).
Now, write a C program for which the miss rate is considerably higher for a 16 byte block than for an 8 byte block. (The easiest way to do this is to find two array locations that conflict with a 16-byte block, but not with an 8-byte block. If you do this, you will see the cache with the 8-byte blocks have a nearly 0% miss rate while the cache with the 16-byte blocks has nearly a 100% miss rate.) List your source code, all cache parameters used, and the resulting hit rates. Hint: You need not loop through the entire array. Instead, find two addresses that collide in the cache. Remove the inner loop and make NUM_LOOPS 1000000.

Hints for writing programs with a specified cache behavior:

Notice in blockSize1.c that array is an array of characters; therefore, each item in the cache is exactly 1 byte. As a result, it is easy to identify data items that will or will not conflict in the cache. For example, in an 8KB direct-mapped cache, array bytes 0 and 8192 will conflict. Your job is to find sets of array elements that conflict with a 16 byte block, but not an 8 byte block.
gcc is a C compiler. Your code must be straight C. No iostreams; no "//"-style comments; and, all variables must be declared at the beginning of each function.
Declaring the local variables as "register" variables encourages the compiler to keep the values of these variables in a register, thereby reducing their effect on the cache hit rate.
Ideally, the array in the C code will line up with the beginning of the cache. In other words, array[0] should be mapped to cache slot 0. Sometimes, it gets mapped to array[8]. If you are confident your solution should work; but, it doesn't try adding 8 to each array index.

Optimal block size

Your next task is to determine the optimal block size for qsort.

Determine the optimal block size for qsort given a 1KB, 4KB, and 16KB cache. Present your results using a graph with block size on the x-axis and the miss rate on the y-axis. Please generate one graph with three lines: One each for 1KB, 4KB, and 16KB. Valid block sizes are 8, 16, 32, and 64. Your graph should have a form similar to Figure 8.18 in Harris and Harris (2nd edition).

Use input_1e4 for input. (It contains 50,000 randomly generated integers.)
The ss_qsort executable and sample inputs are found in ~/TestData.
To simulate a 1KB, direct-mapped cache with 16 byte blocks, use the following (very long) command line:
sim-cache -cache:dl1 dl1:64:16:1:l -redir:prog opt -redir:sim output_dl1:64:16:1:l ~/TestData/ss_qsort ~/TestData/input_1e4
This program should take a few seconds to run. If it finishes instantly, then something isn't configured correctly. The file opt may give you a clue. If not, ask the instructor for help.
If you are using gnuplot, entering the line set style data linespoints will plot all data files using the "linespoints" style. Using this shortcut means that you won't have to type with linespoints after every file.

Effects of associativity

Write a simple C program for which a 4-way associative cache has a significantly lower miss rate than a equally sized 2-way set associative cache. You may choose any cache size and block size you wish, but they must remain the same for the entire problem. Submit the source code, all cache parameters, and resulting hit rates. Hint: Write the simplest program you can that will produce a 100% miss rate for the 2-way cache.
Write a simple C program for which an associativity of 2 has a higher miss rate than a direct-mapped cache. You may choose any cache size and block size you wish, but they must remain the same for the entire problem. Submit the source code, all cache parameters, and resulting hit rates. As with the previous problem, start by writing a program that has a 100% miss rate for a 2-way cache.
Explain why your code above produces the miss rates observed (i.e., why your code has a higer miss rate on the two-way cache).
Choose qsort (or another interesting program of your choice) and a cache size. Produce a graph showing miss rates as associativity ranges over 1, 2, 4, 8, 16, and fully associative. Your graph should have associativity on the x-axis, and miss-rate on the y-axis. It should also contain four lines: one for each block size. Be sure to clearly label your graph with the cache size. Your graph should have a form similar to Figure 5.30 in Patterson and Hennessey (4th edition, revised).

Effects of replacement policy

Write a simple C program for which a random replacement policy results in a significantly lower miss rate than LRU. (Use an 8-way set associative cache with a size and block size of your choosing.) Submit the source code, all cache parameters, and resulting hit rates.
Choose a cache size and block size. Plot the effects of associativity and replacement policy on qsort. Your graph should have associativity on the x-axis and miss rate on the y-axis. There should be three lines: one for each replacement policy.

Updated Monday, 16 November 2020, 2:05 PM