CIS 451

IA Differences

Fall 2019

Warning:This assignment is designed to be completed in-class. There are details missing that you will need to talk to me about if you are attempting to make this assignment up.

The purpose of this lab is to explore the different behaviors of different CPUs that implement the Intel instruction set. To put it another way, we want to observe an AMD processor behaving differently from an Intel processor --- especially with respect to the implementation of their superpipelines.

Background

This lab will use a lot of Intel assembly language as well as many different UNIX tools. Before you begin, you may want to review the Intel Assembly Overview from the Intel Machine Language assignment, as well as the following UNIX topics:

C programming
Bash scripts
pipes
perl
gnuplot

Determining the pipeline width

At a high level, we are going to determine the number of cycles it takes to execute n instructions. This will give us a rough estimate of the number of instructions executed per cycle (IPC). For example, if the CPU can execute 1000 instructions in 500 cycles, then we know that, on average, it is completing 2 instructions per cycle.

Timing Code with `rdtsc`

All x86 processors have a register called the time stamp counter. The CPU sets this register to 0 when it boots up and increments it every time the clock ticks. You can use this register to time code: Simply use the rdtsc assembly instruction to read the timestamp counter immediately before and after the code you are timing.

For this lab, we will be timing code that looks something like this:

    rdtsc            # read the starting cycle count (rdtsc puts its answer in %edx and %eax)
    push %eax        # save the timestamp on the stack so it doesn't get clobbered by the next call to rdtsc
    addl $1, %ecx
    addl $1, %ecx
    ...              # repeat this instruction many times.
    addl $1, %ecx
    addl $1, %ecx
    rdtsc            # read the ending cycle count
    pop %ebx         # retrieve the starting cycle count
    subl %ebx, %eax  # Compute the elapsed number of cycles (subtract the starting value, %ebx, from the ending value)

In theory, the difference in the values returned by the two calls to rdtsc is the number of cycles required to run the n intervening instructions. (This isn't completely true for reasons that will be discussed later; but, it is good enough for right now.)

Collecting Evidence

There is a fair amount of noise in the measurement of the assembly code shown above: The superscalar processor will likely schedule the rdtsc instructions in parallel with other nearby instructions. Therefore, we cannot be sure we are timing precisely n instructions. To compensate for this uncertainty, we will take many measurements with increasing values of n. The overall trend will give us a clearer picture of the superpipeline's width than the measurement for a single n.

The code to be measured will not contain a loop: Instead of looping over a body of n instructions, we will "hand write" a sequence of n instructions. We avoid using a loop because the loop structure would add extra instructions with different dependencies. To avoid loops, we write a different function for each value of n we wish to test. The Ruby script generate_experiments3.rb automates this process.

Run generate_experiments3.rb. Examine the resulting output. Notice

The function step_group contains the sequence of instructions to be repeated and timed.
The functions time_xx_ops contain the code to time a sequence of the desired length. Notice that this function:
- follows conventional stack frame set-up/tear-down (whereas step_group does not,
- calls step_group enough times to generate a sequence of the desired length, and
- contains the calls to rdtsc that time the sequence of operations.

The output from generate_experiments3.rb contains only the code being timed. time_instructions2.c contains main as well as a loop that calls the functions to be tested and prints the results.

Comparing CPUs and interpreting the results

This semester, we have three CPUs available for experimentation:

The Intel i7s in EOS
The Intel i5s in Data Com
The AMDs in Arch

Begin by timing a sequence of addl instructions:

Run generate_experiments3.rb and redirect the output to a file (I'll call it asm1.s
Examine asm1.s and make sure you undertand what it's doing.
Compile asm1.s and time_instructions2.c together into a single executable and run it.
You will notice that the CPI for the initial workload is less than 1. Have me explain why, then adjust time_instructions to account for the problem.
Run the previous test on the different CPUs and plot the results together on the same graph. Attach a copy of your plot to your lab report. If using gnuplot, you can change your output format and save the plot to a file. First type "set term mode", where mode is one of many output formats, including "postscript", "jpeg", "png", and "pbm". (Type help term for the complete list.) After you have set the term, type "set output filename". Typing "set term x11" will return the output to the screen. WARNING: gnuplot doesn't always correctly overwrite files. So, make sure your plots look good on the screen before changing the terminal and output file.
Find the approximate slope of each line.
Describe the differences you see between CPUs (if any).
Now, edit generate_experiments3.rb so that instead of timing a sequence of addl $1, %eax, it times a sequence of addl $1, %eax; addl $1, %ecx. (Look for where instruction_pool is defined near line 154.) Run it on the different hardware. What are the slopes of the lines?
The slopes of the line should be significantly less than 1. What does this tell you?
The first test produces code that looks like this:
```
		addl $1, %eax
		addl $1, %eax
		addl $1, %eax
		addl $1, %eax
		...
		
```
whereas the second produces code that looks like this:
```
		addl $1, %eax
		addl $1, %ecx
		addl $1, %eax
		addl $1, %ecx
		...
		
```
Why can the CPU run the second example faster?
Figure out how many add instructions each different CPU can do in parallel. Attach graphs demonstrating this. (In other words, show me how you figured it out.) In general, each plot should contain more than one line. Look most closely at the slope of the line in the range [10000, 15000] (If you don't see at least one CPU that can handle more than 2 instructions at a time, let me know.) WARNING: Make sure your timing code doesn't use %ebx, %r13d, or any other register that is in use at the point your code being time is being run.
The i7 can, in theory, complete 4 instructions in parallel. Try to find a sequence of four instructions that the i7 will issue and execute in parallel.