CIS 451 |
IA Differences |
Fall 2019 |
Warning:This assignment is designed to be completed in-class. There are details missing that you will need to talk to me about if you are attempting to make this assignment up.
The purpose of this lab is to explore the different behaviors of different CPUs that implement the Intel instruction set. To put it another way, we want to observe an AMD processor behaving differently from an Intel processor --- especially with respect to the implementation of their superpipelines.
Background
This lab will use a lot of Intel assembly language as well as many different UNIX tools. Before you begin, you may want to review the Intel Assembly Overview from the Intel Machine Language assignment, as well as the following UNIX topics:
- C programming
- Bash scripts
- pipes
- perl
- gnuplot
Determining the pipeline width
At a high level, we are going to determine the number of cycles it takes to execute n
instructions.
This will give us a rough estimate of the number of instructions executed per cycle (IPC). For example, if the
CPU can execute 1000 instructions in 500 cycles, then we know that, on average, it is completing 2 instructions
per cycle.
Timing Code with rdtsc
All x86 processors have a register called the time stamp counter. The CPU sets this register to 0 when it
boots up and increments it every time the clock ticks. You can use this register to time code: Simply use the
rdtsc
assembly instruction to read the timestamp counter immediately before and after the code you
are timing.
For this lab, we will be timing code that looks something like this:
rdtsc # read the starting cycle count (rdtsc puts its answer in %edx and %eax) push %eax # save the timestamp on the stack so it doesn't get clobbered by the next call to rdtsc addl $1, %ecx addl $1, %ecx ... # repeat this instruction many times. addl $1, %ecx addl $1, %ecx rdtsc # read the ending cycle count pop %ebx # retrieve the starting cycle count subl %ebx, %eax # Compute the elapsed number of cycles (subtract the starting value, %ebx, from the ending value)
In theory, the difference in the values returned by the two calls to rdtsc
is the
number of cycles required to run the n
intervening instructions. (This isn't completely true for
reasons that will be discussed later; but, it is good enough for right now.)
Collecting Evidence
There is a fair amount of noise in the measurement of the assembly code shown above:
The superscalar processor will likely schedule the rdtsc
instructions in parallel with other nearby
instructions. Therefore, we cannot be sure we are timing precisely n
instructions. To compensate for
this uncertainty, we will take many measurements with increasing values of n
. The overall trend will
give us a clearer picture of the superpipeline's width than the measurement for a single n
.
The code to be measured will not contain a loop: Instead of looping over a body of n
instructions, we
will "hand write" a sequence of n
instructions. We avoid using a loop because the loop structure would
add extra instructions with different dependencies. To avoid loops, we write a different function for each value of
n
we wish to test. The Ruby script generate_experiments3.rb
automates this process.
Run generate_experiments3.rb
. Examine the resulting output. Notice
- The function
step_group
contains the sequence of instructions to be repeated and timed. - The functions
time_xx_ops
contain the code to time a sequence of the desired length. Notice that this function:- follows conventional stack frame set-up/tear-down (whereas
step_group
does not, - calls
step_group
enough times to generate a sequence of the desired length, and - contains the calls to
rdtsc
that time the sequence of operations.
- follows conventional stack frame set-up/tear-down (whereas
The output from generate_experiments3.rb
contains only the code being timed.
time_instructions2.c
contains main
as well as a loop that calls the functions to be
tested and prints the results.
Comparing CPUs and interpreting the results
This semester, we have three CPUs available for experimentation:- The Intel i7s in EOS
- The Intel i5s in Data Com
- The AMDs in Arch
Begin by timing a sequence of addl
instructions:
- Run
generate_experiments3.rb
and redirect the output to a file (I'll call itasm1.s
- Examine
asm1.s
and make sure you undertand what it's doing. - Compile
asm1.s
and
together into a single executable and run it.time_instructions2.c
- You will notice that the CPI for the initial workload is less than 1. Have me explain why, then adjust
time_instructions
to account for the problem. - Run the previous test on the different CPUs
and plot the results together on the same graph. Attach a copy
of your plot to your lab report. If using
gnuplot
, you can change your output format and save the plot to a file. First type "set term mode
", where mode is one of many output formats, including "postscript", "jpeg", "png", and "pbm". (Typehelp term
for the complete list.) After you have set the term, type "set output filename
". Typing "set term x11
" will return the output to the screen. WARNING:gnuplot
doesn't always correctly overwrite files. So, make sure your plots look good on the screen before changing the terminal and output file. - Find the approximate slope of each line.
- Describe the differences you see between CPUs (if any).
- Now, edit
generate_experiments3.rb
so that instead of timing a sequence ofaddl $1, %eax
, it times a sequence ofaddl $1, %eax; addl $1, %ecx
. (Look for whereinstruction_pool
is defined near line 154.) Run it on the different hardware. What are the slopes of the lines? - The slopes of the line should be significantly less than 1. What does this tell you?
- The first test produces code that looks like this:
addl $1, %eax addl $1, %eax addl $1, %eax addl $1, %eax ...
whereas the second produces code that looks like this:addl $1, %eax addl $1, %ecx addl $1, %eax addl $1, %ecx ...
Why can the CPU run the second example faster? - Figure out how many
add
instructions each different CPU can do in parallel. Attach graphs demonstrating this. (In other words, show me how you figured it out.) In general, each plot should contain more than one line. Look most closely at the slope of the line in the range[10000, 15000]
(If you don't see at least one CPU that can handle more than 2 instructions at a time, let me know.) WARNING: Make sure your timing code doesn't use%ebx
,%r13d
, or any other register that is in use at the point your code being time is being run. - The i7 can, in theory, complete 4 instructions in parallel. Try to find a sequence of four instructions that the i7 will issue and execute in parallel.