CIS 451

Vector Processing

Fall 2019

Objective

The objective of this lab assignment is to allow students to examine source code for vector processing and observe the performance benefits.

Chaos

Consider the (obviously broken) recursive function below:

  float chaos(x, init) {
    return chaos(x*x + init);
  }

From a programming perspective, this function is broken because it has no base case and will never terminate. However, from a mathematical perspective, some of the inputs have well-defined outputs. If we were to run this broken code, we might find that the values stop changing much at some point. For example, chaos(0, 0) will limit to 0, but chaos(-1.0, 0) just keeps growing.

It turns out that for inputs 0.25 ≤ c < 2, this function has a limit. If we plot this function it looks like this:

Notice that the plot is reasonably smooth for 0.25 ≤ c < 1.5; but, when c reaches 1.5, the values start jumping all over the place. In this interval, we say that this function is chaotic. (Yes, there really is a mathematical definition of chaos.)

The Mandelbrot Set

If we mapped outputs to colors, we could make an interesting drawing; however, since the function is one-dimensional, it would just be stripes. So, lets change the function so that it operates on a complex numbers. That way, we can define a two-dimensional image.

This is the new function:

  float mandelbrot(z, init) {
    return mandelbrot(z*z + init);
  }

This time, instead of plotting the limit, we are going to keep track of how fast it grows --- how many iterations we can go before norm(z) > 2.0.

The result is this familiar picture:

Brighter colors represent slower divergence. The bright green represents points that diverge very slowly: It takes many iterations for the point to reach a distance of 2.0 from the origin. Dark green colors diverge very quickly: They reach a distance of 2.0 from the origin within a few iterations. Black points are those that do not diverge at all. They are still withing 2.0 of the origin after the maximum number of iterations. (See math.hws.edu/eck/js/mandelbrot/MB-info.html for more information.)

Download mandelbrotComplex.cpp and figure out how it works. (This is the simplest implementation. It uses a class to represent complex numbers. That class overloads the arithmetic operators.)

What do the variables dx and dy represent?
What does the line *image++ = count; do? Explain how this line works. (Be certain to explain all the parts of *image++.)
Compile mandelbrotComplex with the -O0 flag. Then run the program and report how long it takes. (Note: The flag is a capital O then a zero. The -O flag is for Optimization. Contrast this with the lower-case o flag, which is for output.)
g++ -O0 mandelbrotComplex.cpp
time ./a.out
Now compile mandelbrotComplex with the -O3, run it, and report how long it takes.

Now, download mandelbrot.c and figure out how the function madelbrot_c_regular works. The benefit of this C version is that it results in much simpler assembly code; however, we must (1) store the real and imaginary parts of the complex numbers separately, and (2) write our own code to implement complex operations.

What is the variable tzr used for? Why is it needed?
Compile mandelbrot.c with the -O0 flag. Then run the program and report how long it takes.
gcc -mfma -O0 mandelbrot.c
time ./a.out
Now compile mandelbrot.c with the -O3, run it, and report how long it takes.
What is the speedup of the optimized C code over the optimized C++ code?

The AVX instructions

Intel provides a set of instructions that can process several elements in an array at once. The machines in our labs can process up to 8 floating point values at once.

Look at mandelbrot_c_avx. Notice that the inner for loop (the one that uses col as the loop index) increments by 8. This is because each iteration of this loop processes 8 points at once. The functions beginning with __mm256_ call Intel's special vector instructions. Thoroughly examine this code until you understand how it works.

What does the _mm256_broadcast_ss function do?
What is the purpose of the incr array?
What is the purpose of this call: _mm256_xor_ps(dx_ymm0, dx_ymm0)?
What is the purpose of this line of code? test = _mm256_movemask_ps(norm_z_ymm15) & 255U; Is it necessary to the correct operation of the algorithm? If so, what would go wrong if it were missing. If not, why is it optional; and, what is the benefit of keeping it?
Modify main to call mandelbrot_c_avx, then compile the program, run it, and report how long it takes.
Now compile the updated program with -O3, run it, and report how long it takes.
What is the speedup of the optimized code over the unoptimized code?
What is the speedup of the optimized AVX code over the optimized C code?
Each AVX operaton performs 8 arithmetic operations. Why is the speedup less than 8?

Examine the assembly

Compile mandelbrot.c down to assembly: gcc -mfma -S mandelbrot.c.

Find the assembly code for mandelbrot_c_avx (look for the label).

How much room is allocated on the stack for this function?
List the AVX instructions used by this method.

Now, examine the assembly code for mandelbrot_c_regular.

Why does this "non-avx" assembly code use AVX assembly instructions?

If you are curious, you can comment out mandelbrot_c_avx then run gcc -mno-avx -S mandelbrot.c to see what Intel's older style of floating point assembly looked like.

Your turn

Write your own AVX code. It can be almost anything: a different fractal, a matrix multiply, a simple simulation. (The only thing you can't do is download and compile somebody's else's completed AVX code.)

Updated Monday, 18 November 2019, 8:41 PM