CIS 451

Intel Machine Language

Winter 2019

The overview sections of this lab were originally written by Prof. Wolffe.

You will want to work in pairs. You don't want to tackle this alone. Trust me.

Start Early! This is a challenging assignment; and you will need time to ask many questions. If you wait until two or three days before the deadline, you won't have time to get all your questions answered. (Also, remember Piazza.)

Overview

The purpose of this lab is to explore an instruction set and machine language that is different from MIPS. In particular, we will be exploring the Intel 64/IA-32 machine language. As you work through this assignment, pay attention to the differences between MIPS and IA-32 including instruction length, addressing modes, and the variety of instructions included in the instruction sets.

Resources

Overview of the Intel 80x86 Architecture

A typical view of the Intel register set shows eight general-purpose, 32-bit registers (eax , ebx , ecx , edx , esi , edi , esp , ebp). Although these registers are technically general purpose (meaning that they can hold any 32-bit value), in practice only eax , ebx , ecx , and edx are used for general data. esp is typically the stack pointer; and ebp is typically the frame pointer.

Local variables are typically stored in the memory around an address stored in register ebp, also called the frame pointer. (Think of a "frame" around some segment of memory.) For example, int x may be at location %ebp -4, and int y at location %ebp -8. (Remember, ints are 4 bytes long.)

Like many machines, Intel processors use a stack. The register esp holds the address of the next available address on the stack. The stack grows down, meaning that the value of esp decreases as data are added to the stack. Functions store their data on the stack.

Functions are required to leave the stack and frame pointers as they found them. Therefore, the first few assembly language instructions of any function store the current value of esp and ebp (typically on the stack). In addition, the frame for local variables is typically on the stack. Therefore, functions with local data also increase the stack pointer enough to hold all the local variables. The last thing a function does is restore the stack and frame pointers to their original values.

Finally, a few additional notes:

Generating and Understanding Assembler Code

The easiest way to write correct assembly code is to let a compiler do it for us! Then we just have to figure out what it's doing and why. A quick look at the man pages for the gcc compiler under Linux shows that the '-S' switch directs the compiler to generate assembly and stop (without producing an executable). We can use this feature to learn the nature of a particular instruction set by writing simple, understandable programs in a high-level language (like C) and studying what the compiler produces.

A brief note about assembler notation in the interest of making the programs easier to read:

As an example, begin with the program exampleIML-1b.c. Run gcc -S exampleIML-1b.c to produce assembly code for the native machine. (I compiled this code on a 64-bit Intel i7. If you use a different machine, your code may look different.)

Look at the resulting code. Lines that begin with a dot (.) are either labels or directives. They are used by the assembler, but are not instructions that are executed when the program runs.

  1. Explain what each of the assembly language instructions in exampleIML-1b.s does and why. (A couple of the "whys" aren't obvious, so don't hesitate to ask for help.) Two hints:

Intel Machine Language

In comparison to MIPS, the Intel machine language is extremely complex: Instructions can be anywhere from 1 to 15 bytes long; and, each instruction supports many different addressing modes. Fortunately, most of the extremely complex instructions are very specialized and rarely used (e.g., the MMX instructions). The instructions you will examine today are much simpler.

Figure 2-1 on page 2-1 (aka page 31) of Volume 2 of the Software Developer's Manual shows the basic format of an Intel instruction. Some instructions have a prefix of up to four bytes. The next 1 to 3 bytes contain the op code. The instructions we will examine all have one byte opcodes. The byte after the opcode describes the operands and the addressing modes of the operands. As shown in Figure 2-1, this byte is divided into three fields:

Look at Table 2-2 on page 2-6. The leftmost column lists the possible values for the parameter that may specify a memory address. The notation EAX specifies a basic register-direct access to register eax (i.e., the data for the instruction is contained directly in register eax). Brackets specify a memory access. For example [EAX] means that the data for the instruction is not stored directly in eax, but rather the memory location specified by eax. (In other words, eax contains the memory address at which the data is located.)

The top row lists the registers that can serve as the "non-memory" parameter. (Remember, if there are two parameters, the second parameter may not be a memory address.) The cell where the row and column intersect contains the entire Mod R/M byte. For example, an instruction parameters [EDX] and EBX would have a ModR/M byte of 0x1A. Notice the bits listed in the row and column headers (00, 011, and 010 → 0001 1010) are equal to 0x1A the listed ModR/M byte. (See the yellow and green highlighted items in this marked-up version of Table 2.2.)

  1. Using Table 2-2, identify the addressing mode that corresponds to each of the four possible values of Mod. Hint: Look at the operands in the "Effective Address" column. If you saw EBP or %ebp in an assembly instruction, where would the data for that instruction come from? What addressing mode does that correspond do? How about if you saw either [EBP]-16 (which gcc writes as -16(%ebp))?

Now, let's look at some real Intel machine code:

Now, it's your turn:

  1. List the machine instruction for each of the instructions marked with a number, and identify the meaning of each byte. Clearly indicate how the source(s) and destination are specified. Some hints and sample output appear below.

    Your answers should look something like this:

    You may find this template helpful.

  2. Notice that the push instruction is only one byte long. How did the designers squeeze both the opcode and the operator into one byte?
  3. When using Table 2-2, sometimes the Mod and R/M bits (the leftmost column) refers to the source operand, and sometimes they refers to the destination. How can you tell whether the leftmost column refers to the first or second operator? Hint: Compare instructions main+36 and main+46.
  4. How/where does instruction main+15 encode that one of the parameters is an immediate value? How is the R/M byte for this instruction used?
  5. Extra Credit: Explain how the IA64 machine language encodes the "r" registers (r8d, r9d, ..., r15d). Your explanation should include a table for this instruction: sub %r11d, %r14d

Updated Monday, 28 January 2019, 10:29 AM

W3c Validation