CIS 451

Intel Machine Language

Winter 2019

The overview sections of this lab were originally written by Prof. Wolffe.

You will want to work in pairs. You don't want to tackle this alone. Trust me.

Start Early! This is a challenging assignment; and you will need time to ask many questions. If you wait until two or three days before the deadline, you won't have time to get all your questions answered. (Also, remember Piazza.)

Overview

The purpose of this lab is to explore an instruction set and machine language that is different from MIPS. In particular, we will be exploring the Intel 64/IA-32 machine language. As you work through this assignment, pay attention to the differences between MIPS and IA-32 including instruction length, addressing modes, and the variety of instructions included in the instruction sets.

Resources

Section 6.8 in the Harris and Harris textbook discusses the IA-32 architecture. Pay particular attention to how the registers are organized, and the instruction format.
You may also want to check out Sections 2.16 (p 134) and 6.10 (p448) in the Patterson and Hennessy textbook.
There is a link to the Intel architecture support page on the course info page.
Here are links to the Intel Software Developer's Manuals: Volume 1, Volume 2, and Volume 3. You can also access the development manuals directly from the Intel Developer's web site; however, if they have updated the documents since June 2014, the page numbers may not line up with the page numbers I reference below.
- These are massive PDF documents. Do not print them out!

Overview of the Intel 80x86 Architecture

A typical view of the Intel register set shows eight general-purpose, 32-bit registers (eax , ebx , ecx , edx , esi , edi , esp , ebp). Although these registers are technically general purpose (meaning that they can hold any 32-bit value), in practice only eax , ebx , ecx , and edx are used for general data. esp is typically the stack pointer; and ebp is typically the frame pointer.

Local variables are typically stored in the memory around an address stored in register ebp, also called the frame pointer. (Think of a "frame" around some segment of memory.) For example, int x may be at location %ebp -4, and int y at location %ebp -8. (Remember, ints are 4 bytes long.)

Like many machines, Intel processors use a stack. The register esp holds the address of the next available address on the stack. The stack grows down, meaning that the value of esp decreases as data are added to the stack. Functions store their data on the stack.

Functions are required to leave the stack and frame pointers as they found them. Therefore, the first few assembly language instructions of any function store the current value of esp and ebp (typically on the stack). In addition, the frame for local variables is typically on the stack. Therefore, functions with local data also increase the stack pointer enough to hold all the local variables. The last thing a function does is restore the stack and frame pointers to their original values.

Finally, a few additional notes:

Intel assembly follows different conventions depending on context, so you must pay careful attention:
- The assembly produced by gcc follows a different convention from MIPS: The destination of an operation is typically the last operand, not the first.
- However, The Intel Software Developer's Manual guide lists the destination first.
Arithmetic/logical instructions use only two operands — one operand doubles as both source and destination.
Because Intel processors are backward compatible to earlier 16 and 8 bit machines, the lower 16 bits of each 32-bit register have their own names, namely ax, bx, cx, dx, si, di, sp, bp. Similarly, the names (ah, bh, ch, dh, al, bl, cl, dl) designate the most significant and least significant bytes (in earlier microprocessors Intel used 8-bit registers). See Figure 6.36 in Harris and Harris, or Figure 2.40 in the Patterson and Hennessy text. Similarly, on a 64-bit machine, e?x is the lower 32-bits of the 64-bit register r?x. If you look at 64-bit code you will see both the 32-bit "e" versions and the 64-bit "r" versions of the registers used. In general, memory addresses will use the "r" registers (e.g., rsp and rbp) while the 32-bit arithmetic operands will use the "e" registers (eax, ebx, etc.).

Generating and Understanding Assembler Code

The easiest way to write correct assembly code is to let a compiler do it for us! Then we just have to figure out what it's doing and why. A quick look at the man pages for the gcc compiler under Linux shows that the '-S' switch directs the compiler to generate assembly and stop (without producing an executable). We can use this feature to learn the nature of a particular instruction set by writing simple, understandable programs in a high-level language (like C) and studying what the compiler produces.

A brief note about assembler notation in the interest of making the programs easier to read:

registers are preceded by a (%)
constants are preceded by a ($)
labels are indicated by (.L)
offsets and memory references are indicated by parentheses (). For example, movl %esp, %ebp means to take the value of %esp and put it in %ebp. In contrast, movl (%esp), %ebp means to take the value at the top of the stack and put it in %ebp.

As an example, begin with the program exampleIML-1b.c. Run gcc -S exampleIML-1b.c to produce assembly code for the native machine. (I compiled this code on a 64-bit Intel i7. If you use a different machine, your code may look different.)

Look at the resulting code. Lines that begin with a dot (.) are either labels or directives. They are used by the assembler, but are not instructions that are executed when the program runs.

The first three instructions (the pushq, moveq, and subq) set up the stack and frame pointers.
The next four instructions initialize local variables a, b, difference and printf_answer. Notice that variable names do not appear in the assembly code. Instead, each local variable is assigned to a memory location referenced as an offset from the frame pointer.
The next several instructions handle the subtraction.
The eight instructions beginning with movl -8(%rbp), %ecxset up the parameters, call printf, and save the return value. Notice how the parameters are passed in spare registers.
Finally, the return value is set up; the leave instruction restores the stack and frame pointers; and the method ends.

Explain what each of the assembly language instructions in exampleIML-1b.s does and why. (A couple of the "whys" aren't obvious, so don't hesitate to ask for help.) Two hints:
- Remember, this code is un-optimized, so not all operations may have been completed in as few instructions as possible.
- Review the section above discussing how the "r" registers are related to the "e" registers.

Intel Machine Language

In comparison to MIPS, the Intel machine language is extremely complex: Instructions can be anywhere from 1 to 15 bytes long; and, each instruction supports many different addressing modes. Fortunately, most of the extremely complex instructions are very specialized and rarely used (e.g., the MMX instructions). The instructions you will examine today are much simpler.

Figure 2-1 on page 2-1 (aka page 31) of Volume 2 of the Software Developer's Manual shows the basic format of an Intel instruction. Some instructions have a prefix of up to four bytes. The next 1 to 3 bytes contain the op code. The instructions we will examine all have one byte opcodes. The byte after the opcode describes the operands and the addressing modes of the operands. As shown in Figure 2-1, this byte is divided into three fields:

Mod: Bits 6 and 7 represent the addressing mode of the "memory" parameter. If an instruction has 2 parameters, at most one may be a memory address. The other must be a register or an immediate value. The "Mod" bits specify the addressing mode of the parameter that may be a memory address.
R/M: Bits 0 - 2 identify the particular register (eax, ebx, etc.) that uses the addressing mode specified by bits 6 and 7.
Reg/Opcode: For some instructions, bits 3-5 specify the instruction's "non-memory" parameter (i.e., the one that must be a register or immediate value). Other instructions use these 3 bits as an extension to the op code. (This is indicated by a slash and a number after the op-code in the instruction reference.)

Look at Table 2-2 on page 2-6. The leftmost column lists the possible values for the parameter that may specify a memory address. The notation EAX specifies a basic register-direct access to register eax (i.e., the data for the instruction is contained directly in register eax). Brackets specify a memory access. For example [EAX] means that the data for the instruction is not stored directly in eax, but rather the memory location specified by eax. (In other words, eax contains the memory address at which the data is located.)

The top row lists the registers that can serve as the "non-memory" parameter. (Remember, if there are two parameters, the second parameter may not be a memory address.) The cell where the row and column intersect contains the entire Mod R/M byte. For example, an instruction parameters [EDX] and EBX would have a ModR/M byte of 0x1A. Notice the bits listed in the row and column headers (00, 011, and 010 → 0001 1010) are equal to 0x1A the listed ModR/M byte. (See the yellow and green highlighted items in this marked-up version of Table 2.2.)

Using Table 2-2, identify the addressing mode that corresponds to each of the four possible values of Mod. Hint: Look at the operands in the "Effective Address" column. If you saw EBP or %ebp in an assembly instruction, where would the data for that instruction come from? What addressing mode does that correspond do? How about if you saw either [EBP]-16 (which gcc writes as -16(%ebp))?

Now, let's look at some real Intel machine code:

(You will want to do this on EOS/Arch. The compilers on other CPU/OS combinations are almost always slightly different. These differences almost always give students a lot of trouble.)
Begin by compiling exampleIML-1b.c down to assembly code (gcc -S exampleIML-1b.c).
Next, link exampleIML-1b.s with the debug flag (gcc -g exampleIML-1b.s -o ex1).
Launch the GNU debugger (gdb) on your compiled file (gdb ex1).

Issue the command disassemble main. You should see output that looks something the sample below. If it looks different and you are on EOS/Arch, let me know. The first column lists the address of each instruction in main. The second column lists the address of the instruction relative to the beginning of main. The third and fourth columns contain the assembly instruction. Notice that the instruction movq %rsp, %rbp has been replaced with a simple mov instruction.

(7)   0x0000000000001139 <+0>:	push   %rbp
(6)   0x000000000000113a <+1>:	mov    %rsp,%rbp
(ex)  0x000000000000113d <+4>:	sub    $0x10,%rsp
(ex)  0x0000000000001141 <+8>:	movl   $0x52c,-0x10(%rbp)
(1)   0x0000000000001148 <+15>:	movl   $0x1619,-0xc(%rbp)
      0x000000000000114f <+22>:	movl   $0x2694,-0x8(%rbp)
      0x0000000000001156 <+29>:	movl   $0x8ad,-0x4(%rbp)
(2)   0x000000000000115d <+36>:	mov    -0x10(%rbp),%eax
      0x0000000000001160 <+39>:	sub    -0xc(%rbp),%eax
(3)   0x0000000000001163 <+42>:	mov    %eax,-0x8(%rbp)
      0x0000000000001166 <+45>:	mov    -0x8(%rbp),%ecx
      0x0000000000001169 <+48>:	mov    -0xc(%rbp),%edx
      0x000000000000116c <+51>:	mov    -0x10(%rbp),%eax
(ex)  0x000000000000116f <+54>:	mov    %eax,%esi
      0x0000000000001171 <+56>:	lea    0xe8c(%rip),%rdi        # 0x2004
(4)   0x0000000000001178 <+63>:	mov    $0x0,%eax
      0x000000000000117d <+68>:	callq  0x1030 <printf@plt>
      0x0000000000001182 <+73>:	mov    %eax,-0x4(%rbp)
      0x0000000000001185 <+76>:	mov    -0x8(%rbp),%eax
(5)   0x0000000000001188 <+79>:	leaveq
      0x0000000000001189 <+80>:	retq

Cut-and-paste this information into another window.
Type x main+54 to look at the machine code for the fourteenth instruction (mov %eax, %esi). Bytes with lower addresses are displayed to the right; therefore, the instructions will look as if they are printed "backwards". In this example, the first byte of the sub instruction is 0x89; the second is 0xc6. The remaining two bytes are part of the next instruction. If you look on page 3-508 in Volume 2 of the Developer's Guide, you will see that 0x89 is one of many op-codes for the sub instruction. If you look in Table 2-2 (on page 2-6), you will see that a ModR/M byte of 0xc6 indicates that parameters are registers %eax and %esi. (See the purple items in this marked-up version of Table 2.2.)
Now, type x/2 main+8. The /2 tells gdb to print two four-byte words. (We need to print two words because the instruction is 7 bytes long.) The words are displayed left-to-right in increasing order; but, within each word, the byte with the lowest address appears on the right. Thus, you would read this seven-byte instruction as 0xc745f02c050000. (I know it's confusing. Remember, I didn't design it, I'm just showing you how it works.)
The first byte of this instruction is 0xc7. If you look on page 3-508 of the Developer's Guide, you will see that 0xc7 is an opcode for mov. (For some reason, the instruction is listed as movl in your assembly code; but, you look up mov in the Developer's Guide.) Notice the /0 after the opcode. If you look on page 3-2 of the Developer's Guide, you will see that the /0 tells you to ignore the reg field of the ModR/M byte (i.e., ignore the register listed at the top). Looking at the Mod and R/M bits (i.e., the leftmost column) tells us that one of the operands is memory location [rbp] plus some immediate value. In fact, the destination of this instruction is memory location %RBP - 0x10; and, as luck would have it, the next byte, 0xf0 happens to be the twos complement, hexadecimal representation of -0x10. Finally, the last four bytes are the immediate value being stored.

When read "first-to-last", which is low-to-high, the last four bytes are 0x2c050000. However, we conventionally write numbers high-to-low. Thus, when you reverse the order of these bytes you get 0x00052c, which is the immediate value being moved. (Does your head hurt yet?)
Finally, type x main +4 to look at the sub instruction. When you look on page 4-394 of the Developer's Guide, you will notice that the first byte, 0x48 does not correspond to any of the sub op codes. However, the second byte, 0x83 does. In addition, one of the choices for op code is REX.W + 83. A trip back up to the top of Chapter 2 (specifically, Section 2.2.1 beginning on page 2-9), tells us that the REX prefixes are used to indicate that the instruction takes at least one 64-bit parameter. In particular, Table 2-4 tells us that the prefix 0x48 indicates that the operand size is 64 bits. All REX prefixes begin with a 4. At this point, we note the /5 after the op-code, look up the third byte, 0xec in Table 2-2 to find that one of the parameters is %rsp. Page 4-394 tells us that the final parameter is an 8-bit immediate value in this case, 0x10.

Now, it's your turn:

List the machine instruction for each of the instructions marked with a number, and identify the meaning of each byte. Clearly indicate how the source(s) and destination are specified. Some hints and sample output appear below.

Remember that instructions have different lengths. Don't try to incorporate too much into an instruction.
Remember that within each word, the least significant bits are on the right. This makes the instructions feel "backwards", but the constants appear "forward".
Ignore the "l" on the end of mnemonics (i.e., "movl" means "mov").
When necessary, see Table 3-1 for the meaning of the +rw in the opcode.

Your answers should look something like this:

assembly instruction	add %eax,%esi
Machine instruction (hex)	0x89c6
Byte number	1	2
field name	op code	Mod R/M
Field value	0x01	0xd0
Field meaning	add	source: %eax, destination %esi
Info source	Page 3-29	Table 2.2.

assembly instruction	movl $52c,-0x10(%rbp)
Machine instruction (hex)	0xc745f02c050000
Byte number	1	2	3	4-7
field name	op code	Mod R/M	offset	immediate value
Field value	0xc7	0x45	0xf0	0x0000052c
Field meaning	mov	destination [RBP] + offset	subtract 16 from RBP	immediate value
Info source	Page 3-508	Table 2.2	Table 2.2	page 3-508

assembly instruction	sub $0x10, %rsp
Machine instruction (hex)	0x4883ec10
Byte number	1	2	3	4
field name	prefix	op code	Mod R/M	Immediate value
Field value	0x48	0x83	0xec	0x10
Field meaning	64-bit operands	subtract	destination %rbp	immediate value 10
Info source	Table 2-4	Page 4-394	Table 2.2	Page 4-394

You may find this template helpful.

Notice that the push instruction is only one byte long. How did the designers squeeze both the opcode and the operator into one byte?
When using Table 2-2, sometimes the Mod and R/M bits (the leftmost column) refers to the source operand, and sometimes they refers to the destination. How can you tell whether the leftmost column refers to the first or second operator? Hint: Compare instructions main+36 and main+46.
How/where does instruction main+15 encode that one of the parameters is an immediate value? How is the R/M byte for this instruction used?
Extra Credit: Explain how the IA64 machine language encodes the "r" registers (r8d, r9d, ..., r15d). Your explanation should include a table for this instruction: sub %r11d, %r14d

Updated Monday, 28 January 2019, 10:29 AM