CIS 451 |
Intel Machine Language |
Fall 2019 |
The overview sections of this lab were originally written by Prof. Wolffe.
You will want to work in pairs. You don't want to tackle this alone. Trust me.
Start Early! This is a challenging assignment; and you will need time to ask many questions. If you wait until two or three days before the deadline, you won't have time to get all your questions answered. (Also, remember Piazza.)
A typical view of the Intel register set shows eight general-purpose,
32-bit registers
(eax
,
ebx
,
ecx
,
edx
,
esi
,
edi
,
esp
,
ebp
). Although these registers are technically general purpose (meaning
that they can hold any 32-bit value), in practice only
eax
,
ebx
,
ecx
, and
edx
are used for general data.
esp
is typically the stack pointer; and
ebp
is typically the frame pointer.
Local variables are typically stored in the memory around an address
stored in register ebp
, also called the frame pointer.
(Think of a "frame" around some segment of memory.) For example, int
x
may be at location %ebp -4
, and int y
at location %ebp -8
. (Remember, int
s are 4
bytes long.)
Like many machines, Intel processors use a stack. The register esp
holds the address of the next available address on the stack. The stack
grows down, meaning that the value of esp
decreases as data
are added to the stack. Functions store their data on the stack.
Functions are required to leave the stack and frame pointers as they
found them. Therefore, the first few assembly language instructions of
any function store the current value of esp
and ebp
(typically on the stack). In addition, the frame for local variables is
typically on the stack. Therefore, functions with local data also
increase the stack pointer enough to hold all the local variables. The
last thing a function does is restore the stack and frame pointers to
their original values.
Finally, a few additional notes:
gcc
follows a different
convention from MIPS: The destination of an operation is typically the
last operand, not the first.
ax
, bx
, cx
, dx
,
si
, di
, sp
, bp
.
Similarly, the names (ah
, bh
, ch
,
dh
, al
, bl
, cl
, dl
)
designate the most significant and least significant bytes (in earlier
microprocessors Intel used 8-bit registers). See Figure 6.36 in Harris
and Harris, or Figure 2.40 in the Patterson and Hennessy text. Similarly, on a 64-bit
machine, e?x
is the lower 32-bits of the 64-bit register r?x
.
If you look at 64-bit code you will see both the 32-bit "e"
versions and the 64-bit "r"
versions of the registers
used. In general, memory addresses will use the "r"
registers (e.g., rsp
and rbp
) while
the 32-bit arithmetic operands will use the "e"
registers (eax
,
ebx
, etc.).
The easiest way to write correct assembly code is to let a compiler do it for us! Then we just have to figure out
what it's doing and why. A quick look at the man pages for the gcc
compiler under Linux shows that
the '-S
' switch directs the compiler to generate assembly and stop (without producing an
executable). We can use this feature to learn the nature of a particular instruction set by writing simple,
understandable programs in a high-level language (like C) and studying what the compiler produces.
A brief note about assembler notation in the interest of making the programs easier to read:
%
)$
).L
)()
.
For example, movl %esp, %ebp
means to take the value of %esp
and put it in %ebp
. In contrast, movl (%esp), %ebp
means to take the value at the top of the stack and put it in %ebp
.
As an example, begin with the program exampleIML-1b.c. Run
gcc -S exampleIML-1b.c
to produce assembly code for the native machine. (I
compiled this code on a 64-bit Intel i7. If you use a different machine, your code may look different.)
Look at the resulting code. Lines that begin with a dot (.) are either labels or directives. They are used by the assembler, but are not instructions that are executed when the program runs.
pushq
, moveq
,
and subq
) set up the stack and frame pointers.
a
,
b
, difference
and printf_answer
. Notice
that variable names do not appear in the assembly code.
Instead, each local variable is assigned to a memory location
referenced as an offset from the frame pointer.
movl -8(%rbp), %ecx
set up the parameters, call
printf
, and save the return value. Notice how the parameters are passed in spare registers.
leave
instruction restores the stack and frame pointers; and the method
ends.
exampleIML-1b.s
does and why. (A couple of the "whys" aren't obvious, so don't
hesitate to ask for help.) Two hints:
In comparison to MIPS, the Intel machine language is extremely complex: Instructions can be anywhere from 1 to 15 bytes long; and, each instruction supports many different addressing modes. Fortunately, most of the extremely complex instructions are very specialized and rarely used (e.g., the MMX instructions). The instructions you will examine today are much simpler.
Figure 2-1 on page 2-1 (aka page 31) of Volume 2 of the Software Developer's Manual shows the basic format of an Intel instruction. Some instructions have a prefix of up to four bytes. The next 1 to 3 bytes contain the op code. The instructions we will examine all have one byte opcodes. The byte after the opcode describes the operands and the addressing modes of the operands. As shown in Figure 2-1, this byte is divided into three fields:
eax
,
ebx
, etc.) that uses the addressing mode specified by bits
6 and 7.
Look at Table 2-2 on page 2-6. The leftmost column lists the possible values for
the parameter that may specify a memory address. The notation EAX
specifies a basic
register-direct access to register eax
(i.e., the data for the instruction is contained directly in
register eax
). Brackets specify a memory access. For example
[EAX]
means that the data for the instruction is not stored directly in eax
, but rather
the memory location specified by eax
. (In other words, eax
contains the memory
address at which the data is located.)
The top row lists the registers that can serve as the "non-memory" parameter. (Remember, if there are two parameters,
the second parameter may not be a memory address.) The cell where the row and column intersect contains the entire
Mod R/M byte. For example, an instruction parameters [EDX]
and EBX
would have a ModR/M
byte of 0x1A
. Notice the bits listed in the row and column headers
(00, 011, and 010 → 0001 1010) are equal to 0x1A
the listed ModR/M byte. (See the yellow and
green highlighted items in
this marked-up version of Table 2.2.)
EBP
or %ebp
in an assembly
instruction, where would the data for that instruction come from? What addressing mode does that correspond do?
How about if you saw either [EBP]-16
(which gcc
writes as -16(%ebp)
)?
Now, let's look at some real Intel machine code:
exampleIML-1b.c
down to assembly code (gcc -S exampleIML-1b.c
).
exampleIML-1b.s
with the debug flag
(gcc -g exampleIML-1b.s -o ex1
).
gdb
) on your compiled file
(gdb ex1
).
disassemble main
. You should see
output that looks something the sample below. If it looks
different and you are on EOS/Arch, let me know. The first column lists the address of each
instruction in main
. The second column lists the address
of the instruction relative to the beginning of main. The third and
fourth columns contain the assembly instruction. Notice that the
instruction movq %rsp, %rbp
has been replaced with a
simple mov
instruction.
(7) 0x0000000000001139 <+0>: push %rbp (6) 0x000000000000113a <+1>: mov %rsp,%rbp (ex) 0x000000000000113d <+4>: sub $0x10,%rsp (ex) 0x0000000000001141 <+8>: movl $0x52c,-0x10(%rbp) (1) 0x0000000000001148 <+15>: movl $0x1619,-0xc(%rbp) 0x000000000000114f <+22>: movl $0x2694,-0x8(%rbp) 0x0000000000001156 <+29>: movl $0x8ad,-0x4(%rbp) (2) 0x000000000000115d <+36>: mov -0x10(%rbp),%eax 0x0000000000001160 <+39>: sub -0xc(%rbp),%eax (3) 0x0000000000001163 <+42>: mov %eax,-0x8(%rbp) 0x0000000000001166 <+45>: mov -0x8(%rbp),%ecx 0x0000000000001169 <+48>: mov -0xc(%rbp),%edx 0x000000000000116c <+51>: mov -0x10(%rbp),%eax (ex) 0x000000000000116f <+54>: mov %eax,%esi 0x0000000000001171 <+56>: lea 0xe8c(%rip),%rdi # 0x2004 (4) 0x0000000000001178 <+63>: mov $0x0,%eax 0x000000000000117d <+68>: callq 0x1030 <printf@plt> 0x0000000000001182 <+73>: mov %eax,-0x4(%rbp) 0x0000000000001185 <+76>: mov -0x8(%rbp),%eax (5) 0x0000000000001188 <+79>: leaveq 0x0000000000001189 <+80>: retq
x main+54
to look at the machine code for the fourteenth instruction
(mov %eax, %esi
). Bytes with lower addresses are displayed to the right; therefore, the instructions
will look as if they are printed "backwards". In this example, the first byte of the mov
instruction is 0x89
;
the second is 0xc6
. The remaining two bytes are part of the next instruction. If you look on page 3-508
in Volume 2 of the Developer's Guide, you will see that 0x89
is one of many op-codes for
the sub
instruction. If you look in Table 2-2 (on page 2-6), you will see that a ModR/M byte of
0xc6
indicates that parameters are registers %eax
and %esi
.
(See the purple items in this marked-up version of Table 2.2.)
x/2 main+8
. The /2
tells gdb
to print two four-byte words. (We need to print two words because the
instruction is 7 bytes long.) The words are displayed left-to-right in increasing order; but,
within each word, the byte with the lowest address appears on the right. Thus, you would read this
seven-byte instruction as 0xc745f02c050000
. (I know it's confusing. Remember,
I didn't design it, I'm just showing you how it works.)
The first byte of this instruction is 0xc7
. If you look on page 3-508 of the Developer's Guide,
you will see that 0xc7
is an opcode for mov
. (For some reason, the instruction is
listed as movl
in your assembly code; but, you look up mov
in the Developer's Guide.) Notice the /0
after the opcode.
If you look on page 3-2 of the Developer's Guide, you will see that the
/0
tells you to ignore the reg
field of the
ModR/M byte (i.e., ignore the register listed at the top).
Looking at the Mod and R/M bits (i.e., the leftmost column) tells us that one of the operands is memory
location [rbp]
plus some immediate value. In fact, the
destination of this instruction is memory location %RBP - 0x10
;
and, as luck would have it, the next byte, 0xf0
happens to
be the twos complement, hexadecimal representation of -0x10
.
Finally, the last four bytes are the immediate value being stored.
When read "first-to-last", which is low-to-high, the last four bytes are 0x2c050000
.
However, we conventionally write numbers high-to-low. Thus, when you reverse the order of these bytes you
get
0x00052c
, which is the immediate value being moved. (Does your head hurt yet?)
x main +4
to look at the sub
instruction. When you look on page 4-394 of the Developer's Guide, you will notice that the first byte,
0x48
does not correspond to any of the sub
op codes. However,
the second byte, 0x83
does. In addition, one of the
choices for op code is REX.W + 83
. A trip back up to the
top of Chapter 2 (specifically, Section 2.2.1 beginning on page 2-9),
tells us that the REX
prefixes are used to indicate that
the instruction takes at least one 64-bit parameter. In particular,
Table 2-4 tells us that the prefix 0x48
indicates that the
operand size is 64 bits. All REX
prefixes begin with a 4
.
At this point, we note the /5
after the op-code,
look up the third byte, 0xec
in Table 2-2 to find that one
of the parameters is %rsp
. Page 4-394 tells us that the
final parameter is an 8-bit immediate value in this case, 0x10
.
Now, it's your turn:
movl
" means "mov
").+rw
in the opcode.
Your answers should look something like this:
assembly instruction | add %eax,%esi | |
Machine instruction (hex) | 0x89c6 | |
Byte number | 1 | 2 |
field name | op code | Mod R/M |
Field value | 0x01 | 0xd0 |
Field meaning | add | source: %eax, destination %esi |
Info source | Page 3-29 | Table 2.2. |
assembly instruction | movl $52c,-0x10(%rbp) | |||
Machine instruction (hex) | 0xc745f02c050000 | |||
Byte number | 1 | 2 | 3 | 4-7 |
field name | op code | Mod R/M | offset | immediate value |
Field value | 0xc7 | 0x45 | 0xf0 | 0x0000052c |
Field meaning | mov | destination [RBP] + offset | subtract 16 from RBP | immediate value |
Info source | Page 3-508 | Table 2.2 | Table 2.2 | page 3-508 |
assembly instruction | sub $0x10, %rsp | |||
Machine instruction (hex) | 0x4883ec10 | |||
Byte number | 1 | 2 | 3 | 4 |
field name | prefix | op code | Mod R/M | Immediate value |
Field value | 0x48 | 0x83 | 0xec | 0x10 |
Field meaning | 64-bit operands | subtract | destination %rbp | immediate value 10 |
Info source | Table 2-4 | Page 4-394 | Table 2.2 | Page 4-394 |
You may find this template helpful.
push
instruction is only one byte
long. How did the designers squeeze both the opcode and the operator
into one byte? (In other words, how can you tell from the machine code which register to push onto the stack?)
main+36
and main+42
.
main+15
encode that one of
the parameters is an immediate value? How is the R/M byte for this
instruction used? (In other words, how can you tell by looking at the machine code that one parameter is an
immediate value?)
r8d
, r9d
,
..., r15d
). Your explanation should include a table for this instruction: sub %r11d, %r14d
Updated Tuesday, 3 September 2019, 10:42 AM