| CIS 451 | Intel Machine Language | Winter 2019 | 
The overview sections of this lab were originally written by Prof. Wolffe.
You will want to work in pairs. You don't want to tackle this alone. Trust me.
Start Early! This is a challenging assignment; and you will need time to ask many questions. If you wait until two or three days before the deadline, you won't have time to get all your questions answered. (Also, remember Piazza.)
A typical view of the Intel register set shows eight general-purpose,
   32-bit registers
   (eax
   ,
   ebx
   ,
   ecx
   ,
   edx
   ,
   esi
   ,
   edi
   ,
   esp
   ,
   ebp). Although these registers are technically general purpose (meaning
   that they can hold any 32-bit value), in practice only
   eax
   ,
   ebx
   ,
   ecx
   , and
   edx
   are used for general data.
   esp
   is typically the stack pointer; and
   ebp
   is typically the frame pointer.
Local variables are typically stored in the memory around an address
   stored in register ebp, also called the frame pointer.
   (Think of a "frame" around some segment of memory.) For example, int
      x may be at location %ebp -4, and int y
   at location %ebp -8. (Remember, ints are 4
   bytes long.)
Like many machines, Intel processors use a stack. The register esp
   holds the address of the next available address on the stack. The stack
   grows down, meaning that the value of esp decreases as data
   are added to the stack. Functions store their data on the stack.
Functions are required to leave the stack and frame pointers as they
   found them. Therefore, the first few assembly language instructions of
   any function store the current value of esp and ebp
   (typically on the stack). In addition, the frame for local variables is
   typically on the stack. Therefore, functions with local data also
   increase the stack pointer enough to hold all the local variables. The
   last thing a function does is restore the stack and frame pointers to
   their original values.
Finally, a few additional notes:
gcc follows a different
            convention from MIPS: The destination of an operation is typically the
            last operand, not the first.
         ax, bx, cx, dx,
      si, di, sp, bp.
      Similarly, the names (ah, bh, ch,
      dh, al, bl, cl, dl)
      designate the most significant and least significant bytes (in earlier
      microprocessors Intel used 8-bit registers). See Figure 6.36 in Harris
      and Harris, or Figure 2.40 in the Patterson and Hennessy text.  Similarly, on a 64-bit
      machine, e?x is the lower 32-bits of the 64-bit register r?x.
      If you look at 64-bit code you will see both the 32-bit "e"
      versions and the 64-bit "r" versions of the registers
      used. In general, memory addresses will use the "r"
      registers (e.g., rsp and rbp) while
      the 32-bit arithmetic operands will use the "e" registers (eax,
      ebx, etc.).
   The easiest way to write correct assembly code is to let a compiler do it for us! Then we just have to figure out
  what it's doing and why. A quick look at the man pages for the gcc compiler under Linux shows that
  the '-S' switch directs the compiler to generate assembly and stop (without producing an
  executable). We can use this feature to learn the nature of a particular instruction set by writing simple,
  understandable programs in a high-level language (like C) and studying what the compiler produces. 
A brief note about assembler notation in the interest of making the programs easier to read:
%)$).L)().
    For example, movl %esp, %ebp means to take the value of %esp
    and put it in %ebp. In contrast, movl (%esp), %ebp
    means to take the value at the top of the stack and put it in %ebp.
  As an example, begin with the program exampleIML-1b.c. Run
  gcc -S exampleIML-1b.c to produce assembly code for the native machine. (I
  compiled this code on a 64-bit Intel i7. If you use a different machine, your code may look different.)
Look at the resulting code. Lines that begin with a dot (.) are either labels or directives. They are used by the assembler, but are not instructions that are executed when the program runs.
pushq, moveq,
    and subq) set up the stack and frame pointers.
  a,
    b, difference and printf_answer. Notice
    that variable names do not appear in the assembly code.
    Instead, each local variable is assigned to a memory location
    referenced as an offset from the frame pointer.
  movl -8(%rbp), %ecxset up the parameters, call
    printf, and save the return value. Notice how the parameters are passed in spare registers.
  leave
    instruction restores the stack and frame pointers; and the method
    ends.
  exampleIML-1b.s
    does and why. (A couple of the "whys" aren't obvious, so don't
    hesitate to ask for help.) Two hints:
    In comparison to MIPS, the Intel machine language is extremely complex: Instructions can be anywhere from 1 to 15 bytes long; and, each instruction supports many different addressing modes. Fortunately, most of the extremely complex instructions are very specialized and rarely used (e.g., the MMX instructions). The instructions you will examine today are much simpler.
Figure 2-1 on page 2-1 (aka page 31) of Volume 2 of the Software Developer's Manual shows the basic format of an Intel instruction. Some instructions have a prefix of up to four bytes. The next 1 to 3 bytes contain the op code. The instructions we will examine all have one byte opcodes. The byte after the opcode describes the operands and the addressing modes of the operands. As shown in Figure 2-1, this byte is divided into three fields:
eax,
    ebx, etc.) that uses the addressing mode specified by bits
    6 and 7.
  Look at Table 2-2 on page 2-6. The leftmost column lists the possible values for
  the parameter that may specify a memory address. The notation EAX specifies a basic
  register-direct access to register eax (i.e., the data for the instruction is contained directly in
  register eax). Brackets specify a memory access. For example
  [EAX] means that the data for the instruction is not stored directly in eax, but rather
  the memory location specified by eax. (In other words, eax contains the memory
  address at which the data is located.)
The top row lists the registers that can serve as the "non-memory" parameter. (Remember, if there are two parameters,
  the second parameter may not be a memory address.) The cell where the row and column intersect contains the entire
  Mod R/M byte. For example, an instruction parameters [EDX] and EBX would have a ModR/M
  byte of 0x1A. Notice the bits listed in the row and column headers
  (00, 011, and 010 → 0001 1010) are equal to 0x1A the listed ModR/M byte. (See the yellow and
  green highlighted items in
  this marked-up version of Table 2.2.)
EBP or %ebp
    in an assembly
    instruction, where would the data for that instruction come from? What addressing mode does that correspond do?
    How about if you saw either [EBP]-16 (which gcc writes as -16(%ebp))?
  Now, let's look at some real Intel machine code:
exampleIML-1b.c
    down to assembly code (gcc -S exampleIML-1b.c).
  exampleIML-1b.s with the debug flag
    (gcc -g exampleIML-1b.s -o ex1).
  gdb) on your compiled file
    (gdb ex1).
  disassemble main. You should see
    output that looks something the sample below. If it looks
    different and you are on EOS/Arch, let me know. The first column lists the address of each
    instruction in main. The second column lists the address
    of the instruction relative to the beginning of main. The third and
    fourth columns contain the assembly instruction. Notice that the
    instruction movq %rsp, %rbp has been replaced with a
    simple mov instruction.
    
(7)   0x0000000000001139 <+0>:	push   %rbp
(6)   0x000000000000113a <+1>:	mov    %rsp,%rbp
(ex)  0x000000000000113d <+4>:	sub    $0x10,%rsp
(ex)  0x0000000000001141 <+8>:	movl   $0x52c,-0x10(%rbp)
(1)   0x0000000000001148 <+15>:	movl   $0x1619,-0xc(%rbp)
      0x000000000000114f <+22>:	movl   $0x2694,-0x8(%rbp)
      0x0000000000001156 <+29>:	movl   $0x8ad,-0x4(%rbp)
(2)   0x000000000000115d <+36>:	mov    -0x10(%rbp),%eax
      0x0000000000001160 <+39>:	sub    -0xc(%rbp),%eax
(3)   0x0000000000001163 <+42>:	mov    %eax,-0x8(%rbp)
      0x0000000000001166 <+45>:	mov    -0x8(%rbp),%ecx
      0x0000000000001169 <+48>:	mov    -0xc(%rbp),%edx
      0x000000000000116c <+51>:	mov    -0x10(%rbp),%eax
(ex)  0x000000000000116f <+54>:	mov    %eax,%esi
      0x0000000000001171 <+56>:	lea    0xe8c(%rip),%rdi        # 0x2004
(4)   0x0000000000001178 <+63>:	mov    $0x0,%eax
      0x000000000000117d <+68>:	callq  0x1030 <printf@plt>
      0x0000000000001182 <+73>:	mov    %eax,-0x4(%rbp)
      0x0000000000001185 <+76>:	mov    -0x8(%rbp),%eax
(5)   0x0000000000001188 <+79>:	leaveq
      0x0000000000001189 <+80>:	retq
    
  x main+54 to look at the machine code for the fourteenth instruction
    (mov %eax, %esi). Bytes with lower addresses are displayed to the right; therefore, the instructions
    will look as if they are printed "backwards". In this example, the first byte of the sub
    instruction is 0x89;
    the second is 0xc6. The remaining two bytes are part of the next instruction. If you look on page 3-508
    in Volume 2 of the Developer's Guide, you will see that 0x89 is one of many op-codes for
    the sub instruction. If you look in Table 2-2 (on page 2-6), you will see that a ModR/M byte of
    0xc6 indicates that parameters are registers %eax and %esi.
    (See the purple items in this marked-up version of Table 2.2.)
  x/2 main+8. The /2 tells gdb
    to print two four-byte words. (We need to print two words because the
    instruction is 7 bytes long.) The words are displayed left-to-right in increasing order; but,
    within each word, the byte with the lowest address appears on the right. Thus, you would read this
    seven-byte instruction as 0xc745f02c050000. (I know it's confusing. Remember,
    I didn't design it, I'm just showing you how it works.)
    The first byte of this instruction is 0xc7. If you look on page 3-508 of the Developer's Guide,
      you will see that 0xc7 is an opcode for mov. (For some reason, the instruction is
      listed as movl in your assembly code; but, you look up mov
      in the Developer's Guide.) Notice the /0 after the opcode.
      If you look on page 3-2 of the Developer's Guide, you will see that the
      /0 tells you to ignore the reg field of the
      ModR/M byte (i.e., ignore the register listed at the top).
      Looking at the Mod and R/M bits (i.e., the leftmost column) tells us that one of the operands is memory
      location [rbp] plus some immediate value. In fact, the
      destination of this instruction is memory location %RBP - 0x10;
      and, as luck would have it, the next byte, 0xf0 happens to
      be the twos complement, hexadecimal representation of -0x10.
      Finally, the last four bytes are the immediate value being stored.
When read "first-to-last", which is low-to-high, the last four bytes are 0x2c050000.
      However, we conventionally write numbers high-to-low. Thus, when you reverse the order of these bytes you
      get
      0x00052c, which is the immediate value being moved. (Does your head hurt yet?)
x main +4 to look at the sub
    instruction. When you look on page 4-394 of the Developer's Guide, you will notice that the first byte,
    0x48
    does not correspond to any of the sub op codes. However,
    the second byte, 0x83 does. In addition, one of the
    choices for op code is REX.W + 83. A trip back up to the
    top of Chapter 2 (specifically, Section 2.2.1 beginning on page 2-9),
    tells us that the REX prefixes are used to indicate that
    the instruction takes at least one 64-bit parameter. In particular,
    Table 2-4 tells us that the prefix 0x48 indicates that the
    operand size is 64 bits. All REX prefixes begin with a 4.
    At this point, we note the /5 after the op-code,
    look up the third byte, 0xec in Table 2-2 to find that one
    of the parameters is %rsp. Page 4-394 tells us that the
    final parameter is an 8-bit immediate value in this case, 0x10.
  Now, it's your turn:
movl" means "mov").+rw in the opcode.
      Your answers should look something like this:
| assembly instruction | add %eax,%esi | |
| Machine instruction (hex) | 0x89c6 | |
| Byte number | 1 | 2 | 
| field name | op code | Mod R/M | 
| Field value | 0x01 | 0xd0 | 
| Field meaning | add | source: %eax, destination %esi | 
| Info source | Page 3-29 | Table 2.2. | 
| assembly instruction | movl $52c,-0x10(%rbp) | |||
| Machine instruction (hex) | 0xc745f02c050000 | |||
| Byte number | 1 | 2 | 3 | 4-7 | 
| field name | op code | Mod R/M | offset | immediate value | 
| Field value | 0xc7 | 0x45 | 0xf0 | 0x0000052c | 
| Field meaning | mov | destination [RBP] + offset | subtract 16 from RBP | immediate value | 
| Info source | Page 3-508 | Table 2.2 | Table 2.2 | page 3-508 | 
| assembly instruction | sub $0x10, %rsp | |||
| Machine instruction (hex) | 0x4883ec10 | |||
| Byte number | 1 | 2 | 3 | 4 | 
| field name | prefix | op code | Mod R/M | Immediate value | 
| Field value | 0x48 | 0x83 | 0xec | 0x10 | 
| Field meaning | 64-bit operands | subtract | destination %rbp | immediate value 10 | 
| Info source | Table 2-4 | Page 4-394 | Table 2.2 | Page 4-394 | 
You may find this template helpful.
push instruction is only one byte
    long. How did the designers squeeze both the opcode and the operator
    into one byte?
  main+36 and main+46.
  main+15 encode that one of
    the parameters is an immediate value? How is the R/M byte for this
    instruction used?
  r8d, r9d,
    ..., r15d). Your explanation should include a table for this instruction: sub %r11d, %r14d
  Updated Monday, 28 January 2019, 10:29 AM
