├── .gitignore
├── Makefile
├── README.md
├── add-eax-ebx.md
├── bibliography.md
├── global-structure.md
├── intel-manual-format.md
├── introduction.md
├── mov-al-1.asm
├── mov-al-1.md
├── mov-ax-1.asm
├── mov-ax-1.md
├── mov-eax-1.asm
├── mov-eax-1.md
├── mov-eax-address.asm
├── mov-eax-address.md
├── mov-eax-ebx.asm
├── mov-eax-ebx.md
├── mov-eax-memory.asm
├── mov-eax-memory.md
├── mov-ebx-1.asm
├── mov-ebx-1.md
├── nop.asm
├── nop.md
├── push-eax.asm
└── push-eax.md


/.gitignore:
--------------------------------------------------------------------------------
1 | *.bin
2 | *.hd
3 | *.o
4 | test.asm
5 | 


--------------------------------------------------------------------------------
/Makefile:
--------------------------------------------------------------------------------
 1 | .POSIX:
 2 | 
 3 | BIN_EXT ?= .bin
 4 | IN_EXT ?= .asm
 5 | OBJ_EXT ?= .o
 6 | OUT_EXT ?= .hd
 7 | 
 8 | INS := $(wildcard *$(IN_EXT))
 9 | OUTS := $(patsubst %$(IN_EXT),%$(OUT_EXT),$(INS))
10 | 
11 | .PHONY: all clean run
12 | .PRECIOUS: %$(BIN_EXT) %$(OBJ_EXT)
13 | 
14 | all: $(OUTS)
15 | 
16 | %$(OUT_EXT): %$(BIN_EXT)
17 | 	od -An -tx1 '$<' | tail -c+2 > '$@'
18 | 
19 | %$(BIN_EXT): %$(OBJ_EXT)
20 | 	objcopy -O binary --only-section=.text '$<' '$@'
21 | 
22 | %$(OBJ_EXT): %$(IN_EXT)
23 | 	nasm -f elf32 -o '$@' '$<'
24 | 	@# For raw 16 bit. Would need to remove the objcopy step.
25 | 	@#nasm -f bin -o '$@' '$<'
26 | 
27 | clean:
28 | 	rm -f *$(BIN_EXT) *$(OBJ_EXT) *$(OUT_EXT)
29 | 
30 | run: all
31 | 	tail -n+1 *$(OUT_EXT)
32 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # x86 Instruction Encoding Tutorial
 2 | 
 3 | 1.  [Introduction](introduction.md)
 4 | 1.  [Global structure](global-structure.md)
 5 | 1.  [Intel manual format](intel-manual-format.md)
 6 | 1.  Examples
 7 |     1. [nop](nop.md)
 8 |     1. [push eax](push-eax.md)
 9 |     1. [mov eax, 1](mov-eax-1.md)
10 |     1. [mov ebx, 1](mov-ebx-1.md)
11 |     1. [mov ax, 1](mov-ax-1.md)
12 |     1. [mov al, 1](mov-al-1.md)
13 |     1. [mov eax, ebx](mov-eax-ebx.md)
14 |     1. [mov eax, address](mov-eax-address.md)
15 |     1. [mov eax, [memory]](mov-eax-memory.md)
16 | 1.  [Bibliography](bibliography.md)
17 | 


--------------------------------------------------------------------------------
/add-eax-ebx.md:
--------------------------------------------------------------------------------
 1 | # add eax, ebx
 2 | 
 3 | Output:
 4 | 
 5 |     01 d8
 6 |     ^^ ^^
 7 | 
 8 | 1. Opcode
 9 | 1. ModR/M
10 | 
11 | Opcode bits:
12 | 
13 |     0 0 0 0 0 0 0 1
14 |     ^^^^^^^^^^^ ^ ^
15 |     1           2 3
16 | 
17 | 1. This is an add.
18 | 2. Add REG to R/M as represented on the ModR/M byte. Otherwise, other way around.
19 | 3. 32-bit operands. Otherwise, 8-bit.
20 | 
21 | ModR/M bits:
22 | 
23 |     1 1 0 1 1 0 0 0
24 |     ^^^ ^^^^^ ^^^^^
25 |     1   2     3
26 | 
27 | 1. MOD = 3: REG and R/M are registers.
28 | 2. REG = 3: EBX
29 | 3. REG = 0: EAX
30 | 
31 | So from the opcode, we move REG (EBX) into R/M (EAX).
32 | 
33 | Note that two encodings are possible on reg / reg operations: we could swap the before last bit to 1 and both registers.
34 | 
35 | Both possible encodings are documented on the instruction table:
36 | 
37 |     01 /r    ADD r/m32, r32
38 |     03 /r    ADD r32, r/m32
39 | 
40 | `/r` says that a MOdR/M follows the opcode, and that the 2 last bits describe it.
41 | 


--------------------------------------------------------------------------------
/bibliography.md:
--------------------------------------------------------------------------------
 1 | # Bibliography
 2 | 
 3 | -   Intel® 64 and IA-32 Architectures Software Developer’s Manua
 4 | 
 5 |     - section 2.1: binary serialization
 6 |     - section 3.1: documentation format
 7 | 
 8 | -   <http://www.c-jump.com/CIS77/CPU/x86/lecture.html>
 9 | 
10 | -   <http://www.codeproject.com/Articles/662301/x-Instruction-Encoding-Revealed-Bit-Twiddling-fo>
11 | 
12 | -   <http://wiki.osdev.org/X86-64_Instruction_Encoding>
13 | 
14 | -   <http://www.strchr.com/machine_code_redundancy>
15 | 


--------------------------------------------------------------------------------
/global-structure.md:
--------------------------------------------------------------------------------
 1 | # Global structure
 2 | 
 3 | Legend: `X-Y: description`, where `X` is the minimum, and `Y` is the maximum number of bytes.
 4 | 
 5 | - 0-4: instruction prefixes
 6 | - 1-4: opcode
 7 | - 0-1: ModR/M
 8 | - 0-1: SIB
 9 | - 0-4: displacement
10 | - 0-4: immediate
11 | 
12 | The most interesting bytes to start learning are the opcode and ModR/M.
13 | 
14 | ## Opcode
15 | 
16 | Says which instruction is being run.
17 | 
18 | Sometimes, this can be further decomposed into smaller parts which say what is the source of data. E.g. [push ebp](push-ebp.asm), documented in the manual as `+rd`.
19 | 
20 | ## ModR/M
21 | 
22 | Says where data is being moved to. Bits:
23 | 
24 |     0 1 2 3 4 5 6 7
25 |     ^^^ ^^^^^ ^^^^^
26 |     1   2     3
27 | 
28 | 1.  MOD
29 | 
30 |     Determines how the next fields are interpreted.
31 | 
32 |     - 00: Indirect addressing mode.
33 |     - 01: Same as 00 but a 8-bit displacement is added to the value before dereferencing.
34 |     - 10: same as 01 but a 32-bit displacement is added to the value.
35 |     - 11: Reg and R/M byte will each refer to a register.
36 | 
37 | 2.  REG
38 | 
39 |     - 000 (0): EAX (AX if data size is 16 bits, AL if data size is 8 bits)
40 |     - 001 (1): ECX/CX/CL
41 |     - 010 (2): EDX/DX/DL
42 |     - 011 (3): EBX/BX/BL
43 |     - 100 (4): ESP/SP (AH if data size is defined as 8 bits)
44 |     - 101 (5): EBP/BP (CH if data size is defined as 8 bits)
45 |     - 110 (6): ESI/SI (DH if data size is defined as 8 bits)
46 |     - 111 (7): EDI/DI (BH if data size is defined as 8 bits)
47 | 
48 | 3.  R/M
49 | 
50 | ## Prefixes
51 | 
52 | ### 66
53 | 
54 | If given while on 16 bit mode, treat the memory as 32 bit.
55 | 
56 | If given while on 32 bit mode, treat the memory as 16 bit.
57 | 


--------------------------------------------------------------------------------
/intel-manual-format.md:
--------------------------------------------------------------------------------
 1 | # Intel manual format
 2 | 
 3 | How the Intel manual documents the instruction encodings.
 4 | 
 5 | - Opcode
 6 | - Instruction
 7 | - Op / En
 8 | - 64-Bit Mode
 9 | - Compat / Leg Mode
10 | - `CPUID` feature flag
11 | - Description
12 | 
13 | They are explained in section 3.1.
14 | 
15 | ### Instruction
16 | 
17 | E.g.:
18 | 
19 |     XCHG EAX, r32
20 | 
21 | Means: takes 2 arguments:
22 | 
23 | - `EAX`: TODO
24 | - `r32`: a 32-bit register
25 | 
26 | Other important values:
27 | 
28 | - `r/m32`: either a 32-bit register or RAM Memory
29 | - `imm32`: value directly encoded on memory
30 | 
31 | ### Op/En
32 | 
33 | ### Operand Encoding
34 | 
35 | Refers to an entry on the "Instruction Operand Encoding" table.
36 | 
37 | Every instruction has it's own "Instruction Operand Encoding" table.
38 | 
39 | TODO understand an operand encoding table, e.g. for `mov`.
40 | 
41 | ### CPUID feature flag
42 | 
43 | Which version of CPU support the feature as reported by CPUID.
44 | 
45 | ### Compat / Leg Mode
46 | 
47 | - valid
48 | - invalid: can be encoded, but generates an exception
49 | - N.E.: not encodable
50 | 
51 | ### 64-bit mode
52 | 
53 | - V: Supported.
54 | - I: Not supported.
55 | - N.E.: instruction syntax is not encodable in 64-bit mode (it may represent part of a sequence of valid instructions in other modes).
56 | - N.P.: REX prefix does not affect the legacy instruction in 64-bit mode.
57 | - N.I.: opcode is treated as a new instruction in 64-bit mode.
58 | - N.S.: requires an address override prefix in 64-bit mode and is not supported. Using an address override prefix in 64-bit mode may result in model-specific execution behavior
59 | 


--------------------------------------------------------------------------------
/introduction.md:
--------------------------------------------------------------------------------
 1 | # Introduction
 2 | 
 3 | Convert all assembly inputs `.asm` into decompiled hexdump `.hd`:
 4 | 
 5 |     sudo apt-get install nasm
 6 |     make run
 7 | 
 8 | Prerequisites: basics of how x86 assembly works, Intel syntax.
 9 | 
10 | More assembly info at: <https://github.com/cirosantilli/assembly-cheat>
11 | 
12 | To learn, rotate quickly between:
13 | 
14 | - the examples
15 | - the general instruction organization
16 | - the Intel manual
17 | 
18 | until your brain starts to absorb them.
19 | 


--------------------------------------------------------------------------------
/mov-al-1.asm:
--------------------------------------------------------------------------------
1 | mov al, 1
2 | 


--------------------------------------------------------------------------------
/mov-al-1.md:
--------------------------------------------------------------------------------
 1 | # mov al, 1
 2 | 
 3 | Output:
 4 | 
 5 |     b0 01
 6 | 
 7 | Intel manual says:
 8 | 
 9 |     B8 +rd id
10 | 
11 | Obviously correct from previous examples.
12 | 


--------------------------------------------------------------------------------
/mov-ax-1.asm:
--------------------------------------------------------------------------------
1 | mov ax, 1
2 | 


--------------------------------------------------------------------------------
/mov-ax-1.md:
--------------------------------------------------------------------------------
 1 | # mov ax, 1
 2 | 
 3 | Output:
 4 | 
 5 |     66 b8 01 00
 6 |     ^^ ^^ ^^^^^
 7 |     1  2  3
 8 | 
 9 | 1. 66 prefix: indicates that instead of 32-bit memory, this instruction uses 16 bit memory.
10 | 2. Same as `mov eax, 1`.
11 | 3. Same as `mov eax, 1`, but 16 bit only.
12 | 
13 | Intel manual says: `mov eax, 1`, which is analogous to 
14 | 


--------------------------------------------------------------------------------
/mov-eax-1.asm:
--------------------------------------------------------------------------------
1 | mov eax, 1
2 | 


--------------------------------------------------------------------------------
/mov-eax-1.md:
--------------------------------------------------------------------------------
 1 | # mov eax, 1
 2 | 
 3 | Basic mov immediate instruction.
 4 | 
 5 | Output:
 6 | 
 7 |     b8 01 00 00 00
 8 |     ^^ ^^^^^^^^^^^
 9 |     1  2
10 | 
11 | 1.  Opcode
12 | 2.  Immediate value: `1` in little endian
13 | 
14 | Opcode bits:
15 | 
16 |     1 0 1 1 1 0 0 0
17 |     ^^^^^^^^^ ^^^^^
18 |     1         2
19 | 
20 | 1. What to do.
21 | 2. Where to move to. `000` is `eax`.
22 | 
23 | Intel documentation says:
24 | 
25 | -   Opcode: `B8 + rd id`.
26 | 
27 |     `+rd` says that the 3 bits at the end are the destination register.
28 | 
29 |     `id` says that a double word immediate follows.
30 | 
31 | -   Op/En: `OI`.
32 | 
33 |     The "Instruction Operand Encoding" table for `mov` and `OI` says:
34 | 
35 |     Operand 1: `opcode + rd (w)`
36 |     Operand 2: `imm8/16/32/64`
37 | 


--------------------------------------------------------------------------------
/mov-eax-address.asm:
--------------------------------------------------------------------------------
1 | mov eax, address
2 | address:
3 | db 0xFF
4 | 


--------------------------------------------------------------------------------
/mov-eax-address.md:
--------------------------------------------------------------------------------
 1 | # mov eax, address
 2 | 
 3 | Output:
 4 | 
 5 |     b8 05 00 00 00 ff
 6 |     ^^^^^^^^^^^^^^ ^^
 7 | 
 8 | 1. Moving an immediate of value 5 to `eax`
 9 | 2. The hardcoded `ff` byte we point to.
10 | 
11 | If we do `objdump -Sr` on the object file, we see:
12 | 
13 | 
14 |     00000000 <address-0x5>:
15 |        0:   b8 05 00 00 00          mov    $0x5,%eax
16 |                 1: R_386_32 .text
17 | 
18 | The rellocation applied is `R_386_32`.
19 | 


--------------------------------------------------------------------------------
/mov-eax-ebx.asm:
--------------------------------------------------------------------------------
1 | mov eax, ebx
2 | 


--------------------------------------------------------------------------------
/mov-eax-ebx.md:
--------------------------------------------------------------------------------
 1 | # mov eax, ebx
 2 | 
 3 | Output:
 4 | 
 5 |     89 d8
 6 |     ^^ ^^
 7 |     1  2
 8 | 
 9 | 1. Opcode
10 | 1. ModR/M
11 | 
12 | Opcode bits:
13 | 
14 |     1 0 0 0 1 0 0 1
15 |     ^^^^^^^^^^^ ^ ^
16 |     1           2 3
17 | 
18 | 1. This is a `mov`.
19 | 2. Move REG to R/M as represented on the ModR/M byte. Otherwise, other way around.
20 | 3. 32-bit operands. Otherwise, 8-bit.
21 | 
22 | ModR/M bits:
23 | 
24 |     1 1 0 1 1 0 0 0
25 |     ^^^ ^^^^^ ^^^^^
26 |     1   2     3
27 | 
28 | 1. MOD = 3: REG and R/M are registers.
29 | 2. REG = 3: EBX
30 | 3. REG = 0: EAX
31 | 
32 | So from the opcode, we move REG (EBX) into R/M (EAX).
33 | 
34 | Note that two encodings are possible on reg / reg operations: we could swap the before last bit to 1 and both registers.
35 | 
36 | Both possible encodings are documented on the instruction table:
37 | 
38 |     01 /r    MOV r/m32, r32
39 |     03 /r    MOV r32, r/m32
40 | 
41 | `/r` says that a MOdR/M follows the opcode, and that the 2 last bits describe it.
42 | 


--------------------------------------------------------------------------------
/mov-eax-memory.asm:
--------------------------------------------------------------------------------
1 | mov eax, [memory]
2 | memory:
3 | db 0xFF
4 | 


--------------------------------------------------------------------------------
/mov-eax-memory.md:
--------------------------------------------------------------------------------
1 | # mov eax, [memory]
2 | 
3 | TODO
4 | 


--------------------------------------------------------------------------------
/mov-ebx-1.asm:
--------------------------------------------------------------------------------
1 | mov ebx, 1
2 | 


--------------------------------------------------------------------------------
/mov-ebx-1.md:
--------------------------------------------------------------------------------
 1 | # mov ebx, 1
 2 | 
 3 | Compare with [mov eax, 1](mov-eax-1.md) to see how `ebx` is encoded.
 4 | 
 5 | Output:
 6 | 
 7 |     bb 01 00 00 00
 8 | 
 9 | Bits of opcode:
10 | 
11 |     1 0 1 1 1 0 1 1
12 |     ^^^^^^^^^ ^^^^^
13 |     1         2
14 | 
15 | Field 2 contains `3` which corresponds to `ebx` as expected.
16 | 


--------------------------------------------------------------------------------
/nop.asm:
--------------------------------------------------------------------------------
1 | nop
2 | 


--------------------------------------------------------------------------------
/nop.md:
--------------------------------------------------------------------------------
1 | # nop
2 | 
3 | `0x90` is the simple form.
4 | 
5 | But also has other multi-byte forms that can be used for alignment.
6 | 


--------------------------------------------------------------------------------
/push-eax.asm:
--------------------------------------------------------------------------------
1 | push eax
2 | 


--------------------------------------------------------------------------------
/push-eax.md:
--------------------------------------------------------------------------------
 1 | # push eax
 2 | 
 3 | Output:
 4 | 
 5 |     50
 6 | 
 7 | Which is a single opcode.
 8 | 
 9 | The opcode can be further decomposed into the following bits:
10 | 
11 |     0 1 0 1 0 0 0 0
12 |     ^^^^^^^^^ ^^^^^
13 |     1         2
14 | 
15 | 1. This is a `push` instruction.
16 | 2. From where we will push. `000` is `eax`.
17 | 
18 | This is documented as: opcode == `50+rd` in the Intel manual. The `+rd` part says that the 3 last bits indicate where to push from.
19 | 


--------------------------------------------------------------------------------