├── .gitignore ├── README.md ├── hello-world ├── build-linux.sh ├── build-macos-sh ├── hello ├── hello-linux.asm └── hello-macos.asm ├── images └── cardiac2-s.jpg └── instruction-set ├── addressing.asm └── build.sh /.gitignore: -------------------------------------------------------------------------------- 1 | *~ 2 | *.o 3 | *.lst 4 | hello-macos 5 | hello-linux 6 | 7 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Programming in assembly language tutorial 2 | 3 | This tutorial covers AMD64/Intel 64 bit programming. Instruction sets for other processors, such as ARM or RISC-V are radically different, though the concepts are the same. They all have instructions, registers, stacks, and so on. Once you know one processor's assembly language, adapting to a different processor is rather easy. 4 | 5 | I found that I was writing code for a new processor within hours, and writing quality code within a week or two. This is going from Z80 to 6502 to 6809 to 8086 to 68000 and so on. It is interesting to be able to look at a processor's technical manuals and evaluate the power and flexibility of its instruction set. 6 | 7 | This tutorial is aimed at novices and beginners who want to learn the first thing about assembly language programming. If you are an expert, you may or may not get a lot out of this. 8 | 9 | - [Programming in assembly language tutorial](#programming-in-assembly-language-tutorial) 10 | - [Introduction](#introduction) 11 | - [Bits, Bytes, Words, and Number Bases](#bits-bytes-words-and-number-bases) 12 | - [Math](#math) 13 | - [Boolean Algebra](#boolean-algebra) 14 | - [Bit Shifting](#bit-shifting) 15 | - [Memory](#memory) 16 | - [ELF Files and the Loader](#elf-files-and-the-loader) 17 | - [Permissions](#permissions-sections-and-privileged-instructions) 18 | - [MMU](#mmu) 19 | - [Paging and Swapping](#paging-and-swapping) 20 | - [Other exceptions](#other-exceptions) 21 | - [Segfault](#segfault) 22 | - [Divide By Zero](#divide-by-zero) 23 | - [Invalid Opcode](#invalid-opcode) 24 | - [General Protection](#general-protection) 25 | - [ALU](#alu) 26 | - [x64/AMD64 Registers](#x64amd64-registers) 27 | - [General Purpose Registers](#general-purpose-registers) 28 | - [Special Purpose Registers](#special-purpose-registers) 29 | - [CPU Control Registers](#cpu-control-registers) 30 | - [Stack](#stack) 31 | - [Instruction Pointer](#instruction-pointer) 32 | - [Flags](#flags) 33 | - [AMD64 Instruction Set](#amd64-instruction-set) 34 | - [Assembly source](#assembly-source) 35 | - [Addressing Modes](#addressing-modes) 36 | - [Register Operands](#register-operands) 37 | - [Direct Memory Operands](#direct-memory-operands-better-known-as-immediate-operands) 38 | - [Indirect Operands](#indirect-operands) 39 | - [Indirect with Displacement](#indirect-with-displacement) 40 | - [Indirect with displacement and scaled index](#indirect-with-displacement-and-scaled-index) 41 | - [Commonly Used Instructions](#commonly-used-instructions) 42 | - [Aritmetic](#aritmetic) 43 | - [Boolean Algebra](#boolean-algebra-1) 44 | - [Branching and Subroutines](#branching-and-subroutines) 45 | - [Bit Manipulation](#bit-manipulation) 46 | - [Register Manipulation, Casting/Conversions](#register-manipulation-castingconversions) 47 | - [Flags Manipulation](#flags-manipulation) 48 | - [Stack Manipulation](#stack-manipulation) 49 | - [Assembler Source, Directives, and Macros](#assembler-source-directives--and-macros) 50 | - [Assembler Directives](#assembler-directives) 51 | - [section type](#section-type-options) 52 | - [bits 16, bits 32, and bits 64, use16, use32, use64](#bits-16-bits-32-and-bits-64-use16-use32-use64) 53 | - [Comments](#comments) 54 | - [Constants](#constants) 55 | - [Program Variables and Strings](#program-variables-and-strings) 56 | - [Assembler Variables and Labels](#assembler-variables-and-labels) 57 | - [Repetion](#repetion) 58 | - [Macros](#macros) 59 | - [Conditional Assembly](#conditional-assembly) 60 | - [Alignment](#alignment) 61 | - [Structures](#structures) 62 | - [Includes](#includes) 63 | - [Hello, World](#hello-world) 64 | - [MacOS Version](#macos-version) 65 | - [Linux version](#linux-version) 66 | - [How it works](#how-it-works) 67 | - [Linux Syscalls](#linux-syscalls) 68 | - [MacOS Syscalls](#macos-syscalls) 69 | 70 | ## Introduction 71 | 72 | How CPUs work has become something of a lost art. There are a small percentage of software engineers that need to understand the inner workings of CPUs, typically those who work on embedded software or operating systems, or compilers or JIT compilers... 73 | 74 | Assembly language was one of the first languages I ever learned. Back in the early/mid 1970s, my high school classes progressed from BASIC to FORTRAN IV, to BAL (Basic Assembly Language) for the IBM 360 to which we had access. One of the earliest lessons we were taught used a cardboard teaching aid, CARDIAC. CARDIAC stands for "CARDboard Illiustrative Aid to Computation"; it was developed at Bell Labs, which was a big deal back then (Unix was invented there, as well as the C programming language). 75 | 76 | See https://www.cs.drexel.edu/~bls96/museum/cardiac.html. 77 | 78 | With CARDIAC, you simulated the memory, operation, and CPU cycles of a mythical CPU. The numbers and instructions for this CPU were in base 10, so the student doesn't have to understand how to convert to the common base 2, base 8, 8 or base 16 used in computing. CARDIAC provided a cardboard device that had representation for memory, program steps, and ALU (math and logic operations). 79 | 80 | You wrote your program and variables on the cardboard and then, step by step, followed the program and performed the operations for each step. The steps are identified by a single digit, 0-9: 81 | 82 | - 0 INP read a card into memory 83 | - 1 CLA clear accumulator and add from memory 84 | - 2 ADD add from memory to accumulator 85 | - 3 TAC test accumulator and jump if negative 86 | - 4 SFT shift accumulator 87 | - 5 OUT write memory location to output card 88 | - 6 STO store accumulator to memory 89 | - 7 SUB subtract memory from accumulator 90 | - 8 JMP jump and save PC 91 | - 9 HRS halt and reset 92 | 93 | These values are "opcodes" and the encoded instructions/steps include the opcode plus address, number of bits to shift, etc. 94 | 95 | The CPU features only two registers: accumulator and program counter. More complex and modern CPUs have many more registers than these two. 96 | 97 | These instructions and registers are enough to learn from. You learn about memory layout, instruction opcodes, instruction encoding, memory access, and so on. 98 | 99 | In this tutorial, I will cover the basics of programming the x64/AMD64 CPU in assembly language. As I progress, you will see how the CPU is really a glorified version of CARDIAC! 100 | 101 | ## Bits, Bytes, Words, and Number Bases 102 | 103 | The smallest piece of information that a CPU processes is a "bit." A bit is a small integer or boolean type value, either 0 (off/false) or 1 (on/true). 104 | 105 | Bits are then organized as "bytes", or 8 bits grouped together. You can visualize a byte like this: 106 | 107 | ``` 108 | 76543210 109 | ``` 110 | 111 | The digits represent what we call a bit number, and each digit (bits 0-7) may be a 0 or a 1. A byte can represent an unsigned value of 0-255, or a signed value of -128-127. Bit 7 of the byte is considered the "sign bit" - if it is 1, then the byte as a signed value is negative, if it is 0, then the byte is positive. Note that you decide whether the byte is processed as signed or unsigned; more on this later, but for now it is important to understand how the bits make up bytes and signed/unsigned values are represented. 112 | 113 | A "word" is two bytes grouped together, which means we have 16 bits together. You can visualize a word like this: 114 | ``` 115 | 5432109876543210 116 | 111111 117 | ``` 118 | 119 | The high order, sign bit, is bit 15. 120 | 121 | The x86 also has DWORD values, which are two words combined. It also has QWORD values which are two DWORDs combined. The pattern is the same for any of these size values - the high bit is the sign bit, etc. 122 | 123 | From this point forward, I'll use "word" to mean one of these sized values, unless otherwise stated. 124 | 125 | When we talk about the value of the word, we typically use base 2, base 4, base 8, base 10, and base 16. Of these, base 8 isn't used much, but I'll explain a common use case for base 8. 126 | 127 | In base 2 (also called "binary"), we simply talk about the value as the bits. That is, an unsigned byte might be 11111111, or 11101110, and so on. We might add a lead 0 and terminating b for clarity (and this is the syntax used in assembly programming): 011111111b. 128 | 129 | Base 10 is the number base we use every day. You count from 0 to 9 for each digit position in base 10. When you add 1 to the value 9, you clear it (set to 0), and bump the 10s digit. That is, 9+1 becomes 10. As you go right to left in base 10, the digits are: n x 10 to the power of 0, n x 10 , or 10 to the power of 1, n x 100, or 10 to the power of 2, and so on. 130 | 131 | In base 2, we count from 0 to 1 for each digit position. When you add 1 to a 1 in a position in the byte, you clear it and increment the next higher bit (and continue until you find an existing 0 in position, which becomes 1). As you go right to left in base 2, the digits are n x 2 to the power of 0, n x 2, or 2 to the power of 1, n x 4, or 2 to the power of 2, and so on. 132 | 133 | In base 8 (also called "octal"), we count from 0-7 for each digit position. Going right to left, n x 8 to the power of 0, n x 8 to the power of 1, n x 8 to the power of 2, etc. 134 | 135 | In base 16 (also called "hex"), we count from 0-15 for each digit position. We use a counting system that is 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, F, then 10. So going from right to left in a hex number, the digits are n x 16 to the power of 0, n x 16 to the power of 1, n x 16 to the power of 2, and so on. 136 | 137 | A "nybble" is useful for working with hex. A nybble is 4 bits. It turns out that the value you can store in 4 bits is 0-15, perfect for hex. You already get the pattern about power of 4s when using nybbles. 138 | 139 | Let's look at the unsigned value ranges for the common word sizes: 140 | ``` 141 | 1 bit: 0-1 142 | 2 bits: 0-3 143 | 3 bits: 0-7 144 | 4 bits: 0-15 145 | 5 bits, 0-31 146 | ... 147 | ``` 148 | 149 | 150 | The pattern here is that the max value is 2 to the number of bits minus 1. That is for 5 bits, the max value 31 is 2 to the 5th power (32) minus 1. 151 | 152 | When we convert a binary byte to hex, we visualize it something like this: 153 | 154 | ``` 155 | 76543210 is 7654 3210 156 | ``` 157 | We've grouped the bits as two nybbles. We can then convert the two nybbles (4 bits each) to two hex digits. 158 | 159 | This table makes the conversion simple. But if you practice using hex, you will know this table by heart. 160 | 161 | ``` 162 | 0000 | 0 163 | 0001 | 1 164 | 0010 | 2 165 | 0011 | 3 166 | 0100 | 4 167 | 0101 | 5 168 | 0110 | 6 169 | 0111 | 7 170 | 1000 | 8 171 | 1001 | 9 172 | 1010 | A 173 | 1011 | B 174 | 1100 | C 175 | 1101 | D 176 | 1110 | E 177 | 1111 | F 178 | ``` 179 | 180 | For example, we visualize the binary value 010100101b as 1010 0101. Using the table above, we see 1010 is A, and 0101 is 5. So the byte value is A5. We represent hex numbers in assembly as 0xa5, or 0a5h, or sometimes $a5. 181 | 182 | We can use the same scheme to convert 16 bit or 32 bit or 64 bit values to hex! 183 | 184 | I promised to discuss a use for Octal, something we might use every day. In the linux/mac/*nix filesystem, permissions are actually octal values. 185 | ``` 186 | -rw-r--r-- 1 mschwartz staff 5.9K Feb 16 14:13 README.md 187 | ``` 188 | See the -rw-r--r-- ? What we have here is 9 bits in octal. rw- is 110, r-- is 100, r-- is 100. So we can convert this to the internal filesystem representation of 644. If you want to make a file rw-r--r--, you use the chmod command: 189 | ``` 190 | chmod 644 README.md 191 | ``` 192 | The three bits, technically, are "able to read", "able to write", and "able to execute." The first octal value is for the owner, the second is for anyone in the same user group as the owner, and the third is for everyone else. So to allow the owner and group to read and write, but nobody else can read or write the file, we want rw-rw---- or 660. To set a file to be executable, I typically use ```chmod 755```. 193 | 194 | ## Math 195 | 196 | Adding two values of the same word size is simple. The byte 100 plus the byte 50 = 150. 100 + 50 = 150. 197 | 198 | This works for signed and unsigned values. The math is always unsigned, but the result is up to you. If the high order bit (bit 7 of a byte, bit 15 of a 16-bit word...) is 1, the signed value is negative. 199 | 200 | What happens when we add a byte value to a 16-bit word value? The byte value is really a 16-bit value, but the upper 8 bits are zeros. That is, 0xaa can be visualized as 0x00aa. We just add the full 16-bit values together. 201 | 202 | What happens when we add 1 to a byte size value of 255? We only have 8 bits for the result, but we have 9 bits of actual value. That is, 255 + 1 is 256. Represented in binary, you have 255 = 011111111b + 1 = 0100000000b (9 bits!). The 9th bit is basically ignored as far as the result byte goes (more on this later). So if you look at the lower 8 bits of our 9 bit result, we get 0! 203 | 204 | All this extends to 32 bit and 64 bit words. 205 | 206 | Multiplication of two values requires a double-sized result, or you lose a lot more than just the 9th bit! Consider 255 x 255 = 65025 (0xfe01), which fits in 16 bits but not in 8. If we have a byte result, we get 0x01 due to the overflow, losing over 65000 in result value. 207 | 208 | ## Boolean Algebra 209 | 210 | Boolean Algebra is a form of math that we use to deal with true/false values. We use Boolean Algebra all the time in various programming languages, with operators like & (AND), | (OR), ^ (exclusive OR, or XOR), and ! (NOT), ~ (also NOT) and so on. These operators are equivalent to "math-like" operators. 211 | 212 | The simplest way to visualize Boolean Algebra is using single bit values and truth tables. 0 = false, 1 = true. For single bit value operands, there are only (always) 4 combinations possible. 213 | 214 | ``` 215 | AND (if both operands are true, the result is true) 216 | 0 & 0 = 0 217 | 0 & 1 = 0 218 | 1 & 0 = 0 219 | 1 & 1 = 1 220 | 221 | OR (if either operand is true, the result is true) 222 | 0 | 0 = 0 223 | 0 | 1 = 1 224 | 1 | 0 = 1 225 | 1 | 1 = 1 226 | 227 | XOR (if only one operand is true, the result is true) 228 | 0 ^ 0 = 0 229 | 0 ^ 1 = 1 230 | 1 ^ 0 = 1 231 | 1 ^ 1 = 0 232 | ``` 233 | 234 | The ! (NOT) operator only has one operand. If the operand is true, the result is false. If the operand is false, the result is true. The result is also known as a 1's complement, or we've just inverted the state of all the bits. 235 | 236 | The ~ (1's complement) operator inverts the bits in the word. 237 | 238 | If we look at the operands as byte values, we have something like: 239 | ``` 240 | 00000000 & 00000000 = 0 241 | 00000000 & 00000001 = 0 242 | ... 243 | ``` 244 | BUT, we have 8 bits, so the operation is performed on all 8 bits in the two operands. 245 | ``` 246 | 10000000 247 | OR 00000001 248 | -------- 249 | ^ ^ 250 | = 10000001 251 | ^ ^ 252 | 253 | NOT 10000001 254 | = 01111110 255 | ``` 256 | This is a most important concept to grasp! 257 | 258 | We use the Boolean Algebra operators on words to achieve useful results. 259 | 260 | A typical use of the AND operator is to clear bits in a value. If we AND with a value that is the inverse of a power of 2, we are simply clearing a bit. n AND !4 clears bit 3 in n. 261 | 262 | A typical use of the OR operator is to set bits in a value. If we OR with a value that is a power of 2, we are simply setting a bit. n OR 4 sets bit 3 in n. 263 | 264 | A great use of the AND operator is to do a modulo of a number to a power of 2. For example, AND with 3 gets you a result between 0 and 3. AND with 7 gets you a result between 0 and 7. 265 | 266 | ## Bit Shifting 267 | 268 | You can shift a bit to the left (<< operator in C) 1-7 bits. For example: 269 | 270 | ``` 271 | 001111101b << 1 = 011111010b 272 | 273 | 001111101b shifted left becomes 274 | //////// 275 | x011111010b (bit 0 becomes 0, bit 1 becomes 1, bit 2 becomes 0) 276 | ``` 277 | Note that we have the overflow problem here, as we did with addition. We have an upper bit that ends up in the "bit bucket" (thrown away). 278 | 279 | A left shift of 1 bit is effectively a multiplication by 2. Consider 001b<<1 is 010b, or 2. A left shift of 2 bits is a multiply by 4, and so on. 280 | 281 | Shifting to the right works similarly, but we now end up with the high bit being cleared and the low bit in the bit bucket. 282 | 283 | A right shift of 1 bit is effectively a divide by 2. But this right shift will take a negative number and make it positive because the sign bit is cleared. So we need a second kind of right shift (arithmetic shift right) for signed values that sets the high bit in the result to the high bit in the initial value. 284 | 285 | A rotation left/right is the same as a shift, except instead of the lost bit ending up in the bit bucket, it becomes the new high/low bit. 286 | 287 | Other than for the multiply and divide effects, we use bit shifting frequently with Boolean Algebra. 288 | 289 | ``` 290 | To set bit 3: 291 | 292 | n | (1<<3) 293 | 294 | To clear bit 3: 295 | 296 | n & ~(1<<3) 297 | 298 | Note that 1<<3 = 01000b, 299 | and ~(1<<3) is ~01000b 300 | or 00111b. (all the bits are inverted) 301 | When you AND with 00111b, you are clearing bit 3. 302 | ``` 303 | 304 | ## Memory 305 | 306 | Memory (RAM) can be viewed as an array of bytes. If you have 1MB of RAM, your array is indexed from 0 to 1MB-1. The index is better known as an address. 307 | 308 | Memory is used to store your program, for your program stack, for your program's heap (memory allocation) and to store your variables. In a simple CPU and RAM setup, you might have your program start at index 0, your variables start at the end of the program, your heap starts at the end of your variables, and your stack starts at the top of memory and works its way downward as you push onto it. 309 | 310 | ``` 311 | HIGH memory address 312 | +--------------+ 313 | | | 314 | | stack | 315 | | grows down | 316 | | address 1M | 317 | | | 318 | +--------------+ 319 | | | 320 | | heap | 321 | | grows up | 322 | | | 323 | +--------------+ 324 | | | 325 | | uninitalized | 326 | | global | 327 | | variables | 328 | | | 329 | +--------------+ 330 | | | 331 | | initalized | 332 | | global | 333 | | variables | 334 | | | 335 | +--------------+ 336 | | | 337 | | code | 338 | | address 0 | 339 | | | 340 | +--------------+ 341 | LOW memory address 342 | ``` 343 | 344 | ## ELF Files and the Loader 345 | The compiler/assembler/linker generate ELF formatted files. An ELF file is divided into various sections. The more common sections are ```.text``` (code), ```.data``` initialized data, ```.rodata``` read only data (constants), ```.bss``` (uninitialized data), and assorted debugging info sections. 346 | 347 | The operating system program loader reads in the ELF file and allocates memory for the .text section and loads that data from the file into that memory. 348 | 349 | Then the loader allocates memory for the initialized data (.data) and reads that data from the file into that memory. 350 | 351 | Then the loader allocates memory for the constant data (.rodata) and reads that data from the file into that memory. 352 | 353 | The loader allocates memory for the .bss section. Since the .bss section is uninitialized, it only needs to be allocated. 354 | 355 | The linker reads in intermediate object files (```.o```) and links them together to make the final executable. Each .o file may declare variables that might be accessed from other .o files and to access variables that are defined in some other .o file. The linker fixes up the addresses in the final output so the code works as expected! 356 | 357 | ### Permissions (Sections and Privileged Instructions) 358 | The compiler/assembler/linker generally makes the code execute only. If you try to store to those addresses, you will get a segfault. 359 | 360 | The .data and .bss sections are marked as read/write and the .rodata is marked as read-only. 361 | 362 | The way words of the different sizes are stored in memory is determined by the "endianess" of the CPU. A CPU that is big endian stores the high byte first in memory, the next highest byte next, ... and finally the lowest byte last. A CPU that is little endian stores the low byte first, ... the high byte last. 363 | 364 | The CPU has special features that enforce these permissions. If you try to defeat the permissions, a segfault exception is thrown. The operating system sets up these features when the program is started, and kills the program and potentially generates a core dump file of the program. The core dump file can be used later to do forensic debugging/analysis of the failure. 365 | 366 | ### MMU 367 | 368 | In modern operating systems, the CPU uses an MMU (Memory Management Unit) to assign a subset of the system's memory to each program that you run. The MMU maps an address in physical memory to a logical address that the program sees and uses. This allows, for example, a CPU to split the 1MB of RAM into 2x 512K address spaces to run two programs. The address translation makes it so each program thinks it has 512K of RAM starting at address 0 and ending at address 512K - 1. 369 | 370 | The use of the MMU is much more clever than I just explained, but the end result is the same. When a program is launched, it is allocated a small amount of RAM, enough for the program's code and variables and stack and a minimal heap. As the program needs more stack or more heap, the OS adds physical memory to the program's address space using the MMU. The program grows on demand. 371 | 372 | For our purposes, we can assume we're the only program running on the machine. It matters not if there's an OS using the MMU or not, the programming effort and techniques are the same either way. 373 | 374 | #### Paging and Swapping 375 | 376 | The operating system only needs to set up the MMU for enough physical memory for the program to execute. Memory is allocated for the MMU in 4096 byte chunks (pages); this is required by the MMU implementation (hardware). 377 | 378 | This scheme is quite efficient, as a small assembly program might only need a couple of megabytes of RAM (2MB for stack is default in the OS!), and your computer might have 16 Gigabytes of RAM. This efficient allocation of the CPU's memory allows you to load and run many programs at the same time. 379 | 380 | When your program tries to access an address in memory that isn't mapped by the OS using the MMU, a page fault exception is raised. The OS sees this and might map in an additional page so that the access can succeed. 381 | 382 | If the system is out of memory, the OS might compress programs and/or their data to make more RAM available. The OS has to decompress this memory when it's those programs' turn to execute, though. MacOS does this compression, and it's very clever. 383 | 384 | Another thing the OS can do when there is an out of memory (OOM) condition is to "page" one or more 4096 byte pages from memory to the system's swap file/partition. This frees up enough pages to use to handle the page fault. When a program that has memory paged to disk is scheduled to run (use the CPU), the code might cause further page faults to read back in the paged memory. It's possible the program never accesses that memory, and that's perfectly fine. 385 | 386 | Yet another thing the OS can do is to swap out entire programs (and their data) to the swap file/partition. When those programs get to run, they have to be entirely read back into memory (and MMU set up), and perhaps swapping another program to disk. When the system is tight on free memory and is swapping heavily, it will become very unresponsive! 387 | 388 | Finally, if the OS cannot resolve the OOM condition with one of those (or potentially other clever) strategies, it just randomly kills a running program. This seems evil, but what else can it do? 389 | 390 | The stack grows down from high memory. If the stack overflows (grows below the memory allocated for it), a page fault occurs and the OS can add additional pages to the memory map so the stack has more room. 391 | 392 | The heap initially has a small but reasonable amount of RAM allocated. It can be expanded using the ```sbrk``` syscall. This is what the malloc() function does in C, though the sbrk() function can be called directly if you know what you're doing. 393 | 394 | ### Other exceptions 395 | 396 | #### Segfault 397 | It should be noted that a program might just randomly access some address that is truly outside the bounds of the program's memory map. Paging or swapping is not performed in this case. The MMU is set up so these addresses are simply not mapped into the program's memory map. Instead of raising a page fault exception, the CPU/MMU raises a segfault exception. 398 | 399 | This is a hard program crash, and the operating system will terminate the program. 400 | 401 | #### Divide By Zero 402 | If your program attempts to divide by zero, this exception is raised and the program is terminated. 403 | 404 | #### Invalid Opcode 405 | If your program somehow executes instructions that are not valid x64/amd64 instructions, this exception is raise and the program is terminated. This will occur, for example, if you push a random number on the stack and then return. Your program starts executing at that random address and who knows what data are there? If the random number/return causes the program to execute outside its address space, you get a Segfault instead. 406 | 407 | #### General Protection 408 | If your program attempts to execute a privileged instruction, this exception is raised and the program is terminated. There are quite a few privileged instructions, such as CLI/STI (disable/enable interrupts). An OS should not allow programs to disable interrupts, or your multitasking stops working! 409 | 410 | ## ALU 411 | 412 | The cost of having circuitry to add two arbitrary memory locations together is prohibitive. You have 1M x 1M add circuits required, and that's just for addition! 413 | 414 | The math (add) capability is, instead, implemented in the ALU (Arithmetic-Logic Unit) of the CPU. The CPU provides some (small) number of general purpose "registers" and the ALU implements the add circuitry just between those registers. 415 | 416 | You can think of a register as a (temporary) variable that is on chip, usable by the ALU to do math and logic operations. You have to load your operand or operands into registers to perform math, then you can store the result to a variable in memory. 417 | 418 | For example, to add two numbers at memory locations (addresses) 0x100 and 0x200 and store the result at address 0x300, and we have two registers named a and b: 419 | ``` 420 | load value at 0x100 into a 421 | load value at 0x200 into b 422 | add a and b, leaving result in a 423 | store a at 0x300 424 | ``` 425 | 426 | I have just introduced something like a snippet of assembly language code! We need operations to be able to load memory into registers, add registers together, and store registers to memory. Each of these operations is a CPU "opcode." The CPU reads the byte opcode from memory and executes it. Some opcodes, like the load and store ones, require parameters like the address to load from or store to. These addresses are stored in the program immediately following the opcode. As we progress, we're going to see that the instruction sizes (op code plus parameters) are different depending on the instruction (op code) and parameters. 427 | 428 | In the simplest view of the CPU, the above program is 4 instructions. The load and store instructions use 1 byte for opcode and 2 more for the addresses. The add uses just the one byte for the opcode (add b to a). 429 | 430 | Each instruction uses 1 or more "clock cycles," depending on the complexity of the operation. The load instruction requires a clock cycle to load the opcode, another 2 for each byte of the address, and another 2 to load the value from RAM at the address specified in the parameters, for 5 total clock cycles. The add instruction takes just 1 clock cycle. The store takes 5 as well. 431 | 432 | ## x64/AMD64 Registers 433 | 434 | For all intents and purposes, the Intel and AMD processors have the same registers until you get into exotic features (like hardware video decoding). I use the term x64 and AMD64 interchangeable throughout this tutorial. 435 | 436 | ### General Purpose Registers 437 | 438 | You have 4 general purpose registers, A, B, C, and D, though we don't use these specific names for the registers. The size of the register/contents matters. So for a byte value, we use AL or AH, or BL/BH, or CL/CH, or DL/DH. The L means "low order byte" and H means "high order byte." For word values, we use AX, BX, CX, and DX. For 32 bit word values, we use EAX, EBX, ECX, and EDX. And for 64 bit word values, we use RAX, RBX, RCX, and RDX. 439 | 440 | When we use the registers whose size are smaller than 64 bits, the remaining bits in the register are not affected. For example, if AX contains 0x0102 and we load 0x03 into AL, AX will contain 0x0103. This will only matter if you load bytes into registers and add word registers together, in error. There might be tricks you play to take advantage of the nature of the register loads/stores. 441 | 442 | AMD64 and x64 add 8 more general purpose registers, R8, R9, R10, R11, R12, R13, R14, and R15. These are accessed as 8, 16, 32, and 64 bit registers. R8 through R15 (64 bits), R8D-R15D (32 bits), R8W-R15W (16 bits), and R8B-R15B (8 bits). 443 | 444 | ### Special Purpose Registers 445 | 446 | The RCX/ECX/CX (CX) register doubles as a counter for dedicated instructions. The AMD64 instruction set includes instructions to fill, copy, and compare memory, and loops that use this register as the number of bytes/words/dwords/qwords to fill/copy/compare. The special loop instructions use this register as the loop counter as well. 447 | 448 | The RSI/ESI/SI and RDI/EDI/DI/ registers are general purpose "source" and "destination" registers for the fill, copy, and compare instructions. 449 | 450 | The RBP register is a general purpose register that is typically used as a base address register or by high level language compilers to maintain function stack frames (arguments, return address, and local variables allocated on the stack). 451 | 452 | ### CPU Control Registers 453 | 454 | #### Stack 455 | The RSP register contains the address of the last thing pushed on the processor stack. You can push registers on the stack to preserve their values, you can pop them to restore their values, address values already on the stack by index, etc. 456 | 457 | #### Instruction Pointer 458 | The RIP register contains the address of the next instruction to be executed. The CPU automatically adds the correct number to it as it executes instructions to keep it pointed at the correct next instruction. When you call a subroutine, the RIP is pushed on the RSP stack and RIP is loaded with the address of the subroutine. When the subroutine returns, the RIP that was pushed before the call is popped from the stack into RIP. Execution continues at the instruction after the call. 459 | 460 | #### Flags 461 | The FLAGS register is 64 bits containing information provided by the CPU to the program, and commands from the program to the CPU. Not all the bits are used. See https://en.wikipedia.org/wiki/FLAGS_register. 462 | 463 | An example of the bits in FLAGS set by the CPU is the Carry Flag. It is set when you have a carry after an arithmetic operation. For example, if you add 1 to the AL register that contains 255, you will get AL=0, Carry = 1. If you add 1 to AL=254, the Carry will be 0. 464 | 465 | An example of the bits in the FLAGS set by the program is the Direction Flag. If this is 0, the fill/copy/etc. instructions work from start address forward (auto-increments SI and DI). If this is 1, the operations are done backward (auto-decrement). 466 | 467 | The FLAGS register is there to use, but we might really only directly use the Carry bit and Direction bit. We might use the Carry bit to return a true/false result from a function. The CLC and STC instructions clear and set the Carry bit. 468 | 469 | The various branch instructions use the Carry and Zero bits internally. 470 | 471 | There are several instructions that set and clear these bits, programmatically. 472 | 473 | # AMD64 Instruction Set 474 | 475 | You will learn the instruction set as you go. The instruction set is documented as a reference manual, not a programming manual. That is, each instruction is documented as to what it does. But there is no particular "how to use this instruction" documentation. 476 | 477 | You can find the instruction set documented on various Web Sites. The best source is the Intel Programmer's Manual or the AMD64 Programmer's Manual. 478 | 479 | Here is a decent Web Page that lists the instructions in a table, one line per instruction with a short description. 480 | 481 | https://www.felixcloutier.com/x86/ 482 | 483 | There are over 1500 instructions, from AAA to XTEST that we can use. Too many to document every one here. However, there are much fewer commonly used instructions that we use for most things. 484 | 485 | The format of a line of source code in assembly is: 486 | ``` 487 | [optional label] instruction 488 | or 489 | [optional label] instruction operand 490 | or 491 | [optional label] instruction operand1, operand2 492 | ``` 493 | 494 | 495 | When assembled, the instructions are encoded as opcode and operands as a sequence of bytes. The CPU is able to execute these instructions. 496 | 497 | ## Assembly source 498 | 499 | In assembly source, the NASM assembler expects operands to be specified as ```destination, source``` (Intel syntax) while the gas assembler expects operands to be specified as ```source, destination``` (AT&T syntax). The assembler language for the various CPUs (e.g. MC68000, AMD64, ARM, etc.) each specify whether the left operand is source or destination. The gas assembler can be used to assemble source for various processors so it defaults to source, destination format, though you can tell it to use Intel (NASM) syntax. 500 | 501 | In Intel syntax source programs, the semicolon (;) character introduces the start of a comment. All characters from that point on, to the end of the line, are ignored. 502 | 503 | Before we look at some of these instructions, we need to look at addressing modes. 504 | 505 | ## Addressing Modes 506 | 507 | Addressing modes are the means by which operands to instructions are described and how they execute. For example, Register operands indicate specific registers, but memory operands can be addressed through a variety of combinations of offsets and/or register contents. 508 | 509 | To examine the addressing modes, we'll use the MOV instruction, which copies a value in a register to memory or loads a value to a register from memory. 510 | 511 | The source and/or destination operand is specified using one of the addressing modes. 512 | 513 | The instruction-set/addressing.asm file contains example usage of the various addressing modes. 514 | 515 | ### Register Operands 516 | 517 | Rather than memory being the source or destination, the operand is a register. For example, 518 | ``` 519 | mov rax, rbx ; moves contents of rbx register into the rax register. 520 | ``` 521 | 522 | ### Direct Memory Operands (better known as Immediate operands) 523 | 524 | This mode moves a constant into a register. The constant is encoded in the instruction, after the opcode. For example, 525 | ``` 526 | mov rax, 10 ; source operand is a constant 527 | ``` 528 | 529 | #### Indirect Operands 530 | 531 | This mode uses a register as the address of a memory location to be operated on (e.g. load from, store to). For example, 532 | ``` 533 | mov [rax], rbx ; store contents of rbx to memory location contained in rax 534 | ``` 535 | 536 | #### Indirect with Displacement 537 | 538 | This mode uses a register as the base address of a memory location, added to a fixed offset, to determine the address of a memory location to be operated on. For example, 539 | ``` 540 | mov rax, [rbx+24] ; access memory at 24 + contents of rbx 541 | ``` 542 | The purpose of this addressing mode is to facilitate accessing a structure and its members. Consider: 543 | ``` 544 | struct { 545 | char *name, 546 | *address, 547 | *phone; 548 | } person; 549 | person.name = nullptr; 550 | person.address = nullptr; 551 | person.phone = nullptr; 552 | 553 | ``` 554 | 555 | In assembly, we'd do something like this: 556 | ``` 557 | NAME equ 0 558 | ADDRESS equ 8 559 | PHONE equ 12 560 | 561 | mov rsi, person ; load address of person into RSI 562 | mov rax, 0 ; nullptr 563 | mov NAME[rsi], rax 564 | mov ADDRESS[rsi], rax 565 | mov PHONE[rsi], rax 566 | ``` 567 | 568 | Another use of this addressing mode is for stack frames for a language such as "C", especially for calling subroutines. A subroutine may have arguments passed to it on the stack, by value (like an int) or reference (like an address of a struct or string or whatever). A subroutine may need its own local variables. When a subroutine is called recursively, each recursive call must prepare the stack so it has arguments to pass, and allow for the next iteration's local variables on the stack. 569 | 570 | The RBP register is used for stack frames when stack conventions are used for calling functions in "C". 571 | 572 | The calling function pushes arguments on the stack (right to left). That is, for foo(a, b, c);, the compiler will generate code to push c, then b, then a. 573 | 574 | Upon entry to a function, RBP contains the stack frame pointer for the calling function. The compiler generates code to immediately push it. Then the RSP stack pointer is loaded into RBP. 575 | 576 | At this point, RBP points to the return address on the stack, and negative offsets from RBP are the arguments to the function. 577 | 578 | For local variables, the compiler generates a subtract to RSP to make the desired space on the stack. When the function calls another, RSP is after the allocated variables, so it all works. Positive offsets from RBP are used to access the local variables. 579 | 580 | To return, the compiler generates code to pop rbp (restore caller's stack frame) and returns. The calling code has to adjust RSP to remove the pushed arguments. 581 | 582 | Note: AMD64/X64 use a register scheme for passing arguments to functions and uses the stack when there are too many arguments to pass (not enough registers). See https://en.wikipedia.org/wiki/X86_calling_conventions. I present this information because you will likely run across stack frames, especially when viewing GDB (command line debugger) backtraces. 583 | 584 | Let's see a little bit of example code and the assembly generated by the compiler. Note that this is in AT&T syntax, ```source, destination``` format. The register names are prefixed with %. 585 | 586 | ``` 587 | // source 588 | void bar(int a, int b) { 589 | int x, y; 590 | 591 | x = 555; 592 | y = a+b; 593 | } 594 | 595 | void foo(void) { 596 | bar(111,222); 597 | } 598 | 599 | ; compiles to: 600 | bar: 601 | pushl %ebp 602 | movl %esp, %ebp 603 | subl $16, %esp 604 | movl $555, -4(%ebp) 605 | movl 12(%ebp), %eax 606 | movl 8(%ebp), %edx 607 | addl %edx, %eax 608 | movl %eax, -8(%ebp) 609 | leave 610 | ret 611 | 612 | foo: 613 | pushl %ebp 614 | movl %esp, %ebp 615 | subl $8, %esp 616 | movl $222, 4(%esp) 617 | movl $111, (%esp) 618 | call bar 619 | leave 620 | ret 621 | ``` 622 | 623 | Note the use of indirect with offset addressing modes! 624 | 625 | #### Indirect with displacement and scaled index 626 | 627 | This addressing mode is used to access array elements. To illustrate how this mode works: 628 | 629 | * an array of bytes, each element is 1 byte each 630 | * an array of words, each element is 2 bytes each 631 | * an array of dwords, each element is 4 bytes each 632 | * an array of qwords, each element is 8 bytes each 633 | 634 | As you index the array, you have to "scale" the index before adding it to the base of the array. The scale operating assures we are addressing byte, word, dword, or qword elements properly. 635 | 636 | ``` 637 | mov member(rsi, rbx, 4), eax ; store dword in eax at rsi+ member(offset) + rbx x 4 638 | ``` 639 | The above example stores a dword into memory. We are accessing a struct member that is an array of dwords. The rbx register contains the index into the array, [0 ... array.length-1]. The 4 is the scale factor, or size of the dword. 640 | 641 | Note that member may be 0 - in this case, rsi simply contains the address of the array. 642 | 643 | 644 | 645 | # Commonly Used Instructions 646 | 647 | ## Aritmetic 648 | 649 | ``` 650 | ADC - add a value, plus 651 | ADD - add two registers together 652 | DEC - decrement by 1 653 | DIV - unsigned divide 654 | IDIV - signed divide 655 | IMUL - signed multiply 656 | INC - increment by 1 657 | MUL - unsigned multiply 658 | NEG - two's complement (multiply by -1) 659 | SBB - subtract with borrow (carry flag) 660 | SUB - subtract 661 | LEA - load effective address (formed by some expression / addressing mode) into register 662 | ``` 663 | 664 | ## Boolean Algebra 665 | ``` 666 | AND - logical AND to registers together 667 | NOT - one's complement (invert all the bits in the operand) 668 | OR - logical OR 669 | XOR - logical exclusive or 670 | TEST - logical compare 671 | ``` 672 | 673 | ## Branching and Subroutines 674 | ``` 675 | CALL - call a subroutine/function/procedure 676 | SYSCALL - call an OS function (Linux, Mac) 677 | ENTER - make stack from for procedure parameters 678 | LEAVE - high level procedure exit 679 | RET - return from subroutine 680 | CMP - compare two operaands 681 | JA - jump if result of unsigned compare is above 682 | JAE - jump if result of unsigned compare is above or equal 683 | JB - jump if result of unsigned compare is below 684 | JBE - jump if result of unsigned compare is below or equal 685 | JC - jump if carry flag is set 686 | JE - jump if equal 687 | JG - jump if greater than 688 | JGE - jump if greater than or equal 689 | JNC - jump if carry not set 690 | JMP - go to / jmp (simply loads the RPC register with the address) 691 | ``` 692 | 693 | ## Bit Manipulation 694 | ``` 695 | BT - bit test (test a bit) 696 | BTC - bit test and complement 697 | BTR - bit test and reset 698 | BTS - bit test and set 699 | RCL - rotate 9 bits (carry flag, 8 bits in operand) left count bits 700 | RCR - rotate 9 bits (carry flag, 8 bits in operand) right count bits 701 | ROL - rotate 8 bits in operand left count bits 702 | ROR - rotate 8 bits in operand right count bits 703 | SAL - arithmetic shift operand left count bits 704 | SAR - arithmetic shift operand right count bits (maintains sign bit) 705 | SHL - logical shift operand left count bits (same as SAL) 706 | SHR - logical shift operand right count bits (does not maintain sign bit) 707 | ``` 708 | 709 | ## Register Manipulation, Casting/Conversions 710 | ``` 711 | MOV - move register to register, move register to memory, move memory to register 712 | XCHG - exchange register/memory with register 713 | CBW - convert byte to word 714 | CDQ - convert word to double word/convert double word to quad word 715 | ``` 716 | 717 | ## Flags Manipulation 718 | ``` 719 | CLC - clear carry flag/bit in flags register 720 | CLD - clear direction bit in flags register 721 | STC - set carry flag 722 | STD - set direction flag 723 | ``` 724 | 725 | ## Stack Manipulation 726 | ``` 727 | POP - pop a register off the stack 728 | POPF - pop stack into flags register 729 | PUSH - push a register on the stack 730 | PUSHF - push flags register on the stack 731 | ``` 732 | 733 | # Assembler Source, Directives, and Macros 734 | The assembler is a program that reads assembly source code and generates a binary output file or ELF .o file. The assembler reads a line at a time and writes the encoded program instructions for that line to the output file. 735 | 736 | NASM is a great free assembler, LLVM Assembler (as), and Gnu Assembler/as/gas (part of the gcc package) are two assemblers that are used for Linux and MacOS assembly development/programming. For all intents and purposes, LLVM and Gnu assemblers are identical. There are other assemblers out there, but they are beyond the scope of this tutorial. 737 | 738 | 739 | There are two styles of assembly source for x64: Intel and AT&T. 740 | 741 | * Intel syntax expects operands to be specified as ```destination, source```. 742 | * AT&T syntax expects operands to be specified as ```source, destination```. 743 | 744 | The NASM assembler uses Intel syntax and the GNU/LLVM assemblers can use either Intel or AT&T; you choose which using an assembler directive. 745 | 746 | ## Assembler Directives 747 | An assembler directive is not machine instructions. Instead, these are used to convey information to the assembler to effect code generation as you prefer. Assembler directives are specific to the assembler you are using and the source code using these is not portable between assemblers. The nature of (order of) Intel and AT&T syntax makes code written for one not portable to an assembler using the other. 748 | 749 | The gas (gnu/llvm) assembler uses the .intel_syntax directive to tell the assembler that the source format of the file is Intel syntax. Otherwise, AT&T syntax is assumed. 750 | 751 | I'm not going to expand on all the directives for gas and NASM. There are basically similar directives for both assemblers. I prefer using NASM, though there is no reason you can't use gas - whichever you prefer. I'll document the common NASM directives here. 752 | 753 | There are a lot of directives; I'm not covering all of them. For expanded information, see the NASM manual online at https://nasm.us. Hopefully, you find what is covered here to be enough to get you going. 754 | 755 | ### section type [options] 756 | The section directive specifies that the following instructions/directives apply to the specified section. Examples: 757 | ``` 758 | section .text 759 | section .bss execute 760 | section .rodata 761 | ``` 762 | These types were defined earlier in this document. The execute option marks this bit of .bss as read/write and execute permissions. 763 | 764 | ### bits 16, bits 32, and bits 64, use16, use32, use64 765 | These directives tell the assembler to generate instructions for the CPU running in the specified mode. 766 | 767 | When the system first boots, the CPU is in 16 bit mode. The instructions it executes at that point must be ```bits 16``` or ```use16```. You probably won't be writing code for 16 bit mode. 768 | 769 | A 32-bit operating system sets the CPU into 32 bit mode. The instructions it executes at that point must be ```bits 32``` or ```use32```. 770 | 771 | This document assumes 64-bit mode, so we use ```bits 64```. In 64-bit mode, the assembler can generate either 64-bit or 32-bit instructions, whichever is appropriate. 772 | 773 | ### Comments 774 | 775 | In a NASM source program, the semicolon (;) character introduces the start of a comment. All characters from that point on, to the end of the line, are ignored. 776 | 777 | Note that gas supports a couple of comment styles, including ```/* */``` C-style multiline comments, or pound sign ```#``` to introduce the start of a comment. 778 | 779 | ### Constants 780 | NASM supports constants of the form: 781 | ``` 782 | 0x10 ; base 16 783 | 010h ; base 16 784 | 011100b ; base 2 785 | ``` 786 | 787 | ### Program Variables and Strings 788 | Programming is uselss if you can't create variables and create and operate on strings. The assemmbler provides directives to reserve space for variables or to define initialized memory. 789 | 790 | Resserving space examples: 791 | ``` 792 | resb 1 ; reserve 1 byte 793 | resw 1 ; reserve 1 word (2 bytes) 794 | resd 1 ; reserve 1 dword (4 bytes) 795 | resq 1 ; reserve 1 qword (8 bytes) 796 | resb 16 ; reserve 16 bytes 797 | ... 798 | ``` 799 | 800 | Initializing memory examples: 801 | ``` 802 | db 10 ; reserve 1 byte with the value 10 at the memory location 803 | dw 11 ; reserve 1 word with the value 11 at the memory location 804 | dd 10 ; reserve 1 dword with the value 10 at the memory location 805 | dq 10 ; reserve 1 qword with the value 10 at the memory location 806 | db 10, 11, 12 ; reserve 3 bytes with values 10, 11, and 12 807 | ... 808 | ``` 809 | 810 | You can use the memory initializer directives for strings: 811 | ``` 812 | ; create a null terminated string 813 | db 'now is the time for all good men to come to the aid of their country!', 0 814 | ; create a null terminated string with carriage return/linefeed at the end 815 | db 'now is the time for all good men to come to the aid of their country!', 13, 10, 0 816 | ``` 817 | 818 | ### Assembler Variables and Labels 819 | A label is a type of variable, and is the first thing on a line of source code. The value of the label is the current program counter as viewed by the assembler and when the program is actually running. You typically use a label to define a variable to access from assembly code or the address for jumps or subroutines. 820 | 821 | You use the ```global``` directive to make a label's scope visible to other .o files at link time. If you want to reference a label defined in a different .o file, you use the ```extern``` directive. 822 | ``` 823 | section .text 824 | ... 825 | ; find length of message 826 | mov rsi, message ; load address of message into rsi 827 | call length 828 | ; print rcx, it has the length of the string 829 | ... 830 | mov rsi, external_message 831 | call length 832 | ; print rcx, it has the length of the string 833 | ... 834 | length: 835 | xor rcx, rcx ; fast way to set rcx to 0 836 | loop: 837 | mov al, [rsi] ; get character from string 838 | inc rsi ; point to next character 839 | inc rcx ; increment length counter 840 | test al, al 841 | jne loop 842 | ; rcx has the length of the string 843 | ret 844 | ... 845 | 846 | section .rodata 847 | global message 848 | message: db 'hello, world!', 13, 10, 0 ; you can access message in an instruction: 849 | ``` 850 | 851 | A Variable is a string of text that refer to any numeric value you like, with a few exceptions. A common use is to define constants/expressions, as you would use ```#define``` in "C". You use the EQU directive to specify the variable's value. 852 | 853 | Examples: 854 | ``` 855 | ANSWER equ 42 856 | CR equ 13 857 | NEWLINE equ 10 858 | STDIN equ 0 859 | STDOUT equ 1 860 | STDERR equ 2 861 | ``` 862 | 863 | The ```$``` character can be used in these expressions, too. It represents the current value of the program counter as the assembler sees it. 864 | 865 | ``` 866 | section .text 867 | mov rax, message ; load address of message into rax 868 | move rcx, message_len 869 | 870 | section .rodata 871 | message: db 'hello, world!', 13, 10 ; you can access message in an instruction: 872 | message_len equ $ - message ; length of message string in bytes 873 | ``` 874 | 875 | You can also use the ```%assign``` directive to create and update a variable. If you try to use EQU twice on the same variable name, it is an error. 876 | 877 | ``` 878 | %assign count 0 879 | %assign count count+1 880 | ``` 881 | 882 | There is a directive to assign a string to a variable, too. This is similar to the "C" ```#define``` preprocessor directive; the string is substituted in the source code when the variable is encountered. 883 | 884 | ``` 885 | %define hello 'hello, world!', 13, 10 886 | section .text 887 | mov rax, message ; load address of message into rax 888 | move rcx, message_len 889 | 890 | section .rodata 891 | message: db hello 892 | message_len equ $ - message ; length of message string in bytes 893 | ``` 894 | 895 | You can undefine one of these variables created with ```%define``` using ```%undef```. 896 | 897 | You can use local labels so you don't have to keep track of every label/variable you have defined to avoid collisions. A local label begins with a period. Its scope is valid only between two true labels. 898 | 899 | ``` 900 | ; subroutines to return address of string in RSI 901 | get_string1: 902 | mov rsi, .string 903 | ret 904 | .string: db 'string1' 905 | 906 | get_string2: 907 | mov rsi, .string 908 | ret 909 | .string: db 'string2' 910 | ``` 911 | 912 | Creating a variable or label does not generate any code! 913 | 914 | ### Repetion 915 | The ```times``` directive is used to repeat an initialization: 916 | 917 | ``` 918 | section .data 919 | stars: times 32 db '*' ; creates 32 bytes containing * at memory location "stars". 920 | ``` 921 | 922 | ### Macros 923 | A macro is similar to a subroutine, but is substituted inline and has powerful text processing/substitution factilities. 924 | 925 | A macro is defined using the ```%macro``` and ```%endmacro``` directives. Everything between these two directives is the content of the macro, or the text to be substituted. The ```%macro``` directive requires the number of parameters to the macro. 926 | 927 | ``` 928 | ; two handy macros that save me a lot of typing. 929 | %macro pushg 0 930 | push rax 931 | push rbx 932 | push rcx 933 | push rdx 934 | %endmacro 935 | 936 | ; note these have to be popped in the reverse order they are pushed! 937 | %macro popg 0 938 | pop rdx 939 | pop rcx 940 | pop rbx 941 | pop rax 942 | %endmacro 943 | 944 | ... 945 | ; short and convenient 946 | pushg 947 | ; use registers rax, rbx, rcx, rdx 948 | popg 949 | ``` 950 | 951 | If you want to pass arguments to your macro, you specify a non-zero number on the ```%macro``` directive. Within the macro body, you can access the parameters using ```%1```, ```%2``` and so on. Here's a macro definition that demonstrates some of the power of macros. 952 | 953 | ``` 954 | %macro print 1 955 | mov rsi, .message 956 | call print_message 957 | jmp .over 958 | .message: db '%1', 0 959 | .over: 960 | %endmacro 961 | 962 | ... 963 | print "hello, world!" 964 | 965 | ``` 966 | 967 | The problem with our print macro is that it generates .message and .over local labels and you might use the macro more than once between real labels: 968 | 969 | ``` 970 | print "hello, world!" 971 | print "goodbye cruel world!" 972 | ``` 973 | 974 | What happens is we have duplicate local labels and the compiler generates an error. Local labels are incredibly useful in macros, so there has to be a way, and there is. Local labels within macros are defined using the form ```%%label```. The assembler generates a uniqe label name when expanding the macro. This is the working print macro: 975 | 976 | ``` 977 | %macro print 1 978 | mov rsi, %%message 979 | call print_message 980 | jmp %%over 981 | %%message: db '%1', 0 982 | align 8 983 | %%over: 984 | %endmacro 985 | ``` 986 | 987 | ### Conditional Assembly 988 | NASM provides ```%if```, ```%elif```, ```%else```, and ```%endif``` directives that allow for conditional assembly. 989 | 990 | ``` 991 | ; a totally contrived useless example, for illustrative purposes 992 | %assign foo 1 993 | %if foo=1 994 | mov rax, 32 995 | %else 996 | mov rax, 42 997 | %endif 998 | ``` 999 | 1000 | NASM also provides ```%ifdef``` directive that works with ```%elif``` and the other conditional assembly directives. Instead of testing a condition as ```%if``` does, it tests the existance 1001 | 1002 | ``` 1003 | ; comment out the undef to enable the LINUX "do things" code 1004 | %define LINUX 1005 | %undef LINUX 1006 | %ifdef LINUX 1007 | ; do linux things 1008 | %endif 1009 | %else 1010 | ; do mac things 1011 | %endif 1012 | ``` 1013 | 1014 | NASM provides the ```%ifidn``` directive that works with ```%elif``` and the other conditional assembly directives. NASM provides default defined variables that you can use to conditionally assemble using ```%ifidn```. A particularly useful one is __?OUTPUT_FORMMAT?__ which you can test to determine whether to generate code for Linux or MacOS (or other): 1015 | 1016 | ``` 1017 | %ifidn __?OUTPUT_FORMAT__, maco64 1018 | ; do macos stuff 1019 | %else 1020 | ; do linux stuff 1021 | %endif 1022 | ``` 1023 | 1024 | See: https://nasm.us/xdoc/2.15.03rc8/html/nasmdoc5.html for all the predefined variables. 1025 | 1026 | 1027 | ### Alignment 1028 | As you are writing your code, you may want instructions or data aligned on a word, dword, qword, or other size boundaries. Typical uses are to align code on word/dword/qword boundaries. You get a performance boost by having the target of a branching instruction such as jmp, call, and so on. 1029 | 1030 | ``` 1031 | align 8 ; align next code/data generated at next 8 byte boundary/address 1032 | align 16 ; align next code/data at next 16 byte boundary 1033 | 1034 | db 'hello' 1035 | align 8 1036 | my_code_is_aligned: 1037 | ``` 1038 | 1039 | Alignment is also useful for data structure definitions so your assembly structs can match up with ones defined in C. 1040 | 1041 | ### Structures 1042 | You can define high-level like structures using the ```%struc``` and ```%endstruc``` directives. The ```%struc``` directive takes one parameter, the name of the structure. The structure members are defined using the resb/resd/resw/resq space allocation directives. The align directives are used to align structure members on the desired boundaries. 1043 | 1044 | ``` 1045 | %struc Contact 1046 | .company: resb 1 ; true for company, false for individual 1047 | align 2 1048 | .company_id: resd 1 ; identifier 1049 | .name: resb 64 ; max 64 characters for name 1050 | .address: resb 64 ; also 64 for address 1051 | .phone: resb 16 ; 16 characters for phone number 1052 | %endstruc 1053 | ``` 1054 | 1055 | Using a structure is straightforward: 1056 | 1057 | ``` 1058 | mov rsi, [person] ; fetch address of Contact struct into RSI 1059 | mov al, [rsi+Contact.company] 1060 | test al,al 1061 | jne .company 1062 | ; is an individual 1063 | print "Person" 1064 | push rsi 1065 | mov rsi, [rsi+Contact.name] 1066 | call printit 1067 | pop rsi 1068 | ... 1069 | .company: 1070 | ; is a company 1071 | print "Company" 1072 | push rsi 1073 | mov rsi, [rsi+Contact.name] 1074 | call printit 1075 | pop rsi 1076 | ... 1077 | ``` 1078 | 1079 | You use the ```%istruc``` and ```%iend``` directives to declare instances of structures. 1080 | 1081 | ``` 1082 | a_company: istruc Contact 1083 | at .company, db 1 1084 | at .company_id, dd 100 1085 | at .name, db 'Engulf and Devour Corp', 0 1086 | at .address, db '1 Main Street, Anytown USA', 0 1087 | at .phone, db '1-800-devour!', 0 1088 | %iend 1089 | ``` 1090 | 1091 | ### Includes 1092 | NASM provides two commonly used include directives: 1093 | ``` 1094 | %include "path/to/file" 1095 | %incbin"path/to/file" 1096 | ``` 1097 | 1098 | The ```%include``` directive works like the "C" ```#include``` directive - it simply reads the specified file in place and assembles it as if it were part of the current file. You can arbitrarily nest these includes, like you do in "C". 1099 | 1100 | The ```%incbin``` directive includes a raw binary, verbatim, in the output file at the current position. You can use it, for example, to include a .gif file in your code: 1101 | 1102 | ``` 1103 | my_gif: 1104 | %incbin '/path/to/my/picture.gif' 1105 | my_gif_size equ $-my_gif 1106 | ``` 1107 | 1108 | # Hello, World 1109 | 1110 | ## MacOS Version 1111 | 1112 | See hello-world/ directory for a build script and this assembly source. 1113 | 1114 | ``` 1115 | ; Use the build-macos.sh script to assemble and link this. 1116 | 1117 | bits 64 1118 | 1119 | section .text 1120 | 1121 | global start 1122 | start: 1123 | mov rax, 0x2000004 ; write 1124 | mov rdi, 1 ; stdout 1125 | mov rsi, msg 1126 | mov rdx, msg.len 1127 | syscall 1128 | 1129 | mov rax, 0x2000001 ; exit 1130 | mov rdi, 0 1131 | syscall 1132 | 1133 | 1134 | section .data 1135 | 1136 | msg: db "Hello, world!", 10 1137 | .len: equ $ - msg 1138 | ``` 1139 | 1140 | It works. Here's the output: 1141 | 1142 | ``` 1143 | # ./build-mac.sh 1144 | Run it via ./hello-macos 1145 | # ./hello-macos 1146 | Hello, World! 1147 | # 1148 | ``` 1149 | 1150 | ## Linux version 1151 | 1152 | Linux has different (from MacOS) syscall numbers passed in rax. The entry point for Linux programs is "_start"" vs "start" on MacOS. 1153 | 1154 | Otherwise, the program is the same. 1155 | 1156 | ``` 1157 | ; use the build-linux.sh script to assemble and link this 1158 | bits 64 1159 | 1160 | section .text 1161 | 1162 | global _start 1163 | _start: 1164 | mov rax, 1 ; write 1165 | mov rdi, 1 ; stdout 1166 | mov rsi, msg 1167 | mov rdx, msg.len 1168 | syscall 1169 | 1170 | mov rax, 60 ; exit 1171 | mov rdi, 0 1172 | syscall 1173 | 1174 | 1175 | section .data 1176 | 1177 | msg: db "Hello, world!", 10 1178 | .len: equ $ - msg 1179 | ``` 1180 | 1181 | ``` 1182 | # ./build-linux.sh 1183 | Run it via ./hello-linux 1184 | i# ./hello-linux 1185 | Hello, world! 1186 | # 1187 | ``` 1188 | 1189 | ## How it works 1190 | 1191 | MacOS and Linux provide quite a few syscalls each, or operating system calls that we can call from any language. There are quite a few syscalls in common between the two, but they are different flavors of Unix (Linux vs. BSD-ish/MacOS). The two flavors have several syscalls that are provided in one OS but not the other. The syscall numbers (passed in rax) are also different between the operating systems. 1192 | 1193 | The C libraries contain code similar to our code above, to write strings to a file. For our purposes we use the file number for stdout to write to the console. 1194 | 1195 | For most C calls that are not provided by a library or the standard C/C++ libraries, there is a syscall. For example, malloc and free are provided by libc so there is no syscall for it. However, sbrk() is not provided by the libraries and is provided as a syscall. 1196 | 1197 | The syscalls take arguments in the CPU registers. RAX contains the syscall number (one for write, one for exit in the above). 1198 | 1199 | ### Linux Syscalls 1200 | 1201 | Linux syscalls are documented here: 1202 | https://chromium.googlesource.com/chromiumos/docs/+/master/constants/syscalls.md 1203 | The syscalls for Linux are defined in: 1204 | /usr/include/sys/syscall.h 1205 | 1206 | ### MacOS Syscalls 1207 | 1208 | The syscalls for MacOS are defined in: 1209 | ./Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/usr/include/sys/syscall.h 1210 | These syscall numbers are subject to change, so you should, at least, use the defines in your syscall.h and realize that when you update your OS, you need to verify the numbers haven't changed. 1211 | 1212 | Alternatively, you can programatically scan the syscall.h file and generate assembly EQU for each syscall and always have the correct syscall numbers in your program. 1213 | 1214 | If the parameters to the OS syscalls somehow change, your program will crash. It's not likely every syscall is going to have these changes, but you will need to fix your code when this does happen. 1215 | 1216 | -------------------------------------------------------------------------------- /hello-world/build-linux.sh: -------------------------------------------------------------------------------- 1 | #!/bin/sh 2 | 3 | nasm -f elf64 -o hello-linux.o -l hello-linux.lst hello-linux.asm 4 | ld -static -o hello-linux hello-linux.o 5 | 6 | echo "Run it via ./hello-linux" 7 | -------------------------------------------------------------------------------- /hello-world/build-macos-sh: -------------------------------------------------------------------------------- 1 | #!/bin/sh 2 | 3 | nasm -f macho64 -o hello-macos.o -l hello-macos.lst hello-macos.asm 4 | ld -static -o hello-macos hello-macos.o 5 | 6 | echo "Run it via ./hello-macos" 7 | -------------------------------------------------------------------------------- /hello-world/hello: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mschwartz/assembly-tutorial/c03c88092092894f6dfb0b87693868df6f5d21b6/hello-world/hello -------------------------------------------------------------------------------- /hello-world/hello-linux.asm: -------------------------------------------------------------------------------- 1 | ; use the build-linux.sh script to assemble and link this 2 | bits 64 3 | 4 | section .text 5 | 6 | global _start 7 | _start: 8 | mov rax, 1 ; write 9 | mov rdi, 1 ; stdout 10 | mov rsi, msg 11 | mov rdx, msg.len 12 | syscall 13 | 14 | mov rax, 60 ; exit 15 | mov rdi, 0 16 | syscall 17 | 18 | 19 | section .data 20 | 21 | msg: db "Hello, world!", 10 22 | .len: equ $ - msg 23 | -------------------------------------------------------------------------------- /hello-world/hello-macos.asm: -------------------------------------------------------------------------------- 1 | ; use the build-macos.sh script to assemble and link this 2 | 3 | section .text 4 | 5 | global start 6 | start: 7 | mov rax, 0x2000004 ; write 8 | mov rdi, 1 ; stdout 9 | mov rsi, msg 10 | mov rdx, msg.len 11 | syscall 12 | 13 | mov rax, 0x2000001 ; exit 14 | mov rdi, 0 15 | syscall 16 | 17 | 18 | section .data 19 | 20 | msg: db "Hello, world!", 10 21 | .len: equ $ - msg 22 | -------------------------------------------------------------------------------- /images/cardiac2-s.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mschwartz/assembly-tutorial/c03c88092092894f6dfb0b87693868df6f5d21b6/images/cardiac2-s.jpg -------------------------------------------------------------------------------- /instruction-set/addressing.asm: -------------------------------------------------------------------------------- 1 | ;; NOTE that this is not a program meant to be run. It is just a way to demonstrate 2 | ;; that the instructions and addressing modes to assemble without error. 3 | 4 | section .text 5 | global start 6 | start: 7 | ;; register addressing mode 8 | mov rax, rbx 9 | 10 | ; direct (or immediate) addressing mode 11 | ; you cannot store to a constant, so only the source may be a constant 12 | mov rax, 10 ; source operand is a constant 13 | 14 | ; indirect addressing mode 15 | ; one of the operands is the address of the memory location in a register 16 | mov rax, [rbx] 17 | mov [rbx], rax 18 | ; invalid! 19 | ; mov [rax], [rbx] 20 | 21 | 22 | ; indirect with displacement 23 | ; address = base + displacement 24 | ; 25 | ; typical use is to access structure elements (the displacement is the offset 26 | ; to the structure member) 27 | mov rax, [24+rbx] ; base is rbx, displacement is 24 28 | mov [24+rbx], rax 29 | 30 | ; indirect with displacement and scaled index 31 | mov rax, [array + rbx * 4] 32 | mov [array + rbx * 4], rax 33 | 34 | ; indirect with displacement in a second register 35 | mov rax, [rbx + rcx] 36 | mov [rbx + rcx], rax 37 | 38 | ; indirect with displacement in a second register scaled 39 | mov rax, [rbx + rcx *4] 40 | mov [rbx + rcx *4], rax 41 | 42 | ; indirect with displacement and another displacement in a second register scaled 43 | mov rax, [24 + rbx + rcx *4] 44 | mov [24 + rbx + rcx *4], rax 45 | 46 | section .bss 47 | array: resb 8192 48 | 49 | -------------------------------------------------------------------------------- /instruction-set/build.sh: -------------------------------------------------------------------------------- 1 | #!/bin/sh 2 | 3 | nasm -f elf64 -o addressing.o -l addressing.lst addressing.asm 4 | --------------------------------------------------------------------------------