├── .gitignore
├── README.md
├── hello-world
    ├── build-linux.sh
    ├── build-macos-sh
    ├── hello
    ├── hello-linux.asm
    └── hello-macos.asm
├── images
    └── cardiac2-s.jpg
└── instruction-set
    ├── addressing.asm
    └── build.sh


/.gitignore:
--------------------------------------------------------------------------------
1 | *~
2 | *.o
3 | *.lst
4 | hello-macos
5 | hello-linux
6 | 
7 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
   1 | # Programming in assembly language tutorial
   2 | 
   3 | This tutorial covers AMD64/Intel 64 bit programming.  Instruction sets for other processors, such as ARM or RISC-V are radically different, though the concepts are the same.  They all have instructions, registers, stacks, and so on.  Once you know one processor's assembly language, adapting to a different processor is rather easy.  
   4 | 
   5 | I found that I was writing code for a new processor within hours, and writing quality code within a week or two.  This is going from Z80 to 6502 to 6809 to 8086 to 68000 and so on.  It is interesting to be able to look at a processor's technical manuals and evaluate the power and flexibility of its instruction set.
   6 | 
   7 | This tutorial is aimed at novices and beginners who want to learn the first thing about assembly language programming.  If you are an expert, you may or may not get a lot out of this.  
   8 | 
   9 | - [Programming in assembly language tutorial](#programming-in-assembly-language-tutorial)
  10 |     - [Introduction](#introduction)
  11 |     - [Bits, Bytes, Words, and Number Bases](#bits-bytes-words-and-number-bases)
  12 |     - [Math](#math)
  13 |     - [Boolean Algebra](#boolean-algebra)
  14 |     - [Bit Shifting](#bit-shifting)
  15 |     - [Memory](#memory)
  16 |     - [ELF Files and the Loader](#elf-files-and-the-loader)
  17 |     - [Permissions](#permissions-sections-and-privileged-instructions)
  18 |     - [MMU](#mmu)
  19 |         - [Paging and Swapping](#paging-and-swapping)
  20 |     - [Other exceptions](#other-exceptions)
  21 |         - [Segfault](#segfault)
  22 |         - [Divide By Zero](#divide-by-zero)
  23 |         - [Invalid Opcode](#invalid-opcode)
  24 |         - [General Protection](#general-protection)
  25 |     - [ALU](#alu)
  26 |     - [x64/AMD64 Registers](#x64amd64-registers)
  27 |         - [General Purpose Registers](#general-purpose-registers)
  28 |         - [Special Purpose Registers](#special-purpose-registers)
  29 |         - [CPU Control Registers](#cpu-control-registers)
  30 |             - [Stack](#stack)
  31 |             - [Instruction Pointer](#instruction-pointer)
  32 |             - [Flags](#flags)
  33 | - [AMD64 Instruction Set](#amd64-instruction-set)
  34 |     - [Assembly source](#assembly-source)
  35 |     - [Addressing Modes](#addressing-modes)
  36 |         - [Register Operands](#register-operands)
  37 |         - [Direct Memory Operands](#direct-memory-operands-better-known-as-immediate-operands)
  38 |             - [Indirect Operands](#indirect-operands)
  39 |             - [Indirect with Displacement](#indirect-with-displacement)
  40 |             - [Indirect with displacement and scaled index](#indirect-with-displacement-and-scaled-index)
  41 | - [Commonly Used Instructions](#commonly-used-instructions)
  42 |     - [Aritmetic](#aritmetic)
  43 |     - [Boolean Algebra](#boolean-algebra-1)
  44 |     - [Branching and Subroutines](#branching-and-subroutines)
  45 |     - [Bit Manipulation](#bit-manipulation)
  46 |     - [Register Manipulation, Casting/Conversions](#register-manipulation-castingconversions)
  47 |     - [Flags Manipulation](#flags-manipulation)
  48 |     - [Stack Manipulation](#stack-manipulation)
  49 | - [Assembler Source, Directives,  and Macros](#assembler-source-directives--and-macros)
  50 |     - [Assembler Directives](#assembler-directives)
  51 |         - [section type](#section-type-options)
  52 |         - [bits 16, bits 32, and bits 64, use16, use32, use64](#bits-16-bits-32-and-bits-64-use16-use32-use64)
  53 |         - [Comments](#comments)
  54 |         - [Constants](#constants)
  55 |         - [Program Variables and Strings](#program-variables-and-strings)
  56 |         - [Assembler Variables and Labels](#assembler-variables-and-labels)
  57 |         - [Repetion](#repetion)
  58 |         - [Macros](#macros)
  59 |         - [Conditional Assembly](#conditional-assembly)
  60 |         - [Alignment](#alignment)
  61 |         - [Structures](#structures)
  62 |         - [Includes](#includes)
  63 | - [Hello, World](#hello-world)
  64 |     - [MacOS Version](#macos-version)
  65 |     - [Linux version](#linux-version)
  66 |     - [How it works](#how-it-works)
  67 |         - [Linux Syscalls](#linux-syscalls)
  68 |         - [MacOS Syscalls](#macos-syscalls)
  69 | 
  70 | ## Introduction
  71 | 
  72 | How CPUs work has become something of a lost art.  There are a small percentage of software engineers that need to understand the inner workings of CPUs, typically those who work on embedded software or operating systems, or compilers or JIT compilers...
  73 | 
  74 | Assembly language was one of the first languages I ever learned.  Back in the early/mid 1970s, my high school classes progressed from BASIC to FORTRAN IV, to BAL (Basic Assembly Language) for the IBM 360 to which we had access.  One of the earliest lessons we were taught used a cardboard teaching aid, CARDIAC.  CARDIAC stands for "CARDboard Illiustrative Aid to Computation"; it was developed at Bell Labs, which was a big deal back then (Unix was invented there, as well as the C programming language).
  75 | 
  76 | See https://www.cs.drexel.edu/~bls96/museum/cardiac.html.
  77 | 
  78 | With CARDIAC, you simulated the memory, operation, and CPU cycles of a mythical CPU.  The numbers and instructions for this CPU were in base 10, so the student doesn't have to understand how to convert to the common base 2, base 8, 8 or base 16 used in computing.  CARDIAC provided a cardboard device that had representation for memory, program steps, and ALU (math and logic operations).
  79 | 
  80 | You wrote your program and variables on the cardboard and then, step by step, followed the program and performed the operations for each step.  The steps are identified by a single digit, 0-9:
  81 | 
  82 | - 0 INP read a card into memory
  83 | - 1 CLA clear accumulator and add from memory
  84 | - 2 ADD add from memory to accumulator
  85 | - 3 TAC test accumulator and jump if negative
  86 | - 4 SFT shift accumulator
  87 | - 5 OUT write memory location to output card
  88 | - 6 STO store accumulator to memory
  89 | - 7 SUB subtract memory from accumulator
  90 | - 8 JMP jump and save PC
  91 | - 9 HRS halt and reset
  92 | 
  93 | These values are "opcodes" and the encoded instructions/steps include the opcode plus address, number of bits to shift, etc.
  94 | 
  95 | The CPU features only two registers:  accumulator and program counter.  More complex and modern CPUs have many more registers than these two.
  96 | 
  97 | These instructions and registers are enough to learn from.  You learn about memory layout, instruction opcodes, instruction encoding, memory access, and so on.
  98 | 
  99 | In this tutorial, I will cover the basics of programming the x64/AMD64 CPU in assembly language.  As I progress, you will see how the CPU is really a glorified version of CARDIAC!
 100 | 
 101 | ## Bits, Bytes, Words, and Number Bases
 102 | 
 103 | The smallest piece of information that a CPU processes is a "bit."  A bit is a small integer or boolean type value, either 0 (off/false) or 1 (on/true).
 104 | 
 105 | Bits are then organized as "bytes", or 8 bits grouped together.  You can visualize a byte like this:
 106 | 
 107 | ```
 108 | 76543210
 109 | ```
 110 | 
 111 | The digits represent what we call a bit number, and each digit (bits 0-7) may be a 0 or a 1.  A byte can represent an unsigned value of 0-255, or a signed value of -128-127.  Bit 7 of the byte is considered the "sign bit" - if it is 1, then the byte as a signed value is negative, if it is 0, then the byte is positive.  Note that you decide whether the byte is processed as signed or unsigned; more on this later, but for now it is important to understand how the bits make up bytes and signed/unsigned values are represented.
 112 | 
 113 | A "word" is two bytes grouped together, which means we have 16 bits together.  You can visualize a word like this:
 114 | ```
 115 | 5432109876543210
 116 | 111111
 117 | ```
 118 | 
 119 | The high order, sign bit, is bit 15.
 120 | 
 121 | The x86 also has DWORD values, which are two words combined.  It also has QWORD values which are two DWORDs combined.  The pattern is the same for any of these size values - the high bit is the sign bit, etc.  
 122 | 
 123 | From this point forward, I'll use "word" to mean one of these sized values, unless otherwise stated.
 124 | 
 125 | When we talk about the value of the word, we typically use base 2, base 4, base 8, base 10, and base 16.  Of these, base 8 isn't used much, but I'll explain a common use case for base 8.
 126 | 
 127 | In base 2 (also called "binary"), we simply talk about the value as the bits.  That is, an unsigned byte might be 11111111, or 11101110, and so on.  We might add a lead 0 and terminating b for clarity (and this is the syntax used in assembly programming): 011111111b.
 128 | 
 129 | Base 10 is the number base we use every day.  You count from 0 to 9 for each digit position in base 10.  When you add 1 to the value 9, you clear it (set to 0), and bump the 10s digit.  That is, 9+1 becomes 10.  As you go right to left in base 10, the digits are: n x 10 to the power of 0, n x 10 , or 10 to the power of 1, n x 100, or 10 to the power of 2, and so on.
 130 | 
 131 | In base 2, we count from 0 to 1 for each digit position.  When you add 1 to a 1 in a position in the byte, you clear it and increment the next higher bit (and continue until you find an existing 0 in position, which becomes 1).  As you go right to left in base 2, the digits are n x 2 to the power of 0, n x 2, or 2 to the power of 1, n x 4, or 2 to the power of 2, and so on.
 132 | 
 133 | In base 8 (also called "octal"), we count from 0-7 for each digit position.  Going right to left, n x 8 to the power of 0, n x 8 to the power of 1, n x 8 to the power of 2, etc.
 134 | 
 135 | In base 16 (also called "hex"), we count from 0-15 for each digit position.  We use a counting system that is 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, F, then 10.  So going from right to left in a hex number, the digits are n x 16 to the power of 0, n x 16 to the power of 1, n x 16 to the power of 2, and so on.
 136 | 
 137 | A "nybble" is useful for working with hex.  A nybble is 4 bits.  It turns out that the value you can store in 4 bits is 0-15, perfect for hex.  You already get the pattern about power of 4s when using nybbles.
 138 | 
 139 | Let's look at the unsigned value ranges for the common word sizes:
 140 | ```
 141 | 1 bit: 0-1
 142 | 2 bits: 0-3
 143 | 3 bits: 0-7
 144 | 4 bits: 0-15
 145 | 5 bits, 0-31
 146 | ...
 147 | ```
 148 | 
 149 | 
 150 | The pattern here is that the max value is 2 to the number of bits minus 1.  That is for 5 bits, the max value 31 is 2 to the 5th power (32) minus 1.
 151 | 
 152 | When we convert a binary byte to hex, we visualize it something like this:
 153 | 
 154 | ```
 155 | 76543210 is 7654 3210
 156 | ```
 157 | We've grouped the bits as two nybbles.  We can then convert the two nybbles (4 bits each) to two hex digits.
 158 | 
 159 | This table makes the conversion simple.  But if you practice using hex, you will know this table by heart.
 160 | 
 161 | ```
 162 | 0000 | 0
 163 | 0001 | 1
 164 | 0010 | 2
 165 | 0011 | 3
 166 | 0100 | 4
 167 | 0101 | 5
 168 | 0110 | 6
 169 | 0111 | 7
 170 | 1000 | 8
 171 | 1001 | 9
 172 | 1010 | A
 173 | 1011 | B
 174 | 1100 | C
 175 | 1101 | D
 176 | 1110 | E
 177 | 1111 | F
 178 | ```
 179 | 
 180 | For example, we visualize the binary value 010100101b as 1010 0101.  Using the table above, we see 1010 is A, and 0101 is 5.  So the byte value is A5.  We represent hex numbers in assembly as 0xa5, or 0a5h, or sometimes $a5.
 181 | 
 182 | We can use the same scheme to convert 16 bit or 32 bit or 64 bit values to hex!
 183 | 
 184 | I promised to discuss a use for Octal, something we might use every day.  In the linux/mac/*nix filesystem, permissions are actually octal values.
 185 | ```
 186 | -rw-r--r--  1 mschwartz  staff   5.9K Feb 16 14:13 README.md
 187 | ```
 188 | See the -rw-r--r-- ?  What we have here is 9 bits in octal.  rw- is 110, r-- is 100, r-- is 100.  So we can convert this to the internal filesystem representation of 644.  If you want to make a file rw-r--r--, you use the chmod command:
 189 | ```
 190 | chmod 644 README.md
 191 | ```
 192 | The three bits, technically, are "able to read", "able to write", and "able to execute."  The first octal value is for the owner, the second is for anyone in the same user group as the owner, and the third is for everyone else.  So to allow the owner and group to read and write, but nobody else can read or write the file, we want rw-rw---- or 660.  To set a file to be executable, I typically use ```chmod 755```.
 193 | 
 194 | ## Math
 195 | 
 196 | Adding two values of the same word size is simple.  The byte 100 plus the byte 50 = 150.  100 + 50 = 150.
 197 | 
 198 | This works for signed and unsigned values.  The math is always unsigned, but the result is up to you.  If the high order bit (bit 7 of a byte, bit 15 of a 16-bit word...) is 1, the signed value is negative.
 199 | 
 200 | What happens when we add a byte value to a 16-bit word value?  The byte value is really a 16-bit value, but the upper 8 bits are zeros.  That is, 0xaa can be visualized as 0x00aa.  We just add the full 16-bit values together.
 201 | 
 202 | What happens when we add 1 to a byte size value of 255?  We only have 8 bits for the result, but we have 9 bits of actual value.  That is, 255 + 1 is 256.  Represented in binary, you have 255 = 011111111b + 1 = 0100000000b (9 bits!).  The 9th bit is basically ignored as far as the result byte goes (more on this later).  So if you look at the lower 8 bits of our 9 bit result, we get 0!
 203 | 
 204 | All this extends to 32 bit and 64 bit words.
 205 | 
 206 | Multiplication of two values requires a double-sized result, or you lose a lot more than just the 9th bit!  Consider 255 x 255 = 65025 (0xfe01), which fits in 16 bits but not in 8.  If we have a byte result, we get 0x01 due to the overflow, losing over 65000 in result value.
 207 | 
 208 | ## Boolean Algebra
 209 | 
 210 | Boolean Algebra is a form of math that we use to deal with true/false values.  We use Boolean Algebra all the time in various programming languages, with operators like & (AND), | (OR), ^ (exclusive OR, or XOR), and ! (NOT), ~ (also NOT) and so on.  These operators are equivalent to "math-like" operators.
 211 | 
 212 | The simplest way to visualize Boolean Algebra is using single bit values and truth tables.  0 = false, 1 = true.  For single bit value operands, there are only (always) 4 combinations possible.
 213 | 
 214 | ```
 215 | AND (if both operands are true, the result is true)
 216 | 0 & 0 = 0
 217 | 0 & 1 = 0
 218 | 1 & 0 = 0
 219 | 1 & 1 = 1
 220 | 
 221 | OR (if either operand is true, the result is true)
 222 | 0 | 0 = 0
 223 | 0 | 1 = 1
 224 | 1 | 0 = 1
 225 | 1 | 1 = 1
 226 | 
 227 | XOR (if only one operand is true, the result is true)
 228 | 0 ^ 0 = 0
 229 | 0 ^ 1 = 1
 230 | 1 ^ 0 = 1
 231 | 1 ^ 1 = 0
 232 | ```
 233 | 
 234 | The ! (NOT) operator only has one operand.  If the operand is true, the result is false.  If the operand is false, the result is true.  The result is also known as a 1's complement, or we've just inverted the state of all the bits.
 235 | 
 236 | The ~ (1's complement) operator inverts the bits in the word.
 237 | 
 238 | If we look at the operands as byte values, we have something like:
 239 | ```
 240 | 00000000 & 00000000 = 0
 241 | 00000000 & 00000001 = 0
 242 | ...
 243 | ```
 244 | BUT, we have 8 bits, so the operation is performed on all 8 bits in the two operands.
 245 | ```
 246 |    10000000 
 247 | OR 00000001 
 248 |    --------
 249 |    ^      ^
 250 | =  10000001
 251 |    ^      ^
 252 |    
 253 | NOT 10000001
 254 | =   01111110
 255 | ```
 256 | This is a most important concept to grasp!
 257 | 
 258 | We use the Boolean Algebra operators on words to achieve useful results.  
 259 | 
 260 | A typical use of the AND operator is to clear bits in a value.  If we AND with a value that is the inverse of a power of 2, we are simply clearing a bit.  n AND !4 clears bit 3 in n. 
 261 | 
 262 | A typical use of the OR operator is to set bits in a value.  If we OR with a value that is a power of 2, we are simply setting a bit.  n OR 4 sets bit 3 in n.
 263 | 
 264 | A great use of the AND operator is to do a modulo of a number to a power of 2.  For example, AND with 3 gets you a result between 0 and 3.  AND with 7 gets you a result between 0 and 7.
 265 | 
 266 | ## Bit Shifting
 267 | 
 268 | You can shift a bit to the left (<< operator in C) 1-7 bits.  For example:
 269 | 
 270 | ```
 271 | 001111101b << 1 = 011111010b
 272 | 
 273 |  001111101b  shifted left becomes
 274 |  ////////
 275 | x011111010b  (bit 0 becomes 0, bit 1 becomes 1, bit 2 becomes 0)
 276 | ```
 277 | Note that we have the overflow problem here, as we did with addition.  We have an upper bit that ends up in the "bit bucket" (thrown away).
 278 | 
 279 | A left shift of 1 bit is effectively a multiplication by 2.  Consider 001b<<1 is 010b, or 2.  A left shift of 2 bits is a multiply by 4, and so on.
 280 | 
 281 | Shifting to the right works similarly, but we now end up with the high bit being cleared and the low bit in the bit bucket. 
 282 | 
 283 | A right shift of 1 bit is effectively a divide by 2. But this right shift will take a negative number and make it positive because the sign bit is cleared.  So we need a second kind of right shift (arithmetic shift right) for signed values that sets the high bit in the result to the high bit in the initial value.
 284 | 
 285 | A rotation left/right is the same as a shift, except instead of the lost bit ending up in the bit bucket, it becomes the new high/low bit.
 286 | 
 287 | Other than for the multiply and divide effects, we use bit shifting frequently with Boolean Algebra.
 288 | 
 289 | ```
 290 | To set bit 3:
 291 | 
 292 | n | (1<<3)
 293 | 
 294 | To clear bit 3:
 295 | 
 296 | n & ~(1<<3)
 297 | 
 298 | Note that 1<<3 = 01000b, 
 299 | and ~(1<<3) is  ~01000b 
 300 |               or 00111b.   (all the bits are inverted)
 301 | When you AND with 00111b, you are clearing bit 3.
 302 | ```
 303 | 
 304 | ## Memory
 305 | 
 306 | Memory (RAM) can be viewed as an array of bytes.  If you have 1MB of RAM, your array is indexed from 0 to 1MB-1.  The index is better known as an address.
 307 | 
 308 | Memory is used to store your program, for your program stack, for your program's heap (memory allocation) and to store your variables.  In a simple CPU and RAM setup, you might have your program start at index 0, your variables start at the end of the program, your heap starts at the end of your variables, and your stack starts at the top of memory and works its way downward as you push onto it.
 309 | 
 310 | ```
 311 | HIGH memory address
 312 | +--------------+
 313 | |              |
 314 | | stack        |
 315 | | grows down   |
 316 | | address 1M   |
 317 | |              |
 318 | +--------------+
 319 | |              |
 320 | | heap         |
 321 | | grows up     |
 322 | |              |
 323 | +--------------+
 324 | |              |
 325 | | uninitalized |
 326 | | global       |
 327 | | variables    |
 328 | |              |
 329 | +--------------+
 330 | |              |
 331 | | initalized   |
 332 | | global       |
 333 | | variables    |
 334 | |              |
 335 | +--------------+
 336 | |              |
 337 | | code         |
 338 | | address 0    |
 339 | |              |
 340 | +--------------+
 341 | LOW memory address
 342 | ```
 343 | 
 344 | ## ELF Files and the Loader
 345 | The compiler/assembler/linker generate ELF formatted files.  An ELF file is divided into various sections.  The more common sections are ```.text``` (code), ```.data``` initialized data, ```.rodata``` read only data (constants), ```.bss``` (uninitialized data), and assorted debugging info sections.
 346 | 
 347 | The operating system program loader reads in the ELF file and allocates memory for the .text section and loads that data from the file into that memory.  
 348 | 
 349 | Then the loader allocates memory for the initialized data (.data) and reads that data from the file into that memory.  
 350 | 
 351 | Then the loader allocates memory for the constant data (.rodata) and reads that data from the file into that memory.  
 352 | 
 353 | The loader allocates memory for the .bss section.  Since the .bss section is uninitialized, it only needs to be allocated.
 354 | 
 355 | The linker reads in intermediate object files (```.o```) and links them together to make the final executable.  Each .o file may declare variables that might be accessed from other .o files and to access variables that are defined in some other .o file.  The linker fixes up the addresses in the final output so the code works as expected!
 356 | 
 357 | ### Permissions (Sections and Privileged Instructions)
 358 | The compiler/assembler/linker generally makes the code execute only.  If you try to store to those addresses, you will get a segfault.  
 359 | 
 360 | The .data and .bss sections are marked as read/write and the .rodata is marked as read-only.
 361 | 
 362 | The way words of the different sizes are stored in memory is determined by the "endianess" of the CPU.  A CPU that is big endian stores the high byte first in memory, the next highest byte next, ... and finally the lowest byte last.  A CPU that is little endian stores the low byte first, ... the high byte last.
 363 | 
 364 | The CPU has special features that enforce these permissions.  If you try to defeat the permissions, a segfault exception is thrown.  The operating system sets up these features when the program is started, and kills the program and potentially generates a core dump file of the program.  The core dump file can be used later to do forensic debugging/analysis of the failure.
 365 | 
 366 | ### MMU
 367 | 
 368 | In modern operating systems, the CPU uses an MMU (Memory Management Unit) to assign a subset of the system's memory to each program that you run.  The MMU maps an address in physical memory to a logical address that the program sees and uses.  This allows, for example, a CPU to split the 1MB of RAM into 2x 512K address spaces to run two programs.  The address translation makes it so each program thinks it has 512K of RAM starting at address 0 and ending at address 512K - 1.  
 369 | 
 370 | The use of the MMU is much more clever than I just explained, but the end result is the same.  When a program is launched, it is allocated a small amount of RAM, enough for the program's code and variables and stack and a minimal heap.  As the program needs more stack or more heap, the OS adds physical memory to the program's address space using the MMU.  The program grows on demand.
 371 | 
 372 | For our purposes, we can assume we're the only program running on the machine.  It matters not if there's an OS using the MMU or not, the programming effort and techniques are the same either way.
 373 | 
 374 | #### Paging and Swapping
 375 | 
 376 | The operating system only needs to set up the MMU for enough physical memory for the program to execute.  Memory is allocated for the MMU in 4096 byte chunks (pages); this is required by the MMU implementation (hardware).
 377 | 
 378 | This scheme is quite efficient, as a small assembly program might only need a couple of megabytes of RAM (2MB for stack is default in the OS!), and your computer might have 16 Gigabytes of RAM.  This efficient allocation of the CPU's memory allows you to load and run many programs at the same time.
 379 | 
 380 | When your program tries to access an address in memory that isn't mapped by the OS using the MMU, a page fault exception is raised.  The OS sees this and might map in an additional page so that the access can succeed.  
 381 | 
 382 | If the system is out of memory, the OS might compress programs and/or their data to make more RAM available.  The OS has to decompress this memory when it's those programs' turn to execute, though.  MacOS does this compression, and it's very clever.
 383 | 
 384 | Another thing the OS can do when there is an out of memory (OOM) condition is to "page" one or more 4096 byte pages from memory to the system's swap file/partition.  This frees up enough pages to use to handle the page fault.  When a program that has memory paged to disk is scheduled to run (use the CPU), the code might cause further page faults to read back in the paged memory.  It's possible the program never accesses that memory, and that's perfectly fine.
 385 | 
 386 | Yet another thing the OS can do is to swap out entire programs (and their data) to the swap file/partition.  When those programs get to run, they have to be entirely read back into memory (and MMU set up), and perhaps swapping another program to disk.  When the system is tight on free memory and is swapping heavily, it will become very unresponsive!
 387 | 
 388 | Finally, if the OS cannot resolve the OOM condition with one of those (or potentially other clever) strategies, it just randomly kills a running program.  This seems evil, but what else can it do?
 389 | 
 390 | The stack grows down from high memory. If the stack overflows (grows below the memory allocated for it), a page fault occurs and the OS can add additional pages to the memory map so the stack has more room.
 391 | 
 392 | The heap initially has a small but reasonable amount of RAM allocated.  It can be expanded using the ```sbrk``` syscall.  This is what the malloc() function does in C, though the sbrk() function can be called directly if you know what you're doing.
 393 | 
 394 | ### Other exceptions
 395 | 
 396 | #### Segfault
 397 | It should be noted that a program might just randomly access some address that is truly outside the bounds of the program's memory map.  Paging or swapping is not performed in this case.  The MMU is set up so these addresses are simply not mapped into the program's memory map.  Instead of raising a page fault exception, the CPU/MMU raises a segfault exception.  
 398 | 
 399 | This is a hard program crash, and the operating system will terminate the program.
 400 | 
 401 | #### Divide By Zero
 402 | If your program attempts to divide by zero, this exception is raised and the program is terminated.
 403 | 
 404 | #### Invalid Opcode
 405 | If your program somehow executes instructions that are not valid x64/amd64 instructions, this exception is raise and the program is terminated.  This will occur, for example, if you push a random number on the stack and then return.  Your program starts executing at that random address and who knows what data are there?  If the random number/return causes the program to execute outside its address space, you get a Segfault instead.
 406 | 
 407 | #### General Protection 
 408 | If your program attempts to execute a privileged instruction, this exception is raised and the program is terminated.  There are quite a few privileged instructions, such as CLI/STI (disable/enable interrupts).  An OS should not allow programs to disable interrupts, or your multitasking stops working!
 409 | 
 410 | ## ALU
 411 | 
 412 | The cost of having circuitry to add two arbitrary memory locations together is prohibitive.  You have 1M x 1M add circuits required, and that's just for addition!  
 413 | 
 414 | The math (add) capability is, instead, implemented in the ALU (Arithmetic-Logic Unit) of the CPU.  The CPU provides some (small) number of general purpose "registers" and the ALU implements the add circuitry just between those registers.  
 415 | 
 416 | You can think of a register as a (temporary) variable that is on chip, usable by the ALU to do math and logic operations.  You have to load your operand or operands into registers to perform math, then you can store the result to a variable in memory.
 417 | 
 418 | For example, to add two numbers at memory locations (addresses) 0x100 and 0x200 and store the result at address 0x300, and we have two registers named a and b:
 419 | ```
 420 |   load value at 0x100 into a
 421 |   load value at 0x200 into b
 422 |   add a and b, leaving result in a
 423 |   store a at 0x300
 424 | ```
 425 | 
 426 | I have just introduced something like a snippet of assembly language code!  We need operations to be able to load memory into registers, add registers together, and store registers to memory.  Each of these operations is a CPU "opcode."  The CPU reads the byte opcode from memory and executes it.  Some opcodes, like the load and store ones, require parameters like the address to load from or store to.  These addresses are stored in the program immediately following the opcode.  As we progress, we're going to see that the instruction sizes (op code plus parameters) are different depending on the instruction (op code) and parameters.  
 427 | 
 428 | In the simplest view of the CPU, the above program is 4 instructions.  The load and store instructions use 1 byte for opcode and 2 more for the addresses.  The add uses just the one byte for the opcode (add b to a).
 429 | 
 430 | Each instruction uses 1 or more "clock cycles," depending on the complexity of the operation.  The load instruction requires a clock cycle to load the opcode, another 2 for each byte of the address, and another 2 to load the value from RAM at the address specified in the parameters, for 5 total clock cycles.  The add instruction takes just 1 clock cycle.  The store takes 5 as well.
 431 | 
 432 | ## x64/AMD64 Registers
 433 | 
 434 | For all intents and purposes, the Intel and AMD processors have the same registers until you get into exotic features (like hardware video decoding).  I use the term x64 and AMD64 interchangeable throughout this tutorial.
 435 | 
 436 | ### General Purpose Registers
 437 | 
 438 | You have 4 general purpose registers, A, B, C, and D, though we don't use these specific names for the registers.  The size of the register/contents matters.  So for a byte value, we use AL or AH, or BL/BH, or CL/CH, or DL/DH.  The L means "low order byte" and H means "high order byte."  For word values, we use AX, BX, CX, and DX.  For 32 bit word values, we use EAX, EBX, ECX, and EDX.  And for 64 bit word values, we use RAX, RBX, RCX, and RDX.
 439 | 
 440 | When we use the registers whose size are smaller than 64 bits, the remaining bits in the register are not affected.  For example, if AX contains 0x0102 and we load 0x03 into AL, AX will contain 0x0103.  This will only matter if you load bytes into registers and add word registers together, in error.  There might be tricks you play to take advantage of the nature of the register loads/stores.
 441 | 
 442 | AMD64 and x64 add 8 more general purpose registers, R8, R9, R10, R11, R12, R13, R14, and R15.  These are accessed as 8, 16, 32, and 64 bit registers.  R8 through R15 (64 bits), R8D-R15D (32 bits), R8W-R15W (16 bits), and R8B-R15B (8 bits).
 443 | 
 444 | ### Special Purpose Registers
 445 | 
 446 | The RCX/ECX/CX (CX) register doubles as a counter for dedicated instructions.  The AMD64 instruction set includes instructions to fill, copy, and compare memory, and loops that use this register as the number of bytes/words/dwords/qwords to fill/copy/compare.  The special loop instructions use this register as the loop counter as well.
 447 | 
 448 | The RSI/ESI/SI and RDI/EDI/DI/ registers are general purpose "source" and "destination" registers for the fill, copy, and compare instructions.
 449 | 
 450 | The RBP register is a general purpose register that is typically used as a base address register or by high level language compilers to maintain function stack frames (arguments, return address, and local variables allocated on the stack).
 451 | 
 452 | ### CPU Control Registers
 453 | 
 454 | #### Stack 
 455 | The RSP register contains the address of the last thing pushed on the processor stack. You can push registers on the stack to preserve their values, you can pop them to restore their values, address values already on the stack by index, etc.
 456 | 
 457 | #### Instruction Pointer
 458 | The RIP register contains the address of the next instruction to be executed.  The CPU automatically adds the correct number to it as it executes instructions to keep it pointed at the correct next instruction.  When you call a subroutine, the RIP is pushed on the RSP stack and RIP is loaded with the address of the subroutine.  When the subroutine returns, the RIP that was pushed before the call is popped from the stack into RIP.  Execution continues at the instruction after the call.
 459 | 
 460 | #### Flags
 461 | The FLAGS register is 64 bits containing information provided by the CPU to the program, and commands from the program to the CPU.  Not all the bits are used.  See https://en.wikipedia.org/wiki/FLAGS_register.
 462 | 
 463 | An example of the bits in FLAGS set by the CPU is the Carry Flag.  It is set when you have a carry after an arithmetic operation.  For example, if you add 1 to the AL register that contains 255, you will get AL=0, Carry = 1.  If you add 1 to AL=254, the Carry will be 0.
 464 | 
 465 | An example of the bits in the FLAGS set by the program is the Direction Flag.  If this is 0, the fill/copy/etc. instructions work from start address forward (auto-increments SI and DI).  If this is 1, the operations are done backward (auto-decrement).
 466 | 
 467 | The FLAGS register is there to use, but we might really only directly use the Carry bit and Direction bit.  We might use the Carry bit to return a true/false result from a function.  The CLC and STC instructions clear and set the Carry bit.  
 468 | 
 469 | The various branch instructions use the Carry and Zero bits internally.
 470 | 
 471 | There are several instructions that set and clear these bits, programmatically.
 472 | 
 473 | # AMD64 Instruction Set
 474 | 
 475 | You will learn the instruction set as you go.  The instruction set is documented as a reference manual, not a programming manual.  That is, each instruction is documented as to what it does.  But there is no particular "how to use this instruction" documentation.
 476 | 
 477 | You can find the instruction set documented on various Web Sites.  The best source is the Intel Programmer's Manual or the AMD64 Programmer's Manual.
 478 | 
 479 | Here is a decent Web Page that lists the instructions in a table, one line per instruction with a short description.
 480 | 
 481 | https://www.felixcloutier.com/x86/
 482 | 
 483 | There are over 1500 instructions, from AAA to XTEST that we can use.  Too many to document every one here. However, there are much fewer commonly used instructions that we use for most things.
 484 | 
 485 | The format of a line of source code in assembly is:
 486 | ```
 487 | [optional label] instruction
 488 | or
 489 | [optional label] instruction operand
 490 | or
 491 | [optional label] instruction operand1, operand2
 492 | ```
 493 | 
 494 | 
 495 | When assembled, the instructions are encoded as opcode and operands as a sequence of bytes.  The CPU is able to execute these instructions.
 496 | 
 497 | ## Assembly source
 498 | 
 499 | In assembly source, the NASM assembler expects operands to be specified as ```destination, source``` (Intel syntax) while the gas assembler expects operands to be specified as ```source, destination``` (AT&T syntax).  The assembler language for the various CPUs (e.g. MC68000, AMD64, ARM, etc.) each specify whether the left operand is source or destination.  The gas assembler can be used to assemble source for various processors so it defaults to source, destination format, though you can tell it to use Intel (NASM) syntax.
 500 | 
 501 | In Intel syntax source programs, the semicolon (;) character introduces the start of a comment.  All characters from that point on, to the end of the line, are ignored.
 502 | 
 503 | Before we look at some of these instructions, we need to look at addressing modes.
 504 | 
 505 | ## Addressing Modes
 506 | 
 507 | Addressing modes are the means by which operands to instructions are described and how they execute.  For example, Register operands indicate specific registers, but memory operands can be addressed through a variety of combinations of offsets and/or register contents.
 508 | 
 509 | To examine the addressing modes, we'll use the MOV instruction, which copies a value in a register to memory or loads a value to a register from memory.
 510 | 
 511 | The source and/or destination operand is specified using one of the addressing modes.
 512 | 
 513 | The instruction-set/addressing.asm file contains example usage of the various addressing modes.
 514 | 
 515 | ### Register Operands
 516 | 
 517 | Rather than memory being the source or destination, the operand is a register.  For example, 
 518 | ```
 519 | 	mov rax, rbx ; moves contents of rbx register into the rax register.
 520 | ```
 521 | 
 522 | ### Direct Memory Operands (better known as Immediate operands)
 523 | 
 524 | This mode moves a constant into a register.  The constant is encoded in the instruction, after the opcode. For example,
 525 | ```
 526 |         mov rax, 10 	; source operand is a constant
 527 | ```
 528 | 
 529 | #### Indirect Operands
 530 | 
 531 | This mode uses a register as the address of a memory location to be operated on (e.g. load from, store to).  For example,
 532 | ```
 533 |         mov [rax], rbx   ; store contents of rbx to memory location contained in rax
 534 | ```
 535 | 
 536 | #### Indirect with Displacement
 537 | 
 538 | This mode uses a register as the base address of a memory location, added to a fixed offset, to determine the address of a memory location to be operated on.  For example,
 539 | ```
 540 |         mov rax, [rbx+24]  ; access memory at 24 + contents of rbx
 541 | ```
 542 | The purpose of this addressing mode is to facilitate accessing a structure and its members.  Consider:
 543 | ```
 544 | struct {
 545 |   char *name,
 546 |        *address,
 547 |        *phone;
 548 | } person;
 549 | person.name = nullptr;
 550 | person.address = nullptr;
 551 | person.phone = nullptr;
 552 | 
 553 | ```
 554 | 
 555 | In assembly, we'd do something like this:
 556 | ```
 557 | NAME equ 0
 558 | ADDRESS equ 8
 559 | PHONE equ 12
 560 | 
 561 | mov rsi, person  ; load address of person into RSI
 562 | mov rax, 0       ; nullptr
 563 | mov NAME[rsi], rax
 564 | mov ADDRESS[rsi], rax
 565 | mov PHONE[rsi], rax
 566 | ```
 567 | 
 568 | Another use of this addressing mode is for stack frames for a language such as "C", especially for calling subroutines.  A subroutine may have arguments passed to it on the stack, by value (like an int) or reference (like an address of a struct or string or whatever).  A subroutine may need its own local variables.  When a subroutine is called recursively, each recursive call must prepare the stack so it has arguments to pass, and allow for the next iteration's local variables on the stack.
 569 | 
 570 | The RBP register is used for stack frames when stack conventions are used for calling functions in "C".  
 571 | 
 572 | The calling function pushes arguments on the stack (right to left).  That is, for foo(a, b, c);, the compiler will generate code to push c, then b, then a.   
 573 | 
 574 | Upon entry to a function, RBP contains the stack frame pointer for the calling function.  The compiler generates code to immediately push it.  Then the RSP stack pointer is loaded into RBP.  
 575 | 
 576 | At this point,  RBP points to the return address on the stack, and negative offsets from RBP are the arguments to the function.  
 577 | 
 578 | For local variables, the compiler generates a subtract to RSP to make the desired space on the stack.  When the function calls another, RSP is after the allocated variables, so it all works.  Positive offsets from RBP are used to access the local variables.
 579 | 
 580 | To return, the compiler generates code to pop rbp (restore caller's stack frame) and returns.  The calling code has to adjust RSP to remove the pushed arguments.
 581 | 
 582 | Note: AMD64/X64 use a register scheme for passing arguments to functions and uses the stack when there are too many arguments to pass (not enough registers).  See https://en.wikipedia.org/wiki/X86_calling_conventions.  I present this information because you will likely run across stack frames, especially when viewing GDB (command line debugger) backtraces.
 583 | 
 584 | Let's see a little bit of example code and the assembly generated by the compiler.  Note that this is in AT&T syntax, ```source, destination``` format.  The register names are prefixed with %.
 585 | 
 586 | ```
 587 | // source
 588 | void bar(int a, int b) {
 589 |     int x, y;
 590 | 
 591 |     x = 555;
 592 |     y = a+b;
 593 | }
 594 | 
 595 | void foo(void) {
 596 |     bar(111,222);
 597 | }
 598 | 
 599 | ; compiles to:
 600 | bar:
 601 |     pushl   %ebp
 602 |     movl    %esp, %ebp
 603 |     subl    $16, %esp
 604 |     movl    $555, -4(%ebp)
 605 |     movl    12(%ebp), %eax
 606 |     movl    8(%ebp), %edx
 607 |     addl    %edx, %eax
 608 |     movl    %eax, -8(%ebp)
 609 |     leave
 610 |     ret
 611 | 
 612 | foo:
 613 |     pushl   %ebp
 614 |     movl    %esp, %ebp
 615 |     subl    $8, %esp
 616 |     movl    $222, 4(%esp)
 617 |     movl    $111, (%esp)
 618 |     call    bar
 619 |     leave
 620 |     ret
 621 | ```
 622 | 
 623 | Note the use of indirect with offset addressing modes!
 624 | 
 625 | #### Indirect with displacement and scaled index
 626 | 
 627 | This addressing mode is used to access array elements.  To illustrate how this mode works:
 628 | 
 629 | * an array of bytes, each element is 1 byte each
 630 | * an array of words, each element is 2 bytes each
 631 | * an array of dwords, each element is 4 bytes each
 632 | * an array of qwords, each element is 8 bytes each
 633 | 
 634 | As you index the array, you have to "scale" the index before adding it to the base of the array.  The scale operating assures we are addressing byte, word, dword, or qword elements properly.  
 635 | 
 636 | ```
 637 |         mov member(rsi, rbx, 4), eax   ; store dword in eax at rsi+ member(offset) + rbx x 4
 638 | ```
 639 | The above example stores a dword into memory.  We are accessing a struct member that is an array of dwords.  The rbx register contains the index into the array, [0 ... array.length-1].  The 4 is the scale factor, or size of the dword.  
 640 | 
 641 | Note that member may be 0 - in this case, rsi simply contains the address of the array.
 642 | 
 643 | 
 644 | 
 645 | # Commonly Used Instructions
 646 | 
 647 | ## Aritmetic
 648 | 
 649 | ```
 650 | ADC - add a value, plus 
 651 | ADD - add two registers together
 652 | DEC - decrement by 1
 653 | DIV - unsigned divide
 654 | IDIV - signed divide
 655 | IMUL - signed multiply
 656 | INC - increment by 1
 657 | MUL - unsigned multiply
 658 | NEG - two's complement (multiply by -1)
 659 | SBB - subtract with borrow (carry flag)
 660 | SUB - subtract
 661 | LEA - load effective address (formed by some expression / addressing mode) into register
 662 | ```
 663 | 
 664 | ## Boolean Algebra
 665 | ```
 666 | AND - logical AND to registers together
 667 | NOT - one's complement (invert all the bits in the operand)
 668 | OR - logical OR
 669 | XOR - logical exclusive or
 670 | TEST - logical compare
 671 | ```
 672 | 
 673 | ## Branching and Subroutines
 674 | ```
 675 | CALL - call a subroutine/function/procedure
 676 | SYSCALL - call an OS function (Linux, Mac)
 677 | ENTER - make stack from for procedure parameters
 678 | LEAVE - high level procedure exit
 679 | RET - return from subroutine
 680 | CMP - compare two operaands
 681 | JA - jump if result of unsigned compare is above
 682 | JAE - jump if result of unsigned compare is above or equal
 683 | JB - jump if result of unsigned compare is below
 684 | JBE - jump if result of unsigned compare is below or equal
 685 | JC - jump if carry flag is set
 686 | JE - jump if equal
 687 | JG - jump if greater than 
 688 | JGE - jump if greater than or equal
 689 | JNC - jump if carry not set
 690 | JMP - go to / jmp (simply loads the RPC register with the address)
 691 | ```
 692 | 
 693 | ## Bit Manipulation
 694 | ```
 695 | BT - bit test (test a bit)
 696 | BTC - bit test and complement
 697 | BTR - bit test and reset
 698 | BTS - bit test and set
 699 | RCL - rotate 9 bits (carry flag, 8 bits in operand) left count bits
 700 | RCR - rotate 9 bits (carry flag, 8 bits in operand) right count bits
 701 | ROL - rotate 8 bits in operand left count bits
 702 | ROR - rotate 8 bits in operand right count bits
 703 | SAL - arithmetic shift operand left count bits
 704 | SAR - arithmetic shift operand right count bits (maintains sign bit)
 705 | SHL - logical shift operand left count bits (same as SAL)
 706 | SHR - logical shift operand right count bits (does not maintain sign bit)
 707 | ```
 708 | 
 709 | ## Register Manipulation, Casting/Conversions
 710 | ```
 711 | MOV - move register to register, move register to memory, move memory to register
 712 | XCHG - exchange register/memory with register
 713 | CBW - convert byte to word
 714 | CDQ - convert word to double word/convert double word to quad word
 715 | ```
 716 | 
 717 | ## Flags Manipulation
 718 | ```
 719 | CLC - clear carry flag/bit in flags register
 720 | CLD - clear direction bit in flags register
 721 | STC - set carry flag
 722 | STD - set direction flag
 723 | ```
 724 | 
 725 | ## Stack Manipulation
 726 | ```
 727 | POP - pop a register off the stack
 728 | POPF - pop stack into flags register
 729 | PUSH - push a register on the stack
 730 | PUSHF - push flags register on the stack
 731 | ```
 732 | 
 733 | # Assembler Source, Directives,  and Macros
 734 | The assembler is a program that reads assembly source code and generates a binary output file or ELF .o file.  The assembler reads a line at a time and writes the encoded program instructions for that line to the output file.  
 735 | 
 736 | NASM is a great free assembler, LLVM Assembler (as), and Gnu Assembler/as/gas (part of the gcc package) are two assemblers that are used for Linux and MacOS assembly development/programming.  For all intents and purposes, LLVM and Gnu assemblers are identical.  There are other assemblers out there, but they are beyond the scope of this tutorial.
 737 | 
 738 | 
 739 | There are two styles of assembly source for x64: Intel and AT&T. 
 740 | 
 741 | * Intel syntax expects operands to be specified as ```destination, source```.
 742 | * AT&T syntax expects operands to be specified as ```source, destination```.  
 743 | 
 744 | The NASM assembler uses Intel syntax and the GNU/LLVM assemblers can use either Intel or AT&T; you choose which using an assembler directive.  
 745 | 
 746 | ## Assembler Directives
 747 | An assembler directive is not machine instructions.  Instead, these are used to convey information to the assembler to effect code generation as you prefer.  Assembler directives are specific to the assembler you are using and the source code using these is not portable between assemblers.  The nature of (order of) Intel and AT&T syntax makes code written for one not portable to an assembler using the other.
 748 | 
 749 | The gas (gnu/llvm) assembler uses the .intel_syntax directive to tell the assembler that the source format of the file is Intel syntax.  Otherwise, AT&T syntax is assumed.
 750 | 
 751 | I'm not going to expand on all the directives for gas and NASM.  There are basically similar directives for both assemblers.  I prefer using NASM, though there is no reason you can't use gas - whichever you prefer.  I'll document the common NASM directives here.
 752 | 
 753 | There are a lot of directives; I'm not covering all of them. For expanded information, see the NASM manual online at https://nasm.us.  Hopefully, you find what is covered here to be enough to get you going.
 754 | 
 755 | ### section type [options]
 756 | The section directive specifies that the following instructions/directives apply to the specified section.  Examples:
 757 | ```
 758 | section .text
 759 | section .bss execute
 760 | section .rodata
 761 | ```
 762 | These types were defined earlier in this document.  The execute option marks this bit of .bss as read/write and execute permissions.
 763 | 
 764 | ### bits 16, bits 32, and bits 64, use16, use32, use64
 765 | These directives tell the assembler to generate instructions for the CPU running in the specified mode.  
 766 | 
 767 | When the system first boots, the CPU is in 16 bit mode.  The instructions it executes at that point must be ```bits 16``` or ```use16```.  You probably won't be writing code for 16 bit mode.
 768 | 
 769 | A 32-bit operating system sets the CPU into 32 bit mode.  The instructions it executes at that point must be ```bits 32``` or ```use32```.
 770 | 
 771 | This document assumes 64-bit mode, so we use ```bits 64```.  In 64-bit mode, the assembler can generate either 64-bit or 32-bit instructions, whichever is appropriate.
 772 | 
 773 | ### Comments
 774 | 
 775 | In a NASM source program, the semicolon (;) character introduces the start of a comment.  All characters from that point on, to the end of the line, are ignored.
 776 | 
 777 | Note that gas supports a couple of comment styles, including ```/* */``` C-style multiline comments, or pound sign ```#``` to introduce the start of a comment.
 778 | 
 779 | ### Constants
 780 | NASM supports constants of the form:
 781 | ```
 782 | 0x10 ; base 16
 783 | 010h ; base 16
 784 | 011100b ; base 2
 785 | ```
 786 | 
 787 | ### Program Variables and Strings
 788 | Programming is uselss if you can't create variables and create and operate on strings.  The assemmbler provides directives to reserve space for variables or to define initialized memory. 
 789 | 
 790 | Resserving space examples:
 791 | ```
 792 |     resb 1  ; reserve 1 byte
 793 | 	resw 1  ; reserve 1 word (2 bytes)
 794 | 	resd 1  ; reserve 1 dword (4 bytes)
 795 | 	resq 1  ; reserve 1 qword (8 bytes)
 796 | 	resb 16 ; reserve 16 bytes
 797 | 	...
 798 | ```
 799 | 
 800 | Initializing memory examples:
 801 | ```
 802 |      db 10  ; reserve 1 byte with the value 10 at the memory location
 803 |      dw 11  ; reserve 1 word with the value 11 at the memory location
 804 |      dd 10  ; reserve 1 dword with the value 10 at the memory location
 805 |      dq 10  ; reserve 1 qword with the value 10 at the memory location
 806 | 	 db 10, 11, 12 ; reserve 3 bytes with values 10, 11, and 12
 807 | 	 ...
 808 | ```
 809 | 
 810 | You can use the memory initializer directives for strings:
 811 | ```
 812 |      ; create a null terminated string
 813 |      db 'now is the time for all good men to come to the aid of their country!', 0
 814 | 	 ; create a null terminated string with carriage return/linefeed at the end
 815 |      db 'now is the time for all good men to come to the aid of their country!', 13, 10, 0
 816 | ```
 817 | 
 818 | ### Assembler Variables and Labels
 819 | A label is a type of variable, and is the first thing on a line of source code.  The value of the label is the current program counter as viewed by the assembler and when the program is actually running.  You typically use a label to define a variable to access from assembly code or the address for jumps or subroutines.
 820 | 
 821 | You use the ```global``` directive to make a label's scope visible to other .o files at link time.  If you want to reference a label defined in a different .o file, you use the ```extern``` directive.
 822 | ```
 823 | 			section .text
 824 | 			...
 825 | ; find length of message
 826 | 			mov rsi, message    ; load address of message into rsi
 827 | 			call length
 828 | 			; print rcx, it has the length of the string
 829 | 			...
 830 | 			mov rsi, external_message
 831 | 			call length
 832 | 			; print rcx, it has the length of the string
 833 | 			...
 834 | length:			
 835 | 			xor rcx, rcx        ; fast way to set rcx to 0
 836 | loop:
 837 |             mov al, [rsi]       ; get character from string
 838 | 			inc rsi             ; point to next character
 839 | 			inc rcx             ; increment length counter
 840 | 			test al, al
 841 | 			jne loop
 842 | 			; rcx has the length of the string 
 843 | 			ret
 844 |             ...
 845 | 
 846 | 			section .rodata
 847 | 			global message
 848 | message:    db 'hello, world!', 13, 10, 0 ; you can access message in an instruction:
 849 | ```
 850 | 
 851 | A Variable is a string of text that refer to any numeric value you like, with a few exceptions. A common use is to define constants/expressions, as you would use ```#define``` in "C".  You use the EQU directive to specify the variable's value.    
 852 | 
 853 | Examples:
 854 | ```
 855 | ANSWER  equ 42
 856 | CR      equ 13
 857 | NEWLINE equ 10
 858 | STDIN   equ 0
 859 | STDOUT  equ 1
 860 | STDERR  equ 2
 861 | ```
 862 | 
 863 | The ```$```  character can be used in these expressions, too.  It represents the current value of the program counter as the assembler sees it.
 864 | 
 865 | ```
 866 | 			section .text
 867 | 			mov rax, message ; load address of message into rax
 868 | 			move rcx, message_len
 869 | 
 870 | 			section .rodata
 871 | message:    db 'hello, world!', 13, 10 ; you can access message in an instruction:
 872 | message_len equ $ - message ; length of message string in bytes
 873 | ```
 874 | 
 875 | You can also use the ```%assign``` directive to create and update a variable.  If you try to use EQU twice on the same variable name, it is an error.  
 876 | 
 877 | ```
 878 | %assign count 0
 879 | %assign count count+1
 880 | ```
 881 | 
 882 | There is a directive to assign a string to a variable, too.  This is similar to the "C" ```#define``` preprocessor directive; the string is substituted in the source code when the variable is encountered.
 883 | 
 884 | ```
 885 | %define hello 'hello, world!', 13, 10
 886 | 			section .text
 887 | 			mov rax, message ; load address of message into rax
 888 | 			move rcx, message_len
 889 | 
 890 | 			section .rodata
 891 | message:    db hello
 892 | message_len equ $ - message ; length of message string in bytes
 893 | ```
 894 | 
 895 | You can undefine one of these variables created with ```%define``` using ```%undef```.
 896 | 
 897 | You can use local labels so you don't have to keep track of every label/variable you have defined to avoid collisions.  A local label begins with a period.  Its scope is valid only between two true labels.
 898 | 
 899 | ```
 900 | ; subroutines to return address of string in RSI
 901 | get_string1:
 902 |             mov rsi, .string
 903 | 			ret
 904 | .string:    db 'string1'
 905 | 
 906 | get_string2:
 907 |             mov rsi, .string
 908 | 			ret
 909 | .string:    db 'string2'
 910 | ```
 911 | 
 912 | Creating a variable or label does not generate any code!
 913 | 
 914 | ### Repetion
 915 | The ```times``` directive is used to repeat an initialization:
 916 | 
 917 | ```
 918 |         section .data
 919 | stars:  times 32 db '*' ; creates 32 bytes containing * at memory location "stars".
 920 | ```
 921 | 
 922 | ### Macros
 923 | A macro is similar to a subroutine, but is substituted inline and has powerful text processing/substitution factilities.
 924 | 
 925 | A macro is defined using the ```%macro``` and ```%endmacro``` directives.  Everything between these two directives is the content of the macro, or the text to be substituted.  The ```%macro``` directive requires the number of parameters to the macro.
 926 | 
 927 | ```
 928 | ; two handy macros that save me a lot of typing.
 929 | %macro pushg 0
 930 |     push rax
 931 | 	push rbx
 932 | 	push rcx
 933 | 	push rdx
 934 | %endmacro
 935 | 
 936 | ; note these have to be popped in the reverse order they are pushed!
 937 | %macro popg 0
 938 |     pop rdx
 939 | 	pop rcx
 940 | 	pop rbx
 941 | 	pop rax
 942 | %endmacro
 943 | 
 944 |     ...
 945 | 	; short and convenient
 946 | 	pushg
 947 | 	; use registers rax, rbx, rcx, rdx
 948 | 	popg
 949 | ```
 950 | 
 951 | If you want to pass arguments to your macro, you specify a non-zero number on the ```%macro``` directive.  Within the macro body, you can access the parameters using ```%1```, ```%2``` and so on.  Here's a macro definition that demonstrates some of the power of macros.
 952 | 
 953 | ```
 954 | %macro print 1
 955 |     mov rsi, .message
 956 | 	call print_message
 957 | 	jmp .over
 958 | .message: db '%1', 0
 959 | .over:
 960 | %endmacro
 961 | 
 962 |    ...
 963 |    print "hello, world!"
 964 |    
 965 | ```
 966 | 
 967 | The problem with our print macro is that it generates .message and .over local labels and you might use the macro more than once between real labels:
 968 | 
 969 | ```
 970 |    print "hello, world!"
 971 |    print "goodbye cruel world!"
 972 | ```
 973 | 
 974 | What happens is we have duplicate local labels and the compiler generates an error.  Local labels are incredibly useful in macros, so there has to be a way, and there is.   Local labels within macros are defined using the form ```%%label```.  The assembler generates a uniqe label name when expanding the macro.  This is the working print macro:
 975 | 
 976 | ```
 977 | %macro print 1
 978 |     mov rsi, %%message
 979 | 	call print_message
 980 | 	jmp %%over
 981 | %%message: db '%1', 0
 982 |     align 8
 983 | %%over:
 984 | %endmacro
 985 | ```
 986 | 
 987 | ### Conditional Assembly
 988 | NASM provides ```%if```, ```%elif```, ```%else```, and ```%endif``` directives that allow for conditional assembly.  
 989 | 
 990 | ```
 991 | ; a totally contrived useless example, for illustrative purposes
 992 | %assign foo 1
 993 | %if foo=1
 994 |    mov rax, 32
 995 | %else
 996 |    mov rax, 42
 997 | %endif
 998 | ```
 999 | 
1000 | NASM also provides ```%ifdef``` directive that works with ```%elif``` and the other conditional assembly directives.  Instead of testing a condition as ```%if``` does, it tests the existance 
1001 | 
1002 | ```
1003 | ; comment out the undef to enable the LINUX "do things" code
1004 | %define LINUX
1005 | %undef LINUX
1006 | %ifdef LINUX
1007 | ; do linux things
1008 | %endif
1009 | %else
1010 | ; do mac things
1011 | %endif
1012 | ```
1013 | 
1014 | NASM provides the ```%ifidn``` directive that works with ```%elif``` and the other conditional assembly directives. NASM provides default defined variables that you can use to conditionally assemble using ```%ifidn```.  A particularly useful one is __?OUTPUT_FORMMAT?__ which you can test to determine whether to generate code for Linux or MacOS (or other):
1015 | 
1016 | ```
1017 | %ifidn __?OUTPUT_FORMAT__, maco64
1018 |   ; do macos stuff
1019 | %else
1020 |   ; do linux stuff
1021 | %endif
1022 | ```
1023 | 
1024 | See: https://nasm.us/xdoc/2.15.03rc8/html/nasmdoc5.html for all the predefined variables.
1025 | 
1026 | 
1027 | ### Alignment
1028 | As you are writing your code, you may want instructions or data aligned on a word, dword, qword, or other size boundaries.  Typical uses are to align code on word/dword/qword boundaries.  You get a performance boost by having the target of a branching instruction such as jmp, call, and so on.
1029 | 
1030 | ```
1031 |     align 8 ; align next code/data generated at next 8 byte boundary/address
1032 | 	align 16 ; align next code/data at next 16 byte boundary
1033 | 	
1034 | 	db 'hello'
1035 | 	align 8
1036 | my_code_is_aligned:
1037 | ```
1038 | 
1039 | Alignment is also useful for data structure definitions so your assembly structs can match up with ones defined in C.
1040 | 
1041 | ### Structures
1042 | You can define high-level like structures using the ```%struc``` and ```%endstruc``` directives.  The ```%struc``` directive takes one parameter, the name of the structure.  The structure members are defined using the resb/resd/resw/resq space allocation directives.  The align directives are used to align structure members on the desired boundaries.
1043 | 
1044 | ```
1045 | %struc Contact
1046 | .company: resb 1 ; true for company, false for individual
1047 |    align 2
1048 | .company_id: resd 1 ; identifier
1049 | .name: resb 64 ; max 64 characters for name
1050 | .address: resb 64 ; also 64 for address
1051 | .phone: resb 16 ; 16 characters for phone number
1052 | %endstruc
1053 | ```
1054 | 
1055 | Using a structure is straightforward:
1056 | 
1057 | ```
1058 |    mov rsi, [person] ; fetch address of Contact struct into RSI
1059 |    mov al, [rsi+Contact.company]
1060 |    test al,al
1061 |    jne .company
1062 |    ; is an individual
1063 |    print "Person"
1064 |    push rsi
1065 |    mov rsi, [rsi+Contact.name]
1066 |    call printit
1067 |    pop rsi
1068 |    ...
1069 | .company:
1070 |    ; is a company
1071 |    print "Company"
1072 |    push rsi
1073 |    mov rsi, [rsi+Contact.name]
1074 |    call printit
1075 |    pop rsi
1076 |    ...
1077 | ```
1078 | 
1079 | You use the ```%istruc``` and ```%iend``` directives to declare instances of structures.
1080 | 
1081 | ```
1082 | a_company: istruc Contact
1083 |   at .company, db 1
1084 |   at .company_id, dd 100
1085 |   at .name, db 'Engulf and Devour Corp', 0
1086 |   at .address, db '1 Main Street, Anytown USA', 0
1087 |   at .phone, db '1-800-devour!', 0
1088 | %iend
1089 | ```
1090 | 
1091 | ### Includes
1092 | NASM provides two commonly used include directives: 
1093 | ```
1094 |     %include "path/to/file"
1095 |     %incbin"path/to/file"
1096 | ```
1097 | 
1098 | The ```%include``` directive works like the "C" ```#include``` directive - it simply reads the specified file in place and assembles it as if it were part of the current file.  You can arbitrarily nest these includes, like you do in "C".
1099 | 
1100 | The ```%incbin``` directive includes a raw binary, verbatim, in the output file at the current position.  You can use it, for example, to include a .gif file in your code:
1101 | 
1102 | ```
1103 | my_gif:
1104 |    %incbin '/path/to/my/picture.gif'
1105 | my_gif_size equ $-my_gif
1106 | ```
1107 | 
1108 | # Hello, World
1109 | 
1110 | ## MacOS Version
1111 | 
1112 | See hello-world/ directory for a build script and this assembly source.
1113 | 
1114 | ```
1115 | ; Use the build-macos.sh script to assemble and link this.
1116 | 
1117 |         bits 64
1118 | 
1119 | 		section .text
1120 | 
1121 | 		global start
1122 | start:
1123 | 		mov     rax, 0x2000004 ; write
1124 | 		mov     rdi, 1 ; stdout
1125 | 		mov     rsi, msg
1126 | 		mov     rdx, msg.len
1127 | 		syscall
1128 | 
1129 | 		mov     rax, 0x2000001 ; exit
1130 | 		mov     rdi, 0
1131 | 		syscall
1132 | 
1133 | 
1134 | 		section .data
1135 | 
1136 | msg:    db      "Hello, world!", 10
1137 | .len:   equ     $ - msg
1138 | ```
1139 | 
1140 | It works.  Here's the output:
1141 | 
1142 | ```
1143 | # ./build-mac.sh
1144 | Run it via ./hello-macos
1145 | # ./hello-macos
1146 | Hello, World!
1147 | #
1148 | ```
1149 | 
1150 | ## Linux version
1151 | 
1152 | Linux has different (from MacOS) syscall numbers passed in rax.  The entry point for Linux programs is "_start"" vs "start" on MacOS.
1153 | 
1154 | Otherwise, the program is the same.
1155 | 
1156 | ```
1157 | ; use the build-linux.sh script to assemble and link this
1158 |         bits 64
1159 | 
1160 | 		section .text
1161 | 
1162 |         global _start
1163 | _start:
1164 | 		mov     rax, 1 ; write
1165 | 		mov     rdi, 1 ; stdout
1166 | 		mov     rsi, msg
1167 | 		mov     rdx, msg.len
1168 | 		syscall
1169 | 
1170 | 		mov     rax, 60 ; exit
1171 | 		mov     rdi, 0
1172 | 		syscall
1173 | 
1174 | 
1175 | 		section .data
1176 | 
1177 | msg:    db      "Hello, world!", 10
1178 | .len:   equ     $ - msg
1179 | ```
1180 | 
1181 | ```
1182 | # ./build-linux.sh
1183 | Run it via ./hello-linux
1184 | i# ./hello-linux
1185 | Hello, world!
1186 | #
1187 | ```
1188 | 
1189 | ## How it works
1190 | 
1191 | MacOS and Linux provide quite a few syscalls each, or operating system calls that we can call from any language.  There are quite a few syscalls in common between the two, but they are different flavors of Unix (Linux vs. BSD-ish/MacOS).  The two flavors have several syscalls that are provided in one OS but not the other.  The syscall numbers (passed in rax) are also different between the operating systems.
1192 | 
1193 | The C libraries contain code similar to our code above, to write strings to a file.  For our purposes we use the file number for stdout to write to the console.
1194 | 
1195 | For most C calls that are not provided by a library or the standard C/C++ libraries, there is a syscall.  For example, malloc and free are provided by libc so there is no syscall for it.  However, sbrk() is not provided by the libraries and is provided as a syscall.
1196 | 
1197 | The syscalls take arguments in the CPU registers.  RAX contains the syscall number (one for write, one for exit in the above).
1198 | 
1199 | ### Linux Syscalls
1200 | 
1201 | Linux syscalls are documented here:
1202 |     https://chromium.googlesource.com/chromiumos/docs/+/master/constants/syscalls.md
1203 | The syscalls for Linux are defined in:
1204 |     /usr/include/sys/syscall.h
1205 | 
1206 | ### MacOS Syscalls
1207 | 
1208 | The syscalls for MacOS are defined in:
1209 |     ./Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/usr/include/sys/syscall.h
1210 | These syscall numbers are subject to change, so you should, at least, use the defines in your syscall.h and realize that when you update your OS, you need to verify the numbers haven't changed.  
1211 | 
1212 | Alternatively, you can programatically scan the syscall.h file and generate assembly EQU for each syscall and always have the correct syscall numbers in your program.
1213 | 
1214 | If the parameters to the OS syscalls somehow change, your program will crash.  It's not likely every syscall is going to have these changes, but you will need to fix your code when this does happen.
1215 | 
1216 | 


--------------------------------------------------------------------------------
/hello-world/build-linux.sh:
--------------------------------------------------------------------------------
1 | #!/bin/sh
2 | 
3 | nasm -f elf64 -o hello-linux.o -l hello-linux.lst hello-linux.asm
4 | ld -static -o hello-linux  hello-linux.o
5 | 
6 | echo "Run it via ./hello-linux"
7 | 


--------------------------------------------------------------------------------
/hello-world/build-macos-sh:
--------------------------------------------------------------------------------
1 | #!/bin/sh
2 | 
3 | nasm -f macho64 -o hello-macos.o -l hello-macos.lst hello-macos.asm
4 | ld -static -o hello-macos hello-macos.o
5 | 
6 | echo "Run it via ./hello-macos"
7 | 


--------------------------------------------------------------------------------
/hello-world/hello:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mschwartz/assembly-tutorial/c03c88092092894f6dfb0b87693868df6f5d21b6/hello-world/hello


--------------------------------------------------------------------------------
/hello-world/hello-linux.asm:
--------------------------------------------------------------------------------
 1 | ; use the build-linux.sh script to assemble and link this
 2 |         bits 64
 3 | 
 4 | section .text
 5 | 
 6 |         global _start
 7 | _start:
 8 |         mov     rax, 1 ; write
 9 |         mov     rdi, 1 ; stdout
10 |         mov     rsi, msg
11 |         mov     rdx, msg.len
12 |         syscall
13 | 
14 |         mov     rax, 60 ; exit
15 |         mov     rdi, 0
16 |         syscall
17 | 
18 | 
19 |         section .data
20 | 
21 | msg:    db      "Hello, world!", 10
22 | .len:   equ     $ - msg
23 | 


--------------------------------------------------------------------------------
/hello-world/hello-macos.asm:
--------------------------------------------------------------------------------
 1 | ; use the build-macos.sh script to assemble and link this
 2 | 
 3 |         section .text
 4 | 
 5 |         global start
 6 | start:
 7 |         mov     rax, 0x2000004 ; write
 8 |         mov     rdi, 1 ; stdout
 9 |         mov     rsi, msg
10 |         mov     rdx, msg.len
11 |         syscall
12 | 
13 |         mov     rax, 0x2000001 ; exit
14 |         mov     rdi, 0
15 |         syscall
16 | 
17 | 
18 |         section .data
19 | 
20 | msg:    db      "Hello, world!", 10
21 | .len:   equ     $ - msg
22 | 


--------------------------------------------------------------------------------
/images/cardiac2-s.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mschwartz/assembly-tutorial/c03c88092092894f6dfb0b87693868df6f5d21b6/images/cardiac2-s.jpg


--------------------------------------------------------------------------------
/instruction-set/addressing.asm:
--------------------------------------------------------------------------------
 1 | ;; NOTE that this is not a program meant to be run.  It is just a way to demonstrate
 2 | ;; that the instructions and addressing modes to assemble without error.
 3 | 
 4 |         section .text
 5 | global start
 6 | start:
 7 |         ;; register addressing mode
 8 |         mov rax, rbx
 9 | 
10 |         ; direct (or immediate) addressing mode
11 |         ; you cannot store to a constant, so only the source may be a constant
12 |         mov rax, 10 	; source operand is a constant
13 | 
14 |         ; indirect addressing mode
15 |         ; one of the operands is the address of the memory location in a register
16 |         mov rax, [rbx]
17 |         mov [rbx], rax
18 |         ; invalid!
19 |         ; mov [rax], [rbx]
20 | 
21 |         
22 |         ; indirect with displacement
23 |         ; address = base + displacement
24 |         ;
25 |         ; typical use is to access structure elements (the displacement is the offset
26 |         ; to the structure member)
27 |         mov rax, [24+rbx] 	; base is rbx, displacement is 24
28 |         mov [24+rbx], rax
29 | 
30 |         ; indirect with displacement and scaled index
31 |         mov rax, [array + rbx * 4]
32 |         mov [array + rbx * 4], rax
33 | 
34 |         ; indirect with displacement in a second register
35 |         mov rax, [rbx + rcx]
36 |         mov [rbx + rcx], rax
37 |         
38 |         ; indirect with displacement in a second register scaled
39 |         mov rax, [rbx + rcx *4]
40 |         mov [rbx + rcx *4], rax
41 |         
42 |         ; indirect with displacement and another displacement in a second register scaled
43 |         mov rax, [24 + rbx + rcx *4]
44 |         mov [24 + rbx + rcx *4], rax
45 |         
46 |        section .bss 
47 | array: resb 8192
48 |         
49 | 


--------------------------------------------------------------------------------
/instruction-set/build.sh:
--------------------------------------------------------------------------------
1 | #!/bin/sh
2 | 
3 | nasm -f elf64 -o addressing.o -l addressing.lst addressing.asm
4 | 


--------------------------------------------------------------------------------