├── LICENSE └── README.md /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2022 CNLohr 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # assembly-notes 2 | 3 | My repo for notes on assembly lagnuage, inling, etc. 4 | 5 | ## Overall 6 | 7 | ## Dedicated Assembly 8 | 9 | * https://github.com/AEFeinstein/Super-2023-Swadge-FW/blob/main/tools/sandbox_test/ 10 | * Example linker script: https://github.com/AEFeinstein/Super-2023-Swadge-FW/blob/main/tools/bootload_reboot_stub/esp32_s2_stub.ld 11 | * Manual on special commands for dedicated (gas) assembly: http://www.math.utah.edu/docs/info/as_7.html 12 | 13 | ## Using Inline Assembly 14 | 15 | Webpages (For inline assembly) 16 | * https://dmalcolm.fedorapeople.org/gcc/2015-08-31/rst-experiment/how-to-use-inline-assembly-language-in-c-code.html 17 | * https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html#Clobbers-and-Scratch-Registers 18 | * https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html 19 | 20 | Tools: 21 | * https://godbolt.org 22 | * Get a map file from an elf: `objdump -t ` 23 | * Get an assembly listing from an elf: `objdump -S ` 24 | * GCC Pre-link assmebly tools (you RARELY want these) 25 | * Make GCC produce in-line assembly: Add `-Wa,-a,-ad` 26 | * Make GCC produce in-line map file: Add `-Wl,-Map,test.debug.map` 27 | 28 | Constraint Modifiers: 29 | * `=` write-only -- initial value may **only** be written to -- **undefined** at the start of the inline assembly 30 | * NOTE: this only applies to the value passed to the inline assembly expression; after you write to it yourself you are free to read from it at will, of course 31 | * `+` read-and-write -- initial value may be both read **and** written to -- **well-defined** at the start of the inline assembly 32 | * `&` early-clobber. Applies to both `=` and `+` Without this, input registers are allowed to be assigned to the same register as your output. You will want to use this if you write to an output parameter **before** reading from a separate input parameter. 33 | * NOTE: Do not modify InputOperands! It will break things. By telling the compiler something is an input operand you are saying it **will not** be written to. 34 | * `g` pointer 35 | * `r` register 36 | * `a` architecture-specific, see https://gcc.gnu.org/onlinedocs/gcc/Machine-Constraints.html for more information 37 | 38 | Demonstration: read-after-write can lead to undesired behavior without `&` 39 | ```c 40 | static __attribute__((noinline)) uint32_t incorrect_1(uint32_t a) { 41 | uint32_t ret = 0; // NOTE: it would technically be legal to keep this value uninitlalized 42 | asm volatile( 43 | "mov $10, %0\n" 44 | "add %1, %0"// add eax, eax -> bad! 45 | :"=r"(ret):"r"(a)); 46 | return ret; 47 | } 48 | 49 | static __attribute__((noinline)) uint32_t correct_1(uint32_t a) { 50 | uint32_t ret = 0; 51 | asm volatile( 52 | "mov $10, %0\n" 53 | "add %1, %0" // add eax, edx -> fine 54 | :"=&r"(ret):"r"(a)); 55 | return ret; 56 | } 57 | 58 | int main() { 59 | incorrect_1(0); 60 | correct_1(0); 61 | } 62 | ``` 63 | 64 | Demonstration: _never_ write to input constraints! 65 | ```c 66 | uint32_t incorrect_2() { 67 | uint32_t ret = 0; 68 | asm volatile( 69 | "mov $10, %0\n" 70 | ::"r"(ret)); 71 | return ret; // returns 10 72 | } 73 | 74 | uint32_t incorrect_3() { 75 | uint32_t ret = 0; 76 | asm volatile( 77 | "mov $10, %0\n" 78 | ::"r"(ret)); 79 | return ret + 1; // returns 1! 80 | } 81 | ``` 82 | 83 | Labels in inline assembly: `%=` will create a unique thing for this section, so `skip_pix%=` jumps to `skip_pix%=:` 84 | 85 | 86 | Register named syntax: Use `%[registername]` by specifying `: [registername]"constraint"(C-value),... : ` 87 | 88 | ```c 89 | asm [volatile] ( ``AssemblerTemplate`` 90 | : ``OutputOperands`` 91 | [ : ``InputOperands`` 92 | [ : ``Clobbers`` ] ]) 93 | 94 | asm [volatile] goto ( ``AssemblerTemplate`` 95 | : 96 | : ``InputOperands`` 97 | : ``Clobbers`` 98 | : ``GotoLabels``) 99 | ``` 100 | 101 | ## `"memory"` Clobber. 102 | If you modify memory inside your assembly, you, many times will need to add `"memory"` to your clobber list.** There are two major uses, one is that if it's possible for code after your code to read memory that has been altered by your code, you need to specify `memory` in your clobber so that GCC isn't using cached (into a register or something) versions of the memory that your code has modified. **And** the other use is to guarantee that any memory accesses have been accomplished before you execute further code. I.e. the following code: 103 | 104 | ``` 105 | static inline void spin_unlock(spinlock_t *lock) 106 | { 107 | __asm__ __volatile__(""::: "memory"); 108 | lock->lock = 0; 109 | } 110 | ``` 111 | 112 | ## More examples. 113 | 114 | Example: Get current # of cycles (processor cycles) counter on ESP32, ESP8266: 115 | 116 | ```c 117 | static inline uint32_t getCycleCount() { 118 | uint32_t ccount; 119 | asm volatile("rsr %0,ccount":"=a" (ccount)); 120 | return ccount; 121 | } 122 | ``` 123 | 124 | 125 | Example (Xtensa LX7, ESP32S2, ESP32): 126 | ```c 127 | #define TURBO_SET_PIXEL(opxc, opy, colorVal ) \ 128 | asm volatile( "mul16u a4, %[width], %[y]\nadd a4, a4, %[px]\nadd a4, a4, %[opx]\ns8i %[val],a4, 0" \ 129 | : : [opx]"a"(opxc),[y]"a"(opy),[px]"a"(dispPx),[val]"a"(colorVal),[width]"a"(dispWidth) : "a4" ); 130 | 131 | // Very tricky: 132 | // We do bgeui which checks to make sure 0 <= x < MAX 133 | // Other than that, it's basically the same as above. 134 | #define TURBO_SET_PIXEL_BOUNDS(opxc, opy, colorVal ) \ 135 | asm volatile( "bgeu %[opx], %[width], failthrough%=\nbgeu %[y], %[height], failthrough%=\nmul16u a4, %[width], %[y]\nadd a4, a4, %[px]\nadd a4, a4, %[opx]\ns8i %[val],a4, 0\nfailthrough%=:\nnop" \ 136 | : : [opx]"a"(opxc),[y]"a"(opy),[px]"a"(dispPx),[val]"a"(colorVal),[width]"a"(dispWidth),[height]"a"(dispHeight) : "a4" ); 137 | ``` 138 | 139 | x86_64 assembly: strlen, but no loops (sort of) 140 | ```c 141 | unsigned long long ret = 0; 142 | asm volatile("\n\ 143 | xor %%rax, %%rax\n\ 144 | xor %%rcx, %%rcx\n\ 145 | dec %%rcx\n\ 146 | mov %[strin], %%rdi\n\ 147 | repnz scasb (%%rdi), %%al\n\ 148 | not %%rcx\n\ 149 | mov %%rcx, %[ret]\n\ 150 | " : [ret]"=r"(ret) : [strin]"r"(str) : "rdi","rcx","rax" ); 151 | return ret; 152 | ``` 153 | 154 | 155 | Side-note: you can put raw opcodes in your code, in case the compiler doesn't know about some processor quirk. 156 | 157 | ```c 158 | static void CUSTOMLOG( const char * c ) { 159 | asm volatile( "csrrw x0, 0x137, %0\n.long 0xff100073" : : "r" (c)); 160 | } 161 | ``` 162 | 163 | ## Relative Jump Labels 164 | 165 | You can use relative jump labels. You can jump forward with a number and `f` and back with a number and `b`. 166 | 167 | ```c 168 | asm volatile( 169 | " la a0, _sbss\n\ 170 | la a1, _ebss\n\ 171 | li a2, 0\n\ 172 | bge a0, a1, 2f\n\ 173 | 1: sw a2, 0(a0)\n\ 174 | addi a0, a0, 4\n\ 175 | blt a0, a1, 1b\n\ 176 | 2:" 177 | // This loads DATA from FLASH to RAM. 178 | " la a0, _data_lma\n\ 179 | la a1, _data_vma\n\ 180 | la a2, _edata\n\ 181 | 1: beq a1, a2, 2f\n\ 182 | lw a3, 0(a0)\n\ 183 | sw a3, 0(a1)\n\ 184 | addi a0, a0, 4\n\ 185 | addi a1, a1, 4\n\ 186 | bne a1, a2, 1b\n\ 187 | 2:\n" ); 188 | ``` 189 | 190 | ## Random Tricks: 191 | 192 | * Prevent the compiler from emitting dumb l32r's, i.e. Tell the compiler "Please, do not load another literal, please instead compute `c1` from `c0` - thanks, @duk-37 193 | 194 | ```c 195 | uint32_t c0 = 0xF0F0F0F0; 196 | asm("":"+r"(c0)); // Force the compiler to not do an l32r. 197 | uint32_t c1 = c0 >> 4; 198 | // c1 = 0x0f0f0f0f, now, but without extra l32r 199 | ``` 200 | 201 | * Super-fast integer absolute value 202 | 203 | ```c 204 | // .725 seconds 205 | static inline ABS(x) { if( x < 0 ) return -x; else return x; } 206 | 207 | // .727 seconds 208 | static inline ABS(x) { int mask = mask = x>>31; return (mask + x)^mask; } 209 | 210 | // .673 seconds 211 | #define ABS(x) abs(x) 212 | 213 | // .554 seconds!!! 214 | static inline ABS( int x ) { asm volatile("\nmov %[x], %%ebx\nneg %[x]\ncmovl %%ebx,%[x]\n" : : [x]"r"(x) : "ebx" ); return x; } 215 | ``` 216 | (-O4, gcc 9.4.0, x86_64, Run times are my day 15 Advent of Code 2022 Challenge, Part 1, which uses a lot of abs's) 217 | 218 | ### Random example 219 | 220 | This does an integer multiply on systems taht don't have a multiply instruction. 221 | 222 | ```c 223 | uint32_t ret = 0; 224 | asm volatile( "\n\ 225 | .option rvc;\n\ 226 | 1: andi t0, %[small], 1\n\ 227 | beqz t0, 2f\n\ 228 | add %[ret], %[ret], %[big]\n\ 229 | 2: srli %[small], %[small], 1\n\ 230 | slli %[big], %[big], 1\n\ 231 | bnez %[small], 1b\n\ 232 | " : 233 | [ret]"=&r"(ret) , [big]"+&r"(big_num), [small]"+&r"(small_num) : : 234 | "t0" ); 235 | return ret; 236 | ``` 237 | 238 | ### Cursed binary generation at runtime when you use constants. 239 | 240 | Use the "i" type, so you can convert your number into part of an opcode. 241 | 242 | Example here: https://gist.github.com/cleverca22/79143cb23a50d572b9d527c9ea479492#file-vpu-support-native-h 243 | 244 | ```c 245 | static inline int inner(int a) { 246 | asm volatile (".long (%[rega] << 4) | 42" ::[rega]"i"(4)); 247 | } 248 | 249 | void outer() { 250 | inner(42); 251 | } 252 | ``` 253 | 254 | ## Inline assembly for x86 SSE2 255 | 256 | Just a quick example showing how to load some ints, do a gather load, and save some data off. 257 | 258 | ```c 259 | #include 260 | #include 261 | #include 262 | 263 | int main() 264 | { 265 | __attribute__((aligned(256))) const float in_vals[8] = { 0.5, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0 }; 266 | __attribute__((aligned(256))) int in_load_map[8] = { 1, 3, 5, 7, 1, 3, 5, 7 }; // Pointers to where to load data from. 267 | 268 | // Initialize with junk, to make sure it's working. 269 | __attribute__((aligned(512))) float stored[16] = { 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, }; 270 | 271 | __m256 output_in_m256; 272 | asm volatile( "\ 273 | vpcmpeqd %%ymm0, %%ymm0, %%ymm0 /* Make ymm0 filled with 0xffffffff */ \n\ 274 | vmovdqu %[in_load_map], %%ymm1 /* Load ymm1 with all the ints from in_load_map */ \n\ 275 | vgatherdps %%ymm0, 0(%[in_vals],%%ymm1,4), %[output_in_m256] /* treat that input map as indices in the array pointed to by in_vals, and load up output_in_m256 */ \n\ 276 | vmovaps %[output_in_m256], %[stored]\n\ 277 | " : [output_in_m256] "+x" (output_in_m256), [stored] "=m" (stored): [in_load_map] "m" (in_load_map), [in_vals] "r" (in_vals): "ymm0", "ymm1", "ymm2" ); 278 | 279 | printf( "STORED: %f\n", stored[2] ); 280 | 281 | } 282 | 283 | 284 | ``` 285 | 286 | 287 | 288 | ## The presentation. 289 | 290 | https://drive.google.com/drive/folders/1WUkw5rC5yDKR2lT6nQkdDeGCpMPmpI3f?usp=sharing 291 | 292 | ### References from the presentation: 293 | 294 | 1) https://en.wikipedia.org/wiki/RollerCoaster_Tycoon_(video_game)#/media/File:RollerCoaster_Tycoon_Screenshot.png Retrieved 2022-10-24 295 | 2) https://www.reddit.com/r/ProgrammerHumor/comments/ji8sx6/still_assembly_is_one_of_my_favourite_language/ Retrieved 2022-10-24 296 | 3) https://www.youtube.com/watch?v=eAhWIO1Ra6M (Published Jun 6, 2015) 297 | 4) https://hacks.mozilla.org/2013/12/gap-between-asm-js-and-native-performance-gets-even-narrower-with-float32-optimizations/ (Published Dec 20, 2013) 298 | 5) https://www.reddit.com/r/ProgrammerHumor/comments/6tifn2/the_holy_trinity_and_javascript/ (Published August 13 2017) 299 | 6) https://knowyourmeme.com/memes/ackchyually-actually-guy (Retrieved 2022-10-25) 300 | 7) https://www.youtube.com/watch?v=-E36UNXMjnU (2018-12-21) 301 | 8) https://www.youtube.com/watch?v=QM1iUe6IofM (2016-01-18) 302 | 9) https://www.youtube.com/watch?v=GKYCA3UsmrU (Cut from original video uploaded Jun 11, 2015) 303 | 10) https://en.wikipedia.org/w/index.php?title=Red%E2%80%93black_tree&oldid=675908841 (2015-08-13) 304 | 11) https://www.youtube.com/watch?v=m4f4OzEyueg ( 2014-09-05) 305 | 12) https://github.com/WebAssembly/design/issues/796 (Retrieved 2022-10-29) 306 | 13) https://imgur.com/lZLyoaA (2017-11-09) 307 | 14) http://ww1.microchip.com/downloads/en/devicedoc/atmel-0856-avr-instruction-set-manual.pdf (2016-11-01) 308 | 15) https://gist.github.com/cnlohr/e11d59f98c94f748683ba7ec80667d51 (2022-10-28) 309 | 16) https://www.youtube.com/watch?v=etdKnhwWtwA (2013-02-12) 310 | 17) https://luplab.cs.ucdavis.edu/2022/06/03/rvcodec-js.html (2022-06-03) 311 | 18) https://blog.pimaker.at/texts/rvc1/ (2021-08-25) 312 | 19) https://www.pinatafarm.com/memegenerator/4a771794-6d5a-42ef-a1c8-7744545233d8 (Retrieved 2022-11-04) 313 | 20) https://knowyourmeme.com/memes/surprised-pikachu (Retrieved 2022-11-04) 314 | 21) https://en.wikipedia.org/wiki/Amdahl%27s_law#/media/File:Optimizing-different-parts.svg (Retrieved 2008-01-10) 315 | 22) https://azeria-labs.com/arm-data-types-and-registers-part-2/ (Retrieved 2022-11-04) 316 | 23) https://www.youtube.com/watch?v=wU2UoxLtIm8 (2021-02-09) 317 | 24) https://knowyourmeme.com/memes/salt-bae (Retrieved 2022-11-05) 318 | 25) https://github.com/cnlohr/esp32s2-cookbook (Retrieved 2022-11-06 319 | --------------------------------------------------------------------------------