├── LICENSE.txt ├── README.md ├── curve25519-m4f-ptrswap-keil.s ├── linux_example.c ├── x25519-cortex-m4-gcc.s ├── x25519-cortex-m4-keil.s └── x25519-cortex-m4.h /LICENSE.txt: -------------------------------------------------------------------------------- 1 | Copyright (c) 2017, Emil Lenngren 2 | 3 | All rights reserved. 4 | 5 | Redistribution and use in source and binary forms, with or without modification, 6 | are permitted provided that the following conditions are met: 7 | 8 | 1. Redistributions of source code must retain the above copyright notice, this 9 | list of conditions and the following disclaimer. 10 | 11 | 2. Redistributions in binary form, except as embedded into a Nordic 12 | Semiconductor ASA or Dialog Semiconductor PLC integrated circuit in a product 13 | or a software update for such product, must reproduce the above copyright 14 | notice, this list of conditions and the following disclaimer in the 15 | documentation and/or other materials provided with the distribution. 16 | 17 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND 18 | ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED 19 | WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 20 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR 21 | ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES 22 | (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; 23 | LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND 24 | ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT 25 | (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS 26 | SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 27 | 28 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # X25519 for ARM Cortex-M4 and other ARM processors 2 | 3 | This implements highly optimimzed assembler versions of X25519 for ARMv7. It's optimized for Cortex-M4 but works on other ARM processors as well (ARMv7 and newer 32-bit architectures). 4 | 5 | ## X25519 6 | X25519 is an Elliptic curve version of the Diffie-Hellman protocol, using Curve25519 as the elliptic curve, as introduced in https://cr.yp.to/ecdh.html. 7 | 8 | ### API 9 | ``` 10 | void X25519_calc_public_key(uint8_t output_public_key[32], const uint8_t input_secret_key[32]); 11 | void X25519_calc_shared_secret(uint8_t output_shared_secret[32], const uint8_t my_secret_key[32], const uint8_t their_public_key[32]); 12 | ``` 13 | 14 | * To use, first generate a 32 byte random value using a Cryptographically Secure Number Generator (specifically do NOT use `rand()` from the C library), which gives your secret key. 15 | * Feed that secret key into `X25519_calc_public_key` which will give you the corresponding public key you then transfer to the other part. The other part does the same. 16 | * When you get the other part's public key, feed that into `X25519_calc_shared_secret` together with your private key which will give you the shared secret. Rather than using this shared secret directly, it should be hashed (for example with SHA-256) on both sides before use. For further usage instructions see the official web site. 17 | 18 | Note that this library automatically "clamps" the secret key for you (i.e. sets all the three lowest bits to 0 and the two highest to 0 and 1), compared to some other implementations. 19 | 20 | ### Setup 21 | * The header file `x25519-cortex-m4.h` should be included when using the API from C/C++. 22 | * For Keil, the file `x25519-cortex-m4-keil.s` must be added to the project as a Source file. 23 | * When compiling with GCC, `x25519-cortex-m4-gcc.s` must be added to the project as a compilation unit. The compiler switch `-march=armv7-a`, `-march=armv7e-m` or similar might be needed depending on target architecture. 24 | 25 | ### Example 26 | An example can be seen in `linux_example.c` that uses `/dev/urandom` to get random data. It can be compiled on for example Raspberry Pi 3 with: 27 | ``` 28 | gcc linux_example.c x25519-cortex-m4-gcc.s -o linux_example -march=armv7-a 29 | ``` 30 | 31 | ### Performance 32 | The library uses only 1892 bytes of code space in compiled form, uses 368 bytes of stack and runs one scalar multiplication in 548 873 cycles on Cortex-M4, which is speed record as far as I know. For a 64 MHz processor that means less than 9 ms per operation! 33 | 34 | There is also an even more optimized version that uses the FPU which runs in 476 275 cycles on ARM Cortex-M4F. 35 | 36 | ### Code 37 | The code is written in Keil's assembler format (`x25519-cortex-m4-keil.s`) but was converted to GCC's assembler syntax (`x25519-cortex-m4-gcc.s`) using the following regex command: 38 | `perl -0777 -pe 's/^;/\/\//g;s/(\s);/\1\/\//g;s/export/\.global/g;s/(([a-zA-Z0-9_]+) proc[\W\w]+?)endp/\1\.size \2, \.-\2/g;s/([a-zA-Z0-9_]+) proc/\t\.type \1, %function\n\1:/g;s/end//g;s/(\r?\n)(\d+)(\r?\n)/\1\2:\3/g;s/%b(\d+)/\1b/g;s/%f(\d+)/\1f/g;s/(frame[\W\w]+?\n)/\/\/\1/g;s/area \|([^\|]+)\|[^\n]*\n/\1\n/g;s/align /\.align /g;s/^/\t.syntax unified\n\t.thumb\n/' < x25519-cortex-m4-keil.s > x25519-cortex-m4-gcc.s` 39 | 40 | ### Security 41 | The basic implementation runs in constant time and uses a constant memory access pattern, regardless of the private key in order to protect against side channel attacks. The FPU version however reads data from RAM in a non-constant pattern and therefore that version is only suited for embedded devices without data cache, such as Cortex-M4 and Cortex-M33. 42 | 43 | ### License 44 | The code is licensed under a 2-clause BSD license, with an extra exception for Nordic Semiconductor and Dialog Semiconductor devices. 45 | -------------------------------------------------------------------------------- /curve25519-m4f-ptrswap-keil.s: -------------------------------------------------------------------------------- 1 | ; Curve25519 scalar multiplication 2 | ; Copyright (c) 2017-2019, Emil Lenngren 3 | ; 4 | ; All rights reserved. 5 | ; 6 | ; Redistribution and use in source and binary forms, with or without modification, 7 | ; are permitted provided that the following conditions are met: 8 | ; 9 | ; 1. Redistributions of source code must retain the above copyright notice, this 10 | ; list of conditions and the following disclaimer. 11 | ; 12 | ; 2. Redistributions in binary form, except as embedded into a Nordic 13 | ; Semiconductor ASA or Dialog Semiconductor PLC integrated circuit in a product 14 | ; or a software update for such product, must reproduce the above copyright 15 | ; notice, this list of conditions and the following disclaimer in the 16 | ; documentation and/or other materials provided with the distribution. 17 | ; 18 | ; THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND 19 | ; ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED 20 | ; WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 21 | ; DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR 22 | ; ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES 23 | ; (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; 24 | ; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND 25 | ; ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT 26 | ; (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS 27 | ; SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 28 | 29 | 30 | gbla use_mul_for_sqr 31 | use_mul_for_sqr seta 0 32 | 33 | gbla has_fpu 34 | has_fpu seta 1 35 | 36 | ; This is an armv7 implementation of X25519. 37 | ; It follows the reference implementation where the representation of 38 | ; a field element [0..2^255-19) is represented by a 256-bit little endian integer, 39 | ; reduced modulo 2^256-38, and may possibly be in the range [2^256-38..2^256). 40 | ; The scalar is a 256-bit integer where certain bits are hardcoded per specification. 41 | ; 42 | ; The implementation runs in constant time (measured 476 275 cycles on ARM Cortex-M4F, 43 | ; few wait states), and no conditional branches depend on secret data. 44 | ; 45 | ; This implementation conditionally loads data from different RAM locations depending 46 | ; on secret data, so this version should not be used on CPUs that have data cache, 47 | ; such as Cortex-A53. It's suited for embedded devices running CPUs like Cortex-M4 and 48 | ; Cortex-M33. 49 | 50 | area |.text|, code, readonly 51 | align 4 52 | 53 | ; input: *r8=a, *r9=b 54 | ; output: r0-r7 55 | ; clobbers all other registers 56 | ; cycles: 45 57 | fe25519_add proc 58 | export fe25519_add 59 | ldr r0,[r8,#28] 60 | ldr r4,[r9,#28] 61 | adds r0,r0,r4 62 | mov r11,#0 63 | adc r11,r11,r11 64 | lsl r11,r11,#1 65 | add r11,r11,r0, lsr #31 66 | movs r7,#19 67 | mul r11,r11,r7 68 | bic r7,r0,#0x80000000 69 | 70 | ldm r8!,{r0-r3} 71 | ldm r9!,{r4-r6,r10} 72 | mov r12,#1 73 | umaal r0,r11,r12,r4 74 | umaal r1,r11,r12,r5 75 | umaal r2,r11,r12,r6 76 | umaal r3,r11,r12,r10 77 | ldm r9,{r4-r6} 78 | ldm r8,{r8-r10} 79 | umaal r4,r11,r12,r8 80 | umaal r5,r11,r12,r9 81 | umaal r6,r11,r12,r10 82 | add r7,r7,r11 83 | bx lr 84 | 85 | endp 86 | 87 | ; input: *r8=a, *r9=b 88 | ; output: r0-r7 89 | ; clobbers all other registers 90 | ; cycles: 46 91 | fe25519_sub proc 92 | export fe25519_sub 93 | 94 | ldm r8,{r0-r7} 95 | ldm r9!,{r8,r10-r12} 96 | subs r0,r8 97 | sbcs r1,r10 98 | sbcs r2,r11 99 | sbcs r3,r12 100 | ldm r9,{r8-r11} 101 | sbcs r4,r8 102 | sbcs r5,r9 103 | sbcs r6,r10 104 | sbcs r7,r11 105 | 106 | ; if subtraction goes below 0, set r8 to -1 and r9 to -38, else set both to 0 107 | sbc r8,r8 108 | and r9,r8,#-38 109 | 110 | adds r0,r9 111 | adcs r1,r8 112 | adcs r2,r8 113 | adcs r3,r8 114 | adcs r4,r8 115 | adcs r5,r8 116 | adcs r6,r8 117 | adcs r7,r8 118 | 119 | ; if the subtraction did not go below 0, we are done and (r8,r9) are set to 0 120 | ; if the subtraction went below 0 and the addition overflowed, we are done, so set (r8,r9) to 0 121 | ; if the subtraction went below 0 and the addition did not overflow, we need to add once more 122 | ; (r8,r9) will be correctly set to (-1,-38) only when r8 was -1 and we don't have a carry, 123 | ; note that the carry will always be 0 in case (r8,r9) was (0,0) since then there was no real addition 124 | ; also note that it is extremely unlikely we will need an extra addition: 125 | ; that can only happen if input1 was slightly >= 0 and input2 was > 2^256-38 (really input2-input1 > 2^256-38) 126 | ; in that case we currently have 2^256-38 < (r0...r7) < 2^256, so adding -38 will only affect r0 127 | adcs r8,#0 128 | and r9,r8,#-38 129 | 130 | adds r0,r9 131 | 132 | bx lr 133 | 134 | endp 135 | 136 | if use_mul_for_sqr == 1 137 | ;thumb_func 138 | fe25519_sqr ;label definition 139 | mov r2,r1 140 | ; fallthrough 141 | endif 142 | 143 | if has_fpu == 1 144 | ; input: *r1=a, *r2=b 145 | ; output: r0-r7 146 | ; clobbers all other registers 147 | ; cycles: 154 148 | align 4 ; don't know why but this direction here reduces cycle count with 30 000... 149 | fe25519_mul proc 150 | export fe25519_mul 151 | push {lr} 152 | frame push {lr} 153 | 154 | vmov d2[0],r2 ;s4 155 | vldm r1,{s8-s15} 156 | 157 | ldm r2,{r2,r3,r4,r5} 158 | 159 | vmov r0,r10,s8,s9 160 | umull r6,r1,r2,r0 161 | 162 | umull r7,r12,r3,r0 163 | umaal r7,r1,r2,r10 164 | 165 | vmov s0,s1,r6,r7 166 | 167 | umull r8,r6,r4,r0 168 | umaal r8,r1,r3,r10 169 | 170 | umull r9,r7,r5,r0 171 | umaal r9,r1,r4,r10 172 | 173 | umaal r1,r7,r5,r10 174 | 175 | vmov lr,r0,s10,s11 176 | 177 | umaal r8,r12,r2,lr 178 | umaal r9,r12,r3,lr 179 | umaal r1,r12,r4,lr 180 | umaal r12,r7,r5,lr 181 | 182 | umaal r9,r6,r2,r0 183 | umaal r1,r6,r3,r0 184 | umaal r12,r6,r4,r0 185 | umaal r6,r7,r5,r0 186 | 187 | vmov s2,s3,r8,r9 188 | 189 | vmov r10,lr,s12,s13 190 | 191 | mov r9,#0 192 | umaal r1,r9,r2,r10 193 | umaal r12,r9,r3,r10 194 | umaal r6,r9,r4,r10 195 | umaal r7,r9,r5,r10 196 | 197 | mov r10,#0 198 | umaal r12,r10,r2,lr 199 | umaal r6,r10,r3,lr 200 | umaal r7,r10,r4,lr 201 | umaal r9,r10,r5,lr 202 | 203 | vmov r8,d7[0] ;s14 204 | mov lr,#0 205 | umaal lr,r6,r2,r8 206 | umaal r7,r6,r3,r8 207 | umaal r9,r6,r4,r8 208 | umaal r10,r6,r5,r8 209 | 210 | ;_ _ _ _ _ 6 10 9| 7 | lr 12 1 _ _ _ _ 211 | 212 | vmov r8,d7[1] ;s15 213 | mov r11,#0 214 | umaal r7,r11,r2,r8 215 | umaal r9,r11,r3,r8 216 | umaal r10,r11,r4,r8 217 | umaal r6,r11,r5,r8 218 | 219 | ;_ _ _ _ 11 6 10 9| 7 | lr 12 1 _ _ _ _ 220 | 221 | vmov r2,d2[0] ;s4 222 | adds r2,r2,#16 223 | ldm r2,{r2,r3,r4,r5} 224 | 225 | vmov r8,d4[0] ;s8 226 | movs r0,#0 227 | umaal r1,r0,r2,r8 228 | vmov s4,r1 229 | umaal r12,r0,r3,r8 230 | umaal lr,r0,r4,r8 231 | umaal r0,r7,r5,r8 ; 7=carry for 9 232 | 233 | ;_ _ _ _ 11 6 10 9+7| 0 | lr 12 _ _ _ _ _ 234 | 235 | vmov r8,d4[1] ;s9 236 | movs r1,#0 237 | umaal r12,r1,r2,r8 238 | vmov s5,r12 239 | umaal lr,r1,r3,r8 240 | umaal r0,r1,r4,r8 241 | umaal r1,r7,r5,r8 ; 7=carry for 10 242 | 243 | ;_ _ _ _ 11 6 10+7 9+1| 0 | lr _ _ _ _ _ _ 244 | 245 | vmov r8,d5[0] ;s10 246 | mov r12,#0 247 | umaal lr,r12,r2,r8 248 | vmov s6,lr 249 | umaal r0,r12,r3,r8 250 | umaal r1,r12,r4,r8 251 | umaal r10,r12,r5,r8 ; 12=carry for 6 252 | 253 | ;_ _ _ _ 11 6+12 10+7 9+1| 0 | _ _ _ _ _ _ _ 254 | 255 | vmov r8,d5[1] ;s11 256 | mov lr,#0 257 | umaal r0,lr,r2,r8 258 | vmov d3[1],r0 ;s7 259 | umaal r1,lr,r3,r8 260 | umaal r10,lr,r4,r8 261 | umaal r6,lr,r5,r8 ; lr=carry for saved 262 | 263 | ;_ _ _ _ 11+lr 6+12 10+7 9+1| _ | _ _ _ _ _ _ _ 264 | 265 | vmov r0,r8,s12,s13 266 | umaal r1,r9,r2,r0 267 | vmov d4[0],r1 ;s8 268 | umaal r9,r10,r3,r0 269 | umaal r10,r6,r4,r0 270 | umaal r11,r6,r5,r0 ; 6=carry for next 271 | 272 | ;_ _ _ 6 11+lr 10+12 9+7 _ | _ | _ _ _ _ _ _ _ 273 | 274 | umaal r9,r7,r2,r8 275 | umaal r10,r7,r3,r8 276 | umaal r11,r7,r4,r8 277 | umaal r6,r7,r5,r8 278 | 279 | vmov r0,r8,s14,s15 280 | umaal r10,r12,r2,r0 281 | umaal r11,r12,r3,r0 282 | umaal r6,r12,r4,r0 283 | umaal r7,r12,r5,r0 284 | 285 | umaal r11,lr,r2,r8 286 | umaal r6,lr,r3,r8 287 | umaal lr,r7,r4,r8 288 | umaal r7,r12,r5,r8 289 | 290 | ; 12 7 lr 6 11 10 9 fpu*9 291 | 292 | ;now reduce 293 | 294 | vmov r4,r5,s7,s8 295 | movs r3,#38 296 | mov r8,#0 297 | umaal r4,r8,r3,r12 298 | lsl r8,r8,#1 299 | orr r8,r8,r4, lsr #31 300 | and r12,r4,#0x7fffffff 301 | movs r4,#19 302 | mul r8,r8,r4 303 | 304 | vmov r0,r1,s0,s1 305 | vmov r2,d1[0] ;s2 306 | umaal r0,r8,r3,r5 307 | umaal r1,r8,r3,r9 308 | umaal r2,r8,r3,r10 309 | mov r9,#38 310 | vmov r3,r4,s3,s4 311 | umaal r3,r8,r9,r11 312 | umaal r4,r8,r9,r6 313 | vmov r5,r6,s5,s6 314 | umaal r5,r8,r9,lr 315 | umaal r6,r8,r9,r7 316 | add r7,r8,r12 317 | 318 | pop {pc} 319 | 320 | endp 321 | 322 | if use_mul_for_sqr == 0 323 | ; input/result in (r0-r7) 324 | ; clobbers all other registers 325 | ; cycles: 106 326 | fe25519_sqr proc 327 | export fe25519_sqr 328 | push {lr} 329 | frame push {lr} 330 | 331 | ;mul 01, 00 332 | umull r9,r10,r0,r0 333 | umull r11,r12,r0,r1 334 | adds r11,r11,r11 335 | mov lr,#0 336 | umaal r10,r11,lr,lr 337 | 338 | ;r9 r10 done 339 | ;r12 carry for 3rd before col 340 | ;r11+C carry for 3rd final col 341 | 342 | vmov s0,s1,r9,r10 343 | 344 | ;mul 02, 11 345 | mov r8,#0 346 | umaal r8,r12,r0,r2 347 | adcs r8,r8,r8 348 | umaal r8,r11,r1,r1 349 | 350 | ;r8 done (3rd col) 351 | ;r12 carry for 4th before col 352 | ;r11+C carry for 4th final col 353 | 354 | ;mul 03, 12 355 | umull r9,r10,r0,r3 356 | umaal r9,r12,r1,r2 357 | adcs r9,r9,r9 358 | umaal r9,r11,lr,lr 359 | 360 | ;r9 done (4th col) 361 | ;r10+r12 carry for 5th before col 362 | ;r11+C carry for 5th final col 363 | 364 | vmov s2,s3,r8,r9 365 | 366 | ;mul 04, 13, 22 367 | mov r9,#0 368 | umaal r9,r10,r0,r4 369 | umaal r9,r12,r1,r3 370 | adcs r9,r9,r9 371 | umaal r9,r11,r2,r2 372 | 373 | ;r9 done (5th col) 374 | ;r10+r12 carry for 6th before col 375 | ;r11+C carry for 6th final col 376 | 377 | vmov s4,r9 378 | 379 | ;mul 05, 14, 23 380 | umull r9,r8,r0,r5 381 | umaal r9,r10,r1,r4 382 | umaal r9,r12,r2,r3 383 | adcs r9,r9,r9 384 | umaal r9,r11,lr,lr 385 | 386 | ;r9 done (6th col) 387 | ;r10+r12+r8 carry for 7th before col 388 | ;r11+C carry for 7th final col 389 | 390 | vmov s5,r9 391 | 392 | ;mul 06, 15, 24, 33 393 | mov r9,#0 394 | umaal r9,r8,r1,r5 395 | umaal r9,r12,r2,r4 396 | umaal r9,r10,r0,r6 397 | adcs r9,r9,r9 398 | umaal r9,r11,r3,r3 399 | 400 | ;r9 done (7th col) 401 | ;r8+r10+r12 carry for 8th before col 402 | ;r11+C carry for 8th final col 403 | 404 | vmov s6,r9 405 | 406 | ;mul 07, 16, 25, 34 407 | umull r0,r9,r0,r7 408 | umaal r0,r10,r1,r6 409 | umaal r0,r12,r2,r5 410 | umaal r0,r8,r3,r4 411 | adcs r0,r0,r0 412 | umaal r0,r11,lr,lr 413 | 414 | ;r0 done (8th col) 415 | ;r9+r8+r10+r12 carry for 9th before col 416 | ;r11+C carry for 9th final col 417 | 418 | ;mul 17, 26, 35, 44 419 | umaal r9,r8,r1,r7 ;r1 is now dead 420 | umaal r9,r10,r2,r6 421 | umaal r12,r9,r3,r5 422 | adcs r12,r12,r12 423 | umaal r11,r12,r4,r4 424 | 425 | ;r11 done (9th col) 426 | ;r8+r10+r9 carry for 10th before col 427 | ;r12+C carry for 10th final col 428 | 429 | ;mul 27, 36, 45 430 | umaal r9,r8,r2,r7 ;r2 is now dead 431 | umaal r10,r9,r3,r6 432 | movs r2,#0 433 | umaal r10,r2,r4,r5 434 | adcs r10,r10,r10 435 | umaal r12,r10,lr,lr 436 | 437 | ;r12 done (10th col) 438 | ;r8+r9+r2 carry for 11th before col 439 | ;r10+C carry for 11th final col 440 | 441 | ;mul 37, 46, 55 442 | umaal r2,r8,r3,r7 ;r3 is now dead 443 | umaal r9,r2,r4,r6 444 | adcs r9,r9,r9 445 | umaal r10,r9,r5,r5 446 | 447 | ;r10 done (11th col) 448 | ;r8+r2 carry for 12th before col 449 | ;r9+C carry for 12th final col 450 | 451 | ;mul 47, 56 452 | movs r3,#0 453 | umaal r3,r8,r4,r7 ;r4 is now dead 454 | umaal r3,r2,r5,r6 455 | adcs r3,r3,r3 456 | umaal r9,r3,lr,lr 457 | 458 | ;r9 done (12th col) 459 | ;r8+r2 carry for 13th before col 460 | ;r3+C carry for 13th final col 461 | 462 | ;mul 57, 66 463 | umaal r8,r2,r5,r7 ;r5 is now dead 464 | adcs r8,r8,r8 465 | umaal r3,r8,r6,r6 466 | 467 | ;r3 done (13th col) 468 | ;r2 carry for 14th before col 469 | ;r8+C carry for 14th final col 470 | 471 | ;mul 67 472 | umull r4,r5,lr,lr ; set 0 473 | umaal r4,r2,r6,r7 474 | adcs r4,r4,r4 475 | umaal r4,r8,lr,lr 476 | 477 | ;r4 done (14th col) 478 | ;r2 carry for 15th before col 479 | ;r8+C carry for 15th final col 480 | 481 | ;mul 77 482 | adcs r2,r2,r2 483 | umaal r8,r2,r7,r7 484 | adcs r2,r2,lr 485 | 486 | ;r8 done (15th col) 487 | ;r2 done (16th col) 488 | 489 | ;msb -> lsb: r2 r8 r4 r3 r9 r10 r12 r11 r0 s6 s5 s4 s3 s2 s1 s0 490 | ;lr: 0 491 | ;now do reduction 492 | 493 | movs r6,#38 494 | umaal r0,lr,r6,r2 495 | lsl lr,lr,#1 496 | orr lr,lr,r0, lsr #31 497 | and r7,r0,#0x7fffffff 498 | movs r5,#19 499 | mul lr,lr,r5 500 | 501 | vmov r0,r1,s0,s1 502 | umaal r0,lr,r6,r11 503 | umaal r1,lr,r6,r12 504 | 505 | mov r11,r3 506 | mov r12,r4 507 | 508 | vmov r2,r3,s2,s3 509 | vmov r4,r5,s4,s5 510 | umaal r2,lr,r6,r10 511 | umaal r3,lr,r6,r9 512 | umaal r4,lr,r6,r11 513 | umaal r5,lr,r6,r12 514 | 515 | vmov r6,s6 516 | mov r12,#38 517 | umaal r6,lr,r12,r8 518 | add r7,r7,lr 519 | 520 | pop {pc} 521 | 522 | endp 523 | endif 524 | 525 | else 526 | 527 | ; input: *r1=a, *r2=b 528 | ; output: r0-r7 529 | ; clobbers all other registers 530 | ; cycles: 173 531 | fe25519_mul proc 532 | export fe25519_mul 533 | push {r2,lr} 534 | frame push {lr} 535 | frame address sp,8 536 | 537 | sub sp,#28 538 | frame address sp,36 539 | ldm r2,{r2,r3,r4,r5} 540 | 541 | ldm r1!,{r0,r10,lr} 542 | umull r6,r11,r2,r0 543 | 544 | umull r7,r12,r3,r0 545 | umaal r7,r11,r2,r10 546 | 547 | push {r6,r7} 548 | frame address sp,44 549 | 550 | umull r8,r6,r4,r0 551 | umaal r8,r11,r3,r10 552 | 553 | umull r9,r7,r5,r0 554 | umaal r9,r11,r4,r10 555 | 556 | umaal r11,r7,r5,r10 557 | 558 | umaal r8,r12,r2,lr 559 | umaal r9,r12,r3,lr 560 | umaal r11,r12,r4,lr 561 | umaal r12,r7,r5,lr 562 | 563 | ldm r1!,{r0,r10,lr} 564 | 565 | umaal r9,r6,r2,r0 566 | umaal r11,r6,r3,r0 567 | umaal r12,r6,r4,r0 568 | umaal r6,r7,r5,r0 569 | 570 | strd r8,r9,[sp,#8] 571 | 572 | mov r9,#0 573 | umaal r11,r9,r2,r10 574 | umaal r12,r9,r3,r10 575 | umaal r6,r9,r4,r10 576 | umaal r7,r9,r5,r10 577 | 578 | mov r10,#0 579 | umaal r12,r10,r2,lr 580 | umaal r6,r10,r3,lr 581 | umaal r7,r10,r4,lr 582 | umaal r9,r10,r5,lr 583 | 584 | ldr r8,[r1],#4 585 | mov lr,#0 586 | umaal lr,r6,r2,r8 587 | umaal r7,r6,r3,r8 588 | umaal r9,r6,r4,r8 589 | umaal r10,r6,r5,r8 590 | 591 | ;_ _ _ _ _ 6 10 9| 7 | lr 12 11 _ _ _ _ 592 | 593 | ldr r8,[r1],#-28 594 | mov r0,#0 595 | umaal r7,r0,r2,r8 596 | umaal r9,r0,r3,r8 597 | umaal r10,r0,r4,r8 598 | umaal r6,r0,r5,r8 599 | 600 | push {r0} 601 | frame address sp,48 602 | 603 | ;_ _ _ _ s 6 10 9| 7 | lr 12 11 _ _ _ _ 604 | 605 | ldr r2,[sp,#40] 606 | adds r2,r2,#16 607 | ldm r2,{r2,r3,r4,r5} 608 | 609 | ldr r8,[r1],#4 610 | mov r0,#0 611 | umaal r11,r0,r2,r8 612 | str r11,[sp,#16+4] 613 | umaal r12,r0,r3,r8 614 | umaal lr,r0,r4,r8 615 | umaal r0,r7,r5,r8 ; 7=carry for 9 616 | 617 | ;_ _ _ _ s 6 10 9+7| 0 | lr 12 _ _ _ _ _ 618 | 619 | ldr r8,[r1],#4 620 | mov r11,#0 621 | umaal r12,r11,r2,r8 622 | str r12,[sp,#20+4] 623 | umaal lr,r11,r3,r8 624 | umaal r0,r11,r4,r8 625 | umaal r11,r7,r5,r8 ; 7=carry for 10 626 | 627 | ;_ _ _ _ s 6 10+7 9+11| 0 | lr _ _ _ _ _ _ 628 | 629 | ldr r8,[r1],#4 630 | mov r12,#0 631 | umaal lr,r12,r2,r8 632 | str lr,[sp,#24+4] 633 | umaal r0,r12,r3,r8 634 | umaal r11,r12,r4,r8 635 | umaal r10,r12,r5,r8 ; 12=carry for 6 636 | 637 | ;_ _ _ _ s 6+12 10+7 9+11| 0 | _ _ _ _ _ _ _ 638 | 639 | ldr r8,[r1],#4 640 | mov lr,#0 641 | umaal r0,lr,r2,r8 642 | str r0,[sp,#28+4] 643 | umaal r11,lr,r3,r8 644 | umaal r10,lr,r4,r8 645 | umaal r6,lr,r5,r8 ; lr=carry for saved 646 | 647 | ;_ _ _ _ s+lr 6+12 10+7 9+11| _ | _ _ _ _ _ _ _ 648 | 649 | ldm r1!,{r0,r8} 650 | umaal r11,r9,r2,r0 651 | str r11,[sp,#32+4] 652 | umaal r9,r10,r3,r0 653 | umaal r10,r6,r4,r0 654 | pop {r11} 655 | frame address sp,44 656 | umaal r11,r6,r5,r0 ; 6=carry for next 657 | 658 | ;_ _ _ 6 11+lr 10+12 9+7 _ | _ | _ _ _ _ _ _ _ 659 | 660 | umaal r9,r7,r2,r8 661 | umaal r10,r7,r3,r8 662 | umaal r11,r7,r4,r8 663 | umaal r6,r7,r5,r8 664 | 665 | ldm r1!,{r0,r8} 666 | umaal r10,r12,r2,r0 667 | umaal r11,r12,r3,r0 668 | umaal r6,r12,r4,r0 669 | umaal r7,r12,r5,r0 670 | 671 | umaal r11,lr,r2,r8 672 | umaal r6,lr,r3,r8 673 | umaal lr,r7,r4,r8 674 | umaal r7,r12,r5,r8 675 | 676 | ; 12 7 lr 6 11 10 9 stack*9 677 | 678 | ;now reduce 679 | 680 | ldrd r4,r5,[sp,#28] 681 | movs r3,#38 682 | mov r8,#0 683 | umaal r4,r8,r3,r12 684 | lsl r8,r8,#1 685 | orr r8,r8,r4, lsr #31 686 | and r12,r4,#0x7fffffff 687 | movs r4,#19 688 | mul r8,r8,r4 689 | 690 | pop {r0-r2} 691 | frame address sp,32 692 | umaal r0,r8,r3,r5 693 | umaal r1,r8,r3,r9 694 | umaal r2,r8,r3,r10 695 | mov r9,#38 696 | pop {r3,r4} 697 | frame address sp,24 698 | umaal r3,r8,r9,r11 699 | umaal r4,r8,r9,r6 700 | pop {r5,r6} 701 | frame address sp,16 702 | umaal r5,r8,r9,lr 703 | umaal r6,r8,r9,r7 704 | add r7,r8,r12 705 | 706 | add sp,#12 707 | frame address sp,4 708 | pop {pc} 709 | 710 | endp 711 | 712 | if use_mul_for_sqr == 0 713 | ; input/result in (r0-r7) 714 | ; clobbers all other registers 715 | ; cycles: 115 716 | fe25519_sqr proc 717 | export fe25519_sqr 718 | push {lr} 719 | frame push {lr} 720 | sub sp,#20 721 | frame address sp,24 722 | 723 | ;mul 01, 00 724 | umull r9,r10,r0,r0 725 | umull r11,r12,r0,r1 726 | adds r11,r11,r11 727 | mov lr,#0 728 | umaal r10,r11,lr,lr 729 | 730 | ;r9 r10 done 731 | ;r12 carry for 3rd before col 732 | ;r11+C carry for 3rd final col 733 | 734 | push {r9,r10} 735 | frame address sp,32 736 | 737 | ;mul 02, 11 738 | mov r8,#0 739 | umaal r8,r12,r0,r2 740 | adcs r8,r8,r8 741 | umaal r8,r11,r1,r1 742 | 743 | ;r8 done (3rd col) 744 | ;r12 carry for 4th before col 745 | ;r11+C carry for 4th final col 746 | 747 | ;mul 03, 12 748 | umull r9,r10,r0,r3 749 | umaal r9,r12,r1,r2 750 | adcs r9,r9,r9 751 | umaal r9,r11,lr,lr 752 | 753 | ;r9 done (4th col) 754 | ;r10+r12 carry for 5th before col 755 | ;r11+C carry for 5th final col 756 | 757 | strd r8,r9,[sp,#8] 758 | 759 | ;mul 04, 13, 22 760 | mov r9,#0 761 | umaal r9,r10,r0,r4 762 | umaal r9,r12,r1,r3 763 | adcs r9,r9,r9 764 | umaal r9,r11,r2,r2 765 | 766 | ;r9 done (5th col) 767 | ;r10+r12 carry for 6th before col 768 | ;r11+C carry for 6th final col 769 | 770 | str r9,[sp,#16] 771 | 772 | ;mul 05, 14, 23 773 | umull r9,r8,r0,r5 774 | umaal r9,r10,r1,r4 775 | umaal r9,r12,r2,r3 776 | adcs r9,r9,r9 777 | umaal r9,r11,lr,lr 778 | 779 | ;r9 done (6th col) 780 | ;r10+r12+r8 carry for 7th before col 781 | ;r11+C carry for 7th final col 782 | 783 | str r9,[sp,#20] 784 | 785 | ;mul 06, 15, 24, 33 786 | mov r9,#0 787 | umaal r9,r8,r1,r5 788 | umaal r9,r12,r2,r4 789 | umaal r9,r10,r0,r6 790 | adcs r9,r9,r9 791 | umaal r9,r11,r3,r3 792 | 793 | ;r9 done (7th col) 794 | ;r8+r10+r12 carry for 8th before col 795 | ;r11+C carry for 8th final col 796 | 797 | str r9,[sp,#24] 798 | 799 | ;mul 07, 16, 25, 34 800 | umull r0,r9,r0,r7 801 | umaal r0,r10,r1,r6 802 | umaal r0,r12,r2,r5 803 | umaal r0,r8,r3,r4 804 | adcs r0,r0,r0 805 | umaal r0,r11,lr,lr 806 | 807 | ;r0 done (8th col) 808 | ;r9+r8+r10+r12 carry for 9th before col 809 | ;r11+C carry for 9th final col 810 | 811 | ;mul 17, 26, 35, 44 812 | umaal r9,r8,r1,r7 ;r1 is now dead 813 | umaal r9,r10,r2,r6 814 | umaal r12,r9,r3,r5 815 | adcs r12,r12,r12 816 | umaal r11,r12,r4,r4 817 | 818 | ;r11 done (9th col) 819 | ;r8+r10+r9 carry for 10th before col 820 | ;r12+C carry for 10th final col 821 | 822 | ;mul 27, 36, 45 823 | umaal r9,r8,r2,r7 ;r2 is now dead 824 | umaal r10,r9,r3,r6 825 | movs r2,#0 826 | umaal r10,r2,r4,r5 827 | adcs r10,r10,r10 828 | umaal r12,r10,lr,lr 829 | 830 | ;r12 done (10th col) 831 | ;r8+r9+r2 carry for 11th before col 832 | ;r10+C carry for 11th final col 833 | 834 | ;mul 37, 46, 55 835 | umaal r2,r8,r3,r7 ;r3 is now dead 836 | umaal r9,r2,r4,r6 837 | adcs r9,r9,r9 838 | umaal r10,r9,r5,r5 839 | 840 | ;r10 done (11th col) 841 | ;r8+r2 carry for 12th before col 842 | ;r9+C carry for 12th final col 843 | 844 | ;mul 47, 56 845 | movs r3,#0 846 | umaal r3,r8,r4,r7 ;r4 is now dead 847 | umaal r3,r2,r5,r6 848 | adcs r3,r3,r3 849 | umaal r9,r3,lr,lr 850 | 851 | ;r9 done (12th col) 852 | ;r8+r2 carry for 13th before col 853 | ;r3+C carry for 13th final col 854 | 855 | ;mul 57, 66 856 | umaal r8,r2,r5,r7 ;r5 is now dead 857 | adcs r8,r8,r8 858 | umaal r3,r8,r6,r6 859 | 860 | ;r3 done (13th col) 861 | ;r2 carry for 14th before col 862 | ;r8+C carry for 14th final col 863 | 864 | ;mul 67 865 | umull r4,r5,lr,lr ; set 0 866 | umaal r4,r2,r6,r7 867 | adcs r4,r4,r4 868 | umaal r4,r8,lr,lr 869 | 870 | ;r4 done (14th col) 871 | ;r2 carry for 15th before col 872 | ;r8+C carry for 15th final col 873 | 874 | ;mul 77 875 | adcs r2,r2,r2 876 | umaal r8,r2,r7,r7 877 | adcs r2,r2,lr 878 | 879 | ;r8 done (15th col) 880 | ;r2 done (16th col) 881 | 882 | ;msb -> lsb: r2 r8 r4 r3 r9 r10 r12 r11 r0 sp+24 sp+20 sp+16 sp+12 sp+8 sp+4 sp 883 | ;lr: 0 884 | ;now do reduction 885 | 886 | mov r6,#38 887 | umaal r0,lr,r6,r2 888 | lsl lr,lr,#1 889 | orr lr,lr,r0, lsr #31 890 | and r7,r0,#0x7fffffff 891 | movs r5,#19 892 | mul lr,lr,r5 893 | 894 | pop {r0,r1} 895 | frame address sp,24 896 | umaal r0,lr,r6,r11 897 | umaal r1,lr,r6,r12 898 | 899 | mov r11,r3 900 | mov r12,r4 901 | 902 | pop {r2,r3,r4,r5} 903 | frame address sp,8 904 | umaal r2,lr,r6,r10 905 | umaal r3,lr,r6,r9 906 | 907 | umaal r4,lr,r6,r11 908 | umaal r5,lr,r6,r12 909 | 910 | pop {r6} 911 | frame address sp,4 912 | mov r12,#38 913 | umaal r6,lr,r12,r8 914 | add r7,r7,lr 915 | 916 | pop {pc} 917 | 918 | endp 919 | endif 920 | endif 921 | 922 | ; in: r0-r7, count: r8 923 | ; out: r0-r7 + sets result also to top of stack 924 | ; clobbers all other registers 925 | ; cycles: 19 + 114*n 926 | fe25519_sqr_many proc 927 | export fe25519_sqr_many 928 | push {r8,lr} 929 | frame push {r8,lr} 930 | 0 931 | bl fe25519_sqr 932 | 933 | ldr r8,[sp,#0] 934 | subs r8,r8,#1 935 | str r8,[sp,#0] 936 | bne %b0 937 | 938 | add sp,sp,#4 939 | frame address sp,4 940 | add r8,sp,#4 941 | stm r8,{r0-r7} 942 | pop {pc} 943 | endp 944 | 945 | ; This kind of load supports unaligned access 946 | ; in: *r1 947 | ; out: r0-r7 948 | ; cycles: 22 949 | loadm proc 950 | ldr r0,[r1,#0] 951 | ldr r2,[r1,#8] 952 | ldr r3,[r1,#12] 953 | ldr r4,[r1,#16] 954 | ldr r5,[r1,#20] 955 | ldr r6,[r1,#24] 956 | ldr r7,[r1,#28] 957 | ldr r1,[r1,#4] 958 | bx lr 959 | endp 960 | 961 | ; in: *r0 = result, *r1 = scalar, *r2 = basepoint (all pointers may be unaligned) 962 | ; cycles: 475 469 963 | curve25519_scalarmult proc 964 | export curve25519_scalarmult 965 | 966 | ; stack layout: xp zp xq zq x0 bitpos lastbit cswap? scalar result_ptr r4-r11,lr 967 | ; 0 32 64 96 128 160 161 162 164 196 200 968 | 969 | push {r0,r4-r11,lr} 970 | frame push {r4-r11,lr} 971 | frame address sp,40 972 | 973 | mov r10,r2 974 | bl loadm 975 | 976 | and r0,r0,#0xfffffff8 977 | ;and r7,r7,#0x7fffffff not needed since we don't inspect the msb anyway 978 | orr r7,r7,#0x40000000 979 | push {r0-r7} 980 | frame address sp,72 981 | mov r8,#0 982 | 983 | ;ldm r1,{r0-r7} 984 | mov r1,r10 985 | bl loadm 986 | 987 | and r7,r7,#0x7fffffff 988 | push {r0-r8} 989 | frame address sp,108 990 | 991 | movs r9,#1 992 | umull r10,r11,r8,r8 993 | mov r12,r8 994 | push {r8,r10,r11,r12} 995 | frame address sp,124 996 | push {r9,r10,r11,r12} 997 | frame address sp,140 998 | 999 | push {r0-r7} 1000 | frame address sp,172 1001 | 1002 | umull r6,r7,r8,r8 1003 | push {r6,r7,r8,r10,r11,r12} 1004 | frame address sp,196 1005 | push {r6,r7,r8,r10,r11,r12} 1006 | frame address sp,220 1007 | push {r9,r10,r11,r12} 1008 | frame address sp,236 1009 | 1010 | movs r0,#254 1011 | movs r3,#0 1012 | ; 127 cycles so far 1013 | 0 1014 | ; load scalar bit into r1 1015 | lsrs r1,r0,#5 1016 | adds r2,sp,#164 1017 | ldr r1,[r2,r1,lsl #2] 1018 | and r4,r0,#0x1f 1019 | lsrs r1,r1,r4 1020 | and r1,r1,#1 1021 | 1022 | strb r0,[sp,#160] 1023 | strb r1,[sp,#161] 1024 | 1025 | eors r1,r1,r3 1026 | strb r1,[sp,#162] 1027 | 1028 | ; A = X2 + Z2 1029 | add r8,sp,r1, lsl #6 1030 | add r9,r8,#32 1031 | bl fe25519_add 1032 | push {r0-r7} 1033 | frame address sp,268 1034 | 1035 | ; AA = A^2 1036 | bl fe25519_sqr 1037 | push {r0-r7} 1038 | frame address sp,300 1039 | 1040 | ; D = X3 - Z3 1041 | ldrb r0,[sp,#162+64] 1042 | add r8,sp,#128 1043 | sub r8,r8,r0, lsl #6 1044 | add r9,r8,#32 1045 | bl fe25519_sub 1046 | push {r0-r7} 1047 | frame address sp,332 1048 | 1049 | ; DA = D * A 1050 | mov r1,sp 1051 | add r2,sp,#64 1052 | bl fe25519_mul 1053 | add r8,sp,#64 1054 | stm r8,{r0-r7} 1055 | 1056 | ; B = X2 - Z2 1057 | ldrb r0,[sp,#162+96] 1058 | add r8,sp,#96 1059 | add r8,r8,r0, lsl #6 1060 | add r9,r8,#32 1061 | bl fe25519_sub 1062 | stm sp,{r0-r7} 1063 | 1064 | ; BB = B^2 1065 | bl fe25519_sqr 1066 | push {r0-r7} 1067 | frame address sp,364 1068 | 1069 | ; C = X3 + Z3 1070 | ldrb r0,[sp,#162+128] 1071 | add r8,sp,#192 1072 | sub r8,r8,r0, lsl #6 1073 | add r9,r8,#32 1074 | bl fe25519_add 1075 | add r8,sp,#128 1076 | stm r8,{r0-r7} 1077 | 1078 | ; CB = C * B 1079 | add r1,sp,#128 1080 | add r2,sp,#32 1081 | bl fe25519_mul 1082 | add r8,sp,#32 1083 | stm r8,{r0-r7} 1084 | 1085 | ; X2 = BB * AA 1086 | mov r1,sp 1087 | add r2,sp,#64 1088 | bl fe25519_mul 1089 | add r8,sp,#128 1090 | stm r8,{r0-r7} 1091 | 1092 | ; E = AA - BB 1093 | add r8,sp,#64 1094 | mov r9,sp 1095 | bl fe25519_sub 1096 | add r8,sp,#64 1097 | stm r8,{r0-r7} 1098 | 1099 | ; 134 + 2*45 + 3*46 + 3*154 + 2*106 = 1036 cycles 1100 | 1101 | ; T1 = BB + ((a+2)/4) * E 1102 | ;multiplies (r0-r7) with 121666, adds *sp and puts the result on the top of the stack (replacing old content) 1103 | ldr lr,=121666 1104 | ;mov lr,#56130 1105 | ;add lr,lr,#65536 1106 | ldr r12,[sp,#28] 1107 | mov r11,#0 1108 | umaal r12,r11,lr,r7 1109 | lsl r11,r11,#1 1110 | add r11,r11,r12, lsr #31 1111 | movs r7,#19 1112 | mul r11,r11,r7 1113 | bic r7,r12,#0x80000000 1114 | ldm sp!,{r8,r9,r10,r12} 1115 | frame address sp,348 1116 | umaal r8,r11,lr,r0 1117 | umaal r9,r11,lr,r1 1118 | umaal r10,r11,lr,r2 1119 | umaal r12,r11,lr,r3 1120 | ldm sp!,{r0,r1,r2} 1121 | frame address sp,336 1122 | umaal r0,r11,lr,r4 1123 | umaal r1,r11,lr,r5 1124 | umaal r2,r11,lr,r6 1125 | add r7,r7,r11 1126 | add sp,sp,#4 1127 | frame address sp,332 1128 | push {r0,r1,r2,r7} 1129 | frame address sp,348 1130 | push {r8,r9,r10,r12} 1131 | frame address sp,364 1132 | ; 39 cycles 1133 | 1134 | ; Z2 = T1 * E 1135 | mov r1,sp 1136 | add r2,sp,#64 1137 | bl fe25519_mul 1138 | add r8,sp,#160 1139 | stm r8,{r0-r7} 1140 | 1141 | ; X3 = (DA + CB)^2 1142 | add r8,sp,#96 1143 | add r9,sp,#32 1144 | bl fe25519_add 1145 | bl fe25519_sqr 1146 | add r8,sp,#192 1147 | stm r8,{r0-r7} 1148 | 1149 | ; T2 = (DA - CB)^2 1150 | add r8,sp,#96 1151 | add r9,sp,#32 1152 | bl fe25519_sub 1153 | bl fe25519_sqr 1154 | stm sp,{r0-r7} 1155 | 1156 | ; Z3 = T2 * X1 1157 | mov r1,sp 1158 | add r2,sp,#256 1159 | bl fe25519_mul 1160 | add r8,sp,#224 1161 | stm r8,{r0-r7} 1162 | 1163 | add sp,sp,#128 1164 | frame address sp,236 1165 | 1166 | ldrb r2,[sp,#160] 1167 | ldrb r3,[sp,#161] 1168 | subs r0,r2,#1 1169 | ; 92 + 1*45 + 1*46 + 2*154 + 2*106 = 703 cycles 1170 | bpl %b0 1171 | ; in total 1742 cycles per iteration, in total 444 210 cycles for 255 iterations 1172 | 1173 | ;These cswap lines are not needed for curve25519 since the lowest bit is hardcoded to 0 1174 | ;---------- 1175 | ;rsbs lr,r3,#0 1176 | 1177 | ;mov r0,sp 1178 | ;add r1,sp,#64 1179 | 1180 | ;mov r11,#4 1181 | ;1 1182 | ;ldm r0,{r2-r5} 1183 | ;ldm r1!,{r6-r9} 1184 | 1185 | ;eors r2,r2,r6 1186 | ;and r10,r2,lr 1187 | ;eors r6,r6,r10 1188 | ;eors r2,r2,r6 1189 | 1190 | ;eors r3,r3,r7 1191 | ;and r10,r3,lr 1192 | ;eors r7,r7,r10 1193 | ;eors r3,r3,r7 1194 | 1195 | ;eors r4,r4,r8 1196 | ;and r10,r4,lr 1197 | ;eors r8,r8,r10 1198 | ;eors r4,r4,r8 1199 | 1200 | ;eors r5,r5,r9 1201 | ;and r10,r5,lr 1202 | ;eors r9,r9,r10 1203 | ;eors r5,r5,r9 1204 | 1205 | ;stm r0!,{r2-r5} 1206 | 1207 | ;subs r11,#1 1208 | ;bne %b1 1209 | ;---------- 1210 | 1211 | ; now we must invert zp 1212 | add r0,sp,#32 1213 | ldm r0,{r0-r7} 1214 | bl fe25519_sqr 1215 | push {r0-r7} 1216 | frame address sp,268 1217 | 1218 | bl fe25519_sqr 1219 | bl fe25519_sqr 1220 | push {r0-r7} 1221 | frame address sp,300 1222 | 1223 | add r1,sp,#96 1224 | mov r2,sp 1225 | bl fe25519_mul 1226 | stm sp,{r0-r7} 1227 | 1228 | mov r1,sp 1229 | add r2,sp,#32 1230 | bl fe25519_mul 1231 | add r8,sp,#32 1232 | stm r8,{r0-r7} 1233 | 1234 | ; current stack: z^(2^9) z^(2^11) x z 1235 | 1236 | bl fe25519_sqr 1237 | push {r0-r7} 1238 | frame address sp,332 1239 | 1240 | mov r1,sp 1241 | add r2,sp,#32 1242 | bl fe25519_mul 1243 | add r8,sp,#32 1244 | stm r8,{r0-r7} 1245 | 1246 | ; current stack: _ z^(2^5 - 2^0) z^(2^11) x z 1247 | 1248 | mov r8,#5 1249 | ; 959 cycles 1250 | bl fe25519_sqr_many ; 589 cycles 1251 | 1252 | mov r1,sp 1253 | add r2,sp,#32 1254 | bl fe25519_mul 1255 | add r8,sp,#32 1256 | stm r8,{r0-r7} 1257 | 1258 | ; current stack: _ z^(2^10 - 2^0) z^(2^11) x z ... 1259 | 1260 | movs r8,#10 1261 | bl fe25519_sqr_many ; 1159 cycles 1262 | ;z^(2^20 - 2^10) 1263 | 1264 | mov r1,sp 1265 | add r2,sp,#32 1266 | bl fe25519_mul 1267 | stm sp,{r0-r7} 1268 | ;z^(2^20 - 2^0) 1269 | 1270 | ; current stack: z^(2^20 - 2^0) z^(2^10 - 2^0) z^(2^11) x z ... 1271 | 1272 | movs r8,#20 1273 | sub sp,sp,#32 1274 | frame address sp,364 1275 | bl fe25519_sqr_many ; 2299 cycles 1276 | ;z^(2^40 - 2^20) 1277 | 1278 | mov r1,sp 1279 | add r2,sp,#32 1280 | bl fe25519_mul 1281 | add sp,sp,#32 1282 | frame address sp,332 1283 | ;z^(2^40 - 2^0) 1284 | 1285 | movs r8,#10 1286 | bl fe25519_sqr_many ; 1159 cycles 1287 | ;z^(2^50 - 2^10) 1288 | 1289 | mov r1,sp 1290 | add r2,sp,#32 1291 | bl fe25519_mul 1292 | add r8,sp,#32 1293 | stm r8,{r0-r7} 1294 | 1295 | ; current stack: _ z^(2^50 - 2^0) z^(2^11) x z ... 1296 | 1297 | movs r8,#50 1298 | bl fe25519_sqr_many ; 5719 cycles 1299 | ;z^(2^100 - 2^50) 1300 | 1301 | mov r1,sp 1302 | add r2,sp,#32 1303 | bl fe25519_mul 1304 | stm sp,{r0-r7} 1305 | 1306 | ; 13751/12708 cycles so far for inversion 1307 | 1308 | ; current stack: z^(2^100 - 2^0) z^(2^50 - 2^0) z^(2^11) x z ... 1309 | 1310 | movs r8,#100 1311 | sub sp,sp,#32 1312 | frame address sp,364 1313 | bl fe25519_sqr_many ; 11419 cycles 1314 | ;z^(2^200 - 2^100) 1315 | 1316 | mov r1,sp 1317 | add r2,sp,#32 1318 | bl fe25519_mul 1319 | add sp,sp,#32 1320 | frame address sp,332 1321 | ;z^(2^200 - 2^0) 1322 | 1323 | ; current stack: _ z^(2^50 - 2^0) z^(2^11) x z ... 1324 | 1325 | movs r8,#50 1326 | bl fe25519_sqr_many ; 5719 cycles 1327 | ;z^(2^250 - 2^50) 1328 | 1329 | mov r1,sp 1330 | add r2,sp,#32 1331 | bl fe25519_mul 1332 | ;z^(2^250 - 2^0) 1333 | 1334 | movs r8,#5 1335 | bl fe25519_sqr_many ; 589 cycles 1336 | ;z^(2^255 - 2^5) 1337 | 1338 | mov r1,sp 1339 | add r2,sp,#64 1340 | bl fe25519_mul 1341 | stm sp,{r0-r7} 1342 | ;z^(2^255 - 21) 1343 | 1344 | ; 19661/18209 for second half of inversion 1345 | 1346 | ; done inverting! 1347 | ; total inversion cost: 33412/30917 cycles 1348 | 1349 | mov r1,sp 1350 | add r2,sp,#96 1351 | bl fe25519_mul 1352 | 1353 | ; now final reduce 1354 | lsr r8,r7,#31 1355 | mov r9,#19 1356 | mul r8,r8,r9 1357 | mov r10,#0 1358 | 1359 | ; handle the case when 2^255 - 19 <= x < 2^255 1360 | add r8,r8,#19 1361 | 1362 | adds r8,r0,r8 1363 | adcs r8,r1,r10 1364 | adcs r8,r2,r10 1365 | adcs r8,r3,r10 1366 | adcs r8,r4,r10 1367 | adcs r8,r5,r10 1368 | adcs r8,r6,r10 1369 | adcs r8,r7,r10 1370 | adcs r11,r10,r10 1371 | 1372 | lsr r8,r8,#31 1373 | orr r8,r8,r11, lsl #1 1374 | mul r8,r8,r9 1375 | 1376 | ldr r9,[sp,#292] 1377 | 1378 | adds r0,r0,r8 1379 | str r0,[r9,#0] 1380 | movs r0,#0 1381 | adcs r1,r1,r0 1382 | str r1,[r9,#4] 1383 | mov r1,r9 1384 | adcs r2,r2,r0 1385 | adcs r3,r3,r0 1386 | adcs r4,r4,r0 1387 | adcs r5,r5,r0 1388 | adcs r6,r6,r0 1389 | adcs r7,r7,r0 1390 | and r7,r7,#0x7fffffff 1391 | 1392 | str r2,[r1,#8] 1393 | str r3,[r1,#12] 1394 | str r4,[r1,#16] 1395 | str r5,[r1,#20] 1396 | str r6,[r1,#24] 1397 | str r7,[r1,#28] 1398 | 1399 | add sp,sp,#296 1400 | frame address sp,36 1401 | 1402 | pop {r4-r11,pc} 1403 | 1404 | ; 215 cycles after inversion 1405 | ; in total for whole function 475 469 cycles theoretically 1406 | 1407 | endp 1408 | 1409 | end 1410 | -------------------------------------------------------------------------------- /linux_example.c: -------------------------------------------------------------------------------- 1 | #include 2 | #include 3 | #include 4 | 5 | #include 6 | #include 7 | #include 8 | 9 | #include "x25519-cortex-m4.h" 10 | 11 | static int rand_fd = -1; 12 | 13 | static void init_rand(void) { 14 | rand_fd = open("/dev/urandom", O_RDONLY); 15 | if (rand_fd < 0) { 16 | perror("opening /dev/urandom"); 17 | exit(1); 18 | } 19 | } 20 | 21 | static void get_random_bytes(unsigned char* buf, int len) { 22 | if (rand_fd == -1) { 23 | fprintf(stderr, "rand_fd not initialized\n"); 24 | exit(1); 25 | } 26 | 27 | int nread = 0; 28 | while (len) { 29 | int nbytes = read(rand_fd, buf + nread, len); 30 | if (nbytes < 0) { 31 | if (errno == EINTR) { 32 | continue; 33 | } 34 | perror("get_random_bytes"); 35 | exit(1); 36 | } 37 | if (nbytes == 0) { 38 | fprintf(stderr, "rand_fd closed\n"); 39 | exit(1); 40 | } 41 | nread += nbytes; 42 | len -= nbytes; 43 | } 44 | } 45 | 46 | int main() { 47 | unsigned char secret_key_alice[32], secret_key_bob[32]; 48 | unsigned char public_key_alice[32], public_key_bob[32]; 49 | unsigned char shared_secret_alice[32], shared_secret_bob[32]; 50 | 51 | init_rand(); 52 | 53 | // Alice computes 54 | get_random_bytes(secret_key_alice, 32); 55 | X25519_calc_public_key(public_key_alice, secret_key_alice); 56 | 57 | // Bob computes 58 | get_random_bytes(secret_key_bob, 32); 59 | X25519_calc_public_key(public_key_bob, secret_key_bob); 60 | 61 | // The public keys are now exchanged over some protocol 62 | 63 | // Alice computes 64 | X25519_calc_shared_secret(shared_secret_alice, secret_key_alice, public_key_bob); 65 | 66 | // Bob computes 67 | X25519_calc_shared_secret(shared_secret_bob, secret_key_bob, public_key_alice); 68 | 69 | if (memcmp(shared_secret_alice, shared_secret_bob, 32) == 0) { 70 | puts("SUCCESS: Both Bob and Alice computed the same shared secret"); 71 | } else { 72 | puts("FAILED: Bob and Alice did not compute the same shared secret"); 73 | exit(1); 74 | } 75 | 76 | return 0; 77 | } 78 | -------------------------------------------------------------------------------- /x25519-cortex-m4-gcc.s: -------------------------------------------------------------------------------- 1 | .syntax unified 2 | .thumb 3 | // Curve25519 scalar multiplication 4 | // Copyright (c) 2017, Emil Lenngren 5 | // 6 | // All rights reserved. 7 | // 8 | // Redistribution and use in source and binary forms, with or without modification, 9 | // are permitted provided that the following conditions are met: 10 | // 11 | // 1. Redistributions of source code must retain the above copyright notice, this 12 | // list of conditions and the following disclaimer. 13 | // 14 | // 2. Redistributions in binary form, except as embedded into a Nordic 15 | // Semiconductor ASA or Dialog Semiconductor PLC integrated circuit in a product 16 | // or a software update for such product, must reproduce the above copyright 17 | // notice, this list of conditions and the following disclaimer in the 18 | // documentation and/or other materials provided with the distribution. 19 | // 20 | // THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND 21 | // ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED 22 | // WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 23 | // DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR 24 | // ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES 25 | // (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; 26 | // LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND 27 | // ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT 28 | // (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS 29 | // SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 30 | 31 | 32 | 33 | // This is an armv7 implementation of X25519. 34 | // It follows the reference implementation where the representation of 35 | // a field element [0..2^255-19) is represented by a 256-bit little ian integer, 36 | // reduced modulo 2^256-38, and may possibly be in the range [2^256-38..2^256). 37 | // The scalar is a 256-bit integer where certain bits are hardcoded per specification. 38 | // 39 | // The implementation runs in constant time (548 873 cycles on ARM Cortex-M4, 40 | // assuming no wait states), and no conditional branches or memory access 41 | // pattern dep on secret data. 42 | 43 | .text 44 | .align 2 45 | 46 | // input: *r8=a, *r9=b 47 | // output: r0-r7 48 | // clobbers all other registers 49 | // cycles: 45 50 | .type fe25519_add, %function 51 | fe25519_add: 52 | .global fe25519_add 53 | ldr r0,[r8,#28] 54 | ldr r4,[r9,#28] 55 | adds r0,r0,r4 56 | mov r11,#0 57 | adc r11,r11,r11 58 | lsl r11,r11,#1 59 | add r11,r11,r0, lsr #31 60 | movs r7,#19 61 | mul r11,r11,r7 62 | bic r7,r0,#0x80000000 63 | 64 | ldm r8!,{r0-r3} 65 | ldm r9!,{r4-r6,r10} 66 | mov r12,#1 67 | umaal r0,r11,r12,r4 68 | umaal r1,r11,r12,r5 69 | umaal r2,r11,r12,r6 70 | umaal r3,r11,r12,r10 71 | ldm r9,{r4-r6} 72 | ldm r8,{r8-r10} 73 | umaal r4,r11,r12,r8 74 | umaal r5,r11,r12,r9 75 | umaal r6,r11,r12,r10 76 | add r7,r7,r11 77 | bx lr 78 | 79 | .size fe25519_add, .-fe25519_add 80 | 81 | // input: *r8=a, *r9=b 82 | // output: r0-r7 83 | // clobbers all other registers 84 | // cycles: 46 85 | .type fe25519_sub, %function 86 | fe25519_sub: 87 | .global fe25519_sub 88 | 89 | ldm r8,{r0-r7} 90 | ldm r9!,{r8,r10-r12} 91 | subs r0,r8 92 | sbcs r1,r10 93 | sbcs r2,r11 94 | sbcs r3,r12 95 | ldm r9,{r8-r11} 96 | sbcs r4,r8 97 | sbcs r5,r9 98 | sbcs r6,r10 99 | sbcs r7,r11 100 | 101 | // if subtraction goes below 0, set r8 to -1 and r9 to -38, else set both to 0 102 | sbc r8,r8 103 | and r9,r8,#-38 104 | 105 | adds r0,r9 106 | adcs r1,r8 107 | adcs r2,r8 108 | adcs r3,r8 109 | adcs r4,r8 110 | adcs r5,r8 111 | adcs r6,r8 112 | adcs r7,r8 113 | 114 | // if the subtraction did not go below 0, we are done and (r8,r9) are set to 0 115 | // if the subtraction went below 0 and the addition overflowed, we are done, so set (r8,r9) to 0 116 | // if the subtraction went below 0 and the addition did not overflow, we need to add once more 117 | // (r8,r9) will be correctly set to (-1,-38) only when r8 was -1 and we don't have a carry, 118 | // note that the carry will always be 0 in case (r8,r9) was (0,0) since then there was no real addition 119 | // also note that it is extremely unlikely we will need an extra addition: 120 | // that can only happen if input1 was slightly >= 0 and input2 was > 2^256-38 (really input2-input1 > 2^256-38) 121 | // in that case we currently have 2^256-38 < (r0...r7) < 2^256, so adding -38 will only affect r0 122 | adcs r8,#0 123 | and r9,r8,#-38 124 | 125 | adds r0,r9 126 | 127 | bx lr 128 | 129 | .size fe25519_sub, .-fe25519_sub 130 | 131 | // input: *r1=a, *r2=b 132 | // output: r0-r7 133 | // clobbers all other registers 134 | // cycles: 173 135 | .type fe25519_mul, %function 136 | fe25519_mul: 137 | .global fe25519_mul 138 | push {r2,lr} 139 | //frame push {lr} 140 | //frame address sp,8 141 | 142 | sub sp,#28 143 | //frame address sp,36 144 | ldm r2,{r2,r3,r4,r5} 145 | 146 | ldm r1!,{r0,r10,lr} 147 | umull r6,r11,r2,r0 148 | 149 | umull r7,r12,r3,r0 150 | umaal r7,r11,r2,r10 151 | 152 | push {r6,r7} 153 | //frame address sp,44 154 | 155 | umull r8,r6,r4,r0 156 | umaal r8,r11,r3,r10 157 | 158 | umull r9,r7,r5,r0 159 | umaal r9,r11,r4,r10 160 | 161 | umaal r11,r7,r5,r10 162 | 163 | umaal r8,r12,r2,lr 164 | umaal r9,r12,r3,lr 165 | umaal r11,r12,r4,lr 166 | umaal r12,r7,r5,lr 167 | 168 | ldm r1!,{r0,r10,lr} 169 | 170 | umaal r9,r6,r2,r0 171 | umaal r11,r6,r3,r0 172 | umaal r12,r6,r4,r0 173 | umaal r6,r7,r5,r0 174 | 175 | strd r8,r9,[sp,#8] 176 | 177 | mov r9,#0 178 | umaal r11,r9,r2,r10 179 | umaal r12,r9,r3,r10 180 | umaal r6,r9,r4,r10 181 | umaal r7,r9,r5,r10 182 | 183 | mov r10,#0 184 | umaal r12,r10,r2,lr 185 | umaal r6,r10,r3,lr 186 | umaal r7,r10,r4,lr 187 | umaal r9,r10,r5,lr 188 | 189 | ldr r8,[r1],#4 190 | mov lr,#0 191 | umaal lr,r6,r2,r8 192 | umaal r7,r6,r3,r8 193 | umaal r9,r6,r4,r8 194 | umaal r10,r6,r5,r8 195 | 196 | //_ _ _ _ _ 6 10 9| 7 | lr 12 11 _ _ _ _ 197 | 198 | ldr r8,[r1],#-28 199 | mov r0,#0 200 | umaal r7,r0,r2,r8 201 | umaal r9,r0,r3,r8 202 | umaal r10,r0,r4,r8 203 | umaal r6,r0,r5,r8 204 | 205 | push {r0} 206 | //frame address sp,48 207 | 208 | //_ _ _ _ s 6 10 9| 7 | lr 12 11 _ _ _ _ 209 | 210 | ldr r2,[sp,#40] 211 | adds r2,r2,#16 212 | ldm r2,{r2,r3,r4,r5} 213 | 214 | ldr r8,[r1],#4 215 | mov r0,#0 216 | umaal r11,r0,r2,r8 217 | str r11,[sp,#16+4] 218 | umaal r12,r0,r3,r8 219 | umaal lr,r0,r4,r8 220 | umaal r0,r7,r5,r8 // 7=carry for 9 221 | 222 | //_ _ _ _ s 6 10 9+7| 0 | lr 12 _ _ _ _ _ 223 | 224 | ldr r8,[r1],#4 225 | mov r11,#0 226 | umaal r12,r11,r2,r8 227 | str r12,[sp,#20+4] 228 | umaal lr,r11,r3,r8 229 | umaal r0,r11,r4,r8 230 | umaal r11,r7,r5,r8 // 7=carry for 10 231 | 232 | //_ _ _ _ s 6 10+7 9+11| 0 | lr _ _ _ _ _ _ 233 | 234 | ldr r8,[r1],#4 235 | mov r12,#0 236 | umaal lr,r12,r2,r8 237 | str lr,[sp,#24+4] 238 | umaal r0,r12,r3,r8 239 | umaal r11,r12,r4,r8 240 | umaal r10,r12,r5,r8 // 12=carry for 6 241 | 242 | //_ _ _ _ s 6+12 10+7 9+11| 0 | _ _ _ _ _ _ _ 243 | 244 | ldr r8,[r1],#4 245 | mov lr,#0 246 | umaal r0,lr,r2,r8 247 | str r0,[sp,#28+4] 248 | umaal r11,lr,r3,r8 249 | umaal r10,lr,r4,r8 250 | umaal r6,lr,r5,r8 // lr=carry for saved 251 | 252 | //_ _ _ _ s+lr 6+12 10+7 9+11| _ | _ _ _ _ _ _ _ 253 | 254 | ldm r1!,{r0,r8} 255 | umaal r11,r9,r2,r0 256 | str r11,[sp,#32+4] 257 | umaal r9,r10,r3,r0 258 | umaal r10,r6,r4,r0 259 | pop {r11} 260 | //frame address sp,44 261 | umaal r11,r6,r5,r0 // 6=carry for next 262 | 263 | //_ _ _ 6 11+lr 10+12 9+7 _ | _ | _ _ _ _ _ _ _ 264 | 265 | umaal r9,r7,r2,r8 266 | umaal r10,r7,r3,r8 267 | umaal r11,r7,r4,r8 268 | umaal r6,r7,r5,r8 269 | 270 | ldm r1!,{r0,r8} 271 | umaal r10,r12,r2,r0 272 | umaal r11,r12,r3,r0 273 | umaal r6,r12,r4,r0 274 | umaal r7,r12,r5,r0 275 | 276 | umaal r11,lr,r2,r8 277 | umaal r6,lr,r3,r8 278 | umaal lr,r7,r4,r8 279 | umaal r7,r12,r5,r8 280 | 281 | // 12 7 lr 6 11 10 9 stack*9 282 | 283 | //now reduce 284 | 285 | ldrd r4,r5,[sp,#28] 286 | movs r3,#38 287 | mov r8,#0 288 | umaal r4,r8,r3,r12 289 | lsl r8,r8,#1 290 | orr r8,r8,r4, lsr #31 291 | and r12,r4,#0x7fffffff 292 | movs r4,#19 293 | mul r8,r8,r4 294 | 295 | pop {r0-r2} 296 | //frame address sp,32 297 | umaal r0,r8,r3,r5 298 | umaal r1,r8,r3,r9 299 | umaal r2,r8,r3,r10 300 | mov r9,#38 301 | pop {r3,r4} 302 | //frame address sp,24 303 | umaal r3,r8,r9,r11 304 | umaal r4,r8,r9,r6 305 | pop {r5,r6} 306 | //frame address sp,16 307 | umaal r5,r8,r9,lr 308 | umaal r6,r8,r9,r7 309 | add r7,r8,r12 310 | 311 | add sp,#12 312 | //frame address sp,4 313 | pop {pc} 314 | 315 | .size fe25519_mul, .-fe25519_mul 316 | 317 | // input/result in (r0-r7) 318 | // clobbers all other registers 319 | // cycles: 115 320 | .type fe25519_sqr, %function 321 | fe25519_sqr: 322 | .global fe25519_sqr 323 | push {lr} 324 | //frame push {lr} 325 | sub sp,#20 326 | //frame address sp,24 327 | 328 | //mul 01, 00 329 | umull r9,r10,r0,r0 330 | umull r11,r12,r0,r1 331 | adds r11,r11,r11 332 | mov lr,#0 333 | umaal r10,r11,lr,lr 334 | 335 | //r9 r10 done 336 | //r12 carry for 3rd before col 337 | //r11+C carry for 3rd final col 338 | 339 | push {r9,r10} 340 | //frame address sp,32 341 | 342 | //mul 02, 11 343 | mov r8,#0 344 | umaal r8,r12,r0,r2 345 | adcs r8,r8,r8 346 | umaal r8,r11,r1,r1 347 | 348 | //r8 done (3rd col) 349 | //r12 carry for 4th before col 350 | //r11+C carry for 4th final col 351 | 352 | //mul 03, 12 353 | umull r9,r10,r0,r3 354 | umaal r9,r12,r1,r2 355 | adcs r9,r9,r9 356 | umaal r9,r11,lr,lr 357 | 358 | //r9 done (4th col) 359 | //r10+r12 carry for 5th before col 360 | //r11+C carry for 5th final col 361 | 362 | strd r8,r9,[sp,#8] 363 | 364 | //mul 04, 13, 22 365 | mov r9,#0 366 | umaal r9,r10,r0,r4 367 | umaal r9,r12,r1,r3 368 | adcs r9,r9,r9 369 | umaal r9,r11,r2,r2 370 | 371 | //r9 done (5th col) 372 | //r10+r12 carry for 6th before col 373 | //r11+C carry for 6th final col 374 | 375 | str r9,[sp,#16] 376 | 377 | //mul 05, 14, 23 378 | umull r9,r8,r0,r5 379 | umaal r9,r10,r1,r4 380 | umaal r9,r12,r2,r3 381 | adcs r9,r9,r9 382 | umaal r9,r11,lr,lr 383 | 384 | //r9 done (6th col) 385 | //r10+r12+r8 carry for 7th before col 386 | //r11+C carry for 7th final col 387 | 388 | str r9,[sp,#20] 389 | 390 | //mul 06, 15, 24, 33 391 | mov r9,#0 392 | umaal r9,r8,r1,r5 393 | umaal r9,r12,r2,r4 394 | umaal r9,r10,r0,r6 395 | adcs r9,r9,r9 396 | umaal r9,r11,r3,r3 397 | 398 | //r9 done (7th col) 399 | //r8+r10+r12 carry for 8th before col 400 | //r11+C carry for 8th final col 401 | 402 | str r9,[sp,#24] 403 | 404 | //mul 07, 16, 25, 34 405 | umull r0,r9,r0,r7 406 | umaal r0,r10,r1,r6 407 | umaal r0,r12,r2,r5 408 | umaal r0,r8,r3,r4 409 | adcs r0,r0,r0 410 | umaal r0,r11,lr,lr 411 | 412 | //r0 done (8th col) 413 | //r9+r8+r10+r12 carry for 9th before col 414 | //r11+C carry for 9th final col 415 | 416 | //mul 17, 26, 35, 44 417 | umaal r9,r8,r1,r7 //r1 is now dead 418 | umaal r9,r10,r2,r6 419 | umaal r12,r9,r3,r5 420 | adcs r12,r12,r12 421 | umaal r11,r12,r4,r4 422 | 423 | //r11 done (9th col) 424 | //r8+r10+r9 carry for 10th before col 425 | //r12+C carry for 10th final col 426 | 427 | //mul 27, 36, 45 428 | umaal r9,r8,r2,r7 //r2 is now dead 429 | umaal r10,r9,r3,r6 430 | movs r2,#0 431 | umaal r10,r2,r4,r5 432 | adcs r10,r10,r10 433 | umaal r12,r10,lr,lr 434 | 435 | //r12 done (10th col) 436 | //r8+r9+r2 carry for 11th before col 437 | //r10+C carry for 11th final col 438 | 439 | //mul 37, 46, 55 440 | umaal r2,r8,r3,r7 //r3 is now dead 441 | umaal r9,r2,r4,r6 442 | adcs r9,r9,r9 443 | umaal r10,r9,r5,r5 444 | 445 | //r10 done (11th col) 446 | //r8+r2 carry for 12th before col 447 | //r9+C carry for 12th final col 448 | 449 | //mul 47, 56 450 | movs r3,#0 451 | umaal r3,r8,r4,r7 //r4 is now dead 452 | umaal r3,r2,r5,r6 453 | adcs r3,r3,r3 454 | umaal r9,r3,lr,lr 455 | 456 | //r9 done (12th col) 457 | //r8+r2 carry for 13th before col 458 | //r3+C carry for 13th final col 459 | 460 | //mul 57, 66 461 | umaal r8,r2,r5,r7 //r5 is now dead 462 | adcs r8,r8,r8 463 | umaal r3,r8,r6,r6 464 | 465 | //r3 done (13th col) 466 | //r2 carry for 14th before col 467 | //r8+C carry for 14th final col 468 | 469 | //mul 67 470 | umull r4,r5,lr,lr // set 0 471 | umaal r4,r2,r6,r7 472 | adcs r4,r4,r4 473 | umaal r4,r8,lr,lr 474 | 475 | //r4 done (14th col) 476 | //r2 carry for 15th before col 477 | //r8+C carry for 15th final col 478 | 479 | //mul 77 480 | adcs r2,r2,r2 481 | umaal r8,r2,r7,r7 482 | adcs r2,r2,lr 483 | 484 | //r8 done (15th col) 485 | //r2 done (16th col) 486 | 487 | //msb -> lsb: r2 r8 r4 r3 r9 r10 r12 r11 r0 sp+24 sp+20 sp+16 sp+12 sp+8 sp+4 sp 488 | //lr: 0 489 | //now do reduction 490 | 491 | mov r6,#38 492 | umaal r0,lr,r6,r2 493 | lsl lr,lr,#1 494 | orr lr,lr,r0, lsr #31 495 | and r7,r0,#0x7fffffff 496 | movs r5,#19 497 | mul lr,lr,r5 498 | 499 | pop {r0,r1} 500 | //frame address sp,24 501 | umaal r0,lr,r6,r11 502 | umaal r1,lr,r6,r12 503 | 504 | mov r11,r3 505 | mov r12,r4 506 | 507 | pop {r2,r3,r4,r5} 508 | //frame address sp,8 509 | umaal r2,lr,r6,r10 510 | umaal r3,lr,r6,r9 511 | 512 | umaal r4,lr,r6,r11 513 | umaal r5,lr,r6,r12 514 | 515 | pop {r6} 516 | //frame address sp,4 517 | mov r12,#38 518 | umaal r6,lr,r12,r8 519 | add r7,r7,lr 520 | 521 | pop {pc} 522 | 523 | .size fe25519_sqr, .-fe25519_sqr 524 | 525 | // in: r0-r7, count: r8 526 | // out: r0-r7 + sets result also to top of stack 527 | // clobbers all other registers 528 | // cycles: 19 + 123*n 529 | .type fe25519_sqr_many, %function 530 | fe25519_sqr_many: 531 | .global fe25519_sqr_many 532 | push {r8,lr} 533 | //frame push {r8,lr} 534 | 0: 535 | bl fe25519_sqr 536 | 537 | ldr r8,[sp,#0] 538 | subs r8,r8,#1 539 | str r8,[sp,#0] 540 | bne 0b 541 | 542 | add sp,sp,#4 543 | //frame address sp,4 544 | add r8,sp,#4 545 | stm r8,{r0-r7} 546 | pop {pc} 547 | .size fe25519_sqr_many, .-fe25519_sqr_many 548 | 549 | // This kind of load supports unaligned access 550 | // in: *r1 551 | // out: r0-r7 552 | // cycles: 22 553 | .type loadm, %function 554 | loadm: 555 | ldr r0,[r1,#0] 556 | ldr r2,[r1,#8] 557 | ldr r3,[r1,#12] 558 | ldr r4,[r1,#16] 559 | ldr r5,[r1,#20] 560 | ldr r6,[r1,#24] 561 | ldr r7,[r1,#28] 562 | ldr r1,[r1,#4] 563 | bx lr 564 | .size loadm, .-loadm 565 | 566 | // in: *r0 = result, *r1 = scalar, *r2 = basepoint (all pointers may be unaligned) 567 | // cycles: 548 873 568 | .type curve25519_scalarmult, %function 569 | curve25519_scalarmult: 570 | .global curve25519_scalarmult 571 | 572 | // stack layout: xp zp xq zq x0 bitpos lastbit scalar result_ptr r4-r11,lr 573 | // 0 32 64 96 128 160 164 168 200 204 574 | 575 | push {r0,r4-r11,lr} 576 | //frame push {r4-r11,lr} 577 | //frame address sp,40 578 | 579 | mov r10,r2 580 | bl loadm 581 | 582 | and r0,r0,#0xfffffff8 583 | //and r7,r7,#0x7fffffff not needed since we don't inspect the msb anyway 584 | orr r7,r7,#0x40000000 585 | push {r0-r7} 586 | //frame address sp,72 587 | movs r8,#0 588 | push {r2,r8} 589 | //frame address sp,80 590 | 591 | //ldm r1,{r0-r7} 592 | mov r1,r10 593 | bl loadm 594 | 595 | and r7,r7,#0x7fffffff 596 | push {r0-r7} 597 | //frame address sp,112 598 | 599 | movs r9,#1 600 | umull r10,r11,r8,r8 601 | mov r12,#0 602 | push {r8,r10,r11,r12} 603 | //frame address sp,128 604 | push {r9,r10,r11,r12} 605 | //frame address sp,144 606 | 607 | push {r0-r7} 608 | //frame address sp,176 609 | 610 | umull r6,r7,r8,r8 611 | push {r6,r7,r8,r10,r11,r12} 612 | //frame address sp,200 613 | push {r6,r7,r8,r10,r11,r12} 614 | //frame address sp,224 615 | push {r9,r10,r11,r12} 616 | //frame address sp,240 617 | 618 | movs r0,#254 619 | movs r3,#0 620 | // 129 cycles so far 621 | 0: 622 | // load scalar bit into r1 623 | lsrs r1,r0,#5 624 | adds r2,sp,#168 625 | ldr r1,[r2,r1,lsl #2] 626 | and r4,r0,#0x1f 627 | lsrs r1,r1,r4 628 | and r1,r1,#1 629 | 630 | strd r0,r1,[sp,#160] 631 | 632 | eors r1,r1,r3 633 | rsbs lr,r1,#0 634 | 635 | mov r0,sp 636 | add r1,sp,#64 637 | 638 | mov r11,#4 639 | // 15 cycles 640 | 1: 641 | ldm r0,{r2-r5} 642 | ldm r1,{r6-r9} 643 | 644 | eors r2,r2,r6 645 | and r10,r2,lr 646 | eors r6,r6,r10 647 | eors r2,r2,r6 648 | 649 | eors r3,r3,r7 650 | and r10,r3,lr 651 | eors r7,r7,r10 652 | eors r3,r3,r7 653 | 654 | eors r4,r4,r8 655 | and r10,r4,lr 656 | eors r8,r8,r10 657 | eors r4,r4,r8 658 | 659 | eors r5,r5,r9 660 | and r10,r5,lr 661 | eors r9,r9,r10 662 | eors r5,r5,r9 663 | 664 | stm r0!,{r2-r5} 665 | stm r1!,{r6-r9} 666 | 667 | subs r11,#1 668 | bne 1b 669 | // 40*4 - 2 = 158 cycles 670 | 671 | mov r8,sp 672 | add r9,sp,#32 673 | bl fe25519_add 674 | push {r0-r7} 675 | //frame address sp,272 676 | 677 | bl fe25519_sqr 678 | push {r0-r7} 679 | //frame address sp,304 680 | 681 | add r8,sp,#64 682 | add r9,sp,#96 683 | bl fe25519_sub 684 | push {r0-r7} 685 | //frame address sp,336 686 | 687 | bl fe25519_sqr 688 | push {r0-r7} 689 | //frame address sp,368 690 | 691 | mov r1,sp 692 | add r2,sp,#64 693 | bl fe25519_mul 694 | add r8,sp,#128 695 | stm r8,{r0-r7} 696 | 697 | add r8,sp,#64 698 | mov r9,sp 699 | bl fe25519_sub 700 | add r8,sp,#64 701 | stm r8,{r0-r7} 702 | 703 | // 64 + 1*45 + 2*46 + 1*173 + 2*115 = 604 cycles 704 | 705 | //multiplies (r0-r7) with 121666, adds *sp and puts the result on the top of the stack (replacing old content) 706 | ldr lr,=121666 707 | //mov lr,#56130 708 | //add lr,lr,#65536 709 | ldr r12,[sp,#28] 710 | mov r11,#0 711 | umaal r12,r11,lr,r7 712 | lsl r11,r11,#1 713 | add r11,r11,r12, lsr #31 714 | movs r7,#19 715 | mul r11,r11,r7 716 | bic r7,r12,#0x80000000 717 | ldm sp!,{r8,r9,r10,r12} 718 | //frame address sp,352 719 | umaal r8,r11,lr,r0 720 | umaal r9,r11,lr,r1 721 | umaal r10,r11,lr,r2 722 | umaal r12,r11,lr,r3 723 | ldm sp!,{r0,r1,r2} 724 | //frame address sp,340 725 | umaal r0,r11,lr,r4 726 | umaal r1,r11,lr,r5 727 | umaal r2,r11,lr,r6 728 | add r7,r7,r11 729 | add sp,sp,#4 730 | //frame address sp,338 731 | push {r0,r1,r2,r7} 732 | //frame address sp,352 733 | push {r8,r9,r10,r12} 734 | //frame address sp,368 735 | // 39 cycles 736 | 737 | mov r1,sp 738 | add r2,sp,#64 739 | bl fe25519_mul 740 | add r8,sp,#160 741 | stm r8,{r0-r7} 742 | 743 | add r8,sp,#192 744 | add r9,sp,#224 745 | bl fe25519_add 746 | stm sp,{r0-r7} 747 | 748 | mov r1,sp 749 | add r2,sp,#32 750 | bl fe25519_mul 751 | add r8,sp,#32 752 | stm r8,{r0-r7} 753 | 754 | add r8,sp,#192 755 | add r9,sp,#224 756 | bl fe25519_sub 757 | stm sp,{r0-r7} 758 | 759 | mov r1,sp 760 | add r2,sp,#96 761 | bl fe25519_mul 762 | stm sp,{r0-r7} 763 | 764 | mov r8,sp 765 | add r9,sp,#32 766 | bl fe25519_add 767 | 768 | bl fe25519_sqr 769 | 770 | add r8,sp,#192 771 | stm r8,{r0-r7} 772 | 773 | mov r8,sp 774 | add r9,sp,#32 775 | bl fe25519_sub 776 | 777 | bl fe25519_sqr 778 | stm sp,{r0-r7} 779 | 780 | mov r1,sp 781 | add r2,sp,#256 782 | bl fe25519_mul 783 | add r8,sp,#224 784 | stm r8,{r0-r7} 785 | 786 | add sp,sp,#128 787 | //frame address sp,240 788 | 789 | ldrd r2,r3,[sp,#160] 790 | subs r0,r2,#1 791 | // 97 + 2*45 + 2*46 + 4*173 + 2*115 = 1201 cycles 792 | bpl 0b 793 | // in total 2020 cycles per iteration, in total 515 098 cycles for 255 iterations 794 | 795 | //These cswap lines are not needed for curve25519 since the lowest bit is hardcoded to 0 796 | //---------- 797 | //rsbs lr,r3,#0 798 | 799 | //mov r0,sp 800 | //add r1,sp,#64 801 | 802 | //mov r11,#4 803 | //1 804 | //ldm r0,{r2-r5} 805 | //ldm r1!,{r6-r9} 806 | 807 | //eors r2,r2,r6 808 | //and r10,r2,lr 809 | //eors r6,r6,r10 810 | //eors r2,r2,r6 811 | 812 | //eors r3,r3,r7 813 | //and r10,r3,lr 814 | //eors r7,r7,r10 815 | //eors r3,r3,r7 816 | 817 | //eors r4,r4,r8 818 | //and r10,r4,lr 819 | //eors r8,r8,r10 820 | //eors r4,r4,r8 821 | 822 | //eors r5,r5,r9 823 | //and r10,r5,lr 824 | //eors r9,r9,r10 825 | //eors r5,r5,r9 826 | 827 | //stm r0!,{r2-r5} 828 | 829 | //subs r11,#1 830 | //bne 1b 831 | //---------- 832 | 833 | // now we must invert zp 834 | add r0,sp,#32 835 | ldm r0,{r0-r7} 836 | bl fe25519_sqr 837 | push {r0-r7} 838 | //frame address sp,272 839 | 840 | bl fe25519_sqr 841 | bl fe25519_sqr 842 | push {r0-r7} 843 | //frame address sp,304 844 | 845 | add r1,sp,#96 846 | mov r2,sp 847 | bl fe25519_mul 848 | stm sp,{r0-r7} 849 | 850 | mov r1,sp 851 | add r2,sp,#32 852 | bl fe25519_mul 853 | add r8,sp,#32 854 | stm r8,{r0-r7} 855 | 856 | // current stack: z^(2^9) z^(2^11) x z 857 | 858 | bl fe25519_sqr 859 | push {r0-r7} 860 | //frame address sp,336 861 | 862 | mov r1,sp 863 | add r2,sp,#32 864 | bl fe25519_mul 865 | add r8,sp,#32 866 | stm r8,{r0-r7} 867 | 868 | // current stack: _ z^(2^5 - 2^0) z^(2^11) x z 869 | 870 | mov r8,#5 871 | // 1052 cycles 872 | bl fe25519_sqr_many // 634 cycles 873 | 874 | mov r1,sp 875 | add r2,sp,#32 876 | bl fe25519_mul 877 | add r8,sp,#32 878 | stm r8,{r0-r7} 879 | 880 | // current stack: _ z^(2^10 - 2^0) z^(2^11) x z ... 881 | 882 | movs r8,#10 883 | bl fe25519_sqr_many // 1249 cycles 884 | //z^(2^20 - 2^10) 885 | 886 | mov r1,sp 887 | add r2,sp,#32 888 | bl fe25519_mul 889 | stm sp,{r0-r7} 890 | //z^(2^20 - 2^0) 891 | 892 | // current stack: z^(2^20 - 2^0) z^(2^10 - 2^0) z^(2^11) x z ... 893 | 894 | movs r8,#20 895 | sub sp,sp,#32 896 | //frame address sp,368 897 | bl fe25519_sqr_many // 2479 cycles 898 | //z^(2^40 - 2^20) 899 | 900 | mov r1,sp 901 | add r2,sp,#32 902 | bl fe25519_mul 903 | add sp,sp,#32 904 | //frame address sp,336 905 | //z^(2^40 - 2^0) 906 | 907 | movs r8,#10 908 | bl fe25519_sqr_many // 1249 cycles 909 | //z^(2^50 - 2^10) 910 | 911 | mov r1,sp 912 | add r2,sp,#32 913 | bl fe25519_mul 914 | add r8,sp,#32 915 | stm r8,{r0-r7} 916 | 917 | // current stack: _ z^(2^50 - 2^0) z^(2^11) x z ... 918 | 919 | movs r8,#50 920 | bl fe25519_sqr_many // 6169 cycles 921 | //z^(2^100 - 2^50) 922 | 923 | mov r1,sp 924 | add r2,sp,#32 925 | bl fe25519_mul 926 | stm sp,{r0-r7} 927 | 928 | // 13751 cycles so far for inversion 929 | 930 | // current stack: z^(2^100 - 2^0) z^(2^50 - 2^0) z^(2^11) x z ... 931 | 932 | movs r8,#100 933 | sub sp,sp,#32 934 | //frame address sp,368 935 | bl fe25519_sqr_many // 12319 cycles 936 | //z^(2^200 - 2^100) 937 | 938 | mov r1,sp 939 | add r2,sp,#32 940 | bl fe25519_mul 941 | add sp,sp,#32 942 | //frame address sp,336 943 | //z^(2^200 - 2^0) 944 | 945 | // current stack: _ z^(2^50 - 2^0) z^(2^11) x z ... 946 | 947 | movs r8,#50 948 | bl fe25519_sqr_many // 6169 cycles 949 | //z^(2^250 - 2^50) 950 | 951 | mov r1,sp 952 | add r2,sp,#32 953 | bl fe25519_mul 954 | //z^(2^250 - 2^0) 955 | 956 | movs r8,#5 957 | bl fe25519_sqr_many // 634 cycles 958 | //z^(2^255 - 2^5) 959 | 960 | mov r1,sp 961 | add r2,sp,#64 962 | bl fe25519_mul 963 | stm sp,{r0-r7} 964 | //z^(2^255 - 21) 965 | 966 | // 19661 for second half of inversion 967 | 968 | // done inverting! 969 | // total inversion cost: 33412 cycles 970 | 971 | mov r1,sp 972 | add r2,sp,#96 973 | bl fe25519_mul 974 | 975 | // now final reduce 976 | lsr r8,r7,#31 977 | mov r9,#19 978 | mul r8,r8,r9 979 | mov r10,#0 980 | 981 | // handle the case when 2^255 - 19 <= x < 2^255 982 | add r8,r8,#19 983 | 984 | adds r8,r0,r8 985 | adcs r8,r1,r10 986 | adcs r8,r2,r10 987 | adcs r8,r3,r10 988 | adcs r8,r4,r10 989 | adcs r8,r5,r10 990 | adcs r8,r6,r10 991 | adcs r8,r7,r10 992 | adcs r11,r10,r10 993 | 994 | lsr r8,r8,#31 995 | orr r8,r8,r11, lsl #1 996 | mul r8,r8,r9 997 | 998 | ldr r9,[sp,#296] 999 | 1000 | adds r0,r0,r8 1001 | str r0,[r9,#0] 1002 | movs r0,#0 1003 | adcs r1,r1,r0 1004 | str r1,[r9,#4] 1005 | mov r1,r9 1006 | adcs r2,r2,r0 1007 | adcs r3,r3,r0 1008 | adcs r4,r4,r0 1009 | adcs r5,r5,r0 1010 | adcs r6,r6,r0 1011 | adcs r7,r7,r0 1012 | and r7,r7,#0x7fffffff 1013 | 1014 | str r2,[r1,#8] 1015 | str r3,[r1,#12] 1016 | str r4,[r1,#16] 1017 | str r5,[r1,#20] 1018 | str r6,[r1,#24] 1019 | str r7,[r1,#28] 1020 | 1021 | add sp,sp,#300 1022 | //frame address sp,36 1023 | 1024 | pop {r4-r11,pc} 1025 | 1026 | // 234 cycles after inversion 1027 | // in total for whole function 548 873 cycles 1028 | 1029 | .size curve25519_scalarmult, .-curve25519_scalarmult 1030 | 1031 | 1032 | -------------------------------------------------------------------------------- /x25519-cortex-m4-keil.s: -------------------------------------------------------------------------------- 1 | ; Curve25519 scalar multiplication 2 | ; Copyright (c) 2017, Emil Lenngren 3 | ; 4 | ; All rights reserved. 5 | ; 6 | ; Redistribution and use in source and binary forms, with or without modification, 7 | ; are permitted provided that the following conditions are met: 8 | ; 9 | ; 1. Redistributions of source code must retain the above copyright notice, this 10 | ; list of conditions and the following disclaimer. 11 | ; 12 | ; 2. Redistributions in binary form, except as embedded into a Nordic 13 | ; Semiconductor ASA or Dialog Semiconductor PLC integrated circuit in a product 14 | ; or a software update for such product, must reproduce the above copyright 15 | ; notice, this list of conditions and the following disclaimer in the 16 | ; documentation and/or other materials provided with the distribution. 17 | ; 18 | ; THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND 19 | ; ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED 20 | ; WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 21 | ; DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR 22 | ; ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES 23 | ; (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; 24 | ; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND 25 | ; ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT 26 | ; (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS 27 | ; SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 28 | 29 | 30 | 31 | ; This is an armv7 implementation of X25519. 32 | ; It follows the reference implementation where the representation of 33 | ; a field element [0..2^255-19) is represented by a 256-bit little endian integer, 34 | ; reduced modulo 2^256-38, and may possibly be in the range [2^256-38..2^256). 35 | ; The scalar is a 256-bit integer where certain bits are hardcoded per specification. 36 | ; 37 | ; The implementation runs in constant time (548 873 cycles on ARM Cortex-M4, 38 | ; assuming no wait states), and no conditional branches or memory access 39 | ; pattern depend on secret data. 40 | 41 | area |.text|, code, readonly 42 | align 2 43 | 44 | ; input: *r8=a, *r9=b 45 | ; output: r0-r7 46 | ; clobbers all other registers 47 | ; cycles: 45 48 | fe25519_add proc 49 | export fe25519_add 50 | ldr r0,[r8,#28] 51 | ldr r4,[r9,#28] 52 | adds r0,r0,r4 53 | mov r11,#0 54 | adc r11,r11,r11 55 | lsl r11,r11,#1 56 | add r11,r11,r0, lsr #31 57 | movs r7,#19 58 | mul r11,r11,r7 59 | bic r7,r0,#0x80000000 60 | 61 | ldm r8!,{r0-r3} 62 | ldm r9!,{r4-r6,r10} 63 | mov r12,#1 64 | umaal r0,r11,r12,r4 65 | umaal r1,r11,r12,r5 66 | umaal r2,r11,r12,r6 67 | umaal r3,r11,r12,r10 68 | ldm r9,{r4-r6} 69 | ldm r8,{r8-r10} 70 | umaal r4,r11,r12,r8 71 | umaal r5,r11,r12,r9 72 | umaal r6,r11,r12,r10 73 | add r7,r7,r11 74 | bx lr 75 | 76 | endp 77 | 78 | ; input: *r8=a, *r9=b 79 | ; output: r0-r7 80 | ; clobbers all other registers 81 | ; cycles: 46 82 | fe25519_sub proc 83 | export fe25519_sub 84 | 85 | ldm r8,{r0-r7} 86 | ldm r9!,{r8,r10-r12} 87 | subs r0,r8 88 | sbcs r1,r10 89 | sbcs r2,r11 90 | sbcs r3,r12 91 | ldm r9,{r8-r11} 92 | sbcs r4,r8 93 | sbcs r5,r9 94 | sbcs r6,r10 95 | sbcs r7,r11 96 | 97 | ; if subtraction goes below 0, set r8 to -1 and r9 to -38, else set both to 0 98 | sbc r8,r8 99 | and r9,r8,#-38 100 | 101 | adds r0,r9 102 | adcs r1,r8 103 | adcs r2,r8 104 | adcs r3,r8 105 | adcs r4,r8 106 | adcs r5,r8 107 | adcs r6,r8 108 | adcs r7,r8 109 | 110 | ; if the subtraction did not go below 0, we are done and (r8,r9) are set to 0 111 | ; if the subtraction went below 0 and the addition overflowed, we are done, so set (r8,r9) to 0 112 | ; if the subtraction went below 0 and the addition did not overflow, we need to add once more 113 | ; (r8,r9) will be correctly set to (-1,-38) only when r8 was -1 and we don't have a carry, 114 | ; note that the carry will always be 0 in case (r8,r9) was (0,0) since then there was no real addition 115 | ; also note that it is extremely unlikely we will need an extra addition: 116 | ; that can only happen if input1 was slightly >= 0 and input2 was > 2^256-38 (really input2-input1 > 2^256-38) 117 | ; in that case we currently have 2^256-38 < (r0...r7) < 2^256, so adding -38 will only affect r0 118 | adcs r8,#0 119 | and r9,r8,#-38 120 | 121 | adds r0,r9 122 | 123 | bx lr 124 | 125 | endp 126 | 127 | ; input: *r1=a, *r2=b 128 | ; output: r0-r7 129 | ; clobbers all other registers 130 | ; cycles: 173 131 | fe25519_mul proc 132 | export fe25519_mul 133 | push {r2,lr} 134 | frame push {lr} 135 | frame address sp,8 136 | 137 | sub sp,#28 138 | frame address sp,36 139 | ldm r2,{r2,r3,r4,r5} 140 | 141 | ldm r1!,{r0,r10,lr} 142 | umull r6,r11,r2,r0 143 | 144 | umull r7,r12,r3,r0 145 | umaal r7,r11,r2,r10 146 | 147 | push {r6,r7} 148 | frame address sp,44 149 | 150 | umull r8,r6,r4,r0 151 | umaal r8,r11,r3,r10 152 | 153 | umull r9,r7,r5,r0 154 | umaal r9,r11,r4,r10 155 | 156 | umaal r11,r7,r5,r10 157 | 158 | umaal r8,r12,r2,lr 159 | umaal r9,r12,r3,lr 160 | umaal r11,r12,r4,lr 161 | umaal r12,r7,r5,lr 162 | 163 | ldm r1!,{r0,r10,lr} 164 | 165 | umaal r9,r6,r2,r0 166 | umaal r11,r6,r3,r0 167 | umaal r12,r6,r4,r0 168 | umaal r6,r7,r5,r0 169 | 170 | strd r8,r9,[sp,#8] 171 | 172 | mov r9,#0 173 | umaal r11,r9,r2,r10 174 | umaal r12,r9,r3,r10 175 | umaal r6,r9,r4,r10 176 | umaal r7,r9,r5,r10 177 | 178 | mov r10,#0 179 | umaal r12,r10,r2,lr 180 | umaal r6,r10,r3,lr 181 | umaal r7,r10,r4,lr 182 | umaal r9,r10,r5,lr 183 | 184 | ldr r8,[r1],#4 185 | mov lr,#0 186 | umaal lr,r6,r2,r8 187 | umaal r7,r6,r3,r8 188 | umaal r9,r6,r4,r8 189 | umaal r10,r6,r5,r8 190 | 191 | ;_ _ _ _ _ 6 10 9| 7 | lr 12 11 _ _ _ _ 192 | 193 | ldr r8,[r1],#-28 194 | mov r0,#0 195 | umaal r7,r0,r2,r8 196 | umaal r9,r0,r3,r8 197 | umaal r10,r0,r4,r8 198 | umaal r6,r0,r5,r8 199 | 200 | push {r0} 201 | frame address sp,48 202 | 203 | ;_ _ _ _ s 6 10 9| 7 | lr 12 11 _ _ _ _ 204 | 205 | ldr r2,[sp,#40] 206 | adds r2,r2,#16 207 | ldm r2,{r2,r3,r4,r5} 208 | 209 | ldr r8,[r1],#4 210 | mov r0,#0 211 | umaal r11,r0,r2,r8 212 | str r11,[sp,#16+4] 213 | umaal r12,r0,r3,r8 214 | umaal lr,r0,r4,r8 215 | umaal r0,r7,r5,r8 ; 7=carry for 9 216 | 217 | ;_ _ _ _ s 6 10 9+7| 0 | lr 12 _ _ _ _ _ 218 | 219 | ldr r8,[r1],#4 220 | mov r11,#0 221 | umaal r12,r11,r2,r8 222 | str r12,[sp,#20+4] 223 | umaal lr,r11,r3,r8 224 | umaal r0,r11,r4,r8 225 | umaal r11,r7,r5,r8 ; 7=carry for 10 226 | 227 | ;_ _ _ _ s 6 10+7 9+11| 0 | lr _ _ _ _ _ _ 228 | 229 | ldr r8,[r1],#4 230 | mov r12,#0 231 | umaal lr,r12,r2,r8 232 | str lr,[sp,#24+4] 233 | umaal r0,r12,r3,r8 234 | umaal r11,r12,r4,r8 235 | umaal r10,r12,r5,r8 ; 12=carry for 6 236 | 237 | ;_ _ _ _ s 6+12 10+7 9+11| 0 | _ _ _ _ _ _ _ 238 | 239 | ldr r8,[r1],#4 240 | mov lr,#0 241 | umaal r0,lr,r2,r8 242 | str r0,[sp,#28+4] 243 | umaal r11,lr,r3,r8 244 | umaal r10,lr,r4,r8 245 | umaal r6,lr,r5,r8 ; lr=carry for saved 246 | 247 | ;_ _ _ _ s+lr 6+12 10+7 9+11| _ | _ _ _ _ _ _ _ 248 | 249 | ldm r1!,{r0,r8} 250 | umaal r11,r9,r2,r0 251 | str r11,[sp,#32+4] 252 | umaal r9,r10,r3,r0 253 | umaal r10,r6,r4,r0 254 | pop {r11} 255 | frame address sp,44 256 | umaal r11,r6,r5,r0 ; 6=carry for next 257 | 258 | ;_ _ _ 6 11+lr 10+12 9+7 _ | _ | _ _ _ _ _ _ _ 259 | 260 | umaal r9,r7,r2,r8 261 | umaal r10,r7,r3,r8 262 | umaal r11,r7,r4,r8 263 | umaal r6,r7,r5,r8 264 | 265 | ldm r1!,{r0,r8} 266 | umaal r10,r12,r2,r0 267 | umaal r11,r12,r3,r0 268 | umaal r6,r12,r4,r0 269 | umaal r7,r12,r5,r0 270 | 271 | umaal r11,lr,r2,r8 272 | umaal r6,lr,r3,r8 273 | umaal lr,r7,r4,r8 274 | umaal r7,r12,r5,r8 275 | 276 | ; 12 7 lr 6 11 10 9 stack*9 277 | 278 | ;now reduce 279 | 280 | ldrd r4,r5,[sp,#28] 281 | movs r3,#38 282 | mov r8,#0 283 | umaal r4,r8,r3,r12 284 | lsl r8,r8,#1 285 | orr r8,r8,r4, lsr #31 286 | and r12,r4,#0x7fffffff 287 | movs r4,#19 288 | mul r8,r8,r4 289 | 290 | pop {r0-r2} 291 | frame address sp,32 292 | umaal r0,r8,r3,r5 293 | umaal r1,r8,r3,r9 294 | umaal r2,r8,r3,r10 295 | mov r9,#38 296 | pop {r3,r4} 297 | frame address sp,24 298 | umaal r3,r8,r9,r11 299 | umaal r4,r8,r9,r6 300 | pop {r5,r6} 301 | frame address sp,16 302 | umaal r5,r8,r9,lr 303 | umaal r6,r8,r9,r7 304 | add r7,r8,r12 305 | 306 | add sp,#12 307 | frame address sp,4 308 | pop {pc} 309 | 310 | endp 311 | 312 | ; input/result in (r0-r7) 313 | ; clobbers all other registers 314 | ; cycles: 115 315 | fe25519_sqr proc 316 | export fe25519_sqr 317 | push {lr} 318 | frame push {lr} 319 | sub sp,#20 320 | frame address sp,24 321 | 322 | ;mul 01, 00 323 | umull r9,r10,r0,r0 324 | umull r11,r12,r0,r1 325 | adds r11,r11,r11 326 | mov lr,#0 327 | umaal r10,r11,lr,lr 328 | 329 | ;r9 r10 done 330 | ;r12 carry for 3rd before col 331 | ;r11+C carry for 3rd final col 332 | 333 | push {r9,r10} 334 | frame address sp,32 335 | 336 | ;mul 02, 11 337 | mov r8,#0 338 | umaal r8,r12,r0,r2 339 | adcs r8,r8,r8 340 | umaal r8,r11,r1,r1 341 | 342 | ;r8 done (3rd col) 343 | ;r12 carry for 4th before col 344 | ;r11+C carry for 4th final col 345 | 346 | ;mul 03, 12 347 | umull r9,r10,r0,r3 348 | umaal r9,r12,r1,r2 349 | adcs r9,r9,r9 350 | umaal r9,r11,lr,lr 351 | 352 | ;r9 done (4th col) 353 | ;r10+r12 carry for 5th before col 354 | ;r11+C carry for 5th final col 355 | 356 | strd r8,r9,[sp,#8] 357 | 358 | ;mul 04, 13, 22 359 | mov r9,#0 360 | umaal r9,r10,r0,r4 361 | umaal r9,r12,r1,r3 362 | adcs r9,r9,r9 363 | umaal r9,r11,r2,r2 364 | 365 | ;r9 done (5th col) 366 | ;r10+r12 carry for 6th before col 367 | ;r11+C carry for 6th final col 368 | 369 | str r9,[sp,#16] 370 | 371 | ;mul 05, 14, 23 372 | umull r9,r8,r0,r5 373 | umaal r9,r10,r1,r4 374 | umaal r9,r12,r2,r3 375 | adcs r9,r9,r9 376 | umaal r9,r11,lr,lr 377 | 378 | ;r9 done (6th col) 379 | ;r10+r12+r8 carry for 7th before col 380 | ;r11+C carry for 7th final col 381 | 382 | str r9,[sp,#20] 383 | 384 | ;mul 06, 15, 24, 33 385 | mov r9,#0 386 | umaal r9,r8,r1,r5 387 | umaal r9,r12,r2,r4 388 | umaal r9,r10,r0,r6 389 | adcs r9,r9,r9 390 | umaal r9,r11,r3,r3 391 | 392 | ;r9 done (7th col) 393 | ;r8+r10+r12 carry for 8th before col 394 | ;r11+C carry for 8th final col 395 | 396 | str r9,[sp,#24] 397 | 398 | ;mul 07, 16, 25, 34 399 | umull r0,r9,r0,r7 400 | umaal r0,r10,r1,r6 401 | umaal r0,r12,r2,r5 402 | umaal r0,r8,r3,r4 403 | adcs r0,r0,r0 404 | umaal r0,r11,lr,lr 405 | 406 | ;r0 done (8th col) 407 | ;r9+r8+r10+r12 carry for 9th before col 408 | ;r11+C carry for 9th final col 409 | 410 | ;mul 17, 26, 35, 44 411 | umaal r9,r8,r1,r7 ;r1 is now dead 412 | umaal r9,r10,r2,r6 413 | umaal r12,r9,r3,r5 414 | adcs r12,r12,r12 415 | umaal r11,r12,r4,r4 416 | 417 | ;r11 done (9th col) 418 | ;r8+r10+r9 carry for 10th before col 419 | ;r12+C carry for 10th final col 420 | 421 | ;mul 27, 36, 45 422 | umaal r9,r8,r2,r7 ;r2 is now dead 423 | umaal r10,r9,r3,r6 424 | movs r2,#0 425 | umaal r10,r2,r4,r5 426 | adcs r10,r10,r10 427 | umaal r12,r10,lr,lr 428 | 429 | ;r12 done (10th col) 430 | ;r8+r9+r2 carry for 11th before col 431 | ;r10+C carry for 11th final col 432 | 433 | ;mul 37, 46, 55 434 | umaal r2,r8,r3,r7 ;r3 is now dead 435 | umaal r9,r2,r4,r6 436 | adcs r9,r9,r9 437 | umaal r10,r9,r5,r5 438 | 439 | ;r10 done (11th col) 440 | ;r8+r2 carry for 12th before col 441 | ;r9+C carry for 12th final col 442 | 443 | ;mul 47, 56 444 | movs r3,#0 445 | umaal r3,r8,r4,r7 ;r4 is now dead 446 | umaal r3,r2,r5,r6 447 | adcs r3,r3,r3 448 | umaal r9,r3,lr,lr 449 | 450 | ;r9 done (12th col) 451 | ;r8+r2 carry for 13th before col 452 | ;r3+C carry for 13th final col 453 | 454 | ;mul 57, 66 455 | umaal r8,r2,r5,r7 ;r5 is now dead 456 | adcs r8,r8,r8 457 | umaal r3,r8,r6,r6 458 | 459 | ;r3 done (13th col) 460 | ;r2 carry for 14th before col 461 | ;r8+C carry for 14th final col 462 | 463 | ;mul 67 464 | umull r4,r5,lr,lr ; set 0 465 | umaal r4,r2,r6,r7 466 | adcs r4,r4,r4 467 | umaal r4,r8,lr,lr 468 | 469 | ;r4 done (14th col) 470 | ;r2 carry for 15th before col 471 | ;r8+C carry for 15th final col 472 | 473 | ;mul 77 474 | adcs r2,r2,r2 475 | umaal r8,r2,r7,r7 476 | adcs r2,r2,lr 477 | 478 | ;r8 done (15th col) 479 | ;r2 done (16th col) 480 | 481 | ;msb -> lsb: r2 r8 r4 r3 r9 r10 r12 r11 r0 sp+24 sp+20 sp+16 sp+12 sp+8 sp+4 sp 482 | ;lr: 0 483 | ;now do reduction 484 | 485 | mov r6,#38 486 | umaal r0,lr,r6,r2 487 | lsl lr,lr,#1 488 | orr lr,lr,r0, lsr #31 489 | and r7,r0,#0x7fffffff 490 | movs r5,#19 491 | mul lr,lr,r5 492 | 493 | pop {r0,r1} 494 | frame address sp,24 495 | umaal r0,lr,r6,r11 496 | umaal r1,lr,r6,r12 497 | 498 | mov r11,r3 499 | mov r12,r4 500 | 501 | pop {r2,r3,r4,r5} 502 | frame address sp,8 503 | umaal r2,lr,r6,r10 504 | umaal r3,lr,r6,r9 505 | 506 | umaal r4,lr,r6,r11 507 | umaal r5,lr,r6,r12 508 | 509 | pop {r6} 510 | frame address sp,4 511 | mov r12,#38 512 | umaal r6,lr,r12,r8 513 | add r7,r7,lr 514 | 515 | pop {pc} 516 | 517 | endp 518 | 519 | ; in: r0-r7, count: r8 520 | ; out: r0-r7 + sets result also to top of stack 521 | ; clobbers all other registers 522 | ; cycles: 19 + 123*n 523 | fe25519_sqr_many proc 524 | export fe25519_sqr_many 525 | push {r8,lr} 526 | frame push {r8,lr} 527 | 0 528 | bl fe25519_sqr 529 | 530 | ldr r8,[sp,#0] 531 | subs r8,r8,#1 532 | str r8,[sp,#0] 533 | bne %b0 534 | 535 | add sp,sp,#4 536 | frame address sp,4 537 | add r8,sp,#4 538 | stm r8,{r0-r7} 539 | pop {pc} 540 | endp 541 | 542 | ; This kind of load supports unaligned access 543 | ; in: *r1 544 | ; out: r0-r7 545 | ; cycles: 22 546 | loadm proc 547 | ldr r0,[r1,#0] 548 | ldr r2,[r1,#8] 549 | ldr r3,[r1,#12] 550 | ldr r4,[r1,#16] 551 | ldr r5,[r1,#20] 552 | ldr r6,[r1,#24] 553 | ldr r7,[r1,#28] 554 | ldr r1,[r1,#4] 555 | bx lr 556 | endp 557 | 558 | ; in: *r0 = result, *r1 = scalar, *r2 = basepoint (all pointers may be unaligned) 559 | ; cycles: 548 873 560 | curve25519_scalarmult proc 561 | export curve25519_scalarmult 562 | 563 | ; stack layout: xp zp xq zq x0 bitpos lastbit scalar result_ptr r4-r11,lr 564 | ; 0 32 64 96 128 160 164 168 200 204 565 | 566 | push {r0,r4-r11,lr} 567 | frame push {r4-r11,lr} 568 | frame address sp,40 569 | 570 | mov r10,r2 571 | bl loadm 572 | 573 | and r0,r0,#0xfffffff8 574 | ;and r7,r7,#0x7fffffff not needed since we don't inspect the msb anyway 575 | orr r7,r7,#0x40000000 576 | push {r0-r7} 577 | frame address sp,72 578 | movs r8,#0 579 | push {r2,r8} 580 | frame address sp,80 581 | 582 | ;ldm r1,{r0-r7} 583 | mov r1,r10 584 | bl loadm 585 | 586 | and r7,r7,#0x7fffffff 587 | push {r0-r7} 588 | frame address sp,112 589 | 590 | movs r9,#1 591 | umull r10,r11,r8,r8 592 | mov r12,#0 593 | push {r8,r10,r11,r12} 594 | frame address sp,128 595 | push {r9,r10,r11,r12} 596 | frame address sp,144 597 | 598 | push {r0-r7} 599 | frame address sp,176 600 | 601 | umull r6,r7,r8,r8 602 | push {r6,r7,r8,r10,r11,r12} 603 | frame address sp,200 604 | push {r6,r7,r8,r10,r11,r12} 605 | frame address sp,224 606 | push {r9,r10,r11,r12} 607 | frame address sp,240 608 | 609 | movs r0,#254 610 | movs r3,#0 611 | ; 129 cycles so far 612 | 0 613 | ; load scalar bit into r1 614 | lsrs r1,r0,#5 615 | adds r2,sp,#168 616 | ldr r1,[r2,r1,lsl #2] 617 | and r4,r0,#0x1f 618 | lsrs r1,r1,r4 619 | and r1,r1,#1 620 | 621 | strd r0,r1,[sp,#160] 622 | 623 | eors r1,r1,r3 624 | rsbs lr,r1,#0 625 | 626 | mov r0,sp 627 | add r1,sp,#64 628 | 629 | mov r11,#4 630 | ; 15 cycles 631 | 1 632 | ldm r0,{r2-r5} 633 | ldm r1,{r6-r9} 634 | 635 | eors r2,r2,r6 636 | and r10,r2,lr 637 | eors r6,r6,r10 638 | eors r2,r2,r6 639 | 640 | eors r3,r3,r7 641 | and r10,r3,lr 642 | eors r7,r7,r10 643 | eors r3,r3,r7 644 | 645 | eors r4,r4,r8 646 | and r10,r4,lr 647 | eors r8,r8,r10 648 | eors r4,r4,r8 649 | 650 | eors r5,r5,r9 651 | and r10,r5,lr 652 | eors r9,r9,r10 653 | eors r5,r5,r9 654 | 655 | stm r0!,{r2-r5} 656 | stm r1!,{r6-r9} 657 | 658 | subs r11,#1 659 | bne %b1 660 | ; 40*4 - 2 = 158 cycles 661 | 662 | mov r8,sp 663 | add r9,sp,#32 664 | bl fe25519_add 665 | push {r0-r7} 666 | frame address sp,272 667 | 668 | bl fe25519_sqr 669 | push {r0-r7} 670 | frame address sp,304 671 | 672 | add r8,sp,#64 673 | add r9,sp,#96 674 | bl fe25519_sub 675 | push {r0-r7} 676 | frame address sp,336 677 | 678 | bl fe25519_sqr 679 | push {r0-r7} 680 | frame address sp,368 681 | 682 | mov r1,sp 683 | add r2,sp,#64 684 | bl fe25519_mul 685 | add r8,sp,#128 686 | stm r8,{r0-r7} 687 | 688 | add r8,sp,#64 689 | mov r9,sp 690 | bl fe25519_sub 691 | add r8,sp,#64 692 | stm r8,{r0-r7} 693 | 694 | ; 64 + 1*45 + 2*46 + 1*173 + 2*115 = 604 cycles 695 | 696 | ;multiplies (r0-r7) with 121666, adds *sp and puts the result on the top of the stack (replacing old content) 697 | ldr lr,=121666 698 | ;mov lr,#56130 699 | ;add lr,lr,#65536 700 | ldr r12,[sp,#28] 701 | mov r11,#0 702 | umaal r12,r11,lr,r7 703 | lsl r11,r11,#1 704 | add r11,r11,r12, lsr #31 705 | movs r7,#19 706 | mul r11,r11,r7 707 | bic r7,r12,#0x80000000 708 | ldm sp!,{r8,r9,r10,r12} 709 | frame address sp,352 710 | umaal r8,r11,lr,r0 711 | umaal r9,r11,lr,r1 712 | umaal r10,r11,lr,r2 713 | umaal r12,r11,lr,r3 714 | ldm sp!,{r0,r1,r2} 715 | frame address sp,340 716 | umaal r0,r11,lr,r4 717 | umaal r1,r11,lr,r5 718 | umaal r2,r11,lr,r6 719 | add r7,r7,r11 720 | add sp,sp,#4 721 | frame address sp,338 722 | push {r0,r1,r2,r7} 723 | frame address sp,352 724 | push {r8,r9,r10,r12} 725 | frame address sp,368 726 | ; 39 cycles 727 | 728 | mov r1,sp 729 | add r2,sp,#64 730 | bl fe25519_mul 731 | add r8,sp,#160 732 | stm r8,{r0-r7} 733 | 734 | add r8,sp,#192 735 | add r9,sp,#224 736 | bl fe25519_add 737 | stm sp,{r0-r7} 738 | 739 | mov r1,sp 740 | add r2,sp,#32 741 | bl fe25519_mul 742 | add r8,sp,#32 743 | stm r8,{r0-r7} 744 | 745 | add r8,sp,#192 746 | add r9,sp,#224 747 | bl fe25519_sub 748 | stm sp,{r0-r7} 749 | 750 | mov r1,sp 751 | add r2,sp,#96 752 | bl fe25519_mul 753 | stm sp,{r0-r7} 754 | 755 | mov r8,sp 756 | add r9,sp,#32 757 | bl fe25519_add 758 | 759 | bl fe25519_sqr 760 | 761 | add r8,sp,#192 762 | stm r8,{r0-r7} 763 | 764 | mov r8,sp 765 | add r9,sp,#32 766 | bl fe25519_sub 767 | 768 | bl fe25519_sqr 769 | stm sp,{r0-r7} 770 | 771 | mov r1,sp 772 | add r2,sp,#256 773 | bl fe25519_mul 774 | add r8,sp,#224 775 | stm r8,{r0-r7} 776 | 777 | add sp,sp,#128 778 | frame address sp,240 779 | 780 | ldrd r2,r3,[sp,#160] 781 | subs r0,r2,#1 782 | ; 97 + 2*45 + 2*46 + 4*173 + 2*115 = 1201 cycles 783 | bpl %b0 784 | ; in total 2020 cycles per iteration, in total 515 098 cycles for 255 iterations 785 | 786 | ;These cswap lines are not needed for curve25519 since the lowest bit is hardcoded to 0 787 | ;---------- 788 | ;rsbs lr,r3,#0 789 | 790 | ;mov r0,sp 791 | ;add r1,sp,#64 792 | 793 | ;mov r11,#4 794 | ;1 795 | ;ldm r0,{r2-r5} 796 | ;ldm r1!,{r6-r9} 797 | 798 | ;eors r2,r2,r6 799 | ;and r10,r2,lr 800 | ;eors r6,r6,r10 801 | ;eors r2,r2,r6 802 | 803 | ;eors r3,r3,r7 804 | ;and r10,r3,lr 805 | ;eors r7,r7,r10 806 | ;eors r3,r3,r7 807 | 808 | ;eors r4,r4,r8 809 | ;and r10,r4,lr 810 | ;eors r8,r8,r10 811 | ;eors r4,r4,r8 812 | 813 | ;eors r5,r5,r9 814 | ;and r10,r5,lr 815 | ;eors r9,r9,r10 816 | ;eors r5,r5,r9 817 | 818 | ;stm r0!,{r2-r5} 819 | 820 | ;subs r11,#1 821 | ;bne %b1 822 | ;---------- 823 | 824 | ; now we must invert zp 825 | add r0,sp,#32 826 | ldm r0,{r0-r7} 827 | bl fe25519_sqr 828 | push {r0-r7} 829 | frame address sp,272 830 | 831 | bl fe25519_sqr 832 | bl fe25519_sqr 833 | push {r0-r7} 834 | frame address sp,304 835 | 836 | add r1,sp,#96 837 | mov r2,sp 838 | bl fe25519_mul 839 | stm sp,{r0-r7} 840 | 841 | mov r1,sp 842 | add r2,sp,#32 843 | bl fe25519_mul 844 | add r8,sp,#32 845 | stm r8,{r0-r7} 846 | 847 | ; current stack: z^(2^9) z^(2^11) x z 848 | 849 | bl fe25519_sqr 850 | push {r0-r7} 851 | frame address sp,336 852 | 853 | mov r1,sp 854 | add r2,sp,#32 855 | bl fe25519_mul 856 | add r8,sp,#32 857 | stm r8,{r0-r7} 858 | 859 | ; current stack: _ z^(2^5 - 2^0) z^(2^11) x z 860 | 861 | mov r8,#5 862 | ; 1052 cycles 863 | bl fe25519_sqr_many ; 634 cycles 864 | 865 | mov r1,sp 866 | add r2,sp,#32 867 | bl fe25519_mul 868 | add r8,sp,#32 869 | stm r8,{r0-r7} 870 | 871 | ; current stack: _ z^(2^10 - 2^0) z^(2^11) x z ... 872 | 873 | movs r8,#10 874 | bl fe25519_sqr_many ; 1249 cycles 875 | ;z^(2^20 - 2^10) 876 | 877 | mov r1,sp 878 | add r2,sp,#32 879 | bl fe25519_mul 880 | stm sp,{r0-r7} 881 | ;z^(2^20 - 2^0) 882 | 883 | ; current stack: z^(2^20 - 2^0) z^(2^10 - 2^0) z^(2^11) x z ... 884 | 885 | movs r8,#20 886 | sub sp,sp,#32 887 | frame address sp,368 888 | bl fe25519_sqr_many ; 2479 cycles 889 | ;z^(2^40 - 2^20) 890 | 891 | mov r1,sp 892 | add r2,sp,#32 893 | bl fe25519_mul 894 | add sp,sp,#32 895 | frame address sp,336 896 | ;z^(2^40 - 2^0) 897 | 898 | movs r8,#10 899 | bl fe25519_sqr_many ; 1249 cycles 900 | ;z^(2^50 - 2^10) 901 | 902 | mov r1,sp 903 | add r2,sp,#32 904 | bl fe25519_mul 905 | add r8,sp,#32 906 | stm r8,{r0-r7} 907 | 908 | ; current stack: _ z^(2^50 - 2^0) z^(2^11) x z ... 909 | 910 | movs r8,#50 911 | bl fe25519_sqr_many ; 6169 cycles 912 | ;z^(2^100 - 2^50) 913 | 914 | mov r1,sp 915 | add r2,sp,#32 916 | bl fe25519_mul 917 | stm sp,{r0-r7} 918 | 919 | ; 13751 cycles so far for inversion 920 | 921 | ; current stack: z^(2^100 - 2^0) z^(2^50 - 2^0) z^(2^11) x z ... 922 | 923 | movs r8,#100 924 | sub sp,sp,#32 925 | frame address sp,368 926 | bl fe25519_sqr_many ; 12319 cycles 927 | ;z^(2^200 - 2^100) 928 | 929 | mov r1,sp 930 | add r2,sp,#32 931 | bl fe25519_mul 932 | add sp,sp,#32 933 | frame address sp,336 934 | ;z^(2^200 - 2^0) 935 | 936 | ; current stack: _ z^(2^50 - 2^0) z^(2^11) x z ... 937 | 938 | movs r8,#50 939 | bl fe25519_sqr_many ; 6169 cycles 940 | ;z^(2^250 - 2^50) 941 | 942 | mov r1,sp 943 | add r2,sp,#32 944 | bl fe25519_mul 945 | ;z^(2^250 - 2^0) 946 | 947 | movs r8,#5 948 | bl fe25519_sqr_many ; 634 cycles 949 | ;z^(2^255 - 2^5) 950 | 951 | mov r1,sp 952 | add r2,sp,#64 953 | bl fe25519_mul 954 | stm sp,{r0-r7} 955 | ;z^(2^255 - 21) 956 | 957 | ; 19661 for second half of inversion 958 | 959 | ; done inverting! 960 | ; total inversion cost: 33412 cycles 961 | 962 | mov r1,sp 963 | add r2,sp,#96 964 | bl fe25519_mul 965 | 966 | ; now final reduce 967 | lsr r8,r7,#31 968 | mov r9,#19 969 | mul r8,r8,r9 970 | mov r10,#0 971 | 972 | ; handle the case when 2^255 - 19 <= x < 2^255 973 | add r8,r8,#19 974 | 975 | adds r8,r0,r8 976 | adcs r8,r1,r10 977 | adcs r8,r2,r10 978 | adcs r8,r3,r10 979 | adcs r8,r4,r10 980 | adcs r8,r5,r10 981 | adcs r8,r6,r10 982 | adcs r8,r7,r10 983 | adcs r11,r10,r10 984 | 985 | lsr r8,r8,#31 986 | orr r8,r8,r11, lsl #1 987 | mul r8,r8,r9 988 | 989 | ldr r9,[sp,#296] 990 | 991 | adds r0,r0,r8 992 | str r0,[r9,#0] 993 | movs r0,#0 994 | adcs r1,r1,r0 995 | str r1,[r9,#4] 996 | mov r1,r9 997 | adcs r2,r2,r0 998 | adcs r3,r3,r0 999 | adcs r4,r4,r0 1000 | adcs r5,r5,r0 1001 | adcs r6,r6,r0 1002 | adcs r7,r7,r0 1003 | and r7,r7,#0x7fffffff 1004 | 1005 | str r2,[r1,#8] 1006 | str r3,[r1,#12] 1007 | str r4,[r1,#16] 1008 | str r5,[r1,#20] 1009 | str r6,[r1,#24] 1010 | str r7,[r1,#28] 1011 | 1012 | add sp,sp,#300 1013 | frame address sp,36 1014 | 1015 | pop {r4-r11,pc} 1016 | 1017 | ; 234 cycles after inversion 1018 | ; in total for whole function 548 873 cycles 1019 | 1020 | endp 1021 | 1022 | end 1023 | -------------------------------------------------------------------------------- /x25519-cortex-m4.h: -------------------------------------------------------------------------------- 1 | /* Curve25519 scalar multiplication 2 | * Copyright (c) 2017, Emil Lenngren 3 | * 4 | * All rights reserved. 5 | * 6 | * Redistribution and use in source and binary forms, with or without modification, 7 | * are permitted provided that the following conditions are met: 8 | * 9 | * 1. Redistributions of source code must retain the above copyright notice, this 10 | * list of conditions and the following disclaimer. 11 | * 12 | * 2. Redistributions in binary form, except as embedded into a Nordic 13 | * Semiconductor ASA or Dialog Semiconductor PLC integrated circuit in a product 14 | * or a software update for such product, must reproduce the above copyright 15 | * notice, this list of conditions and the following disclaimer in the 16 | * documentation and/or other materials provided with the distribution. 17 | * 18 | * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND 19 | * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED 20 | * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 21 | * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR 22 | * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES 23 | * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; 24 | * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND 25 | * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT 26 | * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS 27 | * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 28 | */ 29 | 30 | /* 31 | * This is an armv7 implementation of X25519. 32 | * It follows the reference implementation where the representation of 33 | * a field element [0..2^255-19) is represented by a 256-bit little endian integer, 34 | * reduced modulo 2^256-38, and may possibly be in the range [2^256-38..2^256). 35 | * The scalar is a 256-bit integer where certain bits are hardcoded per specification. 36 | * 37 | * The implementation runs in constant time (548 873 cycles on ARM Cortex-M4, 38 | * assuming no wait states), and no conditional branches or memory access 39 | * pattern depend on secret data. 40 | */ 41 | 42 | #ifndef X25519_CORTEX_M4_H 43 | #define X25519_CORTEX_M4_H 44 | 45 | // Assembler function 46 | void curve25519_scalarmult(unsigned char result[32], const unsigned char scalar[32], const unsigned char point[32]); 47 | 48 | // User macros 49 | #define X25519_calc_public_key(output_public_key, input_secret_key) do { \ 50 | static const unsigned char basepoint[32] = {9}; \ 51 | curve25519_scalarmult(output_public_key, input_secret_key, basepoint); \ 52 | } while(0) 53 | 54 | #define X25519_calc_shared_secret(output_shared_secret, my_secret_key, their_public_key) \ 55 | curve25519_scalarmult(output_shared_secret, my_secret_key, their_public_key) 56 | 57 | #endif 58 | --------------------------------------------------------------------------------