├── LICENSE.txt
├── README.md
├── curve25519-m4f-ptrswap-keil.s
├── linux_example.c
├── x25519-cortex-m4-gcc.s
├── x25519-cortex-m4-keil.s
└── x25519-cortex-m4.h


/LICENSE.txt:
--------------------------------------------------------------------------------
 1 | Copyright (c) 2017, Emil Lenngren
 2 | 
 3 | All rights reserved.
 4 | 
 5 | Redistribution and use in source and binary forms, with or without modification,
 6 | are permitted provided that the following conditions are met:
 7 | 
 8 | 1. Redistributions of source code must retain the above copyright notice, this
 9 |    list of conditions and the following disclaimer.
10 | 
11 | 2. Redistributions in binary form, except as embedded into a Nordic
12 |    Semiconductor ASA or Dialog Semiconductor PLC integrated circuit in a product
13 |    or a software update for such product, must reproduce the above copyright
14 |    notice, this list of conditions and the following disclaimer in the
15 |    documentation and/or other materials provided with the distribution.
16 | 
17 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
18 | ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
19 | WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
20 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR
21 | ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
22 | (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
23 | LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
24 | ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
25 | (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
26 | SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
27 | 
28 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # X25519 for ARM Cortex-M4 and other ARM processors
 2 | 
 3 | This implements highly optimimzed assembler versions of X25519 for ARMv7. It's optimized for Cortex-M4 but works on other ARM processors as well (ARMv7 and newer 32-bit architectures).
 4 | 
 5 | ## X25519
 6 | X25519 is an Elliptic curve version of the Diffie-Hellman protocol, using Curve25519 as the elliptic curve, as introduced in https://cr.yp.to/ecdh.html.
 7 | 
 8 | ### API
 9 | ```
10 | void X25519_calc_public_key(uint8_t output_public_key[32], const uint8_t input_secret_key[32]);
11 | void X25519_calc_shared_secret(uint8_t output_shared_secret[32], const uint8_t my_secret_key[32], const uint8_t their_public_key[32]);
12 | ```
13 | 
14 | * To use, first generate a 32 byte random value using a Cryptographically Secure Number Generator (specifically do NOT use `rand()` from the C library), which gives your secret key.
15 | * Feed that secret key into `X25519_calc_public_key` which will give you the corresponding public key you then transfer to the other part. The other part does the same.
16 | * When you get the other part's public key, feed that into `X25519_calc_shared_secret` together with your private key which will give you the shared secret. Rather than using this shared secret directly, it should be hashed (for example with SHA-256) on both sides before use. For further usage instructions see the official web site.
17 | 
18 | Note that this library automatically "clamps" the secret key for you (i.e. sets all the three lowest bits to 0 and the two highest to 0 and 1), compared to some other implementations.
19 | 
20 | ### Setup
21 | * The header file `x25519-cortex-m4.h` should be included when using the API from C/C++.
22 | * For Keil, the file `x25519-cortex-m4-keil.s` must be added to the project as a Source file.
23 | * When compiling with GCC, `x25519-cortex-m4-gcc.s` must be added to the project as a compilation unit. The compiler switch `-march=armv7-a`, `-march=armv7e-m` or similar might be needed depending on target architecture.
24 | 
25 | ### Example
26 | An example can be seen in `linux_example.c` that uses `/dev/urandom` to get random data. It can be compiled on for example Raspberry Pi 3 with:
27 | ```
28 | gcc linux_example.c x25519-cortex-m4-gcc.s -o linux_example -march=armv7-a
29 | ```
30 | 
31 | ### Performance
32 | The library uses only 1892 bytes of code space in compiled form, uses 368 bytes of stack and runs one scalar multiplication in 548 873 cycles on Cortex-M4, which is speed record as far as I know. For a 64 MHz processor that means less than 9 ms per operation!
33 | 
34 | There is also an even more optimized version that uses the FPU which runs in 476 275 cycles on ARM Cortex-M4F.
35 | 
36 | ### Code
37 | The code is written in Keil's assembler format (`x25519-cortex-m4-keil.s`) but was converted to GCC's assembler syntax (`x25519-cortex-m4-gcc.s`) using the following regex command:
38 | `perl -0777 -pe 's/^;/\/\//g;s/(\s);/\1\/\//g;s/export/\.global/g;s/(([a-zA-Z0-9_]+) proc[\W\w]+?)endp/\1\.size \2, \.-\2/g;s/([a-zA-Z0-9_]+) proc/\t\.type \1, %function\n\1:/g;s/end//g;s/(\r?\n)(\d+)(\r?\n)/\1\2:\3/g;s/%b(\d+)/\1b/g;s/%f(\d+)/\1f/g;s/(frame[\W\w]+?\n)/\/\/\1/g;s/area \|([^\|]+)\|[^\n]*\n/\1\n/g;s/align /\.align /g;s/^/\t.syntax unified\n\t.thumb\n/' < x25519-cortex-m4-keil.s > x25519-cortex-m4-gcc.s`
39 | 
40 | ### Security
41 | The basic implementation runs in constant time and uses a constant memory access pattern, regardless of the private key in order to protect against side channel attacks. The FPU version however reads data from RAM in a non-constant pattern and therefore that version is only suited for embedded devices without data cache, such as Cortex-M4 and Cortex-M33.
42 | 
43 | ### License
44 | The code is licensed under a 2-clause BSD license, with an extra exception for Nordic Semiconductor and Dialog Semiconductor devices.
45 | 


--------------------------------------------------------------------------------
/curve25519-m4f-ptrswap-keil.s:
--------------------------------------------------------------------------------
   1 | ; Curve25519 scalar multiplication
   2 | ; Copyright (c) 2017-2019, Emil Lenngren
   3 | ;
   4 | ; All rights reserved.
   5 | ;
   6 | ; Redistribution and use in source and binary forms, with or without modification,
   7 | ; are permitted provided that the following conditions are met:
   8 | ;
   9 | ; 1. Redistributions of source code must retain the above copyright notice, this
  10 | ;    list of conditions and the following disclaimer.
  11 | ;
  12 | ; 2. Redistributions in binary form, except as embedded into a Nordic
  13 | ;    Semiconductor ASA or Dialog Semiconductor PLC integrated circuit in a product
  14 | ;    or a software update for such product, must reproduce the above copyright
  15 | ;    notice, this list of conditions and the following disclaimer in the
  16 | ;    documentation and/or other materials provided with the distribution.
  17 | ;
  18 | ; THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
  19 | ; ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
  20 | ; WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
  21 | ; DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR
  22 | ; ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
  23 | ; (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
  24 | ; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
  25 | ; ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
  26 | ; (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
  27 | ; SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  28 | 
  29 | 
  30 | 	gbla use_mul_for_sqr
  31 | use_mul_for_sqr seta 0
  32 | 	
  33 | 	gbla has_fpu
  34 | has_fpu seta 1
  35 | 
  36 | ; This is an armv7 implementation of X25519.
  37 | ; It follows the reference implementation where the representation of
  38 | ; a field element [0..2^255-19) is represented by a 256-bit little endian integer,
  39 | ; reduced modulo 2^256-38, and may possibly be in the range [2^256-38..2^256).
  40 | ; The scalar is a 256-bit integer where certain bits are hardcoded per specification.
  41 | ;
  42 | ; The implementation runs in constant time (measured 476 275 cycles on ARM Cortex-M4F,
  43 | ; few wait states), and no conditional branches depend on secret data.
  44 | ;
  45 | ; This implementation conditionally loads data from different RAM locations depending
  46 | ; on secret data, so this version should not be used on CPUs that have data cache,
  47 | ; such as Cortex-A53. It's suited for embedded devices running CPUs like Cortex-M4 and
  48 | ; Cortex-M33.
  49 | 
  50 | 	area |.text|, code, readonly
  51 | 	align 4
  52 | 
  53 | ; input: *r8=a, *r9=b
  54 | ; output: r0-r7
  55 | ; clobbers all other registers
  56 | ; cycles: 45
  57 | fe25519_add proc
  58 | 	export fe25519_add
  59 | 	ldr r0,[r8,#28]
  60 | 	ldr r4,[r9,#28]
  61 | 	adds r0,r0,r4
  62 | 	mov r11,#0
  63 | 	adc r11,r11,r11
  64 | 	lsl r11,r11,#1
  65 | 	add r11,r11,r0, lsr #31
  66 | 	movs r7,#19
  67 | 	mul r11,r11,r7
  68 | 	bic r7,r0,#0x80000000
  69 | 	
  70 | 	ldm r8!,{r0-r3}
  71 | 	ldm r9!,{r4-r6,r10}
  72 | 	mov r12,#1
  73 | 	umaal r0,r11,r12,r4
  74 | 	umaal r1,r11,r12,r5
  75 | 	umaal r2,r11,r12,r6
  76 | 	umaal r3,r11,r12,r10
  77 | 	ldm r9,{r4-r6}
  78 | 	ldm r8,{r8-r10}
  79 | 	umaal r4,r11,r12,r8
  80 | 	umaal r5,r11,r12,r9
  81 | 	umaal r6,r11,r12,r10
  82 | 	add r7,r7,r11
  83 | 	bx lr
  84 | 	
  85 | 	endp
  86 | 
  87 | ; input: *r8=a, *r9=b
  88 | ; output: r0-r7
  89 | ; clobbers all other registers
  90 | ; cycles: 46
  91 | fe25519_sub proc
  92 | 	export fe25519_sub
  93 | 	
  94 | 	ldm r8,{r0-r7}
  95 | 	ldm r9!,{r8,r10-r12}
  96 | 	subs r0,r8
  97 | 	sbcs r1,r10
  98 | 	sbcs r2,r11
  99 | 	sbcs r3,r12
 100 | 	ldm r9,{r8-r11}
 101 | 	sbcs r4,r8
 102 | 	sbcs r5,r9
 103 | 	sbcs r6,r10
 104 | 	sbcs r7,r11
 105 | 	
 106 | 	; if subtraction goes below 0, set r8 to -1 and r9 to -38, else set both to 0
 107 | 	sbc r8,r8
 108 | 	and r9,r8,#-38
 109 | 	
 110 | 	adds r0,r9
 111 | 	adcs r1,r8
 112 | 	adcs r2,r8
 113 | 	adcs r3,r8
 114 | 	adcs r4,r8
 115 | 	adcs r5,r8
 116 | 	adcs r6,r8
 117 | 	adcs r7,r8
 118 | 	
 119 | 	; if the subtraction did not go below 0, we are done and (r8,r9) are set to 0
 120 | 	; if the subtraction went below 0 and the addition overflowed, we are done, so set (r8,r9) to 0
 121 | 	; if the subtraction went below 0 and the addition did not overflow, we need to add once more
 122 | 	; (r8,r9) will be correctly set to (-1,-38) only when r8 was -1 and we don't have a carry,
 123 | 	; note that the carry will always be 0 in case (r8,r9) was (0,0) since then there was no real addition
 124 | 	; also note that it is extremely unlikely we will need an extra addition:
 125 | 	;   that can only happen if input1 was slightly >= 0 and input2 was > 2^256-38 (really input2-input1 > 2^256-38)
 126 | 	;   in that case we currently have 2^256-38 < (r0...r7) < 2^256, so adding -38 will only affect r0
 127 | 	adcs r8,#0
 128 | 	and r9,r8,#-38
 129 | 	
 130 | 	adds r0,r9
 131 | 	
 132 | 	bx lr
 133 | 	
 134 | 	endp
 135 | 
 136 | 	if use_mul_for_sqr == 1
 137 | ;thumb_func
 138 | fe25519_sqr ;label definition
 139 | 	mov r2,r1
 140 | 	; fallthrough
 141 | 	endif
 142 | 	
 143 | 	if has_fpu == 1
 144 | ; input: *r1=a, *r2=b
 145 | ; output: r0-r7
 146 | ; clobbers all other registers
 147 | ; cycles: 154
 148 | 	align 4 ; don't know why but this direction here reduces cycle count with 30 000...
 149 | fe25519_mul proc
 150 | 	export fe25519_mul
 151 | 	push {lr}
 152 | 	frame push {lr}
 153 | 	
 154 | 	vmov d2[0],r2 ;s4
 155 | 	vldm r1,{s8-s15}
 156 | 	
 157 | 	ldm r2,{r2,r3,r4,r5}
 158 | 	
 159 | 	vmov r0,r10,s8,s9
 160 | 	umull r6,r1,r2,r0
 161 | 	
 162 | 	umull r7,r12,r3,r0
 163 | 	umaal r7,r1,r2,r10
 164 | 	
 165 | 	vmov s0,s1,r6,r7
 166 | 	
 167 | 	umull r8,r6,r4,r0
 168 | 	umaal r8,r1,r3,r10
 169 | 	
 170 | 	umull r9,r7,r5,r0
 171 | 	umaal r9,r1,r4,r10
 172 | 	
 173 | 	umaal r1,r7,r5,r10
 174 | 	
 175 | 	vmov lr,r0,s10,s11
 176 | 	
 177 | 	umaal r8,r12,r2,lr
 178 | 	umaal r9,r12,r3,lr
 179 | 	umaal r1,r12,r4,lr
 180 | 	umaal r12,r7,r5,lr
 181 | 	
 182 | 	umaal r9,r6,r2,r0
 183 | 	umaal r1,r6,r3,r0
 184 | 	umaal r12,r6,r4,r0
 185 | 	umaal r6,r7,r5,r0
 186 | 	
 187 | 	vmov s2,s3,r8,r9
 188 | 	
 189 | 	vmov r10,lr,s12,s13
 190 | 	
 191 | 	mov r9,#0
 192 | 	umaal r1,r9,r2,r10
 193 | 	umaal r12,r9,r3,r10
 194 | 	umaal r6,r9,r4,r10
 195 | 	umaal r7,r9,r5,r10
 196 | 	
 197 | 	mov r10,#0
 198 | 	umaal r12,r10,r2,lr
 199 | 	umaal r6,r10,r3,lr
 200 | 	umaal r7,r10,r4,lr
 201 | 	umaal r9,r10,r5,lr
 202 | 	
 203 | 	vmov r8,d7[0] ;s14
 204 | 	mov lr,#0
 205 | 	umaal lr,r6,r2,r8
 206 | 	umaal r7,r6,r3,r8
 207 | 	umaal r9,r6,r4,r8
 208 | 	umaal r10,r6,r5,r8
 209 | 	
 210 | 	;_ _ _ _ _ 6 10 9| 7 | lr 12 1 _ _ _ _
 211 | 	
 212 | 	vmov r8,d7[1] ;s15
 213 | 	mov r11,#0
 214 | 	umaal r7,r11,r2,r8
 215 | 	umaal r9,r11,r3,r8
 216 | 	umaal r10,r11,r4,r8
 217 | 	umaal r6,r11,r5,r8
 218 | 	
 219 | 	;_ _ _ _ 11 6 10 9| 7 | lr 12 1 _ _ _ _
 220 | 	
 221 | 	vmov r2,d2[0] ;s4
 222 | 	adds r2,r2,#16
 223 | 	ldm r2,{r2,r3,r4,r5}
 224 | 	
 225 | 	vmov r8,d4[0] ;s8
 226 | 	movs r0,#0
 227 | 	umaal r1,r0,r2,r8
 228 | 	vmov s4,r1
 229 | 	umaal r12,r0,r3,r8
 230 | 	umaal lr,r0,r4,r8
 231 | 	umaal r0,r7,r5,r8 ; 7=carry for 9
 232 | 	
 233 | 	;_ _ _ _ 11 6 10 9+7| 0 | lr 12 _ _ _ _ _
 234 | 	
 235 | 	vmov r8,d4[1] ;s9
 236 | 	movs r1,#0
 237 | 	umaal r12,r1,r2,r8
 238 | 	vmov s5,r12
 239 | 	umaal lr,r1,r3,r8
 240 | 	umaal r0,r1,r4,r8
 241 | 	umaal r1,r7,r5,r8 ; 7=carry for 10
 242 | 	
 243 | 	;_ _ _ _ 11 6 10+7 9+1| 0 | lr _ _ _ _ _ _
 244 | 	
 245 | 	vmov r8,d5[0] ;s10
 246 | 	mov r12,#0
 247 | 	umaal lr,r12,r2,r8
 248 | 	vmov s6,lr
 249 | 	umaal r0,r12,r3,r8
 250 | 	umaal r1,r12,r4,r8
 251 | 	umaal r10,r12,r5,r8 ; 12=carry for 6
 252 | 	
 253 | 	;_ _ _ _ 11 6+12 10+7 9+1| 0 | _ _ _ _ _ _ _
 254 | 	
 255 | 	vmov r8,d5[1] ;s11
 256 | 	mov lr,#0
 257 | 	umaal r0,lr,r2,r8
 258 | 	vmov d3[1],r0 ;s7
 259 | 	umaal r1,lr,r3,r8
 260 | 	umaal r10,lr,r4,r8
 261 | 	umaal r6,lr,r5,r8 ; lr=carry for saved
 262 | 	
 263 | 	;_ _ _ _ 11+lr 6+12 10+7 9+1| _ | _ _ _ _ _ _ _
 264 | 	
 265 | 	vmov r0,r8,s12,s13
 266 | 	umaal r1,r9,r2,r0
 267 | 	vmov d4[0],r1 ;s8
 268 | 	umaal r9,r10,r3,r0
 269 | 	umaal r10,r6,r4,r0
 270 | 	umaal r11,r6,r5,r0 ; 6=carry for next
 271 | 	
 272 | 	;_ _ _ 6 11+lr 10+12 9+7 _ | _ | _ _ _ _ _ _ _
 273 | 	
 274 | 	umaal r9,r7,r2,r8
 275 | 	umaal r10,r7,r3,r8
 276 | 	umaal r11,r7,r4,r8
 277 | 	umaal r6,r7,r5,r8
 278 | 	
 279 | 	vmov r0,r8,s14,s15
 280 | 	umaal r10,r12,r2,r0
 281 | 	umaal r11,r12,r3,r0
 282 | 	umaal r6,r12,r4,r0
 283 | 	umaal r7,r12,r5,r0
 284 | 	
 285 | 	umaal r11,lr,r2,r8
 286 | 	umaal r6,lr,r3,r8
 287 | 	umaal lr,r7,r4,r8
 288 | 	umaal r7,r12,r5,r8
 289 | 	
 290 | 	; 12 7 lr 6 11 10 9 fpu*9
 291 | 	
 292 | 	;now reduce
 293 | 	
 294 | 	vmov r4,r5,s7,s8
 295 | 	movs r3,#38
 296 | 	mov r8,#0
 297 | 	umaal r4,r8,r3,r12
 298 | 	lsl r8,r8,#1
 299 | 	orr r8,r8,r4, lsr #31
 300 | 	and r12,r4,#0x7fffffff
 301 | 	movs r4,#19
 302 | 	mul r8,r8,r4
 303 | 	
 304 | 	vmov r0,r1,s0,s1
 305 | 	vmov r2,d1[0] ;s2
 306 | 	umaal r0,r8,r3,r5
 307 | 	umaal r1,r8,r3,r9
 308 | 	umaal r2,r8,r3,r10
 309 | 	mov r9,#38
 310 | 	vmov r3,r4,s3,s4
 311 | 	umaal r3,r8,r9,r11
 312 | 	umaal r4,r8,r9,r6
 313 | 	vmov r5,r6,s5,s6
 314 | 	umaal r5,r8,r9,lr
 315 | 	umaal r6,r8,r9,r7
 316 | 	add r7,r8,r12
 317 | 	
 318 | 	pop {pc}
 319 | 	
 320 | 	endp
 321 | 
 322 | 	if use_mul_for_sqr == 0
 323 | ; input/result in (r0-r7)
 324 | ; clobbers all other registers
 325 | ; cycles: 106
 326 | fe25519_sqr proc
 327 | 	export fe25519_sqr
 328 | 	push {lr}
 329 | 	frame push {lr}
 330 | 	
 331 | 	;mul 01, 00
 332 | 	umull r9,r10,r0,r0
 333 | 	umull r11,r12,r0,r1
 334 | 	adds r11,r11,r11
 335 | 	mov lr,#0
 336 | 	umaal r10,r11,lr,lr
 337 | 	
 338 | 	;r9 r10 done
 339 | 	;r12 carry for 3rd before col
 340 | 	;r11+C carry for 3rd final col
 341 | 	
 342 | 	vmov s0,s1,r9,r10
 343 | 	
 344 | 	;mul 02, 11
 345 | 	mov r8,#0
 346 | 	umaal r8,r12,r0,r2
 347 | 	adcs r8,r8,r8
 348 | 	umaal r8,r11,r1,r1
 349 | 	
 350 | 	;r8 done (3rd col)
 351 | 	;r12 carry for 4th before col
 352 | 	;r11+C carry for 4th final col
 353 | 	
 354 | 	;mul 03, 12
 355 | 	umull r9,r10,r0,r3
 356 | 	umaal r9,r12,r1,r2
 357 | 	adcs r9,r9,r9
 358 | 	umaal r9,r11,lr,lr
 359 | 	
 360 | 	;r9 done (4th col)
 361 | 	;r10+r12 carry for 5th before col
 362 | 	;r11+C carry for 5th final col
 363 | 	
 364 | 	vmov s2,s3,r8,r9
 365 | 	
 366 | 	;mul 04, 13, 22
 367 | 	mov r9,#0
 368 | 	umaal r9,r10,r0,r4
 369 | 	umaal r9,r12,r1,r3
 370 | 	adcs r9,r9,r9
 371 | 	umaal r9,r11,r2,r2
 372 | 	
 373 | 	;r9 done (5th col)
 374 | 	;r10+r12 carry for 6th before col
 375 | 	;r11+C carry for 6th final col
 376 | 	
 377 | 	vmov s4,r9
 378 | 	
 379 | 	;mul 05, 14, 23
 380 | 	umull r9,r8,r0,r5
 381 | 	umaal r9,r10,r1,r4
 382 | 	umaal r9,r12,r2,r3
 383 | 	adcs r9,r9,r9
 384 | 	umaal r9,r11,lr,lr
 385 | 	
 386 | 	;r9 done (6th col)
 387 | 	;r10+r12+r8 carry for 7th before col
 388 | 	;r11+C carry for 7th final col
 389 | 	
 390 | 	vmov s5,r9
 391 | 	
 392 | 	;mul 06, 15, 24, 33
 393 | 	mov r9,#0
 394 | 	umaal r9,r8,r1,r5
 395 | 	umaal r9,r12,r2,r4
 396 | 	umaal r9,r10,r0,r6
 397 | 	adcs r9,r9,r9
 398 | 	umaal r9,r11,r3,r3
 399 | 	
 400 | 	;r9 done (7th col)
 401 | 	;r8+r10+r12 carry for 8th before col
 402 | 	;r11+C carry for 8th final col
 403 | 	
 404 | 	vmov s6,r9
 405 | 	
 406 | 	;mul 07, 16, 25, 34
 407 | 	umull r0,r9,r0,r7
 408 | 	umaal r0,r10,r1,r6
 409 | 	umaal r0,r12,r2,r5
 410 | 	umaal r0,r8,r3,r4
 411 | 	adcs r0,r0,r0
 412 | 	umaal r0,r11,lr,lr
 413 | 	
 414 | 	;r0 done (8th col)
 415 | 	;r9+r8+r10+r12 carry for 9th before col
 416 | 	;r11+C carry for 9th final col
 417 | 	
 418 | 	;mul 17, 26, 35, 44
 419 | 	umaal r9,r8,r1,r7 ;r1 is now dead
 420 | 	umaal r9,r10,r2,r6
 421 | 	umaal r12,r9,r3,r5
 422 | 	adcs r12,r12,r12
 423 | 	umaal r11,r12,r4,r4
 424 | 	
 425 | 	;r11 done (9th col)
 426 | 	;r8+r10+r9 carry for 10th before col
 427 | 	;r12+C carry for 10th final col
 428 | 	
 429 | 	;mul 27, 36, 45
 430 | 	umaal r9,r8,r2,r7 ;r2 is now dead
 431 | 	umaal r10,r9,r3,r6
 432 | 	movs r2,#0
 433 | 	umaal r10,r2,r4,r5
 434 | 	adcs r10,r10,r10
 435 | 	umaal r12,r10,lr,lr
 436 | 	
 437 | 	;r12 done (10th col)
 438 | 	;r8+r9+r2 carry for 11th before col
 439 | 	;r10+C carry for 11th final col
 440 | 	
 441 | 	;mul 37, 46, 55
 442 | 	umaal r2,r8,r3,r7 ;r3 is now dead
 443 | 	umaal r9,r2,r4,r6
 444 | 	adcs r9,r9,r9
 445 | 	umaal r10,r9,r5,r5
 446 | 	
 447 | 	;r10 done (11th col)
 448 | 	;r8+r2 carry for 12th before col
 449 | 	;r9+C carry for 12th final col
 450 | 	
 451 | 	;mul 47, 56
 452 | 	movs r3,#0
 453 | 	umaal r3,r8,r4,r7 ;r4 is now dead
 454 | 	umaal r3,r2,r5,r6
 455 | 	adcs r3,r3,r3
 456 | 	umaal r9,r3,lr,lr
 457 | 	
 458 | 	;r9 done (12th col)
 459 | 	;r8+r2 carry for 13th before col
 460 | 	;r3+C carry for 13th final col
 461 | 	
 462 | 	;mul 57, 66
 463 | 	umaal r8,r2,r5,r7 ;r5 is now dead
 464 | 	adcs r8,r8,r8
 465 | 	umaal r3,r8,r6,r6
 466 | 	
 467 | 	;r3 done (13th col)
 468 | 	;r2 carry for 14th before col
 469 | 	;r8+C carry for 14th final col
 470 | 	
 471 | 	;mul 67
 472 | 	umull r4,r5,lr,lr ; set 0
 473 | 	umaal r4,r2,r6,r7
 474 | 	adcs r4,r4,r4
 475 | 	umaal r4,r8,lr,lr
 476 | 	
 477 | 	;r4 done (14th col)
 478 | 	;r2 carry for 15th before col
 479 | 	;r8+C carry for 15th final col
 480 | 	
 481 | 	;mul 77
 482 | 	adcs r2,r2,r2
 483 | 	umaal r8,r2,r7,r7
 484 | 	adcs r2,r2,lr
 485 | 	
 486 | 	;r8 done (15th col)
 487 | 	;r2 done (16th col)
 488 | 	
 489 | 	;msb -> lsb: r2 r8 r4 r3 r9 r10 r12 r11 r0 s6 s5 s4 s3 s2 s1 s0
 490 | 	;lr: 0
 491 | 	;now do reduction
 492 | 	
 493 | 	movs r6,#38
 494 | 	umaal r0,lr,r6,r2
 495 | 	lsl lr,lr,#1
 496 | 	orr lr,lr,r0, lsr #31
 497 | 	and r7,r0,#0x7fffffff
 498 | 	movs r5,#19
 499 | 	mul lr,lr,r5
 500 | 	
 501 | 	vmov r0,r1,s0,s1
 502 | 	umaal r0,lr,r6,r11
 503 | 	umaal r1,lr,r6,r12
 504 | 	
 505 | 	mov r11,r3
 506 | 	mov r12,r4
 507 | 	
 508 | 	vmov r2,r3,s2,s3
 509 | 	vmov r4,r5,s4,s5
 510 | 	umaal r2,lr,r6,r10
 511 | 	umaal r3,lr,r6,r9
 512 | 	umaal r4,lr,r6,r11
 513 | 	umaal r5,lr,r6,r12
 514 | 	
 515 | 	vmov r6,s6
 516 | 	mov r12,#38
 517 | 	umaal r6,lr,r12,r8
 518 | 	add r7,r7,lr
 519 | 	
 520 | 	pop {pc}
 521 | 	
 522 | 	endp
 523 | 	endif
 524 | 	
 525 | 	else
 526 | 	
 527 | ; input: *r1=a, *r2=b
 528 | ; output: r0-r7
 529 | ; clobbers all other registers
 530 | ; cycles: 173
 531 | fe25519_mul proc
 532 | 	export fe25519_mul
 533 | 	push {r2,lr}
 534 | 	frame push {lr}
 535 | 	frame address sp,8
 536 | 	
 537 | 	sub sp,#28
 538 | 	frame address sp,36
 539 | 	ldm r2,{r2,r3,r4,r5}
 540 | 	
 541 | 	ldm r1!,{r0,r10,lr}
 542 | 	umull r6,r11,r2,r0
 543 | 	
 544 | 	umull r7,r12,r3,r0
 545 | 	umaal r7,r11,r2,r10
 546 | 	
 547 | 	push {r6,r7}
 548 | 	frame address sp,44
 549 | 	
 550 | 	umull r8,r6,r4,r0
 551 | 	umaal r8,r11,r3,r10
 552 | 	
 553 | 	umull r9,r7,r5,r0
 554 | 	umaal r9,r11,r4,r10
 555 | 	
 556 | 	umaal r11,r7,r5,r10
 557 | 	
 558 | 	umaal r8,r12,r2,lr
 559 | 	umaal r9,r12,r3,lr
 560 | 	umaal r11,r12,r4,lr
 561 | 	umaal r12,r7,r5,lr
 562 | 	
 563 | 	ldm r1!,{r0,r10,lr}
 564 | 	
 565 | 	umaal r9,r6,r2,r0
 566 | 	umaal r11,r6,r3,r0
 567 | 	umaal r12,r6,r4,r0
 568 | 	umaal r6,r7,r5,r0
 569 | 	
 570 | 	strd r8,r9,[sp,#8]
 571 | 	
 572 | 	mov r9,#0
 573 | 	umaal r11,r9,r2,r10
 574 | 	umaal r12,r9,r3,r10
 575 | 	umaal r6,r9,r4,r10
 576 | 	umaal r7,r9,r5,r10
 577 | 	
 578 | 	mov r10,#0
 579 | 	umaal r12,r10,r2,lr
 580 | 	umaal r6,r10,r3,lr
 581 | 	umaal r7,r10,r4,lr
 582 | 	umaal r9,r10,r5,lr
 583 | 	
 584 | 	ldr r8,[r1],#4
 585 | 	mov lr,#0
 586 | 	umaal lr,r6,r2,r8
 587 | 	umaal r7,r6,r3,r8
 588 | 	umaal r9,r6,r4,r8
 589 | 	umaal r10,r6,r5,r8
 590 | 	
 591 | 	;_ _ _ _ _ 6 10 9| 7 | lr 12 11 _ _ _ _
 592 | 	
 593 | 	ldr r8,[r1],#-28
 594 | 	mov r0,#0
 595 | 	umaal r7,r0,r2,r8
 596 | 	umaal r9,r0,r3,r8
 597 | 	umaal r10,r0,r4,r8
 598 | 	umaal r6,r0,r5,r8
 599 | 	
 600 | 	push {r0}
 601 | 	frame address sp,48
 602 | 	
 603 | 	;_ _ _ _ s 6 10 9| 7 | lr 12 11 _ _ _ _
 604 | 	
 605 | 	ldr r2,[sp,#40]
 606 | 	adds r2,r2,#16
 607 | 	ldm r2,{r2,r3,r4,r5}
 608 | 	
 609 | 	ldr r8,[r1],#4
 610 | 	mov r0,#0
 611 | 	umaal r11,r0,r2,r8
 612 | 	str r11,[sp,#16+4]
 613 | 	umaal r12,r0,r3,r8
 614 | 	umaal lr,r0,r4,r8
 615 | 	umaal r0,r7,r5,r8 ; 7=carry for 9
 616 | 	
 617 | 	;_ _ _ _ s 6 10 9+7| 0 | lr 12 _ _ _ _ _
 618 | 	
 619 | 	ldr r8,[r1],#4
 620 | 	mov r11,#0
 621 | 	umaal r12,r11,r2,r8
 622 | 	str r12,[sp,#20+4]
 623 | 	umaal lr,r11,r3,r8
 624 | 	umaal r0,r11,r4,r8
 625 | 	umaal r11,r7,r5,r8 ; 7=carry for 10
 626 | 	
 627 | 	;_ _ _ _ s 6 10+7 9+11| 0 | lr _ _ _ _ _ _
 628 | 	
 629 | 	ldr r8,[r1],#4
 630 | 	mov r12,#0
 631 | 	umaal lr,r12,r2,r8
 632 | 	str lr,[sp,#24+4]
 633 | 	umaal r0,r12,r3,r8
 634 | 	umaal r11,r12,r4,r8
 635 | 	umaal r10,r12,r5,r8 ; 12=carry for 6
 636 | 	
 637 | 	;_ _ _ _ s 6+12 10+7 9+11| 0 | _ _ _ _ _ _ _
 638 | 	
 639 | 	ldr r8,[r1],#4
 640 | 	mov lr,#0
 641 | 	umaal r0,lr,r2,r8
 642 | 	str r0,[sp,#28+4]
 643 | 	umaal r11,lr,r3,r8
 644 | 	umaal r10,lr,r4,r8
 645 | 	umaal r6,lr,r5,r8 ; lr=carry for saved
 646 | 	
 647 | 	;_ _ _ _ s+lr 6+12 10+7 9+11| _ | _ _ _ _ _ _ _
 648 | 	
 649 | 	ldm r1!,{r0,r8}
 650 | 	umaal r11,r9,r2,r0
 651 | 	str r11,[sp,#32+4]
 652 | 	umaal r9,r10,r3,r0
 653 | 	umaal r10,r6,r4,r0
 654 | 	pop {r11}
 655 | 	frame address sp,44
 656 | 	umaal r11,r6,r5,r0 ; 6=carry for next
 657 | 	
 658 | 	;_ _ _ 6 11+lr 10+12 9+7 _ | _ | _ _ _ _ _ _ _
 659 | 	
 660 | 	umaal r9,r7,r2,r8
 661 | 	umaal r10,r7,r3,r8
 662 | 	umaal r11,r7,r4,r8
 663 | 	umaal r6,r7,r5,r8
 664 | 	
 665 | 	ldm r1!,{r0,r8}
 666 | 	umaal r10,r12,r2,r0
 667 | 	umaal r11,r12,r3,r0
 668 | 	umaal r6,r12,r4,r0
 669 | 	umaal r7,r12,r5,r0
 670 | 	
 671 | 	umaal r11,lr,r2,r8
 672 | 	umaal r6,lr,r3,r8
 673 | 	umaal lr,r7,r4,r8
 674 | 	umaal r7,r12,r5,r8
 675 | 	
 676 | 	; 12 7 lr 6 11 10 9 stack*9
 677 | 	
 678 | 	;now reduce
 679 | 	
 680 | 	ldrd r4,r5,[sp,#28]
 681 | 	movs r3,#38
 682 | 	mov r8,#0
 683 | 	umaal r4,r8,r3,r12
 684 | 	lsl r8,r8,#1
 685 | 	orr r8,r8,r4, lsr #31
 686 | 	and r12,r4,#0x7fffffff
 687 | 	movs r4,#19
 688 | 	mul r8,r8,r4
 689 | 	
 690 | 	pop {r0-r2}
 691 | 	frame address sp,32
 692 | 	umaal r0,r8,r3,r5
 693 | 	umaal r1,r8,r3,r9
 694 | 	umaal r2,r8,r3,r10
 695 | 	mov r9,#38
 696 | 	pop {r3,r4}
 697 | 	frame address sp,24
 698 | 	umaal r3,r8,r9,r11
 699 | 	umaal r4,r8,r9,r6
 700 | 	pop {r5,r6}
 701 | 	frame address sp,16
 702 | 	umaal r5,r8,r9,lr
 703 | 	umaal r6,r8,r9,r7
 704 | 	add r7,r8,r12
 705 | 	
 706 | 	add sp,#12
 707 | 	frame address sp,4
 708 | 	pop {pc}
 709 | 	
 710 | 	endp
 711 | 
 712 | 	if use_mul_for_sqr == 0
 713 | ; input/result in (r0-r7)
 714 | ; clobbers all other registers
 715 | ; cycles: 115
 716 | fe25519_sqr proc
 717 | 	export fe25519_sqr
 718 | 	push {lr}
 719 | 	frame push {lr}
 720 | 	sub sp,#20
 721 | 	frame address sp,24
 722 | 	
 723 | 	;mul 01, 00
 724 | 	umull r9,r10,r0,r0
 725 | 	umull r11,r12,r0,r1
 726 | 	adds r11,r11,r11
 727 | 	mov lr,#0
 728 | 	umaal r10,r11,lr,lr
 729 | 	
 730 | 	;r9 r10 done
 731 | 	;r12 carry for 3rd before col
 732 | 	;r11+C carry for 3rd final col
 733 | 	
 734 | 	push {r9,r10}
 735 | 	frame address sp,32
 736 | 	
 737 | 	;mul 02, 11
 738 | 	mov r8,#0
 739 | 	umaal r8,r12,r0,r2
 740 | 	adcs r8,r8,r8
 741 | 	umaal r8,r11,r1,r1
 742 | 	
 743 | 	;r8 done (3rd col)
 744 | 	;r12 carry for 4th before col
 745 | 	;r11+C carry for 4th final col
 746 | 	
 747 | 	;mul 03, 12
 748 | 	umull r9,r10,r0,r3
 749 | 	umaal r9,r12,r1,r2
 750 | 	adcs r9,r9,r9
 751 | 	umaal r9,r11,lr,lr
 752 | 	
 753 | 	;r9 done (4th col)
 754 | 	;r10+r12 carry for 5th before col
 755 | 	;r11+C carry for 5th final col
 756 | 	
 757 | 	strd r8,r9,[sp,#8]
 758 | 	
 759 | 	;mul 04, 13, 22
 760 | 	mov r9,#0
 761 | 	umaal r9,r10,r0,r4
 762 | 	umaal r9,r12,r1,r3
 763 | 	adcs r9,r9,r9
 764 | 	umaal r9,r11,r2,r2
 765 | 	
 766 | 	;r9 done (5th col)
 767 | 	;r10+r12 carry for 6th before col
 768 | 	;r11+C carry for 6th final col
 769 | 	
 770 | 	str r9,[sp,#16]
 771 | 	
 772 | 	;mul 05, 14, 23
 773 | 	umull r9,r8,r0,r5
 774 | 	umaal r9,r10,r1,r4
 775 | 	umaal r9,r12,r2,r3
 776 | 	adcs r9,r9,r9
 777 | 	umaal r9,r11,lr,lr
 778 | 	
 779 | 	;r9 done (6th col)
 780 | 	;r10+r12+r8 carry for 7th before col
 781 | 	;r11+C carry for 7th final col
 782 | 	
 783 | 	str r9,[sp,#20]
 784 | 	
 785 | 	;mul 06, 15, 24, 33
 786 | 	mov r9,#0
 787 | 	umaal r9,r8,r1,r5
 788 | 	umaal r9,r12,r2,r4
 789 | 	umaal r9,r10,r0,r6
 790 | 	adcs r9,r9,r9
 791 | 	umaal r9,r11,r3,r3
 792 | 	
 793 | 	;r9 done (7th col)
 794 | 	;r8+r10+r12 carry for 8th before col
 795 | 	;r11+C carry for 8th final col
 796 | 	
 797 | 	str r9,[sp,#24]
 798 | 	
 799 | 	;mul 07, 16, 25, 34
 800 | 	umull r0,r9,r0,r7
 801 | 	umaal r0,r10,r1,r6
 802 | 	umaal r0,r12,r2,r5
 803 | 	umaal r0,r8,r3,r4
 804 | 	adcs r0,r0,r0
 805 | 	umaal r0,r11,lr,lr
 806 | 	
 807 | 	;r0 done (8th col)
 808 | 	;r9+r8+r10+r12 carry for 9th before col
 809 | 	;r11+C carry for 9th final col
 810 | 	
 811 | 	;mul 17, 26, 35, 44
 812 | 	umaal r9,r8,r1,r7 ;r1 is now dead
 813 | 	umaal r9,r10,r2,r6
 814 | 	umaal r12,r9,r3,r5
 815 | 	adcs r12,r12,r12
 816 | 	umaal r11,r12,r4,r4
 817 | 	
 818 | 	;r11 done (9th col)
 819 | 	;r8+r10+r9 carry for 10th before col
 820 | 	;r12+C carry for 10th final col
 821 | 	
 822 | 	;mul 27, 36, 45
 823 | 	umaal r9,r8,r2,r7 ;r2 is now dead
 824 | 	umaal r10,r9,r3,r6
 825 | 	movs r2,#0
 826 | 	umaal r10,r2,r4,r5
 827 | 	adcs r10,r10,r10
 828 | 	umaal r12,r10,lr,lr
 829 | 	
 830 | 	;r12 done (10th col)
 831 | 	;r8+r9+r2 carry for 11th before col
 832 | 	;r10+C carry for 11th final col
 833 | 	
 834 | 	;mul 37, 46, 55
 835 | 	umaal r2,r8,r3,r7 ;r3 is now dead
 836 | 	umaal r9,r2,r4,r6
 837 | 	adcs r9,r9,r9
 838 | 	umaal r10,r9,r5,r5
 839 | 	
 840 | 	;r10 done (11th col)
 841 | 	;r8+r2 carry for 12th before col
 842 | 	;r9+C carry for 12th final col
 843 | 	
 844 | 	;mul 47, 56
 845 | 	movs r3,#0
 846 | 	umaal r3,r8,r4,r7 ;r4 is now dead
 847 | 	umaal r3,r2,r5,r6
 848 | 	adcs r3,r3,r3
 849 | 	umaal r9,r3,lr,lr
 850 | 	
 851 | 	;r9 done (12th col)
 852 | 	;r8+r2 carry for 13th before col
 853 | 	;r3+C carry for 13th final col
 854 | 	
 855 | 	;mul 57, 66
 856 | 	umaal r8,r2,r5,r7 ;r5 is now dead
 857 | 	adcs r8,r8,r8
 858 | 	umaal r3,r8,r6,r6
 859 | 	
 860 | 	;r3 done (13th col)
 861 | 	;r2 carry for 14th before col
 862 | 	;r8+C carry for 14th final col
 863 | 	
 864 | 	;mul 67
 865 | 	umull r4,r5,lr,lr ; set 0
 866 | 	umaal r4,r2,r6,r7
 867 | 	adcs r4,r4,r4
 868 | 	umaal r4,r8,lr,lr
 869 | 	
 870 | 	;r4 done (14th col)
 871 | 	;r2 carry for 15th before col
 872 | 	;r8+C carry for 15th final col
 873 | 	
 874 | 	;mul 77
 875 | 	adcs r2,r2,r2
 876 | 	umaal r8,r2,r7,r7
 877 | 	adcs r2,r2,lr
 878 | 	
 879 | 	;r8 done (15th col)
 880 | 	;r2 done (16th col)
 881 | 	
 882 | 	;msb -> lsb: r2 r8 r4 r3 r9 r10 r12 r11 r0 sp+24 sp+20 sp+16 sp+12 sp+8 sp+4 sp
 883 | 	;lr: 0
 884 | 	;now do reduction
 885 | 	
 886 | 	mov r6,#38
 887 | 	umaal r0,lr,r6,r2
 888 | 	lsl lr,lr,#1
 889 | 	orr lr,lr,r0, lsr #31
 890 | 	and r7,r0,#0x7fffffff
 891 | 	movs r5,#19
 892 | 	mul lr,lr,r5
 893 | 	
 894 | 	pop {r0,r1}
 895 | 	frame address sp,24
 896 | 	umaal r0,lr,r6,r11
 897 | 	umaal r1,lr,r6,r12
 898 | 	
 899 | 	mov r11,r3
 900 | 	mov r12,r4
 901 | 	
 902 | 	pop {r2,r3,r4,r5}
 903 | 	frame address sp,8
 904 | 	umaal r2,lr,r6,r10
 905 | 	umaal r3,lr,r6,r9
 906 | 	
 907 | 	umaal r4,lr,r6,r11
 908 | 	umaal r5,lr,r6,r12
 909 | 	
 910 | 	pop {r6}
 911 | 	frame address sp,4
 912 | 	mov r12,#38
 913 | 	umaal r6,lr,r12,r8
 914 | 	add r7,r7,lr
 915 | 	
 916 | 	pop {pc}
 917 | 	
 918 | 	endp
 919 | 	endif
 920 | 	endif
 921 | 
 922 | ; in: r0-r7, count: r8
 923 | ; out: r0-r7 + sets result also to top of stack
 924 | ; clobbers all other registers
 925 | ; cycles: 19 + 114*n
 926 | fe25519_sqr_many proc
 927 | 	export fe25519_sqr_many
 928 | 	push {r8,lr}
 929 | 	frame push {r8,lr}
 930 | 0
 931 | 	bl fe25519_sqr
 932 | 	
 933 | 	ldr r8,[sp,#0]
 934 | 	subs r8,r8,#1
 935 | 	str r8,[sp,#0]
 936 | 	bne %b0
 937 | 	
 938 | 	add sp,sp,#4
 939 | 	frame address sp,4
 940 | 	add r8,sp,#4
 941 | 	stm r8,{r0-r7}
 942 | 	pop {pc}
 943 | 	endp
 944 | 
 945 | ; This kind of load supports unaligned access
 946 | ; in: *r1
 947 | ; out: r0-r7
 948 | ; cycles: 22
 949 | loadm proc
 950 | 	ldr r0,[r1,#0]
 951 | 	ldr r2,[r1,#8]
 952 | 	ldr r3,[r1,#12]
 953 | 	ldr r4,[r1,#16]
 954 | 	ldr r5,[r1,#20]
 955 | 	ldr r6,[r1,#24]
 956 | 	ldr r7,[r1,#28]
 957 | 	ldr r1,[r1,#4]
 958 | 	bx lr
 959 | 	endp
 960 | 
 961 | ; in: *r0 = result, *r1 = scalar, *r2 = basepoint (all pointers may be unaligned)
 962 | ; cycles: 475 469
 963 | curve25519_scalarmult proc
 964 | 	export curve25519_scalarmult
 965 | 	
 966 | 	; stack layout: xp zp xq zq x0  bitpos lastbit cswap? scalar result_ptr r4-r11,lr
 967 | 	;               0  32 64 96 128 160    161     162    164    196        200
 968 | 
 969 | 	push {r0,r4-r11,lr}
 970 | 	frame push {r4-r11,lr}
 971 | 	frame address sp,40
 972 | 	
 973 | 	mov r10,r2
 974 | 	bl loadm
 975 | 	
 976 | 	and r0,r0,#0xfffffff8
 977 | 	;and r7,r7,#0x7fffffff not needed since we don't inspect the msb anyway
 978 | 	orr r7,r7,#0x40000000
 979 | 	push {r0-r7}
 980 | 	frame address sp,72
 981 | 	mov r8,#0
 982 | 	
 983 | 	;ldm r1,{r0-r7}
 984 | 	mov r1,r10
 985 | 	bl loadm
 986 | 	
 987 | 	and r7,r7,#0x7fffffff
 988 | 	push {r0-r8}
 989 | 	frame address sp,108
 990 | 	
 991 | 	movs r9,#1
 992 | 	umull r10,r11,r8,r8
 993 | 	mov r12,r8
 994 | 	push {r8,r10,r11,r12}
 995 | 	frame address sp,124
 996 | 	push {r9,r10,r11,r12}
 997 | 	frame address sp,140
 998 | 	
 999 | 	push {r0-r7}
1000 | 	frame address sp,172
1001 | 	
1002 | 	umull r6,r7,r8,r8
1003 | 	push {r6,r7,r8,r10,r11,r12}
1004 | 	frame address sp,196
1005 | 	push {r6,r7,r8,r10,r11,r12}
1006 | 	frame address sp,220
1007 | 	push {r9,r10,r11,r12}
1008 | 	frame address sp,236
1009 | 	
1010 | 	movs r0,#254
1011 | 	movs r3,#0
1012 | 	; 127 cycles so far
1013 | 0
1014 | 	; load scalar bit into r1
1015 | 	lsrs r1,r0,#5
1016 | 	adds r2,sp,#164
1017 | 	ldr r1,[r2,r1,lsl #2]
1018 | 	and r4,r0,#0x1f
1019 | 	lsrs r1,r1,r4
1020 | 	and r1,r1,#1
1021 | 	
1022 | 	strb r0,[sp,#160]
1023 | 	strb r1,[sp,#161]
1024 | 	
1025 | 	eors r1,r1,r3
1026 | 	strb r1,[sp,#162]
1027 | 	
1028 | 	; A = X2 + Z2
1029 | 	add r8,sp,r1, lsl #6
1030 | 	add r9,r8,#32
1031 | 	bl fe25519_add
1032 | 	push {r0-r7}
1033 | 	frame address sp,268
1034 | 	
1035 | 	; AA = A^2
1036 | 	bl fe25519_sqr
1037 | 	push {r0-r7}
1038 | 	frame address sp,300
1039 | 	
1040 | 	; D = X3 - Z3
1041 | 	ldrb r0,[sp,#162+64]
1042 | 	add r8,sp,#128
1043 | 	sub r8,r8,r0, lsl #6
1044 | 	add r9,r8,#32
1045 | 	bl fe25519_sub
1046 | 	push {r0-r7}
1047 | 	frame address sp,332
1048 | 	
1049 | 	; DA = D * A
1050 | 	mov r1,sp
1051 | 	add r2,sp,#64
1052 | 	bl fe25519_mul
1053 | 	add r8,sp,#64
1054 | 	stm r8,{r0-r7}
1055 | 	
1056 | 	; B = X2 - Z2
1057 | 	ldrb r0,[sp,#162+96]
1058 | 	add r8,sp,#96
1059 | 	add r8,r8,r0, lsl #6
1060 | 	add r9,r8,#32
1061 | 	bl fe25519_sub
1062 | 	stm sp,{r0-r7}
1063 | 	
1064 | 	; BB = B^2
1065 | 	bl fe25519_sqr
1066 | 	push {r0-r7}
1067 | 	frame address sp,364
1068 | 	
1069 | 	; C = X3 + Z3
1070 | 	ldrb r0,[sp,#162+128]
1071 | 	add r8,sp,#192
1072 | 	sub r8,r8,r0, lsl #6
1073 | 	add r9,r8,#32
1074 | 	bl fe25519_add
1075 | 	add r8,sp,#128
1076 | 	stm r8,{r0-r7}
1077 | 	
1078 | 	; CB = C * B
1079 | 	add r1,sp,#128
1080 | 	add r2,sp,#32
1081 | 	bl fe25519_mul
1082 | 	add r8,sp,#32
1083 | 	stm r8,{r0-r7}
1084 | 	
1085 | 	; X2 = BB * AA
1086 | 	mov r1,sp
1087 | 	add r2,sp,#64
1088 | 	bl fe25519_mul
1089 | 	add r8,sp,#128
1090 | 	stm r8,{r0-r7}
1091 | 	
1092 | 	; E = AA - BB
1093 | 	add r8,sp,#64
1094 | 	mov r9,sp
1095 | 	bl fe25519_sub
1096 | 	add r8,sp,#64
1097 | 	stm r8,{r0-r7}
1098 | 	
1099 | 	; 134 + 2*45 + 3*46 + 3*154 + 2*106 = 1036 cycles
1100 | 	
1101 | 	; T1 = BB + ((a+2)/4) * E
1102 | 	;multiplies (r0-r7) with 121666, adds *sp and puts the result on the top of the stack (replacing old content)
1103 | 	ldr lr,=121666
1104 | 	;mov lr,#56130
1105 | 	;add lr,lr,#65536
1106 | 	ldr r12,[sp,#28]
1107 | 	mov r11,#0
1108 | 	umaal r12,r11,lr,r7
1109 | 	lsl r11,r11,#1
1110 | 	add r11,r11,r12, lsr #31
1111 | 	movs r7,#19
1112 | 	mul r11,r11,r7
1113 | 	bic r7,r12,#0x80000000
1114 | 	ldm sp!,{r8,r9,r10,r12}
1115 | 	frame address sp,348
1116 | 	umaal r8,r11,lr,r0
1117 | 	umaal r9,r11,lr,r1
1118 | 	umaal r10,r11,lr,r2
1119 | 	umaal r12,r11,lr,r3
1120 | 	ldm sp!,{r0,r1,r2}
1121 | 	frame address sp,336
1122 | 	umaal r0,r11,lr,r4
1123 | 	umaal r1,r11,lr,r5
1124 | 	umaal r2,r11,lr,r6
1125 | 	add r7,r7,r11
1126 | 	add sp,sp,#4
1127 | 	frame address sp,332
1128 | 	push {r0,r1,r2,r7}
1129 | 	frame address sp,348
1130 | 	push {r8,r9,r10,r12}
1131 | 	frame address sp,364
1132 | 	; 39 cycles
1133 | 	
1134 | 	; Z2 = T1 * E
1135 | 	mov r1,sp
1136 | 	add r2,sp,#64
1137 | 	bl fe25519_mul
1138 | 	add r8,sp,#160
1139 | 	stm r8,{r0-r7}
1140 | 	
1141 | 	; X3 = (DA + CB)^2
1142 | 	add r8,sp,#96
1143 | 	add r9,sp,#32
1144 | 	bl fe25519_add
1145 | 	bl fe25519_sqr
1146 | 	add r8,sp,#192
1147 | 	stm r8,{r0-r7}
1148 | 	
1149 | 	; T2 = (DA - CB)^2
1150 | 	add r8,sp,#96
1151 | 	add r9,sp,#32
1152 | 	bl fe25519_sub
1153 | 	bl fe25519_sqr
1154 | 	stm sp,{r0-r7}
1155 | 	
1156 | 	; Z3 = T2 * X1
1157 | 	mov r1,sp
1158 | 	add r2,sp,#256
1159 | 	bl fe25519_mul
1160 | 	add r8,sp,#224
1161 | 	stm r8,{r0-r7}
1162 | 	
1163 | 	add sp,sp,#128
1164 | 	frame address sp,236
1165 | 	
1166 | 	ldrb r2,[sp,#160]
1167 | 	ldrb r3,[sp,#161]
1168 | 	subs r0,r2,#1
1169 | 	; 92 + 1*45 + 1*46 + 2*154 + 2*106 = 703 cycles
1170 | 	bpl %b0
1171 | 	; in total 1742 cycles per iteration, in total 444 210 cycles for 255 iterations
1172 | 
1173 | 	;These cswap lines are not needed for curve25519 since the lowest bit is hardcoded to 0
1174 | 	;----------
1175 | 	;rsbs lr,r3,#0
1176 | 	
1177 | 	;mov r0,sp
1178 | 	;add r1,sp,#64
1179 | 	
1180 | 	;mov r11,#4
1181 | ;1
1182 | 	;ldm r0,{r2-r5}
1183 | 	;ldm r1!,{r6-r9}
1184 | 	
1185 | 	;eors r2,r2,r6
1186 | 	;and r10,r2,lr
1187 | 	;eors r6,r6,r10
1188 | 	;eors r2,r2,r6
1189 | 	
1190 | 	;eors r3,r3,r7
1191 | 	;and r10,r3,lr
1192 | 	;eors r7,r7,r10
1193 | 	;eors r3,r3,r7
1194 | 	
1195 | 	;eors r4,r4,r8
1196 | 	;and r10,r4,lr
1197 | 	;eors r8,r8,r10
1198 | 	;eors r4,r4,r8
1199 | 	
1200 | 	;eors r5,r5,r9
1201 | 	;and r10,r5,lr
1202 | 	;eors r9,r9,r10
1203 | 	;eors r5,r5,r9
1204 | 	
1205 | 	;stm r0!,{r2-r5}
1206 | 	
1207 | 	;subs r11,#1
1208 | 	;bne %b1
1209 | 	;----------
1210 | 
1211 | 	; now we must invert zp
1212 | 	add r0,sp,#32
1213 | 	ldm r0,{r0-r7}
1214 | 	bl fe25519_sqr
1215 | 	push {r0-r7}
1216 | 	frame address sp,268
1217 | 	
1218 | 	bl fe25519_sqr
1219 | 	bl fe25519_sqr
1220 | 	push {r0-r7}
1221 | 	frame address sp,300
1222 | 	
1223 | 	add r1,sp,#96
1224 | 	mov r2,sp
1225 | 	bl fe25519_mul
1226 | 	stm sp,{r0-r7}
1227 | 	
1228 | 	mov r1,sp
1229 | 	add r2,sp,#32
1230 | 	bl fe25519_mul
1231 | 	add r8,sp,#32
1232 | 	stm r8,{r0-r7}
1233 | 	
1234 | 	; current stack: z^(2^9) z^(2^11) x z
1235 | 	
1236 | 	bl fe25519_sqr
1237 | 	push {r0-r7}
1238 | 	frame address sp,332
1239 | 	
1240 | 	mov r1,sp
1241 | 	add r2,sp,#32
1242 | 	bl fe25519_mul
1243 | 	add r8,sp,#32
1244 | 	stm r8,{r0-r7}
1245 | 	
1246 | 	; current stack: _ z^(2^5 - 2^0) z^(2^11) x z
1247 | 	
1248 | 	mov r8,#5
1249 | 	; 959 cycles
1250 | 	bl fe25519_sqr_many ; 589 cycles
1251 | 	
1252 | 	mov r1,sp
1253 | 	add r2,sp,#32
1254 | 	bl fe25519_mul
1255 | 	add r8,sp,#32
1256 | 	stm r8,{r0-r7}
1257 | 	
1258 | 	; current stack: _ z^(2^10 - 2^0) z^(2^11) x z <scratch> ...
1259 | 	
1260 | 	movs r8,#10
1261 | 	bl fe25519_sqr_many ; 1159 cycles
1262 | 	;z^(2^20 - 2^10)
1263 | 	
1264 | 	mov r1,sp
1265 | 	add r2,sp,#32
1266 | 	bl fe25519_mul
1267 | 	stm sp,{r0-r7}
1268 | 	;z^(2^20 - 2^0)
1269 | 	
1270 | 	; current stack: z^(2^20 - 2^0) z^(2^10 - 2^0) z^(2^11) x z <scratch> ...
1271 | 	
1272 | 	movs r8,#20
1273 | 	sub sp,sp,#32
1274 | 	frame address sp,364
1275 | 	bl fe25519_sqr_many ; 2299 cycles
1276 | 	;z^(2^40 - 2^20)
1277 | 	
1278 | 	mov r1,sp
1279 | 	add r2,sp,#32
1280 | 	bl fe25519_mul
1281 | 	add sp,sp,#32
1282 | 	frame address sp,332
1283 | 	;z^(2^40 - 2^0)
1284 | 	
1285 | 	movs r8,#10
1286 | 	bl fe25519_sqr_many ; 1159 cycles
1287 | 	;z^(2^50 - 2^10)
1288 | 	
1289 | 	mov r1,sp
1290 | 	add r2,sp,#32
1291 | 	bl fe25519_mul
1292 | 	add r8,sp,#32
1293 | 	stm r8,{r0-r7}
1294 | 	
1295 | 	; current stack: _ z^(2^50 - 2^0) z^(2^11) x z <scratch> ...
1296 | 	
1297 | 	movs r8,#50
1298 | 	bl fe25519_sqr_many ; 5719 cycles
1299 | 	;z^(2^100 - 2^50)
1300 | 	
1301 | 	mov r1,sp
1302 | 	add r2,sp,#32
1303 | 	bl fe25519_mul
1304 | 	stm sp,{r0-r7}
1305 | 	
1306 | 	; 13751/12708 cycles so far for inversion
1307 | 	
1308 | 	; current stack: z^(2^100 - 2^0) z^(2^50 - 2^0) z^(2^11) x z <scratch> ...
1309 | 	
1310 | 	movs r8,#100
1311 | 	sub sp,sp,#32
1312 | 	frame address sp,364
1313 | 	bl fe25519_sqr_many ; 11419 cycles
1314 | 	;z^(2^200 - 2^100)
1315 | 	
1316 | 	mov r1,sp
1317 | 	add r2,sp,#32
1318 | 	bl fe25519_mul
1319 | 	add sp,sp,#32
1320 | 	frame address sp,332
1321 | 	;z^(2^200 - 2^0)
1322 | 	
1323 | 	; current stack: _ z^(2^50 - 2^0) z^(2^11) x z <scratch> ...
1324 | 	
1325 | 	movs r8,#50
1326 | 	bl fe25519_sqr_many ; 5719 cycles
1327 | 	;z^(2^250 - 2^50)
1328 | 	
1329 | 	mov r1,sp
1330 | 	add r2,sp,#32
1331 | 	bl fe25519_mul
1332 | 	;z^(2^250 - 2^0)
1333 | 	
1334 | 	movs r8,#5
1335 | 	bl fe25519_sqr_many ; 589 cycles
1336 | 	;z^(2^255 - 2^5)
1337 | 	
1338 | 	mov r1,sp
1339 | 	add r2,sp,#64
1340 | 	bl fe25519_mul
1341 | 	stm sp,{r0-r7}
1342 | 	;z^(2^255 - 21)
1343 | 	
1344 | 	; 19661/18209 for second half of inversion
1345 | 	
1346 | 	; done inverting!
1347 | 	; total inversion cost: 33412/30917 cycles
1348 | 	
1349 | 	mov r1,sp
1350 | 	add r2,sp,#96
1351 | 	bl fe25519_mul
1352 | 	
1353 | 	; now final reduce
1354 | 	lsr r8,r7,#31
1355 | 	mov r9,#19
1356 | 	mul r8,r8,r9
1357 | 	mov r10,#0
1358 | 	
1359 | 	; handle the case when 2^255 - 19 <= x < 2^255
1360 | 	add r8,r8,#19
1361 | 	
1362 | 	adds r8,r0,r8
1363 | 	adcs r8,r1,r10
1364 | 	adcs r8,r2,r10
1365 | 	adcs r8,r3,r10
1366 | 	adcs r8,r4,r10
1367 | 	adcs r8,r5,r10
1368 | 	adcs r8,r6,r10
1369 | 	adcs r8,r7,r10
1370 | 	adcs r11,r10,r10
1371 | 	
1372 | 	lsr r8,r8,#31
1373 | 	orr r8,r8,r11, lsl #1
1374 | 	mul r8,r8,r9
1375 | 	
1376 | 	ldr r9,[sp,#292]
1377 | 	
1378 | 	adds r0,r0,r8
1379 | 	str r0,[r9,#0]
1380 | 	movs r0,#0
1381 | 	adcs r1,r1,r0
1382 | 	str r1,[r9,#4]
1383 | 	mov r1,r9
1384 | 	adcs r2,r2,r0
1385 | 	adcs r3,r3,r0
1386 | 	adcs r4,r4,r0
1387 | 	adcs r5,r5,r0
1388 | 	adcs r6,r6,r0
1389 | 	adcs r7,r7,r0
1390 | 	and r7,r7,#0x7fffffff
1391 | 	
1392 | 	str r2,[r1,#8]
1393 | 	str r3,[r1,#12]
1394 | 	str r4,[r1,#16]
1395 | 	str r5,[r1,#20]
1396 | 	str r6,[r1,#24]
1397 | 	str r7,[r1,#28]
1398 | 	
1399 | 	add sp,sp,#296
1400 | 	frame address sp,36
1401 | 	
1402 | 	pop {r4-r11,pc}
1403 | 	
1404 | 	; 215 cycles after inversion
1405 | 	; in total for whole function 475 469 cycles theoretically
1406 | 	
1407 | 	endp
1408 | 
1409 | 	end
1410 | 


--------------------------------------------------------------------------------
/linux_example.c:
--------------------------------------------------------------------------------
 1 | #include <stdlib.h>
 2 | #include <stdio.h>
 3 | #include <string.h>
 4 | 
 5 | #include <fcntl.h>
 6 | #include <errno.h>
 7 | #include <unistd.h>
 8 | 
 9 | #include "x25519-cortex-m4.h"
10 | 
11 | static int rand_fd = -1;
12 | 
13 | static void init_rand(void) {
14 | 	rand_fd = open("/dev/urandom", O_RDONLY);
15 | 	if (rand_fd < 0) {
16 | 		perror("opening /dev/urandom");
17 | 		exit(1);
18 | 	}
19 | }
20 | 
21 | static void get_random_bytes(unsigned char* buf, int len) {
22 | 	if (rand_fd == -1) {
23 | 		fprintf(stderr, "rand_fd not initialized\n");
24 | 		exit(1);
25 | 	}
26 | 
27 | 	int nread = 0;
28 | 	while (len) {
29 | 		int nbytes = read(rand_fd, buf + nread, len);
30 | 		if (nbytes < 0) {
31 | 			if (errno == EINTR) {
32 | 				continue;
33 | 			}
34 | 			perror("get_random_bytes");
35 | 			exit(1);
36 | 		}
37 | 		if (nbytes == 0) {
38 | 			fprintf(stderr, "rand_fd closed\n");
39 | 			exit(1);
40 | 		}
41 | 		nread += nbytes;
42 | 		len -= nbytes;
43 | 	}
44 | }
45 | 
46 | int main() {
47 | 	unsigned char secret_key_alice[32], secret_key_bob[32];
48 | 	unsigned char public_key_alice[32], public_key_bob[32];
49 | 	unsigned char shared_secret_alice[32], shared_secret_bob[32];
50 | 
51 | 	init_rand();
52 | 
53 | 	// Alice computes
54 | 	get_random_bytes(secret_key_alice, 32);
55 | 	X25519_calc_public_key(public_key_alice, secret_key_alice);
56 | 
57 | 	// Bob computes
58 | 	get_random_bytes(secret_key_bob, 32);
59 | 	X25519_calc_public_key(public_key_bob, secret_key_bob);
60 | 
61 | 	// The public keys are now exchanged over some protocol
62 | 
63 | 	// Alice computes
64 | 	X25519_calc_shared_secret(shared_secret_alice, secret_key_alice, public_key_bob);
65 | 
66 | 	// Bob computes
67 | 	X25519_calc_shared_secret(shared_secret_bob, secret_key_bob, public_key_alice);
68 | 
69 | 	if (memcmp(shared_secret_alice, shared_secret_bob, 32) == 0) {
70 | 		puts("SUCCESS: Both Bob and Alice computed the same shared secret");
71 | 	} else {
72 | 		puts("FAILED: Bob and Alice did not compute the same shared secret");
73 | 		exit(1);
74 | 	}
75 | 
76 | 	return 0;
77 | }
78 | 


--------------------------------------------------------------------------------
/x25519-cortex-m4-gcc.s:
--------------------------------------------------------------------------------
   1 | 	.syntax unified
   2 | 	.thumb
   3 | // Curve25519 scalar multiplication
   4 | // Copyright (c) 2017, Emil Lenngren
   5 | //
   6 | // All rights reserved.
   7 | //
   8 | // Redistribution and use in source and binary forms, with or without modification,
   9 | // are permitted provided that the following conditions are met:
  10 | //
  11 | // 1. Redistributions of source code must retain the above copyright notice, this
  12 | //    list of conditions and the following disclaimer.
  13 | //
  14 | // 2. Redistributions in binary form, except as embedded into a Nordic
  15 | //    Semiconductor ASA or Dialog Semiconductor PLC integrated circuit in a product
  16 | //    or a software update for such product, must reproduce the above copyright
  17 | //    notice, this list of conditions and the following disclaimer in the
  18 | //    documentation and/or other materials provided with the distribution.
  19 | //
  20 | // THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
  21 | // ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
  22 | // WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
  23 | // DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR
  24 | // ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
  25 | // (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
  26 | // LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
  27 | // ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
  28 | // (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
  29 | // SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  30 | 
  31 | 
  32 | 
  33 | // This is an armv7 implementation of X25519.
  34 | // It follows the reference implementation where the representation of
  35 | // a field element [0..2^255-19) is represented by a 256-bit little ian integer,
  36 | // reduced modulo 2^256-38, and may possibly be in the range [2^256-38..2^256).
  37 | // The scalar is a 256-bit integer where certain bits are hardcoded per specification.
  38 | //
  39 | // The implementation runs in constant time (548 873 cycles on ARM Cortex-M4,
  40 | // assuming no wait states), and no conditional branches or memory access
  41 | // pattern dep on secret data.
  42 | 
  43 | 	.text
  44 | 	.align 2
  45 | 
  46 | // input: *r8=a, *r9=b
  47 | // output: r0-r7
  48 | // clobbers all other registers
  49 | // cycles: 45
  50 | 	.type fe25519_add, %function
  51 | fe25519_add:
  52 | 	.global fe25519_add
  53 | 	ldr r0,[r8,#28]
  54 | 	ldr r4,[r9,#28]
  55 | 	adds r0,r0,r4
  56 | 	mov r11,#0
  57 | 	adc r11,r11,r11
  58 | 	lsl r11,r11,#1
  59 | 	add r11,r11,r0, lsr #31
  60 | 	movs r7,#19
  61 | 	mul r11,r11,r7
  62 | 	bic r7,r0,#0x80000000
  63 | 	
  64 | 	ldm r8!,{r0-r3}
  65 | 	ldm r9!,{r4-r6,r10}
  66 | 	mov r12,#1
  67 | 	umaal r0,r11,r12,r4
  68 | 	umaal r1,r11,r12,r5
  69 | 	umaal r2,r11,r12,r6
  70 | 	umaal r3,r11,r12,r10
  71 | 	ldm r9,{r4-r6}
  72 | 	ldm r8,{r8-r10}
  73 | 	umaal r4,r11,r12,r8
  74 | 	umaal r5,r11,r12,r9
  75 | 	umaal r6,r11,r12,r10
  76 | 	add r7,r7,r11
  77 | 	bx lr
  78 | 	
  79 | 	.size fe25519_add, .-fe25519_add
  80 | 
  81 | // input: *r8=a, *r9=b
  82 | // output: r0-r7
  83 | // clobbers all other registers
  84 | // cycles: 46
  85 | 	.type fe25519_sub, %function
  86 | fe25519_sub:
  87 | 	.global fe25519_sub
  88 | 	
  89 | 	ldm r8,{r0-r7}
  90 | 	ldm r9!,{r8,r10-r12}
  91 | 	subs r0,r8
  92 | 	sbcs r1,r10
  93 | 	sbcs r2,r11
  94 | 	sbcs r3,r12
  95 | 	ldm r9,{r8-r11}
  96 | 	sbcs r4,r8
  97 | 	sbcs r5,r9
  98 | 	sbcs r6,r10
  99 | 	sbcs r7,r11
 100 | 	
 101 | 	// if subtraction goes below 0, set r8 to -1 and r9 to -38, else set both to 0
 102 | 	sbc r8,r8
 103 | 	and r9,r8,#-38
 104 | 	
 105 | 	adds r0,r9
 106 | 	adcs r1,r8
 107 | 	adcs r2,r8
 108 | 	adcs r3,r8
 109 | 	adcs r4,r8
 110 | 	adcs r5,r8
 111 | 	adcs r6,r8
 112 | 	adcs r7,r8
 113 | 	
 114 | 	// if the subtraction did not go below 0, we are done and (r8,r9) are set to 0
 115 | 	// if the subtraction went below 0 and the addition overflowed, we are done, so set (r8,r9) to 0
 116 | 	// if the subtraction went below 0 and the addition did not overflow, we need to add once more
 117 | 	// (r8,r9) will be correctly set to (-1,-38) only when r8 was -1 and we don't have a carry,
 118 | 	// note that the carry will always be 0 in case (r8,r9) was (0,0) since then there was no real addition
 119 | 	// also note that it is extremely unlikely we will need an extra addition:
 120 | 	//   that can only happen if input1 was slightly >= 0 and input2 was > 2^256-38 (really input2-input1 > 2^256-38)
 121 | 	//   in that case we currently have 2^256-38 < (r0...r7) < 2^256, so adding -38 will only affect r0
 122 | 	adcs r8,#0
 123 | 	and r9,r8,#-38
 124 | 	
 125 | 	adds r0,r9
 126 | 	
 127 | 	bx lr
 128 | 	
 129 | 	.size fe25519_sub, .-fe25519_sub
 130 | 
 131 | // input: *r1=a, *r2=b
 132 | // output: r0-r7
 133 | // clobbers all other registers
 134 | // cycles: 173
 135 | 	.type fe25519_mul, %function
 136 | fe25519_mul:
 137 | 	.global fe25519_mul
 138 | 	push {r2,lr}
 139 | 	//frame push {lr}
 140 | 	//frame address sp,8
 141 | 	
 142 | 	sub sp,#28
 143 | 	//frame address sp,36
 144 | 	ldm r2,{r2,r3,r4,r5}
 145 | 	
 146 | 	ldm r1!,{r0,r10,lr}
 147 | 	umull r6,r11,r2,r0
 148 | 	
 149 | 	umull r7,r12,r3,r0
 150 | 	umaal r7,r11,r2,r10
 151 | 	
 152 | 	push {r6,r7}
 153 | 	//frame address sp,44
 154 | 	
 155 | 	umull r8,r6,r4,r0
 156 | 	umaal r8,r11,r3,r10
 157 | 	
 158 | 	umull r9,r7,r5,r0
 159 | 	umaal r9,r11,r4,r10
 160 | 	
 161 | 	umaal r11,r7,r5,r10
 162 | 	
 163 | 	umaal r8,r12,r2,lr
 164 | 	umaal r9,r12,r3,lr
 165 | 	umaal r11,r12,r4,lr
 166 | 	umaal r12,r7,r5,lr
 167 | 	
 168 | 	ldm r1!,{r0,r10,lr}
 169 | 	
 170 | 	umaal r9,r6,r2,r0
 171 | 	umaal r11,r6,r3,r0
 172 | 	umaal r12,r6,r4,r0
 173 | 	umaal r6,r7,r5,r0
 174 | 	
 175 | 	strd r8,r9,[sp,#8]
 176 | 	
 177 | 	mov r9,#0
 178 | 	umaal r11,r9,r2,r10
 179 | 	umaal r12,r9,r3,r10
 180 | 	umaal r6,r9,r4,r10
 181 | 	umaal r7,r9,r5,r10
 182 | 	
 183 | 	mov r10,#0
 184 | 	umaal r12,r10,r2,lr
 185 | 	umaal r6,r10,r3,lr
 186 | 	umaal r7,r10,r4,lr
 187 | 	umaal r9,r10,r5,lr
 188 | 	
 189 | 	ldr r8,[r1],#4
 190 | 	mov lr,#0
 191 | 	umaal lr,r6,r2,r8
 192 | 	umaal r7,r6,r3,r8
 193 | 	umaal r9,r6,r4,r8
 194 | 	umaal r10,r6,r5,r8
 195 | 	
 196 | 	//_ _ _ _ _ 6 10 9| 7 | lr 12 11 _ _ _ _
 197 | 	
 198 | 	ldr r8,[r1],#-28
 199 | 	mov r0,#0
 200 | 	umaal r7,r0,r2,r8
 201 | 	umaal r9,r0,r3,r8
 202 | 	umaal r10,r0,r4,r8
 203 | 	umaal r6,r0,r5,r8
 204 | 	
 205 | 	push {r0}
 206 | 	//frame address sp,48
 207 | 	
 208 | 	//_ _ _ _ s 6 10 9| 7 | lr 12 11 _ _ _ _
 209 | 	
 210 | 	ldr r2,[sp,#40]
 211 | 	adds r2,r2,#16
 212 | 	ldm r2,{r2,r3,r4,r5}
 213 | 	
 214 | 	ldr r8,[r1],#4
 215 | 	mov r0,#0
 216 | 	umaal r11,r0,r2,r8
 217 | 	str r11,[sp,#16+4]
 218 | 	umaal r12,r0,r3,r8
 219 | 	umaal lr,r0,r4,r8
 220 | 	umaal r0,r7,r5,r8 // 7=carry for 9
 221 | 	
 222 | 	//_ _ _ _ s 6 10 9+7| 0 | lr 12 _ _ _ _ _
 223 | 	
 224 | 	ldr r8,[r1],#4
 225 | 	mov r11,#0
 226 | 	umaal r12,r11,r2,r8
 227 | 	str r12,[sp,#20+4]
 228 | 	umaal lr,r11,r3,r8
 229 | 	umaal r0,r11,r4,r8
 230 | 	umaal r11,r7,r5,r8 // 7=carry for 10
 231 | 	
 232 | 	//_ _ _ _ s 6 10+7 9+11| 0 | lr _ _ _ _ _ _
 233 | 	
 234 | 	ldr r8,[r1],#4
 235 | 	mov r12,#0
 236 | 	umaal lr,r12,r2,r8
 237 | 	str lr,[sp,#24+4]
 238 | 	umaal r0,r12,r3,r8
 239 | 	umaal r11,r12,r4,r8
 240 | 	umaal r10,r12,r5,r8 // 12=carry for 6
 241 | 	
 242 | 	//_ _ _ _ s 6+12 10+7 9+11| 0 | _ _ _ _ _ _ _
 243 | 	
 244 | 	ldr r8,[r1],#4
 245 | 	mov lr,#0
 246 | 	umaal r0,lr,r2,r8
 247 | 	str r0,[sp,#28+4]
 248 | 	umaal r11,lr,r3,r8
 249 | 	umaal r10,lr,r4,r8
 250 | 	umaal r6,lr,r5,r8 // lr=carry for saved
 251 | 	
 252 | 	//_ _ _ _ s+lr 6+12 10+7 9+11| _ | _ _ _ _ _ _ _
 253 | 	
 254 | 	ldm r1!,{r0,r8}
 255 | 	umaal r11,r9,r2,r0
 256 | 	str r11,[sp,#32+4]
 257 | 	umaal r9,r10,r3,r0
 258 | 	umaal r10,r6,r4,r0
 259 | 	pop {r11}
 260 | 	//frame address sp,44
 261 | 	umaal r11,r6,r5,r0 // 6=carry for next
 262 | 	
 263 | 	//_ _ _ 6 11+lr 10+12 9+7 _ | _ | _ _ _ _ _ _ _
 264 | 	
 265 | 	umaal r9,r7,r2,r8
 266 | 	umaal r10,r7,r3,r8
 267 | 	umaal r11,r7,r4,r8
 268 | 	umaal r6,r7,r5,r8
 269 | 	
 270 | 	ldm r1!,{r0,r8}
 271 | 	umaal r10,r12,r2,r0
 272 | 	umaal r11,r12,r3,r0
 273 | 	umaal r6,r12,r4,r0
 274 | 	umaal r7,r12,r5,r0
 275 | 	
 276 | 	umaal r11,lr,r2,r8
 277 | 	umaal r6,lr,r3,r8
 278 | 	umaal lr,r7,r4,r8
 279 | 	umaal r7,r12,r5,r8
 280 | 	
 281 | 	// 12 7 lr 6 11 10 9 stack*9
 282 | 	
 283 | 	//now reduce
 284 | 	
 285 | 	ldrd r4,r5,[sp,#28]
 286 | 	movs r3,#38
 287 | 	mov r8,#0
 288 | 	umaal r4,r8,r3,r12
 289 | 	lsl r8,r8,#1
 290 | 	orr r8,r8,r4, lsr #31
 291 | 	and r12,r4,#0x7fffffff
 292 | 	movs r4,#19
 293 | 	mul r8,r8,r4
 294 | 	
 295 | 	pop {r0-r2}
 296 | 	//frame address sp,32
 297 | 	umaal r0,r8,r3,r5
 298 | 	umaal r1,r8,r3,r9
 299 | 	umaal r2,r8,r3,r10
 300 | 	mov r9,#38
 301 | 	pop {r3,r4}
 302 | 	//frame address sp,24
 303 | 	umaal r3,r8,r9,r11
 304 | 	umaal r4,r8,r9,r6
 305 | 	pop {r5,r6}
 306 | 	//frame address sp,16
 307 | 	umaal r5,r8,r9,lr
 308 | 	umaal r6,r8,r9,r7
 309 | 	add r7,r8,r12
 310 | 	
 311 | 	add sp,#12
 312 | 	//frame address sp,4
 313 | 	pop {pc}
 314 | 	
 315 | 	.size fe25519_mul, .-fe25519_mul
 316 | 
 317 | // input/result in (r0-r7)
 318 | // clobbers all other registers
 319 | // cycles: 115
 320 | 	.type fe25519_sqr, %function
 321 | fe25519_sqr:
 322 | 	.global fe25519_sqr
 323 | 	push {lr}
 324 | 	//frame push {lr}
 325 | 	sub sp,#20
 326 | 	//frame address sp,24
 327 | 	
 328 | 	//mul 01, 00
 329 | 	umull r9,r10,r0,r0
 330 | 	umull r11,r12,r0,r1
 331 | 	adds r11,r11,r11
 332 | 	mov lr,#0
 333 | 	umaal r10,r11,lr,lr
 334 | 	
 335 | 	//r9 r10 done
 336 | 	//r12 carry for 3rd before col
 337 | 	//r11+C carry for 3rd final col
 338 | 	
 339 | 	push {r9,r10}
 340 | 	//frame address sp,32
 341 | 	
 342 | 	//mul 02, 11
 343 | 	mov r8,#0
 344 | 	umaal r8,r12,r0,r2
 345 | 	adcs r8,r8,r8
 346 | 	umaal r8,r11,r1,r1
 347 | 	
 348 | 	//r8 done (3rd col)
 349 | 	//r12 carry for 4th before col
 350 | 	//r11+C carry for 4th final col
 351 | 	
 352 | 	//mul 03, 12
 353 | 	umull r9,r10,r0,r3
 354 | 	umaal r9,r12,r1,r2
 355 | 	adcs r9,r9,r9
 356 | 	umaal r9,r11,lr,lr
 357 | 	
 358 | 	//r9 done (4th col)
 359 | 	//r10+r12 carry for 5th before col
 360 | 	//r11+C carry for 5th final col
 361 | 	
 362 | 	strd r8,r9,[sp,#8]
 363 | 	
 364 | 	//mul 04, 13, 22
 365 | 	mov r9,#0
 366 | 	umaal r9,r10,r0,r4
 367 | 	umaal r9,r12,r1,r3
 368 | 	adcs r9,r9,r9
 369 | 	umaal r9,r11,r2,r2
 370 | 	
 371 | 	//r9 done (5th col)
 372 | 	//r10+r12 carry for 6th before col
 373 | 	//r11+C carry for 6th final col
 374 | 	
 375 | 	str r9,[sp,#16]
 376 | 	
 377 | 	//mul 05, 14, 23
 378 | 	umull r9,r8,r0,r5
 379 | 	umaal r9,r10,r1,r4
 380 | 	umaal r9,r12,r2,r3
 381 | 	adcs r9,r9,r9
 382 | 	umaal r9,r11,lr,lr
 383 | 	
 384 | 	//r9 done (6th col)
 385 | 	//r10+r12+r8 carry for 7th before col
 386 | 	//r11+C carry for 7th final col
 387 | 	
 388 | 	str r9,[sp,#20]
 389 | 	
 390 | 	//mul 06, 15, 24, 33
 391 | 	mov r9,#0
 392 | 	umaal r9,r8,r1,r5
 393 | 	umaal r9,r12,r2,r4
 394 | 	umaal r9,r10,r0,r6
 395 | 	adcs r9,r9,r9
 396 | 	umaal r9,r11,r3,r3
 397 | 	
 398 | 	//r9 done (7th col)
 399 | 	//r8+r10+r12 carry for 8th before col
 400 | 	//r11+C carry for 8th final col
 401 | 	
 402 | 	str r9,[sp,#24]
 403 | 	
 404 | 	//mul 07, 16, 25, 34
 405 | 	umull r0,r9,r0,r7
 406 | 	umaal r0,r10,r1,r6
 407 | 	umaal r0,r12,r2,r5
 408 | 	umaal r0,r8,r3,r4
 409 | 	adcs r0,r0,r0
 410 | 	umaal r0,r11,lr,lr
 411 | 	
 412 | 	//r0 done (8th col)
 413 | 	//r9+r8+r10+r12 carry for 9th before col
 414 | 	//r11+C carry for 9th final col
 415 | 	
 416 | 	//mul 17, 26, 35, 44
 417 | 	umaal r9,r8,r1,r7 //r1 is now dead
 418 | 	umaal r9,r10,r2,r6
 419 | 	umaal r12,r9,r3,r5
 420 | 	adcs r12,r12,r12
 421 | 	umaal r11,r12,r4,r4
 422 | 	
 423 | 	//r11 done (9th col)
 424 | 	//r8+r10+r9 carry for 10th before col
 425 | 	//r12+C carry for 10th final col
 426 | 	
 427 | 	//mul 27, 36, 45
 428 | 	umaal r9,r8,r2,r7 //r2 is now dead
 429 | 	umaal r10,r9,r3,r6
 430 | 	movs r2,#0
 431 | 	umaal r10,r2,r4,r5
 432 | 	adcs r10,r10,r10
 433 | 	umaal r12,r10,lr,lr
 434 | 	
 435 | 	//r12 done (10th col)
 436 | 	//r8+r9+r2 carry for 11th before col
 437 | 	//r10+C carry for 11th final col
 438 | 	
 439 | 	//mul 37, 46, 55
 440 | 	umaal r2,r8,r3,r7 //r3 is now dead
 441 | 	umaal r9,r2,r4,r6
 442 | 	adcs r9,r9,r9
 443 | 	umaal r10,r9,r5,r5
 444 | 	
 445 | 	//r10 done (11th col)
 446 | 	//r8+r2 carry for 12th before col
 447 | 	//r9+C carry for 12th final col
 448 | 	
 449 | 	//mul 47, 56
 450 | 	movs r3,#0
 451 | 	umaal r3,r8,r4,r7 //r4 is now dead
 452 | 	umaal r3,r2,r5,r6
 453 | 	adcs r3,r3,r3
 454 | 	umaal r9,r3,lr,lr
 455 | 	
 456 | 	//r9 done (12th col)
 457 | 	//r8+r2 carry for 13th before col
 458 | 	//r3+C carry for 13th final col
 459 | 	
 460 | 	//mul 57, 66
 461 | 	umaal r8,r2,r5,r7 //r5 is now dead
 462 | 	adcs r8,r8,r8
 463 | 	umaal r3,r8,r6,r6
 464 | 	
 465 | 	//r3 done (13th col)
 466 | 	//r2 carry for 14th before col
 467 | 	//r8+C carry for 14th final col
 468 | 	
 469 | 	//mul 67
 470 | 	umull r4,r5,lr,lr // set 0
 471 | 	umaal r4,r2,r6,r7
 472 | 	adcs r4,r4,r4
 473 | 	umaal r4,r8,lr,lr
 474 | 	
 475 | 	//r4 done (14th col)
 476 | 	//r2 carry for 15th before col
 477 | 	//r8+C carry for 15th final col
 478 | 	
 479 | 	//mul 77
 480 | 	adcs r2,r2,r2
 481 | 	umaal r8,r2,r7,r7
 482 | 	adcs r2,r2,lr
 483 | 	
 484 | 	//r8 done (15th col)
 485 | 	//r2 done (16th col)
 486 | 	
 487 | 	//msb -> lsb: r2 r8 r4 r3 r9 r10 r12 r11 r0 sp+24 sp+20 sp+16 sp+12 sp+8 sp+4 sp
 488 | 	//lr: 0
 489 | 	//now do reduction
 490 | 	
 491 | 	mov r6,#38
 492 | 	umaal r0,lr,r6,r2
 493 | 	lsl lr,lr,#1
 494 | 	orr lr,lr,r0, lsr #31
 495 | 	and r7,r0,#0x7fffffff
 496 | 	movs r5,#19
 497 | 	mul lr,lr,r5
 498 | 	
 499 | 	pop {r0,r1}
 500 | 	//frame address sp,24
 501 | 	umaal r0,lr,r6,r11
 502 | 	umaal r1,lr,r6,r12
 503 | 	
 504 | 	mov r11,r3
 505 | 	mov r12,r4
 506 | 	
 507 | 	pop {r2,r3,r4,r5}
 508 | 	//frame address sp,8
 509 | 	umaal r2,lr,r6,r10
 510 | 	umaal r3,lr,r6,r9
 511 | 	
 512 | 	umaal r4,lr,r6,r11
 513 | 	umaal r5,lr,r6,r12
 514 | 	
 515 | 	pop {r6}
 516 | 	//frame address sp,4
 517 | 	mov r12,#38
 518 | 	umaal r6,lr,r12,r8
 519 | 	add r7,r7,lr
 520 | 	
 521 | 	pop {pc}
 522 | 	
 523 | 	.size fe25519_sqr, .-fe25519_sqr
 524 | 
 525 | // in: r0-r7, count: r8
 526 | // out: r0-r7 + sets result also to top of stack
 527 | // clobbers all other registers
 528 | // cycles: 19 + 123*n
 529 | 	.type fe25519_sqr_many, %function
 530 | fe25519_sqr_many:
 531 | 	.global fe25519_sqr_many
 532 | 	push {r8,lr}
 533 | 	//frame push {r8,lr}
 534 | 0:
 535 | 	bl fe25519_sqr
 536 | 	
 537 | 	ldr r8,[sp,#0]
 538 | 	subs r8,r8,#1
 539 | 	str r8,[sp,#0]
 540 | 	bne 0b
 541 | 	
 542 | 	add sp,sp,#4
 543 | 	//frame address sp,4
 544 | 	add r8,sp,#4
 545 | 	stm r8,{r0-r7}
 546 | 	pop {pc}
 547 | 	.size fe25519_sqr_many, .-fe25519_sqr_many
 548 | 
 549 | // This kind of load supports unaligned access
 550 | // in: *r1
 551 | // out: r0-r7
 552 | // cycles: 22
 553 | 	.type loadm, %function
 554 | loadm:
 555 | 	ldr r0,[r1,#0]
 556 | 	ldr r2,[r1,#8]
 557 | 	ldr r3,[r1,#12]
 558 | 	ldr r4,[r1,#16]
 559 | 	ldr r5,[r1,#20]
 560 | 	ldr r6,[r1,#24]
 561 | 	ldr r7,[r1,#28]
 562 | 	ldr r1,[r1,#4]
 563 | 	bx lr
 564 | 	.size loadm, .-loadm
 565 | 
 566 | // in: *r0 = result, *r1 = scalar, *r2 = basepoint (all pointers may be unaligned)
 567 | // cycles: 548 873
 568 | 	.type curve25519_scalarmult, %function
 569 | curve25519_scalarmult:
 570 | 	.global curve25519_scalarmult
 571 | 	
 572 | 	// stack layout: xp zp xq zq x0  bitpos lastbit scalar result_ptr r4-r11,lr
 573 | 	//               0  32 64 96 128 160    164     168    200        204
 574 | 
 575 | 	push {r0,r4-r11,lr}
 576 | 	//frame push {r4-r11,lr}
 577 | 	//frame address sp,40
 578 | 	
 579 | 	mov r10,r2
 580 | 	bl loadm
 581 | 	
 582 | 	and r0,r0,#0xfffffff8
 583 | 	//and r7,r7,#0x7fffffff not needed since we don't inspect the msb anyway
 584 | 	orr r7,r7,#0x40000000
 585 | 	push {r0-r7}
 586 | 	//frame address sp,72
 587 | 	movs r8,#0
 588 | 	push {r2,r8}
 589 | 	//frame address sp,80
 590 | 	
 591 | 	//ldm r1,{r0-r7}
 592 | 	mov r1,r10
 593 | 	bl loadm
 594 | 	
 595 | 	and r7,r7,#0x7fffffff
 596 | 	push {r0-r7}
 597 | 	//frame address sp,112
 598 | 	
 599 | 	movs r9,#1
 600 | 	umull r10,r11,r8,r8
 601 | 	mov r12,#0
 602 | 	push {r8,r10,r11,r12}
 603 | 	//frame address sp,128
 604 | 	push {r9,r10,r11,r12}
 605 | 	//frame address sp,144
 606 | 	
 607 | 	push {r0-r7}
 608 | 	//frame address sp,176
 609 | 	
 610 | 	umull r6,r7,r8,r8
 611 | 	push {r6,r7,r8,r10,r11,r12}
 612 | 	//frame address sp,200
 613 | 	push {r6,r7,r8,r10,r11,r12}
 614 | 	//frame address sp,224
 615 | 	push {r9,r10,r11,r12}
 616 | 	//frame address sp,240
 617 | 	
 618 | 	movs r0,#254
 619 | 	movs r3,#0
 620 | 	// 129 cycles so far
 621 | 0:
 622 | 	// load scalar bit into r1
 623 | 	lsrs r1,r0,#5
 624 | 	adds r2,sp,#168
 625 | 	ldr r1,[r2,r1,lsl #2]
 626 | 	and r4,r0,#0x1f
 627 | 	lsrs r1,r1,r4
 628 | 	and r1,r1,#1
 629 | 	
 630 | 	strd r0,r1,[sp,#160]
 631 | 	
 632 | 	eors r1,r1,r3
 633 | 	rsbs lr,r1,#0
 634 | 	
 635 | 	mov r0,sp
 636 | 	add r1,sp,#64
 637 | 	
 638 | 	mov r11,#4
 639 | 	// 15 cycles
 640 | 1:
 641 | 	ldm r0,{r2-r5}
 642 | 	ldm r1,{r6-r9}
 643 | 	
 644 | 	eors r2,r2,r6
 645 | 	and r10,r2,lr
 646 | 	eors r6,r6,r10
 647 | 	eors r2,r2,r6
 648 | 	
 649 | 	eors r3,r3,r7
 650 | 	and r10,r3,lr
 651 | 	eors r7,r7,r10
 652 | 	eors r3,r3,r7
 653 | 	
 654 | 	eors r4,r4,r8
 655 | 	and r10,r4,lr
 656 | 	eors r8,r8,r10
 657 | 	eors r4,r4,r8
 658 | 	
 659 | 	eors r5,r5,r9
 660 | 	and r10,r5,lr
 661 | 	eors r9,r9,r10
 662 | 	eors r5,r5,r9
 663 | 	
 664 | 	stm r0!,{r2-r5}
 665 | 	stm r1!,{r6-r9}
 666 | 	
 667 | 	subs r11,#1
 668 | 	bne 1b
 669 | 	// 40*4 - 2 = 158 cycles
 670 | 	
 671 | 	mov r8,sp
 672 | 	add r9,sp,#32
 673 | 	bl fe25519_add
 674 | 	push {r0-r7}
 675 | 	//frame address sp,272
 676 | 	
 677 | 	bl fe25519_sqr
 678 | 	push {r0-r7}
 679 | 	//frame address sp,304
 680 | 	
 681 | 	add r8,sp,#64
 682 | 	add r9,sp,#96
 683 | 	bl fe25519_sub
 684 | 	push {r0-r7}
 685 | 	//frame address sp,336
 686 | 	
 687 | 	bl fe25519_sqr
 688 | 	push {r0-r7}
 689 | 	//frame address sp,368
 690 | 	
 691 | 	mov r1,sp
 692 | 	add r2,sp,#64
 693 | 	bl fe25519_mul
 694 | 	add r8,sp,#128
 695 | 	stm r8,{r0-r7}
 696 | 	
 697 | 	add r8,sp,#64
 698 | 	mov r9,sp
 699 | 	bl fe25519_sub
 700 | 	add r8,sp,#64
 701 | 	stm r8,{r0-r7}
 702 | 	
 703 | 	// 64 + 1*45 + 2*46 + 1*173 + 2*115 = 604 cycles
 704 | 	
 705 | 	//multiplies (r0-r7) with 121666, adds *sp and puts the result on the top of the stack (replacing old content)
 706 | 	ldr lr,=121666
 707 | 	//mov lr,#56130
 708 | 	//add lr,lr,#65536
 709 | 	ldr r12,[sp,#28]
 710 | 	mov r11,#0
 711 | 	umaal r12,r11,lr,r7
 712 | 	lsl r11,r11,#1
 713 | 	add r11,r11,r12, lsr #31
 714 | 	movs r7,#19
 715 | 	mul r11,r11,r7
 716 | 	bic r7,r12,#0x80000000
 717 | 	ldm sp!,{r8,r9,r10,r12}
 718 | 	//frame address sp,352
 719 | 	umaal r8,r11,lr,r0
 720 | 	umaal r9,r11,lr,r1
 721 | 	umaal r10,r11,lr,r2
 722 | 	umaal r12,r11,lr,r3
 723 | 	ldm sp!,{r0,r1,r2}
 724 | 	//frame address sp,340
 725 | 	umaal r0,r11,lr,r4
 726 | 	umaal r1,r11,lr,r5
 727 | 	umaal r2,r11,lr,r6
 728 | 	add r7,r7,r11
 729 | 	add sp,sp,#4
 730 | 	//frame address sp,338
 731 | 	push {r0,r1,r2,r7}
 732 | 	//frame address sp,352
 733 | 	push {r8,r9,r10,r12}
 734 | 	//frame address sp,368
 735 | 	// 39 cycles
 736 | 	
 737 | 	mov r1,sp
 738 | 	add r2,sp,#64
 739 | 	bl fe25519_mul
 740 | 	add r8,sp,#160
 741 | 	stm r8,{r0-r7}
 742 | 	
 743 | 	add r8,sp,#192
 744 | 	add r9,sp,#224
 745 | 	bl fe25519_add
 746 | 	stm sp,{r0-r7}
 747 | 	
 748 | 	mov r1,sp
 749 | 	add r2,sp,#32
 750 | 	bl fe25519_mul
 751 | 	add r8,sp,#32
 752 | 	stm r8,{r0-r7}
 753 | 	
 754 | 	add r8,sp,#192
 755 | 	add r9,sp,#224
 756 | 	bl fe25519_sub
 757 | 	stm sp,{r0-r7}
 758 | 	
 759 | 	mov r1,sp
 760 | 	add r2,sp,#96
 761 | 	bl fe25519_mul
 762 | 	stm sp,{r0-r7}
 763 | 	
 764 | 	mov r8,sp
 765 | 	add r9,sp,#32
 766 | 	bl fe25519_add
 767 | 	
 768 | 	bl fe25519_sqr
 769 | 	
 770 | 	add r8,sp,#192
 771 | 	stm r8,{r0-r7}
 772 | 	
 773 | 	mov r8,sp
 774 | 	add r9,sp,#32
 775 | 	bl fe25519_sub
 776 | 	
 777 | 	bl fe25519_sqr
 778 | 	stm sp,{r0-r7}
 779 | 	
 780 | 	mov r1,sp
 781 | 	add r2,sp,#256
 782 | 	bl fe25519_mul
 783 | 	add r8,sp,#224
 784 | 	stm r8,{r0-r7}
 785 | 	
 786 | 	add sp,sp,#128
 787 | 	//frame address sp,240
 788 | 	
 789 | 	ldrd r2,r3,[sp,#160]
 790 | 	subs r0,r2,#1
 791 | 	// 97 + 2*45 + 2*46 + 4*173 + 2*115 = 1201 cycles
 792 | 	bpl 0b
 793 | 	// in total 2020 cycles per iteration, in total 515 098 cycles for 255 iterations
 794 | 
 795 | 	//These cswap lines are not needed for curve25519 since the lowest bit is hardcoded to 0
 796 | 	//----------
 797 | 	//rsbs lr,r3,#0
 798 | 	
 799 | 	//mov r0,sp
 800 | 	//add r1,sp,#64
 801 | 	
 802 | 	//mov r11,#4
 803 | //1
 804 | 	//ldm r0,{r2-r5}
 805 | 	//ldm r1!,{r6-r9}
 806 | 	
 807 | 	//eors r2,r2,r6
 808 | 	//and r10,r2,lr
 809 | 	//eors r6,r6,r10
 810 | 	//eors r2,r2,r6
 811 | 	
 812 | 	//eors r3,r3,r7
 813 | 	//and r10,r3,lr
 814 | 	//eors r7,r7,r10
 815 | 	//eors r3,r3,r7
 816 | 	
 817 | 	//eors r4,r4,r8
 818 | 	//and r10,r4,lr
 819 | 	//eors r8,r8,r10
 820 | 	//eors r4,r4,r8
 821 | 	
 822 | 	//eors r5,r5,r9
 823 | 	//and r10,r5,lr
 824 | 	//eors r9,r9,r10
 825 | 	//eors r5,r5,r9
 826 | 	
 827 | 	//stm r0!,{r2-r5}
 828 | 	
 829 | 	//subs r11,#1
 830 | 	//bne 1b
 831 | 	//----------
 832 | 
 833 | 	// now we must invert zp
 834 | 	add r0,sp,#32
 835 | 	ldm r0,{r0-r7}
 836 | 	bl fe25519_sqr
 837 | 	push {r0-r7}
 838 | 	//frame address sp,272
 839 | 	
 840 | 	bl fe25519_sqr
 841 | 	bl fe25519_sqr
 842 | 	push {r0-r7}
 843 | 	//frame address sp,304
 844 | 	
 845 | 	add r1,sp,#96
 846 | 	mov r2,sp
 847 | 	bl fe25519_mul
 848 | 	stm sp,{r0-r7}
 849 | 	
 850 | 	mov r1,sp
 851 | 	add r2,sp,#32
 852 | 	bl fe25519_mul
 853 | 	add r8,sp,#32
 854 | 	stm r8,{r0-r7}
 855 | 	
 856 | 	// current stack: z^(2^9) z^(2^11) x z
 857 | 	
 858 | 	bl fe25519_sqr
 859 | 	push {r0-r7}
 860 | 	//frame address sp,336
 861 | 	
 862 | 	mov r1,sp
 863 | 	add r2,sp,#32
 864 | 	bl fe25519_mul
 865 | 	add r8,sp,#32
 866 | 	stm r8,{r0-r7}
 867 | 	
 868 | 	// current stack: _ z^(2^5 - 2^0) z^(2^11) x z
 869 | 	
 870 | 	mov r8,#5
 871 | 	// 1052 cycles
 872 | 	bl fe25519_sqr_many // 634 cycles
 873 | 	
 874 | 	mov r1,sp
 875 | 	add r2,sp,#32
 876 | 	bl fe25519_mul
 877 | 	add r8,sp,#32
 878 | 	stm r8,{r0-r7}
 879 | 	
 880 | 	// current stack: _ z^(2^10 - 2^0) z^(2^11) x z <scratch> ...
 881 | 	
 882 | 	movs r8,#10
 883 | 	bl fe25519_sqr_many // 1249 cycles
 884 | 	//z^(2^20 - 2^10)
 885 | 	
 886 | 	mov r1,sp
 887 | 	add r2,sp,#32
 888 | 	bl fe25519_mul
 889 | 	stm sp,{r0-r7}
 890 | 	//z^(2^20 - 2^0)
 891 | 	
 892 | 	// current stack: z^(2^20 - 2^0) z^(2^10 - 2^0) z^(2^11) x z <scratch> ...
 893 | 	
 894 | 	movs r8,#20
 895 | 	sub sp,sp,#32
 896 | 	//frame address sp,368
 897 | 	bl fe25519_sqr_many // 2479 cycles
 898 | 	//z^(2^40 - 2^20)
 899 | 	
 900 | 	mov r1,sp
 901 | 	add r2,sp,#32
 902 | 	bl fe25519_mul
 903 | 	add sp,sp,#32
 904 | 	//frame address sp,336
 905 | 	//z^(2^40 - 2^0)
 906 | 	
 907 | 	movs r8,#10
 908 | 	bl fe25519_sqr_many // 1249 cycles
 909 | 	//z^(2^50 - 2^10)
 910 | 	
 911 | 	mov r1,sp
 912 | 	add r2,sp,#32
 913 | 	bl fe25519_mul
 914 | 	add r8,sp,#32
 915 | 	stm r8,{r0-r7}
 916 | 	
 917 | 	// current stack: _ z^(2^50 - 2^0) z^(2^11) x z <scratch> ...
 918 | 	
 919 | 	movs r8,#50
 920 | 	bl fe25519_sqr_many // 6169 cycles
 921 | 	//z^(2^100 - 2^50)
 922 | 	
 923 | 	mov r1,sp
 924 | 	add r2,sp,#32
 925 | 	bl fe25519_mul
 926 | 	stm sp,{r0-r7}
 927 | 	
 928 | 	// 13751 cycles so far for inversion
 929 | 	
 930 | 	// current stack: z^(2^100 - 2^0) z^(2^50 - 2^0) z^(2^11) x z <scratch> ...
 931 | 	
 932 | 	movs r8,#100
 933 | 	sub sp,sp,#32
 934 | 	//frame address sp,368
 935 | 	bl fe25519_sqr_many // 12319 cycles
 936 | 	//z^(2^200 - 2^100)
 937 | 	
 938 | 	mov r1,sp
 939 | 	add r2,sp,#32
 940 | 	bl fe25519_mul
 941 | 	add sp,sp,#32
 942 | 	//frame address sp,336
 943 | 	//z^(2^200 - 2^0)
 944 | 	
 945 | 	// current stack: _ z^(2^50 - 2^0) z^(2^11) x z <scratch> ...
 946 | 	
 947 | 	movs r8,#50
 948 | 	bl fe25519_sqr_many // 6169 cycles
 949 | 	//z^(2^250 - 2^50)
 950 | 	
 951 | 	mov r1,sp
 952 | 	add r2,sp,#32
 953 | 	bl fe25519_mul
 954 | 	//z^(2^250 - 2^0)
 955 | 	
 956 | 	movs r8,#5
 957 | 	bl fe25519_sqr_many // 634 cycles
 958 | 	//z^(2^255 - 2^5)
 959 | 	
 960 | 	mov r1,sp
 961 | 	add r2,sp,#64
 962 | 	bl fe25519_mul
 963 | 	stm sp,{r0-r7}
 964 | 	//z^(2^255 - 21)
 965 | 	
 966 | 	// 19661 for second half of inversion
 967 | 	
 968 | 	// done inverting!
 969 | 	// total inversion cost: 33412 cycles
 970 | 	
 971 | 	mov r1,sp
 972 | 	add r2,sp,#96
 973 | 	bl fe25519_mul
 974 | 	
 975 | 	// now final reduce
 976 | 	lsr r8,r7,#31
 977 | 	mov r9,#19
 978 | 	mul r8,r8,r9
 979 | 	mov r10,#0
 980 | 	
 981 | 	// handle the case when 2^255 - 19 <= x < 2^255
 982 | 	add r8,r8,#19
 983 | 	
 984 | 	adds r8,r0,r8
 985 | 	adcs r8,r1,r10
 986 | 	adcs r8,r2,r10
 987 | 	adcs r8,r3,r10
 988 | 	adcs r8,r4,r10
 989 | 	adcs r8,r5,r10
 990 | 	adcs r8,r6,r10
 991 | 	adcs r8,r7,r10
 992 | 	adcs r11,r10,r10
 993 | 	
 994 | 	lsr r8,r8,#31
 995 | 	orr r8,r8,r11, lsl #1
 996 | 	mul r8,r8,r9
 997 | 	
 998 | 	ldr r9,[sp,#296]
 999 | 	
1000 | 	adds r0,r0,r8
1001 | 	str r0,[r9,#0]
1002 | 	movs r0,#0
1003 | 	adcs r1,r1,r0
1004 | 	str r1,[r9,#4]
1005 | 	mov r1,r9
1006 | 	adcs r2,r2,r0
1007 | 	adcs r3,r3,r0
1008 | 	adcs r4,r4,r0
1009 | 	adcs r5,r5,r0
1010 | 	adcs r6,r6,r0
1011 | 	adcs r7,r7,r0
1012 | 	and r7,r7,#0x7fffffff
1013 | 	
1014 | 	str r2,[r1,#8]
1015 | 	str r3,[r1,#12]
1016 | 	str r4,[r1,#16]
1017 | 	str r5,[r1,#20]
1018 | 	str r6,[r1,#24]
1019 | 	str r7,[r1,#28]
1020 | 	
1021 | 	add sp,sp,#300
1022 | 	//frame address sp,36
1023 | 	
1024 | 	pop {r4-r11,pc}
1025 | 	
1026 | 	// 234 cycles after inversion
1027 | 	// in total for whole function 548 873 cycles
1028 | 	
1029 | 	.size curve25519_scalarmult, .-curve25519_scalarmult
1030 | 
1031 | 	
1032 | 


--------------------------------------------------------------------------------
/x25519-cortex-m4-keil.s:
--------------------------------------------------------------------------------
   1 | ; Curve25519 scalar multiplication
   2 | ; Copyright (c) 2017, Emil Lenngren
   3 | ;
   4 | ; All rights reserved.
   5 | ;
   6 | ; Redistribution and use in source and binary forms, with or without modification,
   7 | ; are permitted provided that the following conditions are met:
   8 | ;
   9 | ; 1. Redistributions of source code must retain the above copyright notice, this
  10 | ;    list of conditions and the following disclaimer.
  11 | ;
  12 | ; 2. Redistributions in binary form, except as embedded into a Nordic
  13 | ;    Semiconductor ASA or Dialog Semiconductor PLC integrated circuit in a product
  14 | ;    or a software update for such product, must reproduce the above copyright
  15 | ;    notice, this list of conditions and the following disclaimer in the
  16 | ;    documentation and/or other materials provided with the distribution.
  17 | ;
  18 | ; THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
  19 | ; ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
  20 | ; WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
  21 | ; DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR
  22 | ; ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
  23 | ; (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
  24 | ; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
  25 | ; ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
  26 | ; (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
  27 | ; SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  28 | 
  29 | 
  30 | 
  31 | ; This is an armv7 implementation of X25519.
  32 | ; It follows the reference implementation where the representation of
  33 | ; a field element [0..2^255-19) is represented by a 256-bit little endian integer,
  34 | ; reduced modulo 2^256-38, and may possibly be in the range [2^256-38..2^256).
  35 | ; The scalar is a 256-bit integer where certain bits are hardcoded per specification.
  36 | ;
  37 | ; The implementation runs in constant time (548 873 cycles on ARM Cortex-M4,
  38 | ; assuming no wait states), and no conditional branches or memory access
  39 | ; pattern depend on secret data.
  40 | 
  41 | 	area |.text|, code, readonly
  42 | 	align 2
  43 | 
  44 | ; input: *r8=a, *r9=b
  45 | ; output: r0-r7
  46 | ; clobbers all other registers
  47 | ; cycles: 45
  48 | fe25519_add proc
  49 | 	export fe25519_add
  50 | 	ldr r0,[r8,#28]
  51 | 	ldr r4,[r9,#28]
  52 | 	adds r0,r0,r4
  53 | 	mov r11,#0
  54 | 	adc r11,r11,r11
  55 | 	lsl r11,r11,#1
  56 | 	add r11,r11,r0, lsr #31
  57 | 	movs r7,#19
  58 | 	mul r11,r11,r7
  59 | 	bic r7,r0,#0x80000000
  60 | 	
  61 | 	ldm r8!,{r0-r3}
  62 | 	ldm r9!,{r4-r6,r10}
  63 | 	mov r12,#1
  64 | 	umaal r0,r11,r12,r4
  65 | 	umaal r1,r11,r12,r5
  66 | 	umaal r2,r11,r12,r6
  67 | 	umaal r3,r11,r12,r10
  68 | 	ldm r9,{r4-r6}
  69 | 	ldm r8,{r8-r10}
  70 | 	umaal r4,r11,r12,r8
  71 | 	umaal r5,r11,r12,r9
  72 | 	umaal r6,r11,r12,r10
  73 | 	add r7,r7,r11
  74 | 	bx lr
  75 | 	
  76 | 	endp
  77 | 
  78 | ; input: *r8=a, *r9=b
  79 | ; output: r0-r7
  80 | ; clobbers all other registers
  81 | ; cycles: 46
  82 | fe25519_sub proc
  83 | 	export fe25519_sub
  84 | 	
  85 | 	ldm r8,{r0-r7}
  86 | 	ldm r9!,{r8,r10-r12}
  87 | 	subs r0,r8
  88 | 	sbcs r1,r10
  89 | 	sbcs r2,r11
  90 | 	sbcs r3,r12
  91 | 	ldm r9,{r8-r11}
  92 | 	sbcs r4,r8
  93 | 	sbcs r5,r9
  94 | 	sbcs r6,r10
  95 | 	sbcs r7,r11
  96 | 	
  97 | 	; if subtraction goes below 0, set r8 to -1 and r9 to -38, else set both to 0
  98 | 	sbc r8,r8
  99 | 	and r9,r8,#-38
 100 | 	
 101 | 	adds r0,r9
 102 | 	adcs r1,r8
 103 | 	adcs r2,r8
 104 | 	adcs r3,r8
 105 | 	adcs r4,r8
 106 | 	adcs r5,r8
 107 | 	adcs r6,r8
 108 | 	adcs r7,r8
 109 | 	
 110 | 	; if the subtraction did not go below 0, we are done and (r8,r9) are set to 0
 111 | 	; if the subtraction went below 0 and the addition overflowed, we are done, so set (r8,r9) to 0
 112 | 	; if the subtraction went below 0 and the addition did not overflow, we need to add once more
 113 | 	; (r8,r9) will be correctly set to (-1,-38) only when r8 was -1 and we don't have a carry,
 114 | 	; note that the carry will always be 0 in case (r8,r9) was (0,0) since then there was no real addition
 115 | 	; also note that it is extremely unlikely we will need an extra addition:
 116 | 	;   that can only happen if input1 was slightly >= 0 and input2 was > 2^256-38 (really input2-input1 > 2^256-38)
 117 | 	;   in that case we currently have 2^256-38 < (r0...r7) < 2^256, so adding -38 will only affect r0
 118 | 	adcs r8,#0
 119 | 	and r9,r8,#-38
 120 | 	
 121 | 	adds r0,r9
 122 | 	
 123 | 	bx lr
 124 | 	
 125 | 	endp
 126 | 
 127 | ; input: *r1=a, *r2=b
 128 | ; output: r0-r7
 129 | ; clobbers all other registers
 130 | ; cycles: 173
 131 | fe25519_mul proc
 132 | 	export fe25519_mul
 133 | 	push {r2,lr}
 134 | 	frame push {lr}
 135 | 	frame address sp,8
 136 | 	
 137 | 	sub sp,#28
 138 | 	frame address sp,36
 139 | 	ldm r2,{r2,r3,r4,r5}
 140 | 	
 141 | 	ldm r1!,{r0,r10,lr}
 142 | 	umull r6,r11,r2,r0
 143 | 	
 144 | 	umull r7,r12,r3,r0
 145 | 	umaal r7,r11,r2,r10
 146 | 	
 147 | 	push {r6,r7}
 148 | 	frame address sp,44
 149 | 	
 150 | 	umull r8,r6,r4,r0
 151 | 	umaal r8,r11,r3,r10
 152 | 	
 153 | 	umull r9,r7,r5,r0
 154 | 	umaal r9,r11,r4,r10
 155 | 	
 156 | 	umaal r11,r7,r5,r10
 157 | 	
 158 | 	umaal r8,r12,r2,lr
 159 | 	umaal r9,r12,r3,lr
 160 | 	umaal r11,r12,r4,lr
 161 | 	umaal r12,r7,r5,lr
 162 | 	
 163 | 	ldm r1!,{r0,r10,lr}
 164 | 	
 165 | 	umaal r9,r6,r2,r0
 166 | 	umaal r11,r6,r3,r0
 167 | 	umaal r12,r6,r4,r0
 168 | 	umaal r6,r7,r5,r0
 169 | 	
 170 | 	strd r8,r9,[sp,#8]
 171 | 	
 172 | 	mov r9,#0
 173 | 	umaal r11,r9,r2,r10
 174 | 	umaal r12,r9,r3,r10
 175 | 	umaal r6,r9,r4,r10
 176 | 	umaal r7,r9,r5,r10
 177 | 	
 178 | 	mov r10,#0
 179 | 	umaal r12,r10,r2,lr
 180 | 	umaal r6,r10,r3,lr
 181 | 	umaal r7,r10,r4,lr
 182 | 	umaal r9,r10,r5,lr
 183 | 	
 184 | 	ldr r8,[r1],#4
 185 | 	mov lr,#0
 186 | 	umaal lr,r6,r2,r8
 187 | 	umaal r7,r6,r3,r8
 188 | 	umaal r9,r6,r4,r8
 189 | 	umaal r10,r6,r5,r8
 190 | 	
 191 | 	;_ _ _ _ _ 6 10 9| 7 | lr 12 11 _ _ _ _
 192 | 	
 193 | 	ldr r8,[r1],#-28
 194 | 	mov r0,#0
 195 | 	umaal r7,r0,r2,r8
 196 | 	umaal r9,r0,r3,r8
 197 | 	umaal r10,r0,r4,r8
 198 | 	umaal r6,r0,r5,r8
 199 | 	
 200 | 	push {r0}
 201 | 	frame address sp,48
 202 | 	
 203 | 	;_ _ _ _ s 6 10 9| 7 | lr 12 11 _ _ _ _
 204 | 	
 205 | 	ldr r2,[sp,#40]
 206 | 	adds r2,r2,#16
 207 | 	ldm r2,{r2,r3,r4,r5}
 208 | 	
 209 | 	ldr r8,[r1],#4
 210 | 	mov r0,#0
 211 | 	umaal r11,r0,r2,r8
 212 | 	str r11,[sp,#16+4]
 213 | 	umaal r12,r0,r3,r8
 214 | 	umaal lr,r0,r4,r8
 215 | 	umaal r0,r7,r5,r8 ; 7=carry for 9
 216 | 	
 217 | 	;_ _ _ _ s 6 10 9+7| 0 | lr 12 _ _ _ _ _
 218 | 	
 219 | 	ldr r8,[r1],#4
 220 | 	mov r11,#0
 221 | 	umaal r12,r11,r2,r8
 222 | 	str r12,[sp,#20+4]
 223 | 	umaal lr,r11,r3,r8
 224 | 	umaal r0,r11,r4,r8
 225 | 	umaal r11,r7,r5,r8 ; 7=carry for 10
 226 | 	
 227 | 	;_ _ _ _ s 6 10+7 9+11| 0 | lr _ _ _ _ _ _
 228 | 	
 229 | 	ldr r8,[r1],#4
 230 | 	mov r12,#0
 231 | 	umaal lr,r12,r2,r8
 232 | 	str lr,[sp,#24+4]
 233 | 	umaal r0,r12,r3,r8
 234 | 	umaal r11,r12,r4,r8
 235 | 	umaal r10,r12,r5,r8 ; 12=carry for 6
 236 | 	
 237 | 	;_ _ _ _ s 6+12 10+7 9+11| 0 | _ _ _ _ _ _ _
 238 | 	
 239 | 	ldr r8,[r1],#4
 240 | 	mov lr,#0
 241 | 	umaal r0,lr,r2,r8
 242 | 	str r0,[sp,#28+4]
 243 | 	umaal r11,lr,r3,r8
 244 | 	umaal r10,lr,r4,r8
 245 | 	umaal r6,lr,r5,r8 ; lr=carry for saved
 246 | 	
 247 | 	;_ _ _ _ s+lr 6+12 10+7 9+11| _ | _ _ _ _ _ _ _
 248 | 	
 249 | 	ldm r1!,{r0,r8}
 250 | 	umaal r11,r9,r2,r0
 251 | 	str r11,[sp,#32+4]
 252 | 	umaal r9,r10,r3,r0
 253 | 	umaal r10,r6,r4,r0
 254 | 	pop {r11}
 255 | 	frame address sp,44
 256 | 	umaal r11,r6,r5,r0 ; 6=carry for next
 257 | 	
 258 | 	;_ _ _ 6 11+lr 10+12 9+7 _ | _ | _ _ _ _ _ _ _
 259 | 	
 260 | 	umaal r9,r7,r2,r8
 261 | 	umaal r10,r7,r3,r8
 262 | 	umaal r11,r7,r4,r8
 263 | 	umaal r6,r7,r5,r8
 264 | 	
 265 | 	ldm r1!,{r0,r8}
 266 | 	umaal r10,r12,r2,r0
 267 | 	umaal r11,r12,r3,r0
 268 | 	umaal r6,r12,r4,r0
 269 | 	umaal r7,r12,r5,r0
 270 | 	
 271 | 	umaal r11,lr,r2,r8
 272 | 	umaal r6,lr,r3,r8
 273 | 	umaal lr,r7,r4,r8
 274 | 	umaal r7,r12,r5,r8
 275 | 	
 276 | 	; 12 7 lr 6 11 10 9 stack*9
 277 | 	
 278 | 	;now reduce
 279 | 	
 280 | 	ldrd r4,r5,[sp,#28]
 281 | 	movs r3,#38
 282 | 	mov r8,#0
 283 | 	umaal r4,r8,r3,r12
 284 | 	lsl r8,r8,#1
 285 | 	orr r8,r8,r4, lsr #31
 286 | 	and r12,r4,#0x7fffffff
 287 | 	movs r4,#19
 288 | 	mul r8,r8,r4
 289 | 	
 290 | 	pop {r0-r2}
 291 | 	frame address sp,32
 292 | 	umaal r0,r8,r3,r5
 293 | 	umaal r1,r8,r3,r9
 294 | 	umaal r2,r8,r3,r10
 295 | 	mov r9,#38
 296 | 	pop {r3,r4}
 297 | 	frame address sp,24
 298 | 	umaal r3,r8,r9,r11
 299 | 	umaal r4,r8,r9,r6
 300 | 	pop {r5,r6}
 301 | 	frame address sp,16
 302 | 	umaal r5,r8,r9,lr
 303 | 	umaal r6,r8,r9,r7
 304 | 	add r7,r8,r12
 305 | 	
 306 | 	add sp,#12
 307 | 	frame address sp,4
 308 | 	pop {pc}
 309 | 	
 310 | 	endp
 311 | 
 312 | ; input/result in (r0-r7)
 313 | ; clobbers all other registers
 314 | ; cycles: 115
 315 | fe25519_sqr proc
 316 | 	export fe25519_sqr
 317 | 	push {lr}
 318 | 	frame push {lr}
 319 | 	sub sp,#20
 320 | 	frame address sp,24
 321 | 	
 322 | 	;mul 01, 00
 323 | 	umull r9,r10,r0,r0
 324 | 	umull r11,r12,r0,r1
 325 | 	adds r11,r11,r11
 326 | 	mov lr,#0
 327 | 	umaal r10,r11,lr,lr
 328 | 	
 329 | 	;r9 r10 done
 330 | 	;r12 carry for 3rd before col
 331 | 	;r11+C carry for 3rd final col
 332 | 	
 333 | 	push {r9,r10}
 334 | 	frame address sp,32
 335 | 	
 336 | 	;mul 02, 11
 337 | 	mov r8,#0
 338 | 	umaal r8,r12,r0,r2
 339 | 	adcs r8,r8,r8
 340 | 	umaal r8,r11,r1,r1
 341 | 	
 342 | 	;r8 done (3rd col)
 343 | 	;r12 carry for 4th before col
 344 | 	;r11+C carry for 4th final col
 345 | 	
 346 | 	;mul 03, 12
 347 | 	umull r9,r10,r0,r3
 348 | 	umaal r9,r12,r1,r2
 349 | 	adcs r9,r9,r9
 350 | 	umaal r9,r11,lr,lr
 351 | 	
 352 | 	;r9 done (4th col)
 353 | 	;r10+r12 carry for 5th before col
 354 | 	;r11+C carry for 5th final col
 355 | 	
 356 | 	strd r8,r9,[sp,#8]
 357 | 	
 358 | 	;mul 04, 13, 22
 359 | 	mov r9,#0
 360 | 	umaal r9,r10,r0,r4
 361 | 	umaal r9,r12,r1,r3
 362 | 	adcs r9,r9,r9
 363 | 	umaal r9,r11,r2,r2
 364 | 	
 365 | 	;r9 done (5th col)
 366 | 	;r10+r12 carry for 6th before col
 367 | 	;r11+C carry for 6th final col
 368 | 	
 369 | 	str r9,[sp,#16]
 370 | 	
 371 | 	;mul 05, 14, 23
 372 | 	umull r9,r8,r0,r5
 373 | 	umaal r9,r10,r1,r4
 374 | 	umaal r9,r12,r2,r3
 375 | 	adcs r9,r9,r9
 376 | 	umaal r9,r11,lr,lr
 377 | 	
 378 | 	;r9 done (6th col)
 379 | 	;r10+r12+r8 carry for 7th before col
 380 | 	;r11+C carry for 7th final col
 381 | 	
 382 | 	str r9,[sp,#20]
 383 | 	
 384 | 	;mul 06, 15, 24, 33
 385 | 	mov r9,#0
 386 | 	umaal r9,r8,r1,r5
 387 | 	umaal r9,r12,r2,r4
 388 | 	umaal r9,r10,r0,r6
 389 | 	adcs r9,r9,r9
 390 | 	umaal r9,r11,r3,r3
 391 | 	
 392 | 	;r9 done (7th col)
 393 | 	;r8+r10+r12 carry for 8th before col
 394 | 	;r11+C carry for 8th final col
 395 | 	
 396 | 	str r9,[sp,#24]
 397 | 	
 398 | 	;mul 07, 16, 25, 34
 399 | 	umull r0,r9,r0,r7
 400 | 	umaal r0,r10,r1,r6
 401 | 	umaal r0,r12,r2,r5
 402 | 	umaal r0,r8,r3,r4
 403 | 	adcs r0,r0,r0
 404 | 	umaal r0,r11,lr,lr
 405 | 	
 406 | 	;r0 done (8th col)
 407 | 	;r9+r8+r10+r12 carry for 9th before col
 408 | 	;r11+C carry for 9th final col
 409 | 	
 410 | 	;mul 17, 26, 35, 44
 411 | 	umaal r9,r8,r1,r7 ;r1 is now dead
 412 | 	umaal r9,r10,r2,r6
 413 | 	umaal r12,r9,r3,r5
 414 | 	adcs r12,r12,r12
 415 | 	umaal r11,r12,r4,r4
 416 | 	
 417 | 	;r11 done (9th col)
 418 | 	;r8+r10+r9 carry for 10th before col
 419 | 	;r12+C carry for 10th final col
 420 | 	
 421 | 	;mul 27, 36, 45
 422 | 	umaal r9,r8,r2,r7 ;r2 is now dead
 423 | 	umaal r10,r9,r3,r6
 424 | 	movs r2,#0
 425 | 	umaal r10,r2,r4,r5
 426 | 	adcs r10,r10,r10
 427 | 	umaal r12,r10,lr,lr
 428 | 	
 429 | 	;r12 done (10th col)
 430 | 	;r8+r9+r2 carry for 11th before col
 431 | 	;r10+C carry for 11th final col
 432 | 	
 433 | 	;mul 37, 46, 55
 434 | 	umaal r2,r8,r3,r7 ;r3 is now dead
 435 | 	umaal r9,r2,r4,r6
 436 | 	adcs r9,r9,r9
 437 | 	umaal r10,r9,r5,r5
 438 | 	
 439 | 	;r10 done (11th col)
 440 | 	;r8+r2 carry for 12th before col
 441 | 	;r9+C carry for 12th final col
 442 | 	
 443 | 	;mul 47, 56
 444 | 	movs r3,#0
 445 | 	umaal r3,r8,r4,r7 ;r4 is now dead
 446 | 	umaal r3,r2,r5,r6
 447 | 	adcs r3,r3,r3
 448 | 	umaal r9,r3,lr,lr
 449 | 	
 450 | 	;r9 done (12th col)
 451 | 	;r8+r2 carry for 13th before col
 452 | 	;r3+C carry for 13th final col
 453 | 	
 454 | 	;mul 57, 66
 455 | 	umaal r8,r2,r5,r7 ;r5 is now dead
 456 | 	adcs r8,r8,r8
 457 | 	umaal r3,r8,r6,r6
 458 | 	
 459 | 	;r3 done (13th col)
 460 | 	;r2 carry for 14th before col
 461 | 	;r8+C carry for 14th final col
 462 | 	
 463 | 	;mul 67
 464 | 	umull r4,r5,lr,lr ; set 0
 465 | 	umaal r4,r2,r6,r7
 466 | 	adcs r4,r4,r4
 467 | 	umaal r4,r8,lr,lr
 468 | 	
 469 | 	;r4 done (14th col)
 470 | 	;r2 carry for 15th before col
 471 | 	;r8+C carry for 15th final col
 472 | 	
 473 | 	;mul 77
 474 | 	adcs r2,r2,r2
 475 | 	umaal r8,r2,r7,r7
 476 | 	adcs r2,r2,lr
 477 | 	
 478 | 	;r8 done (15th col)
 479 | 	;r2 done (16th col)
 480 | 	
 481 | 	;msb -> lsb: r2 r8 r4 r3 r9 r10 r12 r11 r0 sp+24 sp+20 sp+16 sp+12 sp+8 sp+4 sp
 482 | 	;lr: 0
 483 | 	;now do reduction
 484 | 	
 485 | 	mov r6,#38
 486 | 	umaal r0,lr,r6,r2
 487 | 	lsl lr,lr,#1
 488 | 	orr lr,lr,r0, lsr #31
 489 | 	and r7,r0,#0x7fffffff
 490 | 	movs r5,#19
 491 | 	mul lr,lr,r5
 492 | 	
 493 | 	pop {r0,r1}
 494 | 	frame address sp,24
 495 | 	umaal r0,lr,r6,r11
 496 | 	umaal r1,lr,r6,r12
 497 | 	
 498 | 	mov r11,r3
 499 | 	mov r12,r4
 500 | 	
 501 | 	pop {r2,r3,r4,r5}
 502 | 	frame address sp,8
 503 | 	umaal r2,lr,r6,r10
 504 | 	umaal r3,lr,r6,r9
 505 | 	
 506 | 	umaal r4,lr,r6,r11
 507 | 	umaal r5,lr,r6,r12
 508 | 	
 509 | 	pop {r6}
 510 | 	frame address sp,4
 511 | 	mov r12,#38
 512 | 	umaal r6,lr,r12,r8
 513 | 	add r7,r7,lr
 514 | 	
 515 | 	pop {pc}
 516 | 	
 517 | 	endp
 518 | 
 519 | ; in: r0-r7, count: r8
 520 | ; out: r0-r7 + sets result also to top of stack
 521 | ; clobbers all other registers
 522 | ; cycles: 19 + 123*n
 523 | fe25519_sqr_many proc
 524 | 	export fe25519_sqr_many
 525 | 	push {r8,lr}
 526 | 	frame push {r8,lr}
 527 | 0
 528 | 	bl fe25519_sqr
 529 | 	
 530 | 	ldr r8,[sp,#0]
 531 | 	subs r8,r8,#1
 532 | 	str r8,[sp,#0]
 533 | 	bne %b0
 534 | 	
 535 | 	add sp,sp,#4
 536 | 	frame address sp,4
 537 | 	add r8,sp,#4
 538 | 	stm r8,{r0-r7}
 539 | 	pop {pc}
 540 | 	endp
 541 | 
 542 | ; This kind of load supports unaligned access
 543 | ; in: *r1
 544 | ; out: r0-r7
 545 | ; cycles: 22
 546 | loadm proc
 547 | 	ldr r0,[r1,#0]
 548 | 	ldr r2,[r1,#8]
 549 | 	ldr r3,[r1,#12]
 550 | 	ldr r4,[r1,#16]
 551 | 	ldr r5,[r1,#20]
 552 | 	ldr r6,[r1,#24]
 553 | 	ldr r7,[r1,#28]
 554 | 	ldr r1,[r1,#4]
 555 | 	bx lr
 556 | 	endp
 557 | 
 558 | ; in: *r0 = result, *r1 = scalar, *r2 = basepoint (all pointers may be unaligned)
 559 | ; cycles: 548 873
 560 | curve25519_scalarmult proc
 561 | 	export curve25519_scalarmult
 562 | 	
 563 | 	; stack layout: xp zp xq zq x0  bitpos lastbit scalar result_ptr r4-r11,lr
 564 | 	;               0  32 64 96 128 160    164     168    200        204
 565 | 
 566 | 	push {r0,r4-r11,lr}
 567 | 	frame push {r4-r11,lr}
 568 | 	frame address sp,40
 569 | 	
 570 | 	mov r10,r2
 571 | 	bl loadm
 572 | 	
 573 | 	and r0,r0,#0xfffffff8
 574 | 	;and r7,r7,#0x7fffffff not needed since we don't inspect the msb anyway
 575 | 	orr r7,r7,#0x40000000
 576 | 	push {r0-r7}
 577 | 	frame address sp,72
 578 | 	movs r8,#0
 579 | 	push {r2,r8}
 580 | 	frame address sp,80
 581 | 	
 582 | 	;ldm r1,{r0-r7}
 583 | 	mov r1,r10
 584 | 	bl loadm
 585 | 	
 586 | 	and r7,r7,#0x7fffffff
 587 | 	push {r0-r7}
 588 | 	frame address sp,112
 589 | 	
 590 | 	movs r9,#1
 591 | 	umull r10,r11,r8,r8
 592 | 	mov r12,#0
 593 | 	push {r8,r10,r11,r12}
 594 | 	frame address sp,128
 595 | 	push {r9,r10,r11,r12}
 596 | 	frame address sp,144
 597 | 	
 598 | 	push {r0-r7}
 599 | 	frame address sp,176
 600 | 	
 601 | 	umull r6,r7,r8,r8
 602 | 	push {r6,r7,r8,r10,r11,r12}
 603 | 	frame address sp,200
 604 | 	push {r6,r7,r8,r10,r11,r12}
 605 | 	frame address sp,224
 606 | 	push {r9,r10,r11,r12}
 607 | 	frame address sp,240
 608 | 	
 609 | 	movs r0,#254
 610 | 	movs r3,#0
 611 | 	; 129 cycles so far
 612 | 0
 613 | 	; load scalar bit into r1
 614 | 	lsrs r1,r0,#5
 615 | 	adds r2,sp,#168
 616 | 	ldr r1,[r2,r1,lsl #2]
 617 | 	and r4,r0,#0x1f
 618 | 	lsrs r1,r1,r4
 619 | 	and r1,r1,#1
 620 | 	
 621 | 	strd r0,r1,[sp,#160]
 622 | 	
 623 | 	eors r1,r1,r3
 624 | 	rsbs lr,r1,#0
 625 | 	
 626 | 	mov r0,sp
 627 | 	add r1,sp,#64
 628 | 	
 629 | 	mov r11,#4
 630 | 	; 15 cycles
 631 | 1
 632 | 	ldm r0,{r2-r5}
 633 | 	ldm r1,{r6-r9}
 634 | 	
 635 | 	eors r2,r2,r6
 636 | 	and r10,r2,lr
 637 | 	eors r6,r6,r10
 638 | 	eors r2,r2,r6
 639 | 	
 640 | 	eors r3,r3,r7
 641 | 	and r10,r3,lr
 642 | 	eors r7,r7,r10
 643 | 	eors r3,r3,r7
 644 | 	
 645 | 	eors r4,r4,r8
 646 | 	and r10,r4,lr
 647 | 	eors r8,r8,r10
 648 | 	eors r4,r4,r8
 649 | 	
 650 | 	eors r5,r5,r9
 651 | 	and r10,r5,lr
 652 | 	eors r9,r9,r10
 653 | 	eors r5,r5,r9
 654 | 	
 655 | 	stm r0!,{r2-r5}
 656 | 	stm r1!,{r6-r9}
 657 | 	
 658 | 	subs r11,#1
 659 | 	bne %b1
 660 | 	; 40*4 - 2 = 158 cycles
 661 | 	
 662 | 	mov r8,sp
 663 | 	add r9,sp,#32
 664 | 	bl fe25519_add
 665 | 	push {r0-r7}
 666 | 	frame address sp,272
 667 | 	
 668 | 	bl fe25519_sqr
 669 | 	push {r0-r7}
 670 | 	frame address sp,304
 671 | 	
 672 | 	add r8,sp,#64
 673 | 	add r9,sp,#96
 674 | 	bl fe25519_sub
 675 | 	push {r0-r7}
 676 | 	frame address sp,336
 677 | 	
 678 | 	bl fe25519_sqr
 679 | 	push {r0-r7}
 680 | 	frame address sp,368
 681 | 	
 682 | 	mov r1,sp
 683 | 	add r2,sp,#64
 684 | 	bl fe25519_mul
 685 | 	add r8,sp,#128
 686 | 	stm r8,{r0-r7}
 687 | 	
 688 | 	add r8,sp,#64
 689 | 	mov r9,sp
 690 | 	bl fe25519_sub
 691 | 	add r8,sp,#64
 692 | 	stm r8,{r0-r7}
 693 | 	
 694 | 	; 64 + 1*45 + 2*46 + 1*173 + 2*115 = 604 cycles
 695 | 	
 696 | 	;multiplies (r0-r7) with 121666, adds *sp and puts the result on the top of the stack (replacing old content)
 697 | 	ldr lr,=121666
 698 | 	;mov lr,#56130
 699 | 	;add lr,lr,#65536
 700 | 	ldr r12,[sp,#28]
 701 | 	mov r11,#0
 702 | 	umaal r12,r11,lr,r7
 703 | 	lsl r11,r11,#1
 704 | 	add r11,r11,r12, lsr #31
 705 | 	movs r7,#19
 706 | 	mul r11,r11,r7
 707 | 	bic r7,r12,#0x80000000
 708 | 	ldm sp!,{r8,r9,r10,r12}
 709 | 	frame address sp,352
 710 | 	umaal r8,r11,lr,r0
 711 | 	umaal r9,r11,lr,r1
 712 | 	umaal r10,r11,lr,r2
 713 | 	umaal r12,r11,lr,r3
 714 | 	ldm sp!,{r0,r1,r2}
 715 | 	frame address sp,340
 716 | 	umaal r0,r11,lr,r4
 717 | 	umaal r1,r11,lr,r5
 718 | 	umaal r2,r11,lr,r6
 719 | 	add r7,r7,r11
 720 | 	add sp,sp,#4
 721 | 	frame address sp,338
 722 | 	push {r0,r1,r2,r7}
 723 | 	frame address sp,352
 724 | 	push {r8,r9,r10,r12}
 725 | 	frame address sp,368
 726 | 	; 39 cycles
 727 | 	
 728 | 	mov r1,sp
 729 | 	add r2,sp,#64
 730 | 	bl fe25519_mul
 731 | 	add r8,sp,#160
 732 | 	stm r8,{r0-r7}
 733 | 	
 734 | 	add r8,sp,#192
 735 | 	add r9,sp,#224
 736 | 	bl fe25519_add
 737 | 	stm sp,{r0-r7}
 738 | 	
 739 | 	mov r1,sp
 740 | 	add r2,sp,#32
 741 | 	bl fe25519_mul
 742 | 	add r8,sp,#32
 743 | 	stm r8,{r0-r7}
 744 | 	
 745 | 	add r8,sp,#192
 746 | 	add r9,sp,#224
 747 | 	bl fe25519_sub
 748 | 	stm sp,{r0-r7}
 749 | 	
 750 | 	mov r1,sp
 751 | 	add r2,sp,#96
 752 | 	bl fe25519_mul
 753 | 	stm sp,{r0-r7}
 754 | 	
 755 | 	mov r8,sp
 756 | 	add r9,sp,#32
 757 | 	bl fe25519_add
 758 | 	
 759 | 	bl fe25519_sqr
 760 | 	
 761 | 	add r8,sp,#192
 762 | 	stm r8,{r0-r7}
 763 | 	
 764 | 	mov r8,sp
 765 | 	add r9,sp,#32
 766 | 	bl fe25519_sub
 767 | 	
 768 | 	bl fe25519_sqr
 769 | 	stm sp,{r0-r7}
 770 | 	
 771 | 	mov r1,sp
 772 | 	add r2,sp,#256
 773 | 	bl fe25519_mul
 774 | 	add r8,sp,#224
 775 | 	stm r8,{r0-r7}
 776 | 	
 777 | 	add sp,sp,#128
 778 | 	frame address sp,240
 779 | 	
 780 | 	ldrd r2,r3,[sp,#160]
 781 | 	subs r0,r2,#1
 782 | 	; 97 + 2*45 + 2*46 + 4*173 + 2*115 = 1201 cycles
 783 | 	bpl %b0
 784 | 	; in total 2020 cycles per iteration, in total 515 098 cycles for 255 iterations
 785 | 
 786 | 	;These cswap lines are not needed for curve25519 since the lowest bit is hardcoded to 0
 787 | 	;----------
 788 | 	;rsbs lr,r3,#0
 789 | 	
 790 | 	;mov r0,sp
 791 | 	;add r1,sp,#64
 792 | 	
 793 | 	;mov r11,#4
 794 | ;1
 795 | 	;ldm r0,{r2-r5}
 796 | 	;ldm r1!,{r6-r9}
 797 | 	
 798 | 	;eors r2,r2,r6
 799 | 	;and r10,r2,lr
 800 | 	;eors r6,r6,r10
 801 | 	;eors r2,r2,r6
 802 | 	
 803 | 	;eors r3,r3,r7
 804 | 	;and r10,r3,lr
 805 | 	;eors r7,r7,r10
 806 | 	;eors r3,r3,r7
 807 | 	
 808 | 	;eors r4,r4,r8
 809 | 	;and r10,r4,lr
 810 | 	;eors r8,r8,r10
 811 | 	;eors r4,r4,r8
 812 | 	
 813 | 	;eors r5,r5,r9
 814 | 	;and r10,r5,lr
 815 | 	;eors r9,r9,r10
 816 | 	;eors r5,r5,r9
 817 | 	
 818 | 	;stm r0!,{r2-r5}
 819 | 	
 820 | 	;subs r11,#1
 821 | 	;bne %b1
 822 | 	;----------
 823 | 
 824 | 	; now we must invert zp
 825 | 	add r0,sp,#32
 826 | 	ldm r0,{r0-r7}
 827 | 	bl fe25519_sqr
 828 | 	push {r0-r7}
 829 | 	frame address sp,272
 830 | 	
 831 | 	bl fe25519_sqr
 832 | 	bl fe25519_sqr
 833 | 	push {r0-r7}
 834 | 	frame address sp,304
 835 | 	
 836 | 	add r1,sp,#96
 837 | 	mov r2,sp
 838 | 	bl fe25519_mul
 839 | 	stm sp,{r0-r7}
 840 | 	
 841 | 	mov r1,sp
 842 | 	add r2,sp,#32
 843 | 	bl fe25519_mul
 844 | 	add r8,sp,#32
 845 | 	stm r8,{r0-r7}
 846 | 	
 847 | 	; current stack: z^(2^9) z^(2^11) x z
 848 | 	
 849 | 	bl fe25519_sqr
 850 | 	push {r0-r7}
 851 | 	frame address sp,336
 852 | 	
 853 | 	mov r1,sp
 854 | 	add r2,sp,#32
 855 | 	bl fe25519_mul
 856 | 	add r8,sp,#32
 857 | 	stm r8,{r0-r7}
 858 | 	
 859 | 	; current stack: _ z^(2^5 - 2^0) z^(2^11) x z
 860 | 	
 861 | 	mov r8,#5
 862 | 	; 1052 cycles
 863 | 	bl fe25519_sqr_many ; 634 cycles
 864 | 	
 865 | 	mov r1,sp
 866 | 	add r2,sp,#32
 867 | 	bl fe25519_mul
 868 | 	add r8,sp,#32
 869 | 	stm r8,{r0-r7}
 870 | 	
 871 | 	; current stack: _ z^(2^10 - 2^0) z^(2^11) x z <scratch> ...
 872 | 	
 873 | 	movs r8,#10
 874 | 	bl fe25519_sqr_many ; 1249 cycles
 875 | 	;z^(2^20 - 2^10)
 876 | 	
 877 | 	mov r1,sp
 878 | 	add r2,sp,#32
 879 | 	bl fe25519_mul
 880 | 	stm sp,{r0-r7}
 881 | 	;z^(2^20 - 2^0)
 882 | 	
 883 | 	; current stack: z^(2^20 - 2^0) z^(2^10 - 2^0) z^(2^11) x z <scratch> ...
 884 | 	
 885 | 	movs r8,#20
 886 | 	sub sp,sp,#32
 887 | 	frame address sp,368
 888 | 	bl fe25519_sqr_many ; 2479 cycles
 889 | 	;z^(2^40 - 2^20)
 890 | 	
 891 | 	mov r1,sp
 892 | 	add r2,sp,#32
 893 | 	bl fe25519_mul
 894 | 	add sp,sp,#32
 895 | 	frame address sp,336
 896 | 	;z^(2^40 - 2^0)
 897 | 	
 898 | 	movs r8,#10
 899 | 	bl fe25519_sqr_many ; 1249 cycles
 900 | 	;z^(2^50 - 2^10)
 901 | 	
 902 | 	mov r1,sp
 903 | 	add r2,sp,#32
 904 | 	bl fe25519_mul
 905 | 	add r8,sp,#32
 906 | 	stm r8,{r0-r7}
 907 | 	
 908 | 	; current stack: _ z^(2^50 - 2^0) z^(2^11) x z <scratch> ...
 909 | 	
 910 | 	movs r8,#50
 911 | 	bl fe25519_sqr_many ; 6169 cycles
 912 | 	;z^(2^100 - 2^50)
 913 | 	
 914 | 	mov r1,sp
 915 | 	add r2,sp,#32
 916 | 	bl fe25519_mul
 917 | 	stm sp,{r0-r7}
 918 | 	
 919 | 	; 13751 cycles so far for inversion
 920 | 	
 921 | 	; current stack: z^(2^100 - 2^0) z^(2^50 - 2^0) z^(2^11) x z <scratch> ...
 922 | 	
 923 | 	movs r8,#100
 924 | 	sub sp,sp,#32
 925 | 	frame address sp,368
 926 | 	bl fe25519_sqr_many ; 12319 cycles
 927 | 	;z^(2^200 - 2^100)
 928 | 	
 929 | 	mov r1,sp
 930 | 	add r2,sp,#32
 931 | 	bl fe25519_mul
 932 | 	add sp,sp,#32
 933 | 	frame address sp,336
 934 | 	;z^(2^200 - 2^0)
 935 | 	
 936 | 	; current stack: _ z^(2^50 - 2^0) z^(2^11) x z <scratch> ...
 937 | 	
 938 | 	movs r8,#50
 939 | 	bl fe25519_sqr_many ; 6169 cycles
 940 | 	;z^(2^250 - 2^50)
 941 | 	
 942 | 	mov r1,sp
 943 | 	add r2,sp,#32
 944 | 	bl fe25519_mul
 945 | 	;z^(2^250 - 2^0)
 946 | 	
 947 | 	movs r8,#5
 948 | 	bl fe25519_sqr_many ; 634 cycles
 949 | 	;z^(2^255 - 2^5)
 950 | 	
 951 | 	mov r1,sp
 952 | 	add r2,sp,#64
 953 | 	bl fe25519_mul
 954 | 	stm sp,{r0-r7}
 955 | 	;z^(2^255 - 21)
 956 | 	
 957 | 	; 19661 for second half of inversion
 958 | 	
 959 | 	; done inverting!
 960 | 	; total inversion cost: 33412 cycles
 961 | 	
 962 | 	mov r1,sp
 963 | 	add r2,sp,#96
 964 | 	bl fe25519_mul
 965 | 	
 966 | 	; now final reduce
 967 | 	lsr r8,r7,#31
 968 | 	mov r9,#19
 969 | 	mul r8,r8,r9
 970 | 	mov r10,#0
 971 | 	
 972 | 	; handle the case when 2^255 - 19 <= x < 2^255
 973 | 	add r8,r8,#19
 974 | 	
 975 | 	adds r8,r0,r8
 976 | 	adcs r8,r1,r10
 977 | 	adcs r8,r2,r10
 978 | 	adcs r8,r3,r10
 979 | 	adcs r8,r4,r10
 980 | 	adcs r8,r5,r10
 981 | 	adcs r8,r6,r10
 982 | 	adcs r8,r7,r10
 983 | 	adcs r11,r10,r10
 984 | 	
 985 | 	lsr r8,r8,#31
 986 | 	orr r8,r8,r11, lsl #1
 987 | 	mul r8,r8,r9
 988 | 	
 989 | 	ldr r9,[sp,#296]
 990 | 	
 991 | 	adds r0,r0,r8
 992 | 	str r0,[r9,#0]
 993 | 	movs r0,#0
 994 | 	adcs r1,r1,r0
 995 | 	str r1,[r9,#4]
 996 | 	mov r1,r9
 997 | 	adcs r2,r2,r0
 998 | 	adcs r3,r3,r0
 999 | 	adcs r4,r4,r0
1000 | 	adcs r5,r5,r0
1001 | 	adcs r6,r6,r0
1002 | 	adcs r7,r7,r0
1003 | 	and r7,r7,#0x7fffffff
1004 | 	
1005 | 	str r2,[r1,#8]
1006 | 	str r3,[r1,#12]
1007 | 	str r4,[r1,#16]
1008 | 	str r5,[r1,#20]
1009 | 	str r6,[r1,#24]
1010 | 	str r7,[r1,#28]
1011 | 	
1012 | 	add sp,sp,#300
1013 | 	frame address sp,36
1014 | 	
1015 | 	pop {r4-r11,pc}
1016 | 	
1017 | 	; 234 cycles after inversion
1018 | 	; in total for whole function 548 873 cycles
1019 | 	
1020 | 	endp
1021 | 
1022 | 	end
1023 | 


--------------------------------------------------------------------------------
/x25519-cortex-m4.h:
--------------------------------------------------------------------------------
 1 | /* Curve25519 scalar multiplication
 2 |  * Copyright (c) 2017, Emil Lenngren
 3 |  *
 4 |  * All rights reserved.
 5 |  *
 6 |  * Redistribution and use in source and binary forms, with or without modification,
 7 |  * are permitted provided that the following conditions are met:
 8 |  *
 9 |  * 1. Redistributions of source code must retain the above copyright notice, this
10 |  *    list of conditions and the following disclaimer.
11 |  *
12 |  * 2. Redistributions in binary form, except as embedded into a Nordic
13 |  *    Semiconductor ASA or Dialog Semiconductor PLC integrated circuit in a product
14 |  *    or a software update for such product, must reproduce the above copyright
15 |  *    notice, this list of conditions and the following disclaimer in the
16 |  *    documentation and/or other materials provided with the distribution.
17 |  *
18 |  * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
19 |  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
20 |  * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
21 |  * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR
22 |  * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
23 |  * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
24 |  * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
25 |  * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
26 |  * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
27 |  * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
28 |  */
29 | 
30 | /*
31 |  * This is an armv7 implementation of X25519.
32 |  * It follows the reference implementation where the representation of
33 |  * a field element [0..2^255-19) is represented by a 256-bit little endian integer,
34 |  * reduced modulo 2^256-38, and may possibly be in the range [2^256-38..2^256).
35 |  * The scalar is a 256-bit integer where certain bits are hardcoded per specification.
36 |  *
37 |  * The implementation runs in constant time (548 873 cycles on ARM Cortex-M4,
38 |  * assuming no wait states), and no conditional branches or memory access
39 |  * pattern depend on secret data.
40 |  */
41 | 
42 | #ifndef X25519_CORTEX_M4_H
43 | #define X25519_CORTEX_M4_H
44 | 
45 | // Assembler function
46 | void curve25519_scalarmult(unsigned char result[32], const unsigned char scalar[32], const unsigned char point[32]);
47 | 
48 | // User macros
49 | #define X25519_calc_public_key(output_public_key, input_secret_key) do { \
50 |     static const unsigned char basepoint[32] = {9}; \
51 |     curve25519_scalarmult(output_public_key, input_secret_key, basepoint); \
52 | } while(0)
53 | 
54 | #define X25519_calc_shared_secret(output_shared_secret, my_secret_key, their_public_key) \
55 | curve25519_scalarmult(output_shared_secret, my_secret_key, their_public_key)
56 | 
57 | #endif
58 | 


--------------------------------------------------------------------------------