├── cc2018slides.pdf ├── missed_optimizations_poster.pdf ├── missed_optimizations_preprint.pdf └── README.md /cc2018slides.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gergo-/missed-optimizations/HEAD/cc2018slides.pdf -------------------------------------------------------------------------------- /missed_optimizations_poster.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gergo-/missed-optimizations/HEAD/missed_optimizations_poster.pdf -------------------------------------------------------------------------------- /missed_optimizations_preprint.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gergo-/missed-optimizations/HEAD/missed_optimizations_preprint.pdf -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Missed optimizations in C compilers 2 | 3 | This is a list of some missed optimizations in C compilers. To my knowledge, 4 | these have not been reported by others (but it is difficult to search for 5 | such things, and people may not have bothered to write these up before). I 6 | have reported some of these in bug trackers and/or developed more or less 7 | mature patches for some; see links below. 8 | 9 | These are *not* correctness bugs, only cases where a compiler misses an 10 | opportunity to generate simpler or otherwise (seemingly) more efficient 11 | code. None of these are likely to affect the actual, measurable performance 12 | of real applications. Also, I have not measured most of these: The seemingly 13 | inefficient code may somehow turn out to be faster on real hardware. 14 | 15 | I have tested GCC, Clang, and CompCert, mostly targeting ARM at `-O3` and 16 | with `-fomit-frame-pointer`. The exact versions tested are specified below. 17 | For each example, at least one of these three compilers (typically GCC or 18 | Clang) produces the "right" code. 19 | 20 | This list was put together by Gergö Barany . I'm 21 | interested in feedback. 22 | I have described how I found these missed optimizations in a [paper 23 | (PDF)](missed_optimizations_preprint.pdf) that I presented at CC'18, where 24 | it won the best paper award. The software described in the paper is not 25 | released yet. I also made a quick-and-dirty [poster 26 | (PDF)](missed_optimizations_poster.pdf) and some more informative 27 | [presentation slides (PDF)](cc2018slides.pdf). 28 | 29 | Since initial publication, there have been some insightful comments on this 30 | list [on Reddit](https://www.reddit.com/r/programming/comments/6ylrpi/missed_optimizations_in_c_compilers/) 31 | and [on Hacker News](https://news.ycombinator.com/item?id=15187505). 32 | 33 | 34 | # Added 2018-03-25 35 | 36 | Unless noted otherwise, the examples below can be reproduced with the 37 | following compiler versions and configurations: 38 | 39 | Compiler | Version | Revision | Date | Configuration 40 | ---------|---------|----------|------|-------------- 41 | GCC | 8.0.1 20180325 | r258845 | 2018-03-25 | `--target=armv7a-eabihf --with-arch=armv7-a --with-fpu=vfpv3-d16 --with-float-abi=hard --with-float=hard` 42 | Clang | 7.0.0 (trunk 328450) | LLVM r328450, Clang r328447 | 2018-03-25 | `--target=armv7a-eabihf -march=armv7-a -mfpu=vfpv3-d16 -mfloat-abi=hard` 43 | 44 | ## GCC 45 | 46 | ### Loop vectorization regression 47 | 48 | This issue concerns GCC for x86-64, configured with 49 | `--target=x86_64-pc-linux-gnu`. It used to unroll and vectorize the loop in 50 | the following function: 51 | 52 | ``` 53 | int N; 54 | long fn1(void) { 55 | short i; 56 | long a; 57 | i = a = 0; 58 | while (i < N) 59 | a -= i++; 60 | return a; 61 | } 62 | ``` 63 | 64 | After revision r258036 (2018-02-27), this was no longer the case, and the 65 | non-vectorized version showed a slowdown of about 1.8x on my machine. 66 | 67 | Reported at https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84986, fixed. 68 | 69 | ## Clang 70 | 71 | ### Spill despite available registers 72 | 73 | ``` 74 | double fn1(double p1, double p2, double p3, double p4, double p5, double p6, 75 | double p7, double p8) { 76 | double a, b, c, d, v, e; 77 | a = p4 - 5 + 9 * p1 * 7; 78 | c = 2 - p6 + 1; 79 | b = 3 * p6 * 4 - (p2 + 10) * a; 80 | d = (8 + p3) * p5 * 5 * p7 - p4; 81 | v = p1 - p8 - 2 - (p8 + p2 - b) * ((p3 + 6) * c); 82 | e = v * p5 + d; 83 | return e; 84 | } 85 | ``` 86 | 87 | I will omit the full code generated by Clang, but you can see it at 88 | https://godbolt.org/g/8ZuYbz (generated by a slightly older version, but the 89 | code is the same). 90 | 91 | The interesting issue is the treatment of register `d6`, which is spilled in 92 | the middle of the function and used for temporaries in some calculations, 93 | before its value is reloaded into `d1` towards the end of the function: 94 | 95 | ``` 96 | ... some computations omitted ... 97 | vmov.f64 d13, #4.000000e+00 98 | vmov.f64 d14, #1.000000e+01 99 | vmul.f64 d10, d10, d13 @ last use of d13 100 | vadd.f64 d11, d1, d14 @ last use of d14 101 | vmov.f64 d12, #2.000000e+00 102 | vmov.f64 d15, #8.000000e+00 103 | vsub.f64 d5, d12, d5 104 | vadd.f64 d12, d2, d15 @ last use of d15 105 | vmls.f64 d10, d11, d9 106 | vstr d6, [sp] @ spill original value of d6 107 | vmov.f64 d6, #6.000000e+00 @ use d6 for constant 6.0 108 | vmov.f64 d8, #1.000000e+00 109 | vadd.f64 d1, d1, d7 110 | vadd.f64 d2, d2, d6 @ last use of constant 6.0 in d6 111 | vadd.f64 d5, d5, d8 112 | vsub.f64 d0, d0, d7 113 | vmul.f64 d7, d12, d4 114 | vmov.f64 d8, #5.000000e+00 115 | vmov.f64 d6, #-2.000000e+00 @ use d6 for constant -2.0 116 | vmul.f64 d2, d2, d5 117 | vsub.f64 d1, d1, d10 118 | vmul.f64 d5, d7, d8 119 | vadd.f64 d0, d0, d6 @ last use of constant -2.0 in d6 120 | vmls.f64 d0, d2, d1 121 | vldr d1, [sp] @ reload original value of d6 122 | vnmls.f64 d3, d5, d1 123 | ... 124 | ``` 125 | 126 | It is reasonable to spill `d6` and to use it for temporaries during the 127 | computation if no other registers are free. But that is not the case here: 128 | The registers `d13`, `d14`, and `d15` have short previous uses but are then 129 | unused for the rest of the function. The live ranges allocated to `d6` in 130 | the above code could have been allocated to these registers instead, saving 131 | the spill of `d6` and its reload into `d1`. 132 | 133 | In this particular case this spill causes an overhead of the store, the 134 | load, and one instruction each to allocate and free the stack frame. GCC 135 | generates fewer instructions overall, with lower register use and without 136 | this inline spill. 137 | 138 | Reported at https://bugs.llvm.org/show_bug.cgi?id=37073. It turns out that 139 | this is due to bad spilling decisions before register allocation that are 140 | not visible in the example above because they are undone by smart scheduling 141 | after register allocation. Can be fixed by selecting a different scheduler 142 | to run before allocation instead of the (bad) default choice. 143 | 144 | 145 | ### Excessive copying 146 | 147 | ``` 148 | double fn1(double p1, double p2, double p3, double p4, double p5, double p6, 149 | double p7, double p8) { 150 | double a, b, c, d, e, f; 151 | a = d = b = p1 + p3; 152 | e = p3 + p2 * 8.7 + ((p4 - p4) * (a * p7) - 1.7); 153 | c = b - d - (p2 - 7.4) * 8.1; 154 | f = e - p2 * p3 - d * (4.6 + c) * p4 + p5 * p6 - p7 - p8 + p1; 155 | return f; 156 | } 157 | ``` 158 | 159 | GCC's code starts with an add, a multiply, and a subtract (interspersed with 160 | constant loads I omit here): 161 | 162 | ``` 163 | vadd.f64 d8, d0, d2 164 | vmul.f64 d9, d8, d6 165 | vsub.f64 d10, d3, d3 166 | ``` 167 | 168 | The subtraction is independent of the other instructions, it corresponds to 169 | the expression `(p4 - p4)` in the source code. 170 | 171 | Clang starts with the same operations, but for unclear reasons it inserts 172 | four (!) register copy instructions, at least two of which have the effect 173 | of "setting up" the register `d4` for use by the subtraction, although it 174 | could just use the values from `d3`. 175 | 176 | ``` 177 | vadd.f64 d8, d0, d2 178 | vmov.f64 d10, d7 179 | vmov.f64 d7, d5 180 | vmov.f64 d5, d4 181 | vmov.f64 d4, d3 182 | vmul.f64 d11, d8, d6 183 | vsub.f64 d12, d4, d4 184 | ``` 185 | 186 | This seems to be related to scheduling as well; selecting a different 187 | scheduler with `-mllvm -pre-RA-sched=list-burr` makes it go away. 188 | 189 | 190 | # Added 2017-11-04 191 | 192 | Unless noted otherwise, the examples below can be reproduced with the 193 | following compiler versions and configurations: 194 | 195 | Compiler | Version | Revision | Date | Configuration 196 | ---------|---------|----------|------|-------------- 197 | GCC | 8.0.0 20171101 | r254306 | 2017-11-01 | `--target=armv7a-eabihf --with-arch=armv7-a --with-fpu=vfpv3-d16 --with-float-abi=hard --with-float=hard` 198 | Clang | 6.0.0 (trunk 317088) | LLVM r317088, Clang r317076 | 2017-11-01 | `--target=armv7a-eabihf -march=armv7-a -mfpu=vfpv3-d16 -mfloat-abi=hard` 199 | CompCert | 3.1 | 0c00ace | 2017-10-26 | `armv7a-linux` 200 | 201 | ## GCC 202 | 203 | ### Two dead stores for type conversion 204 | 205 | Similar to a case below, but maybe more striking: 206 | ``` 207 | unsigned fn1(float p1, double p2) { 208 | unsigned a = p2; 209 | if (p1) 210 | a = 8; 211 | return a; 212 | } 213 | ``` 214 | 215 | GCC predicates the entire function, nicely moving the conversion of `p2` to 216 | `unsigned` into a branch. But for some reason, in both branches it spills 217 | the value for `a` to the stack, without ever reloading it: 218 | ``` 219 | vcmp.f32 s0, #0 220 | sub sp, sp, #8 221 | vmrs APSR_nzcv, FPSCR 222 | vcvteq.u32.f64 s15, d1 223 | movne r3, #8 224 | strne r3, [sp, #4] 225 | movne r0, r3 226 | vmoveq r0, s15 @ int 227 | vstreq.32 s15, [sp, #4] @ int 228 | add sp, sp, #8 229 | @ sp needed 230 | bx lr 231 | ``` 232 | 233 | ### Missed simplification of modulo-and 234 | 235 | ``` 236 | short fn1(int p1) { 237 | signed a = (p1 % 4) & 1; 238 | return a; 239 | } 240 | ``` 241 | 242 | GCC computes modulo 4 as bitwise-and 3, carefully handling the case where 243 | `p1` might be negative, then does the bitwise-and 1: 244 | ``` 245 | rsbs r3, r0, #0 246 | and r0, r0, #3 247 | and r3, r3, #3 248 | rsbpl r0, r3, #0 249 | and r0, r0, #1 250 | ``` 251 | 252 | But the result of `(p1 % 4) & 1` is 1 iff `p1 % 4` is odd iff `p1` is odd 253 | iff `p1 & 1` is 1, so Clang just computes this directly: 254 | 255 | ``` 256 | and r0, r0, #1 257 | ``` 258 | 259 | ### Missed simplification of division in condition 260 | 261 | In the function: 262 | ``` 263 | char fn1(char p1) { 264 | signed a = 4; 265 | if (p1 / (p1 + 3)) 266 | a = 1; 267 | return a; 268 | } 269 | ``` 270 | GCC misses the fact that the condition is always false because `p1 + 3` is 271 | always greater than `p1` (as `char` is unsigned on ARM), and generates code 272 | to evaluate it: 273 | ``` 274 | add r1, r0, #3 275 | bl __aeabi_idiv 276 | cmp r0, #0 277 | moveq r0, #4 278 | movne r0, #1 279 | ``` 280 | 281 | Clang returns 4 unconditionally. 282 | 283 | ### Missed simplification/floating-point constant propagation? 284 | 285 | ``` 286 | float fn1(short p1) { 287 | char a; 288 | float b, c; 289 | a = p1; 290 | b = c = 765; 291 | if (a == b) 292 | c = 0; 293 | return c; 294 | } 295 | ``` 296 | 297 | At the comparison `a == b`, `a` is a `char`, and it cannot hold a value that 298 | compares equal to `b`'s value of 765. 299 | 300 | GCC computes the comparison anyway, Clang returns 765.0f unconditionally. 301 | 302 | ### Missed bitwise tricks 303 | 304 | ``` 305 | char fn1(short p1, int p2) { 306 | long a; 307 | int v; 308 | a = p2 & (p1 | 415615); 309 | v = a << 1; 310 | return v; 311 | } 312 | ``` 313 | 314 | The least significant byte of 415615 is `0x7f`, so the lowest 7 bits of `a` 315 | are the same as the lowest 7 bits of `p2`, hence the lowest 8 bits of `v` 316 | are the same as the lowest 8 bits of `p2 << 1`, and that is all that is 317 | needed for the result. GCC generates all the operations present at the 318 | source level, while Clang simplifies nicely: 319 | ``` 320 | lsl r0, r1, #1 321 | uxtb r0, r0 322 | ``` 323 | 324 | ### Unnecessary spilling due to badly scheduled move-immediates 325 | 326 | This is just a much smaller example of an issue with the same title below. 327 | 328 | ``` 329 | long fn1(char p1, short p2, long p3, long p4, short p5, long p6) { 330 | long a, b = 1000893; 331 | a = 600 - p2 * 3; 332 | if (p4 >= p5 * !(p6 * a)) 333 | b = p3; 334 | return b; 335 | } 336 | ``` 337 | 338 | GCC loads the constant 1000893 into a register early on, causing it to spill 339 | a register, but only uses the value much later and conditionally: 340 | 341 | ``` 342 | str lr, [sp, #-4]! @ spill lr 343 | movw lr, #17853 @ move lower part of 1000893 into lr 344 | ldr r1, [sp, #8] 345 | movt lr, 15 @ move upper part of 1000893 into lr 346 | ... 347 | cmp r1, r3 348 | movle r0, r2 349 | movgt r0, lr @ use 1000893 350 | ``` 351 | 352 | Clang just loads the constant conditionally, at the point where it is 353 | actually needed, and avoids spilling: 354 | 355 | ``` 356 | ... 357 | movwgt r2, #17853 358 | movtgt r2, #15 359 | mov r0, r2 360 | ``` 361 | 362 | 363 | ## Clang 364 | 365 | ### Missed simplification of integer-valued floating-point arithmetic 366 | 367 | In some cases (see one further below), Clang can perform floating-point 368 | arithmetic in integers where the result is an exact integer; GCC doesn't 369 | usually do this. But here is a counterexample: 370 | 371 | ``` 372 | char fn1(char p1) { 373 | float a; 374 | short b; 375 | a = !p1; 376 | b = -a; 377 | return b; 378 | } 379 | ``` 380 | 381 | Clang converts to `float`, negates, and converts back: 382 | ``` 383 | vmov.f32 s2, #1.000000e+00 384 | vldr s0, .LCPI0_0 @ load 0.0 385 | cmp r0, #0 386 | vmoveq.f32 s0, s2 387 | vneg.f32 s0, s0 388 | vcvt.s32.f32 s0, s0 389 | vmov r0, s0 390 | ``` 391 | 392 | GCC doesn't: 393 | ``` 394 | cmp r0, #0 395 | moveq r0, #255 396 | movne r0, #0 397 | ``` 398 | 399 | ### Missed conditional constant propagation on `double` 400 | 401 | ``` 402 | short fn1(double p1) { 403 | unsigned a = 4; 404 | if (p1 == 604884) 405 | if (0 * p1) 406 | a = 7; 407 | return a; 408 | } 409 | ``` 410 | 411 | If `p1` is 604884, then `0 * p1` is false. Clang evaluates both conditions, 412 | GCC returns 4 unconditionally. 413 | 414 | 415 | ### Missed simplification of bitwise operations 416 | 417 | ``` 418 | float fn1(short p1) { 419 | float a = 516.403544893f; 420 | if (p1 & 6 >> (p1 & 31) ^ ~0) 421 | a = 1; 422 | return a; 423 | } 424 | ``` 425 | 426 | Clang evaluates the branch condition because it misses that `((p1 & (6 >> 427 | (p1 & 31))) ^ ~0)` is nonzero because almost all the bits of the inner 428 | expression are zero. 429 | 430 | 431 | ## CompCert 432 | 433 | ### Reluctance to select `mvn` instruction 434 | 435 | In this function: 436 | ``` 437 | int fn1(int p1) { 438 | return -7 - p1; 439 | } 440 | ``` 441 | CompCert computes the subtraction by adding a sequence of large constants: 442 | ``` 443 | rsb r0, r0, #249 444 | add r0, r0, #65280 445 | add r0, r0, #16711680 446 | add r0, r0, #-16777216 447 | ``` 448 | 449 | It's simpler to synthesize the constant -7 as the negation of 6 and 450 | subtracting, as GCC does: 451 | ``` 452 | mvn r3, #6 453 | sub r0, r3, r0 454 | ``` 455 | 456 | 457 | # Added 2017-09-06 458 | 459 | Unless noted otherwise, the examples below can be reproduced with the 460 | following compiler versions and configurations: 461 | 462 | Compiler | Version | Revision | Date | Configuration 463 | ---------|---------|----------|------|-------------- 464 | GCC | 8.0.0 20170906 | r251801 | 2017-09-06 | `--target=armv7a-eabihf --with-arch=armv7-a --with-fpu=vfpv3-d16 --with-float-abi=hard --with-float=hard` 465 | Clang | 6.0.0 (trunk 312637) | LLVM r312635, Clang r312634 | 2017-09-06 | `--target=armv7a-eabihf -march=armv7-a -mfpu=vfpv3-d16 -mfloat-abi=hard` 466 | CompCert | 3.0.1 | f3c60be | 2017-08-28 | `armv7a-linux` 467 | 468 | ## GCC 469 | 470 | ### Useless initialization of struct passed by value 471 | 472 | ``` 473 | struct S0 { 474 | int f0; 475 | int f1; 476 | int f2; 477 | int f3; 478 | }; 479 | 480 | int f1(struct S0 p) { 481 | return p.f0; 482 | } 483 | ``` 484 | 485 | The struct is passed in registers, and the function's result is already in 486 | `r0`, which is also the return register. The function could return 487 | immediately, but GCC first stores all the struct fields to the stack and 488 | reloads the first field: 489 | 490 | ``` 491 | sub sp, sp, #16 492 | add ip, sp, #16 493 | stmdb ip, {r0, r1, r2, r3} 494 | ldr r0, [sp] 495 | add sp, sp, #16 496 | bx lr 497 | ``` 498 | 499 | Reported at https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80905. 500 | 501 | ### `float` to `char` type conversion goes through memory 502 | 503 | ``` 504 | char fn1(float p1) { 505 | return (char) p1; 506 | } 507 | ``` 508 | results in: 509 | ``` 510 | vcvt.u32.f32 s15, s0 511 | sub sp, sp, #8 512 | vstr.32 s15, [sp, #4] @ int 513 | ldrb r0, [sp, #4] @ zero_extendqisi2 514 | ``` 515 | i.e., the result of the conversion in `s15` is stored to the stack and then 516 | reloaded into `r0` instead of just copying it between the registers (and 517 | possibly truncating to `char`'s range). Same for `double` instead of 518 | `float`, but *not* for `short` instead of `char`. 519 | 520 | Reported at https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80861. 521 | 522 | If you are worried about the conversion overflowing, you might prefer this 523 | version: 524 | 525 | ``` 526 | #include 527 | #include 528 | 529 | char fn1(float p1) { 530 | if (isfinite(p1) && CHAR_MIN <= p1 && p1 <= CHAR_MAX) { 531 | return (char) p1; 532 | } 533 | return 0; 534 | } 535 | ``` 536 | 537 | GCC still goes through memory to perform the zero extension. 538 | 539 | 540 | ### Spill instead of register copy 541 | 542 | ``` 543 | int N; 544 | int fn2(int p1, int p2) { 545 | int a = p2; 546 | if (N) 547 | a *= 10.5; 548 | return a; 549 | } 550 | ``` 551 | 552 | GCC spills `a` although enough integer registers would be available: 553 | ``` 554 | str r1, [sp, #4] // initialize a to p2 555 | cmp r3, #0 556 | beq .L13 557 | ... compute multiplication, put into d7 ... 558 | vcvt.s32.f64 s15, d7 559 | vstr.32 s15, [sp, #4] @ int // update a in branch 560 | .L13: 561 | ldr r0, [sp, #4] // reload a 562 | ``` 563 | 564 | Reported at https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81012 (together 565 | with another issue further below). 566 | 567 | ### Missed simplification of multiplication by integer-valued floating-point constant 568 | 569 | Variant of the above code with the constant changed slightly: 570 | ``` 571 | int N; 572 | int fn5(int p1, int p2) { 573 | int a = p2; 574 | if (N) 575 | a *= 10.0; 576 | return a; 577 | } 578 | ``` 579 | 580 | GCC converts `a` to double and back as above, but the result must be the 581 | same as simply multiplying by the integer 10. Clang realizes this and 582 | generates an integer multiply, removing all floating-point operations. 583 | 584 | If you are worried about the multiplication overflowing, you might prefer 585 | this version: 586 | 587 | ``` 588 | #include 589 | 590 | int N; 591 | int fn5(int p1, int p2) { 592 | int a = p2; 593 | if (N && INT_MIN / 10.0 <= a && a <= INT_MAX / 10.0) 594 | a *= 10.0; 595 | return a; 596 | } 597 | ``` 598 | 599 | Clang still generates code for multiplying integers, while GCC multiplies as 600 | `double`. 601 | 602 | ### Dead store for `int` to `double` conversion 603 | 604 | ``` 605 | double fn3(int p1, short p2) { 606 | double a = p1; 607 | if (1881 * p2 + 5) 608 | a = 2; 609 | return a; 610 | } 611 | ``` 612 | 613 | `a` is initialized to the value of `p1`, and GCC spills this value. The 614 | spill store is dead, the value is never reloaded: 615 | 616 | ``` 617 | ... 618 | str r0, [sp, #4] 619 | cmn r1, #5 620 | vmovne.f64 d0, #2.0e+0 621 | vmoveq s15, r0 @ int 622 | vcvteq.f64.s32 d0, s15 623 | add sp, sp, #8 624 | @ sp needed 625 | bx lr 626 | ``` 627 | 628 | Reported at https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81012 (together 629 | with another issue above). 630 | 631 | ### Missed strength reduction of division to comparison 632 | 633 | ``` 634 | unsigned int fn1(unsigned int a, unsigned int p2) { 635 | int b; 636 | b = 3; 637 | if (p2 / a) 638 | b = 7; 639 | return b; 640 | } 641 | ``` 642 | 643 | The condition `p2 / a` is true (nonzero) iff `p2 >= a`. Clang compares, GCC 644 | generates a division (on ARM, x86, and PowerPC). 645 | 646 | ### Missed simplification of floating-point addition 647 | 648 | ``` 649 | long fn1(int p1) { 650 | double a; 651 | long b = a = p1; 652 | if (0 + a + 0) 653 | b = 9; 654 | return b; 655 | } 656 | ``` 657 | 658 | It is not correct in general to replace `a + 0` by `a` in floating-point 659 | arithmetic due to NaNs and negative zeros and whatnot, but here `a` is 660 | initialized from an `int`'s value, so none of these cases can happen. GCC 661 | generates all the floating-point operations, while Clang just compares `p1` 662 | to zero. 663 | 664 | ### Missed reassociation of multiplications 665 | 666 | ``` 667 | double fn1(double p1) { 668 | double v; 669 | v = p1 * p1 * (-p1 * p1); 670 | return v; 671 | } 672 | 673 | int fn2(int p1) { 674 | int v; 675 | v = p1 * p1 * (-p1 * p1); 676 | return v; 677 | } 678 | ``` 679 | 680 | For each function, GCC generates three multiplications: 681 | 682 | ``` 683 | vnmul.f64 d7, d0, d0 684 | vmul.f64 d0, d0, d0 685 | vmul.f64 d0, d7, d0 686 | ``` 687 | and 688 | ``` 689 | rsb r2, r0, #0 690 | mul r3, r0, r0 691 | mul r0, r0, r2 692 | mul r0, r3, r0 693 | ``` 694 | 695 | Clang can do the same with two multiplications each: 696 | 697 | ``` 698 | vmul.f64 d0, d0, d0 699 | vnmul.f64 d0, d0, d0 700 | ``` 701 | and 702 | ``` 703 | mul r0, r0, r0 704 | rsb r1, r0, #0 705 | mul r0, r0, r1 706 | ``` 707 | 708 | ### Heroic loop optimization instead of removing the loop 709 | 710 | ``` 711 | int N; 712 | double fn1(double *p1) { 713 | int i = 0; 714 | double v = 0; 715 | while (i < N) { 716 | v = p1[i]; 717 | i++; 718 | } 719 | return v; 720 | } 721 | ``` 722 | 723 | This function returns 0 if `N` is 0 or negative; otherwise, it returns 724 | `p1[N-1]`. On x86-64, GCC generates a complex unrolled loop (and a simpler 725 | loop on ARM). Clang removes the loop and replaces it by a simple branch. 726 | GCC's code is too long and boring to show here, but it's similar enough to 727 | https://godbolt.org/g/RYwgq4. 728 | 729 | *Note:* An earlier version of this document claimed that GCC also removed the 730 | loop on ARM. It was pointed out to me that this claim was false, GCC does 731 | generate a loop on ARM too. My (stupid) mistake, I had misread the assembly 732 | code. 733 | 734 | ### Unnecessary spilling due to badly scheduled move-immediates 735 | 736 | ``` 737 | double fn1(double p1, float p2, double p3, float p4, double *p5, double p6) 738 | { 739 | double a, c, d, e; 740 | float b; 741 | a = p3 / p1 / 38 / 2.56380695236e38f * 5; 742 | c = 946293016.021f / a + 9; 743 | b = p1 - 6023397303.74f / p3 * 909 * *p5; 744 | d = c / 683596028.301 + p3 - b + 701.296936339 - p4 * p6; 745 | e = p2 + d; 746 | return e; 747 | } 748 | ``` 749 | 750 | Sorry about the mess. GCC generates a bunch of code for this, including: 751 | 752 | ``` 753 | vstr.64 d10, [sp] 754 | vmov.f64 d11, #5.0e+0 755 | vmov.f64 d13, #9.0e+0 756 | ... computations involving d10 but not d11 or d13 ... 757 | vldr.64 d10, [sp] 758 | vcvt.f64.f32 d6, s0 759 | vmul.f64 d11, d7, d11 760 | vdiv.f64 d5, d12, d11 761 | vadd.f64 d13, d5, d13 762 | vdiv.f64 d0, d13, d14 763 | ``` 764 | 765 | The register `d10` must be spilled to make room for the constants 5 and 9 766 | being loaded into `d11` and `d13`, respectively. But these loads are much 767 | too early: Their values are only needed after `d10` is restored. These move 768 | instructions (at least one of them) should be closer to their uses, freeing 769 | up registers and avoiding the spill. 770 | 771 | ### Missed constant propagation (or so it seemed) 772 | 773 | ``` 774 | int fn1(int p1) { 775 | int a = 6; 776 | int b = (p1 / 12 == a); 777 | return b; 778 | } 779 | int fn2(int p1) { 780 | int b = (p1 / 12 == 6); 781 | return b; 782 | } 783 | ``` 784 | 785 | These functions should be compiled identically. The condition is equivalent 786 | to `6 * 12 <= p1 < 7 * 12`, and the two signed comparisons can be replaced 787 | by a subtraction and a single unsigned comparison: 788 | 789 | ``` 790 | sub r0, r0, #72 791 | cmp r0, #11 792 | movhi r0, #0 793 | movls r0, #1 794 | bx lr 795 | ``` 796 | 797 | GCC used to do this for the first function, but not the second one. It 798 | seemed like constant propagation failed to replace `a` by 6. 799 | 800 | Reported at https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81346 and fixed in 801 | revision r250342; it turns out that this pattern used to be implemented 802 | (only) in a very early folding pass that ran before constant propagation. 803 | 804 | ### Missed tail-call optimization for compiler-generated call to intrinsic 805 | 806 | ``` 807 | int fn1(int a, int b) { 808 | return a / b; 809 | } 810 | ``` 811 | 812 | On ARM, integer division is turned into a function call: 813 | 814 | ``` 815 | push {r4, lr} 816 | bl __aeabi_idiv 817 | pop {r4, pc} 818 | ``` 819 | 820 | In the code above, GCC allocates a stack frame, not realizing that this call 821 | in tail position could be simply 822 | ``` 823 | b __aeabi_idiv 824 | ``` 825 | 826 | (When the call, even to `__aeabi_idiv`, appears in the source code, GCC is 827 | perfectly capable of compiling it as a tail call.) 828 | 829 | 830 | ## Clang 831 | 832 | ### Incomplete range analysis for certain types 833 | 834 | ``` 835 | short fn1(short p) { 836 | short c = p / 32769; 837 | return c; 838 | } 839 | 840 | int fn2(signed char p) { 841 | int c = p / 129; 842 | return c; 843 | } 844 | ``` 845 | 846 | In both cases, the divisor is outside the possible range of values for `p`, 847 | so the function's result is always 0. Clang doesn't realize this and 848 | generates code to compute the result. (It does optimize corresponding cases 849 | for unsigned `short` and `char`.) 850 | 851 | A report with this and all the other range analysis examples below was filed 852 | by [@gnzlbg](https://github.com/gnzlbg) at 853 | https://bugs.llvm.org/show_bug.cgi?id=34517, and most cases have been fixed. 854 | 855 | ### More incomplete range analysis 856 | 857 | ``` 858 | int fn1(short p4, int p5) { 859 | int v = 0; 860 | if ((p4 - 32779) - !p5) 861 | v = 7; 862 | return v; 863 | } 864 | ``` 865 | 866 | The condition is always true, and the function always returns 7. Clang 867 | generates code to evaluate the condition. Similar to the case above, but 868 | seems to be more complex: If the `- !p5` is removed, Clang also realizes 869 | that the condition is always true. 870 | 871 | ### Even more incomplete range analysis 872 | 873 | ``` 874 | int fn1(int p, int q) { 875 | int a = q + (p % 6) / 9; 876 | return a; 877 | } 878 | 879 | int fn2(int p) { 880 | return (1 / p) / 100; 881 | } 882 | int fn3(int p5, int p6) { 883 | int a, c, v; 884 | v = -!p5; 885 | a = v + 7; 886 | c = (4 % a); 887 | return c; 888 | } 889 | int fn4(int p4, int p5) { 890 | int a = 5; 891 | if ((5 % p5) / 9) 892 | a = p4; 893 | return a; 894 | } 895 | ``` 896 | 897 | The first function always returns `q`, the others always return 0, 4, and 5, 898 | respectively. Clang evaluates at least one division or modulo operation in 899 | each case. 900 | 901 | (I have a *lot* more examples like this, but this list is boring enough as 902 | it is.) 903 | 904 | ### Redundant code generated for `double` to `int` conversion 905 | 906 | ``` 907 | int fn1(double c, int *p, int *q, int *r, int *s) { 908 | int i = (int) c; 909 | *p = i; 910 | *q = i; 911 | *r = i; 912 | *s = i; 913 | return i; 914 | } 915 | ``` 916 | 917 | There is a single type conversion on the source level. Clang duplicates the 918 | conversion operation (`vcvt`) for each use of `i`: 919 | 920 | ``` 921 | vcvt.s32.f64 s2, d0 922 | vstr s2, [r0] 923 | vcvt.s32.f64 s2, d0 924 | vstr s2, [r1] 925 | vcvt.s32.f64 s2, d0 926 | vstr s2, [r2] 927 | vcvt.s32.f64 s2, d0 928 | vcvt.s32.f64 s0, d0 929 | vmov r0, s0 930 | vstr s2, [r3] 931 | ``` 932 | 933 | Several times, the result of the conversion is written into the register 934 | that already holds that same value (`s2`). 935 | 936 | Reported at https://bugs.llvm.org/show_bug.cgi?id=33199, fixed. 937 | 938 | ### Missing instruction selection pattern for `vnmla` 939 | 940 | ``` 941 | double fn1(double p1, double p2, double p3) { 942 | double a = -p1 - p2 * p3; 943 | return a; 944 | } 945 | ``` 946 | 947 | Clang generates 948 | ``` 949 | vneg.f64 d0, d0 950 | vmls.f64 d0, d1, d2 951 | ``` 952 | but this could just be 953 | ``` 954 | vnmla.f64 d0, d1, d2 955 | ``` 956 | 957 | The instruction selector has a pattern for selecting `vnmla` for the 958 | expression `-(a * b) - dst`, but not for the symmetric `-dst - (a * b)`. 959 | 960 | Patch submitted: https://reviews.llvm.org/D35911 961 | 962 | 963 | ## CompCert 964 | 965 | ### No dead code elimination for useless branches 966 | 967 | ``` 968 | void fn1(int r0) { 969 | if (r0 + 42) 970 | 0; 971 | } 972 | ``` 973 | 974 | The conditional branch due to the `if` statement is removed, but the 975 | computation for its condition remains in the generated code: 976 | 977 | ``` 978 | add r0, r0, #42 979 | ``` 980 | 981 | This appears to be because dead code elimination runs before useless 982 | branches are identified and eliminated. 983 | 984 | Fixed in 985 | https://github.com/AbsInt/CompCert/commit/93d2fc9e3b23d69c0c97d229f9fa4ab36079f507 986 | due to my report. 987 | 988 | ### Missing constant folding patterns for modulo 989 | 990 | ``` 991 | int fn4(int p1) { 992 | int a = 0 % p1; 993 | return a; 994 | } 995 | 996 | int fn5(int a) { 997 | int b = a % a; 998 | return b; 999 | } 1000 | ``` 1001 | 1002 | In both cases, CompCert generates a modulo operation (on ARM, it generates a 1003 | library call, a multiply, and a subtraction). In both cases, this operation 1004 | could be constant folded to zero. 1005 | 1006 | My patch at 1007 | https://github.com/gergo-/CompCert/commit/78555eb7afacac184d94db43c9d4438d20502be8 1008 | fixes the first one, but not the second. It also fails to deal with the 1009 | equivalent case of 1010 | ``` 1011 | int zero = 0; 1012 | int a = zero % p1; 1013 | ``` 1014 | because, by the time the constant propagator runs, the modulo operation has 1015 | been mangled into a complex call/multiply/subtract sequence that is 1016 | inconvenient to recognize. 1017 | 1018 | Status: CompCert developers are (understandably) uninterested in this 1019 | trivial issue. 1020 | 1021 | ### Failure to generate `movw` instruction 1022 | 1023 | ``` 1024 | int fn1(int p1) { 1025 | int a = 506 * p1; 1026 | return a; 1027 | } 1028 | ``` 1029 | results in: 1030 | ``` 1031 | mov r1, #250 1032 | orr r1, r1, #256 1033 | mul r0, r0, r1 1034 | ``` 1035 | 1036 | Two instructions are spent on loading the constant. However, ARM's `movw` 1037 | instruction can take an immediate between 0 and 65535, so this can be 1038 | simplified to: 1039 | 1040 | ``` 1041 | movw r1, #506 1042 | ``` 1043 | 1044 | Fixed in a series of patches from me, merged at 1045 | https://github.com/AbsInt/CompCert/commit/c4dcf7c08016f175ba6c06d20c530ebaaad67749. 1046 | 1047 | ### Failure to constant fold `mla` 1048 | 1049 | Call this code `mla.c`: 1050 | 1051 | ``` 1052 | int fn1(int p1) { 1053 | int a, b, c, d, e, v, f; 1054 | a = 0; 1055 | b = c = 0; 1056 | d = e = p1; 1057 | v = 4; 1058 | f = e * d | a * p1 + b; 1059 | return f; 1060 | } 1061 | ``` 1062 | 1063 | The expression `a * p1 + b` evaluates to `0 * p1 + 0`, i.e., 0. CompCert is 1064 | able to fold both `0 * x` and `x + 0` in isolation, but on ARM it generates 1065 | an `mla` (multiply-add) instruction, for which it has no constant folding 1066 | rules: 1067 | 1068 | ``` 1069 | mov r4, #0 1070 | mov r12, #0 1071 | ... 1072 | mla r2, r4, r0, r12 1073 | ``` 1074 | 1075 | Fixed in a patch from me, merged at 1076 | https://github.com/AbsInt/CompCert/commit/4f46e57884a909d2f62fa7cea58b3d933a6a5e58. 1077 | 1078 | ### Unnecessary spilling due to dead code 1079 | 1080 | On `mla.c` from above without the `mla` constant folding patch, CompCert 1081 | spills the `r4` register which it then uses to store a temporary: 1082 | 1083 | ``` 1084 | str r4, [sp, #8] 1085 | mov r4, #0 1086 | mov r12, #0 1087 | mov r1, r0 1088 | mov r2, r1 1089 | mul r3, r2, r1 1090 | mla r2, r4, r0, r12 1091 | orr r0, r3, r2 1092 | ldr r4, [sp, #8] 1093 | ``` 1094 | 1095 | However, the spill goes away if the line `v = 4;` is removed from the 1096 | program: 1097 | 1098 | ``` 1099 | mov r3, #0 1100 | mov r12, #0 1101 | mov r1, r0 1102 | mov r2, r1 1103 | mul r1, r1, r2 1104 | mla r12, r3, r0, r12 1105 | orr r0, r1, r12 1106 | ``` 1107 | 1108 | The value of `v` is never used. This assignment is dead code, and it is 1109 | compiled away. It should not affect register allocation. 1110 | 1111 | ### Missed copy propagation 1112 | 1113 | In the previous example, the copies for the arguments to the `mul` operation 1114 | (`r0` to `r1`, then on to `r2`) are redundant. They could be removed, and 1115 | the multiplication written as just `mul r1, r0, r0`. 1116 | 1117 | *Note:* This entry previously spoke of copy coalescing, but [Sebastian 1118 | Hack](http://compilers.cs.uni-saarland.de/people/hack/) pointed out that 1119 | it's actually copy propagation that is missed here. 1120 | 1121 | ### Failure to propagate folded constants 1122 | 1123 | Continuing the `mla.c` example further, but this time after applying the 1124 | `mla` constant folding patch, CompCert generates: 1125 | 1126 | ``` 1127 | mov r1, r0 1128 | mul r2, r1, r0 1129 | mov r1, #0 1130 | orr r0, r2, r1 1131 | ``` 1132 | 1133 | The `x | 0` operation is redundant. CompCert is able to fold this 1134 | operation if it appears in the source code, but in this case the 0 comes 1135 | from the previously folded `mla` operation. Such constants are not 1136 | propagated by the "constant propagation" (in reality, only local folding) 1137 | pass. Values are only propagated by the value analysis that runs before, but 1138 | for technical reasons the value analysis cannot currently take advantage of 1139 | the algebraic properties of operators' neutral and zero elements. 1140 | 1141 | ### Missed reassociation and folding of constants 1142 | 1143 | ``` 1144 | int fn1(int p1) { 1145 | int a, v, b; 1146 | v = 1; 1147 | a = 3 * p1 * 2 * 3; 1148 | b = v + a * 83; 1149 | return b; 1150 | } 1151 | ``` 1152 | 1153 | All the multiplications could be folded into a single multiplication of `p1` 1154 | by 1494, but CompCert generates a series of adds and shifts before a 1155 | multiplication by 83: 1156 | 1157 | ``` 1158 | add r3, r0, r0, lsl #1 1159 | mov r0, r3, lsl #1 1160 | add r1, r0, r0, lsl #1 1161 | mov r2, #83 1162 | mla r0, r2, r1, r12 1163 | ``` 1164 | --------------------------------------------------------------------------------