├── cc2018slides.pdf
├── missed_optimizations_poster.pdf
├── missed_optimizations_preprint.pdf
└── README.md


/cc2018slides.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/gergo-/missed-optimizations/HEAD/cc2018slides.pdf


--------------------------------------------------------------------------------
/missed_optimizations_poster.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/gergo-/missed-optimizations/HEAD/missed_optimizations_poster.pdf


--------------------------------------------------------------------------------
/missed_optimizations_preprint.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/gergo-/missed-optimizations/HEAD/missed_optimizations_preprint.pdf


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
   1 | # Missed optimizations in C compilers
   2 | 
   3 | This is a list of some missed optimizations in C compilers. To my knowledge,
   4 | these have not been reported by others (but it is difficult to search for
   5 | such things, and people may not have bothered to write these up before). I
   6 | have reported some of these in bug trackers and/or developed more or less
   7 | mature patches for some; see links below.
   8 | 
   9 | These are *not* correctness bugs, only cases where a compiler misses an
  10 | opportunity to generate simpler or otherwise (seemingly) more efficient
  11 | code. None of these are likely to affect the actual, measurable performance
  12 | of real applications. Also, I have not measured most of these: The seemingly
  13 | inefficient code may somehow turn out to be faster on real hardware.
  14 | 
  15 | I have tested GCC, Clang, and CompCert, mostly targeting ARM at `-O3` and
  16 | with `-fomit-frame-pointer`. The exact versions tested are specified below.
  17 | For each example, at least one of these three compilers (typically GCC or
  18 | Clang) produces the "right" code.
  19 | 
  20 | This list was put together by Gergö Barany <gergo.barany@inria.fr>. I'm
  21 | interested in feedback.
  22 | I have described how I found these missed optimizations in a [paper
  23 | (PDF)](missed_optimizations_preprint.pdf) that I presented at CC'18, where
  24 | it won the best paper award. The software described in the paper is not
  25 | released yet. I also made a quick-and-dirty [poster
  26 | (PDF)](missed_optimizations_poster.pdf) and some more informative
  27 | [presentation slides (PDF)](cc2018slides.pdf).
  28 | 
  29 | Since initial publication, there have been some insightful comments on this
  30 | list [on Reddit](https://www.reddit.com/r/programming/comments/6ylrpi/missed_optimizations_in_c_compilers/)
  31 | and [on Hacker News](https://news.ycombinator.com/item?id=15187505).
  32 | 
  33 | 
  34 | # Added 2018-03-25
  35 | 
  36 | Unless noted otherwise, the examples below can be reproduced with the
  37 | following compiler versions and configurations:
  38 | 
  39 | Compiler | Version | Revision | Date | Configuration
  40 | ---------|---------|----------|------|--------------
  41 | GCC      | 8.0.1 20180325 | r258845 | 2018-03-25 | `--target=armv7a-eabihf --with-arch=armv7-a --with-fpu=vfpv3-d16 --with-float-abi=hard --with-float=hard`
  42 | Clang    | 7.0.0 (trunk 328450) | LLVM r328450, Clang r328447 | 2018-03-25 | `--target=armv7a-eabihf -march=armv7-a -mfpu=vfpv3-d16 -mfloat-abi=hard`
  43 | 
  44 | ## GCC
  45 | 
  46 | ### Loop vectorization regression
  47 | 
  48 | This issue concerns GCC for x86-64, configured with
  49 | `--target=x86_64-pc-linux-gnu`. It used to unroll and vectorize the loop in
  50 | the following function:
  51 | 
  52 | ```
  53 | int N;
  54 | long fn1(void) {
  55 |   short i;
  56 |   long a;
  57 |   i = a = 0;
  58 |   while (i < N)
  59 |     a -= i++;
  60 |   return a;
  61 | }
  62 | ```
  63 | 
  64 | After revision r258036 (2018-02-27), this was no longer the case, and the
  65 | non-vectorized version showed a slowdown of about 1.8x on my machine.
  66 | 
  67 | Reported at https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84986, fixed.
  68 | 
  69 | ## Clang
  70 | 
  71 | ### Spill despite available registers
  72 | 
  73 | ```
  74 | double fn1(double p1, double p2, double p3, double p4, double p5, double p6,
  75 |            double p7, double p8) {
  76 |   double a, b, c, d, v, e;
  77 |   a = p4 - 5 + 9 * p1 * 7;
  78 |   c = 2 - p6 + 1;
  79 |   b = 3 * p6 * 4 - (p2 + 10) * a;
  80 |   d = (8 + p3) * p5 * 5 * p7 - p4;
  81 |   v = p1 - p8 - 2 - (p8 + p2 - b) * ((p3 + 6) * c);
  82 |   e = v * p5 + d;
  83 |   return e;
  84 | }
  85 | ```
  86 | 
  87 | I will omit the full code generated by Clang, but you can see it at
  88 | https://godbolt.org/g/8ZuYbz (generated by a slightly older version, but the
  89 | code is the same).
  90 | 
  91 | The interesting issue is the treatment of register `d6`, which is spilled in
  92 | the middle of the function and used for temporaries in some calculations,
  93 | before its value is reloaded into `d1` towards the end of the function:
  94 | 
  95 | ```
  96 |     ... some computations omitted ...
  97 |     vmov.f64    d13, #4.000000e+00
  98 |     vmov.f64    d14, #1.000000e+01
  99 |     vmul.f64    d10, d10, d13           @ last use of d13
 100 |     vadd.f64    d11, d1, d14            @ last use of d14
 101 |     vmov.f64    d12, #2.000000e+00
 102 |     vmov.f64    d15, #8.000000e+00
 103 |     vsub.f64    d5, d12, d5
 104 |     vadd.f64    d12, d2, d15            @ last use of d15
 105 |     vmls.f64    d10, d11, d9
 106 |     vstr    d6, [sp]                    @ spill original value of d6
 107 |     vmov.f64    d6, #6.000000e+00       @ use d6 for constant 6.0
 108 |     vmov.f64    d8, #1.000000e+00
 109 |     vadd.f64    d1, d1, d7
 110 |     vadd.f64    d2, d2, d6              @ last use of constant 6.0 in d6
 111 |     vadd.f64    d5, d5, d8
 112 |     vsub.f64    d0, d0, d7
 113 |     vmul.f64    d7, d12, d4
 114 |     vmov.f64    d8, #5.000000e+00
 115 |     vmov.f64    d6, #-2.000000e+00      @ use d6 for constant -2.0
 116 |     vmul.f64    d2, d2, d5
 117 |     vsub.f64    d1, d1, d10
 118 |     vmul.f64    d5, d7, d8
 119 |     vadd.f64    d0, d0, d6              @ last use of constant -2.0 in d6
 120 |     vmls.f64    d0, d2, d1
 121 |     vldr    d1, [sp]                    @ reload original value of d6
 122 |     vnmls.f64    d3, d5, d1
 123 |     ...
 124 | ```
 125 | 
 126 | It is reasonable to spill `d6` and to use it for temporaries during the
 127 | computation if no other registers are free. But that is not the case here:
 128 | The registers `d13`, `d14`, and `d15` have short previous uses but are then
 129 | unused for the rest of the function. The live ranges allocated to `d6` in
 130 | the above code could have been allocated to these registers instead, saving
 131 | the spill of `d6` and its reload into `d1`.
 132 | 
 133 | In this particular case this spill causes an overhead of the store, the
 134 | load, and one instruction each to allocate and free the stack frame.  GCC
 135 | generates fewer instructions overall, with lower register use and without
 136 | this inline spill.
 137 | 
 138 | Reported at https://bugs.llvm.org/show_bug.cgi?id=37073. It turns out that
 139 | this is due to bad spilling decisions before register allocation that are
 140 | not visible in the example above because they are undone by smart scheduling
 141 | after register allocation. Can be fixed by selecting a different scheduler
 142 | to run before allocation instead of the (bad) default choice.
 143 | 
 144 | 
 145 | ### Excessive copying
 146 | 
 147 | ```
 148 | double fn1(double p1, double p2, double p3, double p4, double p5, double p6,
 149 |            double p7, double p8) {
 150 |   double a, b, c, d, e, f;
 151 |   a = d = b = p1 + p3;
 152 |   e = p3 + p2 * 8.7 + ((p4 - p4) * (a * p7) - 1.7);
 153 |   c = b - d - (p2 - 7.4) * 8.1;
 154 |   f = e - p2 * p3 - d * (4.6 + c) * p4 + p5 * p6 - p7 - p8 + p1;
 155 |   return f;
 156 | }
 157 | ```
 158 | 
 159 | GCC's code starts with an add, a multiply, and a subtract (interspersed with
 160 | constant loads I omit here):
 161 | 
 162 | ```
 163 |     vadd.f64    d8, d0, d2
 164 |     vmul.f64    d9, d8, d6
 165 |     vsub.f64    d10, d3, d3
 166 | ```
 167 | 
 168 | The subtraction is independent of the other instructions, it corresponds to
 169 | the expression `(p4 - p4)` in the source code.
 170 | 
 171 | Clang starts with the same operations, but for unclear reasons it inserts
 172 | four (!) register copy instructions, at least two of which have the effect
 173 | of "setting up" the register `d4` for use by the subtraction, although it
 174 | could just use the values from `d3`.
 175 | 
 176 | ```
 177 |     vadd.f64    d8, d0, d2
 178 |     vmov.f64    d10, d7
 179 |     vmov.f64    d7, d5
 180 |     vmov.f64    d5, d4
 181 |     vmov.f64    d4, d3
 182 |     vmul.f64    d11, d8, d6
 183 |     vsub.f64    d12, d4, d4
 184 | ```
 185 | 
 186 | This seems to be related to scheduling as well; selecting a different
 187 | scheduler with `-mllvm -pre-RA-sched=list-burr` makes it go away.
 188 | 
 189 | 
 190 | # Added 2017-11-04
 191 | 
 192 | Unless noted otherwise, the examples below can be reproduced with the
 193 | following compiler versions and configurations:
 194 | 
 195 | Compiler | Version | Revision | Date | Configuration
 196 | ---------|---------|----------|------|--------------
 197 | GCC      | 8.0.0 20171101 | r254306 | 2017-11-01 | `--target=armv7a-eabihf --with-arch=armv7-a --with-fpu=vfpv3-d16 --with-float-abi=hard --with-float=hard`
 198 | Clang    | 6.0.0 (trunk 317088) | LLVM r317088, Clang r317076 | 2017-11-01 | `--target=armv7a-eabihf -march=armv7-a -mfpu=vfpv3-d16 -mfloat-abi=hard`
 199 | CompCert | 3.1 | 0c00ace | 2017-10-26 | `armv7a-linux`
 200 | 
 201 | ## GCC
 202 | 
 203 | ### Two dead stores for type conversion
 204 | 
 205 | Similar to a case below, but maybe more striking:
 206 | ```
 207 | unsigned fn1(float p1, double p2) {
 208 |   unsigned a = p2;
 209 |   if (p1)
 210 |     a = 8;
 211 |   return a;
 212 | }
 213 | ```
 214 | 
 215 | GCC predicates the entire function, nicely moving the conversion of `p2` to
 216 | `unsigned` into a branch. But for some reason, in both branches it spills
 217 | the value for `a` to the stack, without ever reloading it:
 218 | ```
 219 |     vcmp.f32    s0, #0
 220 |     sub sp, sp, #8
 221 |     vmrs    APSR_nzcv, FPSCR
 222 |     vcvteq.u32.f64  s15, d1
 223 |     movne   r3, #8
 224 |     strne   r3, [sp, #4]
 225 |     movne   r0, r3
 226 |     vmoveq  r0, s15 @ int
 227 |     vstreq.32   s15, [sp, #4]   @ int
 228 |     add sp, sp, #8
 229 |     @ sp needed
 230 |     bx  lr
 231 | ```
 232 | 
 233 | ### Missed simplification of modulo-and
 234 | 
 235 | ```
 236 | short fn1(int p1) {
 237 |   signed a = (p1 % 4) & 1;
 238 |   return a;
 239 | }
 240 | ```
 241 | 
 242 | GCC computes modulo 4 as bitwise-and 3, carefully handling the case where
 243 | `p1` might be negative, then does the bitwise-and 1:
 244 | ```
 245 |     rsbs    r3, r0, #0
 246 |     and r0, r0, #3
 247 |     and r3, r3, #3
 248 |     rsbpl   r0, r3, #0
 249 |     and r0, r0, #1
 250 | ```
 251 | 
 252 | But the result of `(p1 % 4) & 1` is 1 iff `p1 % 4` is odd iff `p1` is odd
 253 | iff `p1 & 1` is 1, so Clang just computes this directly:
 254 | 
 255 | ```
 256 |     and r0, r0, #1
 257 | ```
 258 | 
 259 | ### Missed simplification of division in condition
 260 | 
 261 | In the function:
 262 | ```
 263 | char fn1(char p1) {
 264 |   signed a = 4;
 265 |   if (p1 / (p1 + 3))
 266 |     a = 1;
 267 |   return a;
 268 | }
 269 | ```
 270 | GCC misses the fact that the condition is always false because `p1 + 3` is
 271 | always greater than `p1` (as `char` is unsigned on ARM), and generates code
 272 | to evaluate it:
 273 | ```
 274 |     add r1, r0, #3
 275 |     bl  __aeabi_idiv
 276 |     cmp r0, #0
 277 |     moveq   r0, #4
 278 |     movne   r0, #1
 279 | ```
 280 | 
 281 | Clang returns 4 unconditionally.
 282 | 
 283 | ### Missed simplification/floating-point constant propagation?
 284 | 
 285 | ```
 286 | float fn1(short p1) {
 287 |   char a;
 288 |   float b, c;
 289 |   a = p1;
 290 |   b = c = 765;
 291 |   if (a == b)
 292 |     c = 0;
 293 |   return c;
 294 | }
 295 | ```
 296 | 
 297 | At the comparison `a == b`, `a` is a `char`, and it cannot hold a value that
 298 | compares equal to `b`'s value of 765.
 299 | 
 300 | GCC computes the comparison anyway, Clang returns 765.0f unconditionally.
 301 | 
 302 | ### Missed bitwise tricks
 303 | 
 304 | ```
 305 | char fn1(short p1, int p2) {
 306 |   long a;
 307 |   int v;
 308 |   a = p2 & (p1 | 415615);
 309 |   v = a << 1;
 310 |   return v;
 311 | }
 312 | ```
 313 | 
 314 | The least significant byte of 415615 is `0x7f`, so the lowest 7 bits of `a`
 315 | are the same as the lowest 7 bits of `p2`, hence the lowest 8 bits of `v`
 316 | are the same as the lowest 8 bits of `p2 << 1`, and that is all that is
 317 | needed for the result. GCC generates all the operations present at the
 318 | source level, while Clang simplifies nicely:
 319 | ```
 320 |     lsl r0, r1, #1
 321 |     uxtb    r0, r0
 322 | ```
 323 | 
 324 | ### Unnecessary spilling due to badly scheduled move-immediates
 325 | 
 326 | This is just a much smaller example of an issue with the same title below.
 327 | 
 328 | ```
 329 | long fn1(char p1, short p2, long p3, long p4, short p5, long p6) {
 330 |   long a, b = 1000893;
 331 |   a = 600 - p2 * 3;
 332 |   if (p4 >= p5 * !(p6 * a))
 333 |     b = p3;
 334 |   return b;
 335 | }
 336 | ```
 337 | 
 338 | GCC loads the constant 1000893 into a register early on, causing it to spill
 339 | a register, but only uses the value much later and conditionally:
 340 | 
 341 | ```
 342 |     str lr, [sp, #-4]!      @ spill lr
 343 |     movw    lr, #17853      @ move lower part of 1000893 into lr
 344 |     ldr r1, [sp, #8]
 345 |     movt    lr, 15          @ move upper part of 1000893 into lr
 346 |     ...
 347 |     cmp r1, r3
 348 |     movle   r0, r2
 349 |     movgt   r0, lr          @ use 1000893
 350 | ```
 351 | 
 352 | Clang just loads the constant conditionally, at the point where it is
 353 | actually needed, and avoids spilling:
 354 | 
 355 | ```
 356 |     ...
 357 |     movwgt  r2, #17853
 358 |     movtgt  r2, #15
 359 |     mov r0, r2
 360 | ```
 361 | 
 362 | 
 363 | ## Clang
 364 | 
 365 | ### Missed simplification of integer-valued floating-point arithmetic
 366 | 
 367 | In some cases (see one further below), Clang can perform floating-point
 368 | arithmetic in integers where the result is an exact integer; GCC doesn't
 369 | usually do this. But here is a counterexample:
 370 | 
 371 | ```
 372 | char fn1(char p1) {
 373 |   float a;
 374 |   short b;
 375 |   a = !p1;
 376 |   b = -a;
 377 |   return b;
 378 | }
 379 | ```
 380 | 
 381 | Clang converts to `float`, negates, and converts back:
 382 | ```
 383 |     vmov.f32    s2, #1.000000e+00
 384 |     vldr    s0, .LCPI0_0        @ load 0.0
 385 |     cmp r0, #0
 386 |     vmoveq.f32  s0, s2
 387 |     vneg.f32    s0, s0
 388 |     vcvt.s32.f32    s0, s0
 389 |     vmov    r0, s0
 390 | ```
 391 | 
 392 | GCC doesn't:
 393 | ```
 394 |     cmp r0, #0
 395 |     moveq   r0, #255
 396 |     movne   r0, #0
 397 | ```
 398 | 
 399 | ### Missed conditional constant propagation on `double`
 400 | 
 401 | ```
 402 | short fn1(double p1) {
 403 |   unsigned a = 4;
 404 |   if (p1 == 604884)
 405 |     if (0 * p1)
 406 |       a = 7;
 407 |   return a;
 408 | }
 409 | ```
 410 | 
 411 | If `p1` is 604884, then `0 * p1` is false. Clang evaluates both conditions,
 412 | GCC returns 4 unconditionally.
 413 | 
 414 | 
 415 | ### Missed simplification of bitwise operations
 416 | 
 417 | ```
 418 | float fn1(short p1) {
 419 |   float a = 516.403544893f;
 420 |   if (p1 & 6 >> (p1 & 31) ^ ~0)
 421 |     a = 1;
 422 |   return a;
 423 | }
 424 | ```
 425 | 
 426 | Clang evaluates the branch condition because it misses that `((p1 & (6 >>
 427 | (p1 & 31))) ^ ~0)` is nonzero because almost all the bits of the inner
 428 | expression are zero.
 429 | 
 430 | 
 431 | ## CompCert
 432 | 
 433 | ### Reluctance to select `mvn` instruction
 434 | 
 435 | In this function:
 436 | ```
 437 | int fn1(int p1) {
 438 |   return -7 - p1;
 439 | }
 440 | ```
 441 | CompCert computes the subtraction by adding a sequence of large constants:
 442 | ```
 443 |     rsb r0, r0, #249
 444 |     add r0, r0, #65280
 445 |     add r0, r0, #16711680
 446 |     add r0, r0, #-16777216
 447 | ```
 448 | 
 449 | It's simpler to synthesize the constant -7 as the negation of 6 and
 450 | subtracting, as GCC does:
 451 | ```
 452 |     mvn r3, #6
 453 |     sub r0, r3, r0
 454 | ```
 455 | 
 456 | 
 457 | # Added 2017-09-06
 458 | 
 459 | Unless noted otherwise, the examples below can be reproduced with the
 460 | following compiler versions and configurations:
 461 | 
 462 | Compiler | Version | Revision | Date | Configuration
 463 | ---------|---------|----------|------|--------------
 464 | GCC      | 8.0.0 20170906 | r251801 | 2017-09-06 | `--target=armv7a-eabihf --with-arch=armv7-a --with-fpu=vfpv3-d16 --with-float-abi=hard --with-float=hard`
 465 | Clang    | 6.0.0 (trunk 312637) | LLVM r312635, Clang r312634 | 2017-09-06 | `--target=armv7a-eabihf -march=armv7-a -mfpu=vfpv3-d16 -mfloat-abi=hard`
 466 | CompCert | 3.0.1 | f3c60be | 2017-08-28 | `armv7a-linux`
 467 | 
 468 | ## GCC
 469 | 
 470 | ### Useless initialization of struct passed by value
 471 | 
 472 | ```
 473 | struct S0 {
 474 |   int f0;
 475 |   int f1;
 476 |   int f2;
 477 |   int f3;
 478 | };
 479 | 
 480 | int f1(struct S0 p) {
 481 |     return p.f0;
 482 | }
 483 | ```
 484 | 
 485 | The struct is passed in registers, and the function's result is already in
 486 | `r0`, which is also the return register. The function could return
 487 | immediately, but GCC first stores all the struct fields to the stack and
 488 | reloads the first field:
 489 | 
 490 | ```
 491 |     sub sp, sp, #16
 492 |     add ip, sp, #16
 493 |     stmdb   ip, {r0, r1, r2, r3}
 494 |     ldr r0, [sp]
 495 |     add sp, sp, #16
 496 |     bx  lr
 497 | ```
 498 | 
 499 | Reported at https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80905.
 500 | 
 501 | ### `float` to `char` type conversion goes through memory
 502 | 
 503 | ```
 504 | char fn1(float p1) {
 505 |   return (char) p1;
 506 | }
 507 | ```
 508 | results in:
 509 | ```
 510 |     vcvt.u32.f32    s15, s0
 511 |     sub sp, sp, #8
 512 |     vstr.32 s15, [sp, #4]   @ int
 513 |     ldrb    r0, [sp, #4]    @ zero_extendqisi2
 514 | ```
 515 | i.e., the result of the conversion in `s15` is stored to the stack and then
 516 | reloaded into `r0` instead of just copying it between the registers (and
 517 | possibly truncating to `char`'s range). Same for `double` instead of
 518 | `float`, but *not* for `short` instead of `char`.
 519 | 
 520 | Reported at https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80861.
 521 | 
 522 | If you are worried about the conversion overflowing, you might prefer this
 523 | version:
 524 | 
 525 | ```
 526 | #include <math.h>
 527 | #include <limits.h>
 528 | 
 529 | char fn1(float p1) {
 530 |   if (isfinite(p1) && CHAR_MIN <= p1 && p1 <= CHAR_MAX) {
 531 |     return (char) p1;
 532 |   }
 533 |   return 0;
 534 | }
 535 | ```
 536 | 
 537 | GCC still goes through memory to perform the zero extension.
 538 | 
 539 | 
 540 | ### Spill instead of register copy
 541 | 
 542 | ```
 543 | int N;
 544 | int fn2(int p1, int p2) {
 545 |   int a = p2;
 546 |   if (N)
 547 |     a *= 10.5;
 548 |   return a;
 549 | }
 550 | ```
 551 | 
 552 | GCC spills `a` although enough integer registers would be available:
 553 | ```
 554 |     str r1, [sp, #4]    // initialize a to p2
 555 |     cmp r3, #0
 556 |     beq .L13
 557 |     ... compute multiplication, put into d7 ...
 558 |     vcvt.s32.f64    s15, d7
 559 |     vstr.32 s15, [sp, #4]   @ int   // update a in branch
 560 | .L13:
 561 |     ldr r0, [sp, #4]    // reload a
 562 | ```
 563 | 
 564 | Reported at https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81012 (together
 565 | with another issue further below).
 566 | 
 567 | ### Missed simplification of multiplication by integer-valued floating-point constant
 568 | 
 569 | Variant of the above code with the constant changed slightly:
 570 | ```
 571 | int N;
 572 | int fn5(int p1, int p2) {
 573 |   int a = p2;
 574 |   if (N)
 575 |     a *= 10.0;
 576 |   return a;
 577 | }
 578 | ```
 579 | 
 580 | GCC converts `a` to double and back as above, but the result must be the
 581 | same as simply multiplying by the integer 10. Clang realizes this and
 582 | generates an integer multiply, removing all floating-point operations.
 583 | 
 584 | If you are worried about the multiplication overflowing, you might prefer
 585 | this version:
 586 | 
 587 | ```
 588 | #include <limits.h>
 589 | 
 590 | int N;
 591 | int fn5(int p1, int p2) {
 592 |   int a = p2;
 593 |   if (N && INT_MIN / 10.0 <= a && a <= INT_MAX / 10.0)
 594 |     a *= 10.0;
 595 |   return a;
 596 | }
 597 | ```
 598 | 
 599 | Clang still generates code for multiplying integers, while GCC multiplies as
 600 | `double`.
 601 | 
 602 | ### Dead store for `int` to `double` conversion
 603 | 
 604 | ```
 605 | double fn3(int p1, short p2) {
 606 |   double a = p1;
 607 |   if (1881 * p2 + 5)
 608 |     a = 2;
 609 |   return a;
 610 | }
 611 | ```
 612 | 
 613 | `a` is initialized to the value of `p1`, and GCC spills this value. The
 614 | spill store is dead, the value is never reloaded:
 615 | 
 616 | ```
 617 |     ...
 618 |     str r0, [sp, #4]
 619 |     cmn r1, #5
 620 |     vmovne.f64  d0, #2.0e+0
 621 |     vmoveq  s15, r0 @ int
 622 |     vcvteq.f64.s32  d0, s15
 623 |     add sp, sp, #8
 624 |     @ sp needed
 625 |     bx  lr
 626 | ```
 627 | 
 628 | Reported at https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81012 (together
 629 | with another issue above).
 630 | 
 631 | ### Missed strength reduction of division to comparison
 632 | 
 633 | ```
 634 | unsigned int fn1(unsigned int a, unsigned int p2) {
 635 |   int b;
 636 |   b = 3;
 637 |   if (p2 / a)
 638 |     b = 7;
 639 |   return b;
 640 | }
 641 | ```
 642 | 
 643 | The condition `p2 / a` is true (nonzero) iff `p2 >= a`. Clang compares, GCC
 644 | generates a division (on ARM, x86, and PowerPC).
 645 | 
 646 | ### Missed simplification of floating-point addition
 647 | 
 648 | ```
 649 | long fn1(int p1) {
 650 |   double a;
 651 |   long b = a = p1;
 652 |   if (0 + a + 0)
 653 |     b = 9;
 654 |   return b;
 655 | }
 656 | ```
 657 | 
 658 | It is not correct in general to replace `a + 0` by `a` in floating-point
 659 | arithmetic due to NaNs and negative zeros and whatnot, but here `a` is
 660 | initialized from an `int`'s value, so none of these cases can happen. GCC
 661 | generates all the floating-point operations, while Clang just compares `p1`
 662 | to zero.
 663 | 
 664 | ### Missed reassociation of multiplications
 665 | 
 666 | ```
 667 | double fn1(double p1) {
 668 |   double v;
 669 |   v = p1 * p1 * (-p1 * p1);
 670 |   return v;
 671 | }
 672 | 
 673 | int fn2(int p1) {
 674 |   int v;
 675 |   v = p1 * p1 * (-p1 * p1);
 676 |   return v;
 677 | }
 678 | ```
 679 | 
 680 | For each function, GCC generates three multiplications:
 681 | 
 682 | ```
 683 |     vnmul.f64   d7, d0, d0
 684 |     vmul.f64    d0, d0, d0
 685 |     vmul.f64    d0, d7, d0
 686 | ```
 687 | and
 688 | ```
 689 |     rsb r2, r0, #0
 690 |     mul r3, r0, r0
 691 |     mul r0, r0, r2
 692 |     mul r0, r3, r0
 693 | ```
 694 | 
 695 | Clang can do the same with two multiplications each:
 696 | 
 697 | ```
 698 |     vmul.f64    d0, d0, d0
 699 |     vnmul.f64   d0, d0, d0
 700 | ```
 701 | and
 702 | ```
 703 |     mul r0, r0, r0
 704 |     rsb r1, r0, #0
 705 |     mul r0, r0, r1
 706 | ```
 707 | 
 708 | ### Heroic loop optimization instead of removing the loop
 709 | 
 710 | ```
 711 | int N;
 712 | double fn1(double *p1) {
 713 |   int i = 0;
 714 |   double v = 0;
 715 |   while (i < N) {
 716 |     v = p1[i];
 717 |     i++;
 718 |   }
 719 |   return v;
 720 | }
 721 | ```
 722 | 
 723 | This function returns 0 if `N` is 0 or negative; otherwise, it returns
 724 | `p1[N-1]`. On x86-64, GCC generates a complex unrolled loop (and a simpler
 725 | loop on ARM). Clang removes the loop and replaces it by a simple branch.
 726 | GCC's code is too long and boring to show here, but it's similar enough to
 727 | https://godbolt.org/g/RYwgq4.
 728 | 
 729 | *Note:* An earlier version of this document claimed that GCC also removed the
 730 | loop on ARM. It was pointed out to me that this claim was false, GCC does
 731 | generate a loop on ARM too. My (stupid) mistake, I had misread the assembly
 732 | code.
 733 | 
 734 | ### Unnecessary spilling due to badly scheduled move-immediates
 735 | 
 736 | ```
 737 | double fn1(double p1, float p2, double p3, float p4, double *p5, double p6)
 738 | {
 739 |   double a, c, d, e;
 740 |   float b;
 741 |   a = p3 / p1 / 38 / 2.56380695236e38f * 5;
 742 |   c = 946293016.021f / a + 9;
 743 |   b = p1 - 6023397303.74f / p3 * 909 * *p5;
 744 |   d = c / 683596028.301 + p3 - b + 701.296936339 - p4 * p6;
 745 |   e = p2 + d;
 746 |   return e;
 747 | }
 748 | ```
 749 | 
 750 | Sorry about the mess. GCC generates a bunch of code for this, including:
 751 | 
 752 | ```
 753 |     vstr.64 d10, [sp]
 754 |     vmov.f64    d11, #5.0e+0
 755 |     vmov.f64    d13, #9.0e+0
 756 |     ... computations involving d10 but not d11 or d13 ...
 757 |     vldr.64 d10, [sp]
 758 |     vcvt.f64.f32    d6, s0
 759 |     vmul.f64    d11, d7, d11
 760 |     vdiv.f64    d5, d12, d11
 761 |     vadd.f64    d13, d5, d13
 762 |     vdiv.f64    d0, d13, d14
 763 | ```
 764 | 
 765 | The register `d10` must be spilled to make room for the constants 5 and 9
 766 | being loaded into `d11` and `d13`, respectively. But these loads are much
 767 | too early: Their values are only needed after `d10` is restored. These move
 768 | instructions (at least one of them) should be closer to their uses, freeing
 769 | up registers and avoiding the spill.
 770 | 
 771 | ### Missed constant propagation (or so it seemed)
 772 | 
 773 | ```
 774 | int fn1(int p1) {
 775 |   int a = 6;
 776 |   int b = (p1 / 12 == a);
 777 |   return b;
 778 | }
 779 | int fn2(int p1) {
 780 |   int b = (p1 / 12 == 6);
 781 |   return b;
 782 | }
 783 | ```
 784 | 
 785 | These functions should be compiled identically. The condition is equivalent
 786 | to `6 * 12 <= p1 < 7 * 12`, and the two signed comparisons can be replaced
 787 | by a subtraction and a single unsigned comparison:
 788 | 
 789 | ```
 790 |     sub r0, r0, #72
 791 |     cmp r0, #11
 792 |     movhi   r0, #0
 793 |     movls   r0, #1
 794 |     bx  lr
 795 | ```
 796 | 
 797 | GCC used to do this for the first function, but not the second one. It
 798 | seemed like constant propagation failed to replace `a` by 6.
 799 | 
 800 | Reported at https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81346 and fixed in
 801 | revision r250342; it turns out that this pattern used to be implemented
 802 | (only) in a very early folding pass that ran before constant propagation.
 803 | 
 804 | ### Missed tail-call optimization for compiler-generated call to intrinsic
 805 | 
 806 | ```
 807 | int fn1(int a, int b) {
 808 |   return a / b;
 809 | }
 810 | ```
 811 | 
 812 | On ARM, integer division is turned into a function call:
 813 | 
 814 | ```
 815 |     push    {r4, lr}
 816 |     bl  __aeabi_idiv
 817 |     pop {r4, pc}
 818 | ```
 819 | 
 820 | In the code above, GCC allocates a stack frame, not realizing that this call
 821 | in tail position could be simply
 822 | ```
 823 |     b   __aeabi_idiv
 824 | ```
 825 | 
 826 | (When the call, even to `__aeabi_idiv`, appears in the source code, GCC is
 827 | perfectly capable of compiling it as a tail call.)
 828 | 
 829 | 
 830 | ## Clang
 831 | 
 832 | ### Incomplete range analysis for certain types
 833 | 
 834 | ```
 835 | short fn1(short p) {
 836 |   short c = p / 32769;
 837 |   return c;
 838 | }
 839 | 
 840 | int fn2(signed char p) {
 841 |   int c = p / 129;
 842 |   return c;
 843 | }
 844 | ```
 845 | 
 846 | In both cases, the divisor is outside the possible range of values for `p`,
 847 | so the function's result is always 0. Clang doesn't realize this and
 848 | generates code to compute the result. (It does optimize corresponding cases
 849 | for unsigned `short` and `char`.)
 850 | 
 851 | A report with this and all the other range analysis examples below was filed
 852 | by [@gnzlbg](https://github.com/gnzlbg) at
 853 | https://bugs.llvm.org/show_bug.cgi?id=34517, and most cases have been fixed.
 854 | 
 855 | ### More incomplete range analysis
 856 | 
 857 | ```
 858 | int fn1(short p4, int p5) {
 859 |   int v = 0;
 860 |   if ((p4 - 32779) - !p5)
 861 |     v = 7;
 862 |   return v;
 863 | }
 864 | ```
 865 | 
 866 | The condition is always true, and the function always returns 7. Clang
 867 | generates code to evaluate the condition. Similar to the case above, but
 868 | seems to be more complex: If the `- !p5` is removed, Clang also realizes
 869 | that the condition is always true.
 870 | 
 871 | ### Even more incomplete range analysis
 872 | 
 873 | ```
 874 | int fn1(int p, int q) {
 875 |   int a = q + (p % 6) / 9;
 876 |   return a;
 877 | }
 878 | 
 879 | int fn2(int p) {
 880 |   return (1 / p) / 100;
 881 | }
 882 | int fn3(int p5, int p6) {
 883 |   int a, c, v;
 884 |   v = -!p5;
 885 |   a = v + 7;
 886 |   c = (4 % a);
 887 |   return c;
 888 | }
 889 | int fn4(int p4, int p5) {
 890 |   int a = 5;
 891 |   if ((5 % p5) / 9)
 892 |     a = p4;
 893 |   return a;
 894 | }
 895 | ```
 896 | 
 897 | The first function always returns `q`, the others always return 0, 4, and 5,
 898 | respectively. Clang evaluates at least one division or modulo operation in
 899 | each case.
 900 | 
 901 | (I have a *lot* more examples like this, but this list is boring enough as
 902 | it is.)
 903 | 
 904 | ### Redundant code generated for `double` to `int` conversion
 905 | 
 906 | ```
 907 | int fn1(double c, int *p, int *q, int *r, int *s) {
 908 |   int i = (int) c;
 909 |   *p = i;
 910 |   *q = i;
 911 |   *r = i;
 912 |   *s = i;
 913 |   return i;
 914 | }
 915 | ```
 916 | 
 917 | There is a single type conversion on the source level. Clang duplicates the
 918 | conversion operation (`vcvt`) for each use of `i`:
 919 | 
 920 | ```
 921 |     vcvt.s32.f64    s2, d0
 922 |     vstr    s2, [r0]
 923 |     vcvt.s32.f64    s2, d0
 924 |     vstr    s2, [r1]
 925 |     vcvt.s32.f64    s2, d0
 926 |     vstr    s2, [r2]
 927 |     vcvt.s32.f64    s2, d0
 928 |     vcvt.s32.f64    s0, d0
 929 |     vmov    r0, s0
 930 |     vstr    s2, [r3]
 931 | ```
 932 | 
 933 | Several times, the result of the conversion is written into the register
 934 | that already holds that same value (`s2`).
 935 | 
 936 | Reported at https://bugs.llvm.org/show_bug.cgi?id=33199, fixed.
 937 | 
 938 | ### Missing instruction selection pattern for `vnmla`
 939 | 
 940 | ```
 941 | double fn1(double p1, double p2, double p3) {
 942 |   double a = -p1 - p2 * p3;
 943 |   return a;
 944 | }
 945 | ```
 946 | 
 947 | Clang generates
 948 | ```
 949 |     vneg.f64    d0, d0
 950 |     vmls.f64    d0, d1, d2
 951 | ```
 952 | but this could just be
 953 | ```
 954 |     vnmla.f64   d0, d1, d2
 955 | ```
 956 | 
 957 | The instruction selector has a pattern for selecting `vnmla` for the
 958 | expression `-(a * b) - dst`, but not for the symmetric `-dst - (a * b)`.
 959 | 
 960 | Patch submitted: https://reviews.llvm.org/D35911
 961 | 
 962 | 
 963 | ## CompCert
 964 | 
 965 | ### No dead code elimination for useless branches
 966 | 
 967 | ```
 968 | void fn1(int r0) {
 969 |   if (r0 + 42)
 970 |     0;
 971 | }
 972 | ```
 973 | 
 974 | The conditional branch due to the `if` statement is removed, but the
 975 | computation for its condition remains in the generated code:
 976 | 
 977 | ```
 978 |     add r0, r0, #42
 979 | ```
 980 | 
 981 | This appears to be because dead code elimination runs before useless
 982 | branches are identified and eliminated.
 983 | 
 984 | Fixed in
 985 | https://github.com/AbsInt/CompCert/commit/93d2fc9e3b23d69c0c97d229f9fa4ab36079f507
 986 | due to my report.
 987 | 
 988 | ### Missing constant folding patterns for modulo
 989 | 
 990 | ```
 991 | int fn4(int p1) {
 992 |   int a = 0 % p1;
 993 |   return a;
 994 | }
 995 | 
 996 | int fn5(int a) {
 997 |   int b = a % a;
 998 |   return b; 
 999 | } 
1000 | ```
1001 | 
1002 | In both cases, CompCert generates a modulo operation (on ARM, it generates a
1003 | library call, a multiply, and a subtraction). In both cases, this operation
1004 | could be constant folded to zero.
1005 | 
1006 | My patch at 
1007 | https://github.com/gergo-/CompCert/commit/78555eb7afacac184d94db43c9d4438d20502be8
1008 | fixes the first one, but not the second. It also fails to deal with the
1009 | equivalent case of
1010 | ```
1011 |   int zero = 0;
1012 |   int a = zero % p1;
1013 | ```
1014 | because, by the time the constant propagator runs, the modulo operation has
1015 | been mangled into a complex call/multiply/subtract sequence that is
1016 | inconvenient to recognize.
1017 | 
1018 | Status: CompCert developers are (understandably) uninterested in this
1019 | trivial issue.
1020 | 
1021 | ### Failure to generate `movw` instruction
1022 | 
1023 | ```
1024 | int fn1(int p1) {
1025 |   int a = 506 * p1;
1026 |   return a;
1027 | }
1028 | ```
1029 | results in:
1030 | ```
1031 |     mov r1, #250
1032 |     orr r1, r1, #256
1033 |     mul r0, r0, r1
1034 | ```
1035 | 
1036 | Two instructions are spent on loading the constant. However, ARM's `movw`
1037 | instruction can take an immediate between 0 and 65535, so this can be
1038 | simplified to:
1039 | 
1040 | ```
1041 |     movw r1, #506
1042 | ```
1043 | 
1044 | Fixed in a series of patches from me, merged at
1045 | https://github.com/AbsInt/CompCert/commit/c4dcf7c08016f175ba6c06d20c530ebaaad67749.
1046 | 
1047 | ### Failure to constant fold `mla`
1048 | 
1049 | Call this code `mla.c`:
1050 | 
1051 | ```
1052 | int fn1(int p1) {
1053 |   int a, b, c, d, e, v, f;
1054 |   a = 0;
1055 |   b = c = 0;
1056 |   d = e = p1;
1057 |   v = 4;
1058 |   f = e * d | a * p1 + b;
1059 |   return f;
1060 | }
1061 | ```
1062 | 
1063 | The expression `a * p1 + b` evaluates to `0 * p1 + 0`, i.e., 0. CompCert is
1064 | able to fold both `0 * x` and `x + 0` in isolation, but on ARM it generates
1065 | an `mla` (multiply-add) instruction, for which it has no constant folding
1066 | rules:
1067 | 
1068 | ```
1069 |     mov r4, #0
1070 |     mov r12, #0
1071 |     ...
1072 |     mla r2, r4, r0, r12
1073 | ```
1074 | 
1075 | Fixed in a patch from me, merged at
1076 | https://github.com/AbsInt/CompCert/commit/4f46e57884a909d2f62fa7cea58b3d933a6a5e58.
1077 | 
1078 | ### Unnecessary spilling due to dead code
1079 | 
1080 | On `mla.c` from above without the `mla` constant folding patch, CompCert
1081 | spills the `r4` register which it then uses to store a temporary:
1082 | 
1083 | ```
1084 |     str r4, [sp, #8]
1085 |     mov r4, #0
1086 |     mov r12, #0
1087 |     mov r1, r0
1088 |     mov r2, r1
1089 |     mul r3, r2, r1
1090 |     mla r2, r4, r0, r12
1091 |     orr r0, r3, r2
1092 |     ldr r4, [sp, #8]
1093 | ```
1094 | 
1095 | However, the spill goes away if the line `v = 4;` is removed from the
1096 | program:
1097 | 
1098 | ```
1099 |     mov r3, #0
1100 |     mov r12, #0
1101 |     mov r1, r0
1102 |     mov r2, r1
1103 |     mul r1, r1, r2
1104 |     mla r12, r3, r0, r12
1105 |     orr r0, r1, r12
1106 | ```
1107 | 
1108 | The value of `v` is never used. This assignment is dead code, and it is
1109 | compiled away. It should not affect register allocation.
1110 | 
1111 | ### Missed copy propagation
1112 | 
1113 | In the previous example, the copies for the arguments to the `mul` operation
1114 | (`r0` to `r1`, then on to `r2`) are redundant. They could be removed, and
1115 | the multiplication written as just `mul r1, r0, r0`.
1116 | 
1117 | *Note:* This entry previously spoke of copy coalescing, but [Sebastian
1118 | Hack](http://compilers.cs.uni-saarland.de/people/hack/) pointed out that
1119 | it's actually copy propagation that is missed here.
1120 | 
1121 | ### Failure to propagate folded constants
1122 | 
1123 | Continuing the `mla.c` example further, but this time after applying the
1124 | `mla` constant folding patch, CompCert generates:
1125 | 
1126 | ```
1127 |     mov r1, r0
1128 |     mul r2, r1, r0
1129 |     mov r1, #0
1130 |     orr r0, r2, r1
1131 | ```
1132 | 
1133 | The `x | 0` operation is redundant. CompCert is able to fold this
1134 | operation if it appears in the source code, but in this case the 0 comes
1135 | from the previously folded `mla` operation. Such constants are not
1136 | propagated by the "constant propagation" (in reality, only local folding)
1137 | pass. Values are only propagated by the value analysis that runs before, but
1138 | for technical reasons the value analysis cannot currently take advantage of
1139 | the algebraic properties of operators' neutral and zero elements.
1140 | 
1141 | ### Missed reassociation and folding of constants
1142 | 
1143 | ```
1144 | int fn1(int p1) {
1145 |   int a, v, b;
1146 |   v = 1;
1147 |   a = 3 * p1 * 2 * 3;
1148 |   b = v + a * 83;
1149 |   return b;
1150 | }
1151 | ```
1152 | 
1153 | All the multiplications could be folded into a single multiplication of `p1`
1154 | by 1494, but CompCert generates a series of adds and shifts before a
1155 | multiplication by 83:
1156 | 
1157 | ```
1158 |     add r3, r0, r0, lsl #1
1159 |     mov r0, r3, lsl #1
1160 |     add r1, r0, r0, lsl #1
1161 |     mov r2, #83
1162 |     mla r0, r2, r1, r12
1163 | ```
1164 | 


--------------------------------------------------------------------------------