├── .gitignore ├── Makefile ├── README.md ├── benchmarks ├── Makefile └── bench.cpp ├── figures ├── HotSpotOffCPU.png ├── benchmark_linear_complexity.png ├── dot_disassembly.png ├── frontend_v_backend.png ├── hotspotoffcpuconfig.png ├── logo.svg ├── perf_report_homescreen.png ├── read_portal.svg ├── vtkm_cuda.svg ├── vtkm_openmp.svg ├── vtkm_tbb_rendering.svg └── ymm_dot_product.png ├── perf_permissions.sh └── src ├── decent_code.cpp ├── dot.asm ├── mwe.cpp └── use_asm.cpp /.gitignore: -------------------------------------------------------------------------------- 1 | dot 2 | perf.data 3 | perf.data.old 4 | a.out 5 | *.o 6 | *.x 7 | 8 | -------------------------------------------------------------------------------- /Makefile: -------------------------------------------------------------------------------- 1 | CXX = g++ 2 | CPPFLAGS = --std=c++17 -g -fno-omit-frame-pointer -O3 -march=native -fno-finite-math-only -ffast-math -fopenmp 3 | 4 | all: dot 5 | 6 | dot: src/mwe.cpp 7 | $(CXX) $(CPPFLAGS) $< -o $@ 8 | 9 | use_asm.x: src/use_asm.cpp 10 | yasm -f elf64 -g dwarf2 src/dot.asm -o dot.o 11 | $(CXX) -o $@ dot.o $< 12 | 13 | clean: 14 | rm -f dot *.x *.o perf.data perf.data.old a.out src/*.o src/*.x src/perf.data src/perf.data.old 15 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | slidenumbers: true 2 | 3 | # Performance Tuning 4 | 5 | ![inline](figures/logo.svg) 6 | 7 | Nick Thompson 8 | 9 | ^ Thanks for coming. First let's give a shoutout to Matt Wolf for putting this tutorial together, and to Barney Maccabe for putting the support behind it to make it happen. 10 | 11 | --- 12 | 13 | Session 1: Using `perf` 14 | 15 | [Follow along](https://github.com/NAThompson/performance_tuning_tutorial): 16 | 17 | ``` 18 | $ git clone https://github.com/NAThompson/performance_tuning_tutorial 19 | $ cd performance_tuning_tutorial 20 | $ make 21 | $ ./dot 100000000 22 | ``` 23 | 24 | ^ This is a tutorial, so definitely follow along. I will be pacing this under the assumption you will be following along, so you'll get bored if you're watching. In addition, at the end of the tutorial we'll do a short quiz, not to stress anyone out, but to solidify the concepts. I hope that'll galvanize us to bring a bit more intensity than usually brought to a six hour training session! If the stakes are too low, we're just gonna waste two good mornings. 25 | 26 | ^ Please get the notes from github, and attempt to issue the commands. 27 | 28 | --- 29 | 30 | ## What is `perf`? 31 | 32 | - Performance tools for linux 33 | - Designed to profile kernel, but can profile userspace apps 34 | - Sampling based 35 | - Canonized in linux kernel source code 36 | 37 | --- 38 | 39 | ## Installing `perf`: Ubuntu 40 | 41 | ```bash 42 | $ sudo apt install linux-tools-common 43 | $ sudo apt install linux-tools-generic 44 | $ sudo apt install linux-tools-`uname -r` 45 | ``` 46 | 47 | ^ Installation is pretty easy on Ubuntu. 48 | 49 | --- 50 | 51 | ## Installing `perf`: CentOS 52 | 53 | ```bash 54 | $ yum install perf 55 | ``` 56 | 57 | --- 58 | 59 | ## Access `perf`: 60 | 61 | `perf` is available on Summit (summit.olcf.ornl.gov), Andes (andes.olcf.ornl.gov) and the SNS nodes (analysis.sns.gov) 62 | 63 | I have verified that all the commands of this tutorial work on Andes. 64 | 65 | --- 66 | 67 | ## Installing `perf`: Source build 68 | 69 | ```bash 70 | $ git clone --depth=1 https://github.com/torvalds/linux.git 71 | $ cd linux/tools/perf; 72 | $ make 73 | $ ./perf 74 | ``` 75 | 76 | ^ I like doing source builds of `perf`. Not only because I often don't have root, but also because `perf` improves over time, so I like to get the latest version. For example, new hardware counters were recently added for the Power9 architecture. 77 | 78 | --- 79 | 80 | ## Please do a source build for this tutorial! 81 | 82 | A source build is the first step to owning your tools, and will help us all be on the same page. 83 | 84 | --- 85 | 86 | ## Ubuntu Dependencies 87 | 88 | ``` 89 | $ sudo apt install -y bison flex libslang2-dev systemtap-sdt-dev \ 90 | libnuma-dev libcap-dev libbabeltrace-ctf-dev libiberty-dev python-dev 91 | ``` 92 | 93 | --- 94 | 95 | # `perf_permissions.sh` 96 | 97 | ```bash 98 | #!/bin/bash 99 | 100 | # Taken from Milian Wolf's talk "Linux perf for Qt developers" 101 | sudo mount -o remount,mode=755 /sys/kernel/debug 102 | sudo mount -o remount,mode=755 /sys/kernel/debug/tracing 103 | echo "0" | sudo tee /proc/sys/kernel/kptr_restrict 104 | echo "-1" | sudo tee /proc/sys/kernel/perf_event_paranoid 105 | sudo chown `whoami` /sys/kernel/debug/tracing/uprobe_events 106 | sudo chmod a+rw /sys/kernel/debug/tracing/uprobe_events 107 | ``` 108 | 109 | ^ If we have root, we have the ability to extract more information from `perf` traces. Kernel debug symbols are a nice to have, but not a need to have, so if you don't have root, don't fret too much. 110 | 111 | --- 112 | 113 | ## `perf` MWE 114 | 115 | ```bash 116 | $ perf stat ls 117 | data Desktop Documents Downloads Music Pictures Public Templates TIS Videos 118 | 119 | Performance counter stats for 'ls': 120 | 121 | 2.78 msec task-clock:u # 0.094 CPUs utilized 122 | 0 context-switches:u # 0.000 K/sec 123 | 0 cpu-migrations:u # 0.000 K/sec 124 | 283 page-faults:u # 0.102 M/sec 125 | 838,657 cycles:u # 0.302 GHz 126 | 584,659 instructions:u # 0.70 insn per cycle 127 | 128,106 branches:u # 46.109 M/sec 128 | 7,907 branch-misses:u # 6.17% of all branches 129 | 130 | 0.029630910 seconds time elapsed 131 | 132 | 0.000000000 seconds user 133 | 0.003539000 seconds sys 134 | ``` 135 | 136 | ^ This is the `perf` "hello world". You might see something a bit different depending on your architecture and `perf` version. 137 | 138 | --- 139 | 140 | ## Why `perf`? 141 | 142 | There are lots of great performance analysis tools (Intel VTune, Score-P, tau, cachegrind), but my opinion is that `perf` should be the first tool you reach for. 143 | 144 | --- 145 | 146 | ## Why `perf`? 147 | 148 | - No fighting for a license, or install Java runtimes on HPC clusters 149 | - No need to vandalize source code, or be constrained to work with a set of specific languages 150 | - Text GUI, so easy to use in terminal and over `ssh` 151 | 152 | --- 153 | 154 | ## Why `perf`? 155 | 156 | - Available on any Linux system 157 | - Not limited to x86: works on ARM, RISC-V, PowerPC, Sparc 158 | - Samples rather than models your program 159 | - Doesn't slow your program down 160 | 161 | ^ I was trained in mathematics, and I love learning math because it feels permanent. The situation in computer science is much worse. For example, if no one decides to write a Fortran compiler that targets the new Apple M1 chip, there's no Fortran on the Apple M1! So learning tools which will last is important to me. 162 | 163 | ^ `perf` is part of the linux kernel, so it has credibility that it will survive for a long time. It also works on any architecture Linux compiles on, so it's widely available. As a sampling profiler, it relies on statistics, not a model of your program. 164 | 165 | --- 166 | 167 | ## Why not `perf`? 168 | 169 | - Text GUI, so fancy graphics must be generated by post-processing 170 | - *Only* available on Linux 171 | - Significant limitations when profiling GPUs 172 | 173 | --- 174 | 175 | ### `src/mwe.cpp` 176 | 177 | ```cpp 178 | #include 179 | #include 180 | 181 | double dot_product(double* a, double* b, size_t n) { 182 | double d = 0; 183 | for (size_t i = 0; i < n; ++i) { 184 | d += a[i]*b[i]; 185 | } 186 | return d; 187 | } 188 | 189 | int main(int argc, char** argv) { 190 | if (argc != 2) { 191 | std::cerr << "Usage: ./dot 10\n"; 192 | return 1; 193 | } 194 | size_t n = atoi(argv[1]); 195 | std::vector a(n); 196 | std::vector b(n); 197 | for (size_t i = 0; i < n; ++i) { 198 | a[i] = i; 199 | b[i] = 1/double(i+3); 200 | } 201 | double d = dot_product(a.data(), b.data(), n); 202 | std::cout << "a·b = " << d << "\n"; 203 | } 204 | ``` 205 | 206 | --- 207 | 208 | ## Running the MWE under `perf` 209 | 210 | ```bash 211 | $ g++ src/mwe.cpp 212 | $ perf stat ./a.out 1000000000 213 | a.b = 1e+09 214 | 215 | Performance counter stats for './a.out 1000000000': 216 | 217 | 14,881.09 msec task-clock:u # 0.999 CPUs utilized 218 | 0 context-switches:u # 0.000 K/sec 219 | 0 cpu-migrations:u # 0.000 K/sec 220 | 17,595 page-faults:u # 0.001 M/sec 221 | 39,657,728,345 cycles:u # 2.665 GHz (50.00%) 222 | 27,974,789,022 stalled-cycles-frontend:u # 70.54% frontend cycles idle (50.01%) 223 | 6,000,965,962 stalled-cycles-backend:u # 15.13% backend cycles idle (50.01%) 224 | 88,999,950,765 instructions:u # 2.24 insn per cycle 225 | # 0.31 stalled cycles per insn (50.00%) 226 | 15,998,544,101 branches:u # 1075.093 M/sec (49.99%) 227 | 37,578 branch-misses:u # 0.00% of all branches (49.99%) 228 | 229 | 14.892496917 seconds time elapsed 230 | 231 | 13.566616000 seconds user 232 | 1.199643000 seconds sys 233 | ``` 234 | 235 | ^ If you have a different `perf` version, you might see `stalled-cycles:frontend` and `stalled-cycles:backend`. 236 | Stalled frontend cycles are those where instructions could not be decoded fast enough to operate on the data. 237 | Stalled backend cycles are those where data did not arrive fast enough. Backend cycles stall much more frequently than frontend cycles. See [here](https://stackoverflow.com/questions/22165299) for more details. 238 | 239 | --- 240 | 241 | ## Learning from `perf stat` 242 | 243 | - 2.24 instructions/cycle and large number of stalled frontend cycles means we're probably CPU bound. Right? Right?? (Stay tuned) 244 | - Our branch miss rate is really good! 245 | 246 | But it's not super informative, nor is it actionable. 247 | 248 | --- 249 | 250 | ## Aside on 'frontend-cycles' vs 'backend-cycles' 251 | 252 | ![inline](figures/frontend_v_backend.png) 253 | 254 | [Source](https://software.intel.com/content/www/us/en/develop/documentation/vtune-cookbook/top/methodologies/top-down-microarchitecture-analysis-method.html) 255 | 256 | This is how Intel divvies up the "frontend" and "backend" of the CPU. Frontend is responsible for instruction scheduling and decoding, the backend is for executing instructions and fetching data. 257 | 258 | --- 259 | 260 | > The cycles stalled in the back-end are a waste because the CPU has to wait for resources (usually memory) or to finish long latency instructions (e.g. transcedentals - sqrt, reciprocals, divisions, etc.). The cycles stalled in the front-end are a waste because that means that the Front-End does not feed the Back End with micro-operations. This can mean that you have misses in the Instruction cache, or complex instructions that are not already decoded in the micro-op cache. Just-in-time compiled code usually expresses this behavior. 261 | 262 | -- [stackoverflow](https://stackoverflow.com/a/29059380/) 263 | 264 | --- 265 | 266 | ## Learning from `perf stat` 267 | 268 | `perf` is written by kernel developers, so the `perf stat` defaults are for them. 269 | 270 | At ORNL, we're HPC developers, so let's make some changes. What stats do we have available? 271 | 272 | --- 273 | 274 | ``` 275 | $ perf list 276 | List of pre-defined events (to be used in -e): 277 | 278 | branch-misses [Hardware event] 279 | cache-misses [Hardware event] 280 | cache-references [Hardware event] 281 | instructions [Hardware event] 282 | task-clock [Software event] 283 | 284 | L1-dcache-load-misses [Hardware cache event] 285 | L1-dcache-loads [Hardware cache event] 286 | LLC-load-misses [Hardware cache event] 287 | LLC-loads [Hardware cache event] 288 | 289 | cache-misses OR cpu/cache-misses/ [Kernel PMU event] 290 | cache-references OR cpu/cache-references/ [Kernel PMU event] 291 | power/energy-cores/ [Kernel PMU event] 292 | power/energy-pkg/ [Kernel PMU event] 293 | power/energy-ram/ [Kernel PMU event] 294 | ``` 295 | 296 | ^ Every architecture has a different set of PMCs, so this list will be different for everyone. I like the `power` measurements, since speed is not the only sensible objective we might want to pursue. 297 | 298 | --- 299 | 300 | ## Custom events 301 | 302 | ``` 303 | perf stat -e instructions,cycles,L1-dcache-load-misses,L1-dcache-loads,LLC-load-misses,LLC-loads ./dot 100000000 304 | a.b = 9.99999e+07 305 | 306 | Performance counter stats for './dot 100000000': 307 | 308 | 8,564,368,466 instructions:u # 1.41 insn per cycle (49.98%) 309 | 6,060,955,584 cycles:u (66.65%) 310 | 34,089,080 L1-dcache-load-misses:u # 0.90% of all L1-dcache hits (83.34%) 311 | 3,805,929,303 L1-dcache-loads:u (83.32%) 312 | 854,522 LLC-load-misses:u # 39.87% of all LL-cache hits (33.31%) 313 | 2,143,437 LLC-loads:u (33.31%) 314 | 315 | 5.045450844 seconds time elapsed 316 | 317 | 2.856660000 seconds user 318 | 2.185739000 seconds sys 319 | ``` 320 | 321 | ^ Hmm . . . 40% LL cache miss rate, yet 1.4 instructions/cycle. This CPU-bound vs memory-bound is a bit complicated . . . 322 | 323 | ^ Personally I don't regard CPU-bound vs memory-bound to be an "actionable" way of thinking. We can turn a slow CPU bound program into a fast memory-bound program by just not doing dumb stuff. 324 | 325 | --- 326 | 327 | ## Custom events: gotchas 328 | 329 | These events are not stable across CPU architectures, nor even `perf` versions! 330 | 331 | The events expose the functionality of hardware counters; different hardware has different counters. 332 | 333 | And someone needs to do the work of exposing them in `perf`! 334 | 335 | --- 336 | 337 | ``` 338 | $ perf list 339 | cycle_activity.stalls_l1d_pending 340 | [Execution stalls due to L1 data cache misses] 341 | cycle_activity.stalls_l2_pending 342 | [Execution stalls due to L2 cache misses] 343 | cycle_activity.stalls_ldm_pending 344 | [Execution stalls due to memory subsystem] 345 | $ perf stat -e cycle_activity.stalls_ldm_pending,cycle_activity.stalls_l2_pending,cycle_activity.stalls_l1d_pending,cycles ./dot 10000000 346 | a.b = 9.99999e+07 347 | 348 | Performance counter stats for './dot 100000000': 349 | 350 | 509,998,525 cycle_activity.stalls_ldm_pending:u 351 | 127,137,070 cycle_activity.stalls_l2_pending:u 352 | 70,555,574 cycle_activity.stalls_l1d_pending:u 353 | 5,708,220,052 cycles:u 354 | 355 | 3.637099623 seconds time elapsed 356 | 357 | 2.463966000 seconds user 358 | 1.172459000 seconds sys 359 | ``` 360 | 361 | --- 362 | 363 | ## Kinda painful typing these events: Use `-d` (`--detailed`) 364 | 365 | ``` 366 | $ perf stat -d ./dot 100000000 367 | Performance counter stats for './dot 100000000': 368 | 369 | 1,945.17 msec task-clock:u # 0.970 CPUs utilized 370 | 0 context-switches:u # 0.000 K/sec 371 | 0 cpu-migrations:u # 0.000 K/sec 372 | 390,463 page-faults:u # 0.201 M/sec 373 | 3,329,516,701 cycles:u # 1.712 GHz (49.97%) 374 | 1,272,884,914 instructions:u # 0.38 insn per cycle (62.50%) 375 | 150,445,759 branches:u # 77.343 M/sec (62.55%) 376 | 14,766 branch-misses:u # 0.01% of all branches (62.53%) 377 | 76,672,490 L1-dcache-loads:u # 39.417 M/sec (62.53%) 378 | 51,315,841 L1-dcache-load-misses:u # 66.93% of all L1-dcache hits (62.52%) 379 | 7,867,383 LLC-loads:u # 4.045 M/sec (49.94%) 380 | 7,618,746 LLC-load-misses:u # 96.84% of all LL-cache hits (49.96%) 381 | 382 | 2.005801176 seconds time elapsed 383 | 384 | 0.982545000 seconds user 385 | 0.963534000 seconds sys 386 | ``` 387 | 388 | --- 389 | 390 | ## `perf stat -d` output on Andes 391 | 392 | ``` 393 | [nthompson@andes-login1]~/performance_tuning_tutorial% perf stat -d ./dot 1000000000 394 | a.b = 1e+09 395 | 396 | Performance counter stats for './dot 1000000000': 397 | 398 | 2,242.43 msec task-clock:u # 0.999 CPUs utilized 399 | 0 context-switches:u # 0.000 K/sec 400 | 0 cpu-migrations:u # 0.000 K/sec 401 | 8,456 page-faults:u # 0.004 M/sec 402 | 2,972,264,893 cycles:u # 1.325 GHz (29.99%) 403 | 1,366,982 stalled-cycles-frontend:u # 0.05% frontend cycles idle (30.02%) 404 | 747,429,126 stalled-cycles-backend:u # 25.15% backend cycles idle (30.07%) 405 | 3,499,896,128 instructions:u # 1.18 insn per cycle 406 | # 0.21 stalled cycles per insn (30.06%) 407 | 749,888,957 branches:u # 334.410 M/sec (30.02%) 408 | 9,206 branch-misses:u # 0.00% of all branches (29.98%) 409 | 1,108,395,106 L1-dcache-loads:u # 494.284 M/sec (29.97%) 410 | 36,998,921 L1-dcache-load-misses:u # 3.34% of all L1-dcache accesses (29.97%) 411 | 0 LLC-loads:u # 0.000 K/sec (29.97%) 412 | 0 LLC-load-misses:u # 0.00% of all LL-cache accesses (29.97%) 413 | 414 | 2.244079417 seconds time elapsed 415 | 416 | 1.000742000 seconds user 417 | 1.214037000 seconds sys 418 | ``` 419 | 420 | ^ Lots of backend cycles stalled in this one. This could be from high latency operations like divisions or from slow memory accesses. 421 | 422 | --- 423 | 424 | ## `perf stat` on a different type of computation 425 | 426 | ``` 427 | [nthompson@andes-login1]~/performance_tuning_tutorial% perf stat -d git archive --format=tar.gz --prefix=HEAD/ HEAD > HEAD.tar.gz 428 | 429 | Performance counter stats for 'git archive --format=tar.gz --prefix=HEAD/ HEAD': 430 | 431 | 99.81 msec task-clock:u # 0.795 CPUs utilized 432 | 0 context-switches:u # 0.000 K/sec 433 | 0 cpu-migrations:u # 0.000 K/sec 434 | 1,408 page-faults:u # 0.014 M/sec 435 | 276,165,489 cycles:u # 2.767 GHz (28.07%) 436 | 72,227,873 stalled-cycles-frontend:u # 26.15% frontend cycles idle (28.36%) 437 | 60,614,109 stalled-cycles-backend:u # 21.95% backend cycles idle (29.37%) 438 | 394,352,577 instructions:u # 1.43 insn per cycle 439 | # 0.18 stalled cycles per insn (30.58%) 440 | 66,882,750 branches:u # 670.113 M/sec (31.95%) 441 | 2,974,856 branch-misses:u # 4.45% of all branches (32.23%) 442 | 183,326,327 L1-dcache-loads:u # 1836.788 M/sec (31.19%) 443 | 49,245 L1-dcache-load-misses:u # 0.03% of all L1-dcache accesses (30.05%) 444 | 0 LLC-loads:u # 0.000 K/sec (29.37%) 445 | 0 LLC-load-misses:u # 0.00% of all LL-cache accesses (28.84%) 446 | 447 | 0.125489614 seconds time elapsed 448 | 449 | 0.092736000 seconds user 450 | 0.006905000 seconds sys 451 | ``` 452 | 453 | ^ Compression has much higher instruction complexity than a dot product, and we see that reflected here in the stalled frontend cycles. We also have a much higher branch miss rate. 454 | 455 | --- 456 | 457 | ## `perf stat` is great for reporting . . . 458 | 459 | But not super actionable. 460 | 461 | --- 462 | 463 | ## Get Actionable Data 464 | 465 | ``` 466 | $ perf record -g ./dot 100000000 467 | a.b = 9.99999e+07 468 | [ perf record: Woken up 3 times to write data ] 469 | [ perf record: Captured and wrote 0.735 MB perf.data (5894 samples) ] 470 | $ perf report -g -M intel 471 | ``` 472 | 473 | ![inline](figures/perf_report_homescreen.png) 474 | 475 | --- 476 | 477 | ## Wait, what's actionable about this? 478 | 479 | See how half the time is spend in the `std::vector` allocator? 480 | 481 | That be a clue. 482 | 483 | --- 484 | 485 | ## Self and Children 486 | 487 | - The `Self` column says how much time was taken within the function. 488 | - The `Children` column says how much time was spent in functions called by the function. 489 | 490 | - If the `Children` column value is very near the `Self` column value, that function isn't your hotspot! 491 | 492 | 493 | --- 494 | 495 | If `Self` and `Children` is confusing, just get rid of it: 496 | 497 | ```bash 498 | $ perf report -g -M intel --no-children 499 | ``` 500 | 501 | --- 502 | 503 | ## More intelligible `perf report` 504 | 505 | ```bash 506 | $ perf report --no-children -s dso,sym,srcline 507 | ``` 508 | 509 | Best to put this in a `perf config`: 510 | 511 | ``` 512 | $ perf config --user report.children=false 513 | $ cat ~/.perfconfig 514 | [report] 515 | children = false 516 | ``` 517 | 518 | --- 519 | 520 | ## Some other nice config options 521 | 522 | ``` 523 | $ perf config --user annotate.disassembler_style=intel 524 | $ perf config --user report.percent-limit=0.1 525 | ``` 526 | 527 | --- 528 | 529 | ## Disassembly 530 | 531 | ![inline](figures/dot_disassembly.png) 532 | 533 | --- 534 | 535 | ## What is happening????? 536 | 537 | - If you don't know x86 assembly, I recommend Ray Seyfarth's [Introduction to 64 Bit Assembly Language Programming for Linux and OS X](http://rayseyfarth.com/asm/) 538 | 539 | - If you need to look up instructions one at a time, Felix Cloutier's [x64 reference](https://www.felixcloutier.com/x86/) is a great resource. 540 | 541 | - If you need to examine how compiler flags interact with generated assembly, try [godbolt](https://godbolt.org). 542 | 543 | --- 544 | 545 | ## Detour: System V ABI (Linux) 546 | 547 | - Floating point arguments are passed in registers `xmm0-xmm7`. 548 | - Integer parameters are passed in registers `rdi`, `rsi`, `rdx`, `rcx`, `r8`, and `r9`, in that order. 549 | - A floating point return value is placed in register `xmm0`. 550 | - Integer return values are placed in `rax`. 551 | 552 | Knowing this makes your godbolt's a bit easier to read! 553 | 554 | --- 555 | 556 | ## The default assembly generated by gcc is braindead 557 | 558 | See the [godbolt](https://godbolt.org/z/8qqhGj). 559 | 560 | - Superfluous stack writes. 561 | - No AVX instructions, no fused-multiply adds 562 | 563 | Consequence: Lots of time spent moving data around. 564 | 565 | --- 566 | 567 | ## Sidetrack: Fused-multiply add 568 | 569 | We'll use the fused-multiply add instruction as a "canonical" example of an instruction which we *want* generated, but due to history, chaos, and dysfunction, generally *isn't*. 570 | 571 | --- 572 | 573 | ## Sidetrack: Fused-multiply add 574 | 575 | The fma is defined as 576 | 577 | $$ 578 | \mathrm{fma}(a,b,c) := \mathrm{rnd}(a*b + c) 579 | $$ 580 | 581 | i.e., the multiplication and addition is performanced in a single instruction, with a single rounding. 582 | 583 | ^ I recently determined `gcc` wasn't generating fma's in our flagship product VTK-m. It's often said that it's meaningless to talk about performance of code compiled without optimizations. Implicit in this statement is another: It's incredibly difficult to convince the compiler to generate optimized assembly! (The Intel compiler is very good in this regard.) 584 | 585 | --- 586 | 587 | ## My preferred CPPFLAGS: 588 | 589 | ``` 590 | -g -O3 -ffast-math -fno-finite-math-only -march=native -fno-omit-frame-pointer 591 | ``` 592 | 593 | How does that look on [godbolt](https://godbolt.org/z/4dnfYb)? 594 | 595 | Key instruction: `vfmadd132pd`; vectorized fused multiply add on `xmm`/`ymm` registers. 596 | 597 | --- 598 | 599 | ## Recompile with good flags 600 | 601 | ``` 602 | $ make 603 | $ perf stat ./dot 100000000 604 | a.b = 9.99999e+07 605 | 606 | Performance counter stats for './dot 100000000': 607 | 608 | 2,428.06 msec task-clock:u # 0.998 CPUs utilized 609 | 0 context-switches:u # 0.000 K/sec 610 | 0 cpu-migrations:u # 0.000 K/sec 611 | 390,994 page-faults:u # 0.161 M/sec 612 | 3,651,637,732 cycles:u # 1.504 GHz 613 | 1,676,766,309 instructions:u # 0.46 insn per cycle 614 | 225,636,250 branches:u # 92.929 M/sec 615 | 9,303 branch-misses:u # 0.00% of all branches 616 | 617 | 2.432163719 seconds time elapsed 618 | ``` 619 | 620 | 1/3rd of the instructions/cycle, yet twice as fast, because it ran ~1/5th the number of instructions. 621 | 622 | 623 | --- 624 | 625 | # Exercise 626 | 627 | Look at the code of `src/mwe.cpp`. Is it really measuring a dot product? Look at it under `perf report`. 628 | 629 | Fix it if not. 630 | 631 | ^ The performance of `src/mwe.cpp` is dominated by the cost of initializing data. 632 | ^ The data initialization converts integers to floats and does divisions. Removing these increases the performance. 633 | ^ Even once this is done, 40% of the time is spent in data allocation. This indicates a need for a more sophisticated approach. 634 | 635 | --- 636 | 637 | ## Register width 638 | 639 | - The 8008 architecture from 1972 had 8 bit registers, now vaguely resembling our current `al` register. 640 | 641 | - 16 bit registers were added to the 8086 in 1972; now labelled `ax`. 642 | 643 | - 32 bit registers on the 80386 architecture in 1985; these are now prefixed with `e`, such as the `eax`, `ebx`, so on. 644 | 645 | - 64 bit registers were added in 2003 for the `x86_64` architecture. They are prefixed with `r`, such as the `rax` and `rbx` registers. 646 | 647 | --- 648 | 649 | ## Register width 650 | 651 | Compilers utilize the full width of integer registers without much fuss. The situation for floating point registers is much worse. 652 | 653 | 654 | --- 655 | 656 | ## Floating point register width 657 | 658 | An `xmm` register is 128 bits wide, and can hold 2 doubles, or 4 floats. 659 | 660 | AVX2 introduced the `ymm` registers, which are 256 bits wide, and can hold 4 doubles, or 8 floats. 661 | 662 | AVX-512 (2016) introduced the `zmm` registers, which can hold 8 doubles or 16 floats. 663 | 664 | --- 665 | 666 | To determine if your CPU has `ymm` registers, check for avx2 instruction support: 667 | 668 | ```bash 669 | $ lscpu | grep avx2 670 | Flags: 671 | fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush 672 | dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon 673 | pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 674 | monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt 675 | tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm epb tpr_shadow vnmi 676 | flexpriority ept vpid fsgsbase smep erms xsaveopt dtherm ida arat pln pts 677 | ``` 678 | 679 | or (on Centos) 680 | 681 | ```bash 682 | $ cat /proc/cpuinfo | grep avx2 683 | ``` 684 | 685 | --- 686 | 687 | ## Mind bogglement 688 | 689 | I couldn't get `gcc` or `clang` to generate AVX-512 instructions, so I went looking for the story . . . 690 | 691 | --- 692 | 693 | ## Mind bogglement 694 | 695 | > I hope AVX512 dies a painful death, and that Intel starts fixing real problems instead of trying to create magic instructions to then create benchmarks that they can look good on. I hope Intel gets back to basics: gets their process working again, and concentrate more on regular code that isn't HPC or some other pointless special case. 696 | 697 | -- [Linus Torvalds](https://www.realworldtech.com/forum/?threadid=193189&curpostid=193190) 698 | 699 | --- 700 | 701 | ## Vector instructions 702 | 703 | Even in the CS people don't like AVX-512, it is still difficult to find the magical incantations required to generate AVX2 instructions. 704 | 705 | It generally requires an `-march=native` compiler flag. 706 | 707 | --- 708 | 709 | ## Beautiful assembly: 710 | 711 | ![inline](figures/ymm_dot_product.png) 712 | 713 | --- 714 | 715 | ## Exercise 716 | 717 | On *Andes*, what causes this error? 718 | 719 | ``` 720 | $ module load intel/19.0.3 721 | $ icc -march=skylake-avx512 src/mwe.cpp 722 | $ ./a.out 1000000 723 | zsh: illegal hardware instruction (core dumped) 724 | ``` 725 | 726 | --- 727 | 728 | Compiler defaults are for *compatibility*, not for performance! 729 | 730 | --- 731 | 732 | ## `perf report` commands 733 | 734 | - `k`: Show line numbers of source code 735 | - `o`: Show instruction number 736 | - `t`: Switch between percentage and samples 737 | - `J`: Number of jump sources on target; number of places that can jump here. 738 | - `s`: Hide/Show source code 739 | - `h`: Show options 740 | 741 | --- 742 | 743 | ## perf gotchas 744 | 745 | - perf sometimes attributes the time in a single instruction to the *next* instruction. 746 | 747 | --- 748 | 749 | ## perf gotchas 750 | 751 | ``` 752 | │ if (absx < 1) 753 | 7.76 │ ucomis xmm1,QWORD PTR [rbp-0x20] 754 | 0.95 │ ↓ jbe a6 755 | 1.82 │ movsd xmm0,QWORD PTR ds:0x46a198 756 | 0.01 │ movsd xmm1,QWORD PTR ds:0x46a1a0 757 | 0.01 | movsd xmm2,QWORD PTR ds:0x46a100 758 | ``` 759 | 760 | Hmm, so moving data into `xmm1` and `xmm2` is 182x faster than moving data into `xmm0` . . . 761 | 762 | Looks like a misattribution of the `jbe`. 763 | 764 | --- 765 | 766 | > . . if you're trying to capture the IP on some PMC event, and there's a delay between the PMC overflow and capturing the IP, then the IP will point to the wrong address. This is skew. Another contributing problem is that micro-ops are processed in parallel and out-of-order, while the instruction pointer points to the resumption instruction, not the instruction that caused the event. 767 | 768 | --[Brendan Gregg](http://www.brendangregg.com/perf.html) 769 | 770 | --- 771 | 772 | ## Two sensible goals 773 | 774 | Reduce power consumption, and/or reduce runtime. 775 | 776 | Not necessarily the same thing. Benchmark power consumption: 777 | 778 | ``` 779 | $ perf list | grep energy 780 | power/energy-cores/ [Kernel PMU event] 781 | power/energy-pkg/ [Kernel PMU event] 782 | power/energy-ram/ [Kernel PMU event] 783 | $ perf stat -e energy-cores ./dot 100000000 784 | Performance counter stats for 'system wide': 785 | 786 | 8.55 Joules power/energy-cores/ 787 | ``` 788 | 789 | --- 790 | 791 | ## Improving reproducibility 792 | 793 | For small optimizations (< 2% gains), our perf data often gets swamped in noise. 794 | 795 | ``` 796 | $ perf stat -e uops_retired.all,instructions,cycles -r 5 ./dot 100000000 797 | Performance counter stats for './dot 100000000' (5 runs): 798 | 799 | 1,817,358,542 uops_retired.all:u ( +- 0.00% ) 800 | 1,276,765,688 instructions:u # 0.45 insn per cycle ( +- 0.00% ) 801 | 2,823,559,592 cycles:u ( +- 0.11% ) 802 | 803 | 2.1110 +- 0.0422 seconds time elapsed ( +- 2.00% ) 804 | ``` 805 | 806 | --- 807 | 808 | # Improving reproducibility 809 | 810 | Small optimizations are really important, but really hard to measure reliably. 811 | 812 | See [Producing wrong data without doing anything obviously wrong!](https://users.cs.northwestern.edu/~robby/courses/322-2013-spring/mytkowicz-wrong-data.pdf) 813 | 814 | Link order, environment variables, [running in a new directory](https://youtu.be/koTf7u0v41o?t=1318), cache set of hot instructions can have huge impact on performance! 815 | 816 | --- 817 | 818 | ## Improving reproducibility 819 | 820 | Instruction count and uops are reproducible, but time and cycles are not. 821 | 822 | Use instruction count and uops retired as imperfect metrics for small optimizations when variance in runtime will swamp improvements. 823 | 824 | --- 825 | 826 | ## Long tail `perf` 827 | 828 | Attaching to a running process or MPI rank 829 | 830 | ```bash 831 | $ top # find rogue process 832 | $ perf stat -d -p `pidof paraview` 833 | ^C 834 | ``` 835 | 836 | --- 837 | 838 | ## Long tail `perf` 839 | 840 | Sometimes, `perf` will gather *way* too much data, creating a huge `perf.data` file. 841 | 842 | Solved by reducing sampling frequency: 843 | 844 | ```bash 845 | $ perf record -F 10 ./dot 100000000 846 | ``` 847 | 848 | or compressing (requires compilation with `zstd` support): 849 | 850 | ```bash 851 | $ perf record -z ./dot 100000 852 | ``` 853 | 854 | --- 855 | 856 | ## Exercise 857 | 858 | Replace the computation of $$\mathbf{a}\cdot \mathbf{b}$$ with the computation of $$\left\|\mathbf{a}\right\|^2$$. 859 | 860 | This halves the number of memory references/flop. 861 | 862 | Is it observable under `perf stat`? 863 | 864 | ^ I see a meaningful reduction in L1 cache miss rate. 865 | 866 | --- 867 | 868 | 869 | # Exercise 870 | 871 | Parallelize the dot product using a framework of your choice. 872 | 873 | How does it look under `perf`? 874 | 875 | 876 | --- 877 | 878 | # Solution: OpenMP 879 | 880 | ```cpp 881 | double dot_product(double* a, double* b, size_t n) { 882 | double d = 0; 883 | #pragma omp parallel for reduction(+:d) 884 | for (size_t i = 0; i < n; ++i) { 885 | d += a[i]*b[i]; 886 | } 887 | return d; 888 | } 889 | ``` 890 | 891 | --- 892 | 893 | # Solution: C++17 894 | 895 | 896 | ```cpp 897 | double d = std::transform_reduce(std::execution::par_unseq, 898 | a.begin(), a.end(), b.begin(), 0.0); 899 | ``` 900 | 901 | (FYI: I had to do a [source build](https://github.com/oneapi-src/oneTBB/) of TBB to get this to work.) 902 | 903 | --- 904 | 905 | ## Parallel `perf` lessons 906 | 907 | `perf` is great at finding *hotspots*, not so great at finding coldspots. 908 | 909 | [hotspot](https://github.com/KDAB/hotspot), discussed later, will overcome this problem. 910 | 911 | --- 912 | 913 | Break? 914 | 915 | --- 916 | 917 | ## Session 2 Goals 918 | 919 | - Learn about google/benchmark 920 | - Profile entire workflows and generate flamegraphs and timecharts 921 | 922 | --- 923 | 924 | ## Challenges we need to overcome 925 | 926 | - Our MWE spent fully half its time initializing data. That's not very interesting. 927 | - We could only specify one vector length at a time. What if we'd written a performance bug that induced quadratic scaling? 928 | 929 | --- 930 | 931 | ## A [google/benchmark](https://github.com/google/benchmark/) [example](https://github.com/boostorg/math/blob/develop/reporting/performance/chebyshev_clenshaw.cpp): 932 | 933 | ```bash 934 | $ ./reporting/performance/chebyshev_clenshaw.x --benchmark_filter=^ChebyshevClenshaw 935 | 2020-10-16T15:36:34-04:00 936 | Running ./reporting/performance/chebyshev_clenshaw.x 937 | Run on (16 X 2300 MHz CPU s) 938 | CPU Caches: 939 | L1 Data 32 KiB (x8) 940 | L1 Instruction 32 KiB (x8) 941 | L2 Unified 256 KiB (x8) 942 | L3 Unified 16384 KiB (x1) 943 | Load Average: 2.49, 2.29, 2.09 944 | ---------------------------------------------------------------------------- 945 | Benchmark Time CPU Iterations 946 | ---------------------------------------------------------------------------- 947 | ChebyshevClenshaw/2 0.966 ns 0.965 ns 637018028 948 | ChebyshevClenshaw/4 1.69 ns 1.69 ns 413440355 949 | ChebyshevClenshaw/8 4.26 ns 4.25 ns 161924589 950 | ChebyshevClenshaw/16 13.3 ns 13.3 ns 52107759 951 | ChebyshevClenshaw/32 39.4 ns 39.4 ns 17071255 952 | ChebyshevClenshaw/64 108 ns 108 ns 6438439 953 | ChebyshevClenshaw/128 246 ns 245 ns 2852707 954 | ChebyshevClenshaw/256 522 ns 521 ns 1316359 955 | ChebyshevClenshaw/512 1100 ns 1100 ns 640076 956 | ChebyshevClenshaw/1024 2180 ns 2179 ns 311353 957 | ChebyshevClenshaw/2048 4499 ns 4496 ns 152754 958 | ChebyshevClenshaw/4096 9086 ns 9081 ns 79369 959 | ChebyshevClenshaw_BigO 2.27 N 2.26 N 960 | ChebyshevClenshaw_RMS 4 % 4 % 961 | ``` 962 | 963 | --- 964 | 965 | ## Goals for google/benchmark 966 | 967 | - Empirically determine asymptotic complexity; is it $$\mathcal{O}(N)$$, $$\mathcal{O}(N^2)$$, or $$\mathcal{O}(\log(N))$$? 968 | - Test inputs of different lengths 969 | - Test different types (`float`, `double`, `long double`) 970 | - Dominate the runtime with interesting and relevant operations so our `perf` traces are more meaningful. 971 | 972 | --- 973 | 974 | ## Installation 975 | 976 | - Grab a [release tarball](https://github.com/google/benchmark/releases) 977 | - `pip install google-benchmark` 978 | - `brew install google-benchmark` 979 | - `spack install benchmark` 980 | 981 | --- 982 | 983 | ## Installation 984 | 985 | Source build 986 | 987 | ```bash 988 | $ git clone https://github.com/google/benchmark.git 989 | $ cd benchmark && mkdir build && cd build 990 | build$ cmake -DCMAKE_BUILD_TYPE=Release -DBENCHMARK_ENABLE_TESTING=OFF ../ -G Ninja 991 | build$ ninja 992 | build$ sudo ninja install 993 | ``` 994 | 995 | --- 996 | 997 | ## Example: `benchmarks/bench.cpp` 998 | 999 | ```cpp 1000 | #include 1001 | #include 1002 | #include 1003 | 1004 | template 1005 | void DotProduct(benchmark::State& state) { 1006 | std::vector a(state.range(0)); 1007 | std::vector b(state.range(0)); 1008 | std::random_device rd; 1009 | std::uniform_real_distribution unif(-1,1); 1010 | for (size_t i = 0; i < a.size(); ++i) { 1011 | a[i] = unif(rd); 1012 | b[i] = unif(rd); 1013 | } 1014 | 1015 | for (auto _ : state) { 1016 | benchmark::DoNotOptimize(dot_product(a.data(), b.data(), a.size())); 1017 | } 1018 | state.SetComplexityN(state.range(0)); 1019 | } 1020 | 1021 | BENCHMARK_TEMPLATE(DotProduct, float)->RangeMultiplier(2)->Range(1<<3, 1<<18)->Complexity(); 1022 | BENCHMARK_TEMPLATE(DotProduct, double)->DenseRange(8, 1024*1024, 512)->Complexity(); 1023 | BENCHMARK_TEMPLATE(DotProduct, long double)->RangeMultiplier(2)->Range(1<<3, 1<<18)->Complexity(benchmark::oN); 1024 | 1025 | BENCHMARK_MAIN(); 1026 | ``` 1027 | 1028 | --- 1029 | 1030 | Instantiate a benchmark on type float: 1031 | 1032 | ```cpp 1033 | BENCHMARK_TEMPLATE(DotProduct, float); 1034 | ``` 1035 | 1036 | Test on vectors of length 8, 16, 32,.., 262144: 1037 | 1038 | ``` 1039 | ->RangeMultiplier(2)->Range(1<<3, 1<<18) 1040 | ``` 1041 | 1042 | Regress the performance data against $$\mathcal{O}(\log(n)), \mathcal{O}(n), \mathcal{O}(n^2), \mathcal{O}(n^3)$$: 1043 | 1044 | ``` 1045 | ->Complexity(); 1046 | ``` 1047 | 1048 | --- 1049 | 1050 | Force regression against $$\mathcal{O}(n)$$: 1051 | 1052 | ``` 1053 | ->Complexity(benchmark::oN); 1054 | ``` 1055 | 1056 | Repeat the calculation until confidence in the runtime is obtained: 1057 | 1058 | ```cpp 1059 | for (auto _ : state) { ... } 1060 | ``` 1061 | 1062 | Make sure the compiler doesn't elide these instructions: 1063 | 1064 | ```cpp 1065 | benchmark::DoNotOptimize(dot_product(a.data(), b.data(), a.size())); 1066 | ``` 1067 | 1068 | --- 1069 | 1070 | ## google/benchmark party tricks: Visualize complexity 1071 | 1072 | Set a counter to the length of the vector: 1073 | 1074 | ``` 1075 | state.counters["n"] = state.range(0); 1076 | ``` 1077 | 1078 | Then get the output as CSV: 1079 | 1080 | ``` 1081 | benchmarks$ ./dot_bench --benchmark_format=csv 1082 | ``` 1083 | 1084 | Finally, copy-paste the console output into [scatterplot.online](https://scatterplot.online/) 1085 | 1086 | --- 1087 | 1088 | ![inline](figures/benchmark_linear_complexity.png) 1089 | 1090 | --- 1091 | 1092 | ## `SetBytesProcessed` 1093 | 1094 | We can attack the memory-bound vs CPU bound problem via `SetBytesProcessed`. 1095 | 1096 | ``` 1097 | ./dot_bench --benchmark_filter=DotProduct\/64 0.004 us 0.004 us 155850953 bytes_per_second=212.277G/s n=64 1111 | DotProduct/128 0.010 us 0.010 us 73113102 bytes_per_second=200.232G/s n=128 1112 | DotProduct/256 0.015 us 0.015 us 45589300 bytes_per_second=247.706G/s n=256 1113 | DotProduct/512 0.029 us 0.029 us 24430471 bytes_per_second=266.21G/s n=512 1114 | DotProduct/1024 0.056 us 0.056 us 12490510 bytes_per_second=273.686G/s n=1024 1115 | DotProduct/2048 0.158 us 0.158 us 4413687 bytes_per_second=193.436G/s n=2.048k 1116 | DotProduct/4096 0.676 us 0.676 us 1035341 bytes_per_second=90.2688G/s n=4.096k 1117 | DotProduct/8192 1.33 us 1.33 us 520428 bytes_per_second=91.5784G/s n=8.192k 1118 | DotProduct/16384 2.71 us 2.71 us 258728 bytes_per_second=89.9407G/s n=16.384k 1119 | DotProduct/32768 5.51 us 5.51 us 127636 bytes_per_second=88.5911G/s n=32.768k 1120 | DotProduct/65536 19.9 us 19.9 us 35225 bytes_per_second=49.1777G/s n=65.536k 1121 | DotProduct/131072 77.7 us 77.7 us 9013 bytes_per_second=25.141G/s n=131.072k 1122 | DotProduct/262144 157 us 157 us 4458 bytes_per_second=24.8915G/s n=262.144k 1123 | DotProduct/524288 330 us 330 us 2129 bytes_per_second=23.6636G/s n=524.288k 1124 | DotProduct/1048576 812 us 812 us 835 bytes_per_second=19.2495G/s n=1048.58k 1125 | ``` 1126 | 1127 | --- 1128 | 1129 | ## Is this good or not? 1130 | 1131 | ```bash 1132 | $ sudo lshw -class memory 1133 | *-memory:0 1134 | description: System Memory 1135 | physical id: 3d 1136 | slot: System board or motherboard 1137 | *-bank:0 1138 | description: DIMM DDR4 Synchronous 2666 MHz (0.4 ns) 1139 | physical id: 0 1140 | serial: #@ 1141 | slot: CPU1_DIMM_A0 1142 | size: 8GiB 1143 | width: 64 bits 1144 | clock: 2666MHz (0.4ns) 1145 | ``` 1146 | 1147 | So our RAM can transfer 8bytes at 2.666Ghz--19.2GB/second. 1148 | 1149 | --- 1150 | 1151 | ## Exercise 1152 | 1153 | Determine the size of the lowest level cache on your machine. 1154 | 1155 | Can you empirically observe cache effects? 1156 | 1157 | Hint: Use the `DenseRange` option. 1158 | 1159 | --- 1160 | 1161 | ## Long tail `google/benchmark` 1162 | 1163 | If you have root, you can decrease run-to-run variance via 1164 | 1165 | ``` 1166 | $ sudo cpupower frequency-set --governor performance 1167 | ``` 1168 | 1169 | --- 1170 | 1171 | # perf + google/benchmark 1172 | 1173 | ``` 1174 | $ perf record -g ./dot_bench --benchmark_filter=DotProduct\ The number of iterations to run is determined dynamically by running the benchmark a few times and measuring the time taken and ensuring that the ultimate result will be statistically stable. 1199 | --[Google benchmark docs](https://github.com/google/benchmark) 1200 | 1201 | --- 1202 | 1203 | ## Exercise 1204 | 1205 | Profile a squared norm using google/benchmark. 1206 | 1207 | Compute it in both `float` and `double` precision, determine asymptotic complexity, and the number of bytes/second you are able to process. 1208 | 1209 | --- 1210 | 1211 | ## Exercise 1212 | 1213 | Compare interpolation search to binary search use `perf` and `googlebenchmark`. 1214 | 1215 | --- 1216 | 1217 | ## Break? 1218 | 1219 | --- 1220 | 1221 | ## Session 3 1222 | 1223 | Flamegraphs 1224 | 1225 | --- 1226 | 1227 | # _What is google/benchmark not good for?_ 1228 | 1229 | Profiling workflows. It's a *microbenchmark* library. 1230 | 1231 | But huge problems can often arise integrating even well-designed and performant functions. 1232 | 1233 | What to do? 1234 | 1235 | --- 1236 | 1237 | ## [Flamegraph](https://gitlab.kitware.com/vtk/vtk-m/-/issues/499) of VTK-m graphics pipeline 1238 | 1239 | 1240 | ![inline](figures/read_portal.svg) 1241 | 1242 | --- 1243 | 1244 | Flamegraphs present *sorted* unique stack frames, width drawn proportional to samples in that frame/total samples. 1245 | 1246 | Sorting the stack frames means the x-axis is not a time axis! Great for multithreaded code. x-axis is sorted alphabetically. 1247 | 1248 | y-axis is that callstack. 1249 | 1250 | See the [paper](https://queue.acm.org/detail.cfm?id=2927301). 1251 | 1252 | --- 1253 | 1254 | ## Flamegraphs 1255 | 1256 | ``` 1257 | $ git clone https://github.com/brendangregg/FlameGraph.git 1258 | ``` 1259 | 1260 | --- 1261 | 1262 | ## Flamegraph MWE 1263 | 1264 | In a directory with a `perf.data` file, run 1265 | 1266 | ``` 1267 | $ perf script | ~/FlameGraph/stackcollapse-perf.pl| ~/FlameGraph/flamegraph.pl > flame.svg 1268 | $ firefox flame.svg 1269 | ``` 1270 | 1271 | I find this hard to remember, so I have an alias: 1272 | 1273 | ``` 1274 | $ alias | grep flame 1275 | flamegraph='perf script | ~/FlameGraph/stackcollapse-perf.pl| ~/FlameGraph/flamegraph.pl > flame.svg' 1276 | ``` 1277 | 1278 | --- 1279 | 1280 | ## Viewing flamegraphs 1281 | 1282 | Firefox is best, but no Firefox on Andes. Try ImageMagick: 1283 | 1284 | ``` 1285 | $ ssh -X `whoami`@andes.olcf.ornl.gov 1286 | $ module load imagemagick/7.0.8-7-py3 1287 | $ magick display flame.svg 1288 | ``` 1289 | 1290 | 1291 | 1292 | --- 1293 | 1294 | ## Flamegraph example: VTK-m Volume Rendering 1295 | 1296 | ``` 1297 | $ git clone https://gitlab.kitware.com/vtk/vtk-m.git 1298 | $ cd vtk-m && mkdir build && cd build 1299 | $ cmake ../ \ 1300 | -DCMAKE_CXX_FLAGS="${CMAKE_CXX_FLAGS} -march=native -fno-omit-frame-pointer -Wfatal-errors -ffast-math -fno-finite-math-only -O3 -g" \ 1301 | -DVTKm_ENABLE_EXAMPLES=ON -DVTKm_ENABLE_OPENMP=ON -DVTKm_ENABLE_TESTING=OFF -G Ninja 1302 | $ ninja 1303 | $ perf stat -d ./examples/demo/Demo 1304 | $ perf record -g ./examples/demo/Demo 1305 | ``` 1306 | 1307 | Note: If you have a huge program you'd like to profile, compile it now and follow along! 1308 | 1309 | --- 1310 | 1311 | # Step by step: `perf script` 1312 | 1313 | Dumps all recorded stack traces 1314 | 1315 | ``` 1316 | $ perf script 1317 | perf 20820 510465.112358: 1 cycles: 1318 | ffffffff9f277a8a native_write_msr+0xa ([kernel.kallsyms]) 1319 | ffffffff9f20d7ed __intel_pmu_enable_all.constprop.31+0x4d ([kernel.kallsyms]) 1320 | ffffffff9f20dc29 intel_tfa_pmu_enable_all+0x39 ([kernel.kallsyms]) 1321 | ffffffff9f207aec x86_pmu_enable+0x11c ([kernel.kallsyms]) 1322 | ffffffff9f40ac26 ctx_resched+0x96 ([kernel.kallsyms]) 1323 | ffffffff9f415562 perf_event_exec+0x182 ([kernel.kallsyms]) 1324 | ffffffff9f4e65e2 setup_new_exec+0xc2 ([kernel.kallsyms]) 1325 | ffffffff9f55a9ff load_elf_binary+0x3af ([kernel.kallsyms]) 1326 | ffffffff9f4e4441 search_binary_handler+0x91 ([kernel.kallsyms]) 1327 | ffffffff9f4e5696 __do_execve_file.isra.39+0x6f6 ([kernel.kallsyms]) 1328 | ffffffff9f4e5a49 __x64_sys_execve+0x39 ([kernel.kallsyms]) 1329 | ffffffff9f204417 do_syscall_64+0x57 ([kernel.kallsyms]) 1330 | ``` 1331 | 1332 | --- 1333 | 1334 | ## Step by step: `stackcollapse-perf.pl` 1335 | 1336 | 1337 | Merges duplicate stack samples, sorts them alphabetically: 1338 | 1339 | ``` 1340 | $ perf script | ~/FlameGraph/stackcollapse-perf.pl > out.folded 1341 | $ cat out.folded | more 1342 | Demo;[libgomp.so.1.0.0] 1856 1343 | Demo;[unknown];[libgomp.so.1.0.0];vtkm::cont::DeviceAdapterAlgorithm::ScheduleTask 8 1344 | ``` 1345 | 1346 | 1347 | --- 1348 | 1349 | ## Generate a stack 1350 | 1351 | ``` 1352 | $ perf script | ~/FlameGraph/stackcollapse-perf.pl > out.folded 1353 | $ ~/FlameGraph/flamegraph.pl out.folded --title="VTK-m rendering and isocontouring" > flame.svg 1354 | ``` 1355 | 1356 | --- 1357 | 1358 | ## Convert sorted/unique stack frames into pretty picture 1359 | 1360 | 1361 | ``` 1362 | $ cat out.folded | ~/FlameGraph/flamegraph.pl 1363 | 1364 | 1365 | 1367 | 1369 | 1370 | 1371 | 1372 | 1373 | 1374 | 1375 | 1376 |