├── apr19.md └── profiling_notes.md /apr19.md: -------------------------------------------------------------------------------- 1 | # Benchmarking the OCaml compiler: what we have learnt and suggestions for the future 2 | 3 | _Tom Kelly ([ctk21@cl.cam.ac.uk](mailto:ctk21@cl.cam.ac.uk))_, 4 | _Stephen Dolan_, 5 | _Sadiq Jaffer_, 6 | _KC Sivaramakrishnan_, 7 | _Anil Madhavapeddy_ 8 | 9 | Our goal is to provide statistically meaningful benchmarking of the OCaml compiler 10 | to aid our multicore upstreaming efforts through 2019. In particular we wanted an 11 | infrastructure that operated continuously on all compiler commits and could handle 12 | multiple compiler variants (e.g. flambda). This allows developers to be sure that no 13 | performance regressions slip through and also guides efforts as to where to focus when 14 | introducing new features like multicore which can have a non-trivial impact on 15 | performance, as well as introduce additional non-determinism into measurements due 16 | to parallelism. 17 | 18 | This document describes what we did to build continuous benchmarking 19 | websites[1](#ref1) that take a controlled experiment approach to running the 20 | operf-micro and sandmark benchmarking suites against tracked git branches of 21 | the OCaml compiler. 22 | 23 | Our high level aims were: 24 | 25 | * Try to reuse OCaml benchmarks that exist and cover a range of micro and 26 | macro use cases. 27 | * Try to incorporate best-practice and reuse tools where appropriate by 28 | looking at what other compiler development communities are doing. 29 | * Provide visualizations that are easy to use for OCaml developers making it 30 | easy and less time consuming to merge complex features like multicore 31 | without performance regressions. 32 | * Document the steps to setup the system including its experimental 33 | environment for benchmarking so that the knowledge is captured for others 34 | to reuse. 35 | 36 | This document is structured as follows: we survey the OCaml benchmark packages 37 | available; we survey the infrastructure other communities are using to catch 38 | performance regressions in their compilers; we highlight existing continuous 39 | benchmarks of the OCaml compiler; we describe the system we have built from 40 | selecting the code, running in a controlled environment through to how it is 41 | displayed; we summarise what we have learnt; finally we discuss directions 42 | future work that could follow. 43 | 44 | ## Survey of OCaml benchmarks 45 | 46 | ### Operf-micro[2](#ref2) 47 | 48 | Operf-micro collates a collection of micro benchmarks originally put together 49 | to help with the development of flambda. This tool compiles a micro-benchmark 50 | function inside a harness. The harness runs the tested function in a loop to 51 | generate samples x(n) where n is the number of loop iterations. The harness 52 | will collate data for multiple iteration lengths n which provides coverage of a 53 | range of garbage collector behaviour. The number of samples _(n, x(n))_ is 54 | determined by a runtime quota target. Once this _(n, x(n))_ data is collected, it 55 | is post processed using regression to provide the execution time of a single 56 | iteration of the function. 57 | 58 | The tool is designed to have minimal external dependencies and can be run 59 | against a test compiler binary without additional packages. The experimental 60 | method of running for multiple embedded iterations is also used by Jane 61 | Street’s `core_bench`[3](#ref3) and Haskell’s criterion[4](#ref4). 62 | 63 | ### operf-macro[5](#ref5) 64 | 65 | Operf-macro provides a framework to define and run macro-benchmarks. The 66 | benchmarks themselves are opam packages and the compiler versions are opam 67 | switches. The use of opam is to handle the dependencies of larger benchmarks 68 | and to handle the maintenances of the macro-benchmark code base. Benchmark 69 | descriptions are split among metadata stored in a customised opam-repository 70 | that overlay the benchmarks over existing codebases. 71 | 72 | ### Sandmark[6](#ref6) 73 | 74 | Sandmark is a tool we have developed takes that a similar approach to operf-macro, but: 75 | - utilizes opam v2 and pins the opam repo to one internally defined so that all the benchmarked 76 | code is fixed and not dependent on the global opam-repository 77 | - does not have a ppx dependency for the tested packages, making it easier to work on 78 | development versions of the compiler 79 | - has the benchmark descriptions in a single `dune` file in the sandmark repository, 80 | to make it easy to see what is being measured in one place 81 | 82 | For the purposes of our benchmarking we decided to run both the operf-micro and 83 | sandmark packages. 84 | 85 | #### Single-threaded tests in Sandmark 86 | 87 | Included in Sandmark are a range of tests ported from operf-macro that run on 88 | or could be easily modified to run on multicore. In addition to these, a new 89 | set of performance tests have been added that aim to expand coverage over 90 | compiler and runtime areas that differ in implementation between vanilla and 91 | multicore. These tests are designed to run on both vanilla and multicore and 92 | should highlight changes in single-threaded performance between the two 93 | implementations. The main areas we aimed to cover were: 94 | 95 | * Stack overflow checks in function prologues 96 | * Costs of switching stacks when making external calls (alloc/noalloc/various numbers of parameters) 97 | * Lazy implementation changes 98 | * Weak pointer implementation changes 99 | * Finalizer implementation changes 100 | * General GC performance under various different types of allocation behaviour 101 | * Remembered set overhead implementation changes 102 | 103 | Also included in sandmark are some of the best single-threaded OCaml entries 104 | for the Benchmarks Game, these are already well optimised and should serve to 105 | highlight performance differences between vanilla and multicore on single 106 | threaded code. 107 | 108 | #### Multicore-specific tests in Sandmark 109 | 110 | In addition to the single-threaded tests in Sandmark, there are also 111 | multicore-specific tests. They come in two flavours. First, are the ones 112 | intended to highlight performance changes between vanilla and multicore. These 113 | include benchmarks that perform lots of external calls and callbacks (multicore 114 | runtime manages stacks), stress tests for lazy (different lazy layout in 115 | multicore compared to vanilla), weak arrays and ephemerons (different layouts 116 | and GC algorithms), finalisers (different finaliser structures). 117 | 118 | Secondly, there are multicore-only tests. These consist so far of various simple 119 | lock-free data structure tests that stress the multicore GC in different ways 120 | e.g some force many GC promotions to the major heap while others pre-allocate. 121 | The intention is to expand the set of multicore specific tests to include larger 122 | benchmarks as well as tests that compare existing approaches to parallelism on 123 | vanilla OCaml with reimplementations on multicore. 124 | 125 | ## How other compiler communities handle continuous benchmarking 126 | 127 | ### Python (CPython and PyPy) 128 | 129 | The Python community has continuous benchmarking both for their 130 | CPython[7](#ref7) and PyPy[8](#ref8) runtimes. The 131 | benchmarking data is collected by running the Python Performance Benchmark 132 | Suite[9](#ref9). The open-source web application 133 | Codespeed[10](#ref10) provides a front end to navigate and visualize 134 | the results. Codespeed is written in Python on Django and provides views into 135 | the results via a revision table, a timeline by benchmark and comparison over 136 | all benchmarks between tagged versions. Codespeed has been picked up by other 137 | projects as a way to quickly provide visualizations of performance data across 138 | code revisions. 139 | 140 | ### LLVM 141 | 142 | LLVM has a collection of micro-benchmarks in their C/C++ compiler test suite. 143 | These micro-benchmarks are built on the google-benchmark library and produce 144 | statistics that can be easily fed downstream. They also support external tests 145 | (for example SPEC CPU 2006). LLVM have performance tracking software called LNT 146 | which drives a continuous monitoring site[11](#ref11). While the LNT 147 | software is packaged for people to use in other projects, we could not find 148 | another project using it for visualizing performance data and at first glance 149 | did not look easy to reuse. 150 | 151 | ### GHC 152 | 153 | The Haskell community have performance regression tests that have hardcoded 154 | values which trip a continuous-integration failure. This method has proved 155 | painful for them[12](#ref12) and they have been looking to change it 156 | to a more data driven approach[13](#ref13). At this time they did 157 | not seem to have infrastructure running to help them. 158 | 159 | ### Rust 160 | 161 | Rust have built their own tools to collect benchmark data and present it in a 162 | web app[14](#ref14). This tool has some interesting features: 163 | 164 | * It measures both compile time and runtime performance. 165 | * They are committing the data from their experiments into a github repo to 166 | make the data available to others. 167 | 168 | ## Relation to existing OCaml compiler benchmarking efforts 169 | 170 | ### OCamlPro Flambda benchmarking 171 | 172 | OCamlPro put together the operf-micro, operf-macro and 173 | [http://bench.flambda.ocamlpro.com/](http://bench.flambda.ocamlpro.com/) site 174 | to provide a benchmarking environment for flambda. Our work builds on these 175 | tools and is inspired by them. 176 | 177 | ### Initial OCaml multicore benchmarking site 178 | 179 | OCaml Labs put together an initial hosted multicore benchmarking site 180 | [http://ocamllabs.io/multicore](http://ocamllabs.io/multicore). This built on 181 | the OCamlPro flambda site by (i) implementing visualization in a single 182 | javascript library; (ii) upgrading to opam v2; (iii) updating the 183 | macro-benchmarks to more recent version.[15](#ref15) The work 184 | presented here incorporates the experience of building that site and builds on 185 | it. 186 | 187 | ## What we put together 188 | 189 | We decided to use operf-micro and sandmark as our benchmarks. We went with 190 | Codespeed[10](#ref10) as a visualization tool since it looked easy 191 | to setup and had instances running in multiple projects. We also wanted to try 192 | an open source tool where there is already a community in place. 193 | 194 | The system runs a pipeline: it determines that a commit is of interest on a 195 | branch timeline it is tracking. It builds the compiler for that commit given 196 | the hash. It then runs either operf-micro or sandmark on a clean experimental 197 | CPU. The data is uploaded into Codespeed to allow users to view and explore the 198 | data. 199 | 200 | ### Transforming git commits to a timeline 201 | 202 | Mapping a git branch to a linear timeline requires a bit of care. We use the 203 | following mapping: 204 | - we take a branch tag and ask for commits using the git first-parent option[16](#ref16) 205 | - use the commit date from this as the timestamp for a commit. 206 | 207 | In most repositories this will give a linear code order that makes sense, even 208 | though the development process has been decentralised. It works well for 209 | Github style pull-requests development, and has been verified to work with the 210 | `ocaml/ocaml` git development workflow. 211 | 212 | ### Experimental setup 213 | 214 | We wanted to try to remove sources of noise in the performance measurements 215 | where possible. We configured our x86 Linux machines as follows:[17](#ref17) 216 | 217 | * **Hyperthreading**: Hyperthreading was configured off in the machine BIOS 218 | to avoid cross-talk and resource sharing for a CPU. 219 | * **Turbo boost**: Turbo boost was disabled on all CPUs to ensure that the 220 | processor did not enter turbo which could be throttled by external factors. 221 | * **Pstate configuration:** We set the power state explicitly to performance 222 | rather than powersave. 223 | * **Linux CPU isolation**: We isolated several cores on our machine using the 224 | Linux isolcpus kernel parameter. This allowed us to be sure that only the 225 | benchmarking process was scheduled on a core and that core did not schedule 226 | other processes on the system. 227 | * **Interrupts:** We shifted all operating system interrupt handling away 228 | from the isolated CPUs we used for benchmarking. 229 | * **ASLR (address space layout randomisation):** We switched this off on our 230 | benchmarking runs to make experiments repeatable. 231 | 232 | With this configuration, we found that we could run benchmarks such that they 233 | were very likely to give the same result when rerun at another time. Note that 234 | without making the changes above, there was significantly more variance in the 235 | test results, so it is important to apply this control. 236 | 237 | ### Visualization with Codespeed 238 | 239 | Once the experimental results have been collected, we upload them into the 240 | codespeed visualization tool which presents a web interface for browsing the 241 | results. This tool presents three ways to interact with the data: 242 | 243 | * **Changes**: This view allows you to browse the raw results across all 244 | benchmarks collected for a particular commit, build variant and machine 245 | environment. 246 | * **Timeline**: This view allows you to look at a given benchmark through 247 | time on a given machine environment. You can compare multiple build 248 | variants on the same graph. It also has a grid view that gives a panel 249 | display of timelines for all benchmarks. It can be a good way to spot 250 | regressions in the performance and see if the regression is persistent in 251 | nature. 252 | * **Comparison**: This view allows you to compare tagged versions of the code 253 | to each other (including the latest commit of a branch) across all 254 | benchmarks. This can be a good compact view to ask questions like: “which 255 | benchmarks is 4.08 faster or slower than 4.07.1?” 256 | 257 | The tool provides cross-linking between commits and Github which makes it quick 258 | to drill into the specific code diff of a commit. 259 | 260 | ## What we learnt 261 | 262 | ### Very useful tool to discover performance regressions with multicore 263 | 264 | The benchmarking over many compiler commits and visualization tool has been 265 | very useful in enabling us to determine which test cases have performance 266 | regressions with multicore. Because the experimental environment is clean, it 267 | makes the results we see much more repeatable when we investigate things that 268 | look odd. The timeline view is also able to give clues as to which area of the 269 | code changed to cause a performance regression. When performance regressions 270 | are fixed, the benchmarks are automatically run and so provides a ratchet to 271 | know you are making things better without additional manual overhead. 272 | 273 | ### Getting a clean experimental setup is possible but takes some care 274 | 275 | We found that getting our machine environment into a state where we could rerun 276 | benchmarks and get repeatable performance numbers was possible, but takes some 277 | care. Often we only discovered we had a problem by looking at large numbers of 278 | benchmarks and diving into commits that caused performance swings we didn’t 279 | understand (for example changes to documentation that altered performance). 280 | Hopefully by collecting the configuration information together in a single 281 | place other people can quickly setup clean environments for their own 282 | benchmarking[17](#ref17). Having a clean environment allowed us to really probe the 283 | microstructure effects with x86 performance counters in a rigorous and 284 | iterative way. 285 | 286 | ### Address space layout randomization (ASLR) 287 | 288 | This was found following investigation of some benchmark instability coming 289 | from timeline changes but with commits not touching compiler code. We tracked 290 | down the ~10-15% swings were being caused by branch miss-prediction in some 291 | operf-micro format benchmarks that were sensitive to how the address space was 292 | laid out with randomization. We switched off ASLR so that our benchmarks would 293 | be repeatable and an identical binary between commits would be very likely to 294 | give the same result on a single run. User binaries in the wild will likely run 295 | with ASLR on, but we have configured our environment to make benchmarks 296 | repeatable without the need to take the average of many sample. It remains an 297 | open area to benchmark in reasonable time and allow easy developer interaction 298 | while investigating performance under the presence of ASLR[18](#ref18). 299 | 300 | ### Code layout can matter in surprising ways 301 | 302 | It is well known that code layout can matter when evaluating the performance of 303 | a binary on a modern processor. Modern x86 processors are often deeply 304 | pipelined and can issue multiple decoded uops per cycle. In order to maintain 305 | good performance it is critical that there is a ready stream of decoded uops 306 | available to be executed on the back end of the processor. There are many ways 307 | a uop could be unavailable. 308 | 309 | A well known way for a uop to be unavailable is because the processor is 310 | waiting for it to be fetched from the memory hierarchy. Within the operf-micro 311 | benchmarking suite on the hardware we used, we didn’t see any performance 312 | instability coming from CPU interactions with the instruction or data cache or 313 | main memory. The instruction path memory bottleneck may still be worth 314 | optimizing for and we expect that changes which do optimize this area will be 315 | seen in the sandmark macro-benchmarks[19](#ref19). There are a collection of 316 | approaches being used to improve code layout for instruction memory latency 317 | using profile layout[20](#ref20). 318 | 319 | We were surprised by the size of the performance impact that code layout can 320 | have due to how instructions pass through the front end of an x86 processor on 321 | their way to being issued[21](#ref21). On x86 alignment of blocks of instructions and 322 | how decoded uops are packed into decode stream buffers (DSB) can impact the 323 | performance of smaller hot loops in code. Examples of some of the mechanical 324 | effects are: 325 | 326 | * DSB Throughput and alignment 327 | * DSB Thrashing and alignment 328 | * Branch-predictor effects and alignment 329 | 330 | At a high-level this can be thought of as taking blocks of x86 code and mapping 331 | them into a scarce uops cache. If the blocks of x86 code in the hot path are 332 | unfortunately aligned or have high jump and/or branch complexity then this can 333 | make the decoder and uops cache a bottleneck in your code. If you want to learn 334 | more about the causes of performance swings due to hot code placement in IA 335 | there is a good LLVM dev talk given by Zia Ansari (Intel) 336 | [https://www.youtube.com/watch?v=IX16gcX4vDQ](https://www.youtube.com/watch?v=IX16gcX4vDQ). 337 | 338 | We came across some of these hot-code placement effects in our OCaml 339 | micro-benchmarks when diving into performance swings that we didn’t understand 340 | by looking at the commit diff. We have an example that can be downloaded[22](#ref22) 341 | which alters the alignment of a hot loop in an OCaml microbenchmark (without 342 | changing the individual instructions in the hot loop) that lead to a ~15% range 343 | in observed performance on our Xeon E5-2430L machine. From a compiler writer’s 344 | perspective this can be a difficult area, we don’t have any simple solutions 345 | and other compiler communities also struggle with how to tame these 346 | effects[23](#ref23). 347 | 348 | It is important to realise that a code layout change can lead to changes in the 349 | performance of hot loops. The best way to proceed when you see an unexpected 350 | performance swing on x86 is to look at performance counters around the DSB and 351 | branch predictor. We hope to add selected performance counter data to the 352 | benchmarks we publish in the future. 353 | 354 | ### Logistical issues when scaling up 355 | 356 | To run our experiments in a controlled fashion, we need to schedule 357 | benchmarking runs so that they do not get impacted by external factors. Right 358 | now we have been managing with some scripts and manual placement, but we are 359 | quickly approaching the point where we need to have some distributed job 360 | scheduling and management to assist us. This should allow us to handle multiple 361 | architectures, operating systems and benchmarking of more variants of the 362 | compiler. We have yet to decide what software to use for task scheduling and 363 | management. 364 | 365 | ## Next Steps (as of April 2019) 366 | 367 | ### Managing a heterogeneous cluster of machines, operating systems and benchmarks in a low-overhead way 368 | 369 | We want to support a large number of machine architectures, operating systems 370 | and benchmarking suites. To do this we need to control a very large number of 371 | concurrent benchmarking experiments where process scheduling is tightly 372 | controlled and error management structured. A core component needed will be a 373 | distributed job scheduler. On top of the scheduler, we need a system to handle 374 | iterative result generation, backfilling results for new benchmarks and 375 | regeneration of additional data points on demand. By using a scheduler it 376 | should be possible to get a better resource utilization than under manual block 377 | scheduling. 378 | 379 | ### Publishing data for downstream tools 380 | 381 | We would like to provide a mechanism for publishing the raw data we collect 382 | from running benchmarking experiments. We can see two possible ideas here: i) 383 | similar to Rust we could commit code to a github hosted repo and then the data 384 | can be served using the git toolchain; ii) we could provide a GraphQL style 385 | server interface which would also be a component of front-end web app 386 | visualization tools. This data format could be closely coupled with what 387 | benchmarking suites publish as the result of running benchmark code. 388 | 389 | ### Modernise the front-end visualization and incorporate multi-dimensional data navigation 390 | 391 | We have successfully used Codespeed to visualize our data. This codebase is 392 | based on old Javascript libraries and has internal mechanisms for transporting 393 | data. It is not a trivial application to extend or adapt. A new front-end based 394 | on modern web development technologies which also handles multi-dimensional 395 | benchmark output easily would be useful. 396 | 397 | ### Ability to sandbox the benchmarks while containing noise for pull-request testing 398 | 399 | It would be useful for developer workflow to have the ability to benchmark 400 | pull-request branches. There are challenges to doing this while keeping the 401 | experiments controlled. One approach would be to setup a sandboxed environment 402 | that is as close to the bare metal as possible but maintains the low noise 403 | environment we have from using low-level Linux options. Resource usage control 404 | and general scheduling would also need to be considered in such a multi-user 405 | system. 406 | 407 | ### Publish performance counter and GC data coming from benchmarks 408 | 409 | We would like to publish multi-dimensional data from the benchmarks and have a 410 | structured way to interpret them. There is a bit of work to standardise the 411 | data we get out of the benchmarks suites and integrating the collection of 412 | performance counters would also need doing. Once the data is collected, there 413 | would be a little work to present visualizations of the data. 414 | 415 | ### Experimenting with x86 code alignment emitted from the compiler 416 | 417 | LLVM and GCC have a collection of options available to alter the x86 code 418 | alignment. Some of these get enabled at higher optimization levels and other 419 | options help with debugging.[24](#ref24) For example, it might be useful to be able to 420 | align all jump targets; code size would increase and performance decrease, but 421 | it might allow you to isolate a performance swing to being due to 422 | microstructure alignment when doing a deep dive. 423 | 424 | ### Compile time performance benchmarking 425 | 426 | We would like to produce numbers on the compile-time performance of compiler 427 | versions. The first step would be to decide on a representative corpus of OCaml 428 | code to benchmark (could be sandmark). 429 | 430 | ### How to tell if your benchmark suite is giving good coverage of the compiler optimizations 431 | 432 | It would be useful to have some coverage metrics of a benchmarking suite 433 | relative to the compiler functionality. This might reveal holes in the 434 | benchmarking suite and having an understanding of which bits of the 435 | optimization functionality are engaged on a given benchmark could be revealing 436 | in itself. 437 | 438 | ### Using the infrastructure for continuous benchmarking of general OCaml binaries 439 | 440 | It would be interesting to try our infrastructure on a larger OCaml project 441 | where tracking its performance is important to the community of developers and 442 | users for it. The compiler version would be fixed but the project code would 443 | move through time. One idea we had is odoc but others projects might be more 444 | interesting to explore. 445 | 446 | ## Conclusions 447 | 448 | We are satisfied that we have produced infrastructure that will prove effective 449 | for continuous performance regression of multicore. We are hoping that the 450 | infrastructure can be more generally useful in the wider community. We hope 451 | that by sharing our experiences with benchmarking OCaml code we can help people 452 | quickly realise good benchmarking experiments and avoid non-trivial pitfalls. 453 | 454 | # Notes 455 | 456 | [^1]: [http://bench.ocamllabs.io](http://bench.ocamllabs.io) which presents operf-micro benchmark experiments and [http://bench2.ocamllabs.io](http://bench2.ocamllabs.io) which presents sandmark based benchmark experiments. 457 | 458 | [^2]: [https://github.com/OCamlPro/operf-micro](https://github.com/OCamlPro/operf-micro) - tool code 459 | [https://hal.inria.fr/hal-01245844/document](https://hal.inria.fr/hal-01245844/document) - “Operf: Benchmarking the OCaml Compiler”, Chambart et al 460 | 461 | [^3]: Code for core_bench [https://github.com/janestreet/core_bench](https://github.com/janestreet/core_bench) and blog post describing core_bench [https://blog.janestreet.com/core_bench-micro-benchmarking-for-ocaml/](https://blog.janestreet.com/core_bench-micro-benchmarking-for-ocaml/) 462 | 463 | [^4]: Haskell code for criterion [https://github.com/bos/criterion](https://github.com/bos/criterion) and tutorial on using criterion [http://www.serpentine.com/criterion/tutorial.html](http://www.serpentine.com/criterion/tutorial.html) 464 | 465 | [^5]: Code and documentation for operf-macro [https://github.com/OCamlPro/operf-macro](https://github.com/OCamlPro/operf-macro) 466 | 467 | [^6]: Code for sandmark [https://github.com/ocamllabs/sandmark](https://github.com/ocamllabs/sandmark) 468 | 469 | [^7]: CPython continuous benchmarking site [https://speed.python.org/](https://speed.python.org/) 470 | 471 | [^8]: PyPy continuous benchmarking site [http://speed.pypy.org/](http://speed.pypy.org/) 472 | 473 | [^9]: Python Performance Benchmark Suite [https://pyperformance.readthedocs.io/](https://pyperformance.readthedocs.io/) 474 | 475 | [^10]: Codespeed web app for visualization of performance data [https://github.com/tobami/codespeed](https://github.com/tobami/codespeed) 476 | 477 | [^11]: LNT software [http://llvm.org/docs/lnt/](http://llvm.org/docs/lnt/) and the performance site [https://lnt.llvm.org/](https://lnt.llvm.org/) 478 | 479 | [^12]: Description of Haskell issues with hard performance continous integration tests [https://ghc.haskell.org/trac/ghc/wiki/Performance/Tests](https://ghc.haskell.org/trac/ghc/wiki/Performance/Tests) 480 | 481 | [^13]: ‘2017 summer of code’ work on improving Haskell performance integration tests [https://github.com/jared-w/HSOC2017/blob/master/Proposal.pdf](https://github.com/jared-w/HSOC2017/blob/master/Proposal.pdf) 482 | 483 | [^14]: Live tracking site [https://perf.rust-lang.org/](https://perf.rust-lang.org/) and code for it [https://github.com/rust-lang-nursery/rustc-perf](https://github.com/rust-lang-nursery/rustc-perf) 484 | 485 | [^15]: More information here: [http://kcsrk.info/multicore/ocaml/benchmarks/2018/09/13/1543-multicore-ci/](http://kcsrk.info/multicore/ocaml/benchmarks/2018/09/13/1543-multicore-ci/) 486 | 487 | [^16]: A nice description of why first-parent helps is here [http://www.davidchudzicki.com/posts/first-parent/](http://www.davidchudzicki.com/posts/first-parent/) 488 | 489 | [^17]: For more details please see [https://github.com/ocaml-bench/ocaml_bench_scripts/#notes-on-hardware-and-os-settings-for-linux-benchmarking](https://github.com/ocaml-bench/ocaml_bench_scripts/#notes-on-hardware-and-os-settings-for-linux-benchmarking) 490 | 491 | [^18]: There is some academic literature in this direction; for example “Rigourous benchmarking in reasonable time”, Kalibera et al [https://dl.acm.org/citation.cfm?id=2464160](https://dl.acm.org/citation.cfm?id=2464160) and “STABILIZER: statistically sound performance evaluation”, Curtsinger et al, [https://dl.acm.org/citation.cfm?id=2451141](https://dl.acm.org/citation.cfm?id=2451141). However randomization approaches need to take care that the layout randomization distribution captures the relevant features of layout randomization that real user binaries will see from ASLR, link-order or dynamic library linking. 492 | 493 | [^19]: We feel that there must be some memory latency bottlenecks in the sandmark macro benchmarking suite, but we have yet to deep-dive and investigate a performance instability due to noise in memory latency for fetching instructions to execute. That is the performance may be bottlenecked on instruction memory fetch, but we haven’t seen instruction memory fetch latency being very noisy between benchmark runs in our setup. 494 | 495 | [^20]: Google’s AutoFDO [https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/45290.pdf](https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/45290.pdf), Facebook’s HFSort [https://research.fb.com/wp-content/uploads/2017/01/cgo2017-hfsort-final1.pdf](https://research.fb.com/wp-content/uploads/2017/01/cgo2017-hfsort-final1.pdf) and Facebook’s Bolt [https://github.com/facebookincubator/BOLT](https://github.com/facebookincubator/BOLT) are recent examples to report imporvements in some deployed data-centre workloads. 496 | 497 | [^21]: The effects are not limited to x86 as it has also been observed on ARM Cortex-A53 and Cortex-A57 processors with LLVM [https://www.youtube.com/watch?v=COmfRpnujF8](https://www.youtube.com/watch?v=COmfRpnujF8) 498 | 499 | [^22]: The OCaml example is here [https://github.com/ocaml-bench/ocaml_bench_scripts/tree/master/stability_example](https://github.com/ocaml-bench/ocaml_bench_scripts/tree/master/stability_example), for the super curious there are yet more C++ examples to be had (although not necessarily the same underlying micro mechanism and all dependent on processor) [https://dendibakh.github.io/blog/2018/01/18/Code_alignment_issues](https://dendibakh.github.io/blog/2018/01/18/Code_alignment_issues) 500 | 501 | [^23]: The Q&A at LLVM dev meeting videos [https://www.youtube.com/watch?v=IX16gcX4vDQ](https://www.youtube.com/watch?v=IX16gcX4vDQ) and [https://www.youtube.com/watch?v=COmfRpnujF8](https://www.youtube.com/watch?v=COmfRpnujF8) are examples of discussion around the area. The general area of code layout is also a problem area in the academic community “Producing wrong data without doing anything obviously wrong!”, Mytkowicz et al [https://dl.acm.org/citation.cfm?id=1508275](https://dl.acm.org/citation.cfm?id=1508275) 502 | 503 | [^24]: For more on LLVM alignment options, see [https://dendibakh.github.io/blog/2018/01/25/Code_alignment_options_in_llvm](https://dendibakh.github.io/blog/2018/01/25/Code_alignment_options_in_llvm) 504 | 505 | -------------------------------------------------------------------------------- /profiling_notes.md: -------------------------------------------------------------------------------- 1 | # Profiling OCaml code 2 | _Tom Kelly ([ctk21@cl.cam.ac.uk](mailto:ctk21@cl.cam.ac.uk))_, 3 | _Sadiq Jaffer_ 4 | 5 | There are many ways of profiling OCaml code. Here we describe several we have used that should help you get up and going. 6 | 7 | 8 | ## perf record profiling 9 | 10 | perf is a statistical profiler on Linux that can profile a binary without needing instrumentation. 11 | 12 | Basic recording of an ocaml program 13 | ``` 14 | # vanilla 15 | perf record --call-graph dwarf -- program-to-run program-arguments 16 | # exclude child process (-i), user cycles 17 | perf record --call-graph dwarf -i -e cycles:u -- program-to-run program-arguments 18 | # expand sampled stack (which will make the traces larger) to capture deeper call stacks 19 | perf record --call-graph dwarf,32768 -i -e cycles:u -- program-to-run program-arguments 20 | ``` 21 | 22 | Basic viewing of a perf recording 23 | ``` 24 | # top down view 25 | perf report --children 26 | # bottom up view 27 | perf report --no-children 28 | # bottom up view, including kernel symbols 29 | perf report --no-children --kallsyms /proc/kallsyms 30 | # inverted call-graph can be useful on some versions of perf 31 | perf report --inverted 32 | ``` 33 | 34 | To allow non-root users to use perf on Linux, you may need to execute: 35 | ``` 36 | echo 0 | sudo tee /proc/sys/kernel/perf_event_paranoid 37 | ``` 38 | 39 | Installing debug symbols on ubuntu 40 | https://wiki.ubuntu.com/Debug%20Symbol%20Packages 41 | 42 | In particular the kernel symbols: 43 | `apt-get install linux-image-$(uname -r)-dbgsym` 44 | and libgmp 45 | `apt-get install ` 46 | 47 | To allow non-root users to see the kernel symbols: 48 | ``` 49 | echo 0 | sudo tee /proc/sys/kernel/kptr_restrict 50 | ``` 51 | 52 | For many more perf Examples: 53 | http://www.brendangregg.com/perf.html 54 | 55 | Profiling with perf zine by Julia Evans is a great resource for getting a hang of perf: 56 | https://jvns.ca/perf-zine.pdf 57 | 58 | Pros: 59 | - it's fast as it samples the running program 60 | - the cli can be a very quick way of getting a callchain to start from with a big program 61 | 62 | Limitations: 63 | - sometimes can get strange annotations 64 | - it's statistical so you don't get full coverage or call counts 65 | - the OCaml runtime functions can be confusing if you aren't familiar with them 66 | 67 | ### perf flamegraphs 68 | 69 | You can also visualize the perf data using the FlameGraph package 70 | ``` 71 | # git clone https://github.com/brendangregg/FlameGraph # or download it from github 72 | # cd FlameGraph 73 | # perf record --call-graph dwarf -- program-to-run program-arguments 74 | # perf script | ./stackcollapse-perf.pl | ./flamegraph.pl > perf-flamegraph.svg 75 | ``` 76 | 77 | The flamegraph output allows you to see the dynamics of the function calls. For more on flamegraphs see: 78 | http://www.brendangregg.com/perf.html#FlameGraphs 79 | http://www.brendangregg.com/flamegraphs.html 80 | https://github.com/brendangregg/FlameGraph 81 | 82 | Another front end to perf files is the speedscope web application: 83 | https://www.speedscope.app/ 84 | 85 | ## OCaml gprof support 86 | 87 | Documentation on gprof is here: 88 | https://sourceware.org/binutils/docs/gprof/ 89 | 90 | To run your OCaml program with gprof you need to do: 91 | ``` 92 | ocamlopt -o myprog -p other-options files 93 | ./myprog 94 | gprof myprog 95 | ``` 96 | 97 | The call count is accurate and the timing information is statistical but your program now has extra code which wouldn't be in a release binary. 98 | 99 | Pros: 100 | - fairly quick way to get a call graph without having to use other tools 101 | - call counts are accurate 102 | 103 | Limitations: 104 | - your program is instrumented so isn't what you will really run in production 105 | - not obvious how to profile 3rd party packages built through OPAM 106 | 107 | Open questions: 108 | - is there a way to build to get all the stdlib covered? 109 | - is there a way to build to get opam covered? 110 | 111 | ## Callgrind 112 | 113 | A good introduction to callgrind, if you've never used it, is here: 114 | https://web.stanford.edu/class/archive/cs/cs107/cs107.1196/resources/callgrind 115 | 116 | You can run callgrind from a standard valgrind install on your OCaml binary (as with any other binary): 117 | ``` 118 | valgrind --tool=callgrind program-to-run program-arguments 119 | callgrind_annotate callgrind.out.pid 120 | kcachegrind callgrind.out.pid 121 | ``` 122 | 123 | After a while you will start to notice that you have strange symbols that don't make sense or that you can't see. There are a couple of things you can fix: 124 | - You need to make sure you have the sources available for all dependent packages (incl the compiler). This can be done in opam with: 125 | ``` 126 | opam switch reinstall --keep-build-dir 127 | ``` 128 | - You will want to have the debugging symbols available for the standard Linux libraries; refer to your distribution for how to do that. 129 | 130 | With that done, you may still have some odd functions appearing. This can be because the OCaml compilers (at least before 4.08) don't export the size of several assembler functions (particularly `caml_c_call`) into the ELF binary. Ideally the OCaml runtime would export the size of functions in the ELF information, but right now it does not have `.size` directives in the assembler. 131 | 132 | To fix this, you will need to build a patched valgrind which will let you see them. 133 | 134 | The patch against 3.15.0 (available here http://www.valgrind.org/downloads/repository.html) is this: 135 | ``` 136 | diff --git a/coregrind/m_debuginfo/readelf.c b/coregrind/m_debuginfo/readelf.c 137 | index b982a838a..8b75b260b 100644 138 | --- a/coregrind/m_debuginfo/readelf.c 139 | +++ b/coregrind/m_debuginfo/readelf.c 140 | @@ -253,6 +253,7 @@ void show_raw_elf_symbol ( DiImage* strtab_img, 141 | to piece together the real size, address, name of the symbol from 142 | multiple calls to this function. Ugly and confusing. 143 | */ 144 | +#define ALLOW_ZERO_ELF_SYM 1 145 | static 146 | Bool get_elf_symbol_info ( 147 | /* INPUTS */ 148 | @@ -282,7 +283,8 @@ Bool get_elf_symbol_info ( 149 | Bool in_text, in_data, in_sdata, in_rodata, in_bss, in_sbss; 150 | Addr text_svma, data_svma, sdata_svma, rodata_svma, bss_svma, sbss_svma; 151 | PtrdiffT text_bias, data_bias, sdata_bias, rodata_bias, bss_bias, sbss_bias; 152 | -# if defined(VGPV_arm_linux_android) \ 153 | +# if defined(ALLOW_ZERO_ELF_SYM) \ 154 | + ||defined(VGPV_arm_linux_android) \ 155 | || defined(VGPV_x86_linux_android) \ 156 | || defined(VGPV_mips32_linux_android) \ 157 | || defined(VGPV_arm64_linux_android) 158 | @@ -475,7 +477,8 @@ Bool get_elf_symbol_info ( 159 | in /system/bin/linker: __dl_strcmp __dl_strlen 160 | */ 161 | if (*sym_size_out == 0) { 162 | -# if defined(VGPV_arm_linux_android) \ 163 | +# if defined(ALLOW_ZERO_ELF_SYM) \ 164 | + || defined(VGPV_arm_linux_android) \ 165 | || defined(VGPV_x86_linux_android) \ 166 | || defined(VGPV_mips32_linux_android) \ 167 | || defined(VGPV_arm64_linux_android) 168 | ``` 169 | 170 | You can then run the following, which will allow callgrind to skip through `caml_c_call`, to create a potentially more useful callgraph: 171 | ``` 172 | valgrind --tool=callgrind --fn-skip=_init --fn-skip=caml_c_call -- program-to-run program-arguments 173 | ``` 174 | 175 | Pros: 176 | - the function call counts are accurate 177 | - the instruction count information is very accurate 178 | - you can instrument all the way to basic block instructions and jumps taken 179 | - kcachegrind gives you a graphical front end which can sometimes be easier to navigate 180 | - with the fn-skip options you can look through the caml_c_call to get a callchain from ocaml through underlying libraries (including the OCaml runtime) 181 | 182 | Limitations: 183 | - quite slow to run 184 | - not supported out of the box; right now to get the function skip you will need to compile your own valgrind (this should change following PR8708 on the ocaml compiler) 185 | - recursive functions give confusing total function cost scores which should be ignored 186 | - the OCaml runtime functions can be confusing if you aren't familiar with them 187 | 188 | ## OCaml Landmarks 189 | 190 | Landmarks allow you to instrument at the OCaml level and gives visibility into call chains, call counts and memory allocations in blocks: 191 | https://github.com/LexiFi/landmarks 192 | 193 | There are a couple of ways to instrument a program; manual, using manual ppx extensions and fully automatic using a ppx mode. 194 | 195 | ### Manual instrumentation 196 | You setup your landmarks with `register` and need to call `enter` and `exit` on each landmark you are interested in. Care needs to be taken to handle exceptions so that the `enter`/`exit` pairs match up. 197 | 198 | ### Manual ppx extensions 199 | You annotate specific OCaml lines with a pre-processing command; e.g. `let[@landmark] f = ...`. This handles all the pairing for you and can be quite convenient. You need to add the ppx to your build. 200 | 201 | ### Automatic instrumentation 202 | You run the pre-processor with `--auto` and this will instrument all top level functions within a module. You need to add the ppx to your build. 203 | 204 | 205 | #### ocamlbuild and ppx tags 206 | To get things working with ocamlbuild and landmarks you can add the following to your `_tag` file: 207 | ``` 208 | ppx(`ocamlfind query landmarks.ppx`/ppx.exe --as-ppx) 209 | ``` 210 | or for auto instrumentation: 211 | ``` 212 | ppx(`ocamlfind query landmarks.ppx`/ppx.exe --as-ppx --auto) 213 | ``` 214 | 215 | Pros: 216 | - it's very quick to setup automatic profiling and then determine which functions are the heavy hitters 217 | - it's easy to selectively profile bits of OCaml code you are interested in 218 | 219 | Limitations: 220 | - the timing information includes the overhead of any landmarks within another landmark 221 | - the allocation information includes the overhead of any landmarks within another landmark 222 | - the addition of the landmark will change the compiled code and can remove inlining opportunities that would occur in a release binary 223 | - can slow down your binary in some cases 224 | 225 | ## OCaml Spacetime 226 | 227 | The OCaml Spacetime profiler allows you to see where in your program memory is allocated. 228 | 229 | Setup an opam switch with Spacetime and all your opam packages: 230 | ``` 231 | $ opam switch export opam_existing_universe 232 | $ opam switch create 4.07.1+spacetime 233 | $ opam switch import opam_existing_universe 234 | ``` 235 | 236 | NB: you might have a stack size issue with the universe, you can expand it to 128M in bash with `ulimit -S -s 131072`. 237 | 238 | In the 4.07.1+spacetime switch, build your binary. Now run the executable, using the environment variable OCAML_SPACETIME_INTERVAL to turn on profiling and specify how frequently Spacetime should inspect the OCaml heap (in milliseconds): 239 | ``` 240 | $ OCAML_SPACETIME_INTERVAL=100 program-to-run program-arguments 241 | ``` 242 | This will output a file of the form `spacetime-`. 243 | 244 | To view the information in this file, we need to process it with `prof_spacetime`. Right now, this only runs on OCaml 4.06.1 and lower. Install as follows: 245 | ``` 246 | $ opam switch 4.06.1 247 | $ opam install prof_spacetime 248 | ``` 249 | To post-process the results do 250 | ``` 251 | $ prof_spacetime process -e spacetime- 252 | ``` 253 | This will produce a `spacetime-.p` file. 254 | 255 | To serve this and interact through a web-browser do: 256 | ``` 257 | $ prof_spacetime serve -p spacetime-.p 258 | ``` 259 | 260 | To look at the data through a CLI interface do: 261 | ``` 262 | $ prof_spacetime view -p spacetime-.p 263 | ``` 264 | 265 | For more on Spacetime and its output see: 266 | - Jane Street blog post on Spacetime: https://blog.janestreet.com/a-brief-trip-through-spacetime/ 267 | - OCaml compiler Spacetime documentation: https://caml.inria.fr/pub/docs/manual-ocaml/spacetime.html 268 | 269 | Pros: 270 | - gives rich information about when during the objects are allocated 271 | - gives rich information about which functions are allocating on the minor or major heap 272 | - the web-browser gui is quick and intuitive to navigate 273 | 274 | Limitations: 275 | - can slow down your binary in some cases and require more memory to run 276 | - can be a pain to juggle switches to view the profile 277 | - is focused on memory rather than runtime performance (although these are often heavily linked) 278 | 279 | ## Statistical memory profiling for OCaml 280 | 281 | Install the opam statistical-memprof switch: 282 | ``` 283 | $ opam switch create 4.07.1+statistical-memprof 284 | $ opam switch import 285 | ``` 286 | 287 | Download the `MemprofHelpers` module from: 288 | https://github.com/jhjourdan/ocaml/blob/memprof/memprofHelpers.ml 289 | 290 | And within your binary you need to execute 291 | ``` 292 | MemprofHelpers.start 1E-3 20 100 293 | ``` 294 | 295 | You may also need to setup a dump function by looking at `MemprofHelpers.start` if delivering SIGUSR1 to snapshot is not easy. Your mileage may vary on how useful the output is to look at. 296 | 297 | Slightly out of date github: 298 | https://github.com/jhjourdan/ocaml/tree/memprof 299 | 300 | Document describing the design: 301 | https://hal.inria.fr/hal-01406809/document 302 | 303 | Pros: 304 | - quick to run 305 | - gives rich information about which functions are allocating on the minor or major heap 306 | 307 | Limitations: 308 | - you need to write code in your binary to get this up and running 309 | - is focused on memory rather than runtime performance (although these are often heavily linked) 310 | 311 | Open questions: 312 | - is there a better way than having to do `MemprofHelpers.start` in your binary? 313 | 314 | ## strace 315 | 316 | This tool is very useful for knowing what system calls your program is making. You can wrap a run of your binary with: 317 | ``` 318 | strace -o program-to-run program-arguments 319 | ``` 320 | 321 | ## Compiler explorer 322 | 323 | The compiler explorer supports OCaml as a language: 324 | https://godbolt.org 325 | 326 | Interesting things to try are: 327 | ``` 328 | let cmp a b = 329 | if a > b then a else b 330 | 331 | let cmp_i (a:int) (b:int) = 332 | if a > b then a else b 333 | 334 | let cmp_s (a:String) (b:String) = 335 | if a > b then a else b 336 | 337 | ``` 338 | 339 | Or what happens with a closure: 340 | ``` 341 | let fn ys z = 342 | let g = ((+) z) in 343 | List.map g ys 344 | 345 | let fn_i ys (z:int) = 346 | let g = ((+) z) in 347 | List.map g ys 348 | 349 | ``` 350 | 351 | You can also use the `Analysis` mode to see how specific chunks of assembly will be executed (try something like: `--mcpu=skylake --timeline`). 352 | 353 | 354 | Pros: 355 | - can give you an intuition of what your code will turn into 356 | 357 | Limitations: 358 | - instructions don't tell you about runtime performance directly, you will always need to benchmark 359 | - not really an efficient way to tackle existing code 360 | 361 | ## Other resources 362 | 363 | OCaml.org tutorial on performance and profiling: 364 | https://ocaml.org/learn/tutorials/performance_and_profiling.html 365 | 366 | OCamlverse links for optimizing performance: 367 | https://ocamlverse.github.io/content/optimizing_performance.html 368 | 369 | Real world ocaml and it's backend section has a little: 370 | https://dev.realworldocaml.org/compiler-backend.html#compiling-fast-native-code 371 | 372 | 373 | --------------------------------------------------------------------------------