├── apr19.md
└── profiling_notes.md


/apr19.md:
--------------------------------------------------------------------------------
  1 | # Benchmarking the OCaml compiler: what we have learnt and suggestions for the future
  2 | 
  3 | _Tom Kelly ([ctk21@cl.cam.ac.uk](mailto:ctk21@cl.cam.ac.uk))_,
  4 | _Stephen Dolan_,
  5 | _Sadiq Jaffer_,
  6 | _KC Sivaramakrishnan_,
  7 | _Anil Madhavapeddy_
  8 | 
  9 | Our goal is to provide statistically meaningful benchmarking of the OCaml compiler
 10 | to aid our multicore upstreaming efforts through 2019. In particular we wanted an
 11 | infrastructure that operated continuously on all compiler commits and could handle
 12 | multiple compiler variants (e.g. flambda). This allows developers to be sure that no
 13 | performance regressions slip through and also guides efforts as to where to focus when
 14 | introducing new features like multicore which can have a non-trivial impact on
 15 | performance, as well as introduce additional non-determinism into measurements due
 16 | to parallelism.
 17 | 
 18 | This document describes what we did to build continuous benchmarking
 19 | websites<sup>[1](#ref1)</sup> that take a controlled experiment approach to running the
 20 | operf-micro and sandmark benchmarking suites against tracked git branches of
 21 | the OCaml compiler. 
 22 | 
 23 | Our high level aims were:
 24 | 
 25 | *   Try to reuse OCaml benchmarks that exist and cover a range of micro and
 26 |     macro use cases. 
 27 | *   Try to incorporate best-practice and reuse tools where appropriate by
 28 |     looking at what other compiler development communities are doing. 
 29 | *   Provide visualizations that are easy to use for OCaml developers making it
 30 |     easy and less time consuming to merge complex features like multicore
 31 |     without performance regressions.  
 32 | *   Document the steps to setup the system including its experimental
 33 |     environment for benchmarking so that the knowledge is captured for others
 34 |     to reuse.
 35 | 
 36 | This document is structured as follows: we survey the OCaml benchmark packages
 37 | available; we survey the infrastructure other communities are using to catch
 38 | performance regressions in their compilers; we highlight existing continuous
 39 | benchmarks of the OCaml compiler; we describe the system we have built from
 40 | selecting the code, running in a controlled environment through to how it is
 41 | displayed; we summarise what we have learnt; finally we discuss directions
 42 | future work that could follow. 
 43 | 
 44 | ## Survey of OCaml benchmarks
 45 | 
 46 | ### Operf-micro<sup>[2](#ref2)</sup>
 47 | 
 48 | Operf-micro collates a collection of micro benchmarks originally put together
 49 | to help with the development of flambda. This tool compiles a micro-benchmark
 50 | function inside a harness. The harness runs the tested function in a loop to
 51 | generate samples x(n) where n is the number of loop iterations. The harness
 52 | will collate data for multiple iteration lengths n which provides coverage of a
 53 | range of garbage collector behaviour. The number of samples _(n, x(n))_ is
 54 | determined by a runtime quota target. Once this _(n, x(n))_ data is collected, it
 55 | is post processed using regression to provide the execution time of a single
 56 | iteration of the function.
 57 | 
 58 | The tool is designed to have minimal external dependencies and can be run
 59 | against a test compiler binary without additional packages. The experimental
 60 | method of running for multiple embedded iterations is also used by Jane
 61 | Street’s `core_bench`<sup>[3](#ref3)</sup> and Haskell’s criterion<sup>[4](#ref4)</sup>.
 62 | 
 63 | ### operf-macro<sup>[5](#ref5)</sup>
 64 | 
 65 | Operf-macro provides a framework to define and run macro-benchmarks. The
 66 | benchmarks themselves are opam packages and the compiler versions are opam
 67 | switches. The use of opam is to handle the dependencies of larger benchmarks
 68 | and to handle the maintenances of the macro-benchmark code base.  Benchmark
 69 | descriptions are split among metadata stored in a customised opam-repository
 70 | that overlay the benchmarks over existing codebases.
 71 | 
 72 | ### Sandmark<sup>[6](#ref6)</sup>
 73 | 
 74 | Sandmark is a tool we have developed takes that a similar approach to operf-macro, but:
 75 | - utilizes opam v2 and pins the opam repo to one internally defined so that all the benchmarked
 76 |   code is fixed and not dependent on the global opam-repository
 77 | - does not have a ppx dependency for the tested packages, making it easier to work on
 78 |   development versions of the compiler
 79 | - has the benchmark descriptions in a single `dune` file in the sandmark repository,
 80 |   to make it easy to see what is being measured in one place
 81 | 
 82 | For the purposes of our benchmarking we decided to run both the operf-micro and
 83 | sandmark packages.
 84 | 
 85 | #### Single-threaded tests in Sandmark
 86 | 
 87 | Included in Sandmark are a range of tests ported from operf-macro that run on
 88 | or could be easily modified to run on multicore. In addition to these, a new
 89 | set of performance tests have been added that aim to expand coverage over
 90 | compiler and runtime areas that differ in implementation between vanilla and
 91 | multicore. These tests are designed to run on both vanilla and multicore and
 92 | should highlight changes in single-threaded performance between the two
 93 | implementations. The main areas we aimed to cover were:
 94 | 
 95 | * Stack overflow checks in function prologues
 96 | * Costs of switching stacks when making external calls (alloc/noalloc/various numbers of parameters)
 97 | * Lazy implementation changes
 98 | * Weak pointer implementation changes
 99 | * Finalizer implementation changes
100 | * General GC performance under various different types of allocation behaviour
101 | * Remembered set overhead implementation changes
102 | 
103 | Also included in sandmark are some of the best single-threaded OCaml entries
104 | for the Benchmarks Game, these are already well optimised and should serve to
105 | highlight performance differences between vanilla and multicore on single
106 | threaded code.
107 | 
108 | #### Multicore-specific tests in Sandmark
109 | 
110 | In addition to the single-threaded tests in Sandmark, there are also
111 | multicore-specific tests. They come in two flavours. First, are the ones
112 | intended to highlight performance changes between vanilla and multicore. These
113 | include benchmarks that perform lots of external calls and callbacks (multicore
114 | runtime manages stacks), stress tests for lazy (different lazy layout in
115 | multicore compared to vanilla), weak arrays and ephemerons (different layouts
116 | and GC algorithms), finalisers (different finaliser structures). 
117 | 
118 | Secondly, there are multicore-only tests. These consist so far of various simple
119 | lock-free data structure tests that stress the multicore GC in different ways
120 | e.g some force many GC promotions to the major heap while others pre-allocate.
121 | The intention is to expand the set of multicore specific tests to include larger
122 | benchmarks as well as tests that compare existing approaches to parallelism on
123 | vanilla OCaml with reimplementations on multicore.
124 | 
125 | ## How other compiler communities handle continuous benchmarking
126 | 
127 | ### Python (CPython and PyPy)
128 | 
129 | The Python community has continuous benchmarking both for their
130 | CPython<sup>[7](#ref7)</sup> and PyPy<sup>[8](#ref8)</sup> runtimes. The
131 | benchmarking data is collected by running the Python Performance Benchmark
132 | Suite<sup>[9](#ref9)</sup>. The open-source web application
133 | Codespeed<sup>[10](#ref10)</sup> provides a front end to navigate and visualize
134 | the results. Codespeed is written in Python on Django and provides views into
135 | the results via a revision table, a timeline by benchmark and comparison over
136 | all benchmarks between tagged versions. Codespeed has been picked up by other
137 | projects as a way to quickly provide visualizations of performance data across
138 | code revisions.
139 | 
140 | ### LLVM
141 | 
142 | LLVM has a collection of micro-benchmarks in their C/C++ compiler test suite.
143 | These micro-benchmarks are built on the google-benchmark library and produce
144 | statistics that can be easily fed downstream. They also support external tests
145 | (for example SPEC CPU 2006). LLVM have performance tracking software called LNT
146 | which drives a continuous monitoring site<sup>[11](#ref11)</sup>. While the LNT
147 | software is packaged for people to use in other projects, we could not find
148 | another project using it for visualizing performance data and at first glance
149 | did not look easy to reuse. 
150 | 
151 | ### GHC
152 | 
153 | The Haskell community have performance regression tests that have hardcoded
154 | values which trip a continuous-integration failure. This method has proved
155 | painful for them<sup>[12](#ref12)</sup> and they have been looking to change it
156 | to a more data driven approach<sup>[13](#ref13)</sup>. At this time they did
157 | not seem to have infrastructure running to help them. 
158 | 
159 | ### Rust
160 | 
161 | Rust have built their own tools to collect benchmark data and present it in a
162 | web app<sup>[14](#ref14)</sup>. This tool has some interesting features:
163 | 
164 | *   It measures both compile time and runtime performance.
165 | *   They are committing the data from their experiments into a github repo to
166 |     make the data available to others. 
167 | 
168 | ## Relation to existing OCaml compiler benchmarking efforts
169 | 
170 | ### OCamlPro Flambda benchmarking
171 | 
172 | OCamlPro put together the operf-micro, operf-macro and
173 | [http://bench.flambda.ocamlpro.com/](http://bench.flambda.ocamlpro.com/) site
174 | to provide a benchmarking environment for flambda. Our work builds on these
175 | tools and is inspired by them. 
176 | 
177 | ### Initial OCaml multicore benchmarking site
178 | 
179 | OCaml Labs put together an initial hosted multicore benchmarking site
180 | [http://ocamllabs.io/multicore](http://ocamllabs.io/multicore). This built on
181 | the OCamlPro flambda site by (i) implementing visualization in a single
182 | javascript library; (ii) upgrading to opam v2; (iii) updating the
183 | macro-benchmarks to more recent version.<sup>[15](#ref15)</sup> The work
184 | presented here incorporates the experience of building that site and builds on
185 | it.
186 | 
187 | ## What we put together
188 | 
189 | We decided to use operf-micro and sandmark as our benchmarks. We went with
190 | Codespeed<sup>[10](#ref10)</sup> as a visualization tool since it looked easy
191 | to setup and had instances running in multiple projects. We also wanted to try
192 | an open source tool where there is already a community in place. 
193 | 
194 | The system runs a pipeline: it determines that a commit is of interest on a
195 | branch timeline it is tracking. It builds the compiler for that commit given
196 | the hash. It then runs either operf-micro or sandmark on a clean experimental
197 | CPU. The data is uploaded into Codespeed to allow users to view and explore the
198 | data. 
199 | 
200 | ### Transforming git commits to a timeline
201 | 
202 | Mapping a git branch to a linear timeline requires a bit of care. We use the
203 | following mapping:
204 | - we take a branch tag and ask for commits using the git first-parent option<sup>[16](#ref16)</sup>
205 | - use the commit date from this as the timestamp for a commit.
206 | 
207 | In most repositories this will give a linear code order that makes sense, even
208 | though the development process has been decentralised.  It works well for
209 | Github style pull-requests development, and has been verified to work with the
210 | `ocaml/ocaml` git development workflow.
211 | 
212 | ### Experimental setup
213 | 
214 | We wanted to try to remove sources of noise in the performance measurements
215 | where possible. We configured our x86 Linux machines as follows:<sup>[17](#ref17)</sup>
216 | 
217 | *   **Hyperthreading**: Hyperthreading was configured off in the machine BIOS
218 |     to avoid cross-talk and resource sharing for a CPU.
219 | *   **Turbo boost**: Turbo boost was disabled on all CPUs to ensure that the
220 |     processor did not enter turbo which could be throttled by external factors.
221 | *   **Pstate configuration:** We set the power state explicitly to performance
222 |     rather than powersave. 
223 | *   **Linux CPU isolation**: We isolated several cores on our machine using the
224 |     Linux isolcpus kernel parameter. This allowed us to be sure that only the
225 |     benchmarking process was scheduled on a core and that core did not schedule
226 |     other processes on the system.
227 | *   **Interrupts:** We shifted all operating system interrupt handling away
228 |     from the isolated CPUs we used for benchmarking.
229 | *   **ASLR (address space layout randomisation):** We switched this off on our
230 |     benchmarking runs to make experiments repeatable.
231 | 
232 | With this configuration, we found that we could run benchmarks such that they
233 | were very likely to give the same result when rerun at another time.  Note that
234 | without making the changes above, there was significantly more variance in the
235 | test results, so it is important to apply this control.
236 | 
237 | ### Visualization with Codespeed
238 | 
239 | Once the experimental results have been collected, we upload them into the
240 | codespeed visualization tool which presents a web interface for browsing the
241 | results. This tool presents three ways to interact with the data:
242 | 
243 | *   **Changes**: This view allows you to browse the raw results across all
244 |     benchmarks collected for a particular commit, build variant and machine
245 |     environment.
246 | *   **Timeline**: This view allows you to look at a given benchmark through
247 |     time on a given machine environment. You can compare multiple build
248 |     variants on the same graph. It also has a grid view that gives a panel
249 |     display of timelines for all benchmarks. It can be a good way to spot
250 |     regressions in the performance and see if the regression is persistent in
251 |     nature.
252 | *   **Comparison**: This view allows you to compare tagged versions of the code
253 |     to each other (including the latest commit of a branch) across all
254 |     benchmarks. This can be a good compact view to ask questions like: “which
255 |     benchmarks is 4.08 faster or slower than 4.07.1?”
256 | 
257 | The tool provides cross-linking between commits and Github which makes it quick
258 | to drill into the specific code diff of a commit.
259 | 
260 | ## What we learnt
261 | 
262 | ### Very useful tool to discover performance regressions with multicore
263 | 
264 | The benchmarking over many compiler commits and visualization tool has been
265 | very useful in enabling us to determine which test cases have performance
266 | regressions with multicore. Because the experimental environment is clean, it
267 | makes the results we see much more repeatable when we investigate things that
268 | look odd. The timeline view is also able to give clues as to which area of the
269 | code changed to cause a performance regression. When performance regressions
270 | are fixed, the benchmarks are automatically run and so provides a ratchet to
271 | know you are making things better without additional manual overhead. 
272 | 
273 | ### Getting a clean experimental setup is possible but takes some care
274 | 
275 | We found that getting our machine environment into a state where we could rerun
276 | benchmarks and get repeatable performance numbers was possible, but takes some
277 | care. Often we only discovered we had a problem by looking at large numbers of
278 | benchmarks and diving into commits that caused performance swings we didn’t
279 | understand (for example changes to documentation that altered performance).
280 | Hopefully by collecting the configuration information together in a single
281 | place other people can quickly setup clean environments for their own
282 | benchmarking<sup>[17](#ref17)</sup>. Having a clean environment allowed us to really probe the
283 | microstructure effects with x86 performance counters in a rigorous and
284 | iterative way. 
285 | 
286 | ### Address space layout randomization (ASLR)
287 | 
288 | This was found following investigation of some benchmark instability coming
289 | from timeline changes but with commits not touching compiler code. We tracked
290 | down the ~10-15% swings were being caused by branch miss-prediction in some
291 | operf-micro format benchmarks that were sensitive to how the address space was
292 | laid out with randomization. We switched off ASLR so that our benchmarks would
293 | be repeatable and an identical binary between commits would be very likely to
294 | give the same result on a single run. User binaries in the wild will likely run
295 | with ASLR on, but we have configured our environment to make benchmarks
296 | repeatable without the need to take the average of many sample. It remains an
297 | open area to benchmark in reasonable time and allow easy developer interaction
298 | while investigating performance under the presence of ASLR<sup>[18](#ref18)</sup>. 
299 | 
300 | ### Code layout can matter in surprising ways
301 | 
302 | It is well known that code layout can matter when evaluating the performance of
303 | a binary on a modern processor. Modern x86 processors are often deeply
304 | pipelined and can issue multiple decoded uops per cycle. In order to maintain
305 | good performance it is critical that there is a ready stream of decoded uops
306 | available to be executed on the back end of the processor. There are many ways
307 | a uop could be unavailable.
308 | 
309 | A well known way for a uop to be unavailable is because the processor is
310 | waiting for it to be fetched from the memory hierarchy. Within the operf-micro
311 | benchmarking suite on the hardware we used, we didn’t see any performance
312 | instability coming from CPU interactions with the instruction or data cache or
313 | main memory. The instruction path memory bottleneck may still be worth
314 | optimizing for and we expect that changes which do optimize this area will be
315 | seen in the sandmark macro-benchmarks<sup>[19](#ref19)</sup>. There are a collection of
316 | approaches being used to improve code layout for instruction memory latency
317 | using profile layout<sup>[20](#ref20)</sup>.
318 | 
319 | We were surprised by the size of the performance impact that code layout can
320 | have due to how instructions pass through the front end of an x86 processor on
321 | their way to being issued<sup>[21](#ref21)</sup>. On x86 alignment of blocks of instructions and
322 | how decoded uops are packed into decode stream buffers (DSB)  can impact the
323 | performance of smaller hot loops in code. Examples of some of the mechanical
324 | effects are:
325 | 
326 | *   DSB Throughput and alignment
327 | *   DSB Thrashing and alignment
328 | *   Branch-predictor effects and alignment
329 | 
330 | At a high-level this can be thought of as taking blocks of x86 code and mapping
331 | them into a scarce uops cache. If the blocks of x86 code in the hot path are
332 | unfortunately aligned or have high jump and/or branch complexity then this can
333 | make the decoder and uops cache a bottleneck in your code. If you want to learn
334 | more about the causes of performance swings due to hot code placement in IA
335 | there is a good LLVM dev talk given by Zia Ansari (Intel)
336 | [https://www.youtube.com/watch?v=IX16gcX4vDQ](https://www.youtube.com/watch?v=IX16gcX4vDQ). 
337 | 
338 | We came across some of these hot-code placement effects in our OCaml
339 | micro-benchmarks when diving into performance swings that we didn’t understand
340 | by looking at the commit diff. We have an example that can be downloaded<sup>[22](#ref22)</sup>
341 | which alters the alignment of a hot loop in an OCaml microbenchmark (without
342 | changing the individual instructions in the hot loop) that lead to a ~15% range
343 | in observed performance on our Xeon E5-2430L machine. From a compiler writer’s
344 | perspective this can be a difficult area, we don’t have any simple solutions
345 | and other compiler communities also struggle with how to tame these
346 | effects<sup>[23](#ref23)</sup>.
347 | 
348 | It is important to realise that a code layout change can lead to changes in the
349 | performance of hot loops. The best way to proceed when you see an unexpected
350 | performance swing on x86 is to look at performance counters around the DSB and
351 | branch predictor. We hope to add selected performance counter data to the
352 | benchmarks we publish in the future. 
353 | 
354 | ### Logistical issues when scaling up
355 | 
356 | To run our experiments in a controlled fashion, we need to schedule
357 | benchmarking runs so that they do not get impacted by external factors. Right
358 | now we have been managing with some scripts and manual placement, but we are
359 | quickly approaching the point where we need to have some distributed job
360 | scheduling and management to assist us. This should allow us to handle multiple
361 | architectures, operating systems and benchmarking of more variants of the
362 | compiler. We have yet to decide what software to use for task scheduling and
363 | management. 
364 | 
365 | ## Next Steps (as of April 2019)
366 | 
367 | ### Managing a heterogeneous cluster of machines, operating systems and benchmarks in a low-overhead way
368 | 
369 | We want to support a large number of machine architectures, operating systems
370 | and benchmarking suites. To do this we need to control a very large number of
371 | concurrent benchmarking experiments where process scheduling is tightly
372 | controlled and error management structured. A core component needed will be a
373 | distributed job scheduler. On top of the scheduler, we need a system to handle
374 | iterative result generation, backfilling results for new benchmarks and
375 | regeneration of additional data points on demand. By using a scheduler it
376 | should be possible to get a better resource utilization than under manual block
377 | scheduling.   
378 | 
379 | ### Publishing data for downstream tools
380 | 
381 | We would like to provide a mechanism for publishing the raw data we collect
382 | from running benchmarking experiments. We can see two possible ideas here: i)
383 | similar to Rust we could commit code to a github hosted repo and then the data
384 | can be served using the git toolchain; ii) we could provide a GraphQL style
385 | server interface which would also be a component of front-end web app
386 | visualization tools. This data format could be closely coupled with what
387 | benchmarking suites publish as the result of running benchmark code. 
388 | 
389 | ### Modernise the front-end visualization and incorporate multi-dimensional data navigation
390 | 
391 | We have successfully used Codespeed to visualize our data. This codebase is
392 | based on old Javascript libraries and has internal mechanisms for transporting
393 | data. It is not a trivial application to extend or adapt. A new front-end based
394 | on modern web development technologies which also handles multi-dimensional
395 | benchmark output easily would be useful. 
396 | 
397 | ### Ability to sandbox the benchmarks while containing noise for pull-request testing
398 | 
399 | It would be useful for developer workflow to have the ability to benchmark
400 | pull-request branches. There are challenges to doing this while keeping the
401 | experiments controlled. One approach would be to setup a sandboxed environment
402 | that is as close to the bare metal as possible but maintains the low noise
403 | environment we have from using low-level Linux options. Resource usage control
404 | and general scheduling would also need to be considered in such a multi-user
405 | system. 
406 | 
407 | ### Publish performance counter and GC data coming from benchmarks
408 | 
409 | We would like to publish multi-dimensional data from the benchmarks and have a
410 | structured way to interpret them. There is a bit of work to standardise the
411 | data we get out of the benchmarks suites and integrating the collection of
412 | performance counters would also need doing. Once the data is collected, there
413 | would be a little work to present visualizations of the data.  
414 | 
415 | ### Experimenting with x86 code alignment emitted from the compiler
416 | 
417 | LLVM and GCC have a collection of options available to alter the x86 code
418 | alignment. Some of these get enabled at higher optimization levels and other
419 | options help with debugging.<sup>[24](#ref24)</sup> For example, it might be useful to be able to
420 | align all jump targets; code size would increase and performance decrease, but
421 | it might allow you to isolate a performance swing to being due to
422 | microstructure alignment when doing a deep dive. 
423 | 
424 | ### Compile time performance benchmarking
425 | 
426 | We would like to produce numbers on the compile-time performance of compiler
427 | versions. The first step would be to decide on a representative corpus of OCaml
428 | code to benchmark (could be sandmark).  
429 | 
430 | ### How to tell if your benchmark suite is giving good coverage of the compiler optimizations
431 | 
432 | It would be useful to have some coverage metrics of a benchmarking suite
433 | relative to the compiler functionality. This might reveal holes in the
434 | benchmarking suite and having an understanding of which bits of the
435 | optimization functionality are engaged on a given benchmark could be revealing
436 | in itself. 
437 | 
438 | ### Using the infrastructure for continuous benchmarking of general OCaml binaries
439 | 
440 | It would be interesting to try our infrastructure on a larger OCaml project
441 | where tracking its performance is important to the community of developers and
442 | users for it. The compiler version would be fixed but the project code would
443 | move through time. One idea we had is odoc but others projects might be more
444 | interesting to explore. 
445 | 
446 | ## Conclusions
447 | 
448 | We are satisfied that we have produced infrastructure that will prove effective
449 | for continuous performance regression of multicore. We are hoping that the
450 | infrastructure can be more generally useful in the wider community. We hope
451 | that by sharing our experiences with benchmarking OCaml code we can help people
452 | quickly realise good benchmarking experiments and avoid non-trivial pitfalls. 
453 | 
454 | # Notes
455 | 
456 | <a name="ref1"></a>[^1]: [http://bench.ocamllabs.io](http://bench.ocamllabs.io) which presents operf-micro benchmark experiments and [http://bench2.ocamllabs.io](http://bench2.ocamllabs.io) which presents sandmark based benchmark experiments. 
457 | 
458 | <a name="ref2"></a>[^2]: [https://github.com/OCamlPro/operf-micro](https://github.com/OCamlPro/operf-micro) - tool code
459 |     [https://hal.inria.fr/hal-01245844/document](https://hal.inria.fr/hal-01245844/document) - “Operf: Benchmarking the OCaml Compiler”, Chambart et al
460 | 
461 | <a name="ref3"></a>[^3]: Code for core_bench [https://github.com/janestreet/core_bench](https://github.com/janestreet/core_bench) and blog post describing core_bench [https://blog.janestreet.com/core_bench-micro-benchmarking-for-ocaml/](https://blog.janestreet.com/core_bench-micro-benchmarking-for-ocaml/)
462 | 
463 | <a name="ref4"></a>[^4]: Haskell code for criterion [https://github.com/bos/criterion](https://github.com/bos/criterion) and tutorial on using criterion [http://www.serpentine.com/criterion/tutorial.html](http://www.serpentine.com/criterion/tutorial.html) 
464 | 
465 | <a name="ref5"></a>[^5]: Code and documentation for operf-macro [https://github.com/OCamlPro/operf-macro](https://github.com/OCamlPro/operf-macro) 
466 | 
467 | <a name="ref6"></a>[^6]: Code for sandmark [https://github.com/ocamllabs/sandmark](https://github.com/ocamllabs/sandmark) 
468 | 
469 | <a name="ref7"></a>[^7]: CPython continuous benchmarking site [https://speed.python.org/](https://speed.python.org/) 
470 | 
471 | <a name="ref8"></a>[^8]: PyPy continuous benchmarking site [http://speed.pypy.org/](http://speed.pypy.org/) 
472 | 
473 | <a name="ref9"></a>[^9]: Python Performance Benchmark Suite [https://pyperformance.readthedocs.io/](https://pyperformance.readthedocs.io/) 
474 | 
475 | <a name="ref10"></a>[^10]: Codespeed web app for visualization of performance data [https://github.com/tobami/codespeed](https://github.com/tobami/codespeed) 
476 | 
477 | <a name="ref11"></a>[^11]: LNT software [http://llvm.org/docs/lnt/](http://llvm.org/docs/lnt/) and the performance site [https://lnt.llvm.org/](https://lnt.llvm.org/) 
478 | 
479 | <a name="ref12"></a>[^12]: Description of Haskell issues with hard performance continous integration tests [https://ghc.haskell.org/trac/ghc/wiki/Performance/Tests](https://ghc.haskell.org/trac/ghc/wiki/Performance/Tests) 
480 | 
481 | <a name="ref13"></a>[^13]: ‘2017 summer of code’ work on improving Haskell performance integration tests [https://github.com/jared-w/HSOC2017/blob/master/Proposal.pdf](https://github.com/jared-w/HSOC2017/blob/master/Proposal.pdf)  
482 | 
483 | <a name="ref14"></a>[^14]: Live tracking site [https://perf.rust-lang.org/](https://perf.rust-lang.org/) and code for it [https://github.com/rust-lang-nursery/rustc-perf](https://github.com/rust-lang-nursery/rustc-perf)  
484 | 
485 | <a name="ref15"></a>[^15]: More information here: [http://kcsrk.info/multicore/ocaml/benchmarks/2018/09/13/1543-multicore-ci/](http://kcsrk.info/multicore/ocaml/benchmarks/2018/09/13/1543-multicore-ci/) 
486 | 
487 | <a name="ref16"></a>[^16]: A nice description of why first-parent helps is here [http://www.davidchudzicki.com/posts/first-parent/](http://www.davidchudzicki.com/posts/first-parent/) 
488 | 
489 | <a name="ref17"></a>[^17]: For more details please see [https://github.com/ocaml-bench/ocaml_bench_scripts/#notes-on-hardware-and-os-settings-for-linux-benchmarking](https://github.com/ocaml-bench/ocaml_bench_scripts/#notes-on-hardware-and-os-settings-for-linux-benchmarking)
490 | 
491 | <a name="ref18"></a>[^18]: There is some academic literature in this direction; for example “Rigourous benchmarking in reasonable time”, Kalibera et al [https://dl.acm.org/citation.cfm?id=2464160](https://dl.acm.org/citation.cfm?id=2464160) and “STABILIZER: statistically sound performance evaluation”, Curtsinger et al,  [https://dl.acm.org/citation.cfm?id=2451141](https://dl.acm.org/citation.cfm?id=2451141). However randomization approaches need to take care that the layout randomization distribution captures the relevant features of layout randomization that real user binaries will see from ASLR, link-order or dynamic library linking.
492 | 
493 | <a name="ref19"></a>[^19]: We feel that there must be some memory latency bottlenecks in the sandmark macro benchmarking suite, but we have yet to deep-dive and investigate a performance instability due to noise in memory latency for fetching instructions to execute. That is the performance may be bottlenecked on instruction memory fetch, but we haven’t seen instruction memory fetch latency being very noisy between benchmark runs in our setup.
494 | 
495 | <a name="ref20"></a>[^20]: Google’s AutoFDO [https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/45290.pdf](https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/45290.pdf), Facebook’s HFSort [https://research.fb.com/wp-content/uploads/2017/01/cgo2017-hfsort-final1.pdf](https://research.fb.com/wp-content/uploads/2017/01/cgo2017-hfsort-final1.pdf) and Facebook’s Bolt [https://github.com/facebookincubator/BOLT](https://github.com/facebookincubator/BOLT) are recent examples to report imporvements in some deployed data-centre workloads. 
496 | 
497 | <a name="ref21"></a>[^21]: The effects are not limited to x86 as it has also been observed on ARM Cortex-A53 and Cortex-A57 processors with LLVM [https://www.youtube.com/watch?v=COmfRpnujF8](https://www.youtube.com/watch?v=COmfRpnujF8)   
498 | 
499 | <a name="ref22"></a>[^22]: The OCaml example is here [https://github.com/ocaml-bench/ocaml_bench_scripts/tree/master/stability_example](https://github.com/ocaml-bench/ocaml_bench_scripts/tree/master/stability_example), for the super curious there are yet more C++ examples to be had (although not necessarily the same underlying micro mechanism and all dependent on processor) [https://dendibakh.github.io/blog/2018/01/18/Code_alignment_issues](https://dendibakh.github.io/blog/2018/01/18/Code_alignment_issues)  
500 | 
501 | <a name="ref23"></a>[^23]: The Q&A at LLVM dev meeting videos [https://www.youtube.com/watch?v=IX16gcX4vDQ](https://www.youtube.com/watch?v=IX16gcX4vDQ) and [https://www.youtube.com/watch?v=COmfRpnujF8](https://www.youtube.com/watch?v=COmfRpnujF8) are examples of discussion around the area. The general area of code layout is also a problem area in the academic community “Producing wrong data without doing anything obviously wrong!”, Mytkowicz et al [https://dl.acm.org/citation.cfm?id=1508275](https://dl.acm.org/citation.cfm?id=1508275) 
502 | 
503 | <a name="ref24"></a>[^24]: For more on LLVM alignment options, see [https://dendibakh.github.io/blog/2018/01/25/Code_alignment_options_in_llvm](https://dendibakh.github.io/blog/2018/01/25/Code_alignment_options_in_llvm)
504 | 
505 | 


--------------------------------------------------------------------------------
/profiling_notes.md:
--------------------------------------------------------------------------------
  1 | # Profiling OCaml code
  2 | _Tom Kelly ([ctk21@cl.cam.ac.uk](mailto:ctk21@cl.cam.ac.uk))_,
  3 | _Sadiq Jaffer_
  4 | 
  5 | There are many ways of profiling OCaml code. Here we describe several we have used that should help you get up and going.
  6 | 
  7 | 
  8 | ## perf record profiling
  9 | 
 10 | perf is a statistical profiler on Linux that can profile a binary without needing instrumentation.
 11 | 
 12 | Basic recording of an ocaml program
 13 | ```
 14 | # vanilla
 15 | perf record --call-graph dwarf -- program-to-run program-arguments
 16 | # exclude child process (-i), user cycles
 17 | perf record --call-graph dwarf -i -e cycles:u -- program-to-run program-arguments
 18 | # expand sampled stack (which will make the traces larger) to capture deeper call stacks
 19 | perf record --call-graph dwarf,32768 -i -e cycles:u -- program-to-run program-arguments
 20 | ```
 21 | 
 22 | Basic viewing of a perf recording
 23 | ```
 24 | # top down view
 25 | perf report --children
 26 | # bottom up view
 27 | perf report --no-children
 28 | # bottom up view, including kernel symbols
 29 | perf report --no-children --kallsyms /proc/kallsyms
 30 | # inverted call-graph can be useful on some versions of perf
 31 | perf report --inverted
 32 | ```
 33 | 
 34 | To allow non-root users to use perf on Linux, you may need to execute:
 35 | ```
 36 | echo 0 | sudo tee /proc/sys/kernel/perf_event_paranoid
 37 | ```
 38 | 
 39 | Installing debug symbols on ubuntu
 40 |  https://wiki.ubuntu.com/Debug%20Symbol%20Packages
 41 | 
 42 | In particular the kernel symbols:
 43 |  `apt-get install linux-image-$(uname -r)-dbgsym`
 44 | and libgmp
 45 |  `apt-get install `
 46 | 
 47 | To allow non-root users to see the kernel symbols:
 48 | ```
 49 | echo 0 | sudo tee /proc/sys/kernel/kptr_restrict
 50 | ```
 51 | 
 52 | For many more perf Examples:
 53 |  http://www.brendangregg.com/perf.html
 54 | 
 55 | Profiling with perf zine by Julia Evans is a great resource for getting a hang of perf:
 56 | https://jvns.ca/perf-zine.pdf
 57 | 
 58 | Pros:
 59 |  - it's fast as it samples the running program
 60 |  - the cli can be a very quick way of getting a callchain to start from with a big program
 61 | 
 62 | Limitations:
 63 |  - sometimes can get strange annotations
 64 |  - it's statistical so you don't get full coverage or call counts
 65 |  - the OCaml runtime functions can be confusing if you aren't familiar with them
 66 | 
 67 | ### perf flamegraphs
 68 | 
 69 | You can also visualize the perf data using the FlameGraph package
 70 | ```
 71 | # git clone https://github.com/brendangregg/FlameGraph  # or download it from github
 72 | # cd FlameGraph
 73 | # perf record --call-graph dwarf -- program-to-run program-arguments
 74 | # perf script | ./stackcollapse-perf.pl | ./flamegraph.pl > perf-flamegraph.svg
 75 | ```
 76 | 
 77 | The flamegraph output allows you to see the dynamics of the function calls. For more on flamegraphs see:
 78 |  http://www.brendangregg.com/perf.html#FlameGraphs
 79 |  http://www.brendangregg.com/flamegraphs.html
 80 |  https://github.com/brendangregg/FlameGraph
 81 | 
 82 | Another front end to perf files is the speedscope web application:
 83 |  https://www.speedscope.app/
 84 | 
 85 | ## OCaml gprof support
 86 | 
 87 | Documentation on gprof is here:
 88 |  https://sourceware.org/binutils/docs/gprof/
 89 | 
 90 | To run your OCaml program with gprof you need to do:
 91 | ```
 92 | ocamlopt -o myprog -p other-options files
 93 | ./myprog
 94 | gprof myprog
 95 | ```
 96 | 
 97 | The call count is accurate and the timing information is statistical but your program now has extra code which wouldn't be in a release binary.
 98 | 
 99 | Pros:
100 |  - fairly quick way to get a call graph without having to use other tools
101 |  - call counts are accurate
102 | 
103 | Limitations:
104 |  - your program is instrumented so isn't what you will really run in production
105 |  - not obvious how to profile 3rd party packages built through OPAM
106 | 
107 | Open questions:
108 |  - is there a way to build to get all the stdlib covered?
109 |  - is there a way to build to get opam covered?
110 | 
111 | ## Callgrind
112 | 
113 | A good introduction to callgrind, if you've never used it, is here:
114 |  https://web.stanford.edu/class/archive/cs/cs107/cs107.1196/resources/callgrind
115 | 
116 | You can run callgrind from a standard valgrind install on your OCaml binary (as with any other binary):
117 | ```
118 | valgrind --tool=callgrind program-to-run program-arguments
119 | callgrind_annotate callgrind.out.pid
120 | kcachegrind callgrind.out.pid
121 | ```
122 | 
123 | After a while you will start to notice that you have strange symbols that don't make sense or that you can't see. There are a couple of things you can fix:
124 |  - You need to make sure you have the sources available for all dependent packages (incl the compiler). This can be done in opam with:
125 |  ```
126 | opam switch reinstall --keep-build-dir
127 |  ```
128 |  - You will want to have the debugging symbols available for the standard Linux libraries; refer to your distribution for how to do that.
129 | 
130 | With that done, you may still have some odd functions appearing. This can be because the OCaml compilers (at least before 4.08) don't export the size of several assembler functions (particularly `caml_c_call`) into the ELF binary. Ideally the OCaml runtime would export the size of functions in the ELF information, but right now it does not have `.size` directives in the assembler.
131 | 
132 | To fix this, you will need to build a patched valgrind which will let you see them.
133 | 
134 | The patch against 3.15.0 (available here http://www.valgrind.org/downloads/repository.html) is this:
135 | ```
136 |  diff --git a/coregrind/m_debuginfo/readelf.c b/coregrind/m_debuginfo/readelf.c
137 | index b982a838a..8b75b260b 100644
138 | --- a/coregrind/m_debuginfo/readelf.c
139 | +++ b/coregrind/m_debuginfo/readelf.c
140 | @@ -253,6 +253,7 @@ void show_raw_elf_symbol ( DiImage* strtab_img,
141 |     to piece together the real size, address, name of the symbol from
142 |     multiple calls to this function.  Ugly and confusing.
143 |  */
144 | +#define ALLOW_ZERO_ELF_SYM 1
145 |  static
146 |  Bool get_elf_symbol_info (
147 |          /* INPUTS */
148 | @@ -282,7 +283,8 @@ Bool get_elf_symbol_info (
149 |     Bool in_text, in_data, in_sdata, in_rodata, in_bss, in_sbss;
150 |     Addr text_svma, data_svma, sdata_svma, rodata_svma, bss_svma, sbss_svma;
151 |     PtrdiffT text_bias, data_bias, sdata_bias, rodata_bias, bss_bias, sbss_bias;
152 | -#     if defined(VGPV_arm_linux_android) \
153 | +#     if defined(ALLOW_ZERO_ELF_SYM) \
154 | +         ||defined(VGPV_arm_linux_android) \
155 |           || defined(VGPV_x86_linux_android) \
156 |           || defined(VGPV_mips32_linux_android) \
157 |           || defined(VGPV_arm64_linux_android)
158 | @@ -475,7 +477,8 @@ Bool get_elf_symbol_info (
159 |        in /system/bin/linker:  __dl_strcmp __dl_strlen
160 |     */
161 |     if (*sym_size_out == 0) {
162 | -#     if defined(VGPV_arm_linux_android) \
163 | +#     if defined(ALLOW_ZERO_ELF_SYM) \
164 | +         || defined(VGPV_arm_linux_android) \
165 |           || defined(VGPV_x86_linux_android) \
166 |           || defined(VGPV_mips32_linux_android) \
167 |           || defined(VGPV_arm64_linux_android)
168 | ```
169 | 
170 | You can then run the following, which will allow callgrind to skip through `caml_c_call`, to create a potentially more useful callgraph:
171 | ```
172 | valgrind --tool=callgrind --fn-skip=_init --fn-skip=caml_c_call -- program-to-run program-arguments
173 | ```
174 | 
175 | Pros:
176 |  - the function call counts are accurate
177 |  - the instruction count information is very accurate
178 |  - you can instrument all the way to basic block instructions and jumps taken
179 |  - kcachegrind gives you a graphical front end which can sometimes be easier to navigate
180 |  - with the fn-skip options you can look through the caml_c_call to get a callchain from ocaml through underlying libraries (including the OCaml runtime)
181 | 
182 | Limitations:
183 |  - quite slow to run
184 |  - not supported out of the box; right now to get the function skip you will need to compile your own valgrind (this should change following PR8708 on the ocaml compiler)
185 |  - recursive functions give confusing total function cost scores which should be ignored
186 |  - the OCaml runtime functions can be confusing if you aren't familiar with them
187 | 
188 | ## OCaml Landmarks
189 | 
190 | Landmarks allow you to instrument at the OCaml level and gives visibility into call chains, call counts and memory allocations in blocks:
191 | https://github.com/LexiFi/landmarks
192 | 
193 | There are a couple of ways to instrument a program; manual, using manual ppx extensions and fully automatic using a ppx mode.
194 | 
195 | ### Manual instrumentation
196 | You setup your landmarks with `register` and need to call `enter` and `exit` on each landmark you are interested in. Care needs to be taken to handle exceptions so that the `enter`/`exit` pairs match up.
197 | 
198 | ### Manual ppx extensions
199 | You annotate specific OCaml lines with a pre-processing command; e.g. `let[@landmark] f = ...`. This handles all the pairing for you and can be quite convenient. You need to add the ppx to your build.
200 | 
201 | ### Automatic instrumentation
202 | You run the pre-processor with `--auto` and this will instrument all top level functions within a module. You need to add the ppx to your build.
203 | 
204 | 
205 | #### ocamlbuild and ppx tags
206 | To get things working with ocamlbuild and landmarks you can add the following to your `_tag` file:
207 | ```
208 | ppx(`ocamlfind query landmarks.ppx`/ppx.exe --as-ppx)
209 | ```
210 | or for auto instrumentation:
211 | ```
212 | ppx(`ocamlfind query landmarks.ppx`/ppx.exe --as-ppx --auto)
213 | ```
214 | 
215 | Pros:
216 |  - it's very quick to setup automatic profiling and then determine which functions are the heavy hitters
217 |  - it's easy to selectively profile bits of OCaml code you are interested in
218 | 
219 | Limitations:
220 |  - the timing information includes the overhead of any landmarks within another landmark
221 |  - the allocation information includes the overhead of any landmarks within another landmark
222 |  - the addition of the landmark will change the compiled code and can remove inlining opportunities that would occur in a release binary
223 |  - can slow down your binary in some cases
224 | 
225 | ## OCaml Spacetime
226 | 
227 | The OCaml Spacetime profiler allows you to see where in your program memory is allocated.
228 | 
229 | Setup an opam switch with Spacetime and all your opam packages:
230 | ```
231 |  $ opam switch export opam_existing_universe
232 |  $ opam switch create 4.07.1+spacetime
233 |  $ opam switch import opam_existing_universe
234 | ```
235 | 
236 | NB: you might have a stack size issue with the universe, you can expand it to 128M in bash with `ulimit -S -s 131072`.
237 | 
238 | In the 4.07.1+spacetime switch, build your binary. Now run the executable, using the environment variable OCAML_SPACETIME_INTERVAL to turn on profiling and specify how frequently Spacetime should inspect the OCaml heap (in milliseconds):
239 | ```
240 |  $ OCAML_SPACETIME_INTERVAL=100 program-to-run program-arguments
241 | ```
242 | This will output a file of the form `spacetime-<pid>`.
243 | 
244 | To view the information in this file, we need to process it with `prof_spacetime`. Right now, this only runs on OCaml 4.06.1 and lower. Install as follows:
245 | ```
246 |  $ opam switch 4.06.1
247 |  $ opam install prof_spacetime
248 | ```
249 | To post-process the results do
250 | ```
251 |  $ prof_spacetime process -e <path to your executable> spacetime-<pid>
252 | ```
253 | This will produce a `spacetime-<pid>.p` file.
254 | 
255 | To serve this and interact through a web-browser do:
256 | ```
257 |  $ prof_spacetime serve -p spacetime-<pid>.p
258 | ```
259 | 
260 | To look at the data through a CLI interface do:
261 | ```
262 |  $ prof_spacetime view -p spacetime-<pid>.p
263 | ```
264 | 
265 | For more on Spacetime and its output see:
266 |  - Jane Street blog post on Spacetime: https://blog.janestreet.com/a-brief-trip-through-spacetime/
267 |  - OCaml compiler Spacetime documentation: https://caml.inria.fr/pub/docs/manual-ocaml/spacetime.html
268 | 
269 | Pros:
270 |  - gives rich information about when during the objects are allocated
271 |  - gives rich information about which functions are allocating on the minor or major heap
272 |  - the web-browser gui is quick and intuitive to navigate
273 | 
274 | Limitations:
275 |  - can slow down your binary in some cases and require more memory to run
276 |  - can be a pain to juggle switches to view the profile
277 |  - is focused on memory rather than runtime performance (although these are often heavily linked)
278 | 
279 | ## Statistical memory profiling for OCaml
280 | 
281 | Install the opam statistical-memprof switch:
282 | ```
283 |  $ opam switch create 4.07.1+statistical-memprof
284 |  $ opam switch import <opam_universe_file>
285 | ```
286 | 
287 | Download the `MemprofHelpers` module from:
288 |  https://github.com/jhjourdan/ocaml/blob/memprof/memprofHelpers.ml
289 | 
290 | And within your binary you need to execute
291 | ```
292 | MemprofHelpers.start 1E-3 20 100
293 | ```
294 | 
295 | You may also need to setup a dump function by looking at `MemprofHelpers.start` if delivering SIGUSR1 to snapshot is not easy. Your mileage may vary on how useful the output is to look at.
296 | 
297 | Slightly out of date github:
298 |   https://github.com/jhjourdan/ocaml/tree/memprof
299 | 
300 | Document describing the design:
301 |   https://hal.inria.fr/hal-01406809/document
302 | 
303 | Pros:
304 |  - quick to run
305 |  - gives rich information about which functions are allocating on the minor or major heap
306 | 
307 | Limitations:
308 |  - you need to write code in your binary to get this up and running
309 |  - is focused on memory rather than runtime performance (although these are often heavily linked)
310 | 
311 | Open questions:
312 |  - is there a better way than having to do `MemprofHelpers.start` in your binary?
313 | 
314 | ## strace
315 | 
316 | This tool is very useful for knowing what system calls your program is making. You can wrap a run of your binary with:
317 | ```
318 | strace -o <file for output> program-to-run program-arguments
319 | ```
320 | 
321 | ## Compiler explorer
322 | 
323 | The compiler explorer supports OCaml as a language:
324 |   https://godbolt.org
325 | 
326 | Interesting things to try are:
327 | ```
328 | let cmp a b =
329 |   if a > b then a else b
330 | 
331 | let cmp_i (a:int) (b:int) =
332 |   if a > b then a else b
333 | 
334 | let cmp_s (a:String) (b:String) =
335 |   if a > b then a else b
336 | 
337 | ```
338 | 
339 | Or what happens with a closure:
340 | ```
341 | let fn ys z =
342 |    let g = ((+) z) in
343 |    List.map g ys
344 | 
345 | let fn_i ys (z:int) =
346 |    let g = ((+) z) in
347 |    List.map g ys
348 | 
349 | ```
350 | 
351 | You can also use the `Analysis` mode to see how specific chunks of assembly will be executed (try something like: `--mcpu=skylake --timeline`).
352 | 
353 | 
354 | Pros:
355 |  - can give you an intuition of what your code will turn into
356 | 
357 | Limitations:
358 |  - instructions don't tell you about runtime performance directly, you will always need to benchmark
359 |  - not really an efficient way to tackle existing code
360 | 
361 | ## Other resources
362 | 
363 | OCaml.org tutorial on performance and profiling:
364 |  https://ocaml.org/learn/tutorials/performance_and_profiling.html
365 | 
366 | OCamlverse links for optimizing performance:
367 |  https://ocamlverse.github.io/content/optimizing_performance.html
368 | 
369 | Real world ocaml and it's backend section has a little:
370 |  https://dev.realworldocaml.org/compiler-backend.html#compiling-fast-native-code
371 | 
372 | 
373 | 


--------------------------------------------------------------------------------