├── .gitignore ├── .gitmodules ├── Cheatsheet.md ├── README.md └── img ├── flamegraph1.png ├── kcachegrind1.png ├── kcachegrind2.png ├── kcachegrind3.png ├── perf1.png ├── perf2.png ├── perf3.png ├── perf4.png ├── perf5.png ├── perf6.png ├── tracecompass1.png └── tracecompass2.png /.gitignore: -------------------------------------------------------------------------------- 1 | .vscode -------------------------------------------------------------------------------- /.gitmodules: -------------------------------------------------------------------------------- 1 | [submodule "c-ray"] 2 | path = c-ray 3 | url = https://github.com/Erdk/c-ray 4 | [submodule "FlameGraph"] 5 | path = FlameGraph 6 | url = https://github.com/brendangregg/FlameGraph 7 | [submodule "blkin"] 8 | path = blkin 9 | url = https://github.com/ceph/blkin 10 | -------------------------------------------------------------------------------- /Cheatsheet.md: -------------------------------------------------------------------------------- 1 | # Cheatsheet 2 | 3 | ## gprof 4 | 5 | - compile application with: `-O2 -g -pg --no-pie -fPIC`, link with `-pg` 6 | - to gather profile: just run your program 7 | - to inspect profile: `gprof gmon.out` 8 | - flat profile: `gprof -p gmon.out` 9 | - flat profile of choosen function: `gprof -p gmon.out` 10 | - flat profile without choosen function: `gprof -P gmon.out` 11 | - call graph: `gprof -q gmon.out` 12 | - call graph of specific funtion: `gprof -q gmon.out` 13 | - call graph without choosen function: `gprof -Q gmon.out` 14 | - annotated sources: `gprof -A gmon.out` 15 | 16 | ## perf 17 | - compile application with: `-O2`, if you need annotated C source compile with `-O2 -g`, 18 | - to get basic runtime statistics of application: `perf stat ` 19 | - to gather flat profile: `perf record ` 20 | - to gather call graph: `perf record -g ` 21 | 22 | executing with `-g` produces profile data which have both flat profile and call graph, you could inspect both of them with below commands. 23 | - to inspect flat profile: `perf report` 24 | - to inspect call graph: `perf report -g` 25 | - to compare to profiles: `perf diff ` 26 | 27 | When recording profile you could set custom name for filename, to do so run `perf record -o ...` 28 | - get systemwide realtime look at performance counters: `sudo perf top` 29 | - get realtime look at performance counters: `perf top -p ` 30 | - generate flame graph: 31 | ```bash 32 | $ perf record -g ./ 33 | $ perf script | /stackcollapse-perf.pl | /flamegraph.pl > .svg 34 | ``` 35 | 36 | ## callgrind 37 | - compile with: `-O2 -g` 38 | - to gather profile: `valgrind --tool callgrind ` 39 | - flat profile: `callgrind_annotate callgrind.out.` 40 | 41 | after each execution there will be new file starting with `callgrind.out.` with appended `pid` of process 42 | 43 | - call graph, callers: `callgrind_annotate --tree=caller callgrind.out.` 44 | - call gprah, callees: `callgrind_annotate --tree=calling callgrind.out.` 45 | - call graph full tree: `callgrind_annotate --tree=both callgrind.out.` 46 | 47 | To use `callgrind_control` application must be executed with `valgrind --tool=callgrind ...` 48 | - get realtime statistics: `callgrind_control -s ` 49 | - get stack/back traces of running application: `callgrind_control -b ` 50 | 51 | ## lttng 52 | - create session: `lttng create ` 53 | - list events: `lttng list -u` 54 | - enable event: `lttng enable-event -u :` 55 | - start recording: `lttng start` 56 | - stop recording: `lttng stop` 57 | - view trace: `lttng view` 58 | - finish session (this **won't** erase data): `lttng destroy` 59 | - link against to enable: `-llttng-ust -ldl` 60 | - basic tracepoint: `tracef()` defined in `` 61 | 62 | Basic trace: 63 | 64 | tr.h 65 | ```H 66 | #undef TRACEPOINT_PROVIDER 67 | #define TRACEPOINT_PROVIDER cray 68 | 69 | #undef TRACEPOINT_INCLUDE 70 | #define TRACEPOINT_INCLUDE "./tp.h" 71 | 72 | #if !defined(_TP_H) || defined(TRACEPOINT_HEADER_MULTI_READ) 73 | #define _TP_H 74 | 75 | #include 76 | 77 | TRACEPOINT_EVENT( 78 | cray, 79 | my_first_tracepoint, 80 | TP_ARGS( 81 | int, my_integer_arg, 82 | char*, my_string_arg ), 83 | 84 | TP_FIELDS( 85 | ctf_string(my_string_field, my_string_arg) 86 | ctf_integer(int, my_integer_field, my_integer_arg) 87 | ) 88 | ) 89 | 90 | #endif /* _TP_H */ 91 | 92 | #include 93 | ``` 94 | 95 | tr.c 96 | ```C 97 | #define TRACEPOINT_CREATE_PROBES 98 | #define TRACEPOINT_DEFINE 99 | 100 | #include "tp.h" 101 | ``` 102 | 103 | In code: 104 | ```C 105 | #include "tr.h" 106 | ... 107 | // in some function 108 | tracepoint(cray, my_first_tracepoint, , ); 109 | ``` 110 | 111 | Project compile with: 112 | `-I -llttng -ldl` -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Introduction to profiling 2 | 3 | ## Table of contents 4 | 5 | * [x] [Get sources](#get-sources) 6 | * [x] [Introduction](#introduction) 7 | * [x] [Throw against the wall and see what sticks](#throw-against-the-wall-and-see-what-sticks) 8 | * [x] [Pros & cons](#pros--cons) 9 | * [x] [gprof](#gprof) 10 | * [x] [perf](#perf) 11 | * [x] [perf report](#perf-report) 12 | * [x] [perf diff \ \](#perf-diff-filename1-filename2) 13 | * [x] [perf data](#perf-data-convert---to-ctf-) 14 | * [x] [FlameGraph](#flamegraph) 15 | * [x] [callgrind](#callgrind) 16 | * [x] [kcachegrind](#kcachegrind) 17 | * [x] [summary](#summary) 18 | * ["Scalpel"](#scalpel) 19 | * [lttng & babeltrace](#lttng--babeltrace) 20 | * [x] [How to gather profile](#how--to-gather-profile) 21 | * [x] [Own trace](#own-trace) 22 | * [x] [Multiple tracepoints](#multiple-tracepoints) 23 | * [x] [Inspect trace with TraceCompass](#inspect-trace-with-tracecompass) 24 | * [ ] [zipkin & blkin](#zipkin--blkin) 25 | * [ ] [Other tools](#other-tools) 26 | * [ ] [go prof](#go-prof) 27 | * [x] [Sources](#sources) 28 | 29 | 30 | # Get sources 31 | 32 | Clone repository and download submodules: 33 | 34 | ```bash 35 | $ git clone https://github.com/Erdk/introtoprofiling 36 | $ cd introtoprofiling 37 | $ git submodule update --init 38 | ``` 39 | 40 | After (or before) clonnig repository install applications and libraries from this list: 41 | 42 | - gprof (should be in distro repo) 43 | - perf (should be in distro repo) 44 | - valgrind (should be in distro repo) 45 | - lttng & babeltrace (if not available from repo check here: http://lttng.org/download/ for download instructions) 46 | - blkin (`git clone https://github.com/ceph/blkin`) 47 | - zipkin (`docker run -d -p 9411:9411 openzipkin/zipkin` or `wget -O zipkin.jar 'https://search.maven.org/remote_content?g=io.zipkin.java&a=zipkin-server&v=LATEST&c=exec' && java -jar zipkin.jar`) 48 | - (optional) go pprof 49 | 50 | # Introduction 51 | 52 | # "Throw against the wall and see what sticks" 53 | 54 | `gprof`, `perf` and `callgrind` (part of `valgrind` suite) are relatively easy to use without any changes to the code. Thanks to this it's easy to use them with your application to get an overview on what's going on, what are the most frequently called functions and in which functions application spends most of the time. The penalty for ease of use is performance hit related to the tracing of the whole application, and relatively low granularity. `gprof` is the oldest of those applications, I've included it because it widely available (and maybe because of historical reasons). Most likely you'll be more satisfied with `perf` (in terms of functionality and performance) or with `valgrind`, when 100% correctness of the result is more important than execution time. 55 | 56 | Each of those tools produces three types of results: 57 | 58 | - flat profile: shows how much time your program spent in each function, and how many times that function was called. As the name suggests on this view you won't see any relation between functions, 59 | - call graph: for each function shows which functions called it, which other functions was called by it, and how many times. There is also an estimate of how much time was spent in the subroutines of each function. 60 | - annotate source code: returns source code with performance metrics in lines, where it was applicable. 61 | 62 | ## Pros & cons 63 | 64 | Pros: 65 | 66 | - easy to enable and use, 67 | - all the tools gave good image of where could be possible hotspots, which functions are called the most and which functions are the most expensive computationally. 68 | 69 | Cons: 70 | 71 | `gprof` & `perf`: 72 | 73 | - the computed time and number of function calls are statistical, both of the programs check callstack at fixed time spans, 74 | - lower granuality than lttng, 75 | - adds overhead. 76 | 77 | `callgrind`: 78 | 79 | - executes code on simulatr @ single CPU core, tracks every call, so performance could be very low, in applications relying on network connections could led to timeouts and abnormal program behavior. 80 | 81 | ## gprof 82 | 83 | Simple tool, very easy way to enable: add '-pg' to to CFLAGS and LDFLAGS (at compiling and linking stages). Also, with recent GCC versions you've also must add '--no-pie -fPIC' to the compiler options for it to work. Key advantage here is no need to add new code (apart from Makefile). In directory `c-ray` you could test `gprof`: 84 | 85 | ```bash 86 | $ cd introtoprofiling/c-ray 87 | $ make profile 88 | ``` 89 | 90 | This would create binary 'bin/c-ray-prof'. After executing it: 91 | 92 | ```bash 93 | $ bin/c-ray-prof 94 | ``` 95 | 96 | There will be a new file in the directory: `gmon.out`. With `gprof` executable you could inspect the results. 97 | 98 | ```bash 99 | $ gprof bin/c-ray-prof gmon.out | less 100 | ``` 101 | 102 | You should see something like this: 103 | 104 | ``` 105 | Flat profile: 106 | 107 | Each sample counts as 0.01 seconds. 108 | % cumulative self self total 109 | time seconds seconds calls s/call s/call name 110 | 44.89 208.66 208.66 495703455 0.00 0.00 rayIntersectWithAABB 111 | 17.60 290.46 81.80 62948596 0.00 0.00 rayIntersectsWithNode 112 | 6.91 322.60 32.14 247988102 0.00 0.00 vectorCross 113 | 6.57 353.15 30.55 168061837 0.00 0.00 rayIntersectsWithPolygon 114 | 5.41 378.32 25.17 620561735 0.00 0.00 subtractVectors 115 | 5.15 402.26 23.94 8180655 0.00 0.00 getClosestIsect 116 | 4.18 421.70 19.44 408107946 0.00 0.00 scalarProduct 117 | 1.48 428.58 6.88 1799410 0.00 0.00 getHighlights 118 | ... 119 | ``` 120 | 121 | The output is relatively long, after each section there's lengthy description of columns. To suppress it run grof with '-b' (brief), but I encourage you to read it at least once ;) 122 | 123 | Now we get to options which could help us with analysis. When you only want to inspect flat profile use '-p' switch: 124 | 125 | ```bash 126 | $ gprof -b -p bin/c-ray-prof gmon.out | head 127 | Flat profile: 128 | 129 | Each sample counts as 0.01 seconds. 130 | % cumulative self self total 131 | time seconds seconds calls s/call s/call name 132 | 44.89 208.66 208.66 495703455 0.00 0.00 rayIntersectWithAABB 133 | 17.60 290.46 81.80 62948596 0.00 0.00 rayIntersectsWithNode 134 | 6.91 322.60 32.14 247988102 0.00 0.00 vectorCross 135 | 6.57 353.15 30.55 168061837 0.00 0.00 rayIntersectsWithPolygon 136 | 5.41 378.32 25.17 620561735 0.00 0.00 subtractVectors 137 | 138 | ``` 139 | 140 | You could also inspect flat profile of functions matching "symspec" (function name, source file) '-p\' (note the lack of whitespace between symspec and -p): 141 | 142 | ```bash 143 | $ gprof -b -prayIntersectWithAABB bin/c-ray-prof gmon.out | head 144 | Flat profile: 145 | 146 | Each sample counts as 0.01 seconds. 147 | % cumulative self self total 148 | time seconds seconds calls us/call us/call name 149 | 100.02 208.66 208.66 495703455 0.42 0.42 rayIntersectWithAABB 150 | 151 | ``` 152 | 153 | This could be useful, when you have multiple static functions with the same name. To exclude functions from flat profile use '-P' switch: 154 | 155 | ```bash 156 | $ gprof -b -PrayIntersectWithAABB bin/c-ray-prof gmon.out | head 157 | Flat profile: 158 | 159 | Each sample counts as 0.01 seconds. 160 | % cumulative self self total 161 | time seconds seconds calls s/call s/call name 162 | 31.93 81.80 81.80 62948596 0.00 0.00 rayIntersectsWithNode 163 | 12.55 113.94 32.14 247988102 0.00 0.00 vectorCross 164 | 11.92 144.49 30.55 168061837 0.00 0.00 rayIntersectsWithPolygon 165 | 9.82 169.65 25.17 620561735 0.00 0.00 subtractVectors 166 | 9.35 193.60 23.94 8180655 0.00 0.00 getClosestIsect 167 | ``` 168 | 169 | When you compare with previous example you could see, that all values were adjusted. This could be useful when you want to exclude function you couldn't optimize and want to have a closer look on other potential candidates for optimization. 170 | 171 | In similar way you could inspect call graphs. To do this use '-q' and '-Q' switches. First display ony call graph: 172 | 173 | ```bash 174 | $ gprof -b -q bin/c-ray-prof gmon.out | head -n 27 175 | Call graph 176 | 177 | 178 | granularity: each sample hit covers 2 byte(s) for 0.00% of 464.88 seconds 179 | 180 | index % time self children called name 181 | 182 | [1] 98.3 3.86 453.32 renderThread [1] 183 | 1.71 448.86 2703244/2703244 newTrace [9] 184 | 0.21 1.11 1366674/1366676 transformCameraView [28] 185 | 0.35 0.25 1632466/10151652 normalizeVector [17] 186 | 0.54 0.00 1537256/1537256 getPixel [45] 187 | 0.18 0.00 4013744/19821218 getRandomDouble [36] 188 | 0.11 0.00 1441645/13375163 addColors [34] 189 | 0.00 0.00 6/6 computeTimeAverage [171] 190 | 0.00 0.00 5/5 getTile [172] 191 | ----------------------------------------------- 192 | [2] 96.9 1.71 448.86 2703244+4995853 [2] 193 | 0.53 274.74 1721452 getLighting [5] 194 | 0.44 173.35 3080079 newTrace [9] 195 | 0.74 0.77 2897566 getReflectsAndRefracts [27] 196 | ----------------------------------------------- 197 | 9.68 163.67 3308211/8180655 newTrace [9] 198 | 14.26 241.06 4872444/8180655 isInShadow [7] 199 | [3] 92.2 23.94 404.72 8180655 getClosestIsect [3] 200 | 81.80 315.75 62948596/62948596 rayIntersectsWithNode [4] 201 | 1.93 5.25 15315957/15315957 rayIntersectsWithSphereTemp [14] 202 | 203 | ``` 204 | 205 | Line with index denotes 'main' function of call graph. Functions above are 'parents', callers. Functions below are 'children', callees. In this example, call graph with index 3: 206 | 207 | Function | relation 208 | ---------|--------- 209 | newTrace | top most function 210 | isInShadow | function called by newTrace 211 | getClosestInsect | main function of this call graph 212 | rayIntersectsWithNode | function called by getClosestInsect 213 | rayIntersectsWithSphereTemp | function called by rayIntersectsWithNode 214 | 215 | Numbers in 'called' column denotes respectively number of calls in scope of the parent and number of total calls. Analogous to '-p' switch with '-q\' you could inspect call graph of function you choose: 216 | 217 | ```bash 218 | $ gprof -b -qrayIntersectsWithNode bin/c-ray-prof gmon.out 219 | Call graph 220 | 221 | 222 | granularity: each sample hit covers 2 byte(s) for 0.00% of 464.88 seconds 223 | 224 | index % time self children called name 225 | 213618252 rayIntersectsWithNode [4] 226 | 81.80 315.75 62948596/62948596 getClosestIsect (3) 227 | [4] 85.5 81.80 315.75 62948596+213618252 rayIntersectsWithNode [4] 228 | 208.66 0.00 495703455/495703455 rayIntersectWithAABB [8] 229 | 30.55 74.51 168061837/168061837 rayIntersectsWithPolygon [10] 230 | 1.58 0.00 18565813/25178963 vectorScale [22] 231 | 0.45 0.00 7636053/15517841 addVectors [35] 232 | 213618252 rayIntersectsWithNode [4] 233 | ----------------------------------------------- 234 | 208.66 0.00 495703455/495703455 rayIntersectsWithNode [4] 235 | [8] 44.9 208.66 0.00 495703455 rayIntersectWithAABB [8] 236 | ----------------------------------------------- 237 | 30.55 74.51 168061837/168061837 rayIntersectsWithNode [4] 238 | [10] 22.6 30.55 74.51 168061837 rayIntersectsWithPolygon [10] 239 | 31.88 0.00 245938645/247988102 vectorCross [11] 240 | 23.91 0.00 589609039/620561735 subtractVectors [12] 241 | 17.05 0.00 357801568/408107946 scalarProduct [13] 242 | 1.68 0.00 6623695/6623695 uvFromValues [23] 243 | ... 244 | ``` 245 | 246 | Last, but not least: `gprof` could annotate source file with profile information. As symspec the best is to choose file (e.g. render.c) to have annotated source file and files with callees: 247 | 248 | ```C 249 | $ gprof -b -Arender.c bin/c-ray-prof gmon.out | less 250 | ... 251 | *** File .../c-ray/src/sphere.c: 252 | // 253 | // sphere.c 254 | // C-Ray 255 | // 256 | // Created by Valtteri Koskivuori on 28/02/15. 257 | // Copyright (c) 2015 Valtteri Koskivuori. All rights reserved. 258 | // 259 | 260 | #include "includes.h" 261 | #include "sphere.h" 262 | 263 | 3 -> struct sphere newSphere(struct vector pos, double radius, int materialIndex) { 264 | return (struct sphere){pos, radius, materialIndex}; 265 | } 266 | 267 | //Just check for intersection, nothing else. 268 | ##### -> bool rayIntersectsWithSphereFast(struct lightRay *ray, struct sphere *sphere) { 269 | double A = scalarProduct(&ray->direction, &ray->direction); 270 | struct vector distance = subtractVectors(&ray->start, &sphere->pos); 271 | double B = 2 * scalarProduct(&ray->direction, &distance); 272 | double C = scalarProduct(&distance, &distance) - (sphere->radius * sphere->radius); 273 | double trigDiscriminant = B * B - 4 * A * C; 274 | if (trigDiscriminant < 0) { 275 | return false; 276 | } else { 277 | return true; 278 | } 279 | } 280 | 281 | //Calculates intersection with a sphere and a light ray 282 | 39921178 -> bool rayIntersectsWithSphere(struct lightRay *ray, struct sphere *sphere, double *t) { 283 | bool intersects = false; 284 | 285 | //Vector dot product of the direction 286 | double A = scalarProduct(&ray->direction, &ray->direction); 287 | 288 | //Distance between start of a lightRay and the sphere position 289 | struct vector distance = subtractVectors(&ray->start, &sphere->pos); 290 | 291 | double B = 2 * scalarProduct(&ray->direction, &distance); 292 | 293 | double C = scalarProduct(&distance, &distance) - (sphere->radius * sphere->radius); 294 | 295 | double trigDiscriminant = B * B - 4 * A * C; 296 | 297 | //If discriminant is negative, no real roots and the ray has missed the sphere 298 | if (trigDiscriminant < 0) { 299 | intersects = false; 300 | } else { 301 | double sqrtOfDiscriminant = sqrt(trigDiscriminant); 302 | double t0 = (-B + sqrtOfDiscriminant)/(2); 303 | double t1 = (-B - sqrtOfDiscriminant)/(2); 304 | 305 | //Pick closest intersection 306 | if (t0 > t1) { 307 | t0 = t1; 308 | } 309 | 310 | //Verify intersection is larger than 0 and less than the original distance 311 | if ((t0 > 0.001f) && (t0 < *t)) { 312 | *t = t0; 313 | intersects = true; 314 | } else { 315 | intersects = false; 316 | } 317 | } 318 | return intersects; 319 | } 320 | 321 | 322 | Top 10 Lines: 323 | 324 | Line Count 325 | 326 | 31 39921178 327 | 12 3 328 | 329 | Execution Summary: 330 | 331 | 3 Executable lines in this file 332 | 3 Lines executed 333 | 100.00 Percent of the file executed 334 | 335 | 39921181 Total number of line executions 336 | 13307060.33 Average executions per line 337 | 338 | 339 | *** File .../c-ray/src/main.c: 340 | 341 | ``` 342 | 343 | Used functions are annotated with 'num ->' prior to function definition, denoting number of executions. Also after each source file there's footer with statistics, the hot lines, execution summary and total number of executions. 344 | 345 | ## perf 346 | 347 | Perf 348 | TODO: much better than gprof, faster execution, more detailed profile, easy to profile userspace app or kernel. Better tui. Con: annotation interleaves asm with C. Annotating with C only available when app was compiled with debug information. 349 | 350 | To get trace you have to run `perf record `: 351 | - Just flat profile (default): `perf record bin/c-ray` 352 | - Flat profile + call graph: `perf record --call-graph bin/c-ray` 353 | 354 | ### perf report 355 | 356 | `$ perf report` 357 | 358 | Main window: 359 | ![perf main window](/img/perf1.png) 360 | 361 | To get help press 'h': 362 | ![perf help window](/img/perf2.png) 363 | 364 | Select function with arrow keys and press 'a' to go to the annotated view. Without debug information there will be only asm: 365 | ![perf annotated asm](/img/perf3.png) 366 | 367 | When app was compiled with debug information asm code will be interleaved with C code: 368 | ![perf annotated asm and C](/img/perf4.png) 369 | 370 | GTK2 interface: `perf report --gtk` (needs compiled in support, e.g. screenshot below is taken on Archlinux, because on Fedora this is not compiled in) 371 | ![perf gtk interface](/img/perf5.png) 372 | 373 | ### perf diff \ \ 374 | 375 | `perf diff \ \` compares to perf files. 376 | ![perf diff](/img/perf6.png) 377 | 378 | Above screenshot contains `perf diff` output of two runs of `perf record -g`. `perf` by default creates `perf.data` file which contains performance counters. When you run it the secod time older session will be renamed to `perf.data.old` and new will be saved to `perf.data`. Then when you run `perf diff` (without any other options) it will compare those two sessions, where `perf.data.old` will be a baseline against which `perf.data` will be compared. In the first column there's baseline profile, in the second difference in execution time of matching symbols, in third there's executable or library from which symbol originates and in the last symbol name. By default `perf diff` sorts results by absoulte delta (abs(delta)), that's why results in the first column are out of order and 379 | 380 | ### perf data convert --to-ctf 381 | 382 | Convert native perf data format to CTF, understandable by babeltrace. I didn't encounter a distro in which this is enabled, but it looks like it could be done via building `perf` from sources. Only for courageous ;) 383 | 384 | ### FlameGraph 385 | 386 | FlameGraph is bundle of helper scripts for better visualization of `perf` results. It's good to get the idea how all functions looks on the timegraph, how each function contributes to the total execution time: 387 | 388 | ![flamegraph svg](/img/flamegraph1.png) 389 | 390 | How to use them? First ensure you have `perl` installed, then go to the `c-ray` directory and then: 391 | ```bash 392 | $ perf record -g bin/c-ray 393 | $ perf script | ../FlameGraph/stackcollapse-perf.pl | ../FlameGraph/flamegraph.pl > perf.data.svg 394 | ``` 395 | 396 | `perf.data.svg` contains performance trace saved as 'towers of time spans' with additional javascript functions added to ease navigation between scopes. The bottom-most functions are on the bottom of the stack and to nobody's suprise the higher the 'span tower' the bigger where callstack at the runtime. Vertical size corrseponds to the executiuon time. I would recommend opening this file in web browser because included JS greatly helps with navigating and analyzing results. When you click one of the bars it will show you stack from this function up, clicking 'all' (on the bottom of the graph) after inspecting will get you to the default view. Hovering over bar will show statistics about it in the bottom-left corner. 397 | 398 | Screenshot with deeper callstack: 399 | 400 | ![flamegraph of the kernel](http://www.brendangregg.com/FlameGraphs/cpu-linux-tcpsend.svg) 401 | 402 | Source: http://www.brendangregg.com/FlameGraphs/cpu-linux-tcpsend.svg 403 | 404 | ## callgrind 405 | 406 | Similarily to previous tools `callgrind` doesn't require changes to the code, it's sufficient to compile with optimizations and debug information turned on. Contrary to `gprof` and `perf` it doesn't run code directly on host CPU, but via it's own simulator. The counters are gathered directly from simulator's state, which produces near real-life characteristic. On the downside simplator runs only on a single thread and serializes all the code, so multi-thread, computation heavy applications will run terribly slow. 407 | 408 | To gather profile: 409 | ```bash 410 | $ valgrind --tool=callgrind bin/c-ray 411 | ``` 412 | 413 | To inspect flat profile: 414 | ``` 415 | $ callgrind_annotate callgrind.out.13671 416 | -------------------------------------------------------------------------------- 417 | Profile data file 'callgrind.out.13671' (creator: callgrind-3.13.0) 418 | -------------------------------------------------------------------------------- 419 | I1 cache: 420 | D1 cache: 421 | LL cache: 422 | Timerange: Basic block 0 - 119733969377 423 | Trigger: Program termination 424 | Profiled target: bin/c-ray (PID 13671, part 1) 425 | Events recorded: Ir 426 | Events shown: Ir 427 | Event sort order: Ir 428 | Thresholds: 99 429 | Include dirs: 430 | User annotated: 431 | Auto-annotation: off 432 | 433 | -------------------------------------------------------------------------------- 434 | Ir 435 | -------------------------------------------------------------------------------- 436 | 936,450,558,371 PROGRAM TOTALS 437 | 438 | -------------------------------------------------------------------------------- 439 | Ir file:function 440 | -------------------------------------------------------------------------------- 441 | 442 | 238,763,680,382 .../c-ray/src/bbox.c:rayIntersectWithAABB [.../c-ray/ 443 | bin/c-ray] 444 | 168,001,295,859 .../c-ray/src/raytrace.c:rayIntersectsWithNode'2 [... 445 | /c-ray/bin/c-ray] 446 | 158,259,872,671 .../c-ray/src/poly.c:rayIntersectsWithPolygon [.../c- 447 | ray/bin/c-ray] 448 | ... 449 | ``` 450 | 451 | `I1 cache`, `D1 cache`, `LL cache` refers to simulated CPU caches, in this example we didn't turn option to gather them so there's no data here. 452 | 453 | With switch `--tree=` you could turn on information about call graph: `caller`, `calling`, and `both`: 454 | 455 | ``` 456 | $ callgrind_annotate --tree=caller callgrind.out.13671 457 | ... 458 | 187,684,131,006 < .../c-ray/src/raytrace.c:rayIntersectsWithNode'2 (3842892764x) 459 | [.../c-ray/bin/c-ray] 460 | 51,079,549,376 < .../c-ray/src/raytrace.c:rayIntersectsWithNode (1060050002x) [ 461 | .../c-ray/bin/c-ray] 462 | 238,763,680,382 * .../c-ray/src/bbox.c:rayIntersectWithAABB [.../c-r 463 | ay/bin/c-ray] 464 | ... 465 | ``` 466 | 467 | ``` 468 | $ callgrind_annotate --tree=caller callgrind.out.13671 469 | ... 470 | 238,763,680,382 * .../c-ray/src/bbox.c:rayIntersectWithAABB [.../c-r 471 | ay/bin/c-ray] 472 | 473 | 168,001,295,859 * .../c-ray/src/raytrace.c:rayIntersectsWithNode'2 [.../c-ray/bin/c-ray] 474 | 346,471,804,141 > .../c-ray/src/poly.c:rayIntersectsWithPolygon (2027649930x) [.../c-ray/bin/c-ray] 475 | 1,206,571,355 > .../c-ray/src/vector.c:vectorScale (109688305x) [.../c-ray/bin/c-ray] 476 | 187,684,131,006 > .../c-ray/src/bbox.c:rayIntersectWithAABB (3842892764x) [.../c-ray/bin/c-ray] 477 | Negative repeat count does nothing at /usr/bin/callgrind_annotate line 828. 478 | 3,118,318,860,534 > .../c-ray/src/raytrace.c:rayIntersectsWithNode'2 (3488504364x) [.../c-ray/bin/c-ray] 479 | 1,316,259,660 > .../c-ray/src/vector.c:addVectors (109688305x) [.../c-ray/bin/c-ray] 480 | ... 481 | ``` 482 | 483 | ``` 484 | $ callgrind_annotate --tree=both callgrind.out.13671 485 | ... 486 | 3,118,318,860,534 < .../c-ray/src/raytrace.c:rayIntersectsWithNode'2 (3488504364x) [.../c-ray/bin/c-ray] 487 | 704,680,062,021 < .../c-ray/src/raytrace.c:rayIntersectsWithNode (354388400x) [.../c-ray/bin/c-ray] 488 | 168,001,295,859 * .../c-ray/src/raytrace.c:rayIntersectsWithNode'2 [.../c-ray/bin/c-ray] 489 | Negative repeat count does nothing at /usr/bin/callgrind_annotate line 828. 490 | 3,118,318,860,534 > .../c-ray/src/raytrace.c:rayIntersectsWithNode'2 (3488504364x) [.../c-ray/bin/c-ray] 491 | 187,684,131,006 > .../c-ray/src/bbox.c:rayIntersectWithAABB (3842892764x) [.../c-ray/bin/c-ray] 492 | 1,316,259,660 > .../c-ray/src/vector.c:addVectors (109688305x) [.../c-ray/bin/c-ray] 493 | 1,206,571,355 > .../c-ray/src/vector.c:vectorScale (109688305x) [.../c-ray/bin/c-ray] 494 | 346,471,804,141 > .../c-ray/src/poly.c:rayIntersectsWithPolygon (2027649930x) [.../c-ray/bin/c-ray] 495 | ... 496 | ``` 497 | 498 | With `callgrind` there also comes `callgrind_control`, tool which allows to inspect application during run. 499 | 500 | To get quick statistics about running application: 501 | ```bash 502 | $ callgrind_control -s 22710 503 | PID 22710: bin/c-ray 504 | sending command status internal to pid 22710 505 | Number of running threads: 9, thread IDs: 1 2 3 4 5 6 7 8 9 506 | Events collected: Ir 507 | Functions: 575 (executed 3,805,993,131, contexts 575) 508 | Basic blocks: 5,681 (executed 31,519,541,333, call sites 1,252) 509 | ``` 510 | 511 | Inspect backtrace: 512 | ```bash 513 | $ callgrind_control -b 22710 |head -n 30 514 | PID 22710: bin/c-ray 515 | sending command status internal to pid 22710 516 | 517 | Frame: Backtrace for Thread 1 518 | [ 0] nanosleep (843 x) 519 | [ 1] usleep (842 x) 520 | [ 2] sleepMSec (843 x) 521 | [ 3] main (1 x) 522 | [ 4] (below main) (1 x) 523 | [ 5] _start (1 x) 524 | [ 6] 0x0000000000000ed0 525 | 526 | 527 | Frame: Backtrace for Thread 2 528 | [ 0] subtractVectors (289348042 x) 529 | [ 1] rayIntersectsWithPolygon (289348055 x) 530 | [ 2] rayIntersectsWithNode (25762561 x) 531 | [ 3] rayIntersectsWithNode (128149695 x) 532 | [ 4] getClosestIsect (3052621 x) 533 | [ 5] newTrace (3052621 x) 534 | [ 6] renderThread (8 x) 535 | [ 7] start_thread (8 x) 536 | [ 8] clone 537 | ... 538 | ``` 539 | 540 | With `callgrind_control -i on|off` you could turn on or off instrumentation during runtime. You could combine it with `--instr-atstart=no|yes` option to `valgrind` when you start application, to start without instrumentation, then turn it on for a few minutes to gather profile and then turn it off again. 541 | 542 | ### kcachegrind 543 | 544 | It's possible to base analysis on `callgrind_annotate` output, but better suited to this is KCacheGrind. It provides graphical interface and it provides visualization of data, which helps with analysis. It's available for Linux, and probobly for Windows as QCacheGrind (I saw old builds on sourceforge but I didn't test them...). 545 | 546 | List of callers and callees of choosen function: 547 | ![kcachegrind callers and callees](/img/kcachegrind1.png) 548 | 549 | All callers and call graph: 550 | ![kcachegrind all callers and call graph](/img/kcachegrind2.png) 551 | 552 | Visual maps of execution time: 553 | ![kcachegrind maps](/img/kcachegrind3.png) 554 | 555 | ## Summary 556 | 557 | Below are total times of execution for `gprof`, `perf` and `callgrind`. Only one run, results measured with `time`, just to show how roughly those execution times differ. For obvious reasons there were differences in compilation options of our application: 558 | 559 | - gprof: -O2 -g -pg --no-pie -fPIC 560 | - perf: -O2 -g 561 | - callgrind: -O2 -g 562 | 563 | `gprof` required the most changes, `perf` only requires `-g` when you want to see annotated code in C, not only in asm. `callgrind` requiers `-g` and `-O2`: first to gather required information, second to have semi-decent execution time. I've tested all with `-O2` because you **want** to run optimized executable to gather results as close to reality as possible. 564 | 565 | app | time 566 | -------------------|-------- 567 | gprof | 5119.32s user 5.38s system 762% cpu 11:12.00 total 568 | perf (flat) | 309.79s user 86.39s system 725% cpu 54.629 total 569 | perf (flat + call) | 299.25s user 90.70s system 719% cpu 54.199 total 570 | callgrind (-02 -g) | 4699.96s user 17.50s system 100% cpu 1:18:30.59 total 571 | 572 | Keep in mind that despite that `callgrind` took less seconds than `gprof` total execution time was much longer. Code instrumented with `gprof` runs directly on CPU, when with `callgrind` inside `valgrind`'s simulator, which runs on single CPU and serializes all ops. The fastest of all three was `perf`, and in my opinion is the best option to use. If for some reason you couldn't run `perf` (e.g. you couldn't prepend command) the next option would be `gprof`. `callgrind` could be used if you really, **really** need 100% accurate profile. Given the methodology I propose here (get rough idea what's going on and then use other tools to inspect specific codepaths) I don't think it would be a good match for this stage. 573 | 574 | # "Scalpel" 575 | 576 | 577 | 578 | ## lttng & babeltrace 579 | 580 | When working with `lttng` there are two phases: first you need to add tracepoints to the application and link against lttng. The second, in which we would gather the profile we will need to create tracing session, enable tracepoints and then continue with execution. At the end we wil lstop the session and review gathered data. 581 | 582 | The easiest way to use `lttng` is to use `tracef` fucntion, which accepts the same parameters as `printf`. First we need to link out application with `lttng`, preferably in the way which will allow us to enable it at build time. Check `Makefile` in `c-ray` directory, below `LPROFILE`: 583 | 584 | ```Makefile 585 | ifdef LTTNG_SIMPLE 586 | CFLAGS += -DLTTNG -DLTTNG_SIMPLE 587 | LFLAGS += -llttng-ust -ldl 588 | endif 589 | ``` 590 | 591 | I've added two preprocessor defines, `LTTNG` and `LTTNG_SIMPLE` and I've added `lttng-ust` to libraries to link against. The preprocessor defines allow us to include or exclude parts of code at compilation time. `LTTNG` were used in `src/main.c`, at the top of the `main` function: 592 | 593 | ```C 594 | int main(int argc, char *argv[]) { 595 | #ifdef LTTNG 596 | getchar(); 597 | #endif 598 | ``` 599 | 600 | It adds `getchar()` call at the beggining of the program execution. Without it it would be harder for us to get the whole profile, becuase we would lost a few traces. Adding `getchar` gives as time to review registered traces, enable those which we need and then resume execution with logging all reqiured traces. Why two options? Because I will use `getchar` in all `lttng` examples and `LTTNG_SIMPLE` enables only `tracef` function. 601 | 602 | Keep in mind that `getchar` is purely optional here, you could start tracing at any moment of program execution, e.g. create session, enable traces defined in already running server application, wait for a few minutes (and trigger execution of codepath you're interested in, if you have mean to do this) to get some results and then finish session and inspect gathered profile. 603 | 604 | In the `src/raytrace.c` you could see how I've added `tracef` calls: 605 | 606 | ```C 607 | #ifdef LTTNG_SIMPLE 608 | #include 609 | #endif 610 | ... 611 | #ifdef LTTNG_SIMPLE 612 | tracef("before rayIntersectWithAABB"); 613 | #endif 614 | ``` 615 | 616 | If `LTTNG_SIMPLE` is defined include `tracef` header file and leave calls to it in the sourcefile. 617 | 618 | To compile `c-ray` with `tracef`: 619 | 620 | ```bash 621 | $ make clean 622 | $ LTTNG_SIMPLE=1 make 623 | ``` 624 | 625 | ### How to gather profile? 626 | 627 | First you need to create `lttng session`, to do so type `lttng create `: 628 | ```bash 629 | $ lttng create my-session 630 | ``` 631 | 632 | After creating session you need to tell `lttng` which traces you want to record. There're few default (with `tracef` example we will use it), but below I show you how to do it in the 'generic way'. In the secod terminal build your application with enabled tracing and start it: 633 | 634 | ``` 635 | $ LTTNG_SIMPLE=1 make 636 | $ bin/c-ray 637 | ``` 638 | 639 | You'll see that the program started, but it's not running any computations yet. That's because out `getchar` call. `lttng` has initiation routine which will register all traces from application and will allow to review them and enable one or many of them. In the first terminal review available traces and enable `tracef`: 640 | 641 | ```bash 642 | $ lttng list -u 643 | UST events: 644 | ------------- 645 | 646 | PID: 14173 - Name: bin/c-ray 647 | lttng_ust_tracelog:TRACE_DEBUG (loglevel: TRACE_DEBUG (14)) (type: tracepoint) 648 | lttng_ust_tracelog:TRACE_DEBUG_LINE (loglevel: TRACE_DEBUG_LINE (13)) (type: tracepoint) 649 | lttng_ust_tracelog:TRACE_DEBUG_FUNCTION (loglevel: TRACE_DEBUG_FUNCTION (12)) (type: tracepoint) 650 | lttng_ust_tracelog:TRACE_DEBUG_UNIT (loglevel: TRACE_DEBUG_UNIT (11)) (type: tracepoint) 651 | lttng_ust_tracelog:TRACE_DEBUG_MODULE (loglevel: TRACE_DEBUG_MODULE (10)) (type: tracepoint) 652 | lttng_ust_tracelog:TRACE_DEBUG_PROCESS (loglevel: TRACE_DEBUG_PROCESS (9)) (type: tracepoint) 653 | lttng_ust_tracelog:TRACE_DEBUG_PROGRAM (loglevel: TRACE_DEBUG_PROGRAM (8)) (type: tracepoint) 654 | lttng_ust_tracelog:TRACE_DEBUG_SYSTEM (loglevel: TRACE_DEBUG_SYSTEM (7)) (type: tracepoint) 655 | lttng_ust_tracelog:TRACE_INFO (loglevel: TRACE_INFO (6)) (type: tracepoint) 656 | lttng_ust_tracelog:TRACE_NOTICE (loglevel: TRACE_NOTICE (5)) (type: tracepoint) 657 | lttng_ust_tracelog:TRACE_WARNING (loglevel: TRACE_WARNING (4)) (type: tracepoint) 658 | lttng_ust_tracelog:TRACE_ERR (loglevel: TRACE_ERR (3)) (type: tracepoint) 659 | lttng_ust_tracelog:TRACE_CRIT (loglevel: TRACE_CRIT (2)) (type: tracepoint) 660 | lttng_ust_tracelog:TRACE_ALERT (loglevel: TRACE_ALERT (1)) (type: tracepoint) 661 | lttng_ust_tracelog:TRACE_EMERG (loglevel: TRACE_EMERG (0)) (type: tracepoint) 662 | lttng_ust_tracef:event (loglevel: TRACE_DEBUG (14)) (type: tracepoint) 663 | lttng_ust_lib:unload (loglevel: TRACE_DEBUG_LINE (13)) (type: tracepoint) 664 | lttng_ust_lib:debug_link (loglevel: TRACE_DEBUG_LINE (13)) (type: tracepoint) 665 | lttng_ust_lib:build_id (loglevel: TRACE_DEBUG_LINE (13)) (type: tracepoint) 666 | lttng_ust_lib:load (loglevel: TRACE_DEBUG_LINE (13)) (type: tracepoint) 667 | lttng_ust_statedump:end (loglevel: TRACE_DEBUG_LINE (13)) (type: tracepoint) 668 | lttng_ust_statedump:debug_link (loglevel: TRACE_DEBUG_LINE (13)) (type: tracepoint) 669 | lttng_ust_statedump:build_id (loglevel: TRACE_DEBUG_LINE (13)) (type: tracepoint) 670 | lttng_ust_statedump:bin_info (loglevel: TRACE_DEBUG_LINE (13)) (type: tracepoint) 671 | lttng_ust_statedump:start (loglevel: TRACE_DEBUG_LINE (13)) (type: tracepoint) 672 | $ lttng enable-event -u 'lttng_ust_tracef:event' 673 | UST event lttng_ust_tracef:event created in channel channel0 674 | $ lttng start 675 | ``` 676 | 677 | The last command, `lttng start`, starts the tracing of enabled events. Now in the second terminal press \, execution will resume and `lttng` will gather the traces. Keep in mind that if you will add traces in the frequently called functions the trace will grow to very big size (traces from `src/raytrace.c` added to 88GB(!!!) after 5-6min of execution). To prevent overfilling your storage you could stop collecting trace events after chosen time (e.g. 6 min) with `lttng stop`. 678 | 679 | During collecting trace events you could check current execution statistics: 680 | 681 | ```bash 682 | $ lttng status 683 | Tracing session my-session: [active] 684 | Trace path: /home/erdk/lttng-traces/my-session-20180218-181827 685 | 686 | === Domain: UST global === 687 | 688 | Buffer type: per UID 689 | 690 | Channels: 691 | ------------- 692 | - channel0: [enabled] 693 | 694 | Attributes: 695 | Event-loss mode: discard 696 | Sub-buffer size: 524288 bytes 697 | Sub-buffer count: 4 698 | Switch timer: inactive 699 | Read timer: inactive 700 | Monitor timer: 1000000 µs 701 | Blocking timeout: 0 µs 702 | Trace file count: 1 per stream 703 | Trace file size: unlimited 704 | Output mode: mmap 705 | 706 | Statistics: 707 | Discarded events: 9223372038833876951 708 | 709 | Event rules: 710 | lttng_ust_tracef:event (type: tracepoint) [enabled] 711 | ``` 712 | 713 | When you decide you collected everything you need stop gathering events and review profile: 714 | 715 | ```bash 716 | # after execution ends: 717 | $ lttng stop 718 | $ lttng view | less 719 | [18:44:30.951782113] (+?.?????????) andromeda lttng_ust_tracef:event: { cpu_id = 3 }, { _msg_length = 27, msg = "before rayIntersectWithAABB" } 720 | [18:44:30.951784713] (+0.000002600) andromeda lttng_ust_tracef:event: { cpu_id = 2 }, { _msg_length = 27, msg = "before rayIntersectWithAABB" } 721 | [18:44:30.951797505] (+0.000012792) andromeda lttng_ust_tracef:event: { cpu_id = 3 }, { _msg_length = 27, msg = "before rayIntersectWithAABB" } 722 | [18:44:30.951798075] (+0.000000570) andromeda lttng_ust_tracef:event: { cpu_id = 3 }, { _msg_length = 24, msg = "rayIntersectsWithPolygon" } 723 | [18:44:30.951798954] (+0.000000879) andromeda lttng_ust_tracef:event: { cpu_id = 2 }, { _msg_length = 27, msg = "before rayIntersectWithAABB" } 724 | [18:44:30.951799398] (+0.000000444) andromeda lttng_ust_tracef:event: { cpu_id = 2 }, { _msg_length = 24, msg = "rayIntersectsWithPolygon" } 725 | [18:44:30.951799736] (+0.000000338) andromeda lttng_ust_tracef:event: { cpu_id = 3 }, { _msg_length = 27, msg = "before rayIntersectWithAABB" } 726 | [18:44:30.951800243] (+0.000000507) andromeda lttng_ust_tracef:event: { cpu_id = 3 }, { _msg_length = 24, msg = "rayIntersectsWithPolygon" } 727 | [18:44:30.951800478] (+0.000000235) andromeda lttng_ust_tracef:event: { cpu_id = 2 }, { _msg_length = 27, msg = "before rayIntersectWithAABB" } 728 | [18:44:30.951800893] (+0.000000415) andromeda lttng_ust_tracef:event: { cpu_id = 2 }, { _msg_length = 24, msg = "rayIntersectsWithPolygon" } 729 | [18:44:30.951801014] (+0.000000121) andromeda lttng_ust_tracef:event: { cpu_id = 3 }, { _msg_length = 29, msg = "after intersecting with nodes" } 730 | [18:44:30.951801512] (+0.000000498) andromeda lttng_ust_tracef:event: { cpu_id = 2 }, { _msg_length = 29, msg = "after intersecting with nodes" } 731 | [18:44:30.951801890] (+0.000000378) andromeda lttng_ust_tracef:event: { cpu_id = 3 }, { _msg_length = 27, msg = "before rayIntersectWithAABB" } 732 | [18:44:30.951801950] (+0.000000060) andromeda lttng_ust_tracef:event: { cpu_id = 2 }, { _msg_length = 27, msg = "before rayIntersectWithAABB" } 733 | [18:44:30.951802927] (+0.000000977) andromeda lttng_ust_tracef:event: { cpu_id = 2 }, { _msg_length = 27, msg = "before rayIntersectWithAABB" } 734 | 735 | ``` 736 | 737 | `lttng view` actually calls `babeltrace` with current profile session path, equivalent `babeltrace command`: 738 | 739 | ```bash 740 | $ babeltrace ~/lttng-traces/my-session-20180218-184338/ust/uid//64-bit 741 | ``` 742 | 743 | In path `my-session` refers to the session name you provided to `lttng create` and following number are timestamp of when the trace was captured. 744 | 745 | You could see here messages defined in `src/raytrace.c`, time at which we're executed, time delta between events, event type (here is only one, becuase we only enabled one), CPU at which function was running. After quick look you could see the limitations of the `tracef` function: we called `tracef` in a function which was called from different threads. That's why we got here "three intertwine" profiles. Also `tracef` uses `vasprintf` to save data, which is not the most optimal way to save data. In the next example we will write own trace definition and use it in our application. 746 | 747 | ### Own trace 748 | 749 | Defining own trace definition and enabling it in application is a bit more complicated, but gives better controll what data will be saved in the profile. 750 | 751 | Below is typical implementation of trace, first header file `tp.h`: 752 | 753 | ```H 754 | #ifdef LTTNG_MYTRACE 755 | #undef TRACEPOINT_PROVIDER 756 | #define TRACEPOINT_PROVIDER cray 757 | 758 | #undef TRACEPOINT_INCLUDE 759 | #define TRACEPOINT_INCLUDE "./tp.h" 760 | 761 | #if !defined(_TP_H) || defined(TRACEPOINT_HEADER_MULTI_READ) 762 | #define _TP_H 763 | 764 | #include 765 | 766 | TRACEPOINT_EVENT( 767 | cray, 768 | my_first_tracepoint, 769 | TP_ARGS( 770 | int, my_integer_arg, 771 | char*, my_string_arg ), 772 | 773 | TP_FIELDS( 774 | ctf_string(my_string_field, my_string_arg) 775 | ctf_integer(int, my_integer_field, my_integer_arg) 776 | ) 777 | ) 778 | 779 | #endif /* _TP_H */ 780 | 781 | #include 782 | 783 | #endif /* LTTNG_MYTRACE */ 784 | ``` 785 | 786 | From the beggining: 787 | 788 | - `#ifdef LTTNG_MYTRACE` and `#endif /* LTTNG_MYTRACE */` conditionally enables compilation of tracepoint. In regular application it woudn't be needed, but because `Makefile` we're adjusting gathers sources to compile with glob '*.c' we need this to be able to conditionally include compilation of this trace. 789 | - following `#undef` and `#define` registers `cray` tracepoint provider. In the previous example it was `lttng_ust_tracef`, 790 | - next few preprocessor directives calls macros which set options required to generate tracepoint, generally bolierplate code, 791 | - `TRACEPOINT_EVENT` is definition of our trace, at compilation time preprocessor will generate code from this definition, 792 | - `cray` is trace provider, it have to match `TRACEPOINT_PROVIDER` defined above, 793 | - `my_first_tracepoint` is name of the tracepoint, 794 | - `TP_ARGS` is list of arguments our trace function will accept, the format is: `type`, `name of the variable`, ... For obvious reasons length of this list mus tbe even, 795 | - `TP_FIELDS` defines how parameters specified above should be logged: 796 | - `ctf_string(my_string_field, my_string_arg)` - string argument, it will be saved as `(my_string_field = XXX)` where `XXX` is the value passed in `my_string_arg` 797 | - `ctf_integer(int, my_integer_field, my_integer_arg)` - signed 64bit int argument, it will be saved as `(my_integer_field = XXX)` where `XXX` is vaule passed in `my_integer_arg` 798 | - at the end there are matching `#endif` and include of header with `lttng`'s macros. 799 | 800 | `tp.c` file is boring, most likely you won't need to change much of this. The two key lines are `#define TRACEPOINT_CREATE_PROBES` adn `#define TRACEPOINT_DEFINE` which at compilation phase will instrument precprocessor to generate tracepoint code from 'defines' we wrote in `tp.h`. 801 | 802 | ```C 803 | #ifdef LTTNG_MYTRACE 804 | 805 | #define TRACEPOINT_CREATE_PROBES 806 | #define TRACEPOINT_DEFINE 807 | 808 | #include "tp.h" 809 | 810 | #endif /* LTTNG_MYTRACE */ 811 | ``` 812 | 813 | Lastly in `Makefile`, below `LTTNG_SIMPLE`: 814 | 815 | ```Makefile 816 | 817 | ifdef LTTNG_MYTRACE 818 | CFLAGS += -Isrc -DLTTNG -DLTTNG_MYTRACE 819 | LFLAGS += -llttng-ust -ldl 820 | endif 821 | ``` 822 | 823 | It's very similar to previous example, only notable change here is `-Isrc` which `tp.c` will require to corectly include `tp.h` file. 824 | 825 | To compile project: 826 | 827 | ```bash 828 | $ make clean 829 | $ LTTNG_MYTRACE=1 make 830 | ``` 831 | 832 | With this we compiled `tp.c` object and added it to our application, but we didn't place calls to our newly defined trace in any of the codepaths, so resulting trace would be empty, time to change this. 833 | 834 | ... 835 | 836 | Now in term #1 start lttng session: 837 | 838 | ```bash 839 | $ lttng create my-trace 840 | Session my-trace created. 841 | Traces will be written in /home/erdk/lttng-traces/my-trace-20180218-202218 842 | ``` 843 | 844 | In term #2 execute c-ray: 845 | ```bash 846 | $ bin/c-ray 847 | ``` 848 | 849 | Now back to #1. When you list available events at the bottom you'll see our newly defined tracepoint: 850 | 851 | ```bash 852 | ~ » lttng list -u 853 | UST events: 854 | ------------- 855 | 856 | PID: 30053 - Name: bin/c-ray 857 | lttng_ust_tracelog:TRACE_DEBUG (loglevel: TRACE_DEBUG (14)) (type: tracepoint) 858 | lttng_ust_tracelog:TRACE_DEBUG_LINE (loglevel: TRACE_DEBUG_LINE (13)) (type: tracepoint) 859 | lttng_ust_tracelog:TRACE_DEBUG_FUNCTION (loglevel: TRACE_DEBUG_FUNCTION (12)) (type: tracepoint) 860 | lttng_ust_tracelog:TRACE_DEBUG_UNIT (loglevel: TRACE_DEBUG_UNIT (11)) (type: tracepoint) 861 | lttng_ust_tracelog:TRACE_DEBUG_MODULE (loglevel: TRACE_DEBUG_MODULE (10)) (type: tracepoint) 862 | lttng_ust_tracelog:TRACE_DEBUG_PROCESS (loglevel: TRACE_DEBUG_PROCESS (9)) (type: tracepoint) 863 | lttng_ust_tracelog:TRACE_DEBUG_PROGRAM (loglevel: TRACE_DEBUG_PROGRAM (8)) (type: tracepoint) 864 | lttng_ust_tracelog:TRACE_DEBUG_SYSTEM (loglevel: TRACE_DEBUG_SYSTEM (7)) (type: tracepoint) 865 | lttng_ust_tracelog:TRACE_INFO (loglevel: TRACE_INFO (6)) (type: tracepoint) 866 | lttng_ust_tracelog:TRACE_NOTICE (loglevel: TRACE_NOTICE (5)) (type: tracepoint) 867 | lttng_ust_tracelog:TRACE_WARNING (loglevel: TRACE_WARNING (4)) (type: tracepoint) 868 | lttng_ust_tracelog:TRACE_ERR (loglevel: TRACE_ERR (3)) (type: tracepoint) 869 | lttng_ust_tracelog:TRACE_CRIT (loglevel: TRACE_CRIT (2)) (type: tracepoint) 870 | lttng_ust_tracelog:TRACE_ALERT (loglevel: TRACE_ALERT (1)) (type: tracepoint) 871 | lttng_ust_tracelog:TRACE_EMERG (loglevel: TRACE_EMERG (0)) (type: tracepoint) 872 | lttng_ust_tracef:event (loglevel: TRACE_DEBUG (14)) (type: tracepoint) 873 | lttng_ust_lib:unload (loglevel: TRACE_DEBUG_LINE (13)) (type: tracepoint) 874 | lttng_ust_lib:debug_link (loglevel: TRACE_DEBUG_LINE (13)) (type: tracepoint) 875 | lttng_ust_lib:build_id (loglevel: TRACE_DEBUG_LINE (13)) (type: tracepoint) 876 | lttng_ust_lib:load (loglevel: TRACE_DEBUG_LINE (13)) (type: tracepoint) 877 | lttng_ust_statedump:end (loglevel: TRACE_DEBUG_LINE (13)) (type: tracepoint) 878 | lttng_ust_statedump:debug_link (loglevel: TRACE_DEBUG_LINE (13)) (type: tracepoint) 879 | lttng_ust_statedump:build_id (loglevel: TRACE_DEBUG_LINE (13)) (type: tracepoint) 880 | lttng_ust_statedump:bin_info (loglevel: TRACE_DEBUG_LINE (13)) (type: tracepoint) 881 | lttng_ust_statedump:start (loglevel: TRACE_DEBUG_LINE (13)) (type: tracepoint) 882 | cray:my_first_tracepoint (loglevel: TRACE_DEBUG_LINE (13)) (type: tracepoint) 883 | ``` 884 | 885 | Now we enable it, resume program execution and after short time we will inspect our profile: 886 | 887 | In #1: 888 | ```bash 889 | $ lttng enable-event -u 'cray:my_first_tracepoint' 890 | $ lttng start 891 | ``` 892 | 893 | In #2 press \. After ~2min in #1: 894 | ```bash 895 | $ lttng stop 896 | $ lttng view | head -n 10 897 | ~ » lttng view | head -n 10 898 | [20:26:34.825989192] (+?.?????????) andromeda cray:my_first_tracepoint: { cpu_id = 3 }, { my_string_field = "before reyIntersectWithAABB", my_integer_field = 30344 } 899 | [20:26:34.825994723] (+0.000005531) andromeda cray:my_first_tracepoint: { cpu_id = 2 }, { my_string_field = "before reyIntersectWithAABB", my_integer_field = 30345 } 900 | [20:26:34.826003012] (+0.000008289) andromeda cray:my_first_tracepoint: { cpu_id = 2 }, { my_string_field = "before reyIntersectWithAABB", my_integer_field = 30346 } 901 | [20:26:34.826012860] (+0.000009848) andromeda cray:my_first_tracepoint: { cpu_id = 2 }, { my_string_field = "before reyIntersectWithAABB", my_integer_field = 30345 } 902 | [20:26:34.826013447] (+0.000000587) andromeda cray:my_first_tracepoint: { cpu_id = 2 }, { my_string_field = "rayIntersectsWithPolygon", my_integer_field = 30345 } 903 | [20:26:34.826015514] (+0.000002067) andromeda cray:my_first_tracepoint: { cpu_id = 2 }, { my_string_field = "before reyIntersectWithAABB", my_integer_field = 30345 } 904 | [20:26:34.826016248] (+0.000000734) andromeda cray:my_first_tracepoint: { cpu_id = 2 }, { my_string_field = "rayIntersectsWithPolygon", my_integer_field = 30345 } 905 | [20:26:34.826016936] (+0.000000688) andromeda cray:my_first_tracepoint: { cpu_id = 2 }, { my_string_field = "after intersecting with nodes", my_integer_field = 30345 } 906 | [20:26:34.826017850] (+0.000000914) andromeda cray:my_first_tracepoint: { cpu_id = 2 }, { my_string_field = "before reyIntersectWithAABB", my_integer_field = 30345 } 907 | [20:26:34.826018845] (+0.000000995) andromeda cray:my_first_tracepoint: { cpu_id = 2 }, { my_string_field = "before reyIntersectWithAABB", my_integer_field = 30345 } 908 | ``` 909 | 910 | Remeber that `lttng` stores traces in fiexd sized buffers, if size of the buffer is too low you'll see following warning: 911 | 912 | `[warning] Tracer discarded 2349 events between [20:26:35.656284527] and [20:26:35.669416457] in trace UUID 283815817ab4f419a7bb9ea5618927, at path: ".../lttng-traces/my-trace-20180218-202218/ust/uid/1000/64-bit", within stream id 0, at relative path: "channel0_0". You should consider recording a new trace with larger buffers or with fewer events enabled.` 913 | 914 | ### Multiple tracepoints 915 | 916 | Similarily you could add more traces, e.g. from `tp2.h`: 917 | 918 | ```H 919 | #ifdef LTTNG_MULTITRACE 920 | #undef TRACEPOINT_PROVIDER 921 | #define TRACEPOINT_PROVIDER cray 922 | 923 | #undef TRACEPOINT_INCLUDE 924 | #define TRACEPOINT_INCLUDE "./tp2.h" 925 | 926 | #if !defined(_TP2_H) || defined(TRACEPOINT_HEADER_MULTI_READ) 927 | #define _TP2_H 928 | 929 | #include 930 | 931 | TRACEPOINT_EVENT( 932 | cray, 933 | intersect_aabb, 934 | TP_ARGS(void), 935 | ) 936 | 937 | TRACEPOINT_EVENT( 938 | cray, 939 | intersect_nodes, 940 | TP_ARGS(void), 941 | ) 942 | 943 | TRACEPOINT_EVENT( 944 | cray, 945 | intersect_polygon, 946 | TP_ARGS(void), 947 | ) 948 | 949 | 950 | #endif /* _TP2_H */ 951 | 952 | #include 953 | 954 | #endif /* LTTNG_MULTITRACE */ 955 | ``` 956 | 957 | Here we defined 3 tracepoints, each without arguments. Youd could inspect them at runtime (remeber to create lttng session first!): 958 | 959 | ```bash 960 | $ lttng list -u 961 | UST events: 962 | ------------- 963 | 964 | PID: 1355 - Name: bin/c-ray 965 | lttng_ust_tracelog:TRACE_DEBUG (loglevel: TRACE_DEBUG (14)) (type: tracepoint) 966 | lttng_ust_tracelog:TRACE_DEBUG_LINE (loglevel: TRACE_DEBUG_LINE (13)) (type: tracepoint) 967 | lttng_ust_tracelog:TRACE_DEBUG_FUNCTION (loglevel: TRACE_DEBUG_FUNCTION (12)) (type: tracepoint) 968 | lttng_ust_tracelog:TRACE_DEBUG_UNIT (loglevel: TRACE_DEBUG_UNIT (11)) (type: tracepoint) 969 | lttng_ust_tracelog:TRACE_DEBUG_MODULE (loglevel: TRACE_DEBUG_MODULE (10)) (type: tracepoint) 970 | lttng_ust_tracelog:TRACE_DEBUG_PROCESS (loglevel: TRACE_DEBUG_PROCESS (9)) (type: tracepoint) 971 | lttng_ust_tracelog:TRACE_DEBUG_PROGRAM (loglevel: TRACE_DEBUG_PROGRAM (8)) (type: tracepoint) 972 | lttng_ust_tracelog:TRACE_DEBUG_SYSTEM (loglevel: TRACE_DEBUG_SYSTEM (7)) (type: tracepoint) 973 | lttng_ust_tracelog:TRACE_INFO (loglevel: TRACE_INFO (6)) (type: tracepoint) 974 | lttng_ust_tracelog:TRACE_NOTICE (loglevel: TRACE_NOTICE (5)) (type: tracepoint) 975 | lttng_ust_tracelog:TRACE_WARNING (loglevel: TRACE_WARNING (4)) (type: tracepoint) 976 | lttng_ust_tracelog:TRACE_ERR (loglevel: TRACE_ERR (3)) (type: tracepoint) 977 | lttng_ust_tracelog:TRACE_CRIT (loglevel: TRACE_CRIT (2)) (type: tracepoint) 978 | lttng_ust_tracelog:TRACE_ALERT (loglevel: TRACE_ALERT (1)) (type: tracepoint) 979 | lttng_ust_tracelog:TRACE_EMERG (loglevel: TRACE_EMERG (0)) (type: tracepoint) 980 | lttng_ust_tracef:event (loglevel: TRACE_DEBUG (14)) (type: tracepoint) 981 | lttng_ust_lib:unload (loglevel: TRACE_DEBUG_LINE (13)) (type: tracepoint) 982 | lttng_ust_lib:debug_link (loglevel: TRACE_DEBUG_LINE (13)) (type: tracepoint) 983 | lttng_ust_lib:build_id (loglevel: TRACE_DEBUG_LINE (13)) (type: tracepoint) 984 | lttng_ust_lib:load (loglevel: TRACE_DEBUG_LINE (13)) (type: tracepoint) 985 | lttng_ust_statedump:end (loglevel: TRACE_DEBUG_LINE (13)) (type: tracepoint) 986 | lttng_ust_statedump:debug_link (loglevel: TRACE_DEBUG_LINE (13)) (type: tracepoint) 987 | lttng_ust_statedump:build_id (loglevel: TRACE_DEBUG_LINE (13)) (type: tracepoint) 988 | lttng_ust_statedump:bin_info (loglevel: TRACE_DEBUG_LINE (13)) (type: tracepoint) 989 | lttng_ust_statedump:start (loglevel: TRACE_DEBUG_LINE (13)) (type: tracepoint) 990 | cray:intersect_polygon (loglevel: TRACE_DEBUG_LINE (13)) (type: tracepoint) 991 | cray:intersect_nodes (loglevel: TRACE_DEBUG_LINE (13)) (type: tracepoint) 992 | cray:intersect_aabb (loglevel: TRACE_DEBUG_LINE (13)) (type: tracepoint) 993 | ``` 994 | 995 | Now you could enable one of them: 996 | ```bash 997 | $ lttng enable-event -u 'cray:intersect_polygon' 998 | ``` 999 | 1000 | Or all: 1001 | ```bash 1002 | $ lttng enable-event -u 'cray:*' 1003 | ``` 1004 | 1005 | ### Inspect trace with TraceComapss 1006 | 1007 | From http://tracecompass.org download TraceCompass and run it. It's based on Eclipse, so you'll need Java 7+ to run it (Java 9 not supported yet). After downloading run the program and from `File` menu chose `Open Trace`. Then navigate to `$HOME\lttng-traces\\ust\uid\\64-bit\` and choose `metadata`. 1008 | 1009 | Here is screenshot with data collected whit enabled `cray:my_first_tracepoint`: 1010 | ![TraceCompass UI](/img/tracecompass1.png) 1011 | 1012 | And here with multi tracepoint example: 1013 | ![TraceCompass multi traces](/img/tracecompass2.png) 1014 | 1015 | ## zipkin & blkin 1016 | 1017 | # Other tools 1018 | 1019 | ## go pprof 1020 | 1021 | # Sources 1022 | 1023 | - gprof: https://www.thegeekstuff.com/2012/08/gprof-tutorial/ 1024 | - gprof: `man gprof` 1025 | - perf: https://dev.to/etcwilde/perf---perfect-profiling-of-cc-on-linux-of 1026 | - perf: https://perf.wiki.kernel.org/index.php/Tutorial 1027 | - perf: http://www.brendangregg.com/perf.html 1028 | - valgrind: http://valgrind.org/docs/manual/cl-manual.html 1029 | - (Q|K)cachegrind: https://github.com/KDE/kcachegrind 1030 | - LTTNG: https://lttng.org/docs/v2.10/#doc-what-is-tracing -------------------------------------------------------------------------------- /img/flamegraph1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Erdk/introtoprofiling/0aa5c4036eeddb770d28057d0846ff2b775f57bc/img/flamegraph1.png -------------------------------------------------------------------------------- /img/kcachegrind1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Erdk/introtoprofiling/0aa5c4036eeddb770d28057d0846ff2b775f57bc/img/kcachegrind1.png -------------------------------------------------------------------------------- /img/kcachegrind2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Erdk/introtoprofiling/0aa5c4036eeddb770d28057d0846ff2b775f57bc/img/kcachegrind2.png -------------------------------------------------------------------------------- /img/kcachegrind3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Erdk/introtoprofiling/0aa5c4036eeddb770d28057d0846ff2b775f57bc/img/kcachegrind3.png -------------------------------------------------------------------------------- /img/perf1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Erdk/introtoprofiling/0aa5c4036eeddb770d28057d0846ff2b775f57bc/img/perf1.png -------------------------------------------------------------------------------- /img/perf2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Erdk/introtoprofiling/0aa5c4036eeddb770d28057d0846ff2b775f57bc/img/perf2.png -------------------------------------------------------------------------------- /img/perf3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Erdk/introtoprofiling/0aa5c4036eeddb770d28057d0846ff2b775f57bc/img/perf3.png -------------------------------------------------------------------------------- /img/perf4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Erdk/introtoprofiling/0aa5c4036eeddb770d28057d0846ff2b775f57bc/img/perf4.png -------------------------------------------------------------------------------- /img/perf5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Erdk/introtoprofiling/0aa5c4036eeddb770d28057d0846ff2b775f57bc/img/perf5.png -------------------------------------------------------------------------------- /img/perf6.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Erdk/introtoprofiling/0aa5c4036eeddb770d28057d0846ff2b775f57bc/img/perf6.png -------------------------------------------------------------------------------- /img/tracecompass1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Erdk/introtoprofiling/0aa5c4036eeddb770d28057d0846ff2b775f57bc/img/tracecompass1.png -------------------------------------------------------------------------------- /img/tracecompass2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Erdk/introtoprofiling/0aa5c4036eeddb770d28057d0846ff2b775f57bc/img/tracecompass2.png --------------------------------------------------------------------------------