├── .gitignore ├── pics ├── aglaw.jpg ├── ccbysa.png ├── linus.jpg ├── Fortran.jpeg ├── kbroman.jpeg ├── exercises.png ├── performance.png ├── saladwoman1.jpg ├── digging-a-hole.jpg └── multithreading.jpg ├── rebuild ├── README.md ├── headers.js ├── custom.css └── openmp.Rmd /.gitignore: -------------------------------------------------------------------------------- 1 | openmp_cache/ 2 | -------------------------------------------------------------------------------- /pics/aglaw.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/wrathematics/RparallelGuide/HEAD/pics/aglaw.jpg -------------------------------------------------------------------------------- /pics/ccbysa.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/wrathematics/RparallelGuide/HEAD/pics/ccbysa.png -------------------------------------------------------------------------------- /pics/linus.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/wrathematics/RparallelGuide/HEAD/pics/linus.jpg -------------------------------------------------------------------------------- /pics/Fortran.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/wrathematics/RparallelGuide/HEAD/pics/Fortran.jpeg -------------------------------------------------------------------------------- /pics/kbroman.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/wrathematics/RparallelGuide/HEAD/pics/kbroman.jpeg -------------------------------------------------------------------------------- /pics/exercises.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/wrathematics/RparallelGuide/HEAD/pics/exercises.png -------------------------------------------------------------------------------- /pics/performance.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/wrathematics/RparallelGuide/HEAD/pics/performance.png -------------------------------------------------------------------------------- /pics/saladwoman1.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/wrathematics/RparallelGuide/HEAD/pics/saladwoman1.jpg -------------------------------------------------------------------------------- /pics/digging-a-hole.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/wrathematics/RparallelGuide/HEAD/pics/digging-a-hole.jpg -------------------------------------------------------------------------------- /pics/multithreading.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/wrathematics/RparallelGuide/HEAD/pics/multithreading.jpg -------------------------------------------------------------------------------- /rebuild: -------------------------------------------------------------------------------- 1 | #!/bin/sh 2 | 3 | Rscript -e "rmarkdown::render('openmp.Rmd')" 4 | mv openmp.html index.html 5 | 6 | ### TODO make this less dumb 7 | mv index.html /tmp 8 | mv custom.css /tmp 9 | git checkout gh-pages 10 | mv /tmp/index.html . 11 | mv /tmp/custom.css . 12 | git add index.html custom.css 13 | git commit -m "updated from master" 14 | git push origin gh-pages 15 | git checkout master 16 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # [Parallelism, R, and OpenMP](https://wrathematics.github.io/RparallelGuide) 2 | 3 | Do you want to know about parallelism? With R? And possibly OpenMP? 4 | Then this is the guide for you maybe. 5 | 6 | * Text is licensed CC BY-SA 7 | * Code in the text is public domain or MIT, your choice (some legal 8 | entities do not recognize the public domain). 9 | * The files `custom.css` and `headers.js` are licensed 10 | [GamahCode v1.2](http://gamahgpl.org/) 11 | 12 | Many thanks to Karl Broman for catching a ton of typos. 13 | -------------------------------------------------------------------------------- /headers.js: -------------------------------------------------------------------------------- 1 | wrap=function(tag){ 2 | var elems=document.getElementsByTagName(tag); 3 | for(var i=0; i lines) 34 | x <- c(head(x, lines), more) 35 | else 36 | x <- c(more, x[lines], more) 37 | 38 | x <- paste(c(x, ""), collapse = "\n") 39 | hook_output(x, options) 40 | }) 41 | ``` 42 | 43 | 44 |

45 | **WARNING:** This document was not written while sober. Apparently I think I'm hilarious when I'm drunk, so don't take the tone of this document too seriously. This is not an academic document, and it makes extensive use of adult language. 46 | 47 | 48 | 49 | # About This Document 50 | 51 | ## tl;dr 52 | 53 | * **Version:** 2.0 54 | * **Text License:** [CC BY-SA](http://creativecommons.org/licenses/by/4.0/). 55 | * **Code License:** public domain 56 | * **Naughty Language:** yes 57 | 58 | ## Long Version 59 | 60 | This document is licensed under a [creative commons license](http://creativecommons.org/licenses/by/4.0/). 61 | [![Creative Commons License](pics/ccbysa.png)](http://creativecommons.org/licenses/by/4.0/ "Creative Commons License") 62 | 63 | In addition, all source code presented in this document, appearing in boxes like this: 64 | 65 | ```r 66 | source_code() 67 | ``` 68 | 69 | is released into the public domain. Or if you live in some kind of fake country that doesn't recognize the public domain, you may treat it as MIT licensed. I really don't give a fuck is what I'm trying to say. 70 | 71 | This article contains the use of adult language. I assume that only an adult would be interested in this topic, but if you're a giant weepy baby, feel free to read someone else's inferior explanation of how this stuff works. 72 | 73 | 74 | 75 | # An Introduction to Parallelism 76 | 77 | People talk a lot about *parallelism* these days. The basic idea is really simple. Parallelism is all about independence, literally the ability to do multiple things at the same time in a deterministic way. Mostly, parallelism isn't the hard part. Really the only reason anyone thinks this stuff is hard is because software engineering is an inherently dysfunctional discipline practiced exclusively by sociopathic masochists. 78 | 79 | Programmers like to act like parallelism is this super complicated thing that the hoi polloi are too impossibly dumb to ever grasp. In actual fact, parallelism is really easy most of the time, especially for scientific workflows. Making good use of 15 cores might be a challenge for an Android app that interprets the stomach rumbling sounds of those near you as "hunger" or "you know damn well that guy just ate" (dear venture capitalists, call me 😉). But scientific workflows tend to be very (eheheheh) regular, and predictable. 80 | 81 | There's a great quote about this by trollmaster Linus Torvalds that assholes often like to try to take wildly out of context: 82 | 83 | > The whole "let's parallelize" thing is a huge waste of everybody's time. There's this huge body of "knowledge" that parallel is somehow more efficient, and that whole huge body is pure and utter garbage. Big caches are efficient. Parallel stupid small cores without caches are horrible unless you have a very specific load that is hugely regular (ie graphics). 84 | > 85 | > ... 86 | > 87 | > The only place where parallelism matters is in graphics or on the server side, where we already largely have it. Pushing it anywhere else is just pointless. 88 | 89 | 90 |
91 | 97 |
Little known fact, Linux is powered by smugness and flamewars. Of course, you'd know that already if you used a real operating system.
98 |
99 | 100 | For the record, I work in supercomputing professionally (or as professionally as a dipshit like me can be); and I agree with him 100%. But some people like to pretend that King Linus is saying parallelism is never worth pursuing. These people are frauds and idiots. I regularly work with 1000's of cores (with R no less!) to solve real problems that real people really care about. In case you're wondering what he's talking about with caches, he's really getting at the kinds of emerging architectures coming mostly from Intel. Linus is basically saying that Xeon Phi's aren't going to get shipped in the next iPhone. 101 | 102 | For basically all of this document, we're going to be focusing on *embarrassingly parallel* problems. Something is embarrassingly parallel if you really shouldn't have to think very hard about how to (conceptually) parallelize it. The opposite of "embarrassingly parallel" --- a phrase I think should be abandoned, by the way --- is "tightly coupled". The better way to say "embarrassingly parallel" is *loosely coupled*. Use of "embarrassingly parallel" leads to stupid shit like "unembarrassingly parallel", which makes me want to flip over a table. 103 | 104 | A simple example comparing the two is fitting linear models. Do you want to fit a bunch of different linear models (possibly even on completely different data?). Just throw models at your computer's cores until your electricity usage is so outrageous that the feds kick in your door because they think you're growing pot. But what if you want to fit just one model, on a really large dataset? Not so simple now is it, smart guy? 105 | 106 | The first is an example of an embarrassingly parallel problem. There's no dependence between fitting two linear models. There might be some computational advantages to thinking more carefully about how the load is distributed and making use of QR updates in the model fit...but that's not the point; I'm saying you *could* just fire the problems off blindly and call it a day. In the latter case, you basically need a parallel matrix-matrix multiply. Of course, there are plenty of multi-threaded BLAS out there, but think about how you would write one yourself and actually get decent performance. Now think about writing one that uses not only multiple cores, but multiple nodes. It's not so obvious. 107 | 108 | Some people might split this into additional concepts of *task parallelism* and *data parallelism*; and even just a few years ago I might have as well. But to be honest, I don't really think there is any such thing as task parallelism, and there are people way smarter than me who've said the same thing. Sorry I even brought it up. 109 | 110 |
111 | 115 |
"I know you asked for a mound of dirt, but management read an article on hackernews about how wood chips are more scalable."
116 |
117 | 118 | Forget computers for a second because they're miserable creations designed to make us question the value of human life. Say you have 100 workers with shovels in a field. You want to dig 100 holes? Perfect, just throw one worker at each hole and you improve your time-to-hole performance 100-fold! But what if you want to dig one really big hole? Throwing 100 people at it and telling them to dig probably won't work out for you. There's lots of inter-digger communication that has to take place to make sure that they're digging in the right place, that dirt is leaving the hole correctly, all kinds of things. And never mind that what you really needed in the first place was a mound of dirt. 119 | 120 | For clarity, you would prefer to be in the 100 hole situation. It's just easier to manage. And like any good manager, when it's all done, you get to take credit for all the hard work. And you can even fire all the workers when you're done if you want to (this is called "the cloud"). Of course real life is usually more complicated than this. It tends to be more like, you need 20 holes you need dug, and each of vastly differing sizes, and all you have is an ice cream scoop and a drill. 121 | 122 |
123 | 126 |
Your first homework exercise is to write a script to bury this hole-digging analogy.
127 |
128 | 129 | Parallelism in R is very simple. This is often a huge shock to people who come to R from other languages, since everything else in R is so goddamn convoluted and stupid. But it's true! The language mandates certain guarantees about the behavior of functions, for example, that make parallelism in many cases completely trivial. And it even supports some very nice utilities for very easily writing parallel code. Unfortunately, like everything else in R, there are several official interfaces, and about a hundred unofficial ones. 130 | 131 | In R, a good first place to start thinking about parallelism is the `parallel` package. It contains two separate interfaces, one like the one from the `multicore` package and one like the `snow` package's interface. Presumably in an effort to keep things "interesting", the `parallel` package completely subsumes the `multicore` package, but *not* the `snow` package. It also makes no attempt to unify the two interfaces. But you're not using R because it's easy; you're using it because you have to, so shhh, no tears. 132 | 133 | We're not going to talk about the `snow` side of things because it's ugly, inefficient, and stupid, and literally the only reason to care about it at this point is that it's needed for Windows users, who at this point should be grateful we even acknowledge their existence. For the `multicore` side of things, you just turn your `lapply()` calls into `mclapply()` calls: 134 | 135 | ```{r} 136 | index <- 1:1e5 137 | system.time(lapply(index, sqrt)) 138 | system.time(parallel::mclapply(index, sqrt, mc.cores=2)) 139 | ``` 140 | 141 | Obviously using more cores means more performance, and I am so confident in this self-evident fact, that I won't even bother to check the results of the benchmark before completing this sentence. 142 | 143 | So that's R, but what about OpenMP? OpenMP is a parallel programming standard for the only compiled languages that matter, C, C++, and Fortran. Basically it's a way of asking the compiler to parallelize something for you. But much like that time you were playing dungeons and dragons and got a scroll of wish, you have to be extraordinarily careful how you phrase your request. Failure to form your request carefully will likely result in chaos and destruction. 144 | 145 | OpenMP is a set of compiler pragmas. They look like this: 146 | 147 | ```c 148 | void do_something(double *x, int i); 149 | int i; 150 | 151 | #pragma omp parallel for shared(x) private(i) 152 | for (i=0; i 165 | 169 |
Up and to the right means more better. But it's in 3d, so don't stand behind it, ok?
170 | 171 | 172 | Parallelism is kind of a hot topic right now. This is due at least in part to the fact that processors haven't really gotten any faster in the last 10 years; instead, hardware vendors have been stuffing more cores, and now hardware threads, as well as fraudulent things like hyperthreads (fyi, hardware threads != hyperthreads) into chips to maintain Moore's Law. 173 | 174 | Say you have 4 cores in your fancy new laptop. Well first of all, if it's a laptop, it probably only has 2 physical cores and a couple of hyperthreads. Anyway, chances are each one of those cores runs at the same clock rate (or slower) than the last 3 or 4 computers you've bought. One of the big reasons for this is if you continue to increase the clock rate beyond 3ghz, the chances of your computer turning into a molten box of sand increase dramatically. 175 | 176 | But not all is lost, dear consumer; there are still many good reasons to buy a new computer. For instance, hardware vendors have found some very crafty ways to be able to do more mathematical operations with processors without actually making them "faster". Intel in particular seems to have a real love affair with creating new forms of vector instructions nobody knows how to use. 177 | 178 |
179 | 183 |
When your hipster flash-in-the-pan language is forgotten, there will still be the eternal, undying Fortran.
184 |
185 | 186 | As noted, parallelism in R is very cheap from a programmer's perspective. Firing off an `mclapply()` is very simple, and you can get some decent speedup for a lot of problems. But if you have 4 cores, this strategy will never get better than 4x performance improvement over the serial version. Interestingly, sometimes you can use 4 cores to get better than 4x performance improvement in a compiled language, due to things like cache effects. But the problems where this can happen are pretty rare, and likely not the problems you have. 187 | 188 | So parallelism can get you let's say 3.5x improvement (optimistically) on that fancy little machine of yours. But moving your R code to a compiled language like C or C++ (or Fortran, as god intended) has an opening bid of 5x if you're unlucky. The more "R" your code is, the higher the performance improvement ceiling. It's not uncommon to achieve speedups on the order of 20-40x, really putting that measly 3.5x to shame. No joke, one time I got a 15,000-fold improvement over a client's R code by moving it to a carefully optimized compiled code, completely serial. 189 | 190 | On the other hand, maybe it would be a lot of work to move the code to a compiled language (it is). And that 3.5x might take you all of a few minutes. Or maybe you can combine the two approaches. Or maybe you realize what a terrible collection of life choices you've made that have brought you down this dark and lonely path to thinking about cpu caches or whatever. Look, what I'm trying to say is weigh your options. 191 | 192 | 193 | # Examples 194 | 195 | ## Fizzbuzz 196 | 197 | A really common first problem given to people in programming interviews is fizzbuzz. It's basically a way of quickly weeding out people who think using a pivot table in excel makes you some kind of techno spaceman from the future, and now want to be able to legally call themselves a developer. 198 | 199 | The problem is: 200 | 201 | > For the numbers 1 to 100, print "fizz" if the number is divisible by 3, "buzz" if the number is divisible by 5, "fizzbuzz" if the number is divisible by both 5 and 3, and otherwise print the number. 202 | 203 | That's honestly not the best phrasing of the problem, but it's how I first heard it. And guess what, nerd? If your clients could figure out how to carefully phrase their problems, they wouldn't need you and your shitty attitude. 204 | 205 | So how might you do this in R? Lots of ways, I'm sure, but I want to impose an extra restriction: you have to use `sapply()`. Now it's just a matter of ordering and formatting. You should come up with something reasonably similar to this: 206 | 207 | ```{r, output.lines=10} 208 | fizzbuzz <- function(i) 209 | { 210 | if (i%%15==0) 211 | print("fizzbuzz") 212 | else if (i%%3==0) 213 | print("fizz") 214 | else if (i%%5 == 0) 215 | print("buzz") 216 | else 217 | print(i) 218 | 219 | invisible() 220 | } 221 | 222 | sapply(1:100, fizzbuzz) %>% invisible() 223 | ``` 224 | 225 | 226 | 227 | Clearly this is a big data problem if I ever saw one, so what we really need is to use more cores. See figure below for details. 228 | 229 |
230 |
231 | 232 |
233 |
CORES FOR THE CORE GOD, THREADS FOR THE THREAD THRONE
234 |
235 | 236 | If we were to use, say, `mclapply()` from the `parallel` package, we would expect to get identical outputs: 237 | 238 | ```r 239 | parallel::mclapply(1:100, fizzbuzz, mc.cores=4) %>% invisible() 240 | ``` 241 | 242 | ``` 243 | ## [1] 1 244 | ## [1] "buzz" 245 | ## [1] "fizz" 246 | ## [1] 13 247 | ## [1] 17 248 | ## [1] "fizz" 249 | ## [1][1] 2 "buzz" 250 | ## 251 | ## [1] 29 252 | ## [1] "fizz" 253 | 254 | [[ ... results truncated ... ]] 255 | ``` 256 | 257 | It's perfect! Ship it! 258 | 259 | Maybe a more common way to use `mclapply()` is to exploit the fact that it will return a list. So we could instead (conceptually) insert, as a value, the fizzbuzzed version of the integer into a vector. 260 | 261 | ```{r} 262 | fb <- function(i) 263 | { 264 | if (i%%15==0) 265 | "fizzbuzz" 266 | else if (i%%3==0) 267 | "fizz" 268 | else if (i%%5 == 0) 269 | "buzz" 270 | else 271 | i 272 | } 273 | 274 | sapply(1:100, fb) %>% head() 275 | ``` 276 | 277 | As before, just replace `sapply()` with `mclapply()`...then run `simplify2array()`. We can easily prove that the returns are identical: 278 | 279 | ```{r} 280 | n <- 100 281 | 282 | serial <- sapply(1:n, fb) 283 | parallel <- parallel::mclapply(1:n, fb) %>% simplify2array() 284 | 285 | all.equal(serial, parallel) 286 | ``` 287 | 288 | So next time you're at a job interview and they ask you to do a fizzbuzz, smugly smile and tell them that you operate at a higher level. 289 | 290 | 291 | 292 | ## Fizzbuzz OpenMP 293 | 294 | Ok so now you're a parallel expert for R. But what about OpenMP? For the sake of "you've probably dealt with it before", we'll use Rcpp. Before doing so, we have to set some compiler flags: 295 | 296 | ```{r, flags} 297 | Sys.setenv("PKG_CXXFLAGS"="-fopenmp -std=c++11") 298 | ``` 299 | 300 | These tell the compiler to use OpenMP and C++11. We're using C++11 because I don't want to terrify 99% of you with the eldritch, labyrinthine and eternal horror of working with strings at a low level. Anyway, here's how you might fizzbuzz in parallel with Rcpp, returning a string vector of values: 301 | 302 | ```{Rcpp, rcppfzbz, dependson="flags"} 303 | #include 304 | #include 305 | 306 | // [[Rcpp::export]] 307 | std::vector fizzbuzz(int n) 308 | { 309 | std::vector ret; 310 | 311 | for (int i=1; i<=n; i++) 312 | { 313 | i%15 == 0 ? ret.push_back("fizzbuzz") 314 | : i%5 == 0 ? ret.push_back("buzz") 315 | : i%3 == 0 ? ret.push_back("fizz") 316 | : ret.push_back(std::to_string(i)); 317 | } 318 | 319 | return ret; 320 | } 321 | ``` 322 | 323 | We can now easily call the function from R: 324 | 325 | ```{r, dependson="rcppfzbz"} 326 | fizzbuzz(100) %>% head() 327 | ``` 328 | 329 | For the remainder, we're not going to work with strings. 330 | 331 | Fuck strings. 332 | 333 | 334 | 335 | ## Mean of a Vector 336 | 337 | Let's say we have: 338 | 339 | ```{r, xdefine} 340 | n <- 1e8 341 | x <- rnorm(n) 342 | ``` 343 | 344 | So in R you can use `mean()`: 345 | 346 | ```{r, dependson="xdefine"} 347 | system.time(mean(x)) 348 | ``` 349 | 350 | And in serial, we can compute the mean in Rcpp like so: 351 | 352 | ```{Rcpp, meanserial} 353 | #include 354 | 355 | // [[Rcpp::export]] 356 | double mean_serial(Rcpp::NumericVector x) 357 | { 358 | double sum = 0.0; 359 | for (int i=0; i 370 | 371 | // [[Rcpp::export]] 372 | double mean_parallel(Rcpp::NumericVector x) 373 | { 374 | double sum = 0.0; 375 | #pragma omp parallel for simd shared(x) reduction(+:sum) 376 | for (int i=0; i 412 | 413 |
Multithreading: Theory and Practice
414 | 415 | 416 | OpenMP is also notorious for its difficulty to debug. This is because, as we have seen, it works via compiler pragmas. The problem is, the compiler is generating a lot of code that you really don't understand. If you're used to C++, you probably aren't worried because you're already brain-damaged from staring at million line template explosions. But in a language like C, if you really understand how your hardware works, you can generally get a rough idea of how your code will translate to assembly. This is no longer so when using OpenMP. So sometimes you just sit there scratching your head wondering if the thing even did what you just told it to do. Probably the number one thing I find myself thinking when using OpenMP is "did that even run in parallel?" This is why the OpenMP hello world is actually useful: 417 | 418 | ```c 419 | #include 420 | 421 | void hellomp() 422 | { 423 | // thread id and number of threads 424 | int tid, nthreads; 425 | 426 | #pragma omp parallel private(tid) 427 | { 428 | nthreads = omp_get_num_threads(); 429 | tid = omp_get_thread_num(); 430 | 431 | printf("Hello from thread %d of %d\n", tid, nthreads); 432 | } 433 | } 434 | ``` 435 | 436 | By calling `#pragma omp parallel` in this way, we are instructing the compiler to have each thread run the section following. Otherwise this should be fairly self-explanatory. If you need to test that OpenMP is working, this is always a great one to bring out. 437 | 438 | Finally, be careful with your core management. If you're combining `mclapply()` calls and code using OpenMP, you need to be careful with how you allocate your cores, and where. You might be ok just throwing caution to the wind, or you might need to be super careful about who gets what resources. Life is hard. 439 | 440 | 441 | 442 | # Miscellany 443 | 444 | ## R and Thread Safety 445 | 446 | From the [Writing R Extensions manual](http://cran.r-project.org/doc/manuals/r-patched/R-exts.html#OpenMP-support): 447 | 448 | > Calling any of the R API from threaded code is ‘for experts only’: they will need to read the source code to determine if it is thread-safe. 449 | 450 | In general, `SEXP` functions (including most things you would write with Rcpp) should not be considered thread safe if you want to be cautious (and you probably do, because thread-safety problems are very hard to debug). Of note, memory allocation and the RNG are not thread safe. And it's not a bad idea to assume that every R function performs some memory allocation. What this means is, you can't call R functions in parallel. Well I mean, you *can*, but you shouldn't. 451 | 452 | After the first publishing of this post, there was some confusion about what I mean by this. And rightly so, it was poorly phrased and overly abstract. You can use OpenMP effectively with R. Focusing on Rcpp for the moment (since that is how we have been integrating compiled code), all of the above still holds. There are examples dating back to at least 2010 effectively using OpenMP with Rcpp. There is also an example in this document, and there are yet more in the Romp package quoted by this document, and there are exercises in this document instructing you to create functions marrying Rcpp and OpenMP. It's definitely possible to do this safely. 453 | 454 | But it's also possible to really screw up! So what can go wrong? Here's a simple example computing the column sums of the square root of a matrix (note: don't do this): 455 | 456 | ```C 457 | #include 458 | 459 | 460 | double rcpp_rootsum_j(Rcpp::NumericVector x) 461 | { 462 | Rcpp::NumericVector ret = sqrt(x); 463 | return sum(ret); 464 | } 465 | 466 | // [[Rcpp::export]] 467 | Rcpp::NumericVector rcpp_rootsum(Rcpp::NumericMatrix x) 468 | { 469 | const int nr = x.nrow(); 470 | const int nc = x.ncol(); 471 | Rcpp::NumericVector ret(nc); 472 | 473 | #pragma omp parallel for shared(x, ret) 474 | for (int j=0; j 541 | #include 542 | 543 | 544 | double rootsum_j(SEXP x) 545 | { 546 | double ret = 0.; 547 | 548 | for (int i=0; i= 4.9) support the `simd` pragma, and Clang's initial OpenMP release supports it as well. But if you are a huge asshole with an ancient software stack, you can still program for people living in the modern era pretty simply. In C/C++, you could use the preprocessor to easily check if the `simd` pragma is supported by checking the OpenMP version: 631 | 632 | ```C 633 | #ifndef MY_OMP_H_ 634 | #define MY_OMP_H_ 635 | 636 | #ifdef _OPENMP 637 | #include 638 | #if _OPENMP >= 201307 639 | #define OMP_VER_4 640 | #endif 641 | #endif 642 | 643 | // Insert SIMD pragma if supported 644 | #ifdef OMP_VER_4 645 | #define SAFE_SIMD _Pragma("omp simd") 646 | #define SAFE_FOR_SIMD _Pragma("omp for simd") 647 | #else 648 | #define SAFE_SIMD 649 | #define SAFE_FOR_SIMD 650 | #endif 651 | 652 | #endif 653 | ``` 654 | 655 | This way, I can insert `SAFE_SIMD` or `SAFE_FOR_SIMD` into my code, and if `simd` is supported by my OpenMP version, use it, if not, don't. All error free. 656 | 657 | For Fortran users, you can't do this sort of thing without using an "FPP", which is generally a cascade of headaches and heartbreaks. Basically because there's really no such thing as an "FPP", and people tend to just use the CPP and hope for the best. I like Fortran, but we need to be honest with ourselves; if this level of control is important to you, then you should be using C. 658 | 659 | 660 | 661 | ## OpenMP Alternatives 662 | 663 | OpenACC is an alternative. Originally, OpenMP and OpenACC had vastly different goals, though more and more they look similar to me. I honestly don't know much about it because I never use it, but some people swear by it. It's probably more interesting to you if you work with accelerators a lot (I don't). 664 | 665 | If you're interested in programming accelerator cards, like a gpu, there is an extension for OpenMP to dispatch to a gpu. I've never used it, but it seems like a bad idea to me, in the sense that I don't know how you could hope to get good performance out of it. But I've never used it, so maybe you can. In that case, I'd probably look at OpenACC, or things like the Thrust template library. 666 | 667 | Additionally, if you're using C++, there are a *lot* of other threading frameworks out there. Most of them are shit. There's probably something in boost which, under a deeply strained legal interpretation of the word "works", works. There's also Intel Thread Building Blocks (TBB). TBB has some really cool ideas in it, but it's not portable (Intel only), and frankly by my assessment, it's pretty dead. I don't work for Intel, or have any special insight into how they do things, but I'm 99% sure they're just giving up on TBB in favor of OpenMP, which is a real and proper standard (supported deeply by Intel, no less). The various parallel Rcpp efforts, like RcppParallel seem to be heading towards utilizing TBB (because they all use Macs, and Clang only recently got its act together on OpenMP). JJ's a super duper smart guy, but I really think he's betting on the wrong horse here. 668 | 669 | Oh there's also pthreads. 670 | 671 |
672 |
673 | 674 |
675 |
My face when someone suggests I use pthreads
676 |
677 | 678 | 679 | 680 | # What next? 681 | 682 | Norm Matloff, who I have a lot of respect and admiration for, has [written a great introduction](https://matloff.wordpress.com/2015/01/16/openmp-tutorial-with-r-interface/) to OpenMP with R. I actually consider his guide a bit terse and (relatively) advanced, and it contains a lot of great orthogonal advice to the incoherent ramblings presented here. So if this didn't do it for you, give his proper guide a try. Norm has also just finished a book [about parallel programming for data science](http://www.amazon.com/Parallel-Computing-Data-Science-Examples/dp/1466587016/) that I'm pretty excited about. Considering he wrote what is [easily the single best book on R](http://www.amazon.com/Art-Programming-Statistical-Software-Design/dp/1593273843/), it should be great. 683 | 684 | If you're looking on stuff specifically about OpenMP, the [Using OpenMP](http://www.amazon.com/Using-OpenMP-Programming-Engineering-Computation/dp/0262533022/) book is pretty good. It's a pretty out of date now, written for the 2.5 standard (4.0 has been out for a bit as of the time of writing), but if you're a beginner, you should be fine starting there. 685 | 686 | The [OpenMP 4.0 Syntax Quick Reference Card](http://openmp.org/mp-documents/OpenMP-4.0-C.pdf) is useful once you kind of get the hang of OpenMP but don't use it regularly enough to remember how the order and/or presence/absence of keywords can drastically affect things. 687 | 688 | In my opinion, books and tutorials are great to motivate you and give you a starting point, but the best way to really learn something is to dive into the deep end with a lead weight around your neck. Profile your code, find something that could be faster, and make it parallel or die trying. 689 | 690 | We didn't talk about simulations and other things involving random number generation. Just know that it complicates things in a way I don't feel like getting into right now, and woe betide the person with such problems in the first place. 691 | 692 | 693 | 694 | # Exercises 695 | 696 |
697 |
698 | 699 |
700 |
What is best in life? TO CRUSH YOUR BENCHMARKS, SEE YOUR MULTIPLE CORES AT FULL LOAD, AND HEAR THE LAMENTATION OF WHOEVER PAYS YOUR POWER BILL
701 |
702 | 703 | For the following exercises, use your choice of C, C++, and Fortran. For C++, avoid the functional-ish STL operations and use `for` loops. When I say "write a function", I mean "write a function in a compiled language, callable by R", such as via Rcpp. 704 | 705 | 1. Write a function that takes a vector of integers and multiplies each element by 2. 706 | 707 | 2. Revisit the above, but using OpenMP. 708 | 709 | 3. Experiment with various problem sizes and core counts benchmarking exercises 1 and 2 above, and an R implementation (`2*x`). 710 | 711 | 4. Write a function that takes a numeric matrix and produces a numeric vector containing the sums of the columns of the matrix. 712 | 713 | 5. Revisit the above, but using OpenMP. 714 | 715 | 6. Experiment with various problem sizes and core counts benchmarking exercises 4 and 5 above, as well as the R function `colSums()`. For bonus points, handle `NA`'s as R does. 716 | 717 | 7. Write a safe version of the `rootsum()` function in the thread safety subsection. 718 | 719 | 720 | 721 | # Thanks 722 | 723 | Special thanks to: 724 | 725 | * Karl Broman 726 | * Joseph Stachelek 727 | * Dirk Eddelbuettel 728 | 729 | who each corrected and/or pointed out a number of errors, ranging from typos and grammar issues to very serious, technical flaws. 730 | 731 | Also thanks to everyone who had kind words to say about this document. I really appreciate it. 732 | 733 | --------------------------------------------------------------------------------