├── README.md ├── RedcardData.csv ├── RedcardData_dictionary.txt ├── logistic0.stan ├── logistic1.stan └── redcard_input.R /README.md: -------------------------------------------------------------------------------- 1 | # Multithreading and Map-Reduce in Stan 2.18.0: A Minimal Example 2 | 3 | v1 - Richard McElreath - 19 Sep 2018 4 | 5 | ## Introduction 6 | 7 | In version 2.18.0, Stan can now use parallel subthreads to speed computation of log-probabilities. So while the Markov chain samples are still necessarily serial, running on a single core, the math library calculations needed to compute the samples can be parallelized across different cores. If the data can be split into pieces ("shards"), then this can produce tremendous speed improvements on the right hardware. 8 | 9 | The benefits are not automatic however. You'll have to modify your Stan models first. You might also have to do a little bit of compiler configuration. Finally, the 2.18.0 is only available for ``cmdstan``, the command line version of Stan, at this moment. It should be available for ``rstan`` and other interfaces soon. So if you haven't used cmdstan before, you'll need to adjust your workflow a little. 10 | 11 | Here I present a minimal example of modifying a Stan model to take advantage of multithreading. I also show the basic workflow for using ``cmdstan``, in case you are used to ``rstan``. 12 | 13 | ## Overview 14 | 15 | The general approach to modifying a Stan model for multithreading has three steps. 16 | 17 | (1) You must repack the data and parameters into slices, known as "shards". This looks weird the first time you do it, but it really is just repacking. Nothing fancy is going on. 18 | 19 | (2) You must rewrite the model to use a special function to compute the log-probability for each slice ("shard") of the data. Then in model block, you use a function called `map_rect` to call this function and pass it all the shards. 20 | 21 | (3) Configure your compiler to enable multithreading. Then you can compile and sample as usual. 22 | 23 | In the rest of this tutorial, I'll build up a simple first without multithreading. This will serve to clarify the changes and show you how to work with `cmdstan`, if you haven't before. Then I'll modify the data and model to get multithreading to work. 24 | 25 | If you use Windows, it looks like multithreading isn't working yet (see ). So flip over to a Linux partition, if you have one. If you don't, hope is that an upcoming update to RTools will make things work. 26 | 27 | ## Prepping the data 28 | 29 | Let's consider a simple logistic regression, just so the model doesn't get in the way. We'll use a reasonably big data table, the football red card data set from the recent crowdsourced data analysis project (). This data table contains 146,028 player-referee dyads. For each dyad, the table records the total number of red cards the referee assigned to the player over the observed number of games. 30 | 31 | The ``RedcardData.csv`` file is provided in the repository here. Load the data in R, and let's take a look at the distribution of red cards: 32 | ```R 33 | d <- read.csv( "RedcardData.csv" , stringsAsFactors=FALSE ) 34 | table( d$redCards ) 35 | ``` 36 | ```text 37 | 0 1 2 38 | 144219 1784 25 39 | ``` 40 | The vast majority of dyads have zero red cards. Only 25 dyads show 2 red cards. These counts are our inference target. 41 | 42 | The motivating hypothesis behind these data is that referees are biased against darker skinned players. So we're going to try to predict these counts using the skin color ratings of each player. Not all players actually received skin color ratings in these data, so let's reduce down to dyads with ratings: 43 | ```R 44 | d2 <- d[ !is.na(d$rater1) , ] 45 | out_data <- list( n_redcards=d2$redCards , n_games=d2$games , rating=d2$rater1 ) 46 | out_data$N <- nrow(d2) 47 | ``` 48 | This leaves us with 124,621 dyads to predict. 49 | 50 | At this point, you are thinking: "But there are repeat observations on players and referees! You need some cluster variables in there in order to build a proper multilevel model!" You are right. But let's start simple. Keep your partial pooling on idle for the moment. 51 | 52 | Now to use these data with ``cmdstan``, we need to export the table to a file that ``cmdstan`` can read in. This is made easy: 53 | ```R 54 | library(rstan) 55 | stan_rdump(ls(out_data), "redcard_input.R", envir = list2env(out_data)) 56 | ``` 57 | 58 | ## Making the model 59 | 60 | A Stan model for this problem is just a simple logistic (binomial) GLM. I'll assume you know Stan well enough already that I can just plop the code down here. It's contained in the ``logistic0.stan`` file in the repository. It's not a complex model: 61 | ```c++ 62 | data { 63 | int N; 64 | int n_redcards[N]; 65 | int n_games[N]; 66 | real rating[N]; 67 | } 68 | parameters { 69 | vector[2] beta; 70 | } 71 | model { 72 | beta ~ normal(0,1); 73 | n_redcards ~ binomial_logit( n_games , beta[1] + beta[2] * to_vector(rating) ); 74 | } 75 | ``` 76 | If you were going to run this model in ``rstan``, you'd enter now: 77 | ```R 78 | m0 <- stan( file="logistic0.stan" , data=out_data , chains=1 ) 79 | ``` 80 | But we're going to instead build and run this in ``cmdstan``. 81 | 82 | ## Installing and Building cmdstan 83 | 84 | You'll need to download the 2.18.0 `cmdstan` tarball from here: . Also grab the 2.18.0 user's guide while you are at it. It has a nice new section on this topic. 85 | 86 | If you are used to unix and to `make` workflows, you can expand the thing and build it now. But if you are not, don't worry. I'm here to help. Many of my colleagues are very capable scientists who have never done much work in a unix shell. So I know even highly technical people who need some basic help here. I am going to assume, however, that you aren't using Windows. Because if you are, you can install and use `cmdstan` fine, but the multithreading won't work (yet). 87 | 88 | Begin by putting the `cmdstan-2.18.0.tar.gz` file wherever you want `cmdstan` to live. You can move it later. Then open a terminal and navigate to that folder. Decompress with: 89 | ```bash 90 | tar -xzf cmdstan-2.18.0.tar.gz 91 | ``` 92 | Enter the new `cmdstan-2.18.0` folder and build with: 93 | ```bash 94 | make build 95 | ``` 96 | Your compiler will toil for a bit and then signal its satisfaction. 97 | 98 | ## Building and Running the Basic Model in cmdstan 99 | 100 | Make a new folder called `redcard` inside your `cmdstan-2.18.0` folder and drop the `redcard_input.R` and `logistic0.stan` files in it. To build the Stan model, while still in the `cmdstan-2.18.0` folder, enter: 101 | ```bash 102 | make redcards/logistic0 103 | ``` 104 | Again, compiler toil. Satsifaction. To sample from the model: 105 | ```bash 106 | cd redcard 107 | time ./logistic0 sample data file=redcard_input.R 108 | ``` 109 | I added the `time` command to the front so that we get a processor time report at the end. We'll want to compare this time to the time you get later, after you enable multithreading. On my machine, I get: 110 | ``` 111 | real 2m22.694s 112 | user 2m22.369s 113 | sys 0m0.247s 114 | ``` 115 | The first line is the "real" time, the one you probably care about for benchmarking. 116 | 117 | By default, the output file is called `output.csv`. You'll want to go back into R to analyze it. Just open R in the same folder and enter: 118 | ```R 119 | library(rstan) 120 | m0 <- read_stan_csv("output.csv") 121 | ``` 122 | Now you can proceed as usual to work with the samples. 123 | 124 | ## Rewriting the Model to Enable Multithreading 125 | 126 | The first thing to do in order to get multithreading to work is to rewrite model. Our rewrite needs to accomplish two things. 127 | 128 | First, it needs to restructure the data into multiple "shards" of the same length. Each shard can be sent off into its own thread. The right number of shards depends upon your data, model, and hardware. You could make each dyad, in our example, into a shard. That would be pretty inefficient however, because vectorizing the calculation is usually beneficial, and you probably don't have enough processor cores to handle that many parallel threads at once. At the other end, two shards is better than one, and three is nearly always better than two. At some point, we reach an optimal number that trades off efficiencies within each shard with the ability to have parallel calculations. 129 | 130 | Second, it needs to use Stan's new `map_rect` function to update the log-probability. This will mean moving the log-probability calculation to a special function and then calling that function in the model block using `map_rect`. 131 | 132 | We'll take on each of these in turn. 133 | 134 | ### Making Shards 135 | 136 | In this example, I am going to make 7 shards of equal size. I didn't find this number by any reasoning. It is just that we have 124,621 dyads, and that divides nicely by 7 into 17,803 dyads per shard. So 7 shards of 17,803 dyads each. 137 | 138 | You'll want to do the splitting into shards in the `transformed data` block of your Stan model. I've built a modified version of `logistic0.stan` and named it, creatively, `logistic1.stan`. It's in the repository. Here is its `transformed data` block: 139 | ```c++ 140 | transformed data { 141 | // 7 shards 142 | // M = N/7 = 124621/7 = 17803 143 | int n_shards = 7; 144 | int M = N/n_shards; 145 | int xi[n_shards, 2*M]; // 2M because two variables, and they get stacked in array 146 | real xr[n_shards, M]; 147 | // an empty set of per-shard parameters 148 | vector[0] theta[n_shards]; 149 | // split into shards 150 | for ( i in 1:n_shards ) { 151 | int j = 1 + (i-1)*M; 152 | int k = i*M; 153 | xi[i,1:M] = n_redcards[ j:k ]; 154 | xi[i,(M+1):(2*M)] = n_games[ j:k ]; 155 | xr[i] = rating[ j:k ]; 156 | } 157 | } 158 | ``` 159 | The way that shards work is that we have to pack all of the variables particular to each shard into three arrays: an array of integer data, an array of real (continuous) data, and an array of parameters. I'm going to call the array of integers `xi`, the array of reals `xr`, and the array of parameters `theta`. 160 | 161 | First, consider the array of reals, `xr`. It's the simplest. We need to pack into this the `rating` values. So we define: 162 | ``` 163 | real xr[n_shards, M]; 164 | ``` 165 | The value `M` is just the number of dyads per shard. Then in the loop at the bottom of the block, we pack in the `rating` values from each shard, iterating over the indexes. 166 | 167 | The array of integers is slightly harder. We have two integer variables to pack in: `n_redcards` and `n_games`. But these need to be stacked together in a single vector. The code above defines therefore: 168 | ``` 169 | int xi[n_shards, 2*M]; 170 | ``` 171 | Each row of `xi` is twice as long as `xr`, because we need two variables. These get packed in the loop at the bottom. We will unpack them later. 172 | 173 | The array of parameters works the same way. But in this example, it is super easy. There are no parameters special to each shard, because the parameters are global (no varying effects, for example). So we can just define a zero-length array for each shard: 174 | ``` 175 | vector[0] theta[n_shards]; 176 | ``` 177 | If there were actually parameters to pack in here, we'd define them in either the `parameters` or `transformed parameters` block. 178 | 179 | Hopefully that explains the shard construction. Your data will need their own packing, but the principles are the same. 180 | 181 | ### Using map_rect 182 | 183 | In the new model, we replace the model block with: 184 | ```c++ 185 | model { 186 | beta ~ normal(0,1); 187 | target += sum( map_rect( lp_reduce , beta , theta , xr , xi ) ); 188 | } 189 | ``` 190 | Gone is the call to `binomial_logit`. Instead we call `map_rect` and give it five arguments: 191 | 192 | (1) `lp_reduce`: This is name of our new function (we'll write it soon) that computes the log-probability of the data. 193 | 194 | (2) `beta`: This is a vector of global parameters that do not vary among shards. In this case, it is just the vector defined in the `parameters` block as before. Nothing changes. 195 | 196 | (3) `theta`: This is the array of parameters that vary among shards. We defined this in the previous section. 197 | 198 | (4) `xr` and (5) `xi`: These are the data arrays we defined above. 199 | 200 | Then we make a new block, at the very top of our Stan model, to hold the function `lp_reduce`: 201 | ```c++ 202 | functions { 203 | vector lp_reduce( vector beta , vector theta , real[] xr , int[] xi ) { 204 | int n = size(xr); 205 | int y[n] = xi[1:n]; 206 | int m[n] = xi[(n+1):(2*n)]; 207 | real lp = binomial_logit_lpmf( y | m , beta[1] + to_vector(xr) * beta[2] ); 208 | return [lp]'; 209 | } 210 | } 211 | ``` 212 | You can name this function whatever you want. But it has the have arguments of types `vector`, `vector`, `real`, and `int`. Those match the arguments in the call to `map_rect`. 213 | 214 | Inside the function, we first unpack the data into separate arrays again. Then the familiar call to `binomial_logit`, but this time storing the result in `lp`. Then we convert `lp` to a vector and return it. 215 | 216 | When you run the model, this vector lands back in the model block, `map_rect` collects all the different returns from each shard, sums them together, and updates the log-probability. Samples are born. 217 | 218 | ## Running the Multithreaded Model 219 | 220 | ### Configure Compiler 221 | 222 | Now we're ready to go. Almost. First you need to add some compiler variables. Go to your `cmdstan-2.18.0` folder from earlier. Then enter in the shell: 223 | ```bash 224 | echo "CXXFLAGS += -DSTAN_THREADS" > make/local 225 | ``` 226 | What this does is add a special flag for `cmdstan` when it compiles your Stan models. If you are using g++ (instead of clang), you will also likely need one more flag: 227 | ```bash 228 | echo "CXXFLAGS += -pthread" >> make/local 229 | ``` 230 | On my Linux machine, that was enough to make it work. On my Macintosh, all that was needed was `-DSTAN_THREADS`. If you want to see the `local` file you just made, do: 231 | ```bash 232 | cat make/local 233 | ``` 234 | 235 | ### Build and Run 236 | 237 | Now build the Stan model: 238 | ```bash 239 | make redcard/logistic1 240 | ``` 241 | And finally we can switch into the `redcard` folder, set the number of threads we want to use (-1 means all), and go: 242 | ```bash 243 | cd redcard 244 | export STAN_NUM_THREADS=-1 245 | time ./logistic1 sample data file=redcard_input.R 246 | ``` 247 | The timings that I get for this model are: 248 | ``` 249 | real 1m42.789s 250 | user 5m10.812s 251 | sys 0m44.104s 252 | ``` 253 | So that's 1m43s versus 2m23s from before. We saved less than a minute by multithreading. Not that big a deal, in absolute time. But as a proportion it is really something. And once you start adding varying effects to this model, or have a larger data table, that's quite a speedup you can expect. 254 | 255 | ## More 256 | 257 | There is a new section in the 2.18 user manual that contains examples of using `map_rect`. Get it here: and see page 237. It also contains an example of a hierarchical model. 258 | -------------------------------------------------------------------------------- /RedcardData_dictionary.txt: -------------------------------------------------------------------------------- 1 | {\rtf1\ansi\ansicpg1252\cocoartf1265\cocoasubrtf190 2 | {\fonttbl\f0\froman\fcharset0 TimesNewRomanPSMT;} 3 | {\colortbl;\red255\green255\blue255;} 4 | \margl1440\margr1440\vieww10800\viewh8400\viewkind0 5 | \deftab720 6 | \pard\pardeftab720\sl360 7 | 8 | \f0\fs24 \cf0 From a company for sports statistics, we obtained data and profile photos from all soccer players (N = 2053) playing in the first male divisions of England, Germany, France and Spain in the 2012-2013 season and all referees (N = 3147) that these players played under in their professional career. We created a dataset of player\'96referee dyads including the number of matches players and referees encountered each other and our dependent variable, the number of red cards given to a player by a particular referee throughout all matches the two encountered each other.\ 9 | \ 10 | \pard\pardeftab720 11 | \cf0 Player photos were available from the source for 1586 out of 2053 players. Players\'92 skin tone was coded by two independent raters blind to the research question who, based on their profile photo, categorized players on a 5-point scale ranging from \'93very light skin\'94 to \'93very dark skin\'94 with \'93neither dark nor light skin\'94 as the center value. \ 12 | \ 13 | Additionally, implicit bias scores for each referee country were calculated using a race implicit association test (IAT), with higher values corresponding to faster white | good, black | bad associations. Explicit bias scores for each referee country were calculated using a racial thermometer task, with higher values corresponding to greater feelings of warmth toward whites versus blacks. Both these measures were created by aggregating data from many online users in referee countries taking these tests on Project Implicit ({\field{\*\fldinst{HYPERLINK "http://projectimplicit.net/index.html"}}{\fldrslt http://projectimplicit.net}})\ 14 | \ 15 | In all, the dataset has a total of 146028 dyads of players and referees. A detailed description of all variables in the dataset can be seen in the list below.\ 16 | \pard\tx720\tx1440\tx2160\tx2880\tx3600\tx4320\tx5040\tx5760\tx6480\tx7200\tx7920\tx8640\pardirnatural 17 | \cf0 \ 18 | Variables:\ 19 | \ 20 | 21 | \b playerShort 22 | \b0 - short player ID\ 23 | 24 | \b player 25 | \b0 - player name\ 26 | 27 | \b club 28 | \b0 - player club\ 29 | 30 | \b leagueCountry 31 | \b0 - country of player club (England, Germany, France, and Spain)\ 32 | 33 | \b birthday 34 | \b0 - player birthday\ 35 | 36 | \b height 37 | \b0 - player height (in cm)\ 38 | 39 | \b weight 40 | \b0 - player weight (in kg)\ 41 | 42 | \b position 43 | \b0 - detailed player position \ 44 | 45 | \b games 46 | \b0 - number of games in the player-referee dyad\ 47 | 48 | \b victories 49 | \b0 - victories in the player-referee dyad\ 50 | 51 | \b ties 52 | \b0 - ties in the player-referee dyad\ 53 | 54 | \b defeats 55 | \b0 - losses in the player-referee dyad\ 56 | 57 | \b goals 58 | \b0 - goals scored by a player in the player-referee dyad\ 59 | 60 | \b yellowCards 61 | \b0 - number of yellow cards player received from referee\ 62 | 63 | \b yellowReds 64 | \b0 - number of yellow-red cards player received from referee\ 65 | 66 | \b redCards 67 | \b0 - number of red cards player received from referee\ 68 | 69 | \b photoID 70 | \b0 - ID of player photo (if available)\ 71 | 72 | \b rater1 73 | \b0 - skin rating of photo by rater 1 (5-point scale ranging from \'93very light skin\'94 to \'93very dark skin\'94)\ 74 | 75 | \b rater2 76 | \b0 - skin rating of photo by rater 2 (5-point scale ranging from \'93very light skin\'94 to \'93very dark skin\'94)\ 77 | 78 | \b refNum 79 | \b0 - unique referee ID number (referee name removed for anonymizing purposes)\ 80 | 81 | \b refCountry 82 | \b0 - unique referee country ID number (country name removed for anonymizing purposes)\ 83 | 84 | \b meanIAT 85 | \b0 - mean implicit bias score (using the race IAT) for referee country, higher values correspond to faster white | good, black | bad associations \ 86 | 87 | \b nIAT 88 | \b0 - sample size for race IAT in that particular country\ 89 | 90 | \b seIAT 91 | \b0 - standard error for mean estimate of race IAT \ 92 | 93 | \b meanExp 94 | \b0 - mean explicit bias score (using a racial thermometer task) for referee country, higher values correspond to greater feelings of warmth toward whites versus blacks\ 95 | 96 | \b nExp 97 | \b0 - sample size for explicit bias in that particular country\ 98 | 99 | \b seExp 100 | \b0 - standard error for mean estimate of explicit bias measure\ 101 | \ 102 | } -------------------------------------------------------------------------------- /logistic0.stan: -------------------------------------------------------------------------------- 1 | data { 2 | int N; 3 | int n_redcards[N]; 4 | int n_games[N]; 5 | real rating[N]; 6 | } 7 | parameters { 8 | vector[2] beta; 9 | } 10 | model { 11 | beta ~ normal(0,1); 12 | n_redcards ~ binomial_logit( n_games , beta[1] + beta[2] * to_vector(rating) ); 13 | } 14 | -------------------------------------------------------------------------------- /logistic1.stan: -------------------------------------------------------------------------------- 1 | functions { 2 | vector lp_reduce( vector beta , vector theta , real[] xr , int[] xi ) { 3 | int n = size(xr); 4 | int y[n] = xi[1:n]; 5 | int m[n] = xi[(n+1):(2*n)]; 6 | real lp = binomial_logit_lpmf( y | m , beta[1] + to_vector(xr) * beta[2] ); 7 | return [lp]'; 8 | } 9 | } 10 | data { 11 | int N; 12 | int n_redcards[N]; 13 | int n_games[N]; 14 | real rating[N]; 15 | } 16 | transformed data { 17 | // 7 shards 18 | // M = N/7 = 124621/7 = 17803 19 | int n_shards = 7; 20 | int M = N/n_shards; 21 | int xi[n_shards, 2*M]; // 2M because two variables, and they get stacked in array 22 | real xr[n_shards, M]; 23 | // an empty set of per-shard parameters 24 | vector[0] theta[n_shards]; 25 | // split into shards 26 | for ( i in 1:n_shards ) { 27 | int j = 1 + (i-1)*M; 28 | int k = i*M; 29 | xi[i,1:M] = n_redcards[ j:k ]; 30 | xi[i,(M+1):(2*M)] = n_games[ j:k ]; 31 | xr[i] = rating[ j:k ]; 32 | } 33 | } 34 | parameters { 35 | vector[2] beta; 36 | } 37 | model { 38 | beta ~ normal(0,1); 39 | target += sum( map_rect( lp_reduce , beta , theta , xr , xi ) ); 40 | } 41 | --------------------------------------------------------------------------------