├── README.md
├── RedcardData.csv
├── RedcardData_dictionary.txt
├── logistic0.stan
├── logistic1.stan
└── redcard_input.R


/README.md:
--------------------------------------------------------------------------------
  1 | # Multithreading and Map-Reduce in Stan 2.18.0: A Minimal Example
  2 | 
  3 | v1 - Richard McElreath - 19 Sep 2018
  4 | 
  5 | ## Introduction
  6 | 
  7 | In version 2.18.0, Stan can now use parallel subthreads to speed computation of log-probabilities. So while the Markov chain samples are still necessarily serial, running on a single core, the math library calculations needed to compute the samples can be parallelized across different cores. If the data can be split into pieces ("shards"), then this can produce tremendous speed improvements on the right hardware.
  8 | 
  9 | The benefits are not automatic however. You'll have to modify your Stan models first. You might also have to do a little bit of compiler configuration. Finally, the 2.18.0 is only available for ``cmdstan``, the command line version of Stan, at this moment. It should be available for ``rstan`` and other interfaces soon. So if you haven't used cmdstan before, you'll need to adjust your workflow a little.
 10 | 
 11 | Here I present a minimal example of modifying a Stan model to take advantage of multithreading. I also show the basic workflow for using ``cmdstan``, in case you are used to ``rstan``.
 12 | 
 13 | ## Overview
 14 | 
 15 | The general approach to modifying a Stan model for multithreading has three steps.
 16 | 
 17 | (1) You must repack the data and parameters into slices, known as "shards". This looks weird the first time you do it, but it really is just repacking. Nothing fancy is going on.
 18 | 
 19 | (2) You must rewrite the model to use a special function to compute the log-probability for each slice ("shard") of the data. Then in model block, you use a function called `map_rect` to call this function and pass it all the shards. 
 20 | 
 21 | (3) Configure your compiler to enable multithreading. Then you can compile and sample as usual.
 22 | 
 23 | In the rest of this tutorial, I'll build up a simple first without multithreading. This will serve to clarify the changes and show you how to work with `cmdstan`, if you haven't before. Then I'll modify the data and model to get multithreading to work.
 24 | 
 25 | If you use Windows, it looks like multithreading isn't working yet (see <https://github.com/stan-dev/math/wiki/Threading-Support>). So flip over to a Linux partition, if you have one. If you don't, hope is that an upcoming update to RTools will make things work.
 26 | 
 27 | ## Prepping the data
 28 | 
 29 | Let's consider a simple logistic regression, just so the model doesn't get in the way. We'll use a reasonably big data table, the football red card data set from the recent crowdsourced data analysis project (<https://psyarxiv.com/qkwst/>). This data table contains 146,028 player-referee dyads. For each dyad, the table records the total number of red cards the referee assigned to the player over the observed number of games. 
 30 | 
 31 | The ``RedcardData.csv`` file is provided in the repository here. Load the data in R, and let's take a look at the distribution of red cards:
 32 | ```R
 33 | d <- read.csv( "RedcardData.csv" , stringsAsFactors=FALSE )
 34 | table( d$redCards )
 35 | ```
 36 | ```text
 37 |      0      1      2 
 38 | 144219   1784     25
 39 | ```
 40 | The vast majority of dyads have zero red cards. Only 25 dyads show 2 red cards. These counts are our inference target.
 41 | 
 42 | The motivating hypothesis behind these data is that referees are biased against darker skinned players. So we're going to try to predict these counts using the skin color ratings of each player. Not all players actually received skin color ratings in these data, so let's reduce down to dyads with ratings:
 43 | ```R
 44 | d2 <- d[ !is.na(d$rater1) , ]
 45 | out_data <- list( n_redcards=d2$redCards , n_games=d2$games , rating=d2$rater1 )
 46 | out_data$N <- nrow(d2)
 47 | ```
 48 | This leaves us with 124,621 dyads to predict. 
 49 | 
 50 | At this point, you are thinking: "But there are repeat observations on players and referees! You need some cluster variables in there in order to build a proper multilevel model!" You are right. But let's start simple. Keep your partial pooling on idle for the moment.
 51 | 
 52 | Now to use these data with ``cmdstan``, we need to export the table to a file that ``cmdstan`` can read in. This is made easy:
 53 | ```R
 54 | library(rstan)
 55 | stan_rdump(ls(out_data), "redcard_input.R", envir = list2env(out_data))
 56 | ```
 57 | 
 58 | ## Making the model
 59 | 
 60 | A Stan model for this problem is just a simple logistic (binomial) GLM. I'll assume you know Stan well enough already that I can just plop the code down here. It's contained in the ``logistic0.stan`` file in the repository. It's not a complex model:
 61 | ```c++
 62 | data {
 63 |   int N;
 64 |   int n_redcards[N];
 65 |   int n_games[N];
 66 |   real rating[N];
 67 | }
 68 | parameters {
 69 |   vector[2] beta;
 70 | }
 71 | model {
 72 |   beta ~ normal(0,1);
 73 |   n_redcards ~ binomial_logit( n_games , beta[1] + beta[2] * to_vector(rating) );
 74 | }
 75 | ```
 76 | If you were going to run this model in ``rstan``, you'd enter now:
 77 | ```R
 78 | m0 <- stan( file="logistic0.stan" , data=out_data , chains=1 )
 79 | ```
 80 | But we're going to instead build and run this in ``cmdstan``.
 81 | 
 82 | ## Installing and Building cmdstan
 83 | 
 84 | You'll need to download the 2.18.0 `cmdstan` tarball from here: <https://github.com/stan-dev/cmdstan/releases/tag/v2.18.0>. Also grab the 2.18.0 user's guide while you are at it. It has a nice new section on this topic.
 85 | 
 86 | If you are used to unix and to `make` workflows, you can expand the thing and build it now. But if you are not, don't worry. I'm here to help. Many of my colleagues are very capable scientists who have never done much work in a unix shell. So I know even highly technical people who need some basic help here. I am going to assume, however, that you aren't using Windows. Because if you are, you can install and use `cmdstan` fine, but the multithreading won't work (yet).
 87 | 
 88 | Begin by putting the `cmdstan-2.18.0.tar.gz` file wherever you want `cmdstan` to live. You can move it later. Then open a terminal and navigate to that folder. Decompress with:
 89 | ```bash
 90 | tar -xzf cmdstan-2.18.0.tar.gz
 91 | ```
 92 | Enter the new `cmdstan-2.18.0` folder and build with:
 93 | ```bash
 94 | make build
 95 | ```
 96 | Your compiler will toil for a bit and then signal its satisfaction.
 97 | 
 98 | ## Building and Running the Basic Model in cmdstan
 99 | 
100 | Make a new folder called `redcard` inside your `cmdstan-2.18.0` folder and drop the `redcard_input.R` and `logistic0.stan` files in it. To build the Stan model, while still in the `cmdstan-2.18.0` folder, enter:
101 | ```bash
102 | make redcards/logistic0
103 | ```
104 | Again, compiler toil. Satsifaction. To sample from the model:
105 | ```bash
106 | cd redcard
107 | time ./logistic0 sample data file=redcard_input.R
108 | ```
109 | I added the `time` command to the front so that we get a processor time report at the end. We'll want to compare this time to the time you get later, after you enable multithreading. On my machine, I get:
110 | ```
111 | real    2m22.694s
112 | user    2m22.369s
113 | sys     0m0.247s
114 | ```
115 | The first line is the "real" time, the one you probably care about for benchmarking.
116 | 
117 | By default, the output file is called `output.csv`. You'll want to go back into R to analyze it. Just open R in the same folder and enter:
118 | ```R
119 | library(rstan)
120 | m0 <- read_stan_csv("output.csv")
121 | ```
122 | Now you can proceed as usual to work with the samples.
123 | 
124 | ## Rewriting the Model to Enable Multithreading
125 | 
126 | The first thing to do in order to get multithreading to work is to rewrite model. Our rewrite needs to accomplish two things. 
127 | 
128 | First, it needs to restructure the data into multiple "shards" of the same length. Each shard can be sent off into its own thread. The right number of shards depends upon your data, model, and hardware. You could make each dyad, in our example, into a shard. That would be pretty inefficient however, because vectorizing the calculation is usually beneficial, and you probably don't have enough processor cores to handle that many parallel threads at once. At the other end, two shards is better than one, and three is nearly always better than two. At some point, we reach an optimal number that trades off efficiencies within each shard with the ability to have parallel calculations.
129 | 
130 | Second, it needs to use Stan's new `map_rect` function to update the log-probability. This will mean moving the log-probability calculation to a special function and then calling that function in the model block using `map_rect`.
131 | 
132 | We'll take on each of these in turn.
133 | 
134 | ### Making Shards
135 | 
136 | In this example, I am going to make 7 shards of equal size. I didn't find this number by any reasoning. It is just that we have 124,621 dyads, and that divides nicely by 7 into 17,803 dyads per shard. So 7 shards of 17,803 dyads each.
137 | 
138 | You'll want to do the splitting into shards in the `transformed data` block of your Stan model. I've built a modified version of `logistic0.stan` and named it, creatively, `logistic1.stan`. It's in the repository. Here is its `transformed data` block:
139 | ```c++
140 | transformed data {
141 |   // 7 shards
142 |   // M = N/7 = 124621/7 = 17803
143 |   int n_shards = 7;
144 |   int M = N/n_shards;
145 |   int xi[n_shards, 2*M];  // 2M because two variables, and they get stacked in array
146 |   real xr[n_shards, M];
147 |   // an empty set of per-shard parameters
148 |   vector[0] theta[n_shards];
149 |   // split into shards
150 |   for ( i in 1:n_shards ) {
151 |     int j = 1 + (i-1)*M;
152 |     int k = i*M;
153 |     xi[i,1:M] = n_redcards[ j:k ];
154 |     xi[i,(M+1):(2*M)] = n_games[ j:k ];
155 |     xr[i] = rating[ j:k ];
156 |   }
157 | }
158 | ```
159 | The way that shards work is that we have to pack all of the variables particular to each shard into three arrays: an array of integer data, an array of real (continuous) data, and an array of parameters. I'm going to call the array of integers `xi`, the array of reals `xr`, and the array of parameters `theta`. 
160 | 
161 | First, consider the array of reals, `xr`. It's the simplest. We need to pack into this the `rating` values. So we define:
162 | ```
163 | real xr[n_shards, M];
164 | ```
165 | The value `M` is just the number of dyads per shard. Then in the loop at the bottom of the block, we pack in the `rating` values from each shard, iterating over the indexes.
166 | 
167 | The array of integers is slightly harder. We have two integer variables to pack in: `n_redcards` and `n_games`. But these need to be stacked together in a single vector. The code above defines therefore:
168 | ```
169 | int xi[n_shards, 2*M];
170 | ```
171 | Each row of `xi` is twice as long as `xr`, because we need two variables. These get packed in the loop at the bottom. We will unpack them later.
172 | 
173 | The array of parameters works the same way. But in this example, it is super easy. There are no parameters special to each shard, because the parameters are global (no varying effects, for example). So we can just define a zero-length array for each shard:
174 | ```
175 | vector[0] theta[n_shards];
176 | ```
177 | If there were actually parameters to pack in here, we'd define them in either the `parameters` or `transformed parameters` block.
178 | 
179 | Hopefully that explains the shard construction. Your data will need their own packing, but the principles are the same. 
180 | 
181 | ### Using map_rect
182 | 
183 | In the new model, we replace the model block with:
184 | ```c++
185 | model {
186 |   beta ~ normal(0,1);
187 |   target += sum( map_rect( lp_reduce , beta , theta , xr , xi ) );
188 | }
189 | ```
190 | Gone is the call to `binomial_logit`. Instead we call `map_rect` and give it five arguments:
191 | 
192 | (1) `lp_reduce`: This is name of our new function (we'll write it soon) that computes the log-probability of the data.
193 | 
194 | (2) `beta`: This is a vector of global parameters that do not vary among shards. In this case, it is just the vector defined in the `parameters` block as before. Nothing changes.
195 | 
196 | (3) `theta`: This is the array of parameters that vary among shards. We defined this in the previous section.
197 | 
198 | (4) `xr` and (5) `xi`: These are the data arrays we defined above.
199 | 
200 | Then we make a new block, at the very top of our Stan model, to hold the function `lp_reduce`:
201 | ```c++
202 | functions {
203 |   vector lp_reduce( vector beta , vector theta , real[] xr , int[] xi ) {
204 |     int n = size(xr);
205 |     int y[n] = xi[1:n];
206 |     int m[n] = xi[(n+1):(2*n)];
207 |     real lp = binomial_logit_lpmf( y | m , beta[1] + to_vector(xr) * beta[2] );
208 |     return [lp]';
209 |   }
210 | } 
211 | ```
212 | You can name this function whatever you want. But it has the have arguments of types `vector`, `vector`, `real`, and `int`. Those match the arguments in the call to `map_rect`.
213 | 
214 | Inside the function, we first unpack the data into separate arrays again. Then the familiar call to `binomial_logit`, but this time storing the result in `lp`. Then we convert `lp` to a vector and return it. 
215 | 
216 | When you run the model, this vector lands back in the model block, `map_rect` collects all the different returns from each shard, sums them together, and updates the log-probability. Samples are born.
217 | 
218 | ## Running the Multithreaded Model
219 | 
220 | ### Configure Compiler
221 | 
222 | Now we're ready to go. Almost. First you need to add some compiler variables. Go to your `cmdstan-2.18.0` folder from earlier. Then enter in the shell:
223 | ```bash
224 | echo "CXXFLAGS += -DSTAN_THREADS" > make/local
225 | ```
226 | What this does is add a special flag for `cmdstan` when it compiles your Stan models. If you are using g++ (instead of clang), you will also likely need one more flag:
227 | ```bash
228 | echo "CXXFLAGS += -pthread" >> make/local
229 | ```
230 | On my Linux machine, that was enough to make it work. On my Macintosh, all that was needed was `-DSTAN_THREADS`. If you want to see the `local` file you just made, do:
231 | ```bash
232 | cat make/local
233 | ```
234 | 
235 | ### Build and Run
236 | 
237 | Now build the Stan model:
238 | ```bash
239 | make redcard/logistic1
240 | ```
241 | And finally we can switch into the `redcard` folder, set the number of threads we want to use (-1 means all), and go:
242 | ```bash
243 | cd redcard
244 | export STAN_NUM_THREADS=-1
245 | time ./logistic1 sample data file=redcard_input.R
246 | ```
247 | The timings that I get for this model are:
248 | ```
249 | real  1m42.789s
250 | user  5m10.812s
251 | sys   0m44.104s
252 | ```
253 | So that's 1m43s versus 2m23s from before. We saved less than a minute by multithreading. Not that big a deal, in absolute time. But as a proportion it is really something. And once you start adding varying effects to this model, or have a larger data table, that's quite a speedup you can expect.
254 | 
255 | ## More
256 | 
257 | There is a new section in the 2.18 user manual that contains examples of using `map_rect`. Get it here: <https://github.com/stan-dev/cmdstan/releases/tag/v2.18.0> and see page 237. It also contains an example of a hierarchical model.
258 | 


--------------------------------------------------------------------------------
/RedcardData_dictionary.txt:
--------------------------------------------------------------------------------
  1 | {\rtf1\ansi\ansicpg1252\cocoartf1265\cocoasubrtf190
  2 | {\fonttbl\f0\froman\fcharset0 TimesNewRomanPSMT;}
  3 | {\colortbl;\red255\green255\blue255;}
  4 | \margl1440\margr1440\vieww10800\viewh8400\viewkind0
  5 | \deftab720
  6 | \pard\pardeftab720\sl360
  7 | 
  8 | \f0\fs24 \cf0 From a company for sports statistics, we obtained data and profile photos from all soccer players (N = 2053) playing in the first male divisions of England, Germany, France and Spain in the 2012-2013 season and all referees (N = 3147) that these players played under in their professional career. We created a dataset of player\'96referee dyads including the number of matches players and referees encountered each other and our dependent variable, the number of red cards given to a player by a particular referee throughout all matches the two encountered each other.\
  9 |  \
 10 | \pard\pardeftab720
 11 | \cf0 Player photos were available from the source for 1586 out of 2053 players. Players\'92 skin tone was coded by two independent raters blind to the research question who, based on their profile photo, categorized players on a 5-point scale ranging from \'93very light skin\'94 to \'93very dark skin\'94 with \'93neither dark nor light skin\'94 as the center value. \
 12 | \
 13 | Additionally, implicit bias scores for each referee country were calculated using a race implicit association test (IAT), with higher values corresponding to faster white | good, black | bad associations. Explicit bias scores for each referee country were calculated using a racial thermometer task, with higher values corresponding to greater feelings of warmth toward whites versus blacks. Both these measures were created by aggregating data from many online users in referee countries taking these tests on Project Implicit ({\field{\*\fldinst{HYPERLINK "http://projectimplicit.net/index.html"}}{\fldrslt http://projectimplicit.net}})\
 14 | \
 15 | In all, the dataset has a total of 146028 dyads of players and referees. A detailed description of all variables in the dataset can be seen in the list below.\
 16 | \pard\tx720\tx1440\tx2160\tx2880\tx3600\tx4320\tx5040\tx5760\tx6480\tx7200\tx7920\tx8640\pardirnatural
 17 | \cf0 \
 18 | Variables:\
 19 | \
 20 | 
 21 | \b playerShort 
 22 | \b0 - short player ID\
 23 | 
 24 | \b player 
 25 | \b0 - player name\
 26 | 
 27 | \b club
 28 | \b0  - player club\
 29 | 
 30 | \b leagueCountry
 31 | \b0  - country of player club (England, Germany, France, and Spain)\
 32 | 
 33 | \b birthday
 34 | \b0  - player birthday\
 35 | 
 36 | \b height
 37 | \b0  - player height (in cm)\
 38 | 
 39 | \b weight
 40 | \b0  - player weight (in kg)\
 41 | 
 42 | \b position
 43 | \b0  - detailed player position \
 44 | 
 45 | \b games
 46 | \b0  - number of games in the player-referee dyad\
 47 | 
 48 | \b victories
 49 | \b0  - victories in the player-referee dyad\
 50 | 
 51 | \b ties
 52 | \b0  - ties in the player-referee dyad\
 53 | 
 54 | \b defeats
 55 | \b0  - losses in the player-referee dyad\
 56 | 
 57 | \b goals
 58 | \b0  - goals scored by a player in the player-referee dyad\
 59 | 
 60 | \b yellowCards
 61 | \b0  - number of yellow cards player received from referee\
 62 | 
 63 | \b yellowReds
 64 | \b0  - number of yellow-red cards player received from referee\
 65 | 
 66 | \b redCards
 67 | \b0  - number of red cards player received from referee\
 68 | 
 69 | \b photoID
 70 | \b0  - ID of player photo (if available)\
 71 | 
 72 | \b rater1
 73 | \b0  - skin rating of photo by rater 1 (5-point scale ranging from \'93very light skin\'94 to \'93very dark skin\'94)\
 74 | 
 75 | \b rater2
 76 | \b0  - skin rating of photo by rater 2 (5-point scale ranging from \'93very light skin\'94 to \'93very dark skin\'94)\
 77 | 
 78 | \b refNum
 79 | \b0  - unique referee ID number (referee name removed for anonymizing purposes)\
 80 | 
 81 | \b refCountry
 82 | \b0  - unique referee country ID number (country name removed for anonymizing purposes)\
 83 | 
 84 | \b meanIAT
 85 | \b0  - mean implicit bias score (using the race IAT) for referee country, higher values correspond to faster white | good, black | bad associations \
 86 | 
 87 | \b nIAT
 88 | \b0  - sample size for race IAT in that particular country\
 89 | 
 90 | \b seIAT
 91 | \b0  - standard error for mean estimate of race IAT       \
 92 | 
 93 | \b meanExp
 94 | \b0  - mean explicit bias score (using a racial thermometer task) for referee country, higher values correspond to greater feelings of warmth toward whites versus blacks\
 95 | 
 96 | \b nExp
 97 | \b0  - sample size for explicit bias in that particular country\
 98 | 
 99 | \b seExp
100 | \b0  - standard error for mean estimate of explicit bias measure\
101 | \
102 | }


--------------------------------------------------------------------------------
/logistic0.stan:
--------------------------------------------------------------------------------
 1 | data {
 2 |   int N;
 3 |   int n_redcards[N];
 4 |   int n_games[N];
 5 |   real rating[N];
 6 | }
 7 | parameters {
 8 |   vector[2] beta;
 9 | }
10 | model {
11 |   beta ~ normal(0,1);
12 |   n_redcards ~ binomial_logit( n_games , beta[1] + beta[2] * to_vector(rating) );
13 | }
14 | 


--------------------------------------------------------------------------------
/logistic1.stan:
--------------------------------------------------------------------------------
 1 | functions {
 2 |   vector lp_reduce( vector beta , vector theta , real[] xr , int[] xi ) {
 3 |     int n = size(xr);
 4 |     int y[n] = xi[1:n];
 5 |     int m[n] = xi[(n+1):(2*n)];
 6 |     real lp = binomial_logit_lpmf( y | m , beta[1] + to_vector(xr) * beta[2] );
 7 |     return [lp]';
 8 |   }
 9 | } 
10 | data {
11 |   int N;
12 |   int n_redcards[N];
13 |   int n_games[N];
14 |   real rating[N];
15 | }
16 | transformed data {
17 |   // 7 shards
18 |   // M = N/7 = 124621/7 = 17803
19 |   int n_shards = 7;
20 |   int M = N/n_shards;
21 |   int xi[n_shards, 2*M];  // 2M because two variables, and they get stacked in array
22 |   real xr[n_shards, M];
23 |   // an empty set of per-shard parameters
24 |   vector[0] theta[n_shards];
25 |   // split into shards
26 |   for ( i in 1:n_shards ) {
27 |     int j = 1 + (i-1)*M;
28 |     int k = i*M;
29 |     xi[i,1:M] = n_redcards[ j:k ];
30 |     xi[i,(M+1):(2*M)] = n_games[ j:k ];
31 |     xr[i] = rating[ j:k ];
32 |   }
33 | }
34 | parameters {
35 |   vector[2] beta;
36 | }
37 | model {
38 |   beta ~ normal(0,1);
39 |   target += sum( map_rect( lp_reduce , beta , theta , xr , xi ) );
40 | }
41 | 


--------------------------------------------------------------------------------