└── README.md /README.md: -------------------------------------------------------------------------------- 1 | # How to specify survey settings 2 | 3 | ## Stas Kolenikov, Brady T. West, Peter Lugtig 4 | 5 | In this repository, we document our understanding of, and recommendations for, 6 | appropriate best practices in specifying the complex sampling design settings 7 | in statistical software that enables design-based analyses of survey data. 8 | We discuss features of the complex survey data such as stratification, clustering, 9 | unequal probabilities of selection, and calibration, and outline their impact on estimation procedures. 10 | We demonstrate how statistical software treats them, and how the survey data providers 11 | can make data users' lifes easier by clearly documenting accurate and efficient ways to make sure 12 | that their software properly accounts for the complex sampling design features. 13 | 14 | - [Survey sampling features](#survey-sampling-features) 15 | + [Stratification](#stratification) 16 | + [Cluster Sampling](#cluster-sampling) 17 | + [Unequal Probabilities of Selection](#unequal-probabilities-of-selection) 18 | + [Weight Adjustments](#weight-adjustments) 19 | + [Sampling is about doing the best job for the money!](#sampling-is-about-doing-the-best-job-for-the-money-) 20 | - [Survey settings in statistical software](#survey-settings-in-statistical-software) 21 | + [R](#r) 22 | + [Stata](#stata) 23 | + [SAS](#sas) 24 | + [See also](#see-also) 25 | - [Documentation on appropriate design-based analysis techniques for complex sample survey data: rubrics](#documentation-on-appropriate-design-based-analysis-techniques-for-complex-sample-survey-data--rubrics) 26 | - [Evaluating documentation in practice](#evaluating-documentation-in-practice) 27 | + [Dealing with existing documentation](#dealing-with-existing-documentation) 28 | + [The National Survey of Family Growth (NSFG), 2013--2015](#the-national-survey-of-family-growth--nsfg---2013--2015) 29 | + [The Population Assessment of Tobacco and Health](#the-population-assessment-of-tobacco-and-health) 30 | + [Understanding Society (Waves 1--8)](#understanding-society--waves-1--8-) 31 | + [European Social Survey](#european-social-survey) 32 | + [American Time Use Survey](#american-time-use-survey) 33 | + [The 2005 India Human Development Survey](#the-2005-india-human-development-survey) 34 | + [A Portrait of Jewish Americans](#a-portrait-of-jewish-americans) 35 | + [Placeholder: Survey name](#placeholder--survey-name) 36 | - [Recommendations for survey organizations](#recommendations-for-survey-organizations) 37 | + [The Data Documentation Initiative](#the-data--documentation-initiative) 38 | - [Additional resources](#additional-resources) 39 | - [Acknowledgements](#acknowledgements) 40 | 41 | ### About authors 42 | 43 | Stas Kolenikov is Principal Scientist at [Abt Associates](http://www.abtassociates.com); 44 | @skolenik on GitHub; [@StatStas](https://twitter.com/StatStas) on Twitter. 45 | His interest in this project is from the triple perspective of the data provider 46 | at the Division of Data Science, Surveys, and Enabling Technologies of Abt Associates; 47 | of an occasional continuing education instructor teaching courses on survey design and weighting, and estimation 48 | with complex survey data; and of the developer of 49 | [statistical software for survey weight calibration](https://econpapers.repec.org/software/bocbocode/s458430.htm). 50 | 51 | Brady T. West is Research Associate Professor 52 | in the [Survey Research Center](https://www.src.isr.umich.edu/) 53 | at the [Institute for Social Research](http://www.isr.umich.edu/) on the University of Michigan-Ann Arbor campus. 54 | His interest in this project arises from his recent NSF-funded work on analytic error 55 | (https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0158120), 56 | suggesting that the majority of secondary analyses published in peer-reviewed journals 57 | do not correctly account for the features of complex sample designs in the estimation and inference process. 58 | [@bradytwest](https://twitter.com/bradytwest) on Twitter. 59 | Website: http://www-personal.umich.edu/~bwest/ 60 | 61 | Peter Lugtig is Associate Professor at 62 | [the Department of Methodology and Statistics](https://www.uu.nl/en/organisation/faculty-of-social-and-behavioural-sciences/about-the-faculty/departments/methodology-statistics) 63 | in the School of Social and Behavioural Sciences, University of Utrecht. 64 | His interest in this project is from the perspective of teaching applied social scientists in working with complex survey data. 65 | One of his interests is in estimating seperate components of Total Survey Error in surveys. To this end, it is vital that 66 | potential errors introducted via sampling can be identified and correctly adjusted. 67 | [@PeterLugtig](https://twitter.com/PeterLugtig) on Twitter. 68 | Website: http://www.peterlugtig.com/ 69 | 70 | ___ 71 | 72 | ## Survey sampling features 73 | 74 | One of the objects in survey research is to come up with survey designs that minimize Total Survey Error (TSE). 75 | Sampling and adjustment errors are only two of the errors within the larger TSE framework. 76 | However, when surveys are build on the principles of probability sampling, sampling errors are two types of error that can be estimated correctly. 77 | When coverage and nonresponse can be estimated as well, there are possibilities to adjust errors in order to ensure that the analysis of the survey represent the larger population. 78 | If this is done well, the results from the survey analysis are asymptotically unbiased, whileuncertainty due to the various errors can be estimated. 79 | In this manifesto, we focus on the "big four" features of complex sampling designs: stratification, cluster sampling, unequal probabilities of selection, and weight adjustments. 80 | Each design feature is described in more detail below. 81 | 82 | ### Stratification 83 | 84 | Stratification = breaking up the population/frames into mutually exclusive groups (strata) before sampling. Common examples of strata include: 85 | 86 | * Geographic regions for in-person samples 87 | * Diagnostic groups for patient list samples 88 | * Industry and/or employment size and/or geographical regions for establishment samples 89 | 90 | Why do complex sampling designs employ stratification? 91 | 92 | * Oversample subpopulations of interest if they can be identified on the frame(s) 93 | * Oversample areas of higher concentration of the target rare population (c.f. http://www.asasrms.org/Proceedings/y2006/Files/JSM2006-000557.pdf) 94 | * Ensure specific accuracy targets in subpopulations of interest 95 | * Utilize different sampling designs/frames in different strata 96 | * Balance things around/avoid weird outlying samples/spread the sample across the whole population 97 | * Optimize costs vs. precision via [Neyman-Chuprow](https://support.sas.com/documentation/cdl/en/statug/63962/HTML/default/viewer.htm#statug_surveyselect_a0000000208.htm) or more complicated allocations 98 | 99 | ### Cluster Sampling 100 | 101 | Cluster, or multistage sampling design = sampling groups of units (clusters), rather than the ultimate observation units. Common examples of randomly sampled clusters include: 102 | 103 | * Geographic units (e.g., census tracts) in face-to-face surveys 104 | * Entities in natural hierarchies (e.g. health care providers within practices within hospitals, or students within classes within schools) 105 | 106 | Why do complex sampling designs employ cluster sampling? 107 | 108 | * Complete lists of all units are not available, but the survey statistician can obtain lists of administrative units for which residence or other relevant eligibility status of observation units can be easily identified 109 | * Reduce interviewer travel time/cost in face-to-face surveys 110 | * Analytic interest in multilevel modeling of hierarchical structures 111 | 112 | Terminology: PSU = primary sampling unit = cluster for analytic purposes 113 | 114 | ### Unequal Probabilities of Selection 115 | 116 | Why do complex sampling designs assign unequal probabilities of selection to different population units? 117 | 118 | * Directly oversample (smaller) subpopulations of interest (e.g., ethnic/racial minorities) that would not have sufficient 119 | sample sizes in an equal probability of selection method (epsem) sample 120 | * Indirectly oversample, by selecting areas with a higher concentration of the target rare population 121 | * Result of multiple stage/cluster sampling 122 | - Most samples for face-to-face surveys are designed with probability proportional to size (PPS) sampling at the first few stages, fixed sample size at last stage $\Rightarrow$ approximately EPSEM. In many cases, the sample size at last stage of selection (e.g. the size of a household) is unknown in advance. 123 | - If measures of sizes are not accurate, or nonresponse depends on cluster (size), no longer EPSEM. 124 | * Unintended result of multiple frame sampling 125 | - dual phone users, i.e., those who have both landline and cell phone service, are more likely to be selected. 126 | 127 | ### Weight Adjustments 128 | 129 | Why are design weights (computed as the inverse of the probability of selection) adjusted further? These adjustments are designed to be corrections for... 130 | 131 | * eligibility 132 | * frame noncoverage 133 | * frame overlap in multiple frame surveys 134 | * statistical efficiency 135 | * nonresponse (unavoidable in the real world) 136 | 137 | ### Sampling is about doing the best job for the money! 138 | 139 | At the end of the day, all of the complex sampling features described above are employed for 1+ of the following reasons: 140 | 141 | * Save money 142 | 143 | - use cluster samples to save on travel costs 144 | - use stratified samples to realize statistical efficiency gains 145 | 146 | * Cannot get the full population listing 147 | 148 | - ... so have to use area samples to gradually zoom down to individuals 149 | - ... so have to use infrastructure created for a different purpose (telecom or postal) to contact people 150 | - ... so have to sample a larger, general population, and screen out the eligible rare/hard to reach population 151 | 152 | * Overcome real world data collection difficulties 153 | 154 | - nonresponse weight adjustments 155 | 156 | As a result of all considerations above, most serious surveys that are using face-to-face data collection or address-based sampling are using a complex sample design in their fieldwork. 157 | Data resulting from the survey cannot be naively analyzed, and survey weights have to be used. 158 | Survey statisticians routinely compute weights for users. These weights often take the form of a design weight that corrects for 159 | eligibility, frame overlap, and unequal seelction probabilities in sampling. A separate nonresponse weight corrects for 160 | nonresponse, and sometimes for noncoverage errors in the frame used. In some surveys additional weights are provided for the 161 | purpose of doing cross-national comparisons (multi-country surveys) or longitudinal analysis (cohort or panel studies). 162 | For more information on how modern surveys are efficiently designed, and weights are computed, we refer the reader to Kalton, 163 | Flores-Cervantes (2003), Lohr (2010), Bethlehem (2011), Valliant, Dever, Kreuter (2013), Valliant and Dever (2017) or Kolenikov 164 | (2016). 165 | The weights included in the dataset should be accompanied with detailed documentation on how the weights were computed and should be used in practice by applied researchers. 166 | We have often found that the documentation of survey weights is inadequate. 167 | Sometimes, details on how the weights were designed are missing. 168 | More often, the decsription of the weights is sparse or very technical. This then leads to users not using weights at all, or using them incorrectly. 169 | West, Sakshaug and Aurelien (2016) have shown for example that analytic errors are prevalent in 145 analyses of the survey 'Scientists and Engineers Statistical Data System' (SESTAT). 170 | This paper seeks to provide a rubric for how survey weights should be documented. We will define a set of rubrics consisting of five main and two bonus elements, and then use these rubrics to discuss the survey documentation of several popular surveys originating in the U.S., U.K. and Europe. 171 | 172 | This paper is accompanied by a website, where applied researchers can paste example code from SAS, Stata and R and generate corresponding code in other software packages to facilitate the correct use of weights in future. Please visit https://statstas.shinyapps.io/svysettings/ for details. 173 | 174 | ___ 175 | 176 | ## Survey settings in statistical software 177 | 178 | The most common public use data file specification of an area probability sampling design is that 179 | of a two-stage stratified clustered sample. It is nearly always an approximation to the true sampling design, 180 | as most typically the design would include more stages, and some additional modifications of the 181 | sampling design variables would be undertaken: true sampling strata or units would be combined 182 | or split, units would be swapped with one another, etc., typically in order to mask the true geographical 183 | locations of respondents, as geography is one of the strongest factors putting individuals 184 | at risk of identification and disclosure (Heeringa et al., 2017, Chapter 4). 185 | 186 | In the examples below, we provide the following semi-standardized examples: 187 | 188 | - a "public use" stratified two-stage design: 189 | * the data file in the package native format is `PUMS_svy`, with an appropriate extension 190 | * strata are `thisStrat` 191 | * clusters are `thisPSU` 192 | * weights are `thisWeight` 193 | * Taylor Series Linearization (TSL) is generally the default variance estimation procedure in these settings 194 | - a "dual frame RDD" design, approximated by an unequal probability design: 195 | * the data file in the package native format is `RDD_svy`, with an appropriate extension 196 | * weights are `thisWeight` 197 | - a design with the bootstrap replicate weights: 198 | * the data file in the package native format is `BSTRAP_svy`, with an appropriate extension 199 | * the main weights are `thisWeight` 200 | * the replicate weights are `bsWeight1`, `bsWeight2`, ..., `bsWeight100` 201 | 202 | In addition, three analyses are discussed: 203 | - estimation of the total of a continuous variable `y`; 204 | - cross-tabulation of two categorical variables `sex` and `race`; 205 | - analysis in subpopulation/domain defined by age restriction, `age` between 18 and 30. 206 | 207 | ### R 208 | 209 | Implementation of complex sample survey estimation in `library(survey)` separates the steps of declaring the sampling design 210 | and running estimation. 211 | 212 | (In terms of reading the input data, we assume that the user follows the best practices of workflow management 213 | and uses `library(here)` to identify the root of the project; 214 | see [Bryan (2017)](https://www.tidyverse.org/articles/2017/12/workflow-vs-script/)). 215 | 216 | The "public use" stratified two-stage design: 217 | 218 | ``` 219 | # prerequisites 220 | library(survey) 221 | library(here) 222 | # read the data 223 | thisSurvey <- readData(here("data/PUMS_svy.Rdata")) 224 | # specify the design 225 | thisDesign <- svydesign(id =~ thisPSU, strat =~ thisStrat, weights =~thisWeight, data =~ thisSurvey) 226 | # estimate the total 227 | (total_y <- svytotal(~y, design = thisDesign) ) 228 | # tabulate 229 | (tab1_sex_race <- svymean( ~interaction(sex,race,drop=TRUE), design = thisDesign ) ) 230 | (tab2_sex_race <- svytable( ~sex+race, design = thisDesign) ) 231 | (tab3_sex_race <- svyby(~sex, by = ~race, design = thisDesign, FUN = svymean) 232 | # subpopulation estimation: redeclare the design 233 | young_adults <- subset( design = thisDesign, ( (age>=18) & (age<=30) ) ) 234 | (total_y_young <- svytotal(~y, design = young_adults ) 235 | ``` 236 | 237 | In the above, the line `( object <- function_call(input1, ... ) )` simultaneously creates and assigns 238 | the object, and prints it. Lumley (2010) notes that by default, all functions give missing values (`NA`) 239 | when they encounter item missing data. To discard the missing data from analysis, `na.rm=TRUE` should 240 | be specified as an option to the `svy...(...,na.rm=TRUE)` functions, with the effect of treating 241 | the non-missing data data as a subpopulation. More complex analysis, such as linear models are available through the survey package. 242 | 243 | The RDD unequal weights design: 244 | 245 | ``` 246 | # prerequisites 247 | library(survey) 248 | library(here) 249 | # read the data 250 | thisSurvey <- readData(here("data/BSTRAP_svy.Rdata")) 251 | # specify the design 252 | thisDesign <- svrepdesign(id =~ 1, weights =~thisWeight, data =~ thisSurvey) 253 | # estimation can use the same syntax as above 254 | ``` 255 | 256 | The replicate weight design: 257 | 258 | ``` 259 | # prerequisites 260 | library(survey) 261 | library(here) 262 | # read the data 263 | thisSurvey <- readData(here("data/RDD_svy.Rdata")) 264 | # specify the design 265 | thisDesign <- svydesign(weights =~thisWeight, data =~ thisSurvey, 266 | repweights =~ "bsWeight[0-9]+", type="bootstrap", 267 | combined.weights = TRUE) 268 | # estimation can use the same syntax as above 269 | ``` 270 | 271 | In the above syntax, `"bsWeight[0-9]+"` is a [*regular expression*](https://regexr.com/) which, in this case, builds 272 | a filter for variable names as follows: 273 | 1. must start with the text `bsWeight` exactly; 274 | 2. this prefix must be followed by a digit `[0-9]` 275 | 3. this digit must happen at least once, and may happen an unlimited number of times (`+` modifier). 276 | 277 | For more examples, see Thomas Lumley's documentation of the `library(survey)` package: 278 | - [Lumley (2010)](https://www.amazon.com/Complex-Surveys-Guide-Analysis-Using/dp/0470284307) book; 279 | - http://r-survey.r-forge.r-project.org/survey/, home of the `library(survey)` package. 280 | 281 | ### Stata 282 | 283 | In Stata, survey settings can be specified once, and be used later with the `svy:` estimation prefix. 284 | The settings can be saved with the data set. This is a recommended best practice for data providers. 285 | 286 | ``` 287 | use thisSurvey, clear 288 | svyset 289 | * if empty, specify svyset on your own 290 | svyset thisPSU [pw=thisWeight], strata(thisStrat) 291 | * estimate the total 292 | svy : total y 293 | * tabulate 294 | svy : tab sex race, col se 295 | * subpopulation estimation: subpop option 296 | svy , subpop( if inrange(age,18,30) ) : total y 297 | ``` 298 | 299 | The RDD unequal weights design: 300 | 301 | ``` 302 | use thisSurvey, clear 303 | svyset 304 | * if empty, specify svyset on your own 305 | svyset thisPSU [pw=thisWeight] 306 | * estimation commands as before 307 | * estimate the total 308 | svy : total y 309 | * tabulate 310 | svy : tab sex race, col se 311 | * subpopulation estimation: subpop option 312 | svy , subpop( if inrange(age,18,30) ) : total y 313 | ``` 314 | 315 | The replicate weight design: 316 | 317 | ``` 318 | use thisSurvey, clear 319 | svyset 320 | * if empty, specify svyset on your own 321 | svyset [pw=thisWeight], vce(bootstrap) bsrw( bsWeight* ) mse 322 | * estimate the total 323 | svy : total y 324 | * tabulate 325 | svy : tab sex race, col se 326 | * subpopulation estimation: subpop option 327 | svy , subpop( if inrange(age,18,30) ) : total y 328 | ``` 329 | 330 | The estimation commands themselves are identical to those for the cluster+strata designs. 331 | 332 | The `mse` option of the `svyset` command requests the MSE version of the estimator 333 | where the original estimate is subtracted, 338 | vs. variance version where the mean of the pseudo-values is substracted 339 | when the squared differences are formed. 340 | 346 | 347 | Starting with Stata 15.1, calibrated weights [are supported](https://www.stata.com/bookstore/survey-weights/). 348 | 349 | ### SAS 350 | 351 | In SAS, survey settings need to be declared in every `SURVEY` procedure. 352 | 353 | [Tabulations and cross-tabulations](https://support.sas.com/documentation/cdl/en/statug/63962/HTML/default/viewer.htm#surveyfreq_toc.htm): 354 | 355 | ``` 356 | PROC SURVEYFREQ data=thisSurveyLib.thisSurvey; 357 | WEIGHTS thisWeight; 358 | CLUSTER thisPSU; 359 | STRATA thisStrat; 360 | TABLES sex*race; 361 | RUN; 362 | ``` 363 | 364 | Subpopulation analysis: 365 | 366 | ``` 367 | DATA thisSurveyLib.thisSurvey; 368 | SET thisSurveyLib.thisSurvey; 369 | age_18to30 = (age>=18) & (age<=30); 370 | RUN; 371 | PROC SURVEYFREQ data=thisSurveyLib.thisSurvey; 372 | WEIGHTS thisWeight; 373 | CLUSTER thisPSU; 374 | STRATA thisStrat; 375 | TABLES sex*race; 376 | DOMAIN age_18to30; 377 | RUN; 378 | ``` 379 | 380 | ### See also 381 | 382 | https://github.com/skolenik/ICHPS2018-svy 383 | 384 | https://statstas.shinyapps.io/svysettings/ 385 | 386 | ___ 387 | 388 | ## Documentation on appropriate design-based analysis techniques for complex sample survey data: rubrics 389 | Large scale data collections are nowadays routinely released to the public. They typically include anonymized survey micro-date, along with some variables that include details about the fieldwork itself, 390 | and one or several weighting variables that allow any data user to correct for unequal sampling probabilities introducted in the survey design, 391 | as well as errors introduced by coverage and/or nonresponse errors. 392 | The survey datasets are accompanied with survey documentation that explain the design of the surveys and details on the measurements taken. 393 | In teh next section we first propose a short checklist in order to assess the qualityt of survey documentation. We then apply the checklist to several existing and widely used public datasets. We argue that the documentation should be understandable and usable to the average applied scientist, and conclude with recommendations on how to improve survey documentation. 394 | 395 | 396 | 1. **Can a survey statistician figure out from the documentation how to set the data up for correct estimation?** 397 | This would be a person with training on par with or exceeding the level of the Lohr (1999) or Kish (1965) textbooks, and applied 398 | experience on par with or exceeding the Lumley (2010) or Heeringa, West and Berglund (2017) books. 399 | 400 | 2. **Can an applied researcher figure out from the documentation how to set the data up for correct estimation?** 401 | This would be a person who has some background / training in applied statistical analysis, but has only cursory knowledge of 402 | survey methodology, based on at most several hours of classroom instruction in their "methods" class or a short course at a 403 | conference. 404 | 405 | 3. **Is everything described succinctly in one place, or scattered throughout the document?** 406 | It is of course easier on the user when all the relevant information is easily available in a single section. However, some 407 | reports put information about weights in one place, e.g. where sampling was described, while information about other complex 408 | sampling features (e.g., cluster/strata/variance estimation) only appears some twenty pages away. 409 | 410 | 4. **Are examples of specific syntax to specify survey settings provided?** 411 | Has the data producer provided worked and clearly-annotated examples of analyses of the complex sample survey data produced by a 412 | given survey using the syntax for existing procedures in one or more common statistical software packages? And as a bonus, have 413 | examples been provided in multiple languages (e.g., SAS, R, and Stata)? 414 | 415 | 5. **Are there examples given for how to answer substantive research questions?** 416 | In all languages, there are specific ways to run commands that are survey-design-aware. In other words, only specifying the 417 | design may not be sufficient in ensuring that estimation is done correctly. For instance, are examples provided for both 418 | descriptive and analytic (i.e., regression-driven) research questions? 419 | 420 | 6. (Bonus) **Is an executive summary description of the sample design available?** 421 | Many researchers would appreciate a two-three sentence paragraph to summarize the sampling design that 422 | they could copy and paste into their papers, e.g., 423 | 424 | > {This survey} is a three-stage areal sampling design survey with census tracts, households, and individuals 425 | as sampling units. The final analysis weights provided by {the organization who collected the data} account 426 | for unequal selection probabilities, nonresponse, and study eligibility, and are used in all analyses 427 | reported in this paper. Standard errors are estimated using the complex survey bootstrap variance estimation procedures. 428 | 429 | or 430 | 431 | > {This survey} is a dual-frame RDD survey that collected data on both landline and mobile phones. 432 | The final analysis weights provided by {the organization who collected the data} account 433 | for unequal selection probabilities, nonresponse, and study eligibility, and are used in all analyses 434 | reported in this paper. Standard errors are estimated using Taylor series linearization, 435 | the default analytical method available in most statistical packages. 436 | 437 | 7. (Bonus) **What kinds of references are provided?** 438 | It is often helpful to the end users if the description of the sampling design features is 439 | accompanied by the references to (a) methodological literature describing them in general 440 | (e.g., texbooks such as Korn & Graubard, Kish, Lohr, Heeringa-West-Berglund, Lumley, etc.), 441 | and (b) technical publications specific to the study in question, such as 442 | the JSM or AAPOR proceedings, technical reports on the provider website, or publications 443 | in technical literature describing the study, if appropriate. E.g., the description 444 | of clustered sampling designs used in the U.S. Census Bureau large scale surveys 445 | such as the American Community Survey or Current Population Survey could refer 446 | to general descriptions of stratified clustered surveys, to the user Handbooks 447 | ([*What Researchers Need to Know* ACS Handbook (Census Bureau 2009)](https://www.census.gov/library/publications/2009/acs/researchers.html)), 448 | and to the technical papers on variance estimation 449 | ([Ash 2011](http://www.citeulike.org/user/ctacmo/article/13018645)). 450 | 451 | (Secondary bonus) Are the references pointing out to sources other than the authors? (Chances are if there's a JSM Proceedings 452 | paper by the same group of authors, it won't be any clearer, frankly.) 453 | 454 | We now use the seven rubrics defined above to "score" several existing examples of documentation for public-use survey 455 | data files based on these criteria. For example, if the documentation for a public-use data file successfully satisfies / meets 456 | the first five rubrics above, the documentation will be scored 5/5. These scores are designed to be **illustrative**, in terms 457 | of rating existing examples of documentation for public-use data files on how effectively they convey complex sampling features 458 | and how they should be employed in analysis to users. The scores are designed to motivate data producers to improve the clarity 459 | of their documentation for a variety of data users hoping to analyze large (and usually publically-funded) survey data sets. 460 | 461 | ## Evaluating documentation in practice 462 | 463 | In this section, we will evaluate a convenience sample of the documentation for several public use survey data files (PUFs). We 464 | will apply the above rubrics to see how the documentation compares in terms of effectively describing appropriate analysis 465 | techniques to data users. 466 | 467 | ### Dealing with existing documentation 468 | 469 | 1. Search documentation for the software footprint as keywords: `svyset` per Stata, 470 | `PROC SURVEY` per SAS, `svydesign` per R `library(survey)`. 471 | 472 | 2. If that fails, search for "sampling weight", "final weight", "analysis weight", "survey weight" or "design weight". 473 | You can search for "weight" per se but you should expect that in health studies, this is likely to produce 474 | many false positives. 475 | 476 | 3. See if there is any description of the sampling strata and clusters near the text where weights are mentioned. 477 | 478 | 4. Search for "*PSU*" and "*cluster*" and "*strata*" and "*stratification*" to find 479 | the variables that needed to be specified in survey settings. 480 | 481 | 5. Search for "*variance estimation*", the generic technical term to deal with complexities of survey estimation. 482 | 483 | 6. Search for "*replicate weights*", "*BRR*", "*jackknife*" and "*bootstrap*", the keywords for the popular 484 | replicate variance estimation methods. 485 | 486 | ___ 487 | 488 | ### The National Survey of Family Growth (NSFG), 2013--2015 489 | 490 | ⭐⭐⭐⭐⭐ 491 | 492 | **Funding**: 493 | 494 | * Eunice Kennedy Shriver National Institute of Child Health and Human Development 495 | * Office of Population Affairs 496 | * NCHS, CDC 497 | * Division of HIV/AIDS Prevention, CDC 498 | * Division of Sexually Transmitted Disease Prevention, CDC 499 | * Division of Reproductive Health, CDC 500 | * Division of Birth Defects and Developmental Disabilities, CDC 501 | * Division of Cancer Prevention and Control, CDC 502 | * Children’s Bureau, Administration for Children and Families (ACF) 503 | * Office of Planning, Research and Evaluation, ACF 504 | 505 | **Data collection**: The University of Michigan Survey Research Center (http://src.isr.umich.edu) 506 | 507 | **Host**: The National Center for Health Statistics (http://www.cdc.gov/nchs/) 508 | 509 | **URL**: http://www.cdc.gov/nchs/nsfg 510 | 511 | **Rubrics**: 512 | 513 | 1. **Can a survey statistician figure out from the documentation how to set the data up for correct estimation?** 514 | Yes. Electronic documents like 515 | [Example 1: Variance Estimates for Percentages](https://www.cdc.gov/nchs/data/nsfg/NSFG_2013_2015_VarEst_Ex1.pdf) 516 | linked from the [documentation page](https://www.cdc.gov/nchs/nsfg/nsfg_2013_2015_puf.htm) 517 | under *Variance estimation* subtitle make it very easy for survey statisticians and applied researchers alike 518 | to correctly declare complex sampling features to survey analysis software for design-based analyses. 519 | 520 | 2. **Can an applied researcher figure out from the documentation how to set the data up for correct estimation?** 521 | Yes. See above. 522 | 523 | 3. **Is everything that the data user needs to know about the complex sampling contained in one place?** 524 | Yes, although very little (if anything) is said about the actual complex sample design. 525 | Instead this information appears in separate electronic files, such as 526 | [Sample Design Documentation](https://www.cdc.gov/nchs/data/nsfg/NSFG_2013-2015_Sample_Design_Documentation.pdf). 527 | This is out of necessity, however, given the complexity of the NSFG sample design, 528 | and all of the information that a user needs to compute weighted point estimates and estimate variance 529 | accounting for the complex sampling can be found in examples like the one indicated above. 530 | 531 | 4. **Are examples of specific syntax for performing correct design-based analyses provided?** 532 | Yes. Three examples are clearly documented 533 | ([tabulations for categorical variables](https://www.cdc.gov/nchs/data/nsfg/NSFG_2013_2015_VarEst_Ex1.pdf); 534 | [means for continuous variables](https://www.cdc.gov/nchs/data/nsfg/NSFG_2013_2015_VarEst_Ex2.pdf); 535 | [analysis with domains/subpopulations](https://www.cdc.gov/nchs/data/nsfg/NSFG_2013_2015_VarEst_Ex3.pdf)) 536 | and linked on the main documentation page, and both syntax and output are included in each case. 537 | Bonus: syntax and output are provided for both SAS and Stata. 538 | 539 | 5. **Are examples of analyses need for addressing specific substantive questions provided?** 540 | Yes; see previous item. 541 | 542 | 6. **(Bonus) Is an executive summary of the sample design provided?** 543 | Yes; such an executive summary is given in the first section of 544 | [the main sample document](https://www.cdc.gov/nchs/data/nsfg/NSFG_2013-2015_Sample_Design_Documentation.pdf) 545 | 546 | 7. **(Bonus) What kinds of references are provided?** 547 | There are several references to the most important sample design literature included in Section 11 of the document linked above. 548 | 549 | **Score**: 5++/5 550 | 551 | The NSFG provides an excellent example of the type of documentation 552 | that needs to be provided to data users to minimize the risk of analytic error 553 | due to a failure to account for complex sampling features. 554 | 555 | Accessed on 2018-07-15. 556 | 557 | ___ 558 | 559 | ### The Population Assessment of Tobacco and Health 560 | 561 | ⭐⭐⭐⭐⭐ 562 | 563 | **Funding**: The Population Assessment of Tobacco and Health (PATH) Study is a collaboration 564 | between the National Institute on Drug Abuse (NIDA), National Institutes of Health (NIH), 565 | and the Center for Tobacco Products (CTP), Food and Drug Administration (FDA). 566 | 567 | **Data collection**: Westat (http://www.westat.com) 568 | 569 | **Host**: The National Addiction and HIV Data Archive Program 570 | 571 | **URL**: https://www.icpsr.umich.edu/icpsrweb/NAHDAP/series/606 572 | 573 | **Rubrics**: 574 | 575 | 1. **Can a survey statistician figure out from the documentation how to set the data up for correct estimation?** 576 | Yes. Section 5 of the Public-Use Files User Guide provides clear detail on the calculation and names 577 | of the various weight variables that can be used for estimation. This section also discusses variance estimation, 578 | and clearly describes the replicate weights that have been prepared for data users enabling variance estimation. 579 | Software options are also discussed in this section, and code illustrating the use of multiple programs 580 | for the protype example analyses is provided in Appendix A. 581 | 582 | 2. **Can an applied researcher figure out from the documentation how to set the data up for correct estimation?** 583 | Yes. Appendix A of the User Guide is very helpful, given that it provides annotated example code for several different packages. 584 | Section 5 is aimed at survey statisticians, and will be overwhelming to an audience that is less technically prepared. 585 | 586 | 3. **Is everything that the data user needs to know about the complex sampling contained in one place?** 587 | Yes; Section 5 provides all of the necessary sampling information for analysis purposes, and Appendix A contains all of the necessary code for actual practice. 588 | 589 | 4. **Are examples of specific syntax for performing correct design-based analyses provided?** 590 | Yes. Appendix A of the Public-Use Files User Guide is an excellent example of providing this kind of resource for data users. 591 | 592 | 5. **Are examples of analyses needed for addressing specific substantive questions provided?** 593 | Yes. Appendix A illustrates a variety of potential analyses that data users could perform. 594 | 595 | 6. **(Bonus) Is an executive summary of the sample design provided?** 596 | Chapter 2 of the User Guide provides a detailed summary of the sample design, which serves as an executive summary. 597 | 598 | 7. **(Bonus) What kinds of references are provided?** 599 | There are several references to the most important sample design literature included at the end of the User Guide. 600 | 601 | **Score**: 5++/5 602 | 603 | The PATH PUF user guide is another excellent, gold-standard example of detailed and useful information 604 | designed to make the life of the survey data user easier. 605 | 606 | Accessed on 2018-12-17. 607 | ___ 608 | 609 | ### Understanding Society (Waves 1--8) 610 | 611 | ⭐⭐ 612 | 613 | **Funding**: Economic & Social Research Council (ESRC). 614 | 615 | **Data collection**: 616 | - The Institute for Social and Economic Research (ISER), University of Essex 617 | - NatCen Social Research (wave 1-5) - Great Britian 618 | - Central Survey Unit of NISRA (wave 1-5)- Northern Ireland 619 | - Kantar Public UK (wave 6 onwards) 620 | 621 | **Host**: The UK data archive: https://discover.ukdataservice.ac.uk/series/?sn=2000053 622 | 623 | **URL**: https://www.understandingsociety.ac.uk/ 624 | 625 | User Guide, including information on sampling design and weighting: 626 | https://www.understandingsociety.ac.uk/sites/default/files/downloads/documentation/mainstage/user-guides/mainstage-user-guide.pdf 627 | 628 | **Rubrics**: 629 | 630 | 1. **Can a survey statistician figure out how to declare the complex sampling features?** 631 | Yes. See the link above. The stratification is well described, both for Understanding Society, and it's predecessor, 632 | The British Household Panel Study. The sample design is complex, as this study is longitudinal. 633 | The study included refreshment samples to increase sample sizes, include minorities, and add new regions into the study. 634 | Sampling design variables are described in Section 3.2.7 of the User Guide referenced above. 635 | 636 | 2. **Can an applied researcher figure out from the documentation how to set the data up for correct estimation?** 637 | This is problematic given the complexity of the design, and the different populations 638 | that this data set could be used to analyze (e.g., the latest cross-section, panel based on BHPS starting Wave 1, 639 | panel based on BHPS, GPS and EMBS starting from Wave 2, etc.). 640 | The applied researcher needs to have 641 | a very clear definition of what the target population of their analysis exactly is. 642 | Guidance is provided on what survey weights to use depending on the choice of the target population 643 | in Section 3.3.1 of the above referenced User Guide, but the description is highly technical. 644 | 645 | 3. **Is everything that the data user needs to know about the complex sampling contained in one place?** 646 | Yes, Section 3.2.7 of the User Guide referenced above includes all information. 647 | 648 | 4. **Are examples of specific syntax for performing correct design-based analyses provided?** 649 | No. Examples are provided for data management tasks such as combining waves or identifying individuals within 650 | households, but no `svyset` syntax is provided. 651 | 652 | 5. **Are examples of analyses need for addressing specific substantive questions provided?** 653 | There are examples of the code provided, but these analysis are unweighted, 654 | and thus aruably misleading. 655 | The description is technical, and practical examples of the kind of questions 656 | researchers would want to answer using this data may help users to select the right set of weights. 657 | 658 | 6. **(Bonus) Is an executive summary of the sample design provided?** 659 | No. The sample design is also too complex for this. 660 | 661 | 7. **(Bonus) What kinds of references are provided?** 662 | The documentation includes many references to additional papers and technical reports 663 | written on the design and analysis of *Understanding Society* Data. 664 | 665 | **Score**: 2+/5 666 | 667 | *Understanding Society* is a very complex survey. While technical documentation is excellent 668 | for its technical purposes, guidance for applied researchers is between limited and confusing. 669 | Examples in Stata and SPSS are provided, but neither cover data management, or provide 670 | examples of basic *unweighted* analyses, i.e., do not help the researchers to set the data up 671 | for correct analysis that would account for the complex sampling nature of the survey. 672 | 673 | Accessed on 2018-12-12. 674 | ___ 675 | 676 | ### European Social Survey ### 677 | 678 | ⭐⭐ 679 | 680 | **Funding**: European Commission, Horizon 2020. 681 | Rounds 1-7 of ESS have been founded by national science foundations and/or European national governments. 682 | 683 | **Data collection**: coordinated by City University, London, UK. 684 | Data collection in separate European Countries coordinated within every country. 685 | 686 | **Host**: European Social Survey, formerly at Norwegian data Archive 687 | 688 | **URL**: www.europeansocialsurvey.org 689 | 690 | Weighting documentation: 691 | http://www.europeansocialsurvey.org/docs/methodology/ESS_weighting_data_1.pdf 692 | 693 | **Rubrics**: 694 | 695 | 1. **Can a survey statistician figure out from the documentation how to set the data up for correct estimation?** 696 | Yes. The European Social Survey is a repeated cross-sectional study conducted in about 30 different countries in Europe. 697 | Sampling is conducted within every country, using either listing methods or registers (of individuals or addresses). 698 | Three weights (design, poststratification and population equivalence weights) are included in the main datafile. 699 | This allows for Horvitz-Thompson estimation, but not the specification of a complex survey design. 700 | However, an Integrated Sample data file does include information on stratificiation or cluster variables, 701 | as well as selection probabilities for every respondent. 702 | On top of this, a multilevel file adds regional indicators to the main datafile, allowing for multilevel-analysis 703 | 704 | 2. **Can an applied researcher figure out from the documentation how to set the data up for correct estimation?** 705 | Yes, three weights are provided: a design weight, a poststratification weight and a population equivalence weight. 706 | Guidance is included on how to comibine the three weights, and when to use what weight in some examples of analyses. 707 | 708 | 3. **Is everything that the data user needs to know about the complex sampling contained in one place?** 709 | Documentation is scattered across many different documents and files on the ESS website. 710 | However, most users in practice would use one round of ESS. 711 | In that case, the country report files contain details on how fieldwork (including sampling) was conducted. 712 | One good aspect of the European Social Survey is that the users are explicitly warned that data need to be weighted 713 | when data are downloaded from the ESS website. However, there isn't an accompanying warning about using 714 | the sample design variables for variance estimation as well. 715 | 716 | 4. **Are examples of specific syntax for performing correct design-based analyses provided?** 717 | No. 718 | 719 | 5. **Are examples of analyses need for addressing specific substantive questions provided?** 720 | There are a few examples of data management code, but not of the complex survey analysis syntax. 721 | 722 | 6. **(Bonus) Is an executive summary of the sample design provided?** 723 | There is an executive summary that describes the basic sampling methodology. 724 | There is no easily accessible executive summary that explains how and why sampling differs over the countries. 725 | 726 | 7. **(Bonus) What kinds of references are provided?** 727 | There are references to standard textbooks on complex survey design, and references to other documents 728 | on the ESS website, with more detailed documentation. 729 | 730 | **Score**: 2/5 731 | 732 | The ESS is a typical example of documentation written by survey statisticians for survey statisticians, 733 | and it takes a survey statistician to process it and come up with the requisite syntax. 734 | Novice users may be deterred by the complexity of documentation, and would choose to either underutilize 735 | the resource, or would otherwise have to bombard the survey provides with additional questions. 736 | 737 | Accessed on 2017-07-19. 738 | ___ 739 | 740 | ### American Time Use Survey 741 | 742 | **Funding**: 743 | 744 | **Data collection**: 745 | 746 | **Host**: ICPSR 747 | 748 | **URL**: https://www.bls.gov/tus/ 749 | 750 | **Rubrics**: 751 | 752 | **Score**: 753 | 754 | 755 | 756 | ___ 757 | 758 | ### The 2005 India Human Development Survey 759 | 760 | ⭐ 761 | 762 | **Funding**: NIH Grants R01HD041455 and R01HD046166. 763 | 764 | **Data collection**: Per the user guide, the fieldwork was carried out by 24 collaborating institutions under the supervision of the National Council of Applied Economic Research (NCAER) in New Delhi. Interviewers were hired by the institutions and trained by personnel from the University of Maryland and NCAER. 765 | 766 | **Host**: The data are hosted by ICPSR (see web site below). 767 | 768 | **URL**: https://www.icpsr.umich.edu/icpsrweb/content/DSDR/idhs-data-guide.html 769 | 770 | **Rubrics**: 771 | 772 | 1. **Can a survey statistician figure out from the documentation how to set the data up for correct estimation?** 773 | No. Even after reviewing the technical documentation (https://ihds.umd.edu/sites/ihds.umd.edu/files/publications/papers/technical%20paper%201.pdf) in addition to the user guide, it is still unclear exactly what variable(s) should be used in variance estimation to reflect the stratification and cluster sampling that was performed. Reading carefully enough reveals the weights that should be used to compute population estimates at the individual or household level (SWEIGHT), but nothing is said about the appropriate design variables (stratum codes, PSU codes) in the data files. Complicating matters is the fact that two apparent PSU variables (IDPSU and PSUID) appear in the data file, with no mention of which one should be used for variance estimation and no references to an appropriate stratum variable. 774 | 775 | 2. **Can an applied researcher figure out from the documentation how to set the data up for correct estimation?** 776 | No. The web site above provides a brief description of the weights in Section V, but the user guide only mentions the appropriate weight variable for individuals or households (SWEIGHT) in a section described the constructed and geographic variables. There is also no mention of which variables should be used for appropriate variance estimation, despite the fact that there are variables called PSUID and IDPSU in the data file. There is no indication of what variable (or variables) should be used to capture the stratified sampling that was performed. 777 | 778 | 3. **Is everything that the data user needs to know about the complex sampling contained in one place?** 779 | For the most part. The user guide contains a detailed section on sampling, and the technical documentation provides even more detailed aimed at survey statisticians. 780 | 781 | 4. **Are examples of specific syntax for performing correct design-based analyses provided?** 782 | No. 783 | 784 | 5. **Are examples of analyses need for addressing specific substantive questions provided?** 785 | No. This documentation is in serious need of analysis examples, demonstrating syntax that should be used for design-based analyses along with indications of how to form the appropriate sample design variables for variance estimation. 786 | 787 | 6. **(Bonus) Is an executive summary of the sample design provided?** 788 | Yes, in the user guide. 789 | 790 | 7. **(Bonus) What kinds of references are provided?** 791 | The technical documentation does not provide any references, and the user guide briefly refers to papers that have been published using these data (which won't help users who actually want to analyze the data). 792 | 793 | **Score**: 1/5 794 | 795 | The documentation is incomplete making it impossible to analyze the data correctly. 796 | 797 | Accessed 2018-12-21. 798 | ___ 799 | 800 | 801 | ### A Portrait of Jewish Americans 802 | 803 | ⭐⭐⭐⭐ 804 | 805 | **Funding**: The Pew Research Center’s 2013 survey of U.S. Jews was conducted by the 806 | center’s Religion & Public Life Project with generous funding from 807 | The Pew Charitable Trusts and the Neubauer Family Foundation. 808 | 809 | **Data collection**: Abt SRBI under contract to Pew Research Center 810 | 811 | **Host**: Pew Research Center http://www.pewresearch.org/ 812 | 813 | **URL**: http://www.pewforum.org/dataset/a-portrait-of-jewish-americans/ 814 | 815 | **Rubrics**: how well the documentation matches the desired criteria 816 | 817 | 1. **Can a survey statistician figure out from the documentation how to set the data up for correct estimation?** 818 | Yes; survey documentation explains the differences between the household and the person-level weights, 819 | and stresses that the boostrap weights should be used for variance estimation. 820 | 821 | 2. **Can an applied researcher figure out from the documentation how to set the data up for correct estimation?** 822 | Yes; Stata syntax is provided early in the document, or can be found by search in the PDF file. 823 | 824 | 3. **Is everything described succinctly in one place, or scattered throughout the document?** 825 | Yes; all of the relevant information is contained in the **Key Elements of the Data** section in about 2 pages. 826 | 827 | 4. **Are examples of specific syntax to specify survey settings provided?** 828 | Yes; item 6 of **Key Elements of the Data** section identifies the variables and provides Stata syntax 829 | for individual level and household level anlayses. 830 | (Search for any of `Stata`, `SAS`, `weight`, `svyset` would lead the researcher to this information.) 831 | A warninig is given that SPSS Statistics Base package cannot correctly compute standard errors. 832 | 833 | 5. **Are there examples given for how to answer substantive research questions?** 834 | No examples are given. 835 | 836 | 6. (Bonus) **Is an executive summary description of the sample design available?** 837 | Sample design is described in painstaking detail in about 9 pages. No short summary of the design is available 838 | from the technical documentation, although such a summary can be found in the substantive report 839 | (http://www.pewforum.org/2013/10/01/jewish-american-beliefs-attitudes-culture-survey/). 840 | 841 | 7. (Bonus) **What kinds of references are provided?** 842 | No additional references are given. 843 | 844 | **Score**: 845 | 4+/5 846 | 847 | A Portrait of Jewish Americans is a very well described survey that most researchers will be able to 848 | analyze correctly by following the instructions of the data provider. 849 | Slight limitations of the documentation is that examples of the settings are only 850 | given for one package, Stata, and no examples of substantive analyses, e.g. those leading to the primary 851 | tables in the substantive report, are provided. 852 | 853 | Accessed on 2018-12-11. 854 | 855 | ___ 856 | 857 | ### Placeholder: Survey name 858 | 859 | **Funding**: the ultimate client of the study 860 | 861 | **Data collection**: the organization who collected and documented the data 862 | 863 | **Host**: the organization that hosts the data 864 | 865 | **URL**: 866 | 867 | **Rubrics**: how well the documentation matches the desired criteria 868 | 869 | **Score**: 870 | 871 | Accessed on [DATE]. 872 | ___ 873 | 874 | ## Recommendations for survey organizations 875 | 876 | Follow best practices and the above rubrices, and strive to make 5 stars in our ratings! 877 | 878 | Documentation of survey settings can also be strengthened in the existing documentation standards. 879 | 880 | ### The Data Documentation Initiative 881 | 882 | DDI (http://www.ddialliance.org/) is an international standard for describing the data 883 | produced by surveys and other observational methods in the social, behavioral, economic, and health sciences. 884 | DDI is a free standard that can document and manage different stages in the research data lifecycle, 885 | such as conceptualization, collection, processing, distribution, discovery, and archiving. 886 | Documenting data with DDI facilitates understanding, interpretation, and use -- 887 | by people, software systems, and computer networks. 888 | 889 | The version of DDI as of January 2019, DDI 3, provides support to document the weights in a free format. 890 | DDI Codebook 2.5 provides the following specifications: 891 | 892 | - http://www.ddialliance.org/Specification/DDI-Codebook/2.5/XMLSchema/field_level_documentation_files/schemas/codebook_xsd/elements/weight.html 893 | 894 | 895 | ## Additional resources 896 | 897 | Piotr Jabkowski (Adam Mikiewicz University, Poznan, Poland) has been maintaining Surveys Quality Assessment Database (SQAD). 898 | This project compiles metadata from 67 waves of six major European cross-country comparative surveys, 899 | totaling 1537 individual surveys. 900 | See [the project webpage](https://www.researchgate.net/project/Surveys-Quality-Assessment-Database-SQAD) 901 | and [the Technical Report 1.0](https://www.researchgate.net/publication/327136417_COMPARATIVE_ANALYSIS_OF_THE_QUALITY_OF_SURVEY_SAMPLES_IN_THE_CROSSNATIONAL_STUDIES_SURVEY_ARCHIVISATION_AND_META-BASE_OF_RESULTS_TECHNICAL_REPORT_VERSION_10). 902 | 903 | ___ 904 | 905 | ## Acknowledgements 906 | 907 | The initial impetus for this work came from the AAPOR presentation 908 | by Margaret Levenstein on how ICPSR handles meta data of the surveys they 909 | store and distribute, and from 910 | [the discussion on Twitter that followed](https://twitter.com/MaryELosch/status/997213578917707778). 911 | 912 | Table of contents generated with markdown-toc 913 | 914 | --------------------------------------------------------------------------------