├── missing-data-slides.pdf ├── .gitignore ├── README.md ├── 01_unit_missingness.ipynb ├── solutions ├── 01_unit_missingness.ipynb └── 00_interactive_plot.ipynb ├── 00_interactive_plot.ipynb └── 02_item_missingness.ipynb /missing-data-slides.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/matthewbrems/missing-data-workshop/HEAD/missing-data-slides.pdf -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # OSX DS Store 2 | .DS_Store 3 | 4 | # Byte-compiled / optimized / DLL files 5 | __pycache__/ 6 | *.py[cod] 7 | *$py.class 8 | 9 | # C extensions 10 | *.so 11 | 12 | # Distribution / packaging 13 | .Python 14 | env/ 15 | build/ 16 | develop-eggs/ 17 | dist/ 18 | downloads/ 19 | eggs/ 20 | .eggs/ 21 | lib/ 22 | lib64/ 23 | parts/ 24 | sdist/ 25 | var/ 26 | *.egg-info/ 27 | .installed.cfg 28 | *.egg 29 | 30 | # PyInstaller 31 | # Usually these files are written by a python script from a template 32 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 33 | *.manifest 34 | *.spec 35 | 36 | # Installer logs 37 | pip-log.txt 38 | pip-delete-this-directory.txt 39 | 40 | # Unit test / coverage reports 41 | htmlcov/ 42 | .tox/ 43 | .coverage 44 | .coverage.* 45 | .cache 46 | nosetests.xml 47 | coverage.xml 48 | *,cover 49 | .hypothesis/ 50 | 51 | # Translations 52 | *.mo 53 | *.pot 54 | 55 | # Django stuff: 56 | *.log 57 | local_settings.py 58 | 59 | # Flask stuff: 60 | instance/ 61 | .webassets-cache 62 | 63 | # Scrapy stuff: 64 | .scrapy 65 | 66 | # Sphinx documentation 67 | docs/_build/ 68 | 69 | # PyBuilder 70 | target/ 71 | 72 | # IPython Notebook 73 | *.ipynb_checkpoints 74 | 75 | # pyenv 76 | .python-version 77 | 78 | # celery beat schedule file 79 | celerybeat-schedule 80 | 81 | # dotenv 82 | .env 83 | 84 | # virtualenv 85 | venv/ 86 | ENV/ 87 | 88 | # Spyder project settings 89 | .spyderproject 90 | 91 | # Rope project settings 92 | .ropeproject 93 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Good, Fast, Cheap: How to do Data Science with Missing Data 2 | 3 | ## Resources 4 | 5 | #### Academic Content (roughly sorted from least technical to most) 6 | - [Good summary of single vs. multiple imputation](https://scikit-learn.org/stable/modules/impute.html#multiple-vs-single-imputation) 7 | - [UTexas Slides on Missing Data](https://liberalarts.utexas.edu/prc/_files/cs/Missing-Data.pdf) 8 | - [Flexible Imputation of Missing Data, 2nd ed.](https://stefvanbuuren.name/fimd/) 9 | - [Pattern Submodel Paper from "Biostatistics"](https://academic.oup.com/biostatistics/advance-article/doi/10.1093/biostatistics/kxy040/5092384) 10 | - [The prevention and handling of missing data](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3668100/) 11 | - [Should you use a missing data indicator?](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3414599/) 12 | - [Are all biases missing data problems?](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4643276/) 13 | - [Accounting for missing data in statistical analyses: multiple imputation is not always the answer](https://academic.oup.com/ije/article/48/4/1294/5382162?login=false#.XVpWZLg4jrU.twitter) 14 | - [The proportion of missing data should not be used to guide decisions on multiple imputation](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6547017/) 15 | - [Boston University Technical Report on Missing Data, Assumptions, and Applications](http://www.bu.edu/sph/files/2014/05/Marina-tech-report.pdf) 16 | - [Andrew Gelman Chapter on Missing Data - thorough and very good, but academic](http://www.stat.columbia.edu/~gelman/arm/missing.pdf) 17 | 18 | #### Python 19 | - [Scikit-Learn Write-Up of Imputation](https://scikit-learn.org/stable/modules/impute.html) 20 | - [Scikit-Learn SimpleImputer Class](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html) 21 | - [Scikit-Learn IterativeImputer Class](https://scikit-learn.org/stable/modules/generated/sklearn.impute.IterativeImputer.html) **note: this is experimental** 22 | - [Scikit-Learn KNNImputer Class](https://scikit-learn.org/stable/modules/generated/sklearn.impute.KNNImputer.html) 23 | - [Scikit-Learn Mean Imputation](http://scikit-learn.org/stable/auto_examples/plot_missing_values.html#sphx-glr-auto-examples-plot-missing-values-py) 24 | - [MissingNo in Python](https://github.com/ResidentMario/missingno) 25 | - [MissingNo Paper](http://joss.theoj.org/papers/52b4115d6c03864b884fbf3334851322) 26 | 27 | ##### Rubin's Rules (combining parameter estimates across multiple imputations) 28 | - [Article about Rubin's Rules](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2727536/) 29 | - [Rubin's Rules Formulas](https://bookdown.org/mwheymans/bookmi/rubins-rules.html) 30 | - [R Code for Pooling Estimates](http://finzi.psych.upenn.edu/R/library/mice/html/pool.html) 31 | 32 | #### Non-Statistical References 33 | - [How One 19-Year-Old Illinois Man Is Distorting National Polling Averages - NYTimes](https://www.nytimes.com/2016/10/13/upshot/how-one-19-year-old-illinois-man-is-distorting-national-polling-averages.html) 34 | 35 | 36 | ### Feel free to contact me afterward! 37 | - [LinkedIn](https://www.linkedin.com/in/matthewbrems) 38 | - [Twitter](https://www.twitter.com/matthewbrems) 39 | - [Medium](https://www.medium.com/@matthew.w.brems) 40 | -------------------------------------------------------------------------------- /01_unit_missingness.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Unit Missingness Demo\n", 8 | "\n", 9 | "When handling unit missingness, the most common method is to do **weight class adjustments**. This requires us to break our observations into classes and weight them before doing our analysis." 10 | ] 11 | }, 12 | { 13 | "cell_type": "code", 14 | "execution_count": null, 15 | "metadata": {}, 16 | "outputs": [], 17 | "source": [ 18 | "# Import libraries.\n", 19 | "import pandas as pd\n", 20 | "import numpy as np\n", 21 | "\n", 22 | "# Set random seed.\n", 23 | "np.random.seed(42)\n", 24 | "\n", 25 | "# Generate dataframe.\n", 26 | "value_score = [min(np.random.poisson(5), 10) if i % 2 == 0 else min(np.random.poisson(6), 10) for i in range(10_000)]\n", 27 | "value_score = [value_score[i] if (i % 8 == 0 or (i % 7 != 0 and i % 2 == 1)) else np.nan for i in range(10_000)]\n", 28 | "departments = ['finance' if i % 2 == 0 else 'accounting' for i in range(10_000)]\n", 29 | "df = pd.DataFrame({\n", 30 | " 'dept': departments,\n", 31 | " 'score': value_score\n", 32 | "})\n", 33 | "\n", 34 | "# Check first five rows.\n", 35 | "df.head()" 36 | ] 37 | }, 38 | { 39 | "cell_type": "code", 40 | "execution_count": null, 41 | "metadata": {}, 42 | "outputs": [], 43 | "source": [ 44 | "# What is the distribution of department?\n", 45 | "df['dept'].value_counts(normalize = True)" 46 | ] 47 | }, 48 | { 49 | "cell_type": "code", 50 | "execution_count": null, 51 | "metadata": {}, 52 | "outputs": [], 53 | "source": [ 54 | "# Check for nulls.\n", 55 | "df.isnull().sum()" 56 | ] 57 | }, 58 | { 59 | "cell_type": "code", 60 | "execution_count": null, 61 | "metadata": {}, 62 | "outputs": [], 63 | "source": [ 64 | "# Drop NAs.\n", 65 | "df.dropna(inplace = True)" 66 | ] 67 | }, 68 | { 69 | "cell_type": "code", 70 | "execution_count": null, 71 | "metadata": {}, 72 | "outputs": [], 73 | "source": [ 74 | "# What proportion of our responses came from accounting?\n", 75 | "df['dept'].value_counts(normalize = True)['accounting']" 76 | ] 77 | }, 78 | { 79 | "cell_type": "code", 80 | "execution_count": null, 81 | "metadata": {}, 82 | "outputs": [], 83 | "source": [ 84 | "df['dept'].value_counts(normalize = True)" 85 | ] 86 | }, 87 | { 88 | "cell_type": "markdown", 89 | "metadata": {}, 90 | "source": [ 91 | "1. Take the full sample (observed and missing) and break them into subgroups based on characteristics we know.\n", 92 | "2. Calculate a weight for each observation:\n", 93 | "\n", 94 | "$$\n", 95 | "\\text{weight}_i = \\frac{\\text{true proportion in group }i}{\\text{proportion of observed values in group }i}\n", 96 | "$$" 97 | ] 98 | }, 99 | { 100 | "cell_type": "code", 101 | "execution_count": null, 102 | "metadata": {}, 103 | "outputs": [], 104 | "source": [ 105 | "# Calculate and print the weight for accounting.\n", 106 | "w_accounting = (1/2) / df['dept'].value_counts(normalize = True)['accounting']\n", 107 | "\n", 108 | "print(f'The weight for each accounting vote is: {w_accounting}.')" 109 | ] 110 | }, 111 | { 112 | "cell_type": "code", 113 | "execution_count": null, 114 | "metadata": {}, 115 | "outputs": [], 116 | "source": [ 117 | "# Calculate the and print weight for finance.\n", 118 | "w_finance = (1/2) / df['dept'].value_counts(normalize = True)['finance']\n", 119 | "\n", 120 | "print(f'The weight for each finance vote is: {w_finance}.')" 121 | ] 122 | }, 123 | { 124 | "cell_type": "code", 125 | "execution_count": null, 126 | "metadata": {}, 127 | "outputs": [], 128 | "source": [ 129 | "# Let's confirm that the weights times the counts\n", 130 | "# yields a 50/50 split.\n", 131 | "print(w_accounting * df['dept'].value_counts()['accounting'])\n", 132 | "print(w_finance * df['dept'].value_counts()['finance'])" 133 | ] 134 | }, 135 | { 136 | "cell_type": "code", 137 | "execution_count": null, 138 | "metadata": {}, 139 | "outputs": [], 140 | "source": [ 141 | "# Create column that stores the weights.\n", 142 | "\n", 143 | "df['weights'] = [w_accounting if i == 'accounting' else w_finance for i in df['dept']]" 144 | ] 145 | }, 146 | { 147 | "cell_type": "code", 148 | "execution_count": null, 149 | "metadata": {}, 150 | "outputs": [], 151 | "source": [ 152 | "# Confirm counts.\n", 153 | "\n", 154 | "df['weights'].value_counts()" 155 | ] 156 | }, 157 | { 158 | "cell_type": "code", 159 | "execution_count": null, 160 | "metadata": {}, 161 | "outputs": [], 162 | "source": [ 163 | "# Calculate raw mean of my employee satisfaction score.\n", 164 | "\n", 165 | "np.mean(df['score'])" 166 | ] 167 | }, 168 | { 169 | "cell_type": "code", 170 | "execution_count": null, 171 | "metadata": {}, 172 | "outputs": [], 173 | "source": [ 174 | "# Calculate weighted mean of my employee satisfaction score.\n", 175 | "\n", 176 | "np.mean(df['score'] * df['weights'])" 177 | ] 178 | }, 179 | { 180 | "cell_type": "markdown", 181 | "metadata": {}, 182 | "source": [ 183 | "
Our goal with post-weighting is to decrease bias. What should we be concerned about?\n", 184 | " \n", 185 | "- Due to the bias-variance tradeoff, as we decrease bias, we may cause an increase in variance.\n", 186 | "- This can be a really big deal, [said the New York Times in 2016](https://www.nytimes.com/2016/10/13/upshot/how-one-19-year-old-illinois-man-is-distorting-national-polling-averages.html).\n", 187 | "
" 188 | ] 189 | }, 190 | { 191 | "cell_type": "markdown", 192 | "metadata": {}, 193 | "source": [ 194 | "
What might be a situation where we may not be able to use weight class adjustments?\n", 195 | " \n", 196 | "- If we don't know the true distribution of our classes.\n", 197 | "- For example, if I didn't know that half of our team was in accounting and half in finance.\n", 198 | "- Another example, let's say I wanted to apply this weighting method to understand the percentage of voters supporting the Democratic candidate in the upcoming election. I don't know how many people will be in each of the age groups 18-34, 35-54, and 55+. I'll have to make a guess. (Hopefully an educated one!)\n", 199 | "
" 200 | ] 201 | }, 202 | { 203 | "cell_type": "markdown", 204 | "metadata": {}, 205 | "source": [ 206 | "#### Have more variables and want to build a sophisticated model?\n", 207 | "Pass `df['weight']` into `sklearn` when fitting your model. [Source](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html#sklearn.ensemble.RandomForestRegressor.fit).\n", 208 | "> `model.fit(X_train, y_train, X_train['weight'])`" 209 | ] 210 | }, 211 | { 212 | "cell_type": "markdown", 213 | "metadata": {}, 214 | "source": [ 215 | "In R, I am using `wtd.chi.sq`." 216 | ] 217 | } 218 | ], 219 | "metadata": { 220 | "kernelspec": { 221 | "display_name": "Python 3", 222 | "language": "python", 223 | "name": "python3" 224 | }, 225 | "language_info": { 226 | "codemirror_mode": { 227 | "name": "ipython", 228 | "version": 3 229 | }, 230 | "file_extension": ".py", 231 | "mimetype": "text/x-python", 232 | "name": "python", 233 | "nbconvert_exporter": "python", 234 | "pygments_lexer": "ipython3", 235 | "version": "3.8.3" 236 | } 237 | }, 238 | "nbformat": 4, 239 | "nbformat_minor": 4 240 | } 241 | -------------------------------------------------------------------------------- /solutions/01_unit_missingness.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Unit Missingness Demo\n", 8 | "\n", 9 | "When handling unit missingness, the most common method is to do **weight class adjustments**. This requires us to break our observations into classes and weight them before doing our analysis." 10 | ] 11 | }, 12 | { 13 | "cell_type": "code", 14 | "execution_count": 1, 15 | "metadata": {}, 16 | "outputs": [ 17 | { 18 | "data": { 19 | "text/html": [ 20 | "
\n", 21 | "\n", 34 | "\n", 35 | " \n", 36 | " \n", 37 | " \n", 38 | " \n", 39 | " \n", 40 | " \n", 41 | " \n", 42 | " \n", 43 | " \n", 44 | " \n", 45 | " \n", 46 | " \n", 47 | " \n", 48 | " \n", 49 | " \n", 50 | " \n", 51 | " \n", 52 | " \n", 53 | " \n", 54 | " \n", 55 | " \n", 56 | " \n", 57 | " \n", 58 | " \n", 59 | " \n", 60 | " \n", 61 | " \n", 62 | " \n", 63 | " \n", 64 | " \n", 65 | " \n", 66 | " \n", 67 | " \n", 68 | " \n", 69 | "
deptscore
0finance5.0
1accounting4.0
2financeNaN
3accounting5.0
4financeNaN
\n", 70 | "
" 71 | ], 72 | "text/plain": [ 73 | " dept score\n", 74 | "0 finance 5.0\n", 75 | "1 accounting 4.0\n", 76 | "2 finance NaN\n", 77 | "3 accounting 5.0\n", 78 | "4 finance NaN" 79 | ] 80 | }, 81 | "execution_count": 1, 82 | "metadata": {}, 83 | "output_type": "execute_result" 84 | } 85 | ], 86 | "source": [ 87 | "# Import libraries.\n", 88 | "import pandas as pd\n", 89 | "import numpy as np\n", 90 | "\n", 91 | "# Set random seed.\n", 92 | "np.random.seed(42)\n", 93 | "\n", 94 | "# Generate dataframe.\n", 95 | "value_score = [min(np.random.poisson(5), 10) if i % 2 == 0 else min(np.random.poisson(6), 10) for i in range(10_000)]\n", 96 | "value_score = [value_score[i] if (i % 8 == 0 or (i % 7 != 0 and i % 2 == 1)) else np.nan for i in range(10_000)]\n", 97 | "departments = ['finance' if i % 2 == 0 else 'accounting' for i in range(10_000)]\n", 98 | "df = pd.DataFrame({\n", 99 | " 'dept': departments,\n", 100 | " 'score': value_score\n", 101 | "})\n", 102 | "\n", 103 | "# Check first five rows.\n", 104 | "df.head()" 105 | ] 106 | }, 107 | { 108 | "cell_type": "code", 109 | "execution_count": 2, 110 | "metadata": {}, 111 | "outputs": [ 112 | { 113 | "data": { 114 | "text/plain": [ 115 | "accounting 0.5\n", 116 | "finance 0.5\n", 117 | "Name: dept, dtype: float64" 118 | ] 119 | }, 120 | "execution_count": 2, 121 | "metadata": {}, 122 | "output_type": "execute_result" 123 | } 124 | ], 125 | "source": [ 126 | "# What is the distribution of department?\n", 127 | "df['dept'].value_counts(normalize = True)" 128 | ] 129 | }, 130 | { 131 | "cell_type": "code", 132 | "execution_count": 3, 133 | "metadata": {}, 134 | "outputs": [ 135 | { 136 | "data": { 137 | "text/plain": [ 138 | "dept 0\n", 139 | "score 4464\n", 140 | "dtype: int64" 141 | ] 142 | }, 143 | "execution_count": 3, 144 | "metadata": {}, 145 | "output_type": "execute_result" 146 | } 147 | ], 148 | "source": [ 149 | "# Check for nulls.\n", 150 | "df.isnull().sum()" 151 | ] 152 | }, 153 | { 154 | "cell_type": "code", 155 | "execution_count": 4, 156 | "metadata": {}, 157 | "outputs": [], 158 | "source": [ 159 | "# Drop NAs.\n", 160 | "df.dropna(inplace = True)" 161 | ] 162 | }, 163 | { 164 | "cell_type": "code", 165 | "execution_count": 5, 166 | "metadata": {}, 167 | "outputs": [ 168 | { 169 | "data": { 170 | "text/plain": [ 171 | "0.7742052023121387" 172 | ] 173 | }, 174 | "execution_count": 5, 175 | "metadata": {}, 176 | "output_type": "execute_result" 177 | } 178 | ], 179 | "source": [ 180 | "# What proportion of our responses came from accounting?\n", 181 | "df['dept'].value_counts(normalize = True)['accounting']" 182 | ] 183 | }, 184 | { 185 | "cell_type": "markdown", 186 | "metadata": {}, 187 | "source": [ 188 | "1. Take the full sample (observed and missing) and break them into subgroups based on characteristics we know.\n", 189 | "2. Calculate a weight for each observation:\n", 190 | "\n", 191 | "$$\n", 192 | "\\text{weight}_i = \\frac{\\text{true proportion in group }i}{\\text{proportion of observed values in group }i}\n", 193 | "$$" 194 | ] 195 | }, 196 | { 197 | "cell_type": "code", 198 | "execution_count": 6, 199 | "metadata": {}, 200 | "outputs": [ 201 | { 202 | "name": "stdout", 203 | "output_type": "stream", 204 | "text": [ 205 | "The weight for each accounting vote is: 0.645823611759216.\n" 206 | ] 207 | } 208 | ], 209 | "source": [ 210 | "# Calculate and print the weight for accounting.\n", 211 | "w_accounting = (1/2) / df['dept'].value_counts(normalize = True)['accounting']\n", 212 | "\n", 213 | "print(f'The weight for each accounting vote is: {w_accounting}.')" 214 | ] 215 | }, 216 | { 217 | "cell_type": "code", 218 | "execution_count": 7, 219 | "metadata": {}, 220 | "outputs": [ 221 | { 222 | "name": "stdout", 223 | "output_type": "stream", 224 | "text": [ 225 | "The weight for each finance vote is: 2.2144.\n" 226 | ] 227 | } 228 | ], 229 | "source": [ 230 | "# Calculate the and print weight for finance.\n", 231 | "w_finance = (1/2) / df['dept'].value_counts(normalize = True)['finance']\n", 232 | "\n", 233 | "print(f'The weight for each finance vote is: {w_finance}.')" 234 | ] 235 | }, 236 | { 237 | "cell_type": "code", 238 | "execution_count": 8, 239 | "metadata": {}, 240 | "outputs": [ 241 | { 242 | "name": "stdout", 243 | "output_type": "stream", 244 | "text": [ 245 | "2767.9999999999995\n", 246 | "2768.0\n" 247 | ] 248 | } 249 | ], 250 | "source": [ 251 | "# Let's confirm that the weights times the counts\n", 252 | "# yields a 50/50 split.\n", 253 | "print(w_accounting * df['dept'].value_counts()['accounting'])\n", 254 | "print(w_finance * df['dept'].value_counts()['finance'])" 255 | ] 256 | }, 257 | { 258 | "cell_type": "code", 259 | "execution_count": 9, 260 | "metadata": {}, 261 | "outputs": [], 262 | "source": [ 263 | "# Create column that stores the weights.\n", 264 | "\n", 265 | "df['weights'] = [w_accounting if i == 'accounting' else w_finance for i in df['dept']]" 266 | ] 267 | }, 268 | { 269 | "cell_type": "code", 270 | "execution_count": 10, 271 | "metadata": {}, 272 | "outputs": [ 273 | { 274 | "data": { 275 | "text/plain": [ 276 | "0.645824 4286\n", 277 | "2.214400 1250\n", 278 | "Name: weights, dtype: int64" 279 | ] 280 | }, 281 | "execution_count": 10, 282 | "metadata": {}, 283 | "output_type": "execute_result" 284 | } 285 | ], 286 | "source": [ 287 | "# Confirm counts.\n", 288 | "\n", 289 | "df['weights'].value_counts()" 290 | ] 291 | }, 292 | { 293 | "cell_type": "code", 294 | "execution_count": 11, 295 | "metadata": {}, 296 | "outputs": [ 297 | { 298 | "data": { 299 | "text/plain": [ 300 | "5.724530346820809" 301 | ] 302 | }, 303 | "execution_count": 11, 304 | "metadata": {}, 305 | "output_type": "execute_result" 306 | } 307 | ], 308 | "source": [ 309 | "# Calculate raw mean of my employee satisfaction score.\n", 310 | "\n", 311 | "np.mean(df['score'])" 312 | ] 313 | }, 314 | { 315 | "cell_type": "code", 316 | "execution_count": 12, 317 | "metadata": {}, 318 | "outputs": [ 319 | { 320 | "data": { 321 | "text/plain": [ 322 | "5.450634997666867" 323 | ] 324 | }, 325 | "execution_count": 12, 326 | "metadata": {}, 327 | "output_type": "execute_result" 328 | } 329 | ], 330 | "source": [ 331 | "# Calculate weighted mean of my employee satisfaction score.\n", 332 | "\n", 333 | "np.mean(df['score'] * df['weights'])" 334 | ] 335 | }, 336 | { 337 | "cell_type": "markdown", 338 | "metadata": {}, 339 | "source": [ 340 | "
Our goal with post-weighting is to decrease bias. What should we be concerned about?\n", 341 | " \n", 342 | "- Due to the bias-variance tradeoff, as we decrease bias, we may cause an increase in variance.\n", 343 | "- This can be a really big deal, [said the New York Times in 2016](https://www.nytimes.com/2016/10/13/upshot/how-one-19-year-old-illinois-man-is-distorting-national-polling-averages.html).\n", 344 | "
" 345 | ] 346 | }, 347 | { 348 | "cell_type": "markdown", 349 | "metadata": {}, 350 | "source": [ 351 | "
What might be a situation where we may not be able to use weight class adjustments?\n", 352 | " \n", 353 | "- If we don't know the true distribution of our classes.\n", 354 | "- For example, if I didn't know that half of our team was in accounting and half in finance.\n", 355 | "- Another example, let's say I wanted to apply this weighting method to understand the percentage of voters supporting the Democratic candidate in the upcoming election. I don't know how many people will be in each of the age groups 18-34, 35-54, and 55+. I'll have to make a guess. (Hopefully an educated one!)\n", 356 | "
" 357 | ] 358 | }, 359 | { 360 | "cell_type": "markdown", 361 | "metadata": {}, 362 | "source": [ 363 | "#### Have more variables and want to build a sophisticated model?\n", 364 | "Pass `df['weight']` into `sklearn` when fitting your model. [Source](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html#sklearn.ensemble.RandomForestRegressor.fit).\n", 365 | "> `model.fit(X_train, y_train, X_train['weight'])`" 366 | ] 367 | } 368 | ], 369 | "metadata": { 370 | "kernelspec": { 371 | "display_name": "Python 3", 372 | "language": "python", 373 | "name": "python3" 374 | }, 375 | "language_info": { 376 | "codemirror_mode": { 377 | "name": "ipython", 378 | "version": 3 379 | }, 380 | "file_extension": ".py", 381 | "mimetype": "text/x-python", 382 | "name": "python", 383 | "nbconvert_exporter": "python", 384 | "pygments_lexer": "ipython3", 385 | "version": "3.8.3" 386 | } 387 | }, 388 | "nbformat": 4, 389 | "nbformat_minor": 4 390 | } 391 | -------------------------------------------------------------------------------- /00_interactive_plot.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": null, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "#### To confirm that you have the latest versions of these packages, uncomment and run the following command.\n", 10 | "# !pip install numpy pandas matplotlib sklearn ipywidgets IPython missingno --upgrade\n", 11 | "\n", 12 | "# To generate and store data.\n", 13 | "import numpy as np\n", 14 | "import pandas as pd\n", 15 | "\n", 16 | "# To visualize data.\n", 17 | "import matplotlib.pyplot as plt\n", 18 | "\n", 19 | "# To fit linear regression model.\n", 20 | "from sklearn.linear_model import LinearRegression\n", 21 | "\n", 22 | "# To allow interactive plot.\n", 23 | "from ipywidgets import *\n", 24 | "from IPython.display import display\n", 25 | "\n", 26 | "# There is a SciPy issue that won't affect our work, but a warning exists\n", 27 | "# and an update is not imminent.\n", 28 | "import warnings\n", 29 | "warnings.filterwarnings(action=\"ignore\")\n", 30 | "\n", 31 | "# To render plots in the notebook.\n", 32 | "%matplotlib inline" 33 | ] 34 | }, 35 | { 36 | "cell_type": "code", 37 | "execution_count": null, 38 | "metadata": {}, 39 | "outputs": [], 40 | "source": [ 41 | "# Generate data and store in a dataframe.\n", 42 | "\n", 43 | "np.random.seed(42)\n", 44 | "\n", 45 | "age = np.random.uniform(20, 60, size = 100)\n", 46 | "income = 15000 + 750 * age + np.random.normal(0, 20000, size = 100)\n", 47 | "income = [i if i >= 0 else 0 for i in income]\n", 48 | "\n", 49 | "df = pd.DataFrame({'income':income,\n", 50 | " 'age': age})" 51 | ] 52 | }, 53 | { 54 | "cell_type": "code", 55 | "execution_count": null, 56 | "metadata": {}, 57 | "outputs": [], 58 | "source": [ 59 | "# Create three functions to model missingness according to certain patterns.\n", 60 | "\n", 61 | "def create_mcar_column(df, missing_column = 'income', p_missing = 0.01, random_state = 42):\n", 62 | " \"\"\"\n", 63 | " Creates missingness indicator column, where data are MCAR (missing completely at random).\n", 64 | " \n", 65 | " User must specify:\n", 66 | " df = the pandas DataFrame the user wants to read in for analysis\n", 67 | " column = the name of the column in df that is missing\n", 68 | " p_missing = the proportion of observations that are missing\n", 69 | " \n", 70 | " Function returns:\n", 71 | " mcar_column = a column that indicates whether data are missing, assuming MCAR\n", 72 | " \"\"\"\n", 73 | " np.random.seed(random_state)\n", 74 | " \n", 75 | " mcar_indices = [df.sample(n = 1).index[0] for i in range(round(p_missing * df.shape[0]))]\n", 76 | " \n", 77 | " while len(set(mcar_indices)) < round(p_missing * df.shape[0]):\n", 78 | " mcar_indices.append(df.sample(n = 1).index[0])\n", 79 | " \n", 80 | " mcar_column = [1 if i in mcar_indices else 0 for i in range(df.shape[0])]\n", 81 | " \n", 82 | " return mcar_column\n", 83 | "\n", 84 | "def create_mar_column(df, missing_column = 'income', depends_on = 'age', method = 'linear', p_missing = 0.01, random_state = 42):\n", 85 | " \"\"\"\n", 86 | " Creates missingness indicator column, where data are MAR (missing at random).\n", 87 | " \n", 88 | " User must specify:\n", 89 | " df = the pandas DataFrame the user wants to read in for analysis\n", 90 | " missing_column = the name of the column in df that is missing\n", 91 | " depends_on = the name of the column in df which affects the missingness\n", 92 | " method = 'linear' or 'quadratic'\n", 93 | " - 'linear' means the probability of missingness is linearly related to the depends_on variable\n", 94 | " - 'quadratic' means the probability of missingness is quadratically related to the depends_on variable\n", 95 | " p_missing = the proportion of observations that are missing\n", 96 | " \n", 97 | " Function returns:\n", 98 | " mar_column = a column that indicates whether data are missing, assuming MAR\n", 99 | " \"\"\"\n", 100 | " np.random.seed(random_state)\n", 101 | " \n", 102 | " if method == 'linear':\n", 103 | " mar_indices = [df.sample(n = 1, weights = df[depends_on] ** -1).index[0] for i in range(round(p_missing * df.shape[0]))]\n", 104 | "\n", 105 | " while len(set(mar_indices)) < round(p_missing * df.shape[0]):\n", 106 | " mar_indices.append(df.sample(n = 1, weights = df[depends_on] ** -1).index[0])\n", 107 | " \n", 108 | " elif method == 'quadratic':\n", 109 | " mar_indices = [df.sample(n = 1, weights = df[depends_on] ** -2).index[0] for i in range(round(p_missing * df.shape[0]))]\n", 110 | "\n", 111 | " while len(set(mar_indices)) < round(p_missing * df.shape[0]):\n", 112 | " mar_indices.append(df.sample(n = 1, weights = df[depends_on] ** -2).index[0])\n", 113 | "\n", 114 | " mar_column = [1 if i in mar_indices else 0 for i in range(df.shape[0])]\n", 115 | " \n", 116 | " return mar_column\n", 117 | "\n", 118 | "def create_nmar_column(df, missing_column = 'income', method = 'linear', p_missing = 0.01, random_state = 42):\n", 119 | " \"\"\"\n", 120 | " Creates missingness indicator column, where data are NMAR (not missing at random).\n", 121 | " \n", 122 | " User must specify:\n", 123 | " df = the pandas DataFrame the user wants to read in for analysis\n", 124 | " missing_column = the name of the column in df that is missing\n", 125 | " method = 'linear' or 'quadratic'\n", 126 | " - 'linear' means the probability of missingness is linearly related to the depends_on variable\n", 127 | " - 'quadratic' means the probability of missingness is quadratically related to the depends_on variable\n", 128 | " p_missing = the proportion of observations that are missing\n", 129 | " \n", 130 | " Function returns:\n", 131 | " nmar_column = a column that indicates whether data are missing, assuming NMAR\n", 132 | " \"\"\"\n", 133 | " np.random.seed(random_state)\n", 134 | " \n", 135 | " if method == 'linear':\n", 136 | " nmar_indices = [df.sample(n = 1, weights = df[missing_column] ** -1).index[0] for i in range(round(p_missing * df.shape[0]))]\n", 137 | "\n", 138 | " while len(set(nmar_indices)) < round(p_missing * df.shape[0]):\n", 139 | " nmar_indices.append(df.sample(n = 1, weights = df[missing_column] ** -1).index[0])\n", 140 | " \n", 141 | " elif method == 'quadratic':\n", 142 | " nmar_indices = [df.sample(n = 1, weights = df[missing_column] ** -2).index[0] for i in range(round(p_missing * df.shape[0]))]\n", 143 | "\n", 144 | " while len(set(nmar_indices)) < round(p_missing * df.shape[0]):\n", 145 | " nmar_indices.append(df.sample(n = 1, weights = df[missing_column] ** -2).index[0])\n", 146 | " \n", 147 | " nmar_column = [1 if i in nmar_indices else 0 for i in range(df.shape[0])]\n", 148 | " \n", 149 | " return nmar_column" 150 | ] 151 | }, 152 | { 153 | "cell_type": "code", 154 | "execution_count": null, 155 | "metadata": {}, 156 | "outputs": [], 157 | "source": [ 158 | "def generate_scatterplot(p_missing, missing_type, method = 'linear', missing_column = 'income', depends_on = 'age'):\n", 159 | " # Generate one plot.\n", 160 | " fig, ax = plt.subplots(nrows = 1, ncols = 1, figsize = (16,9))\n", 161 | "\n", 162 | " # Set labels and axes.\n", 163 | " ax.set_xlabel(\"Age\", position = (0,0), ha = 'left', fontsize = 25, color = 'grey', alpha = 0.85)\n", 164 | " ax.set_ylabel(\"Income\", position = (0,1), ha = 'right', va = 'top', fontsize = 25, rotation = 0, color = 'grey', alpha = 0.85)\n", 165 | " \n", 166 | " ax.set_ylim([-1000, 100000])\n", 167 | " \n", 168 | " # Generate data with proportion p missing.\n", 169 | " if missing_type == 'MCAR':\n", 170 | " df['missingness'] = create_mcar_column(df,\n", 171 | " missing_column = missing_column,\n", 172 | " p_missing = p_missing)\n", 173 | " elif missing_type == 'MAR':\n", 174 | " df['missingness'] = create_mar_column(df,\n", 175 | " missing_column = missing_column,\n", 176 | " depends_on = depends_on,\n", 177 | " method = method,\n", 178 | " p_missing = p_missing)\n", 179 | " \n", 180 | " elif missing_type == 'NMAR':\n", 181 | " df['missingness'] = create_nmar_column(df,\n", 182 | " missing_column = missing_column,\n", 183 | " method = method,\n", 184 | " p_missing = p_missing)\n", 185 | " \n", 186 | " # Generate scatterplot.\n", 187 | " ax.scatter(df['age'][df['missingness'] == 0], df['income'][df['missingness'] == 0], s = 35, color = '#185fad', alpha = 0.75, label = 'Observed')\n", 188 | " ax.scatter(df['age'][df['missingness'] == 1], df['income'][df['missingness'] == 1], s = 35, color = 'grey', alpha = 0.25, label = '')\n", 189 | " \n", 190 | " # Generate lines of best fit based on observed and missing values.\n", 191 | " x = np.linspace(20, 60)\n", 192 | " ax.plot(x, 15000 + 750 * x, c = 'orange', alpha = 0.7, label = '\"True\" Line', lw = 3)\n", 193 | " model = LinearRegression().fit(df[['age']][df['missingness'] == 0], df['income'][df['missingness'] == 0])\n", 194 | " ax.plot(x, model.intercept_ + model.coef_ * x, c = '#185fad', alpha = 0.7, label='Observed Line', lw = 3)\n", 195 | "\n", 196 | " # Generate title and legend.\n", 197 | " ax.set_title(f'Type of Missing Data: {missing_type} \\nProportion Missing: {p_missing}', position = (0,1), ha = 'left', fontsize = 25)\n", 198 | " ax.legend(prop={'size': 20}, loc = 2)\n", 199 | " \n", 200 | " ax.set_xticks([])\n", 201 | " ax.set_yticks([])\n", 202 | " plt.show();" 203 | ] 204 | }, 205 | { 206 | "cell_type": "code", 207 | "execution_count": null, 208 | "metadata": {}, 209 | "outputs": [], 210 | "source": [ 211 | "generate_scatterplot(p_missing=0.1,\n", 212 | " missing_type = 'MCAR',\n", 213 | " method = 'linear')" 214 | ] 215 | }, 216 | { 217 | "cell_type": "code", 218 | "execution_count": null, 219 | "metadata": { 220 | "scrolled": false 221 | }, 222 | "outputs": [], 223 | "source": [ 224 | "def plot_interact(p_missing = 0.8, missing_type = 'MCAR', method = 'linear'):\n", 225 | " generate_scatterplot(p_missing, missing_type, method, missing_column = 'income', depends_on = 'age')\n", 226 | " \n", 227 | "interact(plot_interact, p_missing = (0, 0.99, 0.05), missing_type = ['MCAR','MAR','NMAR'], method = ['linear','quadratic']);" 228 | ] 229 | } 230 | ], 231 | "metadata": { 232 | "kernelspec": { 233 | "display_name": "Python 3", 234 | "language": "python", 235 | "name": "python3" 236 | }, 237 | "language_info": { 238 | "codemirror_mode": { 239 | "name": "ipython", 240 | "version": 3 241 | }, 242 | "file_extension": ".py", 243 | "mimetype": "text/x-python", 244 | "name": "python", 245 | "nbconvert_exporter": "python", 246 | "pygments_lexer": "ipython3", 247 | "version": "3.8.3" 248 | } 249 | }, 250 | "nbformat": 4, 251 | "nbformat_minor": 2 252 | } 253 | -------------------------------------------------------------------------------- /02_item_missingness.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Item Missingness Demo" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": null, 13 | "metadata": {}, 14 | "outputs": [], 15 | "source": [ 16 | "# To generate and store data.\n", 17 | "import numpy as np\n", 18 | "import pandas as pd\n", 19 | "import scipy.stats as stats\n", 20 | "\n", 21 | "# To visualize data.\n", 22 | "import matplotlib.pyplot as plt\n", 23 | "\n", 24 | "# To fit linear regression model.\n", 25 | "from sklearn.linear_model import LinearRegression, LogisticRegression\n", 26 | "\n", 27 | "# Install and import missingno to visualize missingness patterns. (Uncomment first line to install missingno.)\n", 28 | "# !pip3 install missingno\n", 29 | "import missingno as msno\n", 30 | "\n", 31 | "# # There is a SciPy issue that won't affect our work, but a warning exists\n", 32 | "# # and an update is not imminent.\n", 33 | "import warnings\n", 34 | "warnings.filterwarnings(action=\"ignore\")\n", 35 | "\n", 36 | "# To render plots in the notebook.\n", 37 | "%matplotlib inline" 38 | ] 39 | }, 40 | { 41 | "cell_type": "markdown", 42 | "metadata": {}, 43 | "source": [ 44 | "### Let's generate some data. Specifically, we'll generate age, partnered, children, and income data, where income is linearly related to age, partnered, and children." 45 | ] 46 | }, 47 | { 48 | "cell_type": "code", 49 | "execution_count": null, 50 | "metadata": {}, 51 | "outputs": [], 52 | "source": [ 53 | "# To ensure we get the same results.\n", 54 | "np.random.seed(42)\n", 55 | "\n", 56 | "# Generate data.\n", 57 | "age = np.round(np.random.uniform(20, 60, size = 100))\n", 58 | "partnered = np.random.binomial(1, 0.8, size = 100)\n", 59 | "children = np.random.poisson(2, size = 100)\n", 60 | "income = 15000 + 750 * age + 20000 * partnered - 2500 * children + np.random.normal(0, 20000, size = 100)\n", 61 | "\n", 62 | "# Ensure income is not negative!\n", 63 | "income = [i if i >= 0 else 0 for i in income]\n", 64 | "\n", 65 | "# Combine our results into one dataframe.\n", 66 | "df = pd.DataFrame({'age': age,\n", 67 | " 'partnered': partnered,\n", 68 | " 'children': children,\n", 69 | " 'income': income})\n", 70 | "\n", 71 | "# Check the first five rows of df to make sure we did this properly.\n", 72 | "df.head()" 73 | ] 74 | }, 75 | { 76 | "cell_type": "markdown", 77 | "metadata": {}, 78 | "source": [ 79 | "### Run this cell. These are functions that will generate missing values according to MCAR, MAR, or NMAR." 80 | ] 81 | }, 82 | { 83 | "cell_type": "code", 84 | "execution_count": null, 85 | "metadata": {}, 86 | "outputs": [], 87 | "source": [ 88 | "def create_mcar_column(df, missing_column = 'income', p_missing = 0.01, random_state = 42):\n", 89 | " \"\"\"\n", 90 | " Creates missingness indicator column, where data are MCAR (missing completely at random).\n", 91 | " \n", 92 | " User must specify:\n", 93 | " df = the pandas DataFrame the user wants to read in for analysis\n", 94 | " column = the name of the column in df that is missing\n", 95 | " p_missing = the proportion of observations that are missing\n", 96 | " \n", 97 | " Function returns:\n", 98 | " mcar_column = a column that indicates whether data are missing, assuming MCAR\n", 99 | " \"\"\"\n", 100 | " np.random.seed(random_state)\n", 101 | " \n", 102 | " mcar_indices = [df.sample(n = 1).index[0] for i in range(round(p_missing * df.shape[0]))]\n", 103 | " \n", 104 | " while len(set(mcar_indices)) < round(p_missing * df.shape[0]):\n", 105 | " mcar_indices.append(df.sample(n = 1).index[0])\n", 106 | " \n", 107 | " mcar_column = [1 if i in mcar_indices else 0 for i in range(df.shape[0])]\n", 108 | " \n", 109 | " return mcar_column\n", 110 | "\n", 111 | "def create_mar_column(df, missing_column = 'income', depends_on = 'age', method = 'linear', p_missing = 0.01, random_state = 42):\n", 112 | " \"\"\"\n", 113 | " Creates missingness indicator column, where data are MAR (missing at random).\n", 114 | " \n", 115 | " User must specify:\n", 116 | " df = the pandas DataFrame the user wants to read in for analysis\n", 117 | " missing_column = the name of the column in df that is missing\n", 118 | " depends_on = the name of the column in df which affects the missingness\n", 119 | " method = 'linear' or 'quadratic'\n", 120 | " - 'linear' means the probability of missingness is linearly related to the depends_on variable\n", 121 | " - 'quadratic' means the probability of missingness is quadratically related to the depends_on variable\n", 122 | " p_missing = the proportion of observations that are missing\n", 123 | " \n", 124 | " Function returns:\n", 125 | " mar_column = a column that indicates whether data are missing, assuming MAR\n", 126 | " \"\"\"\n", 127 | " np.random.seed(random_state)\n", 128 | " \n", 129 | " if method == 'linear':\n", 130 | " mar_indices = [df.sample(n = 1, weights = depends_on).index[0] for i in range(round(p_missing * df.shape[0]))]\n", 131 | "\n", 132 | " while len(set(mar_indices)) < round(p_missing * df.shape[0]):\n", 133 | " mar_indices.append(df.sample(n = 1, weights = depends_on).index[0])\n", 134 | " \n", 135 | " elif method == 'quadratic':\n", 136 | " mar_indices = [df.sample(n = 1, weights = df[depends_on] ** 2).index[0] for i in range(round(p_missing * df.shape[0]))]\n", 137 | "\n", 138 | " while len(set(mar_indices)) < round(p_missing * df.shape[0]):\n", 139 | " mar_indices.append(df.sample(n = 1, weights = df[depends_on] ** 2).index[0])\n", 140 | "\n", 141 | " mar_column = [1 if i in mar_indices else 0 for i in range(df.shape[0])]\n", 142 | " \n", 143 | " return mar_column\n", 144 | "\n", 145 | "def create_nmar_column(df, missing_column = 'income', method = 'linear', p_missing = 0.01, random_state = 42):\n", 146 | " \"\"\"\n", 147 | " Creates missingness indicator column, where data are NMAR (not missing at random).\n", 148 | " \n", 149 | " User must specify:\n", 150 | " df = the pandas DataFrame the user wants to read in for analysis\n", 151 | " missing_column = the name of the column in df that is missing\n", 152 | " method = 'linear' or 'quadratic'\n", 153 | " - 'linear' means the probability of missingness is linearly related to the depends_on variable\n", 154 | " - 'quadratic' means the probability of missingness is quadratically related to the depends_on variable\n", 155 | " p_missing = the proportion of observations that are missing\n", 156 | " \n", 157 | " Function returns:\n", 158 | " nmar_column = a column that indicates whether data are missing, assuming NMAR\n", 159 | " \"\"\"\n", 160 | " np.random.seed(random_state)\n", 161 | " \n", 162 | " if method == 'linear':\n", 163 | " nmar_indices = [df.sample(n = 1, weights = missing_column).index[0] for i in range(round(p_missing * df.shape[0]))]\n", 164 | "\n", 165 | " while len(set(nmar_indices)) < round(p_missing * df.shape[0]):\n", 166 | " nmar_indices.append(df.sample(n = 1, weights = missing_column).index[0])\n", 167 | " \n", 168 | " elif method == 'quadratic':\n", 169 | " nmar_indices = [df.sample(n = 1, weights = df[missing_column] ** 2).index[0] for i in range(round(p_missing * df.shape[0]))]\n", 170 | "\n", 171 | " while len(set(nmar_indices)) < round(p_missing * df.shape[0]):\n", 172 | " nmar_indices.append(df.sample(n = 1, weights = df[missing_column] ** 2).index[0])\n", 173 | " \n", 174 | " nmar_column = [1 if i in nmar_indices else 0 for i in range(df.shape[0])]\n", 175 | " \n", 176 | " return nmar_column" 177 | ] 178 | }, 179 | { 180 | "cell_type": "markdown", 181 | "metadata": {}, 182 | "source": [ 183 | "### Let's generate some missing data!" 184 | ] 185 | }, 186 | { 187 | "cell_type": "code", 188 | "execution_count": null, 189 | "metadata": {}, 190 | "outputs": [], 191 | "source": [ 192 | "df['age_missingness'] = create_mcar_column(df,\n", 193 | " missing_column = 'age', \n", 194 | " p_missing = 0.3,\n", 195 | " random_state = 42)\n", 196 | "\n", 197 | "df['partnered_missingness'] = create_mar_column(df,\n", 198 | " missing_column = 'partnered',\n", 199 | " method = 'linear',\n", 200 | " p_missing = 0.2,\n", 201 | " random_state = 42)\n", 202 | "\n", 203 | "df['income_missingness'] = create_nmar_column(df,\n", 204 | " missing_column = 'income',\n", 205 | " method = 'quadratic',\n", 206 | " p_missing = 0.2,\n", 207 | " random_state = 42)\n", 208 | "\n", 209 | "print(df['age_missingness'].value_counts())\n", 210 | "print(df['partnered_missingness'].value_counts())\n", 211 | "print(df['income_missingness'].value_counts())" 212 | ] 213 | }, 214 | { 215 | "cell_type": "code", 216 | "execution_count": null, 217 | "metadata": {}, 218 | "outputs": [], 219 | "source": [ 220 | "df.head()" 221 | ] 222 | }, 223 | { 224 | "cell_type": "markdown", 225 | "metadata": {}, 226 | "source": [ 227 | "### Let's create a new dataframe with the values actually missing." 228 | ] 229 | }, 230 | { 231 | "cell_type": "code", 232 | "execution_count": null, 233 | "metadata": {}, 234 | "outputs": [], 235 | "source": [ 236 | "df_missing = pd.DataFrame(df['children'])\n", 237 | "\n", 238 | "df_missing['age'] = [df.loc[i,'age'] if df.loc[i,'age_missingness'] == 0 else np.nan for i in range(100)]\n", 239 | "df_missing['partnered'] = [df.loc[i,'partnered'] if df.loc[i,'partnered_missingness'] == 0 else np.nan for i in range(100)]\n", 240 | "df_missing['income'] = [df.loc[i,'income'] if df.loc[i,'income_missingness'] == 0 else np.nan for i in range(100)]\n", 241 | "\n", 242 | "df_missing.head()" 243 | ] 244 | }, 245 | { 246 | "cell_type": "markdown", 247 | "metadata": {}, 248 | "source": [ 249 | "### Let's visualize our missing data.\n", 250 | "- Children is 100% observed.\n", 251 | "- Age is missing completely at random and is missing 30% of its observations.\n", 252 | "- Partnered is missing at random and is missing 20% of its observations.\n", 253 | "- Income is missing at random and is missing 20% of its observations." 254 | ] 255 | }, 256 | { 257 | "cell_type": "code", 258 | "execution_count": null, 259 | "metadata": {}, 260 | "outputs": [], 261 | "source": [ 262 | "msno.matrix(df_missing);" 263 | ] 264 | }, 265 | { 266 | "cell_type": "markdown", 267 | "metadata": {}, 268 | "source": [ 269 | "### Generate histograms." 270 | ] 271 | }, 272 | { 273 | "cell_type": "code", 274 | "execution_count": null, 275 | "metadata": {}, 276 | "outputs": [], 277 | "source": [ 278 | "def compare_histograms(df, imputed_column, original_column, missingness_column, x_label, y_label = 'Frequency'):\n", 279 | " fig, (ax0, ax1) = plt.subplots(nrows = 2, ncols = 1, figsize = (16,9))\n", 280 | "\n", 281 | " # Set axes of histograms.\n", 282 | " mode = stats.mode(df[imputed_column])\n", 283 | " rnge = max(df[original_column]) - min(df[original_column])\n", 284 | " xmin = min(df[original_column]) - 0.02 * rnge\n", 285 | " xmax = max(df[original_column]) + 0.02 * rnge\n", 286 | " ymax = 1.3 * (mode[1][0] + df[df[original_column] == mode[0][0]].shape[0])\n", 287 | "\n", 288 | " ax0.set_xlim(xmin, xmax)\n", 289 | " ax0.set_ylim(0, ymax)\n", 290 | " ax1.set_xlim(xmin, xmax)\n", 291 | " ax1.set_ylim(0, ymax)\n", 292 | "\n", 293 | " # Set top labels.\n", 294 | " ax0.set_title('Real Histogram', position = (0,1), ha = 'left', fontsize = 25)\n", 295 | " ax0.set_xlabel(x_label, position = (0,0), ha = 'left', fontsize = 25, color = 'grey', alpha = 0.85)\n", 296 | " ax0.set_ylabel(y_label, position = (0,1), ha = 'right', va = 'top', fontsize = 25, rotation = 0, color = 'grey', alpha = 0.85)\n", 297 | " ax0.set_xticks([])\n", 298 | " ax0.set_yticks([])\n", 299 | "\n", 300 | " # Generate top histogram.\n", 301 | " ax0.hist(df[original_column], bins = 15, color = '#185fad', alpha = 0.75, label = '')\n", 302 | " ax0.axvline(np.mean(df[original_column]), color = '#185fad', lw = 5, label = 'True Mean')\n", 303 | " ax0.legend(prop={'size': 15}, loc = 1)\n", 304 | "\n", 305 | " # Set bottom labels.\n", 306 | " ax1.set_title('Observed + Imputed Histogram', position = (0,1), ha = 'left', fontsize = 25)\n", 307 | " ax1.set_xlabel(x_label, position = (0,0), ha = 'left', fontsize = 25, color = 'grey', alpha = 0.85)\n", 308 | " ax1.set_ylabel(y_label, position = (0,1), ha = 'right', va = 'top', fontsize = 25, rotation = 0, color = 'grey', alpha = 0.85)\n", 309 | "\n", 310 | " # Generate bottom histogram.\n", 311 | " ax1.hist([df[imputed_column][df[missingness_column] == 0], df[imputed_column][df[missingness_column] == 1]], bins = 15, color = ['#185fad','orange'], alpha = 0.75, label = '', stacked = True)\n", 312 | " ax1.axvline(np.mean(df[original_column]), color = '#185fad', lw = 5, label = 'True Mean')\n", 313 | " ax1.axvline(np.mean(df[original_column][df[missingness_column] == 0]), color = 'grey', alpha = 0.5, lw = 5, label = 'Observed Mean')\n", 314 | " ax1.axvline(np.mean(df[imputed_column]), color = 'orange', lw = 5, label = 'Observed and Imputed Mean')\n", 315 | " ax1.legend(prop={'size': 15}, loc = 1)\n", 316 | " \n", 317 | " plt.tight_layout()\n", 318 | "\n", 319 | " plt.show();" 320 | ] 321 | }, 322 | { 323 | "cell_type": "markdown", 324 | "metadata": {}, 325 | "source": [ 326 | "### Examine various imputation methods." 327 | ] 328 | }, 329 | { 330 | "cell_type": "markdown", 331 | "metadata": {}, 332 | "source": [ 333 | "##### Mean Imputation" 334 | ] 335 | }, 336 | { 337 | "cell_type": "code", 338 | "execution_count": null, 339 | "metadata": {}, 340 | "outputs": [], 341 | "source": [ 342 | "def impute_mean(df, impute_column, missingness_column):\n", 343 | " \"\"\"\n", 344 | " Imputes mean for any value where data is marked missing.\n", 345 | " \n", 346 | " User must specify:\n", 347 | " df = the pandas DataFrame the user wants to read in for analysis\n", 348 | " impute_column = the name of the column in df that is missing\n", 349 | " missingness_column = the name of the missingness indicator column\n", 350 | " \n", 351 | " Function returns:\n", 352 | " mean_impute = a column with the mean imputed for any missing value.\n", 353 | " \"\"\"\n", 354 | " mean_impute = [df.loc[i,impute_column] if df.loc[i,missingness_column] == 0 else np.mean(df[impute_column]) for i in range(df.shape[0])]\n", 355 | " \n", 356 | " return mean_impute" 357 | ] 358 | }, 359 | { 360 | "cell_type": "code", 361 | "execution_count": null, 362 | "metadata": {}, 363 | "outputs": [], 364 | "source": [ 365 | "df['age_mean_imputed'] = impute_mean(df, 'age', 'age_missingness')" 366 | ] 367 | }, 368 | { 369 | "cell_type": "code", 370 | "execution_count": null, 371 | "metadata": {}, 372 | "outputs": [], 373 | "source": [ 374 | "compare_histograms(df = df,\n", 375 | " imputed_column = 'age_mean_imputed',\n", 376 | " original_column = 'age',\n", 377 | " missingness_column = 'age_missingness',\n", 378 | " x_label = 'Age',\n", 379 | " y_label = 'Frequency')" 380 | ] 381 | }, 382 | { 383 | "cell_type": "markdown", 384 | "metadata": {}, 385 | "source": [ 386 | "How to read the above chart:\n", 387 | "- The blue line is the true mean of all data (observed and unobserved).\n", 388 | "- The grey line is the mean of just the observed data. (i.e. no imputation)\n", 389 | "- The orange line is the mean of the observed and imputed data." 390 | ] 391 | }, 392 | { 393 | "cell_type": "markdown", 394 | "metadata": {}, 395 | "source": [ 396 | "$$\n", 397 | "\\begin{eqnarray*}\n", 398 | "s &=& \\sqrt{\\frac{\\sum_{i=1}^n(x_i - \\bar{x})^2}{n-1}} \\\\\n", 399 | "\\text{impute mean for values } k+1 \\text{ through } n \\Rightarrow s &=& \\sqrt{\\frac{\\sum_{i=1}^k(x_i - \\bar{x})^2}{n-1} + \\frac{\\sum_{i=k+1}^n(\\bar{x} - \\bar{x})^2}{n-1}} \\\\\n", 400 | "&=& \\sqrt{\\frac{\\sum_{i=1}^k(x_i - \\bar{x})^2}{n-1}} \\\\\n", 401 | "&\\Rightarrow& \\text{the denominator increases but numerator remains fixed} \\\\\n", 402 | "&\\Rightarrow& \\text{the sample standard deviation is underestimated} \\\\\n", 403 | "&\\Rightarrow& \\text{confidence intervals relying on the mean are narrower than they should be}\n", 404 | "\\end{eqnarray*}\n", 405 | "$$" 406 | ] 407 | }, 408 | { 409 | "cell_type": "markdown", 410 | "metadata": {}, 411 | "source": [ 412 | "##### Median Imputation" 413 | ] 414 | }, 415 | { 416 | "cell_type": "code", 417 | "execution_count": null, 418 | "metadata": {}, 419 | "outputs": [], 420 | "source": [ 421 | "def impute_median(df, impute_column, missingness_column):\n", 422 | " \"\"\"\n", 423 | " Imputes median for any value where data is marked missing.\n", 424 | " \n", 425 | " User must specify:\n", 426 | " df = the pandas DataFrame the user wants to read in for analysis\n", 427 | " impute_column = the name of the column in df that is missing\n", 428 | " missingness_column = the name of the missingness indicator column\n", 429 | " \n", 430 | " Function returns:\n", 431 | " median_impute = a column with the median imputed for any missing value.\n", 432 | " \"\"\"\n", 433 | " median_impute = [df.loc[i,impute_column] if df.loc[i,missingness_column] == 0 else np.median(df[impute_column]) for i in range(df.shape[0])]\n", 434 | " \n", 435 | " return median_impute" 436 | ] 437 | }, 438 | { 439 | "cell_type": "code", 440 | "execution_count": null, 441 | "metadata": {}, 442 | "outputs": [], 443 | "source": [ 444 | "df['age_median_imputed'] = impute_median(df, 'age', 'age_missingness')" 445 | ] 446 | }, 447 | { 448 | "cell_type": "code", 449 | "execution_count": null, 450 | "metadata": {}, 451 | "outputs": [], 452 | "source": [ 453 | "compare_histograms(df = df,\n", 454 | " imputed_column = 'age_median_imputed',\n", 455 | " original_column = 'age',\n", 456 | " missingness_column = 'age_missingness',\n", 457 | " x_label = 'Age',\n", 458 | " y_label = 'Frequency')" 459 | ] 460 | }, 461 | { 462 | "cell_type": "markdown", 463 | "metadata": {}, 464 | "source": [ 465 | "##### Mode Imputation" 466 | ] 467 | }, 468 | { 469 | "cell_type": "code", 470 | "execution_count": null, 471 | "metadata": {}, 472 | "outputs": [], 473 | "source": [ 474 | "def impute_mode(df, impute_column, missingness_column):\n", 475 | " \"\"\"\n", 476 | " Imputes mode for any value where data is marked missing.\n", 477 | " \n", 478 | " User must specify:\n", 479 | " df = the pandas DataFrame the user wants to read in for analysis\n", 480 | " impute_column = the name of the column in df that is missing\n", 481 | " missingness_column = the name of the missingness indicator column\n", 482 | " \n", 483 | " Function returns:\n", 484 | " mode_impute = a column with the mode imputed for any missing value.\n", 485 | " \"\"\"\n", 486 | " mode_impute = [df.loc[i,impute_column] if df.loc[i,missingness_column] == 0 else stats.mode(df[impute_column])[0][0] for i in range(df.shape[0])]\n", 487 | " \n", 488 | " return mode_impute" 489 | ] 490 | }, 491 | { 492 | "cell_type": "code", 493 | "execution_count": null, 494 | "metadata": {}, 495 | "outputs": [], 496 | "source": [ 497 | "df['age_mode_imputed'] = impute_mode(df, 'age', 'age_missingness')" 498 | ] 499 | }, 500 | { 501 | "cell_type": "code", 502 | "execution_count": null, 503 | "metadata": {}, 504 | "outputs": [], 505 | "source": [ 506 | "compare_histograms(df = df,\n", 507 | " imputed_column = 'age_mode_imputed',\n", 508 | " original_column = 'age',\n", 509 | " missingness_column = 'age_missingness',\n", 510 | " x_label = 'Age',\n", 511 | " y_label = 'Frequency')" 512 | ] 513 | }, 514 | { 515 | "cell_type": "markdown", 516 | "metadata": {}, 517 | "source": [ 518 | "##### Regression Imputation" 519 | ] 520 | }, 521 | { 522 | "cell_type": "code", 523 | "execution_count": null, 524 | "metadata": {}, 525 | "outputs": [], 526 | "source": [ 527 | "def regression_imputation(df, impute_column, X_columns, missingness_column, regression = 'linear'):\n", 528 | " \"\"\"\n", 529 | " Fits regression line to observed data, then imputes regression prediction\n", 530 | " for any value where data is marked missing.\n", 531 | " \n", 532 | " User must specify:\n", 533 | " df = the pandas DataFrame the user wants to read in for analysis\n", 534 | " impute_column = the name of the column in df that is missing\n", 535 | " X_columns = the names of the columns used as independent variables\n", 536 | " to impute the missing value\n", 537 | " missingness_column = the name of the missingness indicator column\n", 538 | " regression = the type of regression to run; only supports 'linear'\n", 539 | " for LinearRegression and 'logistic' for LogisticRegression\n", 540 | " \n", 541 | " Function returns:\n", 542 | " regression_impute = a column with the regression value imputed for any missing value.\n", 543 | " \n", 544 | " NOTE: Only set up to do linear or logistic regression.\n", 545 | " \"\"\"\n", 546 | " \n", 547 | " if regression == 'linear':\n", 548 | " model = LinearRegression()\n", 549 | " elif regression == 'logistic':\n", 550 | " model = LogisticRegression()\n", 551 | " \n", 552 | " model.fit(df[X_columns], df[impute_column])\n", 553 | " \n", 554 | " regression_impute = [df.loc[i,'age'] if df.loc[i,'age_missingness'] == 0\n", 555 | " else model.predict(pd.DataFrame(df.loc[i,['children', 'partnered', 'income']]).T)[0] \n", 556 | " for i in range(df.shape[0])]\n", 557 | " \n", 558 | " return regression_impute" 559 | ] 560 | }, 561 | { 562 | "cell_type": "code", 563 | "execution_count": null, 564 | "metadata": {}, 565 | "outputs": [], 566 | "source": [ 567 | "df['age_regression_imputed'] = regression_imputation(df, 'age', ['children', 'partnered', 'income'], 'age_missingness')" 568 | ] 569 | }, 570 | { 571 | "cell_type": "code", 572 | "execution_count": null, 573 | "metadata": {}, 574 | "outputs": [], 575 | "source": [ 576 | "compare_histograms(df = df,\n", 577 | " imputed_column = 'age_regression_imputed',\n", 578 | " original_column = 'age',\n", 579 | " missingness_column = 'age_missingness',\n", 580 | " x_label = 'Age',\n", 581 | " y_label = 'Frequency')" 582 | ] 583 | }, 584 | { 585 | "cell_type": "code", 586 | "execution_count": null, 587 | "metadata": {}, 588 | "outputs": [], 589 | "source": [ 590 | "np.std(df['age_regression_imputed'], ddof = 1)" 591 | ] 592 | }, 593 | { 594 | "cell_type": "code", 595 | "execution_count": null, 596 | "metadata": {}, 597 | "outputs": [], 598 | "source": [ 599 | "np.std(df['age'], ddof = 1)" 600 | ] 601 | }, 602 | { 603 | "cell_type": "markdown", 604 | "metadata": {}, 605 | "source": [ 606 | "### Work in progress:" 607 | ] 608 | }, 609 | { 610 | "cell_type": "code", 611 | "execution_count": null, 612 | "metadata": {}, 613 | "outputs": [], 614 | "source": [ 615 | "def compare_scatterplots(df, imputed_column, original_X_column, original_Y_column, missingness_column, x_label, y_label):\n", 616 | " fig, (ax0, ax1) = plt.subplots(nrows = 1, ncols = 2, figsize = (20,8))\n", 617 | "\n", 618 | " # Set axes of scatterplots.\n", 619 | " x_rnge = max(df[original_X_column]) - min(df[original_X_column])\n", 620 | " xmin = min(df[original_X_column]) - 0.1 * x_rnge\n", 621 | " xmax = max(df[original_X_column]) + 0.1 * x_rnge\n", 622 | " y_rnge = max(df[original_Y_column]) - min(df[original_Y_column])\n", 623 | " ymin = min(df[original_Y_column]) - 0.1 * y_rnge\n", 624 | " ymax = max(df[original_Y_column]) + 0.1 * y_rnge\n", 625 | "\n", 626 | " ax0.set_xlim(xmin, xmax)\n", 627 | " ax0.set_ylim(ymin, ymax)\n", 628 | " ax1.set_xlim(xmin, xmax)\n", 629 | " ax1.set_ylim(ymin, ymax)\n", 630 | "\n", 631 | " # Set left labels.\n", 632 | " ax0.set_title('Real Scatterplot', position = (0,1), ha = 'left', fontsize = 25)\n", 633 | " ax0.set_xlabel(x_label, position = (0,0), ha = 'left', fontsize = 25, color = 'grey', alpha = 0.85)\n", 634 | " ax0.set_ylabel(y_label, position = (0,1), ha = 'right', va = 'top', fontsize = 25, rotation = 0, color = 'grey', alpha = 0.85)\n", 635 | " ax0.set_xticks([])\n", 636 | " ax0.set_yticks([])\n", 637 | "\n", 638 | " # Generate left scatterplot.\n", 639 | " ax0.scatter(df[original_X_column], df[original_Y_column], color = '#185fad', alpha = 0.5, label = 'True Values')\n", 640 | " ax0.legend(prop={'size': 15}, loc = 1)\n", 641 | " \n", 642 | " # Set right labels.\n", 643 | " ax1.set_title('Observed + Imputed Scatterplot', position = (0,1), ha = 'left', fontsize = 25)\n", 644 | " ax1.set_xlabel(x_label, position = (0,0), ha = 'left', fontsize = 25, color = 'grey', alpha = 0.85)\n", 645 | " ax1.set_ylabel(y_label, position = (0,1), ha = 'right', va = 'top', fontsize = 25, rotation = 0, color = 'grey', alpha = 0.85)\n", 646 | " ax1.set_xticks([])\n", 647 | " ax1.set_yticks([])\n", 648 | "\n", 649 | " # Generate right histogram.\n", 650 | " ax1.scatter(df[original_X_column][df[missingness_column] == 1], df[original_Y_column][df[missingness_column] == 1], color = 'orange', alpha = 0.5, label = 'Imputed Values')\n", 651 | " ax1.scatter(df[original_X_column][df[missingness_column] == 0], df[imputed_column][df[missingness_column] == 0], color = '#185fad', alpha = 0.5, label = 'Observed Values')\n", 652 | "\n", 653 | " ax1.legend(prop={'size': 15}, loc = 1)\n", 654 | " \n", 655 | " plt.show();" 656 | ] 657 | }, 658 | { 659 | "cell_type": "code", 660 | "execution_count": null, 661 | "metadata": {}, 662 | "outputs": [], 663 | "source": [ 664 | "compare_scatterplots(df = df,\n", 665 | " imputed_column = 'age_regression_imputed',\n", 666 | " original_X_column = 'children',\n", 667 | " original_Y_column = 'age',\n", 668 | " missingness_column = 'age_missingness',\n", 669 | " x_label = 'Children',\n", 670 | " y_label = 'Age')" 671 | ] 672 | } 673 | ], 674 | "metadata": { 675 | "kernelspec": { 676 | "display_name": "Python 3", 677 | "language": "python", 678 | "name": "python3" 679 | }, 680 | "language_info": { 681 | "codemirror_mode": { 682 | "name": "ipython", 683 | "version": 3 684 | }, 685 | "file_extension": ".py", 686 | "mimetype": "text/x-python", 687 | "name": "python", 688 | "nbconvert_exporter": "python", 689 | "pygments_lexer": "ipython3", 690 | "version": "3.8.3" 691 | } 692 | }, 693 | "nbformat": 4, 694 | "nbformat_minor": 2 695 | } 696 | -------------------------------------------------------------------------------- /solutions/00_interactive_plot.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "#### To confirm that you have the latest versions of these packages, uncomment and run the following command.\n", 10 | "# !pip install numpy pandas matplotlib sklearn ipywidgets IPython missingno --upgrade\n", 11 | "\n", 12 | "# To generate and store data.\n", 13 | "import numpy as np\n", 14 | "import pandas as pd\n", 15 | "\n", 16 | "# To visualize data.\n", 17 | "import matplotlib.pyplot as plt\n", 18 | "\n", 19 | "# To fit linear regression model.\n", 20 | "from sklearn.linear_model import LinearRegression\n", 21 | "\n", 22 | "# To allow interactive plot.\n", 23 | "from ipywidgets import *\n", 24 | "from IPython.display import display\n", 25 | "\n", 26 | "# There is a SciPy issue that won't affect our work, but a warning exists\n", 27 | "# and an update is not imminent.\n", 28 | "import warnings\n", 29 | "warnings.filterwarnings(action=\"ignore\")\n", 30 | "\n", 31 | "# To render plots in the notebook.\n", 32 | "%matplotlib inline" 33 | ] 34 | }, 35 | { 36 | "cell_type": "code", 37 | "execution_count": 2, 38 | "metadata": {}, 39 | "outputs": [], 40 | "source": [ 41 | "# Generate data and store in a dataframe.\n", 42 | "\n", 43 | "np.random.seed(42)\n", 44 | "\n", 45 | "age = np.random.uniform(20, 60, size = 100)\n", 46 | "income = 15000 + 750 * age + np.random.normal(0, 20000, size = 100)\n", 47 | "income = [i if i >= 0 else 0 for i in income]\n", 48 | "\n", 49 | "df = pd.DataFrame({'income':income,\n", 50 | " 'age': age})" 51 | ] 52 | }, 53 | { 54 | "cell_type": "code", 55 | "execution_count": 3, 56 | "metadata": {}, 57 | "outputs": [], 58 | "source": [ 59 | "# Create three functions to model missingness according to certain patterns.\n", 60 | "\n", 61 | "def create_mcar_column(df, missing_column = 'income', p_missing = 0.01, random_state = 42):\n", 62 | " \"\"\"\n", 63 | " Creates missingness indicator column, where data are MCAR (missing completely at random).\n", 64 | " \n", 65 | " User must specify:\n", 66 | " df = the pandas DataFrame the user wants to read in for analysis\n", 67 | " column = the name of the column in df that is missing\n", 68 | " p_missing = the proportion of observations that are missing\n", 69 | " \n", 70 | " Function returns:\n", 71 | " mcar_column = a column that indicates whether data are missing, assuming MCAR\n", 72 | " \"\"\"\n", 73 | " np.random.seed(random_state)\n", 74 | " \n", 75 | " mcar_indices = [df.sample(n = 1).index[0] for i in range(round(p_missing * df.shape[0]))]\n", 76 | " \n", 77 | " while len(set(mcar_indices)) < round(p_missing * df.shape[0]):\n", 78 | " mcar_indices.append(df.sample(n = 1).index[0])\n", 79 | " \n", 80 | " mcar_column = [1 if i in mcar_indices else 0 for i in range(df.shape[0])]\n", 81 | " \n", 82 | " return mcar_column\n", 83 | "\n", 84 | "def create_mar_column(df, missing_column = 'income', depends_on = 'age', method = 'linear', p_missing = 0.01, random_state = 42):\n", 85 | " \"\"\"\n", 86 | " Creates missingness indicator column, where data are MAR (missing at random).\n", 87 | " \n", 88 | " User must specify:\n", 89 | " df = the pandas DataFrame the user wants to read in for analysis\n", 90 | " missing_column = the name of the column in df that is missing\n", 91 | " depends_on = the name of the column in df which affects the missingness\n", 92 | " method = 'linear' or 'quadratic'\n", 93 | " - 'linear' means the probability of missingness is linearly related to the depends_on variable\n", 94 | " - 'quadratic' means the probability of missingness is quadratically related to the depends_on variable\n", 95 | " p_missing = the proportion of observations that are missing\n", 96 | " \n", 97 | " Function returns:\n", 98 | " mar_column = a column that indicates whether data are missing, assuming MAR\n", 99 | " \"\"\"\n", 100 | " np.random.seed(random_state)\n", 101 | " \n", 102 | " if method == 'linear':\n", 103 | " mar_indices = [df.sample(n = 1, weights = df[depends_on] ** -1).index[0] for i in range(round(p_missing * df.shape[0]))]\n", 104 | "\n", 105 | " while len(set(mar_indices)) < round(p_missing * df.shape[0]):\n", 106 | " mar_indices.append(df.sample(n = 1, weights = df[depends_on] ** -1).index[0])\n", 107 | " \n", 108 | " elif method == 'quadratic':\n", 109 | " mar_indices = [df.sample(n = 1, weights = df[depends_on] ** -2).index[0] for i in range(round(p_missing * df.shape[0]))]\n", 110 | "\n", 111 | " while len(set(mar_indices)) < round(p_missing * df.shape[0]):\n", 112 | " mar_indices.append(df.sample(n = 1, weights = df[depends_on] ** -2).index[0])\n", 113 | "\n", 114 | " mar_column = [1 if i in mar_indices else 0 for i in range(df.shape[0])]\n", 115 | " \n", 116 | " return mar_column\n", 117 | "\n", 118 | "def create_nmar_column(df, missing_column = 'income', method = 'linear', p_missing = 0.01, random_state = 42):\n", 119 | " \"\"\"\n", 120 | " Creates missingness indicator column, where data are NMAR (not missing at random).\n", 121 | " \n", 122 | " User must specify:\n", 123 | " df = the pandas DataFrame the user wants to read in for analysis\n", 124 | " missing_column = the name of the column in df that is missing\n", 125 | " method = 'linear' or 'quadratic'\n", 126 | " - 'linear' means the probability of missingness is linearly related to the depends_on variable\n", 127 | " - 'quadratic' means the probability of missingness is quadratically related to the depends_on variable\n", 128 | " p_missing = the proportion of observations that are missing\n", 129 | " \n", 130 | " Function returns:\n", 131 | " nmar_column = a column that indicates whether data are missing, assuming NMAR\n", 132 | " \"\"\"\n", 133 | " np.random.seed(random_state)\n", 134 | " \n", 135 | " if method == 'linear':\n", 136 | " nmar_indices = [df.sample(n = 1, weights = df[missing_column] ** -1).index[0] for i in range(round(p_missing * df.shape[0]))]\n", 137 | "\n", 138 | " while len(set(nmar_indices)) < round(p_missing * df.shape[0]):\n", 139 | " nmar_indices.append(df.sample(n = 1, weights = df[missing_column] ** -1).index[0])\n", 140 | " \n", 141 | " elif method == 'quadratic':\n", 142 | " nmar_indices = [df.sample(n = 1, weights = df[missing_column] ** -2).index[0] for i in range(round(p_missing * df.shape[0]))]\n", 143 | "\n", 144 | " while len(set(nmar_indices)) < round(p_missing * df.shape[0]):\n", 145 | " nmar_indices.append(df.sample(n = 1, weights = df[missing_column] ** -2).index[0])\n", 146 | " \n", 147 | " nmar_column = [1 if i in nmar_indices else 0 for i in range(df.shape[0])]\n", 148 | " \n", 149 | " return nmar_column" 150 | ] 151 | }, 152 | { 153 | "cell_type": "code", 154 | "execution_count": 4, 155 | "metadata": {}, 156 | "outputs": [], 157 | "source": [ 158 | "def generate_scatterplot(p_missing, missing_type, method = 'linear', missing_column = 'income', depends_on = 'age'):\n", 159 | " # Generate one plot.\n", 160 | " fig, ax = plt.subplots(nrows = 1, ncols = 1, figsize = (16,9))\n", 161 | "\n", 162 | " # Set labels and axes.\n", 163 | " ax.set_xlabel(\"Age\", position = (0,0), ha = 'left', fontsize = 25, color = 'grey', alpha = 0.85)\n", 164 | " ax.set_ylabel(\"Income\", position = (0,1), ha = 'right', va = 'top', fontsize = 25, rotation = 0, color = 'grey', alpha = 0.85)\n", 165 | " \n", 166 | " ax.set_ylim([-1000, 100000])\n", 167 | " \n", 168 | " # Generate data with proportion p missing.\n", 169 | " if missing_type == 'MCAR':\n", 170 | " df['missingness'] = create_mcar_column(df,\n", 171 | " missing_column = missing_column,\n", 172 | " p_missing = p_missing)\n", 173 | " elif missing_type == 'MAR':\n", 174 | " df['missingness'] = create_mar_column(df,\n", 175 | " missing_column = missing_column,\n", 176 | " depends_on = depends_on,\n", 177 | " method = method,\n", 178 | " p_missing = p_missing)\n", 179 | " \n", 180 | " elif missing_type == 'NMAR':\n", 181 | " df['missingness'] = create_nmar_column(df,\n", 182 | " missing_column = missing_column,\n", 183 | " method = method,\n", 184 | " p_missing = p_missing)\n", 185 | " \n", 186 | " # Generate scatterplot.\n", 187 | " ax.scatter(df['age'][df['missingness'] == 0], df['income'][df['missingness'] == 0], s = 35, color = '#185fad', alpha = 0.75, label = 'Observed')\n", 188 | " ax.scatter(df['age'][df['missingness'] == 1], df['income'][df['missingness'] == 1], s = 35, color = 'grey', alpha = 0.25, label = '')\n", 189 | " \n", 190 | " # Generate lines of best fit based on observed and missing values.\n", 191 | " x = np.linspace(20, 60)\n", 192 | " ax.plot(x, 15000 + 750 * x, c = 'orange', alpha = 0.7, label = '\"True\" Line', lw = 3)\n", 193 | " model = LinearRegression().fit(df[['age']][df['missingness'] == 0], df['income'][df['missingness'] == 0])\n", 194 | " ax.plot(x, model.intercept_ + model.coef_ * x, c = '#185fad', alpha = 0.7, label='Observed Line', lw = 3)\n", 195 | "\n", 196 | " # Generate title and legend.\n", 197 | " ax.set_title(f'Type of Missing Data: {missing_type} \\nProportion Missing: {p_missing}', position = (0,1), ha = 'left', fontsize = 25)\n", 198 | " ax.legend(prop={'size': 20}, loc = 2)\n", 199 | " \n", 200 | " ax.set_xticks([])\n", 201 | " ax.set_yticks([])\n", 202 | " plt.show();" 203 | ] 204 | }, 205 | { 206 | "cell_type": "code", 207 | "execution_count": 5, 208 | "metadata": {}, 209 | "outputs": [ 210 | { 211 | "data": { 212 | "image/png": "\n", 213 | "text/plain": [ 214 | "
" 215 | ] 216 | }, 217 | "metadata": {}, 218 | "output_type": "display_data" 219 | } 220 | ], 221 | "source": [ 222 | "generate_scatterplot(p_missing=0.1, missing_type = 'MCAR', method = 'linear')" 223 | ] 224 | }, 225 | { 226 | "cell_type": "code", 227 | "execution_count": 6, 228 | "metadata": { 229 | "scrolled": false 230 | }, 231 | "outputs": [ 232 | { 233 | "data": { 234 | "application/vnd.jupyter.widget-view+json": { 235 | "model_id": "2370061c73b749058d6d52b202778572", 236 | "version_major": 2, 237 | "version_minor": 0 238 | }, 239 | "text/plain": [ 240 | "interactive(children=(FloatSlider(value=0.0, description='p_missing', max=0.99, step=0.05), Dropdown(descripti…" 241 | ] 242 | }, 243 | "metadata": {}, 244 | "output_type": "display_data" 245 | } 246 | ], 247 | "source": [ 248 | "def plot_interact(p_missing = 0, missing_type = 'MCAR', method = 'linear'):\n", 249 | " generate_scatterplot(p_missing, missing_type, method, missing_column = 'income', depends_on = 'age')\n", 250 | " \n", 251 | "interact(plot_interact, p_missing = (0, 0.99, 0.05), missing_type = ['MCAR','MAR','NMAR'], method = ['linear','quadratic']);" 252 | ] 253 | } 254 | ], 255 | "metadata": { 256 | "kernelspec": { 257 | "display_name": "Python 3", 258 | "language": "python", 259 | "name": "python3" 260 | }, 261 | "language_info": { 262 | "codemirror_mode": { 263 | "name": "ipython", 264 | "version": 3 265 | }, 266 | "file_extension": ".py", 267 | "mimetype": "text/x-python", 268 | "name": "python", 269 | "nbconvert_exporter": "python", 270 | "pygments_lexer": "ipython3", 271 | "version": "3.8.3" 272 | } 273 | }, 274 | "nbformat": 4, 275 | "nbformat_minor": 2 276 | } 277 | --------------------------------------------------------------------------------