├── missing-data-slides.pdf
├── .gitignore
├── README.md
├── 01_unit_missingness.ipynb
├── solutions
├── 01_unit_missingness.ipynb
└── 00_interactive_plot.ipynb
├── 00_interactive_plot.ipynb
└── 02_item_missingness.ipynb
/missing-data-slides.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/matthewbrems/missing-data-workshop/HEAD/missing-data-slides.pdf
--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | # OSX DS Store
2 | .DS_Store
3 |
4 | # Byte-compiled / optimized / DLL files
5 | __pycache__/
6 | *.py[cod]
7 | *$py.class
8 |
9 | # C extensions
10 | *.so
11 |
12 | # Distribution / packaging
13 | .Python
14 | env/
15 | build/
16 | develop-eggs/
17 | dist/
18 | downloads/
19 | eggs/
20 | .eggs/
21 | lib/
22 | lib64/
23 | parts/
24 | sdist/
25 | var/
26 | *.egg-info/
27 | .installed.cfg
28 | *.egg
29 |
30 | # PyInstaller
31 | # Usually these files are written by a python script from a template
32 | # before PyInstaller builds the exe, so as to inject date/other infos into it.
33 | *.manifest
34 | *.spec
35 |
36 | # Installer logs
37 | pip-log.txt
38 | pip-delete-this-directory.txt
39 |
40 | # Unit test / coverage reports
41 | htmlcov/
42 | .tox/
43 | .coverage
44 | .coverage.*
45 | .cache
46 | nosetests.xml
47 | coverage.xml
48 | *,cover
49 | .hypothesis/
50 |
51 | # Translations
52 | *.mo
53 | *.pot
54 |
55 | # Django stuff:
56 | *.log
57 | local_settings.py
58 |
59 | # Flask stuff:
60 | instance/
61 | .webassets-cache
62 |
63 | # Scrapy stuff:
64 | .scrapy
65 |
66 | # Sphinx documentation
67 | docs/_build/
68 |
69 | # PyBuilder
70 | target/
71 |
72 | # IPython Notebook
73 | *.ipynb_checkpoints
74 |
75 | # pyenv
76 | .python-version
77 |
78 | # celery beat schedule file
79 | celerybeat-schedule
80 |
81 | # dotenv
82 | .env
83 |
84 | # virtualenv
85 | venv/
86 | ENV/
87 |
88 | # Spyder project settings
89 | .spyderproject
90 |
91 | # Rope project settings
92 | .ropeproject
93 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Good, Fast, Cheap: How to do Data Science with Missing Data
2 |
3 | ## Resources
4 |
5 | #### Academic Content (roughly sorted from least technical to most)
6 | - [Good summary of single vs. multiple imputation](https://scikit-learn.org/stable/modules/impute.html#multiple-vs-single-imputation)
7 | - [UTexas Slides on Missing Data](https://liberalarts.utexas.edu/prc/_files/cs/Missing-Data.pdf)
8 | - [Flexible Imputation of Missing Data, 2nd ed.](https://stefvanbuuren.name/fimd/)
9 | - [Pattern Submodel Paper from "Biostatistics"](https://academic.oup.com/biostatistics/advance-article/doi/10.1093/biostatistics/kxy040/5092384)
10 | - [The prevention and handling of missing data](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3668100/)
11 | - [Should you use a missing data indicator?](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3414599/)
12 | - [Are all biases missing data problems?](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4643276/)
13 | - [Accounting for missing data in statistical analyses: multiple imputation is not always the answer](https://academic.oup.com/ije/article/48/4/1294/5382162?login=false#.XVpWZLg4jrU.twitter)
14 | - [The proportion of missing data should not be used to guide decisions on multiple imputation](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6547017/)
15 | - [Boston University Technical Report on Missing Data, Assumptions, and Applications](http://www.bu.edu/sph/files/2014/05/Marina-tech-report.pdf)
16 | - [Andrew Gelman Chapter on Missing Data - thorough and very good, but academic](http://www.stat.columbia.edu/~gelman/arm/missing.pdf)
17 |
18 | #### Python
19 | - [Scikit-Learn Write-Up of Imputation](https://scikit-learn.org/stable/modules/impute.html)
20 | - [Scikit-Learn SimpleImputer Class](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html)
21 | - [Scikit-Learn IterativeImputer Class](https://scikit-learn.org/stable/modules/generated/sklearn.impute.IterativeImputer.html) **note: this is experimental**
22 | - [Scikit-Learn KNNImputer Class](https://scikit-learn.org/stable/modules/generated/sklearn.impute.KNNImputer.html)
23 | - [Scikit-Learn Mean Imputation](http://scikit-learn.org/stable/auto_examples/plot_missing_values.html#sphx-glr-auto-examples-plot-missing-values-py)
24 | - [MissingNo in Python](https://github.com/ResidentMario/missingno)
25 | - [MissingNo Paper](http://joss.theoj.org/papers/52b4115d6c03864b884fbf3334851322)
26 |
27 | ##### Rubin's Rules (combining parameter estimates across multiple imputations)
28 | - [Article about Rubin's Rules](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2727536/)
29 | - [Rubin's Rules Formulas](https://bookdown.org/mwheymans/bookmi/rubins-rules.html)
30 | - [R Code for Pooling Estimates](http://finzi.psych.upenn.edu/R/library/mice/html/pool.html)
31 |
32 | #### Non-Statistical References
33 | - [How One 19-Year-Old Illinois Man Is Distorting National Polling Averages - NYTimes](https://www.nytimes.com/2016/10/13/upshot/how-one-19-year-old-illinois-man-is-distorting-national-polling-averages.html)
34 |
35 |
36 | ### Feel free to contact me afterward!
37 | - [LinkedIn](https://www.linkedin.com/in/matthewbrems)
38 | - [Twitter](https://www.twitter.com/matthewbrems)
39 | - [Medium](https://www.medium.com/@matthew.w.brems)
40 |
--------------------------------------------------------------------------------
/01_unit_missingness.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Unit Missingness Demo\n",
8 | "\n",
9 | "When handling unit missingness, the most common method is to do **weight class adjustments**. This requires us to break our observations into classes and weight them before doing our analysis."
10 | ]
11 | },
12 | {
13 | "cell_type": "code",
14 | "execution_count": null,
15 | "metadata": {},
16 | "outputs": [],
17 | "source": [
18 | "# Import libraries.\n",
19 | "import pandas as pd\n",
20 | "import numpy as np\n",
21 | "\n",
22 | "# Set random seed.\n",
23 | "np.random.seed(42)\n",
24 | "\n",
25 | "# Generate dataframe.\n",
26 | "value_score = [min(np.random.poisson(5), 10) if i % 2 == 0 else min(np.random.poisson(6), 10) for i in range(10_000)]\n",
27 | "value_score = [value_score[i] if (i % 8 == 0 or (i % 7 != 0 and i % 2 == 1)) else np.nan for i in range(10_000)]\n",
28 | "departments = ['finance' if i % 2 == 0 else 'accounting' for i in range(10_000)]\n",
29 | "df = pd.DataFrame({\n",
30 | " 'dept': departments,\n",
31 | " 'score': value_score\n",
32 | "})\n",
33 | "\n",
34 | "# Check first five rows.\n",
35 | "df.head()"
36 | ]
37 | },
38 | {
39 | "cell_type": "code",
40 | "execution_count": null,
41 | "metadata": {},
42 | "outputs": [],
43 | "source": [
44 | "# What is the distribution of department?\n",
45 | "df['dept'].value_counts(normalize = True)"
46 | ]
47 | },
48 | {
49 | "cell_type": "code",
50 | "execution_count": null,
51 | "metadata": {},
52 | "outputs": [],
53 | "source": [
54 | "# Check for nulls.\n",
55 | "df.isnull().sum()"
56 | ]
57 | },
58 | {
59 | "cell_type": "code",
60 | "execution_count": null,
61 | "metadata": {},
62 | "outputs": [],
63 | "source": [
64 | "# Drop NAs.\n",
65 | "df.dropna(inplace = True)"
66 | ]
67 | },
68 | {
69 | "cell_type": "code",
70 | "execution_count": null,
71 | "metadata": {},
72 | "outputs": [],
73 | "source": [
74 | "# What proportion of our responses came from accounting?\n",
75 | "df['dept'].value_counts(normalize = True)['accounting']"
76 | ]
77 | },
78 | {
79 | "cell_type": "code",
80 | "execution_count": null,
81 | "metadata": {},
82 | "outputs": [],
83 | "source": [
84 | "df['dept'].value_counts(normalize = True)"
85 | ]
86 | },
87 | {
88 | "cell_type": "markdown",
89 | "metadata": {},
90 | "source": [
91 | "1. Take the full sample (observed and missing) and break them into subgroups based on characteristics we know.\n",
92 | "2. Calculate a weight for each observation:\n",
93 | "\n",
94 | "$$\n",
95 | "\\text{weight}_i = \\frac{\\text{true proportion in group }i}{\\text{proportion of observed values in group }i}\n",
96 | "$$"
97 | ]
98 | },
99 | {
100 | "cell_type": "code",
101 | "execution_count": null,
102 | "metadata": {},
103 | "outputs": [],
104 | "source": [
105 | "# Calculate and print the weight for accounting.\n",
106 | "w_accounting = (1/2) / df['dept'].value_counts(normalize = True)['accounting']\n",
107 | "\n",
108 | "print(f'The weight for each accounting vote is: {w_accounting}.')"
109 | ]
110 | },
111 | {
112 | "cell_type": "code",
113 | "execution_count": null,
114 | "metadata": {},
115 | "outputs": [],
116 | "source": [
117 | "# Calculate the and print weight for finance.\n",
118 | "w_finance = (1/2) / df['dept'].value_counts(normalize = True)['finance']\n",
119 | "\n",
120 | "print(f'The weight for each finance vote is: {w_finance}.')"
121 | ]
122 | },
123 | {
124 | "cell_type": "code",
125 | "execution_count": null,
126 | "metadata": {},
127 | "outputs": [],
128 | "source": [
129 | "# Let's confirm that the weights times the counts\n",
130 | "# yields a 50/50 split.\n",
131 | "print(w_accounting * df['dept'].value_counts()['accounting'])\n",
132 | "print(w_finance * df['dept'].value_counts()['finance'])"
133 | ]
134 | },
135 | {
136 | "cell_type": "code",
137 | "execution_count": null,
138 | "metadata": {},
139 | "outputs": [],
140 | "source": [
141 | "# Create column that stores the weights.\n",
142 | "\n",
143 | "df['weights'] = [w_accounting if i == 'accounting' else w_finance for i in df['dept']]"
144 | ]
145 | },
146 | {
147 | "cell_type": "code",
148 | "execution_count": null,
149 | "metadata": {},
150 | "outputs": [],
151 | "source": [
152 | "# Confirm counts.\n",
153 | "\n",
154 | "df['weights'].value_counts()"
155 | ]
156 | },
157 | {
158 | "cell_type": "code",
159 | "execution_count": null,
160 | "metadata": {},
161 | "outputs": [],
162 | "source": [
163 | "# Calculate raw mean of my employee satisfaction score.\n",
164 | "\n",
165 | "np.mean(df['score'])"
166 | ]
167 | },
168 | {
169 | "cell_type": "code",
170 | "execution_count": null,
171 | "metadata": {},
172 | "outputs": [],
173 | "source": [
174 | "# Calculate weighted mean of my employee satisfaction score.\n",
175 | "\n",
176 | "np.mean(df['score'] * df['weights'])"
177 | ]
178 | },
179 | {
180 | "cell_type": "markdown",
181 | "metadata": {},
182 | "source": [
183 | "Our goal with post-weighting is to decrease bias. What should we be concerned about?
\n",
184 | " \n",
185 | "- Due to the bias-variance tradeoff, as we decrease bias, we may cause an increase in variance.\n",
186 | "- This can be a really big deal, [said the New York Times in 2016](https://www.nytimes.com/2016/10/13/upshot/how-one-19-year-old-illinois-man-is-distorting-national-polling-averages.html).\n",
187 | " "
188 | ]
189 | },
190 | {
191 | "cell_type": "markdown",
192 | "metadata": {},
193 | "source": [
194 | "What might be a situation where we may not be able to use weight class adjustments?
\n",
195 | " \n",
196 | "- If we don't know the true distribution of our classes.\n",
197 | "- For example, if I didn't know that half of our team was in accounting and half in finance.\n",
198 | "- Another example, let's say I wanted to apply this weighting method to understand the percentage of voters supporting the Democratic candidate in the upcoming election. I don't know how many people will be in each of the age groups 18-34, 35-54, and 55+. I'll have to make a guess. (Hopefully an educated one!)\n",
199 | " "
200 | ]
201 | },
202 | {
203 | "cell_type": "markdown",
204 | "metadata": {},
205 | "source": [
206 | "#### Have more variables and want to build a sophisticated model?\n",
207 | "Pass `df['weight']` into `sklearn` when fitting your model. [Source](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html#sklearn.ensemble.RandomForestRegressor.fit).\n",
208 | "> `model.fit(X_train, y_train, X_train['weight'])`"
209 | ]
210 | },
211 | {
212 | "cell_type": "markdown",
213 | "metadata": {},
214 | "source": [
215 | "In R, I am using `wtd.chi.sq`."
216 | ]
217 | }
218 | ],
219 | "metadata": {
220 | "kernelspec": {
221 | "display_name": "Python 3",
222 | "language": "python",
223 | "name": "python3"
224 | },
225 | "language_info": {
226 | "codemirror_mode": {
227 | "name": "ipython",
228 | "version": 3
229 | },
230 | "file_extension": ".py",
231 | "mimetype": "text/x-python",
232 | "name": "python",
233 | "nbconvert_exporter": "python",
234 | "pygments_lexer": "ipython3",
235 | "version": "3.8.3"
236 | }
237 | },
238 | "nbformat": 4,
239 | "nbformat_minor": 4
240 | }
241 |
--------------------------------------------------------------------------------
/solutions/01_unit_missingness.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Unit Missingness Demo\n",
8 | "\n",
9 | "When handling unit missingness, the most common method is to do **weight class adjustments**. This requires us to break our observations into classes and weight them before doing our analysis."
10 | ]
11 | },
12 | {
13 | "cell_type": "code",
14 | "execution_count": 1,
15 | "metadata": {},
16 | "outputs": [
17 | {
18 | "data": {
19 | "text/html": [
20 | "
\n",
21 | "\n",
34 | "
\n",
35 | " \n",
36 | " \n",
37 | " | \n",
38 | " dept | \n",
39 | " score | \n",
40 | "
\n",
41 | " \n",
42 | " \n",
43 | " \n",
44 | " | 0 | \n",
45 | " finance | \n",
46 | " 5.0 | \n",
47 | "
\n",
48 | " \n",
49 | " | 1 | \n",
50 | " accounting | \n",
51 | " 4.0 | \n",
52 | "
\n",
53 | " \n",
54 | " | 2 | \n",
55 | " finance | \n",
56 | " NaN | \n",
57 | "
\n",
58 | " \n",
59 | " | 3 | \n",
60 | " accounting | \n",
61 | " 5.0 | \n",
62 | "
\n",
63 | " \n",
64 | " | 4 | \n",
65 | " finance | \n",
66 | " NaN | \n",
67 | "
\n",
68 | " \n",
69 | "
\n",
70 | "
"
71 | ],
72 | "text/plain": [
73 | " dept score\n",
74 | "0 finance 5.0\n",
75 | "1 accounting 4.0\n",
76 | "2 finance NaN\n",
77 | "3 accounting 5.0\n",
78 | "4 finance NaN"
79 | ]
80 | },
81 | "execution_count": 1,
82 | "metadata": {},
83 | "output_type": "execute_result"
84 | }
85 | ],
86 | "source": [
87 | "# Import libraries.\n",
88 | "import pandas as pd\n",
89 | "import numpy as np\n",
90 | "\n",
91 | "# Set random seed.\n",
92 | "np.random.seed(42)\n",
93 | "\n",
94 | "# Generate dataframe.\n",
95 | "value_score = [min(np.random.poisson(5), 10) if i % 2 == 0 else min(np.random.poisson(6), 10) for i in range(10_000)]\n",
96 | "value_score = [value_score[i] if (i % 8 == 0 or (i % 7 != 0 and i % 2 == 1)) else np.nan for i in range(10_000)]\n",
97 | "departments = ['finance' if i % 2 == 0 else 'accounting' for i in range(10_000)]\n",
98 | "df = pd.DataFrame({\n",
99 | " 'dept': departments,\n",
100 | " 'score': value_score\n",
101 | "})\n",
102 | "\n",
103 | "# Check first five rows.\n",
104 | "df.head()"
105 | ]
106 | },
107 | {
108 | "cell_type": "code",
109 | "execution_count": 2,
110 | "metadata": {},
111 | "outputs": [
112 | {
113 | "data": {
114 | "text/plain": [
115 | "accounting 0.5\n",
116 | "finance 0.5\n",
117 | "Name: dept, dtype: float64"
118 | ]
119 | },
120 | "execution_count": 2,
121 | "metadata": {},
122 | "output_type": "execute_result"
123 | }
124 | ],
125 | "source": [
126 | "# What is the distribution of department?\n",
127 | "df['dept'].value_counts(normalize = True)"
128 | ]
129 | },
130 | {
131 | "cell_type": "code",
132 | "execution_count": 3,
133 | "metadata": {},
134 | "outputs": [
135 | {
136 | "data": {
137 | "text/plain": [
138 | "dept 0\n",
139 | "score 4464\n",
140 | "dtype: int64"
141 | ]
142 | },
143 | "execution_count": 3,
144 | "metadata": {},
145 | "output_type": "execute_result"
146 | }
147 | ],
148 | "source": [
149 | "# Check for nulls.\n",
150 | "df.isnull().sum()"
151 | ]
152 | },
153 | {
154 | "cell_type": "code",
155 | "execution_count": 4,
156 | "metadata": {},
157 | "outputs": [],
158 | "source": [
159 | "# Drop NAs.\n",
160 | "df.dropna(inplace = True)"
161 | ]
162 | },
163 | {
164 | "cell_type": "code",
165 | "execution_count": 5,
166 | "metadata": {},
167 | "outputs": [
168 | {
169 | "data": {
170 | "text/plain": [
171 | "0.7742052023121387"
172 | ]
173 | },
174 | "execution_count": 5,
175 | "metadata": {},
176 | "output_type": "execute_result"
177 | }
178 | ],
179 | "source": [
180 | "# What proportion of our responses came from accounting?\n",
181 | "df['dept'].value_counts(normalize = True)['accounting']"
182 | ]
183 | },
184 | {
185 | "cell_type": "markdown",
186 | "metadata": {},
187 | "source": [
188 | "1. Take the full sample (observed and missing) and break them into subgroups based on characteristics we know.\n",
189 | "2. Calculate a weight for each observation:\n",
190 | "\n",
191 | "$$\n",
192 | "\\text{weight}_i = \\frac{\\text{true proportion in group }i}{\\text{proportion of observed values in group }i}\n",
193 | "$$"
194 | ]
195 | },
196 | {
197 | "cell_type": "code",
198 | "execution_count": 6,
199 | "metadata": {},
200 | "outputs": [
201 | {
202 | "name": "stdout",
203 | "output_type": "stream",
204 | "text": [
205 | "The weight for each accounting vote is: 0.645823611759216.\n"
206 | ]
207 | }
208 | ],
209 | "source": [
210 | "# Calculate and print the weight for accounting.\n",
211 | "w_accounting = (1/2) / df['dept'].value_counts(normalize = True)['accounting']\n",
212 | "\n",
213 | "print(f'The weight for each accounting vote is: {w_accounting}.')"
214 | ]
215 | },
216 | {
217 | "cell_type": "code",
218 | "execution_count": 7,
219 | "metadata": {},
220 | "outputs": [
221 | {
222 | "name": "stdout",
223 | "output_type": "stream",
224 | "text": [
225 | "The weight for each finance vote is: 2.2144.\n"
226 | ]
227 | }
228 | ],
229 | "source": [
230 | "# Calculate the and print weight for finance.\n",
231 | "w_finance = (1/2) / df['dept'].value_counts(normalize = True)['finance']\n",
232 | "\n",
233 | "print(f'The weight for each finance vote is: {w_finance}.')"
234 | ]
235 | },
236 | {
237 | "cell_type": "code",
238 | "execution_count": 8,
239 | "metadata": {},
240 | "outputs": [
241 | {
242 | "name": "stdout",
243 | "output_type": "stream",
244 | "text": [
245 | "2767.9999999999995\n",
246 | "2768.0\n"
247 | ]
248 | }
249 | ],
250 | "source": [
251 | "# Let's confirm that the weights times the counts\n",
252 | "# yields a 50/50 split.\n",
253 | "print(w_accounting * df['dept'].value_counts()['accounting'])\n",
254 | "print(w_finance * df['dept'].value_counts()['finance'])"
255 | ]
256 | },
257 | {
258 | "cell_type": "code",
259 | "execution_count": 9,
260 | "metadata": {},
261 | "outputs": [],
262 | "source": [
263 | "# Create column that stores the weights.\n",
264 | "\n",
265 | "df['weights'] = [w_accounting if i == 'accounting' else w_finance for i in df['dept']]"
266 | ]
267 | },
268 | {
269 | "cell_type": "code",
270 | "execution_count": 10,
271 | "metadata": {},
272 | "outputs": [
273 | {
274 | "data": {
275 | "text/plain": [
276 | "0.645824 4286\n",
277 | "2.214400 1250\n",
278 | "Name: weights, dtype: int64"
279 | ]
280 | },
281 | "execution_count": 10,
282 | "metadata": {},
283 | "output_type": "execute_result"
284 | }
285 | ],
286 | "source": [
287 | "# Confirm counts.\n",
288 | "\n",
289 | "df['weights'].value_counts()"
290 | ]
291 | },
292 | {
293 | "cell_type": "code",
294 | "execution_count": 11,
295 | "metadata": {},
296 | "outputs": [
297 | {
298 | "data": {
299 | "text/plain": [
300 | "5.724530346820809"
301 | ]
302 | },
303 | "execution_count": 11,
304 | "metadata": {},
305 | "output_type": "execute_result"
306 | }
307 | ],
308 | "source": [
309 | "# Calculate raw mean of my employee satisfaction score.\n",
310 | "\n",
311 | "np.mean(df['score'])"
312 | ]
313 | },
314 | {
315 | "cell_type": "code",
316 | "execution_count": 12,
317 | "metadata": {},
318 | "outputs": [
319 | {
320 | "data": {
321 | "text/plain": [
322 | "5.450634997666867"
323 | ]
324 | },
325 | "execution_count": 12,
326 | "metadata": {},
327 | "output_type": "execute_result"
328 | }
329 | ],
330 | "source": [
331 | "# Calculate weighted mean of my employee satisfaction score.\n",
332 | "\n",
333 | "np.mean(df['score'] * df['weights'])"
334 | ]
335 | },
336 | {
337 | "cell_type": "markdown",
338 | "metadata": {},
339 | "source": [
340 | "Our goal with post-weighting is to decrease bias. What should we be concerned about?
\n",
341 | " \n",
342 | "- Due to the bias-variance tradeoff, as we decrease bias, we may cause an increase in variance.\n",
343 | "- This can be a really big deal, [said the New York Times in 2016](https://www.nytimes.com/2016/10/13/upshot/how-one-19-year-old-illinois-man-is-distorting-national-polling-averages.html).\n",
344 | " "
345 | ]
346 | },
347 | {
348 | "cell_type": "markdown",
349 | "metadata": {},
350 | "source": [
351 | "What might be a situation where we may not be able to use weight class adjustments?
\n",
352 | " \n",
353 | "- If we don't know the true distribution of our classes.\n",
354 | "- For example, if I didn't know that half of our team was in accounting and half in finance.\n",
355 | "- Another example, let's say I wanted to apply this weighting method to understand the percentage of voters supporting the Democratic candidate in the upcoming election. I don't know how many people will be in each of the age groups 18-34, 35-54, and 55+. I'll have to make a guess. (Hopefully an educated one!)\n",
356 | " "
357 | ]
358 | },
359 | {
360 | "cell_type": "markdown",
361 | "metadata": {},
362 | "source": [
363 | "#### Have more variables and want to build a sophisticated model?\n",
364 | "Pass `df['weight']` into `sklearn` when fitting your model. [Source](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html#sklearn.ensemble.RandomForestRegressor.fit).\n",
365 | "> `model.fit(X_train, y_train, X_train['weight'])`"
366 | ]
367 | }
368 | ],
369 | "metadata": {
370 | "kernelspec": {
371 | "display_name": "Python 3",
372 | "language": "python",
373 | "name": "python3"
374 | },
375 | "language_info": {
376 | "codemirror_mode": {
377 | "name": "ipython",
378 | "version": 3
379 | },
380 | "file_extension": ".py",
381 | "mimetype": "text/x-python",
382 | "name": "python",
383 | "nbconvert_exporter": "python",
384 | "pygments_lexer": "ipython3",
385 | "version": "3.8.3"
386 | }
387 | },
388 | "nbformat": 4,
389 | "nbformat_minor": 4
390 | }
391 |
--------------------------------------------------------------------------------
/00_interactive_plot.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": null,
6 | "metadata": {},
7 | "outputs": [],
8 | "source": [
9 | "#### To confirm that you have the latest versions of these packages, uncomment and run the following command.\n",
10 | "# !pip install numpy pandas matplotlib sklearn ipywidgets IPython missingno --upgrade\n",
11 | "\n",
12 | "# To generate and store data.\n",
13 | "import numpy as np\n",
14 | "import pandas as pd\n",
15 | "\n",
16 | "# To visualize data.\n",
17 | "import matplotlib.pyplot as plt\n",
18 | "\n",
19 | "# To fit linear regression model.\n",
20 | "from sklearn.linear_model import LinearRegression\n",
21 | "\n",
22 | "# To allow interactive plot.\n",
23 | "from ipywidgets import *\n",
24 | "from IPython.display import display\n",
25 | "\n",
26 | "# There is a SciPy issue that won't affect our work, but a warning exists\n",
27 | "# and an update is not imminent.\n",
28 | "import warnings\n",
29 | "warnings.filterwarnings(action=\"ignore\")\n",
30 | "\n",
31 | "# To render plots in the notebook.\n",
32 | "%matplotlib inline"
33 | ]
34 | },
35 | {
36 | "cell_type": "code",
37 | "execution_count": null,
38 | "metadata": {},
39 | "outputs": [],
40 | "source": [
41 | "# Generate data and store in a dataframe.\n",
42 | "\n",
43 | "np.random.seed(42)\n",
44 | "\n",
45 | "age = np.random.uniform(20, 60, size = 100)\n",
46 | "income = 15000 + 750 * age + np.random.normal(0, 20000, size = 100)\n",
47 | "income = [i if i >= 0 else 0 for i in income]\n",
48 | "\n",
49 | "df = pd.DataFrame({'income':income,\n",
50 | " 'age': age})"
51 | ]
52 | },
53 | {
54 | "cell_type": "code",
55 | "execution_count": null,
56 | "metadata": {},
57 | "outputs": [],
58 | "source": [
59 | "# Create three functions to model missingness according to certain patterns.\n",
60 | "\n",
61 | "def create_mcar_column(df, missing_column = 'income', p_missing = 0.01, random_state = 42):\n",
62 | " \"\"\"\n",
63 | " Creates missingness indicator column, where data are MCAR (missing completely at random).\n",
64 | " \n",
65 | " User must specify:\n",
66 | " df = the pandas DataFrame the user wants to read in for analysis\n",
67 | " column = the name of the column in df that is missing\n",
68 | " p_missing = the proportion of observations that are missing\n",
69 | " \n",
70 | " Function returns:\n",
71 | " mcar_column = a column that indicates whether data are missing, assuming MCAR\n",
72 | " \"\"\"\n",
73 | " np.random.seed(random_state)\n",
74 | " \n",
75 | " mcar_indices = [df.sample(n = 1).index[0] for i in range(round(p_missing * df.shape[0]))]\n",
76 | " \n",
77 | " while len(set(mcar_indices)) < round(p_missing * df.shape[0]):\n",
78 | " mcar_indices.append(df.sample(n = 1).index[0])\n",
79 | " \n",
80 | " mcar_column = [1 if i in mcar_indices else 0 for i in range(df.shape[0])]\n",
81 | " \n",
82 | " return mcar_column\n",
83 | "\n",
84 | "def create_mar_column(df, missing_column = 'income', depends_on = 'age', method = 'linear', p_missing = 0.01, random_state = 42):\n",
85 | " \"\"\"\n",
86 | " Creates missingness indicator column, where data are MAR (missing at random).\n",
87 | " \n",
88 | " User must specify:\n",
89 | " df = the pandas DataFrame the user wants to read in for analysis\n",
90 | " missing_column = the name of the column in df that is missing\n",
91 | " depends_on = the name of the column in df which affects the missingness\n",
92 | " method = 'linear' or 'quadratic'\n",
93 | " - 'linear' means the probability of missingness is linearly related to the depends_on variable\n",
94 | " - 'quadratic' means the probability of missingness is quadratically related to the depends_on variable\n",
95 | " p_missing = the proportion of observations that are missing\n",
96 | " \n",
97 | " Function returns:\n",
98 | " mar_column = a column that indicates whether data are missing, assuming MAR\n",
99 | " \"\"\"\n",
100 | " np.random.seed(random_state)\n",
101 | " \n",
102 | " if method == 'linear':\n",
103 | " mar_indices = [df.sample(n = 1, weights = df[depends_on] ** -1).index[0] for i in range(round(p_missing * df.shape[0]))]\n",
104 | "\n",
105 | " while len(set(mar_indices)) < round(p_missing * df.shape[0]):\n",
106 | " mar_indices.append(df.sample(n = 1, weights = df[depends_on] ** -1).index[0])\n",
107 | " \n",
108 | " elif method == 'quadratic':\n",
109 | " mar_indices = [df.sample(n = 1, weights = df[depends_on] ** -2).index[0] for i in range(round(p_missing * df.shape[0]))]\n",
110 | "\n",
111 | " while len(set(mar_indices)) < round(p_missing * df.shape[0]):\n",
112 | " mar_indices.append(df.sample(n = 1, weights = df[depends_on] ** -2).index[0])\n",
113 | "\n",
114 | " mar_column = [1 if i in mar_indices else 0 for i in range(df.shape[0])]\n",
115 | " \n",
116 | " return mar_column\n",
117 | "\n",
118 | "def create_nmar_column(df, missing_column = 'income', method = 'linear', p_missing = 0.01, random_state = 42):\n",
119 | " \"\"\"\n",
120 | " Creates missingness indicator column, where data are NMAR (not missing at random).\n",
121 | " \n",
122 | " User must specify:\n",
123 | " df = the pandas DataFrame the user wants to read in for analysis\n",
124 | " missing_column = the name of the column in df that is missing\n",
125 | " method = 'linear' or 'quadratic'\n",
126 | " - 'linear' means the probability of missingness is linearly related to the depends_on variable\n",
127 | " - 'quadratic' means the probability of missingness is quadratically related to the depends_on variable\n",
128 | " p_missing = the proportion of observations that are missing\n",
129 | " \n",
130 | " Function returns:\n",
131 | " nmar_column = a column that indicates whether data are missing, assuming NMAR\n",
132 | " \"\"\"\n",
133 | " np.random.seed(random_state)\n",
134 | " \n",
135 | " if method == 'linear':\n",
136 | " nmar_indices = [df.sample(n = 1, weights = df[missing_column] ** -1).index[0] for i in range(round(p_missing * df.shape[0]))]\n",
137 | "\n",
138 | " while len(set(nmar_indices)) < round(p_missing * df.shape[0]):\n",
139 | " nmar_indices.append(df.sample(n = 1, weights = df[missing_column] ** -1).index[0])\n",
140 | " \n",
141 | " elif method == 'quadratic':\n",
142 | " nmar_indices = [df.sample(n = 1, weights = df[missing_column] ** -2).index[0] for i in range(round(p_missing * df.shape[0]))]\n",
143 | "\n",
144 | " while len(set(nmar_indices)) < round(p_missing * df.shape[0]):\n",
145 | " nmar_indices.append(df.sample(n = 1, weights = df[missing_column] ** -2).index[0])\n",
146 | " \n",
147 | " nmar_column = [1 if i in nmar_indices else 0 for i in range(df.shape[0])]\n",
148 | " \n",
149 | " return nmar_column"
150 | ]
151 | },
152 | {
153 | "cell_type": "code",
154 | "execution_count": null,
155 | "metadata": {},
156 | "outputs": [],
157 | "source": [
158 | "def generate_scatterplot(p_missing, missing_type, method = 'linear', missing_column = 'income', depends_on = 'age'):\n",
159 | " # Generate one plot.\n",
160 | " fig, ax = plt.subplots(nrows = 1, ncols = 1, figsize = (16,9))\n",
161 | "\n",
162 | " # Set labels and axes.\n",
163 | " ax.set_xlabel(\"Age\", position = (0,0), ha = 'left', fontsize = 25, color = 'grey', alpha = 0.85)\n",
164 | " ax.set_ylabel(\"Income\", position = (0,1), ha = 'right', va = 'top', fontsize = 25, rotation = 0, color = 'grey', alpha = 0.85)\n",
165 | " \n",
166 | " ax.set_ylim([-1000, 100000])\n",
167 | " \n",
168 | " # Generate data with proportion p missing.\n",
169 | " if missing_type == 'MCAR':\n",
170 | " df['missingness'] = create_mcar_column(df,\n",
171 | " missing_column = missing_column,\n",
172 | " p_missing = p_missing)\n",
173 | " elif missing_type == 'MAR':\n",
174 | " df['missingness'] = create_mar_column(df,\n",
175 | " missing_column = missing_column,\n",
176 | " depends_on = depends_on,\n",
177 | " method = method,\n",
178 | " p_missing = p_missing)\n",
179 | " \n",
180 | " elif missing_type == 'NMAR':\n",
181 | " df['missingness'] = create_nmar_column(df,\n",
182 | " missing_column = missing_column,\n",
183 | " method = method,\n",
184 | " p_missing = p_missing)\n",
185 | " \n",
186 | " # Generate scatterplot.\n",
187 | " ax.scatter(df['age'][df['missingness'] == 0], df['income'][df['missingness'] == 0], s = 35, color = '#185fad', alpha = 0.75, label = 'Observed')\n",
188 | " ax.scatter(df['age'][df['missingness'] == 1], df['income'][df['missingness'] == 1], s = 35, color = 'grey', alpha = 0.25, label = '')\n",
189 | " \n",
190 | " # Generate lines of best fit based on observed and missing values.\n",
191 | " x = np.linspace(20, 60)\n",
192 | " ax.plot(x, 15000 + 750 * x, c = 'orange', alpha = 0.7, label = '\"True\" Line', lw = 3)\n",
193 | " model = LinearRegression().fit(df[['age']][df['missingness'] == 0], df['income'][df['missingness'] == 0])\n",
194 | " ax.plot(x, model.intercept_ + model.coef_ * x, c = '#185fad', alpha = 0.7, label='Observed Line', lw = 3)\n",
195 | "\n",
196 | " # Generate title and legend.\n",
197 | " ax.set_title(f'Type of Missing Data: {missing_type} \\nProportion Missing: {p_missing}', position = (0,1), ha = 'left', fontsize = 25)\n",
198 | " ax.legend(prop={'size': 20}, loc = 2)\n",
199 | " \n",
200 | " ax.set_xticks([])\n",
201 | " ax.set_yticks([])\n",
202 | " plt.show();"
203 | ]
204 | },
205 | {
206 | "cell_type": "code",
207 | "execution_count": null,
208 | "metadata": {},
209 | "outputs": [],
210 | "source": [
211 | "generate_scatterplot(p_missing=0.1,\n",
212 | " missing_type = 'MCAR',\n",
213 | " method = 'linear')"
214 | ]
215 | },
216 | {
217 | "cell_type": "code",
218 | "execution_count": null,
219 | "metadata": {
220 | "scrolled": false
221 | },
222 | "outputs": [],
223 | "source": [
224 | "def plot_interact(p_missing = 0.8, missing_type = 'MCAR', method = 'linear'):\n",
225 | " generate_scatterplot(p_missing, missing_type, method, missing_column = 'income', depends_on = 'age')\n",
226 | " \n",
227 | "interact(plot_interact, p_missing = (0, 0.99, 0.05), missing_type = ['MCAR','MAR','NMAR'], method = ['linear','quadratic']);"
228 | ]
229 | }
230 | ],
231 | "metadata": {
232 | "kernelspec": {
233 | "display_name": "Python 3",
234 | "language": "python",
235 | "name": "python3"
236 | },
237 | "language_info": {
238 | "codemirror_mode": {
239 | "name": "ipython",
240 | "version": 3
241 | },
242 | "file_extension": ".py",
243 | "mimetype": "text/x-python",
244 | "name": "python",
245 | "nbconvert_exporter": "python",
246 | "pygments_lexer": "ipython3",
247 | "version": "3.8.3"
248 | }
249 | },
250 | "nbformat": 4,
251 | "nbformat_minor": 2
252 | }
253 |
--------------------------------------------------------------------------------
/02_item_missingness.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Item Missingness Demo"
8 | ]
9 | },
10 | {
11 | "cell_type": "code",
12 | "execution_count": null,
13 | "metadata": {},
14 | "outputs": [],
15 | "source": [
16 | "# To generate and store data.\n",
17 | "import numpy as np\n",
18 | "import pandas as pd\n",
19 | "import scipy.stats as stats\n",
20 | "\n",
21 | "# To visualize data.\n",
22 | "import matplotlib.pyplot as plt\n",
23 | "\n",
24 | "# To fit linear regression model.\n",
25 | "from sklearn.linear_model import LinearRegression, LogisticRegression\n",
26 | "\n",
27 | "# Install and import missingno to visualize missingness patterns. (Uncomment first line to install missingno.)\n",
28 | "# !pip3 install missingno\n",
29 | "import missingno as msno\n",
30 | "\n",
31 | "# # There is a SciPy issue that won't affect our work, but a warning exists\n",
32 | "# # and an update is not imminent.\n",
33 | "import warnings\n",
34 | "warnings.filterwarnings(action=\"ignore\")\n",
35 | "\n",
36 | "# To render plots in the notebook.\n",
37 | "%matplotlib inline"
38 | ]
39 | },
40 | {
41 | "cell_type": "markdown",
42 | "metadata": {},
43 | "source": [
44 | "### Let's generate some data. Specifically, we'll generate age, partnered, children, and income data, where income is linearly related to age, partnered, and children."
45 | ]
46 | },
47 | {
48 | "cell_type": "code",
49 | "execution_count": null,
50 | "metadata": {},
51 | "outputs": [],
52 | "source": [
53 | "# To ensure we get the same results.\n",
54 | "np.random.seed(42)\n",
55 | "\n",
56 | "# Generate data.\n",
57 | "age = np.round(np.random.uniform(20, 60, size = 100))\n",
58 | "partnered = np.random.binomial(1, 0.8, size = 100)\n",
59 | "children = np.random.poisson(2, size = 100)\n",
60 | "income = 15000 + 750 * age + 20000 * partnered - 2500 * children + np.random.normal(0, 20000, size = 100)\n",
61 | "\n",
62 | "# Ensure income is not negative!\n",
63 | "income = [i if i >= 0 else 0 for i in income]\n",
64 | "\n",
65 | "# Combine our results into one dataframe.\n",
66 | "df = pd.DataFrame({'age': age,\n",
67 | " 'partnered': partnered,\n",
68 | " 'children': children,\n",
69 | " 'income': income})\n",
70 | "\n",
71 | "# Check the first five rows of df to make sure we did this properly.\n",
72 | "df.head()"
73 | ]
74 | },
75 | {
76 | "cell_type": "markdown",
77 | "metadata": {},
78 | "source": [
79 | "### Run this cell. These are functions that will generate missing values according to MCAR, MAR, or NMAR."
80 | ]
81 | },
82 | {
83 | "cell_type": "code",
84 | "execution_count": null,
85 | "metadata": {},
86 | "outputs": [],
87 | "source": [
88 | "def create_mcar_column(df, missing_column = 'income', p_missing = 0.01, random_state = 42):\n",
89 | " \"\"\"\n",
90 | " Creates missingness indicator column, where data are MCAR (missing completely at random).\n",
91 | " \n",
92 | " User must specify:\n",
93 | " df = the pandas DataFrame the user wants to read in for analysis\n",
94 | " column = the name of the column in df that is missing\n",
95 | " p_missing = the proportion of observations that are missing\n",
96 | " \n",
97 | " Function returns:\n",
98 | " mcar_column = a column that indicates whether data are missing, assuming MCAR\n",
99 | " \"\"\"\n",
100 | " np.random.seed(random_state)\n",
101 | " \n",
102 | " mcar_indices = [df.sample(n = 1).index[0] for i in range(round(p_missing * df.shape[0]))]\n",
103 | " \n",
104 | " while len(set(mcar_indices)) < round(p_missing * df.shape[0]):\n",
105 | " mcar_indices.append(df.sample(n = 1).index[0])\n",
106 | " \n",
107 | " mcar_column = [1 if i in mcar_indices else 0 for i in range(df.shape[0])]\n",
108 | " \n",
109 | " return mcar_column\n",
110 | "\n",
111 | "def create_mar_column(df, missing_column = 'income', depends_on = 'age', method = 'linear', p_missing = 0.01, random_state = 42):\n",
112 | " \"\"\"\n",
113 | " Creates missingness indicator column, where data are MAR (missing at random).\n",
114 | " \n",
115 | " User must specify:\n",
116 | " df = the pandas DataFrame the user wants to read in for analysis\n",
117 | " missing_column = the name of the column in df that is missing\n",
118 | " depends_on = the name of the column in df which affects the missingness\n",
119 | " method = 'linear' or 'quadratic'\n",
120 | " - 'linear' means the probability of missingness is linearly related to the depends_on variable\n",
121 | " - 'quadratic' means the probability of missingness is quadratically related to the depends_on variable\n",
122 | " p_missing = the proportion of observations that are missing\n",
123 | " \n",
124 | " Function returns:\n",
125 | " mar_column = a column that indicates whether data are missing, assuming MAR\n",
126 | " \"\"\"\n",
127 | " np.random.seed(random_state)\n",
128 | " \n",
129 | " if method == 'linear':\n",
130 | " mar_indices = [df.sample(n = 1, weights = depends_on).index[0] for i in range(round(p_missing * df.shape[0]))]\n",
131 | "\n",
132 | " while len(set(mar_indices)) < round(p_missing * df.shape[0]):\n",
133 | " mar_indices.append(df.sample(n = 1, weights = depends_on).index[0])\n",
134 | " \n",
135 | " elif method == 'quadratic':\n",
136 | " mar_indices = [df.sample(n = 1, weights = df[depends_on] ** 2).index[0] for i in range(round(p_missing * df.shape[0]))]\n",
137 | "\n",
138 | " while len(set(mar_indices)) < round(p_missing * df.shape[0]):\n",
139 | " mar_indices.append(df.sample(n = 1, weights = df[depends_on] ** 2).index[0])\n",
140 | "\n",
141 | " mar_column = [1 if i in mar_indices else 0 for i in range(df.shape[0])]\n",
142 | " \n",
143 | " return mar_column\n",
144 | "\n",
145 | "def create_nmar_column(df, missing_column = 'income', method = 'linear', p_missing = 0.01, random_state = 42):\n",
146 | " \"\"\"\n",
147 | " Creates missingness indicator column, where data are NMAR (not missing at random).\n",
148 | " \n",
149 | " User must specify:\n",
150 | " df = the pandas DataFrame the user wants to read in for analysis\n",
151 | " missing_column = the name of the column in df that is missing\n",
152 | " method = 'linear' or 'quadratic'\n",
153 | " - 'linear' means the probability of missingness is linearly related to the depends_on variable\n",
154 | " - 'quadratic' means the probability of missingness is quadratically related to the depends_on variable\n",
155 | " p_missing = the proportion of observations that are missing\n",
156 | " \n",
157 | " Function returns:\n",
158 | " nmar_column = a column that indicates whether data are missing, assuming NMAR\n",
159 | " \"\"\"\n",
160 | " np.random.seed(random_state)\n",
161 | " \n",
162 | " if method == 'linear':\n",
163 | " nmar_indices = [df.sample(n = 1, weights = missing_column).index[0] for i in range(round(p_missing * df.shape[0]))]\n",
164 | "\n",
165 | " while len(set(nmar_indices)) < round(p_missing * df.shape[0]):\n",
166 | " nmar_indices.append(df.sample(n = 1, weights = missing_column).index[0])\n",
167 | " \n",
168 | " elif method == 'quadratic':\n",
169 | " nmar_indices = [df.sample(n = 1, weights = df[missing_column] ** 2).index[0] for i in range(round(p_missing * df.shape[0]))]\n",
170 | "\n",
171 | " while len(set(nmar_indices)) < round(p_missing * df.shape[0]):\n",
172 | " nmar_indices.append(df.sample(n = 1, weights = df[missing_column] ** 2).index[0])\n",
173 | " \n",
174 | " nmar_column = [1 if i in nmar_indices else 0 for i in range(df.shape[0])]\n",
175 | " \n",
176 | " return nmar_column"
177 | ]
178 | },
179 | {
180 | "cell_type": "markdown",
181 | "metadata": {},
182 | "source": [
183 | "### Let's generate some missing data!"
184 | ]
185 | },
186 | {
187 | "cell_type": "code",
188 | "execution_count": null,
189 | "metadata": {},
190 | "outputs": [],
191 | "source": [
192 | "df['age_missingness'] = create_mcar_column(df,\n",
193 | " missing_column = 'age', \n",
194 | " p_missing = 0.3,\n",
195 | " random_state = 42)\n",
196 | "\n",
197 | "df['partnered_missingness'] = create_mar_column(df,\n",
198 | " missing_column = 'partnered',\n",
199 | " method = 'linear',\n",
200 | " p_missing = 0.2,\n",
201 | " random_state = 42)\n",
202 | "\n",
203 | "df['income_missingness'] = create_nmar_column(df,\n",
204 | " missing_column = 'income',\n",
205 | " method = 'quadratic',\n",
206 | " p_missing = 0.2,\n",
207 | " random_state = 42)\n",
208 | "\n",
209 | "print(df['age_missingness'].value_counts())\n",
210 | "print(df['partnered_missingness'].value_counts())\n",
211 | "print(df['income_missingness'].value_counts())"
212 | ]
213 | },
214 | {
215 | "cell_type": "code",
216 | "execution_count": null,
217 | "metadata": {},
218 | "outputs": [],
219 | "source": [
220 | "df.head()"
221 | ]
222 | },
223 | {
224 | "cell_type": "markdown",
225 | "metadata": {},
226 | "source": [
227 | "### Let's create a new dataframe with the values actually missing."
228 | ]
229 | },
230 | {
231 | "cell_type": "code",
232 | "execution_count": null,
233 | "metadata": {},
234 | "outputs": [],
235 | "source": [
236 | "df_missing = pd.DataFrame(df['children'])\n",
237 | "\n",
238 | "df_missing['age'] = [df.loc[i,'age'] if df.loc[i,'age_missingness'] == 0 else np.nan for i in range(100)]\n",
239 | "df_missing['partnered'] = [df.loc[i,'partnered'] if df.loc[i,'partnered_missingness'] == 0 else np.nan for i in range(100)]\n",
240 | "df_missing['income'] = [df.loc[i,'income'] if df.loc[i,'income_missingness'] == 0 else np.nan for i in range(100)]\n",
241 | "\n",
242 | "df_missing.head()"
243 | ]
244 | },
245 | {
246 | "cell_type": "markdown",
247 | "metadata": {},
248 | "source": [
249 | "### Let's visualize our missing data.\n",
250 | "- Children is 100% observed.\n",
251 | "- Age is missing completely at random and is missing 30% of its observations.\n",
252 | "- Partnered is missing at random and is missing 20% of its observations.\n",
253 | "- Income is missing at random and is missing 20% of its observations."
254 | ]
255 | },
256 | {
257 | "cell_type": "code",
258 | "execution_count": null,
259 | "metadata": {},
260 | "outputs": [],
261 | "source": [
262 | "msno.matrix(df_missing);"
263 | ]
264 | },
265 | {
266 | "cell_type": "markdown",
267 | "metadata": {},
268 | "source": [
269 | "### Generate histograms."
270 | ]
271 | },
272 | {
273 | "cell_type": "code",
274 | "execution_count": null,
275 | "metadata": {},
276 | "outputs": [],
277 | "source": [
278 | "def compare_histograms(df, imputed_column, original_column, missingness_column, x_label, y_label = 'Frequency'):\n",
279 | " fig, (ax0, ax1) = plt.subplots(nrows = 2, ncols = 1, figsize = (16,9))\n",
280 | "\n",
281 | " # Set axes of histograms.\n",
282 | " mode = stats.mode(df[imputed_column])\n",
283 | " rnge = max(df[original_column]) - min(df[original_column])\n",
284 | " xmin = min(df[original_column]) - 0.02 * rnge\n",
285 | " xmax = max(df[original_column]) + 0.02 * rnge\n",
286 | " ymax = 1.3 * (mode[1][0] + df[df[original_column] == mode[0][0]].shape[0])\n",
287 | "\n",
288 | " ax0.set_xlim(xmin, xmax)\n",
289 | " ax0.set_ylim(0, ymax)\n",
290 | " ax1.set_xlim(xmin, xmax)\n",
291 | " ax1.set_ylim(0, ymax)\n",
292 | "\n",
293 | " # Set top labels.\n",
294 | " ax0.set_title('Real Histogram', position = (0,1), ha = 'left', fontsize = 25)\n",
295 | " ax0.set_xlabel(x_label, position = (0,0), ha = 'left', fontsize = 25, color = 'grey', alpha = 0.85)\n",
296 | " ax0.set_ylabel(y_label, position = (0,1), ha = 'right', va = 'top', fontsize = 25, rotation = 0, color = 'grey', alpha = 0.85)\n",
297 | " ax0.set_xticks([])\n",
298 | " ax0.set_yticks([])\n",
299 | "\n",
300 | " # Generate top histogram.\n",
301 | " ax0.hist(df[original_column], bins = 15, color = '#185fad', alpha = 0.75, label = '')\n",
302 | " ax0.axvline(np.mean(df[original_column]), color = '#185fad', lw = 5, label = 'True Mean')\n",
303 | " ax0.legend(prop={'size': 15}, loc = 1)\n",
304 | "\n",
305 | " # Set bottom labels.\n",
306 | " ax1.set_title('Observed + Imputed Histogram', position = (0,1), ha = 'left', fontsize = 25)\n",
307 | " ax1.set_xlabel(x_label, position = (0,0), ha = 'left', fontsize = 25, color = 'grey', alpha = 0.85)\n",
308 | " ax1.set_ylabel(y_label, position = (0,1), ha = 'right', va = 'top', fontsize = 25, rotation = 0, color = 'grey', alpha = 0.85)\n",
309 | "\n",
310 | " # Generate bottom histogram.\n",
311 | " ax1.hist([df[imputed_column][df[missingness_column] == 0], df[imputed_column][df[missingness_column] == 1]], bins = 15, color = ['#185fad','orange'], alpha = 0.75, label = '', stacked = True)\n",
312 | " ax1.axvline(np.mean(df[original_column]), color = '#185fad', lw = 5, label = 'True Mean')\n",
313 | " ax1.axvline(np.mean(df[original_column][df[missingness_column] == 0]), color = 'grey', alpha = 0.5, lw = 5, label = 'Observed Mean')\n",
314 | " ax1.axvline(np.mean(df[imputed_column]), color = 'orange', lw = 5, label = 'Observed and Imputed Mean')\n",
315 | " ax1.legend(prop={'size': 15}, loc = 1)\n",
316 | " \n",
317 | " plt.tight_layout()\n",
318 | "\n",
319 | " plt.show();"
320 | ]
321 | },
322 | {
323 | "cell_type": "markdown",
324 | "metadata": {},
325 | "source": [
326 | "### Examine various imputation methods."
327 | ]
328 | },
329 | {
330 | "cell_type": "markdown",
331 | "metadata": {},
332 | "source": [
333 | "##### Mean Imputation"
334 | ]
335 | },
336 | {
337 | "cell_type": "code",
338 | "execution_count": null,
339 | "metadata": {},
340 | "outputs": [],
341 | "source": [
342 | "def impute_mean(df, impute_column, missingness_column):\n",
343 | " \"\"\"\n",
344 | " Imputes mean for any value where data is marked missing.\n",
345 | " \n",
346 | " User must specify:\n",
347 | " df = the pandas DataFrame the user wants to read in for analysis\n",
348 | " impute_column = the name of the column in df that is missing\n",
349 | " missingness_column = the name of the missingness indicator column\n",
350 | " \n",
351 | " Function returns:\n",
352 | " mean_impute = a column with the mean imputed for any missing value.\n",
353 | " \"\"\"\n",
354 | " mean_impute = [df.loc[i,impute_column] if df.loc[i,missingness_column] == 0 else np.mean(df[impute_column]) for i in range(df.shape[0])]\n",
355 | " \n",
356 | " return mean_impute"
357 | ]
358 | },
359 | {
360 | "cell_type": "code",
361 | "execution_count": null,
362 | "metadata": {},
363 | "outputs": [],
364 | "source": [
365 | "df['age_mean_imputed'] = impute_mean(df, 'age', 'age_missingness')"
366 | ]
367 | },
368 | {
369 | "cell_type": "code",
370 | "execution_count": null,
371 | "metadata": {},
372 | "outputs": [],
373 | "source": [
374 | "compare_histograms(df = df,\n",
375 | " imputed_column = 'age_mean_imputed',\n",
376 | " original_column = 'age',\n",
377 | " missingness_column = 'age_missingness',\n",
378 | " x_label = 'Age',\n",
379 | " y_label = 'Frequency')"
380 | ]
381 | },
382 | {
383 | "cell_type": "markdown",
384 | "metadata": {},
385 | "source": [
386 | "How to read the above chart:\n",
387 | "- The blue line is the true mean of all data (observed and unobserved).\n",
388 | "- The grey line is the mean of just the observed data. (i.e. no imputation)\n",
389 | "- The orange line is the mean of the observed and imputed data."
390 | ]
391 | },
392 | {
393 | "cell_type": "markdown",
394 | "metadata": {},
395 | "source": [
396 | "$$\n",
397 | "\\begin{eqnarray*}\n",
398 | "s &=& \\sqrt{\\frac{\\sum_{i=1}^n(x_i - \\bar{x})^2}{n-1}} \\\\\n",
399 | "\\text{impute mean for values } k+1 \\text{ through } n \\Rightarrow s &=& \\sqrt{\\frac{\\sum_{i=1}^k(x_i - \\bar{x})^2}{n-1} + \\frac{\\sum_{i=k+1}^n(\\bar{x} - \\bar{x})^2}{n-1}} \\\\\n",
400 | "&=& \\sqrt{\\frac{\\sum_{i=1}^k(x_i - \\bar{x})^2}{n-1}} \\\\\n",
401 | "&\\Rightarrow& \\text{the denominator increases but numerator remains fixed} \\\\\n",
402 | "&\\Rightarrow& \\text{the sample standard deviation is underestimated} \\\\\n",
403 | "&\\Rightarrow& \\text{confidence intervals relying on the mean are narrower than they should be}\n",
404 | "\\end{eqnarray*}\n",
405 | "$$"
406 | ]
407 | },
408 | {
409 | "cell_type": "markdown",
410 | "metadata": {},
411 | "source": [
412 | "##### Median Imputation"
413 | ]
414 | },
415 | {
416 | "cell_type": "code",
417 | "execution_count": null,
418 | "metadata": {},
419 | "outputs": [],
420 | "source": [
421 | "def impute_median(df, impute_column, missingness_column):\n",
422 | " \"\"\"\n",
423 | " Imputes median for any value where data is marked missing.\n",
424 | " \n",
425 | " User must specify:\n",
426 | " df = the pandas DataFrame the user wants to read in for analysis\n",
427 | " impute_column = the name of the column in df that is missing\n",
428 | " missingness_column = the name of the missingness indicator column\n",
429 | " \n",
430 | " Function returns:\n",
431 | " median_impute = a column with the median imputed for any missing value.\n",
432 | " \"\"\"\n",
433 | " median_impute = [df.loc[i,impute_column] if df.loc[i,missingness_column] == 0 else np.median(df[impute_column]) for i in range(df.shape[0])]\n",
434 | " \n",
435 | " return median_impute"
436 | ]
437 | },
438 | {
439 | "cell_type": "code",
440 | "execution_count": null,
441 | "metadata": {},
442 | "outputs": [],
443 | "source": [
444 | "df['age_median_imputed'] = impute_median(df, 'age', 'age_missingness')"
445 | ]
446 | },
447 | {
448 | "cell_type": "code",
449 | "execution_count": null,
450 | "metadata": {},
451 | "outputs": [],
452 | "source": [
453 | "compare_histograms(df = df,\n",
454 | " imputed_column = 'age_median_imputed',\n",
455 | " original_column = 'age',\n",
456 | " missingness_column = 'age_missingness',\n",
457 | " x_label = 'Age',\n",
458 | " y_label = 'Frequency')"
459 | ]
460 | },
461 | {
462 | "cell_type": "markdown",
463 | "metadata": {},
464 | "source": [
465 | "##### Mode Imputation"
466 | ]
467 | },
468 | {
469 | "cell_type": "code",
470 | "execution_count": null,
471 | "metadata": {},
472 | "outputs": [],
473 | "source": [
474 | "def impute_mode(df, impute_column, missingness_column):\n",
475 | " \"\"\"\n",
476 | " Imputes mode for any value where data is marked missing.\n",
477 | " \n",
478 | " User must specify:\n",
479 | " df = the pandas DataFrame the user wants to read in for analysis\n",
480 | " impute_column = the name of the column in df that is missing\n",
481 | " missingness_column = the name of the missingness indicator column\n",
482 | " \n",
483 | " Function returns:\n",
484 | " mode_impute = a column with the mode imputed for any missing value.\n",
485 | " \"\"\"\n",
486 | " mode_impute = [df.loc[i,impute_column] if df.loc[i,missingness_column] == 0 else stats.mode(df[impute_column])[0][0] for i in range(df.shape[0])]\n",
487 | " \n",
488 | " return mode_impute"
489 | ]
490 | },
491 | {
492 | "cell_type": "code",
493 | "execution_count": null,
494 | "metadata": {},
495 | "outputs": [],
496 | "source": [
497 | "df['age_mode_imputed'] = impute_mode(df, 'age', 'age_missingness')"
498 | ]
499 | },
500 | {
501 | "cell_type": "code",
502 | "execution_count": null,
503 | "metadata": {},
504 | "outputs": [],
505 | "source": [
506 | "compare_histograms(df = df,\n",
507 | " imputed_column = 'age_mode_imputed',\n",
508 | " original_column = 'age',\n",
509 | " missingness_column = 'age_missingness',\n",
510 | " x_label = 'Age',\n",
511 | " y_label = 'Frequency')"
512 | ]
513 | },
514 | {
515 | "cell_type": "markdown",
516 | "metadata": {},
517 | "source": [
518 | "##### Regression Imputation"
519 | ]
520 | },
521 | {
522 | "cell_type": "code",
523 | "execution_count": null,
524 | "metadata": {},
525 | "outputs": [],
526 | "source": [
527 | "def regression_imputation(df, impute_column, X_columns, missingness_column, regression = 'linear'):\n",
528 | " \"\"\"\n",
529 | " Fits regression line to observed data, then imputes regression prediction\n",
530 | " for any value where data is marked missing.\n",
531 | " \n",
532 | " User must specify:\n",
533 | " df = the pandas DataFrame the user wants to read in for analysis\n",
534 | " impute_column = the name of the column in df that is missing\n",
535 | " X_columns = the names of the columns used as independent variables\n",
536 | " to impute the missing value\n",
537 | " missingness_column = the name of the missingness indicator column\n",
538 | " regression = the type of regression to run; only supports 'linear'\n",
539 | " for LinearRegression and 'logistic' for LogisticRegression\n",
540 | " \n",
541 | " Function returns:\n",
542 | " regression_impute = a column with the regression value imputed for any missing value.\n",
543 | " \n",
544 | " NOTE: Only set up to do linear or logistic regression.\n",
545 | " \"\"\"\n",
546 | " \n",
547 | " if regression == 'linear':\n",
548 | " model = LinearRegression()\n",
549 | " elif regression == 'logistic':\n",
550 | " model = LogisticRegression()\n",
551 | " \n",
552 | " model.fit(df[X_columns], df[impute_column])\n",
553 | " \n",
554 | " regression_impute = [df.loc[i,'age'] if df.loc[i,'age_missingness'] == 0\n",
555 | " else model.predict(pd.DataFrame(df.loc[i,['children', 'partnered', 'income']]).T)[0] \n",
556 | " for i in range(df.shape[0])]\n",
557 | " \n",
558 | " return regression_impute"
559 | ]
560 | },
561 | {
562 | "cell_type": "code",
563 | "execution_count": null,
564 | "metadata": {},
565 | "outputs": [],
566 | "source": [
567 | "df['age_regression_imputed'] = regression_imputation(df, 'age', ['children', 'partnered', 'income'], 'age_missingness')"
568 | ]
569 | },
570 | {
571 | "cell_type": "code",
572 | "execution_count": null,
573 | "metadata": {},
574 | "outputs": [],
575 | "source": [
576 | "compare_histograms(df = df,\n",
577 | " imputed_column = 'age_regression_imputed',\n",
578 | " original_column = 'age',\n",
579 | " missingness_column = 'age_missingness',\n",
580 | " x_label = 'Age',\n",
581 | " y_label = 'Frequency')"
582 | ]
583 | },
584 | {
585 | "cell_type": "code",
586 | "execution_count": null,
587 | "metadata": {},
588 | "outputs": [],
589 | "source": [
590 | "np.std(df['age_regression_imputed'], ddof = 1)"
591 | ]
592 | },
593 | {
594 | "cell_type": "code",
595 | "execution_count": null,
596 | "metadata": {},
597 | "outputs": [],
598 | "source": [
599 | "np.std(df['age'], ddof = 1)"
600 | ]
601 | },
602 | {
603 | "cell_type": "markdown",
604 | "metadata": {},
605 | "source": [
606 | "### Work in progress:"
607 | ]
608 | },
609 | {
610 | "cell_type": "code",
611 | "execution_count": null,
612 | "metadata": {},
613 | "outputs": [],
614 | "source": [
615 | "def compare_scatterplots(df, imputed_column, original_X_column, original_Y_column, missingness_column, x_label, y_label):\n",
616 | " fig, (ax0, ax1) = plt.subplots(nrows = 1, ncols = 2, figsize = (20,8))\n",
617 | "\n",
618 | " # Set axes of scatterplots.\n",
619 | " x_rnge = max(df[original_X_column]) - min(df[original_X_column])\n",
620 | " xmin = min(df[original_X_column]) - 0.1 * x_rnge\n",
621 | " xmax = max(df[original_X_column]) + 0.1 * x_rnge\n",
622 | " y_rnge = max(df[original_Y_column]) - min(df[original_Y_column])\n",
623 | " ymin = min(df[original_Y_column]) - 0.1 * y_rnge\n",
624 | " ymax = max(df[original_Y_column]) + 0.1 * y_rnge\n",
625 | "\n",
626 | " ax0.set_xlim(xmin, xmax)\n",
627 | " ax0.set_ylim(ymin, ymax)\n",
628 | " ax1.set_xlim(xmin, xmax)\n",
629 | " ax1.set_ylim(ymin, ymax)\n",
630 | "\n",
631 | " # Set left labels.\n",
632 | " ax0.set_title('Real Scatterplot', position = (0,1), ha = 'left', fontsize = 25)\n",
633 | " ax0.set_xlabel(x_label, position = (0,0), ha = 'left', fontsize = 25, color = 'grey', alpha = 0.85)\n",
634 | " ax0.set_ylabel(y_label, position = (0,1), ha = 'right', va = 'top', fontsize = 25, rotation = 0, color = 'grey', alpha = 0.85)\n",
635 | " ax0.set_xticks([])\n",
636 | " ax0.set_yticks([])\n",
637 | "\n",
638 | " # Generate left scatterplot.\n",
639 | " ax0.scatter(df[original_X_column], df[original_Y_column], color = '#185fad', alpha = 0.5, label = 'True Values')\n",
640 | " ax0.legend(prop={'size': 15}, loc = 1)\n",
641 | " \n",
642 | " # Set right labels.\n",
643 | " ax1.set_title('Observed + Imputed Scatterplot', position = (0,1), ha = 'left', fontsize = 25)\n",
644 | " ax1.set_xlabel(x_label, position = (0,0), ha = 'left', fontsize = 25, color = 'grey', alpha = 0.85)\n",
645 | " ax1.set_ylabel(y_label, position = (0,1), ha = 'right', va = 'top', fontsize = 25, rotation = 0, color = 'grey', alpha = 0.85)\n",
646 | " ax1.set_xticks([])\n",
647 | " ax1.set_yticks([])\n",
648 | "\n",
649 | " # Generate right histogram.\n",
650 | " ax1.scatter(df[original_X_column][df[missingness_column] == 1], df[original_Y_column][df[missingness_column] == 1], color = 'orange', alpha = 0.5, label = 'Imputed Values')\n",
651 | " ax1.scatter(df[original_X_column][df[missingness_column] == 0], df[imputed_column][df[missingness_column] == 0], color = '#185fad', alpha = 0.5, label = 'Observed Values')\n",
652 | "\n",
653 | " ax1.legend(prop={'size': 15}, loc = 1)\n",
654 | " \n",
655 | " plt.show();"
656 | ]
657 | },
658 | {
659 | "cell_type": "code",
660 | "execution_count": null,
661 | "metadata": {},
662 | "outputs": [],
663 | "source": [
664 | "compare_scatterplots(df = df,\n",
665 | " imputed_column = 'age_regression_imputed',\n",
666 | " original_X_column = 'children',\n",
667 | " original_Y_column = 'age',\n",
668 | " missingness_column = 'age_missingness',\n",
669 | " x_label = 'Children',\n",
670 | " y_label = 'Age')"
671 | ]
672 | }
673 | ],
674 | "metadata": {
675 | "kernelspec": {
676 | "display_name": "Python 3",
677 | "language": "python",
678 | "name": "python3"
679 | },
680 | "language_info": {
681 | "codemirror_mode": {
682 | "name": "ipython",
683 | "version": 3
684 | },
685 | "file_extension": ".py",
686 | "mimetype": "text/x-python",
687 | "name": "python",
688 | "nbconvert_exporter": "python",
689 | "pygments_lexer": "ipython3",
690 | "version": "3.8.3"
691 | }
692 | },
693 | "nbformat": 4,
694 | "nbformat_minor": 2
695 | }
696 |
--------------------------------------------------------------------------------
/solutions/00_interactive_plot.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 1,
6 | "metadata": {},
7 | "outputs": [],
8 | "source": [
9 | "#### To confirm that you have the latest versions of these packages, uncomment and run the following command.\n",
10 | "# !pip install numpy pandas matplotlib sklearn ipywidgets IPython missingno --upgrade\n",
11 | "\n",
12 | "# To generate and store data.\n",
13 | "import numpy as np\n",
14 | "import pandas as pd\n",
15 | "\n",
16 | "# To visualize data.\n",
17 | "import matplotlib.pyplot as plt\n",
18 | "\n",
19 | "# To fit linear regression model.\n",
20 | "from sklearn.linear_model import LinearRegression\n",
21 | "\n",
22 | "# To allow interactive plot.\n",
23 | "from ipywidgets import *\n",
24 | "from IPython.display import display\n",
25 | "\n",
26 | "# There is a SciPy issue that won't affect our work, but a warning exists\n",
27 | "# and an update is not imminent.\n",
28 | "import warnings\n",
29 | "warnings.filterwarnings(action=\"ignore\")\n",
30 | "\n",
31 | "# To render plots in the notebook.\n",
32 | "%matplotlib inline"
33 | ]
34 | },
35 | {
36 | "cell_type": "code",
37 | "execution_count": 2,
38 | "metadata": {},
39 | "outputs": [],
40 | "source": [
41 | "# Generate data and store in a dataframe.\n",
42 | "\n",
43 | "np.random.seed(42)\n",
44 | "\n",
45 | "age = np.random.uniform(20, 60, size = 100)\n",
46 | "income = 15000 + 750 * age + np.random.normal(0, 20000, size = 100)\n",
47 | "income = [i if i >= 0 else 0 for i in income]\n",
48 | "\n",
49 | "df = pd.DataFrame({'income':income,\n",
50 | " 'age': age})"
51 | ]
52 | },
53 | {
54 | "cell_type": "code",
55 | "execution_count": 3,
56 | "metadata": {},
57 | "outputs": [],
58 | "source": [
59 | "# Create three functions to model missingness according to certain patterns.\n",
60 | "\n",
61 | "def create_mcar_column(df, missing_column = 'income', p_missing = 0.01, random_state = 42):\n",
62 | " \"\"\"\n",
63 | " Creates missingness indicator column, where data are MCAR (missing completely at random).\n",
64 | " \n",
65 | " User must specify:\n",
66 | " df = the pandas DataFrame the user wants to read in for analysis\n",
67 | " column = the name of the column in df that is missing\n",
68 | " p_missing = the proportion of observations that are missing\n",
69 | " \n",
70 | " Function returns:\n",
71 | " mcar_column = a column that indicates whether data are missing, assuming MCAR\n",
72 | " \"\"\"\n",
73 | " np.random.seed(random_state)\n",
74 | " \n",
75 | " mcar_indices = [df.sample(n = 1).index[0] for i in range(round(p_missing * df.shape[0]))]\n",
76 | " \n",
77 | " while len(set(mcar_indices)) < round(p_missing * df.shape[0]):\n",
78 | " mcar_indices.append(df.sample(n = 1).index[0])\n",
79 | " \n",
80 | " mcar_column = [1 if i in mcar_indices else 0 for i in range(df.shape[0])]\n",
81 | " \n",
82 | " return mcar_column\n",
83 | "\n",
84 | "def create_mar_column(df, missing_column = 'income', depends_on = 'age', method = 'linear', p_missing = 0.01, random_state = 42):\n",
85 | " \"\"\"\n",
86 | " Creates missingness indicator column, where data are MAR (missing at random).\n",
87 | " \n",
88 | " User must specify:\n",
89 | " df = the pandas DataFrame the user wants to read in for analysis\n",
90 | " missing_column = the name of the column in df that is missing\n",
91 | " depends_on = the name of the column in df which affects the missingness\n",
92 | " method = 'linear' or 'quadratic'\n",
93 | " - 'linear' means the probability of missingness is linearly related to the depends_on variable\n",
94 | " - 'quadratic' means the probability of missingness is quadratically related to the depends_on variable\n",
95 | " p_missing = the proportion of observations that are missing\n",
96 | " \n",
97 | " Function returns:\n",
98 | " mar_column = a column that indicates whether data are missing, assuming MAR\n",
99 | " \"\"\"\n",
100 | " np.random.seed(random_state)\n",
101 | " \n",
102 | " if method == 'linear':\n",
103 | " mar_indices = [df.sample(n = 1, weights = df[depends_on] ** -1).index[0] for i in range(round(p_missing * df.shape[0]))]\n",
104 | "\n",
105 | " while len(set(mar_indices)) < round(p_missing * df.shape[0]):\n",
106 | " mar_indices.append(df.sample(n = 1, weights = df[depends_on] ** -1).index[0])\n",
107 | " \n",
108 | " elif method == 'quadratic':\n",
109 | " mar_indices = [df.sample(n = 1, weights = df[depends_on] ** -2).index[0] for i in range(round(p_missing * df.shape[0]))]\n",
110 | "\n",
111 | " while len(set(mar_indices)) < round(p_missing * df.shape[0]):\n",
112 | " mar_indices.append(df.sample(n = 1, weights = df[depends_on] ** -2).index[0])\n",
113 | "\n",
114 | " mar_column = [1 if i in mar_indices else 0 for i in range(df.shape[0])]\n",
115 | " \n",
116 | " return mar_column\n",
117 | "\n",
118 | "def create_nmar_column(df, missing_column = 'income', method = 'linear', p_missing = 0.01, random_state = 42):\n",
119 | " \"\"\"\n",
120 | " Creates missingness indicator column, where data are NMAR (not missing at random).\n",
121 | " \n",
122 | " User must specify:\n",
123 | " df = the pandas DataFrame the user wants to read in for analysis\n",
124 | " missing_column = the name of the column in df that is missing\n",
125 | " method = 'linear' or 'quadratic'\n",
126 | " - 'linear' means the probability of missingness is linearly related to the depends_on variable\n",
127 | " - 'quadratic' means the probability of missingness is quadratically related to the depends_on variable\n",
128 | " p_missing = the proportion of observations that are missing\n",
129 | " \n",
130 | " Function returns:\n",
131 | " nmar_column = a column that indicates whether data are missing, assuming NMAR\n",
132 | " \"\"\"\n",
133 | " np.random.seed(random_state)\n",
134 | " \n",
135 | " if method == 'linear':\n",
136 | " nmar_indices = [df.sample(n = 1, weights = df[missing_column] ** -1).index[0] for i in range(round(p_missing * df.shape[0]))]\n",
137 | "\n",
138 | " while len(set(nmar_indices)) < round(p_missing * df.shape[0]):\n",
139 | " nmar_indices.append(df.sample(n = 1, weights = df[missing_column] ** -1).index[0])\n",
140 | " \n",
141 | " elif method == 'quadratic':\n",
142 | " nmar_indices = [df.sample(n = 1, weights = df[missing_column] ** -2).index[0] for i in range(round(p_missing * df.shape[0]))]\n",
143 | "\n",
144 | " while len(set(nmar_indices)) < round(p_missing * df.shape[0]):\n",
145 | " nmar_indices.append(df.sample(n = 1, weights = df[missing_column] ** -2).index[0])\n",
146 | " \n",
147 | " nmar_column = [1 if i in nmar_indices else 0 for i in range(df.shape[0])]\n",
148 | " \n",
149 | " return nmar_column"
150 | ]
151 | },
152 | {
153 | "cell_type": "code",
154 | "execution_count": 4,
155 | "metadata": {},
156 | "outputs": [],
157 | "source": [
158 | "def generate_scatterplot(p_missing, missing_type, method = 'linear', missing_column = 'income', depends_on = 'age'):\n",
159 | " # Generate one plot.\n",
160 | " fig, ax = plt.subplots(nrows = 1, ncols = 1, figsize = (16,9))\n",
161 | "\n",
162 | " # Set labels and axes.\n",
163 | " ax.set_xlabel(\"Age\", position = (0,0), ha = 'left', fontsize = 25, color = 'grey', alpha = 0.85)\n",
164 | " ax.set_ylabel(\"Income\", position = (0,1), ha = 'right', va = 'top', fontsize = 25, rotation = 0, color = 'grey', alpha = 0.85)\n",
165 | " \n",
166 | " ax.set_ylim([-1000, 100000])\n",
167 | " \n",
168 | " # Generate data with proportion p missing.\n",
169 | " if missing_type == 'MCAR':\n",
170 | " df['missingness'] = create_mcar_column(df,\n",
171 | " missing_column = missing_column,\n",
172 | " p_missing = p_missing)\n",
173 | " elif missing_type == 'MAR':\n",
174 | " df['missingness'] = create_mar_column(df,\n",
175 | " missing_column = missing_column,\n",
176 | " depends_on = depends_on,\n",
177 | " method = method,\n",
178 | " p_missing = p_missing)\n",
179 | " \n",
180 | " elif missing_type == 'NMAR':\n",
181 | " df['missingness'] = create_nmar_column(df,\n",
182 | " missing_column = missing_column,\n",
183 | " method = method,\n",
184 | " p_missing = p_missing)\n",
185 | " \n",
186 | " # Generate scatterplot.\n",
187 | " ax.scatter(df['age'][df['missingness'] == 0], df['income'][df['missingness'] == 0], s = 35, color = '#185fad', alpha = 0.75, label = 'Observed')\n",
188 | " ax.scatter(df['age'][df['missingness'] == 1], df['income'][df['missingness'] == 1], s = 35, color = 'grey', alpha = 0.25, label = '')\n",
189 | " \n",
190 | " # Generate lines of best fit based on observed and missing values.\n",
191 | " x = np.linspace(20, 60)\n",
192 | " ax.plot(x, 15000 + 750 * x, c = 'orange', alpha = 0.7, label = '\"True\" Line', lw = 3)\n",
193 | " model = LinearRegression().fit(df[['age']][df['missingness'] == 0], df['income'][df['missingness'] == 0])\n",
194 | " ax.plot(x, model.intercept_ + model.coef_ * x, c = '#185fad', alpha = 0.7, label='Observed Line', lw = 3)\n",
195 | "\n",
196 | " # Generate title and legend.\n",
197 | " ax.set_title(f'Type of Missing Data: {missing_type} \\nProportion Missing: {p_missing}', position = (0,1), ha = 'left', fontsize = 25)\n",
198 | " ax.legend(prop={'size': 20}, loc = 2)\n",
199 | " \n",
200 | " ax.set_xticks([])\n",
201 | " ax.set_yticks([])\n",
202 | " plt.show();"
203 | ]
204 | },
205 | {
206 | "cell_type": "code",
207 | "execution_count": 5,
208 | "metadata": {},
209 | "outputs": [
210 | {
211 | "data": {
212 | "image/png": "\n",
213 | "text/plain": [
214 | ""
215 | ]
216 | },
217 | "metadata": {},
218 | "output_type": "display_data"
219 | }
220 | ],
221 | "source": [
222 | "generate_scatterplot(p_missing=0.1, missing_type = 'MCAR', method = 'linear')"
223 | ]
224 | },
225 | {
226 | "cell_type": "code",
227 | "execution_count": 6,
228 | "metadata": {
229 | "scrolled": false
230 | },
231 | "outputs": [
232 | {
233 | "data": {
234 | "application/vnd.jupyter.widget-view+json": {
235 | "model_id": "2370061c73b749058d6d52b202778572",
236 | "version_major": 2,
237 | "version_minor": 0
238 | },
239 | "text/plain": [
240 | "interactive(children=(FloatSlider(value=0.0, description='p_missing', max=0.99, step=0.05), Dropdown(descripti…"
241 | ]
242 | },
243 | "metadata": {},
244 | "output_type": "display_data"
245 | }
246 | ],
247 | "source": [
248 | "def plot_interact(p_missing = 0, missing_type = 'MCAR', method = 'linear'):\n",
249 | " generate_scatterplot(p_missing, missing_type, method, missing_column = 'income', depends_on = 'age')\n",
250 | " \n",
251 | "interact(plot_interact, p_missing = (0, 0.99, 0.05), missing_type = ['MCAR','MAR','NMAR'], method = ['linear','quadratic']);"
252 | ]
253 | }
254 | ],
255 | "metadata": {
256 | "kernelspec": {
257 | "display_name": "Python 3",
258 | "language": "python",
259 | "name": "python3"
260 | },
261 | "language_info": {
262 | "codemirror_mode": {
263 | "name": "ipython",
264 | "version": 3
265 | },
266 | "file_extension": ".py",
267 | "mimetype": "text/x-python",
268 | "name": "python",
269 | "nbconvert_exporter": "python",
270 | "pygments_lexer": "ipython3",
271 | "version": "3.8.3"
272 | }
273 | },
274 | "nbformat": 4,
275 | "nbformat_minor": 2
276 | }
277 |
--------------------------------------------------------------------------------