├── Fitting Statistical Models to Data with Python ├── Week4 │ └── Readme.md ├── Week3 │ ├── Readme.md │ ├── utf-8''w3_assessment.ipynb │ └── utf-8''week3_nhanes_practice.ipynb ├── Week2 │ ├── Readme.md │ ├── Linear Regression Overview.pdf │ ├── Logistic Regression Overview.pdf │ ├── utf-8''week2_nhanes_practice.ipynb │ └── utf-8''week2_nhanes_condensed_tutorial.ipynb └── Readme.md ├── Understanding & Visualizing Data with Python ├── Week1 │ ├── Readme.md │ ├── utf-8''Cartwheeldata.csv │ ├── utf-8''data_types.ipynb │ ├── utf-8''introduction_jupyter.ipynb │ ├── utf-8''week1_python_resources.ipynb │ ├── utf-8''libraries_data_management.ipynb │ └── utf-8''nhanes_data_basics.ipynb ├── Week2 │ ├── Readme.md │ ├── utf-8''Cartwheeldata.csv │ └── utf-8''nhanes_univariate_practice.ipynb ├── Week3 │ ├── Readme.md │ ├── utf-8''Cartwheeldata.csv │ ├── utf-8''Multivariate_Distributions.ipynb │ ├── utf-8''Unit_Testing.ipynb │ └── utf-8''nhanes_multivariate_practice.ipynb ├── Week4 │ └── Readme.md └── Readme.md ├── README.md └── Inferential Statistical Analysis With Python ├── Week1 ├── Readme.md ├── utf-8''functions_lambdas_help.ipynb ├── utf-8''week1_assessment.ipynb └── utf-8''listsvsarrays.ipynb ├── Week2 ├── Readme.md └── utf-8''intro_confidenceintervals.ipynb ├── Week3 └── Readme.md ├── Week4 ├── Readme.md ├── utf-8''nhanes_confidence_intervals_practice.ipynb └── utf-8''nhanes_hypothesis_test_practice.ipynb └── ReadMe.md /Fitting Statistical Models to Data with Python/Week4/Readme.md: -------------------------------------------------------------------------------- 1 | Material For Week 4 2 | -------------------------------------------------------------------------------- /Understanding & Visualizing Data with Python/Week1/Readme.md: -------------------------------------------------------------------------------- 1 | Material for Week 1 2 | -------------------------------------------------------------------------------- /Understanding & Visualizing Data with Python/Week2/Readme.md: -------------------------------------------------------------------------------- 1 | Material for Week 2 2 | -------------------------------------------------------------------------------- /Understanding & Visualizing Data with Python/Week3/Readme.md: -------------------------------------------------------------------------------- 1 | Material For Week 3 2 | -------------------------------------------------------------------------------- /Understanding & Visualizing Data with Python/Week4/Readme.md: -------------------------------------------------------------------------------- 1 | Material For Week 4 2 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Statistics-with-Python 2 | Material for Statistics With Python Specialization 3 | -------------------------------------------------------------------------------- /Fitting Statistical Models to Data with Python/Week3/Readme.md: -------------------------------------------------------------------------------- 1 | Contains the files for Week3 2 | -------------------------------------------------------------------------------- /Inferential Statistical Analysis With Python/Week1/Readme.md: -------------------------------------------------------------------------------- 1 | Week1 Resources for this course 2 | -------------------------------------------------------------------------------- /Inferential Statistical Analysis With Python/Week2/Readme.md: -------------------------------------------------------------------------------- 1 | Week2 Resources for this course 2 | -------------------------------------------------------------------------------- /Inferential Statistical Analysis With Python/Week3/Readme.md: -------------------------------------------------------------------------------- 1 | Week 3 Modules of this course 2 | -------------------------------------------------------------------------------- /Inferential Statistical Analysis With Python/Week4/Readme.md: -------------------------------------------------------------------------------- 1 | Week4 Resources for this course 2 | -------------------------------------------------------------------------------- /Understanding & Visualizing Data with Python/Readme.md: -------------------------------------------------------------------------------- 1 | Material For This Course for All Weeks 2 | -------------------------------------------------------------------------------- /Fitting Statistical Models to Data with Python/Week2/Readme.md: -------------------------------------------------------------------------------- 1 | Material for Week 2 of this course 2 | -------------------------------------------------------------------------------- /Fitting Statistical Models to Data with Python/Readme.md: -------------------------------------------------------------------------------- 1 | Contains code and course material for this course 2 | -------------------------------------------------------------------------------- /Inferential Statistical Analysis With Python/ReadMe.md: -------------------------------------------------------------------------------- 1 | This folder contains necesary files/datasets for the 2nd course in the statistics with python specialization 2 | -------------------------------------------------------------------------------- /Fitting Statistical Models to Data with Python/Week2/Linear Regression Overview.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/RohanMathur17/Statistics-with-Python/HEAD/Fitting Statistical Models to Data with Python/Week2/Linear Regression Overview.pdf -------------------------------------------------------------------------------- /Fitting Statistical Models to Data with Python/Week2/Logistic Regression Overview.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/RohanMathur17/Statistics-with-Python/HEAD/Fitting Statistical Models to Data with Python/Week2/Logistic Regression Overview.pdf -------------------------------------------------------------------------------- /Understanding & Visualizing Data with Python/Week1/utf-8''Cartwheeldata.csv: -------------------------------------------------------------------------------- 1 | ID,Age,Gender,GenderGroup,Glasses,GlassesGroup,Height,Wingspan,CWDistance,Complete,CompleteGroup,Score 2 | 1,56,F,1,Y,1,62,61,79,Y,1,7 3 | 2,26,F,1,Y,1,62,60,70,Y,1,8 4 | 3,33,F,1,Y,1,66,64,85,Y,1,7 5 | 4,39,F,1,N,0,64,63,87,Y,1,10 6 | 5,27,M,2,N,0,73,75,72,N,0,4 7 | 6,24,M,2,N,0,75,71,81,N,0,3 8 | 7,28,M,2,N,0,75,76,107,Y,1,10 9 | 8,22,F,1,N,0,65,62,98,Y,1,9 10 | 9,29,M,2,Y,1,74,73,106,N,0,5 11 | 10,33,F,1,Y,1,63,60,65,Y,1,8 12 | 11,30,M,2,Y,1,69.5,66,96,Y,1,6 13 | 12,28,F,1,Y,1,62.75,58,79,Y,1,10 14 | 13,25,F,1,Y,1,65,64.5,92,Y,1,6 15 | 14,23,F,1,N,0,61.5,57.5,66,Y,1,4 16 | 15,31,M,2,Y,1,73,74,72,Y,1,9 17 | 16,26,M,2,Y,1,71,72,115,Y,1,6 18 | 17,26,F,1,N,0,61.5,59.5,90,N,0,10 19 | 18,27,M,2,N,0,66,66,74,Y,1,5 20 | 19,23,M,2,Y,1,70,69,64,Y,1,3 21 | 20,24,F,1,Y,1,68,66,85,Y,1,8 22 | 21,23,M,2,Y,1,69,67,66,N,0,2 23 | 22,29,M,2,N,0,71,70,101,Y,1,8 24 | 23,25,M,2,N,0,70,68,82,Y,1,4 25 | 24,26,M,2,N,0,69,71,63,Y,1,5 26 | 25,23,F,1,Y,1,65,63,67,N,0,3 -------------------------------------------------------------------------------- /Understanding & Visualizing Data with Python/Week2/utf-8''Cartwheeldata.csv: -------------------------------------------------------------------------------- 1 | ID,Age,Gender,GenderGroup,Glasses,GlassesGroup,Height,Wingspan,CWDistance,Complete,CompleteGroup,Score 2 | 1,56,F,1,Y,1,62,61,79,Y,1,7 3 | 2,26,F,1,Y,1,62,60,70,Y,1,8 4 | 3,33,F,1,Y,1,66,64,85,Y,1,7 5 | 4,39,F,1,N,0,64,63,87,Y,1,10 6 | 5,27,M,2,N,0,73,75,72,N,0,4 7 | 6,24,M,2,N,0,75,71,81,N,0,3 8 | 7,28,M,2,N,0,75,76,107,Y,1,10 9 | 8,22,F,1,N,0,65,62,98,Y,1,9 10 | 9,29,M,2,Y,1,74,73,106,N,0,5 11 | 10,33,F,1,Y,1,63,60,65,Y,1,8 12 | 11,30,M,2,Y,1,69.5,66,96,Y,1,6 13 | 12,28,F,1,Y,1,62.75,58,79,Y,1,10 14 | 13,25,F,1,Y,1,65,64.5,92,Y,1,6 15 | 14,23,F,1,N,0,61.5,57.5,66,Y,1,4 16 | 15,31,M,2,Y,1,73,74,72,Y,1,9 17 | 16,26,M,2,Y,1,71,72,115,Y,1,6 18 | 17,26,F,1,N,0,61.5,59.5,90,N,0,10 19 | 18,27,M,2,N,0,66,66,74,Y,1,5 20 | 19,23,M,2,Y,1,70,69,64,Y,1,3 21 | 20,24,F,1,Y,1,68,66,85,Y,1,8 22 | 21,23,M,2,Y,1,69,67,66,N,0,2 23 | 22,29,M,2,N,0,71,70,101,Y,1,8 24 | 23,25,M,2,N,0,70,68,82,Y,1,4 25 | 24,26,M,2,N,0,69,71,63,Y,1,5 26 | 25,23,F,1,Y,1,65,63,67,N,0,3 -------------------------------------------------------------------------------- /Understanding & Visualizing Data with Python/Week3/utf-8''Cartwheeldata.csv: -------------------------------------------------------------------------------- 1 | ID,Age,Gender,GenderGroup,Glasses,GlassesGroup,Height,Wingspan,CWDistance,Complete,CompleteGroup,Score 2 | 1,56,F,1,Y,1,62,61,79,Y,1,7 3 | 2,26,F,1,Y,1,62,60,70,Y,1,8 4 | 3,33,F,1,Y,1,66,64,85,Y,1,7 5 | 4,39,F,1,N,0,64,63,87,Y,1,10 6 | 5,27,M,2,N,0,73,75,72,N,0,4 7 | 6,24,M,2,N,0,75,71,81,N,0,3 8 | 7,28,M,2,N,0,75,76,107,Y,1,10 9 | 8,22,F,1,N,0,65,62,98,Y,1,9 10 | 9,29,M,2,Y,1,74,73,106,N,0,5 11 | 10,33,F,1,Y,1,63,60,65,Y,1,8 12 | 11,30,M,2,Y,1,69.5,66,96,Y,1,6 13 | 12,28,F,1,Y,1,62.75,58,79,Y,1,10 14 | 13,25,F,1,Y,1,65,64.5,92,Y,1,6 15 | 14,23,F,1,N,0,61.5,57.5,66,Y,1,4 16 | 15,31,M,2,Y,1,73,74,72,Y,1,9 17 | 16,26,M,2,Y,1,71,72,115,Y,1,6 18 | 17,26,F,1,N,0,61.5,59.5,90,N,0,10 19 | 18,27,M,2,N,0,66,66,74,Y,1,5 20 | 19,23,M,2,Y,1,70,69,64,Y,1,3 21 | 20,24,F,1,Y,1,68,66,85,Y,1,8 22 | 21,23,M,2,Y,1,69,67,66,N,0,2 23 | 22,29,M,2,N,0,71,70,101,Y,1,8 24 | 23,25,M,2,N,0,70,68,82,Y,1,4 25 | 24,26,M,2,N,0,69,71,63,Y,1,5 26 | 25,23,F,1,Y,1,65,63,67,N,0,3 -------------------------------------------------------------------------------- /Understanding & Visualizing Data with Python/Week3/utf-8''Multivariate_Distributions.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Multivariate Distributions in Python\n", 8 | "\n", 9 | "Sometimes we can get a lot of information about how two variables (or more) relate if we plot them together. This tutorial aims to show how plotting two variables together can give us information that plotting each one separately may miss.\n", 10 | "\n" 11 | ] 12 | }, 13 | { 14 | "cell_type": "code", 15 | "execution_count": null, 16 | "metadata": {}, 17 | "outputs": [], 18 | "source": [ 19 | "# import the packages we are going to be using\n", 20 | "import numpy as np # for getting our distribution\n", 21 | "import matplotlib.pyplot as plt # for plotting\n", 22 | "import seaborn as sns; sns.set() # For a different plotting theme\n", 23 | "\n", 24 | "# Don't worry so much about what rho is doing here\n", 25 | "# Just know if we have a rho of 1 then we will get a perfectly\n", 26 | "# upward sloping line, and if we have a rho of -1, we will get \n", 27 | "# a perfectly downward slopping line. A rho of 0 will \n", 28 | "# get us a 'cloud' of points\n", 29 | "r = 1\n", 30 | "\n", 31 | "# Don't worry so much about the following three lines of code for now\n", 32 | "# this is just getting the data for us to plot\n", 33 | "mean = [15, 5]\n", 34 | "cov = [[1, r], [r, 1]]\n", 35 | "x, y = x, y = np.random.multivariate_normal(mean, cov, 400).T\n", 36 | "\n", 37 | "# Adjust the figure size\n", 38 | "plt.figure(figsize=(10,5))\n", 39 | "\n", 40 | "# Plot the histograms of X and Y next to each other\n", 41 | "plt.subplot(1,2,1)\n", 42 | "plt.hist(x = x, bins = 15)\n", 43 | "plt.title(\"X\")\n", 44 | "\n", 45 | "plt.subplot(1,2,2)\n", 46 | "plt.hist(x = y, bins = 15)\n", 47 | "plt.title(\"Y\")\n", 48 | "\n", 49 | "plt.show()" 50 | ] 51 | }, 52 | { 53 | "cell_type": "code", 54 | "execution_count": null, 55 | "metadata": {}, 56 | "outputs": [], 57 | "source": [ 58 | "# Plot the data\n", 59 | "plt.figure(figsize=(10,10))\n", 60 | "plt.subplot(2,2,2)\n", 61 | "plt.scatter(x = x, y = y)\n", 62 | "plt.title(\"Joint Distribution of X and Y\")\n", 63 | "\n", 64 | "# Plot the Marginal X Distribution\n", 65 | "plt.subplot(2,2,4)\n", 66 | "plt.hist(x = x, bins = 15)\n", 67 | "plt.title(\"Marginal Distribution of X\")\n", 68 | "\n", 69 | "\n", 70 | "# Plot the Marginal Y Distribution\n", 71 | "plt.subplot(2,2,1)\n", 72 | "plt.hist(x = y, orientation = \"horizontal\", bins = 15)\n", 73 | "plt.title(\"Marginal Distribution of Y\")\n", 74 | "\n", 75 | "# Show the plots\n", 76 | "plt.show()" 77 | ] 78 | }, 79 | { 80 | "cell_type": "code", 81 | "execution_count": null, 82 | "metadata": {}, 83 | "outputs": [], 84 | "source": [] 85 | } 86 | ], 87 | "metadata": { 88 | "kernelspec": { 89 | "display_name": "Python 2", 90 | "language": "python", 91 | "name": "python2" 92 | }, 93 | "language_info": { 94 | "codemirror_mode": { 95 | "name": "ipython", 96 | "version": 2 97 | }, 98 | "file_extension": ".py", 99 | "mimetype": "text/x-python", 100 | "name": "python", 101 | "nbconvert_exporter": "python", 102 | "pygments_lexer": "ipython2", 103 | "version": "2.7.15" 104 | } 105 | }, 106 | "nbformat": 4, 107 | "nbformat_minor": 2 108 | } 109 | -------------------------------------------------------------------------------- /Fitting Statistical Models to Data with Python/Week3/utf-8''w3_assessment.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "colab_type": "text", 7 | "id": "QQdjVZNGikFv" 8 | }, 9 | "source": [ 10 | "# Week 3 Assessment\n", 11 | "\n", 12 | "This Jupyter Notebook is auxillary to the following assessment in this week. To complete this assessment, you will complete the 5 questions outlined in this document and use the output from the python cells to answer them.\n", 13 | "\n", 14 | "Run the following cell to initialize your environment and begin the assessment." 15 | ] 16 | }, 17 | { 18 | "cell_type": "code", 19 | "execution_count": 1, 20 | "metadata": { 21 | "colab": {}, 22 | "colab_type": "code", 23 | "id": "pfrQgsCOikFw" 24 | }, 25 | "outputs": [], 26 | "source": [ 27 | "#### RUN THIS\n", 28 | "\n", 29 | "import warnings\n", 30 | "warnings.filterwarnings('ignore')\n", 31 | "\n", 32 | "import numpy as np\n", 33 | "import statsmodels.api as sm\n", 34 | "import pandas as pd \n", 35 | "\n", 36 | "url = \"nhanes_2015_2016.csv\"\n", 37 | "da = pd.read_csv(url)\n", 38 | "\n", 39 | "# Drop unused columns, drop rows with any missing values.\n", 40 | "vars = [\"BPXSY1\", \"RIDAGEYR\", \"RIAGENDR\", \"RIDRETH1\", \"DMDEDUC2\", \"BMXBMI\",\n", 41 | " \"SMQ020\", \"SDMVSTRA\", \"SDMVPSU\"]\n", 42 | "da = da[vars].dropna()\n", 43 | "\n", 44 | "da[\"group\"] = 10*da.SDMVSTRA + da.SDMVPSU\n", 45 | "\n", 46 | "da[\"smq\"] = da.SMQ020.replace({2: 0, 7: np.nan, 9: np.nan})\n", 47 | "\n", 48 | "np.random.seed(123)" 49 | ] 50 | }, 51 | { 52 | "cell_type": "markdown", 53 | "metadata": { 54 | "colab_type": "text", 55 | "id": "eSA_KJQvikF1" 56 | }, 57 | "source": [ 58 | "#### Question 1: What is clustered data? (You'll answer this question within the quiz that follows this notebook)\n", 59 | "\n", 60 | "\n", 61 | "#### Question 2: (You'll answer this question within the quiz that follows this notebook)\n", 62 | "\n", 63 | "Utilize the following output for this question:" 64 | ] 65 | }, 66 | { 67 | "cell_type": "code", 68 | "execution_count": 2, 69 | "metadata": { 70 | "colab": {}, 71 | "colab_type": "code", 72 | "id": "2oU4HEnPikF2" 73 | }, 74 | "outputs": [ 75 | { 76 | "name": "stdout", 77 | "output_type": "stream", 78 | "text": [ 79 | "BPXSY1 The correlation between two observations in the same cluster is 0.030\n", 80 | "SDMVSTRA The correlation between two observations in the same cluster is 0.959\n", 81 | "RIDAGEYR The correlation between two observations in the same cluster is 0.035\n", 82 | "BMXBMI The correlation between two observations in the same cluster is 0.039\n", 83 | "smq The correlation between two observations in the same cluster is 0.026\n" 84 | ] 85 | } 86 | ], 87 | "source": [ 88 | "for v in [\"BPXSY1\", \"SDMVSTRA\", \"RIDAGEYR\", \"BMXBMI\", \"smq\"]:\n", 89 | " model = sm.GEE.from_formula(v + \" ~ 1\", groups=\"group\",\n", 90 | " cov_struct=sm.cov_struct.Exchangeable(), data=da)\n", 91 | " result = model.fit()\n", 92 | " print(v, result.cov_struct.summary())" 93 | ] 94 | }, 95 | { 96 | "cell_type": "markdown", 97 | "metadata": { 98 | "colab_type": "text", 99 | "id": "4zRZc-tOikF8" 100 | }, 101 | "source": [ 102 | "Which of the listed features has the highest correlation between two observations in the same cluster? " 103 | ] 104 | }, 105 | { 106 | "cell_type": "markdown", 107 | "metadata": { 108 | "colab_type": "text", 109 | "id": "or7LyZNWikF9" 110 | }, 111 | "source": [ 112 | "#### Question 3: (You'll answer this question within the quiz that follows this notebook)\n", 113 | "\n", 114 | "What is true about multiple linear regression and marginal linear models when dependence is present in data?\n", 115 | "\n", 116 | "\n", 117 | "#### Question 4: (You'll answer this question within the quiz that follows this notebook)\n", 118 | "\n", 119 | "Multilevel models are expressed in terms of _____.\n", 120 | "\n", 121 | "\n", 122 | "#### Question 5: (You'll answer this question within the quiz that follows this notebook)\n", 123 | "\n", 124 | "Which of the following is NOT true regarding reasons why we fit marginal models?" 125 | ] 126 | } 127 | ], 128 | "metadata": { 129 | "colab": { 130 | "collapsed_sections": [], 131 | "name": "w3_assessment.ipynb", 132 | "provenance": [], 133 | "version": "0.3.2" 134 | }, 135 | "kernelspec": { 136 | "display_name": "Python 3", 137 | "language": "python", 138 | "name": "python3" 139 | }, 140 | "language_info": { 141 | "codemirror_mode": { 142 | "name": "ipython", 143 | "version": 3 144 | }, 145 | "file_extension": ".py", 146 | "mimetype": "text/x-python", 147 | "name": "python", 148 | "nbconvert_exporter": "python", 149 | "pygments_lexer": "ipython3", 150 | "version": "3.6.3" 151 | } 152 | }, 153 | "nbformat": 4, 154 | "nbformat_minor": 1 155 | } 156 | -------------------------------------------------------------------------------- /Understanding & Visualizing Data with Python/Week3/utf-8''Unit_Testing.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Unit Testing\n", 8 | "While we will not cover the [unit testing library](https://docs.python.org/3/library/unittest.html) that python has, we wanted to introduce you to a simple way that you can test your code.\n", 9 | "\n", 10 | "Unit testing is important because it the only way you can be sure that your code is do what you think it is doing. \n", 11 | "\n", 12 | "Remember, just because ther are no errors does not mean your code is correct." 13 | ] 14 | }, 15 | { 16 | "cell_type": "code", 17 | "execution_count": null, 18 | "metadata": { 19 | "collapsed": true 20 | }, 21 | "outputs": [], 22 | "source": [ 23 | "import numpy as np\n", 24 | "import pandas as pd\n", 25 | "import matplotlib as plt\n", 26 | "pd.set_option('display.max_columns', 100) # Show all columns when looking at dataframe" 27 | ] 28 | }, 29 | { 30 | "cell_type": "code", 31 | "execution_count": null, 32 | "metadata": {}, 33 | "outputs": [], 34 | "source": [ 35 | "# Download NHANES 2015-2016 data\n", 36 | "df = pd.read_csv(\"nhanes_2015_2016.csv\")\n", 37 | "df.index = range(1,df.shape[0]+1)" 38 | ] 39 | }, 40 | { 41 | "cell_type": "code", 42 | "execution_count": null, 43 | "metadata": {}, 44 | "outputs": [], 45 | "source": [ 46 | "df.head()" 47 | ] 48 | }, 49 | { 50 | "cell_type": "markdown", 51 | "metadata": {}, 52 | "source": [ 53 | "### Goal\n", 54 | "We want to find the mean of first 100 rows of 'BPXSY1' when 'RIDAGEYR' > 60" 55 | ] 56 | }, 57 | { 58 | "cell_type": "code", 59 | "execution_count": null, 60 | "metadata": {}, 61 | "outputs": [], 62 | "source": [ 63 | "# One possible way of doing this is:\n", 64 | "pd.Series.mean(df[df.RIDAGEYR > 60].loc[range(0,100), 'BPXSY1']) \n", 65 | "# Current version of python will include this warning, older versions will not" 66 | ] 67 | }, 68 | { 69 | "cell_type": "code", 70 | "execution_count": null, 71 | "metadata": {}, 72 | "outputs": [], 73 | "source": [ 74 | "# test our code on only ten rows so we can easily check\n", 75 | "test = pd.DataFrame({'col1': np.repeat([3,1],5), 'col2': range(3,13)}, index=range(1,11))\n", 76 | "test" 77 | ] 78 | }, 79 | { 80 | "cell_type": "code", 81 | "execution_count": null, 82 | "metadata": {}, 83 | "outputs": [], 84 | "source": [ 85 | "# pd.Series.mean(df[df.RIDAGEYR > 60].loc[range(0,5), 'BPXSY1'])\n", 86 | "# should return 5\n", 87 | "\n", 88 | "pd.Series.mean(test[test.col1 > 2].loc[range(0,5), 'col2'])" 89 | ] 90 | }, 91 | { 92 | "cell_type": "markdown", 93 | "metadata": {}, 94 | "source": [ 95 | "What went wrong?" 96 | ] 97 | }, 98 | { 99 | "cell_type": "code", 100 | "execution_count": null, 101 | "metadata": {}, 102 | "outputs": [], 103 | "source": [ 104 | "test[test.col1 > 2].loc[range(0,5), 'col2']\n", 105 | "# 0 is not in the row index labels because the second row's value is < 2. For now, pandas defaults to filling this\n", 106 | "# with NaN" 107 | ] 108 | }, 109 | { 110 | "cell_type": "code", 111 | "execution_count": null, 112 | "metadata": {}, 113 | "outputs": [], 114 | "source": [ 115 | "# Using the .iloc method instead, we are correctly choosing the first 5 rows, regardless of their row labels\n", 116 | "test[test.col1 >2].iloc[range(0,5), 1]" 117 | ] 118 | }, 119 | { 120 | "cell_type": "code", 121 | "execution_count": null, 122 | "metadata": {}, 123 | "outputs": [], 124 | "source": [ 125 | "pd.Series.mean(test[test.col1 >2].iloc[range(0,5), 1])" 126 | ] 127 | }, 128 | { 129 | "cell_type": "code", 130 | "execution_count": null, 131 | "metadata": {}, 132 | "outputs": [], 133 | "source": [ 134 | "# We can compare what our real dataframe looks like with the incorrect and correct methods\n", 135 | "df[df.RIDAGEYR > 60].loc[range(0,5), :] # Filled with NaN whenever a row label does not meet the condition" 136 | ] 137 | }, 138 | { 139 | "cell_type": "code", 140 | "execution_count": null, 141 | "metadata": {}, 142 | "outputs": [], 143 | "source": [ 144 | "df[df.RIDAGEYR > 60].iloc[range(0,5), :] # Correct picks the first fice rows such that 'RIDAGEYR\" > 60" 145 | ] 146 | }, 147 | { 148 | "cell_type": "code", 149 | "execution_count": null, 150 | "metadata": {}, 151 | "outputs": [], 152 | "source": [ 153 | "# Applying the correct method to the original question about BPXSY1\n", 154 | "print(pd.Series.mean(df[df.RIDAGEYR > 60].iloc[range(0,100), 16]))\n", 155 | "\n", 156 | "# Another way to reference the BPXSY1 variable\n", 157 | "print(pd.Series.mean(df[df.RIDAGEYR > 60].iloc[range(0,100), df.columns.get_loc('BPXSY1')]))" 158 | ] 159 | } 160 | ], 161 | "metadata": { 162 | "kernelspec": { 163 | "display_name": "Python 3", 164 | "language": "python", 165 | "name": "python3" 166 | }, 167 | "language_info": { 168 | "codemirror_mode": { 169 | "name": "ipython", 170 | "version": 3 171 | }, 172 | "file_extension": ".py", 173 | "mimetype": "text/x-python", 174 | "name": "python", 175 | "nbconvert_exporter": "python", 176 | "pygments_lexer": "ipython3", 177 | "version": "3.6.3" 178 | } 179 | }, 180 | "nbformat": 4, 181 | "nbformat_minor": 2 182 | } 183 | -------------------------------------------------------------------------------- /Understanding & Visualizing Data with Python/Week1/utf-8''data_types.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": null, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import math\n", 10 | "import numpy as np" 11 | ] 12 | }, 13 | { 14 | "cell_type": "markdown", 15 | "metadata": {}, 16 | "source": [ 17 | "## Data Types in Python " 18 | ] 19 | }, 20 | { 21 | "cell_type": "markdown", 22 | "metadata": {}, 23 | "source": [ 24 | "The following data types can be used in base python:\n", 25 | "* **boolean**\n", 26 | "* **integer**\n", 27 | "* **float**\n", 28 | "* **string**\n", 29 | "* **list**\n", 30 | "* **None**\n", 31 | "* long\n", 32 | "* complex\n", 33 | "* object\n", 34 | "\n", 35 | "We will only focus on the **bolded** ones" 36 | ] 37 | }, 38 | { 39 | "cell_type": "markdown", 40 | "metadata": {}, 41 | "source": [ 42 | "Let's connect these data types to the the variable types we learned from the Variable Types video." 43 | ] 44 | }, 45 | { 46 | "cell_type": "markdown", 47 | "metadata": {}, 48 | "source": [ 49 | "### Numerical or Quantitative\n", 50 | "* Discrete\n", 51 | " * Integer (int)\n", 52 | "* Continuous\n", 53 | " * Float (float)" 54 | ] 55 | }, 56 | { 57 | "cell_type": "code", 58 | "execution_count": null, 59 | "metadata": {}, 60 | "outputs": [], 61 | "source": [ 62 | "type(-4)" 63 | ] 64 | }, 65 | { 66 | "cell_type": "code", 67 | "execution_count": null, 68 | "metadata": {}, 69 | "outputs": [], 70 | "source": [ 71 | "type(np.mean([2, 3, 4, 5]))" 72 | ] 73 | }, 74 | { 75 | "cell_type": "code", 76 | "execution_count": null, 77 | "metadata": {}, 78 | "outputs": [], 79 | "source": [ 80 | "type(3/5)" 81 | ] 82 | }, 83 | { 84 | "cell_type": "code", 85 | "execution_count": null, 86 | "metadata": {}, 87 | "outputs": [], 88 | "source": [ 89 | "type(math.pi)" 90 | ] 91 | }, 92 | { 93 | "cell_type": "code", 94 | "execution_count": null, 95 | "metadata": {}, 96 | "outputs": [], 97 | "source": [ 98 | "type(4.0)" 99 | ] 100 | }, 101 | { 102 | "cell_type": "code", 103 | "execution_count": null, 104 | "metadata": {}, 105 | "outputs": [], 106 | "source": [ 107 | "type(np.mean([math.pi, 3/5, 4.1]))" 108 | ] 109 | }, 110 | { 111 | "cell_type": "markdown", 112 | "metadata": {}, 113 | "source": [ 114 | "### Categorical or Qualitative\n", 115 | "* Nominal\n", 116 | " * Boolean (bool)\n", 117 | " * String (str)\n", 118 | " * None (NoneType)\n", 119 | "* Ordinal\n", 120 | " * Only defined by how you use the data\n", 121 | " * Often important when creating visuals" 122 | ] 123 | }, 124 | { 125 | "cell_type": "code", 126 | "execution_count": null, 127 | "metadata": {}, 128 | "outputs": [], 129 | "source": [ 130 | "# Boolean\n", 131 | "type(True)" 132 | ] 133 | }, 134 | { 135 | "cell_type": "code", 136 | "execution_count": null, 137 | "metadata": {}, 138 | "outputs": [], 139 | "source": [ 140 | "# Boolean\n", 141 | "if 6 < 5:\n", 142 | " print(\"Yes!\")" 143 | ] 144 | }, 145 | { 146 | "cell_type": "code", 147 | "execution_count": null, 148 | "metadata": {}, 149 | "outputs": [], 150 | "source": [ 151 | "# String\n", 152 | "type(\"math.pi\")" 153 | ] 154 | }, 155 | { 156 | "cell_type": "code", 157 | "execution_count": null, 158 | "metadata": {}, 159 | "outputs": [], 160 | "source": [ 161 | "# None\n", 162 | "type(None)" 163 | ] 164 | }, 165 | { 166 | "cell_type": "code", 167 | "execution_count": null, 168 | "metadata": {}, 169 | "outputs": [], 170 | "source": [ 171 | "# None\n", 172 | "x = None\n", 173 | "type(x)" 174 | ] 175 | }, 176 | { 177 | "cell_type": "markdown", 178 | "metadata": {}, 179 | "source": [ 180 | "### Lists\n", 181 | "A list can hold many types, so it's category depends on how you use it. That being said, list elements are ordered by index, so there is always order information available (it just may or may not be useful to you)." 182 | ] 183 | }, 184 | { 185 | "cell_type": "code", 186 | "execution_count": null, 187 | "metadata": {}, 188 | "outputs": [], 189 | "source": [ 190 | "# List\n", 191 | "my_list = [1, 1.1, \"This is a sentence\", None]\n", 192 | "for element in my_list:\n", 193 | " print(type(element))" 194 | ] 195 | }, 196 | { 197 | "cell_type": "code", 198 | "execution_count": null, 199 | "metadata": {}, 200 | "outputs": [], 201 | "source": [ 202 | "np.mean(my_list)" 203 | ] 204 | }, 205 | { 206 | "cell_type": "code", 207 | "execution_count": null, 208 | "metadata": {}, 209 | "outputs": [], 210 | "source": [ 211 | "# List\n", 212 | "my_list = [1, 2, 3]\n", 213 | "for element in my_list:\n", 214 | " print(type(element))\n", 215 | "np.mean(my_list) # note that this outputs a float" 216 | ] 217 | } 218 | ], 219 | "metadata": { 220 | "kernelspec": { 221 | "display_name": "Python 3", 222 | "language": "python", 223 | "name": "python3" 224 | }, 225 | "language_info": { 226 | "codemirror_mode": { 227 | "name": "ipython", 228 | "version": 3 229 | }, 230 | "file_extension": ".py", 231 | "mimetype": "text/x-python", 232 | "name": "python", 233 | "nbconvert_exporter": "python", 234 | "pygments_lexer": "ipython3", 235 | "version": "3.6.3" 236 | } 237 | }, 238 | "nbformat": 4, 239 | "nbformat_minor": 2 240 | } 241 | -------------------------------------------------------------------------------- /Fitting Statistical Models to Data with Python/Week3/utf-8''week3_nhanes_practice.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Practice notebook for regression analysis with dependent data in NHANES\n", 8 | "\n", 9 | "This notebook will give you the opportunity to perform some analyses\n", 10 | "using the regression methods for dependent data that we are focusing\n", 11 | "on in this week of the course.\n", 12 | "\n", 13 | "Enter the code in response to each question in the boxes labeled \"enter your code here\".\n", 14 | "Then enter your responses to the questions in the boxes labeled \"Type\n", 15 | "Markdown and Latex\".\n", 16 | "\n", 17 | "This notebook is based on the NHANES case study notebook for\n", 18 | "regression with dependent data. Most of the code that you will need\n", 19 | "to write below is very similar to code that appears in the case study\n", 20 | "notebook. You will need to edit code from that notebook in small ways\n", 21 | "to adapt it to the prompts below.\n", 22 | "\n", 23 | "To get started, we will use the same module imports and read the data\n", 24 | "in the same way as we did in the case study:" 25 | ] 26 | }, 27 | { 28 | "cell_type": "code", 29 | "execution_count": 1, 30 | "metadata": {}, 31 | "outputs": [], 32 | "source": [ 33 | "%matplotlib inline\n", 34 | "import matplotlib.pyplot as plt\n", 35 | "import seaborn as sns\n", 36 | "import pandas as pd\n", 37 | "import statsmodels.api as sm\n", 38 | "import numpy as np\n", 39 | "\n", 40 | "url = \"nhanes_2015_2016.csv\"\n", 41 | "da = pd.read_csv(url)\n", 42 | "\n", 43 | "# Drop unused columns, drop rows with any missing values.\n", 44 | "vars = [\"BPXSY1\", \"RIDAGEYR\", \"RIAGENDR\", \"RIDRETH1\", \"DMDEDUC2\", \"BMXBMI\",\n", 45 | " \"SMQ020\", \"SDMVSTRA\", \"SDMVPSU\"]\n", 46 | "da = da[vars].dropna()\n", 47 | "\n", 48 | "# This is the grouping variable\n", 49 | "da[\"group\"] = 10*da.SDMVSTRA + da.SDMVPSU" 50 | ] 51 | }, 52 | { 53 | "cell_type": "markdown", 54 | "metadata": {}, 55 | "source": [ 56 | "## Question 1: \n", 57 | "\n", 58 | "Build a marginal linear model using GEE for the first measurement of diastolic blood pressure (`BPXDI1`), accounting for the grouping variable defined above. This initial model should have no covariates, and will be used to assess the ICC of this blood pressure measure." 59 | ] 60 | }, 61 | { 62 | "cell_type": "code", 63 | "execution_count": 2, 64 | "metadata": {}, 65 | "outputs": [], 66 | "source": [ 67 | "# enter your code here" 68 | ] 69 | }, 70 | { 71 | "cell_type": "markdown", 72 | "metadata": {}, 73 | "source": [ 74 | "__Q1a.__ What is the ICC for diastolic blood pressure? What can you\n", 75 | " conclude by comparing it to the ICC for systolic blood pressure?" 76 | ] 77 | }, 78 | { 79 | "cell_type": "markdown", 80 | "metadata": {}, 81 | "source": [] 82 | }, 83 | { 84 | "cell_type": "markdown", 85 | "metadata": {}, 86 | "source": [ 87 | "## Question 2: \n", 88 | "\n", 89 | "Take your model from question 1, and add gender, age, and BMI to it as covariates." 90 | ] 91 | }, 92 | { 93 | "cell_type": "code", 94 | "execution_count": 3, 95 | "metadata": {}, 96 | "outputs": [], 97 | "source": [ 98 | "# enter your code here" 99 | ] 100 | }, 101 | { 102 | "cell_type": "markdown", 103 | "metadata": {}, 104 | "source": [ 105 | "__Q2a.__ What is the ICC for this model? What can you conclude by comparing it to the ICC for the model that you fit in question 1?" 106 | ] 107 | }, 108 | { 109 | "cell_type": "markdown", 110 | "metadata": {}, 111 | "source": [] 112 | }, 113 | { 114 | "cell_type": "markdown", 115 | "metadata": {}, 116 | "source": [ 117 | "## Question 3: \n", 118 | "\n", 119 | "Split the data into separate datasets for females and for males and fit two separate marginal linear models for diastolic blood pressure, one only for females, and one only for males." 120 | ] 121 | }, 122 | { 123 | "cell_type": "code", 124 | "execution_count": 4, 125 | "metadata": {}, 126 | "outputs": [], 127 | "source": [ 128 | "# enter your code here" 129 | ] 130 | }, 131 | { 132 | "cell_type": "markdown", 133 | "metadata": {}, 134 | "source": [ 135 | "__Q3a.__ What do you learn by comparing these two fitted models?" 136 | ] 137 | }, 138 | { 139 | "cell_type": "markdown", 140 | "metadata": {}, 141 | "source": [] 142 | }, 143 | { 144 | "cell_type": "markdown", 145 | "metadata": {}, 146 | "source": [ 147 | "## Question 4: \n", 148 | "\n", 149 | "Using the entire data set, fit a multilevel model for diastolic blood pressure, predicted by age, gender, BMI, and educational attainment. Include a random intercept for groups." 150 | ] 151 | }, 152 | { 153 | "cell_type": "code", 154 | "execution_count": 5, 155 | "metadata": {}, 156 | "outputs": [], 157 | "source": [ 158 | "# enter your code here" 159 | ] 160 | }, 161 | { 162 | "cell_type": "markdown", 163 | "metadata": {}, 164 | "source": [ 165 | "__Q4a.__ How would you describe the strength of the clustering in this analysis?" 166 | ] 167 | }, 168 | { 169 | "cell_type": "markdown", 170 | "metadata": {}, 171 | "source": [] 172 | }, 173 | { 174 | "cell_type": "markdown", 175 | "metadata": {}, 176 | "source": [ 177 | "__Q4b:__ Include a random intercept for age, and describe your findings." 178 | ] 179 | }, 180 | { 181 | "cell_type": "markdown", 182 | "metadata": {}, 183 | "source": [] 184 | } 185 | ], 186 | "metadata": { 187 | "kernelspec": { 188 | "display_name": "Python 3", 189 | "language": "python", 190 | "name": "python3" 191 | }, 192 | "language_info": { 193 | "codemirror_mode": { 194 | "name": "ipython", 195 | "version": 3 196 | }, 197 | "file_extension": ".py", 198 | "mimetype": "text/x-python", 199 | "name": "python", 200 | "nbconvert_exporter": "python", 201 | "pygments_lexer": "ipython3", 202 | "version": "3.6.3" 203 | } 204 | }, 205 | "nbformat": 4, 206 | "nbformat_minor": 1 207 | } 208 | -------------------------------------------------------------------------------- /Understanding & Visualizing Data with Python/Week3/utf-8''nhanes_multivariate_practice.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Practice notebook for multivariate analysis using NHANES data\n", 8 | "\n", 9 | "This notebook will give you the opportunity to perform some multivariate analyses on your own using the NHANES study data. These analyses are similar to what was done in the week 3 NHANES case study notebook.\n", 10 | "\n", 11 | "You can enter your code into the cells that say \"enter your code here\", and you can type responses to the questions into the cells that say \"Type Markdown and Latex\".\n", 12 | "\n", 13 | "Note that most of the code that you will need to write below is very similar to code that appears in the case study notebook. You will need to edit code from that notebook in small ways to adapt it to the prompts below.\n", 14 | "\n", 15 | "To get started, we will use the same module imports and read the data in the same way as we did in the case study:" 16 | ] 17 | }, 18 | { 19 | "cell_type": "code", 20 | "execution_count": null, 21 | "metadata": {}, 22 | "outputs": [], 23 | "source": [ 24 | "%matplotlib inline\n", 25 | "import matplotlib.pyplot as plt\n", 26 | "import seaborn as sns\n", 27 | "import pandas as pd\n", 28 | "import statsmodels.api as sm\n", 29 | "import numpy as np\n", 30 | "\n", 31 | "da = pd.read_csv(\"nhanes_2015_2016.csv\")\n", 32 | "da.columns" 33 | ] 34 | }, 35 | { 36 | "cell_type": "markdown", 37 | "metadata": {}, 38 | "source": [ 39 | "## Question 1\n", 40 | "\n", 41 | "Make a scatterplot showing the relationship between the first and second measurements of diastolic blood pressure ([BPXDI1](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/BPX_I.htm#BPXDI1) and [BPXDI2](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/BPX_I.htm#BPXDI2)). Also obtain the 4x4 matrix of correlation coefficients among the first two systolic and the first two diastolic blood pressure measures." 42 | ] 43 | }, 44 | { 45 | "cell_type": "code", 46 | "execution_count": null, 47 | "metadata": {}, 48 | "outputs": [], 49 | "source": [ 50 | "# enter your code here" 51 | ] 52 | }, 53 | { 54 | "cell_type": "markdown", 55 | "metadata": {}, 56 | "source": [ 57 | "__Q1a.__ How does the correlation between repeated measurements of diastolic blood pressure relate to the correlation between repeated measurements of systolic blood pressure?" 58 | ] 59 | }, 60 | { 61 | "cell_type": "markdown", 62 | "metadata": {}, 63 | "source": [] 64 | }, 65 | { 66 | "cell_type": "markdown", 67 | "metadata": {}, 68 | "source": [ 69 | "__Q2a.__ Are the second systolic and second diastolic blood pressure measure more correlated or less correlated than the first systolic and first diastolic blood pressure measure?" 70 | ] 71 | }, 72 | { 73 | "cell_type": "markdown", 74 | "metadata": {}, 75 | "source": [] 76 | }, 77 | { 78 | "cell_type": "markdown", 79 | "metadata": {}, 80 | "source": [ 81 | "## Question 2\n", 82 | "\n", 83 | "Log transform the four blood pressure variables and repeat question 1." 84 | ] 85 | }, 86 | { 87 | "cell_type": "code", 88 | "execution_count": null, 89 | "metadata": {}, 90 | "outputs": [], 91 | "source": [ 92 | "# enter your code here" 93 | ] 94 | }, 95 | { 96 | "cell_type": "markdown", 97 | "metadata": {}, 98 | "source": [ 99 | "__Q2a.__ Does the correlation analysis on log tranformed data lead to any important insights that the correlation analysis on the untransformed data missed?" 100 | ] 101 | }, 102 | { 103 | "cell_type": "markdown", 104 | "metadata": {}, 105 | "source": [] 106 | }, 107 | { 108 | "cell_type": "markdown", 109 | "metadata": {}, 110 | "source": [ 111 | "## Question 3\n", 112 | "\n", 113 | "Construct a grid of scatterplots between the first systolic and the first diastolic blood pressure measurement. Stratify the plots by gender (rows) and by race/ethnicity groups (columns)." 114 | ] 115 | }, 116 | { 117 | "cell_type": "code", 118 | "execution_count": null, 119 | "metadata": {}, 120 | "outputs": [], 121 | "source": [ 122 | "# insert your code here" 123 | ] 124 | }, 125 | { 126 | "cell_type": "markdown", 127 | "metadata": {}, 128 | "source": [ 129 | "__Q3a.__ Comment on the extent to which these two blood pressure variables are correlated to different degrees in different demographic subgroups." 130 | ] 131 | }, 132 | { 133 | "cell_type": "markdown", 134 | "metadata": {}, 135 | "source": [] 136 | }, 137 | { 138 | "cell_type": "markdown", 139 | "metadata": {}, 140 | "source": [ 141 | "## Question 4\n", 142 | "\n", 143 | "Use \"violin plots\" to compare the distributions of ages within groups defined by gender and educational attainment." 144 | ] 145 | }, 146 | { 147 | "cell_type": "code", 148 | "execution_count": null, 149 | "metadata": {}, 150 | "outputs": [], 151 | "source": [ 152 | "# insert your code here" 153 | ] 154 | }, 155 | { 156 | "cell_type": "markdown", 157 | "metadata": {}, 158 | "source": [ 159 | "__Q4a.__ Comment on any evident differences among the age distributions in the different demographic groups." 160 | ] 161 | }, 162 | { 163 | "cell_type": "markdown", 164 | "metadata": {}, 165 | "source": [] 166 | }, 167 | { 168 | "cell_type": "markdown", 169 | "metadata": {}, 170 | "source": [ 171 | "## Question 5\n", 172 | "\n", 173 | "Use violin plots to compare the distributions of BMI within a series of 10-year age bands. Also stratify these plots by gender." 174 | ] 175 | }, 176 | { 177 | "cell_type": "code", 178 | "execution_count": null, 179 | "metadata": {}, 180 | "outputs": [], 181 | "source": [ 182 | "# insert your code here" 183 | ] 184 | }, 185 | { 186 | "cell_type": "markdown", 187 | "metadata": {}, 188 | "source": [ 189 | "__Q5a.__ Comment on the trends in BMI across the demographic groups." 190 | ] 191 | }, 192 | { 193 | "cell_type": "markdown", 194 | "metadata": {}, 195 | "source": [] 196 | }, 197 | { 198 | "cell_type": "markdown", 199 | "metadata": {}, 200 | "source": [ 201 | "## Question 6\n", 202 | "\n", 203 | "Construct a frequency table for the joint distribution of ethnicity groups ([RIDRETH1](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/DEMO_I.htm#RIDRETH1)) and health-insurance status ([HIQ210](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/HIQ_I.htm#HIQ210)). Normalize the results so that the values within each ethnic group are proportions that sum to 1." 204 | ] 205 | }, 206 | { 207 | "cell_type": "code", 208 | "execution_count": null, 209 | "metadata": {}, 210 | "outputs": [], 211 | "source": [ 212 | "# insert your code here" 213 | ] 214 | }, 215 | { 216 | "cell_type": "markdown", 217 | "metadata": {}, 218 | "source": [ 219 | "__Q6a.__ Which ethnic group has the highest rate of being uninsured in the past year?" 220 | ] 221 | }, 222 | { 223 | "cell_type": "markdown", 224 | "metadata": {}, 225 | "source": [] 226 | } 227 | ], 228 | "metadata": { 229 | "kernelspec": { 230 | "display_name": "Python 3", 231 | "language": "python", 232 | "name": "python3" 233 | }, 234 | "language_info": { 235 | "codemirror_mode": { 236 | "name": "ipython", 237 | "version": 3 238 | }, 239 | "file_extension": ".py", 240 | "mimetype": "text/x-python", 241 | "name": "python", 242 | "nbconvert_exporter": "python", 243 | "pygments_lexer": "ipython3", 244 | "version": "3.6.3" 245 | } 246 | }, 247 | "nbformat": 4, 248 | "nbformat_minor": 2 249 | } 250 | -------------------------------------------------------------------------------- /Understanding & Visualizing Data with Python/Week1/utf-8''introduction_jupyter.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "### What is Jupyter Notebooks?\n", 8 | "\n", 9 | "Jupyter is a web-based interactive development environment that supports multiple programming languages, however most commonly used with the Python programming language.\n", 10 | "\n", 11 | "The interactive environment that Jupyter provides enables students, scientists, and researchers to create reproducible analysis and formulate a story within a single document.\n", 12 | "\n", 13 | "Lets take a look at an example of a completed Jupyter Notebook: [Example Notebook](http://nbviewer.jupyter.org/github/cossatot/lanf_earthquake_likelihood/blob/master/notebooks/lanf_manuscript_notebook.ipynb)" 14 | ] 15 | }, 16 | { 17 | "cell_type": "markdown", 18 | "metadata": {}, 19 | "source": [ 20 | "### Jupyter Notebook Features\n", 21 | "\n", 22 | "* File Browser\n", 23 | "* Markdown Cells & Syntax\n", 24 | "* Kernels, Variables, & Environment\n", 25 | "* Command vs. Edit Mode & Shortcuts" 26 | ] 27 | }, 28 | { 29 | "cell_type": "markdown", 30 | "metadata": {}, 31 | "source": [ 32 | "### What is Markdown?\n", 33 | "\n", 34 | "Markdown is a markup language that uses plain text formatting syntax. This means that we can modify the formatting our text with the use of various symbols on our keyboard as indicators.\n", 35 | "\n", 36 | "Some examples include:\n", 37 | "\n", 38 | "* Headers\n", 39 | "* Text modifications such as italics and bold\n", 40 | "* Ordered and Unordered lists\n", 41 | "* Links\n", 42 | "* Tables\n", 43 | "* Images\n", 44 | "* Etc.\n", 45 | "\n", 46 | "Now I'll showcase some examples of how this formatting is done:" 47 | ] 48 | }, 49 | { 50 | "cell_type": "markdown", 51 | "metadata": {}, 52 | "source": [ 53 | "Headers:\n", 54 | "\n", 55 | "# H1\n", 56 | "## H2\n", 57 | "### H3\n", 58 | "#### H4\n", 59 | "##### H5\n", 60 | "###### H6" 61 | ] 62 | }, 63 | { 64 | "cell_type": "markdown", 65 | "metadata": {}, 66 | "source": [ 67 | "Text modifications:\n", 68 | "\n", 69 | "Emphasis, aka italics, with *asterisks* or _underscores_.\n", 70 | "\n", 71 | "Strong emphasis, aka bold, with **asterisks** or __underscores__.\n", 72 | "\n", 73 | "Combined emphasis with **asterisks and _underscores_**.\n", 74 | "\n", 75 | "Strikethrough uses two tildes. ~~Scratch this.~~" 76 | ] 77 | }, 78 | { 79 | "cell_type": "markdown", 80 | "metadata": {}, 81 | "source": [ 82 | "Lists:\n", 83 | "\n", 84 | "1. First ordered list item\n", 85 | "2. Another item\n", 86 | " * Unordered sub-list. \n", 87 | "1. Actual numbers don't matter, just that it's a number\n", 88 | " 1. Ordered sub-list\n", 89 | "4. And another item.\n", 90 | "\n", 91 | "* Unordered list can use asterisks\n", 92 | "- Or minuses\n", 93 | "+ Or pluses" 94 | ] 95 | }, 96 | { 97 | "cell_type": "markdown", 98 | "metadata": {}, 99 | "source": [ 100 | "Links:\n", 101 | "\n", 102 | "http://www.umich.edu\n", 103 | "\n", 104 | "\n", 105 | "\n", 106 | "[The University of Michigan's Homepage](www.http://umich.edu/)\n", 107 | "\n", 108 | "To look into more examples of Markdown syntax and features such as tables, images, etc. head to the following link: [Markdown Reference](https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet)" 109 | ] 110 | }, 111 | { 112 | "cell_type": "markdown", 113 | "metadata": {}, 114 | "source": [ 115 | "### Kernels, Variables, and Environment\n", 116 | "\n", 117 | "A notebook kernel is a “computational engine” that executes the code contained in a Notebook document. There are kernels for various programming languages, however we are solely using the python kernel which executes python code.\n", 118 | "\n", 119 | "When a notebook is opened, the associated kernel is automatically launched for our convenience." 120 | ] 121 | }, 122 | { 123 | "cell_type": "code", 124 | "execution_count": null, 125 | "metadata": { 126 | "collapsed": true 127 | }, 128 | "outputs": [], 129 | "source": [ 130 | "### This is python\n", 131 | "print(\"This is a python code cell\")" 132 | ] 133 | }, 134 | { 135 | "cell_type": "markdown", 136 | "metadata": {}, 137 | "source": [ 138 | "A kernel is the back-end of our notebook which not only executes our python code, but stores our initialized variables." 139 | ] 140 | }, 141 | { 142 | "cell_type": "code", 143 | "execution_count": null, 144 | "metadata": { 145 | "collapsed": true 146 | }, 147 | "outputs": [], 148 | "source": [ 149 | "### For example, lets initialize variable x\n", 150 | "\n", 151 | "x = 1738\n", 152 | "\n", 153 | "print(\"x has been set to \" + str(x))" 154 | ] 155 | }, 156 | { 157 | "cell_type": "code", 158 | "execution_count": null, 159 | "metadata": { 160 | "collapsed": true 161 | }, 162 | "outputs": [], 163 | "source": [ 164 | "### Print x\n", 165 | "\n", 166 | "print(x)" 167 | ] 168 | }, 169 | { 170 | "cell_type": "markdown", 171 | "metadata": {}, 172 | "source": [ 173 | "Issues arrise when we restart our kernel and attempt to run code with variables that have not been reinitialized.\n", 174 | "\n", 175 | "If the kernel is reset, make sure to rerun code where variables are intialized." 176 | ] 177 | }, 178 | { 179 | "cell_type": "code", 180 | "execution_count": null, 181 | "metadata": { 182 | "collapsed": true 183 | }, 184 | "outputs": [], 185 | "source": [ 186 | "## We can also run code that accepts input\n", 187 | "\n", 188 | "name = input(\"What is your name? \")\n", 189 | "\n", 190 | "print(\"The name you entered is \" + name)" 191 | ] 192 | }, 193 | { 194 | "cell_type": "markdown", 195 | "metadata": {}, 196 | "source": [ 197 | "It is important to note that Jupyter Notebooks have in-line cell execution. This means that a prior executing cell must complete its operations prior to another cell being executed. A cell still being executing is indicated by the [*] on the left-hand side of the cell." 198 | ] 199 | }, 200 | { 201 | "cell_type": "code", 202 | "execution_count": null, 203 | "metadata": { 204 | "collapsed": true 205 | }, 206 | "outputs": [], 207 | "source": [ 208 | "print(\"This won't print until all prior cells have finished executing.\")" 209 | ] 210 | }, 211 | { 212 | "cell_type": "markdown", 213 | "metadata": {}, 214 | "source": [ 215 | "### Command vs. Edit Mode & Shortcuts\n", 216 | "\n", 217 | "There is an edit and a command mode for jupyter notebooks. The mode is easily identifiable by the color of the left border of the cell.\n", 218 | "\n", 219 | "Blue = Command Mode.\n", 220 | "\n", 221 | "Green = Edit Mode.\n", 222 | "\n", 223 | "Command Mode can be toggled by pressing **esc** on your keyboard.\n", 224 | "\n", 225 | "Commands can be used to execute notebook functions. For example, changing the format of a markdown cell or adding line numbers.\n", 226 | "\n", 227 | "Lets toggle line numbers while in command mode by pressing **L**.\n", 228 | "\n", 229 | "#### Additional Shortcuts\n", 230 | "\n", 231 | "There are a lot of shortcuts that can be used to improve productivity while using Jupyter Notebooks.\n", 232 | "\n", 233 | "Here is a list:\n", 234 | "\n", 235 | "![Jupyter Notebook Shortcuts](img/shortcuts.png)" 236 | ] 237 | }, 238 | { 239 | "cell_type": "markdown", 240 | "metadata": {}, 241 | "source": [ 242 | "### How do you install Jupyter Notebooks?\n", 243 | "\n", 244 | "**Note:** *Coursera provides embedded jupyter notebooks within the course, thus the download is not a requirement unless you wish to explore jupyter further on your own computer.*\n", 245 | "\n", 246 | "Official Installation Guide: https://jupyter.readthedocs.io/en/latest/install.html\n", 247 | "\n", 248 | "Jupyter recommends utilizing Anaconda, which is a platform compatible with Windows, macOS, and Linux systems. \n", 249 | "\n", 250 | "Anaconda Download: https://www.anaconda.com/download/#macos" 251 | ] 252 | } 253 | ], 254 | "metadata": { 255 | "kernelspec": { 256 | "display_name": "Python 3", 257 | "language": "python", 258 | "name": "python3" 259 | }, 260 | "language_info": { 261 | "codemirror_mode": { 262 | "name": "ipython", 263 | "version": 3 264 | }, 265 | "file_extension": ".py", 266 | "mimetype": "text/x-python", 267 | "name": "python", 268 | "nbconvert_exporter": "python", 269 | "pygments_lexer": "ipython3", 270 | "version": "3.6.3" 271 | } 272 | }, 273 | "nbformat": 4, 274 | "nbformat_minor": 2 275 | } 276 | -------------------------------------------------------------------------------- /Inferential Statistical Analysis With Python/Week1/utf-8''functions_lambdas_help.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": null, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import numpy as np\n", 10 | "import pandas as pd\n", 11 | "import matplotlib.pyplot as plt\n", 12 | "%matplotlib inline" 13 | ] 14 | }, 15 | { 16 | "cell_type": "markdown", 17 | "metadata": {}, 18 | "source": [ 19 | "## Functions, lambda functions, and reading help documents" 20 | ] 21 | }, 22 | { 23 | "cell_type": "markdown", 24 | "metadata": {}, 25 | "source": [ 26 | "We are going to introduce a few more python concepts. If you've been working through the NHANES example notebooks, you will have seen these in use already. There is a lot to say about these new concepts, but we will only be giving a brief introduction to each. For more information, follow the links provided or do your own search for the many great resources available on the web. " 27 | ] 28 | }, 29 | { 30 | "cell_type": "markdown", 31 | "metadata": {}, 32 | "source": [ 33 | "### Functions\n", 34 | "\n", 35 | "If you use a snippet of code multiple times, it is best practice to put that code into a function instead of copying and pasting it. For example, if you wanted several of the same plots with different data, you could create a function that returns that styple of plot for arbitrary (though with correct dimension and type) data. \n", 36 | "\n", 37 | "In Python, indentation is very important. If done incorrectly, your code will not run and instead will give an error. When defining a function, all code after the ':' must be indented properly. The indentation conveys the scope of the code. [Some further explanation](https://docs.python.org/2.0/ref/indentation.html).\n", 38 | "```\n", 39 | "def function_name(arguments):\n", 40 | " \"\"\"\n", 41 | " Header comment: brief description of what this function does\n", 42 | " \n", 43 | " Args:\n", 44 | " obj: input for this function\n", 45 | " Returns:\n", 46 | " out: the output of this function\n", 47 | " \"\"\"\n", 48 | " \n", 49 | " some code\n", 50 | " \n", 51 | " return out \n", 52 | " ```\n", 53 | " \n", 54 | " Exactly how to structure the header comments is up to you if you work alone, or will likely be specified if working for an established company. \n", 55 | " \n", 56 | "Function names should start with a lower case letter (they cannot start with a number), and can be in camelCase or snake_case.\n", 57 | "\n", 58 | "If your function returns a variable, you use 'return' to specify that variable. A function doesn't always have to return something though. For example, you could have a function that creates a plot and then saves it in the current directory. " 59 | ] 60 | }, 61 | { 62 | "cell_type": "code", 63 | "execution_count": null, 64 | "metadata": {}, 65 | "outputs": [], 66 | "source": [ 67 | "def sum_x_y(x, y): # don't need comments if immediately clear what the function does\n", 68 | " out = x + y\n", 69 | " return out\n", 70 | "\n", 71 | "sum_x_y(4, 6)" 72 | ] 73 | }, 74 | { 75 | "cell_type": "code", 76 | "execution_count": null, 77 | "metadata": {}, 78 | "outputs": [], 79 | "source": [ 80 | "def get_max(x):\n", 81 | " current_max = x[0]\n", 82 | " for i in x[1:]:\n", 83 | " if i > current_max:\n", 84 | " current_max = i\n", 85 | " return current_max" 86 | ] 87 | }, 88 | { 89 | "cell_type": "code", 90 | "execution_count": null, 91 | "metadata": {}, 92 | "outputs": [], 93 | "source": [ 94 | "get_max(np.random.choice(400, 100)) \n", 95 | "# np.random.choice(400, 100) will randomly choose 100 integers between 0 and 400 " 96 | ] 97 | }, 98 | { 99 | "cell_type": "markdown", 100 | "metadata": {}, 101 | "source": [ 102 | "There is a lot more to be said about functions that we don't have time to cover in this course, so I leave you with examples of [common gotchas](https://docs.python-guide.org/writing/gotchas/) that you may run into." 103 | ] 104 | }, 105 | { 106 | "cell_type": "markdown", 107 | "metadata": {}, 108 | "source": [ 109 | "### lambda functions\n", 110 | "\n", 111 | "There are also know as anonymous functions because they are unnamed. This function can have any number of arguments but only one expression. Lambda functions, unlike defined functions, always return a variable.\n", 112 | "The format of a lambda function is \n", 113 | "```\n", 114 | "lambda arguments: expression \n", 115 | "```\n", 116 | "\n", 117 | "They can look similar to a mathematical expression for evauating a function. \n", 118 | "For example:\n", 119 | "```\n", 120 | "(lambda x: x**2)(3)\n", 121 | "```\n", 122 | "Is the same as mathmatically writing \n", 123 | "$f(x) = x^2$ an then evauluating the function $f$ at $x=3$, \n", 124 | "$f(3) = 9$" 125 | ] 126 | }, 127 | { 128 | "cell_type": "code", 129 | "execution_count": null, 130 | "metadata": {}, 131 | "outputs": [], 132 | "source": [ 133 | "(lambda x: x**2)(3)" 134 | ] 135 | }, 136 | { 137 | "cell_type": "markdown", 138 | "metadata": {}, 139 | "source": [ 140 | "Another way to use a lambda function is to store it in a variable like in the example below." 141 | ] 142 | }, 143 | { 144 | "cell_type": "code", 145 | "execution_count": null, 146 | "metadata": {}, 147 | "outputs": [], 148 | "source": [ 149 | "f = lambda x: np.sin(x)\n", 150 | "x = np.linspace(-np.pi, np.pi, 100)\n", 151 | "y = [f(i) for i in x]\n", 152 | "plt.plot(x, y)\n", 153 | "plt.show\n", 154 | "# we could have made this several ways, can you think of another?" 155 | ] 156 | }, 157 | { 158 | "cell_type": "markdown", 159 | "metadata": {}, 160 | "source": [ 161 | "You shouldn't come across many (if any) cases where you would have to use a lambda function, but we present them briefly here so that you can regonize them in the wild. " 162 | ] 163 | }, 164 | { 165 | "cell_type": "markdown", 166 | "metadata": {}, 167 | "source": [ 168 | "### Reading help documentation\n", 169 | "A key skill in being a successful programmer is being able to read the documentation for a function and understand what that functions does and what the arguments are. \n", 170 | "\n", 171 | "To get the documentation, use the help function. First, let's call the help function on help, to see what is does:" 172 | ] 173 | }, 174 | { 175 | "cell_type": "code", 176 | "execution_count": null, 177 | "metadata": {}, 178 | "outputs": [], 179 | "source": [ 180 | "help(help)" 181 | ] 182 | }, 183 | { 184 | "cell_type": "markdown", 185 | "metadata": {}, 186 | "source": [ 187 | "We can see that calling help(thing) will print the documentation for 'thing'. Generally, this documentation will first list the function with its arguments (also called parameters), showing what the default arguments are. Then, it will list these arguments (parameters) and specify what they are and their type. Then it will documents what the function returns, errors it may raise, and possibly other documentation as necessary. Often, the bottom of the document will contain examples.\n", 188 | "\n", 189 | "Let's look at another example, the pandas drop function. This is used to drop rows or columns from a DataFrame. If you had a DataFrame call 'my_df', you would call this function by\n", 190 | "```\n", 191 | "my_df.drop(some arguments)\n", 192 | "```\n", 193 | "Unfortunately, we cannot simply call \n", 194 | "```\n", 195 | "help(drop)\n", 196 | "```\n", 197 | "because drop is not a function in base python. Instead, we must call\n", 198 | "```\n", 199 | "help(pd.DataFrame.drop)\n", 200 | "```\n", 201 | "because we need to specify that this is from pandas library (pd) and is applied to a DataFrame. If you're wondering why I'm capitalizing DataFrame as such, it is because that is a data type in the python pandas library. Without the capitalization, it had no meaning. " 202 | ] 203 | }, 204 | { 205 | "cell_type": "code", 206 | "execution_count": null, 207 | "metadata": {}, 208 | "outputs": [], 209 | "source": [ 210 | "help(pd.DataFrame.drop)" 211 | ] 212 | }, 213 | { 214 | "cell_type": "markdown", 215 | "metadata": {}, 216 | "source": [ 217 | "If you wanted to drop the column 'this one' from the DataFrame 'my_df', how would you do it?" 218 | ] 219 | } 220 | ], 221 | "metadata": { 222 | "kernelspec": { 223 | "display_name": "Python 3", 224 | "language": "python", 225 | "name": "python3" 226 | }, 227 | "language_info": { 228 | "codemirror_mode": { 229 | "name": "ipython", 230 | "version": 3 231 | }, 232 | "file_extension": ".py", 233 | "mimetype": "text/x-python", 234 | "name": "python", 235 | "nbconvert_exporter": "python", 236 | "pygments_lexer": "ipython3", 237 | "version": "3.6.3" 238 | } 239 | }, 240 | "nbformat": 4, 241 | "nbformat_minor": 2 242 | } 243 | -------------------------------------------------------------------------------- /Inferential Statistical Analysis With Python/Week4/utf-8''nhanes_confidence_intervals_practice.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Practice notebook for confidence intervals using NHANES data\n", 8 | "\n", 9 | "This notebook will give you the opportunity to practice working with confidence intervals using the NHANES data.\n", 10 | "\n", 11 | "You can enter your code into the cells that say \"enter your code here\", and you can type responses to the questions into the cells that say \"Type Markdown and Latex\".\n", 12 | "\n", 13 | "Note that most of the code that you will need to write below is very similar to code that appears in the case study notebook. You will need to edit code from that notebook in small ways to adapt it to the prompts below.\n", 14 | "\n", 15 | "To get started, we will use the same module imports and read the data in the same way as we did in the case study:" 16 | ] 17 | }, 18 | { 19 | "cell_type": "code", 20 | "execution_count": 2, 21 | "metadata": {}, 22 | "outputs": [], 23 | "source": [ 24 | "%matplotlib inline\n", 25 | "import matplotlib.pyplot as plt\n", 26 | "import pandas as pd\n", 27 | "import numpy as np\n", 28 | "import seaborn as sns\n", 29 | "import statsmodels.api as sm\n", 30 | "\n", 31 | "da = pd.read_csv(\"nhanes_2015_2016.csv\")" 32 | ] 33 | }, 34 | { 35 | "cell_type": "markdown", 36 | "metadata": {}, 37 | "source": [ 38 | "## Question 1\n", 39 | "\n", 40 | "Restrict the sample to women between 35 and 50, then use the marital status variable [DMDMARTL](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/DEMO_I.htm#DMDMARTL) to partition this sample into two groups - women who are currently married, and women who are not currently married. Within each of these groups, calculate the proportion of women who have completed college. Calculate 95% confidence intervals for each of these proportions." 41 | ] 42 | }, 43 | { 44 | "cell_type": "code", 45 | "execution_count": null, 46 | "metadata": {}, 47 | "outputs": [], 48 | "source": [ 49 | "# enter your code here" 50 | ] 51 | }, 52 | { 53 | "cell_type": "markdown", 54 | "metadata": {}, 55 | "source": [ 56 | "__Q1a.__ Identify which of the two confidence intervals is wider, and explain why this is the case. " 57 | ] 58 | }, 59 | { 60 | "cell_type": "markdown", 61 | "metadata": {}, 62 | "source": [] 63 | }, 64 | { 65 | "cell_type": "markdown", 66 | "metadata": {}, 67 | "source": [ 68 | "__Q1b.__ Write 1-2 sentences summarizing these findings for an audience that does not know what a confidence interval is (the goal here is to report the substance of what you learned about how marital status and educational attainment are related, not to teach a person what a confidence interval is)." 69 | ] 70 | }, 71 | { 72 | "cell_type": "markdown", 73 | "metadata": {}, 74 | "source": [] 75 | }, 76 | { 77 | "cell_type": "markdown", 78 | "metadata": {}, 79 | "source": [ 80 | "## Question 2\n", 81 | "\n", 82 | "Construct 95% confidence intervals for the proportion of smokers who are female, and for the proportion of smokers who are male. Then construct a 95% confidence interval for the difference between these proportions." 83 | ] 84 | }, 85 | { 86 | "cell_type": "code", 87 | "execution_count": null, 88 | "metadata": {}, 89 | "outputs": [], 90 | "source": [ 91 | "# enter your code here" 92 | ] 93 | }, 94 | { 95 | "cell_type": "markdown", 96 | "metadata": {}, 97 | "source": [ 98 | "__Q2a.__ Discuss why it may be relevant to report the proportions of smokers who are female and male, and contrast this to reporting the proportions of males and females who smoke." 99 | ] 100 | }, 101 | { 102 | "cell_type": "markdown", 103 | "metadata": {}, 104 | "source": [] 105 | }, 106 | { 107 | "cell_type": "markdown", 108 | "metadata": {}, 109 | "source": [ 110 | "__Q2b.__ How does the width of the confidence interval for the difference of the two proportions compare to the widths of the confidence intervals for each proportion separately?" 111 | ] 112 | }, 113 | { 114 | "cell_type": "markdown", 115 | "metadata": {}, 116 | "source": [] 117 | }, 118 | { 119 | "cell_type": "markdown", 120 | "metadata": {}, 121 | "source": [ 122 | "## Question 3\n", 123 | "\n", 124 | "Construct a 95% interval for height ([BMXHT](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/BMX_I.htm#BMXHT)) in centimeters. Then convert height from centimeters to inches by dividing by 2.54, and construct a 95% confidence interval for height in inches. Finally, convert the endpoints (the lower and upper confidence limits) of the confidence interval from inches to back to centimeters " 125 | ] 126 | }, 127 | { 128 | "cell_type": "code", 129 | "execution_count": null, 130 | "metadata": {}, 131 | "outputs": [], 132 | "source": [ 133 | "# enter your code here" 134 | ] 135 | }, 136 | { 137 | "cell_type": "markdown", 138 | "metadata": {}, 139 | "source": [ 140 | "__Q3a.__ Describe how the confidence interval constructed in centimeters relates to the confidence interval constructed in inches." 141 | ] 142 | }, 143 | { 144 | "cell_type": "markdown", 145 | "metadata": {}, 146 | "source": [] 147 | }, 148 | { 149 | "cell_type": "markdown", 150 | "metadata": {}, 151 | "source": [ 152 | "## Question 4\n", 153 | "\n", 154 | "Partition the sample based on 10-year age bands, i.e. the resulting groups will consist of people with ages from 18-28, 29-38, etc. Construct 95% confidence intervals for the difference between the mean BMI for females and for males within each age band." 155 | ] 156 | }, 157 | { 158 | "cell_type": "code", 159 | "execution_count": null, 160 | "metadata": {}, 161 | "outputs": [], 162 | "source": [ 163 | "# enter your code here" 164 | ] 165 | }, 166 | { 167 | "cell_type": "markdown", 168 | "metadata": {}, 169 | "source": [ 170 | "__Q4a.__ How do the widths of these confidence intervals differ? Provide an explanation for any substantial diferences in the confidence interval widths that you see." 171 | ] 172 | }, 173 | { 174 | "cell_type": "markdown", 175 | "metadata": {}, 176 | "source": [ 177 | "## Question 5\n", 178 | "\n", 179 | "Construct a 95% confidence interval for the first and second systolic blood pressure measures, and for the difference between the first and second systolic blood pressure measurements within a subject." 180 | ] 181 | }, 182 | { 183 | "cell_type": "code", 184 | "execution_count": null, 185 | "metadata": {}, 186 | "outputs": [], 187 | "source": [ 188 | "# enter code here" 189 | ] 190 | }, 191 | { 192 | "cell_type": "markdown", 193 | "metadata": {}, 194 | "source": [ 195 | "__Q5a.__ Based on these confidence intervals, would you say that a difference of zero between the population mean values of the first and second systolic blood pressure measures is consistent with the data?" 196 | ] 197 | }, 198 | { 199 | "cell_type": "markdown", 200 | "metadata": {}, 201 | "source": [] 202 | }, 203 | { 204 | "cell_type": "markdown", 205 | "metadata": {}, 206 | "source": [ 207 | "__Q5b.__ Discuss how the width of the confidence interval for the within-subject difference compares to the widths of the confidence intervals for the first and second measures." 208 | ] 209 | }, 210 | { 211 | "cell_type": "markdown", 212 | "metadata": {}, 213 | "source": [] 214 | }, 215 | { 216 | "cell_type": "markdown", 217 | "metadata": {}, 218 | "source": [ 219 | "## Question 6\n", 220 | "\n", 221 | "Construct a 95% confidence interval for the mean difference between the average age of a smoker, and the average age of a non-smoker." 222 | ] 223 | }, 224 | { 225 | "cell_type": "code", 226 | "execution_count": null, 227 | "metadata": {}, 228 | "outputs": [], 229 | "source": [ 230 | "# insert your code here" 231 | ] 232 | }, 233 | { 234 | "cell_type": "markdown", 235 | "metadata": {}, 236 | "source": [ 237 | "__Q6a.__ Use graphical and numerical techniques to compare the variation in the ages of smokers to the variation in the ages of non-smokers. " 238 | ] 239 | }, 240 | { 241 | "cell_type": "code", 242 | "execution_count": 1, 243 | "metadata": {}, 244 | "outputs": [], 245 | "source": [ 246 | "# insert your code here" 247 | ] 248 | }, 249 | { 250 | "cell_type": "markdown", 251 | "metadata": {}, 252 | "source": [ 253 | "__Q6b.__ Does it appear that uncertainty about the mean age of smokers, or uncertainty about the mean age of non-smokers contributed more to the uncertainty for the mean difference that we are focusing on here?" 254 | ] 255 | }, 256 | { 257 | "cell_type": "markdown", 258 | "metadata": {}, 259 | "source": [] 260 | } 261 | ], 262 | "metadata": { 263 | "kernelspec": { 264 | "display_name": "Python 3", 265 | "language": "python", 266 | "name": "python3" 267 | }, 268 | "language_info": { 269 | "codemirror_mode": { 270 | "name": "ipython", 271 | "version": 3 272 | }, 273 | "file_extension": ".py", 274 | "mimetype": "text/x-python", 275 | "name": "python", 276 | "nbconvert_exporter": "python", 277 | "pygments_lexer": "ipython3", 278 | "version": "3.6.3" 279 | } 280 | }, 281 | "nbformat": 4, 282 | "nbformat_minor": 2 283 | } 284 | -------------------------------------------------------------------------------- /Inferential Statistical Analysis With Python/Week4/utf-8''nhanes_hypothesis_test_practice.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Practice notebook for hypothesis tests using NHANES data\n", 8 | "\n", 9 | "This notebook will give you the opportunity to perform some hypothesis tests with the NHANES data that are similar to\n", 10 | "what was done in the week 3 case study notebook.\n", 11 | "\n", 12 | "You can enter your code into the cells that say \"enter your code here\", and you can type responses to the questions into the cells that say \"Type Markdown and Latex\".\n", 13 | "\n", 14 | "Note that most of the code that you will need to write below is very similar to code that appears in the case study notebook. You will need to edit code from that notebook in small ways to adapt it to the prompts below.\n", 15 | "\n", 16 | "To get started, we will use the same module imports and read the data in the same way as we did in the case study:" 17 | ] 18 | }, 19 | { 20 | "cell_type": "code", 21 | "execution_count": 1, 22 | "metadata": {}, 23 | "outputs": [], 24 | "source": [ 25 | "%matplotlib inline\n", 26 | "import matplotlib.pyplot as plt\n", 27 | "import seaborn as sns\n", 28 | "import pandas as pd\n", 29 | "import statsmodels.api as sm\n", 30 | "import numpy as np\n", 31 | "\n", 32 | "da = pd.read_csv(\"nhanes_2015_2016.csv\")" 33 | ] 34 | }, 35 | { 36 | "cell_type": "markdown", 37 | "metadata": {}, 38 | "source": [ 39 | "## Question 1\n", 40 | "\n", 41 | "Conduct a hypothesis test (at the 0.05 level) for the null hypothesis that the proportion of women who smoke is equal to the proportion of men who smoke." 42 | ] 43 | }, 44 | { 45 | "cell_type": "code", 46 | "execution_count": 1, 47 | "metadata": {}, 48 | "outputs": [], 49 | "source": [ 50 | "# insert your code here" 51 | ] 52 | }, 53 | { 54 | "cell_type": "markdown", 55 | "metadata": {}, 56 | "source": [ 57 | "__Q1a.__ Write 1-2 sentences explaining the substance of your findings to someone who does not know anything about statistical hypothesis tests." 58 | ] 59 | }, 60 | { 61 | "cell_type": "markdown", 62 | "metadata": {}, 63 | "source": [] 64 | }, 65 | { 66 | "cell_type": "markdown", 67 | "metadata": {}, 68 | "source": [ 69 | "__Q1b.__ Construct three 95% confidence intervals: one for the proportion of women who smoke, one for the proportion of men who smoke, and one for the difference in the rates of smoking between women and men." 70 | ] 71 | }, 72 | { 73 | "cell_type": "code", 74 | "execution_count": null, 75 | "metadata": {}, 76 | "outputs": [], 77 | "source": [ 78 | "# insert your code here" 79 | ] 80 | }, 81 | { 82 | "cell_type": "markdown", 83 | "metadata": {}, 84 | "source": [ 85 | "__Q1c.__ Comment on any ways in which the confidence intervals that you found in part b reinforce, contradict, or add support to the hypothesis test conducted in part a." 86 | ] 87 | }, 88 | { 89 | "cell_type": "markdown", 90 | "metadata": {}, 91 | "source": [] 92 | }, 93 | { 94 | "cell_type": "markdown", 95 | "metadata": {}, 96 | "source": [ 97 | "## Question 2\n", 98 | "\n", 99 | "Partition the population into two groups based on whether a person has graduated college or not, using the educational attainment variable [DMDEDUC2](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/DEMO_I.htm#DMDEDUC2). Then conduct a test of the null hypothesis that the average heights (in centimeters) of the two groups are equal. Next, convert the heights from centimeters to inches, and conduct a test of the null hypothesis that the average heights (in inches) of the two groups are equal." 100 | ] 101 | }, 102 | { 103 | "cell_type": "code", 104 | "execution_count": null, 105 | "metadata": {}, 106 | "outputs": [], 107 | "source": [ 108 | "# insert your code here" 109 | ] 110 | }, 111 | { 112 | "cell_type": "markdown", 113 | "metadata": {}, 114 | "source": [ 115 | "__Q2a.__ Based on the analysis performed here, are you confident that people who graduated from college have a different average height compared to people who did not graduate from college?" 116 | ] 117 | }, 118 | { 119 | "cell_type": "markdown", 120 | "metadata": {}, 121 | "source": [] 122 | }, 123 | { 124 | "cell_type": "markdown", 125 | "metadata": {}, 126 | "source": [ 127 | "__Q2b:__ How do the results obtained using the heights expressed in inches compare to the results obtained using the heights expressed in centimeters?" 128 | ] 129 | }, 130 | { 131 | "cell_type": "markdown", 132 | "metadata": {}, 133 | "source": [] 134 | }, 135 | { 136 | "cell_type": "markdown", 137 | "metadata": {}, 138 | "source": [ 139 | "## Question 3\n", 140 | "\n", 141 | "Conduct a hypothesis test of the null hypothesis that the average BMI for men between 30 and 40 is equal to the average BMI for men between 50 and 60. Then carry out this test again after log transforming the BMI values." 142 | ] 143 | }, 144 | { 145 | "cell_type": "code", 146 | "execution_count": null, 147 | "metadata": {}, 148 | "outputs": [], 149 | "source": [ 150 | "# insert your code here" 151 | ] 152 | }, 153 | { 154 | "cell_type": "markdown", 155 | "metadata": {}, 156 | "source": [ 157 | "__Q3a.__ How would you characterize the evidence that mean BMI differs between these age bands, and how would you characterize the evidence that mean log BMI differs between these age bands?" 158 | ] 159 | }, 160 | { 161 | "cell_type": "markdown", 162 | "metadata": {}, 163 | "source": [] 164 | }, 165 | { 166 | "cell_type": "markdown", 167 | "metadata": {}, 168 | "source": [ 169 | "## Question 4\n", 170 | "\n", 171 | "Suppose we wish to compare the mean BMI between college graduates and people who have not graduated from college, focusing on women between the ages of 30 and 40. First, consider the variance of BMI within each of these subpopulations using graphical techniques, and through the estimated subpopulation variances. Then, calculate pooled and unpooled estimates of the standard error for the difference between the mean BMI in the two populations being compared. Finally, test the null hypothesis that the two population means are equal, using each of the two different standard errors." 172 | ] 173 | }, 174 | { 175 | "cell_type": "code", 176 | "execution_count": null, 177 | "metadata": {}, 178 | "outputs": [], 179 | "source": [ 180 | "# insert your code here" 181 | ] 182 | }, 183 | { 184 | "cell_type": "markdown", 185 | "metadata": {}, 186 | "source": [ 187 | "__Q4a.__ Comment on the strength of evidence against the null hypothesis that these two populations have equal mean BMI." 188 | ] 189 | }, 190 | { 191 | "cell_type": "markdown", 192 | "metadata": {}, 193 | "source": [] 194 | }, 195 | { 196 | "cell_type": "markdown", 197 | "metadata": {}, 198 | "source": [ 199 | "__Q4b.__ Comment on the degree to which the two populations have different variances, and on the extent to which the results using different approaches to estimating the standard error of the mean difference give divergent results." 200 | ] 201 | }, 202 | { 203 | "cell_type": "markdown", 204 | "metadata": {}, 205 | "source": [] 206 | }, 207 | { 208 | "cell_type": "markdown", 209 | "metadata": {}, 210 | "source": [ 211 | "## Question 5\n", 212 | "\n", 213 | "Conduct a test of the null hypothesis that the first and second diastolic blood pressure measurements within a subject have the same mean values." 214 | ] 215 | }, 216 | { 217 | "cell_type": "code", 218 | "execution_count": null, 219 | "metadata": {}, 220 | "outputs": [], 221 | "source": [ 222 | "# insert your code here" 223 | ] 224 | }, 225 | { 226 | "cell_type": "markdown", 227 | "metadata": {}, 228 | "source": [ 229 | "__Q5a.__ Briefly describe your findings for an audience that is not familiar with statistical hypothesis testing." 230 | ] 231 | }, 232 | { 233 | "cell_type": "markdown", 234 | "metadata": {}, 235 | "source": [] 236 | }, 237 | { 238 | "cell_type": "markdown", 239 | "metadata": {}, 240 | "source": [ 241 | "__Q5b.__ Pretend that the first and second diastolic blood pressure measurements were taken on different people. Modfify the analysis above as appropriate for this setting." 242 | ] 243 | }, 244 | { 245 | "cell_type": "code", 246 | "execution_count": null, 247 | "metadata": {}, 248 | "outputs": [], 249 | "source": [ 250 | "# insert your code here" 251 | ] 252 | }, 253 | { 254 | "cell_type": "markdown", 255 | "metadata": {}, 256 | "source": [ 257 | "__Q5c.__ Briefly describe how the approaches used and the results obtained in the preceeding two parts of the question differ." 258 | ] 259 | }, 260 | { 261 | "cell_type": "markdown", 262 | "metadata": {}, 263 | "source": [] 264 | } 265 | ], 266 | "metadata": { 267 | "kernelspec": { 268 | "display_name": "Python 3", 269 | "language": "python", 270 | "name": "python3" 271 | }, 272 | "language_info": { 273 | "codemirror_mode": { 274 | "name": "ipython", 275 | "version": 3 276 | }, 277 | "file_extension": ".py", 278 | "mimetype": "text/x-python", 279 | "name": "python", 280 | "nbconvert_exporter": "python", 281 | "pygments_lexer": "ipython3", 282 | "version": "3.6.3" 283 | } 284 | }, 285 | "nbformat": 4, 286 | "nbformat_minor": 2 287 | } 288 | -------------------------------------------------------------------------------- /Inferential Statistical Analysis With Python/Week1/utf-8''week1_assessment.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "colab_type": "text", 7 | "id": "J5p3JZwvGQEe" 8 | }, 9 | "source": [ 10 | "You will use the values of what you find in this assignment to answer questions in the quiz that follows. You may want to open this notebook to be displayed side-by-side on screen with this next quiz." 11 | ] 12 | }, 13 | { 14 | "cell_type": "markdown", 15 | "metadata": { 16 | "colab_type": "text", 17 | "id": "20Hp_V-eFzbI" 18 | }, 19 | "source": [ 20 | "1. Write a function that inputs an integers and returns the negative" 21 | ] 22 | }, 23 | { 24 | "cell_type": "code", 25 | "execution_count": 1, 26 | "metadata": { 27 | "colab": {}, 28 | "colab_type": "code", 29 | "id": "tFPbRKR4FzbL" 30 | }, 31 | "outputs": [], 32 | "source": [ 33 | "# Write your function here\n", 34 | "def funct(s):\n", 35 | " s=-abs(s)\n", 36 | " return s\n" 37 | ] 38 | }, 39 | { 40 | "cell_type": "code", 41 | "execution_count": 2, 42 | "metadata": { 43 | "colab": {}, 44 | "colab_type": "code", 45 | "id": "jW5MfUUnFzbQ" 46 | }, 47 | "outputs": [ 48 | { 49 | "data": { 50 | "text/plain": [ 51 | "-4" 52 | ] 53 | }, 54 | "execution_count": 2, 55 | "metadata": {}, 56 | "output_type": "execute_result" 57 | } 58 | ], 59 | "source": [ 60 | "# Test your function with input x\n", 61 | "x = 4\n", 62 | "funct(x)" 63 | ] 64 | }, 65 | { 66 | "cell_type": "markdown", 67 | "metadata": { 68 | "colab_type": "text", 69 | "id": "f6kLOf6_FzbU" 70 | }, 71 | "source": [ 72 | "2. Write a function that inputs a list of integers and returns the minimum value" 73 | ] 74 | }, 75 | { 76 | "cell_type": "code", 77 | "execution_count": 3, 78 | "metadata": { 79 | "colab": {}, 80 | "colab_type": "code", 81 | "id": "IHV-wS_hFzbW" 82 | }, 83 | "outputs": [], 84 | "source": [ 85 | "# Write your function here\n", 86 | "def funct(l):\n", 87 | " l.sort()\n", 88 | " print(l[0])" 89 | ] 90 | }, 91 | { 92 | "cell_type": "code", 93 | "execution_count": 4, 94 | "metadata": { 95 | "colab": {}, 96 | "colab_type": "code", 97 | "id": "EfvSeoaOFzba" 98 | }, 99 | "outputs": [ 100 | { 101 | "name": "stdout", 102 | "output_type": "stream", 103 | "text": [ 104 | "-3\n" 105 | ] 106 | } 107 | ], 108 | "source": [ 109 | "# Test your function with input lst\n", 110 | "lst = [-3, 0, 2, 100, -1, 2]\n", 111 | "funct(lst)\n", 112 | "\n", 113 | "# Create you own input list to test with" 114 | ] 115 | }, 116 | { 117 | "cell_type": "markdown", 118 | "metadata": { 119 | "colab_type": "text", 120 | "id": "-yjvHCDuFzbd" 121 | }, 122 | "source": [ 123 | "#### Challenge problem: \n", 124 | "Write a function that take in four arguments: lst1, lst2, str1, str2, and returns a pandas DataFrame that has the first column labeled str1 and the second column labaled str2, that have values lst1 and lst2 scaled to be between 0 and 1.\n", 125 | "\n", 126 | "For example\n", 127 | "```\n", 128 | "lst1 = [1, 2, 3]\n", 129 | "lst2 = [2, 4, 5]\n", 130 | "str1 = 'one'\n", 131 | "str2 = 'two'\n", 132 | "\n", 133 | "my_function(lst1, lst2, str1, str2)\n", 134 | "``` \n", 135 | "should return a DataFrame that looks like:\n", 136 | "\n", 137 | "\n", 138 | "\n", 139 | "| | one | two |\n", 140 | "| --- | --- | --- |\n", 141 | "| 0 | 0 | 0 |\n", 142 | "| 1 | .5 | .666 |\n", 143 | "| 2 | 1 | 1 |\n", 144 | "\n" 145 | ] 146 | }, 147 | { 148 | "cell_type": "code", 149 | "execution_count": 0, 150 | "metadata": { 151 | "colab": {}, 152 | "colab_type": "code", 153 | "id": "0yABet-jFzbi" 154 | }, 155 | "outputs": [], 156 | "source": [ 157 | "# test your challenge problem function\n", 158 | "import numpy as np\n", 159 | "\n", 160 | "lst1 = np.random.randint(-234, 938, 100)\n", 161 | "lst2 = np.random.randint(-522, 123, 100)\n", 162 | "str1 = 'one'\n", 163 | "str2 = 'alpha'" 164 | ] 165 | }, 166 | { 167 | "cell_type": "code", 168 | "execution_count": 6, 169 | "metadata": { 170 | "colab": {}, 171 | "colab_type": "code", 172 | "id": "UhTlOZX1Fzbf" 173 | }, 174 | "outputs": [ 175 | { 176 | "name": "stdout", 177 | "output_type": "stream", 178 | "text": [ 179 | "Help on function drop in module pandas.core.frame:\n", 180 | "\n", 181 | "drop(self, labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise')\n", 182 | " Drop specified labels from rows or columns.\n", 183 | " \n", 184 | " Remove rows or columns by specifying label names and corresponding\n", 185 | " axis, or by specifying directly index or column names. When using a\n", 186 | " multi-index, labels on different levels can be removed by specifying\n", 187 | " the level.\n", 188 | " \n", 189 | " Parameters\n", 190 | " ----------\n", 191 | " labels : single label or list-like\n", 192 | " Index or column labels to drop.\n", 193 | " axis : {0 or 'index', 1 or 'columns'}, default 0\n", 194 | " Whether to drop labels from the index (0 or 'index') or\n", 195 | " columns (1 or 'columns').\n", 196 | " index, columns : single label or list-like\n", 197 | " Alternative to specifying axis (``labels, axis=1``\n", 198 | " is equivalent to ``columns=labels``).\n", 199 | " \n", 200 | " .. versionadded:: 0.21.0\n", 201 | " level : int or level name, optional\n", 202 | " For MultiIndex, level from which the labels will be removed.\n", 203 | " inplace : bool, default False\n", 204 | " If True, do operation inplace and return None.\n", 205 | " errors : {'ignore', 'raise'}, default 'raise'\n", 206 | " If 'ignore', suppress error and only existing labels are\n", 207 | " dropped.\n", 208 | " \n", 209 | " Returns\n", 210 | " -------\n", 211 | " dropped : pandas.DataFrame\n", 212 | " \n", 213 | " Raises\n", 214 | " ------\n", 215 | " KeyError\n", 216 | " If none of the labels are found in the selected axis\n", 217 | " \n", 218 | " See Also\n", 219 | " --------\n", 220 | " DataFrame.loc : Label-location based indexer for selection by label.\n", 221 | " DataFrame.dropna : Return DataFrame with labels on given axis omitted\n", 222 | " where (all or any) data are missing.\n", 223 | " DataFrame.drop_duplicates : Return DataFrame with duplicate rows\n", 224 | " removed, optionally only considering certain columns.\n", 225 | " Series.drop : Return Series with specified index labels removed.\n", 226 | " \n", 227 | " Examples\n", 228 | " --------\n", 229 | " >>> df = pd.DataFrame(np.arange(12).reshape(3,4),\n", 230 | " ... columns=['A', 'B', 'C', 'D'])\n", 231 | " >>> df\n", 232 | " A B C D\n", 233 | " 0 0 1 2 3\n", 234 | " 1 4 5 6 7\n", 235 | " 2 8 9 10 11\n", 236 | " \n", 237 | " Drop columns\n", 238 | " \n", 239 | " >>> df.drop(['B', 'C'], axis=1)\n", 240 | " A D\n", 241 | " 0 0 3\n", 242 | " 1 4 7\n", 243 | " 2 8 11\n", 244 | " \n", 245 | " >>> df.drop(columns=['B', 'C'])\n", 246 | " A D\n", 247 | " 0 0 3\n", 248 | " 1 4 7\n", 249 | " 2 8 11\n", 250 | " \n", 251 | " Drop a row by index\n", 252 | " \n", 253 | " >>> df.drop([0, 1])\n", 254 | " A B C D\n", 255 | " 2 8 9 10 11\n", 256 | " \n", 257 | " Drop columns and/or rows of MultiIndex DataFrame\n", 258 | " \n", 259 | " >>> midx = pd.MultiIndex(levels=[['lama', 'cow', 'falcon'],\n", 260 | " ... ['speed', 'weight', 'length']],\n", 261 | " ... codes=[[0, 0, 0, 1, 1, 1, 2, 2, 2],\n", 262 | " ... [0, 1, 2, 0, 1, 2, 0, 1, 2]])\n", 263 | " >>> df = pd.DataFrame(index=midx, columns=['big', 'small'],\n", 264 | " ... data=[[45, 30], [200, 100], [1.5, 1], [30, 20],\n", 265 | " ... [250, 150], [1.5, 0.8], [320, 250],\n", 266 | " ... [1, 0.8], [0.3,0.2]])\n", 267 | " >>> df\n", 268 | " big small\n", 269 | " lama speed 45.0 30.0\n", 270 | " weight 200.0 100.0\n", 271 | " length 1.5 1.0\n", 272 | " cow speed 30.0 20.0\n", 273 | " weight 250.0 150.0\n", 274 | " length 1.5 0.8\n", 275 | " falcon speed 320.0 250.0\n", 276 | " weight 1.0 0.8\n", 277 | " length 0.3 0.2\n", 278 | " \n", 279 | " >>> df.drop(index='cow', columns='small')\n", 280 | " big\n", 281 | " lama speed 45.0\n", 282 | " weight 200.0\n", 283 | " length 1.5\n", 284 | " falcon speed 320.0\n", 285 | " weight 1.0\n", 286 | " length 0.3\n", 287 | " \n", 288 | " >>> df.drop(index='length', level=1)\n", 289 | " big small\n", 290 | " lama speed 45.0 30.0\n", 291 | " weight 200.0 100.0\n", 292 | " cow speed 30.0 20.0\n", 293 | " weight 250.0 150.0\n", 294 | " falcon speed 320.0 250.0\n", 295 | " weight 1.0 0.8\n", 296 | "\n" 297 | ] 298 | } 299 | ], 300 | "source": [ 301 | "import pandas as pd\n", 302 | "help(pd.DataFrame.drop)" 303 | ] 304 | }, 305 | { 306 | "cell_type": "code", 307 | "execution_count": null, 308 | "metadata": {}, 309 | "outputs": [], 310 | "source": [] 311 | } 312 | ], 313 | "metadata": { 314 | "colab": { 315 | "collapsed_sections": [], 316 | "name": "week1_assessment.ipynb", 317 | "provenance": [], 318 | "version": "0.3.2" 319 | }, 320 | "kernelspec": { 321 | "display_name": "Python 3", 322 | "language": "python", 323 | "name": "python3" 324 | }, 325 | "language_info": { 326 | "codemirror_mode": { 327 | "name": "ipython", 328 | "version": 3 329 | }, 330 | "file_extension": ".py", 331 | "mimetype": "text/x-python", 332 | "name": "python", 333 | "nbconvert_exporter": "python", 334 | "pygments_lexer": "ipython3", 335 | "version": "3.6.3" 336 | } 337 | }, 338 | "nbformat": 4, 339 | "nbformat_minor": 1 340 | } 341 | -------------------------------------------------------------------------------- /Inferential Statistical Analysis With Python/Week2/utf-8''intro_confidenceintervals.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Statistical Inference with Confidence Intervals\n", 8 | "\n", 9 | "Throughout week 2, we have explored the concept of confidence intervals, how to calculate them, interpret them, and what confidence really means. \n", 10 | "\n", 11 | "In this tutorial, we're going to review how to calculate confidence intervals of population proportions and means.\n", 12 | "\n", 13 | "To begin, let's go over some of the material from this week and why confidence intervals are useful tools when deriving insights from data.\n", 14 | "\n", 15 | "### Why Confidence Intervals?\n", 16 | "\n", 17 | "Confidence intervals are a calculated range or boundary around a parameter or a statistic that is supported mathematically with a certain level of confidence. For example, in the lecture, we estimated, with 95% confidence, that the population proportion of parents with a toddler that use a car seat for all travel with their toddler was somewhere between 82.2% and 87.7%.\n", 18 | "\n", 19 | "This is *__different__* than having a 95% probability that the true population proportion is within our confidence interval.\n", 20 | "\n", 21 | "Essentially, if we were to repeat this process, 95% of our calculated confidence intervals would contain the true proportion.\n", 22 | "\n", 23 | "### How are Confidence Intervals Calculated?\n", 24 | "\n", 25 | "Our equation for calculating confidence intervals is as follows:\n", 26 | "\n", 27 | "$$Best\\ Estimate \\pm Margin\\ of\\ Error$$\n", 28 | "\n", 29 | "Where the *Best Estimate* is the **observed population proportion or mean** and the *Margin of Error* is the **t-multiplier**.\n", 30 | "\n", 31 | "The t-multiplier is calculated based on the degrees of freedom and desired confidence level. For samples with more than 30 observations and a confidence level of 95%, the t-multiplier is 1.96\n", 32 | "\n", 33 | "The equation to create a 95% confidence interval can also be shown as:\n", 34 | "\n", 35 | "$$Population\\ Proportion\\ or\\ Mean\\ \\pm (t-multiplier *\\ Standard\\ Error)$$\n", 36 | "\n", 37 | "Lastly, the Standard Error is calculated differenly for population proportion and mean:\n", 38 | "\n", 39 | "$$Standard\\ Error \\ for\\ Population\\ Proportion = \\sqrt{\\frac{Population\\ Proportion * (1 - Population\\ Proportion)}{Number\\ Of\\ Observations}}$$\n", 40 | "\n", 41 | "$$Standard\\ Error \\ for\\ Mean = \\frac{Standard\\ Deviation}{\\sqrt{Number\\ Of\\ Observations}}$$\n", 42 | "\n", 43 | "Let's replicate the car seat example from lecture:" 44 | ] 45 | }, 46 | { 47 | "cell_type": "code", 48 | "execution_count": 1, 49 | "metadata": {}, 50 | "outputs": [], 51 | "source": [ 52 | "import numpy as np" 53 | ] 54 | }, 55 | { 56 | "cell_type": "code", 57 | "execution_count": 3, 58 | "metadata": {}, 59 | "outputs": [ 60 | { 61 | "data": { 62 | "text/plain": [ 63 | "0.01390952774409444" 64 | ] 65 | }, 66 | "execution_count": 3, 67 | "metadata": {}, 68 | "output_type": "execute_result" 69 | } 70 | ], 71 | "source": [ 72 | "tstar = 1.96\n", 73 | "p = .85\n", 74 | "n = 659\n", 75 | "\n", 76 | "se = np.sqrt((p * (1 - p))/n)\n", 77 | "se #standard error" 78 | ] 79 | }, 80 | { 81 | "cell_type": "code", 82 | "execution_count": 4, 83 | "metadata": {}, 84 | "outputs": [ 85 | { 86 | "data": { 87 | "text/plain": [ 88 | "(0.8227373256215749, 0.8772626743784251)" 89 | ] 90 | }, 91 | "execution_count": 4, 92 | "metadata": {}, 93 | "output_type": "execute_result" 94 | } 95 | ], 96 | "source": [ 97 | "lcb = p - tstar * se #lower_bound\n", 98 | "ucb = p + tstar * se #upper_bound\n", 99 | "(lcb, ucb)" 100 | ] 101 | }, 102 | { 103 | "cell_type": "code", 104 | "execution_count": 5, 105 | "metadata": {}, 106 | "outputs": [], 107 | "source": [ 108 | "import statsmodels.api as sm" 109 | ] 110 | }, 111 | { 112 | "cell_type": "code", 113 | "execution_count": 6, 114 | "metadata": {}, 115 | "outputs": [ 116 | { 117 | "data": { 118 | "text/plain": [ 119 | "(0.8227378265796143, 0.8772621734203857)" 120 | ] 121 | }, 122 | "execution_count": 6, 123 | "metadata": {}, 124 | "output_type": "execute_result" 125 | } 126 | ], 127 | "source": [ 128 | "sm.stats.proportion_confint(n * p, n) #count of values-->n* p , n=total value" 129 | ] 130 | }, 131 | { 132 | "cell_type": "markdown", 133 | "metadata": {}, 134 | "source": [ 135 | "Now, lets take our Cartwheel dataset introduced in lecture and calculate a confidence interval for our mean cartwheel distance:" 136 | ] 137 | }, 138 | { 139 | "cell_type": "code", 140 | "execution_count": 7, 141 | "metadata": {}, 142 | "outputs": [], 143 | "source": [ 144 | "import pandas as pd\n", 145 | "\n", 146 | "df = pd.read_csv(\"Cartwheeldata.csv\")" 147 | ] 148 | }, 149 | { 150 | "cell_type": "code", 151 | "execution_count": 8, 152 | "metadata": {}, 153 | "outputs": [ 154 | { 155 | "data": { 156 | "text/html": [ 157 | "
\n", 158 | "\n", 171 | "\n", 172 | " \n", 173 | " \n", 174 | " \n", 175 | " \n", 176 | " \n", 177 | " \n", 178 | " \n", 179 | " \n", 180 | " \n", 181 | " \n", 182 | " \n", 183 | " \n", 184 | " \n", 185 | " \n", 186 | " \n", 187 | " \n", 188 | " \n", 189 | " \n", 190 | " \n", 191 | " \n", 192 | " \n", 193 | " \n", 194 | " \n", 195 | " \n", 196 | " \n", 197 | " \n", 198 | " \n", 199 | " \n", 200 | " \n", 201 | " \n", 202 | " \n", 203 | " \n", 204 | " \n", 205 | " \n", 206 | " \n", 207 | " \n", 208 | " \n", 209 | " \n", 210 | " \n", 211 | " \n", 212 | " \n", 213 | " \n", 214 | " \n", 215 | " \n", 216 | " \n", 217 | " \n", 218 | " \n", 219 | " \n", 220 | " \n", 221 | " \n", 222 | " \n", 223 | " \n", 224 | " \n", 225 | " \n", 226 | " \n", 227 | " \n", 228 | " \n", 229 | " \n", 230 | " \n", 231 | " \n", 232 | " \n", 233 | " \n", 234 | " \n", 235 | " \n", 236 | " \n", 237 | " \n", 238 | " \n", 239 | " \n", 240 | " \n", 241 | " \n", 242 | " \n", 243 | " \n", 244 | " \n", 245 | " \n", 246 | " \n", 247 | " \n", 248 | " \n", 249 | " \n", 250 | " \n", 251 | " \n", 252 | " \n", 253 | " \n", 254 | " \n", 255 | " \n", 256 | " \n", 257 | " \n", 258 | " \n", 259 | " \n", 260 | " \n", 261 | " \n", 262 | " \n", 263 | " \n", 264 | " \n", 265 | " \n", 266 | "
IDAgeGenderGenderGroupGlassesGlassesGroupHeightWingspanCWDistanceCompleteCompleteGroupScore
0156F1Y162.061.079Y17
1226F1Y162.060.070Y18
2333F1Y166.064.085Y17
3439F1N064.063.087Y110
4527M2N073.075.072N04
\n", 267 | "
" 268 | ], 269 | "text/plain": [ 270 | " ID Age Gender GenderGroup Glasses GlassesGroup Height Wingspan \\\n", 271 | "0 1 56 F 1 Y 1 62.0 61.0 \n", 272 | "1 2 26 F 1 Y 1 62.0 60.0 \n", 273 | "2 3 33 F 1 Y 1 66.0 64.0 \n", 274 | "3 4 39 F 1 N 0 64.0 63.0 \n", 275 | "4 5 27 M 2 N 0 73.0 75.0 \n", 276 | "\n", 277 | " CWDistance Complete CompleteGroup Score \n", 278 | "0 79 Y 1 7 \n", 279 | "1 70 Y 1 8 \n", 280 | "2 85 Y 1 7 \n", 281 | "3 87 Y 1 10 \n", 282 | "4 72 N 0 4 " 283 | ] 284 | }, 285 | "execution_count": 8, 286 | "metadata": {}, 287 | "output_type": "execute_result" 288 | } 289 | ], 290 | "source": [ 291 | "df.head()" 292 | ] 293 | }, 294 | { 295 | "cell_type": "code", 296 | "execution_count": 9, 297 | "metadata": {}, 298 | "outputs": [ 299 | { 300 | "data": { 301 | "text/plain": [ 302 | "25" 303 | ] 304 | }, 305 | "execution_count": 9, 306 | "metadata": {}, 307 | "output_type": "execute_result" 308 | } 309 | ], 310 | "source": [ 311 | "mean = df[\"CWDistance\"].mean()\n", 312 | "sd = df[\"CWDistance\"].std()\n", 313 | "n = len(df)\n", 314 | "\n", 315 | "n" 316 | ] 317 | }, 318 | { 319 | "cell_type": "code", 320 | "execution_count": null, 321 | "metadata": {}, 322 | "outputs": [], 323 | "source": [ 324 | "tstar = 2.064 #dont have min 30 observations , hence --> 2.064\n", 325 | "\n", 326 | "se = sd/np.sqrt(n) #standa\n", 327 | "\n", 328 | "se" 329 | ] 330 | }, 331 | { 332 | "cell_type": "code", 333 | "execution_count": null, 334 | "metadata": {}, 335 | "outputs": [], 336 | "source": [ 337 | "lcb = mean - tstar * se\n", 338 | "ucb = mean + tstar * se\n", 339 | "(lcb, ucb)" 340 | ] 341 | }, 342 | { 343 | "cell_type": "code", 344 | "execution_count": null, 345 | "metadata": {}, 346 | "outputs": [], 347 | "source": [ 348 | "sm.stats.DescrStatsW(df[\"CWDistance\"]).zconfint_mean()" 349 | ] 350 | } 351 | ], 352 | "metadata": { 353 | "kernelspec": { 354 | "display_name": "Python 3", 355 | "language": "python", 356 | "name": "python3" 357 | }, 358 | "language_info": { 359 | "codemirror_mode": { 360 | "name": "ipython", 361 | "version": 3 362 | }, 363 | "file_extension": ".py", 364 | "mimetype": "text/x-python", 365 | "name": "python", 366 | "nbconvert_exporter": "python", 367 | "pygments_lexer": "ipython3", 368 | "version": "3.6.3" 369 | } 370 | }, 371 | "nbformat": 4, 372 | "nbformat_minor": 2 373 | } 374 | -------------------------------------------------------------------------------- /Understanding & Visualizing Data with Python/Week2/utf-8''nhanes_univariate_practice.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Practice notebook for univariate analysis using NHANES data\n", 8 | "\n", 9 | "This notebook will give you the opportunity to perform some univariate analyses on your own using the NHANES. These analyses are similar to what was done in the week 2 NHANES case study notebook.\n", 10 | "\n", 11 | "You can enter your code into the cells that say \"enter your code here\", and you can type responses to the questions into the cells that say \"Type Markdown and Latex\".\n", 12 | "\n", 13 | "Note that most of the code that you will need to write below is very similar to code that appears in the case study notebook. You will need to edit code from that notebook in small ways to adapt it to the prompts below.\n", 14 | "\n", 15 | "To get started, we will use the same module imports and read the data in the same way as we did in the case study:" 16 | ] 17 | }, 18 | { 19 | "cell_type": "code", 20 | "execution_count": 2, 21 | "metadata": {}, 22 | "outputs": [ 23 | { 24 | "name": "stdout", 25 | "output_type": "stream", 26 | "text": [ 27 | "Index(['SEQN', 'ALQ101', 'ALQ110', 'ALQ130', 'SMQ020', 'RIAGENDR', 'RIDAGEYR',\n", 28 | " 'RIDRETH1', 'DMDCITZN', 'DMDEDUC2', 'DMDMARTL', 'DMDHHSIZ', 'WTINT2YR',\n", 29 | " 'SDMVPSU', 'SDMVSTRA', 'INDFMPIR', 'BPXSY1', 'BPXDI1', 'BPXSY2',\n", 30 | " 'BPXDI2', 'BMXWT', 'BMXHT', 'BMXBMI', 'BMXLEG', 'BMXARML', 'BMXARMC',\n", 31 | " 'BMXWAIST', 'HIQ210'],\n", 32 | " dtype='object')\n" 33 | ] 34 | } 35 | ], 36 | "source": [ 37 | "%matplotlib inline\n", 38 | "import matplotlib.pyplot as plt\n", 39 | "import seaborn as sns\n", 40 | "import pandas as pd\n", 41 | "import statsmodels.api as sm\n", 42 | "import numpy as np\n", 43 | "\n", 44 | "da = pd.read_csv(\"nhanes_2015_2016.csv\")" 45 | ] 46 | }, 47 | { 48 | "cell_type": "markdown", 49 | "metadata": {}, 50 | "source": [ 51 | "## Question 1\n", 52 | "\n", 53 | "Relabel the marital status variable [DMDMARTL](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/DEMO_I.htm#DMDMARTL) to have brief but informative character labels. Then construct a frequency table of these values for all people, then for women only, and for men only. Then construct these three frequency tables using only people whose age is between 30 and 40." 54 | ] 55 | }, 56 | { 57 | "cell_type": "code", 58 | "execution_count": null, 59 | "metadata": {}, 60 | "outputs": [], 61 | "source": [ 62 | "# insert your code here" 63 | ] 64 | }, 65 | { 66 | "cell_type": "markdown", 67 | "metadata": {}, 68 | "source": [ 69 | "__Q1a.__ Briefly comment on some of the differences that you observe between the distribution of marital status between women and men, for people of all ages." 70 | ] 71 | }, 72 | { 73 | "cell_type": "raw", 74 | "metadata": {}, 75 | "source": [] 76 | }, 77 | { 78 | "cell_type": "markdown", 79 | "metadata": {}, 80 | "source": [ 81 | "__Q1b.__ Briefly comment on the differences that you observe between the distribution of marital status states for women between the overall population, and for women between the ages of 30 and 40." 82 | ] 83 | }, 84 | { 85 | "cell_type": "raw", 86 | "metadata": {}, 87 | "source": [] 88 | }, 89 | { 90 | "cell_type": "markdown", 91 | "metadata": {}, 92 | "source": [ 93 | "__Q1c.__ Repeat part b for the men." 94 | ] 95 | }, 96 | { 97 | "cell_type": "raw", 98 | "metadata": {}, 99 | "source": [] 100 | }, 101 | { 102 | "cell_type": "markdown", 103 | "metadata": {}, 104 | "source": [ 105 | "## Question 2\n", 106 | "\n", 107 | "Restricting to the female population, stratify the subjects into age bands no wider than ten years, and construct the distribution of marital status within each age band. Within each age band, present the distribution in terms of proportions that must sum to 1." 108 | ] 109 | }, 110 | { 111 | "cell_type": "code", 112 | "execution_count": null, 113 | "metadata": {}, 114 | "outputs": [], 115 | "source": [ 116 | "# insert your code here" 117 | ] 118 | }, 119 | { 120 | "cell_type": "markdown", 121 | "metadata": {}, 122 | "source": [ 123 | "__Q2a.__ Comment on the trends that you see in this series of marginal distributions." 124 | ] 125 | }, 126 | { 127 | "cell_type": "raw", 128 | "metadata": {}, 129 | "source": [] 130 | }, 131 | { 132 | "cell_type": "markdown", 133 | "metadata": {}, 134 | "source": [ 135 | "__Q2b.__ Repeat the construction for males." 136 | ] 137 | }, 138 | { 139 | "cell_type": "code", 140 | "execution_count": null, 141 | "metadata": {}, 142 | "outputs": [], 143 | "source": [ 144 | "# insert your code here" 145 | ] 146 | }, 147 | { 148 | "cell_type": "markdown", 149 | "metadata": {}, 150 | "source": [ 151 | "__Q2c.__ Comment on any notable differences that you see when comparing these results for females and for males." 152 | ] 153 | }, 154 | { 155 | "cell_type": "raw", 156 | "metadata": {}, 157 | "source": [] 158 | }, 159 | { 160 | "cell_type": "markdown", 161 | "metadata": {}, 162 | "source": [ 163 | "## Question 3\n", 164 | "\n", 165 | "Construct a histogram of the distribution of heights using the BMXHT variable in the NHANES sample." 166 | ] 167 | }, 168 | { 169 | "cell_type": "code", 170 | "execution_count": null, 171 | "metadata": {}, 172 | "outputs": [], 173 | "source": [ 174 | "# insert your code here" 175 | ] 176 | }, 177 | { 178 | "cell_type": "markdown", 179 | "metadata": {}, 180 | "source": [ 181 | "__Q3a.__ Use the `bins` argument to [distplot](https://seaborn.pydata.org/generated/seaborn.distplot.html) to produce histograms with different numbers of bins. Assess whether the default value for this argument gives a meaningful result, and comment on what happens as the number of bins grows excessively large or excessively small. " 182 | ] 183 | }, 184 | { 185 | "cell_type": "raw", 186 | "metadata": {}, 187 | "source": [] 188 | }, 189 | { 190 | "cell_type": "markdown", 191 | "metadata": {}, 192 | "source": [ 193 | "__Q3b.__ Make separate histograms for the heights of women and men, then make a side-by-side boxplot showing the heights of women and men." 194 | ] 195 | }, 196 | { 197 | "cell_type": "code", 198 | "execution_count": 3, 199 | "metadata": {}, 200 | "outputs": [], 201 | "source": [ 202 | "# insert your code here" 203 | ] 204 | }, 205 | { 206 | "cell_type": "markdown", 207 | "metadata": {}, 208 | "source": [ 209 | "__Q3c.__ Comment on what features, if any are not represented clearly in the boxplots, and what features, if any, are easier to see in the boxplots than in the histograms." 210 | ] 211 | }, 212 | { 213 | "cell_type": "raw", 214 | "metadata": {}, 215 | "source": [] 216 | }, 217 | { 218 | "cell_type": "markdown", 219 | "metadata": {}, 220 | "source": [ 221 | "## Question 4\n", 222 | "\n", 223 | "Make a boxplot showing the distribution of within-subject differences between the first and second systolic blood pressure measurents ([BPXSY1](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/BPX_I.htm#BPXSY1) and [BPXSY2](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/BPX_I.htm#BPXSY2))." 224 | ] 225 | }, 226 | { 227 | "cell_type": "code", 228 | "execution_count": null, 229 | "metadata": {}, 230 | "outputs": [], 231 | "source": [ 232 | "# insert your code here" 233 | ] 234 | }, 235 | { 236 | "cell_type": "markdown", 237 | "metadata": {}, 238 | "source": [ 239 | "__Q4a.__ What proportion of the subjects have a lower SBP on the second reading compared to the first?" 240 | ] 241 | }, 242 | { 243 | "cell_type": "code", 244 | "execution_count": null, 245 | "metadata": {}, 246 | "outputs": [], 247 | "source": [ 248 | "# insert your code here" 249 | ] 250 | }, 251 | { 252 | "cell_type": "markdown", 253 | "metadata": {}, 254 | "source": [ 255 | "__Q4b.__ Make side-by-side boxplots of the two systolic blood pressure variables." 256 | ] 257 | }, 258 | { 259 | "cell_type": "code", 260 | "execution_count": 4, 261 | "metadata": {}, 262 | "outputs": [], 263 | "source": [ 264 | "# insert your code here" 265 | ] 266 | }, 267 | { 268 | "cell_type": "markdown", 269 | "metadata": {}, 270 | "source": [ 271 | "__Q4c.__ Comment on the variation within either the first or second systolic blood pressure measurements, and the variation in the within-subject differences between the first and second systolic blood pressure measurements." 272 | ] 273 | }, 274 | { 275 | "cell_type": "raw", 276 | "metadata": {}, 277 | "source": [] 278 | }, 279 | { 280 | "cell_type": "markdown", 281 | "metadata": {}, 282 | "source": [ 283 | "## Question 5\n", 284 | "\n", 285 | "Construct a frequency table of household sizes for people within each educational attainment category (the relevant variable is [DMDEDUC2](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/DEMO_I.htm#DMDEDUC2)). Convert the frequencies to proportions." 286 | ] 287 | }, 288 | { 289 | "cell_type": "code", 290 | "execution_count": null, 291 | "metadata": {}, 292 | "outputs": [], 293 | "source": [ 294 | "# insert your code here" 295 | ] 296 | }, 297 | { 298 | "cell_type": "markdown", 299 | "metadata": {}, 300 | "source": [ 301 | "__Q5a.__ Comment on any major differences among the distributions." 302 | ] 303 | }, 304 | { 305 | "cell_type": "raw", 306 | "metadata": {}, 307 | "source": [] 308 | }, 309 | { 310 | "cell_type": "markdown", 311 | "metadata": {}, 312 | "source": [ 313 | "__Q5b.__ Restrict the sample to people between 30 and 40 years of age. Then calculate the median household size for women and men within each level of educational attainment." 314 | ] 315 | }, 316 | { 317 | "cell_type": "code", 318 | "execution_count": 7, 319 | "metadata": {}, 320 | "outputs": [], 321 | "source": [ 322 | "# insert your code here" 323 | ] 324 | }, 325 | { 326 | "cell_type": "markdown", 327 | "metadata": {}, 328 | "source": [ 329 | "## Question 6\n", 330 | "\n", 331 | "The participants can be clustered into \"maked variance units\" (MVU) based on every combination of the variables [SDMVSTRA](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/DEMO_I.htm#SDMVSTRA) and [SDMVPSU](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/DEMO_I.htm#SDMVPSU). Calculate the mean age ([RIDAGEYR](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/DEMO_I.htm#RIDAGEYR)), height ([BMXHT](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/BMX_I.htm#BMXHT)), and BMI ([BMXBMI](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/BMX_I.htm#BMXBMI)) for each gender ([RIAGENDR](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/DEMO_I.htm#RIAGENDR)), within each MVU, and report the ratio between the largest and smallest mean (e.g. for height) across the MVUs." 332 | ] 333 | }, 334 | { 335 | "cell_type": "code", 336 | "execution_count": 1, 337 | "metadata": {}, 338 | "outputs": [], 339 | "source": [ 340 | "# insert your code here" 341 | ] 342 | }, 343 | { 344 | "cell_type": "markdown", 345 | "metadata": {}, 346 | "source": [ 347 | "__Q6a.__ Comment on the extent to which mean age, height, and BMI vary among the MVUs." 348 | ] 349 | }, 350 | { 351 | "cell_type": "raw", 352 | "metadata": {}, 353 | "source": [] 354 | }, 355 | { 356 | "cell_type": "markdown", 357 | "metadata": {}, 358 | "source": [ 359 | "__Q6b.__ Calculate the inter-quartile range (IQR) for age, height, and BMI for each gender and each MVU. Report the ratio between the largest and smalles IQR across the MVUs." 360 | ] 361 | }, 362 | { 363 | "cell_type": "code", 364 | "execution_count": null, 365 | "metadata": {}, 366 | "outputs": [], 367 | "source": [ 368 | "# insert your code here" 369 | ] 370 | }, 371 | { 372 | "cell_type": "markdown", 373 | "metadata": {}, 374 | "source": [ 375 | "__Q6c.__ Comment on the extent to which the IQR for age, height, and BMI vary among the MVUs." 376 | ] 377 | }, 378 | { 379 | "cell_type": "raw", 380 | "metadata": {}, 381 | "source": [] 382 | } 383 | ], 384 | "metadata": { 385 | "kernelspec": { 386 | "display_name": "Python 3", 387 | "language": "python", 388 | "name": "python3" 389 | }, 390 | "language_info": { 391 | "codemirror_mode": { 392 | "name": "ipython", 393 | "version": 3 394 | }, 395 | "file_extension": ".py", 396 | "mimetype": "text/x-python", 397 | "name": "python", 398 | "nbconvert_exporter": "python", 399 | "pygments_lexer": "ipython3", 400 | "version": "3.6.3" 401 | } 402 | }, 403 | "nbformat": 4, 404 | "nbformat_minor": 2 405 | } 406 | -------------------------------------------------------------------------------- /Inferential Statistical Analysis With Python/Week1/utf-8''listsvsarrays.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import numpy as np\n", 10 | "import pandas as pd\n", 11 | "import matplotlib.pyplot as plt\n", 12 | "%matplotlib inline" 13 | ] 14 | }, 15 | { 16 | "cell_type": "markdown", 17 | "metadata": {}, 18 | "source": [ 19 | "# Lists vs. numpy arrays, dictionaries, functions, and lambda functions " 20 | ] 21 | }, 22 | { 23 | "cell_type": "markdown", 24 | "metadata": {}, 25 | "source": [ 26 | "We are going to introduce a few more python concepts. If you've been working through the NHANES example notebooks, you will have seen these in use already. There is a lot to say about these new concepts, but we will only be giving a brief introduction to each. For more information, follow the links provided or do your own search for the many great resources available on the web. " 27 | ] 28 | }, 29 | { 30 | "cell_type": "markdown", 31 | "metadata": {}, 32 | "source": [ 33 | "### Lists vs numpy arrays" 34 | ] 35 | }, 36 | { 37 | "cell_type": "markdown", 38 | "metadata": {}, 39 | "source": [ 40 | "Lists can have multiple datatypes. For example one element can be a string and another can be and int and another a float. Lists are defined by using the square brackets: \\[ \\], with elements separated by commas, ','.\n", 41 | "Ex:\n", 42 | "```\n", 43 | "my_list = [1, 'Colorado', 4.7, 'rain']\n", 44 | "```\n", 45 | "Lists are indexed by position. Remember, in Python, the index starts at 0 and ends at length(list)-1. So to retrieve the first element of the list you call:\n", 46 | "```\n", 47 | "my_list[0]\n", 48 | "```\n", 49 | "\n", 50 | "Numpy arrays np.arrays differ from lists is that the contain only 1 datatype. For example all the elements might be ints or strings or floats or objects. It is defined by np.array(object), where the input 'object' can be for example a list or a tuple. \n", 51 | "Ex:\n", 52 | "```\n", 53 | "my_array = np.array([1, 4, 5, 2])\n", 54 | "```\n", 55 | "or \n", 56 | "```\n", 57 | "my_array = np.array((1, 4, 5, 2))\n", 58 | "```\n", 59 | "\n", 60 | "Lists and numpy arrays differ in their speed and memory efficiency. An intuitive reason for this is that python lists have to store the value of each element and also the type of each element (since the types can differ). Whereas numpy arrays only need to store the type once because it is the same for all the elements in the array. \n", 61 | "\n", 62 | "You can do calculations with numpy arrays that can't be done on lists. \n", 63 | "Ex:\n", 64 | "```\n", 65 | "my_array/3\n", 66 | "```\n", 67 | "will return a numpy array, with each of the elements divided by 3. Whereas:\n", 68 | "```\n", 69 | "my_list/3\n", 70 | "```\n", 71 | "Will throw an error.\n", 72 | "\n", 73 | "You can appened items to the end of lists and numpy arrays, though they have slightly different commands. It is almost of note that lists can append an item 'in place', but numpy arrays cannot.\n", 74 | "\n", 75 | "```\n", 76 | "my_list.append('new item')\n", 77 | "np.append(my_array, 5) # new element must be of the same type as all other elements\n", 78 | "```\n", 79 | "\n", 80 | "Links to python docs: \n", 81 | "[Lists](https://docs.python.org/3/tutorial/datastructures.html), [arrays](https://docs.scipy.org/doc/numpy-1.15.1/reference/generated/numpy.array.html), [more on arrays](https://docs.scipy.org/doc/numpy-1.15.0/user/basics.creation.html)" 82 | ] 83 | }, 84 | { 85 | "cell_type": "code", 86 | "execution_count": 2, 87 | "metadata": {}, 88 | "outputs": [], 89 | "source": [ 90 | "my_list = [1, 2, 3]\n", 91 | "my_array = np.array([1, 2, 3])" 92 | ] 93 | }, 94 | { 95 | "cell_type": "code", 96 | "execution_count": 3, 97 | "metadata": {}, 98 | "outputs": [ 99 | { 100 | "data": { 101 | "text/plain": [ 102 | "1" 103 | ] 104 | }, 105 | "execution_count": 3, 106 | "metadata": {}, 107 | "output_type": "execute_result" 108 | } 109 | ], 110 | "source": [ 111 | "# Both indexed by position\n", 112 | "my_list[0]" 113 | ] 114 | }, 115 | { 116 | "cell_type": "code", 117 | "execution_count": 4, 118 | "metadata": {}, 119 | "outputs": [ 120 | { 121 | "data": { 122 | "text/plain": [ 123 | "1" 124 | ] 125 | }, 126 | "execution_count": 4, 127 | "metadata": {}, 128 | "output_type": "execute_result" 129 | } 130 | ], 131 | "source": [ 132 | "my_array[0]" 133 | ] 134 | }, 135 | { 136 | "cell_type": "code", 137 | "execution_count": 5, 138 | "metadata": {}, 139 | "outputs": [ 140 | { 141 | "data": { 142 | "text/plain": [ 143 | "array([0.33333333, 0.66666667, 1. ])" 144 | ] 145 | }, 146 | "execution_count": 5, 147 | "metadata": {}, 148 | "output_type": "execute_result" 149 | } 150 | ], 151 | "source": [ 152 | "my_array/3" 153 | ] 154 | }, 155 | { 156 | "cell_type": "code", 157 | "execution_count": 6, 158 | "metadata": {}, 159 | "outputs": [ 160 | { 161 | "ename": "TypeError", 162 | "evalue": "unsupported operand type(s) for /: 'list' and 'int'", 163 | "output_type": "error", 164 | "traceback": [ 165 | "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", 166 | "\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)", 167 | "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mmy_list\u001b[0m\u001b[0;34m/\u001b[0m\u001b[0;36m3\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", 168 | "\u001b[0;31mTypeError\u001b[0m: unsupported operand type(s) for /: 'list' and 'int'" 169 | ] 170 | } 171 | ], 172 | "source": [ 173 | "my_list/3" 174 | ] 175 | }, 176 | { 177 | "cell_type": "code", 178 | "execution_count": 7, 179 | "metadata": {}, 180 | "outputs": [ 181 | { 182 | "data": { 183 | "text/plain": [ 184 | "[1, 2, 3, 5]" 185 | ] 186 | }, 187 | "execution_count": 7, 188 | "metadata": {}, 189 | "output_type": "execute_result" 190 | } 191 | ], 192 | "source": [ 193 | "my_list.append(5) # inplace\n", 194 | "my_list" 195 | ] 196 | }, 197 | { 198 | "cell_type": "code", 199 | "execution_count": 8, 200 | "metadata": {}, 201 | "outputs": [ 202 | { 203 | "data": { 204 | "text/plain": [ 205 | "array([1, 2, 3, 5])" 206 | ] 207 | }, 208 | "execution_count": 8, 209 | "metadata": {}, 210 | "output_type": "execute_result" 211 | } 212 | ], 213 | "source": [ 214 | "my_array = np.append(my_array, 5) # cannot do inplace because not always contiguous memory\n", 215 | "my_array" 216 | ] 217 | }, 218 | { 219 | "cell_type": "markdown", 220 | "metadata": {}, 221 | "source": [ 222 | "### Dictionaries \n", 223 | "Store key-values pairs and are indexed by the keys.\n", 224 | "denoted with \\{key1: value1, key2: value2\\}. The keys must be unique, the values do not need to be unique. \n", 225 | "Can be used for many tasks, for example, creating DataFrames and changing column names. \n", 226 | "\n", 227 | "[More about dictionaries in section 5.5](https://docs.python.org/3/tutorial/datastructures.html)" 228 | ] 229 | }, 230 | { 231 | "cell_type": "code", 232 | "execution_count": 9, 233 | "metadata": {}, 234 | "outputs": [ 235 | { 236 | "data": { 237 | "text/plain": [ 238 | "2" 239 | ] 240 | }, 241 | "execution_count": 9, 242 | "metadata": {}, 243 | "output_type": "execute_result" 244 | } 245 | ], 246 | "source": [ 247 | "dct = {'thing 1': 2, 'thing 2': 1}\n", 248 | "dct['thing 1']" 249 | ] 250 | }, 251 | { 252 | "cell_type": "code", 253 | "execution_count": 10, 254 | "metadata": {}, 255 | "outputs": [], 256 | "source": [ 257 | "# adding to a dictionary\n", 258 | "dct['new thing'] = 'woooo'" 259 | ] 260 | }, 261 | { 262 | "cell_type": "code", 263 | "execution_count": 11, 264 | "metadata": {}, 265 | "outputs": [ 266 | { 267 | "data": { 268 | "text/plain": [ 269 | "'woooo'" 270 | ] 271 | }, 272 | "execution_count": 11, 273 | "metadata": {}, 274 | "output_type": "execute_result" 275 | } 276 | ], 277 | "source": [ 278 | "dct['new thing']" 279 | ] 280 | }, 281 | { 282 | "cell_type": "code", 283 | "execution_count": 12, 284 | "metadata": {}, 285 | "outputs": [ 286 | { 287 | "data": { 288 | "text/html": [ 289 | "
\n", 290 | "\n", 303 | "\n", 304 | " \n", 305 | " \n", 306 | " \n", 307 | " \n", 308 | " \n", 309 | " \n", 310 | " \n", 311 | " \n", 312 | " \n", 313 | " \n", 314 | " \n", 315 | " \n", 316 | " \n", 317 | " \n", 318 | " \n", 319 | " \n", 320 | " \n", 321 | " \n", 322 | " \n", 323 | " \n", 324 | " \n", 325 | " \n", 326 | " \n", 327 | " \n", 328 | "
col1col2
003
114
225
\n", 329 | "
" 330 | ], 331 | "text/plain": [ 332 | " col1 col2\n", 333 | "0 0 3\n", 334 | "1 1 4\n", 335 | "2 2 5" 336 | ] 337 | }, 338 | "execution_count": 12, 339 | "metadata": {}, 340 | "output_type": "execute_result" 341 | } 342 | ], 343 | "source": [ 344 | "# Create DataFrame\n", 345 | "df = pd.DataFrame({'col1':range(3), 'col2':range(3,6)})\n", 346 | "df" 347 | ] 348 | }, 349 | { 350 | "cell_type": "code", 351 | "execution_count": 13, 352 | "metadata": {}, 353 | "outputs": [ 354 | { 355 | "data": { 356 | "text/html": [ 357 | "
\n", 358 | "\n", 371 | "\n", 372 | " \n", 373 | " \n", 374 | " \n", 375 | " \n", 376 | " \n", 377 | " \n", 378 | " \n", 379 | " \n", 380 | " \n", 381 | " \n", 382 | " \n", 383 | " \n", 384 | " \n", 385 | " \n", 386 | " \n", 387 | " \n", 388 | " \n", 389 | " \n", 390 | " \n", 391 | " \n", 392 | " \n", 393 | " \n", 394 | " \n", 395 | " \n", 396 | "
applesoranges
003
114
225
\n", 397 | "
" 398 | ], 399 | "text/plain": [ 400 | " apples oranges\n", 401 | "0 0 3\n", 402 | "1 1 4\n", 403 | "2 2 5" 404 | ] 405 | }, 406 | "execution_count": 13, 407 | "metadata": {}, 408 | "output_type": "execute_result" 409 | } 410 | ], 411 | "source": [ 412 | "# Change column names\n", 413 | "df.rename(columns={'col1': 'apples', 'col2':'oranges'})" 414 | ] 415 | }, 416 | { 417 | "cell_type": "code", 418 | "execution_count": null, 419 | "metadata": {}, 420 | "outputs": [], 421 | "source": [] 422 | } 423 | ], 424 | "metadata": { 425 | "kernelspec": { 426 | "display_name": "Python 3", 427 | "language": "python", 428 | "name": "python3" 429 | }, 430 | "language_info": { 431 | "codemirror_mode": { 432 | "name": "ipython", 433 | "version": 3 434 | }, 435 | "file_extension": ".py", 436 | "mimetype": "text/x-python", 437 | "name": "python", 438 | "nbconvert_exporter": "python", 439 | "pygments_lexer": "ipython3", 440 | "version": "3.6.3" 441 | } 442 | }, 443 | "nbformat": 4, 444 | "nbformat_minor": 2 445 | } 446 | -------------------------------------------------------------------------------- /Understanding & Visualizing Data with Python/Week1/utf-8''week1_python_resources.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "colab_type": "text", 7 | "id": "tPxurgKK6mUZ" 8 | }, 9 | "source": [ 10 | "### Python Resources\n", 11 | "The purpose of this document is to direct you to resources that you may find useful if you decide to do a deeper dive into Python. This course is not meant to be an introduction to programming, nor an introduction to Python, but if you find yourself interested in exploring Python further, or feel as if this is a useful skill, this document aims to direct you to resources that you may find useful. If you have a background in Python or programming, a style guides are included below to show how Python may differ from other programming languages or give you a launching point for diving deeper into more advanced packages. This course does not endorse the use or non-use of any particular resource, but the author has found these resources useful in their exploration of programming and Python in particular\n", 12 | "\n", 13 | "### The Python Documentation\n", 14 | "Any reference that does not begin with the Python documentation would not be complete. The authors of the language, as well as the community that supports it, have developed a great set of tutorials, documentation, and references around Python. When in doubt, this is often the first place that you should look if you run into a scary error or would like to learn more about a specific function. The documentation can be found here: [Python Documentation](https://docs.python.org/3/)\n", 15 | "\n", 16 | "\n" 17 | ] 18 | }, 19 | { 20 | "cell_type": "markdown", 21 | "metadata": { 22 | "colab_type": "text", 23 | "id": "8OH-7kq_3Ei4" 24 | }, 25 | "source": [ 26 | "### Python Programming Introductions\n", 27 | "\n", 28 | "Below are resources to help you along your way in learning Python. While it is great to consume material, in programming there is no substitute for actually writing code. For every hour that you spend learning, you should spend about twice that amount of time writing code for cool problems or working out examples. Coding is best learned through actually coding!\n", 29 | "\n", 30 | "* [Coursera](https://www.coursera.org/courses?query=python) has several offerings for Python that you can take in addition to this course. These courses will go into depth into Python programming and how to use it in an applied setting \n", 31 | "* [Code Academy](https://www.codecademy.com/learn/learn-python) is another resources that is great for learning Python (and other programming languages). While not as focused as Cousera, this is a quick way to get up-and-running with Python\n", 32 | "* YouTube is another great resource for online learning and there are several \"courses\" for learning Python. We recommend trying several sets of videos to see which you like best and using multiple video series to learn since each will present the material in a slightly different way\n", 33 | "* There are tens of books on programming in Python that are great if you prefer to read. More so than the other resources, be sure to code what you learn. It is easy to read about coding, but you really learn to code by coding!\n", 34 | "* If you have a background in coding, the authors have found the tutorial at [Tutorials Point](https://www.tutorialspoint.com/python/index.htm) to be useful in getting started with Python. This tutorial assumes that you have some background in coding in another language" 35 | ] 36 | }, 37 | { 38 | "cell_type": "markdown", 39 | "metadata": { 40 | "colab_type": "text", 41 | "id": "PdurMQjI2rQn" 42 | }, 43 | "source": [ 44 | "### Cheatsheets and References\n", 45 | "\n", 46 | "There are a variety of one-pagers and cheat-sheets available for Python that summarize the language in a few simple pages. These resources tend to be more aimed at someone who knows the language, or has experience in the language, but would like a refresher course in how the language works. \n", 47 | "\n", 48 | "* [Cheatsheet for Numpy](https://www.datacamp.com/community/blog/python-numpy-cheat-sheet#gs.AK5ZBgE)\n", 49 | "* [Cheatsheet for Datawrangling](https://www.datacamp.com/community/blog/pandas-cheat-sheet-python#gs.HPFoRIc)\n", 50 | "* [Cheatsheet for Pandas](https://www.datacamp.com/community/blog/python-pandas-cheat-sheet#gs.oundfxM)\n", 51 | "* [Cheatsheet for SciPy](https://www.datacamp.com/community/blog/python-scipy-cheat-sheet#gs.JDSg3OI)\n", 52 | "* [Cheatsheet for Matplotlib](https://www.datacamp.com/community/blog/python-matplotlib-cheat-sheet#gs.uEKySpY)\n", 53 | "\n", 54 | "\n" 55 | ] 56 | }, 57 | { 58 | "cell_type": "markdown", 59 | "metadata": { 60 | "colab_type": "text", 61 | "id": "swjetdNA2_iq" 62 | }, 63 | "source": [ 64 | "### Python Style Guides\n", 65 | "\n", 66 | "As you learn to code, you will find that you will begin to develop your own style. Sometimes this is good. Most times, this can be detrimental to your code readability and, worse, can hinder you from finding bugs in your own code in extreme cases. \n", 67 | "\n", 68 | "It is best to learn good coding habits from the beginning and the [Google Style Guide](https://github.com/google/styleguide/blob/gh-pages/pyguide.md) is a great place to start. We will mention some of these best practices here." 69 | ] 70 | }, 71 | { 72 | "cell_type": "markdown", 73 | "metadata": { 74 | "colab_type": "text", 75 | "id": "iNXpVVq6zSfY" 76 | }, 77 | "source": [ 78 | "#### Consistent Indenting\n", 79 | "Python will generally 'yell' at you if your indenting is incorrect. It is good to use an editor that takes care of this for you. In general, four spaces are preferred for indenting and you should not mix tabs and spaces. " 80 | ] 81 | }, 82 | { 83 | "cell_type": "code", 84 | "execution_count": 0, 85 | "metadata": { 86 | "colab": {}, 87 | "colab_type": "code", 88 | "id": "xc6e9f0SCvbh" 89 | }, 90 | "outputs": [], 91 | "source": [ 92 | "# Good Indenting - four spaces are standard but consistiency is key\n", 93 | "result = []\n", 94 | "for x in range(10):\n", 95 | " for y in range(5):\n", 96 | " if x * y > 10:\n", 97 | " result.append((x, y))\n", 98 | "print result\n", 99 | "\n", 100 | "# Bad indenting\n", 101 | "result = []\n", 102 | "for x in range(10):\n", 103 | " for y in range(5):\n", 104 | " if x * y > 10:\n", 105 | " result.append((x, y))\n", 106 | "print result" 107 | ] 108 | }, 109 | { 110 | "cell_type": "markdown", 111 | "metadata": { 112 | "colab_type": "text", 113 | "id": "ErsdjHdUDixX" 114 | }, 115 | "source": [ 116 | "#### Commenting\n", 117 | "\n", 118 | "Comments seem weird when you first begin programming - why would I include 'code' that doesn't run? Comments are probably some of the most important aspects of code. They help other read code that is difficult for them to understand, and they, more importantly, are helpful for yourself if you look at the code in a few weeks and need clarity on why you did something. Always comment and comment well.\n", 119 | "\n" 120 | ] 121 | }, 122 | { 123 | "cell_type": "code", 124 | "execution_count": 0, 125 | "metadata": { 126 | "colab": {}, 127 | "colab_type": "code", 128 | "id": "qBTE3rwqEFMm" 129 | }, 130 | "outputs": [], 131 | "source": [ 132 | "################################################################################\n", 133 | "# #\n", 134 | "# Good Commenting #\n", 135 | "# #\n", 136 | "################################################################################\n", 137 | "\n", 138 | "################################ Bad Commenting ################################\n", 139 | "\n", 140 | "# My loop\n", 141 | "for x in range(10):\n", 142 | " print x\n", 143 | " \n", 144 | "############################## Better Commenting ###############################\n", 145 | "\n", 146 | "# Looping from zero to ten\n", 147 | "for x in range(10):\n", 148 | " print x\n", 149 | "\n", 150 | "############################# Preferred Commenting #############################\n", 151 | "\n", 152 | "# Print out the numbers from zero to ten\n", 153 | "for x in range(10):\n", 154 | " print x" 155 | ] 156 | }, 157 | { 158 | "cell_type": "code", 159 | "execution_count": 0, 160 | "metadata": { 161 | "colab": {}, 162 | "colab_type": "code", 163 | "id": "c0kgjakBF5X8" 164 | }, 165 | "outputs": [], 166 | "source": [ 167 | "################################################################################\n", 168 | "# #\n", 169 | "# Mixing Commenting Strategies #\n", 170 | "# #\n", 171 | "################################################################################\n", 172 | "\n", 173 | "# Try not to mix commenting styles in the same blocks - just be consistient\n", 174 | "\n", 175 | "########### Bad - mixing doc-strings commenting and line commenting ############\n", 176 | "\n", 177 | "''' Printing one to five, a six, and then six to nine'''\n", 178 | "for x in range(10):\n", 179 | " # If x > 5, then print the value\n", 180 | " if x > 5: \n", 181 | " print x\n", 182 | " else:\n", 183 | " print x + 1\n", 184 | " \n", 185 | "##################### Good - no mixing of comment types ########################\n", 186 | "\n", 187 | "# Printing one to five, a six, and then six to nine\n", 188 | "for x in range(10):\n", 189 | " # If x > 5, then print the value\n", 190 | " if x > 5: \n", 191 | " print x\n", 192 | " else:\n", 193 | " print x + 1" 194 | ] 195 | }, 196 | { 197 | "cell_type": "markdown", 198 | "metadata": { 199 | "colab_type": "text", 200 | "id": "UDSKv5J3G9j-" 201 | }, 202 | "source": [ 203 | "#### Line Length\n", 204 | "\n", 205 | "Try to avoid excessively long lines. Standard practice is to keep lines to no longer than 80 characters. While this is not a hard rule, it is a good practice to follow for readability" 206 | ] 207 | }, 208 | { 209 | "cell_type": "code", 210 | "execution_count": 0, 211 | "metadata": { 212 | "colab": {}, 213 | "colab_type": "code", 214 | "id": "QhSQsFKeHZae" 215 | }, 216 | "outputs": [], 217 | "source": [ 218 | "######################### Bad - This code is too long ##########################\n", 219 | "\n", 220 | "my_random_array = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]\n", 221 | "\n", 222 | "############ Good - this code is wrapped to avoid excessive length #############\n", 223 | "\n", 224 | "my_random_array = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9, \n", 225 | " 10, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5, 6, 7, 8, \n", 226 | " 9, 10]" 227 | ] 228 | }, 229 | { 230 | "cell_type": "markdown", 231 | "metadata": { 232 | "colab_type": "text", 233 | "id": "-IVW9wL8JC9O" 234 | }, 235 | "source": [ 236 | "#### White Space\n", 237 | "Utilizing Whitespace is a great way to improve the way that your code looks. In general the following can be helpful to improve the look of your code\n", 238 | "\n", 239 | "* Try to space out your code and introduce whitespace to improve readability\n", 240 | "* Use spacing to separate function arguments\n", 241 | "* Do not over-do spacing. Too many spaces between code blocks makes it difficult to organize code well" 242 | ] 243 | }, 244 | { 245 | "cell_type": "code", 246 | "execution_count": 0, 247 | "metadata": { 248 | "colab": {}, 249 | "colab_type": "code", 250 | "id": "tkzduWdeiwSX" 251 | }, 252 | "outputs": [], 253 | "source": [ 254 | "################ Bad - this code has bad whitespace management #################\n", 255 | "\n", 256 | "my_player = player()\n", 257 | "player_attributes = get_player_attributes(my_player,height,weight, birthday)\n", 258 | "\n", 259 | "\n", 260 | "player_attributes[0]*=12 # convert from feet to inches\n", 261 | "\n", 262 | "\n", 263 | "\n", 264 | "\n", 265 | "player.shoot_ball()\n", 266 | "\n", 267 | "\n", 268 | "########################## Good whitespace management ##########################\n", 269 | "\n", 270 | "my_player = player()\n", 271 | "player_attributes = get_player_attributes(my_player, height, weight, birthday)\n", 272 | "\n", 273 | "# convert from feet to inches\n", 274 | "player_attributes[0] *= 12 \n", 275 | "\n", 276 | "player.shoot_ball()" 277 | ] 278 | }, 279 | { 280 | "cell_type": "markdown", 281 | "metadata": { 282 | "colab_type": "text", 283 | "id": "5Wafm9FsuNoi" 284 | }, 285 | "source": [ 286 | "#### The tip of the iceberg\n", 287 | "\n", 288 | "Take a look at code out in the wild if you are really curious. How are they coding specific things? How do they manage spacing in loops? How do they manage the whitespace in argument list? \n", 289 | "\n", 290 | "You will learn to code by coding, and you will develop your own style but starting out with good habits ensures that your code is easy to read by others and, most importantly, yourself. Good luck!" 291 | ] 292 | } 293 | ], 294 | "metadata": { 295 | "colab": { 296 | "name": "Diving Deeper Into Python.ipynb", 297 | "provenance": [], 298 | "version": "0.3.2" 299 | }, 300 | "kernelspec": { 301 | "display_name": "Python 3", 302 | "language": "python", 303 | "name": "python3" 304 | }, 305 | "language_info": { 306 | "codemirror_mode": { 307 | "name": "ipython", 308 | "version": 3 309 | }, 310 | "file_extension": ".py", 311 | "mimetype": "text/x-python", 312 | "name": "python", 313 | "nbconvert_exporter": "python", 314 | "pygments_lexer": "ipython3", 315 | "version": "3.6.3" 316 | } 317 | }, 318 | "nbformat": 4, 319 | "nbformat_minor": 1 320 | } 321 | -------------------------------------------------------------------------------- /Fitting Statistical Models to Data with Python/Week2/utf-8''week2_nhanes_practice.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Practice notebook for regression analysis with NHANES\n", 8 | "\n", 9 | "This notebook will give you the opportunity to perform some\n", 10 | "regression analyses with the NHANES data that are similar to\n", 11 | "the analyses done in the week 2 case study notebook.\n", 12 | "\n", 13 | "You can enter your code into the cells that say \"enter your code here\",\n", 14 | "and you can type responses to the questions into the cells that say \"Type Markdown and Latex\".\n", 15 | "\n", 16 | "Note that most of the code that you will need to write below is very similar\n", 17 | "to code that appears in the case study notebook. You will need\n", 18 | "to edit code from that notebook in small ways to adapt it to the\n", 19 | "prompts below.\n", 20 | "\n", 21 | "To get started, we will use the same module imports and\n", 22 | "read the data in the same way as we did in the case study:" 23 | ] 24 | }, 25 | { 26 | "cell_type": "code", 27 | "execution_count": null, 28 | "metadata": {}, 29 | "outputs": [], 30 | "source": [ 31 | "%matplotlib inline\n", 32 | "import matplotlib.pyplot as plt\n", 33 | "import seaborn as sns\n", 34 | "import pandas as pd\n", 35 | "import statsmodels.api as sm\n", 36 | "import numpy as np\n", 37 | "\n", 38 | "url = \"https://raw.githubusercontent.com/kshedden/statswpy/master/NHANES/merged/nhanes_2015_2016.csv\"\n", 39 | "da = pd.read_csv(url)\n", 40 | "\n", 41 | "# Drop unused columns, drop rows with any missing values.\n", 42 | "vars = [\"BPXSY1\", \"RIDAGEYR\", \"RIAGENDR\", \"RIDRETH1\", \"DMDEDUC2\", \"BMXBMI\", \"SMQ020\"]\n", 43 | "da = da[vars].dropna()" 44 | ] 45 | }, 46 | { 47 | "cell_type": "markdown", 48 | "metadata": {}, 49 | "source": [ 50 | "## Question 1:\n", 51 | "\n", 52 | "Use linear regression to relate the expected body mass index (BMI) to a person's age." 53 | ] 54 | }, 55 | { 56 | "cell_type": "code", 57 | "execution_count": null, 58 | "metadata": {}, 59 | "outputs": [], 60 | "source": [ 61 | "# enter your code here" 62 | ] 63 | }, 64 | { 65 | "cell_type": "markdown", 66 | "metadata": {}, 67 | "source": [ 68 | "__Q1a.__ According to your fitted model, do older people tend to have higher or lower BMI than younger people?" 69 | ] 70 | }, 71 | { 72 | "cell_type": "markdown", 73 | "metadata": {}, 74 | "source": [] 75 | }, 76 | { 77 | "cell_type": "markdown", 78 | "metadata": {}, 79 | "source": [ 80 | "__Q1b.__ Based your analysis, are you confident that there is a relationship between BMI and age in the population that NHANES represents?" 81 | ] 82 | }, 83 | { 84 | "cell_type": "markdown", 85 | "metadata": {}, 86 | "source": [] 87 | }, 88 | { 89 | "cell_type": "markdown", 90 | "metadata": {}, 91 | "source": [ 92 | "__Q1c.__ By how much does the average BMI of a 40 year old differ from the average BMI of a 20 year old?" 93 | ] 94 | }, 95 | { 96 | "cell_type": "markdown", 97 | "metadata": {}, 98 | "source": [] 99 | }, 100 | { 101 | "cell_type": "markdown", 102 | "metadata": {}, 103 | "source": [ 104 | "__Q1d.__ What fraction of the variation of BMI in this population is explained by age?" 105 | ] 106 | }, 107 | { 108 | "cell_type": "markdown", 109 | "metadata": {}, 110 | "source": [] 111 | }, 112 | { 113 | "cell_type": "markdown", 114 | "metadata": {}, 115 | "source": [ 116 | "## Question 2: \n", 117 | "\n", 118 | "Add gender and ethnicity as additional control variables to your linear model relating BMI to age. You will need to recode the ethnic groups based\n", 119 | "on the values in the codebook entry for [RIDRETH1](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/DEMO_I.htm#RIDRETH1)." 120 | ] 121 | }, 122 | { 123 | "cell_type": "code", 124 | "execution_count": null, 125 | "metadata": {}, 126 | "outputs": [], 127 | "source": [ 128 | "# enter your code here" 129 | ] 130 | }, 131 | { 132 | "cell_type": "markdown", 133 | "metadata": {}, 134 | "source": [ 135 | "__Q2a.__ How did the mean relationship between BMI and age change when you added additional covariates to the model?" 136 | ] 137 | }, 138 | { 139 | "cell_type": "markdown", 140 | "metadata": {}, 141 | "source": [] 142 | }, 143 | { 144 | "cell_type": "markdown", 145 | "metadata": {}, 146 | "source": [ 147 | "__Q2b.__ How did the standard error for the regression parameter for age change when you added additional covariates to the model?" 148 | ] 149 | }, 150 | { 151 | "cell_type": "markdown", 152 | "metadata": {}, 153 | "source": [] 154 | }, 155 | { 156 | "cell_type": "markdown", 157 | "metadata": {}, 158 | "source": [ 159 | "__Q2c.__ How much additional variation in BMI is explained by age, gender, and ethnicity that is not explained by age alone?" 160 | ] 161 | }, 162 | { 163 | "cell_type": "markdown", 164 | "metadata": {}, 165 | "source": [] 166 | }, 167 | { 168 | "cell_type": "markdown", 169 | "metadata": {}, 170 | "source": [ 171 | "__Q2d.__ What reference level did the software select for the ethnicity variable?" 172 | ] 173 | }, 174 | { 175 | "cell_type": "markdown", 176 | "metadata": {}, 177 | "source": [] 178 | }, 179 | { 180 | "cell_type": "markdown", 181 | "metadata": {}, 182 | "source": [ 183 | "__Q2e.__ What is the expected difference between the BMI of a 40 year-old non-Hispanic black man and a 30 year-old non-Hispanic black man?" 184 | ] 185 | }, 186 | { 187 | "cell_type": "markdown", 188 | "metadata": {}, 189 | "source": [] 190 | }, 191 | { 192 | "cell_type": "markdown", 193 | "metadata": {}, 194 | "source": [ 195 | "__Q2f.__ What is the expected difference between the BMI of a 50 year-old Mexican American woman and a 50 year-old non-Hispanic black man?" 196 | ] 197 | }, 198 | { 199 | "cell_type": "markdown", 200 | "metadata": {}, 201 | "source": [] 202 | }, 203 | { 204 | "cell_type": "markdown", 205 | "metadata": {}, 206 | "source": [ 207 | "## Question 3: \n", 208 | "\n", 209 | "Randomly sample 25% of the NHANES data, then fit the same model you used in question 2 to this data set." 210 | ] 211 | }, 212 | { 213 | "cell_type": "code", 214 | "execution_count": null, 215 | "metadata": {}, 216 | "outputs": [], 217 | "source": [ 218 | "# enter your code here" 219 | ] 220 | }, 221 | { 222 | "cell_type": "markdown", 223 | "metadata": {}, 224 | "source": [ 225 | "__Q3a.__ How do the estimated regression coefficients and their standard errors compare between these two models? Do you see any systematic relationship between the two sets of results?" 226 | ] 227 | }, 228 | { 229 | "cell_type": "markdown", 230 | "metadata": {}, 231 | "source": [] 232 | }, 233 | { 234 | "cell_type": "markdown", 235 | "metadata": {}, 236 | "source": [ 237 | "## Question 4:\n", 238 | "\n", 239 | "Generate a scatterplot of the residuals against the fitted values for the model you fit in question 2." 240 | ] 241 | }, 242 | { 243 | "cell_type": "code", 244 | "execution_count": null, 245 | "metadata": {}, 246 | "outputs": [], 247 | "source": [ 248 | "# enter your code here" 249 | ] 250 | }, 251 | { 252 | "cell_type": "markdown", 253 | "metadata": {}, 254 | "source": [ 255 | "__Q4a.__ What mean/variance relationship do you see?" 256 | ] 257 | }, 258 | { 259 | "cell_type": "markdown", 260 | "metadata": {}, 261 | "source": [] 262 | }, 263 | { 264 | "cell_type": "markdown", 265 | "metadata": {}, 266 | "source": [ 267 | "## Question 5: \n", 268 | "\n", 269 | "Generate a plot showing the fitted mean BMI as a function of age for Mexican American men. Include a 95% simultaneous confidence band on your graph." 270 | ] 271 | }, 272 | { 273 | "cell_type": "code", 274 | "execution_count": null, 275 | "metadata": {}, 276 | "outputs": [], 277 | "source": [ 278 | "# enter your code here" 279 | ] 280 | }, 281 | { 282 | "cell_type": "markdown", 283 | "metadata": {}, 284 | "source": [ 285 | "__Q5a.__ According to your graph, what is the longest interval starting at year 30 following which the mean BMI could be constant? *Hint:* What is the longest horizontal line starting at age 30 that remains within the confidence band?" 286 | ] 287 | }, 288 | { 289 | "cell_type": "markdown", 290 | "metadata": {}, 291 | "source": [] 292 | }, 293 | { 294 | "cell_type": "markdown", 295 | "metadata": {}, 296 | "source": [ 297 | "__Q5b.__ Add an additional line and confidence band to the same plot, showing the relationship between age and BMI for Mexican American women. At what ages do these intervals not overlap?" 298 | ] 299 | }, 300 | { 301 | "cell_type": "markdown", 302 | "metadata": {}, 303 | "source": [] 304 | }, 305 | { 306 | "cell_type": "markdown", 307 | "metadata": {}, 308 | "source": [ 309 | "## Question 6:\n", 310 | "\n", 311 | "Use an added variable plot to assess the linearity of the relationship between BMI and age (when controlling for gender and ethnicity)." 312 | ] 313 | }, 314 | { 315 | "cell_type": "code", 316 | "execution_count": null, 317 | "metadata": {}, 318 | "outputs": [], 319 | "source": [ 320 | "# enter your code here" 321 | ] 322 | }, 323 | { 324 | "cell_type": "markdown", 325 | "metadata": {}, 326 | "source": [ 327 | "__Q6a.__ What is your interpretation of the added variable plot?" 328 | ] 329 | }, 330 | { 331 | "cell_type": "markdown", 332 | "metadata": {}, 333 | "source": [] 334 | }, 335 | { 336 | "cell_type": "markdown", 337 | "metadata": {}, 338 | "source": [ 339 | "## Question 7: \n", 340 | "\n", 341 | "Generate a binary variable reflecting whether a person has had at least 12 drinks in their lifetime, based on the [ALQ110](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/ALQ_I.htm#ALQ110) variable in NHANES. Calculate the marginal probability, odds, and log odds of this variable for women and for men. Then calculate the odds ratio for females relative to males." 342 | ] 343 | }, 344 | { 345 | "cell_type": "code", 346 | "execution_count": null, 347 | "metadata": {}, 348 | "outputs": [], 349 | "source": [ 350 | "# enter your code here" 351 | ] 352 | }, 353 | { 354 | "cell_type": "markdown", 355 | "metadata": {}, 356 | "source": [ 357 | "__Q7a.__ Based on the log odds alone, do more than 50% of women drink alcohol?" 358 | ] 359 | }, 360 | { 361 | "cell_type": "markdown", 362 | "metadata": {}, 363 | "source": [] 364 | }, 365 | { 366 | "cell_type": "markdown", 367 | "metadata": {}, 368 | "source": [ 369 | "__Q7b.__ Does there appear to be an important difference between the alcohol use rate of women and men?" 370 | ] 371 | }, 372 | { 373 | "cell_type": "markdown", 374 | "metadata": {}, 375 | "source": [] 376 | }, 377 | { 378 | "cell_type": "markdown", 379 | "metadata": {}, 380 | "source": [ 381 | "## Question 8: \n", 382 | "\n", 383 | "Use logistic regression to express the log odds that a person drinks (based on the binary drinking variable that you constructed above) in terms of gender." 384 | ] 385 | }, 386 | { 387 | "cell_type": "code", 388 | "execution_count": null, 389 | "metadata": {}, 390 | "outputs": [], 391 | "source": [ 392 | "# enter your code here" 393 | ] 394 | }, 395 | { 396 | "cell_type": "markdown", 397 | "metadata": {}, 398 | "source": [ 399 | "__Q8a.__ Is there statistical evidence that the drinking rate differs between women and men? If so, in what direction is there a difference?" 400 | ] 401 | }, 402 | { 403 | "cell_type": "markdown", 404 | "metadata": {}, 405 | "source": [] 406 | }, 407 | { 408 | "cell_type": "markdown", 409 | "metadata": {}, 410 | "source": [ 411 | "__Q8b.__ Confirm that the log odds ratio between drinking and smoking calculated using the logistic regression model matches the log odds ratio calculated directly in question 6." 412 | ] 413 | }, 414 | { 415 | "cell_type": "markdown", 416 | "metadata": {}, 417 | "source": [] 418 | }, 419 | { 420 | "cell_type": "markdown", 421 | "metadata": {}, 422 | "source": [ 423 | "## Question 9: \n", 424 | "\n", 425 | "Use logistic regression to relate drinking to age, gender, and education." 426 | ] 427 | }, 428 | { 429 | "cell_type": "code", 430 | "execution_count": null, 431 | "metadata": {}, 432 | "outputs": [], 433 | "source": [ 434 | "# enter your code here" 435 | ] 436 | }, 437 | { 438 | "cell_type": "markdown", 439 | "metadata": {}, 440 | "source": [ 441 | "__Q9a.__ Which of these predictor variables shows a statistically significant association with drinking?" 442 | ] 443 | }, 444 | { 445 | "cell_type": "markdown", 446 | "metadata": {}, 447 | "source": [] 448 | }, 449 | { 450 | "cell_type": "markdown", 451 | "metadata": {}, 452 | "source": [ 453 | "__Q9b.__ What is the odds of a college educated, 50 year old woman drinking?" 454 | ] 455 | }, 456 | { 457 | "cell_type": "markdown", 458 | "metadata": {}, 459 | "source": [] 460 | }, 461 | { 462 | "cell_type": "markdown", 463 | "metadata": {}, 464 | "source": [ 465 | "__Q9c.__ What is the odds ratio between the drinking status for college graduates and high school graduates (with no college), holding gender and age fixed?" 466 | ] 467 | }, 468 | { 469 | "cell_type": "markdown", 470 | "metadata": {}, 471 | "source": [] 472 | }, 473 | { 474 | "cell_type": "markdown", 475 | "metadata": {}, 476 | "source": [ 477 | "__Q9d.__ Did the regression parameter for gender change to a meaningful degree when age and education were added to the model?" 478 | ] 479 | }, 480 | { 481 | "cell_type": "markdown", 482 | "metadata": {}, 483 | "source": [] 484 | }, 485 | { 486 | "cell_type": "markdown", 487 | "metadata": {}, 488 | "source": [ 489 | "## Question 10:\n", 490 | "\n", 491 | "Construct a CERES plot for the relationship between drinking and age (using the model that controls for gender and educational attainment)." 492 | ] 493 | }, 494 | { 495 | "cell_type": "code", 496 | "execution_count": null, 497 | "metadata": {}, 498 | "outputs": [], 499 | "source": [ 500 | "# enter your code here" 501 | ] 502 | }, 503 | { 504 | "cell_type": "markdown", 505 | "metadata": {}, 506 | "source": [ 507 | "__Q10a.__ Does the plot indicate any major non-linearity in the relationship between age and the log odds for drinking?" 508 | ] 509 | } 510 | ], 511 | "metadata": { 512 | "kernelspec": { 513 | "display_name": "Python 3", 514 | "language": "python", 515 | "name": "python3" 516 | }, 517 | "language_info": { 518 | "codemirror_mode": { 519 | "name": "ipython", 520 | "version": 3 521 | }, 522 | "file_extension": ".py", 523 | "mimetype": "text/x-python", 524 | "name": "python", 525 | "nbconvert_exporter": "python", 526 | "pygments_lexer": "ipython3", 527 | "version": "3.6.3" 528 | } 529 | }, 530 | "nbformat": 4, 531 | "nbformat_minor": 1 532 | } 533 | -------------------------------------------------------------------------------- /Understanding & Visualizing Data with Python/Week1/utf-8''libraries_data_management.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "colab_type": "text", 7 | "id": "GPScbsDWhjjS" 8 | }, 9 | "source": [ 10 | "# Python Libraries\n", 11 | "\n", 12 | "Python, like other programming languages, has an abundance of additional modules or libraries that augument the base framework and functionality of the language.\n", 13 | "\n", 14 | "Think of a library as a collection of functions that can be accessed to complete certain programming tasks without having to write your own algorithm.\n", 15 | "\n", 16 | "For this course, we will focus primarily on the following libraries:\n", 17 | "\n", 18 | "* **Numpy** is a library for working with arrays of data.\n", 19 | "\n", 20 | "* **Pandas** provides high-performance, easy-to-use data structures and data analysis tools.\n", 21 | "\n", 22 | "* **Scipy** is a library of techniques for numerical and scientific computing.\n", 23 | "\n", 24 | "* **Matplotlib** is a library for making graphs.\n", 25 | "\n", 26 | "* **Seaborn** is a higher-level interface to Matplotlib that can be used to simplify many graphing tasks.\n", 27 | "\n", 28 | "* **Statsmodels** is a library that implements many statistical techniques.\n", 29 | "\n", 30 | "# Documentation\n", 31 | "\n", 32 | "Reliable and accesible documentation is an absolute necessity when it comes to knowledge transfer of programming languages. Luckily, python provides a significant amount of detailed documentation that explains the ins and outs of the language syntax, libraries, and more. \n", 33 | "\n", 34 | "Understanding how to read documentation is crucial for any programmer as it will serve as a fantastic resource when learning the intricacies of python.\n", 35 | "\n", 36 | "Here is the link to the documentation of the python standard library: [Python Standard Library](https://docs.python.org/3/library/index.html#library-index)" 37 | ] 38 | }, 39 | { 40 | "cell_type": "markdown", 41 | "metadata": { 42 | "colab_type": "text", 43 | "id": "b9FN-daXhjjT" 44 | }, 45 | "source": [ 46 | "### Importing Libraries\n", 47 | "\n", 48 | "When using Python, you must always begin your scripts by importing the libraries that you will be using. \n", 49 | "\n", 50 | "The following statement imports the numpy and pandas library, and gives them abbreviated names:" 51 | ] 52 | }, 53 | { 54 | "cell_type": "code", 55 | "execution_count": null, 56 | "metadata": { 57 | "colab": {}, 58 | "colab_type": "code", 59 | "id": "uRwVhX-YhjjU" 60 | }, 61 | "outputs": [], 62 | "source": [ 63 | "import numpy as np\n", 64 | "import pandas as pd" 65 | ] 66 | }, 67 | { 68 | "cell_type": "markdown", 69 | "metadata": { 70 | "colab_type": "text", 71 | "id": "9oG4uoyAhjjX" 72 | }, 73 | "source": [ 74 | "### Utilizing Library Functions\n", 75 | "\n", 76 | "After importing a library, its functions can then be called from your code by prepending the library name to the function name. For example, to use the '`dot`' function from the '`numpy`' library, you would enter '`numpy.dot`'. To avoid repeatedly having to type the libary name in your scripts, it is conventional to define a two or three letter abbreviation for each library, e.g. '`numpy`' is usually abbreviated as '`np`'. This allows us to use '`np.dot`' instead of '`numpy.dot`'. Similarly, the Pandas library is typically abbreviated as '`pd`'." 77 | ] 78 | }, 79 | { 80 | "cell_type": "markdown", 81 | "metadata": { 82 | "colab_type": "text", 83 | "id": "fip4VOLMhjjY" 84 | }, 85 | "source": [ 86 | "The next cell shows how to call functions within an imported library:" 87 | ] 88 | }, 89 | { 90 | "cell_type": "code", 91 | "execution_count": null, 92 | "metadata": { 93 | "colab": {}, 94 | "colab_type": "code", 95 | "id": "9S03bwnkhjja", 96 | "outputId": "f3a621eb-281b-4f96-b4a9-ca112e4f8604" 97 | }, 98 | "outputs": [], 99 | "source": [ 100 | "a = np.array([[1,2],[3,4]]) \n", 101 | "b = np.array([[11,12],[13,14]]) \n", 102 | "\n", 103 | "np.dot(a,b)" 104 | ] 105 | }, 106 | { 107 | "cell_type": "markdown", 108 | "metadata": { 109 | "colab_type": "text", 110 | "id": "ljjQpFYfhjje" 111 | }, 112 | "source": [ 113 | "As you can see, we used the dot() function within the numpy library to calculate the dot product of two arrays, a and b." 114 | ] 115 | }, 116 | { 117 | "cell_type": "markdown", 118 | "metadata": { 119 | "colab_type": "text", 120 | "id": "auNcbdxVhjjf" 121 | }, 122 | "source": [ 123 | "# Data Management\n", 124 | "\n", 125 | "Data management is a crucial component to statistical analysis and data science work. The following code will show how to import data via the pandas library, view your data, and transform your data.\n", 126 | "\n", 127 | "The main data structure that Pandas works with is called a **Data Frame**. This is a two-dimensional table of data in which the rows typically represent cases (e.g. Cartwheel Contest Participants), and the columns represent variables. Pandas also has a one-dimensional data structure called a **Series** that we will encounter when accesing a single column of a Data Frame.\n", 128 | "\n", 129 | "Pandas has a variety of functions named '`read_xxx`' for reading data in different formats. Right now we will focus on reading '`csv`' files, which stands for comma-separated values. However the other file formats include excel, json, and sql just to name a few.\n", 130 | "\n", 131 | "There are many other options to '`read_csv`' that are very useful. For example, you would use the option `sep='\\t'` instead of the default `sep=','` if the fields of your data file are delimited by tabs instead of commas. See [here](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) for the full documentation for '`read_csv`'." 132 | ] 133 | }, 134 | { 135 | "cell_type": "markdown", 136 | "metadata": { 137 | "colab_type": "text", 138 | "id": "MbGKgakihjjg" 139 | }, 140 | "source": [ 141 | "### Importing Data" 142 | ] 143 | }, 144 | { 145 | "cell_type": "code", 146 | "execution_count": null, 147 | "metadata": { 148 | "colab": {}, 149 | "colab_type": "code", 150 | "id": "DXHHhj2jhjjg", 151 | "outputId": "af99e411-9c09-44aa-d619-553b2d2a5aa8" 152 | }, 153 | "outputs": [], 154 | "source": [ 155 | "# Store the url string that hosts our .csv file\n", 156 | "url = \"Cartwheeldata.csv\"\n", 157 | "\n", 158 | "# Read the .csv file and store it as a pandas Data Frame\n", 159 | "df = pd.read_csv(url)\n", 160 | "\n", 161 | "# Output object type\n", 162 | "type(df)" 163 | ] 164 | }, 165 | { 166 | "cell_type": "markdown", 167 | "metadata": { 168 | "colab_type": "text", 169 | "id": "-TrO3YWShjjl" 170 | }, 171 | "source": [ 172 | "### Viewing Data" 173 | ] 174 | }, 175 | { 176 | "cell_type": "code", 177 | "execution_count": null, 178 | "metadata": { 179 | "colab": {}, 180 | "colab_type": "code", 181 | "id": "IMgR30w4hjjl", 182 | "outputId": "9a269897-78a7-4380-9634-81d2d6165c2d" 183 | }, 184 | "outputs": [], 185 | "source": [ 186 | "# We can view our Data Frame by calling the head() function\n", 187 | "df.head()" 188 | ] 189 | }, 190 | { 191 | "cell_type": "markdown", 192 | "metadata": { 193 | "colab_type": "text", 194 | "id": "rwcqqpCrhjjp" 195 | }, 196 | "source": [ 197 | "The head() function simply shows the first 5 rows of our Data Frame. If we wanted to show the entire Data Frame we would simply write the following:" 198 | ] 199 | }, 200 | { 201 | "cell_type": "code", 202 | "execution_count": null, 203 | "metadata": { 204 | "colab": {}, 205 | "colab_type": "code", 206 | "id": "clB7ZnfOhjjq", 207 | "outputId": "32b34b9e-c481-4c46-b443-488ac7097c55" 208 | }, 209 | "outputs": [], 210 | "source": [ 211 | "# Output entire Data Frame\n", 212 | "df" 213 | ] 214 | }, 215 | { 216 | "cell_type": "markdown", 217 | "metadata": { 218 | "colab_type": "text", 219 | "id": "-F1DVbu4hjju" 220 | }, 221 | "source": [ 222 | "As you can see, we have a 2-Dimensional object where each row is an independent observation of our cartwheel data.\n", 223 | "\n", 224 | "To gather more information regarding the data, we can view the column names and data types of each column with the following functions:" 225 | ] 226 | }, 227 | { 228 | "cell_type": "code", 229 | "execution_count": null, 230 | "metadata": { 231 | "colab": {}, 232 | "colab_type": "code", 233 | "id": "pEdgVYnDhjjv", 234 | "outputId": "3c4a5edc-e29d-4665-b58c-b2e9fa125442" 235 | }, 236 | "outputs": [], 237 | "source": [ 238 | "df.columns" 239 | ] 240 | }, 241 | { 242 | "cell_type": "markdown", 243 | "metadata": { 244 | "colab_type": "text", 245 | "id": "oeR8lBkmhjjz" 246 | }, 247 | "source": [ 248 | "Lets say we would like to splice our data frame and select only specific portions of our data. There are three different ways of doing so.\n", 249 | "\n", 250 | "1. .loc()\n", 251 | "2. .iloc()\n", 252 | "3. .ix()\n", 253 | "\n", 254 | "We will cover the .loc() and .iloc() splicing functions.\n", 255 | "\n", 256 | "### .loc()\n", 257 | ".loc() takes two single/list/range operator separated by ','. The first one indicates the row and the second one indicates columns." 258 | ] 259 | }, 260 | { 261 | "cell_type": "code", 262 | "execution_count": null, 263 | "metadata": { 264 | "colab": {}, 265 | "colab_type": "code", 266 | "id": "HpUEXXovhjj0", 267 | "outputId": "d4f26af0-9769-4fc1-c7ae-679e3dbf7c69" 268 | }, 269 | "outputs": [], 270 | "source": [ 271 | "# Return all observations of CWDistance\n", 272 | "df.loc[:,\"CWDistance\"]" 273 | ] 274 | }, 275 | { 276 | "cell_type": "code", 277 | "execution_count": null, 278 | "metadata": { 279 | "colab": {}, 280 | "colab_type": "code", 281 | "id": "vmZyHBk_hjj4", 282 | "outputId": "3d76358a-cd47-43bb-fd69-9f858f03add1" 283 | }, 284 | "outputs": [], 285 | "source": [ 286 | "# Select all rows for multiple columns, [\"CWDistance\", \"Height\", \"Wingspan\"]\n", 287 | "df.loc[:,[\"CWDistance\", \"Height\", \"Wingspan\"]]" 288 | ] 289 | }, 290 | { 291 | "cell_type": "code", 292 | "execution_count": null, 293 | "metadata": { 294 | "colab": {}, 295 | "colab_type": "code", 296 | "id": "xi9U34Kwhjj8", 297 | "outputId": "e6413dbf-1ffe-4af4-e6b9-95e6d7c56dc0" 298 | }, 299 | "outputs": [], 300 | "source": [ 301 | "# Select few rows for multiple columns, [\"CWDistance\", \"Height\", \"Wingspan\"]\n", 302 | "df.loc[:9, [\"CWDistance\", \"Height\", \"Wingspan\"]]" 303 | ] 304 | }, 305 | { 306 | "cell_type": "code", 307 | "execution_count": null, 308 | "metadata": { 309 | "colab": {}, 310 | "colab_type": "code", 311 | "id": "8SACpAZDhjkA", 312 | "outputId": "79f0f222-85ed-4224-8242-69a8ae2742ff" 313 | }, 314 | "outputs": [], 315 | "source": [ 316 | "# Select range of rows for all columns\n", 317 | "df.loc[10:15]" 318 | ] 319 | }, 320 | { 321 | "cell_type": "markdown", 322 | "metadata": { 323 | "colab_type": "text", 324 | "id": "ALS06G3rhjkC" 325 | }, 326 | "source": [ 327 | "The .loc() function requires to arguments, the indices of the rows and the column names you wish to observe.\n", 328 | "\n", 329 | "In the above case **:** specifies all rows, and our column is **CWDistance**. df.loc[**:**,**\"CWDistance\"**]" 330 | ] 331 | }, 332 | { 333 | "cell_type": "markdown", 334 | "metadata": { 335 | "colab_type": "text", 336 | "id": "DG7GYn4nhjkE" 337 | }, 338 | "source": [ 339 | "Now, let's say we only want to return the first 10 observations:" 340 | ] 341 | }, 342 | { 343 | "cell_type": "code", 344 | "execution_count": null, 345 | "metadata": { 346 | "colab": {}, 347 | "colab_type": "code", 348 | "id": "rKfblg4KhjkE", 349 | "outputId": "b3f4df29-8b8e-4a58-e69c-403dcd5a4dc5" 350 | }, 351 | "outputs": [], 352 | "source": [ 353 | "df.loc[:9, \"CWDistance\"]" 354 | ] 355 | }, 356 | { 357 | "cell_type": "markdown", 358 | "metadata": { 359 | "colab_type": "text", 360 | "id": "DhluNGL1hjkI" 361 | }, 362 | "source": [ 363 | "### .iloc()\n", 364 | ".iloc() is integer based slicing, whereas .loc() used labels/column names. Here are some examples:" 365 | ] 366 | }, 367 | { 368 | "cell_type": "code", 369 | "execution_count": null, 370 | "metadata": { 371 | "colab": {}, 372 | "colab_type": "code", 373 | "id": "6u1A-2drhjkJ", 374 | "outputId": "1eaf5856-74d9-4b2e-d8a4-93e163973cae" 375 | }, 376 | "outputs": [], 377 | "source": [ 378 | "df.iloc[:4]" 379 | ] 380 | }, 381 | { 382 | "cell_type": "code", 383 | "execution_count": null, 384 | "metadata": { 385 | "colab": {}, 386 | "colab_type": "code", 387 | "id": "L7U4Db6WhjkM", 388 | "outputId": "4068bbfc-c246-4847-bcd6-bce52656a71b" 389 | }, 390 | "outputs": [], 391 | "source": [ 392 | "df.iloc[1:5, 2:4]" 393 | ] 394 | }, 395 | { 396 | "cell_type": "code", 397 | "execution_count": null, 398 | "metadata": { 399 | "colab": {}, 400 | "colab_type": "code", 401 | "id": "Cot4DwdUhjkQ", 402 | "outputId": "52430bbe-8bab-48ac-b43d-deba59750797" 403 | }, 404 | "outputs": [], 405 | "source": [ 406 | "df.iloc[1:5, [\"Gender\", \"GenderGroup\"]]" 407 | ] 408 | }, 409 | { 410 | "cell_type": "markdown", 411 | "metadata": { 412 | "colab_type": "text", 413 | "id": "FIekmzj6hjkT" 414 | }, 415 | "source": [ 416 | "We can view the data types of our data frame columns with by calling .dtypes on our data frame:" 417 | ] 418 | }, 419 | { 420 | "cell_type": "code", 421 | "execution_count": null, 422 | "metadata": { 423 | "colab": {}, 424 | "colab_type": "code", 425 | "id": "VVvtjY7ghjkU", 426 | "outputId": "4cb2e4f1-19f7-46f8-b5f4-3d1d3b779c67" 427 | }, 428 | "outputs": [], 429 | "source": [ 430 | "df.dtypes" 431 | ] 432 | }, 433 | { 434 | "cell_type": "markdown", 435 | "metadata": { 436 | "colab_type": "text", 437 | "id": "fxzHKNfKhjkX" 438 | }, 439 | "source": [ 440 | "The output indicates we have integers, floats, and objects with our Data Frame.\n", 441 | "\n", 442 | "We may also want to observe the different unique values within a specific column, lets do this for Gender:" 443 | ] 444 | }, 445 | { 446 | "cell_type": "code", 447 | "execution_count": null, 448 | "metadata": { 449 | "colab": {}, 450 | "colab_type": "code", 451 | "id": "brIC2kbKhjkZ", 452 | "outputId": "b3c7b6f1-3c6b-4145-ff19-a3437e298212" 453 | }, 454 | "outputs": [], 455 | "source": [ 456 | "# List unique values in the df['Gender'] column\n", 457 | "df.Gender.unique()" 458 | ] 459 | }, 460 | { 461 | "cell_type": "code", 462 | "execution_count": null, 463 | "metadata": { 464 | "colab": {}, 465 | "colab_type": "code", 466 | "id": "Js4ikCVWhjkc", 467 | "outputId": "4cfbdf05-044e-4c56-9430-d392a589e9ee" 468 | }, 469 | "outputs": [], 470 | "source": [ 471 | "# Lets explore df[\"GenderGroup] as well\n", 472 | "df.GenderGroup.unique()" 473 | ] 474 | }, 475 | { 476 | "cell_type": "markdown", 477 | "metadata": { 478 | "colab_type": "text", 479 | "id": "a3pqgpifhjkf" 480 | }, 481 | "source": [ 482 | "It seems that these fields may serve the same purpose, which is to specify male vs. female. Lets check this quickly by observing only these two columns:" 483 | ] 484 | }, 485 | { 486 | "cell_type": "code", 487 | "execution_count": null, 488 | "metadata": { 489 | "colab": {}, 490 | "colab_type": "code", 491 | "id": "8Oqj-XOghjkf", 492 | "outputId": "a02d15e9-bd6c-4f41-aa26-10dc0d8697e4" 493 | }, 494 | "outputs": [], 495 | "source": [ 496 | "# Use .loc() to specify a list of mulitple column names\n", 497 | "df.loc[:,[\"Gender\", \"GenderGroup\"]]" 498 | ] 499 | }, 500 | { 501 | "cell_type": "markdown", 502 | "metadata": { 503 | "colab_type": "text", 504 | "id": "n0S7vCwphjkj" 505 | }, 506 | "source": [ 507 | "From eyeballing the output, it seems to check out. We can streamline this by utilizing the groupby() and size() functions." 508 | ] 509 | }, 510 | { 511 | "cell_type": "code", 512 | "execution_count": null, 513 | "metadata": { 514 | "colab": {}, 515 | "colab_type": "code", 516 | "id": "nNvUQetJhjkj", 517 | "outputId": "3eedb9e8-0d1a-4a1f-ff65-f7ee4a178c5e" 518 | }, 519 | "outputs": [], 520 | "source": [ 521 | "df.groupby(['Gender','GenderGroup']).size()" 522 | ] 523 | }, 524 | { 525 | "cell_type": "markdown", 526 | "metadata": { 527 | "colab_type": "text", 528 | "id": "7bHLzOH2hjkn" 529 | }, 530 | "source": [ 531 | "This output indicates that we have two types of combinations. \n", 532 | "\n", 533 | "* Case 1: Gender = F & Gender Group = 1 \n", 534 | "* Case 2: Gender = M & GenderGroup = 2. \n", 535 | "\n", 536 | "This validates our initial assumption that these two fields essentially portray the same information." 537 | ] 538 | } 539 | ], 540 | "metadata": { 541 | "colab": { 542 | "name": "Introduction to Libraries and Data Management.ipynb", 543 | "provenance": [], 544 | "version": "0.3.2" 545 | }, 546 | "kernelspec": { 547 | "display_name": "Python 2", 548 | "language": "python", 549 | "name": "python2" 550 | }, 551 | "language_info": { 552 | "codemirror_mode": { 553 | "name": "ipython", 554 | "version": 2 555 | }, 556 | "file_extension": ".py", 557 | "mimetype": "text/x-python", 558 | "name": "python", 559 | "nbconvert_exporter": "python", 560 | "pygments_lexer": "ipython2", 561 | "version": "2.7.15" 562 | } 563 | }, 564 | "nbformat": 4, 565 | "nbformat_minor": 1 566 | } 567 | -------------------------------------------------------------------------------- /Fitting Statistical Models to Data with Python/Week2/utf-8''week2_nhanes_condensed_tutorial.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "colab_type": "text", 7 | "id": "Drk8n9Fugd9R" 8 | }, 9 | "source": [ 10 | "# Linear Regression with NHANES Data\n", 11 | "\n", 12 | "This tutorial will be taking an excerpt from the NHANES case study provided in this week and reviewing the linear regression portion. We will cover model parameters such as coefficients, r-squared, and correlation. Additionally, we will construct models utilzing more than one predictor, introduce how categorical variables are handled, and generate visualizations of our models.\n", 13 | "\n", 14 | "As with our previous work, we will be using the\n", 15 | "[Pandas](http://pandas.pydata.org) library for data management, the\n", 16 | "[Numpy](http://www.numpy.org) library for numerical calculations, and\n", 17 | "the [Statsmodels](http://www.statsmodels.org) library for statistical\n", 18 | "modeling.\n", 19 | "\n", 20 | "We begin by importing the libraries that we will be using:" 21 | ] 22 | }, 23 | { 24 | "cell_type": "code", 25 | "execution_count": null, 26 | "metadata": { 27 | "colab": {}, 28 | "colab_type": "code", 29 | "id": "hKJ2797Egd9T" 30 | }, 31 | "outputs": [], 32 | "source": [ 33 | "%matplotlib inline\n", 34 | "import matplotlib.pyplot as plt\n", 35 | "import seaborn as sns\n", 36 | "import pandas as pd\n", 37 | "import statsmodels.api as sm\n", 38 | "import numpy as np" 39 | ] 40 | }, 41 | { 42 | "cell_type": "code", 43 | "execution_count": null, 44 | "metadata": { 45 | "colab": {}, 46 | "colab_type": "code", 47 | "id": "VjPLPhj1gd9W" 48 | }, 49 | "outputs": [], 50 | "source": [ 51 | "url = \"nhanes_2015_2016.csv\"\n", 52 | "da = pd.read_csv(url)" 53 | ] 54 | }, 55 | { 56 | "cell_type": "code", 57 | "execution_count": null, 58 | "metadata": { 59 | "colab": {}, 60 | "colab_type": "code", 61 | "id": "yCAc2JGegd9Z" 62 | }, 63 | "outputs": [], 64 | "source": [ 65 | "# Drop unused columns, drop rows with any missing values.\n", 66 | "vars = [\"BPXSY1\", \"RIDAGEYR\", \"RIAGENDR\", \"RIDRETH1\", \"DMDEDUC2\", \"BMXBMI\", \"SMQ020\"]\n", 67 | "da = da[vars].dropna()" 68 | ] 69 | }, 70 | { 71 | "cell_type": "code", 72 | "execution_count": null, 73 | "metadata": { 74 | "colab": {}, 75 | "colab_type": "code", 76 | "id": "0bk9HZGMgd9b", 77 | "outputId": "57ad69ff-584a-49aa-f019-e08e4b7b54fc" 78 | }, 79 | "outputs": [], 80 | "source": [ 81 | "da.head()" 82 | ] 83 | }, 84 | { 85 | "cell_type": "markdown", 86 | "metadata": { 87 | "colab_type": "text", 88 | "id": "bn3-A4H6gd9g" 89 | }, 90 | "source": [ 91 | "## Linear regression\n", 92 | "\n", 93 | "\n", 94 | "### Simple Linear Regression with One Covariate" 95 | ] 96 | }, 97 | { 98 | "cell_type": "code", 99 | "execution_count": null, 100 | "metadata": { 101 | "colab": {}, 102 | "colab_type": "code", 103 | "id": "XfDp1slKgd9h", 104 | "outputId": "287ba8c7-859b-4e85-b354-506c278039f3" 105 | }, 106 | "outputs": [], 107 | "source": [ 108 | "### OLS Model of BPXSY1 with RIDAGEYR\n", 109 | "model = sm.OLS.from_formula(\"BPXSY1 ~ RIDAGEYR\", data=da)\n", 110 | "result = model.fit()\n", 111 | "result.summary()" 112 | ] 113 | }, 114 | { 115 | "cell_type": "code", 116 | "execution_count": null, 117 | "metadata": { 118 | "colab": {}, 119 | "colab_type": "code", 120 | "id": "ykkYJIjogd9l", 121 | "outputId": "ee9589b1-3004-4ce5-ab4d-7ea3a8055fbf" 122 | }, 123 | "outputs": [], 124 | "source": [ 125 | "da.BPXSY1.std()" 126 | ] 127 | }, 128 | { 129 | "cell_type": "markdown", 130 | "metadata": { 131 | "colab_type": "text", 132 | "id": "rs88E0iMgd9n" 133 | }, 134 | "source": [ 135 | "### R-squared and correlation\n", 136 | "\n", 137 | "The primary summary statistic for assessing the strength of a\n", 138 | "predictive relationship in a regression model is the *R-squared*, which is\n", 139 | "shown to be 0.207 in the regression output above. This means that 21%\n", 140 | "of the variation in SBP is explained by age. Note that this value is\n", 141 | "exactly the same as the squared Pearson correlation coefficient\n", 142 | "between SBP and age, as shown below." 143 | ] 144 | }, 145 | { 146 | "cell_type": "code", 147 | "execution_count": null, 148 | "metadata": { 149 | "colab": {}, 150 | "colab_type": "code", 151 | "id": "ziJPW3M-gd9p", 152 | "outputId": "1382b4aa-a528-4f30-a575-6399995813ff" 153 | }, 154 | "outputs": [], 155 | "source": [ 156 | "cc = da[[\"BPXSY1\", \"RIDAGEYR\"]].corr()\n", 157 | "print(cc.BPXSY1.RIDAGEYR**2)" 158 | ] 159 | }, 160 | { 161 | "cell_type": "markdown", 162 | "metadata": { 163 | "colab_type": "text", 164 | "id": "1AmPwraTgd9s" 165 | }, 166 | "source": [ 167 | "### Adding a Second Predictor\n", 168 | "\n", 169 | "Now we will add gender to our initial model so we have two predictors, age and gender." 170 | ] 171 | }, 172 | { 173 | "cell_type": "code", 174 | "execution_count": null, 175 | "metadata": { 176 | "colab": {}, 177 | "colab_type": "code", 178 | "id": "ztS5b0ptgd9t" 179 | }, 180 | "outputs": [], 181 | "source": [ 182 | "# Create a labeled version of the gender variable\n", 183 | "da[\"RIAGENDRx\"] = da.RIAGENDR.replace({1: \"Male\", 2: \"Female\"})" 184 | ] 185 | }, 186 | { 187 | "cell_type": "code", 188 | "execution_count": null, 189 | "metadata": { 190 | "colab": {}, 191 | "colab_type": "code", 192 | "id": "1VSTpMLWgd9u", 193 | "outputId": "8ad2ccc3-cb12-48f0-d4da-2aeddbf638b7" 194 | }, 195 | "outputs": [], 196 | "source": [ 197 | "model = sm.OLS.from_formula(\"BPXSY1 ~ RIDAGEYR + RIAGENDRx\", data=da)\n", 198 | "result = model.fit()\n", 199 | "result.summary()" 200 | ] 201 | }, 202 | { 203 | "cell_type": "markdown", 204 | "metadata": { 205 | "colab_type": "text", 206 | "id": "xfPLqR0agd9y" 207 | }, 208 | "source": [ 209 | "The syntax `RIDAGEYR + RIAGENDRx` in the cell above does not mean\n", 210 | "that these two variables are literally added together. Instead,\n", 211 | "it means that these variables are both included in the model as\n", 212 | "predictors of blood pressure (`BPXSY1`).\n", 213 | "\n", 214 | "The model that was fit above uses both age and gender to explain the\n", 215 | "variation in SBP. It finds that two people with the same gender whose\n", 216 | "ages differ by one year tend to have blood pressure values differing\n", 217 | "by 0.47 units, which is essentially the same gender parameter that we found above in\n", 218 | "the model based on age alone. This model also shows us that comparing\n", 219 | "a man and a woman of the same age, the man will on average have 3.23 units\n", 220 | "greater SBP.\n", 221 | "\n", 222 | "It is very important to emphasize that the age coefficient of 0.47 is\n", 223 | "only meaningful when comparing two people of the same gender, and the\n", 224 | "gender coefficient of 3.23 is only meaningful when comparing two\n", 225 | "people of the same age.\n", 226 | "Moreover, these effects are additive, meaning that if we compare, say, a 50 year\n", 227 | "old man to a 40 year old woman, the man's blood pressure will on\n", 228 | "average be around 3.23 + 10*0.47 = 7.93 units higher, with the first\n", 229 | "term in this sum being attributable to gender, and the second term\n", 230 | "being attributable to age.\n", 231 | "\n", 232 | "We noted above that the regression coefficient for age did not change\n", 233 | "by much when we added gender to the model. It is important to note\n", 234 | "however that in general, the estimated coefficient of a variable in a\n", 235 | "regression model will change when other variables are added or\n", 236 | "removed. We see here that a coefficient is nearly unchanged if any\n", 237 | "variables that are added to or removed from the model are\n", 238 | "approximately uncorrelated with the other covariates that are already\n", 239 | "in the model.\n", 240 | "\n", 241 | "Below we confirm that gender and age are nearly uncorrelated in this\n", 242 | "data set (the correlation of around -0.02 is negligible):" 243 | ] 244 | }, 245 | { 246 | "cell_type": "code", 247 | "execution_count": null, 248 | "metadata": { 249 | "colab": {}, 250 | "colab_type": "code", 251 | "id": "pQXMD09lgd9z", 252 | "outputId": "a7d887fa-c813-4d90-8c8b-0bc3df23c289" 253 | }, 254 | "outputs": [], 255 | "source": [ 256 | "# We need to use the original, numerical version of the gender\n", 257 | "# variable to calculate the correlation coefficient.\n", 258 | "da[[\"RIDAGEYR\", \"RIAGENDR\"]].corr()" 259 | ] 260 | }, 261 | { 262 | "cell_type": "markdown", 263 | "metadata": { 264 | "colab_type": "text", 265 | "id": "k0Tl6mtngd92" 266 | }, 267 | "source": [ 268 | "### A model with three variables\n", 269 | "\n", 270 | "Next we add a third variable, body mass index (BMI), to the model predicting SBP.\n", 271 | "[BMI](https://en.wikipedia.org/wiki/Body_mass_index) is a measure that is used\n", 272 | "to assess if a person has healthy weight given their height.\n", 273 | "[BMXBMI](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/BMX_I.htm#BMXBMI)\n", 274 | "is the NHANES variable containing the BMI value for each subject." 275 | ] 276 | }, 277 | { 278 | "cell_type": "code", 279 | "execution_count": null, 280 | "metadata": { 281 | "colab": {}, 282 | "colab_type": "code", 283 | "id": "mxFsbNOFgd93", 284 | "outputId": "d6324212-40a1-43ac-b872-c5c037d4dcbf" 285 | }, 286 | "outputs": [], 287 | "source": [ 288 | "model = sm.OLS.from_formula(\"BPXSY1 ~ RIDAGEYR + BMXBMI + RIAGENDRx\", data=da)\n", 289 | "result = model.fit()\n", 290 | "result.summary()" 291 | ] 292 | }, 293 | { 294 | "cell_type": "markdown", 295 | "metadata": { 296 | "colab_type": "text", 297 | "id": "NmdQDd1ugd96" 298 | }, 299 | "source": [ 300 | "Not surprisingly, BMI is positively associated with SBP. Given two\n", 301 | "subjects with the same gender and age, and whose BMI differs by 1\n", 302 | "unit, the person with greater BMI will have, on average, 0.31 units\n", 303 | "greater systolic blood pressure (SBP). Also note that after adding\n", 304 | "BMI to the model, the coefficient for gender became somewhat greater.\n", 305 | "This is due to the fact that the three covariates in the model, age,\n", 306 | "gender, and BMI, are mutually correlated, as shown next:" 307 | ] 308 | }, 309 | { 310 | "cell_type": "code", 311 | "execution_count": null, 312 | "metadata": { 313 | "colab": {}, 314 | "colab_type": "code", 315 | "id": "BweSe7ozgd97", 316 | "outputId": "fc64c74e-d096-4577-c666-9ca1d0048a8c" 317 | }, 318 | "outputs": [], 319 | "source": [ 320 | "da[[\"RIDAGEYR\", \"RIAGENDR\", \"BMXBMI\"]].corr()" 321 | ] 322 | }, 323 | { 324 | "cell_type": "markdown", 325 | "metadata": { 326 | "colab_type": "text", 327 | "id": "Ksj41HDQgd-B" 328 | }, 329 | "source": [ 330 | "Although the correlations among these three variables are not strong,\n", 331 | "they are sufficient to induce fairly substantial differences in the\n", 332 | "regression coefficients (e.g. the gender coefficient changes from 3.23\n", 333 | "to 3.58). In this example, the gender effect becomes larger after we\n", 334 | "control for BMI - we can take this to mean that BMI was masking part\n", 335 | "of the association between gender and blood pressure. In other settings, including\n", 336 | "additional covariates can reduce the association between a covariate\n", 337 | "and an outcome.\n", 338 | "\n", 339 | "### Visualization of the Fitted Models\n", 340 | "\n", 341 | "In this section we demonstrate some graphing techniques that can be\n", 342 | "used to gain a better understanding of a regression model that has\n", 343 | "been fit to data." 344 | ] 345 | }, 346 | { 347 | "cell_type": "code", 348 | "execution_count": null, 349 | "metadata": { 350 | "colab": {}, 351 | "colab_type": "code", 352 | "id": "bo4n664cgd-C", 353 | "outputId": "d52b0012-b92a-4db8-8e9e-115ba106c0ca" 354 | }, 355 | "outputs": [], 356 | "source": [ 357 | "from statsmodels.sandbox.predict_functional import predict_functional\n", 358 | "\n", 359 | "# Fix certain variables at reference values. Not all of these\n", 360 | "# variables are used here, but we provide them with a value anyway\n", 361 | "# to prevent a warning message from appearing.\n", 362 | "values = {\"RIAGENDRx\": \"Female\", \"RIAGENDR\": 1, \"BMXBMI\": 25,\n", 363 | " \"DMDEDUC2\": 1, \"RIDRETH1\": 1, \"SMQ020\": 1}\n", 364 | "\n", 365 | "pr, cb, fv = predict_functional(result, \"RIDAGEYR\",\n", 366 | " values=values, ci_method=\"simultaneous\")\n", 367 | "\n", 368 | "ax = sns.lineplot(fv, pr, lw=4)\n", 369 | "ax.fill_between(fv, cb[:, 0], cb[:, 1], color='grey', alpha=0.4)\n", 370 | "ax.set_xlabel(\"Age\")\n", 371 | "_ = ax.set_ylabel(\"SBP\")" 372 | ] 373 | }, 374 | { 375 | "cell_type": "markdown", 376 | "metadata": { 377 | "colab_type": "text", 378 | "id": "mxsw1flSgd-H" 379 | }, 380 | "source": [ 381 | "The analogous plot for BMI is shown next. Here we fix the\n", 382 | "gender as \"female\" and the age at 50, so we are looking\n", 383 | "at the relationship between expected SBP and age for women\n", 384 | "of age 50." 385 | ] 386 | }, 387 | { 388 | "cell_type": "code", 389 | "execution_count": null, 390 | "metadata": { 391 | "colab": {}, 392 | "colab_type": "code", 393 | "id": "TPwn6l0Jgd-H", 394 | "outputId": "84544c55-ec66-4391-dc0f-86a62e2b8fa2" 395 | }, 396 | "outputs": [], 397 | "source": [ 398 | "del values[\"BMXBMI\"]\n", 399 | "values[\"RIDAGEYR\"] = 50\n", 400 | "pr, cb, fv = predict_functional(result, \"BMXBMI\",\n", 401 | " values=values, ci_method=\"simultaneous\")\n", 402 | "\n", 403 | "ax = sns.lineplot(fv, pr, lw=4)\n", 404 | "ax.fill_between(fv, cb[:, 0], cb[:, 1], color='grey', alpha=0.4)\n", 405 | "ax.set_xlabel(\"BMI\")\n", 406 | "_ = ax.set_ylabel(\"SBP\")" 407 | ] 408 | }, 409 | { 410 | "cell_type": "markdown", 411 | "metadata": { 412 | "colab_type": "text", 413 | "id": "ei8VexJogd-K" 414 | }, 415 | "source": [ 416 | "Below we show the plot of residuals on fitted values for the NHANES\n", 417 | "data. It appears that we have a modestly increasing mean/variance\n", 418 | "relationship. That is, the scatter around the mean blood pressure is\n", 419 | "greater when the mean blood pressure itself is greater." 420 | ] 421 | }, 422 | { 423 | "cell_type": "code", 424 | "execution_count": null, 425 | "metadata": { 426 | "colab": {}, 427 | "colab_type": "code", 428 | "id": "1uqfmjxbgd-M", 429 | "outputId": "1f6a0c57-b2b5-45ef-9df3-bc22c77917ba" 430 | }, 431 | "outputs": [], 432 | "source": [ 433 | "pp = sns.scatterplot(result.fittedvalues, result.resid)\n", 434 | "pp.set_xlabel(\"Fitted values\")\n", 435 | "_ = pp.set_ylabel(\"Residuals\")" 436 | ] 437 | }, 438 | { 439 | "cell_type": "markdown", 440 | "metadata": { 441 | "colab_type": "text", 442 | "id": "js_aaKzqgd-P" 443 | }, 444 | "source": [ 445 | "A \"component plus residual plot\" or \"partial residual plot\" is\n", 446 | "intended to show how the data would look if all but one covariate\n", 447 | "could be fixed at reference values. By controlling the values of\n", 448 | "these covariates, all remaining variation is due either to the \"focus\n", 449 | "variable\" (the one variable that is left unfixed, and is plotted on\n", 450 | "the horizontal axis), or to sources of variation that are unexplained\n", 451 | "by any of our covariates.\n", 452 | "\n", 453 | "For example, the partial residual plot below shows how age (horizontal\n", 454 | "axis) and SBP (vertical axis) would be related if gender and BMI were\n", 455 | "fixed. Note that the origin of the vertical axis in these plots is\n", 456 | "not meaningful (we are not implying that anyone's blood pressure would\n", 457 | "be negative), but the differences along the vertical axis are\n", 458 | "meaningful. This plot implies that when BMI and gender are held\n", 459 | "fixed, the average blood pressures of an 80 and 18 year old differ by\n", 460 | "around 30 mm/Hg. This plot also shows, as discussed above,\n", 461 | "that the deviations from the\n", 462 | "mean are somewhat smaller at the low end of the range compared to the\n", 463 | "high end of the range. We also see that at the high end of the range, the\n", 464 | "deviations from the mean are somewhat right-skewed, with\n", 465 | "exceptionally high SBP values being more common than exceptionally low SBP values." 466 | ] 467 | }, 468 | { 469 | "cell_type": "code", 470 | "execution_count": null, 471 | "metadata": { 472 | "colab": {}, 473 | "colab_type": "code", 474 | "id": "XbRQLUd7gd-Q", 475 | "outputId": "044c8133-9214-4e7d-a7a6-65bed2eefa72" 476 | }, 477 | "outputs": [], 478 | "source": [ 479 | "from statsmodels.graphics.regressionplots import plot_ccpr\n", 480 | "\n", 481 | "ax = plt.axes()\n", 482 | "plot_ccpr(result, \"RIDAGEYR\", ax)\n", 483 | "_ = ax.lines[0].set_alpha(0.2) # Reduce overplotting with transparency" 484 | ] 485 | }, 486 | { 487 | "cell_type": "markdown", 488 | "metadata": { 489 | "colab_type": "text", 490 | "id": "Rsh67Jfygd-T" 491 | }, 492 | "source": [ 493 | "Next we have a partial residual plot that shows how BMI (horizontal\n", 494 | "axis) and SBP (vertical axis) would be related if gender and age were\n", 495 | "fixed. Compared to the plot above, we see here that age is more\n", 496 | "uniformly distributed than BMI. Also, it appears that there is more\n", 497 | "scatter in the partial residuals for BMI compared to what we saw above\n", 498 | "for age. Thus there seems to be less information about SBP in BMI,\n", 499 | "although a trend certainly exists." 500 | ] 501 | }, 502 | { 503 | "cell_type": "code", 504 | "execution_count": null, 505 | "metadata": { 506 | "colab": {}, 507 | "colab_type": "code", 508 | "id": "YAhOQlzfgd-U", 509 | "outputId": "da1a8224-7f21-4804-9024-537fb73bf7c7" 510 | }, 511 | "outputs": [], 512 | "source": [ 513 | "ax = plt.axes()\n", 514 | "plot_ccpr(result, \"BMXBMI\", ax)\n", 515 | "_ = ax.lines[0].set_alpha(0.2)" 516 | ] 517 | } 518 | ], 519 | "metadata": { 520 | "colab": { 521 | "collapsed_sections": [], 522 | "name": "Linear Regression NHANES Walkthrough.ipynb", 523 | "provenance": [], 524 | "version": "0.3.2" 525 | }, 526 | "kernelspec": { 527 | "display_name": "Python 3", 528 | "language": "python", 529 | "name": "python3" 530 | }, 531 | "language_info": { 532 | "codemirror_mode": { 533 | "name": "ipython", 534 | "version": 3 535 | }, 536 | "file_extension": ".py", 537 | "mimetype": "text/x-python", 538 | "name": "python", 539 | "nbconvert_exporter": "python", 540 | "pygments_lexer": "ipython3", 541 | "version": "3.6.3" 542 | } 543 | }, 544 | "nbformat": 4, 545 | "nbformat_minor": 1 546 | } 547 | -------------------------------------------------------------------------------- /Understanding & Visualizing Data with Python/Week1/utf-8''nhanes_data_basics.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Using Python to read data files and explore their contents\n", 8 | "\n", 9 | "This notebook demonstrates using the [Pandas](http://pandas.pydata.org) data processing library to read a dataset into Python, and obtain a basic understanding of its contents.\n", 10 | "\n", 11 | "Note that Python by itself is a general-purpose programming language and does not provide high-level data processing capabilities. The Pandas library was developed to meet this need. Pandas is the most popular Python library for data manipulation, and we will use it extensively in this course.\n", 12 | "\n", 13 | "In addition to Pandas, we will also make use of the following Python libraries\n", 14 | "\n", 15 | "* [Numpy](http://www.numpy.org) is a library for working with arrays of data\n", 16 | "\n", 17 | "* [Matplotlib](https://matplotlib.org) is a library for making graphs\n", 18 | "\n", 19 | "* [Seaborn](https://seaborn.pydata.org) is a higher-level interface to Matplotlib that can be used to simplify many graphing tasks\n", 20 | "\n", 21 | "* [Statsmodels](https://www.statsmodels.org/stable/index.html) is a library that implements many statistical techniques\n", 22 | "\n", 23 | "* [Scipy](https://www.scipy.org) is a library of techniques for numerical and scientific computing\n" 24 | ] 25 | }, 26 | { 27 | "cell_type": "markdown", 28 | "metadata": {}, 29 | "source": [ 30 | "### Importing libraries\n", 31 | "\n", 32 | "When using Python, you must always begin your scripts by importing the libraries that you will be using. After importing a library, its functions can then be called from your code by prepending the library name to the function name. For example, to use the '`dot`' function from the '`numpy`' library, you would enter '`numpy.dot`'. To avoid repeatedly having to type the libary name in your scripts, it is conventional to define a two or three letter abbreviation for each library, e.g. '`numpy`' is usually abbreviated as '`np`'. This allows us to use '`np.dot`' instead of '`numpy.dot`'. Similarly, the Pandas library is typically abbreviated as '`pd`'.\n", 33 | "\n", 34 | "The following statement imports the Pandas library, and gives it the abbreviated name 'pd'." 35 | ] 36 | }, 37 | { 38 | "cell_type": "code", 39 | "execution_count": 1, 40 | "metadata": { 41 | "collapsed": true 42 | }, 43 | "outputs": [], 44 | "source": [ 45 | "import pandas as pd" 46 | ] 47 | }, 48 | { 49 | "cell_type": "markdown", 50 | "metadata": {}, 51 | "source": [ 52 | "### Reading a data file\n", 53 | "\n", 54 | "We will be working with the NHANES (National Health and Nutrition Examination Survey) data from the 2015-2016 wave, which has been discussed earlier in this course. The raw data for this study are available here:\n", 55 | "\n", 56 | "https://wwwn.cdc.gov/nchs/nhanes/Default.aspx\n", 57 | "\n", 58 | "As in many large studies, the NHANES data are spread across multiple files. The NHANES files are stored in [SAS transport](https://v8doc.sas.com/sashtml/files/z0987199.htm) (Xport) format. This is a somewhat obscure format, and while Pandas is perfectly capable of reading the NHANES data directly from the xport files, accomplishing this task is a more advanced topic than we want to get into here. Therefore, for this course we have prepared some merged datasets in text/csv format." 59 | ] 60 | }, 61 | { 62 | "cell_type": "markdown", 63 | "metadata": {}, 64 | "source": [ 65 | "Pandas is a large and powerful library. Here we will only use a few of its basic features. The main data structure that Pandas works with is called a \"data frame\". This is a two-dimensional table of data in which the rows typically represent cases (e.g. NHANES subjects), and the columns represent variables. Pandas also has a one-dimensional data structure called a `Series` that we will encounter occasionally.\n", 66 | "\n", 67 | "Pandas has a variety of functions named with the pattern '`read_xxx`' for reading data in different formats into Python. Right now we will focus on reading '`csv`' files, so we are using the '`read_csv`' function, which can read csv (and \"tsv\") format files that are exported from spreadsheet software like Excel. The '`read_csv`' function by default expects the first row of the data file to contain column names. \n", 68 | "\n", 69 | "Using '`read_csv`' in its default mode is fairly straightforward. There are many options to '`read_csv`' that are useful for handling less-common situations. For example, you would use the option `sep='\\t'` instead of the default `sep=','` if the fields of your data file are delimited by tabs instead of commas. See [here](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) for the full documentation for '`read_csv`'.\n", 70 | "\n", 71 | "Pandas can read a data file over the internet when provided with a URL, which is what we will do below. In the Python script we will name the data set '`da`', i.e. this is the name of the Python variable that will hold the data frame after we have loaded it. \n", 72 | "\n", 73 | "The variable '`url`' holds a string (text) value, which is the internet URL where the data are located. If you have the data file in your local filesystem, you can also use '`read_csv`' to read the data from this file. In this case you would pass a file path instead of a URL, e.g. `pd.read_csv(\"my_file.csv\")` would read a file named `my_file.csv` that is located in your current working directory." 74 | ] 75 | }, 76 | { 77 | "cell_type": "code", 78 | "execution_count": 2, 79 | "metadata": {}, 80 | "outputs": [], 81 | "source": [ 82 | "url = \"nhanes_2015_2016.csv\"\n", 83 | "da = pd.read_csv(url)" 84 | ] 85 | }, 86 | { 87 | "cell_type": "markdown", 88 | "metadata": {}, 89 | "source": [ 90 | "To confirm that we have actually obtained the data the we are expecting, we can display the shape (number of rows and columns) of the data frame in the notebook. Note that the final expression in any Jupyter notebook cell is automatically printed, but you can force other expressions to be printed by using the '`print`' function, e.g. '`print(da.shape)`'.\n", 91 | "\n", 92 | "Based on what we see below, the data set being read here has 5735 rows, corresponding to 5735 people in this wave of the NHANES study, and 28 columns, corresponding to 28 variables in this particular data file. Note that NHANES collects thousands of variables on each study subject, but here we are working with a reduced file that contains a limited number of variables." 93 | ] 94 | }, 95 | { 96 | "cell_type": "code", 97 | "execution_count": 3, 98 | "metadata": {}, 99 | "outputs": [ 100 | { 101 | "data": { 102 | "text/plain": [ 103 | "(5735, 28)" 104 | ] 105 | }, 106 | "execution_count": 3, 107 | "metadata": {}, 108 | "output_type": "execute_result" 109 | } 110 | ], 111 | "source": [ 112 | "da.shape" 113 | ] 114 | }, 115 | { 116 | "cell_type": "markdown", 117 | "metadata": {}, 118 | "source": [ 119 | "### Exploring the contents of a data set\n", 120 | "\n", 121 | "Pandas has a number of basic ways to understand what is in a data set. For example, above we used the '`shape`' method to determine the numbers of rows and columns in a data set. The columns in a Pandas data frame have names, to see the names, use the '`columns`' method:" 122 | ] 123 | }, 124 | { 125 | "cell_type": "code", 126 | "execution_count": 4, 127 | "metadata": {}, 128 | "outputs": [ 129 | { 130 | "data": { 131 | "text/plain": [ 132 | "Index(['SEQN', 'ALQ101', 'ALQ110', 'ALQ130', 'SMQ020', 'RIAGENDR', 'RIDAGEYR',\n", 133 | " 'RIDRETH1', 'DMDCITZN', 'DMDEDUC2', 'DMDMARTL', 'DMDHHSIZ', 'WTINT2YR',\n", 134 | " 'SDMVPSU', 'SDMVSTRA', 'INDFMPIR', 'BPXSY1', 'BPXDI1', 'BPXSY2',\n", 135 | " 'BPXDI2', 'BMXWT', 'BMXHT', 'BMXBMI', 'BMXLEG', 'BMXARML', 'BMXARMC',\n", 136 | " 'BMXWAIST', 'HIQ210'],\n", 137 | " dtype='object')" 138 | ] 139 | }, 140 | "execution_count": 4, 141 | "metadata": {}, 142 | "output_type": "execute_result" 143 | } 144 | ], 145 | "source": [ 146 | "da.columns" 147 | ] 148 | }, 149 | { 150 | "cell_type": "markdown", 151 | "metadata": {}, 152 | "source": [ 153 | "These names correspond to variables in the NHANES study. For example, `SEQN` is a unique identifier for one person, and `BMXWT` is the subject's weight in kilograms (\"BMX\" is the NHANES prefix for body measurements). The variables in the NHANES data set are documented in a set of \"codebooks\" that are available on-line. The codebooks for the 2015-2016 wave of NHANES can be found by following the links at the following page:\n", 154 | "\n", 155 | "https://wwwn.cdc.gov/nchs/nhanes/continuousnhanes/default.aspx?BeginYear=2015\n", 156 | "\n", 157 | "For convenience, direct links to some of the code books are included below:\n", 158 | "\n", 159 | "* [Demographics code book](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/DEMO_I.htm)\n", 160 | "\n", 161 | "* [Body measures code book](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/BMX_I.htm)\n", 162 | "\n", 163 | "* [Blood pressure code book](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/BPX_I.htm)\n", 164 | "\n", 165 | "* [Alcohol questionaire code book](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/ALQ_I.htm)\n", 166 | "\n", 167 | "* [Smoking questionaire code book](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/SMQ_I.htm)" 168 | ] 169 | }, 170 | { 171 | "cell_type": "markdown", 172 | "metadata": {}, 173 | "source": [ 174 | "Every variable in a Pandas data frame has a data type. There are many different data types, but most commonly you will encounter floating point values (real numbers), integers, strings (text), and date/time values. When Pandas reads a text/csv file, it guesses the data types based on what it sees in the first few rows of the data file. Usually it selects an appropriate type, but occasionally it does not. To confirm that the data types are consistent with what the variables represent, inspect the '`dtypes`' attribute of the data frame." 175 | ] 176 | }, 177 | { 178 | "cell_type": "code", 179 | "execution_count": 5, 180 | "metadata": {}, 181 | "outputs": [ 182 | { 183 | "data": { 184 | "text/plain": [ 185 | "SEQN int64\n", 186 | "ALQ101 float64\n", 187 | "ALQ110 float64\n", 188 | "ALQ130 float64\n", 189 | "SMQ020 int64\n", 190 | "RIAGENDR int64\n", 191 | "RIDAGEYR int64\n", 192 | "RIDRETH1 int64\n", 193 | "DMDCITZN float64\n", 194 | "DMDEDUC2 float64\n", 195 | "DMDMARTL float64\n", 196 | "DMDHHSIZ int64\n", 197 | "WTINT2YR float64\n", 198 | "SDMVPSU int64\n", 199 | "SDMVSTRA int64\n", 200 | "INDFMPIR float64\n", 201 | "BPXSY1 float64\n", 202 | "BPXDI1 float64\n", 203 | "BPXSY2 float64\n", 204 | "BPXDI2 float64\n", 205 | "BMXWT float64\n", 206 | "BMXHT float64\n", 207 | "BMXBMI float64\n", 208 | "BMXLEG float64\n", 209 | "BMXARML float64\n", 210 | "BMXARMC float64\n", 211 | "BMXWAIST float64\n", 212 | "HIQ210 float64\n", 213 | "dtype: object" 214 | ] 215 | }, 216 | "execution_count": 5, 217 | "metadata": {}, 218 | "output_type": "execute_result" 219 | } 220 | ], 221 | "source": [ 222 | "da.dtypes" 223 | ] 224 | }, 225 | { 226 | "cell_type": "markdown", 227 | "metadata": {}, 228 | "source": [ 229 | "As we see here, most of the variables have floating point or integer data type. Unlike many data sets, NHANES does not use any text values in its data. For example, while many datasets would use text labels like \"F\" or \"M\" to denote a subject's gender, this information is represented in NHANES with integer codes. The actual meanings of these codes can be determined from the codebooks. For example, the variable `RIAGENDR` contains each subject's gender, with male gender coded as `1` and female gender coded as `2`. The `RIAGENDR` variable is part of the demographics component of NHANES, so this coding can be found in the demographics codebook.\n", 230 | "\n", 231 | "Variables like `BMXWT` which represent a quantitative measurement will typically be stored as floating point data values." 232 | ] 233 | }, 234 | { 235 | "cell_type": "markdown", 236 | "metadata": {}, 237 | "source": [ 238 | "### Slicing a data set\n", 239 | "\n", 240 | "As discussed above, a Pandas data frame is a rectangular data table, in which the rows represent cases and the columns represent variables. One common manipulation of a data frame is to extract the data for one case or for one variable. There are several ways to do this, as shown below.\n", 241 | "\n", 242 | "To extract all the values for one variable, the following three approaches are equivalent (\"DMDEDUC2\" here is an NHANES variable containing a person's educational attainment). In these four lines of code, we are assigning the data from one column of the data frame `da` into new variables `w`, `x`, `y`, and `z`. The first three approaches access the variable by name. The fourth approach accesses the variable by position (note that `DMDEDUC2` is in position 9 of the `da.columns` array shown above -- remember that Python counts starting at position zero)." 243 | ] 244 | }, 245 | { 246 | "cell_type": "code", 247 | "execution_count": 6, 248 | "metadata": {}, 249 | "outputs": [], 250 | "source": [ 251 | "w = da[\"DMDEDUC2\"]\n", 252 | "x = da.loc[:, \"DMDEDUC2\"]\n", 253 | "y = da.DMDEDUC2\n", 254 | "z = da.iloc[:, 9] # DMDEDUC2 is in column 9" 255 | ] 256 | }, 257 | { 258 | "cell_type": "markdown", 259 | "metadata": {}, 260 | "source": [ 261 | "Another reason to slice a variable out of a data frame is so that we can then pass it into a function. For example, we can find the maximum value over all `DMDEDUC2` values using any one of the following four lines of code:" 262 | ] 263 | }, 264 | { 265 | "cell_type": "code", 266 | "execution_count": 7, 267 | "metadata": {}, 268 | "outputs": [ 269 | { 270 | "name": "stdout", 271 | "output_type": "stream", 272 | "text": [ 273 | "9.0\n", 274 | "9.0\n", 275 | "9.0\n", 276 | "9.0\n" 277 | ] 278 | } 279 | ], 280 | "source": [ 281 | "print(da[\"DMDEDUC2\"].max())\n", 282 | "print(da.loc[:, \"DMDEDUC2\"].max())\n", 283 | "print(da.DMDEDUC2.max())\n", 284 | "print(da.iloc[:, 9].max())" 285 | ] 286 | }, 287 | { 288 | "cell_type": "markdown", 289 | "metadata": {}, 290 | "source": [ 291 | "Every value in a Python program has a type, and the type information can be obtained using Python's '`type`' function. This can be useful, for example, if you are looking for the documentation associated with some value, but you do not know what the value's type is. \n", 292 | "\n", 293 | "Here we see that the variable `da` has type 'DataFrame', while one column of `da` has type 'Series'. As noted above, a Series is a Pandas data structure for holding a single column (or row) of data." 294 | ] 295 | }, 296 | { 297 | "cell_type": "code", 298 | "execution_count": 8, 299 | "metadata": {}, 300 | "outputs": [ 301 | { 302 | "name": "stdout", 303 | "output_type": "stream", 304 | "text": [ 305 | "\n", 306 | "\n", 307 | "\n" 308 | ] 309 | } 310 | ], 311 | "source": [ 312 | "print(type(da)) # The type of the variable\n", 313 | "print(type(da.DMDEDUC2)) # The type of one column of the data frame\n", 314 | "print(type(da.iloc[2,:])) # The type of one row of the data frame" 315 | ] 316 | }, 317 | { 318 | "cell_type": "markdown", 319 | "metadata": {}, 320 | "source": [ 321 | "It may also be useful to slice a row (case) out of a data frame. Just as a data frame's columns have names, the rows also have names, which are called the \"index\". However many data sets do not have meaningful row names, so it is more common to extract a row of a data frame using its position. The `iloc` method slices rows or columns from a data frame by position (counting from 0). The following line of code extracts row 3 from the data set (which is the fourth row, counting from zero)." 322 | ] 323 | }, 324 | { 325 | "cell_type": "code", 326 | "execution_count": 9, 327 | "metadata": { 328 | "collapsed": true 329 | }, 330 | "outputs": [], 331 | "source": [ 332 | "x = da.iloc[3, :]" 333 | ] 334 | }, 335 | { 336 | "cell_type": "markdown", 337 | "metadata": {}, 338 | "source": [ 339 | "Another important data frame manipulation is to extract a contiguous block of rows or columns from the data set. Below we slice by position, in the first case taking row positions 3 and 4 (counting from 0, which are rows 4 and 5 counting from 1), and in the second case taking columns 2, 3, and 4 (columns 3, 4, 5 if counting from 1)." 340 | ] 341 | }, 342 | { 343 | "cell_type": "code", 344 | "execution_count": 10, 345 | "metadata": {}, 346 | "outputs": [], 347 | "source": [ 348 | "x = da.iloc[3:5, :]\n", 349 | "y = da.iloc[:, 2:5]" 350 | ] 351 | }, 352 | { 353 | "cell_type": "markdown", 354 | "metadata": {}, 355 | "source": [ 356 | "### Missing values\n" 357 | ] 358 | }, 359 | { 360 | "cell_type": "markdown", 361 | "metadata": {}, 362 | "source": [ 363 | "When reading a dataset using Pandas, there is a set of values including 'NA', 'NULL', and 'NaN' that are taken by default to represent a missing value. The full list of default missing value codes is in the '`read_csv`' documentation [here](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html). This document also explains how to change the way that '`read_csv`' decides whether a variable's value is missing.\n", 364 | "\n", 365 | "Pandas has functions called `isnull` and `notnull` that can be used to identify where the missing and non-missing values are located in a data frame. Below we use these functions to count the number of missing and non-missing `DMDEDUC2` values." 366 | ] 367 | }, 368 | { 369 | "cell_type": "code", 370 | "execution_count": 11, 371 | "metadata": {}, 372 | "outputs": [ 373 | { 374 | "name": "stdout", 375 | "output_type": "stream", 376 | "text": [ 377 | "261\n", 378 | "5474\n" 379 | ] 380 | } 381 | ], 382 | "source": [ 383 | "print(pd.isnull(da.DMDEDUC2).sum())\n", 384 | "print(pd.notnull(da.DMDEDUC2).sum())" 385 | ] 386 | }, 387 | { 388 | "cell_type": "markdown", 389 | "metadata": {}, 390 | "source": [ 391 | "As an aside, note that there may be a variety of distinct forms of missingness in a variable, and in some cases it is important to keep these values distinct. For example, in case of the DMDEDUC2 variable, in addition to the blank or NA values that Pandas considers to be missing, three people responded \"don't know\" (code value 9). In many analyses, the \"don't know\" values will also be treated as missing, but at this point we are considering \"don't know\" to be a distinct category of observed response." 392 | ] 393 | } 394 | ], 395 | "metadata": { 396 | "kernelspec": { 397 | "display_name": "Python 3", 398 | "language": "python", 399 | "name": "python3" 400 | }, 401 | "language_info": { 402 | "codemirror_mode": { 403 | "name": "ipython", 404 | "version": 3 405 | }, 406 | "file_extension": ".py", 407 | "mimetype": "text/x-python", 408 | "name": "python", 409 | "nbconvert_exporter": "python", 410 | "pygments_lexer": "ipython3", 411 | "version": "3.6.3" 412 | } 413 | }, 414 | "nbformat": 4, 415 | "nbformat_minor": 1 416 | } 417 | --------------------------------------------------------------------------------