├── Course 1 - Understanding and Visualizing Data with Python ├── Week 1 - Introduction to Data │ ├── data_types.ipynb │ ├── introduction_jupyter.ipynb │ ├── libraries_data_management.ipynb │ ├── nhanes_data_basics.ipynb │ └── week1_python_resources.ipynb ├── Week 2 - Univariate Data │ ├── Tables_Histograms_and_Boxplots_in_Python.ipynb │ ├── nhanes_univariate_analyses.ipynb │ ├── nhanes_univariate_practice.ipynb │ ├── python_libraries.ipynb │ └── w2_assessment.ipynb ├── Week 3 - Multivariate Data │ ├── Multivariate_Data_Selection.ipynb │ ├── Multivariate_Distributions.ipynb │ ├── Unit_Testing.ipynb │ ├── nhanes_multivariate_analyses.ipynb │ ├── nhanes_multivariate_practice.ipynb │ ├── pizza_study_design.md │ └── w3_assessment.ipynb └── Week 4 - Populations and Samples │ ├── Empirical_Distribution.ipynb │ ├── Randomness_and_Reproducibility.ipynb │ ├── Sampling_from_a_Biased_Population.ipynb │ ├── lecture_notes.md │ └── nhanes_sampling_distributions.ipynb ├── Course 2 - Inferential Statistics with Python ├── .DS_Store ├── Week 1 - Overview & Inference Procedures │ ├── course1review.ipynb │ ├── functions_lambdas_help.ipynb │ ├── listsvsarrays.ipynb │ └── week1_assessment.ipynb ├── Week 2 - Confidence Intervals │ ├── Confidence_Intervals_Differences_Population_Parameters.ipynb │ ├── intro_confidence_intervals.ipynb │ ├── nhanes_confidence_intervals.ipynb │ ├── nhanes_confidence_intervals_practice.ipynb │ ├── slides │ │ ├── calculating_sample_sizes.png │ │ ├── confidence_interval_approaches.png │ │ ├── confidence_interval_difference_means_paired_data.png │ │ ├── confidence_interval_difference_means_pooled.png │ │ ├── confidence_interval_difference_means_unpooled.png │ │ ├── confidence_interval_difference_population_proportions.png │ │ ├── confidence_interval_one_mean.png │ │ ├── confidence_interval_one_population_proportion.png │ │ ├── difference_in_proportion_confidence_interval.png │ │ ├── pooled_confidence_interval_calucaltions.png │ │ ├── understanding_confidence_level.png │ │ └── unpooled_confidence_interval_calculations.png │ └── week2_assessment.ipynb ├── Week 3 - Hypothesis Testing │ ├── .DS_Store │ ├── Introduction to Hypothesis Testing in Python.ipynb │ ├── NHANES Hypothesis Testing Walkthrough.ipynb │ ├── chocolate_cycling_experiment_analysis.txt │ ├── nhanes_hypothesis_test_practice.ipynb │ ├── nhanes_hypothesis_testing.ipynb │ ├── slides │ │ ├── .DS_Store │ │ ├── alternative_tests_difference_population_proportion.png │ │ ├── assumptions_one_sample_mean_t_test.png │ │ ├── checking_sample_sizes_two_population_proportions.png │ │ ├── ddof_difference_population_means_independent.png │ │ ├── population_proportion_hypothesis_test.png │ │ ├── sample_size_check_population_proportion.png │ │ ├── t_test_one_sample_mean.png │ │ ├── t_test_one_sample_mean_normality.png │ │ ├── test_statisitic_one_sample_mean_assume_normality.png │ │ ├── test_statistic_difference_means_independent_pooled.png │ │ ├── test_statistic_difference_means_independent_unpooled.png │ │ ├── test_statistic_difference_two_population_proportions.png │ │ └── test_statistic_one_population_proportion.png │ └── week3_assessment.ipynb └── Week 4 - Learner Application │ ├── .DS_Store │ ├── .ipynb_checkpoints │ └── Week 4 Quiz-checkpoint.ipynb │ ├── Week 4 Quiz.ipynb │ └── slides │ ├── pooled_unpooled_assumptions_difference_means_proportions.png │ ├── standard_error_one_mean.png │ └── standard_error_one_proportion.png ├── Course 3 - Fitting Statistical Models to Data with Python ├── .DS_Store ├── Week 1 - Overview & Considerations for Statistical Modeling │ ├── modeling.ipynb │ └── python_libraries.ipynb ├── Week 2 - Fitting Models to Independent Data │ ├── w2_assessment.ipynb │ ├── week2_nhanes.ipynb │ ├── week2_nhanes_condensed_tutorial.ipynb │ └── week2_nhanes_practice.ipynb ├── Week 3 - Fitting Models to Dependent Data │ ├── .DS_Store │ ├── Autism_Multilevel_Marginal_Models.ipynb │ ├── slides │ │ ├── Multi-Level Models Intro.png │ │ ├── Random Effects and MLM names.png │ │ └── Why fit Multi-level models?.png │ ├── w3_assessment.ipynb │ ├── week3_nhanes.ipynb │ └── week3_nhanes_practice.ipynb └── Week 4 - Special Topics │ ├── .DS_Store │ ├── bayesian.ipynb │ └── slides │ ├── Great Quote.png │ ├── How to use sruvey weights in practice.png │ └── Should we use survey weights?.png ├── README.md └── gitignore /Course 1 - Understanding and Visualizing Data with Python/Week 1 - Introduction to Data/introduction_jupyter.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "### What is Jupyter Notebooks?\n", 8 | "\n", 9 | "Jupyter is a web-based interactive development environment that supports multiple programming languages, however most commonly used with the Python programming language.\n", 10 | "\n", 11 | "The interactive environment that Jupyter provides enables students, scientists, and researchers to create reproducible analysis and formulate a story within a single document.\n", 12 | "\n", 13 | "Lets take a look at an example of a completed Jupyter Notebook: [Example Notebook](http://nbviewer.jupyter.org/github/cossatot/lanf_earthquake_likelihood/blob/master/notebooks/lanf_manuscript_notebook.ipynb)" 14 | ] 15 | }, 16 | { 17 | "cell_type": "markdown", 18 | "metadata": {}, 19 | "source": [ 20 | "### Jupyter Notebook Features\n", 21 | "\n", 22 | "* File Browser\n", 23 | "* Markdown Cells & Syntax\n", 24 | "* Kernels, Variables, & Environment\n", 25 | "* Command vs. Edit Mode & Shortcuts" 26 | ] 27 | }, 28 | { 29 | "cell_type": "markdown", 30 | "metadata": {}, 31 | "source": [ 32 | "### What is Markdown?\n", 33 | "\n", 34 | "Markdown is a markup language that uses plain text formatting syntax. This means that we can modify the formatting our text with the use of various symbols on our keyboard as indicators.\n", 35 | "\n", 36 | "Some examples include:\n", 37 | "\n", 38 | "* Headers\n", 39 | "* Text modifications such as italics and bold\n", 40 | "* Ordered and Unordered lists\n", 41 | "* Links\n", 42 | "* Tables\n", 43 | "* Images\n", 44 | "* Etc.\n", 45 | "\n", 46 | "Now I'll showcase some examples of how this formatting is done:" 47 | ] 48 | }, 49 | { 50 | "cell_type": "markdown", 51 | "metadata": {}, 52 | "source": [ 53 | "Headers:\n", 54 | "\n", 55 | "# H1\n", 56 | "## H2\n", 57 | "### H3\n", 58 | "#### H4\n", 59 | "##### H5\n", 60 | "###### H6" 61 | ] 62 | }, 63 | { 64 | "cell_type": "markdown", 65 | "metadata": {}, 66 | "source": [ 67 | "Text modifications:\n", 68 | "\n", 69 | "Emphasis, aka italics, with *asterisks* or _underscores_.\n", 70 | "\n", 71 | "Strong emphasis, aka bold, with **asterisks** or __underscores__.\n", 72 | "\n", 73 | "Combined emphasis with **asterisks and _underscores_**.\n", 74 | "\n", 75 | "Strikethrough uses two tildes. ~~Scratch this.~~" 76 | ] 77 | }, 78 | { 79 | "cell_type": "markdown", 80 | "metadata": {}, 81 | "source": [ 82 | "Lists:\n", 83 | "\n", 84 | "1. First ordered list item\n", 85 | "2. Another item\n", 86 | " * Unordered sub-list. \n", 87 | "1. Actual numbers don't matter, just that it's a number\n", 88 | " 1. Ordered sub-list\n", 89 | "4. And another item.\n", 90 | "\n", 91 | "* Unordered list can use asterisks\n", 92 | "- Or minuses\n", 93 | "+ Or pluses" 94 | ] 95 | }, 96 | { 97 | "cell_type": "markdown", 98 | "metadata": {}, 99 | "source": [ 100 | "Links:\n", 101 | "\n", 102 | "http://www.umich.edu\n", 103 | "\n", 104 | "\n", 105 | "\n", 106 | "[The University of Michigan's Homepage](www.http://umich.edu/)\n", 107 | "\n", 108 | "To look into more examples of Markdown syntax and features such as tables, images, etc. head to the following link: [Markdown Reference](https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet)" 109 | ] 110 | }, 111 | { 112 | "cell_type": "markdown", 113 | "metadata": {}, 114 | "source": [ 115 | "### Kernels, Variables, and Environment\n", 116 | "\n", 117 | "A notebook kernel is a “computational engine” that executes the code contained in a Notebook document. There are kernels for various programming languages, however we are solely using the python kernel which executes python code.\n", 118 | "\n", 119 | "When a notebook is opened, the associated kernel is automatically launched for our convenience." 120 | ] 121 | }, 122 | { 123 | "cell_type": "code", 124 | "execution_count": 1, 125 | "metadata": {}, 126 | "outputs": [ 127 | { 128 | "name": "stdout", 129 | "output_type": "stream", 130 | "text": [ 131 | "This is a python code cell\n" 132 | ] 133 | } 134 | ], 135 | "source": [ 136 | "### This is python\n", 137 | "print(\"This is a python code cell\")" 138 | ] 139 | }, 140 | { 141 | "cell_type": "markdown", 142 | "metadata": {}, 143 | "source": [ 144 | "A kernel is the back-end of our notebook which not only executes our python code, but stores our initialized variables." 145 | ] 146 | }, 147 | { 148 | "cell_type": "code", 149 | "execution_count": 2, 150 | "metadata": {}, 151 | "outputs": [ 152 | { 153 | "name": "stdout", 154 | "output_type": "stream", 155 | "text": [ 156 | "x has been set to 1738\n" 157 | ] 158 | } 159 | ], 160 | "source": [ 161 | "### For example, lets initialize variable x\n", 162 | "\n", 163 | "x = 1738\n", 164 | "\n", 165 | "print(\"x has been set to \" + str(x))" 166 | ] 167 | }, 168 | { 169 | "cell_type": "code", 170 | "execution_count": 3, 171 | "metadata": {}, 172 | "outputs": [ 173 | { 174 | "name": "stdout", 175 | "output_type": "stream", 176 | "text": [ 177 | "1738\n" 178 | ] 179 | } 180 | ], 181 | "source": [ 182 | "### Print x\n", 183 | "\n", 184 | "print(x)" 185 | ] 186 | }, 187 | { 188 | "cell_type": "markdown", 189 | "metadata": {}, 190 | "source": [ 191 | "Issues arrise when we restart our kernel and attempt to run code with variables that have not been reinitialized.\n", 192 | "\n", 193 | "If the kernel is reset, make sure to rerun code where variables are intialized." 194 | ] 195 | }, 196 | { 197 | "cell_type": "code", 198 | "execution_count": 4, 199 | "metadata": {}, 200 | "outputs": [ 201 | { 202 | "name": "stdout", 203 | "output_type": "stream", 204 | "text": [ 205 | "What is your name? \n", 206 | "The name you entered is \n" 207 | ] 208 | } 209 | ], 210 | "source": [ 211 | "## We can also run code that accepts input\n", 212 | "\n", 213 | "name = input(\"What is your name? \")\n", 214 | "\n", 215 | "print(\"The name you entered is \" + name)" 216 | ] 217 | }, 218 | { 219 | "cell_type": "markdown", 220 | "metadata": {}, 221 | "source": [ 222 | "It is important to note that Jupyter Notebooks have in-line cell execution. This means that a prior executing cell must complete its operations prior to another cell being executed. A cell still being executing is indicated by the [*] on the left-hand side of the cell." 223 | ] 224 | }, 225 | { 226 | "cell_type": "code", 227 | "execution_count": 5, 228 | "metadata": {}, 229 | "outputs": [ 230 | { 231 | "name": "stdout", 232 | "output_type": "stream", 233 | "text": [ 234 | "This won't print until all prior cells have finished executing.\n" 235 | ] 236 | } 237 | ], 238 | "source": [ 239 | "print(\"This won't print until all prior cells have finished executing.\")" 240 | ] 241 | }, 242 | { 243 | "cell_type": "markdown", 244 | "metadata": {}, 245 | "source": [ 246 | "### Command vs. Edit Mode & Shortcuts\n", 247 | "\n", 248 | "There is an edit and a command mode for jupyter notebooks. The mode is easily identifiable by the color of the left border of the cell.\n", 249 | "\n", 250 | "Blue = Command Mode.\n", 251 | "\n", 252 | "Green = Edit Mode.\n", 253 | "\n", 254 | "Command Mode can be toggled by pressing **esc** on your keyboard.\n", 255 | "\n", 256 | "Commands can be used to execute notebook functions. For example, changing the format of a markdown cell or adding line numbers.\n", 257 | "\n", 258 | "Lets toggle line numbers while in command mode by pressing **L**.\n", 259 | "\n", 260 | "#### Additional Shortcuts\n", 261 | "\n", 262 | "There are a lot of shortcuts that can be used to improve productivity while using Jupyter Notebooks.\n", 263 | "\n", 264 | "Here is a list:\n", 265 | "\n", 266 | "![Jupyter Notebook Shortcuts](img/shortcuts.png)" 267 | ] 268 | }, 269 | { 270 | "cell_type": "markdown", 271 | "metadata": {}, 272 | "source": [ 273 | "### How do you install Jupyter Notebooks?\n", 274 | "\n", 275 | "**Note:** *Coursera provides embedded jupyter notebooks within the course, thus the download is not a requirement unless you wish to explore jupyter further on your own computer.*\n", 276 | "\n", 277 | "Official Installation Guide: https://jupyter.readthedocs.io/en/latest/install.html\n", 278 | "\n", 279 | "Jupyter recommends utilizing Anaconda, which is a platform compatible with Windows, macOS, and Linux systems. \n", 280 | "\n", 281 | "Anaconda Download: https://www.anaconda.com/download/#macos" 282 | ] 283 | } 284 | ], 285 | "metadata": { 286 | "kernelspec": { 287 | "display_name": "Python 3", 288 | "language": "python", 289 | "name": "python3" 290 | }, 291 | "language_info": { 292 | "codemirror_mode": { 293 | "name": "ipython", 294 | "version": 3 295 | }, 296 | "file_extension": ".py", 297 | "mimetype": "text/x-python", 298 | "name": "python", 299 | "nbconvert_exporter": "python", 300 | "pygments_lexer": "ipython3", 301 | "version": "3.6.3" 302 | } 303 | }, 304 | "nbformat": 4, 305 | "nbformat_minor": 2 306 | } 307 | -------------------------------------------------------------------------------- /Course 1 - Understanding and Visualizing Data with Python/Week 1 - Introduction to Data/libraries_data_management.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "colab_type": "text", 7 | "id": "GPScbsDWhjjS" 8 | }, 9 | "source": [ 10 | "# Python Libraries\n", 11 | "\n", 12 | "Python, like other programming languages, has an abundance of additional modules or libraries that augument the base framework and functionality of the language.\n", 13 | "\n", 14 | "Think of a library as a collection of functions that can be accessed to complete certain programming tasks without having to write your own algorithm.\n", 15 | "\n", 16 | "For this course, we will focus primarily on the following libraries:\n", 17 | "\n", 18 | "* **Numpy** is a library for working with arrays of data.\n", 19 | "\n", 20 | "* **Pandas** provides high-performance, easy-to-use data structures and data analysis tools.\n", 21 | "\n", 22 | "* **Scipy** is a library of techniques for numerical and scientific computing.\n", 23 | "\n", 24 | "* **Matplotlib** is a library for making graphs.\n", 25 | "\n", 26 | "* **Seaborn** is a higher-level interface to Matplotlib that can be used to simplify many graphing tasks.\n", 27 | "\n", 28 | "* **Statsmodels** is a library that implements many statistical techniques.\n", 29 | "\n", 30 | "# Documentation\n", 31 | "\n", 32 | "Reliable and accesible documentation is an absolute necessity when it comes to knowledge transfer of programming languages. Luckily, python provides a significant amount of detailed documentation that explains the ins and outs of the language syntax, libraries, and more. \n", 33 | "\n", 34 | "Understanding how to read documentation is crucial for any programmer as it will serve as a fantastic resource when learning the intricacies of python.\n", 35 | "\n", 36 | "Here is the link to the documentation of the python standard library: [Python Standard Library](https://docs.python.org/3/library/index.html#library-index)" 37 | ] 38 | }, 39 | { 40 | "cell_type": "markdown", 41 | "metadata": { 42 | "colab_type": "text", 43 | "id": "b9FN-daXhjjT" 44 | }, 45 | "source": [ 46 | "### Importing Libraries\n", 47 | "\n", 48 | "When using Python, you must always begin your scripts by importing the libraries that you will be using. \n", 49 | "\n", 50 | "The following statement imports the numpy and pandas library, and gives them abbreviated names:" 51 | ] 52 | }, 53 | { 54 | "cell_type": "code", 55 | "execution_count": null, 56 | "metadata": { 57 | "colab": {}, 58 | "colab_type": "code", 59 | "id": "uRwVhX-YhjjU" 60 | }, 61 | "outputs": [], 62 | "source": [ 63 | "import numpy as np\n", 64 | "import pandas as pd" 65 | ] 66 | }, 67 | { 68 | "cell_type": "markdown", 69 | "metadata": { 70 | "colab_type": "text", 71 | "id": "9oG4uoyAhjjX" 72 | }, 73 | "source": [ 74 | "### Utilizing Library Functions\n", 75 | "\n", 76 | "After importing a library, its functions can then be called from your code by prepending the library name to the function name. For example, to use the '`dot`' function from the '`numpy`' library, you would enter '`numpy.dot`'. To avoid repeatedly having to type the libary name in your scripts, it is conventional to define a two or three letter abbreviation for each library, e.g. '`numpy`' is usually abbreviated as '`np`'. This allows us to use '`np.dot`' instead of '`numpy.dot`'. Similarly, the Pandas library is typically abbreviated as '`pd`'." 77 | ] 78 | }, 79 | { 80 | "cell_type": "markdown", 81 | "metadata": { 82 | "colab_type": "text", 83 | "id": "fip4VOLMhjjY" 84 | }, 85 | "source": [ 86 | "The next cell shows how to call functions within an imported library:" 87 | ] 88 | }, 89 | { 90 | "cell_type": "code", 91 | "execution_count": null, 92 | "metadata": { 93 | "colab": {}, 94 | "colab_type": "code", 95 | "id": "9S03bwnkhjja", 96 | "outputId": "f3a621eb-281b-4f96-b4a9-ca112e4f8604" 97 | }, 98 | "outputs": [], 99 | "source": [ 100 | "a = np.array([0,1,2,3,4,5,6,7,8,9,10]) \n", 101 | "np.mean(a)" 102 | ] 103 | }, 104 | { 105 | "cell_type": "markdown", 106 | "metadata": { 107 | "colab_type": "text", 108 | "id": "ljjQpFYfhjje" 109 | }, 110 | "source": [ 111 | "As you can see, we used the mean() function within the numpy library to calculate the mean of the numpy 1-dimensional array." 112 | ] 113 | }, 114 | { 115 | "cell_type": "markdown", 116 | "metadata": { 117 | "colab_type": "text", 118 | "id": "auNcbdxVhjjf" 119 | }, 120 | "source": [ 121 | "# Data Management\n", 122 | "\n", 123 | "Data management is a crucial component to statistical analysis and data science work. The following code will show how to import data via the pandas library, view your data, and transform your data.\n", 124 | "\n", 125 | "The main data structure that Pandas works with is called a **Data Frame**. This is a two-dimensional table of data in which the rows typically represent cases (e.g. Cartwheel Contest Participants), and the columns represent variables. Pandas also has a one-dimensional data structure called a **Series** that we will encounter when accesing a single column of a Data Frame.\n", 126 | "\n", 127 | "Pandas has a variety of functions named '`read_xxx`' for reading data in different formats. Right now we will focus on reading '`csv`' files, which stands for comma-separated values. However the other file formats include excel, json, and sql just to name a few.\n", 128 | "\n", 129 | "This is a link to the .csv that we will be exploring in this tutorial: [Cartwheel Data](https://www.coursera.org/learn/understanding-visualization-data/resources/0rVxx) (Link goes to the dataset section of the Resources for this course)\n", 130 | "\n", 131 | "There are many other options to '`read_csv`' that are very useful. For example, you would use the option `sep='\\t'` instead of the default `sep=','` if the fields of your data file are delimited by tabs instead of commas. See [here](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) for the full documentation for '`read_csv`'." 132 | ] 133 | }, 134 | { 135 | "cell_type": "markdown", 136 | "metadata": { 137 | "colab_type": "text", 138 | "id": "MbGKgakihjjg" 139 | }, 140 | "source": [ 141 | "### Importing Data" 142 | ] 143 | }, 144 | { 145 | "cell_type": "code", 146 | "execution_count": null, 147 | "metadata": { 148 | "colab": {}, 149 | "colab_type": "code", 150 | "id": "DXHHhj2jhjjg", 151 | "outputId": "af99e411-9c09-44aa-d619-553b2d2a5aa8" 152 | }, 153 | "outputs": [], 154 | "source": [ 155 | "# Store the url string that hosts our .csv file (note that this is a different url than in the video)\n", 156 | "url = \"Cartwheeldata.csv\"\n", 157 | "\n", 158 | "# Read the .csv file and store it as a pandas Data Frame\n", 159 | "df = pd.read_csv(url)\n", 160 | "\n", 161 | "# Output object type\n", 162 | "type(df)" 163 | ] 164 | }, 165 | { 166 | "cell_type": "markdown", 167 | "metadata": { 168 | "colab_type": "text", 169 | "id": "-TrO3YWShjjl" 170 | }, 171 | "source": [ 172 | "### Viewing Data" 173 | ] 174 | }, 175 | { 176 | "cell_type": "code", 177 | "execution_count": null, 178 | "metadata": { 179 | "colab": {}, 180 | "colab_type": "code", 181 | "id": "IMgR30w4hjjl", 182 | "outputId": "9a269897-78a7-4380-9634-81d2d6165c2d" 183 | }, 184 | "outputs": [], 185 | "source": [ 186 | "# We can view our Data Frame by calling the head() function\n", 187 | "df.head()" 188 | ] 189 | }, 190 | { 191 | "cell_type": "markdown", 192 | "metadata": { 193 | "colab_type": "text", 194 | "id": "rwcqqpCrhjjp" 195 | }, 196 | "source": [ 197 | "The head() function simply shows the first 5 rows of our Data Frame. If we wanted to show the entire Data Frame we would simply write the following:" 198 | ] 199 | }, 200 | { 201 | "cell_type": "code", 202 | "execution_count": null, 203 | "metadata": { 204 | "colab": {}, 205 | "colab_type": "code", 206 | "id": "clB7ZnfOhjjq", 207 | "outputId": "32b34b9e-c481-4c46-b443-488ac7097c55" 208 | }, 209 | "outputs": [], 210 | "source": [ 211 | "# Output entire Data Frame\n", 212 | "df" 213 | ] 214 | }, 215 | { 216 | "cell_type": "markdown", 217 | "metadata": { 218 | "colab_type": "text", 219 | "id": "-F1DVbu4hjju" 220 | }, 221 | "source": [ 222 | "As you can see, we have a 2-Dimensional object where each row is an independent observation of our cartwheel data.\n", 223 | "\n", 224 | "To gather more information regarding the data, we can view the column names and data types of each column with the following functions:" 225 | ] 226 | }, 227 | { 228 | "cell_type": "code", 229 | "execution_count": null, 230 | "metadata": { 231 | "colab": {}, 232 | "colab_type": "code", 233 | "id": "pEdgVYnDhjjv", 234 | "outputId": "3c4a5edc-e29d-4665-b58c-b2e9fa125442" 235 | }, 236 | "outputs": [], 237 | "source": [ 238 | "df.columns" 239 | ] 240 | }, 241 | { 242 | "cell_type": "markdown", 243 | "metadata": { 244 | "colab_type": "text", 245 | "id": "oeR8lBkmhjjz" 246 | }, 247 | "source": [ 248 | "Lets say we would like to splice our data frame and select only specific portions of our data. There are three different ways of doing so.\n", 249 | "\n", 250 | "1. .loc()\n", 251 | "2. .iloc()\n", 252 | "3. .ix()\n", 253 | "\n", 254 | "We will cover the .loc() and .iloc() splicing functions.\n", 255 | "\n", 256 | "### .loc()\n", 257 | ".loc() takes two single/list/range operator separated by ','. The first one indicates the row and the second one indicates columns." 258 | ] 259 | }, 260 | { 261 | "cell_type": "code", 262 | "execution_count": null, 263 | "metadata": { 264 | "colab": {}, 265 | "colab_type": "code", 266 | "id": "HpUEXXovhjj0", 267 | "outputId": "d4f26af0-9769-4fc1-c7ae-679e3dbf7c69" 268 | }, 269 | "outputs": [], 270 | "source": [ 271 | "# Return all observations of CWDistance\n", 272 | "df.loc[:,\"CWDistance\"]" 273 | ] 274 | }, 275 | { 276 | "cell_type": "code", 277 | "execution_count": null, 278 | "metadata": { 279 | "colab": {}, 280 | "colab_type": "code", 281 | "id": "vmZyHBk_hjj4", 282 | "outputId": "3d76358a-cd47-43bb-fd69-9f858f03add1" 283 | }, 284 | "outputs": [], 285 | "source": [ 286 | "# Select all rows for multiple columns, [\"CWDistance\", \"Height\", \"Wingspan\"]\n", 287 | "df.loc[:,[\"CWDistance\", \"Height\", \"Wingspan\"]]" 288 | ] 289 | }, 290 | { 291 | "cell_type": "code", 292 | "execution_count": null, 293 | "metadata": { 294 | "colab": {}, 295 | "colab_type": "code", 296 | "id": "xi9U34Kwhjj8", 297 | "outputId": "e6413dbf-1ffe-4af4-e6b9-95e6d7c56dc0" 298 | }, 299 | "outputs": [], 300 | "source": [ 301 | "# Select few rows for multiple columns, [\"CWDistance\", \"Height\", \"Wingspan\"]\n", 302 | "df.loc[:9, [\"CWDistance\", \"Height\", \"Wingspan\"]]" 303 | ] 304 | }, 305 | { 306 | "cell_type": "code", 307 | "execution_count": null, 308 | "metadata": { 309 | "colab": {}, 310 | "colab_type": "code", 311 | "id": "8SACpAZDhjkA", 312 | "outputId": "79f0f222-85ed-4224-8242-69a8ae2742ff" 313 | }, 314 | "outputs": [], 315 | "source": [ 316 | "# Select range of rows for all columns\n", 317 | "df.loc[10:15]" 318 | ] 319 | }, 320 | { 321 | "cell_type": "markdown", 322 | "metadata": { 323 | "colab_type": "text", 324 | "id": "ALS06G3rhjkC" 325 | }, 326 | "source": [ 327 | "The .loc() function requires to arguments, the indices of the rows and the column names you wish to observe.\n", 328 | "\n", 329 | "In the above case **:** specifies all rows, and our column is **CWDistance**. df.loc[**:**,**\"CWDistance\"**]" 330 | ] 331 | }, 332 | { 333 | "cell_type": "markdown", 334 | "metadata": { 335 | "colab_type": "text", 336 | "id": "DG7GYn4nhjkE" 337 | }, 338 | "source": [ 339 | "Now, let's say we only want to return the first 10 observations:" 340 | ] 341 | }, 342 | { 343 | "cell_type": "code", 344 | "execution_count": null, 345 | "metadata": { 346 | "colab": {}, 347 | "colab_type": "code", 348 | "id": "rKfblg4KhjkE", 349 | "outputId": "b3f4df29-8b8e-4a58-e69c-403dcd5a4dc5" 350 | }, 351 | "outputs": [], 352 | "source": [ 353 | "df.loc[:9, \"CWDistance\"]" 354 | ] 355 | }, 356 | { 357 | "cell_type": "markdown", 358 | "metadata": { 359 | "colab_type": "text", 360 | "id": "DhluNGL1hjkI" 361 | }, 362 | "source": [ 363 | "### .iloc()\n", 364 | ".iloc() is integer based slicing, whereas .loc() used labels/column names. Here are some examples:" 365 | ] 366 | }, 367 | { 368 | "cell_type": "code", 369 | "execution_count": null, 370 | "metadata": { 371 | "colab": {}, 372 | "colab_type": "code", 373 | "id": "6u1A-2drhjkJ", 374 | "outputId": "1eaf5856-74d9-4b2e-d8a4-93e163973cae" 375 | }, 376 | "outputs": [], 377 | "source": [ 378 | "df.iloc[:4]" 379 | ] 380 | }, 381 | { 382 | "cell_type": "code", 383 | "execution_count": null, 384 | "metadata": { 385 | "colab": {}, 386 | "colab_type": "code", 387 | "id": "L7U4Db6WhjkM", 388 | "outputId": "4068bbfc-c246-4847-bcd6-bce52656a71b" 389 | }, 390 | "outputs": [], 391 | "source": [ 392 | "df.iloc[1:5, 2:4]" 393 | ] 394 | }, 395 | { 396 | "cell_type": "code", 397 | "execution_count": null, 398 | "metadata": { 399 | "colab": {}, 400 | "colab_type": "code", 401 | "id": "Cot4DwdUhjkQ", 402 | "outputId": "52430bbe-8bab-48ac-b43d-deba59750797" 403 | }, 404 | "outputs": [], 405 | "source": [ 406 | "df.iloc[1:5, [\"Gender\", \"GenderGroup\"]]" 407 | ] 408 | }, 409 | { 410 | "cell_type": "markdown", 411 | "metadata": { 412 | "colab_type": "text", 413 | "id": "FIekmzj6hjkT" 414 | }, 415 | "source": [ 416 | "We can view the data types of our data frame columns with by calling .dtypes on our data frame:" 417 | ] 418 | }, 419 | { 420 | "cell_type": "code", 421 | "execution_count": null, 422 | "metadata": { 423 | "colab": {}, 424 | "colab_type": "code", 425 | "id": "VVvtjY7ghjkU", 426 | "outputId": "4cb2e4f1-19f7-46f8-b5f4-3d1d3b779c67" 427 | }, 428 | "outputs": [], 429 | "source": [ 430 | "df.dtypes" 431 | ] 432 | }, 433 | { 434 | "cell_type": "markdown", 435 | "metadata": { 436 | "colab_type": "text", 437 | "id": "fxzHKNfKhjkX" 438 | }, 439 | "source": [ 440 | "The output indicates we have integers, floats, and objects with our Data Frame.\n", 441 | "\n", 442 | "We may also want to observe the different unique values within a specific column, lets do this for Gender:" 443 | ] 444 | }, 445 | { 446 | "cell_type": "code", 447 | "execution_count": null, 448 | "metadata": { 449 | "colab": {}, 450 | "colab_type": "code", 451 | "id": "brIC2kbKhjkZ", 452 | "outputId": "b3c7b6f1-3c6b-4145-ff19-a3437e298212" 453 | }, 454 | "outputs": [], 455 | "source": [ 456 | "# List unique values in the df['Gender'] column\n", 457 | "df.Gender.unique()" 458 | ] 459 | }, 460 | { 461 | "cell_type": "code", 462 | "execution_count": null, 463 | "metadata": { 464 | "colab": {}, 465 | "colab_type": "code", 466 | "id": "Js4ikCVWhjkc", 467 | "outputId": "4cfbdf05-044e-4c56-9430-d392a589e9ee" 468 | }, 469 | "outputs": [], 470 | "source": [ 471 | "# Lets explore df[\"GenderGroup] as well\n", 472 | "df.GenderGroup.unique()" 473 | ] 474 | }, 475 | { 476 | "cell_type": "markdown", 477 | "metadata": { 478 | "colab_type": "text", 479 | "id": "a3pqgpifhjkf" 480 | }, 481 | "source": [ 482 | "It seems that these fields may serve the same purpose, which is to specify male vs. female. Lets check this quickly by observing only these two columns:" 483 | ] 484 | }, 485 | { 486 | "cell_type": "code", 487 | "execution_count": null, 488 | "metadata": { 489 | "colab": {}, 490 | "colab_type": "code", 491 | "id": "8Oqj-XOghjkf", 492 | "outputId": "a02d15e9-bd6c-4f41-aa26-10dc0d8697e4" 493 | }, 494 | "outputs": [], 495 | "source": [ 496 | "# Use .loc() to specify a list of mulitple column names\n", 497 | "df.loc[:,[\"Gender\", \"GenderGroup\"]]" 498 | ] 499 | }, 500 | { 501 | "cell_type": "markdown", 502 | "metadata": { 503 | "colab_type": "text", 504 | "id": "n0S7vCwphjkj" 505 | }, 506 | "source": [ 507 | "From eyeballing the output, it seems to check out. We can streamline this by utilizing the groupby() and size() functions." 508 | ] 509 | }, 510 | { 511 | "cell_type": "code", 512 | "execution_count": null, 513 | "metadata": { 514 | "colab": {}, 515 | "colab_type": "code", 516 | "id": "nNvUQetJhjkj", 517 | "outputId": "3eedb9e8-0d1a-4a1f-ff65-f7ee4a178c5e" 518 | }, 519 | "outputs": [], 520 | "source": [ 521 | "df.groupby(['Gender','GenderGroup']).size()" 522 | ] 523 | }, 524 | { 525 | "cell_type": "markdown", 526 | "metadata": { 527 | "colab_type": "text", 528 | "id": "7bHLzOH2hjkn" 529 | }, 530 | "source": [ 531 | "This output indicates that we have two types of combinations. \n", 532 | "\n", 533 | "* Case 1: Gender = F & Gender Group = 1 \n", 534 | "* Case 2: Gender = M & GenderGroup = 2. \n", 535 | "\n", 536 | "This validates our initial assumption that these two fields essentially portray the same information." 537 | ] 538 | } 539 | ], 540 | "metadata": { 541 | "colab": { 542 | "name": "Introduction to Libraries and Data Management.ipynb", 543 | "provenance": [], 544 | "version": "0.3.2" 545 | }, 546 | "kernelspec": { 547 | "display_name": "Python 2", 548 | "language": "python", 549 | "name": "python2" 550 | }, 551 | "language_info": { 552 | "codemirror_mode": { 553 | "name": "ipython", 554 | "version": 2 555 | }, 556 | "file_extension": ".py", 557 | "mimetype": "text/x-python", 558 | "name": "python", 559 | "nbconvert_exporter": "python", 560 | "pygments_lexer": "ipython2", 561 | "version": "2.7.15" 562 | } 563 | }, 564 | "nbformat": 4, 565 | "nbformat_minor": 1 566 | } 567 | -------------------------------------------------------------------------------- /Course 1 - Understanding and Visualizing Data with Python/Week 1 - Introduction to Data/nhanes_data_basics.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Using Python to read data files and explore their contents\n", 8 | "\n", 9 | "This notebook demonstrates using the [Pandas](http://pandas.pydata.org) data processing library to read a dataset into Python, and obtain a basic understanding of its contents.\n", 10 | "\n", 11 | "Note that Python by itself is a general-purpose programming language and does not provide high-level data processing capabilities. The Pandas library was developed to meet this need. Pandas is the most popular Python library for data manipulation, and we will use it extensively in this course.\n", 12 | "\n", 13 | "In addition to Pandas, we will also make use of the following Python libraries\n", 14 | "\n", 15 | "* [Numpy](http://www.numpy.org) is a library for working with arrays of data\n", 16 | "\n", 17 | "* [Matplotlib](https://matplotlib.org) is a library for making graphs\n", 18 | "\n", 19 | "* [Seaborn](https://seaborn.pydata.org) is a higher-level interface to Matplotlib that can be used to simplify many graphing tasks\n", 20 | "\n", 21 | "* [Statsmodels](https://www.statsmodels.org/stable/index.html) is a library that implements many statistical techniques\n", 22 | "\n", 23 | "* [Scipy](https://www.scipy.org) is a library of techniques for numerical and scientific computing\n" 24 | ] 25 | }, 26 | { 27 | "cell_type": "markdown", 28 | "metadata": {}, 29 | "source": [ 30 | "### Importing libraries\n", 31 | "\n", 32 | "When using Python, you must always begin your scripts by importing the libraries that you will be using. After importing a library, its functions can then be called from your code by prepending the library name to the function name. For example, to use the '`dot`' function from the '`numpy`' library, you would enter '`numpy.dot`'. To avoid repeatedly having to type the libary name in your scripts, it is conventional to define a two or three letter abbreviation for each library, e.g. '`numpy`' is usually abbreviated as '`np`'. This allows us to use '`np.dot`' instead of '`numpy.dot`'. Similarly, the Pandas library is typically abbreviated as '`pd`'.\n", 33 | "\n", 34 | "The following statement imports the Pandas library, and gives it the abbreviated name 'pd'." 35 | ] 36 | }, 37 | { 38 | "cell_type": "code", 39 | "execution_count": 1, 40 | "metadata": { 41 | "collapsed": true 42 | }, 43 | "outputs": [], 44 | "source": [ 45 | "import pandas as pd" 46 | ] 47 | }, 48 | { 49 | "cell_type": "markdown", 50 | "metadata": {}, 51 | "source": [ 52 | "### Reading a data file\n", 53 | "\n", 54 | "We will be working with the NHANES (National Health and Nutrition Examination Survey) data from the 2015-2016 wave, which has been discussed earlier in this course. The raw data for this study are available here:\n", 55 | "\n", 56 | "https://wwwn.cdc.gov/nchs/nhanes/Default.aspx\n", 57 | "\n", 58 | "As in many large studies, the NHANES data are spread across multiple files. The NHANES files are stored in [SAS transport](https://v8doc.sas.com/sashtml/files/z0987199.htm) (Xport) format. This is a somewhat obscure format, and while Pandas is perfectly capable of reading the NHANES data directly from the xport files, accomplishing this task is a more advanced topic than we want to get into here. Therefore, for this course we have prepared some merged datasets in text/csv format." 59 | ] 60 | }, 61 | { 62 | "cell_type": "markdown", 63 | "metadata": {}, 64 | "source": [ 65 | "Pandas is a large and powerful library. Here we will only use a few of its basic features. The main data structure that Pandas works with is called a \"data frame\". This is a two-dimensional table of data in which the rows typically represent cases (e.g. NHANES subjects), and the columns represent variables. Pandas also has a one-dimensional data structure called a `Series` that we will encounter occasionally.\n", 66 | "\n", 67 | "Pandas has a variety of functions named with the pattern '`read_xxx`' for reading data in different formats into Python. Right now we will focus on reading '`csv`' files, so we are using the '`read_csv`' function, which can read csv (and \"tsv\") format files that are exported from spreadsheet software like Excel. The '`read_csv`' function by default expects the first row of the data file to contain column names. \n", 68 | "\n", 69 | "Using '`read_csv`' in its default mode is fairly straightforward. There are many options to '`read_csv`' that are useful for handling less-common situations. For example, you would use the option `sep='\\t'` instead of the default `sep=','` if the fields of your data file are delimited by tabs instead of commas. See [here](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) for the full documentation for '`read_csv`'.\n", 70 | "\n", 71 | "Pandas can read a data file over the internet when provided with a URL, which is what we will do below. In the Python script we will name the data set '`da`', i.e. this is the name of the Python variable that will hold the data frame after we have loaded it. \n", 72 | "\n", 73 | "The variable '`url`' holds a string (text) value, which is the internet URL where the data are located. If you have the data file in your local filesystem, you can also use '`read_csv`' to read the data from this file. In this case you would pass a file path instead of a URL, e.g. `pd.read_csv(\"my_file.csv\")` would read a file named `my_file.csv` that is located in your current working directory." 74 | ] 75 | }, 76 | { 77 | "cell_type": "code", 78 | "execution_count": 2, 79 | "metadata": {}, 80 | "outputs": [], 81 | "source": [ 82 | "url = \"nhanes_2015_2016.csv\"\n", 83 | "da = pd.read_csv(url)" 84 | ] 85 | }, 86 | { 87 | "cell_type": "markdown", 88 | "metadata": {}, 89 | "source": [ 90 | "To confirm that we have actually obtained the data the we are expecting, we can display the shape (number of rows and columns) of the data frame in the notebook. Note that the final expression in any Jupyter notebook cell is automatically printed, but you can force other expressions to be printed by using the '`print`' function, e.g. '`print(da.shape)`'.\n", 91 | "\n", 92 | "Based on what we see below, the data set being read here has 5735 rows, corresponding to 5735 people in this wave of the NHANES study, and 28 columns, corresponding to 28 variables in this particular data file. Note that NHANES collects thousands of variables on each study subject, but here we are working with a reduced file that contains a limited number of variables." 93 | ] 94 | }, 95 | { 96 | "cell_type": "code", 97 | "execution_count": 3, 98 | "metadata": {}, 99 | "outputs": [ 100 | { 101 | "data": { 102 | "text/plain": [ 103 | "(5735, 28)" 104 | ] 105 | }, 106 | "execution_count": 3, 107 | "metadata": {}, 108 | "output_type": "execute_result" 109 | } 110 | ], 111 | "source": [ 112 | "da.shape" 113 | ] 114 | }, 115 | { 116 | "cell_type": "markdown", 117 | "metadata": {}, 118 | "source": [ 119 | "### Exploring the contents of a data set\n", 120 | "\n", 121 | "Pandas has a number of basic ways to understand what is in a data set. For example, above we used the '`shape`' method to determine the numbers of rows and columns in a data set. The columns in a Pandas data frame have names, to see the names, use the '`columns`' method:" 122 | ] 123 | }, 124 | { 125 | "cell_type": "code", 126 | "execution_count": 4, 127 | "metadata": {}, 128 | "outputs": [ 129 | { 130 | "data": { 131 | "text/plain": [ 132 | "Index(['SEQN', 'ALQ101', 'ALQ110', 'ALQ130', 'SMQ020', 'RIAGENDR', 'RIDAGEYR',\n", 133 | " 'RIDRETH1', 'DMDCITZN', 'DMDEDUC2', 'DMDMARTL', 'DMDHHSIZ', 'WTINT2YR',\n", 134 | " 'SDMVPSU', 'SDMVSTRA', 'INDFMPIR', 'BPXSY1', 'BPXDI1', 'BPXSY2',\n", 135 | " 'BPXDI2', 'BMXWT', 'BMXHT', 'BMXBMI', 'BMXLEG', 'BMXARML', 'BMXARMC',\n", 136 | " 'BMXWAIST', 'HIQ210'],\n", 137 | " dtype='object')" 138 | ] 139 | }, 140 | "execution_count": 4, 141 | "metadata": {}, 142 | "output_type": "execute_result" 143 | } 144 | ], 145 | "source": [ 146 | "da.columns" 147 | ] 148 | }, 149 | { 150 | "cell_type": "markdown", 151 | "metadata": {}, 152 | "source": [ 153 | "These names correspond to variables in the NHANES study. For example, `SEQN` is a unique identifier for one person, and `BMXWT` is the subject's weight in kilograms (\"BMX\" is the NHANES prefix for body measurements). The variables in the NHANES data set are documented in a set of \"codebooks\" that are available on-line. The codebooks for the 2015-2016 wave of NHANES can be found by following the links at the following page:\n", 154 | "\n", 155 | "https://wwwn.cdc.gov/nchs/nhanes/continuousnhanes/default.aspx?BeginYear=2015\n", 156 | "\n", 157 | "For convenience, direct links to some of the code books are included below:\n", 158 | "\n", 159 | "* [Demographics code book](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/DEMO_I.htm)\n", 160 | "\n", 161 | "* [Body measures code book](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/BMX_I.htm)\n", 162 | "\n", 163 | "* [Blood pressure code book](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/BPX_I.htm)\n", 164 | "\n", 165 | "* [Alcohol questionaire code book](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/ALQ_I.htm)\n", 166 | "\n", 167 | "* [Smoking questionaire code book](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/SMQ_I.htm)" 168 | ] 169 | }, 170 | { 171 | "cell_type": "markdown", 172 | "metadata": {}, 173 | "source": [ 174 | "Every variable in a Pandas data frame has a data type. There are many different data types, but most commonly you will encounter floating point values (real numbers), integers, strings (text), and date/time values. When Pandas reads a text/csv file, it guesses the data types based on what it sees in the first few rows of the data file. Usually it selects an appropriate type, but occasionally it does not. To confirm that the data types are consistent with what the variables represent, inspect the '`dtypes`' attribute of the data frame." 175 | ] 176 | }, 177 | { 178 | "cell_type": "code", 179 | "execution_count": 5, 180 | "metadata": {}, 181 | "outputs": [ 182 | { 183 | "data": { 184 | "text/plain": [ 185 | "SEQN int64\n", 186 | "ALQ101 float64\n", 187 | "ALQ110 float64\n", 188 | "ALQ130 float64\n", 189 | "SMQ020 int64\n", 190 | "RIAGENDR int64\n", 191 | "RIDAGEYR int64\n", 192 | "RIDRETH1 int64\n", 193 | "DMDCITZN float64\n", 194 | "DMDEDUC2 float64\n", 195 | "DMDMARTL float64\n", 196 | "DMDHHSIZ int64\n", 197 | "WTINT2YR float64\n", 198 | "SDMVPSU int64\n", 199 | "SDMVSTRA int64\n", 200 | "INDFMPIR float64\n", 201 | "BPXSY1 float64\n", 202 | "BPXDI1 float64\n", 203 | "BPXSY2 float64\n", 204 | "BPXDI2 float64\n", 205 | "BMXWT float64\n", 206 | "BMXHT float64\n", 207 | "BMXBMI float64\n", 208 | "BMXLEG float64\n", 209 | "BMXARML float64\n", 210 | "BMXARMC float64\n", 211 | "BMXWAIST float64\n", 212 | "HIQ210 float64\n", 213 | "dtype: object" 214 | ] 215 | }, 216 | "execution_count": 5, 217 | "metadata": {}, 218 | "output_type": "execute_result" 219 | } 220 | ], 221 | "source": [ 222 | "da.dtypes" 223 | ] 224 | }, 225 | { 226 | "cell_type": "markdown", 227 | "metadata": {}, 228 | "source": [ 229 | "As we see here, most of the variables have floating point or integer data type. Unlike many data sets, NHANES does not use any text values in its data. For example, while many datasets would use text labels like \"F\" or \"M\" to denote a subject's gender, this information is represented in NHANES with integer codes. The actual meanings of these codes can be determined from the codebooks. For example, the variable `RIAGENDR` contains each subject's gender, with male gender coded as `1` and female gender coded as `2`. The `RIAGENDR` variable is part of the demographics component of NHANES, so this coding can be found in the demographics codebook.\n", 230 | "\n", 231 | "Variables like `BMXWT` which represent a quantitative measurement will typically be stored as floating point data values." 232 | ] 233 | }, 234 | { 235 | "cell_type": "markdown", 236 | "metadata": {}, 237 | "source": [ 238 | "### Slicing a data set\n", 239 | "\n", 240 | "As discussed above, a Pandas data frame is a rectangular data table, in which the rows represent cases and the columns represent variables. One common manipulation of a data frame is to extract the data for one case or for one variable. There are several ways to do this, as shown below.\n", 241 | "\n", 242 | "To extract all the values for one variable, the following three approaches are equivalent (\"DMDEDUC2\" here is an NHANES variable containing a person's educational attainment). In these four lines of code, we are assigning the data from one column of the data frame `da` into new variables `w`, `x`, `y`, and `z`. The first three approaches access the variable by name. The fourth approach accesses the variable by position (note that `DMDEDUC2` is in position 9 of the `da.columns` array shown above -- remember that Python counts starting at position zero)." 243 | ] 244 | }, 245 | { 246 | "cell_type": "code", 247 | "execution_count": 6, 248 | "metadata": {}, 249 | "outputs": [], 250 | "source": [ 251 | "w = da[\"DMDEDUC2\"]\n", 252 | "x = da.loc[:, \"DMDEDUC2\"]\n", 253 | "y = da.DMDEDUC2\n", 254 | "z = da.iloc[:, 9] # DMDEDUC2 is in column 9" 255 | ] 256 | }, 257 | { 258 | "cell_type": "markdown", 259 | "metadata": {}, 260 | "source": [ 261 | "Another reason to slice a variable out of a data frame is so that we can then pass it into a function. For example, we can find the maximum value over all `DMDEDUC2` values using any one of the following four lines of code:" 262 | ] 263 | }, 264 | { 265 | "cell_type": "code", 266 | "execution_count": 7, 267 | "metadata": {}, 268 | "outputs": [ 269 | { 270 | "name": "stdout", 271 | "output_type": "stream", 272 | "text": [ 273 | "9.0\n", 274 | "9.0\n", 275 | "9.0\n", 276 | "9.0\n" 277 | ] 278 | } 279 | ], 280 | "source": [ 281 | "print(da[\"DMDEDUC2\"].max())\n", 282 | "print(da.loc[:, \"DMDEDUC2\"].max())\n", 283 | "print(da.DMDEDUC2.max())\n", 284 | "print(da.iloc[:, 9].max())" 285 | ] 286 | }, 287 | { 288 | "cell_type": "markdown", 289 | "metadata": {}, 290 | "source": [ 291 | "Every value in a Python program has a type, and the type information can be obtained using Python's '`type`' function. This can be useful, for example, if you are looking for the documentation associated with some value, but you do not know what the value's type is. \n", 292 | "\n", 293 | "Here we see that the variable `da` has type 'DataFrame', while one column of `da` has type 'Series'. As noted above, a Series is a Pandas data structure for holding a single column (or row) of data." 294 | ] 295 | }, 296 | { 297 | "cell_type": "code", 298 | "execution_count": 8, 299 | "metadata": {}, 300 | "outputs": [ 301 | { 302 | "name": "stdout", 303 | "output_type": "stream", 304 | "text": [ 305 | "\n", 306 | "\n", 307 | "\n" 308 | ] 309 | } 310 | ], 311 | "source": [ 312 | "print(type(da)) # The type of the variable\n", 313 | "print(type(da.DMDEDUC2)) # The type of one column of the data frame\n", 314 | "print(type(da.iloc[2,:])) # The type of one row of the data frame" 315 | ] 316 | }, 317 | { 318 | "cell_type": "markdown", 319 | "metadata": {}, 320 | "source": [ 321 | "It may also be useful to slice a row (case) out of a data frame. Just as a data frame's columns have names, the rows also have names, which are called the \"index\". However many data sets do not have meaningful row names, so it is more common to extract a row of a data frame using its position. The `iloc` method slices rows or columns from a data frame by position (counting from 0). The following line of code extracts row 3 from the data set (which is the fourth row, counting from zero)." 322 | ] 323 | }, 324 | { 325 | "cell_type": "code", 326 | "execution_count": 9, 327 | "metadata": { 328 | "collapsed": true 329 | }, 330 | "outputs": [], 331 | "source": [ 332 | "x = da.iloc[3, :]" 333 | ] 334 | }, 335 | { 336 | "cell_type": "markdown", 337 | "metadata": {}, 338 | "source": [ 339 | "Another important data frame manipulation is to extract a contiguous block of rows or columns from the data set. Below we slice by position, in the first case taking row positions 3 and 4 (counting from 0, which are rows 4 and 5 counting from 1), and in the second case taking columns 2, 3, and 4 (columns 3, 4, 5 if counting from 1)." 340 | ] 341 | }, 342 | { 343 | "cell_type": "code", 344 | "execution_count": 10, 345 | "metadata": {}, 346 | "outputs": [], 347 | "source": [ 348 | "x = da.iloc[3:5, :]\n", 349 | "y = da.iloc[:, 2:5]" 350 | ] 351 | }, 352 | { 353 | "cell_type": "markdown", 354 | "metadata": {}, 355 | "source": [ 356 | "### Missing values\n" 357 | ] 358 | }, 359 | { 360 | "cell_type": "markdown", 361 | "metadata": {}, 362 | "source": [ 363 | "When reading a dataset using Pandas, there is a set of values including 'NA', 'NULL', and 'NaN' that are taken by default to represent a missing value. The full list of default missing value codes is in the '`read_csv`' documentation [here](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html). This document also explains how to change the way that '`read_csv`' decides whether a variable's value is missing.\n", 364 | "\n", 365 | "Pandas has functions called `isnull` and `notnull` that can be used to identify where the missing and non-missing values are located in a data frame. Below we use these functions to count the number of missing and non-missing `DMDEDUC2` values." 366 | ] 367 | }, 368 | { 369 | "cell_type": "code", 370 | "execution_count": 11, 371 | "metadata": {}, 372 | "outputs": [ 373 | { 374 | "name": "stdout", 375 | "output_type": "stream", 376 | "text": [ 377 | "261\n", 378 | "5474\n" 379 | ] 380 | } 381 | ], 382 | "source": [ 383 | "print(pd.isnull(da.DMDEDUC2).sum())\n", 384 | "print(pd.notnull(da.DMDEDUC2).sum())" 385 | ] 386 | }, 387 | { 388 | "cell_type": "markdown", 389 | "metadata": {}, 390 | "source": [ 391 | "As an aside, note that there may be a variety of distinct forms of missingness in a variable, and in some cases it is important to keep these values distinct. For example, in case of the DMDEDUC2 variable, in addition to the blank or NA values that Pandas considers to be missing, three people responded \"don't know\" (code value 9). In many analyses, the \"don't know\" values will also be treated as missing, but at this point we are considering \"don't know\" to be a distinct category of observed response." 392 | ] 393 | } 394 | ], 395 | "metadata": { 396 | "kernelspec": { 397 | "display_name": "Python 3", 398 | "language": "python", 399 | "name": "python3" 400 | }, 401 | "language_info": { 402 | "codemirror_mode": { 403 | "name": "ipython", 404 | "version": 3 405 | }, 406 | "file_extension": ".py", 407 | "mimetype": "text/x-python", 408 | "name": "python", 409 | "nbconvert_exporter": "python", 410 | "pygments_lexer": "ipython3", 411 | "version": "3.6.3" 412 | } 413 | }, 414 | "nbformat": 4, 415 | "nbformat_minor": 1 416 | } 417 | -------------------------------------------------------------------------------- /Course 1 - Understanding and Visualizing Data with Python/Week 1 - Introduction to Data/week1_python_resources.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "colab_type": "text", 7 | "id": "tPxurgKK6mUZ" 8 | }, 9 | "source": [ 10 | "### Python Resources\n", 11 | "The purpose of this document is to direct you to resources that you may find useful if you decide to do a deeper dive into Python. This course is not meant to be an introduction to programming, nor an introduction to Python, but if you find yourself interested in exploring Python further, or feel as if this is a useful skill, this document aims to direct you to resources that you may find useful. If you have a background in Python or programming, a style guides are included below to show how Python may differ from other programming languages or give you a launching point for diving deeper into more advanced packages. This course does not endorse the use or non-use of any particular resource, but the author has found these resources useful in their exploration of programming and Python in particular\n", 12 | "\n", 13 | "### The Python Documentation\n", 14 | "Any reference that does not begin with the Python documentation would not be complete. The authors of the language, as well as the community that supports it, have developed a great set of tutorials, documentation, and references around Python. When in doubt, this is often the first place that you should look if you run into a scary error or would like to learn more about a specific function. The documentation can be found here: [Python Documentation](https://docs.python.org/3/)\n", 15 | "\n", 16 | "\n" 17 | ] 18 | }, 19 | { 20 | "cell_type": "markdown", 21 | "metadata": { 22 | "colab_type": "text", 23 | "id": "8OH-7kq_3Ei4" 24 | }, 25 | "source": [ 26 | "### Python Programming Introductions\n", 27 | "\n", 28 | "Below are resources to help you along your way in learning Python. While it is great to consume material, in programming there is no substitute for actually writing code. For every hour that you spend learning, you should spend about twice that amount of time writing code for cool problems or working out examples. Coding is best learned through actually coding!\n", 29 | "\n", 30 | "* [Coursera](https://www.coursera.org/courses?query=python) has several offerings for Python that you can take in addition to this course. These courses will go into depth into Python programming and how to use it in an applied setting \n", 31 | "* [Code Academy](https://www.codecademy.com/learn/learn-python) is another resources that is great for learning Python (and other programming languages). While not as focused as Cousera, this is a quick way to get up-and-running with Python\n", 32 | "* YouTube is another great resource for online learning and there are several \"courses\" for learning Python. We recommend trying several sets of videos to see which you like best and using multiple video series to learn since each will present the material in a slightly different way\n", 33 | "* There are tens of books on programming in Python that are great if you prefer to read. More so than the other resources, be sure to code what you learn. It is easy to read about coding, but you really learn to code by coding!\n", 34 | "* If you have a background in coding, the authors have found the tutorial at [Tutorials Point](https://www.tutorialspoint.com/python/index.htm) to be useful in getting started with Python. This tutorial assumes that you have some background in coding in another language" 35 | ] 36 | }, 37 | { 38 | "cell_type": "markdown", 39 | "metadata": { 40 | "colab_type": "text", 41 | "id": "PdurMQjI2rQn" 42 | }, 43 | "source": [ 44 | "### Cheatsheets and References\n", 45 | "\n", 46 | "There are a variety of one-pagers and cheat-sheets available for Python that summarize the language in a few simple pages. These resources tend to be more aimed at someone who knows the language, or has experience in the language, but would like a refresher course in how the language works. \n", 47 | "\n", 48 | "* [Cheatsheet for Numpy](https://www.datacamp.com/community/blog/python-numpy-cheat-sheet#gs.AK5ZBgE)\n", 49 | "* [Cheatsheet for Datawrangling](https://www.datacamp.com/community/blog/pandas-cheat-sheet-python#gs.HPFoRIc)\n", 50 | "* [Cheatsheet for Pandas](https://www.datacamp.com/community/blog/python-pandas-cheat-sheet#gs.oundfxM)\n", 51 | "* [Cheatsheet for SciPy](https://www.datacamp.com/community/blog/python-scipy-cheat-sheet#gs.JDSg3OI)\n", 52 | "* [Cheatsheet for Matplotlib](https://www.datacamp.com/community/blog/python-matplotlib-cheat-sheet#gs.uEKySpY)\n", 53 | "\n", 54 | "\n" 55 | ] 56 | }, 57 | { 58 | "cell_type": "markdown", 59 | "metadata": { 60 | "colab_type": "text", 61 | "id": "swjetdNA2_iq" 62 | }, 63 | "source": [ 64 | "### Python Style Guides\n", 65 | "\n", 66 | "As you learn to code, you will find that you will begin to develop your own style. Sometimes this is good. Most times, this can be detrimental to your code readability and, worse, can hinder you from finding bugs in your own code in extreme cases. \n", 67 | "\n", 68 | "It is best to learn good coding habits from the beginning and the [Google Style Guide](https://github.com/google/styleguide/blob/gh-pages/pyguide.md) is a great place to start. We will mention some of these best practices here." 69 | ] 70 | }, 71 | { 72 | "cell_type": "markdown", 73 | "metadata": { 74 | "colab_type": "text", 75 | "id": "iNXpVVq6zSfY" 76 | }, 77 | "source": [ 78 | "#### Consistent Indenting\n", 79 | "Python will generally 'yell' at you if your indenting is incorrect. It is good to use an editor that takes care of this for you. In general, four spaces are preferred for indenting and you should not mix tabs and spaces. " 80 | ] 81 | }, 82 | { 83 | "cell_type": "code", 84 | "execution_count": 0, 85 | "metadata": { 86 | "colab": {}, 87 | "colab_type": "code", 88 | "id": "xc6e9f0SCvbh" 89 | }, 90 | "outputs": [], 91 | "source": [ 92 | "# Good Indenting - four spaces are standard but consistiency is key\n", 93 | "result = []\n", 94 | "for x in range(10):\n", 95 | " for y in range(5):\n", 96 | " if x * y > 10:\n", 97 | " result.append((x, y))\n", 98 | "print (result)\n", 99 | "\n", 100 | "# Bad indenting\n", 101 | "result = []\n", 102 | "for x in range(10):\n", 103 | " for y in range(5):\n", 104 | " if x * y > 10:\n", 105 | " result.append((x, y))\n", 106 | "print (result)" 107 | ] 108 | }, 109 | { 110 | "cell_type": "markdown", 111 | "metadata": { 112 | "colab_type": "text", 113 | "id": "ErsdjHdUDixX" 114 | }, 115 | "source": [ 116 | "#### Commenting\n", 117 | "\n", 118 | "Comments seem weird when you first begin programming - why would I include 'code' that doesn't run? Comments are probably some of the most important aspects of code. They help other read code that is difficult for them to understand, and they, more importantly, are helpful for yourself if you look at the code in a few weeks and need clarity on why you did something. Always comment and comment well.\n", 119 | "\n" 120 | ] 121 | }, 122 | { 123 | "cell_type": "code", 124 | "execution_count": 0, 125 | "metadata": { 126 | "colab": {}, 127 | "colab_type": "code", 128 | "id": "qBTE3rwqEFMm" 129 | }, 130 | "outputs": [], 131 | "source": [ 132 | "################################################################################\n", 133 | "# #\n", 134 | "# Good Commenting #\n", 135 | "# #\n", 136 | "################################################################################\n", 137 | "\n", 138 | "################################ Bad Commenting ################################\n", 139 | "\n", 140 | "# My loop\n", 141 | "for x in range(10):\n", 142 | " print (x)\n", 143 | " \n", 144 | "############################## Better Commenting ###############################\n", 145 | "\n", 146 | "# Looping from zero to ten\n", 147 | "for x in range(10):\n", 148 | " print (x)\n", 149 | "\n", 150 | "############################# Preferred Commenting #############################\n", 151 | "\n", 152 | "# Print out the numbers from zero to ten\n", 153 | "for x in range(10):\n", 154 | " print (x)" 155 | ] 156 | }, 157 | { 158 | "cell_type": "code", 159 | "execution_count": 0, 160 | "metadata": { 161 | "colab": {}, 162 | "colab_type": "code", 163 | "id": "c0kgjakBF5X8" 164 | }, 165 | "outputs": [], 166 | "source": [ 167 | "################################################################################\n", 168 | "# #\n", 169 | "# Mixing Commenting Strategies #\n", 170 | "# #\n", 171 | "################################################################################\n", 172 | "\n", 173 | "# Try not to mix commenting styles in the same blocks - just be consistient\n", 174 | "\n", 175 | "########### Bad - mixing doc-strings commenting and line commenting ############\n", 176 | "\n", 177 | "''' Printing one to five, a six, and then six to nine'''\n", 178 | "for x in range(10):\n", 179 | " # If x > 5, then print the value\n", 180 | " if x > 5: \n", 181 | " print (x)\n", 182 | " else:\n", 183 | " print (x + 1)\n", 184 | " \n", 185 | "##################### Good - no mixing of comment types ########################\n", 186 | "\n", 187 | "# Printing one to five, a six, and then six to nine\n", 188 | "for x in range(10):\n", 189 | " # If x > 5, then print the value\n", 190 | " if x > 5: \n", 191 | " print (x)\n", 192 | " else:\n", 193 | " print (x + 1)" 194 | ] 195 | }, 196 | { 197 | "cell_type": "markdown", 198 | "metadata": { 199 | "colab_type": "text", 200 | "id": "UDSKv5J3G9j-" 201 | }, 202 | "source": [ 203 | "#### Line Length\n", 204 | "\n", 205 | "Try to avoid excessively long lines. Standard practice is to keep lines to no longer than 80 characters. While this is not a hard rule, it is a good practice to follow for readability" 206 | ] 207 | }, 208 | { 209 | "cell_type": "code", 210 | "execution_count": 0, 211 | "metadata": { 212 | "colab": {}, 213 | "colab_type": "code", 214 | "id": "QhSQsFKeHZae" 215 | }, 216 | "outputs": [], 217 | "source": [ 218 | "######################### Bad - This code is too long ##########################\n", 219 | "\n", 220 | "my_random_array = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]\n", 221 | "\n", 222 | "############ Good - this code is wrapped to avoid excessive length #############\n", 223 | "\n", 224 | "my_random_array = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9, \n", 225 | " 10, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5, 6, 7, 8, \n", 226 | " 9, 10]" 227 | ] 228 | }, 229 | { 230 | "cell_type": "markdown", 231 | "metadata": { 232 | "colab_type": "text", 233 | "id": "-IVW9wL8JC9O" 234 | }, 235 | "source": [ 236 | "#### White Space\n", 237 | "Utilizing Whitespace is a great way to improve the way that your code looks. In general the following can be helpful to improve the look of your code\n", 238 | "\n", 239 | "* Try to space out your code and introduce whitespace to improve readability\n", 240 | "* Use spacing to separate function arguments\n", 241 | "* Do not over-do spacing. Too many spaces between code blocks makes it difficult to organize code well" 242 | ] 243 | }, 244 | { 245 | "cell_type": "code", 246 | "execution_count": 0, 247 | "metadata": { 248 | "colab": {}, 249 | "colab_type": "code", 250 | "id": "tkzduWdeiwSX" 251 | }, 252 | "outputs": [], 253 | "source": [ 254 | "################ Bad - this code has bad whitespace management #################\n", 255 | "\n", 256 | "my_player = player()\n", 257 | "player_attributes = get_player_attributes(my_player,height,weight, birthday)\n", 258 | "\n", 259 | "\n", 260 | "player_attributes[0]*=12 # convert from feet to inches\n", 261 | "\n", 262 | "\n", 263 | "\n", 264 | "\n", 265 | "player.shoot_ball()\n", 266 | "\n", 267 | "\n", 268 | "########################## Good whitespace management ##########################\n", 269 | "\n", 270 | "my_player = player()\n", 271 | "player_attributes = get_player_attributes(my_player, height, weight, birthday)\n", 272 | "\n", 273 | "# convert from feet to inches\n", 274 | "player_attributes[0] *= 12 \n", 275 | "\n", 276 | "player.shoot_ball()" 277 | ] 278 | }, 279 | { 280 | "cell_type": "markdown", 281 | "metadata": { 282 | "colab_type": "text", 283 | "id": "5Wafm9FsuNoi" 284 | }, 285 | "source": [ 286 | "#### The tip of the iceberg\n", 287 | "\n", 288 | "Take a look at code out in the wild if you are really curious. How are they coding specific things? How do they manage spacing in loops? How do they manage the whitespace in argument list? \n", 289 | "\n", 290 | "You will learn to code by coding, and you will develop your own style but starting out with good habits ensures that your code is easy to read by others and, most importantly, yourself. Good luck!" 291 | ] 292 | } 293 | ], 294 | "metadata": { 295 | "colab": { 296 | "name": "Diving Deeper Into Python.ipynb", 297 | "provenance": [], 298 | "version": "0.3.2" 299 | }, 300 | "kernelspec": { 301 | "display_name": "Python 3", 302 | "language": "python", 303 | "name": "python3" 304 | }, 305 | "language_info": { 306 | "codemirror_mode": { 307 | "name": "ipython", 308 | "version": 3 309 | }, 310 | "file_extension": ".py", 311 | "mimetype": "text/x-python", 312 | "name": "python", 313 | "nbconvert_exporter": "python", 314 | "pygments_lexer": "ipython3", 315 | "version": "3.6.3" 316 | } 317 | }, 318 | "nbformat": 4, 319 | "nbformat_minor": 1 320 | } 321 | -------------------------------------------------------------------------------- /Course 1 - Understanding and Visualizing Data with Python/Week 3 - Multivariate Data/Multivariate_Distributions.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Multivariate Distributions in Python\n", 8 | "\n", 9 | "Sometimes we can get a lot of information about how two variables (or more) relate if we plot them together. This tutorial aims to show how plotting two variables together can give us information that plotting each one separately may miss.\n", 10 | "\n" 11 | ] 12 | }, 13 | { 14 | "cell_type": "code", 15 | "execution_count": null, 16 | "metadata": {}, 17 | "outputs": [], 18 | "source": [ 19 | "# import the packages we are going to be using\n", 20 | "import numpy as np # for getting our distribution\n", 21 | "import matplotlib.pyplot as plt # for plotting\n", 22 | "import seaborn as sns; sns.set() # For a different plotting theme\n", 23 | "\n", 24 | "# Don't worry so much about what rho is doing here\n", 25 | "# Just know if we have a rho of 1 then we will get a perfectly\n", 26 | "# upward sloping line, and if we have a rho of -1, we will get \n", 27 | "# a perfectly downward slopping line. A rho of 0 will \n", 28 | "# get us a 'cloud' of points\n", 29 | "r = 1\n", 30 | "\n", 31 | "# Don't worry so much about the following three lines of code for now\n", 32 | "# this is just getting the data for us to plot\n", 33 | "mean = [15, 5]\n", 34 | "cov = [[1, r], [r, 1]]\n", 35 | "x, y = x, y = np.random.multivariate_normal(mean, cov, 400).T\n", 36 | "\n", 37 | "# Adjust the figure size\n", 38 | "plt.figure(figsize=(10,5))\n", 39 | "\n", 40 | "# Plot the histograms of X and Y next to each other\n", 41 | "plt.subplot(1,2,1)\n", 42 | "plt.hist(x = x, bins = 15)\n", 43 | "plt.title(\"X\")\n", 44 | "\n", 45 | "plt.subplot(1,2,2)\n", 46 | "plt.hist(x = y, bins = 15)\n", 47 | "plt.title(\"Y\")\n", 48 | "\n", 49 | "plt.show()" 50 | ] 51 | }, 52 | { 53 | "cell_type": "code", 54 | "execution_count": null, 55 | "metadata": {}, 56 | "outputs": [], 57 | "source": [ 58 | "# Plot the data\n", 59 | "plt.figure(figsize=(10,10))\n", 60 | "plt.subplot(2,2,2)\n", 61 | "plt.scatter(x = x, y = y)\n", 62 | "plt.title(\"Joint Distribution of X and Y\")\n", 63 | "\n", 64 | "# Plot the Marginal X Distribution\n", 65 | "plt.subplot(2,2,4)\n", 66 | "plt.hist(x = x, bins = 15)\n", 67 | "plt.title(\"Marginal Distribution of X\")\n", 68 | "\n", 69 | "\n", 70 | "# Plot the Marginal Y Distribution\n", 71 | "plt.subplot(2,2,1)\n", 72 | "plt.hist(x = y, orientation = \"horizontal\", bins = 15)\n", 73 | "plt.title(\"Marginal Distribution of Y\")\n", 74 | "\n", 75 | "# Show the plots\n", 76 | "plt.show()" 77 | ] 78 | }, 79 | { 80 | "cell_type": "code", 81 | "execution_count": null, 82 | "metadata": {}, 83 | "outputs": [], 84 | "source": [] 85 | } 86 | ], 87 | "metadata": { 88 | "kernelspec": { 89 | "display_name": "Python 2", 90 | "language": "python", 91 | "name": "python2" 92 | }, 93 | "language_info": { 94 | "codemirror_mode": { 95 | "name": "ipython", 96 | "version": 2 97 | }, 98 | "file_extension": ".py", 99 | "mimetype": "text/x-python", 100 | "name": "python", 101 | "nbconvert_exporter": "python", 102 | "pygments_lexer": "ipython2", 103 | "version": "2.7.15" 104 | } 105 | }, 106 | "nbformat": 4, 107 | "nbformat_minor": 2 108 | } 109 | { 110 | "cells": [ 111 | { 112 | "cell_type": "markdown", 113 | "metadata": {}, 114 | "source": [ 115 | "## Multivariate Distributions in Python\n", 116 | "\n", 117 | "Sometimes we can get a lot of information about how two variables (or more) relate if we plot them together. This tutorial aims to show how plotting two variables together can give us information that plotting each one separately may miss.\n", 118 | "\n" 119 | ] 120 | }, 121 | { 122 | "cell_type": "code", 123 | "execution_count": null, 124 | "metadata": {}, 125 | "outputs": [], 126 | "source": [ 127 | "# import the packages we are going to be using\n", 128 | "import numpy as np # for getting our distribution\n", 129 | "import matplotlib.pyplot as plt # for plotting\n", 130 | "import seaborn as sns; sns.set() # For a different plotting theme\n", 131 | "\n", 132 | "# Don't worry so much about what rho is doing here\n", 133 | "# Just know if we have a rho of 1 then we will get a perfectly\n", 134 | "# upward sloping line, and if we have a rho of -1, we will get \n", 135 | "# a perfectly downward slopping line. A rho of 0 will \n", 136 | "# get us a 'cloud' of points\n", 137 | "r = 1\n", 138 | "\n", 139 | "# Don't worry so much about the following three lines of code for now\n", 140 | "# this is just getting the data for us to plot\n", 141 | "mean = [15, 5]\n", 142 | "cov = [[1, r], [r, 1]]\n", 143 | "x, y = x, y = np.random.multivariate_normal(mean, cov, 400).T\n", 144 | "\n", 145 | "# Adjust the figure size\n", 146 | "plt.figure(figsize=(10,5))\n", 147 | "\n", 148 | "# Plot the histograms of X and Y next to each other\n", 149 | "plt.subplot(1,2,1)\n", 150 | "plt.hist(x = x, bins = 15)\n", 151 | "plt.title(\"X\")\n", 152 | "\n", 153 | "plt.subplot(1,2,2)\n", 154 | "plt.hist(x = y, bins = 15)\n", 155 | "plt.title(\"Y\")\n", 156 | "\n", 157 | "plt.show()" 158 | ] 159 | }, 160 | { 161 | "cell_type": "code", 162 | "execution_count": null, 163 | "metadata": {}, 164 | "outputs": [], 165 | "source": [ 166 | "# Plot the data\n", 167 | "plt.figure(figsize=(10,10))\n", 168 | "plt.subplot(2,2,2)\n", 169 | "plt.scatter(x = x, y = y)\n", 170 | "plt.title(\"Joint Distribution of X and Y\")\n", 171 | "\n", 172 | "# Plot the Marginal X Distribution\n", 173 | "plt.subplot(2,2,4)\n", 174 | "plt.hist(x = x, bins = 15)\n", 175 | "plt.title(\"Marginal Distribution of X\")\n", 176 | "\n", 177 | "\n", 178 | "# Plot the Marginal Y Distribution\n", 179 | "plt.subplot(2,2,1)\n", 180 | "plt.hist(x = y, orientation = \"horizontal\", bins = 15)\n", 181 | "plt.title(\"Marginal Distribution of Y\")\n", 182 | "\n", 183 | "# Show the plots\n", 184 | "plt.show()" 185 | ] 186 | }, 187 | { 188 | "cell_type": "code", 189 | "execution_count": null, 190 | "metadata": {}, 191 | "outputs": [], 192 | "source": [] 193 | } 194 | ], 195 | "metadata": { 196 | "kernelspec": { 197 | "display_name": "Python 2", 198 | "language": "python", 199 | "name": "python2" 200 | }, 201 | "language_info": { 202 | "codemirror_mode": { 203 | "name": "ipython", 204 | "version": 2 205 | }, 206 | "file_extension": ".py", 207 | "mimetype": "text/x-python", 208 | "name": "python", 209 | "nbconvert_exporter": "python", 210 | "pygments_lexer": "ipython2", 211 | "version": "2.7.15" 212 | } 213 | }, 214 | "nbformat": 4, 215 | "nbformat_minor": 2 216 | } -------------------------------------------------------------------------------- /Course 1 - Understanding and Visualizing Data with Python/Week 3 - Multivariate Data/Unit_Testing.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Unit Testing\n", 8 | "While we will not cover the [unit testing library](https://docs.python.org/3/library/unittest.html) that python has, we wanted to introduce you to a simple way that you can test your code.\n", 9 | "\n", 10 | "Unit testing is important because it the only way you can be sure that your code is do what you think it is doing. \n", 11 | "\n", 12 | "Remember, just because ther are no errors does not mean your code is correct." 13 | ] 14 | }, 15 | { 16 | "cell_type": "code", 17 | "execution_count": null, 18 | "metadata": { 19 | "collapsed": true 20 | }, 21 | "outputs": [], 22 | "source": [ 23 | "import numpy as np\n", 24 | "import pandas as pd\n", 25 | "import matplotlib as plt\n", 26 | "pd.set_option('display.max_columns', 100) # Show all columns when looking at dataframe" 27 | ] 28 | }, 29 | { 30 | "cell_type": "code", 31 | "execution_count": null, 32 | "metadata": {}, 33 | "outputs": [], 34 | "source": [ 35 | "# Download NHANES 2015-2016 data\n", 36 | "df = pd.read_csv(\"nhanes_2015_2016.csv\")\n", 37 | "df.index = range(1,df.shape[0]+1)" 38 | ] 39 | }, 40 | { 41 | "cell_type": "code", 42 | "execution_count": null, 43 | "metadata": {}, 44 | "outputs": [], 45 | "source": [ 46 | "df.head()" 47 | ] 48 | }, 49 | { 50 | "cell_type": "markdown", 51 | "metadata": {}, 52 | "source": [ 53 | "### Goal\n", 54 | "We want to find the mean of first 100 rows of 'BPXSY1' when 'RIDAGEYR' > 60" 55 | ] 56 | }, 57 | { 58 | "cell_type": "code", 59 | "execution_count": null, 60 | "metadata": {}, 61 | "outputs": [], 62 | "source": [ 63 | "# One possible way of doing this is:\n", 64 | "pd.Series.mean(df[df.RIDAGEYR > 60].loc[range(0,100), 'BPXSY1']) \n", 65 | "# Current version of python will include this warning, older versions will not" 66 | ] 67 | }, 68 | { 69 | "cell_type": "code", 70 | "execution_count": null, 71 | "metadata": {}, 72 | "outputs": [], 73 | "source": [ 74 | "# test our code on only ten rows so we can easily check\n", 75 | "test = pd.DataFrame({'col1': np.repeat([3,1],5), 'col2': range(3,13)}, index=range(1,11))\n", 76 | "test" 77 | ] 78 | }, 79 | { 80 | "cell_type": "code", 81 | "execution_count": null, 82 | "metadata": {}, 83 | "outputs": [], 84 | "source": [ 85 | "# pd.Series.mean(df[df.RIDAGEYR > 60].loc[range(0,5), 'BPXSY1'])\n", 86 | "# should return 5\n", 87 | "\n", 88 | "pd.Series.mean(test[test.col1 > 2].loc[range(0,5), 'col2'])" 89 | ] 90 | }, 91 | { 92 | "cell_type": "markdown", 93 | "metadata": {}, 94 | "source": [ 95 | "What went wrong?" 96 | ] 97 | }, 98 | { 99 | "cell_type": "code", 100 | "execution_count": null, 101 | "metadata": {}, 102 | "outputs": [], 103 | "source": [ 104 | "test[test.col1 > 2].loc[range(0,5), 'col2']\n", 105 | "# 0 is not in the row index labels because the second row's value is < 2. For now, pandas defaults to filling this\n", 106 | "# with NaN" 107 | ] 108 | }, 109 | { 110 | "cell_type": "code", 111 | "execution_count": null, 112 | "metadata": {}, 113 | "outputs": [], 114 | "source": [ 115 | "# Using the .iloc method instead, we are correctly choosing the first 5 rows, regardless of their row labels\n", 116 | "test[test.col1 >2].iloc[range(0,5), 1]" 117 | ] 118 | }, 119 | { 120 | "cell_type": "code", 121 | "execution_count": null, 122 | "metadata": {}, 123 | "outputs": [], 124 | "source": [ 125 | "pd.Series.mean(test[test.col1 >2].iloc[range(0,5), 1])" 126 | ] 127 | }, 128 | { 129 | "cell_type": "code", 130 | "execution_count": null, 131 | "metadata": {}, 132 | "outputs": [], 133 | "source": [ 134 | "# We can compare what our real dataframe looks like with the incorrect and correct methods\n", 135 | "df[df.RIDAGEYR > 60].loc[range(0,5), :] # Filled with NaN whenever a row label does not meet the condition" 136 | ] 137 | }, 138 | { 139 | "cell_type": "code", 140 | "execution_count": null, 141 | "metadata": {}, 142 | "outputs": [], 143 | "source": [ 144 | "df[df.RIDAGEYR > 60].iloc[range(0,5), :] # Correct picks the first fice rows such that 'RIDAGEYR\" > 60" 145 | ] 146 | }, 147 | { 148 | "cell_type": "code", 149 | "execution_count": null, 150 | "metadata": {}, 151 | "outputs": [], 152 | "source": [ 153 | "# Applying the correct method to the original question about BPXSY1\n", 154 | "print(pd.Series.mean(df[df.RIDAGEYR > 60].iloc[range(0,100), 16]))\n", 155 | "\n", 156 | "# Another way to reference the BPXSY1 variable\n", 157 | "print(pd.Series.mean(df[df.RIDAGEYR > 60].iloc[range(0,100), df.columns.get_loc('BPXSY1')]))" 158 | ] 159 | } 160 | ], 161 | "metadata": { 162 | "kernelspec": { 163 | "display_name": "Python 3", 164 | "language": "python", 165 | "name": "python3" 166 | }, 167 | "language_info": { 168 | "codemirror_mode": { 169 | "name": "ipython", 170 | "version": 3 171 | }, 172 | "file_extension": ".py", 173 | "mimetype": "text/x-python", 174 | "name": "python", 175 | "nbconvert_exporter": "python", 176 | "pygments_lexer": "ipython3", 177 | "version": "3.6.3" 178 | } 179 | }, 180 | "nbformat": 4, 181 | "nbformat_minor": 2 182 | } -------------------------------------------------------------------------------- /Course 1 - Understanding and Visualizing Data with Python/Week 3 - Multivariate Data/pizza_study_design.md: -------------------------------------------------------------------------------- 1 | TO: Jane Doe, CEO 2 | 3 | FROM: John Doe, Data Scientist 4 | 5 | DATE: April 18, 2021 6 | 7 | SUBJECT: Hot Fire Pizza Study Design Proposal 8 | 9 | This proposal is designed to help collect and summarize data to be used in Hot Fire Pizza's upcoming marketing campaign. The marketing campaign will feature comparison against our main competitor Cold Water Pizza. This plan includes what information will be measured, how it will be measured, and how it will be communicated to form actionable insights. 10 | 11 | **What we will measure (Research Questions)** 12 | 13 | Within the past few years, Hot Fire Pizza and Cold Water Pizza have been locked in heated competition to own the growing Gotham City pizza market. In this research study we will look to answer three critical questions: 14 | 15 | - What objectively measurable advantages does Hot Fire Pizza's product have over Cold Fire Pizza? 16 | - Does our target audience prefer Hot Fire Pizza or Cold Fire pizza in taste, service, and overall satisfaction? 17 | - What is our target audience's favorite attribute of both Hot Fire Pizza and Cold Fire Pizza? 18 | 19 | **How we will measure it (Variables to Record)** 20 | 21 | The following data collection processes need to be conducted: 22 | 23 | - Record attributes of the pizzas themselves 24 | - Survey target audience on their opinions 25 | 26 | First, we will record the attributes of the pizzas from both Hot Fire Pizza and Cold Water Pizza. The attributes we will record include: 27 | 28 | - the prices of the pizzas 29 | - the amount of topping you can get 30 | - the size of the pizzas 31 | - weight of the pizzas 32 | 33 | Next, we will conduct a survey of our target audience. This survey should include a random sample of one thousand people within our target audience and ask them the following questions about both Hot Fire Pizza and Cold Water Pizza: 34 | 35 | - Rate taste of pizza (1-10) 36 | - Rate customer service (1-10) 37 | - Rate overall satisfaction (1-10) 38 | - Favorite attribute of the pizza (multiple-choice, 5 answers) 39 | 40 | **How we will extract actionable insights (Graphical & Numerical Summaries)** 41 | 42 | First, we will highlight the measurable attributes where Hot Fire Pizza has an advantage over Cold Water Pizza. We will plot bar charts to communicate the size difference in price & toppings. For the size & weight of the pizzas, we can use infographics because these variables people strongly associate with the visual of the pizza itself. 43 | 44 | Next, we will take the survey responses we collected to communicate whether or not our target audience prefers Hot Fire Pizza or Cold Fire Pizza in taste, service, and satisfaction. These three attributes are all on the same scale so we can plot the distribution of answers using a side-by-side box-plot. This box plot will show the distribution of survey responses for taste, service, and satisfaction across Hot Fire Pizza and Cold Fire Pizza. 45 | 46 | Finally, in order to communicate what the target audiences' favorite attributes of Hot Fire Pizza and Cold Water Pizza are we will plot a stacked bar chart that shows the relative frequency of each possible answer for both companies. 47 | 48 | 49 | 50 | -------------------------------------------------------------------------------- /Course 1 - Understanding and Visualizing Data with Python/Week 4 - Populations and Samples/Randomness_and_Reproducibility.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# The Empirical Rule and Distribution\n", 8 | "\n", 9 | "In week 2, we discussed the empirical rule or the 68 - 95 - 99.7 rule, which describes how many observations fall within a certain distance from our mean. This distance from the mean is denoted as sigma, or standard deviation (the average distance an observation is from the mean).\n", 10 | "\n", 11 | "The following image may help refresh your memory:\n", 12 | "\n", 13 | "![Three Sigma Rule](img/three_sigma_rule.png)\n", 14 | "\n", 15 | "For this tutorial, we will be exploring the number of hours the average college student gets.\n", 16 | "\n", 17 | "The example used in lecture stated there was a mean of 7 and standard deviation of 1.7 for hours of sleep; we will use these same values." 18 | ] 19 | }, 20 | { 21 | "cell_type": "code", 22 | "execution_count": null, 23 | "metadata": {}, 24 | "outputs": [], 25 | "source": [ 26 | "import warnings\n", 27 | "warnings.filterwarnings('ignore')\n", 28 | "import random\n", 29 | "import numpy as np\n", 30 | "import pandas as pd\n", 31 | "import matplotlib.pyplot as plt\n", 32 | "import seaborn as sns\n", 33 | "\n", 34 | "random.seed(1738)" 35 | ] 36 | }, 37 | { 38 | "cell_type": "code", 39 | "execution_count": null, 40 | "metadata": {}, 41 | "outputs": [], 42 | "source": [ 43 | "mu = 7\n", 44 | "\n", 45 | "sigma = 1.7\n", 46 | "\n", 47 | "Observations = [random.normalvariate(mu, sigma) for _ in range(100000)]" 48 | ] 49 | }, 50 | { 51 | "cell_type": "code", 52 | "execution_count": null, 53 | "metadata": {}, 54 | "outputs": [], 55 | "source": [ 56 | "sns.distplot(Observations)\n", 57 | "\n", 58 | "plt.axvline(np.mean(Observations) + np.std(Observations), color = \"g\")\n", 59 | "plt.axvline(np.mean(Observations) - np.std(Observations), color = \"g\")\n", 60 | "\n", 61 | "plt.axvline(np.mean(Observations) + (np.std(Observations) * 2), color = \"y\")\n", 62 | "plt.axvline(np.mean(Observations) - (np.std(Observations) * 2), color = \"y\")\n" 63 | ] 64 | }, 65 | { 66 | "cell_type": "code", 67 | "execution_count": null, 68 | "metadata": {}, 69 | "outputs": [], 70 | "source": [ 71 | "pd.Series(Observations).describe()" 72 | ] 73 | }, 74 | { 75 | "cell_type": "code", 76 | "execution_count": null, 77 | "metadata": {}, 78 | "outputs": [], 79 | "source": [ 80 | "SampleA = random.sample(Observations, 100)\n", 81 | "SampleB = random.sample(Observations, 100)\n", 82 | "SampleC = random.sample(Observations, 100)" 83 | ] 84 | }, 85 | { 86 | "cell_type": "code", 87 | "execution_count": null, 88 | "metadata": {}, 89 | "outputs": [], 90 | "source": [ 91 | "fig, ax = plt.subplots()\n", 92 | "\n", 93 | "sns.distplot(SampleA, ax = ax)\n", 94 | "sns.distplot(SampleB, ax = ax)\n", 95 | "sns.distplot(SampleC, ax = ax)\n" 96 | ] 97 | }, 98 | { 99 | "cell_type": "markdown", 100 | "metadata": {}, 101 | "source": [ 102 | "Now that we have covered the 68 - 95 - 99.7 rule, we will take this a step further and discuss the empirical distribution.\n", 103 | "\n", 104 | "The empirical distribution is a cumulative density function that signifies the proportion of observations that are less than or equal to a certain values.\n", 105 | "\n", 106 | "Lets use the initial image above as an example of this concept:\n", 107 | "\n", 108 | "\n", 109 | "\n", 110 | "Now, by using our observations for ours of sleep, we can create an empirical distribution in python that signifies the proportion of observations are observed at a specific number for hours of sleep." 111 | ] 112 | }, 113 | { 114 | "cell_type": "code", 115 | "execution_count": null, 116 | "metadata": {}, 117 | "outputs": [], 118 | "source": [ 119 | "mu = 7\n", 120 | "\n", 121 | "sigma = 1.7\n", 122 | "\n", 123 | "Observations = [random.normalvariate(mu, sigma) for _ in range(100000)]\n", 124 | "\n", 125 | "sns.distplot(Observations)\n", 126 | "plt.axvline(np.mean(Observations) + np.std(Observations), 0, .59, color = 'g')\n", 127 | "plt.axvline(np.mean(Observations) - np.std(Observations), 0, .59, color = 'g')\n", 128 | "\n", 129 | "plt.axvline(np.mean(Observations) + (np.std(Observations) * 2), 0, .15, color = 'y')\n", 130 | "plt.axvline(np.mean(Observations) - (np.std(Observations) * 2), 0, .15, color = 'y')" 131 | ] 132 | }, 133 | { 134 | "cell_type": "code", 135 | "execution_count": null, 136 | "metadata": {}, 137 | "outputs": [], 138 | "source": [ 139 | "from statsmodels.distributions.empirical_distribution import ECDF\n", 140 | "import matplotlib.pyplot as plt\n", 141 | "\n", 142 | "ecdf = ECDF(Observations)\n", 143 | "\n", 144 | "plt.plot(ecdf.x, ecdf.y)\n", 145 | "\n", 146 | "plt.axhline(y = 0.025, color = 'y', linestyle='-')\n", 147 | "plt.axvline(x = np.mean(Observations) - (2 * np.std(Observations)), color = 'y', linestyle='-')\n", 148 | "\n", 149 | "plt.axhline(y = 0.975, color = 'y', linestyle='-')\n", 150 | "plt.axvline(x = np.mean(Observations) + (2 * np.std(Observations)), color = 'y', linestyle='-')" 151 | ] 152 | } 153 | ], 154 | "metadata": { 155 | "kernelspec": { 156 | "display_name": "Python 3", 157 | "language": "python", 158 | "name": "python3" 159 | }, 160 | "language_info": { 161 | "codemirror_mode": { 162 | "name": "ipython", 163 | "version": 3 164 | }, 165 | "file_extension": ".py", 166 | "mimetype": "text/x-python", 167 | "name": "python", 168 | "nbconvert_exporter": "python", 169 | "pygments_lexer": "ipython3", 170 | "version": "3.6.3" 171 | } 172 | }, 173 | "nbformat": 4, 174 | "nbformat_minor": 2 175 | } 176 | -------------------------------------------------------------------------------- /Course 1 - Understanding and Visualizing Data with Python/Week 4 - Populations and Samples/Sampling_from_a_Biased_Population.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Sampling from a Biased Population\n", 8 | "In this tutorial we will go over some code that recreates the visualizations in the Interactive Sampling Distribution Demo. This demo looks at a hypothetical problem that illustrates what happens when we sample from a biased population and not the entire population we are interested in. This tutorial assumes that you have seen that demo, for context, and understand the statistics behind the graphs. " 9 | ] 10 | }, 11 | { 12 | "cell_type": "code", 13 | "execution_count": null, 14 | "metadata": {}, 15 | "outputs": [], 16 | "source": [ 17 | "# Import the packages that we will be using for the tutorial\n", 18 | "import numpy as np # for sampling for the distributions\n", 19 | "import matplotlib.pyplot as plt # for basic plotting \n", 20 | "import seaborn as sns; sns.set() # for plotting of the histograms\n", 21 | "\n", 22 | "# Recreate the simulations from the video \n", 23 | "mean_uofm = 155\n", 24 | "sd_uofm = 5\n", 25 | "mean_gym = 185 \n", 26 | "sd_gym = 5 \n", 27 | "gymperc = .3\n", 28 | "totalPopSize = 40000\n", 29 | "\n", 30 | "# Create the two subgroups\n", 31 | "uofm_students = np.random.normal(mean_uofm, sd_uofm, int(totalPopSize * (1 - gymperc)))\n", 32 | "students_at_gym = np.random.normal(mean_gym, sd_gym, int(totalPopSize * (gymperc)))\n", 33 | "\n", 34 | "# Create the population from the subgroups\n", 35 | "population = np.append(uofm_students, students_at_gym)\n", 36 | "\n", 37 | "# Set up the figure for plotting\n", 38 | "plt.figure(figsize=(10,12))\n", 39 | "\n", 40 | "# Plot the UofM students only\n", 41 | "plt.subplot(3,1,1)\n", 42 | "sns.distplot(uofm_students)\n", 43 | "plt.title(\"UofM Students Only\")\n", 44 | "plt.xlim([140,200])\n", 45 | "\n", 46 | "# Plot the Gym Goers only\n", 47 | "plt.subplot(3,1,2)\n", 48 | "sns.distplot(students_at_gym)\n", 49 | "plt.title(\"Gym Goers Only\")\n", 50 | "plt.xlim([140,200])\n", 51 | "\n", 52 | "# Plot both groups together\n", 53 | "plt.subplot(3,1,3)\n", 54 | "sns.distplot(population)\n", 55 | "plt.title(\"Full Population of UofM Students\")\n", 56 | "plt.axvline(x = np.mean(population))\n", 57 | "plt.xlim([140,200])\n", 58 | "\n", 59 | "plt.show()" 60 | ] 61 | }, 62 | { 63 | "cell_type": "markdown", 64 | "metadata": {}, 65 | "source": [ 66 | "## What Happens if We Sample from the Entire Population?\n", 67 | "We will sample randomly from all students at the University of Michigan." 68 | ] 69 | }, 70 | { 71 | "cell_type": "code", 72 | "execution_count": null, 73 | "metadata": {}, 74 | "outputs": [], 75 | "source": [ 76 | "# Simulation parameters\n", 77 | "numberSamps = 5000\n", 78 | "sampSize = 50\n", 79 | "\n", 80 | "# Get the sampling distribution of the mean from only the gym\n", 81 | "mean_distribution = np.empty(numberSamps)\n", 82 | "for i in xrange(numberSamps):\n", 83 | " random_students = np.random.choice(population, sampSize)\n", 84 | " mean_distribution[i] = np.mean(random_students)\n", 85 | " \n", 86 | "# Plot the population and the biased sampling distribution\n", 87 | "plt.figure(figsize = (10,8))\n", 88 | "\n", 89 | "# Plotting the population again\n", 90 | "plt.subplot(2,1,1)\n", 91 | "sns.distplot(population)\n", 92 | "plt.title(\"Full Population of UofM Students\")\n", 93 | "plt.axvline(x = np.mean(population))\n", 94 | "plt.xlim([140,200])\n", 95 | "\n", 96 | "# Plotting the sampling distribution\n", 97 | "plt.subplot(2,1,2)\n", 98 | "sns.distplot(mean_distribution)\n", 99 | "plt.title(\"Sampling Distribution of the Mean Weight of All UofM Students\")\n", 100 | "plt.axvline(x = np.mean(population))\n", 101 | "plt.axvline(x = np.mean(mean_distribution), color = \"black\")\n", 102 | "plt.xlim([140,200])\n", 103 | "\n", 104 | "plt.show()" 105 | ] 106 | }, 107 | { 108 | "cell_type": "markdown", 109 | "metadata": {}, 110 | "source": [ 111 | "## What Happens if We take a Non-Representative Sample?\n", 112 | "What happens if I only go to the gym to get the weight of individuals, and I don't sample randomly from all students at the University of Michigan?" 113 | ] 114 | }, 115 | { 116 | "cell_type": "code", 117 | "execution_count": null, 118 | "metadata": {}, 119 | "outputs": [], 120 | "source": [ 121 | "# Simulation parameters\n", 122 | "numberSamps = 5000\n", 123 | "sampSize = 3\n", 124 | "\n", 125 | "# Get the sampling distribution of the mean from only the gym\n", 126 | "mean_distribution = np.empty(numberSamps)\n", 127 | "for i in xrange(numberSamps):\n", 128 | " random_students = np.random.choice(students_at_gym, sampSize)\n", 129 | " mean_distribution[i] = np.mean(random_students) \n", 130 | " \n", 131 | "\n", 132 | "# Plot the population and the biased sampling distribution\n", 133 | "plt.figure(figsize = (10,8))\n", 134 | "\n", 135 | "# Plotting the population again\n", 136 | "plt.subplot(2,1,1)\n", 137 | "sns.distplot(population)\n", 138 | "plt.title(\"Full Population of UofM Students\")\n", 139 | "plt.axvline(x = np.mean(population))\n", 140 | "plt.xlim([140,200])\n", 141 | "\n", 142 | "# Plotting the sampling distribution\n", 143 | "plt.subplot(2,1,2)\n", 144 | "sns.distplot(mean_distribution)\n", 145 | "plt.title(\"Sampling Distribution of the Mean Weight of Gym Goers\")\n", 146 | "plt.axvline(x = np.mean(population))\n", 147 | "plt.axvline(x = np.mean(students_at_gym), color = \"black\")\n", 148 | "plt.xlim([140,200])\n", 149 | "\n", 150 | "plt.show()" 151 | ] 152 | } 153 | ], 154 | "metadata": { 155 | "kernelspec": { 156 | "display_name": "Python 2", 157 | "language": "python", 158 | "name": "python2" 159 | }, 160 | "language_info": { 161 | "codemirror_mode": { 162 | "name": "ipython", 163 | "version": 2 164 | }, 165 | "file_extension": ".py", 166 | "mimetype": "text/x-python", 167 | "name": "python", 168 | "nbconvert_exporter": "python", 169 | "pygments_lexer": "ipython2", 170 | "version": "2.7.15" 171 | } 172 | }, 173 | "nbformat": 4, 174 | "nbformat_minor": 2 175 | } 176 | -------------------------------------------------------------------------------- /Course 1 - Understanding and Visualizing Data with Python/Week 4 - Populations and Samples/lecture_notes.md: -------------------------------------------------------------------------------- 1 | **University of Michigan Stats with Python Notes** 2 | 3 | 4 | 5 | **Course 1 - Week 4** 6 | 7 | 8 | 9 | **Sampling Distributions** 10 | 11 | Size of samples - effects variation in estimates 12 | 13 | \# of samples - effects the normality of distribution 14 | 15 | Because we know that many samples result in a normal sampling distribution, we can use the size of the sample to estimate variation parameters in a hypothetical sampling distribution. 16 | 17 | One sample -> estimates of sampling distribution -> estimates of population 18 | 19 | 20 | 21 | **Population Inference - Single Probability Sample** 22 | 23 | 1. Compute point estimate 24 | 25 | ​ - unbiased aka representative 26 | 27 | ​ - if unequal probabilities of selection, use weighs when computing point estimate 28 | 29 | 2. Estimate sampling distribution variance of point estimate 30 | 31 | ​ - Standard Error = 2x STD of Point Estimate (95% confidence) 32 | 33 | 3. Form a confidence interval 34 | 35 | ​ - Point estimate +- Standard Error 36 | 37 | 4. Test Hypothesis 38 | 39 | ​ - Test stat = (pred - null value) / standard error 40 | 41 | ​ - p-value probability of getting this test statistic based on a sampling distribution of test statistics 42 | 43 | 44 | 45 | **Population Inference - Non-Probability Sample** 46 | 47 | \- Can’t use sampling theory 48 | 49 | 50 | 51 | *Methods:* 52 | 53 | 1. Quasi-Randomization 54 | 55 | ​ - Combine Probability Sample & Non-probability sample data 56 | 57 | ​ - Build model to predict which source data came from (Binary Classification) 58 | 59 | ​ - Survey Weight = 1 / Predicted Prob 60 | 61 | ​ - Variation of sampling distribution estimated by sampling sample data 62 | 63 | 2. Population Modeling 64 | 65 | ​ - Predictive modeling to predict aggregate sample values on variables using auxiliary data on people in target population 66 | 67 | ​ - Compute estimates of interest using estimate totals 68 | 69 | ​ - weighted mean = predictive total estimate / estimated population size 70 | 71 | 72 | 73 | **Complex Samples** 74 | 75 | \- Stratification 76 | 77 | \- Clustering 78 | 79 | \- Weighting Sample’s to represent population so sample is not biased 80 | 81 | \- Non-response weights 82 | 83 | ![Screen Shot 2021-04-25 at 5.30.30 PM](/Users/miesner.jacob/Desktop/Screen Shot 2021-04-25 at 5.30.30 PM.png) -------------------------------------------------------------------------------- /Course 2 - Inferential Statistics with Python/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MiesnerJacob/statistics-with-python-michigan-university/f7219dcd7a91bdd3818bf2704f54793a0623a8a5/Course 2 - Inferential Statistics with Python/.DS_Store -------------------------------------------------------------------------------- /Course 2 - Inferential Statistics with Python/Week 1 - Overview & Inference Procedures/listsvsarrays.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import numpy as np\n", 10 | "import pandas as pd\n", 11 | "import matplotlib.pyplot as plt\n", 12 | "%matplotlib inline" 13 | ] 14 | }, 15 | { 16 | "cell_type": "markdown", 17 | "metadata": {}, 18 | "source": [ 19 | "# Lists vs. numpy arrays, dictionaries, functions, and lambda functions " 20 | ] 21 | }, 22 | { 23 | "cell_type": "markdown", 24 | "metadata": {}, 25 | "source": [ 26 | "We are going to introduce a few more python concepts. If you've been working through the NHANES example notebooks, you will have seen these in use already. There is a lot to say about these new concepts, but we will only be giving a brief introduction to each. For more information, follow the links provided or do your own search for the many great resources available on the web. " 27 | ] 28 | }, 29 | { 30 | "cell_type": "markdown", 31 | "metadata": {}, 32 | "source": [ 33 | "### Lists vs numpy arrays" 34 | ] 35 | }, 36 | { 37 | "cell_type": "markdown", 38 | "metadata": {}, 39 | "source": [ 40 | "Lists can have multiple datatypes. For example one element can be a string and another can be and int and another a float. Lists are defined by using the square brackets: \\[ \\], with elements separated by commas, ','.\n", 41 | "Ex:\n", 42 | "```\n", 43 | "my_list = [1, 'Colorado', 4.7, 'rain']\n", 44 | "```\n", 45 | "Lists are indexed by position. Remember, in Python, the index starts at 0 and ends at length(list)-1. So to retrieve the first element of the list you call:\n", 46 | "```\n", 47 | "my_list[0]\n", 48 | "```\n", 49 | "\n", 50 | "Numpy arrays np.arrays differ from lists is that the contain only 1 datatype. For example all the elements might be ints or strings or floats or objects. It is defined by np.array(object), where the input 'object' can be for example a list or a tuple. \n", 51 | "Ex:\n", 52 | "```\n", 53 | "my_array = np.array([1, 4, 5, 2])\n", 54 | "```\n", 55 | "or \n", 56 | "```\n", 57 | "my_array = np.array((1, 4, 5, 2))\n", 58 | "```\n", 59 | "\n", 60 | "Lists and numpy arrays differ in their speed and memory efficiency. An intuitive reason for this is that python lists have to store the value of each element and also the type of each element (since the types can differ). Whereas numpy arrays only need to store the type once because it is the same for all the elements in the array. \n", 61 | "\n", 62 | "You can do calculations with numpy arrays that can't be done on lists. \n", 63 | "Ex:\n", 64 | "```\n", 65 | "my_array/3\n", 66 | "```\n", 67 | "will return a numpy array, with each of the elements divided by 3. Whereas:\n", 68 | "```\n", 69 | "my_list/3\n", 70 | "```\n", 71 | "Will throw an error.\n", 72 | "\n", 73 | "You can appened items to the end of lists and numpy arrays, though they have slightly different commands. It is almost of note that lists can append an item 'in place', but numpy arrays cannot.\n", 74 | "\n", 75 | "```\n", 76 | "my_list.append('new item')\n", 77 | "np.append(my_array, 5) # new element must be of the same type as all other elements\n", 78 | "```\n", 79 | "\n", 80 | "Links to python docs: \n", 81 | "[Lists](https://docs.python.org/3/tutorial/datastructures.html), [arrays](https://docs.scipy.org/doc/numpy-1.15.1/reference/generated/numpy.array.html), [more on arrays](https://docs.scipy.org/doc/numpy-1.15.0/user/basics.creation.html)" 82 | ] 83 | }, 84 | { 85 | "cell_type": "code", 86 | "execution_count": 2, 87 | "metadata": {}, 88 | "outputs": [], 89 | "source": [ 90 | "my_list = [1, 2, 3]\n", 91 | "my_array = np.array([1, 2, 3])" 92 | ] 93 | }, 94 | { 95 | "cell_type": "code", 96 | "execution_count": 3, 97 | "metadata": {}, 98 | "outputs": [ 99 | { 100 | "data": { 101 | "text/plain": [ 102 | "1" 103 | ] 104 | }, 105 | "execution_count": 3, 106 | "metadata": {}, 107 | "output_type": "execute_result" 108 | } 109 | ], 110 | "source": [ 111 | "# Both indexed by position\n", 112 | "my_list[0]" 113 | ] 114 | }, 115 | { 116 | "cell_type": "code", 117 | "execution_count": 4, 118 | "metadata": {}, 119 | "outputs": [ 120 | { 121 | "data": { 122 | "text/plain": [ 123 | "1" 124 | ] 125 | }, 126 | "execution_count": 4, 127 | "metadata": {}, 128 | "output_type": "execute_result" 129 | } 130 | ], 131 | "source": [ 132 | "my_array[0]" 133 | ] 134 | }, 135 | { 136 | "cell_type": "code", 137 | "execution_count": 5, 138 | "metadata": {}, 139 | "outputs": [ 140 | { 141 | "data": { 142 | "text/plain": [ 143 | "array([0.33333333, 0.66666667, 1. ])" 144 | ] 145 | }, 146 | "execution_count": 5, 147 | "metadata": {}, 148 | "output_type": "execute_result" 149 | } 150 | ], 151 | "source": [ 152 | "my_array/3" 153 | ] 154 | }, 155 | { 156 | "cell_type": "code", 157 | "execution_count": 6, 158 | "metadata": {}, 159 | "outputs": [ 160 | { 161 | "ename": "TypeError", 162 | "evalue": "unsupported operand type(s) for /: 'list' and 'int'", 163 | "output_type": "error", 164 | "traceback": [ 165 | "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", 166 | "\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)", 167 | "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mmy_list\u001b[0m\u001b[0;34m/\u001b[0m\u001b[0;36m3\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", 168 | "\u001b[0;31mTypeError\u001b[0m: unsupported operand type(s) for /: 'list' and 'int'" 169 | ] 170 | } 171 | ], 172 | "source": [ 173 | "my_list/3" 174 | ] 175 | }, 176 | { 177 | "cell_type": "code", 178 | "execution_count": 7, 179 | "metadata": {}, 180 | "outputs": [ 181 | { 182 | "data": { 183 | "text/plain": [ 184 | "[1, 2, 3, 5]" 185 | ] 186 | }, 187 | "execution_count": 7, 188 | "metadata": {}, 189 | "output_type": "execute_result" 190 | } 191 | ], 192 | "source": [ 193 | "my_list.append(5) # inplace\n", 194 | "my_list" 195 | ] 196 | }, 197 | { 198 | "cell_type": "code", 199 | "execution_count": 8, 200 | "metadata": {}, 201 | "outputs": [ 202 | { 203 | "data": { 204 | "text/plain": [ 205 | "array([1, 2, 3, 5])" 206 | ] 207 | }, 208 | "execution_count": 8, 209 | "metadata": {}, 210 | "output_type": "execute_result" 211 | } 212 | ], 213 | "source": [ 214 | "my_array = np.append(my_array, 5) # cannot do inplace because not always contiguous memory\n", 215 | "my_array" 216 | ] 217 | }, 218 | { 219 | "cell_type": "markdown", 220 | "metadata": {}, 221 | "source": [ 222 | "### Dictionaries \n", 223 | "Store key-values pairs and are indexed by the keys.\n", 224 | "denoted with \\{key1: value1, key2: value2\\}. The keys must be unique, the values do not need to be unique. \n", 225 | "Can be used for many tasks, for example, creating DataFrames and changing column names. \n", 226 | "\n", 227 | "[More about dictionaries in section 5.5](https://docs.python.org/3/tutorial/datastructures.html)" 228 | ] 229 | }, 230 | { 231 | "cell_type": "code", 232 | "execution_count": null, 233 | "metadata": {}, 234 | "outputs": [], 235 | "source": [ 236 | "dct = {'thing 1': 2, 'thing 2': 1}\n", 237 | "dct['thing 1']" 238 | ] 239 | }, 240 | { 241 | "cell_type": "code", 242 | "execution_count": null, 243 | "metadata": {}, 244 | "outputs": [], 245 | "source": [ 246 | "# adding to a dictionary\n", 247 | "dct['new thing'] = 'woooo'" 248 | ] 249 | }, 250 | { 251 | "cell_type": "code", 252 | "execution_count": null, 253 | "metadata": {}, 254 | "outputs": [], 255 | "source": [ 256 | "dct['new thing']" 257 | ] 258 | }, 259 | { 260 | "cell_type": "code", 261 | "execution_count": null, 262 | "metadata": {}, 263 | "outputs": [], 264 | "source": [ 265 | "# Create DataFrame\n", 266 | "df = pd.DataFrame({'col1':range(3), 'col2':range(3,6)})\n", 267 | "df" 268 | ] 269 | }, 270 | { 271 | "cell_type": "code", 272 | "execution_count": null, 273 | "metadata": {}, 274 | "outputs": [], 275 | "source": [ 276 | "# Change column names\n", 277 | "df.rename(columns={'col1': 'apples', 'col2':'oranges'})" 278 | ] 279 | } 280 | ], 281 | "metadata": { 282 | "kernelspec": { 283 | "display_name": "Python 3", 284 | "language": "python", 285 | "name": "python3" 286 | }, 287 | "language_info": { 288 | "codemirror_mode": { 289 | "name": "ipython", 290 | "version": 3 291 | }, 292 | "file_extension": ".py", 293 | "mimetype": "text/x-python", 294 | "name": "python", 295 | "nbconvert_exporter": "python", 296 | "pygments_lexer": "ipython3", 297 | "version": "3.6.3" 298 | } 299 | }, 300 | "nbformat": 4, 301 | "nbformat_minor": 2 302 | } 303 | -------------------------------------------------------------------------------- /Course 2 - Inferential Statistics with Python/Week 1 - Overview & Inference Procedures/week1_assessment.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "colab_type": "text", 7 | "id": "J5p3JZwvGQEe" 8 | }, 9 | "source": [ 10 | "You will use the values of what you find in this assignment to answer questions in the quiz that follows. You may want to open this notebook to be displayed side-by-side on screen with this next quiz." 11 | ] 12 | }, 13 | { 14 | "cell_type": "markdown", 15 | "metadata": { 16 | "colab_type": "text", 17 | "id": "20Hp_V-eFzbI" 18 | }, 19 | "source": [ 20 | "1. Write a function that inputs an integers and returns the negative" 21 | ] 22 | }, 23 | { 24 | "cell_type": "code", 25 | "execution_count": 1, 26 | "metadata": { 27 | "colab": {}, 28 | "colab_type": "code", 29 | "id": "tFPbRKR4FzbL" 30 | }, 31 | "outputs": [], 32 | "source": [ 33 | "# Write your function here\n", 34 | "\n", 35 | "def neg_integer(x):\n", 36 | " return x * -1" 37 | ] 38 | }, 39 | { 40 | "cell_type": "code", 41 | "execution_count": 3, 42 | "metadata": { 43 | "colab": {}, 44 | "colab_type": "code", 45 | "id": "jW5MfUUnFzbQ" 46 | }, 47 | "outputs": [ 48 | { 49 | "data": { 50 | "text/plain": [ 51 | "-4" 52 | ] 53 | }, 54 | "execution_count": 3, 55 | "metadata": {}, 56 | "output_type": "execute_result" 57 | } 58 | ], 59 | "source": [ 60 | "# Test your function with input x\n", 61 | "x = 4\n", 62 | "neg_integer(4)" 63 | ] 64 | }, 65 | { 66 | "cell_type": "markdown", 67 | "metadata": { 68 | "colab_type": "text", 69 | "id": "f6kLOf6_FzbU" 70 | }, 71 | "source": [ 72 | "2. Write a function that inputs a list of integers and returns the minimum value" 73 | ] 74 | }, 75 | { 76 | "cell_type": "code", 77 | "execution_count": 16, 78 | "metadata": { 79 | "colab": {}, 80 | "colab_type": "code", 81 | "id": "IHV-wS_hFzbW" 82 | }, 83 | "outputs": [], 84 | "source": [ 85 | "# Write your function here\n", 86 | "def max_list_value(lst):\n", 87 | " return max(lst)" 88 | ] 89 | }, 90 | { 91 | "cell_type": "code", 92 | "execution_count": 18, 93 | "metadata": { 94 | "colab": {}, 95 | "colab_type": "code", 96 | "id": "EfvSeoaOFzba" 97 | }, 98 | "outputs": [ 99 | { 100 | "data": { 101 | "text/plain": [ 102 | "346342" 103 | ] 104 | }, 105 | "execution_count": 18, 106 | "metadata": {}, 107 | "output_type": "execute_result" 108 | } 109 | ], 110 | "source": [ 111 | "# Test your function with input lst\n", 112 | "lst = [-3, 0, 2, 100, -1, 2]\n", 113 | "max_list_value(lst)\n", 114 | "\n", 115 | "# Create you own input list to test with\n", 116 | "test_lst = [12,346342,78,468,4589.-12.5,764]\n", 117 | "max_list_value(test_lst)" 118 | ] 119 | }, 120 | { 121 | "cell_type": "markdown", 122 | "metadata": { 123 | "colab_type": "text", 124 | "id": "-yjvHCDuFzbd" 125 | }, 126 | "source": [ 127 | "#### Challenge problem: \n", 128 | "Write a function that take in four arguments: lst1, lst2, str1, str2, and returns a pandas DataFrame that has the first column labeled str1 and the second column labaled str2, that have values lst1 and lst2 scaled to be between 0 and 1.\n", 129 | "\n", 130 | "For example\n", 131 | "```\n", 132 | "lst1 = [1, 2, 3]\n", 133 | "lst2 = [2, 4, 5]\n", 134 | "str1 = 'one'\n", 135 | "str2 = 'two'\n", 136 | "\n", 137 | "my_function(lst1, lst2, str1, str2)\n", 138 | "``` \n", 139 | "should return a DataFrame that looks like:\n", 140 | "\n", 141 | "\n", 142 | "\n", 143 | "| | one | two |\n", 144 | "| --- | --- | --- |\n", 145 | "| 0 | 0 | 0 |\n", 146 | "| 1 | .5 | .666 |\n", 147 | "| 2 | 1 | 1 |\n", 148 | "\n" 149 | ] 150 | }, 151 | { 152 | "cell_type": "code", 153 | "execution_count": 53, 154 | "metadata": { 155 | "colab": {}, 156 | "colab_type": "code", 157 | "id": "UhTlOZX1Fzbf" 158 | }, 159 | "outputs": [], 160 | "source": [ 161 | "import pandas as pd\n", 162 | "\n", 163 | "def dataframe_creator(lst1, lst2, str1, str2):\n", 164 | " lst1 = np.array(lst1)\n", 165 | " lst1_normalized = (lst1-min(lst1))/(max(lst1)-min(lst1))\n", 166 | " lst2 = np.array(lst2)\n", 167 | " lst2_normalized = (lst2-min(lst2))/(max(lst2)-min(lst2))\n", 168 | " return pd.DataFrame({str1:lst1_normalized, str2:lst2_normalized})" 169 | ] 170 | }, 171 | { 172 | "cell_type": "code", 173 | "execution_count": 54, 174 | "metadata": {}, 175 | "outputs": [], 176 | "source": [ 177 | "lst1 = [1, 2, 3]\n", 178 | "lst2 = [2, 4, 5]\n", 179 | "str1 = 'one'\n", 180 | "str2 = 'two'" 181 | ] 182 | }, 183 | { 184 | "cell_type": "code", 185 | "execution_count": 55, 186 | "metadata": {}, 187 | "outputs": [ 188 | { 189 | "data": { 190 | "text/html": [ 191 | "
\n", 192 | "\n", 205 | "\n", 206 | " \n", 207 | " \n", 208 | " \n", 209 | " \n", 210 | " \n", 211 | " \n", 212 | " \n", 213 | " \n", 214 | " \n", 215 | " \n", 216 | " \n", 217 | " \n", 218 | " \n", 219 | " \n", 220 | " \n", 221 | " \n", 222 | " \n", 223 | " \n", 224 | " \n", 225 | " \n", 226 | " \n", 227 | " \n", 228 | " \n", 229 | " \n", 230 | "
onetwo
00.00.000000
10.50.666667
21.01.000000
\n", 231 | "
" 232 | ], 233 | "text/plain": [ 234 | " one two\n", 235 | "0 0.0 0.000000\n", 236 | "1 0.5 0.666667\n", 237 | "2 1.0 1.000000" 238 | ] 239 | }, 240 | "execution_count": 55, 241 | "metadata": {}, 242 | "output_type": "execute_result" 243 | } 244 | ], 245 | "source": [ 246 | "dataframe_creator(lst1, lst2, str1, str2)" 247 | ] 248 | }, 249 | { 250 | "cell_type": "code", 251 | "execution_count": 56, 252 | "metadata": { 253 | "colab": {}, 254 | "colab_type": "code", 255 | "id": "0yABet-jFzbi" 256 | }, 257 | "outputs": [], 258 | "source": [ 259 | "# test your challenge problem function\n", 260 | "import numpy as np\n", 261 | "\n", 262 | "lst1 = np.random.randint(-234, 938, 100)\n", 263 | "lst2 = np.random.randint(-522, 123, 100)\n", 264 | "str1 = 'one'\n", 265 | "str2 = 'alpha'" 266 | ] 267 | }, 268 | { 269 | "cell_type": "code", 270 | "execution_count": 57, 271 | "metadata": { 272 | "scrolled": false 273 | }, 274 | "outputs": [ 275 | { 276 | "data": { 277 | "text/html": [ 278 | "
\n", 279 | "\n", 292 | "\n", 293 | " \n", 294 | " \n", 295 | " \n", 296 | " \n", 297 | " \n", 298 | " \n", 299 | " \n", 300 | " \n", 301 | " \n", 302 | " \n", 303 | " \n", 304 | " \n", 305 | " \n", 306 | " \n", 307 | " \n", 308 | " \n", 309 | " \n", 310 | " \n", 311 | " \n", 312 | " \n", 313 | " \n", 314 | " \n", 315 | " \n", 316 | " \n", 317 | " \n", 318 | " \n", 319 | " \n", 320 | " \n", 321 | " \n", 322 | " \n", 323 | " \n", 324 | " \n", 325 | " \n", 326 | " \n", 327 | " \n", 328 | " \n", 329 | " \n", 330 | " \n", 331 | " \n", 332 | " \n", 333 | " \n", 334 | " \n", 335 | " \n", 336 | " \n", 337 | " \n", 338 | " \n", 339 | " \n", 340 | " \n", 341 | " \n", 342 | " \n", 343 | " \n", 344 | " \n", 345 | " \n", 346 | " \n", 347 | " \n", 348 | " \n", 349 | " \n", 350 | " \n", 351 | " \n", 352 | " \n", 353 | " \n", 354 | " \n", 355 | " \n", 356 | " \n", 357 | " \n", 358 | " \n", 359 | " \n", 360 | " \n", 361 | " \n", 362 | " \n", 363 | " \n", 364 | " \n", 365 | " \n", 366 | " \n", 367 | " \n", 368 | " \n", 369 | " \n", 370 | " \n", 371 | " \n", 372 | " \n", 373 | " \n", 374 | " \n", 375 | " \n", 376 | " \n", 377 | " \n", 378 | " \n", 379 | " \n", 380 | " \n", 381 | " \n", 382 | " \n", 383 | " \n", 384 | " \n", 385 | " \n", 386 | " \n", 387 | " \n", 388 | " \n", 389 | " \n", 390 | " \n", 391 | " \n", 392 | " \n", 393 | " \n", 394 | " \n", 395 | " \n", 396 | " \n", 397 | " \n", 398 | " \n", 399 | " \n", 400 | " \n", 401 | " \n", 402 | " \n", 403 | " \n", 404 | " \n", 405 | " \n", 406 | " \n", 407 | " \n", 408 | " \n", 409 | " \n", 410 | " \n", 411 | " \n", 412 | " \n", 413 | " \n", 414 | " \n", 415 | " \n", 416 | " \n", 417 | " \n", 418 | " \n", 419 | " \n", 420 | " \n", 421 | " \n", 422 | " \n", 423 | " \n", 424 | " \n", 425 | " \n", 426 | " \n", 427 | " \n", 428 | " \n", 429 | " \n", 430 | " \n", 431 | " \n", 432 | " \n", 433 | " \n", 434 | " \n", 435 | " \n", 436 | " \n", 437 | " \n", 438 | " \n", 439 | " \n", 440 | " \n", 441 | " \n", 442 | " \n", 443 | " \n", 444 | " \n", 445 | " \n", 446 | " \n", 447 | " \n", 448 | " \n", 449 | " \n", 450 | " \n", 451 | " \n", 452 | " \n", 453 | " \n", 454 | " \n", 455 | " \n", 456 | " \n", 457 | " \n", 458 | " \n", 459 | " \n", 460 | " \n", 461 | " \n", 462 | " \n", 463 | " \n", 464 | " \n", 465 | " \n", 466 | " \n", 467 | " \n", 468 | " \n", 469 | " \n", 470 | " \n", 471 | " \n", 472 | " \n", 473 | " \n", 474 | " \n", 475 | " \n", 476 | " \n", 477 | " \n", 478 | " \n", 479 | " \n", 480 | " \n", 481 | " \n", 482 | " \n", 483 | " \n", 484 | " \n", 485 | " \n", 486 | " \n", 487 | " \n", 488 | " \n", 489 | " \n", 490 | " \n", 491 | " \n", 492 | " \n", 493 | " \n", 494 | " \n", 495 | " \n", 496 | " \n", 497 | " \n", 498 | " \n", 499 | " \n", 500 | " \n", 501 | " \n", 502 | " \n", 503 | " \n", 504 | " \n", 505 | " \n", 506 | " \n", 507 | " \n", 508 | " \n", 509 | " \n", 510 | " \n", 511 | " \n", 512 | " \n", 513 | " \n", 514 | " \n", 515 | " \n", 516 | " \n", 517 | " \n", 518 | " \n", 519 | " \n", 520 | " \n", 521 | " \n", 522 | " \n", 523 | " \n", 524 | " \n", 525 | " \n", 526 | " \n", 527 | " \n", 528 | " \n", 529 | " \n", 530 | " \n", 531 | " \n", 532 | " \n", 533 | " \n", 534 | " \n", 535 | " \n", 536 | " \n", 537 | " \n", 538 | " \n", 539 | " \n", 540 | " \n", 541 | " \n", 542 | " \n", 543 | " \n", 544 | " \n", 545 | " \n", 546 | " \n", 547 | " \n", 548 | " \n", 549 | " \n", 550 | " \n", 551 | " \n", 552 | " \n", 553 | " \n", 554 | " \n", 555 | " \n", 556 | " \n", 557 | " \n", 558 | " \n", 559 | " \n", 560 | " \n", 561 | " \n", 562 | " \n", 563 | " \n", 564 | " \n", 565 | " \n", 566 | " \n", 567 | " \n", 568 | " \n", 569 | " \n", 570 | " \n", 571 | " \n", 572 | " \n", 573 | " \n", 574 | " \n", 575 | " \n", 576 | " \n", 577 | " \n", 578 | " \n", 579 | " \n", 580 | " \n", 581 | " \n", 582 | " \n", 583 | " \n", 584 | " \n", 585 | " \n", 586 | " \n", 587 | " \n", 588 | " \n", 589 | " \n", 590 | " \n", 591 | " \n", 592 | " \n", 593 | " \n", 594 | " \n", 595 | " \n", 596 | " \n", 597 | " \n", 598 | " \n", 599 | " \n", 600 | " \n", 601 | " \n", 602 | " \n", 603 | " \n", 604 | " \n", 605 | " \n", 606 | " \n", 607 | "
onealpha
00.0369420.888530
10.0609970.022617
20.0678690.187399
30.0575600.512116
40.8926120.634895
50.8427840.851373
60.4639180.075929
70.3505150.051696
80.9896910.930533
90.8857390.586430
100.4209620.898223
110.5678690.458805
120.6022340.835218
130.9286940.735057
140.4278350.957997
150.3152920.216478
160.6640890.017771
170.0893470.893376
180.8831620.000000
190.6597940.261712
200.0798970.135703
210.1254300.557351
220.2774910.907916
230.3745700.038772
240.1417530.185784
250.4750860.434572
260.7723370.145396
270.3101370.161551
280.2920960.405493
290.2491410.806139
.........
700.9123710.835218
710.4793810.151858
721.0000000.676898
730.5756010.891761
740.9201030.883683
750.0859110.216478
760.5824740.914378
770.1348800.702746
780.0034360.134087
790.1194160.232633
800.4175260.988691
810.6769760.891761
820.5094500.552504
830.6872850.179321
840.8556701.000000
850.8041240.416801
860.7731960.762520
870.1795530.546042
880.3264600.213247
890.7749140.762520
900.6314430.489499
910.6726800.794830
920.8006870.258481
930.7568730.943457
940.0506870.365105
950.4742270.268174
960.8685570.783522
970.4965640.050081
980.7010310.712439
990.0618560.460420
\n", 608 | "

100 rows × 2 columns

\n", 609 | "
" 610 | ], 611 | "text/plain": [ 612 | " one alpha\n", 613 | "0 0.036942 0.888530\n", 614 | "1 0.060997 0.022617\n", 615 | "2 0.067869 0.187399\n", 616 | "3 0.057560 0.512116\n", 617 | "4 0.892612 0.634895\n", 618 | "5 0.842784 0.851373\n", 619 | "6 0.463918 0.075929\n", 620 | "7 0.350515 0.051696\n", 621 | "8 0.989691 0.930533\n", 622 | "9 0.885739 0.586430\n", 623 | "10 0.420962 0.898223\n", 624 | "11 0.567869 0.458805\n", 625 | "12 0.602234 0.835218\n", 626 | "13 0.928694 0.735057\n", 627 | "14 0.427835 0.957997\n", 628 | "15 0.315292 0.216478\n", 629 | "16 0.664089 0.017771\n", 630 | "17 0.089347 0.893376\n", 631 | "18 0.883162 0.000000\n", 632 | "19 0.659794 0.261712\n", 633 | "20 0.079897 0.135703\n", 634 | "21 0.125430 0.557351\n", 635 | "22 0.277491 0.907916\n", 636 | "23 0.374570 0.038772\n", 637 | "24 0.141753 0.185784\n", 638 | "25 0.475086 0.434572\n", 639 | "26 0.772337 0.145396\n", 640 | "27 0.310137 0.161551\n", 641 | "28 0.292096 0.405493\n", 642 | "29 0.249141 0.806139\n", 643 | ".. ... ...\n", 644 | "70 0.912371 0.835218\n", 645 | "71 0.479381 0.151858\n", 646 | "72 1.000000 0.676898\n", 647 | "73 0.575601 0.891761\n", 648 | "74 0.920103 0.883683\n", 649 | "75 0.085911 0.216478\n", 650 | "76 0.582474 0.914378\n", 651 | "77 0.134880 0.702746\n", 652 | "78 0.003436 0.134087\n", 653 | "79 0.119416 0.232633\n", 654 | "80 0.417526 0.988691\n", 655 | "81 0.676976 0.891761\n", 656 | "82 0.509450 0.552504\n", 657 | "83 0.687285 0.179321\n", 658 | "84 0.855670 1.000000\n", 659 | "85 0.804124 0.416801\n", 660 | "86 0.773196 0.762520\n", 661 | "87 0.179553 0.546042\n", 662 | "88 0.326460 0.213247\n", 663 | "89 0.774914 0.762520\n", 664 | "90 0.631443 0.489499\n", 665 | "91 0.672680 0.794830\n", 666 | "92 0.800687 0.258481\n", 667 | "93 0.756873 0.943457\n", 668 | "94 0.050687 0.365105\n", 669 | "95 0.474227 0.268174\n", 670 | "96 0.868557 0.783522\n", 671 | "97 0.496564 0.050081\n", 672 | "98 0.701031 0.712439\n", 673 | "99 0.061856 0.460420\n", 674 | "\n", 675 | "[100 rows x 2 columns]" 676 | ] 677 | }, 678 | "execution_count": 57, 679 | "metadata": {}, 680 | "output_type": "execute_result" 681 | } 682 | ], 683 | "source": [ 684 | "dataframe_creator(lst1, lst2, str1, str2)" 685 | ] 686 | } 687 | ], 688 | "metadata": { 689 | "colab": { 690 | "collapsed_sections": [], 691 | "name": "week1_assessment.ipynb", 692 | "provenance": [], 693 | "version": "0.3.2" 694 | }, 695 | "kernelspec": { 696 | "display_name": "Python 3", 697 | "language": "python", 698 | "name": "python3" 699 | }, 700 | "language_info": { 701 | "codemirror_mode": { 702 | "name": "ipython", 703 | "version": 3 704 | }, 705 | "file_extension": ".py", 706 | "mimetype": "text/x-python", 707 | "name": "python", 708 | "nbconvert_exporter": "python", 709 | "pygments_lexer": "ipython3", 710 | "version": "3.6.3" 711 | } 712 | }, 713 | "nbformat": 4, 714 | "nbformat_minor": 1 715 | } 716 | -------------------------------------------------------------------------------- /Course 2 - Inferential Statistics with Python/Week 2 - Confidence Intervals/Confidence_Intervals_Differences_Population_Parameters.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Confidence Intervals\n", 8 | "\n", 9 | "\n", 10 | "This tutorial is going to demonstrate how to load data, clean/manipulate a dataset, and construct a confidence interval for the difference between two population proportions and means.\n", 11 | "\n", 12 | "We will use the 2015-2016 wave of the NHANES data for our analysis.\n", 13 | "\n", 14 | "*Note: We have provided a notebook that includes more analysis, with examples of confidence intervals for one population proportions and means, in addition to the analysis I will show you in this tutorial. I highly recommend checking it out!\n", 15 | "\n", 16 | "For our population proportions, we will analyze the difference of proportion between female and male smokers. The column that specifies smoker and non-smoker is \"SMQ020\" in our dataset.\n", 17 | "\n", 18 | "For our population means, we will analyze the difference of mean of body mass index within our female and male populations. The column that includes the body mass index value is \"BMXBMI\".\n", 19 | "\n", 20 | "Additionally, the gender is specified in the column \"RIAGENDR\"." 21 | ] 22 | }, 23 | { 24 | "cell_type": "code", 25 | "execution_count": 1, 26 | "metadata": {}, 27 | "outputs": [], 28 | "source": [ 29 | "import pandas as pd\n", 30 | "import numpy as np\n", 31 | "import matplotlib\n", 32 | "matplotlib.use('Agg')\n", 33 | "import seaborn as sns\n", 34 | "%matplotlib inline\n", 35 | "import matplotlib.pyplot as plt\n", 36 | "import statsmodels.api as sm" 37 | ] 38 | }, 39 | { 40 | "cell_type": "code", 41 | "execution_count": 2, 42 | "metadata": {}, 43 | "outputs": [], 44 | "source": [ 45 | "url = \"nhanes_2015_2016.csv\"\n", 46 | "da = pd.read_csv(url)" 47 | ] 48 | }, 49 | { 50 | "cell_type": "markdown", 51 | "metadata": {}, 52 | "source": [ 53 | "### Investigating and Cleaning Data" 54 | ] 55 | }, 56 | { 57 | "cell_type": "code", 58 | "execution_count": 3, 59 | "metadata": {}, 60 | "outputs": [ 61 | { 62 | "data": { 63 | "text/plain": [ 64 | "0 Yes\n", 65 | "1 Yes\n", 66 | "2 Yes\n", 67 | "3 No\n", 68 | "4 No\n", 69 | "5 No\n", 70 | "6 Yes\n", 71 | "7 No\n", 72 | "8 No\n", 73 | "9 No\n", 74 | "10 Yes\n", 75 | "11 Yes\n", 76 | "12 Yes\n", 77 | "13 No\n", 78 | "14 No\n", 79 | "15 No\n", 80 | "16 No\n", 81 | "17 No\n", 82 | "18 Yes\n", 83 | "19 No\n", 84 | "20 No\n", 85 | "21 No\n", 86 | "22 Yes\n", 87 | "23 No\n", 88 | "24 No\n", 89 | "25 No\n", 90 | "26 Yes\n", 91 | "27 Yes\n", 92 | "28 No\n", 93 | "29 No\n", 94 | " ... \n", 95 | "5705 Yes\n", 96 | "5706 Yes\n", 97 | "5707 No\n", 98 | "5708 No\n", 99 | "5709 Yes\n", 100 | "5710 No\n", 101 | "5711 Yes\n", 102 | "5712 No\n", 103 | "5713 No\n", 104 | "5714 No\n", 105 | "5715 No\n", 106 | "5716 Yes\n", 107 | "5717 Yes\n", 108 | "5718 No\n", 109 | "5719 Yes\n", 110 | "5720 No\n", 111 | "5721 No\n", 112 | "5722 No\n", 113 | "5723 Yes\n", 114 | "5724 No\n", 115 | "5725 No\n", 116 | "5726 Yes\n", 117 | "5727 No\n", 118 | "5728 No\n", 119 | "5729 No\n", 120 | "5730 Yes\n", 121 | "5731 No\n", 122 | "5732 Yes\n", 123 | "5733 Yes\n", 124 | "5734 No\n", 125 | "Name: SMQ020x, Length: 5735, dtype: object" 126 | ] 127 | }, 128 | "execution_count": 3, 129 | "metadata": {}, 130 | "output_type": "execute_result" 131 | } 132 | ], 133 | "source": [ 134 | "# Recode SMQ020 from 1/2 to Yes/No into new variable SMQ020x\n", 135 | "da[\"SMQ020x\"] = da.SMQ020.replace({1: \"Yes\", 2: \"No\", 7: np.nan, 9: np.nan})\n", 136 | "da[\"SMQ020x\"]" 137 | ] 138 | }, 139 | { 140 | "cell_type": "code", 141 | "execution_count": 4, 142 | "metadata": {}, 143 | "outputs": [ 144 | { 145 | "data": { 146 | "text/plain": [ 147 | "0 Male\n", 148 | "1 Male\n", 149 | "2 Male\n", 150 | "3 Female\n", 151 | "4 Female\n", 152 | "5 Female\n", 153 | "6 Male\n", 154 | "7 Female\n", 155 | "8 Male\n", 156 | "9 Male\n", 157 | "10 Male\n", 158 | "11 Male\n", 159 | "12 Female\n", 160 | "13 Female\n", 161 | "14 Male\n", 162 | "15 Female\n", 163 | "16 Female\n", 164 | "17 Female\n", 165 | "18 Female\n", 166 | "19 Female\n", 167 | "20 Male\n", 168 | "21 Female\n", 169 | "22 Female\n", 170 | "23 Female\n", 171 | "24 Male\n", 172 | "25 Female\n", 173 | "26 Male\n", 174 | "27 Female\n", 175 | "28 Male\n", 176 | "29 Female\n", 177 | " ... \n", 178 | "5705 Male\n", 179 | "5706 Male\n", 180 | "5707 Female\n", 181 | "5708 Female\n", 182 | "5709 Male\n", 183 | "5710 Female\n", 184 | "5711 Male\n", 185 | "5712 Female\n", 186 | "5713 Male\n", 187 | "5714 Male\n", 188 | "5715 Female\n", 189 | "5716 Female\n", 190 | "5717 Male\n", 191 | "5718 Male\n", 192 | "5719 Female\n", 193 | "5720 Male\n", 194 | "5721 Female\n", 195 | "5722 Female\n", 196 | "5723 Female\n", 197 | "5724 Female\n", 198 | "5725 Male\n", 199 | "5726 Male\n", 200 | "5727 Female\n", 201 | "5728 Male\n", 202 | "5729 Male\n", 203 | "5730 Female\n", 204 | "5731 Male\n", 205 | "5732 Female\n", 206 | "5733 Male\n", 207 | "5734 Female\n", 208 | "Name: RIAGENDRx, Length: 5735, dtype: object" 209 | ] 210 | }, 211 | "execution_count": 4, 212 | "metadata": {}, 213 | "output_type": "execute_result" 214 | } 215 | ], 216 | "source": [ 217 | "# Recode RIAGENDR from 1/2 to Male/Female into new variable RIAGENDRx\n", 218 | "da[\"RIAGENDRx\"] = da.RIAGENDR.replace({1: \"Male\", 2: \"Female\"})\n", 219 | "da[\"RIAGENDRx\"]" 220 | ] 221 | }, 222 | { 223 | "cell_type": "code", 224 | "execution_count": 5, 225 | "metadata": {}, 226 | "outputs": [ 227 | { 228 | "data": { 229 | "text/html": [ 230 | "
\n", 231 | "\n", 244 | "\n", 245 | " \n", 246 | " \n", 247 | " \n", 248 | " \n", 249 | " \n", 250 | " \n", 251 | " \n", 252 | " \n", 253 | " \n", 254 | " \n", 255 | " \n", 256 | " \n", 257 | " \n", 258 | " \n", 259 | " \n", 260 | " \n", 261 | " \n", 262 | " \n", 263 | " \n", 264 | " \n", 265 | " \n", 266 | " \n", 267 | " \n", 268 | " \n", 269 | "
RIAGENDRxFemaleMale
SMQ020x
No20661340
Yes9061413
\n", 270 | "
" 271 | ], 272 | "text/plain": [ 273 | "RIAGENDRx Female Male\n", 274 | "SMQ020x \n", 275 | "No 2066 1340\n", 276 | "Yes 906 1413" 277 | ] 278 | }, 279 | "execution_count": 5, 280 | "metadata": {}, 281 | "output_type": "execute_result" 282 | } 283 | ], 284 | "source": [ 285 | "dx = da[[\"SMQ020x\", \"RIAGENDRx\"]].dropna()\n", 286 | "pd.crosstab(dx.SMQ020x, dx.RIAGENDRx)" 287 | ] 288 | }, 289 | { 290 | "cell_type": "code", 291 | "execution_count": 6, 292 | "metadata": {}, 293 | "outputs": [], 294 | "source": [ 295 | "# Recode SMQ020x from Yes/No to 1/0 into existing variable SMQ020x\n", 296 | "dx[\"SMQ020x\"] = dx.SMQ020x.replace({\"Yes\": 1, \"No\": 0})" 297 | ] 298 | }, 299 | { 300 | "cell_type": "code", 301 | "execution_count": 7, 302 | "metadata": {}, 303 | "outputs": [ 304 | { 305 | "data": { 306 | "text/html": [ 307 | "
\n", 308 | "\n", 321 | "\n", 322 | " \n", 323 | " \n", 324 | " \n", 325 | " \n", 326 | " \n", 327 | " \n", 328 | " \n", 329 | " \n", 330 | " \n", 331 | " \n", 332 | " \n", 333 | " \n", 334 | " \n", 335 | " \n", 336 | " \n", 337 | " \n", 338 | " \n", 339 | " \n", 340 | " \n", 341 | " \n", 342 | " \n", 343 | " \n", 344 | " \n", 345 | " \n", 346 | "
ProportionTotal n
RIAGENDRx
Female0.3048452972
Male0.5132582753
\n", 347 | "
" 348 | ], 349 | "text/plain": [ 350 | " Proportion Total n\n", 351 | "RIAGENDRx \n", 352 | "Female 0.304845 2972\n", 353 | "Male 0.513258 2753" 354 | ] 355 | }, 356 | "execution_count": 7, 357 | "metadata": {}, 358 | "output_type": "execute_result" 359 | } 360 | ], 361 | "source": [ 362 | "dz = dx.groupby(\"RIAGENDRx\").agg({\"SMQ020x\": [np.mean, np.size]})\n", 363 | "dz.columns = [\"Proportion\", \"Total n\"]\n", 364 | "dz" 365 | ] 366 | }, 367 | { 368 | "cell_type": "markdown", 369 | "metadata": {}, 370 | "source": [ 371 | "### Constructing Confidence Intervals\n", 372 | "\n", 373 | "Now that we have the population proportions of male and female smokers, we can begin to calculate confidence intervals. From lecture, we know that the equation is as follows:\n", 374 | "\n", 375 | "$$Best\\ Estimate \\pm Margin\\ of\\ Error$$\n", 376 | "\n", 377 | "Where the *Best Estimate* is the **observed population proportion or mean** from the sample and the *Margin of Error* is the **t-multiplier**.\n", 378 | "\n", 379 | "The equation to create a 95% confidence interval can also be shown as:\n", 380 | "\n", 381 | "$$Population\\ Proportion\\ or\\ Mean\\ \\pm (t-multiplier *\\ Standard\\ Error)$$\n", 382 | "\n", 383 | "The Standard Error (SE) is calculated differenly for population proportion and mean:\n", 384 | "\n", 385 | "$$Standard\\ Error \\ for\\ Population\\ Proportion = \\sqrt{\\frac{Population\\ Proportion * (1 - Population\\ Proportion)}{Number\\ Of\\ Observations}}$$\n", 386 | "\n", 387 | "$$Standard\\ Error \\ for\\ Mean = \\frac{Standard\\ Deviation}{\\sqrt{Number\\ Of\\ Observations}}$$\n", 388 | "\n", 389 | "Lastly, the standard error for difference of population proportions and means is:\n", 390 | "\n", 391 | "$$Standard\\ Error\\ for\\ Difference\\ of\\ Two\\ Population\\ Proportions\\ Or\\ Means = \\sqrt{(SE_{\\ 1})^2 + (SE_{\\ 2})^2}$$" 392 | ] 393 | }, 394 | { 395 | "cell_type": "markdown", 396 | "metadata": {}, 397 | "source": [ 398 | "#### Difference of Two Population Proportions" 399 | ] 400 | }, 401 | { 402 | "cell_type": "code", 403 | "execution_count": 9, 404 | "metadata": {}, 405 | "outputs": [ 406 | { 407 | "data": { 408 | "text/plain": [ 409 | "0.00844415041930423" 410 | ] 411 | }, 412 | "execution_count": 9, 413 | "metadata": {}, 414 | "output_type": "execute_result" 415 | } 416 | ], 417 | "source": [ 418 | "p = .304845\n", 419 | "n = 2972\n", 420 | "se_female = np.sqrt(p * (1 - p)/n)\n", 421 | "se_female" 422 | ] 423 | }, 424 | { 425 | "cell_type": "code", 426 | "execution_count": 10, 427 | "metadata": {}, 428 | "outputs": [ 429 | { 430 | "data": { 431 | "text/plain": [ 432 | "0.009526078787008965" 433 | ] 434 | }, 435 | "execution_count": 10, 436 | "metadata": {}, 437 | "output_type": "execute_result" 438 | } 439 | ], 440 | "source": [ 441 | "p = .513258\n", 442 | "n = 2753\n", 443 | "se_male = np.sqrt(p * (1 - p)/ n)\n", 444 | "se_male" 445 | ] 446 | }, 447 | { 448 | "cell_type": "code", 449 | "execution_count": 11, 450 | "metadata": {}, 451 | "outputs": [ 452 | { 453 | "data": { 454 | "text/plain": [ 455 | "0.012729880335656654" 456 | ] 457 | }, 458 | "execution_count": 11, 459 | "metadata": {}, 460 | "output_type": "execute_result" 461 | } 462 | ], 463 | "source": [ 464 | "se_diff = np.sqrt(se_female**2 + se_male**2)\n", 465 | "se_diff" 466 | ] 467 | }, 468 | { 469 | "cell_type": "code", 470 | "execution_count": 12, 471 | "metadata": {}, 472 | "outputs": [ 473 | { 474 | "data": { 475 | "text/plain": [ 476 | "(-0.23336356545788706, -0.18346243454211297)" 477 | ] 478 | }, 479 | "execution_count": 12, 480 | "metadata": {}, 481 | "output_type": "execute_result" 482 | } 483 | ], 484 | "source": [ 485 | "d = .304845 - .513258\n", 486 | "lcb = d - 1.96 * se_diff\n", 487 | "ucb = d + 1.96 * se_diff\n", 488 | "(lcb, ucb)" 489 | ] 490 | }, 491 | { 492 | "cell_type": "markdown", 493 | "metadata": {}, 494 | "source": [ 495 | "#### Difference of Two Population Means" 496 | ] 497 | }, 498 | { 499 | "cell_type": "code", 500 | "execution_count": 13, 501 | "metadata": {}, 502 | "outputs": [ 503 | { 504 | "data": { 505 | "text/plain": [ 506 | "0 27.8\n", 507 | "1 30.8\n", 508 | "2 28.8\n", 509 | "3 42.4\n", 510 | "4 20.3\n", 511 | "Name: BMXBMI, dtype: float64" 512 | ] 513 | }, 514 | "execution_count": 13, 515 | "metadata": {}, 516 | "output_type": "execute_result" 517 | } 518 | ], 519 | "source": [ 520 | "da[\"BMXBMI\"].head()" 521 | ] 522 | }, 523 | { 524 | "cell_type": "code", 525 | "execution_count": 14, 526 | "metadata": {}, 527 | "outputs": [ 528 | { 529 | "data": { 530 | "text/html": [ 531 | "
\n", 532 | "\n", 549 | "\n", 550 | " \n", 551 | " \n", 552 | " \n", 553 | " \n", 554 | " \n", 555 | " \n", 556 | " \n", 557 | " \n", 558 | " \n", 559 | " \n", 560 | " \n", 561 | " \n", 562 | " \n", 563 | " \n", 564 | " \n", 565 | " \n", 566 | " \n", 567 | " \n", 568 | " \n", 569 | " \n", 570 | " \n", 571 | " \n", 572 | " \n", 573 | " \n", 574 | " \n", 575 | " \n", 576 | " \n", 577 | " \n", 578 | " \n", 579 | " \n", 580 | " \n", 581 | " \n", 582 | "
BMXBMI
meanstdsize
RIAGENDRx
Female29.9399467.7533192976.0
Male28.7780726.2525682759.0
\n", 583 | "
" 584 | ], 585 | "text/plain": [ 586 | " BMXBMI \n", 587 | " mean std size\n", 588 | "RIAGENDRx \n", 589 | "Female 29.939946 7.753319 2976.0\n", 590 | "Male 28.778072 6.252568 2759.0" 591 | ] 592 | }, 593 | "execution_count": 14, 594 | "metadata": {}, 595 | "output_type": "execute_result" 596 | } 597 | ], 598 | "source": [ 599 | "da.groupby(\"RIAGENDRx\").agg({\"BMXBMI\": [np.mean, np.std, np.size]})" 600 | ] 601 | }, 602 | { 603 | "cell_type": "code", 604 | "execution_count": 15, 605 | "metadata": {}, 606 | "outputs": [ 607 | { 608 | "data": { 609 | "text/plain": [ 610 | "(0.14212523289878048, 0.11903716451870151)" 611 | ] 612 | }, 613 | "execution_count": 15, 614 | "metadata": {}, 615 | "output_type": "execute_result" 616 | } 617 | ], 618 | "source": [ 619 | "sem_female = 7.753319 / np.sqrt(2976)\n", 620 | "sem_male = 6.252568 / np.sqrt(2759)\n", 621 | "(sem_female, sem_male)" 622 | ] 623 | }, 624 | { 625 | "cell_type": "code", 626 | "execution_count": null, 627 | "metadata": {}, 628 | "outputs": [], 629 | "source": [ 630 | "sem_diff = np.sqrt(sem_female**2 + sem_male**2)\n", 631 | "sem_diff" 632 | ] 633 | }, 634 | { 635 | "cell_type": "code", 636 | "execution_count": null, 637 | "metadata": {}, 638 | "outputs": [], 639 | "source": [ 640 | "d = 29.939946 - 28.778072" 641 | ] 642 | }, 643 | { 644 | "cell_type": "code", 645 | "execution_count": null, 646 | "metadata": {}, 647 | "outputs": [], 648 | "source": [ 649 | "lcb = d - 1.96 * sem_diff\n", 650 | "ucb = d + 1.96 * sem_diff\n", 651 | "(lcb, ucb)" 652 | ] 653 | } 654 | ], 655 | "metadata": { 656 | "kernelspec": { 657 | "display_name": "Python 3", 658 | "language": "python", 659 | "name": "python3" 660 | }, 661 | "language_info": { 662 | "codemirror_mode": { 663 | "name": "ipython", 664 | "version": 3 665 | }, 666 | "file_extension": ".py", 667 | "mimetype": "text/x-python", 668 | "name": "python", 669 | "nbconvert_exporter": "python", 670 | "pygments_lexer": "ipython3", 671 | "version": "3.6.3" 672 | } 673 | }, 674 | "nbformat": 4, 675 | "nbformat_minor": 2 676 | } 677 | -------------------------------------------------------------------------------- /Course 2 - Inferential Statistics with Python/Week 2 - Confidence Intervals/intro_confidence_intervals.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Statistical Inference with Confidence Intervals\n", 8 | "\n", 9 | "Throughout week 2, we have explored the concept of confidence intervals, how to calculate them, interpret them, and what confidence really means. \n", 10 | "\n", 11 | "In this tutorial, we're going to review how to calculate confidence intervals of population proportions and means.\n", 12 | "\n", 13 | "To begin, let's go over some of the material from this week and why confidence intervals are useful tools when deriving insights from data.\n", 14 | "\n", 15 | "### Why Confidence Intervals?\n", 16 | "\n", 17 | "Confidence intervals are a calculated range or boundary around a parameter or a statistic that is supported mathematically with a certain level of confidence. For example, in the lecture, we estimated, with 95% confidence, that the population proportion of parents with a toddler that use a car seat for all travel with their toddler was somewhere between 82.2% and 87.7%.\n", 18 | "\n", 19 | "This is *__different__* than having a 95% probability that the true population proportion is within our confidence interval.\n", 20 | "\n", 21 | "Essentially, if we were to repeat this process, 95% of our calculated confidence intervals would contain the true proportion.\n", 22 | "\n", 23 | "### How are Confidence Intervals Calculated?\n", 24 | "\n", 25 | "Our equation for calculating confidence intervals is as follows:\n", 26 | "\n", 27 | "$$Best\\ Estimate \\pm Margin\\ of\\ Error$$\n", 28 | "\n", 29 | "Where the *Best Estimate* is the **observed population proportion or mean** and the *Margin of Error* is the **t-multiplier**.\n", 30 | "\n", 31 | "The t-multiplier is calculated based on the degrees of freedom and desired confidence level. For samples with more than 30 observations and a confidence level of 95%, the t-multiplier is 1.96\n", 32 | "\n", 33 | "The equation to create a 95% confidence interval can also be shown as:\n", 34 | "\n", 35 | "$$Population\\ Proportion\\ or\\ Mean\\ \\pm (t-multiplier *\\ Standard\\ Error)$$\n", 36 | "\n", 37 | "Lastly, the Standard Error is calculated differenly for population proportion and mean:\n", 38 | "\n", 39 | "We apply similar techniques when constructing a confidence interval for a mean, but now we are interested in estimating the population mean by using the sample statistic () and the multiplier is a value. Similar to the values that you used as the multiplier for constructing confidence intervals for population proportions, here you will use values as the multipliers. \n", 40 | "\n", 41 | "$$Standard\\ Error \\ for\\ Population\\ Proportion = \\sqrt{\\frac{Population\\ Proportion * (1 - Population\\ Proportion)}{Number\\ Of\\ Observations}}$$\n", 42 | "\n", 43 | "$$Standard\\ Error \\ for\\ Mean = \\frac{Standard\\ Deviation}{\\sqrt{Number\\ Of\\ Observations}}$$\n", 44 | "\n", 45 | "Let's replicate the car seat example from lecture:" 46 | ] 47 | }, 48 | { 49 | "cell_type": "code", 50 | "execution_count": 1, 51 | "metadata": {}, 52 | "outputs": [], 53 | "source": [ 54 | "import numpy as np" 55 | ] 56 | }, 57 | { 58 | "cell_type": "code", 59 | "execution_count": 2, 60 | "metadata": {}, 61 | "outputs": [ 62 | { 63 | "data": { 64 | "text/plain": [ 65 | "0.01390952774409444" 66 | ] 67 | }, 68 | "execution_count": 2, 69 | "metadata": {}, 70 | "output_type": "execute_result" 71 | } 72 | ], 73 | "source": [ 74 | "tstar = 1.96\n", 75 | "p = .85\n", 76 | "n = 659\n", 77 | "\n", 78 | "se = np.sqrt((p * (1 - p))/n)\n", 79 | "se" 80 | ] 81 | }, 82 | { 83 | "cell_type": "code", 84 | "execution_count": 3, 85 | "metadata": {}, 86 | "outputs": [ 87 | { 88 | "data": { 89 | "text/plain": [ 90 | "(0.8227373256215749, 0.8772626743784251)" 91 | ] 92 | }, 93 | "execution_count": 3, 94 | "metadata": {}, 95 | "output_type": "execute_result" 96 | } 97 | ], 98 | "source": [ 99 | "lcb = p - tstar * se\n", 100 | "ucb = p + tstar * se\n", 101 | "(lcb, ucb)" 102 | ] 103 | }, 104 | { 105 | "cell_type": "code", 106 | "execution_count": 4, 107 | "metadata": {}, 108 | "outputs": [], 109 | "source": [ 110 | "import statsmodels.api as sm" 111 | ] 112 | }, 113 | { 114 | "cell_type": "code", 115 | "execution_count": 5, 116 | "metadata": {}, 117 | "outputs": [ 118 | { 119 | "data": { 120 | "text/plain": [ 121 | "(0.8227378265796143, 0.8772621734203857)" 122 | ] 123 | }, 124 | "execution_count": 5, 125 | "metadata": {}, 126 | "output_type": "execute_result" 127 | } 128 | ], 129 | "source": [ 130 | "sm.stats.proportion_confint(n * p, n)" 131 | ] 132 | }, 133 | { 134 | "cell_type": "markdown", 135 | "metadata": {}, 136 | "source": [ 137 | "Now, lets take our Cartwheel dataset introduced in lecture and calculate a confidence interval for our mean cartwheel distance:" 138 | ] 139 | }, 140 | { 141 | "cell_type": "code", 142 | "execution_count": 6, 143 | "metadata": {}, 144 | "outputs": [], 145 | "source": [ 146 | "import pandas as pd\n", 147 | "\n", 148 | "df = pd.read_csv(\"Cartwheeldata.csv\")" 149 | ] 150 | }, 151 | { 152 | "cell_type": "code", 153 | "execution_count": 7, 154 | "metadata": {}, 155 | "outputs": [ 156 | { 157 | "data": { 158 | "text/html": [ 159 | "
\n", 160 | "\n", 173 | "\n", 174 | " \n", 175 | " \n", 176 | " \n", 177 | " \n", 178 | " \n", 179 | " \n", 180 | " \n", 181 | " \n", 182 | " \n", 183 | " \n", 184 | " \n", 185 | " \n", 186 | " \n", 187 | " \n", 188 | " \n", 189 | " \n", 190 | " \n", 191 | " \n", 192 | " \n", 193 | " \n", 194 | " \n", 195 | " \n", 196 | " \n", 197 | " \n", 198 | " \n", 199 | " \n", 200 | " \n", 201 | " \n", 202 | " \n", 203 | " \n", 204 | " \n", 205 | " \n", 206 | " \n", 207 | " \n", 208 | " \n", 209 | " \n", 210 | " \n", 211 | " \n", 212 | " \n", 213 | " \n", 214 | " \n", 215 | " \n", 216 | " \n", 217 | " \n", 218 | " \n", 219 | " \n", 220 | " \n", 221 | " \n", 222 | " \n", 223 | " \n", 224 | " \n", 225 | " \n", 226 | " \n", 227 | " \n", 228 | " \n", 229 | " \n", 230 | " \n", 231 | " \n", 232 | " \n", 233 | " \n", 234 | " \n", 235 | " \n", 236 | " \n", 237 | " \n", 238 | " \n", 239 | " \n", 240 | " \n", 241 | " \n", 242 | " \n", 243 | " \n", 244 | " \n", 245 | " \n", 246 | " \n", 247 | " \n", 248 | " \n", 249 | " \n", 250 | " \n", 251 | " \n", 252 | " \n", 253 | " \n", 254 | " \n", 255 | " \n", 256 | " \n", 257 | " \n", 258 | " \n", 259 | " \n", 260 | " \n", 261 | " \n", 262 | " \n", 263 | " \n", 264 | " \n", 265 | " \n", 266 | " \n", 267 | " \n", 268 | "
IDAgeGenderGenderGroupGlassesGlassesGroupHeightWingspanCWDistanceCompleteCompleteGroupScore
0156F1Y162.061.079Y17
1226F1Y162.060.070Y18
2333F1Y166.064.085Y17
3439F1N064.063.087Y110
4527M2N073.075.072N04
\n", 269 | "
" 270 | ], 271 | "text/plain": [ 272 | " ID Age Gender GenderGroup Glasses GlassesGroup Height Wingspan \\\n", 273 | "0 1 56 F 1 Y 1 62.0 61.0 \n", 274 | "1 2 26 F 1 Y 1 62.0 60.0 \n", 275 | "2 3 33 F 1 Y 1 66.0 64.0 \n", 276 | "3 4 39 F 1 N 0 64.0 63.0 \n", 277 | "4 5 27 M 2 N 0 73.0 75.0 \n", 278 | "\n", 279 | " CWDistance Complete CompleteGroup Score \n", 280 | "0 79 Y 1 7 \n", 281 | "1 70 Y 1 8 \n", 282 | "2 85 Y 1 7 \n", 283 | "3 87 Y 1 10 \n", 284 | "4 72 N 0 4 " 285 | ] 286 | }, 287 | "execution_count": 7, 288 | "metadata": {}, 289 | "output_type": "execute_result" 290 | } 291 | ], 292 | "source": [ 293 | "df.head()" 294 | ] 295 | }, 296 | { 297 | "cell_type": "code", 298 | "execution_count": 8, 299 | "metadata": {}, 300 | "outputs": [ 301 | { 302 | "data": { 303 | "text/plain": [ 304 | "25" 305 | ] 306 | }, 307 | "execution_count": 8, 308 | "metadata": {}, 309 | "output_type": "execute_result" 310 | } 311 | ], 312 | "source": [ 313 | "mean = df[\"CWDistance\"].mean()\n", 314 | "sd = df[\"CWDistance\"].std()\n", 315 | "n = len(df)\n", 316 | "\n", 317 | "n" 318 | ] 319 | }, 320 | { 321 | "cell_type": "code", 322 | "execution_count": 9, 323 | "metadata": {}, 324 | "outputs": [ 325 | { 326 | "data": { 327 | "text/plain": [ 328 | "3.0117104774529704" 329 | ] 330 | }, 331 | "execution_count": 9, 332 | "metadata": {}, 333 | "output_type": "execute_result" 334 | } 335 | ], 336 | "source": [ 337 | "tstar = 2.064\n", 338 | "\n", 339 | "se = sd/np.sqrt(n)\n", 340 | "\n", 341 | "se" 342 | ] 343 | }, 344 | { 345 | "cell_type": "code", 346 | "execution_count": 10, 347 | "metadata": {}, 348 | "outputs": [ 349 | { 350 | "data": { 351 | "text/plain": [ 352 | "(76.26382957453707, 88.69617042546294)" 353 | ] 354 | }, 355 | "execution_count": 10, 356 | "metadata": {}, 357 | "output_type": "execute_result" 358 | } 359 | ], 360 | "source": [ 361 | "lcb = mean - tstar * se\n", 362 | "ucb = mean + tstar * se\n", 363 | "(lcb, ucb)" 364 | ] 365 | }, 366 | { 367 | "cell_type": "code", 368 | "execution_count": 11, 369 | "metadata": {}, 370 | "outputs": [ 371 | { 372 | "data": { 373 | "text/plain": [ 374 | "(76.57715593233024, 88.38284406766977)" 375 | ] 376 | }, 377 | "execution_count": 11, 378 | "metadata": {}, 379 | "output_type": "execute_result" 380 | } 381 | ], 382 | "source": [ 383 | "sm.stats.DescrStatsW(df[\"CWDistance\"]).zconfint_mean()" 384 | ] 385 | } 386 | ], 387 | "metadata": { 388 | "kernelspec": { 389 | "display_name": "Python 3", 390 | "language": "python", 391 | "name": "python3" 392 | }, 393 | "language_info": { 394 | "codemirror_mode": { 395 | "name": "ipython", 396 | "version": 3 397 | }, 398 | "file_extension": ".py", 399 | "mimetype": "text/x-python", 400 | "name": "python", 401 | "nbconvert_exporter": "python", 402 | "pygments_lexer": "ipython3", 403 | "version": "3.6.3" 404 | } 405 | }, 406 | "nbformat": 4, 407 | "nbformat_minor": 2 408 | } 409 | -------------------------------------------------------------------------------- /Course 2 - Inferential Statistics with Python/Week 2 - Confidence Intervals/slides/calculating_sample_sizes.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MiesnerJacob/statistics-with-python-michigan-university/f7219dcd7a91bdd3818bf2704f54793a0623a8a5/Course 2 - Inferential Statistics with Python/Week 2 - Confidence Intervals/slides/calculating_sample_sizes.png -------------------------------------------------------------------------------- /Course 2 - Inferential Statistics with Python/Week 2 - Confidence Intervals/slides/confidence_interval_approaches.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MiesnerJacob/statistics-with-python-michigan-university/f7219dcd7a91bdd3818bf2704f54793a0623a8a5/Course 2 - Inferential Statistics with Python/Week 2 - Confidence Intervals/slides/confidence_interval_approaches.png -------------------------------------------------------------------------------- /Course 2 - Inferential Statistics with Python/Week 2 - Confidence Intervals/slides/confidence_interval_difference_means_paired_data.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MiesnerJacob/statistics-with-python-michigan-university/f7219dcd7a91bdd3818bf2704f54793a0623a8a5/Course 2 - Inferential Statistics with Python/Week 2 - Confidence Intervals/slides/confidence_interval_difference_means_paired_data.png -------------------------------------------------------------------------------- /Course 2 - Inferential Statistics with Python/Week 2 - Confidence Intervals/slides/confidence_interval_difference_means_pooled.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MiesnerJacob/statistics-with-python-michigan-university/f7219dcd7a91bdd3818bf2704f54793a0623a8a5/Course 2 - Inferential Statistics with Python/Week 2 - Confidence Intervals/slides/confidence_interval_difference_means_pooled.png -------------------------------------------------------------------------------- /Course 2 - Inferential Statistics with Python/Week 2 - Confidence Intervals/slides/confidence_interval_difference_means_unpooled.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MiesnerJacob/statistics-with-python-michigan-university/f7219dcd7a91bdd3818bf2704f54793a0623a8a5/Course 2 - Inferential Statistics with Python/Week 2 - Confidence Intervals/slides/confidence_interval_difference_means_unpooled.png -------------------------------------------------------------------------------- /Course 2 - Inferential Statistics with Python/Week 2 - Confidence Intervals/slides/confidence_interval_difference_population_proportions.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MiesnerJacob/statistics-with-python-michigan-university/f7219dcd7a91bdd3818bf2704f54793a0623a8a5/Course 2 - Inferential Statistics with Python/Week 2 - Confidence Intervals/slides/confidence_interval_difference_population_proportions.png -------------------------------------------------------------------------------- /Course 2 - Inferential Statistics with Python/Week 2 - Confidence Intervals/slides/confidence_interval_one_mean.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MiesnerJacob/statistics-with-python-michigan-university/f7219dcd7a91bdd3818bf2704f54793a0623a8a5/Course 2 - Inferential Statistics with Python/Week 2 - Confidence Intervals/slides/confidence_interval_one_mean.png -------------------------------------------------------------------------------- /Course 2 - Inferential Statistics with Python/Week 2 - Confidence Intervals/slides/confidence_interval_one_population_proportion.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MiesnerJacob/statistics-with-python-michigan-university/f7219dcd7a91bdd3818bf2704f54793a0623a8a5/Course 2 - Inferential Statistics with Python/Week 2 - Confidence Intervals/slides/confidence_interval_one_population_proportion.png -------------------------------------------------------------------------------- /Course 2 - Inferential Statistics with Python/Week 2 - Confidence Intervals/slides/difference_in_proportion_confidence_interval.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MiesnerJacob/statistics-with-python-michigan-university/f7219dcd7a91bdd3818bf2704f54793a0623a8a5/Course 2 - Inferential Statistics with Python/Week 2 - Confidence Intervals/slides/difference_in_proportion_confidence_interval.png -------------------------------------------------------------------------------- /Course 2 - Inferential Statistics with Python/Week 2 - Confidence Intervals/slides/pooled_confidence_interval_calucaltions.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MiesnerJacob/statistics-with-python-michigan-university/f7219dcd7a91bdd3818bf2704f54793a0623a8a5/Course 2 - Inferential Statistics with Python/Week 2 - Confidence Intervals/slides/pooled_confidence_interval_calucaltions.png -------------------------------------------------------------------------------- /Course 2 - Inferential Statistics with Python/Week 2 - Confidence Intervals/slides/understanding_confidence_level.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MiesnerJacob/statistics-with-python-michigan-university/f7219dcd7a91bdd3818bf2704f54793a0623a8a5/Course 2 - Inferential Statistics with Python/Week 2 - Confidence Intervals/slides/understanding_confidence_level.png -------------------------------------------------------------------------------- /Course 2 - Inferential Statistics with Python/Week 2 - Confidence Intervals/slides/unpooled_confidence_interval_calculations.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MiesnerJacob/statistics-with-python-michigan-university/f7219dcd7a91bdd3818bf2704f54793a0623a8a5/Course 2 - Inferential Statistics with Python/Week 2 - Confidence Intervals/slides/unpooled_confidence_interval_calculations.png -------------------------------------------------------------------------------- /Course 2 - Inferential Statistics with Python/Week 3 - Hypothesis Testing/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MiesnerJacob/statistics-with-python-michigan-university/f7219dcd7a91bdd3818bf2704f54793a0623a8a5/Course 2 - Inferential Statistics with Python/Week 3 - Hypothesis Testing/.DS_Store -------------------------------------------------------------------------------- /Course 2 - Inferential Statistics with Python/Week 3 - Hypothesis Testing/chocolate_cycling_experiment_analysis.txt: -------------------------------------------------------------------------------- 1 | TO: Jane Doe, US Bicycle Team Head Trainer 2 | FROM: John Doe, Data Scientist 3 | DATE: May 16, 2021 4 | SUBJECT: Chocolate Consumption/Bicycling Performance Experiment Analysis 5 | 6 | Overview 7 | This memorandum is designed to address the experiment conducted by the U.S. bicycle team to help them train for the Tour de France. The experiment was designed to test how the regular consumption of chocolate affects the total distance 8 | covered during an all-out sprint and if the type of chocolate consumed matters. 9 | 10 | Benefits of a Cross-over Experiment 11 | A cross-over experimental design was used for this experiment. Each of the experiment's nine male participants underwent baseline measurements before receiving the first treatment. Next, those same measurements were taken 12 | in two trials after participants consumed either dark chocolate (40 grams of Dove) or white chocolate (40 grams of Milkybar) for two weeks. The order of the treatment was randomized. Cross-over studies have increased statistical 13 | power because of the elimination of confounding variables that could arise from using different subjects to measure the effects of the treatments. 14 | 15 | Analysis of Test Results 16 | First, let’s look at the difference in mean distance covered during an all-out sprint (dark chocolate over baseline). The p-value for this difference is .001, lower than our significance level of 0.05, which enables us to state that the dark chocolate group’s increased all-out sprint distance over the baseline group is statistically significant! This test tells us that dark chocolate has a positive effect on all-out sprint distance, and we can say with 95% confidence the average performance gain is somewhere between 165m to 314m. 17 | Next, let’s look at the difference in mean distance covered during an all-out sprint (white chocolate over baseline). The p-value for the performance increase of white chocolate over baseline is 0.319 which is greater than our significance 18 | level of .05, which says we have insufficient evidence to reject the null hypothesis. This means we do not have the evidence to prove that white chocolate provides a performance gain over not eating chocolate at all, even though the 19 | point estimate for the white chocolate group was greater. 20 | Finally, let’s look at the difference in mean distance covered during an all-out sprint (dark chocolate over white chocolate). The p-value for this difference is .003, lower than our significance level of 0.05, which enables us to state that the dark chocolate group’s increased all-out sprint distance over the white chocolate group is statistically significant! This test tells us that dark chocolate has a more positive effect on all-out sprint distance than white chocolate, and we can say 21 | with 95% confidence, the average performance gain is somewhere between 82m to 292m. 22 | 23 | Conclusion 24 | The results of this statistical analysis support the conclusion that treating male cyclists with dark chocolate 2 weeks before measuring an all-out sprint distance is better than not treating with chocolate or treating with white chocolate. 25 | 26 | Sources: Patel, R. K.; Brouner, J.; Spendiff, O. Journal of the International Society of Sports Nutrition. 2015 12:47 -------------------------------------------------------------------------------- /Course 2 - Inferential Statistics with Python/Week 3 - Hypothesis Testing/slides/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MiesnerJacob/statistics-with-python-michigan-university/f7219dcd7a91bdd3818bf2704f54793a0623a8a5/Course 2 - Inferential Statistics with Python/Week 3 - Hypothesis Testing/slides/.DS_Store -------------------------------------------------------------------------------- /Course 2 - Inferential Statistics with Python/Week 3 - Hypothesis Testing/slides/alternative_tests_difference_population_proportion.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MiesnerJacob/statistics-with-python-michigan-university/f7219dcd7a91bdd3818bf2704f54793a0623a8a5/Course 2 - Inferential Statistics with Python/Week 3 - Hypothesis Testing/slides/alternative_tests_difference_population_proportion.png -------------------------------------------------------------------------------- /Course 2 - Inferential Statistics with Python/Week 3 - Hypothesis Testing/slides/assumptions_one_sample_mean_t_test.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MiesnerJacob/statistics-with-python-michigan-university/f7219dcd7a91bdd3818bf2704f54793a0623a8a5/Course 2 - Inferential Statistics with Python/Week 3 - Hypothesis Testing/slides/assumptions_one_sample_mean_t_test.png -------------------------------------------------------------------------------- /Course 2 - Inferential Statistics with Python/Week 3 - Hypothesis Testing/slides/checking_sample_sizes_two_population_proportions.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MiesnerJacob/statistics-with-python-michigan-university/f7219dcd7a91bdd3818bf2704f54793a0623a8a5/Course 2 - Inferential Statistics with Python/Week 3 - Hypothesis Testing/slides/checking_sample_sizes_two_population_proportions.png -------------------------------------------------------------------------------- /Course 2 - Inferential Statistics with Python/Week 3 - Hypothesis Testing/slides/ddof_difference_population_means_independent.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MiesnerJacob/statistics-with-python-michigan-university/f7219dcd7a91bdd3818bf2704f54793a0623a8a5/Course 2 - Inferential Statistics with Python/Week 3 - Hypothesis Testing/slides/ddof_difference_population_means_independent.png -------------------------------------------------------------------------------- /Course 2 - Inferential Statistics with Python/Week 3 - Hypothesis Testing/slides/population_proportion_hypothesis_test.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MiesnerJacob/statistics-with-python-michigan-university/f7219dcd7a91bdd3818bf2704f54793a0623a8a5/Course 2 - Inferential Statistics with Python/Week 3 - Hypothesis Testing/slides/population_proportion_hypothesis_test.png -------------------------------------------------------------------------------- /Course 2 - Inferential Statistics with Python/Week 3 - Hypothesis Testing/slides/sample_size_check_population_proportion.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MiesnerJacob/statistics-with-python-michigan-university/f7219dcd7a91bdd3818bf2704f54793a0623a8a5/Course 2 - Inferential Statistics with Python/Week 3 - Hypothesis Testing/slides/sample_size_check_population_proportion.png -------------------------------------------------------------------------------- /Course 2 - Inferential Statistics with Python/Week 3 - Hypothesis Testing/slides/t_test_one_sample_mean.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MiesnerJacob/statistics-with-python-michigan-university/f7219dcd7a91bdd3818bf2704f54793a0623a8a5/Course 2 - Inferential Statistics with Python/Week 3 - Hypothesis Testing/slides/t_test_one_sample_mean.png -------------------------------------------------------------------------------- /Course 2 - Inferential Statistics with Python/Week 3 - Hypothesis Testing/slides/t_test_one_sample_mean_normality.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MiesnerJacob/statistics-with-python-michigan-university/f7219dcd7a91bdd3818bf2704f54793a0623a8a5/Course 2 - Inferential Statistics with Python/Week 3 - Hypothesis Testing/slides/t_test_one_sample_mean_normality.png -------------------------------------------------------------------------------- /Course 2 - Inferential Statistics with Python/Week 3 - Hypothesis Testing/slides/test_statisitic_one_sample_mean_assume_normality.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MiesnerJacob/statistics-with-python-michigan-university/f7219dcd7a91bdd3818bf2704f54793a0623a8a5/Course 2 - Inferential Statistics with Python/Week 3 - Hypothesis Testing/slides/test_statisitic_one_sample_mean_assume_normality.png -------------------------------------------------------------------------------- /Course 2 - Inferential Statistics with Python/Week 3 - Hypothesis Testing/slides/test_statistic_difference_means_independent_pooled.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MiesnerJacob/statistics-with-python-michigan-university/f7219dcd7a91bdd3818bf2704f54793a0623a8a5/Course 2 - Inferential Statistics with Python/Week 3 - Hypothesis Testing/slides/test_statistic_difference_means_independent_pooled.png -------------------------------------------------------------------------------- /Course 2 - Inferential Statistics with Python/Week 3 - Hypothesis Testing/slides/test_statistic_difference_means_independent_unpooled.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MiesnerJacob/statistics-with-python-michigan-university/f7219dcd7a91bdd3818bf2704f54793a0623a8a5/Course 2 - Inferential Statistics with Python/Week 3 - Hypothesis Testing/slides/test_statistic_difference_means_independent_unpooled.png -------------------------------------------------------------------------------- /Course 2 - Inferential Statistics with Python/Week 3 - Hypothesis Testing/slides/test_statistic_difference_two_population_proportions.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MiesnerJacob/statistics-with-python-michigan-university/f7219dcd7a91bdd3818bf2704f54793a0623a8a5/Course 2 - Inferential Statistics with Python/Week 3 - Hypothesis Testing/slides/test_statistic_difference_two_population_proportions.png -------------------------------------------------------------------------------- /Course 2 - Inferential Statistics with Python/Week 3 - Hypothesis Testing/slides/test_statistic_one_population_proportion.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MiesnerJacob/statistics-with-python-michigan-university/f7219dcd7a91bdd3818bf2704f54793a0623a8a5/Course 2 - Inferential Statistics with Python/Week 3 - Hypothesis Testing/slides/test_statistic_one_population_proportion.png -------------------------------------------------------------------------------- /Course 2 - Inferential Statistics with Python/Week 4 - Learner Application/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MiesnerJacob/statistics-with-python-michigan-university/f7219dcd7a91bdd3818bf2704f54793a0623a8a5/Course 2 - Inferential Statistics with Python/Week 4 - Learner Application/.DS_Store -------------------------------------------------------------------------------- /Course 2 - Inferential Statistics with Python/Week 4 - Learner Application/.ipynb_checkpoints/Week 4 Quiz-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [], 3 | "metadata": {}, 4 | "nbformat": 4, 5 | "nbformat_minor": 4 6 | } 7 | -------------------------------------------------------------------------------- /Course 2 - Inferential Statistics with Python/Week 4 - Learner Application/Week 4 Quiz.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "*Question 1*\n", 8 | "\n", 9 | "**A simple random sample of 500 undergraduates at a large university self-administered a political knowledge test, where the maximum score is 100 and the minimum score is 0. The mean score was 62.5, and the standard deviation of the scores was 10. What is a 95% confidence interval for the overall undergraduate mean at the university?**" 10 | ] 11 | }, 12 | { 13 | "cell_type": "code", 14 | "execution_count": 3, 15 | "metadata": {}, 16 | "outputs": [], 17 | "source": [ 18 | "import math" 19 | ] 20 | }, 21 | { 22 | "cell_type": "code", 23 | "execution_count": 4, 24 | "metadata": {}, 25 | "outputs": [], 26 | "source": [ 27 | "n = 500\n", 28 | "point_estimate = 62.5\n", 29 | "sample_standard_deviation = 10\n", 30 | "t_multiplier = 1.96" 31 | ] 32 | }, 33 | { 34 | "cell_type": "code", 35 | "execution_count": 8, 36 | "metadata": {}, 37 | "outputs": [ 38 | { 39 | "name": "stdout", 40 | "output_type": "stream", 41 | "text": [ 42 | "standard error: 0.88\n" 43 | ] 44 | } 45 | ], 46 | "source": [ 47 | "standard_error = t_multiplier * (sample_standard_deviation/math.sqrt(n))\n", 48 | "print(\"standard error: \" +str(round(standard_error,2)))" 49 | ] 50 | }, 51 | { 52 | "cell_type": "code", 53 | "execution_count": 12, 54 | "metadata": {}, 55 | "outputs": [ 56 | { 57 | "name": "stdout", 58 | "output_type": "stream", 59 | "text": [ 60 | "confidence interval: (63.37653864717992, 61.62346135282008)\n" 61 | ] 62 | } 63 | ], 64 | "source": [ 65 | "confidence_interval = (point_estimate + standard_error, point_estimate - standard_error)\n", 66 | "print(\"confidence interval: \" + str(confidence_interval))" 67 | ] 68 | }, 69 | { 70 | "cell_type": "markdown", 71 | "metadata": {}, 72 | "source": [ 73 | "*Question 2*\n", 74 | "\n", 75 | "**Given the result in Problem 1, what would we conclude about a hypothesized mean of 63?**\n", 76 | "\n", 77 | "We have evidence in support of this hypothesized mean." 78 | ] 79 | }, 80 | { 81 | "cell_type": "markdown", 82 | "metadata": {}, 83 | "source": [ 84 | "*Question 3*\n", 85 | "\n", 86 | "**How do we interpret the confidence interval in Problem 1?**\n", 87 | "\n", 88 | "95% of all confidence intervals computed this way will cover the true population mean (in expectation)." 89 | ] 90 | }, 91 | { 92 | "cell_type": "markdown", 93 | "metadata": {}, 94 | "source": [ 95 | "*Question 4*\n", 96 | "\n", 97 | "**We perform a two-tailed, one-sample t-test of the null hypothesis that the true population mean is 63, versus the alternative hypothesis that the mean is different from 63. We find a test statistic of t = -1.12 (df = 499), with a p-value of 0.264. What is our decision about the null hypothesis at an alpha=0.05?**" 98 | ] 99 | }, 100 | { 101 | "cell_type": "code", 102 | "execution_count": 13, 103 | "metadata": {}, 104 | "outputs": [], 105 | "source": [ 106 | "null_hypothesis = 63\n", 107 | "test_statistic= -1.12\n", 108 | "df = 499\n", 109 | "p_value = 0.05" 110 | ] 111 | }, 112 | { 113 | "cell_type": "markdown", 114 | "metadata": {}, 115 | "source": [ 116 | "We fail to reject it; the mean is not significantly different from this hypothesized mean." 117 | ] 118 | }, 119 | { 120 | "cell_type": "markdown", 121 | "metadata": {}, 122 | "source": [ 123 | "*Question 5*\n", 124 | "\n", 125 | "**A new experimental drug for reducing the pain due to migraine headaches is being tested in a randomized controlled trial. A total of 100 participants with a history of migraine headaches are given the drug, and 100 participants with the same history are given a placebo pill. Each participant is asked about their pain one hour after taking the medication, and whether it has been reduced (yes or no). A biostatistician computes an exact 95% confidence interval for the difference in the proportions of people experiencing pain relief within one hour (treatment minus control). The 95% confidence interval for the difference in proportions is (-0.05, 0.09). What should the biostatistician conclude?**" 126 | ] 127 | }, 128 | { 129 | "cell_type": "code", 130 | "execution_count": 14, 131 | "metadata": {}, 132 | "outputs": [], 133 | "source": [ 134 | "n = 100\n", 135 | "significance = 0.05\n", 136 | "resulting_interval = (-0.05, 0.09)" 137 | ] 138 | }, 139 | { 140 | "cell_type": "markdown", 141 | "metadata": {}, 142 | "source": [ 143 | "There is no evidence at all of the experimental pill being effective; the proportions are statistically identical." 144 | ] 145 | } 146 | ], 147 | "metadata": { 148 | "kernelspec": { 149 | "display_name": "Python 3", 150 | "language": "python", 151 | "name": "python3" 152 | }, 153 | "language_info": { 154 | "codemirror_mode": { 155 | "name": "ipython", 156 | "version": 3 157 | }, 158 | "file_extension": ".py", 159 | "mimetype": "text/x-python", 160 | "name": "python", 161 | "nbconvert_exporter": "python", 162 | "pygments_lexer": "ipython3", 163 | "version": "3.8.6" 164 | }, 165 | "toc": { 166 | "base_numbering": 1, 167 | "nav_menu": {}, 168 | "number_sections": true, 169 | "sideBar": true, 170 | "skip_h1_title": false, 171 | "title_cell": "Table of Contents", 172 | "title_sidebar": "Contents", 173 | "toc_cell": false, 174 | "toc_position": {}, 175 | "toc_section_display": true, 176 | "toc_window_display": false 177 | } 178 | }, 179 | "nbformat": 4, 180 | "nbformat_minor": 4 181 | } 182 | -------------------------------------------------------------------------------- /Course 2 - Inferential Statistics with Python/Week 4 - Learner Application/slides/pooled_unpooled_assumptions_difference_means_proportions.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MiesnerJacob/statistics-with-python-michigan-university/f7219dcd7a91bdd3818bf2704f54793a0623a8a5/Course 2 - Inferential Statistics with Python/Week 4 - Learner Application/slides/pooled_unpooled_assumptions_difference_means_proportions.png -------------------------------------------------------------------------------- /Course 2 - Inferential Statistics with Python/Week 4 - Learner Application/slides/standard_error_one_mean.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MiesnerJacob/statistics-with-python-michigan-university/f7219dcd7a91bdd3818bf2704f54793a0623a8a5/Course 2 - Inferential Statistics with Python/Week 4 - Learner Application/slides/standard_error_one_mean.png -------------------------------------------------------------------------------- /Course 2 - Inferential Statistics with Python/Week 4 - Learner Application/slides/standard_error_one_proportion.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MiesnerJacob/statistics-with-python-michigan-university/f7219dcd7a91bdd3818bf2704f54793a0623a8a5/Course 2 - Inferential Statistics with Python/Week 4 - Learner Application/slides/standard_error_one_proportion.png -------------------------------------------------------------------------------- /Course 3 - Fitting Statistical Models to Data with Python/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MiesnerJacob/statistics-with-python-michigan-university/f7219dcd7a91bdd3818bf2704f54793a0623a8a5/Course 3 - Fitting Statistical Models to Data with Python/.DS_Store -------------------------------------------------------------------------------- /Course 3 - Fitting Statistical Models to Data with Python/Week 1 - Overview & Considerations for Statistical Modeling/python_libraries.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "colab_type": "text", 7 | "id": "Wp6u2WFB6sD3" 8 | }, 9 | "source": [ 10 | "# Python Libraries\n", 11 | "\n", 12 | "For this tutorial, we are going to explore the python libraries that include functionality that corresponds with the material discussed in the course.\n", 13 | "\n", 14 | "The primary package we will be using is:\n", 15 | "\n", 16 | "* **Statsmodels:** a library that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, exploring data, and constructing models. \n", 17 | "\n", 18 | "*__ATTN__: If you are not familiar with the following packages:* \n", 19 | "\n", 20 | "* **Numpy** is a library for working with arrays of data. \n", 21 | "\n", 22 | "* **Pandas** is a library for data management, manipulation, and analysis. \n", 23 | "\n", 24 | "* **Matplotlib** is a library for making visualizations. \n", 25 | "\n", 26 | "* **Seaborn** is a higher-level interface to Matplotlib that can be used to simplify many visualization tasks. \n", 27 | "\n", 28 | "We recommend you check out the first and second courses of the Statistics with Python specialization, **Understanding and Visualizing Data** and **Inferential Statistical Analysis with Python**.\n", 29 | "\n", 30 | "*__Important__: While this notebooks provides insight into the basics of these libraries, it is recommended that you dig into the documentation available online.*" 31 | ] 32 | }, 33 | { 34 | "cell_type": "markdown", 35 | "metadata": { 36 | "colab_type": "text", 37 | "id": "bS09niUH6sD4" 38 | }, 39 | "source": [ 40 | "## StatsModels\n", 41 | "\n", 42 | "The StatsModels library is extremely extensive and includes functionality ranging from statistical methods to advanced topics such as regression, time-series analysis, and multivariate statistics.\n", 43 | "\n", 44 | "We will mainly be looking at the stats, OLS, and GLM sub-libraries. However, we will begin by reviewing some functionality that has been referenced in earlier course of the Statistics with Python specialization." 45 | ] 46 | }, 47 | { 48 | "cell_type": "code", 49 | "execution_count": null, 50 | "metadata": { 51 | "colab": {}, 52 | "colab_type": "code", 53 | "id": "zAmQwUrM6sD4" 54 | }, 55 | "outputs": [], 56 | "source": [ 57 | "import statsmodels.api as sm\n", 58 | "import numpy as np" 59 | ] 60 | }, 61 | { 62 | "cell_type": "markdown", 63 | "metadata": { 64 | "colab_type": "text", 65 | "id": "yHW550yS6sD9" 66 | }, 67 | "source": [ 68 | "### Stats\n", 69 | "\n", 70 | "#### Descriptive Statistics" 71 | ] 72 | }, 73 | { 74 | "cell_type": "code", 75 | "execution_count": null, 76 | "metadata": { 77 | "colab": {}, 78 | "colab_type": "code", 79 | "id": "T0Diw1p66sD-", 80 | "outputId": "38a8e936-209c-49de-d828-7dae305acaf8" 81 | }, 82 | "outputs": [], 83 | "source": [ 84 | "# Draw random variables from a normal distribution with numpy\n", 85 | "normalRandomVariables = np.random.normal(0,1, 1000)\n", 86 | "\n", 87 | "# Create object that has descriptive statistics as variables\n", 88 | "x = sm.stats.DescrStatsW(normalRandomVariables)\n", 89 | "\n", 90 | "print(x)" 91 | ] 92 | }, 93 | { 94 | "cell_type": "markdown", 95 | "metadata": { 96 | "colab_type": "text", 97 | "id": "O3VPrgBO6sEC" 98 | }, 99 | "source": [ 100 | "As you can see from the above output, we have created an object with type: \"statsmodels.stats.weightstats.DescrStatsW\". \n", 101 | "\n", 102 | "This object stores various descriptive statistics such as mean, standard deviation, variance, ect. that we can access." 103 | ] 104 | }, 105 | { 106 | "cell_type": "code", 107 | "execution_count": null, 108 | "metadata": { 109 | "colab": {}, 110 | "colab_type": "code", 111 | "id": "t080a6ZB6sEE", 112 | "outputId": "b275eed9-d036-4212-bfcc-0c8a3c72f89e" 113 | }, 114 | "outputs": [], 115 | "source": [ 116 | "# Mean\n", 117 | "print(x.mean)\n", 118 | "\n", 119 | "# Standard deviation\n", 120 | "print(x.std)\n", 121 | "\n", 122 | "# Variance\n", 123 | "print(x.var)" 124 | ] 125 | }, 126 | { 127 | "cell_type": "markdown", 128 | "metadata": { 129 | "colab_type": "text", 130 | "id": "DvveuP5x6sEH" 131 | }, 132 | "source": [ 133 | "The output above shows the mean, standard deviation, and variance of the 1000 random variables we drew from the distribution we generated above.\n", 134 | "\n", 135 | "There are other interesting things you can do with this object, such as generating confidence intervals and hypothesis testing.\n", 136 | "\n", 137 | "#### Confidence Intervals" 138 | ] 139 | }, 140 | { 141 | "cell_type": "code", 142 | "execution_count": null, 143 | "metadata": { 144 | "colab": {}, 145 | "colab_type": "code", 146 | "id": "B0pWYUHz6sEJ", 147 | "outputId": "eda277fe-cc28-44ba-c4d2-f667dd9846dd" 148 | }, 149 | "outputs": [], 150 | "source": [ 151 | "# Generate confidence interval for a population proportion\n", 152 | "\n", 153 | "tstar = 1.96\n", 154 | "\n", 155 | "# Observer population proportion\n", 156 | "p = .85\n", 157 | "\n", 158 | "# Size of population\n", 159 | "n = 659\n", 160 | "\n", 161 | "# Construct confidence interval\n", 162 | "sm.stats.proportion_confint(n * p, n)" 163 | ] 164 | }, 165 | { 166 | "cell_type": "markdown", 167 | "metadata": { 168 | "colab_type": "text", 169 | "id": "AuZGEYlG6sEM" 170 | }, 171 | "source": [ 172 | "The above output includes the lower and upper bounds of a 95% confidence interval of population proportion." 173 | ] 174 | }, 175 | { 176 | "cell_type": "code", 177 | "execution_count": null, 178 | "metadata": { 179 | "colab": {}, 180 | "colab_type": "code", 181 | "id": "kwBfjY_y6sEN", 182 | "outputId": "475a8442-9cc0-4841-a8f3-0140583a8f6b" 183 | }, 184 | "outputs": [], 185 | "source": [ 186 | "import pandas as pd\n", 187 | "\n", 188 | "# Import data that will be used to construct confidence interval of population mean\n", 189 | "df = pd.read_csv(\"https://raw.githubusercontent.com/UMstatspy/UMStatsPy/master/Course_1/Cartwheeldata.csv\")\n", 190 | "\n", 191 | "# Generate confidence interval for a population mean\n", 192 | "sm.stats.DescrStatsW(df[\"CWDistance\"]).zconfint_mean()" 193 | ] 194 | }, 195 | { 196 | "cell_type": "markdown", 197 | "metadata": { 198 | "colab_type": "text", 199 | "id": "pk-nzwLg6sEP" 200 | }, 201 | "source": [ 202 | "The output above shows the lower and upper bounds of a 95% confidence interval of population mean.\n", 203 | "\n", 204 | "These functions should be familiar, if not, we recommend you take course 2 of our specialization.\n", 205 | "#### Hypothesis Testing" 206 | ] 207 | }, 208 | { 209 | "cell_type": "code", 210 | "execution_count": null, 211 | "metadata": { 212 | "colab": {}, 213 | "colab_type": "code", 214 | "id": "aC6ZmdvT6sEQ", 215 | "outputId": "5f1243c9-34e5-4bda-9bda-1dfc8bc052b9" 216 | }, 217 | "outputs": [], 218 | "source": [ 219 | "# One population proportion hypothesis testing\n", 220 | "\n", 221 | "# Population size\n", 222 | "n = 1018\n", 223 | "\n", 224 | "# Null hypothesis population proportion\n", 225 | "pnull = .52\n", 226 | "\n", 227 | "# Observe population proportion\n", 228 | "phat = .56\n", 229 | "\n", 230 | "# Calculate test statistic and p-value\n", 231 | "sm.stats.proportions_ztest(phat * n, n, pnull)" 232 | ] 233 | }, 234 | { 235 | "cell_type": "code", 236 | "execution_count": null, 237 | "metadata": { 238 | "colab": {}, 239 | "colab_type": "code", 240 | "id": "_tdgO3pL6sEU", 241 | "outputId": "37a8fa49-ba7e-4996-e1e6-85b80b22383b" 242 | }, 243 | "outputs": [], 244 | "source": [ 245 | "# Using the dataframe imported above, perform a hypothesis test for population mean\n", 246 | "sm.stats.ztest(df[\"CWDistance\"], value = 80, alternative = \"larger\")" 247 | ] 248 | }, 249 | { 250 | "cell_type": "markdown", 251 | "metadata": { 252 | "colab_type": "text", 253 | "id": "Ko78IcAX6sEX" 254 | }, 255 | "source": [ 256 | "The outputs above are the test statistics and p-values from the respective hypothesis tests.\n", 257 | "\n", 258 | "If you'd like to review these functions on your own, the stats sub-library documentation can be found at the following url: https://www.statsmodels.org/stable/stats.html\n", 259 | "\n", 260 | "This concludes the review portion of this notebook, now we are going to introduce the OLS and GLM sub-libraries and the functions you will be seeing throughout this course.\n", 261 | "\n", 262 | "# OLS (Ordinary Least Squares), GLM (Generalized Linear Models), GEE (Generalize Estimated Equations), MIXEDLM (Multilevel Models)\n", 263 | "\n", 264 | "The OLS, GLM, GEE, and MIXEDLM sub-libraries are the primary libraries in statsmodels that we will be utilizing in this course to create various models.\n", 265 | "\n", 266 | "Below, we will give a brief description of each model and a skeleton of the functions you will see going forward in the course. This is simply for you to get familiar with these concepts and to prepare you for the coming weeks. If their application at this time seems a bit ambigious have no fear as they will be discussed in detail throughout this course!\n", 267 | "\n", 268 | "For each of the following models, we follow our similar structure which means we will be following our structure of Dependent and Independent Variables, with a few caveats that will be expressed below.\n", 269 | "\n", 270 | "#### Ordinary Least Squares\n", 271 | "\n", 272 | "Ordinary Least Squares is a method for estimating the unknown parameters in a linear regression model. This is the function we will use when our target variable is continuous." 273 | ] 274 | }, 275 | { 276 | "cell_type": "code", 277 | "execution_count": null, 278 | "metadata": { 279 | "colab": {}, 280 | "colab_type": "code", 281 | "id": "MhalCiu-6sEX" 282 | }, 283 | "outputs": [], 284 | "source": [ 285 | "da = pd.read_csv(\"nhanes_2015_2016.csv\")\n", 286 | "\n", 287 | "# Drop unused columns, drop rows with any missing values.\n", 288 | "vars = [\"BPXSY1\", \"RIDAGEYR\", \"RIAGENDR\", \"RIDRETH1\", \"DMDEDUC2\", \"BMXBMI\",\n", 289 | " \"SMQ020\", \"SDMVSTRA\", \"SDMVPSU\"]\n", 290 | "da = da[vars].dropna()\n", 291 | "\n", 292 | "da[\"RIAGENDRx\"] = da.RIAGENDR.replace({1: \"Male\", 2: \"Female\"})\n", 293 | "\n", 294 | "model = sm.OLS.from_formula(\"BPXSY1 ~ RIDAGEYR + RIAGENDRx\", data=da)\n", 295 | "res = model.fit()\n", 296 | "print(res.summary())" 297 | ] 298 | }, 299 | { 300 | "cell_type": "markdown", 301 | "metadata": { 302 | "colab_type": "text", 303 | "id": "ujGjtQwR6sEa" 304 | }, 305 | "source": [ 306 | "The above code is creating a multiple linear regression where the target variable is BPXSY1 and the two predictor variables are RIDAGEYR and RIAGENDRx.\n", 307 | "\n", 308 | "Note that the target variable, BPXSY1, is a continous variable that represents blood pressure." 309 | ] 310 | }, 311 | { 312 | "cell_type": "markdown", 313 | "metadata": { 314 | "colab_type": "text", 315 | "id": "t7mrqEjx6sEb" 316 | }, 317 | "source": [ 318 | "#### Generalized Linear Models\n", 319 | "\n", 320 | "While generalized linear models are a broad topic, in this course we will be using this suite of functions to carry out logistic regression. Logistic regression is used when our target variable is a binary outcome, or a classification of two groups, which can be denoted as group 0 and group 1." 321 | ] 322 | }, 323 | { 324 | "cell_type": "code", 325 | "execution_count": null, 326 | "metadata": { 327 | "colab": {}, 328 | "colab_type": "code", 329 | "id": "66gX6Ivv6sEb" 330 | }, 331 | "outputs": [], 332 | "source": [ 333 | "da[\"smq\"] = da.SMQ020.replace({2: 0, 7: np.nan, 9: np.nan})\n", 334 | "model = sm.GLM.from_formula(\"smq ~ RIAGENDRx\", family=sm.families.Binomial(), data=da)\n", 335 | "res = model.fit()\n", 336 | "print(res.summary())" 337 | ] 338 | }, 339 | { 340 | "cell_type": "markdown", 341 | "metadata": { 342 | "colab_type": "text", 343 | "id": "HAGWs0sA6sEg" 344 | }, 345 | "source": [ 346 | "Above is a example of creating a logistic model where the target value is SMQ020x, which in this case is whether or not this person is a smoker or not. The predictor is RIAGENDRx, which is gender." 347 | ] 348 | }, 349 | { 350 | "cell_type": "markdown", 351 | "metadata": { 352 | "colab_type": "text", 353 | "id": "HPr2J3Xu6sEh" 354 | }, 355 | "source": [ 356 | "#### Generalized Estimated Equations\n", 357 | "\n", 358 | "Generalized Estimating Equations estimate generalized linear models for panel, cluster or repeated measures data when the observations are possibly correlated within a cluster but uncorrelated across clusters. These are used primarily when there is uncertainty regarding correlation between outcomes. \"Generalized Estimating Equations\" (GEE) fit marginal linear models, and estimate intraclass correlation." 359 | ] 360 | }, 361 | { 362 | "cell_type": "code", 363 | "execution_count": null, 364 | "metadata": { 365 | "colab": {}, 366 | "colab_type": "code", 367 | "id": "N2UPibR56sEi" 368 | }, 369 | "outputs": [], 370 | "source": [ 371 | "da[\"group\"] = 10*da.SDMVSTRA + da.SDMVPSU\n", 372 | "model = sm.GEE.from_formula(\"BPXSY1 ~ 1\", groups=\"group\", cov_struct=sm.cov_struct.Exchangeable(), data=da)\n", 373 | "res = model.fit()\n", 374 | "print(res.cov_struct.summary())\n" 375 | ] 376 | }, 377 | { 378 | "cell_type": "markdown", 379 | "metadata": { 380 | "colab_type": "text", 381 | "id": "GI-wvJZz6sEm" 382 | }, 383 | "source": [ 384 | "Here we are creating a marginal linear model of BPXSY1 to determine the estimated ICC value, which would indicate whether or not there are correlated clusters of BPXSY1." 385 | ] 386 | }, 387 | { 388 | "cell_type": "markdown", 389 | "metadata": { 390 | "colab_type": "text", 391 | "id": "qTDJVIpQ6sEn" 392 | }, 393 | "source": [ 394 | "#### Multilevel Models\n", 395 | "\n", 396 | "Similarly to GEEs, we use multilevel models when there is potential for outcomes to be grouped together which is not uncommon when using various sampling methods to collect data." 397 | ] 398 | }, 399 | { 400 | "cell_type": "code", 401 | "execution_count": null, 402 | "metadata": { 403 | "colab": {}, 404 | "colab_type": "code", 405 | "id": "kx8UyI4b6sEo" 406 | }, 407 | "outputs": [], 408 | "source": [ 409 | "for v in [\"BPXSY1\", \"RIDAGEYR\", \"BMXBMI\", \"smq\", \"SDMVSTRA\"]:\n", 410 | " model = sm.GEE.from_formula(v + \" ~ 1\", groups=\"group\",\n", 411 | " cov_struct=sm.cov_struct.Exchangeable(), data=da)\n", 412 | " result = model.fit()\n", 413 | " print(v, result.cov_struct.summary())" 414 | ] 415 | }, 416 | { 417 | "cell_type": "markdown", 418 | "metadata": { 419 | "colab_type": "text", 420 | "id": "oCAM_2cf6sEt" 421 | }, 422 | "source": [ 423 | "What;s nice about the statsmodels library is that all the models follow the similar structure and syntax. \n", 424 | "\n", 425 | "\n", 426 | "Documentation and examples of these models can be found at the following links:\n", 427 | "\n", 428 | "* OLS: https://www.statsmodels.org/stable/generated/statsmodels.regression.linear_model.OLS.html\n", 429 | "\n", 430 | "* GLM: https://www.statsmodels.org/stable/glm.html\n", 431 | "\n", 432 | "* GEE: https://www.statsmodels.org/stable/gee.html\n", 433 | "\n", 434 | "* MIXEDLM: https://www.statsmodels.org/stable/generated/statsmodels.regression.linear_model.OLS.html\n", 435 | "\n", 436 | "Feel free to read up on these sub-libraries and their use cases. In week 2 you will see examples of OLS and GLM, where in week 3, we will be implementing GEE and MIXEDLM." 437 | ] 438 | } 439 | ], 440 | "metadata": { 441 | "colab": { 442 | "collapsed_sections": [], 443 | "name": "Important Python Libraries - Fitting Statistical Models to Data with Python.ipynb", 444 | "provenance": [], 445 | "version": "0.3.2" 446 | }, 447 | "kernelspec": { 448 | "display_name": "Python 3", 449 | "language": "python", 450 | "name": "python3" 451 | }, 452 | "language_info": { 453 | "codemirror_mode": { 454 | "name": "ipython", 455 | "version": 3 456 | }, 457 | "file_extension": ".py", 458 | "mimetype": "text/x-python", 459 | "name": "python", 460 | "nbconvert_exporter": "python", 461 | "pygments_lexer": "ipython3", 462 | "version": "3.6.3" 463 | } 464 | }, 465 | "nbformat": 4, 466 | "nbformat_minor": 1 467 | } 468 | -------------------------------------------------------------------------------- /Course 3 - Fitting Statistical Models to Data with Python/Week 3 - Fitting Models to Dependent Data/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MiesnerJacob/statistics-with-python-michigan-university/f7219dcd7a91bdd3818bf2704f54793a0623a8a5/Course 3 - Fitting Statistical Models to Data with Python/Week 3 - Fitting Models to Dependent Data/.DS_Store -------------------------------------------------------------------------------- /Course 3 - Fitting Statistical Models to Data with Python/Week 3 - Fitting Models to Dependent Data/slides/Multi-Level Models Intro.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MiesnerJacob/statistics-with-python-michigan-university/f7219dcd7a91bdd3818bf2704f54793a0623a8a5/Course 3 - Fitting Statistical Models to Data with Python/Week 3 - Fitting Models to Dependent Data/slides/Multi-Level Models Intro.png -------------------------------------------------------------------------------- /Course 3 - Fitting Statistical Models to Data with Python/Week 3 - Fitting Models to Dependent Data/slides/Random Effects and MLM names.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MiesnerJacob/statistics-with-python-michigan-university/f7219dcd7a91bdd3818bf2704f54793a0623a8a5/Course 3 - Fitting Statistical Models to Data with Python/Week 3 - Fitting Models to Dependent Data/slides/Random Effects and MLM names.png -------------------------------------------------------------------------------- /Course 3 - Fitting Statistical Models to Data with Python/Week 3 - Fitting Models to Dependent Data/slides/Why fit Multi-level models?.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MiesnerJacob/statistics-with-python-michigan-university/f7219dcd7a91bdd3818bf2704f54793a0623a8a5/Course 3 - Fitting Statistical Models to Data with Python/Week 3 - Fitting Models to Dependent Data/slides/Why fit Multi-level models?.png -------------------------------------------------------------------------------- /Course 3 - Fitting Statistical Models to Data with Python/Week 3 - Fitting Models to Dependent Data/w3_assessment.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "colab_type": "text", 7 | "id": "QQdjVZNGikFv" 8 | }, 9 | "source": [ 10 | "# Week 3 Assessment\n", 11 | "\n", 12 | "This Jupyter Notebook is auxillary to the following assessment in this week. To complete this assessment, you will complete the 5 questions outlined in this document and use the output from the python cells to answer them.\n", 13 | "\n", 14 | "Run the following cell to initialize your environment and begin the assessment." 15 | ] 16 | }, 17 | { 18 | "cell_type": "code", 19 | "execution_count": 2, 20 | "metadata": { 21 | "colab": {}, 22 | "colab_type": "code", 23 | "id": "pfrQgsCOikFw" 24 | }, 25 | "outputs": [], 26 | "source": [ 27 | "#### RUN THIS\n", 28 | "\n", 29 | "import warnings\n", 30 | "warnings.filterwarnings('ignore')\n", 31 | "\n", 32 | "import numpy as np\n", 33 | "import statsmodels.api as sm\n", 34 | "import pandas as pd \n", 35 | "\n", 36 | "url = \"nhanes_2015_2016.csv\"\n", 37 | "da = pd.read_csv(url)\n", 38 | "\n", 39 | "# Drop unused columns, drop rows with any missing values.\n", 40 | "vars = [\"BPXSY1\", \"RIDAGEYR\", \"RIAGENDR\", \"RIDRETH1\", \"DMDEDUC2\", \"BMXBMI\",\n", 41 | " \"SMQ020\", \"SDMVSTRA\", \"SDMVPSU\"]\n", 42 | "da = da[vars].dropna()\n", 43 | "\n", 44 | "da[\"group\"] = 10*da.SDMVSTRA + da.SDMVPSU\n", 45 | "\n", 46 | "da[\"smq\"] = da.SMQ020.replace({2: 0, 7: np.nan, 9: np.nan})\n", 47 | "\n", 48 | "np.random.seed(123)" 49 | ] 50 | }, 51 | { 52 | "cell_type": "markdown", 53 | "metadata": { 54 | "colab_type": "text", 55 | "id": "eSA_KJQvikF1" 56 | }, 57 | "source": [ 58 | "#### Question 1: What is clustered data? (You'll answer this question within the quiz that follows this notebook)\n", 59 | "\n", 60 | "Data is considered clustered when observations are correlated within groups, sometimes related to study designs.\n", 61 | "\n", 62 | "#### Question 2: (You'll answer this question within the quiz that follows this notebook)\n", 63 | "\n", 64 | "Utilize the following output for this question:" 65 | ] 66 | }, 67 | { 68 | "cell_type": "code", 69 | "execution_count": 3, 70 | "metadata": { 71 | "colab": {}, 72 | "colab_type": "code", 73 | "id": "2oU4HEnPikF2" 74 | }, 75 | "outputs": [ 76 | { 77 | "name": "stdout", 78 | "output_type": "stream", 79 | "text": [ 80 | "BPXSY1 The correlation between two observations in the same cluster is 0.030\n", 81 | "SDMVSTRA The correlation between two observations in the same cluster is 0.959\n", 82 | "RIDAGEYR The correlation between two observations in the same cluster is 0.035\n", 83 | "BMXBMI The correlation between two observations in the same cluster is 0.039\n", 84 | "smq The correlation between two observations in the same cluster is 0.026\n" 85 | ] 86 | } 87 | ], 88 | "source": [ 89 | "for v in [\"BPXSY1\", \"SDMVSTRA\", \"RIDAGEYR\", \"BMXBMI\", \"smq\"]:\n", 90 | " model = sm.GEE.from_formula(v + \" ~ 1\", groups=\"group\",\n", 91 | " cov_struct=sm.cov_struct.Exchangeable(), data=da)\n", 92 | " result = model.fit()\n", 93 | " print(v, result.cov_struct.summary())" 94 | ] 95 | }, 96 | { 97 | "cell_type": "markdown", 98 | "metadata": { 99 | "colab_type": "text", 100 | "id": "4zRZc-tOikF8" 101 | }, 102 | "source": [ 103 | "Which of the listed features has the highest correlation between two observations in the same cluster? \n", 104 | "\n", 105 | "SDMVSTRA" 106 | ] 107 | }, 108 | { 109 | "cell_type": "markdown", 110 | "metadata": { 111 | "colab_type": "text", 112 | "id": "or7LyZNWikF9" 113 | }, 114 | "source": [ 115 | "#### Question 3: (You'll answer this question within the quiz that follows this notebook)\n", 116 | "\n", 117 | "What is true about multiple linear regression and marginal linear models when dependence is present in data?\n", 118 | "\n", 119 | "Marginal linear model estimates and standard errors are meaningful due to the dependence of data, but only when dependence is strictly between observations within the same group.\n", 120 | "\n", 121 | "#### Question 4: (You'll answer this question within the quiz that follows this notebook)\n", 122 | "\n", 123 | "Multilevel models are expressed in terms of _____.\n", 124 | "\n", 125 | "Random effects\n", 126 | "\n", 127 | "#### Question 5: (You'll answer this question within the quiz that follows this notebook)\n", 128 | "\n", 129 | "Which of the following is NOT true regarding reasons why we fit marginal models?\n", 130 | "\n", 131 | "All the above are true" 132 | ] 133 | } 134 | ], 135 | "metadata": { 136 | "colab": { 137 | "collapsed_sections": [], 138 | "name": "w3_assessment.ipynb", 139 | "provenance": [], 140 | "version": "0.3.2" 141 | }, 142 | "kernelspec": { 143 | "display_name": "Python 3", 144 | "language": "python", 145 | "name": "python3" 146 | }, 147 | "language_info": { 148 | "codemirror_mode": { 149 | "name": "ipython", 150 | "version": 3 151 | }, 152 | "file_extension": ".py", 153 | "mimetype": "text/x-python", 154 | "name": "python", 155 | "nbconvert_exporter": "python", 156 | "pygments_lexer": "ipython3", 157 | "version": "3.6.3" 158 | } 159 | }, 160 | "nbformat": 4, 161 | "nbformat_minor": 1 162 | } 163 | -------------------------------------------------------------------------------- /Course 3 - Fitting Statistical Models to Data with Python/Week 4 - Special Topics/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MiesnerJacob/statistics-with-python-michigan-university/f7219dcd7a91bdd3818bf2704f54793a0623a8a5/Course 3 - Fitting Statistical Models to Data with Python/Week 4 - Special Topics/.DS_Store -------------------------------------------------------------------------------- /Course 3 - Fitting Statistical Models to Data with Python/Week 4 - Special Topics/slides/Great Quote.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MiesnerJacob/statistics-with-python-michigan-university/f7219dcd7a91bdd3818bf2704f54793a0623a8a5/Course 3 - Fitting Statistical Models to Data with Python/Week 4 - Special Topics/slides/Great Quote.png -------------------------------------------------------------------------------- /Course 3 - Fitting Statistical Models to Data with Python/Week 4 - Special Topics/slides/How to use sruvey weights in practice.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MiesnerJacob/statistics-with-python-michigan-university/f7219dcd7a91bdd3818bf2704f54793a0623a8a5/Course 3 - Fitting Statistical Models to Data with Python/Week 4 - Special Topics/slides/How to use sruvey weights in practice.png -------------------------------------------------------------------------------- /Course 3 - Fitting Statistical Models to Data with Python/Week 4 - Special Topics/slides/Should we use survey weights?.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MiesnerJacob/statistics-with-python-michigan-university/f7219dcd7a91bdd3818bf2704f54793a0623a8a5/Course 3 - Fitting Statistical Models to Data with Python/Week 4 - Special Topics/slides/Should we use survey weights?.png -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # statistics-with-python-michigan-university 2 | A 3 course series offered by Michigan University which covers statistics and applying it through code. This specialization is designed to teach learners beginning and intermediate concepts of statistical analysis using the Python programming language. I am taking this course as a means of reaffirming and strengthening my knowledge base in statistics. 3 | 4 | ## Course 1 - Understanding and Visualizing Data with Python 5 | This course introduced the field of statistics, including where data come from, study design, data management, and exploring and visualizing data. Learners will identify different types of data, and learn how to visualize, analyze, and interpret summaries for both univariate and multivariate data. Learners will also be introduced to the differences between probability and non-probability sampling from larger populations, the idea of how sample estimates vary, and how inferences can be made about larger populations based on probability sampling. 6 | 7 | ## Course 2 - Inferential Statistical Analysis with Python 8 | This course explores basic principles behind using data for estimation and for assessing theories. We will analyze both categorical data and quantitative data, starting with one population techniques and expanding to handle comparisons of two populations. We will learn how to construct confidence intervals. We will also use sample data to assess whether or not a theory about the value of a parameter is consistent with the data. A major focus will be on interpreting inferential results appropriately. 9 | 10 | ## Course 3 - Fitting Statistical Models to Data with Python 11 | This course expands our exploration of statistical inference techniques by focusing on the science and art of fitting statistical models to data. We will build on the concepts presented in the Statistical Inference course (Course 2) to emphasize the importance of connecting research questions to our data analysis methods. We will also focus on various modeling objectives, including making inference about relationships between variables and generating predictions for future observations. 12 | -------------------------------------------------------------------------------- /gitignore: -------------------------------------------------------------------------------- 1 | .DS_Store/ 2 | .DS_Store 3 | build/ 4 | dist/ 5 | .idea/ 6 | .ipynb_checkpoints/ 7 | --------------------------------------------------------------------------------