├── .gitignore ├── 1. Intro to Art of Data Analysis.ipynb ├── 2. introduction-to-pandas.ipynb ├── LICENSE ├── README.md ├── imdb_movies.csv └── img ├── corr.svg ├── data_analysis.png ├── pivot.png ├── problems.png ├── splitapplycombine.png ├── subsetrows.png └── table.jpg /.gitignore: -------------------------------------------------------------------------------- 1 | # Folder view configuration files 2 | .DS_Store 3 | Desktop.ini 4 | 5 | # Thumbnail cache files 6 | ._* 7 | Thumbs.db 8 | 9 | # Files that might appear on external disks 10 | .Spotlight-V100 11 | .Trashes 12 | 13 | # Compiled Python files 14 | *.pyc 15 | 16 | # Compiled C++ files 17 | *.out 18 | 19 | # Application specific files 20 | venv 21 | node_modules 22 | .sass-cache 23 | 24 | 25 | *.DS_Store 26 | .AppleDouble 27 | .LSOverride 28 | 29 | # Icon must end with two \r 30 | Icon 31 | 32 | 33 | # Thumbnails 34 | ._* 35 | 36 | # Files that might appear in the root of a volume 37 | .DocumentRevisions-V100 38 | .fseventsd 39 | .Spotlight-V100 40 | .TemporaryItems 41 | .Trashes 42 | .VolumeIcon.icns 43 | .com.apple.timemachine.donotpresent 44 | 45 | # Directories potentially created on remote AFP share 46 | .AppleDB 47 | .AppleDesktop 48 | Network Trash Folder 49 | Temporary Items 50 | .apdisk 51 | 52 | .ipynb_checkpoints/ 53 | -------------------------------------------------------------------------------- /1. Intro to Art of Data Analysis.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Introduction\n", 8 | "\n", 9 | "> “I think, therefore I am”\n", 10 | "\n", 11 | "- What is data analysis?\n", 12 | "- What type of questions can be answered?\n", 13 | "- Developing a hypothesis drive approach.\n", 14 | "- Making the case.\n", 15 | "\n", 16 | "\n", 17 | "## Data Analysis as an Art\n", 18 | "\n", 19 | "> \"Science is knowledge which we understand so well that we can teach it to a computer. Everything else is art\" - Donald Knuth\n", 20 | "\n", 21 | "- We need to know the science, we need to learn the art.\n", 22 | "- Analogous examples - Creating a hit song, Diagnosing a medical problem.\n", 23 | "- Business problems are 'wicked in nature' - multiple stakeholder, different problem definition, different solutions, interdependence, constraints, amplifying loops\n", 24 | "\n", 25 | "\n", 26 | "![](img/problems.png)\n", 27 | "\n", 28 | "> \"Data analysis is hard, and part of the problem is that few people can explain how to do it. It’s not that there aren’t any people doing data analysis on a regular basis. It’s that the people who are really good at it have yet to enlighten us about the thought process that goes on in their heads.\" - Roger Peng\n", 29 | "\n", 30 | "![](img/data_analysis.png)\n", 31 | "\n", 32 | "\n", 33 | "## Hypothesis driven Approach\n", 34 | "Hypothesis is an educated guess / hunch. \n", 35 | "\n", 36 | "Hypothesis generation asks the question \"what if\"; Hypotheses testing follows it up by saying \"if x, then y\" with relevant data and analysis. If we keep doing this, the we can keep improving the hypothesis. It is process of \"iteration and learning\". Both the definition of the problem and the solution are not separate and we keep refining and reshaping and sharpening both of them \n", 37 | "\n", 38 | "Hypothesis testing is based on abductive reasoning. When you have Induction - you start with data, working backward to form a rule... you look at a set of data and notice when price increase, demand falls. When you have deduction, you start with rule and makes a prediction of what you will observe = when price increase, demand falls. Abduction however reasons from effect to cause - if demand is down, it might be because prices is up. \n", 39 | "- Induction - something is operative\n", 40 | "- Deduction - proves that something must be. \n", 41 | "- Abductions - only suggest that something may be\n", 42 | "\n", 43 | "Now why is abduction important - Possibility of both problem and solutions are unbounded, good hypothesis generations is critical. Because the solution is invented choice, rather than discovered truth - its contestability requires persuasive argumentation. \n", 44 | "\n", 45 | "\n", 46 | "## Making the Case\n", 47 | "\n", 48 | "\"Making the case\" is important and compelling case comes from data based hypothesis. Explaining 'what is' is an essential step in building confidence in the recommendation. Learning and changing mental models is needed for implementation and acceptance\n", 49 | "\n", 50 | "\n", 51 | "# 1. Frame\n", 52 | "\n", 53 | "## Types of Question\n", 54 | "\n", 55 | "> \"Doing data analysis requires quite a bit of thinking and we believe that when you’ve completed a good data analysis, you’ve spent more time thinking than doing.\" - Roger Peng\n", 56 | "\n", 57 | "1. **Descriptive** - \"seeks to summarize a characteristic of a set of data\"\n", 58 | "2. **Exploratory** - \"analyze the data to see if there are patterns, trends, or relationships between variables\" (hypothesis generating) \n", 59 | "3. **Inferential** - \"a restatement of this proposed hypothesis as a question and would be answered by analyzing a different set of data\" (hypothesis testing)\n", 60 | "4. **Predictive** - \"determine the impact on one factor based on other factor in a population - to make a prediction\"\n", 61 | "5. **Causal** - \"asks whether changing one factor will change another factor in a population - to establish a causal link\" \n", 62 | "6. **Mechanistic** - \"establish *how* the change in one factor results in change in another factor in a population - to determine the exact mechanism\"\n", 63 | "\n", 64 | "# 2. Acquire\n", 65 | "\n", 66 | "> \"Data is the new oil\"\n", 67 | "\n", 68 | "**Ways to acquire data** (typical data source)\n", 69 | "\n", 70 | "- Download from an internal system\n", 71 | "- Obtained from client, or other 3rd party\n", 72 | "- Extracted from a web-based API\n", 73 | "- Scraped from a website\n", 74 | "- Extracted from a PDF file\n", 75 | "- Gathered manually and recorded\n", 76 | "\n", 77 | "**Data Formats**\n", 78 | "- Flat files (e.g. csv)\n", 79 | "- Excel files\n", 80 | "- Database (e.g. MySQL)\n", 81 | "- JSON\n", 82 | "- HDFS (Hadoop)\n", 83 | "\n", 84 | "# 3. Refine the Data\n", 85 | " \n", 86 | "> \"Data is messy\"\n", 87 | "\n", 88 | "- **Remove** e.g. remove redundant data from the data frame\n", 89 | "- **Derive** e.g. State and City from the market field\n", 90 | "- **Parse** e.g. extract date from year and month column\n", 91 | "\n", 92 | "Other stuff you may need to do to refine are...\n", 93 | "- **Missing** e.g. Check for missing or incomplete data\n", 94 | "- **Quality** e.g. Check for duplicates, accuracy, unusual data\n", 95 | "\n", 96 | "\n", 97 | "# 4. Transform the data\n", 98 | "\n", 99 | "> \"A rough diamond is cut and shaped into a beautiful gem\"\n", 100 | "\n", 101 | "- **Convert** e.g. free text to coded value\n", 102 | "- **Calculate** e.g. percentages, proportion\n", 103 | "- **Merge** e.g. first and surname for full name\n", 104 | "- **Aggregate** e.g. rollup by year, cluster by area\n", 105 | "- **Filter** e.g. exclude based on location\n", 106 | "- **Sample** e.g. extract a representative data\n", 107 | "- **Summary** e.g. show summary stats like mean\n", 108 | "\n", 109 | "# 5. Explore the Data\n", 110 | "\n", 111 | "> \"I don't know, what I don't know\"\n", 112 | "\n", 113 | "- Why do **visual exploration**?\n", 114 | "- Understand Data Structure & Types\n", 115 | "- Explore **single variable graphs** - Quantitative, Categorical\n", 116 | "- Explore **dual variable graphs** - (Q & Q, Q & C, C & C)\n", 117 | "- Explore **multi variable graphs**\n", 118 | "\n", 119 | "We want to first **visually explore** the data to see if we can confirm some of our initial hypotheses as well as make new hypothesis about the problem we are trying to solve.\n", 120 | "\n", 121 | "For this we will start by loading the data and understanding the data structure of the dataframe we have.\n", 122 | "\n", 123 | "### PRINCIPLE: Subset a Dataframe\n", 124 | "\n", 125 | "![](img/subsetrows.png)\n", 126 | "\n", 127 | "How do you subset a dataframe on a given criteria\n", 128 | "\n", 129 | "`newDataframe` = `df`[ <`subset condition`> ] \n", 130 | "\n", 131 | "### Principle: Split Apply Combine\n", 132 | "\n", 133 | "How do we get the sum of quantity for each city.\n", 134 | "\n", 135 | "We need to **SPLIT** the data by each city, **APPLY** the sum to the quantity row and then **COMBINE** the data again\n", 136 | "\n", 137 | "\n", 138 | "![](img/splitapplycombine.png)\n", 139 | "\n", 140 | "\n", 141 | "In pandas, we use the `groupby` function to do this.\n", 142 | "\n", 143 | "### PRINCIPLE: Pivot Table\n", 144 | "\n", 145 | "Pivot table is a way to summarize data frame data into index (rows), columns and value \n", 146 | "\n", 147 | "![](img/pivot.png)\n", 148 | "\n", 149 | "# 6. Model\n", 150 | "\n", 151 | "> \"All models are wrong, Some of them are useful\"\n", 152 | "\n", 153 | "- The power and limits of models\n", 154 | "- Tradeoff between Prediction Accuracy and Model Interpretability\n", 155 | "- Assessing Model Accuracy\n", 156 | "- Regression models (Simple, Multiple)\n", 157 | "- Classification model\n", 158 | "\n", 159 | "### PRINCIPLE: Correlation\n", 160 | "\n", 161 | "Correlation refers to any of a broad class of statistical relationships involving dependence, though in common usage it most often refers to the extent to which two variables have a linear relationship with each other.\n", 162 | "\n", 163 | "![](img/corr.svg)\n", 164 | "\n", 165 | "# 7. Insight\n", 166 | "\n", 167 | "> “The goal is to turn data into insight”\n", 168 | " \n", 169 | "- Why do we need to communicate insight?\n", 170 | "- Types of communication - Exploration vs. Explanation\n", 171 | "- Explanation: Telling a story with data\n", 172 | "- Exploration: Building an interface for people to find stories" 173 | ] 174 | } 175 | ], 176 | "metadata": { 177 | "anaconda-cloud": {}, 178 | "kernelspec": { 179 | "display_name": "Python [conda root]", 180 | "language": "python", 181 | "name": "conda-root-py" 182 | }, 183 | "language_info": { 184 | "codemirror_mode": { 185 | "name": "ipython", 186 | "version": 3 187 | }, 188 | "file_extension": ".py", 189 | "mimetype": "text/x-python", 190 | "name": "python", 191 | "nbconvert_exporter": "python", 192 | "pygments_lexer": "ipython3", 193 | "version": "3.5.2" 194 | } 195 | }, 196 | "nbformat": 4, 197 | "nbformat_minor": 1 198 | } 199 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2016 Amit Kapoor 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # pandas-workshop 2 | Introduction to data analysis using Pandas 3 | -------------------------------------------------------------------------------- /img/data_analysis.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/amitkaps/pandas-workshop/d6b0ef7cf5b897fe49762f41571aa89475f1c722/img/data_analysis.png -------------------------------------------------------------------------------- /img/pivot.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/amitkaps/pandas-workshop/d6b0ef7cf5b897fe49762f41571aa89475f1c722/img/pivot.png -------------------------------------------------------------------------------- /img/problems.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/amitkaps/pandas-workshop/d6b0ef7cf5b897fe49762f41571aa89475f1c722/img/problems.png -------------------------------------------------------------------------------- /img/splitapplycombine.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/amitkaps/pandas-workshop/d6b0ef7cf5b897fe49762f41571aa89475f1c722/img/splitapplycombine.png -------------------------------------------------------------------------------- /img/subsetrows.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/amitkaps/pandas-workshop/d6b0ef7cf5b897fe49762f41571aa89475f1c722/img/subsetrows.png -------------------------------------------------------------------------------- /img/table.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/amitkaps/pandas-workshop/d6b0ef7cf5b897fe49762f41571aa89475f1c722/img/table.jpg --------------------------------------------------------------------------------