├── .gitignore
├── .notes.txt
├── 00_introduction.ipynb
├── 01_exercise_pandas.ipynb
├── 01_intro_numpy_pandas.ipynb
├── 01_solutions_pandas.ipynb
├── 02_intro_seaborn.ipynb
├── 03_exercise_scikit_learn.ipynb
├── 03_intro_scikit_learn.ipynb
├── 03_solutions_scikit_learn.ipynb
├── README.md
├── colormaps.py
├── data
├── 2002FemPreg.csv.gz
├── AAPL.csv
├── iris.csv
└── titanic3.csv
├── figures
├── cumulative_snowfall.png
├── iris_setosa.jpg
├── iris_versicolor.jpg
├── iris_virginica.jpg
├── pandas-book.jpg
├── petal_sepal.jpg
├── seaborn_gallery.png
├── supervised_workflow.svg
└── train_test_split.svg
├── index.ipynb
└── requirements.txt
/.gitignore:
--------------------------------------------------------------------------------
1 | *~
2 | \#*
3 | *.swp
4 | .ipynb_checkpoints
5 |
--------------------------------------------------------------------------------
/.notes.txt:
--------------------------------------------------------------------------------
1 | # An introduction to "Data Science"
2 |
3 | ## Python Survival Pack
4 |
5 | 14:00 -- 15:40 Session One
6 | 16:15 -- 17:55 Session Two
7 |
8 | ** Add mybinder links **
9 |
10 | ## Session One: Exploring the data (100 mins)
11 |
12 | DS: Asking & answering questions about data
13 | - Principles
14 | - Methods
15 | - Discipline
16 |
17 | Data Loading, Cleaning and Visualization
18 |
19 | + nature (big/small, stream/not, etc.)
20 |
21 | 1. Introduction + install + overview (20 mins)
22 |
23 | - Give page of resources including:
24 |
25 | - scipy lectures
26 | - docs for pandas, seaborn, scikit-learn, etc.
27 |
28 | - Mention git as prominent first step in tracking code / data
29 |
30 | 2. [X] (NumPy) + Pandas (+ basic hist / plotting) (40 mins)
31 | 3. [X] Assignment: "early births" (30 mins)
32 | 4. Solution of "early births": (10 mins)
33 |
34 | ## Session Two: (100 mins)
35 |
36 | - Visualization using Seaborn (20 mins)
37 | - Mention alternatives (Matplotlib, Bokeh, Vega, etc.)
38 |
39 | - [X] Scikit-learn (40 mins)
40 | - ML: Supervised classification
41 | - Training vs testing data
42 | - API overview
43 | - Example: Titanic
44 |
45 | - [X] Assignment (Iris dataset): (30 mins)
46 | - Solutions of Iris dataset: (10 mins)
47 |
48 |
49 | A. Data management, data exploration and visualization, and data processing
50 |
51 | Data management includes the versioning of material (e.g., us-
52 | ing snapshots, check-ins, or labeled back-ups), sharing and distri-
53 | bution (e.g., revision control, databases, cloud storage, distributed
54 | networks, network file servers or physical media) and cleaning
55 | (converting the data into usable formats, interpreting elements, and
56 | scrubbing out invalid or blank records).
57 |
58 | Once the data is in a usable form, it can be explored to gain an in-
59 | tuitive understanding of what it contains (and whether there are any
60 | anomalies—such as sampling or encoding artifacts—to be aware
61 | of). This step can include reducing the amount of data through slic-
62 | ing or projection, calculating summary statistics, and plotting the
63 | resulting sets in various ways.
64 |
65 | After we’ve improved our understanding of the data, we process
66 | it by applying more sophisticated statistical models. From these
67 | models, we may draw inferences on newly obtained data, or use
68 | our results to frame questions for a next round of data gathering.
69 | After attending this part of the tutorial, attendees should have a
70 | basic global understanding of the data science landscape
71 |
72 |
73 | B. Part 2: The data scientist’s Python toolbox
74 |
75 | (a) Create a numpy array and perform fundamental operations on
76 | it.
77 | (b) Plot one and two dimensional arrays.
78 | (c) Load/save arrays from/to disk.
79 | (d) Create common plotting items such as line plots, histograms,
80 | scatter plots, density plots, error bars and error margins (on
81 | line plots).
82 | (e) Create a network with nodes and links and run some common
83 | queries on it.
84 | (f) Be able to help themselves (from online sources and via doc-
85 | strings), should they get stuck.
86 |
87 | C. Data Exploration
88 |
89 | (a) Be aware of the most common ways of storing and fetch-
90 | ing data, including revision control systems (such as Git) and
91 | SQL, as well as formats (such as CSV, JSON).
92 | (b) Be able to load a CSV file from disk.
93 | (c) Be able to remove or replace missing values from a data-set.
94 | (d) Know how to perform exploratory data visualization, includ-
95 | ing slicing and displaying a data set.
96 |
97 | D. Analysis
98 |
99 | (a) Understand what a classifier is.
100 | (b) Be familiar with the scikit-learn classifier API.
101 | (c) Be able to construct a random forest classifier based on known
102 | data.
103 | (d) Be able to evaluate its classification accuracy for a new, un-
104 | known set of data.
105 |
--------------------------------------------------------------------------------
/00_introduction.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {
6 | "slideshow": {
7 | "slide_type": "slide"
8 | }
9 | },
10 | "source": [
11 | "## About me\n",
12 | "\n",
13 | "### Stéfan van der Walt (@stefanvdwalt)\n",
14 | "\n",
15 | "\n",
16 | "\n",
17 | "\n",
18 | "
\n",
19 | "### https://github.com/stefanv/ds_intro"
20 | ]
21 | },
22 | {
23 | "cell_type": "markdown",
24 | "metadata": {
25 | "slideshow": {
26 | "slide_type": "slide"
27 | }
28 | },
29 | "source": [
30 | "## What is Data Science?\n",
31 | "\n",
32 | "That's an excellent question, thank you for asking."
33 | ]
34 | },
35 | {
36 | "cell_type": "markdown",
37 | "metadata": {
38 | "slideshow": {
39 | "slide_type": "subslide"
40 | }
41 | },
42 | "source": [
43 | "It means a lot of things to a lot of people. Including different things to those in:\n",
44 | " \n",
45 | " - Industry\n",
46 | " - Statistics\n",
47 | " - Engineering\n",
48 | " - Computer Science\n",
49 | " - \"Data Science\""
50 | ]
51 | },
52 | {
53 | "cell_type": "markdown",
54 | "metadata": {
55 | "slideshow": {
56 | "slide_type": "subslide"
57 | }
58 | },
59 | "source": [
60 | "Very simplistically, data science is about asking & answering questions around data, and making use of a wide variety of tools across the fields of computation, visualization and engineering to achieve that goal. DS sources from various fields:\n",
61 | "\n",
62 | " - Principles (mathematical)\n",
63 | " - Methods (statistical)\n",
64 | " - Implementation (engineering)\n",
65 | " - Discipline (software engineering & CS)"
66 | ]
67 | },
68 | {
69 | "cell_type": "markdown",
70 | "metadata": {
71 | "slideshow": {
72 | "slide_type": "subslide"
73 | }
74 | },
75 | "source": [
76 | "- We're not going to try and answer these questions today.\n",
77 | "- We're also not going to try and turn you into a so-called Data Scientist in 3 hours.\n",
78 | "- We will try and teach you how to:\n",
79 | "\n",
80 | " - Load & manipulate data using **Pandas**\n",
81 | " - How to generate beautiful plots using **Seaborn**\n",
82 | " - Do basic machine learning, i.e. how to fit & evaluate a model in **scikit-learn**"
83 | ]
84 | },
85 | {
86 | "cell_type": "markdown",
87 | "metadata": {
88 | "slideshow": {
89 | "slide_type": "subslide"
90 | }
91 | },
92 | "source": [
93 | "We should be saying things about:\n",
94 | "\n",
95 | " - Version control\n",
96 | " - Data sharing and distribution\n",
97 | " - Testing\n",
98 | " - Licensing and collaboration\n",
99 | " - & many other important topics...\n",
100 | " \n",
101 | "But we also won't do that today."
102 | ]
103 | },
104 | {
105 | "cell_type": "markdown",
106 | "metadata": {
107 | "slideshow": {
108 | "slide_type": "subslide"
109 | }
110 | },
111 | "source": [
112 | "## Who is the intended audience of this tutorial?\n",
113 | "\n",
114 | "This tutorial is intended for people who:\n",
115 | "\n",
116 | "- have programmed in Python before,\n",
117 | "- have worked with arrays (in some) languages,\n",
118 | "- are interested in learning about Pandas & scikit-learn.\n",
119 | "\n",
120 | "It is not intended to:\n",
121 | "\n",
122 | "- Explore the dark underbelly of NumPy (I have [another lecture for that](https://github.com/stefanv/teaching/tree/master/2014_assp_split_numpy)).\n",
123 | "- Further hone the skills of experts in machine learning (you want [this scikit-learn tutorial](https://github.com/amueller/scipy_2015_sklearn_tutorial)).\n",
124 | "\n",
125 | "\n"
126 | ]
127 | },
128 | {
129 | "cell_type": "markdown",
130 | "metadata": {
131 | "slideshow": {
132 | "slide_type": "slide"
133 | }
134 | },
135 | "source": [
136 | "## This tutorial is meant to be enjoyed interactively.\n",
137 | "\n",
138 | "Please ask if you have any questions along the way."
139 | ]
140 | },
141 | {
142 | "cell_type": "markdown",
143 | "metadata": {
144 | "collapsed": true,
145 | "slideshow": {
146 | "slide_type": "subslide"
147 | }
148 | },
149 | "source": [
150 | "## Using the IPython notebook"
151 | ]
152 | },
153 | {
154 | "cell_type": "markdown",
155 | "metadata": {},
156 | "source": [
157 | "* You can run a cell by pressing ``[shift] + [Enter]`` or by pressing the \"play\" button in the menu.\n",
158 | "* You can get help on a function or object by pressing ``[shift] + [tab]`` after the opening parenthesis ``function(``\n",
159 | "* You can also get help by executing: ``function?``"
160 | ]
161 | },
162 | {
163 | "cell_type": "markdown",
164 | "metadata": {
165 | "slideshow": {
166 | "slide_type": "subslide"
167 | }
168 | },
169 | "source": [
170 | "## Finding help\n",
171 | "\n",
172 | "- Docstrings\n",
173 | "- Tab completion, shift-tab\n",
174 | "- Package documentation:\n",
175 | "\n",
176 | " + http://www.scipy-lectures.org/\n",
177 | " + http://pandas.pydata.org/pandas-docs/stable/\n",
178 | " + http://stanford.edu/~mwaskom/software/seaborn/\n",
179 | " + http://scikit-learn.org/stable/documentation.html"
180 | ]
181 | },
182 | {
183 | "cell_type": "markdown",
184 | "metadata": {
185 | "slideshow": {
186 | "slide_type": "subslide"
187 | }
188 | },
189 | "source": [
190 | "## Installation\n",
191 | "\n",
192 | "Please see the [index](index.ipynb)"
193 | ]
194 | }
195 | ],
196 | "metadata": {
197 | "celltoolbar": "Slideshow",
198 | "kernelspec": {
199 | "display_name": "Python 3",
200 | "language": "python",
201 | "name": "python3"
202 | },
203 | "language_info": {
204 | "codemirror_mode": {
205 | "name": "ipython",
206 | "version": 3
207 | },
208 | "file_extension": ".py",
209 | "mimetype": "text/x-python",
210 | "name": "python",
211 | "nbconvert_exporter": "python",
212 | "pygments_lexer": "ipython3",
213 | "version": "3.4.3"
214 | }
215 | },
216 | "nbformat": 4,
217 | "nbformat_minor": 0
218 | }
219 |
--------------------------------------------------------------------------------
/01_exercise_pandas.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "## IPython Notebooks\n",
8 | "\n",
9 | "* You can run a cell by pressing ``[shift] + [Enter]`` or by pressing the \"play\" button in the menu.\n",
10 | "* You can get help on a function or object by pressing ``[shift] + [tab]`` after the opening parenthesis ``function(``\n",
11 | "* You can also get help by executing: ``function?``\n",
12 | "\n",
13 | "We'll use the following standard imports. Execute this cell first:"
14 | ]
15 | },
16 | {
17 | "cell_type": "markdown",
18 | "metadata": {},
19 | "source": [
20 | "# Exercise: are first-borns more likely to be late?\n",
21 | "\n",
22 | "This exercise is based on [lecture material by Allen Downey](https://github.com/AllenDowney/CompStats.git).\n",
23 | "\n",
24 | "License: [Creative Commons Attribution 4.0 International](http://creativecommons.org/licenses/by/4.0/)"
25 | ]
26 | },
27 | {
28 | "cell_type": "code",
29 | "execution_count": null,
30 | "metadata": {
31 | "collapsed": false
32 | },
33 | "outputs": [],
34 | "source": [
35 | "# this future import makes this code mostly compatible with Python 2 and 3\n",
36 | "from __future__ import print_function, division\n",
37 | "\n",
38 | "import random\n",
39 | "\n",
40 | "import numpy as np\n",
41 | "import pandas as pd\n",
42 | "import seaborn as sns\n",
43 | "\n",
44 | "%matplotlib inline\n",
45 | "import matplotlib.pyplot as plt"
46 | ]
47 | },
48 | {
49 | "cell_type": "markdown",
50 | "metadata": {},
51 | "source": [
52 | "Are first babies more likely to be late?\n",
53 | "----------------------------------------\n",
54 | "\n",
55 | "Allen Downey wrote a popular blog post about this topic:\n",
56 | "\n",
57 | "http://allendowney.blogspot.com/2011/02/are-first-babies-more-likely-to-be-late.html\n",
58 | "\n",
59 | "We are going to investigate the question for ourselves, based on data from the National Survey of Family Growth (NSFG)."
60 | ]
61 | },
62 | {
63 | "cell_type": "markdown",
64 | "metadata": {},
65 | "source": [
66 | "Use the Pandas ``read_csv`` command to load ``data/2002FemPreg.csv.gz``."
67 | ]
68 | },
69 | {
70 | "cell_type": "code",
71 | "execution_count": null,
72 | "metadata": {
73 | "collapsed": false
74 | },
75 | "outputs": [],
76 | "source": [
77 | "preg = ..."
78 | ]
79 | },
80 | {
81 | "cell_type": "markdown",
82 | "metadata": {},
83 | "source": [
84 | "- The variable **`outcome`** encodes the outcome of the pregnancy. Outcome 1 is a live birth.\n",
85 | "- The variable **`pregordr`** encodes for first pregnancies (==1) and others (>1).\n",
86 | "- The variables **`prglngth`** encodes for the length of pregnancy up to birth."
87 | ]
88 | },
89 | {
90 | "cell_type": "code",
91 | "execution_count": null,
92 | "metadata": {
93 | "collapsed": false
94 | },
95 | "outputs": [],
96 | "source": [
97 | "preg.outcome.value_counts().sort_index()"
98 | ]
99 | },
100 | {
101 | "cell_type": "code",
102 | "execution_count": null,
103 | "metadata": {
104 | "collapsed": false
105 | },
106 | "outputs": [],
107 | "source": [
108 | "preg.pregordr.value_counts().sort_index()"
109 | ]
110 | },
111 | {
112 | "cell_type": "markdown",
113 | "metadata": {},
114 | "source": [
115 | "Let's visualize the number of births over different weeks:"
116 | ]
117 | },
118 | {
119 | "cell_type": "code",
120 | "execution_count": null,
121 | "metadata": {
122 | "collapsed": false
123 | },
124 | "outputs": [],
125 | "source": [
126 | "preg.prglngth.value_counts().sort_index().plot(title='Number of births for week')"
127 | ]
128 | },
129 | {
130 | "cell_type": "markdown",
131 | "metadata": {},
132 | "source": [
133 | "And here is the total number of babies born up to a certain week:"
134 | ]
135 | },
136 | {
137 | "cell_type": "code",
138 | "execution_count": null,
139 | "metadata": {
140 | "collapsed": false
141 | },
142 | "outputs": [],
143 | "source": [
144 | "preg.prglngth.value_counts().sort_index().cumsum().plot(title='Total nr of births up to week')"
145 | ]
146 | },
147 | {
148 | "cell_type": "markdown",
149 | "metadata": {},
150 | "source": [
151 | "Now, create a Pandas dataframe containing *only* the three columns of interest. **Hint:** ``loc``"
152 | ]
153 | },
154 | {
155 | "cell_type": "code",
156 | "execution_count": null,
157 | "metadata": {
158 | "collapsed": false
159 | },
160 | "outputs": [],
161 | "source": [
162 | "pp = ...\n",
163 | "pp.head()"
164 | ]
165 | },
166 | {
167 | "cell_type": "markdown",
168 | "metadata": {},
169 | "source": [
170 | "Now, select only entries where ``outcome`` is 1 (i.e., live births)."
171 | ]
172 | },
173 | {
174 | "cell_type": "code",
175 | "execution_count": null,
176 | "metadata": {
177 | "collapsed": false
178 | },
179 | "outputs": [],
180 | "source": [
181 | "pp = ..."
182 | ]
183 | },
184 | {
185 | "cell_type": "markdown",
186 | "metadata": {},
187 | "source": [
188 | "Also, we are only interested in whether babies are first born or not, so set any entries for ``pregordr`` that are !=1 to the value 2. *Hint:* Use ``.loc`` with two indices."
189 | ]
190 | },
191 | {
192 | "cell_type": "code",
193 | "execution_count": null,
194 | "metadata": {
195 | "collapsed": false
196 | },
197 | "outputs": [],
198 | "source": [
199 | "pp.loc[...] = 2\n",
200 | "pp.head()"
201 | ]
202 | },
203 | {
204 | "cell_type": "code",
205 | "execution_count": null,
206 | "metadata": {
207 | "collapsed": true
208 | },
209 | "outputs": [],
210 | "source": [
211 | "pp.groupby('pregordr').describe()"
212 | ]
213 | },
214 | {
215 | "cell_type": "markdown",
216 | "metadata": {},
217 | "source": [
218 | "Create two dataframes. One with first pregnancies, and one with all the rest."
219 | ]
220 | },
221 | {
222 | "cell_type": "code",
223 | "execution_count": null,
224 | "metadata": {
225 | "collapsed": false
226 | },
227 | "outputs": [],
228 | "source": [
229 | "firsts = pp[...]\n",
230 | "others = pp[...]"
231 | ]
232 | },
233 | {
234 | "cell_type": "markdown",
235 | "metadata": {},
236 | "source": [
237 | "Computer the mean difference in weeks:"
238 | ]
239 | },
240 | {
241 | "cell_type": "code",
242 | "execution_count": null,
243 | "metadata": {
244 | "collapsed": false
245 | },
246 | "outputs": [],
247 | "source": [
248 | "firsts.prglngth.mean(), others.prglngth.mean()"
249 | ]
250 | },
251 | {
252 | "cell_type": "markdown",
253 | "metadata": {},
254 | "source": [
255 | "The difference is very small--a few hours!"
256 | ]
257 | },
258 | {
259 | "cell_type": "markdown",
260 | "metadata": {},
261 | "source": [
262 | "### Let's see if we can visualize the difference in the histograms.\n",
263 | "\n",
264 | "1. From the first pregnancy table, select column ``prglngth``, then call ``hist`` on it.\n",
265 | "2. You will get better results if you specify bins to ``hist``, with ``bins=range(50)``.\n",
266 | "3. Do the same for other births.\n",
267 | "4. To optimally compare the two histograms, set the x-axis to be the same with ``plt.xlim(30, 45)``"
268 | ]
269 | },
270 | {
271 | "cell_type": "code",
272 | "execution_count": null,
273 | "metadata": {
274 | "collapsed": false
275 | },
276 | "outputs": [],
277 | "source": [
278 | "# ...\n",
279 | "plt.xlim(30, 45)"
280 | ]
281 | },
282 | {
283 | "cell_type": "code",
284 | "execution_count": null,
285 | "metadata": {
286 | "collapsed": false
287 | },
288 | "outputs": [],
289 | "source": [
290 | "# ...\n",
291 | "plt.xlim(30, 45)"
292 | ]
293 | },
294 | {
295 | "cell_type": "markdown",
296 | "metadata": {},
297 | "source": [
298 | "I wrote a little utility function to help us plot the distributions side-by-side. See if you can read the code below and figure out what it does:"
299 | ]
300 | },
301 | {
302 | "cell_type": "code",
303 | "execution_count": null,
304 | "metadata": {
305 | "collapsed": true
306 | },
307 | "outputs": [],
308 | "source": [
309 | "LILAC = '#998ec3'\n",
310 | "ORANGE = '#f1a340'\n",
311 | "\n",
312 | "\n",
313 | "def hist_two(series_A, series_B,\n",
314 | " labels=['series_A', 'series_B'],\n",
315 | " normalize=False, cumulative=False, bar_or_line='bar'):\n",
316 | "\n",
317 | " fig, ax = plt.subplots(figsize=(10, 5))\n",
318 | " \n",
319 | " a_heights, a_bins = np.histogram(series_A, bins=range(45), normed=normalize)\n",
320 | " b_heights, b_bins = np.histogram(series_B, bins=a_bins, normed=normalize)\n",
321 | " \n",
322 | " if cumulative:\n",
323 | " a_heights = np.cumsum(a_heights)\n",
324 | " b_heights = np.cumsum(b_heights)\n",
325 | "\n",
326 | " width = (a_bins[1] - a_bins[0])/2.5\n",
327 | "\n",
328 | " if bar_or_line == 'bar':\n",
329 | " ax.bar(a_bins[:-1], a_heights, width=width, facecolor=LILAC, label=labels[0])\n",
330 | " ax.bar(b_bins[:-1] + width, b_heights, width=width, facecolor=ORANGE, label=labels[1])\n",
331 | " else:\n",
332 | " plt.plot(a_bins[:-1], a_heights, linewidth=4, color=LILAC, label=labels[0])\n",
333 | " plt.plot(b_bins[:-1], b_heights, linewidth=4, color=ORANGE, label=labels[1])\n",
334 | "\n",
335 | " plt.legend(loc='upper left')"
336 | ]
337 | },
338 | {
339 | "cell_type": "code",
340 | "execution_count": null,
341 | "metadata": {
342 | "collapsed": false
343 | },
344 | "outputs": [],
345 | "source": [
346 | "hist_two(firsts.prglngth, others.prglngth, labels=['firsts', 'others'])\n",
347 | "plt.xlim(33, 44);"
348 | ]
349 | },
350 | {
351 | "cell_type": "markdown",
352 | "metadata": {},
353 | "source": [
354 | "Remember that the vertical axis is counts. In this case, we are comparing counts with different totals, which might be misleading.\n",
355 | "\n",
356 | "An alternative is to compute a probability mass function (PMF), which divides the counts by the totals, yielding a map from each element to its probability.\n",
357 | "\n",
358 | "The probabilities are \"normalized\" to add up to 1.\n"
359 | ]
360 | },
361 | {
362 | "cell_type": "markdown",
363 | "metadata": {},
364 | "source": [
365 | "Now we can compare histograms fairly."
366 | ]
367 | },
368 | {
369 | "cell_type": "code",
370 | "execution_count": null,
371 | "metadata": {
372 | "collapsed": false
373 | },
374 | "outputs": [],
375 | "source": [
376 | "hist_two(firsts.prglngth, others.prglngth, labels=['firsts', 'others'], normalize=True)\n",
377 | "plt.xlim(33, 44);"
378 | ]
379 | },
380 | {
381 | "cell_type": "markdown",
382 | "metadata": {},
383 | "source": [
384 | "We see here that some of the difference at 39 weeks was an artifact of the different samples sizes."
385 | ]
386 | },
387 | {
388 | "cell_type": "markdown",
389 | "metadata": {},
390 | "source": [
391 | "Even so, it is not easy to compare histograms. One more alternative is the cumulative histogram, which shows, for each $t$, the total probability up to and including $t$."
392 | ]
393 | },
394 | {
395 | "cell_type": "code",
396 | "execution_count": null,
397 | "metadata": {
398 | "collapsed": false
399 | },
400 | "outputs": [],
401 | "source": [
402 | "pp = live.loc[:, ['birthord','prglngth']]\n",
403 | "not_firsts = pp['birthord'] != 1\n",
404 | "pp.loc[not_firsts, 'birthord'] = 2"
405 | ]
406 | },
407 | {
408 | "cell_type": "code",
409 | "execution_count": null,
410 | "metadata": {
411 | "collapsed": false
412 | },
413 | "outputs": [],
414 | "source": [
415 | "hist_two(firsts.prglngth, others.prglngth, labels=['firsts', 'others'], normalize=True, cumulative=True, bar_or_line='line')\n",
416 | "plt.xlim(33, 44);"
417 | ]
418 | },
419 | {
420 | "cell_type": "markdown",
421 | "metadata": {},
422 | "source": [
423 | "The cumulative histograms are similar up to week 38. After that, first babies are more likely to be born late. \n",
424 | "\n",
425 | "*Can you read this from the plot above?*"
426 | ]
427 | },
428 | {
429 | "cell_type": "markdown",
430 | "metadata": {},
431 | "source": [
432 | "One other thought: cumulative curves are often a good option for visualizing noisy series. For example, the graphic below works pretty well despite some questionable aesthetic choices. "
433 | ]
434 | },
435 | {
436 | "cell_type": "markdown",
437 | "metadata": {},
438 | "source": [
439 | ""
440 | ]
441 | },
442 | {
443 | "cell_type": "code",
444 | "execution_count": null,
445 | "metadata": {
446 | "collapsed": true
447 | },
448 | "outputs": [],
449 | "source": []
450 | }
451 | ],
452 | "metadata": {
453 | "kernelspec": {
454 | "display_name": "Python 3",
455 | "language": "python",
456 | "name": "python3"
457 | },
458 | "language_info": {
459 | "codemirror_mode": {
460 | "name": "ipython",
461 | "version": 3
462 | },
463 | "file_extension": ".py",
464 | "mimetype": "text/x-python",
465 | "name": "python",
466 | "nbconvert_exporter": "python",
467 | "pygments_lexer": "ipython3",
468 | "version": "3.4.3"
469 | }
470 | },
471 | "nbformat": 4,
472 | "nbformat_minor": 0
473 | }
474 |
--------------------------------------------------------------------------------
/03_exercise_scikit_learn.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "