"
340 | ],
341 | "text/plain": [
342 | " Acceleration Cylinders Displacement Horsepower Miles_per_Gallon \\\n",
343 | "0 12.0 8 307.0 130.0 18.0 \n",
344 | "1 11.5 8 350.0 165.0 15.0 \n",
345 | "2 11.0 8 318.0 150.0 18.0 \n",
346 | "3 12.0 8 304.0 150.0 16.0 \n",
347 | "4 10.5 8 302.0 140.0 17.0 \n",
348 | "\n",
349 | " Name Origin Weight_in_lbs Year \n",
350 | "0 chevrolet chevelle malibu USA 3504 1970-01-01 \n",
351 | "1 buick skylark 320 USA 3693 1970-01-01 \n",
352 | "2 plymouth satellite USA 3436 1970-01-01 \n",
353 | "3 amc rebel sst USA 3433 1970-01-01 \n",
354 | "4 ford torino USA 3449 1970-01-01 "
355 | ]
356 | },
357 | "execution_count": 19,
358 | "metadata": {},
359 | "output_type": "execute_result"
360 | }
361 | ],
362 | "source": [
363 | "table2.to_pandas().head()"
364 | ]
365 | },
366 | {
367 | "cell_type": "code",
368 | "execution_count": null,
369 | "metadata": {},
370 | "outputs": [],
371 | "source": []
372 | }
373 | ],
374 | "metadata": {
375 | "kernelspec": {
376 | "display_name": "Python 3",
377 | "language": "python",
378 | "name": "python3"
379 | },
380 | "language_info": {
381 | "codemirror_mode": {
382 | "name": "ipython",
383 | "version": 3
384 | },
385 | "file_extension": ".py",
386 | "mimetype": "text/x-python",
387 | "name": "python",
388 | "nbconvert_exporter": "python",
389 | "pygments_lexer": "ipython3",
390 | "version": "3.6.3"
391 | }
392 | },
393 | "nbformat": 4,
394 | "nbformat_minor": 2
395 | }
396 |
--------------------------------------------------------------------------------
/Content/Introduction/01-Introduction.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Introduction"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "## Course materials\n",
15 | "\n",
16 | "- https://github.com/jakevdp/WhirlwindTourOfPython\n",
17 | "- https://github.com/jakevdp/PythonDataScienceHandbook\n",
18 | "- http://r4ds.had.co.nz/index.html\n",
19 | "- https://github.com/amueller/introduction_to_ml_with_python"
20 | ]
21 | },
22 | {
23 | "cell_type": "markdown",
24 | "metadata": {},
25 | "source": [
26 | "## Introduction to Data Science\n",
27 | "\n",
28 | "- [What is Data Science?](./02-WhatIsDataScience.ipynb)\n",
29 | "- [Theory of Data](./03-TheoryofData.ipynb)\n",
30 | "- [Programming](../Programming/01-Introduction.ipynb)\n",
31 | "- [Workflow](../Workflow/01-Introduction.ipynb)"
32 | ]
33 | },
34 | {
35 | "cell_type": "markdown",
36 | "metadata": {},
37 | "source": [
38 | "## Explore\n",
39 | "\n",
40 | "Explore = Visualize + Transform\n",
41 | "\n",
42 | "- [Data Visualisation](../Visualize/01-Introduction.ipynb)\n",
43 | "- [Basic Data Transformation](../Transform/02-BasicDataTransformation.ipynb)\n",
44 | "- [Exploratory Data Analysis (EDA)](../Explore/02-EDA.ipynb)"
45 | ]
46 | },
47 | {
48 | "cell_type": "markdown",
49 | "metadata": {},
50 | "source": [
51 | "## Wrangle\n",
52 | "\n",
53 | "Wrangle = Import + Tidy + Transform\n",
54 | "\n",
55 | "- [Import](../Import/01-Introduction.ipynb)\n",
56 | "- [Tidy](../Tidy/01-Introduction.ipynb)\n",
57 | "- [Transform](../Transform/01-Introduction.ipynb)"
58 | ]
59 | },
60 | {
61 | "cell_type": "markdown",
62 | "metadata": {},
63 | "source": [
64 | "## Model\n",
65 | "\n",
66 | "- [Model](../Model/01-Introduction.ipynb)"
67 | ]
68 | },
69 | {
70 | "cell_type": "markdown",
71 | "metadata": {},
72 | "source": [
73 | "## Communicate\n",
74 | "\n",
75 | "- GitHub and Notebooks\n",
76 | "- nbconvert\n",
77 | "- nbviewer"
78 | ]
79 | }
80 | ],
81 | "metadata": {
82 | "kernelspec": {
83 | "display_name": "Python 3",
84 | "language": "python",
85 | "name": "python3"
86 | },
87 | "language_info": {
88 | "codemirror_mode": {
89 | "name": "ipython",
90 | "version": 3
91 | },
92 | "file_extension": ".py",
93 | "mimetype": "text/x-python",
94 | "name": "python",
95 | "nbconvert_exporter": "python",
96 | "pygments_lexer": "ipython3",
97 | "version": "3.6.3"
98 | }
99 | },
100 | "nbformat": 4,
101 | "nbformat_minor": 2
102 | }
103 |
--------------------------------------------------------------------------------
/Content/Introduction/02-WhatIsDataScience.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# What is Data Science?"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "**Learning Objective:** Understand different ways that Data Science can be defined."
15 | ]
16 | },
17 | {
18 | "cell_type": "markdown",
19 | "metadata": {},
20 | "source": [
21 | "## Data Science as a *skill set*"
22 | ]
23 | },
24 | {
25 | "cell_type": "markdown",
26 | "metadata": {},
27 | "source": [
28 | "Perhaps the most common definition of Data Science is to enumerate the skills and knowledge areas used in Data Science. The best known treatment of that approach is [Drew Conway's Data Science Venn diagram](http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram), seen here:\n",
29 | "\n",
30 | ""
31 | ]
32 | },
33 | {
34 | "cell_type": "markdown",
35 | "metadata": {},
36 | "source": [
37 | "## Data Science as a *process*"
38 | ]
39 | },
40 | {
41 | "cell_type": "markdown",
42 | "metadata": {},
43 | "source": [
44 | "An operational definition of Data Science answers the question, \"what do Data Scientists do?\". Over the last few years, the community of Data Scientists have been building a concensus answer to this question. The different activities involved in Data Science are linked together to form the Data Science **process** or **workflow**. By looking at the descriptions of the Data Science process by a few individuals, we can start to see a clear picture emerging.\n",
45 | "\n",
46 | "* [A Data Science Taxonomy](http://www.dataists.com/2010/09/a-taxonomy-of-data-science/), Hilary Mason (2012):\n",
47 | " - Obtain\n",
48 | " - Scrub\n",
49 | " - Explore\n",
50 | " - Model\n",
51 | " - Interpret\n",
52 | "* [The Data Science Process](http://columbiadatascience.com/2012/09/24/reflections-after-jakes-lecture/), Rachel Shutt (2012):\n",
53 | " - Observation and collection\n",
54 | " - Processing\n",
55 | " - Exploratory data analysis\n",
56 | " - Modeling: Stats, ML\n",
57 | " - Build data product\n",
58 | " - Communicate\n",
59 | " - Make decisions\n",
60 | "* [Introduction to Data Science 2.0](http://columbiadatascience.com/2013/09/16/introduction-to-data-science-version-2-0/), Rachel Shutt (2013):\n",
61 | " - Gather and observe\n",
62 | " - Process\n",
63 | " - Modeling: Stats, ML\n",
64 | " - Summarize, communicate, build\n",
65 | " - Decide, interact\n",
66 | "* [Data Science Workflow: Overview and Challenges ](http://cacm.acm.org/blogs/blog-cacm/169199-data-science-workflow-overview-and-challenges/fulltext), Philip Guo (2014):\n",
67 | " - Preparation\n",
68 | " - Analysis\n",
69 | " - Reflection\n",
70 | " - Dissemination"
71 | ]
72 | },
73 | {
74 | "cell_type": "markdown",
75 | "metadata": {},
76 | "source": [
77 | "## Data Science as a *set of questions*"
78 | ]
79 | },
80 | {
81 | "cell_type": "markdown",
82 | "metadata": {},
83 | "source": [
84 | "The *skills* and *process* approaches to defining Data Science do have some limitations. Another approach, for which I advocate, is to enumerate the underlying questions that the field is pursuing. Here are some possibilities:\n",
85 | "\n",
86 | "* How/where do we get data?\n",
87 | "* What is the raw format of the data?\n",
88 | "* How much data and how often?\n",
89 | "* What variables/fields are present in the data and what are their types?\n",
90 | "* What relevant variables/fields are not present in the data?\n",
91 | "* What relationships are present in the data and how are they expressed?\n",
92 | "* Is the data observational or collected in a controlled manner?\n",
93 | "* What practical questions can we, or would we like to answer with the data?\n",
94 | "* How is the data stored after collection and how does that relate to the\n",
95 | " practical questions we are interested in answering?\n",
96 | "* What in memory data structures are appropriate for answering those practical\n",
97 | " questions efficiently?\n",
98 | "* What can we predict with the data?\n",
99 | "* What can we understand with the data?\n",
100 | "* What hypotheses can be supported or rejected with the data?\n",
101 | "* What statistical or machine learning methods are needed to answer these questions?\n",
102 | "* What user interfaces are required for humans to work with the data efficiently and\n",
103 | " productively?\n",
104 | "* How can the data be visualized effectively?\n",
105 | "* How can code, data and visualizations be embedded into narratives used to\n",
106 | " communicate results?\n",
107 | "* What software is needed to support the activities around these questions?\n",
108 | "* What computational infrastructure is needed?\n",
109 | "* How can organizations leverage data to meet their goals?\n",
110 | "* What organizational structures are needed to best take advantage of data?\n",
111 | "* What are the economic benefits of pursuing these questions?\n",
112 | "* What are the social benefits of pursuing these questions?\n",
113 | "* Where do these questions and the activities in pursuit of them intersect important ethical issues."
114 | ]
115 | },
116 | {
117 | "cell_type": "markdown",
118 | "metadata": {},
119 | "source": [
120 | "## Data Science as *Science*"
121 | ]
122 | },
123 | {
124 | "cell_type": "markdown",
125 | "metadata": {},
126 | "source": [
127 | "If we take the name \"Data Science\" seriously, then we have to assume that it is somehow related to science. Here is my own take:\n",
128 | "\n",
129 | "> Data Science involves the application of scientific methods and approaches to data sets that *may* lie outside the traditional fields of science (Physics, Chemistry, Biology, etc.).\n",
130 | "\n",
131 | "In other words, Data Science involves a broad application of the scientific method."
132 | ]
133 | },
134 | {
135 | "cell_type": "markdown",
136 | "metadata": {},
137 | "source": [
138 | "## R for Data Science\n",
139 | "\n",
140 | "In [R for Data Science](http://r4ds.had.co.nz/index.html), Garrett Grolemund and Hadley Wickham organize the data science process into the following ideas and practices:\n",
141 | "\n",
142 | "1. Import\n",
143 | "2. Tidy\n",
144 | "3. Understand:\n",
145 | " - Transform\n",
146 | " - Visualize\n",
147 | " - Model\n",
148 | "4. Communicate\n",
149 | "\n",
150 | "Furthermore, he overlays a couple of other composite ideas and practices:\n",
151 | "\n",
152 | "* Wrangle = Import + Tidy + Transform\n",
153 | "* Explore = Transform + Visualize\n",
154 | "\n",
155 | "Lastly, he identifies two additional cross cutting areas:\n",
156 | "\n",
157 | "* Workflow\n",
158 | "* Programming\n",
159 | "\n",
160 | "**This course will follow this conceptual model of data science.**"
161 | ]
162 | },
163 | {
164 | "cell_type": "markdown",
165 | "metadata": {},
166 | "source": [
167 | "## Resources"
168 | ]
169 | },
170 | {
171 | "cell_type": "markdown",
172 | "metadata": {},
173 | "source": [
174 | "* [Scientific Method](https://en.wikipedia.org/wiki/Scientific_method), Wikipedia (2016).\n",
175 | "* [50 Years of Data Science](http://courses.csail.mit.edu/18.337/2015/docs/50YearsDataScience.pdf), David Donoho (2015).\n",
176 | "* [Data Science Survey](https://www.oreilly.com/ideas/2015-data-science-salary-survey), O'Reilly Media (2015).\n",
177 | "* [The Emerging Role of Data Scientists on Software Development Teams](http://research.microsoft.com/apps/pubs/default.aspx?id=242286), Microsoft Research (2015)."
178 | ]
179 | }
180 | ],
181 | "metadata": {
182 | "kernelspec": {
183 | "display_name": "Python 3",
184 | "language": "python",
185 | "name": "python3"
186 | },
187 | "language_info": {
188 | "codemirror_mode": {
189 | "name": "ipython",
190 | "version": 3
191 | },
192 | "file_extension": ".py",
193 | "mimetype": "text/x-python",
194 | "name": "python",
195 | "nbconvert_exporter": "python",
196 | "pygments_lexer": "ipython3",
197 | "version": "3.6.3"
198 | }
199 | },
200 | "nbformat": 4,
201 | "nbformat_minor": 2
202 | }
203 |
--------------------------------------------------------------------------------
/Content/Introduction/03-TheoryofData.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# A Brief Theory of Data"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "## Types of data"
15 | ]
16 | },
17 | {
18 | "cell_type": "markdown",
19 | "metadata": {},
20 | "source": [
21 | "In this section, when I talk about the \"type\" of the data, I am not talking about the `dtype` (`int`, `float`, `bool`, `str`) used to represent the data in a NumPy array or Pandas DataFrame. In this context the \"type\" of the data is used in a more abstract sense."
22 | ]
23 | },
24 | {
25 | "cell_type": "markdown",
26 | "metadata": {},
27 | "source": [
28 | "## Take 1: Data+Design"
29 | ]
30 | },
31 | {
32 | "cell_type": "markdown",
33 | "metadata": {},
34 | "source": [
35 | "[Data+Design]() is an excellent online book about the theory of data. It is very well thought out and beautiful as well. I highly recommend spending time reading it. In Chapter 1 of Data+Design, the authors cover [Basic Data Types](https://infoactive.co/data-design/ch01.html). For further details, see also the Wikipedia pages on [Levels of measurement](https://en.wikipedia.org/wiki/Level_of_measurement). Here is a short summary of those basic data types:"
36 | ]
37 | },
38 | {
39 | "cell_type": "markdown",
40 | "metadata": {},
41 | "source": [
42 | "### Nominal"
43 | ]
44 | },
45 | {
46 | "cell_type": "markdown",
47 | "metadata": {},
48 | "source": [
49 | "* Non-numerical\n",
50 | "* Usually, but not always strings\n",
51 | "* Non-ordered\n",
52 | "* Cannot be averaged"
53 | ]
54 | },
55 | {
56 | "cell_type": "code",
57 | "execution_count": 18,
58 | "metadata": {
59 | "collapsed": false
60 | },
61 | "outputs": [
62 | {
63 | "data": {
64 | "text/plain": [
65 | "['Oregon', 'California', 'Texas', 'Colorado']"
66 | ]
67 | },
68 | "execution_count": 18,
69 | "metadata": {},
70 | "output_type": "execute_result"
71 | }
72 | ],
73 | "source": [
74 | "states = ['Oregon', 'California', 'Texas', 'Colorado']\n",
75 | "states"
76 | ]
77 | },
78 | {
79 | "cell_type": "code",
80 | "execution_count": 19,
81 | "metadata": {
82 | "collapsed": false
83 | },
84 | "outputs": [
85 | {
86 | "data": {
87 | "text/plain": [
88 | "['produce', 'diary', 'frozen']"
89 | ]
90 | },
91 | "execution_count": 19,
92 | "metadata": {},
93 | "output_type": "execute_result"
94 | }
95 | ],
96 | "source": [
97 | "grocery_sections = [\"produce\", 'diary', 'frozen']\n",
98 | "grocery_sections"
99 | ]
100 | },
101 | {
102 | "cell_type": "code",
103 | "execution_count": 20,
104 | "metadata": {
105 | "collapsed": false
106 | },
107 | "outputs": [
108 | {
109 | "data": {
110 | "text/plain": [
111 | "['male', 'female']"
112 | ]
113 | },
114 | "execution_count": 20,
115 | "metadata": {},
116 | "output_type": "execute_result"
117 | }
118 | ],
119 | "source": [
120 | "gender = ['male', 'female']\n",
121 | "gender"
122 | ]
123 | },
124 | {
125 | "cell_type": "markdown",
126 | "metadata": {},
127 | "source": [
128 | "### Ordinal"
129 | ]
130 | },
131 | {
132 | "cell_type": "markdown",
133 | "metadata": {},
134 | "source": [
135 | "* Non-numerical\n",
136 | "* Usually, but not always strings\n",
137 | "* Natural ordering\n",
138 | "* Sometimes can be averaged\n",
139 | "* Can assign numerical scale, but it will be arbitrary"
140 | ]
141 | },
142 | {
143 | "cell_type": "code",
144 | "execution_count": 4,
145 | "metadata": {
146 | "collapsed": false
147 | },
148 | "outputs": [
149 | {
150 | "data": {
151 | "text/plain": [
152 | "['strongly disagree', 'disagre', 'neutral', 'agree', 'strongly agree']"
153 | ]
154 | },
155 | "execution_count": 4,
156 | "metadata": {},
157 | "output_type": "execute_result"
158 | }
159 | ],
160 | "source": [
161 | "response = ['strongly disagree', 'disagre', 'neutral', 'agree', 'strongly agree']\n",
162 | "response"
163 | ]
164 | },
165 | {
166 | "cell_type": "code",
167 | "execution_count": 5,
168 | "metadata": {
169 | "collapsed": false
170 | },
171 | "outputs": [
172 | {
173 | "data": {
174 | "text/plain": [
175 | "['cold', 'hot']"
176 | ]
177 | },
178 | "execution_count": 5,
179 | "metadata": {},
180 | "output_type": "execute_result"
181 | }
182 | ],
183 | "source": [
184 | "temp = ['cold', 'hot']\n",
185 | "temp"
186 | ]
187 | },
188 | {
189 | "cell_type": "code",
190 | "execution_count": 6,
191 | "metadata": {
192 | "collapsed": false
193 | },
194 | "outputs": [
195 | {
196 | "data": {
197 | "text/plain": [
198 | "['short', 'medium', 'tall']"
199 | ]
200 | },
201 | "execution_count": 6,
202 | "metadata": {},
203 | "output_type": "execute_result"
204 | }
205 | ],
206 | "source": [
207 | "height = ['short', 'medium', 'tall']\n",
208 | "height"
209 | ]
210 | },
211 | {
212 | "cell_type": "markdown",
213 | "metadata": {},
214 | "source": [
215 | "### Interval"
216 | ]
217 | },
218 | {
219 | "cell_type": "markdown",
220 | "metadata": {},
221 | "source": [
222 | "* Equally spaced numerical data\n",
223 | "* Ordered\n",
224 | "* Can either be discrete (int) or continuous (float)\n",
225 | "* No meaningful zero point\n",
226 | "* Examples:\n",
227 | " - Temperature in F or C\n",
228 | " - Dates/Times"
229 | ]
230 | },
231 | {
232 | "cell_type": "code",
233 | "execution_count": 22,
234 | "metadata": {
235 | "collapsed": false
236 | },
237 | "outputs": [
238 | {
239 | "data": {
240 | "text/plain": [
241 | "[32.1, 99.4, 210.0, -76.4]"
242 | ]
243 | },
244 | "execution_count": 22,
245 | "metadata": {},
246 | "output_type": "execute_result"
247 | }
248 | ],
249 | "source": [
250 | "temps = [32.1, 99.4, 210.0, -76.4]\n",
251 | "temps"
252 | ]
253 | },
254 | {
255 | "cell_type": "markdown",
256 | "metadata": {},
257 | "source": [
258 | "### Ratio"
259 | ]
260 | },
261 | {
262 | "cell_type": "markdown",
263 | "metadata": {},
264 | "source": [
265 | "* Equally spaced, ordered numerical data\n",
266 | "* Can either be discrete or continuous\n",
267 | "* Meaningful zero point that indicates an absence of the measured entity\n",
268 | "* Examples:\n",
269 | " - Age in years\n",
270 | " - Height in inches"
271 | ]
272 | },
273 | {
274 | "cell_type": "code",
275 | "execution_count": 12,
276 | "metadata": {
277 | "collapsed": false
278 | },
279 | "outputs": [
280 | {
281 | "data": {
282 | "text/plain": [
283 | "[47, 76, 17, 48, 99, 53, 86, 45, 56, 38]"
284 | ]
285 | },
286 | "execution_count": 12,
287 | "metadata": {},
288 | "output_type": "execute_result"
289 | }
290 | ],
291 | "source": [
292 | "ages = [random.randint(0,100) for i in range(10)]\n",
293 | "ages"
294 | ]
295 | },
296 | {
297 | "cell_type": "code",
298 | "execution_count": 16,
299 | "metadata": {
300 | "collapsed": false
301 | },
302 | "outputs": [
303 | {
304 | "data": {
305 | "text/plain": [
306 | "[3.6024466624950744,\n",
307 | " 72.57678124421476,\n",
308 | " 28.563897505990518,\n",
309 | " 47.537636077547305,\n",
310 | " 48.56497103639384,\n",
311 | " 29.140383493314705,\n",
312 | " 71.6319961862486,\n",
313 | " 59.37821139476524,\n",
314 | " 71.10757477132888,\n",
315 | " 12.436518123166024]"
316 | ]
317 | },
318 | "execution_count": 16,
319 | "metadata": {},
320 | "output_type": "execute_result"
321 | }
322 | ],
323 | "source": [
324 | "height = [76.0*random.random() for i in range(10)]\n",
325 | "height"
326 | ]
327 | },
328 | {
329 | "cell_type": "markdown",
330 | "metadata": {},
331 | "source": [
332 | "### Categorical\n",
333 | "\n",
334 | "* Data is labelled by well separated categories\n",
335 | "* Often used as an umbrella for nominal and ordinal, which are unordered and ordered categorical data types respectively."
336 | ]
337 | },
338 | {
339 | "cell_type": "markdown",
340 | "metadata": {},
341 | "source": [
342 | "## Take 2: Polaris, Tableau, d3/vega"
343 | ]
344 | },
345 | {
346 | "cell_type": "markdown",
347 | "metadata": {},
348 | "source": [
349 | "The data visualization community has spent a lot of time thinking carefully about fundamental data types. There is a large body of research and software projects that encode the results of that research into usable forms. Good examples of this research and software are:\n",
350 | "\n",
351 | "* [Polaris: A System for...](http://graphics.stanford.edu/papers/polaris_extended/polaris.pdf), C. Stolte, D. Tank and P. Hanrahan (2002).\n",
352 | "* [Tableau](http://www.tableau.com/), Tableau Software, website (2016).\n",
353 | "* [d3](http://d3js.org/), Data Driven Documents, website (2016).\n",
354 | "* [Vega](http://vega.github.io/vega/), Vega: A Visualization Grammar, website (2016).\n",
355 | "* [Vega-Lite](http://vega.github.io/vega-lite/), Vega-Lite: A High-Level Visualization Grammar, website (2016).\n",
356 | "* [polestar](http://vega.github.io/polestar/), Polestar website (2016).\n",
357 | "\n",
358 | "Here is a rough union of the different data types found in this body of work:\n",
359 | "\n",
360 | "* Ordinal (same as above)\n",
361 | "* Nominal (same as above)\n",
362 | "* Quantitative (ratio, interval)\n",
363 | "* Date/time (calendar dates and/or times)\n",
364 | "* Geographic (states, latitude/longitude)\n",
365 | "\n",
366 | "Some of these sofware packages also have a `text` data type that is meant for textual data that is not categorical."
367 | ]
368 | },
369 | {
370 | "cell_type": "markdown",
371 | "metadata": {},
372 | "source": [
373 | "## Variables"
374 | ]
375 | },
376 | {
377 | "cell_type": "markdown",
378 | "metadata": {},
379 | "source": [
380 | "* A **variable** is some quantity that is measured, such as \"age\"\n",
381 | "* A single variable can be measured in different ways that give different data types:\n",
382 | " - \"young\" or \"old\" = ordinal\n",
383 | " - Age ranges (0-9, 10-19, ...) = ordinal\n",
384 | " - Age in years = ratio"
385 | ]
386 | },
387 | {
388 | "cell_type": "markdown",
389 | "metadata": {},
390 | "source": [
391 | "## Records and data sets\n",
392 | "\n",
393 | " * A **record** or **sample** is one measurement of a set of variables\n",
394 | " * A **data set** is a set of records that measure the same set of variables in the same way"
395 | ]
396 | },
397 | {
398 | "cell_type": "code",
399 | "execution_count": 28,
400 | "metadata": {
401 | "collapsed": false
402 | },
403 | "outputs": [
404 | {
405 | "data": {
406 | "text/plain": [
407 | "[{'age': 52, 'height': 0.45153746742233736},\n",
408 | " {'age': 76, 'height': 56.65900992198216},\n",
409 | " {'age': 36, 'height': 22.419785610573825},\n",
410 | " {'age': 12, 'height': 16.34630476175516},\n",
411 | " {'age': 34, 'height': 35.392522637659134},\n",
412 | " {'age': 81, 'height': 41.75690996668162},\n",
413 | " {'age': 26, 'height': 32.123243497319},\n",
414 | " {'age': 83, 'height': 11.127118329861124},\n",
415 | " {'age': 96, 'height': 4.556533241526422},\n",
416 | " {'age': 0, 'height': 30.41328942526455}]"
417 | ]
418 | },
419 | "execution_count": 28,
420 | "metadata": {},
421 | "output_type": "execute_result"
422 | }
423 | ],
424 | "source": [
425 | "ages = [random.randint(0,100) for i in range(10)]\n",
426 | "heights = [76.0*random.random() for i in range(10)]\n",
427 | "data_set = [{'age':a, 'height':h} for a, h in zip(ages, heights)]\n",
428 | "data_set"
429 | ]
430 | },
431 | {
432 | "cell_type": "code",
433 | "execution_count": 29,
434 | "metadata": {
435 | "collapsed": false
436 | },
437 | "outputs": [
438 | {
439 | "data": {
440 | "text/plain": [
441 | "{'age': 52, 'height': 0.45153746742233736}"
442 | ]
443 | },
444 | "execution_count": 29,
445 | "metadata": {},
446 | "output_type": "execute_result"
447 | }
448 | ],
449 | "source": [
450 | "sample0 = data_set[0]\n",
451 | "sample0"
452 | ]
453 | },
454 | {
455 | "cell_type": "markdown",
456 | "metadata": {},
457 | "source": [
458 | "## Resources"
459 | ]
460 | },
461 | {
462 | "cell_type": "markdown",
463 | "metadata": {},
464 | "source": [
465 | "* [Data+Design](https://infoactive.co/data-design) Trina Chiasson, Dyanna Gregory, et al (2016)."
466 | ]
467 | }
468 | ],
469 | "metadata": {
470 | "kernelspec": {
471 | "display_name": "Python 3",
472 | "language": "python",
473 | "name": "python3"
474 | },
475 | "language_info": {
476 | "codemirror_mode": {
477 | "name": "ipython",
478 | "version": 3
479 | },
480 | "file_extension": ".py",
481 | "mimetype": "text/x-python",
482 | "name": "python",
483 | "nbconvert_exporter": "python",
484 | "pygments_lexer": "ipython3",
485 | "version": "3.5.2"
486 | }
487 | },
488 | "nbformat": 4,
489 | "nbformat_minor": 0
490 | }
491 |
--------------------------------------------------------------------------------
/Content/Introduction/images/data_science_vd.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/calpolydatascience/data301/5ca50db37afb0be83e9b0e294385f7bd60634179/Content/Introduction/images/data_science_vd.png
--------------------------------------------------------------------------------
/Content/Model/01-Introduction.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Introduction"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "**Learning Objectives:** Learn about the theory and practice of statistical modelling."
15 | ]
16 | },
17 | {
18 | "cell_type": "markdown",
19 | "metadata": {},
20 | "source": [
21 | "## Ouline"
22 | ]
23 | },
24 | {
25 | "cell_type": "markdown",
26 | "metadata": {},
27 | "source": [
28 | "* [Probability](02-Probability.ipynb)\n",
29 | "* [Common Distributions](03-CommonDistributions.ipynb)\n",
30 | "* [Information Theory](04-InformationTheory.ipynb)\n",
31 | "* [Modelling Overview](05-ModellingOverview.ipynb)\n",
32 | "* [Estimators, Bias, Variance](06-EstimatorsBiasVariance.ipynb)\n",
33 | "* [Bootstrap Resampling](07-BootstrapResampling.ipynb)\n",
34 | "* [Maximum Likelihood Estomation](08-MLE.ipynb)\n",
35 | "* [Linear Regression](09-LinearRegression.ipynb)\n",
36 | "* [Specific Models](10-SpecificModels.ipynb)"
37 | ]
38 | }
39 | ],
40 | "metadata": {
41 | "kernelspec": {
42 | "display_name": "Python 3",
43 | "language": "python",
44 | "name": "python3"
45 | },
46 | "language_info": {
47 | "codemirror_mode": {
48 | "name": "ipython",
49 | "version": 3
50 | },
51 | "file_extension": ".py",
52 | "mimetype": "text/x-python",
53 | "name": "python",
54 | "nbconvert_exporter": "python",
55 | "pygments_lexer": "ipython3",
56 | "version": "3.6.3"
57 | }
58 | },
59 | "nbformat": 4,
60 | "nbformat_minor": 2
61 | }
62 |
--------------------------------------------------------------------------------
/Content/Model/04-InformationTheory.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Information Theory "
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "- Entropy\n",
15 | "- Kullback-Leibler divergence\n",
16 | "- Cross entropy"
17 | ]
18 | },
19 | {
20 | "cell_type": "code",
21 | "execution_count": null,
22 | "metadata": {},
23 | "outputs": [],
24 | "source": []
25 | }
26 | ],
27 | "metadata": {
28 | "kernelspec": {
29 | "display_name": "Python 3",
30 | "language": "python",
31 | "name": "python3"
32 | },
33 | "language_info": {
34 | "codemirror_mode": {
35 | "name": "ipython",
36 | "version": 3
37 | },
38 | "file_extension": ".py",
39 | "mimetype": "text/x-python",
40 | "name": "python",
41 | "nbconvert_exporter": "python",
42 | "pygments_lexer": "ipython3",
43 | "version": "3.5.2"
44 | }
45 | },
46 | "nbformat": 4,
47 | "nbformat_minor": 2
48 | }
49 |
--------------------------------------------------------------------------------
/Content/Model/05-ModellingOverview.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Modeling Overview"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "**Learning Objectives:** Get a general conceptual understanding of statistical modeling and machine learning."
15 | ]
16 | },
17 | {
18 | "cell_type": "markdown",
19 | "metadata": {},
20 | "source": [
21 | "## Imports"
22 | ]
23 | },
24 | {
25 | "cell_type": "code",
26 | "execution_count": 1,
27 | "metadata": {},
28 | "outputs": [],
29 | "source": [
30 | "import numpy as np"
31 | ]
32 | },
33 | {
34 | "cell_type": "code",
35 | "execution_count": 2,
36 | "metadata": {},
37 | "outputs": [],
38 | "source": [
39 | "import matplotlib.pyplot as plt\n",
40 | "%matplotlib inline"
41 | ]
42 | },
43 | {
44 | "cell_type": "markdown",
45 | "metadata": {},
46 | "source": [
47 | "## 1 Introduction"
48 | ]
49 | },
50 | {
51 | "cell_type": "markdown",
52 | "metadata": {},
53 | "source": [
54 | "**Modeling**, or **statistical modeling** is a very general approach for using data in a variety of productive ways. In other circles these same ideas go under the name **machine learning** or more trendy phrases such as **machine intelligence**. Some of the slipery terminology comes from the fact that research in this field has been done across different academic disciplines such as statistics, computer science, mathematics and physics. Each field has developed its own emphases and terminologies.\n",
55 | "\n",
56 | "Some of the goals of modeling include:\n",
57 | "\n",
58 | "* Predict future events based on past data.\n",
59 | "* Provide intuitive understanding data.\n",
60 | "* Provide a mathematical model for data that lacks first principles theoretical models (as in Physics).\n",
61 | "* Quantify uncertainties.\n",
62 | "* Learn generalizable information from data.\n",
63 | "\n",
64 | "As pointed out by Goodfellow et al., Mitchell (1997) provided a nice general definition of this idea of \"learning from data\":\n",
65 | "\n",
66 | "> A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E\n",
67 | "\n",
68 | "In this course, we will focus on two different ways of thinking about models:\n",
69 | "\n",
70 | "1. Forward = Generative models\n",
71 | "2. Backwards = Inference with models"
72 | ]
73 | },
74 | {
75 | "cell_type": "markdown",
76 | "metadata": {},
77 | "source": [
78 | "## 2 Generative models"
79 | ]
80 | },
81 | {
82 | "cell_type": "markdown",
83 | "metadata": {},
84 | "source": [
85 | "The idea of a **generative model** is that we can use a model to generate data. Usually, our models will have parameters that we get to (and have to) choose. Here is a diagram that shows show this works:\n",
86 | "\n",
87 | "**Model** $+$ **Parameters** $\\rightarrow$ **Generated Data**\n",
88 | "\n",
89 | "Let use this process to model the time between soccer goals in a soccer game. The appropriate distribution for this would be the exponential distribution. Let's say that we know the average time between goals is 20 minutes. Using this parameter and the exponential distribution (our model), we can create a dataset of the time between specific goals (100 of them!) in soccer games:"
90 | ]
91 | },
92 | {
93 | "cell_type": "code",
94 | "execution_count": 4,
95 | "metadata": {},
96 | "outputs": [
97 | {
98 | "data": {
99 | "text/plain": [
100 | "array([ 4.71507829e+01, 4.22468719e+01, 1.66022228e+01,\n",
101 | " 1.95064886e+01, 5.44656709e+00, 1.01567676e+01,\n",
102 | " 3.03634966e+01, 5.90058457e+01, 5.96845511e+01,\n",
103 | " 4.56862724e+01, 7.67989738e+00, 7.38203287e+00,\n",
104 | " 3.24785943e+01, 8.85298367e+00, 1.31063440e+01,\n",
105 | " 1.19161635e+01, 4.97332313e+01, 4.41001039e+00,\n",
106 | " 3.15167607e+01, 3.67762425e+01, 3.18983132e+01,\n",
107 | " 3.78041120e+01, 2.12051012e-02, 3.46656245e+01,\n",
108 | " 2.50736875e+00, 9.33988324e+00, 3.54828328e+00,\n",
109 | " 8.97400411e+00, 3.25434878e+01, 2.35341585e+01,\n",
110 | " 8.07792154e+00, 1.15779362e+01, 1.72659522e+01,\n",
111 | " 2.02563042e+01, 9.18558896e-01, 7.24242360e+00,\n",
112 | " 2.65958441e+01, 5.50392122e+01, 2.08735129e+01,\n",
113 | " 2.10050443e+00, 1.01509222e+00, 9.25728340e-01,\n",
114 | " 7.83342215e-01, 1.27210814e+01, 2.78212012e+00,\n",
115 | " 3.44151046e+01, 7.62980429e-01, 3.54275758e+00,\n",
116 | " 1.92608673e+01, 6.14986481e+00, 3.05824946e+00,\n",
117 | " 5.62379262e+00, 1.44342878e+01, 1.42093249e+00,\n",
118 | " 1.50380526e+01, 1.37009936e+01, 1.45319686e+01,\n",
119 | " 1.63783217e+01, 8.98506500e+00, 9.53802491e+00,\n",
120 | " 4.56794033e+01, 2.84668139e+01, 1.13010953e+01,\n",
121 | " 5.11700244e+00, 4.63142783e+01, 1.23100532e+00,\n",
122 | " 1.31443132e+01, 4.40701045e+01, 1.34850432e+01,\n",
123 | " 2.32386429e+01, 7.45366566e+00, 2.60398837e+01,\n",
124 | " 1.78306323e+00, 7.43059105e+00, 6.86534103e+00,\n",
125 | " 2.30494429e+01, 9.50409199e-01, 1.68655453e+01,\n",
126 | " 3.65952907e-01, 1.95535102e+01, 7.79056167e+00,\n",
127 | " 3.16661162e+00, 2.04141580e+01, 1.07834780e+02,\n",
128 | " 2.54883375e+01, 1.02814340e+01, 1.59914411e+01,\n",
129 | " 1.64449898e+02, 1.27400145e+00, 3.43275789e+00,\n",
130 | " 3.29461633e+01, 2.60505252e+01, 2.83368097e+01,\n",
131 | " 4.11928128e+01, 1.13617707e+01, 9.85044211e+00,\n",
132 | " 7.50329192e+00, 2.23366012e+01, 4.48704797e+01,\n",
133 | " 2.06561925e+00])"
134 | ]
135 | },
136 | "execution_count": 4,
137 | "metadata": {},
138 | "output_type": "execute_result"
139 | }
140 | ],
141 | "source": [
142 | "β = 20 # Parameter\n",
143 | "data = np.random.exponential(β, 100) # Model\n",
144 | "data # data"
145 | ]
146 | },
147 | {
148 | "cell_type": "markdown",
149 | "metadata": {},
150 | "source": [
151 | "We can then visualize this dataset:"
152 | ]
153 | },
154 | {
155 | "cell_type": "code",
156 | "execution_count": 5,
157 | "metadata": {},
158 | "outputs": [
159 | {
160 | "data": {
161 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXQAAAD8CAYAAABn919SAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4xLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvAOZPmwAADy5JREFUeJzt3X+sZHV9xvH30wXUKi1QrmSD0IuGGkkTF3JDSGiNxR/lRyuYtomksZuUZG0iiaS2cdU/ik2bQFslaWK0a6BuG9RalUAUWwnFGpOKveCy7HZFENcWWXevWgukDe3ip3/M2eS63tmZOz/u3PnyfiWTmfnOmZ2Hbw7PPffcc86kqpAkzb+fmnUASdJkWOiS1AgLXZIaYaFLUiMsdElqhIUuSY2w0CWpERa6JDXCQpekRpy0kR925pln1uLi4kZ+pCTNvQceeOB7VbUwaLkNLfTFxUWWl5c38iMlae4l+fYwy7nLRZIaYaFLUiMsdElqhIUuSY2w0CWpEQMLPckLk3w1yUNJ9id5Xzf+0STfSrKnu22bflxJUj/DHLb4LHBZVT2T5GTgy0k+3732h1X1qenFkyQNa2ChV+876p7pnp7c3fzeOknaZIbah55kS5I9wBHgnqq6v3vpT5PsTXJLkhdMLaUkaaChzhStqueAbUlOA+5I8ovAu4HvAqcAu4B3AX98/HuT7AB2AJx77rkjB13c+bmR33vwpqtGfq8kzYt1HeVSVT8EvghcXlWHqudZ4K+Bi/u8Z1dVLVXV0sLCwEsRSJJGNMxRLgvdljlJXgS8Hvh6kq3dWIBrgH3TDCpJOrFhdrlsBXYn2ULvB8Anq+qzSf4pyQIQYA/we1PMKUkaYJijXPYCF64xftlUEkmSRuKZopLUCAtdkhphoUtSIyx0SWqEhS5JjbDQJakRFrokNcJCl6RGWOiS1AgLXZIaYaFLUiMsdElqhIUuSY2w0CWpERa6JDXCQpekRljoktQIC12SGmGhS1IjLHRJaoSFLkmNGFjoSV6Y5KtJHkqyP8n7uvHzktyf5NEkf5fklOnHlST1M8wW+rPAZVX1amAbcHmSS4CbgVuq6nzgP4HrphdTkjTIwEKvnme6pyd3twIuAz7Vje8GrplKQknSUIbah55kS5I9wBHgHuCbwA+r6mi3yBPA2X3euyPJcpLllZWVSWSWJK1hqEKvqueqahvwMuBi4FVrLdbnvbuqaqmqlhYWFkZPKkk6oXUd5VJVPwS+CFwCnJbkpO6llwFPTjaaJGk9hjnKZSHJad3jFwGvBw4A9wG/2S22HbhzWiElSYOdNHgRtgK7k2yh9wPgk1X12ST/BnwiyZ8AXwNunWJOSdIAAwu9qvYCF64x/ji9/emSpE3AM0UlqREWuiQ1wkKXpEZY6JLUCAtdkhphoUtSIyx0SWqEhS5JjbDQJakRFrokNcJCl6RGWOiS1AgLXZIaYaFLUiMsdElqhIUuSY2w0CWpERa6JDXCQpekRljoktSIgYWe5Jwk9yU5kGR/knd04zcm+U6SPd3tyunHlST1c9IQyxwF3llVDyY5FXggyT3da7dU1V9ML54kaVgDC72qDgGHusdPJzkAnD3tYJKk9VnXPvQki8CFwP3d0PVJ9ia5LcnpE84mSVqHoQs9yUuATwM3VNVTwIeAVwDb6G3Bv7/P+3YkWU6yvLKyMoHIkqS1DFXoSU6mV+a3V9VnAKrqcFU9V1U/Aj4CXLzWe6tqV1UtVdXSwsLCpHJLko4zzFEuAW4FDlTVB1aNb1212JuBfZOPJ0ka1jBHuVwKvBV4OMmebuw9wLVJtgEFHATeNpWEkqShDHOUy5eBrPHS3ZOPI0kalWeKSlIjLHRJaoSFLkmNsNAlqREWuiQ1wkKXpEZY6JLUCAtdkhphoUtSIyx0SWqEhS5JjbDQJakRFrokNWKYy+fOvcWdnxvr/QdvumpCSSRpetxCl6RGWOiS1AgLXZIaYaFLUiMsdElqhIUuSY2w0CWpEQMLPck5Se5LciDJ/iTv6MbPSHJPkke7+9OnH1eS1M8wW+hHgXdW1auAS4C3J7kA2AncW1XnA/d2zyVJMzKw0KvqUFU92D1+GjgAnA1cDezuFtsNXDOtkJKkwda1Dz3JInAhcD9wVlUdgl7pAy/t854dSZaTLK+srIyXVpLU19CFnuQlwKeBG6rqqWHfV1W7qmqpqpYWFhZGyShJGsJQhZ7kZHplfntVfaYbPpxka/f6VuDIdCJKkoYxzFEuAW4FDlTVB1a9dBewvXu8Hbhz8vEkScMa5vK5lwJvBR5Osqcbew9wE/DJJNcB/w781nQiSpKGMbDQq+rLQPq8/LrJxpEkjcozRSWpERa6JDXCQpekRljoktQIC12SGmGhS1IjLHRJaoSFLkmNsNAlqREWuiQ1wkKXpEZY6JLUCAtdkhphoUtSIyx0SWqEhS5JjbDQJakRFrokNcJCl6RGWOiS1IiBhZ7ktiRHkuxbNXZjku8k2dPdrpxuTEnSIMNsoX8UuHyN8Vuqalt3u3uysSRJ6zWw0KvqS8APNiCLJGkM4+xDvz7J3m6XzOkTSyRJGsmohf4h4BXANuAQ8P5+CybZkWQ5yfLKysqIHydJGmSkQq+qw1X1XFX9CPgIcPEJlt1VVUtVtbSwsDBqTknSACMVepKtq56+GdjXb1lJ0sY4adACST4OvBY4M8kTwB8Br02yDSjgIPC2KWaUJA1hYKFX1bVrDN86hSySpDF4pqgkNcJCl6RGWOiS1AgLXZIaYaFLUiMsdElqhIUuSY0YeBy6YHHn50Z+78GbrppgEknqzy10SWqEhS5JjbDQJakRFrokNcJCl6RGWOiS1AgLXZIaYaFLUiMsdElqhIUuSY2w0CWpEV7LZcq8DoykjeIWuiQ1YmChJ7ktyZEk+1aNnZHkniSPdvenTzemJGmQYbbQPwpcftzYTuDeqjofuLd7LkmaoYGFXlVfAn5w3PDVwO7u8W7gmgnnkiSt06j70M+qqkMA3f1LJxdJkjSKqf9RNMmOJMtJlldWVqb9cZL0vDVqoR9OshWguz/Sb8Gq2lVVS1W1tLCwMOLHSZIGGbXQ7wK2d4+3A3dOJo4kaVTDHLb4ceBfgFcmeSLJdcBNwBuSPAq8oXsuSZqhgWeKVtW1fV563YSzSJLG4JmiktQIC12SGmGhS1IjLHRJaoSFLkmNsNAlqRF+wUWj/GIN6fnHLXRJaoSFLkmNsNAlqREWuiQ1wkKXpEZY6JLUCAtdkhphoUtSIyx0SWqEhS5JjbDQJakRFrokNcJCl6RGWOiS1IixLp+b5CDwNPAccLSqliYRSpK0fpO4HvqvVNX3JvDvSJLG4C4XSWrEuFvoBXwhSQF/VVW7jl8gyQ5gB8C555475sc9v4zzrUOSnn/G3UK/tKouAq4A3p7kNccvUFW7qmqpqpYWFhbG/DhJUj9jFXpVPdndHwHuAC6eRChJ0vqNXOhJXpzk1GOPgTcC+yYVTJK0PuPsQz8LuCPJsX/nY1X1DxNJJUlat5ELvaoeB149wSySpDF42KIkNcJCl6RGWOiS1AgLXZIaYaFLUiMsdElqhIUuSY2w0CWpERa6JDXCQpekRljoktQIC12SGjGJ7xRVY2b5TUkHb7pqZp8tzTu30CWpERa6JDXCQpekRljoktQIC12SGuFRLtIcG+eIJI8oWp9xj/7aiPl2C12SGjFWoSe5PMkjSR5LsnNSoSRJ6zdyoSfZAnwQuAK4ALg2yQWTCiZJWp9xttAvBh6rqser6n+BTwBXTyaWJGm9xin0s4H/WPX8iW5MkjQD4xzlkjXG6icWSnYAO7qnzyR5ZMTPOxP43ojvnaV5zQ0zyJ6bJ/LPzOucb2juCc31Mc75AGPO988Ps9A4hf4EcM6q5y8Dnjx+oaraBewa43MASLJcVUvj/jsbbV5zw/xmN/fGm9fs85q7n3F2ufwrcH6S85KcArwFuGsysSRJ6zXyFnpVHU1yPfCPwBbgtqraP7FkkqR1GetM0aq6G7h7QlkGGXu3zYzMa26Y3+zm3njzmn1ec68pVT/xd0xJ0hzy1H9JasRcFPq8XGIgyTlJ7ktyIMn+JO/oxm9M8p0ke7rblbPOerwkB5M83OVb7sbOSHJPkke7+9NnnXO1JK9cNad7kjyV5IbNOt9JbktyJMm+VWNrznF6/rJb5/cmuWiT5f7zJF/vst2R5LRufDHJ/6ya+w9vstx9140k7+7m+5Ekvzqb1GOqqk19o/cH128CLwdOAR4CLph1rj5ZtwIXdY9PBb5B77IINwJ/MOt8A7IfBM48buzPgJ3d453AzbPOOWA9+S6943U35XwDrwEuAvYNmmPgSuDz9M73uAS4f5PlfiNwUvf45lW5F1cvtwnne811o/v/9CHgBcB5XedsmfV/w3pv87CFPjeXGKiqQ1X1YPf4aeAA83327NXA7u7xbuCaGWYZ5HXAN6vq27MO0k9VfQn4wXHD/eb4auBvqucrwGlJtm5M0h+3Vu6q+kJVHe2efoXeeSibSp/57udq4BNV9WxVfQt4jF73zJV5KPS5vMRAkkXgQuD+buj67tfT2zbbrotOAV9I8kB3di/AWVV1CHo/rICXzizdYG8BPr7q+Waf72P6zfE8rfe/S++3iWPOS/K1JP+c5JdnFeoE1lo35mm++5qHQh/qEgObSZKXAJ8Gbqiqp4APAa8AtgGHgPfPMF4/l1bVRfSunvn2JK+ZdaBhdSe2vQn4+25oHuZ7kLlY75O8FzgK3N4NHQLOraoLgd8HPpbkZ2aVbw391o25mO9B5qHQh7rEwGaR5GR6ZX57VX0GoKoOV9VzVfUj4CNswl/lqurJ7v4IcAe9jIeP/Zrf3R+ZXcITugJ4sKoOw3zM9yr95njTr/dJtgO/Bvx2dTuiu10W3+8eP0BvX/QvzC7ljzvBurHp53sY81Doc3OJgSQBbgUOVNUHVo2v3vf5ZmDf8e+dpSQvTnLqscf0/uC1j948b+8W2w7cOZuEA13Lqt0tm32+j9Nvju8Cfqc72uUS4L+O7ZrZDJJcDrwLeFNV/feq8YX0viuBJC8Hzgcen03Kn3SCdeMu4C1JXpDkPHq5v7rR+cY267/KDnOj9xf/b9D7af/eWec5Qc5fovdr2l5gT3e7Evhb4OFu/C5g66yzHpf75fT+wv8QsP/YHAM/B9wLPNrdnzHrrGtk/2ng+8DPrhrblPNN74fOIeD/6G0RXtdvjuntAvhgt84/DCxtstyP0dvnfGw9/3C37G9069BDwIPAr2+y3H3XDeC93Xw/Alwx6/VllJtnikpSI+Zhl4skaQgWuiQ1wkKXpEZY6JLUCAtdkhphoUtSIyx0SWqEhS5Jjfh/tB8YwL20I2QAAAAASUVORK5CYII=\n",
162 | "text/plain": [
163 | ""
164 | ]
165 | },
166 | "metadata": {},
167 | "output_type": "display_data"
168 | }
169 | ],
170 | "source": [
171 | "plt.hist(data, bins=20);"
172 | ]
173 | },
174 | {
175 | "cell_type": "markdown",
176 | "metadata": {},
177 | "source": [
178 | "This example clarifies the choices that you have to make when building a generative model:\n",
179 | "\n",
180 | "* You have to pick a model to use\n",
181 | "* You have to pick the parameters of the model\n",
182 | "\n",
183 | "To assess if you have made good choice, you will have to perform some sort of comparison of the generated data, with actual observations from the system you are intenting to model. In general, you would like to know that the parameters of your model are choosen in a way that makes your model useful. That is exactly what **inference** provides."
184 | ]
185 | },
186 | {
187 | "cell_type": "markdown",
188 | "metadata": {},
189 | "source": [
190 | "## 3 Inference with models"
191 | ]
192 | },
193 | {
194 | "cell_type": "markdown",
195 | "metadata": {},
196 | "source": [
197 | "**Inference** is a way of **learning from data**. In the context of generative models, inference allow you to go backwards from **observed data** to parameters that optimize how well the model works for that observed data. Here is a diagram of inference:\n",
198 | "\n",
199 | "**Model** $+$ **Observed Data** + **Training** $\\rightarrow$ **Best Parameters**\n",
200 | "\n",
201 | "Notice the similarities to generative modelling:\n",
202 | "\n",
203 | "* You still have to pick your model!!!\n",
204 | "\n",
205 | "However the differences are most important:\n",
206 | "\n",
207 | "* The data is not generated, it is observed\n",
208 | "* The parameters are learned, rather than guessed\n",
209 | "* A **training** step is required.\n",
210 | "\n",
211 | "The magic of inference is that once you have performed inference to find the best parameters, you can turn it around and generate predictions:\n",
212 | "\n",
213 | "**Model** $+$ **Best Parameters** $\\rightarrow$ **Predictions**\n",
214 | "\n",
215 | "If your model and parameters are good, you should be able to predict outcomes you haven't seen before.\n",
216 | "\n",
217 | "Let's see how this would work with the above soccer goal data. You have been handed a small dataset of the times (in minutes) between soccer goals. This is your observed data:"
218 | ]
219 | },
220 | {
221 | "cell_type": "code",
222 | "execution_count": 6,
223 | "metadata": {
224 | "collapsed": true
225 | },
226 | "outputs": [],
227 | "source": [
228 | "observed_data = np.array(\n",
229 | " [ 6.57946838, 16.66471659, 52.11420679, 25.64266511,\n",
230 | " 10.90558697, 17.74796824, 8.0075313 , 3.98989899,\n",
231 | " 13.46723746, 24.90308858]\n",
232 | ")"
233 | ]
234 | },
235 | {
236 | "cell_type": "markdown",
237 | "metadata": {},
238 | "source": [
239 | "We are again going to pick the exponential distribution, with a parameter $\\beta$. We need to perform some type of inference to find the best value of $\\beta$ to use. We will often denote the best parameter with a hat, so let's call the best value $\\hat\\beta$. There are much more sophisticated way of finding the best parameter, but for now let's find it by just taking the mean of the observed data:"
240 | ]
241 | },
242 | {
243 | "cell_type": "code",
244 | "execution_count": 7,
245 | "metadata": {},
246 | "outputs": [],
247 | "source": [
248 | "beta_hat = observed_data.mean()"
249 | ]
250 | },
251 | {
252 | "cell_type": "markdown",
253 | "metadata": {},
254 | "source": [
255 | "Now that we have the \"best\" value of beta, we can predict the times between goals of the the *next* 20 goals to happen:"
256 | ]
257 | },
258 | {
259 | "cell_type": "code",
260 | "execution_count": 8,
261 | "metadata": {},
262 | "outputs": [
263 | {
264 | "data": {
265 | "text/plain": [
266 | "array([ 31.42502418, 6.87980742, 1.64119901, 7.45192818,\n",
267 | " 10.65510412, 8.38029071, 5.497031 , 13.21566205,\n",
268 | " 32.94057228, 28.06311202, 4.74709422, 36.54064439,\n",
269 | " 12.65607335, 3.95839439, 12.18467277, 0.32111527,\n",
270 | " 5.43256726, 6.93929714, 9.08066106, 14.51748799])"
271 | ]
272 | },
273 | "execution_count": 8,
274 | "metadata": {},
275 | "output_type": "execute_result"
276 | }
277 | ],
278 | "source": [
279 | "new_data = np.random.exponential(beta_hat, 20)\n",
280 | "new_data"
281 | ]
282 | },
283 | {
284 | "cell_type": "markdown",
285 | "metadata": {},
286 | "source": [
287 | "The obvious question to ask it then this: how did we do. To determine that, we would need to actually observe the next 20 goals and see how their times compare to the generated values. This is a very simple, model so we wouldn't expect the goals to exactly match these predictions, but we might hope that in some aggregate sense our predictions are accurate. In future notebooks, we will go into great detail about assessing how well a model works."
288 | ]
289 | }
290 | ],
291 | "metadata": {
292 | "kernelspec": {
293 | "display_name": "Python 3",
294 | "language": "python",
295 | "name": "python3"
296 | },
297 | "language_info": {
298 | "codemirror_mode": {
299 | "name": "ipython",
300 | "version": 3
301 | },
302 | "file_extension": ".py",
303 | "mimetype": "text/x-python",
304 | "name": "python",
305 | "nbconvert_exporter": "python",
306 | "pygments_lexer": "ipython3",
307 | "version": "3.6.3"
308 | }
309 | },
310 | "nbformat": 4,
311 | "nbformat_minor": 2
312 | }
313 |
--------------------------------------------------------------------------------
/Content/Model/10-SpecificModels.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Specific Models"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "- Feature engineering (PDSH Chapter 5.4)\n",
15 | "- Naive Bayes (PDSH Chapter 5.5)\n",
16 | "- Linear regression (5.6)\n",
17 | "- SVM (5.7)\n",
18 | "- Random forests (5.8)\n",
19 | "- PCA (5.9)\n",
20 | "- Manifold learning (5.10)\n",
21 | "- K-means (5.11)\n",
22 | "- Gaussian mixtures (5.12)\n",
23 | "- Kernel density estimation (5.13)\n",
24 | "- Neural networks (DL, Keras)"
25 | ]
26 | },
27 | {
28 | "cell_type": "code",
29 | "execution_count": null,
30 | "metadata": {},
31 | "outputs": [],
32 | "source": []
33 | }
34 | ],
35 | "metadata": {
36 | "kernelspec": {
37 | "display_name": "Python 3",
38 | "language": "python",
39 | "name": "python3"
40 | },
41 | "language_info": {
42 | "codemirror_mode": {
43 | "name": "ipython",
44 | "version": 3
45 | },
46 | "file_extension": ".py",
47 | "mimetype": "text/x-python",
48 | "name": "python",
49 | "nbconvert_exporter": "python",
50 | "pygments_lexer": "ipython3",
51 | "version": "3.5.2"
52 | }
53 | },
54 | "nbformat": 4,
55 | "nbformat_minor": 2
56 | }
57 |
--------------------------------------------------------------------------------
/Content/Model/Scipy.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# SciPy: Numerical Algorithms for Python"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "**Learning Objective:** Learn how to find and use numerical algorithms in the SciPy package."
15 | ]
16 | },
17 | {
18 | "cell_type": "code",
19 | "execution_count": 1,
20 | "metadata": {
21 | "collapsed": false
22 | },
23 | "outputs": [],
24 | "source": [
25 | "%matplotlib inline\n",
26 | "from matplotlib import pyplot as plt\n",
27 | "import numpy as np"
28 | ]
29 | },
30 | {
31 | "cell_type": "markdown",
32 | "metadata": {},
33 | "source": [
34 | "## Overview"
35 | ]
36 | },
37 | {
38 | "cell_type": "markdown",
39 | "metadata": {},
40 | "source": [
41 | "The SciPy framework builds on top NumPy and provides a large number of numerical algorithms for working with data. Some of the topics that SciPy covers are:\n",
42 | "\n",
43 | "* Special functions ([scipy.special](http://docs.scipy.org/doc/scipy/reference/special.html))\n",
44 | "* Integration/ODEs ([scipy.integrate](http://docs.scipy.org/doc/scipy/reference/integrate.html))\n",
45 | "* Optimization ([scipy.optimize](http://docs.scipy.org/doc/scipy/reference/optimize.html))\n",
46 | "* Interpolation ([scipy.interpolate](http://docs.scipy.org/doc/scipy/reference/interpolate.html))\n",
47 | "* Fourier Transforms ([scipy.fftpack](http://docs.scipy.org/doc/scipy/reference/fftpack.html))\n",
48 | "* Signal Processing ([scipy.signal](http://docs.scipy.org/doc/scipy/reference/signal.html))\n",
49 | "* Linear Algebra ([scipy.linalg](http://docs.scipy.org/doc/scipy/reference/linalg.html))\n",
50 | "* Sparse Eigenvalue Problems ([scipy.sparse](http://docs.scipy.org/doc/scipy/reference/sparse.html))\n",
51 | "* Statistics ([scipy.stats](http://docs.scipy.org/doc/scipy/reference/stats.html))\n",
52 | "* Multi-dimensional image processing ([scipy.ndimage](http://docs.scipy.org/doc/scipy/reference/ndimage.html))\n",
53 | "* File IO ([scipy.io](http://docs.scipy.org/doc/scipy/reference/io.html))\n",
54 | "\n",
55 | "This notebook is not a complete tour of SciPy. Rather it focuses on the most important parts of the package for processing data.\n",
56 | "\n",
57 | "In many cases, you will want to import specific names from `scipy` subpackages. However, as a start, it is helpful to do the following import:"
58 | ]
59 | },
60 | {
61 | "cell_type": "code",
62 | "execution_count": 2,
63 | "metadata": {
64 | "collapsed": false
65 | },
66 | "outputs": [],
67 | "source": [
68 | "import scipy as sp"
69 | ]
70 | },
71 | {
72 | "cell_type": "markdown",
73 | "metadata": {},
74 | "source": [
75 | "## Approach"
76 | ]
77 | },
78 | {
79 | "cell_type": "markdown",
80 | "metadata": {},
81 | "source": [
82 | "One of the most important skills in data science is to be able to find Python functions and classes in a module and learn how to use them yourself. Here are some recommended steps on how to go about this:\n",
83 | "\n",
84 | "* Find the online documentation for the package you are using.\n",
85 | "* Try to find the subpackage or even the function that looks like will do the job.\n",
86 | "* Import the module, function or class and use tab completion and `?` to explore it.\n",
87 | "* Try using the function or class for an extremely simple case where you know the answer.\n",
88 | "* Then try using for your real problem."
89 | ]
90 | },
91 | {
92 | "cell_type": "markdown",
93 | "metadata": {},
94 | "source": [
95 | "## Resources"
96 | ]
97 | },
98 | {
99 | "cell_type": "markdown",
100 | "metadata": {},
101 | "source": [
102 | "* [SciPy Website](http://www.scipy.org)\n",
103 | "* [SciPy Reference Documentation](http://docs.scipy.org/doc/scipy/reference/)\n",
104 | "* [Python Scientific Lecture Notes](http://scipy-lectures.github.io/index.html), Edited by Valentin Haenel,\n",
105 | "Emmanuelle Gouillart and Gaël Varoquaux.\n",
106 | "* [Lectures on Scientific Computing with Python](https://github.com/jrjohansson/scientific-python-lectures), J.R. Johansson.\n",
107 | "* [Introduction to Scientific Computing in Python](http://nbviewer.ipython.org/github/jakevdp/2014_fall_ASTR599/tree/master/), Jake Vanderplas."
108 | ]
109 | }
110 | ],
111 | "metadata": {
112 | "kernelspec": {
113 | "display_name": "Python 3",
114 | "language": "python",
115 | "name": "python3"
116 | },
117 | "language_info": {
118 | "codemirror_mode": {
119 | "name": "ipython",
120 | "version": 3
121 | },
122 | "file_extension": ".py",
123 | "mimetype": "text/x-python",
124 | "name": "python",
125 | "nbconvert_exporter": "python",
126 | "pygments_lexer": "ipython3",
127 | "version": "3.5.2"
128 | }
129 | },
130 | "nbformat": 4,
131 | "nbformat_minor": 1
132 | }
133 |
--------------------------------------------------------------------------------
/Content/Model/images/rectangles.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/calpolydatascience/data301/5ca50db37afb0be83e9b0e294385f7bd60634179/Content/Model/images/rectangles.png
--------------------------------------------------------------------------------
/Content/Model/images/trapz.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/calpolydatascience/data301/5ca50db37afb0be83e9b0e294385f7bd60634179/Content/Model/images/trapz.png
--------------------------------------------------------------------------------
/Content/Programming/01-Introduction.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Introduction"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "**Learning Objective:** Learn how to use Python effectively for data science."
15 | ]
16 | },
17 | {
18 | "cell_type": "markdown",
19 | "metadata": {},
20 | "source": [
21 | "A prerequisite for this course is that you already know how to program, and hopefully already in Python. However, many of you may be rusty with Python and some of you might not have worked in Python yet. Lastly, many of you may have learned Python outside of the context of data science. To bring everyone up to the same level with using Python for data science, we begin with a brief tour of Python for data science. For this purpose we will use Jake VanderPlas's excellent [Whirlwind Tour of Python](../WhirlwindTourOfPython/00-Introduction.ipynb).\n",
22 | "\n",
23 | "This book is available on GitHub at https://github.com/jakevdp/WhirlwindTourOfPython and is included in this repository as a Git submodule."
24 | ]
25 | }
26 | ],
27 | "metadata": {
28 | "kernelspec": {
29 | "display_name": "Python 3",
30 | "language": "python",
31 | "name": "python3"
32 | },
33 | "language_info": {
34 | "codemirror_mode": {
35 | "name": "ipython",
36 | "version": 3
37 | },
38 | "file_extension": ".py",
39 | "mimetype": "text/x-python",
40 | "name": "python",
41 | "nbconvert_exporter": "python",
42 | "pygments_lexer": "ipython3",
43 | "version": "3.5.2"
44 | }
45 | },
46 | "nbformat": 4,
47 | "nbformat_minor": 2
48 | }
49 |
--------------------------------------------------------------------------------
/Content/Programming/PythonPackages.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Python Packages"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "**Learning Objective:** Learn what a Python package and how to import it."
15 | ]
16 | },
17 | {
18 | "cell_type": "markdown",
19 | "metadata": {},
20 | "source": [
21 | "## Built-ins"
22 | ]
23 | },
24 | {
25 | "cell_type": "markdown",
26 | "metadata": {},
27 | "source": [
28 | "The Python programming language offers a very minimal set of objects and functions. These are called **built-ins**"
29 | ]
30 | },
31 | {
32 | "cell_type": "code",
33 | "execution_count": null,
34 | "metadata": {
35 | "collapsed": false
36 | },
37 | "outputs": [],
38 | "source": [
39 | "dir(__builtins__)"
40 | ]
41 | },
42 | {
43 | "cell_type": "markdown",
44 | "metadata": {},
45 | "source": [
46 | "All other capabilities are shipped as **packages** that have to be **imported** before you can use them."
47 | ]
48 | },
49 | {
50 | "cell_type": "markdown",
51 | "metadata": {},
52 | "source": [
53 | "## Packages"
54 | ]
55 | },
56 | {
57 | "cell_type": "markdown",
58 | "metadata": {},
59 | "source": [
60 | "* A Python **package** is one of two things:\n",
61 | " - A single file with a `.py` (pure Python) extension or `.so` extension (compiled).\n",
62 | " - A directory of files with `.py`/`.so` extensions and `__init__.py` files.\n",
63 | "* To use a package you must first **import** it.\n",
64 | "* The files within packages are called **modules**.\n",
65 | "* Once you import a package, you can usually see what files it comes from using the `__file__` attribute.\n",
66 | "\n",
67 | "Let's import the `functools` package and see what file it is coming from:"
68 | ]
69 | },
70 | {
71 | "cell_type": "code",
72 | "execution_count": null,
73 | "metadata": {
74 | "collapsed": false
75 | },
76 | "outputs": [],
77 | "source": [
78 | "import functools"
79 | ]
80 | },
81 | {
82 | "cell_type": "code",
83 | "execution_count": null,
84 | "metadata": {
85 | "collapsed": false
86 | },
87 | "outputs": [],
88 | "source": [
89 | "functools.__file__"
90 | ]
91 | },
92 | {
93 | "cell_type": "markdown",
94 | "metadata": {},
95 | "source": [
96 | "You can also use `from` to import specific names from a package:"
97 | ]
98 | },
99 | {
100 | "cell_type": "code",
101 | "execution_count": null,
102 | "metadata": {
103 | "collapsed": false
104 | },
105 | "outputs": [],
106 | "source": [
107 | "from math import cos"
108 | ]
109 | },
110 | {
111 | "cell_type": "code",
112 | "execution_count": null,
113 | "metadata": {
114 | "collapsed": false
115 | },
116 | "outputs": [],
117 | "source": [
118 | "cos(1.0)"
119 | ]
120 | },
121 | {
122 | "cell_type": "markdown",
123 | "metadata": {},
124 | "source": [
125 | "Packages and import statements can also be nested:"
126 | ]
127 | },
128 | {
129 | "cell_type": "code",
130 | "execution_count": null,
131 | "metadata": {
132 | "collapsed": false
133 | },
134 | "outputs": [],
135 | "source": [
136 | "from numpy.random import rand"
137 | ]
138 | },
139 | {
140 | "cell_type": "code",
141 | "execution_count": null,
142 | "metadata": {
143 | "collapsed": false
144 | },
145 | "outputs": [],
146 | "source": [
147 | "rand()"
148 | ]
149 | },
150 | {
151 | "cell_type": "markdown",
152 | "metadata": {},
153 | "source": [
154 | "You can also use the `as` keyword to change the name of a package. For example the `numpy` package is usually imported under the name `np`:"
155 | ]
156 | },
157 | {
158 | "cell_type": "code",
159 | "execution_count": 3,
160 | "metadata": {
161 | "collapsed": true
162 | },
163 | "outputs": [],
164 | "source": [
165 | "import numpy as np"
166 | ]
167 | },
168 | {
169 | "cell_type": "code",
170 | "execution_count": 4,
171 | "metadata": {
172 | "collapsed": false
173 | },
174 | "outputs": [
175 | {
176 | "data": {
177 | "text/plain": [
178 | "0.0641057288069139"
179 | ]
180 | },
181 | "execution_count": 4,
182 | "metadata": {},
183 | "output_type": "execute_result"
184 | }
185 | ],
186 | "source": [
187 | "np.random.rand()"
188 | ]
189 | },
190 | {
191 | "cell_type": "markdown",
192 | "metadata": {},
193 | "source": [
194 | "You can also use tab completion in import statements!"
195 | ]
196 | },
197 | {
198 | "cell_type": "markdown",
199 | "metadata": {},
200 | "source": [
201 | "Python ships with a large set of packages, which together are often called the **standard library**. Any Python distribution has all of these packages, but you still have to import them before you can use them. A full list, along with documentation of the standard library can be found in the [Python Library Documentation](https://docs.python.org/2/library/index.html). All packages that are not in the standard library are called **external packages**. Examples of external packages that we will use in this course are NumPy, SciPy, Matplotlib and Pandas."
202 | ]
203 | },
204 | {
205 | "cell_type": "markdown",
206 | "metadata": {},
207 | "source": [
208 | "## Resources"
209 | ]
210 | },
211 | {
212 | "cell_type": "markdown",
213 | "metadata": {},
214 | "source": [
215 | "* [Python Library Documentation](https://docs.python.org/2/library/index.html)"
216 | ]
217 | }
218 | ],
219 | "metadata": {
220 | "kernelspec": {
221 | "display_name": "Python 3",
222 | "language": "python",
223 | "name": "python3"
224 | },
225 | "language_info": {
226 | "codemirror_mode": {
227 | "name": "ipython",
228 | "version": 3
229 | },
230 | "file_extension": ".py",
231 | "mimetype": "text/x-python",
232 | "name": "python",
233 | "nbconvert_exporter": "python",
234 | "pygments_lexer": "ipython3",
235 | "version": "3.4.3"
236 | }
237 | },
238 | "nbformat": 4,
239 | "nbformat_minor": 0
240 | }
241 |
--------------------------------------------------------------------------------
/Content/Programming/StandardLibrary.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Standard Library"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "The Python *Standard Library* refers to the set of Python models and packages that are included with Python, but not imported by default. The documentation for the Standard Library can be found here:\n",
15 | "\n",
16 | "https://docs.python.org/3.4/library/index.html\n",
17 | "\n",
18 | "This doesn't include external modules and packages that are developed, distributed and installed separate from Python itself, such as NumPy, SciPy, Pandas, Matplotlib, etc."
19 | ]
20 | },
21 | {
22 | "cell_type": "markdown",
23 | "metadata": {},
24 | "source": [
25 | "While there are many more packages in the Standard Library, here are the ones you will run into most often when using Python for Data Science:\n",
26 | "\n",
27 | "* [re](https://docs.python.org/3.4/library/re.html)\n",
28 | "* [datetime](https://docs.python.org/3.4/library/datetime.html)\n",
29 | "* [math](https://docs.python.org/3.4/library/math.html)\n",
30 | "* [random](https://docs.python.org/3.4/library/random.html)\n",
31 | "* [itertools](https://docs.python.org/3.4/library/itertools.html)\n",
32 | "* [functools](https://docs.python.org/3.4/library/functools.html)\n",
33 | "* [glob](https://docs.python.org/3.4/library/glob.html)\n",
34 | "* [os.path](https://docs.python.org/3.4/library/os.path.html)\n",
35 | "* [pickle](https://docs.python.org/3.4/library/pickle.html)\n",
36 | "* [multiprocessing](https://docs.python.org/3.4/library/multiprocessing.html)\n",
37 | "* [json](https://docs.python.org/3.4/library/json.html)"
38 | ]
39 | }
40 | ],
41 | "metadata": {
42 | "kernelspec": {
43 | "display_name": "Python 3",
44 | "language": "python",
45 | "name": "python3"
46 | },
47 | "language_info": {
48 | "codemirror_mode": {
49 | "name": "ipython",
50 | "version": 3
51 | },
52 | "file_extension": ".py",
53 | "mimetype": "text/x-python",
54 | "name": "python",
55 | "nbconvert_exporter": "python",
56 | "pygments_lexer": "ipython3",
57 | "version": "3.4.3"
58 | }
59 | },
60 | "nbformat": 4,
61 | "nbformat_minor": 0
62 | }
63 |
--------------------------------------------------------------------------------
/Content/Tidy/01-Introduction.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Introduction"
8 | ]
9 | },
10 | {
11 | "cell_type": "code",
12 | "execution_count": null,
13 | "metadata": {},
14 | "outputs": [],
15 | "source": []
16 | }
17 | ],
18 | "metadata": {
19 | "kernelspec": {
20 | "display_name": "Python 3",
21 | "language": "python",
22 | "name": "python3"
23 | },
24 | "language_info": {
25 | "codemirror_mode": {
26 | "name": "ipython",
27 | "version": 3
28 | },
29 | "file_extension": ".py",
30 | "mimetype": "text/x-python",
31 | "name": "python",
32 | "nbconvert_exporter": "python",
33 | "pygments_lexer": "ipython3",
34 | "version": "3.5.2"
35 | }
36 | },
37 | "nbformat": 4,
38 | "nbformat_minor": 2
39 | }
40 |
--------------------------------------------------------------------------------
/Content/Tidy/02-TidyData.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Tidy Data"
8 | ]
9 | },
10 | {
11 | "cell_type": "code",
12 | "execution_count": null,
13 | "metadata": {},
14 | "outputs": [],
15 | "source": []
16 | }
17 | ],
18 | "metadata": {
19 | "kernelspec": {
20 | "display_name": "Python 3",
21 | "language": "python",
22 | "name": "python3"
23 | },
24 | "language_info": {
25 | "codemirror_mode": {
26 | "name": "ipython",
27 | "version": 3
28 | },
29 | "file_extension": ".py",
30 | "mimetype": "text/x-python",
31 | "name": "python",
32 | "nbconvert_exporter": "python",
33 | "pygments_lexer": "ipython3",
34 | "version": "3.5.2"
35 | }
36 | },
37 | "nbformat": 4,
38 | "nbformat_minor": 2
39 | }
40 |
--------------------------------------------------------------------------------
/Content/Transform/01-Introduction.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Data Transformation\n",
8 | "\n",
9 | "**Learning Objective**: Learn how to transform data during the different stages of the data science process, from tidying a messy dataset to transforming during the the visualization and modeling stages."
10 | ]
11 | },
12 | {
13 | "cell_type": "markdown",
14 | "metadata": {},
15 | "source": [
16 | "## Overview\n",
17 | "\n",
18 | "Data transformation is one of most important phases in the data science process. Data transformation is needed:\n",
19 | "\n",
20 | "1. To bring a raw dataset into the tidy format.\n",
21 | "2. During the visualization phase to compute groups, aggregations, statistical summaries, etc.\n",
22 | "3. During the iterative process of *Visualizing* and *Modeling*.\n",
23 | "\n",
24 | "One of the strength of Python and its various data science packages are their power, flexibility and convenience to transform data. Sometimes Python and its standard library will be sufficient. Other times, we will need more powerful packges such as [NumPy](http://www.numpy.org/), [Pandas](http://pandas.pydata.org/) or [SQL](https://en.wikipedia.org/wiki/SQL). In this section of the course, you will learn how to leverage these tools to transform data."
25 | ]
26 | },
27 | {
28 | "cell_type": "markdown",
29 | "metadata": {},
30 | "source": [
31 | "## Outline\n",
32 | "\n",
33 | "Here is an outline of this section of the course:\n",
34 | "\n",
35 | "* [Basic Data Transformation](02-BasicDataTransformation.ipynb)\n",
36 | "* [NumPy](03-Numpy.ipynb)\n",
37 | "* [Pandas](04-Pandas.ipynb)\n",
38 | "* [Relational Data](05-RelationalData.ipynb)"
39 | ]
40 | }
41 | ],
42 | "metadata": {
43 | "kernelspec": {
44 | "display_name": "Python 3",
45 | "language": "python",
46 | "name": "python3"
47 | },
48 | "language_info": {
49 | "codemirror_mode": {
50 | "name": "ipython",
51 | "version": 3
52 | },
53 | "file_extension": ".py",
54 | "mimetype": "text/x-python",
55 | "name": "python",
56 | "nbconvert_exporter": "python",
57 | "pygments_lexer": "ipython3",
58 | "version": "3.5.2"
59 | }
60 | },
61 | "nbformat": 4,
62 | "nbformat_minor": 2
63 | }
64 |
--------------------------------------------------------------------------------
/Content/Transform/04-Pandas.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Pandas"
8 | ]
9 | },
10 | {
11 | "cell_type": "code",
12 | "execution_count": null,
13 | "metadata": {},
14 | "outputs": [],
15 | "source": []
16 | }
17 | ],
18 | "metadata": {
19 | "kernelspec": {
20 | "display_name": "Python 3",
21 | "language": "python",
22 | "name": "python3"
23 | },
24 | "language_info": {
25 | "codemirror_mode": {
26 | "name": "ipython",
27 | "version": 3
28 | },
29 | "file_extension": ".py",
30 | "mimetype": "text/x-python",
31 | "name": "python",
32 | "nbconvert_exporter": "python",
33 | "pygments_lexer": "ipython3",
34 | "version": "3.5.2"
35 | }
36 | },
37 | "nbformat": 4,
38 | "nbformat_minor": 2
39 | }
40 |
--------------------------------------------------------------------------------
/Content/Transform/data/Chinook_Sqlite.sqlite:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/calpolydatascience/data301/5ca50db37afb0be83e9b0e294385f7bd60634179/Content/Transform/data/Chinook_Sqlite.sqlite
--------------------------------------------------------------------------------
/Content/Visualize/01-Introduction.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Introduction to Data Visualization"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "To do just about anything with data, you need to be able to look at it. In many, if not most cases, that will mean creating a visualization. This section of the course will cover the basics of data visualization. Our approach here tries to follow the Zen of Python:"
15 | ]
16 | },
17 | {
18 | "cell_type": "code",
19 | "execution_count": 1,
20 | "metadata": {
21 | "collapsed": false
22 | },
23 | "outputs": [
24 | {
25 | "name": "stdout",
26 | "output_type": "stream",
27 | "text": [
28 | "The Zen of Python, by Tim Peters\n",
29 | "\n",
30 | "Beautiful is better than ugly.\n",
31 | "Explicit is better than implicit.\n",
32 | "Simple is better than complex.\n",
33 | "Complex is better than complicated.\n",
34 | "Flat is better than nested.\n",
35 | "Sparse is better than dense.\n",
36 | "Readability counts.\n",
37 | "Special cases aren't special enough to break the rules.\n",
38 | "Although practicality beats purity.\n",
39 | "Errors should never pass silently.\n",
40 | "Unless explicitly silenced.\n",
41 | "In the face of ambiguity, refuse the temptation to guess.\n",
42 | "There should be one-- and preferably only one --obvious way to do it.\n",
43 | "Although that way may not be obvious at first unless you're Dutch.\n",
44 | "Now is better than never.\n",
45 | "Although never is often better than *right* now.\n",
46 | "If the implementation is hard to explain, it's a bad idea.\n",
47 | "If the implementation is easy to explain, it may be a good idea.\n",
48 | "Namespaces are one honking great idea -- let's do more of those!\n"
49 | ]
50 | }
51 | ],
52 | "source": [
53 | "import this"
54 | ]
55 | },
56 | {
57 | "cell_type": "markdown",
58 | "metadata": {},
59 | "source": [
60 | "One one side there is a rich body of research literature that approaches data visualization from a formal and principled perspective through empirical studies. On the other side are numerous Python libraries which all offer slightly different ways of expressing visual concepts in code. These libraries are many (Matplotlib, ggplot, Bokeh, Altair, BQPlot, Seaborn, etc.) and they all cover some subset of data visualization quite well. Herein, I will briefly cover foundational results from data visualization research, and then turn quickly to their application by covering two very different visualization libraries: Altair and Matplotlib. These two Python libraries complement each other well, with Altair focusing narrowly on formal statistical visualization and Matplotlib covering everything else."
61 | ]
62 | },
63 | {
64 | "cell_type": "markdown",
65 | "metadata": {},
66 | "source": [
67 | "## Outline\n",
68 | "\n",
69 | "* [Visualization Grammar](02-VisualizationGrammar.ipynb)\n",
70 | "* [Tidy Data](03-TidyData.ipynb)\n",
71 | "* [Chart, Marks and Encodings](04-ChartMarksEncodings.ipynb)\n",
72 | "* [Transformation](05-Transformation.ipynb)\n",
73 | "* [Seattle Weather](06-SeattleWeather.ipynb)\n",
74 | "* [Configuration](07-Configuration.ipynb)\n",
75 | "* [Layers](08-Layers.ipynb)\n",
76 | "* [Theory and Practice](09-TheoryAndPractice.ipynb)\n",
77 | "* [Matplotlib](10-Matplotlib.ipynb)"
78 | ]
79 | },
80 | {
81 | "cell_type": "markdown",
82 | "metadata": {},
83 | "source": [
84 | "## Resources\n",
85 | "\n",
86 | "The following resources are the main ones for this section:\n",
87 | "\n",
88 | "- [Jeff Heer's CSE512](http://courses.cs.washington.edu/courses/cse512/14wi/index.html)\n",
89 | "- [Altair](https://altair-viz.github.io/)\n",
90 | "- [Matplotlib](http://matplotlib.org/)\n",
91 | "- [The work of Edward Tufte](https://www.edwardtufte.com/tufte/)"
92 | ]
93 | }
94 | ],
95 | "metadata": {
96 | "kernelspec": {
97 | "display_name": "Python 3",
98 | "language": "python",
99 | "name": "python3"
100 | },
101 | "language_info": {
102 | "codemirror_mode": {
103 | "name": "ipython",
104 | "version": 3
105 | },
106 | "file_extension": ".py",
107 | "mimetype": "text/x-python",
108 | "name": "python",
109 | "nbconvert_exporter": "python",
110 | "pygments_lexer": "ipython3",
111 | "version": "3.5.2"
112 | }
113 | },
114 | "nbformat": 4,
115 | "nbformat_minor": 2
116 | }
117 |
--------------------------------------------------------------------------------
/Content/Visualize/03-TidyData.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Tidy Data"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "**Learning Objective:** Understand the basics of *tidy data*."
15 | ]
16 | },
17 | {
18 | "cell_type": "markdown",
19 | "metadata": {},
20 | "source": [
21 | "## Overview"
22 | ]
23 | },
24 | {
25 | "cell_type": "markdown",
26 | "metadata": {},
27 | "source": [
28 | "In our [Theory of Data](../Introduction/03-TheoryofData.ipynb) section, we covered some basic aspects of data:\n",
29 | "\n",
30 | "- **Data types:** ordinal, nominal, quantitative, date/time, goegraphic\n",
31 | "- **Variables:** a single thing that is measured\n",
32 | "- **Observations:** multiple variables that are measured for a single entity\n",
33 | "- **Dataset:** a set of records\n",
34 | "\n",
35 | "The idea of *tidy data* is this:\n",
36 | "\n",
37 | "1. There are many possible ways one can organize variables and observations into a dataset;\n",
38 | "2. However, not all ways are equal; and\n",
39 | "3. A particular way or organizing a dataset, called *tidy data* is particularly useful in working with data.\n",
40 | "\n",
41 | "The idea of tidy data was first formalized by Hadley Wickham in his [Tidy Data](https://www.jstatsoft.org/article/view/v059i10) paper from 2010. Later in this course we will describe tidy data in more detail. However, it is useful to take a short tidy data detour before diving into data visualization. The reason for our pausing to describe *tidy data* at this point is that our first rule in data visualization is this:\n",
42 | "\n",
43 | "> Start all data visualizations with a tidy dataset.\n",
44 | "\n",
45 | "Thus, if you want to visualize a dataset, your first task will be to put it into a tidy form. For now, we will be working with datasets that are already tidy; the often painful process of tidying a dataset will be covered later."
46 | ]
47 | },
48 | {
49 | "cell_type": "markdown",
50 | "metadata": {},
51 | "source": [
52 | "## Defining tidy data\n",
53 | "\n",
54 | "A tidy dataset has the following properties:\n",
55 | "\n",
56 | "1. Each variable forms a column\n",
57 | "2. Each observation forma row\n",
58 | "3. Each type of observational unit forms a table\n",
59 | "\n",
60 | "*Messy data* is any other arrangement of the data."
61 | ]
62 | },
63 | {
64 | "cell_type": "markdown",
65 | "metadata": {},
66 | "source": [
67 | "## Example: cars dataset"
68 | ]
69 | },
70 | {
71 | "cell_type": "markdown",
72 | "metadata": {},
73 | "source": [
74 | "The cars dataset, which comes with the Altair visualization library, is an example of a tidy dataset. Let's load that dataset and look at it:"
75 | ]
76 | },
77 | {
78 | "cell_type": "code",
79 | "execution_count": 2,
80 | "metadata": {
81 | "collapsed": false
82 | },
83 | "outputs": [],
84 | "source": [
85 | "import altair as alt\n",
86 | "alt.enable_mime_rendering()"
87 | ]
88 | },
89 | {
90 | "cell_type": "code",
91 | "execution_count": 3,
92 | "metadata": {
93 | "collapsed": true
94 | },
95 | "outputs": [],
96 | "source": [
97 | "cars = alt.load_dataset('cars')"
98 | ]
99 | },
100 | {
101 | "cell_type": "code",
102 | "execution_count": 4,
103 | "metadata": {
104 | "collapsed": false
105 | },
106 | "outputs": [
107 | {
108 | "data": {
109 | "text/html": [
110 | "
\n",
111 | "\n",
124 | "
\n",
125 | " \n",
126 | "
\n",
127 | "
\n",
128 | "
Acceleration
\n",
129 | "
Cylinders
\n",
130 | "
Displacement
\n",
131 | "
Horsepower
\n",
132 | "
Miles_per_Gallon
\n",
133 | "
Name
\n",
134 | "
Origin
\n",
135 | "
Weight_in_lbs
\n",
136 | "
Year
\n",
137 | "
\n",
138 | " \n",
139 | " \n",
140 | "
\n",
141 | "
0
\n",
142 | "
12.0
\n",
143 | "
8
\n",
144 | "
307.0
\n",
145 | "
130.0
\n",
146 | "
18.0
\n",
147 | "
chevrolet chevelle malibu
\n",
148 | "
USA
\n",
149 | "
3504
\n",
150 | "
1970-01-01
\n",
151 | "
\n",
152 | "
\n",
153 | "
1
\n",
154 | "
11.5
\n",
155 | "
8
\n",
156 | "
350.0
\n",
157 | "
165.0
\n",
158 | "
15.0
\n",
159 | "
buick skylark 320
\n",
160 | "
USA
\n",
161 | "
3693
\n",
162 | "
1970-01-01
\n",
163 | "
\n",
164 | "
\n",
165 | "
2
\n",
166 | "
11.0
\n",
167 | "
8
\n",
168 | "
318.0
\n",
169 | "
150.0
\n",
170 | "
18.0
\n",
171 | "
plymouth satellite
\n",
172 | "
USA
\n",
173 | "
3436
\n",
174 | "
1970-01-01
\n",
175 | "
\n",
176 | "
\n",
177 | "
3
\n",
178 | "
12.0
\n",
179 | "
8
\n",
180 | "
304.0
\n",
181 | "
150.0
\n",
182 | "
16.0
\n",
183 | "
amc rebel sst
\n",
184 | "
USA
\n",
185 | "
3433
\n",
186 | "
1970-01-01
\n",
187 | "
\n",
188 | "
\n",
189 | "
4
\n",
190 | "
10.5
\n",
191 | "
8
\n",
192 | "
302.0
\n",
193 | "
140.0
\n",
194 | "
17.0
\n",
195 | "
ford torino
\n",
196 | "
USA
\n",
197 | "
3449
\n",
198 | "
1970-01-01
\n",
199 | "
\n",
200 | " \n",
201 | "
\n",
202 | "
"
203 | ],
204 | "text/plain": [
205 | " Acceleration Cylinders Displacement Horsepower Miles_per_Gallon \\\n",
206 | "0 12.0 8 307.0 130.0 18.0 \n",
207 | "1 11.5 8 350.0 165.0 15.0 \n",
208 | "2 11.0 8 318.0 150.0 18.0 \n",
209 | "3 12.0 8 304.0 150.0 16.0 \n",
210 | "4 10.5 8 302.0 140.0 17.0 \n",
211 | "\n",
212 | " Name Origin Weight_in_lbs Year \n",
213 | "0 chevrolet chevelle malibu USA 3504 1970-01-01 \n",
214 | "1 buick skylark 320 USA 3693 1970-01-01 \n",
215 | "2 plymouth satellite USA 3436 1970-01-01 \n",
216 | "3 amc rebel sst USA 3433 1970-01-01 \n",
217 | "4 ford torino USA 3449 1970-01-01 "
218 | ]
219 | },
220 | "execution_count": 4,
221 | "metadata": {},
222 | "output_type": "execute_result"
223 | }
224 | ],
225 | "source": [
226 | "cars.head()"
227 | ]
228 | },
229 | {
230 | "cell_type": "markdown",
231 | "metadata": {},
232 | "source": [
233 | "The cars dataset above is stored as a table-like object called a `DataFrame`. In Python, the [Pandas](http://pandas.pydata.org/) library provides this data structure:"
234 | ]
235 | },
236 | {
237 | "cell_type": "code",
238 | "execution_count": 5,
239 | "metadata": {
240 | "collapsed": false
241 | },
242 | "outputs": [
243 | {
244 | "data": {
245 | "text/plain": [
246 | "pandas.core.frame.DataFrame"
247 | ]
248 | },
249 | "execution_count": 5,
250 | "metadata": {},
251 | "output_type": "execute_result"
252 | }
253 | ],
254 | "source": [
255 | "type(cars)"
256 | ]
257 | },
258 | {
259 | "cell_type": "markdown",
260 | "metadata": {},
261 | "source": [
262 | "We will learn more about Pandas and `DataFrame`s later in this course. For now, we cover a few of their commonly used attributes and methods. The `.columns` attribute returns a one dimensional sequence of the column names. These are the *variables* in the dataset:"
263 | ]
264 | },
265 | {
266 | "cell_type": "code",
267 | "execution_count": 6,
268 | "metadata": {
269 | "collapsed": false
270 | },
271 | "outputs": [
272 | {
273 | "data": {
274 | "text/plain": [
275 | "Index(['Acceleration', 'Cylinders', 'Displacement', 'Horsepower',\n",
276 | " 'Miles_per_Gallon', 'Name', 'Origin', 'Weight_in_lbs', 'Year'],\n",
277 | " dtype='object')"
278 | ]
279 | },
280 | "execution_count": 6,
281 | "metadata": {},
282 | "output_type": "execute_result"
283 | }
284 | ],
285 | "source": [
286 | "cars.columns"
287 | ]
288 | },
289 | {
290 | "cell_type": "markdown",
291 | "metadata": {},
292 | "source": [
293 | "The rows (observations) are labeled by another one dimensional sequence called the index (`.index`):"
294 | ]
295 | },
296 | {
297 | "cell_type": "code",
298 | "execution_count": 7,
299 | "metadata": {
300 | "collapsed": false
301 | },
302 | "outputs": [
303 | {
304 | "data": {
305 | "text/plain": [
306 | "Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,\n",
307 | " ...\n",
308 | " 396, 397, 398, 399, 400, 401, 402, 403, 404, 405],\n",
309 | " dtype='int64', length=406)"
310 | ]
311 | },
312 | "execution_count": 7,
313 | "metadata": {},
314 | "output_type": "execute_result"
315 | }
316 | ],
317 | "source": [
318 | "cars.index"
319 | ]
320 | },
321 | {
322 | "cell_type": "markdown",
323 | "metadata": {},
324 | "source": [
325 | "The length of the dataset is the number of rows:"
326 | ]
327 | },
328 | {
329 | "cell_type": "code",
330 | "execution_count": 8,
331 | "metadata": {
332 | "collapsed": false
333 | },
334 | "outputs": [
335 | {
336 | "data": {
337 | "text/plain": [
338 | "406"
339 | ]
340 | },
341 | "execution_count": 8,
342 | "metadata": {},
343 | "output_type": "execute_result"
344 | }
345 | ],
346 | "source": [
347 | "len(cars)"
348 | ]
349 | },
350 | {
351 | "cell_type": "markdown",
352 | "metadata": {},
353 | "source": [
354 | "Lastly, the `DataFrame` acts like a specialized dictionary, where the keys are the column names and the values are the columns:"
355 | ]
356 | },
357 | {
358 | "cell_type": "code",
359 | "execution_count": 9,
360 | "metadata": {
361 | "collapsed": false
362 | },
363 | "outputs": [
364 | {
365 | "data": {
366 | "text/plain": [
367 | "0 12.0\n",
368 | "1 11.5\n",
369 | "2 11.0\n",
370 | "3 12.0\n",
371 | "4 10.5\n",
372 | "Name: Acceleration, dtype: float64"
373 | ]
374 | },
375 | "execution_count": 9,
376 | "metadata": {},
377 | "output_type": "execute_result"
378 | }
379 | ],
380 | "source": [
381 | "cars['Acceleration'].head()"
382 | ]
383 | },
384 | {
385 | "cell_type": "markdown",
386 | "metadata": {},
387 | "source": [
388 | "We will be using this cars dataset to cover the basics of data visualization with Altair."
389 | ]
390 | }
391 | ],
392 | "metadata": {
393 | "kernelspec": {
394 | "display_name": "Python 3",
395 | "language": "python",
396 | "name": "python3"
397 | },
398 | "language_info": {
399 | "codemirror_mode": {
400 | "name": "ipython",
401 | "version": 3
402 | },
403 | "file_extension": ".py",
404 | "mimetype": "text/x-python",
405 | "name": "python",
406 | "nbconvert_exporter": "python",
407 | "pygments_lexer": "ipython3",
408 | "version": "3.6.3"
409 | }
410 | },
411 | "nbformat": 4,
412 | "nbformat_minor": 2
413 | }
414 |
--------------------------------------------------------------------------------
/Content/Visualize/images/column_syntax1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/calpolydatascience/data301/5ca50db37afb0be83e9b0e294385f7bd60634179/Content/Visualize/images/column_syntax1.png
--------------------------------------------------------------------------------
/Content/Visualize/images/column_syntax2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/calpolydatascience/data301/5ca50db37afb0be83e9b0e294385f7bd60634179/Content/Visualize/images/column_syntax2.png
--------------------------------------------------------------------------------
/Content/Visualize/images/encodings1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/calpolydatascience/data301/5ca50db37afb0be83e9b0e294385f7bd60634179/Content/Visualize/images/encodings1.png
--------------------------------------------------------------------------------
/Content/Visualize/images/encodings2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/calpolydatascience/data301/5ca50db37afb0be83e9b0e294385f7bd60634179/Content/Visualize/images/encodings2.png
--------------------------------------------------------------------------------
/Content/Visualize/images/mackinlay1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/calpolydatascience/data301/5ca50db37afb0be83e9b0e294385f7bd60634179/Content/Visualize/images/mackinlay1.png
--------------------------------------------------------------------------------
/Content/Visualize/images/mackinlay2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/calpolydatascience/data301/5ca50db37afb0be83e9b0e294385f7bd60634179/Content/Visualize/images/mackinlay2.png
--------------------------------------------------------------------------------
/Content/Visualize/images/marks.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/calpolydatascience/data301/5ca50db37afb0be83e9b0e294385f7bd60634179/Content/Visualize/images/marks.png
--------------------------------------------------------------------------------
/Content/Visualize/images/marks_encoding.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/calpolydatascience/data301/5ca50db37afb0be83e9b0e294385f7bd60634179/Content/Visualize/images/marks_encoding.png
--------------------------------------------------------------------------------
/Content/Visualize/images/measles_wsj.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/calpolydatascience/data301/5ca50db37afb0be83e9b0e294385f7bd60634179/Content/Visualize/images/measles_wsj.png
--------------------------------------------------------------------------------
/Content/Visualize/images/social_assistance_538.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/calpolydatascience/data301/5ca50db37afb0be83e9b0e294385f7bd60634179/Content/Visualize/images/social_assistance_538.png
--------------------------------------------------------------------------------
/Content/Visualize/images/viz_grammar.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/calpolydatascience/data301/5ca50db37afb0be83e9b0e294385f7bd60634179/Content/Visualize/images/viz_grammar.png
--------------------------------------------------------------------------------
/Content/Workflow/01-Introduction.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Introduction"
8 | ]
9 | },
10 | {
11 | "cell_type": "code",
12 | "execution_count": null,
13 | "metadata": {},
14 | "outputs": [],
15 | "source": []
16 | }
17 | ],
18 | "metadata": {
19 | "kernelspec": {
20 | "display_name": "Python 3",
21 | "language": "python",
22 | "name": "python3"
23 | },
24 | "language_info": {
25 | "codemirror_mode": {
26 | "name": "ipython",
27 | "version": 3
28 | },
29 | "file_extension": ".py",
30 | "mimetype": "text/x-python",
31 | "name": "python",
32 | "nbconvert_exporter": "python",
33 | "pygments_lexer": "ipython3",
34 | "version": "3.5.2"
35 | }
36 | },
37 | "nbformat": 4,
38 | "nbformat_minor": 2
39 | }
40 |
--------------------------------------------------------------------------------
/Content/Workflow/02-TheJupyterNotebook.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {
6 | "slideshow": {
7 | "slide_type": "slide"
8 | }
9 | },
10 | "source": [
11 | "# What is the Jupyter Notebook?"
12 | ]
13 | },
14 | {
15 | "cell_type": "markdown",
16 | "metadata": {},
17 | "source": [
18 | "## Introduction"
19 | ]
20 | },
21 | {
22 | "cell_type": "markdown",
23 | "metadata": {},
24 | "source": [
25 | "The Jupyter Notebook is an **interactive computing environment** that enables users to author notebook documents that include: \n",
26 | "- Live code\n",
27 | "- Interactive widgets\n",
28 | "- Plots\n",
29 | "- Narrative text\n",
30 | "- Equations\n",
31 | "- Images\n",
32 | "- Video\n",
33 | "\n",
34 | "These documents provide a **complete and self-contained record of a computation** that can be converted to various formats and shared with others using email, [Dropbox](http://dropbox.com), version control systems (like git/[GitHub](http://github.com)) or [nbviewer.jupyter.org](http://nbviewer.jupyter.org)."
35 | ]
36 | },
37 | {
38 | "cell_type": "markdown",
39 | "metadata": {
40 | "slideshow": {
41 | "slide_type": "slide"
42 | }
43 | },
44 | "source": [
45 | "### Components"
46 | ]
47 | },
48 | {
49 | "cell_type": "markdown",
50 | "metadata": {},
51 | "source": [
52 | "The Jupyter Notebook combines three components:\n",
53 | "\n",
54 | "* **The notebook web application**: An interactive web application for writing and running code interactively and authoring notebook documents.\n",
55 | "* **Kernels**: Separate processes started by the notebook web application that runs users' code in a given language and returns output back to the notebook web application. The kernel also handles things like computations for interactive widgets, tab completion and introspection. \n",
56 | "* **Notebook documents**: Self-contained documents that contain a representation of all content visible in the notebook web application, including inputs and outputs of the computations, narrative\n",
57 | "text, equations, images, and rich media representations of objects. Each notebook document has its own kernel."
58 | ]
59 | },
60 | {
61 | "cell_type": "markdown",
62 | "metadata": {
63 | "slideshow": {
64 | "slide_type": "slide"
65 | }
66 | },
67 | "source": [
68 | "## Notebook web application"
69 | ]
70 | },
71 | {
72 | "cell_type": "markdown",
73 | "metadata": {},
74 | "source": [
75 | "The notebook web application enables users to:\n",
76 | "\n",
77 | "* **Edit code in the browser**, with automatic syntax highlighting, indentation, and tab completion/introspection.\n",
78 | "* **Run code from the browser**, with the results of computations attached to the code which generated them.\n",
79 | "* See the results of computations with **rich media representations**, such as HTML, LaTeX, PNG, SVG, PDF, etc.\n",
80 | "* Create and use **interactive JavaScript widgets**, which bind interactive user interface controls and visualizations to reactive kernel side computations.\n",
81 | "* Author **narrative text** using the [Markdown](https://daringfireball.net/projects/markdown/) markup language.\n",
82 | "* Build **hierarchical documents** that are organized into sections with different levels of headings.\n",
83 | "* Include mathematical equations using **LaTeX syntax in Markdown**, which are rendered in-browser by [MathJax](http://www.mathjax.org/)."
84 | ]
85 | },
86 | {
87 | "cell_type": "markdown",
88 | "metadata": {
89 | "slideshow": {
90 | "slide_type": "slide"
91 | }
92 | },
93 | "source": [
94 | "## Kernels"
95 | ]
96 | },
97 | {
98 | "cell_type": "markdown",
99 | "metadata": {},
100 | "source": [
101 | "Through Jupyter's kernel and messaging architecture, the Notebook allows code to be run in a range of different programming languages. For each notebook document that a user opens, the web application starts a kernel that runs the code for that notebook. Each kernel is capable of running code in a single programming language and there are kernels available in the following languages:\n",
102 | "\n",
103 | "* Python(https://github.com/ipython/ipython)\n",
104 | "* Julia (https://github.com/JuliaLang/IJulia.jl)\n",
105 | "* R (https://github.com/takluyver/IRkernel)\n",
106 | "* Ruby (https://github.com/minrk/iruby)\n",
107 | "* Haskell (https://github.com/gibiansky/IHaskell)\n",
108 | "* Scala (https://github.com/Bridgewater/scala-notebook)\n",
109 | "* node.js (https://gist.github.com/Carreau/4279371)\n",
110 | "* Go (https://github.com/takluyver/igo)\n",
111 | "\n",
112 | "The default kernel runs Python code. The notebook provides a simple way for users to pick which of these kernels is used for a given notebook. \n",
113 | "\n",
114 | "Each of these kernels communicate with the notebook web application and web browser using a JSON over ZeroMQ/WebSockets message protocol that is described [here](http://ipython.org/ipython-doc/dev/development/messaging.html). Most users don't need to know about these details, but it helps to understand that \"kernels run code.\""
115 | ]
116 | },
117 | {
118 | "cell_type": "markdown",
119 | "metadata": {
120 | "slideshow": {
121 | "slide_type": "slide"
122 | }
123 | },
124 | "source": [
125 | "## Notebook documents"
126 | ]
127 | },
128 | {
129 | "cell_type": "markdown",
130 | "metadata": {},
131 | "source": [
132 | "Notebook documents contain the **inputs and outputs** of an interactive session as well as **narrative text** that accompanies the code but is not meant for execution. **Rich output** generated by running code, including HTML, images, video, and plots, is embeddeed in the notebook, which makes it a complete and self-contained record of a computation. "
133 | ]
134 | },
135 | {
136 | "cell_type": "markdown",
137 | "metadata": {},
138 | "source": [
139 | "When you run the notebook web application on your computer, notebook documents are just **files on your local filesystem with a `.ipynb` extension**. This allows you to use familiar workflows for organizing your notebooks into folders and sharing them with others."
140 | ]
141 | },
142 | {
143 | "cell_type": "markdown",
144 | "metadata": {},
145 | "source": [
146 | "Notebooks consist of a **linear sequence of cells**. There are four basic cell types:\n",
147 | "\n",
148 | "* **Code cells:** Input and output of live code that is run in the kernel\n",
149 | "* **Markdown cells:** Narrative text with embedded LaTeX equations\n",
150 | "* **Heading cells:** 6 levels of hierarchical organization and formatting\n",
151 | "* **Raw cells:** Unformatted text that is included, without modification, when notebooks are converted to different formats using nbconvert\n",
152 | "\n",
153 | "Internally, notebook documents are **[JSON](http://en.wikipedia.org/wiki/JSON) data** with **binary values [base64](http://en.wikipedia.org/wiki/Base64)** encoded. This allows them to be **read and manipulated programmatically** by any programming language. Because JSON is a text format, notebook documents are version control friendly.\n",
154 | "\n",
155 | "**Notebooks can be exported** to different static formats including HTML, reStructeredText, LaTeX, PDF, and slide shows ([reveal.js](http://lab.hakim.se/reveal-js/#/)) using Jupyter's `nbconvert` utility.\n",
156 | "\n",
157 | "Furthermore, any notebook document available from a **public URL on or GitHub can be shared** via [nbviewer](http://nbviewer.ipython.org). This service loads the notebook document from the URL and renders it as a static web page. The resulting web page may thus be shared with others **without their needing to install the Jupyter Notebook**."
158 | ]
159 | }
160 | ],
161 | "metadata": {
162 | "kernelspec": {
163 | "display_name": "Python 3",
164 | "language": "python",
165 | "name": "python3"
166 | },
167 | "language_info": {
168 | "codemirror_mode": {
169 | "name": "ipython",
170 | "version": 3
171 | },
172 | "file_extension": ".py",
173 | "mimetype": "text/x-python",
174 | "name": "python",
175 | "nbconvert_exporter": "python",
176 | "pygments_lexer": "ipython3",
177 | "version": "3.4.0"
178 | }
179 | },
180 | "nbformat": 4,
181 | "nbformat_minor": 0
182 | }
183 |
--------------------------------------------------------------------------------
/Content/Workflow/03-NotebookBasics.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Notebook Basics"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "## Running the Notebook Server"
15 | ]
16 | },
17 | {
18 | "cell_type": "markdown",
19 | "metadata": {},
20 | "source": [
21 | "The Jupyter notebook server is a custom web server that runs the notebook web application. Most of the time, users run the notebook server on their local computer using the command line interface."
22 | ]
23 | },
24 | {
25 | "cell_type": "markdown",
26 | "metadata": {},
27 | "source": [
28 | "
\n",
29 | "This section is only relevant if you are *not* using JupyterHub or another pre-deployed version of the Jupyter Notebook.\n",
30 | "
"
31 | ]
32 | },
33 | {
34 | "cell_type": "markdown",
35 | "metadata": {},
36 | "source": [
37 | "### Starting the notebook server using the command line"
38 | ]
39 | },
40 | {
41 | "cell_type": "markdown",
42 | "metadata": {},
43 | "source": [
44 | "You can start the notebook server from the command line (Terminal on Mac/Linux, CMD prompt on Windows) by running the following command: \n",
45 | "\n",
46 | " jupyter notebook\n",
47 | "\n",
48 | "This will print some information about the notebook server in your terminal, including the URL of the web application (by default, `http://127.0.0.1:8888`). It will then open your default web browser to this URL.\n",
49 | "\n",
50 | "When the notebook opens, you will see the **notebook dashboard**, which will show a list of the notebooks, files, and subdirectories in the directory where the notebook server was started (as seen in the next section, below). Most of the time, you will want to start a notebook server in the highest directory in your filesystem where notebooks can be found. Often this will be your home directory."
51 | ]
52 | },
53 | {
54 | "cell_type": "markdown",
55 | "metadata": {},
56 | "source": [
57 | "### Additional options"
58 | ]
59 | },
60 | {
61 | "cell_type": "markdown",
62 | "metadata": {},
63 | "source": [
64 | "By default, the notebook server starts on port 8888. If port 8888 is unavailable, the notebook server searchs the next available port.\n",
65 | "\n",
66 | "You can also specify the port manually:\n",
67 | "\n",
68 | " jupyter notebook --port 9999\n",
69 | "\n",
70 | "Or start notebook server without opening a web browser.\n",
71 | "\n",
72 | " jupyter notebook --no-browser\n",
73 | "\n",
74 | "The notebook server has a number of other command line arguments that can be displayed with the `--help` flag: \n",
75 | "\n",
76 | " jupyter notebook --help"
77 | ]
78 | },
79 | {
80 | "cell_type": "markdown",
81 | "metadata": {},
82 | "source": [
83 | "## The Notebook dashboard"
84 | ]
85 | },
86 | {
87 | "cell_type": "markdown",
88 | "metadata": {},
89 | "source": [
90 | "When you first start the notebook server, your browser will open to the notebook dashboard. The dashboard serves as a home page for the notebook. Its main purpose is to display the notebooks and files in the current directory. For example, here is a screenshot of the dashboard page for the `examples` directory in the Jupyter repository:\n",
91 | "\n",
92 | ""
93 | ]
94 | },
95 | {
96 | "cell_type": "markdown",
97 | "metadata": {},
98 | "source": [
99 | "The top of the notebook list displays clickable breadcrumbs of the current directory. By clicking on these breadcrumbs or on sub-directories in the notebook list, you can navigate your file system.\n",
100 | "\n",
101 | "To create a new notebook, click on the \"New\" button at the top of the list and select a kernel from the dropdown (as seen below). Which kernels are listed depend on what's installed on the server. Some of the kernels in the screenshot below may not exist as an option to you.\n",
102 | "\n",
103 | ""
104 | ]
105 | },
106 | {
107 | "cell_type": "markdown",
108 | "metadata": {},
109 | "source": [
110 | "Notebooks and files can be uploaded to the current directory by dragging a notebook file onto the notebook list or by the \"click here\" text above the list.\n",
111 | "\n",
112 | "The notebook list shows green \"Running\" text and a green notebook icon next to running notebooks (as seen below). Notebooks remain running until you explicitly shut them down; closing the notebook's page is not sufficient.\n",
113 | "\n",
114 | ""
115 | ]
116 | },
117 | {
118 | "cell_type": "markdown",
119 | "metadata": {},
120 | "source": [
121 | "To shutdown, delete, duplicate, or rename a notebook check the checkbox next to it and an array of controls will appear at the top of the notebook list (as seen below). You can also use the same operations on directories and files when applicable.\n",
122 | "\n",
123 | ""
124 | ]
125 | },
126 | {
127 | "cell_type": "markdown",
128 | "metadata": {},
129 | "source": [
130 | "To see all of your running notebooks along with their directories, click on the \"Running\" tab:\n",
131 | "\n",
132 | "\n",
133 | "\n",
134 | "This view provides a convenient way to track notebooks that you start as you navigate the file system in a long running notebook server."
135 | ]
136 | },
137 | {
138 | "cell_type": "markdown",
139 | "metadata": {},
140 | "source": [
141 | "## Overview of the Notebook UI"
142 | ]
143 | },
144 | {
145 | "cell_type": "markdown",
146 | "metadata": {},
147 | "source": [
148 | "If you create a new notebook or open an existing one, you will be taken to the notebook user interface (UI). This UI allows you to run code and author notebook documents interactively. The notebook UI has the following main areas:\n",
149 | "\n",
150 | "* Menu\n",
151 | "* Toolbar\n",
152 | "* Notebook area and cells\n",
153 | "\n",
154 | "The notebook has an interactive tour of these elements that can be started in the \"Help:User Interface Tour\" menu item."
155 | ]
156 | },
157 | {
158 | "cell_type": "markdown",
159 | "metadata": {},
160 | "source": [
161 | "## Modal editor"
162 | ]
163 | },
164 | {
165 | "cell_type": "markdown",
166 | "metadata": {},
167 | "source": [
168 | "Starting with IPython 2.0, the Jupyter Notebook has a modal user interface. This means that the keyboard does different things depending on which mode the Notebook is in. There are two modes: edit mode and command mode."
169 | ]
170 | },
171 | {
172 | "cell_type": "markdown",
173 | "metadata": {},
174 | "source": [
175 | "### Edit mode"
176 | ]
177 | },
178 | {
179 | "cell_type": "markdown",
180 | "metadata": {},
181 | "source": [
182 | "Edit mode is indicated by a green cell border and a prompt showing in the editor area:\n",
183 | "\n",
184 | "\n",
185 | "\n",
186 | "When a cell is in edit mode, you can type into the cell, like a normal text editor."
187 | ]
188 | },
189 | {
190 | "cell_type": "markdown",
191 | "metadata": {},
192 | "source": [
193 | "
\n",
194 | "Enter edit mode by pressing `Enter` or using the mouse to click on a cell's editor area.\n",
195 | "
"
196 | ]
197 | },
198 | {
199 | "cell_type": "markdown",
200 | "metadata": {},
201 | "source": [
202 | "### Command mode"
203 | ]
204 | },
205 | {
206 | "cell_type": "markdown",
207 | "metadata": {},
208 | "source": [
209 | "Command mode is indicated by a grey cell border with a blue left margin:\n",
210 | "\n",
211 | "\n",
212 | "\n",
213 | "When you are in command mode, you are able to edit the notebook as a whole, but not type into individual cells. Most importantly, in command mode, the keyboard is mapped to a set of shortcuts that let you perform notebook and cell actions efficiently. For example, if you are in command mode and you press `c`, you will copy the current cell - no modifier is needed."
214 | ]
215 | },
216 | {
217 | "cell_type": "markdown",
218 | "metadata": {},
219 | "source": [
220 | "
\n",
221 | "Don't try to type into a cell in command mode; unexpected things will happen!\n",
222 | "