├── .gitignore
├── LICENSE
├── README.md
├── data_science_basics
└── BasicPythonSupport
│ ├── .ipynb_checkpoints
│ └── Python_basics-checkpoint.ipynb
│ ├── Boxplot.png
│ ├── Python_basics.ipynb
│ ├── titanic.csv
│ ├── uforeports.csv
│ └── uforeports_excel.xls
├── lecture_10
├── .ipynb_checkpoints
│ └── Lecture10-checkpoint.ipynb
├── Lecture10.ipynb
├── adspy_shared_utilities.py
├── fruit_data_with_colors.txt
├── wine_data.csv
└── winequality-white.csv
├── lecture_slides
├── FIT1043_IntroDS_Lecture 2-2018S2.pdf
├── FIT1043_IntroDS_Lecture 3-2018S2.pdf
├── FIT1043_IntroDS_Lecture 4-2018S2.pdf
├── FIT1043_introDS_Lecture 1-2018S2.pdf
├── FIT1043_introDS_Lecture 10-2018S2(2).pdf
├── FIT1043_introDS_Lecture 10-2018S2.pdf
├── FIT1043_introDS_Lecture 11-2018S2.pdf
├── FIT1043_introDS_Lecture 12-2018S2.pdf
├── FIT1043_introDS_Lecture 5-2018S2.pdf
├── FIT1043_introDS_Lecture 6-2018S2.pdf
├── FIT1043_introDS_Lecture 7-2018S2.pdf
├── FIT1043_introDS_Lecture 8-2018S2.pdf
├── FIT1043_introDS_Lecture 9-2018S2.pdf
├── FIT1043_introDS_M6_L11-Unit review.pdf
├── Intro_to_Predictive_Models.pdf
├── Introduction-to-Data-Science.pdf
├── Introduction_to_Python_for_Data_Science.pdf
├── Introduction_to_Python_for_Data_Science_PART2.pdf
├── Introduction_to_R_for_Data_Science.pdf
├── Introduction_to_Shell_Commands_for_Data_Science.pdf
├── PMML in R.png
├── PandasPythonForDataScience.pdf
├── Semi Structured Data.pdf
└── Ten Examples of Open Source Projects.pdf
├── motion-chart_activity
└── MotionChart Activity
│ ├── .ipynb_checkpoints
│ └── MotionChart-checkpoint.ipynb
│ ├── MotionChart.ipynb
│ ├── data.csv
│ ├── mc_temp.html
│ ├── motionchart.py
│ ├── motionchart
│ ├── __init__.py
│ ├── __pycache__
│ │ ├── __init__.cpython-36.pyc
│ │ └── motionchart.cpython-36.pyc
│ └── motionchart.py
│ └── setting.png
├── notes
├── commands.pdf
└── complete_notes.pdf
└── past_exam
├── FIT1043_Sample_2016.pdf
├── FIT1043_Sample_Exam.pdf
├── FIT1043_Sample_Exam_Sols.pdf
└── answers.txt
/.gitignore:
--------------------------------------------------------------------------------
1 | # Mac files
2 | .DS_Store
3 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2019 Jun Qing Lim
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # fit1043-introduction-to-data-science
2 | FIT1043 Introduction to Data Science notes, batch of Semester 2, 2018.
3 |
4 | ## Notes
5 | [notes](notes) is written by me. *Do take note of any inaccuracies made by me.*
6 |
7 | ## Review over FIT1043
8 | Took this unit during my entry first semester. Much of what was taught in lectures were all about scraping data from online public datasets site like [Kaggle](https://www.kaggle.com/datasets). Speaking from the perspective of a person with no absolute background in "data field", this unit is considered as challenging for me, because much of the mathematical understanding behind the data modelling requires advanced mathematics skills such as [Regression Analysis](https://en.wikipedia.org/wiki/Regression_analysis), [Linear Regression](https://en.wikipedia.org/wiki/Linear_regression).
"
192 | ],
193 | "text/plain": [
194 | " Countries Time GDP Life Expectancy Population Region\n",
195 | "0 USA 1990 31744 76 250 North America\n",
196 | "1 USA 2000 38850 77 285 North America\n",
197 | "2 China 1990 1466 68 1440 Asia\n",
198 | "3 China 2000 2806 72 1270 Asia\n",
199 | "4 Japan 1990 25870 79 124 Asia\n",
200 | "5 Japan 2000 28569 81 127 Asia\n",
201 | "6 Brazil 1990 7247 66 151 South America\n",
202 | "7 Brazil 2000 8184 71 178 South America"
203 | ]
204 | },
205 | "execution_count": 31,
206 | "metadata": {},
207 | "output_type": "execute_result"
208 | }
209 | ],
210 | "source": [
211 | "# create a pandas dataframe with the sample data values\n",
212 | "sampleData = pd.DataFrame([\n",
213 | "['USA', '1990', '31744', '76', '250', 'North America'],\n",
214 | "['USA', '2000', '38850', '77', '285', 'North America'],\n",
215 | "['China', '1990', '1466', '68', '1440', 'Asia'],\n",
216 | "['China', '2000', '2806', '72', '1270', 'Asia'],\n",
217 | "['Japan', '1990', '25870', '79', '124', 'Asia'],\n",
218 | "['Japan', '2000', '28569', '81', '127', 'Asia'],\n",
219 | "['Brazil', '1990', '7247', '66', '151', 'South America'],\n",
220 | "['Brazil', '2000', '8184', '71', '178', 'South America']])\n",
221 | "\n",
222 | "sampleData.columns = ['Countries','Time', 'GDP', 'Life Expectancy', 'Population', 'Region']\n",
223 | "sampleData # we can then have a look at the dataframe"
224 | ]
225 | },
226 | {
227 | "cell_type": "markdown",
228 | "metadata": {},
229 | "source": [
230 | "### Reading data from csv file\n",
231 | "In a more complex case where you have a large dataset, manually entering them will not be possible. \n",
232 | "Now assume we have the sample data in csv format (data.csv; make sure that the data file is in the same directory as this jupyter notebook). \n",
233 | "The following code will read the data directly from data.csv into a pandas dataframe. \n",
234 | "We do not even need to set the column names, becaues the headers are automatically recognized as column names."
235 | ]
236 | },
237 | {
238 | "cell_type": "code",
239 | "execution_count": 32,
240 | "metadata": {},
241 | "outputs": [
242 | {
243 | "data": {
244 | "text/html": [
245 | "
\n",
246 | "\n",
259 | "
\n",
260 | " \n",
261 | "
\n",
262 | "
\n",
263 | "
Time
\n",
264 | "
GDP
\n",
265 | "
Life Expectancy
\n",
266 | "
Countries
\n",
267 | "
Population
\n",
268 | "
Region
\n",
269 | "
\n",
270 | " \n",
271 | " \n",
272 | "
\n",
273 | "
0
\n",
274 | "
1990
\n",
275 | "
31744
\n",
276 | "
76
\n",
277 | "
USA
\n",
278 | "
250
\n",
279 | "
North America
\n",
280 | "
\n",
281 | "
\n",
282 | "
1
\n",
283 | "
2000
\n",
284 | "
38850
\n",
285 | "
77
\n",
286 | "
USA
\n",
287 | "
285
\n",
288 | "
North America
\n",
289 | "
\n",
290 | "
\n",
291 | "
2
\n",
292 | "
1990
\n",
293 | "
11466
\n",
294 | "
68
\n",
295 | "
China
\n",
296 | "
1440
\n",
297 | "
Asia
\n",
298 | "
\n",
299 | "
\n",
300 | "
3
\n",
301 | "
2000
\n",
302 | "
12806
\n",
303 | "
72
\n",
304 | "
China
\n",
305 | "
1270
\n",
306 | "
Asia
\n",
307 | "
\n",
308 | "
\n",
309 | "
4
\n",
310 | "
1990
\n",
311 | "
25870
\n",
312 | "
79
\n",
313 | "
Japan
\n",
314 | "
124
\n",
315 | "
Asia
\n",
316 | "
\n",
317 | "
\n",
318 | "
5
\n",
319 | "
2000
\n",
320 | "
28569
\n",
321 | "
81
\n",
322 | "
Japan
\n",
323 | "
127
\n",
324 | "
Asia
\n",
325 | "
\n",
326 | "
\n",
327 | "
6
\n",
328 | "
1990
\n",
329 | "
7247
\n",
330 | "
66
\n",
331 | "
Brazil
\n",
332 | "
151
\n",
333 | "
South America
\n",
334 | "
\n",
335 | "
\n",
336 | "
7
\n",
337 | "
2000
\n",
338 | "
8184
\n",
339 | "
71
\n",
340 | "
Brazil
\n",
341 | "
178
\n",
342 | "
South America
\n",
343 | "
\n",
344 | " \n",
345 | "
\n",
346 | "
"
347 | ],
348 | "text/plain": [
349 | " Time GDP Life Expectancy Countries Population Region\n",
350 | "0 1990 31744 76 USA 250 North America\n",
351 | "1 2000 38850 77 USA 285 North America\n",
352 | "2 1990 11466 68 China 1440 Asia\n",
353 | "3 2000 12806 72 China 1270 Asia\n",
354 | "4 1990 25870 79 Japan 124 Asia\n",
355 | "5 2000 28569 81 Japan 127 Asia\n",
356 | "6 1990 7247 66 Brazil 151 South America\n",
357 | "7 2000 8184 71 Brazil 178 South America"
358 | ]
359 | },
360 | "execution_count": 32,
361 | "metadata": {},
362 | "output_type": "execute_result"
363 | }
364 | ],
365 | "source": [
366 | "# read in the sample data from data.csv\n",
367 | "sampleData = pd.read_csv('data.csv')\n",
368 | "# have a look at the dataframe, it should be exactly the same as the one we had before\n",
369 | "sampleData"
370 | ]
371 | },
372 | {
373 | "cell_type": "markdown",
374 | "metadata": {},
375 | "source": [
376 | "## Step 4: Creating motion chart\n",
377 | "In the following, we pass our data to MotionChart to make them animate and show thier changes."
378 | ]
379 | },
380 | {
381 | "cell_type": "markdown",
382 | "metadata": {},
383 | "source": [
384 | "The following html code block is just to make sure that you will see the entire motion chart nicely in the output cell."
385 | ]
386 | },
387 | {
388 | "cell_type": "code",
389 | "execution_count": 33,
390 | "metadata": {},
391 | "outputs": [
392 | {
393 | "data": {
394 | "text/html": [
395 | ""
405 | ],
406 | "text/plain": [
407 | ""
408 | ]
409 | },
410 | "metadata": {},
411 | "output_type": "display_data"
412 | }
413 | ],
414 | "source": [
415 | "%%html\n",
416 | ""
426 | ]
427 | },
428 | {
429 | "cell_type": "code",
430 | "execution_count": 34,
431 | "metadata": {
432 | "scrolled": true
433 | },
434 | "outputs": [
435 | {
436 | "data": {
437 | "text/html": [
438 | "\n",
439 | " \n",
446 | " "
447 | ],
448 | "text/plain": [
449 | ""
450 | ]
451 | },
452 | "metadata": {},
453 | "output_type": "display_data"
454 | }
455 | ],
456 | "source": [
457 | "# now generate the motionchart and show it in the notebook\n",
458 | "mChart = MotionChart(df = sampleData)\n",
459 | "mChart.to_notebook()"
460 | ]
461 | },
462 | {
463 | "cell_type": "markdown",
464 | "metadata": {},
465 | "source": [
466 | "At first, we've got a motion chart that doesn't looks quite right. \n",
467 | "There are some defult settings which may not be what we want, e.g., what we will use for x and y axis, what will be shown as size, etc.\n",
468 | "Therefore, we will have to reset the options and parameters. \n",
469 | "You can do this when you create your motion chart (with coding) or change the setting with the active motion chart panel."
470 | ]
471 | },
472 | {
473 | "cell_type": "code",
474 | "execution_count": 35,
475 | "metadata": {},
476 | "outputs": [
477 | {
478 | "data": {
479 | "text/html": [
480 | "\n",
481 | " \n",
488 | " "
489 | ],
490 | "text/plain": [
491 | ""
492 | ]
493 | },
494 | "metadata": {},
495 | "output_type": "display_data"
496 | }
497 | ],
498 | "source": [
499 | "mChart = MotionChart(df = sampleData, key='Time', x='GDP', y='Life Expectancy', xscale='linear', yscale='linear',\n",
500 | " size='Population', color='Region', category='Countries')\n",
501 | "\n",
502 | "mChart.to_notebook()"
503 | ]
504 | },
505 | {
506 | "cell_type": "markdown",
507 | "metadata": {},
508 | "source": [
509 | "To change the setting directly in the motion chart, move your mouse towards the top right of the motion chart, you'll see a blue long rectangle. Once you hover your mouse on the rectangle, it will show all the options. \n",
510 | "In this example, we choose the following settings:\n",
511 | "- Key: Time\n",
512 | "- X-Axis: GDP\n",
513 | "- Y-Ais: Life Expecta\n",
514 | "- Size: Population\n",
515 | "- Color: Region\n",
516 | "- Category: Countries\n",
517 | "\n",
518 | "The correct setting would look like this:\n",
519 | "\n",
520 | "\n",
521 | "Once you've reset the options, you are free to play around with the motion chart. \n",
522 | "\n",
523 | "In fact, there are a lot of controls in Python Motion Charts, two buttons on top and some setting panel on right side. \n",
524 | "You can change the chart setting and generate any visualization you get. "
525 | ]
526 | },
527 | {
528 | "cell_type": "markdown",
529 | "metadata": {},
530 | "source": [
531 | "#### Question:\n",
532 | "Looking at the data, we've only got data in 1990 and 2000. However, it seems that the data points changes smoothly from 1990 to 2000 - do you know what is going on?"
533 | ]
534 | },
535 | {
536 | "cell_type": "markdown",
537 | "metadata": {},
538 | "source": [
539 | "## Step 5: Adding the Australian data\n",
540 | "\n",
541 | "In this step, you are to research about the data (GDP or appriximate, etc.) of Australia and add that into the dataset. You then can re-generate the motion chart and compare the Australian data with other countries."
542 | ]
543 | },
544 | {
545 | "cell_type": "markdown",
546 | "metadata": {},
547 | "source": [
548 | "## Step 6: Create your own motion chart\n",
549 | "\n",
550 | "There are other datasets would be suitable to be visualised via motion chart. \n",
551 | "Now you can explore a dataset of your own interest and visualise it using motion chart.\n",
552 | "\n",
553 | "You can also get a big dataset from GapMinder: http://www.gapminder.org/data/ (or from US labor stats, see:\n",
554 | "www.bls.gov/osmr/pdf/st110110.pdf).\n",
555 | "\n",
556 | "#### Explore and assess:\n",
557 | "How do you rate GapMinder as a tool, as used by Rosling, for social good, for education?\n",
558 | " GapMinder is nearly 10 years old (Google bought it circa 2008), what's available now?\n"
559 | ]
560 | }
561 | ],
562 | "metadata": {
563 | "kernelspec": {
564 | "display_name": "Python 3",
565 | "language": "python",
566 | "name": "python3"
567 | },
568 | "language_info": {
569 | "codemirror_mode": {
570 | "name": "ipython",
571 | "version": 3
572 | },
573 | "file_extension": ".py",
574 | "mimetype": "text/x-python",
575 | "name": "python",
576 | "nbconvert_exporter": "python",
577 | "pygments_lexer": "ipython3",
578 | "version": "3.6.1"
579 | }
580 | },
581 | "nbformat": 4,
582 | "nbformat_minor": 2
583 | }
584 |
--------------------------------------------------------------------------------
/motion-chart_activity/MotionChart Activity/MotionChart.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Motion Chart with Python\n",
8 | "In this activity, we are going to learn how to create motion charts with Python."
9 | ]
10 | },
11 | {
12 | "cell_type": "markdown",
13 | "metadata": {},
14 | "source": [
15 | "## Step 1: Importing libraries and packages\n",
16 | "Like other python libraries, we have to import before use. \n",
17 | "Also if we know the which part of the library we need, it is more efficient to import the specific part rather than the whole library. \n",
18 | "In this activity, we will only need MotionChart, so let's begin with importing MotionChart from it's package. \n",
19 | "\n",
20 | "For this tutorial, we will put some sample data into DataFrame structure and then show their changes during time with MotionChart tools. So we will also need to import pandas.\n",
21 | "\n",
22 | "Notice that if you are using your own machine, you will need to install motionchart library in your python before you import it, otherwise you will get an error when executing the following code. \n",
23 | "Please following the instruction here for installing motionchart.\n",
24 | "In a simple case, you will only need to open your terminal (mac) or windows prompt, and enter the following:\n",
25 | "\n",
26 | " pip install motionchart \n",
27 | "\n",
28 | "You will probably need to install also its denpendency: pyperclip\n",
29 | "\n",
30 | " pip install pyperclip"
31 | ]
32 | },
33 | {
34 | "cell_type": "code",
35 | "execution_count": 2,
36 | "metadata": {},
37 | "outputs": [],
38 | "source": [
39 | "from motionchart.motionchart import MotionChart\n",
40 | "import pandas as pd"
41 | ]
42 | },
43 | {
44 | "cell_type": "markdown",
45 | "metadata": {},
46 | "source": [
47 | "## Step 2: Sample data\n",
48 | "Now we need some sample data and we will use the content of following table. \n",
49 | "This table shows GDP, Life Expectancy and Population of some countries of several years.\n",
50 | "\n",
51 | "#### Sample Data Table\n",
52 | "| Countries | Years | GDP | Life Expectancy | Population | Region\n",
53 | "|:------------:|:--------:|:------:|:-------:|:----------------------:|:--------------:\n",
54 | "| USA | 1990 | 31744 | 76 | 250 | North America\n",
55 | "| USA | 2000 | 38850 | 77 | 285 | North America\n",
56 | "| China | 1990 | 1466 | 68 | 1440 | Asia\n",
57 | "| China | 2000 | 2806 | 72 | 1270 | Asia\n",
58 | "| Japan | 1990 | 25870 | 79 | 124 | Asia\n",
59 | "| Japan | 2000 | 28569 | 81 | 127 | Asia\n",
60 | "| Brazil | 1990 | 7247 | 66 | 151 | South America\n",
61 | "| Brazil | 2000 | 8184 | 71 | 178 | South America\n"
62 | ]
63 | },
64 | {
65 | "cell_type": "markdown",
66 | "metadata": {},
67 | "source": [
68 | "## Step 3: Storing sample data into pandas dataframe\n",
69 | "As previously mentioned, we will put the sample data into a DataFrame object."
70 | ]
71 | },
72 | {
73 | "cell_type": "markdown",
74 | "metadata": {},
75 | "source": [
76 | "### Enter sample data manually\n",
77 | "In a simply example, we can enter the sample data manually one by one. The following code creates a dataframe with the sample data and set its column names as listed previously."
78 | ]
79 | },
80 | {
81 | "cell_type": "code",
82 | "execution_count": 3,
83 | "metadata": {},
84 | "outputs": [
85 | {
86 | "data": {
87 | "text/html": [
88 | "
\n",
89 | "\n",
102 | "
\n",
103 | " \n",
104 | "
\n",
105 | "
\n",
106 | "
Countries
\n",
107 | "
Time
\n",
108 | "
GDP
\n",
109 | "
Life Expectancy
\n",
110 | "
Population
\n",
111 | "
Region
\n",
112 | "
\n",
113 | " \n",
114 | " \n",
115 | "
\n",
116 | "
0
\n",
117 | "
USA
\n",
118 | "
1990
\n",
119 | "
31744
\n",
120 | "
76
\n",
121 | "
250
\n",
122 | "
North America
\n",
123 | "
\n",
124 | "
\n",
125 | "
1
\n",
126 | "
USA
\n",
127 | "
2000
\n",
128 | "
38850
\n",
129 | "
77
\n",
130 | "
285
\n",
131 | "
North America
\n",
132 | "
\n",
133 | "
\n",
134 | "
2
\n",
135 | "
China
\n",
136 | "
1990
\n",
137 | "
1466
\n",
138 | "
68
\n",
139 | "
1440
\n",
140 | "
Asia
\n",
141 | "
\n",
142 | "
\n",
143 | "
3
\n",
144 | "
China
\n",
145 | "
2000
\n",
146 | "
2806
\n",
147 | "
72
\n",
148 | "
1270
\n",
149 | "
Asia
\n",
150 | "
\n",
151 | "
\n",
152 | "
4
\n",
153 | "
Japan
\n",
154 | "
1990
\n",
155 | "
25870
\n",
156 | "
79
\n",
157 | "
124
\n",
158 | "
Asia
\n",
159 | "
\n",
160 | "
\n",
161 | "
5
\n",
162 | "
Japan
\n",
163 | "
2000
\n",
164 | "
28569
\n",
165 | "
81
\n",
166 | "
127
\n",
167 | "
Asia
\n",
168 | "
\n",
169 | "
\n",
170 | "
6
\n",
171 | "
Brazil
\n",
172 | "
1990
\n",
173 | "
7247
\n",
174 | "
66
\n",
175 | "
151
\n",
176 | "
South America
\n",
177 | "
\n",
178 | "
\n",
179 | "
7
\n",
180 | "
Brazil
\n",
181 | "
2000
\n",
182 | "
8184
\n",
183 | "
71
\n",
184 | "
178
\n",
185 | "
South America
\n",
186 | "
\n",
187 | " \n",
188 | "
\n",
189 | "
"
190 | ],
191 | "text/plain": [
192 | " Countries Time GDP Life Expectancy Population Region\n",
193 | "0 USA 1990 31744 76 250 North America\n",
194 | "1 USA 2000 38850 77 285 North America\n",
195 | "2 China 1990 1466 68 1440 Asia\n",
196 | "3 China 2000 2806 72 1270 Asia\n",
197 | "4 Japan 1990 25870 79 124 Asia\n",
198 | "5 Japan 2000 28569 81 127 Asia\n",
199 | "6 Brazil 1990 7247 66 151 South America\n",
200 | "7 Brazil 2000 8184 71 178 South America"
201 | ]
202 | },
203 | "execution_count": 3,
204 | "metadata": {},
205 | "output_type": "execute_result"
206 | }
207 | ],
208 | "source": [
209 | "# create a pandas dataframe with the sample data values\n",
210 | "sampleData = pd.DataFrame([\n",
211 | "['USA', '1990', '31744', '76', '250', 'North America'],\n",
212 | "['USA', '2000', '38850', '77', '285', 'North America'],\n",
213 | "['China', '1990', '1466', '68', '1440', 'Asia'],\n",
214 | "['China', '2000', '2806', '72', '1270', 'Asia'],\n",
215 | "['Japan', '1990', '25870', '79', '124', 'Asia'],\n",
216 | "['Japan', '2000', '28569', '81', '127', 'Asia'],\n",
217 | "['Brazil', '1990', '7247', '66', '151', 'South America'],\n",
218 | "['Brazil', '2000', '8184', '71', '178', 'South America']])\n",
219 | "\n",
220 | "sampleData.columns = ['Countries','Time', 'GDP', 'Life Expectancy', 'Population', 'Region']\n",
221 | "sampleData # we can then have a look at the dataframe"
222 | ]
223 | },
224 | {
225 | "cell_type": "markdown",
226 | "metadata": {},
227 | "source": [
228 | "### Reading data from csv file\n",
229 | "In a more complex case where you have a large dataset, manually entering them will not be possible. \n",
230 | "Now assume we have the sample data in csv format (data.csv; make sure that the data file is in the same directory as this jupyter notebook). \n",
231 | "The following code will read the data directly from data.csv into a pandas dataframe. \n",
232 | "We do not even need to set the column names, becaues the headers are automatically recognized as column names."
233 | ]
234 | },
235 | {
236 | "cell_type": "code",
237 | "execution_count": 4,
238 | "metadata": {},
239 | "outputs": [
240 | {
241 | "data": {
242 | "text/html": [
243 | "
\n",
244 | "\n",
257 | "
\n",
258 | " \n",
259 | "
\n",
260 | "
\n",
261 | "
Time
\n",
262 | "
GDP
\n",
263 | "
Life Expectancy
\n",
264 | "
Countries
\n",
265 | "
Population
\n",
266 | "
Region
\n",
267 | "
\n",
268 | " \n",
269 | " \n",
270 | "
\n",
271 | "
0
\n",
272 | "
1990
\n",
273 | "
31744
\n",
274 | "
76
\n",
275 | "
USA
\n",
276 | "
250
\n",
277 | "
North America
\n",
278 | "
\n",
279 | "
\n",
280 | "
1
\n",
281 | "
2000
\n",
282 | "
38850
\n",
283 | "
77
\n",
284 | "
USA
\n",
285 | "
285
\n",
286 | "
North America
\n",
287 | "
\n",
288 | "
\n",
289 | "
2
\n",
290 | "
1990
\n",
291 | "
11466
\n",
292 | "
68
\n",
293 | "
China
\n",
294 | "
1440
\n",
295 | "
Asia
\n",
296 | "
\n",
297 | "
\n",
298 | "
3
\n",
299 | "
2000
\n",
300 | "
12806
\n",
301 | "
72
\n",
302 | "
China
\n",
303 | "
1270
\n",
304 | "
Asia
\n",
305 | "
\n",
306 | "
\n",
307 | "
4
\n",
308 | "
1990
\n",
309 | "
25870
\n",
310 | "
79
\n",
311 | "
Japan
\n",
312 | "
124
\n",
313 | "
Asia
\n",
314 | "
\n",
315 | "
\n",
316 | "
5
\n",
317 | "
2000
\n",
318 | "
28569
\n",
319 | "
81
\n",
320 | "
Japan
\n",
321 | "
127
\n",
322 | "
Asia
\n",
323 | "
\n",
324 | "
\n",
325 | "
6
\n",
326 | "
1990
\n",
327 | "
7247
\n",
328 | "
66
\n",
329 | "
Brazil
\n",
330 | "
151
\n",
331 | "
South America
\n",
332 | "
\n",
333 | "
\n",
334 | "
7
\n",
335 | "
2000
\n",
336 | "
8184
\n",
337 | "
71
\n",
338 | "
Brazil
\n",
339 | "
178
\n",
340 | "
South America
\n",
341 | "
\n",
342 | " \n",
343 | "
\n",
344 | "
"
345 | ],
346 | "text/plain": [
347 | " Time GDP Life Expectancy Countries Population Region\n",
348 | "0 1990 31744 76 USA 250 North America\n",
349 | "1 2000 38850 77 USA 285 North America\n",
350 | "2 1990 11466 68 China 1440 Asia\n",
351 | "3 2000 12806 72 China 1270 Asia\n",
352 | "4 1990 25870 79 Japan 124 Asia\n",
353 | "5 2000 28569 81 Japan 127 Asia\n",
354 | "6 1990 7247 66 Brazil 151 South America\n",
355 | "7 2000 8184 71 Brazil 178 South America"
356 | ]
357 | },
358 | "execution_count": 4,
359 | "metadata": {},
360 | "output_type": "execute_result"
361 | }
362 | ],
363 | "source": [
364 | "# read in the sample data from data.csv\n",
365 | "sampleData = pd.read_csv('data.csv')\n",
366 | "# have a look at the dataframe, it should be exactly the same as the one we had before\n",
367 | "sampleData"
368 | ]
369 | },
370 | {
371 | "cell_type": "markdown",
372 | "metadata": {},
373 | "source": [
374 | "## Step 4: Creating motion chart\n",
375 | "In the following, we pass our data to MotionChart to make them animate and show thier changes."
376 | ]
377 | },
378 | {
379 | "cell_type": "markdown",
380 | "metadata": {},
381 | "source": [
382 | "The following html code block is just to make sure that you will see the entire motion chart nicely in the output cell."
383 | ]
384 | },
385 | {
386 | "cell_type": "code",
387 | "execution_count": 5,
388 | "metadata": {},
389 | "outputs": [
390 | {
391 | "data": {
392 | "text/html": [
393 | ""
403 | ],
404 | "text/plain": [
405 | ""
406 | ]
407 | },
408 | "metadata": {},
409 | "output_type": "display_data"
410 | }
411 | ],
412 | "source": [
413 | "%%html\n",
414 | ""
424 | ]
425 | },
426 | {
427 | "cell_type": "code",
428 | "execution_count": 6,
429 | "metadata": {
430 | "scrolled": true
431 | },
432 | "outputs": [
433 | {
434 | "data": {
435 | "text/html": [
436 | "\n",
437 | " \n",
444 | " "
445 | ],
446 | "text/plain": [
447 | ""
448 | ]
449 | },
450 | "metadata": {},
451 | "output_type": "display_data"
452 | }
453 | ],
454 | "source": [
455 | "# now generate the motionchart and show it in the notebook\n",
456 | "mChart = MotionChart(df = sampleData)\n",
457 | "mChart.to_notebook()"
458 | ]
459 | },
460 | {
461 | "cell_type": "markdown",
462 | "metadata": {},
463 | "source": [
464 | "At first, we've got a motion chart that doesn't looks quite right. \n",
465 | "There are some defult settings which may not be what we want, e.g., what we will use for x and y axis, what will be shown as size, etc.\n",
466 | "Therefore, we will have to reset the options and parameters. \n",
467 | "You can do this when you create your motion chart (with coding) or change the setting with the active motion chart panel."
468 | ]
469 | },
470 | {
471 | "cell_type": "code",
472 | "execution_count": 35,
473 | "metadata": {},
474 | "outputs": [
475 | {
476 | "data": {
477 | "text/html": [
478 | "\n",
479 | " \n",
486 | " "
487 | ],
488 | "text/plain": [
489 | ""
490 | ]
491 | },
492 | "metadata": {},
493 | "output_type": "display_data"
494 | }
495 | ],
496 | "source": [
497 | "mChart = MotionChart(df = sampleData, key='Time', x='GDP', y='Life Expectancy', xscale='linear', yscale='linear',\n",
498 | " size='Population', color='Region', category='Countries')\n",
499 | "\n",
500 | "mChart.to_notebook()"
501 | ]
502 | },
503 | {
504 | "cell_type": "markdown",
505 | "metadata": {},
506 | "source": [
507 | "To change the setting directly in the motion chart, move your mouse towards the top right of the motion chart, you'll see a blue long rectangle. Once you hover your mouse on the rectangle, it will show all the options. \n",
508 | "In this example, we choose the following settings:\n",
509 | "- Key: Time\n",
510 | "- X-Axis: GDP\n",
511 | "- Y-Ais: Life Expecta\n",
512 | "- Size: Population\n",
513 | "- Color: Region\n",
514 | "- Category: Countries\n",
515 | "\n",
516 | "The correct setting would look like this:\n",
517 | "\n",
518 | "\n",
519 | "Once you've reset the options, you are free to play around with the motion chart. \n",
520 | "\n",
521 | "In fact, there are a lot of controls in Python Motion Charts, two buttons on top and some setting panel on right side. \n",
522 | "You can change the chart setting and generate any visualization you get. "
523 | ]
524 | },
525 | {
526 | "cell_type": "markdown",
527 | "metadata": {},
528 | "source": [
529 | "#### Question:\n",
530 | "Looking at the data, we've only got data in 1990 and 2000. However, it seems that the data points changes smoothly from 1990 to 2000 - do you know what is going on?"
531 | ]
532 | },
533 | {
534 | "cell_type": "markdown",
535 | "metadata": {},
536 | "source": [
537 | "## Step 5: Adding the Australian data\n",
538 | "\n",
539 | "In this step, you are to research about the data (GDP or appriximate, etc.) of Australia and add that into the dataset. You then can re-generate the motion chart and compare the Australian data with other countries."
540 | ]
541 | },
542 | {
543 | "cell_type": "markdown",
544 | "metadata": {},
545 | "source": [
546 | "## Step 6: Create your own motion chart\n",
547 | "\n",
548 | "There are other datasets would be suitable to be visualised via motion chart. \n",
549 | "Now you can explore a dataset of your own interest and visualise it using motion chart.\n",
550 | "\n",
551 | "You can also get a big dataset from GapMinder: http://www.gapminder.org/data/ (or from US labor stats, see:\n",
552 | "www.bls.gov/osmr/pdf/st110110.pdf).\n",
553 | "\n",
554 | "#### Explore and assess:\n",
555 | "How do you rate GapMinder as a tool, as used by Rosling, for social good, for education?\n",
556 | " GapMinder is nearly 10 years old (Google bought it circa 2008), what's available now?\n"
557 | ]
558 | }
559 | ],
560 | "metadata": {
561 | "kernelspec": {
562 | "display_name": "Python 3",
563 | "language": "python",
564 | "name": "python3"
565 | },
566 | "language_info": {
567 | "codemirror_mode": {
568 | "name": "ipython",
569 | "version": 3
570 | },
571 | "file_extension": ".py",
572 | "mimetype": "text/x-python",
573 | "name": "python",
574 | "nbconvert_exporter": "python",
575 | "pygments_lexer": "ipython3",
576 | "version": "3.6.4"
577 | }
578 | },
579 | "nbformat": 4,
580 | "nbformat_minor": 2
581 | }
582 |
--------------------------------------------------------------------------------
/motion-chart_activity/MotionChart Activity/data.csv:
--------------------------------------------------------------------------------
1 | Time,GDP,Life Expectancy,Countries,Population,Region
1990,31744,76,USA,250,North America
2000,38850,77,USA,285,North America
1990,11466,68,China,1440,Asia
2000,12806,72,China,1270,Asia
1990,25870,79,Japan,124,Asia
2000,28569,81,Japan,127,Asia
1990,7247,66,Brazil,151,South America
2000,8184,71,Brazil,178,South America
--------------------------------------------------------------------------------
/motion-chart_activity/MotionChart Activity/mc_temp.html:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
11 |
12 |
13 |
14 |
15 |
16 |
17 |
18 |
19 |
20 |
24 |
25 |
26 |
39 |
40 |
41 |
42 |
--------------------------------------------------------------------------------
/motion-chart_activity/MotionChart Activity/motionchart.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | """
3 | Created on Sun Dec 28 15:33:33 2014
4 |
5 | @author: Hans Olav Melberg
6 | """
7 | # This is a wrapper that makes it possible to create motion charts from a pandas dataframe
8 | #
9 | # Acknowledgements and more information
10 | # See https://github.com/RamyElkest/SocrMotionChartsHTML5 for more information about the javascript
11 | # See also https://github.com/psychemedia/dataviz4development/tree/master/SocrMotionCharts
12 | # For more bakcground, and java version, see http://www.amstat.org/publications/jse/v18n3/dinov.pdf
13 |
14 | import os
15 | import webbrowser
16 | import pandas as pd
17 | import pyperclip
18 | from IPython.display import display, HTML, IFrame
19 |
20 | class MotionChart(object):
21 | ''' To create a Motion Chart object from a pandas dataframe:
22 | mc = MotionChart(df = dataframe)
23 | To send the object to the Ipyton Notebook, to a browser, to the clipboard and to a file by writing:
24 | mc.to_notebook()
25 | mc.to_browser()
26 | mc.to_clipboard()
27 | mc.to_file()
28 |
29 | Options and defaults (specifying which variable you want to be x, y, etc):
30 | mc = MotionChart(
31 | df = df,
32 | title = "Motion Chart",
33 | url = "http://socr.ucla.edu/htmls/HTML5/MotionChart",
34 | key = 1,
35 | x = 2,
36 | y = 3,
37 | size = 4,
38 | color = 5,
39 | category = 1,
40 | xscale='linear',
41 | yscale='linear',
42 | play = 'true',
43 | loop = 'false',
44 | width = 800,
45 | height = 600,
46 | varLabels=None)
47 |
48 | Explained:
49 | df # specifies the name of the pandas dataframe used to create the motion chart, default is df
50 | title # string. The title of the chart
51 | url # string. url to folder with js and css files;
52 | can be local, default is external which requires wireless connection
53 | key # string or integer. the column number of the "motion" variable (does not have to be time)
54 | x # string or integer. number (integer) or name (text, string) of the x-variable in the chart.
55 | Can later be changed by clicking on the variable in the chart.
56 | Number starts from 0 which is the outer index of the dataframe
57 | y # string or integer. number (integer) or name (text, string) of the x-variable in the chart.
58 | size # name (text, string) or column number (integer)
59 | The variable used to determine the size of the circles
60 | color # name (text, string) or column number (integer)
61 | The variable used to determine the color of the circles
62 | category # name (text, string) or column number (integer)
63 | The variable used to describe the category the observation belongs to.
64 | Example Mid-West, South. Often the same variable as color.
65 | xscale # string. Scale for x-variable, string, default 'linear'.
66 | Possible values 'linear', 'log', 'sqrt', 'log', 'quadnomial', 'ordinal'
67 | yscale # string. Scale for x-variable, string, default 'linear'.
68 | Possible values 'linear', 'log', 'sqrt', 'log', 'quadnomial', 'ordinal'
69 | play # string. 'true' or 'false' (default, false).
70 | Determines whether the motion starts right away or if you have to click play first.
71 | loop # string. 'true' or 'false' (default, false).
72 | Determines whether the motion keeps repeating after one loop over the series, or stops.
73 | width # integer. width of chart in pixels, default 900
74 | height # integer. height of chart in pixels, default 700
75 | varLabels # list. list of labels for columns (default is column headers of dataframe)
76 | Must be of same length as the number of columns in the dataframe, including the index
77 |
78 | '''
79 | # This defines the motion chart object.
80 | # Basically just holds the parameters used to create the chart: name of data source, which variables to use
81 | def __init__(self,
82 | df = 'df',
83 | title = "Motion Chart",
84 | url = "http://socr.ucla.edu/htmls/HTML5/MotionChart",
85 | key = 1,
86 | x = 2,
87 | y = 3,
88 | size = 4,
89 | color = 5,
90 | category = 5,
91 | xscale='linear',
92 | yscale='linear',
93 | play = 'true',
94 | loop = 'false',
95 | width = 800,
96 | height = 600,
97 | varLabels=None):
98 | self.df = df
99 | self.title = title
100 | self.url = url
101 | self.key = key
102 | self.x = x
103 | self.y = y
104 | self.size = size
105 | self.color = color
106 | self.category = category
107 | self.xscale= xscale
108 | self.yscale= yscale
109 | self.play = play
110 | self.loop = loop # string: 'true' or 'false' (default, false).
111 | self.width = width # width of chart in pixels, default 800
112 | self.height = height # height of chart in pixels, default 400
113 | self.varLabels = varLabels # list of labels for columns (default is column headers of dataframe
114 |
115 | # The informaton from the object is used to generate the HTML string generating the chart
116 | # (inserting the specific information in the object into the template string)
117 | # Note 1: The string is generated in two steps, not one, because future version might want to revise some properties
118 | # without redoing the reformatting and creatingof the dataset from the dataframe
119 | # Note 2: Initially the string itself was saved in the object, although useful sometimes it seems memory greedy
120 | # Note 3: The template string used here is just a revised version of a template somebody else has created
121 | # See Tony Hirst: https://github.com/psychemedia/dataviz4development/tree/master/SocrMotionCharts
122 | def htmlStringStart(self):
123 | socrTemplateStart='''
124 |
125 |
126 |
127 |
128 |
129 |
130 |
131 |
132 |
133 |
134 |
135 |
136 |
137 |
138 |
139 |
140 |
141 |
142 |
146 | '''
147 | # In order to make it easy to use information in the index of the dataframe, the index in the passed dataframe is reset
148 | # For instance: If the time variable is in the index of the dataframe, say the outer index, then one would write
149 | # mc = MotionChart(df, key = 0) when specifying the motion chart
150 | # Note that although the key often is time, it does not have to be so (unlike Google Motion Chart)
151 | # In MotionCharts it is basically whatver variable you want to use to define the change
152 |
153 | df = self.df.reset_index()
154 |
155 | # If variable labels are not specified, the column names of the dataframe is used
156 | # Note. variable levels are specified the list of labels to be used has to have the same number of elements
157 | # as the columns in the reset dataframe (ie. original number of columns plus number of index levels)
158 | if self.varLabels == None:
159 | self.varLabels = df.columns.tolist()
160 |
161 | # Here the data is converted from a pandas dataframe to the format which is accepted by the SocrMotion Chart (javascript)
162 | # The starting point is a json string of all the values in the dataframe, which is then modified fit SocrMotionChart
163 | dataValuesString = df.to_json(orient = 'values')
164 | varNamesString = ",".join(['"' + str(var) + '"' for var in self.varLabels])
165 | varNamesString = "[[" + varNamesString + "], ["
166 | dataValuesString = dataValuesString.lstrip("[")
167 | socrDataString = varNamesString + dataValuesString
168 |
169 | # The generated string containing the data in the right format, is inserted into the template string
170 | htmlString1 = socrTemplateStart.format(
171 | data = socrDataString,
172 | url = self.url
173 | )
174 | # Change reference to bootstrap.js file if the url is changed to "custom-bootstrap.js"
175 | # The js available on SOCR's webpage which lists it as boostrap.js, but on GitHub version which many use
176 | # to install a local copy, the same file is referred to as custom-boostrap.js
177 | # The default value is to keep it as 'custom-boostrap.js', but if the url points to socr
178 | # (which is default since we want the chart to work on the web), then the filename is changed to 'bootstrap.js'
179 | if self.url == "http://socr.ucla.edu/htmls/HTML5/MotionChart":
180 | htmlString1 = htmlString1.replace("custom-bootstrap.js", "bootstrap.js")
181 | return htmlString1
182 |
183 | # Generating the last half of the html string which produces the motion chart
184 | # The reason the string is generated in two halfes, is to open up for revisons in which some options are changed
185 | # without having to transfor and generate the data from the dataframe again.
186 | def htmlStringEnd(self):
187 | socrTemplateEnd = '''
188 |
189 |
202 |
203 |
204 |
205 | '''
206 | # Rename variables to avoid changing the properties of the object when changing strings to numbers
207 | # (NUmbers are required in the js script)
208 | kkey = self.key
209 | xx = self.x
210 | yy = self.y
211 | ssize = self.size
212 | ccolor = self.color
213 | ccategory = self.category
214 |
215 | # The user is free to specify many variables either by location (an integer representing the column number)
216 | # or by name (the column name in the dataframe)
217 | # This means we have to find and replace with column number if the variable is specified as a string since
218 | # the javascript wants integers (note: variable labels must be unique)
219 | # The code below finds and replaces the specified column name (text) with the column number (numeric)
220 | if type(kkey) is str:
221 | kkey=self.varLabels.index(kkey)
222 | if type(xx) is str:
223 | xx=self.varLabels.index(xx)
224 | if type(yy) is str:
225 | yy=self.varLabels.index(yy)
226 | if type(ssize) is str:
227 | ssize=self.varLabels.index(ssize)
228 | if type(ccolor) is str:
229 | ccolor=self.varLabels.index(ccolor)
230 | if type(ccategory) is str:
231 | ccategory=self.varLabels.index(ccategory)
232 |
233 | # The properties are inserted into the last half of the template string
234 | htmlString2 = socrTemplateEnd.format(
235 | title = self.title,
236 | key = kkey, x = xx, y = yy, size = ssize, color = ccolor, category = ccategory,
237 | xscale= self.xscale , yscale= self.yscale,
238 | play = self.play, loop = self.loop,
239 | width = self.width, height = self.height)
240 | return htmlString2
241 |
242 | # Display the motion chart in the browser (start the default browser)
243 | def to_browser(self):
244 | htmlString = self.htmlStringStart() + self.htmlStringEnd()
245 | path = os.path.abspath('temp.html')
246 | url = 'file://' + path
247 |
248 | with open(path, 'w') as f:
249 | f.write(htmlString)
250 | webbrowser.open(url)
251 |
252 | # Display the motion chart in the Ipython notebook
253 | # This is saved to a file because in Python 3 it was difficult to encode the string that could be used in HTML directly
254 | # TODO: Eliminate file (security risk to leave the file on disk, and overwrite danger?) and avoid name conflicts.
255 | # Also: What if multiple figures?
256 | def to_notebook(self, width = 900, height = 700):
257 | htmlString = self.htmlStringStart() + self.htmlStringEnd()
258 | path = os.path.abspath('mc_temp.html')
259 | with open(path, 'w') as f:
260 | f.write(htmlString)
261 | display(IFrame(src="mc_temp.html", width = width, height = height))
262 |
263 | # Copy the HTML string to the clipboard
264 | def to_clipboard(self):
265 | htmlString = self.htmlStringStart() + self.htmlStringEnd()
266 | pyperclip.copy(htmlString)
267 |
268 | # Save the motion chart as a file (inclulde .html manually if desired)
269 | def to_file(self, path_and_name):
270 | htmlString = self.htmlStringStart() + self.htmlStringEnd()
271 | fileName = path_and_name
272 | try: # encode will not (need not!) work in Python 3 since it is unicode already
273 | fileName = fileName.encode('string-escape')
274 | with open(fileName, 'w') as f:
275 | f.write(htmlString)
276 | except:
277 | with open(fileName, 'w') as f:
278 | f.write(htmlString)
279 |
280 | # Include a demo option
281 | def MotionChartDemo():
282 | fruitdf = pd.DataFrame([
283 | ['Apples', '1988-0-1', 1000, 300, 44,'East'],
284 | ['Oranges', '1988-0-1', 1150, 200, 42, 'West'],
285 | ['Bananas', '1988-0-1', 300, 250, 35, 'West'],
286 | ['Apples', '1989-6-1', 1200, 400, 48, 'East'],
287 | ['Oranges', '1989-6-1', 750, 150, 47, 'West'],
288 | ['Bananas', '1989-6-1', 788, 617, 45, 'West']])
289 | fruitdf.columns = ['fruit', 'time', 'sales', 'price', 'temperature', 'location']
290 | fruitdf['time'] = pd.to_datetime(fruitdf['time'])
291 | mChart = MotionChart(
292 | df = fruitdf,
293 | url = "http://socr.ucla.edu/htmls/HTML5/MotionChart",
294 | key = 'time',
295 | x = 'price',
296 | y = 'sales',
297 | size = 'temperature',
298 | color = 'fruit',
299 | category = 'location')
300 | mChart.to_browser()
--------------------------------------------------------------------------------
/motion-chart_activity/MotionChart Activity/motionchart/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/itsjunqing/fit1043-introduction-to-data-science/3d3667752be63cbc7b6dc63435a504d18e311651/motion-chart_activity/MotionChart Activity/motionchart/__init__.py
--------------------------------------------------------------------------------
/motion-chart_activity/MotionChart Activity/motionchart/__pycache__/__init__.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/itsjunqing/fit1043-introduction-to-data-science/3d3667752be63cbc7b6dc63435a504d18e311651/motion-chart_activity/MotionChart Activity/motionchart/__pycache__/__init__.cpython-36.pyc
--------------------------------------------------------------------------------
/motion-chart_activity/MotionChart Activity/motionchart/__pycache__/motionchart.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/itsjunqing/fit1043-introduction-to-data-science/3d3667752be63cbc7b6dc63435a504d18e311651/motion-chart_activity/MotionChart Activity/motionchart/__pycache__/motionchart.cpython-36.pyc
--------------------------------------------------------------------------------
/motion-chart_activity/MotionChart Activity/motionchart/motionchart.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | """
3 | Created on Sun Dec 28 15:33:33 2014
4 |
5 | @author: Hans Olav Melberg
6 | """
7 | # This is a wrapper that makes it possible to create motion charts from a pandas dataframe
8 | #
9 | # Acknowledgements and more information
10 | # See https://github.com/RamyElkest/SocrMotionChartsHTML5 for more information about the javascript
11 | # See also https://github.com/psychemedia/dataviz4development/tree/master/SocrMotionCharts
12 | # For more bakcground, and java version, see http://www.amstat.org/publications/jse/v18n3/dinov.pdf
13 |
14 | import os
15 | import webbrowser
16 | import pandas as pd
17 | import pyperclip
18 | from IPython.display import display, HTML, IFrame
19 |
20 | class MotionChart(object):
21 | ''' To create a Motion Chart object from a pandas dataframe:
22 | mc = MotionChart(df = dataframe)
23 | To send the object to the Ipyton Notebook, to a browser, to the clipboard and to a file by writing:
24 | mc.to_notebook()
25 | mc.to_browser()
26 | mc.to_clipboard()
27 | mc.to_file()
28 |
29 | Options and defaults (specifying which variable you want to be x, y, etc):
30 | mc = MotionChart(
31 | df = df,
32 | title = "Motion Chart",
33 | url = "http://socr.ucla.edu/htmls/HTML5/MotionChart",
34 | key = 1,
35 | x = 2,
36 | y = 3,
37 | size = 4,
38 | color = 5,
39 | category = 1,
40 | xscale='linear',
41 | yscale='linear',
42 | play = 'true',
43 | loop = 'false',
44 | width = 800,
45 | height = 600,
46 | varLabels=None)
47 |
48 | Explained:
49 | df # specifies the name of the pandas dataframe used to create the motion chart, default is df
50 | title # string. The title of the chart
51 | url # string. url to folder with js and css files;
52 | can be local, default is external which requires wireless connection
53 | key # string or integer. the column number of the "motion" variable (does not have to be time)
54 | x # string or integer. number (integer) or name (text, string) of the x-variable in the chart.
55 | Can later be changed by clicking on the variable in the chart.
56 | Number starts from 0 which is the outer index of the dataframe
57 | y # string or integer. number (integer) or name (text, string) of the x-variable in the chart.
58 | size # name (text, string) or column number (integer)
59 | The variable used to determine the size of the circles
60 | color # name (text, string) or column number (integer)
61 | The variable used to determine the color of the circles
62 | category # name (text, string) or column number (integer)
63 | The variable used to describe the category the observation belongs to.
64 | Example Mid-West, South. Often the same variable as color.
65 | xscale # string. Scale for x-variable, string, default 'linear'.
66 | Possible values 'linear', 'log', 'sqrt', 'log', 'quadnomial', 'ordinal'
67 | yscale # string. Scale for x-variable, string, default 'linear'.
68 | Possible values 'linear', 'log', 'sqrt', 'log', 'quadnomial', 'ordinal'
69 | play # string. 'true' or 'false' (default, false).
70 | Determines whether the motion starts right away or if you have to click play first.
71 | loop # string. 'true' or 'false' (default, false).
72 | Determines whether the motion keeps repeating after one loop over the series, or stops.
73 | width # integer. width of chart in pixels, default 900
74 | height # integer. height of chart in pixels, default 700
75 | varLabels # list. list of labels for columns (default is column headers of dataframe)
76 | Must be of same length as the number of columns in the dataframe, including the index
77 |
78 | '''
79 | # This defines the motion chart object.
80 | # Basically just holds the parameters used to create the chart: name of data source, which variables to use
81 | def __init__(self,
82 | df = 'df',
83 | title = "Motion Chart",
84 | url = "http://socr.ucla.edu/htmls/HTML5/MotionChart",
85 | key = 1,
86 | x = 2,
87 | y = 3,
88 | size = 4,
89 | color = 5,
90 | category = 5,
91 | xscale='linear',
92 | yscale='linear',
93 | play = 'true',
94 | loop = 'false',
95 | width = 800,
96 | height = 600,
97 | varLabels=None):
98 | self.df = df
99 | self.title = title
100 | self.url = url
101 | self.key = key
102 | self.x = x
103 | self.y = y
104 | self.size = size
105 | self.color = color
106 | self.category = category
107 | self.xscale= xscale
108 | self.yscale= yscale
109 | self.play = play
110 | self.loop = loop # string: 'true' or 'false' (default, false).
111 | self.width = width # width of chart in pixels, default 800
112 | self.height = height # height of chart in pixels, default 400
113 | self.varLabels = varLabels # list of labels for columns (default is column headers of dataframe
114 |
115 | # The informaton from the object is used to generate the HTML string generating the chart
116 | # (inserting the specific information in the object into the template string)
117 | # Note 1: The string is generated in two steps, not one, because future version might want to revise some properties
118 | # without redoing the reformatting and creatingof the dataset from the dataframe
119 | # Note 2: Initially the string itself was saved in the object, although useful sometimes it seems memory greedy
120 | # Note 3: The template string used here is just a revised version of a template somebody else has created
121 | # See Tony Hirst: https://github.com/psychemedia/dataviz4development/tree/master/SocrMotionCharts
122 | def htmlStringStart(self):
123 | socrTemplateStart='''
124 |
125 |
126 |
127 |
128 |
129 |
130 |
131 |
132 |
133 |
134 |
135 |
136 |
137 |
138 |
139 |
140 |
141 |
142 |
146 | '''
147 | # In order to make it easy to use information in the index of the dataframe, the index in the passed dataframe is reset
148 | # For instance: If the time variable is in the index of the dataframe, say the outer index, then one would write
149 | # mc = MotionChart(df, key = 0) when specifying the motion chart
150 | # Note that although the key often is time, it does not have to be so (unlike Google Motion Chart)
151 | # In MotionCharts it is basically whatver variable you want to use to define the change
152 |
153 | df = self.df.reset_index()
154 |
155 | # If variable labels are not specified, the column names of the dataframe is used
156 | # Note. variable levels are specified the list of labels to be used has to have the same number of elements
157 | # as the columns in the reset dataframe (ie. original number of columns plus number of index levels)
158 | if self.varLabels == None:
159 | self.varLabels = df.columns.tolist()
160 |
161 | # Here the data is converted from a pandas dataframe to the format which is accepted by the SocrMotion Chart (javascript)
162 | # The starting point is a json string of all the values in the dataframe, which is then modified fit SocrMotionChart
163 | dataValuesString = df.to_json(orient = 'values')
164 | varNamesString = ",".join(['"' + str(var) + '"' for var in self.varLabels])
165 | varNamesString = "[[" + varNamesString + "], ["
166 | dataValuesString = dataValuesString.lstrip("[")
167 | socrDataString = varNamesString + dataValuesString
168 |
169 | # The generated string containing the data in the right format, is inserted into the template string
170 | htmlString1 = socrTemplateStart.format(
171 | data = socrDataString,
172 | url = self.url
173 | )
174 | # Change reference to bootstrap.js file if the url is changed to "custom-bootstrap.js"
175 | # The js available on SOCR's webpage which lists it as boostrap.js, but on GitHub version which many use
176 | # to install a local copy, the same file is referred to as custom-boostrap.js
177 | # The default value is to keep it as 'custom-boostrap.js', but if the url points to socr
178 | # (which is default since we want the chart to work on the web), then the filename is changed to 'bootstrap.js'
179 | if self.url == "http://socr.ucla.edu/htmls/HTML5/MotionChart":
180 | htmlString1 = htmlString1.replace("custom-bootstrap.js", "bootstrap.js")
181 | return htmlString1
182 |
183 | # Generating the last half of the html string which produces the motion chart
184 | # The reason the string is generated in two halfes, is to open up for revisons in which some options are changed
185 | # without having to transfor and generate the data from the dataframe again.
186 | def htmlStringEnd(self):
187 | socrTemplateEnd = '''
188 |
189 |
202 |
203 |
204 |
205 | '''
206 | # Rename variables to avoid changing the properties of the object when changing strings to numbers
207 | # (NUmbers are required in the js script)
208 | kkey = self.key
209 | xx = self.x
210 | yy = self.y
211 | ssize = self.size
212 | ccolor = self.color
213 | ccategory = self.category
214 |
215 | # The user is free to specify many variables either by location (an integer representing the column number)
216 | # or by name (the column name in the dataframe)
217 | # This means we have to find and replace with column number if the variable is specified as a string since
218 | # the javascript wants integers (note: variable labels must be unique)
219 | # The code below finds and replaces the specified column name (text) with the column number (numeric)
220 | if type(kkey) is str:
221 | kkey=self.varLabels.index(kkey)
222 | if type(xx) is str:
223 | xx=self.varLabels.index(xx)
224 | if type(yy) is str:
225 | yy=self.varLabels.index(yy)
226 | if type(ssize) is str:
227 | ssize=self.varLabels.index(ssize)
228 | if type(ccolor) is str:
229 | ccolor=self.varLabels.index(ccolor)
230 | if type(ccategory) is str:
231 | ccategory=self.varLabels.index(ccategory)
232 |
233 | # The properties are inserted into the last half of the template string
234 | htmlString2 = socrTemplateEnd.format(
235 | title = self.title,
236 | key = kkey, x = xx, y = yy, size = ssize, color = ccolor, category = ccategory,
237 | xscale= self.xscale , yscale= self.yscale,
238 | play = self.play, loop = self.loop,
239 | width = self.width, height = self.height)
240 | return htmlString2
241 |
242 | # Display the motion chart in the browser (start the default browser)
243 | def to_browser(self):
244 | htmlString = self.htmlStringStart() + self.htmlStringEnd()
245 | path = os.path.abspath('temp.html')
246 | url = 'file://' + path
247 |
248 | with open(path, 'w') as f:
249 | f.write(htmlString)
250 | webbrowser.open(url)
251 |
252 | # Display the motion chart in the Ipython notebook
253 | # This is saved to a file because in Python 3 it was difficult to encode the string that could be used in HTML directly
254 | # TODO: Eliminate file (security risk to leave the file on disk, and overwrite danger?) and avoid name conflicts.
255 | # Also: What if multiple figures?
256 | def to_notebook(self, width = 900, height = 700):
257 | htmlString = self.htmlStringStart() + self.htmlStringEnd()
258 | path = os.path.abspath('mc_temp.html')
259 | with open(path, 'w') as f:
260 | f.write(htmlString)
261 | display(IFrame(src="mc_temp.html", width = width, height = height))
262 |
263 | # Copy the HTML string to the clipboard
264 | def to_clipboard(self):
265 | htmlString = self.htmlStringStart() + self.htmlStringEnd()
266 | pyperclip.copy(htmlString)
267 |
268 | # Save the motion chart as a file (inclulde .html manually if desired)
269 | def to_file(self, path_and_name):
270 | htmlString = self.htmlStringStart() + self.htmlStringEnd()
271 | fileName = path_and_name
272 | try: # encode will not (need not!) work in Python 3 since it is unicode already
273 | fileName = fileName.encode('string-escape')
274 | with open(fileName, 'w') as f:
275 | f.write(htmlString)
276 | except:
277 | with open(fileName, 'w') as f:
278 | f.write(htmlString)
279 |
280 | # Include a demo option
281 | def MotionChartDemo():
282 | fruitdf = pd.DataFrame([
283 | ['Apples', '1988-0-1', 1000, 300, 44,'East'],
284 | ['Oranges', '1988-0-1', 1150, 200, 42, 'West'],
285 | ['Bananas', '1988-0-1', 300, 250, 35, 'West'],
286 | ['Apples', '1989-6-1', 1200, 400, 48, 'East'],
287 | ['Oranges', '1989-6-1', 750, 150, 47, 'West'],
288 | ['Bananas', '1989-6-1', 788, 617, 45, 'West']])
289 | fruitdf.columns = ['fruit', 'time', 'sales', 'price', 'temperature', 'location']
290 | fruitdf['time'] = pd.to_datetime(fruitdf['time'])
291 | mChart = MotionChart(
292 | df = fruitdf,
293 | url = "http://socr.ucla.edu/htmls/HTML5/MotionChart",
294 | key = 'time',
295 | x = 'price',
296 | y = 'sales',
297 | size = 'temperature',
298 | color = 'fruit',
299 | category = 'location')
300 | mChart.to_browser()
--------------------------------------------------------------------------------
/motion-chart_activity/MotionChart Activity/setting.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/itsjunqing/fit1043-introduction-to-data-science/3d3667752be63cbc7b6dc63435a504d18e311651/motion-chart_activity/MotionChart Activity/setting.png
--------------------------------------------------------------------------------
/notes/commands.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/itsjunqing/fit1043-introduction-to-data-science/3d3667752be63cbc7b6dc63435a504d18e311651/notes/commands.pdf
--------------------------------------------------------------------------------
/notes/complete_notes.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/itsjunqing/fit1043-introduction-to-data-science/3d3667752be63cbc7b6dc63435a504d18e311651/notes/complete_notes.pdf
--------------------------------------------------------------------------------
/past_exam/FIT1043_Sample_2016.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/itsjunqing/fit1043-introduction-to-data-science/3d3667752be63cbc7b6dc63435a504d18e311651/past_exam/FIT1043_Sample_2016.pdf
--------------------------------------------------------------------------------
/past_exam/FIT1043_Sample_Exam.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/itsjunqing/fit1043-introduction-to-data-science/3d3667752be63cbc7b6dc63435a504d18e311651/past_exam/FIT1043_Sample_Exam.pdf
--------------------------------------------------------------------------------
/past_exam/FIT1043_Sample_Exam_Sols.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/itsjunqing/fit1043-introduction-to-data-science/3d3667752be63cbc7b6dc63435a504d18e311651/past_exam/FIT1043_Sample_Exam_Sols.pdf
--------------------------------------------------------------------------------
/past_exam/answers.txt:
--------------------------------------------------------------------------------
1 | For multiple choice, you have roughly 1.5 minutes per answer. You
2 | have to quickly figure out which answers are clearly wrong and then
3 | try and rank the remaining. For short answer, you have roughly 3
4 | minutes per short answer question. The answers could be up to double
5 | in length to what is below but wouldn't expect much more. Please
6 | carefully consider your answer before writing it down. Perhaps write
7 | wee bullets outside the box or on the blank paper at the end to try
8 | and get things straight before committing to an answer.
9 |
10 |
11 | Q1.1 is entry D
12 | unless you had some problems understanding the meaning of the
13 | word "characteristics", this should have been obvious
14 | Q1.2 is entry C
15 | B isn't true, if the data are stored in different DBs, then
16 | its hard to run single processing jobs on them because the
17 | DBs will have different record IDS, standards and naming conventions
18 | A is true initially, but probably not in the long run
19 | D is definitely not recommended
20 | Q1.3 is entry B
21 | Q1.4 is entry A
22 | they are very different systems, but A is common; C is not in
23 | RDBMSs, D is not an issue with in-memory DBs, commercial RDBMSs
24 | are not cheap
25 | Q1.5 is entry A
26 | "big" is relative to the task too, which discounts C and D
27 | Q1.6 is entry A
28 | C is wrong; B is an argument some make but not generally believed;
29 | A is argument behind Zimmerman's law
30 | Q1.7 is entry A
31 | B and D is false; C is partly true no doubt;
32 | Q1.8 is entry C
33 | Q1.9 is entry D
34 | all answers a partially true, so which is best?
35 | B and C should be considered silly; A usually applies to XML, JSOn etc.
36 | Q1.10 is entry B
37 | A and C equally apply to RDBMSs; D is a trick answer to try and fool you
38 | Q1.11 is entry B
39 | C and D are instances of p-hacking; both A and B are good, so which is
40 | best? in fact they are very similar ... not a good question,
41 | but B sounds safer
42 | Q1.12 is entry B
43 | one can "check" significance tests using visualisation, but not
44 | perform them
45 | Q1.13 is entry D
46 | A-C were all discussed and mentioned as examples, so that leaves
47 | D
48 | Q1.14 is entry A
49 | C and D partially true; B clearly wrong; A seems best; the word
50 | "primarily" says A is the answer
51 | Q1.15 is entry B
52 | A and C very important; D is handy; B is probably important in
53 | medicine or other sciences more actively using hypothesis testing,
54 | but the issue is "standard" significant levels change ... some use 0.025;
55 | so B is least useful though not entirely useless
56 |
57 |
58 | Q2.1 linked open data?
59 | Linked Open Data is publicly available data that is ``semantically" connected to other data through established relationships. It uses semantic web formats in RDF and XML to store the data.
60 |
61 | Q2.2 metadata?
62 | Meta data describes the format and characteristics and perhaps the source of the data. Without this contextual information, the data itself cannot be interpreted for analysis, nor can the quality of the data be properly considered or anomalous results explained.
63 |
64 | Q2.3 SAS, Datawrangler, Python?
65 | SAS general purpose, strange syntax, commercial.
66 | DataWrangler -- specialised tool, GUI interface, no coding.
67 | Python -- general purpose, open-source, libraries for tasks
68 |
69 | Q2.4 PMML?
70 | This is an XML standard for describing a predictive model, and can be used to pass model between different systems which often have their own internal and proprietary formats otherwise.
71 |
72 |
--------------------------------------------------------------------------------