├── .gitignore
├── Challenges
├── Anomaly_Detection
│ └── anomaly_detection_challenge.ipynb
├── House_Pricing
│ ├── challenge_data
│ │ ├── Data description.rtf
│ │ ├── sample_submission.csv
│ │ ├── test.csv
│ │ └── train.csv
│ └── house_pricing_challenge.ipynb
└── Plankton
│ └── plankton_challenge.ipynb
├── Notebooks
├── Intro-public.ipynb
└── RecSys-public.ipynb
└── README.md
/.gitignore:
--------------------------------------------------------------------------------
1 | Notebooks/.ipynb_checkpoints/
2 |
--------------------------------------------------------------------------------
/Challenges/Anomaly_Detection/anomaly_detection_challenge.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "AML2019"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "
Challenge 3
\n",
15 | "Anomaly Detection (AD)
\n",
16 | "
\n",
17 | "3th May 2019"
18 | ]
19 | },
20 | {
21 | "cell_type": "markdown",
22 | "metadata": {},
23 | "source": [
24 | "Anomaly detection (AD) refers to the process of detecting data points that do not conform with the rest of observations. Applications of anomaly detection include fraud and fault detection, surveillance, diagnosis, data cleanup, predictive maintenance.\n",
25 | "\n",
26 | "When we talk about AD, we usually look at it as an unsupervised (or semi-supervised) task, where the concept of anomaly is often not well defined or, in the best case, just few samples are labeled as anomalous. In this challenge, you will look at AD from a different perspective!\n",
27 | "\n",
28 | "The dataset you are going to work on consists of monitoring data generated by IT systems; such data is then processed by a monitoring system that executes some checks and detects a series of anomalies. This is a multi-label classification problem, where each check is a binary label corresponding to a specific type of anomaly. Your goal is to develop a machine learning model (or multiple ones) to accurately detect such anomalies.\n",
29 | "\n",
30 | "This will also involve a mixture of data exploration, pre-processing, model selection, and performance evaluation. You will also be asked to try one or more rule learning models, and compare them with other ML models both in terms of predictive performances and interpretability. Interpreatibility is indeed a strong requirement especially in applications like AD where understanding the output of a model is as important as the output itself.\n",
31 | "\n",
32 | "Please, bear in mind that the purpose of this challenge is not simply to find the best-performing model. You should rather make sure to understand the difficulties that come with this AD task."
33 | ]
34 | },
35 | {
36 | "cell_type": "markdown",
37 | "metadata": {},
38 | "source": [
39 | "# Overview\n",
40 | "
"
41 | ]
42 | },
43 | {
44 | "cell_type": "markdown",
45 | "metadata": {},
46 | "source": [
47 | "Beyond simply producing a well-performing model for making predictions, in this challenge we would like you to start developing your skills as a machine learning scientist.\n",
48 | "In this regard, your notebook should be structured in such a way as to explore the five following tasks that are expected to be carried out whenever undertaking such a project.\n",
49 | "The description below each aspect should serve as a guide for your work, but you are strongly encouraged to also explore alternative options and directions. \n",
50 | "Thinking outside the box will always be rewarded in these challenges."
51 | ]
52 | },
53 | {
54 | "cell_type": "markdown",
55 | "metadata": {},
56 | "source": [
57 | "\n",
58 | "
1. Data Exploration
\n",
59 | ""
60 | ]
61 | },
62 | {
63 | "cell_type": "markdown",
64 | "metadata": {},
65 | "source": [
66 | "The first broad component of your notebook should enable you to familiarise yourselves with the given data, an outline of which is given at the end of this challenge specification.\n",
67 | "Among others, this section should investigate:\n",
68 | "\n",
69 | "- Data cleaning\n",
70 | "- Data visualisation;\n",
71 | "- Computing descriptive statistics, e.g. correlation.\n",
72 | "- etc.\n",
73 | "\n",
74 | "Data exploration is also useful to identify eventual errors in the dataset: for example, some features may have values that are outside the allowed range of values. Ranges are specified in the dataset description."
75 | ]
76 | },
77 | {
78 | "cell_type": "markdown",
79 | "metadata": {},
80 | "source": [
81 | "\n",
82 | "
2. Data Pre-processing
\n",
83 | ""
84 | ]
85 | },
86 | {
87 | "cell_type": "markdown",
88 | "metadata": {},
89 | "source": [
90 | "The previous step should give you a better understanding of which pre-processing is required for the data.\n",
91 | "This may include:\n",
92 | "\n",
93 | "- Normalising and standardising the given data;\n",
94 | "- Removing outliers;\n",
95 | "- Carrying out feature selection;\n",
96 | "- Handling missing information in the dataset;\n",
97 | "- Handling errors in the dataset;\n",
98 | "- Combining existing features."
99 | ]
100 | },
101 | {
102 | "cell_type": "markdown",
103 | "metadata": {},
104 | "source": [
105 | "\n",
106 | "
3. Model Selection
\n",
107 | ""
108 | ]
109 | },
110 | {
111 | "cell_type": "markdown",
112 | "metadata": {},
113 | "source": [
114 | "At this point, you should have a good understanding of the dataset, and have an idea about the possible candidate models. For example, you may try a multi-label classification model to predict all classes at ones, or train different models, one for each label. In any case, it is important to justify your choices and make a comparison among the candidate models.\n",
115 | "\n",
116 | "You are free to choose any model you want, but you should be aware about some factors which may influence your decision:\n",
117 | "\n",
118 | "- What is the model's complexity?\n",
119 | "- Is the model interpretable?\n",
120 | "- Is the model able to handle imbalanced datasets?\n",
121 | "- Is the model capable of handling both numerical and categorical data?\n",
122 | "- Is the model able to handle missing values?\n",
123 | "- Does the model return uncertainty estimates along with predictions?\n",
124 | "\n",
125 | "An in-depth evaluation of competing models in view of this and other criteria will elevate the quality of your submission and earn you a higher grade. You may also try to build new labels by combining one or more labels (for example by doing an OR) and check if this impacts the performance of the model(s)."
126 | ]
127 | },
128 | {
129 | "cell_type": "markdown",
130 | "metadata": {},
131 | "source": [
132 | "\n",
133 | "
3.1 Interpretable Models\n",
134 | "
"
135 | ]
136 | },
137 | {
138 | "cell_type": "markdown",
139 | "metadata": {},
140 | "source": [
141 | "Being able to understand the output of a model is important in many field, especially in anomaly detection. In linear regression, for example, the weights of the model can provide some hints on the importance of features, and this is a form of interpretability. Here, we focus on Rule learning, a specific field of interpretable machine learning that provides interpretability through the use of rules. Examples of rule-based models are: \n",
142 | "\n",
143 | "- RIPPER\n",
144 | " - [Main Paper](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.107.2612&rep=rep1&type=pdf)\n",
145 | " - A fast and reliable implementation is JRIP by [WEKA](https://www.cs.waikato.ac.nz/~ml/weka/). You can also find unofficial python implementations on GitHub.\n",
146 | "- Bayesian Rule Sets (BRS)\n",
147 | " - [Main Paper](http://jmlr.org/papers/volume18/16-003/16-003.pdf)\n",
148 | " - You can find a good implementation [here](https://pypi.org/project/ruleset/). You will probably need to install \"fim\" (pip install fim) before installing BRS.\n",
149 | "- Scalable Bayesian Rule Lists (SBRL)\n",
150 | " - [Main Paper](https://arxiv.org/pdf/1602.08610.pdf)\n",
151 | " - You can find a good implementation [here](https://github.com/myaooo/pysbrl). You will probably need to install \"fim\" (pip install fim) before installing SBRL.\n",
152 | "- and so on... \n",
153 | "\n",
154 | "Try to run at least one of the suggested models (you are free to try others as well) and comment:\n",
155 | "\n",
156 | "- Are rule-learning models able to provide the same predictive performances as previously tested models?\n",
157 | "- Are they faster or slower to train?\n",
158 | "- Do learned rules look meaningful to you?\n",
159 | "- How many rules do these models learn?\n",
160 | "- How many conditions/atoms have on average?\n",
161 | "\n",
162 | "N.B. Since most of the rule-learning implementations deal with binary labels, you can train the model to predict one label of your choice."
163 | ]
164 | },
165 | {
166 | "cell_type": "markdown",
167 | "metadata": {},
168 | "source": [
169 | "\n",
170 | "
4. Parameter Optimisation
\n",
171 | ""
172 | ]
173 | },
174 | {
175 | "cell_type": "markdown",
176 | "metadata": {},
177 | "source": [
178 | "Irrespective of your choice, it is highly likely that your model will have one or more parameters that require tuning.\n",
179 | "There are several techniques for carrying out such a procedure, including cross-validation, Bayesian optimisation, and several others.\n",
180 | "As before, an analysis into which parameter tuning technique best suits your model is expected before proceeding with the optimisation of your model."
181 | ]
182 | },
183 | {
184 | "cell_type": "markdown",
185 | "metadata": {},
186 | "source": [
187 | "\n",
188 | "
5. Model Evaluation
\n",
189 | ""
190 | ]
191 | },
192 | {
193 | "cell_type": "markdown",
194 | "metadata": {},
195 | "source": [
196 | "Some form of pre-evaluation will inevitably be required in the preceding sections in order to both select an appropriate model and configure its parameters appropriately.\n",
197 | "In this final section, you may evaluate other aspects of the model such as:\n",
198 | "\n",
199 | "- Assessing the running time of your model;\n",
200 | "- Determining whether some aspects can be parallelised;\n",
201 | "- Training the model with smaller subsets of the data.\n",
202 | "- etc.\n",
203 | "\n",
204 | "For the evaluation of the classification results, you should use F1-score for each class and do the average.\n",
205 | "\n",
206 | "N.B. Please note that you are responsible for creating a sensible train/validation/test split. There is no predefined held-out test data."
207 | ]
208 | },
209 | {
210 | "cell_type": "markdown",
211 | "metadata": {},
212 | "source": [
213 | "\n",
214 | "
*. Optional
\n",
215 | ""
216 | ]
217 | },
218 | {
219 | "cell_type": "markdown",
220 | "metadata": {},
221 | "source": [
222 | "As you will see in the dataset description, the labels you are going to predict have no meaningful names. Try to understand which kind of anomalies these labels refer to and give sensible names. To do it, you could exploit the output of the interpretable models and/or use a statistical approach with the data you have."
223 | ]
224 | },
225 | {
226 | "cell_type": "markdown",
227 | "metadata": {},
228 | "source": [
229 | "\n",
230 | " N.B. Please note that the items listed under each heading are neither exhaustive, nor are you expected to explore every given suggestion.\n",
231 | " Nonetheless, these should serve as a guideline for your work in both this and upcoming challenges.\n",
232 | " As always, you should use your intuition and understanding in order to decide which analysis best suits the assigned task.\n",
233 | "
"
234 | ]
235 | },
236 | {
237 | "cell_type": "markdown",
238 | "metadata": {},
239 | "source": [
240 | "\n",
241 | "
Submission Instructions
\n",
242 | "
\n",
243 | ""
244 | ]
245 | },
246 | {
247 | "cell_type": "markdown",
248 | "metadata": {},
249 | "source": [
250 | "- The goal of this challenge is to construct one or more models to detect anomalies.\n",
251 | "- Your submission will be the HTML version of your notebook exploring the various modelling aspects described above."
252 | ]
253 | },
254 | {
255 | "cell_type": "markdown",
256 | "metadata": {},
257 | "source": [
258 | "\n",
259 | "
Dataset Description
\n",
260 | "
\n",
261 | ""
262 | ]
263 | },
264 | {
265 | "cell_type": "markdown",
266 | "metadata": {},
267 | "source": [
268 | "#### * Location of the Dataset on zoe\n",
269 | "The data for this challenge is located at: `/mnt/datasets/anomaly`\n",
270 | "\n",
271 | "#### * Files\n",
272 | "\n",
273 | "You have a unique csv file with 36 features and 8 labels.\n",
274 | "Each record contains aggregate features computed over a given amount of time.\n",
275 | "\n",
276 | "#### * Attributes\n",
277 | "\n",
278 | "A brief outline of the available attributes is given below.\n",
279 | "\n",
280 | "1. SessionNumber (INTEGER): it identifies the session on which data is collected;\n",
281 | "* SystemID (INTEGER): it identifies the system generating the data;\n",
282 | "* Date (DATE): collection date;\n",
283 | "* HighPriorityAlerts (INTEGER [0, N]): number of high priority alerts in the session;\n",
284 | "* Dumps (INTEGER [0, N]): number of memory dumps;\n",
285 | "* CleanupOOMDumps (INTEGER) [0, N]): number of cleanup OOM dumps;\n",
286 | "* CompositeOOMDums (INTEGER [0, N]): number of composite OOM dumps;\n",
287 | "* IndexServerRestarts (INTEGER [0, N]): number of restarts of the index server;\n",
288 | "* NameServerRestarts (INTEGER [0, N]): number of restarts of the name server;\n",
289 | "* XSEngineRestarts (INTEGER [0, N]): number of restarts of the XSEngine;\n",
290 | "* PreprocessorRestarts (INTEGER [0, N]): number of restarts of the preprocessor;\n",
291 | "* DaemonRestarts (INTEGER [0, N]): number of restarts of the daemon process;\n",
292 | "* StatisticsServerRestarts (INTEGER [0, N]): number of restarts of the statistics server;\n",
293 | "* CPU (FLOAT [0, 100]): cpu usage;\n",
294 | "* PhysMEM (FLOAT [0, 100]): physical memory;\n",
295 | "* InstanceMEM (FLOAT [0, 100]): memory usage of one instance of the system;\n",
296 | "* TablesAllocation (FLOAT [0, 100]): memory allocated for tables;\n",
297 | "* IndexServerAllocationLimit (FLOAT [0, 100]): level of memory used by index server;\n",
298 | "* ColumnUnloads (INTEGER [0, N]): number of columns unloaded from the tables;\n",
299 | "* DeltaSize (INTEGER [0, N]): size of the delta store;\n",
300 | "* MergeErrors BOOLEAN [0, 1]: 1 if there are merge errors;\n",
301 | "* BlockingPhaseSec (INTEGER [0, N]): blocking phase duration in seconds;\n",
302 | "* Disk (FLOAT [0, 100]): disk usage;\n",
303 | "* LargestTableSize (INTEGER [0, N]): size of the largest table;\n",
304 | "* LargestPartitionSize (INTEGER [0, N]): size of the largest partition of a table;\n",
305 | "* DiagnosisFiles (INTEGER [0, N]): number of diagnosis files;\n",
306 | "* DiagnosisFilesSize (INTEGER [0, N]): size of diagnosis files;\n",
307 | "* DaysWithSuccessfulDataBackups (INTEGER [0, N]): number of days with successful data backups;\n",
308 | "* DaysWithSuccessfulLogBackups (INTEGER [0, N]): number of days with successful log backups;\n",
309 | "* DaysWithFailedDataBackups (INTEGER [0, N]): number of days with failed data backups;\n",
310 | "* DaysWithFailedfulLogBackups (INTEGER [0, N]): number of days with failed log backups;\n",
311 | "* MinDailyNumberOfSuccessfulDataBackups (INTEGER [0, N]): minimum number of successful data backups per day;\n",
312 | "* MinDailyNumberOfSuccessfulLogBackups (INTEGER [0, N]): minimum number of successful log backups per day;\n",
313 | "* MaxDailyNumberOfFailedDataBackups (INTEGER [0, N]): maximum number of failed data backups per day;\n",
314 | "* MaxDailyNumberOfFailedLogBackups (INTEGER [0, N]): maximum number of failed log backups per day;\n",
315 | "* LogSegmentChange (INTEGER [0, N]): changes in the number of log segments.\n",
316 | "\n",
317 | "#### * Labels\n",
318 | "\n",
319 | "Labels are binary. Each label refers to a different anomaly.\n",
320 | "\n",
321 | "* Check1;\n",
322 | "* Check2;\n",
323 | "* Check3;\n",
324 | "* Check4;\n",
325 | "* Check5;\n",
326 | "* Check6;\n",
327 | "* Check7;\n",
328 | "* Check8;"
329 | ]
330 | },
331 | {
332 | "cell_type": "code",
333 | "execution_count": null,
334 | "metadata": {
335 | "collapsed": true
336 | },
337 | "outputs": [],
338 | "source": []
339 | }
340 | ],
341 | "metadata": {
342 | "kernelspec": {
343 | "display_name": "Python 3",
344 | "language": "python",
345 | "name": "python3"
346 | },
347 | "language_info": {
348 | "codemirror_mode": {
349 | "name": "ipython",
350 | "version": 3
351 | },
352 | "file_extension": ".py",
353 | "mimetype": "text/x-python",
354 | "name": "python",
355 | "nbconvert_exporter": "python",
356 | "pygments_lexer": "ipython3",
357 | "version": "3.5.2"
358 | }
359 | },
360 | "nbformat": 4,
361 | "nbformat_minor": 2
362 | }
363 |
--------------------------------------------------------------------------------
/Challenges/House_Pricing/challenge_data/Data description.rtf:
--------------------------------------------------------------------------------
1 | {\rtf1\ansi\ansicpg1252\cocoartf1561\cocoasubrtf200
2 | {\fonttbl\f0\fmodern\fcharset0 Courier;}
3 | {\colortbl;\red255\green255\blue255;\red0\green0\blue0;}
4 | {\*\expandedcolortbl;;\cssrgb\c0\c0\c0;}
5 | \paperw11900\paperh16840\margl1440\margr1440\vieww10800\viewh8400\viewkind0
6 | \deftab720
7 | \pard\pardeftab720\sl280\partightenfactor0
8 |
9 | \f0\fs24 \cf2 \expnd0\expndtw0\kerning0
10 | \outl0\strokewidth0 \strokec2 MSSubClass: Identifies the type of dwelling involved in the sale. \
11 | \
12 | 20 1-STORY 1946 & NEWER ALL STYLES\
13 | 30 1-STORY 1945 & OLDER\
14 | 40 1-STORY W/FINISHED ATTIC ALL AGES\
15 | 45 1-1/2 STORY - UNFINISHED ALL AGES\
16 | 50 1-1/2 STORY FINISHED ALL AGES\
17 | 60 2-STORY 1946 & NEWER\
18 | 70 2-STORY 1945 & OLDER\
19 | 75 2-1/2 STORY ALL AGES\
20 | 80 SPLIT OR MULTI-LEVEL\
21 | 85 SPLIT FOYER\
22 | 90 DUPLEX - ALL STYLES AND AGES\
23 | 120 1-STORY PUD (Planned Unit Development) - 1946 & NEWER\
24 | 150 1-1/2 STORY PUD - ALL AGES\
25 | 160 2-STORY PUD - 1946 & NEWER\
26 | 180 PUD - MULTILEVEL - INCL SPLIT LEV/FOYER\
27 | 190 2 FAMILY CONVERSION - ALL STYLES AND AGES\
28 | \
29 | MSZoning: Identifies the general zoning classification of the sale.\
30 | \
31 | A Agriculture\
32 | C Commercial\
33 | FV Floating Village Residential\
34 | I Industrial\
35 | RH Residential High Density\
36 | RL Residential Low Density\
37 | RP Residential Low Density Park \
38 | RM Residential Medium Density\
39 | \
40 | LotFrontage: Linear feet of street connected to property\
41 | \
42 | LotArea: Lot size in square feet\
43 | \
44 | Street: Type of road access to property\
45 | \
46 | Grvl Gravel \
47 | Pave Paved\
48 | \
49 | Alley: Type of alley access to property\
50 | \
51 | Grvl Gravel\
52 | Pave Paved\
53 | NA No alley access\
54 | \
55 | LotShape: General shape of property\
56 | \
57 | Reg Regular \
58 | IR1 Slightly irregular\
59 | IR2 Moderately Irregular\
60 | IR3 Irregular\
61 | \
62 | LandContour: Flatness of the property\
63 | \
64 | Lvl Near Flat/Level \
65 | Bnk Banked - Quick and significant rise from street grade to building\
66 | HLS Hillside - Significant slope from side to side\
67 | Low Depression\
68 | \
69 | Utilities: Type of utilities available\
70 | \
71 | AllPub All public Utilities (E,G,W,& S) \
72 | NoSewr Electricity, Gas, and Water (Septic Tank)\
73 | NoSeWa Electricity and Gas Only\
74 | ELO Electricity only \
75 | \
76 | LotConfig: Lot configuration\
77 | \
78 | Inside Inside lot\
79 | Corner Corner lot\
80 | CulDSac Cul-de-sac\
81 | FR2 Frontage on 2 sides of property\
82 | FR3 Frontage on 3 sides of property\
83 | \
84 | LandSlope: Slope of property\
85 | \
86 | Gtl Gentle slope\
87 | Mod Moderate Slope \
88 | Sev Severe Slope\
89 | \
90 | Neighborhood: Physical locations within Ames city limits\
91 | \
92 | Blmngtn Bloomington Heights\
93 | Blueste Bluestem\
94 | BrDale Briardale\
95 | BrkSide Brookside\
96 | ClearCr Clear Creek\
97 | CollgCr College Creek\
98 | Crawfor Crawford\
99 | Edwards Edwards\
100 | Gilbert Gilbert\
101 | IDOTRR Iowa DOT and Rail Road\
102 | MeadowV Meadow Village\
103 | Mitchel Mitchell\
104 | Names North Ames\
105 | NoRidge Northridge\
106 | NPkVill Northpark Villa\
107 | NridgHt Northridge Heights\
108 | NWAmes Northwest Ames\
109 | OldTown Old Town\
110 | SWISU South & West of Iowa State University\
111 | Sawyer Sawyer\
112 | SawyerW Sawyer West\
113 | Somerst Somerset\
114 | StoneBr Stone Brook\
115 | Timber Timberland\
116 | Veenker Veenker\
117 | \
118 | Condition1: Proximity to various conditions\
119 | \
120 | Artery Adjacent to arterial street\
121 | Feedr Adjacent to feeder street \
122 | Norm Normal \
123 | RRNn Within 200' of North-South Railroad\
124 | RRAn Adjacent to North-South Railroad\
125 | PosN Near positive off-site feature--park, greenbelt, etc.\
126 | PosA Adjacent to postive off-site feature\
127 | RRNe Within 200' of East-West Railroad\
128 | RRAe Adjacent to East-West Railroad\
129 | \
130 | Condition2: Proximity to various conditions (if more than one is present)\
131 | \
132 | Artery Adjacent to arterial street\
133 | Feedr Adjacent to feeder street \
134 | Norm Normal \
135 | RRNn Within 200' of North-South Railroad\
136 | RRAn Adjacent to North-South Railroad\
137 | PosN Near positive off-site feature--park, greenbelt, etc.\
138 | PosA Adjacent to postive off-site feature\
139 | RRNe Within 200' of East-West Railroad\
140 | RRAe Adjacent to East-West Railroad\
141 | \
142 | BldgType: Type of dwelling\
143 | \
144 | 1Fam Single-family Detached \
145 | 2FmCon Two-family Conversion; originally built as one-family dwelling\
146 | Duplx Duplex\
147 | TwnhsE Townhouse End Unit\
148 | TwnhsI Townhouse Inside Unit\
149 | \
150 | HouseStyle: Style of dwelling\
151 | \
152 | 1Story One story\
153 | 1.5Fin One and one-half story: 2nd level finished\
154 | 1.5Unf One and one-half story: 2nd level unfinished\
155 | 2Story Two story\
156 | 2.5Fin Two and one-half story: 2nd level finished\
157 | 2.5Unf Two and one-half story: 2nd level unfinished\
158 | SFoyer Split Foyer\
159 | SLvl Split Level\
160 | \
161 | OverallQual: Rates the overall material and finish of the house\
162 | \
163 | 10 Very Excellent\
164 | 9 Excellent\
165 | 8 Very Good\
166 | 7 Good\
167 | 6 Above Average\
168 | 5 Average\
169 | 4 Below Average\
170 | 3 Fair\
171 | 2 Poor\
172 | 1 Very Poor\
173 | \
174 | OverallCond: Rates the overall condition of the house\
175 | \
176 | 10 Very Excellent\
177 | 9 Excellent\
178 | 8 Very Good\
179 | 7 Good\
180 | 6 Above Average \
181 | 5 Average\
182 | 4 Below Average \
183 | 3 Fair\
184 | 2 Poor\
185 | 1 Very Poor\
186 | \
187 | YearBuilt: Original construction date\
188 | \
189 | YearRemodAdd: Remodel date (same as construction date if no remodeling or additions)\
190 | \
191 | RoofStyle: Type of roof\
192 | \
193 | Flat Flat\
194 | Gable Gable\
195 | Gambrel Gabrel (Barn)\
196 | Hip Hip\
197 | Mansard Mansard\
198 | Shed Shed\
199 | \
200 | RoofMatl: Roof material\
201 | \
202 | ClyTile Clay or Tile\
203 | CompShg Standard (Composite) Shingle\
204 | Membran Membrane\
205 | Metal Metal\
206 | Roll Roll\
207 | Tar&Grv Gravel & Tar\
208 | WdShake Wood Shakes\
209 | WdShngl Wood Shingles\
210 | \
211 | Exterior1st: Exterior covering on house\
212 | \
213 | AsbShng Asbestos Shingles\
214 | AsphShn Asphalt Shingles\
215 | BrkComm Brick Common\
216 | BrkFace Brick Face\
217 | CBlock Cinder Block\
218 | CemntBd Cement Board\
219 | HdBoard Hard Board\
220 | ImStucc Imitation Stucco\
221 | MetalSd Metal Siding\
222 | Other Other\
223 | Plywood Plywood\
224 | PreCast PreCast \
225 | Stone Stone\
226 | Stucco Stucco\
227 | VinylSd Vinyl Siding\
228 | Wd Sdng Wood Siding\
229 | WdShing Wood Shingles\
230 | \
231 | Exterior2nd: Exterior covering on house (if more than one material)\
232 | \
233 | AsbShng Asbestos Shingles\
234 | AsphShn Asphalt Shingles\
235 | BrkComm Brick Common\
236 | BrkFace Brick Face\
237 | CBlock Cinder Block\
238 | CemntBd Cement Board\
239 | HdBoard Hard Board\
240 | ImStucc Imitation Stucco\
241 | MetalSd Metal Siding\
242 | Other Other\
243 | Plywood Plywood\
244 | PreCast PreCast\
245 | Stone Stone\
246 | Stucco Stucco\
247 | VinylSd Vinyl Siding\
248 | Wd Sdng Wood Siding\
249 | WdShing Wood Shingles\
250 | \
251 | MasVnrType: Masonry veneer type\
252 | \
253 | BrkCmn Brick Common\
254 | BrkFace Brick Face\
255 | CBlock Cinder Block\
256 | None None\
257 | Stone Stone\
258 | \
259 | MasVnrArea: Masonry veneer area in square feet\
260 | \
261 | ExterQual: Evaluates the quality of the material on the exterior \
262 | \
263 | Ex Excellent\
264 | Gd Good\
265 | TA Average/Typical\
266 | Fa Fair\
267 | Po Poor\
268 | \
269 | ExterCond: Evaluates the present condition of the material on the exterior\
270 | \
271 | Ex Excellent\
272 | Gd Good\
273 | TA Average/Typical\
274 | Fa Fair\
275 | Po Poor\
276 | \
277 | Foundation: Type of foundation\
278 | \
279 | BrkTil Brick & Tile\
280 | CBlock Cinder Block\
281 | PConc Poured Contrete \
282 | Slab Slab\
283 | Stone Stone\
284 | Wood Wood\
285 | \
286 | BsmtQual: Evaluates the height of the basement\
287 | \
288 | Ex Excellent (100+ inches) \
289 | Gd Good (90-99 inches)\
290 | TA Typical (80-89 inches)\
291 | Fa Fair (70-79 inches)\
292 | Po Poor (<70 inches\
293 | NA No Basement\
294 | \
295 | BsmtCond: Evaluates the general condition of the basement\
296 | \
297 | Ex Excellent\
298 | Gd Good\
299 | TA Typical - slight dampness allowed\
300 | Fa Fair - dampness or some cracking or settling\
301 | Po Poor - Severe cracking, settling, or wetness\
302 | NA No Basement\
303 | \
304 | BsmtExposure: Refers to walkout or garden level walls\
305 | \
306 | Gd Good Exposure\
307 | Av Average Exposure (split levels or foyers typically score average or above) \
308 | Mn Mimimum Exposure\
309 | No No Exposure\
310 | NA No Basement\
311 | \
312 | BsmtFinType1: Rating of basement finished area\
313 | \
314 | GLQ Good Living Quarters\
315 | ALQ Average Living Quarters\
316 | BLQ Below Average Living Quarters \
317 | Rec Average Rec Room\
318 | LwQ Low Quality\
319 | Unf Unfinshed\
320 | NA No Basement\
321 | \
322 | BsmtFinSF1: Type 1 finished square feet\
323 | \
324 | BsmtFinType2: Rating of basement finished area (if multiple types)\
325 | \
326 | GLQ Good Living Quarters\
327 | ALQ Average Living Quarters\
328 | BLQ Below Average Living Quarters \
329 | Rec Average Rec Room\
330 | LwQ Low Quality\
331 | Unf Unfinshed\
332 | NA No Basement\
333 | \
334 | BsmtFinSF2: Type 2 finished square feet\
335 | \
336 | BsmtUnfSF: Unfinished square feet of basement area\
337 | \
338 | TotalBsmtSF: Total square feet of basement area\
339 | \
340 | Heating: Type of heating\
341 | \
342 | Floor Floor Furnace\
343 | GasA Gas forced warm air furnace\
344 | GasW Gas hot water or steam heat\
345 | Grav Gravity furnace \
346 | OthW Hot water or steam heat other than gas\
347 | Wall Wall furnace\
348 | \
349 | HeatingQC: Heating quality and condition\
350 | \
351 | Ex Excellent\
352 | Gd Good\
353 | TA Average/Typical\
354 | Fa Fair\
355 | Po Poor\
356 | \
357 | CentralAir: Central air conditioning\
358 | \
359 | N No\
360 | Y Yes\
361 | \
362 | Electrical: Electrical system\
363 | \
364 | SBrkr Standard Circuit Breakers & Romex\
365 | FuseA Fuse Box over 60 AMP and all Romex wiring (Average) \
366 | FuseF 60 AMP Fuse Box and mostly Romex wiring (Fair)\
367 | FuseP 60 AMP Fuse Box and mostly knob & tube wiring (poor)\
368 | Mix Mixed\
369 | \
370 | 1stFlrSF: First Floor square feet\
371 | \
372 | 2ndFlrSF: Second floor square feet\
373 | \
374 | LowQualFinSF: Low quality finished square feet (all floors)\
375 | \
376 | GrLivArea: Above grade (ground) living area square feet\
377 | \
378 | BsmtFullBath: Basement full bathrooms\
379 | \
380 | BsmtHalfBath: Basement half bathrooms\
381 | \
382 | FullBath: Full bathrooms above grade\
383 | \
384 | HalfBath: Half baths above grade\
385 | \
386 | Bedroom: Bedrooms above grade (does NOT include basement bedrooms)\
387 | \
388 | Kitchen: Kitchens above grade\
389 | \
390 | KitchenQual: Kitchen quality\
391 | \
392 | Ex Excellent\
393 | Gd Good\
394 | TA Typical/Average\
395 | Fa Fair\
396 | Po Poor\
397 | \
398 | TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)\
399 | \
400 | Functional: Home functionality (Assume typical unless deductions are warranted)\
401 | \
402 | Typ Typical Functionality\
403 | Min1 Minor Deductions 1\
404 | Min2 Minor Deductions 2\
405 | Mod Moderate Deductions\
406 | Maj1 Major Deductions 1\
407 | Maj2 Major Deductions 2\
408 | Sev Severely Damaged\
409 | Sal Salvage only\
410 | \
411 | Fireplaces: Number of fireplaces\
412 | \
413 | FireplaceQu: Fireplace quality\
414 | \
415 | Ex Excellent - Exceptional Masonry Fireplace\
416 | Gd Good - Masonry Fireplace in main level\
417 | TA Average - Prefabricated Fireplace in main living area or Masonry Fireplace in basement\
418 | Fa Fair - Prefabricated Fireplace in basement\
419 | Po Poor - Ben Franklin Stove\
420 | NA No Fireplace\
421 | \
422 | GarageType: Garage location\
423 | \
424 | 2Types More than one type of garage\
425 | Attchd Attached to home\
426 | Basment Basement Garage\
427 | BuiltIn Built-In (Garage part of house - typically has room above garage)\
428 | CarPort Car Port\
429 | Detchd Detached from home\
430 | NA No Garage\
431 | \
432 | GarageYrBlt: Year garage was built\
433 | \
434 | GarageFinish: Interior finish of the garage\
435 | \
436 | Fin Finished\
437 | RFn Rough Finished \
438 | Unf Unfinished\
439 | NA No Garage\
440 | \
441 | GarageCars: Size of garage in car capacity\
442 | \
443 | GarageArea: Size of garage in square feet\
444 | \
445 | GarageQual: Garage quality\
446 | \
447 | Ex Excellent\
448 | Gd Good\
449 | TA Typical/Average\
450 | Fa Fair\
451 | Po Poor\
452 | NA No Garage\
453 | \
454 | GarageCond: Garage condition\
455 | \
456 | Ex Excellent\
457 | Gd Good\
458 | TA Typical/Average\
459 | Fa Fair\
460 | Po Poor\
461 | NA No Garage\
462 | \
463 | PavedDrive: Paved driveway\
464 | \
465 | Y Paved \
466 | P Partial Pavement\
467 | N Dirt/Gravel\
468 | \
469 | WoodDeckSF: Wood deck area in square feet\
470 | \
471 | OpenPorchSF: Open porch area in square feet\
472 | \
473 | EnclosedPorch: Enclosed porch area in square feet\
474 | \
475 | 3SsnPorch: Three season porch area in square feet\
476 | \
477 | ScreenPorch: Screen porch area in square feet\
478 | \
479 | PoolArea: Pool area in square feet\
480 | \
481 | PoolQC: Pool quality\
482 | \
483 | Ex Excellent\
484 | Gd Good\
485 | TA Average/Typical\
486 | Fa Fair\
487 | NA No Pool\
488 | \
489 | Fence: Fence quality\
490 | \
491 | GdPrv Good Privacy\
492 | MnPrv Minimum Privacy\
493 | GdWo Good Wood\
494 | MnWw Minimum Wood/Wire\
495 | NA No Fence\
496 | \
497 | MiscFeature: Miscellaneous feature not covered in other categories\
498 | \
499 | Elev Elevator\
500 | Gar2 2nd Garage (if not described in garage section)\
501 | Othr Other\
502 | Shed Shed (over 100 SF)\
503 | TenC Tennis Court\
504 | NA None\
505 | \
506 | MiscVal: $Value of miscellaneous feature\
507 | \
508 | MoSold: Month Sold (MM)\
509 | \
510 | YrSold: Year Sold (YYYY)\
511 | \
512 | SaleType: Type of sale\
513 | \
514 | WD Warranty Deed - Conventional\
515 | CWD Warranty Deed - Cash\
516 | VWD Warranty Deed - VA Loan\
517 | New Home just constructed and sold\
518 | COD Court Officer Deed/Estate\
519 | Con Contract 15% Down payment regular terms\
520 | ConLw Contract Low Down payment and low interest\
521 | ConLI Contract Low Interest\
522 | ConLD Contract Low Down\
523 | Oth Other\
524 | \
525 | SaleCondition: Condition of sale\
526 | \
527 | Normal Normal Sale\
528 | Abnorml Abnormal Sale - trade, foreclosure, short sale\
529 | AdjLand Adjoining Land Purchase\
530 | Alloca Allocation - two linked properties with separate deeds, typically condo with a garage unit \
531 | Family Sale between family members\
532 | Partial Home was not completed when last assessed (associated with New Homes)\
533 | Ch}
--------------------------------------------------------------------------------
/Challenges/House_Pricing/challenge_data/sample_submission.csv:
--------------------------------------------------------------------------------
1 | Id,SalePrice
2 | 1461,169277.0524984
3 | 1462,187758.393988768
4 | 1463,183583.683569555
5 | 1464,179317.47751083
6 | 1465,150730.079976501
7 | 1466,177150.989247307
8 | 1467,172070.659229164
9 | 1468,175110.956519547
10 | 1469,162011.698831665
11 | 1470,160726.247831419
12 | 1471,157933.279456005
13 | 1472,145291.245020389
14 | 1473,159672.017631819
15 | 1474,164167.518301885
16 | 1475,150891.638244053
17 | 1476,179460.96518734
18 | 1477,185034.62891405
19 | 1478,182352.192644656
20 | 1479,183053.458213802
21 | 1480,187823.339254278
22 | 1481,186544.114327568
23 | 1482,158230.77520516
24 | 1483,190552.829321091
25 | 1484,147183.67487199
26 | 1485,185855.300905493
27 | 1486,174350.470676986
28 | 1487,201740.620690863
29 | 1488,162986.378895754
30 | 1489,162330.199085679
31 | 1490,165845.938616539
32 | 1491,180929.622876974
33 | 1492,163481.501519718
34 | 1493,187798.076714233
35 | 1494,198822.198942566
36 | 1495,194868.409899858
37 | 1496,152605.298564403
38 | 1497,147797.702836811
39 | 1498,150521.96899297
40 | 1499,146991.630153739
41 | 1500,150306.307814534
42 | 1501,151164.372534604
43 | 1502,151133.706960953
44 | 1503,156214.042540726
45 | 1504,171992.760735142
46 | 1505,173214.912549738
47 | 1506,192429.187345783
48 | 1507,190878.69508543
49 | 1508,194542.544135519
50 | 1509,191849.439072822
51 | 1510,176363.773907793
52 | 1511,176954.185412429
53 | 1512,176521.216975696
54 | 1513,179436.704810176
55 | 1514,220079.756777048
56 | 1515,175502.918109444
57 | 1516,188321.073833569
58 | 1517,163276.324450004
59 | 1518,185911.366293097
60 | 1519,171392.830997252
61 | 1520,174418.207020775
62 | 1521,179682.709603774
63 | 1522,179423.751581665
64 | 1523,171756.918091777
65 | 1524,166849.638174419
66 | 1525,181122.168676666
67 | 1526,170934.462746566
68 | 1527,159738.292580329
69 | 1528,174445.759557658
70 | 1529,174706.363659627
71 | 1530,164507.672539365
72 | 1531,163602.512172832
73 | 1532,154126.270249525
74 | 1533,171104.853481351
75 | 1534,167735.39270528
76 | 1535,183003.613338104
77 | 1536,172580.381161499
78 | 1537,165407.889104689
79 | 1538,176363.773907793
80 | 1539,175182.950898522
81 | 1540,190757.177789246
82 | 1541,167186.995771991
83 | 1542,167839.376779276
84 | 1543,173912.421165137
85 | 1544,154034.917445551
86 | 1545,156002.955794336
87 | 1546,168173.94329857
88 | 1547,168882.437104132
89 | 1548,168173.94329857
90 | 1549,157580.177551642
91 | 1550,181922.15256011
92 | 1551,155134.227842592
93 | 1552,188885.573319552
94 | 1553,183963.193012381
95 | 1554,161298.762306335
96 | 1555,188613.66763056
97 | 1556,175080.111822945
98 | 1557,174744.400305232
99 | 1558,168175.911336919
100 | 1559,182333.472575006
101 | 1560,158307.206742274
102 | 1561,193053.055502348
103 | 1562,175031.089987177
104 | 1563,160713.294602908
105 | 1564,173186.215014436
106 | 1565,191736.7598055
107 | 1566,170401.630997116
108 | 1567,164626.577880222
109 | 1568,205469.409444832
110 | 1569,209561.784211885
111 | 1570,182271.503072356
112 | 1571,178081.549427793
113 | 1572,178425.956138831
114 | 1573,162015.318511503
115 | 1574,181722.420373045
116 | 1575,156705.730169433
117 | 1576,182902.420342386
118 | 1577,157574.595395085
119 | 1578,184380.739100813
120 | 1579,169364.469225677
121 | 1580,175846.179822063
122 | 1581,189673.295302136
123 | 1582,174401.317715566
124 | 1583,179021.448718583
125 | 1584,189196.845337149
126 | 1585,139647.095720655
127 | 1586,161468.198288911
128 | 1587,171557.32317862
129 | 1588,179447.36804185
130 | 1589,169611.619017694
131 | 1590,172088.872655744
132 | 1591,171190.624128768
133 | 1592,154850.508361878
134 | 1593,158617.655719941
135 | 1594,209258.33693701
136 | 1595,177939.027626751
137 | 1596,194631.100299584
138 | 1597,213618.871562568
139 | 1598,198342.504228533
140 | 1599,138607.971472497
141 | 1600,150778.958976731
142 | 1601,146966.230339786
143 | 1602,162182.59620952
144 | 1603,176825.940961269
145 | 1604,152799.812402444
146 | 1605,180322.322067129
147 | 1606,177508.027228367
148 | 1607,208029.642652019
149 | 1608,181987.282510201
150 | 1609,160172.72797397
151 | 1610,176761.317654248
152 | 1611,176515.497545231
153 | 1612,176270.453065471
154 | 1613,183050.846258475
155 | 1614,150011.102062216
156 | 1615,159270.537808667
157 | 1616,163419.663729346
158 | 1617,163399.983345859
159 | 1618,173364.161505756
160 | 1619,169556.835902417
161 | 1620,183690.595995738
162 | 1621,176980.914909382
163 | 1622,204773.36222471
164 | 1623,174728.655998442
165 | 1624,181873.458244461
166 | 1625,177322.000823979
167 | 1626,193927.939041863
168 | 1627,181715.622732304
169 | 1628,199270.841200324
170 | 1629,177109.589956218
171 | 1630,153909.578271486
172 | 1631,162931.203336223
173 | 1632,166386.7567182
174 | 1633,173719.30379824
175 | 1634,179757.925656704
176 | 1635,179007.601964376
177 | 1636,180370.808623106
178 | 1637,185102.616730563
179 | 1638,198825.563452058
180 | 1639,184294.576009142
181 | 1640,200443.7920562
182 | 1641,181294.784484153
183 | 1642,174354.336267919
184 | 1643,172023.677781517
185 | 1644,181666.922855025
186 | 1645,179024.491269586
187 | 1646,178324.191575907
188 | 1647,184534.676687694
189 | 1648,159397.250378784
190 | 1649,178430.966728182
191 | 1650,177743.799385967
192 | 1651,179395.305519087
193 | 1652,151713.38474815
194 | 1653,151713.38474815
195 | 1654,168434.977996215
196 | 1655,153999.100311019
197 | 1656,164096.097354123
198 | 1657,166335.403036551
199 | 1658,163020.725375757
200 | 1659,155862.510668829
201 | 1660,182760.651095509
202 | 1661,201912.270622883
203 | 1662,185988.233987516
204 | 1663,183778.44888032
205 | 1664,170935.85921771
206 | 1665,184468.908382254
207 | 1666,191569.089663229
208 | 1667,232991.025583822
209 | 1668,180980.721388278
210 | 1669,164279.13048219
211 | 1670,183859.460411109
212 | 1671,185922.465682076
213 | 1672,191742.778119363
214 | 1673,199954.072465842
215 | 1674,180690.274752587
216 | 1675,163099.3096358
217 | 1676,140791.922472443
218 | 1677,166481.86647592
219 | 1678,172080.434496773
220 | 1679,191719.161659178
221 | 1680,160741.098612515
222 | 1681,157829.546854733
223 | 1682,196896.748596341
224 | 1683,159675.423990355
225 | 1684,182084.790901946
226 | 1685,179233.926374487
227 | 1686,155774.270901623
228 | 1687,181354.326716058
229 | 1688,179605.563663918
230 | 1689,181609.34866147
231 | 1690,178221.531623281
232 | 1691,175559.920735795
233 | 1692,200328.822792041
234 | 1693,178630.060559899
235 | 1694,177174.535221728
236 | 1695,172515.687368714
237 | 1696,204032.992922943
238 | 1697,176023.232787689
239 | 1698,202202.073341595
240 | 1699,181734.480075862
241 | 1700,183982.158993126
242 | 1701,188007.94241481
243 | 1702,185922.966763517
244 | 1703,183978.544874918
245 | 1704,177199.618638821
246 | 1705,181878.647956764
247 | 1706,173622.088728263
248 | 1707,180728.168562655
249 | 1708,176477.026606328
250 | 1709,184282.266697609
251 | 1710,162062.47538448
252 | 1711,182550.070992189
253 | 1712,180987.949624695
254 | 1713,178173.79762147
255 | 1714,179980.635948606
256 | 1715,173257.637826205
257 | 1716,177271.291059307
258 | 1717,175338.355442312
259 | 1718,177548.140549508
260 | 1719,175969.91662932
261 | 1720,175011.481953462
262 | 1721,185199.372568143
263 | 1722,188514.050228937
264 | 1723,185080.145268797
265 | 1724,157304.402574096
266 | 1725,194260.859481297
267 | 1726,181262.329995106
268 | 1727,157003.292706732
269 | 1728,182924.499359899
270 | 1729,181902.586375439
271 | 1730,188985.371708134
272 | 1731,185290.904495068
273 | 1732,177304.425752748
274 | 1733,166274.900490809
275 | 1734,177807.420530107
276 | 1735,180330.624816201
277 | 1736,179069.112234629
278 | 1737,175943.371816948
279 | 1738,185199.050609653
280 | 1739,167350.910824524
281 | 1740,149315.311876449
282 | 1741,139010.847766793
283 | 1742,155412.151845447
284 | 1743,171308.313985441
285 | 1744,176220.543265638
286 | 1745,177643.434991809
287 | 1746,187222.653264601
288 | 1747,185635.132083154
289 | 1748,206492.534215854
290 | 1749,181681.021081956
291 | 1750,180500.198072685
292 | 1751,206486.17086841
293 | 1752,161334.301195429
294 | 1753,176156.558313965
295 | 1754,191642.223478994
296 | 1755,191945.808027777
297 | 1756,164146.306037354
298 | 1757,179883.057071096
299 | 1758,178071.137668844
300 | 1759,188241.637896875
301 | 1760,174559.656173171
302 | 1761,182347.363042264
303 | 1762,191507.251872857
304 | 1763,199751.865597358
305 | 1764,162106.416145131
306 | 1765,164575.982314367
307 | 1766,179176.352180931
308 | 1767,177327.403857584
309 | 1768,177818.083761781
310 | 1769,186965.204048443
311 | 1770,178762.742169197
312 | 1771,183322.866146283
313 | 1772,178903.295931891
314 | 1773,186570.129421778
315 | 1774,199144.242829024
316 | 1775,172154.713310956
317 | 1776,177444.019201603
318 | 1777,166200.938073485
319 | 1778,158995.770555632
320 | 1779,168273.282454755
321 | 1780,189680.453052788
322 | 1781,181681.021081956
323 | 1782,160277.142643643
324 | 1783,197318.54715833
325 | 1784,162228.935604196
326 | 1785,187340.455456083
327 | 1786,181065.347037275
328 | 1787,190233.609102705
329 | 1788,157929.594852031
330 | 1789,168557.001935469
331 | 1790,160805.584645628
332 | 1791,221648.391978216
333 | 1792,180539.88079815
334 | 1793,182105.616283853
335 | 1794,166380.852603154
336 | 1795,178942.155617426
337 | 1796,162804.747800461
338 | 1797,183077.684392615
339 | 1798,171728.4720292
340 | 1799,164786.741540638
341 | 1800,177427.267170302
342 | 1801,197318.54715833
343 | 1802,178658.114178223
344 | 1803,185437.320523764
345 | 1804,169759.652489529
346 | 1805,173986.635055186
347 | 1806,168607.664289468
348 | 1807,194138.519145183
349 | 1808,192502.440921994
350 | 1809,176746.969818601
351 | 1810,177604.891703134
352 | 1811,193283.746584832
353 | 1812,181627.061006609
354 | 1813,169071.62025834
355 | 1814,167398.006470987
356 | 1815,150106.505141704
357 | 1816,159650.304285848
358 | 1817,179471.23597476
359 | 1818,177109.589956218
360 | 1819,166558.113328453
361 | 1820,153796.714319583
362 | 1821,174520.152570658
363 | 1822,196297.95829524
364 | 1823,169100.681601175
365 | 1824,176911.319164431
366 | 1825,169234.6454828
367 | 1826,172386.297919134
368 | 1827,156031.904802362
369 | 1828,168202.892306596
370 | 1829,166505.984017547
371 | 1830,176507.37022149
372 | 1831,180116.752553161
373 | 1832,183072.740591406
374 | 1833,189595.964677698
375 | 1834,167523.919076265
376 | 1835,210817.775863413
377 | 1836,172942.930813351
378 | 1837,145286.278144089
379 | 1838,176468.653371492
380 | 1839,159040.069562187
381 | 1840,178518.204332507
382 | 1841,169163.980786825
383 | 1842,189786.685274579
384 | 1843,181246.728523853
385 | 1844,176349.927153587
386 | 1845,205266.631009142
387 | 1846,187397.993362224
388 | 1847,208943.427726113
389 | 1848,165014.532907657
390 | 1849,182492.037566236
391 | 1850,161718.71259042
392 | 1851,180084.118941162
393 | 1852,178534.950802179
394 | 1853,151217.259961305
395 | 1854,156342.717587562
396 | 1855,188511.443835239
397 | 1856,183570.337896789
398 | 1857,225810.160292177
399 | 1858,214217.401131694
400 | 1859,187665.64101603
401 | 1860,161157.177744039
402 | 1861,187643.992594193
403 | 1862,228156.372839158
404 | 1863,220449.534665317
405 | 1864,220522.352084222
406 | 1865,156647.763531624
407 | 1866,187388.833374873
408 | 1867,178640.723791573
409 | 1868,180847.216739049
410 | 1869,159505.170529478
411 | 1870,164305.538020654
412 | 1871,180181.19673723
413 | 1872,184602.734989972
414 | 1873,193440.372174434
415 | 1874,184199.788209911
416 | 1875,196241.892907637
417 | 1876,175588.618271096
418 | 1877,179503.046546829
419 | 1878,183658.076582555
420 | 1879,193700.976276404
421 | 1880,165399.62450704
422 | 1881,186847.944787446
423 | 1882,198127.73287817
424 | 1883,183320.898107934
425 | 1884,181613.606696657
426 | 1885,178298.791761954
427 | 1886,185733.534000593
428 | 1887,180008.188485489
429 | 1888,175127.59621604
430 | 1889,183467.176862723
431 | 1890,182705.546021743
432 | 1891,152324.943593181
433 | 1892,169878.515981342
434 | 1893,183735.975076576
435 | 1894,224118.280105941
436 | 1895,169355.202465146
437 | 1896,180054.276407441
438 | 1897,174081.601977368
439 | 1898,168494.985022146
440 | 1899,181871.598843299
441 | 1900,173554.489658383
442 | 1901,169805.382165577
443 | 1902,176192.990728755
444 | 1903,204264.39284654
445 | 1904,169630.906956928
446 | 1905,185724.838807268
447 | 1906,195699.036281861
448 | 1907,189494.276162169
449 | 1908,149607.905673439
450 | 1909,154650.199045978
451 | 1910,151579.558140433
452 | 1911,185147.380531144
453 | 1912,196314.53120359
454 | 1913,210802.395364155
455 | 1914,166271.2863726
456 | 1915,154865.359142973
457 | 1916,173575.5052865
458 | 1917,179399.563554274
459 | 1918,164280.776562049
460 | 1919,171247.48948121
461 | 1920,166878.587182445
462 | 1921,188129.459710994
463 | 1922,183517.34369691
464 | 1923,175522.026925727
465 | 1924,190060.105331152
466 | 1925,174179.824771856
467 | 1926,171059.523675194
468 | 1927,183004.186769318
469 | 1928,183601.647387418
470 | 1929,163539.327185998
471 | 1930,164677.676391525
472 | 1931,162395.073865424
473 | 1932,182207.6323195
474 | 1933,192223.939790304
475 | 1934,176391.829390125
476 | 1935,181913.179121348
477 | 1936,179136.097888261
478 | 1937,196595.568243212
479 | 1938,194822.365690957
480 | 1939,148356.669440918
481 | 1940,160387.604263899
482 | 1941,181276.500571809
483 | 1942,192474.817899346
484 | 1943,157699.907796437
485 | 1944,215785.540813051
486 | 1945,181824.300998793
487 | 1946,221813.00948166
488 | 1947,165281.292597397
489 | 1948,255629.49047034
490 | 1949,173154.590990955
491 | 1950,183884.65246539
492 | 1951,200210.353608489
493 | 1952,186599.221265342
494 | 1953,192718.532696106
495 | 1954,178628.665952764
496 | 1955,180650.342418406
497 | 1956,206003.107947263
498 | 1957,166457.67844853
499 | 1958,202916.221653487
500 | 1959,192463.969983091
501 | 1960,171775.497189898
502 | 1961,175249.222149411
503 | 1962,147086.59893993
504 | 1963,149709.672100371
505 | 1964,171411.404533743
506 | 1965,178188.964799425
507 | 1966,156491.711373235
508 | 1967,180953.241201168
509 | 1968,203909.759061135
510 | 1969,175470.149087545
511 | 1970,205578.333622415
512 | 1971,199428.857699441
513 | 1972,187599.163869476
514 | 1973,192265.198109864
515 | 1974,196666.554897677
516 | 1975,155537.862252682
517 | 1976,169543.240620935
518 | 1977,202487.010170501
519 | 1978,208232.716273485
520 | 1979,173621.195202569
521 | 1980,172414.608571812
522 | 1981,164400.75641556
523 | 1982,160480.424024781
524 | 1983,156060.853810389
525 | 1984,157437.192820581
526 | 1985,158163.720929772
527 | 1986,154849.043268978
528 | 1987,152186.609341561
529 | 1988,180340.215399228
530 | 1989,178344.62451356
531 | 1990,190170.382266827
532 | 1991,168092.975480832
533 | 1992,178757.912566805
534 | 1993,174518.256882082
535 | 1994,198168.490116289
536 | 1995,176882.693978902
537 | 1996,183801.672896251
538 | 1997,196400.046680661
539 | 1998,172281.605004025
540 | 1999,196380.366297173
541 | 2000,198228.354306682
542 | 2001,195556.581268962
543 | 2002,186453.264469043
544 | 2003,181869.381196234
545 | 2004,175610.840124147
546 | 2005,183438.730800145
547 | 2006,179584.488673295
548 | 2007,182386.152242034
549 | 2008,160750.367237054
550 | 2009,182477.505046008
551 | 2010,187720.359207171
552 | 2011,187201.942081511
553 | 2012,176385.102235149
554 | 2013,175901.787841278
555 | 2014,182584.280198283
556 | 2015,195664.686104237
557 | 2016,181420.346494222
558 | 2017,176676.04995228
559 | 2018,181594.678867334
560 | 2019,178521.747964951
561 | 2020,175895.883726231
562 | 2021,168468.005916477
563 | 2022,200973.129447888
564 | 2023,197030.641992202
565 | 2024,192867.417844592
566 | 2025,196449.247639381
567 | 2026,141684.196398607
568 | 2027,153353.334123901
569 | 2028,151143.549016705
570 | 2029,163753.087114229
571 | 2030,158682.460013921
572 | 2031,144959.835250915
573 | 2032,160144.390548579
574 | 2033,156286.534303521
575 | 2034,165726.707619571
576 | 2035,182427.481047359
577 | 2036,173310.56154032
578 | 2037,173310.56154032
579 | 2038,151556.01403002
580 | 2039,158908.146068683
581 | 2040,209834.383092536
582 | 2041,192410.516550815
583 | 2042,174026.247294886
584 | 2043,195499.830115336
585 | 2044,200918.018812493
586 | 2045,207243.616023976
587 | 2046,196149.783851876
588 | 2047,192097.914850217
589 | 2048,178570.948923671
590 | 2049,228617.968325428
591 | 2050,199929.884438451
592 | 2051,160206.365612859
593 | 2052,179854.431885567
594 | 2053,185987.340461822
595 | 2054,161122.505607926
596 | 2055,175949.342720138
597 | 2056,183683.590595324
598 | 2057,176401.34762338
599 | 2058,205832.532527897
600 | 2059,177799.799849436
601 | 2060,167565.362080406
602 | 2061,186348.958436557
603 | 2062,179782.759465081
604 | 2063,169837.623333323
605 | 2064,178817.275675758
606 | 2065,174444.479149339
607 | 2066,192834.968917174
608 | 2067,196564.717984981
609 | 2068,206977.567039357
610 | 2069,157054.253944128
611 | 2070,175142.948078577
612 | 2071,159932.1643654
613 | 2072,182801.408333628
614 | 2073,181510.375176825
615 | 2074,181613.035129451
616 | 2075,186920.512597635
617 | 2076,157950.170625222
618 | 2077,176115.159022876
619 | 2078,182744.514344465
620 | 2079,180660.683691591
621 | 2080,160775.629777099
622 | 2081,186711.715848082
623 | 2082,223581.758190888
624 | 2083,172330.943236652
625 | 2084,163474.633393212
626 | 2085,175308.263299874
627 | 2086,187462.725306432
628 | 2087,180655.101535034
629 | 2088,152121.98603454
630 | 2089,159856.233909727
631 | 2090,186559.854936737
632 | 2091,183962.550959411
633 | 2092,162107.168699296
634 | 2093,162582.288981283
635 | 2094,154407.701597409
636 | 2095,181625.666399474
637 | 2096,164810.609473548
638 | 2097,176429.401241704
639 | 2098,179188.089925259
640 | 2099,145997.635377703
641 | 2100,218676.768270367
642 | 2101,188323.861214226
643 | 2102,168690.0722914
644 | 2103,165088.746797705
645 | 2104,191435.007885166
646 | 2105,168864.404664512
647 | 2106,176041.882371574
648 | 2107,215911.674390325
649 | 2108,167388.238629016
650 | 2109,163854.786753017
651 | 2110,163299.477980171
652 | 2111,178298.214633119
653 | 2112,176376.586164775
654 | 2113,170211.043976522
655 | 2114,170818.344786366
656 | 2115,174388.867432503
657 | 2116,161112.987374671
658 | 2117,172179.082325307
659 | 2118,157798.309713876
660 | 2119,169106.151422924
661 | 2120,170129.531364292
662 | 2121,157680.227412949
663 | 2122,162690.209131977
664 | 2123,146968.379365095
665 | 2124,181507.721372455
666 | 2125,191215.589752983
667 | 2126,189432.689844522
668 | 2127,207271.484957719
669 | 2128,170030.807488363
670 | 2129,148409.806476335
671 | 2130,193850.613979055
672 | 2131,193808.319298263
673 | 2132,166300.235380627
674 | 2133,163474.633393212
675 | 2134,177473.606564978
676 | 2135,157443.925537187
677 | 2136,180681.007992057
678 | 2137,183463.17030026
679 | 2138,182481.763081195
680 | 2139,193717.15117887
681 | 2140,182782.55099007
682 | 2141,175530.651633287
683 | 2142,177804.057884623
684 | 2143,159448.670848577
685 | 2144,181338.976717529
686 | 2145,178553.558537021
687 | 2146,162820.928264556
688 | 2147,188832.479997186
689 | 2148,164682.185899437
690 | 2149,181549.735943801
691 | 2150,199158.097008868
692 | 2151,152889.520990566
693 | 2152,181150.551679116
694 | 2153,181416.732376013
695 | 2154,164391.238182305
696 | 2155,185421.046498812
697 | 2156,193981.327550004
698 | 2157,178824.324789223
699 | 2158,209270.051606246
700 | 2159,177801.266806344
701 | 2160,179053.762236101
702 | 2161,178762.170601992
703 | 2162,184655.300458183
704 | 2163,191284.655779772
705 | 2164,179598.085818785
706 | 2165,167517.628078595
707 | 2166,182873.903794044
708 | 2167,177484.91371363
709 | 2168,188444.597319524
710 | 2169,179184.153848562
711 | 2170,184365.175780982
712 | 2171,184479.322005212
713 | 2172,182927.863869391
714 | 2173,178611.639373646
715 | 2174,181943.343613558
716 | 2175,175080.614768394
717 | 2176,190720.794649138
718 | 2177,198422.868144723
719 | 2178,184482.11308349
720 | 2179,139214.952187861
721 | 2180,169233.113601757
722 | 2181,180664.118686848
723 | 2182,178818.742632666
724 | 2183,180422.049969947
725 | 2184,178601.93645581
726 | 2185,183083.159775993
727 | 2186,173163.101499699
728 | 2187,185968.161159774
729 | 2188,171226.050683054
730 | 2189,281643.976116786
731 | 2190,160031.711281258
732 | 2191,162775.979779394
733 | 2192,160735.445970193
734 | 2193,166646.109048572
735 | 2194,188384.548444549
736 | 2195,165830.697255197
737 | 2196,182138.358533039
738 | 2197,171595.397975647
739 | 2198,160337.079183809
740 | 2199,191215.088671543
741 | 2200,166956.093232213
742 | 2201,186581.830878692
743 | 2202,176450.548582099
744 | 2203,193743.194909801
745 | 2204,198882.566078408
746 | 2205,176385.102235149
747 | 2206,162447.639333636
748 | 2207,193782.555676777
749 | 2208,183653.890897141
750 | 2209,210578.623546866
751 | 2210,158527.164107319
752 | 2211,163081.025723456
753 | 2212,174388.867432503
754 | 2213,191905.870131966
755 | 2214,174388.867432503
756 | 2215,161642.711648983
757 | 2216,186939.507215101
758 | 2217,172482.165792649
759 | 2218,159695.999763546
760 | 2219,157230.369671007
761 | 2220,179188.089925259
762 | 2221,157972.82120994
763 | 2222,156804.951429181
764 | 2223,211491.972463654
765 | 2224,186537.246201062
766 | 2225,200468.161070551
767 | 2226,182241.340444154
768 | 2227,157342.225898399
769 | 2228,182022.387105998
770 | 2229,181244.510876788
771 | 2230,178556.671573788
772 | 2231,189547.199876284
773 | 2232,187948.65165563
774 | 2233,194107.287565956
775 | 2234,183521.710369283
776 | 2235,183682.123638416
777 | 2236,178483.353073443
778 | 2237,184003.879764736
779 | 2238,171318.59033449
780 | 2239,162039.754313997
781 | 2240,154846.252190699
782 | 2241,194822.365690957
783 | 2242,169788.738771463
784 | 2243,178891.554489941
785 | 2244,152084.772428865
786 | 2245,139169.86642879
787 | 2246,192439.536044606
788 | 2247,161067.859766557
789 | 2248,158762.648504781
790 | 2249,175569.690441774
791 | 2250,183659.795012187
792 | 2251,280618.132617258
793 | 2252,180051.809151659
794 | 2253,176519.18031559
795 | 2254,179028.429210291
796 | 2255,177161.583857224
797 | 2256,180081.508849842
798 | 2257,205895.254584712
799 | 2258,183389.78131415
800 | 2259,178543.647859512
801 | 2260,194798.320499104
802 | 2261,162845.613675766
803 | 2262,148103.867006579
804 | 2263,201016.171121215
805 | 2264,277936.12694354
806 | 2265,249768.279823405
807 | 2266,161596.052159825
808 | 2267,158011.114889899
809 | 2268,194089.683858004
810 | 2269,181733.336941451
811 | 2270,182852.32772198
812 | 2271,189893.003058465
813 | 2272,194650.210979875
814 | 2273,187904.461286262
815 | 2274,171774.925622692
816 | 2275,177998.685921479
817 | 2276,175648.484325498
818 | 2277,196918.071362067
819 | 2278,184299.838071218
820 | 2279,182379.855682734
821 | 2280,184050.725802482
822 | 2281,158296.975970284
823 | 2282,175053.355553278
824 | 2283,162293.376090644
825 | 2284,186328.880047186
826 | 2285,151422.116936538
827 | 2286,181969.358707768
828 | 2287,189122.67702416
829 | 2288,185645.475220346
830 | 2289,182829.898109257
831 | 2290,195848.788183328
832 | 2291,198785.059550672
833 | 2292,181676.126555428
834 | 2293,194131.012663328
835 | 2294,201416.004864508
836 | 2295,185096.577205616
837 | 2296,195158.972598372
838 | 2297,184795.783735112
839 | 2298,189168.263864671
840 | 2299,216855.260149095
841 | 2300,184946.642483576
842 | 2301,189317.51282069
843 | 2302,180803.277842406
844 | 2303,175061.18585763
845 | 2304,179074.839090732
846 | 2305,145708.764336107
847 | 2306,142398.022752011
848 | 2307,161474.534863641
849 | 2308,157025.945155458
850 | 2309,163424.037827357
851 | 2310,164692.778645345
852 | 2311,152163.2443541
853 | 2312,192383.215486656
854 | 2313,182520.230322476
855 | 2314,187254.507549722
856 | 2315,176489.659740359
857 | 2316,181520.466841293
858 | 2317,186414.978214721
859 | 2318,185197.764639705
860 | 2319,178657.794083741
861 | 2320,179731.198023759
862 | 2321,161748.271317074
863 | 2322,158608.749069322
864 | 2323,178807.370559878
865 | 2324,184187.158803897
866 | 2325,181686.10402108
867 | 2326,190311.050228337
868 | 2327,192252.496354076
869 | 2328,193954.849525775
870 | 2329,181044.201560887
871 | 2330,180258.131219792
872 | 2331,199641.657313834
873 | 2332,197530.775205517
874 | 2333,191777.196949138
875 | 2334,195779.543033588
876 | 2335,202112.046522999
877 | 2336,192343.34807661
878 | 2337,185191.359443218
879 | 2338,186760.207965688
880 | 2339,177733.78193528
881 | 2340,164430.391189608
882 | 2341,185299.601552401
883 | 2342,186414.012339254
884 | 2343,176401.921054593
885 | 2344,182381.322639642
886 | 2345,176334.184710805
887 | 2346,184901.735847457
888 | 2347,180085.766885029
889 | 2348,184901.735847457
890 | 2349,183967.561548763
891 | 2350,193046.301574659
892 | 2351,168538.969495849
893 | 2352,170157.842016969
894 | 2353,196559.709259637
895 | 2354,177133.709361852
896 | 2355,181553.279576244
897 | 2356,185770.606634739
898 | 2357,177017.595099274
899 | 2358,184123.358536806
900 | 2359,165970.357492196
901 | 2360,158151.985049452
902 | 2361,177086.476441481
903 | 2362,196373.896176551
904 | 2363,172465.707083115
905 | 2364,168590.782409896
906 | 2365,158820.474171061
907 | 2366,151611.37057651
908 | 2367,152125.028585543
909 | 2368,158404.073081048
910 | 2369,160692.078640755
911 | 2370,170175.22684199
912 | 2371,169854.436591138
913 | 2372,183410.785819008
914 | 2373,180347.194026928
915 | 2374,178930.528374292
916 | 2375,153346.220086301
917 | 2376,182675.204270589
918 | 2377,180770.649792036
919 | 2378,188714.148087543
920 | 2379,191393.608594076
921 | 2380,174016.157494425
922 | 2381,183189.685319552
923 | 2382,183621.508757866
924 | 2383,168991.29635758
925 | 2384,185306.650665866
926 | 2385,189030.680303208
927 | 2386,179208.665698449
928 | 2387,174901.452792889
929 | 2388,168337.406544343
930 | 2389,158234.96461859
931 | 2390,179562.453368834
932 | 2391,174176.391640607
933 | 2392,173931.531845427
934 | 2393,184111.729429665
935 | 2394,179374.482001188
936 | 2395,207348.811884535
937 | 2396,186983.419339031
938 | 2397,206779.094049527
939 | 2398,177472.074683935
940 | 2399,156727.948324862
941 | 2400,157090.568462479
942 | 2401,160387.032696693
943 | 2402,172410.28005086
944 | 2403,191603.365657467
945 | 2404,182152.207151253
946 | 2405,180161.697340702
947 | 2406,169652.235284283
948 | 2407,182503.520140218
949 | 2408,179714.630677039
950 | 2409,180282.570719908
951 | 2410,192600.338060371
952 | 2411,166115.491248565
953 | 2412,186379.553524443
954 | 2413,184361.992258449
955 | 2414,186220.965458121
956 | 2415,198176.47090687
957 | 2416,168437.776500131
958 | 2417,178003.582312015
959 | 2418,179180.469244588
960 | 2419,191930.561104806
961 | 2420,175590.266214964
962 | 2421,176713.19307219
963 | 2422,180159.090947005
964 | 2423,188090.100808026
965 | 2424,186184.717727913
966 | 2425,223055.588672278
967 | 2426,158270.753116401
968 | 2427,184733.12846644
969 | 2428,199926.378957429
970 | 2429,175075.785166001
971 | 2430,180917.925148076
972 | 2431,182067.760625207
973 | 2432,178238.60191545
974 | 2433,173454.944606532
975 | 2434,176821.936262814
976 | 2435,183642.191304235
977 | 2436,177254.582741058
978 | 2437,168715.950111702
979 | 2438,180096.931198144
980 | 2439,160620.728178758
981 | 2440,175286.544392273
982 | 2441,153494.783276297
983 | 2442,156407.65915545
984 | 2443,162162.525245786
985 | 2444,166809.886827197
986 | 2445,172929.156408918
987 | 2446,193514.330894137
988 | 2447,181612.141603756
989 | 2448,191745.386377068
990 | 2449,171369.325038261
991 | 2450,184425.470567051
992 | 2451,170563.252355189
993 | 2452,184522.369240168
994 | 2453,164968.947931153
995 | 2454,157939.621592364
996 | 2455,151520.381580069
997 | 2456,176129.508722531
998 | 2457,171112.978971478
999 | 2458,169762.081624282
1000 | 2459,162246.828936295
1001 | 2460,171339.303381589
1002 | 2461,189034.753653813
1003 | 2462,175758.873595981
1004 | 2463,163351.721489893
1005 | 2464,189806.546645026
1006 | 2465,175370.990918319
1007 | 2466,196895.599900301
1008 | 2467,176905.917994834
1009 | 2468,176866.557227858
1010 | 2469,163590.677170026
1011 | 2470,212693.502958393
1012 | 2471,192686.931747717
1013 | 2472,181578.684951827
1014 | 2473,166475.457581812
1015 | 2474,185998.255166219
1016 | 2475,185527.714877908
1017 | 2476,159027.118197683
1018 | 2477,181169.654933769
1019 | 2478,176732.915304722
1020 | 2479,191619.294648838
1021 | 2480,189114.303789324
1022 | 2481,180934.635330334
1023 | 2482,164573.372223048
1024 | 2483,173902.011270196
1025 | 2484,165625.127741229
1026 | 2485,179555.219570787
1027 | 2486,196899.720661579
1028 | 2487,207566.12470446
1029 | 2488,163899.981149274
1030 | 2489,189179.428177786
1031 | 2490,193892.880023125
1032 | 2491,178980.874331431
1033 | 2492,179749.876244365
1034 | 2493,197999.674975598
1035 | 2494,203717.470295797
1036 | 2495,185249.261156892
1037 | 2496,201691.208274848
1038 | 2497,181956.548314794
1039 | 2498,171895.936275806
1040 | 2499,187245.168439419
1041 | 2500,157816.77461318
1042 | 2501,191702.912573325
1043 | 2502,198599.420028908
1044 | 2503,187193.313676329
1045 | 2504,220514.993999535
1046 | 2505,181814.527595192
1047 | 2506,183750.755371907
1048 | 2507,183000.431679579
1049 | 2508,185830.971906573
1050 | 2509,185497.872344187
1051 | 2510,179613.437681321
1052 | 2511,164454.967963631
1053 | 2512,185127.237217638
1054 | 2513,178750.613844623
1055 | 2514,160927.61044889
1056 | 2515,192562.808057836
1057 | 2516,180990.24148554
1058 | 2517,180064.941503122
1059 | 2518,196070.997393789
1060 | 2519,180352.919019023
1061 | 2520,183367.953769362
1062 | 2521,176734.841494027
1063 | 2522,180848.220765939
1064 | 2523,187806.059368823
1065 | 2524,180521.52640004
1066 | 2525,181502.754496154
1067 | 2526,174525.87942676
1068 | 2527,188927.984063168
1069 | 2528,184728.870431253
1070 | 2529,179857.975518011
1071 | 2530,180962.868071609
1072 | 2531,179194.066390078
1073 | 2532,179591.789259484
1074 | 2533,180638.463702549
1075 | 2534,185846.215131922
1076 | 2535,195174.031139141
1077 | 2536,192474.56829063
1078 | 2537,164200.595496827
1079 | 2538,178403.094096818
1080 | 2539,170774.84018302
1081 | 2540,179879.945898337
1082 | 2541,177668.192752792
1083 | 2542,180174.328610725
1084 | 2543,170643.303572141
1085 | 2544,165448.004289838
1086 | 2545,195531.754886222
1087 | 2546,165314.177682121
1088 | 2547,172532.757660882
1089 | 2548,203310.218069877
1090 | 2549,175090.062515883
1091 | 2550,230841.338626282
1092 | 2551,155225.19006632
1093 | 2552,168322.342441945
1094 | 2553,165956.259265265
1095 | 2554,193956.817564124
1096 | 2555,171070.367893827
1097 | 2556,166285.243628001
1098 | 2557,182875.801346628
1099 | 2558,218108.536769738
1100 | 2559,174378.777632042
1101 | 2560,164731.316372391
1102 | 2561,156969.695083273
1103 | 2562,173388.854342604
1104 | 2563,177559.628685119
1105 | 2564,194297.789279905
1106 | 2565,174894.588364005
1107 | 2566,196544.144075798
1108 | 2567,179036.158528149
1109 | 2568,211423.986511149
1110 | 2569,208156.398935188
1111 | 2570,159233.941347257
1112 | 2571,210820.115134931
1113 | 2572,140196.10979821
1114 | 2573,198678.469082978
1115 | 2574,186818.610760803
1116 | 2575,175044.797633861
1117 | 2576,180031.162892704
1118 | 2577,176889.171525162
1119 | 2578,159638.856165666
1120 | 2579,154287.264375509
1121 | 2580,191885.618181273
1122 | 2581,177503.378612934
1123 | 2582,166548.31684976
1124 | 2583,164475.14942856
1125 | 2584,167484.744857879
1126 | 2585,188683.160555403
1127 | 2586,162243.399502668
1128 | 2587,180807.213919103
1129 | 2588,176279.079637039
1130 | 2589,163438.959094218
1131 | 2590,161495.5393685
1132 | 2591,216032.303722443
1133 | 2592,176632.181541401
1134 | 2593,168743.001567144
1135 | 2594,183810.11848086
1136 | 2595,156794.36054728
1137 | 2596,169136.43011395
1138 | 2597,183203.318752456
1139 | 2598,213252.926930889
1140 | 2599,190550.327866959
1141 | 2600,234707.209860273
1142 | 2601,135751.318892816
1143 | 2602,164228.45886894
1144 | 2603,153219.437030419
1145 | 2604,164210.746523801
1146 | 2605,163883.229117973
1147 | 2606,154892.776269956
1148 | 2607,197092.08733832
1149 | 2608,228148.376399122
1150 | 2609,178680.587503997
1151 | 2610,165643.341167808
1152 | 2611,222406.642660249
1153 | 2612,184021.843582599
1154 | 2613,170871.094939159
1155 | 2614,189562.873697309
1156 | 2615,170591.884966356
1157 | 2616,172934.351682851
1158 | 2617,186425.069879189
1159 | 2618,218648.131133006
1160 | 2619,183035.606761141
1161 | 2620,178378.906069427
1162 | 2621,184516.716597846
1163 | 2622,181419.5253183
1164 | 2623,196858.923438425
1165 | 2624,189228.701486278
1166 | 2625,208973.380761028
1167 | 2626,180269.86896412
1168 | 2627,159488.713683953
1169 | 2628,191490.299507521
1170 | 2629,228684.245137946
1171 | 2630,201842.998700429
1172 | 2631,209242.82289186
1173 | 2632,202357.62258493
1174 | 2633,168238.61218265
1175 | 2634,202524.12465369
1176 | 2635,170588.771929588
1177 | 2636,198375.31512987
1178 | 2637,170636.827889889
1179 | 2638,181991.079479377
1180 | 2639,183994.54251844
1181 | 2640,182951.482193584
1182 | 2641,174126.297156192
1183 | 2642,170575.496742588
1184 | 2643,175332.239869971
1185 | 2644,167522.061539111
1186 | 2645,168095.583738538
1187 | 2646,154406.415627461
1188 | 2647,170996.973346087
1189 | 2648,159056.890245639
1190 | 2649,181373.6165193
1191 | 2650,152272.560975937
1192 | 2651,168664.346821336
1193 | 2652,211007.008292301
1194 | 2653,182909.515032911
1195 | 2654,203926.829353303
1196 | 2655,179082.825442944
1197 | 2656,206260.099795032
1198 | 2657,181732.443415757
1199 | 2658,189698.740693148
1200 | 2659,203074.34678979
1201 | 2660,201670.634365666
1202 | 2661,173756.812589691
1203 | 2662,181387.076390881
1204 | 2663,184859.155270535
1205 | 2664,158313.615666777
1206 | 2665,151951.955409666
1207 | 2666,162537.52704471
1208 | 2667,178998.337067854
1209 | 2668,186732.584943041
1210 | 2669,187323.318406165
1211 | 2670,199437.232798284
1212 | 2671,185546.680858653
1213 | 2672,161595.015798593
1214 | 2673,154672.422763036
1215 | 2674,159355.710116165
1216 | 2675,155919.014077746
1217 | 2676,182424.87095604
1218 | 2677,178100.589622319
1219 | 2678,202577.900044456
1220 | 2679,177862.778940605
1221 | 2680,182056.024744887
1222 | 2681,191403.199177104
1223 | 2682,196264.754980043
1224 | 2683,209375.003419718
1225 | 2684,196691.81930173
1226 | 2685,192458.431539585
1227 | 2686,182242.80926507
1228 | 2687,183259.503900506
1229 | 2688,188108.243748841
1230 | 2689,171418.640195797
1231 | 2690,194698.882220432
1232 | 2691,174841.84007522
1233 | 2692,172965.476488899
1234 | 2693,189386.323677132
1235 | 2694,185682.618340257
1236 | 2695,176412.012719061
1237 | 2696,174976.489722867
1238 | 2697,180718.581707643
1239 | 2698,186131.188248242
1240 | 2699,165220.786354033
1241 | 2700,164115.893800435
1242 | 2701,182125.729127024
1243 | 2702,182285.140233276
1244 | 2703,196325.442210366
1245 | 2704,164865.215329881
1246 | 2705,182694.492209823
1247 | 2706,185425.485520958
1248 | 2707,171414.7041191
1249 | 2708,183433.472466085
1250 | 2709,176844.981155794
1251 | 2710,180568.187753206
1252 | 2711,185948.625475832
1253 | 2712,189388.291715481
1254 | 2713,142754.489165865
1255 | 2714,156106.800760811
1256 | 2715,155895.397617561
1257 | 2716,159851.977738548
1258 | 2717,185157.832305524
1259 | 2718,180716.291710805
1260 | 2719,176901.093954071
1261 | 2720,181017.222455218
1262 | 2721,183269.159407668
1263 | 2722,193550.830097069
1264 | 2723,170625.842699726
1265 | 2724,182012.405942725
1266 | 2725,179162.507290733
1267 | 2726,183269.159407668
1268 | 2727,180589.836175042
1269 | 2728,181465.935198741
1270 | 2729,196053.029878304
1271 | 2730,183421.020319014
1272 | 2731,167926.839083612
1273 | 2732,168027.530997889
1274 | 2733,182164.26685407
1275 | 2734,172469.071592608
1276 | 2735,181059.374300472
1277 | 2736,182997.570115536
1278 | 2737,166140.504179894
1279 | 2738,198515.546934075
1280 | 2739,193789.648503294
1281 | 2740,173550.025727531
1282 | 2741,176487.943174734
1283 | 2742,188813.302559147
1284 | 2743,178531.911979192
1285 | 2744,182145.731469001
1286 | 2745,179196.465024103
1287 | 2746,169618.349900686
1288 | 2747,170010.168655046
1289 | 2748,181739.671652174
1290 | 2749,172846.934955574
1291 | 2750,195560.8830172
1292 | 2751,180358.114292956
1293 | 2752,211817.702818093
1294 | 2753,176170.128686742
1295 | 2754,234492.248263699
1296 | 2755,182450.956536015
1297 | 2756,174902.068073146
1298 | 2757,173684.174293738
1299 | 2758,147196.673677562
1300 | 2759,175231.189709791
1301 | 2760,193417.64740633
1302 | 2761,183313.601249761
1303 | 2762,180882.250849082
1304 | 2763,186735.697979808
1305 | 2764,172922.865411247
1306 | 2765,202551.677190573
1307 | 2766,190485.634074173
1308 | 2767,173439.49362151
1309 | 2768,196613.598849219
1310 | 2769,178152.259700828
1311 | 2770,174519.904825949
1312 | 2771,172627.796932837
1313 | 2772,173732.689486435
1314 | 2773,209219.844787023
1315 | 2774,181059.374300472
1316 | 2775,188515.443002459
1317 | 2776,182164.26685407
1318 | 2777,188137.901597981
1319 | 2778,158893.54306269
1320 | 2779,189579.65066771
1321 | 2780,165229.803505847
1322 | 2781,162186.071220207
1323 | 2782,166374.879866351
1324 | 2783,161665.184974757
1325 | 2784,175079.328798445
1326 | 2785,203840.874021305
1327 | 2786,152129.078861057
1328 | 2787,181012.141380101
1329 | 2788,161305.53503837
1330 | 2789,203326.392972343
1331 | 2790,168385.571141831
1332 | 2791,183564.365159986
1333 | 2792,163784.619440861
1334 | 2793,171989.192193993
1335 | 2794,180839.95616829
1336 | 2795,170895.923185907
1337 | 2796,174071.054808518
1338 | 2797,259423.859147546
1339 | 2798,188000.824679588
1340 | 2799,179171.703565498
1341 | 2800,171022.241447762
1342 | 2801,174126.297156192
1343 | 2802,187625.573271948
1344 | 2803,199567.946369234
1345 | 2804,205328.078219268
1346 | 2805,166231.535025379
1347 | 2806,154743.91606057
1348 | 2807,159714.537012622
1349 | 2808,185563.069082422
1350 | 2809,171500.796725006
1351 | 2810,180983.443844799
1352 | 2811,183141.236914997
1353 | 2812,178498.634450214
1354 | 2813,224323.710512388
1355 | 2814,218200.642127877
1356 | 2815,182283.177756557
1357 | 2816,190054.639237419
1358 | 2817,160192.453934518
1359 | 2818,171289.393581756
1360 | 2819,151131.098733642
1361 | 2820,181721.458225594
1362 | 2821,172725.053851858
1363 | 2822,222438.699143414
1364 | 2823,235419.373448928
1365 | 2824,185150.926027596
1366 | 2825,184772.239624699
1367 | 2826,180658.216435809
1368 | 2827,209673.316647174
1369 | 2828,205939.810625621
1370 | 2829,165633.573325837
1371 | 2830,186030.317211014
1372 | 2831,160312.319589212
1373 | 2832,190702.440251029
1374 | 2833,175122.810326699
1375 | 2834,183783.13937519
1376 | 2835,178290.666302221
1377 | 2836,181605.343963015
1378 | 2837,187992.451444752
1379 | 2838,188885.11781517
1380 | 2839,189959.344795118
1381 | 2840,179258.619211334
1382 | 2841,181518.750275669
1383 | 2842,193008.659237315
1384 | 2843,186313.89385619
1385 | 2844,181499.39185067
1386 | 2845,174126.297156192
1387 | 2846,183918.612062767
1388 | 2847,184114.270899227
1389 | 2848,158540.947801398
1390 | 2849,197034.759055859
1391 | 2850,185170.284452595
1392 | 2851,221134.533635148
1393 | 2852,184306.637575967
1394 | 2853,199792.302740996
1395 | 2854,143237.803559736
1396 | 2855,177294.838897736
1397 | 2856,182368.620883855
1398 | 2857,176487.943174734
1399 | 2858,183849.408762071
1400 | 2859,184964.141507413
1401 | 2860,196395.969632434
1402 | 2861,188374.936650438
1403 | 2862,176261.296806135
1404 | 2863,163628.142248426
1405 | 2864,180618.032628904
1406 | 2865,161647.329794081
1407 | 2866,167129.598867773
1408 | 2867,174750.988352687
1409 | 2868,177560.202116333
1410 | 2869,192577.796112839
1411 | 2870,199202.898960871
1412 | 2871,182818.156667308
1413 | 2872,148217.262540651
1414 | 2873,188997.797082492
1415 | 2874,185807.928877601
1416 | 2875,177030.477842021
1417 | 2876,175942.474593632
1418 | 2877,172912.518576433
1419 | 2878,198359.248864591
1420 | 2879,184379.133036383
1421 | 2880,194255.566948886
1422 | 2881,209449.651603064
1423 | 2882,169979.323958443
1424 | 2883,188206.281858748
1425 | 2884,186412.438609167
1426 | 2885,196761.386409959
1427 | 2886,208353.269558209
1428 | 2887,166548.067241044
1429 | 2888,175942.474593632
1430 | 2889,166790.457916434
1431 | 2890,160515.850579067
1432 | 2891,192167.621096362
1433 | 2892,178751.551083369
1434 | 2893,198678.894117024
1435 | 2894,164553.120272354
1436 | 2895,156887.932862327
1437 | 2896,164185.777305524
1438 | 2897,212992.120630876
1439 | 2898,197468.550532521
1440 | 2899,180106.84373966
1441 | 2900,183972.071056674
1442 | 2901,245283.198337927
1443 | 2902,170351.963410756
1444 | 2903,195596.307707478
1445 | 2904,189369.756330412
1446 | 2905,223667.404551664
1447 | 2906,169335.310624364
1448 | 2907,167411.02835165
1449 | 2908,187709.555003968
1450 | 2909,196526.002998991
1451 | 2910,137402.569855589
1452 | 2911,165086.775061735
1453 | 2912,188506.431412274
1454 | 2913,172917.456816012
1455 | 2914,166274.325225982
1456 | 2915,167081.220948984
1457 | 2916,164788.778231138
1458 | 2917,219222.423400059
1459 | 2918,184924.279658997
1460 | 2919,187741.866657478
1461 |
--------------------------------------------------------------------------------
/Challenges/House_Pricing/house_pricing_challenge.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "AML2019"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "Challenge 1
\n",
15 | "House Pricing Prediction
\n",
16 | "
\n",
17 | "22th March 2019"
18 | ]
19 | },
20 | {
21 | "cell_type": "markdown",
22 | "metadata": {},
23 | "source": [
24 | "The first AML challenge for this year is adapted from the well-known 'Zillow's Home Value Prediction' competition on Kaggle.\n",
25 | "In particular, given a dataset containing descriptions of homes on the US property market, your task is to make predictions on the selling price of as-yet unlisted properties. \n",
26 | "Developing a model which accurately fits the available training data while also generalising to unseen data-points is a multi-faceted challenge that involves a mixture of data exploration, pre-processing, model selection, and performance evaluation."
27 | ]
28 | },
29 | {
30 | "cell_type": "markdown",
31 | "metadata": {},
32 | "source": [
33 | "# Overview\n",
34 | "
"
35 | ]
36 | },
37 | {
38 | "cell_type": "markdown",
39 | "metadata": {},
40 | "source": [
41 | "Beyond simply producing a well-performing model for making predictions, in this challenge we would like you to start developing your skills as a machine learning scientist.\n",
42 | "In this regard, your notebook should be structured in such a way as to explore the five following tasks that are expected to be carried out whenever undertaking such a project.\n",
43 | "The description below each aspect should serve as a guide for your work, but you are strongly encouraged to also explore alternative options and directions. \n",
44 | "Thinking outside the box will always be rewarded in these challenges."
45 | ]
46 | },
47 | {
48 | "cell_type": "markdown",
49 | "metadata": {},
50 | "source": [
51 | "\n",
52 | "
1. Data Exploration
\n",
53 | ""
54 | ]
55 | },
56 | {
57 | "cell_type": "markdown",
58 | "metadata": {},
59 | "source": [
60 | "The first broad component of your notebook should enable you to familiarise yourselves with the given data, an outline of which is given at the end of this challenge specification.\n",
61 | "Among others, this section should investigate:\n",
62 | "\n",
63 | "- Data cleaning, e.g. treatment of categorial variables;\n",
64 | "- Data visualisation;\n",
65 | "- Computing descriptive statistics, e.g. correlation.\n",
66 | "- etc."
67 | ]
68 | },
69 | {
70 | "cell_type": "markdown",
71 | "metadata": {},
72 | "source": [
73 | "\n",
74 | "
2. Data Pre-processing
\n",
75 | ""
76 | ]
77 | },
78 | {
79 | "cell_type": "markdown",
80 | "metadata": {},
81 | "source": [
82 | "The previous step should give you a better understanding of which pre-processing is required for the data.\n",
83 | "This may include:\n",
84 | "\n",
85 | "- Normalising and standardising the given data;\n",
86 | "- Removing outliers;\n",
87 | "- Carrying out feature selection, possibly using metrics derived from information theory;\n",
88 | "- Handling missing information in the dataset;\n",
89 | "- Augmenting the dataset with external information;\n",
90 | "- Combining existing features."
91 | ]
92 | },
93 | {
94 | "cell_type": "markdown",
95 | "metadata": {},
96 | "source": [
97 | "\n",
98 | "
3. Model Selection
\n",
99 | ""
100 | ]
101 | },
102 | {
103 | "cell_type": "markdown",
104 | "metadata": {},
105 | "source": [
106 | "Perhaps the most important segment of this challenge involves the selection of a model that can successfully handle the given data and yield sensible predictions.\n",
107 | "Instead of focusing exclusively on your final chosen model, it is also important to share your thought process in this notebook by additionally describing alternative candidate models.\n",
108 | "There is a wealth of models to choose from, such as decision trees, random forests, (Bayesian) neural networks, Gaussian processes, LASSO regression, and so on.\n",
109 | "There are several factors which may influence your decision:\n",
110 | "\n",
111 | "- What is the model's complexity?\n",
112 | "- Is the model interpretable?\n",
113 | "- Is the model capable of handling different data-types?\n",
114 | "- Does the model return uncertainty estimates along with predictions?\n",
115 | "\n",
116 | "An in-depth evaluation of competing models in view of this and other criteria will elevate the quality of your submission and earn you a higher grade.\n"
117 | ]
118 | },
119 | {
120 | "cell_type": "markdown",
121 | "metadata": {},
122 | "source": [
123 | "\n",
124 | "
4. Parameter Optimisation
\n",
125 | ""
126 | ]
127 | },
128 | {
129 | "cell_type": "markdown",
130 | "metadata": {},
131 | "source": [
132 | "Irrespective of your choice, it is highly likely that your model will have one or more parameters that require tuning.\n",
133 | "There are several techniques for carrying out such a procedure, including cross-validation, Bayesian optimisation, and several others.\n",
134 | "As before, an analysis into which parameter tuning technique best suits your model is expected before proceeding with the optimisation of your model."
135 | ]
136 | },
137 | {
138 | "cell_type": "markdown",
139 | "metadata": {},
140 | "source": [
141 | "\n",
142 | "
5. Model Evaluation
\n",
143 | ""
144 | ]
145 | },
146 | {
147 | "cell_type": "markdown",
148 | "metadata": {},
149 | "source": [
150 | "Some form of pre-evaluation will inevitably be required in the preceding sections in order to both select an appropriate model and configure its parameters appropriately.\n",
151 | "In this final section, you may evaluate other aspects of the model such as:\n",
152 | "\n",
153 | "- Assessing the running time of your model;\n",
154 | "- Determining whether some aspects can be parallelised;\n",
155 | "- Training the model with smaller subsets of the data.\n",
156 | "- etc."
157 | ]
158 | },
159 | {
160 | "cell_type": "markdown",
161 | "metadata": {},
162 | "source": [
163 | "\n",
164 | " N.B. Please note that the items listed under each heading are neither exhaustive, nor are you expected to explore every given suggestion.\n",
165 | " Nonetheless, these should serve as a guideline for your work in both this and upcoming challenges.\n",
166 | " As always, you should use your intuition and understanding in order to decide which analysis best suits the assigned task.\n",
167 | "
"
168 | ]
169 | },
170 | {
171 | "cell_type": "markdown",
172 | "metadata": {},
173 | "source": [
174 | "\n",
175 | "
Submission Instructions
\n",
176 | "
\n",
177 | ""
178 | ]
179 | },
180 | {
181 | "cell_type": "markdown",
182 | "metadata": {},
183 | "source": [
184 | "- The goal of this challenge is to construct a model for predicting house prices;\n",
185 | "
\n",
186 | "\n",
187 | "- Your submission will have two components:\n",
188 | "\n",
189 | " 1. An HTML version of your notebook exploring the various modelling aspects described above;\n",
190 | " 2. A CSV file containing your final model's predictions on the given test data. \n",
191 | " This file should contain a header and have the following format:\n",
192 | " \n",
193 | " ```\n",
194 | " Id,SalePrice\n",
195 | " 1461,169000.1\n",
196 | " 1462,187724.1233\n",
197 | " 1463,175221\n",
198 | " etc.\n",
199 | " ```\n",
200 | " \n",
201 | " An example submission file has been provided in the data directory of the repository.\n",
202 | " A leaderboard for this challenge will be ranked using the root mean squared error between the logarithm of the predicted value and the logarithm of the observed sales price. \n",
203 | " Taking logs ensures that errors in predicting expensive houses and cheap houses will have a similar impact on the overall result;\n",
204 | "
\n",
205 | "- This exercise is due on 04/04/2019."
206 | ]
207 | },
208 | {
209 | "cell_type": "markdown",
210 | "metadata": {},
211 | "source": [
212 | "\n",
213 | "
Dataset Description
\n",
214 | "
\n",
215 | ""
216 | ]
217 | },
218 | {
219 | "cell_type": "markdown",
220 | "metadata": {},
221 | "source": [
222 | "#### * Files\n",
223 | "\n",
224 | "* train.csv - The training dataset;\n",
225 | "* test.csv - The test dataset;\n",
226 | "* data_description.txt - Full description of each column.\n",
227 | "\n",
228 | "#### * Attributes\n",
229 | "\n",
230 | "A brief outline of the available attributes is given below:\n",
231 | "\n",
232 | "* SalePrice: The property's sale price in dollars. This is the target variable that your model is intended to predict;\n",
233 | "\n",
234 | "* MSSubClass: The building class;\n",
235 | "* MSZoning: The general zoning classification;\n",
236 | "* LotFrontage: Linear feet of street connected to property;\n",
237 | "* LotArea: Lot size in square feet;\n",
238 | "* Street: Type of road access;\n",
239 | "* Alley: Type of alley access;\n",
240 | "* LotShape: General shape of property;\n",
241 | "* LandContour: Flatness of the property;\n",
242 | "* Utilities: Type of utilities available;\n",
243 | "* LotConfig: Lot configuration;\n",
244 | "* LandSlope: Slope of property;\n",
245 | "* Neighborhood: Physical locations within Ames city limits;\n",
246 | "* Condition1: Proximity to main road or railroad;\n",
247 | "* Condition2: Proximity to main road or railroad (if a second is present);\n",
248 | "* BldgType: Type of dwelling;\n",
249 | "* HouseStyle: Style of dwelling;\n",
250 | "* OverallQual: Overall material and finish quality;\n",
251 | "* OverallCond: Overall condition rating;\n",
252 | "* YearBuilt: Original construction date;\n",
253 | "* YearRemodAdd: Remodel date;\n",
254 | "* RoofStyle: Type of roof;\n",
255 | "* RoofMatl: Roof material;\n",
256 | "* Exterior1st: Exterior covering on house;\n",
257 | "* Exterior2nd: Exterior covering on house (if more than one material);\n",
258 | "* MasVnrType: Masonry veneer type;\n",
259 | "* MasVnrArea: Masonry veneer area in square feet;\n",
260 | "* ExterQualv: Exterior material quality;\n",
261 | "* ExterCond: Present condition of the material on the exterior;\n",
262 | "* Foundation: Type of foundation;\n",
263 | "* BsmtQual: Height of the basement;\n",
264 | "* BsmtCond: General condition of the basement;\n",
265 | "* BsmtExposure: Walkout or garden level basement walls;\n",
266 | "* BsmtFinType1: Quality of basement finished area;\n",
267 | "* BsmtFinSF1: Type 1 finished square feet;\n",
268 | "* BsmtFinType2: Quality of second finished area (if present);\n",
269 | "* BsmtFinSF2: Type 2 finished square feet;\n",
270 | "* BsmtUnfSF: Unfinished square feet of basement area;\n",
271 | "* TotalBsmtSF: Total square feet of basement area;\n",
272 | "* Heating: Type of heating;\n",
273 | "* HeatingQC: Heating quality and condition;\n",
274 | "* CentralAir: Central air conditioning;\n",
275 | "* Electrical: Electrical system;\n",
276 | "* 1stFlrSF: First Floor square feet;\n",
277 | "* 2ndFlrSF: Second floor square feet;\n",
278 | "* LowQualFinSF: Low quality finished square feet (all floors);\n",
279 | "* GrLivArea: Above grade (ground) living area square feet;\n",
280 | "* BsmtFullBath: Basement full bathrooms;\n",
281 | "* BsmtHalfBath: Basement half bathrooms;\n",
282 | "* FullBath: Full bathrooms above grade;\n",
283 | "* HalfBath: Half baths above grade;\n",
284 | "* Bedroom: Number of bedrooms above basement level;\n",
285 | "* Kitchen: Number of kitchens;\n",
286 | "* KitchenQual: Kitchen quality;\n",
287 | "* TotRmsAbvGrd: Total rooms above grade (does not include bathrooms);\n",
288 | "* Functional: Home functionality rating;\n",
289 | "* Fireplaces: Number of fireplaces;\n",
290 | "* FireplaceQu: Fireplace quality;\n",
291 | "* GarageType: Garage location;\n",
292 | "* GarageYrBlt: Year garage was built;\n",
293 | "* GarageFinish: Interior finish of the garage;\n",
294 | "* GarageCars: Size of garage in car capacity;\n",
295 | "* GarageArea: Size of garage in square feet;\n",
296 | "* GarageQual: Garage quality;\n",
297 | "* GarageCond: Garage condition;\n",
298 | "* PavedDrive: Paved driveway;\n",
299 | "* WoodDeckSF: Wood deck area in square feet;\n",
300 | "* OpenPorchSF: Open porch area in square feet;\n",
301 | "* EnclosedPorch: Enclosed porch area in square feet;\n",
302 | "* 3SsnPorch: Three season porch area in square feet;\n",
303 | "* ScreenPorch: Screen porch area in square feet;\n",
304 | "* PoolArea: Pool area in square feet;\n",
305 | "* PoolQC: Pool quality;\n",
306 | "* Fence: Fence quality;\n",
307 | "* MiscFeature: Miscellaneous feature not covered in other categories;\n",
308 | "* MiscVal: Value (in dollars) of miscellaneous feature;\n",
309 | "* MoSold: Month sold;\n",
310 | "* YrSold: Year sold;\n",
311 | "* SaleType: Type of sale;\n",
312 | "* SaleCondition: Condition of sale.\n"
313 | ]
314 | },
315 | {
316 | "cell_type": "code",
317 | "execution_count": null,
318 | "metadata": {},
319 | "outputs": [],
320 | "source": []
321 | }
322 | ],
323 | "metadata": {
324 | "kernelspec": {
325 | "display_name": "Python 3",
326 | "language": "python",
327 | "name": "python3"
328 | },
329 | "language_info": {
330 | "codemirror_mode": {
331 | "name": "ipython",
332 | "version": 3
333 | },
334 | "file_extension": ".py",
335 | "mimetype": "text/x-python",
336 | "name": "python",
337 | "nbconvert_exporter": "python",
338 | "pygments_lexer": "ipython3",
339 | "version": "3.7.0"
340 | }
341 | },
342 | "nbformat": 4,
343 | "nbformat_minor": 1
344 | }
345 |
--------------------------------------------------------------------------------
/Challenges/Plankton/plankton_challenge.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "Algorithmic Machine Learning Challenge
\n",
8 | "Plankton Image Classification
\n",
9 | "
"
10 | ]
11 | },
12 | {
13 | "cell_type": "markdown",
14 | "metadata": {},
15 | "source": [
16 | "Plankton comprises all the organisms freely drifting with ocean currents. These life forms are a critically important piece of oceanic ecosystems, accounting for more than half the primary production on earth and nearly half the total carbon fixed in the global carbon cycle. They also form the foundation of aquatic food webs, including those of large, commercially important fisheries. Loss of plankton populations could result in ecological upheaval as well as negative societal impacts, particularly in indigenous cultures and the developing world. Plankton’s global significance makes their population levels an ideal measure of the health of the world’s oceans and ecosystems.\n",
17 | "\n",
18 | "Traditional methods for measuring and monitoring plankton populations are time consuming and cannot scale to the granularity or scope necessary for large-scale studies. Improved approaches are needed. One such approach is through the use of underwater imagery sensors. \n",
19 | "\n",
20 | "In this challenge, which was prepared in cooperation with the Laboratoire d’Océanographie de Villefranche, jointly run by Sorbonne Université and CNRS, plankton images were acquired in the bay of Villefranche, weekly since 2013 and manually engineered features were computed on each imaged object. \n",
21 | "\n",
22 | "This challenge aims at developing solid approaches to plankton image classification. We will compare methods based on carefully (but manually) engineered features, with “Deep Learning” methods in which features will be learned from image data alone.\n",
23 | "\n",
24 | "The purpose of this challenge is for you to learn about the commonly used paradigms when working with computer vision problems. This means you can choose one of the following paths:\n",
25 | "\n",
26 | "- Work directly with the provided images, e.g. using a (convolutional) neural network\n",
27 | "- Work with the supplied features extracted from the images (*native* or *skimage* or both of them)\n",
28 | "- Extract your own features from the provided images using a technique of your choice\n",
29 | "\n",
30 | "You will find a detailed description about the image data and the features at the end of this text.\n",
31 | "In any case, the choice of the classifier that you decide to work with strongly depends on the choice of features.\n",
32 | "\n",
33 | "Please bear in mind that the purpose of this challenge is not simply to find the best-performing model that was released on e.g. Kaggle for a similar problem. You should rather make sure to understand the dificulties that come with this computer vision task. Moreover, you should be able to justify your choice of features/model and be able to explain its advantages and disadvantages for the task."
34 | ]
35 | },
36 | {
37 | "cell_type": "markdown",
38 | "metadata": {},
39 | "source": [
40 | "# Overview\n",
41 | "
"
42 | ]
43 | },
44 | {
45 | "cell_type": "markdown",
46 | "metadata": {},
47 | "source": [
48 | "Beyond simply producing a well-performing model for making predictions, in this challenge we would like you to start developing your skills as a machine learning scientist.\n",
49 | "In this regard, your notebook should be structured in such a way as to explore the five following tasks that are expected to be carried out whenever undertaking such a project.\n",
50 | "The description below each aspect should serve as a guide for your work, but you are strongly encouraged to also explore alternative options and directions. \n",
51 | "Thinking outside the box will always be rewarded in these challenges."
52 | ]
53 | },
54 | {
55 | "cell_type": "markdown",
56 | "metadata": {},
57 | "source": [
58 | "\n",
59 | "
1. Data Exploration
\n",
60 | ""
61 | ]
62 | },
63 | {
64 | "cell_type": "markdown",
65 | "metadata": {},
66 | "source": [
67 | "The first broad component of your notebook should enable you to familiarise yourselves with the given data, an outline of which is given at the end of this challenge specification.\n",
68 | "\n",
69 | "What is new in this challenge is that you will be working with image data. Therefore, you should have a look at example images located in the *imgs.zip* file (see description below). If you decide to work with the native or the skimage features, make sure to understand them!\n",
70 | "\n",
71 | "Among others, this section should investigate:\n",
72 | "\n",
73 | "- Distribution of the different image dimensions (including the number of channels)\n",
74 | "- Distribution of the different labels that the images are assigned to\n",
75 | "\n",
76 | "The image labels are organized in a taxonomy. We will measure the final model performance for the classification into the *level2* categories. Make sure to understand the meaning of this label inside the taxonomy."
77 | ]
78 | },
79 | {
80 | "cell_type": "markdown",
81 | "metadata": {},
82 | "source": [
83 | "\n",
84 | "
2. Data Pre-processing
\n",
85 | ""
86 | ]
87 | },
88 | {
89 | "cell_type": "markdown",
90 | "metadata": {},
91 | "source": [
92 | "The previous step should give you a better understanding of which pre-processing is required for the data based on your approach:\n",
93 | "\n",
94 | "- If you decide to work with the provided features, some data cleaning may be required to make full use of all the data.\n",
95 | "- If you decide to extract your own features from the images, you should explain your approach in this section.\n",
96 | "- If you decide to work directly with the images themselves, preprocessing the images may improve your classification results. In particular, if you work with a neural network the following should be of interest to you:\n",
97 | "\n",
98 | " - Due to the fully-connected layers (that usually come after the convolutional ones), the input needs to have a fixed dimension.\n",
99 | " - Data augmentation (image rotation, scaling, cropping, etc. of the existing images) can be used to increase the size of the training data set. This may improve performance especially when little data is available for a particular class.\n",
100 | " - Be aware of the computational cost! It might be worth rescaling the images to a smaller size!\n",
101 | "\n",
102 | " All of the operations above are usually realized using a dataloader. This means that you do not need to create a modified version of the dataset and save it to disk. Instead, the dataloader processes the data \"on the fly\" and in-memory before passing it to the network.\n",
103 | " \n",
104 | " NB: Although aligning image sizes is necessary to train CNNs, this will prevent your classifier from learning about different object sizes as a feature. Additional gains may be achieved when also taking object sizes into account."
105 | ]
106 | },
107 | {
108 | "cell_type": "markdown",
109 | "metadata": {},
110 | "source": [
111 | "\n",
112 | "
3. Model Selection
\n",
113 | ""
114 | ]
115 | },
116 | {
117 | "cell_type": "markdown",
118 | "metadata": {},
119 | "source": [
120 | "Perhaps the most important segment of this challenge involves the selection of a model that can successfully handle the given data and yield sensible predictions.\n",
121 | "Instead of focusing exclusively on your final chosen model, it is also important to share your thought process in this notebook by additionally describing alternative candidate models.\n",
122 | "\n",
123 | "The choice of your model is closely connected to the way you preprocessed the input data.\n",
124 | "\n",
125 | "Furthermore, there are other factors which may influence your decision:\n",
126 | "\n",
127 | "- What is the model's complexity?\n",
128 | "- Is the model interpretable?\n",
129 | "- Is the model capable of handling different data-types?\n",
130 | "- Does the model return uncertainty estimates along with predictions?"
131 | ]
132 | },
133 | {
134 | "cell_type": "markdown",
135 | "metadata": {},
136 | "source": [
137 | "\n",
138 | "
4. Parameter Optimisation
\n",
139 | ""
140 | ]
141 | },
142 | {
143 | "cell_type": "markdown",
144 | "metadata": {},
145 | "source": [
146 | "Irrespective of your choice, it is highly likely that your model will have one or more (hyper-)parameters that require tuning.\n",
147 | "There are several techniques for carrying out such a procedure, including cross-validation, Bayesian optimisation, and several others.\n",
148 | "As before, an analysis into which parameter tuning technique best suits your model is expected before proceeding with the optimisation of your model.\n",
149 | "\n",
150 | "If you use a neural network, the optimization of hyperparameters (learning rate, weight decay, etc.) can be a very time-consuming process. In this case, your may decide to carry out smaller experiments and to justify your choice on these preliminary tests."
151 | ]
152 | },
153 | {
154 | "cell_type": "markdown",
155 | "metadata": {},
156 | "source": [
157 | "\n",
158 | "
5. Model Evaluation
\n",
159 | ""
160 | ]
161 | },
162 | {
163 | "cell_type": "markdown",
164 | "metadata": {},
165 | "source": [
166 | "Some form of pre-evaluation will inevitably be required in the preceding sections in order to both select an appropriate model and configure its parameters appropriately.\n",
167 | "In this final section, you may evaluate other aspects of the model such as:\n",
168 | "\n",
169 | "- Assessing the running time of your model;\n",
170 | "- Determining whether some aspects can be parallelised;\n",
171 | "- Training the model with smaller subsets of the data.\n",
172 | "- etc.\n",
173 | "\n",
174 | "For the evaluation of the classification results, you should use the F1 measure (see Submission Instructions). Here the focus should be on level2 classification. A classification evaluation for other labels is optional.\n",
175 | "\n",
176 | "Please note that you are responsible for creating a sensible train/validation/test split. There is no predefined held-out test data."
177 | ]
178 | },
179 | {
180 | "cell_type": "markdown",
181 | "metadata": {},
182 | "source": [
183 | "\n",
184 | " N.B. Please note that the items listed under each heading are neither exhaustive, nor are you expected to explore every given suggestion.\n",
185 | " Nonetheless, these should serve as a guideline for your work in both this and upcoming challenges.\n",
186 | " As always, you should use your intuition and understanding in order to decide which analysis best suits the assigned task.\n",
187 | "
"
188 | ]
189 | },
190 | {
191 | "cell_type": "markdown",
192 | "metadata": {},
193 | "source": [
194 | "\n",
195 | "
Submission Instructions
\n",
196 | "
\n",
197 | ""
198 | ]
199 | },
200 | {
201 | "cell_type": "markdown",
202 | "metadata": {},
203 | "source": [
204 | "- The goal of this challenge is to construct a model for predicting Plankton (taxonomy level 2) classes.\n",
205 | "\n",
206 | "- Your submission will be the HTML version of your notebook exploring the various modelling aspects described above.\n",
207 | "\n",
208 | "- At the end of the notebook you should indicate your final evaluation score on a held-out test set. As an evaluation metric you should use the F1 score with the *average=macro* option as it is provided by the scikit-learn library. See the following link for more information:\n",
209 | " \n",
210 | "https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html"
211 | ]
212 | },
213 | {
214 | "cell_type": "markdown",
215 | "metadata": {},
216 | "source": [
217 | "\n",
218 | "
Dataset Description
\n",
219 | "
\n",
220 | ""
221 | ]
222 | },
223 | {
224 | "cell_type": "markdown",
225 | "metadata": {},
226 | "source": [
227 | "#### * Location of the Dataset on zoe\n",
228 | "The data for this challenge is located at: `/mnt/datasets/plankton/flowcam`\n",
229 | "\n",
230 | "#### * Hierachical Taxonomy Tree for Labels \n",
231 | "\n",
232 | "Each object is represented by a single image and is identified by a unique integer number. It has a name associated to it which is integrated in a hierarchical taxonomic tree. The identifications are gathered from different projects, classified by different people in different contexts, so they often target different taxonomic levels. For example, let us say we classify items of clothing along the following tree\n",
233 | "\n",
234 | " top\n",
235 | " shirt\n",
236 | " long sleeves\n",
237 | " short sleeves\n",
238 | " sweater\n",
239 | " hooded\n",
240 | " no hood\n",
241 | " bottom\n",
242 | " pants\n",
243 | " jeans\n",
244 | " other\n",
245 | " shorts\n",
246 | " \n",
247 | "In a first project, images are classified to the finest level possible, but it may be the case that, on some pictures, it is impossible to determine whether a sweater has a hood or not, in which case it is simply classified as `sweater`. In the second project, the operator classified tops as `shirt` or `sweater` only, and bottoms to the finest level. In a third project, the operator only separated tops from bottoms. In such a context, the original names in the database cannot be used directly because, for example `sweater` will contain images that are impossible to determine as `hooded` or `no hood` *as well as* `hooded` and `no hood` images that were simply not classified further. If all three classes (`sweater`, `hooded`, and `no hood`) are included in the training set, it will likely confuse the classifier. For this reason, we define different target taxonomic levels:\n",
248 | "\n",
249 | "- `level1` is the finest taxonomic level possible. In the example above, we would include `hooded` and `no hood` but discard all images in `sweater` to avoid confusion; and proceed in the same manner for other classes.\n",
250 | "\n",
251 | "- `level2` is a grouping of underlying levels. In the example above, it would include `shirt` (which contains all images in `shirt`, `long sleeves`, and `short sleeves`), `sweater` (which, similarly would include this class and all its children), `pants` (including children), and `shorts`. So typically, `level2` contains more images (less discarding), sorted within fewer classes than `level1`, and may therefore be an easier classification problem.\n",
252 | "\n",
253 | "- `level3` is an even broader grouping. Here it would be `top` vs `bottom`\n",
254 | "\n",
255 | "- etc.\n",
256 | "\n",
257 | "In the Plankton Image dataset, the objects will be categorised based on a pre-defined 'level1' and 'level2'. You can opt to work on one of them, but we recommend you to work on `level2` because it is an easier classification problem. \n",
258 | "\n",
259 | "#### * Data Structure\n",
260 | "\n",
261 | " /mnt/datasets/plankton/flowcam/\n",
262 | " meta.csv\n",
263 | " taxo.csv\n",
264 | " features_native.csv.gz\n",
265 | " features_skimage.csv.gz\n",
266 | " imgs.zip\n",
267 | "\n",
268 | "* `meta.csv` contains the index of images and their corresponding labels\n",
269 | "* `taxo.csv` defines the taxonomic tree and its potential groupings at various level. Note that, the information is also available in `meta.csv`. Therefore, the information in `taxo.csv` is probably useless, but at least it gives you a global view about taxonomy tree\n",
270 | "* `features_native.csv.gz` contain the morphological handcrafted features computed by ZooProcess. In fact, ZooProcess generates the region of interests (ROI) around each individual object from a original image of Plankton. In addition, it also computes a set of associated features measured on the object. These features are the ones contained in `features_native.csv.gz`\n",
271 | "* `features_skimage.csv.gz` contains the morphological features recomputed with skimage.measure.regionprops on the ROIs produced by ZooProcess.\n",
272 | "* `imgs.zip` contains a post-processed version of the original images. Images are named by `objid`.jpg\n",
273 | "\n",
274 | "#### * Attributes in meta.csv\n",
275 | "\n",
276 | "The file contains the image identifiers (objid) as well as the labels assigned to the images by human operators. Those are defined with various levels of precision:\n",
277 | "\n",
278 | "* unique_name: raw labels from operators\n",
279 | "* level1: cleaned, most detailed labels\n",
280 | "* level2: regrouped (coarser) labels\n",
281 | "* lineage: full taxonomic lineage of the class\n",
282 | "\n",
283 | "Some labels may be missing (coded ‘NA’) at a given level, meaning that the corresponding objects should be discarded for the classification at this level.\n",
284 | "\n",
285 | "#### * imgs.zip\n",
286 | "\n",
287 | "This zip archive contains an *imgs* folder that contains all the images in .jpg format. Do not extract this folder to disk! Instead you will be loading the images to memory. See the code below for a quick how-to:"
288 | ]
289 | },
290 | {
291 | "cell_type": "code",
292 | "execution_count": null,
293 | "metadata": {},
294 | "outputs": [],
295 | "source": [
296 | "import zipfile\n",
297 | "from io import BytesIO\n",
298 | "from PIL import Image\n",
299 | "\n",
300 | "def extract_zip_to_memory(input_zip):\n",
301 | " '''\n",
302 | " This function extracts the images stored inside the given zip file.\n",
303 | " It stores the result in a python dictionary.\n",
304 | " \n",
305 | " input_zip (string): path to the zip file\n",
306 | " \n",
307 | " returns (dict): {filename (string): image_file (bytes)}\n",
308 | " '''\n",
309 | " input_zip=zipfile.ZipFile(input_zip)\n",
310 | " return {name: BytesIO(input_zip.read(name)) for name in input_zip.namelist() if name.endswith('.jpg')}\n",
311 | "\n",
312 | "\n",
313 | "# img_files = extract_zip_to_memory(\"imgs.zip\")\n",
314 | "\n",
315 | "# Display an example image \n",
316 | "# Image.open(img_files['imgs/32738710.jpg'])\n",
317 | "\n",
318 | "# Load the image as a numpy array:\n",
319 | "# np_arr = np.array(Image.open(img_files['imgs/32738710.jpg']))\n",
320 | "\n",
321 | "# Be aware that the dictionary will occupy roughly 2GB of computer memory!\n",
322 | "# To free this memory again, run:\n",
323 | "# del img_files"
324 | ]
325 | },
326 | {
327 | "cell_type": "markdown",
328 | "metadata": {},
329 | "source": [
330 | "#### * Attributes in features_native.csv.gz\n",
331 | "A brief outline of the availabel attributes in `features_native.csv.gz` which you can use is given below:\n",
332 | "\n",
333 | "* objid: same as in `meta.csv`\n",
334 | "* area: area of ROI\n",
335 | "* meanimagegrey:\n",
336 | "* mean: mean grey\n",
337 | "* stddev: standard deviation of greys\n",
338 | "* min: minimum grey\n",
339 | "* perim.: perimeter of ROI\n",
340 | "* width, height: dimensions of ROI\n",
341 | "* major, minor: length of major,minor axis of the best fitting ellipse\n",
342 | "* angle: \n",
343 | "* circ.: circularity or shape factor which can be computed by 4pi(area/perim.^2)\n",
344 | "* feret: maximal feret diameter\n",
345 | "* intden: integrated density: mean*area\n",
346 | "* median: median grey\n",
347 | "* skew, kurt: skewness,kurtosis of the histogram of greys\n",
348 | "* %area: proportion of the image corresponding to the object\n",
349 | "* area_exc: area excluding holes\n",
350 | "* fractal: fractal dimension of the perimeter\n",
351 | "* skelarea: area of the one-pixel wide skeleton of the image ???\n",
352 | "* slope: slope of the cumulated histogram of greys\n",
353 | "* histcum1, 2, 3: grey level at quantiles 0.25, 0.5, 0.75 of the histogram of greys\n",
354 | "* nb1, 2, 3: number of objects after thresholding at the grey levels above\n",
355 | "* symetrieh, symetriev: index of horizontal,vertical symmetry\n",
356 | "* symetriehc, symetrievc: same but after thresholding at level histcum1\n",
357 | "* convperim, convarea: perimeter,area of the convex hull of the object\n",
358 | "* fcons: contrast\n",
359 | "* thickr: thickness ratio: maximum thickness/mean thickness\n",
360 | "* esd:\n",
361 | "* elongation: elongation index: major/minor\n",
362 | "* range: range of greys: max-min\n",
363 | "* meanpos: relative position of the mean grey: (max-mean)/range\n",
364 | "* centroids:\n",
365 | "* cv: coefficient of variation of greys: 100*(stddev/mean)\n",
366 | "* sr: index of variation of greys: 100*(stddev/range)\n",
367 | "* perimareaexc:\n",
368 | "* feretareaexc:\n",
369 | "* perimferet: index of the relative complexity of the perimeter: perim/feret\n",
370 | "* perimmajor: index of the relative complexity of the perimeter: perim/major\n",
371 | "* circex:\n",
372 | "* cdexc:\n",
373 | "* kurt_mean:\n",
374 | "* skew_mean:\n",
375 | "* convperim_perim:\n",
376 | "* convarea_area:\n",
377 | "* symetrieh_area:\n",
378 | "* symetriev_area:\n",
379 | "* nb1_area:\n",
380 | "* nb2_area:\n",
381 | "* nb3_area:\n",
382 | "* nb1_range:\n",
383 | "* nb2_range:\n",
384 | "* nb3_range:\n",
385 | "* median_mean:\n",
386 | "* median_mean_range:\n",
387 | "* skeleton_area:\n",
388 | "\n",
389 | "#### * Attributes in features_skimage.csv.gz\n",
390 | "Table of morphological features recomputed with skimage.measure.regionprops on the ROIs produced by ZooProcess. See http://scikit-image.org/docs/dev/api/skimage.measure.html#skimage.measure.regionprops for documentation."
391 | ]
392 | },
393 | {
394 | "cell_type": "code",
395 | "execution_count": null,
396 | "metadata": {},
397 | "outputs": [],
398 | "source": []
399 | }
400 | ],
401 | "metadata": {
402 | "kernelspec": {
403 | "display_name": "Python 3",
404 | "language": "python",
405 | "name": "python3"
406 | },
407 | "language_info": {
408 | "codemirror_mode": {
409 | "name": "ipython",
410 | "version": 3
411 | },
412 | "file_extension": ".py",
413 | "mimetype": "text/x-python",
414 | "name": "python",
415 | "nbconvert_exporter": "python",
416 | "pygments_lexer": "ipython3",
417 | "version": "3.6.8"
418 | }
419 | },
420 | "nbformat": 4,
421 | "nbformat_minor": 2
422 | }
423 |
--------------------------------------------------------------------------------
/Notebooks/Intro-public.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "source": [
5 | "# Revision on JupyterLab, Python, Pandas and Matplotlib (Spring 2019)\n",
6 | "In this introductory laboratory, we expect students to:\n",
7 | "\n",
8 | "1. Acquire basic knowledge about Python and Matplotlib\n",
9 | "2. Gain familiarity with Juypter Notebooks\n",
10 | "\n",
11 | "\n",
12 | "To achieve such goals, we will go through the following steps:\n",
13 | "\n",
14 | "1. In section 1, **IPython** and **Jupyter Notebooks** are introduced to help students understand the environment used to work on projects, including those that are part of the CLOUDS course.\n",
15 | "\n",
16 | "2. In section 2, we briefly overview **Python** and its syntax. In addition, we cover **Matplotlib**, a very powerful library to plot figures in Python. Finally, we introduce **Pandas**, a python library that is very helpful when manipulating data."
17 | ],
18 | "metadata": {},
19 | "cell_type": "markdown"
20 | },
21 | {
22 | "source": [
23 | "# 1. Python, IPython and Jupyter Notebooks\n",
24 | "\n",
25 | "**Python** is a high-level, dynamic, object-oriented programming language. It is a general purpose language, which is designed to be easy to use and easy to read.\n",
26 | "\n",
27 | "**IPython** (Interactive Python) is originally developed for Python. Now, it is a command shell for interactive computing supporting multiple programming languages. It offers rich media, shell syntax, tab completion, and history. IPython is based on an architecture that provides parallel and distributed computing. IPython enables parallel applications to be developed, executed, debugged and monitored interactively.\n",
28 | "\n",
29 | "**Jupyter Notebooks** are a web-based interactive computational environment for creating IPython notebooks. An IPython notebook is a JSON document containing an ordered list of input/output cells which can contain code, text, mathematics, plots and rich media. Notebooks make data analysis easier to perform, understand and reproduce. All laboratories in this course are prepared as Notebooks. As you can see, in this Notebook, we can put text, images, hyperlinks, source code... The Notebooks can be converted to a number of open standard output formats (HTML, HTML presentation slides, LaTeX, PDF, ReStructuredText, Markdown, Python) through `File` -> `Download As` in the web interface. In addition, Jupyter manages the notebooks' versions through a `checkpoint` mechanism. You can create checkpoint anytime via `File -> Save and Checkpoint`. \n",
30 | "\n",
31 | "**NOTE on Checkpointing:** in this course, we use a peculiar environment to work. We don't have a Notebook server: instead, we create on demand clusters with a Notebook front-end. Since your clusters are **ephemeral** (they are terminated after a predefined amount of time), checkpointing is of little use, for anything else than saving your notebook in your ephemeral environment. It is far better to download regularly your notebooks, and to push them to your git repository."
32 | ],
33 | "metadata": {},
34 | "cell_type": "markdown"
35 | },
36 | {
37 | "source": [
38 | "## 1.1. Tab completion\n",
39 | "\n",
40 | "Tab completion is a convenient way to explore the structure of any object you're dealing with. Simply type object_name. to view the suggestion for object's attributes. Besides Python objects and keywords, tab completion also works on file and directory names."
41 | ],
42 | "metadata": {},
43 | "cell_type": "markdown"
44 | },
45 | {
46 | "source": [
47 | "s = \"test function of tab completion\"\n",
48 | "\n",
49 | "# type s. to see the suggestions\n",
50 | "\n",
51 | "# Show your experiments working on a string. \n",
52 | "# Try splitting a string into its constituent words, and count the number of words.\n"
53 | ],
54 | "execution_count": null,
55 | "cell_type": "code",
56 | "metadata": {
57 | "collapsed": false
58 | },
59 | "outputs": []
60 | },
61 | {
62 | "source": [
63 | "## 1.2. System shell commands\n",
64 | "\n",
65 | "To run any command in the system shell, simply prefix it with `!`. For example:"
66 | ],
67 | "metadata": {},
68 | "cell_type": "markdown"
69 | },
70 | {
71 | "source": [
72 | "# list all file and directories in the current folder\n",
73 | "!ls"
74 | ],
75 | "execution_count": null,
76 | "cell_type": "code",
77 | "metadata": {
78 | "collapsed": false
79 | },
80 | "outputs": []
81 | },
82 | {
83 | "source": [
84 | "## 1.3. Magic functions\n",
85 | "\n",
86 | "IPython has a set of predefined `magic functions` that you can call with a command line style syntax. There are two types of magics, line-oriented and cell-oriented. \n",
87 | "\n",
88 | "**Line magics** are prefixed with the `%` character and work much like OS command-line calls: they get as an argument the rest of the line, *where arguments are passed without parentheses or quotes*. \n",
89 | "\n",
90 | "**Cell magics** are prefixed with a double `%%`, and they are functions that get as an argument not only the rest of the line, but also the lines below it in a separate argument."
91 | ],
92 | "metadata": {},
93 | "cell_type": "markdown"
94 | },
95 | {
96 | "source": [
97 | "%timeit range(1000)"
98 | ],
99 | "execution_count": null,
100 | "cell_type": "code",
101 | "metadata": {
102 | "collapsed": false
103 | },
104 | "outputs": []
105 | },
106 | {
107 | "source": [
108 | "%%timeit x = range(10000)\n",
109 | "max(x)"
110 | ],
111 | "execution_count": null,
112 | "cell_type": "code",
113 | "metadata": {
114 | "collapsed": false
115 | },
116 | "outputs": []
117 | },
118 | {
119 | "source": [
120 | "For more information, you can follow this [link](http://nbviewer.jupyter.org/github/ipython/ipython/blob/1.x/examples/notebooks/Cell%20Magics.ipynb)"
121 | ],
122 | "metadata": {},
123 | "cell_type": "markdown"
124 | },
125 | {
126 | "source": [
127 | "## 1.4. Debugging\n",
128 | "\n",
129 | "Whenever an exception occurs, the call stack is printed out to help you to track down the true source of the problem. It is important to gain familiarity with the call stack, especially when using the PySpark API."
130 | ],
131 | "metadata": {},
132 | "cell_type": "markdown"
133 | },
134 | {
135 | "source": [
136 | "for i in [4,3,2,0]:\n",
137 | " print(5/i)"
138 | ],
139 | "execution_count": null,
140 | "cell_type": "code",
141 | "metadata": {
142 | "collapsed": false
143 | },
144 | "outputs": []
145 | },
146 | {
147 | "source": [
148 | "# 2. Python + Pandas + Matplotlib: A great environment for Data Science\n",
149 | "\n",
150 | "This section aims to help students gain a basic understanding of the python programming language and some of its libraries, including `Pandas` or `Matplotlib`. \n",
151 | "\n",
152 | "When working with a small dataset (one that can comfortably fit into a single machine), Pandas and Matplotlib, together with Python are valid alternatives to other popular tools such as R and Matlab. Using such libraries allows to inherit from the simple and clear Python syntax, achieve very good performance, enjoy superior memory management, error handling, and good package management \\[[1](http://ajminich.com/2013/06/22/9-reasons-to-switch-from-matlab-to-python/)\\].\n",
153 | "\n",
154 | "\n",
155 | "## 2.1. Python syntax\n",
156 | "\n",
157 | "(This section is for students who did not program in Python before. If you're familiar with Python, please move to the next section: 1.2. Numpy)\n",
158 | "\n",
159 | "When working with Python, the code seems to be simpler than (many) other languages. In this laboratory, we compare the Python syntax to that of Java - another very common language.\n",
160 | "\n",
161 | "```java\n",
162 | "// java syntax\n",
163 | "int i = 10;\n",
164 | "string s = \"advanced machine learning\";\n",
165 | "System.out.println(i);\n",
166 | "System.out.println(s);\n",
167 | "// you must not forget the semicolon at the end of each sentence\n",
168 | "```"
169 | ],
170 | "metadata": {},
171 | "cell_type": "markdown"
172 | },
173 | {
174 | "source": [
175 | "# python syntax\n",
176 | "i = 10\n",
177 | "s = \"advanced machine learning\"\n",
178 | "print(i)\n",
179 | "print(s)\n",
180 | "# forget about the obligation of commas"
181 | ],
182 | "execution_count": null,
183 | "cell_type": "code",
184 | "metadata": {
185 | "collapsed": false
186 | },
187 | "outputs": []
188 | },
189 | {
190 | "source": [
191 | "### Indentation & If-else syntax\n",
192 | "In python, we don't use `{` and `}` to define blocks of codes: instead, we use indentation to do that. **The code within the same block must have the same indentation**. For example, in java, we write:\n",
193 | "```java\n",
194 | "string language = \"Python\";\n",
195 | "\n",
196 | "// the block is surrounded by { and }\n",
197 | "// the condition is in ( and )\n",
198 | "if (language == \"Python\") {\n",
199 | " int x = 1;\n",
200 | " x += 10;\n",
201 | " int y = 5; // a wrong indentation isn't problem\n",
202 | " y = x + y;\n",
203 | " System.out.println(x + y);\n",
204 | " \n",
205 | " // a statement is broken into two line\n",
206 | " x = y\n",
207 | " + y;\n",
208 | " \n",
209 | " // do some stuffs\n",
210 | "}\n",
211 | "else if (language == \"Java\") {\n",
212 | " // another block\n",
213 | "}\n",
214 | "else {\n",
215 | " // another block\n",
216 | "}\n",
217 | "```"
218 | ],
219 | "metadata": {},
220 | "cell_type": "markdown"
221 | },
222 | {
223 | "source": [
224 | "language = \"Python\"\n",
225 | "if language == \"Python\":\n",
226 | " x = 10\n",
227 | " x += 10\n",
228 | " y = 5 # all statements in the same block must have the same indentation\n",
229 | " y = (\n",
230 | " x + y\n",
231 | " ) # statements can be on multiple lines, using ( )\n",
232 | " print (x \n",
233 | " + y)\n",
234 | " \n",
235 | " # statements can also be split on multiple lines by using \\ at the END of each line\n",
236 | " x = y \\\n",
237 | " + y\n",
238 | " \n",
239 | " # do some other stuffs\n",
240 | "elif language == \"Java\":\n",
241 | " # another block\n",
242 | " pass\n",
243 | "else:\n",
244 | " # another block\n",
245 | " pass"
246 | ],
247 | "execution_count": null,
248 | "cell_type": "code",
249 | "metadata": {
250 | "collapsed": false
251 | },
252 | "outputs": []
253 | },
254 | {
255 | "source": [
256 | "### Ternary conditional operator\n",
257 | "In python, we often see ternary conditional operator, which is used to assign a value to a variable based on some condition. For example, in java, we write:\n",
258 | "\n",
259 | "```java\n",
260 | "int x = 10;\n",
261 | "// if x > 10, assign y = 5, otherwise, y = 15\n",
262 | "int y = (x > 10) ? 5 : 15;\n",
263 | "\n",
264 | "int z;\n",
265 | "if (x > 10)\n",
266 | " z = 5; // it's not necessary to have { } when the block has only one statement\n",
267 | "else\n",
268 | " z = 15;\n",
269 | "```\n",
270 | "\n",
271 | "Of course, although we can easily write these lines of code in an `if else` block to get the same result, people prefer ternary conditional operator because of simplicity.\n",
272 | "\n",
273 | "In python, we write:"
274 | ],
275 | "metadata": {},
276 | "cell_type": "markdown"
277 | },
278 | {
279 | "source": [
280 | "x = 10\n",
281 | "# a very natural way\n",
282 | "y = 5 if x > 10 else 15\n",
283 | "print(y)\n",
284 | "\n",
285 | "# another way\n",
286 | "y = x > 10 and 5 or 15\n",
287 | "print(y)"
288 | ],
289 | "execution_count": null,
290 | "cell_type": "code",
291 | "metadata": {
292 | "collapsed": false
293 | },
294 | "outputs": []
295 | },
296 | {
297 | "source": [
298 | "### Lists and For loops\n",
299 | "Another syntax that we should revisit is the `for loop`. In java, we can write:\n",
300 | "\n",
301 | "```java\n",
302 | "// init an array with 10 integer numbers\n",
303 | "int[] array = new int[]{1, 2, 3, 4, 5, 6, 7, 8, 9, 10};\n",
304 | "for (int i = 0; i < array.length; i++){\n",
305 | " // print the i-th element of array\n",
306 | " System.out.println(array[i]);\n",
307 | "}\n",
308 | "```\n",
309 | "\n",
310 | "In Python, instead of using an index to help indicating an element, we can access the element directly:"
311 | ],
312 | "metadata": {},
313 | "cell_type": "markdown"
314 | },
315 | {
316 | "source": [
317 | "array = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]\n",
318 | "# Python has no built-in array data structure\n",
319 | "# instead, it uses \"list\" which is much more general \n",
320 | "# and can be used as a multidimensional array quite easily.\n",
321 | "for element in array:\n",
322 | " print(element)"
323 | ],
324 | "execution_count": null,
325 | "cell_type": "code",
326 | "metadata": {
327 | "collapsed": false
328 | },
329 | "outputs": []
330 | },
331 | {
332 | "source": [
333 | "As we can see, the code is very clean. If you need the index of each element, here's what you should do:"
334 | ],
335 | "metadata": {},
336 | "cell_type": "markdown"
337 | },
338 | {
339 | "source": [
340 | "for (index, element) in enumerate(array):\n",
341 | " print(index, element)"
342 | ],
343 | "execution_count": null,
344 | "cell_type": "code",
345 | "metadata": {
346 | "collapsed": false
347 | },
348 | "outputs": []
349 | },
350 | {
351 | "source": [
352 | "Actually, Python has no built-in array data structure. It uses the `list` data structure, which is much more general and can be used as a multidimensional array quite easily. In addition, elements in a list can be retrieved in a very concise way. For example, we create a 2d-array with 4 rows. Each row has 3 elements."
353 | ],
354 | "metadata": {},
355 | "cell_type": "markdown"
356 | },
357 | {
358 | "source": [
359 | "# 2-dimentions array with 4 rows, 3 columns\n",
360 | "twod_array = [[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]]\n",
361 | "for index, row in enumerate(twod_array):\n",
362 | " print(\"row \", index, \":\", row)\n",
363 | "\n",
364 | "# print row 1 until row 3\n",
365 | "print(\"row 1 until row 3: \", twod_array[1:3])\n",
366 | "\n",
367 | "# all rows from row 2\n",
368 | "print(\"all rows from row 2: \", twod_array[2:])\n",
369 | "\n",
370 | "# all rows until row 2\n",
371 | "print(\"all rows until row 2:\", twod_array[:2])\n",
372 | "\n",
373 | "# all rows from the beginning with step of 2. \n",
374 | "print(\"all rows from the beginning with step of 2:\", twod_array[::2])"
375 | ],
376 | "execution_count": null,
377 | "cell_type": "code",
378 | "metadata": {
379 | "collapsed": false
380 | },
381 | "outputs": []
382 | },
383 | {
384 | "source": [
385 | "### Dictionaries\n",
386 | "Another useful data structure in Python is a `dictionary`, which we use to store (key, value) pairs. Here's some example usage of dictionaries:"
387 | ],
388 | "metadata": {},
389 | "cell_type": "markdown"
390 | },
391 | {
392 | "source": [
393 | "d = {'key1': 'value1', 'key2': 'value2'} # Create a new dictionary with some data\n",
394 | "print(d['key1']) # Get an entry from a dictionary; prints \"value1\"\n",
395 | "print('key1' in d) # Check if a dictionary has a given key; prints \"True\"\n",
396 | "d['key3'] = 'value3' # Set an entry in a dictionary\n",
397 | "print(d['key3']) # Prints \"value3\"\n",
398 | "# print(d['key9']) # KeyError: 'key9' not a key of d\n",
399 | "print(d.get('key9', 'custom_default_value')) # Get an element with a default; prints \"custom_default_value\"\n",
400 | "print(d.get('key3', 'custom_default_value')) # Get an element with a default; prints \"value3\"\n",
401 | "del d['key3'] # Remove an element from a dictionary\n",
402 | "print(d.get('key3', 'custom_default_value')) # \"fish\" is no longer a key; prints \"custom_default_value\"\n"
403 | ],
404 | "execution_count": null,
405 | "cell_type": "code",
406 | "metadata": {
407 | "collapsed": false
408 | },
409 | "outputs": []
410 | },
411 | {
412 | "source": [
413 | "### Functions\n",
414 | "In Python, we can define a function by using keyword `def`."
415 | ],
416 | "metadata": {},
417 | "cell_type": "markdown"
418 | },
419 | {
420 | "source": [
421 | "def square(x):\n",
422 | " return x*x\n",
423 | "\n",
424 | "print(square(5))"
425 | ],
426 | "execution_count": null,
427 | "cell_type": "code",
428 | "metadata": {
429 | "collapsed": false
430 | },
431 | "outputs": []
432 | },
433 | {
434 | "source": [
435 | "You can apply a function to each element of a list/array by using `lambda` function. For example, we want to square elements in a list:"
436 | ],
437 | "metadata": {},
438 | "cell_type": "markdown"
439 | },
440 | {
441 | "source": [
442 | "array = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]\n",
443 | "\n",
444 | "# apply function \"square\" on each element of \"array\"\n",
445 | "print(list(map(lambda x: square(x), array)))\n",
446 | "\n",
447 | "# or using a for loop, and a list comprehension\n",
448 | "print([square(x) for x in array])\n",
449 | "\n",
450 | "print(\"orignal array:\", array)"
451 | ],
452 | "execution_count": null,
453 | "cell_type": "code",
454 | "metadata": {
455 | "collapsed": false
456 | },
457 | "outputs": []
458 | },
459 | {
460 | "source": [
461 | "These two above syntaxes are used very often. \n",
462 | "\n",
463 | "If you are not familiar with **list comprehensions**, follow this [link](http://python-3-patterns-idioms-test.readthedocs.io/en/latest/Comprehensions.html]).\n",
464 | "\n",
465 | "We can also put a function `B` inside a function `A` (that is, we can have nested functions). In that case, function `B` is only accessed inside function `A` (the scope that it's declared). For example:"
466 | ],
467 | "metadata": {},
468 | "cell_type": "markdown"
469 | },
470 | {
471 | "source": [
472 | "# select only the prime number in array\n",
473 | "# and square them\n",
474 | "def filterAndSquarePrime(arr):\n",
475 | " \n",
476 | " # a very simple function to check a number is prime or not\n",
477 | " def checkPrime(number):\n",
478 | " for i in range(2, int(number/2)):\n",
479 | " if number % i == 0:\n",
480 | " return False\n",
481 | " return True\n",
482 | " \n",
483 | " primeNumbers = filter(lambda x: checkPrime(x), arr)\n",
484 | " return map(lambda x: square(x), primeNumbers)\n",
485 | "\n",
486 | "# we can not access checkPrime from here\n",
487 | "# checkPrime(5)\n",
488 | "\n",
489 | "result = filterAndSquarePrime(array)\n",
490 | "list(result)"
491 | ],
492 | "execution_count": null,
493 | "cell_type": "code",
494 | "metadata": {
495 | "collapsed": false
496 | },
497 | "outputs": []
498 | },
499 | {
500 | "source": [
501 | "### Importing modules, functions\n",
502 | "Modules in Python are packages of code. Putting code into modules helps increasing the reusability and maintainability.\n",
503 | "The modules can be nested.\n",
504 | "To import a module, we simple use syntax: `import `. Once it is imported, we can use any functions, classes inside it."
505 | ],
506 | "metadata": {},
507 | "cell_type": "markdown"
508 | },
509 | {
510 | "source": [
511 | "# import module 'math' to uses functions for calculating\n",
512 | "import math\n",
513 | "\n",
514 | "# print the square root of 16\n",
515 | "print(math.sqrt(16))\n",
516 | "\n",
517 | "# we can create alias when import a module\n",
518 | "import numpy as np\n",
519 | "\n",
520 | "print(np.sqrt(16))"
521 | ],
522 | "execution_count": null,
523 | "cell_type": "code",
524 | "metadata": {
525 | "collapsed": false
526 | },
527 | "outputs": []
528 | },
529 | {
530 | "source": [
531 | "Sometimes, you only need to import some functions inside a module to avoid loading the whole module into memory. To do that, we can use syntax: `from import `"
532 | ],
533 | "metadata": {},
534 | "cell_type": "markdown"
535 | },
536 | {
537 | "source": [
538 | "# only import function 'sin' in package 'math'\n",
539 | "from math import sin\n",
540 | "\n",
541 | "# use the function\n",
542 | "print(sin(60))"
543 | ],
544 | "execution_count": null,
545 | "cell_type": "code",
546 | "metadata": {
547 | "collapsed": false
548 | },
549 | "outputs": []
550 | },
551 | {
552 | "source": [
553 | "That's quite enough for Python. Now, let's practice a little bit."
554 | ],
555 | "metadata": {},
556 | "cell_type": "markdown"
557 | },
558 | {
559 | "source": [
560 | "### Question 1\n",
561 | "#### Question 1.1\n",
562 | "\n",
563 | "Write a function `checkSquareNumber` to check if a integer number is a square number or not. For example, 16 and 9 are square numbers. 15 isn't square number.\n",
564 | "Requirements:\n",
565 | "\n",
566 | "- Input: an integer number\n",
567 | "\n",
568 | "- Output: `True` or `False`\n",
569 | "\n",
570 | "HINT: If the square root of a number is an integer number, it is a square number.\n",
571 | "
"
572 | ],
573 | "metadata": {},
574 | "cell_type": "markdown"
575 | },
576 | {
577 | "source": [
578 | "```python\n",
579 | "###################################################################\n",
580 | "#### TO COMPLETE #####\n",
581 | "###################################################################\n",
582 | "import math\n",
583 | "\n",
584 | "def checkSquareNumber(x):\n",
585 | " # calculate the square root of x\n",
586 | " # return True if square root is integer, \n",
587 | " # otherwise, return False\n",
588 | " return ...\n",
589 | "\n",
590 | "print(checkSquareNumber(16))\n",
591 | "print(checkSquareNumber(250))\n",
592 | "```"
593 | ],
594 | "metadata": {},
595 | "cell_type": "markdown"
596 | },
597 | {
598 | "source": [
599 | "#### Question 1.2\n",
600 | "\n",
601 | "A list `list_numbers` which contains the numbers from 1 to 9999 can be constructed from: \n",
602 | "\n",
603 | "```python\n",
604 | "list_numbers = range(0, 10000)\n",
605 | "```\n",
606 | "\n",
607 | "Extract the square numbers in `list_numbers` using function `checkSquareNumber` from question 1.1. How many elements in the extracted list ?\n",
608 | "
"
609 | ],
610 | "metadata": {},
611 | "cell_type": "markdown"
612 | },
613 | {
614 | "source": [
615 | "```python\n",
616 | "###################################################################\n",
617 | "#### TO COMPLETE #####\n",
618 | "###################################################################\n",
619 | "\n",
620 | "list_numbers = ...\n",
621 | "square_numbers = # try to use the filter method\n",
622 | "print(square_numbers)\n",
623 | "print(len(square_numbers))\n",
624 | "```"
625 | ],
626 | "metadata": {},
627 | "cell_type": "markdown"
628 | },
629 | {
630 | "source": [
631 | "#### Question 1.3\n",
632 | "\n",
633 | "Using array slicing, select the elements of the list square_numbers, whose index is from 5 to 20 (zero-based index).\n",
634 | "
"
635 | ],
636 | "metadata": {},
637 | "cell_type": "markdown"
638 | },
639 | {
640 | "source": [
641 | "```python\n",
642 | "###################################################################\n",
643 | "#### TO COMPLETE #####\n",
644 | "###################################################################\n",
645 | "\n",
646 | "print(square_numbers[...])\n",
647 | "```"
648 | ],
649 | "metadata": {},
650 | "cell_type": "markdown"
651 | },
652 | {
653 | "source": [
654 | "Next, we will take a quick look on Numpy - a powerful module of Python."
655 | ],
656 | "metadata": {},
657 | "cell_type": "markdown"
658 | },
659 | {
660 | "source": [
661 | "## 2.2. Numpy\n",
662 | "Numpy is the core library for scientific computing in Python. It provides a high-performance multidimensional array object, and tools for working with these arrays.\n",
663 | "### 2.2.1. Array\n",
664 | "A numpy array is a grid of values, all of **the same type**, and is indexed by a tuple of nonnegative integers. Thanks to the same type property, Numpy has the benefits of [locality of reference](https://en.wikipedia.org/wiki/Locality_of_reference). Besides, many other Numpy operations are implemented in C, avoiding the general cost of loops in Python, pointer indirection and per-element dynamic type checking. So, the speed of Numpy is often faster than using built-in datastructure of Python. When working with massive data with computationally expensive tasks, you should consider to use Numpy. \n",
665 | "\n",
666 | "The number of dimensions is the `rank` of the array; the `shape` of an array is a tuple of integers giving the size of the array along each dimension.\n",
667 | "\n",
668 | "We can initialize numpy arrays from nested Python lists, and access elements using square brackets:"
669 | ],
670 | "metadata": {},
671 | "cell_type": "markdown"
672 | },
673 | {
674 | "source": [
675 | "import numpy as np\n",
676 | "\n",
677 | "# Create a rank 1 array\n",
678 | "rank1_array = np.array([1, 2, 3])\n",
679 | "print(\"type of rank1_array:\", type(rank1_array))\n",
680 | "print(\"shape of rank1_array:\", rank1_array.shape)\n",
681 | "print(\"elements in rank1_array:\", rank1_array[0], rank1_array[1], rank1_array[2])\n",
682 | "\n",
683 | "# Create a rank 2 array\n",
684 | "rank2_array = np.array([[1,2,3],[4,5,6]])\n",
685 | "print(\"shape of rank2_array:\", rank2_array.shape)\n",
686 | "print(rank2_array[0, 0], rank2_array[0, 1], rank2_array[1, 0])"
687 | ],
688 | "execution_count": null,
689 | "cell_type": "code",
690 | "metadata": {
691 | "collapsed": false
692 | },
693 | "outputs": []
694 | },
695 | {
696 | "source": [
697 | "### 2.2.2. Array slicing\n",
698 | "Similar to Python lists, numpy arrays can be sliced. The different thing is that you must specify a slice for each dimension of the array because arrays may be multidimensional."
699 | ],
700 | "metadata": {},
701 | "cell_type": "markdown"
702 | },
703 | {
704 | "source": [
705 | "m_array = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]])\n",
706 | "\n",
707 | "# Use slicing to pull out the subarray consisting of the first 2 rows\n",
708 | "# and columns 1 and 2\n",
709 | "b = m_array[:2, 1:3]\n",
710 | "print(b)\n",
711 | "\n",
712 | "# we can only use this syntax with numpy array, not python list\n",
713 | "print(\"value at row 0, column 1:\", m_array[0, 1])\n",
714 | "\n",
715 | "# Rank 1 view of the second row of m_array \n",
716 | "print(\"the second row of m_array:\", m_array[1, :])\n",
717 | "\n",
718 | "# print element at position (0,2) and (1,3)\n",
719 | "print(m_array[[0,1], [2,3]])"
720 | ],
721 | "execution_count": null,
722 | "cell_type": "code",
723 | "metadata": {
724 | "collapsed": false
725 | },
726 | "outputs": []
727 | },
728 | {
729 | "source": [
730 | "### 2.2.3. Boolean array indexing\n",
731 | "We can use boolean array indexing to check whether each element in the array satisfies a condition or use it to do filtering."
732 | ],
733 | "metadata": {},
734 | "cell_type": "markdown"
735 | },
736 | {
737 | "source": [
738 | "m_array = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]])\n",
739 | "\n",
740 | "# Find the elements of a that are bigger than 2\n",
741 | "# this returns a numpy array of Booleans of the same\n",
742 | "# shape as m_array, where each value of bool_idx tells\n",
743 | "# whether that element of a is > 3 or not\n",
744 | "bool_idx = (m_array > 3)\n",
745 | "print(bool_idx , \"\\n\")\n",
746 | "\n",
747 | "# We use boolean array indexing to construct a rank 1 array\n",
748 | "# consisting of the elements of a corresponding to the True values\n",
749 | "# of bool_idx\n",
750 | "print(m_array[bool_idx], \"\\n\")\n",
751 | "\n",
752 | "# We can combine two statements\n",
753 | "print(m_array[m_array > 3], \"\\n\")\n",
754 | "\n",
755 | "# select elements with multiple conditions\n",
756 | "print(m_array[(m_array > 3) & (m_array % 2 == 0)])\n"
757 | ],
758 | "execution_count": null,
759 | "cell_type": "code",
760 | "metadata": {
761 | "collapsed": false
762 | },
763 | "outputs": []
764 | },
765 | {
766 | "source": [
767 | "### 2.2.4. Datatypes\n",
768 | "Remember that the elements in a numpy array have the same type. When constructing arrays, Numpy tries to guess a datatype when you create an array However, we can specify the datatype explicitly via an optional argument."
769 | ],
770 | "metadata": {},
771 | "cell_type": "markdown"
772 | },
773 | {
774 | "source": [
775 | "# let Numpy guess the datatype\n",
776 | "x1 = np.array([1, 2])\n",
777 | "print(x1.dtype)\n",
778 | "\n",
779 | "# force the datatype be float64\n",
780 | "x2 = np.array([1, 2], dtype=np.float64)\n",
781 | "print(x2.dtype)"
782 | ],
783 | "execution_count": null,
784 | "cell_type": "code",
785 | "metadata": {
786 | "collapsed": false
787 | },
788 | "outputs": []
789 | },
790 | {
791 | "source": [
792 | "### 2.2.5. Array math\n",
793 | "Similar to Matlab or R, in Numpy, basic mathematical functions operate elementwise on arrays, and are available both as operator overloads and as functions in the numpy module."
794 | ],
795 | "metadata": {},
796 | "cell_type": "markdown"
797 | },
798 | {
799 | "source": [
800 | "x = np.array([[1,2],[3,4]], dtype=np.float64)\n",
801 | "y = np.array([[5,6],[7,8]], dtype=np.float64)\n",
802 | "# mathematical function is used as operator\n",
803 | "print(\"x + y =\", x + y, \"\\n\")\n",
804 | "\n",
805 | "# mathematical function is used as function\n",
806 | "print(\"np.add(x, y)=\", np.add(x, y), \"\\n\")\n",
807 | "\n",
808 | "# Unlike MATLAB, * is elementwise multiplication\n",
809 | "# not matrix multiplication\n",
810 | "print(\"x * y =\", x * y , \"\\n\")\n",
811 | "print(\"np.multiply(x, y)=\", np.multiply(x, y), \"\\n\")\n",
812 | "print(\"x*2=\", x*2, \"\\n\")\n",
813 | "\n",
814 | "# to multiply two matrices, we use dot function\n",
815 | "print(\"x.dot(y)=\", x.dot(y), \"\\n\")\n",
816 | "print(\"np.dot(x, y)=\", np.dot(x, y), \"\\n\")\n",
817 | "\n",
818 | "# Elementwise square root\n",
819 | "print(\"np.sqrt(x)=\", np.sqrt(x), \"\\n\")"
820 | ],
821 | "execution_count": null,
822 | "cell_type": "code",
823 | "metadata": {
824 | "collapsed": false
825 | },
826 | "outputs": []
827 | },
828 | {
829 | "source": [
830 | "Note that unlike MATLAB, `*` is elementwise multiplication, not matrix multiplication. We instead use the `dot` function to compute inner products of vectors, to multiply a vector by a matrix, and to multiply matrices. In what follows, we work on a few more examples to reiterate the concept."
831 | ],
832 | "metadata": {},
833 | "cell_type": "markdown"
834 | },
835 | {
836 | "source": [
837 | "# declare two vectors\n",
838 | "v = np.array([9,10])\n",
839 | "w = np.array([11, 12])\n",
840 | "\n",
841 | "# Inner product of vectors\n",
842 | "print(\"v.dot(w)=\", v.dot(w))\n",
843 | "print(\"np.dot(v, w)=\", np.dot(v, w))\n",
844 | "\n",
845 | "# Matrix / vector product\n",
846 | "print(\"x.dot(v)=\", x.dot(v))\n",
847 | "print(\"np.dot(x, v)=\", np.dot(x, v))\n",
848 | "\n",
849 | "# Matrix / matrix product\n",
850 | "print(\"x.dot(y)=\", x.dot(y))\n",
851 | "print(\"np.dot(x, y)=\", np.dot(x, y))"
852 | ],
853 | "execution_count": null,
854 | "cell_type": "code",
855 | "metadata": {
856 | "collapsed": false
857 | },
858 | "outputs": []
859 | },
860 | {
861 | "source": [
862 | "Additionally, we can do other aggregation computations on arrays such as `sum`, `nansum`, or `T`."
863 | ],
864 | "metadata": {},
865 | "cell_type": "markdown"
866 | },
867 | {
868 | "source": [
869 | "x = np.array([[1,2], [3,4]])\n",
870 | "\n",
871 | "# Compute sum of all elements\n",
872 | "print(np.sum(x))\n",
873 | "\n",
874 | "# Compute sum of each column\n",
875 | "print(np.sum(x, axis=0))\n",
876 | "\n",
877 | "# Compute sum of each row\n",
878 | "print(np.sum(x, axis=1))\n",
879 | "\n",
880 | "# transpose the matrix\n",
881 | "print(x.T)\n",
882 | "\n",
883 | "# Note that taking the transpose of a rank 1 array does nothing:\n",
884 | "v = np.array([1,2,3])\n",
885 | "print(v.T) # Prints \"[1 2 3]\""
886 | ],
887 | "execution_count": null,
888 | "cell_type": "code",
889 | "metadata": {
890 | "collapsed": false
891 | },
892 | "outputs": []
893 | },
894 | {
895 | "source": [
896 | "### Question 2\n",
897 | "\n",
898 | "Given a 2D array:\n",
899 | "\n",
900 | "```\n",
901 | " 1 2 3 4\n",
902 | " 5 6 7 8 \n",
903 | " 9 10 11 12\n",
904 | "13 14 15 16\n",
905 | "```\n",
906 | "\n",
907 | "\n",
908 | "#### Question 2.1\n",
909 | "\n",
910 | "Print the all odd numbers in this array using `Boolean array indexing`.\n",
911 | "
"
912 | ],
913 | "metadata": {},
914 | "cell_type": "markdown"
915 | },
916 | {
917 | "source": [
918 | "```python\n",
919 | "###################################################################\n",
920 | "#### TO COMPLETE #####\n",
921 | "###################################################################\n",
922 | "\n",
923 | "array_numbers = np.array([\n",
924 | " [1, 2, 3, 4],\n",
925 | " [5, 6, 7, 8],\n",
926 | " [9, 10, 11, 12],\n",
927 | " [13, 14, 15, 16]\n",
928 | " ])\n",
929 | "\n",
930 | "print(...)\n",
931 | "```"
932 | ],
933 | "metadata": {},
934 | "cell_type": "markdown"
935 | },
936 | {
937 | "source": [
938 | "#### Question 2.2\n",
939 | "\n",
940 | "Extract the second row and the third column in this array using `array slicing`.\n",
941 | "
"
942 | ],
943 | "metadata": {},
944 | "cell_type": "markdown"
945 | },
946 | {
947 | "source": [
948 | "```python\n",
949 | "###################################################################\n",
950 | "#### TO COMPLETE #####\n",
951 | "###################################################################\n",
952 | "\n",
953 | "print(array_numbers[...])\n",
954 | "print(array_numbers[...])\n",
955 | "```"
956 | ],
957 | "metadata": {},
958 | "cell_type": "markdown"
959 | },
960 | {
961 | "source": [
962 | "#### Question 2.3\n",
963 | "\n",
964 | "Calculate the sum of diagonal elements.\n",
965 | "
"
966 | ],
967 | "metadata": {},
968 | "cell_type": "markdown"
969 | },
970 | {
971 | "source": [
972 | "```python\n",
973 | "###################################################################\n",
974 | "#### TO COMPLETE #####\n",
975 | "###################################################################\n",
976 | "\n",
977 | "sum = 0\n",
978 | "for i in range(0, ...):\n",
979 | " sum += array_numbers...\n",
980 | " \n",
981 | "print(sum)\n",
982 | "```"
983 | ],
984 | "metadata": {},
985 | "cell_type": "markdown"
986 | },
987 | {
988 | "source": [
989 | "#### Question 2.4\n",
990 | "\n",
991 | "Print elementwise multiplication of the first row and the last row using numpy's functions.\n",
992 | "\n",
993 | "Print the inner product of these two rows.\n",
994 | "
"
995 | ],
996 | "metadata": {},
997 | "cell_type": "markdown"
998 | },
999 | {
1000 | "source": [
1001 | "```python\n",
1002 | "###################################################################\n",
1003 | "#### TO COMPLETE #####\n",
1004 | "###################################################################\n",
1005 | "\n",
1006 | "print(...)\n",
1007 | "print(...)\n",
1008 | "```"
1009 | ],
1010 | "metadata": {},
1011 | "cell_type": "markdown"
1012 | },
1013 | {
1014 | "source": [
1015 | "## 2.3. Matplotlib\n",
1016 | "\n",
1017 | "As its name indicates, Matplotlib is a plotting library. It provides both a very quick way to visualize data from Python and publication-quality figures in many formats. The most important function in matplotlib is `plot`, which allows you to plot 2D data."
1018 | ],
1019 | "metadata": {},
1020 | "cell_type": "markdown"
1021 | },
1022 | {
1023 | "source": [
1024 | "%matplotlib inline\n",
1025 | "import matplotlib.pyplot as plt\n",
1026 | "plt.plot([1,2,3,4])\n",
1027 | "plt.ylabel('custom y label')\n",
1028 | "plt.show()"
1029 | ],
1030 | "execution_count": null,
1031 | "cell_type": "code",
1032 | "metadata": {
1033 | "collapsed": false
1034 | },
1035 | "outputs": []
1036 | },
1037 | {
1038 | "source": [
1039 | "In this case, we provide a single list or array to the `plot()` command, matplotlib assumes it is a sequence of y values, and automatically generates the x values for us. Since python ranges start with 0, the default x vector has the same length as y but starts with 0. Hence the x data are [0,1,2,3].\n",
1040 | "\n",
1041 | "In the next example, we plot figure with both x and y data. Besides, we want to draw dashed lines instead of the solid in default."
1042 | ],
1043 | "metadata": {},
1044 | "cell_type": "markdown"
1045 | },
1046 | {
1047 | "source": [
1048 | "plt.plot([1, 2, 3, 4], [1, 4, 9, 16], 'r--')\n",
1049 | "plt.show()\n",
1050 | "\n",
1051 | "plt.bar([1, 2, 3, 4], [1, 4, 9, 16], align='center')\n",
1052 | "# labels of each column bar\n",
1053 | "x_labels = [\"Type 1\", \"Type 2\", \"Type 3\", \"Type 4\"]\n",
1054 | "# assign labels to the plot\n",
1055 | "plt.xticks([1, 2, 3, 4], x_labels)\n",
1056 | "\n",
1057 | "plt.show()"
1058 | ],
1059 | "execution_count": null,
1060 | "cell_type": "code",
1061 | "metadata": {
1062 | "collapsed": false
1063 | },
1064 | "outputs": []
1065 | },
1066 | {
1067 | "source": [
1068 | "If we want to merge two figures into a single one, subplot is the best way to do that. For example, we want to put two figures in a stack vertically, we should define a grid of plots with 2 rows and 1 column. Then, in each row, a single figure is plotted."
1069 | ],
1070 | "metadata": {},
1071 | "cell_type": "markdown"
1072 | },
1073 | {
1074 | "source": [
1075 | "# Set up a subplot grid that has height 2 and width 1,\n",
1076 | "# and set the first such subplot as active.\n",
1077 | "plt.subplot(2, 1, 1)\n",
1078 | "plt.plot([1, 2, 3, 4], [1, 4, 9, 16], 'r--')\n",
1079 | "\n",
1080 | "# Set the second subplot as active, and make the second plot.\n",
1081 | "plt.subplot(2, 1, 2)\n",
1082 | "plt.bar([1, 2, 3, 4], [1, 4, 9, 16])\n",
1083 | "\n",
1084 | "plt.show()"
1085 | ],
1086 | "execution_count": null,
1087 | "cell_type": "code",
1088 | "metadata": {
1089 | "collapsed": false
1090 | },
1091 | "outputs": []
1092 | },
1093 | {
1094 | "source": [
1095 | "For more examples, please visit the [homepage](http://matplotlib.org/1.5.1/examples/index.html) of Matplotlib."
1096 | ],
1097 | "metadata": {},
1098 | "cell_type": "markdown"
1099 | },
1100 | {
1101 | "source": [
1102 | "### Question 3\n",
1103 | "Given a list of numbers from 0 to 9999.\n",
1104 | "\n",
1105 | "\n",
1106 | "#### Question 3.1\n",
1107 | "\n",
1108 | "Calculate the histogram of numbers divisible by 3, 7, 11 in the list respectively.\n",
1109 | "\n",
1110 | "( Or in other words, how many numbers divisible by 3, 7, 11 in the list respectively ?)\n",
1111 | "
"
1112 | ],
1113 | "metadata": {},
1114 | "cell_type": "markdown"
1115 | },
1116 | {
1117 | "source": [
1118 | "```python\n",
1119 | "###################################################################\n",
1120 | "#### TO COMPLETE #####\n",
1121 | "###################################################################\n",
1122 | "\n",
1123 | "arr = np.array(...)\n",
1124 | "divisors = [3, 7, 11]\n",
1125 | "histogram = list(...)\n",
1126 | "print(histogram)\n",
1127 | "```"
1128 | ],
1129 | "metadata": {},
1130 | "cell_type": "markdown"
1131 | },
1132 | {
1133 | "source": [
1134 | "#### Question 3.2\n",
1135 | "\n",
1136 | "Plot the histogram in a line chart.\n",
1137 | "
"
1138 | ],
1139 | "metadata": {},
1140 | "cell_type": "markdown"
1141 | },
1142 | {
1143 | "source": [
1144 | "```python\n",
1145 | "###################################################################\n",
1146 | "#### TO COMPLETE #####\n",
1147 | "###################################################################\n",
1148 | "\n",
1149 | "%matplotlib inline\n",
1150 | "import matplotlib.pyplot as plt\n",
1151 | "\n",
1152 | "# simple line chart\n",
1153 | "plt.plot(histogram)\n",
1154 | "x_indexes = ...\n",
1155 | "x_names = list(...)\n",
1156 | "plt.xticks(x_indexes, x_names)\n",
1157 | "plt.show()\n",
1158 | "```"
1159 | ],
1160 | "metadata": {},
1161 | "cell_type": "markdown"
1162 | },
1163 | {
1164 | "source": [
1165 | "#### Question 3.3\n",
1166 | "\n",
1167 | "Plot the histogram in a bar chart.\n",
1168 | "
"
1169 | ],
1170 | "metadata": {},
1171 | "cell_type": "markdown"
1172 | },
1173 | {
1174 | "source": [
1175 | "```python\n",
1176 | "###################################################################\n",
1177 | "#### TO COMPLETE #####\n",
1178 | "###################################################################\n",
1179 | "\n",
1180 | "# char chart with x-lables\n",
1181 | "x_indexes = range(...)\n",
1182 | "x_names = list(...)\n",
1183 | "plt.bar( x_indexes, histogram, align='center')\n",
1184 | "plt.xticks(x_indexes, x_names)\n",
1185 | "plt.show()\n",
1186 | "```"
1187 | ],
1188 | "metadata": {},
1189 | "cell_type": "markdown"
1190 | },
1191 | {
1192 | "source": [
1193 | "## 2.4. Pandas\n",
1194 | "\n",
1195 | "Pandas is an open source library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. Indeed, it is great for data manipulation, data analysis, and data visualization.\n",
1196 | "\n",
1197 | "### 2.4.1. Data structures\n",
1198 | "Pandas introduces two useful (and powerful) structures: `Series` and `DataFrame`, both of which are built on top of NumPy.\n",
1199 | "\n",
1200 | "#### Series\n",
1201 | "A `Series` is a one-dimensional object similar to an array, list, or even column in a table. It assigns a *labeled index* to each item in the Series. By default, each item will receive an index label from `0` to `N-1`, where `N` is the number items of `Series`.\n",
1202 | "\n",
1203 | "We can create a Series by passing a list of values, and let pandas create a default integer index.\n"
1204 | ],
1205 | "metadata": {},
1206 | "cell_type": "markdown"
1207 | },
1208 | {
1209 | "source": [
1210 | "import pandas as pd\n",
1211 | "import numpy as np\n",
1212 | "\n",
1213 | "# create a Series with an arbitrary list\n",
1214 | "s = pd.Series([3, 'Machine learning', 1.414259, -65545, 'Happy coding!'])\n",
1215 | "print(s)"
1216 | ],
1217 | "execution_count": null,
1218 | "cell_type": "code",
1219 | "metadata": {
1220 | "collapsed": false
1221 | },
1222 | "outputs": []
1223 | },
1224 | {
1225 | "source": [
1226 | "Or, an index can be used explicitly when creating the `Series`."
1227 | ],
1228 | "metadata": {},
1229 | "cell_type": "markdown"
1230 | },
1231 | {
1232 | "source": [
1233 | "s = pd.Series([3, 'Machine learning', 1.414259, -65545, 'Happy coding!'],\n",
1234 | " index=['Col1', 'Col2', 'Col3', 4.1, 5])\n",
1235 | "print(s)"
1236 | ],
1237 | "execution_count": null,
1238 | "cell_type": "code",
1239 | "metadata": {
1240 | "collapsed": false
1241 | },
1242 | "outputs": []
1243 | },
1244 | {
1245 | "source": [
1246 | "A `Series` can be constructed from a dictionary too."
1247 | ],
1248 | "metadata": {},
1249 | "cell_type": "markdown"
1250 | },
1251 | {
1252 | "source": [
1253 | "s = pd.Series({\n",
1254 | " 'Col1': 3, 'Col2': 'Machine learning', \n",
1255 | " 'Col3': 1.414259, 4.1: -65545, \n",
1256 | " 5: 'Happy coding!'\n",
1257 | " })\n",
1258 | "print(s)"
1259 | ],
1260 | "execution_count": null,
1261 | "cell_type": "code",
1262 | "metadata": {
1263 | "collapsed": false
1264 | },
1265 | "outputs": []
1266 | },
1267 | {
1268 | "source": [
1269 | "We can access items in a `Series` in a same way as `Numpy`."
1270 | ],
1271 | "metadata": {},
1272 | "cell_type": "markdown"
1273 | },
1274 | {
1275 | "source": [
1276 | "s = pd.Series({\n",
1277 | " 'Col1': 3, 'Col2': -10, \n",
1278 | " 'Col3': 1.414259, \n",
1279 | " 4.1: -65545, \n",
1280 | " 5: 8\n",
1281 | " })\n",
1282 | "\n",
1283 | "# get element which has index='Col1'\n",
1284 | "print(\"s['Col1']=\", s['Col1'], \"\\n\")\n",
1285 | "\n",
1286 | "# get elements whose index is in a given list\n",
1287 | "print(\"s[['Col1', 'Col3', 4.5]]=\", s[['Col1', 'Col3', 4.5]], \"\\n\")\n",
1288 | "\n",
1289 | "# use boolean indexing for selection\n",
1290 | "print(s[s > 0], \"\\n\")\n",
1291 | "\n",
1292 | "# modify elements on the fly using boolean indexing\n",
1293 | "s[s > 0] = 15\n",
1294 | "\n",
1295 | "print(s, \"\\n\")\n",
1296 | "\n",
1297 | "# mathematical operations can be done using operators and functions.\n",
1298 | "print(s*10, \"\\n\")\n",
1299 | "print(np.square(s), \"\\n\")"
1300 | ],
1301 | "execution_count": null,
1302 | "cell_type": "code",
1303 | "metadata": {
1304 | "collapsed": false
1305 | },
1306 | "outputs": []
1307 | },
1308 | {
1309 | "source": [
1310 | "#### DataFrame\n",
1311 | "A DataFrame is a tabular data structure comprised of rows and columns, akin to database table, or R's data.frame object. In a loose way, we can also think of a DataFrame as a group of Series objects that share an index (the column names).\n",
1312 | "\n",
1313 | "We can create a DataFrame by passing a dict of objects that can be converted to series-like."
1314 | ],
1315 | "metadata": {},
1316 | "cell_type": "markdown"
1317 | },
1318 | {
1319 | "source": [
1320 | "data = {'year': [2013, 2014, 2015, 2013, 2014, 2015, 2013, 2014],\n",
1321 | " 'team': ['Manchester United', 'Chelsea', 'Asernal', 'Liverpool', 'West Ham', 'Newcastle', 'Machester City', 'Tottenham'],\n",
1322 | " 'wins': [11, 8, 10, 15, 11, 6, 10, 4],\n",
1323 | " 'losses': [5, 8, 6, 1, 5, 10, 6, 12]}\n",
1324 | "football = pd.DataFrame(data, columns=['year', 'team', 'wins', 'losses'])\n",
1325 | "football"
1326 | ],
1327 | "execution_count": null,
1328 | "cell_type": "code",
1329 | "metadata": {
1330 | "collapsed": false
1331 | },
1332 | "outputs": []
1333 | },
1334 | {
1335 | "source": [
1336 | "We can store data as a CSV file, or read data from a CSV file."
1337 | ],
1338 | "metadata": {},
1339 | "cell_type": "markdown"
1340 | },
1341 | {
1342 | "source": [
1343 | "# save data to a csv file without the index\n",
1344 | "football.to_csv('football.csv', index=False)\n",
1345 | "\n",
1346 | "from_csv = pd.read_csv('football.csv')\n",
1347 | "from_csv.head()"
1348 | ],
1349 | "execution_count": null,
1350 | "cell_type": "code",
1351 | "metadata": {
1352 | "collapsed": false
1353 | },
1354 | "outputs": []
1355 | },
1356 | {
1357 | "source": [
1358 | "To read a CSV file with a custom delimiter between values and custom columns' names, we can use parameters `sep` and `names` relatively.\n",
1359 | "Moreover, Pandas also supports to read and write to [Excel file](http://pandas.pydata.org/pandas-docs/stable/io.html#io-excel) , sqlite database file, URL, or even clipboard.\n",
1360 | "\n",
1361 | "We can have an overview on the data by using functions `info` and `describe`."
1362 | ],
1363 | "metadata": {},
1364 | "cell_type": "markdown"
1365 | },
1366 | {
1367 | "source": [
1368 | "print(football.info(), \"\\n\")\n",
1369 | "football.describe()"
1370 | ],
1371 | "execution_count": null,
1372 | "cell_type": "code",
1373 | "metadata": {
1374 | "collapsed": false
1375 | },
1376 | "outputs": []
1377 | },
1378 | {
1379 | "source": [
1380 | "Numpy's regular slicing syntax works as well."
1381 | ],
1382 | "metadata": {},
1383 | "cell_type": "markdown"
1384 | },
1385 | {
1386 | "source": [
1387 | "print(football[0:2], \"\\n\")\n",
1388 | "\n",
1389 | "# show only the teams that have won more than 10 matches from 2014\n",
1390 | "print(football[(football.year >= 2014) & (football.wins >= 10)])"
1391 | ],
1392 | "execution_count": null,
1393 | "cell_type": "code",
1394 | "metadata": {
1395 | "collapsed": false
1396 | },
1397 | "outputs": []
1398 | },
1399 | {
1400 | "source": [
1401 | "An important feature that Pandas supports is `JOIN`. Very often, the data comes from multiple sources, in multiple files. For example, we have 2 CSV files, one contains the information of Artists, the other contains information of Songs. If we want to query the artist name and his/her corresponding songs, we have to do joining two dataframe.\n",
1402 | "\n",
1403 | "Similar to SQL, in Pandas, you can do inner join, left outer join, right outer join and full outer join. Let's see a small example. Assume that we have two dataset of singers and songs. The relationship between two datasets is maintained by a constrain on `singer_code`."
1404 | ],
1405 | "metadata": {},
1406 | "cell_type": "markdown"
1407 | },
1408 | {
1409 | "source": [
1410 | "singers = pd.DataFrame({'singer_code': range(5), \n",
1411 | " 'singer_name': ['singer_a', 'singer_b', 'singer_c', 'singer_d', 'singer_e']})\n",
1412 | "songs = pd.DataFrame({'singer_code': [2, 2, 3, 4, 5], \n",
1413 | " 'song_name': ['song_f', 'song_g', 'song_h', 'song_i', 'song_j']})\n",
1414 | "print(singers)\n",
1415 | "print('\\n')\n",
1416 | "print(songs)"
1417 | ],
1418 | "execution_count": null,
1419 | "cell_type": "code",
1420 | "metadata": {
1421 | "collapsed": false
1422 | },
1423 | "outputs": []
1424 | },
1425 | {
1426 | "source": [
1427 | "# inner join\n",
1428 | "pd.merge(singers, songs, on='singer_code', how='inner')"
1429 | ],
1430 | "execution_count": null,
1431 | "cell_type": "code",
1432 | "metadata": {
1433 | "collapsed": false
1434 | },
1435 | "outputs": []
1436 | },
1437 | {
1438 | "source": [
1439 | "# left join\n",
1440 | "pd.merge(singers, songs, on='singer_code', how='left')"
1441 | ],
1442 | "execution_count": null,
1443 | "cell_type": "code",
1444 | "metadata": {
1445 | "collapsed": false
1446 | },
1447 | "outputs": []
1448 | },
1449 | {
1450 | "source": [
1451 | "# right join\n",
1452 | "pd.merge(singers, songs, on='singer_code', how='right')"
1453 | ],
1454 | "execution_count": null,
1455 | "cell_type": "code",
1456 | "metadata": {
1457 | "collapsed": false
1458 | },
1459 | "outputs": []
1460 | },
1461 | {
1462 | "source": [
1463 | "# outer join (full join)\n",
1464 | "pd.merge(singers, songs, on='singer_code', how='outer')"
1465 | ],
1466 | "execution_count": null,
1467 | "cell_type": "code",
1468 | "metadata": {
1469 | "collapsed": false
1470 | },
1471 | "outputs": []
1472 | },
1473 | {
1474 | "source": [
1475 | "We can also concatenate two dataframes vertically or horizontally via function `concat` and parameter `axis`. This function is useful when we need to append two similar datasets or to put them side by site"
1476 | ],
1477 | "metadata": {},
1478 | "cell_type": "markdown"
1479 | },
1480 | {
1481 | "source": [
1482 | "# concat vertically\n",
1483 | "pd.concat([singers, songs])"
1484 | ],
1485 | "execution_count": null,
1486 | "cell_type": "code",
1487 | "metadata": {
1488 | "collapsed": false
1489 | },
1490 | "outputs": []
1491 | },
1492 | {
1493 | "source": [
1494 | "# concat horizontally\n",
1495 | "pd.concat([singers, songs], axis=1)"
1496 | ],
1497 | "execution_count": null,
1498 | "cell_type": "code",
1499 | "metadata": {
1500 | "collapsed": false
1501 | },
1502 | "outputs": []
1503 | },
1504 | {
1505 | "source": [
1506 | "When computing descriptive statistic, we usually need to aggregate data by each group. For example, to answer the question \"how many songs each singer has?\", we have to group data by each singer, and then calculate the number of songs in each group. Not that the result must contain the statistic of all singers in database (even if some of them have no song)"
1507 | ],
1508 | "metadata": {},
1509 | "cell_type": "markdown"
1510 | },
1511 | {
1512 | "source": [
1513 | "data = pd.merge(singers, songs, on='singer_code', how='left')\n",
1514 | "\n",
1515 | "# count the values of each column in group\n",
1516 | "print(data.groupby('singer_code').count())\n",
1517 | "\n",
1518 | "print(\"\\n\")\n",
1519 | "\n",
1520 | "# count only song_name\n",
1521 | "print(data.groupby('singer_code').song_name.count())\n",
1522 | "\n",
1523 | "print(\"\\n\")\n",
1524 | "\n",
1525 | "# count song name but ignore duplication, and order the result\n",
1526 | "print(data.groupby('singer_code').song_name.nunique().sort_values(ascending=True))"
1527 | ],
1528 | "execution_count": null,
1529 | "cell_type": "code",
1530 | "metadata": {
1531 | "collapsed": false
1532 | },
1533 | "outputs": []
1534 | },
1535 | {
1536 | "source": [
1537 | "### Question 4\n",
1538 | "We have two datasets about music: [song](https://github.com/michiard/AML-COURSE/blob/master/data/song.tsv) and [album](https://github.com/michiard/AML-COURSE/blob/master/data/album.tsv).\n",
1539 | "\n",
1540 | "In the following questions, you **have to** use Pandas to load data and write code to answer these questions.\n",
1541 | "\n",
1542 | "\n",
1543 | "#### Question 4.1\n",
1544 | "\n",
1545 | "Load both dataset into two dataframes and print the information of each dataframe\n",
1546 | "\n",
1547 | "**HINT**: \n",
1548 | "\n",
1549 | "- You can click button `Raw` on the github page of each dataset and copy the URL of the raw file.\n",
1550 | "- The dataset can be load by using function `read_table`. For example: `df = pd.read_table(raw_url, sep='\\t')`\n",
1551 | "
"
1552 | ],
1553 | "metadata": {},
1554 | "cell_type": "markdown"
1555 | },
1556 | {
1557 | "source": [
1558 | "```python\n",
1559 | "###################################################################\n",
1560 | "#### TO COMPLETE #####\n",
1561 | "###################################################################\n",
1562 | "\n",
1563 | "import pandas as pd\n",
1564 | "\n",
1565 | "songdb_url = 'https://raw.githubusercontent.com/DistributedSystemsGroup/Algorithmic-Machine-Learning/master/data/song.tsv'\n",
1566 | "albumdb_url = 'https://raw.githubusercontent.com/DistributedSystemsGroup/Algorithmic-Machine-Learning/master/data/album.tsv'\n",
1567 | "song_df = pd...\n",
1568 | "album_df = pd...\n",
1569 | "\n",
1570 | "print(song_df...)\n",
1571 | "print(album_df...)\n",
1572 | "```"
1573 | ],
1574 | "metadata": {},
1575 | "cell_type": "markdown"
1576 | },
1577 | {
1578 | "source": [
1579 | "#### Question 4.2\n",
1580 | "\n",
1581 | "How many albums in this datasets ?\n",
1582 | "\n",
1583 | "How many songs in this datasets ?\n",
1584 | "
"
1585 | ],
1586 | "metadata": {},
1587 | "cell_type": "markdown"
1588 | },
1589 | {
1590 | "source": [
1591 | "```python\n",
1592 | "###################################################################\n",
1593 | "#### TO COMPLETE #####\n",
1594 | "###################################################################\n",
1595 | "\n",
1596 | "print(\"number of albums:\", album_df....count())\n",
1597 | "print(\"number of songs:\", song_df.Song...)\n",
1598 | "```"
1599 | ],
1600 | "metadata": {},
1601 | "cell_type": "markdown"
1602 | },
1603 | {
1604 | "source": [
1605 | "#### Question 4.3\n",
1606 | "\n",
1607 | "How many distinct singers in this dataset ?\n",
1608 | "
"
1609 | ],
1610 | "metadata": {},
1611 | "cell_type": "markdown"
1612 | },
1613 | {
1614 | "source": [
1615 | "```python\n",
1616 | "###################################################################\n",
1617 | "#### TO COMPLETE #####\n",
1618 | "###################################################################\n",
1619 | "\n",
1620 | "print(\"number distinct singers:\", len(...))\n",
1621 | "```"
1622 | ],
1623 | "metadata": {},
1624 | "cell_type": "markdown"
1625 | },
1626 | {
1627 | "source": [
1628 | "#### Question 4.4\n",
1629 | "\n",
1630 | "Is there any song that doesn't belong to any album ?\n",
1631 | "\n",
1632 | "Is there any album that has no song ?\n",
1633 | "\n",
1634 | "**HINT**: \n",
1635 | "\n",
1636 | "- To join two datasets on different key names, we use `left_on=` and `right_on=` instead of `on=`.\n",
1637 | "- Funtion `notnull` and `isnull` help determining the value of a column is missing or not. For example:\n",
1638 | "`df['song'].isnull()`.\n",
1639 | "
"
1640 | ],
1641 | "metadata": {},
1642 | "cell_type": "markdown"
1643 | },
1644 | {
1645 | "source": [
1646 | "```python\n",
1647 | "###################################################################\n",
1648 | "#### TO COMPLETE #####\n",
1649 | "###################################################################\n",
1650 | "\n",
1651 | "fulldf = pd.merge(song_df, album_df, how='outer', left_on='Album', right_on='Album code')\n",
1652 | "fulldf[fulldf['Song'].... & fulldf['Album']....]\n",
1653 | "```"
1654 | ],
1655 | "metadata": {},
1656 | "cell_type": "markdown"
1657 | },
1658 | {
1659 | "source": [
1660 | "```python\n",
1661 | "###################################################################\n",
1662 | "#### TO COMPLETE #####\n",
1663 | "###################################################################\n",
1664 | "\n",
1665 | "fulldf[fulldf['Song'].... & fulldf['Album code']....]\n",
1666 | "```"
1667 | ],
1668 | "metadata": {},
1669 | "cell_type": "markdown"
1670 | },
1671 | {
1672 | "source": [
1673 | "#### Question 4.5\n",
1674 | "\n",
1675 | "How many songs in each albums of Michael Jackson ?\n",
1676 | "
"
1677 | ],
1678 | "metadata": {},
1679 | "cell_type": "markdown"
1680 | },
1681 | {
1682 | "source": [
1683 | "```python\n",
1684 | "###################################################################\n",
1685 | "#### TO COMPLETE #####\n",
1686 | "###################################################################\n",
1687 | "\n",
1688 | "\n",
1689 | "\n",
1690 | "fulldf[fulldf['Singer']=='Michael Jackson']....\n",
1691 | "```"
1692 | ],
1693 | "metadata": {},
1694 | "cell_type": "markdown"
1695 | },
1696 | {
1697 | "source": [
1698 | "# Summary\n",
1699 | "\n",
1700 | "In this lecture, we gained familiarity with the Jupyter Notebook environment, the Python programming language and its modules. In particular, we covered the Python syntax, Numpy - the core library for scientific computing, Matplotlib - a module to plot graphs, Pandas - a data analysis module.\n"
1701 | ],
1702 | "metadata": {},
1703 | "cell_type": "markdown"
1704 | },
1705 | {
1706 | "source": [
1707 | "# References\n",
1708 | "This notebook is inspired from:\n",
1709 | "\n",
1710 | "- [Python Numpy tutorial](http://cs231n.github.io/python-numpy-tutorial/)"
1711 | ],
1712 | "metadata": {},
1713 | "cell_type": "markdown"
1714 | },
1715 | {
1716 | "source": [],
1717 | "execution_count": null,
1718 | "cell_type": "code",
1719 | "metadata": {},
1720 | "outputs": []
1721 | },
1722 | {
1723 | "source": [],
1724 | "execution_count": null,
1725 | "cell_type": "code",
1726 | "metadata": {},
1727 | "outputs": []
1728 | }
1729 | ],
1730 | "metadata": {
1731 | "kernelspec": {
1732 | "display_name": "Python 3",
1733 | "language": "python",
1734 | "name": "python3"
1735 | },
1736 | "language_info": {
1737 | "name": "python",
1738 | "pygments_lexer": "ipython3",
1739 | "version": "3.5.2",
1740 | "mimetype": "text/x-python",
1741 | "file_extension": ".py",
1742 | "codemirror_mode": {
1743 | "name": "ipython",
1744 | "version": 3
1745 | },
1746 | "nbconvert_exporter": "python"
1747 | }
1748 | },
1749 | "nbformat_minor": 2,
1750 | "nbformat": 4
1751 | }
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # AML-COURSE
2 | This repository contains Jupyter Notebooks for the Algorithmic Machine Learning Course at Eurecom.
3 |
4 | ## Objectives of the course
5 | The goal of this course is mainly to offer data science projects to students to gain hands-on experience. It nicely merges the theoretical concepts students can learn in our courses on machine learning and statistical inference, and systems concepts we teach in distributed systems.
6 |
7 | Notebooks require to address several challenges, that can be roughly classified in:
8 |
9 | * Data preparation and cleaning
10 | * Building descriptive statistics of the data
11 | * Working on a selected algorithm, e.g., for building a statistical model
12 | * Working on experimental validation
13 |
14 | ## Technical notes
15 | Students will use the EURECOM cloud computing platform to work on Notebooks. Our cluster is managed by [Zoe](http://zoe-analytics.eu/), which is a container-based analytics-as-a-service system we have built. Notebooks front-end run in a user-facing container, whereas Notebooks kernel run in clusters of containers.
16 |
17 | ## Sources and acknowledgments
18 | Some of the Notebooks we use in our lectures are based on use cases illustrated in the book [Advanced Analytics with Spark](http://shop.oreilly.com/product/0636920035091.do), by Sandy Ryza, Uri Laserson, Sean Owen & Josh Wills.
19 |
20 | Some Notebooks are instead based on publicly available data, for which we defined the tasks to complete.
21 |
22 | Finally, some Notebooks are private, and cannot be pushed to this repository. This is the case for industrial Notebooks, taking the form of use cases by Data Scientists from companies we are in contact with.
23 |
24 | Finally, all this could not be achieved without the skills of several PhD students at Eurecom:
25 |
26 | * Duc-Trung Nguyen
27 | * Rosa Candela
28 | * Simone Rossi
29 | * Kurt Cutajar
30 | * Jonas Wacker
31 | * Gia-Lac Tran
32 | * Graziano Mita
33 |
--------------------------------------------------------------------------------