├── .gitignore ├── Challenges ├── Anomaly_Detection │ └── anomaly_detection_challenge.ipynb ├── House_Pricing │ ├── challenge_data │ │ ├── Data description.rtf │ │ ├── sample_submission.csv │ │ ├── test.csv │ │ └── train.csv │ └── house_pricing_challenge.ipynb └── Plankton │ └── plankton_challenge.ipynb ├── Notebooks ├── Intro-public.ipynb └── RecSys-public.ipynb └── README.md /.gitignore: -------------------------------------------------------------------------------- 1 | Notebooks/.ipynb_checkpoints/ 2 | -------------------------------------------------------------------------------- /Challenges/Anomaly_Detection/anomaly_detection_challenge.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "AML2019" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "

Challenge 3

\n", 15 | "

Anomaly Detection (AD)

\n", 16 | "
\n", 17 | "3th May 2019" 18 | ] 19 | }, 20 | { 21 | "cell_type": "markdown", 22 | "metadata": {}, 23 | "source": [ 24 | "Anomaly detection (AD) refers to the process of detecting data points that do not conform with the rest of observations. Applications of anomaly detection include fraud and fault detection, surveillance, diagnosis, data cleanup, predictive maintenance.\n", 25 | "\n", 26 | "When we talk about AD, we usually look at it as an unsupervised (or semi-supervised) task, where the concept of anomaly is often not well defined or, in the best case, just few samples are labeled as anomalous. In this challenge, you will look at AD from a different perspective!\n", 27 | "\n", 28 | "The dataset you are going to work on consists of monitoring data generated by IT systems; such data is then processed by a monitoring system that executes some checks and detects a series of anomalies. This is a multi-label classification problem, where each check is a binary label corresponding to a specific type of anomaly. Your goal is to develop a machine learning model (or multiple ones) to accurately detect such anomalies.\n", 29 | "\n", 30 | "This will also involve a mixture of data exploration, pre-processing, model selection, and performance evaluation. You will also be asked to try one or more rule learning models, and compare them with other ML models both in terms of predictive performances and interpretability. Interpreatibility is indeed a strong requirement especially in applications like AD where understanding the output of a model is as important as the output itself.\n", 31 | "\n", 32 | "Please, bear in mind that the purpose of this challenge is not simply to find the best-performing model. You should rather make sure to understand the difficulties that come with this AD task." 33 | ] 34 | }, 35 | { 36 | "cell_type": "markdown", 37 | "metadata": {}, 38 | "source": [ 39 | "# Overview\n", 40 | "
" 41 | ] 42 | }, 43 | { 44 | "cell_type": "markdown", 45 | "metadata": {}, 46 | "source": [ 47 | "Beyond simply producing a well-performing model for making predictions, in this challenge we would like you to start developing your skills as a machine learning scientist.\n", 48 | "In this regard, your notebook should be structured in such a way as to explore the five following tasks that are expected to be carried out whenever undertaking such a project.\n", 49 | "The description below each aspect should serve as a guide for your work, but you are strongly encouraged to also explore alternative options and directions. \n", 50 | "Thinking outside the box will always be rewarded in these challenges." 51 | ] 52 | }, 53 | { 54 | "cell_type": "markdown", 55 | "metadata": {}, 56 | "source": [ 57 | "
\n", 58 | "

1. Data Exploration

\n", 59 | "
" 60 | ] 61 | }, 62 | { 63 | "cell_type": "markdown", 64 | "metadata": {}, 65 | "source": [ 66 | "The first broad component of your notebook should enable you to familiarise yourselves with the given data, an outline of which is given at the end of this challenge specification.\n", 67 | "Among others, this section should investigate:\n", 68 | "\n", 69 | "- Data cleaning\n", 70 | "- Data visualisation;\n", 71 | "- Computing descriptive statistics, e.g. correlation.\n", 72 | "- etc.\n", 73 | "\n", 74 | "Data exploration is also useful to identify eventual errors in the dataset: for example, some features may have values that are outside the allowed range of values. Ranges are specified in the dataset description." 75 | ] 76 | }, 77 | { 78 | "cell_type": "markdown", 79 | "metadata": {}, 80 | "source": [ 81 | "
\n", 82 | "

2. Data Pre-processing

\n", 83 | "
" 84 | ] 85 | }, 86 | { 87 | "cell_type": "markdown", 88 | "metadata": {}, 89 | "source": [ 90 | "The previous step should give you a better understanding of which pre-processing is required for the data.\n", 91 | "This may include:\n", 92 | "\n", 93 | "- Normalising and standardising the given data;\n", 94 | "- Removing outliers;\n", 95 | "- Carrying out feature selection;\n", 96 | "- Handling missing information in the dataset;\n", 97 | "- Handling errors in the dataset;\n", 98 | "- Combining existing features." 99 | ] 100 | }, 101 | { 102 | "cell_type": "markdown", 103 | "metadata": {}, 104 | "source": [ 105 | "
\n", 106 | "

3. Model Selection

\n", 107 | "
" 108 | ] 109 | }, 110 | { 111 | "cell_type": "markdown", 112 | "metadata": {}, 113 | "source": [ 114 | "At this point, you should have a good understanding of the dataset, and have an idea about the possible candidate models. For example, you may try a multi-label classification model to predict all classes at ones, or train different models, one for each label. In any case, it is important to justify your choices and make a comparison among the candidate models.\n", 115 | "\n", 116 | "You are free to choose any model you want, but you should be aware about some factors which may influence your decision:\n", 117 | "\n", 118 | "- What is the model's complexity?\n", 119 | "- Is the model interpretable?\n", 120 | "- Is the model able to handle imbalanced datasets?\n", 121 | "- Is the model capable of handling both numerical and categorical data?\n", 122 | "- Is the model able to handle missing values?\n", 123 | "- Does the model return uncertainty estimates along with predictions?\n", 124 | "\n", 125 | "An in-depth evaluation of competing models in view of this and other criteria will elevate the quality of your submission and earn you a higher grade. You may also try to build new labels by combining one or more labels (for example by doing an OR) and check if this impacts the performance of the model(s)." 126 | ] 127 | }, 128 | { 129 | "cell_type": "markdown", 130 | "metadata": {}, 131 | "source": [ 132 | "
\n", 133 | "

3.1 Interpretable Models

\n", 134 | "
" 135 | ] 136 | }, 137 | { 138 | "cell_type": "markdown", 139 | "metadata": {}, 140 | "source": [ 141 | "Being able to understand the output of a model is important in many field, especially in anomaly detection. In linear regression, for example, the weights of the model can provide some hints on the importance of features, and this is a form of interpretability. Here, we focus on Rule learning, a specific field of interpretable machine learning that provides interpretability through the use of rules. Examples of rule-based models are: \n", 142 | "\n", 143 | "- RIPPER\n", 144 | " - [Main Paper](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.107.2612&rep=rep1&type=pdf)\n", 145 | " - A fast and reliable implementation is JRIP by [WEKA](https://www.cs.waikato.ac.nz/~ml/weka/). You can also find unofficial python implementations on GitHub.\n", 146 | "- Bayesian Rule Sets (BRS)\n", 147 | " - [Main Paper](http://jmlr.org/papers/volume18/16-003/16-003.pdf)\n", 148 | " - You can find a good implementation [here](https://pypi.org/project/ruleset/). You will probably need to install \"fim\" (pip install fim) before installing BRS.\n", 149 | "- Scalable Bayesian Rule Lists (SBRL)\n", 150 | " - [Main Paper](https://arxiv.org/pdf/1602.08610.pdf)\n", 151 | " - You can find a good implementation [here](https://github.com/myaooo/pysbrl). You will probably need to install \"fim\" (pip install fim) before installing SBRL.\n", 152 | "- and so on... \n", 153 | "\n", 154 | "Try to run at least one of the suggested models (you are free to try others as well) and comment:\n", 155 | "\n", 156 | "- Are rule-learning models able to provide the same predictive performances as previously tested models?\n", 157 | "- Are they faster or slower to train?\n", 158 | "- Do learned rules look meaningful to you?\n", 159 | "- How many rules do these models learn?\n", 160 | "- How many conditions/atoms have on average?\n", 161 | "\n", 162 | "N.B. Since most of the rule-learning implementations deal with binary labels, you can train the model to predict one label of your choice." 163 | ] 164 | }, 165 | { 166 | "cell_type": "markdown", 167 | "metadata": {}, 168 | "source": [ 169 | "
\n", 170 | "

4. Parameter Optimisation

\n", 171 | "
" 172 | ] 173 | }, 174 | { 175 | "cell_type": "markdown", 176 | "metadata": {}, 177 | "source": [ 178 | "Irrespective of your choice, it is highly likely that your model will have one or more parameters that require tuning.\n", 179 | "There are several techniques for carrying out such a procedure, including cross-validation, Bayesian optimisation, and several others.\n", 180 | "As before, an analysis into which parameter tuning technique best suits your model is expected before proceeding with the optimisation of your model." 181 | ] 182 | }, 183 | { 184 | "cell_type": "markdown", 185 | "metadata": {}, 186 | "source": [ 187 | "
\n", 188 | "

5. Model Evaluation

\n", 189 | "
" 190 | ] 191 | }, 192 | { 193 | "cell_type": "markdown", 194 | "metadata": {}, 195 | "source": [ 196 | "Some form of pre-evaluation will inevitably be required in the preceding sections in order to both select an appropriate model and configure its parameters appropriately.\n", 197 | "In this final section, you may evaluate other aspects of the model such as:\n", 198 | "\n", 199 | "- Assessing the running time of your model;\n", 200 | "- Determining whether some aspects can be parallelised;\n", 201 | "- Training the model with smaller subsets of the data.\n", 202 | "- etc.\n", 203 | "\n", 204 | "For the evaluation of the classification results, you should use F1-score for each class and do the average.\n", 205 | "\n", 206 | "N.B. Please note that you are responsible for creating a sensible train/validation/test split. There is no predefined held-out test data." 207 | ] 208 | }, 209 | { 210 | "cell_type": "markdown", 211 | "metadata": {}, 212 | "source": [ 213 | "
\n", 214 | "

*. Optional

\n", 215 | "
" 216 | ] 217 | }, 218 | { 219 | "cell_type": "markdown", 220 | "metadata": {}, 221 | "source": [ 222 | "As you will see in the dataset description, the labels you are going to predict have no meaningful names. Try to understand which kind of anomalies these labels refer to and give sensible names. To do it, you could exploit the output of the interpretable models and/or use a statistical approach with the data you have." 223 | ] 224 | }, 225 | { 226 | "cell_type": "markdown", 227 | "metadata": {}, 228 | "source": [ 229 | "
\n", 230 | " N.B. Please note that the items listed under each heading are neither exhaustive, nor are you expected to explore every given suggestion.\n", 231 | " Nonetheless, these should serve as a guideline for your work in both this and upcoming challenges.\n", 232 | " As always, you should use your intuition and understanding in order to decide which analysis best suits the assigned task.\n", 233 | "
" 234 | ] 235 | }, 236 | { 237 | "cell_type": "markdown", 238 | "metadata": {}, 239 | "source": [ 240 | "
\n", 241 | "

Submission Instructions

\n", 242 | "
\n", 243 | "
" 244 | ] 245 | }, 246 | { 247 | "cell_type": "markdown", 248 | "metadata": {}, 249 | "source": [ 250 | "- The goal of this challenge is to construct one or more models to detect anomalies.\n", 251 | "- Your submission will be the HTML version of your notebook exploring the various modelling aspects described above." 252 | ] 253 | }, 254 | { 255 | "cell_type": "markdown", 256 | "metadata": {}, 257 | "source": [ 258 | "
\n", 259 | "

Dataset Description

\n", 260 | "
\n", 261 | "
" 262 | ] 263 | }, 264 | { 265 | "cell_type": "markdown", 266 | "metadata": {}, 267 | "source": [ 268 | "#### * Location of the Dataset on zoe\n", 269 | "The data for this challenge is located at: `/mnt/datasets/anomaly`\n", 270 | "\n", 271 | "#### * Files\n", 272 | "\n", 273 | "You have a unique csv file with 36 features and 8 labels.\n", 274 | "Each record contains aggregate features computed over a given amount of time.\n", 275 | "\n", 276 | "#### * Attributes\n", 277 | "\n", 278 | "A brief outline of the available attributes is given below.\n", 279 | "\n", 280 | "1. SessionNumber (INTEGER): it identifies the session on which data is collected;\n", 281 | "* SystemID (INTEGER): it identifies the system generating the data;\n", 282 | "* Date (DATE): collection date;\n", 283 | "* HighPriorityAlerts (INTEGER [0, N]): number of high priority alerts in the session;\n", 284 | "* Dumps (INTEGER [0, N]): number of memory dumps;\n", 285 | "* CleanupOOMDumps (INTEGER) [0, N]): number of cleanup OOM dumps;\n", 286 | "* CompositeOOMDums (INTEGER [0, N]): number of composite OOM dumps;\n", 287 | "* IndexServerRestarts (INTEGER [0, N]): number of restarts of the index server;\n", 288 | "* NameServerRestarts (INTEGER [0, N]): number of restarts of the name server;\n", 289 | "* XSEngineRestarts (INTEGER [0, N]): number of restarts of the XSEngine;\n", 290 | "* PreprocessorRestarts (INTEGER [0, N]): number of restarts of the preprocessor;\n", 291 | "* DaemonRestarts (INTEGER [0, N]): number of restarts of the daemon process;\n", 292 | "* StatisticsServerRestarts (INTEGER [0, N]): number of restarts of the statistics server;\n", 293 | "* CPU (FLOAT [0, 100]): cpu usage;\n", 294 | "* PhysMEM (FLOAT [0, 100]): physical memory;\n", 295 | "* InstanceMEM (FLOAT [0, 100]): memory usage of one instance of the system;\n", 296 | "* TablesAllocation (FLOAT [0, 100]): memory allocated for tables;\n", 297 | "* IndexServerAllocationLimit (FLOAT [0, 100]): level of memory used by index server;\n", 298 | "* ColumnUnloads (INTEGER [0, N]): number of columns unloaded from the tables;\n", 299 | "* DeltaSize (INTEGER [0, N]): size of the delta store;\n", 300 | "* MergeErrors BOOLEAN [0, 1]: 1 if there are merge errors;\n", 301 | "* BlockingPhaseSec (INTEGER [0, N]): blocking phase duration in seconds;\n", 302 | "* Disk (FLOAT [0, 100]): disk usage;\n", 303 | "* LargestTableSize (INTEGER [0, N]): size of the largest table;\n", 304 | "* LargestPartitionSize (INTEGER [0, N]): size of the largest partition of a table;\n", 305 | "* DiagnosisFiles (INTEGER [0, N]): number of diagnosis files;\n", 306 | "* DiagnosisFilesSize (INTEGER [0, N]): size of diagnosis files;\n", 307 | "* DaysWithSuccessfulDataBackups (INTEGER [0, N]): number of days with successful data backups;\n", 308 | "* DaysWithSuccessfulLogBackups (INTEGER [0, N]): number of days with successful log backups;\n", 309 | "* DaysWithFailedDataBackups (INTEGER [0, N]): number of days with failed data backups;\n", 310 | "* DaysWithFailedfulLogBackups (INTEGER [0, N]): number of days with failed log backups;\n", 311 | "* MinDailyNumberOfSuccessfulDataBackups (INTEGER [0, N]): minimum number of successful data backups per day;\n", 312 | "* MinDailyNumberOfSuccessfulLogBackups (INTEGER [0, N]): minimum number of successful log backups per day;\n", 313 | "* MaxDailyNumberOfFailedDataBackups (INTEGER [0, N]): maximum number of failed data backups per day;\n", 314 | "* MaxDailyNumberOfFailedLogBackups (INTEGER [0, N]): maximum number of failed log backups per day;\n", 315 | "* LogSegmentChange (INTEGER [0, N]): changes in the number of log segments.\n", 316 | "\n", 317 | "#### * Labels\n", 318 | "\n", 319 | "Labels are binary. Each label refers to a different anomaly.\n", 320 | "\n", 321 | "* Check1;\n", 322 | "* Check2;\n", 323 | "* Check3;\n", 324 | "* Check4;\n", 325 | "* Check5;\n", 326 | "* Check6;\n", 327 | "* Check7;\n", 328 | "* Check8;" 329 | ] 330 | }, 331 | { 332 | "cell_type": "code", 333 | "execution_count": null, 334 | "metadata": { 335 | "collapsed": true 336 | }, 337 | "outputs": [], 338 | "source": [] 339 | } 340 | ], 341 | "metadata": { 342 | "kernelspec": { 343 | "display_name": "Python 3", 344 | "language": "python", 345 | "name": "python3" 346 | }, 347 | "language_info": { 348 | "codemirror_mode": { 349 | "name": "ipython", 350 | "version": 3 351 | }, 352 | "file_extension": ".py", 353 | "mimetype": "text/x-python", 354 | "name": "python", 355 | "nbconvert_exporter": "python", 356 | "pygments_lexer": "ipython3", 357 | "version": "3.5.2" 358 | } 359 | }, 360 | "nbformat": 4, 361 | "nbformat_minor": 2 362 | } 363 | -------------------------------------------------------------------------------- /Challenges/House_Pricing/challenge_data/Data description.rtf: -------------------------------------------------------------------------------- 1 | {\rtf1\ansi\ansicpg1252\cocoartf1561\cocoasubrtf200 2 | {\fonttbl\f0\fmodern\fcharset0 Courier;} 3 | {\colortbl;\red255\green255\blue255;\red0\green0\blue0;} 4 | {\*\expandedcolortbl;;\cssrgb\c0\c0\c0;} 5 | \paperw11900\paperh16840\margl1440\margr1440\vieww10800\viewh8400\viewkind0 6 | \deftab720 7 | \pard\pardeftab720\sl280\partightenfactor0 8 | 9 | \f0\fs24 \cf2 \expnd0\expndtw0\kerning0 10 | \outl0\strokewidth0 \strokec2 MSSubClass: Identifies the type of dwelling involved in the sale. \ 11 | \ 12 | 20 1-STORY 1946 & NEWER ALL STYLES\ 13 | 30 1-STORY 1945 & OLDER\ 14 | 40 1-STORY W/FINISHED ATTIC ALL AGES\ 15 | 45 1-1/2 STORY - UNFINISHED ALL AGES\ 16 | 50 1-1/2 STORY FINISHED ALL AGES\ 17 | 60 2-STORY 1946 & NEWER\ 18 | 70 2-STORY 1945 & OLDER\ 19 | 75 2-1/2 STORY ALL AGES\ 20 | 80 SPLIT OR MULTI-LEVEL\ 21 | 85 SPLIT FOYER\ 22 | 90 DUPLEX - ALL STYLES AND AGES\ 23 | 120 1-STORY PUD (Planned Unit Development) - 1946 & NEWER\ 24 | 150 1-1/2 STORY PUD - ALL AGES\ 25 | 160 2-STORY PUD - 1946 & NEWER\ 26 | 180 PUD - MULTILEVEL - INCL SPLIT LEV/FOYER\ 27 | 190 2 FAMILY CONVERSION - ALL STYLES AND AGES\ 28 | \ 29 | MSZoning: Identifies the general zoning classification of the sale.\ 30 | \ 31 | A Agriculture\ 32 | C Commercial\ 33 | FV Floating Village Residential\ 34 | I Industrial\ 35 | RH Residential High Density\ 36 | RL Residential Low Density\ 37 | RP Residential Low Density Park \ 38 | RM Residential Medium Density\ 39 | \ 40 | LotFrontage: Linear feet of street connected to property\ 41 | \ 42 | LotArea: Lot size in square feet\ 43 | \ 44 | Street: Type of road access to property\ 45 | \ 46 | Grvl Gravel \ 47 | Pave Paved\ 48 | \ 49 | Alley: Type of alley access to property\ 50 | \ 51 | Grvl Gravel\ 52 | Pave Paved\ 53 | NA No alley access\ 54 | \ 55 | LotShape: General shape of property\ 56 | \ 57 | Reg Regular \ 58 | IR1 Slightly irregular\ 59 | IR2 Moderately Irregular\ 60 | IR3 Irregular\ 61 | \ 62 | LandContour: Flatness of the property\ 63 | \ 64 | Lvl Near Flat/Level \ 65 | Bnk Banked - Quick and significant rise from street grade to building\ 66 | HLS Hillside - Significant slope from side to side\ 67 | Low Depression\ 68 | \ 69 | Utilities: Type of utilities available\ 70 | \ 71 | AllPub All public Utilities (E,G,W,& S) \ 72 | NoSewr Electricity, Gas, and Water (Septic Tank)\ 73 | NoSeWa Electricity and Gas Only\ 74 | ELO Electricity only \ 75 | \ 76 | LotConfig: Lot configuration\ 77 | \ 78 | Inside Inside lot\ 79 | Corner Corner lot\ 80 | CulDSac Cul-de-sac\ 81 | FR2 Frontage on 2 sides of property\ 82 | FR3 Frontage on 3 sides of property\ 83 | \ 84 | LandSlope: Slope of property\ 85 | \ 86 | Gtl Gentle slope\ 87 | Mod Moderate Slope \ 88 | Sev Severe Slope\ 89 | \ 90 | Neighborhood: Physical locations within Ames city limits\ 91 | \ 92 | Blmngtn Bloomington Heights\ 93 | Blueste Bluestem\ 94 | BrDale Briardale\ 95 | BrkSide Brookside\ 96 | ClearCr Clear Creek\ 97 | CollgCr College Creek\ 98 | Crawfor Crawford\ 99 | Edwards Edwards\ 100 | Gilbert Gilbert\ 101 | IDOTRR Iowa DOT and Rail Road\ 102 | MeadowV Meadow Village\ 103 | Mitchel Mitchell\ 104 | Names North Ames\ 105 | NoRidge Northridge\ 106 | NPkVill Northpark Villa\ 107 | NridgHt Northridge Heights\ 108 | NWAmes Northwest Ames\ 109 | OldTown Old Town\ 110 | SWISU South & West of Iowa State University\ 111 | Sawyer Sawyer\ 112 | SawyerW Sawyer West\ 113 | Somerst Somerset\ 114 | StoneBr Stone Brook\ 115 | Timber Timberland\ 116 | Veenker Veenker\ 117 | \ 118 | Condition1: Proximity to various conditions\ 119 | \ 120 | Artery Adjacent to arterial street\ 121 | Feedr Adjacent to feeder street \ 122 | Norm Normal \ 123 | RRNn Within 200' of North-South Railroad\ 124 | RRAn Adjacent to North-South Railroad\ 125 | PosN Near positive off-site feature--park, greenbelt, etc.\ 126 | PosA Adjacent to postive off-site feature\ 127 | RRNe Within 200' of East-West Railroad\ 128 | RRAe Adjacent to East-West Railroad\ 129 | \ 130 | Condition2: Proximity to various conditions (if more than one is present)\ 131 | \ 132 | Artery Adjacent to arterial street\ 133 | Feedr Adjacent to feeder street \ 134 | Norm Normal \ 135 | RRNn Within 200' of North-South Railroad\ 136 | RRAn Adjacent to North-South Railroad\ 137 | PosN Near positive off-site feature--park, greenbelt, etc.\ 138 | PosA Adjacent to postive off-site feature\ 139 | RRNe Within 200' of East-West Railroad\ 140 | RRAe Adjacent to East-West Railroad\ 141 | \ 142 | BldgType: Type of dwelling\ 143 | \ 144 | 1Fam Single-family Detached \ 145 | 2FmCon Two-family Conversion; originally built as one-family dwelling\ 146 | Duplx Duplex\ 147 | TwnhsE Townhouse End Unit\ 148 | TwnhsI Townhouse Inside Unit\ 149 | \ 150 | HouseStyle: Style of dwelling\ 151 | \ 152 | 1Story One story\ 153 | 1.5Fin One and one-half story: 2nd level finished\ 154 | 1.5Unf One and one-half story: 2nd level unfinished\ 155 | 2Story Two story\ 156 | 2.5Fin Two and one-half story: 2nd level finished\ 157 | 2.5Unf Two and one-half story: 2nd level unfinished\ 158 | SFoyer Split Foyer\ 159 | SLvl Split Level\ 160 | \ 161 | OverallQual: Rates the overall material and finish of the house\ 162 | \ 163 | 10 Very Excellent\ 164 | 9 Excellent\ 165 | 8 Very Good\ 166 | 7 Good\ 167 | 6 Above Average\ 168 | 5 Average\ 169 | 4 Below Average\ 170 | 3 Fair\ 171 | 2 Poor\ 172 | 1 Very Poor\ 173 | \ 174 | OverallCond: Rates the overall condition of the house\ 175 | \ 176 | 10 Very Excellent\ 177 | 9 Excellent\ 178 | 8 Very Good\ 179 | 7 Good\ 180 | 6 Above Average \ 181 | 5 Average\ 182 | 4 Below Average \ 183 | 3 Fair\ 184 | 2 Poor\ 185 | 1 Very Poor\ 186 | \ 187 | YearBuilt: Original construction date\ 188 | \ 189 | YearRemodAdd: Remodel date (same as construction date if no remodeling or additions)\ 190 | \ 191 | RoofStyle: Type of roof\ 192 | \ 193 | Flat Flat\ 194 | Gable Gable\ 195 | Gambrel Gabrel (Barn)\ 196 | Hip Hip\ 197 | Mansard Mansard\ 198 | Shed Shed\ 199 | \ 200 | RoofMatl: Roof material\ 201 | \ 202 | ClyTile Clay or Tile\ 203 | CompShg Standard (Composite) Shingle\ 204 | Membran Membrane\ 205 | Metal Metal\ 206 | Roll Roll\ 207 | Tar&Grv Gravel & Tar\ 208 | WdShake Wood Shakes\ 209 | WdShngl Wood Shingles\ 210 | \ 211 | Exterior1st: Exterior covering on house\ 212 | \ 213 | AsbShng Asbestos Shingles\ 214 | AsphShn Asphalt Shingles\ 215 | BrkComm Brick Common\ 216 | BrkFace Brick Face\ 217 | CBlock Cinder Block\ 218 | CemntBd Cement Board\ 219 | HdBoard Hard Board\ 220 | ImStucc Imitation Stucco\ 221 | MetalSd Metal Siding\ 222 | Other Other\ 223 | Plywood Plywood\ 224 | PreCast PreCast \ 225 | Stone Stone\ 226 | Stucco Stucco\ 227 | VinylSd Vinyl Siding\ 228 | Wd Sdng Wood Siding\ 229 | WdShing Wood Shingles\ 230 | \ 231 | Exterior2nd: Exterior covering on house (if more than one material)\ 232 | \ 233 | AsbShng Asbestos Shingles\ 234 | AsphShn Asphalt Shingles\ 235 | BrkComm Brick Common\ 236 | BrkFace Brick Face\ 237 | CBlock Cinder Block\ 238 | CemntBd Cement Board\ 239 | HdBoard Hard Board\ 240 | ImStucc Imitation Stucco\ 241 | MetalSd Metal Siding\ 242 | Other Other\ 243 | Plywood Plywood\ 244 | PreCast PreCast\ 245 | Stone Stone\ 246 | Stucco Stucco\ 247 | VinylSd Vinyl Siding\ 248 | Wd Sdng Wood Siding\ 249 | WdShing Wood Shingles\ 250 | \ 251 | MasVnrType: Masonry veneer type\ 252 | \ 253 | BrkCmn Brick Common\ 254 | BrkFace Brick Face\ 255 | CBlock Cinder Block\ 256 | None None\ 257 | Stone Stone\ 258 | \ 259 | MasVnrArea: Masonry veneer area in square feet\ 260 | \ 261 | ExterQual: Evaluates the quality of the material on the exterior \ 262 | \ 263 | Ex Excellent\ 264 | Gd Good\ 265 | TA Average/Typical\ 266 | Fa Fair\ 267 | Po Poor\ 268 | \ 269 | ExterCond: Evaluates the present condition of the material on the exterior\ 270 | \ 271 | Ex Excellent\ 272 | Gd Good\ 273 | TA Average/Typical\ 274 | Fa Fair\ 275 | Po Poor\ 276 | \ 277 | Foundation: Type of foundation\ 278 | \ 279 | BrkTil Brick & Tile\ 280 | CBlock Cinder Block\ 281 | PConc Poured Contrete \ 282 | Slab Slab\ 283 | Stone Stone\ 284 | Wood Wood\ 285 | \ 286 | BsmtQual: Evaluates the height of the basement\ 287 | \ 288 | Ex Excellent (100+ inches) \ 289 | Gd Good (90-99 inches)\ 290 | TA Typical (80-89 inches)\ 291 | Fa Fair (70-79 inches)\ 292 | Po Poor (<70 inches\ 293 | NA No Basement\ 294 | \ 295 | BsmtCond: Evaluates the general condition of the basement\ 296 | \ 297 | Ex Excellent\ 298 | Gd Good\ 299 | TA Typical - slight dampness allowed\ 300 | Fa Fair - dampness or some cracking or settling\ 301 | Po Poor - Severe cracking, settling, or wetness\ 302 | NA No Basement\ 303 | \ 304 | BsmtExposure: Refers to walkout or garden level walls\ 305 | \ 306 | Gd Good Exposure\ 307 | Av Average Exposure (split levels or foyers typically score average or above) \ 308 | Mn Mimimum Exposure\ 309 | No No Exposure\ 310 | NA No Basement\ 311 | \ 312 | BsmtFinType1: Rating of basement finished area\ 313 | \ 314 | GLQ Good Living Quarters\ 315 | ALQ Average Living Quarters\ 316 | BLQ Below Average Living Quarters \ 317 | Rec Average Rec Room\ 318 | LwQ Low Quality\ 319 | Unf Unfinshed\ 320 | NA No Basement\ 321 | \ 322 | BsmtFinSF1: Type 1 finished square feet\ 323 | \ 324 | BsmtFinType2: Rating of basement finished area (if multiple types)\ 325 | \ 326 | GLQ Good Living Quarters\ 327 | ALQ Average Living Quarters\ 328 | BLQ Below Average Living Quarters \ 329 | Rec Average Rec Room\ 330 | LwQ Low Quality\ 331 | Unf Unfinshed\ 332 | NA No Basement\ 333 | \ 334 | BsmtFinSF2: Type 2 finished square feet\ 335 | \ 336 | BsmtUnfSF: Unfinished square feet of basement area\ 337 | \ 338 | TotalBsmtSF: Total square feet of basement area\ 339 | \ 340 | Heating: Type of heating\ 341 | \ 342 | Floor Floor Furnace\ 343 | GasA Gas forced warm air furnace\ 344 | GasW Gas hot water or steam heat\ 345 | Grav Gravity furnace \ 346 | OthW Hot water or steam heat other than gas\ 347 | Wall Wall furnace\ 348 | \ 349 | HeatingQC: Heating quality and condition\ 350 | \ 351 | Ex Excellent\ 352 | Gd Good\ 353 | TA Average/Typical\ 354 | Fa Fair\ 355 | Po Poor\ 356 | \ 357 | CentralAir: Central air conditioning\ 358 | \ 359 | N No\ 360 | Y Yes\ 361 | \ 362 | Electrical: Electrical system\ 363 | \ 364 | SBrkr Standard Circuit Breakers & Romex\ 365 | FuseA Fuse Box over 60 AMP and all Romex wiring (Average) \ 366 | FuseF 60 AMP Fuse Box and mostly Romex wiring (Fair)\ 367 | FuseP 60 AMP Fuse Box and mostly knob & tube wiring (poor)\ 368 | Mix Mixed\ 369 | \ 370 | 1stFlrSF: First Floor square feet\ 371 | \ 372 | 2ndFlrSF: Second floor square feet\ 373 | \ 374 | LowQualFinSF: Low quality finished square feet (all floors)\ 375 | \ 376 | GrLivArea: Above grade (ground) living area square feet\ 377 | \ 378 | BsmtFullBath: Basement full bathrooms\ 379 | \ 380 | BsmtHalfBath: Basement half bathrooms\ 381 | \ 382 | FullBath: Full bathrooms above grade\ 383 | \ 384 | HalfBath: Half baths above grade\ 385 | \ 386 | Bedroom: Bedrooms above grade (does NOT include basement bedrooms)\ 387 | \ 388 | Kitchen: Kitchens above grade\ 389 | \ 390 | KitchenQual: Kitchen quality\ 391 | \ 392 | Ex Excellent\ 393 | Gd Good\ 394 | TA Typical/Average\ 395 | Fa Fair\ 396 | Po Poor\ 397 | \ 398 | TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)\ 399 | \ 400 | Functional: Home functionality (Assume typical unless deductions are warranted)\ 401 | \ 402 | Typ Typical Functionality\ 403 | Min1 Minor Deductions 1\ 404 | Min2 Minor Deductions 2\ 405 | Mod Moderate Deductions\ 406 | Maj1 Major Deductions 1\ 407 | Maj2 Major Deductions 2\ 408 | Sev Severely Damaged\ 409 | Sal Salvage only\ 410 | \ 411 | Fireplaces: Number of fireplaces\ 412 | \ 413 | FireplaceQu: Fireplace quality\ 414 | \ 415 | Ex Excellent - Exceptional Masonry Fireplace\ 416 | Gd Good - Masonry Fireplace in main level\ 417 | TA Average - Prefabricated Fireplace in main living area or Masonry Fireplace in basement\ 418 | Fa Fair - Prefabricated Fireplace in basement\ 419 | Po Poor - Ben Franklin Stove\ 420 | NA No Fireplace\ 421 | \ 422 | GarageType: Garage location\ 423 | \ 424 | 2Types More than one type of garage\ 425 | Attchd Attached to home\ 426 | Basment Basement Garage\ 427 | BuiltIn Built-In (Garage part of house - typically has room above garage)\ 428 | CarPort Car Port\ 429 | Detchd Detached from home\ 430 | NA No Garage\ 431 | \ 432 | GarageYrBlt: Year garage was built\ 433 | \ 434 | GarageFinish: Interior finish of the garage\ 435 | \ 436 | Fin Finished\ 437 | RFn Rough Finished \ 438 | Unf Unfinished\ 439 | NA No Garage\ 440 | \ 441 | GarageCars: Size of garage in car capacity\ 442 | \ 443 | GarageArea: Size of garage in square feet\ 444 | \ 445 | GarageQual: Garage quality\ 446 | \ 447 | Ex Excellent\ 448 | Gd Good\ 449 | TA Typical/Average\ 450 | Fa Fair\ 451 | Po Poor\ 452 | NA No Garage\ 453 | \ 454 | GarageCond: Garage condition\ 455 | \ 456 | Ex Excellent\ 457 | Gd Good\ 458 | TA Typical/Average\ 459 | Fa Fair\ 460 | Po Poor\ 461 | NA No Garage\ 462 | \ 463 | PavedDrive: Paved driveway\ 464 | \ 465 | Y Paved \ 466 | P Partial Pavement\ 467 | N Dirt/Gravel\ 468 | \ 469 | WoodDeckSF: Wood deck area in square feet\ 470 | \ 471 | OpenPorchSF: Open porch area in square feet\ 472 | \ 473 | EnclosedPorch: Enclosed porch area in square feet\ 474 | \ 475 | 3SsnPorch: Three season porch area in square feet\ 476 | \ 477 | ScreenPorch: Screen porch area in square feet\ 478 | \ 479 | PoolArea: Pool area in square feet\ 480 | \ 481 | PoolQC: Pool quality\ 482 | \ 483 | Ex Excellent\ 484 | Gd Good\ 485 | TA Average/Typical\ 486 | Fa Fair\ 487 | NA No Pool\ 488 | \ 489 | Fence: Fence quality\ 490 | \ 491 | GdPrv Good Privacy\ 492 | MnPrv Minimum Privacy\ 493 | GdWo Good Wood\ 494 | MnWw Minimum Wood/Wire\ 495 | NA No Fence\ 496 | \ 497 | MiscFeature: Miscellaneous feature not covered in other categories\ 498 | \ 499 | Elev Elevator\ 500 | Gar2 2nd Garage (if not described in garage section)\ 501 | Othr Other\ 502 | Shed Shed (over 100 SF)\ 503 | TenC Tennis Court\ 504 | NA None\ 505 | \ 506 | MiscVal: $Value of miscellaneous feature\ 507 | \ 508 | MoSold: Month Sold (MM)\ 509 | \ 510 | YrSold: Year Sold (YYYY)\ 511 | \ 512 | SaleType: Type of sale\ 513 | \ 514 | WD Warranty Deed - Conventional\ 515 | CWD Warranty Deed - Cash\ 516 | VWD Warranty Deed - VA Loan\ 517 | New Home just constructed and sold\ 518 | COD Court Officer Deed/Estate\ 519 | Con Contract 15% Down payment regular terms\ 520 | ConLw Contract Low Down payment and low interest\ 521 | ConLI Contract Low Interest\ 522 | ConLD Contract Low Down\ 523 | Oth Other\ 524 | \ 525 | SaleCondition: Condition of sale\ 526 | \ 527 | Normal Normal Sale\ 528 | Abnorml Abnormal Sale - trade, foreclosure, short sale\ 529 | AdjLand Adjoining Land Purchase\ 530 | Alloca Allocation - two linked properties with separate deeds, typically condo with a garage unit \ 531 | Family Sale between family members\ 532 | Partial Home was not completed when last assessed (associated with New Homes)\ 533 | Ch} -------------------------------------------------------------------------------- /Challenges/House_Pricing/challenge_data/sample_submission.csv: -------------------------------------------------------------------------------- 1 | Id,SalePrice 2 | 1461,169277.0524984 3 | 1462,187758.393988768 4 | 1463,183583.683569555 5 | 1464,179317.47751083 6 | 1465,150730.079976501 7 | 1466,177150.989247307 8 | 1467,172070.659229164 9 | 1468,175110.956519547 10 | 1469,162011.698831665 11 | 1470,160726.247831419 12 | 1471,157933.279456005 13 | 1472,145291.245020389 14 | 1473,159672.017631819 15 | 1474,164167.518301885 16 | 1475,150891.638244053 17 | 1476,179460.96518734 18 | 1477,185034.62891405 19 | 1478,182352.192644656 20 | 1479,183053.458213802 21 | 1480,187823.339254278 22 | 1481,186544.114327568 23 | 1482,158230.77520516 24 | 1483,190552.829321091 25 | 1484,147183.67487199 26 | 1485,185855.300905493 27 | 1486,174350.470676986 28 | 1487,201740.620690863 29 | 1488,162986.378895754 30 | 1489,162330.199085679 31 | 1490,165845.938616539 32 | 1491,180929.622876974 33 | 1492,163481.501519718 34 | 1493,187798.076714233 35 | 1494,198822.198942566 36 | 1495,194868.409899858 37 | 1496,152605.298564403 38 | 1497,147797.702836811 39 | 1498,150521.96899297 40 | 1499,146991.630153739 41 | 1500,150306.307814534 42 | 1501,151164.372534604 43 | 1502,151133.706960953 44 | 1503,156214.042540726 45 | 1504,171992.760735142 46 | 1505,173214.912549738 47 | 1506,192429.187345783 48 | 1507,190878.69508543 49 | 1508,194542.544135519 50 | 1509,191849.439072822 51 | 1510,176363.773907793 52 | 1511,176954.185412429 53 | 1512,176521.216975696 54 | 1513,179436.704810176 55 | 1514,220079.756777048 56 | 1515,175502.918109444 57 | 1516,188321.073833569 58 | 1517,163276.324450004 59 | 1518,185911.366293097 60 | 1519,171392.830997252 61 | 1520,174418.207020775 62 | 1521,179682.709603774 63 | 1522,179423.751581665 64 | 1523,171756.918091777 65 | 1524,166849.638174419 66 | 1525,181122.168676666 67 | 1526,170934.462746566 68 | 1527,159738.292580329 69 | 1528,174445.759557658 70 | 1529,174706.363659627 71 | 1530,164507.672539365 72 | 1531,163602.512172832 73 | 1532,154126.270249525 74 | 1533,171104.853481351 75 | 1534,167735.39270528 76 | 1535,183003.613338104 77 | 1536,172580.381161499 78 | 1537,165407.889104689 79 | 1538,176363.773907793 80 | 1539,175182.950898522 81 | 1540,190757.177789246 82 | 1541,167186.995771991 83 | 1542,167839.376779276 84 | 1543,173912.421165137 85 | 1544,154034.917445551 86 | 1545,156002.955794336 87 | 1546,168173.94329857 88 | 1547,168882.437104132 89 | 1548,168173.94329857 90 | 1549,157580.177551642 91 | 1550,181922.15256011 92 | 1551,155134.227842592 93 | 1552,188885.573319552 94 | 1553,183963.193012381 95 | 1554,161298.762306335 96 | 1555,188613.66763056 97 | 1556,175080.111822945 98 | 1557,174744.400305232 99 | 1558,168175.911336919 100 | 1559,182333.472575006 101 | 1560,158307.206742274 102 | 1561,193053.055502348 103 | 1562,175031.089987177 104 | 1563,160713.294602908 105 | 1564,173186.215014436 106 | 1565,191736.7598055 107 | 1566,170401.630997116 108 | 1567,164626.577880222 109 | 1568,205469.409444832 110 | 1569,209561.784211885 111 | 1570,182271.503072356 112 | 1571,178081.549427793 113 | 1572,178425.956138831 114 | 1573,162015.318511503 115 | 1574,181722.420373045 116 | 1575,156705.730169433 117 | 1576,182902.420342386 118 | 1577,157574.595395085 119 | 1578,184380.739100813 120 | 1579,169364.469225677 121 | 1580,175846.179822063 122 | 1581,189673.295302136 123 | 1582,174401.317715566 124 | 1583,179021.448718583 125 | 1584,189196.845337149 126 | 1585,139647.095720655 127 | 1586,161468.198288911 128 | 1587,171557.32317862 129 | 1588,179447.36804185 130 | 1589,169611.619017694 131 | 1590,172088.872655744 132 | 1591,171190.624128768 133 | 1592,154850.508361878 134 | 1593,158617.655719941 135 | 1594,209258.33693701 136 | 1595,177939.027626751 137 | 1596,194631.100299584 138 | 1597,213618.871562568 139 | 1598,198342.504228533 140 | 1599,138607.971472497 141 | 1600,150778.958976731 142 | 1601,146966.230339786 143 | 1602,162182.59620952 144 | 1603,176825.940961269 145 | 1604,152799.812402444 146 | 1605,180322.322067129 147 | 1606,177508.027228367 148 | 1607,208029.642652019 149 | 1608,181987.282510201 150 | 1609,160172.72797397 151 | 1610,176761.317654248 152 | 1611,176515.497545231 153 | 1612,176270.453065471 154 | 1613,183050.846258475 155 | 1614,150011.102062216 156 | 1615,159270.537808667 157 | 1616,163419.663729346 158 | 1617,163399.983345859 159 | 1618,173364.161505756 160 | 1619,169556.835902417 161 | 1620,183690.595995738 162 | 1621,176980.914909382 163 | 1622,204773.36222471 164 | 1623,174728.655998442 165 | 1624,181873.458244461 166 | 1625,177322.000823979 167 | 1626,193927.939041863 168 | 1627,181715.622732304 169 | 1628,199270.841200324 170 | 1629,177109.589956218 171 | 1630,153909.578271486 172 | 1631,162931.203336223 173 | 1632,166386.7567182 174 | 1633,173719.30379824 175 | 1634,179757.925656704 176 | 1635,179007.601964376 177 | 1636,180370.808623106 178 | 1637,185102.616730563 179 | 1638,198825.563452058 180 | 1639,184294.576009142 181 | 1640,200443.7920562 182 | 1641,181294.784484153 183 | 1642,174354.336267919 184 | 1643,172023.677781517 185 | 1644,181666.922855025 186 | 1645,179024.491269586 187 | 1646,178324.191575907 188 | 1647,184534.676687694 189 | 1648,159397.250378784 190 | 1649,178430.966728182 191 | 1650,177743.799385967 192 | 1651,179395.305519087 193 | 1652,151713.38474815 194 | 1653,151713.38474815 195 | 1654,168434.977996215 196 | 1655,153999.100311019 197 | 1656,164096.097354123 198 | 1657,166335.403036551 199 | 1658,163020.725375757 200 | 1659,155862.510668829 201 | 1660,182760.651095509 202 | 1661,201912.270622883 203 | 1662,185988.233987516 204 | 1663,183778.44888032 205 | 1664,170935.85921771 206 | 1665,184468.908382254 207 | 1666,191569.089663229 208 | 1667,232991.025583822 209 | 1668,180980.721388278 210 | 1669,164279.13048219 211 | 1670,183859.460411109 212 | 1671,185922.465682076 213 | 1672,191742.778119363 214 | 1673,199954.072465842 215 | 1674,180690.274752587 216 | 1675,163099.3096358 217 | 1676,140791.922472443 218 | 1677,166481.86647592 219 | 1678,172080.434496773 220 | 1679,191719.161659178 221 | 1680,160741.098612515 222 | 1681,157829.546854733 223 | 1682,196896.748596341 224 | 1683,159675.423990355 225 | 1684,182084.790901946 226 | 1685,179233.926374487 227 | 1686,155774.270901623 228 | 1687,181354.326716058 229 | 1688,179605.563663918 230 | 1689,181609.34866147 231 | 1690,178221.531623281 232 | 1691,175559.920735795 233 | 1692,200328.822792041 234 | 1693,178630.060559899 235 | 1694,177174.535221728 236 | 1695,172515.687368714 237 | 1696,204032.992922943 238 | 1697,176023.232787689 239 | 1698,202202.073341595 240 | 1699,181734.480075862 241 | 1700,183982.158993126 242 | 1701,188007.94241481 243 | 1702,185922.966763517 244 | 1703,183978.544874918 245 | 1704,177199.618638821 246 | 1705,181878.647956764 247 | 1706,173622.088728263 248 | 1707,180728.168562655 249 | 1708,176477.026606328 250 | 1709,184282.266697609 251 | 1710,162062.47538448 252 | 1711,182550.070992189 253 | 1712,180987.949624695 254 | 1713,178173.79762147 255 | 1714,179980.635948606 256 | 1715,173257.637826205 257 | 1716,177271.291059307 258 | 1717,175338.355442312 259 | 1718,177548.140549508 260 | 1719,175969.91662932 261 | 1720,175011.481953462 262 | 1721,185199.372568143 263 | 1722,188514.050228937 264 | 1723,185080.145268797 265 | 1724,157304.402574096 266 | 1725,194260.859481297 267 | 1726,181262.329995106 268 | 1727,157003.292706732 269 | 1728,182924.499359899 270 | 1729,181902.586375439 271 | 1730,188985.371708134 272 | 1731,185290.904495068 273 | 1732,177304.425752748 274 | 1733,166274.900490809 275 | 1734,177807.420530107 276 | 1735,180330.624816201 277 | 1736,179069.112234629 278 | 1737,175943.371816948 279 | 1738,185199.050609653 280 | 1739,167350.910824524 281 | 1740,149315.311876449 282 | 1741,139010.847766793 283 | 1742,155412.151845447 284 | 1743,171308.313985441 285 | 1744,176220.543265638 286 | 1745,177643.434991809 287 | 1746,187222.653264601 288 | 1747,185635.132083154 289 | 1748,206492.534215854 290 | 1749,181681.021081956 291 | 1750,180500.198072685 292 | 1751,206486.17086841 293 | 1752,161334.301195429 294 | 1753,176156.558313965 295 | 1754,191642.223478994 296 | 1755,191945.808027777 297 | 1756,164146.306037354 298 | 1757,179883.057071096 299 | 1758,178071.137668844 300 | 1759,188241.637896875 301 | 1760,174559.656173171 302 | 1761,182347.363042264 303 | 1762,191507.251872857 304 | 1763,199751.865597358 305 | 1764,162106.416145131 306 | 1765,164575.982314367 307 | 1766,179176.352180931 308 | 1767,177327.403857584 309 | 1768,177818.083761781 310 | 1769,186965.204048443 311 | 1770,178762.742169197 312 | 1771,183322.866146283 313 | 1772,178903.295931891 314 | 1773,186570.129421778 315 | 1774,199144.242829024 316 | 1775,172154.713310956 317 | 1776,177444.019201603 318 | 1777,166200.938073485 319 | 1778,158995.770555632 320 | 1779,168273.282454755 321 | 1780,189680.453052788 322 | 1781,181681.021081956 323 | 1782,160277.142643643 324 | 1783,197318.54715833 325 | 1784,162228.935604196 326 | 1785,187340.455456083 327 | 1786,181065.347037275 328 | 1787,190233.609102705 329 | 1788,157929.594852031 330 | 1789,168557.001935469 331 | 1790,160805.584645628 332 | 1791,221648.391978216 333 | 1792,180539.88079815 334 | 1793,182105.616283853 335 | 1794,166380.852603154 336 | 1795,178942.155617426 337 | 1796,162804.747800461 338 | 1797,183077.684392615 339 | 1798,171728.4720292 340 | 1799,164786.741540638 341 | 1800,177427.267170302 342 | 1801,197318.54715833 343 | 1802,178658.114178223 344 | 1803,185437.320523764 345 | 1804,169759.652489529 346 | 1805,173986.635055186 347 | 1806,168607.664289468 348 | 1807,194138.519145183 349 | 1808,192502.440921994 350 | 1809,176746.969818601 351 | 1810,177604.891703134 352 | 1811,193283.746584832 353 | 1812,181627.061006609 354 | 1813,169071.62025834 355 | 1814,167398.006470987 356 | 1815,150106.505141704 357 | 1816,159650.304285848 358 | 1817,179471.23597476 359 | 1818,177109.589956218 360 | 1819,166558.113328453 361 | 1820,153796.714319583 362 | 1821,174520.152570658 363 | 1822,196297.95829524 364 | 1823,169100.681601175 365 | 1824,176911.319164431 366 | 1825,169234.6454828 367 | 1826,172386.297919134 368 | 1827,156031.904802362 369 | 1828,168202.892306596 370 | 1829,166505.984017547 371 | 1830,176507.37022149 372 | 1831,180116.752553161 373 | 1832,183072.740591406 374 | 1833,189595.964677698 375 | 1834,167523.919076265 376 | 1835,210817.775863413 377 | 1836,172942.930813351 378 | 1837,145286.278144089 379 | 1838,176468.653371492 380 | 1839,159040.069562187 381 | 1840,178518.204332507 382 | 1841,169163.980786825 383 | 1842,189786.685274579 384 | 1843,181246.728523853 385 | 1844,176349.927153587 386 | 1845,205266.631009142 387 | 1846,187397.993362224 388 | 1847,208943.427726113 389 | 1848,165014.532907657 390 | 1849,182492.037566236 391 | 1850,161718.71259042 392 | 1851,180084.118941162 393 | 1852,178534.950802179 394 | 1853,151217.259961305 395 | 1854,156342.717587562 396 | 1855,188511.443835239 397 | 1856,183570.337896789 398 | 1857,225810.160292177 399 | 1858,214217.401131694 400 | 1859,187665.64101603 401 | 1860,161157.177744039 402 | 1861,187643.992594193 403 | 1862,228156.372839158 404 | 1863,220449.534665317 405 | 1864,220522.352084222 406 | 1865,156647.763531624 407 | 1866,187388.833374873 408 | 1867,178640.723791573 409 | 1868,180847.216739049 410 | 1869,159505.170529478 411 | 1870,164305.538020654 412 | 1871,180181.19673723 413 | 1872,184602.734989972 414 | 1873,193440.372174434 415 | 1874,184199.788209911 416 | 1875,196241.892907637 417 | 1876,175588.618271096 418 | 1877,179503.046546829 419 | 1878,183658.076582555 420 | 1879,193700.976276404 421 | 1880,165399.62450704 422 | 1881,186847.944787446 423 | 1882,198127.73287817 424 | 1883,183320.898107934 425 | 1884,181613.606696657 426 | 1885,178298.791761954 427 | 1886,185733.534000593 428 | 1887,180008.188485489 429 | 1888,175127.59621604 430 | 1889,183467.176862723 431 | 1890,182705.546021743 432 | 1891,152324.943593181 433 | 1892,169878.515981342 434 | 1893,183735.975076576 435 | 1894,224118.280105941 436 | 1895,169355.202465146 437 | 1896,180054.276407441 438 | 1897,174081.601977368 439 | 1898,168494.985022146 440 | 1899,181871.598843299 441 | 1900,173554.489658383 442 | 1901,169805.382165577 443 | 1902,176192.990728755 444 | 1903,204264.39284654 445 | 1904,169630.906956928 446 | 1905,185724.838807268 447 | 1906,195699.036281861 448 | 1907,189494.276162169 449 | 1908,149607.905673439 450 | 1909,154650.199045978 451 | 1910,151579.558140433 452 | 1911,185147.380531144 453 | 1912,196314.53120359 454 | 1913,210802.395364155 455 | 1914,166271.2863726 456 | 1915,154865.359142973 457 | 1916,173575.5052865 458 | 1917,179399.563554274 459 | 1918,164280.776562049 460 | 1919,171247.48948121 461 | 1920,166878.587182445 462 | 1921,188129.459710994 463 | 1922,183517.34369691 464 | 1923,175522.026925727 465 | 1924,190060.105331152 466 | 1925,174179.824771856 467 | 1926,171059.523675194 468 | 1927,183004.186769318 469 | 1928,183601.647387418 470 | 1929,163539.327185998 471 | 1930,164677.676391525 472 | 1931,162395.073865424 473 | 1932,182207.6323195 474 | 1933,192223.939790304 475 | 1934,176391.829390125 476 | 1935,181913.179121348 477 | 1936,179136.097888261 478 | 1937,196595.568243212 479 | 1938,194822.365690957 480 | 1939,148356.669440918 481 | 1940,160387.604263899 482 | 1941,181276.500571809 483 | 1942,192474.817899346 484 | 1943,157699.907796437 485 | 1944,215785.540813051 486 | 1945,181824.300998793 487 | 1946,221813.00948166 488 | 1947,165281.292597397 489 | 1948,255629.49047034 490 | 1949,173154.590990955 491 | 1950,183884.65246539 492 | 1951,200210.353608489 493 | 1952,186599.221265342 494 | 1953,192718.532696106 495 | 1954,178628.665952764 496 | 1955,180650.342418406 497 | 1956,206003.107947263 498 | 1957,166457.67844853 499 | 1958,202916.221653487 500 | 1959,192463.969983091 501 | 1960,171775.497189898 502 | 1961,175249.222149411 503 | 1962,147086.59893993 504 | 1963,149709.672100371 505 | 1964,171411.404533743 506 | 1965,178188.964799425 507 | 1966,156491.711373235 508 | 1967,180953.241201168 509 | 1968,203909.759061135 510 | 1969,175470.149087545 511 | 1970,205578.333622415 512 | 1971,199428.857699441 513 | 1972,187599.163869476 514 | 1973,192265.198109864 515 | 1974,196666.554897677 516 | 1975,155537.862252682 517 | 1976,169543.240620935 518 | 1977,202487.010170501 519 | 1978,208232.716273485 520 | 1979,173621.195202569 521 | 1980,172414.608571812 522 | 1981,164400.75641556 523 | 1982,160480.424024781 524 | 1983,156060.853810389 525 | 1984,157437.192820581 526 | 1985,158163.720929772 527 | 1986,154849.043268978 528 | 1987,152186.609341561 529 | 1988,180340.215399228 530 | 1989,178344.62451356 531 | 1990,190170.382266827 532 | 1991,168092.975480832 533 | 1992,178757.912566805 534 | 1993,174518.256882082 535 | 1994,198168.490116289 536 | 1995,176882.693978902 537 | 1996,183801.672896251 538 | 1997,196400.046680661 539 | 1998,172281.605004025 540 | 1999,196380.366297173 541 | 2000,198228.354306682 542 | 2001,195556.581268962 543 | 2002,186453.264469043 544 | 2003,181869.381196234 545 | 2004,175610.840124147 546 | 2005,183438.730800145 547 | 2006,179584.488673295 548 | 2007,182386.152242034 549 | 2008,160750.367237054 550 | 2009,182477.505046008 551 | 2010,187720.359207171 552 | 2011,187201.942081511 553 | 2012,176385.102235149 554 | 2013,175901.787841278 555 | 2014,182584.280198283 556 | 2015,195664.686104237 557 | 2016,181420.346494222 558 | 2017,176676.04995228 559 | 2018,181594.678867334 560 | 2019,178521.747964951 561 | 2020,175895.883726231 562 | 2021,168468.005916477 563 | 2022,200973.129447888 564 | 2023,197030.641992202 565 | 2024,192867.417844592 566 | 2025,196449.247639381 567 | 2026,141684.196398607 568 | 2027,153353.334123901 569 | 2028,151143.549016705 570 | 2029,163753.087114229 571 | 2030,158682.460013921 572 | 2031,144959.835250915 573 | 2032,160144.390548579 574 | 2033,156286.534303521 575 | 2034,165726.707619571 576 | 2035,182427.481047359 577 | 2036,173310.56154032 578 | 2037,173310.56154032 579 | 2038,151556.01403002 580 | 2039,158908.146068683 581 | 2040,209834.383092536 582 | 2041,192410.516550815 583 | 2042,174026.247294886 584 | 2043,195499.830115336 585 | 2044,200918.018812493 586 | 2045,207243.616023976 587 | 2046,196149.783851876 588 | 2047,192097.914850217 589 | 2048,178570.948923671 590 | 2049,228617.968325428 591 | 2050,199929.884438451 592 | 2051,160206.365612859 593 | 2052,179854.431885567 594 | 2053,185987.340461822 595 | 2054,161122.505607926 596 | 2055,175949.342720138 597 | 2056,183683.590595324 598 | 2057,176401.34762338 599 | 2058,205832.532527897 600 | 2059,177799.799849436 601 | 2060,167565.362080406 602 | 2061,186348.958436557 603 | 2062,179782.759465081 604 | 2063,169837.623333323 605 | 2064,178817.275675758 606 | 2065,174444.479149339 607 | 2066,192834.968917174 608 | 2067,196564.717984981 609 | 2068,206977.567039357 610 | 2069,157054.253944128 611 | 2070,175142.948078577 612 | 2071,159932.1643654 613 | 2072,182801.408333628 614 | 2073,181510.375176825 615 | 2074,181613.035129451 616 | 2075,186920.512597635 617 | 2076,157950.170625222 618 | 2077,176115.159022876 619 | 2078,182744.514344465 620 | 2079,180660.683691591 621 | 2080,160775.629777099 622 | 2081,186711.715848082 623 | 2082,223581.758190888 624 | 2083,172330.943236652 625 | 2084,163474.633393212 626 | 2085,175308.263299874 627 | 2086,187462.725306432 628 | 2087,180655.101535034 629 | 2088,152121.98603454 630 | 2089,159856.233909727 631 | 2090,186559.854936737 632 | 2091,183962.550959411 633 | 2092,162107.168699296 634 | 2093,162582.288981283 635 | 2094,154407.701597409 636 | 2095,181625.666399474 637 | 2096,164810.609473548 638 | 2097,176429.401241704 639 | 2098,179188.089925259 640 | 2099,145997.635377703 641 | 2100,218676.768270367 642 | 2101,188323.861214226 643 | 2102,168690.0722914 644 | 2103,165088.746797705 645 | 2104,191435.007885166 646 | 2105,168864.404664512 647 | 2106,176041.882371574 648 | 2107,215911.674390325 649 | 2108,167388.238629016 650 | 2109,163854.786753017 651 | 2110,163299.477980171 652 | 2111,178298.214633119 653 | 2112,176376.586164775 654 | 2113,170211.043976522 655 | 2114,170818.344786366 656 | 2115,174388.867432503 657 | 2116,161112.987374671 658 | 2117,172179.082325307 659 | 2118,157798.309713876 660 | 2119,169106.151422924 661 | 2120,170129.531364292 662 | 2121,157680.227412949 663 | 2122,162690.209131977 664 | 2123,146968.379365095 665 | 2124,181507.721372455 666 | 2125,191215.589752983 667 | 2126,189432.689844522 668 | 2127,207271.484957719 669 | 2128,170030.807488363 670 | 2129,148409.806476335 671 | 2130,193850.613979055 672 | 2131,193808.319298263 673 | 2132,166300.235380627 674 | 2133,163474.633393212 675 | 2134,177473.606564978 676 | 2135,157443.925537187 677 | 2136,180681.007992057 678 | 2137,183463.17030026 679 | 2138,182481.763081195 680 | 2139,193717.15117887 681 | 2140,182782.55099007 682 | 2141,175530.651633287 683 | 2142,177804.057884623 684 | 2143,159448.670848577 685 | 2144,181338.976717529 686 | 2145,178553.558537021 687 | 2146,162820.928264556 688 | 2147,188832.479997186 689 | 2148,164682.185899437 690 | 2149,181549.735943801 691 | 2150,199158.097008868 692 | 2151,152889.520990566 693 | 2152,181150.551679116 694 | 2153,181416.732376013 695 | 2154,164391.238182305 696 | 2155,185421.046498812 697 | 2156,193981.327550004 698 | 2157,178824.324789223 699 | 2158,209270.051606246 700 | 2159,177801.266806344 701 | 2160,179053.762236101 702 | 2161,178762.170601992 703 | 2162,184655.300458183 704 | 2163,191284.655779772 705 | 2164,179598.085818785 706 | 2165,167517.628078595 707 | 2166,182873.903794044 708 | 2167,177484.91371363 709 | 2168,188444.597319524 710 | 2169,179184.153848562 711 | 2170,184365.175780982 712 | 2171,184479.322005212 713 | 2172,182927.863869391 714 | 2173,178611.639373646 715 | 2174,181943.343613558 716 | 2175,175080.614768394 717 | 2176,190720.794649138 718 | 2177,198422.868144723 719 | 2178,184482.11308349 720 | 2179,139214.952187861 721 | 2180,169233.113601757 722 | 2181,180664.118686848 723 | 2182,178818.742632666 724 | 2183,180422.049969947 725 | 2184,178601.93645581 726 | 2185,183083.159775993 727 | 2186,173163.101499699 728 | 2187,185968.161159774 729 | 2188,171226.050683054 730 | 2189,281643.976116786 731 | 2190,160031.711281258 732 | 2191,162775.979779394 733 | 2192,160735.445970193 734 | 2193,166646.109048572 735 | 2194,188384.548444549 736 | 2195,165830.697255197 737 | 2196,182138.358533039 738 | 2197,171595.397975647 739 | 2198,160337.079183809 740 | 2199,191215.088671543 741 | 2200,166956.093232213 742 | 2201,186581.830878692 743 | 2202,176450.548582099 744 | 2203,193743.194909801 745 | 2204,198882.566078408 746 | 2205,176385.102235149 747 | 2206,162447.639333636 748 | 2207,193782.555676777 749 | 2208,183653.890897141 750 | 2209,210578.623546866 751 | 2210,158527.164107319 752 | 2211,163081.025723456 753 | 2212,174388.867432503 754 | 2213,191905.870131966 755 | 2214,174388.867432503 756 | 2215,161642.711648983 757 | 2216,186939.507215101 758 | 2217,172482.165792649 759 | 2218,159695.999763546 760 | 2219,157230.369671007 761 | 2220,179188.089925259 762 | 2221,157972.82120994 763 | 2222,156804.951429181 764 | 2223,211491.972463654 765 | 2224,186537.246201062 766 | 2225,200468.161070551 767 | 2226,182241.340444154 768 | 2227,157342.225898399 769 | 2228,182022.387105998 770 | 2229,181244.510876788 771 | 2230,178556.671573788 772 | 2231,189547.199876284 773 | 2232,187948.65165563 774 | 2233,194107.287565956 775 | 2234,183521.710369283 776 | 2235,183682.123638416 777 | 2236,178483.353073443 778 | 2237,184003.879764736 779 | 2238,171318.59033449 780 | 2239,162039.754313997 781 | 2240,154846.252190699 782 | 2241,194822.365690957 783 | 2242,169788.738771463 784 | 2243,178891.554489941 785 | 2244,152084.772428865 786 | 2245,139169.86642879 787 | 2246,192439.536044606 788 | 2247,161067.859766557 789 | 2248,158762.648504781 790 | 2249,175569.690441774 791 | 2250,183659.795012187 792 | 2251,280618.132617258 793 | 2252,180051.809151659 794 | 2253,176519.18031559 795 | 2254,179028.429210291 796 | 2255,177161.583857224 797 | 2256,180081.508849842 798 | 2257,205895.254584712 799 | 2258,183389.78131415 800 | 2259,178543.647859512 801 | 2260,194798.320499104 802 | 2261,162845.613675766 803 | 2262,148103.867006579 804 | 2263,201016.171121215 805 | 2264,277936.12694354 806 | 2265,249768.279823405 807 | 2266,161596.052159825 808 | 2267,158011.114889899 809 | 2268,194089.683858004 810 | 2269,181733.336941451 811 | 2270,182852.32772198 812 | 2271,189893.003058465 813 | 2272,194650.210979875 814 | 2273,187904.461286262 815 | 2274,171774.925622692 816 | 2275,177998.685921479 817 | 2276,175648.484325498 818 | 2277,196918.071362067 819 | 2278,184299.838071218 820 | 2279,182379.855682734 821 | 2280,184050.725802482 822 | 2281,158296.975970284 823 | 2282,175053.355553278 824 | 2283,162293.376090644 825 | 2284,186328.880047186 826 | 2285,151422.116936538 827 | 2286,181969.358707768 828 | 2287,189122.67702416 829 | 2288,185645.475220346 830 | 2289,182829.898109257 831 | 2290,195848.788183328 832 | 2291,198785.059550672 833 | 2292,181676.126555428 834 | 2293,194131.012663328 835 | 2294,201416.004864508 836 | 2295,185096.577205616 837 | 2296,195158.972598372 838 | 2297,184795.783735112 839 | 2298,189168.263864671 840 | 2299,216855.260149095 841 | 2300,184946.642483576 842 | 2301,189317.51282069 843 | 2302,180803.277842406 844 | 2303,175061.18585763 845 | 2304,179074.839090732 846 | 2305,145708.764336107 847 | 2306,142398.022752011 848 | 2307,161474.534863641 849 | 2308,157025.945155458 850 | 2309,163424.037827357 851 | 2310,164692.778645345 852 | 2311,152163.2443541 853 | 2312,192383.215486656 854 | 2313,182520.230322476 855 | 2314,187254.507549722 856 | 2315,176489.659740359 857 | 2316,181520.466841293 858 | 2317,186414.978214721 859 | 2318,185197.764639705 860 | 2319,178657.794083741 861 | 2320,179731.198023759 862 | 2321,161748.271317074 863 | 2322,158608.749069322 864 | 2323,178807.370559878 865 | 2324,184187.158803897 866 | 2325,181686.10402108 867 | 2326,190311.050228337 868 | 2327,192252.496354076 869 | 2328,193954.849525775 870 | 2329,181044.201560887 871 | 2330,180258.131219792 872 | 2331,199641.657313834 873 | 2332,197530.775205517 874 | 2333,191777.196949138 875 | 2334,195779.543033588 876 | 2335,202112.046522999 877 | 2336,192343.34807661 878 | 2337,185191.359443218 879 | 2338,186760.207965688 880 | 2339,177733.78193528 881 | 2340,164430.391189608 882 | 2341,185299.601552401 883 | 2342,186414.012339254 884 | 2343,176401.921054593 885 | 2344,182381.322639642 886 | 2345,176334.184710805 887 | 2346,184901.735847457 888 | 2347,180085.766885029 889 | 2348,184901.735847457 890 | 2349,183967.561548763 891 | 2350,193046.301574659 892 | 2351,168538.969495849 893 | 2352,170157.842016969 894 | 2353,196559.709259637 895 | 2354,177133.709361852 896 | 2355,181553.279576244 897 | 2356,185770.606634739 898 | 2357,177017.595099274 899 | 2358,184123.358536806 900 | 2359,165970.357492196 901 | 2360,158151.985049452 902 | 2361,177086.476441481 903 | 2362,196373.896176551 904 | 2363,172465.707083115 905 | 2364,168590.782409896 906 | 2365,158820.474171061 907 | 2366,151611.37057651 908 | 2367,152125.028585543 909 | 2368,158404.073081048 910 | 2369,160692.078640755 911 | 2370,170175.22684199 912 | 2371,169854.436591138 913 | 2372,183410.785819008 914 | 2373,180347.194026928 915 | 2374,178930.528374292 916 | 2375,153346.220086301 917 | 2376,182675.204270589 918 | 2377,180770.649792036 919 | 2378,188714.148087543 920 | 2379,191393.608594076 921 | 2380,174016.157494425 922 | 2381,183189.685319552 923 | 2382,183621.508757866 924 | 2383,168991.29635758 925 | 2384,185306.650665866 926 | 2385,189030.680303208 927 | 2386,179208.665698449 928 | 2387,174901.452792889 929 | 2388,168337.406544343 930 | 2389,158234.96461859 931 | 2390,179562.453368834 932 | 2391,174176.391640607 933 | 2392,173931.531845427 934 | 2393,184111.729429665 935 | 2394,179374.482001188 936 | 2395,207348.811884535 937 | 2396,186983.419339031 938 | 2397,206779.094049527 939 | 2398,177472.074683935 940 | 2399,156727.948324862 941 | 2400,157090.568462479 942 | 2401,160387.032696693 943 | 2402,172410.28005086 944 | 2403,191603.365657467 945 | 2404,182152.207151253 946 | 2405,180161.697340702 947 | 2406,169652.235284283 948 | 2407,182503.520140218 949 | 2408,179714.630677039 950 | 2409,180282.570719908 951 | 2410,192600.338060371 952 | 2411,166115.491248565 953 | 2412,186379.553524443 954 | 2413,184361.992258449 955 | 2414,186220.965458121 956 | 2415,198176.47090687 957 | 2416,168437.776500131 958 | 2417,178003.582312015 959 | 2418,179180.469244588 960 | 2419,191930.561104806 961 | 2420,175590.266214964 962 | 2421,176713.19307219 963 | 2422,180159.090947005 964 | 2423,188090.100808026 965 | 2424,186184.717727913 966 | 2425,223055.588672278 967 | 2426,158270.753116401 968 | 2427,184733.12846644 969 | 2428,199926.378957429 970 | 2429,175075.785166001 971 | 2430,180917.925148076 972 | 2431,182067.760625207 973 | 2432,178238.60191545 974 | 2433,173454.944606532 975 | 2434,176821.936262814 976 | 2435,183642.191304235 977 | 2436,177254.582741058 978 | 2437,168715.950111702 979 | 2438,180096.931198144 980 | 2439,160620.728178758 981 | 2440,175286.544392273 982 | 2441,153494.783276297 983 | 2442,156407.65915545 984 | 2443,162162.525245786 985 | 2444,166809.886827197 986 | 2445,172929.156408918 987 | 2446,193514.330894137 988 | 2447,181612.141603756 989 | 2448,191745.386377068 990 | 2449,171369.325038261 991 | 2450,184425.470567051 992 | 2451,170563.252355189 993 | 2452,184522.369240168 994 | 2453,164968.947931153 995 | 2454,157939.621592364 996 | 2455,151520.381580069 997 | 2456,176129.508722531 998 | 2457,171112.978971478 999 | 2458,169762.081624282 1000 | 2459,162246.828936295 1001 | 2460,171339.303381589 1002 | 2461,189034.753653813 1003 | 2462,175758.873595981 1004 | 2463,163351.721489893 1005 | 2464,189806.546645026 1006 | 2465,175370.990918319 1007 | 2466,196895.599900301 1008 | 2467,176905.917994834 1009 | 2468,176866.557227858 1010 | 2469,163590.677170026 1011 | 2470,212693.502958393 1012 | 2471,192686.931747717 1013 | 2472,181578.684951827 1014 | 2473,166475.457581812 1015 | 2474,185998.255166219 1016 | 2475,185527.714877908 1017 | 2476,159027.118197683 1018 | 2477,181169.654933769 1019 | 2478,176732.915304722 1020 | 2479,191619.294648838 1021 | 2480,189114.303789324 1022 | 2481,180934.635330334 1023 | 2482,164573.372223048 1024 | 2483,173902.011270196 1025 | 2484,165625.127741229 1026 | 2485,179555.219570787 1027 | 2486,196899.720661579 1028 | 2487,207566.12470446 1029 | 2488,163899.981149274 1030 | 2489,189179.428177786 1031 | 2490,193892.880023125 1032 | 2491,178980.874331431 1033 | 2492,179749.876244365 1034 | 2493,197999.674975598 1035 | 2494,203717.470295797 1036 | 2495,185249.261156892 1037 | 2496,201691.208274848 1038 | 2497,181956.548314794 1039 | 2498,171895.936275806 1040 | 2499,187245.168439419 1041 | 2500,157816.77461318 1042 | 2501,191702.912573325 1043 | 2502,198599.420028908 1044 | 2503,187193.313676329 1045 | 2504,220514.993999535 1046 | 2505,181814.527595192 1047 | 2506,183750.755371907 1048 | 2507,183000.431679579 1049 | 2508,185830.971906573 1050 | 2509,185497.872344187 1051 | 2510,179613.437681321 1052 | 2511,164454.967963631 1053 | 2512,185127.237217638 1054 | 2513,178750.613844623 1055 | 2514,160927.61044889 1056 | 2515,192562.808057836 1057 | 2516,180990.24148554 1058 | 2517,180064.941503122 1059 | 2518,196070.997393789 1060 | 2519,180352.919019023 1061 | 2520,183367.953769362 1062 | 2521,176734.841494027 1063 | 2522,180848.220765939 1064 | 2523,187806.059368823 1065 | 2524,180521.52640004 1066 | 2525,181502.754496154 1067 | 2526,174525.87942676 1068 | 2527,188927.984063168 1069 | 2528,184728.870431253 1070 | 2529,179857.975518011 1071 | 2530,180962.868071609 1072 | 2531,179194.066390078 1073 | 2532,179591.789259484 1074 | 2533,180638.463702549 1075 | 2534,185846.215131922 1076 | 2535,195174.031139141 1077 | 2536,192474.56829063 1078 | 2537,164200.595496827 1079 | 2538,178403.094096818 1080 | 2539,170774.84018302 1081 | 2540,179879.945898337 1082 | 2541,177668.192752792 1083 | 2542,180174.328610725 1084 | 2543,170643.303572141 1085 | 2544,165448.004289838 1086 | 2545,195531.754886222 1087 | 2546,165314.177682121 1088 | 2547,172532.757660882 1089 | 2548,203310.218069877 1090 | 2549,175090.062515883 1091 | 2550,230841.338626282 1092 | 2551,155225.19006632 1093 | 2552,168322.342441945 1094 | 2553,165956.259265265 1095 | 2554,193956.817564124 1096 | 2555,171070.367893827 1097 | 2556,166285.243628001 1098 | 2557,182875.801346628 1099 | 2558,218108.536769738 1100 | 2559,174378.777632042 1101 | 2560,164731.316372391 1102 | 2561,156969.695083273 1103 | 2562,173388.854342604 1104 | 2563,177559.628685119 1105 | 2564,194297.789279905 1106 | 2565,174894.588364005 1107 | 2566,196544.144075798 1108 | 2567,179036.158528149 1109 | 2568,211423.986511149 1110 | 2569,208156.398935188 1111 | 2570,159233.941347257 1112 | 2571,210820.115134931 1113 | 2572,140196.10979821 1114 | 2573,198678.469082978 1115 | 2574,186818.610760803 1116 | 2575,175044.797633861 1117 | 2576,180031.162892704 1118 | 2577,176889.171525162 1119 | 2578,159638.856165666 1120 | 2579,154287.264375509 1121 | 2580,191885.618181273 1122 | 2581,177503.378612934 1123 | 2582,166548.31684976 1124 | 2583,164475.14942856 1125 | 2584,167484.744857879 1126 | 2585,188683.160555403 1127 | 2586,162243.399502668 1128 | 2587,180807.213919103 1129 | 2588,176279.079637039 1130 | 2589,163438.959094218 1131 | 2590,161495.5393685 1132 | 2591,216032.303722443 1133 | 2592,176632.181541401 1134 | 2593,168743.001567144 1135 | 2594,183810.11848086 1136 | 2595,156794.36054728 1137 | 2596,169136.43011395 1138 | 2597,183203.318752456 1139 | 2598,213252.926930889 1140 | 2599,190550.327866959 1141 | 2600,234707.209860273 1142 | 2601,135751.318892816 1143 | 2602,164228.45886894 1144 | 2603,153219.437030419 1145 | 2604,164210.746523801 1146 | 2605,163883.229117973 1147 | 2606,154892.776269956 1148 | 2607,197092.08733832 1149 | 2608,228148.376399122 1150 | 2609,178680.587503997 1151 | 2610,165643.341167808 1152 | 2611,222406.642660249 1153 | 2612,184021.843582599 1154 | 2613,170871.094939159 1155 | 2614,189562.873697309 1156 | 2615,170591.884966356 1157 | 2616,172934.351682851 1158 | 2617,186425.069879189 1159 | 2618,218648.131133006 1160 | 2619,183035.606761141 1161 | 2620,178378.906069427 1162 | 2621,184516.716597846 1163 | 2622,181419.5253183 1164 | 2623,196858.923438425 1165 | 2624,189228.701486278 1166 | 2625,208973.380761028 1167 | 2626,180269.86896412 1168 | 2627,159488.713683953 1169 | 2628,191490.299507521 1170 | 2629,228684.245137946 1171 | 2630,201842.998700429 1172 | 2631,209242.82289186 1173 | 2632,202357.62258493 1174 | 2633,168238.61218265 1175 | 2634,202524.12465369 1176 | 2635,170588.771929588 1177 | 2636,198375.31512987 1178 | 2637,170636.827889889 1179 | 2638,181991.079479377 1180 | 2639,183994.54251844 1181 | 2640,182951.482193584 1182 | 2641,174126.297156192 1183 | 2642,170575.496742588 1184 | 2643,175332.239869971 1185 | 2644,167522.061539111 1186 | 2645,168095.583738538 1187 | 2646,154406.415627461 1188 | 2647,170996.973346087 1189 | 2648,159056.890245639 1190 | 2649,181373.6165193 1191 | 2650,152272.560975937 1192 | 2651,168664.346821336 1193 | 2652,211007.008292301 1194 | 2653,182909.515032911 1195 | 2654,203926.829353303 1196 | 2655,179082.825442944 1197 | 2656,206260.099795032 1198 | 2657,181732.443415757 1199 | 2658,189698.740693148 1200 | 2659,203074.34678979 1201 | 2660,201670.634365666 1202 | 2661,173756.812589691 1203 | 2662,181387.076390881 1204 | 2663,184859.155270535 1205 | 2664,158313.615666777 1206 | 2665,151951.955409666 1207 | 2666,162537.52704471 1208 | 2667,178998.337067854 1209 | 2668,186732.584943041 1210 | 2669,187323.318406165 1211 | 2670,199437.232798284 1212 | 2671,185546.680858653 1213 | 2672,161595.015798593 1214 | 2673,154672.422763036 1215 | 2674,159355.710116165 1216 | 2675,155919.014077746 1217 | 2676,182424.87095604 1218 | 2677,178100.589622319 1219 | 2678,202577.900044456 1220 | 2679,177862.778940605 1221 | 2680,182056.024744887 1222 | 2681,191403.199177104 1223 | 2682,196264.754980043 1224 | 2683,209375.003419718 1225 | 2684,196691.81930173 1226 | 2685,192458.431539585 1227 | 2686,182242.80926507 1228 | 2687,183259.503900506 1229 | 2688,188108.243748841 1230 | 2689,171418.640195797 1231 | 2690,194698.882220432 1232 | 2691,174841.84007522 1233 | 2692,172965.476488899 1234 | 2693,189386.323677132 1235 | 2694,185682.618340257 1236 | 2695,176412.012719061 1237 | 2696,174976.489722867 1238 | 2697,180718.581707643 1239 | 2698,186131.188248242 1240 | 2699,165220.786354033 1241 | 2700,164115.893800435 1242 | 2701,182125.729127024 1243 | 2702,182285.140233276 1244 | 2703,196325.442210366 1245 | 2704,164865.215329881 1246 | 2705,182694.492209823 1247 | 2706,185425.485520958 1248 | 2707,171414.7041191 1249 | 2708,183433.472466085 1250 | 2709,176844.981155794 1251 | 2710,180568.187753206 1252 | 2711,185948.625475832 1253 | 2712,189388.291715481 1254 | 2713,142754.489165865 1255 | 2714,156106.800760811 1256 | 2715,155895.397617561 1257 | 2716,159851.977738548 1258 | 2717,185157.832305524 1259 | 2718,180716.291710805 1260 | 2719,176901.093954071 1261 | 2720,181017.222455218 1262 | 2721,183269.159407668 1263 | 2722,193550.830097069 1264 | 2723,170625.842699726 1265 | 2724,182012.405942725 1266 | 2725,179162.507290733 1267 | 2726,183269.159407668 1268 | 2727,180589.836175042 1269 | 2728,181465.935198741 1270 | 2729,196053.029878304 1271 | 2730,183421.020319014 1272 | 2731,167926.839083612 1273 | 2732,168027.530997889 1274 | 2733,182164.26685407 1275 | 2734,172469.071592608 1276 | 2735,181059.374300472 1277 | 2736,182997.570115536 1278 | 2737,166140.504179894 1279 | 2738,198515.546934075 1280 | 2739,193789.648503294 1281 | 2740,173550.025727531 1282 | 2741,176487.943174734 1283 | 2742,188813.302559147 1284 | 2743,178531.911979192 1285 | 2744,182145.731469001 1286 | 2745,179196.465024103 1287 | 2746,169618.349900686 1288 | 2747,170010.168655046 1289 | 2748,181739.671652174 1290 | 2749,172846.934955574 1291 | 2750,195560.8830172 1292 | 2751,180358.114292956 1293 | 2752,211817.702818093 1294 | 2753,176170.128686742 1295 | 2754,234492.248263699 1296 | 2755,182450.956536015 1297 | 2756,174902.068073146 1298 | 2757,173684.174293738 1299 | 2758,147196.673677562 1300 | 2759,175231.189709791 1301 | 2760,193417.64740633 1302 | 2761,183313.601249761 1303 | 2762,180882.250849082 1304 | 2763,186735.697979808 1305 | 2764,172922.865411247 1306 | 2765,202551.677190573 1307 | 2766,190485.634074173 1308 | 2767,173439.49362151 1309 | 2768,196613.598849219 1310 | 2769,178152.259700828 1311 | 2770,174519.904825949 1312 | 2771,172627.796932837 1313 | 2772,173732.689486435 1314 | 2773,209219.844787023 1315 | 2774,181059.374300472 1316 | 2775,188515.443002459 1317 | 2776,182164.26685407 1318 | 2777,188137.901597981 1319 | 2778,158893.54306269 1320 | 2779,189579.65066771 1321 | 2780,165229.803505847 1322 | 2781,162186.071220207 1323 | 2782,166374.879866351 1324 | 2783,161665.184974757 1325 | 2784,175079.328798445 1326 | 2785,203840.874021305 1327 | 2786,152129.078861057 1328 | 2787,181012.141380101 1329 | 2788,161305.53503837 1330 | 2789,203326.392972343 1331 | 2790,168385.571141831 1332 | 2791,183564.365159986 1333 | 2792,163784.619440861 1334 | 2793,171989.192193993 1335 | 2794,180839.95616829 1336 | 2795,170895.923185907 1337 | 2796,174071.054808518 1338 | 2797,259423.859147546 1339 | 2798,188000.824679588 1340 | 2799,179171.703565498 1341 | 2800,171022.241447762 1342 | 2801,174126.297156192 1343 | 2802,187625.573271948 1344 | 2803,199567.946369234 1345 | 2804,205328.078219268 1346 | 2805,166231.535025379 1347 | 2806,154743.91606057 1348 | 2807,159714.537012622 1349 | 2808,185563.069082422 1350 | 2809,171500.796725006 1351 | 2810,180983.443844799 1352 | 2811,183141.236914997 1353 | 2812,178498.634450214 1354 | 2813,224323.710512388 1355 | 2814,218200.642127877 1356 | 2815,182283.177756557 1357 | 2816,190054.639237419 1358 | 2817,160192.453934518 1359 | 2818,171289.393581756 1360 | 2819,151131.098733642 1361 | 2820,181721.458225594 1362 | 2821,172725.053851858 1363 | 2822,222438.699143414 1364 | 2823,235419.373448928 1365 | 2824,185150.926027596 1366 | 2825,184772.239624699 1367 | 2826,180658.216435809 1368 | 2827,209673.316647174 1369 | 2828,205939.810625621 1370 | 2829,165633.573325837 1371 | 2830,186030.317211014 1372 | 2831,160312.319589212 1373 | 2832,190702.440251029 1374 | 2833,175122.810326699 1375 | 2834,183783.13937519 1376 | 2835,178290.666302221 1377 | 2836,181605.343963015 1378 | 2837,187992.451444752 1379 | 2838,188885.11781517 1380 | 2839,189959.344795118 1381 | 2840,179258.619211334 1382 | 2841,181518.750275669 1383 | 2842,193008.659237315 1384 | 2843,186313.89385619 1385 | 2844,181499.39185067 1386 | 2845,174126.297156192 1387 | 2846,183918.612062767 1388 | 2847,184114.270899227 1389 | 2848,158540.947801398 1390 | 2849,197034.759055859 1391 | 2850,185170.284452595 1392 | 2851,221134.533635148 1393 | 2852,184306.637575967 1394 | 2853,199792.302740996 1395 | 2854,143237.803559736 1396 | 2855,177294.838897736 1397 | 2856,182368.620883855 1398 | 2857,176487.943174734 1399 | 2858,183849.408762071 1400 | 2859,184964.141507413 1401 | 2860,196395.969632434 1402 | 2861,188374.936650438 1403 | 2862,176261.296806135 1404 | 2863,163628.142248426 1405 | 2864,180618.032628904 1406 | 2865,161647.329794081 1407 | 2866,167129.598867773 1408 | 2867,174750.988352687 1409 | 2868,177560.202116333 1410 | 2869,192577.796112839 1411 | 2870,199202.898960871 1412 | 2871,182818.156667308 1413 | 2872,148217.262540651 1414 | 2873,188997.797082492 1415 | 2874,185807.928877601 1416 | 2875,177030.477842021 1417 | 2876,175942.474593632 1418 | 2877,172912.518576433 1419 | 2878,198359.248864591 1420 | 2879,184379.133036383 1421 | 2880,194255.566948886 1422 | 2881,209449.651603064 1423 | 2882,169979.323958443 1424 | 2883,188206.281858748 1425 | 2884,186412.438609167 1426 | 2885,196761.386409959 1427 | 2886,208353.269558209 1428 | 2887,166548.067241044 1429 | 2888,175942.474593632 1430 | 2889,166790.457916434 1431 | 2890,160515.850579067 1432 | 2891,192167.621096362 1433 | 2892,178751.551083369 1434 | 2893,198678.894117024 1435 | 2894,164553.120272354 1436 | 2895,156887.932862327 1437 | 2896,164185.777305524 1438 | 2897,212992.120630876 1439 | 2898,197468.550532521 1440 | 2899,180106.84373966 1441 | 2900,183972.071056674 1442 | 2901,245283.198337927 1443 | 2902,170351.963410756 1444 | 2903,195596.307707478 1445 | 2904,189369.756330412 1446 | 2905,223667.404551664 1447 | 2906,169335.310624364 1448 | 2907,167411.02835165 1449 | 2908,187709.555003968 1450 | 2909,196526.002998991 1451 | 2910,137402.569855589 1452 | 2911,165086.775061735 1453 | 2912,188506.431412274 1454 | 2913,172917.456816012 1455 | 2914,166274.325225982 1456 | 2915,167081.220948984 1457 | 2916,164788.778231138 1458 | 2917,219222.423400059 1459 | 2918,184924.279658997 1460 | 2919,187741.866657478 1461 | -------------------------------------------------------------------------------- /Challenges/House_Pricing/house_pricing_challenge.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "AML2019" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "

Challenge 1

\n", 15 | "

House Pricing Prediction

\n", 16 | "
\n", 17 | "22th March 2019" 18 | ] 19 | }, 20 | { 21 | "cell_type": "markdown", 22 | "metadata": {}, 23 | "source": [ 24 | "The first AML challenge for this year is adapted from the well-known 'Zillow's Home Value Prediction' competition on Kaggle.\n", 25 | "In particular, given a dataset containing descriptions of homes on the US property market, your task is to make predictions on the selling price of as-yet unlisted properties. \n", 26 | "Developing a model which accurately fits the available training data while also generalising to unseen data-points is a multi-faceted challenge that involves a mixture of data exploration, pre-processing, model selection, and performance evaluation." 27 | ] 28 | }, 29 | { 30 | "cell_type": "markdown", 31 | "metadata": {}, 32 | "source": [ 33 | "# Overview\n", 34 | "
" 35 | ] 36 | }, 37 | { 38 | "cell_type": "markdown", 39 | "metadata": {}, 40 | "source": [ 41 | "Beyond simply producing a well-performing model for making predictions, in this challenge we would like you to start developing your skills as a machine learning scientist.\n", 42 | "In this regard, your notebook should be structured in such a way as to explore the five following tasks that are expected to be carried out whenever undertaking such a project.\n", 43 | "The description below each aspect should serve as a guide for your work, but you are strongly encouraged to also explore alternative options and directions. \n", 44 | "Thinking outside the box will always be rewarded in these challenges." 45 | ] 46 | }, 47 | { 48 | "cell_type": "markdown", 49 | "metadata": {}, 50 | "source": [ 51 | "
\n", 52 | "

1. Data Exploration

\n", 53 | "
" 54 | ] 55 | }, 56 | { 57 | "cell_type": "markdown", 58 | "metadata": {}, 59 | "source": [ 60 | "The first broad component of your notebook should enable you to familiarise yourselves with the given data, an outline of which is given at the end of this challenge specification.\n", 61 | "Among others, this section should investigate:\n", 62 | "\n", 63 | "- Data cleaning, e.g. treatment of categorial variables;\n", 64 | "- Data visualisation;\n", 65 | "- Computing descriptive statistics, e.g. correlation.\n", 66 | "- etc." 67 | ] 68 | }, 69 | { 70 | "cell_type": "markdown", 71 | "metadata": {}, 72 | "source": [ 73 | "
\n", 74 | "

2. Data Pre-processing

\n", 75 | "
" 76 | ] 77 | }, 78 | { 79 | "cell_type": "markdown", 80 | "metadata": {}, 81 | "source": [ 82 | "The previous step should give you a better understanding of which pre-processing is required for the data.\n", 83 | "This may include:\n", 84 | "\n", 85 | "- Normalising and standardising the given data;\n", 86 | "- Removing outliers;\n", 87 | "- Carrying out feature selection, possibly using metrics derived from information theory;\n", 88 | "- Handling missing information in the dataset;\n", 89 | "- Augmenting the dataset with external information;\n", 90 | "- Combining existing features." 91 | ] 92 | }, 93 | { 94 | "cell_type": "markdown", 95 | "metadata": {}, 96 | "source": [ 97 | "
\n", 98 | "

3. Model Selection

\n", 99 | "
" 100 | ] 101 | }, 102 | { 103 | "cell_type": "markdown", 104 | "metadata": {}, 105 | "source": [ 106 | "Perhaps the most important segment of this challenge involves the selection of a model that can successfully handle the given data and yield sensible predictions.\n", 107 | "Instead of focusing exclusively on your final chosen model, it is also important to share your thought process in this notebook by additionally describing alternative candidate models.\n", 108 | "There is a wealth of models to choose from, such as decision trees, random forests, (Bayesian) neural networks, Gaussian processes, LASSO regression, and so on.\n", 109 | "There are several factors which may influence your decision:\n", 110 | "\n", 111 | "- What is the model's complexity?\n", 112 | "- Is the model interpretable?\n", 113 | "- Is the model capable of handling different data-types?\n", 114 | "- Does the model return uncertainty estimates along with predictions?\n", 115 | "\n", 116 | "An in-depth evaluation of competing models in view of this and other criteria will elevate the quality of your submission and earn you a higher grade.\n" 117 | ] 118 | }, 119 | { 120 | "cell_type": "markdown", 121 | "metadata": {}, 122 | "source": [ 123 | "
\n", 124 | "

4. Parameter Optimisation

\n", 125 | "
" 126 | ] 127 | }, 128 | { 129 | "cell_type": "markdown", 130 | "metadata": {}, 131 | "source": [ 132 | "Irrespective of your choice, it is highly likely that your model will have one or more parameters that require tuning.\n", 133 | "There are several techniques for carrying out such a procedure, including cross-validation, Bayesian optimisation, and several others.\n", 134 | "As before, an analysis into which parameter tuning technique best suits your model is expected before proceeding with the optimisation of your model." 135 | ] 136 | }, 137 | { 138 | "cell_type": "markdown", 139 | "metadata": {}, 140 | "source": [ 141 | "
\n", 142 | "

5. Model Evaluation

\n", 143 | "
" 144 | ] 145 | }, 146 | { 147 | "cell_type": "markdown", 148 | "metadata": {}, 149 | "source": [ 150 | "Some form of pre-evaluation will inevitably be required in the preceding sections in order to both select an appropriate model and configure its parameters appropriately.\n", 151 | "In this final section, you may evaluate other aspects of the model such as:\n", 152 | "\n", 153 | "- Assessing the running time of your model;\n", 154 | "- Determining whether some aspects can be parallelised;\n", 155 | "- Training the model with smaller subsets of the data.\n", 156 | "- etc." 157 | ] 158 | }, 159 | { 160 | "cell_type": "markdown", 161 | "metadata": {}, 162 | "source": [ 163 | "
\n", 164 | " N.B. Please note that the items listed under each heading are neither exhaustive, nor are you expected to explore every given suggestion.\n", 165 | " Nonetheless, these should serve as a guideline for your work in both this and upcoming challenges.\n", 166 | " As always, you should use your intuition and understanding in order to decide which analysis best suits the assigned task.\n", 167 | "
" 168 | ] 169 | }, 170 | { 171 | "cell_type": "markdown", 172 | "metadata": {}, 173 | "source": [ 174 | "
\n", 175 | "

Submission Instructions

\n", 176 | "
\n", 177 | "
" 178 | ] 179 | }, 180 | { 181 | "cell_type": "markdown", 182 | "metadata": {}, 183 | "source": [ 184 | "- The goal of this challenge is to construct a model for predicting house prices;\n", 185 | "

\n", 186 | "\n", 187 | "- Your submission will have two components:\n", 188 | "\n", 189 | " 1. An HTML version of your notebook exploring the various modelling aspects described above;\n", 190 | " 2. A CSV file containing your final model's predictions on the given test data. \n", 191 | " This file should contain a header and have the following format:\n", 192 | " \n", 193 | " ```\n", 194 | " Id,SalePrice\n", 195 | " 1461,169000.1\n", 196 | " 1462,187724.1233\n", 197 | " 1463,175221\n", 198 | " etc.\n", 199 | " ```\n", 200 | " \n", 201 | " An example submission file has been provided in the data directory of the repository.\n", 202 | " A leaderboard for this challenge will be ranked using the root mean squared error between the logarithm of the predicted value and the logarithm of the observed sales price. \n", 203 | " Taking logs ensures that errors in predicting expensive houses and cheap houses will have a similar impact on the overall result;\n", 204 | "

\n", 205 | "- This exercise is due on 04/04/2019." 206 | ] 207 | }, 208 | { 209 | "cell_type": "markdown", 210 | "metadata": {}, 211 | "source": [ 212 | "
\n", 213 | "

Dataset Description

\n", 214 | "
\n", 215 | "
" 216 | ] 217 | }, 218 | { 219 | "cell_type": "markdown", 220 | "metadata": {}, 221 | "source": [ 222 | "#### * Files\n", 223 | "\n", 224 | "* train.csv - The training dataset;\n", 225 | "* test.csv - The test dataset;\n", 226 | "* data_description.txt - Full description of each column.\n", 227 | "\n", 228 | "#### * Attributes\n", 229 | "\n", 230 | "A brief outline of the available attributes is given below:\n", 231 | "\n", 232 | "* SalePrice: The property's sale price in dollars. This is the target variable that your model is intended to predict;\n", 233 | "\n", 234 | "* MSSubClass: The building class;\n", 235 | "* MSZoning: The general zoning classification;\n", 236 | "* LotFrontage: Linear feet of street connected to property;\n", 237 | "* LotArea: Lot size in square feet;\n", 238 | "* Street: Type of road access;\n", 239 | "* Alley: Type of alley access;\n", 240 | "* LotShape: General shape of property;\n", 241 | "* LandContour: Flatness of the property;\n", 242 | "* Utilities: Type of utilities available;\n", 243 | "* LotConfig: Lot configuration;\n", 244 | "* LandSlope: Slope of property;\n", 245 | "* Neighborhood: Physical locations within Ames city limits;\n", 246 | "* Condition1: Proximity to main road or railroad;\n", 247 | "* Condition2: Proximity to main road or railroad (if a second is present);\n", 248 | "* BldgType: Type of dwelling;\n", 249 | "* HouseStyle: Style of dwelling;\n", 250 | "* OverallQual: Overall material and finish quality;\n", 251 | "* OverallCond: Overall condition rating;\n", 252 | "* YearBuilt: Original construction date;\n", 253 | "* YearRemodAdd: Remodel date;\n", 254 | "* RoofStyle: Type of roof;\n", 255 | "* RoofMatl: Roof material;\n", 256 | "* Exterior1st: Exterior covering on house;\n", 257 | "* Exterior2nd: Exterior covering on house (if more than one material);\n", 258 | "* MasVnrType: Masonry veneer type;\n", 259 | "* MasVnrArea: Masonry veneer area in square feet;\n", 260 | "* ExterQualv: Exterior material quality;\n", 261 | "* ExterCond: Present condition of the material on the exterior;\n", 262 | "* Foundation: Type of foundation;\n", 263 | "* BsmtQual: Height of the basement;\n", 264 | "* BsmtCond: General condition of the basement;\n", 265 | "* BsmtExposure: Walkout or garden level basement walls;\n", 266 | "* BsmtFinType1: Quality of basement finished area;\n", 267 | "* BsmtFinSF1: Type 1 finished square feet;\n", 268 | "* BsmtFinType2: Quality of second finished area (if present);\n", 269 | "* BsmtFinSF2: Type 2 finished square feet;\n", 270 | "* BsmtUnfSF: Unfinished square feet of basement area;\n", 271 | "* TotalBsmtSF: Total square feet of basement area;\n", 272 | "* Heating: Type of heating;\n", 273 | "* HeatingQC: Heating quality and condition;\n", 274 | "* CentralAir: Central air conditioning;\n", 275 | "* Electrical: Electrical system;\n", 276 | "* 1stFlrSF: First Floor square feet;\n", 277 | "* 2ndFlrSF: Second floor square feet;\n", 278 | "* LowQualFinSF: Low quality finished square feet (all floors);\n", 279 | "* GrLivArea: Above grade (ground) living area square feet;\n", 280 | "* BsmtFullBath: Basement full bathrooms;\n", 281 | "* BsmtHalfBath: Basement half bathrooms;\n", 282 | "* FullBath: Full bathrooms above grade;\n", 283 | "* HalfBath: Half baths above grade;\n", 284 | "* Bedroom: Number of bedrooms above basement level;\n", 285 | "* Kitchen: Number of kitchens;\n", 286 | "* KitchenQual: Kitchen quality;\n", 287 | "* TotRmsAbvGrd: Total rooms above grade (does not include bathrooms);\n", 288 | "* Functional: Home functionality rating;\n", 289 | "* Fireplaces: Number of fireplaces;\n", 290 | "* FireplaceQu: Fireplace quality;\n", 291 | "* GarageType: Garage location;\n", 292 | "* GarageYrBlt: Year garage was built;\n", 293 | "* GarageFinish: Interior finish of the garage;\n", 294 | "* GarageCars: Size of garage in car capacity;\n", 295 | "* GarageArea: Size of garage in square feet;\n", 296 | "* GarageQual: Garage quality;\n", 297 | "* GarageCond: Garage condition;\n", 298 | "* PavedDrive: Paved driveway;\n", 299 | "* WoodDeckSF: Wood deck area in square feet;\n", 300 | "* OpenPorchSF: Open porch area in square feet;\n", 301 | "* EnclosedPorch: Enclosed porch area in square feet;\n", 302 | "* 3SsnPorch: Three season porch area in square feet;\n", 303 | "* ScreenPorch: Screen porch area in square feet;\n", 304 | "* PoolArea: Pool area in square feet;\n", 305 | "* PoolQC: Pool quality;\n", 306 | "* Fence: Fence quality;\n", 307 | "* MiscFeature: Miscellaneous feature not covered in other categories;\n", 308 | "* MiscVal: Value (in dollars) of miscellaneous feature;\n", 309 | "* MoSold: Month sold;\n", 310 | "* YrSold: Year sold;\n", 311 | "* SaleType: Type of sale;\n", 312 | "* SaleCondition: Condition of sale.\n" 313 | ] 314 | }, 315 | { 316 | "cell_type": "code", 317 | "execution_count": null, 318 | "metadata": {}, 319 | "outputs": [], 320 | "source": [] 321 | } 322 | ], 323 | "metadata": { 324 | "kernelspec": { 325 | "display_name": "Python 3", 326 | "language": "python", 327 | "name": "python3" 328 | }, 329 | "language_info": { 330 | "codemirror_mode": { 331 | "name": "ipython", 332 | "version": 3 333 | }, 334 | "file_extension": ".py", 335 | "mimetype": "text/x-python", 336 | "name": "python", 337 | "nbconvert_exporter": "python", 338 | "pygments_lexer": "ipython3", 339 | "version": "3.7.0" 340 | } 341 | }, 342 | "nbformat": 4, 343 | "nbformat_minor": 1 344 | } 345 | -------------------------------------------------------------------------------- /Challenges/Plankton/plankton_challenge.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "

Algorithmic Machine Learning Challenge

\n", 8 | "

Plankton Image Classification

\n", 9 | "
" 10 | ] 11 | }, 12 | { 13 | "cell_type": "markdown", 14 | "metadata": {}, 15 | "source": [ 16 | "Plankton comprises all the organisms freely drifting with ocean currents. These life forms are a critically important piece of oceanic ecosystems, accounting for more than half the primary production on earth and nearly half the total carbon fixed in the global carbon cycle. They also form the foundation of aquatic food webs, including those of large, commercially important fisheries. Loss of plankton populations could result in ecological upheaval as well as negative societal impacts, particularly in indigenous cultures and the developing world. Plankton’s global significance makes their population levels an ideal measure of the health of the world’s oceans and ecosystems.\n", 17 | "\n", 18 | "Traditional methods for measuring and monitoring plankton populations are time consuming and cannot scale to the granularity or scope necessary for large-scale studies. Improved approaches are needed. One such approach is through the use of underwater imagery sensors. \n", 19 | "\n", 20 | "In this challenge, which was prepared in cooperation with the Laboratoire d’Océanographie de Villefranche, jointly run by Sorbonne Université and CNRS, plankton images were acquired in the bay of Villefranche, weekly since 2013 and manually engineered features were computed on each imaged object. \n", 21 | "\n", 22 | "This challenge aims at developing solid approaches to plankton image classification. We will compare methods based on carefully (but manually) engineered features, with “Deep Learning” methods in which features will be learned from image data alone.\n", 23 | "\n", 24 | "The purpose of this challenge is for you to learn about the commonly used paradigms when working with computer vision problems. This means you can choose one of the following paths:\n", 25 | "\n", 26 | "- Work directly with the provided images, e.g. using a (convolutional) neural network\n", 27 | "- Work with the supplied features extracted from the images (*native* or *skimage* or both of them)\n", 28 | "- Extract your own features from the provided images using a technique of your choice\n", 29 | "\n", 30 | "You will find a detailed description about the image data and the features at the end of this text.\n", 31 | "In any case, the choice of the classifier that you decide to work with strongly depends on the choice of features.\n", 32 | "\n", 33 | "Please bear in mind that the purpose of this challenge is not simply to find the best-performing model that was released on e.g. Kaggle for a similar problem. You should rather make sure to understand the dificulties that come with this computer vision task. Moreover, you should be able to justify your choice of features/model and be able to explain its advantages and disadvantages for the task." 34 | ] 35 | }, 36 | { 37 | "cell_type": "markdown", 38 | "metadata": {}, 39 | "source": [ 40 | "# Overview\n", 41 | "
" 42 | ] 43 | }, 44 | { 45 | "cell_type": "markdown", 46 | "metadata": {}, 47 | "source": [ 48 | "Beyond simply producing a well-performing model for making predictions, in this challenge we would like you to start developing your skills as a machine learning scientist.\n", 49 | "In this regard, your notebook should be structured in such a way as to explore the five following tasks that are expected to be carried out whenever undertaking such a project.\n", 50 | "The description below each aspect should serve as a guide for your work, but you are strongly encouraged to also explore alternative options and directions. \n", 51 | "Thinking outside the box will always be rewarded in these challenges." 52 | ] 53 | }, 54 | { 55 | "cell_type": "markdown", 56 | "metadata": {}, 57 | "source": [ 58 | "
\n", 59 | "

1. Data Exploration

\n", 60 | "
" 61 | ] 62 | }, 63 | { 64 | "cell_type": "markdown", 65 | "metadata": {}, 66 | "source": [ 67 | "The first broad component of your notebook should enable you to familiarise yourselves with the given data, an outline of which is given at the end of this challenge specification.\n", 68 | "\n", 69 | "What is new in this challenge is that you will be working with image data. Therefore, you should have a look at example images located in the *imgs.zip* file (see description below). If you decide to work with the native or the skimage features, make sure to understand them!\n", 70 | "\n", 71 | "Among others, this section should investigate:\n", 72 | "\n", 73 | "- Distribution of the different image dimensions (including the number of channels)\n", 74 | "- Distribution of the different labels that the images are assigned to\n", 75 | "\n", 76 | "The image labels are organized in a taxonomy. We will measure the final model performance for the classification into the *level2* categories. Make sure to understand the meaning of this label inside the taxonomy." 77 | ] 78 | }, 79 | { 80 | "cell_type": "markdown", 81 | "metadata": {}, 82 | "source": [ 83 | "
\n", 84 | "

2. Data Pre-processing

\n", 85 | "
" 86 | ] 87 | }, 88 | { 89 | "cell_type": "markdown", 90 | "metadata": {}, 91 | "source": [ 92 | "The previous step should give you a better understanding of which pre-processing is required for the data based on your approach:\n", 93 | "\n", 94 | "- If you decide to work with the provided features, some data cleaning may be required to make full use of all the data.\n", 95 | "- If you decide to extract your own features from the images, you should explain your approach in this section.\n", 96 | "- If you decide to work directly with the images themselves, preprocessing the images may improve your classification results. In particular, if you work with a neural network the following should be of interest to you:\n", 97 | "\n", 98 | " - Due to the fully-connected layers (that usually come after the convolutional ones), the input needs to have a fixed dimension.\n", 99 | " - Data augmentation (image rotation, scaling, cropping, etc. of the existing images) can be used to increase the size of the training data set. This may improve performance especially when little data is available for a particular class.\n", 100 | " - Be aware of the computational cost! It might be worth rescaling the images to a smaller size!\n", 101 | "\n", 102 | " All of the operations above are usually realized using a dataloader. This means that you do not need to create a modified version of the dataset and save it to disk. Instead, the dataloader processes the data \"on the fly\" and in-memory before passing it to the network.\n", 103 | " \n", 104 | " NB: Although aligning image sizes is necessary to train CNNs, this will prevent your classifier from learning about different object sizes as a feature. Additional gains may be achieved when also taking object sizes into account." 105 | ] 106 | }, 107 | { 108 | "cell_type": "markdown", 109 | "metadata": {}, 110 | "source": [ 111 | "
\n", 112 | "

3. Model Selection

\n", 113 | "
" 114 | ] 115 | }, 116 | { 117 | "cell_type": "markdown", 118 | "metadata": {}, 119 | "source": [ 120 | "Perhaps the most important segment of this challenge involves the selection of a model that can successfully handle the given data and yield sensible predictions.\n", 121 | "Instead of focusing exclusively on your final chosen model, it is also important to share your thought process in this notebook by additionally describing alternative candidate models.\n", 122 | "\n", 123 | "The choice of your model is closely connected to the way you preprocessed the input data.\n", 124 | "\n", 125 | "Furthermore, there are other factors which may influence your decision:\n", 126 | "\n", 127 | "- What is the model's complexity?\n", 128 | "- Is the model interpretable?\n", 129 | "- Is the model capable of handling different data-types?\n", 130 | "- Does the model return uncertainty estimates along with predictions?" 131 | ] 132 | }, 133 | { 134 | "cell_type": "markdown", 135 | "metadata": {}, 136 | "source": [ 137 | "
\n", 138 | "

4. Parameter Optimisation

\n", 139 | "
" 140 | ] 141 | }, 142 | { 143 | "cell_type": "markdown", 144 | "metadata": {}, 145 | "source": [ 146 | "Irrespective of your choice, it is highly likely that your model will have one or more (hyper-)parameters that require tuning.\n", 147 | "There are several techniques for carrying out such a procedure, including cross-validation, Bayesian optimisation, and several others.\n", 148 | "As before, an analysis into which parameter tuning technique best suits your model is expected before proceeding with the optimisation of your model.\n", 149 | "\n", 150 | "If you use a neural network, the optimization of hyperparameters (learning rate, weight decay, etc.) can be a very time-consuming process. In this case, your may decide to carry out smaller experiments and to justify your choice on these preliminary tests." 151 | ] 152 | }, 153 | { 154 | "cell_type": "markdown", 155 | "metadata": {}, 156 | "source": [ 157 | "
\n", 158 | "

5. Model Evaluation

\n", 159 | "
" 160 | ] 161 | }, 162 | { 163 | "cell_type": "markdown", 164 | "metadata": {}, 165 | "source": [ 166 | "Some form of pre-evaluation will inevitably be required in the preceding sections in order to both select an appropriate model and configure its parameters appropriately.\n", 167 | "In this final section, you may evaluate other aspects of the model such as:\n", 168 | "\n", 169 | "- Assessing the running time of your model;\n", 170 | "- Determining whether some aspects can be parallelised;\n", 171 | "- Training the model with smaller subsets of the data.\n", 172 | "- etc.\n", 173 | "\n", 174 | "For the evaluation of the classification results, you should use the F1 measure (see Submission Instructions). Here the focus should be on level2 classification. A classification evaluation for other labels is optional.\n", 175 | "\n", 176 | "Please note that you are responsible for creating a sensible train/validation/test split. There is no predefined held-out test data." 177 | ] 178 | }, 179 | { 180 | "cell_type": "markdown", 181 | "metadata": {}, 182 | "source": [ 183 | "
\n", 184 | " N.B. Please note that the items listed under each heading are neither exhaustive, nor are you expected to explore every given suggestion.\n", 185 | " Nonetheless, these should serve as a guideline for your work in both this and upcoming challenges.\n", 186 | " As always, you should use your intuition and understanding in order to decide which analysis best suits the assigned task.\n", 187 | "
" 188 | ] 189 | }, 190 | { 191 | "cell_type": "markdown", 192 | "metadata": {}, 193 | "source": [ 194 | "
\n", 195 | "

Submission Instructions

\n", 196 | "
\n", 197 | "
" 198 | ] 199 | }, 200 | { 201 | "cell_type": "markdown", 202 | "metadata": {}, 203 | "source": [ 204 | "- The goal of this challenge is to construct a model for predicting Plankton (taxonomy level 2) classes.\n", 205 | "\n", 206 | "- Your submission will be the HTML version of your notebook exploring the various modelling aspects described above.\n", 207 | "\n", 208 | "- At the end of the notebook you should indicate your final evaluation score on a held-out test set. As an evaluation metric you should use the F1 score with the *average=macro* option as it is provided by the scikit-learn library. See the following link for more information:\n", 209 | " \n", 210 | "https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html" 211 | ] 212 | }, 213 | { 214 | "cell_type": "markdown", 215 | "metadata": {}, 216 | "source": [ 217 | "
\n", 218 | "

Dataset Description

\n", 219 | "
\n", 220 | "
" 221 | ] 222 | }, 223 | { 224 | "cell_type": "markdown", 225 | "metadata": {}, 226 | "source": [ 227 | "#### * Location of the Dataset on zoe\n", 228 | "The data for this challenge is located at: `/mnt/datasets/plankton/flowcam`\n", 229 | "\n", 230 | "#### * Hierachical Taxonomy Tree for Labels \n", 231 | "\n", 232 | "Each object is represented by a single image and is identified by a unique integer number. It has a name associated to it which is integrated in a hierarchical taxonomic tree. The identifications are gathered from different projects, classified by different people in different contexts, so they often target different taxonomic levels. For example, let us say we classify items of clothing along the following tree\n", 233 | "\n", 234 | " top\n", 235 | " shirt\n", 236 | " long sleeves\n", 237 | " short sleeves\n", 238 | " sweater\n", 239 | " hooded\n", 240 | " no hood\n", 241 | " bottom\n", 242 | " pants\n", 243 | " jeans\n", 244 | " other\n", 245 | " shorts\n", 246 | " \n", 247 | "In a first project, images are classified to the finest level possible, but it may be the case that, on some pictures, it is impossible to determine whether a sweater has a hood or not, in which case it is simply classified as `sweater`. In the second project, the operator classified tops as `shirt` or `sweater` only, and bottoms to the finest level. In a third project, the operator only separated tops from bottoms. In such a context, the original names in the database cannot be used directly because, for example `sweater` will contain images that are impossible to determine as `hooded` or `no hood` *as well as* `hooded` and `no hood` images that were simply not classified further. If all three classes (`sweater`, `hooded`, and `no hood`) are included in the training set, it will likely confuse the classifier. For this reason, we define different target taxonomic levels:\n", 248 | "\n", 249 | "- `level1` is the finest taxonomic level possible. In the example above, we would include `hooded` and `no hood` but discard all images in `sweater` to avoid confusion; and proceed in the same manner for other classes.\n", 250 | "\n", 251 | "- `level2` is a grouping of underlying levels. In the example above, it would include `shirt` (which contains all images in `shirt`, `long sleeves`, and `short sleeves`), `sweater` (which, similarly would include this class and all its children), `pants` (including children), and `shorts`. So typically, `level2` contains more images (less discarding), sorted within fewer classes than `level1`, and may therefore be an easier classification problem.\n", 252 | "\n", 253 | "- `level3` is an even broader grouping. Here it would be `top` vs `bottom`\n", 254 | "\n", 255 | "- etc.\n", 256 | "\n", 257 | "In the Plankton Image dataset, the objects will be categorised based on a pre-defined 'level1' and 'level2'. You can opt to work on one of them, but we recommend you to work on `level2` because it is an easier classification problem. \n", 258 | "\n", 259 | "#### * Data Structure\n", 260 | "\n", 261 | " /mnt/datasets/plankton/flowcam/\n", 262 | " meta.csv\n", 263 | " taxo.csv\n", 264 | " features_native.csv.gz\n", 265 | " features_skimage.csv.gz\n", 266 | " imgs.zip\n", 267 | "\n", 268 | "* `meta.csv` contains the index of images and their corresponding labels\n", 269 | "* `taxo.csv` defines the taxonomic tree and its potential groupings at various level. Note that, the information is also available in `meta.csv`. Therefore, the information in `taxo.csv` is probably useless, but at least it gives you a global view about taxonomy tree\n", 270 | "* `features_native.csv.gz` contain the morphological handcrafted features computed by ZooProcess. In fact, ZooProcess generates the region of interests (ROI) around each individual object from a original image of Plankton. In addition, it also computes a set of associated features measured on the object. These features are the ones contained in `features_native.csv.gz`\n", 271 | "* `features_skimage.csv.gz` contains the morphological features recomputed with skimage.measure.regionprops on the ROIs produced by ZooProcess.\n", 272 | "* `imgs.zip` contains a post-processed version of the original images. Images are named by `objid`.jpg\n", 273 | "\n", 274 | "#### * Attributes in meta.csv\n", 275 | "\n", 276 | "The file contains the image identifiers (objid) as well as the labels assigned to the images by human operators. Those are defined with various levels of precision:\n", 277 | "\n", 278 | "* unique_name: raw labels from operators\n", 279 | "* level1: cleaned, most detailed labels\n", 280 | "* level2: regrouped (coarser) labels\n", 281 | "* lineage: full taxonomic lineage of the class\n", 282 | "\n", 283 | "Some labels may be missing (coded ‘NA’) at a given level, meaning that the corresponding objects should be discarded for the classification at this level.\n", 284 | "\n", 285 | "#### * imgs.zip\n", 286 | "\n", 287 | "This zip archive contains an *imgs* folder that contains all the images in .jpg format. Do not extract this folder to disk! Instead you will be loading the images to memory. See the code below for a quick how-to:" 288 | ] 289 | }, 290 | { 291 | "cell_type": "code", 292 | "execution_count": null, 293 | "metadata": {}, 294 | "outputs": [], 295 | "source": [ 296 | "import zipfile\n", 297 | "from io import BytesIO\n", 298 | "from PIL import Image\n", 299 | "\n", 300 | "def extract_zip_to_memory(input_zip):\n", 301 | " '''\n", 302 | " This function extracts the images stored inside the given zip file.\n", 303 | " It stores the result in a python dictionary.\n", 304 | " \n", 305 | " input_zip (string): path to the zip file\n", 306 | " \n", 307 | " returns (dict): {filename (string): image_file (bytes)}\n", 308 | " '''\n", 309 | " input_zip=zipfile.ZipFile(input_zip)\n", 310 | " return {name: BytesIO(input_zip.read(name)) for name in input_zip.namelist() if name.endswith('.jpg')}\n", 311 | "\n", 312 | "\n", 313 | "# img_files = extract_zip_to_memory(\"imgs.zip\")\n", 314 | "\n", 315 | "# Display an example image \n", 316 | "# Image.open(img_files['imgs/32738710.jpg'])\n", 317 | "\n", 318 | "# Load the image as a numpy array:\n", 319 | "# np_arr = np.array(Image.open(img_files['imgs/32738710.jpg']))\n", 320 | "\n", 321 | "# Be aware that the dictionary will occupy roughly 2GB of computer memory!\n", 322 | "# To free this memory again, run:\n", 323 | "# del img_files" 324 | ] 325 | }, 326 | { 327 | "cell_type": "markdown", 328 | "metadata": {}, 329 | "source": [ 330 | "#### * Attributes in features_native.csv.gz\n", 331 | "A brief outline of the availabel attributes in `features_native.csv.gz` which you can use is given below:\n", 332 | "\n", 333 | "* objid: same as in `meta.csv`\n", 334 | "* area: area of ROI\n", 335 | "* meanimagegrey:\n", 336 | "* mean: mean grey\n", 337 | "* stddev: standard deviation of greys\n", 338 | "* min: minimum grey\n", 339 | "* perim.: perimeter of ROI\n", 340 | "* width, height: dimensions of ROI\n", 341 | "* major, minor: length of major,minor axis of the best fitting ellipse\n", 342 | "* angle: \n", 343 | "* circ.: circularity or shape factor which can be computed by 4pi(area/perim.^2)\n", 344 | "* feret: maximal feret diameter\n", 345 | "* intden: integrated density: mean*area\n", 346 | "* median: median grey\n", 347 | "* skew, kurt: skewness,kurtosis of the histogram of greys\n", 348 | "* %area: proportion of the image corresponding to the object\n", 349 | "* area_exc: area excluding holes\n", 350 | "* fractal: fractal dimension of the perimeter\n", 351 | "* skelarea: area of the one-pixel wide skeleton of the image ???\n", 352 | "* slope: slope of the cumulated histogram of greys\n", 353 | "* histcum1, 2, 3: grey level at quantiles 0.25, 0.5, 0.75 of the histogram of greys\n", 354 | "* nb1, 2, 3: number of objects after thresholding at the grey levels above\n", 355 | "* symetrieh, symetriev: index of horizontal,vertical symmetry\n", 356 | "* symetriehc, symetrievc: same but after thresholding at level histcum1\n", 357 | "* convperim, convarea: perimeter,area of the convex hull of the object\n", 358 | "* fcons: contrast\n", 359 | "* thickr: thickness ratio: maximum thickness/mean thickness\n", 360 | "* esd:\n", 361 | "* elongation: elongation index: major/minor\n", 362 | "* range: range of greys: max-min\n", 363 | "* meanpos: relative position of the mean grey: (max-mean)/range\n", 364 | "* centroids:\n", 365 | "* cv: coefficient of variation of greys: 100*(stddev/mean)\n", 366 | "* sr: index of variation of greys: 100*(stddev/range)\n", 367 | "* perimareaexc:\n", 368 | "* feretareaexc:\n", 369 | "* perimferet: index of the relative complexity of the perimeter: perim/feret\n", 370 | "* perimmajor: index of the relative complexity of the perimeter: perim/major\n", 371 | "* circex:\n", 372 | "* cdexc:\n", 373 | "* kurt_mean:\n", 374 | "* skew_mean:\n", 375 | "* convperim_perim:\n", 376 | "* convarea_area:\n", 377 | "* symetrieh_area:\n", 378 | "* symetriev_area:\n", 379 | "* nb1_area:\n", 380 | "* nb2_area:\n", 381 | "* nb3_area:\n", 382 | "* nb1_range:\n", 383 | "* nb2_range:\n", 384 | "* nb3_range:\n", 385 | "* median_mean:\n", 386 | "* median_mean_range:\n", 387 | "* skeleton_area:\n", 388 | "\n", 389 | "#### * Attributes in features_skimage.csv.gz\n", 390 | "Table of morphological features recomputed with skimage.measure.regionprops on the ROIs produced by ZooProcess. See http://scikit-image.org/docs/dev/api/skimage.measure.html#skimage.measure.regionprops for documentation." 391 | ] 392 | }, 393 | { 394 | "cell_type": "code", 395 | "execution_count": null, 396 | "metadata": {}, 397 | "outputs": [], 398 | "source": [] 399 | } 400 | ], 401 | "metadata": { 402 | "kernelspec": { 403 | "display_name": "Python 3", 404 | "language": "python", 405 | "name": "python3" 406 | }, 407 | "language_info": { 408 | "codemirror_mode": { 409 | "name": "ipython", 410 | "version": 3 411 | }, 412 | "file_extension": ".py", 413 | "mimetype": "text/x-python", 414 | "name": "python", 415 | "nbconvert_exporter": "python", 416 | "pygments_lexer": "ipython3", 417 | "version": "3.6.8" 418 | } 419 | }, 420 | "nbformat": 4, 421 | "nbformat_minor": 2 422 | } 423 | -------------------------------------------------------------------------------- /Notebooks/Intro-public.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "source": [ 5 | "# Revision on JupyterLab, Python, Pandas and Matplotlib (Spring 2019)\n", 6 | "In this introductory laboratory, we expect students to:\n", 7 | "\n", 8 | "1. Acquire basic knowledge about Python and Matplotlib\n", 9 | "2. Gain familiarity with Juypter Notebooks\n", 10 | "\n", 11 | "\n", 12 | "To achieve such goals, we will go through the following steps:\n", 13 | "\n", 14 | "1. In section 1, **IPython** and **Jupyter Notebooks** are introduced to help students understand the environment used to work on projects, including those that are part of the CLOUDS course.\n", 15 | "\n", 16 | "2. In section 2, we briefly overview **Python** and its syntax. In addition, we cover **Matplotlib**, a very powerful library to plot figures in Python. Finally, we introduce **Pandas**, a python library that is very helpful when manipulating data." 17 | ], 18 | "metadata": {}, 19 | "cell_type": "markdown" 20 | }, 21 | { 22 | "source": [ 23 | "# 1. Python, IPython and Jupyter Notebooks\n", 24 | "\n", 25 | "**Python** is a high-level, dynamic, object-oriented programming language. It is a general purpose language, which is designed to be easy to use and easy to read.\n", 26 | "\n", 27 | "**IPython** (Interactive Python) is originally developed for Python. Now, it is a command shell for interactive computing supporting multiple programming languages. It offers rich media, shell syntax, tab completion, and history. IPython is based on an architecture that provides parallel and distributed computing. IPython enables parallel applications to be developed, executed, debugged and monitored interactively.\n", 28 | "\n", 29 | "**Jupyter Notebooks** are a web-based interactive computational environment for creating IPython notebooks. An IPython notebook is a JSON document containing an ordered list of input/output cells which can contain code, text, mathematics, plots and rich media. Notebooks make data analysis easier to perform, understand and reproduce. All laboratories in this course are prepared as Notebooks. As you can see, in this Notebook, we can put text, images, hyperlinks, source code... The Notebooks can be converted to a number of open standard output formats (HTML, HTML presentation slides, LaTeX, PDF, ReStructuredText, Markdown, Python) through `File` -> `Download As` in the web interface. In addition, Jupyter manages the notebooks' versions through a `checkpoint` mechanism. You can create checkpoint anytime via `File -> Save and Checkpoint`. \n", 30 | "\n", 31 | "**NOTE on Checkpointing:** in this course, we use a peculiar environment to work. We don't have a Notebook server: instead, we create on demand clusters with a Notebook front-end. Since your clusters are **ephemeral** (they are terminated after a predefined amount of time), checkpointing is of little use, for anything else than saving your notebook in your ephemeral environment. It is far better to download regularly your notebooks, and to push them to your git repository." 32 | ], 33 | "metadata": {}, 34 | "cell_type": "markdown" 35 | }, 36 | { 37 | "source": [ 38 | "## 1.1. Tab completion\n", 39 | "\n", 40 | "Tab completion is a convenient way to explore the structure of any object you're dealing with. Simply type object_name. to view the suggestion for object's attributes. Besides Python objects and keywords, tab completion also works on file and directory names." 41 | ], 42 | "metadata": {}, 43 | "cell_type": "markdown" 44 | }, 45 | { 46 | "source": [ 47 | "s = \"test function of tab completion\"\n", 48 | "\n", 49 | "# type s. to see the suggestions\n", 50 | "\n", 51 | "# Show your experiments working on a string. \n", 52 | "# Try splitting a string into its constituent words, and count the number of words.\n" 53 | ], 54 | "execution_count": null, 55 | "cell_type": "code", 56 | "metadata": { 57 | "collapsed": false 58 | }, 59 | "outputs": [] 60 | }, 61 | { 62 | "source": [ 63 | "## 1.2. System shell commands\n", 64 | "\n", 65 | "To run any command in the system shell, simply prefix it with `!`. For example:" 66 | ], 67 | "metadata": {}, 68 | "cell_type": "markdown" 69 | }, 70 | { 71 | "source": [ 72 | "# list all file and directories in the current folder\n", 73 | "!ls" 74 | ], 75 | "execution_count": null, 76 | "cell_type": "code", 77 | "metadata": { 78 | "collapsed": false 79 | }, 80 | "outputs": [] 81 | }, 82 | { 83 | "source": [ 84 | "## 1.3. Magic functions\n", 85 | "\n", 86 | "IPython has a set of predefined `magic functions` that you can call with a command line style syntax. There are two types of magics, line-oriented and cell-oriented. \n", 87 | "\n", 88 | "**Line magics** are prefixed with the `%` character and work much like OS command-line calls: they get as an argument the rest of the line, *where arguments are passed without parentheses or quotes*. \n", 89 | "\n", 90 | "**Cell magics** are prefixed with a double `%%`, and they are functions that get as an argument not only the rest of the line, but also the lines below it in a separate argument." 91 | ], 92 | "metadata": {}, 93 | "cell_type": "markdown" 94 | }, 95 | { 96 | "source": [ 97 | "%timeit range(1000)" 98 | ], 99 | "execution_count": null, 100 | "cell_type": "code", 101 | "metadata": { 102 | "collapsed": false 103 | }, 104 | "outputs": [] 105 | }, 106 | { 107 | "source": [ 108 | "%%timeit x = range(10000)\n", 109 | "max(x)" 110 | ], 111 | "execution_count": null, 112 | "cell_type": "code", 113 | "metadata": { 114 | "collapsed": false 115 | }, 116 | "outputs": [] 117 | }, 118 | { 119 | "source": [ 120 | "For more information, you can follow this [link](http://nbviewer.jupyter.org/github/ipython/ipython/blob/1.x/examples/notebooks/Cell%20Magics.ipynb)" 121 | ], 122 | "metadata": {}, 123 | "cell_type": "markdown" 124 | }, 125 | { 126 | "source": [ 127 | "## 1.4. Debugging\n", 128 | "\n", 129 | "Whenever an exception occurs, the call stack is printed out to help you to track down the true source of the problem. It is important to gain familiarity with the call stack, especially when using the PySpark API." 130 | ], 131 | "metadata": {}, 132 | "cell_type": "markdown" 133 | }, 134 | { 135 | "source": [ 136 | "for i in [4,3,2,0]:\n", 137 | " print(5/i)" 138 | ], 139 | "execution_count": null, 140 | "cell_type": "code", 141 | "metadata": { 142 | "collapsed": false 143 | }, 144 | "outputs": [] 145 | }, 146 | { 147 | "source": [ 148 | "# 2. Python + Pandas + Matplotlib: A great environment for Data Science\n", 149 | "\n", 150 | "This section aims to help students gain a basic understanding of the python programming language and some of its libraries, including `Pandas` or `Matplotlib`. \n", 151 | "\n", 152 | "When working with a small dataset (one that can comfortably fit into a single machine), Pandas and Matplotlib, together with Python are valid alternatives to other popular tools such as R and Matlab. Using such libraries allows to inherit from the simple and clear Python syntax, achieve very good performance, enjoy superior memory management, error handling, and good package management \\[[1](http://ajminich.com/2013/06/22/9-reasons-to-switch-from-matlab-to-python/)\\].\n", 153 | "\n", 154 | "\n", 155 | "## 2.1. Python syntax\n", 156 | "\n", 157 | "(This section is for students who did not program in Python before. If you're familiar with Python, please move to the next section: 1.2. Numpy)\n", 158 | "\n", 159 | "When working with Python, the code seems to be simpler than (many) other languages. In this laboratory, we compare the Python syntax to that of Java - another very common language.\n", 160 | "\n", 161 | "```java\n", 162 | "// java syntax\n", 163 | "int i = 10;\n", 164 | "string s = \"advanced machine learning\";\n", 165 | "System.out.println(i);\n", 166 | "System.out.println(s);\n", 167 | "// you must not forget the semicolon at the end of each sentence\n", 168 | "```" 169 | ], 170 | "metadata": {}, 171 | "cell_type": "markdown" 172 | }, 173 | { 174 | "source": [ 175 | "# python syntax\n", 176 | "i = 10\n", 177 | "s = \"advanced machine learning\"\n", 178 | "print(i)\n", 179 | "print(s)\n", 180 | "# forget about the obligation of commas" 181 | ], 182 | "execution_count": null, 183 | "cell_type": "code", 184 | "metadata": { 185 | "collapsed": false 186 | }, 187 | "outputs": [] 188 | }, 189 | { 190 | "source": [ 191 | "### Indentation & If-else syntax\n", 192 | "In python, we don't use `{` and `}` to define blocks of codes: instead, we use indentation to do that. **The code within the same block must have the same indentation**. For example, in java, we write:\n", 193 | "```java\n", 194 | "string language = \"Python\";\n", 195 | "\n", 196 | "// the block is surrounded by { and }\n", 197 | "// the condition is in ( and )\n", 198 | "if (language == \"Python\") {\n", 199 | " int x = 1;\n", 200 | " x += 10;\n", 201 | " int y = 5; // a wrong indentation isn't problem\n", 202 | " y = x + y;\n", 203 | " System.out.println(x + y);\n", 204 | " \n", 205 | " // a statement is broken into two line\n", 206 | " x = y\n", 207 | " + y;\n", 208 | " \n", 209 | " // do some stuffs\n", 210 | "}\n", 211 | "else if (language == \"Java\") {\n", 212 | " // another block\n", 213 | "}\n", 214 | "else {\n", 215 | " // another block\n", 216 | "}\n", 217 | "```" 218 | ], 219 | "metadata": {}, 220 | "cell_type": "markdown" 221 | }, 222 | { 223 | "source": [ 224 | "language = \"Python\"\n", 225 | "if language == \"Python\":\n", 226 | " x = 10\n", 227 | " x += 10\n", 228 | " y = 5 # all statements in the same block must have the same indentation\n", 229 | " y = (\n", 230 | " x + y\n", 231 | " ) # statements can be on multiple lines, using ( )\n", 232 | " print (x \n", 233 | " + y)\n", 234 | " \n", 235 | " # statements can also be split on multiple lines by using \\ at the END of each line\n", 236 | " x = y \\\n", 237 | " + y\n", 238 | " \n", 239 | " # do some other stuffs\n", 240 | "elif language == \"Java\":\n", 241 | " # another block\n", 242 | " pass\n", 243 | "else:\n", 244 | " # another block\n", 245 | " pass" 246 | ], 247 | "execution_count": null, 248 | "cell_type": "code", 249 | "metadata": { 250 | "collapsed": false 251 | }, 252 | "outputs": [] 253 | }, 254 | { 255 | "source": [ 256 | "### Ternary conditional operator\n", 257 | "In python, we often see ternary conditional operator, which is used to assign a value to a variable based on some condition. For example, in java, we write:\n", 258 | "\n", 259 | "```java\n", 260 | "int x = 10;\n", 261 | "// if x > 10, assign y = 5, otherwise, y = 15\n", 262 | "int y = (x > 10) ? 5 : 15;\n", 263 | "\n", 264 | "int z;\n", 265 | "if (x > 10)\n", 266 | " z = 5; // it's not necessary to have { } when the block has only one statement\n", 267 | "else\n", 268 | " z = 15;\n", 269 | "```\n", 270 | "\n", 271 | "Of course, although we can easily write these lines of code in an `if else` block to get the same result, people prefer ternary conditional operator because of simplicity.\n", 272 | "\n", 273 | "In python, we write:" 274 | ], 275 | "metadata": {}, 276 | "cell_type": "markdown" 277 | }, 278 | { 279 | "source": [ 280 | "x = 10\n", 281 | "# a very natural way\n", 282 | "y = 5 if x > 10 else 15\n", 283 | "print(y)\n", 284 | "\n", 285 | "# another way\n", 286 | "y = x > 10 and 5 or 15\n", 287 | "print(y)" 288 | ], 289 | "execution_count": null, 290 | "cell_type": "code", 291 | "metadata": { 292 | "collapsed": false 293 | }, 294 | "outputs": [] 295 | }, 296 | { 297 | "source": [ 298 | "### Lists and For loops\n", 299 | "Another syntax that we should revisit is the `for loop`. In java, we can write:\n", 300 | "\n", 301 | "```java\n", 302 | "// init an array with 10 integer numbers\n", 303 | "int[] array = new int[]{1, 2, 3, 4, 5, 6, 7, 8, 9, 10};\n", 304 | "for (int i = 0; i < array.length; i++){\n", 305 | " // print the i-th element of array\n", 306 | " System.out.println(array[i]);\n", 307 | "}\n", 308 | "```\n", 309 | "\n", 310 | "In Python, instead of using an index to help indicating an element, we can access the element directly:" 311 | ], 312 | "metadata": {}, 313 | "cell_type": "markdown" 314 | }, 315 | { 316 | "source": [ 317 | "array = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]\n", 318 | "# Python has no built-in array data structure\n", 319 | "# instead, it uses \"list\" which is much more general \n", 320 | "# and can be used as a multidimensional array quite easily.\n", 321 | "for element in array:\n", 322 | " print(element)" 323 | ], 324 | "execution_count": null, 325 | "cell_type": "code", 326 | "metadata": { 327 | "collapsed": false 328 | }, 329 | "outputs": [] 330 | }, 331 | { 332 | "source": [ 333 | "As we can see, the code is very clean. If you need the index of each element, here's what you should do:" 334 | ], 335 | "metadata": {}, 336 | "cell_type": "markdown" 337 | }, 338 | { 339 | "source": [ 340 | "for (index, element) in enumerate(array):\n", 341 | " print(index, element)" 342 | ], 343 | "execution_count": null, 344 | "cell_type": "code", 345 | "metadata": { 346 | "collapsed": false 347 | }, 348 | "outputs": [] 349 | }, 350 | { 351 | "source": [ 352 | "Actually, Python has no built-in array data structure. It uses the `list` data structure, which is much more general and can be used as a multidimensional array quite easily. In addition, elements in a list can be retrieved in a very concise way. For example, we create a 2d-array with 4 rows. Each row has 3 elements." 353 | ], 354 | "metadata": {}, 355 | "cell_type": "markdown" 356 | }, 357 | { 358 | "source": [ 359 | "# 2-dimentions array with 4 rows, 3 columns\n", 360 | "twod_array = [[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]]\n", 361 | "for index, row in enumerate(twod_array):\n", 362 | " print(\"row \", index, \":\", row)\n", 363 | "\n", 364 | "# print row 1 until row 3\n", 365 | "print(\"row 1 until row 3: \", twod_array[1:3])\n", 366 | "\n", 367 | "# all rows from row 2\n", 368 | "print(\"all rows from row 2: \", twod_array[2:])\n", 369 | "\n", 370 | "# all rows until row 2\n", 371 | "print(\"all rows until row 2:\", twod_array[:2])\n", 372 | "\n", 373 | "# all rows from the beginning with step of 2. \n", 374 | "print(\"all rows from the beginning with step of 2:\", twod_array[::2])" 375 | ], 376 | "execution_count": null, 377 | "cell_type": "code", 378 | "metadata": { 379 | "collapsed": false 380 | }, 381 | "outputs": [] 382 | }, 383 | { 384 | "source": [ 385 | "### Dictionaries\n", 386 | "Another useful data structure in Python is a `dictionary`, which we use to store (key, value) pairs. Here's some example usage of dictionaries:" 387 | ], 388 | "metadata": {}, 389 | "cell_type": "markdown" 390 | }, 391 | { 392 | "source": [ 393 | "d = {'key1': 'value1', 'key2': 'value2'} # Create a new dictionary with some data\n", 394 | "print(d['key1']) # Get an entry from a dictionary; prints \"value1\"\n", 395 | "print('key1' in d) # Check if a dictionary has a given key; prints \"True\"\n", 396 | "d['key3'] = 'value3' # Set an entry in a dictionary\n", 397 | "print(d['key3']) # Prints \"value3\"\n", 398 | "# print(d['key9']) # KeyError: 'key9' not a key of d\n", 399 | "print(d.get('key9', 'custom_default_value')) # Get an element with a default; prints \"custom_default_value\"\n", 400 | "print(d.get('key3', 'custom_default_value')) # Get an element with a default; prints \"value3\"\n", 401 | "del d['key3'] # Remove an element from a dictionary\n", 402 | "print(d.get('key3', 'custom_default_value')) # \"fish\" is no longer a key; prints \"custom_default_value\"\n" 403 | ], 404 | "execution_count": null, 405 | "cell_type": "code", 406 | "metadata": { 407 | "collapsed": false 408 | }, 409 | "outputs": [] 410 | }, 411 | { 412 | "source": [ 413 | "### Functions\n", 414 | "In Python, we can define a function by using keyword `def`." 415 | ], 416 | "metadata": {}, 417 | "cell_type": "markdown" 418 | }, 419 | { 420 | "source": [ 421 | "def square(x):\n", 422 | " return x*x\n", 423 | "\n", 424 | "print(square(5))" 425 | ], 426 | "execution_count": null, 427 | "cell_type": "code", 428 | "metadata": { 429 | "collapsed": false 430 | }, 431 | "outputs": [] 432 | }, 433 | { 434 | "source": [ 435 | "You can apply a function to each element of a list/array by using `lambda` function. For example, we want to square elements in a list:" 436 | ], 437 | "metadata": {}, 438 | "cell_type": "markdown" 439 | }, 440 | { 441 | "source": [ 442 | "array = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]\n", 443 | "\n", 444 | "# apply function \"square\" on each element of \"array\"\n", 445 | "print(list(map(lambda x: square(x), array)))\n", 446 | "\n", 447 | "# or using a for loop, and a list comprehension\n", 448 | "print([square(x) for x in array])\n", 449 | "\n", 450 | "print(\"orignal array:\", array)" 451 | ], 452 | "execution_count": null, 453 | "cell_type": "code", 454 | "metadata": { 455 | "collapsed": false 456 | }, 457 | "outputs": [] 458 | }, 459 | { 460 | "source": [ 461 | "These two above syntaxes are used very often. \n", 462 | "\n", 463 | "If you are not familiar with **list comprehensions**, follow this [link](http://python-3-patterns-idioms-test.readthedocs.io/en/latest/Comprehensions.html]).\n", 464 | "\n", 465 | "We can also put a function `B` inside a function `A` (that is, we can have nested functions). In that case, function `B` is only accessed inside function `A` (the scope that it's declared). For example:" 466 | ], 467 | "metadata": {}, 468 | "cell_type": "markdown" 469 | }, 470 | { 471 | "source": [ 472 | "# select only the prime number in array\n", 473 | "# and square them\n", 474 | "def filterAndSquarePrime(arr):\n", 475 | " \n", 476 | " # a very simple function to check a number is prime or not\n", 477 | " def checkPrime(number):\n", 478 | " for i in range(2, int(number/2)):\n", 479 | " if number % i == 0:\n", 480 | " return False\n", 481 | " return True\n", 482 | " \n", 483 | " primeNumbers = filter(lambda x: checkPrime(x), arr)\n", 484 | " return map(lambda x: square(x), primeNumbers)\n", 485 | "\n", 486 | "# we can not access checkPrime from here\n", 487 | "# checkPrime(5)\n", 488 | "\n", 489 | "result = filterAndSquarePrime(array)\n", 490 | "list(result)" 491 | ], 492 | "execution_count": null, 493 | "cell_type": "code", 494 | "metadata": { 495 | "collapsed": false 496 | }, 497 | "outputs": [] 498 | }, 499 | { 500 | "source": [ 501 | "### Importing modules, functions\n", 502 | "Modules in Python are packages of code. Putting code into modules helps increasing the reusability and maintainability.\n", 503 | "The modules can be nested.\n", 504 | "To import a module, we simple use syntax: `import `. Once it is imported, we can use any functions, classes inside it." 505 | ], 506 | "metadata": {}, 507 | "cell_type": "markdown" 508 | }, 509 | { 510 | "source": [ 511 | "# import module 'math' to uses functions for calculating\n", 512 | "import math\n", 513 | "\n", 514 | "# print the square root of 16\n", 515 | "print(math.sqrt(16))\n", 516 | "\n", 517 | "# we can create alias when import a module\n", 518 | "import numpy as np\n", 519 | "\n", 520 | "print(np.sqrt(16))" 521 | ], 522 | "execution_count": null, 523 | "cell_type": "code", 524 | "metadata": { 525 | "collapsed": false 526 | }, 527 | "outputs": [] 528 | }, 529 | { 530 | "source": [ 531 | "Sometimes, you only need to import some functions inside a module to avoid loading the whole module into memory. To do that, we can use syntax: `from import `" 532 | ], 533 | "metadata": {}, 534 | "cell_type": "markdown" 535 | }, 536 | { 537 | "source": [ 538 | "# only import function 'sin' in package 'math'\n", 539 | "from math import sin\n", 540 | "\n", 541 | "# use the function\n", 542 | "print(sin(60))" 543 | ], 544 | "execution_count": null, 545 | "cell_type": "code", 546 | "metadata": { 547 | "collapsed": false 548 | }, 549 | "outputs": [] 550 | }, 551 | { 552 | "source": [ 553 | "That's quite enough for Python. Now, let's practice a little bit." 554 | ], 555 | "metadata": {}, 556 | "cell_type": "markdown" 557 | }, 558 | { 559 | "source": [ 560 | "### Question 1\n", 561 | "#### Question 1.1\n", 562 | "
\n", 563 | "Write a function `checkSquareNumber` to check if a integer number is a square number or not. For example, 16 and 9 are square numbers. 15 isn't square number.\n", 564 | "Requirements:\n", 565 | "\n", 566 | "- Input: an integer number\n", 567 | "\n", 568 | "- Output: `True` or `False`\n", 569 | "\n", 570 | "HINT: If the square root of a number is an integer number, it is a square number.\n", 571 | "
" 572 | ], 573 | "metadata": {}, 574 | "cell_type": "markdown" 575 | }, 576 | { 577 | "source": [ 578 | "```python\n", 579 | "###################################################################\n", 580 | "#### TO COMPLETE #####\n", 581 | "###################################################################\n", 582 | "import math\n", 583 | "\n", 584 | "def checkSquareNumber(x):\n", 585 | " # calculate the square root of x\n", 586 | " # return True if square root is integer, \n", 587 | " # otherwise, return False\n", 588 | " return ...\n", 589 | "\n", 590 | "print(checkSquareNumber(16))\n", 591 | "print(checkSquareNumber(250))\n", 592 | "```" 593 | ], 594 | "metadata": {}, 595 | "cell_type": "markdown" 596 | }, 597 | { 598 | "source": [ 599 | "#### Question 1.2\n", 600 | "
\n", 601 | "A list `list_numbers` which contains the numbers from 1 to 9999 can be constructed from: \n", 602 | "\n", 603 | "```python\n", 604 | "list_numbers = range(0, 10000)\n", 605 | "```\n", 606 | "\n", 607 | "Extract the square numbers in `list_numbers` using function `checkSquareNumber` from question 1.1. How many elements in the extracted list ?\n", 608 | "
" 609 | ], 610 | "metadata": {}, 611 | "cell_type": "markdown" 612 | }, 613 | { 614 | "source": [ 615 | "```python\n", 616 | "###################################################################\n", 617 | "#### TO COMPLETE #####\n", 618 | "###################################################################\n", 619 | "\n", 620 | "list_numbers = ...\n", 621 | "square_numbers = # try to use the filter method\n", 622 | "print(square_numbers)\n", 623 | "print(len(square_numbers))\n", 624 | "```" 625 | ], 626 | "metadata": {}, 627 | "cell_type": "markdown" 628 | }, 629 | { 630 | "source": [ 631 | "#### Question 1.3\n", 632 | "
\n", 633 | "Using array slicing, select the elements of the list square_numbers, whose index is from 5 to 20 (zero-based index).\n", 634 | "
" 635 | ], 636 | "metadata": {}, 637 | "cell_type": "markdown" 638 | }, 639 | { 640 | "source": [ 641 | "```python\n", 642 | "###################################################################\n", 643 | "#### TO COMPLETE #####\n", 644 | "###################################################################\n", 645 | "\n", 646 | "print(square_numbers[...])\n", 647 | "```" 648 | ], 649 | "metadata": {}, 650 | "cell_type": "markdown" 651 | }, 652 | { 653 | "source": [ 654 | "Next, we will take a quick look on Numpy - a powerful module of Python." 655 | ], 656 | "metadata": {}, 657 | "cell_type": "markdown" 658 | }, 659 | { 660 | "source": [ 661 | "## 2.2. Numpy\n", 662 | "Numpy is the core library for scientific computing in Python. It provides a high-performance multidimensional array object, and tools for working with these arrays.\n", 663 | "### 2.2.1. Array\n", 664 | "A numpy array is a grid of values, all of **the same type**, and is indexed by a tuple of nonnegative integers. Thanks to the same type property, Numpy has the benefits of [locality of reference](https://en.wikipedia.org/wiki/Locality_of_reference). Besides, many other Numpy operations are implemented in C, avoiding the general cost of loops in Python, pointer indirection and per-element dynamic type checking. So, the speed of Numpy is often faster than using built-in datastructure of Python. When working with massive data with computationally expensive tasks, you should consider to use Numpy. \n", 665 | "\n", 666 | "The number of dimensions is the `rank` of the array; the `shape` of an array is a tuple of integers giving the size of the array along each dimension.\n", 667 | "\n", 668 | "We can initialize numpy arrays from nested Python lists, and access elements using square brackets:" 669 | ], 670 | "metadata": {}, 671 | "cell_type": "markdown" 672 | }, 673 | { 674 | "source": [ 675 | "import numpy as np\n", 676 | "\n", 677 | "# Create a rank 1 array\n", 678 | "rank1_array = np.array([1, 2, 3])\n", 679 | "print(\"type of rank1_array:\", type(rank1_array))\n", 680 | "print(\"shape of rank1_array:\", rank1_array.shape)\n", 681 | "print(\"elements in rank1_array:\", rank1_array[0], rank1_array[1], rank1_array[2])\n", 682 | "\n", 683 | "# Create a rank 2 array\n", 684 | "rank2_array = np.array([[1,2,3],[4,5,6]])\n", 685 | "print(\"shape of rank2_array:\", rank2_array.shape)\n", 686 | "print(rank2_array[0, 0], rank2_array[0, 1], rank2_array[1, 0])" 687 | ], 688 | "execution_count": null, 689 | "cell_type": "code", 690 | "metadata": { 691 | "collapsed": false 692 | }, 693 | "outputs": [] 694 | }, 695 | { 696 | "source": [ 697 | "### 2.2.2. Array slicing\n", 698 | "Similar to Python lists, numpy arrays can be sliced. The different thing is that you must specify a slice for each dimension of the array because arrays may be multidimensional." 699 | ], 700 | "metadata": {}, 701 | "cell_type": "markdown" 702 | }, 703 | { 704 | "source": [ 705 | "m_array = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]])\n", 706 | "\n", 707 | "# Use slicing to pull out the subarray consisting of the first 2 rows\n", 708 | "# and columns 1 and 2\n", 709 | "b = m_array[:2, 1:3]\n", 710 | "print(b)\n", 711 | "\n", 712 | "# we can only use this syntax with numpy array, not python list\n", 713 | "print(\"value at row 0, column 1:\", m_array[0, 1])\n", 714 | "\n", 715 | "# Rank 1 view of the second row of m_array \n", 716 | "print(\"the second row of m_array:\", m_array[1, :])\n", 717 | "\n", 718 | "# print element at position (0,2) and (1,3)\n", 719 | "print(m_array[[0,1], [2,3]])" 720 | ], 721 | "execution_count": null, 722 | "cell_type": "code", 723 | "metadata": { 724 | "collapsed": false 725 | }, 726 | "outputs": [] 727 | }, 728 | { 729 | "source": [ 730 | "### 2.2.3. Boolean array indexing\n", 731 | "We can use boolean array indexing to check whether each element in the array satisfies a condition or use it to do filtering." 732 | ], 733 | "metadata": {}, 734 | "cell_type": "markdown" 735 | }, 736 | { 737 | "source": [ 738 | "m_array = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]])\n", 739 | "\n", 740 | "# Find the elements of a that are bigger than 2\n", 741 | "# this returns a numpy array of Booleans of the same\n", 742 | "# shape as m_array, where each value of bool_idx tells\n", 743 | "# whether that element of a is > 3 or not\n", 744 | "bool_idx = (m_array > 3)\n", 745 | "print(bool_idx , \"\\n\")\n", 746 | "\n", 747 | "# We use boolean array indexing to construct a rank 1 array\n", 748 | "# consisting of the elements of a corresponding to the True values\n", 749 | "# of bool_idx\n", 750 | "print(m_array[bool_idx], \"\\n\")\n", 751 | "\n", 752 | "# We can combine two statements\n", 753 | "print(m_array[m_array > 3], \"\\n\")\n", 754 | "\n", 755 | "# select elements with multiple conditions\n", 756 | "print(m_array[(m_array > 3) & (m_array % 2 == 0)])\n" 757 | ], 758 | "execution_count": null, 759 | "cell_type": "code", 760 | "metadata": { 761 | "collapsed": false 762 | }, 763 | "outputs": [] 764 | }, 765 | { 766 | "source": [ 767 | "### 2.2.4. Datatypes\n", 768 | "Remember that the elements in a numpy array have the same type. When constructing arrays, Numpy tries to guess a datatype when you create an array However, we can specify the datatype explicitly via an optional argument." 769 | ], 770 | "metadata": {}, 771 | "cell_type": "markdown" 772 | }, 773 | { 774 | "source": [ 775 | "# let Numpy guess the datatype\n", 776 | "x1 = np.array([1, 2])\n", 777 | "print(x1.dtype)\n", 778 | "\n", 779 | "# force the datatype be float64\n", 780 | "x2 = np.array([1, 2], dtype=np.float64)\n", 781 | "print(x2.dtype)" 782 | ], 783 | "execution_count": null, 784 | "cell_type": "code", 785 | "metadata": { 786 | "collapsed": false 787 | }, 788 | "outputs": [] 789 | }, 790 | { 791 | "source": [ 792 | "### 2.2.5. Array math\n", 793 | "Similar to Matlab or R, in Numpy, basic mathematical functions operate elementwise on arrays, and are available both as operator overloads and as functions in the numpy module." 794 | ], 795 | "metadata": {}, 796 | "cell_type": "markdown" 797 | }, 798 | { 799 | "source": [ 800 | "x = np.array([[1,2],[3,4]], dtype=np.float64)\n", 801 | "y = np.array([[5,6],[7,8]], dtype=np.float64)\n", 802 | "# mathematical function is used as operator\n", 803 | "print(\"x + y =\", x + y, \"\\n\")\n", 804 | "\n", 805 | "# mathematical function is used as function\n", 806 | "print(\"np.add(x, y)=\", np.add(x, y), \"\\n\")\n", 807 | "\n", 808 | "# Unlike MATLAB, * is elementwise multiplication\n", 809 | "# not matrix multiplication\n", 810 | "print(\"x * y =\", x * y , \"\\n\")\n", 811 | "print(\"np.multiply(x, y)=\", np.multiply(x, y), \"\\n\")\n", 812 | "print(\"x*2=\", x*2, \"\\n\")\n", 813 | "\n", 814 | "# to multiply two matrices, we use dot function\n", 815 | "print(\"x.dot(y)=\", x.dot(y), \"\\n\")\n", 816 | "print(\"np.dot(x, y)=\", np.dot(x, y), \"\\n\")\n", 817 | "\n", 818 | "# Elementwise square root\n", 819 | "print(\"np.sqrt(x)=\", np.sqrt(x), \"\\n\")" 820 | ], 821 | "execution_count": null, 822 | "cell_type": "code", 823 | "metadata": { 824 | "collapsed": false 825 | }, 826 | "outputs": [] 827 | }, 828 | { 829 | "source": [ 830 | "Note that unlike MATLAB, `*` is elementwise multiplication, not matrix multiplication. We instead use the `dot` function to compute inner products of vectors, to multiply a vector by a matrix, and to multiply matrices. In what follows, we work on a few more examples to reiterate the concept." 831 | ], 832 | "metadata": {}, 833 | "cell_type": "markdown" 834 | }, 835 | { 836 | "source": [ 837 | "# declare two vectors\n", 838 | "v = np.array([9,10])\n", 839 | "w = np.array([11, 12])\n", 840 | "\n", 841 | "# Inner product of vectors\n", 842 | "print(\"v.dot(w)=\", v.dot(w))\n", 843 | "print(\"np.dot(v, w)=\", np.dot(v, w))\n", 844 | "\n", 845 | "# Matrix / vector product\n", 846 | "print(\"x.dot(v)=\", x.dot(v))\n", 847 | "print(\"np.dot(x, v)=\", np.dot(x, v))\n", 848 | "\n", 849 | "# Matrix / matrix product\n", 850 | "print(\"x.dot(y)=\", x.dot(y))\n", 851 | "print(\"np.dot(x, y)=\", np.dot(x, y))" 852 | ], 853 | "execution_count": null, 854 | "cell_type": "code", 855 | "metadata": { 856 | "collapsed": false 857 | }, 858 | "outputs": [] 859 | }, 860 | { 861 | "source": [ 862 | "Additionally, we can do other aggregation computations on arrays such as `sum`, `nansum`, or `T`." 863 | ], 864 | "metadata": {}, 865 | "cell_type": "markdown" 866 | }, 867 | { 868 | "source": [ 869 | "x = np.array([[1,2], [3,4]])\n", 870 | "\n", 871 | "# Compute sum of all elements\n", 872 | "print(np.sum(x))\n", 873 | "\n", 874 | "# Compute sum of each column\n", 875 | "print(np.sum(x, axis=0))\n", 876 | "\n", 877 | "# Compute sum of each row\n", 878 | "print(np.sum(x, axis=1))\n", 879 | "\n", 880 | "# transpose the matrix\n", 881 | "print(x.T)\n", 882 | "\n", 883 | "# Note that taking the transpose of a rank 1 array does nothing:\n", 884 | "v = np.array([1,2,3])\n", 885 | "print(v.T) # Prints \"[1 2 3]\"" 886 | ], 887 | "execution_count": null, 888 | "cell_type": "code", 889 | "metadata": { 890 | "collapsed": false 891 | }, 892 | "outputs": [] 893 | }, 894 | { 895 | "source": [ 896 | "### Question 2\n", 897 | "\n", 898 | "Given a 2D array:\n", 899 | "\n", 900 | "```\n", 901 | " 1 2 3 4\n", 902 | " 5 6 7 8 \n", 903 | " 9 10 11 12\n", 904 | "13 14 15 16\n", 905 | "```\n", 906 | "\n", 907 | "\n", 908 | "#### Question 2.1\n", 909 | "
\n", 910 | "Print the all odd numbers in this array using `Boolean array indexing`.\n", 911 | "
" 912 | ], 913 | "metadata": {}, 914 | "cell_type": "markdown" 915 | }, 916 | { 917 | "source": [ 918 | "```python\n", 919 | "###################################################################\n", 920 | "#### TO COMPLETE #####\n", 921 | "###################################################################\n", 922 | "\n", 923 | "array_numbers = np.array([\n", 924 | " [1, 2, 3, 4],\n", 925 | " [5, 6, 7, 8],\n", 926 | " [9, 10, 11, 12],\n", 927 | " [13, 14, 15, 16]\n", 928 | " ])\n", 929 | "\n", 930 | "print(...)\n", 931 | "```" 932 | ], 933 | "metadata": {}, 934 | "cell_type": "markdown" 935 | }, 936 | { 937 | "source": [ 938 | "#### Question 2.2\n", 939 | "
\n", 940 | "Extract the second row and the third column in this array using `array slicing`.\n", 941 | "
" 942 | ], 943 | "metadata": {}, 944 | "cell_type": "markdown" 945 | }, 946 | { 947 | "source": [ 948 | "```python\n", 949 | "###################################################################\n", 950 | "#### TO COMPLETE #####\n", 951 | "###################################################################\n", 952 | "\n", 953 | "print(array_numbers[...])\n", 954 | "print(array_numbers[...])\n", 955 | "```" 956 | ], 957 | "metadata": {}, 958 | "cell_type": "markdown" 959 | }, 960 | { 961 | "source": [ 962 | "#### Question 2.3\n", 963 | "
\n", 964 | "Calculate the sum of diagonal elements.\n", 965 | "
" 966 | ], 967 | "metadata": {}, 968 | "cell_type": "markdown" 969 | }, 970 | { 971 | "source": [ 972 | "```python\n", 973 | "###################################################################\n", 974 | "#### TO COMPLETE #####\n", 975 | "###################################################################\n", 976 | "\n", 977 | "sum = 0\n", 978 | "for i in range(0, ...):\n", 979 | " sum += array_numbers...\n", 980 | " \n", 981 | "print(sum)\n", 982 | "```" 983 | ], 984 | "metadata": {}, 985 | "cell_type": "markdown" 986 | }, 987 | { 988 | "source": [ 989 | "#### Question 2.4\n", 990 | "
\n", 991 | "Print elementwise multiplication of the first row and the last row using numpy's functions.\n", 992 | "\n", 993 | "Print the inner product of these two rows.\n", 994 | "
" 995 | ], 996 | "metadata": {}, 997 | "cell_type": "markdown" 998 | }, 999 | { 1000 | "source": [ 1001 | "```python\n", 1002 | "###################################################################\n", 1003 | "#### TO COMPLETE #####\n", 1004 | "###################################################################\n", 1005 | "\n", 1006 | "print(...)\n", 1007 | "print(...)\n", 1008 | "```" 1009 | ], 1010 | "metadata": {}, 1011 | "cell_type": "markdown" 1012 | }, 1013 | { 1014 | "source": [ 1015 | "## 2.3. Matplotlib\n", 1016 | "\n", 1017 | "As its name indicates, Matplotlib is a plotting library. It provides both a very quick way to visualize data from Python and publication-quality figures in many formats. The most important function in matplotlib is `plot`, which allows you to plot 2D data." 1018 | ], 1019 | "metadata": {}, 1020 | "cell_type": "markdown" 1021 | }, 1022 | { 1023 | "source": [ 1024 | "%matplotlib inline\n", 1025 | "import matplotlib.pyplot as plt\n", 1026 | "plt.plot([1,2,3,4])\n", 1027 | "plt.ylabel('custom y label')\n", 1028 | "plt.show()" 1029 | ], 1030 | "execution_count": null, 1031 | "cell_type": "code", 1032 | "metadata": { 1033 | "collapsed": false 1034 | }, 1035 | "outputs": [] 1036 | }, 1037 | { 1038 | "source": [ 1039 | "In this case, we provide a single list or array to the `plot()` command, matplotlib assumes it is a sequence of y values, and automatically generates the x values for us. Since python ranges start with 0, the default x vector has the same length as y but starts with 0. Hence the x data are [0,1,2,3].\n", 1040 | "\n", 1041 | "In the next example, we plot figure with both x and y data. Besides, we want to draw dashed lines instead of the solid in default." 1042 | ], 1043 | "metadata": {}, 1044 | "cell_type": "markdown" 1045 | }, 1046 | { 1047 | "source": [ 1048 | "plt.plot([1, 2, 3, 4], [1, 4, 9, 16], 'r--')\n", 1049 | "plt.show()\n", 1050 | "\n", 1051 | "plt.bar([1, 2, 3, 4], [1, 4, 9, 16], align='center')\n", 1052 | "# labels of each column bar\n", 1053 | "x_labels = [\"Type 1\", \"Type 2\", \"Type 3\", \"Type 4\"]\n", 1054 | "# assign labels to the plot\n", 1055 | "plt.xticks([1, 2, 3, 4], x_labels)\n", 1056 | "\n", 1057 | "plt.show()" 1058 | ], 1059 | "execution_count": null, 1060 | "cell_type": "code", 1061 | "metadata": { 1062 | "collapsed": false 1063 | }, 1064 | "outputs": [] 1065 | }, 1066 | { 1067 | "source": [ 1068 | "If we want to merge two figures into a single one, subplot is the best way to do that. For example, we want to put two figures in a stack vertically, we should define a grid of plots with 2 rows and 1 column. Then, in each row, a single figure is plotted." 1069 | ], 1070 | "metadata": {}, 1071 | "cell_type": "markdown" 1072 | }, 1073 | { 1074 | "source": [ 1075 | "# Set up a subplot grid that has height 2 and width 1,\n", 1076 | "# and set the first such subplot as active.\n", 1077 | "plt.subplot(2, 1, 1)\n", 1078 | "plt.plot([1, 2, 3, 4], [1, 4, 9, 16], 'r--')\n", 1079 | "\n", 1080 | "# Set the second subplot as active, and make the second plot.\n", 1081 | "plt.subplot(2, 1, 2)\n", 1082 | "plt.bar([1, 2, 3, 4], [1, 4, 9, 16])\n", 1083 | "\n", 1084 | "plt.show()" 1085 | ], 1086 | "execution_count": null, 1087 | "cell_type": "code", 1088 | "metadata": { 1089 | "collapsed": false 1090 | }, 1091 | "outputs": [] 1092 | }, 1093 | { 1094 | "source": [ 1095 | "For more examples, please visit the [homepage](http://matplotlib.org/1.5.1/examples/index.html) of Matplotlib." 1096 | ], 1097 | "metadata": {}, 1098 | "cell_type": "markdown" 1099 | }, 1100 | { 1101 | "source": [ 1102 | "### Question 3\n", 1103 | "Given a list of numbers from 0 to 9999.\n", 1104 | "\n", 1105 | "\n", 1106 | "#### Question 3.1\n", 1107 | "
\n", 1108 | "Calculate the histogram of numbers divisible by 3, 7, 11 in the list respectively.\n", 1109 | "\n", 1110 | "( Or in other words, how many numbers divisible by 3, 7, 11 in the list respectively ?)\n", 1111 | "
" 1112 | ], 1113 | "metadata": {}, 1114 | "cell_type": "markdown" 1115 | }, 1116 | { 1117 | "source": [ 1118 | "```python\n", 1119 | "###################################################################\n", 1120 | "#### TO COMPLETE #####\n", 1121 | "###################################################################\n", 1122 | "\n", 1123 | "arr = np.array(...)\n", 1124 | "divisors = [3, 7, 11]\n", 1125 | "histogram = list(...)\n", 1126 | "print(histogram)\n", 1127 | "```" 1128 | ], 1129 | "metadata": {}, 1130 | "cell_type": "markdown" 1131 | }, 1132 | { 1133 | "source": [ 1134 | "#### Question 3.2\n", 1135 | "
\n", 1136 | "Plot the histogram in a line chart.\n", 1137 | "
" 1138 | ], 1139 | "metadata": {}, 1140 | "cell_type": "markdown" 1141 | }, 1142 | { 1143 | "source": [ 1144 | "```python\n", 1145 | "###################################################################\n", 1146 | "#### TO COMPLETE #####\n", 1147 | "###################################################################\n", 1148 | "\n", 1149 | "%matplotlib inline\n", 1150 | "import matplotlib.pyplot as plt\n", 1151 | "\n", 1152 | "# simple line chart\n", 1153 | "plt.plot(histogram)\n", 1154 | "x_indexes = ...\n", 1155 | "x_names = list(...)\n", 1156 | "plt.xticks(x_indexes, x_names)\n", 1157 | "plt.show()\n", 1158 | "```" 1159 | ], 1160 | "metadata": {}, 1161 | "cell_type": "markdown" 1162 | }, 1163 | { 1164 | "source": [ 1165 | "#### Question 3.3\n", 1166 | "
\n", 1167 | "Plot the histogram in a bar chart.\n", 1168 | "
" 1169 | ], 1170 | "metadata": {}, 1171 | "cell_type": "markdown" 1172 | }, 1173 | { 1174 | "source": [ 1175 | "```python\n", 1176 | "###################################################################\n", 1177 | "#### TO COMPLETE #####\n", 1178 | "###################################################################\n", 1179 | "\n", 1180 | "# char chart with x-lables\n", 1181 | "x_indexes = range(...)\n", 1182 | "x_names = list(...)\n", 1183 | "plt.bar( x_indexes, histogram, align='center')\n", 1184 | "plt.xticks(x_indexes, x_names)\n", 1185 | "plt.show()\n", 1186 | "```" 1187 | ], 1188 | "metadata": {}, 1189 | "cell_type": "markdown" 1190 | }, 1191 | { 1192 | "source": [ 1193 | "## 2.4. Pandas\n", 1194 | "\n", 1195 | "Pandas is an open source library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. Indeed, it is great for data manipulation, data analysis, and data visualization.\n", 1196 | "\n", 1197 | "### 2.4.1. Data structures\n", 1198 | "Pandas introduces two useful (and powerful) structures: `Series` and `DataFrame`, both of which are built on top of NumPy.\n", 1199 | "\n", 1200 | "#### Series\n", 1201 | "A `Series` is a one-dimensional object similar to an array, list, or even column in a table. It assigns a *labeled index* to each item in the Series. By default, each item will receive an index label from `0` to `N-1`, where `N` is the number items of `Series`.\n", 1202 | "\n", 1203 | "We can create a Series by passing a list of values, and let pandas create a default integer index.\n" 1204 | ], 1205 | "metadata": {}, 1206 | "cell_type": "markdown" 1207 | }, 1208 | { 1209 | "source": [ 1210 | "import pandas as pd\n", 1211 | "import numpy as np\n", 1212 | "\n", 1213 | "# create a Series with an arbitrary list\n", 1214 | "s = pd.Series([3, 'Machine learning', 1.414259, -65545, 'Happy coding!'])\n", 1215 | "print(s)" 1216 | ], 1217 | "execution_count": null, 1218 | "cell_type": "code", 1219 | "metadata": { 1220 | "collapsed": false 1221 | }, 1222 | "outputs": [] 1223 | }, 1224 | { 1225 | "source": [ 1226 | "Or, an index can be used explicitly when creating the `Series`." 1227 | ], 1228 | "metadata": {}, 1229 | "cell_type": "markdown" 1230 | }, 1231 | { 1232 | "source": [ 1233 | "s = pd.Series([3, 'Machine learning', 1.414259, -65545, 'Happy coding!'],\n", 1234 | " index=['Col1', 'Col2', 'Col3', 4.1, 5])\n", 1235 | "print(s)" 1236 | ], 1237 | "execution_count": null, 1238 | "cell_type": "code", 1239 | "metadata": { 1240 | "collapsed": false 1241 | }, 1242 | "outputs": [] 1243 | }, 1244 | { 1245 | "source": [ 1246 | "A `Series` can be constructed from a dictionary too." 1247 | ], 1248 | "metadata": {}, 1249 | "cell_type": "markdown" 1250 | }, 1251 | { 1252 | "source": [ 1253 | "s = pd.Series({\n", 1254 | " 'Col1': 3, 'Col2': 'Machine learning', \n", 1255 | " 'Col3': 1.414259, 4.1: -65545, \n", 1256 | " 5: 'Happy coding!'\n", 1257 | " })\n", 1258 | "print(s)" 1259 | ], 1260 | "execution_count": null, 1261 | "cell_type": "code", 1262 | "metadata": { 1263 | "collapsed": false 1264 | }, 1265 | "outputs": [] 1266 | }, 1267 | { 1268 | "source": [ 1269 | "We can access items in a `Series` in a same way as `Numpy`." 1270 | ], 1271 | "metadata": {}, 1272 | "cell_type": "markdown" 1273 | }, 1274 | { 1275 | "source": [ 1276 | "s = pd.Series({\n", 1277 | " 'Col1': 3, 'Col2': -10, \n", 1278 | " 'Col3': 1.414259, \n", 1279 | " 4.1: -65545, \n", 1280 | " 5: 8\n", 1281 | " })\n", 1282 | "\n", 1283 | "# get element which has index='Col1'\n", 1284 | "print(\"s['Col1']=\", s['Col1'], \"\\n\")\n", 1285 | "\n", 1286 | "# get elements whose index is in a given list\n", 1287 | "print(\"s[['Col1', 'Col3', 4.5]]=\", s[['Col1', 'Col3', 4.5]], \"\\n\")\n", 1288 | "\n", 1289 | "# use boolean indexing for selection\n", 1290 | "print(s[s > 0], \"\\n\")\n", 1291 | "\n", 1292 | "# modify elements on the fly using boolean indexing\n", 1293 | "s[s > 0] = 15\n", 1294 | "\n", 1295 | "print(s, \"\\n\")\n", 1296 | "\n", 1297 | "# mathematical operations can be done using operators and functions.\n", 1298 | "print(s*10, \"\\n\")\n", 1299 | "print(np.square(s), \"\\n\")" 1300 | ], 1301 | "execution_count": null, 1302 | "cell_type": "code", 1303 | "metadata": { 1304 | "collapsed": false 1305 | }, 1306 | "outputs": [] 1307 | }, 1308 | { 1309 | "source": [ 1310 | "#### DataFrame\n", 1311 | "A DataFrame is a tabular data structure comprised of rows and columns, akin to database table, or R's data.frame object. In a loose way, we can also think of a DataFrame as a group of Series objects that share an index (the column names).\n", 1312 | "\n", 1313 | "We can create a DataFrame by passing a dict of objects that can be converted to series-like." 1314 | ], 1315 | "metadata": {}, 1316 | "cell_type": "markdown" 1317 | }, 1318 | { 1319 | "source": [ 1320 | "data = {'year': [2013, 2014, 2015, 2013, 2014, 2015, 2013, 2014],\n", 1321 | " 'team': ['Manchester United', 'Chelsea', 'Asernal', 'Liverpool', 'West Ham', 'Newcastle', 'Machester City', 'Tottenham'],\n", 1322 | " 'wins': [11, 8, 10, 15, 11, 6, 10, 4],\n", 1323 | " 'losses': [5, 8, 6, 1, 5, 10, 6, 12]}\n", 1324 | "football = pd.DataFrame(data, columns=['year', 'team', 'wins', 'losses'])\n", 1325 | "football" 1326 | ], 1327 | "execution_count": null, 1328 | "cell_type": "code", 1329 | "metadata": { 1330 | "collapsed": false 1331 | }, 1332 | "outputs": [] 1333 | }, 1334 | { 1335 | "source": [ 1336 | "We can store data as a CSV file, or read data from a CSV file." 1337 | ], 1338 | "metadata": {}, 1339 | "cell_type": "markdown" 1340 | }, 1341 | { 1342 | "source": [ 1343 | "# save data to a csv file without the index\n", 1344 | "football.to_csv('football.csv', index=False)\n", 1345 | "\n", 1346 | "from_csv = pd.read_csv('football.csv')\n", 1347 | "from_csv.head()" 1348 | ], 1349 | "execution_count": null, 1350 | "cell_type": "code", 1351 | "metadata": { 1352 | "collapsed": false 1353 | }, 1354 | "outputs": [] 1355 | }, 1356 | { 1357 | "source": [ 1358 | "To read a CSV file with a custom delimiter between values and custom columns' names, we can use parameters `sep` and `names` relatively.\n", 1359 | "Moreover, Pandas also supports to read and write to [Excel file](http://pandas.pydata.org/pandas-docs/stable/io.html#io-excel) , sqlite database file, URL, or even clipboard.\n", 1360 | "\n", 1361 | "We can have an overview on the data by using functions `info` and `describe`." 1362 | ], 1363 | "metadata": {}, 1364 | "cell_type": "markdown" 1365 | }, 1366 | { 1367 | "source": [ 1368 | "print(football.info(), \"\\n\")\n", 1369 | "football.describe()" 1370 | ], 1371 | "execution_count": null, 1372 | "cell_type": "code", 1373 | "metadata": { 1374 | "collapsed": false 1375 | }, 1376 | "outputs": [] 1377 | }, 1378 | { 1379 | "source": [ 1380 | "Numpy's regular slicing syntax works as well." 1381 | ], 1382 | "metadata": {}, 1383 | "cell_type": "markdown" 1384 | }, 1385 | { 1386 | "source": [ 1387 | "print(football[0:2], \"\\n\")\n", 1388 | "\n", 1389 | "# show only the teams that have won more than 10 matches from 2014\n", 1390 | "print(football[(football.year >= 2014) & (football.wins >= 10)])" 1391 | ], 1392 | "execution_count": null, 1393 | "cell_type": "code", 1394 | "metadata": { 1395 | "collapsed": false 1396 | }, 1397 | "outputs": [] 1398 | }, 1399 | { 1400 | "source": [ 1401 | "An important feature that Pandas supports is `JOIN`. Very often, the data comes from multiple sources, in multiple files. For example, we have 2 CSV files, one contains the information of Artists, the other contains information of Songs. If we want to query the artist name and his/her corresponding songs, we have to do joining two dataframe.\n", 1402 | "\n", 1403 | "Similar to SQL, in Pandas, you can do inner join, left outer join, right outer join and full outer join. Let's see a small example. Assume that we have two dataset of singers and songs. The relationship between two datasets is maintained by a constrain on `singer_code`." 1404 | ], 1405 | "metadata": {}, 1406 | "cell_type": "markdown" 1407 | }, 1408 | { 1409 | "source": [ 1410 | "singers = pd.DataFrame({'singer_code': range(5), \n", 1411 | " 'singer_name': ['singer_a', 'singer_b', 'singer_c', 'singer_d', 'singer_e']})\n", 1412 | "songs = pd.DataFrame({'singer_code': [2, 2, 3, 4, 5], \n", 1413 | " 'song_name': ['song_f', 'song_g', 'song_h', 'song_i', 'song_j']})\n", 1414 | "print(singers)\n", 1415 | "print('\\n')\n", 1416 | "print(songs)" 1417 | ], 1418 | "execution_count": null, 1419 | "cell_type": "code", 1420 | "metadata": { 1421 | "collapsed": false 1422 | }, 1423 | "outputs": [] 1424 | }, 1425 | { 1426 | "source": [ 1427 | "# inner join\n", 1428 | "pd.merge(singers, songs, on='singer_code', how='inner')" 1429 | ], 1430 | "execution_count": null, 1431 | "cell_type": "code", 1432 | "metadata": { 1433 | "collapsed": false 1434 | }, 1435 | "outputs": [] 1436 | }, 1437 | { 1438 | "source": [ 1439 | "# left join\n", 1440 | "pd.merge(singers, songs, on='singer_code', how='left')" 1441 | ], 1442 | "execution_count": null, 1443 | "cell_type": "code", 1444 | "metadata": { 1445 | "collapsed": false 1446 | }, 1447 | "outputs": [] 1448 | }, 1449 | { 1450 | "source": [ 1451 | "# right join\n", 1452 | "pd.merge(singers, songs, on='singer_code', how='right')" 1453 | ], 1454 | "execution_count": null, 1455 | "cell_type": "code", 1456 | "metadata": { 1457 | "collapsed": false 1458 | }, 1459 | "outputs": [] 1460 | }, 1461 | { 1462 | "source": [ 1463 | "# outer join (full join)\n", 1464 | "pd.merge(singers, songs, on='singer_code', how='outer')" 1465 | ], 1466 | "execution_count": null, 1467 | "cell_type": "code", 1468 | "metadata": { 1469 | "collapsed": false 1470 | }, 1471 | "outputs": [] 1472 | }, 1473 | { 1474 | "source": [ 1475 | "We can also concatenate two dataframes vertically or horizontally via function `concat` and parameter `axis`. This function is useful when we need to append two similar datasets or to put them side by site" 1476 | ], 1477 | "metadata": {}, 1478 | "cell_type": "markdown" 1479 | }, 1480 | { 1481 | "source": [ 1482 | "# concat vertically\n", 1483 | "pd.concat([singers, songs])" 1484 | ], 1485 | "execution_count": null, 1486 | "cell_type": "code", 1487 | "metadata": { 1488 | "collapsed": false 1489 | }, 1490 | "outputs": [] 1491 | }, 1492 | { 1493 | "source": [ 1494 | "# concat horizontally\n", 1495 | "pd.concat([singers, songs], axis=1)" 1496 | ], 1497 | "execution_count": null, 1498 | "cell_type": "code", 1499 | "metadata": { 1500 | "collapsed": false 1501 | }, 1502 | "outputs": [] 1503 | }, 1504 | { 1505 | "source": [ 1506 | "When computing descriptive statistic, we usually need to aggregate data by each group. For example, to answer the question \"how many songs each singer has?\", we have to group data by each singer, and then calculate the number of songs in each group. Not that the result must contain the statistic of all singers in database (even if some of them have no song)" 1507 | ], 1508 | "metadata": {}, 1509 | "cell_type": "markdown" 1510 | }, 1511 | { 1512 | "source": [ 1513 | "data = pd.merge(singers, songs, on='singer_code', how='left')\n", 1514 | "\n", 1515 | "# count the values of each column in group\n", 1516 | "print(data.groupby('singer_code').count())\n", 1517 | "\n", 1518 | "print(\"\\n\")\n", 1519 | "\n", 1520 | "# count only song_name\n", 1521 | "print(data.groupby('singer_code').song_name.count())\n", 1522 | "\n", 1523 | "print(\"\\n\")\n", 1524 | "\n", 1525 | "# count song name but ignore duplication, and order the result\n", 1526 | "print(data.groupby('singer_code').song_name.nunique().sort_values(ascending=True))" 1527 | ], 1528 | "execution_count": null, 1529 | "cell_type": "code", 1530 | "metadata": { 1531 | "collapsed": false 1532 | }, 1533 | "outputs": [] 1534 | }, 1535 | { 1536 | "source": [ 1537 | "### Question 4\n", 1538 | "We have two datasets about music: [song](https://github.com/michiard/AML-COURSE/blob/master/data/song.tsv) and [album](https://github.com/michiard/AML-COURSE/blob/master/data/album.tsv).\n", 1539 | "\n", 1540 | "In the following questions, you **have to** use Pandas to load data and write code to answer these questions.\n", 1541 | "\n", 1542 | "\n", 1543 | "#### Question 4.1\n", 1544 | "
\n", 1545 | "Load both dataset into two dataframes and print the information of each dataframe\n", 1546 | "\n", 1547 | "**HINT**: \n", 1548 | "\n", 1549 | "- You can click button `Raw` on the github page of each dataset and copy the URL of the raw file.\n", 1550 | "- The dataset can be load by using function `read_table`. For example: `df = pd.read_table(raw_url, sep='\\t')`\n", 1551 | "
" 1552 | ], 1553 | "metadata": {}, 1554 | "cell_type": "markdown" 1555 | }, 1556 | { 1557 | "source": [ 1558 | "```python\n", 1559 | "###################################################################\n", 1560 | "#### TO COMPLETE #####\n", 1561 | "###################################################################\n", 1562 | "\n", 1563 | "import pandas as pd\n", 1564 | "\n", 1565 | "songdb_url = 'https://raw.githubusercontent.com/DistributedSystemsGroup/Algorithmic-Machine-Learning/master/data/song.tsv'\n", 1566 | "albumdb_url = 'https://raw.githubusercontent.com/DistributedSystemsGroup/Algorithmic-Machine-Learning/master/data/album.tsv'\n", 1567 | "song_df = pd...\n", 1568 | "album_df = pd...\n", 1569 | "\n", 1570 | "print(song_df...)\n", 1571 | "print(album_df...)\n", 1572 | "```" 1573 | ], 1574 | "metadata": {}, 1575 | "cell_type": "markdown" 1576 | }, 1577 | { 1578 | "source": [ 1579 | "#### Question 4.2\n", 1580 | "
\n", 1581 | "How many albums in this datasets ?\n", 1582 | "\n", 1583 | "How many songs in this datasets ?\n", 1584 | "
" 1585 | ], 1586 | "metadata": {}, 1587 | "cell_type": "markdown" 1588 | }, 1589 | { 1590 | "source": [ 1591 | "```python\n", 1592 | "###################################################################\n", 1593 | "#### TO COMPLETE #####\n", 1594 | "###################################################################\n", 1595 | "\n", 1596 | "print(\"number of albums:\", album_df....count())\n", 1597 | "print(\"number of songs:\", song_df.Song...)\n", 1598 | "```" 1599 | ], 1600 | "metadata": {}, 1601 | "cell_type": "markdown" 1602 | }, 1603 | { 1604 | "source": [ 1605 | "#### Question 4.3\n", 1606 | "
\n", 1607 | "How many distinct singers in this dataset ?\n", 1608 | "
" 1609 | ], 1610 | "metadata": {}, 1611 | "cell_type": "markdown" 1612 | }, 1613 | { 1614 | "source": [ 1615 | "```python\n", 1616 | "###################################################################\n", 1617 | "#### TO COMPLETE #####\n", 1618 | "###################################################################\n", 1619 | "\n", 1620 | "print(\"number distinct singers:\", len(...))\n", 1621 | "```" 1622 | ], 1623 | "metadata": {}, 1624 | "cell_type": "markdown" 1625 | }, 1626 | { 1627 | "source": [ 1628 | "#### Question 4.4\n", 1629 | "
\n", 1630 | "Is there any song that doesn't belong to any album ?\n", 1631 | "\n", 1632 | "Is there any album that has no song ?\n", 1633 | "\n", 1634 | "**HINT**: \n", 1635 | "\n", 1636 | "- To join two datasets on different key names, we use `left_on=` and `right_on=` instead of `on=`.\n", 1637 | "- Funtion `notnull` and `isnull` help determining the value of a column is missing or not. For example:\n", 1638 | "`df['song'].isnull()`.\n", 1639 | "
" 1640 | ], 1641 | "metadata": {}, 1642 | "cell_type": "markdown" 1643 | }, 1644 | { 1645 | "source": [ 1646 | "```python\n", 1647 | "###################################################################\n", 1648 | "#### TO COMPLETE #####\n", 1649 | "###################################################################\n", 1650 | "\n", 1651 | "fulldf = pd.merge(song_df, album_df, how='outer', left_on='Album', right_on='Album code')\n", 1652 | "fulldf[fulldf['Song'].... & fulldf['Album']....]\n", 1653 | "```" 1654 | ], 1655 | "metadata": {}, 1656 | "cell_type": "markdown" 1657 | }, 1658 | { 1659 | "source": [ 1660 | "```python\n", 1661 | "###################################################################\n", 1662 | "#### TO COMPLETE #####\n", 1663 | "###################################################################\n", 1664 | "\n", 1665 | "fulldf[fulldf['Song'].... & fulldf['Album code']....]\n", 1666 | "```" 1667 | ], 1668 | "metadata": {}, 1669 | "cell_type": "markdown" 1670 | }, 1671 | { 1672 | "source": [ 1673 | "#### Question 4.5\n", 1674 | "
\n", 1675 | "How many songs in each albums of Michael Jackson ?\n", 1676 | "
" 1677 | ], 1678 | "metadata": {}, 1679 | "cell_type": "markdown" 1680 | }, 1681 | { 1682 | "source": [ 1683 | "```python\n", 1684 | "###################################################################\n", 1685 | "#### TO COMPLETE #####\n", 1686 | "###################################################################\n", 1687 | "\n", 1688 | "\n", 1689 | "\n", 1690 | "fulldf[fulldf['Singer']=='Michael Jackson']....\n", 1691 | "```" 1692 | ], 1693 | "metadata": {}, 1694 | "cell_type": "markdown" 1695 | }, 1696 | { 1697 | "source": [ 1698 | "# Summary\n", 1699 | "\n", 1700 | "In this lecture, we gained familiarity with the Jupyter Notebook environment, the Python programming language and its modules. In particular, we covered the Python syntax, Numpy - the core library for scientific computing, Matplotlib - a module to plot graphs, Pandas - a data analysis module.\n" 1701 | ], 1702 | "metadata": {}, 1703 | "cell_type": "markdown" 1704 | }, 1705 | { 1706 | "source": [ 1707 | "# References\n", 1708 | "This notebook is inspired from:\n", 1709 | "\n", 1710 | "- [Python Numpy tutorial](http://cs231n.github.io/python-numpy-tutorial/)" 1711 | ], 1712 | "metadata": {}, 1713 | "cell_type": "markdown" 1714 | }, 1715 | { 1716 | "source": [], 1717 | "execution_count": null, 1718 | "cell_type": "code", 1719 | "metadata": {}, 1720 | "outputs": [] 1721 | }, 1722 | { 1723 | "source": [], 1724 | "execution_count": null, 1725 | "cell_type": "code", 1726 | "metadata": {}, 1727 | "outputs": [] 1728 | } 1729 | ], 1730 | "metadata": { 1731 | "kernelspec": { 1732 | "display_name": "Python 3", 1733 | "language": "python", 1734 | "name": "python3" 1735 | }, 1736 | "language_info": { 1737 | "name": "python", 1738 | "pygments_lexer": "ipython3", 1739 | "version": "3.5.2", 1740 | "mimetype": "text/x-python", 1741 | "file_extension": ".py", 1742 | "codemirror_mode": { 1743 | "name": "ipython", 1744 | "version": 3 1745 | }, 1746 | "nbconvert_exporter": "python" 1747 | } 1748 | }, 1749 | "nbformat_minor": 2, 1750 | "nbformat": 4 1751 | } -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # AML-COURSE 2 | This repository contains Jupyter Notebooks for the Algorithmic Machine Learning Course at Eurecom. 3 | 4 | ## Objectives of the course 5 | The goal of this course is mainly to offer data science projects to students to gain hands-on experience. It nicely merges the theoretical concepts students can learn in our courses on machine learning and statistical inference, and systems concepts we teach in distributed systems. 6 | 7 | Notebooks require to address several challenges, that can be roughly classified in: 8 | 9 | * Data preparation and cleaning 10 | * Building descriptive statistics of the data 11 | * Working on a selected algorithm, e.g., for building a statistical model 12 | * Working on experimental validation 13 | 14 | ## Technical notes 15 | Students will use the EURECOM cloud computing platform to work on Notebooks. Our cluster is managed by [Zoe](http://zoe-analytics.eu/), which is a container-based analytics-as-a-service system we have built. Notebooks front-end run in a user-facing container, whereas Notebooks kernel run in clusters of containers. 16 | 17 | ## Sources and acknowledgments 18 | Some of the Notebooks we use in our lectures are based on use cases illustrated in the book [Advanced Analytics with Spark](http://shop.oreilly.com/product/0636920035091.do), by Sandy Ryza, Uri Laserson, Sean Owen & Josh Wills. 19 | 20 | Some Notebooks are instead based on publicly available data, for which we defined the tasks to complete. 21 | 22 | Finally, some Notebooks are private, and cannot be pushed to this repository. This is the case for industrial Notebooks, taking the form of use cases by Data Scientists from companies we are in contact with. 23 | 24 | Finally, all this could not be achieved without the skills of several PhD students at Eurecom: 25 | 26 | * Duc-Trung Nguyen 27 | * Rosa Candela 28 | * Simone Rossi 29 | * Kurt Cutajar 30 | * Jonas Wacker 31 | * Gia-Lac Tran 32 | * Graziano Mita 33 | --------------------------------------------------------------------------------