├── .gitignore ├── 00_Introduction ├── 01_intro.html ├── 01_intro.qmd ├── README.md └── images │ ├── about_appsilon1.png │ ├── about_appsilon2.png │ ├── dataneversleeps2.png │ ├── datascience.png │ ├── datascienceworkflow2a.png │ ├── datascienceworkflow_extended.png │ ├── dominotools.png │ ├── meme2.jpeg │ └── the-data-science-workflow.jpeg ├── 01_Numpy ├── 01_numpy.html ├── 01_numpy.ipynb ├── 02_homework.ipynb ├── README.md └── images │ ├── behind_scenes.webp │ ├── cpp_numpy.jpg │ └── rly.gif ├── 02_Pandas ├── 01_types_of_data.html ├── 01_types_of_data.qmd ├── 02_pandas.html ├── 02_pandas.ipynb ├── 03_homework.ipynb ├── 04_loading_wiki_data.ipynb ├── README.md └── images │ ├── de_wiki_problem.png │ ├── one_does.jpg │ └── weird_line.png ├── 03_Plots ├── 01_matplotlib_plotly.html ├── 01_matplotlib_plotly.ipynb ├── 02_homework.ipynb ├── README.md └── images │ └── votes.png ├── 04_Scikit-learn ├── .gitignore ├── 01_machine_learning.html ├── 01_machine_learning.qmd ├── 02_linear_regression.html ├── 02_linear_regression.ipynb ├── 03_homework.ipynb ├── README.md └── images │ ├── NA_trick.png │ ├── Precisionrecall.png │ ├── na_example.svg │ ├── onehot.png │ ├── regression.png │ └── traintest.png ├── 05_SharingWork ├── 01_streamlit │ ├── 01_hello_world │ │ ├── README.md │ │ ├── main1.py │ │ ├── main2.py │ │ └── main3.py │ ├── 02_small_report │ │ ├── README.md │ │ └── main.py │ ├── 03_mc_simulation │ │ ├── README.md │ │ └── main.py │ ├── 04_dishes │ │ ├── README.md │ │ ├── dishes.json │ │ └── favorite_dish.py │ └── 05_checker │ │ ├── README.md │ │ ├── create_db.py │ │ ├── files │ │ └── .gitignore │ │ ├── main.py │ │ ├── results.db │ │ └── y_true.csv ├── 02_quarto │ ├── README.md │ ├── report.docx │ ├── report.html │ ├── report.ipynb │ └── report.pdf ├── 03_fastapi │ ├── README.md │ ├── boston_model_prediction │ │ ├── README.md │ │ ├── boston_api.py │ │ └── model_lgbm_regressor.pkl │ └── image_prediction │ │ ├── README.md │ │ └── image_api.py └── README.md ├── README.md ├── data ├── .gitignore ├── flights │ ├── .gitignore │ └── flights_Q1_JFK.csv ├── housing │ ├── .gitignore │ ├── housing_example_submission.csv │ ├── housing_train.csv │ └── housing_validation.csv ├── iris │ ├── iris.csv │ ├── iris.tsv │ ├── iris.xlsx │ └── iris_noheader.csv └── other │ ├── Life Expectancy Data.csv │ └── lotr_data.csv ├── homework_solutions ├── .gitignore ├── 01_hw_numpy.ipynb ├── 02_hw_pandas1.ipynb ├── 03_hw_pandas2_a.ipynb ├── 03_hw_pandas2_b.ipynb └── 04_hw_sklearn.ipynb ├── requirements.txt └── requirements_loose.txt /.gitignore: -------------------------------------------------------------------------------- 1 | .ipynb_checkpoints/ 2 | __pycache__ 3 | -------------------------------------------------------------------------------- /00_Introduction/01_intro.qmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Introduction to Data Science in Python by Appsilon" 3 | subtitle: "Course introduction" 4 | author: "Piotr Pasza Storożenko@Appsilon" 5 | lang: "pl" 6 | format: 7 | revealjs: 8 | embed-resources: true 9 | # smaller: true 10 | theme: [dark] 11 | editor_options: 12 | markdown: 13 | wrap: 99 14 | --- 15 | 16 | # Introduction 17 | 18 | ## About me 19 | 20 | Piotr Pasza Storożenko, 21 | Machine Learning Engineer 22 | 23 | A bit: 24 | 25 | - ML guy 26 | - Physicist 27 | - Computer scientist 28 | - Mathematician 29 | 30 | you can find more on my blog: [pstorozenko.github.io](https://pstorozenko.github.io/). 31 | 32 | ## {background-image="images/about_appsilon1.png"} 33 | 34 | ## {background-image="images/about_appsilon2.png"} 35 | 36 | ## Why this course? 37 | 38 | Through the years I got a lot of knowledge about `python`, `R`, `julia`, data science, machine learning, deep learning, software development. 39 | 40 | Now I can share it with you! 41 | 42 | ## What will be interesting for whom during this course?{.smaller} 43 | 44 | :::: {.columns} 45 | 46 | :::{.column} 47 | - Computer scientists 48 | - Easy to use and **efficient** tools to work with data 49 | - Made by software developers 50 | - Mathematicians 51 | - Intuitive tools that support **reproducible** experiments 52 | - _Ridiculously easy_ work with plots 53 | - Electrical and mechanical engineers 54 | - Great alternative to matlab 55 | - Easy to use 56 | - Simple plots and animations 57 | ::: 58 | 59 | :::{.column} 60 | - Economics students 61 | - Great alternative for spreadsheets 62 | - Set of **free**, open source tools 63 | - Uncomparable greater control of data when compared to MS Office, Tableau, PowerBI etc. 64 | - Physicists, chemists, biologists 65 | - Substential relief from spreadsheets 66 | - Open-source software 67 | - Much easier to create, reproducible plots 68 | ::: 69 | 70 | :::: 71 | 72 | # Data Science 73 | 74 | ## What's Data Science? 75 | 76 | ![](images/datascience.png){fig-align="center"} 77 | 78 | ::: footer 79 | Source: [https://medium.com/data-science-in-2019/what-is-data-science-87e9dc225cf9](https://medium.com/data-science-in-2019/what-is-data-science-87e9dc225cf9) 80 | ::: 81 | 82 | ## Why Data Science? 83 | 84 | ![](images/dataneversleeps2.png){fig-align="center"} 85 | 86 | ::: footer 87 | Source: [https://www.domo.com/learn/infographic/data-never-sleeps-9](https://www.domo.com/learn/infographic/data-never-sleeps-9) 88 | ::: 89 | 90 | ## Explanation of various terms{.smaller} 91 | 92 | - Artificial Inteligence, AI -- an [umbrella term](https://en.wiktionary.org/wiki/umbrella_term) connected to everything where the computer/system makes decisions based on a set of rules, on an algorithm. 93 | - Data Science (DS) -- everything related to data, from collecting, through processing, up to displaying and using 94 | - Machine Learning, ML -- everything related to creating/training models that are able to _learn_ rules based on provided data 95 | - [Deep] Neural Networks, [D]NN -- subset of ML methods based on special class of models, so called neural networks. Their architecture (design) somehow resembles connections between biological neurons, hence the name. 96 | 97 | ## Who's a Data Scientist? 98 | 99 | Someone who simultaneously: 100 | 101 | 1. Discusses with so called _bussiness_ the topic of required solutions. 102 | 2. Based on provided and collected data creates solutions using programming skills. 103 | 3. Delivers the results to the _bussiness_ in a clear and interesing way. 104 | 105 | Bussiness talks about **AI**, experts preffer to say **ML**... 106 | 107 | ## Data Science Workflow 108 | 109 | ![](images/the-data-science-workflow.jpeg){fig-align="center"} 110 | 111 | ::: footer 112 | Source: [https://www.business-science.io/business/2019/06/27/data-science-workflow.html](https://www.business-science.io/business/2019/06/27/data-science-workflow.html) 113 | ::: 114 | 115 | 116 | ## Data Science Tools 117 | 118 | ![](images/dominotools.png){fig-align="center"} 119 | 120 | ::: footer 121 | Source: [https://blog.dominodatalab.com/data-science-tools](https://blog.dominodatalab.com/data-science-tools) 122 | ::: 123 | 124 | # Course plan 125 | 126 | 1. Introduction and `numpy` - working with numbers 127 | 2. `pandas` - working with data frames 128 | 3. `matplotlib` and `plotly` - plotting data 129 | 4. `scikit-learn` - introduction to machine learning 130 | 5. `streamlit`, `quarto`, `fastapi` - sharing your work 131 | 132 | ## Course plan 133 | 134 | ![](images/meme2.jpeg){fig-align="center"} 135 | 136 | 137 | ## Data Science Workflow x This course 138 | 139 | ![](images/datascienceworkflow2a.png){fig-align="center"} 140 | 141 | ::: footer 142 | Source: [https://www.business-science.io/business/2019/06/27/data-science-workflow.html](https://www.business-science.io/business/2019/06/27/data-science-workflow.html) 143 | ::: 144 | 145 | ## Data Science Workflow x This course ++ 146 | 147 | ![](images/datascienceworkflow_extended.png){fig-align="center"} 148 | 149 | ::: footer 150 | Source: [https://www.business-science.io/business/2019/06/27/data-science-workflow.html](https://www.business-science.io/business/2019/06/27/data-science-workflow.html) 151 | ::: 152 | 153 | # Tools used in the course 154 | 155 | ## Tools used in the course 156 | 157 | - Python 3.9/3.10 via [Anaconda](https://www.anaconda.com/products/distribution#Downloads) + many additional packages 158 | - Visual Studio Code aka [VS Code](https://code.visualstudio.com/download) 159 | 160 | ## Anaconda{.smaller} 161 | 162 | Anaconda is the standard when it comes to managing python environemnts in the data science/machine learning community. 163 | It allows to obtain a consistent environment among various systems. 164 | 165 | Why is it that important? 166 | 167 | Data scienctists often work on many projects at the same time. 168 | Each project might require a different environment, with specific versions of python and other libraries. 169 | 170 | This might be also a relief when working on different projects during studying! 171 | 172 | ## Anaconda - How to create an environment{.smaller} 173 | 174 | After installing anaconda, you have to [clone this course repo](https://docs.github.com/en/repositories/creating-and-managing-repositories/cloning-a-repository), and move yourself in terminal into course repo. 175 | Then run: 176 | ``` 177 | conda create -n appsilon-ds-course python=3.10 -y 178 | conda activate appsilon-ds-course 179 | pip install -r requirements.txt 180 | ``` 181 | 182 | If you received the following error message: 183 | ``` 184 | ERROR: Could not open requirements file: [Errno 2] No such file or directory: 'requirements.txt' 185 | ``` 186 | you're in the wrong directory. 187 | 188 | In case of problems, check out the [official conda tutorial](https://docs.conda.io/projects/conda/en/latest/user-guide/getting-started.html). 189 | 190 | ## VSCode - code editor for 2022 191 | 192 | Why [VS Code](https://code.visualstudio.com/#alt-downloads)? 193 | 194 | - Great support for both python scripts and jupyter notebooks. 195 | - Automatically detects `conda` environments 196 | - Great support for working with remote machines through SSH (althogh we will not use this feature) 197 | - One tool to work with `python`, `R`, `julia`, `javascript`, `typescript` etc. 198 | - Above all -- VS Code is free 199 | 200 | ## What to do after installing VS Code? 201 | 202 | Install extensions 203 | 204 | - [Pylance](https://marketplace.visualstudio.com/items?itemName=ms-python.vscode-pylance) 205 | - [Jupyter](https://marketplace.visualstudio.com/items?itemName=ms-toolsai.jupyter) 206 | 207 | Environment for work and studing is ready! 208 | 209 | # `pip` vs `conda` vs `pipenv` vs ... 210 | 211 | We need multiple environments on a single machine. 212 | 213 | How to live, what to use? 214 | 215 | . . . 216 | 217 | **NEVER PLAY WITH DATA SCIENCE ON YOUR DEFAULT SYSTEM'S PYTHON** 218 | 219 | ## `pip` + `virtualenv` 220 | 221 | 1. A basic package manager included in python 222 | 2. Works only for **a single** version of python 223 | 3. Capable of installing **python packages only** 224 | 4. Basic package versioning with `pip freeze` 225 | 5. Pretty fast when it doesn't have to build packages 226 | 227 | ## `conda`{.smaller} 228 | 229 | 1. A package manager provided by Anaconda 230 | 2. Allows for creating different environments for different major (`3.9`/`3.10`) and minor (`3.10.3`/`3.10.4`) python versions 231 | 3. Is able to install **other software than python packages as well** (e.g. `R` or CUDA drivers) 232 | 4. Basic package versioning with `conda list --export` 233 | 5. Super slow for bigger environments 234 | 6. Packages installed with conda can be shared across environments -- lower disk usage (just PyTorch is ~1.7GB) 235 | 236 | ## `pipenv` 237 | 238 | 1. Looks like `pip` + `virtualenv` plus different python versions 239 | 2. Very big focus on environment reproducibility 240 | 3. Super slow for bigger environments 241 | 242 | ## How to live?{.smaller} 243 | 244 | The most reliable setup for experimenting is to do: 245 | 246 | ``` 247 | conda create -n my-env python==3.10.4 248 | conda activate my-env 249 | pip install ... 250 | ``` 251 | 252 | If you need to install CUDA drivers then do it during environment creation `conda create -n my-env python cudatoolkit`. 253 | 254 | After you install all packages, save the **python version** in your README file e.g., 255 | 256 | > Project created with python 3.10.4. 257 | 258 | and store installed packages with `pip freeze > requirements.txt`. 259 | 260 | . . . 261 | 262 | Remember that not every package version is available for every python version. 263 | For example Tensorflow 2.10 is supported only in python>=3.10. 264 | -------------------------------------------------------------------------------- /00_Introduction/README.md: -------------------------------------------------------------------------------- 1 | # Introduction 2 | 3 | This folder contains introduction into Data Science responsibilities and describes content of the course, as well as how to setup an environment. -------------------------------------------------------------------------------- /00_Introduction/images/about_appsilon1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Appsilon/datascience-python/a1d299415544191f5e96502c5a064561ba3f59ce/00_Introduction/images/about_appsilon1.png -------------------------------------------------------------------------------- /00_Introduction/images/about_appsilon2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Appsilon/datascience-python/a1d299415544191f5e96502c5a064561ba3f59ce/00_Introduction/images/about_appsilon2.png -------------------------------------------------------------------------------- /00_Introduction/images/dataneversleeps2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Appsilon/datascience-python/a1d299415544191f5e96502c5a064561ba3f59ce/00_Introduction/images/dataneversleeps2.png -------------------------------------------------------------------------------- /00_Introduction/images/datascience.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Appsilon/datascience-python/a1d299415544191f5e96502c5a064561ba3f59ce/00_Introduction/images/datascience.png -------------------------------------------------------------------------------- /00_Introduction/images/datascienceworkflow2a.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Appsilon/datascience-python/a1d299415544191f5e96502c5a064561ba3f59ce/00_Introduction/images/datascienceworkflow2a.png -------------------------------------------------------------------------------- /00_Introduction/images/datascienceworkflow_extended.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Appsilon/datascience-python/a1d299415544191f5e96502c5a064561ba3f59ce/00_Introduction/images/datascienceworkflow_extended.png -------------------------------------------------------------------------------- /00_Introduction/images/dominotools.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Appsilon/datascience-python/a1d299415544191f5e96502c5a064561ba3f59ce/00_Introduction/images/dominotools.png -------------------------------------------------------------------------------- /00_Introduction/images/meme2.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Appsilon/datascience-python/a1d299415544191f5e96502c5a064561ba3f59ce/00_Introduction/images/meme2.jpeg -------------------------------------------------------------------------------- /00_Introduction/images/the-data-science-workflow.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Appsilon/datascience-python/a1d299415544191f5e96502c5a064561ba3f59ce/00_Introduction/images/the-data-science-workflow.jpeg -------------------------------------------------------------------------------- /01_Numpy/02_homework.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Numpy Homework\n", 8 | "\n", 9 | "Example solutions can be found in `homework_solutions` directory.\n", 10 | "\n", 11 | "We will use the new and recommended random number generator `default_rng` instead of `np.random.rand`.\n", 12 | "\n", 13 | "## Task 1\n", 14 | "\n", 15 | "Given the following two vectors:" 16 | ] 17 | }, 18 | { 19 | "cell_type": "code", 20 | "execution_count": 1, 21 | "metadata": {}, 22 | "outputs": [ 23 | { 24 | "data": { 25 | "text/plain": [ 26 | "(array([ 0.04, 0.47, -0.14, -1.39, 2.52, -1.01, 1.86, -2.5 , 0.15,\n", 27 | " -0.09, 2.12, 0.52, -0.53, 1.24, 0.21, 1.81, -0.22, 0.09,\n", 28 | " -0.13, -1.13, 0.85, 0.68, 0.87, -0.34, 1.02, 1.11, -0.04,\n", 29 | " -0.82, -0.16, -1.5 ]),\n", 30 | " array([-0.11, 0.54, -0.31, -1.58, 2.56, -1. , 1.67, -2.42, 0.15,\n", 31 | " 0.03, 2.09, 0.56, -0.69, 1.19, 0.32, 1.72, -0.14, 0.14,\n", 32 | " -0.22, -0.94, 0.91, 0.75, 0.95, -0.54, 1.03, 1.19, -0.04,\n", 33 | " -0.73, -0.15, -1.46]))" 34 | ] 35 | }, 36 | "execution_count": 1, 37 | "metadata": {}, 38 | "output_type": "execute_result" 39 | } 40 | ], 41 | "source": [ 42 | "import numpy as np\n", 43 | "from numpy.random import default_rng\n", 44 | "\n", 45 | "rng = default_rng(1337)\n", 46 | "x = np.round(rng.normal(size=30), 2)\n", 47 | "y = x + np.round(rng.normal(size=30) * 0.1, 2)\n", 48 | "x, y" 49 | ] 50 | }, 51 | { 52 | "cell_type": "markdown", 53 | "metadata": {}, 54 | "source": [ 55 | "Calculate:\n", 56 | "\n", 57 | "1. Mean of `x`\n", 58 | "2. Sum of `x`\n", 59 | "3. Mean of absolute values of `x`\n", 60 | "4. Element further from $0$ of `x`\n", 61 | "5. Element further from $2$ of `x`\n", 62 | "6. Array that will setup elements smaller than $-1$ to $-1$ and larger than $1$ to $1$\n", 63 | "7. Mean error (ERR) between `x` and `y`\n", 64 | "8. Mean absolute error (MAD) between `x` and `y`\n", 65 | "9. Mean squared error (MSE) between `x` and `y`\n", 66 | "10. Root mean squared error (RMSE) between `x` and `y`" 67 | ] 68 | }, 69 | { 70 | "cell_type": "markdown", 71 | "metadata": {}, 72 | "source": [ 73 | "## Task 2\n", 74 | "\n", 75 | "Write function `standardize(X)` that will norm every column of matrix `X` (each separately).\n", 76 | "The mean of every column should be equal $0$ and standard deviation to $1$.\n", 77 | "It's procedure very often used in ML." 78 | ] 79 | }, 80 | { 81 | "cell_type": "markdown", 82 | "metadata": {}, 83 | "source": [ 84 | "## Task 3\n", 85 | "\n", 86 | "Calculate value of $\\pi$ by using Monte Carlo method.\n", 87 | " \n", 88 | "Useful links:\n", 89 | "\n", 90 | "- https://www.geeksforgeeks.org/estimating-value-pi-using-monte-carlo/\n", 91 | "- https://www.youtube.com/watch?v=WAf0rqwAvgg" 92 | ] 93 | }, 94 | { 95 | "cell_type": "markdown", 96 | "metadata": {}, 97 | "source": [ 98 | "## Task 4\n", 99 | "\n", 100 | "Calculate the area under function $\\exp(-x^2)$ from $-2$ to $2$.\n" 101 | ] 102 | } 103 | ], 104 | "metadata": { 105 | "kernelspec": { 106 | "display_name": "Python 3.9.12 ('rigplay-lighting')", 107 | "language": "python", 108 | "name": "python3" 109 | }, 110 | "language_info": { 111 | "codemirror_mode": { 112 | "name": "ipython", 113 | "version": 3 114 | }, 115 | "file_extension": ".py", 116 | "mimetype": "text/x-python", 117 | "name": "python", 118 | "nbconvert_exporter": "python", 119 | "pygments_lexer": "ipython3", 120 | "version": "3.9.12" 121 | }, 122 | "orig_nbformat": 4, 123 | "vscode": { 124 | "interpreter": { 125 | "hash": "acaec1dd4d4ad1413b15d1459179aaee505991b8d2edc661768082683fde5d51" 126 | } 127 | } 128 | }, 129 | "nbformat": 4, 130 | "nbformat_minor": 2 131 | } 132 | -------------------------------------------------------------------------------- /01_Numpy/README.md: -------------------------------------------------------------------------------- 1 | # Numpy 2 | 3 | This directory contains introduction to `numpy` library. 4 | Start with the [01_numpy.ipynb](01_numpy.ipynb) notebook and proceed with [homework](02_homework.ipynb). 5 | -------------------------------------------------------------------------------- /01_Numpy/images/behind_scenes.webp: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Appsilon/datascience-python/a1d299415544191f5e96502c5a064561ba3f59ce/01_Numpy/images/behind_scenes.webp -------------------------------------------------------------------------------- /01_Numpy/images/cpp_numpy.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Appsilon/datascience-python/a1d299415544191f5e96502c5a064561ba3f59ce/01_Numpy/images/cpp_numpy.jpg -------------------------------------------------------------------------------- /01_Numpy/images/rly.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Appsilon/datascience-python/a1d299415544191f5e96502c5a064561ba3f59ce/01_Numpy/images/rly.gif -------------------------------------------------------------------------------- /02_Pandas/01_types_of_data.qmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Introduction to Data Science in Python by Appsilon" 3 | subtitle: "Types of data in Data Science" 4 | author: "Piotr Pasza Storożenko@Appsilon" 5 | lang: "en" 6 | format: 7 | revealjs: 8 | embed-resources: true 9 | theme: [dark] 10 | editor_options: 11 | markdown: 12 | wrap: 99 13 | --- 14 | 15 | # Dataframes 16 | 17 | The building block of data science. 18 | 19 | ## Dataframe example {.smaller} 20 | 21 | Dataframes are the most common data type in Data Science. 22 | 23 | | Name | Surname | Race | Salary | Profession | Age of Death | 24 | |---------|---------|--------|--------|-------------|--------------| 25 | | Bilbo | Baggins | Hobbit | 10000 | Retired | 131 | 26 | | Frodo | Baggins | Hobbit | 70000 | Ring-bearer | 53 | 27 | | Sam | Gamgee | Hobbit | 60000 | Security | 102 | 28 | | Aragorn | NA | Human | 60000 | Security | 210 | 29 | 30 | This is a dataframe with **four** rows and six columns. 31 | 32 | - Dataframes may also be called _tables_. 33 | - Rows may be called _observations_. 34 | - Columns may be called _features_ or _variables_. 35 | 36 | Since this is a table, **every observation must have the same number of columns**. 37 | However, some of them might be _missing_ (NA - not available). 38 | 39 | ## Rows and columns{.smaller} 40 | 41 | Each row represents **a single entity**, e.g.: 42 | 43 | - A single student on the university students' list 44 | - A single part in a warehouse 45 | - Weekly sales for different shops 46 | 47 | . . . 48 | 49 | Values in a single column have usually the same type for every observation. 50 | 51 | - `Salary` is of type `float`. 52 | - `Age of Death` is of type `int`. 53 | - `Name` and `Surname` are of type `string`. 54 | - `Race` is of type **categorical**. 55 | 56 | ## Categorical datatype -- examples{.smaller} 57 | 58 | Values that belong to only few distinct categories are called [categorical data](https://en.wikipedia.org/wiki/Categorical_variable) (also called factors). 59 | 60 | Examples of factors: 61 | 62 | - `Race` from the above example (let's exclude cases of being half-elf for now). 63 | - Color of eyes 64 | - Gender, although one should be careful as it might be a sensitive topic in some cases 65 | - Day of a week 66 | - Mark at the university (2, 3, 4, 5) 67 | - Country of birth 68 | 69 | ## Categorical datatype -- storing problem and solution{.smaller} 70 | 71 | Imagine having a database of everyone in Poland (over 40mln rows right now). 72 | If we look at a column `Country of Birth`, values will have only one of ~200 distinct values (since there are ~200 _available_ countries to choose from). 73 | 74 | . . . 75 | 76 | We might store name of country as a `string` in memory but it will be unefficient, think of string array storage problem and country's names like `United Kingdom of Great Britain and Northern Ireland`. 77 | 78 | . . . 79 | 80 | It's much more resonable to store in memory mapping: 81 | 82 | 1. Republic of Poland 83 | 2. Ukraine 84 | 3. United Kingdom of Great Britain and Northern Ireland 85 | 86 | and then store only values $1,2, ..., 200$ in `Country of Birth` column. 87 | This is what categorical values are for. 88 | 89 | ## Storing dataframes 90 | 91 | Depending on the volume of data, it is ususaly stored in the following formats: 92 | 93 | - [CSV, TSV file](https://en.wikipedia.org/wiki/Comma-separated_values) 94 | - [Arrow file](https://en.wikipedia.org/wiki/Apache_Arrow) 95 | - [Table in some SQL database](https://en.wikipedia.org/wiki/SQL) 96 | 97 | # Unstructured data 98 | 99 | ## JSON{.smaller} 100 | 101 | Sometimes it's not convenient to place data in table in sql-friendly format. 102 | For example we might be quering some APIs for diffent books by `isbn` code and get responses in json's like: 103 | 104 | ```json 105 | { 106 | "isbn": "123-456-222", 107 | "authors": [ 108 | { 109 | "lastname": "Piotr", 110 | "middlename": "Pasza", 111 | "firstname": "Storozenko" 112 | } 113 | ], 114 | "title": "Introduction to Data Science in Python by Appsilon", 115 | "category": [ 116 | "Non-Fiction", 117 | "Pure Fiction" 118 | ] 119 | } 120 | ``` 121 | 122 | Not that `authors` is a list with a single element. 123 | 124 | We call this kind of data _unstructured_ even though there is some structure. 125 | 126 | ## XML{.smaller} 127 | 128 | XML file format is similar to json, but has much more redundant signs, so it is _heavier_ in size on disk. 129 | 130 | ```xml 131 | 132 | 133 | 134 | 135 | Storozenko 136 | Piotr 137 | Pasza 138 | 139 | 140 | 141 | Non-Fiction 142 | Pure Fiction 143 | 144 | 123-456-222 145 | Introduction to Data Science in Python by Appsilon 146 | 147 | ``` 148 | 149 | # Images 150 | 151 | ## Images 152 | 153 | We all know what an image is. 154 | 155 | . . . 156 | 157 | ![An example of image](images/one_does.jpg) 158 | 159 | ## Images 160 | 161 | We all know what an image is. 162 | 163 | It's a tensor of dimensions `[C, H, W]`: 164 | 165 | - `C` -- channels, 1 for black and white images, 3 for RGB, 4 for RGB with transparency, more for satellite image 166 | - `H` -- height 167 | - `W` -- width 168 | 169 | `[3, 355, 355]` in Boromir's case 170 | 171 | ## How to store images 172 | 173 | On one hand there are multiple image formats like jpg, png, bpm, **webp**. 174 | Here we would like to ask _how to store images_. 175 | 176 | It must be a format that is convenient to use images in ML. 177 | 178 | ## Storing images and metadata{.smaller} 179 | 180 | We can store images in `data` folder and store additional dataframe like: 181 | 182 | | Split | File | Class | 183 | |-------|---------------------------|-------| 184 | | train | 0013035.jpg | ants | 185 | | train | 1030023514_aad5c608f9.jpg | ants | 186 | | train | 1092977343_cb42b38d62.jpg | bees | 187 | | train | 1093831624_fb5fbe2308.jpg | bees | 188 | | val | 10308379_1b6c72e180.jpg | ants | 189 | | val | 1053149811_f62a3410d3.jpg | ants | 190 | | val | 1032546534_06907fe3b3.jpg | bees | 191 | | val | 10870992_eebeeb3a12.jpg | bees | 192 | | ... | ... | ... | 193 | 194 | 195 | ## Storing image folderwise{.smaller} 196 | 197 | We can just perserve the following directory structure: 198 | 199 | ``` 200 | data 201 | ├── train 202 | │ ├── ants 203 | │ │ ├── 0013035.jpg 204 | │ │ ├── 1030023514_aad5c608f9.jpg 205 | │ │ ... 206 | │ └── bees 207 | │ ├── 1092977343_cb42b38d62.jpg 208 | │ ├── 1093831624_fb5fbe2308.jpg 209 | │ ... 210 | └── val 211 | ├── ants 212 | │ ├── 10308379_1b6c72e180.jpg 213 | │ ├── 1053149811_f62a3410d3.jpg 214 | │ ... 215 | └── bees 216 | ├── 1032546534_06907fe3b3.jpg 217 | ├── 10870992_eebeeb3a12.jpg 218 | ... 219 | ``` 220 | 221 | Then it's harder to work with data, but we can easily extend the dataset. 222 | 223 | # Other kinds of data 224 | 225 | ## Other kinds of data 226 | 227 | Of course there are many more kinds of data like: 228 | 229 | - Raw text data 230 | - Audio data 231 | - Animations data 232 | 233 | But they tend to follow similar patterns. 234 | 235 | # Data versioning 236 | 237 | ## Data versioning 238 | 239 | If we have **very** small amount of data, we can just store it in git. 240 | Very often it's quickly not enough. 241 | 242 | I recomend using [dvc](https://dvc.org/) as a convinient tool for data versioniong. -------------------------------------------------------------------------------- /02_Pandas/03_homework.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Pandas Homework\n", 8 | "\n", 9 | "In this homework you will have to make some queries to answer questions.\n", 10 | "All queries will be based on the [flights dataset](https://www.kaggle.com/datasets/usdot/flight-delays).\n", 11 | "\n", 12 | "Example solutions can be found in `homework_solutions` directory." 13 | ] 14 | }, 15 | { 16 | "cell_type": "markdown", 17 | "metadata": {}, 18 | "source": [ 19 | "1. Find the maximum delay caused by weather (`WEATHER_DELAY`).\n", 20 | "2. Find the maximum arrival delay for planes from `LAX` to `ATL` in December.\n", 21 | "3. Find the tail number of plane with the maximum arrival delay for planes from `LAX` to `ATL` in December.\n", 22 | "4. Find the minimum arrival delay for planes from `LAX` to `ATL` in December.\n", 23 | "5. Find the tail number of plane with the minimum arrival delay for planes from `LAX` to `ATL` in December.\n", 24 | "6. For flight with the maximum flight time (`AIR_TIME`) find its flight time (`AIR_TIME`).\n", 25 | "7. For flight with the maximum flight time (`AIR_TIME`) find its destination airport.\n", 26 | "8. For flight with the maximum flight time (`AIR_TIME`) find its airline.\n", 27 | "9. For all flights from `ORD` to `AUS` and from `AUS` to `ORD` find the minimum time spent in the air (`AIR_TIME`).\n", 28 | "10. For all flights from `ORD` to `AUS` and from `AUS` to `ORD` find the mean time spent in the air (`AIR_TIME`).\n", 29 | "11. For all flights from `ORD` to `AUS` and from `AUS` to `ORD` find the median time spent in the air (`AIR_TIME`).\n", 30 | "12. For all flights from `ORD` to `AUS` and from `AUS` to `ORD` find the maximum time spent in the air (`AIR_TIME`).\n", 31 | "13. Find the airplane that operate the most flights from Yuma (`YUM`).\n", 32 | "14. Find the number of flights made from Yuma (`YUM`) by the airplane that operate the most flights from Yuma (`YUM`).\n", 33 | "15. Find route that takes the shortest time to fly on (by `AIR_TIME`).\n", 34 | "16. Find the mean time spent in the air on routes from `HNL` to `EWR` and from `EWR` to `HNL`.\n", 35 | "17. Find the mean time spent in the air on the route from `HNL` to `EWR`.\n", 36 | "18. Find the mean time spent in the air on the route from `EWR` to `HNL`." 37 | ] 38 | } 39 | ], 40 | "metadata": { 41 | "kernelspec": { 42 | "display_name": "Python 3.9.12 ('rigplay-lighting')", 43 | "language": "python", 44 | "name": "python3" 45 | }, 46 | "language_info": { 47 | "name": "python", 48 | "version": "3.9.12" 49 | }, 50 | "orig_nbformat": 4, 51 | "vscode": { 52 | "interpreter": { 53 | "hash": "acaec1dd4d4ad1413b15d1459179aaee505991b8d2edc661768082683fde5d51" 54 | } 55 | } 56 | }, 57 | "nbformat": 4, 58 | "nbformat_minor": 2 59 | } 60 | -------------------------------------------------------------------------------- /02_Pandas/README.md: -------------------------------------------------------------------------------- 1 | # Pandas 2 | 3 | This directory contains introduction to types of data used in data science/machine learning followed by the introductory tutorial to `pandas` - the go-to library for dataframes in python. 4 | 5 | As a bonus I present ways to do the homework in `polars`. 6 | 7 | In this one notebook I covered only the very basics of data wrangling in `pandas`. 8 | If you want to get a more detailed course regarding this manner I highly recommend [Minimalist Data Wrangling with Python](https://datawranglingpy.gagolewski.com/) book by Marek Gągolewski! 9 | It is available for free for everyone. 10 | -------------------------------------------------------------------------------- /02_Pandas/images/de_wiki_problem.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Appsilon/datascience-python/a1d299415544191f5e96502c5a064561ba3f59ce/02_Pandas/images/de_wiki_problem.png -------------------------------------------------------------------------------- /02_Pandas/images/one_does.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Appsilon/datascience-python/a1d299415544191f5e96502c5a064561ba3f59ce/02_Pandas/images/one_does.jpg -------------------------------------------------------------------------------- /02_Pandas/images/weird_line.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Appsilon/datascience-python/a1d299415544191f5e96502c5a064561ba3f59ce/02_Pandas/images/weird_line.png -------------------------------------------------------------------------------- /03_Plots/02_homework.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# ~Plots~ Pandas Homework Part 2\n", 8 | "\n", 9 | "Surprise, this homework will also be from `pandas` as it's crucial to master your `pandas` skill in the world of Data Science.\n", 10 | "\n", 11 | "This time you will use data from Travel Stackexchange and Wikipedia.\n", 12 | "\n", 13 | "Example solutions can be found in `homework_solutions` directory.\n", 14 | "This time two versions of solutions are available.\n", 15 | "Using pandas (1) and polars (2).\n", 16 | "\n", 17 | "## Wikipedia clickstream\n", 18 | "\n", 19 | "This [dataset](https://dumps.wikimedia.org/other/clickstream/readme.html) \n", 20 | "contains the information on how Wikipedia users move around the website.\n", 21 | "You will work on [the data from March 2022](https://dumps.wikimedia.org/other/clickstream/2022-03/).\n", 22 | "Data format is [available here](https://meta.wikimedia.org/wiki/Research:Wikipedia_clickstream#Format).\n", 23 | "\n", 24 | "## Stack Exchange\n", 25 | "\n", 26 | "The second dataset contains information on posts (but not only) of [Stack users](https://archive.org/download/stackexchange).\n", 27 | "It may surprise you, but all the posts from Stack Overflow and related sites (like Mathexchange) are available for analysis!\n", 28 | "We will focus on data coming from the stack on [**travels**](https://archive.org/download/stackexchange/travel.stackexchange.com.7z)!\n", 29 | "\n", 30 | "Tip for the task on `UpVotes` and `DownVotes`.\n", 31 | "Take look at column `VoteTypeId` in dataframe `Votes`.\n", 32 | "There's an information on votes type.\n", 33 | "Each VoteId corresponds to different type of vote.\n", 34 | "Information on what's an upvote, what's downvote, which answer has been accepted is [available here](https://meta.stackexchange.com/questions/2677/database-schema-documentation-for-the-public-data-dump-and-sede).\n", 35 | "\n", 36 | "![](images/votes.png)" 37 | ] 38 | }, 39 | { 40 | "cell_type": "markdown", 41 | "metadata": {}, 42 | "source": [ 43 | "### Tasks\n", 44 | "\n", 45 | "1. Find the article with the most external entries on German Wikipedia in March 2022.\n", 46 | "2. Find the article with the most external entries on Polish Wikipedia in March 2022.\n", 47 | "3. Find the `DisplayName` of the user with the most badges.\n", 48 | "4. Find the `Location` of the user with the most badges.\n", 49 | "5. Find the number of entries on the article about the city from the question number 4 on English wikipedia in March 2022.\n", 50 | "6. Find the most common word with at least 8 letters in all posts.\n", 51 | "7. Find the number of occurrences of the most common word with at least 8 letters in all posts.\n", 52 | "8. For the post with the highest number of difference between up votes and down votes find its author `DisplayName`.\n", 53 | "9. For the post with the highest number of difference between up votes and down votes find its difference between up votes and down votes.\n", 54 | "10. Find the month in which the most posts were created.\n", 55 | "11. Find the month in which there was the biggest decrease in the amount of created posts.\n", 56 | "12. Find the most common tag along posts created by users from Poland (column `Location` should contain `Poland` or `Polska`)." 57 | ] 58 | } 59 | ], 60 | "metadata": { 61 | "kernelspec": { 62 | "display_name": "Python 3.9.12 ('rigplay-lighting')", 63 | "language": "python", 64 | "name": "python3" 65 | }, 66 | "language_info": { 67 | "name": "python", 68 | "version": "3.9.12" 69 | }, 70 | "orig_nbformat": 4, 71 | "vscode": { 72 | "interpreter": { 73 | "hash": "acaec1dd4d4ad1413b15d1459179aaee505991b8d2edc661768082683fde5d51" 74 | } 75 | } 76 | }, 77 | "nbformat": 4, 78 | "nbformat_minor": 2 79 | } 80 | -------------------------------------------------------------------------------- /03_Plots/README.md: -------------------------------------------------------------------------------- 1 | # Plots 2 | 3 | This directory contains notebook with examples of plotting in python. 4 | We cover two libraries `matplotlib` and `plotly.express`. 5 | 6 | ## `matplotlib` 7 | 8 | This library is usually used if you need a quick, dead-simple plot or a plot that will be exported to pdf. 9 | Lot's and lot's of [tutorials available for `matplotlib` online](https://matplotlib.org/stable/tutorials/index.html). 10 | 11 | ## `plotly.express` 12 | 13 | This is [sugar-syntaxed version of the `plotly` library](https://plotly.com/python/plotly-express/). 14 | In a very simple way you can create interactive and visually appealing plots for data from data frames. 15 | -------------------------------------------------------------------------------- /03_Plots/images/votes.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Appsilon/datascience-python/a1d299415544191f5e96502c5a064561ba3f59ce/03_Plots/images/votes.png -------------------------------------------------------------------------------- /04_Scikit-learn/.gitignore: -------------------------------------------------------------------------------- 1 | 04_prepare_hw.ipynb -------------------------------------------------------------------------------- /04_Scikit-learn/01_machine_learning.qmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Introduction to Data Science in Python by Appsilon" 3 | subtitle: "Machine Learning" 4 | author: "Piotr Pasza Storożenko@Appsilon" 5 | format: 6 | revealjs: 7 | theme: [dark] 8 | embed-resources 9 | : true 10 | editor_options: 11 | markdown: 12 | wrap: 99 13 | --- 14 | 15 | # Problems machine learning solve 16 | 17 | ## How do we divide ML 18 | 19 | - Supervised (pol. Nadzorowane) 20 | - Classification (pol. Klasyfikacja) 21 | - Regression (pol. Regresja) 22 | - Unsupervised (pol. Nienadzorowane) 23 | - Clustering (pol. Analiza skupień) 24 | 25 | ## Classification 26 | 27 | Our aim is to predict the **discrete** value. 28 | 29 | . . . 30 | 31 | Example classification problems: 32 | 33 | - Model that distinguishes between cats and dogs 34 | - Document classification into given classes 35 | - Deciding whether tweet is toxic or not 36 | - Product categorization 37 | - Fraud detection 38 | 39 | ## Regression 40 | 41 | Our aim is to predict the **continuous** value. 42 | 43 | . . . 44 | 45 | Example regression problems: 46 | 47 | - Real estate price prediction 48 | - Sales revenue prediction 49 | - Age estimation 50 | 51 | ## Clustering 52 | 53 | Our aim is to **find similar** entities. 54 | 55 | . . . 56 | 57 | Example clustering problems: 58 | 59 | - Find similar products 60 | - Suggest similar artists 61 | 62 | # Basic rules of approaching ML problems 63 | 64 | ## Metric 65 | 66 | To solve an ML problem we must be able to say which model is good and which bad. 67 | 68 | The single number that will decide whether the model is good or bad is called **metric**. 69 | 70 | With metric we can ask the computer to find _the best model_. 71 | 72 | We use different metrics in different tasks. 73 | 74 | ## Example regression metrics: 75 | 76 | - Mean square error 77 | - Mean absolute error 78 | 79 | ## Example classification metrics 80 | 81 | :::: {.columns} 82 | 83 | ::: {.column width="40%"} 84 | - Accuracy 85 | - Precision 86 | - Recall 87 | - F1 88 | 89 | ::: 90 | 91 | ::: {.column width="60%"} 92 | ![](images/Precisionrecall.png){height="600"} 93 | 94 | ::: 95 | 96 | :::: 97 | 98 | ## What is learned from what{.smaller} 99 | 100 | We start with the whole dataset and divide its columns into `X` and `y`. 101 | 102 | - `X` is matrix with columns, features that will be used to train the model 103 | - `y` is a **target** vector 104 | 105 | Examples: 106 | 107 | In case of flats price prediction, `X` will consist of columns like flat size, number of rooms, number of bathrooms, indication if flat has a balcony and so on. `y` will be the price. 108 | 109 | In case of cats vs dogs prediction, the `X` are pixels values, `y` is label `cat`/`dog`. 110 | 111 | ## Train test split 112 | 113 | We usually want to train the model on part of data and check it's performance on the rest. 114 | 115 | The basic approach to this is train test split: 116 | 117 | ![](images/traintest.png) 118 | 119 | :::{.footer} 120 | Source: https://towardsdatascience.com/understanding-train-test-split-scikit-learn-python-ea676d5e3d1 121 | ::: 122 | 123 | 124 | ## Train test split{.smaller} 125 | 126 | ![](images/traintest.png) 127 | 128 | 1. Train model `m` on `X_train` and `y_train`. 129 | 2. Get predictions `y_pred` by evaluating model `m` on `X_test`. 130 | 3. Calculate metric by comparing `y_pred` and `y_test`. 131 | 132 | :::{.footer} 133 | Source: https://towardsdatascience.com/understanding-train-test-split-scikit-learn-python-ea676d5e3d1 134 | ::: 135 | 136 | # The most common columns transformations 137 | 138 | The majority of models expect an input to be in form of numeric values. 139 | Often it's not the case... 140 | 141 | ## One-hot encoding 142 | 143 | The easiest way to convert categorical values to numeric is so called one-hot encoding. 144 | 145 | ![](images/onehot.png) 146 | 147 | :::{.footer} 148 | Source: https://towardsdatascience.com/building-a-one-hot-encoding-layer-with-tensorflow-f907d686bf39 149 | ::: 150 | 151 | ## Smarter `NA` trick{.smaller} 152 | 153 | We've learned how to fill `NA` with zeros in pandas. 154 | 155 | Sometimes filling with median/mean value makes more sense. 156 | It's worth experimenting and trying different approaches yourself! 157 | 158 | A smart trick to get more information from `NA` is to, apart from filling it, add new column with boolean value like so: 159 | 160 | ![](images/NA_trick.png) -------------------------------------------------------------------------------- /04_Scikit-learn/03_homework.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Sklearn Homework\n", 8 | "\n", 9 | "In this homework you will work on 3 files that can be found in the `data/housing/*` directory:\n", 10 | "\n", 11 | "1. `housing_train.csv` - training data\n", 12 | "2. `housing_validation.csv` - validation data, you will evaluate your model on this data\n", 13 | "3. `housing_example_submission.csv` - example submission of your model\n", 14 | "\n", 15 | "You will split your work into two parts.\n", 16 | "\n", 17 | "## Train a simple linear regression model\n", 18 | "\n", 19 | "First make the following preprocessing on the whole `housing_train.csv`:\n", 20 | "\n", 21 | "1. Columns with more than 70% of `NA` values change into columns `NA_in_col_*` by following instructions in the presentation part. **Remove the original column**.\n", 22 | "2. Fill the rest of `NA`s in other columns with the **median** of particular column.\n", 23 | "\n", 24 | "With such dataframe train the linear regression model that will predict the `MEDV` column.\n", 25 | "Using this model answer the following questions:\n", 26 | "\n", 27 | "1. Predict the price of the observation `rec` (defined bellow).\n", 28 | "2. How much would the prediction change if we'd increase the `RM` column by 2?" 29 | ] 30 | }, 31 | { 32 | "cell_type": "code", 33 | "execution_count": 1, 34 | "metadata": {}, 35 | "outputs": [ 36 | { 37 | "data": { 38 | "text/html": [ 39 | "
\n", 40 | "\n", 53 | "\n", 54 | " \n", 55 | " \n", 56 | " \n", 57 | " \n", 58 | " \n", 59 | " \n", 60 | " \n", 61 | " \n", 62 | " \n", 63 | " \n", 64 | " \n", 65 | " \n", 66 | " \n", 67 | " \n", 68 | " \n", 69 | " \n", 70 | " \n", 71 | " \n", 72 | " \n", 73 | " \n", 74 | " \n", 75 | " \n", 76 | " \n", 77 | " \n", 78 | " \n", 79 | " \n", 80 | " \n", 81 | " \n", 82 | " \n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | "
CRIMZNINDUSCHASNOXRMAGEDISRADTAXPTRATIOBLSTAT
03.6911.3711.150.070.876.2968.913.779.5410.9518.37354.470.79
\n", 91 | "
" 92 | ], 93 | "text/plain": [ 94 | " CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO \\\n", 95 | "0 3.69 11.37 11.15 0.07 0.87 6.29 68.91 3.77 9.5 410.95 18.37 \n", 96 | "\n", 97 | " B LSTAT \n", 98 | "0 354.47 0.79 " 99 | ] 100 | }, 101 | "execution_count": 1, 102 | "metadata": {}, 103 | "output_type": "execute_result" 104 | } 105 | ], 106 | "source": [ 107 | "import pandas as pd\n", 108 | "\n", 109 | "rec = pd.DataFrame({\n", 110 | " 'CRIM': [3.69],\n", 111 | " 'ZN': [11.37],\n", 112 | " 'INDUS': [11.15],\n", 113 | " 'CHAS': [0.07],\n", 114 | " 'NOX': [0.87],\n", 115 | " 'RM': [6.29],\n", 116 | " 'AGE': [68.91],\n", 117 | " 'DIS': [3.77],\n", 118 | " 'RAD': [9.50],\n", 119 | " 'TAX': [410.95],\n", 120 | " 'PTRATIO': [18.37],\n", 121 | " 'B': [354.47],\n", 122 | " 'LSTAT': [0.79],\n", 123 | "})\n", 124 | "rec" 125 | ] 126 | }, 127 | { 128 | "cell_type": "markdown", 129 | "metadata": {}, 130 | "source": [ 131 | "## Train the best model regression you can\n", 132 | "\n", 133 | "Now you are tasked with a real ML case.\n", 134 | "Using the acquired knowledge and the [(brilliant) `scikit-learn` documentation](https://scikit-learn.org/stable/) train the best model possible!\n", 135 | "\n", 136 | "Using the `housing_train.csv` file, train the best model you can by minimizing **Mean Square Error** metric, then make prediction on `housing_validation.csv`.\n", 137 | "Prediction should be saved into a file with the same format as `housing_example_submission.csv` file (i.e. ensure the correct column name and lack of index. `s.to_csv(filename, index=False)` should work).\n", 138 | "Of course `housing_validation.csv` file doesn't contain the `MEDV` column, you have to predict it.\n", 139 | "\n", 140 | "After saving the results, save the model as well.\n", 141 | "This can be done for example using [`pickle`](https://scikit-learn.org/stable/model_persistence.html).\n" 142 | ] 143 | }, 144 | { 145 | "cell_type": "markdown", 146 | "metadata": {}, 147 | "source": [ 148 | "### Tips\n", 149 | "\n", 150 | "When you will work on filling `NA`s with median, take a look on how I done it in the presentation.\n", 151 | "\n", 152 | "In the second part experiment with both, preprocessing and modeling.\n", 153 | "\n", 154 | "Remember that training on the `housing_train.csv` and checking the model performance on the same dataset can be very miss leading.\n", 155 | "\n", 156 | "Probably you want to split `housing_train.csv` into train and test (independently of `housing_validation.csv`).\n", 157 | "Create the best model using those two datasets and only then make predictions on `housing_validation.csv`.\n", 158 | "Maybe you want to understand what is [cross validation](https://scikit-learn.org/stable/modules/cross_validation.html) here?\n", 159 | "How good is your prediction? You will check it in the next lesson!\n", 160 | "\n", 161 | "Apart from `sklearn` you can try [xgboost](https://xgboost.readthedocs.io/en/stable/python/python_api.html#module-xgboost.sklearn) and [lightgbm](https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMRegressor.html), they should work as a drop-in replacement.\n" 162 | ] 163 | }, 164 | { 165 | "cell_type": "markdown", 166 | "metadata": {}, 167 | "source": [ 168 | "### Context\n", 169 | "\n", 170 | "On websites like [kaggle](https://www.kaggle.com/) you can challenge yourself with ML tasks.\n", 171 | "Usually the competition looks similarly to this task.\n", 172 | "You get one csv file with training data and have to evaluate the best model on another.\n", 173 | "\n", 174 | "## Additional information on dataset\n", 175 | "\n", 176 | "This is modified Boston housing dataset, _classic_ dataset use for learning ML.\n", 177 | "\n", 178 | "(Not fully accurate) information on columns' content:\n", 179 | "\n", 180 | "```\n", 181 | "1. CRIM per capita crime rate by town\n", 182 | "2. ZN proportion of residential land zoned for lots over 25,000 sq.ft.\n", 183 | "3. INDUS proportion of non-retail business acres per town\n", 184 | "4. CHAS Charles River dummy variable (= 1 if tract bounds \n", 185 | " river; 0 otherwise)\n", 186 | "5. NOX nitric oxides concentration (parts per 10 million)\n", 187 | "6. RM average number of rooms per dwelling\n", 188 | "7. AGE proportion of owner-occupied units built prior to 1940\n", 189 | "8. DIS weighted distances to five Boston employment centres\n", 190 | "9. RAD index of accessibility to radial highways\n", 191 | "10. TAX full-value property-tax rate per $10,000\n", 192 | "11. PTRATIO pupil-teacher ratio by town\n", 193 | "12. B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town\n", 194 | "13. LSTAT % lower status of the population\n", 195 | "14. MEDV Median value of owner-occupied homes in $1000's\n", 196 | "```\n", 197 | "\n" 198 | ] 199 | } 200 | ], 201 | "metadata": { 202 | "interpreter": { 203 | "hash": "306379f83cd7ad906f1941e4fa5d943b9cb847ce8f1ed425d8b7a18353685fed" 204 | }, 205 | "kernelspec": { 206 | "display_name": "Python 3.10.4 ('daftacademy-ds')", 207 | "language": "python", 208 | "name": "python3" 209 | }, 210 | "language_info": { 211 | "codemirror_mode": { 212 | "name": "ipython", 213 | "version": 3 214 | }, 215 | "file_extension": ".py", 216 | "mimetype": "text/x-python", 217 | "name": "python", 218 | "nbconvert_exporter": "python", 219 | "pygments_lexer": "ipython3", 220 | "version": "3.10.4" 221 | }, 222 | "orig_nbformat": 4 223 | }, 224 | "nbformat": 4, 225 | "nbformat_minor": 2 226 | } 227 | -------------------------------------------------------------------------------- /04_Scikit-learn/README.md: -------------------------------------------------------------------------------- 1 | # Scikit-learn 2 | 3 | We're done with data processing for some time, now we start to do the machine learning! 4 | Start with presentation on introduction to machine learning, proceed to notebook on linear regression and then do the homework! 5 | 6 | Good luck! 7 | -------------------------------------------------------------------------------- /04_Scikit-learn/images/NA_trick.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Appsilon/datascience-python/a1d299415544191f5e96502c5a064561ba3f59ce/04_Scikit-learn/images/NA_trick.png -------------------------------------------------------------------------------- /04_Scikit-learn/images/Precisionrecall.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Appsilon/datascience-python/a1d299415544191f5e96502c5a064561ba3f59ce/04_Scikit-learn/images/Precisionrecall.png -------------------------------------------------------------------------------- /04_Scikit-learn/images/na_example.svg: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /04_Scikit-learn/images/onehot.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Appsilon/datascience-python/a1d299415544191f5e96502c5a064561ba3f59ce/04_Scikit-learn/images/onehot.png -------------------------------------------------------------------------------- /04_Scikit-learn/images/regression.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Appsilon/datascience-python/a1d299415544191f5e96502c5a064561ba3f59ce/04_Scikit-learn/images/regression.png -------------------------------------------------------------------------------- /04_Scikit-learn/images/traintest.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Appsilon/datascience-python/a1d299415544191f5e96502c5a064561ba3f59ce/04_Scikit-learn/images/traintest.png -------------------------------------------------------------------------------- /05_SharingWork/01_streamlit/01_hello_world/README.md: -------------------------------------------------------------------------------- 1 | # Hello World in Streamlit 2 | 3 | Simple apps, run with `streamlit run main.py`. 4 | -------------------------------------------------------------------------------- /05_SharingWork/01_streamlit/01_hello_world/main1.py: -------------------------------------------------------------------------------- 1 | import streamlit as st 2 | 3 | st.title("My first streamlit app") 4 | st.write("Hello world!") 5 | -------------------------------------------------------------------------------- /05_SharingWork/01_streamlit/01_hello_world/main2.py: -------------------------------------------------------------------------------- 1 | import streamlit as st 2 | 3 | # This might look like magic and it is since we're running the app with 4 | # streamlit run main2.py 5 | 6 | "# My second streamlit app" 7 | "Hello world!" 8 | -------------------------------------------------------------------------------- /05_SharingWork/01_streamlit/01_hello_world/main3.py: -------------------------------------------------------------------------------- 1 | import streamlit as st 2 | import numpy as np 3 | import pandas as pd 4 | import plotly.express as px 5 | import matplotlib.pyplot as plt 6 | 7 | 8 | "# My third streamlit app" 9 | "Plot of $y = \sin(x) ^2$" 10 | 11 | 12 | x = np.r_[0:2*np.pi:100j] 13 | y = np.sin(x) ** 2 14 | df = pd.DataFrame({ 15 | 'x': x, 16 | 'y': y, 17 | }) 18 | st.line_chart(y) 19 | 20 | 21 | fig, ax = plt.subplots() 22 | ax.plot(x, y) 23 | ax.set_title("The sine squared plot") 24 | st.pyplot(fig) 25 | 26 | p = px.line(df, 'x', 'y') 27 | st.plotly_chart(p) 28 | 29 | st.dataframe(df) 30 | 31 | st.table(df.head(20)) 32 | -------------------------------------------------------------------------------- /05_SharingWork/01_streamlit/02_small_report/README.md: -------------------------------------------------------------------------------- 1 | # Small report 2 | 3 | A simple app presenting dataframe capabilities of `streamlit`. -------------------------------------------------------------------------------- /05_SharingWork/01_streamlit/02_small_report/main.py: -------------------------------------------------------------------------------- 1 | # to verify 2 | 3 | import pandas as pd 4 | import streamlit as st 5 | import plotly.express as px 6 | 7 | st.config.dataFrameSerialization = "arrow" 8 | 9 | 10 | @st.cache_data 11 | def load_data(): 12 | flights = pd.read_csv("../../../data/flights/flights.csv", engine="pyarrow") 13 | flights["DATE"] = pd.to_datetime( 14 | flights["YEAR"].astype(str) + "-" + flights["MONTH"].astype(str) + "-" + flights["DAY"].astype(str) 15 | ) 16 | airports = pd.read_csv("../../../data/flights/airports.csv", engine="pyarrow") 17 | return flights, airports 18 | 19 | 20 | flights, airports = load_data() 21 | 22 | "# Small flights application" 23 | 24 | "Let's recall some analysis from earlier classes." 25 | 26 | airport_input = st.text_input("Airport") 27 | 28 | flights_lax = ( 29 | flights.query("DESTINATION_AIRPORT == @airport_input") 30 | .groupby("ORIGIN_AIRPORT") 31 | .agg({"ARRIVAL_DELAY": ["mean", "count"]}) 32 | .reset_index(col_level=1) 33 | ) 34 | flights_lax.columns = flights_lax.columns.droplevel(0) 35 | flights_lax = ( 36 | flights_lax 37 | .merge(airports, left_on="ORIGIN_AIRPORT", right_on="IATA_CODE") 38 | # .loc[:, ["AIRPORT","ARRIVAL_DELAY", "Count"]] 39 | ) 40 | 41 | st.write(flights_lax) 42 | 43 | flights_d = flights.query("DESTINATION_AIRPORT == @airport_input").value_counts("DATE").reset_index().sort_values("DATE").rename(columns={0: "Count"}) 44 | 45 | p = px.line(flights_d, "DATE", "Count") 46 | 47 | st.plotly_chart(p) 48 | -------------------------------------------------------------------------------- /05_SharingWork/01_streamlit/03_mc_simulation/README.md: -------------------------------------------------------------------------------- 1 | # MC Simulations 2 | 3 | This application shows how might look like the interactive dashboard presenting the simulation from the first homework. 4 | -------------------------------------------------------------------------------- /05_SharingWork/01_streamlit/03_mc_simulation/main.py: -------------------------------------------------------------------------------- 1 | import matplotlib.pyplot as plt 2 | import numpy as np 3 | import streamlit as st 4 | 5 | 6 | def simulation(n): 7 | x = np.random.rand(n) * 4 - 2 8 | yr = np.random.rand(n) 9 | yf = np.exp(-(x**2)) 10 | A = (yr < yf).mean() * 4 11 | 12 | fig, ax = plt.subplots() 13 | ax.plot(x[yf <= yr], yr[yf <= yr], ".b") 14 | ax.plot(x[yf > yr], yr[yf > yr], ".r") 15 | 16 | return fig, A 17 | 18 | 19 | "# MC simulation" 20 | 21 | "In this app we will calculate the area under function $\exp(-x^2)$ from $-2$ to $2$. " 22 | 23 | n = st.number_input("n", value=1_000, min_value=1, max_value=10_000_000, format="%d") 24 | 25 | if st.button("Simulate!"): 26 | with st.spinner(text="Simulating!"): 27 | fig, A = simulation(n) 28 | st.pyplot(fig) 29 | f"Calculated area under the function: {A.round(3)}" 30 | -------------------------------------------------------------------------------- /05_SharingWork/01_streamlit/04_dishes/README.md: -------------------------------------------------------------------------------- 1 | # Favorite dish application 2 | 3 | This app shows how to easily create application to use by multiple users. 4 | Everyone can pick they're favorite food from the list or add theirs. 5 | 6 | Note that the app is _ill-designed_ with clear data race to the `dishes.json` file. 7 | -------------------------------------------------------------------------------- /05_SharingWork/01_streamlit/04_dishes/dishes.json: -------------------------------------------------------------------------------- 1 | {} -------------------------------------------------------------------------------- /05_SharingWork/01_streamlit/04_dishes/favorite_dish.py: -------------------------------------------------------------------------------- 1 | import json 2 | 3 | import pandas as pd 4 | import plotly.express as px 5 | import streamlit as st 6 | 7 | 8 | st.set_page_config(page_title="Favorite dish", page_icon="🥘") 9 | 10 | with open("dishes.json") as f: 11 | dishes = json.load(f) 12 | 13 | st.title("What is your favorite dish?") 14 | st.write("Pick from the list or add your own") 15 | form = st.form(key="dish") 16 | available_dishes = sorted(dishes.keys()) 17 | dish_radio = form.selectbox("Pick from the list", available_dishes) 18 | dish_text = form.text_input( 19 | "Or type if it's not on the list:", 20 | placeholder="Leave empty to choose from the list", 21 | ) 22 | submit = form.form_submit_button("Submit") 23 | 24 | if submit: 25 | if dish_text != "": 26 | key = dish_text.lower().capitalize() 27 | else: 28 | key = dish_radio 29 | 30 | if key not in dishes: 31 | dishes[key] = 1 32 | else: 33 | dishes[key] += 1 34 | 35 | with open("dishes.json", "w") as f: 36 | json.dump(dishes, f) 37 | 38 | if st.button("Refresh"): 39 | pass 40 | df = pd.DataFrame(dishes.items(), columns=["Dish", "Count"]) 41 | st.plotly_chart(px.bar(df, x="Dish", y="Count")) 42 | -------------------------------------------------------------------------------- /05_SharingWork/01_streamlit/05_checker/README.md: -------------------------------------------------------------------------------- 1 | # Homework checker 2 | 3 | This app has been designed to check the prediction part of homework from the previous week! 4 | Run it and upload your predictions. 5 | 6 | The design allows running on some server with multiple users connecting and uploading their results. 7 | 8 | ## Running checker 9 | 10 | The database already contains LightGBM model result, if you want to create a clean database run the `create_db.py` script. 11 | You can run the app with `streamlit run main.py`. 12 | -------------------------------------------------------------------------------- /05_SharingWork/01_streamlit/05_checker/create_db.py: -------------------------------------------------------------------------------- 1 | import sqlite3 2 | 3 | con = sqlite3.connect("results.db") 4 | 5 | with con: 6 | con.execute( 7 | """CREATE TABLE results 8 | (nickname text, email text, score real, filename text)""" 9 | ) 10 | -------------------------------------------------------------------------------- /05_SharingWork/01_streamlit/05_checker/files/.gitignore: -------------------------------------------------------------------------------- 1 | *.csv -------------------------------------------------------------------------------- /05_SharingWork/01_streamlit/05_checker/main.py: -------------------------------------------------------------------------------- 1 | import sqlite3 2 | import uuid 3 | 4 | import pandas as pd 5 | import streamlit as st 6 | from sklearn.metrics import mean_squared_error 7 | 8 | 9 | con = sqlite3.connect("results.db", check_same_thread=False) 10 | stmt = "INSERT INTO results (nickname, email, score, filename) values (?, ?, ?, ?)" 11 | 12 | y_true = pd.read_csv("y_true.csv").iloc[:, 0] 13 | 14 | "# Week 4 Homework Checker" 15 | 16 | with st.form("my_form"): 17 | student_nickname = st.text_input("Your nickname") 18 | student_email = st.text_input("Your e-mail", help="The one you've used for registration") 19 | uploaded_file = st.file_uploader("Upload your homework") 20 | submitted = st.form_submit_button("Submit") 21 | info_element = st.info("You must fill the name, mail and upload the file, then press `Submit`") 22 | if student_nickname != "" and student_email != "" and uploaded_file is not None and submitted: 23 | df = pd.read_csv(uploaded_file) 24 | filename = f"files/{uuid.uuid4()}.csv" 25 | df.to_csv(filename, index=False) 26 | y_pred = df["MEDV"] 27 | score = mean_squared_error(y_true, y_pred) 28 | fields = (student_nickname, student_email, score, filename) 29 | with con: 30 | con.execute(stmt, fields) 31 | info_element.success("Result recorded") 32 | elif submitted: 33 | info_element.error("You must fill the name, mail and upload the file, then press `Submit`") 34 | 35 | "## Top 10 Results:" 36 | 37 | results = pd.read_sql_query("SELECT * FROM results", con) 38 | 39 | if st.button("Refresh"): 40 | pass 41 | 42 | st.table( 43 | results.sort_values("score") 44 | .groupby("email") 45 | .first() 46 | .reset_index() 47 | .loc[:, ["nickname", "score"]] 48 | .sort_values("score") 49 | .head(10) 50 | ) 51 | -------------------------------------------------------------------------------- /05_SharingWork/01_streamlit/05_checker/results.db: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Appsilon/datascience-python/a1d299415544191f5e96502c5a064561ba3f59ce/05_SharingWork/01_streamlit/05_checker/results.db -------------------------------------------------------------------------------- /05_SharingWork/01_streamlit/05_checker/y_true.csv: -------------------------------------------------------------------------------- 1 | MEDV 2 | 101.21033698173383 3 | 138.91314641725276 4 | 58.23319106873045 5 | 97.65902371130434 6 | 68.92553585303733 7 | 85.65052981879218 8 | 76.29113025829184 9 | 59.98630034438092 10 | 83.92613343134433 11 | 72.01118151457328 12 | 92.16125152937637 13 | 80.94334883753203 14 | 29.98226780875327 15 | 90.88349143008415 16 | 79.19801777762241 17 | 127.72797082370907 18 | 80.61791173105397 19 | 43.67256295325555 20 | 214.14708398667176 21 | 60.48215847691291 22 | 108.01116698952946 23 | 124.59504671034674 24 | 54.4307463870796 25 | 95.98739923815477 26 | 60.86837265398642 27 | 59.13618569199229 28 | 86.89390306707966 29 | 63.85046694579307 30 | 92.97001454452686 31 | 78.42034935253015 32 | 99.07810016202895 33 | 101.96230141956615 34 | 64.20260696852783 35 | 89.04957792817417 36 | 81.87304637855341 37 | 83.19084441481127 38 | 148.5403003078944 39 | 83.537348656387 40 | 104.51152581320432 41 | 100.36999194026664 42 | 84.38958140755855 43 | 120.97578159520806 44 | 214.27920633629162 45 | 74.55494176812948 46 | 96.8776373142841 47 | 64.73835525070551 48 | 56.146704596483936 49 | 103.78223083506998 50 | 85.26735746265928 51 | 102.81215133358074 52 | 80.90933117433606 53 | 151.62963937737572 54 | 65.19437154351274 55 | 113.66647394288508 56 | 186.43807621201492 57 | 90.76212320268449 58 | 78.86001935421702 59 | 122.02964510092482 60 | 102.52896260509968 61 | 79.30402130979557 62 | 107.22373036030697 63 | 151.6943977652074 64 | 134.88626011767445 65 | 86.62827000798177 66 | 103.29393469599242 67 | 85.71805278777714 68 | 56.15467383789862 69 | 106.17114827729353 70 | 132.08205032646725 71 | 54.416540325330224 72 | 85.79145405259666 73 | 101.53077753401259 74 | 46.293398204646294 75 | 88.31330943238713 76 | 89.02505708508217 77 | 21.40087579954311 78 | -------------------------------------------------------------------------------- /05_SharingWork/02_quarto/README.md: -------------------------------------------------------------------------------- 1 | # Quarto demo 2 | 3 | Quarto is a tool for converting your notebooks into reports and presentations. 4 | But it can do even more! 5 | 6 | You should check out their [official website](https://quarto.org/) to download the tool, install and play with numerous tutorials. 7 | 8 | To crate the html report you just run `quarto render report.ipynb`. 9 | Exporting to pdf requires specifying the output format as so: `quarto render report.ipynb -t pdf`. 10 | 11 | If you want to work with the report in an interactive way I recommend checking out `quarto preview report.ipynb`. 12 | -------------------------------------------------------------------------------- /05_SharingWork/02_quarto/report.docx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Appsilon/datascience-python/a1d299415544191f5e96502c5a064561ba3f59ce/05_SharingWork/02_quarto/report.docx -------------------------------------------------------------------------------- /05_SharingWork/02_quarto/report.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Appsilon/datascience-python/a1d299415544191f5e96502c5a064561ba3f59ce/05_SharingWork/02_quarto/report.pdf -------------------------------------------------------------------------------- /05_SharingWork/03_fastapi/README.md: -------------------------------------------------------------------------------- 1 | # FastAPI 2 | 3 | FastAPI is a brillant library to create REST APIs in python. 4 | This is the main way different services communicate on the internet. 5 | 6 | If you don't know what REST API is then go and run the examples! 7 | It will be much easier to understand then. 8 | -------------------------------------------------------------------------------- /05_SharingWork/03_fastapi/boston_model_prediction/README.md: -------------------------------------------------------------------------------- 1 | # Boston Model Prediction API 2 | 3 | Script `boston_api.py` shows PoC example of how one can deploy the `sklearn` model. 4 | Note that this implementation lacks a lot of key parts like: 5 | 6 | 1. Data validation 7 | 2. Data preprocessing, filling `NA` like during training. 8 | 3. Errors handling 9 | 10 | Run with: 11 | 12 | ``` 13 | uvicorn boston_api:app 14 | ``` 15 | -------------------------------------------------------------------------------- /05_SharingWork/03_fastapi/boston_model_prediction/boston_api.py: -------------------------------------------------------------------------------- 1 | from fastapi import FastAPI, Depends 2 | import pandas as pd 3 | import pickle 4 | from pydantic import BaseModel 5 | from starlette.responses import RedirectResponse 6 | 7 | 8 | app = FastAPI() 9 | state = {} 10 | 11 | 12 | @app.on_event("startup") 13 | async def load_model(): 14 | with open("model_lgbm_regressor.pkl", "rb") as f: 15 | state["model"] = pickle.load(f) 16 | 17 | 18 | class BostonData(BaseModel): 19 | CRIM: float 20 | ZN: float 21 | INDUS: float 22 | CHAS: float 23 | NOX: float 24 | RM: float 25 | AGE: float 26 | DIS: float 27 | RAD: float 28 | TAX: float 29 | PTRATIO: float 30 | B: float 31 | LSTAT: float 32 | 33 | 34 | @app.get("/", include_in_schema=False) 35 | async def index(): 36 | return RedirectResponse(url="/docs") 37 | 38 | 39 | @app.post("/boston_prediction") 40 | async def boston_prediction(boston_X: BostonData = Depends()): 41 | X = pd.DataFrame( 42 | { 43 | "CRIM": [boston_X.CRIM], 44 | "ZN": [boston_X.ZN], 45 | "INDUS": [boston_X.INDUS], 46 | "CHAS": [boston_X.CHAS], 47 | "NOX": [boston_X.NOX], 48 | "RM": [boston_X.RM], 49 | "AGE": [boston_X.AGE], 50 | "DIS": [boston_X.DIS], 51 | "RAD": [boston_X.RAD], 52 | "TAX": [boston_X.TAX], 53 | "PTRATIO": [boston_X.PTRATIO], 54 | "B": [boston_X.B], 55 | "LSTAT": [boston_X.LSTAT], 56 | } 57 | ) 58 | 59 | pred = state["model"].predict(X) 60 | 61 | return {"price": pred[0]} 62 | -------------------------------------------------------------------------------- /05_SharingWork/03_fastapi/boston_model_prediction/model_lgbm_regressor.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Appsilon/datascience-python/a1d299415544191f5e96502c5a064561ba3f59ce/05_SharingWork/03_fastapi/boston_model_prediction/model_lgbm_regressor.pkl -------------------------------------------------------------------------------- /05_SharingWork/03_fastapi/image_prediction/README.md: -------------------------------------------------------------------------------- 1 | # Image Prediction API 2 | 3 | Script `image_api.py` shows PoC example of _image classifying_ API. 4 | In reality it just returns the brightness of provided image, but one can extend this anyway they want. 5 | 6 | Run with: 7 | 8 | ``` 9 | uvicorn image_api:app 10 | ``` 11 | -------------------------------------------------------------------------------- /05_SharingWork/03_fastapi/image_prediction/image_api.py: -------------------------------------------------------------------------------- 1 | import io 2 | 3 | from fastapi import FastAPI, File, UploadFile 4 | import numpy as np 5 | from PIL import Image 6 | from starlette.responses import RedirectResponse 7 | 8 | 9 | app = FastAPI() 10 | 11 | 12 | def read_imagefile(file) -> Image.Image: 13 | image = Image.open(io.BytesIO(file)) 14 | return image 15 | 16 | 17 | @app.get("/", include_in_schema=False) 18 | async def index(): 19 | return RedirectResponse(url="/docs") 20 | 21 | 22 | @app.post("/image_brightness") 23 | async def image_brightness(file: UploadFile = File(...)): 24 | image = read_imagefile(await file.read()).convert("L") 25 | x = np.array(image) 26 | return {"brightness": x.mean()} 27 | -------------------------------------------------------------------------------- /05_SharingWork/README.md: -------------------------------------------------------------------------------- 1 | # Streamlit, Quarto, FastAPI 2 | 3 | We've reached the final lecture 🎉 of this course. 4 | 5 | We've learned a planty of things, but it's high time we show our results to the world. 6 | To do so I present 3 libraries that focus on different approaches. 7 | 8 | ### Streamlit 9 | 10 | The new, yet already feature-full library for creating interactive dashboards of your analysis. 11 | Dashboard may present some plots, data from dataframes, simulation results or predictions from an ML model. 12 | What is remarkable about `streamlit` is the easy with which you can create those dashboards. 13 | Check out examples. 14 | 15 | ### Quarto 16 | 17 | Library that created `html` files with course materials from `qmd` and `ipynb` files. 18 | You can also create technical $\\LaTeX$-like reports using `quarto` as shown here. 19 | Technically speaking its the language-agnostic successor of Rmarkdown. 20 | 21 | ### FastAPI 22 | 23 | Not always results of our analysis will be presented to other people, sometimes they'll be consumed by another service/machine. 24 | If that's your use case, there is no easier to use library to create REST api in Python than `fastapi`. 25 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Introduction to Data Science in Python by Appsilon 2 | 3 | ## Introduction 4 | 5 | Welcome to the course _Introduction to Data Science in Python by Appsilon_! 6 | 7 | ## Target audience 8 | 9 | This course aims to introduce people that know how to code in Python into the Data Science world. 10 | In particular I show tricks and tips useful for STEM/economic students. 11 | One of secondary goals is to show students how use **free** tools that are **industry standards** at the same time instead of Matlab/Statistica/SAS and so on. 12 | 13 | ## Covered topics 14 | 15 | 0. The course starts with introducing what does Data Scientist do in his work and why this job is so important in XXI century. Then we start the technical part of the course. 16 | 1. `numpy` - numbers and vectors, fundamentals of all calculations in Python 17 | 2. `pandas` - data frames - SQL-like, in-memory data, fundamentals of data processing in Python 18 | 3. `matplotlib` and `plotly` - plots, basics of data visualization 19 | 4. `scikit-learn` - introduction to machine learning, examples from the go-to library in Python 20 | 5. `streamlit`, `quarto`, `fastapi` - simple, useful and creative ways to share your work in Python and to generate beautiful reports 21 | 22 | Apart from those libraries I present and benchmark the `polars` library - a high-performant replacement for `pandas` if you work datasets of sizes 0.5GB - 5GB and pandas starts to be too slow. 23 | 24 | ## Course materials 25 | 26 | All course materials are located either here or on google drive. 27 | Code and small datasets are in repo, while large size datasets are located on google drive. 28 | 29 | I suggest using `html` files, generated from `qmd` and `ipynb` with `quarto`. 30 | 31 | Guide to setup an environment included in the introduction presentation. 32 | 33 | tl;dr You can try 34 | ``` 35 | conda create -n ds-course python=3.10 36 | conda activate ds-course 37 | pip install -r requirements.txt 38 | ``` 39 | 40 | ### Homeworks 41 | 42 | Each lecture has also some homework assignment. 43 | For every homework, there's provided solution in a separate directory. 44 | Note that solutions are not necessarily the best possible, but may present some interesting approach. 45 | Very often there are multiple ways you can approach the same problem. 46 | 47 | ## License 48 | 49 | The course has been prepared by [Piotr Pasza Storożenko](https://pstorozenko.github.io/) from [Appsilon](http://appsilon.com/). 50 | It is available under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/) license. 51 | Feel free to use these materials for your use, you just have to attribute the original author. 52 | 53 | Some exercise have been inspired by the exercises author had to solve while studying. 54 | -------------------------------------------------------------------------------- /data/.gitignore: -------------------------------------------------------------------------------- 1 | largedf 2 | travel 3 | wikipedia -------------------------------------------------------------------------------- /data/flights/.gitignore: -------------------------------------------------------------------------------- 1 | airlines.csv 2 | airports.csv 3 | flights.csv 4 | -------------------------------------------------------------------------------- /data/housing/.gitignore: -------------------------------------------------------------------------------- 1 | housing_answers.csv 2 | -------------------------------------------------------------------------------- /data/housing/housing_example_submission.csv: -------------------------------------------------------------------------------- 1 | MEDV 2 | 0.3133704207 3 | 0.3133704207 4 | 0.3133704207 5 | 0.3133704207 6 | 0.3133704207 7 | 0.3133704207 8 | 0.3133704207 9 | 0.3133704207 10 | 0.3133704207 11 | 0.3133704207 12 | 0.3133704207 13 | 0.3133704207 14 | 0.3133704207 15 | 0.3133704207 16 | 0.3133704207 17 | 0.3133704207 18 | 0.3133704207 19 | 0.3133704207 20 | 0.3133704207 21 | 0.3133704207 22 | 0.3133704207 23 | 0.3133704207 24 | 0.3133704207 25 | 0.3133704207 26 | 0.3133704207 27 | 0.3133704207 28 | 0.3133704207 29 | 0.3133704207 30 | 0.3133704207 31 | 0.3133704207 32 | 0.3133704207 33 | 0.3133704207 34 | 0.3133704207 35 | 0.3133704207 36 | 0.3133704207 37 | 0.3133704207 38 | 0.3133704207 39 | 0.3133704207 40 | 0.3133704207 41 | 0.3133704207 42 | 0.3133704207 43 | 0.3133704207 44 | 0.3133704207 45 | 0.3133704207 46 | 0.3133704207 47 | 0.3133704207 48 | 0.3133704207 49 | 0.3133704207 50 | 0.3133704207 51 | 0.3133704207 52 | 0.3133704207 53 | 0.3133704207 54 | 0.3133704207 55 | 0.3133704207 56 | 0.3133704207 57 | 0.3133704207 58 | 0.3133704207 59 | 0.3133704207 60 | 0.3133704207 61 | 0.3133704207 62 | 0.3133704207 63 | 0.3133704207 64 | 0.3133704207 65 | 0.3133704207 66 | 0.3133704207 67 | 0.3133704207 68 | 0.3133704207 69 | 0.3133704207 70 | 0.3133704207 71 | 0.3133704207 72 | 0.3133704207 73 | 0.3133704207 74 | 0.3133704207 75 | 0.3133704207 76 | 0.3133704207 77 | 0.3133704207 78 | -------------------------------------------------------------------------------- /data/housing/housing_train.csv: -------------------------------------------------------------------------------- 1 | CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV 2 | 0.1396,0.0,8.56,0,,6.167,90.0,2.421,5,384.0,,392.69,,86.0660335294968 3 | 0.0351,95.0,2.68,0,,7.853,33.2,5.118,4,224.0,,392.78,,207.70005687846765 4 | 15.8744,0.0,18.1,0,,6.545,99.1,1.5192,24,666.0,,396.9,21.08,46.68934923020518 5 | 0.18337,0.0,27.74,0,,5.414,98.3,1.7554,4,711.0,,344.05,23.97,29.980669480245425 6 | 0.12816,12.5,6.07,0,,5.885,33.0,6.498,4,345.0,,396.9,,89.46127472267845 7 | 7.40389,0.0,18.1,0,,5.617,97.9,1.4547,24,666.0,,314.64,26.4,73.75821628799845 8 | 0.03548,80.0,3.64,0,,5.876,19.1,9.2203,1,315.0,,395.18,,89.5826886739847 9 | 11.5779,0.0,18.1,0,,5.036,97.0,1.77,24,666.0,,396.9,25.68,41.53875244545348 10 | 0.26169,0.0,9.9,0,,6.023,90.4,2.834,4,304.0,,396.3,,83.19418362437355 11 | 0.44791,0.0,6.2,1,,6.726,66.5,3.6519,8,307.0,,360.2,,124.27493023973318 12 | 4.81213,0.0,18.1,0,0.713,6.701,90.0,2.5975,24,666.0,20.2,255.23,,70.30934050876428 13 | 0.34109,0.0,7.38,0,,6.415,40.1,4.7211,5,287.0,19.6,396.9,,107.1012551997701 14 | 0.02875,28.0,15.04,0,,6.211,28.9,3.6659,4,270.0,18.2,396.33,,107.18347747973844 15 | 0.35233,0.0,21.89,0,,6.454,98.4,1.8498,4,437.0,21.2,394.08,,73.21645056736543 16 | 0.07022,0.0,4.05,0,,6.02,47.2,3.5549,5,296.0,16.6,393.23,,99.39861331287238 17 | 25.9406,0.0,18.1,0,,5.304,89.1,1.6475,24,666.0,20.2,127.36,26.64,44.59025248731203 18 | 1.19294,0.0,21.89,0,,6.326,97.7,2.271,4,437.0,21.2,396.9,,83.92318147539109 19 | 0.06162,0.0,4.39,0,,5.898,52.3,8.0136,3,352.0,18.8,364.61,,73.71711611183099 20 | 4.55587,0.0,18.1,0,0.718,3.561,87.9,1.6132,24,666.0,20.2,354.7,,117.87600081588016 21 | 0.59005,0.0,21.89,0,,6.372,97.9,2.3274,4,437.0,21.2,385.76,,98.61335073183163 22 | 9.2323,0.0,18.1,0,,6.216,100.0,1.1691,24,666.0,20.2,366.15,,214.34410393293334 23 | 18.811,0.0,18.1,0,,4.628,100.0,1.5539,24,666.0,20.2,28.79,34.37,76.64051245269846 24 | 14.4208,0.0,18.1,0,0.74,6.461,93.3,2.0026,24,666.0,20.2,27.49,18.05,41.12452029160635 25 | 14.0507,0.0,18.1,0,,6.657,100.0,1.5275,24,666.0,20.2,35.05,21.22,73.67595467076562 26 | 0.05188,0.0,4.49,0,,6.015,45.1,4.4272,3,247.0,18.5,395.99,,96.49865391055593 27 | 0.09512,0.0,12.83,0,,6.286,45.0,4.5026,5,398.0,18.7,383.23,,91.6522176655636 28 | 15.0234,0.0,18.1,0,,5.304,97.3,2.1007,24,666.0,20.2,349.48,24.91,51.42387122560069 29 | 0.62739,0.0,8.14,0,,5.834,56.5,4.4986,4,307.0,21.0,395.62,,85.17266592604656 30 | 0.03466,35.0,6.06,0,,6.031,23.3,6.6407,1,304.0,16.9,362.25,,83.198449333261 31 | 7.05042,0.0,18.1,0,,6.103,85.1,2.0218,24,666.0,20.2,2.52,23.29,57.47604042733803 32 | 0.7258,0.0,8.14,0,,5.727,69.5,3.7965,4,307.0,21.0,390.95,,78.02623070337263 33 | 0.19186,0.0,7.38,0,,6.431,14.7,5.4159,5,287.0,19.6,393.68,,105.43385792635864 34 | 0.03961,0.0,5.19,0,,6.037,34.5,5.9853,5,224.0,20.2,396.9,,90.43432864980113 35 | 0.02055,85.0,0.74,0,,6.383,35.7,9.1876,2,313.0,17.3,396.9,,105.83732302619457 36 | 15.1772,0.0,18.1,0,0.74,6.152,100.0,1.9142,24,666.0,20.2,9.32,26.45,37.28371947578368 37 | 14.4383,0.0,18.1,0,,6.852,100.0,1.4655,24,666.0,20.2,179.36,19.78,117.8834171671725 38 | 0.03738,0.0,5.19,,,6.31,38.5,6.4584,5,224.0,20.2,389.4,,88.7178014151015 39 | 0.06888,0.0,2.46,0,,6.144,62.2,2.5979,3,193.0,17.8,396.9,,155.00853000996466 40 | 0.41238,0.0,6.2,0,,7.163,79.9,3.2157,8,307.0,17.4,372.08,,135.25380417327324 41 | 13.9134,0.0,18.1,0,0.713,6.208,95.0,2.2222,24,666.0,20.2,100.63,,50.147350321243344 42 | 0.06588,0.0,2.46,0,,7.765,83.3,2.741,3,193.0,17.8,395.56,,170.6408431307883 43 | 0.84054,0.0,8.14,0,,5.599,85.7,4.4546,4,307.0,21.0,303.42,,59.54643658953512 44 | 0.17331,0.0,9.69,0,,5.707,54.0,2.3817,6,391.0,19.2,396.9,,93.39654697869281 45 | 0.08244,30.0,4.93,0,,6.481,18.5,6.1899,6,300.0,16.6,379.41,,101.52079892017001 46 | 0.20608,22.0,5.86,0,,5.593,76.5,7.9549,7,330.0,19.1,372.49,,75.33524829516917 47 | 0.1403,22.0,5.86,0,,6.487,13.0,7.3967,7,330.0,19.1,396.28,,104.58245151797445 48 | 73.5341,0.0,18.1,0,,5.957,100.0,1.8026,24,666.0,20.2,16.45,20.62,37.67196238918527 49 | 0.15098,0.0,10.01,0,,6.021,82.6,2.7474,6,432.0,17.8,394.51,,82.21308151000076 50 | 0.1415,0.0,6.91,0,,6.169,6.6,5.7209,3,233.0,17.9,383.37,,108.31444570679487 51 | 0.35114,0.0,7.38,0,,6.041,49.9,4.7211,5,287.0,19.6,396.9,,87.40118173931863 52 | 0.0187,85.0,4.15,0,,6.516,27.7,8.5353,4,351.0,17.9,392.43,,99.0871227549707 53 | 0.09103,0.0,2.46,0,,7.155,92.2,2.7006,3,193.0,17.8,394.12,,162.5410185463129 54 | 3.53501,0.0,19.58,1,0.871,6.152,82.6,1.7455,5,403.0,14.7,88.01,,66.76816059885593 55 | 0.03578,20.0,3.33,0,,7.82,64.5,4.6947,5,216.0,14.9,387.31,,194.60858141073373 56 | 0.38735,0.0,25.65,0,,5.613,95.6,1.7572,2,188.0,19.1,359.29,27.26,67.24048484287769 57 | 0.06724,0.0,3.24,0,,6.333,17.2,5.2146,4,430.0,16.9,375.21,,96.87537985348989 58 | 1.35472,0.0,8.14,0,,6.072,100.0,4.175,4,307.0,21.0,376.73,,62.20243792468661 59 | 0.22212,0.0,10.01,0,,6.092,95.4,2.548,6,432.0,17.8,396.9,,80.03620946084678 60 | 2.33099,0.0,19.58,0,0.871,5.186,93.8,1.5296,5,403.0,14.7,356.99,28.32,76.28980246294269 61 | 6.44405,0.0,18.1,0,,6.425,74.8,2.2004,24,666.0,20.2,97.95,,68.9608693776895 62 | 0.03306,0.0,5.19,0,,6.059,37.3,4.8122,5,224.0,20.2,396.14,,88.2163428518819 63 | 0.01432,100.0,1.32,0,,6.816,40.5,8.3248,5,256.0,15.1,392.9,,135.56284210515494 64 | 0.01439,60.0,2.93,0,,6.604,18.8,6.2196,1,265.0,15.6,376.7,,124.77986816790845 65 | 0.75026,0.0,8.14,0,,5.924,94.1,4.3996,4,307.0,21.0,394.33,,66.78507694539394 66 | 0.7842,0.0,8.14,0,,5.99,81.7,4.2579,4,307.0,21.0,386.75,,74.9519755033796 67 | 0.06466,70.0,2.24,0,,6.345,20.1,7.8278,5,358.0,14.8,368.24,,96.42224261775675 68 | 0.04379,80.0,3.37,0,,5.787,31.1,6.6115,4,337.0,16.1,396.9,,83.1078869346196 69 | 0.37578,0.0,10.59,1,,5.404,88.6,3.665,4,277.0,18.6,395.24,23.98,82.79397982162877 70 | 41.5292,0.0,18.1,0,,5.531,85.4,1.6074,24,666.0,20.2,329.46,27.38,36.4476183815942 71 | 0.04294,28.0,15.04,0,,6.249,77.3,3.615,4,270.0,18.2,396.9,,88.27537962430918 72 | 1.41385,0.0,19.58,1,0.871,6.129,96.0,1.7494,5,403.0,14.7,321.02,,72.80292648744702 73 | 9.72418,0.0,18.1,0,0.74,6.406,97.2,2.0651,24,666.0,20.2,385.96,19.52,73.19050806675492 74 | 0.98843,0.0,8.14,0,,5.813,100.0,4.0952,4,307.0,21.0,394.54,19.88,62.066539598196954 75 | 0.52693,0.0,6.2,0,,8.725,83.0,2.8944,8,307.0,17.4,382.0,,214.05139620224634 76 | 5.58107,0.0,18.1,0,0.713,6.436,87.9,2.3158,24,666.0,20.2,100.19,,61.294119851011075 77 | 9.92485,0.0,18.1,0,0.74,6.251,96.6,2.198,24,666.0,20.2,388.52,,53.99633782178745 78 | 0.02985,0.0,2.18,0,,6.43,58.7,6.0622,3,,18.7,394.12,,122.86141195884612 79 | 0.13158,0.0,10.01,0,,6.176,72.5,2.7301,6,432.0,17.8,393.3,,90.79084192610794 80 | 0.17142,0.0,6.91,0,,5.682,33.8,5.1004,3,233.0,17.9,396.9,,82.61108926392335 81 | 1.05393,0.0,8.14,0,,5.935,29.3,4.4986,4,307.0,21.0,386.85,,98.97594878707021 82 | 15.5757,0.0,18.1,0,,5.926,71.0,2.9084,24,666.0,20.2,368.74,18.13,81.82059423120702 83 | 4.54192,0.0,18.1,0,0.77,6.398,88.0,2.5182,24,666.0,20.2,374.56,,107.00854694148481 84 | 0.03237,0.0,2.18,0,,6.998,45.8,6.0622,3,222.0,18.7,394.63,,142.95516813483192 85 | 67.9208,0.0,18.1,0,,5.683,100.0,1.4254,24,666.0,20.2,384.97,22.98,21.445933889849677 86 | 0.06047,0.0,2.46,0,,6.153,68.8,3.2797,3,193.0,17.8,387.11,,126.82757360819036 87 | 0.14932,25.0,5.13,0,,5.741,66.2,7.2254,8,284.0,19.7,395.11,,80.15248929060887 88 | 0.10793,0.0,8.56,0,,6.195,54.4,2.7778,5,384.0,20.9,393.49,,93.05392848107756 89 | 0.18159,0.0,7.38,,,6.376,54.3,4.5404,5,287.0,19.6,396.9,,98.88396967360913 90 | 0.76162,20.0,3.97,0,,5.56,62.8,1.9865,5,264.0,13.0,392.4,,97.62139188166562 91 | 1.00245,0.0,8.14,0,,6.674,87.3,4.239,4,307.0,21.0,380.23,,89.93438635683638 92 | 0.52014,20.0,3.97,0,,8.398,91.5,2.2885,5,264.0,13.0,386.86,,209.2465102860684 93 | 10.233,0.0,18.1,0,,6.185,96.7,2.1705,24,666.0,20.2,379.7,18.03,62.57281607977637 94 | 0.67191,0.0,8.14,0,,5.813,90.3,4.682,4,307.0,21.0,376.88,,71.0686566345113 95 | 0.14455,12.5,7.87,0,,6.172,96.1,5.9505,5,311.0,15.2,396.9,19.15,116.11802073957813 96 | 0.11132,0.0,27.74,0,,5.983,83.5,2.1099,4,711.0,20.1,396.9,,86.21246825032345 97 | 0.12802,0.0,8.56,0,,6.474,97.1,2.4329,5,384.0,20.9,395.24,,84.91805615308519 98 | 0.08014,0.0,5.96,0,,5.85,41.5,3.9342,5,279.0,19.2,396.9,,89.89364714599141 99 | 1.22358,0.0,19.58,0,,6.943,97.4,1.8773,5,403.0,14.7,363.43,,176.9575067791254 100 | 3.56868,0.0,18.1,0,,6.437,75.0,2.8965,24,666.0,20.2,393.37,,99.33458893299641 101 | 0.13058,0.0,10.01,0,,5.872,73.1,2.4775,6,432.0,17.8,338.63,,87.32915603779146 102 | 0.14231,0.0,10.01,0,,6.254,84.2,2.2565,6,432.0,17.8,388.74,,79.28289362098235 103 | 0.06664,0.0,4.05,0,,6.546,33.1,3.1323,5,296.0,16.6,390.96,,126.05546765906081 104 | 0.08664,45.0,3.44,0,,7.178,26.3,6.4798,5,,15.2,390.49,,155.9850489993171 105 | 0.1146,20.0,6.96,0,,6.538,58.7,3.9175,3,223.0,18.6,394.96,,104.51228025482678 106 | 2.77974,0.0,19.58,0,0.871,4.903,97.8,1.3459,5,403.0,14.7,396.9,29.29,50.62083750798458 107 | 11.1081,0.0,18.1,0,,4.906,100.0,1.1742,24,666.0,20.2,396.9,34.77,59.13022579725972 108 | 7.99248,0.0,18.1,0,,5.52,100.0,1.5331,24,666.0,20.2,396.9,24.56,52.65946620595223 109 | 8.98296,0.0,18.1,1,0.77,6.212,97.4,2.1222,24,666.0,20.2,377.73,,76.26345361907924 110 | 0.06127,40.0,6.41,1,,6.826,27.6,4.8628,4,254.0,17.6,393.45,,141.68245200870115 111 | 0.35809,0.0,6.2,1,,6.951,88.5,2.8617,8,307.0,17.4,391.7,,114.53035718635633 112 | 6.71772,0.0,18.1,0,0.713,6.749,92.6,2.3236,24,666.0,20.2,0.32,,57.474507693137504 113 | 1.62864,0.0,21.89,0,,5.019,100.0,1.4394,4,437.0,21.2,396.9,34.41,61.640340546320466 114 | 5.66998,0.0,18.1,1,,6.683,96.8,1.3567,24,666.0,20.2,375.33,,214.051654478562 115 | 0.05789,12.5,6.07,0,,5.878,21.4,6.498,4,,18.9,396.21,,94.22005106246839 116 | 3.83684,0.0,18.1,0,0.77,6.251,91.1,2.2955,24,666.0,20.2,350.65,,85.36835190309107 117 | 2.3004,0.0,19.58,,,6.319,96.1,2.1,5,403.0,14.7,297.09,,102.07059692876268 118 | 0.17783,0.0,9.69,,,5.569,73.5,2.3999,6,391.0,19.2,395.77,,74.90045220259788 119 | 13.3598,0.0,18.1,0,,5.887,94.7,1.7821,24,666.0,20.2,396.9,,54.38842392864051 120 | 25.0461,0.0,18.1,0,,5.987,100.0,1.5888,24,666.0,20.2,396.9,26.77,24.014529122715135 121 | 0.02187,60.0,2.93,0,,6.8,9.9,6.2196,1,265.0,15.6,393.37,,133.11355504856078 122 | 0.19073,22.0,5.86,0,,6.718,17.5,7.8265,7,330.0,19.1,393.74,,112.29183147899278 123 | 0.26363,0.0,8.56,0,,6.229,91.2,2.5451,5,384.0,20.9,391.23,,83.03299816312641 124 | 11.0874,0.0,18.1,0,0.718,6.411,100.0,1.8589,24,666.0,20.2,318.75,,71.49422130850488 125 | 2.37934,0.0,19.58,0,0.871,6.13,100.0,1.4191,5,403.0,14.7,172.91,27.8,59.15729521149178 126 | 0.04203,28.0,15.04,0,,6.442,53.6,3.6659,4,270.0,18.2,395.01,,98.2099688695904 127 | 1.12658,0.0,19.58,1,0.871,5.012,88.0,1.6102,5,403.0,14.7,343.28,,65.54089052997352 128 | 0.62356,0.0,6.2,1,,6.879,77.7,3.2721,8,307.0,17.4,390.39,,117.89917845973477 129 | 0.05515,33.0,2.18,,,7.236,41.1,4.022,7,222.0,18.4,393.68,,154.72377820705736 130 | 0.03551,25.0,4.86,0,,6.167,46.7,5.4007,4,281.0,19.0,390.64,,98.21683590419168 131 | 0.16439,22.0,5.86,0,,6.433,49.1,7.8265,7,330.0,19.1,374.71,,105.07957421915785 132 | 2.924,0.0,19.58,0,,6.101,93.0,2.2834,5,403.0,14.7,240.16,,107.19320569500421 133 | 1.51902,0.0,19.58,1,,8.375,93.9,2.162,5,403.0,14.7,388.45,,214.14918721549503 134 | 0.0315,95.0,1.47,0,,6.975,15.3,7.6534,3,402.0,17.0,396.9,,149.53327195447721 135 | 0.46296,0.0,6.2,0,,7.412,76.9,3.6715,8,307.0,17.4,376.14,,135.92193010124637 136 | 0.07896,0.0,12.83,0,,6.273,6.0,4.2515,5,398.0,18.7,394.92,,103.32518624974017 137 | 0.79041,0.0,9.9,0,,6.122,52.8,2.6403,4,304.0,18.4,396.9,,94.69094463035148 138 | 4.75237,0.0,18.1,0,0.713,6.525,86.5,2.4358,24,666.0,20.2,50.92,18.13,60.38269582189052 139 | 0.36894,22.0,5.86,0,,8.259,8.4,8.9067,7,330.0,19.1,396.9,,183.58426637694816 140 | 0.14476,0.0,10.01,0,,5.731,65.2,2.7592,6,432.0,17.8,391.5,,82.66700168026514 141 | 0.00906,90.0,2.97,0,,7.088,20.8,7.3073,1,285.0,15.3,394.72,,137.84225152963592 142 | 0.09266,34.0,6.09,0,,6.495,18.4,5.4917,7,329.0,16.1,383.61,,113.09641600484015 143 | 2.81838,0.0,18.1,0,,5.762,40.3,4.0983,24,666.0,20.2,392.92,,93.48345193539986 144 | 3.8497,0.0,18.1,1,0.77,6.395,91.0,2.5052,24,666.0,20.2,391.34,,93.06353491055529 145 | 24.8017,0.0,18.1,0,,5.349,96.0,1.7028,24,666.0,20.2,396.9,19.77,35.53338811498336 146 | 0.29819,0.0,6.2,0,,7.686,17.0,3.3751,8,307.0,17.4,377.51,,200.10126043018462 147 | 0.53412,20.0,3.97,0,,7.52,89.4,2.1398,5,,13.0,388.37,,184.72551209679438 148 | 0.51183,0.0,6.2,0,,7.358,71.6,4.148,8,307.0,17.4,390.07,,135.0797219247725 149 | 24.3938,0.0,18.1,0,,4.652,100.0,1.4672,24,666.0,20.2,396.9,28.28,45.03868107815232 150 | 4.87141,0.0,18.1,0,,6.484,93.6,2.3053,24,666.0,20.2,396.21,18.68,71.63124418152948 151 | 0.09744,0.0,5.96,0,,5.841,61.4,3.3779,5,279.0,19.2,377.56,,85.74867056457873 152 | 0.04011,80.0,1.52,0,,7.287,34.1,7.309,2,329.0,12.6,396.9,,142.82764139964164 153 | 0.54452,0.0,21.89,,,6.151,97.9,1.6687,4,437.0,21.2,396.9,18.46,76.31012581681928 154 | 4.89822,0.0,18.1,0,,4.97,100.0,1.3325,24,666.0,20.2,375.52,,214.04427779211161 155 | 0.19657,22.0,5.86,0,,6.226,79.2,8.0555,7,330.0,19.1,376.14,,87.76546509317926 156 | 0.03871,52.5,5.32,0,,6.209,31.3,7.3172,6,293.0,16.6,396.9,,99.3930687509709 157 | 23.6482,0.0,18.1,0,,6.38,96.2,1.3861,24,666.0,20.2,396.9,23.69,56.14399470209011 158 | 0.10328,25.0,5.13,,,5.927,47.2,6.932,8,284.0,19.7,396.9,,83.97882262665027 159 | 0.10084,0.0,10.01,0,,6.715,81.6,2.6775,6,432.0,17.8,395.59,,97.63612653682769 160 | 0.05302,0.0,3.41,0,,7.079,63.1,3.4145,2,270.0,17.8,396.06,,122.99996853633093 161 | 0.7857,20.0,3.97,0,,7.014,84.6,2.1329,5,264.0,13.0,384.07,,131.5306524154623 162 | 0.08829,12.5,7.87,0,,6.012,66.6,5.5605,5,311.0,15.2,395.6,,98.02060966106423 163 | 3.47428,0.0,18.1,1,0.718,8.78,82.9,1.9047,24,666.0,20.2,354.55,,93.7433216073559 164 | 0.06076,0.0,11.93,0,,6.976,91.0,2.1675,1,273.0,21.0,396.9,,102.48170478024093 165 | 0.01301,35.0,1.52,0,,7.241,49.3,7.0379,1,284.0,15.5,394.74,,140.06420082169748 166 | 1.34284,0.0,19.58,0,,6.066,100.0,1.7573,5,403.0,14.7,353.89,,104.23346578329476 167 | 1.6566,0.0,19.58,0,0.871,6.122,97.3,1.618,5,403.0,14.7,372.8,,92.03410962108342 168 | 0.05425,0.0,4.05,0,,6.315,73.4,3.3175,5,296.0,16.6,395.6,,105.3647401896618 169 | 7.67202,0.0,18.1,0,,5.747,98.9,1.6334,24,666.0,20.2,393.1,19.92,36.41181522520689 170 | 0.08308,0.0,2.46,0,,5.604,89.8,2.9879,3,193.0,17.8,391.0,,113.14056607559466 171 | 0.40202,0.0,9.9,,,6.382,67.2,3.5325,4,304.0,18.4,395.21,,98.95412803519065 172 | 0.22489,12.5,7.87,0,,6.377,94.3,6.3467,5,311.0,15.2,392.52,20.45,64.25974310759635 173 | 20.0849,0.0,18.1,0,,4.368,91.2,1.4395,24,666.0,20.2,285.83,30.63,37.6756788895361 174 | 0.21161,0.0,8.56,0,,6.137,87.4,2.7147,5,384.0,20.9,394.47,,82.73890521244164 175 | 0.04462,25.0,4.86,0,,6.619,70.4,5.4007,4,281.0,19.0,395.63,,102.5219471514495 176 | 0.17505,0.0,5.96,0,,5.966,30.2,3.8473,5,279.0,19.2,393.43,,105.8151903183971 177 | 0.24522,0.0,9.9,0,,5.782,71.7,4.0317,4,304.0,18.4,396.9,,84.86924302557883 178 | 1.80028,0.0,19.58,0,,5.877,79.2,2.4259,5,403.0,14.7,227.61,,101.89525776008544 179 | 6.39312,0.0,18.1,0,,6.162,97.4,2.206,24,666.0,20.2,302.76,24.1,57.01312277260327 180 | 0.05561,70.0,2.24,0,,7.041,10.0,7.8278,5,358.0,14.8,371.58,,124.26905880166453 181 | 0.05372,0.0,13.92,0,,6.549,51.0,5.9604,4,289.0,16.0,392.85,,116.1960531736661 182 | 0.03768,80.0,1.52,0,,7.274,38.3,7.309,2,329.0,12.6,392.2,,148.15216877063816 183 | 9.82349,0.0,18.1,0,,6.794,98.8,1.358,24,666.0,20.2,396.9,21.24,57.05342200049336 184 | 2.15505,0.0,19.58,0,0.871,5.628,100.0,1.5166,5,403.0,14.7,169.27,,66.8375694958936 185 | 5.87205,0.0,18.1,0,,6.405,96.0,1.6768,24,666.0,20.2,396.9,19.37,53.62232891092506 186 | 2.36862,0.0,19.58,0,0.871,4.926,95.7,1.4608,5,403.0,14.7,391.71,29.53,62.58254575193668 187 | 7.36711,0.0,18.1,0,,6.193,78.1,1.9356,24,666.0,20.2,96.73,21.52,47.17557906880946 188 | 0.04297,52.5,5.32,0,,6.565,22.9,7.3172,6,293.0,16.6,371.72,,106.3524321967567 189 | 0.15038,0.0,25.65,,,5.856,97.0,1.9444,2,188.0,19.1,370.31,25.41,74.04556952600493 190 | 0.20746,0.0,27.74,0,,5.093,98.0,1.8226,4,711.0,20.1,318.43,29.68,34.732265987909976 191 | 0.11504,0.0,2.89,,,6.163,69.6,3.4952,2,276.0,18.0,391.83,,91.64449110390264 192 | 4.0974,0.0,19.58,0,0.871,5.468,100.0,1.4118,5,403.0,14.7,396.9,26.42,66.90097668683498 193 | 0.09252,30.0,4.93,0,,6.606,42.2,6.1899,6,300.0,16.6,383.78,,99.79904806561713 194 | 0.09604,40.0,6.41,0,,6.854,42.8,4.2673,4,254.0,17.6,396.9,,137.2389864892714 195 | 0.12083,0.0,2.89,0,,8.069,76.0,3.4952,2,276.0,18.0,396.9,,165.69689083530488 196 | 0.01709,90.0,2.02,0,,6.728,36.1,12.1265,5,,17.0,384.46,,128.98858708951522 197 | 0.09299,0.0,25.65,0,,5.961,92.9,2.0869,2,188.0,19.1,378.09,17.93,87.8172225492073 198 | 0.10008,0.0,2.46,0,,6.563,95.6,2.847,3,193.0,17.8,396.9,,139.33235031306575 199 | 0.02177,82.5,2.03,0,,7.61,15.7,6.27,2,,14.7,395.38,,181.30089062761513 200 | 0.33983,22.0,5.86,0,,6.108,34.9,8.0555,7,330.0,19.1,390.18,,104.23522664376836 201 | 2.37857,0.0,18.1,0,,5.871,41.9,3.724,24,666.0,20.2,370.73,,88.27563109434632 202 | 0.03537,34.0,6.09,0,,6.59,40.4,5.4917,7,329.0,16.1,395.75,,94.16646677806642 203 | 0.04301,80.0,1.91,0,,5.663,21.9,10.5857,4,334.0,22.0,382.8,,78.0382858400318 204 | 51.1358,0.0,18.1,0,,5.757,100.0,1.413,24,666.0,20.2,2.6,,64.28657778550676 205 | 9.91655,0.0,18.1,0,,5.852,77.8,1.5004,24,666.0,20.2,338.16,29.97,26.979381279296607 206 | 0.01965,80.0,1.76,0,,6.23,31.5,9.0892,1,241.0,18.2,341.6,,86.07233278610938 207 | 0.16902,0.0,25.65,0,,5.986,88.4,1.9929,2,188.0,19.1,385.02,,91.61948802567194 208 | 0.05479,33.0,2.18,0,,6.616,58.1,3.37,7,222.0,18.4,393.36,,121.777199530868 209 | 0.6147,0.0,6.2,0,,6.618,80.8,3.2721,8,307.0,17.4,396.9,,129.0534976533695 210 | 12.0482,0.0,18.1,0,,5.648,87.6,1.9512,24,666.0,20.2,291.55,,89.18315969982761 211 | 0.11425,0.0,13.89,1,,6.373,92.4,3.3633,5,276.0,16.4,393.74,,98.61412363155058 212 | 0.88125,0.0,21.89,0,,5.637,94.7,1.9799,4,437.0,21.2,396.9,18.34,61.324137374892 213 | 8.79212,0.0,18.1,0,,5.565,70.6,2.0635,24,666.0,20.2,3.65,,50.127266164530205 214 | 0.07886,80.0,4.95,0,,7.148,27.7,5.1167,4,245.0,19.2,396.9,,160.00237114570464 215 | 0.05023,35.0,6.06,0,,5.706,28.4,6.6407,1,304.0,16.9,394.02,,73.25623685678946 216 | 88.9762,0.0,18.1,0,,6.968,91.9,1.4165,24,666.0,20.2,396.9,,44.56625992485017 217 | 5.82401,0.0,18.1,0,,6.242,64.7,3.4242,24,666.0,20.2,396.9,,98.605056230759 218 | 5.20177,0.0,18.1,1,0.77,6.127,83.4,2.7227,24,666.0,20.2,395.43,,97.2826978366578 219 | 0.14103,0.0,13.92,0,,5.79,58.0,6.32,4,289.0,16.0,396.9,,86.98648351102429 220 | 0.08199,0.0,13.92,0,,6.009,42.3,5.5027,4,289.0,16.0,396.9,,93.01244196555069 221 | 6.53876,0.0,18.1,1,,7.016,97.5,1.2024,24,666.0,20.2,392.05,,214.2923353502213 222 | 13.6781,0.0,18.1,0,0.74,5.935,87.9,1.8206,24,666.0,20.2,68.95,34.02,36.00210477989878 223 | 0.12329,0.0,10.01,0,,5.913,92.9,2.3534,6,432.0,17.8,394.95,,80.63483584614265 224 | 0.0578,0.0,2.46,0,,6.98,58.4,2.829,3,193.0,17.8,396.9,,159.48548523421246 225 | 2.63548,0.0,9.9,0,,4.973,37.8,2.5194,4,304.0,18.4,350.45,,68.96979939145074 226 | 0.02498,0.0,1.89,,,6.54,59.7,6.2669,1,422.0,15.9,389.96,,70.69404829968218 227 | 0.05083,0.0,5.19,0,,6.316,38.1,6.4584,5,224.0,20.2,389.71,,95.01800474971677 228 | 4.83567,0.0,18.1,0,,5.905,53.2,3.1523,24,666.0,20.2,388.22,,88.33275490565735 229 | 8.20058,0.0,18.1,0,0.713,5.936,80.3,2.7792,24,666.0,20.2,3.5,,57.83323931871548 230 | 0.33147,0.0,6.2,0,,8.247,70.4,3.6519,8,307.0,17.4,378.95,,206.7589770594467 231 | 0.3692,0.0,9.9,0,,6.567,87.3,3.6023,4,304.0,18.4,395.69,,102.04868804927443 232 | 2.24236,0.0,19.58,0,,5.854,91.8,2.422,5,403.0,14.7,395.11,,97.34666800638067 233 | 0.32264,0.0,21.89,,,5.942,93.5,1.9669,4,437.0,21.2,378.25,,74.6368298325566 234 | 0.04666,80.0,1.52,,,7.107,36.6,7.309,2,329.0,12.6,354.31,,129.97715591287496 235 | 0.66351,20.0,3.97,0,,7.333,100.0,1.8946,5,264.0,13.0,383.29,,154.25438344112666 236 | 0.57529,0.0,6.2,0,,8.337,73.3,3.8384,8,307.0,17.4,385.91,,178.86849603926493 237 | 0.17134,0.0,10.01,0,,5.928,88.2,2.4631,6,432.0,17.8,344.91,,78.46246435209265 238 | 0.06899,0.0,25.65,0,,5.87,69.7,2.2577,2,188.0,19.1,389.15,,94.28316574870794 239 | 0.07244,60.0,1.69,0,,5.884,18.5,10.7103,4,411.0,18.3,392.33,,79.67435216186544 240 | 0.31533,0.0,6.2,0,,8.266,78.3,2.8944,8,307.0,17.4,385.05,,191.80937308916776 241 | 20.7162,0.0,18.1,0,,4.138,100.0,1.1781,24,666.0,20.2,370.22,23.34,51.003472535563006 242 | 0.06151,0.0,5.19,0,,5.968,58.5,4.8122,5,224.0,20.2,396.9,,80.1186585578974 243 | 0.25915,0.0,21.89,0,,5.693,96.0,1.7883,4,437.0,21.2,392.11,,69.3389852904356 244 | 0.01096,55.0,2.25,0,,6.453,31.9,7.3073,1,300.0,15.3,394.72,,94.24871749944725 245 | 18.0846,0.0,18.1,0,,6.434,100.0,1.8347,24,666.0,20.2,27.25,29.05,30.878383584061993 246 | 0.13117,0.0,8.56,,,6.127,85.2,2.1224,5,384.0,20.9,387.69,,87.40695588178909 247 | 18.4982,0.0,18.1,0,,4.138,100.0,1.137,24,666.0,20.2,396.9,37.97,59.15866354756566 248 | 7.52601,0.0,18.1,0,0.713,6.417,98.3,2.185,24,666.0,20.2,304.21,19.31,55.75013175697844 249 | 0.32982,0.0,21.89,0,,5.822,95.4,2.4699,4,437.0,21.2,388.69,,78.89714048466422 250 | 13.5222,0.0,18.1,0,,3.863,100.0,1.5106,24,666.0,20.2,131.42,,98.9095824110562 251 | 0.12269,0.0,6.91,0,,6.069,40.0,5.7209,3,233.0,17.9,389.39,,90.81742720963614 252 | 0.17899,0.0,9.69,0,,5.67,28.8,2.7986,6,391.0,19.2,393.29,,99.0564379366495 253 | 0.03584,80.0,3.37,0,,6.29,17.8,6.6115,4,337.0,16.1,396.9,,100.68749654297682 254 | 0.01501,90.0,1.21,1,,7.923,24.8,5.885,1,198.0,13.6,395.52,,214.24663842339191 255 | 0.05735,0.0,4.49,0,,6.63,56.1,4.4377,3,247.0,18.5,392.3,,113.86436352897965 256 | 0.1029,30.0,4.93,0,,6.358,52.9,7.0355,6,300.0,16.6,372.75,,95.16988905486264 257 | 0.05602,0.0,2.46,0,,7.831,53.6,3.1992,3,193.0,17.8,392.63,,214.271616402895 258 | 15.8603,0.0,18.1,0,,5.896,95.4,1.9096,24,666.0,20.2,7.68,24.39,35.598814178915845 259 | 1.42502,0.0,19.58,0,0.871,6.51,100.0,1.7659,5,,14.7,364.31,,99.74781111338865 260 | 0.09378,12.5,7.87,0,,5.889,39.0,5.4509,5,311.0,15.2,390.5,,92.94899239065076 261 | 0.06417,0.0,5.96,0,,5.933,68.2,3.3603,5,279.0,19.2,396.9,,81.07895594795473 262 | 0.77299,0.0,8.14,0,,6.495,94.4,4.4547,4,307.0,21.0,387.94,,78.89883003527437 263 | 1.20742,0.0,19.58,0,,5.875,94.6,2.4259,5,403.0,14.7,292.29,,74.47744660259207 264 | 3.32105,0.0,19.58,1,0.871,5.403,100.0,1.3216,5,403.0,14.7,396.9,26.82,57.362009051979555 265 | 9.59571,0.0,18.1,0,,6.404,100.0,1.639,24,666.0,20.2,376.11,20.31,51.880990843107384 266 | 0.02899,40.0,1.25,0,,6.939,34.5,8.7921,1,335.0,19.7,389.85,,113.94356069894572 267 | 0.40771,0.0,6.2,1,,6.164,91.3,3.048,8,307.0,17.4,395.24,21.46,93.07299664769766 268 | 0.12204,0.0,2.89,0,,6.625,57.8,3.4952,2,276.0,18.0,357.98,,121.79855766835065 269 | 0.04337,21.0,5.64,0,,6.115,63.0,6.8147,4,,16.8,393.97,,87.84624337516867 270 | 0.11329,30.0,4.93,0,,6.897,54.3,6.3361,6,300.0,16.6,391.25,,94.37504260498926 271 | 15.288,0.0,18.1,0,,6.649,93.3,1.3449,24,666.0,20.2,363.02,23.24,59.53691516133277 272 | 9.18702,0.0,18.1,0,,5.536,100.0,1.5804,24,666.0,20.2,396.9,23.6,48.42414368373738 273 | 0.06642,0.0,4.05,0,,6.86,74.4,2.9153,5,296.0,16.6,391.27,,128.067902079391 274 | 0.12744,0.0,6.91,0,,6.77,2.9,5.7209,3,233.0,17.9,385.41,,114.06895090972169 275 | 22.0511,0.0,18.1,0,0.74,5.818,92.4,1.8662,24,666.0,20.2,391.45,22.11,44.98666404253888 276 | 5.29305,0.0,18.1,0,,6.051,82.5,2.1678,24,666.0,20.2,378.38,18.76,99.33223001199178 277 | 0.22969,0.0,10.59,,,6.326,52.5,4.3549,4,277.0,18.6,394.87,,104.4712414144551 278 | 0.06129,20.0,3.33,1,,7.645,49.7,5.2119,5,216.0,14.9,377.07,,197.1295401724309 279 | 0.04819,80.0,3.64,0,,6.108,32.0,9.2203,1,315.0,16.4,392.89,,93.9394697513162 280 | 10.8342,0.0,18.1,0,,6.782,90.8,1.8195,24,666.0,20.2,21.57,25.79,32.1745002703019 281 | 0.06905,0.0,2.18,0,,7.147,54.2,6.0622,3,222.0,18.7,396.9,,154.95507154862378 282 | 0.01538,90.0,3.75,0,,7.454,34.2,6.3361,3,244.0,15.9,386.34,,188.66470599941434 283 | 8.24809,0.0,18.1,0,0.713,7.393,99.3,2.4527,24,666.0,20.2,375.87,,76.21052055036272 284 | 0.14866,0.0,8.56,0,,6.727,79.9,2.7778,5,384.0,20.9,394.76,,117.76563975860779 285 | 0.38214,0.0,6.2,0,,8.04,86.5,3.2157,8,307.0,17.4,387.38,,161.14266009269343 286 | 10.0623,0.0,18.1,0,,6.833,94.3,2.0882,24,666.0,20.2,81.33,19.69,60.43951952689001 287 | 0.14052,0.0,10.59,0,,6.375,32.3,3.9454,4,277.0,18.6,385.81,,120.39855220565468 288 | 12.2472,0.0,18.1,0,,5.837,59.7,1.9976,24,666.0,20.2,24.65,,43.745544898222214 289 | 2.3139,0.0,19.58,0,,5.88,97.3,2.3887,5,403.0,14.7,348.13,,81.88767258279199 290 | 0.08187,0.0,2.89,0,,7.82,36.9,3.4952,2,276.0,18.0,393.53,,187.5342552461346 291 | 0.03615,80.0,4.95,0,,6.63,23.4,5.1167,4,245.0,19.2,396.9,,119.528038155155 292 | 0.19802,0.0,10.59,0,,6.182,42.4,3.9454,4,277.0,18.6,393.63,,107.09888627097 293 | 0.17171,25.0,5.13,0,,5.966,93.4,6.8185,8,284.0,19.7,378.08,,68.59965945665476 294 | 0.22927,0.0,6.91,0,,6.03,85.5,5.6894,3,233.0,17.9,392.74,18.8,71.18515014550181 295 | 1.38799,0.0,8.14,0,,5.95,82.0,3.99,4,307.0,21.0,232.6,27.71,56.61271426511336 296 | 0.57834,20.0,3.97,0,,8.297,67.0,2.4216,5,264.0,13.0,384.54,,214.36241327993707 297 | 0.24103,0.0,7.38,0,,6.083,43.7,5.4159,5,287.0,19.6,396.9,,95.16310192393985 298 | 0.01778,95.0,1.47,0,,7.135,13.9,7.6534,3,402.0,17.0,384.3,,140.81240527376974 299 | 5.44114,0.0,18.1,0,0.713,6.655,98.2,2.3552,24,666.0,20.2,355.29,,65.17969247582458 300 | 0.95577,0.0,8.14,0,,6.047,88.8,4.4534,4,307.0,21.0,306.38,,63.4855495232388 301 | 8.64476,0.0,18.1,0,,6.193,92.6,1.7912,24,666.0,20.2,396.9,,59.19694186292922 302 | 0.537,0.0,6.2,0,,5.981,68.1,3.6715,8,307.0,17.4,378.35,,104.0880820259377 303 | 0.54011,20.0,3.97,,,7.203,81.8,2.1121,5,264.0,13.0,392.8,,144.85091449097607 304 | 0.0459,52.5,5.32,0,,6.315,45.6,7.3172,6,293.0,16.6,396.9,,95.48939795651633 305 | 1.83377,0.0,19.58,1,,7.802,98.2,2.0407,5,,14.7,389.61,,214.41296038289997 306 | 9.33889,0.0,18.1,0,,6.38,95.6,1.9682,24,666.0,20.2,60.72,24.08,40.695688863928765 307 | 0.2498,0.0,21.89,0,,5.857,98.2,1.6686,4,437.0,21.2,392.04,21.32,57.02654091090679 308 | 0.11027,25.0,5.13,0,,6.456,67.8,7.2255,8,284.0,19.7,396.9,,95.02374006557191 309 | 0.55778,0.0,21.89,0,,6.335,98.2,2.1107,4,437.0,21.2,394.67,,77.60752459364593 310 | 0.32543,0.0,21.89,0,,6.431,98.8,1.8125,4,437.0,21.2,396.9,,77.11780334110067 311 | 5.73116,0.0,18.1,0,,7.061,77.0,3.4106,24,666.0,20.2,395.28,,107.24418130966565 312 | 0.21124,12.5,7.87,0,,5.631,100.0,6.0821,5,311.0,15.2,386.63,29.93,70.71582493919252 313 | 0.30347,0.0,7.38,0,,6.312,28.9,5.4159,5,287.0,19.6,396.9,,98.57326542684902 314 | 13.0751,0.0,18.1,0,,5.713,56.7,2.8237,24,666.0,20.2,396.9,,86.06762797944116 315 | 0.01951,17.5,1.38,0,,7.104,59.5,9.2229,3,216.0,18.6,393.24,,141.52384184941792 316 | 0.04417,70.0,2.24,0,,6.871,47.4,7.8278,5,358.0,14.8,390.86,,106.28983847776634 317 | 0.63796,0.0,8.14,0,,6.096,84.5,4.4619,4,307.0,21.0,380.02,,78.02927346890762 318 | 2.44668,0.0,19.58,0,0.871,5.272,94.0,1.7364,5,403.0,14.7,88.63,,56.08882091510925 319 | 0.03359,75.0,2.95,0,,7.024,15.8,5.4011,3,252.0,18.3,395.62,,149.45246864933594 320 | 17.8667,0.0,18.1,0,,6.223,100.0,1.3861,24,666.0,20.2,393.74,21.78,43.68193090714954 321 | 3.1636,0.0,18.1,0,,5.759,48.2,3.0665,24,666.0,20.2,334.4,,85.18618477374098 322 | 11.9511,0.0,18.1,0,,5.608,100.0,1.2852,24,666.0,20.2,332.09,,119.67450422441831 323 | 0.0456,0.0,13.89,1,,5.888,56.0,3.1121,5,276.0,16.4,392.8,,99.95110506548588 324 | 0.21038,20.0,3.33,0,,6.812,32.2,4.1007,5,216.0,14.9,396.9,,150.23532225989135 325 | 9.39063,0.0,18.1,0,0.74,5.627,93.9,1.8172,24,666.0,20.2,396.9,22.88,54.85384792172743 326 | 0.10959,0.0,11.93,0,,6.794,89.3,2.3889,1,273.0,21.0,393.45,,94.30225330095304 327 | 0.03041,0.0,5.19,,,5.895,59.6,5.615,5,,20.2,394.81,,79.22385015356505 328 | 0.52058,0.0,6.2,1,,6.631,76.5,4.148,8,307.0,17.4,388.45,,107.44870964214456 329 | 0.25199,0.0,10.59,0,,5.783,72.7,4.3549,4,277.0,18.6,389.43,18.06,96.44659188003051 330 | 0.21719,0.0,10.59,1,,5.807,53.8,3.6526,4,277.0,18.6,390.94,,95.90550606395755 331 | 0.12932,0.0,13.92,0,,6.678,31.1,5.9604,4,289.0,16.0,396.9,,122.61831018541842 332 | 6.65492,0.0,18.1,0,0.713,6.317,83.0,2.7344,24,666.0,20.2,396.9,,83.62036671666932 333 | 0.21409,22.0,5.86,0,,6.438,8.9,7.3967,7,330.0,19.1,377.07,,106.29601015536667 334 | 0.27957,0.0,9.69,0,,5.926,42.6,2.3817,6,391.0,19.2,396.9,,104.88132454583398 335 | 7.83932,0.0,18.1,0,,6.209,65.4,2.9634,24,666.0,20.2,396.9,,91.78389685802578 336 | 0.1,34.0,6.09,0,,6.982,17.7,5.4917,7,329.0,16.1,390.43,,141.86864846418877 337 | 0.06211,40.0,1.25,0,,6.49,44.4,8.7921,1,335.0,19.7,396.9,,98.09887311709224 338 | 0.09065,20.0,6.96,1,,5.92,61.5,3.9175,3,223.0,18.6,391.34,,88.73061089647581 339 | 0.03445,82.5,2.03,0,,6.162,38.4,6.27,2,348.0,14.7,393.77,,103.29219687089451 340 | 1.46336,0.0,19.58,0,,7.489,90.8,1.9709,5,403.0,14.7,374.43,,214.26351204810595 341 | 0.15936,0.0,6.91,0,,6.211,6.5,5.7209,3,233.0,17.9,394.46,,105.72759550321551 342 | 0.07013,0.0,13.89,0,,6.642,85.1,3.4211,5,276.0,16.4,392.78,,123.04318174921504 343 | 14.2362,0.0,18.1,0,,6.343,100.0,1.5741,24,666.0,20.2,396.9,20.32,30.878251836188767 344 | 0.09068,45.0,3.44,0,,6.951,21.5,6.4798,5,398.0,15.2,377.68,,158.41685347411962 345 | 0.3494,0.0,9.9,,,5.972,76.7,3.1025,4,304.0,18.4,396.24,,86.95690765902225 346 | 0.65665,20.0,3.97,0,,6.842,100.0,2.0107,5,264.0,13.0,391.93,,129.0269068556081 347 | 0.13262,0.0,8.56,0,,5.851,96.7,2.1069,5,384.0,20.9,394.05,,83.63191924089021 348 | 0.04981,21.0,5.64,0,,5.998,21.4,6.8147,4,243.0,16.8,396.9,,100.27485737120129 349 | 8.15174,0.0,18.1,0,,5.39,98.9,1.7281,24,666.0,20.2,396.9,20.85,49.24166014463305 350 | 0.02731,0.0,7.07,0,,6.421,78.9,4.9671,2,242.0,17.8,396.9,,92.64943942043587 351 | 6.28807,0.0,18.1,0,0.74,6.341,96.4,2.072,24,666.0,20.2,318.01,,63.84656635106048 352 | 0.15086,0.0,27.74,0,,5.454,92.7,1.8209,4,711.0,20.1,395.09,18.06,65.08405693968963 353 | 0.21977,0.0,6.91,0,,5.602,62.0,6.0877,3,233.0,17.9,396.9,,83.15417691706125 354 | 11.8123,0.0,18.1,0,0.718,6.824,76.5,1.794,24,666.0,20.2,48.45,22.74,35.97624959002458 355 | 0.04113,25.0,4.86,0,,6.727,33.5,5.4007,4,281.0,19.0,396.9,,119.98912870543222 356 | 0.13642,0.0,10.59,0,,5.891,22.3,3.9454,4,277.0,18.6,396.9,,96.7789720721923 357 | 1.61282,0.0,8.14,0,,6.096,96.9,3.7598,4,307.0,21.0,248.31,20.34,57.902002735568665 358 | 8.49213,0.0,18.1,0,,6.348,86.1,2.0527,24,666.0,20.2,83.45,,62.10304604481125 359 | 0.82526,20.0,3.97,0,,7.327,94.5,2.0788,5,264.0,13.0,393.42,,132.8290801877492 360 | 37.6619,0.0,18.1,0,,6.202,78.7,1.8629,24,666.0,20.2,18.82,,46.68657433312743 361 | 3.69695,0.0,18.1,0,0.718,4.963,91.4,1.7523,24,666.0,20.2,316.03,,93.78120955775702 362 | 0.03932,0.0,3.41,0,,6.405,73.9,3.0921,2,270.0,17.8,393.55,,94.27963135975943 363 | 0.05497,0.0,5.19,0,,5.985,45.4,4.8122,5,224.0,20.2,396.9,,81.48790483723555 364 | 14.3337,0.0,18.1,0,,6.229,88.0,1.9512,24,666.0,20.2,383.32,,91.61118003657916 365 | 0.0536,21.0,5.64,0,,6.511,21.1,6.8147,4,243.0,16.8,396.9,,107.14931851810758 366 | 0.03113,0.0,4.39,0,,6.014,48.5,8.0136,3,352.0,18.8,385.64,,74.97865092116241 367 | 0.55007,20.0,3.97,0,,7.206,91.6,1.9301,5,264.0,13.0,387.89,,156.23487524872377 368 | 0.10612,30.0,4.93,0,,6.095,65.1,6.3361,6,300.0,16.6,394.62,,86.06837267706234 369 | 0.62976,0.0,8.14,0,,5.949,61.8,4.7075,4,307.0,21.0,396.9,,87.41859391509344 370 | 0.25356,0.0,9.9,0,,5.705,77.7,3.945,4,304.0,18.4,396.42,,69.40166024965988 371 | 0.0566,0.0,3.41,,,7.007,86.3,3.4217,2,270.0,17.8,396.9,,101.22193723778894 372 | 22.5971,0.0,18.1,0,,5.0,89.5,1.5184,24,666.0,20.2,396.9,31.99,31.726234519957263 373 | 0.22188,20.0,6.96,1,,7.691,51.8,4.3665,3,223.0,18.6,390.77,,150.82150427262926 374 | 2.01019,0.0,19.58,0,,7.929,96.2,2.0459,5,403.0,14.7,369.3,,214.11502684151574 375 | 0.06617,0.0,3.24,0,,5.868,25.8,5.2146,4,430.0,16.9,382.44,,82.62238547103524 376 | 0.23912,0.0,9.69,,,6.019,65.3,2.4091,6,391.0,19.2,396.9,,90.75980852101517 377 | 0.97617,0.0,21.89,0,,5.757,98.4,2.346,4,437.0,21.2,262.76,,66.795105542148 378 | 0.07503,33.0,2.18,0,,7.42,71.9,3.0992,7,222.0,18.4,396.9,,143.17077311916992 379 | 5.69175,0.0,18.1,0,,6.114,79.8,3.5459,24,666.0,20.2,392.68,,81.9121484165166 380 | 0.47547,0.0,9.9,0,,6.113,58.8,4.0019,4,304.0,18.4,396.23,,89.98764919309322 381 | 0.12757,30.0,4.93,0,,6.393,7.8,7.0355,6,300.0,16.6,374.71,,101.63068145253833 382 | 0.0136,75.0,4.0,0,,5.888,47.6,7.3197,3,469.0,21.1,396.9,,80.94118410654285 383 | 4.22239,0.0,18.1,1,0.77,5.803,89.0,1.9047,24,666.0,20.2,353.04,,71.94282999328784 384 | 0.08873,21.0,5.64,0,,5.963,45.7,6.8147,4,243.0,16.8,395.56,,84.3586158325382 385 | 3.69311,0.0,18.1,0,0.713,6.376,88.4,2.5671,24,666.0,20.2,391.43,,75.8482584414499 386 | 0.08447,0.0,4.05,0,,5.859,68.7,2.7019,5,296.0,16.6,393.23,,96.94598162673307 387 | 10.6718,0.0,18.1,0,0.74,6.459,94.8,1.9879,24,666.0,20.2,43.06,23.98,50.5936026306949 388 | 0.0837,45.0,3.44,0,,7.185,38.9,4.5667,5,398.0,15.2,396.9,,149.40381640715177 389 | 0.04527,0.0,11.93,0,,6.12,76.7,2.2875,1,273.0,21.0,396.9,,88.29026894523939 390 | 5.82115,0.0,18.1,0,0.713,6.513,89.9,2.8016,24,666.0,20.2,393.82,,86.54303414249068 391 | 0.07875,45.0,3.44,,,6.782,41.1,3.7886,5,398.0,15.2,393.87,,137.01095863624937 392 | 2.44953,0.0,19.58,0,,6.402,95.2,2.2625,5,403.0,14.7,330.04,,95.65880177371714 393 | 0.15445,25.0,5.13,0,,6.145,29.2,7.8148,8,284.0,19.7,390.68,,99.93310262249477 394 | 0.25387,0.0,6.91,0,,5.399,95.3,5.87,3,233.0,17.9,396.9,30.81,61.7473108349467 395 | 0.03049,55.0,3.78,0,,6.874,28.1,6.4654,5,370.0,17.6,387.97,,133.58147604929238 396 | 0.33045,0.0,6.2,0,,6.086,61.5,3.6519,8,307.0,17.4,376.75,,102.93180803659672 397 | 0.08221,22.0,5.86,0,,6.957,6.8,8.9067,7,330.0,19.1,386.09,,126.77645542235585 398 | 0.85204,0.0,8.14,0,,5.965,89.2,4.0123,4,307.0,21.0,392.53,,83.94234406655659 399 | 0.26938,0.0,9.9,0,,6.266,82.8,3.2628,4,304.0,18.4,393.39,,92.4989310922952 400 | 6.80117,0.0,18.1,0,0.713,6.081,84.4,2.7175,24,666.0,20.2,396.9,,85.74937514322716 401 | 1.27346,0.0,19.58,1,,6.25,92.6,1.7984,5,403.0,14.7,338.92,,115.79063663910462 402 | 0.10469,40.0,6.41,1,,7.267,49.0,4.7872,4,254.0,17.6,389.25,,142.23997357087353 403 | 9.96654,0.0,18.1,0,0.74,6.485,100.0,1.9784,24,666.0,20.2,386.73,18.85,65.93794043531199 404 | 0.06911,45.0,3.44,0,,6.739,30.8,6.4798,5,398.0,15.2,389.71,,130.67837655855462 405 | 16.8118,0.0,18.1,0,,5.277,98.1,1.4261,24,666.0,20.2,396.9,30.81,30.826335050066838 406 | 0.08265,0.0,13.92,0,,6.127,18.4,5.5027,4,289.0,16.0,396.9,,102.49425262479257 407 | 28.6558,0.0,18.1,0,,5.155,100.0,1.5894,24,666.0,20.2,210.97,20.08,69.86260748939857 408 | 0.02543,55.0,3.78,0,,6.696,56.4,5.7321,5,370.0,17.6,396.9,,102.45198430594878 409 | 0.61154,20.0,3.97,0,,8.704,86.9,1.801,5,264.0,13.0,389.7,,214.30612899294056 410 | 0.49298,0.0,9.9,0,,6.635,82.5,3.3175,4,304.0,18.4,396.9,,97.64871883252277 411 | 2.73397,0.0,19.58,0,0.871,5.597,94.9,1.5257,5,403.0,14.7,351.85,21.45,65.97107628810248 412 | 0.34006,0.0,21.89,0,,6.458,98.9,2.1185,4,437.0,21.2,395.04,,82.25129303259749 413 | 1.49632,0.0,19.58,0,0.871,5.404,100.0,1.5916,5,403.0,14.7,341.6,,84.04901653769186 414 | 4.26131,0.0,18.1,0,0.77,6.112,81.3,2.5091,24,666.0,20.2,390.74,,96.8500187197275 415 | 0.0686,0.0,2.89,0,,7.416,62.5,3.4952,2,276.0,18.0,396.9,,142.20689094774346 416 | 8.26725,0.0,18.1,1,,5.875,89.6,1.1296,24,666.0,20.2,347.88,,214.1986894942322 417 | 0.07151,0.0,4.49,0,,6.121,56.8,3.7476,3,247.0,18.5,395.15,,95.13056954253848 418 | 7.75223,0.0,18.1,0,0.713,6.301,83.7,2.7831,24,666.0,20.2,272.21,,63.85386272152615 419 | 0.04544,0.0,3.24,0,,6.144,32.2,5.8736,4,430.0,16.9,368.57,,84.8058053152259 420 | 0.28955,0.0,10.59,0,,5.412,9.8,3.5875,4,277.0,18.6,348.93,29.55,101.59726822044716 421 | 3.77498,0.0,18.1,0,,5.952,84.7,2.8715,24,666.0,20.2,22.01,,81.50138540570634 422 | 0.07165,0.0,25.65,0,,6.004,84.1,2.1974,2,188.0,19.1,377.67,,86.90751630292641 423 | 0.04741,0.0,11.93,0,,6.03,80.8,2.505,1,273.0,21.0,396.9,,50.95279803969646 424 | 1.25179,0.0,8.14,0,,5.57,98.1,3.7979,4,307.0,21.0,376.57,21.02,58.28811876068804 425 | 0.12579,45.0,3.44,,,6.556,29.1,4.5667,5,398.0,15.2,382.84,,127.60470297228434 426 | 0.15876,0.0,10.81,0,,5.961,17.5,5.2873,4,305.0,19.2,376.94,,92.88368665394682 427 | 0.1712,0.0,8.56,0,,5.836,91.9,2.211,5,384.0,20.9,395.67,18.66,83.64745648119606 428 | 0.29916,20.0,6.96,0,,5.856,42.1,4.429,3,223.0,18.6,388.65,,90.40912785392962 429 | 0.01501,80.0,2.01,0,,6.635,29.7,8.344,4,280.0,17.0,390.94,,104.92186930660849 430 | 11.1604,0.0,18.1,0,0.74,6.629,94.6,2.1247,24,666.0,20.2,109.85,23.27,57.4275008162647 431 | 0.22876,0.0,8.56,0,,6.405,85.4,2.7147,5,384.0,20.9,70.8,,79.62768514145213 432 | -------------------------------------------------------------------------------- /data/housing/housing_validation.csv: -------------------------------------------------------------------------------- 1 | CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT 2 | 0.09178,0.0,4.05,0,,6.416,84.1,2.6463,5,296.0,,395.5, 3 | 0.05644,40.0,6.41,1,,6.758,32.9,4.0776,4,254.0,,396.9, 4 | 0.10574,0.0,27.74,0,,5.983,98.8,1.8681,4,711.0,,390.11,18.07 5 | 0.09164,0.0,10.81,0,,6.065,7.8,5.2873,4,305.0,,390.91, 6 | 5.09017,0.0,18.1,0,0.713,6.297,91.8,2.3682,24,666.0,,385.09, 7 | 0.10153,0.0,12.83,,,6.279,74.5,4.0522,5,398.0,,373.66, 8 | 0.31827,0.0,9.9,0,,5.914,83.2,3.9986,4,304.0,,390.7,18.33 9 | 0.2909,0.0,21.89,0,,6.174,93.6,1.6119,4,437.0,,388.08,24.16 10 | 4.03841,0.0,18.1,0,,6.229,90.7,3.0993,24,666.0,,395.33, 11 | 0.22438,0.0,9.69,0,,6.027,79.7,2.4982,6,391.0,,396.9, 12 | 0.11069,0.0,13.89,1,,5.951,93.8,2.8893,5,276.0,,396.9,17.92 13 | 0.17004,12.5,7.87,0,,6.004,85.9,6.5921,5,311.0,,386.71, 14 | 45.7461,0.0,18.1,0,,4.519,100.0,1.6582,24,666.0,,88.27,36.98 15 | 0.05646,0.0,12.83,0,,6.232,53.7,5.0141,5,398.0,,386.4, 16 | 0.28392,0.0,7.38,0,,5.708,74.3,4.7211,5,287.0,,391.13, 17 | 4.64689,0.0,18.1,0,,6.98,67.6,2.5329,24,666.0,,374.68, 18 | 0.09849,0.0,25.65,0,,5.879,95.8,2.0063,2,188.0,,379.38, 19 | 14.3337,0.0,18.1,0,,4.88,100.0,1.5895,24,666.0,,372.92,30.62 20 | 0.01381,80.0,0.46,0,,7.875,32.0,5.6484,4,255.0,,394.23, 21 | 9.32909,0.0,18.1,0,0.713,6.185,98.7,2.2616,24,666.0,,396.9,18.13 22 | 0.16211,20.0,6.96,0,,6.24,16.3,4.429,3,223.0,,396.9, 23 | 0.07978,40.0,6.41,0,,6.482,32.1,4.1403,4,254.0,,396.9, 24 | 1.13081,0.0,8.14,0,,5.713,94.1,4.233,4,307.0,,360.17,22.6 25 | 0.06263,0.0,11.93,0,,6.593,69.1,2.4786,1,273.0,,391.99, 26 | 7.02259,0.0,18.1,0,0.718,6.006,95.3,1.8746,24,666.0,,319.98, 27 | 8.05579,0.0,18.1,0,,5.427,95.4,2.4298,24,666.0,,352.58,18.14 28 | 0.08387,0.0,12.83,0,,5.874,36.6,4.5026,5,398.0,,396.06, 29 | 9.51363,0.0,18.1,0,0.713,6.728,94.1,2.4961,24,666.0,,6.68,18.71 30 | 0.17446,0.0,10.59,1,,5.96,92.1,3.8771,4,277.0,,393.25, 31 | 0.26838,0.0,9.69,0,,5.794,70.6,2.8927,6,391.0,,396.9, 32 | 0.13914,0.0,4.05,0,,5.572,88.5,2.5961,5,296.0,,396.9, 33 | 0.1676,0.0,7.38,0,,6.426,52.3,4.5404,5,287.0,,396.9, 34 | 19.6091,0.0,18.1,0,,7.313,97.9,1.3163,24,666.0,,396.9, 35 | 3.67822,0.0,18.1,0,0.77,5.362,96.2,2.1036,24,666.0,,380.79, 36 | 4.42228,0.0,18.1,0,,6.003,94.5,2.5403,24,666.0,,331.29,21.32 37 | 2.14918,0.0,19.58,0,0.871,5.709,98.5,1.6232,5,403.0,,261.95, 38 | 0.02729,0.0,7.07,0,,7.185,61.1,4.9671,2,242.0,,392.83, 39 | 0.03427,0.0,5.19,0,,5.869,46.3,5.2311,5,224.0,,396.9, 40 | 0.13587,0.0,10.59,1,,6.064,59.1,4.2392,4,277.0,,381.32, 41 | 0.19539,0.0,10.81,0,,6.245,6.2,5.2873,4,305.0,,377.17, 42 | 0.2896,0.0,9.69,0,,5.39,72.9,2.7986,6,391.0,,396.9,21.14 43 | 0.04932,33.0,2.18,0,,6.849,70.3,3.1827,7,222.0,,396.9, 44 | 0.02009,95.0,2.68,0,,8.034,31.9,5.118,4,224.0,,390.55, 45 | 0.13554,12.5,6.07,0,,5.594,36.8,6.498,4,345.0,,396.9, 46 | 0.04684,0.0,3.41,0,,6.417,66.1,3.0923,2,270.0,,392.18, 47 | 6.96215,0.0,18.1,0,,5.713,97.0,1.9265,24,666.0,,394.43, 48 | 1.15172,0.0,8.14,0,,5.701,95.0,3.7872,4,307.0,,358.77,18.35 49 | 0.08826,0.0,10.81,,,6.417,6.6,5.2873,4,305.0,,383.73, 50 | 4.34879,0.0,18.1,0,,6.167,84.0,3.0334,24,666.0,,396.9, 51 | 0.00632,18.0,2.31,0,,6.575,65.2,4.09,1,296.0,,396.9, 52 | 0.11747,12.5,7.87,0,,6.009,82.9,6.2267,5,311.0,,396.9, 53 | 0.03705,20.0,3.33,0,,6.968,37.2,5.2447,5,216.0,,392.23, 54 | 1.23247,0.0,8.14,0,,6.142,91.7,3.9769,4,307.0,,396.9,18.72 55 | 0.11432,0.0,8.56,0,,6.781,71.3,2.8561,5,384.0,,395.58, 56 | 0.5405,20.0,3.97,0,,7.47,52.6,2.872,5,264.0,,390.3, 57 | 3.67367,0.0,18.1,0,,6.312,51.9,3.9917,24,666.0,,388.62, 58 | 5.66637,0.0,18.1,0,0.74,6.219,100.0,2.0048,24,666.0,,395.69, 59 | 0.03502,80.0,4.95,0,,6.861,27.9,5.1167,4,245.0,,396.9, 60 | 0.05059,0.0,4.49,0,,6.389,48.0,4.7794,3,247.0,,396.9, 61 | 0.19133,22.0,5.86,0,,5.605,70.2,7.9549,7,330.0,,389.13,18.46 62 | 0.1265,25.0,5.13,0,,6.762,43.4,7.9809,8,284.0,,395.58, 63 | 0.01311,90.0,1.22,0,,7.249,21.9,8.6966,5,226.0,,395.93, 64 | 0.44178,0.0,6.2,0,,6.552,21.4,3.3751,8,307.0,,380.34, 65 | 0.80271,0.0,8.14,0,,5.456,36.6,3.7965,4,307.0,,288.99, 66 | 0.0795,60.0,1.69,0,,6.579,35.9,10.7103,4,411.0,,370.78, 67 | 0.43571,0.0,10.59,1,,5.344,100.0,3.875,4,277.0,,396.9,23.09 68 | 8.71675,0.0,18.1,0,,6.471,98.8,1.7257,24,666.0,,391.98, 69 | 0.03659,25.0,4.86,0,,6.302,32.2,5.4007,4,281.0,,396.9, 70 | 0.02763,75.0,2.95,0,,6.595,21.8,5.4011,3,252.0,,395.63, 71 | 4.66883,0.0,18.1,0,0.713,5.976,87.9,2.5806,24,666.0,,10.48,19.01 72 | 0.18836,0.0,6.91,0,,5.786,33.3,5.1004,3,233.0,,396.9, 73 | 5.70818,0.0,18.1,0,,6.75,74.9,3.3317,24,666.0,,393.07, 74 | 12.8023,0.0,18.1,0,0.74,5.854,96.6,1.8956,24,666.0,,240.52,23.79 75 | 0.10659,80.0,1.91,0,,5.936,19.5,10.5857,4,334.0,,376.04, 76 | 0.08707,0.0,12.83,0,,6.14,45.8,4.0905,5,398.0,,386.96, 77 | 38.3518,0.0,18.1,,,5.453,100.0,1.4896,24,666.0,,396.9,30.59 78 | -------------------------------------------------------------------------------- /data/iris/iris.csv: -------------------------------------------------------------------------------- 1 | "sepal.length","sepal.width","petal.length","petal.width","variety" 2 | 5.1,3.5,1.4,.2,"Setosa" 3 | 4.9,3,1.4,.2,"Setosa" 4 | 4.7,3.2,1.3,.2,"Setosa" 5 | 4.6,3.1,1.5,.2,"Setosa" 6 | 5,3.6,1.4,.2,"Setosa" 7 | 5.4,3.9,1.7,.4,"Setosa" 8 | 4.6,3.4,1.4,.3,"Setosa" 9 | 5,3.4,1.5,.2,"Setosa" 10 | 4.4,2.9,1.4,.2,"Setosa" 11 | 4.9,3.1,1.5,.1,"Setosa" 12 | 5.4,3.7,1.5,.2,"Setosa" 13 | 4.8,3.4,1.6,.2,"Setosa" 14 | 4.8,3,1.4,.1,"Setosa" 15 | 4.3,3,1.1,.1,"Setosa" 16 | 5.8,4,1.2,.2,"Setosa" 17 | 5.7,4.4,1.5,.4,"Setosa" 18 | 5.4,3.9,1.3,.4,"Setosa" 19 | 5.1,3.5,1.4,.3,"Setosa" 20 | 5.7,3.8,1.7,.3,"Setosa" 21 | 5.1,3.8,1.5,.3,"Setosa" 22 | 5.4,3.4,1.7,.2,"Setosa" 23 | 5.1,3.7,1.5,.4,"Setosa" 24 | 4.6,3.6,1,.2,"Setosa" 25 | 5.1,3.3,1.7,.5,"Setosa" 26 | 4.8,3.4,1.9,.2,"Setosa" 27 | 5,3,1.6,.2,"Setosa" 28 | 5,3.4,1.6,.4,"Setosa" 29 | 5.2,3.5,1.5,.2,"Setosa" 30 | 5.2,3.4,1.4,.2,"Setosa" 31 | 4.7,3.2,1.6,.2,"Setosa" 32 | 4.8,3.1,1.6,.2,"Setosa" 33 | 5.4,3.4,1.5,.4,"Setosa" 34 | 5.2,4.1,1.5,.1,"Setosa" 35 | 5.5,4.2,1.4,.2,"Setosa" 36 | 4.9,3.1,1.5,.2,"Setosa" 37 | 5,3.2,1.2,.2,"Setosa" 38 | 5.5,3.5,1.3,.2,"Setosa" 39 | 4.9,3.6,1.4,.1,"Setosa" 40 | 4.4,3,1.3,.2,"Setosa" 41 | 5.1,3.4,1.5,.2,"Setosa" 42 | 5,3.5,1.3,.3,"Setosa" 43 | 4.5,2.3,1.3,.3,"Setosa" 44 | 4.4,3.2,1.3,.2,"Setosa" 45 | 5,3.5,1.6,.6,"Setosa" 46 | 5.1,3.8,1.9,.4,"Setosa" 47 | 4.8,3,1.4,.3,"Setosa" 48 | 5.1,3.8,1.6,.2,"Setosa" 49 | 4.6,3.2,1.4,.2,"Setosa" 50 | 5.3,3.7,1.5,.2,"Setosa" 51 | 5,3.3,1.4,.2,"Setosa" 52 | 7,3.2,4.7,1.4,"Versicolor" 53 | 6.4,3.2,4.5,1.5,"Versicolor" 54 | 6.9,3.1,4.9,1.5,"Versicolor" 55 | 5.5,2.3,4,1.3,"Versicolor" 56 | 6.5,2.8,4.6,1.5,"Versicolor" 57 | 5.7,2.8,4.5,1.3,"Versicolor" 58 | 6.3,3.3,4.7,1.6,"Versicolor" 59 | 4.9,2.4,3.3,1,"Versicolor" 60 | 6.6,2.9,4.6,1.3,"Versicolor" 61 | 5.2,2.7,3.9,1.4,"Versicolor" 62 | 5,2,3.5,1,"Versicolor" 63 | 5.9,3,4.2,1.5,"Versicolor" 64 | 6,2.2,4,1,"Versicolor" 65 | 6.1,2.9,4.7,1.4,"Versicolor" 66 | 5.6,2.9,3.6,1.3,"Versicolor" 67 | 6.7,3.1,4.4,1.4,"Versicolor" 68 | 5.6,3,4.5,1.5,"Versicolor" 69 | 5.8,2.7,4.1,1,"Versicolor" 70 | 6.2,2.2,4.5,1.5,"Versicolor" 71 | 5.6,2.5,3.9,1.1,"Versicolor" 72 | 5.9,3.2,4.8,1.8,"Versicolor" 73 | 6.1,2.8,4,1.3,"Versicolor" 74 | 6.3,2.5,4.9,1.5,"Versicolor" 75 | 6.1,2.8,4.7,1.2,"Versicolor" 76 | 6.4,2.9,4.3,1.3,"Versicolor" 77 | 6.6,3,4.4,1.4,"Versicolor" 78 | 6.8,2.8,4.8,1.4,"Versicolor" 79 | 6.7,3,5,1.7,"Versicolor" 80 | 6,2.9,4.5,1.5,"Versicolor" 81 | 5.7,2.6,3.5,1,"Versicolor" 82 | 5.5,2.4,3.8,1.1,"Versicolor" 83 | 5.5,2.4,3.7,1,"Versicolor" 84 | 5.8,2.7,3.9,1.2,"Versicolor" 85 | 6,2.7,5.1,1.6,"Versicolor" 86 | 5.4,3,4.5,1.5,"Versicolor" 87 | 6,3.4,4.5,1.6,"Versicolor" 88 | 6.7,3.1,4.7,1.5,"Versicolor" 89 | 6.3,2.3,4.4,1.3,"Versicolor" 90 | 5.6,3,4.1,1.3,"Versicolor" 91 | 5.5,2.5,4,1.3,"Versicolor" 92 | 5.5,2.6,4.4,1.2,"Versicolor" 93 | 6.1,3,4.6,1.4,"Versicolor" 94 | 5.8,2.6,4,1.2,"Versicolor" 95 | 5,2.3,3.3,1,"Versicolor" 96 | 5.6,2.7,4.2,1.3,"Versicolor" 97 | 5.7,3,4.2,1.2,"Versicolor" 98 | 5.7,2.9,4.2,1.3,"Versicolor" 99 | 6.2,2.9,4.3,1.3,"Versicolor" 100 | 5.1,2.5,3,1.1,"Versicolor" 101 | 5.7,2.8,4.1,1.3,"Versicolor" 102 | 6.3,3.3,6,2.5,"Virginica" 103 | 5.8,2.7,5.1,1.9,"Virginica" 104 | 7.1,3,5.9,2.1,"Virginica" 105 | 6.3,2.9,5.6,1.8,"Virginica" 106 | 6.5,3,5.8,2.2,"Virginica" 107 | 7.6,3,6.6,2.1,"Virginica" 108 | 4.9,2.5,4.5,1.7,"Virginica" 109 | 7.3,2.9,6.3,1.8,"Virginica" 110 | 6.7,2.5,5.8,1.8,"Virginica" 111 | 7.2,3.6,6.1,2.5,"Virginica" 112 | 6.5,3.2,5.1,2,"Virginica" 113 | 6.4,2.7,5.3,1.9,"Virginica" 114 | 6.8,3,5.5,2.1,"Virginica" 115 | 5.7,2.5,5,2,"Virginica" 116 | 5.8,2.8,5.1,2.4,"Virginica" 117 | 6.4,3.2,5.3,2.3,"Virginica" 118 | 6.5,3,5.5,1.8,"Virginica" 119 | 7.7,3.8,6.7,2.2,"Virginica" 120 | 7.7,2.6,6.9,2.3,"Virginica" 121 | 6,2.2,5,1.5,"Virginica" 122 | 6.9,3.2,5.7,2.3,"Virginica" 123 | 5.6,2.8,4.9,2,"Virginica" 124 | 7.7,2.8,6.7,2,"Virginica" 125 | 6.3,2.7,4.9,1.8,"Virginica" 126 | 6.7,3.3,5.7,2.1,"Virginica" 127 | 7.2,3.2,6,1.8,"Virginica" 128 | 6.2,2.8,4.8,1.8,"Virginica" 129 | 6.1,3,4.9,1.8,"Virginica" 130 | 6.4,2.8,5.6,2.1,"Virginica" 131 | 7.2,3,5.8,1.6,"Virginica" 132 | 7.4,2.8,6.1,1.9,"Virginica" 133 | 7.9,3.8,6.4,2,"Virginica" 134 | 6.4,2.8,5.6,2.2,"Virginica" 135 | 6.3,2.8,5.1,1.5,"Virginica" 136 | 6.1,2.6,5.6,1.4,"Virginica" 137 | 7.7,3,6.1,2.3,"Virginica" 138 | 6.3,3.4,5.6,2.4,"Virginica" 139 | 6.4,3.1,5.5,1.8,"Virginica" 140 | 6,3,4.8,1.8,"Virginica" 141 | 6.9,3.1,5.4,2.1,"Virginica" 142 | 6.7,3.1,5.6,2.4,"Virginica" 143 | 6.9,3.1,5.1,2.3,"Virginica" 144 | 5.8,2.7,5.1,1.9,"Virginica" 145 | 6.8,3.2,5.9,2.3,"Virginica" 146 | 6.7,3.3,5.7,2.5,"Virginica" 147 | 6.7,3,5.2,2.3,"Virginica" 148 | 6.3,2.5,5,1.9,"Virginica" 149 | 6.5,3,5.2,2,"Virginica" 150 | 6.2,3.4,5.4,2.3,"Virginica" 151 | 5.9,3,5.1,1.8,"Virginica" -------------------------------------------------------------------------------- /data/iris/iris.tsv: -------------------------------------------------------------------------------- 1 | sepal.length sepal.width petal.length petal.width variety 2 | 5.1 3.5 1.4 0.2 Setosa 3 | 4.9 3.0 1.4 0.2 Setosa 4 | 4.7 3.2 1.3 0.2 Setosa 5 | 4.6 3.1 1.5 0.2 Setosa 6 | 5.0 3.6 1.4 0.2 Setosa 7 | 5.4 3.9 1.7 0.4 Setosa 8 | 4.6 3.4 1.4 0.3 Setosa 9 | 5.0 3.4 1.5 0.2 Setosa 10 | 4.4 2.9 1.4 0.2 Setosa 11 | 4.9 3.1 1.5 0.1 Setosa 12 | 5.4 3.7 1.5 0.2 Setosa 13 | 4.8 3.4 1.6 0.2 Setosa 14 | 4.8 3.0 1.4 0.1 Setosa 15 | 4.3 3.0 1.1 0.1 Setosa 16 | 5.8 4.0 1.2 0.2 Setosa 17 | 5.7 4.4 1.5 0.4 Setosa 18 | 5.4 3.9 1.3 0.4 Setosa 19 | 5.1 3.5 1.4 0.3 Setosa 20 | 5.7 3.8 1.7 0.3 Setosa 21 | 5.1 3.8 1.5 0.3 Setosa 22 | 5.4 3.4 1.7 0.2 Setosa 23 | 5.1 3.7 1.5 0.4 Setosa 24 | 4.6 3.6 1.0 0.2 Setosa 25 | 5.1 3.3 1.7 0.5 Setosa 26 | 4.8 3.4 1.9 0.2 Setosa 27 | 5.0 3.0 1.6 0.2 Setosa 28 | 5.0 3.4 1.6 0.4 Setosa 29 | 5.2 3.5 1.5 0.2 Setosa 30 | 5.2 3.4 1.4 0.2 Setosa 31 | 4.7 3.2 1.6 0.2 Setosa 32 | 4.8 3.1 1.6 0.2 Setosa 33 | 5.4 3.4 1.5 0.4 Setosa 34 | 5.2 4.1 1.5 0.1 Setosa 35 | 5.5 4.2 1.4 0.2 Setosa 36 | 4.9 3.1 1.5 0.2 Setosa 37 | 5.0 3.2 1.2 0.2 Setosa 38 | 5.5 3.5 1.3 0.2 Setosa 39 | 4.9 3.6 1.4 0.1 Setosa 40 | 4.4 3.0 1.3 0.2 Setosa 41 | 5.1 3.4 1.5 0.2 Setosa 42 | 5.0 3.5 1.3 0.3 Setosa 43 | 4.5 2.3 1.3 0.3 Setosa 44 | 4.4 3.2 1.3 0.2 Setosa 45 | 5.0 3.5 1.6 0.6 Setosa 46 | 5.1 3.8 1.9 0.4 Setosa 47 | 4.8 3.0 1.4 0.3 Setosa 48 | 5.1 3.8 1.6 0.2 Setosa 49 | 4.6 3.2 1.4 0.2 Setosa 50 | 5.3 3.7 1.5 0.2 Setosa 51 | 5.0 3.3 1.4 0.2 Setosa 52 | 7.0 3.2 4.7 1.4 Versicolor 53 | 6.4 3.2 4.5 1.5 Versicolor 54 | 6.9 3.1 4.9 1.5 Versicolor 55 | 5.5 2.3 4.0 1.3 Versicolor 56 | 6.5 2.8 4.6 1.5 Versicolor 57 | 5.7 2.8 4.5 1.3 Versicolor 58 | 6.3 3.3 4.7 1.6 Versicolor 59 | 4.9 2.4 3.3 1.0 Versicolor 60 | 6.6 2.9 4.6 1.3 Versicolor 61 | 5.2 2.7 3.9 1.4 Versicolor 62 | 5.0 2.0 3.5 1.0 Versicolor 63 | 5.9 3.0 4.2 1.5 Versicolor 64 | 6.0 2.2 4.0 1.0 Versicolor 65 | 6.1 2.9 4.7 1.4 Versicolor 66 | 5.6 2.9 3.6 1.3 Versicolor 67 | 6.7 3.1 4.4 1.4 Versicolor 68 | 5.6 3.0 4.5 1.5 Versicolor 69 | 5.8 2.7 4.1 1.0 Versicolor 70 | 6.2 2.2 4.5 1.5 Versicolor 71 | 5.6 2.5 3.9 1.1 Versicolor 72 | 5.9 3.2 4.8 1.8 Versicolor 73 | 6.1 2.8 4.0 1.3 Versicolor 74 | 6.3 2.5 4.9 1.5 Versicolor 75 | 6.1 2.8 4.7 1.2 Versicolor 76 | 6.4 2.9 4.3 1.3 Versicolor 77 | 6.6 3.0 4.4 1.4 Versicolor 78 | 6.8 2.8 4.8 1.4 Versicolor 79 | 6.7 3.0 5.0 1.7 Versicolor 80 | 6.0 2.9 4.5 1.5 Versicolor 81 | 5.7 2.6 3.5 1.0 Versicolor 82 | 5.5 2.4 3.8 1.1 Versicolor 83 | 5.5 2.4 3.7 1.0 Versicolor 84 | 5.8 2.7 3.9 1.2 Versicolor 85 | 6.0 2.7 5.1 1.6 Versicolor 86 | 5.4 3.0 4.5 1.5 Versicolor 87 | 6.0 3.4 4.5 1.6 Versicolor 88 | 6.7 3.1 4.7 1.5 Versicolor 89 | 6.3 2.3 4.4 1.3 Versicolor 90 | 5.6 3.0 4.1 1.3 Versicolor 91 | 5.5 2.5 4.0 1.3 Versicolor 92 | 5.5 2.6 4.4 1.2 Versicolor 93 | 6.1 3.0 4.6 1.4 Versicolor 94 | 5.8 2.6 4.0 1.2 Versicolor 95 | 5.0 2.3 3.3 1.0 Versicolor 96 | 5.6 2.7 4.2 1.3 Versicolor 97 | 5.7 3.0 4.2 1.2 Versicolor 98 | 5.7 2.9 4.2 1.3 Versicolor 99 | 6.2 2.9 4.3 1.3 Versicolor 100 | 5.1 2.5 3.0 1.1 Versicolor 101 | 5.7 2.8 4.1 1.3 Versicolor 102 | 6.3 3.3 6.0 2.5 Virginica 103 | 5.8 2.7 5.1 1.9 Virginica 104 | 7.1 3.0 5.9 2.1 Virginica 105 | 6.3 2.9 5.6 1.8 Virginica 106 | 6.5 3.0 5.8 2.2 Virginica 107 | 7.6 3.0 6.6 2.1 Virginica 108 | 4.9 2.5 4.5 1.7 Virginica 109 | 7.3 2.9 6.3 1.8 Virginica 110 | 6.7 2.5 5.8 1.8 Virginica 111 | 7.2 3.6 6.1 2.5 Virginica 112 | 6.5 3.2 5.1 2.0 Virginica 113 | 6.4 2.7 5.3 1.9 Virginica 114 | 6.8 3.0 5.5 2.1 Virginica 115 | 5.7 2.5 5.0 2.0 Virginica 116 | 5.8 2.8 5.1 2.4 Virginica 117 | 6.4 3.2 5.3 2.3 Virginica 118 | 6.5 3.0 5.5 1.8 Virginica 119 | 7.7 3.8 6.7 2.2 Virginica 120 | 7.7 2.6 6.9 2.3 Virginica 121 | 6.0 2.2 5.0 1.5 Virginica 122 | 6.9 3.2 5.7 2.3 Virginica 123 | 5.6 2.8 4.9 2.0 Virginica 124 | 7.7 2.8 6.7 2.0 Virginica 125 | 6.3 2.7 4.9 1.8 Virginica 126 | 6.7 3.3 5.7 2.1 Virginica 127 | 7.2 3.2 6.0 1.8 Virginica 128 | 6.2 2.8 4.8 1.8 Virginica 129 | 6.1 3.0 4.9 1.8 Virginica 130 | 6.4 2.8 5.6 2.1 Virginica 131 | 7.2 3.0 5.8 1.6 Virginica 132 | 7.4 2.8 6.1 1.9 Virginica 133 | 7.9 3.8 6.4 2.0 Virginica 134 | 6.4 2.8 5.6 2.2 Virginica 135 | 6.3 2.8 5.1 1.5 Virginica 136 | 6.1 2.6 5.6 1.4 Virginica 137 | 7.7 3.0 6.1 2.3 Virginica 138 | 6.3 3.4 5.6 2.4 Virginica 139 | 6.4 3.1 5.5 1.8 Virginica 140 | 6.0 3.0 4.8 1.8 Virginica 141 | 6.9 3.1 5.4 2.1 Virginica 142 | 6.7 3.1 5.6 2.4 Virginica 143 | 6.9 3.1 5.1 2.3 Virginica 144 | 5.8 2.7 5.1 1.9 Virginica 145 | 6.8 3.2 5.9 2.3 Virginica 146 | 6.7 3.3 5.7 2.5 Virginica 147 | 6.7 3.0 5.2 2.3 Virginica 148 | 6.3 2.5 5.0 1.9 Virginica 149 | 6.5 3.0 5.2 2.0 Virginica 150 | 6.2 3.4 5.4 2.3 Virginica 151 | 5.9 3.0 5.1 1.8 Virginica 152 | -------------------------------------------------------------------------------- /data/iris/iris.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Appsilon/datascience-python/a1d299415544191f5e96502c5a064561ba3f59ce/data/iris/iris.xlsx -------------------------------------------------------------------------------- /data/iris/iris_noheader.csv: -------------------------------------------------------------------------------- 1 | 5.1,3.5,1.4,.2,"Setosa" 2 | 4.9,3,1.4,.2,"Setosa" 3 | 4.7,3.2,1.3,.2,"Setosa" 4 | 4.6,3.1,1.5,.2,"Setosa" 5 | 5,3.6,1.4,.2,"Setosa" 6 | 5.4,3.9,1.7,.4,"Setosa" 7 | 4.6,3.4,1.4,.3,"Setosa" 8 | 5,3.4,1.5,.2,"Setosa" 9 | 4.4,2.9,1.4,.2,"Setosa" 10 | 4.9,3.1,1.5,.1,"Setosa" 11 | 5.4,3.7,1.5,.2,"Setosa" 12 | 4.8,3.4,1.6,.2,"Setosa" 13 | 4.8,3,1.4,.1,"Setosa" 14 | 4.3,3,1.1,.1,"Setosa" 15 | 5.8,4,1.2,.2,"Setosa" 16 | 5.7,4.4,1.5,.4,"Setosa" 17 | 5.4,3.9,1.3,.4,"Setosa" 18 | 5.1,3.5,1.4,.3,"Setosa" 19 | 5.7,3.8,1.7,.3,"Setosa" 20 | 5.1,3.8,1.5,.3,"Setosa" 21 | 5.4,3.4,1.7,.2,"Setosa" 22 | 5.1,3.7,1.5,.4,"Setosa" 23 | 4.6,3.6,1,.2,"Setosa" 24 | 5.1,3.3,1.7,.5,"Setosa" 25 | 4.8,3.4,1.9,.2,"Setosa" 26 | 5,3,1.6,.2,"Setosa" 27 | 5,3.4,1.6,.4,"Setosa" 28 | 5.2,3.5,1.5,.2,"Setosa" 29 | 5.2,3.4,1.4,.2,"Setosa" 30 | 4.7,3.2,1.6,.2,"Setosa" 31 | 4.8,3.1,1.6,.2,"Setosa" 32 | 5.4,3.4,1.5,.4,"Setosa" 33 | 5.2,4.1,1.5,.1,"Setosa" 34 | 5.5,4.2,1.4,.2,"Setosa" 35 | 4.9,3.1,1.5,.2,"Setosa" 36 | 5,3.2,1.2,.2,"Setosa" 37 | 5.5,3.5,1.3,.2,"Setosa" 38 | 4.9,3.6,1.4,.1,"Setosa" 39 | 4.4,3,1.3,.2,"Setosa" 40 | 5.1,3.4,1.5,.2,"Setosa" 41 | 5,3.5,1.3,.3,"Setosa" 42 | 4.5,2.3,1.3,.3,"Setosa" 43 | 4.4,3.2,1.3,.2,"Setosa" 44 | 5,3.5,1.6,.6,"Setosa" 45 | 5.1,3.8,1.9,.4,"Setosa" 46 | 4.8,3,1.4,.3,"Setosa" 47 | 5.1,3.8,1.6,.2,"Setosa" 48 | 4.6,3.2,1.4,.2,"Setosa" 49 | 5.3,3.7,1.5,.2,"Setosa" 50 | 5,3.3,1.4,.2,"Setosa" 51 | 7,3.2,4.7,1.4,"Versicolor" 52 | 6.4,3.2,4.5,1.5,"Versicolor" 53 | 6.9,3.1,4.9,1.5,"Versicolor" 54 | 5.5,2.3,4,1.3,"Versicolor" 55 | 6.5,2.8,4.6,1.5,"Versicolor" 56 | 5.7,2.8,4.5,1.3,"Versicolor" 57 | 6.3,3.3,4.7,1.6,"Versicolor" 58 | 4.9,2.4,3.3,1,"Versicolor" 59 | 6.6,2.9,4.6,1.3,"Versicolor" 60 | 5.2,2.7,3.9,1.4,"Versicolor" 61 | 5,2,3.5,1,"Versicolor" 62 | 5.9,3,4.2,1.5,"Versicolor" 63 | 6,2.2,4,1,"Versicolor" 64 | 6.1,2.9,4.7,1.4,"Versicolor" 65 | 5.6,2.9,3.6,1.3,"Versicolor" 66 | 6.7,3.1,4.4,1.4,"Versicolor" 67 | 5.6,3,4.5,1.5,"Versicolor" 68 | 5.8,2.7,4.1,1,"Versicolor" 69 | 6.2,2.2,4.5,1.5,"Versicolor" 70 | 5.6,2.5,3.9,1.1,"Versicolor" 71 | 5.9,3.2,4.8,1.8,"Versicolor" 72 | 6.1,2.8,4,1.3,"Versicolor" 73 | 6.3,2.5,4.9,1.5,"Versicolor" 74 | 6.1,2.8,4.7,1.2,"Versicolor" 75 | 6.4,2.9,4.3,1.3,"Versicolor" 76 | 6.6,3,4.4,1.4,"Versicolor" 77 | 6.8,2.8,4.8,1.4,"Versicolor" 78 | 6.7,3,5,1.7,"Versicolor" 79 | 6,2.9,4.5,1.5,"Versicolor" 80 | 5.7,2.6,3.5,1,"Versicolor" 81 | 5.5,2.4,3.8,1.1,"Versicolor" 82 | 5.5,2.4,3.7,1,"Versicolor" 83 | 5.8,2.7,3.9,1.2,"Versicolor" 84 | 6,2.7,5.1,1.6,"Versicolor" 85 | 5.4,3,4.5,1.5,"Versicolor" 86 | 6,3.4,4.5,1.6,"Versicolor" 87 | 6.7,3.1,4.7,1.5,"Versicolor" 88 | 6.3,2.3,4.4,1.3,"Versicolor" 89 | 5.6,3,4.1,1.3,"Versicolor" 90 | 5.5,2.5,4,1.3,"Versicolor" 91 | 5.5,2.6,4.4,1.2,"Versicolor" 92 | 6.1,3,4.6,1.4,"Versicolor" 93 | 5.8,2.6,4,1.2,"Versicolor" 94 | 5,2.3,3.3,1,"Versicolor" 95 | 5.6,2.7,4.2,1.3,"Versicolor" 96 | 5.7,3,4.2,1.2,"Versicolor" 97 | 5.7,2.9,4.2,1.3,"Versicolor" 98 | 6.2,2.9,4.3,1.3,"Versicolor" 99 | 5.1,2.5,3,1.1,"Versicolor" 100 | 5.7,2.8,4.1,1.3,"Versicolor" 101 | 6.3,3.3,6,2.5,"Virginica" 102 | 5.8,2.7,5.1,1.9,"Virginica" 103 | 7.1,3,5.9,2.1,"Virginica" 104 | 6.3,2.9,5.6,1.8,"Virginica" 105 | 6.5,3,5.8,2.2,"Virginica" 106 | 7.6,3,6.6,2.1,"Virginica" 107 | 4.9,2.5,4.5,1.7,"Virginica" 108 | 7.3,2.9,6.3,1.8,"Virginica" 109 | 6.7,2.5,5.8,1.8,"Virginica" 110 | 7.2,3.6,6.1,2.5,"Virginica" 111 | 6.5,3.2,5.1,2,"Virginica" 112 | 6.4,2.7,5.3,1.9,"Virginica" 113 | 6.8,3,5.5,2.1,"Virginica" 114 | 5.7,2.5,5,2,"Virginica" 115 | 5.8,2.8,5.1,2.4,"Virginica" 116 | 6.4,3.2,5.3,2.3,"Virginica" 117 | 6.5,3,5.5,1.8,"Virginica" 118 | 7.7,3.8,6.7,2.2,"Virginica" 119 | 7.7,2.6,6.9,2.3,"Virginica" 120 | 6,2.2,5,1.5,"Virginica" 121 | 6.9,3.2,5.7,2.3,"Virginica" 122 | 5.6,2.8,4.9,2,"Virginica" 123 | 7.7,2.8,6.7,2,"Virginica" 124 | 6.3,2.7,4.9,1.8,"Virginica" 125 | 6.7,3.3,5.7,2.1,"Virginica" 126 | 7.2,3.2,6,1.8,"Virginica" 127 | 6.2,2.8,4.8,1.8,"Virginica" 128 | 6.1,3,4.9,1.8,"Virginica" 129 | 6.4,2.8,5.6,2.1,"Virginica" 130 | 7.2,3,5.8,1.6,"Virginica" 131 | 7.4,2.8,6.1,1.9,"Virginica" 132 | 7.9,3.8,6.4,2,"Virginica" 133 | 6.4,2.8,5.6,2.2,"Virginica" 134 | 6.3,2.8,5.1,1.5,"Virginica" 135 | 6.1,2.6,5.6,1.4,"Virginica" 136 | 7.7,3,6.1,2.3,"Virginica" 137 | 6.3,3.4,5.6,2.4,"Virginica" 138 | 6.4,3.1,5.5,1.8,"Virginica" 139 | 6,3,4.8,1.8,"Virginica" 140 | 6.9,3.1,5.4,2.1,"Virginica" 141 | 6.7,3.1,5.6,2.4,"Virginica" 142 | 6.9,3.1,5.1,2.3,"Virginica" 143 | 5.8,2.7,5.1,1.9,"Virginica" 144 | 6.8,3.2,5.9,2.3,"Virginica" 145 | 6.7,3.3,5.7,2.5,"Virginica" 146 | 6.7,3,5.2,2.3,"Virginica" 147 | 6.3,2.5,5,1.9,"Virginica" 148 | 6.5,3,5.2,2,"Virginica" 149 | 6.2,3.4,5.4,2.3,"Virginica" 150 | 5.9,3,5.1,1.8,"Virginica" -------------------------------------------------------------------------------- /data/other/lotr_data.csv: -------------------------------------------------------------------------------- 1 | Name,Race,Salary,Profession,Age of Death 2 | Bilbo Baggins,Hobbit,10000,Retired,131 3 | Frodo Baggins,Hobbit,70000,Ring-bearer,53 4 | Sam Gamgee,Hobbit,60000,Security,102 5 | Aragorn,Human,60000,Security,210 6 | -------------------------------------------------------------------------------- /homework_solutions/.gitignore: -------------------------------------------------------------------------------- 1 | light_gbm.csv 2 | model_lgbm_regressor.pkl 3 | -------------------------------------------------------------------------------- /homework_solutions/01_hw_numpy.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Numpy Homework Solution\n", 8 | "\n", 9 | "## Task 1" 10 | ] 11 | }, 12 | { 13 | "cell_type": "code", 14 | "execution_count": 1, 15 | "metadata": {}, 16 | "outputs": [], 17 | "source": [ 18 | "import numpy as np\n", 19 | "from numpy.random import default_rng\n", 20 | "\n", 21 | "rng = default_rng(1337)\n", 22 | "x = np.round(rng.normal(size=30), 2)\n", 23 | "y = x + np.round(rng.normal(size=30) * 0.1, 2)" 24 | ] 25 | }, 26 | { 27 | "cell_type": "code", 28 | "execution_count": 2, 29 | "metadata": {}, 30 | "outputs": [ 31 | { 32 | "data": { 33 | "text/plain": [ 34 | "(0.18533333333333335, 5.5600000000000005, 0.8520000000000001)" 35 | ] 36 | }, 37 | "execution_count": 2, 38 | "metadata": {}, 39 | "output_type": "execute_result" 40 | } 41 | ], 42 | "source": [ 43 | "# 1, 2, 3\n", 44 | "x.mean(), x.sum(), np.abs(x).mean()" 45 | ] 46 | }, 47 | { 48 | "cell_type": "code", 49 | "execution_count": 3, 50 | "metadata": {}, 51 | "outputs": [ 52 | { 53 | "data": { 54 | "text/plain": [ 55 | "2.52" 56 | ] 57 | }, 58 | "execution_count": 3, 59 | "metadata": {}, 60 | "output_type": "execute_result" 61 | } 62 | ], 63 | "source": [ 64 | "# 4\n", 65 | "x[np.abs(x).argmax()]" 66 | ] 67 | }, 68 | { 69 | "cell_type": "code", 70 | "execution_count": 4, 71 | "metadata": {}, 72 | "outputs": [ 73 | { 74 | "data": { 75 | "text/plain": [ 76 | "-2.5" 77 | ] 78 | }, 79 | "execution_count": 4, 80 | "metadata": {}, 81 | "output_type": "execute_result" 82 | } 83 | ], 84 | "source": [ 85 | "# 5\n", 86 | "x[np.abs(x - 2).argmax()]" 87 | ] 88 | }, 89 | { 90 | "cell_type": "code", 91 | "execution_count": 5, 92 | "metadata": {}, 93 | "outputs": [ 94 | { 95 | "data": { 96 | "text/plain": [ 97 | "array([ 0.04, 0.47, -0.14, -1. , 1. , -1. , 1. , -1. , 0.15,\n", 98 | " -0.09, 1. , 0.52, -0.53, 1. , 0.21, 1. , -0.22, 0.09,\n", 99 | " -0.13, -1. , 0.85, 0.68, 0.87, -0.34, 1. , 1. , -0.04,\n", 100 | " -0.82, -0.16, -1. ])" 101 | ] 102 | }, 103 | "execution_count": 5, 104 | "metadata": {}, 105 | "output_type": "execute_result" 106 | } 107 | ], 108 | "source": [ 109 | "# 6\n", 110 | "np.where(x > 1, 1, np.where(x>-1, x , -1))" 111 | ] 112 | }, 113 | { 114 | "cell_type": "code", 115 | "execution_count": 6, 116 | "metadata": {}, 117 | "outputs": [ 118 | { 119 | "data": { 120 | "text/plain": [ 121 | "-0.0029999999999999914" 122 | ] 123 | }, 124 | "execution_count": 6, 125 | "metadata": {}, 126 | "output_type": "execute_result" 127 | } 128 | ], 129 | "source": [ 130 | "# 7\n", 131 | "(y-x).mean()" 132 | ] 133 | }, 134 | { 135 | "cell_type": "code", 136 | "execution_count": 7, 137 | "metadata": {}, 138 | "outputs": [ 139 | { 140 | "data": { 141 | "text/plain": [ 142 | "0.08499999999999999" 143 | ] 144 | }, 145 | "execution_count": 7, 146 | "metadata": {}, 147 | "output_type": "execute_result" 148 | } 149 | ], 150 | "source": [ 151 | "# 8\n", 152 | "np.abs(y - x).mean()" 153 | ] 154 | }, 155 | { 156 | "cell_type": "code", 157 | "execution_count": 8, 158 | "metadata": {}, 159 | "outputs": [ 160 | { 161 | "data": { 162 | "text/plain": [ 163 | "0.010869999999999998" 164 | ] 165 | }, 166 | "execution_count": 8, 167 | "metadata": {}, 168 | "output_type": "execute_result" 169 | } 170 | ], 171 | "source": [ 172 | "# 9\n", 173 | "((x - y)**2).mean()" 174 | ] 175 | }, 176 | { 177 | "cell_type": "code", 178 | "execution_count": 9, 179 | "metadata": {}, 180 | "outputs": [ 181 | { 182 | "data": { 183 | "text/plain": [ 184 | "0.10425929215182692" 185 | ] 186 | }, 187 | "execution_count": 9, 188 | "metadata": {}, 189 | "output_type": "execute_result" 190 | } 191 | ], 192 | "source": [ 193 | "# 10\n", 194 | "np.sqrt(((x - y)**2).mean())" 195 | ] 196 | }, 197 | { 198 | "cell_type": "markdown", 199 | "metadata": {}, 200 | "source": [ 201 | "## Task 2" 202 | ] 203 | }, 204 | { 205 | "cell_type": "code", 206 | "execution_count": 10, 207 | "metadata": {}, 208 | "outputs": [], 209 | "source": [ 210 | "def standardize(X):\n", 211 | " return (X - X.mean(axis=0)) / X.std(axis=0)" 212 | ] 213 | }, 214 | { 215 | "cell_type": "code", 216 | "execution_count": 11, 217 | "metadata": {}, 218 | "outputs": [ 219 | { 220 | "data": { 221 | "text/plain": [ 222 | "array([[ 0, 1, 2],\n", 223 | " [ 3, 4, 5],\n", 224 | " [ 6, 7, 8],\n", 225 | " [ 9, 10, 11],\n", 226 | " [12, 13, 14],\n", 227 | " [15, 16, 17],\n", 228 | " [18, 19, 20],\n", 229 | " [21, 22, 23],\n", 230 | " [24, 25, 26],\n", 231 | " [27, 28, 29]])" 232 | ] 233 | }, 234 | "execution_count": 11, 235 | "metadata": {}, 236 | "output_type": "execute_result" 237 | } 238 | ], 239 | "source": [ 240 | "X = np.arange(30).reshape((10, -1))\n", 241 | "X" 242 | ] 243 | }, 244 | { 245 | "cell_type": "code", 246 | "execution_count": 12, 247 | "metadata": {}, 248 | "outputs": [ 249 | { 250 | "data": { 251 | "text/plain": [ 252 | "(array([-1.11022302e-16, -1.11022302e-16, -1.11022302e-16]),\n", 253 | " array([1., 1., 1.]))" 254 | ] 255 | }, 256 | "execution_count": 12, 257 | "metadata": {}, 258 | "output_type": "execute_result" 259 | } 260 | ], 261 | "source": [ 262 | "Xs = standardize(X)\n", 263 | "Xs.mean(axis=0), Xs.std(axis=0)" 264 | ] 265 | }, 266 | { 267 | "cell_type": "markdown", 268 | "metadata": {}, 269 | "source": [ 270 | "## Task 3" 271 | ] 272 | }, 273 | { 274 | "cell_type": "code", 275 | "execution_count": 13, 276 | "metadata": {}, 277 | "outputs": [], 278 | "source": [ 279 | "def simulation_pi(n, seed):\n", 280 | " # We will sample x, y points from [0, 1) and check\n", 281 | " # percentage of them landing inside a quater of a unit\n", 282 | " # circle. Unit square has area 1 and quater of unit circle\n", 283 | " # has area $\\pi / 4$, that's why we multiply by 4 to get $pi$.\n", 284 | " rng = default_rng(seed)\n", 285 | " x = rng.random(n)\n", 286 | " y = rng.random(n)\n", 287 | " return 4 * ((x**2 + y**2) <= 1).mean()" 288 | ] 289 | }, 290 | { 291 | "cell_type": "code", 292 | "execution_count": 14, 293 | "metadata": {}, 294 | "outputs": [ 295 | { 296 | "data": { 297 | "text/plain": [ 298 | "3.1409368" 299 | ] 300 | }, 301 | "execution_count": 14, 302 | "metadata": {}, 303 | "output_type": "execute_result" 304 | } 305 | ], 306 | "source": [ 307 | "simulation_pi(10_000_000, 2022)" 308 | ] 309 | }, 310 | { 311 | "cell_type": "markdown", 312 | "metadata": {}, 313 | "source": [ 314 | "## Task 4" 315 | ] 316 | }, 317 | { 318 | "cell_type": "code", 319 | "execution_count": 15, 320 | "metadata": {}, 321 | "outputs": [], 322 | "source": [ 323 | "def simulation_exp2(n, seed):\n", 324 | " rng = default_rng(seed)\n", 325 | " x = rng.uniform(-2, 2, n)\n", 326 | " yr = rng.uniform(0, 1, n)\n", 327 | " yf = np.exp(-(x**2))\n", 328 | " # Here we multiply by 4 since we sample from \n", 329 | " # rectangle of area 4 \n", 330 | " # x in [-2, 2) and y from [0, 1)\n", 331 | " A = (yr < yf).mean() * 4\n", 332 | " return A" 333 | ] 334 | }, 335 | { 336 | "cell_type": "code", 337 | "execution_count": 16, 338 | "metadata": {}, 339 | "outputs": [ 340 | { 341 | "data": { 342 | "text/plain": [ 343 | "1.7634636" 344 | ] 345 | }, 346 | "execution_count": 16, 347 | "metadata": {}, 348 | "output_type": "execute_result" 349 | } 350 | ], 351 | "source": [ 352 | "simulation_exp2(10_000_000, 2022)" 353 | ] 354 | } 355 | ], 356 | "metadata": { 357 | "kernelspec": { 358 | "display_name": "Python 3.10.4 ('daftacademy-ds')", 359 | "language": "python", 360 | "name": "python3" 361 | }, 362 | "language_info": { 363 | "codemirror_mode": { 364 | "name": "ipython", 365 | "version": 3 366 | }, 367 | "file_extension": ".py", 368 | "mimetype": "text/x-python", 369 | "name": "python", 370 | "nbconvert_exporter": "python", 371 | "pygments_lexer": "ipython3", 372 | "version": "3.10.4" 373 | }, 374 | "orig_nbformat": 4, 375 | "vscode": { 376 | "interpreter": { 377 | "hash": "306379f83cd7ad906f1941e4fa5d943b9cb847ce8f1ed425d8b7a18353685fed" 378 | } 379 | } 380 | }, 381 | "nbformat": 4, 382 | "nbformat_minor": 2 383 | } 384 | -------------------------------------------------------------------------------- /homework_solutions/03_hw_pandas2_a.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Pandas Homework Part 2\n", 8 | "\n", 9 | "`pandas` version" 10 | ] 11 | }, 12 | { 13 | "cell_type": "code", 14 | "execution_count": 1, 15 | "metadata": {}, 16 | "outputs": [], 17 | "source": [ 18 | "import pandas as pd\n", 19 | "from collections import defaultdict" 20 | ] 21 | }, 22 | { 23 | "cell_type": "code", 24 | "execution_count": 2, 25 | "metadata": {}, 26 | "outputs": [], 27 | "source": [ 28 | "columns = ['prev', 'curr', 'type', 'n']" 29 | ] 30 | }, 31 | { 32 | "cell_type": "code", 33 | "execution_count": 19, 34 | "metadata": {}, 35 | "outputs": [ 36 | { 37 | "name": "stdout", 38 | "output_type": "stream", 39 | "text": [ 40 | "de Ukraine\n", 41 | "pl Ukraina\n", 42 | "CPU times: user 13.3 s, sys: 304 ms, total: 13.6 s\n", 43 | "Wall time: 13.7 s\n" 44 | ] 45 | } 46 | ], 47 | "source": [ 48 | "%%time\n", 49 | "# 1, 2\n", 50 | "\n", 51 | "for country in [\"de\", \"pl\"]:\n", 52 | " df = pd.read_csv(f\"../data/wikipedia/clickstream-{country}wiki-2022-03.tsv.gz\", sep=\"\\t\", names=columns, on_bad_lines='warn', quoting=3)\n", 53 | " s = df.query(\"type == 'external'\").groupby(\"curr\")['n'].sum().sort_values(ascending=False).head().index[0]\n", 54 | " print(country, s)" 55 | ] 56 | }, 57 | { 58 | "cell_type": "code", 59 | "execution_count": 5, 60 | "metadata": {}, 61 | "outputs": [], 62 | "source": [ 63 | "badges = pd.read_xml(\"../data/travel/travel.stackexchange.com/Badges.xml\")\n", 64 | "posts = pd.read_xml(\"../data/travel/travel.stackexchange.com/Posts.xml\", parser='etree')\n", 65 | "tags = pd.read_xml(\"../data/travel/travel.stackexchange.com/Tags.xml\", parser='etree')\n", 66 | "users = pd.read_xml(\"../data/travel/travel.stackexchange.com/Users.xml\", parser='etree')\n", 67 | "votes = pd.read_xml(\"../data/travel/travel.stackexchange.com/Votes.xml\", parser='etree')" 68 | ] 69 | }, 70 | { 71 | "cell_type": "code", 72 | "execution_count": 6, 73 | "metadata": {}, 74 | "outputs": [], 75 | "source": [ 76 | "wiki = pd.read_csv(f\"../data/wikipedia/clickstream-enwiki-2022-03.tsv.gz\", sep=\"\\t\", names=columns, quoting=3)" 77 | ] 78 | }, 79 | { 80 | "cell_type": "code", 81 | "execution_count": 7, 82 | "metadata": {}, 83 | "outputs": [ 84 | { 85 | "name": "stdout", 86 | "output_type": "stream", 87 | "text": [ 88 | "CPU times: user 166 ms, sys: 3.64 ms, total: 170 ms\n", 89 | "Wall time: 169 ms\n" 90 | ] 91 | }, 92 | { 93 | "data": { 94 | "text/plain": [ 95 | "DisplayName Mark Mayo\n", 96 | "Location Christchurch, New Zealand\n", 97 | "Name: 0, dtype: object" 98 | ] 99 | }, 100 | "execution_count": 7, 101 | "metadata": {}, 102 | "output_type": "execute_result" 103 | } 104 | ], 105 | "source": [ 106 | "%%time\n", 107 | "# 3, 4\n", 108 | "tid = badges.merge(users, left_on=\"UserId\", right_on=\"Id\").groupby(\"UserId\").size().sort_values().index[-1]\n", 109 | "top_user = users.loc[users[\"Id\"] == tid, :]\n", 110 | "top_user = top_user.reset_index(drop=True).loc[0, ['DisplayName', 'Location']]\n", 111 | "top_user" 112 | ] 113 | }, 114 | { 115 | "cell_type": "code", 116 | "execution_count": 8, 117 | "metadata": {}, 118 | "outputs": [ 119 | { 120 | "name": "stdout", 121 | "output_type": "stream", 122 | "text": [ 123 | "CPU times: user 1.4 s, sys: 67.7 ms, total: 1.46 s\n", 124 | "Wall time: 1.45 s\n" 125 | ] 126 | }, 127 | { 128 | "data": { 129 | "text/plain": [ 130 | "25804" 131 | ] 132 | }, 133 | "execution_count": 8, 134 | "metadata": {}, 135 | "output_type": "execute_result" 136 | } 137 | ], 138 | "source": [ 139 | "%%time\n", 140 | "# 5\n", 141 | "city = top_user['Location'].split(\", \")[0]\n", 142 | "wiki.loc[wiki['curr'] == city, :]['n'].sum()" 143 | ] 144 | }, 145 | { 146 | "cell_type": "code", 147 | "execution_count": 9, 148 | "metadata": {}, 149 | "outputs": [], 150 | "source": [ 151 | "# Not that this part can be done in many ways\n", 152 | "# this focuses on showing how to work with apply in non-standard way\n", 153 | "\n", 154 | "# https://stackoverflow.com/questions/9662346/python-code-to-remove-html-tags-from-a-string\n", 155 | "import re\n", 156 | "CLEANR = re.compile('<.*?>') \n", 157 | "\n", 158 | "def cleanhtml(raw_html):\n", 159 | " if isinstance(raw_html, str):\n", 160 | " cleantext = re.sub(CLEANR, '', raw_html)\n", 161 | " return cleantext\n", 162 | " else:\n", 163 | " return raw_html" 164 | ] 165 | }, 166 | { 167 | "cell_type": "code", 168 | "execution_count": 10, 169 | "metadata": {}, 170 | "outputs": [], 171 | "source": [ 172 | "def aux(l):\n", 173 | " d = defaultdict(int)\n", 174 | " if isinstance(l, list):\n", 175 | " for w in l:\n", 176 | " d[w.lower()] += 1\n", 177 | " return d" 178 | ] 179 | }, 180 | { 181 | "cell_type": "code", 182 | "execution_count": 11, 183 | "metadata": {}, 184 | "outputs": [ 185 | { 186 | "name": "stdout", 187 | "output_type": "stream", 188 | "text": [ 189 | "CPU times: user 9.94 s, sys: 722 ms, total: 10.7 s\n", 190 | "Wall time: 10.7 s\n" 191 | ] 192 | }, 193 | { 194 | "data": { 195 | "text/plain": [ 196 | "('passport', 31631)" 197 | ] 198 | }, 199 | "execution_count": 11, 200 | "metadata": {}, 201 | "output_type": "execute_result" 202 | } 203 | ], 204 | "source": [ 205 | "%%time\n", 206 | "# 6, 7\n", 207 | "dicts = posts['Body'].apply(cleanhtml).str.replace(\"\\n\", \" \").str.split(\" \").apply(aux)\n", 208 | "# Even better solution:\n", 209 | "# dicts = posts['Body'].str.replace('<.*?>', \"\", regex=True).str.replace(\"\\n\", \" \").str.split(\" \").apply(aux)\n", 210 | "big_d = defaultdict(int)\n", 211 | "for d in dicts:\n", 212 | " for k, v in d.items():\n", 213 | " big_d[k] += v\n", 214 | "\n", 215 | "s = pd.Series(big_d, name=\"Count\").reset_index()\n", 216 | "s.rename(columns={'index':'Word'}, inplace=True)\n", 217 | "\n", 218 | "words = s.loc[s.Word.str.len() > 7, :].sort_values(\"Count\", ascending=False).head()\n", 219 | "\n", 220 | "# 3 points\n", 221 | "theword = words['Word'].iloc[0]\n", 222 | "theword, wiki.query(\"curr == @theword.capitalize()\")['n'].sum()" 223 | ] 224 | }, 225 | { 226 | "cell_type": "code", 227 | "execution_count": 14, 228 | "metadata": {}, 229 | "outputs": [ 230 | { 231 | "name": "stdout", 232 | "output_type": "stream", 233 | "text": [ 234 | "CPU times: user 12.8 s, sys: 1.53 s, total: 14.4 s\n", 235 | "Wall time: 14.4 s\n" 236 | ] 237 | }, 238 | { 239 | "data": { 240 | "text/plain": [ 241 | "('passport', 31631)" 242 | ] 243 | }, 244 | "execution_count": 14, 245 | "metadata": {}, 246 | "output_type": "execute_result" 247 | } 248 | ], 249 | "source": [ 250 | "%%time\n", 251 | "# 6, 7\n", 252 | "# Just different approach\n", 253 | "words = (\n", 254 | " posts['Body']\n", 255 | " .str.replace('<.*?>', \"\", regex=True)\n", 256 | " .str.replace(\"\\n\", \" \")\n", 257 | " .str.split(\" \")\n", 258 | " .explode()\n", 259 | " .str.lower()\n", 260 | ")\n", 261 | "theword = words[words.str.len() > 7].value_counts().head(1).index[0]\n", 262 | "theword, wiki.query(\"curr == @theword.capitalize()\")['n'].sum()" 263 | ] 264 | }, 265 | { 266 | "cell_type": "code", 267 | "execution_count": 15, 268 | "metadata": {}, 269 | "outputs": [ 270 | { 271 | "name": "stdout", 272 | "output_type": "stream", 273 | "text": [ 274 | "CPU times: user 933 ms, sys: 11.9 ms, total: 945 ms\n", 275 | "Wall time: 943 ms\n" 276 | ] 277 | }, 278 | { 279 | "data": { 280 | "text/plain": [ 281 | "Score 547\n", 282 | "DisplayName Andrew Lazarus\n", 283 | "Name: 0, dtype: object" 284 | ] 285 | }, 286 | "execution_count": 15, 287 | "metadata": {}, 288 | "output_type": "execute_result" 289 | } 290 | ], 291 | "source": [ 292 | "%%time\n", 293 | "# 8, 9\n", 294 | "upvotes = (\n", 295 | " votes\n", 296 | " .query('VoteTypeId == 2')\n", 297 | " .groupby(\"PostId\")\n", 298 | " .size()\n", 299 | " .reset_index(name=\"UpVotes\")\n", 300 | ")\n", 301 | "downvotes = (\n", 302 | " votes\n", 303 | " .query('VoteTypeId == 3')\n", 304 | " .groupby(\"PostId\")\n", 305 | " .size()\n", 306 | " .reset_index(name=\"DownVotes\")\n", 307 | ")\n", 308 | "\n", 309 | "posts2 = (\n", 310 | " posts\n", 311 | " .merge(upvotes, left_on=\"Id\", right_on=\"PostId\", how='left')\n", 312 | " .merge(downvotes, left_on=\"Id\", right_on=\"PostId\", how='left')\n", 313 | ")\n", 314 | "posts2.loc[:, ['UpVotes', 'DownVotes']] = posts2.loc[:, ['UpVotes', 'DownVotes']].fillna(value=0)\n", 315 | "\n", 316 | "posts2['UpVoteRatio'] = posts2['UpVotes'] - posts2['DownVotes']\n", 317 | "\n", 318 | "(\n", 319 | " posts2\n", 320 | " .merge(users, left_on=\"OwnerUserId\", right_on=\"Id\")\n", 321 | " .sort_values(\"UpVoteRatio\", ascending=False)\n", 322 | " .reset_index(drop=True)\n", 323 | " .loc[0, ['Score', 'DisplayName']]\n", 324 | ")" 325 | ] 326 | }, 327 | { 328 | "cell_type": "code", 329 | "execution_count": 16, 330 | "metadata": {}, 331 | "outputs": [ 332 | { 333 | "name": "stdout", 334 | "output_type": "stream", 335 | "text": [ 336 | "CPU times: user 212 ms, sys: 8.26 ms, total: 220 ms\n", 337 | "Wall time: 219 ms\n" 338 | ] 339 | }, 340 | { 341 | "data": { 342 | "text/plain": [ 343 | "Timestamp('2016-08-31 00:00:00')" 344 | ] 345 | }, 346 | "execution_count": 16, 347 | "metadata": {}, 348 | "output_type": "execute_result" 349 | } 350 | ], 351 | "source": [ 352 | "%%time\n", 353 | "# 10\n", 354 | "votes\n", 355 | "votes['CreationDateDT'] = pd.to_datetime(votes['CreationDate'])\n", 356 | "votes.set_index(\"CreationDateDT\", inplace=True)\n", 357 | "\n", 358 | "votesagg = votes.groupby(pd.Grouper(freq=\"M\")).size()\n", 359 | "\n", 360 | "votesagg.sort_values(ascending=False).index[0]\n" 361 | ] 362 | }, 363 | { 364 | "cell_type": "code", 365 | "execution_count": 17, 366 | "metadata": {}, 367 | "outputs": [ 368 | { 369 | "name": "stdout", 370 | "output_type": "stream", 371 | "text": [ 372 | "CPU times: user 0 ns, sys: 1.9 ms, total: 1.9 ms\n", 373 | "Wall time: 1.73 ms\n" 374 | ] 375 | }, 376 | { 377 | "data": { 378 | "text/plain": [ 379 | "Timestamp('2015-10-31 00:00:00')" 380 | ] 381 | }, 382 | "execution_count": 17, 383 | "metadata": {}, 384 | "output_type": "execute_result" 385 | } 386 | ], 387 | "source": [ 388 | "%%time\n", 389 | "# 11\n", 390 | "# votesagg is sorted by index (CreationDateDT) \n", 391 | "votesagg.diff().sort_values().index[0]" 392 | ] 393 | }, 394 | { 395 | "cell_type": "code", 396 | "execution_count": 18, 397 | "metadata": {}, 398 | "outputs": [ 399 | { 400 | "name": "stdout", 401 | "output_type": "stream", 402 | "text": [ 403 | "CPU times: user 452 ms, sys: 37 µs, total: 452 ms\n", 404 | "Wall time: 451 ms\n" 405 | ] 406 | }, 407 | { 408 | "data": { 409 | "text/plain": [ 410 | "air-travel 34\n", 411 | "Name: Tags, dtype: int64" 412 | ] 413 | }, 414 | "execution_count": 18, 415 | "metadata": {}, 416 | "output_type": "execute_result" 417 | } 418 | ], 419 | "source": [ 420 | "%%time\n", 421 | "# 12\n", 422 | "\n", 423 | "posts3 = posts.merge(users, left_on=\"OwnerUserId\", right_on=\"Id\")\n", 424 | "tags = posts3.loc[\n", 425 | " posts3['Location'].str.contains(\"Poland\") | \n", 426 | " posts3['Location'].str.contains(\"Polska\"), \n", 427 | " 'Tags'\n", 428 | "]\n", 429 | "(\n", 430 | " tags\n", 431 | " .str.strip(\"<\")\n", 432 | " .str.strip(\">\")\n", 433 | " .str.split(\"><\")\n", 434 | " .dropna()\n", 435 | " .explode()\n", 436 | " .value_counts()\n", 437 | " .head(1)\n", 438 | ")" 439 | ] 440 | }, 441 | { 442 | "cell_type": "code", 443 | "execution_count": null, 444 | "metadata": {}, 445 | "outputs": [], 446 | "source": [] 447 | } 448 | ], 449 | "metadata": { 450 | "kernelspec": { 451 | "display_name": "Python 3.10.4 ('daftacademy-ds2')", 452 | "language": "python", 453 | "name": "python3" 454 | }, 455 | "language_info": { 456 | "codemirror_mode": { 457 | "name": "ipython", 458 | "version": 3 459 | }, 460 | "file_extension": ".py", 461 | "mimetype": "text/x-python", 462 | "name": "python", 463 | "nbconvert_exporter": "python", 464 | "pygments_lexer": "ipython3", 465 | "version": "3.10.4" 466 | }, 467 | "orig_nbformat": 4, 468 | "vscode": { 469 | "interpreter": { 470 | "hash": "8d8a772a312a89d7c091db0c8769ded3912bfec6f446bb9104da72914614d8d8" 471 | } 472 | } 473 | }, 474 | "nbformat": 4, 475 | "nbformat_minor": 2 476 | } 477 | -------------------------------------------------------------------------------- /homework_solutions/03_hw_pandas2_b.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Pandas Homework Part 2\n", 8 | "\n", 9 | "`polars` version" 10 | ] 11 | }, 12 | { 13 | "cell_type": "code", 14 | "execution_count": 1, 15 | "metadata": {}, 16 | "outputs": [], 17 | "source": [ 18 | "import pandas as pd\n", 19 | "import polars as pl" 20 | ] 21 | }, 22 | { 23 | "cell_type": "code", 24 | "execution_count": 2, 25 | "metadata": {}, 26 | "outputs": [], 27 | "source": [ 28 | "columns = ['prev', 'curr', 'type', 'n']" 29 | ] 30 | }, 31 | { 32 | "cell_type": "code", 33 | "execution_count": 4, 34 | "metadata": {}, 35 | "outputs": [ 36 | { 37 | "name": "stdout", 38 | "output_type": "stream", 39 | "text": [ 40 | "de Ukraine\n", 41 | "pl Ukraina\n", 42 | "CPU times: user 6.5 s, sys: 1.83 s, total: 8.32 s\n", 43 | "Wall time: 2.54 s\n" 44 | ] 45 | } 46 | ], 47 | "source": [ 48 | "%%time\n", 49 | "for country in [\"de\", \"pl\"]:\n", 50 | " dfl = pl.read_csv(f\"../data/wikipedia/clickstream-{country}wiki-2022-03.tsv.gz\",sep=\"\\t\", has_header=False, new_columns=columns, quote_char=None)\n", 51 | " s = (\n", 52 | " dfl.lazy()\n", 53 | " .filter(pl.col(\"type\") ==\"external\")\n", 54 | " .groupby(\"curr\")\n", 55 | " .agg(pl.col(\"n\").sum().alias(\"total\"))\n", 56 | " .sort(\"total\",reverse=True)\n", 57 | " .collect()[0, 'curr']\n", 58 | " )\n", 59 | " print(country, s)" 60 | ] 61 | }, 62 | { 63 | "cell_type": "code", 64 | "execution_count": 5, 65 | "metadata": {}, 66 | "outputs": [], 67 | "source": [ 68 | "badges = pd.read_xml(\"../data/travel/travel.stackexchange.com/Badges.xml\")\n", 69 | "posts = pd.read_xml(\"../data/travel/travel.stackexchange.com/Posts.xml\", parser='etree')\n", 70 | "tags = pd.read_xml(\"../data/travel/travel.stackexchange.com/Tags.xml\", parser='etree')\n", 71 | "users = pd.read_xml(\"../data/travel/travel.stackexchange.com/Users.xml\", parser='etree')\n", 72 | "votes = pd.read_xml(\"../data/travel/travel.stackexchange.com/Votes.xml\", parser='etree')" 73 | ] 74 | }, 75 | { 76 | "cell_type": "code", 77 | "execution_count": 6, 78 | "metadata": {}, 79 | "outputs": [], 80 | "source": [ 81 | "badges_pl = pl.from_pandas(badges)\n", 82 | "posts_pl = pl.from_pandas(posts)\n", 83 | "tags_pl = pl.from_pandas(tags)\n", 84 | "votes_pl = pl.from_pandas(votes)\n", 85 | "users_pl = pl.from_pandas(users)\n", 86 | "posts_pl['OwnerUserId'] = posts_pl['OwnerUserId'].cast(int)" 87 | ] 88 | }, 89 | { 90 | "cell_type": "code", 91 | "execution_count": 7, 92 | "metadata": {}, 93 | "outputs": [], 94 | "source": [ 95 | "wiki_pl = pl.read_csv(f\"../data/wikipedia/clickstream-enwiki-2022-03.tsv.gz\", sep=\"\\t\", has_header=False, new_columns=columns, quote_char=\"\")" 96 | ] 97 | }, 98 | { 99 | "cell_type": "code", 100 | "execution_count": 8, 101 | "metadata": {}, 102 | "outputs": [ 103 | { 104 | "name": "stdout", 105 | "output_type": "stream", 106 | "text": [ 107 | "CPU times: user 393 ms, sys: 94.4 ms, total: 488 ms\n", 108 | "Wall time: 108 ms\n" 109 | ] 110 | }, 111 | { 112 | "data": { 113 | "text/html": [ 114 | "
\n", 115 | "\n", 128 | "\n", 129 | "\n", 130 | "\n", 131 | "\n", 134 | "\n", 137 | "\n", 138 | "\n", 139 | "\n", 142 | "\n", 145 | "\n", 146 | "\n", 147 | "\n", 148 | "\n", 149 | "\n", 152 | "\n", 155 | "\n", 156 | "\n", 157 | "
\n", 132 | "DisplayName\n", 133 | "\n", 135 | "Location\n", 136 | "
\n", 140 | "str\n", 141 | "\n", 143 | "str\n", 144 | "
\n", 150 | "\"Mark Mayo\"\n", 151 | "\n", 153 | "\"Christchurch, New Zealand\"\n", 154 | "
\n", 158 | "
" 159 | ], 160 | "text/plain": [ 161 | "shape: (1, 2)\n", 162 | "┌─────────────┬───────────────────────────┐\n", 163 | "│ DisplayName ┆ Location │\n", 164 | "│ --- ┆ --- │\n", 165 | "│ str ┆ str │\n", 166 | "╞═════════════╪═══════════════════════════╡\n", 167 | "│ Mark Mayo ┆ Christchurch, New Zealand │\n", 168 | "└─────────────┴───────────────────────────┘" 169 | ] 170 | }, 171 | "execution_count": 8, 172 | "metadata": {}, 173 | "output_type": "execute_result" 174 | } 175 | ], 176 | "source": [ 177 | "%%time \n", 178 | "# 3, 4\n", 179 | "tid = (\n", 180 | " badges_pl\n", 181 | " .join(users_pl, left_on=\"UserId\", right_on=\"Id\", how='left')\n", 182 | " .groupby([\"UserId\", \"DisplayName\"])\n", 183 | " .agg([pl.count().alias(\"NBadges\")])\n", 184 | " .sort(\"NBadges\", reverse=True)\n", 185 | " .head(1)\n", 186 | " [0, 'UserId']\n", 187 | ")\n", 188 | "top_user =(\n", 189 | " users_pl\n", 190 | " .filter(pl.col(\"Id\") == tid)\n", 191 | " .select(['DisplayName', 'Location'])\n", 192 | ")\n", 193 | "top_user" 194 | ] 195 | }, 196 | { 197 | "cell_type": "code", 198 | "execution_count": 9, 199 | "metadata": {}, 200 | "outputs": [ 201 | { 202 | "name": "stdout", 203 | "output_type": "stream", 204 | "text": [ 205 | "CPU times: user 170 ms, sys: 38.5 ms, total: 208 ms\n", 206 | "Wall time: 44.6 ms\n" 207 | ] 208 | }, 209 | { 210 | "data": { 211 | "text/html": [ 212 | "
\n", 213 | "\n", 226 | "\n", 227 | "\n", 228 | "\n", 229 | "\n", 232 | "\n", 235 | "\n", 236 | "\n", 237 | "\n", 240 | "\n", 243 | "\n", 244 | "\n", 245 | "\n", 246 | "\n", 247 | "\n", 250 | "\n", 253 | "\n", 254 | "\n", 255 | "
\n", 230 | "DisplayName\n", 231 | "\n", 233 | "Location\n", 234 | "
\n", 238 | "str\n", 239 | "\n", 241 | "str\n", 242 | "
\n", 248 | "\"Mark Mayo\"\n", 249 | "\n", 251 | "\"Christchurch, New Zealand\"\n", 252 | "
\n", 256 | "
" 257 | ], 258 | "text/plain": [ 259 | "shape: (1, 2)\n", 260 | "┌─────────────┬───────────────────────────┐\n", 261 | "│ DisplayName ┆ Location │\n", 262 | "│ --- ┆ --- │\n", 263 | "│ str ┆ str │\n", 264 | "╞═════════════╪═══════════════════════════╡\n", 265 | "│ Mark Mayo ┆ Christchurch, New Zealand │\n", 266 | "└─────────────┴───────────────────────────┘" 267 | ] 268 | }, 269 | "execution_count": 9, 270 | "metadata": {}, 271 | "output_type": "execute_result" 272 | } 273 | ], 274 | "source": [ 275 | "%%time \n", 276 | "# 3, 4 lazy evaluation\n", 277 | "(\n", 278 | " badges_pl.lazy()\n", 279 | " .join(users_pl.lazy(), left_on=\"UserId\", right_on=\"Id\", how='left')\n", 280 | " .groupby([\"UserId\", \"DisplayName\"])\n", 281 | " .agg([pl.count().alias(\"NBadges\")])\n", 282 | " .sort(\"NBadges\", reverse=True)\n", 283 | " .head(1)\n", 284 | " .collect()[0, ['UserId']]\n", 285 | ")\n", 286 | "top_user = (\n", 287 | " users_pl\n", 288 | " .lazy()\n", 289 | " .filter(pl.col(\"Id\") == tid)\n", 290 | " .select(['DisplayName', 'Location'])\n", 291 | " .collect()\n", 292 | ")\n", 293 | "top_user" 294 | ] 295 | }, 296 | { 297 | "cell_type": "code", 298 | "execution_count": 10, 299 | "metadata": {}, 300 | "outputs": [ 301 | { 302 | "name": "stdout", 303 | "output_type": "stream", 304 | "text": [ 305 | "CPU times: user 93 ms, sys: 1.29 ms, total: 94.3 ms\n", 306 | "Wall time: 78.3 ms\n" 307 | ] 308 | }, 309 | { 310 | "data": { 311 | "text/html": [ 312 | "
\n", 313 | "\n", 326 | "\n", 327 | "\n", 328 | "\n", 329 | "\n", 332 | "\n", 333 | "\n", 334 | "\n", 337 | "\n", 338 | "\n", 339 | "\n", 340 | "\n", 341 | "\n", 344 | "\n", 345 | "\n", 346 | "
\n", 330 | "n\n", 331 | "
\n", 335 | "i64\n", 336 | "
\n", 342 | "25804\n", 343 | "
\n", 347 | "
" 348 | ], 349 | "text/plain": [ 350 | "shape: (1, 1)\n", 351 | "┌───────┐\n", 352 | "│ n │\n", 353 | "│ --- │\n", 354 | "│ i64 │\n", 355 | "╞═══════╡\n", 356 | "│ 25804 │\n", 357 | "└───────┘" 358 | ] 359 | }, 360 | "execution_count": 10, 361 | "metadata": {}, 362 | "output_type": "execute_result" 363 | } 364 | ], 365 | "source": [ 366 | "%%time\n", 367 | "# 5\n", 368 | "city = top_user['Location'][0].split(\", \")[0]\n", 369 | "(\n", 370 | " wiki_pl\n", 371 | " .filter(pl.col('curr') == city)\n", 372 | " .select(pl.col('n').sum())\n", 373 | ")" 374 | ] 375 | }, 376 | { 377 | "cell_type": "code", 378 | "execution_count": 11, 379 | "metadata": {}, 380 | "outputs": [ 381 | { 382 | "name": "stdout", 383 | "output_type": "stream", 384 | "text": [ 385 | "CPU times: user 2.91 s, sys: 427 ms, total: 3.34 s\n", 386 | "Wall time: 3.28 s\n" 387 | ] 388 | }, 389 | { 390 | "data": { 391 | "text/plain": [ 392 | "('passport', 31631)" 393 | ] 394 | }, 395 | "execution_count": 11, 396 | "metadata": {}, 397 | "output_type": "execute_result" 398 | } 399 | ], 400 | "source": [ 401 | "%%time\n", 402 | "# 6, 7\n", 403 | "res = (\n", 404 | " posts_pl\n", 405 | " .select(\n", 406 | " pl.col('Body')\n", 407 | " .str.replace_all(\"<.*?>\", \"\")\n", 408 | " .str.replace_all(\"\\n\", \" \")\n", 409 | " .str.split(\" \")\n", 410 | " .explode()\n", 411 | " .str.to_lowercase()\n", 412 | " .alias(\"Words\")\n", 413 | " )\n", 414 | " .select(\n", 415 | " pl.col(\"Words\")\n", 416 | " .filter(pl.col(\"Words\").str.lengths() > 7)\n", 417 | " .value_counts()\n", 418 | " ).unnest(\"Words\")\n", 419 | " .head(1)\n", 420 | " \n", 421 | ")\n", 422 | "res[0, 'Words'], wiki_pl.filter(pl.col('curr') == res[0, 'Words'].capitalize())['n'].sum()" 423 | ] 424 | }, 425 | { 426 | "cell_type": "code", 427 | "execution_count": 12, 428 | "metadata": {}, 429 | "outputs": [ 430 | { 431 | "name": "stdout", 432 | "output_type": "stream", 433 | "text": [ 434 | "CPU times: user 1.24 s, sys: 754 ms, total: 1.99 s\n", 435 | "Wall time: 469 ms\n" 436 | ] 437 | }, 438 | { 439 | "data": { 440 | "text/html": [ 441 | "
\n", 442 | "\n", 455 | "\n", 456 | "\n", 457 | "\n", 458 | "\n", 461 | "\n", 464 | "\n", 465 | "\n", 466 | "\n", 469 | "\n", 472 | "\n", 473 | "\n", 474 | "\n", 475 | "\n", 476 | "\n", 479 | "\n", 482 | "\n", 483 | "\n", 484 | "
\n", 459 | "DisplayName\n", 460 | "\n", 462 | "UpVoteRatio\n", 463 | "
\n", 467 | "str\n", 468 | "\n", 470 | "i64\n", 471 | "
\n", 477 | "\"Andrew Lazarus\"\n", 478 | "\n", 480 | "547\n", 481 | "
\n", 485 | "
" 486 | ], 487 | "text/plain": [ 488 | "shape: (1, 2)\n", 489 | "┌────────────────┬─────────────┐\n", 490 | "│ DisplayName ┆ UpVoteRatio │\n", 491 | "│ --- ┆ --- │\n", 492 | "│ str ┆ i64 │\n", 493 | "╞════════════════╪═════════════╡\n", 494 | "│ Andrew Lazarus ┆ 547 │\n", 495 | "└────────────────┴─────────────┘" 496 | ] 497 | }, 498 | "execution_count": 12, 499 | "metadata": {}, 500 | "output_type": "execute_result" 501 | } 502 | ], 503 | "source": [ 504 | "%%time\n", 505 | "# 8, 9\n", 506 | "upvotes_pl = (\n", 507 | " votes_pl\n", 508 | " .lazy()\n", 509 | " .filter(pl.col(\"VoteTypeId\") == 2)\n", 510 | " .groupby(\"PostId\")\n", 511 | " .agg(pl.count().alias(\"UpVotes\"))\n", 512 | ")\n", 513 | "\n", 514 | "downvotes_pl = (\n", 515 | " votes_pl\n", 516 | " .lazy()\n", 517 | " .filter(pl.col(\"VoteTypeId\") == 3)\n", 518 | " .groupby(\"PostId\")\n", 519 | " .agg(pl.count().alias(\"DownVotes\"))\n", 520 | ")\n", 521 | "\n", 522 | "(\n", 523 | " posts_pl.lazy()\n", 524 | " .join(upvotes_pl, left_on=\"Id\", right_on=\"PostId\", how='left')\n", 525 | " .join(downvotes_pl, left_on=\"Id\", right_on=\"PostId\", how='left')\n", 526 | " .with_columns(\n", 527 | " [\n", 528 | " pl.col(\"UpVotes\").fill_null(0),\n", 529 | " pl.col(\"DownVotes\").fill_null(0),\n", 530 | " ]\n", 531 | " )\n", 532 | " .with_column(\n", 533 | " (pl.col('UpVotes') - pl.col('DownVotes')).alias('UpVoteRatio')\n", 534 | " )\n", 535 | " .join(users_pl.lazy(), left_on=\"OwnerUserId\", right_on=\"Id\")\n", 536 | " .sort('UpVoteRatio', reverse=True)\n", 537 | " .collect()[0, ['DisplayName', 'UpVoteRatio']]\n", 538 | ")\n" 539 | ] 540 | }, 541 | { 542 | "cell_type": "code", 543 | "execution_count": 13, 544 | "metadata": {}, 545 | "outputs": [ 546 | { 547 | "name": "stdout", 548 | "output_type": "stream", 549 | "text": [ 550 | "CPU times: user 591 ms, sys: 3.48 ms, total: 595 ms\n", 551 | "Wall time: 518 ms\n" 552 | ] 553 | }, 554 | { 555 | "data": { 556 | "text/html": [ 557 | "
\n", 558 | "\n", 571 | "\n", 572 | "\n", 573 | "\n", 574 | "\n", 577 | "\n", 580 | "\n", 583 | "\n", 584 | "\n", 585 | "\n", 588 | "\n", 591 | "\n", 594 | "\n", 595 | "\n", 596 | "\n", 597 | "\n", 598 | "\n", 601 | "\n", 604 | "\n", 607 | "\n", 608 | "\n", 609 | "
\n", 575 | "Year\n", 576 | "\n", 578 | "Month\n", 579 | "\n", 581 | "NVotes\n", 582 | "
\n", 586 | "i32\n", 587 | "\n", 589 | "u32\n", 590 | "\n", 592 | "u32\n", 593 | "
\n", 599 | "2016\n", 600 | "\n", 602 | "8\n", 603 | "\n", 605 | "19591\n", 606 | "
\n", 610 | "
" 611 | ], 612 | "text/plain": [ 613 | "shape: (1, 3)\n", 614 | "┌──────┬───────┬────────┐\n", 615 | "│ Year ┆ Month ┆ NVotes │\n", 616 | "│ --- ┆ --- ┆ --- │\n", 617 | "│ i32 ┆ u32 ┆ u32 │\n", 618 | "╞══════╪═══════╪════════╡\n", 619 | "│ 2016 ┆ 8 ┆ 19591 │\n", 620 | "└──────┴───────┴────────┘" 621 | ] 622 | }, 623 | "execution_count": 13, 624 | "metadata": {}, 625 | "output_type": "execute_result" 626 | } 627 | ], 628 | "source": [ 629 | "%%time \n", 630 | "# 10\n", 631 | "votes_agg = (\n", 632 | " votes_pl\n", 633 | " .with_column(\n", 634 | " pl.col('CreationDate').str.strptime(pl.Datetime)\n", 635 | " )\n", 636 | " .groupby([\n", 637 | " pl.col('CreationDate').dt.year().alias(\"Year\"),\n", 638 | " pl.col('CreationDate').dt.month().alias(\"Month\")\n", 639 | " ])\n", 640 | " .agg(pl.count().alias(\"NVotes\"))\n", 641 | ")\n", 642 | "\n", 643 | "votes_agg.filter(pl.col(\"NVotes\") == pl.col(\"NVotes\").max())" 644 | ] 645 | }, 646 | { 647 | "cell_type": "code", 648 | "execution_count": 14, 649 | "metadata": {}, 650 | "outputs": [ 651 | { 652 | "name": "stdout", 653 | "output_type": "stream", 654 | "text": [ 655 | "CPU times: user 8.34 ms, sys: 1.12 ms, total: 9.46 ms\n", 656 | "Wall time: 2.72 ms\n" 657 | ] 658 | }, 659 | { 660 | "data": { 661 | "text/html": [ 662 | "
\n", 663 | "\n", 676 | "\n", 677 | "\n", 678 | "\n", 679 | "\n", 682 | "\n", 685 | "\n", 688 | "\n", 689 | "\n", 690 | "\n", 693 | "\n", 696 | "\n", 699 | "\n", 700 | "\n", 701 | "\n", 702 | "\n", 703 | "\n", 706 | "\n", 709 | "\n", 712 | "\n", 713 | "\n", 714 | "
\n", 680 | "Year\n", 681 | "\n", 683 | "Month\n", 684 | "\n", 686 | "NVotesDiff\n", 687 | "
\n", 691 | "i32\n", 692 | "\n", 694 | "u32\n", 695 | "\n", 697 | "i64\n", 698 | "
\n", 704 | "2015\n", 705 | "\n", 707 | "10\n", 708 | "\n", 710 | "-6201\n", 711 | "
\n", 715 | "
" 716 | ], 717 | "text/plain": [ 718 | "shape: (1, 3)\n", 719 | "┌──────┬───────┬────────────┐\n", 720 | "│ Year ┆ Month ┆ NVotesDiff │\n", 721 | "│ --- ┆ --- ┆ --- │\n", 722 | "│ i32 ┆ u32 ┆ i64 │\n", 723 | "╞══════╪═══════╪════════════╡\n", 724 | "│ 2015 ┆ 10 ┆ -6201 │\n", 725 | "└──────┴───────┴────────────┘" 726 | ] 727 | }, 728 | "execution_count": 14, 729 | "metadata": {}, 730 | "output_type": "execute_result" 731 | } 732 | ], 733 | "source": [ 734 | "%%time\n", 735 | "# 11\n", 736 | "(\n", 737 | " votes_agg\n", 738 | " .sort([\"Year\", \"Month\"])\n", 739 | " .select([\n", 740 | " \"Year\",\n", 741 | " \"Month\",\n", 742 | " pl.col(\"NVotes\").cast(int).diff().alias(\"NVotesDiff\")\n", 743 | " ])\n", 744 | " .filter(pl.col(\"NVotesDiff\") == pl.col(\"NVotesDiff\").min())\n", 745 | ")" 746 | ] 747 | }, 748 | { 749 | "cell_type": "code", 750 | "execution_count": 15, 751 | "metadata": {}, 752 | "outputs": [ 753 | { 754 | "name": "stdout", 755 | "output_type": "stream", 756 | "text": [ 757 | "CPU times: user 42.3 ms, sys: 15.4 ms, total: 57.7 ms\n", 758 | "Wall time: 21.6 ms\n" 759 | ] 760 | }, 761 | { 762 | "data": { 763 | "text/html": [ 764 | "
\n", 765 | "\n", 778 | "\n", 779 | "\n", 780 | "\n", 781 | "\n", 784 | "\n", 787 | "\n", 788 | "\n", 789 | "\n", 792 | "\n", 795 | "\n", 796 | "\n", 797 | "\n", 798 | "\n", 799 | "\n", 802 | "\n", 805 | "\n", 806 | "\n", 807 | "
\n", 782 | "Tags\n", 783 | "\n", 785 | "counts\n", 786 | "
\n", 790 | "str\n", 791 | "\n", 793 | "u32\n", 794 | "
\n", 800 | "\"air-travel\"\n", 801 | "\n", 803 | "34\n", 804 | "
\n", 808 | "
" 809 | ], 810 | "text/plain": [ 811 | "shape: (1, 2)\n", 812 | "┌────────────┬────────┐\n", 813 | "│ Tags ┆ counts │\n", 814 | "│ --- ┆ --- │\n", 815 | "│ str ┆ u32 │\n", 816 | "╞════════════╪════════╡\n", 817 | "│ air-travel ┆ 34 │\n", 818 | "└────────────┴────────┘" 819 | ] 820 | }, 821 | "execution_count": 15, 822 | "metadata": {}, 823 | "output_type": "execute_result" 824 | } 825 | ], 826 | "source": [ 827 | "%%time\n", 828 | "# 12\n", 829 | "(\n", 830 | " posts_pl.lazy().join(users_pl.lazy(), left_on=\"OwnerUserId\", right_on=\"Id\", how='left')\n", 831 | " .filter(\n", 832 | " pl.col(\"Location\").str.contains(\"Poland\") | \n", 833 | " pl.col(\"Location\").str.contains(\"Polska\")\n", 834 | " )\n", 835 | " .select([\n", 836 | " pl.col('Tags')\n", 837 | " .str.replace(r\"^<\", \"\")\n", 838 | " .str.replace(r\">$\", \"\")\n", 839 | " .str.split(\"><\")\n", 840 | " .drop_nulls()\n", 841 | " .explode()\n", 842 | " .value_counts()\n", 843 | " ])\n", 844 | " .unnest(\"Tags\")\n", 845 | " .head(1)\n", 846 | " .collect()\n", 847 | ")\n" 848 | ] 849 | } 850 | ], 851 | "metadata": { 852 | "kernelspec": { 853 | "display_name": "Python 3.10.4 ('daftacademy-ds')", 854 | "language": "python", 855 | "name": "python3" 856 | }, 857 | "language_info": { 858 | "codemirror_mode": { 859 | "name": "ipython", 860 | "version": 3 861 | }, 862 | "file_extension": ".py", 863 | "mimetype": "text/x-python", 864 | "name": "python", 865 | "nbconvert_exporter": "python", 866 | "pygments_lexer": "ipython3", 867 | "version": "3.10.4" 868 | }, 869 | "orig_nbformat": 4, 870 | "vscode": { 871 | "interpreter": { 872 | "hash": "306379f83cd7ad906f1941e4fa5d943b9cb847ce8f1ed425d8b7a18353685fed" 873 | } 874 | } 875 | }, 876 | "nbformat": 4, 877 | "nbformat_minor": 2 878 | } 879 | -------------------------------------------------------------------------------- /homework_solutions/04_hw_sklearn.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 25, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import pandas as pd\n", 10 | "from sklearn.linear_model import LinearRegression\n", 11 | "from sklearn.metrics import mean_squared_error\n", 12 | "from sklearn.impute import SimpleImputer\n", 13 | "import numpy as np\n", 14 | "import matplotlib.pyplot as plt\n", 15 | "import pickle" 16 | ] 17 | }, 18 | { 19 | "cell_type": "code", 20 | "execution_count": 26, 21 | "metadata": {}, 22 | "outputs": [ 23 | { 24 | "data": { 25 | "text/html": [ 26 | "
\n", 27 | "\n", 40 | "\n", 41 | " \n", 42 | " \n", 43 | " \n", 44 | " \n", 45 | " \n", 46 | " \n", 47 | " \n", 48 | " \n", 49 | " \n", 50 | " \n", 51 | " \n", 52 | " \n", 53 | " \n", 54 | " \n", 55 | " \n", 56 | " \n", 57 | " \n", 58 | " \n", 59 | " \n", 60 | " \n", 61 | " \n", 62 | " \n", 63 | " \n", 64 | " \n", 65 | " \n", 66 | " \n", 67 | " \n", 68 | " \n", 69 | " \n", 70 | " \n", 71 | " \n", 72 | " \n", 73 | " \n", 74 | " \n", 75 | " \n", 76 | " \n", 77 | "
CRIMZNINDUSCHASNOXRMAGEDISRADTAXPTRATIOBLSTAT
03.6911.3711.150.070.876.2968.913.779.5410.9518.37354.470.79
\n", 78 | "
" 79 | ], 80 | "text/plain": [ 81 | " CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO \\\n", 82 | "0 3.69 11.37 11.15 0.07 0.87 6.29 68.91 3.77 9.5 410.95 18.37 \n", 83 | "\n", 84 | " B LSTAT \n", 85 | "0 354.47 0.79 " 86 | ] 87 | }, 88 | "execution_count": 26, 89 | "metadata": {}, 90 | "output_type": "execute_result" 91 | } 92 | ], 93 | "source": [ 94 | "rec = pd.DataFrame({\n", 95 | " 'CRIM': [3.69],\n", 96 | " 'ZN': [11.37],\n", 97 | " 'INDUS': [11.15],\n", 98 | " 'CHAS': [0.07],\n", 99 | " 'NOX': [0.87],\n", 100 | " 'RM': [6.29],\n", 101 | " 'AGE': [68.91],\n", 102 | " 'DIS': [3.77],\n", 103 | " 'RAD': [9.50],\n", 104 | " 'TAX': [410.95],\n", 105 | " 'PTRATIO': [18.37],\n", 106 | " 'B': [354.47],\n", 107 | " 'LSTAT': [0.79],\n", 108 | "})\n", 109 | "rec" 110 | ] 111 | }, 112 | { 113 | "cell_type": "code", 114 | "execution_count": 27, 115 | "metadata": {}, 116 | "outputs": [], 117 | "source": [ 118 | "train = pd.read_csv(\"../data/housing/housing_train.csv\")" 119 | ] 120 | }, 121 | { 122 | "cell_type": "code", 123 | "execution_count": 28, 124 | "metadata": {}, 125 | "outputs": [ 126 | { 127 | "data": { 128 | "text/plain": [ 129 | "NOX 0.879070\n", 130 | "LSTAT 0.795349\n", 131 | "dtype: float64" 132 | ] 133 | }, 134 | "execution_count": 28, 135 | "metadata": {}, 136 | "output_type": "execute_result" 137 | } 138 | ], 139 | "source": [ 140 | "s = train.isna().mean()\n", 141 | "s[s > 0.7]" 142 | ] 143 | }, 144 | { 145 | "cell_type": "code", 146 | "execution_count": 29, 147 | "metadata": {}, 148 | "outputs": [], 149 | "source": [ 150 | "train['LSTAT'] = train['LSTAT'].isna()\n", 151 | "train['NOX'] = train['NOX'].isna()" 152 | ] 153 | }, 154 | { 155 | "cell_type": "code", 156 | "execution_count": 30, 157 | "metadata": {}, 158 | "outputs": [], 159 | "source": [ 160 | "X = train.iloc[:, :-1]\n", 161 | "y = train['MEDV']" 162 | ] 163 | }, 164 | { 165 | "cell_type": "code", 166 | "execution_count": 31, 167 | "metadata": {}, 168 | "outputs": [], 169 | "source": [ 170 | "imputer = SimpleImputer(strategy=\"median\")\n", 171 | "X = imputer.fit_transform(X)" 172 | ] 173 | }, 174 | { 175 | "cell_type": "code", 176 | "execution_count": 32, 177 | "metadata": {}, 178 | "outputs": [ 179 | { 180 | "data": { 181 | "text/html": [ 182 | "
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" 183 | ], 184 | "text/plain": [ 185 | "LinearRegression()" 186 | ] 187 | }, 188 | "execution_count": 32, 189 | "metadata": {}, 190 | "output_type": "execute_result" 191 | } 192 | ], 193 | "source": [ 194 | "reg = LinearRegression()\n", 195 | "reg.fit(X, y)" 196 | ] 197 | }, 198 | { 199 | "cell_type": "code", 200 | "execution_count": 33, 201 | "metadata": {}, 202 | "outputs": [], 203 | "source": [ 204 | "y_pred = reg.predict(X)" 205 | ] 206 | }, 207 | { 208 | "cell_type": "code", 209 | "execution_count": 34, 210 | "metadata": {}, 211 | "outputs": [ 212 | { 213 | "name": "stderr", 214 | "output_type": "stream", 215 | "text": [ 216 | "/home/piotr/anaconda3/envs/daftacademy-ds/lib/python3.10/site-packages/sklearn/base.py:443: UserWarning: X has feature names, but LinearRegression was fitted without feature names\n", 217 | " warnings.warn(\n" 218 | ] 219 | }, 220 | { 221 | "data": { 222 | "text/plain": [ 223 | "array([96.67])" 224 | ] 225 | }, 226 | "execution_count": 34, 227 | "metadata": {}, 228 | "output_type": "execute_result" 229 | } 230 | ], 231 | "source": [ 232 | "reg.predict(rec).round(2)" 233 | ] 234 | }, 235 | { 236 | "cell_type": "code", 237 | "execution_count": 35, 238 | "metadata": {}, 239 | "outputs": [], 240 | "source": [ 241 | "rec2 = rec.copy()\n", 242 | "rec2.loc[0, 'RM'] += 2" 243 | ] 244 | }, 245 | { 246 | "cell_type": "code", 247 | "execution_count": 39, 248 | "metadata": {}, 249 | "outputs": [ 250 | { 251 | "data": { 252 | "text/plain": [ 253 | "CRIM -0.693008\n", 254 | "ZN 0.210142\n", 255 | "INDUS -0.159057\n", 256 | "CHAS 13.851840\n", 257 | "NOX 20.619870\n", 258 | "RM 25.280527\n", 259 | "AGE -0.160472\n", 260 | "DIS -5.301457\n", 261 | "RAD 1.421446\n", 262 | "TAX -0.057192\n", 263 | "PTRATIO -3.679619\n", 264 | "B 0.055114\n", 265 | "LSTAT 20.989393\n", 266 | "dtype: float64" 267 | ] 268 | }, 269 | "execution_count": 39, 270 | "metadata": {}, 271 | "output_type": "execute_result" 272 | } 273 | ], 274 | "source": [ 275 | "coef = pd.Series(reg.coef_, index=train.columns[:-1])\n", 276 | "coef" 277 | ] 278 | }, 279 | { 280 | "cell_type": "code", 281 | "execution_count": 40, 282 | "metadata": {}, 283 | "outputs": [ 284 | { 285 | "data": { 286 | "text/plain": [ 287 | "50.5610537224786" 288 | ] 289 | }, 290 | "execution_count": 40, 291 | "metadata": {}, 292 | "output_type": "execute_result" 293 | } 294 | ], 295 | "source": [ 296 | "coef['RM'] * 2" 297 | ] 298 | }, 299 | { 300 | "cell_type": "code", 301 | "execution_count": 41, 302 | "metadata": {}, 303 | "outputs": [ 304 | { 305 | "name": "stderr", 306 | "output_type": "stream", 307 | "text": [ 308 | "/home/piotr/anaconda3/envs/daftacademy-ds/lib/python3.10/site-packages/sklearn/base.py:443: UserWarning: X has feature names, but LinearRegression was fitted without feature names\n", 309 | " warnings.warn(\n", 310 | "/home/piotr/anaconda3/envs/daftacademy-ds/lib/python3.10/site-packages/sklearn/base.py:443: UserWarning: X has feature names, but LinearRegression was fitted without feature names\n", 311 | " warnings.warn(\n" 312 | ] 313 | }, 314 | { 315 | "data": { 316 | "text/plain": [ 317 | "array([50.56105372])" 318 | ] 319 | }, 320 | "execution_count": 41, 321 | "metadata": {}, 322 | "output_type": "execute_result" 323 | } 324 | ], 325 | "source": [ 326 | "reg.predict(rec2) - reg.predict(rec)" 327 | ] 328 | }, 329 | { 330 | "cell_type": "markdown", 331 | "metadata": {}, 332 | "source": [ 333 | "## Advanced model\n", 334 | "\n", 335 | "We will check a few models on split dataset and then choose the best model to do the final training before submission." 336 | ] 337 | }, 338 | { 339 | "cell_type": "code", 340 | "execution_count": 42, 341 | "metadata": {}, 342 | "outputs": [], 343 | "source": [ 344 | "from sklearn.ensemble import RandomForestRegressor\n", 345 | "from lightgbm import LGBMRegressor\n", 346 | "from xgboost import XGBRFRegressor\n", 347 | "from sklearn.model_selection import train_test_split" 348 | ] 349 | }, 350 | { 351 | "cell_type": "code", 352 | "execution_count": 43, 353 | "metadata": {}, 354 | "outputs": [], 355 | "source": [ 356 | "X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=2022)" 357 | ] 358 | }, 359 | { 360 | "cell_type": "code", 361 | "execution_count": 44, 362 | "metadata": {}, 363 | "outputs": [], 364 | "source": [ 365 | "models = {\n", 366 | " 'linear_regression': LinearRegression(),\n", 367 | " 'rf': RandomForestRegressor(),\n", 368 | " 'lgbm': LGBMRegressor(),\n", 369 | " 'xgb': XGBRFRegressor(),\n", 370 | " \n", 371 | "}" 372 | ] 373 | }, 374 | { 375 | "cell_type": "code", 376 | "execution_count": 51, 377 | "metadata": {}, 378 | "outputs": [ 379 | { 380 | "name": "stdout", 381 | "output_type": "stream", 382 | "text": [ 383 | "Naive baseline y_train.mean() scores MSE 1421.11\n", 384 | "Naive baseline y_train.median() scores MSE 1435.55\n", 385 | "linear_regression scores MSE: 385.24\n", 386 | "rf scores MSE: 338.15\n", 387 | "lgbm scores MSE: 299.47\n", 388 | "xgb scores MSE: 340.21\n" 389 | ] 390 | } 391 | ], 392 | "source": [ 393 | "naive_mean_mse = mean_squared_error(y_test, np.tile(y_train.mean(), len(y_test))).round(2)\n", 394 | "naive_median_mse = mean_squared_error(y_test, np.tile(y_train.median(), len(y_test))).round(2)\n", 395 | "print(f\"Naive baseline y_train.mean() scores MSE {naive_mean_mse}\")\n", 396 | "print(f\"Naive baseline y_train.median() scores MSE {naive_median_mse}\")\n", 397 | "for name, model in models.items():\n", 398 | " model.fit(X_train, y_train)\n", 399 | " y_pred = model.predict(X_test)\n", 400 | " score = mean_squared_error(y_test, y_pred)\n", 401 | " print(f\"{name} scores MSE: {score.round(2)}\")\n" 402 | ] 403 | }, 404 | { 405 | "cell_type": "code", 406 | "execution_count": 53, 407 | "metadata": {}, 408 | "outputs": [ 409 | { 410 | "data": { 411 | "text/html": [ 412 | "
LGBMRegressor()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" 413 | ], 414 | "text/plain": [ 415 | "LGBMRegressor()" 416 | ] 417 | }, 418 | "execution_count": 53, 419 | "metadata": {}, 420 | "output_type": "execute_result" 421 | } 422 | ], 423 | "source": [ 424 | "train = pd.read_csv(\"../data/housing/housing_train.csv\")\n", 425 | "X = train.iloc[:, :-1]\n", 426 | "y = train['MEDV']\n", 427 | "\n", 428 | "val = pd.read_csv(\"../data/housing/housing_validation.csv\")\n", 429 | "val['LSTAT'] = val['LSTAT'].isna()\n", 430 | "val['NOX'] = val['NOX'].isna()\n", 431 | "\n", 432 | "imputer2 = SimpleImputer(strategy=\"median\")\n", 433 | "\n", 434 | "X2 = imputer2.fit_transform(X)\n", 435 | "val_t = imputer2.transform(val)\n", 436 | "\n", 437 | "reg2 = LGBMRegressor()\n", 438 | "reg2.fit(X2, y)" 439 | ] 440 | }, 441 | { 442 | "cell_type": "code", 443 | "execution_count": 54, 444 | "metadata": {}, 445 | "outputs": [], 446 | "source": [ 447 | "y_pred = reg2.predict(val_t)\n", 448 | "res = pd.Series(y_pred, name='MEDV')\n", 449 | "res.to_csv(\"light_gbm.csv\")" 450 | ] 451 | }, 452 | { 453 | "cell_type": "code", 454 | "execution_count": 55, 455 | "metadata": {}, 456 | "outputs": [], 457 | "source": [ 458 | "with open(\"model_lgbm_regressor.pkl\", 'wb') as f:\n", 459 | " pickle.dump(reg2, f)" 460 | ] 461 | }, 462 | { 463 | "cell_type": "code", 464 | "execution_count": null, 465 | "metadata": {}, 466 | "outputs": [], 467 | "source": [] 468 | } 469 | ], 470 | "metadata": { 471 | "kernelspec": { 472 | "display_name": "Python 3.10.4 ('daftacademy-ds')", 473 | "language": "python", 474 | "name": "python3" 475 | }, 476 | "language_info": { 477 | "codemirror_mode": { 478 | "name": "ipython", 479 | "version": 3 480 | }, 481 | "file_extension": ".py", 482 | "mimetype": "text/x-python", 483 | "name": "python", 484 | "nbconvert_exporter": "python", 485 | "pygments_lexer": "ipython3", 486 | "version": "3.10.4" 487 | }, 488 | "orig_nbformat": 4, 489 | "vscode": { 490 | "interpreter": { 491 | "hash": "306379f83cd7ad906f1941e4fa5d943b9cb847ce8f1ed425d8b7a18353685fed" 492 | } 493 | } 494 | }, 495 | "nbformat": 4, 496 | "nbformat_minor": 2 497 | } 498 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | aiofiles==22.1.0 2 | aiosqlite==0.18.0 3 | altair==4.2.2 4 | anyio==3.6.2 5 | argon2-cffi==21.3.0 6 | argon2-cffi-bindings==21.2.0 7 | arrow==1.2.3 8 | asttokens==2.2.1 9 | attrs==22.2.0 10 | Babel==2.11.0 11 | backcall==0.2.0 12 | beautifulsoup4==4.11.2 13 | bleach==6.0.0 14 | blinker==1.5 15 | cachetools==5.3.0 16 | cffi==1.15.1 17 | charset-normalizer==3.0.1 18 | click==8.1.3 19 | comm==0.1.2 20 | contourpy==1.0.7 21 | cycler==0.11.0 22 | debugpy==1.6.6 23 | decorator==5.1.1 24 | defusedxml==0.7.1 25 | entrypoints==0.4 26 | et-xmlfile==1.1.0 27 | executing==1.2.0 28 | fastapi==0.92.0 29 | fastjsonschema==2.16.2 30 | fonttools==4.38.0 31 | fqdn==1.5.1 32 | gitdb==4.0.10 33 | GitPython==3.1.31 34 | h11==0.14.0 35 | httptools==0.5.0 36 | idna==3.4 37 | importlib-metadata==6.0.0 38 | ipykernel==6.21.2 39 | ipython==8.10.0 40 | ipython-genutils==0.2.0 41 | ipywidgets==8.0.4 42 | isoduration==20.11.0 43 | jedi==0.18.2 44 | Jinja2==3.1.2 45 | joblib==1.2.0 46 | json5==0.9.11 47 | jsonpointer==2.3 48 | jsonschema==4.17.3 49 | jupyter-events==0.6.3 50 | jupyter-ydoc==0.2.2 51 | jupyter_client==8.0.3 52 | jupyter_core==5.2.0 53 | jupyter_server==2.3.0 54 | jupyter_server_fileid==0.7.0 55 | jupyter_server_terminals==0.4.4 56 | jupyter_server_ydoc==0.6.1 57 | jupyterlab==3.6.1 58 | jupyterlab-pygments==0.2.2 59 | jupyterlab-widgets==3.0.5 60 | jupyterlab_server==2.19.0 61 | kiwisolver==1.4.4 62 | lightgbm==3.3.5 63 | llvmlite==0.39.1 64 | lxml==4.9.2 65 | markdown-it-py==2.1.0 66 | MarkupSafe==2.1.2 67 | matplotlib==3.7.0 68 | matplotlib-inline==0.1.6 69 | mdurl==0.1.2 70 | mistune==2.0.5 71 | nbclassic==0.5.2 72 | nbclient==0.7.2 73 | nbconvert==7.2.9 74 | nbformat==5.7.3 75 | nest-asyncio==1.5.6 76 | notebook==6.5.2 77 | notebook_shim==0.2.2 78 | numba==0.56.4 79 | numpy==1.23.5 80 | openpyxl==3.1.1 81 | packaging==23.0 82 | pandas==1.5.3 83 | pandocfilters==1.5.0 84 | parso==0.8.3 85 | pexpect==4.8.0 86 | pickleshare==0.7.5 87 | Pillow==9.4.0 88 | platformdirs==3.0.0 89 | plotly==5.13.0 90 | polars==0.16.7 91 | prometheus-client==0.16.0 92 | prompt-toolkit==3.0.36 93 | protobuf==3.20.3 94 | psutil==5.9.4 95 | ptyprocess==0.7.0 96 | pure-eval==0.2.2 97 | pyarrow==11.0.0 98 | pycparser==2.21 99 | pydantic==1.10.5 100 | pydeck==0.8.0 101 | Pygments==2.14.0 102 | Pympler==1.0.1 103 | pyparsing==3.0.9 104 | pyrsistent==0.19.3 105 | python-dateutil==2.8.2 106 | python-dotenv==0.21.1 107 | python-json-logger==2.0.6 108 | python-multipart==0.0.5 109 | pytz==2022.7.1 110 | pytz-deprecation-shim==0.1.0.post0 111 | PyYAML==6.0 112 | pyzmq==25.0.0 113 | requests==2.28.2 114 | rfc3339-validator==0.1.4 115 | rfc3986-validator==0.1.1 116 | rich==13.3.1 117 | scikit-learn==1.2.1 118 | scipy==1.10.1 119 | semver==2.13.0 120 | Send2Trash==1.8.0 121 | six==1.16.0 122 | smmap==5.0.0 123 | sniffio==1.3.0 124 | soupsieve==2.4 125 | stack-data==0.6.2 126 | starlette==0.25.0 127 | streamlit==1.18.1 128 | tenacity==8.2.1 129 | terminado==0.17.1 130 | threadpoolctl==3.1.0 131 | tinycss2==1.2.1 132 | toml==0.10.2 133 | tomli==2.0.1 134 | toolz==0.12.0 135 | tornado==6.2 136 | traitlets==5.9.0 137 | typing_extensions==4.5.0 138 | tzdata==2022.7 139 | tzlocal==4.2 140 | uri-template==1.2.0 141 | urllib3==1.26.14 142 | uvicorn==0.20.0 143 | uvloop==0.17.0 144 | validators==0.20.0 145 | watchdog==2.2.1 146 | watchfiles==0.18.1 147 | wcwidth==0.2.6 148 | webcolors==1.12 149 | webencodings==0.5.1 150 | websocket-client==1.5.1 151 | websockets==10.4 152 | widgetsnbextension==4.0.5 153 | xgboost==1.7.4 154 | y-py==0.5.4 155 | ypy-websocket==0.8.2 156 | zipp==3.14.0 157 | zstandard==0.19.0 158 | -------------------------------------------------------------------------------- /requirements_loose.txt: -------------------------------------------------------------------------------- 1 | fastapi 2 | ipykernel 3 | ipython 4 | ipywidgets 5 | jupyterlab 6 | lightgbm 7 | lxml 8 | matplotlib 9 | numba 10 | numpy 11 | openpyxl 12 | pandas 13 | plotly 14 | Pillow 15 | polars 16 | python-multipart 17 | scikit-learn 18 | streamlit 19 | uvicorn[standard] 20 | xgboost 21 | zstandard 22 | --------------------------------------------------------------------------------