├── .gitignore
├── 00_Introduction
├── 01_intro.html
├── 01_intro.qmd
├── README.md
└── images
│ ├── about_appsilon1.png
│ ├── about_appsilon2.png
│ ├── dataneversleeps2.png
│ ├── datascience.png
│ ├── datascienceworkflow2a.png
│ ├── datascienceworkflow_extended.png
│ ├── dominotools.png
│ ├── meme2.jpeg
│ └── the-data-science-workflow.jpeg
├── 01_Numpy
├── 01_numpy.html
├── 01_numpy.ipynb
├── 02_homework.ipynb
├── README.md
└── images
│ ├── behind_scenes.webp
│ ├── cpp_numpy.jpg
│ └── rly.gif
├── 02_Pandas
├── 01_types_of_data.html
├── 01_types_of_data.qmd
├── 02_pandas.html
├── 02_pandas.ipynb
├── 03_homework.ipynb
├── 04_loading_wiki_data.ipynb
├── README.md
└── images
│ ├── de_wiki_problem.png
│ ├── one_does.jpg
│ └── weird_line.png
├── 03_Plots
├── 01_matplotlib_plotly.html
├── 01_matplotlib_plotly.ipynb
├── 02_homework.ipynb
├── README.md
└── images
│ └── votes.png
├── 04_Scikit-learn
├── .gitignore
├── 01_machine_learning.html
├── 01_machine_learning.qmd
├── 02_linear_regression.html
├── 02_linear_regression.ipynb
├── 03_homework.ipynb
├── README.md
└── images
│ ├── NA_trick.png
│ ├── Precisionrecall.png
│ ├── na_example.svg
│ ├── onehot.png
│ ├── regression.png
│ └── traintest.png
├── 05_SharingWork
├── 01_streamlit
│ ├── 01_hello_world
│ │ ├── README.md
│ │ ├── main1.py
│ │ ├── main2.py
│ │ └── main3.py
│ ├── 02_small_report
│ │ ├── README.md
│ │ └── main.py
│ ├── 03_mc_simulation
│ │ ├── README.md
│ │ └── main.py
│ ├── 04_dishes
│ │ ├── README.md
│ │ ├── dishes.json
│ │ └── favorite_dish.py
│ └── 05_checker
│ │ ├── README.md
│ │ ├── create_db.py
│ │ ├── files
│ │ └── .gitignore
│ │ ├── main.py
│ │ ├── results.db
│ │ └── y_true.csv
├── 02_quarto
│ ├── README.md
│ ├── report.docx
│ ├── report.html
│ ├── report.ipynb
│ └── report.pdf
├── 03_fastapi
│ ├── README.md
│ ├── boston_model_prediction
│ │ ├── README.md
│ │ ├── boston_api.py
│ │ └── model_lgbm_regressor.pkl
│ └── image_prediction
│ │ ├── README.md
│ │ └── image_api.py
└── README.md
├── README.md
├── data
├── .gitignore
├── flights
│ ├── .gitignore
│ └── flights_Q1_JFK.csv
├── housing
│ ├── .gitignore
│ ├── housing_example_submission.csv
│ ├── housing_train.csv
│ └── housing_validation.csv
├── iris
│ ├── iris.csv
│ ├── iris.tsv
│ ├── iris.xlsx
│ └── iris_noheader.csv
└── other
│ ├── Life Expectancy Data.csv
│ └── lotr_data.csv
├── homework_solutions
├── .gitignore
├── 01_hw_numpy.ipynb
├── 02_hw_pandas1.ipynb
├── 03_hw_pandas2_a.ipynb
├── 03_hw_pandas2_b.ipynb
└── 04_hw_sklearn.ipynb
├── requirements.txt
└── requirements_loose.txt
/.gitignore:
--------------------------------------------------------------------------------
1 | .ipynb_checkpoints/
2 | __pycache__
3 |
--------------------------------------------------------------------------------
/00_Introduction/01_intro.qmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Introduction to Data Science in Python by Appsilon"
3 | subtitle: "Course introduction"
4 | author: "Piotr Pasza Storożenko@Appsilon"
5 | lang: "pl"
6 | format:
7 | revealjs:
8 | embed-resources: true
9 | # smaller: true
10 | theme: [dark]
11 | editor_options:
12 | markdown:
13 | wrap: 99
14 | ---
15 |
16 | # Introduction
17 |
18 | ## About me
19 |
20 | Piotr Pasza Storożenko,
21 | Machine Learning Engineer
22 |
23 | A bit:
24 |
25 | - ML guy
26 | - Physicist
27 | - Computer scientist
28 | - Mathematician
29 |
30 | you can find more on my blog: [pstorozenko.github.io](https://pstorozenko.github.io/).
31 |
32 | ## {background-image="images/about_appsilon1.png"}
33 |
34 | ## {background-image="images/about_appsilon2.png"}
35 |
36 | ## Why this course?
37 |
38 | Through the years I got a lot of knowledge about `python`, `R`, `julia`, data science, machine learning, deep learning, software development.
39 |
40 | Now I can share it with you!
41 |
42 | ## What will be interesting for whom during this course?{.smaller}
43 |
44 | :::: {.columns}
45 |
46 | :::{.column}
47 | - Computer scientists
48 | - Easy to use and **efficient** tools to work with data
49 | - Made by software developers
50 | - Mathematicians
51 | - Intuitive tools that support **reproducible** experiments
52 | - _Ridiculously easy_ work with plots
53 | - Electrical and mechanical engineers
54 | - Great alternative to matlab
55 | - Easy to use
56 | - Simple plots and animations
57 | :::
58 |
59 | :::{.column}
60 | - Economics students
61 | - Great alternative for spreadsheets
62 | - Set of **free**, open source tools
63 | - Uncomparable greater control of data when compared to MS Office, Tableau, PowerBI etc.
64 | - Physicists, chemists, biologists
65 | - Substential relief from spreadsheets
66 | - Open-source software
67 | - Much easier to create, reproducible plots
68 | :::
69 |
70 | ::::
71 |
72 | # Data Science
73 |
74 | ## What's Data Science?
75 |
76 | {fig-align="center"}
77 |
78 | ::: footer
79 | Source: [https://medium.com/data-science-in-2019/what-is-data-science-87e9dc225cf9](https://medium.com/data-science-in-2019/what-is-data-science-87e9dc225cf9)
80 | :::
81 |
82 | ## Why Data Science?
83 |
84 | {fig-align="center"}
85 |
86 | ::: footer
87 | Source: [https://www.domo.com/learn/infographic/data-never-sleeps-9](https://www.domo.com/learn/infographic/data-never-sleeps-9)
88 | :::
89 |
90 | ## Explanation of various terms{.smaller}
91 |
92 | - Artificial Inteligence, AI -- an [umbrella term](https://en.wiktionary.org/wiki/umbrella_term) connected to everything where the computer/system makes decisions based on a set of rules, on an algorithm.
93 | - Data Science (DS) -- everything related to data, from collecting, through processing, up to displaying and using
94 | - Machine Learning, ML -- everything related to creating/training models that are able to _learn_ rules based on provided data
95 | - [Deep] Neural Networks, [D]NN -- subset of ML methods based on special class of models, so called neural networks. Their architecture (design) somehow resembles connections between biological neurons, hence the name.
96 |
97 | ## Who's a Data Scientist?
98 |
99 | Someone who simultaneously:
100 |
101 | 1. Discusses with so called _bussiness_ the topic of required solutions.
102 | 2. Based on provided and collected data creates solutions using programming skills.
103 | 3. Delivers the results to the _bussiness_ in a clear and interesing way.
104 |
105 | Bussiness talks about **AI**, experts preffer to say **ML**...
106 |
107 | ## Data Science Workflow
108 |
109 | {fig-align="center"}
110 |
111 | ::: footer
112 | Source: [https://www.business-science.io/business/2019/06/27/data-science-workflow.html](https://www.business-science.io/business/2019/06/27/data-science-workflow.html)
113 | :::
114 |
115 |
116 | ## Data Science Tools
117 |
118 | {fig-align="center"}
119 |
120 | ::: footer
121 | Source: [https://blog.dominodatalab.com/data-science-tools](https://blog.dominodatalab.com/data-science-tools)
122 | :::
123 |
124 | # Course plan
125 |
126 | 1. Introduction and `numpy` - working with numbers
127 | 2. `pandas` - working with data frames
128 | 3. `matplotlib` and `plotly` - plotting data
129 | 4. `scikit-learn` - introduction to machine learning
130 | 5. `streamlit`, `quarto`, `fastapi` - sharing your work
131 |
132 | ## Course plan
133 |
134 | {fig-align="center"}
135 |
136 |
137 | ## Data Science Workflow x This course
138 |
139 | {fig-align="center"}
140 |
141 | ::: footer
142 | Source: [https://www.business-science.io/business/2019/06/27/data-science-workflow.html](https://www.business-science.io/business/2019/06/27/data-science-workflow.html)
143 | :::
144 |
145 | ## Data Science Workflow x This course ++
146 |
147 | {fig-align="center"}
148 |
149 | ::: footer
150 | Source: [https://www.business-science.io/business/2019/06/27/data-science-workflow.html](https://www.business-science.io/business/2019/06/27/data-science-workflow.html)
151 | :::
152 |
153 | # Tools used in the course
154 |
155 | ## Tools used in the course
156 |
157 | - Python 3.9/3.10 via [Anaconda](https://www.anaconda.com/products/distribution#Downloads) + many additional packages
158 | - Visual Studio Code aka [VS Code](https://code.visualstudio.com/download)
159 |
160 | ## Anaconda{.smaller}
161 |
162 | Anaconda is the standard when it comes to managing python environemnts in the data science/machine learning community.
163 | It allows to obtain a consistent environment among various systems.
164 |
165 | Why is it that important?
166 |
167 | Data scienctists often work on many projects at the same time.
168 | Each project might require a different environment, with specific versions of python and other libraries.
169 |
170 | This might be also a relief when working on different projects during studying!
171 |
172 | ## Anaconda - How to create an environment{.smaller}
173 |
174 | After installing anaconda, you have to [clone this course repo](https://docs.github.com/en/repositories/creating-and-managing-repositories/cloning-a-repository), and move yourself in terminal into course repo.
175 | Then run:
176 | ```
177 | conda create -n appsilon-ds-course python=3.10 -y
178 | conda activate appsilon-ds-course
179 | pip install -r requirements.txt
180 | ```
181 |
182 | If you received the following error message:
183 | ```
184 | ERROR: Could not open requirements file: [Errno 2] No such file or directory: 'requirements.txt'
185 | ```
186 | you're in the wrong directory.
187 |
188 | In case of problems, check out the [official conda tutorial](https://docs.conda.io/projects/conda/en/latest/user-guide/getting-started.html).
189 |
190 | ## VSCode - code editor for 2022
191 |
192 | Why [VS Code](https://code.visualstudio.com/#alt-downloads)?
193 |
194 | - Great support for both python scripts and jupyter notebooks .
195 | - Automatically detects `conda` environments
196 | - Great support for working with remote machines through SSH (althogh we will not use this feature)
197 | - One tool to work with `python`, `R`, `julia`, `javascript`, `typescript` etc.
198 | - Above all -- VS Code is free
199 |
200 | ## What to do after installing VS Code?
201 |
202 | Install extensions
203 |
204 | - [Pylance](https://marketplace.visualstudio.com/items?itemName=ms-python.vscode-pylance)
205 | - [Jupyter](https://marketplace.visualstudio.com/items?itemName=ms-toolsai.jupyter)
206 |
207 | Environment for work and studing is ready!
208 |
209 | # `pip` vs `conda` vs `pipenv` vs ...
210 |
211 | We need multiple environments on a single machine.
212 |
213 | How to live, what to use?
214 |
215 | . . .
216 |
217 | **NEVER PLAY WITH DATA SCIENCE ON YOUR DEFAULT SYSTEM'S PYTHON**
218 |
219 | ## `pip` + `virtualenv`
220 |
221 | 1. A basic package manager included in python
222 | 2. Works only for **a single** version of python
223 | 3. Capable of installing **python packages only**
224 | 4. Basic package versioning with `pip freeze`
225 | 5. Pretty fast when it doesn't have to build packages
226 |
227 | ## `conda`{.smaller}
228 |
229 | 1. A package manager provided by Anaconda
230 | 2. Allows for creating different environments for different major (`3.9`/`3.10`) and minor (`3.10.3`/`3.10.4`) python versions
231 | 3. Is able to install **other software than python packages as well** (e.g. `R` or CUDA drivers)
232 | 4. Basic package versioning with `conda list --export`
233 | 5. Super slow for bigger environments
234 | 6. Packages installed with conda can be shared across environments -- lower disk usage (just PyTorch is ~1.7GB)
235 |
236 | ## `pipenv`
237 |
238 | 1. Looks like `pip` + `virtualenv` plus different python versions
239 | 2. Very big focus on environment reproducibility
240 | 3. Super slow for bigger environments
241 |
242 | ## How to live?{.smaller}
243 |
244 | The most reliable setup for experimenting is to do:
245 |
246 | ```
247 | conda create -n my-env python==3.10.4
248 | conda activate my-env
249 | pip install ...
250 | ```
251 |
252 | If you need to install CUDA drivers then do it during environment creation `conda create -n my-env python cudatoolkit`.
253 |
254 | After you install all packages, save the **python version** in your README file e.g.,
255 |
256 | > Project created with python 3.10.4.
257 |
258 | and store installed packages with `pip freeze > requirements.txt`.
259 |
260 | . . .
261 |
262 | Remember that not every package version is available for every python version.
263 | For example Tensorflow 2.10 is supported only in python>=3.10.
264 |
--------------------------------------------------------------------------------
/00_Introduction/README.md:
--------------------------------------------------------------------------------
1 | # Introduction
2 |
3 | This folder contains introduction into Data Science responsibilities and describes content of the course, as well as how to setup an environment.
--------------------------------------------------------------------------------
/00_Introduction/images/about_appsilon1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Appsilon/datascience-python/a1d299415544191f5e96502c5a064561ba3f59ce/00_Introduction/images/about_appsilon1.png
--------------------------------------------------------------------------------
/00_Introduction/images/about_appsilon2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Appsilon/datascience-python/a1d299415544191f5e96502c5a064561ba3f59ce/00_Introduction/images/about_appsilon2.png
--------------------------------------------------------------------------------
/00_Introduction/images/dataneversleeps2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Appsilon/datascience-python/a1d299415544191f5e96502c5a064561ba3f59ce/00_Introduction/images/dataneversleeps2.png
--------------------------------------------------------------------------------
/00_Introduction/images/datascience.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Appsilon/datascience-python/a1d299415544191f5e96502c5a064561ba3f59ce/00_Introduction/images/datascience.png
--------------------------------------------------------------------------------
/00_Introduction/images/datascienceworkflow2a.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Appsilon/datascience-python/a1d299415544191f5e96502c5a064561ba3f59ce/00_Introduction/images/datascienceworkflow2a.png
--------------------------------------------------------------------------------
/00_Introduction/images/datascienceworkflow_extended.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Appsilon/datascience-python/a1d299415544191f5e96502c5a064561ba3f59ce/00_Introduction/images/datascienceworkflow_extended.png
--------------------------------------------------------------------------------
/00_Introduction/images/dominotools.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Appsilon/datascience-python/a1d299415544191f5e96502c5a064561ba3f59ce/00_Introduction/images/dominotools.png
--------------------------------------------------------------------------------
/00_Introduction/images/meme2.jpeg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Appsilon/datascience-python/a1d299415544191f5e96502c5a064561ba3f59ce/00_Introduction/images/meme2.jpeg
--------------------------------------------------------------------------------
/00_Introduction/images/the-data-science-workflow.jpeg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Appsilon/datascience-python/a1d299415544191f5e96502c5a064561ba3f59ce/00_Introduction/images/the-data-science-workflow.jpeg
--------------------------------------------------------------------------------
/01_Numpy/02_homework.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Numpy Homework\n",
8 | "\n",
9 | "Example solutions can be found in `homework_solutions` directory.\n",
10 | "\n",
11 | "We will use the new and recommended random number generator `default_rng` instead of `np.random.rand`.\n",
12 | "\n",
13 | "## Task 1\n",
14 | "\n",
15 | "Given the following two vectors:"
16 | ]
17 | },
18 | {
19 | "cell_type": "code",
20 | "execution_count": 1,
21 | "metadata": {},
22 | "outputs": [
23 | {
24 | "data": {
25 | "text/plain": [
26 | "(array([ 0.04, 0.47, -0.14, -1.39, 2.52, -1.01, 1.86, -2.5 , 0.15,\n",
27 | " -0.09, 2.12, 0.52, -0.53, 1.24, 0.21, 1.81, -0.22, 0.09,\n",
28 | " -0.13, -1.13, 0.85, 0.68, 0.87, -0.34, 1.02, 1.11, -0.04,\n",
29 | " -0.82, -0.16, -1.5 ]),\n",
30 | " array([-0.11, 0.54, -0.31, -1.58, 2.56, -1. , 1.67, -2.42, 0.15,\n",
31 | " 0.03, 2.09, 0.56, -0.69, 1.19, 0.32, 1.72, -0.14, 0.14,\n",
32 | " -0.22, -0.94, 0.91, 0.75, 0.95, -0.54, 1.03, 1.19, -0.04,\n",
33 | " -0.73, -0.15, -1.46]))"
34 | ]
35 | },
36 | "execution_count": 1,
37 | "metadata": {},
38 | "output_type": "execute_result"
39 | }
40 | ],
41 | "source": [
42 | "import numpy as np\n",
43 | "from numpy.random import default_rng\n",
44 | "\n",
45 | "rng = default_rng(1337)\n",
46 | "x = np.round(rng.normal(size=30), 2)\n",
47 | "y = x + np.round(rng.normal(size=30) * 0.1, 2)\n",
48 | "x, y"
49 | ]
50 | },
51 | {
52 | "cell_type": "markdown",
53 | "metadata": {},
54 | "source": [
55 | "Calculate:\n",
56 | "\n",
57 | "1. Mean of `x`\n",
58 | "2. Sum of `x`\n",
59 | "3. Mean of absolute values of `x`\n",
60 | "4. Element further from $0$ of `x`\n",
61 | "5. Element further from $2$ of `x`\n",
62 | "6. Array that will setup elements smaller than $-1$ to $-1$ and larger than $1$ to $1$\n",
63 | "7. Mean error (ERR) between `x` and `y`\n",
64 | "8. Mean absolute error (MAD) between `x` and `y`\n",
65 | "9. Mean squared error (MSE) between `x` and `y`\n",
66 | "10. Root mean squared error (RMSE) between `x` and `y`"
67 | ]
68 | },
69 | {
70 | "cell_type": "markdown",
71 | "metadata": {},
72 | "source": [
73 | "## Task 2\n",
74 | "\n",
75 | "Write function `standardize(X)` that will norm every column of matrix `X` (each separately).\n",
76 | "The mean of every column should be equal $0$ and standard deviation to $1$.\n",
77 | "It's procedure very often used in ML."
78 | ]
79 | },
80 | {
81 | "cell_type": "markdown",
82 | "metadata": {},
83 | "source": [
84 | "## Task 3\n",
85 | "\n",
86 | "Calculate value of $\\pi$ by using Monte Carlo method.\n",
87 | " \n",
88 | "Useful links:\n",
89 | "\n",
90 | "- https://www.geeksforgeeks.org/estimating-value-pi-using-monte-carlo/\n",
91 | "- https://www.youtube.com/watch?v=WAf0rqwAvgg"
92 | ]
93 | },
94 | {
95 | "cell_type": "markdown",
96 | "metadata": {},
97 | "source": [
98 | "## Task 4\n",
99 | "\n",
100 | "Calculate the area under function $\\exp(-x^2)$ from $-2$ to $2$.\n"
101 | ]
102 | }
103 | ],
104 | "metadata": {
105 | "kernelspec": {
106 | "display_name": "Python 3.9.12 ('rigplay-lighting')",
107 | "language": "python",
108 | "name": "python3"
109 | },
110 | "language_info": {
111 | "codemirror_mode": {
112 | "name": "ipython",
113 | "version": 3
114 | },
115 | "file_extension": ".py",
116 | "mimetype": "text/x-python",
117 | "name": "python",
118 | "nbconvert_exporter": "python",
119 | "pygments_lexer": "ipython3",
120 | "version": "3.9.12"
121 | },
122 | "orig_nbformat": 4,
123 | "vscode": {
124 | "interpreter": {
125 | "hash": "acaec1dd4d4ad1413b15d1459179aaee505991b8d2edc661768082683fde5d51"
126 | }
127 | }
128 | },
129 | "nbformat": 4,
130 | "nbformat_minor": 2
131 | }
132 |
--------------------------------------------------------------------------------
/01_Numpy/README.md:
--------------------------------------------------------------------------------
1 | # Numpy
2 |
3 | This directory contains introduction to `numpy` library.
4 | Start with the [01_numpy.ipynb](01_numpy.ipynb) notebook and proceed with [homework](02_homework.ipynb).
5 |
--------------------------------------------------------------------------------
/01_Numpy/images/behind_scenes.webp:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Appsilon/datascience-python/a1d299415544191f5e96502c5a064561ba3f59ce/01_Numpy/images/behind_scenes.webp
--------------------------------------------------------------------------------
/01_Numpy/images/cpp_numpy.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Appsilon/datascience-python/a1d299415544191f5e96502c5a064561ba3f59ce/01_Numpy/images/cpp_numpy.jpg
--------------------------------------------------------------------------------
/01_Numpy/images/rly.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Appsilon/datascience-python/a1d299415544191f5e96502c5a064561ba3f59ce/01_Numpy/images/rly.gif
--------------------------------------------------------------------------------
/02_Pandas/01_types_of_data.qmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Introduction to Data Science in Python by Appsilon"
3 | subtitle: "Types of data in Data Science"
4 | author: "Piotr Pasza Storożenko@Appsilon"
5 | lang: "en"
6 | format:
7 | revealjs:
8 | embed-resources: true
9 | theme: [dark]
10 | editor_options:
11 | markdown:
12 | wrap: 99
13 | ---
14 |
15 | # Dataframes
16 |
17 | The building block of data science.
18 |
19 | ## Dataframe example {.smaller}
20 |
21 | Dataframes are the most common data type in Data Science.
22 |
23 | | Name | Surname | Race | Salary | Profession | Age of Death |
24 | |---------|---------|--------|--------|-------------|--------------|
25 | | Bilbo | Baggins | Hobbit | 10000 | Retired | 131 |
26 | | Frodo | Baggins | Hobbit | 70000 | Ring-bearer | 53 |
27 | | Sam | Gamgee | Hobbit | 60000 | Security | 102 |
28 | | Aragorn | NA | Human | 60000 | Security | 210 |
29 |
30 | This is a dataframe with **four** rows and six columns.
31 |
32 | - Dataframes may also be called _tables_.
33 | - Rows may be called _observations_.
34 | - Columns may be called _features_ or _variables_.
35 |
36 | Since this is a table, **every observation must have the same number of columns**.
37 | However, some of them might be _missing_ (NA - not available).
38 |
39 | ## Rows and columns{.smaller}
40 |
41 | Each row represents **a single entity**, e.g.:
42 |
43 | - A single student on the university students' list
44 | - A single part in a warehouse
45 | - Weekly sales for different shops
46 |
47 | . . .
48 |
49 | Values in a single column have usually the same type for every observation.
50 |
51 | - `Salary` is of type `float`.
52 | - `Age of Death` is of type `int`.
53 | - `Name` and `Surname` are of type `string`.
54 | - `Race` is of type **categorical**.
55 |
56 | ## Categorical datatype -- examples{.smaller}
57 |
58 | Values that belong to only few distinct categories are called [categorical data](https://en.wikipedia.org/wiki/Categorical_variable) (also called factors).
59 |
60 | Examples of factors:
61 |
62 | - `Race` from the above example (let's exclude cases of being half-elf for now).
63 | - Color of eyes
64 | - Gender, although one should be careful as it might be a sensitive topic in some cases
65 | - Day of a week
66 | - Mark at the university (2, 3, 4, 5)
67 | - Country of birth
68 |
69 | ## Categorical datatype -- storing problem and solution{.smaller}
70 |
71 | Imagine having a database of everyone in Poland (over 40mln rows right now).
72 | If we look at a column `Country of Birth`, values will have only one of ~200 distinct values (since there are ~200 _available_ countries to choose from).
73 |
74 | . . .
75 |
76 | We might store name of country as a `string` in memory but it will be unefficient, think of string array storage problem and country's names like `United Kingdom of Great Britain and Northern Ireland`.
77 |
78 | . . .
79 |
80 | It's much more resonable to store in memory mapping:
81 |
82 | 1. Republic of Poland
83 | 2. Ukraine
84 | 3. United Kingdom of Great Britain and Northern Ireland
85 |
86 | and then store only values $1,2, ..., 200$ in `Country of Birth` column.
87 | This is what categorical values are for.
88 |
89 | ## Storing dataframes
90 |
91 | Depending on the volume of data, it is ususaly stored in the following formats:
92 |
93 | - [CSV, TSV file](https://en.wikipedia.org/wiki/Comma-separated_values)
94 | - [Arrow file](https://en.wikipedia.org/wiki/Apache_Arrow)
95 | - [Table in some SQL database](https://en.wikipedia.org/wiki/SQL)
96 |
97 | # Unstructured data
98 |
99 | ## JSON{.smaller}
100 |
101 | Sometimes it's not convenient to place data in table in sql-friendly format.
102 | For example we might be quering some APIs for diffent books by `isbn` code and get responses in json's like:
103 |
104 | ```json
105 | {
106 | "isbn": "123-456-222",
107 | "authors": [
108 | {
109 | "lastname": "Piotr",
110 | "middlename": "Pasza",
111 | "firstname": "Storozenko"
112 | }
113 | ],
114 | "title": "Introduction to Data Science in Python by Appsilon",
115 | "category": [
116 | "Non-Fiction",
117 | "Pure Fiction"
118 | ]
119 | }
120 | ```
121 |
122 | Not that `authors` is a list with a single element.
123 |
124 | We call this kind of data _unstructured_ even though there is some structure.
125 |
126 | ## XML{.smaller}
127 |
128 | XML file format is similar to json, but has much more redundant signs, so it is _heavier_ in size on disk.
129 |
130 | ```xml
131 |
132 |
133 |
134 |
135 | Storozenko
136 | Piotr
137 | Pasza
138 |
139 |
140 |
141 | Non-Fiction
142 | Pure Fiction
143 |
144 | 123-456-222
145 | Introduction to Data Science in Python by Appsilon
146 |
147 | ```
148 |
149 | # Images
150 |
151 | ## Images
152 |
153 | We all know what an image is.
154 |
155 | . . .
156 |
157 | 
158 |
159 | ## Images
160 |
161 | We all know what an image is.
162 |
163 | It's a tensor of dimensions `[C, H, W]`:
164 |
165 | - `C` -- channels, 1 for black and white images, 3 for RGB, 4 for RGB with transparency, more for satellite image
166 | - `H` -- height
167 | - `W` -- width
168 |
169 | `[3, 355, 355]` in Boromir's case
170 |
171 | ## How to store images
172 |
173 | On one hand there are multiple image formats like jpg, png, bpm, **webp**.
174 | Here we would like to ask _how to store images_.
175 |
176 | It must be a format that is convenient to use images in ML.
177 |
178 | ## Storing images and metadata{.smaller}
179 |
180 | We can store images in `data` folder and store additional dataframe like:
181 |
182 | | Split | File | Class |
183 | |-------|---------------------------|-------|
184 | | train | 0013035.jpg | ants |
185 | | train | 1030023514_aad5c608f9.jpg | ants |
186 | | train | 1092977343_cb42b38d62.jpg | bees |
187 | | train | 1093831624_fb5fbe2308.jpg | bees |
188 | | val | 10308379_1b6c72e180.jpg | ants |
189 | | val | 1053149811_f62a3410d3.jpg | ants |
190 | | val | 1032546534_06907fe3b3.jpg | bees |
191 | | val | 10870992_eebeeb3a12.jpg | bees |
192 | | ... | ... | ... |
193 |
194 |
195 | ## Storing image folderwise{.smaller}
196 |
197 | We can just perserve the following directory structure:
198 |
199 | ```
200 | data
201 | ├── train
202 | │ ├── ants
203 | │ │ ├── 0013035.jpg
204 | │ │ ├── 1030023514_aad5c608f9.jpg
205 | │ │ ...
206 | │ └── bees
207 | │ ├── 1092977343_cb42b38d62.jpg
208 | │ ├── 1093831624_fb5fbe2308.jpg
209 | │ ...
210 | └── val
211 | ├── ants
212 | │ ├── 10308379_1b6c72e180.jpg
213 | │ ├── 1053149811_f62a3410d3.jpg
214 | │ ...
215 | └── bees
216 | ├── 1032546534_06907fe3b3.jpg
217 | ├── 10870992_eebeeb3a12.jpg
218 | ...
219 | ```
220 |
221 | Then it's harder to work with data, but we can easily extend the dataset.
222 |
223 | # Other kinds of data
224 |
225 | ## Other kinds of data
226 |
227 | Of course there are many more kinds of data like:
228 |
229 | - Raw text data
230 | - Audio data
231 | - Animations data
232 |
233 | But they tend to follow similar patterns.
234 |
235 | # Data versioning
236 |
237 | ## Data versioning
238 |
239 | If we have **very** small amount of data, we can just store it in git.
240 | Very often it's quickly not enough.
241 |
242 | I recomend using [dvc](https://dvc.org/) as a convinient tool for data versioniong.
--------------------------------------------------------------------------------
/02_Pandas/03_homework.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Pandas Homework\n",
8 | "\n",
9 | "In this homework you will have to make some queries to answer questions.\n",
10 | "All queries will be based on the [flights dataset](https://www.kaggle.com/datasets/usdot/flight-delays).\n",
11 | "\n",
12 | "Example solutions can be found in `homework_solutions` directory."
13 | ]
14 | },
15 | {
16 | "cell_type": "markdown",
17 | "metadata": {},
18 | "source": [
19 | "1. Find the maximum delay caused by weather (`WEATHER_DELAY`).\n",
20 | "2. Find the maximum arrival delay for planes from `LAX` to `ATL` in December.\n",
21 | "3. Find the tail number of plane with the maximum arrival delay for planes from `LAX` to `ATL` in December.\n",
22 | "4. Find the minimum arrival delay for planes from `LAX` to `ATL` in December.\n",
23 | "5. Find the tail number of plane with the minimum arrival delay for planes from `LAX` to `ATL` in December.\n",
24 | "6. For flight with the maximum flight time (`AIR_TIME`) find its flight time (`AIR_TIME`).\n",
25 | "7. For flight with the maximum flight time (`AIR_TIME`) find its destination airport.\n",
26 | "8. For flight with the maximum flight time (`AIR_TIME`) find its airline.\n",
27 | "9. For all flights from `ORD` to `AUS` and from `AUS` to `ORD` find the minimum time spent in the air (`AIR_TIME`).\n",
28 | "10. For all flights from `ORD` to `AUS` and from `AUS` to `ORD` find the mean time spent in the air (`AIR_TIME`).\n",
29 | "11. For all flights from `ORD` to `AUS` and from `AUS` to `ORD` find the median time spent in the air (`AIR_TIME`).\n",
30 | "12. For all flights from `ORD` to `AUS` and from `AUS` to `ORD` find the maximum time spent in the air (`AIR_TIME`).\n",
31 | "13. Find the airplane that operate the most flights from Yuma (`YUM`).\n",
32 | "14. Find the number of flights made from Yuma (`YUM`) by the airplane that operate the most flights from Yuma (`YUM`).\n",
33 | "15. Find route that takes the shortest time to fly on (by `AIR_TIME`).\n",
34 | "16. Find the mean time spent in the air on routes from `HNL` to `EWR` and from `EWR` to `HNL`.\n",
35 | "17. Find the mean time spent in the air on the route from `HNL` to `EWR`.\n",
36 | "18. Find the mean time spent in the air on the route from `EWR` to `HNL`."
37 | ]
38 | }
39 | ],
40 | "metadata": {
41 | "kernelspec": {
42 | "display_name": "Python 3.9.12 ('rigplay-lighting')",
43 | "language": "python",
44 | "name": "python3"
45 | },
46 | "language_info": {
47 | "name": "python",
48 | "version": "3.9.12"
49 | },
50 | "orig_nbformat": 4,
51 | "vscode": {
52 | "interpreter": {
53 | "hash": "acaec1dd4d4ad1413b15d1459179aaee505991b8d2edc661768082683fde5d51"
54 | }
55 | }
56 | },
57 | "nbformat": 4,
58 | "nbformat_minor": 2
59 | }
60 |
--------------------------------------------------------------------------------
/02_Pandas/README.md:
--------------------------------------------------------------------------------
1 | # Pandas
2 |
3 | This directory contains introduction to types of data used in data science/machine learning followed by the introductory tutorial to `pandas` - the go-to library for dataframes in python.
4 |
5 | As a bonus I present ways to do the homework in `polars`.
6 |
7 | In this one notebook I covered only the very basics of data wrangling in `pandas`.
8 | If you want to get a more detailed course regarding this manner I highly recommend [Minimalist Data Wrangling with Python](https://datawranglingpy.gagolewski.com/) book by Marek Gągolewski!
9 | It is available for free for everyone.
10 |
--------------------------------------------------------------------------------
/02_Pandas/images/de_wiki_problem.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Appsilon/datascience-python/a1d299415544191f5e96502c5a064561ba3f59ce/02_Pandas/images/de_wiki_problem.png
--------------------------------------------------------------------------------
/02_Pandas/images/one_does.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Appsilon/datascience-python/a1d299415544191f5e96502c5a064561ba3f59ce/02_Pandas/images/one_does.jpg
--------------------------------------------------------------------------------
/02_Pandas/images/weird_line.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Appsilon/datascience-python/a1d299415544191f5e96502c5a064561ba3f59ce/02_Pandas/images/weird_line.png
--------------------------------------------------------------------------------
/03_Plots/02_homework.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# ~Plots~ Pandas Homework Part 2\n",
8 | "\n",
9 | "Surprise, this homework will also be from `pandas` as it's crucial to master your `pandas` skill in the world of Data Science.\n",
10 | "\n",
11 | "This time you will use data from Travel Stackexchange and Wikipedia.\n",
12 | "\n",
13 | "Example solutions can be found in `homework_solutions` directory.\n",
14 | "This time two versions of solutions are available.\n",
15 | "Using pandas (1) and polars (2).\n",
16 | "\n",
17 | "## Wikipedia clickstream\n",
18 | "\n",
19 | "This [dataset](https://dumps.wikimedia.org/other/clickstream/readme.html) \n",
20 | "contains the information on how Wikipedia users move around the website.\n",
21 | "You will work on [the data from March 2022](https://dumps.wikimedia.org/other/clickstream/2022-03/).\n",
22 | "Data format is [available here](https://meta.wikimedia.org/wiki/Research:Wikipedia_clickstream#Format).\n",
23 | "\n",
24 | "## Stack Exchange\n",
25 | "\n",
26 | "The second dataset contains information on posts (but not only) of [Stack users](https://archive.org/download/stackexchange).\n",
27 | "It may surprise you, but all the posts from Stack Overflow and related sites (like Mathexchange) are available for analysis!\n",
28 | "We will focus on data coming from the stack on [**travels**](https://archive.org/download/stackexchange/travel.stackexchange.com.7z)!\n",
29 | "\n",
30 | "Tip for the task on `UpVotes` and `DownVotes`.\n",
31 | "Take look at column `VoteTypeId` in dataframe `Votes`.\n",
32 | "There's an information on votes type.\n",
33 | "Each VoteId corresponds to different type of vote.\n",
34 | "Information on what's an upvote, what's downvote, which answer has been accepted is [available here](https://meta.stackexchange.com/questions/2677/database-schema-documentation-for-the-public-data-dump-and-sede).\n",
35 | "\n",
36 | ""
37 | ]
38 | },
39 | {
40 | "cell_type": "markdown",
41 | "metadata": {},
42 | "source": [
43 | "### Tasks\n",
44 | "\n",
45 | "1. Find the article with the most external entries on German Wikipedia in March 2022.\n",
46 | "2. Find the article with the most external entries on Polish Wikipedia in March 2022.\n",
47 | "3. Find the `DisplayName` of the user with the most badges.\n",
48 | "4. Find the `Location` of the user with the most badges.\n",
49 | "5. Find the number of entries on the article about the city from the question number 4 on English wikipedia in March 2022.\n",
50 | "6. Find the most common word with at least 8 letters in all posts.\n",
51 | "7. Find the number of occurrences of the most common word with at least 8 letters in all posts.\n",
52 | "8. For the post with the highest number of difference between up votes and down votes find its author `DisplayName`.\n",
53 | "9. For the post with the highest number of difference between up votes and down votes find its difference between up votes and down votes.\n",
54 | "10. Find the month in which the most posts were created.\n",
55 | "11. Find the month in which there was the biggest decrease in the amount of created posts.\n",
56 | "12. Find the most common tag along posts created by users from Poland (column `Location` should contain `Poland` or `Polska`)."
57 | ]
58 | }
59 | ],
60 | "metadata": {
61 | "kernelspec": {
62 | "display_name": "Python 3.9.12 ('rigplay-lighting')",
63 | "language": "python",
64 | "name": "python3"
65 | },
66 | "language_info": {
67 | "name": "python",
68 | "version": "3.9.12"
69 | },
70 | "orig_nbformat": 4,
71 | "vscode": {
72 | "interpreter": {
73 | "hash": "acaec1dd4d4ad1413b15d1459179aaee505991b8d2edc661768082683fde5d51"
74 | }
75 | }
76 | },
77 | "nbformat": 4,
78 | "nbformat_minor": 2
79 | }
80 |
--------------------------------------------------------------------------------
/03_Plots/README.md:
--------------------------------------------------------------------------------
1 | # Plots
2 |
3 | This directory contains notebook with examples of plotting in python.
4 | We cover two libraries `matplotlib` and `plotly.express`.
5 |
6 | ## `matplotlib`
7 |
8 | This library is usually used if you need a quick, dead-simple plot or a plot that will be exported to pdf.
9 | Lot's and lot's of [tutorials available for `matplotlib` online](https://matplotlib.org/stable/tutorials/index.html).
10 |
11 | ## `plotly.express`
12 |
13 | This is [sugar-syntaxed version of the `plotly` library](https://plotly.com/python/plotly-express/).
14 | In a very simple way you can create interactive and visually appealing plots for data from data frames.
15 |
--------------------------------------------------------------------------------
/03_Plots/images/votes.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Appsilon/datascience-python/a1d299415544191f5e96502c5a064561ba3f59ce/03_Plots/images/votes.png
--------------------------------------------------------------------------------
/04_Scikit-learn/.gitignore:
--------------------------------------------------------------------------------
1 | 04_prepare_hw.ipynb
--------------------------------------------------------------------------------
/04_Scikit-learn/01_machine_learning.qmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Introduction to Data Science in Python by Appsilon"
3 | subtitle: "Machine Learning"
4 | author: "Piotr Pasza Storożenko@Appsilon"
5 | format:
6 | revealjs:
7 | theme: [dark]
8 | embed-resources
9 | : true
10 | editor_options:
11 | markdown:
12 | wrap: 99
13 | ---
14 |
15 | # Problems machine learning solve
16 |
17 | ## How do we divide ML
18 |
19 | - Supervised (pol. Nadzorowane)
20 | - Classification (pol. Klasyfikacja)
21 | - Regression (pol. Regresja)
22 | - Unsupervised (pol. Nienadzorowane)
23 | - Clustering (pol. Analiza skupień)
24 |
25 | ## Classification
26 |
27 | Our aim is to predict the **discrete** value.
28 |
29 | . . .
30 |
31 | Example classification problems:
32 |
33 | - Model that distinguishes between cats and dogs
34 | - Document classification into given classes
35 | - Deciding whether tweet is toxic or not
36 | - Product categorization
37 | - Fraud detection
38 |
39 | ## Regression
40 |
41 | Our aim is to predict the **continuous** value.
42 |
43 | . . .
44 |
45 | Example regression problems:
46 |
47 | - Real estate price prediction
48 | - Sales revenue prediction
49 | - Age estimation
50 |
51 | ## Clustering
52 |
53 | Our aim is to **find similar** entities.
54 |
55 | . . .
56 |
57 | Example clustering problems:
58 |
59 | - Find similar products
60 | - Suggest similar artists
61 |
62 | # Basic rules of approaching ML problems
63 |
64 | ## Metric
65 |
66 | To solve an ML problem we must be able to say which model is good and which bad.
67 |
68 | The single number that will decide whether the model is good or bad is called **metric**.
69 |
70 | With metric we can ask the computer to find _the best model_.
71 |
72 | We use different metrics in different tasks.
73 |
74 | ## Example regression metrics:
75 |
76 | - Mean square error
77 | - Mean absolute error
78 |
79 | ## Example classification metrics
80 |
81 | :::: {.columns}
82 |
83 | ::: {.column width="40%"}
84 | - Accuracy
85 | - Precision
86 | - Recall
87 | - F1
88 |
89 | :::
90 |
91 | ::: {.column width="60%"}
92 | {height="600"}
93 |
94 | :::
95 |
96 | ::::
97 |
98 | ## What is learned from what{.smaller}
99 |
100 | We start with the whole dataset and divide its columns into `X` and `y`.
101 |
102 | - `X` is matrix with columns, features that will be used to train the model
103 | - `y` is a **target** vector
104 |
105 | Examples:
106 |
107 | In case of flats price prediction, `X` will consist of columns like flat size, number of rooms, number of bathrooms, indication if flat has a balcony and so on. `y` will be the price.
108 |
109 | In case of cats vs dogs prediction, the `X` are pixels values, `y` is label `cat`/`dog`.
110 |
111 | ## Train test split
112 |
113 | We usually want to train the model on part of data and check it's performance on the rest.
114 |
115 | The basic approach to this is train test split:
116 |
117 | 
118 |
119 | :::{.footer}
120 | Source: https://towardsdatascience.com/understanding-train-test-split-scikit-learn-python-ea676d5e3d1
121 | :::
122 |
123 |
124 | ## Train test split{.smaller}
125 |
126 | 
127 |
128 | 1. Train model `m` on `X_train` and `y_train`.
129 | 2. Get predictions `y_pred` by evaluating model `m` on `X_test`.
130 | 3. Calculate metric by comparing `y_pred` and `y_test`.
131 |
132 | :::{.footer}
133 | Source: https://towardsdatascience.com/understanding-train-test-split-scikit-learn-python-ea676d5e3d1
134 | :::
135 |
136 | # The most common columns transformations
137 |
138 | The majority of models expect an input to be in form of numeric values.
139 | Often it's not the case...
140 |
141 | ## One-hot encoding
142 |
143 | The easiest way to convert categorical values to numeric is so called one-hot encoding.
144 |
145 | 
146 |
147 | :::{.footer}
148 | Source: https://towardsdatascience.com/building-a-one-hot-encoding-layer-with-tensorflow-f907d686bf39
149 | :::
150 |
151 | ## Smarter `NA` trick{.smaller}
152 |
153 | We've learned how to fill `NA` with zeros in pandas.
154 |
155 | Sometimes filling with median/mean value makes more sense.
156 | It's worth experimenting and trying different approaches yourself!
157 |
158 | A smart trick to get more information from `NA` is to, apart from filling it, add new column with boolean value like so:
159 |
160 | 
--------------------------------------------------------------------------------
/04_Scikit-learn/03_homework.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Sklearn Homework\n",
8 | "\n",
9 | "In this homework you will work on 3 files that can be found in the `data/housing/*` directory:\n",
10 | "\n",
11 | "1. `housing_train.csv` - training data\n",
12 | "2. `housing_validation.csv` - validation data, you will evaluate your model on this data\n",
13 | "3. `housing_example_submission.csv` - example submission of your model\n",
14 | "\n",
15 | "You will split your work into two parts.\n",
16 | "\n",
17 | "## Train a simple linear regression model\n",
18 | "\n",
19 | "First make the following preprocessing on the whole `housing_train.csv`:\n",
20 | "\n",
21 | "1. Columns with more than 70% of `NA` values change into columns `NA_in_col_*` by following instructions in the presentation part. **Remove the original column**.\n",
22 | "2. Fill the rest of `NA`s in other columns with the **median** of particular column.\n",
23 | "\n",
24 | "With such dataframe train the linear regression model that will predict the `MEDV` column.\n",
25 | "Using this model answer the following questions:\n",
26 | "\n",
27 | "1. Predict the price of the observation `rec` (defined bellow).\n",
28 | "2. How much would the prediction change if we'd increase the `RM` column by 2?"
29 | ]
30 | },
31 | {
32 | "cell_type": "code",
33 | "execution_count": 1,
34 | "metadata": {},
35 | "outputs": [
36 | {
37 | "data": {
38 | "text/html": [
39 | "
\n",
40 | "\n",
53 | "
\n",
54 | " \n",
55 | " \n",
56 | " \n",
57 | " CRIM \n",
58 | " ZN \n",
59 | " INDUS \n",
60 | " CHAS \n",
61 | " NOX \n",
62 | " RM \n",
63 | " AGE \n",
64 | " DIS \n",
65 | " RAD \n",
66 | " TAX \n",
67 | " PTRATIO \n",
68 | " B \n",
69 | " LSTAT \n",
70 | " \n",
71 | " \n",
72 | " \n",
73 | " \n",
74 | " 0 \n",
75 | " 3.69 \n",
76 | " 11.37 \n",
77 | " 11.15 \n",
78 | " 0.07 \n",
79 | " 0.87 \n",
80 | " 6.29 \n",
81 | " 68.91 \n",
82 | " 3.77 \n",
83 | " 9.5 \n",
84 | " 410.95 \n",
85 | " 18.37 \n",
86 | " 354.47 \n",
87 | " 0.79 \n",
88 | " \n",
89 | " \n",
90 | "
\n",
91 | "
"
92 | ],
93 | "text/plain": [
94 | " CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO \\\n",
95 | "0 3.69 11.37 11.15 0.07 0.87 6.29 68.91 3.77 9.5 410.95 18.37 \n",
96 | "\n",
97 | " B LSTAT \n",
98 | "0 354.47 0.79 "
99 | ]
100 | },
101 | "execution_count": 1,
102 | "metadata": {},
103 | "output_type": "execute_result"
104 | }
105 | ],
106 | "source": [
107 | "import pandas as pd\n",
108 | "\n",
109 | "rec = pd.DataFrame({\n",
110 | " 'CRIM': [3.69],\n",
111 | " 'ZN': [11.37],\n",
112 | " 'INDUS': [11.15],\n",
113 | " 'CHAS': [0.07],\n",
114 | " 'NOX': [0.87],\n",
115 | " 'RM': [6.29],\n",
116 | " 'AGE': [68.91],\n",
117 | " 'DIS': [3.77],\n",
118 | " 'RAD': [9.50],\n",
119 | " 'TAX': [410.95],\n",
120 | " 'PTRATIO': [18.37],\n",
121 | " 'B': [354.47],\n",
122 | " 'LSTAT': [0.79],\n",
123 | "})\n",
124 | "rec"
125 | ]
126 | },
127 | {
128 | "cell_type": "markdown",
129 | "metadata": {},
130 | "source": [
131 | "## Train the best model regression you can\n",
132 | "\n",
133 | "Now you are tasked with a real ML case.\n",
134 | "Using the acquired knowledge and the [(brilliant) `scikit-learn` documentation](https://scikit-learn.org/stable/) train the best model possible!\n",
135 | "\n",
136 | "Using the `housing_train.csv` file, train the best model you can by minimizing **Mean Square Error** metric, then make prediction on `housing_validation.csv`.\n",
137 | "Prediction should be saved into a file with the same format as `housing_example_submission.csv` file (i.e. ensure the correct column name and lack of index. `s.to_csv(filename, index=False)` should work).\n",
138 | "Of course `housing_validation.csv` file doesn't contain the `MEDV` column, you have to predict it.\n",
139 | "\n",
140 | "After saving the results, save the model as well.\n",
141 | "This can be done for example using [`pickle`](https://scikit-learn.org/stable/model_persistence.html).\n"
142 | ]
143 | },
144 | {
145 | "cell_type": "markdown",
146 | "metadata": {},
147 | "source": [
148 | "### Tips\n",
149 | "\n",
150 | "When you will work on filling `NA`s with median, take a look on how I done it in the presentation.\n",
151 | "\n",
152 | "In the second part experiment with both, preprocessing and modeling.\n",
153 | "\n",
154 | "Remember that training on the `housing_train.csv` and checking the model performance on the same dataset can be very miss leading.\n",
155 | "\n",
156 | "Probably you want to split `housing_train.csv` into train and test (independently of `housing_validation.csv`).\n",
157 | "Create the best model using those two datasets and only then make predictions on `housing_validation.csv`.\n",
158 | "Maybe you want to understand what is [cross validation](https://scikit-learn.org/stable/modules/cross_validation.html) here?\n",
159 | "How good is your prediction? You will check it in the next lesson!\n",
160 | "\n",
161 | "Apart from `sklearn` you can try [xgboost](https://xgboost.readthedocs.io/en/stable/python/python_api.html#module-xgboost.sklearn) and [lightgbm](https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMRegressor.html), they should work as a drop-in replacement.\n"
162 | ]
163 | },
164 | {
165 | "cell_type": "markdown",
166 | "metadata": {},
167 | "source": [
168 | "### Context\n",
169 | "\n",
170 | "On websites like [kaggle](https://www.kaggle.com/) you can challenge yourself with ML tasks.\n",
171 | "Usually the competition looks similarly to this task.\n",
172 | "You get one csv file with training data and have to evaluate the best model on another.\n",
173 | "\n",
174 | "## Additional information on dataset\n",
175 | "\n",
176 | "This is modified Boston housing dataset, _classic_ dataset use for learning ML.\n",
177 | "\n",
178 | "(Not fully accurate) information on columns' content:\n",
179 | "\n",
180 | "```\n",
181 | "1. CRIM per capita crime rate by town\n",
182 | "2. ZN proportion of residential land zoned for lots over 25,000 sq.ft.\n",
183 | "3. INDUS proportion of non-retail business acres per town\n",
184 | "4. CHAS Charles River dummy variable (= 1 if tract bounds \n",
185 | " river; 0 otherwise)\n",
186 | "5. NOX nitric oxides concentration (parts per 10 million)\n",
187 | "6. RM average number of rooms per dwelling\n",
188 | "7. AGE proportion of owner-occupied units built prior to 1940\n",
189 | "8. DIS weighted distances to five Boston employment centres\n",
190 | "9. RAD index of accessibility to radial highways\n",
191 | "10. TAX full-value property-tax rate per $10,000\n",
192 | "11. PTRATIO pupil-teacher ratio by town\n",
193 | "12. B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town\n",
194 | "13. LSTAT % lower status of the population\n",
195 | "14. MEDV Median value of owner-occupied homes in $1000's\n",
196 | "```\n",
197 | "\n"
198 | ]
199 | }
200 | ],
201 | "metadata": {
202 | "interpreter": {
203 | "hash": "306379f83cd7ad906f1941e4fa5d943b9cb847ce8f1ed425d8b7a18353685fed"
204 | },
205 | "kernelspec": {
206 | "display_name": "Python 3.10.4 ('daftacademy-ds')",
207 | "language": "python",
208 | "name": "python3"
209 | },
210 | "language_info": {
211 | "codemirror_mode": {
212 | "name": "ipython",
213 | "version": 3
214 | },
215 | "file_extension": ".py",
216 | "mimetype": "text/x-python",
217 | "name": "python",
218 | "nbconvert_exporter": "python",
219 | "pygments_lexer": "ipython3",
220 | "version": "3.10.4"
221 | },
222 | "orig_nbformat": 4
223 | },
224 | "nbformat": 4,
225 | "nbformat_minor": 2
226 | }
227 |
--------------------------------------------------------------------------------
/04_Scikit-learn/README.md:
--------------------------------------------------------------------------------
1 | # Scikit-learn
2 |
3 | We're done with data processing for some time, now we start to do the machine learning!
4 | Start with presentation on introduction to machine learning, proceed to notebook on linear regression and then do the homework!
5 |
6 | Good luck!
7 |
--------------------------------------------------------------------------------
/04_Scikit-learn/images/NA_trick.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Appsilon/datascience-python/a1d299415544191f5e96502c5a064561ba3f59ce/04_Scikit-learn/images/NA_trick.png
--------------------------------------------------------------------------------
/04_Scikit-learn/images/Precisionrecall.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Appsilon/datascience-python/a1d299415544191f5e96502c5a064561ba3f59ce/04_Scikit-learn/images/Precisionrecall.png
--------------------------------------------------------------------------------
/04_Scikit-learn/images/na_example.svg:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/04_Scikit-learn/images/onehot.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Appsilon/datascience-python/a1d299415544191f5e96502c5a064561ba3f59ce/04_Scikit-learn/images/onehot.png
--------------------------------------------------------------------------------
/04_Scikit-learn/images/regression.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Appsilon/datascience-python/a1d299415544191f5e96502c5a064561ba3f59ce/04_Scikit-learn/images/regression.png
--------------------------------------------------------------------------------
/04_Scikit-learn/images/traintest.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Appsilon/datascience-python/a1d299415544191f5e96502c5a064561ba3f59ce/04_Scikit-learn/images/traintest.png
--------------------------------------------------------------------------------
/05_SharingWork/01_streamlit/01_hello_world/README.md:
--------------------------------------------------------------------------------
1 | # Hello World in Streamlit
2 |
3 | Simple apps, run with `streamlit run main.py`.
4 |
--------------------------------------------------------------------------------
/05_SharingWork/01_streamlit/01_hello_world/main1.py:
--------------------------------------------------------------------------------
1 | import streamlit as st
2 |
3 | st.title("My first streamlit app")
4 | st.write("Hello world!")
5 |
--------------------------------------------------------------------------------
/05_SharingWork/01_streamlit/01_hello_world/main2.py:
--------------------------------------------------------------------------------
1 | import streamlit as st
2 |
3 | # This might look like magic and it is since we're running the app with
4 | # streamlit run main2.py
5 |
6 | "# My second streamlit app"
7 | "Hello world!"
8 |
--------------------------------------------------------------------------------
/05_SharingWork/01_streamlit/01_hello_world/main3.py:
--------------------------------------------------------------------------------
1 | import streamlit as st
2 | import numpy as np
3 | import pandas as pd
4 | import plotly.express as px
5 | import matplotlib.pyplot as plt
6 |
7 |
8 | "# My third streamlit app"
9 | "Plot of $y = \sin(x) ^2$"
10 |
11 |
12 | x = np.r_[0:2*np.pi:100j]
13 | y = np.sin(x) ** 2
14 | df = pd.DataFrame({
15 | 'x': x,
16 | 'y': y,
17 | })
18 | st.line_chart(y)
19 |
20 |
21 | fig, ax = plt.subplots()
22 | ax.plot(x, y)
23 | ax.set_title("The sine squared plot")
24 | st.pyplot(fig)
25 |
26 | p = px.line(df, 'x', 'y')
27 | st.plotly_chart(p)
28 |
29 | st.dataframe(df)
30 |
31 | st.table(df.head(20))
32 |
--------------------------------------------------------------------------------
/05_SharingWork/01_streamlit/02_small_report/README.md:
--------------------------------------------------------------------------------
1 | # Small report
2 |
3 | A simple app presenting dataframe capabilities of `streamlit`.
--------------------------------------------------------------------------------
/05_SharingWork/01_streamlit/02_small_report/main.py:
--------------------------------------------------------------------------------
1 | # to verify
2 |
3 | import pandas as pd
4 | import streamlit as st
5 | import plotly.express as px
6 |
7 | st.config.dataFrameSerialization = "arrow"
8 |
9 |
10 | @st.cache_data
11 | def load_data():
12 | flights = pd.read_csv("../../../data/flights/flights.csv", engine="pyarrow")
13 | flights["DATE"] = pd.to_datetime(
14 | flights["YEAR"].astype(str) + "-" + flights["MONTH"].astype(str) + "-" + flights["DAY"].astype(str)
15 | )
16 | airports = pd.read_csv("../../../data/flights/airports.csv", engine="pyarrow")
17 | return flights, airports
18 |
19 |
20 | flights, airports = load_data()
21 |
22 | "# Small flights application"
23 |
24 | "Let's recall some analysis from earlier classes."
25 |
26 | airport_input = st.text_input("Airport")
27 |
28 | flights_lax = (
29 | flights.query("DESTINATION_AIRPORT == @airport_input")
30 | .groupby("ORIGIN_AIRPORT")
31 | .agg({"ARRIVAL_DELAY": ["mean", "count"]})
32 | .reset_index(col_level=1)
33 | )
34 | flights_lax.columns = flights_lax.columns.droplevel(0)
35 | flights_lax = (
36 | flights_lax
37 | .merge(airports, left_on="ORIGIN_AIRPORT", right_on="IATA_CODE")
38 | # .loc[:, ["AIRPORT","ARRIVAL_DELAY", "Count"]]
39 | )
40 |
41 | st.write(flights_lax)
42 |
43 | flights_d = flights.query("DESTINATION_AIRPORT == @airport_input").value_counts("DATE").reset_index().sort_values("DATE").rename(columns={0: "Count"})
44 |
45 | p = px.line(flights_d, "DATE", "Count")
46 |
47 | st.plotly_chart(p)
48 |
--------------------------------------------------------------------------------
/05_SharingWork/01_streamlit/03_mc_simulation/README.md:
--------------------------------------------------------------------------------
1 | # MC Simulations
2 |
3 | This application shows how might look like the interactive dashboard presenting the simulation from the first homework.
4 |
--------------------------------------------------------------------------------
/05_SharingWork/01_streamlit/03_mc_simulation/main.py:
--------------------------------------------------------------------------------
1 | import matplotlib.pyplot as plt
2 | import numpy as np
3 | import streamlit as st
4 |
5 |
6 | def simulation(n):
7 | x = np.random.rand(n) * 4 - 2
8 | yr = np.random.rand(n)
9 | yf = np.exp(-(x**2))
10 | A = (yr < yf).mean() * 4
11 |
12 | fig, ax = plt.subplots()
13 | ax.plot(x[yf <= yr], yr[yf <= yr], ".b")
14 | ax.plot(x[yf > yr], yr[yf > yr], ".r")
15 |
16 | return fig, A
17 |
18 |
19 | "# MC simulation"
20 |
21 | "In this app we will calculate the area under function $\exp(-x^2)$ from $-2$ to $2$. "
22 |
23 | n = st.number_input("n", value=1_000, min_value=1, max_value=10_000_000, format="%d")
24 |
25 | if st.button("Simulate!"):
26 | with st.spinner(text="Simulating!"):
27 | fig, A = simulation(n)
28 | st.pyplot(fig)
29 | f"Calculated area under the function: {A.round(3)}"
30 |
--------------------------------------------------------------------------------
/05_SharingWork/01_streamlit/04_dishes/README.md:
--------------------------------------------------------------------------------
1 | # Favorite dish application
2 |
3 | This app shows how to easily create application to use by multiple users.
4 | Everyone can pick they're favorite food from the list or add theirs.
5 |
6 | Note that the app is _ill-designed_ with clear data race to the `dishes.json` file.
7 |
--------------------------------------------------------------------------------
/05_SharingWork/01_streamlit/04_dishes/dishes.json:
--------------------------------------------------------------------------------
1 | {}
--------------------------------------------------------------------------------
/05_SharingWork/01_streamlit/04_dishes/favorite_dish.py:
--------------------------------------------------------------------------------
1 | import json
2 |
3 | import pandas as pd
4 | import plotly.express as px
5 | import streamlit as st
6 |
7 |
8 | st.set_page_config(page_title="Favorite dish", page_icon="🥘")
9 |
10 | with open("dishes.json") as f:
11 | dishes = json.load(f)
12 |
13 | st.title("What is your favorite dish?")
14 | st.write("Pick from the list or add your own")
15 | form = st.form(key="dish")
16 | available_dishes = sorted(dishes.keys())
17 | dish_radio = form.selectbox("Pick from the list", available_dishes)
18 | dish_text = form.text_input(
19 | "Or type if it's not on the list:",
20 | placeholder="Leave empty to choose from the list",
21 | )
22 | submit = form.form_submit_button("Submit")
23 |
24 | if submit:
25 | if dish_text != "":
26 | key = dish_text.lower().capitalize()
27 | else:
28 | key = dish_radio
29 |
30 | if key not in dishes:
31 | dishes[key] = 1
32 | else:
33 | dishes[key] += 1
34 |
35 | with open("dishes.json", "w") as f:
36 | json.dump(dishes, f)
37 |
38 | if st.button("Refresh"):
39 | pass
40 | df = pd.DataFrame(dishes.items(), columns=["Dish", "Count"])
41 | st.plotly_chart(px.bar(df, x="Dish", y="Count"))
42 |
--------------------------------------------------------------------------------
/05_SharingWork/01_streamlit/05_checker/README.md:
--------------------------------------------------------------------------------
1 | # Homework checker
2 |
3 | This app has been designed to check the prediction part of homework from the previous week!
4 | Run it and upload your predictions.
5 |
6 | The design allows running on some server with multiple users connecting and uploading their results.
7 |
8 | ## Running checker
9 |
10 | The database already contains LightGBM model result, if you want to create a clean database run the `create_db.py` script.
11 | You can run the app with `streamlit run main.py`.
12 |
--------------------------------------------------------------------------------
/05_SharingWork/01_streamlit/05_checker/create_db.py:
--------------------------------------------------------------------------------
1 | import sqlite3
2 |
3 | con = sqlite3.connect("results.db")
4 |
5 | with con:
6 | con.execute(
7 | """CREATE TABLE results
8 | (nickname text, email text, score real, filename text)"""
9 | )
10 |
--------------------------------------------------------------------------------
/05_SharingWork/01_streamlit/05_checker/files/.gitignore:
--------------------------------------------------------------------------------
1 | *.csv
--------------------------------------------------------------------------------
/05_SharingWork/01_streamlit/05_checker/main.py:
--------------------------------------------------------------------------------
1 | import sqlite3
2 | import uuid
3 |
4 | import pandas as pd
5 | import streamlit as st
6 | from sklearn.metrics import mean_squared_error
7 |
8 |
9 | con = sqlite3.connect("results.db", check_same_thread=False)
10 | stmt = "INSERT INTO results (nickname, email, score, filename) values (?, ?, ?, ?)"
11 |
12 | y_true = pd.read_csv("y_true.csv").iloc[:, 0]
13 |
14 | "# Week 4 Homework Checker"
15 |
16 | with st.form("my_form"):
17 | student_nickname = st.text_input("Your nickname")
18 | student_email = st.text_input("Your e-mail", help="The one you've used for registration")
19 | uploaded_file = st.file_uploader("Upload your homework")
20 | submitted = st.form_submit_button("Submit")
21 | info_element = st.info("You must fill the name, mail and upload the file, then press `Submit`")
22 | if student_nickname != "" and student_email != "" and uploaded_file is not None and submitted:
23 | df = pd.read_csv(uploaded_file)
24 | filename = f"files/{uuid.uuid4()}.csv"
25 | df.to_csv(filename, index=False)
26 | y_pred = df["MEDV"]
27 | score = mean_squared_error(y_true, y_pred)
28 | fields = (student_nickname, student_email, score, filename)
29 | with con:
30 | con.execute(stmt, fields)
31 | info_element.success("Result recorded")
32 | elif submitted:
33 | info_element.error("You must fill the name, mail and upload the file, then press `Submit`")
34 |
35 | "## Top 10 Results:"
36 |
37 | results = pd.read_sql_query("SELECT * FROM results", con)
38 |
39 | if st.button("Refresh"):
40 | pass
41 |
42 | st.table(
43 | results.sort_values("score")
44 | .groupby("email")
45 | .first()
46 | .reset_index()
47 | .loc[:, ["nickname", "score"]]
48 | .sort_values("score")
49 | .head(10)
50 | )
51 |
--------------------------------------------------------------------------------
/05_SharingWork/01_streamlit/05_checker/results.db:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Appsilon/datascience-python/a1d299415544191f5e96502c5a064561ba3f59ce/05_SharingWork/01_streamlit/05_checker/results.db
--------------------------------------------------------------------------------
/05_SharingWork/01_streamlit/05_checker/y_true.csv:
--------------------------------------------------------------------------------
1 | MEDV
2 | 101.21033698173383
3 | 138.91314641725276
4 | 58.23319106873045
5 | 97.65902371130434
6 | 68.92553585303733
7 | 85.65052981879218
8 | 76.29113025829184
9 | 59.98630034438092
10 | 83.92613343134433
11 | 72.01118151457328
12 | 92.16125152937637
13 | 80.94334883753203
14 | 29.98226780875327
15 | 90.88349143008415
16 | 79.19801777762241
17 | 127.72797082370907
18 | 80.61791173105397
19 | 43.67256295325555
20 | 214.14708398667176
21 | 60.48215847691291
22 | 108.01116698952946
23 | 124.59504671034674
24 | 54.4307463870796
25 | 95.98739923815477
26 | 60.86837265398642
27 | 59.13618569199229
28 | 86.89390306707966
29 | 63.85046694579307
30 | 92.97001454452686
31 | 78.42034935253015
32 | 99.07810016202895
33 | 101.96230141956615
34 | 64.20260696852783
35 | 89.04957792817417
36 | 81.87304637855341
37 | 83.19084441481127
38 | 148.5403003078944
39 | 83.537348656387
40 | 104.51152581320432
41 | 100.36999194026664
42 | 84.38958140755855
43 | 120.97578159520806
44 | 214.27920633629162
45 | 74.55494176812948
46 | 96.8776373142841
47 | 64.73835525070551
48 | 56.146704596483936
49 | 103.78223083506998
50 | 85.26735746265928
51 | 102.81215133358074
52 | 80.90933117433606
53 | 151.62963937737572
54 | 65.19437154351274
55 | 113.66647394288508
56 | 186.43807621201492
57 | 90.76212320268449
58 | 78.86001935421702
59 | 122.02964510092482
60 | 102.52896260509968
61 | 79.30402130979557
62 | 107.22373036030697
63 | 151.6943977652074
64 | 134.88626011767445
65 | 86.62827000798177
66 | 103.29393469599242
67 | 85.71805278777714
68 | 56.15467383789862
69 | 106.17114827729353
70 | 132.08205032646725
71 | 54.416540325330224
72 | 85.79145405259666
73 | 101.53077753401259
74 | 46.293398204646294
75 | 88.31330943238713
76 | 89.02505708508217
77 | 21.40087579954311
78 |
--------------------------------------------------------------------------------
/05_SharingWork/02_quarto/README.md:
--------------------------------------------------------------------------------
1 | # Quarto demo
2 |
3 | Quarto is a tool for converting your notebooks into reports and presentations.
4 | But it can do even more!
5 |
6 | You should check out their [official website](https://quarto.org/) to download the tool, install and play with numerous tutorials.
7 |
8 | To crate the html report you just run `quarto render report.ipynb`.
9 | Exporting to pdf requires specifying the output format as so: `quarto render report.ipynb -t pdf`.
10 |
11 | If you want to work with the report in an interactive way I recommend checking out `quarto preview report.ipynb`.
12 |
--------------------------------------------------------------------------------
/05_SharingWork/02_quarto/report.docx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Appsilon/datascience-python/a1d299415544191f5e96502c5a064561ba3f59ce/05_SharingWork/02_quarto/report.docx
--------------------------------------------------------------------------------
/05_SharingWork/02_quarto/report.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Appsilon/datascience-python/a1d299415544191f5e96502c5a064561ba3f59ce/05_SharingWork/02_quarto/report.pdf
--------------------------------------------------------------------------------
/05_SharingWork/03_fastapi/README.md:
--------------------------------------------------------------------------------
1 | # FastAPI
2 |
3 | FastAPI is a brillant library to create REST APIs in python.
4 | This is the main way different services communicate on the internet.
5 |
6 | If you don't know what REST API is then go and run the examples!
7 | It will be much easier to understand then.
8 |
--------------------------------------------------------------------------------
/05_SharingWork/03_fastapi/boston_model_prediction/README.md:
--------------------------------------------------------------------------------
1 | # Boston Model Prediction API
2 |
3 | Script `boston_api.py` shows PoC example of how one can deploy the `sklearn` model.
4 | Note that this implementation lacks a lot of key parts like:
5 |
6 | 1. Data validation
7 | 2. Data preprocessing, filling `NA` like during training.
8 | 3. Errors handling
9 |
10 | Run with:
11 |
12 | ```
13 | uvicorn boston_api:app
14 | ```
15 |
--------------------------------------------------------------------------------
/05_SharingWork/03_fastapi/boston_model_prediction/boston_api.py:
--------------------------------------------------------------------------------
1 | from fastapi import FastAPI, Depends
2 | import pandas as pd
3 | import pickle
4 | from pydantic import BaseModel
5 | from starlette.responses import RedirectResponse
6 |
7 |
8 | app = FastAPI()
9 | state = {}
10 |
11 |
12 | @app.on_event("startup")
13 | async def load_model():
14 | with open("model_lgbm_regressor.pkl", "rb") as f:
15 | state["model"] = pickle.load(f)
16 |
17 |
18 | class BostonData(BaseModel):
19 | CRIM: float
20 | ZN: float
21 | INDUS: float
22 | CHAS: float
23 | NOX: float
24 | RM: float
25 | AGE: float
26 | DIS: float
27 | RAD: float
28 | TAX: float
29 | PTRATIO: float
30 | B: float
31 | LSTAT: float
32 |
33 |
34 | @app.get("/", include_in_schema=False)
35 | async def index():
36 | return RedirectResponse(url="/docs")
37 |
38 |
39 | @app.post("/boston_prediction")
40 | async def boston_prediction(boston_X: BostonData = Depends()):
41 | X = pd.DataFrame(
42 | {
43 | "CRIM": [boston_X.CRIM],
44 | "ZN": [boston_X.ZN],
45 | "INDUS": [boston_X.INDUS],
46 | "CHAS": [boston_X.CHAS],
47 | "NOX": [boston_X.NOX],
48 | "RM": [boston_X.RM],
49 | "AGE": [boston_X.AGE],
50 | "DIS": [boston_X.DIS],
51 | "RAD": [boston_X.RAD],
52 | "TAX": [boston_X.TAX],
53 | "PTRATIO": [boston_X.PTRATIO],
54 | "B": [boston_X.B],
55 | "LSTAT": [boston_X.LSTAT],
56 | }
57 | )
58 |
59 | pred = state["model"].predict(X)
60 |
61 | return {"price": pred[0]}
62 |
--------------------------------------------------------------------------------
/05_SharingWork/03_fastapi/boston_model_prediction/model_lgbm_regressor.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Appsilon/datascience-python/a1d299415544191f5e96502c5a064561ba3f59ce/05_SharingWork/03_fastapi/boston_model_prediction/model_lgbm_regressor.pkl
--------------------------------------------------------------------------------
/05_SharingWork/03_fastapi/image_prediction/README.md:
--------------------------------------------------------------------------------
1 | # Image Prediction API
2 |
3 | Script `image_api.py` shows PoC example of _image classifying_ API.
4 | In reality it just returns the brightness of provided image, but one can extend this anyway they want.
5 |
6 | Run with:
7 |
8 | ```
9 | uvicorn image_api:app
10 | ```
11 |
--------------------------------------------------------------------------------
/05_SharingWork/03_fastapi/image_prediction/image_api.py:
--------------------------------------------------------------------------------
1 | import io
2 |
3 | from fastapi import FastAPI, File, UploadFile
4 | import numpy as np
5 | from PIL import Image
6 | from starlette.responses import RedirectResponse
7 |
8 |
9 | app = FastAPI()
10 |
11 |
12 | def read_imagefile(file) -> Image.Image:
13 | image = Image.open(io.BytesIO(file))
14 | return image
15 |
16 |
17 | @app.get("/", include_in_schema=False)
18 | async def index():
19 | return RedirectResponse(url="/docs")
20 |
21 |
22 | @app.post("/image_brightness")
23 | async def image_brightness(file: UploadFile = File(...)):
24 | image = read_imagefile(await file.read()).convert("L")
25 | x = np.array(image)
26 | return {"brightness": x.mean()}
27 |
--------------------------------------------------------------------------------
/05_SharingWork/README.md:
--------------------------------------------------------------------------------
1 | # Streamlit, Quarto, FastAPI
2 |
3 | We've reached the final lecture 🎉 of this course.
4 |
5 | We've learned a planty of things, but it's high time we show our results to the world.
6 | To do so I present 3 libraries that focus on different approaches.
7 |
8 | ### Streamlit
9 |
10 | The new, yet already feature-full library for creating interactive dashboards of your analysis.
11 | Dashboard may present some plots, data from dataframes, simulation results or predictions from an ML model.
12 | What is remarkable about `streamlit` is the easy with which you can create those dashboards.
13 | Check out examples.
14 |
15 | ### Quarto
16 |
17 | Library that created `html` files with course materials from `qmd` and `ipynb` files.
18 | You can also create technical $\\LaTeX$-like reports using `quarto` as shown here.
19 | Technically speaking its the language-agnostic successor of Rmarkdown.
20 |
21 | ### FastAPI
22 |
23 | Not always results of our analysis will be presented to other people, sometimes they'll be consumed by another service/machine.
24 | If that's your use case, there is no easier to use library to create REST api in Python than `fastapi`.
25 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Introduction to Data Science in Python by Appsilon
2 |
3 | ## Introduction
4 |
5 | Welcome to the course _Introduction to Data Science in Python by Appsilon_!
6 |
7 | ## Target audience
8 |
9 | This course aims to introduce people that know how to code in Python into the Data Science world.
10 | In particular I show tricks and tips useful for STEM/economic students.
11 | One of secondary goals is to show students how use **free** tools that are **industry standards** at the same time instead of Matlab/Statistica/SAS and so on.
12 |
13 | ## Covered topics
14 |
15 | 0. The course starts with introducing what does Data Scientist do in his work and why this job is so important in XXI century. Then we start the technical part of the course.
16 | 1. `numpy` - numbers and vectors, fundamentals of all calculations in Python
17 | 2. `pandas` - data frames - SQL-like, in-memory data, fundamentals of data processing in Python
18 | 3. `matplotlib` and `plotly` - plots, basics of data visualization
19 | 4. `scikit-learn` - introduction to machine learning, examples from the go-to library in Python
20 | 5. `streamlit`, `quarto`, `fastapi` - simple, useful and creative ways to share your work in Python and to generate beautiful reports
21 |
22 | Apart from those libraries I present and benchmark the `polars` library - a high-performant replacement for `pandas` if you work datasets of sizes 0.5GB - 5GB and pandas starts to be too slow.
23 |
24 | ## Course materials
25 |
26 | All course materials are located either here or on google drive.
27 | Code and small datasets are in repo, while large size datasets are located on google drive.
28 |
29 | I suggest using `html` files, generated from `qmd` and `ipynb` with `quarto`.
30 |
31 | Guide to setup an environment included in the introduction presentation.
32 |
33 | tl;dr You can try
34 | ```
35 | conda create -n ds-course python=3.10
36 | conda activate ds-course
37 | pip install -r requirements.txt
38 | ```
39 |
40 | ### Homeworks
41 |
42 | Each lecture has also some homework assignment.
43 | For every homework, there's provided solution in a separate directory.
44 | Note that solutions are not necessarily the best possible, but may present some interesting approach.
45 | Very often there are multiple ways you can approach the same problem.
46 |
47 | ## License
48 |
49 | The course has been prepared by [Piotr Pasza Storożenko](https://pstorozenko.github.io/) from [Appsilon](http://appsilon.com/).
50 | It is available under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/) license.
51 | Feel free to use these materials for your use, you just have to attribute the original author.
52 |
53 | Some exercise have been inspired by the exercises author had to solve while studying.
54 |
--------------------------------------------------------------------------------
/data/.gitignore:
--------------------------------------------------------------------------------
1 | largedf
2 | travel
3 | wikipedia
--------------------------------------------------------------------------------
/data/flights/.gitignore:
--------------------------------------------------------------------------------
1 | airlines.csv
2 | airports.csv
3 | flights.csv
4 |
--------------------------------------------------------------------------------
/data/housing/.gitignore:
--------------------------------------------------------------------------------
1 | housing_answers.csv
2 |
--------------------------------------------------------------------------------
/data/housing/housing_example_submission.csv:
--------------------------------------------------------------------------------
1 | MEDV
2 | 0.3133704207
3 | 0.3133704207
4 | 0.3133704207
5 | 0.3133704207
6 | 0.3133704207
7 | 0.3133704207
8 | 0.3133704207
9 | 0.3133704207
10 | 0.3133704207
11 | 0.3133704207
12 | 0.3133704207
13 | 0.3133704207
14 | 0.3133704207
15 | 0.3133704207
16 | 0.3133704207
17 | 0.3133704207
18 | 0.3133704207
19 | 0.3133704207
20 | 0.3133704207
21 | 0.3133704207
22 | 0.3133704207
23 | 0.3133704207
24 | 0.3133704207
25 | 0.3133704207
26 | 0.3133704207
27 | 0.3133704207
28 | 0.3133704207
29 | 0.3133704207
30 | 0.3133704207
31 | 0.3133704207
32 | 0.3133704207
33 | 0.3133704207
34 | 0.3133704207
35 | 0.3133704207
36 | 0.3133704207
37 | 0.3133704207
38 | 0.3133704207
39 | 0.3133704207
40 | 0.3133704207
41 | 0.3133704207
42 | 0.3133704207
43 | 0.3133704207
44 | 0.3133704207
45 | 0.3133704207
46 | 0.3133704207
47 | 0.3133704207
48 | 0.3133704207
49 | 0.3133704207
50 | 0.3133704207
51 | 0.3133704207
52 | 0.3133704207
53 | 0.3133704207
54 | 0.3133704207
55 | 0.3133704207
56 | 0.3133704207
57 | 0.3133704207
58 | 0.3133704207
59 | 0.3133704207
60 | 0.3133704207
61 | 0.3133704207
62 | 0.3133704207
63 | 0.3133704207
64 | 0.3133704207
65 | 0.3133704207
66 | 0.3133704207
67 | 0.3133704207
68 | 0.3133704207
69 | 0.3133704207
70 | 0.3133704207
71 | 0.3133704207
72 | 0.3133704207
73 | 0.3133704207
74 | 0.3133704207
75 | 0.3133704207
76 | 0.3133704207
77 | 0.3133704207
78 |
--------------------------------------------------------------------------------
/data/housing/housing_train.csv:
--------------------------------------------------------------------------------
1 | CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
2 | 0.1396,0.0,8.56,0,,6.167,90.0,2.421,5,384.0,,392.69,,86.0660335294968
3 | 0.0351,95.0,2.68,0,,7.853,33.2,5.118,4,224.0,,392.78,,207.70005687846765
4 | 15.8744,0.0,18.1,0,,6.545,99.1,1.5192,24,666.0,,396.9,21.08,46.68934923020518
5 | 0.18337,0.0,27.74,0,,5.414,98.3,1.7554,4,711.0,,344.05,23.97,29.980669480245425
6 | 0.12816,12.5,6.07,0,,5.885,33.0,6.498,4,345.0,,396.9,,89.46127472267845
7 | 7.40389,0.0,18.1,0,,5.617,97.9,1.4547,24,666.0,,314.64,26.4,73.75821628799845
8 | 0.03548,80.0,3.64,0,,5.876,19.1,9.2203,1,315.0,,395.18,,89.5826886739847
9 | 11.5779,0.0,18.1,0,,5.036,97.0,1.77,24,666.0,,396.9,25.68,41.53875244545348
10 | 0.26169,0.0,9.9,0,,6.023,90.4,2.834,4,304.0,,396.3,,83.19418362437355
11 | 0.44791,0.0,6.2,1,,6.726,66.5,3.6519,8,307.0,,360.2,,124.27493023973318
12 | 4.81213,0.0,18.1,0,0.713,6.701,90.0,2.5975,24,666.0,20.2,255.23,,70.30934050876428
13 | 0.34109,0.0,7.38,0,,6.415,40.1,4.7211,5,287.0,19.6,396.9,,107.1012551997701
14 | 0.02875,28.0,15.04,0,,6.211,28.9,3.6659,4,270.0,18.2,396.33,,107.18347747973844
15 | 0.35233,0.0,21.89,0,,6.454,98.4,1.8498,4,437.0,21.2,394.08,,73.21645056736543
16 | 0.07022,0.0,4.05,0,,6.02,47.2,3.5549,5,296.0,16.6,393.23,,99.39861331287238
17 | 25.9406,0.0,18.1,0,,5.304,89.1,1.6475,24,666.0,20.2,127.36,26.64,44.59025248731203
18 | 1.19294,0.0,21.89,0,,6.326,97.7,2.271,4,437.0,21.2,396.9,,83.92318147539109
19 | 0.06162,0.0,4.39,0,,5.898,52.3,8.0136,3,352.0,18.8,364.61,,73.71711611183099
20 | 4.55587,0.0,18.1,0,0.718,3.561,87.9,1.6132,24,666.0,20.2,354.7,,117.87600081588016
21 | 0.59005,0.0,21.89,0,,6.372,97.9,2.3274,4,437.0,21.2,385.76,,98.61335073183163
22 | 9.2323,0.0,18.1,0,,6.216,100.0,1.1691,24,666.0,20.2,366.15,,214.34410393293334
23 | 18.811,0.0,18.1,0,,4.628,100.0,1.5539,24,666.0,20.2,28.79,34.37,76.64051245269846
24 | 14.4208,0.0,18.1,0,0.74,6.461,93.3,2.0026,24,666.0,20.2,27.49,18.05,41.12452029160635
25 | 14.0507,0.0,18.1,0,,6.657,100.0,1.5275,24,666.0,20.2,35.05,21.22,73.67595467076562
26 | 0.05188,0.0,4.49,0,,6.015,45.1,4.4272,3,247.0,18.5,395.99,,96.49865391055593
27 | 0.09512,0.0,12.83,0,,6.286,45.0,4.5026,5,398.0,18.7,383.23,,91.6522176655636
28 | 15.0234,0.0,18.1,0,,5.304,97.3,2.1007,24,666.0,20.2,349.48,24.91,51.42387122560069
29 | 0.62739,0.0,8.14,0,,5.834,56.5,4.4986,4,307.0,21.0,395.62,,85.17266592604656
30 | 0.03466,35.0,6.06,0,,6.031,23.3,6.6407,1,304.0,16.9,362.25,,83.198449333261
31 | 7.05042,0.0,18.1,0,,6.103,85.1,2.0218,24,666.0,20.2,2.52,23.29,57.47604042733803
32 | 0.7258,0.0,8.14,0,,5.727,69.5,3.7965,4,307.0,21.0,390.95,,78.02623070337263
33 | 0.19186,0.0,7.38,0,,6.431,14.7,5.4159,5,287.0,19.6,393.68,,105.43385792635864
34 | 0.03961,0.0,5.19,0,,6.037,34.5,5.9853,5,224.0,20.2,396.9,,90.43432864980113
35 | 0.02055,85.0,0.74,0,,6.383,35.7,9.1876,2,313.0,17.3,396.9,,105.83732302619457
36 | 15.1772,0.0,18.1,0,0.74,6.152,100.0,1.9142,24,666.0,20.2,9.32,26.45,37.28371947578368
37 | 14.4383,0.0,18.1,0,,6.852,100.0,1.4655,24,666.0,20.2,179.36,19.78,117.8834171671725
38 | 0.03738,0.0,5.19,,,6.31,38.5,6.4584,5,224.0,20.2,389.4,,88.7178014151015
39 | 0.06888,0.0,2.46,0,,6.144,62.2,2.5979,3,193.0,17.8,396.9,,155.00853000996466
40 | 0.41238,0.0,6.2,0,,7.163,79.9,3.2157,8,307.0,17.4,372.08,,135.25380417327324
41 | 13.9134,0.0,18.1,0,0.713,6.208,95.0,2.2222,24,666.0,20.2,100.63,,50.147350321243344
42 | 0.06588,0.0,2.46,0,,7.765,83.3,2.741,3,193.0,17.8,395.56,,170.6408431307883
43 | 0.84054,0.0,8.14,0,,5.599,85.7,4.4546,4,307.0,21.0,303.42,,59.54643658953512
44 | 0.17331,0.0,9.69,0,,5.707,54.0,2.3817,6,391.0,19.2,396.9,,93.39654697869281
45 | 0.08244,30.0,4.93,0,,6.481,18.5,6.1899,6,300.0,16.6,379.41,,101.52079892017001
46 | 0.20608,22.0,5.86,0,,5.593,76.5,7.9549,7,330.0,19.1,372.49,,75.33524829516917
47 | 0.1403,22.0,5.86,0,,6.487,13.0,7.3967,7,330.0,19.1,396.28,,104.58245151797445
48 | 73.5341,0.0,18.1,0,,5.957,100.0,1.8026,24,666.0,20.2,16.45,20.62,37.67196238918527
49 | 0.15098,0.0,10.01,0,,6.021,82.6,2.7474,6,432.0,17.8,394.51,,82.21308151000076
50 | 0.1415,0.0,6.91,0,,6.169,6.6,5.7209,3,233.0,17.9,383.37,,108.31444570679487
51 | 0.35114,0.0,7.38,0,,6.041,49.9,4.7211,5,287.0,19.6,396.9,,87.40118173931863
52 | 0.0187,85.0,4.15,0,,6.516,27.7,8.5353,4,351.0,17.9,392.43,,99.0871227549707
53 | 0.09103,0.0,2.46,0,,7.155,92.2,2.7006,3,193.0,17.8,394.12,,162.5410185463129
54 | 3.53501,0.0,19.58,1,0.871,6.152,82.6,1.7455,5,403.0,14.7,88.01,,66.76816059885593
55 | 0.03578,20.0,3.33,0,,7.82,64.5,4.6947,5,216.0,14.9,387.31,,194.60858141073373
56 | 0.38735,0.0,25.65,0,,5.613,95.6,1.7572,2,188.0,19.1,359.29,27.26,67.24048484287769
57 | 0.06724,0.0,3.24,0,,6.333,17.2,5.2146,4,430.0,16.9,375.21,,96.87537985348989
58 | 1.35472,0.0,8.14,0,,6.072,100.0,4.175,4,307.0,21.0,376.73,,62.20243792468661
59 | 0.22212,0.0,10.01,0,,6.092,95.4,2.548,6,432.0,17.8,396.9,,80.03620946084678
60 | 2.33099,0.0,19.58,0,0.871,5.186,93.8,1.5296,5,403.0,14.7,356.99,28.32,76.28980246294269
61 | 6.44405,0.0,18.1,0,,6.425,74.8,2.2004,24,666.0,20.2,97.95,,68.9608693776895
62 | 0.03306,0.0,5.19,0,,6.059,37.3,4.8122,5,224.0,20.2,396.14,,88.2163428518819
63 | 0.01432,100.0,1.32,0,,6.816,40.5,8.3248,5,256.0,15.1,392.9,,135.56284210515494
64 | 0.01439,60.0,2.93,0,,6.604,18.8,6.2196,1,265.0,15.6,376.7,,124.77986816790845
65 | 0.75026,0.0,8.14,0,,5.924,94.1,4.3996,4,307.0,21.0,394.33,,66.78507694539394
66 | 0.7842,0.0,8.14,0,,5.99,81.7,4.2579,4,307.0,21.0,386.75,,74.9519755033796
67 | 0.06466,70.0,2.24,0,,6.345,20.1,7.8278,5,358.0,14.8,368.24,,96.42224261775675
68 | 0.04379,80.0,3.37,0,,5.787,31.1,6.6115,4,337.0,16.1,396.9,,83.1078869346196
69 | 0.37578,0.0,10.59,1,,5.404,88.6,3.665,4,277.0,18.6,395.24,23.98,82.79397982162877
70 | 41.5292,0.0,18.1,0,,5.531,85.4,1.6074,24,666.0,20.2,329.46,27.38,36.4476183815942
71 | 0.04294,28.0,15.04,0,,6.249,77.3,3.615,4,270.0,18.2,396.9,,88.27537962430918
72 | 1.41385,0.0,19.58,1,0.871,6.129,96.0,1.7494,5,403.0,14.7,321.02,,72.80292648744702
73 | 9.72418,0.0,18.1,0,0.74,6.406,97.2,2.0651,24,666.0,20.2,385.96,19.52,73.19050806675492
74 | 0.98843,0.0,8.14,0,,5.813,100.0,4.0952,4,307.0,21.0,394.54,19.88,62.066539598196954
75 | 0.52693,0.0,6.2,0,,8.725,83.0,2.8944,8,307.0,17.4,382.0,,214.05139620224634
76 | 5.58107,0.0,18.1,0,0.713,6.436,87.9,2.3158,24,666.0,20.2,100.19,,61.294119851011075
77 | 9.92485,0.0,18.1,0,0.74,6.251,96.6,2.198,24,666.0,20.2,388.52,,53.99633782178745
78 | 0.02985,0.0,2.18,0,,6.43,58.7,6.0622,3,,18.7,394.12,,122.86141195884612
79 | 0.13158,0.0,10.01,0,,6.176,72.5,2.7301,6,432.0,17.8,393.3,,90.79084192610794
80 | 0.17142,0.0,6.91,0,,5.682,33.8,5.1004,3,233.0,17.9,396.9,,82.61108926392335
81 | 1.05393,0.0,8.14,0,,5.935,29.3,4.4986,4,307.0,21.0,386.85,,98.97594878707021
82 | 15.5757,0.0,18.1,0,,5.926,71.0,2.9084,24,666.0,20.2,368.74,18.13,81.82059423120702
83 | 4.54192,0.0,18.1,0,0.77,6.398,88.0,2.5182,24,666.0,20.2,374.56,,107.00854694148481
84 | 0.03237,0.0,2.18,0,,6.998,45.8,6.0622,3,222.0,18.7,394.63,,142.95516813483192
85 | 67.9208,0.0,18.1,0,,5.683,100.0,1.4254,24,666.0,20.2,384.97,22.98,21.445933889849677
86 | 0.06047,0.0,2.46,0,,6.153,68.8,3.2797,3,193.0,17.8,387.11,,126.82757360819036
87 | 0.14932,25.0,5.13,0,,5.741,66.2,7.2254,8,284.0,19.7,395.11,,80.15248929060887
88 | 0.10793,0.0,8.56,0,,6.195,54.4,2.7778,5,384.0,20.9,393.49,,93.05392848107756
89 | 0.18159,0.0,7.38,,,6.376,54.3,4.5404,5,287.0,19.6,396.9,,98.88396967360913
90 | 0.76162,20.0,3.97,0,,5.56,62.8,1.9865,5,264.0,13.0,392.4,,97.62139188166562
91 | 1.00245,0.0,8.14,0,,6.674,87.3,4.239,4,307.0,21.0,380.23,,89.93438635683638
92 | 0.52014,20.0,3.97,0,,8.398,91.5,2.2885,5,264.0,13.0,386.86,,209.2465102860684
93 | 10.233,0.0,18.1,0,,6.185,96.7,2.1705,24,666.0,20.2,379.7,18.03,62.57281607977637
94 | 0.67191,0.0,8.14,0,,5.813,90.3,4.682,4,307.0,21.0,376.88,,71.0686566345113
95 | 0.14455,12.5,7.87,0,,6.172,96.1,5.9505,5,311.0,15.2,396.9,19.15,116.11802073957813
96 | 0.11132,0.0,27.74,0,,5.983,83.5,2.1099,4,711.0,20.1,396.9,,86.21246825032345
97 | 0.12802,0.0,8.56,0,,6.474,97.1,2.4329,5,384.0,20.9,395.24,,84.91805615308519
98 | 0.08014,0.0,5.96,0,,5.85,41.5,3.9342,5,279.0,19.2,396.9,,89.89364714599141
99 | 1.22358,0.0,19.58,0,,6.943,97.4,1.8773,5,403.0,14.7,363.43,,176.9575067791254
100 | 3.56868,0.0,18.1,0,,6.437,75.0,2.8965,24,666.0,20.2,393.37,,99.33458893299641
101 | 0.13058,0.0,10.01,0,,5.872,73.1,2.4775,6,432.0,17.8,338.63,,87.32915603779146
102 | 0.14231,0.0,10.01,0,,6.254,84.2,2.2565,6,432.0,17.8,388.74,,79.28289362098235
103 | 0.06664,0.0,4.05,0,,6.546,33.1,3.1323,5,296.0,16.6,390.96,,126.05546765906081
104 | 0.08664,45.0,3.44,0,,7.178,26.3,6.4798,5,,15.2,390.49,,155.9850489993171
105 | 0.1146,20.0,6.96,0,,6.538,58.7,3.9175,3,223.0,18.6,394.96,,104.51228025482678
106 | 2.77974,0.0,19.58,0,0.871,4.903,97.8,1.3459,5,403.0,14.7,396.9,29.29,50.62083750798458
107 | 11.1081,0.0,18.1,0,,4.906,100.0,1.1742,24,666.0,20.2,396.9,34.77,59.13022579725972
108 | 7.99248,0.0,18.1,0,,5.52,100.0,1.5331,24,666.0,20.2,396.9,24.56,52.65946620595223
109 | 8.98296,0.0,18.1,1,0.77,6.212,97.4,2.1222,24,666.0,20.2,377.73,,76.26345361907924
110 | 0.06127,40.0,6.41,1,,6.826,27.6,4.8628,4,254.0,17.6,393.45,,141.68245200870115
111 | 0.35809,0.0,6.2,1,,6.951,88.5,2.8617,8,307.0,17.4,391.7,,114.53035718635633
112 | 6.71772,0.0,18.1,0,0.713,6.749,92.6,2.3236,24,666.0,20.2,0.32,,57.474507693137504
113 | 1.62864,0.0,21.89,0,,5.019,100.0,1.4394,4,437.0,21.2,396.9,34.41,61.640340546320466
114 | 5.66998,0.0,18.1,1,,6.683,96.8,1.3567,24,666.0,20.2,375.33,,214.051654478562
115 | 0.05789,12.5,6.07,0,,5.878,21.4,6.498,4,,18.9,396.21,,94.22005106246839
116 | 3.83684,0.0,18.1,0,0.77,6.251,91.1,2.2955,24,666.0,20.2,350.65,,85.36835190309107
117 | 2.3004,0.0,19.58,,,6.319,96.1,2.1,5,403.0,14.7,297.09,,102.07059692876268
118 | 0.17783,0.0,9.69,,,5.569,73.5,2.3999,6,391.0,19.2,395.77,,74.90045220259788
119 | 13.3598,0.0,18.1,0,,5.887,94.7,1.7821,24,666.0,20.2,396.9,,54.38842392864051
120 | 25.0461,0.0,18.1,0,,5.987,100.0,1.5888,24,666.0,20.2,396.9,26.77,24.014529122715135
121 | 0.02187,60.0,2.93,0,,6.8,9.9,6.2196,1,265.0,15.6,393.37,,133.11355504856078
122 | 0.19073,22.0,5.86,0,,6.718,17.5,7.8265,7,330.0,19.1,393.74,,112.29183147899278
123 | 0.26363,0.0,8.56,0,,6.229,91.2,2.5451,5,384.0,20.9,391.23,,83.03299816312641
124 | 11.0874,0.0,18.1,0,0.718,6.411,100.0,1.8589,24,666.0,20.2,318.75,,71.49422130850488
125 | 2.37934,0.0,19.58,0,0.871,6.13,100.0,1.4191,5,403.0,14.7,172.91,27.8,59.15729521149178
126 | 0.04203,28.0,15.04,0,,6.442,53.6,3.6659,4,270.0,18.2,395.01,,98.2099688695904
127 | 1.12658,0.0,19.58,1,0.871,5.012,88.0,1.6102,5,403.0,14.7,343.28,,65.54089052997352
128 | 0.62356,0.0,6.2,1,,6.879,77.7,3.2721,8,307.0,17.4,390.39,,117.89917845973477
129 | 0.05515,33.0,2.18,,,7.236,41.1,4.022,7,222.0,18.4,393.68,,154.72377820705736
130 | 0.03551,25.0,4.86,0,,6.167,46.7,5.4007,4,281.0,19.0,390.64,,98.21683590419168
131 | 0.16439,22.0,5.86,0,,6.433,49.1,7.8265,7,330.0,19.1,374.71,,105.07957421915785
132 | 2.924,0.0,19.58,0,,6.101,93.0,2.2834,5,403.0,14.7,240.16,,107.19320569500421
133 | 1.51902,0.0,19.58,1,,8.375,93.9,2.162,5,403.0,14.7,388.45,,214.14918721549503
134 | 0.0315,95.0,1.47,0,,6.975,15.3,7.6534,3,402.0,17.0,396.9,,149.53327195447721
135 | 0.46296,0.0,6.2,0,,7.412,76.9,3.6715,8,307.0,17.4,376.14,,135.92193010124637
136 | 0.07896,0.0,12.83,0,,6.273,6.0,4.2515,5,398.0,18.7,394.92,,103.32518624974017
137 | 0.79041,0.0,9.9,0,,6.122,52.8,2.6403,4,304.0,18.4,396.9,,94.69094463035148
138 | 4.75237,0.0,18.1,0,0.713,6.525,86.5,2.4358,24,666.0,20.2,50.92,18.13,60.38269582189052
139 | 0.36894,22.0,5.86,0,,8.259,8.4,8.9067,7,330.0,19.1,396.9,,183.58426637694816
140 | 0.14476,0.0,10.01,0,,5.731,65.2,2.7592,6,432.0,17.8,391.5,,82.66700168026514
141 | 0.00906,90.0,2.97,0,,7.088,20.8,7.3073,1,285.0,15.3,394.72,,137.84225152963592
142 | 0.09266,34.0,6.09,0,,6.495,18.4,5.4917,7,329.0,16.1,383.61,,113.09641600484015
143 | 2.81838,0.0,18.1,0,,5.762,40.3,4.0983,24,666.0,20.2,392.92,,93.48345193539986
144 | 3.8497,0.0,18.1,1,0.77,6.395,91.0,2.5052,24,666.0,20.2,391.34,,93.06353491055529
145 | 24.8017,0.0,18.1,0,,5.349,96.0,1.7028,24,666.0,20.2,396.9,19.77,35.53338811498336
146 | 0.29819,0.0,6.2,0,,7.686,17.0,3.3751,8,307.0,17.4,377.51,,200.10126043018462
147 | 0.53412,20.0,3.97,0,,7.52,89.4,2.1398,5,,13.0,388.37,,184.72551209679438
148 | 0.51183,0.0,6.2,0,,7.358,71.6,4.148,8,307.0,17.4,390.07,,135.0797219247725
149 | 24.3938,0.0,18.1,0,,4.652,100.0,1.4672,24,666.0,20.2,396.9,28.28,45.03868107815232
150 | 4.87141,0.0,18.1,0,,6.484,93.6,2.3053,24,666.0,20.2,396.21,18.68,71.63124418152948
151 | 0.09744,0.0,5.96,0,,5.841,61.4,3.3779,5,279.0,19.2,377.56,,85.74867056457873
152 | 0.04011,80.0,1.52,0,,7.287,34.1,7.309,2,329.0,12.6,396.9,,142.82764139964164
153 | 0.54452,0.0,21.89,,,6.151,97.9,1.6687,4,437.0,21.2,396.9,18.46,76.31012581681928
154 | 4.89822,0.0,18.1,0,,4.97,100.0,1.3325,24,666.0,20.2,375.52,,214.04427779211161
155 | 0.19657,22.0,5.86,0,,6.226,79.2,8.0555,7,330.0,19.1,376.14,,87.76546509317926
156 | 0.03871,52.5,5.32,0,,6.209,31.3,7.3172,6,293.0,16.6,396.9,,99.3930687509709
157 | 23.6482,0.0,18.1,0,,6.38,96.2,1.3861,24,666.0,20.2,396.9,23.69,56.14399470209011
158 | 0.10328,25.0,5.13,,,5.927,47.2,6.932,8,284.0,19.7,396.9,,83.97882262665027
159 | 0.10084,0.0,10.01,0,,6.715,81.6,2.6775,6,432.0,17.8,395.59,,97.63612653682769
160 | 0.05302,0.0,3.41,0,,7.079,63.1,3.4145,2,270.0,17.8,396.06,,122.99996853633093
161 | 0.7857,20.0,3.97,0,,7.014,84.6,2.1329,5,264.0,13.0,384.07,,131.5306524154623
162 | 0.08829,12.5,7.87,0,,6.012,66.6,5.5605,5,311.0,15.2,395.6,,98.02060966106423
163 | 3.47428,0.0,18.1,1,0.718,8.78,82.9,1.9047,24,666.0,20.2,354.55,,93.7433216073559
164 | 0.06076,0.0,11.93,0,,6.976,91.0,2.1675,1,273.0,21.0,396.9,,102.48170478024093
165 | 0.01301,35.0,1.52,0,,7.241,49.3,7.0379,1,284.0,15.5,394.74,,140.06420082169748
166 | 1.34284,0.0,19.58,0,,6.066,100.0,1.7573,5,403.0,14.7,353.89,,104.23346578329476
167 | 1.6566,0.0,19.58,0,0.871,6.122,97.3,1.618,5,403.0,14.7,372.8,,92.03410962108342
168 | 0.05425,0.0,4.05,0,,6.315,73.4,3.3175,5,296.0,16.6,395.6,,105.3647401896618
169 | 7.67202,0.0,18.1,0,,5.747,98.9,1.6334,24,666.0,20.2,393.1,19.92,36.41181522520689
170 | 0.08308,0.0,2.46,0,,5.604,89.8,2.9879,3,193.0,17.8,391.0,,113.14056607559466
171 | 0.40202,0.0,9.9,,,6.382,67.2,3.5325,4,304.0,18.4,395.21,,98.95412803519065
172 | 0.22489,12.5,7.87,0,,6.377,94.3,6.3467,5,311.0,15.2,392.52,20.45,64.25974310759635
173 | 20.0849,0.0,18.1,0,,4.368,91.2,1.4395,24,666.0,20.2,285.83,30.63,37.6756788895361
174 | 0.21161,0.0,8.56,0,,6.137,87.4,2.7147,5,384.0,20.9,394.47,,82.73890521244164
175 | 0.04462,25.0,4.86,0,,6.619,70.4,5.4007,4,281.0,19.0,395.63,,102.5219471514495
176 | 0.17505,0.0,5.96,0,,5.966,30.2,3.8473,5,279.0,19.2,393.43,,105.8151903183971
177 | 0.24522,0.0,9.9,0,,5.782,71.7,4.0317,4,304.0,18.4,396.9,,84.86924302557883
178 | 1.80028,0.0,19.58,0,,5.877,79.2,2.4259,5,403.0,14.7,227.61,,101.89525776008544
179 | 6.39312,0.0,18.1,0,,6.162,97.4,2.206,24,666.0,20.2,302.76,24.1,57.01312277260327
180 | 0.05561,70.0,2.24,0,,7.041,10.0,7.8278,5,358.0,14.8,371.58,,124.26905880166453
181 | 0.05372,0.0,13.92,0,,6.549,51.0,5.9604,4,289.0,16.0,392.85,,116.1960531736661
182 | 0.03768,80.0,1.52,0,,7.274,38.3,7.309,2,329.0,12.6,392.2,,148.15216877063816
183 | 9.82349,0.0,18.1,0,,6.794,98.8,1.358,24,666.0,20.2,396.9,21.24,57.05342200049336
184 | 2.15505,0.0,19.58,0,0.871,5.628,100.0,1.5166,5,403.0,14.7,169.27,,66.8375694958936
185 | 5.87205,0.0,18.1,0,,6.405,96.0,1.6768,24,666.0,20.2,396.9,19.37,53.62232891092506
186 | 2.36862,0.0,19.58,0,0.871,4.926,95.7,1.4608,5,403.0,14.7,391.71,29.53,62.58254575193668
187 | 7.36711,0.0,18.1,0,,6.193,78.1,1.9356,24,666.0,20.2,96.73,21.52,47.17557906880946
188 | 0.04297,52.5,5.32,0,,6.565,22.9,7.3172,6,293.0,16.6,371.72,,106.3524321967567
189 | 0.15038,0.0,25.65,,,5.856,97.0,1.9444,2,188.0,19.1,370.31,25.41,74.04556952600493
190 | 0.20746,0.0,27.74,0,,5.093,98.0,1.8226,4,711.0,20.1,318.43,29.68,34.732265987909976
191 | 0.11504,0.0,2.89,,,6.163,69.6,3.4952,2,276.0,18.0,391.83,,91.64449110390264
192 | 4.0974,0.0,19.58,0,0.871,5.468,100.0,1.4118,5,403.0,14.7,396.9,26.42,66.90097668683498
193 | 0.09252,30.0,4.93,0,,6.606,42.2,6.1899,6,300.0,16.6,383.78,,99.79904806561713
194 | 0.09604,40.0,6.41,0,,6.854,42.8,4.2673,4,254.0,17.6,396.9,,137.2389864892714
195 | 0.12083,0.0,2.89,0,,8.069,76.0,3.4952,2,276.0,18.0,396.9,,165.69689083530488
196 | 0.01709,90.0,2.02,0,,6.728,36.1,12.1265,5,,17.0,384.46,,128.98858708951522
197 | 0.09299,0.0,25.65,0,,5.961,92.9,2.0869,2,188.0,19.1,378.09,17.93,87.8172225492073
198 | 0.10008,0.0,2.46,0,,6.563,95.6,2.847,3,193.0,17.8,396.9,,139.33235031306575
199 | 0.02177,82.5,2.03,0,,7.61,15.7,6.27,2,,14.7,395.38,,181.30089062761513
200 | 0.33983,22.0,5.86,0,,6.108,34.9,8.0555,7,330.0,19.1,390.18,,104.23522664376836
201 | 2.37857,0.0,18.1,0,,5.871,41.9,3.724,24,666.0,20.2,370.73,,88.27563109434632
202 | 0.03537,34.0,6.09,0,,6.59,40.4,5.4917,7,329.0,16.1,395.75,,94.16646677806642
203 | 0.04301,80.0,1.91,0,,5.663,21.9,10.5857,4,334.0,22.0,382.8,,78.0382858400318
204 | 51.1358,0.0,18.1,0,,5.757,100.0,1.413,24,666.0,20.2,2.6,,64.28657778550676
205 | 9.91655,0.0,18.1,0,,5.852,77.8,1.5004,24,666.0,20.2,338.16,29.97,26.979381279296607
206 | 0.01965,80.0,1.76,0,,6.23,31.5,9.0892,1,241.0,18.2,341.6,,86.07233278610938
207 | 0.16902,0.0,25.65,0,,5.986,88.4,1.9929,2,188.0,19.1,385.02,,91.61948802567194
208 | 0.05479,33.0,2.18,0,,6.616,58.1,3.37,7,222.0,18.4,393.36,,121.777199530868
209 | 0.6147,0.0,6.2,0,,6.618,80.8,3.2721,8,307.0,17.4,396.9,,129.0534976533695
210 | 12.0482,0.0,18.1,0,,5.648,87.6,1.9512,24,666.0,20.2,291.55,,89.18315969982761
211 | 0.11425,0.0,13.89,1,,6.373,92.4,3.3633,5,276.0,16.4,393.74,,98.61412363155058
212 | 0.88125,0.0,21.89,0,,5.637,94.7,1.9799,4,437.0,21.2,396.9,18.34,61.324137374892
213 | 8.79212,0.0,18.1,0,,5.565,70.6,2.0635,24,666.0,20.2,3.65,,50.127266164530205
214 | 0.07886,80.0,4.95,0,,7.148,27.7,5.1167,4,245.0,19.2,396.9,,160.00237114570464
215 | 0.05023,35.0,6.06,0,,5.706,28.4,6.6407,1,304.0,16.9,394.02,,73.25623685678946
216 | 88.9762,0.0,18.1,0,,6.968,91.9,1.4165,24,666.0,20.2,396.9,,44.56625992485017
217 | 5.82401,0.0,18.1,0,,6.242,64.7,3.4242,24,666.0,20.2,396.9,,98.605056230759
218 | 5.20177,0.0,18.1,1,0.77,6.127,83.4,2.7227,24,666.0,20.2,395.43,,97.2826978366578
219 | 0.14103,0.0,13.92,0,,5.79,58.0,6.32,4,289.0,16.0,396.9,,86.98648351102429
220 | 0.08199,0.0,13.92,0,,6.009,42.3,5.5027,4,289.0,16.0,396.9,,93.01244196555069
221 | 6.53876,0.0,18.1,1,,7.016,97.5,1.2024,24,666.0,20.2,392.05,,214.2923353502213
222 | 13.6781,0.0,18.1,0,0.74,5.935,87.9,1.8206,24,666.0,20.2,68.95,34.02,36.00210477989878
223 | 0.12329,0.0,10.01,0,,5.913,92.9,2.3534,6,432.0,17.8,394.95,,80.63483584614265
224 | 0.0578,0.0,2.46,0,,6.98,58.4,2.829,3,193.0,17.8,396.9,,159.48548523421246
225 | 2.63548,0.0,9.9,0,,4.973,37.8,2.5194,4,304.0,18.4,350.45,,68.96979939145074
226 | 0.02498,0.0,1.89,,,6.54,59.7,6.2669,1,422.0,15.9,389.96,,70.69404829968218
227 | 0.05083,0.0,5.19,0,,6.316,38.1,6.4584,5,224.0,20.2,389.71,,95.01800474971677
228 | 4.83567,0.0,18.1,0,,5.905,53.2,3.1523,24,666.0,20.2,388.22,,88.33275490565735
229 | 8.20058,0.0,18.1,0,0.713,5.936,80.3,2.7792,24,666.0,20.2,3.5,,57.83323931871548
230 | 0.33147,0.0,6.2,0,,8.247,70.4,3.6519,8,307.0,17.4,378.95,,206.7589770594467
231 | 0.3692,0.0,9.9,0,,6.567,87.3,3.6023,4,304.0,18.4,395.69,,102.04868804927443
232 | 2.24236,0.0,19.58,0,,5.854,91.8,2.422,5,403.0,14.7,395.11,,97.34666800638067
233 | 0.32264,0.0,21.89,,,5.942,93.5,1.9669,4,437.0,21.2,378.25,,74.6368298325566
234 | 0.04666,80.0,1.52,,,7.107,36.6,7.309,2,329.0,12.6,354.31,,129.97715591287496
235 | 0.66351,20.0,3.97,0,,7.333,100.0,1.8946,5,264.0,13.0,383.29,,154.25438344112666
236 | 0.57529,0.0,6.2,0,,8.337,73.3,3.8384,8,307.0,17.4,385.91,,178.86849603926493
237 | 0.17134,0.0,10.01,0,,5.928,88.2,2.4631,6,432.0,17.8,344.91,,78.46246435209265
238 | 0.06899,0.0,25.65,0,,5.87,69.7,2.2577,2,188.0,19.1,389.15,,94.28316574870794
239 | 0.07244,60.0,1.69,0,,5.884,18.5,10.7103,4,411.0,18.3,392.33,,79.67435216186544
240 | 0.31533,0.0,6.2,0,,8.266,78.3,2.8944,8,307.0,17.4,385.05,,191.80937308916776
241 | 20.7162,0.0,18.1,0,,4.138,100.0,1.1781,24,666.0,20.2,370.22,23.34,51.003472535563006
242 | 0.06151,0.0,5.19,0,,5.968,58.5,4.8122,5,224.0,20.2,396.9,,80.1186585578974
243 | 0.25915,0.0,21.89,0,,5.693,96.0,1.7883,4,437.0,21.2,392.11,,69.3389852904356
244 | 0.01096,55.0,2.25,0,,6.453,31.9,7.3073,1,300.0,15.3,394.72,,94.24871749944725
245 | 18.0846,0.0,18.1,0,,6.434,100.0,1.8347,24,666.0,20.2,27.25,29.05,30.878383584061993
246 | 0.13117,0.0,8.56,,,6.127,85.2,2.1224,5,384.0,20.9,387.69,,87.40695588178909
247 | 18.4982,0.0,18.1,0,,4.138,100.0,1.137,24,666.0,20.2,396.9,37.97,59.15866354756566
248 | 7.52601,0.0,18.1,0,0.713,6.417,98.3,2.185,24,666.0,20.2,304.21,19.31,55.75013175697844
249 | 0.32982,0.0,21.89,0,,5.822,95.4,2.4699,4,437.0,21.2,388.69,,78.89714048466422
250 | 13.5222,0.0,18.1,0,,3.863,100.0,1.5106,24,666.0,20.2,131.42,,98.9095824110562
251 | 0.12269,0.0,6.91,0,,6.069,40.0,5.7209,3,233.0,17.9,389.39,,90.81742720963614
252 | 0.17899,0.0,9.69,0,,5.67,28.8,2.7986,6,391.0,19.2,393.29,,99.0564379366495
253 | 0.03584,80.0,3.37,0,,6.29,17.8,6.6115,4,337.0,16.1,396.9,,100.68749654297682
254 | 0.01501,90.0,1.21,1,,7.923,24.8,5.885,1,198.0,13.6,395.52,,214.24663842339191
255 | 0.05735,0.0,4.49,0,,6.63,56.1,4.4377,3,247.0,18.5,392.3,,113.86436352897965
256 | 0.1029,30.0,4.93,0,,6.358,52.9,7.0355,6,300.0,16.6,372.75,,95.16988905486264
257 | 0.05602,0.0,2.46,0,,7.831,53.6,3.1992,3,193.0,17.8,392.63,,214.271616402895
258 | 15.8603,0.0,18.1,0,,5.896,95.4,1.9096,24,666.0,20.2,7.68,24.39,35.598814178915845
259 | 1.42502,0.0,19.58,0,0.871,6.51,100.0,1.7659,5,,14.7,364.31,,99.74781111338865
260 | 0.09378,12.5,7.87,0,,5.889,39.0,5.4509,5,311.0,15.2,390.5,,92.94899239065076
261 | 0.06417,0.0,5.96,0,,5.933,68.2,3.3603,5,279.0,19.2,396.9,,81.07895594795473
262 | 0.77299,0.0,8.14,0,,6.495,94.4,4.4547,4,307.0,21.0,387.94,,78.89883003527437
263 | 1.20742,0.0,19.58,0,,5.875,94.6,2.4259,5,403.0,14.7,292.29,,74.47744660259207
264 | 3.32105,0.0,19.58,1,0.871,5.403,100.0,1.3216,5,403.0,14.7,396.9,26.82,57.362009051979555
265 | 9.59571,0.0,18.1,0,,6.404,100.0,1.639,24,666.0,20.2,376.11,20.31,51.880990843107384
266 | 0.02899,40.0,1.25,0,,6.939,34.5,8.7921,1,335.0,19.7,389.85,,113.94356069894572
267 | 0.40771,0.0,6.2,1,,6.164,91.3,3.048,8,307.0,17.4,395.24,21.46,93.07299664769766
268 | 0.12204,0.0,2.89,0,,6.625,57.8,3.4952,2,276.0,18.0,357.98,,121.79855766835065
269 | 0.04337,21.0,5.64,0,,6.115,63.0,6.8147,4,,16.8,393.97,,87.84624337516867
270 | 0.11329,30.0,4.93,0,,6.897,54.3,6.3361,6,300.0,16.6,391.25,,94.37504260498926
271 | 15.288,0.0,18.1,0,,6.649,93.3,1.3449,24,666.0,20.2,363.02,23.24,59.53691516133277
272 | 9.18702,0.0,18.1,0,,5.536,100.0,1.5804,24,666.0,20.2,396.9,23.6,48.42414368373738
273 | 0.06642,0.0,4.05,0,,6.86,74.4,2.9153,5,296.0,16.6,391.27,,128.067902079391
274 | 0.12744,0.0,6.91,0,,6.77,2.9,5.7209,3,233.0,17.9,385.41,,114.06895090972169
275 | 22.0511,0.0,18.1,0,0.74,5.818,92.4,1.8662,24,666.0,20.2,391.45,22.11,44.98666404253888
276 | 5.29305,0.0,18.1,0,,6.051,82.5,2.1678,24,666.0,20.2,378.38,18.76,99.33223001199178
277 | 0.22969,0.0,10.59,,,6.326,52.5,4.3549,4,277.0,18.6,394.87,,104.4712414144551
278 | 0.06129,20.0,3.33,1,,7.645,49.7,5.2119,5,216.0,14.9,377.07,,197.1295401724309
279 | 0.04819,80.0,3.64,0,,6.108,32.0,9.2203,1,315.0,16.4,392.89,,93.9394697513162
280 | 10.8342,0.0,18.1,0,,6.782,90.8,1.8195,24,666.0,20.2,21.57,25.79,32.1745002703019
281 | 0.06905,0.0,2.18,0,,7.147,54.2,6.0622,3,222.0,18.7,396.9,,154.95507154862378
282 | 0.01538,90.0,3.75,0,,7.454,34.2,6.3361,3,244.0,15.9,386.34,,188.66470599941434
283 | 8.24809,0.0,18.1,0,0.713,7.393,99.3,2.4527,24,666.0,20.2,375.87,,76.21052055036272
284 | 0.14866,0.0,8.56,0,,6.727,79.9,2.7778,5,384.0,20.9,394.76,,117.76563975860779
285 | 0.38214,0.0,6.2,0,,8.04,86.5,3.2157,8,307.0,17.4,387.38,,161.14266009269343
286 | 10.0623,0.0,18.1,0,,6.833,94.3,2.0882,24,666.0,20.2,81.33,19.69,60.43951952689001
287 | 0.14052,0.0,10.59,0,,6.375,32.3,3.9454,4,277.0,18.6,385.81,,120.39855220565468
288 | 12.2472,0.0,18.1,0,,5.837,59.7,1.9976,24,666.0,20.2,24.65,,43.745544898222214
289 | 2.3139,0.0,19.58,0,,5.88,97.3,2.3887,5,403.0,14.7,348.13,,81.88767258279199
290 | 0.08187,0.0,2.89,0,,7.82,36.9,3.4952,2,276.0,18.0,393.53,,187.5342552461346
291 | 0.03615,80.0,4.95,0,,6.63,23.4,5.1167,4,245.0,19.2,396.9,,119.528038155155
292 | 0.19802,0.0,10.59,0,,6.182,42.4,3.9454,4,277.0,18.6,393.63,,107.09888627097
293 | 0.17171,25.0,5.13,0,,5.966,93.4,6.8185,8,284.0,19.7,378.08,,68.59965945665476
294 | 0.22927,0.0,6.91,0,,6.03,85.5,5.6894,3,233.0,17.9,392.74,18.8,71.18515014550181
295 | 1.38799,0.0,8.14,0,,5.95,82.0,3.99,4,307.0,21.0,232.6,27.71,56.61271426511336
296 | 0.57834,20.0,3.97,0,,8.297,67.0,2.4216,5,264.0,13.0,384.54,,214.36241327993707
297 | 0.24103,0.0,7.38,0,,6.083,43.7,5.4159,5,287.0,19.6,396.9,,95.16310192393985
298 | 0.01778,95.0,1.47,0,,7.135,13.9,7.6534,3,402.0,17.0,384.3,,140.81240527376974
299 | 5.44114,0.0,18.1,0,0.713,6.655,98.2,2.3552,24,666.0,20.2,355.29,,65.17969247582458
300 | 0.95577,0.0,8.14,0,,6.047,88.8,4.4534,4,307.0,21.0,306.38,,63.4855495232388
301 | 8.64476,0.0,18.1,0,,6.193,92.6,1.7912,24,666.0,20.2,396.9,,59.19694186292922
302 | 0.537,0.0,6.2,0,,5.981,68.1,3.6715,8,307.0,17.4,378.35,,104.0880820259377
303 | 0.54011,20.0,3.97,,,7.203,81.8,2.1121,5,264.0,13.0,392.8,,144.85091449097607
304 | 0.0459,52.5,5.32,0,,6.315,45.6,7.3172,6,293.0,16.6,396.9,,95.48939795651633
305 | 1.83377,0.0,19.58,1,,7.802,98.2,2.0407,5,,14.7,389.61,,214.41296038289997
306 | 9.33889,0.0,18.1,0,,6.38,95.6,1.9682,24,666.0,20.2,60.72,24.08,40.695688863928765
307 | 0.2498,0.0,21.89,0,,5.857,98.2,1.6686,4,437.0,21.2,392.04,21.32,57.02654091090679
308 | 0.11027,25.0,5.13,0,,6.456,67.8,7.2255,8,284.0,19.7,396.9,,95.02374006557191
309 | 0.55778,0.0,21.89,0,,6.335,98.2,2.1107,4,437.0,21.2,394.67,,77.60752459364593
310 | 0.32543,0.0,21.89,0,,6.431,98.8,1.8125,4,437.0,21.2,396.9,,77.11780334110067
311 | 5.73116,0.0,18.1,0,,7.061,77.0,3.4106,24,666.0,20.2,395.28,,107.24418130966565
312 | 0.21124,12.5,7.87,0,,5.631,100.0,6.0821,5,311.0,15.2,386.63,29.93,70.71582493919252
313 | 0.30347,0.0,7.38,0,,6.312,28.9,5.4159,5,287.0,19.6,396.9,,98.57326542684902
314 | 13.0751,0.0,18.1,0,,5.713,56.7,2.8237,24,666.0,20.2,396.9,,86.06762797944116
315 | 0.01951,17.5,1.38,0,,7.104,59.5,9.2229,3,216.0,18.6,393.24,,141.52384184941792
316 | 0.04417,70.0,2.24,0,,6.871,47.4,7.8278,5,358.0,14.8,390.86,,106.28983847776634
317 | 0.63796,0.0,8.14,0,,6.096,84.5,4.4619,4,307.0,21.0,380.02,,78.02927346890762
318 | 2.44668,0.0,19.58,0,0.871,5.272,94.0,1.7364,5,403.0,14.7,88.63,,56.08882091510925
319 | 0.03359,75.0,2.95,0,,7.024,15.8,5.4011,3,252.0,18.3,395.62,,149.45246864933594
320 | 17.8667,0.0,18.1,0,,6.223,100.0,1.3861,24,666.0,20.2,393.74,21.78,43.68193090714954
321 | 3.1636,0.0,18.1,0,,5.759,48.2,3.0665,24,666.0,20.2,334.4,,85.18618477374098
322 | 11.9511,0.0,18.1,0,,5.608,100.0,1.2852,24,666.0,20.2,332.09,,119.67450422441831
323 | 0.0456,0.0,13.89,1,,5.888,56.0,3.1121,5,276.0,16.4,392.8,,99.95110506548588
324 | 0.21038,20.0,3.33,0,,6.812,32.2,4.1007,5,216.0,14.9,396.9,,150.23532225989135
325 | 9.39063,0.0,18.1,0,0.74,5.627,93.9,1.8172,24,666.0,20.2,396.9,22.88,54.85384792172743
326 | 0.10959,0.0,11.93,0,,6.794,89.3,2.3889,1,273.0,21.0,393.45,,94.30225330095304
327 | 0.03041,0.0,5.19,,,5.895,59.6,5.615,5,,20.2,394.81,,79.22385015356505
328 | 0.52058,0.0,6.2,1,,6.631,76.5,4.148,8,307.0,17.4,388.45,,107.44870964214456
329 | 0.25199,0.0,10.59,0,,5.783,72.7,4.3549,4,277.0,18.6,389.43,18.06,96.44659188003051
330 | 0.21719,0.0,10.59,1,,5.807,53.8,3.6526,4,277.0,18.6,390.94,,95.90550606395755
331 | 0.12932,0.0,13.92,0,,6.678,31.1,5.9604,4,289.0,16.0,396.9,,122.61831018541842
332 | 6.65492,0.0,18.1,0,0.713,6.317,83.0,2.7344,24,666.0,20.2,396.9,,83.62036671666932
333 | 0.21409,22.0,5.86,0,,6.438,8.9,7.3967,7,330.0,19.1,377.07,,106.29601015536667
334 | 0.27957,0.0,9.69,0,,5.926,42.6,2.3817,6,391.0,19.2,396.9,,104.88132454583398
335 | 7.83932,0.0,18.1,0,,6.209,65.4,2.9634,24,666.0,20.2,396.9,,91.78389685802578
336 | 0.1,34.0,6.09,0,,6.982,17.7,5.4917,7,329.0,16.1,390.43,,141.86864846418877
337 | 0.06211,40.0,1.25,0,,6.49,44.4,8.7921,1,335.0,19.7,396.9,,98.09887311709224
338 | 0.09065,20.0,6.96,1,,5.92,61.5,3.9175,3,223.0,18.6,391.34,,88.73061089647581
339 | 0.03445,82.5,2.03,0,,6.162,38.4,6.27,2,348.0,14.7,393.77,,103.29219687089451
340 | 1.46336,0.0,19.58,0,,7.489,90.8,1.9709,5,403.0,14.7,374.43,,214.26351204810595
341 | 0.15936,0.0,6.91,0,,6.211,6.5,5.7209,3,233.0,17.9,394.46,,105.72759550321551
342 | 0.07013,0.0,13.89,0,,6.642,85.1,3.4211,5,276.0,16.4,392.78,,123.04318174921504
343 | 14.2362,0.0,18.1,0,,6.343,100.0,1.5741,24,666.0,20.2,396.9,20.32,30.878251836188767
344 | 0.09068,45.0,3.44,0,,6.951,21.5,6.4798,5,398.0,15.2,377.68,,158.41685347411962
345 | 0.3494,0.0,9.9,,,5.972,76.7,3.1025,4,304.0,18.4,396.24,,86.95690765902225
346 | 0.65665,20.0,3.97,0,,6.842,100.0,2.0107,5,264.0,13.0,391.93,,129.0269068556081
347 | 0.13262,0.0,8.56,0,,5.851,96.7,2.1069,5,384.0,20.9,394.05,,83.63191924089021
348 | 0.04981,21.0,5.64,0,,5.998,21.4,6.8147,4,243.0,16.8,396.9,,100.27485737120129
349 | 8.15174,0.0,18.1,0,,5.39,98.9,1.7281,24,666.0,20.2,396.9,20.85,49.24166014463305
350 | 0.02731,0.0,7.07,0,,6.421,78.9,4.9671,2,242.0,17.8,396.9,,92.64943942043587
351 | 6.28807,0.0,18.1,0,0.74,6.341,96.4,2.072,24,666.0,20.2,318.01,,63.84656635106048
352 | 0.15086,0.0,27.74,0,,5.454,92.7,1.8209,4,711.0,20.1,395.09,18.06,65.08405693968963
353 | 0.21977,0.0,6.91,0,,5.602,62.0,6.0877,3,233.0,17.9,396.9,,83.15417691706125
354 | 11.8123,0.0,18.1,0,0.718,6.824,76.5,1.794,24,666.0,20.2,48.45,22.74,35.97624959002458
355 | 0.04113,25.0,4.86,0,,6.727,33.5,5.4007,4,281.0,19.0,396.9,,119.98912870543222
356 | 0.13642,0.0,10.59,0,,5.891,22.3,3.9454,4,277.0,18.6,396.9,,96.7789720721923
357 | 1.61282,0.0,8.14,0,,6.096,96.9,3.7598,4,307.0,21.0,248.31,20.34,57.902002735568665
358 | 8.49213,0.0,18.1,0,,6.348,86.1,2.0527,24,666.0,20.2,83.45,,62.10304604481125
359 | 0.82526,20.0,3.97,0,,7.327,94.5,2.0788,5,264.0,13.0,393.42,,132.8290801877492
360 | 37.6619,0.0,18.1,0,,6.202,78.7,1.8629,24,666.0,20.2,18.82,,46.68657433312743
361 | 3.69695,0.0,18.1,0,0.718,4.963,91.4,1.7523,24,666.0,20.2,316.03,,93.78120955775702
362 | 0.03932,0.0,3.41,0,,6.405,73.9,3.0921,2,270.0,17.8,393.55,,94.27963135975943
363 | 0.05497,0.0,5.19,0,,5.985,45.4,4.8122,5,224.0,20.2,396.9,,81.48790483723555
364 | 14.3337,0.0,18.1,0,,6.229,88.0,1.9512,24,666.0,20.2,383.32,,91.61118003657916
365 | 0.0536,21.0,5.64,0,,6.511,21.1,6.8147,4,243.0,16.8,396.9,,107.14931851810758
366 | 0.03113,0.0,4.39,0,,6.014,48.5,8.0136,3,352.0,18.8,385.64,,74.97865092116241
367 | 0.55007,20.0,3.97,0,,7.206,91.6,1.9301,5,264.0,13.0,387.89,,156.23487524872377
368 | 0.10612,30.0,4.93,0,,6.095,65.1,6.3361,6,300.0,16.6,394.62,,86.06837267706234
369 | 0.62976,0.0,8.14,0,,5.949,61.8,4.7075,4,307.0,21.0,396.9,,87.41859391509344
370 | 0.25356,0.0,9.9,0,,5.705,77.7,3.945,4,304.0,18.4,396.42,,69.40166024965988
371 | 0.0566,0.0,3.41,,,7.007,86.3,3.4217,2,270.0,17.8,396.9,,101.22193723778894
372 | 22.5971,0.0,18.1,0,,5.0,89.5,1.5184,24,666.0,20.2,396.9,31.99,31.726234519957263
373 | 0.22188,20.0,6.96,1,,7.691,51.8,4.3665,3,223.0,18.6,390.77,,150.82150427262926
374 | 2.01019,0.0,19.58,0,,7.929,96.2,2.0459,5,403.0,14.7,369.3,,214.11502684151574
375 | 0.06617,0.0,3.24,0,,5.868,25.8,5.2146,4,430.0,16.9,382.44,,82.62238547103524
376 | 0.23912,0.0,9.69,,,6.019,65.3,2.4091,6,391.0,19.2,396.9,,90.75980852101517
377 | 0.97617,0.0,21.89,0,,5.757,98.4,2.346,4,437.0,21.2,262.76,,66.795105542148
378 | 0.07503,33.0,2.18,0,,7.42,71.9,3.0992,7,222.0,18.4,396.9,,143.17077311916992
379 | 5.69175,0.0,18.1,0,,6.114,79.8,3.5459,24,666.0,20.2,392.68,,81.9121484165166
380 | 0.47547,0.0,9.9,0,,6.113,58.8,4.0019,4,304.0,18.4,396.23,,89.98764919309322
381 | 0.12757,30.0,4.93,0,,6.393,7.8,7.0355,6,300.0,16.6,374.71,,101.63068145253833
382 | 0.0136,75.0,4.0,0,,5.888,47.6,7.3197,3,469.0,21.1,396.9,,80.94118410654285
383 | 4.22239,0.0,18.1,1,0.77,5.803,89.0,1.9047,24,666.0,20.2,353.04,,71.94282999328784
384 | 0.08873,21.0,5.64,0,,5.963,45.7,6.8147,4,243.0,16.8,395.56,,84.3586158325382
385 | 3.69311,0.0,18.1,0,0.713,6.376,88.4,2.5671,24,666.0,20.2,391.43,,75.8482584414499
386 | 0.08447,0.0,4.05,0,,5.859,68.7,2.7019,5,296.0,16.6,393.23,,96.94598162673307
387 | 10.6718,0.0,18.1,0,0.74,6.459,94.8,1.9879,24,666.0,20.2,43.06,23.98,50.5936026306949
388 | 0.0837,45.0,3.44,0,,7.185,38.9,4.5667,5,398.0,15.2,396.9,,149.40381640715177
389 | 0.04527,0.0,11.93,0,,6.12,76.7,2.2875,1,273.0,21.0,396.9,,88.29026894523939
390 | 5.82115,0.0,18.1,0,0.713,6.513,89.9,2.8016,24,666.0,20.2,393.82,,86.54303414249068
391 | 0.07875,45.0,3.44,,,6.782,41.1,3.7886,5,398.0,15.2,393.87,,137.01095863624937
392 | 2.44953,0.0,19.58,0,,6.402,95.2,2.2625,5,403.0,14.7,330.04,,95.65880177371714
393 | 0.15445,25.0,5.13,0,,6.145,29.2,7.8148,8,284.0,19.7,390.68,,99.93310262249477
394 | 0.25387,0.0,6.91,0,,5.399,95.3,5.87,3,233.0,17.9,396.9,30.81,61.7473108349467
395 | 0.03049,55.0,3.78,0,,6.874,28.1,6.4654,5,370.0,17.6,387.97,,133.58147604929238
396 | 0.33045,0.0,6.2,0,,6.086,61.5,3.6519,8,307.0,17.4,376.75,,102.93180803659672
397 | 0.08221,22.0,5.86,0,,6.957,6.8,8.9067,7,330.0,19.1,386.09,,126.77645542235585
398 | 0.85204,0.0,8.14,0,,5.965,89.2,4.0123,4,307.0,21.0,392.53,,83.94234406655659
399 | 0.26938,0.0,9.9,0,,6.266,82.8,3.2628,4,304.0,18.4,393.39,,92.4989310922952
400 | 6.80117,0.0,18.1,0,0.713,6.081,84.4,2.7175,24,666.0,20.2,396.9,,85.74937514322716
401 | 1.27346,0.0,19.58,1,,6.25,92.6,1.7984,5,403.0,14.7,338.92,,115.79063663910462
402 | 0.10469,40.0,6.41,1,,7.267,49.0,4.7872,4,254.0,17.6,389.25,,142.23997357087353
403 | 9.96654,0.0,18.1,0,0.74,6.485,100.0,1.9784,24,666.0,20.2,386.73,18.85,65.93794043531199
404 | 0.06911,45.0,3.44,0,,6.739,30.8,6.4798,5,398.0,15.2,389.71,,130.67837655855462
405 | 16.8118,0.0,18.1,0,,5.277,98.1,1.4261,24,666.0,20.2,396.9,30.81,30.826335050066838
406 | 0.08265,0.0,13.92,0,,6.127,18.4,5.5027,4,289.0,16.0,396.9,,102.49425262479257
407 | 28.6558,0.0,18.1,0,,5.155,100.0,1.5894,24,666.0,20.2,210.97,20.08,69.86260748939857
408 | 0.02543,55.0,3.78,0,,6.696,56.4,5.7321,5,370.0,17.6,396.9,,102.45198430594878
409 | 0.61154,20.0,3.97,0,,8.704,86.9,1.801,5,264.0,13.0,389.7,,214.30612899294056
410 | 0.49298,0.0,9.9,0,,6.635,82.5,3.3175,4,304.0,18.4,396.9,,97.64871883252277
411 | 2.73397,0.0,19.58,0,0.871,5.597,94.9,1.5257,5,403.0,14.7,351.85,21.45,65.97107628810248
412 | 0.34006,0.0,21.89,0,,6.458,98.9,2.1185,4,437.0,21.2,395.04,,82.25129303259749
413 | 1.49632,0.0,19.58,0,0.871,5.404,100.0,1.5916,5,403.0,14.7,341.6,,84.04901653769186
414 | 4.26131,0.0,18.1,0,0.77,6.112,81.3,2.5091,24,666.0,20.2,390.74,,96.8500187197275
415 | 0.0686,0.0,2.89,0,,7.416,62.5,3.4952,2,276.0,18.0,396.9,,142.20689094774346
416 | 8.26725,0.0,18.1,1,,5.875,89.6,1.1296,24,666.0,20.2,347.88,,214.1986894942322
417 | 0.07151,0.0,4.49,0,,6.121,56.8,3.7476,3,247.0,18.5,395.15,,95.13056954253848
418 | 7.75223,0.0,18.1,0,0.713,6.301,83.7,2.7831,24,666.0,20.2,272.21,,63.85386272152615
419 | 0.04544,0.0,3.24,0,,6.144,32.2,5.8736,4,430.0,16.9,368.57,,84.8058053152259
420 | 0.28955,0.0,10.59,0,,5.412,9.8,3.5875,4,277.0,18.6,348.93,29.55,101.59726822044716
421 | 3.77498,0.0,18.1,0,,5.952,84.7,2.8715,24,666.0,20.2,22.01,,81.50138540570634
422 | 0.07165,0.0,25.65,0,,6.004,84.1,2.1974,2,188.0,19.1,377.67,,86.90751630292641
423 | 0.04741,0.0,11.93,0,,6.03,80.8,2.505,1,273.0,21.0,396.9,,50.95279803969646
424 | 1.25179,0.0,8.14,0,,5.57,98.1,3.7979,4,307.0,21.0,376.57,21.02,58.28811876068804
425 | 0.12579,45.0,3.44,,,6.556,29.1,4.5667,5,398.0,15.2,382.84,,127.60470297228434
426 | 0.15876,0.0,10.81,0,,5.961,17.5,5.2873,4,305.0,19.2,376.94,,92.88368665394682
427 | 0.1712,0.0,8.56,0,,5.836,91.9,2.211,5,384.0,20.9,395.67,18.66,83.64745648119606
428 | 0.29916,20.0,6.96,0,,5.856,42.1,4.429,3,223.0,18.6,388.65,,90.40912785392962
429 | 0.01501,80.0,2.01,0,,6.635,29.7,8.344,4,280.0,17.0,390.94,,104.92186930660849
430 | 11.1604,0.0,18.1,0,0.74,6.629,94.6,2.1247,24,666.0,20.2,109.85,23.27,57.4275008162647
431 | 0.22876,0.0,8.56,0,,6.405,85.4,2.7147,5,384.0,20.9,70.8,,79.62768514145213
432 |
--------------------------------------------------------------------------------
/data/housing/housing_validation.csv:
--------------------------------------------------------------------------------
1 | CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
2 | 0.09178,0.0,4.05,0,,6.416,84.1,2.6463,5,296.0,,395.5,
3 | 0.05644,40.0,6.41,1,,6.758,32.9,4.0776,4,254.0,,396.9,
4 | 0.10574,0.0,27.74,0,,5.983,98.8,1.8681,4,711.0,,390.11,18.07
5 | 0.09164,0.0,10.81,0,,6.065,7.8,5.2873,4,305.0,,390.91,
6 | 5.09017,0.0,18.1,0,0.713,6.297,91.8,2.3682,24,666.0,,385.09,
7 | 0.10153,0.0,12.83,,,6.279,74.5,4.0522,5,398.0,,373.66,
8 | 0.31827,0.0,9.9,0,,5.914,83.2,3.9986,4,304.0,,390.7,18.33
9 | 0.2909,0.0,21.89,0,,6.174,93.6,1.6119,4,437.0,,388.08,24.16
10 | 4.03841,0.0,18.1,0,,6.229,90.7,3.0993,24,666.0,,395.33,
11 | 0.22438,0.0,9.69,0,,6.027,79.7,2.4982,6,391.0,,396.9,
12 | 0.11069,0.0,13.89,1,,5.951,93.8,2.8893,5,276.0,,396.9,17.92
13 | 0.17004,12.5,7.87,0,,6.004,85.9,6.5921,5,311.0,,386.71,
14 | 45.7461,0.0,18.1,0,,4.519,100.0,1.6582,24,666.0,,88.27,36.98
15 | 0.05646,0.0,12.83,0,,6.232,53.7,5.0141,5,398.0,,386.4,
16 | 0.28392,0.0,7.38,0,,5.708,74.3,4.7211,5,287.0,,391.13,
17 | 4.64689,0.0,18.1,0,,6.98,67.6,2.5329,24,666.0,,374.68,
18 | 0.09849,0.0,25.65,0,,5.879,95.8,2.0063,2,188.0,,379.38,
19 | 14.3337,0.0,18.1,0,,4.88,100.0,1.5895,24,666.0,,372.92,30.62
20 | 0.01381,80.0,0.46,0,,7.875,32.0,5.6484,4,255.0,,394.23,
21 | 9.32909,0.0,18.1,0,0.713,6.185,98.7,2.2616,24,666.0,,396.9,18.13
22 | 0.16211,20.0,6.96,0,,6.24,16.3,4.429,3,223.0,,396.9,
23 | 0.07978,40.0,6.41,0,,6.482,32.1,4.1403,4,254.0,,396.9,
24 | 1.13081,0.0,8.14,0,,5.713,94.1,4.233,4,307.0,,360.17,22.6
25 | 0.06263,0.0,11.93,0,,6.593,69.1,2.4786,1,273.0,,391.99,
26 | 7.02259,0.0,18.1,0,0.718,6.006,95.3,1.8746,24,666.0,,319.98,
27 | 8.05579,0.0,18.1,0,,5.427,95.4,2.4298,24,666.0,,352.58,18.14
28 | 0.08387,0.0,12.83,0,,5.874,36.6,4.5026,5,398.0,,396.06,
29 | 9.51363,0.0,18.1,0,0.713,6.728,94.1,2.4961,24,666.0,,6.68,18.71
30 | 0.17446,0.0,10.59,1,,5.96,92.1,3.8771,4,277.0,,393.25,
31 | 0.26838,0.0,9.69,0,,5.794,70.6,2.8927,6,391.0,,396.9,
32 | 0.13914,0.0,4.05,0,,5.572,88.5,2.5961,5,296.0,,396.9,
33 | 0.1676,0.0,7.38,0,,6.426,52.3,4.5404,5,287.0,,396.9,
34 | 19.6091,0.0,18.1,0,,7.313,97.9,1.3163,24,666.0,,396.9,
35 | 3.67822,0.0,18.1,0,0.77,5.362,96.2,2.1036,24,666.0,,380.79,
36 | 4.42228,0.0,18.1,0,,6.003,94.5,2.5403,24,666.0,,331.29,21.32
37 | 2.14918,0.0,19.58,0,0.871,5.709,98.5,1.6232,5,403.0,,261.95,
38 | 0.02729,0.0,7.07,0,,7.185,61.1,4.9671,2,242.0,,392.83,
39 | 0.03427,0.0,5.19,0,,5.869,46.3,5.2311,5,224.0,,396.9,
40 | 0.13587,0.0,10.59,1,,6.064,59.1,4.2392,4,277.0,,381.32,
41 | 0.19539,0.0,10.81,0,,6.245,6.2,5.2873,4,305.0,,377.17,
42 | 0.2896,0.0,9.69,0,,5.39,72.9,2.7986,6,391.0,,396.9,21.14
43 | 0.04932,33.0,2.18,0,,6.849,70.3,3.1827,7,222.0,,396.9,
44 | 0.02009,95.0,2.68,0,,8.034,31.9,5.118,4,224.0,,390.55,
45 | 0.13554,12.5,6.07,0,,5.594,36.8,6.498,4,345.0,,396.9,
46 | 0.04684,0.0,3.41,0,,6.417,66.1,3.0923,2,270.0,,392.18,
47 | 6.96215,0.0,18.1,0,,5.713,97.0,1.9265,24,666.0,,394.43,
48 | 1.15172,0.0,8.14,0,,5.701,95.0,3.7872,4,307.0,,358.77,18.35
49 | 0.08826,0.0,10.81,,,6.417,6.6,5.2873,4,305.0,,383.73,
50 | 4.34879,0.0,18.1,0,,6.167,84.0,3.0334,24,666.0,,396.9,
51 | 0.00632,18.0,2.31,0,,6.575,65.2,4.09,1,296.0,,396.9,
52 | 0.11747,12.5,7.87,0,,6.009,82.9,6.2267,5,311.0,,396.9,
53 | 0.03705,20.0,3.33,0,,6.968,37.2,5.2447,5,216.0,,392.23,
54 | 1.23247,0.0,8.14,0,,6.142,91.7,3.9769,4,307.0,,396.9,18.72
55 | 0.11432,0.0,8.56,0,,6.781,71.3,2.8561,5,384.0,,395.58,
56 | 0.5405,20.0,3.97,0,,7.47,52.6,2.872,5,264.0,,390.3,
57 | 3.67367,0.0,18.1,0,,6.312,51.9,3.9917,24,666.0,,388.62,
58 | 5.66637,0.0,18.1,0,0.74,6.219,100.0,2.0048,24,666.0,,395.69,
59 | 0.03502,80.0,4.95,0,,6.861,27.9,5.1167,4,245.0,,396.9,
60 | 0.05059,0.0,4.49,0,,6.389,48.0,4.7794,3,247.0,,396.9,
61 | 0.19133,22.0,5.86,0,,5.605,70.2,7.9549,7,330.0,,389.13,18.46
62 | 0.1265,25.0,5.13,0,,6.762,43.4,7.9809,8,284.0,,395.58,
63 | 0.01311,90.0,1.22,0,,7.249,21.9,8.6966,5,226.0,,395.93,
64 | 0.44178,0.0,6.2,0,,6.552,21.4,3.3751,8,307.0,,380.34,
65 | 0.80271,0.0,8.14,0,,5.456,36.6,3.7965,4,307.0,,288.99,
66 | 0.0795,60.0,1.69,0,,6.579,35.9,10.7103,4,411.0,,370.78,
67 | 0.43571,0.0,10.59,1,,5.344,100.0,3.875,4,277.0,,396.9,23.09
68 | 8.71675,0.0,18.1,0,,6.471,98.8,1.7257,24,666.0,,391.98,
69 | 0.03659,25.0,4.86,0,,6.302,32.2,5.4007,4,281.0,,396.9,
70 | 0.02763,75.0,2.95,0,,6.595,21.8,5.4011,3,252.0,,395.63,
71 | 4.66883,0.0,18.1,0,0.713,5.976,87.9,2.5806,24,666.0,,10.48,19.01
72 | 0.18836,0.0,6.91,0,,5.786,33.3,5.1004,3,233.0,,396.9,
73 | 5.70818,0.0,18.1,0,,6.75,74.9,3.3317,24,666.0,,393.07,
74 | 12.8023,0.0,18.1,0,0.74,5.854,96.6,1.8956,24,666.0,,240.52,23.79
75 | 0.10659,80.0,1.91,0,,5.936,19.5,10.5857,4,334.0,,376.04,
76 | 0.08707,0.0,12.83,0,,6.14,45.8,4.0905,5,398.0,,386.96,
77 | 38.3518,0.0,18.1,,,5.453,100.0,1.4896,24,666.0,,396.9,30.59
78 |
--------------------------------------------------------------------------------
/data/iris/iris.csv:
--------------------------------------------------------------------------------
1 | "sepal.length","sepal.width","petal.length","petal.width","variety"
2 | 5.1,3.5,1.4,.2,"Setosa"
3 | 4.9,3,1.4,.2,"Setosa"
4 | 4.7,3.2,1.3,.2,"Setosa"
5 | 4.6,3.1,1.5,.2,"Setosa"
6 | 5,3.6,1.4,.2,"Setosa"
7 | 5.4,3.9,1.7,.4,"Setosa"
8 | 4.6,3.4,1.4,.3,"Setosa"
9 | 5,3.4,1.5,.2,"Setosa"
10 | 4.4,2.9,1.4,.2,"Setosa"
11 | 4.9,3.1,1.5,.1,"Setosa"
12 | 5.4,3.7,1.5,.2,"Setosa"
13 | 4.8,3.4,1.6,.2,"Setosa"
14 | 4.8,3,1.4,.1,"Setosa"
15 | 4.3,3,1.1,.1,"Setosa"
16 | 5.8,4,1.2,.2,"Setosa"
17 | 5.7,4.4,1.5,.4,"Setosa"
18 | 5.4,3.9,1.3,.4,"Setosa"
19 | 5.1,3.5,1.4,.3,"Setosa"
20 | 5.7,3.8,1.7,.3,"Setosa"
21 | 5.1,3.8,1.5,.3,"Setosa"
22 | 5.4,3.4,1.7,.2,"Setosa"
23 | 5.1,3.7,1.5,.4,"Setosa"
24 | 4.6,3.6,1,.2,"Setosa"
25 | 5.1,3.3,1.7,.5,"Setosa"
26 | 4.8,3.4,1.9,.2,"Setosa"
27 | 5,3,1.6,.2,"Setosa"
28 | 5,3.4,1.6,.4,"Setosa"
29 | 5.2,3.5,1.5,.2,"Setosa"
30 | 5.2,3.4,1.4,.2,"Setosa"
31 | 4.7,3.2,1.6,.2,"Setosa"
32 | 4.8,3.1,1.6,.2,"Setosa"
33 | 5.4,3.4,1.5,.4,"Setosa"
34 | 5.2,4.1,1.5,.1,"Setosa"
35 | 5.5,4.2,1.4,.2,"Setosa"
36 | 4.9,3.1,1.5,.2,"Setosa"
37 | 5,3.2,1.2,.2,"Setosa"
38 | 5.5,3.5,1.3,.2,"Setosa"
39 | 4.9,3.6,1.4,.1,"Setosa"
40 | 4.4,3,1.3,.2,"Setosa"
41 | 5.1,3.4,1.5,.2,"Setosa"
42 | 5,3.5,1.3,.3,"Setosa"
43 | 4.5,2.3,1.3,.3,"Setosa"
44 | 4.4,3.2,1.3,.2,"Setosa"
45 | 5,3.5,1.6,.6,"Setosa"
46 | 5.1,3.8,1.9,.4,"Setosa"
47 | 4.8,3,1.4,.3,"Setosa"
48 | 5.1,3.8,1.6,.2,"Setosa"
49 | 4.6,3.2,1.4,.2,"Setosa"
50 | 5.3,3.7,1.5,.2,"Setosa"
51 | 5,3.3,1.4,.2,"Setosa"
52 | 7,3.2,4.7,1.4,"Versicolor"
53 | 6.4,3.2,4.5,1.5,"Versicolor"
54 | 6.9,3.1,4.9,1.5,"Versicolor"
55 | 5.5,2.3,4,1.3,"Versicolor"
56 | 6.5,2.8,4.6,1.5,"Versicolor"
57 | 5.7,2.8,4.5,1.3,"Versicolor"
58 | 6.3,3.3,4.7,1.6,"Versicolor"
59 | 4.9,2.4,3.3,1,"Versicolor"
60 | 6.6,2.9,4.6,1.3,"Versicolor"
61 | 5.2,2.7,3.9,1.4,"Versicolor"
62 | 5,2,3.5,1,"Versicolor"
63 | 5.9,3,4.2,1.5,"Versicolor"
64 | 6,2.2,4,1,"Versicolor"
65 | 6.1,2.9,4.7,1.4,"Versicolor"
66 | 5.6,2.9,3.6,1.3,"Versicolor"
67 | 6.7,3.1,4.4,1.4,"Versicolor"
68 | 5.6,3,4.5,1.5,"Versicolor"
69 | 5.8,2.7,4.1,1,"Versicolor"
70 | 6.2,2.2,4.5,1.5,"Versicolor"
71 | 5.6,2.5,3.9,1.1,"Versicolor"
72 | 5.9,3.2,4.8,1.8,"Versicolor"
73 | 6.1,2.8,4,1.3,"Versicolor"
74 | 6.3,2.5,4.9,1.5,"Versicolor"
75 | 6.1,2.8,4.7,1.2,"Versicolor"
76 | 6.4,2.9,4.3,1.3,"Versicolor"
77 | 6.6,3,4.4,1.4,"Versicolor"
78 | 6.8,2.8,4.8,1.4,"Versicolor"
79 | 6.7,3,5,1.7,"Versicolor"
80 | 6,2.9,4.5,1.5,"Versicolor"
81 | 5.7,2.6,3.5,1,"Versicolor"
82 | 5.5,2.4,3.8,1.1,"Versicolor"
83 | 5.5,2.4,3.7,1,"Versicolor"
84 | 5.8,2.7,3.9,1.2,"Versicolor"
85 | 6,2.7,5.1,1.6,"Versicolor"
86 | 5.4,3,4.5,1.5,"Versicolor"
87 | 6,3.4,4.5,1.6,"Versicolor"
88 | 6.7,3.1,4.7,1.5,"Versicolor"
89 | 6.3,2.3,4.4,1.3,"Versicolor"
90 | 5.6,3,4.1,1.3,"Versicolor"
91 | 5.5,2.5,4,1.3,"Versicolor"
92 | 5.5,2.6,4.4,1.2,"Versicolor"
93 | 6.1,3,4.6,1.4,"Versicolor"
94 | 5.8,2.6,4,1.2,"Versicolor"
95 | 5,2.3,3.3,1,"Versicolor"
96 | 5.6,2.7,4.2,1.3,"Versicolor"
97 | 5.7,3,4.2,1.2,"Versicolor"
98 | 5.7,2.9,4.2,1.3,"Versicolor"
99 | 6.2,2.9,4.3,1.3,"Versicolor"
100 | 5.1,2.5,3,1.1,"Versicolor"
101 | 5.7,2.8,4.1,1.3,"Versicolor"
102 | 6.3,3.3,6,2.5,"Virginica"
103 | 5.8,2.7,5.1,1.9,"Virginica"
104 | 7.1,3,5.9,2.1,"Virginica"
105 | 6.3,2.9,5.6,1.8,"Virginica"
106 | 6.5,3,5.8,2.2,"Virginica"
107 | 7.6,3,6.6,2.1,"Virginica"
108 | 4.9,2.5,4.5,1.7,"Virginica"
109 | 7.3,2.9,6.3,1.8,"Virginica"
110 | 6.7,2.5,5.8,1.8,"Virginica"
111 | 7.2,3.6,6.1,2.5,"Virginica"
112 | 6.5,3.2,5.1,2,"Virginica"
113 | 6.4,2.7,5.3,1.9,"Virginica"
114 | 6.8,3,5.5,2.1,"Virginica"
115 | 5.7,2.5,5,2,"Virginica"
116 | 5.8,2.8,5.1,2.4,"Virginica"
117 | 6.4,3.2,5.3,2.3,"Virginica"
118 | 6.5,3,5.5,1.8,"Virginica"
119 | 7.7,3.8,6.7,2.2,"Virginica"
120 | 7.7,2.6,6.9,2.3,"Virginica"
121 | 6,2.2,5,1.5,"Virginica"
122 | 6.9,3.2,5.7,2.3,"Virginica"
123 | 5.6,2.8,4.9,2,"Virginica"
124 | 7.7,2.8,6.7,2,"Virginica"
125 | 6.3,2.7,4.9,1.8,"Virginica"
126 | 6.7,3.3,5.7,2.1,"Virginica"
127 | 7.2,3.2,6,1.8,"Virginica"
128 | 6.2,2.8,4.8,1.8,"Virginica"
129 | 6.1,3,4.9,1.8,"Virginica"
130 | 6.4,2.8,5.6,2.1,"Virginica"
131 | 7.2,3,5.8,1.6,"Virginica"
132 | 7.4,2.8,6.1,1.9,"Virginica"
133 | 7.9,3.8,6.4,2,"Virginica"
134 | 6.4,2.8,5.6,2.2,"Virginica"
135 | 6.3,2.8,5.1,1.5,"Virginica"
136 | 6.1,2.6,5.6,1.4,"Virginica"
137 | 7.7,3,6.1,2.3,"Virginica"
138 | 6.3,3.4,5.6,2.4,"Virginica"
139 | 6.4,3.1,5.5,1.8,"Virginica"
140 | 6,3,4.8,1.8,"Virginica"
141 | 6.9,3.1,5.4,2.1,"Virginica"
142 | 6.7,3.1,5.6,2.4,"Virginica"
143 | 6.9,3.1,5.1,2.3,"Virginica"
144 | 5.8,2.7,5.1,1.9,"Virginica"
145 | 6.8,3.2,5.9,2.3,"Virginica"
146 | 6.7,3.3,5.7,2.5,"Virginica"
147 | 6.7,3,5.2,2.3,"Virginica"
148 | 6.3,2.5,5,1.9,"Virginica"
149 | 6.5,3,5.2,2,"Virginica"
150 | 6.2,3.4,5.4,2.3,"Virginica"
151 | 5.9,3,5.1,1.8,"Virginica"
--------------------------------------------------------------------------------
/data/iris/iris.tsv:
--------------------------------------------------------------------------------
1 | sepal.length sepal.width petal.length petal.width variety
2 | 5.1 3.5 1.4 0.2 Setosa
3 | 4.9 3.0 1.4 0.2 Setosa
4 | 4.7 3.2 1.3 0.2 Setosa
5 | 4.6 3.1 1.5 0.2 Setosa
6 | 5.0 3.6 1.4 0.2 Setosa
7 | 5.4 3.9 1.7 0.4 Setosa
8 | 4.6 3.4 1.4 0.3 Setosa
9 | 5.0 3.4 1.5 0.2 Setosa
10 | 4.4 2.9 1.4 0.2 Setosa
11 | 4.9 3.1 1.5 0.1 Setosa
12 | 5.4 3.7 1.5 0.2 Setosa
13 | 4.8 3.4 1.6 0.2 Setosa
14 | 4.8 3.0 1.4 0.1 Setosa
15 | 4.3 3.0 1.1 0.1 Setosa
16 | 5.8 4.0 1.2 0.2 Setosa
17 | 5.7 4.4 1.5 0.4 Setosa
18 | 5.4 3.9 1.3 0.4 Setosa
19 | 5.1 3.5 1.4 0.3 Setosa
20 | 5.7 3.8 1.7 0.3 Setosa
21 | 5.1 3.8 1.5 0.3 Setosa
22 | 5.4 3.4 1.7 0.2 Setosa
23 | 5.1 3.7 1.5 0.4 Setosa
24 | 4.6 3.6 1.0 0.2 Setosa
25 | 5.1 3.3 1.7 0.5 Setosa
26 | 4.8 3.4 1.9 0.2 Setosa
27 | 5.0 3.0 1.6 0.2 Setosa
28 | 5.0 3.4 1.6 0.4 Setosa
29 | 5.2 3.5 1.5 0.2 Setosa
30 | 5.2 3.4 1.4 0.2 Setosa
31 | 4.7 3.2 1.6 0.2 Setosa
32 | 4.8 3.1 1.6 0.2 Setosa
33 | 5.4 3.4 1.5 0.4 Setosa
34 | 5.2 4.1 1.5 0.1 Setosa
35 | 5.5 4.2 1.4 0.2 Setosa
36 | 4.9 3.1 1.5 0.2 Setosa
37 | 5.0 3.2 1.2 0.2 Setosa
38 | 5.5 3.5 1.3 0.2 Setosa
39 | 4.9 3.6 1.4 0.1 Setosa
40 | 4.4 3.0 1.3 0.2 Setosa
41 | 5.1 3.4 1.5 0.2 Setosa
42 | 5.0 3.5 1.3 0.3 Setosa
43 | 4.5 2.3 1.3 0.3 Setosa
44 | 4.4 3.2 1.3 0.2 Setosa
45 | 5.0 3.5 1.6 0.6 Setosa
46 | 5.1 3.8 1.9 0.4 Setosa
47 | 4.8 3.0 1.4 0.3 Setosa
48 | 5.1 3.8 1.6 0.2 Setosa
49 | 4.6 3.2 1.4 0.2 Setosa
50 | 5.3 3.7 1.5 0.2 Setosa
51 | 5.0 3.3 1.4 0.2 Setosa
52 | 7.0 3.2 4.7 1.4 Versicolor
53 | 6.4 3.2 4.5 1.5 Versicolor
54 | 6.9 3.1 4.9 1.5 Versicolor
55 | 5.5 2.3 4.0 1.3 Versicolor
56 | 6.5 2.8 4.6 1.5 Versicolor
57 | 5.7 2.8 4.5 1.3 Versicolor
58 | 6.3 3.3 4.7 1.6 Versicolor
59 | 4.9 2.4 3.3 1.0 Versicolor
60 | 6.6 2.9 4.6 1.3 Versicolor
61 | 5.2 2.7 3.9 1.4 Versicolor
62 | 5.0 2.0 3.5 1.0 Versicolor
63 | 5.9 3.0 4.2 1.5 Versicolor
64 | 6.0 2.2 4.0 1.0 Versicolor
65 | 6.1 2.9 4.7 1.4 Versicolor
66 | 5.6 2.9 3.6 1.3 Versicolor
67 | 6.7 3.1 4.4 1.4 Versicolor
68 | 5.6 3.0 4.5 1.5 Versicolor
69 | 5.8 2.7 4.1 1.0 Versicolor
70 | 6.2 2.2 4.5 1.5 Versicolor
71 | 5.6 2.5 3.9 1.1 Versicolor
72 | 5.9 3.2 4.8 1.8 Versicolor
73 | 6.1 2.8 4.0 1.3 Versicolor
74 | 6.3 2.5 4.9 1.5 Versicolor
75 | 6.1 2.8 4.7 1.2 Versicolor
76 | 6.4 2.9 4.3 1.3 Versicolor
77 | 6.6 3.0 4.4 1.4 Versicolor
78 | 6.8 2.8 4.8 1.4 Versicolor
79 | 6.7 3.0 5.0 1.7 Versicolor
80 | 6.0 2.9 4.5 1.5 Versicolor
81 | 5.7 2.6 3.5 1.0 Versicolor
82 | 5.5 2.4 3.8 1.1 Versicolor
83 | 5.5 2.4 3.7 1.0 Versicolor
84 | 5.8 2.7 3.9 1.2 Versicolor
85 | 6.0 2.7 5.1 1.6 Versicolor
86 | 5.4 3.0 4.5 1.5 Versicolor
87 | 6.0 3.4 4.5 1.6 Versicolor
88 | 6.7 3.1 4.7 1.5 Versicolor
89 | 6.3 2.3 4.4 1.3 Versicolor
90 | 5.6 3.0 4.1 1.3 Versicolor
91 | 5.5 2.5 4.0 1.3 Versicolor
92 | 5.5 2.6 4.4 1.2 Versicolor
93 | 6.1 3.0 4.6 1.4 Versicolor
94 | 5.8 2.6 4.0 1.2 Versicolor
95 | 5.0 2.3 3.3 1.0 Versicolor
96 | 5.6 2.7 4.2 1.3 Versicolor
97 | 5.7 3.0 4.2 1.2 Versicolor
98 | 5.7 2.9 4.2 1.3 Versicolor
99 | 6.2 2.9 4.3 1.3 Versicolor
100 | 5.1 2.5 3.0 1.1 Versicolor
101 | 5.7 2.8 4.1 1.3 Versicolor
102 | 6.3 3.3 6.0 2.5 Virginica
103 | 5.8 2.7 5.1 1.9 Virginica
104 | 7.1 3.0 5.9 2.1 Virginica
105 | 6.3 2.9 5.6 1.8 Virginica
106 | 6.5 3.0 5.8 2.2 Virginica
107 | 7.6 3.0 6.6 2.1 Virginica
108 | 4.9 2.5 4.5 1.7 Virginica
109 | 7.3 2.9 6.3 1.8 Virginica
110 | 6.7 2.5 5.8 1.8 Virginica
111 | 7.2 3.6 6.1 2.5 Virginica
112 | 6.5 3.2 5.1 2.0 Virginica
113 | 6.4 2.7 5.3 1.9 Virginica
114 | 6.8 3.0 5.5 2.1 Virginica
115 | 5.7 2.5 5.0 2.0 Virginica
116 | 5.8 2.8 5.1 2.4 Virginica
117 | 6.4 3.2 5.3 2.3 Virginica
118 | 6.5 3.0 5.5 1.8 Virginica
119 | 7.7 3.8 6.7 2.2 Virginica
120 | 7.7 2.6 6.9 2.3 Virginica
121 | 6.0 2.2 5.0 1.5 Virginica
122 | 6.9 3.2 5.7 2.3 Virginica
123 | 5.6 2.8 4.9 2.0 Virginica
124 | 7.7 2.8 6.7 2.0 Virginica
125 | 6.3 2.7 4.9 1.8 Virginica
126 | 6.7 3.3 5.7 2.1 Virginica
127 | 7.2 3.2 6.0 1.8 Virginica
128 | 6.2 2.8 4.8 1.8 Virginica
129 | 6.1 3.0 4.9 1.8 Virginica
130 | 6.4 2.8 5.6 2.1 Virginica
131 | 7.2 3.0 5.8 1.6 Virginica
132 | 7.4 2.8 6.1 1.9 Virginica
133 | 7.9 3.8 6.4 2.0 Virginica
134 | 6.4 2.8 5.6 2.2 Virginica
135 | 6.3 2.8 5.1 1.5 Virginica
136 | 6.1 2.6 5.6 1.4 Virginica
137 | 7.7 3.0 6.1 2.3 Virginica
138 | 6.3 3.4 5.6 2.4 Virginica
139 | 6.4 3.1 5.5 1.8 Virginica
140 | 6.0 3.0 4.8 1.8 Virginica
141 | 6.9 3.1 5.4 2.1 Virginica
142 | 6.7 3.1 5.6 2.4 Virginica
143 | 6.9 3.1 5.1 2.3 Virginica
144 | 5.8 2.7 5.1 1.9 Virginica
145 | 6.8 3.2 5.9 2.3 Virginica
146 | 6.7 3.3 5.7 2.5 Virginica
147 | 6.7 3.0 5.2 2.3 Virginica
148 | 6.3 2.5 5.0 1.9 Virginica
149 | 6.5 3.0 5.2 2.0 Virginica
150 | 6.2 3.4 5.4 2.3 Virginica
151 | 5.9 3.0 5.1 1.8 Virginica
152 |
--------------------------------------------------------------------------------
/data/iris/iris.xlsx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Appsilon/datascience-python/a1d299415544191f5e96502c5a064561ba3f59ce/data/iris/iris.xlsx
--------------------------------------------------------------------------------
/data/iris/iris_noheader.csv:
--------------------------------------------------------------------------------
1 | 5.1,3.5,1.4,.2,"Setosa"
2 | 4.9,3,1.4,.2,"Setosa"
3 | 4.7,3.2,1.3,.2,"Setosa"
4 | 4.6,3.1,1.5,.2,"Setosa"
5 | 5,3.6,1.4,.2,"Setosa"
6 | 5.4,3.9,1.7,.4,"Setosa"
7 | 4.6,3.4,1.4,.3,"Setosa"
8 | 5,3.4,1.5,.2,"Setosa"
9 | 4.4,2.9,1.4,.2,"Setosa"
10 | 4.9,3.1,1.5,.1,"Setosa"
11 | 5.4,3.7,1.5,.2,"Setosa"
12 | 4.8,3.4,1.6,.2,"Setosa"
13 | 4.8,3,1.4,.1,"Setosa"
14 | 4.3,3,1.1,.1,"Setosa"
15 | 5.8,4,1.2,.2,"Setosa"
16 | 5.7,4.4,1.5,.4,"Setosa"
17 | 5.4,3.9,1.3,.4,"Setosa"
18 | 5.1,3.5,1.4,.3,"Setosa"
19 | 5.7,3.8,1.7,.3,"Setosa"
20 | 5.1,3.8,1.5,.3,"Setosa"
21 | 5.4,3.4,1.7,.2,"Setosa"
22 | 5.1,3.7,1.5,.4,"Setosa"
23 | 4.6,3.6,1,.2,"Setosa"
24 | 5.1,3.3,1.7,.5,"Setosa"
25 | 4.8,3.4,1.9,.2,"Setosa"
26 | 5,3,1.6,.2,"Setosa"
27 | 5,3.4,1.6,.4,"Setosa"
28 | 5.2,3.5,1.5,.2,"Setosa"
29 | 5.2,3.4,1.4,.2,"Setosa"
30 | 4.7,3.2,1.6,.2,"Setosa"
31 | 4.8,3.1,1.6,.2,"Setosa"
32 | 5.4,3.4,1.5,.4,"Setosa"
33 | 5.2,4.1,1.5,.1,"Setosa"
34 | 5.5,4.2,1.4,.2,"Setosa"
35 | 4.9,3.1,1.5,.2,"Setosa"
36 | 5,3.2,1.2,.2,"Setosa"
37 | 5.5,3.5,1.3,.2,"Setosa"
38 | 4.9,3.6,1.4,.1,"Setosa"
39 | 4.4,3,1.3,.2,"Setosa"
40 | 5.1,3.4,1.5,.2,"Setosa"
41 | 5,3.5,1.3,.3,"Setosa"
42 | 4.5,2.3,1.3,.3,"Setosa"
43 | 4.4,3.2,1.3,.2,"Setosa"
44 | 5,3.5,1.6,.6,"Setosa"
45 | 5.1,3.8,1.9,.4,"Setosa"
46 | 4.8,3,1.4,.3,"Setosa"
47 | 5.1,3.8,1.6,.2,"Setosa"
48 | 4.6,3.2,1.4,.2,"Setosa"
49 | 5.3,3.7,1.5,.2,"Setosa"
50 | 5,3.3,1.4,.2,"Setosa"
51 | 7,3.2,4.7,1.4,"Versicolor"
52 | 6.4,3.2,4.5,1.5,"Versicolor"
53 | 6.9,3.1,4.9,1.5,"Versicolor"
54 | 5.5,2.3,4,1.3,"Versicolor"
55 | 6.5,2.8,4.6,1.5,"Versicolor"
56 | 5.7,2.8,4.5,1.3,"Versicolor"
57 | 6.3,3.3,4.7,1.6,"Versicolor"
58 | 4.9,2.4,3.3,1,"Versicolor"
59 | 6.6,2.9,4.6,1.3,"Versicolor"
60 | 5.2,2.7,3.9,1.4,"Versicolor"
61 | 5,2,3.5,1,"Versicolor"
62 | 5.9,3,4.2,1.5,"Versicolor"
63 | 6,2.2,4,1,"Versicolor"
64 | 6.1,2.9,4.7,1.4,"Versicolor"
65 | 5.6,2.9,3.6,1.3,"Versicolor"
66 | 6.7,3.1,4.4,1.4,"Versicolor"
67 | 5.6,3,4.5,1.5,"Versicolor"
68 | 5.8,2.7,4.1,1,"Versicolor"
69 | 6.2,2.2,4.5,1.5,"Versicolor"
70 | 5.6,2.5,3.9,1.1,"Versicolor"
71 | 5.9,3.2,4.8,1.8,"Versicolor"
72 | 6.1,2.8,4,1.3,"Versicolor"
73 | 6.3,2.5,4.9,1.5,"Versicolor"
74 | 6.1,2.8,4.7,1.2,"Versicolor"
75 | 6.4,2.9,4.3,1.3,"Versicolor"
76 | 6.6,3,4.4,1.4,"Versicolor"
77 | 6.8,2.8,4.8,1.4,"Versicolor"
78 | 6.7,3,5,1.7,"Versicolor"
79 | 6,2.9,4.5,1.5,"Versicolor"
80 | 5.7,2.6,3.5,1,"Versicolor"
81 | 5.5,2.4,3.8,1.1,"Versicolor"
82 | 5.5,2.4,3.7,1,"Versicolor"
83 | 5.8,2.7,3.9,1.2,"Versicolor"
84 | 6,2.7,5.1,1.6,"Versicolor"
85 | 5.4,3,4.5,1.5,"Versicolor"
86 | 6,3.4,4.5,1.6,"Versicolor"
87 | 6.7,3.1,4.7,1.5,"Versicolor"
88 | 6.3,2.3,4.4,1.3,"Versicolor"
89 | 5.6,3,4.1,1.3,"Versicolor"
90 | 5.5,2.5,4,1.3,"Versicolor"
91 | 5.5,2.6,4.4,1.2,"Versicolor"
92 | 6.1,3,4.6,1.4,"Versicolor"
93 | 5.8,2.6,4,1.2,"Versicolor"
94 | 5,2.3,3.3,1,"Versicolor"
95 | 5.6,2.7,4.2,1.3,"Versicolor"
96 | 5.7,3,4.2,1.2,"Versicolor"
97 | 5.7,2.9,4.2,1.3,"Versicolor"
98 | 6.2,2.9,4.3,1.3,"Versicolor"
99 | 5.1,2.5,3,1.1,"Versicolor"
100 | 5.7,2.8,4.1,1.3,"Versicolor"
101 | 6.3,3.3,6,2.5,"Virginica"
102 | 5.8,2.7,5.1,1.9,"Virginica"
103 | 7.1,3,5.9,2.1,"Virginica"
104 | 6.3,2.9,5.6,1.8,"Virginica"
105 | 6.5,3,5.8,2.2,"Virginica"
106 | 7.6,3,6.6,2.1,"Virginica"
107 | 4.9,2.5,4.5,1.7,"Virginica"
108 | 7.3,2.9,6.3,1.8,"Virginica"
109 | 6.7,2.5,5.8,1.8,"Virginica"
110 | 7.2,3.6,6.1,2.5,"Virginica"
111 | 6.5,3.2,5.1,2,"Virginica"
112 | 6.4,2.7,5.3,1.9,"Virginica"
113 | 6.8,3,5.5,2.1,"Virginica"
114 | 5.7,2.5,5,2,"Virginica"
115 | 5.8,2.8,5.1,2.4,"Virginica"
116 | 6.4,3.2,5.3,2.3,"Virginica"
117 | 6.5,3,5.5,1.8,"Virginica"
118 | 7.7,3.8,6.7,2.2,"Virginica"
119 | 7.7,2.6,6.9,2.3,"Virginica"
120 | 6,2.2,5,1.5,"Virginica"
121 | 6.9,3.2,5.7,2.3,"Virginica"
122 | 5.6,2.8,4.9,2,"Virginica"
123 | 7.7,2.8,6.7,2,"Virginica"
124 | 6.3,2.7,4.9,1.8,"Virginica"
125 | 6.7,3.3,5.7,2.1,"Virginica"
126 | 7.2,3.2,6,1.8,"Virginica"
127 | 6.2,2.8,4.8,1.8,"Virginica"
128 | 6.1,3,4.9,1.8,"Virginica"
129 | 6.4,2.8,5.6,2.1,"Virginica"
130 | 7.2,3,5.8,1.6,"Virginica"
131 | 7.4,2.8,6.1,1.9,"Virginica"
132 | 7.9,3.8,6.4,2,"Virginica"
133 | 6.4,2.8,5.6,2.2,"Virginica"
134 | 6.3,2.8,5.1,1.5,"Virginica"
135 | 6.1,2.6,5.6,1.4,"Virginica"
136 | 7.7,3,6.1,2.3,"Virginica"
137 | 6.3,3.4,5.6,2.4,"Virginica"
138 | 6.4,3.1,5.5,1.8,"Virginica"
139 | 6,3,4.8,1.8,"Virginica"
140 | 6.9,3.1,5.4,2.1,"Virginica"
141 | 6.7,3.1,5.6,2.4,"Virginica"
142 | 6.9,3.1,5.1,2.3,"Virginica"
143 | 5.8,2.7,5.1,1.9,"Virginica"
144 | 6.8,3.2,5.9,2.3,"Virginica"
145 | 6.7,3.3,5.7,2.5,"Virginica"
146 | 6.7,3,5.2,2.3,"Virginica"
147 | 6.3,2.5,5,1.9,"Virginica"
148 | 6.5,3,5.2,2,"Virginica"
149 | 6.2,3.4,5.4,2.3,"Virginica"
150 | 5.9,3,5.1,1.8,"Virginica"
--------------------------------------------------------------------------------
/data/other/lotr_data.csv:
--------------------------------------------------------------------------------
1 | Name,Race,Salary,Profession,Age of Death
2 | Bilbo Baggins,Hobbit,10000,Retired,131
3 | Frodo Baggins,Hobbit,70000,Ring-bearer,53
4 | Sam Gamgee,Hobbit,60000,Security,102
5 | Aragorn,Human,60000,Security,210
6 |
--------------------------------------------------------------------------------
/homework_solutions/.gitignore:
--------------------------------------------------------------------------------
1 | light_gbm.csv
2 | model_lgbm_regressor.pkl
3 |
--------------------------------------------------------------------------------
/homework_solutions/01_hw_numpy.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Numpy Homework Solution\n",
8 | "\n",
9 | "## Task 1"
10 | ]
11 | },
12 | {
13 | "cell_type": "code",
14 | "execution_count": 1,
15 | "metadata": {},
16 | "outputs": [],
17 | "source": [
18 | "import numpy as np\n",
19 | "from numpy.random import default_rng\n",
20 | "\n",
21 | "rng = default_rng(1337)\n",
22 | "x = np.round(rng.normal(size=30), 2)\n",
23 | "y = x + np.round(rng.normal(size=30) * 0.1, 2)"
24 | ]
25 | },
26 | {
27 | "cell_type": "code",
28 | "execution_count": 2,
29 | "metadata": {},
30 | "outputs": [
31 | {
32 | "data": {
33 | "text/plain": [
34 | "(0.18533333333333335, 5.5600000000000005, 0.8520000000000001)"
35 | ]
36 | },
37 | "execution_count": 2,
38 | "metadata": {},
39 | "output_type": "execute_result"
40 | }
41 | ],
42 | "source": [
43 | "# 1, 2, 3\n",
44 | "x.mean(), x.sum(), np.abs(x).mean()"
45 | ]
46 | },
47 | {
48 | "cell_type": "code",
49 | "execution_count": 3,
50 | "metadata": {},
51 | "outputs": [
52 | {
53 | "data": {
54 | "text/plain": [
55 | "2.52"
56 | ]
57 | },
58 | "execution_count": 3,
59 | "metadata": {},
60 | "output_type": "execute_result"
61 | }
62 | ],
63 | "source": [
64 | "# 4\n",
65 | "x[np.abs(x).argmax()]"
66 | ]
67 | },
68 | {
69 | "cell_type": "code",
70 | "execution_count": 4,
71 | "metadata": {},
72 | "outputs": [
73 | {
74 | "data": {
75 | "text/plain": [
76 | "-2.5"
77 | ]
78 | },
79 | "execution_count": 4,
80 | "metadata": {},
81 | "output_type": "execute_result"
82 | }
83 | ],
84 | "source": [
85 | "# 5\n",
86 | "x[np.abs(x - 2).argmax()]"
87 | ]
88 | },
89 | {
90 | "cell_type": "code",
91 | "execution_count": 5,
92 | "metadata": {},
93 | "outputs": [
94 | {
95 | "data": {
96 | "text/plain": [
97 | "array([ 0.04, 0.47, -0.14, -1. , 1. , -1. , 1. , -1. , 0.15,\n",
98 | " -0.09, 1. , 0.52, -0.53, 1. , 0.21, 1. , -0.22, 0.09,\n",
99 | " -0.13, -1. , 0.85, 0.68, 0.87, -0.34, 1. , 1. , -0.04,\n",
100 | " -0.82, -0.16, -1. ])"
101 | ]
102 | },
103 | "execution_count": 5,
104 | "metadata": {},
105 | "output_type": "execute_result"
106 | }
107 | ],
108 | "source": [
109 | "# 6\n",
110 | "np.where(x > 1, 1, np.where(x>-1, x , -1))"
111 | ]
112 | },
113 | {
114 | "cell_type": "code",
115 | "execution_count": 6,
116 | "metadata": {},
117 | "outputs": [
118 | {
119 | "data": {
120 | "text/plain": [
121 | "-0.0029999999999999914"
122 | ]
123 | },
124 | "execution_count": 6,
125 | "metadata": {},
126 | "output_type": "execute_result"
127 | }
128 | ],
129 | "source": [
130 | "# 7\n",
131 | "(y-x).mean()"
132 | ]
133 | },
134 | {
135 | "cell_type": "code",
136 | "execution_count": 7,
137 | "metadata": {},
138 | "outputs": [
139 | {
140 | "data": {
141 | "text/plain": [
142 | "0.08499999999999999"
143 | ]
144 | },
145 | "execution_count": 7,
146 | "metadata": {},
147 | "output_type": "execute_result"
148 | }
149 | ],
150 | "source": [
151 | "# 8\n",
152 | "np.abs(y - x).mean()"
153 | ]
154 | },
155 | {
156 | "cell_type": "code",
157 | "execution_count": 8,
158 | "metadata": {},
159 | "outputs": [
160 | {
161 | "data": {
162 | "text/plain": [
163 | "0.010869999999999998"
164 | ]
165 | },
166 | "execution_count": 8,
167 | "metadata": {},
168 | "output_type": "execute_result"
169 | }
170 | ],
171 | "source": [
172 | "# 9\n",
173 | "((x - y)**2).mean()"
174 | ]
175 | },
176 | {
177 | "cell_type": "code",
178 | "execution_count": 9,
179 | "metadata": {},
180 | "outputs": [
181 | {
182 | "data": {
183 | "text/plain": [
184 | "0.10425929215182692"
185 | ]
186 | },
187 | "execution_count": 9,
188 | "metadata": {},
189 | "output_type": "execute_result"
190 | }
191 | ],
192 | "source": [
193 | "# 10\n",
194 | "np.sqrt(((x - y)**2).mean())"
195 | ]
196 | },
197 | {
198 | "cell_type": "markdown",
199 | "metadata": {},
200 | "source": [
201 | "## Task 2"
202 | ]
203 | },
204 | {
205 | "cell_type": "code",
206 | "execution_count": 10,
207 | "metadata": {},
208 | "outputs": [],
209 | "source": [
210 | "def standardize(X):\n",
211 | " return (X - X.mean(axis=0)) / X.std(axis=0)"
212 | ]
213 | },
214 | {
215 | "cell_type": "code",
216 | "execution_count": 11,
217 | "metadata": {},
218 | "outputs": [
219 | {
220 | "data": {
221 | "text/plain": [
222 | "array([[ 0, 1, 2],\n",
223 | " [ 3, 4, 5],\n",
224 | " [ 6, 7, 8],\n",
225 | " [ 9, 10, 11],\n",
226 | " [12, 13, 14],\n",
227 | " [15, 16, 17],\n",
228 | " [18, 19, 20],\n",
229 | " [21, 22, 23],\n",
230 | " [24, 25, 26],\n",
231 | " [27, 28, 29]])"
232 | ]
233 | },
234 | "execution_count": 11,
235 | "metadata": {},
236 | "output_type": "execute_result"
237 | }
238 | ],
239 | "source": [
240 | "X = np.arange(30).reshape((10, -1))\n",
241 | "X"
242 | ]
243 | },
244 | {
245 | "cell_type": "code",
246 | "execution_count": 12,
247 | "metadata": {},
248 | "outputs": [
249 | {
250 | "data": {
251 | "text/plain": [
252 | "(array([-1.11022302e-16, -1.11022302e-16, -1.11022302e-16]),\n",
253 | " array([1., 1., 1.]))"
254 | ]
255 | },
256 | "execution_count": 12,
257 | "metadata": {},
258 | "output_type": "execute_result"
259 | }
260 | ],
261 | "source": [
262 | "Xs = standardize(X)\n",
263 | "Xs.mean(axis=0), Xs.std(axis=0)"
264 | ]
265 | },
266 | {
267 | "cell_type": "markdown",
268 | "metadata": {},
269 | "source": [
270 | "## Task 3"
271 | ]
272 | },
273 | {
274 | "cell_type": "code",
275 | "execution_count": 13,
276 | "metadata": {},
277 | "outputs": [],
278 | "source": [
279 | "def simulation_pi(n, seed):\n",
280 | " # We will sample x, y points from [0, 1) and check\n",
281 | " # percentage of them landing inside a quater of a unit\n",
282 | " # circle. Unit square has area 1 and quater of unit circle\n",
283 | " # has area $\\pi / 4$, that's why we multiply by 4 to get $pi$.\n",
284 | " rng = default_rng(seed)\n",
285 | " x = rng.random(n)\n",
286 | " y = rng.random(n)\n",
287 | " return 4 * ((x**2 + y**2) <= 1).mean()"
288 | ]
289 | },
290 | {
291 | "cell_type": "code",
292 | "execution_count": 14,
293 | "metadata": {},
294 | "outputs": [
295 | {
296 | "data": {
297 | "text/plain": [
298 | "3.1409368"
299 | ]
300 | },
301 | "execution_count": 14,
302 | "metadata": {},
303 | "output_type": "execute_result"
304 | }
305 | ],
306 | "source": [
307 | "simulation_pi(10_000_000, 2022)"
308 | ]
309 | },
310 | {
311 | "cell_type": "markdown",
312 | "metadata": {},
313 | "source": [
314 | "## Task 4"
315 | ]
316 | },
317 | {
318 | "cell_type": "code",
319 | "execution_count": 15,
320 | "metadata": {},
321 | "outputs": [],
322 | "source": [
323 | "def simulation_exp2(n, seed):\n",
324 | " rng = default_rng(seed)\n",
325 | " x = rng.uniform(-2, 2, n)\n",
326 | " yr = rng.uniform(0, 1, n)\n",
327 | " yf = np.exp(-(x**2))\n",
328 | " # Here we multiply by 4 since we sample from \n",
329 | " # rectangle of area 4 \n",
330 | " # x in [-2, 2) and y from [0, 1)\n",
331 | " A = (yr < yf).mean() * 4\n",
332 | " return A"
333 | ]
334 | },
335 | {
336 | "cell_type": "code",
337 | "execution_count": 16,
338 | "metadata": {},
339 | "outputs": [
340 | {
341 | "data": {
342 | "text/plain": [
343 | "1.7634636"
344 | ]
345 | },
346 | "execution_count": 16,
347 | "metadata": {},
348 | "output_type": "execute_result"
349 | }
350 | ],
351 | "source": [
352 | "simulation_exp2(10_000_000, 2022)"
353 | ]
354 | }
355 | ],
356 | "metadata": {
357 | "kernelspec": {
358 | "display_name": "Python 3.10.4 ('daftacademy-ds')",
359 | "language": "python",
360 | "name": "python3"
361 | },
362 | "language_info": {
363 | "codemirror_mode": {
364 | "name": "ipython",
365 | "version": 3
366 | },
367 | "file_extension": ".py",
368 | "mimetype": "text/x-python",
369 | "name": "python",
370 | "nbconvert_exporter": "python",
371 | "pygments_lexer": "ipython3",
372 | "version": "3.10.4"
373 | },
374 | "orig_nbformat": 4,
375 | "vscode": {
376 | "interpreter": {
377 | "hash": "306379f83cd7ad906f1941e4fa5d943b9cb847ce8f1ed425d8b7a18353685fed"
378 | }
379 | }
380 | },
381 | "nbformat": 4,
382 | "nbformat_minor": 2
383 | }
384 |
--------------------------------------------------------------------------------
/homework_solutions/03_hw_pandas2_a.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Pandas Homework Part 2\n",
8 | "\n",
9 | "`pandas` version"
10 | ]
11 | },
12 | {
13 | "cell_type": "code",
14 | "execution_count": 1,
15 | "metadata": {},
16 | "outputs": [],
17 | "source": [
18 | "import pandas as pd\n",
19 | "from collections import defaultdict"
20 | ]
21 | },
22 | {
23 | "cell_type": "code",
24 | "execution_count": 2,
25 | "metadata": {},
26 | "outputs": [],
27 | "source": [
28 | "columns = ['prev', 'curr', 'type', 'n']"
29 | ]
30 | },
31 | {
32 | "cell_type": "code",
33 | "execution_count": 19,
34 | "metadata": {},
35 | "outputs": [
36 | {
37 | "name": "stdout",
38 | "output_type": "stream",
39 | "text": [
40 | "de Ukraine\n",
41 | "pl Ukraina\n",
42 | "CPU times: user 13.3 s, sys: 304 ms, total: 13.6 s\n",
43 | "Wall time: 13.7 s\n"
44 | ]
45 | }
46 | ],
47 | "source": [
48 | "%%time\n",
49 | "# 1, 2\n",
50 | "\n",
51 | "for country in [\"de\", \"pl\"]:\n",
52 | " df = pd.read_csv(f\"../data/wikipedia/clickstream-{country}wiki-2022-03.tsv.gz\", sep=\"\\t\", names=columns, on_bad_lines='warn', quoting=3)\n",
53 | " s = df.query(\"type == 'external'\").groupby(\"curr\")['n'].sum().sort_values(ascending=False).head().index[0]\n",
54 | " print(country, s)"
55 | ]
56 | },
57 | {
58 | "cell_type": "code",
59 | "execution_count": 5,
60 | "metadata": {},
61 | "outputs": [],
62 | "source": [
63 | "badges = pd.read_xml(\"../data/travel/travel.stackexchange.com/Badges.xml\")\n",
64 | "posts = pd.read_xml(\"../data/travel/travel.stackexchange.com/Posts.xml\", parser='etree')\n",
65 | "tags = pd.read_xml(\"../data/travel/travel.stackexchange.com/Tags.xml\", parser='etree')\n",
66 | "users = pd.read_xml(\"../data/travel/travel.stackexchange.com/Users.xml\", parser='etree')\n",
67 | "votes = pd.read_xml(\"../data/travel/travel.stackexchange.com/Votes.xml\", parser='etree')"
68 | ]
69 | },
70 | {
71 | "cell_type": "code",
72 | "execution_count": 6,
73 | "metadata": {},
74 | "outputs": [],
75 | "source": [
76 | "wiki = pd.read_csv(f\"../data/wikipedia/clickstream-enwiki-2022-03.tsv.gz\", sep=\"\\t\", names=columns, quoting=3)"
77 | ]
78 | },
79 | {
80 | "cell_type": "code",
81 | "execution_count": 7,
82 | "metadata": {},
83 | "outputs": [
84 | {
85 | "name": "stdout",
86 | "output_type": "stream",
87 | "text": [
88 | "CPU times: user 166 ms, sys: 3.64 ms, total: 170 ms\n",
89 | "Wall time: 169 ms\n"
90 | ]
91 | },
92 | {
93 | "data": {
94 | "text/plain": [
95 | "DisplayName Mark Mayo\n",
96 | "Location Christchurch, New Zealand\n",
97 | "Name: 0, dtype: object"
98 | ]
99 | },
100 | "execution_count": 7,
101 | "metadata": {},
102 | "output_type": "execute_result"
103 | }
104 | ],
105 | "source": [
106 | "%%time\n",
107 | "# 3, 4\n",
108 | "tid = badges.merge(users, left_on=\"UserId\", right_on=\"Id\").groupby(\"UserId\").size().sort_values().index[-1]\n",
109 | "top_user = users.loc[users[\"Id\"] == tid, :]\n",
110 | "top_user = top_user.reset_index(drop=True).loc[0, ['DisplayName', 'Location']]\n",
111 | "top_user"
112 | ]
113 | },
114 | {
115 | "cell_type": "code",
116 | "execution_count": 8,
117 | "metadata": {},
118 | "outputs": [
119 | {
120 | "name": "stdout",
121 | "output_type": "stream",
122 | "text": [
123 | "CPU times: user 1.4 s, sys: 67.7 ms, total: 1.46 s\n",
124 | "Wall time: 1.45 s\n"
125 | ]
126 | },
127 | {
128 | "data": {
129 | "text/plain": [
130 | "25804"
131 | ]
132 | },
133 | "execution_count": 8,
134 | "metadata": {},
135 | "output_type": "execute_result"
136 | }
137 | ],
138 | "source": [
139 | "%%time\n",
140 | "# 5\n",
141 | "city = top_user['Location'].split(\", \")[0]\n",
142 | "wiki.loc[wiki['curr'] == city, :]['n'].sum()"
143 | ]
144 | },
145 | {
146 | "cell_type": "code",
147 | "execution_count": 9,
148 | "metadata": {},
149 | "outputs": [],
150 | "source": [
151 | "# Not that this part can be done in many ways\n",
152 | "# this focuses on showing how to work with apply in non-standard way\n",
153 | "\n",
154 | "# https://stackoverflow.com/questions/9662346/python-code-to-remove-html-tags-from-a-string\n",
155 | "import re\n",
156 | "CLEANR = re.compile('<.*?>') \n",
157 | "\n",
158 | "def cleanhtml(raw_html):\n",
159 | " if isinstance(raw_html, str):\n",
160 | " cleantext = re.sub(CLEANR, '', raw_html)\n",
161 | " return cleantext\n",
162 | " else:\n",
163 | " return raw_html"
164 | ]
165 | },
166 | {
167 | "cell_type": "code",
168 | "execution_count": 10,
169 | "metadata": {},
170 | "outputs": [],
171 | "source": [
172 | "def aux(l):\n",
173 | " d = defaultdict(int)\n",
174 | " if isinstance(l, list):\n",
175 | " for w in l:\n",
176 | " d[w.lower()] += 1\n",
177 | " return d"
178 | ]
179 | },
180 | {
181 | "cell_type": "code",
182 | "execution_count": 11,
183 | "metadata": {},
184 | "outputs": [
185 | {
186 | "name": "stdout",
187 | "output_type": "stream",
188 | "text": [
189 | "CPU times: user 9.94 s, sys: 722 ms, total: 10.7 s\n",
190 | "Wall time: 10.7 s\n"
191 | ]
192 | },
193 | {
194 | "data": {
195 | "text/plain": [
196 | "('passport', 31631)"
197 | ]
198 | },
199 | "execution_count": 11,
200 | "metadata": {},
201 | "output_type": "execute_result"
202 | }
203 | ],
204 | "source": [
205 | "%%time\n",
206 | "# 6, 7\n",
207 | "dicts = posts['Body'].apply(cleanhtml).str.replace(\"\\n\", \" \").str.split(\" \").apply(aux)\n",
208 | "# Even better solution:\n",
209 | "# dicts = posts['Body'].str.replace('<.*?>', \"\", regex=True).str.replace(\"\\n\", \" \").str.split(\" \").apply(aux)\n",
210 | "big_d = defaultdict(int)\n",
211 | "for d in dicts:\n",
212 | " for k, v in d.items():\n",
213 | " big_d[k] += v\n",
214 | "\n",
215 | "s = pd.Series(big_d, name=\"Count\").reset_index()\n",
216 | "s.rename(columns={'index':'Word'}, inplace=True)\n",
217 | "\n",
218 | "words = s.loc[s.Word.str.len() > 7, :].sort_values(\"Count\", ascending=False).head()\n",
219 | "\n",
220 | "# 3 points\n",
221 | "theword = words['Word'].iloc[0]\n",
222 | "theword, wiki.query(\"curr == @theword.capitalize()\")['n'].sum()"
223 | ]
224 | },
225 | {
226 | "cell_type": "code",
227 | "execution_count": 14,
228 | "metadata": {},
229 | "outputs": [
230 | {
231 | "name": "stdout",
232 | "output_type": "stream",
233 | "text": [
234 | "CPU times: user 12.8 s, sys: 1.53 s, total: 14.4 s\n",
235 | "Wall time: 14.4 s\n"
236 | ]
237 | },
238 | {
239 | "data": {
240 | "text/plain": [
241 | "('passport', 31631)"
242 | ]
243 | },
244 | "execution_count": 14,
245 | "metadata": {},
246 | "output_type": "execute_result"
247 | }
248 | ],
249 | "source": [
250 | "%%time\n",
251 | "# 6, 7\n",
252 | "# Just different approach\n",
253 | "words = (\n",
254 | " posts['Body']\n",
255 | " .str.replace('<.*?>', \"\", regex=True)\n",
256 | " .str.replace(\"\\n\", \" \")\n",
257 | " .str.split(\" \")\n",
258 | " .explode()\n",
259 | " .str.lower()\n",
260 | ")\n",
261 | "theword = words[words.str.len() > 7].value_counts().head(1).index[0]\n",
262 | "theword, wiki.query(\"curr == @theword.capitalize()\")['n'].sum()"
263 | ]
264 | },
265 | {
266 | "cell_type": "code",
267 | "execution_count": 15,
268 | "metadata": {},
269 | "outputs": [
270 | {
271 | "name": "stdout",
272 | "output_type": "stream",
273 | "text": [
274 | "CPU times: user 933 ms, sys: 11.9 ms, total: 945 ms\n",
275 | "Wall time: 943 ms\n"
276 | ]
277 | },
278 | {
279 | "data": {
280 | "text/plain": [
281 | "Score 547\n",
282 | "DisplayName Andrew Lazarus\n",
283 | "Name: 0, dtype: object"
284 | ]
285 | },
286 | "execution_count": 15,
287 | "metadata": {},
288 | "output_type": "execute_result"
289 | }
290 | ],
291 | "source": [
292 | "%%time\n",
293 | "# 8, 9\n",
294 | "upvotes = (\n",
295 | " votes\n",
296 | " .query('VoteTypeId == 2')\n",
297 | " .groupby(\"PostId\")\n",
298 | " .size()\n",
299 | " .reset_index(name=\"UpVotes\")\n",
300 | ")\n",
301 | "downvotes = (\n",
302 | " votes\n",
303 | " .query('VoteTypeId == 3')\n",
304 | " .groupby(\"PostId\")\n",
305 | " .size()\n",
306 | " .reset_index(name=\"DownVotes\")\n",
307 | ")\n",
308 | "\n",
309 | "posts2 = (\n",
310 | " posts\n",
311 | " .merge(upvotes, left_on=\"Id\", right_on=\"PostId\", how='left')\n",
312 | " .merge(downvotes, left_on=\"Id\", right_on=\"PostId\", how='left')\n",
313 | ")\n",
314 | "posts2.loc[:, ['UpVotes', 'DownVotes']] = posts2.loc[:, ['UpVotes', 'DownVotes']].fillna(value=0)\n",
315 | "\n",
316 | "posts2['UpVoteRatio'] = posts2['UpVotes'] - posts2['DownVotes']\n",
317 | "\n",
318 | "(\n",
319 | " posts2\n",
320 | " .merge(users, left_on=\"OwnerUserId\", right_on=\"Id\")\n",
321 | " .sort_values(\"UpVoteRatio\", ascending=False)\n",
322 | " .reset_index(drop=True)\n",
323 | " .loc[0, ['Score', 'DisplayName']]\n",
324 | ")"
325 | ]
326 | },
327 | {
328 | "cell_type": "code",
329 | "execution_count": 16,
330 | "metadata": {},
331 | "outputs": [
332 | {
333 | "name": "stdout",
334 | "output_type": "stream",
335 | "text": [
336 | "CPU times: user 212 ms, sys: 8.26 ms, total: 220 ms\n",
337 | "Wall time: 219 ms\n"
338 | ]
339 | },
340 | {
341 | "data": {
342 | "text/plain": [
343 | "Timestamp('2016-08-31 00:00:00')"
344 | ]
345 | },
346 | "execution_count": 16,
347 | "metadata": {},
348 | "output_type": "execute_result"
349 | }
350 | ],
351 | "source": [
352 | "%%time\n",
353 | "# 10\n",
354 | "votes\n",
355 | "votes['CreationDateDT'] = pd.to_datetime(votes['CreationDate'])\n",
356 | "votes.set_index(\"CreationDateDT\", inplace=True)\n",
357 | "\n",
358 | "votesagg = votes.groupby(pd.Grouper(freq=\"M\")).size()\n",
359 | "\n",
360 | "votesagg.sort_values(ascending=False).index[0]\n"
361 | ]
362 | },
363 | {
364 | "cell_type": "code",
365 | "execution_count": 17,
366 | "metadata": {},
367 | "outputs": [
368 | {
369 | "name": "stdout",
370 | "output_type": "stream",
371 | "text": [
372 | "CPU times: user 0 ns, sys: 1.9 ms, total: 1.9 ms\n",
373 | "Wall time: 1.73 ms\n"
374 | ]
375 | },
376 | {
377 | "data": {
378 | "text/plain": [
379 | "Timestamp('2015-10-31 00:00:00')"
380 | ]
381 | },
382 | "execution_count": 17,
383 | "metadata": {},
384 | "output_type": "execute_result"
385 | }
386 | ],
387 | "source": [
388 | "%%time\n",
389 | "# 11\n",
390 | "# votesagg is sorted by index (CreationDateDT) \n",
391 | "votesagg.diff().sort_values().index[0]"
392 | ]
393 | },
394 | {
395 | "cell_type": "code",
396 | "execution_count": 18,
397 | "metadata": {},
398 | "outputs": [
399 | {
400 | "name": "stdout",
401 | "output_type": "stream",
402 | "text": [
403 | "CPU times: user 452 ms, sys: 37 µs, total: 452 ms\n",
404 | "Wall time: 451 ms\n"
405 | ]
406 | },
407 | {
408 | "data": {
409 | "text/plain": [
410 | "air-travel 34\n",
411 | "Name: Tags, dtype: int64"
412 | ]
413 | },
414 | "execution_count": 18,
415 | "metadata": {},
416 | "output_type": "execute_result"
417 | }
418 | ],
419 | "source": [
420 | "%%time\n",
421 | "# 12\n",
422 | "\n",
423 | "posts3 = posts.merge(users, left_on=\"OwnerUserId\", right_on=\"Id\")\n",
424 | "tags = posts3.loc[\n",
425 | " posts3['Location'].str.contains(\"Poland\") | \n",
426 | " posts3['Location'].str.contains(\"Polska\"), \n",
427 | " 'Tags'\n",
428 | "]\n",
429 | "(\n",
430 | " tags\n",
431 | " .str.strip(\"<\")\n",
432 | " .str.strip(\">\")\n",
433 | " .str.split(\"><\")\n",
434 | " .dropna()\n",
435 | " .explode()\n",
436 | " .value_counts()\n",
437 | " .head(1)\n",
438 | ")"
439 | ]
440 | },
441 | {
442 | "cell_type": "code",
443 | "execution_count": null,
444 | "metadata": {},
445 | "outputs": [],
446 | "source": []
447 | }
448 | ],
449 | "metadata": {
450 | "kernelspec": {
451 | "display_name": "Python 3.10.4 ('daftacademy-ds2')",
452 | "language": "python",
453 | "name": "python3"
454 | },
455 | "language_info": {
456 | "codemirror_mode": {
457 | "name": "ipython",
458 | "version": 3
459 | },
460 | "file_extension": ".py",
461 | "mimetype": "text/x-python",
462 | "name": "python",
463 | "nbconvert_exporter": "python",
464 | "pygments_lexer": "ipython3",
465 | "version": "3.10.4"
466 | },
467 | "orig_nbformat": 4,
468 | "vscode": {
469 | "interpreter": {
470 | "hash": "8d8a772a312a89d7c091db0c8769ded3912bfec6f446bb9104da72914614d8d8"
471 | }
472 | }
473 | },
474 | "nbformat": 4,
475 | "nbformat_minor": 2
476 | }
477 |
--------------------------------------------------------------------------------
/homework_solutions/03_hw_pandas2_b.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Pandas Homework Part 2\n",
8 | "\n",
9 | "`polars` version"
10 | ]
11 | },
12 | {
13 | "cell_type": "code",
14 | "execution_count": 1,
15 | "metadata": {},
16 | "outputs": [],
17 | "source": [
18 | "import pandas as pd\n",
19 | "import polars as pl"
20 | ]
21 | },
22 | {
23 | "cell_type": "code",
24 | "execution_count": 2,
25 | "metadata": {},
26 | "outputs": [],
27 | "source": [
28 | "columns = ['prev', 'curr', 'type', 'n']"
29 | ]
30 | },
31 | {
32 | "cell_type": "code",
33 | "execution_count": 4,
34 | "metadata": {},
35 | "outputs": [
36 | {
37 | "name": "stdout",
38 | "output_type": "stream",
39 | "text": [
40 | "de Ukraine\n",
41 | "pl Ukraina\n",
42 | "CPU times: user 6.5 s, sys: 1.83 s, total: 8.32 s\n",
43 | "Wall time: 2.54 s\n"
44 | ]
45 | }
46 | ],
47 | "source": [
48 | "%%time\n",
49 | "for country in [\"de\", \"pl\"]:\n",
50 | " dfl = pl.read_csv(f\"../data/wikipedia/clickstream-{country}wiki-2022-03.tsv.gz\",sep=\"\\t\", has_header=False, new_columns=columns, quote_char=None)\n",
51 | " s = (\n",
52 | " dfl.lazy()\n",
53 | " .filter(pl.col(\"type\") ==\"external\")\n",
54 | " .groupby(\"curr\")\n",
55 | " .agg(pl.col(\"n\").sum().alias(\"total\"))\n",
56 | " .sort(\"total\",reverse=True)\n",
57 | " .collect()[0, 'curr']\n",
58 | " )\n",
59 | " print(country, s)"
60 | ]
61 | },
62 | {
63 | "cell_type": "code",
64 | "execution_count": 5,
65 | "metadata": {},
66 | "outputs": [],
67 | "source": [
68 | "badges = pd.read_xml(\"../data/travel/travel.stackexchange.com/Badges.xml\")\n",
69 | "posts = pd.read_xml(\"../data/travel/travel.stackexchange.com/Posts.xml\", parser='etree')\n",
70 | "tags = pd.read_xml(\"../data/travel/travel.stackexchange.com/Tags.xml\", parser='etree')\n",
71 | "users = pd.read_xml(\"../data/travel/travel.stackexchange.com/Users.xml\", parser='etree')\n",
72 | "votes = pd.read_xml(\"../data/travel/travel.stackexchange.com/Votes.xml\", parser='etree')"
73 | ]
74 | },
75 | {
76 | "cell_type": "code",
77 | "execution_count": 6,
78 | "metadata": {},
79 | "outputs": [],
80 | "source": [
81 | "badges_pl = pl.from_pandas(badges)\n",
82 | "posts_pl = pl.from_pandas(posts)\n",
83 | "tags_pl = pl.from_pandas(tags)\n",
84 | "votes_pl = pl.from_pandas(votes)\n",
85 | "users_pl = pl.from_pandas(users)\n",
86 | "posts_pl['OwnerUserId'] = posts_pl['OwnerUserId'].cast(int)"
87 | ]
88 | },
89 | {
90 | "cell_type": "code",
91 | "execution_count": 7,
92 | "metadata": {},
93 | "outputs": [],
94 | "source": [
95 | "wiki_pl = pl.read_csv(f\"../data/wikipedia/clickstream-enwiki-2022-03.tsv.gz\", sep=\"\\t\", has_header=False, new_columns=columns, quote_char=\"\")"
96 | ]
97 | },
98 | {
99 | "cell_type": "code",
100 | "execution_count": 8,
101 | "metadata": {},
102 | "outputs": [
103 | {
104 | "name": "stdout",
105 | "output_type": "stream",
106 | "text": [
107 | "CPU times: user 393 ms, sys: 94.4 ms, total: 488 ms\n",
108 | "Wall time: 108 ms\n"
109 | ]
110 | },
111 | {
112 | "data": {
113 | "text/html": [
114 | "\n",
115 | "\n",
128 | "
\n",
129 | "\n",
130 | "\n",
131 | "\n",
132 | "DisplayName\n",
133 | " \n",
134 | "\n",
135 | "Location\n",
136 | " \n",
137 | " \n",
138 | "\n",
139 | "\n",
140 | "str\n",
141 | " \n",
142 | "\n",
143 | "str\n",
144 | " \n",
145 | " \n",
146 | " \n",
147 | "\n",
148 | "\n",
149 | "\n",
150 | "\"Mark Mayo\"\n",
151 | " \n",
152 | "\n",
153 | "\"Christchurch, New Zealand\"\n",
154 | " \n",
155 | " \n",
156 | " \n",
157 | "
\n",
158 | "
"
159 | ],
160 | "text/plain": [
161 | "shape: (1, 2)\n",
162 | "┌─────────────┬───────────────────────────┐\n",
163 | "│ DisplayName ┆ Location │\n",
164 | "│ --- ┆ --- │\n",
165 | "│ str ┆ str │\n",
166 | "╞═════════════╪═══════════════════════════╡\n",
167 | "│ Mark Mayo ┆ Christchurch, New Zealand │\n",
168 | "└─────────────┴───────────────────────────┘"
169 | ]
170 | },
171 | "execution_count": 8,
172 | "metadata": {},
173 | "output_type": "execute_result"
174 | }
175 | ],
176 | "source": [
177 | "%%time \n",
178 | "# 3, 4\n",
179 | "tid = (\n",
180 | " badges_pl\n",
181 | " .join(users_pl, left_on=\"UserId\", right_on=\"Id\", how='left')\n",
182 | " .groupby([\"UserId\", \"DisplayName\"])\n",
183 | " .agg([pl.count().alias(\"NBadges\")])\n",
184 | " .sort(\"NBadges\", reverse=True)\n",
185 | " .head(1)\n",
186 | " [0, 'UserId']\n",
187 | ")\n",
188 | "top_user =(\n",
189 | " users_pl\n",
190 | " .filter(pl.col(\"Id\") == tid)\n",
191 | " .select(['DisplayName', 'Location'])\n",
192 | ")\n",
193 | "top_user"
194 | ]
195 | },
196 | {
197 | "cell_type": "code",
198 | "execution_count": 9,
199 | "metadata": {},
200 | "outputs": [
201 | {
202 | "name": "stdout",
203 | "output_type": "stream",
204 | "text": [
205 | "CPU times: user 170 ms, sys: 38.5 ms, total: 208 ms\n",
206 | "Wall time: 44.6 ms\n"
207 | ]
208 | },
209 | {
210 | "data": {
211 | "text/html": [
212 | "\n",
213 | "\n",
226 | "
\n",
227 | "\n",
228 | "\n",
229 | "\n",
230 | "DisplayName\n",
231 | " \n",
232 | "\n",
233 | "Location\n",
234 | " \n",
235 | " \n",
236 | "\n",
237 | "\n",
238 | "str\n",
239 | " \n",
240 | "\n",
241 | "str\n",
242 | " \n",
243 | " \n",
244 | " \n",
245 | "\n",
246 | "\n",
247 | "\n",
248 | "\"Mark Mayo\"\n",
249 | " \n",
250 | "\n",
251 | "\"Christchurch, New Zealand\"\n",
252 | " \n",
253 | " \n",
254 | " \n",
255 | "
\n",
256 | "
"
257 | ],
258 | "text/plain": [
259 | "shape: (1, 2)\n",
260 | "┌─────────────┬───────────────────────────┐\n",
261 | "│ DisplayName ┆ Location │\n",
262 | "│ --- ┆ --- │\n",
263 | "│ str ┆ str │\n",
264 | "╞═════════════╪═══════════════════════════╡\n",
265 | "│ Mark Mayo ┆ Christchurch, New Zealand │\n",
266 | "└─────────────┴───────────────────────────┘"
267 | ]
268 | },
269 | "execution_count": 9,
270 | "metadata": {},
271 | "output_type": "execute_result"
272 | }
273 | ],
274 | "source": [
275 | "%%time \n",
276 | "# 3, 4 lazy evaluation\n",
277 | "(\n",
278 | " badges_pl.lazy()\n",
279 | " .join(users_pl.lazy(), left_on=\"UserId\", right_on=\"Id\", how='left')\n",
280 | " .groupby([\"UserId\", \"DisplayName\"])\n",
281 | " .agg([pl.count().alias(\"NBadges\")])\n",
282 | " .sort(\"NBadges\", reverse=True)\n",
283 | " .head(1)\n",
284 | " .collect()[0, ['UserId']]\n",
285 | ")\n",
286 | "top_user = (\n",
287 | " users_pl\n",
288 | " .lazy()\n",
289 | " .filter(pl.col(\"Id\") == tid)\n",
290 | " .select(['DisplayName', 'Location'])\n",
291 | " .collect()\n",
292 | ")\n",
293 | "top_user"
294 | ]
295 | },
296 | {
297 | "cell_type": "code",
298 | "execution_count": 10,
299 | "metadata": {},
300 | "outputs": [
301 | {
302 | "name": "stdout",
303 | "output_type": "stream",
304 | "text": [
305 | "CPU times: user 93 ms, sys: 1.29 ms, total: 94.3 ms\n",
306 | "Wall time: 78.3 ms\n"
307 | ]
308 | },
309 | {
310 | "data": {
311 | "text/html": [
312 | "\n",
313 | "\n",
326 | "
\n",
327 | "\n",
328 | "\n",
329 | "\n",
330 | "n\n",
331 | " \n",
332 | " \n",
333 | "\n",
334 | "\n",
335 | "i64\n",
336 | " \n",
337 | " \n",
338 | " \n",
339 | "\n",
340 | "\n",
341 | "\n",
342 | "25804\n",
343 | " \n",
344 | " \n",
345 | " \n",
346 | "
\n",
347 | "
"
348 | ],
349 | "text/plain": [
350 | "shape: (1, 1)\n",
351 | "┌───────┐\n",
352 | "│ n │\n",
353 | "│ --- │\n",
354 | "│ i64 │\n",
355 | "╞═══════╡\n",
356 | "│ 25804 │\n",
357 | "└───────┘"
358 | ]
359 | },
360 | "execution_count": 10,
361 | "metadata": {},
362 | "output_type": "execute_result"
363 | }
364 | ],
365 | "source": [
366 | "%%time\n",
367 | "# 5\n",
368 | "city = top_user['Location'][0].split(\", \")[0]\n",
369 | "(\n",
370 | " wiki_pl\n",
371 | " .filter(pl.col('curr') == city)\n",
372 | " .select(pl.col('n').sum())\n",
373 | ")"
374 | ]
375 | },
376 | {
377 | "cell_type": "code",
378 | "execution_count": 11,
379 | "metadata": {},
380 | "outputs": [
381 | {
382 | "name": "stdout",
383 | "output_type": "stream",
384 | "text": [
385 | "CPU times: user 2.91 s, sys: 427 ms, total: 3.34 s\n",
386 | "Wall time: 3.28 s\n"
387 | ]
388 | },
389 | {
390 | "data": {
391 | "text/plain": [
392 | "('passport', 31631)"
393 | ]
394 | },
395 | "execution_count": 11,
396 | "metadata": {},
397 | "output_type": "execute_result"
398 | }
399 | ],
400 | "source": [
401 | "%%time\n",
402 | "# 6, 7\n",
403 | "res = (\n",
404 | " posts_pl\n",
405 | " .select(\n",
406 | " pl.col('Body')\n",
407 | " .str.replace_all(\"<.*?>\", \"\")\n",
408 | " .str.replace_all(\"\\n\", \" \")\n",
409 | " .str.split(\" \")\n",
410 | " .explode()\n",
411 | " .str.to_lowercase()\n",
412 | " .alias(\"Words\")\n",
413 | " )\n",
414 | " .select(\n",
415 | " pl.col(\"Words\")\n",
416 | " .filter(pl.col(\"Words\").str.lengths() > 7)\n",
417 | " .value_counts()\n",
418 | " ).unnest(\"Words\")\n",
419 | " .head(1)\n",
420 | " \n",
421 | ")\n",
422 | "res[0, 'Words'], wiki_pl.filter(pl.col('curr') == res[0, 'Words'].capitalize())['n'].sum()"
423 | ]
424 | },
425 | {
426 | "cell_type": "code",
427 | "execution_count": 12,
428 | "metadata": {},
429 | "outputs": [
430 | {
431 | "name": "stdout",
432 | "output_type": "stream",
433 | "text": [
434 | "CPU times: user 1.24 s, sys: 754 ms, total: 1.99 s\n",
435 | "Wall time: 469 ms\n"
436 | ]
437 | },
438 | {
439 | "data": {
440 | "text/html": [
441 | "\n",
442 | "\n",
455 | "
\n",
456 | "\n",
457 | "\n",
458 | "\n",
459 | "DisplayName\n",
460 | " \n",
461 | "\n",
462 | "UpVoteRatio\n",
463 | " \n",
464 | " \n",
465 | "\n",
466 | "\n",
467 | "str\n",
468 | " \n",
469 | "\n",
470 | "i64\n",
471 | " \n",
472 | " \n",
473 | " \n",
474 | "\n",
475 | "\n",
476 | "\n",
477 | "\"Andrew Lazarus\"\n",
478 | " \n",
479 | "\n",
480 | "547\n",
481 | " \n",
482 | " \n",
483 | " \n",
484 | "
\n",
485 | "
"
486 | ],
487 | "text/plain": [
488 | "shape: (1, 2)\n",
489 | "┌────────────────┬─────────────┐\n",
490 | "│ DisplayName ┆ UpVoteRatio │\n",
491 | "│ --- ┆ --- │\n",
492 | "│ str ┆ i64 │\n",
493 | "╞════════════════╪═════════════╡\n",
494 | "│ Andrew Lazarus ┆ 547 │\n",
495 | "└────────────────┴─────────────┘"
496 | ]
497 | },
498 | "execution_count": 12,
499 | "metadata": {},
500 | "output_type": "execute_result"
501 | }
502 | ],
503 | "source": [
504 | "%%time\n",
505 | "# 8, 9\n",
506 | "upvotes_pl = (\n",
507 | " votes_pl\n",
508 | " .lazy()\n",
509 | " .filter(pl.col(\"VoteTypeId\") == 2)\n",
510 | " .groupby(\"PostId\")\n",
511 | " .agg(pl.count().alias(\"UpVotes\"))\n",
512 | ")\n",
513 | "\n",
514 | "downvotes_pl = (\n",
515 | " votes_pl\n",
516 | " .lazy()\n",
517 | " .filter(pl.col(\"VoteTypeId\") == 3)\n",
518 | " .groupby(\"PostId\")\n",
519 | " .agg(pl.count().alias(\"DownVotes\"))\n",
520 | ")\n",
521 | "\n",
522 | "(\n",
523 | " posts_pl.lazy()\n",
524 | " .join(upvotes_pl, left_on=\"Id\", right_on=\"PostId\", how='left')\n",
525 | " .join(downvotes_pl, left_on=\"Id\", right_on=\"PostId\", how='left')\n",
526 | " .with_columns(\n",
527 | " [\n",
528 | " pl.col(\"UpVotes\").fill_null(0),\n",
529 | " pl.col(\"DownVotes\").fill_null(0),\n",
530 | " ]\n",
531 | " )\n",
532 | " .with_column(\n",
533 | " (pl.col('UpVotes') - pl.col('DownVotes')).alias('UpVoteRatio')\n",
534 | " )\n",
535 | " .join(users_pl.lazy(), left_on=\"OwnerUserId\", right_on=\"Id\")\n",
536 | " .sort('UpVoteRatio', reverse=True)\n",
537 | " .collect()[0, ['DisplayName', 'UpVoteRatio']]\n",
538 | ")\n"
539 | ]
540 | },
541 | {
542 | "cell_type": "code",
543 | "execution_count": 13,
544 | "metadata": {},
545 | "outputs": [
546 | {
547 | "name": "stdout",
548 | "output_type": "stream",
549 | "text": [
550 | "CPU times: user 591 ms, sys: 3.48 ms, total: 595 ms\n",
551 | "Wall time: 518 ms\n"
552 | ]
553 | },
554 | {
555 | "data": {
556 | "text/html": [
557 | "\n",
558 | "\n",
571 | "
\n",
572 | "\n",
573 | "\n",
574 | "\n",
575 | "Year\n",
576 | " \n",
577 | "\n",
578 | "Month\n",
579 | " \n",
580 | "\n",
581 | "NVotes\n",
582 | " \n",
583 | " \n",
584 | "\n",
585 | "\n",
586 | "i32\n",
587 | " \n",
588 | "\n",
589 | "u32\n",
590 | " \n",
591 | "\n",
592 | "u32\n",
593 | " \n",
594 | " \n",
595 | " \n",
596 | "\n",
597 | "\n",
598 | "\n",
599 | "2016\n",
600 | " \n",
601 | "\n",
602 | "8\n",
603 | " \n",
604 | "\n",
605 | "19591\n",
606 | " \n",
607 | " \n",
608 | " \n",
609 | "
\n",
610 | "
"
611 | ],
612 | "text/plain": [
613 | "shape: (1, 3)\n",
614 | "┌──────┬───────┬────────┐\n",
615 | "│ Year ┆ Month ┆ NVotes │\n",
616 | "│ --- ┆ --- ┆ --- │\n",
617 | "│ i32 ┆ u32 ┆ u32 │\n",
618 | "╞══════╪═══════╪════════╡\n",
619 | "│ 2016 ┆ 8 ┆ 19591 │\n",
620 | "└──────┴───────┴────────┘"
621 | ]
622 | },
623 | "execution_count": 13,
624 | "metadata": {},
625 | "output_type": "execute_result"
626 | }
627 | ],
628 | "source": [
629 | "%%time \n",
630 | "# 10\n",
631 | "votes_agg = (\n",
632 | " votes_pl\n",
633 | " .with_column(\n",
634 | " pl.col('CreationDate').str.strptime(pl.Datetime)\n",
635 | " )\n",
636 | " .groupby([\n",
637 | " pl.col('CreationDate').dt.year().alias(\"Year\"),\n",
638 | " pl.col('CreationDate').dt.month().alias(\"Month\")\n",
639 | " ])\n",
640 | " .agg(pl.count().alias(\"NVotes\"))\n",
641 | ")\n",
642 | "\n",
643 | "votes_agg.filter(pl.col(\"NVotes\") == pl.col(\"NVotes\").max())"
644 | ]
645 | },
646 | {
647 | "cell_type": "code",
648 | "execution_count": 14,
649 | "metadata": {},
650 | "outputs": [
651 | {
652 | "name": "stdout",
653 | "output_type": "stream",
654 | "text": [
655 | "CPU times: user 8.34 ms, sys: 1.12 ms, total: 9.46 ms\n",
656 | "Wall time: 2.72 ms\n"
657 | ]
658 | },
659 | {
660 | "data": {
661 | "text/html": [
662 | "\n",
663 | "\n",
676 | "
\n",
677 | "\n",
678 | "\n",
679 | "\n",
680 | "Year\n",
681 | " \n",
682 | "\n",
683 | "Month\n",
684 | " \n",
685 | "\n",
686 | "NVotesDiff\n",
687 | " \n",
688 | " \n",
689 | "\n",
690 | "\n",
691 | "i32\n",
692 | " \n",
693 | "\n",
694 | "u32\n",
695 | " \n",
696 | "\n",
697 | "i64\n",
698 | " \n",
699 | " \n",
700 | " \n",
701 | "\n",
702 | "\n",
703 | "\n",
704 | "2015\n",
705 | " \n",
706 | "\n",
707 | "10\n",
708 | " \n",
709 | "\n",
710 | "-6201\n",
711 | " \n",
712 | " \n",
713 | " \n",
714 | "
\n",
715 | "
"
716 | ],
717 | "text/plain": [
718 | "shape: (1, 3)\n",
719 | "┌──────┬───────┬────────────┐\n",
720 | "│ Year ┆ Month ┆ NVotesDiff │\n",
721 | "│ --- ┆ --- ┆ --- │\n",
722 | "│ i32 ┆ u32 ┆ i64 │\n",
723 | "╞══════╪═══════╪════════════╡\n",
724 | "│ 2015 ┆ 10 ┆ -6201 │\n",
725 | "└──────┴───────┴────────────┘"
726 | ]
727 | },
728 | "execution_count": 14,
729 | "metadata": {},
730 | "output_type": "execute_result"
731 | }
732 | ],
733 | "source": [
734 | "%%time\n",
735 | "# 11\n",
736 | "(\n",
737 | " votes_agg\n",
738 | " .sort([\"Year\", \"Month\"])\n",
739 | " .select([\n",
740 | " \"Year\",\n",
741 | " \"Month\",\n",
742 | " pl.col(\"NVotes\").cast(int).diff().alias(\"NVotesDiff\")\n",
743 | " ])\n",
744 | " .filter(pl.col(\"NVotesDiff\") == pl.col(\"NVotesDiff\").min())\n",
745 | ")"
746 | ]
747 | },
748 | {
749 | "cell_type": "code",
750 | "execution_count": 15,
751 | "metadata": {},
752 | "outputs": [
753 | {
754 | "name": "stdout",
755 | "output_type": "stream",
756 | "text": [
757 | "CPU times: user 42.3 ms, sys: 15.4 ms, total: 57.7 ms\n",
758 | "Wall time: 21.6 ms\n"
759 | ]
760 | },
761 | {
762 | "data": {
763 | "text/html": [
764 | "\n",
765 | "\n",
778 | "
\n",
779 | "\n",
780 | "\n",
781 | "\n",
782 | "Tags\n",
783 | " \n",
784 | "\n",
785 | "counts\n",
786 | " \n",
787 | " \n",
788 | "\n",
789 | "\n",
790 | "str\n",
791 | " \n",
792 | "\n",
793 | "u32\n",
794 | " \n",
795 | " \n",
796 | " \n",
797 | "\n",
798 | "\n",
799 | "\n",
800 | "\"air-travel\"\n",
801 | " \n",
802 | "\n",
803 | "34\n",
804 | " \n",
805 | " \n",
806 | " \n",
807 | "
\n",
808 | "
"
809 | ],
810 | "text/plain": [
811 | "shape: (1, 2)\n",
812 | "┌────────────┬────────┐\n",
813 | "│ Tags ┆ counts │\n",
814 | "│ --- ┆ --- │\n",
815 | "│ str ┆ u32 │\n",
816 | "╞════════════╪════════╡\n",
817 | "│ air-travel ┆ 34 │\n",
818 | "└────────────┴────────┘"
819 | ]
820 | },
821 | "execution_count": 15,
822 | "metadata": {},
823 | "output_type": "execute_result"
824 | }
825 | ],
826 | "source": [
827 | "%%time\n",
828 | "# 12\n",
829 | "(\n",
830 | " posts_pl.lazy().join(users_pl.lazy(), left_on=\"OwnerUserId\", right_on=\"Id\", how='left')\n",
831 | " .filter(\n",
832 | " pl.col(\"Location\").str.contains(\"Poland\") | \n",
833 | " pl.col(\"Location\").str.contains(\"Polska\")\n",
834 | " )\n",
835 | " .select([\n",
836 | " pl.col('Tags')\n",
837 | " .str.replace(r\"^<\", \"\")\n",
838 | " .str.replace(r\">$\", \"\")\n",
839 | " .str.split(\"><\")\n",
840 | " .drop_nulls()\n",
841 | " .explode()\n",
842 | " .value_counts()\n",
843 | " ])\n",
844 | " .unnest(\"Tags\")\n",
845 | " .head(1)\n",
846 | " .collect()\n",
847 | ")\n"
848 | ]
849 | }
850 | ],
851 | "metadata": {
852 | "kernelspec": {
853 | "display_name": "Python 3.10.4 ('daftacademy-ds')",
854 | "language": "python",
855 | "name": "python3"
856 | },
857 | "language_info": {
858 | "codemirror_mode": {
859 | "name": "ipython",
860 | "version": 3
861 | },
862 | "file_extension": ".py",
863 | "mimetype": "text/x-python",
864 | "name": "python",
865 | "nbconvert_exporter": "python",
866 | "pygments_lexer": "ipython3",
867 | "version": "3.10.4"
868 | },
869 | "orig_nbformat": 4,
870 | "vscode": {
871 | "interpreter": {
872 | "hash": "306379f83cd7ad906f1941e4fa5d943b9cb847ce8f1ed425d8b7a18353685fed"
873 | }
874 | }
875 | },
876 | "nbformat": 4,
877 | "nbformat_minor": 2
878 | }
879 |
--------------------------------------------------------------------------------
/homework_solutions/04_hw_sklearn.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 25,
6 | "metadata": {},
7 | "outputs": [],
8 | "source": [
9 | "import pandas as pd\n",
10 | "from sklearn.linear_model import LinearRegression\n",
11 | "from sklearn.metrics import mean_squared_error\n",
12 | "from sklearn.impute import SimpleImputer\n",
13 | "import numpy as np\n",
14 | "import matplotlib.pyplot as plt\n",
15 | "import pickle"
16 | ]
17 | },
18 | {
19 | "cell_type": "code",
20 | "execution_count": 26,
21 | "metadata": {},
22 | "outputs": [
23 | {
24 | "data": {
25 | "text/html": [
26 | "\n",
27 | "\n",
40 | "
\n",
41 | " \n",
42 | " \n",
43 | " \n",
44 | " CRIM \n",
45 | " ZN \n",
46 | " INDUS \n",
47 | " CHAS \n",
48 | " NOX \n",
49 | " RM \n",
50 | " AGE \n",
51 | " DIS \n",
52 | " RAD \n",
53 | " TAX \n",
54 | " PTRATIO \n",
55 | " B \n",
56 | " LSTAT \n",
57 | " \n",
58 | " \n",
59 | " \n",
60 | " \n",
61 | " 0 \n",
62 | " 3.69 \n",
63 | " 11.37 \n",
64 | " 11.15 \n",
65 | " 0.07 \n",
66 | " 0.87 \n",
67 | " 6.29 \n",
68 | " 68.91 \n",
69 | " 3.77 \n",
70 | " 9.5 \n",
71 | " 410.95 \n",
72 | " 18.37 \n",
73 | " 354.47 \n",
74 | " 0.79 \n",
75 | " \n",
76 | " \n",
77 | "
\n",
78 | "
"
79 | ],
80 | "text/plain": [
81 | " CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO \\\n",
82 | "0 3.69 11.37 11.15 0.07 0.87 6.29 68.91 3.77 9.5 410.95 18.37 \n",
83 | "\n",
84 | " B LSTAT \n",
85 | "0 354.47 0.79 "
86 | ]
87 | },
88 | "execution_count": 26,
89 | "metadata": {},
90 | "output_type": "execute_result"
91 | }
92 | ],
93 | "source": [
94 | "rec = pd.DataFrame({\n",
95 | " 'CRIM': [3.69],\n",
96 | " 'ZN': [11.37],\n",
97 | " 'INDUS': [11.15],\n",
98 | " 'CHAS': [0.07],\n",
99 | " 'NOX': [0.87],\n",
100 | " 'RM': [6.29],\n",
101 | " 'AGE': [68.91],\n",
102 | " 'DIS': [3.77],\n",
103 | " 'RAD': [9.50],\n",
104 | " 'TAX': [410.95],\n",
105 | " 'PTRATIO': [18.37],\n",
106 | " 'B': [354.47],\n",
107 | " 'LSTAT': [0.79],\n",
108 | "})\n",
109 | "rec"
110 | ]
111 | },
112 | {
113 | "cell_type": "code",
114 | "execution_count": 27,
115 | "metadata": {},
116 | "outputs": [],
117 | "source": [
118 | "train = pd.read_csv(\"../data/housing/housing_train.csv\")"
119 | ]
120 | },
121 | {
122 | "cell_type": "code",
123 | "execution_count": 28,
124 | "metadata": {},
125 | "outputs": [
126 | {
127 | "data": {
128 | "text/plain": [
129 | "NOX 0.879070\n",
130 | "LSTAT 0.795349\n",
131 | "dtype: float64"
132 | ]
133 | },
134 | "execution_count": 28,
135 | "metadata": {},
136 | "output_type": "execute_result"
137 | }
138 | ],
139 | "source": [
140 | "s = train.isna().mean()\n",
141 | "s[s > 0.7]"
142 | ]
143 | },
144 | {
145 | "cell_type": "code",
146 | "execution_count": 29,
147 | "metadata": {},
148 | "outputs": [],
149 | "source": [
150 | "train['LSTAT'] = train['LSTAT'].isna()\n",
151 | "train['NOX'] = train['NOX'].isna()"
152 | ]
153 | },
154 | {
155 | "cell_type": "code",
156 | "execution_count": 30,
157 | "metadata": {},
158 | "outputs": [],
159 | "source": [
160 | "X = train.iloc[:, :-1]\n",
161 | "y = train['MEDV']"
162 | ]
163 | },
164 | {
165 | "cell_type": "code",
166 | "execution_count": 31,
167 | "metadata": {},
168 | "outputs": [],
169 | "source": [
170 | "imputer = SimpleImputer(strategy=\"median\")\n",
171 | "X = imputer.fit_transform(X)"
172 | ]
173 | },
174 | {
175 | "cell_type": "code",
176 | "execution_count": 32,
177 | "metadata": {},
178 | "outputs": [
179 | {
180 | "data": {
181 | "text/html": [
182 | "LinearRegression() In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org. "
183 | ],
184 | "text/plain": [
185 | "LinearRegression()"
186 | ]
187 | },
188 | "execution_count": 32,
189 | "metadata": {},
190 | "output_type": "execute_result"
191 | }
192 | ],
193 | "source": [
194 | "reg = LinearRegression()\n",
195 | "reg.fit(X, y)"
196 | ]
197 | },
198 | {
199 | "cell_type": "code",
200 | "execution_count": 33,
201 | "metadata": {},
202 | "outputs": [],
203 | "source": [
204 | "y_pred = reg.predict(X)"
205 | ]
206 | },
207 | {
208 | "cell_type": "code",
209 | "execution_count": 34,
210 | "metadata": {},
211 | "outputs": [
212 | {
213 | "name": "stderr",
214 | "output_type": "stream",
215 | "text": [
216 | "/home/piotr/anaconda3/envs/daftacademy-ds/lib/python3.10/site-packages/sklearn/base.py:443: UserWarning: X has feature names, but LinearRegression was fitted without feature names\n",
217 | " warnings.warn(\n"
218 | ]
219 | },
220 | {
221 | "data": {
222 | "text/plain": [
223 | "array([96.67])"
224 | ]
225 | },
226 | "execution_count": 34,
227 | "metadata": {},
228 | "output_type": "execute_result"
229 | }
230 | ],
231 | "source": [
232 | "reg.predict(rec).round(2)"
233 | ]
234 | },
235 | {
236 | "cell_type": "code",
237 | "execution_count": 35,
238 | "metadata": {},
239 | "outputs": [],
240 | "source": [
241 | "rec2 = rec.copy()\n",
242 | "rec2.loc[0, 'RM'] += 2"
243 | ]
244 | },
245 | {
246 | "cell_type": "code",
247 | "execution_count": 39,
248 | "metadata": {},
249 | "outputs": [
250 | {
251 | "data": {
252 | "text/plain": [
253 | "CRIM -0.693008\n",
254 | "ZN 0.210142\n",
255 | "INDUS -0.159057\n",
256 | "CHAS 13.851840\n",
257 | "NOX 20.619870\n",
258 | "RM 25.280527\n",
259 | "AGE -0.160472\n",
260 | "DIS -5.301457\n",
261 | "RAD 1.421446\n",
262 | "TAX -0.057192\n",
263 | "PTRATIO -3.679619\n",
264 | "B 0.055114\n",
265 | "LSTAT 20.989393\n",
266 | "dtype: float64"
267 | ]
268 | },
269 | "execution_count": 39,
270 | "metadata": {},
271 | "output_type": "execute_result"
272 | }
273 | ],
274 | "source": [
275 | "coef = pd.Series(reg.coef_, index=train.columns[:-1])\n",
276 | "coef"
277 | ]
278 | },
279 | {
280 | "cell_type": "code",
281 | "execution_count": 40,
282 | "metadata": {},
283 | "outputs": [
284 | {
285 | "data": {
286 | "text/plain": [
287 | "50.5610537224786"
288 | ]
289 | },
290 | "execution_count": 40,
291 | "metadata": {},
292 | "output_type": "execute_result"
293 | }
294 | ],
295 | "source": [
296 | "coef['RM'] * 2"
297 | ]
298 | },
299 | {
300 | "cell_type": "code",
301 | "execution_count": 41,
302 | "metadata": {},
303 | "outputs": [
304 | {
305 | "name": "stderr",
306 | "output_type": "stream",
307 | "text": [
308 | "/home/piotr/anaconda3/envs/daftacademy-ds/lib/python3.10/site-packages/sklearn/base.py:443: UserWarning: X has feature names, but LinearRegression was fitted without feature names\n",
309 | " warnings.warn(\n",
310 | "/home/piotr/anaconda3/envs/daftacademy-ds/lib/python3.10/site-packages/sklearn/base.py:443: UserWarning: X has feature names, but LinearRegression was fitted without feature names\n",
311 | " warnings.warn(\n"
312 | ]
313 | },
314 | {
315 | "data": {
316 | "text/plain": [
317 | "array([50.56105372])"
318 | ]
319 | },
320 | "execution_count": 41,
321 | "metadata": {},
322 | "output_type": "execute_result"
323 | }
324 | ],
325 | "source": [
326 | "reg.predict(rec2) - reg.predict(rec)"
327 | ]
328 | },
329 | {
330 | "cell_type": "markdown",
331 | "metadata": {},
332 | "source": [
333 | "## Advanced model\n",
334 | "\n",
335 | "We will check a few models on split dataset and then choose the best model to do the final training before submission."
336 | ]
337 | },
338 | {
339 | "cell_type": "code",
340 | "execution_count": 42,
341 | "metadata": {},
342 | "outputs": [],
343 | "source": [
344 | "from sklearn.ensemble import RandomForestRegressor\n",
345 | "from lightgbm import LGBMRegressor\n",
346 | "from xgboost import XGBRFRegressor\n",
347 | "from sklearn.model_selection import train_test_split"
348 | ]
349 | },
350 | {
351 | "cell_type": "code",
352 | "execution_count": 43,
353 | "metadata": {},
354 | "outputs": [],
355 | "source": [
356 | "X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=2022)"
357 | ]
358 | },
359 | {
360 | "cell_type": "code",
361 | "execution_count": 44,
362 | "metadata": {},
363 | "outputs": [],
364 | "source": [
365 | "models = {\n",
366 | " 'linear_regression': LinearRegression(),\n",
367 | " 'rf': RandomForestRegressor(),\n",
368 | " 'lgbm': LGBMRegressor(),\n",
369 | " 'xgb': XGBRFRegressor(),\n",
370 | " \n",
371 | "}"
372 | ]
373 | },
374 | {
375 | "cell_type": "code",
376 | "execution_count": 51,
377 | "metadata": {},
378 | "outputs": [
379 | {
380 | "name": "stdout",
381 | "output_type": "stream",
382 | "text": [
383 | "Naive baseline y_train.mean() scores MSE 1421.11\n",
384 | "Naive baseline y_train.median() scores MSE 1435.55\n",
385 | "linear_regression scores MSE: 385.24\n",
386 | "rf scores MSE: 338.15\n",
387 | "lgbm scores MSE: 299.47\n",
388 | "xgb scores MSE: 340.21\n"
389 | ]
390 | }
391 | ],
392 | "source": [
393 | "naive_mean_mse = mean_squared_error(y_test, np.tile(y_train.mean(), len(y_test))).round(2)\n",
394 | "naive_median_mse = mean_squared_error(y_test, np.tile(y_train.median(), len(y_test))).round(2)\n",
395 | "print(f\"Naive baseline y_train.mean() scores MSE {naive_mean_mse}\")\n",
396 | "print(f\"Naive baseline y_train.median() scores MSE {naive_median_mse}\")\n",
397 | "for name, model in models.items():\n",
398 | " model.fit(X_train, y_train)\n",
399 | " y_pred = model.predict(X_test)\n",
400 | " score = mean_squared_error(y_test, y_pred)\n",
401 | " print(f\"{name} scores MSE: {score.round(2)}\")\n"
402 | ]
403 | },
404 | {
405 | "cell_type": "code",
406 | "execution_count": 53,
407 | "metadata": {},
408 | "outputs": [
409 | {
410 | "data": {
411 | "text/html": [
412 | "LGBMRegressor() In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org. "
413 | ],
414 | "text/plain": [
415 | "LGBMRegressor()"
416 | ]
417 | },
418 | "execution_count": 53,
419 | "metadata": {},
420 | "output_type": "execute_result"
421 | }
422 | ],
423 | "source": [
424 | "train = pd.read_csv(\"../data/housing/housing_train.csv\")\n",
425 | "X = train.iloc[:, :-1]\n",
426 | "y = train['MEDV']\n",
427 | "\n",
428 | "val = pd.read_csv(\"../data/housing/housing_validation.csv\")\n",
429 | "val['LSTAT'] = val['LSTAT'].isna()\n",
430 | "val['NOX'] = val['NOX'].isna()\n",
431 | "\n",
432 | "imputer2 = SimpleImputer(strategy=\"median\")\n",
433 | "\n",
434 | "X2 = imputer2.fit_transform(X)\n",
435 | "val_t = imputer2.transform(val)\n",
436 | "\n",
437 | "reg2 = LGBMRegressor()\n",
438 | "reg2.fit(X2, y)"
439 | ]
440 | },
441 | {
442 | "cell_type": "code",
443 | "execution_count": 54,
444 | "metadata": {},
445 | "outputs": [],
446 | "source": [
447 | "y_pred = reg2.predict(val_t)\n",
448 | "res = pd.Series(y_pred, name='MEDV')\n",
449 | "res.to_csv(\"light_gbm.csv\")"
450 | ]
451 | },
452 | {
453 | "cell_type": "code",
454 | "execution_count": 55,
455 | "metadata": {},
456 | "outputs": [],
457 | "source": [
458 | "with open(\"model_lgbm_regressor.pkl\", 'wb') as f:\n",
459 | " pickle.dump(reg2, f)"
460 | ]
461 | },
462 | {
463 | "cell_type": "code",
464 | "execution_count": null,
465 | "metadata": {},
466 | "outputs": [],
467 | "source": []
468 | }
469 | ],
470 | "metadata": {
471 | "kernelspec": {
472 | "display_name": "Python 3.10.4 ('daftacademy-ds')",
473 | "language": "python",
474 | "name": "python3"
475 | },
476 | "language_info": {
477 | "codemirror_mode": {
478 | "name": "ipython",
479 | "version": 3
480 | },
481 | "file_extension": ".py",
482 | "mimetype": "text/x-python",
483 | "name": "python",
484 | "nbconvert_exporter": "python",
485 | "pygments_lexer": "ipython3",
486 | "version": "3.10.4"
487 | },
488 | "orig_nbformat": 4,
489 | "vscode": {
490 | "interpreter": {
491 | "hash": "306379f83cd7ad906f1941e4fa5d943b9cb847ce8f1ed425d8b7a18353685fed"
492 | }
493 | }
494 | },
495 | "nbformat": 4,
496 | "nbformat_minor": 2
497 | }
498 |
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | aiofiles==22.1.0
2 | aiosqlite==0.18.0
3 | altair==4.2.2
4 | anyio==3.6.2
5 | argon2-cffi==21.3.0
6 | argon2-cffi-bindings==21.2.0
7 | arrow==1.2.3
8 | asttokens==2.2.1
9 | attrs==22.2.0
10 | Babel==2.11.0
11 | backcall==0.2.0
12 | beautifulsoup4==4.11.2
13 | bleach==6.0.0
14 | blinker==1.5
15 | cachetools==5.3.0
16 | cffi==1.15.1
17 | charset-normalizer==3.0.1
18 | click==8.1.3
19 | comm==0.1.2
20 | contourpy==1.0.7
21 | cycler==0.11.0
22 | debugpy==1.6.6
23 | decorator==5.1.1
24 | defusedxml==0.7.1
25 | entrypoints==0.4
26 | et-xmlfile==1.1.0
27 | executing==1.2.0
28 | fastapi==0.92.0
29 | fastjsonschema==2.16.2
30 | fonttools==4.38.0
31 | fqdn==1.5.1
32 | gitdb==4.0.10
33 | GitPython==3.1.31
34 | h11==0.14.0
35 | httptools==0.5.0
36 | idna==3.4
37 | importlib-metadata==6.0.0
38 | ipykernel==6.21.2
39 | ipython==8.10.0
40 | ipython-genutils==0.2.0
41 | ipywidgets==8.0.4
42 | isoduration==20.11.0
43 | jedi==0.18.2
44 | Jinja2==3.1.2
45 | joblib==1.2.0
46 | json5==0.9.11
47 | jsonpointer==2.3
48 | jsonschema==4.17.3
49 | jupyter-events==0.6.3
50 | jupyter-ydoc==0.2.2
51 | jupyter_client==8.0.3
52 | jupyter_core==5.2.0
53 | jupyter_server==2.3.0
54 | jupyter_server_fileid==0.7.0
55 | jupyter_server_terminals==0.4.4
56 | jupyter_server_ydoc==0.6.1
57 | jupyterlab==3.6.1
58 | jupyterlab-pygments==0.2.2
59 | jupyterlab-widgets==3.0.5
60 | jupyterlab_server==2.19.0
61 | kiwisolver==1.4.4
62 | lightgbm==3.3.5
63 | llvmlite==0.39.1
64 | lxml==4.9.2
65 | markdown-it-py==2.1.0
66 | MarkupSafe==2.1.2
67 | matplotlib==3.7.0
68 | matplotlib-inline==0.1.6
69 | mdurl==0.1.2
70 | mistune==2.0.5
71 | nbclassic==0.5.2
72 | nbclient==0.7.2
73 | nbconvert==7.2.9
74 | nbformat==5.7.3
75 | nest-asyncio==1.5.6
76 | notebook==6.5.2
77 | notebook_shim==0.2.2
78 | numba==0.56.4
79 | numpy==1.23.5
80 | openpyxl==3.1.1
81 | packaging==23.0
82 | pandas==1.5.3
83 | pandocfilters==1.5.0
84 | parso==0.8.3
85 | pexpect==4.8.0
86 | pickleshare==0.7.5
87 | Pillow==9.4.0
88 | platformdirs==3.0.0
89 | plotly==5.13.0
90 | polars==0.16.7
91 | prometheus-client==0.16.0
92 | prompt-toolkit==3.0.36
93 | protobuf==3.20.3
94 | psutil==5.9.4
95 | ptyprocess==0.7.0
96 | pure-eval==0.2.2
97 | pyarrow==11.0.0
98 | pycparser==2.21
99 | pydantic==1.10.5
100 | pydeck==0.8.0
101 | Pygments==2.14.0
102 | Pympler==1.0.1
103 | pyparsing==3.0.9
104 | pyrsistent==0.19.3
105 | python-dateutil==2.8.2
106 | python-dotenv==0.21.1
107 | python-json-logger==2.0.6
108 | python-multipart==0.0.5
109 | pytz==2022.7.1
110 | pytz-deprecation-shim==0.1.0.post0
111 | PyYAML==6.0
112 | pyzmq==25.0.0
113 | requests==2.28.2
114 | rfc3339-validator==0.1.4
115 | rfc3986-validator==0.1.1
116 | rich==13.3.1
117 | scikit-learn==1.2.1
118 | scipy==1.10.1
119 | semver==2.13.0
120 | Send2Trash==1.8.0
121 | six==1.16.0
122 | smmap==5.0.0
123 | sniffio==1.3.0
124 | soupsieve==2.4
125 | stack-data==0.6.2
126 | starlette==0.25.0
127 | streamlit==1.18.1
128 | tenacity==8.2.1
129 | terminado==0.17.1
130 | threadpoolctl==3.1.0
131 | tinycss2==1.2.1
132 | toml==0.10.2
133 | tomli==2.0.1
134 | toolz==0.12.0
135 | tornado==6.2
136 | traitlets==5.9.0
137 | typing_extensions==4.5.0
138 | tzdata==2022.7
139 | tzlocal==4.2
140 | uri-template==1.2.0
141 | urllib3==1.26.14
142 | uvicorn==0.20.0
143 | uvloop==0.17.0
144 | validators==0.20.0
145 | watchdog==2.2.1
146 | watchfiles==0.18.1
147 | wcwidth==0.2.6
148 | webcolors==1.12
149 | webencodings==0.5.1
150 | websocket-client==1.5.1
151 | websockets==10.4
152 | widgetsnbextension==4.0.5
153 | xgboost==1.7.4
154 | y-py==0.5.4
155 | ypy-websocket==0.8.2
156 | zipp==3.14.0
157 | zstandard==0.19.0
158 |
--------------------------------------------------------------------------------
/requirements_loose.txt:
--------------------------------------------------------------------------------
1 | fastapi
2 | ipykernel
3 | ipython
4 | ipywidgets
5 | jupyterlab
6 | lightgbm
7 | lxml
8 | matplotlib
9 | numba
10 | numpy
11 | openpyxl
12 | pandas
13 | plotly
14 | Pillow
15 | polars
16 | python-multipart
17 | scikit-learn
18 | streamlit
19 | uvicorn[standard]
20 | xgboost
21 | zstandard
22 |
--------------------------------------------------------------------------------