├── .gitignore ├── 0_prerequisites ├── Introduction to Machine Learning.pptx ├── README.md └── python_tasks.md ├── 1_data_manipulations ├── Pandas_data_manipulations_tasks.ipynb ├── README.md ├── Seminar_pandas.ipynb └── googleplaystore.csv ├── 2_data_exploration ├── README.md └── eda.ipynb ├── 3_linear_regression ├── README.md └── linear_regression.ipynb ├── 4_overfitting_regularization ├── README.md └── overfitting_regularization.ipynb ├── 5_classification_linear_knn ├── README.md ├── examples.ipynb └── hw_classification.ipynb ├── 6_feature_engineering_selection ├── Melbourne_housing_FULL.csv ├── README.md ├── feature_engineering_selection.ipynb └── homework.ipynb ├── 7_trees_and_ensembles ├── README.md ├── data │ └── sonar-all-data.csv ├── grid_random_search.png ├── rf_classifier.ipynb └── tests.py ├── 8_clustering ├── README.md ├── clustering.ipynb └── slides │ ├── clustering_examples_bad.png │ ├── silhouette.png │ ├── slides_clustering.pdf │ ├── slides_clustering.tex │ ├── slides_dimensionality_reduction.pdf │ └── slides_dimensionality_reduction.tex ├── 9_evaluation_selection ├── HOMEWORK.md └── README.md └── README.md /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # Unit test / coverage reports 7 | htmlcov/ 8 | .tox/ 9 | .nox/ 10 | .coverage 11 | .coverage.* 12 | .cache 13 | nosetests.xml 14 | coverage.xml 15 | *.cover 16 | *.py,cover 17 | .hypothesis/ 18 | .pytest_cache/ 19 | 20 | # Jupyter Notebook 21 | .ipynb_checkpoints 22 | 23 | # IPython 24 | profile_default/ 25 | ipython_config.py 26 | 27 | # pyenv 28 | .python-version 29 | 30 | # Environments 31 | .env 32 | .venv 33 | env/ 34 | venv/ 35 | ENV/ 36 | env.bak/ 37 | venv.bak/ 38 | -------------------------------------------------------------------------------- /0_prerequisites/Introduction to Machine Learning.pptx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rolling-scopes-school/ml-intro/e7fec69ba996da03354b440107a66f26defa95c9/0_prerequisites/Introduction to Machine Learning.pptx -------------------------------------------------------------------------------- /0_prerequisites/README.md: -------------------------------------------------------------------------------- 1 | # Prerequisites 2 | 3 | Although this is an introductory course we assume that you have some basic Python knowledge and you can figure out the environment settings yourself. We accept all graded assignments with a **Python 3.8+** code in **Jupyter Notebook**. 4 | 5 | ## Environment and additional libraries 6 | You need to install [Anaconda](https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html) which means you are getting a version of Python 3.8 automatically and then follow the [Jupyter Notebook](https://jupyter-notebook-beginner-guide.readthedocs.io/en/latest/) quick guide to make sure that everything works correctly. If you are a more advanced user with Python already installed and prefer to manage your packages manually, you can just use [pip](https://jupyter.org/install). Answers to questions like "What is Jupyter Notebook?" or "How to use it?" can be found in this [short intro guide](https://www.dataquest.io/blog/jupyter-notebook-tutorial/) (read first chapter up to "Example Analysis").

7 | If you don't have enough computer power to complete the steps above or something just went wrong and you are stuck, feel free to use [Google Colab](https://colab.research.google.com/)! It has mostly everything pre-installed and all you need for using it is your web browser. Check out a [quick start guide](https://medium.com/@dinaelhanan/an-absolute-beginners-guide-to-google-colaboratory-d55c0eb375de) or a [detailed overview](https://www.tutorialspoint.com/google_colab/google_colab_quick_guide.htm). 8 | 9 | ## Python 10 | 11 | We assume that you have solid programming skills and experience with any OO or functional language. Hence learning Python should not be a problem for you. 12 | 13 | If you are already familiar with Python language you can test whether your knowledge is enough to dive into the course. Don't worry if you cannot complete all of the tasks provided, fully completing the first part (Fundamentals, Strings, Arrays) will be good enough for start. 14 | ### Tasks for self-assessment below 15 | [Python](https://github.com/rolling-scopes-school/ml-intro/blob/2022/0_prerequisites/python_tasks.md) 16 | 17 | If the Python is a new language for you please refer to a [quick start guide](https://www.stavros.io/tutorials/python/) and then try to do the first few tasks. Later on when you need a deeper understanding of language concepts, please refer to [Learning Python](https://learning-python.com/about-lp.html) by Mark Lutz chapters 4, 5, 7, 8, 12, 13, 14, 27, 29. 18 | 19 | You also should take a look at the [Numpy library](https://cs231n.github.io/python-numpy-tutorial/#numpy) which we are going to work closely with. 20 | 21 | ## Linear Algebra 22 | If you had a relevant course in University or you studied Linear Algebra yourself, you may freshen up your knowledge by simply going through this [cheatsheet](https://stanford.edu/~shervine/teaching/cs-229/refresher-algebra-calculus). If those concepts are unfamiliar to you, you may take a look at [Linear Algebra for ML blog posts](https://programmathically.com/linear-algebra-for-machine-learning/) including all of the chapters in chronological order. If you are a fan of video podcasts, please refer to this excellent [Essence of linear algebra](https://www.youtube.com/playlist?list=PLZHQObOWTQDPD3MizzM2xVFitgF8hE_ab) series. 23 | 24 | ## Calculus 25 | You don't need much of a Calculus to feel confident enough with the lectures and assignments on the course. However, if the concepts of _derivative_ and _gradient_ cause you some questions, please take a look at [Calculus for ML blog posts](https://programmathically.com/calculus-for-machine-learning/). You need only the first seven posts up to _The Multivariable Chain Rule_. Alternatively, you may watch the first four videos of [Essence of calculus](https://www.youtube.com/playlist?list=PLZHQObOWTQDMsr9K-rj53DwVRMYO3t5Yr) up to _Visualizing the chain rule and product rule_. 26 | 27 | ## Probability and statistics 28 | You may refresh your knowledge by going through this [Probability for ML cheatsheet](https://stanford.edu/~shervine/teaching/cme-106/cheatsheet-probability). If some concepts are not clear, please refer to [Probability and Statistics for ML](https://programmathically.com/probability-and-statistics-for-machine-learning-and-data-science/). You need only the first eight posts of _Basic Statistics and Probability_ chapter up to _Normal Distribution and Gaussian Random Variables_. 29 | -------------------------------------------------------------------------------- /0_prerequisites/python_tasks.md: -------------------------------------------------------------------------------- 1 | ## Python tasks for self-assessment 2 | 3 | ### Fundamentals, Strings, Arrays 4 | - [Descending Order](https://www.codewars.com/kata/5467e4d82edf8bbf40000155) 5 | - [Square every digit](https://www.codewars.com/kata/546e2562b03326a88e000020) 6 | - [Find the odd int](https://www.codewars.com/kata/54da5a58ea159efa38000836) 7 | - [Persistent Bugger](https://www.codewars.com/kata/55bf01e5a717a0d57e0000ec) 8 | - [Counting Duplicates](https://www.codewars.com/kata/54bf1c2cd5b56cc47f0007a1) 9 | - [Who likes it](https://www.codewars.com/kata/5266876b8f4bf2da9b000362) 10 | - [Snail](https://www.codewars.com/kata/521c2db8ddc89b9b7a0000c1) 11 | 12 | ### Regular expressions, Strings 13 | - [Disemvowel Trolls](https://www.codewars.com/kata/52fba66badcd10859f00097e) 14 | - [Validate PIN code](https://www.codewars.com/kata/55f8a9c06c018a0d6e000132) 15 | - [String to Camel Case](https://www.codewars.com/kata/517abf86da9663f1d2000003) 16 | - [Split Strings](https://www.codewars.com/kata/515de9ae9dcfc28eb6000001) 17 | - [Extract the domain name](https://www.codewars.com/kata/514a024011ea4fb54200004b) 18 | 19 | ### OOP 20 | - [Classy Extensions](https://www.codewars.com/kata/55a14aa4817efe41c20000bc) 21 | - [Classy Classes](https://www.codewars.com/kata/55a144eff5124e546400005a) 22 | - [Interactive Dictionary](https://www.codewars.com/kata/57a93f93bb9944516d0000c1) 23 | - [Default List](https://www.codewars.com/kata/5e882048999e6c0023412908) 24 | - [Who has the most money](https://www.codewars.com/kata/528d36d7cc451cd7e4000339) 25 | - [Vector class](https://www.codewars.com/kata/526dad7f8c0eb5c4640000a4) 26 | -------------------------------------------------------------------------------- /1_data_manipulations/Pandas_data_manipulations_tasks.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "6d6010da", 6 | "metadata": {}, 7 | "source": [ 8 | "# How to complete this assignment\n", 9 | "First, download [this Kaggle dataset](https://www.kaggle.com/hugomathien/soccer) and extract *sqlite* database. You may need to register at https://www.kaggle.com/ beforehand. Then complete 15 graded tasks below, the score is given in brackets. Finally submit the resulting `.ipynb` file to rs-app Auto-test.\n", 10 | "\n", 11 | "- Do not delete or rename the variables given before the inscription `#your code here`, they are needed for the correct verification.\n", 12 | "- Do not change the code in the last Notebook cell, it is required for the server check.\n", 13 | "- Your Notebook must run completely without errors to be graded! Please check everything before submission by going *Cell -> Run All*" 14 | ] 15 | }, 16 | { 17 | "cell_type": "markdown", 18 | "id": "0dbd5f9a", 19 | "metadata": {}, 20 | "source": [ 21 | "## Some important notes\n", 22 | "- If you need to **calculate the number of \"something\"** that means we expect you to assign an Integer to the given variable\n", 23 | "- If you need to **make a list of \"something\"** we expect you to assign a Python list with appropriate values to the given variable\n", 24 | "- If you need to find a **specifiс player, day of the week, team, etc.** we expect you to assign a String with the full name of the entity to the given variable (`player_name`, day of week full name, `team_name`, etc.)" 25 | ] 26 | }, 27 | { 28 | "cell_type": "code", 29 | "execution_count": null, 30 | "id": "f52b1bac", 31 | "metadata": {}, 32 | "outputs": [], 33 | "source": [ 34 | "import sqlite3\n", 35 | "import pandas as pd\n", 36 | "import os\n", 37 | "\n", 38 | "pd.set_option('display.max_column', None)" 39 | ] 40 | }, 41 | { 42 | "cell_type": "code", 43 | "execution_count": null, 44 | "id": "8ebe6afd", 45 | "metadata": {}, 46 | "outputs": [], 47 | "source": [ 48 | "# Leave that code unchanged, it is required for the server check!\n", 49 | "db = sqlite3.connect(os.environ.get(\"DB_PATH\") or 'database.sqlite')" 50 | ] 51 | }, 52 | { 53 | "cell_type": "code", 54 | "execution_count": null, 55 | "id": "9860d0d0", 56 | "metadata": {}, 57 | "outputs": [], 58 | "source": [ 59 | "# You may load the data from SQL table directly to the Pandas dataframe as\n", 60 | "player_data = pd.read_sql(\"SELECT * FROM Player;\", db)" 61 | ] 62 | }, 63 | { 64 | "cell_type": "code", 65 | "execution_count": null, 66 | "id": "7e69a7af", 67 | "metadata": {}, 68 | "outputs": [], 69 | "source": [ 70 | "player_data.head()" 71 | ] 72 | }, 73 | { 74 | "cell_type": "markdown", 75 | "id": "f8b23f3a", 76 | "metadata": {}, 77 | "source": [ 78 | "**Task 1 (0.25 point).** Calculate the number of players with a height between 180 and 190 inclusive" 79 | ] 80 | }, 81 | { 82 | "cell_type": "code", 83 | "execution_count": null, 84 | "id": "7cd6f780", 85 | "metadata": {}, 86 | "outputs": [], 87 | "source": [ 88 | "players_180_190 = # Your code here" 89 | ] 90 | }, 91 | { 92 | "cell_type": "code", 93 | "execution_count": null, 94 | "id": "9d058065", 95 | "metadata": {}, 96 | "outputs": [], 97 | "source": [ 98 | "assert(isinstance(players_180_190, int))" 99 | ] 100 | }, 101 | { 102 | "cell_type": "markdown", 103 | "id": "5a39f3bc", 104 | "metadata": {}, 105 | "source": [ 106 | "**Task 2 (0.25 point).** Calculate the number of players born in 1980.
\n", 107 | "**Hint:** you may want to cast your 'birthday' column to DateTime type by [pandas.to_datetime](https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html)" 108 | ] 109 | }, 110 | { 111 | "cell_type": "code", 112 | "execution_count": null, 113 | "id": "ff21f7a2", 114 | "metadata": {}, 115 | "outputs": [], 116 | "source": [ 117 | "players_1980 = # Your code here" 118 | ] 119 | }, 120 | { 121 | "cell_type": "code", 122 | "execution_count": null, 123 | "id": "e53cc066", 124 | "metadata": {}, 125 | "outputs": [], 126 | "source": [ 127 | "assert(isinstance(players_1980, int))" 128 | ] 129 | }, 130 | { 131 | "cell_type": "markdown", 132 | "id": "70d1dea0", 133 | "metadata": {}, 134 | "source": [ 135 | "**Task 3 (0.25 point).** Make a list of the top 10 players with the highest weight sorted in descending order. If there are several players with the same weight put them in the lexicographic order by name." 136 | ] 137 | }, 138 | { 139 | "cell_type": "code", 140 | "execution_count": null, 141 | "id": "b0dbdaf5", 142 | "metadata": {}, 143 | "outputs": [], 144 | "source": [ 145 | "highest_players = # Your code here" 146 | ] 147 | }, 148 | { 149 | "cell_type": "code", 150 | "execution_count": null, 151 | "id": "40dabe0d", 152 | "metadata": {}, 153 | "outputs": [], 154 | "source": [ 155 | "assert(len(highest_players) == 10)\n", 156 | "assert(isinstance(highest_players, list))\n", 157 | "for i in range(10):\n", 158 | " assert(isinstance(highest_players[i], str))" 159 | ] 160 | }, 161 | { 162 | "cell_type": "markdown", 163 | "id": "ff30607f", 164 | "metadata": {}, 165 | "source": [ 166 | "**Task 4 (0.5 point).** Make a list of tuples containing years along with the number of players born in that year from 1980 up to 1990.
\n", 167 | "**Structure example**: [(1980, 123), (1981, 140) ..., (1990, 83)] -> There were born 123 players in 1980, there were born 140 players in 1981 and etc." 168 | ] 169 | }, 170 | { 171 | "cell_type": "code", 172 | "execution_count": null, 173 | "id": "9b609f1c", 174 | "metadata": {}, 175 | "outputs": [], 176 | "source": [ 177 | "years_born_players = # Your code here" 178 | ] 179 | }, 180 | { 181 | "cell_type": "code", 182 | "execution_count": null, 183 | "id": "64cbf754", 184 | "metadata": {}, 185 | "outputs": [], 186 | "source": [ 187 | "assert(len(years_born_players) == 11)\n", 188 | "assert(isinstance(years_born_players, list))\n", 189 | "for i in range(10):\n", 190 | " assert(isinstance(years_born_players[i], tuple))\n", 191 | " assert(isinstance(years_born_players[i][0], int))\n", 192 | " assert(isinstance(years_born_players[i][1], int))" 193 | ] 194 | }, 195 | { 196 | "cell_type": "markdown", 197 | "id": "33cbd931", 198 | "metadata": {}, 199 | "source": [ 200 | "**Task 5 (0.5 point).** Calculate the mean and the standard deviation of the players' **height** with the name **Adriano**.
\n", 201 | "**Note:** Name is represented by the first part of `player_name`." 202 | ] 203 | }, 204 | { 205 | "cell_type": "code", 206 | "execution_count": null, 207 | "id": "614fac31", 208 | "metadata": {}, 209 | "outputs": [], 210 | "source": [ 211 | "adriano_mean, adriano_std = # Your code here" 212 | ] 213 | }, 214 | { 215 | "cell_type": "code", 216 | "execution_count": null, 217 | "id": "f508c49f", 218 | "metadata": {}, 219 | "outputs": [], 220 | "source": [ 221 | "assert(isinstance(adriano_mean, float))\n", 222 | "assert(isinstance(adriano_std, float))" 223 | ] 224 | }, 225 | { 226 | "cell_type": "markdown", 227 | "id": "8a361dfd", 228 | "metadata": {}, 229 | "source": [ 230 | "**Task 6 (0.75 point).** How many players were born on each day of the week? Find the day of the week with the minimum number of players born." 231 | ] 232 | }, 233 | { 234 | "cell_type": "code", 235 | "execution_count": null, 236 | "id": "c140be4f", 237 | "metadata": {}, 238 | "outputs": [], 239 | "source": [ 240 | "dow_with_min_players_born = # Your code here" 241 | ] 242 | }, 243 | { 244 | "cell_type": "code", 245 | "execution_count": null, 246 | "id": "fc041623", 247 | "metadata": {}, 248 | "outputs": [], 249 | "source": [ 250 | "assert(isinstance(dow_with_min_players_born, str))" 251 | ] 252 | }, 253 | { 254 | "cell_type": "markdown", 255 | "id": "dc7428be", 256 | "metadata": {}, 257 | "source": [ 258 | "**Task 7 (0.75 point).** Find a league with the most matches in total. If there are several leagues with the same amount of matches, take the first in the lexical order." 259 | ] 260 | }, 261 | { 262 | "cell_type": "code", 263 | "execution_count": null, 264 | "id": "ff3113ac", 265 | "metadata": {}, 266 | "outputs": [], 267 | "source": [ 268 | "league_most_matches = # Your code here" 269 | ] 270 | }, 271 | { 272 | "cell_type": "code", 273 | "execution_count": null, 274 | "id": "390a265b", 275 | "metadata": {}, 276 | "outputs": [], 277 | "source": [ 278 | "assert(isinstance(league_most_matches, str))" 279 | ] 280 | }, 281 | { 282 | "cell_type": "markdown", 283 | "id": "97199b7d", 284 | "metadata": {}, 285 | "source": [ 286 | "**Task 8 (1.25 point).** Find a player who participated in the largest number of matches during the whole match history. Assign a `player_name` to the given variable" 287 | ] 288 | }, 289 | { 290 | "cell_type": "code", 291 | "execution_count": null, 292 | "id": "ec31bc69", 293 | "metadata": {}, 294 | "outputs": [], 295 | "source": [ 296 | "max_matches_player = # Your code here" 297 | ] 298 | }, 299 | { 300 | "cell_type": "code", 301 | "execution_count": null, 302 | "id": "00ec2e89", 303 | "metadata": {}, 304 | "outputs": [], 305 | "source": [ 306 | "assert(isinstance(max_matches_player, str))" 307 | ] 308 | }, 309 | { 310 | "cell_type": "markdown", 311 | "id": "dbc68bfe", 312 | "metadata": {}, 313 | "source": [ 314 | "**Task 9 (1.5 point).** List top-5 tuples of most correlated **player's characteristics** in the descending order of the absolute [Pearson's coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) value.\n", 315 | "\n", 316 | "**Note 1:** Players characteristics are all the columns in `Player_Attributes` table except `[id, player_fifa_api_id, player_api_id, date, preferred_foot, attacking_work_rate, defensive_work_rate]`).
\n", 317 | "**Note 2:** Exclude duplicated pairs from the list. E.g. ('gk_handling', 'gk_reflexes') and ('gk_reflexes', 'gk_handling') are duplicates, leave just one of them in the resulting list.\n", 318 | "\n", 319 | "**Hint:** You may use [dataframe.corr()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html) for calculating pairwise Pearson correlation." 320 | ] 321 | }, 322 | { 323 | "cell_type": "code", 324 | "execution_count": null, 325 | "id": "47c1412e", 326 | "metadata": {}, 327 | "outputs": [], 328 | "source": [ 329 | "top_correlated_features = # Your code here" 330 | ] 331 | }, 332 | { 333 | "cell_type": "code", 334 | "execution_count": null, 335 | "id": "67acd6bf", 336 | "metadata": {}, 337 | "outputs": [], 338 | "source": [ 339 | "assert(len(top_correlated_features) == 5)\n", 340 | "assert(isinstance(top_correlated_features, list))\n", 341 | "for i in range(5):\n", 342 | " assert(isinstance(top_correlated_features[i], tuple))\n", 343 | " assert(isinstance(top_correlated_features[i][0], str))\n", 344 | " assert(isinstance(top_correlated_features[i][1], str))" 345 | ] 346 | }, 347 | { 348 | "cell_type": "markdown", 349 | "id": "7d3d8fd3", 350 | "metadata": {}, 351 | "source": [ 352 | "**Task 10 (2 points).** Find top-5 most similar players to **Neymar** whose names are given. The similarity is measured as [Euclidean distance](https://en.wikipedia.org/wiki/Euclidean_distance) between vectors of players' characteristics (described in the task above). Put their names in a vector in ascending order by Euclidean distance and sorted by `player_name` if the distance is the same
\n", 353 | "**Note 1:** There are many records for some players in the `Player_Attributes` table. You need to take the freshest data (characteristics with the most recent `date`).
\n", 354 | "**Note 2:** Use pure values of the characteristics even if you are aware of such preprocessing technics as normalization.
\n", 355 | "**Note 3:** Please avoid using any built-in methods for calculating the Euclidean distance between vectors, think about implementing your own." 356 | ] 357 | }, 358 | { 359 | "cell_type": "code", 360 | "execution_count": null, 361 | "id": "fac5a571", 362 | "metadata": {}, 363 | "outputs": [], 364 | "source": [ 365 | "neymar_similarities = # Your code here" 366 | ] 367 | }, 368 | { 369 | "cell_type": "code", 370 | "execution_count": null, 371 | "id": "ddb1876a", 372 | "metadata": {}, 373 | "outputs": [], 374 | "source": [ 375 | "assert(len(neymar_similarities) == 5)\n", 376 | "assert(isinstance(neymar_similarities, list))\n", 377 | "for i in range(5):\n", 378 | " assert(isinstance(neymar_similarities[i], str))" 379 | ] 380 | }, 381 | { 382 | "cell_type": "markdown", 383 | "id": "a3a7f878", 384 | "metadata": {}, 385 | "source": [ 386 | "**Task 11 (1 point).** Calculate the number of home matches played by the **Borussia Dortmund** team in **Germany 1. Bundesliga** in season **2008/2009**" 387 | ] 388 | }, 389 | { 390 | "cell_type": "code", 391 | "execution_count": null, 392 | "id": "bdf5a267", 393 | "metadata": {}, 394 | "outputs": [], 395 | "source": [ 396 | "borussia_bundesliga_2008_2009_matches = # Your code here" 397 | ] 398 | }, 399 | { 400 | "cell_type": "code", 401 | "execution_count": null, 402 | "id": "488fdd4c", 403 | "metadata": {}, 404 | "outputs": [], 405 | "source": [ 406 | "assert(isinstance(borussia_bundesliga_2008_2009_matches, int))" 407 | ] 408 | }, 409 | { 410 | "cell_type": "markdown", 411 | "id": "69dca9a5", 412 | "metadata": {}, 413 | "source": [ 414 | "**Task 12 (1 point).** Find a team having the most matches (both home and away!) in the **Germany 1. Bundesliga** in **2008/2009** season. Return number of matches." 415 | ] 416 | }, 417 | { 418 | "cell_type": "code", 419 | "execution_count": null, 420 | "id": "9969ba5c", 421 | "metadata": {}, 422 | "outputs": [], 423 | "source": [ 424 | "team_most_matches_bundesliga_2008_2009 = # Your code here" 425 | ] 426 | }, 427 | { 428 | "cell_type": "code", 429 | "execution_count": null, 430 | "id": "ef3b8fa2", 431 | "metadata": {}, 432 | "outputs": [], 433 | "source": [ 434 | "assert(isinstance(team_most_matches_bundesliga_2008_2009, int))" 435 | ] 436 | }, 437 | { 438 | "cell_type": "markdown", 439 | "id": "2f3c65aa", 440 | "metadata": {}, 441 | "source": [ 442 | "**Task 13 (1 point).** Count total number of **Arsenal** matches (both home and away!) in the **2015/2016** season which they have won.

\n", 443 | "**Note:** Winning a game means scoring **more** goals than an opponent." 444 | ] 445 | }, 446 | { 447 | "cell_type": "code", 448 | "execution_count": null, 449 | "id": "52456f34", 450 | "metadata": {}, 451 | "outputs": [], 452 | "source": [ 453 | "arsenal_won_matches_2015_2016 = # Your code here" 454 | ] 455 | }, 456 | { 457 | "cell_type": "code", 458 | "execution_count": null, 459 | "id": "214d9689", 460 | "metadata": {}, 461 | "outputs": [], 462 | "source": [ 463 | "assert(isinstance(arsenal_won_matches_2015_2016, int))" 464 | ] 465 | }, 466 | { 467 | "cell_type": "markdown", 468 | "id": "d70d5b23", 469 | "metadata": {}, 470 | "source": [ 471 | "**Task 14 (2 points).** Find a team with the highest win rate in the **2015/2016** season. Win rate means won matches / all matches. If there are several teams with the highest win rate return the first by name in lexical order" 472 | ] 473 | }, 474 | { 475 | "cell_type": "code", 476 | "execution_count": null, 477 | "id": "b1aa7db0", 478 | "metadata": {}, 479 | "outputs": [], 480 | "source": [ 481 | "team_highest_winrate_2015_2016 = # Your code here" 482 | ] 483 | }, 484 | { 485 | "cell_type": "code", 486 | "execution_count": null, 487 | "id": "b4cc8e46", 488 | "metadata": {}, 489 | "outputs": [], 490 | "source": [ 491 | "assert(isinstance(team_highest_winrate_2015_2016, str))" 492 | ] 493 | }, 494 | { 495 | "cell_type": "markdown", 496 | "id": "f7f3b4f5", 497 | "metadata": {}, 498 | "source": [ 499 | "**Task 15 (2 points).** Determine the team with the maximum days' gap between matches in **England Premier League 2010/2011 season**. Return number of days in that gap.
\n", 500 | "**Note**: a *gap* means the number of days between two consecutive matches of the same team." 501 | ] 502 | }, 503 | { 504 | "cell_type": "code", 505 | "execution_count": null, 506 | "id": "a4c33e5f", 507 | "metadata": {}, 508 | "outputs": [], 509 | "source": [ 510 | "highest_gap_england_2010_2011 = # Your code here" 511 | ] 512 | }, 513 | { 514 | "cell_type": "code", 515 | "execution_count": null, 516 | "id": "5f7aa84e", 517 | "metadata": {}, 518 | "outputs": [], 519 | "source": [ 520 | "assert(isinstance(highest_gap_england_2010_2011, int))" 521 | ] 522 | }, 523 | { 524 | "cell_type": "markdown", 525 | "id": "acecc77f", 526 | "metadata": {}, 527 | "source": [ 528 | "### Warning! Do not change anything in the area below" 529 | ] 530 | }, 531 | { 532 | "cell_type": "code", 533 | "execution_count": null, 534 | "id": "94c3b9be", 535 | "metadata": {}, 536 | "outputs": [], 537 | "source": [ 538 | "with open('student_answers.txt', 'w') as file:\n", 539 | " file.write(f\"{players_180_190}\\n\")\n", 540 | " file.write(f\"{players_1980}\\n\")\n", 541 | " file.write(f\"{highest_players}\\n\")\n", 542 | " file.write(f\"{years_born_players}\\n\")\n", 543 | " file.write(f\"{round(adriano_mean, 3)} {round(adriano_std, 3)}\\n\")\n", 544 | " file.write(f\"{dow_with_min_players_born}\\n\")\n", 545 | " file.write(f\"{league_most_matches}\\n\")\n", 546 | " file.write(f\"{max_matches_player}\\n\")\n", 547 | " file.write(f\"{';'.join(['%s,%s' % tup for tup in top_correlated_features])};\\n\")\n", 548 | " file.write(f\"{neymar_similarities}\\n\")\n", 549 | " file.write(f\"{borussia_bundesliga_2008_2009_matches}\\n\")\n", 550 | " file.write(f\"{team_most_matches_bundesliga_2008_2009}\\n\")\n", 551 | " file.write(f\"{arsenal_won_matches_2015_2016}\\n\")\n", 552 | " file.write(f\"{team_highest_winrate_2015_2016}\\n\")\n", 553 | " file.write(f\"{highest_gap_england_2010_2011}\\n\")" 554 | ] 555 | } 556 | ], 557 | "metadata": { 558 | "kernelspec": { 559 | "display_name": "Python 3 (ipykernel)", 560 | "language": "python", 561 | "name": "python3" 562 | }, 563 | "language_info": { 564 | "codemirror_mode": { 565 | "name": "ipython", 566 | "version": 3 567 | }, 568 | "file_extension": ".py", 569 | "mimetype": "text/x-python", 570 | "name": "python", 571 | "nbconvert_exporter": "python", 572 | "pygments_lexer": "ipython3", 573 | "version": "3.8.12" 574 | } 575 | }, 576 | "nbformat": 4, 577 | "nbformat_minor": 5 578 | } 579 | -------------------------------------------------------------------------------- /1_data_manipulations/README.md: -------------------------------------------------------------------------------- 1 | # Title 2 | Introduction to the Pandas library. 3 | 4 | # Goal 5 | - Learn the main approaches to extract the necessary information from tables using Pandas API 6 | - Understand the concepts of the Dataframe & Series and learn how to use in-built SQL-like functions with them 7 | 8 | # Topics 9 | - Filtering, sorting 10 | - Grouping and calculating statistics 11 | - Joining, merging 12 | - Applying custom functions to Dataframe 13 | - Rolling window functions 14 | 15 | # Materials 16 | - [Topic 1 in Open Data Science course](https://habr.com/ru/company/ods/blog/322626/). 1 hour 17 | - [Official Pandas documentation](https://pandas.pydata.org/pandas-docs/stable/index.html) and a short [user guide](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html). 1 hour 18 | - One more [tutorial](https://www.freecodecamp.org/news/the-ultimate-guide-to-the-pandas-library-for-data-science-in-python/). 2 часа 19 | 20 | # Exercises for self-study 21 | - https://www.kaggle.com/learn/pandas 22 | 23 | # Assignments 24 | [Pandas_data_manipulations.ipynb](./Pandas_data_manipulations_tasks.ipynb) 25 | -------------------------------------------------------------------------------- /1_data_manipulations/Seminar_pandas.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "slideshow": { 7 | "slide_type": "slide" 8 | } 9 | }, 10 | "source": [ 11 | "# NumPy" 12 | ] 13 | }, 14 | { 15 | "cell_type": "markdown", 16 | "metadata": { 17 | "slideshow": { 18 | "slide_type": "subslide" 19 | } 20 | }, 21 | "source": [ 22 | "**Numeric Python** is the core library for scientific computing in Python. It provides a high-performance multidimensional array object, and tools for working with these arrays." 23 | ] 24 | }, 25 | { 26 | "cell_type": "markdown", 27 | "metadata": { 28 | "slideshow": { 29 | "slide_type": "slide" 30 | } 31 | }, 32 | "source": [ 33 | "The First Rule of NumPy" 34 | ] 35 | }, 36 | { 37 | "cell_type": "code", 38 | "execution_count": null, 39 | "metadata": { 40 | "slideshow": { 41 | "slide_type": "fragment" 42 | } 43 | }, 44 | "outputs": [], 45 | "source": [ 46 | "import numpy as np" 47 | ] 48 | }, 49 | { 50 | "cell_type": "markdown", 51 | "metadata": { 52 | "slideshow": { 53 | "slide_type": "fragment" 54 | } 55 | }, 56 | "source": [ 57 | "The Second Rule of NumPy" 58 | ] 59 | }, 60 | { 61 | "cell_type": "code", 62 | "execution_count": null, 63 | "metadata": { 64 | "slideshow": { 65 | "slide_type": "fragment" 66 | } 67 | }, 68 | "outputs": [], 69 | "source": [ 70 | "# You don't need cycles\n", 71 | "\n", 72 | "first_arr = np.array([1, 2, 3, 4, 5])\n", 73 | "second_arr = np.copy(first_arr)\n", 74 | "\n", 75 | "# Instead of\n", 76 | "for i in range(len(arr)):\n", 77 | " if first_arr[i] == 3 or first_arr[i] == 4:\n", 78 | " first_arr[i] = 0\n", 79 | "# Do\n", 80 | "second_arr[(second_arr == 4) | (second_arr == 3)] = 0\n", 81 | "\n", 82 | "assert((first_arr == second_arr).all())" 83 | ] 84 | }, 85 | { 86 | "cell_type": "markdown", 87 | "metadata": { 88 | "slideshow": { 89 | "slide_type": "slide" 90 | } 91 | }, 92 | "source": [ 93 | "Array creation" 94 | ] 95 | }, 96 | { 97 | "cell_type": "code", 98 | "execution_count": null, 99 | "metadata": { 100 | "slideshow": { 101 | "slide_type": "fragment" 102 | } 103 | }, 104 | "outputs": [], 105 | "source": [ 106 | "a = np.array([1, 2, 3, 4, 5, 6, 7, 8, 8])\n", 107 | "print(type(a))\n", 108 | "print(a.shape)\n", 109 | "print(a.dtype)" 110 | ] 111 | }, 112 | { 113 | "cell_type": "markdown", 114 | "metadata": { 115 | "slideshow": { 116 | "slide_type": "slide" 117 | } 118 | }, 119 | "source": [ 120 | "Assigning and appending an element to an array" 121 | ] 122 | }, 123 | { 124 | "cell_type": "code", 125 | "execution_count": null, 126 | "metadata": { 127 | "slideshow": { 128 | "slide_type": "fragment" 129 | } 130 | }, 131 | "outputs": [], 132 | "source": [ 133 | "a[8] = 9\n", 134 | "a = np.append(a, 10)\n", 135 | "print(a)" 136 | ] 137 | }, 138 | { 139 | "cell_type": "markdown", 140 | "metadata": { 141 | "slideshow": { 142 | "slide_type": "slide" 143 | } 144 | }, 145 | "source": [ 146 | "Standard Python slicing syntax" 147 | ] 148 | }, 149 | { 150 | "cell_type": "code", 151 | "execution_count": null, 152 | "metadata": { 153 | "slideshow": { 154 | "slide_type": "fragment" 155 | } 156 | }, 157 | "outputs": [], 158 | "source": [ 159 | "print(a)\n", 160 | "print('\\n')\n", 161 | "\n", 162 | "print(a[0:5])\n", 163 | "print(a[0:5:2])\n", 164 | "print(a[0:-1])\n", 165 | "print(a[4::-1])\n", 166 | "print(a[5:0:-2])" 167 | ] 168 | }, 169 | { 170 | "cell_type": "markdown", 171 | "metadata": { 172 | "slideshow": { 173 | "slide_type": "slide" 174 | } 175 | }, 176 | "source": [ 177 | "In-built calculation of different statistics" 178 | ] 179 | }, 180 | { 181 | "cell_type": "code", 182 | "execution_count": null, 183 | "metadata": { 184 | "slideshow": { 185 | "slide_type": "fragment" 186 | } 187 | }, 188 | "outputs": [], 189 | "source": [ 190 | "print('Vector max %d, min %d, mean %.2f, median %.2f, stardard deviation %.2f and total sum %d' %\n", 191 | " (a.max(), np.min(a), a.mean(), np.median(a), a.std(), a.sum()))" 192 | ] 193 | }, 194 | { 195 | "cell_type": "markdown", 196 | "metadata": { 197 | "slideshow": { 198 | "slide_type": "slide" 199 | } 200 | }, 201 | "source": [ 202 | "Filtering on condition (masking)" 203 | ] 204 | }, 205 | { 206 | "cell_type": "code", 207 | "execution_count": null, 208 | "metadata": { 209 | "slideshow": { 210 | "slide_type": "fragment" 211 | } 212 | }, 213 | "outputs": [], 214 | "source": [ 215 | "a[a > a.mean()]" 216 | ] 217 | }, 218 | { 219 | "cell_type": "markdown", 220 | "metadata": { 221 | "slideshow": { 222 | "slide_type": "slide" 223 | } 224 | }, 225 | "source": [ 226 | "Sorting" 227 | ] 228 | }, 229 | { 230 | "cell_type": "code", 231 | "execution_count": null, 232 | "metadata": { 233 | "slideshow": { 234 | "slide_type": "fragment" 235 | } 236 | }, 237 | "outputs": [], 238 | "source": [ 239 | "# Sorted array\n", 240 | "print(np.sort(a))\n", 241 | "\n", 242 | "# Order of indices in sorted array\n", 243 | "print(np.argsort(a))" 244 | ] 245 | }, 246 | { 247 | "cell_type": "markdown", 248 | "metadata": { 249 | "slideshow": { 250 | "slide_type": "slide" 251 | } 252 | }, 253 | "source": [ 254 | "Vector operations" 255 | ] 256 | }, 257 | { 258 | "cell_type": "code", 259 | "execution_count": null, 260 | "metadata": { 261 | "slideshow": { 262 | "slide_type": "fragment" 263 | } 264 | }, 265 | "outputs": [], 266 | "source": [ 267 | " a = np.array([1, 2, 3])\n", 268 | "b = np.array([2, 3, 4])\n", 269 | "a * b\n", 270 | "a - b\n", 271 | "a + b" 272 | ] 273 | }, 274 | { 275 | "cell_type": "markdown", 276 | "metadata": { 277 | "slideshow": { 278 | "slide_type": "slide" 279 | } 280 | }, 281 | "source": [ 282 | "2D arrays (matrices)" 283 | ] 284 | }, 285 | { 286 | "cell_type": "code", 287 | "execution_count": null, 288 | "metadata": { 289 | "slideshow": { 290 | "slide_type": "fragment" 291 | } 292 | }, 293 | "outputs": [], 294 | "source": [ 295 | "m_a = np.array([[1, 2, 3, 4]\n", 296 | " ,[13, 3, 8, 2]\n", 297 | " ,[8, 7, 2, 3]])\n", 298 | "print(m_a.shape)" 299 | ] 300 | }, 301 | { 302 | "cell_type": "markdown", 303 | "metadata": { 304 | "slideshow": { 305 | "slide_type": "slide" 306 | } 307 | }, 308 | "source": [ 309 | "Statistics calculation" 310 | ] 311 | }, 312 | { 313 | "cell_type": "code", 314 | "execution_count": null, 315 | "metadata": { 316 | "slideshow": { 317 | "slide_type": "fragment" 318 | } 319 | }, 320 | "outputs": [], 321 | "source": [ 322 | "print(m_a.max())\n", 323 | "print(m_a.max(axis=0))\n", 324 | "print(m_a.max(axis=1))" 325 | ] 326 | }, 327 | { 328 | "cell_type": "markdown", 329 | "metadata": { 330 | "slideshow": { 331 | "slide_type": "slide" 332 | } 333 | }, 334 | "source": [ 335 | "Sorting" 336 | ] 337 | }, 338 | { 339 | "cell_type": "code", 340 | "execution_count": null, 341 | "metadata": { 342 | "slideshow": { 343 | "slide_type": "fragment" 344 | } 345 | }, 346 | "outputs": [], 347 | "source": [ 348 | "print(np.sort(m_a, axis=0))\n", 349 | "print(np.sort(m_a, axis=1))" 350 | ] 351 | }, 352 | { 353 | "cell_type": "markdown", 354 | "metadata": { 355 | "slideshow": { 356 | "slide_type": "slide" 357 | } 358 | }, 359 | "source": [ 360 | "# Intro to Pandas data structures" 361 | ] 362 | }, 363 | { 364 | "cell_type": "code", 365 | "execution_count": null, 366 | "metadata": { 367 | "slideshow": { 368 | "slide_type": "subslide" 369 | } 370 | }, 371 | "outputs": [], 372 | "source": [ 373 | "import pandas as pd" 374 | ] 375 | }, 376 | { 377 | "cell_type": "markdown", 378 | "metadata": { 379 | "slideshow": { 380 | "slide_type": "slide" 381 | } 382 | }, 383 | "source": [ 384 | "## Series" 385 | ] 386 | }, 387 | { 388 | "cell_type": "markdown", 389 | "metadata": { 390 | "slideshow": { 391 | "slide_type": "subslide" 392 | } 393 | }, 394 | "source": [ 395 | "[Series](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html#pandas.Series) is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index. The basic method to create a Series is to call:" 396 | ] 397 | }, 398 | { 399 | "cell_type": "code", 400 | "execution_count": null, 401 | "metadata": { 402 | "slideshow": { 403 | "slide_type": "fragment" 404 | } 405 | }, 406 | "outputs": [], 407 | "source": [ 408 | "s = pd.Series(data=[39.4, 91.2, 80.5, 20.3, 4.2, -13.4]\n", 409 | " ,index=['first', 'second', 'second', 'third', 'forth', 'fifth'])\n", 410 | "print(type(s))\n", 411 | "print(s.shape)\n", 412 | "print(s.dtype)\n", 413 | "print(s['second'])" 414 | ] 415 | }, 416 | { 417 | "cell_type": "code", 418 | "execution_count": null, 419 | "metadata": { 420 | "slideshow": { 421 | "slide_type": "slide" 422 | } 423 | }, 424 | "outputs": [], 425 | "source": [ 426 | "s = pd.Series(data=[39.4, 91.2, 20.3, 4.2, -13.4])\n", 427 | "s" 428 | ] 429 | }, 430 | { 431 | "cell_type": "markdown", 432 | "metadata": { 433 | "slideshow": { 434 | "slide_type": "slide" 435 | } 436 | }, 437 | "source": [ 438 | "Series acts very similarly to a [ndarray](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html), and is a valid argument to most [NumPy](https://numpy.org/doc/stable/user/whatisnumpy.html) functions. However, operations such as slicing will also slice the index." 439 | ] 440 | }, 441 | { 442 | "cell_type": "code", 443 | "execution_count": null, 444 | "metadata": { 445 | "slideshow": { 446 | "slide_type": "fragment" 447 | } 448 | }, 449 | "outputs": [], 450 | "source": [ 451 | "s[1:4]" 452 | ] 453 | }, 454 | { 455 | "cell_type": "markdown", 456 | "metadata": { 457 | "slideshow": { 458 | "slide_type": "slide" 459 | } 460 | }, 461 | "source": [ 462 | "In-built statistics calculation is the same as in NumPy" 463 | ] 464 | }, 465 | { 466 | "cell_type": "code", 467 | "execution_count": null, 468 | "metadata": { 469 | "slideshow": { 470 | "slide_type": "fragment" 471 | } 472 | }, 473 | "outputs": [], 474 | "source": [ 475 | "np.max(s)\n", 476 | "s.min()\n", 477 | "s.std()" 478 | ] 479 | }, 480 | { 481 | "cell_type": "markdown", 482 | "metadata": { 483 | "slideshow": { 484 | "slide_type": "slide" 485 | } 486 | }, 487 | "source": [ 488 | "Vector operations" 489 | ] 490 | }, 491 | { 492 | "cell_type": "code", 493 | "execution_count": null, 494 | "metadata": { 495 | "slideshow": { 496 | "slide_type": "fragment" 497 | } 498 | }, 499 | "outputs": [], 500 | "source": [ 501 | "a = pd.Series([1, 2, 3])\n", 502 | "b = pd.Series([2, 3, 4])\n", 503 | "\n", 504 | "a * 2\n", 505 | "a + b\n", 506 | "a - b\n", 507 | "a * b" 508 | ] 509 | }, 510 | { 511 | "cell_type": "markdown", 512 | "metadata": { 513 | "slideshow": { 514 | "slide_type": "slide" 515 | } 516 | }, 517 | "source": [ 518 | "## DataFrame" 519 | ] 520 | }, 521 | { 522 | "cell_type": "markdown", 523 | "metadata": { 524 | "slideshow": { 525 | "slide_type": "fragment" 526 | } 527 | }, 528 | "source": [ 529 | "DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object." 530 | ] 531 | }, 532 | { 533 | "cell_type": "markdown", 534 | "metadata": { 535 | "slideshow": { 536 | "slide_type": "slide" 537 | } 538 | }, 539 | "source": [ 540 | "DataFrame creation" 541 | ] 542 | }, 543 | { 544 | "cell_type": "code", 545 | "execution_count": null, 546 | "metadata": { 547 | "slideshow": { 548 | "slide_type": "fragment" 549 | } 550 | }, 551 | "outputs": [], 552 | "source": [ 553 | "d = {\"one\": pd.Series([1.0, 2.0, 3.0], index=[\"a\", \"b\", \"c\"]),\n", 554 | " \"two\": pd.Series([1.0, 2.0, 3.0, 4.0], index=[\"a\", \"b\", \"c\", \"d\"]),\n", 555 | " }\n", 556 | "df = pd.DataFrame(d)\n", 557 | "df" 558 | ] 559 | }, 560 | { 561 | "cell_type": "code", 562 | "execution_count": null, 563 | "metadata": { 564 | "slideshow": { 565 | "slide_type": "fragment" 566 | } 567 | }, 568 | "outputs": [], 569 | "source": [ 570 | "d = {\"one\": [1.0, 2.0, 3.0, 4.0], \"two\": [4.0, 3.0, 2.0, 1.0]}\n", 571 | "df = pd.DataFrame(d)\n", 572 | "df" 573 | ] 574 | }, 575 | { 576 | "cell_type": "markdown", 577 | "metadata": {}, 578 | "source": [ 579 | "### Basic transformations" 580 | ] 581 | }, 582 | { 583 | "cell_type": "markdown", 584 | "metadata": { 585 | "slideshow": { 586 | "slide_type": "slide" 587 | } 588 | }, 589 | "source": [ 590 | "Loading existing DataFrame from csv file. We'll use Google Play Store Apps dataset from here https://www.kaggle.com/lava18/google-play-store-apps." 591 | ] 592 | }, 593 | { 594 | "cell_type": "code", 595 | "execution_count": null, 596 | "metadata": { 597 | "slideshow": { 598 | "slide_type": "fragment" 599 | } 600 | }, 601 | "outputs": [], 602 | "source": [ 603 | "data = pd.read_csv('googleplaystore.csv')\n", 604 | "data.head(3)" 605 | ] 606 | }, 607 | { 608 | "cell_type": "markdown", 609 | "metadata": { 610 | "slideshow": { 611 | "slide_type": "slide" 612 | } 613 | }, 614 | "source": [ 615 | "Column types" 616 | ] 617 | }, 618 | { 619 | "cell_type": "code", 620 | "execution_count": null, 621 | "metadata": { 622 | "slideshow": { 623 | "slide_type": "fragment" 624 | } 625 | }, 626 | "outputs": [], 627 | "source": [ 628 | "data.dtypes" 629 | ] 630 | }, 631 | { 632 | "cell_type": "markdown", 633 | "metadata": { 634 | "slideshow": { 635 | "slide_type": "slide" 636 | } 637 | }, 638 | "source": [ 639 | "Column selection as DataFrame" 640 | ] 641 | }, 642 | { 643 | "cell_type": "code", 644 | "execution_count": null, 645 | "metadata": { 646 | "slideshow": { 647 | "slide_type": "fragment" 648 | } 649 | }, 650 | "outputs": [], 651 | "source": [ 652 | "data[['App', 'Rating']].head()" 653 | ] 654 | }, 655 | { 656 | "cell_type": "markdown", 657 | "metadata": { 658 | "slideshow": { 659 | "slide_type": "slide" 660 | } 661 | }, 662 | "source": [ 663 | "Column selection as Series. You can treat a DataFrame semantically like a dict of like-indexed Series objects." 664 | ] 665 | }, 666 | { 667 | "cell_type": "code", 668 | "execution_count": null, 669 | "metadata": { 670 | "slideshow": { 671 | "slide_type": "fragment" 672 | } 673 | }, 674 | "outputs": [], 675 | "source": [ 676 | "data['Rating'].head()" 677 | ] 678 | }, 679 | { 680 | "cell_type": "code", 681 | "execution_count": null, 682 | "metadata": { 683 | "slideshow": { 684 | "slide_type": "fragment" 685 | } 686 | }, 687 | "outputs": [], 688 | "source": [ 689 | "type(data['Rating'])" 690 | ] 691 | }, 692 | { 693 | "cell_type": "code", 694 | "execution_count": null, 695 | "metadata": { 696 | "slideshow": { 697 | "slide_type": "fragment" 698 | } 699 | }, 700 | "outputs": [], 701 | "source": [ 702 | "data['Rating'].mean()" 703 | ] 704 | }, 705 | { 706 | "cell_type": "markdown", 707 | "metadata": { 708 | "slideshow": { 709 | "slide_type": "slide" 710 | } 711 | }, 712 | "source": [ 713 | "Row selection by index" 714 | ] 715 | }, 716 | { 717 | "cell_type": "code", 718 | "execution_count": null, 719 | "metadata": {}, 720 | "outputs": [], 721 | "source": [ 722 | "data.iloc[1]" 723 | ] 724 | }, 725 | { 726 | "cell_type": "markdown", 727 | "metadata": { 728 | "slideshow": { 729 | "slide_type": "slide" 730 | } 731 | }, 732 | "source": [ 733 | "Filtering (row selection by condition)" 734 | ] 735 | }, 736 | { 737 | "cell_type": "code", 738 | "execution_count": null, 739 | "metadata": { 740 | "slideshow": { 741 | "slide_type": "fragment" 742 | } 743 | }, 744 | "outputs": [], 745 | "source": [ 746 | "data[data['Rating'] < 3].head()" 747 | ] 748 | }, 749 | { 750 | "cell_type": "markdown", 751 | "metadata": { 752 | "slideshow": { 753 | "slide_type": "slide" 754 | } 755 | }, 756 | "source": [ 757 | "Assigning value to column based on condition" 758 | ] 759 | }, 760 | { 761 | "cell_type": "code", 762 | "execution_count": null, 763 | "metadata": { 764 | "slideshow": { 765 | "slide_type": "fragment" 766 | } 767 | }, 768 | "outputs": [], 769 | "source": [ 770 | "# Not correct\n", 771 | "data[data['Rating'] < 3]['Rating'] = 0" 772 | ] 773 | }, 774 | { 775 | "cell_type": "code", 776 | "execution_count": null, 777 | "metadata": { 778 | "slideshow": { 779 | "slide_type": "fragment" 780 | } 781 | }, 782 | "outputs": [], 783 | "source": [ 784 | "data[data['Rating'] < 3].head(3)" 785 | ] 786 | }, 787 | { 788 | "cell_type": "code", 789 | "execution_count": null, 790 | "metadata": { 791 | "slideshow": { 792 | "slide_type": "subslide" 793 | } 794 | }, 795 | "outputs": [], 796 | "source": [ 797 | "# Correct\n", 798 | "data.loc[data['Rating'] < 3, 'Rating'] = 0" 799 | ] 800 | }, 801 | { 802 | "cell_type": "code", 803 | "execution_count": null, 804 | "metadata": { 805 | "slideshow": { 806 | "slide_type": "fragment" 807 | } 808 | }, 809 | "outputs": [], 810 | "source": [ 811 | "data[data['Rating'] < 3].head(3)" 812 | ] 813 | }, 814 | { 815 | "cell_type": "markdown", 816 | "metadata": { 817 | "slideshow": { 818 | "slide_type": "slide" 819 | } 820 | }, 821 | "source": [ 822 | "Sorting" 823 | ] 824 | }, 825 | { 826 | "cell_type": "code", 827 | "execution_count": null, 828 | "metadata": { 829 | "slideshow": { 830 | "slide_type": "fragment" 831 | } 832 | }, 833 | "outputs": [], 834 | "source": [ 835 | "data.sort_values(by='Rating').head(3)" 836 | ] 837 | }, 838 | { 839 | "cell_type": "markdown", 840 | "metadata": { 841 | "slideshow": { 842 | "slide_type": "slide" 843 | } 844 | }, 845 | "source": [ 846 | "Selecting values" 847 | ] 848 | }, 849 | { 850 | "cell_type": "code", 851 | "execution_count": null, 852 | "metadata": { 853 | "slideshow": { 854 | "slide_type": "subslide" 855 | } 856 | }, 857 | "outputs": [], 858 | "source": [ 859 | "data.head()['App'].values.tolist()" 860 | ] 861 | }, 862 | { 863 | "cell_type": "markdown", 864 | "metadata": { 865 | "slideshow": { 866 | "slide_type": "slide" 867 | } 868 | }, 869 | "source": [ 870 | "**All together**. Make list of top-5 Free Apps by Rating in Education Genres by alphabet order" 871 | ] 872 | }, 873 | { 874 | "cell_type": "code", 875 | "execution_count": null, 876 | "metadata": { 877 | "slideshow": { 878 | "slide_type": "fragment" 879 | } 880 | }, 881 | "outputs": [], 882 | "source": [ 883 | "data[(data['Type'] == 'Free') & (data['Genres'] == 'Education')]\\\n", 884 | " .sort_values(by=['Rating', 'App'], ascending=(False, True))\\\n", 885 | " .head(5)['App'].values.tolist()" 886 | ] 887 | }, 888 | { 889 | "cell_type": "markdown", 890 | "metadata": { 891 | "slideshow": { 892 | "slide_type": "slide" 893 | } 894 | }, 895 | "source": [ 896 | "### Concatenating" 897 | ] 898 | }, 899 | { 900 | "cell_type": "markdown", 901 | "metadata": { 902 | "slideshow": { 903 | "slide_type": "fragment" 904 | } 905 | }, 906 | "source": [ 907 | "Appending DataFrames" 908 | ] 909 | }, 910 | { 911 | "cell_type": "code", 912 | "execution_count": null, 913 | "metadata": { 914 | "slideshow": { 915 | "slide_type": "fragment" 916 | } 917 | }, 918 | "outputs": [], 919 | "source": [ 920 | "df1 = pd.DataFrame({\n", 921 | " \"A\": [\"A0\", \"A1\", \"A2\", \"A3\"],\n", 922 | " \"B\": [\"B0\", \"B1\", \"B2\", \"B3\"],\n", 923 | " \"C\": [\"C0\", \"C1\", \"C2\", \"C3\"],\n", 924 | " \"D\": [\"D0\", \"D1\", \"D2\", \"D3\"],},\n", 925 | " index=[0, 1, 2, 3],)\n", 926 | "\n", 927 | "df2 = pd.DataFrame({\n", 928 | " \"A\": [\"A4\", \"A5\", \"A6\", \"A7\"],\n", 929 | " \"B\": [\"B4\", \"B5\", \"B6\", \"B7\"],\n", 930 | " \"C\": [\"C4\", \"C5\", \"C6\", \"C7\"],\n", 931 | " \"E\": [\"E4\", \"E5\", \"E6\", \"E7\"],},\n", 932 | " index=[0, 1, 2, 3],)" 933 | ] 934 | }, 935 | { 936 | "cell_type": "code", 937 | "execution_count": null, 938 | "metadata": { 939 | "slideshow": { 940 | "slide_type": "slide" 941 | } 942 | }, 943 | "outputs": [], 944 | "source": [ 945 | "df1" 946 | ] 947 | }, 948 | { 949 | "cell_type": "code", 950 | "execution_count": null, 951 | "metadata": { 952 | "slideshow": { 953 | "slide_type": "subslide" 954 | } 955 | }, 956 | "outputs": [], 957 | "source": [ 958 | "df2" 959 | ] 960 | }, 961 | { 962 | "cell_type": "code", 963 | "execution_count": null, 964 | "metadata": { 965 | "slideshow": { 966 | "slide_type": "slide" 967 | } 968 | }, 969 | "outputs": [], 970 | "source": [ 971 | "df1.append(df2, ignore_index=True)" 972 | ] 973 | }, 974 | { 975 | "cell_type": "markdown", 976 | "metadata": { 977 | "slideshow": { 978 | "slide_type": "slide" 979 | } 980 | }, 981 | "source": [ 982 | "**Join** methon works better with joining DataFrame by indices and is fine-tuned by default to do it." 983 | ] 984 | }, 985 | { 986 | "cell_type": "code", 987 | "execution_count": null, 988 | "metadata": { 989 | "slideshow": { 990 | "slide_type": "fragment" 991 | } 992 | }, 993 | "outputs": [], 994 | "source": [ 995 | "df3 = pd.DataFrame({\n", 996 | " \"A\": [\"A1\", \"A2\", \"A3\", \"A4\"],\n", 997 | " \"F\": [\"F0\", \"F1\", \"F2\", \"F3\"],},\n", 998 | " index=[0, 1, 2, 3],)" 999 | ] 1000 | }, 1001 | { 1002 | "cell_type": "code", 1003 | "execution_count": null, 1004 | "metadata": { 1005 | "slideshow": { 1006 | "slide_type": "slide" 1007 | } 1008 | }, 1009 | "outputs": [], 1010 | "source": [ 1011 | "df1" 1012 | ] 1013 | }, 1014 | { 1015 | "cell_type": "code", 1016 | "execution_count": null, 1017 | "metadata": { 1018 | "slideshow": { 1019 | "slide_type": "fragment" 1020 | } 1021 | }, 1022 | "outputs": [], 1023 | "source": [ 1024 | "df3" 1025 | ] 1026 | }, 1027 | { 1028 | "cell_type": "code", 1029 | "execution_count": null, 1030 | "metadata": { 1031 | "slideshow": { 1032 | "slide_type": "slide" 1033 | } 1034 | }, 1035 | "outputs": [], 1036 | "source": [ 1037 | "df1.join(df3, how='inner', lsuffix='_first', rsuffix='_third')" 1038 | ] 1039 | }, 1040 | { 1041 | "cell_type": "markdown", 1042 | "metadata": { 1043 | "slideshow": { 1044 | "slide_type": "slide" 1045 | } 1046 | }, 1047 | "source": [ 1048 | "**Merge** method is more versatile and allows us to specify columns besides the index to join on for both dataframes." 1049 | ] 1050 | }, 1051 | { 1052 | "cell_type": "code", 1053 | "execution_count": null, 1054 | "metadata": { 1055 | "slideshow": { 1056 | "slide_type": "fragment" 1057 | } 1058 | }, 1059 | "outputs": [], 1060 | "source": [ 1061 | "df1.merge(df3, how='left', on=['A'])" 1062 | ] 1063 | }, 1064 | { 1065 | "cell_type": "markdown", 1066 | "metadata": { 1067 | "slideshow": { 1068 | "slide_type": "slide" 1069 | } 1070 | }, 1071 | "source": [ 1072 | "### Grouping" 1073 | ] 1074 | }, 1075 | { 1076 | "cell_type": "markdown", 1077 | "metadata": { 1078 | "slideshow": { 1079 | "slide_type": "fragment" 1080 | } 1081 | }, 1082 | "source": [ 1083 | "A groupby operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups" 1084 | ] 1085 | }, 1086 | { 1087 | "cell_type": "code", 1088 | "execution_count": null, 1089 | "metadata": { 1090 | "slideshow": { 1091 | "slide_type": "slide" 1092 | } 1093 | }, 1094 | "outputs": [], 1095 | "source": [ 1096 | "data.head(3)" 1097 | ] 1098 | }, 1099 | { 1100 | "cell_type": "markdown", 1101 | "metadata": { 1102 | "slideshow": { 1103 | "slide_type": "slide" 1104 | } 1105 | }, 1106 | "source": [ 1107 | "Find _Category_ having the highest average rating among it's applications. No cycles, I promise." 1108 | ] 1109 | }, 1110 | { 1111 | "cell_type": "code", 1112 | "execution_count": null, 1113 | "metadata": { 1114 | "slideshow": { 1115 | "slide_type": "fragment" 1116 | } 1117 | }, 1118 | "outputs": [], 1119 | "source": [ 1120 | "data.groupby('Category')['Rating'].mean().sort_values(ascending=False).index[1]" 1121 | ] 1122 | } 1123 | ], 1124 | "metadata": { 1125 | "celltoolbar": "Slideshow", 1126 | "kernelspec": { 1127 | "display_name": "Python 3", 1128 | "language": "python", 1129 | "name": "python3" 1130 | }, 1131 | "language_info": { 1132 | "codemirror_mode": { 1133 | "name": "ipython", 1134 | "version": 3 1135 | }, 1136 | "file_extension": ".py", 1137 | "mimetype": "text/x-python", 1138 | "name": "python", 1139 | "nbconvert_exporter": "python", 1140 | "pygments_lexer": "ipython3", 1141 | "version": "3.7.9" 1142 | } 1143 | }, 1144 | "nbformat": 4, 1145 | "nbformat_minor": 4 1146 | } 1147 | -------------------------------------------------------------------------------- /2_data_exploration/README.md: -------------------------------------------------------------------------------- 1 | # Materials 2 | Take a look into data quality related issues in real life: https://hbr.org/2018/04/if-your-data-is-bad-your-machine-learning-tools-are-useless. 3 | 4 | Study sections 12.1. Describing Single Variables and 12.2.2 Correlations Between Quantitative Variables from [Research Methods article](https://saylordotorg.github.io/text_research-methods-in-psychology/) (feel free to skip the Differences Between Groups or Conditions section). 5 | 6 | Familiarize yourself with the seaborn library: https://seaborn.pydata.org/introduction.html. 7 | Take this mini-course to get more hands-on: https://www.kaggle.com/learn/data-visualization. Don't miss the exercises! 8 | Get acquainted with boxplots: https://towardsdatascience.com/understanding-boxplots-5e2df7bcbd51. 9 | 10 | Install and experiment with the [pandas_profiling library](https://github.com/ydataai/pandas-profiling) 11 | 12 | Bonus task: explore the https://www.tylervigen.com/spurious-correlations site (and if intrigued, read https://hbr.org/2015/06/beware-spurious-correlations). 13 | 14 | # Exercises for self-training 15 | Perform a comprehensive exploration of this dataset: https://archive.ics.uci.edu/ml/datasets/Adult _without_ using the pandas_profiling library first. Refer to https://github.com/cmawer/pycon-2017-eda-tutorial/blob/master/EDA-cheat-sheet.md for structuring your approach. Feel free to draw inspiration from this example: https://towardsdatascience.com/lets-learn-exploratory-data-analysis-practically-4a923499b779. 16 | Then apply pandas_profiling, check for new findings and fill the gaps in your earlier analysis. 17 | 18 | # Graded assignments 19 | Register on https://www.kaggle.com/. 20 | Join this competition: https://www.kaggle.com/c/tabular-playground-series-apr-2021. 21 | Get train.csv from https://www.kaggle.com/c/tabular-playground-series-apr-2021/data. 22 | Perform data exploration and visualization without using the pandas_profiling library. 23 | -------------------------------------------------------------------------------- /2_data_exploration/eda.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": null, 6 | "id": "b489cb9e", 7 | "metadata": {}, 8 | "outputs": [], 9 | "source": [] 10 | }, 11 | { 12 | "cell_type": "markdown", 13 | "id": "d07912a0", 14 | "metadata": {}, 15 | "source": [ 16 | "# Context" 17 | ] 18 | }, 19 | { 20 | "cell_type": "code", 21 | "execution_count": null, 22 | "id": "6b981d68", 23 | "metadata": {}, 24 | "outputs": [], 25 | "source": [] 26 | }, 27 | { 28 | "cell_type": "markdown", 29 | "id": "9b4320c6", 30 | "metadata": {}, 31 | "source": [ 32 | "# Data quality assessment" 33 | ] 34 | }, 35 | { 36 | "cell_type": "code", 37 | "execution_count": null, 38 | "id": "e443ca82", 39 | "metadata": {}, 40 | "outputs": [], 41 | "source": [] 42 | }, 43 | { 44 | "cell_type": "markdown", 45 | "id": "5408eca5", 46 | "metadata": {}, 47 | "source": [ 48 | "# Data exploration" 49 | ] 50 | }, 51 | { 52 | "cell_type": "code", 53 | "execution_count": null, 54 | "id": "2417286d", 55 | "metadata": {}, 56 | "outputs": [], 57 | "source": [] 58 | }, 59 | { 60 | "cell_type": "markdown", 61 | "id": "034945c8", 62 | "metadata": {}, 63 | "source": [ 64 | "# Summary" 65 | ] 66 | }, 67 | { 68 | "cell_type": "code", 69 | "execution_count": null, 70 | "id": "72e3fb13", 71 | "metadata": {}, 72 | "outputs": [], 73 | "source": [] 74 | } 75 | ], 76 | "metadata": { 77 | "kernelspec": { 78 | "display_name": "Python 3 (ipykernel)", 79 | "language": "python", 80 | "name": "python3" 81 | }, 82 | "language_info": { 83 | "codemirror_mode": { 84 | "name": "ipython", 85 | "version": 3 86 | }, 87 | "file_extension": ".py", 88 | "mimetype": "text/x-python", 89 | "name": "python", 90 | "nbconvert_exporter": "python", 91 | "pygments_lexer": "ipython3", 92 | "version": "3.8.12" 93 | }, 94 | "toc": { 95 | "base_numbering": 1, 96 | "nav_menu": {}, 97 | "number_sections": true, 98 | "sideBar": true, 99 | "skip_h1_title": false, 100 | "title_cell": "Table of Contents", 101 | "title_sidebar": "Contents", 102 | "toc_cell": false, 103 | "toc_position": {}, 104 | "toc_section_display": true, 105 | "toc_window_display": false 106 | } 107 | }, 108 | "nbformat": 4, 109 | "nbformat_minor": 5 110 | } 111 | -------------------------------------------------------------------------------- /3_linear_regression/README.md: -------------------------------------------------------------------------------- 1 | # Title 2 | Линейная регрессия и визуализация данных. 3 | 4 | # Goal 5 | - Познакомиться с основным разделением типов алгоритмов машинного обучения “с учителем” и “без учителя”. 6 | - Научиться проводить первичный анализ данных, их предобработку и визуализацию. 7 | - Реализовать классическую модель линейной регрессии как пример алгоритма обучения с учителем. 8 | - Решить регрессионную задачу с помощью обучения линейной модели из библиотеки sklearn. 9 | 10 | # Materials 11 | ## Videos with detailed and simple math explanation: 12 | - [Supervised](https://www.youtube.com/watch?v=bQI5uDxrFfA&list=PLLssT5z_DsK-h9vYZkQkYNWcItqhlRJLN&index=2&ab_channel=ArtificialIntelligence-AllinOne) VS [Unsupervised](https://www.youtube.com/watch?v=jAA2g9ItoAc&list=PLLssT5z_DsK-h9vYZkQkYNWcItqhlRJLN&index=3&ab_channel=ArtificialIntelligence-AllinOne) learning 13 | - Linear regression with one variable in parts [1](https://www.youtube.com/watch?v=kHwlB_j7Hkc&list=PLLssT5z_DsK-h9vYZkQkYNWcItqhlRJLN&index=4&ab_channel=ArtificialIntelligence-AllinOne), [2](https://www.youtube.com/watch?v=yuH4iRcggMw&list=PLLssT5z_DsK-h9vYZkQkYNWcItqhlRJLN&index=5&ab_channel=ArtificialIntelligence-AllinOne), [3](https://www.youtube.com/watch?v=yR2ipCoFvNo&list=PLLssT5z_DsK-h9vYZkQkYNWcItqhlRJLN&index=6&ab_channel=ArtificialIntelligence-AllinOne), [4](https://www.youtube.com/watch?v=0kns1gXLYg4&list=PLLssT5z_DsK-h9vYZkQkYNWcItqhlRJLN&index=7&ab_channel=ArtificialIntelligence-AllinOne) 14 | - [Linear regression with many variables](https://www.youtube.com/watch?v=Q4GNLhRtZNc&list=PLLssT5z_DsK-h9vYZkQkYNWcItqhlRJLN&index=18&ab_channel=ArtificialIntelligence-AllinOne) (Prerequisites : Linear Algebra knowledge) 15 | 16 | ## Other 17 | [Linear regression math explained](https://habr.com/ru/company/ods/blog/323890/) (Prerequisites : Linear Algebra knowledge) 18 | 19 | # Assignment 20 | [linear_regression.ipynb](./linear_regression.ipynb) 21 | -------------------------------------------------------------------------------- /4_overfitting_regularization/README.md: -------------------------------------------------------------------------------- 1 | # Materials 2 | 3 | If you were successful with [#3 Linear Regression](../3_linear_regression) assignment you may find that module very easy. You should be already familiar with most of the concepts covered here. If not - don't worry! We have a bunch of cool videos for you to study regularization and consolidate your knowledge on a practical assignment. 4 | 5 | Firstly, I want you to watch a series of excellent videos by StatQuest. 6 | 1. [Cross-Validation](https://www.youtube.com/watch?v=fSytzGwwBVw&list=PLblh5JKOoLUICTaGLRoHQDuF_7q2GfuJF&index=3) 7 | 2. [Bias-variance tradeoff](https://www.youtube.com/watch?v=EuBBz3bI-aA&list=PLblh5JKOoLUICTaGLRoHQDuF_7q2GfuJF&index=5&ab_channel=StatQuestwithJoshStarmer) 8 | 3. [Ridge Regression](https://www.youtube.com/watch?v=Q81RR3yKn30&list=PLblh5JKOoLUICTaGLRoHQDuF_7q2GfuJF&index=24&ab_channel=StatQuestwithJoshStarmer) 9 | 4. [Lasso Regression](https://www.youtube.com/watch?v=NGf0voTMlcs&list=PLblh5JKOoLUICTaGLRoHQDuF_7q2GfuJF&index=24&ab_channel=StatQuestwithJoshStarmer) 10 | 5. [Linear Regression: Ridge, Lasso, and Polynomial Regression](https://www.coursera.org/lecture/python-machine-learning/linear-regression-ridge-lasso-and-polynomial-regression-M7yUQ) 11 | 6. [Ridge VS Lasso visualized](https://www.youtube.com/watch?v=Xm2C_gTAl8c&list=PLblh5JKOoLUICTaGLRoHQDuF_7q2GfuJF&index=26) 12 | 7. [A visual explanation for regularization of linear models](https://explained.ai/regularization/index.html) 13 | 8. [Elastic Net](https://www.youtube.com/watch?v=1dKRdX9bfIo&list=PLblh5JKOoLUICTaGLRoHQDuF_7q2GfuJF&index=27) 14 | 15 | 16 | Another lectures on Overfitting and Regularization (optional) 17 | - https://youtu.be/EQWr3GGCdzw?t=223 18 | - https://www.youtube.com/watch?v=I-VfYXzC5ro&feature=youtu.be&t=278 19 | 20 | 21 | # Assignment 22 | See [overfitting_regularization.ipynb](./overfitting_regularization.ipynb) notebook. 23 | -------------------------------------------------------------------------------- /4_overfitting_regularization/overfitting_regularization.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Overfitting and Regularization" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "## Imports" 15 | ] 16 | }, 17 | { 18 | "cell_type": "code", 19 | "execution_count": null, 20 | "metadata": { 21 | "ExecuteTime": { 22 | "end_time": "2022-02-16T07:39:25.890581Z", 23 | "start_time": "2022-02-16T07:39:23.835921Z" 24 | } 25 | }, 26 | "outputs": [], 27 | "source": [ 28 | "import numpy as np\n", 29 | "import pandas as pd\n", 30 | "import seaborn as sns\n", 31 | "import matplotlib.pyplot as plt" 32 | ] 33 | }, 34 | { 35 | "cell_type": "code", 36 | "execution_count": null, 37 | "metadata": { 38 | "ExecuteTime": { 39 | "end_time": "2022-02-16T07:39:26.108031Z", 40 | "start_time": "2022-02-16T07:39:25.925893Z" 41 | } 42 | }, 43 | "outputs": [], 44 | "source": [ 45 | "from sklearn.pipeline import Pipeline\n", 46 | "from sklearn.model_selection import train_test_split, cross_validate\n", 47 | "from sklearn.preprocessing import PolynomialFeatures, StandardScaler\n", 48 | "from sklearn.linear_model import LinearRegression, Lasso, Ridge\n", 49 | "from sklearn.metrics import mean_squared_error\n", 50 | "from sklearn import set_config" 51 | ] 52 | }, 53 | { 54 | "cell_type": "code", 55 | "execution_count": null, 56 | "metadata": { 57 | "ExecuteTime": { 58 | "end_time": "2022-02-16T07:39:26.155183Z", 59 | "start_time": "2022-02-16T07:39:26.141249Z" 60 | } 61 | }, 62 | "outputs": [], 63 | "source": [ 64 | "set_config(display='diagram')" 65 | ] 66 | }, 67 | { 68 | "cell_type": "markdown", 69 | "metadata": {}, 70 | "source": [ 71 | "## Settings" 72 | ] 73 | }, 74 | { 75 | "cell_type": "code", 76 | "execution_count": null, 77 | "metadata": { 78 | "ExecuteTime": { 79 | "end_time": "2022-02-16T07:39:26.201661Z", 80 | "start_time": "2022-02-16T07:39:26.189181Z" 81 | } 82 | }, 83 | "outputs": [], 84 | "source": [ 85 | "SEED = 42\n", 86 | "RANGE = (-5, 5)\n", 87 | "N_SAMPLES = 50\n", 88 | "DEGREES = np.linspace(0, 15, 1 + 15, dtype=int)\n", 89 | "ALPHAS = np.linspace(0, 0.5, 1 + 40)" 90 | ] 91 | }, 92 | { 93 | "cell_type": "markdown", 94 | "metadata": {}, 95 | "source": [ 96 | "## Part 1: Underfitting vs. overfitting" 97 | ] 98 | }, 99 | { 100 | "cell_type": "markdown", 101 | "metadata": {}, 102 | "source": [ 103 | "### Generate samples" 104 | ] 105 | }, 106 | { 107 | "cell_type": "markdown", 108 | "metadata": { 109 | "ExecuteTime": { 110 | "end_time": "2020-11-09T09:04:09.904994Z", 111 | "start_time": "2020-11-09T09:04:09.896444Z" 112 | } 113 | }, 114 | "source": [ 115 | "Let's pick a target function $ f(x) = 2\\cdot x + 10\\cdot sin(x) $ and generate some noisy samples to learn from." 116 | ] 117 | }, 118 | { 119 | "cell_type": "code", 120 | "execution_count": null, 121 | "metadata": { 122 | "ExecuteTime": { 123 | "end_time": "2022-02-16T07:39:26.433239Z", 124 | "start_time": "2022-02-16T07:39:26.412501Z" 125 | } 126 | }, 127 | "outputs": [], 128 | "source": [ 129 | "def target_function(x):\n", 130 | " return 2 * x + 10 * np.sin(x)\n", 131 | "\n", 132 | "def generate_samples():\n", 133 | " \"\"\"Generate noisy samples.\"\"\"\n", 134 | " np.random.seed(SEED)\n", 135 | " x = np.random.uniform(*RANGE, size=N_SAMPLES)\n", 136 | " y = target_function(x) + np.random.normal(scale=4, size=N_SAMPLES)\n", 137 | " return x.reshape(-1, 1), y\n", 138 | "\n", 139 | "X, y = generate_samples()" 140 | ] 141 | }, 142 | { 143 | "cell_type": "markdown", 144 | "metadata": {}, 145 | "source": [ 146 | "### Plot samples" 147 | ] 148 | }, 149 | { 150 | "cell_type": "code", 151 | "execution_count": null, 152 | "metadata": { 153 | "ExecuteTime": { 154 | "end_time": "2022-02-16T07:39:27.822431Z", 155 | "start_time": "2022-02-16T07:39:27.696426Z" 156 | } 157 | }, 158 | "outputs": [], 159 | "source": [ 160 | "def plot_scatter(x, y, title=None, label='Noisy samples'):\n", 161 | " plt.scatter(x, y, label=label)\n", 162 | " plt.xlabel('x')\n", 163 | " plt.ylabel('y')\n", 164 | " plt.grid(True)\n", 165 | " plt.title(title)\n", 166 | " plt.legend(loc='lower right')\n", 167 | "\n", 168 | "plot_scatter(X, y)" 169 | ] 170 | }, 171 | { 172 | "cell_type": "markdown", 173 | "metadata": {}, 174 | "source": [ 175 | "### Split" 176 | ] 177 | }, 178 | { 179 | "cell_type": "code", 180 | "execution_count": null, 181 | "metadata": { 182 | "ExecuteTime": { 183 | "end_time": "2022-02-16T07:39:29.210072Z", 184 | "start_time": "2022-02-16T07:39:29.093175Z" 185 | } 186 | }, 187 | "outputs": [], 188 | "source": [ 189 | "X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.3, random_state=SEED)\n", 190 | "\n", 191 | "plot_scatter(X_train, y_train, label='Training set')\n", 192 | "plot_scatter(X_valid, y_valid, label='Validation set')" 193 | ] 194 | }, 195 | { 196 | "cell_type": "code", 197 | "execution_count": null, 198 | "metadata": { 199 | "ExecuteTime": { 200 | "end_time": "2022-02-16T07:39:33.496625Z", 201 | "start_time": "2022-02-16T07:39:33.479851Z" 202 | } 203 | }, 204 | "outputs": [], 205 | "source": [ 206 | "y_train" 207 | ] 208 | }, 209 | { 210 | "cell_type": "markdown", 211 | "metadata": {}, 212 | "source": [ 213 | "### Model" 214 | ] 215 | }, 216 | { 217 | "cell_type": "markdown", 218 | "metadata": {}, 219 | "source": [ 220 | "Let's try to approximate our target function $ f(x) = 2\\cdot x + 10\\cdot sin(x) $ with polynomials of different degree. \n", 221 | "\n", 222 | "A polynomial of degree $n$ has the form:\n", 223 | "$ h(x) = w_0 + w_1\\cdot x + w_2\\cdot x^2 +\\ldots + w_n\\cdot x^n $.\n", 224 | "\n", 225 | "$x^i$ values could easily be generated by `PolynomialFeatures`, while $w_i$ are the unknown paramaters to be estimated using `LinearRegression`." 226 | ] 227 | }, 228 | { 229 | "cell_type": "code", 230 | "execution_count": null, 231 | "metadata": { 232 | "ExecuteTime": { 233 | "end_time": "2022-02-16T07:39:34.647226Z", 234 | "start_time": "2022-02-16T07:39:34.639540Z" 235 | } 236 | }, 237 | "outputs": [], 238 | "source": [ 239 | "PolynomialFeatures(degree=4, include_bias=False).fit_transform(X=[\n", 240 | " [1],\n", 241 | " [3],\n", 242 | " [4],\n", 243 | "])" 244 | ] 245 | }, 246 | { 247 | "cell_type": "code", 248 | "execution_count": null, 249 | "metadata": { 250 | "ExecuteTime": { 251 | "end_time": "2022-02-16T07:40:49.993908Z", 252 | "start_time": "2022-02-16T07:40:49.959340Z" 253 | } 254 | }, 255 | "outputs": [], 256 | "source": [ 257 | "def make_model(degree, alpha=0, penalty=None):\n", 258 | " # linear regression\n", 259 | " if alpha == 0:\n", 260 | " regressor = LinearRegression()\n", 261 | " # lasso regression\",\n", 262 | " elif penalty == 'L1':\n", 263 | " regressor = Lasso(alpha=alpha, random_state=SEED, max_iter=50000)\n", 264 | " # ridge regression\",\n", 265 | " elif penalty == 'L2':\n", 266 | " regressor = Ridge(alpha=alpha, random_state=SEED, max_iter=50000) \n", 267 | " \n", 268 | " \n", 269 | " return Pipeline([\n", 270 | " ('pol', PolynomialFeatures(degree, include_bias=(degree == 0))),\n", 271 | " ('sca', StandardScaler()),\n", 272 | " ('reg', regressor)\n", 273 | " ])\n", 274 | "\n", 275 | "display(make_model(2))\n", 276 | "display(make_model(2, penalty='L1', alpha=0.1))\n", 277 | "display(make_model(2, penalty='L2', alpha=0.1))" 278 | ] 279 | }, 280 | { 281 | "cell_type": "markdown", 282 | "metadata": {}, 283 | "source": [ 284 | "### Fit" 285 | ] 286 | }, 287 | { 288 | "cell_type": "markdown", 289 | "metadata": {}, 290 | "source": [ 291 | "Let's fit a model and plot the hypothesis it learns:" 292 | ] 293 | }, 294 | { 295 | "cell_type": "code", 296 | "execution_count": null, 297 | "metadata": { 298 | "ExecuteTime": { 299 | "end_time": "2022-02-16T07:40:51.426785Z", 300 | "start_time": "2022-02-16T07:40:51.209830Z" 301 | } 302 | }, 303 | "outputs": [], 304 | "source": [ 305 | "def plot_fit(model):\n", 306 | " degree = model['pol'].degree\n", 307 | " X_range = np.linspace(*RANGE, 1000).reshape(-1, 1)\n", 308 | " y_pred = model.predict(X_range)\n", 309 | " plot_scatter(X_train, y_train, label='Training sample')\n", 310 | " plot_scatter(X_valid, y_valid, label='Validation sample')\n", 311 | " plt.plot(X_range, target_function(X_range), c='green', alpha=0.2, lw=5, label='Target function')\n", 312 | " plt.plot(X_range, y_pred, c='red', label='Hypothesis')\n", 313 | " plt.ylim((min(y) - 3, max(y) + 3))\n", 314 | " plt.legend(loc='best') \n", 315 | " plt.title(f'Polynomial approximation: degree={degree}')\n", 316 | " plt.show()\n", 317 | "\n", 318 | "plot_fit(make_model(degree=2).fit(X_train, y_train))" 319 | ] 320 | }, 321 | { 322 | "cell_type": "markdown", 323 | "metadata": {}, 324 | "source": [ 325 | "### From underfitting to overfitting" 326 | ] 327 | }, 328 | { 329 | "cell_type": "markdown", 330 | "metadata": { 331 | "ExecuteTime": { 332 | "end_time": "2020-11-09T11:15:24.323458Z", 333 | "start_time": "2020-11-09T11:15:24.318089Z" 334 | } 335 | }, 336 | "source": [ 337 | "We can investigate the shape of the fitted curve for different values of `degree`:" 338 | ] 339 | }, 340 | { 341 | "cell_type": "code", 342 | "execution_count": null, 343 | "metadata": { 344 | "ExecuteTime": { 345 | "end_time": "2022-02-16T07:40:54.236522Z", 346 | "start_time": "2022-02-16T07:40:52.748262Z" 347 | }, 348 | "scrolled": false 349 | }, 350 | "outputs": [], 351 | "source": [ 352 | "for degree in [0, 1, 2, 3, 4, 5, 10, 15, 20]:\n", 353 | " plot_fit(make_model(degree).fit(X_train, y_train))" 354 | ] 355 | }, 356 | { 357 | "cell_type": "markdown", 358 | "metadata": {}, 359 | "source": [ 360 | "### Fitting graph" 361 | ] 362 | }, 363 | { 364 | "cell_type": "markdown", 365 | "metadata": {}, 366 | "source": [ 367 | "In the next step we calculate the training and the validation error for each `degree` and plot them in a single graph. The resulting graph is called the fitting graph." 368 | ] 369 | }, 370 | { 371 | "cell_type": "code", 372 | "execution_count": null, 373 | "metadata": { 374 | "ExecuteTime": { 375 | "end_time": "2021-12-15T20:32:47.308471Z", 376 | "start_time": "2021-12-15T20:32:47.036535Z" 377 | } 378 | }, 379 | "outputs": [], 380 | "source": [ 381 | "def rmse(y_true, y_pred):\n", 382 | " return np.sqrt(mean_squared_error(y_true, y_pred))\n", 383 | "\n", 384 | "def plot_fitting_graph(x, metric_train, metric_valid, xlabel, ylabel, \n", 385 | " custom_metric=None, custom_label='', custom_scale='log', title='Fitting graph'):\n", 386 | " plt.figure(figsize=(9, 4.5))\n", 387 | " plt.plot(x, metric_train, label='Training')\n", 388 | " plt.plot(x, metric_valid, color='C1', label='Validation')\n", 389 | " plt.axvline(x[np.argmin(metric_valid)], color='C1', lw=10, alpha=0.2)\n", 390 | " plt.title(title)\n", 391 | " plt.xlabel(xlabel)\n", 392 | " plt.ylabel(ylabel)\n", 393 | " plt.grid(True)\n", 394 | " plt.xticks(x, rotation='vertical')\n", 395 | " plt.legend(loc='center left') \n", 396 | " if custom_metric:\n", 397 | " plt.twinx()\n", 398 | " plt.yscale(custom_scale)\n", 399 | " plt.plot(x, custom_metric, alpha=0.2, lw=4, ls='dotted', color='black', label=custom_label) \n", 400 | " plt.legend(loc='center right') \n", 401 | " plt.show()\n", 402 | " \n", 403 | "rmse_train, rmse_valid = [], []\n", 404 | "for degree in DEGREES:\n", 405 | " reg = make_model(degree).fit(X_train, y_train)\n", 406 | " rmse_train.append(rmse(reg.predict(X_train), y_train))\n", 407 | " rmse_valid.append(rmse(reg.predict(X_valid), y_valid))\n", 408 | " \n", 409 | "plot_fitting_graph(DEGREES, rmse_train, rmse_valid, xlabel='Complexity (degree)', ylabel='Error (RMSE)', \n", 410 | " title='Least squares polynomial regression')" 411 | ] 412 | }, 413 | { 414 | "cell_type": "markdown", 415 | "metadata": {}, 416 | "source": [ 417 | "### Sweet spot" 418 | ] 419 | }, 420 | { 421 | "cell_type": "markdown", 422 | "metadata": {}, 423 | "source": [ 424 | "What is the optimal `degree` to go with?" 425 | ] 426 | }, 427 | { 428 | "cell_type": "code", 429 | "execution_count": null, 430 | "metadata": { 431 | "ExecuteTime": { 432 | "end_time": "2021-12-15T20:32:47.324492Z", 433 | "start_time": "2021-12-15T20:32:47.308471Z" 434 | } 435 | }, 436 | "outputs": [], 437 | "source": [ 438 | "DEGREES[np.argmin(rmse_valid)]" 439 | ] 440 | }, 441 | { 442 | "cell_type": "markdown", 443 | "metadata": {}, 444 | "source": [ 445 | "### Cross-validation" 446 | ] 447 | }, 448 | { 449 | "cell_type": "markdown", 450 | "metadata": {}, 451 | "source": [ 452 | "Ideally, we would choose the the model parameters such that we have the best model performance. However, we want to make sure that we really have the best validation performance. When we do `train_test_split` we randomly split the data into two parts. What could happen is that we got lucky and split the data such that it favours the validation error. This is especially dangerous if we are dealing with small datasets. One way to check if that's the case is to run the experiment several times for different, random splits. However, there is an even more systematic way of doing this: [cross-validation](https://scikit-learn.org/stable/modules/cross_validation.html)." 453 | ] 454 | }, 455 | { 456 | "cell_type": "markdown", 457 | "metadata": {}, 458 | "source": [ 459 | "

" 460 | ] 461 | }, 462 | { 463 | "cell_type": "code", 464 | "execution_count": null, 465 | "metadata": { 466 | "ExecuteTime": { 467 | "end_time": "2021-12-15T20:32:47.788153Z", 468 | "start_time": "2021-12-15T20:32:47.324492Z" 469 | } 470 | }, 471 | "outputs": [], 472 | "source": [ 473 | "rmse_train, rmse_valid = [], []\n", 474 | "for degree in DEGREES:\n", 475 | " results = cross_validate(make_model(degree), \n", 476 | " X, y, cv=5,\n", 477 | " return_train_score=True,\n", 478 | " scoring='neg_root_mean_squared_error')\n", 479 | " rmse_train.append(-np.mean(results['train_score']))\n", 480 | " rmse_valid.append(-np.mean(results['test_score']))\n", 481 | " \n", 482 | "plot_fitting_graph(DEGREES, rmse_train, rmse_valid, xlabel='Complexity (degree)', ylabel='Error (RMSE)',\n", 483 | " title='Least squares polynomial regression')" 484 | ] 485 | }, 486 | { 487 | "cell_type": "markdown", 488 | "metadata": {}, 489 | "source": [ 490 | "### Model coefficients" 491 | ] 492 | }, 493 | { 494 | "cell_type": "markdown", 495 | "metadata": { 496 | "ExecuteTime": { 497 | "end_time": "2020-11-05T16:26:33.200639Z", 498 | "start_time": "2020-11-05T16:26:33.197656Z" 499 | } 500 | }, 501 | "source": [ 502 | "Let's inspect our regression model coefficients:" 503 | ] 504 | }, 505 | { 506 | "cell_type": "code", 507 | "execution_count": null, 508 | "metadata": { 509 | "ExecuteTime": { 510 | "end_time": "2021-12-15T20:32:47.820142Z", 511 | "start_time": "2021-12-15T20:32:47.788153Z" 512 | } 513 | }, 514 | "outputs": [], 515 | "source": [ 516 | "(make_model(degree=1).fit(X_train, y_train)['reg'].coef_,\n", 517 | " make_model(degree=2).fit(X_train, y_train)['reg'].coef_,\n", 518 | " make_model(degree=5).fit(X_train, y_train)['reg'].coef_,\n", 519 | " make_model(degree=10).fit(X_train, y_train)['reg'].coef_)" 520 | ] 521 | }, 522 | { 523 | "cell_type": "markdown", 524 | "metadata": {}, 525 | "source": [ 526 | "Hmm... it looks like high degree polynomials are coming with much bigger regression coefficients. \n", 527 | "\n", 528 | "We are going to plot the mean absolute value of $w_i$ as a function of degree to reveal the relationship:" 529 | ] 530 | }, 531 | { 532 | "cell_type": "code", 533 | "execution_count": null, 534 | "metadata": { 535 | "ExecuteTime": { 536 | "end_time": "2021-12-15T20:32:48.876099Z", 537 | "start_time": "2021-12-15T20:32:47.820142Z" 538 | } 539 | }, 540 | "outputs": [], 541 | "source": [ 542 | "rmse_train, rmse_valid, avg_coef = [], [], []\n", 543 | "for degree in DEGREES:\n", 544 | " results = cross_validate(make_model(degree),\n", 545 | " X, y, cv=5,\n", 546 | " return_train_score=True, return_estimator=True,\n", 547 | " scoring='neg_root_mean_squared_error')\n", 548 | " rmse_train.append(-np.mean(results['train_score']))\n", 549 | " rmse_valid.append(-np.mean(results['test_score'])) \n", 550 | " avg_coef.append( \n", 551 | " # average over CV folds\n", 552 | " np.mean([ \n", 553 | " # mean absolute value of weights\n", 554 | " np.mean(np.abs(model['reg'].coef_))\n", 555 | " for model in results['estimator']\n", 556 | " ]))\n", 557 | " \n", 558 | "plot_fitting_graph(DEGREES, rmse_train, rmse_valid,\n", 559 | " xlabel='Complexity (degree)', ylabel='Error (RMSE)',\n", 560 | " custom_metric=avg_coef, custom_label='avg(|$w_i$|)',\n", 561 | " title='Least squares polynomial regression')" 562 | ] 563 | }, 564 | { 565 | "cell_type": "markdown", 566 | "metadata": { 567 | "ExecuteTime": { 568 | "end_time": "2020-11-09T15:24:18.245064Z", 569 | "start_time": "2020-11-09T15:24:18.241548Z" 570 | } 571 | }, 572 | "source": [ 573 | "### Summary" 574 | ] 575 | }, 576 | { 577 | "cell_type": "markdown", 578 | "metadata": {}, 579 | "source": [ 580 | "We observe the following:\n", 581 | "\n", 582 | "1. **Underfitting** (degree < 5): The model is not able to fit the data properly. The fit is bad for both the training and the validation set.\n", 583 | "\n", 584 | "2. **Fit is just right** (degree = 5): The model is able to capture the underlying data distribution. The fit is good for both the training and the validation set.\n", 585 | "\n", 586 | "3. **Overfitting** (degree > 5): The model starts fitting the noise in the dataset. While the fit for the training data gets even better, the fit for the validation set gets worse.\n", 587 | "\n", 588 | "4. As the order of polynomial increases, the linear model coefficients become more likely to take on **large values**." 589 | ] 590 | }, 591 | { 592 | "cell_type": "markdown", 593 | "metadata": { 594 | "ExecuteTime": { 595 | "end_time": "2020-11-03T08:47:28.699189Z", 596 | "start_time": "2020-11-03T08:47:28.695466Z" 597 | } 598 | }, 599 | "source": [ 600 | "## Part 2: Regularization" 601 | ] 602 | }, 603 | { 604 | "cell_type": "markdown", 605 | "metadata": {}, 606 | "source": [ 607 | "There are two major ways to build a machine learning model with the ability to generalize well on unseen data:\n", 608 | "1. Train the simplest model possible for our purpose (according to Occam’s Razor).\n", 609 | "2. Train a complex or more expressive model on the data and perform regularization.\n", 610 | "\n", 611 | "Regularization is a method used to reduce the variance of a machine learning model. In other words, it is used to reduce overfitting. Regularization penalizes a model for being complex. For linear models, it means regularization forces model coefficients to be smaller in magnitude.\n", 612 | "\n", 613 | "Let's pick a polynomial model of degree **15** (which tends to overfit strongly) and try to regularize it using **L1** and **L2** penalties." 614 | ] 615 | }, 616 | { 617 | "cell_type": "markdown", 618 | "metadata": {}, 619 | "source": [ 620 | "### L1 - Lasso regression" 621 | ] 622 | }, 623 | { 624 | "cell_type": "code", 625 | "execution_count": null, 626 | "metadata": { 627 | "ExecuteTime": { 628 | "end_time": "2021-12-15T20:32:49.851006Z", 629 | "start_time": "2021-12-15T20:32:48.876099Z" 630 | }, 631 | "scrolled": false 632 | }, 633 | "outputs": [], 634 | "source": [ 635 | "rmse_train, rmse_valid = [], []\n", 636 | "for alpha in ALPHAS: \n", 637 | " results = cross_validate(make_model(degree=15, penalty='L1', alpha=alpha), \n", 638 | " X, y, cv=5,\n", 639 | " return_train_score=True,\n", 640 | " scoring='neg_root_mean_squared_error')\n", 641 | " rmse_train.append(-np.mean(results['train_score']))\n", 642 | " rmse_valid.append(-np.mean(results['test_score']))\n", 643 | " \n", 644 | "plot_fitting_graph(ALPHAS, rmse_train, rmse_valid,\n", 645 | " xlabel='Regularization strength (alpha)', ylabel='Error (RMSE)',\n", 646 | " title='Lasso polynomial regression (L1): degree=15')" 647 | ] 648 | }, 649 | { 650 | "cell_type": "markdown", 651 | "metadata": {}, 652 | "source": [ 653 | "### L2 - Ridge regression" 654 | ] 655 | }, 656 | { 657 | "cell_type": "code", 658 | "execution_count": null, 659 | "metadata": { 660 | "ExecuteTime": { 661 | "end_time": "2021-12-15T20:32:51.202745Z", 662 | "start_time": "2021-12-15T20:32:50.468099Z" 663 | } 664 | }, 665 | "outputs": [], 666 | "source": [ 667 | "rmse_train, rmse_valid = [], []\n", 668 | "for alpha in ALPHAS: \n", 669 | " results = cross_validate(make_model(degree=15, penalty='L2', alpha=alpha), \n", 670 | " X, y, cv=5,\n", 671 | " return_train_score=True,\n", 672 | " scoring='neg_root_mean_squared_error')\n", 673 | " rmse_train.append(-np.mean(results['train_score']))\n", 674 | " rmse_valid.append(-np.mean(results['test_score']))\n", 675 | " \n", 676 | "plot_fitting_graph(ALPHAS, rmse_train, rmse_valid, \n", 677 | " xlabel='Regularization strength (alpha)', ylabel='Error (RMSE)', \n", 678 | " title='Ridge polynomial regression (L2): degree=15')" 679 | ] 680 | }, 681 | { 682 | "cell_type": "markdown", 683 | "metadata": { 684 | "ExecuteTime": { 685 | "end_time": "2020-11-09T13:47:39.048589Z", 686 | "start_time": "2020-11-09T13:47:39.044912Z" 687 | } 688 | }, 689 | "source": [ 690 | "### Summary" 691 | ] 692 | }, 693 | { 694 | "cell_type": "markdown", 695 | "metadata": { 696 | "ExecuteTime": { 697 | "end_time": "2020-11-09T13:49:03.993455Z", 698 | "start_time": "2020-11-09T13:49:03.987472Z" 699 | } 700 | }, 701 | "source": [ 702 | "1. We can control the regularization strength by changing the hyperparameter `alpha`.\n", 703 | "2. Regularized version of the model performs pretty well. Even in case the original original (unregularized) model is heavily overfitting due to excessive complexity." 704 | ] 705 | }, 706 | { 707 | "cell_type": "markdown", 708 | "metadata": { 709 | "ExecuteTime": { 710 | "end_time": "2020-11-09T12:14:17.962945Z", 711 | "start_time": "2020-11-09T12:14:17.959952Z" 712 | } 713 | }, 714 | "source": [ 715 | "## Part 3: Homework assignment (10 points)" 716 | ] 717 | }, 718 | { 719 | "cell_type": "markdown", 720 | "metadata": {}, 721 | "source": [ 722 | "**WARNING!**\n", 723 | "\n", 724 | "Due to the limited power of your machine, you may face some difficulties in generating polynomial features of a high degree. It's ok to take only a subsample of features for that purpose (even one feature is enough). Afterwards, you **must collect all features together** (those which were used to generate polynomials and the rest)." 725 | ] 726 | }, 727 | { 728 | "cell_type": "markdown", 729 | "metadata": { 730 | "ExecuteTime": { 731 | "end_time": "2021-12-10T12:27:23.202301Z", 732 | "start_time": "2021-12-10T12:27:23.185315Z" 733 | } 734 | }, 735 | "source": [ 736 | "### Excercise 1 - Overfiting and Underfitting (2 points)" 737 | ] 738 | }, 739 | { 740 | "cell_type": "markdown", 741 | "metadata": { 742 | "ExecuteTime": { 743 | "end_time": "2021-12-10T07:35:07.485715Z", 744 | "start_time": "2021-12-10T07:35:07.461799Z" 745 | } 746 | }, 747 | "source": [ 748 | "Let's work with the diabetes dataset" 749 | ] 750 | }, 751 | { 752 | "cell_type": "code", 753 | "execution_count": null, 754 | "metadata": { 755 | "ExecuteTime": { 756 | "end_time": "2021-12-15T20:32:51.450770Z", 757 | "start_time": "2021-12-15T20:32:51.234725Z" 758 | } 759 | }, 760 | "outputs": [], 761 | "source": [ 762 | "from sklearn.datasets import load_diabetes\n", 763 | "data = load_diabetes()\n", 764 | "X_diabetes = pd.DataFrame(data['data'], columns=data['feature_names'])\n", 765 | "y_diabetes = pd.DataFrame(data['target'], columns=['target'])\n", 766 | "print(data['DESCR'])" 767 | ] 768 | }, 769 | { 770 | "cell_type": "markdown", 771 | "metadata": {}, 772 | "source": [ 773 | "Apply model for diabetes dataset with polynomial feature engineering of different degrees. Plot the dependence of train and test error on polynomial degree. Highlight a degree with the best test error. Which degrees cause overfitting/underfitting? Why?" 774 | ] 775 | }, 776 | { 777 | "cell_type": "code", 778 | "execution_count": null, 779 | "metadata": { 780 | "ExecuteTime": { 781 | "end_time": "2021-12-15T20:38:03.456910Z", 782 | "start_time": "2021-12-15T20:38:03.448915Z" 783 | } 784 | }, 785 | "outputs": [], 786 | "source": [ 787 | "# your findings/conclusions" 788 | ] 789 | }, 790 | { 791 | "cell_type": "markdown", 792 | "metadata": { 793 | "ExecuteTime": { 794 | "end_time": "2021-12-10T12:46:46.756169Z", 795 | "start_time": "2021-12-10T12:44:13.217Z" 796 | } 797 | }, 798 | "source": [ 799 | "### Excercise 2 - Magnitude (3 points)" 800 | ] 801 | }, 802 | { 803 | "cell_type": "markdown", 804 | "metadata": {}, 805 | "source": [ 806 | "As discussed earlier, regularization methods are expected to constraint the weights (model coefficients). \n", 807 | "\n", 808 | "Is it indeed happening? \n", 809 | "\n", 810 | "Please do a discovery on your own and find that out empirically (both for **L1** and **L2**). Let's use `degree=15` and `alpha` from `ALPHAS`." 811 | ] 812 | }, 813 | { 814 | "cell_type": "markdown", 815 | "metadata": {}, 816 | "source": [ 817 | "#### L1" 818 | ] 819 | }, 820 | { 821 | "cell_type": "code", 822 | "execution_count": null, 823 | "metadata": {}, 824 | "outputs": [], 825 | "source": [ 826 | "## your code" 827 | ] 828 | }, 829 | { 830 | "cell_type": "markdown", 831 | "metadata": {}, 832 | "source": [ 833 | "#### L2" 834 | ] 835 | }, 836 | { 837 | "cell_type": "code", 838 | "execution_count": null, 839 | "metadata": {}, 840 | "outputs": [], 841 | "source": [ 842 | "## your code" 843 | ] 844 | }, 845 | { 846 | "cell_type": "markdown", 847 | "metadata": {}, 848 | "source": [ 849 | "#### Summary" 850 | ] 851 | }, 852 | { 853 | "cell_type": "code", 854 | "execution_count": null, 855 | "metadata": {}, 856 | "outputs": [], 857 | "source": [ 858 | "## your observations/conclusions" 859 | ] 860 | }, 861 | { 862 | "cell_type": "markdown", 863 | "metadata": { 864 | "ExecuteTime": { 865 | "end_time": "2021-12-10T12:46:46.756169Z", 866 | "start_time": "2021-12-10T12:44:13.217Z" 867 | } 868 | }, 869 | "source": [ 870 | "### Excercise 3 - Sparsity (3 points)" 871 | ] 872 | }, 873 | { 874 | "cell_type": "markdown", 875 | "metadata": {}, 876 | "source": [ 877 | "Lasso can also be used for **feature selection** since L1 is [more likely to produce zero coefficients](https://explained.ai/regularization/).\n", 878 | "\n", 879 | "Is it indeed happening? \n", 880 | "\n", 881 | "Please do a discovery on your own and find that out empirically (both for **L1** and **L2**). Let's use `degree=15` and `alpha` from `ALPHAS`." 882 | ] 883 | }, 884 | { 885 | "cell_type": "markdown", 886 | "metadata": {}, 887 | "source": [ 888 | "#### L1" 889 | ] 890 | }, 891 | { 892 | "cell_type": "code", 893 | "execution_count": null, 894 | "metadata": {}, 895 | "outputs": [], 896 | "source": [ 897 | "## your code" 898 | ] 899 | }, 900 | { 901 | "cell_type": "markdown", 902 | "metadata": {}, 903 | "source": [ 904 | "#### L2" 905 | ] 906 | }, 907 | { 908 | "cell_type": "code", 909 | "execution_count": null, 910 | "metadata": {}, 911 | "outputs": [], 912 | "source": [ 913 | "## your code" 914 | ] 915 | }, 916 | { 917 | "cell_type": "markdown", 918 | "metadata": {}, 919 | "source": [ 920 | "#### Summary" 921 | ] 922 | }, 923 | { 924 | "cell_type": "code", 925 | "execution_count": null, 926 | "metadata": { 927 | "ExecuteTime": { 928 | "end_time": "2021-12-10T13:40:56.063816Z", 929 | "start_time": "2021-12-10T13:40:56.057832Z" 930 | } 931 | }, 932 | "outputs": [], 933 | "source": [ 934 | "# your findings/conclusions" 935 | ] 936 | }, 937 | { 938 | "cell_type": "markdown", 939 | "metadata": {}, 940 | "source": [ 941 | "### Excercise 4 - Scaling (2 points)" 942 | ] 943 | }, 944 | { 945 | "cell_type": "markdown", 946 | "metadata": {}, 947 | "source": [ 948 | "As a general rule, it is recommended to scale input features before fitting a regularized model so that the features/inputs take values in similar ranges. One common way of doing so is to standardize the inputs and that is exactly what our pipeline second step (`StandardScaler`) is responsible for. \n", 949 | "\n", 950 | "Is scaling important? What are the underlying reasons?\n", 951 | "\n", 952 | "Please do a discovery on your own and find that out empirically (both for **L1** and **L2**) on the dataset below. Check coefficients." 953 | ] 954 | }, 955 | { 956 | "cell_type": "code", 957 | "execution_count": null, 958 | "metadata": {}, 959 | "outputs": [], 960 | "source": [ 961 | "def target_function_hw(x):\n", 962 | " return 2 * x\n", 963 | "\n", 964 | "def generate_samples_hw():\n", 965 | " np.random.seed(SEED)\n", 966 | " x = np.random.uniform(*RANGE, size=N_SAMPLES)\n", 967 | " \n", 968 | " np.random.seed(SEED+1)\n", 969 | " x_noise = np.random.uniform(*[x * 100 for x in RANGE], size=N_SAMPLES)\n", 970 | " x_noise2 = np.random.normal(100, 50, size=N_SAMPLES)\n", 971 | " \n", 972 | " y = target_function_hw(x) + np.random.normal(scale=4, size=N_SAMPLES)\n", 973 | " \n", 974 | " return np.concatenate([x.reshape(-1, 1) / 100, \n", 975 | " x_noise.reshape(-1, 1),\n", 976 | " x_noise2.reshape(-1, 1)], axis=1), y\n", 977 | "\n", 978 | "X_hw, y_hw = generate_samples_hw()\n", 979 | "\n", 980 | "for i in range(X_hw.shape[1]):\n", 981 | " print(f'Min of feature {i}: {min(X_hw[:, i]):.2f}, max: {max(X_hw[:, i]):.2f}')" 982 | ] 983 | }, 984 | { 985 | "cell_type": "markdown", 986 | "metadata": {}, 987 | "source": [ 988 | "#### L1" 989 | ] 990 | }, 991 | { 992 | "cell_type": "code", 993 | "execution_count": null, 994 | "metadata": {}, 995 | "outputs": [], 996 | "source": [ 997 | "## your code" 998 | ] 999 | }, 1000 | { 1001 | "cell_type": "markdown", 1002 | "metadata": {}, 1003 | "source": [ 1004 | "#### L2" 1005 | ] 1006 | }, 1007 | { 1008 | "cell_type": "code", 1009 | "execution_count": null, 1010 | "metadata": {}, 1011 | "outputs": [], 1012 | "source": [ 1013 | "## your code" 1014 | ] 1015 | }, 1016 | { 1017 | "cell_type": "markdown", 1018 | "metadata": {}, 1019 | "source": [ 1020 | "#### Summary\n", 1021 | "\n" 1022 | ] 1023 | }, 1024 | { 1025 | "cell_type": "code", 1026 | "execution_count": null, 1027 | "metadata": {}, 1028 | "outputs": [], 1029 | "source": [ 1030 | "## your observations/conclusions" 1031 | ] 1032 | } 1033 | ], 1034 | "metadata": { 1035 | "kernelspec": { 1036 | "display_name": "Python 3 (ipykernel)", 1037 | "language": "python", 1038 | "name": "python3" 1039 | }, 1040 | "language_info": { 1041 | "codemirror_mode": { 1042 | "name": "ipython", 1043 | "version": 3 1044 | }, 1045 | "file_extension": ".py", 1046 | "mimetype": "text/x-python", 1047 | "name": "python", 1048 | "nbconvert_exporter": "python", 1049 | "pygments_lexer": "ipython3", 1050 | "version": "3.8.12" 1051 | }, 1052 | "toc": { 1053 | "base_numbering": 1, 1054 | "nav_menu": {}, 1055 | "number_sections": true, 1056 | "sideBar": true, 1057 | "skip_h1_title": false, 1058 | "title_cell": "Table of Contents", 1059 | "title_sidebar": "Contents", 1060 | "toc_cell": false, 1061 | "toc_position": { 1062 | "height": "calc(100% - 180px)", 1063 | "left": "10px", 1064 | "top": "150px", 1065 | "width": "288.188px" 1066 | }, 1067 | "toc_section_display": true, 1068 | "toc_window_display": true 1069 | }, 1070 | "varInspector": { 1071 | "cols": { 1072 | "lenName": 16, 1073 | "lenType": 16, 1074 | "lenVar": 40 1075 | }, 1076 | "kernels_config": { 1077 | "python": { 1078 | "delete_cmd_postfix": "", 1079 | "delete_cmd_prefix": "del ", 1080 | "library": "var_list.py", 1081 | "varRefreshCmd": "print(var_dic_list())" 1082 | }, 1083 | "r": { 1084 | "delete_cmd_postfix": ") ", 1085 | "delete_cmd_prefix": "rm(", 1086 | "library": "var_list.r", 1087 | "varRefreshCmd": "cat(var_dic_list()) " 1088 | } 1089 | }, 1090 | "types_to_exclude": [ 1091 | "module", 1092 | "function", 1093 | "builtin_function_or_method", 1094 | "instance", 1095 | "_Feature" 1096 | ], 1097 | "window_display": false 1098 | } 1099 | }, 1100 | "nbformat": 4, 1101 | "nbformat_minor": 4 1102 | } 1103 | -------------------------------------------------------------------------------- /5_classification_linear_knn/README.md: -------------------------------------------------------------------------------- 1 | # Materials 2 | 3 | - In order to get a general understanding, read [An Introduction to Statistical Learning](https://trevorhastie.github.io/ISLR/) chapter 4 up to section 4.3 inclusive. Dive into section 4.4 if you dare (optional). 4 | - Review the related [sklearn facilities usage (logistic regression)](https://nbviewer.jupyter.org/github/justmarkham/DAT8/blob/master/notebooks/12_logistic_regression.ipynb). 5 | - Study the [Classification section of the Google Machine Learning Crash Course](https://developers.google.com/machine-learning/crash-course/classification/video-lecture) (skip the Programming Exercise as it's not relevant for our course purposes). 6 | - Read about [kNN](https://www.unite.ai/what-is-k-nearest-neighbors/). 7 | - Review the [related sklearn facilities usage (KNN)](https://nbviewer.jupyter.org/github/justmarkham/DAT8/blob/master/notebooks/08_knn_sklearn.ipynb). 8 | - Read [this article](https://towardsdatascience.com/the-5-classification-evaluation-metrics-you-must-know-aa97784ff226) for an extended overview of the ways to evaluate classification models. 9 | - Read [Imbalanced classification approaches overview](https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/). 10 | - *Optional*. Read [KNN notes](https://sebastianraschka.com/pdf/lecture-notes/stat479fs18/02_knn_notes.pdf) - KNN explanation including computational performance analysis. 11 | 12 | 13 | # Assignment 14 | 15 | [hw_classification.ipynb](./hw_classification.ipynb) 16 | -------------------------------------------------------------------------------- /5_classification_linear_knn/hw_classification.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Classification. Linear models and KNN" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": null, 13 | "metadata": {}, 14 | "outputs": [], 15 | "source": [ 16 | "import os\n", 17 | "import numpy as np\n", 18 | "import pandas as pd\n", 19 | "import seaborn as sns\n", 20 | "import matplotlib.pyplot as plt" 21 | ] 22 | }, 23 | { 24 | "cell_type": "code", 25 | "execution_count": null, 26 | "metadata": {}, 27 | "outputs": [], 28 | "source": [ 29 | "from sklearn.pipeline import Pipeline\n", 30 | "from sklearn.compose import ColumnTransformer\n", 31 | "from sklearn.model_selection import train_test_split, cross_validate\n", 32 | "from sklearn.metrics import plot_confusion_matrix, accuracy_score\n", 33 | "from sklearn.neighbors import KNeighborsClassifier\n", 34 | "from sklearn.preprocessing import StandardScaler, OneHotEncoder" 35 | ] 36 | }, 37 | { 38 | "cell_type": "markdown", 39 | "metadata": {}, 40 | "source": [ 41 | "## Part 1: Implementing Logistic Regression" 42 | ] 43 | }, 44 | { 45 | "cell_type": "markdown", 46 | "metadata": {}, 47 | "source": [ 48 | "In this task you need to implement Logistic Regression with l2 regularization using gradient descent algorithm." 49 | ] 50 | }, 51 | { 52 | "cell_type": "markdown", 53 | "metadata": {}, 54 | "source": [ 55 | "Logistic Regression loss:\n", 56 | "$$ L(w) = \\dfrac{1}{N}\\sum_{i=1}^N \\log(1 + e^{-\\langle w, x_i \\rangle y_i}) + \\frac{1}{2C} \\lVert w \\rVert^2 \\to \\min_w$$\n", 57 | "$$\\langle w, x_i \\rangle = \\sum_{j=1}^n w_{j}x_{ij} + w_{0},$$ $$ y_{i} \\in \\{-1, 1\\}$$ where $n$ is the number of features and $N$ is the number of samples." 58 | ] 59 | }, 60 | { 61 | "cell_type": "markdown", 62 | "metadata": {}, 63 | "source": [ 64 | "Gradient descent step:\n", 65 | "$$w^{(t+1)} := w^{(t)} + \\dfrac{\\eta}{N}\\sum_{i=1}^N y_ix_i \\Big(1 - \\dfrac{1}{1 + exp(-\\langle w^{(t)}, x_i \\rangle y_i)}\\Big) - \\eta \\frac{1}{C} w,$$\n", 66 | "where $\\eta$ is the learning rate." 67 | ] 68 | }, 69 | { 70 | "cell_type": "markdown", 71 | "metadata": {}, 72 | "source": [ 73 | "**(2 points)** Implement the algorithm and use it to classify the digits (https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html) into \"even\" and \"odd\" categories. \"Even\" and \"Odd\" classes should correspond to {-1, 1} labels." 74 | ] 75 | }, 76 | { 77 | "cell_type": "markdown", 78 | "metadata": {}, 79 | "source": [ 80 | "Stopping criteria: either the number of iterations exceeds *max_iter* or $||w^{(t+1)} - w^{(t)}||_2 < tol$." 81 | ] 82 | }, 83 | { 84 | "cell_type": "code", 85 | "execution_count": null, 86 | "metadata": {}, 87 | "outputs": [], 88 | "source": [ 89 | "from sklearn.exceptions import NotFittedError" 90 | ] 91 | }, 92 | { 93 | "cell_type": "code", 94 | "execution_count": null, 95 | "metadata": {}, 96 | "outputs": [], 97 | "source": [ 98 | "class CustomLogisticRegression:\n", 99 | " _estimator_type = \"classifier\"\n", 100 | " \n", 101 | " def __init__(self, eta=0.001, max_iter=1000, C=1.0, tol=1e-5, random_state=42, zero_init=False):\n", 102 | " \"\"\"Logistic Regression classifier.\n", 103 | " \n", 104 | " Args:\n", 105 | " eta: float, default=0.001\n", 106 | " Learning rate.\n", 107 | " max_iter: int, default=1000\n", 108 | " Maximum number of iterations taken for the solvers to converge.\n", 109 | " C: float, default=1.0\n", 110 | " Inverse of regularization strength; must be a positive float.\n", 111 | " Smaller values specify stronger regularization.\n", 112 | " tol: float, default=1e-5\n", 113 | " Tolerance for stopping criteria.\n", 114 | " random_state: int, default=42\n", 115 | " Random state.\n", 116 | " zero_init: bool, default=False\n", 117 | " Zero weight initialization.\n", 118 | " \"\"\"\n", 119 | " self.eta = eta\n", 120 | " self.max_iter = max_iter\n", 121 | " self.C = C\n", 122 | " self.tol = tol\n", 123 | " self.random_state = np.random.RandomState(seed=random_state)\n", 124 | " self.zero_init = zero_init\n", 125 | " \n", 126 | " def get_sigmoid(self, X, weights):\n", 127 | " \"\"\"Compute the sigmoid value.\"\"\"\n", 128 | " # \n", 129 | " pass\n", 130 | " \n", 131 | " def get_loss(self, x, weights, y):\n", 132 | " \"\"\"Calculate the loss.\"\"\"\n", 133 | " # \n", 134 | " pass\n", 135 | " \n", 136 | " def fit(self, X, y):\n", 137 | " \"\"\"Fit the model.\n", 138 | " \n", 139 | " Args:\n", 140 | " X: numpy array of shape (n_samples, n_features)\n", 141 | " y: numpy array of shape (n_samples,)\n", 142 | " Target vector. \n", 143 | " \"\"\"\n", 144 | " X_ext = np.hstack([np.ones((X.shape[0], 1)), X]) # a constant feature is included to handle intercept\n", 145 | " num_features = X_ext.shape[1]\n", 146 | " if self.zero_init:\n", 147 | " self.weights_ = np.zeros(num_features) \n", 148 | " else:\n", 149 | " weight_threshold = 1.0 / (2 * num_features)\n", 150 | " self.weights_ = self.random_state.uniform(low=-weight_threshold,\n", 151 | " high=weight_threshold, size=num_features) # random weight initialization\n", 152 | " \n", 153 | " for i in range(self.max_iter):\n", 154 | " delta = \"\"\n", 155 | " self.weights_ -= self.eta * delta\n", 156 | " if \"\":\n", 157 | " break\n", 158 | " \n", 159 | " def predict_proba(self, X):\n", 160 | " \"\"\"Predict positive class probabilities.\n", 161 | " \n", 162 | " Args:\n", 163 | " X: numpy array of shape (n_samples, n_features)\n", 164 | " Returns:\n", 165 | " y: numpy array of shape (n_samples,)\n", 166 | " Vector containing positive class probabilities.\n", 167 | " \"\"\"\n", 168 | " X_ext = np.hstack([np.ones((X.shape[0], 1)), X])\n", 169 | " if hasattr(self, 'weights_'):\n", 170 | " return self.get_sigmoid(X_ext, self.weights_)\n", 171 | " else: \n", 172 | " raise NotFittedError(\"CustomLogisticRegression instance is not fitted yet\")\n", 173 | " \n", 174 | " def predict(self, X):\n", 175 | " \"\"\"Predict classes.\n", 176 | " \n", 177 | " Args:\n", 178 | " X: numpy array of shape (n_samples, n_features)\n", 179 | " Returns:\n", 180 | " y: numpy array of shape (n_samples,)\n", 181 | " Vector containing predicted class labels.\n", 182 | " \"\"\"\n", 183 | " # \n", 184 | " pass" 185 | ] 186 | }, 187 | { 188 | "cell_type": "code", 189 | "execution_count": null, 190 | "metadata": {}, 191 | "outputs": [], 192 | "source": [ 193 | "from sklearn import datasets\n", 194 | "from sklearn import metrics" 195 | ] 196 | }, 197 | { 198 | "cell_type": "code", 199 | "execution_count": null, 200 | "metadata": {}, 201 | "outputs": [], 202 | "source": [ 203 | "X, y = datasets.load_digits(n_class=10, return_X_y=True)\n", 204 | "\n", 205 | "_, axes = plt.subplots(nrows=3, ncols=7, figsize=(10, 5))\n", 206 | "for ax, image, label in zip(axes.flatten(), X, y):\n", 207 | " ax.set_axis_off()\n", 208 | " ax.imshow(image.reshape((8, 8)), cmap=plt.cm.gray_r if label % 2 else plt.cm.afmhot_r)\n", 209 | " ax.set_title(label)\n", 210 | "\n", 211 | "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True, random_state=42)\n", 212 | "#y_train = \"\"\n", 213 | "#y_test = \"\"\n", 214 | "y_train = (y_train % 2) * 2 - 1\n", 215 | "y_test = (y_test % 2) * 2 - 1" 216 | ] 217 | }, 218 | { 219 | "cell_type": "code", 220 | "execution_count": null, 221 | "metadata": {}, 222 | "outputs": [], 223 | "source": [ 224 | "assert (np.unique(y_train) == [-1, 1]).all()\n", 225 | "assert (np.unique(y_test) == [-1, 1]).all()" 226 | ] 227 | }, 228 | { 229 | "cell_type": "code", 230 | "execution_count": null, 231 | "metadata": {}, 232 | "outputs": [], 233 | "source": [ 234 | "def fit_evaluate(clf, X_train, y_train, X_test, y_test):\n", 235 | " clf.fit(X_train, y_train)\n", 236 | " disp = metrics.plot_confusion_matrix(clf, X_test, y_test, normalize='true')\n", 237 | " disp.figure_.suptitle(\"Confusion Matrix\")\n", 238 | " plt.show()\n", 239 | " \n", 240 | " return metrics.accuracy_score(y_pred=clf.predict(X_train), y_true=y_train), \\\n", 241 | " metrics.accuracy_score(y_pred=clf.predict(X_test), y_true=y_test)" 242 | ] 243 | }, 244 | { 245 | "cell_type": "code", 246 | "execution_count": null, 247 | "metadata": {}, 248 | "outputs": [], 249 | "source": [ 250 | "lr_clf = CustomLogisticRegression(max_iter=1, zero_init=True)" 251 | ] 252 | }, 253 | { 254 | "cell_type": "code", 255 | "execution_count": null, 256 | "metadata": {}, 257 | "outputs": [], 258 | "source": [ 259 | "assert np.allclose(lr_clf.get_sigmoid(np.array([[0.5, 0, 1.0], [0.3, 1.3, 1.0]]), np.array([0.5, -0.5, 0.1])),\n", 260 | " np.array([0.58662, 0.40131]))" 261 | ] 262 | }, 263 | { 264 | "cell_type": "code", 265 | "execution_count": null, 266 | "metadata": {}, 267 | "outputs": [], 268 | "source": [ 269 | "lr_clf.fit(X_train, y_train)" 270 | ] 271 | }, 272 | { 273 | "cell_type": "code", 274 | "execution_count": null, 275 | "metadata": {}, 276 | "outputs": [], 277 | "source": [ 278 | "assert np.allclose(lr_clf.weights_, np.array([ 3.1000e-06, 0.0000e+00, 4.1800e-05, 5.4770e-04, 2.2130e-04,\n", 279 | " 4.8750e-04, 1.3577e-03, 5.9780e-04, 5.6400e-05, -7.0000e-07,\n", 280 | " 1.6910e-04, 2.5190e-04, -4.3700e-04, 3.6190e-04, 1.0049e-03,\n", 281 | " 4.2280e-04, 2.5700e-05, 3.0000e-07, -1.1500e-05, -7.2440e-04,\n", 282 | " -2.6200e-04, 8.7540e-04, 4.1540e-04, -8.4200e-05, -5.2000e-06,\n", 283 | " 0.0000e+00, -2.2160e-04, -5.7130e-04, 9.8570e-04, 1.3507e-03,\n", 284 | " 5.0210e-04, -1.7050e-04, -1.0000e-06, 0.0000e+00, -6.7810e-04,\n", 285 | " -1.0515e-03, -4.4500e-05, 3.7160e-04, 4.2100e-04, -8.1800e-05,\n", 286 | " 0.0000e+00, -5.2000e-06, -5.3410e-04, -2.0393e-03, -8.4310e-04,\n", 287 | " 1.0400e-04, -1.2390e-04, -1.7880e-04, -1.3200e-05, -4.5000e-06,\n", 288 | " -9.4300e-05, -1.1127e-03, -5.0900e-04, -2.1850e-04, -5.6050e-04,\n", 289 | " -3.9560e-04, -1.7700e-05, -3.0000e-07, 2.6800e-05, 6.3920e-04,\n", 290 | " 1.8090e-04, -7.3660e-04, -5.3930e-04, -3.7060e-04, -2.8200e-05]), atol=1e-5)" 291 | ] 292 | }, 293 | { 294 | "cell_type": "code", 295 | "execution_count": null, 296 | "metadata": {}, 297 | "outputs": [], 298 | "source": [ 299 | "model = CustomLogisticRegression()" 300 | ] 301 | }, 302 | { 303 | "cell_type": "code", 304 | "execution_count": null, 305 | "metadata": {}, 306 | "outputs": [], 307 | "source": [ 308 | "train_acc, test_acc = fit_evaluate(model, X_train, y_train, X_test, y_test)" 309 | ] 310 | }, 311 | { 312 | "cell_type": "code", 313 | "execution_count": null, 314 | "metadata": {}, 315 | "outputs": [], 316 | "source": [ 317 | "train_acc, test_acc" 318 | ] 319 | }, 320 | { 321 | "cell_type": "code", 322 | "execution_count": null, 323 | "metadata": {}, 324 | "outputs": [], 325 | "source": [ 326 | "assert min(train_acc, test_acc) > 0.9" 327 | ] 328 | }, 329 | { 330 | "cell_type": "markdown", 331 | "metadata": {}, 332 | "source": [ 333 | "**(0.5 points)** Visualize the loss history." 334 | ] 335 | }, 336 | { 337 | "cell_type": "code", 338 | "execution_count": null, 339 | "metadata": {}, 340 | "outputs": [], 341 | "source": [ 342 | "## your code" 343 | ] 344 | }, 345 | { 346 | "cell_type": "markdown", 347 | "metadata": {}, 348 | "source": [ 349 | "**(0.5 points)** Try different learning rates and compare the results. How does the learning rate influence the convergence?" 350 | ] 351 | }, 352 | { 353 | "cell_type": "code", 354 | "execution_count": null, 355 | "metadata": {}, 356 | "outputs": [], 357 | "source": [ 358 | "## your code" 359 | ] 360 | }, 361 | { 362 | "cell_type": "markdown", 363 | "metadata": {}, 364 | "source": [ 365 | "< your thoughts >" 366 | ] 367 | }, 368 | { 369 | "cell_type": "markdown", 370 | "metadata": {}, 371 | "source": [ 372 | "**(0.5 points)** Try different regularization parameter values and compare the model quality." 373 | ] 374 | }, 375 | { 376 | "cell_type": "code", 377 | "execution_count": null, 378 | "metadata": {}, 379 | "outputs": [], 380 | "source": [ 381 | "## your code" 382 | ] 383 | }, 384 | { 385 | "cell_type": "markdown", 386 | "metadata": {}, 387 | "source": [ 388 | "< your thoughts >" 389 | ] 390 | }, 391 | { 392 | "cell_type": "markdown", 393 | "metadata": {}, 394 | "source": [ 395 | "**(0.5 points)** Compare zero initialization and random initialization. " 396 | ] 397 | }, 398 | { 399 | "cell_type": "code", 400 | "execution_count": null, 401 | "metadata": {}, 402 | "outputs": [], 403 | "source": [ 404 | "## your code" 405 | ] 406 | }, 407 | { 408 | "cell_type": "markdown", 409 | "metadata": {}, 410 | "source": [ 411 | "< your thoughts >" 412 | ] 413 | }, 414 | { 415 | "cell_type": "markdown", 416 | "metadata": {}, 417 | "source": [ 418 | "## Part 2: Implementing KNN Classifier" 419 | ] 420 | }, 421 | { 422 | "cell_type": "markdown", 423 | "metadata": {}, 424 | "source": [ 425 | "In this task you need to implement weighted K-Neighbors Classifier." 426 | ] 427 | }, 428 | { 429 | "cell_type": "markdown", 430 | "metadata": {}, 431 | "source": [ 432 | "Recall that training a KNN classifier is simply memorizing a training sample. \n", 433 | "\n", 434 | "The process of applying a classifier for one object is to find the distances from it to all objects in the training data, then select the k nearest objects (neighbors) and return the most common class among these objects." 435 | ] 436 | }, 437 | { 438 | "cell_type": "markdown", 439 | "metadata": {}, 440 | "source": [ 441 | "You can also give the nearest neighbors weights in accordance with the distance of the object to them. In the simplest case (as in your assignment), you can set the weights inversely proportional to that distance. \n", 442 | "\n", 443 | "$$w_{i} = \\frac{1}{d_{i} + eps},$$\n", 444 | "\n", 445 | "where $d_{i}$ is the distance between object and i-th nearest neighbor and $eps$ is the small value to prevent division by zero.\n", 446 | "\n", 447 | "In case of 'uniform' weights, all k nearest neighbors are equivalent (have equal weight, for example $w_{i} = 1, \\forall i \\in(1,k)$)." 448 | ] 449 | }, 450 | { 451 | "cell_type": "markdown", 452 | "metadata": {}, 453 | "source": [ 454 | "To predict the probability of classes, it is necessary to normalize the weights of each class, dividing them by the sum:\n", 455 | "\n", 456 | "$$p_{i} = \\frac{w_{i}}{\\sum_{j=1}^{c}w_{j}},$$\n", 457 | "\n", 458 | "where $p_i$ is probability of i-th class and $c$ is the number of classes." 459 | ] 460 | }, 461 | { 462 | "cell_type": "markdown", 463 | "metadata": {}, 464 | "source": [ 465 | "**(2 points)** Implement the algorithm and use it to classify the digits. By implementing this algorithm, you will be able to classify numbers not only into \"even\" or \"odd\", but into their real representation." 466 | ] 467 | }, 468 | { 469 | "cell_type": "code", 470 | "execution_count": null, 471 | "metadata": {}, 472 | "outputs": [], 473 | "source": [ 474 | "class CustomKNeighborsClassifier:\n", 475 | " _estimator_type = \"classifier\"\n", 476 | " \n", 477 | " def __init__(self, n_neighbors=5, weights='uniform', eps=1e-9):\n", 478 | " \"\"\"K-Nearest Neighbors classifier.\n", 479 | " \n", 480 | " Args:\n", 481 | " n_neighbors: int, default=5\n", 482 | " Number of neighbors to use by default for :meth:`kneighbors` queries.\n", 483 | " weights : {'uniform', 'distance'} or callable, default='uniform'\n", 484 | " Weight function used in prediction. Possible values:\n", 485 | " - 'uniform' : uniform weights. All points in each neighborhood\n", 486 | " are weighted equally.\n", 487 | " - 'distance' : weight points by the inverse of their distance.\n", 488 | " in this case, closer neighbors of a query point will have a\n", 489 | " greater influence than neighbors which are further away.\n", 490 | " eps : float, default=1e-5\n", 491 | " Epsilon to prevent division by 0 \n", 492 | " \"\"\"\n", 493 | " self.n_neighbors = n_neighbors\n", 494 | " self.weights = weights\n", 495 | " self.eps = eps\n", 496 | " \n", 497 | " \n", 498 | " def get_pairwise_distances(self, X, Y):\n", 499 | " \"\"\"\n", 500 | " Returnes matrix of the pairwise distances between the rows from both X and Y.\n", 501 | " Args:\n", 502 | " X: numpy array of shape (n_samples, n_features)\n", 503 | " Y: numpy array of shape (k_samples, n_features)\n", 504 | " Returns:\n", 505 | " P: numpy array of shape (n_samples, k_samples)\n", 506 | " Matrix in which (i, j) value is the distance \n", 507 | " between i'th row from the X and j'th row from the Y.\n", 508 | " \"\"\"\n", 509 | " # \n", 510 | " pass\n", 511 | " \n", 512 | " \n", 513 | " def get_class_weights(self, y, weights):\n", 514 | " \"\"\"\n", 515 | " Returns a vector with sum of weights for each class \n", 516 | " Args:\n", 517 | " y: numpy array of shape (n_samles,)\n", 518 | " weights: numpy array of shape (n_samples,)\n", 519 | " The weights of the corresponding points of y.\n", 520 | " Returns:\n", 521 | " p: numpy array of shape (n_classes)\n", 522 | " Array where the value at the i-th position \n", 523 | " corresponds to the weight of the i-th class.\n", 524 | " \"\"\"\n", 525 | " # \n", 526 | " pass\n", 527 | " \n", 528 | " \n", 529 | " def fit(self, X, y):\n", 530 | " \"\"\"Fit the model.\n", 531 | " \n", 532 | " Args:\n", 533 | " X: numpy array of shape (n_samples, n_features)\n", 534 | " y: numpy array of shape (n_samples,)\n", 535 | " Target vector. \n", 536 | " \"\"\"\n", 537 | " self.points = X\n", 538 | " self.y = y\n", 539 | " self.classes_ = np.unique(y)\n", 540 | " \n", 541 | " \n", 542 | " def predict_proba(self, X):\n", 543 | " \"\"\"Predict positive class probabilities.\n", 544 | " \n", 545 | " Args:\n", 546 | " X: numpy array of shape (n_samples, n_features)\n", 547 | " Returns:\n", 548 | " y: numpy array of shape (n_samples, n_classes)\n", 549 | " Vector containing positive class probabilities.\n", 550 | " \"\"\"\n", 551 | " if hasattr(self, 'points'):\n", 552 | " P = self.get_pairwise_distances(X, self.points)\n", 553 | " \n", 554 | " weights_of_points = np.ones(P.shape)\n", 555 | " if self.weights == 'distance':\n", 556 | " weights_of_points = 'your code'\n", 557 | " \n", 558 | " # \n", 559 | " pass\n", 560 | " \n", 561 | " else: \n", 562 | " raise NotFittedError(\"CustomKNeighborsClassifier instance is not fitted yet\")\n", 563 | " \n", 564 | " \n", 565 | " def predict(self, X):\n", 566 | " \"\"\"Predict classes.\n", 567 | " \n", 568 | " Args:\n", 569 | " X: numpy array of shape (n_samples, n_features)\n", 570 | " Returns:\n", 571 | " y: numpy array of shape (n_samples,)\n", 572 | " Vector containing predicted class labels.\n", 573 | " \"\"\"\n", 574 | " # \n", 575 | " pass" 576 | ] 577 | }, 578 | { 579 | "cell_type": "code", 580 | "execution_count": null, 581 | "metadata": {}, 582 | "outputs": [], 583 | "source": [ 584 | "model = CustomKNeighborsClassifier(n_neighbors=5, weights='distance')\n", 585 | "knn = KNeighborsClassifier(n_neighbors=5, weights='distance')" 586 | ] 587 | }, 588 | { 589 | "cell_type": "code", 590 | "execution_count": null, 591 | "metadata": {}, 592 | "outputs": [], 593 | "source": [ 594 | "assert np.allclose(model.get_pairwise_distances(np.array([[0 , 1] , [1, 1]]), \n", 595 | " np.array([[0.5, 0.5], [1, 0]])),\n", 596 | " np.array([[0.70710678, 1.41421356],\n", 597 | " [0.70710678, 1. ]]))" 598 | ] 599 | }, 600 | { 601 | "cell_type": "code", 602 | "execution_count": null, 603 | "metadata": {}, 604 | "outputs": [], 605 | "source": [ 606 | "model.classes_ = ['one', 'two', 'three']\n", 607 | "assert np.allclose(model.get_class_weights(np.array(['one', 'one', 'three', 'two']), np.array([1, 1, 0, 4])), \n", 608 | " np.array([2,4,0]))" 609 | ] 610 | }, 611 | { 612 | "cell_type": "code", 613 | "execution_count": null, 614 | "metadata": {}, 615 | "outputs": [], 616 | "source": [ 617 | "X, y = datasets.load_digits(n_class=10, return_X_y=True)\n", 618 | "\n", 619 | "_, axes = plt.subplots(nrows=3, ncols=7, figsize=(10, 5))\n", 620 | "for ax, image, label in zip(axes.flatten(), X, y):\n", 621 | " ax.set_axis_off()\n", 622 | " ax.imshow(image.reshape((8, 8)), cmap=plt.cm.gray_r if label % 2 else plt.cm.afmhot_r)\n", 623 | " ax.set_title(label)\n", 624 | "\n", 625 | "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True, random_state=42)" 626 | ] 627 | }, 628 | { 629 | "cell_type": "code", 630 | "execution_count": null, 631 | "metadata": {}, 632 | "outputs": [], 633 | "source": [ 634 | "model.fit(X_train, y_train)\n", 635 | "knn.fit(X_train, list(map(str, y_train)));" 636 | ] 637 | }, 638 | { 639 | "cell_type": "code", 640 | "execution_count": null, 641 | "metadata": {}, 642 | "outputs": [], 643 | "source": [ 644 | "assert np.allclose(model.predict_proba(X_test), knn.predict_proba(X_test))" 645 | ] 646 | }, 647 | { 648 | "cell_type": "code", 649 | "execution_count": null, 650 | "metadata": {}, 651 | "outputs": [], 652 | "source": [ 653 | "train_acc, test_acc = fit_evaluate(model, X_train, y_train, X_test, y_test)" 654 | ] 655 | }, 656 | { 657 | "cell_type": "code", 658 | "execution_count": null, 659 | "metadata": {}, 660 | "outputs": [], 661 | "source": [ 662 | "assert train_acc == 1\n", 663 | "assert test_acc > 0.98" 664 | ] 665 | }, 666 | { 667 | "cell_type": "markdown", 668 | "metadata": {}, 669 | "source": [ 670 | "**(0.5 points)** Take a look at the confusion matrix and tell what numbers the model confuses and why this happens." 671 | ] 672 | }, 673 | { 674 | "cell_type": "markdown", 675 | "metadata": {}, 676 | "source": [ 677 | "< your thoughts >" 678 | ] 679 | }, 680 | { 681 | "cell_type": "markdown", 682 | "metadata": {}, 683 | "source": [ 684 | "**(0.5 points)** Try different n_neighbors parameters and compare the output probabilities of the model." 685 | ] 686 | }, 687 | { 688 | "cell_type": "code", 689 | "execution_count": null, 690 | "metadata": {}, 691 | "outputs": [], 692 | "source": [ 693 | "## your code" 694 | ] 695 | }, 696 | { 697 | "cell_type": "markdown", 698 | "metadata": {}, 699 | "source": [ 700 | "< your thoughts >" 701 | ] 702 | }, 703 | { 704 | "cell_type": "markdown", 705 | "metadata": {}, 706 | "source": [ 707 | "**(0.5 points)** Compare both 'uniform' and 'distance' weights and share your thoughts in what situations which parameter can be better." 708 | ] 709 | }, 710 | { 711 | "cell_type": "code", 712 | "execution_count": null, 713 | "metadata": {}, 714 | "outputs": [], 715 | "source": [ 716 | "## your code" 717 | ] 718 | }, 719 | { 720 | "cell_type": "markdown", 721 | "metadata": {}, 722 | "source": [ 723 | "< your thoughts >" 724 | ] 725 | }, 726 | { 727 | "cell_type": "markdown", 728 | "metadata": {}, 729 | "source": [ 730 | "**(0.5 points)** Suggest another distance measurement function that could improve the quality of the classification for this task. " 731 | ] 732 | }, 733 | { 734 | "cell_type": "markdown", 735 | "metadata": {}, 736 | "source": [ 737 | "< your thoughts >" 738 | ] 739 | }, 740 | { 741 | "cell_type": "markdown", 742 | "metadata": {}, 743 | "source": [ 744 | "**(0.5 points)** Suggest different task and distance function that you think would be suitable for it." 745 | ] 746 | }, 747 | { 748 | "cell_type": "markdown", 749 | "metadata": {}, 750 | "source": [ 751 | "< your thoughts >" 752 | ] 753 | }, 754 | { 755 | "cell_type": "markdown", 756 | "metadata": {}, 757 | "source": [ 758 | "## Part 3: Synthetic Titanic Survival Prediction" 759 | ] 760 | }, 761 | { 762 | "cell_type": "markdown", 763 | "metadata": {}, 764 | "source": [ 765 | "### Dataset\n", 766 | "\n", 767 | "Read the description here: https://www.kaggle.com/c/tabular-playground-series-apr-2021/data. Download the dataset and place it in the *data/titanic/* folder in your working directory.\n", 768 | "You will use train.csv for model training and validation. The test set is used for model testing: once the model is trained, you can predict whether a passenger survived or not for each passenger in the test set, and submit the predictions: https://www.kaggle.com/c/tabular-playground-series-apr-2021/overview/evaluation. \n" 769 | ] 770 | }, 771 | { 772 | "cell_type": "code", 773 | "execution_count": null, 774 | "metadata": {}, 775 | "outputs": [], 776 | "source": [ 777 | "PATH = \"./data/\"" 778 | ] 779 | }, 780 | { 781 | "cell_type": "code", 782 | "execution_count": null, 783 | "metadata": {}, 784 | "outputs": [], 785 | "source": [ 786 | "data = pd.read_csv(os.path.join(PATH, 'titanic', 'train.csv')).set_index('PassengerId')" 787 | ] 788 | }, 789 | { 790 | "cell_type": "code", 791 | "execution_count": null, 792 | "metadata": { 793 | "scrolled": true 794 | }, 795 | "outputs": [], 796 | "source": [ 797 | "data.head()" 798 | ] 799 | }, 800 | { 801 | "cell_type": "markdown", 802 | "metadata": {}, 803 | "source": [ 804 | "### EDA" 805 | ] 806 | }, 807 | { 808 | "cell_type": "markdown", 809 | "metadata": {}, 810 | "source": [ 811 | "**(0.5 points)** How many females and males are there in the dataset? What about the survived passengers? Is there any relationship between the gender and the survival?" 812 | ] 813 | }, 814 | { 815 | "cell_type": "code", 816 | "execution_count": null, 817 | "metadata": {}, 818 | "outputs": [], 819 | "source": [ 820 | "## your code" 821 | ] 822 | }, 823 | { 824 | "cell_type": "markdown", 825 | "metadata": {}, 826 | "source": [ 827 | "< your thoughts >" 828 | ] 829 | }, 830 | { 831 | "cell_type": "markdown", 832 | "metadata": {}, 833 | "source": [ 834 | "**(0.5 points)** Plot age distribution of the passengers. What is the average and the median age of survived and deceased passengers? Do age distributions differ for survived and deceased passengers? Why?" 835 | ] 836 | }, 837 | { 838 | "cell_type": "code", 839 | "execution_count": null, 840 | "metadata": {}, 841 | "outputs": [], 842 | "source": [ 843 | "## your code" 844 | ] 845 | }, 846 | { 847 | "cell_type": "markdown", 848 | "metadata": {}, 849 | "source": [ 850 | "< your thoughts >" 851 | ] 852 | }, 853 | { 854 | "cell_type": "markdown", 855 | "metadata": {}, 856 | "source": [ 857 | "**(1 point)** Explore \"passenger class\" and \"embarked\" features. What class was \"the safest\"? Is there any relationship between the embarkation port and the survival? Provide the corresponding visualizations." 858 | ] 859 | }, 860 | { 861 | "cell_type": "code", 862 | "execution_count": null, 863 | "metadata": {}, 864 | "outputs": [], 865 | "source": [ 866 | "## your code" 867 | ] 868 | }, 869 | { 870 | "cell_type": "markdown", 871 | "metadata": {}, 872 | "source": [ 873 | "< your thoughts >" 874 | ] 875 | }, 876 | { 877 | "cell_type": "markdown", 878 | "metadata": {}, 879 | "source": [ 880 | "### Modelling" 881 | ] 882 | }, 883 | { 884 | "cell_type": "markdown", 885 | "metadata": {}, 886 | "source": [ 887 | "**(0.5 points)** Find the percentage of missing values for each feature. " 888 | ] 889 | }, 890 | { 891 | "cell_type": "code", 892 | "execution_count": null, 893 | "metadata": {}, 894 | "outputs": [], 895 | "source": [ 896 | "## your code" 897 | ] 898 | }, 899 | { 900 | "cell_type": "markdown", 901 | "metadata": {}, 902 | "source": [ 903 | "Think about the ways to handle these missing values for modelling and write your answer below. Which methods would you suggest? What are their advantages and disadvantages?\n", 904 | "\n", 905 | "< your thoughts >" 906 | ] 907 | }, 908 | { 909 | "cell_type": "markdown", 910 | "metadata": {}, 911 | "source": [ 912 | "**(1.5 points)** Prepare the features and train two models (KNN and Logistic Regression) to predict the survival. Compare the results. Use accuracy as a metric. Don't forget about cross-validation!" 913 | ] 914 | }, 915 | { 916 | "cell_type": "code", 917 | "execution_count": null, 918 | "metadata": {}, 919 | "outputs": [], 920 | "source": [ 921 | "## your code" 922 | ] 923 | }, 924 | { 925 | "cell_type": "markdown", 926 | "metadata": {}, 927 | "source": [ 928 | "**(0.5 + X points)** Try more feature engineering and hyperparameter tuning to improve the results. You may use either KNN or Logistic Regression (or both)." 929 | ] 930 | }, 931 | { 932 | "cell_type": "code", 933 | "execution_count": null, 934 | "metadata": {}, 935 | "outputs": [], 936 | "source": [ 937 | "## your code" 938 | ] 939 | }, 940 | { 941 | "cell_type": "markdown", 942 | "metadata": {}, 943 | "source": [ 944 | "Select the best model, load the test set and make the predictions. Submit them to kaggle and see the results :)\n", 945 | "\n", 946 | "**Note**. X points will depend on your kaggle public leaderboard score.\n", 947 | "$$ f(score) = 1.0, \\ \\ 0.79 \\leq score < 0.80,$$\n", 948 | "$$ f(score) = 2.5, \\ \\ 0.80 \\leq score < 0.81,$$ \n", 949 | "$$ f(score) = 4.0, \\ \\ 0.81 \\leq score $$ \n", 950 | "Your code should generate the output submitted to kaggle. Fix random seeds to make the results reproducible." 951 | ] 952 | }, 953 | { 954 | "cell_type": "code", 955 | "execution_count": null, 956 | "metadata": {}, 957 | "outputs": [], 958 | "source": [] 959 | } 960 | ], 961 | "metadata": { 962 | "kernelspec": { 963 | "display_name": "Python 3 (ipykernel)", 964 | "language": "python", 965 | "name": "python3" 966 | }, 967 | "language_info": { 968 | "codemirror_mode": { 969 | "name": "ipython", 970 | "version": 3 971 | }, 972 | "file_extension": ".py", 973 | "mimetype": "text/x-python", 974 | "name": "python", 975 | "nbconvert_exporter": "python", 976 | "pygments_lexer": "ipython3", 977 | "version": "3.9.7" 978 | } 979 | }, 980 | "nbformat": 4, 981 | "nbformat_minor": 4 982 | } 983 | -------------------------------------------------------------------------------- /6_feature_engineering_selection/README.md: -------------------------------------------------------------------------------- 1 | # Materials 2 | 3 | * General article about feature engineering and selection (main reference): 4 | https://github.com/Yorko/mlcourse.ai/blob/master/jupyter_english/topic06_features_regression/topic6_feature_engineering_feature_selection.ipynb 5 | 6 | * A great set of videos about Principal Component Analysis: [PCA 1](https://www.youtube.com/watch?v=FgakZw6K1QQ&list=PLblh5JKOoLUICTaGLRoHQDuF_7q2GfuJF&index=28&ab_channel=StatQuestwithJoshStarmer) and [PCA 2](https://www.youtube.com/watch?v=oRvgq966yZg&list=PLblh5JKOoLUICTaGLRoHQDuF_7q2GfuJF&index=30&ab_channel=StatQuestwithJoshStarmer) 7 | 8 | * [Boruta feature selection article](https://towardsdatascience.com/boruta-explained-the-way-i-wish-someone-explained-it-to-me-4489d70e154a) 9 | 10 | * Feature engineering/preprocessing, using scikit-learn API (great code examples, but really brief explanation): 11 | https://scikit-learn.org/stable/modules/preprocessing 12 | 13 | * Feature scaling/normalization: 14 | https://towardsdatascience.com/all-about-feature-scaling-bcc0ad75cb35 15 | 16 | * Log Transform/power transform: 17 | https://medium.com/@kyawsawhtoon/log-transformation-purpose-and-interpretation-9444b4b049c9 18 | 19 | * Missing values preprocessing using scikit-learn API (great code examples, great explanation): 20 | https://scikit-learn.org/stable/modules/impute.html 21 | 22 | * Feature selection scikit-learn API (great code examples, great explanation): 23 | https://scikit-learn.org/stable/modules/feature_selection.html 24 | 25 | * Melbourne housing dataset source: 26 | https://www.kaggle.com/anthonypino/melbourne-housing-market 27 | 28 | # Assignment 29 | See [homework.ipynb](./homework.ipynb) notebook. 30 | -------------------------------------------------------------------------------- /7_trees_and_ensembles/README.md: -------------------------------------------------------------------------------- 1 | # Materials 2 | 3 | ## Decision Tree and Random Forest 4 | - [Decision Trees - Part 1 (17m)](https://www.youtube.com/watch?v=7VeUPuFGJHk&list=PLblh5JKOoLUICTaGLRoHQDuF_7q2GfuJF&index=38) 5 | - [Decision Trees - Part 2 (22m)](https://www.youtube.com/watch?v=g9c66TUylZ4&list=PLblh5JKOoLUICTaGLRoHQDuF_7q2GfuJF&index=40) 6 | - [Random Forest - Part 1 (10m)](https://www.youtube.com/watch?v=J4Wdy0Wc_xQ&list=PLblh5JKOoLUICTaGLRoHQDuF_7q2GfuJF&index=42) 7 | - [Random Forest - Part 2 (12m)](https://www.youtube.com/watch?v=sQ870aTKqiM&list=PLblh5JKOoLUICTaGLRoHQDuF_7q2GfuJF&index=43) 8 | 9 | ## Hyperparameter tuning 10 | - [Gentle introductiom to Grid Search and Random Search](https://medium.com/@senapati.dipak97/grid-search-vs-random-search-d34c92946318) 11 | - [Grid and Random search in details with code examples](https://machinelearningmastery.com/hyperparameter-optimization-with-random-search-and-grid-search/) 12 | 13 | # Assignment 14 | See [rf_classifier.ipynb](./rf_classifier.ipynb) notebook. 15 | -------------------------------------------------------------------------------- /7_trees_and_ensembles/grid_random_search.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rolling-scopes-school/ml-intro/e7fec69ba996da03354b440107a66f26defa95c9/7_trees_and_ensembles/grid_random_search.png -------------------------------------------------------------------------------- /7_trees_and_ensembles/tests.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | from sklearn.model_selection import train_test_split 3 | from sklearn.metrics import accuracy_score 4 | 5 | RANDOM_STATE = 2020 6 | 7 | 8 | def create_datasets(): 9 | """Create train and test datasets from the same distribution 10 | 11 | Returns: 12 | (np.array, np.array, np.array, np.array): 13 | A tuple (X_train, X_test, y_train, y_test) 14 | """ 15 | rng = np.random.default_rng(RANDOM_STATE) 16 | X = rng.normal(size=(160, 10)) 17 | y = (X[:, ::2].sum(axis=1) < 0.0).astype(int) 18 | return train_test_split(X, y, test_size=0.2, random_state=RANDOM_STATE) 19 | 20 | 21 | def sum_up_tree_values(tree, field): 22 | """Perform depth first search across the tree and sum up values of requested fields 23 | 24 | Args: 25 | tree: Tree 26 | Tree to traverse 27 | field: str 28 | The name of the class field which values should be summed up 29 | Returns: 30 | float: 31 | The sum of all values found across the tree. None is replaces with 0.0 32 | """ 33 | value = getattr(tree, field) 34 | if value is None: 35 | value = 0.0 36 | if tree.left_child is not None: 37 | value += sum_up_tree_values(tree.left_child, field) 38 | if tree.right_child is not None: 39 | value += sum_up_tree_values(tree.right_child, field) 40 | return value 41 | 42 | 43 | def sum_up_forest_values(forest, field): 44 | """Sums up values of requested field in all the trees of a firest 45 | 46 | Args: 47 | forest: RandomForestClassifier 48 | Random Forest to go through 49 | field: str 50 | The name of the class field which values should be summed up 51 | Returns: 52 | float: 53 | The sum of all values found across all the trees of the forest. 54 | """ 55 | value = 0 56 | for tree in forest.trees: 57 | value += sum_up_tree_values(tree, field) 58 | return value 59 | 60 | 61 | def test_gini_index(func): 62 | def test(x, value): 63 | assert np.isclose(func(x), value), f'Should be {value} for {x}' 64 | test(np.array([]), 0.0) 65 | test(np.array([1]), 0.0) 66 | test(np.array([1, 0]), 0.5) 67 | test(np.array([1, 1, 0]), 4/9) 68 | trgt = np.random.default_rng(RANDOM_STATE).choice([0, 1], 20) 69 | test(trgt, 0.455) 70 | print('\033[92m All good!') 71 | 72 | 73 | def test_gini_gain(func): 74 | def test(x, xs, value): 75 | assert np.isclose(func(x, xs), value), f'Should be {value} for {x} and {xs}' 76 | test(np.array([1, 1, 0]), [np.array([1, 1]), np.array([0])], 4/9) 77 | test(np.array([1, 1, 0]), [np.array([1]), np.array([1, 0])], 1/9) 78 | trgt = np.random.default_rng(RANDOM_STATE).choice([0, 1], 20) 79 | test(trgt, [trgt[:10], trgt[10:]], 0.045) 80 | print('\033[92m All good!') 81 | 82 | 83 | def test_entropy(func): 84 | assert func(np.array([0])) == 0.0, 'Should be 0.0 if all elements are equal' 85 | assert func(np.array([1])) == 0.0, 'Should be 0.0 if all elements are equal' 86 | 87 | def test(x, value): 88 | assert np.isclose(func(x), value), f'Should be {value} for {x}' 89 | test(np.array([]), 0.0) 90 | test(np.array([1, 0]), 0.69314718) 91 | test(np.array([1, 1, 0]), 0.63651416) 92 | trgt = np.random.default_rng(RANDOM_STATE).choice([0, 1], 20) 93 | test(trgt, 0.64744663) 94 | print('\033[92m All good!') 95 | 96 | 97 | def test_information_gain(func): 98 | def test(x, xs, value): 99 | assert np.isclose(func(x, xs), value), f'Should be {value} for {x} and {xs}' 100 | test(np.array([1, 1, 0]), [np.array([1, 1]), np.array([0])], 0.63651416) 101 | test(np.array([1, 1, 0]), [np.array([1]), np.array([1, 0])], 0.17441604) 102 | trgt = np.random.default_rng(RANDOM_STATE).choice([0, 1], 20) 103 | test(trgt, [trgt[:10], trgt[10:]], 0.05067183) 104 | print('\033[92m All good!') 105 | 106 | 107 | def test_split_dataset(func): 108 | df = np.random.default_rng(RANDOM_STATE).normal(size=(5, 2)) 109 | trgt = np.random.default_rng(RANDOM_STATE).integers(2, size=5) 110 | l_X, r_X, l_y, r_y = func(df, trgt, 0, 0) 111 | assert l_X is not None and l_y is not None, "Should not return None values" 112 | assert np.allclose(l_X, [[-0.27279702, 0.06677798], [-1.76050488, 1.08797011]]) 113 | assert np.array_equal(l_y, [1, 1]) 114 | l_X, r_X, l_y, r_y = func(df, trgt, 1, 0.1) 115 | assert l_X is not None and l_y is not None, "Should not return None values" 116 | assert np.allclose(l_X, [[1.33254869, -1.41820464], [-0.27279702, 0.06677798]]) 117 | assert np.array_equal(l_y, [0, 1]) 118 | print('\033[92m All good!') 119 | 120 | 121 | def test_tree(Tree): 122 | rng = np.random.default_rng(RANDOM_STATE) 123 | df = rng.normal(size=(5, 3)) 124 | y = rng.integers(2, size=5) 125 | df_test = rng.normal(size=(5, 3)) 126 | 127 | tree = Tree(criterion='entropy', random_gen=np.random.default_rng(RANDOM_STATE)) 128 | assert np.allclose(np.sort(tree._find_splits(df[:, 0])), 129 | [-0.58359374, 0.31790047, 0.73637695, 1.17408836]), \ 130 | "Your _find_splits function doesn't work right" 131 | assert np.allclose(tree._find_best_split(df, y, df.shape[1]), 132 | [0, 0.73637695, 0.29110316]), \ 133 | "Your _find_best_splits function doesn't work right" 134 | tree.fit(df, y, max_depth=0) 135 | assert tree.left_child is None and tree.right_child is None, \ 136 | "Your tree grows more then allowed by max_depth." 137 | tree.fit(df, np.array([1, 1, 1, 1, 1])) 138 | assert tree.left_child is None and tree.right_child is None, \ 139 | "Your tree grows even when it is pure" 140 | 141 | tree = Tree(criterion='entropy', random_gen=np.random.default_rng(RANDOM_STATE)) 142 | tree.fit(df, y, max_depth=1) 143 | assert np.isclose(tree.threshold, 0.73637695) and tree.column_index == 0, \ 144 | "The split values are not stored in the right way." 145 | assert tree.left_child is not None and tree.right_child is not None, \ 146 | "Your tree doesn't grow children when it should." 147 | assert np.array_equal([tree.predict_row(np.array([x, 0, 0])) for x in [0.73, 0.74]], 148 | [0., 1.]), "Your predict_row method may be erroneous" 149 | 150 | tree = Tree(criterion='entropy', random_gen=np.random.default_rng(RANDOM_STATE)) 151 | tree.fit(df, y) 152 | np.allclose([sum_up_tree_values(tree, field) for field in 153 | ['threshold', 'outcome_probs', 'column_index']], 154 | [-0.11048649, 2.0, 2.0]), "Your fit method seem to have an error" 155 | assert np.allclose(tree.predict(df_test), [0., 0., 1., 0., 0.]),\ 156 | "Your predict method may be erroneous" 157 | 158 | X_train, X_test, y_train, y_test = create_datasets() 159 | tree = Tree(criterion='gini', random_gen=np.random.default_rng(RANDOM_STATE)) 160 | tree.fit(X_train, y_train, max_depth=3, feature_frac=0.7) 161 | 162 | assert np.allclose([sum_up_tree_values(tree, field) for field in 163 | ['threshold', 'outcome_probs', 'column_index']], 164 | [0.28624208, 4.0, 28.0]), "Your fit method seems to have an error" 165 | 166 | y_pred = tree.predict(X_test) 167 | assert np.isclose(accuracy_score(y_test, y_pred), 0.78125), \ 168 | "Your tree is not accurate enough enough" 169 | 170 | print('\033[92m All good!') 171 | 172 | 173 | def test_random_forest(clf): 174 | X_train, X_test, y_train, y_test = create_datasets() 175 | rng = np.random.default_rng(RANDOM_STATE) 176 | model = clf(n_estimators=10, max_depth=4, feature_frac=None, 177 | criterion="entropy", random_gen=rng) 178 | model.fit(X_train, y_train) 179 | 180 | assert np.allclose([sum_up_forest_values(model, field) for field in 181 | ['threshold', 'outcome_probs', 'column_index']], 182 | [-9.41975128, 45.0, 346.0]), "Your fit method seems to have an error" 183 | 184 | y_pred = model.predict(X_test) 185 | assert np.isclose(accuracy_score(y_test, y_pred), 0.71875), \ 186 | "Your classifier is not accurate enough" 187 | print('\033[92m All good!') 188 | -------------------------------------------------------------------------------- /8_clustering/README.md: -------------------------------------------------------------------------------- 1 | # Materials 2 | 3 | **Lecture slides:** 4 | - [Clustering](./slides/slides_clustering.pdf) 5 | - [Dimensionality reduction](./slides/slides_dimensionality_reduction.pdf) 6 | 7 | **Reading:** 8 | 9 | - [An Introduction to Statistical Learning - Hastie et al. \[Sections 10.3, 10.2\]](https://www.statlearning.com/s/ISLR-Seventh-Printing.pdf) 10 | - [Relationship between SVD and PCA](https://stats.stackexchange.com/questions/134282/relationship-between-svd-and-pca-how-to-use-svd-to-perform-pca) 11 | 12 | 13 | **Videos:** 14 | - [Clustering | Unsupervised Learning | Introduction](https://www.youtube.com/watch?v=Ev8YbxPu_bQ) (3m) 15 | - [Clustering | KMeans Algorithm](https://www.youtube.com/watch?v=hDmNF9JG3lo) (12m) 16 | - [Clustering | Optimization Objective](https://www.youtube.com/watch?v=LvgcfMOyREE) (7m) 17 | - [Clustering | Random Initialization](https://www.youtube.com/watch?v=PpH_hv55GNQ) (8m) 18 | - [Clustering | Choosing The Number Of Clusters](https://www.youtube.com/watch?v=lbR5br5yvrY) (8m) 19 | - [K-means and Hierarchical Clustering](https://www.youtube.com/watch?v=QXOkPvFM6NU) (17m) 20 | - [DBSCAN](https://www.youtube.com/watch?v=6jl9KkmgDIw) (7m) 21 | - [Number of Clusters as a Hyperparameter The Elbow and Silhouette Method](https://www.youtube.com/watch?v=AtxQ0rvdQIA) (8m) 22 | - [Dimensionality Reduction Motivation I | Data Compression](https://www.youtube.com/watch?v=Zbr5hyJNGCs) (10m) 23 | - [Dimensionality Reduction Motivation II | Visualization](https://www.youtube.com/watch?v=cnCzY5M3txk) (5m) 24 | - [Dimensionality Reduction | Principal Component Analysis | Problem Formulation](https://www.youtube.com/watch?v=T-B8muDvzu0) (9m) 25 | - [Dimensionality Reduction | Principal Component Analysis Algorithm](https://www.youtube.com/watch?v=rng04VJxUt4) (15m) 26 | - [Dimensionality Reduction | Choosing The Number Of Principal Components](https://www.youtube.com/watch?v=5aHWplWElcc) (10m) 27 | - [Dimensionality Reduction | Reconstruction From Compressed Representation](https://www.youtube.com/watch?v=R8zHEyT2R4E) (4m) 28 | - [Dimensionality Reduction | Advice For Applying PCA](https://www.youtube.com/watch?v=xI9-I-gcwaw) (13m) 29 | - [Multidimensional Scaling](https://www.youtube.com/watch?v=Yt0o8ukIOKU) (20m) 30 | - [t-SNE, Clearly Explained](https://www.youtube.com/watch?v=NEaUSP4YerM) (12m) 31 | 32 | **Optional:** 33 | 34 | - \[Grus\], chapter 19 35 | - \[Coelho & Richert\], chapter 3 36 | - \[Mueller & Guido\], chapter 3 37 | - https://www.kaggle.com/kashnitsky/topic-7-unsupervised-learning-pca-and-clustering 38 | - https://habr.com/en/company/ods/blog/325654/ 39 | - http://www.machinelearning.ru/wiki/images/5/52/Voron-ML-Clustering-SSL-slides.pdf 40 | - https://www.kaggle.com/kashnitsky/a7-demo-unsupervised-learning 41 | 42 | # Knowledge Check 43 | 44 | [Quiz](https://docs.google.com/forms/d/1JXZpqij3vIMjcXooHnNJbkr_Zc3z_FqQztX-BalSBo4/edit?usp=sharing) 45 | 46 | # Assignment 47 | 48 | [clustering.ipynb](./clustering.ipynb) 49 | 50 | # References 51 | 52 | - \[Grus\] Grus, J. Data Science from Scratch: First Principles with Python. O'Reilly, 2015.
53 | _Русский перевод:_ Грас, Дж. Data Science. Наука о данных с нуля. – СПб.: БХВ-Петербург, 2017. 54 | - \[Coelho & Richert\] Coelho, L. P., and W. Richert. Building Machine Learning Systems with Python. 2nd ed. Packt Publishing, 2015.
55 | _Русский перевод:_ Коэльо, Л. П. Построение систем машинного обучения на языке Python // Л. П. Коэльо, В. Ричарт. – 2-е изд. – М.: ДМК Пресс, 2017. 56 | - \[Mueller & Guido\] Mueller, A. C., and S. Guido. Introduction to Machine Learning with Python. O'Reilly, 2016.
57 | _Русский перевод:_ Мюллер, А. Введение в машинное обучение с помощью Python // А. Мюллер, С. Гвидо. – М.: Диалектика, 2017. 58 | -------------------------------------------------------------------------------- /8_clustering/clustering.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "Each task that is proposed to be completed as part of the homework has a declared \"price\" in points. The maximum possible amount is 10 points, and together with the bonus assignment - 12 points. It is not necessary to complete all the tasks, only a part can be done. Most of the points expect you to write working Python code; sometimes you will need to write comments - for example, to compare several approaches to solve the same problem. Also you can add more cells for your convenience if you need." 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "This homework focuses on clustering. We will work with images of handwritten digits, learn how to cluster them using two different methods (hierarchical clustering and the 𝐾-means algorithm), evaluate the quality of the partition and choose the optimal number of clusters, as well as visualize intermediate results." 15 | ] 16 | }, 17 | { 18 | "cell_type": "markdown", 19 | "metadata": {}, 20 | "source": [ 21 | "## 1. Loading data\n", 22 | "The data we will be working with is available in the scikit-learn library (`sklearn` module) in the `datasets` submodule via the `load_digits` function. The data contains 1,797 observations, each of which is 8×8 pixel image of a handwritten digit from 0 to 9. This is about the same amount of each digit (about 180).\n", 23 | "\n", 24 | "For convenience, every image expands to a 64 (8×8) row, so entire numpy array is 1797×64. The color intensity in each pixel is encoded with an integer from 0 to 16.\n", 25 | "\n", 26 | "In addition to images, their labels are also known. In this task, we will assume that the labels (as well as their amount) are unknown and try to group the data in such a way that the resulting clusters 'better match' the original ones. Possible options for determining the 'better match' are presented later." 27 | ] 28 | }, 29 | { 30 | "cell_type": "markdown", 31 | "metadata": {}, 32 | "source": [ 33 | "**(0.25 points)** Load the images into `X` variable, and their labels into `y` variable." 34 | ] 35 | }, 36 | { 37 | "cell_type": "code", 38 | "execution_count": null, 39 | "metadata": {}, 40 | "outputs": [], 41 | "source": [] 42 | }, 43 | { 44 | "cell_type": "code", 45 | "execution_count": null, 46 | "metadata": {}, 47 | "outputs": [], 48 | "source": [ 49 | "assert X.shape == (1797, 64)\n", 50 | "assert y.shape == (1797,)" 51 | ] 52 | }, 53 | { 54 | "cell_type": "markdown", 55 | "metadata": {}, 56 | "source": [ 57 | "**(0.5 points)** Visualize the first 10 images.\n", 58 | "\n", 59 | "- Arrange images on a grid rather than in a row. You may need the `subplot` and `imshow` functions from the `pyplot` module in the `matplotlib` library.\n", 60 | "- You will also need to reshape the images to 8×8.\n", 61 | "- Remove ticks and labels from both axes. The `xticks` and `yticks` functions or the `tick_params` function from `pyplot` can help you with this.\n", 62 | "- Make the output good sized with the `figure` function from `pyplot`." 63 | ] 64 | }, 65 | { 66 | "cell_type": "code", 67 | "execution_count": null, 68 | "metadata": {}, 69 | "outputs": [], 70 | "source": [] 71 | }, 72 | { 73 | "cell_type": "markdown", 74 | "metadata": {}, 75 | "source": [ 76 | "## 2. Clustering and quality evaluation" 77 | ] 78 | }, 79 | { 80 | "cell_type": "markdown", 81 | "metadata": {}, 82 | "source": [ 83 | "**(2 points)** Implement the the KMeans algorithm. Use objective function $L = \\sum_{i=1}^{n}|x_{i}-Z_{A(x_{i})}|^{2}$, where $Z_{A(x_{i})}$ is the center of the cluster corresponding to $x_{i}$ object." 84 | ] 85 | }, 86 | { 87 | "cell_type": "code", 88 | "execution_count": null, 89 | "metadata": {}, 90 | "outputs": [], 91 | "source": [ 92 | "from sklearn.exceptions import NotFittedError\n", 93 | "from numpy.random import RandomState" 94 | ] 95 | }, 96 | { 97 | "cell_type": "code", 98 | "execution_count": 2, 99 | "metadata": {}, 100 | "outputs": [], 101 | "source": [ 102 | "class CustomKMeans:\n", 103 | " def __init__(self, n_clusters=2, max_iter=30, n_init=10, random_state=42):\n", 104 | " '''K-Means clustering.\n", 105 | " \n", 106 | " Args:\n", 107 | " n_clusters: int, default=2\n", 108 | " The number of clusters to be formed is also \n", 109 | " the number of centroids to generate. \n", 110 | " max_iter: int, default=300\n", 111 | " Maximum number of iterations of the k-means algorithm for a\n", 112 | " single run.\n", 113 | " n_init: int, default=10\n", 114 | " Number of time the k-means algorithm will be run with different\n", 115 | " centroid seeds. The final results will be the best output of\n", 116 | " n_init consecutive runs in terms of objective function.\n", 117 | " random_state: int, default=42\n", 118 | " Random state.\n", 119 | " '''\n", 120 | " self.n_clusters = n_clusters\n", 121 | " self.n_init = n_init\n", 122 | " self.max_iter = max_iter\n", 123 | " self.random_state = RandomState(seed=random_state)\n", 124 | " \n", 125 | " def calculate_distances_to_centroids(self, X, cluster_centers):\n", 126 | " \"\"\"\n", 127 | " Returns (n, c) matrix where the element at position (i, j) \n", 128 | " is the distance from i-th object to j-th centroid.\"\"\"\n", 129 | " # \n", 130 | " pass\n", 131 | " \n", 132 | " def update_centroids(self, X, nearest_clusters):\n", 133 | " \"\"\"\n", 134 | " Returns numpy array of shape (n_clusters, n_features) - \n", 135 | " new clusters that are found by averaging objects belonging \n", 136 | " to the corresponding cluster.\"\"\"\n", 137 | " # \n", 138 | " pass\n", 139 | " \n", 140 | " def fit(self, X):\n", 141 | " \"\"\"Fit the model.\n", 142 | " \n", 143 | " Args:\n", 144 | " X: numpy array of shape (n_samples, n_features)\n", 145 | " \"\"\"\n", 146 | " assert X.shape[0] >= self.n_clusters\n", 147 | " # \n", 148 | " \n", 149 | " return self\n", 150 | " \n", 151 | " \n", 152 | " def predict(self, X):\n", 153 | " \"\"\"Predict classes.\n", 154 | " \n", 155 | " Args:\n", 156 | " X: numpy array of shape (n_samples, n_features)\n", 157 | " Returns:\n", 158 | " y: numpy array of shape (n_samples,)\n", 159 | " Vector containing predicted cluster labels.\n", 160 | " \"\"\"\n", 161 | " if hasattr(self, 'cluster_centers_'):\n", 162 | " # \n", 163 | " pass\n", 164 | " else: \n", 165 | " raise NotFittedError(\"CustomKMeans instance is not fitted yet\")" 166 | ] 167 | }, 168 | { 169 | "cell_type": "markdown", 170 | "metadata": {}, 171 | "source": [ 172 | "**(1 points)** Get the `X` array partition into 10 clusters. Visualize the centers of clusters.\n", 173 | "- We will assume that the center of the cluster is average value of all observations belonging to the cluster.\n", 174 | "- The cluster centers should have the same shape as our observations (64). So you have to average the points across the rows." 175 | ] 176 | }, 177 | { 178 | "cell_type": "code", 179 | "execution_count": null, 180 | "metadata": {}, 181 | "outputs": [], 182 | "source": [ 183 | "custom_kmeans_labels = ...\n", 184 | "assert custor_kmeans_labels.shape == (1797,)" 185 | ] 186 | }, 187 | { 188 | "cell_type": "code", 189 | "execution_count": null, 190 | "metadata": {}, 191 | "outputs": [], 192 | "source": [] 193 | }, 194 | { 195 | "cell_type": "markdown", 196 | "metadata": {}, 197 | "source": [ 198 | "**(0.5 points)** Experiment with `max_iter` and `n_init` parameters. Look at the range of values of the objective function, it's best values, at what parameters and how often they are achieved." 199 | ] 200 | }, 201 | { 202 | "cell_type": "code", 203 | "execution_count": null, 204 | "metadata": {}, 205 | "outputs": [], 206 | "source": [] 207 | }, 208 | { 209 | "cell_type": "markdown", 210 | "metadata": {}, 211 | "source": [ 212 | "Now we will use two popular algorithms: hierarchical clustering and $K$-means clustering. These and other algorithms are available in the `scikit-learn` module in the `cluster` submodule. Hierarchical clustering is called `AgglomerativeClustering`, and the $K$-means method is called `KMeans`.\n", 213 | "\n", 214 | "**(0.5 points)** Use each of the two methods: hierarchical clustering and KMeans. Get the `X` array partition into 10 clusters.\n", 215 | "\n", 216 | "- Note that `AgglomerativeClustering` does not have a `predict` method, so you can either use the `fit_predict` method or use the `fit` method and then look at the `labels_` attribute of the class instance.\n", 217 | "- Kmeans performs multiple runs (default 10) with random centers and then returns the best partition in terms of average distance within the clusters. You can increase the number of runs to improve the quality of predictions in the `i_init` parameter." 218 | ] 219 | }, 220 | { 221 | "cell_type": "code", 222 | "execution_count": null, 223 | "metadata": {}, 224 | "outputs": [], 225 | "source": [ 226 | "hierarchical_labels = ...\n", 227 | "kmeans_labels = ..." 228 | ] 229 | }, 230 | { 231 | "cell_type": "code", 232 | "execution_count": null, 233 | "metadata": {}, 234 | "outputs": [], 235 | "source": [ 236 | "assert hierarchical_labels.shape == (1797,)\n", 237 | "assert kmeans_labels.shape == (1797,)" 238 | ] 239 | }, 240 | { 241 | "cell_type": "markdown", 242 | "metadata": {}, 243 | "source": [ 244 | "**(0.5 points)** Visualize the centers of clusters obtained by both methods." 245 | ] 246 | }, 247 | { 248 | "cell_type": "code", 249 | "execution_count": null, 250 | "metadata": {}, 251 | "outputs": [], 252 | "source": [] 253 | }, 254 | { 255 | "cell_type": "markdown", 256 | "metadata": {}, 257 | "source": [ 258 | "In a situation where the true number of classes is unknown, we can select it by maximazing some metric.\n", 259 | "\n", 260 | "When we can set some distance function between our observations, we can consider the `silhouette` distance as a function of measuring the quality of the clustering. Let's show how it is calculated:" 261 | ] 262 | }, 263 | { 264 | "cell_type": "markdown", 265 | "metadata": {}, 266 | "source": [ 267 | "Let $X$ – set of observations, $M \\subset X$ – one of the clusters obtained as a result of clustering process, $\\rho$ – some metric on $X$. Let's choose one observation $x \\in M$. Denote $a(x)$ as the average distance from $x$ to $x'$ points from the same cluster:\n", 268 | "$$\n", 269 | "a(x) = \\frac{1}{|M| - 1} \\sum_{x' \\in M,\\, x' \\ne x} \\rho(x,\\, x')\n", 270 | "$$\n", 271 | "\n", 272 | "Denote $b(x)$ as minimun of average distances from $x$ to $x''$ from some other cluster $N$:\n", 273 | "$$\n", 274 | "b(x) = \\min_{N \\ne M} \\frac{1}{|N|} \\sum_{x'' \\in N} \\rho(x,\\, x'')\n", 275 | "$$\n", 276 | "\n", 277 | "The silhouette is difference between a(x) and b(x), normalized to $[-1, \\, 1]$ and averaged over all observations:\n", 278 | "$$\n", 279 | "\\frac{1}{|X|} \\sum_{x \\in X} \\frac{b(x) - a(x)}{\\max(a(x),\\, b(x))}\n", 280 | "$$\n", 281 | "\n", 282 | "The implementation of this metric in the `scikit-learn` is the `silhouette_score` function from the `metrics` submidule." 283 | ] 284 | }, 285 | { 286 | "cell_type": "markdown", 287 | "metadata": { 288 | "collapsed": true 289 | }, 290 | "source": [ 291 | "**(0.75 point)** For each $K$ between 2 and 20 inclusive, partition of the array $X$ into $K$ clusters using both methods. Calculate the silhouette score and visualize it for both methods on the same plot ($K$ on the $x$ axis and silhouette score on the $y$ axis). Sign the axes and make a legend." 292 | ] 293 | }, 294 | { 295 | "cell_type": "code", 296 | "execution_count": null, 297 | "metadata": {}, 298 | "outputs": [], 299 | "source": [] 300 | }, 301 | { 302 | "cell_type": "markdown", 303 | "metadata": {}, 304 | "source": [ 305 | "When we know the true clustering labels, the clustering result can be compared to them using measures such as `homogeneity`, `completeness` and their harmonic mean - $V$-score. The definitions of these quantities are rather bulky and are based on the [entropy of the probability distribution](https://ru.wikipedia.org/wiki/Информационная_энтропия). Details are given in [this article](http://aclweb.org/anthology/D/D07/D07-1043.pdf). In practice, it's enough to know that `homogeneity`, `completeness` and $V$-score are in the range from 0 and 1, and the more, the better.\n", 306 | "\n", 307 | "Since we know what digit each image is (`y` array), we can compare the clustering results to it using the measures listed above." 308 | ] 309 | }, 310 | { 311 | "cell_type": "markdown", 312 | "metadata": {}, 313 | "source": [ 314 | "**(0.5 points)** Repeat the previous task using $V$-measure instead of silhouette." 315 | ] 316 | }, 317 | { 318 | "cell_type": "code", 319 | "execution_count": null, 320 | "metadata": {}, 321 | "outputs": [], 322 | "source": [] 323 | }, 324 | { 325 | "cell_type": "markdown", 326 | "metadata": {}, 327 | "source": [ 328 | "## 3. Feature space dimensionality reduction\n", 329 | "\n", 330 | "In some cases, especially when there are a large number of features, when not all of them are informative, and some of them are correlated, it can be useful to reduce the dimension of the feature space. This mean that instead of $d$ original features, we will go to $d'\\ll d$ new ones. And if earlier our data were presented in the form of an $n×d$ matrix, then it will presented as a $n×d'$.\n", 331 | "\n", 332 | "There are two popular dimensionality reduction approaches:\n", 333 | "- select new features from existing features;\n", 334 | "- extract the new features by transforming old ones, for example, by making $d'$ different linear combinations of columns of an $n×d$ matrix.\n", 335 | "\n", 336 | "One widely used dimensionality reduction technique is the Singular Value Decomposition (SVD). This method allows you to construct any number $d'\\leq d$ of new features in such a way that they are the most informative (in some sense).\n", 337 | "\n", 338 | "The `scikit-learn` module has several implementations of singular value decomposition. We will use the `TruncatedSVD` class from the `decomposition` submodule.\n", 339 | "\n", 340 | "**Note:** The singular value decomposition of the matrix $M$ is usually written as $M=U \\Sigma V^{*}$. `TruncatedSVD`, in turn, returns only the $d'$ first columns of the matrix $U$." 341 | ] 342 | }, 343 | { 344 | "cell_type": "markdown", 345 | "metadata": {}, 346 | "source": [ 347 | "**(0.75 point)** Perform a singular value decomposition of the $X$ matrix, leaving 2, 5, 10, 20 features. In each case, perform hierarchical clustering and $K$-Means clustering (take the number of clusters equal to 10). Calculate the silhouette and $V$-score and compare them to corresponding values obtained from the original data.\n", 348 | "\n", 349 | "**Note**: It is not valid to compare the silhouette calculated with different metrics. Even if we use the same metric function when calculating the distance between points in the data, after applying dimensionality reduction or other data transformations, we will (not always) get different silhouette scores. Therefore, after training the clustering algorithm, to compare the result of clustering, you need to calculate the silhouette on the original data." 350 | ] 351 | }, 352 | { 353 | "cell_type": "code", 354 | "execution_count": null, 355 | "metadata": {}, 356 | "outputs": [], 357 | "source": [] 358 | }, 359 | { 360 | "cell_type": "markdown", 361 | "metadata": {}, 362 | "source": [ 363 | "Another popular dimensionality reduction approach that is useful for working with images is t-distributed stochastic neighbor embeddings, abbreviated `tSNE`. Unlike singular value decomposition, this it is non-linear transformation. It's main idea is to map points from a space of dimension `d` to another space of dimension 2 or 3 in such a way that the distances between points are mostly preserved. Mathematical details can be found, for example, [here](https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding).\n", 364 | "\n", 365 | "The implementation of `tSNE` in the `scikit-learn` library is the `TSNE` class in the `manifold` submodule.\n", 366 | "\n", 367 | "**Note:** In recent years [UMAP](https://github.com/lmcinnes/umap) is often used istead of `tSNE`. It is a faster algorithm with similar properties. We don't ask you to use `UMAP` because it requires you to install another dependency, the `umap-learn` library. Those who wish can perform the following task using `UMAP`." 368 | ] 369 | }, 370 | { 371 | "cell_type": "markdown", 372 | "metadata": {}, 373 | "source": [ 374 | "**(0.5 points)** Perform a tSNE-transform of the `X` matrix, leaving 2 features. Visualize the obtained data in the form of a scatter plot form: the first feature on the horizontal axis, and the second one the vertical axis. Color the points according to the digits they belong to.\n", 375 | "\n", 376 | "- The `c` parameter in the plt.scatter function is responsible for the color of the points. Pass the true labels to it." 377 | ] 378 | }, 379 | { 380 | "cell_type": "code", 381 | "execution_count": null, 382 | "metadata": {}, 383 | "outputs": [], 384 | "source": [] 385 | }, 386 | { 387 | "cell_type": "markdown", 388 | "metadata": {}, 389 | "source": [ 390 | "**(0.5 points)** From the data transformed using the tSNE, perform hierarchical clustering and $K$-means clustering (take the number of clusters equal to 10). Calculate the silhouette and the $V$-score and compare them to corresponding values obtained from the original data." 391 | ] 392 | }, 393 | { 394 | "cell_type": "code", 395 | "execution_count": null, 396 | "metadata": {}, 397 | "outputs": [], 398 | "source": [] 399 | }, 400 | { 401 | "cell_type": "markdown", 402 | "metadata": {}, 403 | "source": [ 404 | "**(0.25 points)** Choose the best partition (in terms of silhouette or $V$-score) and visualize the centers of clusters with images. Did you managed to make each digit correspond to one center of the cluster?" 405 | ] 406 | }, 407 | { 408 | "cell_type": "code", 409 | "execution_count": null, 410 | "metadata": {}, 411 | "outputs": [], 412 | "source": [] 413 | }, 414 | { 415 | "cell_type": "markdown", 416 | "metadata": {}, 417 | "source": [ 418 | "## 4. Results and bonus part" 419 | ] 420 | }, 421 | { 422 | "cell_type": "markdown", 423 | "metadata": {}, 424 | "source": [ 425 | "Write in free form what conclusions you made after completing this assignment. Answer the following questions:\n", 426 | "\n", 427 | "**(0.5 points)** Which algorithm gives more meaningful results - hierarchical clustering or $K$- means clustering. Does it depend on the algorithm settings or on the quality evaluation method?" 428 | ] 429 | }, 430 | { 431 | "cell_type": "code", 432 | "execution_count": null, 433 | "metadata": {}, 434 | "outputs": [], 435 | "source": [] 436 | }, 437 | { 438 | "cell_type": "markdown", 439 | "metadata": {}, 440 | "source": [ 441 | "**(0.5 points)** Imagine the situation where after hierarchical clustering, you need to cluster new data in the same way without retraining the model. Suggest a method how you will do it and how you will measure the quality of clustering of new data." 442 | ] 443 | }, 444 | { 445 | "cell_type": "code", 446 | "execution_count": null, 447 | "metadata": {}, 448 | "outputs": [], 449 | "source": [] 450 | }, 451 | { 452 | "cell_type": "markdown", 453 | "metadata": { 454 | "collapsed": true 455 | }, 456 | "source": [ 457 | "**(0.5 points)** Does dimensionality reduction improve clustering results?" 458 | ] 459 | }, 460 | { 461 | "cell_type": "code", 462 | "execution_count": null, 463 | "metadata": {}, 464 | "outputs": [], 465 | "source": [] 466 | }, 467 | { 468 | "cell_type": "markdown", 469 | "metadata": {}, 470 | "source": [ 471 | "**(0.5 points)** How to evaluate the quality of dimensional reduction? Suggest at least 2 options." 472 | ] 473 | }, 474 | { 475 | "cell_type": "code", 476 | "execution_count": null, 477 | "metadata": {}, 478 | "outputs": [], 479 | "source": [] 480 | }, 481 | { 482 | "cell_type": "markdown", 483 | "metadata": {}, 484 | "source": [ 485 | "**(Bonus 2 points)** Load the [MNIST Handwritten Digits](http://yann.lecun.com/exdb/mnist) dataset. You can also do it with `scikit-learn` as explained [here](https://stackoverflow.com/a/60450028). Explore the data and try to cluster it using different approaches. Compare results of these approaches using the silhouette and the $V$-score." 486 | ] 487 | }, 488 | { 489 | "cell_type": "code", 490 | "execution_count": null, 491 | "metadata": {}, 492 | "outputs": [], 493 | "source": [] 494 | } 495 | ], 496 | "metadata": { 497 | "kernelspec": { 498 | "display_name": "Python 3 (ipykernel)", 499 | "language": "python", 500 | "name": "python3" 501 | }, 502 | "language_info": { 503 | "codemirror_mode": { 504 | "name": "ipython", 505 | "version": 3 506 | }, 507 | "file_extension": ".py", 508 | "mimetype": "text/x-python", 509 | "name": "python", 510 | "nbconvert_exporter": "python", 511 | "pygments_lexer": "ipython3", 512 | "version": "3.9.7" 513 | } 514 | }, 515 | "nbformat": 4, 516 | "nbformat_minor": 2 517 | } 518 | -------------------------------------------------------------------------------- /8_clustering/slides/clustering_examples_bad.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rolling-scopes-school/ml-intro/e7fec69ba996da03354b440107a66f26defa95c9/8_clustering/slides/clustering_examples_bad.png -------------------------------------------------------------------------------- /8_clustering/slides/silhouette.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rolling-scopes-school/ml-intro/e7fec69ba996da03354b440107a66f26defa95c9/8_clustering/slides/silhouette.png -------------------------------------------------------------------------------- /8_clustering/slides/slides_clustering.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rolling-scopes-school/ml-intro/e7fec69ba996da03354b440107a66f26defa95c9/8_clustering/slides/slides_clustering.pdf -------------------------------------------------------------------------------- /8_clustering/slides/slides_clustering.tex: -------------------------------------------------------------------------------- 1 | \documentclass[unicode]{beamer} 2 | 3 | \mode 4 | { 5 | \usetheme{Madrid} 6 | \setbeamercovered{transparent} 7 | \setbeamersize{text margin left=5mm,text margin right=5mm} 8 | } 9 | 10 | \usepackage{fontspec} 11 | % Parameters borrowed from tex.stackexchange.com/questions/147038 12 | % with some additions from linux.org.ru/forum/desktop/8254794 13 | \setmainfont[ 14 | Ligatures=TeX, 15 | Extension=.otf, 16 | BoldFont=cmunbx, 17 | ItalicFont=cmunti, 18 | BoldItalicFont=cmunbi, 19 | SlantedFont=cmunsl, 20 | SmallCapsFont=cmunrm, 21 | SmallCapsFeatures={Letters=SmallCaps} 22 | ]{cmunrm} 23 | \setsansfont[ 24 | Ligatures=TeX, 25 | Extension=.otf, 26 | BoldFont=cmunsx, 27 | ItalicFont=cmunsi, 28 | BoldItalicFont=cmunso, 29 | SlantedFont=cmunsi, 30 | SmallCapsFont=cmunss, 31 | SmallCapsFeatures={Letters=SmallCaps} 32 | ]{cmunss} 33 | \setmonofont[ 34 | % Ligatures=TeX, % not needed as it turns straight quotes into curly ones 35 | Extension=.otf, 36 | BoldFont=cmuntb, 37 | ItalicFont=cmuntt, % instead of cmunit 38 | BoldItalicFont=cmuntx, 39 | SlantedFont=cmuntt, % instead of cmunst 40 | SmallCapsFont=cmuntt, 41 | SmallCapsFeatures={Letters=SmallCaps} 42 | ]{cmuntt} 43 | % Another monospaced font 44 | % \setmonofont[Ligatures=TeX]{FreeMono} 45 | 46 | \usepackage{polyglossia} 47 | \setdefaultlanguage{russian} 48 | \setotherlanguage{english} 49 | 50 | % \usepackage{cmap} 51 | % \usepackage[utf8]{inputenc} 52 | % \usepackage[T1,T2A,T4]{fontenc} 53 | % \usepackage[english,russian]{babel} 54 | 55 | \usepackage{graphicx} 56 | \usepackage{tikz} 57 | \usetikzlibrary{decorations.pathreplacing} 58 | % \usepackage{minted} 59 | % \definecolor{bg}{rgb}{0.95,0.95,0.95} 60 | % \setminted[python]{xleftmargin=1cm} 61 | % \setminted[python3]{xleftmargin=1cm} 62 | % \setminted[shell]{xleftmargin=1cm} 63 | % \newcommand{\pytwo}[1]{\mintinline{python}{#1}} 64 | % \newcommand{\py}[1]{\mintinline{python3}{#1}} 65 | % \newcommand{\bash}[1]{\mintinline{shell}{#1}} 66 | 67 | \usefonttheme[onlymath]{serif} 68 | \AtBeginDocument{\DeclareSymbolFont{pureletters}{T1}{lmr}{\mddefault}{it}} 69 | 70 | % NB: Be sure to specify [fragile] option for each frame with code listings. 71 | 72 | \title[Кластеризация]{Кластеризация} 73 | \author[]{RS School Machine Learning course} 74 | \institute[]{} 75 | \date[]{} 76 | 77 | \beamerdefaultoverlayspecification{} 78 | 79 | \begin{document} 80 | 81 | \begin{frame} 82 | \titlepage 83 | \end{frame} 84 | 85 | \begin{frame} 86 | \frametitle{План} 87 | \tableofcontents 88 | \end{frame} 89 | 90 | \section{Векторные и метрические пространства} 91 | 92 | \begin{frame}[t] 93 | \frametitle{Напоминание} 94 | 95 | \begin{itemize} 96 | \item Когда целевой переменной нет или её значения неизвестны, говорят, что перед нами задача машинного обучения <<без~учителя>> (unsupervised learning). 97 | \item Типичный пример -- кластеризация: разбить данные на несколько групп (классов, кластеров) наблюдений, похожих между собой. 98 | \item Количество кластеров иногда известно, иногда нет. 99 | \item Приложения кластеризации: 100 | \begin{itemize} 101 | \item[$\rhd$] Сегментация аудитории 102 | \item[$\rhd$] Опознавание аномалий в поведении пользователей 103 | \item[$\rhd$] Моделирование тематики текстов 104 | \item[$\rhd$] \dots 105 | \end{itemize} 106 | \end{itemize} 107 | 108 | \end{frame} 109 | 110 | \begin{frame}[t] 111 | \frametitle{Векторные пространства} 112 | 113 | \only<1>{ 114 | \begin{itemize} 115 | \item Данные для кластеризации (наблюдения, точки) обычно заданы в $\mathbb{R}^d$, вещественном $d$-мерном пространстве. 116 | \item Это одно из векторных пространств -- объектов с богатой алгебраической и геометрической структурой. 117 | \item Элементы векторного пространства обычно называют векторами или точками. 118 | \item Обычная евклидова плоскость $\mathbb{R}^2$ -- модельный пример, которым можно пользоваться, обсуждая свойства векторных пространств. 119 | \end{itemize} 120 | } 121 | \only<2>{ 122 | Алгебраические свойства: 123 | \begin{itemize} 124 | \item[$\rhd$] Векторы можно складывать. 125 | \item[$\rhd$] Можно умножать на число (будем считать, что из $\mathbb{R}$). 126 | \item[$\rhd$] Поэтому линейная комбинация векторов из какого-нибудь векторного пространства $V$ принадлежит этому пространству: 127 | $$ 128 | \forall a,\, b \in \mathbb{R},\, u,\, v \in V: au + bv \in V 129 | $$ 130 | \end{itemize} 131 | } 132 | \only<3>{ 133 | Геометрические свойства: 134 | \begin{itemize} 135 | \item[$\rhd$] В некоторых векторных пространствах, в частности, в $\mathbb{R}^d$, определено скалярное произведение: 136 | $$ 137 | \langle u,\, v \rangle = \sum_{i=1}^{d}{u_i v_i} \qquad (u,\, v \in \mathbb{R}^d) 138 | $$ 139 | \item[$\rhd$] Скалярное произведение порождает норму: 140 | $$ 141 | \|u\| = \sqrt{\langle u,\, u \rangle} 142 | $$ 143 | \item[$\rhd$] Норма порождает метрику (берётся линейная комбинация векторов): 144 | $$ 145 | \rho(u,\, v) = \|u - v\| 146 | $$ 147 | \end{itemize} 148 | } 149 | 150 | \end{frame} 151 | 152 | \begin{frame}[t] 153 | \frametitle{Метрические пространства} 154 | 155 | \only<1>{ 156 | \begin{itemize} 157 | \item Метрическим пространством называется пара $(X,\, \rho)$, где\\ 158 | \hspace{1cm}$X$ -- произвольное множество;\\ 159 | \hspace{1cm}$\rho : X \times X \rightarrow \mathbb{R}$ -- функция с набором хороших свойств\\ 160 | \hspace{1cm}(далее о них). 161 | \item Элементы множества $X$ называют точками. 162 | \item Функцию $\rho$ называют метрикой. 163 | \item О величине $\rho(x,\, y)$ можно думать как о расстоянии или длине кратчайшего пути между точками $x$ и $y$. 164 | \end{itemize} 165 | } 166 | \only<2>{ 167 | Свойства метрики: 168 | \begin{enumerate} 169 | \item Неотрицательность: 170 | $$ 171 | \forall x,\, y \in X \quad \rho(x,\, y) \ge 0 172 | $$ 173 | {\footnotesize (У любого пути неотрицательная длина.)} 174 | \item Условие обращения в ноль: 175 | $$ 176 | \forall x,\, y \in X \quad \rho(x,\, y) = 0 \Leftrightarrow x = y 177 | $$ 178 | {\footnotesize (У пути нулевой длины совпадают начало и конец, и наоборот.)} 179 | \item Симметричность: 180 | $$ 181 | \forall x,\, y \in X \quad \rho(x,\, y) = \rho(y,\, x) 182 | $$ 183 | {\footnotesize (Путь между точками имеет одинаковую длину в обе стороны.)} 184 | \item Неравенство треугольника: 185 | $$ 186 | \forall x,\, y,\, z \in X \quad \rho(x,\, y) + \rho(y,\, z) \ge \rho(x,\, z) 187 | $$ 188 | {\footnotesize (Путь через промежуточный пункт не бывает короче, чем напрямик.)} 189 | \end{enumerate} 190 | } 191 | \only<3>{ 192 | \begin{itemize} 193 | \item Таким образом, векторное пространство со скалярным произведением (такое, как $\mathbb{R}^d$) сразу является метрическим пространством. 194 | \item Но не на любом метрическом пространстве можно задать структуру векторного пространства. 195 | \item Есть метрические пространства, которые невозможно вложить в~$\mathbb{R}^d$ при любом $d$: 196 | \end{itemize} 197 | 198 | \begin{center} 199 | \begin{tikzpicture} 200 | \draw (1, 2) node[minimum size=1.5mm, inner sep=0mm, shape=circle, fill] {}; 201 | \draw (0, 0) node[minimum size=1.5mm, inner sep=0mm, shape=circle, fill] {}; 202 | \draw (1, 0) node[minimum size=1.5mm, inner sep=0mm, shape=circle, fill] {}; 203 | \draw (2, 0) node[minimum size=1.5mm, inner sep=0mm, shape=circle, fill] {}; 204 | \draw (1, 2) -- node[midway, above left] {{\footnotesize 2}} (0, 0); 205 | \draw (1, 2) -- node[midway, left] {{\footnotesize 2}} (1, 0); 206 | \draw (1, 2) -- node[midway, above right] {{\footnotesize 2}} (2, 0); 207 | \draw (0, 0) -- node[midway, above] {{\footnotesize 1}} (1, 0); 208 | \draw (1, 0) -- node[midway, above] {{\footnotesize 1}} (2, 0); 209 | \draw [decorate,decoration={brace,mirror,raise=2mm}] (0, 0) -- node[midway, below=3mm] {{\footnotesize 2}} (2, 0); 210 | \end{tikzpicture} 211 | \end{center} 212 | } 213 | 214 | \end{frame} 215 | 216 | \begin{frame}[t] 217 | \frametitle{Опять к кластеризации} 218 | 219 | \begin{itemize} 220 | \item Метрическая структура <<беднее>>, чем векторная\\ 221 | $\Rightarrow$ больше разных объектов обладают ею. 222 | \item Алгоритмы, о которых мы будем говорить: 223 | \begin{itemize} 224 | \item[--] работают в произвольных метрических пространствах 225 | \item[--] либо нуждаются для этого в незначительных изменениях 226 | \end{itemize} 227 | \item Поэтому мы можем считать, что наблюдения живут в~метрическом пространстве $(X,\, \rho)$. 228 | \item Совокупность данных обозначим $U = \{x_1,\, \dots,\, x_n\}$. 229 | \item У многих алгоритмов кластеризации один настраиваемый параметр $k$ -- количество кластеров, которое нужно получить. 230 | \end{itemize} 231 | 232 | \end{frame} 233 | 234 | \section{Методы кластеризации} 235 | 236 | \subsection{Иерархическая кластеризация} 237 | 238 | \begin{frame}[t] 239 | \frametitle{Иерархическая кластеризация} 240 | 241 | \begin{itemize} 242 | \item[$\rhd$] Инициализируем $n$ кластеров, в каждом по одному наблюдению. 243 | \item[$\rhd$] На каждом шаге будем брать два кластера, самых близких друг к другу, и объединять их. 244 | \item[$\rhd$] Таким образом, на каждом шаге число кластеров уменьшается на единицу. 245 | \item[$\rhd$] Когда останется ровно $k$ кластеров, остановимся. 246 | \item Это иерархическая (или агломеративная) кластеризация. 247 | \item \textbf{Проблема:} Что такое <<кластеры, самые близкие друг к другу>>? 248 | \end{itemize} 249 | 250 | \end{frame} 251 | 252 | \begin{frame}[t] 253 | \frametitle{Критерии близости кластеров} 254 | 255 | \begin{itemize} 256 | \item Как задать <<расстояние>> $D$ между двумя кластерами $M$ и $N$? 257 | \item[$\rhd$] Single-linkage clustering: 258 | $$ 259 | D(M,\, N) = \min\, \{\, \rho(x,\, y)\, |\, x \in M,\, y \in N \,\} 260 | $$ 261 | \item[$\rhd$] Complete-linkage clustering: 262 | $$ 263 | D(M,\, N) = \max\, \{\, \rho(x,\, y)\, |\, x \in M,\, y \in N \,\} 264 | $$ 265 | \item[$\rhd$] Average-linkage clustering: 266 | $$ 267 | D(M,\, N) = \frac{1}{|M||N|} \sum_{x \in M} \sum_{y \in N}{\rho(x,\, y)} 268 | $$ 269 | \item Эти и другие критерии близости кластеров можно представить как частные случаи формулы Ланса--Уильямса (мы не будем её обсуждать подробнее). 270 | \end{itemize} 271 | 272 | \end{frame} 273 | 274 | \subsection{Алгоритм $k$ средних} 275 | 276 | \begin{frame}[t] 277 | \frametitle{Алгоритм $k$ средних} 278 | \begin{itemize} 279 | \only<1>{ 280 | \item Будем считать, что наблюдения даны в $\mathbb{R}^d$. 281 | \item[$\rhd$] Случайным образом выберем $k$ наблюдений $x_{i_1}$, \dots, $x_{i_k}$ в~качестве~центров. 282 | \item[$\rhd$] Сформируем $k$ кластеров: в кластер с центром $x_i$ входят наблюдения, которые ближе к $x_i$, чем к любому другому из~центров. 283 | \item[$\rhd$] На каждом шаге: 284 | \begin{itemize} 285 | \item[--] Пересчитаем центры, усреднив каждый кластер. 286 | \item[--] Переформируем кластеры относительно новых центров. 287 | \end{itemize} 288 | \item[$\rhd$] Когда кластеры перестают меняться, остановимся. 289 | } 290 | \only<2>{ 291 | \item Это алгоритм $k$ средних ($k$-means). 292 | \item Модификация для произвольного метрического пространства: центрами могут быть только точки, представленные в данных (алгоритм $k$-medoids). 293 | \item \textbf{Проблема:} При разном выборе начальных точек получаются разные результаты. 294 | \item Улучшенный способ первоначального выбора центров: 295 | \item[$\rhd$] Первый центр выберем случайно (равномерно по всем наблюдениям). 296 | \item[$\rhd$] Каждый следующий будем выбирать с вероятностью, прямо пропорциональной квадрату расстояния данной точки до ближайшего уже выбранного центра. 297 | \item Этот способ инициализации называется $k$-means++. 298 | } 299 | \end{itemize} 300 | \end{frame} 301 | 302 | \subsection{DBSCAN} 303 | 304 | \begin{frame}[t] 305 | \frametitle{DBSCAN} 306 | \begin{itemize} 307 | \only<1>{ 308 | \item В некоторых ситуациях (кластеры сложной формы, зашумлённые данные) иерархическая кластеризация и алгоритм $k$ средних работают плохо. 309 | \end{itemize} 310 | 311 | \begin{center} 312 | \includegraphics[width=0.6\textwidth]{clustering_examples_bad} 313 | \end{center} 314 | 315 | \begin{itemize} 316 | \item Не будем фиксировать число кластеров, посмотрим на локальную структуру данных. 317 | } 318 | \only<2>{ 319 | \item[$\rhd$] Введём два параметра: радиус $\varepsilon$ и минимальная плотность $m$. 320 | \item[$\rhd$] Для каждой точки $x_i$ рассмотрим её окрестность $U_{\varepsilon}(x_i) = \{x_j \in U\, |\, i \ne j,\, \rho(x_i, x_j) \le \varepsilon\}$. 321 | \item[$\rhd$] Подразделим все точки на: 322 | \begin{itemize} 323 | \item[--] центральные: $|U_{\varepsilon}(x_i)| \ge m$ 324 | \item[--] периферийные: $1 \le |U_{\varepsilon}(x_i)| < m$ 325 | \item[--] шумовые: $|U_{\varepsilon}(x_i)| = 0$ 326 | \end{itemize} 327 | \item[$\rhd$] На всех центральных точках построим граф: две точки в окрестности друг друга соединяются ребром. 328 | \item[$\rhd$] Одна связная компонента графа = один кластер. 329 | \item[$\rhd$] Каждую периферийную точку относим к тому же кластеру, что и ближайшая центральная. 330 | \item[$\rhd$] Шумовые точки не кластеризуем. 331 | } 332 | \only<3>{ 333 | \item Это алгоритм DBSCAN. 334 | \item[$\oplus$] На практике обычно даёт хорошие результаты. 335 | \item[$\circleddash$] Не параметризуется число кластеров. 336 | \item Можно использовать для фильтрации шума в данных,\linebreak т.~е. сами кластеры отбрасываются. 337 | } 338 | \end{itemize} 339 | \end{frame} 340 | 341 | \section{Оценивание качества кластеризации} 342 | 343 | \begin{frame}[t] 344 | \frametitle{Оценивание качества кластеризации} 345 | \begin{itemize} 346 | \item Для подбора параметров нужно уметь оценивать, насколько кластеризация хороша. 347 | \item Геометрические характеристики: насколько плотные кластеры, далеко ли отстоят друг от друга\dots 348 | \item Если есть размеченные данные, можно сравнить с эталонной кластеризацией. 349 | \end{itemize} 350 | \end{frame} 351 | 352 | \subsection{Силуэт} 353 | 354 | \begin{frame}[t] 355 | \frametitle{Силуэт} 356 | \only<1>{ 357 | \begin{itemize} 358 | \item Пусть выборка $x_1,\, \dots,\, x_N$ разбита на кластеры. 359 | \item Для точки $x_i$, отнесённой к кластеру $C$, рассмотрим две величины: 360 | \item[] \quad$a_i$ – среднее расстояние от $x_i$ до других точек в $C$; 361 | \item[] \quad$b_i$ – минимум, по всем остальным кластерам $C^\prime$, среднего расстояния от $x_i$ до точек в $C^\prime$. 362 | \end{itemize} 363 | 364 | \begin{center} 365 | \includegraphics[width=0.5\textwidth]{silhouette} 366 | \end{center} 367 | } 368 | \only<2>{ 369 | \begin{itemize} 370 | \item Силуэтом (silhouette) $x_i$ называется разность $b_i - a_i$, нормированная до отрезка $[-1,\, 1]$. 371 | \item Усредняем по всей выборке. 372 | \item Интуитивно, тем ближе к 1, 373 | \begin{itemize} 374 | \item[--] чем ближе расположены точки внутри кластеров; 375 | \item[--] чем дальше кластеры друг от друга. 376 | \end{itemize} 377 | \end{itemize} 378 | } 379 | \end{frame} 380 | 381 | \subsection{Adjusted Rand index} 382 | 383 | \begin{frame}[t] 384 | \frametitle{Adjusted Rand index} 385 | \begin{itemize} 386 | \only<1>{ 387 | \item Пусть для выборки $x_1,\, \dots,\, x_N$ есть эталонная кластеризация $\mathcal{C}$, а мы построили кластеризацию $\mathcal{D}$. 388 | \item Насколько наша кластеризация близка к эталону? 389 | \begin{itemize} 390 | \item[$\rhd$] Количество и нумерация кластеров могут различаться. 391 | \end{itemize} 392 | \item Рассмотрим всевозможные пары точек $x_i,\, x_j$, их всего $\displaystyle{N \choose 2}$. 393 | \item Обозначим: 394 | \begin{itemize} 395 | \item[--] $a$ – \# пар, которые в $\mathcal{C}$ попадают в один кластер и в $\mathcal{D}$ тоже; 396 | \item[--] $b$ – \# пар, которые в $\mathcal{C}$ попадают в разные кластеры и в $\mathcal{D}$ тоже; 397 | \item[--] $c$ – \# пар, которые в $\mathcal{C}$ попадают в один кластер, а в $\mathcal{D}$ в разные; 398 | \item[--] $d$ – \# пар, которые в $\mathcal{C}$ попадают в разные кластеры, а в~$\mathcal{D}$ в~один. 399 | \end{itemize} 400 | } 401 | \only<2>{ 402 | \item Индексом Рэнда называется величина 403 | $$ 404 | \frac{a + b}{a + b + c + d} = \frac{a + b}{{N \choose 2}}. 405 | $$ 406 | \item Заключён между 0 и 1, равен 1 для разбиений, которые различаются только нумерацией кластеров. 407 | \item {<<}Исправленный{>>} индекс Рэнда: вычитается математическое ожидание суммы $a + b$ по всевозможным парам разбиений, однотипных с $\mathcal{C}$ и $\mathcal{D}$ по численности кластеров. 408 | } 409 | \end{itemize} 410 | \end{frame} 411 | 412 | \end{document} 413 | -------------------------------------------------------------------------------- /8_clustering/slides/slides_dimensionality_reduction.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rolling-scopes-school/ml-intro/e7fec69ba996da03354b440107a66f26defa95c9/8_clustering/slides/slides_dimensionality_reduction.pdf -------------------------------------------------------------------------------- /8_clustering/slides/slides_dimensionality_reduction.tex: -------------------------------------------------------------------------------- 1 | \documentclass[unicode]{beamer} 2 | 3 | \mode 4 | { 5 | \usetheme{Madrid} 6 | \setbeamercovered{transparent} 7 | \setbeamersize{text margin left=5mm,text margin right=5mm} 8 | } 9 | 10 | \usepackage{fontspec} 11 | % Parameters borrowed from tex.stackexchange.com/questions/147038 12 | % with some additions from linux.org.ru/forum/desktop/8254794 13 | \setmainfont[ 14 | Ligatures=TeX, 15 | Extension=.otf, 16 | BoldFont=cmunbx, 17 | ItalicFont=cmunti, 18 | BoldItalicFont=cmunbi, 19 | SlantedFont=cmunsl, 20 | SmallCapsFont=cmunrm, 21 | SmallCapsFeatures={Letters=SmallCaps} 22 | ]{cmunrm} 23 | \setsansfont[ 24 | Ligatures=TeX, 25 | Extension=.otf, 26 | BoldFont=cmunsx, 27 | ItalicFont=cmunsi, 28 | BoldItalicFont=cmunso, 29 | SlantedFont=cmunsi, 30 | SmallCapsFont=cmunss, 31 | SmallCapsFeatures={Letters=SmallCaps} 32 | ]{cmunss} 33 | \setmonofont[ 34 | % Ligatures=TeX, % not needed as it turns straight quotes into curly ones 35 | Extension=.otf, 36 | BoldFont=cmuntb, 37 | ItalicFont=cmuntt, % instead of cmunit 38 | BoldItalicFont=cmuntx, 39 | SlantedFont=cmuntt, % instead of cmunst 40 | SmallCapsFont=cmuntt, 41 | SmallCapsFeatures={Letters=SmallCaps} 42 | ]{cmuntt} 43 | % Another monospaced font 44 | % \setmonofont[Ligatures=TeX]{FreeMono} 45 | 46 | \usepackage{polyglossia} 47 | \setdefaultlanguage{russian} 48 | \setotherlanguage{english} 49 | 50 | % \usepackage{cmap} 51 | % \usepackage[utf8]{inputenc} 52 | % \usepackage[T1,T2A,T4]{fontenc} 53 | % \usepackage[english,russian]{babel} 54 | 55 | \usepackage{graphicx} 56 | \usepackage{tikz} 57 | \usetikzlibrary{decorations.pathreplacing} 58 | \usepackage{minted} 59 | % \definecolor{bg}{rgb}{0.95,0.95,0.95} 60 | \setminted[python]{xleftmargin=1cm} 61 | \setminted[python3]{xleftmargin=1cm} 62 | \setminted[shell]{xleftmargin=1cm} 63 | \newcommand{\pytwo}[1]{\mintinline{python}{#1}} 64 | \newcommand{\py}[1]{\mintinline{python3}{#1}} 65 | \newcommand{\bash}[1]{\mintinline{shell}{#1}} 66 | 67 | \usefonttheme[onlymath]{serif} 68 | \AtBeginDocument{\DeclareSymbolFont{pureletters}{T1}{lmr}{\mddefault}{it}} 69 | 70 | % NB: Be sure to specify [fragile] option for each frame with code listings. 71 | 72 | \title[Снижение размерности]{Снижение размерности} 73 | \author[]{RS School Machine Learning course} 74 | \institute[]{} 75 | \date[]{} 76 | 77 | \beamerdefaultoverlayspecification{} 78 | 79 | \begin{document} 80 | 81 | \begin{frame} 82 | \titlepage 83 | \end{frame} 84 | 85 | \begin{frame}[t] 86 | \frametitle{Зачем нужно снижать размерность} 87 | \begin{itemize} 88 | \item Обычно наблюдения даны в $\mathbb{R}^d$. 89 | \item {<<}Проклятие размерности>>: при больших $d$ падает эффективность алгоритмов и перестаёт работать геометрическая интуиция. 90 | \begin{itemize} 91 | \item[--] У многих классов моделей время обучения зависит линейно от $d$ 92 | \item[--] Разреженность данных: желательно, чтобы наблюдений в обучающей выборке было хотя бы в несколько раз больше $d$ 93 | \item[--] Евклидово расстояние теряет чувствительность с ростом $d$ 94 | \end{itemize} 95 | \item Чтобы справиться с <<проклятием размерности>>, данные отображают в $\mathbb{R}^{d^\prime}$, $d^\prime \ll d$. 96 | \item Когда $d^\prime = 2$, можно визуализировать на плоскости. 97 | \end{itemize} 98 | \end{frame} 99 | 100 | \begin{frame}[t] 101 | \frametitle{Методы снижения размерности} 102 | \begin{itemize} 103 | \item Отбор признаков (feature selection): 104 | \begin{itemize} 105 | \item[--] по статистическим свойствам 106 | \item[--] по скоррелированности друг с другом 107 | \item[--] по предсказательной силе модели на подмножестве признаков\dots 108 | \end{itemize} 109 | \item Выделение признаков (feature extraction) -- сегодня о нём. 110 | \item Линейные методы: искомое отображение $\mathbb{R}^d \rightarrow \mathbb{R}^{d^\prime}$ можно представить матрицей $d \times d^\prime$. 111 | \begin{itemize} 112 | \item[$\rhd$] Метод главных компонент 113 | \end{itemize} 114 | \item Нелинейные методы: отображение из произвольного метрического пространства, не обязательно записывается в~явном виде. 115 | \begin{itemize} 116 | \item[$\rhd$] Многомерное шкалирование 117 | \item[$\rhd$] t-SNE 118 | \item[$\rhd$] UMAP 119 | \end{itemize} 120 | \end{itemize} 121 | \end{frame} 122 | 123 | \begin{frame}[t] 124 | \frametitle{Метод главных компонент} 125 | \begin{itemize} 126 | \item[$\rhd$] Центрируем данные: вычтем среднее по каждому признаку. 127 | \item[$\rhd$] Найдём такую прямую, проходящую через ноль, проекции точек на которую дают наибольшую дисперсию. 128 | \item[$\rhd$] Среди всех перпендикуляров к ней опять найдём такую, что проекции точек на неё дают наибольшую дисперсию. 129 | \item[$\rhd$] \dots И так далее, пока не наберётся $d^\prime$. 130 | \item[$\rhd$] Можно продолжать до размерности $d$, получим новый базис в исходном пространстве. 131 | \item Это метод главных компонент (principal components analysis, PCA). 132 | \begin{itemize} 133 | \item На практике обычно выполняют сингулярное разложение матрицы данных или считают собственные вектора ковариационной матрицы. 134 | \end{itemize} 135 | \end{itemize} 136 | \end{frame} 137 | 138 | \begin{frame}[t] 139 | \frametitle{Многомерное шкалирование} 140 | \begin{itemize} 141 | \item Что если попробовать как можно точнее сохранить расстояния между точками? 142 | \item[$\rhd$] Пусть $x_1,\, \dots,\, x_N$ – исходные точки в $(X,\, \rho)$. 143 | \item[$\rhd$] Заведём проекции этих точек $y_1,\, \dots,\, y_N$ в $\mathbb{R}^{d^\prime}$. 144 | \item[$\rhd$] Вычислим все попарные расстояния $\rho_{ij}$ между исходными точками и $\delta_{ij}$ между проекциями, где $\delta$ -- обычное евклидово расстояние. 145 | \item[$\rhd$] Минимизируем 146 | $$ 147 | \sum_i \sum_j \left(\rho_{ij} - \delta_{ij}\right)^2 148 | $$ 149 | -- можно градиентным спуском, двигая проекции. 150 | \item Это многомерное шкалирование (multidimensional scaling). 151 | \end{itemize} 152 | \end{frame} 153 | 154 | \begin{frame}[t] 155 | \frametitle{t-SNE} 156 | \begin{itemize} 157 | \item Дальнейшее развитие этой идеи – алгоритм t-SNE (t-distributed stochastic neighbor embedding). 158 | \item $\rho_{ij}$ между исходными точками и $\delta_{ij}$ между проекциями записываются через распределения вероятностей. 159 | \item Минимизируется расстояние между распределениями (KL-divergence). 160 | \end{itemize} 161 | \end{frame} 162 | 163 | \begin{frame}[t] 164 | \frametitle{UMAP} 165 | \begin{itemize} 166 | \item[$\rhd$] Параметр $n$ -- число соседей. 167 | \item[$\rhd$] Строится ориентированный взвешенный граф: 168 | \begin{itemize} 169 | \item[--] вершины -- точки данных 170 | \item[--] из каждой вершины исходит $n$ рёбер к ближайшим соседям 171 | \item[--] сумма весов рёбер ограничена константой 172 | \end{itemize} 173 | \item[$\rhd$] Веса рёбер симметризуются $\Rightarrow$ пропадает ориентация. 174 | \item[$\rhd$] Граф вкладывается в $\mathbb{R}^{d^\prime}$, используется force-directed алгоритм: вершины притягиваются и отталкиваются, в зависимости от весов рёбер. 175 | \end{itemize} 176 | \end{frame} 177 | 178 | \end{document} 179 | -------------------------------------------------------------------------------- /9_evaluation_selection/HOMEWORK.md: -------------------------------------------------------------------------------- 1 | # Evaluation-selection homework 2 | 3 | Welcome to the assignment of evaluation and selection module. This time we encourage you to implement an ML project from scratch with a structure close to production ML projects. To simplify things a little bit, you will implement only model training, evaluation, and selection steps. As a challenge, you can try to add deployment and inference steps later after finishing the RS course. 4 | 5 | ## How to solve it? 6 | 7 | Carefully read all the instructions before writing any code. Don't worry if it seems too complicated at first glance — it's meant to be. Decompose the big task into small and easy tasks and implement them one by one. We prepared a [demo solution](https://github.com/rolling-scopes-school/ml-project-demo) for another dataset that can help you while implementing yours. You don't have and shouldn't follow it entirely, but use it as a guide. For some tasks, you'll have to do things that are not implemented in the demo: that's an exercise for finding the necessary information on your own. 8 | 9 | There are three kinds of tasks: necessary conditions, good to have, and optional. Everything that is not marked as optional or necessary is good to have. You'll need to pass the necessary conditions to get more than 0 points for this homework, but there are not so many of them. Concentrate on good to have tasks, do them as well as you can. Do optional tasks only if you have some time and energy left, and are confident that other tasks are implemented with great quality. 10 | 11 | If you're worried that someone may steal your work, start with a private Github repository. You'll need to make it public before submitting it for peer review, all your previous commits will become public after that. Submit a link to your public repository. Before submitting, check that everything is shown correctly to other people by opening the link you provide in a private window. 12 | 13 | ## How to review it? 14 | 15 | If you're reviewing someone else's homework, you should receive a link to a public Github repository with this person's solution. Calculate points as stated in bold font after most of the tasks in the statement. If you see that some tasks are done not fully, you can count a partial number of points for these tasks. If there is nothing specified in bold font, it means either that there are no points for this task or it's comprised fully of sub-tasks and doesn't have value on its own. To calculate the total result, sum up points from all tasks that have points. 16 | 17 | The maximum number of points for this HW is 100, 22 of them are for optional tasks. 18 | 19 | Please, leave feedback for each task: don't just write the total number of points a student received. 20 | 21 | You can use [this spreadsheet](https://docs.google.com/spreadsheets/d/14QBY9aSRnKsx2mTYm-5OFBIV_NBg0Xh-QbcO6U3QlpY/edit?usp=sharing) as a template for your review. Make a copy of it and fill it out during the review: you'll see a total amount of points at the bottom. Send a final version to the person you were reviewing. 22 | 23 | ## Homework statement 24 | 25 | 1. Use the [Forest train dataset](https://www.kaggle.com/competitions/forest-cover-type-prediction). You will solve the task of forest cover type prediction and compete with other participants. **(necessary condition, 0 points for the whole homework if not done)** 26 | 2. Format your homework as a Python package. Use an [src layout](https://blog.ionelmc.ro/2014/05/25/python-packaging/#the-structure) or choose some other layout that seems reasonable to you, explain your choice in the README file. Don't use Jupyter Notebooks for this homework. Instead, write your code in .py files. **(necessary condition, 0 points for the whole homework if not done)** 27 | 3. Publish your code to Github. **(necessary condition, 0 points for the whole homework if not done)** 28 | 1. Commits should be small and pushed while you're working on a project (not at the last moment, since storing unpublished code locally for a long time is not reliable: imagine something bad happens to your PC and you lost all your code). Your repository should have at least 30 commits if you do all non-optional parts of this homework. **(12 points)** 29 | 4. Use [Poetry](https://python-poetry.org/) to manage your package and dependencies. **(6 points)** 30 | 1. When installing dependencies, think if these dependencies will be used to run scripts from your package, or you'll need them only for development purposes (such as testing, formatting code, etc.). For development dependencies, use [the dev option of add command](https://python-poetry.org/docs/cli/#add). If you decided not to use Poetry, list your dependencies in requirements.txt and requirements-dev.txt files. **(4 points)** 31 | 5. Create a data folder and place the dataset there. **(necessary condition, 0 points for the whole homework if not done. *Note for reviewers: data folder won't be seen on GitHub if added to gitignore, it's OK, check gitignore*)** 32 | 1. Don't forget to add your data to gitignore. **(5 points)** 33 | 2. (optional) Write a script that will generate you an EDA report, e.g. with [pandas profiling](https://pandas-profiling.github.io/pandas-profiling/docs/master/rtd/) 34 | 6. Write a script that trains a model and saves it to a file. Your script should be runnable from the terminal, receive some arguments such as the path to data, model configurations, etc. To create CLI, you can use argparse, click (as in the demo), hydra, or some other alternatives. **(10 points)** 35 | 1. (optional) Register your script in pyproject.toml. This way you can run it without specifying a full path to a script file. **(2 points)** 36 | 7. Choose some metrics to validate your model (at least 3) and calculate them after training. Use K-fold cross-validation. **(10 points maximum: 2 per metric + 4 for K-fold. *Note for reviewers: K-fold CV may be overwritten by nested CV if the 9th task is implemented, check the history of commits in this case. If more than 3 metrics were chosen, only 3 are graded*)** 37 | 8. Conduct experiments with your model. Track each experiment into MLFlow. Make a screenshot of the results in the MLFlow UI and include it in README. You can see the screenshot example below, but in your case, it may be more complex than that. Choose the best configuration with respect to a single metric (most important of all metrics you calculate, according to your opinion). 38 | 1. Try at least three different sets of hyperparameters for each model. **(3 points)** 39 | 2. Try at least two different feature engineering techniques for each model. **(4 points)** 40 | 3. Try at least two different ML models. **(4 points)** 41 | 42 | ![MLFlow experiments example](https://user-images.githubusercontent.com/40484210/147333877-8acc8c51-00f6-4278-bf76-05abf51301ab.png) 43 | 44 | 9. Instead of tuning hyperparameters manually, use automatic hyperparameter search for each model (choose a single metric again). Estimate quality with nested cross-validation, e.g. as described [here](https://machinelearningmastery.com/nested-cross-validation-for-machine-learning-with-python/). Although you used a single metric for model selection, the quality should be measured with all the metrics you chose in task 7. **(10 points)** 45 | 10. In your README, write instructions on how to run your code (training script and optionally other scripts you created, such as EDA). If someone who cloned your repository correctly follows the steps you describe, the script should work for them and produce the same results as it produced on your PC (so don't forget about specifying random seeds). The instructions should be as unambiguous and easy to follow as possible. **(10 points)** 46 | 1. (optional) If you do the optional tasks below, add a development guide to README. You should specify what other developers should do to continue working on your code: what dependencies they should install, how they should run tests, formatting, etc. **(2 points)** 47 | 11. (optional) Test your code. Tests should be reproducible and depend only on your code, not on the data folder or external resources. To provide some data for a test, you should either [generate random data](https://faker.readthedocs.io/en/master/) during the test run or put a small sample of real data to the *tests/* folder, that will be excluded from gitignore. You should have at least one test that describes a situation when everything works fine (exit code 0, valid model produced), and one test that checks if it fails or returns an error message with invalid usage. Since you're working with files, you'll also need to create an isolated filesystem that will temporarily store test files and remove them after tests are finished. Read how to do it [here](https://click.palletsprojects.com/en/8.0.x/testing/) for click, or find the solution yourself if you use other options for CLI. Provide a screenshot that tests are passed. 48 | 1. (optional) Single or more tests for error cases without using fake/sample data and filesystem isolation, as in the demo. **(3 points)** 49 | 2. (optional) Test for a valid input case with test data, filesystem isolation, and checking saved model for correctness. **(5 points)** 50 | 12. (optional) Format your code with black and lint it with flake8. Provide a screenshot that linting and formatting are passed. **(2 points)** 51 | 13. (optional) Type annotate your code, run mypy to ensure the types are correct. It's not necessary to use strict mode as in the demo, but make sure all of the methods you implemented are type annotated and used correctly throughout the code. Provide a screenshot of mypy report, it should be successful. **(3 points)** 52 | 14. (optional) To combine steps of testing and linting into a single command, use nox or tox session. Provide a single screenshot for all sessions, such as in the example below. **(2 points)** 53 | 54 | ![nox report](https://user-images.githubusercontent.com/40484210/147333990-86db2125-5aff-4bb7-9431-e92e4e8894cc.png) 55 | 56 | 15. (optional) Create a Github action that runs tests and linters against your code automatically each time you push to the main branch. Run it to see if it works. The status of your workflow should be 'completed successfully'. **(3 points *If you check someone else's work, you can verify the presence of passed Github actions as shown in the example below: if actions are present, you'll see the workflow runs after clicking on a button*)** 57 | 58 | ![GitHub actions button](https://user-images.githubusercontent.com/40484210/147334036-6915e696-f5ea-46a2-86ba-170c72ef578c.png) 59 | 60 | ![GitHub actions workflows](https://user-images.githubusercontent.com/40484210/147334079-6097c5db-762e-4f1c-ae3c-f01b7d98823f.png) 61 | -------------------------------------------------------------------------------- /9_evaluation_selection/README.md: -------------------------------------------------------------------------------- 1 | # Model evaluation and selection 2 | 3 | > "Here is a system. I'm very sure that it's terrible." - *Yaser Abu-Mostafa* 4 | 5 | Welcome to the evaluation and selection module of the RS School Machine Learning course. Hopefully, this module will help you to learn: 6 | - How to select the best model and evaluate its performance. 7 | - How to track your model experiments. 8 | - How to organize your ML project. 9 | - How to write a clean, reproducible, and flexible code. 10 | 11 | This module has an assignment that is quite voluminous. Start working early to complete it before the deadline. 12 | 13 | ## Model evaluation and selection 14 | 1. A series of blog posts on model evaluation and selection from Sebastian Raschka (parts [1](https://sebastianraschka.com/blog/2016/model-evaluation-selection-part1.html), [2](https://sebastianraschka.com/blog/2016/model-evaluation-selection-part2.html), [3](https://sebastianraschka.com/blog/2016/model-evaluation-selection-part3.html), and [4](https://sebastianraschka.com/blog/2018/model-evaluation-selection-part4.html)). It discusses what evaluation and selection are and how to estimate their uncertainty. Feel free to skip any complicated maths, rather try to focus on what may be useful for you: holdout, k-fold, and nested cross-validation. 15 | 2. [Cross Validation in Time Series](https://medium.com/@soumyachess1496/cross-validation-in-time-series-566ae4981ce4). This article will tell you about validating time series models. However, the concept described there is very important and often used with any temporal features data, even if it's not time series. So study it carefully and try to understand well why we can't use a simple random split in this case. 16 | 3. Scikit learn [guide](https://scikit-learn.org/stable/model_selection.html) to cross-validation with code examples 17 | 4. Another [guide](https://weina.me/nested-cross-validation/) to nested cross-validation with a nice visualization. 18 | 19 | ## Tracking experiments 20 | 1. [ML Experiment Tracking: What It Is, Why It Matters, and How to Implement It](https://neptune.ai/blog/ml-experiment-tracking) 21 | 2. [How We Track Machine Learning Experiments with MLFlow](https://www.datarevenue.com/en-blog/how-we-track-machine-learning-experiments-with-mlflow) 22 | 23 | ## Organizing your projects 24 | Usually using Jupyter notebooks to write all or even some of your project code is not a good idea: it's difficult to manage versions with git, test, format, reproduce and run such code. You can watch [this talk](https://www.youtube.com/watch?v=7jiPeIFXb6U) by Joel Grus for a more detailed argumentation. But if using notebooks is a bad practice, what should we do instead? 25 | There is no universal pattern for project organization. Python is a flexible language – it allows you to choose layouts and tools that are best for your specific case. The best advice that you should follow in all cases is [The Zen of Python](https://www.python.org/dev/peps/pep-0020/), which works for project structures as well as for code. 26 | However, you don't have to reinvent the wheel. Below are the links to what worked well for other people: study them, adapt, and possibly combine the approaches. 27 | Also, stay updated on new instruments since they are emerging every time. 28 | In most cases and with any structures you should use a separate virtual environment for each of your projects. If you are unfamiliar with the concept, watch [this video](https://www.youtube.com/watch?v=KxvKCSwlUv8). There are alternatives to venv popular among data scientists, such as conda or poetry, but try to understand venv first as the most simple one. 29 | 1. [Structuring Your Project](https://docs.python-guide.org/writing/structure/) from The Hitchhiker's Guide to Python. Simple, vanilla Python approach with no additional tools. Even if you prefer more modern approaches, read this guide to understand what's happening under the hood and learn best practices. 30 | 2. Go through the chapters of [Hypermodern Python](https://cjolowicz.github.io/posts/hypermodern-python-01-setup/) by Claudio Jolowicz. Here you can look at a (hyper)modern approach to structuring a Python package. Well, at least for 2020 when this article was written. The approach is to use lots of fancy tools that make our life much easier but sometimes introduce annoying bugs. It will teach you not only to structure your project but to test, format, lint, document, etc., etc. It also has a [cookiecutter](https://github.com/cjolowicz/cookiecutter-hypermodern-python) that's being updated and maybe even more hypermodern than the article itself. 31 | 3. [Cookiecutter Data Science](https://drivendata.github.io/cookiecutter-data-science/) offers you a structure developed specifically for ML projects. 32 | 33 | ## Code style, reproducibility, testing 34 | It's a myth that Data Scientists shouldn't write high-quality code since they are more researchers than engineers. ML systems can be more complex than any others and they possess even more risks for [technical debt](https://proceedings.neurips.cc/paper/2015/file/86df7dcfd896fcaf2674f757a2463eba-Paper.pdf). That's why it's important to design solutions of high quality, follow style guides, and test your applications. 35 | 1. [PEP-8](https://www.python.org/dev/peps/pep-0008/) is a style guide to follow when writing Python code. You don't have to remember all the rules – there are the instruments that will detect if you break them automatically, such as [flake8](https://flake8.pycqa.org/en/latest/). 36 | 2. You can make Python development more pleasurable and efficient if you use type annotations. Watch [this talk](https://www.youtube.com/watch?v=pMgmKJyWKn8) if you want to know why and how to do it. 37 | 3. [Reproducibility, Replicability, and Data Science](https://www.kdnuggets.com/2019/11/reproducibility-replicability-data-science.html). 38 | 4. [Here](https://realpython.com/python-testing/) is the tutorial for beginners at testing Python code. 39 | 5. Testing machine learning models can be trickier than testing traditional software, see [this article](https://www.jeremyjordan.me/testing-ml/) for details. 40 | 6. There are several approaches to testing software. If you already know how to write tests in Python and are thinking about when to write them, how to find the balance between reliability and complexity of your tests, you may find [this talk](https://www.youtube.com/watch?v=EZ05e7EMOLM) on Test-Driven Development helpful. 41 | 42 | ## Assignment 43 | You can see the assignment for this module in the HOMEWORK.md file. Good luck! 44 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # RS School Machine Learning course 2 | CURRENT RUN 3 | - This is the second run 2022 version 4 | - Materials and assignments information for each topic are in corresponding folders 5 | --------------------------------------------------------------------------------