├── LICENSE.md ├── Notebooks [eng] ├── 08) Quality metrics and error analysis in ML.ipynb ├── 09) Optimization methods in ML & DL.ipynb ├── ML-algorithms from scratch │ ├── 01) Linear regression & its types.ipynb │ ├── 02) Logistic & Softmax-regressions.ipynb │ ├── 03) Linear Discriminant Analysis (LDA).ipynb │ ├── 04) Naive Bayes Classifier.ipynb │ ├── 05) Support Vector Machines (SVM).ipynb │ ├── 06) K-Nearest Neighbors (KNN).ipynb │ ├── 07) Decision Tree (CART).ipynb │ ├── 08) Bagging & Random Forest.ipynb │ ├── 09) AdaBoost algorithms (SAMME & R2).ipynb │ ├── 10) Gradient Boosting (GBM, XGBoost, CatBoost, LightGBM).ipynb │ ├── 11) Stacking & Blending.ipynb │ ├── 12) Principal Component Analysis (PCA).ipynb │ └── 13) Clustering in ML (K-Means, Agglomerative, Spectral, DBSCAN, Affinity Propagation).ipynb └── empty.txt ├── Notebooks [rus] ├── 08) Метрики качества и анализ ошибок в ML.ipynb ├── 09) Методы оптимизации в ML и DL.ipynb ├── ML-алгоритмы с нуля │ ├── 01) Линейная регрессия и её виды.ipynb │ ├── 02) Логистическая и Softmax-регрессии.ipynb │ ├── 03) Линейный дискриминантный анализ (LDA).ipynb │ ├── 04) Наивный байесовский классификатор.ipynb │ ├── 05) Метод опорных векторов (SVM).ipynb │ ├── 06) K-ближайших соседей (KNN).ipynb │ ├── 07) Дерево решений (CART).ipynb │ ├── 08) Бэггинг и случайный лес.ipynb │ ├── 09) Алгоритмы AdaBoost (SAMME & R2).ipynb │ ├── 10) Градиентный бустинг (GBM, XGBoost, CatBoost, LightGBM).ipynb │ ├── 11) Стекинг & Блендинг.ipynb │ ├── 12) Метод главных компонент (PCA).ipynb │ └── 13) Кластеризация в ML (K-Means, Agglomerative, Spectral, DBSCAN, Affinity Propagation).ipynb └── empty.txt └── README.md /LICENSE.md: -------------------------------------------------------------------------------- 1 | Attribution-NonCommercial-ShareAlike 4.0 International 2 | 3 | ======================================================================= 4 | 5 | Creative Commons Corporation ("Creative Commons") is not a law firm and 6 | does not provide legal services or legal advice. Distribution of 7 | Creative Commons public licenses does not create a lawyer-client or 8 | other relationship. Creative Commons makes its licenses and related 9 | information available on an "as-is" basis. Creative Commons gives no 10 | warranties regarding its licenses, any material licensed under their 11 | terms and conditions, or any related information. Creative Commons 12 | disclaims all liability for damages resulting from their use to the 13 | fullest extent possible. 14 | 15 | Using Creative Commons Public Licenses 16 | 17 | Creative Commons public licenses provide a standard set of terms and 18 | conditions that creators and other rights holders may use to share 19 | original works of authorship and other material subject to copyright 20 | and certain other rights specified in the public license below. The 21 | following considerations are for informational purposes only, are not 22 | exhaustive, and do not form part of our licenses. 23 | 24 | Considerations for licensors: Our public licenses are 25 | intended for use by those authorized to give the public 26 | permission to use material in ways otherwise restricted by 27 | copyright and certain other rights. Our licenses are 28 | irrevocable. Licensors should read and understand the terms 29 | and conditions of the license they choose before applying it. 30 | Licensors should also secure all rights necessary before 31 | applying our licenses so that the public can reuse the 32 | material as expected. Licensors should clearly mark any 33 | material not subject to the license. This includes other CC- 34 | licensed material, or material used under an exception or 35 | limitation to copyright. More considerations for licensors: 36 | wiki.creativecommons.org/Considerations_for_licensors 37 | 38 | Considerations for the public: By using one of our public 39 | licenses, a licensor grants the public permission to use the 40 | licensed material under specified terms and conditions. If 41 | the licensor's permission is not necessary for any reason--for 42 | example, because of any applicable exception or limitation to 43 | copyright--then that use is not regulated by the license. Our 44 | licenses grant only permissions under copyright and certain 45 | other rights that a licensor has authority to grant. Use of 46 | the licensed material may still be restricted for other 47 | reasons, including because others have copyright or other 48 | rights in the material. A licensor may make special requests, 49 | such as asking that all changes be marked or described. 50 | Although not required by our licenses, you are encouraged to 51 | respect those requests where reasonable. More considerations 52 | for the public: 53 | wiki.creativecommons.org/Considerations_for_licensees 54 | 55 | ======================================================================= 56 | 57 | Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International 58 | Public License 59 | 60 | By exercising the Licensed Rights (defined below), You accept and agree 61 | to be bound by the terms and conditions of this Creative Commons 62 | Attribution-NonCommercial-ShareAlike 4.0 International Public License 63 | ("Public License"). To the extent this Public License may be 64 | interpreted as a contract, You are granted the Licensed Rights in 65 | consideration of Your acceptance of these terms and conditions, and the 66 | Licensor grants You such rights in consideration of benefits the 67 | Licensor receives from making the Licensed Material available under 68 | these terms and conditions. 69 | 70 | 71 | Section 1 -- Definitions. 72 | 73 | a. Adapted Material means material subject to Copyright and Similar 74 | Rights that is derived from or based upon the Licensed Material 75 | and in which the Licensed Material is translated, altered, 76 | arranged, transformed, or otherwise modified in a manner requiring 77 | permission under the Copyright and Similar Rights held by the 78 | Licensor. For purposes of this Public License, where the Licensed 79 | Material is a musical work, performance, or sound recording, 80 | Adapted Material is always produced where the Licensed Material is 81 | synched in timed relation with a moving image. 82 | 83 | b. Adapter's License means the license You apply to Your Copyright 84 | and Similar Rights in Your contributions to Adapted Material in 85 | accordance with the terms and conditions of this Public License. 86 | 87 | c. BY-NC-SA Compatible License means a license listed at 88 | creativecommons.org/compatiblelicenses, approved by Creative 89 | Commons as essentially the equivalent of this Public License. 90 | 91 | d. Copyright and Similar Rights means copyright and/or similar rights 92 | closely related to copyright including, without limitation, 93 | performance, broadcast, sound recording, and Sui Generis Database 94 | Rights, without regard to how the rights are labeled or 95 | categorized. For purposes of this Public License, the rights 96 | specified in Section 2(b)(1)-(2) are not Copyright and Similar 97 | Rights. 98 | 99 | e. Effective Technological Measures means those measures that, in the 100 | absence of proper authority, may not be circumvented under laws 101 | fulfilling obligations under Article 11 of the WIPO Copyright 102 | Treaty adopted on December 20, 1996, and/or similar international 103 | agreements. 104 | 105 | f. Exceptions and Limitations means fair use, fair dealing, and/or 106 | any other exception or limitation to Copyright and Similar Rights 107 | that applies to Your use of the Licensed Material. 108 | 109 | g. License Elements means the license attributes listed in the name 110 | of a Creative Commons Public License. The License Elements of this 111 | Public License are Attribution, NonCommercial, and ShareAlike. 112 | 113 | h. Licensed Material means the artistic or literary work, database, 114 | or other material to which the Licensor applied this Public 115 | License. 116 | 117 | i. Licensed Rights means the rights granted to You subject to the 118 | terms and conditions of this Public License, which are limited to 119 | all Copyright and Similar Rights that apply to Your use of the 120 | Licensed Material and that the Licensor has authority to license. 121 | 122 | j. Licensor means the individual(s) or entity(ies) granting rights 123 | under this Public License. 124 | 125 | k. NonCommercial means not primarily intended for or directed towards 126 | commercial advantage or monetary compensation. For purposes of 127 | this Public License, the exchange of the Licensed Material for 128 | other material subject to Copyright and Similar Rights by digital 129 | file-sharing or similar means is NonCommercial provided there is 130 | no payment of monetary compensation in connection with the 131 | exchange. 132 | 133 | l. Share means to provide material to the public by any means or 134 | process that requires permission under the Licensed Rights, such 135 | as reproduction, public display, public performance, distribution, 136 | dissemination, communication, or importation, and to make material 137 | available to the public including in ways that members of the 138 | public may access the material from a place and at a time 139 | individually chosen by them. 140 | 141 | m. Sui Generis Database Rights means rights other than copyright 142 | resulting from Directive 96/9/EC of the European Parliament and of 143 | the Council of 11 March 1996 on the legal protection of databases, 144 | as amended and/or succeeded, as well as other essentially 145 | equivalent rights anywhere in the world. 146 | 147 | n. You means the individual or entity exercising the Licensed Rights 148 | under this Public License. Your has a corresponding meaning. 149 | 150 | 151 | Section 2 -- Scope. 152 | 153 | a. License grant. 154 | 155 | 1. Subject to the terms and conditions of this Public License, 156 | the Licensor hereby grants You a worldwide, royalty-free, 157 | non-sublicensable, non-exclusive, irrevocable license to 158 | exercise the Licensed Rights in the Licensed Material to: 159 | 160 | a. reproduce and Share the Licensed Material, in whole or 161 | in part, for NonCommercial purposes only; and 162 | 163 | b. produce, reproduce, and Share Adapted Material for 164 | NonCommercial purposes only. 165 | 166 | 2. Exceptions and Limitations. For the avoidance of doubt, where 167 | Exceptions and Limitations apply to Your use, this Public 168 | License does not apply, and You do not need to comply with 169 | its terms and conditions. 170 | 171 | 3. Term. The term of this Public License is specified in Section 172 | 6(a). 173 | 174 | 4. Media and formats; technical modifications allowed. The 175 | Licensor authorizes You to exercise the Licensed Rights in 176 | all media and formats whether now known or hereafter created, 177 | and to make technical modifications necessary to do so. The 178 | Licensor waives and/or agrees not to assert any right or 179 | authority to forbid You from making technical modifications 180 | necessary to exercise the Licensed Rights, including 181 | technical modifications necessary to circumvent Effective 182 | Technological Measures. For purposes of this Public License, 183 | simply making modifications authorized by this Section 2(a) 184 | (4) never produces Adapted Material. 185 | 186 | 5. Downstream recipients. 187 | 188 | a. Offer from the Licensor -- Licensed Material. Every 189 | recipient of the Licensed Material automatically 190 | receives an offer from the Licensor to exercise the 191 | Licensed Rights under the terms and conditions of this 192 | Public License. 193 | 194 | b. Additional offer from the Licensor -- Adapted Material. 195 | Every recipient of Adapted Material from You 196 | automatically receives an offer from the Licensor to 197 | exercise the Licensed Rights in the Adapted Material 198 | under the conditions of the Adapter's License You apply. 199 | 200 | c. No downstream restrictions. You may not offer or impose 201 | any additional or different terms or conditions on, or 202 | apply any Effective Technological Measures to, the 203 | Licensed Material if doing so restricts exercise of the 204 | Licensed Rights by any recipient of the Licensed 205 | Material. 206 | 207 | 6. No endorsement. Nothing in this Public License constitutes or 208 | may be construed as permission to assert or imply that You 209 | are, or that Your use of the Licensed Material is, connected 210 | with, or sponsored, endorsed, or granted official status by, 211 | the Licensor or others designated to receive attribution as 212 | provided in Section 3(a)(1)(A)(i). 213 | 214 | b. Other rights. 215 | 216 | 1. Moral rights, such as the right of integrity, are not 217 | licensed under this Public License, nor are publicity, 218 | privacy, and/or other similar personality rights; however, to 219 | the extent possible, the Licensor waives and/or agrees not to 220 | assert any such rights held by the Licensor to the limited 221 | extent necessary to allow You to exercise the Licensed 222 | Rights, but not otherwise. 223 | 224 | 2. Patent and trademark rights are not licensed under this 225 | Public License. 226 | 227 | 3. To the extent possible, the Licensor waives any right to 228 | collect royalties from You for the exercise of the Licensed 229 | Rights, whether directly or through a collecting society 230 | under any voluntary or waivable statutory or compulsory 231 | licensing scheme. In all other cases the Licensor expressly 232 | reserves any right to collect such royalties, including when 233 | the Licensed Material is used other than for NonCommercial 234 | purposes. 235 | 236 | 237 | Section 3 -- License Conditions. 238 | 239 | Your exercise of the Licensed Rights is expressly made subject to the 240 | following conditions. 241 | 242 | a. Attribution. 243 | 244 | 1. If You Share the Licensed Material (including in modified 245 | form), You must: 246 | 247 | a. retain the following if it is supplied by the Licensor 248 | with the Licensed Material: 249 | 250 | i. identification of the creator(s) of the Licensed 251 | Material and any others designated to receive 252 | attribution, in any reasonable manner requested by 253 | the Licensor (including by pseudonym if 254 | designated); 255 | 256 | ii. a copyright notice; 257 | 258 | iii. a notice that refers to this Public License; 259 | 260 | iv. a notice that refers to the disclaimer of 261 | warranties; 262 | 263 | v. a URI or hyperlink to the Licensed Material to the 264 | extent reasonably practicable; 265 | 266 | b. indicate if You modified the Licensed Material and 267 | retain an indication of any previous modifications; and 268 | 269 | c. indicate the Licensed Material is licensed under this 270 | Public License, and include the text of, or the URI or 271 | hyperlink to, this Public License. 272 | 273 | 2. You may satisfy the conditions in Section 3(a)(1) in any 274 | reasonable manner based on the medium, means, and context in 275 | which You Share the Licensed Material. For example, it may be 276 | reasonable to satisfy the conditions by providing a URI or 277 | hyperlink to a resource that includes the required 278 | information. 279 | 3. If requested by the Licensor, You must remove any of the 280 | information required by Section 3(a)(1)(A) to the extent 281 | reasonably practicable. 282 | 283 | b. ShareAlike. 284 | 285 | In addition to the conditions in Section 3(a), if You Share 286 | Adapted Material You produce, the following conditions also apply. 287 | 288 | 1. The Adapter's License You apply must be a Creative Commons 289 | license with the same License Elements, this version or 290 | later, or a BY-NC-SA Compatible License. 291 | 292 | 2. You must include the text of, or the URI or hyperlink to, the 293 | Adapter's License You apply. You may satisfy this condition 294 | in any reasonable manner based on the medium, means, and 295 | context in which You Share Adapted Material. 296 | 297 | 3. You may not offer or impose any additional or different terms 298 | or conditions on, or apply any Effective Technological 299 | Measures to, Adapted Material that restrict exercise of the 300 | rights granted under the Adapter's License You apply. 301 | 302 | 303 | Section 4 -- Sui Generis Database Rights. 304 | 305 | Where the Licensed Rights include Sui Generis Database Rights that 306 | apply to Your use of the Licensed Material: 307 | 308 | a. for the avoidance of doubt, Section 2(a)(1) grants You the right 309 | to extract, reuse, reproduce, and Share all or a substantial 310 | portion of the contents of the database for NonCommercial purposes 311 | only; 312 | 313 | b. if You include all or a substantial portion of the database 314 | contents in a database in which You have Sui Generis Database 315 | Rights, then the database in which You have Sui Generis Database 316 | Rights (but not its individual contents) is Adapted Material, 317 | including for purposes of Section 3(b); and 318 | 319 | c. You must comply with the conditions in Section 3(a) if You Share 320 | all or a substantial portion of the contents of the database. 321 | 322 | For the avoidance of doubt, this Section 4 supplements and does not 323 | replace Your obligations under this Public License where the Licensed 324 | Rights include other Copyright and Similar Rights. 325 | 326 | 327 | Section 5 -- Disclaimer of Warranties and Limitation of Liability. 328 | 329 | a. UNLESS OTHERWISE SEPARATELY UNDERTAKEN BY THE LICENSOR, TO THE 330 | EXTENT POSSIBLE, THE LICENSOR OFFERS THE LICENSED MATERIAL AS-IS 331 | AND AS-AVAILABLE, AND MAKES NO REPRESENTATIONS OR WARRANTIES OF 332 | ANY KIND CONCERNING THE LICENSED MATERIAL, WHETHER EXPRESS, 333 | IMPLIED, STATUTORY, OR OTHER. THIS INCLUDES, WITHOUT LIMITATION, 334 | WARRANTIES OF TITLE, MERCHANTABILITY, FITNESS FOR A PARTICULAR 335 | PURPOSE, NON-INFRINGEMENT, ABSENCE OF LATENT OR OTHER DEFECTS, 336 | ACCURACY, OR THE PRESENCE OR ABSENCE OF ERRORS, WHETHER OR NOT 337 | KNOWN OR DISCOVERABLE. WHERE DISCLAIMERS OF WARRANTIES ARE NOT 338 | ALLOWED IN FULL OR IN PART, THIS DISCLAIMER MAY NOT APPLY TO YOU. 339 | 340 | b. TO THE EXTENT POSSIBLE, IN NO EVENT WILL THE LICENSOR BE LIABLE 341 | TO YOU ON ANY LEGAL THEORY (INCLUDING, WITHOUT LIMITATION, 342 | NEGLIGENCE) OR OTHERWISE FOR ANY DIRECT, SPECIAL, INDIRECT, 343 | INCIDENTAL, CONSEQUENTIAL, PUNITIVE, EXEMPLARY, OR OTHER LOSSES, 344 | COSTS, EXPENSES, OR DAMAGES ARISING OUT OF THIS PUBLIC LICENSE OR 345 | USE OF THE LICENSED MATERIAL, EVEN IF THE LICENSOR HAS BEEN 346 | ADVISED OF THE POSSIBILITY OF SUCH LOSSES, COSTS, EXPENSES, OR 347 | DAMAGES. WHERE A LIMITATION OF LIABILITY IS NOT ALLOWED IN FULL OR 348 | IN PART, THIS LIMITATION MAY NOT APPLY TO YOU. 349 | 350 | c. The disclaimer of warranties and limitation of liability provided 351 | above shall be interpreted in a manner that, to the extent 352 | possible, most closely approximates an absolute disclaimer and 353 | waiver of all liability. 354 | 355 | 356 | Section 6 -- Term and Termination. 357 | 358 | a. This Public License applies for the term of the Copyright and 359 | Similar Rights licensed here. However, if You fail to comply with 360 | this Public License, then Your rights under this Public License 361 | terminate automatically. 362 | 363 | b. Where Your right to use the Licensed Material has terminated under 364 | Section 6(a), it reinstates: 365 | 366 | 1. automatically as of the date the violation is cured, provided 367 | it is cured within 30 days of Your discovery of the 368 | violation; or 369 | 370 | 2. upon express reinstatement by the Licensor. 371 | 372 | For the avoidance of doubt, this Section 6(b) does not affect any 373 | right the Licensor may have to seek remedies for Your violations 374 | of this Public License. 375 | 376 | c. For the avoidance of doubt, the Licensor may also offer the 377 | Licensed Material under separate terms or conditions or stop 378 | distributing the Licensed Material at any time; however, doing so 379 | will not terminate this Public License. 380 | 381 | d. Sections 1, 5, 6, 7, and 8 survive termination of this Public 382 | License. 383 | 384 | 385 | Section 7 -- Other Terms and Conditions. 386 | 387 | a. The Licensor shall not be bound by any additional or different 388 | terms or conditions communicated by You unless expressly agreed. 389 | 390 | b. Any arrangements, understandings, or agreements regarding the 391 | Licensed Material not stated herein are separate from and 392 | independent of the terms and conditions of this Public License. 393 | 394 | 395 | Section 8 -- Interpretation. 396 | 397 | a. For the avoidance of doubt, this Public License does not, and 398 | shall not be interpreted to, reduce, limit, restrict, or impose 399 | conditions on any use of the Licensed Material that could lawfully 400 | be made without permission under this Public License. 401 | 402 | b. To the extent possible, if any provision of this Public License is 403 | deemed unenforceable, it shall be automatically reformed to the 404 | minimum extent necessary to make it enforceable. If the provision 405 | cannot be reformed, it shall be severed from this Public License 406 | without affecting the enforceability of the remaining terms and 407 | conditions. 408 | 409 | c. No term or condition of this Public License will be waived and no 410 | failure to comply consented to unless expressly agreed to by the 411 | Licensor. 412 | 413 | d. Nothing in this Public License constitutes or may be interpreted 414 | as a limitation upon, or waiver of, any privileges and immunities 415 | that apply to the Licensor or You, including from the legal 416 | processes of any jurisdiction or authority. 417 | 418 | ======================================================================= 419 | 420 | Creative Commons is not a party to its public 421 | licenses. Notwithstanding, Creative Commons may elect to apply one of 422 | its public licenses to material it publishes and in those instances 423 | will be considered the “Licensor.” The text of the Creative Commons 424 | public licenses is dedicated to the public domain under the CC0 Public 425 | Domain Dedication. Except for the limited purpose of indicating that 426 | material is shared under a Creative Commons public license or as 427 | otherwise permitted by the Creative Commons policies published at 428 | creativecommons.org/policies, Creative Commons does not authorize the 429 | use of the trademark "Creative Commons" or any other trademark or logo 430 | of Creative Commons without its prior written consent including, 431 | without limitation, in connection with any unauthorized modifications 432 | to any of its public licenses or any other arrangements, 433 | understandings, or agreements concerning use of licensed material. For 434 | the avoidance of doubt, this paragraph does not form part of the 435 | public licenses. 436 | 437 | Creative Commons may be contacted at creativecommons.org. 438 | -------------------------------------------------------------------------------- /Notebooks [eng]/ML-algorithms from scratch/04) Naive Bayes Classifier.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "provenance": [] 7 | }, 8 | "kernelspec": { 9 | "name": "python3", 10 | "display_name": "Python 3" 11 | }, 12 | "language_info": { 13 | "name": "python" 14 | } 15 | }, 16 | "cells": [ 17 | { 18 | "cell_type": "markdown", 19 | "source": [ 20 | "## **Naive Bayes Classifier**\n", 21 | "Naive Bayes classifier is a probabilistic classifier based on the Bayes formula with a strict (naive) assumption of the independence of features among themselves for a given class, which greatly simplifies the classification task due to the evaluation of one-dimensional probability densities instead of one multidimensional one." 22 | ], 23 | "metadata": { 24 | "id": "WLtcR3DvbaEl" 25 | } 26 | }, 27 | { 28 | "cell_type": "markdown", 29 | "source": [ 30 | "### **The main idea**\n", 31 | "\n", 32 | "In this case, a one-dimensional probability density is an estimate of the probability of each feature separately, provided they are independent, and a multidimensional one is an estimate of the probability of a combination of all features, which follows from the case of their dependence. It is for this reason that this classifier is called naive, since it greatly simplifies calculations and increases the efficiency of the algorithm. However, this assumption is not always true in practice and in some cases can lead to a significant deterioration in the quality of predictions.\n", 33 | "\n", 34 | "The Bayes formula itself looks like this:\n", 35 | "\n", 36 | "$$P(A|B) = \\frac{P(B|A) P(A)}{P(B)}$$\n", 37 | "\n", 38 | "where:\n", 39 | "- $P(A|B)$ is the a posterior probability of event A, provided event B is executed;\n", 40 | "\n", 41 | "- $P(B|A)$ is the conditional probability of event B, provided that event A is executed;\n", 42 | "\n", 43 | "- $P(A)$ and $P(B)$ are prior probabilities of events A and B respectively.\n", 44 | "\n", 45 | "And in the context of machine learning, the Bayes formula takes the following form:\n", 46 | "\n", 47 | "$$P(y_k|X) = \\frac{P(y_k)P(X|y_k)}{P(X)}$$\n", 48 | "\n", 49 | "where:\n", 50 | "- $P(y_k|X)$ is a posterior probability of the sample belonging to the class $y_k$, taking into account its features $X$;\n", 51 | "- $P(X|y_k)$ is likelihood, that is, the probability of features $X$ for a given class $y_k$;\n", 52 | "- $P(y_k)$ is the a prior probability that a randomly selected observation belongs to the class $y_k$;\n", 53 | "- $P(X)$ is the a prior probability of features $X$.\n", 54 | "\n", 55 | "If an object is described by not one, but several features $X_1, X_2, ..., X_n$, then the formula takes the form:\n", 56 | "\n", 57 | "$$P(y_k|X_1, X_2, ..., X_n) = \\frac{P(y_k)\\prod_{i=1}^n P(X_i|y_k)}{P(X_1, X_2, ..., X_n)}$$\n", 58 | "\n", 59 | "In practice, the numerator of this formula is of the greatest interest, since the denominator depends only on the features, not on the class, and therefore it is often omitted when comparing the probabilities of different classes. Ultimately, the classification rule will be proportional to the choice of the class with the maximum a posterior probability:\n", 60 | "\n", 61 | "$$y_k \\propto \\arg\\max_{y_k} P(y_k)\\prod_{i=1}^n P(X_i|y_k)$$\n", 62 | "\n", 63 | "To estimate the parameters of the model, that is, the probabilities $P(y_k)$ and $P(X_i|y_k)$, the maximum likelihood method is usually used, which is based on the frequency of occurrence of classes and features in the training dataset.\n", 64 | "\n", 65 | "### **Naive Bayes varieties**\n", 66 | "The scikit-learn library has several implementations of the naive Bayes classifier, differing in assumptions about the distribution of features for a given class. These include the following:\n", 67 | "\n", 68 | "- **Gaussian naive Bayes classifier (GaussianNB)** is an option for working with continuous features that have a normal (Gaussian) distribution. The probability of a feature for a given class is calculated using the formula: $$P(x_i|y) =\\frac{1}{\\sqrt{2\\pi\\sigma_y^2}}\\exp\\left(-\\frac{(x_i-\\mu_y)^2}{2\\sigma_y^2}\\right)$$ where $\\mu_y$ and $\\sigma_y$ are the mean and standard deviations of the feature in the class $y$. These parameters are estimated using the maximum likelihood method based on training dataset.\n", 69 | "\n", 70 | "- **The Multinomial naive Bayes classifier (MultinomialNB)** is an option for working with discrete features that have a multinomial distribution. Such features are often found in text classification tasks, where they represent the number of occurrences in the text. The probability of a feature for a given class is calculated using the formula: $$P(x_i|y) =\\frac{N_{yi} +\\alpha}{N_y + \\alpha n}$$ where $N_{yi}$ is the number of times the feature $i$ occurs in the class $y$; $N_y$ is the total number of all features in the class $y$; $n$ is the number of different features; and $\\alpha$ is a smoothing parameter that prevents the occurrence of zero probabilities.\n", 71 | "\n", 72 | "- **Complement Naive Bayes classifier (ComplementNB)** is an improved version of *MultinomialNB*, suitable for unbalanced datasets. Instead of estimating the probability of a feature for a given class, the algorithm evaluates the normalized weight of the feature $w_{ci}$ for the class $c$ as the probability of the feature when the class is complemented, that is, for all other classes. Thus, the algorithm takes into account not only the frequency of features in the class, but also their absence in other classes, which makes it less sensitive to sampling bias. The formula for calculating the probability of a feature when complementing a class is as follows: \\begin{align}\\begin{aligned}\\hat{\\theta}_{ci} = \\frac{\\alpha_i + \\sum_{j:y_j \\neq c} d_{ij}}\n", 73 | " {\\alpha + \\sum_{j:y_j \\neq c} \\sum_{k} d_{kj}}\\\\w_{ci} = \\log \\hat{\\theta}_{ci}\\\\w_{ci} = \\frac{w_{ci}}{\\sum_{j} |w_{cj}|}\\end{aligned}\\end{align}\n", 74 | "where is $\\hat{\\theta}_{ci}$ is an estimate of the probability of the feature $i$ when complementing the class $c$, which is calculated using the smoothing parameter $\\alpha_i$ and the frequency of the feature $i$ in all classes except $c$ (in this case $d_{ij}$ is the number of times when the feature $i$ occurs in the $j$ class); $w_{ci}$ is the normalized weight of the feature $i$ for the $c$ class. The predicted class $\\hat c$ for a given feature vector $t$ will look like this: $$\\hat{c} = \\arg\\min_c \\sum_{i} t_i w_{ci}$$\n", 75 | "\n", 76 | "- **Bernoulli Naive Bayes classifier (BernoulliNB)** is another option for working with discrete features, but which have a Bernoulli distribution. In this case, the features are binary indicators of the presence or absence of certain properties in an object. For example, in a text classification task, this may be the presence or absence of certain words in the text. The probability of a feature for a given class is calculated using the formula: $$P(x_i|y) = P(x_i = 1|y)x_i + (1-P(x_i = 1|y))(1-x_i)$$ where $P(x_i = 1|y)$ is the probability that the feature $i$ takes the value 1 (true), provided that the object belongs to the class $y$; $x_i$ is the value of the feature $i$ (0 or 1).\n", 77 | "\n", 78 | "- **Categorical Naive Bayes classifier (CategoricalNB)** is an option for categorically distributed data based on the assumption that each feature described by the index has its own categorical distribution. The probability of a feature for a given class is calculated using the formula: $$P(x_i = t \\mid y = c \\: ;\\, \\alpha) = \\frac{ N_{tic} + \\alpha}{N_{c} + \\alpha n_i}$$ where $N_{tic} =|\\{j \\in J\\mid x_{ij} =t, y_j =c\\}|$ is the number of times the feature $x_i$ takes the value $t$ in the class $c$; $N_{c} = |\\{j \\in J\\mid y_j = c\\}|$ is the total number of all features in the class $c$ in the training dataset; $\\alpha$ is a smoothing parameter; $n_i$ is the number of available values of the feature $i$.\n", 79 | "\n", 80 | "### **The principle of Naive Bayes classifier operation with a Gaussian distribution**\n", 81 | "The algorithm is constructed as follows:\n", 82 | "- 1) the a prior probabilities of classes are initially calculated;\n", 83 | "- 2) after that, the average and standard deviations of features by class are calculated;\n", 84 | "- 3) based on the obtained deviations of features by classes, the probabilistic density of test features is calculated according to the Gaussian distribution;\n", 85 | "- 4) next, the posterior probabilities are calculated as the product of the prior probabilities of the classes and the probabilistic densities of the test features;\n", 86 | "- 5) classes with the maximum a posterior probability will be the final prediction." 87 | ], 88 | "metadata": { 89 | "id": "YSjH9spRGuQo" 90 | } 91 | }, 92 | { 93 | "cell_type": "markdown", 94 | "source": [ 95 | "### **Naive Bayes in Spam filtering tasks**\n", 96 | "In the context of spam filtering, Naive Bayes classifier is based on the frequency of occurrence of words in messages for spam and non-spam and maximizing the product of their probabilities. Naivety in this case will consist in the assumption that the words in the message are independent of the order and context. Then the Bayes formula takes the following form:\n", 97 | "\n", 98 | "$$P(C|M) \\propto P(C) \\prod_{i=1}^n P(w_i|C), \\ \\ w_i \\in M$$\n", 99 | "\n", 100 | "where:\n", 101 | "- C — class spam or not spam;\n", 102 | "- M — message;\n", 103 | "- w_i — the i-th word in the message M;\n", 104 | "- $\\propto$ — the proportionality sign.\n", 105 | "\n", 106 | "For a better understanding, consider the following example. Suppose we want to classify the message **\"Hi, you won a discount and you can get the prize this evening.\"** and we have a training dataset consisting of the following messages:\n", 107 | "\n", 108 | "| Message | Class |\n", 109 | "| --- | --- |\n", 110 | "| Hi, how are you? | Not spam |\n", 111 | "| Congratulations, you won a prize! | Spam |\n", 112 | "| Buy the product now and get a discount! | Spam |\n", 113 | "| Let's walk this evening. | Not spam |\n", 114 | "\n", 115 | "The first step is to calculate the frequency of occurrence of all unique words and their total number in spam and non-spam messages. Then the probability of encountering each word in spam and non-spam messages is calculated based on these frequencies. When there are words in the message that were not previously found in the training dataset, smoothing is used. There are many different types of smoothing, but the essence of the simplest of them is to add $1$ when calculating the frequency of words in messages. This technique avoids the zero probability problem. Below is a table with the calculation of probabilities for all words.\n", 116 | "\n", 117 | "| Word | Frequency in Not Spam | Frequency in Spam | Probability in Not Spam | Probability in Spam |\n", 118 | "| --- | --- | --- | --- | --- |\n", 119 | "| Hi | 1 + $1$ = 2 | 0 + $1$ = 1 | 2 / 28 = 0.0714 | 1 / 33 = 0.03 |\n", 120 | "| how | 1 + $1$ = 2 | 0 + $1$ = 1 | 2 / 28 = 0.0714 | 1 / 33 = 0.03 |\n", 121 | "| are | 1 + $1$ = 2 | 0 + $1$ = 1 | 2 / 28 = 0.0714 | 1 / 33 = 0.03 |\n", 122 | "| you | 1 + $1$ = 2 | 1 + $1$ = 2 | 2 / 28 = 0.0714 | 2 / 33 = 0.06 |\n", 123 | "| Congratulations | 0 + $1$ = 1 | 1 + $1$ = 2 | 1 / 28 = 0.0357 | 2 / 33 = 0.06 |\n", 124 | "| won | 0 + $1$ = 1 | 1 + $1$ = 2 | 1 / 28 = 0.0357 | 2 / 33 = 0.06 |\n", 125 | "| a | 0 + $1$ = 1 | 2 + $1$ = 3 | 1 / 28 = 0.0357 | 3 / 33 = 0.09 |\n", 126 | "| prize | 0 + $1$ = 1 | 1 + $1$ = 2 | 1 / 28 = 0.0357 | 2 / 33 = 0.06 |\n", 127 | "| Buy | 0 + $1$ = 1 | 1 + $1$ = 2 | 1 / 28 = 0.0357 | 2 / 33 = 0.06 |\n", 128 | "| the | 0 + $1$ = 1 | 1 + $1$ = 2 | 1 / 28 = 0.0357 | 2 / 33 = 0.06 |\n", 129 | "| product | 0 + $1$ = 1 | 1 + $1$ = 2 | 1 / 28 = 0.0357 | 2 / 33 = 0.06 |\n", 130 | "| now | 0 + $1$ = 1 | 1 + $1$ = 2 | 1 / 28 = 0.0357 | 2 / 33 = 0.06 |\n", 131 | "| and | 0 + $1$ = 1 | 1 + $1$ = 2 | 1 / 28 = 0.0357 | 2 / 33 = 0.06 |\n", 132 | "| get | 0 + $1$ = 1 | 1 + $1$ = 2 | 1 / 28 = 0.0357 | 2 / 33 = 0.06 |\n", 133 | "| discount | 0 + $1$ = 1 | 1 + $1$ = 2 | 1 / 28 = 0.0357 | 2 / 33 = 0.06 |\n", 134 | "| Let's | 1 + $1$ = 2 | 0 + $1$ = 1 | 2 / 28 = 0.0714 | 1 / 33 = 0.03 |\n", 135 | "| walk | 1 + $1$ = 2 | 0 + $1$ = 1 | 2 / 28 = 0.0714 | 1 / 33 = 0.03 |\n", 136 | "| this | 1 + $1$ = 2 | 0 + $1$ = 1 | 2 / 28 = 0.0714 | 1 / 33 = 0.03 |\n", 137 | "| evening | 1 + $1$ = 2 | 0 + $1$ = 1 | 2 / 28 = 0.0714 | 1 / 33 = 0.03 |\n", 138 | "| can | 0 + $1$ = 1 | 0 + $1$ = 1 | 1 / 28 = 0.0357 | 1 / 33 = 0.03 |\n", 139 | "| **Total amount of words** | **28** | **33** |\n", 140 | "\n", 141 | "At the end, the probabilities of the message being spam or not spam are calculated, and the final prediction will be the class with the highest probability.\n", 142 | "\n", 143 | "$P(C|M) = P(C) \\cdot P('Hi'|C) \\cdot P('you'|C) \\cdot P('won'|C) \\cdot P('a'|C)\n", 144 | "\\cdot P('discount'|C) \\cdot P('and'|C) \\cdot P('you'|C) \\cdot P('can'|C) \\cdot P('get'|C)\n", 145 | "\\cdot P('the'|C) \\cdot P('prize'|C) \\cdot P('this'|C) \\cdot P('evening'|C)$\n", 146 | "\n", 147 | "Where:\n", 148 | "- $C \\in (Spam, \\ \\ Not \\ \\ Spam)$;\n", 149 | "- $P(Spam) = P(Not \\ \\ Spam) = \\frac{2}{4} = 0.5$\n" 150 | ], 151 | "metadata": { 152 | "id": "Fs8ykp-9SXK1" 153 | } 154 | }, 155 | { 156 | "cell_type": "markdown", 157 | "source": [ 158 | "$P(Spam|M) = 0.5 \\cdot 0.03 \\cdot 0.06 \\cdot 0.06 \\cdot 0.09 \\cdot 0.06 \\cdot 0.06 \\cdot 0.06 \\cdot 0.03 \\cdot 0.06 \\cdot 0.06 \\cdot 0.06 \\cdot 0.03 \\cdot 0.03 \\approx 6.12 \\cdot 10^{-18}$\n", 159 | "\n", 160 | "$P(Not \\ \\ Spam|M) = 0.5 \\cdot 0.0714 \\cdot 0.0714 \\cdot 0.0357 \\cdot 0.0357 \\cdot 0.0357 \\cdot 0.0357 \\cdot 0.0714 \\cdot 0.0357 \\cdot 0.0357 \\cdot 0.0357 \\cdot 0.0357 \\cdot 0.0714 \\cdot 0.0714 \\approx 2.45 \\cdot 10^{-18}$\n", 161 | "\n", 162 | "$P(Spam|M) > P(Not \\ \\ Spam|M) \\rightarrow$ **the message is spam**\n", 163 | "\n", 164 | "It is worth adding that in practice, for convenience of calculations, the logarithm of probability is often used instead of the probability itself." 165 | ], 166 | "metadata": { 167 | "id": "GbXuXUE4UOrV" 168 | } 169 | }, 170 | { 171 | "cell_type": "markdown", 172 | "source": [ 173 | "### **Python implementation from scratch**" 174 | ], 175 | "metadata": { 176 | "id": "aUlIJ0VtU7Dc" 177 | } 178 | }, 179 | { 180 | "cell_type": "code", 181 | "source": [ 182 | "import numpy as np\n", 183 | "import pandas as pd\n", 184 | "import matplotlib.pyplot as plt\n", 185 | "from sklearn.datasets import load_iris\n", 186 | "from sklearn.naive_bayes import GaussianNB\n", 187 | "from sklearn.metrics import accuracy_score\n", 188 | "from sklearn.preprocessing import LabelEncoder\n", 189 | "from sklearn.model_selection import train_test_split\n", 190 | "from mlxtend.plotting import plot_decision_regions" 191 | ], 192 | "metadata": { 193 | "id": "vT1i1YbjVB1f" 194 | }, 195 | "execution_count": 30, 196 | "outputs": [] 197 | }, 198 | { 199 | "cell_type": "code", 200 | "source": [ 201 | "class GaussianNaiveBayes:\n", 202 | " def fit(self, X, y):\n", 203 | " classes, cls_counts = np.unique(y, return_counts=True)\n", 204 | " n_classes = len(classes)\n", 205 | " self.priors = cls_counts / len(y)\n", 206 | "\n", 207 | " # calculate the mean and standard deviations of features by classes\n", 208 | " self.X_cls_mean = np.array([np.mean(X[y == c], axis=0) for c in range(n_classes)])\n", 209 | " self.X_stds = np.array([np.std(X[y == c], axis=0) for c in range(n_classes)])\n", 210 | "\n", 211 | " # calculate the probability density of the feature according to the Gaussian distribution\n", 212 | " def pdf(self, x, mean, std):\n", 213 | " return (1 / (np.sqrt(2 * np.pi) * std)) * np.exp(-0.5 * ((x - mean) / std) ** 2)\n", 214 | "\n", 215 | " def predict(self, X):\n", 216 | " pdfs = np.array([self.pdf(x, self.X_cls_mean, self.X_stds) for x in X])\n", 217 | " posteriors = self.priors * np.prod(pdfs, axis=2) # shorten Bayes formula\n", 218 | "\n", 219 | " return np.argmax(posteriors, axis=1)" 220 | ], 221 | "metadata": { 222 | "id": "_6JPBgjGVGq8" 223 | }, 224 | "execution_count": 31, 225 | "outputs": [] 226 | }, 227 | { 228 | "cell_type": "code", 229 | "source": [ 230 | "def decision_boundary_plot(X, y, X_train, y_train, clf, feature_indexes, title=None):\n", 231 | " feature1_name, feature2_name = X.columns[feature_indexes]\n", 232 | " X_feature_columns = X.values[:, feature_indexes]\n", 233 | " X_train_feature_columns = X_train[:, feature_indexes]\n", 234 | " clf.fit(X_train_feature_columns, y_train)\n", 235 | "\n", 236 | " plot_decision_regions(X=X_feature_columns, y=y.values, clf=clf)\n", 237 | " plt.xlabel(feature1_name)\n", 238 | " plt.ylabel(feature2_name)\n", 239 | " plt.title(title)" 240 | ], 241 | "metadata": { 242 | "id": "lL23Df3qVJ0R" 243 | }, 244 | "execution_count": 32, 245 | "outputs": [] 246 | }, 247 | { 248 | "cell_type": "markdown", 249 | "source": [ 250 | "### **Uploading a dataset**\n", 251 | "[Iris dataset](https://www.kaggle.com/datasets/himanshunakrani/iris-dataset) will be used to train models, where it is necessary to correctly determine the types of flowers based on their characteristics." 252 | ], 253 | "metadata": { 254 | "id": "OAJUvHTnVP-s" 255 | } 256 | }, 257 | { 258 | "cell_type": "code", 259 | "source": [ 260 | "df_path = \"/content/drive/MyDrive/iris.csv\"\n", 261 | "iris = pd.read_csv(df_path)\n", 262 | "X1, y1 = iris.iloc[:, :-1], iris.iloc[:, -1]\n", 263 | "y1 = pd.Series(LabelEncoder().fit_transform(y1))\n", 264 | "X1_train, X1_test, y1_train, y1_test = train_test_split(X1.values, y1.values, random_state=0)\n", 265 | "print(iris)\n" 266 | ], 267 | "metadata": { 268 | "colab": { 269 | "base_uri": "https://localhost:8080/" 270 | }, 271 | "id": "SyvzlVvhVQ2a", 272 | "outputId": "de395158-bc96-40cb-88a3-91d38d4d4817" 273 | }, 274 | "execution_count": 33, 275 | "outputs": [ 276 | { 277 | "output_type": "stream", 278 | "name": "stdout", 279 | "text": [ 280 | " sepal_length sepal_width petal_length petal_width species\n", 281 | "0 5.1 3.5 1.4 0.2 setosa\n", 282 | "1 4.9 3.0 1.4 0.2 setosa\n", 283 | "2 4.7 3.2 1.3 0.2 setosa\n", 284 | "3 4.6 3.1 1.5 0.2 setosa\n", 285 | "4 5.0 3.6 1.4 0.2 setosa\n", 286 | ".. ... ... ... ... ...\n", 287 | "145 6.7 3.0 5.2 2.3 virginica\n", 288 | "146 6.3 2.5 5.0 1.9 virginica\n", 289 | "147 6.5 3.0 5.2 2.0 virginica\n", 290 | "148 6.2 3.4 5.4 2.3 virginica\n", 291 | "149 5.9 3.0 5.1 1.8 virginica\n", 292 | "\n", 293 | "[150 rows x 5 columns]\n" 294 | ] 295 | } 296 | ] 297 | }, 298 | { 299 | "cell_type": "markdown", 300 | "source": [ 301 | "### **Model training and evaluation of the obtained results**\n", 302 | "Despite its simplicity, in this case the algorithm has showed an excellent result by classifying all the samples correctly, which is possible due to the construction of a flexible decision boundary. From this we can draw an interesting conclusion that in some situations simple models can work much better than complex ones, which can be seen later on the example of other algorithms." 303 | ], 304 | "metadata": { 305 | "id": "6mJoUO12ZYlY" 306 | } 307 | }, 308 | { 309 | "cell_type": "markdown", 310 | "source": [ 311 | "**Naive Bayes**" 312 | ], 313 | "metadata": { 314 | "id": "gi-kXxHhZyGZ" 315 | } 316 | }, 317 | { 318 | "cell_type": "code", 319 | "source": [ 320 | "nb_clf = GaussianNaiveBayes()\n", 321 | "nb_clf.fit(X1_train, y1_train)\n", 322 | "nb_clf_pred_res = nb_clf.predict(X1_test)\n", 323 | "nb_clf_accuracy = accuracy_score(y1_test, nb_clf_pred_res)\n", 324 | "\n", 325 | "print(f'Naive Bayes classifier accucacy: {nb_clf_accuracy}')\n", 326 | "print(nb_clf_pred_res)" 327 | ], 328 | "metadata": { 329 | "colab": { 330 | "base_uri": "https://localhost:8080/" 331 | }, 332 | "id": "HNy3mPlvYpq1", 333 | "outputId": "bfebbc89-a4f7-4d45-8b74-b378fff43177" 334 | }, 335 | "execution_count": 34, 336 | "outputs": [ 337 | { 338 | "output_type": "stream", 339 | "name": "stdout", 340 | "text": [ 341 | "Naive Bayes classifier accucacy: 1.0\n", 342 | "[2 1 0 2 0 2 0 1 1 1 2 1 1 1 1 0 1 1 0 0 2 1 0 0 2 0 0 1 1 0 2 1 0 2 2 1 0\n", 343 | " 1]\n" 344 | ] 345 | } 346 | ] 347 | }, 348 | { 349 | "cell_type": "markdown", 350 | "source": [ 351 | "**Naive Bayes (scikit-learn)**" 352 | ], 353 | "metadata": { 354 | "id": "vcUJmnaaZ1dF" 355 | } 356 | }, 357 | { 358 | "cell_type": "code", 359 | "source": [ 360 | "sk_nb_clf = GaussianNB()\n", 361 | "sk_nb_clf.fit(X1_train, y1_train)\n", 362 | "sk_nb_clf_pred_res = sk_nb_clf.predict(X1_test)\n", 363 | "sk_nb_clf_accuracy = accuracy_score(y1_test, sk_nb_clf_pred_res)\n", 364 | "\n", 365 | "print(f'sk Naive Bayes classifier accucacy: {sk_nb_clf_accuracy}')\n", 366 | "print(sk_nb_clf_pred_res)\n", 367 | "\n", 368 | "feature_indexes = [2, 3]\n", 369 | "title1 = 'GaussianNB surface'\n", 370 | "decision_boundary_plot(X1, y1, X1_train, y1_train, sk_nb_clf, feature_indexes, title1)" 371 | ], 372 | "metadata": { 373 | "colab": { 374 | "base_uri": "https://localhost:8080/", 375 | "height": 524 376 | }, 377 | "id": "OFUE33eLZ2DV", 378 | "outputId": "1d38e397-4c87-4088-ac35-932725c2ad5c" 379 | }, 380 | "execution_count": 35, 381 | "outputs": [ 382 | { 383 | "output_type": "stream", 384 | "name": "stdout", 385 | "text": [ 386 | "sk Naive Bayes classifier accucacy: 1.0\n", 387 | "[2 1 0 2 0 2 0 1 1 1 2 1 1 1 1 0 1 1 0 0 2 1 0 0 2 0 0 1 1 0 2 1 0 2 2 1 0\n", 388 | " 1]\n" 389 | ] 390 | }, 391 | { 392 | "output_type": "display_data", 393 | "data": { 394 | "text/plain": [ 395 | "
" 396 | ], 397 | "image/png": "\n" 398 | }, 399 | "metadata": {} 400 | } 401 | ] 402 | }, 403 | { 404 | "cell_type": "markdown", 405 | "source": [ 406 | "### **Pros and cons of the Naive Bayes classifier**\n", 407 | "Pros:\n", 408 | "- easy to implement and interpret;\n", 409 | "- practically no parameter settings are required;\n", 410 | "- high speed and accuracy of predictions in many situations;\n", 411 | "- it has relatively good resistance to noise and outliers, since it is based on probability distributions and a naive assumption of the independence of features.\n", 412 | "\n", 413 | "Cons:\n", 414 | "- in a case of violation of the assumption of independence of features, the accuracy of predictions may significantly decrease;\n", 415 | "- may give preference to classes with a large number of samples in case of unbalanced data." 416 | ], 417 | "metadata": { 418 | "id": "01g1BCfwaFwn" 419 | } 420 | }, 421 | { 422 | "cell_type": "markdown", 423 | "source": [ 424 | "### **Additional sources**\n", 425 | "Paper «Bayes and Naive-Bayes Classifier», Rajiv Gandhi, Andhra Pradesh.\n", 426 | "\n", 427 | "Documentation:\n", 428 | "- [Naive Bayes description](https://scikit-learn.org/stable/modules/naive_bayes.html);\n", 429 | "- [GaussianNB](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html#sklearn.naive_bayes.GaussianNB);\n", 430 | "- [MultinomialNB](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB);\n", 431 | "- [ComplementNB](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.ComplementNB.html#sklearn.naive_bayes.ComplementNB);\n", 432 | "- [BernoulliNB](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.BernoulliNB.html#sklearn.naive_bayes.BernoulliNB);\n", 433 | "- [CategoricalNB](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.CategoricalNB.html#sklearn.naive_bayes.CategoricalNB).\n", 434 | "\n", 435 | "Video: [one](https://www.youtube.com/watch?v=O2L2Uv9pdDA), [two](https://www.youtube.com/watch?v=H3EjCKtlVog), [three](https://www.youtube.com/watch?v=nt63k3bfXS0), [four](https://www.youtube.com/watch?v=ADj95edZc0w).\n", 436 | "\n" 437 | ], 438 | "metadata": { 439 | "id": "ovU3wJ2aaojO" 440 | } 441 | } 442 | ] 443 | } -------------------------------------------------------------------------------- /Notebooks [eng]/ML-algorithms from scratch/12) Principal Component Analysis (PCA).ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "provenance": [] 7 | }, 8 | "kernelspec": { 9 | "name": "python3", 10 | "display_name": "Python 3" 11 | }, 12 | "language_info": { 13 | "name": "python" 14 | } 15 | }, 16 | "cells": [ 17 | { 18 | "cell_type": "markdown", 19 | "source": [ 20 | "## **Principal Component Analysis**\n", 21 | "Principal Component Analysis or PCA is an unsupervised learning algorithm used to reduce the dimension and identify the most informative features in the data. Its essence lies in the assumption of the linearity of data relations and their projection onto a subspace of orthogonal vectors in which the variance will be maximal. Such vectors are called the main components and they determine directions of the greatest variability (informativeness) of the data. Alternatively, the essence of PCA can be defined as a linear projection that minimizes the RMS distance between the source points and their projections." 22 | ], 23 | "metadata": { 24 | "id": "OjhUfMJ5v37E" 25 | } 26 | }, 27 | { 28 | "cell_type": "markdown", 29 | "source": [ 30 | "### **The principle of operation of PCA**\n", 31 | "Initially, the feature matrix is necessarily centered so that the first main component can correspond to the direction of maximum variation of the data and not just their average value. Usually, finding the main components comes down to two main methods:\n", 32 | " - **Calculation of eigenvectors and eigenvalues of the covariance matrix of data**. Since the covariance matrix reflects the degree of linear relationship between different variables, the eigenvectors of this matrix will set the directions along which the data variance is maximum, and the eigenvalues will set the magnitude of this variance. The eigenvalue corresponding to the eigenvector characterizes the contribution of this vector to the explanation of the variance of the data and the greater the eigenvalue, the more significant the main component. Usually, only those main components are selected that explain a given level of variance, for example, 95%.\n", 33 | "\n", 34 | " - **Calculation of the singular value decomposition of the data matrix**. Singular value decomposition is a way of representing any matrix as a product of three other matrices: the left singular matrix U, the diagonal matrix of singular values S, and the right singular matrix V, where the singular values are the square roots of the eigenvalues of the covariance matrix of the data (this is what data pre-centering is done for in this case), the right singular matrix V it will correspond to the eigenvectors of the covariance matrix of the data, and the left U will be the projection of the original data onto the main components defined by the matrix V. Thus, the singular value decomposition also makes it possible to isolate the main components, but without the need to calculate the covariance matrix. Besides the fact that such a solution is more efficient, it is considered more numerically stable, since it does not require calculating the covariance matrix directly which may be poorly conditioned in the case of a strong correlation of features. Particularly this approach is used in the implementation of scikit-learn, but with some of the features discussed below." 35 | ], 36 | "metadata": { 37 | "id": "8OCN7le0v3vY" 38 | } 39 | }, 40 | { 41 | "cell_type": "markdown", 42 | "source": [ 43 | "**SVD-based PCA is constructed as follows:**\n", 44 | "- 1) data centering occurs first and the number of components is determined at least between the number of samples and features if the number of components has not been specified;\n", 45 | "- 2) Next, SVD is applied to the centered data matrix;\n", 46 | "- 3) the svd_flip_vector method is applied to the matrix U which finds the maximum modulo elements in each column of the matrix U, extracts their signs and multiplies the matrix U by these signs to guarantee a deterministic output;\n", 47 | "- 4) the explained variance for each principal component is calculated as squared corresponding singular values divided by n_samples - 1 and the transformed data is calculated taking into account the number of principal components according to the rule $X_{new} = X \\cdot V = U \\cdot S \\cdot V^T \\cdot V = U \\cdot S$." 48 | ], 49 | "metadata": { 50 | "id": "mppCbZCFw1fY" 51 | } 52 | }, 53 | { 54 | "cell_type": "markdown", 55 | "source": [ 56 | "### **Additional features of PCA**\n", 57 | "**The coefficient of explained variance** of each main component, available through the variable *explained_variance_ratio_*, indicates the proportion of variance of the dataset lying along the axis of each main component.\n", 58 | "\n", 59 | "**Data recovery** using the *inverse_transform()* method consists in applying the inverse transformation of the PCA projection of the form $X_{recovered} = X_{d-proj} W_d^T$, where $W_d^T$ is a matrix of the first d principal components. It follows that the data will be recovered with losses proportional to the amount of discarded variance of the original data and the average square of the distance between the original and restored data represents a *reconstruction error*.\n", 60 | "\n", 61 | "**Incremental PCA**, implemented as the [*IncrementalPCA*](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.IncrementalPCA) class, allows you to work more efficiently with large datasets by splitting them into mini-packages and storing them in memory piece by piece during training.\n", 62 | "\n", 63 | "**Randomized PCA**, set using the svd_solver='randomized' parameter, uses a stochastic algorithm to quickly calculate approximate d principal components and it is based on the assumption that a random projection of data onto a low-dimensional subspace can preserve their structure and properties well, however, this approach may be less accurate.\n", 64 | "\n", 65 | "**Kernel PCA**, implemented using the [*KernelPCA*](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.KernelPCA.html) class, allows you to perform complex non-linear projections using kernel functions. As in the case of SVM, its essence in this case is that the linear boundary of solutions in a multidimensional feature space will correspond to a complex non-linear boundary in the original space." 66 | ], 67 | "metadata": { 68 | "id": "y_dUBdo1yKEa" 69 | } 70 | }, 71 | { 72 | "cell_type": "markdown", 73 | "source": [ 74 | "### **Alternatives to PCA**\n", 75 | "Despite the fact that the principal component method is one of the most popular dimensionality reduction algorithms there are alternatives that may be more preferable in a number of situations, as well as depending on the type of data:\n", 76 | "- **LLE (Locally Linear Embedding)** is an algorithm for creating linear combinations of each point from its neighbors, followed by restoring these combinations in a lower dimensional space which allows you to preserve the nonlinear geometry of the data and be useful for some tasks where global properties are less important. On the other hand, this approach has high computational complexity and may be sensitive to noise.\n", 77 | "- **t-SNE (t-Distributed Stochastic Neighbor Embedding)** is an algorithm that converts similarities between data into probabilities and further tries to minimize the discrepancy between probability distributions in high and low dimensional space. t-SNE is effective in visualizing high-dimensional data, but it can distort the global data structure because it does not take into account linear dependencies, but only their proximity in the original space.\n", 78 | "- **UMAP (Uniform Manifold Approximation and Projection)** is another algorithm suitable for data visualization which is based on the idea that the data lies on some homogeneous manifold that can be approximated using a neighbor graph. This approach takes into account the global data structure and allows for better adaptation to different types of data and better handling of noise and outliers than t-SNE.\n", 79 | "- **Autoencoders** is a type of neural networks based on training the encoder to convert input data into a low-dimensional representation, followed by training the decoder to restore the original data from this representation. Autoencoders can also be used for data compression, noise removal, and many other purposes." 80 | ], 81 | "metadata": { 82 | "id": "kQQAKxouym9g" 83 | } 84 | }, 85 | { 86 | "cell_type": "markdown", 87 | "source": [ 88 | "### **Python implementation from scratch**" 89 | ], 90 | "metadata": { 91 | "id": "d0NqRbif1Ukt" 92 | } 93 | }, 94 | { 95 | "cell_type": "code", 96 | "source": [ 97 | "import numpy as np\n", 98 | "from sklearn.decomposition import PCA\n", 99 | "from sklearn.datasets import load_iris" 100 | ], 101 | "metadata": { 102 | "id": "m0OOS98L16no" 103 | }, 104 | "execution_count": 1, 105 | "outputs": [] 106 | }, 107 | { 108 | "cell_type": "code", 109 | "source": [ 110 | "class SVDPCA:\n", 111 | " def __init__(self, n_components=None):\n", 112 | " self.n_components = n_components\n", 113 | "\n", 114 | " @staticmethod\n", 115 | " def svd_flip_vector(U):\n", 116 | " max_abs_cols_U = np.argmax(np.abs(U), axis=0)\n", 117 | " # extract the signs of the max absolute values\n", 118 | " signs_U = np.sign(U[max_abs_cols_U, range(U.shape[1])])\n", 119 | "\n", 120 | " return U * signs_U\n", 121 | "\n", 122 | " def fit_transform(self, X):\n", 123 | " n_samples, n_features = X.shape\n", 124 | " X_centered = X - X.mean(axis=0)\n", 125 | "\n", 126 | " if self.n_components is None:\n", 127 | " self.n_components = min(n_samples, n_features)\n", 128 | "\n", 129 | " U, S, Vt = np.linalg.svd(X_centered)\n", 130 | " # flip the eigenvector sign to enforce deterministic output\n", 131 | " U_flipped = self.svd_flip_vector(U)\n", 132 | "\n", 133 | " self.explained_variance = (S[:self.n_components] ** 2) / (n_samples - 1)\n", 134 | " self.explained_variance_ratio = self.explained_variance / np.sum(self.explained_variance)\n", 135 | "\n", 136 | " # X_new = X * V = U * S * Vt * V = U * S\n", 137 | " X_transformed = U_flipped[:, : self.n_components] * S[: self.n_components]\n", 138 | "\n", 139 | " return X_transformed" 140 | ], 141 | "metadata": { 142 | "id": "DO5dvAD_16b3" 143 | }, 144 | "execution_count": 2, 145 | "outputs": [] 146 | }, 147 | { 148 | "cell_type": "markdown", 149 | "source": [ 150 | "### **Uploading a dataset**" 151 | ], 152 | "metadata": { 153 | "id": "tSIQ9NJH2EGj" 154 | } 155 | }, 156 | { 157 | "cell_type": "code", 158 | "source": [ 159 | "X, y = load_iris(return_X_y=True, as_frame=True)\n", 160 | "print(X)" 161 | ], 162 | "metadata": { 163 | "colab": { 164 | "base_uri": "https://localhost:8080/" 165 | }, 166 | "id": "7m0fRrsi2FSD", 167 | "outputId": "1148a141-7959-49ef-eea3-614e2ffc8843" 168 | }, 169 | "execution_count": 3, 170 | "outputs": [ 171 | { 172 | "output_type": "stream", 173 | "name": "stdout", 174 | "text": [ 175 | " sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)\n", 176 | "0 5.1 3.5 1.4 0.2\n", 177 | "1 4.9 3.0 1.4 0.2\n", 178 | "2 4.7 3.2 1.3 0.2\n", 179 | "3 4.6 3.1 1.5 0.2\n", 180 | "4 5.0 3.6 1.4 0.2\n", 181 | ".. ... ... ... ...\n", 182 | "145 6.7 3.0 5.2 2.3\n", 183 | "146 6.3 2.5 5.0 1.9\n", 184 | "147 6.5 3.0 5.2 2.0\n", 185 | "148 6.2 3.4 5.4 2.3\n", 186 | "149 5.9 3.0 5.1 1.8\n", 187 | "\n", 188 | "[150 rows x 4 columns]\n" 189 | ] 190 | } 191 | ] 192 | }, 193 | { 194 | "cell_type": "markdown", 195 | "source": [ 196 | "### **Model training and evaluation of the obtained results**\n", 197 | "The manual implementation showed identical scikit-learn results. As you can see, the first 2 main components explain almost 98% of the variance in the data which allows you to reduce the number of features by half without much loss of information. If the number of features were not 4 but several thousand or millions, then this would significantly reduce the training time of models without significant loss of accuracy (and sometimes with increased accuracy by reducing multicollinearity between features) what makes PCA and its alternatives an excellent addition to other algorithms." 198 | ], 199 | "metadata": { 200 | "id": "73RONIPv2NUU" 201 | } 202 | }, 203 | { 204 | "cell_type": "markdown", 205 | "source": [ 206 | "**PCA**" 207 | ], 208 | "metadata": { 209 | "id": "haSyFD092g1j" 210 | } 211 | }, 212 | { 213 | "cell_type": "code", 214 | "source": [ 215 | "pca = SVDPCA()\n", 216 | "X_transformed = pca.fit_transform(X)\n", 217 | "\n", 218 | "print('transformed data', X_transformed[:10], '', sep='\\n')\n", 219 | "print('explained_variance', pca.explained_variance)\n", 220 | "print('explained_variance_ratio', pca.explained_variance_ratio)" 221 | ], 222 | "metadata": { 223 | "colab": { 224 | "base_uri": "https://localhost:8080/" 225 | }, 226 | "id": "-q44z3MS2gTK", 227 | "outputId": "a8bb811f-828f-49fa-cdc3-c8b5174139ee" 228 | }, 229 | "execution_count": 4, 230 | "outputs": [ 231 | { 232 | "output_type": "stream", 233 | "name": "stdout", 234 | "text": [ 235 | "transformed data\n", 236 | "[[-2.68412563e+00 3.19397247e-01 -2.79148276e-02 -2.26243707e-03]\n", 237 | " [-2.71414169e+00 -1.77001225e-01 -2.10464272e-01 -9.90265503e-02]\n", 238 | " [-2.88899057e+00 -1.44949426e-01 1.79002563e-02 -1.99683897e-02]\n", 239 | " [-2.74534286e+00 -3.18298979e-01 3.15593736e-02 7.55758166e-02]\n", 240 | " [-2.72871654e+00 3.26754513e-01 9.00792406e-02 6.12585926e-02]\n", 241 | " [-2.28085963e+00 7.41330449e-01 1.68677658e-01 2.42008576e-02]\n", 242 | " [-2.82053775e+00 -8.94613845e-02 2.57892158e-01 4.81431065e-02]\n", 243 | " [-2.62614497e+00 1.63384960e-01 -2.18793179e-02 4.52978706e-02]\n", 244 | " [-2.88638273e+00 -5.78311754e-01 2.07595703e-02 2.67447358e-02]\n", 245 | " [-2.67275580e+00 -1.13774246e-01 -1.97632725e-01 5.62954013e-02]]\n", 246 | "\n", 247 | "explained_variance [4.22824171 0.24267075 0.0782095 0.02383509]\n", 248 | "explained_variance_ratio [0.92461872 0.05306648 0.01710261 0.00521218]\n" 249 | ] 250 | } 251 | ] 252 | }, 253 | { 254 | "cell_type": "markdown", 255 | "source": [ 256 | "**PCA (scikit-learn)**" 257 | ], 258 | "metadata": { 259 | "id": "KPNSTfkk2noR" 260 | } 261 | }, 262 | { 263 | "cell_type": "code", 264 | "source": [ 265 | "sk_pca = PCA()\n", 266 | "sk_X_transformed = sk_pca.fit_transform(X)\n", 267 | "\n", 268 | "print('sk transformed data', sk_X_transformed[:10], '', sep='\\n')\n", 269 | "print('sk explained_variance', sk_pca.explained_variance_)\n", 270 | "print('sk explained_variance_ratio_', sk_pca.explained_variance_ratio_)" 271 | ], 272 | "metadata": { 273 | "colab": { 274 | "base_uri": "https://localhost:8080/" 275 | }, 276 | "id": "avObFcAI2oZh", 277 | "outputId": "c47941f6-231d-4969-a521-db26d264f8da" 278 | }, 279 | "execution_count": 5, 280 | "outputs": [ 281 | { 282 | "output_type": "stream", 283 | "name": "stdout", 284 | "text": [ 285 | "sk transformed data\n", 286 | "[[-2.68412563e+00 3.19397247e-01 -2.79148276e-02 -2.26243707e-03]\n", 287 | " [-2.71414169e+00 -1.77001225e-01 -2.10464272e-01 -9.90265503e-02]\n", 288 | " [-2.88899057e+00 -1.44949426e-01 1.79002563e-02 -1.99683897e-02]\n", 289 | " [-2.74534286e+00 -3.18298979e-01 3.15593736e-02 7.55758166e-02]\n", 290 | " [-2.72871654e+00 3.26754513e-01 9.00792406e-02 6.12585926e-02]\n", 291 | " [-2.28085963e+00 7.41330449e-01 1.68677658e-01 2.42008576e-02]\n", 292 | " [-2.82053775e+00 -8.94613845e-02 2.57892158e-01 4.81431065e-02]\n", 293 | " [-2.62614497e+00 1.63384960e-01 -2.18793179e-02 4.52978706e-02]\n", 294 | " [-2.88638273e+00 -5.78311754e-01 2.07595703e-02 2.67447358e-02]\n", 295 | " [-2.67275580e+00 -1.13774246e-01 -1.97632725e-01 5.62954013e-02]]\n", 296 | "\n", 297 | "sk explained_variance [4.22824171 0.24267075 0.0782095 0.02383509]\n", 298 | "sk explained_variance_ratio_ [0.92461872 0.05306648 0.01710261 0.00521218]\n" 299 | ] 300 | } 301 | ] 302 | }, 303 | { 304 | "cell_type": "markdown", 305 | "source": [ 306 | "### **Pros and cons**\n", 307 | "Pros:\n", 308 | "- reducing the dimension while preserving a large amount of information which also allows you to visualize high-dimensional data in two-dimensional or three-dimensional space;\n", 309 | "- not only allows you to significantly speed up training, but also reduce the overfitting of models in some cases;\n", 310 | "- can be used to compress data.\n", 311 | "\n", 312 | "Cons:\n", 313 | "- the inevitable loss of some information in the data;\n", 314 | "- search only for linear dependence in the data (in the classic PCA);\n", 315 | "- the lack of semantic meaning of the main components due to the difficulty of linking them with real features." 316 | ], 317 | "metadata": { 318 | "id": "sXRg_DP12v-q" 319 | } 320 | }, 321 | { 322 | "cell_type": "markdown", 323 | "source": [ 324 | "### **Additional sources**\n", 325 | "Papers:\n", 326 | "- «A Tutorial on Principal Component Analysis», Jonathon Shlens;\n", 327 | "- «Locally Linear Embedding and its Variants: Tutorial and Survey», Benyamin Ghojogh, Ali Ghodsi, Fakhri Karray, Mark Crowley;\n", 328 | "- «Theoretical Foundations of t-SNE for Visualizing High-Dimensional Clustered Data», T. Tony Cai, Rong Ma;\n", 329 | "- «UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction», Leland McInnes, John Healy, James Melville;\n", 330 | "- «Deep Autoencoders for Dimensionality Reduction of High-Content Screening Data», Lee Zamparo, Zhaolei Zhang.\n", 331 | "\n", 332 | "Documentation:\n", 333 | "- [PCA description](https://scikit-learn.org/stable/modules/decomposition.html#pca);\n", 334 | "- [LLE description](https://scikit-learn.org/stable/modules/manifold.html#locally-linear-embedding);\n", 335 | "- [t-SNE description](https://scikit-learn.org/stable/modules/manifold.html#t-sne);\n", 336 | "- [PCA](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html);\n", 337 | "- [LLE](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.LocallyLinearEmbedding.html);\n", 338 | "- [t-SNE](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html);\n", 339 | "- [UMAP](https://umap-learn.readthedocs.io/en/latest/index.html).\n", 340 | "\n", 341 | "Видео:\n", 342 | "- PCA: [one](https://www.youtube.com/watch?v=FgakZw6K1QQ), [two](https://www.youtube.com/watch?v=fkf4IBRSeEc), [three](https://www.youtube.com/watch?v=IwPzjlBXBlA), [four](https://www.youtube.com/watch?v=WW3ZJHPwvyg);\n", 343 | "- [LLE](https://www.youtube.com/watch?v=B6kzA1W_4pU);\n", 344 | "- [t-SNE](https://www.youtube.com/watch?v=NEaUSP4YerM);\n", 345 | "- [UMAP](https://www.youtube.com/watch?v=eN0wFzBA4Sc);\n", 346 | "- [autoencoders](https://www.youtube.com/watch?v=FhmpO73ythg)." 347 | ], 348 | "metadata": { 349 | "id": "MlLNPn1S34tc" 350 | } 351 | } 352 | ] 353 | } -------------------------------------------------------------------------------- /Notebooks [eng]/empty.txt: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /Notebooks [rus]/ML-алгоритмы с нуля/04) Наивный байесовский классификатор.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "provenance": [] 7 | }, 8 | "kernelspec": { 9 | "name": "python3", 10 | "display_name": "Python 3" 11 | }, 12 | "language_info": { 13 | "name": "python" 14 | } 15 | }, 16 | "cells": [ 17 | { 18 | "cell_type": "markdown", 19 | "source": [ 20 | "## **Наивный байесовский классификатор**\n", 21 | "Наивный байесовский классификатор (Naive Bayes classifier) — вероятностный классификатор на основе формулы Байеса со строгим (наивным) предположением о независимости признаков между собой при заданном классе, что сильно упрощает задачу классификации из-за оценки одномерных вероятностных плотностей вместо одной многомерной.\n", 22 | "\n", 23 | "В данном случае, одномерная вероятностная плотность — это оценка вероятности каждого признака отдельно при условии их независимости, а многомерная — оценка вероятности комбинации всех признаков, что вытекает из случая их зависимости. Именно по этой причине данный классификатор называется наивным, поскольку позволяет сильно упростить вычисления и повысить эффективность алгоритма. Однако такое предположение не всегда является верным на практике и в ряде случаев может привести к значительному ухудшению качества прогнозов.\n", 24 | "\n", 25 | "Сама же формула Байеса выглядит следующим образом:\n", 26 | "\n", 27 | "$$P(A|B) = \\frac{P(B|A) P(A)}{P(B)}$$\n", 28 | "\n", 29 | "где:\n", 30 | "- $P(A|B)$ — апостериорная вероятность события A при условии выполнения события B;\n", 31 | "\n", 32 | "- $P(B|A)$ — условная вероятность события B при условии выполнения события A;\n", 33 | "\n", 34 | "- $P(A)$ и $P(B)$ — априорные вероятности событий A и B соответственно.\n", 35 | "\n", 36 | "А в контексте машинного обучения формула Байеса приобретает следующий вид:\n", 37 | "\n", 38 | "$$P(y_k|X) = \\frac{P(y_k)P(X|y_k)}{P(X)}$$\n", 39 | "\n", 40 | "где:\n", 41 | "- $P(y_k|X)$ — апостериорная вероятность принадлежности образца к классу $y_k$ с учётом его признаков $X$;\n", 42 | "- $P(X|y_k)$ — правдоподобие, то есть вероятность признаков $X$ при заданном классе $y_k$;\n", 43 | "- $P(y_k)$ — априорная вероятность принадлежности случайно выбранного наблюдения к классу $y_k$;\n", 44 | "- $P(X)$ — априорная вероятность признаков $X$.\n", 45 | "\n", 46 | "Если объект описывается не одним, а несколькими признаками $X_1, X_2, ..., X_n$, то формула принимает вид:\n", 47 | "\n", 48 | "$$P(y_k|X_1, X_2, ..., X_n) = \\frac{P(y_k)\\prod_{i=1}^n P(X_i|y_k)}{P(X_1, X_2, ..., X_n)}$$\n", 49 | "\n", 50 | "На практике числитель данной формулы представляет наибольший интерес, поскольку знаменатель зависит только от признаков, а не от класса, и поэтому часто он опускается при сравнении вероятностей разных классов. В конечном счёте правило классификации будет пропорционально выбору класса с максимальной апостериорной вероятностью:\n", 51 | "\n", 52 | "$$y_k \\propto \\arg\\max_{y_k} P(y_k)\\prod_{i=1}^n P(X_i|y_k)$$\n", 53 | "\n", 54 | "Для оценки параметров модели, то есть вероятностей $P(y_k)$ и $P(X_i|y_k)$, обычно применяется метод максимального правдоподобия, который в данном случае основан на частотах встречаемости классов и признаков в обучающей выборке.\n", 55 | "\n", 56 | "### **Разновидности наивного Байеса**\n", 57 | "В библиотеке scikit-learn есть несколько реализаций наивного байесовского классификатора, отличающиеся предположениями о распределении признаков при заданном классе. К таковым относятся следующие:\n", 58 | "\n", 59 | "- **Гауссовский наивный байесовский классификатор (GaussianNB)** — вариант для работы с непрерывными признаками, которые имеют нормальное (гауссовское) распределение. Вероятность признака при заданном классе вычисляется по формуле: $$P(x_i|y) = \\frac{1}{\\sqrt{2\\pi\\sigma_y^2}}\\exp\\left(-\\frac{(x_i-\\mu_y)^2}{2\\sigma_y^2}\\right)$$ где $\\mu_y$ и $\\sigma_y$ — это среднее и стандартное отклонения признака в классе $y$. Эти параметры оцениваются с помощью метода максимального правдоподобия по обучающим данным.\n", 60 | "- **Мультиномиальный наивный байесовский классификатор (MultinomialNB)** — вариант для работы с дискретными признаками, которые имеют мультиномиальное распределение. Такие признаки часто встречаются в задачах классификации текстов, где они представляют собой количество вхождений в тексте. Вероятность признака при заданном классе вычисляется по формуле: $$P(x_i|y) = \\frac{N_{yi} + \\alpha}{N_y + \\alpha n}$$ где $N_{yi}$ — это количество раз, когда признак $i$ встречается в классе $y$; $N_y$ — общее количество всех признаков в классе $y$; $n$ — количество различных признаков; а $\\alpha$ — сглаживающий параметр, предотвращающий возникновение нулевых вероятностей.\n", 61 | "- **Комплементарный наивный байесовский классификатор (ComplementNB)** — улучшенный вариант *MultinomialNB*, подходящий для несбалансированных наборов данных. Вместо оценки вероятности признака при заданном классе, алгоритм оценивает нормированный вес признака $w_{ci}$ для класса $c$ как вероятность признака при дополнении класса, то есть при всех остальных классах. Таким образом, алгоритм учитывает не только частоту признаков в классе, но и их отсутствие в других классах, что делает его менее чувствительным к смещению выборки. Формула для вычисления вероятности признака при дополнении класса выглядит следующим образом: \\begin{align}\\begin{aligned}\\hat{\\theta}_{ci} = \\frac{\\alpha_i + \\sum_{j:y_j \\neq c} d_{ij}}\n", 62 | " {\\alpha + \\sum_{j:y_j \\neq c} \\sum_{k} d_{kj}}\\\\w_{ci} = \\log \\hat{\\theta}_{ci}\\\\w_{ci} = \\frac{w_{ci}}{\\sum_{j} |w_{cj}|}\\end{aligned}\\end{align}\n", 63 | "где $\\hat{\\theta}_{ci}$ — это оценка вероятности признака $i$ при дополнении класса $c$, которая вычисляется с помощью сглаживающего параметра $\\alpha_i$ и частоты признака $i$ во всех классах кроме $c$ (в данном случае $d_{ij}$ — это количество раз, когда признак $i$ встречается в классе $j$); $w_{ci}$ — это нормированный вес признака $i$ для класса $c$. Предсказанный класс $\\hat c$ для заданного вектора признаков $t$ будет выглядеть следующим образом: $$\\hat{c} = \\arg\\min_c \\sum_{i} t_i w_{ci}$$\n", 64 | "- **Бернуллиевский наивный байесовский классификатор (BernoulliNB)** — ещё один вариант для работы с дискретными признаками, но которые имеют бернуллиевское распределение. В данном случае признаки представляют собой бинарные индикаторы наличия или отсутствия определённых свойств в объекте. Например, в задаче классификации текстов это может быть наличие или отсутствие определённых слов в тексте. Вероятность признака при заданном классе вычисляется по формуле: $$P(x_i|y) = P(x_i = 1|y)x_i + (1-P(x_i = 1|y))(1-x_i)$$ где $P(x_i = 1|y)$ — это вероятность того, что признак $i$ принимает значение 1 (истина) при условии, что объект принадлежит классу $y$; $x_i$ — значение признака $i$ (0 или 1).\n", 65 | "- **Категориальный наивный байесовский классификатор (CategoricalNB)** — вариант для категориально распределенных данных, основанный на предположении, что каждый описываемый индексом признак имеет своё собственное категориальное распределение. Вероятность признака при заданном классе вычисляется по формуле: $$P(x_i = t \\mid y = c \\: ;\\, \\alpha) = \\frac{ N_{tic} + \\alpha}{N_{c} + \\alpha n_i}$$ где $N_{tic} = |\\{j \\in J \\mid x_{ij} = t, y_j = c\\}|$ — это количество раз, когда признак $x_i$ принимает значение $t$ в классе $c$; $N_{c} = |\\{ j \\in J\\mid y_j = c\\}|$ — общее количество всех признаков в классе $c$ в обучающих данных; $\\alpha$ — сглаживающий параметр; $n_i$ — количество доступных значений признака $i$.\n", 66 | "\n", 67 | "### **Принцип работы наивного байесовского классификатора c гауссовским распределением**\n", 68 | "Алгоритм строится следующим образом:\n", 69 | "- 1) изначально рассчитываются априорные вероятности классов;\n", 70 | "- 2) после рассчитываются средние и стандартные отклонения признаков по классам;\n", 71 | "- 3) на основе полученных отклонений признаков по классам рассчитывается вероятностная плотность тестовых признаков по распределению Гаусса;\n", 72 | "- 4) далее вычисляются апостериорные вероятности как произведение априорных вероятностей классов и вероятностных плотностей тестовых признаков;\n", 73 | "- 5) классы с максимальной апостериорной вероятностью будут итоговым прогнозом." 74 | ], 75 | "metadata": { 76 | "id": "m9EN0b8o3R35" 77 | } 78 | }, 79 | { 80 | "cell_type": "markdown", 81 | "source": [ 82 | "### **Наивный Байес в задачах фильтрации спама**\n", 83 | "В контексте фильтрации спама наивный байесовский классификатор основан на частоте появления слов в сообщениях для спама и не спама, и максимизации произведения их вероятностей. Наивность в данном случае будет заключаться в предположении о независимости слов в сообщении от порядка и контекста. Тогда формула Байеса приобретает следующий вид:\n", 84 | "\n", 85 | "$$P(C|M) \\propto P(C) \\prod_{i=1}^n P(w_i|C), \\ \\ w_i \\in M$$\n", 86 | "\n", 87 | "где:\n", 88 | "- $C$ — класс спам или не спам;\n", 89 | "- $M$ — сообщение;\n", 90 | "- $w_i$ — i-е слово в сообщении $M$;\n", 91 | "- $\\propto$ — знак пропорциональности.\n", 92 | "\n", 93 | "Для лучшего понимания рассмотрим следующий пример. Предположим, мы хотим классифицировать сообщение **\"Hi, you won a discount and you can get the prize this evening.\"** и у нас есть обучающая выборка, состоящая из следующих сообщений:\n", 94 | "\n", 95 | "| Message | Class |\n", 96 | "| --- | --- |\n", 97 | "| Hi, how are you? | Not spam |\n", 98 | "| Congratulations, you won a prize! | Spam |\n", 99 | "| Buy the product now and get a discount! | Spam |\n", 100 | "| Let's walk this evening. | Not spam |\n", 101 | "\n", 102 | "Первым делом необходимо рассчитать частоту появления всех уникальных слов и их общее количество в сообщениях для спама и не спама. Затем производится расчёт вероятностей встретить каждое слово в спам и не спам сообщениях на основе этих частот. Когда в сообщении есть слова, которые раньше не встречались в обучающей выборке, используется сглаживание. Существует много различных видов сглаживаний, но суть самого простого из них заключается в добавлении $1$ при подсчёте частот слов в сообщениях. Такой приём позволяет избежать проблему нулевой вероятности. Ниже приведена таблица с расчётом вероятностей для всех слов.\n", 103 | "\n", 104 | "| Word | Frequency in Not Spam | Frequency in Spam | Probability in Not Spam | Probability in Spam |\n", 105 | "| --- | --- | --- | --- | --- |\n", 106 | "| Hi | 1 + $1$ = 2 | 0 + $1$ = 1 | 2 / 28 = 0.0714 | 1 / 33 = 0.03 |\n", 107 | "| how | 1 + $1$ = 2 | 0 + $1$ = 1 | 2 / 28 = 0.0714 | 1 / 33 = 0.03 |\n", 108 | "| are | 1 + $1$ = 2 | 0 + $1$ = 1 | 2 / 28 = 0.0714 | 1 / 33 = 0.03 |\n", 109 | "| you | 1 + $1$ = 2 | 1 + $1$ = 2 | 2 / 28 = 0.0714 | 2 / 33 = 0.06 |\n", 110 | "| Congratulations | 0 + $1$ = 1 | 1 + $1$ = 2 | 1 / 28 = 0.0357 | 2 / 33 = 0.06 |\n", 111 | "| won | 0 + $1$ = 1 | 1 + $1$ = 2 | 1 / 28 = 0.0357 | 2 / 33 = 0.06 |\n", 112 | "| a | 0 + $1$ = 1 | 2 + $1$ = 3 | 1 / 28 = 0.0357 | 3 / 33 = 0.09 |\n", 113 | "| prize | 0 + $1$ = 1 | 1 + $1$ = 2 | 1 / 28 = 0.0357 | 2 / 33 = 0.06 |\n", 114 | "| Buy | 0 + $1$ = 1 | 1 + $1$ = 2 | 1 / 28 = 0.0357 | 2 / 33 = 0.06 |\n", 115 | "| the | 0 + $1$ = 1 | 1 + $1$ = 2 | 1 / 28 = 0.0357 | 2 / 33 = 0.06 |\n", 116 | "| product | 0 + $1$ = 1 | 1 + $1$ = 2 | 1 / 28 = 0.0357 | 2 / 33 = 0.06 |\n", 117 | "| now | 0 + $1$ = 1 | 1 + $1$ = 2 | 1 / 28 = 0.0357 | 2 / 33 = 0.06 |\n", 118 | "| and | 0 + $1$ = 1 | 1 + $1$ = 2 | 1 / 28 = 0.0357 | 2 / 33 = 0.06 |\n", 119 | "| get | 0 + $1$ = 1 | 1 + $1$ = 2 | 1 / 28 = 0.0357 | 2 / 33 = 0.06 |\n", 120 | "| discount | 0 + $1$ = 1 | 1 + $1$ = 2 | 1 / 28 = 0.0357 | 2 / 33 = 0.06 |\n", 121 | "| Let's | 1 + $1$ = 2 | 0 + $1$ = 1 | 2 / 28 = 0.0714 | 1 / 33 = 0.03 |\n", 122 | "| walk | 1 + $1$ = 2 | 0 + $1$ = 1 | 2 / 28 = 0.0714 | 1 / 33 = 0.03 |\n", 123 | "| this | 1 + $1$ = 2 | 0 + $1$ = 1 | 2 / 28 = 0.0714 | 1 / 33 = 0.03 |\n", 124 | "| evening | 1 + $1$ = 2 | 0 + $1$ = 1 | 2 / 28 = 0.0714 | 1 / 33 = 0.03 |\n", 125 | "| can | 0 + $1$ = 1 | 0 + $1$ = 1 | 1 / 28 = 0.0357 | 1 / 33 = 0.03 |\n", 126 | "| **Total amount of words** | **28** | **33** |\n", 127 | "\n", 128 | " В конце рассчитываются вероятности сообщения быть спамом или не спамом, а итоговым прогнозом будет класс с максимальной вероятностью.\n", 129 | "\n", 130 | "$P(C|M) = P(C) \\cdot P('Hi'|C) \\cdot P('you'|C) \\cdot P('won'|C) \\cdot P('a'|C)\n", 131 | "\\cdot P('discount'|C) \\cdot P('and'|C) \\cdot P('you'|C) \\cdot P('can'|C) \\cdot P('get'|C)\n", 132 | "\\cdot P('the'|C) \\cdot P('prize'|C) \\cdot P('this'|C) \\cdot P('evening'|C)$\n", 133 | "\n", 134 | "Где:\n", 135 | "- $C \\in (Spam, \\ \\ Not \\ \\ Spam)$;\n", 136 | "- $P(Spam) = P(Not \\ \\ Spam) = \\frac{2}{4} = 0.5$\n", 137 | "\n" 138 | ], 139 | "metadata": { 140 | "id": "9Yn2qe4-J54P" 141 | } 142 | }, 143 | { 144 | "cell_type": "markdown", 145 | "source": [ 146 | "$P(Spam|M) = 0.5 \\cdot 0.03 \\cdot 0.06 \\cdot 0.06 \\cdot 0.09 \\cdot 0.06 \\cdot 0.06 \\cdot 0.06 \\cdot 0.03 \\cdot 0.06 \\cdot 0.06 \\cdot 0.06 \\cdot 0.03 \\cdot 0.03 \\approx 6.12 \\cdot 10^{-18}$\n", 147 | "\n", 148 | "$P(Not \\ \\ Spam|M) = 0.5 \\cdot 0.0714 \\cdot 0.0714 \\cdot 0.0357 \\cdot 0.0357 \\cdot 0.0357 \\cdot 0.0357 \\cdot 0.0714 \\cdot 0.0357 \\cdot 0.0357 \\cdot 0.0357 \\cdot 0.0357 \\cdot 0.0714 \\cdot 0.0714 \\approx 2.45 \\cdot 10^{-18}$\n", 149 | "\n", 150 | "$P(Spam|M) > P(Not \\ \\ Spam|M) \\rightarrow$ **сообщение является спамом**\n", 151 | "\n", 152 | "Стоит добавить, что на практике для удобства расчётов зачастую используется логарифм вероятности вместо самой вероятности." 153 | ], 154 | "metadata": { 155 | "id": "I--z9SYmUlBL" 156 | } 157 | }, 158 | { 159 | "cell_type": "markdown", 160 | "source": [ 161 | "### **Реализация на Python с нуля**" 162 | ], 163 | "metadata": { 164 | "id": "birEwBNZ3Nqw" 165 | } 166 | }, 167 | { 168 | "cell_type": "code", 169 | "source": [ 170 | "import numpy as np\n", 171 | "import pandas as pd\n", 172 | "import matplotlib.pyplot as plt\n", 173 | "from sklearn.datasets import load_iris\n", 174 | "from sklearn.naive_bayes import GaussianNB\n", 175 | "from sklearn.metrics import accuracy_score\n", 176 | "from sklearn.model_selection import train_test_split\n", 177 | "from mlxtend.plotting import plot_decision_regions" 178 | ], 179 | "metadata": { 180 | "id": "CPkMdMHLCDbQ" 181 | }, 182 | "execution_count": 7, 183 | "outputs": [] 184 | }, 185 | { 186 | "cell_type": "code", 187 | "source": [ 188 | "class GaussianNaiveBayes:\n", 189 | " def fit(self, X, y):\n", 190 | " classes, cls_counts = np.unique(y, return_counts=True)\n", 191 | " n_classes = len(classes)\n", 192 | " self.priors = cls_counts / len(y)\n", 193 | "\n", 194 | " # calculate the mean and standard deviations of features by classes\n", 195 | " self.X_cls_mean = np.array([np.mean(X[y == c], axis=0) for c in range(n_classes)])\n", 196 | " self.X_stds = np.array([np.std(X[y == c], axis=0) for c in range(n_classes)])\n", 197 | "\n", 198 | " # calculate the probability density of the feature according to the Gaussian distribution\n", 199 | " def pdf(self, x, mean, std):\n", 200 | " return (1 / (np.sqrt(2 * np.pi) * std)) * np.exp(-0.5 * ((x - mean) / std) ** 2)\n", 201 | "\n", 202 | " def predict(self, X):\n", 203 | " pdfs = np.array([self.pdf(x, self.X_cls_mean, self.X_stds) for x in X])\n", 204 | " posteriors = self.priors * np.prod(pdfs, axis=2) # shorten Bayes formula\n", 205 | "\n", 206 | " return np.argmax(posteriors, axis=1)" 207 | ], 208 | "metadata": { 209 | "id": "JkWEEsgTt67e" 210 | }, 211 | "execution_count": 8, 212 | "outputs": [] 213 | }, 214 | { 215 | "cell_type": "code", 216 | "source": [ 217 | "def decision_boundary_plot(X, y, X_train, y_train, clf, feature_indexes, title=None):\n", 218 | " feature1_name, feature2_name = X.columns[feature_indexes]\n", 219 | " X_feature_columns = X.values[:, feature_indexes]\n", 220 | " X_train_feature_columns = X_train[:, feature_indexes]\n", 221 | " clf.fit(X_train_feature_columns, y_train)\n", 222 | "\n", 223 | " plot_decision_regions(X=X_feature_columns, y=y.values, clf=clf)\n", 224 | " plt.xlabel(feature1_name)\n", 225 | " plt.ylabel(feature2_name)\n", 226 | " plt.title(title)" 227 | ], 228 | "metadata": { 229 | "id": "Yx0jFHbCEvLz" 230 | }, 231 | "execution_count": 9, 232 | "outputs": [] 233 | }, 234 | { 235 | "cell_type": "markdown", 236 | "source": [ 237 | "### **Загрузка датасета**" 238 | ], 239 | "metadata": { 240 | "id": "4TBM0I-R_4XI" 241 | } 242 | }, 243 | { 244 | "cell_type": "code", 245 | "source": [ 246 | "X1, y1 = load_iris(return_X_y=True, as_frame=True)\n", 247 | "X1_train, X1_test, y1_train, y1_test = train_test_split(X1.values, y1.values, random_state=0)\n", 248 | "print(X1, y1, sep='\\n')" 249 | ], 250 | "metadata": { 251 | "colab": { 252 | "base_uri": "https://localhost:8080/" 253 | }, 254 | "id": "fRTiyfWHBS8r", 255 | "outputId": "e36bc59c-7936-4351-a39a-63a1b6533a2d" 256 | }, 257 | "execution_count": 10, 258 | "outputs": [ 259 | { 260 | "output_type": "stream", 261 | "name": "stdout", 262 | "text": [ 263 | " sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)\n", 264 | "0 5.1 3.5 1.4 0.2\n", 265 | "1 4.9 3.0 1.4 0.2\n", 266 | "2 4.7 3.2 1.3 0.2\n", 267 | "3 4.6 3.1 1.5 0.2\n", 268 | "4 5.0 3.6 1.4 0.2\n", 269 | ".. ... ... ... ...\n", 270 | "145 6.7 3.0 5.2 2.3\n", 271 | "146 6.3 2.5 5.0 1.9\n", 272 | "147 6.5 3.0 5.2 2.0\n", 273 | "148 6.2 3.4 5.4 2.3\n", 274 | "149 5.9 3.0 5.1 1.8\n", 275 | "\n", 276 | "[150 rows x 4 columns]\n", 277 | "0 0\n", 278 | "1 0\n", 279 | "2 0\n", 280 | "3 0\n", 281 | "4 0\n", 282 | " ..\n", 283 | "145 2\n", 284 | "146 2\n", 285 | "147 2\n", 286 | "148 2\n", 287 | "149 2\n", 288 | "Name: target, Length: 150, dtype: int64\n" 289 | ] 290 | } 291 | ] 292 | }, 293 | { 294 | "cell_type": "markdown", 295 | "source": [ 296 | "### **Обучение моделей и оценка полученных результатов**\n", 297 | "Не смотря на свою простоту, в данном случае алгоритм показал отличный результат, классифицировав правильно абсолютно все образцы, что возможно благодаря построению гибкой решающей границы с высокой обобщающей способностью. Из этого можно сделать интересный вывод, что в некоторых ситуациях более простые модели могут работать гораздо лучше, чем сложные, что можно будет заметить в дальнейшем на примере других алгоритмов." 298 | ], 299 | "metadata": { 300 | "id": "rG4hElWeCRj4" 301 | } 302 | }, 303 | { 304 | "cell_type": "markdown", 305 | "source": [ 306 | "**Naive Bayes**" 307 | ], 308 | "metadata": { 309 | "id": "rnN1p3zuCm1D" 310 | } 311 | }, 312 | { 313 | "cell_type": "code", 314 | "source": [ 315 | "nb_clf = GaussianNaiveBayes()\n", 316 | "nb_clf.fit(X1_train, y1_train)\n", 317 | "nb_clf_pred_res = nb_clf.predict(X1_test)\n", 318 | "nb_clf_accuracy = accuracy_score(y1_test, nb_clf_pred_res)\n", 319 | "\n", 320 | "print(f'Naive Bayes classifier accucacy: {nb_clf_accuracy}')\n", 321 | "print(nb_clf_pred_res)" 322 | ], 323 | "metadata": { 324 | "colab": { 325 | "base_uri": "https://localhost:8080/" 326 | }, 327 | "id": "U60Olss-Cfm3", 328 | "outputId": "43f8e955-9e8e-4876-b82e-ba5941028042" 329 | }, 330 | "execution_count": 11, 331 | "outputs": [ 332 | { 333 | "output_type": "stream", 334 | "name": "stdout", 335 | "text": [ 336 | "Naive Bayes classifier accucacy: 1.0\n", 337 | "[2 1 0 2 0 2 0 1 1 1 2 1 1 1 1 0 1 1 0 0 2 1 0 0 2 0 0 1 1 0 2 1 0 2 2 1 0\n", 338 | " 1]\n" 339 | ] 340 | } 341 | ] 342 | }, 343 | { 344 | "cell_type": "markdown", 345 | "source": [ 346 | "**Naive Bayes (scikit-learn)**" 347 | ], 348 | "metadata": { 349 | "id": "1SPatZBoELDl" 350 | } 351 | }, 352 | { 353 | "cell_type": "code", 354 | "source": [ 355 | "sk_nb_clf = GaussianNB()\n", 356 | "sk_nb_clf.fit(X1_train, y1_train)\n", 357 | "sk_nb_clf_pred_res = sk_nb_clf.predict(X1_test)\n", 358 | "sk_nb_clf_accuracy = accuracy_score(y1_test, sk_nb_clf_pred_res)\n", 359 | "\n", 360 | "print(f'sk Naive Bayes classifier accucacy: {sk_nb_clf_accuracy}')\n", 361 | "print(sk_nb_clf_pred_res)\n", 362 | "\n", 363 | "feature_indexes = [2, 3]\n", 364 | "title1 = 'GaussianNB surface'\n", 365 | "decision_boundary_plot(X1, y1, X1_train, y1_train, sk_nb_clf, feature_indexes, title1)" 366 | ], 367 | "metadata": { 368 | "colab": { 369 | "base_uri": "https://localhost:8080/", 370 | "height": 524 371 | }, 372 | "id": "0bdNaO_FoYOW", 373 | "outputId": "bcc84fb8-b6f4-4232-fadf-748c2505035c" 374 | }, 375 | "execution_count": 12, 376 | "outputs": [ 377 | { 378 | "output_type": "stream", 379 | "name": "stdout", 380 | "text": [ 381 | "sk Naive Bayes classifier accucacy: 1.0\n", 382 | "[2 1 0 2 0 2 0 1 1 1 2 1 1 1 1 0 1 1 0 0 2 1 0 0 2 0 0 1 1 0 2 1 0 2 2 1 0\n", 383 | " 1]\n" 384 | ] 385 | }, 386 | { 387 | "output_type": "display_data", 388 | "data": { 389 | "text/plain": [ 390 | "
" 391 | ], 392 | "image/png": "\n" 393 | }, 394 | "metadata": {} 395 | } 396 | ] 397 | }, 398 | { 399 | "cell_type": "markdown", 400 | "source": [ 401 | "### **Преимущества и недостатки наивного байесовского классификатора**\n", 402 | "Преимущества:\n", 403 | "- простота в реализации и интерпретации;\n", 404 | "- практически не требуется настройка параметров;\n", 405 | "- высокая скорость работы и точность прогнозов во многих ситуациях;\n", 406 | "- имеет относительно хорошую устойчивость к шуму и выбросам, поскольку основан на вероятностных распределениях и наивном предположении о независимости признаков.\n", 407 | "\n", 408 | "Недостатки:\n", 409 | "- в случае нарушения предположения о независимости признаков, точность прогнозов может значительно снизиться;\n", 410 | "- может отдавать предпочтение к классам с бОльшим количеством образцов в случае несбалансированных данных.\n" 411 | ], 412 | "metadata": { 413 | "id": "XxWeyVZhk5og" 414 | } 415 | }, 416 | { 417 | "cell_type": "markdown", 418 | "source": [ 419 | "### **Дополнительные источники**\n", 420 | "Статья «Bayes and Naive-Bayes Classifier», Rajiv Gandhi, Andhra Pradesh.\n", 421 | "\n", 422 | "Документация:\n", 423 | "- [описание наивного Байеса](https://scikit-learn.org/stable/modules/naive_bayes.html);\n", 424 | "- [GaussianNB](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html#sklearn.naive_bayes.GaussianNB);\n", 425 | "- [MultinomialNB](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB);\n", 426 | "- [ComplementNB](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.ComplementNB.html#sklearn.naive_bayes.ComplementNB);\n", 427 | "- [BernoulliNB](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.BernoulliNB.html#sklearn.naive_bayes.BernoulliNB);\n", 428 | "- [CategoricalNB](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.CategoricalNB.html#sklearn.naive_bayes.CategoricalNB).\n", 429 | "\n", 430 | "Видео: [один](https://www.youtube.com/watch?v=O2L2Uv9pdDA), [два](https://www.youtube.com/watch?v=H3EjCKtlVog), [три](https://www.youtube.com/watch?v=nt63k3bfXS0), [четыре](https://www.youtube.com/watch?v=ADj95edZc0w).\n", 431 | "\n" 432 | ], 433 | "metadata": { 434 | "id": "mtO8wzUrl5JN" 435 | } 436 | } 437 | ] 438 | } -------------------------------------------------------------------------------- /Notebooks [rus]/ML-алгоритмы с нуля/12) Метод главных компонент (PCA).ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "provenance": [] 7 | }, 8 | "kernelspec": { 9 | "name": "python3", 10 | "display_name": "Python 3" 11 | }, 12 | "language_info": { 13 | "name": "python" 14 | } 15 | }, 16 | "cells": [ 17 | { 18 | "cell_type": "markdown", 19 | "source": [ 20 | "## **Метод главных компонент**\n", 21 | "Метод главных компонент (Principal Component Analysis или же PCA) — алгоритм обучения без учителя, используемый для понижения размерности и выявления наиболее информативных признаков в данных. Его суть заключается в предположении о линейности отношений данных и их проекции на подпространство ортогональных векторов, в которых дисперсия будет максимальной. Такие вектора называются главными компонентами и они определяют направления наибольшей изменчивости (информативности) данных. Альтернативно суть PCA можно определить как линейное проецирование, минимизирующее среднеквадратичное расстояние между исходными точками и их проекциями." 22 | ], 23 | "metadata": { 24 | "id": "GEUI9KeXiFd1" 25 | } 26 | }, 27 | { 28 | "cell_type": "markdown", 29 | "source": [ 30 | "### **Принцип работы PCA**\n", 31 | "Изначально матрица признаков обязательно центрируется, чтобы первая главная компонента могла соответствовать направлению максимальной вариации данных, а не просто их среднему значению. Обычно нахождение главных компонент сводится к двум основным методам:\n", 32 | " - **Вычисление собственных векторов и собственных значений ковариационной матрицы данных**. Поскольку ковариационная матрица отражает степень линейной связи между различными переменными, то собственные вектора этой матрицы будут задавать направления, вдоль которых дисперсия данных максимальна, а собственные значения — величину этой дисперсии. Собственное значение, соответствующее собственному вектору, характеризует вклад этого вектора в объяснение дисперсии данных и чем больше собственное значение, тем значимее главная компонента. Обычно отбираются только те главные компоненты, которые объясняют заданный уровень дисперсии, например, 95%.\n", 33 | "\n", 34 | " - **Вычисление сингулярного разложения матрицы данных**. Сингулярное разложение — это способ представления любой матрицы в виде произведения трёх других матриц: левой сингулярной матрицы U, диагональной матрицы сингулярных значений S и правой сингулярной матрицы V, где сингулярные значения — это квадратные корни собственных значений ковариационной матрицы данных (именно для этого в данном случае выполняется предварительное центрирование данных), правая сингулярная матрица V будет соответствовать собственным векторам ковариационной матрицы данных, а левая U будет являться проекцией исходных данных на главные компоненты, определённые матрицей V. Таким образом, сингулярное разложение также позволяет выделить главные компоненты, но без необходимости в вычислении ковариационной матрицы. Помимо того, что такое решение более эффективно, оно считается более численно стабильным, поскольку не требует вычисления ковариационной матрицы напрямую, которая может быть плохо обусловлена в случае сильной корреляции признаков. Именно данный подход используется в реализации scikit-learn, но с некоторыми особенностями, рассмотренными ниже.\n", 35 | "\n", 36 | "**PCA на основе SVD строится следующим образом:**\n", 37 | "- 1) сначала происходит центрирование данных, а также определяется число компонент как минимум между числом образцов и признаков в случае, если число компонент не было задано;\n", 38 | "- 2) далее SVD применяется к центрированной матрице данных;\n", 39 | "- 3) к матрице U применяется метод svd_flip_vector, который находит максимальные по модулю элементы в каждом столбце матрицы U, извлекает их знаки и умножает матрицу U на эти знаки, чтобы гарантировать детерминированный вывод;\n", 40 | "- 4) объяснённая дисперсия для каждой главной компоненты вычисляется как возведённые в квадрат соответствующие сингулярные значения, разделённые на n_samples - 1, а преобразованные данные вычисляются с учётом числа главных компонент по правилу $X_{new} = X \\cdot V = U \\cdot S \\cdot V^T \\cdot V = U \\cdot S$.\n" 41 | ], 42 | "metadata": { 43 | "id": "1eGqxmn_8A-R" 44 | } 45 | }, 46 | { 47 | "cell_type": "markdown", 48 | "source": [ 49 | "### **Дополнительные возможности PCA**\n", 50 | "**Коэффициент объяснённой дисперсии** каждой главной компоненты, доступный через переменную *explained_variance_ratio_*, указывает долю дисперсии датасета, лежащей вдоль оси каждой главной компоненты.\n", 51 | "\n", 52 | "**Восстановление данных** с помощью метода *inverse_transform()* заключается в применении обратной трансформации проекции PCA вида $X_{recovered} = X_{d-proj} W_d^T$, где $W_d^T$ — матрица из первых d главных компонент. Из этого следует, что данные будут восстановлены с потерями, пропорциональными количеству отброшенной дисперсии исходных данных, а средний квадрат расстояния между исходными и восстановленными данными представляет собой ошибку восстановления (reconstruction error).\n", 53 | "\n", 54 | "**Инкрементный PCA**, реализованный в виде класса *IncrementalPCA*, позволяет работать эффективнее с большими наборами данных за счёт их разбиения на мини-пакеты и поштучном хранении в памяти во время обучения.\n", 55 | "\n", 56 | "**Рандомизированный PCA**, устанавливаемый с помощью параметра svd_solver='randomized', использует стохастический алгоритм для быстрого вычисления приближённых d главных компонент и основан на предположении, что случайная проекция данных на низкоразмерное подпространство может хорошо сохранять их структуру и свойства, однако такой подход может быть менее точным.\n", 57 | "\n", 58 | "**Ядерный PCA**, реализованный с помощью класса *KernelPCA*, позволяет выполнять сложные нелинейные проекции с использованием ядерных функций. Как и в случае с SVM, его суть в данном случае заключается в том, что линейная граница решений в многомерном пространстве признаков будет соответствовать сложной нелинейной границе в исходном пространстве." 59 | ], 60 | "metadata": { 61 | "id": "bhdxhVqqiT_N" 62 | } 63 | }, 64 | { 65 | "cell_type": "markdown", 66 | "source": [ 67 | "### **Альтернативы PCA**\n", 68 | "Не смотря на то, что метод главных компонент является одним из самых популярных алгоритмов понижения размерности, существуют альтернативы, которые могут быть более предпочтительными в ряде ситуаций, а также в зависимости от типа данных:\n", 69 | "- **LLE (Locally Linear Embedding)** — алгоритм создания линейных комбинаций каждой точки из её соседей с последующим восстановлением этих комбинаций в пространстве более низкой размерности, что позволяет сохранить нелинейную геометрию данных и быть полезным для некоторых задач, где глобальные свойства менее важны. С другой стороны, такой подход имеет высокую вычислительную сложность и может быть чувствителен к шуму.\n", 70 | "- **t-SNE (t-Distributed Stochastic Neighbor Embedding)** — алгоритм, который преобразует сходства между данными в вероятности и в дальнейшем пытается минимизировать расхождение между распределениями вероятностей в пространстве высокой и низкой размерности. t-SNE эффективен при визуализации данных высокой размерности, однако может искажать глобальную структуру данных, поскольку не учитывает линейные зависимости, а лишь их близость в исходном пространстве.\n", 71 | "- **UMAP (Uniform Manifold Approximation and Projection)** — ещё один алгоритм, подходящий для визуализации данных, который основан на идеи, что данные лежат на некотором однородном многообразии, которое можно аппроксимировать с помощью графа соседей. Такой подход учитывает глобальную структуру данных и позволяет лучше адаптироваться к различным типам данных и лучше справляться с шумом и выбросами, чем t-SNE.\n", 72 | "- **Autoencoders** — тип нейронных сетей, основанный на обучении кодировщика преобразовывать входные данные в низкоразмерное представление, с последующим обучением декодера восстанавливать исходные данные из этого представления. Autoencoders могут также использоваться для сжатия данных, удаления шума и многих других целей.\n" 73 | ], 74 | "metadata": { 75 | "id": "VMuABOU9iaAc" 76 | } 77 | }, 78 | { 79 | "cell_type": "markdown", 80 | "source": [ 81 | "### **Реализация на Python с нуля**\n" 82 | ], 83 | "metadata": { 84 | "id": "_lBgn_MwyM_F" 85 | } 86 | }, 87 | { 88 | "cell_type": "code", 89 | "source": [ 90 | "import numpy as np\n", 91 | "from sklearn.decomposition import PCA\n", 92 | "from sklearn.datasets import load_iris" 93 | ], 94 | "metadata": { 95 | "id": "IRx6PDrq7PSK" 96 | }, 97 | "execution_count": 6, 98 | "outputs": [] 99 | }, 100 | { 101 | "cell_type": "code", 102 | "source": [ 103 | "class SVDPCA:\n", 104 | " def __init__(self, n_components=None):\n", 105 | " self.n_components = n_components\n", 106 | "\n", 107 | " @staticmethod\n", 108 | " def svd_flip_vector(U):\n", 109 | " max_abs_cols_U = np.argmax(np.abs(U), axis=0)\n", 110 | " # extract the signs of the max absolute values\n", 111 | " signs_U = np.sign(U[max_abs_cols_U, range(U.shape[1])])\n", 112 | "\n", 113 | " return U * signs_U\n", 114 | "\n", 115 | " def fit_transform(self, X):\n", 116 | " n_samples, n_features = X.shape\n", 117 | " X_centered = X - X.mean(axis=0)\n", 118 | "\n", 119 | " if self.n_components is None:\n", 120 | " self.n_components = min(n_samples, n_features)\n", 121 | "\n", 122 | " U, S, Vt = np.linalg.svd(X_centered)\n", 123 | " # flip the eigenvector sign to enforce deterministic output\n", 124 | " U_flipped = self.svd_flip_vector(U)\n", 125 | "\n", 126 | " self.explained_variance = (S[:self.n_components] ** 2) / (n_samples - 1)\n", 127 | " self.explained_variance_ratio = self.explained_variance / np.sum(self.explained_variance)\n", 128 | "\n", 129 | " # X_new = X * V = U * S * Vt * V = U * S\n", 130 | " X_transformed = U_flipped[:, : self.n_components] * S[: self.n_components]\n", 131 | "\n", 132 | " return X_transformed" 133 | ], 134 | "metadata": { 135 | "id": "lsf4c0787Nd-" 136 | }, 137 | "execution_count": 7, 138 | "outputs": [] 139 | }, 140 | { 141 | "cell_type": "markdown", 142 | "source": [ 143 | "### **Загрузка датасета**" 144 | ], 145 | "metadata": { 146 | "id": "Ap2qaOBsySp-" 147 | } 148 | }, 149 | { 150 | "cell_type": "code", 151 | "source": [ 152 | "X, y = load_iris(return_X_y=True, as_frame=True)\n", 153 | "print(X)" 154 | ], 155 | "metadata": { 156 | "id": "RwZpNYTACi7X", 157 | "colab": { 158 | "base_uri": "https://localhost:8080/" 159 | }, 160 | "outputId": "fe21d12d-7bbe-490b-c85f-463cb2a725b9" 161 | }, 162 | "execution_count": 8, 163 | "outputs": [ 164 | { 165 | "output_type": "stream", 166 | "name": "stdout", 167 | "text": [ 168 | " sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)\n", 169 | "0 5.1 3.5 1.4 0.2\n", 170 | "1 4.9 3.0 1.4 0.2\n", 171 | "2 4.7 3.2 1.3 0.2\n", 172 | "3 4.6 3.1 1.5 0.2\n", 173 | "4 5.0 3.6 1.4 0.2\n", 174 | ".. ... ... ... ...\n", 175 | "145 6.7 3.0 5.2 2.3\n", 176 | "146 6.3 2.5 5.0 1.9\n", 177 | "147 6.5 3.0 5.2 2.0\n", 178 | "148 6.2 3.4 5.4 2.3\n", 179 | "149 5.9 3.0 5.1 1.8\n", 180 | "\n", 181 | "[150 rows x 4 columns]\n" 182 | ] 183 | } 184 | ] 185 | }, 186 | { 187 | "cell_type": "markdown", 188 | "source": [ 189 | "### **Обучение моделей и оценка полученных результатов**\n", 190 | "Ручная реализация показала идентичные результаты scikit-learn. Как можно заметить, первые 2 главные компоненты объясняют практически 98% дисперсии в данных, что позволяет сократить количество признаков вдвое без особой потери информации. Если бы количество признаков было не 4, а несколько тысяч или миллионов, то это бы позволило существенно сократить время обучения моделей без значительной потери точности (а иногда и с увеличением точности за счёт уменьшения мультиколлинеарности между признаками), что делает PCA и его альтернативы прекрасным дополнением к другим алгоритмам." 191 | ], 192 | "metadata": { 193 | "id": "nc7ILPOKyW_b" 194 | } 195 | }, 196 | { 197 | "cell_type": "markdown", 198 | "source": [ 199 | "**PCA**" 200 | ], 201 | "metadata": { 202 | "id": "zAlR-QBEglAK" 203 | } 204 | }, 205 | { 206 | "cell_type": "code", 207 | "source": [ 208 | "pca = SVDPCA()\n", 209 | "X_transformed = pca.fit_transform(X)\n", 210 | "\n", 211 | "print('transformed data', X_transformed[:10], '', sep='\\n')\n", 212 | "print('explained_variance', pca.explained_variance)\n", 213 | "print('explained_variance_ratio', pca.explained_variance_ratio)" 214 | ], 215 | "metadata": { 216 | "colab": { 217 | "base_uri": "https://localhost:8080/" 218 | }, 219 | "id": "4llKYPNLCrys", 220 | "outputId": "fc82c1ca-5ef5-46d7-e0c4-0891a26446a1" 221 | }, 222 | "execution_count": 9, 223 | "outputs": [ 224 | { 225 | "output_type": "stream", 226 | "name": "stdout", 227 | "text": [ 228 | "transformed data\n", 229 | "[[-2.68412563e+00 3.19397247e-01 -2.79148276e-02 -2.26243707e-03]\n", 230 | " [-2.71414169e+00 -1.77001225e-01 -2.10464272e-01 -9.90265503e-02]\n", 231 | " [-2.88899057e+00 -1.44949426e-01 1.79002563e-02 -1.99683897e-02]\n", 232 | " [-2.74534286e+00 -3.18298979e-01 3.15593736e-02 7.55758166e-02]\n", 233 | " [-2.72871654e+00 3.26754513e-01 9.00792406e-02 6.12585926e-02]\n", 234 | " [-2.28085963e+00 7.41330449e-01 1.68677658e-01 2.42008576e-02]\n", 235 | " [-2.82053775e+00 -8.94613845e-02 2.57892158e-01 4.81431065e-02]\n", 236 | " [-2.62614497e+00 1.63384960e-01 -2.18793179e-02 4.52978706e-02]\n", 237 | " [-2.88638273e+00 -5.78311754e-01 2.07595703e-02 2.67447358e-02]\n", 238 | " [-2.67275580e+00 -1.13774246e-01 -1.97632725e-01 5.62954013e-02]]\n", 239 | "\n", 240 | "explained_variance [4.22824171 0.24267075 0.0782095 0.02383509]\n", 241 | "explained_variance_ratio [0.92461872 0.05306648 0.01710261 0.00521218]\n" 242 | ] 243 | } 244 | ] 245 | }, 246 | { 247 | "cell_type": "markdown", 248 | "source": [ 249 | "**PCA (scikit-learn)**" 250 | ], 251 | "metadata": { 252 | "id": "m5aG-Cl2goww" 253 | } 254 | }, 255 | { 256 | "cell_type": "code", 257 | "source": [ 258 | "sk_pca = PCA()\n", 259 | "sk_X_transformed = sk_pca.fit_transform(X)\n", 260 | "\n", 261 | "print('sk transformed data', sk_X_transformed[:10], '', sep='\\n')\n", 262 | "print('sk explained_variance', sk_pca.explained_variance_)\n", 263 | "print('sk explained_variance_ratio_', sk_pca.explained_variance_ratio_)" 264 | ], 265 | "metadata": { 266 | "colab": { 267 | "base_uri": "https://localhost:8080/" 268 | }, 269 | "id": "D91ifASgDTbX", 270 | "outputId": "afa161c5-b856-428d-c5a4-eba6b29204df" 271 | }, 272 | "execution_count": 10, 273 | "outputs": [ 274 | { 275 | "output_type": "stream", 276 | "name": "stdout", 277 | "text": [ 278 | "sk transformed data\n", 279 | "[[-2.68412563e+00 3.19397247e-01 -2.79148276e-02 -2.26243707e-03]\n", 280 | " [-2.71414169e+00 -1.77001225e-01 -2.10464272e-01 -9.90265503e-02]\n", 281 | " [-2.88899057e+00 -1.44949426e-01 1.79002563e-02 -1.99683897e-02]\n", 282 | " [-2.74534286e+00 -3.18298979e-01 3.15593736e-02 7.55758166e-02]\n", 283 | " [-2.72871654e+00 3.26754513e-01 9.00792406e-02 6.12585926e-02]\n", 284 | " [-2.28085963e+00 7.41330449e-01 1.68677658e-01 2.42008576e-02]\n", 285 | " [-2.82053775e+00 -8.94613845e-02 2.57892158e-01 4.81431065e-02]\n", 286 | " [-2.62614497e+00 1.63384960e-01 -2.18793179e-02 4.52978706e-02]\n", 287 | " [-2.88638273e+00 -5.78311754e-01 2.07595703e-02 2.67447358e-02]\n", 288 | " [-2.67275580e+00 -1.13774246e-01 -1.97632725e-01 5.62954013e-02]]\n", 289 | "\n", 290 | "sk explained_variance [4.22824171 0.24267075 0.0782095 0.02383509]\n", 291 | "sk explained_variance_ratio_ [0.92461872 0.05306648 0.01710261 0.00521218]\n" 292 | ] 293 | } 294 | ] 295 | }, 296 | { 297 | "cell_type": "markdown", 298 | "source": [ 299 | "### **Преимущества и недостатки**\n", 300 | "Преимущества:\n", 301 | "- понижение размерности с сохранением большого количества информации, что также позволяет визуализировать данные высокой размерности в двумерном или трёхмерном пространстве;\n", 302 | "- не только позволяет значительно ускорить обучение, но и уменьшить переобучение моделей в ряде случаев;\n", 303 | "- может использоваться для сжатия данных.\n", 304 | "\n", 305 | "Недостатки:\n", 306 | "- неизбежная потеря части информации в данных;\n", 307 | "- поиск только линейной зависимости в данных (в обычном PCA);\n", 308 | "- отсутствие смыслового значения главных компонент из-за трудности их связывания с реальными признакам." 309 | ], 310 | "metadata": { 311 | "id": "hFEntYHAksy_" 312 | } 313 | }, 314 | { 315 | "cell_type": "markdown", 316 | "source": [ 317 | "### **Дополнительные источники**\n", 318 | "Статьи:\n", 319 | "- «A Tutorial on Principal Component Analysis», Jonathon Shlens;\n", 320 | "- «Locally Linear Embedding and its Variants: Tutorial and Survey», Benyamin Ghojogh, Ali Ghodsi, Fakhri Karray, Mark Crowley;\n", 321 | "- «Theoretical Foundations of t-SNE for Visualizing High-Dimensional Clustered Data», T. Tony Cai, Rong Ma;\n", 322 | "- «UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction», Leland McInnes, John Healy, James Melville;\n", 323 | "- «Deep Autoencoders for Dimensionality Reduction of High-Content Screening Data», Lee Zamparo, Zhaolei Zhang.\n", 324 | "\n", 325 | "Документация:\n", 326 | "- [описание PCA](https://scikit-learn.org/stable/modules/decomposition.html#pca);\n", 327 | "- [описание LLE](https://scikit-learn.org/stable/modules/manifold.html#locally-linear-embedding);\n", 328 | "- [описание t-SNE](https://scikit-learn.org/stable/modules/manifold.html#t-sne);\n", 329 | "- [PCA](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html);\n", 330 | "- [LLE](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.LocallyLinearEmbedding.html);\n", 331 | "- [t-SNE](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html);\n", 332 | "- [UMAP](https://umap-learn.readthedocs.io/en/latest/index.html).\n", 333 | "\n", 334 | "Видео:\n", 335 | "- PCA: [один](https://www.youtube.com/watch?v=FgakZw6K1QQ), [два](https://www.youtube.com/watch?v=fkf4IBRSeEc), [три](https://www.youtube.com/watch?v=IwPzjlBXBlA), [четыре](https://www.youtube.com/watch?v=WW3ZJHPwvyg);\n", 336 | "- [LLE](https://www.youtube.com/watch?v=B6kzA1W_4pU);\n", 337 | "- [t-SNE](https://www.youtube.com/watch?v=NEaUSP4YerM);\n", 338 | "- [UMAP](https://www.youtube.com/watch?v=eN0wFzBA4Sc);\n", 339 | "- [autoencoders](https://www.youtube.com/watch?v=FhmpO73ythg)." 340 | ], 341 | "metadata": { 342 | "id": "MEqVtot3k0Vs" 343 | } 344 | } 345 | ] 346 | } -------------------------------------------------------------------------------- /Notebooks [rus]/empty.txt: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ![ML from scratch course logo](https://github.com/egaoharu-kensei/ML-algorithms-from-scratch.-Course-for-beginners/assets/162469942/db9333ab-6025-4961-b745-d45467ab71c2) 2 | 3 | [![CC BY-NC-SA 4.0][cc-by-nc-sa-shield]][cc-by-nc-sa] 4 | 5 | [cc-by-nc-sa]: http://creativecommons.org/licenses/by-nc-sa/4.0/ 6 | [cc-by-nc-sa-image]: https://licensebuttons.net/l/by-nc-sa/4.0/88x31.png 7 | [cc-by-nc-sa-shield]: https://img.shields.io/badge/License-CC%20BY--NC--SA%204.0-green.svg 8 | 9 | This repository contains implementations of all popular ML-algorithms from scratch using Python with their detailed theoretical description. In addition, this course is presented in two languages (eng, rus) in the form of jupyter notebooks and will feature tutorials on the necessary ML-libraries and also all the theory of concepts in ML, starting from methods of evaluating models and ending with optimization methods. 10 | 11 | #### 🔔***This course is also available on other platforms: [Habr (rus)](https://habr.com/ru/users/egaoharu_kensei/publications/articles/) and [Kaggle (eng)](https://www.kaggle.com/egazakharenko/code)*** 12 | 13 | ## **What you need to know before studying** 14 | 15 | - python (at the function and OOP level); 16 | - mathematics (linear algebra, calculus, probability theory and statistics). 17 | 18 | ## **Current content** 19 | 20 | *At the moment, the course is under development and new notebooks will be gradually added* 21 | 22 | ***Initial theory*** 23 | - 8\) Quality metrics and error analysis 24 | - 9\) Optimization methods in ML and DL 25 | 26 | ***Supervised learning*** 27 | - 1\) Linear regression and its modifications 28 | - 2\) Logistic & Softmax-regressions 29 | - 3\) Linear Discriminant Analysis (LDA) 30 | - 4\) Naive Bayes Classifier 31 | - 5\) Support Vector Machines (SVM) 32 | - 6\) K-Nearest Neighbors (KNN) 33 | - 7\) Decision Tree (CART) 34 | - 8\) Bagging & Random Forest 35 | - 9\) AdaBoost (SAMME & R2) 36 | - 10\) Gradient Boosting algorithms (XGBoost, CatBoost, LightGBM) 37 | - 11\) Stacking & Blending 38 | 39 | ***Unsupervised learning*** 40 | - 12\) Principal Component Analysis (PCA) 41 | - 13\) Popular clustering algorithms (K-Means, DBSCAN et al.) 42 | 43 | ## **Best additional sources** 44 | 45 | ***Books*** 46 | - "Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow" (3rd edition) by Aurélien Géron 47 | - "An Introduction to Statistical Learning: with Applications in R" (2nd edition) by Gareth James et al. 48 | - "The Elements of Statistical Learning: Data Mining, Inference, and Prediction" (2nd edition) by Trevor Hastie et al. 49 | - "Pattern Recognition and Machine Learning" by Christopher M. Bishop 50 | - "Deep Learning" by Ian Goodfellow et al. 51 | 52 | ***Courses and playlists*** 53 | - [Machine Learning Specialization by Andrew Ng 2022](https://www.coursera.org/specializations/machine-learning-introduction) 54 | - [Machine Learning (StatQuest with Josh Starmer)](https://www.youtube.com/playlist?list=PLblh5JKOoLUICTaGLRoHQDuF_7q2GfuJF) 55 | - [Machine Learning Concepts (Simply Explained) by Pedram Jahangiry](https://www.youtube.com/playlist?list=PL2GWo47BFyUPWL5fBZSn6FFHRr1bSkX_J) 56 | 57 | ## 58 | 59 | ### All success and until we meet again!🏆 60 | --------------------------------------------------------------------------------