├── README.md ├── seminar01 ├── numpy_indexing.png ├── numpy_fancy_indexing.png ├── 01_Main_SklearnFirstClassifiers.ipynb ├── 03_Main_NaiveBayes.ipynb └── 06_Reference_Numpy.ipynb ├── seminar02 ├── ml_bias_variance.png ├── 02_Main_Bagging.ipynb ├── 03_Main_Boosting.ipynb └── 05_Reference_BiasVariance.ipynb ├── hw01 ├── DMIA_Base_2021_Spring_hw1.pdf ├── README.md ├── NumpyScipy.ipynb ├── KNN.ipynb ├── NaiveBayes.ipynb └── Polynom.ipynb ├── lecture01 └── Lecture1_MathAndSimpleMethods-compressed.pdf └── hw02 ├── README.md └── SimpleGB.ipynb /README.md: -------------------------------------------------------------------------------- 1 | # DMIA_Base_2021_Spring -------------------------------------------------------------------------------- /seminar01/numpy_indexing.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/data-mining-in-action/DMIA_Base_2021_Spring/HEAD/seminar01/numpy_indexing.png -------------------------------------------------------------------------------- /seminar02/ml_bias_variance.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/data-mining-in-action/DMIA_Base_2021_Spring/HEAD/seminar02/ml_bias_variance.png -------------------------------------------------------------------------------- /hw01/DMIA_Base_2021_Spring_hw1.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/data-mining-in-action/DMIA_Base_2021_Spring/HEAD/hw01/DMIA_Base_2021_Spring_hw1.pdf -------------------------------------------------------------------------------- /seminar01/numpy_fancy_indexing.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/data-mining-in-action/DMIA_Base_2021_Spring/HEAD/seminar01/numpy_fancy_indexing.png -------------------------------------------------------------------------------- /lecture01/Lecture1_MathAndSimpleMethods-compressed.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/data-mining-in-action/DMIA_Base_2021_Spring/HEAD/lecture01/Lecture1_MathAndSimpleMethods-compressed.pdf -------------------------------------------------------------------------------- /hw01/README.md: -------------------------------------------------------------------------------- 1 | Выполните задачи в домашних ноутбуках, в каждой задаче будет контрольный код, вывод которого нужно ввести в [форму](https://forms.gle/LgGQgq2E1WauiGYT8). 2 | 3 | Дедлайн по сдаче: 10:00 15 мая 2021 4 | -------------------------------------------------------------------------------- /hw02/README.md: -------------------------------------------------------------------------------- 1 | Выполните задачи в домашних ноутбуках, в каждой задаче будет контрольный код, вывод которого нужно ввести в [форму](https://forms.gle/RpPCzACqkmW8rkvN6). 2 | 3 | Дедлайн по сдаче: 10:00 20 мая 2021 4 | -------------------------------------------------------------------------------- /hw01/NumpyScipy.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": null, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import numpy as np\n", 10 | "import scipy" 11 | ] 12 | }, 13 | { 14 | "cell_type": "markdown", 15 | "metadata": {}, 16 | "source": [ 17 | "### Задача 1\n", 18 | "\n", 19 | "Дан массив $arr$, требуется для каждой позиции $i$ найти номер элемента $arr_i$ в массиве $arr$, отсортированном по убыванию. Все значения массива $arr$ различны." 20 | ] 21 | }, 22 | { 23 | "cell_type": "code", 24 | "execution_count": null, 25 | "metadata": {}, 26 | "outputs": [], 27 | "source": [ 28 | "def function_1(arr):\n", 29 | " return #TODO" 30 | ] 31 | }, 32 | { 33 | "cell_type": "code", 34 | "execution_count": null, 35 | "metadata": {}, 36 | "outputs": [], 37 | "source": [ 38 | "(function_1([1, 2, 3]) == [2, 1, 0]).all()" 39 | ] 40 | }, 41 | { 42 | "cell_type": "code", 43 | "execution_count": null, 44 | "metadata": {}, 45 | "outputs": [], 46 | "source": [ 47 | "(function_1([-2, 1, 0]) == [2, 0, 1]).all()" 48 | ] 49 | }, 50 | { 51 | "cell_type": "code", 52 | "execution_count": null, 53 | "metadata": {}, 54 | "outputs": [], 55 | "source": [ 56 | "(function_1([-2, 1, 0, -1]) == [3, 0, 1, 2]).all()" 57 | ] 58 | }, 59 | { 60 | "cell_type": "markdown", 61 | "metadata": {}, 62 | "source": [ 63 | "**Значение для формы**" 64 | ] 65 | }, 66 | { 67 | "cell_type": "code", 68 | "execution_count": null, 69 | "metadata": {}, 70 | "outputs": [], 71 | "source": [ 72 | "np.random.seed(42)\n", 73 | "arr = function_1(np.random.uniform(size=1000000))\n", 74 | "print(arr[7] + arr[42] + arr[445677] + arr[53422])" 75 | ] 76 | }, 77 | { 78 | "cell_type": "markdown", 79 | "metadata": {}, 80 | "source": [ 81 | "### Задача 2\n", 82 | "\n", 83 | "Дана матрица $X$, нужно найти след матрицы $X X^T$" 84 | ] 85 | }, 86 | { 87 | "cell_type": "code", 88 | "execution_count": null, 89 | "metadata": {}, 90 | "outputs": [], 91 | "source": [ 92 | "def function_2(matrix):\n", 93 | " return #TODO" 94 | ] 95 | }, 96 | { 97 | "cell_type": "code", 98 | "execution_count": null, 99 | "metadata": {}, 100 | "outputs": [], 101 | "source": [ 102 | "function_2(np.array([\n", 103 | " [1, 2],\n", 104 | " [3, 4]\n", 105 | "])) == 30" 106 | ] 107 | }, 108 | { 109 | "cell_type": "code", 110 | "execution_count": null, 111 | "metadata": {}, 112 | "outputs": [], 113 | "source": [ 114 | "function_2(np.array([\n", 115 | " [1, 0],\n", 116 | " [0, 1]\n", 117 | "])) == 2" 118 | ] 119 | }, 120 | { 121 | "cell_type": "code", 122 | "execution_count": null, 123 | "metadata": {}, 124 | "outputs": [], 125 | "source": [ 126 | "function_2(np.array([\n", 127 | " [2, 0],\n", 128 | " [0, 2]\n", 129 | "])) == 8" 130 | ] 131 | }, 132 | { 133 | "cell_type": "code", 134 | "execution_count": null, 135 | "metadata": {}, 136 | "outputs": [], 137 | "source": [ 138 | "function_2(np.array([\n", 139 | " [2, 1, 1],\n", 140 | " [1, 2, 1]\n", 141 | "])) == 12" 142 | ] 143 | }, 144 | { 145 | "cell_type": "markdown", 146 | "metadata": {}, 147 | "source": [ 148 | "**Значение для формы**" 149 | ] 150 | }, 151 | { 152 | "cell_type": "code", 153 | "execution_count": null, 154 | "metadata": {}, 155 | "outputs": [], 156 | "source": [ 157 | "np.random.seed(42)\n", 158 | "arr1 = np.random.uniform(size=(1, 100000))\n", 159 | "arr2 = np.random.uniform(size=(100000, 1))\n", 160 | "print(int(function_2(arr1) + function_2(arr2)))" 161 | ] 162 | }, 163 | { 164 | "cell_type": "markdown", 165 | "metadata": {}, 166 | "source": [ 167 | "### Задача 3\n", 168 | "\n", 169 | "Дан набор точек с координатам точек points_x и points_y. Нужно найти такую точку $p$ с нулевой координатой $y$ (то есть с координатами вида $(x, 0)$), что расстояние от неё до самой удалённой точки из исходного набора (растояние евклидово) минимально" 170 | ] 171 | }, 172 | { 173 | "cell_type": "code", 174 | "execution_count": null, 175 | "metadata": {}, 176 | "outputs": [], 177 | "source": [ 178 | "def function_3(points_x, points_y):\n", 179 | " return #TODO" 180 | ] 181 | }, 182 | { 183 | "cell_type": "code", 184 | "execution_count": null, 185 | "metadata": {}, 186 | "outputs": [], 187 | "source": [ 188 | "np.abs(function_3([0, 2], [1, 1]) - 1.) < 1e-3" 189 | ] 190 | }, 191 | { 192 | "cell_type": "code", 193 | "execution_count": null, 194 | "metadata": {}, 195 | "outputs": [], 196 | "source": [ 197 | "np.abs(function_3([0, 2, 4], [1, 1, 1]) - 2.) < 1e-3" 198 | ] 199 | }, 200 | { 201 | "cell_type": "code", 202 | "execution_count": null, 203 | "metadata": {}, 204 | "outputs": [], 205 | "source": [ 206 | "np.abs(function_3([0, 4, 4], [1, 1, 1]) - 2.) < 1e-3" 207 | ] 208 | }, 209 | { 210 | "cell_type": "markdown", 211 | "metadata": {}, 212 | "source": [ 213 | "**Значение для формы**" 214 | ] 215 | }, 216 | { 217 | "cell_type": "code", 218 | "execution_count": null, 219 | "metadata": {}, 220 | "outputs": [], 221 | "source": [ 222 | "np.random.seed(42)\n", 223 | "arr1 = np.random.uniform(-56, 100, size=100000)\n", 224 | "arr2 = np.random.uniform(-100, 100, size=100000)\n", 225 | "print(int(round((function_3(arr1, arr2) * 100))))" 226 | ] 227 | }, 228 | { 229 | "cell_type": "code", 230 | "execution_count": null, 231 | "metadata": {}, 232 | "outputs": [], 233 | "source": [] 234 | } 235 | ], 236 | "metadata": { 237 | "kernelspec": { 238 | "display_name": "dmia", 239 | "language": "python", 240 | "name": "dmia" 241 | }, 242 | "language_info": { 243 | "codemirror_mode": { 244 | "name": "ipython", 245 | "version": 3 246 | }, 247 | "file_extension": ".py", 248 | "mimetype": "text/x-python", 249 | "name": "python", 250 | "nbconvert_exporter": "python", 251 | "pygments_lexer": "ipython3", 252 | "version": "3.6.6" 253 | } 254 | }, 255 | "nbformat": 4, 256 | "nbformat_minor": 2 257 | } 258 | -------------------------------------------------------------------------------- /hw01/KNN.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Реализуем метод predict_proba для KNN" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "Ниже реализован класс KNeighborsClassifier, который для поиска ближайших соседей использует sklearn.neighbors.NearestNeighbors\n", 15 | "\n", 16 | "Требуется реализовать метод predict_proba для вычисления ответа классификатора." 17 | ] 18 | }, 19 | { 20 | "cell_type": "code", 21 | "execution_count": 1, 22 | "metadata": {}, 23 | "outputs": [], 24 | "source": [ 25 | "import numpy as np\n", 26 | "\n", 27 | "from sklearn.base import BaseEstimator, ClassifierMixin\n", 28 | "from sklearn.neighbors import NearestNeighbors\n", 29 | "\n", 30 | "\n", 31 | "class KNeighborsClassifier(BaseEstimator, ClassifierMixin):\n", 32 | " '''\n", 33 | " Класс, который позволит нам изучить KNN\n", 34 | " '''\n", 35 | " def __init__(self, n_neighbors=5, weights='uniform', \n", 36 | " metric='minkowski', p=2):\n", 37 | " '''\n", 38 | " Инициализируем KNN с несколькими стандартными параметрами\n", 39 | " '''\n", 40 | " assert weights in ('uniform', 'distance')\n", 41 | " \n", 42 | " self.n_neighbors = n_neighbors\n", 43 | " self.weights = weights\n", 44 | " self.metric = metric\n", 45 | " \n", 46 | " self.NearestNeighbors = NearestNeighbors(\n", 47 | " n_neighbors = n_neighbors,\n", 48 | " metric = self.metric)\n", 49 | " \n", 50 | " def fit(self, X, y):\n", 51 | " '''\n", 52 | " Используем sklearn.neighbors.NearestNeighbors \n", 53 | " для запоминания обучающей выборки\n", 54 | " и последующего поиска соседей\n", 55 | " '''\n", 56 | " self.NearestNeighbors.fit(X)\n", 57 | " self.n_classes = len(np.unique(y))\n", 58 | " self.y = y\n", 59 | " \n", 60 | " def predict_proba(self, X, use_first_zero_distant_sample=True):\n", 61 | " '''\n", 62 | " Чтобы реализовать этот метод, \n", 63 | " изучите работу sklearn.neighbors.NearestNeighbors'''\n", 64 | " \n", 65 | " # получим здесь расстояния до соседей distances и их метки\n", 66 | " \n", 67 | " if self.weights == 'uniform':\n", 68 | " w = np.ones(distances.shape)\n", 69 | " else:\n", 70 | " # чтобы не делить на 0, \n", 71 | " # добавим небольшую константу, например 1e-3\n", 72 | " w = 1/(distances + 1e-3)\n", 73 | "\n", 74 | " # реализуем вычисление предсказаний:\n", 75 | " # выбрав один объект, для каждого класса посчитаем\n", 76 | " # суммарный вес голосующих за него объектов\n", 77 | " # затем нормируем эти веса на их сумму\n", 78 | " # и вернем это как предсказание KNN\n", 79 | " return probs" 80 | ] 81 | }, 82 | { 83 | "cell_type": "markdown", 84 | "metadata": {}, 85 | "source": [ 86 | "# Загрузим данные и обучим классификатор" 87 | ] 88 | }, 89 | { 90 | "cell_type": "code", 91 | "execution_count": 2, 92 | "metadata": {}, 93 | "outputs": [], 94 | "source": [ 95 | "from sklearn.datasets import load_iris\n", 96 | "X, y = load_iris(return_X_y=True)\n", 97 | "\n", 98 | "knn = KNeighborsClassifier(weights='distance')\n", 99 | "knn.fit(X, y)\n", 100 | "prediction = knn.predict_proba(X, )" 101 | ] 102 | }, 103 | { 104 | "cell_type": "markdown", 105 | "metadata": {}, 106 | "source": [ 107 | "Поскольку мы используем одну и ту же выборку для обучения и предсказания, ближайшим соседом любого объекта будет он же сам. В качестве упражнения предлагаю реализовать метод transform, который реализует получение предсказаний для обучающей выборки, но для каждого объекта не будет учитывать его самого.\n", 108 | "\n", 109 | "Посмотрим, в каких объектах max(prediction) != 1:" 110 | ] 111 | }, 112 | { 113 | "cell_type": "code", 114 | "execution_count": 3, 115 | "metadata": {}, 116 | "outputs": [ 117 | { 118 | "name": "stdout", 119 | "output_type": "stream", 120 | "text": [ 121 | "[ 56 68 70 72 77 83 106 110 119 123 127 133 134 138 146]\n" 122 | ] 123 | } 124 | ], 125 | "source": [ 126 | "inds = np.arange(len(prediction))[prediction.max(1) != 1]\n", 127 | "print(inds)\n", 128 | "\n", 129 | "# [ 56 68 70 72 77 83 106 110 119 123 127 133 134 138 146]" 130 | ] 131 | }, 132 | { 133 | "cell_type": "markdown", 134 | "metadata": {}, 135 | "source": [ 136 | "Несколько примеров, на которых можно проверить правильность реализованного метода:" 137 | ] 138 | }, 139 | { 140 | "cell_type": "code", 141 | "execution_count": 4, 142 | "metadata": {}, 143 | "outputs": [ 144 | { 145 | "name": "stdout", 146 | "output_type": "stream", 147 | "text": [ 148 | "68 [0. 0.99816311 0.00183689]\n", 149 | "77 [0. 0.99527902 0.00472098]\n", 150 | "146 [0. 0.00239145 0.99760855]\n" 151 | ] 152 | } 153 | ], 154 | "source": [ 155 | "for i in 1, 4, -1:\n", 156 | " print(inds[i], prediction[inds[i]])\n", 157 | "\n", 158 | "# 68 [0. 0.99816311 0.00183689]\n", 159 | "# 77 [0. 0.99527902 0.00472098]\n", 160 | "# 146 [0. 0.00239145 0.99760855]" 161 | ] 162 | }, 163 | { 164 | "cell_type": "markdown", 165 | "metadata": {}, 166 | "source": [ 167 | "**Примечание:** отличие в третьем-четвертом знаке после запятой в тестах не должно повлиять на сдачу задания" 168 | ] 169 | }, 170 | { 171 | "cell_type": "markdown", 172 | "metadata": {}, 173 | "source": [ 174 | "# Ответы для формы" 175 | ] 176 | }, 177 | { 178 | "cell_type": "markdown", 179 | "metadata": {}, 180 | "source": [ 181 | "В форму требуется ввести max(prediction) для объекта. Если метод реализован верно, то ячейка ниже распечатает ответы, которые нужно ввести в форму" 182 | ] 183 | }, 184 | { 185 | "cell_type": "code", 186 | "execution_count": null, 187 | "metadata": {}, 188 | "outputs": [], 189 | "source": [ 190 | "for i in 56, 83, 127:\n", 191 | " print('{:.2f}'.format(max(prediction[i])))" 192 | ] 193 | }, 194 | { 195 | "cell_type": "code", 196 | "execution_count": null, 197 | "metadata": {}, 198 | "outputs": [], 199 | "source": [] 200 | } 201 | ], 202 | "metadata": { 203 | "kernelspec": { 204 | "display_name": "Python 3", 205 | "language": "python", 206 | "name": "python3" 207 | }, 208 | "language_info": { 209 | "codemirror_mode": { 210 | "name": "ipython", 211 | "version": 3 212 | }, 213 | "file_extension": ".py", 214 | "mimetype": "text/x-python", 215 | "name": "python", 216 | "nbconvert_exporter": "python", 217 | "pygments_lexer": "ipython3", 218 | "version": "3.6.5" 219 | } 220 | }, 221 | "nbformat": 4, 222 | "nbformat_minor": 2 223 | } 224 | -------------------------------------------------------------------------------- /seminar02/02_Main_Bagging.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Бэггинг" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "Пусть есть случайные одинаково распределённые величины $\\xi_1, \\xi_2, \\dots, \\xi_n$, скоррелированные с коэффициентом корреляции $\\rho$ и дисперсией $\\sigma^2$. Какова будет дисперсия величины $\\frac1n \\sum_{i=1}^n \\xi_i$?\n", 15 | "\n", 16 | "$$\\mathbf{D} \\frac1n \\sum_{i=1}^n \\xi_i = \\frac1{n^2}\\mathbf{cov} (\\sum_{i=1}^n \\xi_i, \\sum_{i=1}^n \\xi_i) = \\frac1{n^2} \\sum_{i=1, j=1}^n \\mathbf{cov}(\\xi_i, \\xi_j) = \\frac1{n^2} \\sum_{i=1}^n \\mathbf{cov}(\\xi_i, \\xi_i) + \\frac1{n^2} \\sum_{i=1, j=1, i\\neq j}^n \\mathbf{cov}(\\xi_i, \\xi_j) = \\frac1{n^2} \\sum_{i=1}^n \\sigma^2+ \\frac1{n^2} \\sum_{i=1, j=1, i\\neq j}^n \\rho \\sigma^2 =$$\n", 17 | "$$ = \\frac1{n^2} n \\sigma^2 + \\frac1{n^2} n(n-1) \\rho \\sigma^2 = \\frac{\\sigma^2( 1 + \\rho(n-1))}{n}$$\n", 18 | "\n", 19 | "Таким образом, чем менее величины скоррелированы между собой, тем меньше будет дисперсия после их усреднения. Грубо говоря в этом и состоит идея бэггинга: давайте сделаем много максимально независимых моделей, а потом их усредим, и тогда предсказания станет более устойчивым!" 20 | ] 21 | }, 22 | { 23 | "cell_type": "markdown", 24 | "metadata": {}, 25 | "source": [ 26 | "# Бэггинг над решающими деревьями\n", 27 | "\n", 28 | "Посмотрим, какие модели можно получить из деревьев с помощью их рандомизации" 29 | ] 30 | }, 31 | { 32 | "cell_type": "code", 33 | "execution_count": 1, 34 | "metadata": {}, 35 | "outputs": [], 36 | "source": [ 37 | "import pandas as pd\n", 38 | "import numpy as np\n", 39 | "from sklearn.model_selection import cross_val_score, train_test_split\n", 40 | "from sklearn.ensemble import BaggingClassifier\n", 41 | "from sklearn.tree import DecisionTreeClassifier\n", 42 | "from sklearn.ensemble import RandomForestClassifier\n", 43 | "from sklearn.linear_model import LogisticRegression" 44 | ] 45 | }, 46 | { 47 | "cell_type": "code", 48 | "execution_count": 2, 49 | "metadata": { 50 | "scrolled": true 51 | }, 52 | "outputs": [ 53 | { 54 | "name": "stdout", 55 | "output_type": "stream", 56 | "text": [ 57 | "['last_evaluation', 'number_project', 'average_montly_hours', 'time_spend_company', 'Work_accident', 'promotion_last_5years']\n" 58 | ] 59 | } 60 | ], 61 | "source": [ 62 | "data = pd.read_csv('HR.csv')\n", 63 | "\n", 64 | "target = 'left'\n", 65 | "features = [c for c in data if c != target]\n", 66 | "print(features)\n", 67 | "\n", 68 | "X, y = data[features], data[target]" 69 | ] 70 | }, 71 | { 72 | "cell_type": "code", 73 | "execution_count": 3, 74 | "metadata": {}, 75 | "outputs": [], 76 | "source": [ 77 | "rnd_d3 = DecisionTreeClassifier(max_features=int(len(features) ** 0.5)) # Решающее дерево с рандомизацией в сплитах\n", 78 | "d3 = DecisionTreeClassifier() # Обычное решающее дерево" 79 | ] 80 | }, 81 | { 82 | "cell_type": "markdown", 83 | "metadata": {}, 84 | "source": [ 85 | "Качество классификации решающим деревом с настройками по-умолчанию:" 86 | ] 87 | }, 88 | { 89 | "cell_type": "code", 90 | "execution_count": 4, 91 | "metadata": {}, 92 | "outputs": [ 93 | { 94 | "name": "stdout", 95 | "output_type": "stream", 96 | "text": [ 97 | "Decision tree: 0.6523099419883976\n" 98 | ] 99 | } 100 | ], 101 | "source": [ 102 | "print(\"Decision tree:\", cross_val_score(d3, X, y).mean())" 103 | ] 104 | }, 105 | { 106 | "cell_type": "markdown", 107 | "metadata": {}, 108 | "source": [ 109 | "Бэггинг над решающими деревьями:" 110 | ] 111 | }, 112 | { 113 | "cell_type": "code", 114 | "execution_count": 5, 115 | "metadata": {}, 116 | "outputs": [ 117 | { 118 | "name": "stdout", 119 | "output_type": "stream", 120 | "text": [ 121 | "D3 bagging: 0.7174495299059812\n" 122 | ] 123 | } 124 | ], 125 | "source": [ 126 | "print(\"D3 bagging:\", cross_val_score(BaggingClassifier(d3, random_state=42), X, y).mean())" 127 | ] 128 | }, 129 | { 130 | "cell_type": "markdown", 131 | "metadata": {}, 132 | "source": [ 133 | "Усредненная модель оказалась намного лучше. Оказывается, у решающих деревьев есть существенный недостаток - нестабильность получаемого дерева при небольших изменениях в выборке. Но бэггинг обращает этот недостаток в достоинство, ведь усредненная модель работает лучше, когда базовые модели слабо скоррелированы." 134 | ] 135 | }, 136 | { 137 | "cell_type": "markdown", 138 | "metadata": {}, 139 | "source": [ 140 | "Изучив параметры DecisionTreeClassifier, можно найти хороший способ сделать деревья еще более различными - при построении каждого узла отбирать случайные max_features признаков и искать информативное разбиение только по одному из них." 141 | ] 142 | }, 143 | { 144 | "cell_type": "code", 145 | "execution_count": 6, 146 | "metadata": {}, 147 | "outputs": [ 148 | { 149 | "name": "stdout", 150 | "output_type": "stream", 151 | "text": [ 152 | "Randomized D3 Bagging: 0.7194494632259785\n" 153 | ] 154 | } 155 | ], 156 | "source": [ 157 | "print(\"Randomized D3 Bagging:\", cross_val_score(BaggingClassifier(rnd_d3, random_state=42), X, y).mean())" 158 | ] 159 | }, 160 | { 161 | "cell_type": "markdown", 162 | "metadata": {}, 163 | "source": [ 164 | "В среднем, качество получается еще лучше. Для выбора числа признаков использовалась часто применяемая на практике эвристика - брать корень из общего числа признаков. Если бы мы решали задачу регрессии - брали бы треть от общего числа." 165 | ] 166 | }, 167 | { 168 | "cell_type": "code", 169 | "execution_count": 7, 170 | "metadata": {}, 171 | "outputs": [ 172 | { 173 | "name": "stdout", 174 | "output_type": "stream", 175 | "text": [ 176 | "Random Forest: 0.7232495965859839\n" 177 | ] 178 | } 179 | ], 180 | "source": [ 181 | "print(\"Random Forest:\", cross_val_score(RandomForestClassifier(random_state=42), X, y).mean())" 182 | ] 183 | }, 184 | { 185 | "cell_type": "code", 186 | "execution_count": 8, 187 | "metadata": {}, 188 | "outputs": [ 189 | { 190 | "name": "stdout", 191 | "output_type": "stream", 192 | "text": [ 193 | "Logistic Regression: 0.6287053143962126\n" 194 | ] 195 | } 196 | ], 197 | "source": [ 198 | "print(\"Logistic Regression:\", cross_val_score(LogisticRegression(), X, y).mean())" 199 | ] 200 | }, 201 | { 202 | "cell_type": "markdown", 203 | "metadata": {}, 204 | "source": [ 205 | "## Опциональное задание\n", 206 | "Повторные запуски cross_val_score будут показывать различное качество модели.\n", 207 | "\n", 208 | "Это зависит от параметра рандомизации модели \"random_state\" в DecisionTreeClassifier, BaggingClassifie или RandomForest.\n", 209 | "\n", 210 | "Чтобы понять, действительно ли одна модель лучше другой, можно посмотреть на её качество в среднем, то есть усредняя запуски с разным random_state. Предлагаю сравнить качество и понять, действительно ли BaggingClassifier(d3) лучше BaggingClassifier(rnd_d3)?\n", 211 | "\n", 212 | "Также предлагаю ответить на вопрос, чем здесь отличается BaggingClassifier(rnd_d3) от RandomForestClassifier()?" 213 | ] 214 | }, 215 | { 216 | "cell_type": "code", 217 | "execution_count": null, 218 | "metadata": { 219 | "collapsed": true 220 | }, 221 | "outputs": [], 222 | "source": [] 223 | } 224 | ], 225 | "metadata": { 226 | "anaconda-cloud": {}, 227 | "kernelspec": { 228 | "display_name": "Python 3", 229 | "language": "python", 230 | "name": "python3" 231 | }, 232 | "language_info": { 233 | "codemirror_mode": { 234 | "name": "ipython", 235 | "version": 3 236 | }, 237 | "file_extension": ".py", 238 | "mimetype": "text/x-python", 239 | "name": "python", 240 | "nbconvert_exporter": "python", 241 | "pygments_lexer": "ipython3", 242 | "version": "3.6.5" 243 | } 244 | }, 245 | "nbformat": 4, 246 | "nbformat_minor": 1 247 | } 248 | -------------------------------------------------------------------------------- /seminar02/03_Main_Boosting.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Градиентный бустинг" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "Бустинг это метод построения компизиции алгоритмов, в котором базовые алгоритмы строятся последовательно один за другим, причем каждый следующий алгоритм строится таким образом, чтобы уменьшить ошибку предыдущего." 15 | ] 16 | }, 17 | { 18 | "cell_type": "markdown", 19 | "metadata": {}, 20 | "source": [ 21 | "Положим, что алгоритм это линейная комбинация некоторых базовых алгоритмов:\n", 22 | " $$a_N(x) = \\sum_{n=1}^N b_n(x)$$" 23 | ] 24 | }, 25 | { 26 | "cell_type": "markdown", 27 | "metadata": {}, 28 | "source": [ 29 | "Пусть задана некоторая функция потетерь, которую мы оптимизируем\n", 30 | "$$\\sum_{i=1}^l L(\\hat y_i, y_i) \\rightarrow min$$ \n", 31 | "\n", 32 | "\n", 33 | "Зададимся вопросом: а что если мы хотим добавить ещё один алгоритм в эту композицию, но не просто добавить, а как можно оптимальнее с точки зрения исходной оптимизационной задачи. То есть уже есть какой-то алгоритм $a_N(x)$ и мы хотим прибавить к нему базовый алгоритм $b_{N+1}(x)$:\n", 34 | "\n", 35 | "$$\\sum_{i=1}^l L(a_{N}(x_i) + b_{N+1}(x_i), y_i) \\to \\min_{b_{N+1}}$$\n", 36 | "\n", 37 | "Сначала имеет смысл решить более простую задачу: определить, какие значения $r_1 ,r_2 ..., r_l$ должен принимать алгоритм $b_N(x_i) = r_i$ на объектах обучающей выборки, чтобы ошибка на обучающей выборке была минимальной:\n", 38 | "\n", 39 | "$$F(r) = \\sum_{i=1}^l L(a_{N}(x_i) + r_i, y_i) \\to \\min_{r},$$\n", 40 | "\n", 41 | "где $r = (r_1, r_2, \\dots, r_l)$ - вектор сдвигов." 42 | ] 43 | }, 44 | { 45 | "cell_type": "markdown", 46 | "metadata": {}, 47 | "source": [ 48 | "Поскольку направление наискорейшего убывания функции задается направлением антиградиента, его можно принять в качестве вектора $r$:\n", 49 | "$$r = -\\nabla F \\\\$$\n", 50 | "$$r_i = \\frac{\\partial{L}(a_N(x_i), y_i))}{\\partial{a_N(x_i)}}, \\ \\ \\ i = \\overline{1,l}$$" 51 | ] 52 | }, 53 | { 54 | "cell_type": "markdown", 55 | "metadata": {}, 56 | "source": [ 57 | "Компоненты вектора $r$, фактически, являются теми значениями, которые на объектах обучающей выборки должен принимать новый алгоритм $b_{N+1}(x)$, чтобы минимизировать ошибку строящейся композиции. \n", 58 | "Обучение $b_{N+1}(x)$, таким образом, представляет собой *задачу обучения на размеченных данных*, в которой ${(x_i , r_i )}_{i=1}^l$ — обучающая выборка, и используется, например, квадратичная функция ошибки:\n", 59 | "$$b_{N+1}(x) = arg \\min_{b}\\sum_{i=1}^l(b(x_i) - r_i)^2$$" 60 | ] 61 | }, 62 | { 63 | "cell_type": "markdown", 64 | "metadata": {}, 65 | "source": [ 66 | "Таким образом, можно подобрать неплохое улучшение текущего алгоритма $a_N(x)$, а потом ещё раз и ещё, в итоге получив комбинацию алгоритмов, которая будет минимизировать исходный функционал." 67 | ] 68 | }, 69 | { 70 | "cell_type": "markdown", 71 | "metadata": {}, 72 | "source": [ 73 | "# Бустинг над решающими деревьями\n", 74 | "\n", 75 | "Наиболее популярное семейство алгоритмов для бустинга это деревья. Рассмотрим популярные библиотеки" 76 | ] 77 | }, 78 | { 79 | "cell_type": "code", 80 | "execution_count": 1, 81 | "metadata": { 82 | "collapsed": true 83 | }, 84 | "outputs": [], 85 | "source": [ 86 | "import pandas as pd\n", 87 | "import numpy as np\n", 88 | "from sklearn.model_selection import cross_val_score, train_test_split\n", 89 | "\n", 90 | "from xgboost import XGBClassifier\n", 91 | "from catboost import CatBoostClassifier\n", 92 | "from lightgbm import LGBMClassifier\n", 93 | "\n", 94 | "import warnings\n", 95 | "warnings.filterwarnings('ignore')" 96 | ] 97 | }, 98 | { 99 | "cell_type": "code", 100 | "execution_count": 2, 101 | "metadata": { 102 | "collapsed": true 103 | }, 104 | "outputs": [], 105 | "source": [ 106 | "data = pd.read_csv('HR.csv')" 107 | ] 108 | }, 109 | { 110 | "cell_type": "code", 111 | "execution_count": 3, 112 | "metadata": {}, 113 | "outputs": [ 114 | { 115 | "data": { 116 | "text/html": [ 117 | "
\n", 118 | "\n", 131 | "\n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | " \n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " \n", 149 | " \n", 150 | " \n", 151 | " \n", 152 | " \n", 153 | " \n", 154 | " \n", 155 | " \n", 156 | " \n", 157 | " \n", 158 | " \n", 159 | " \n", 160 | " \n", 161 | " \n", 162 | " \n", 163 | " \n", 164 | " \n", 165 | " \n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | " \n", 170 | " \n", 171 | " \n", 172 | " \n", 173 | " \n", 174 | " \n", 175 | " \n", 176 | " \n", 177 | " \n", 178 | " \n", 179 | " \n", 180 | " \n", 181 | " \n", 182 | " \n", 183 | " \n", 184 | " \n", 185 | " \n", 186 | " \n", 187 | " \n", 188 | " \n", 189 | " \n", 190 | " \n", 191 | " \n", 192 | " \n", 193 | " \n", 194 | " \n", 195 | " \n", 196 | "
last_evaluationnumber_projectaverage_montly_hourstime_spend_companyWork_accidentleftpromotion_last_5years
00.5321573010
10.8652626000
20.8872724010
30.8752235010
40.5221593010
\n", 197 | "
" 198 | ], 199 | "text/plain": [ 200 | " last_evaluation number_project average_montly_hours time_spend_company \\\n", 201 | "0 0.53 2 157 3 \n", 202 | "1 0.86 5 262 6 \n", 203 | "2 0.88 7 272 4 \n", 204 | "3 0.87 5 223 5 \n", 205 | "4 0.52 2 159 3 \n", 206 | "\n", 207 | " Work_accident left promotion_last_5years \n", 208 | "0 0 1 0 \n", 209 | "1 0 0 0 \n", 210 | "2 0 1 0 \n", 211 | "3 0 1 0 \n", 212 | "4 0 1 0 " 213 | ] 214 | }, 215 | "execution_count": 3, 216 | "metadata": {}, 217 | "output_type": "execute_result" 218 | } 219 | ], 220 | "source": [ 221 | "data.head()" 222 | ] 223 | }, 224 | { 225 | "cell_type": "code", 226 | "execution_count": 4, 227 | "metadata": { 228 | "collapsed": true, 229 | "scrolled": true 230 | }, 231 | "outputs": [], 232 | "source": [ 233 | "X, y = data.drop('left', axis=1).values, data['left'].values" 234 | ] 235 | }, 236 | { 237 | "cell_type": "markdown", 238 | "metadata": {}, 239 | "source": [ 240 | "Качество классификации решающим деревом с настройками по-умолчанию:" 241 | ] 242 | }, 243 | { 244 | "cell_type": "code", 245 | "execution_count": 5, 246 | "metadata": {}, 247 | "outputs": [ 248 | { 249 | "name": "stdout", 250 | "output_type": "stream", 251 | "text": [ 252 | "XGBClassifier: 0.7791\n", 253 | "CPU times: user 1.05 s, sys: 4.04 ms, total: 1.06 s\n", 254 | "Wall time: 1.06 s\n" 255 | ] 256 | } 257 | ], 258 | "source": [ 259 | "%%time\n", 260 | "print(\"XGBClassifier: {:.4f}\".format(cross_val_score(XGBClassifier(), X, y).mean()))" 261 | ] 262 | }, 263 | { 264 | "cell_type": "code", 265 | "execution_count": 6, 266 | "metadata": {}, 267 | "outputs": [ 268 | { 269 | "name": "stdout", 270 | "output_type": "stream", 271 | "text": [ 272 | "CatBoostClassifier: 0.7776\n", 273 | "CPU times: user 1min 45s, sys: 52.7 s, total: 2min 38s\n", 274 | "Wall time: 50.4 s\n" 275 | ] 276 | } 277 | ], 278 | "source": [ 279 | "%%time\n", 280 | "print(\"CatBoostClassifier: {:.4f}\".format(cross_val_score(CatBoostClassifier(verbose=False), X, y).mean()))" 281 | ] 282 | }, 283 | { 284 | "cell_type": "code", 285 | "execution_count": 7, 286 | "metadata": {}, 287 | "outputs": [ 288 | { 289 | "name": "stdout", 290 | "output_type": "stream", 291 | "text": [ 292 | "LGBMClassifier: 0.7790\n", 293 | "CPU times: user 562 ms, sys: 24.8 ms, total: 587 ms\n", 294 | "Wall time: 586 ms\n" 295 | ] 296 | } 297 | ], 298 | "source": [ 299 | "%%time\n", 300 | "print(\"LGBMClassifier: {:.4f}\".format(cross_val_score(LGBMClassifier(), X, y).mean()))" 301 | ] 302 | }, 303 | { 304 | "cell_type": "markdown", 305 | "metadata": {}, 306 | "source": [ 307 | "## Опциональное задание\n", 308 | "Поиграйтесь с основными параметрами алгоритмов, чтобы максимизировать качество" 309 | ] 310 | }, 311 | { 312 | "cell_type": "code", 313 | "execution_count": null, 314 | "metadata": { 315 | "collapsed": true 316 | }, 317 | "outputs": [], 318 | "source": [] 319 | } 320 | ], 321 | "metadata": { 322 | "anaconda-cloud": {}, 323 | "kernelspec": { 324 | "display_name": "Python 3", 325 | "language": "python", 326 | "name": "python3" 327 | }, 328 | "language_info": { 329 | "codemirror_mode": { 330 | "name": "ipython", 331 | "version": 3 332 | }, 333 | "file_extension": ".py", 334 | "mimetype": "text/x-python", 335 | "name": "python", 336 | "nbconvert_exporter": "python", 337 | "pygments_lexer": "ipython3", 338 | "version": "3.6.5" 339 | } 340 | }, 341 | "nbformat": 4, 342 | "nbformat_minor": 1 343 | } 344 | -------------------------------------------------------------------------------- /hw02/SimpleGB.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Домашнее задание 2\n", 8 | "\n", 9 | "### Реализация градиентного бустинга для классификации" 10 | ] 11 | }, 12 | { 13 | "cell_type": "markdown", 14 | "metadata": {}, 15 | "source": [ 16 | "В рамках этой задачи нужно написать градиентный бустинг над решающими деревьями в задаче классификации. В качестве функции потерь предлагается взять **log loss**. Про него можно прочитать подробнее здесь: https://scikit-learn.org/stable/modules/model_evaluation.html#log-loss" 17 | ] 18 | }, 19 | { 20 | "cell_type": "markdown", 21 | "metadata": {}, 22 | "source": [ 23 | "\n", 24 | "$y_i$ это правильный ответ (0 или 1), $\\hat{y}_i$ это ваше предсказание\n", 25 | "\n", 26 | "Может показаться, что надо максимизировать функцию $L(\\hat{y}, y) = \\sum_{i=1}^n y_i \\log(\\hat{y}_i) + (1 - y_i) \\log(1 - \\hat{y}_i)$,\n", 27 | "\n", 28 | "Это так, но не совсем: лучше максимизировать функцию $L(\\hat{y}, y) = \\sum_{i=1}^n y_i \\log(f(\\hat{y}_i)) + (1 - y_i) \\log(1 - f(\\hat{y}_i))$, где $f(x) = \\frac{1}{1 + e^{-x}}$. Благодаря этому у вас не будет ограничений на принимаеммые значения для прогнозов $\\hat{y}_i$" 29 | ] 30 | }, 31 | { 32 | "cell_type": "markdown", 33 | "metadata": {}, 34 | "source": [ 35 | "### Задание 1\n", 36 | "\n", 37 | "Функцию f(x), предложенную выше, обычно называют **сигмоида** или **сигмоидная функция**. Напишите функцию, вычисляющую значения производной функции f(x)." 38 | ] 39 | }, 40 | { 41 | "cell_type": "code", 42 | "execution_count": null, 43 | "metadata": {}, 44 | "outputs": [], 45 | "source": [ 46 | "import numpy as np" 47 | ] 48 | }, 49 | { 50 | "cell_type": "code", 51 | "execution_count": null, 52 | "metadata": {}, 53 | "outputs": [], 54 | "source": [ 55 | "def sigmoid(x):\n", 56 | " return 1. / (1 + np.exp(-x))\n", 57 | "\n", 58 | "\n", 59 | "def der_sigmoid(x):\n", 60 | " return None # TODO" 61 | ] 62 | }, 63 | { 64 | "cell_type": "code", 65 | "execution_count": null, 66 | "metadata": {}, 67 | "outputs": [], 68 | "source": [ 69 | "der_sigmoid(0) == 0.25" 70 | ] 71 | }, 72 | { 73 | "cell_type": "code", 74 | "execution_count": null, 75 | "metadata": {}, 76 | "outputs": [], 77 | "source": [ 78 | "der_sigmoid(np.array([0, 0])) == np.array([0.25, 0.25])" 79 | ] 80 | }, 81 | { 82 | "cell_type": "code", 83 | "execution_count": null, 84 | "metadata": {}, 85 | "outputs": [], 86 | "source": [ 87 | "der_sigmoid(np.log(3)) == 0.1875" 88 | ] 89 | }, 90 | { 91 | "cell_type": "markdown", 92 | "metadata": {}, 93 | "source": [ 94 | "**Значение для формы:**" 95 | ] 96 | }, 97 | { 98 | "cell_type": "code", 99 | "execution_count": null, 100 | "metadata": {}, 101 | "outputs": [], 102 | "source": [ 103 | "print(round(der_sigmoid(np.array([-10, 4.1, -1, 2])).sum() + sigmoid(0.42), 4))" 104 | ] 105 | }, 106 | { 107 | "cell_type": "markdown", 108 | "metadata": {}, 109 | "source": [ 110 | "Хорошо, теперь мы умеем считать производную функции f, но надо найти производную log loss-а по $\\hat{y}$ в первом варианте записи потерь\n", 111 | "\n", 112 | "Напоминание, первый вариант это $y_i \\log(\\hat{y}_i) + (1 - y_i) \\log(1 - \\hat{y}_i)$" 113 | ] 114 | }, 115 | { 116 | "cell_type": "markdown", 117 | "metadata": {}, 118 | "source": [ 119 | "### Задание 2\n", 120 | "\n", 121 | "Напишите вычисление производной log loss-a" 122 | ] 123 | }, 124 | { 125 | "cell_type": "code", 126 | "execution_count": null, 127 | "metadata": {}, 128 | "outputs": [], 129 | "source": [ 130 | "def der_log_loss(y_hat, y_true):\n", 131 | " \"\"\"\n", 132 | " 0 < y_hat < 1\n", 133 | " \"\"\"\n", 134 | " return None # TODO" 135 | ] 136 | }, 137 | { 138 | "cell_type": "code", 139 | "execution_count": null, 140 | "metadata": {}, 141 | "outputs": [], 142 | "source": [ 143 | "der_log_loss(0.5, 0) == -2" 144 | ] 145 | }, 146 | { 147 | "cell_type": "code", 148 | "execution_count": null, 149 | "metadata": {}, 150 | "outputs": [], 151 | "source": [ 152 | "der_log_loss(0.5, 1) == 2" 153 | ] 154 | }, 155 | { 156 | "cell_type": "code", 157 | "execution_count": null, 158 | "metadata": {}, 159 | "outputs": [], 160 | "source": [ 161 | "der_log_loss(np.array([0.8, 0.8]), np.array([1, 1])) == np.array([1.25, 1.25])" 162 | ] 163 | }, 164 | { 165 | "cell_type": "markdown", 166 | "metadata": {}, 167 | "source": [ 168 | "**Значение для формы**" 169 | ] 170 | }, 171 | { 172 | "cell_type": "code", 173 | "execution_count": null, 174 | "metadata": {}, 175 | "outputs": [], 176 | "source": [ 177 | "print(round(-sum(der_log_loss((x + 1) / 100., x % 2) for x in range(99)), 2))" 178 | ] 179 | }, 180 | { 181 | "cell_type": "markdown", 182 | "metadata": {}, 183 | "source": [ 184 | "Теперь мы можем воспользоваться формулой производной сложной функции (chain rule) и получить вычисление градиента формулы по второму варианту:" 185 | ] 186 | }, 187 | { 188 | "cell_type": "code", 189 | "execution_count": null, 190 | "metadata": {}, 191 | "outputs": [], 192 | "source": [ 193 | "def calc_gradient(y_hat, y_true):\n", 194 | " return der_log_loss(sigmoid(y_hat), y_true) * der_sigmoid(y_hat)" 195 | ] 196 | }, 197 | { 198 | "cell_type": "markdown", 199 | "metadata": {}, 200 | "source": [ 201 | "Теперь мы можем написать код градиентного бустинга для классификации" 202 | ] 203 | }, 204 | { 205 | "cell_type": "markdown", 206 | "metadata": {}, 207 | "source": [ 208 | "### Задание 3\n", 209 | "\n", 210 | "Допишите класс" 211 | ] 212 | }, 213 | { 214 | "cell_type": "code", 215 | "execution_count": null, 216 | "metadata": {}, 217 | "outputs": [], 218 | "source": [ 219 | "from sklearn.base import BaseEstimator # чтобы поддержать интерфейс sklearn\n", 220 | "from sklearn.tree import DecisionTreeRegressor # для обучения на каждой итерации" 221 | ] 222 | }, 223 | { 224 | "cell_type": "code", 225 | "execution_count": null, 226 | "metadata": {}, 227 | "outputs": [], 228 | "source": [ 229 | "class SimpleGB(BaseEstimator):\n", 230 | " def __init__(self, tree_params_dict, iters=100, tau=1e-1):\n", 231 | " \"\"\"\n", 232 | " tree_params_dict - словарь параметров, которые надо использовать при обучении дерева на итерации\n", 233 | " iters - количество итераций\n", 234 | " tau - коэффициент перед предсказаниями деревьев на каждой итерации\n", 235 | " \"\"\"\n", 236 | " self.tree_params_dict = tree_params_dict\n", 237 | " self.iters = iters\n", 238 | " self.tau = tau\n", 239 | " \n", 240 | " def fit(self, X_data, y_data):\n", 241 | " self.estimators = []\n", 242 | " curr_pred = 0\n", 243 | " for iter_num in range(self.iters):\n", 244 | " # Нужно найти градиент функции потерь по предсказниям в точке curr_pred\n", 245 | " grad = None # TODO\n", 246 | " # Мы максимизируем, поэтому надо обучить DecisionTreeRegressor с параметрами \n", 247 | " # tree_params_dict по X_data предсказывать grad\n", 248 | " algo = None # TODO\n", 249 | " self.estimators.append(algo)\n", 250 | " # все предсказания домножаются на tau и обновляется переменная curr_pred\n", 251 | " curr_pred += self.tau * algo.predict(X_data)\n", 252 | " \n", 253 | " def predict(self, X_data):\n", 254 | " # изначально все предскзания нули\n", 255 | " res = np.zeros(X_data.shape[0])\n", 256 | " for estimator in self.estimators:\n", 257 | " # нужно сложить все предсказания деревьев с весом self.tau\n", 258 | " pass # TODO\n", 259 | " \n", 260 | " return (res > 0).astype(int)" 261 | ] 262 | }, 263 | { 264 | "cell_type": "markdown", 265 | "metadata": {}, 266 | "source": [ 267 | "## Проверка качества полученного класса (в самом низу код для формы)\n", 268 | "\n", 269 | "Можете поиграться с параметрами, посмотрим, у кого получится самое лучшее качество" 270 | ] 271 | }, 272 | { 273 | "cell_type": "code", 274 | "execution_count": null, 275 | "metadata": {}, 276 | "outputs": [], 277 | "source": [ 278 | "# для оценки качества\n", 279 | "from sklearn.model_selection import cross_val_score\n", 280 | "\n", 281 | "# для генерации датасетов\n", 282 | "from sklearn.datasets import make_classification\n", 283 | "\n", 284 | "# для сравнения\n", 285 | "from sklearn.tree import DecisionTreeClassifier\n", 286 | "from sklearn.linear_model import LogisticRegression\n", 287 | "from xgboost import XGBClassifier\n", 288 | "\n", 289 | "import warnings\n", 290 | "warnings.filterwarnings('ignore')" 291 | ] 292 | }, 293 | { 294 | "cell_type": "code", 295 | "execution_count": null, 296 | "metadata": {}, 297 | "outputs": [], 298 | "source": [ 299 | "X_data, y_data = make_classification(n_samples=1000, n_features=10, random_state=42)" 300 | ] 301 | }, 302 | { 303 | "cell_type": "code", 304 | "execution_count": null, 305 | "metadata": {}, 306 | "outputs": [], 307 | "source": [ 308 | "algo = SimpleGB(\n", 309 | " tree_params_dict={\n", 310 | " 'max_depth':4\n", 311 | " },\n", 312 | " iters=100,\n", 313 | " tau = 0.1\n", 314 | ")" 315 | ] 316 | }, 317 | { 318 | "cell_type": "code", 319 | "execution_count": null, 320 | "metadata": {}, 321 | "outputs": [], 322 | "source": [ 323 | "np.mean(cross_val_score(algo, X_data, y_data, cv=5, scoring='accuracy'))" 324 | ] 325 | }, 326 | { 327 | "cell_type": "code", 328 | "execution_count": null, 329 | "metadata": { 330 | "scrolled": true 331 | }, 332 | "outputs": [], 333 | "source": [ 334 | "np.mean(cross_val_score(DecisionTreeClassifier(), X_data, y_data, cv=5, scoring='accuracy'))" 335 | ] 336 | }, 337 | { 338 | "cell_type": "code", 339 | "execution_count": null, 340 | "metadata": {}, 341 | "outputs": [], 342 | "source": [ 343 | "np.mean(cross_val_score(XGBClassifier(), X_data, y_data, cv=5, scoring='accuracy'))" 344 | ] 345 | }, 346 | { 347 | "cell_type": "code", 348 | "execution_count": null, 349 | "metadata": {}, 350 | "outputs": [], 351 | "source": [ 352 | "np.mean(cross_val_score(LogisticRegression(), X_data, y_data, cv=5, scoring='accuracy'))" 353 | ] 354 | }, 355 | { 356 | "cell_type": "markdown", 357 | "metadata": {}, 358 | "source": [ 359 | "**Значение для формы**" 360 | ] 361 | }, 362 | { 363 | "cell_type": "code", 364 | "execution_count": null, 365 | "metadata": {}, 366 | "outputs": [], 367 | "source": [ 368 | "print(round(np.mean(cross_val_score(SimpleGB(\n", 369 | " tree_params_dict={\n", 370 | " 'max_depth': 4\n", 371 | " },\n", 372 | " iters=1000,\n", 373 | " tau = 0.01\n", 374 | "), X_data, y_data, cv=4, scoring='accuracy')), 3))" 375 | ] 376 | }, 377 | { 378 | "cell_type": "code", 379 | "execution_count": null, 380 | "metadata": {}, 381 | "outputs": [], 382 | "source": [] 383 | } 384 | ], 385 | "metadata": { 386 | "kernelspec": { 387 | "display_name": "Python 3", 388 | "language": "python", 389 | "name": "python3" 390 | }, 391 | "language_info": { 392 | "codemirror_mode": { 393 | "name": "ipython", 394 | "version": 3 395 | }, 396 | "file_extension": ".py", 397 | "mimetype": "text/x-python", 398 | "name": "python", 399 | "nbconvert_exporter": "python", 400 | "pygments_lexer": "ipython3", 401 | "version": "3.6.5" 402 | } 403 | }, 404 | "nbformat": 4, 405 | "nbformat_minor": 1 406 | } 407 | -------------------------------------------------------------------------------- /hw01/NaiveBayes.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Реализуем методы для наивного байеса" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "Сгенерируем выборку, в которой каждый признак имеет некоторое своё распределение, параметры которого отличаются для каждого класса. Затем реализуем несколько методов для класса, который уже частично написан ниже:\n", 15 | "- метод predict\n", 16 | "- метод \\_find\\_expon\\_params и \\_get\\_expon\\_density для экспоненциального распределения\n", 17 | "- метод \\_find\\_norm\\_params и \\_get\\_norm\\_probability для биномиального распределения\n", 18 | "\n", 19 | "Для имплементации \\_find\\_something\\_params изучите документацию функций для работы с этими распределениями в scipy.stats и используйте предоставленные там методы." 20 | ] 21 | }, 22 | { 23 | "cell_type": "code", 24 | "execution_count": 1, 25 | "metadata": {}, 26 | "outputs": [], 27 | "source": [ 28 | "import numpy as np\n", 29 | "import scipy\n", 30 | "import scipy.stats" 31 | ] 32 | }, 33 | { 34 | "cell_type": "markdown", 35 | "metadata": {}, 36 | "source": [ 37 | "Сформируем параметры генерации для трех датасетов" 38 | ] 39 | }, 40 | { 41 | "cell_type": "code", 42 | "execution_count": 2, 43 | "metadata": {}, 44 | "outputs": [ 45 | { 46 | "data": { 47 | "text/plain": [ 48 | "((5000, 1), (5000,), ['bernoulli'])" 49 | ] 50 | }, 51 | "execution_count": 2, 52 | "metadata": {}, 53 | "output_type": "execute_result" 54 | } 55 | ], 56 | "source": [ 57 | "func_params_set0 = [(scipy.stats.bernoulli, [dict(p=0.1), dict(p=0.5)]),\n", 58 | " ]\n", 59 | "\n", 60 | "func_params_set1 = [(scipy.stats.bernoulli, [dict(p=0.1), dict(p=0.5)]),\n", 61 | " (scipy.stats.expon, [dict(scale=1), dict(scale=0.3)]),\n", 62 | " ]\n", 63 | "\n", 64 | "func_params_set2 = [(scipy.stats.bernoulli, [dict(p=0.1), dict(p=0.5)]),\n", 65 | " (scipy.stats.expon, [dict(scale=1), dict(scale=0.3)]),\n", 66 | " (scipy.stats.norm, [dict(loc=0, scale=1), dict(loc=1, scale=2)]),\n", 67 | " ]\n", 68 | "\n", 69 | "def generate_dataset_for_nb(func_params_set=[], size = 2500, random_seed=0):\n", 70 | " '''\n", 71 | " Генерирует выборку с заданными параметрами распределений P(x|y).\n", 72 | " Число классов задается длиной списка с параметрами.\n", 73 | " Возвращает X, y, список с названиями распределений\n", 74 | " '''\n", 75 | " np.random.seed(random_seed)\n", 76 | "\n", 77 | " X = []\n", 78 | " names = []\n", 79 | " for func, params in func_params_set:\n", 80 | " names.append(func.name)\n", 81 | " f = []\n", 82 | " for i, param in enumerate(params):\n", 83 | " f.append(func.rvs(size=size, **param))\n", 84 | " f = np.concatenate(f).reshape(-1,1)\n", 85 | " X.append(f)\n", 86 | "\n", 87 | " X = np.concatenate(X, 1)\n", 88 | " y = np.array([0] * size + [1] * size)\n", 89 | "\n", 90 | " shuffle_inds = np.random.choice(range(len(X)), size=len(X), replace=False)\n", 91 | " X = X[shuffle_inds]\n", 92 | " y = y[shuffle_inds]\n", 93 | "\n", 94 | " return X, y, names \n", 95 | "\n", 96 | "X, y, distrubution_names = generate_dataset_for_nb(func_params_set0)\n", 97 | "X.shape, y.shape, distrubution_names" 98 | ] 99 | }, 100 | { 101 | "cell_type": "code", 102 | "execution_count": 3, 103 | "metadata": {}, 104 | "outputs": [], 105 | "source": [ 106 | "from collections import defaultdict\n", 107 | "from sklearn.base import BaseEstimator, ClassifierMixin\n", 108 | "\n", 109 | "class NaiveBayes(BaseEstimator, ClassifierMixin):\n", 110 | " '''\n", 111 | " Реализация наивного байеса, которая помимо X, y\n", 112 | " принимает на вход во время обучения \n", 113 | " виды распределений значений признаков\n", 114 | " '''\n", 115 | " def __init__(self):\n", 116 | " pass\n", 117 | " \n", 118 | " def _find_bernoulli_params(self, x):\n", 119 | " '''\n", 120 | " метод возвращает найденный параметр `p`\n", 121 | " распределения scipy.stats.bernoulli\n", 122 | " '''\n", 123 | " return dict(p=np.mean(x))\n", 124 | " \n", 125 | " def _get_bernoulli_probability(self, x, params):\n", 126 | " '''\n", 127 | " метод возвращает вероятность x для данных\n", 128 | " параметров распределния\n", 129 | " '''\n", 130 | " return scipy.stats.bernoulli.pmf(x, **params)\n", 131 | "\n", 132 | " def _find_expon_params(self, x):\n", 133 | " # нужно определить параметры распределения\n", 134 | " # и вернуть их\n", 135 | " pass\n", 136 | " \n", 137 | " def _get_expon_density(self, x, params):\n", 138 | " # нужно вернуть плотность распределения в x\n", 139 | " pass\n", 140 | "\n", 141 | " def _find_norm_params(self, x):\n", 142 | " # нужно определить параметры распределения\n", 143 | " # и вернуть их\n", 144 | " pass\n", 145 | " \n", 146 | " def _get_norm_density(self, x, params):\n", 147 | " # нужно вернуть плотность распределения в x\n", 148 | " pass\n", 149 | "\n", 150 | " def _get_params(self, x, distribution):\n", 151 | " '''\n", 152 | " x - значения из распределения,\n", 153 | " distribution - название распределения в scipy.stats\n", 154 | " '''\n", 155 | " if distribution == 'bernoulli':\n", 156 | " return self._find_bernoulli_params(x)\n", 157 | " elif distribution == 'expon':\n", 158 | " return self._find_expon_params(x)\n", 159 | " elif distribution == 'norm':\n", 160 | " return self._find_norm_params(x)\n", 161 | " else:\n", 162 | " raise NotImplementedError('Unknown distribution')\n", 163 | " \n", 164 | " def _get_probability_or_density(self, x, distribution, params):\n", 165 | " '''\n", 166 | " x - значения,\n", 167 | " distribytion - название распределения в scipy.stats,\n", 168 | " params - параметры распределения\n", 169 | " '''\n", 170 | " if distribution == 'bernoulli':\n", 171 | " return self._get_bernoulli_probability(x, params)\n", 172 | " elif distribution == 'expon':\n", 173 | " return self._get_expon_density(x, params)\n", 174 | " elif distribution == 'norm':\n", 175 | " return self._get_norm_density(x, params)\n", 176 | " else:\n", 177 | " raise NotImplementedError('Unknown distribution')\n", 178 | "\n", 179 | " def fit(self, X, y, distrubution_names):\n", 180 | " '''\n", 181 | " X - обучающая выборка,\n", 182 | " y - целевая переменная,\n", 183 | " feature_distributions - список названий распределений, \n", 184 | " по которым предположительно распределны значения P(x|y)\n", 185 | " ''' \n", 186 | " assert X.shape[1] == len(distrubution_names)\n", 187 | " assert set(y) == {0, 1}\n", 188 | " self.n_classes = len(np.unique(y))\n", 189 | " self.distrubution_names = distrubution_names\n", 190 | " \n", 191 | " self.y_prior = [(y == j).mean() for j in range(self.n_classes)]\n", 192 | " \n", 193 | " self.distributions_params = defaultdict(dict)\n", 194 | " for i in range(X.shape[1]):\n", 195 | " distribution = self.distrubution_names[i]\n", 196 | " for j in range(self.n_classes):\n", 197 | " values = X[y == j, i]\n", 198 | " self.distributions_params[j][i] = \\\n", 199 | " self._get_params(values, distribution)\n", 200 | " \n", 201 | " return self.distributions_params\n", 202 | " \n", 203 | " def predict(self, X):\n", 204 | " '''\n", 205 | " X - тестовая выборка\n", 206 | " '''\n", 207 | " assert X.shape[1] == len(self.distrubution_names)\n", 208 | " \n", 209 | " # нужно реализовать подсчет аргмаксной формулы, по которой \n", 210 | " # наивный байес принимает решение о принадлежности объекта классу\n", 211 | " # и применить её для каждого объекта в X\n", 212 | " #\n", 213 | " # примечание: обычно подсчет этой формулы реализуют через \n", 214 | " # её логарифмирование, то есть, через сумму логарифмов вероятностей, \n", 215 | " # поскольку перемножение достаточно малых вероятностей будет вести\n", 216 | " # к вычислительным неточностям\n", 217 | " \n", 218 | " return preds" 219 | ] 220 | }, 221 | { 222 | "cell_type": "markdown", 223 | "metadata": {}, 224 | "source": [ 225 | "Проверим результат на примере первого распределения" 226 | ] 227 | }, 228 | { 229 | "cell_type": "code", 230 | "execution_count": 4, 231 | "metadata": {}, 232 | "outputs": [ 233 | { 234 | "data": { 235 | "text/plain": [ 236 | "defaultdict(dict, {0: {0: {'p': 0.1128}}, 1: {0: {'p': 0.482}}})" 237 | ] 238 | }, 239 | "execution_count": 4, 240 | "metadata": {}, 241 | "output_type": "execute_result" 242 | } 243 | ], 244 | "source": [ 245 | "nb = NaiveBayes()\n", 246 | "nb.fit(X, y, ['bernoulli'])" 247 | ] 248 | }, 249 | { 250 | "cell_type": "code", 251 | "execution_count": 5, 252 | "metadata": {}, 253 | "outputs": [ 254 | { 255 | "name": "stdout", 256 | "output_type": "stream", 257 | "text": [ 258 | "0.6045\n" 259 | ] 260 | } 261 | ], 262 | "source": [ 263 | "from sklearn.metrics import f1_score\n", 264 | "\n", 265 | "prediction = nb.predict(X)\n", 266 | "score = f1_score(y, prediction)\n", 267 | "print('{:.2f}'.format(score))" 268 | ] 269 | }, 270 | { 271 | "cell_type": "markdown", 272 | "metadata": {}, 273 | "source": [ 274 | "# Ответы для формы" 275 | ] 276 | }, 277 | { 278 | "cell_type": "markdown", 279 | "metadata": {}, 280 | "source": [ 281 | "Ответом для формы должны служить числа, которые будут выведены ниже. Все ответы проверены: в этих примерах получается одинаковый результат и через сумму логарифмов, и через произведение вероятностей." 282 | ] 283 | }, 284 | { 285 | "cell_type": "code", 286 | "execution_count": null, 287 | "metadata": {}, 288 | "outputs": [], 289 | "source": [ 290 | "scipy.stats.bernoulli.name\n", 291 | "\n", 292 | "for fps in (func_params_set0 * 2,\n", 293 | " func_params_set1, \n", 294 | " func_params_set2):\n", 295 | " \n", 296 | "\n", 297 | " X, y, distrubution_names = generate_dataset_for_nb(fps)\n", 298 | " \n", 299 | " nb = NaiveBayes()\n", 300 | " nb.fit(X, y, distrubution_names)\n", 301 | " prediction = nb.predict(X)\n", 302 | " score = f1_score(y, prediction)\n", 303 | " print('{:.2f}'.format(score))" 304 | ] 305 | }, 306 | { 307 | "cell_type": "code", 308 | "execution_count": null, 309 | "metadata": {}, 310 | "outputs": [], 311 | "source": [] 312 | } 313 | ], 314 | "metadata": { 315 | "kernelspec": { 316 | "display_name": "Python 3", 317 | "language": "python", 318 | "name": "python3" 319 | }, 320 | "language_info": { 321 | "codemirror_mode": { 322 | "name": "ipython", 323 | "version": 3 324 | }, 325 | "file_extension": ".py", 326 | "mimetype": "text/x-python", 327 | "name": "python", 328 | "nbconvert_exporter": "python", 329 | "pygments_lexer": "ipython3", 330 | "version": "3.6.5" 331 | } 332 | }, 333 | "nbformat": 4, 334 | "nbformat_minor": 2 335 | } 336 | -------------------------------------------------------------------------------- /hw01/Polynom.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import numpy as np\n", 10 | "\n", 11 | "import matplotlib.pyplot as plt\n", 12 | "%matplotlib inline" 13 | ] 14 | }, 15 | { 16 | "cell_type": "markdown", 17 | "metadata": {}, 18 | "source": [ 19 | "# Задание\n", 20 | "\n", 21 | "Допишите реализацию класса для обучения полиномиальной регресии, то есть по точкам $x_1, x_2, \\dots, x_n$ и $y_1, y_2, \\dots, y_n$ и заданному числу $d$ решить оптимизационную задачу:\n", 22 | "\n", 23 | "$$ \\sum_{i=1}^n (~f(x_i) - y_i~)^2 \\min_f,$$ где f – полином степени не выше $d$." 24 | ] 25 | }, 26 | { 27 | "cell_type": "markdown", 28 | "metadata": {}, 29 | "source": [ 30 | "**Примечание:** в этом задании оптимизационную задачу можно решать как с помощью scipy.optimize, так и сводя задачу к линейной регрессии и используя готовую формулу весов из нее. Предпочтительней второй путь, но первый вариант проще, и его можно использовать для проверки. Независимо от того, как вы решите эту задачу, сдавайте в форму ответ, в котором будете больше всего уверенны.\n", 31 | "\n", 32 | "**Предупреждение:** проверка этого задания **не предполагает**, что вы решите его с помощью SGD, т.к. получить таким способом тот же ответ *очень* сложно." 33 | ] 34 | }, 35 | { 36 | "cell_type": "code", 37 | "execution_count": 2, 38 | "metadata": {}, 39 | "outputs": [], 40 | "source": [ 41 | "class PolynomialRegression(object):\n", 42 | " \n", 43 | " def __init__(self, max_degree=1):\n", 44 | " self.max_degree = max_degree\n", 45 | " \n", 46 | " def fit(self, points_x, points_y):\n", 47 | " # insert your code here to fit the model\n", 48 | " \n", 49 | " return self\n", 50 | " \n", 51 | " def predict(self, points_x):\n", 52 | " # insert your code here to predict the values\n", 53 | " \n", 54 | " return values" 55 | ] 56 | }, 57 | { 58 | "cell_type": "code", 59 | "execution_count": 3, 60 | "metadata": {}, 61 | "outputs": [], 62 | "source": [ 63 | "np.random.seed(42)\n", 64 | "points_x = np.random.uniform(-10, 10, size=10)\n", 65 | "# we use list comprehesion but think about how to write it using np.array operations\n", 66 | "points_y = np.array([4 - x + x ** 2 + 0.1 * x ** 3 + np.random.uniform(-20, 20) for x in points_x])" 67 | ] 68 | }, 69 | { 70 | "cell_type": "code", 71 | "execution_count": 4, 72 | "metadata": {}, 73 | "outputs": [ 74 | { 75 | "data": { 76 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAloAAAEyCAYAAAAiFH5AAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvhp/UCwAAGWhJREFUeJzt3X+MXXd55/H3s2ODRhm00zZ0GjthDVrX0ibZtetR6IoumtlAHSJETFSFWBVNgF2TVVm1WuQWQ1Sipgha80OqaGnNJiIsNBNEHOONQk02dDZUWrOx42ycX9M6aVI8jpKSxAkDI2qbZ/+YM2HGzPGM58537px73y9pNPd+z7nnPPPoXOeT8z3n3shMJEmStPT+RbsLkCRJ6lQGLUmSpEIMWpIkSYUYtCRJkgoxaEmSJBVi0JIkSSrEoCVJklSIQUuSJKkQg5YkSVIhq+ZbISJuBd4JPJ+Zl1RjdwAbqlX6gROZuTEi1gGPA2PVsgOZecN8+zj//PNz3bp151x8p/jhD3/Ieeed1+4yViR7U8/e1LM39exNPXtTz97MdujQoe9n5usXsu68QQv4EvB54MvTA5n5nunHEfEZ4OUZ6z+ZmRsXVuqUdevWcfDgwXN5SUcZHR1laGio3WWsSPamnr2pZ2/q2Zt69qaevZktIp5Z6LrzBq3MvL86UzXXjgK4BviPC92hJElSt4iFfKl0FbTunp46nDH+VuCzmTk4Y71Hgb8DXgFuzMzv1GxzO7AdYGBgYPPIyMhi/4bGm5iYoK+vr91lrEj2pp69qWdv6tmbevamnr2ZbXh4+NB09pnPQqYOz2YbcPuM588Cb8jMFyJiM7A3Ii7OzFfOfGFm7gZ2AwwODmY3n5L0lGw9e1PP3tSzN/XsTT17U8/eLN6i7zqMiFXA1cAd02OZ+ePMfKF6fAh4EvjlVouUJElqolY+3uFtwBOZeWx6ICJeHxE91eM3AeuBp1orUZIkqZnmDVoRcTvwf4ANEXEsIj5QLbqW2dOGAG8FHo6Ih4CvAzdk5otLWbAkSVJTLOSuw20149fPMXYncGfrZUmSJDVfqxfDS5Iktd3ew+Ps2j/G8ROTrOnvZceWDWzdtLbdZRm0JElSs+09PM7OPUeYPHkagPETk+zccwSg7WHL7zqUJEmNtmv/2Ksha9rkydPs2j9W84rlY9CSJEmNdvzE5DmNLyeDliRJarQ1/b3nNL6cDFqSJKnRdmzZQO/qnlljvat72LFlQ5sq+ikvhpckSY02fcG7dx1KkiQVsHXT2hURrM7k1KEkSVIhBi1JkqRCDFqSJEmFGLQkSZIKMWhJkiQVYtCSJEkqxKAlSZJUiEFLkiSpEIOWJElSIQYtSZKkQgxakiRJhRi0JEmSCjFoSZIkFWLQkiRJKsSgJUmSVIhBS5IkqRCDliRJUiEGLUmSpELmDVoRcWtEPB8Rj8wYuykixiPioernyhnLdkbE0YgYi4gtpQqXJEla6RZyRutLwBVzjH8uMzdWP/cARMS/Aa4FLq5e8+cR0bNUxUqSJDXJvEErM+8HXlzg9q4CRjLzx5n5D8BR4LIW6pMkSWqsyMz5V4pYB9ydmZdUz28CrgdeAQ4CH87MlyLi88CBzPxKtd4twDcz8+tzbHM7sB1gYGBg88jIyBL8Oc00MTFBX19fu8tYkexNPXtTz97Uszf17E09ezPb8PDwocwcXMi6qxa5jy8ANwNZ/f4M8P5z2UBm7gZ2AwwODubQ0NAiS2m+0dFRuvnvPxt7U8/e1LM39exNPXtTz94s3qLuOszM5zLzdGb+BPgiP50eHAcumrHqhdWYJElS11lU0IqIC2Y8fTcwfUfiPuDaiHhtRLwRWA/839ZKlCRJaqZ5pw4j4nZgCDg/Io4BHweGImIjU1OHTwMfBMjMRyPia8BjwCngtzPzdJnSJUmSVrZ5g1Zmbptj+JazrP8J4BOtFCVJktQJ/GR4SZKkQgxakiRJhRi0JEmSCjFoSZIkFWLQkiRJKsSgJUmSVIhBS5IkqRCDliRJUiEGLUmSpEIMWpIkSYUYtCRJkgoxaEmSJBVi0JIkSSrEoCVJklSIQUuSJKkQg5YkSVIhBi1JkqRCDFqSJEmFGLQkSZIKMWhJkiQVYtCSJEkqxKAlSZJUiEFLkiSpEIOWJElSIQYtSZKkQgxakiRJhcwbtCLi1oh4PiIemTG2KyKeiIiHI+KuiOivxtdFxGREPFT9/EXJ4iVJklayhZzR+hJwxRlj9wKXZOa/Bf4O2Dlj2ZOZubH6uWFpypQkSWqeeYNWZt4PvHjG2Lcy81T19ABwYYHaJEmSGi0yc/6VItYBd2fmJXMs+5/AHZn5lWq9R5k6y/UKcGNmfqdmm9uB7QADAwObR0ZGFvcXdICJiQn6+vraXcaKZG/q2Zt69qaevalnb+rZm9mGh4cPZebgQtZd1cqOIuJjwCngq9XQs8AbMvOFiNgM7I2IizPzlTNfm5m7gd0Ag4ODOTQ01EopjTY6Oko3//1nY2/q2Zt69qaevalnb+rZm8Vb9F2HEXE98E7gN7M6LZaZP87MF6rHh4AngV9egjolSZIaZ1FBKyKuAH4PeFdm/mjG+Osjoqd6/CZgPfDUUhQqSZLUNPNOHUbE7cAQcH5EHAM+ztRdhq8F7o0IgAPVHYZvBf4wIk4CPwFuyMwX59ywJElSh5s3aGXmtjmGb6lZ907gzlaLkiRJ6gR+MrwkSVIhBi1JkqRCDFqSJEmFGLQkSZIKMWhJkiQVYtCSJEkqxKAlSZJUiEFLkiSpEIOWJElSIQYtSZKkQgxakiRJhRi0JEmSCjFoSZIkFWLQkiRJKsSgJUmSVIhBS5IkqRCDliRJUiEGLUmSpEIMWpIkSYUYtCRJkgoxaEmSJBVi0JIkSSrEoCVJklSIQUuSJKkQg5YkSVIhBi1JkqRCFhS0IuLWiHg+Ih6ZMfbzEXFvRPx99fvnqvGIiD+NiKMR8XBE/Eqp4iVJklayhZ7R+hJwxRljHwHuy8z1wH3Vc4B3AOurn+3AF1ovU5IkqXkWFLQy837gxTOGrwJuqx7fBmydMf7lnHIA6I+IC5aiWEmSpCaJzFzYihHrgLsz85Lq+YnM7K8eB/BSZvZHxN3ApzLzb6tl9wG/n5kHz9jedqbOeDEwMLB5ZGRkaf6iBpqYmKCvr6/dZaxI9qaevalnb+rZm3r2pp69mW14ePhQZg4uZN1VS7HDzMyIWFhi++lrdgO7AQYHB3NoaGgpSmmk0dFRuvnvPxt7U8/e1LM39exNPXtTz94sXit3HT43PSVY/X6+Gh8HLpqx3oXVmCRJUldpJWjtA66rHl8HfGPG+G9Vdx/+KvByZj7bwn4kSZIaaUFThxFxOzAEnB8Rx4CPA58CvhYRHwCeAa6pVr8HuBI4CvwIeN8S1yxJktQICwpambmtZtHlc6ybwG+3UpQkSVIn8JPhJUmSCjFoSZIkFWLQkiRJKsSgJUmSVMiSfGCpJElqtr2Hx9m1f4zjJyZZ09/Lji0b2LppbbvLajyDliRJXW7v4XF27jnC5MnTAIyfmGTnniMAhq0WOXUoSVKX27V/7NWQNW3y5Gl27R9rU0Wdw6AlSVKXO35i8pzGtXAGLUmSutya/t5zGtfCGbQkSepyO7ZsoHd1z6yx3tU97NiyoU0VdQ4vhpckqctNX/DuXYdLz6AlSZLYummtwaoApw4lSZIKMWhJkiQVYtCSJEkqxKAlSZJUiEFLkiSpEIOWJElSIQYtSZKkQgxakiRJhRi0JEmSCjFoSZIkFWLQkiRJKsSgJUmSVIhBS5IkqRCDliRJUiGrFvvCiNgA3DFj6E3AHwD9wH8G/qka/2hm3rPoCiVJkhpq0UErM8eAjQAR0QOMA3cB7wM+l5mfXpIKJUmSGmqppg4vB57MzGeWaHuSJEmNF5nZ+kYibgUezMzPR8RNwPXAK8BB4MOZ+dIcr9kObAcYGBjYPDIy0nIdTTUxMUFfX1+7y1iR7E09e1PP3tSzN/XsTT17M9vw8PChzBxcyLotB62IeA1wHLg4M5+LiAHg+0ACNwMXZOb7z7aNwcHBPHjwYEt1NNno6ChDQ0PtLmNFsjf17E09e1PP3tSzN/XszWwRseCgtRRTh+9g6mzWcwCZ+Vxmns7MnwBfBC5bgn1IkiQ1zlIErW3A7dNPIuKCGcveDTyyBPuQJElqnEXfdQgQEecBbwc+OGP4TyJiI1NTh0+fsUySJKlrtBS0MvOHwC+cMfbeliqSJEnqEH4yvCRJUiEGLUmSpEIMWpIkSYUYtCRJkgoxaEmSJBVi0JIkSSrEoCVJklSIQUuSJKkQg5YkSVIhBi1JkqRCDFqSJEmFGLQkSZIKMWhJkiQVYtCSJEkqxKAlSZJUiEFLkiSpEIOWJElSIQYtSZKkQgxakiRJhRi0JEmSCjFoSZIkFWLQkiRJKsSgJUmSVIhBS5IkqRCDliRJUiEGLUmSpEJWtbqBiHga+AFwGjiVmYMR8fPAHcA64Gngmsx8qdV9SZIkNclSndEazsyNmTlYPf8IcF9mrgfuq55LkiR1lVJTh1cBt1WPbwO2FtqPJEnSihWZ2doGIv4BeAlI4C8zc3dEnMjM/mp5AC9NP5/xuu3AdoCBgYHNIyMjLdXRZBMTE/T19bW7jBXJ3tSzN/XsTT17U8/e1LM3sw0PDx+aMYt3Vi1fowX8WmaOR8QvAvdGxBMzF2ZmRsTPpLnM3A3sBhgcHMyhoaElKKWZRkdH6ea//2zsTT17U8/e1LM39exNPXuzeC1PHWbmePX7eeAu4DLguYi4AKD6/Xyr+5EkSWqaloJWRJwXEa+bfgz8OvAIsA+4rlrtOuAbrexHkiSpiVqdOhwA7pq6DItVwF9l5l9HxAPA1yLiA8AzwDUt7keSJKlxWgpamfkU8O/mGH8BuLyVbUuS1Iq9h8fZtX+M4ycmWdPfy44tG9i6aW27y1KXWYqL4SVJWlH2Hh5n554jTJ48DcD4iUl27jkCYNjSsvIreCRJHWfX/rFXQ9a0yZOn2bV/rE0VqVsZtCRJHef4iclzGpdKMWhJkjrOmv7ecxqXSjFoSZI6zo4tG+hd3TNrrHd1Dzu2bGhTRepWXgwvSeo40xe8e9eh2s2gJUnqSFs3rTVYqe2cOpQkSSrEoCVJklSIQUuSJKkQg5YkSVIhBi1JkqRCDFqSJEmFGLQkSZIKMWhJkiQVYtCSJEkqxKAlSZJUiEFLkiSpEIOWJElSIQYtSZKkQgxakiRJhRi0JEmSCjFoSZIkFWLQkiRJKsSgJUmSVIhBS5IkqZBFB62IuCgi/iYiHouIRyPid6rxmyJiPCIeqn6uXLpyJUmSmmNVC689BXw4Mx+MiNcBhyLi3mrZ5zLz062XJ0mS1FyLDlqZ+SzwbPX4BxHxOLB2qQqTJElqusjM1jcSsQ64H7gE+G/A9cArwEGmznq9NMdrtgPbAQYGBjaPjIy0XEdTTUxM0NfX1+4yViR7U8/e1LM39exNPXtTz97MNjw8fCgzBxeybstBKyL6gP8NfCIz90TEAPB9IIGbgQsy8/1n28bg4GAePHiwpTqabHR0lKGhoXaXsSLZm3r2pp69qWdv6tmbevZmtohYcNBq6a7DiFgN3Al8NTP3AGTmc5l5OjN/AnwRuKyVfUiSJDVVK3cdBnAL8HhmfnbG+AUzVns38Mjiy5MkSWquVu46fAvwXuBIRDxUjX0U2BYRG5maOnwa+GBLFUqSJDVUK3cd/i0Qcyy6Z/HlSJIkdQ4/GV6SJKkQg5YkSVIhBi1JkqRCDFqSJEmFtHLXoSRJxe09PM6u/WMcPzHJmv5edmzZwNZNfuObmsGgJUlasfYeHmfnniNMnjwNwPiJSXbuOQJg2FIjOHUoSVqxdu0fezVkTZs8eZpd+8faVJF0brrijJannSWpmY6fmDyncWml6fig1emnnW/ce4Tbv/s9TmfSE8G2N1/EH229tN1lSdKSWNPfy/gcoWpNf28bqpHOXcdPHXbyaecb9x7hKwf+kdOZAJzO5CsH/pEb9x5pc2WStDR2bNlA7+qeWWO9q3vYsWVDmyqSzk3HB61OPu18+3e/d07jktQ0Wzet5ZNXX8ra/l4CWNvfyyevvrQjZiTUHTp+6rCTTztPn8la6LgkNdHWTWsNVmqsjj+j1cmnnXtiru/0rh+XJEnLq+ODViefdt725ovOaVySJC2vjp86hM497Tx9d6F3HUqStDJ1RdDqZH+09VKDlSRJK1THTx1KkiS1i0FLkiSpEIOWJElSIQYtSZKkQrwYXpIaZO/hcXbtH+P4iUnW9PeyY8uGjryrWuoUBi1Jaoi9h8fZuefIq9/fOn5ikp17pr7b1LAlrUxOHUpSQ+zaP/ZqyJo2efI0u/aPtakiSfMxaElSQxyf43tbzzYuqf0MWpLUEGv6e89pXFL7GbQkqSF2bNlA7+qeWWO9q3vYsWVDmyqSNJ9iQSsiroiIsYg4GhEfKbUfSeoWWzet5ZNXX8ra/l4CWNvfyyevvtQL4aUVrMhdhxHRA/wZ8HbgGPBAROzLzMdK7E+SusXWTWsNVlKDlDqjdRlwNDOfysx/BkaAqwrtS5IkaUWKzFz6jUb8BnBFZv6n6vl7gTdn5odmrLMd2A4wMDCweWRkZMnraIqJiQn6+vraXcaKZG/q2Zt69qaevalnb+rZm9mGh4cPZebgQtZt2weWZuZuYDfA4OBgDg0NtauUthsdHaWb//6zsTf17E09e1PP3tSzN/XszeKVmjocBy6a8fzCakySJKlrlApaDwDrI+KNEfEa4FpgX6F9SZIkrUhFpg4z81REfAjYD/QAt2bmoyX2JUndwi+Ulpqn2DVamXkPcE+p7UtSN/ELpaVm8pPhJakB/EJpqZkMWpLUAH6htNRMBi2pYfYeHuctn/o2R8Zf5i2f+jZ7D3tDbzfwC6WlZjJoSQ0yfZ3OeHUWY/o6HcNW5/MLpaVmMmhJDeJ1Ot3LL5SWmqltnwwv6dx5nU538wulpebxjJbUIF6nI0nNYtCSGsTrdCSpWZw6lBpketpo6pqsH7DWTweXpBXNoCU1zPR1OqOjo/zX3xxqdzmSpLNw6lCSJKkQg5YkSVIhBi1JkqRCDFqSJEmFGLQkSZIKMWhJkiQVYtCSJEkqxKAlSZJUSGRmu2sgIv4JeKbddbTR+cD3213ECmVv6tmbevamnr2pZ2/q2ZvZ/lVmvn4hK66IoNXtIuJgZg62u46VyN7Uszf17E09e1PP3tSzN4vn1KEkSVIhBi1JkqRCDForw+52F7CC2Zt69qaevalnb+rZm3r2ZpG8RkuSJKkQz2hJkiQVYtCSJEkqxKDVBhFxR0Q8VP08HREP1az3dEQcqdY7uNx1tkNE3BQR4zP6c2XNeldExFhEHI2Ijyx3ne0QEbsi4omIeDgi7oqI/pr1uua4me84iIjXVu+3oxHx3YhYt/xVLr+IuCgi/iYiHouIRyPid+ZYZygiXp7xXvuDdtTaDvO9R2LKn1bHzcMR8SvtqHO5RcSGGcfDQxHxSkT87hnrdO1xs1ir2l1AN8rM90w/jojPAC+fZfXhzOy2D4n7XGZ+um5hRPQAfwa8HTgGPBAR+zLzseUqsE3uBXZm5qmI+GNgJ/D7Net2/HGzwOPgA8BLmfmvI+Ja4I+B9/zs1jrOKeDDmflgRLwOOBQR987xHvlOZr6zDfWtBGd7j7wDWF/9vBn4QvW7o2XmGLARXn1/jQN3zbFqNx8358wzWm0UEQFcA9ze7loa5jLgaGY+lZn/DIwAV7W5puIy81uZeap6egC4sJ31rAALOQ6uAm6rHn8duLx633W0zHw2Mx+sHv8AeBxY296qGuUq4Ms55QDQHxEXtLuoZXY58GRmdvO3tiwJg1Z7/Qfgucz8+5rlCXwrIg5FxPZlrKvdPlSdrr81In5ujuVrge/NeH6M7vuPyPuBb9Ys65bjZiHHwavrVCH1ZeAXlqW6FaKaLt0EfHeOxf8+Iv5fRHwzIi5e1sLaa773iP/GwLXUnwTo1uNmUZw6LCQi/hfwS3Ms+lhmfqN6vI2zn836tcwcj4hfBO6NiCcy8/6lrnW5na03TJ2iv5mpfwhvBj7DVKjoCgs5biLiY0xNDX21ZjMdedzo3EVEH3An8LuZ+coZix9k6vvaJqprIfcyNVXWDXyPnEVEvAZ4F1OXJ5ypm4+bRTFoFZKZbzvb8ohYBVwNbD7LNsar389HxF1MTZU0/h+D+XozLSK+CNw9x6Jx4KIZzy+sxhpvAcfN9cA7gcuz5kPwOvW4mcNCjoPpdY5V77l/CbywPOW1V0SsZipkfTUz95y5fGbwysx7IuLPI+L8Tr+2Dxb0HunYf2MW6B3Ag5n53JkLuvm4WSynDtvnbcATmXlsroURcV51ESsRcR7w68Ajy1hfW5xxHcS7mftvfgBYHxFvrP7P61pg33LU104RcQXwe8C7MvNHNet003GzkONgH3Bd9fg3gG/XBdROUl2HdgvweGZ+tmadX5q+Xi0iLmPqvwcdH0IX+B7ZB/xWdffhrwIvZ+azy1xqO9XOtnTrcdMKz2i1z8/Mf0fEGuC/Z+aVwABwV3U8rwL+KjP/etmrXH5/EhEbmZo6fBr4IMzuTXXX3YeA/UAPcGtmPtqugpfR54HXMjXVAXAgM2/o1uOm7jiIiD8EDmbmPqbCxv+IiKPAi0y977rBW4D3Akfipx8f81HgDQCZ+RdMBc//EhGngEng2m4IodS8RyLiBni1N/cAVwJHgR8B72tTrcuuCp9vp/q3txqb2ZtuPW4Wza/gkSRJKsSpQ0mSpEIMWpIkSYUYtCRJkgoxaEmSJBVi0JIkSSrEoCVJklSIQUuSJKmQ/w+dgKdwolc3OgAAAABJRU5ErkJggg==\n", 77 | "text/plain": [ 78 | "
" 79 | ] 80 | }, 81 | "metadata": { 82 | "needs_background": "light" 83 | }, 84 | "output_type": "display_data" 85 | } 86 | ], 87 | "source": [ 88 | "plt.figure(figsize=(10, 5))\n", 89 | "plt.scatter(points_x, points_y)\n", 90 | "plt.grid()\n", 91 | "plt.show()" 92 | ] 93 | }, 94 | { 95 | "cell_type": "code", 96 | "execution_count": 19, 97 | "metadata": {}, 98 | "outputs": [], 99 | "source": [ 100 | "def plot_model(max_degree):\n", 101 | " plt.figure(figsize=(10, 5))\n", 102 | " plt.scatter(points_x, points_y)\n", 103 | " model = PolynomialRegression(max_degree).fit(points_x, points_y)\n", 104 | " all_x = np.arange(-10, 10.1, 0.1)\n", 105 | " plt.plot(all_x, model.predict(all_x))\n", 106 | " plt.grid()\n", 107 | " plt.show()" 108 | ] 109 | }, 110 | { 111 | "cell_type": "code", 112 | "execution_count": null, 113 | "metadata": {}, 114 | "outputs": [], 115 | "source": [ 116 | "for i in range(10):\n", 117 | " plot_model(i)" 118 | ] 119 | }, 120 | { 121 | "cell_type": "markdown", 122 | "metadata": {}, 123 | "source": [ 124 | "Объясните почему графики меняются таким образом" 125 | ] 126 | }, 127 | { 128 | "cell_type": "code", 129 | "execution_count": null, 130 | "metadata": {}, 131 | "outputs": [], 132 | "source": [] 133 | }, 134 | { 135 | "cell_type": "markdown", 136 | "metadata": {}, 137 | "source": [ 138 | "**Значение для формы**" 139 | ] 140 | }, 141 | { 142 | "cell_type": "code", 143 | "execution_count": null, 144 | "metadata": {}, 145 | "outputs": [], 146 | "source": [ 147 | "print(int(\n", 148 | " PolynomialRegression(7).fit(points_x, points_y).predict([10])[0]\n", 149 | " + PolynomialRegression(1).fit(points_x, points_y).predict([-5])[0]\n", 150 | " + PolynomialRegression(4).fit(points_x, points_y).predict([-15])[0]\n", 151 | "))" 152 | ] 153 | } 154 | ], 155 | "metadata": { 156 | "kernelspec": { 157 | "display_name": "Python 3", 158 | "language": "python", 159 | "name": "python3" 160 | }, 161 | "language_info": { 162 | "codemirror_mode": { 163 | "name": "ipython", 164 | "version": 3 165 | }, 166 | "file_extension": ".py", 167 | "mimetype": "text/x-python", 168 | "name": "python", 169 | "nbconvert_exporter": "python", 170 | "pygments_lexer": "ipython3", 171 | "version": "3.6.5" 172 | } 173 | }, 174 | "nbformat": 4, 175 | "nbformat_minor": 2 176 | } 177 | -------------------------------------------------------------------------------- /seminar01/01_Main_SklearnFirstClassifiers.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Обучаем первые классификаторы в sklearn" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "### Данные" 15 | ] 16 | }, 17 | { 18 | "cell_type": "markdown", 19 | "metadata": {}, 20 | "source": [ 21 | "\n", 22 | "По данным характеристикам молекулы требуется определить, будет ли дан биологический ответ (biological response).\n", 23 | "\n", 24 | "Для демонстрации используется обучающая выборка из исходных данных bioresponse.csv, файл с данными прилагается." 25 | ] 26 | }, 27 | { 28 | "cell_type": "markdown", 29 | "metadata": {}, 30 | "source": [ 31 | "### Готовим обучающую и тестовую выборки" 32 | ] 33 | }, 34 | { 35 | "cell_type": "code", 36 | "execution_count": 1, 37 | "metadata": {}, 38 | "outputs": [], 39 | "source": [ 40 | "import pandas as pd\n", 41 | "\n", 42 | "bioresponce = pd.read_csv('bioresponse.csv', header=0, sep=',')" 43 | ] 44 | }, 45 | { 46 | "cell_type": "code", 47 | "execution_count": 2, 48 | "metadata": {}, 49 | "outputs": [ 50 | { 51 | "data": { 52 | "text/html": [ 53 | "
\n", 54 | "\n", 67 | "\n", 68 | " \n", 69 | " \n", 70 | " \n", 71 | " \n", 72 | " \n", 73 | " \n", 74 | " \n", 75 | " \n", 76 | " \n", 77 | " \n", 78 | " \n", 79 | " \n", 80 | " \n", 81 | " \n", 82 | " \n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | " \n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " \n", 149 | " \n", 150 | " \n", 151 | " \n", 152 | " \n", 153 | " \n", 154 | " \n", 155 | " \n", 156 | " \n", 157 | " \n", 158 | " \n", 159 | " \n", 160 | " \n", 161 | " \n", 162 | " \n", 163 | " \n", 164 | " \n", 165 | " \n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | " \n", 170 | " \n", 171 | " \n", 172 | " \n", 173 | " \n", 174 | " \n", 175 | " \n", 176 | " \n", 177 | " \n", 178 | " \n", 179 | " \n", 180 | " \n", 181 | " \n", 182 | " \n", 183 | " \n", 184 | " \n", 185 | " \n", 186 | " \n", 187 | " \n", 188 | " \n", 189 | " \n", 190 | " \n", 191 | " \n", 192 | " \n", 193 | " \n", 194 | " \n", 195 | " \n", 196 | " \n", 197 | " \n", 198 | " \n", 199 | " \n", 200 | " \n", 201 | " \n", 202 | " \n", 203 | " \n", 204 | " \n", 205 | " \n", 206 | " \n", 207 | " \n", 208 | " \n", 209 | " \n", 210 | " \n", 211 | " \n", 212 | " \n", 213 | " \n", 214 | " \n", 215 | " \n", 216 | "
ActivityD1D2D3D4D5D6D7D8D9...D1767D1768D1769D1770D1771D1772D1773D1774D1775D1776
010.0000000.4970090.100.00.1329560.6780310.2731660.5854450.743663...0000000000
110.3666670.6062910.050.00.1112090.8034550.1061050.4117540.836582...1111010010
210.0333000.4801240.000.00.2097910.6103500.3564530.5177200.679051...0000000000
310.0000000.5388250.000.50.1963440.7242300.2356060.2887640.805110...0000000000
400.1000000.5177940.000.00.4947340.7814220.1543610.3038090.812646...0000000000
\n", 217 | "

5 rows × 1777 columns

\n", 218 | "
" 219 | ], 220 | "text/plain": [ 221 | " Activity D1 D2 D3 D4 D5 D6 D7 \\\n", 222 | "0 1 0.000000 0.497009 0.10 0.0 0.132956 0.678031 0.273166 \n", 223 | "1 1 0.366667 0.606291 0.05 0.0 0.111209 0.803455 0.106105 \n", 224 | "2 1 0.033300 0.480124 0.00 0.0 0.209791 0.610350 0.356453 \n", 225 | "3 1 0.000000 0.538825 0.00 0.5 0.196344 0.724230 0.235606 \n", 226 | "4 0 0.100000 0.517794 0.00 0.0 0.494734 0.781422 0.154361 \n", 227 | "\n", 228 | " D8 D9 ... D1767 D1768 D1769 D1770 D1771 D1772 D1773 \\\n", 229 | "0 0.585445 0.743663 ... 0 0 0 0 0 0 0 \n", 230 | "1 0.411754 0.836582 ... 1 1 1 1 0 1 0 \n", 231 | "2 0.517720 0.679051 ... 0 0 0 0 0 0 0 \n", 232 | "3 0.288764 0.805110 ... 0 0 0 0 0 0 0 \n", 233 | "4 0.303809 0.812646 ... 0 0 0 0 0 0 0 \n", 234 | "\n", 235 | " D1774 D1775 D1776 \n", 236 | "0 0 0 0 \n", 237 | "1 0 1 0 \n", 238 | "2 0 0 0 \n", 239 | "3 0 0 0 \n", 240 | "4 0 0 0 \n", 241 | "\n", 242 | "[5 rows x 1777 columns]" 243 | ] 244 | }, 245 | "execution_count": 2, 246 | "metadata": {}, 247 | "output_type": "execute_result" 248 | } 249 | ], 250 | "source": [ 251 | "bioresponce.head(5)" 252 | ] 253 | }, 254 | { 255 | "cell_type": "code", 256 | "execution_count": 3, 257 | "metadata": {}, 258 | "outputs": [], 259 | "source": [ 260 | "y = bioresponce.Activity.values" 261 | ] 262 | }, 263 | { 264 | "cell_type": "code", 265 | "execution_count": 4, 266 | "metadata": {}, 267 | "outputs": [], 268 | "source": [ 269 | "X = bioresponce.iloc[:, 1:]" 270 | ] 271 | }, 272 | { 273 | "cell_type": "code", 274 | "execution_count": 6, 275 | "metadata": {}, 276 | "outputs": [], 277 | "source": [ 278 | "from sklearn.model_selection import train_test_split\n", 279 | "\n", 280 | "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)" 281 | ] 282 | }, 283 | { 284 | "cell_type": "markdown", 285 | "metadata": {}, 286 | "source": [ 287 | "### Строим модель и оцениваем качество" 288 | ] 289 | }, 290 | { 291 | "cell_type": "code", 292 | "execution_count": 7, 293 | "metadata": {}, 294 | "outputs": [], 295 | "source": [ 296 | "from sklearn.linear_model import LogisticRegression" 297 | ] 298 | }, 299 | { 300 | "cell_type": "code", 301 | "execution_count": 8, 302 | "metadata": {}, 303 | "outputs": [], 304 | "source": [ 305 | "model = LogisticRegression()\n", 306 | "model.fit(X_train, y_train)\n", 307 | "preds = model.predict(X_test)" 308 | ] 309 | }, 310 | { 311 | "cell_type": "code", 312 | "execution_count": 9, 313 | "metadata": {}, 314 | "outputs": [ 315 | { 316 | "data": { 317 | "text/plain": [ 318 | "numpy.ndarray" 319 | ] 320 | }, 321 | "execution_count": 9, 322 | "metadata": {}, 323 | "output_type": "execute_result" 324 | } 325 | ], 326 | "source": [ 327 | "type(preds)" 328 | ] 329 | }, 330 | { 331 | "cell_type": "code", 332 | "execution_count": 10, 333 | "metadata": {}, 334 | "outputs": [ 335 | { 336 | "data": { 337 | "text/plain": [ 338 | "1" 339 | ] 340 | }, 341 | "execution_count": 10, 342 | "metadata": {}, 343 | "output_type": "execute_result" 344 | } 345 | ], 346 | "source": [ 347 | "10 // 9" 348 | ] 349 | }, 350 | { 351 | "cell_type": "code", 352 | "execution_count": 11, 353 | "metadata": {}, 354 | "outputs": [ 355 | { 356 | "name": "stdout", 357 | "output_type": "stream", 358 | "text": [ 359 | "0.7560581583198708\n" 360 | ] 361 | } 362 | ], 363 | "source": [ 364 | "print(sum(preds == y_test) / len(preds))" 365 | ] 366 | }, 367 | { 368 | "cell_type": "code", 369 | "execution_count": 12, 370 | "metadata": {}, 371 | "outputs": [ 372 | { 373 | "name": "stdout", 374 | "output_type": "stream", 375 | "text": [ 376 | "0.7560581583198708\n" 377 | ] 378 | } 379 | ], 380 | "source": [ 381 | "print(sum(preds == y_test) / float(len(preds)))" 382 | ] 383 | }, 384 | { 385 | "cell_type": "code", 386 | "execution_count": 13, 387 | "metadata": {}, 388 | "outputs": [ 389 | { 390 | "name": "stdout", 391 | "output_type": "stream", 392 | "text": [ 393 | "0.7560581583198708\n" 394 | ] 395 | } 396 | ], 397 | "source": [ 398 | "from sklearn.metrics import accuracy_score\n", 399 | "\n", 400 | "print(accuracy_score(preds, y_test))" 401 | ] 402 | }, 403 | { 404 | "cell_type": "markdown", 405 | "metadata": {}, 406 | "source": [ 407 | "### Качество на кросс-валидации" 408 | ] 409 | }, 410 | { 411 | "cell_type": "code", 412 | "execution_count": 14, 413 | "metadata": {}, 414 | "outputs": [ 415 | { 416 | "name": "stdout", 417 | "output_type": "stream", 418 | "text": [ 419 | "[0.74404762 0.73956262 0.72310757 0.75099602 0.75896414]\n" 420 | ] 421 | } 422 | ], 423 | "source": [ 424 | "from sklearn.model_selection import cross_val_score\n", 425 | "\n", 426 | "print(cross_val_score(model, X_train, y_train, cv=5))" 427 | ] 428 | }, 429 | { 430 | "cell_type": "code", 431 | "execution_count": 15, 432 | "metadata": {}, 433 | "outputs": [ 434 | { 435 | "name": "stdout", 436 | "output_type": "stream", 437 | "text": [ 438 | "0.7433355944771515\n" 439 | ] 440 | } 441 | ], 442 | "source": [ 443 | "print(cross_val_score(model, X_train, y_train, cv=5).mean())" 444 | ] 445 | }, 446 | { 447 | "cell_type": "markdown", 448 | "metadata": {}, 449 | "source": [ 450 | "### Пробуем другие классификаторы" 451 | ] 452 | }, 453 | { 454 | "cell_type": "code", 455 | "execution_count": 16, 456 | "metadata": {}, 457 | "outputs": [], 458 | "source": [ 459 | "from sklearn.neighbors import KNeighborsClassifier\n", 460 | "from sklearn.tree import DecisionTreeClassifier\n", 461 | "from sklearn.svm import LinearSVC\n", 462 | "from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier\n" 463 | ] 464 | }, 465 | { 466 | "cell_type": "code", 467 | "execution_count": 17, 468 | "metadata": {}, 469 | "outputs": [ 470 | { 471 | "name": "stdout", 472 | "output_type": "stream", 473 | "text": [ 474 | "0.7189014539579968 KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',\n", 475 | " metric_params=None, n_jobs=1, n_neighbors=5, p=2,\n", 476 | " weights='uniform')\n", 477 | "0.7059773828756059 DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,\n", 478 | " max_features=None, max_leaf_nodes=None,\n", 479 | " min_impurity_decrease=0.0, min_impurity_split=None,\n", 480 | " min_samples_leaf=1, min_samples_split=2,\n", 481 | " min_weight_fraction_leaf=0.0, presort=False, random_state=None,\n", 482 | " splitter='best')\n", 483 | "0.7431340872374798 LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,\n", 484 | " intercept_scaling=1, loss='squared_hinge', max_iter=1000,\n", 485 | " multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,\n", 486 | " verbose=0)\n", 487 | "0.789983844911147 RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',\n", 488 | " max_depth=None, max_features='auto', max_leaf_nodes=None,\n", 489 | " min_impurity_decrease=0.0, min_impurity_split=None,\n", 490 | " min_samples_leaf=1, min_samples_split=2,\n", 491 | " min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,\n", 492 | " oob_score=False, random_state=None, verbose=0,\n", 493 | " warm_start=False)\n", 494 | "0.778675282714055 GradientBoostingClassifier(criterion='friedman_mse', init=None,\n", 495 | " learning_rate=0.1, loss='deviance', max_depth=3,\n", 496 | " max_features=None, max_leaf_nodes=None,\n", 497 | " min_impurity_decrease=0.0, min_impurity_split=None,\n", 498 | " min_samples_leaf=1, min_samples_split=2,\n", 499 | " min_weight_fraction_leaf=0.0, n_estimators=100,\n", 500 | " presort='auto', random_state=None, subsample=1.0, verbose=0,\n", 501 | " warm_start=False)\n", 502 | "CPU times: user 25.9 s, sys: 900 ms, total: 26.8 s\n", 503 | "Wall time: 25.9 s\n" 504 | ] 505 | } 506 | ], 507 | "source": [ 508 | "%%time\n", 509 | "\n", 510 | "models = [\n", 511 | " KNeighborsClassifier(),\n", 512 | " DecisionTreeClassifier(),\n", 513 | " LinearSVC(),\n", 514 | " RandomForestClassifier(n_estimators=100), \n", 515 | " GradientBoostingClassifier(n_estimators=100)\n", 516 | "]\n", 517 | "\n", 518 | "for model in models:\n", 519 | " model.fit(X_train, y_train)\n", 520 | " preds = model.predict(X_test)\n", 521 | " print(accuracy_score(preds, y_test), model)" 522 | ] 523 | }, 524 | { 525 | "cell_type": "code", 526 | "execution_count": null, 527 | "metadata": {}, 528 | "outputs": [], 529 | "source": [] 530 | } 531 | ], 532 | "metadata": { 533 | "anaconda-cloud": {}, 534 | "kernelspec": { 535 | "display_name": "dmia", 536 | "language": "python", 537 | "name": "dmia" 538 | }, 539 | "language_info": { 540 | "codemirror_mode": { 541 | "name": "ipython", 542 | "version": 3 543 | }, 544 | "file_extension": ".py", 545 | "mimetype": "text/x-python", 546 | "name": "python", 547 | "nbconvert_exporter": "python", 548 | "pygments_lexer": "ipython3", 549 | "version": "3.6.6" 550 | } 551 | }, 552 | "nbformat": 4, 553 | "nbformat_minor": 1 554 | } 555 | -------------------------------------------------------------------------------- /seminar02/05_Reference_BiasVariance.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Bias Variance" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "![title](ml_bias_variance.png)" 15 | ] 16 | }, 17 | { 18 | "cell_type": "markdown", 19 | "metadata": {}, 20 | "source": [ 21 | "Мы не будем выписывать строгие формулы, но попробуем объяснить идею этих понятий.\n", 22 | "\n", 23 | "Пусть у нас есть алгоритм обучения, который по данным может создать модель.\n", 24 | "\n", 25 | "Ошибка этих моделей может быть разложена на три части:\n", 26 | "* **Noise** – шум данных, не предсказуем, теоретический минимум ошибки\n", 27 | "* **Bias** – смещение, на сколько хорошо работает средний алгоритм. Средний алгоритм это \"возьмём случайные данные, обучим алгоритм, сделаем предсказания\", **Bias** – это ошибка средних предсказаний.\n", 28 | "* **Variance** – разброс, на сколько устойчиво работает алгоритм. Опять же \"возьмём случайные данные, обучим алгоритм, сделаем предсказания\", **Variance** – это разрос этих предсказаний." 29 | ] 30 | }, 31 | { 32 | "cell_type": "markdown", 33 | "metadata": {}, 34 | "source": [ 35 | "# Бустинг и Бэггинг в терминах Bias и Variance" 36 | ] 37 | }, 38 | { 39 | "cell_type": "markdown", 40 | "metadata": {}, 41 | "source": [ 42 | "Как вы думаете на какую составляющую Бустинг и Бэггинг влияют, а на какую нет?" 43 | ] 44 | }, 45 | { 46 | "cell_type": "code", 47 | "execution_count": 1, 48 | "metadata": { 49 | "collapsed": true 50 | }, 51 | "outputs": [], 52 | "source": [ 53 | "import pandas as pd\n", 54 | "import numpy as np\n", 55 | "from sklearn.model_selection import cross_val_score, train_test_split\n", 56 | "from xgboost import XGBRegressor\n", 57 | "from catboost import CatBoostRegressor\n", 58 | "from lightgbm import LGBMRegressor\n", 59 | "from sklearn.ensemble import RandomForestRegressor\n", 60 | "from sklearn.tree import DecisionTreeRegressor\n", 61 | "from sklearn.linear_model import LinearRegression\n", 62 | "\n", 63 | "import warnings\n", 64 | "warnings.filterwarnings('ignore')" 65 | ] 66 | }, 67 | { 68 | "cell_type": "code", 69 | "execution_count": 2, 70 | "metadata": { 71 | "collapsed": true 72 | }, 73 | "outputs": [], 74 | "source": [ 75 | "data = pd.read_csv('HR.csv')" 76 | ] 77 | }, 78 | { 79 | "cell_type": "code", 80 | "execution_count": 3, 81 | "metadata": {}, 82 | "outputs": [ 83 | { 84 | "data": { 85 | "text/html": [ 86 | "
\n", 87 | "\n", 100 | "\n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | " \n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " \n", 149 | " \n", 150 | " \n", 151 | " \n", 152 | " \n", 153 | " \n", 154 | " \n", 155 | " \n", 156 | " \n", 157 | " \n", 158 | " \n", 159 | " \n", 160 | " \n", 161 | " \n", 162 | " \n", 163 | " \n", 164 | " \n", 165 | "
last_evaluationnumber_projectaverage_montly_hourstime_spend_companyWork_accidentleftpromotion_last_5years
00.5321573010
10.8652626000
20.8872724010
30.8752235010
40.5221593010
\n", 166 | "
" 167 | ], 168 | "text/plain": [ 169 | " last_evaluation number_project average_montly_hours time_spend_company \\\n", 170 | "0 0.53 2 157 3 \n", 171 | "1 0.86 5 262 6 \n", 172 | "2 0.88 7 272 4 \n", 173 | "3 0.87 5 223 5 \n", 174 | "4 0.52 2 159 3 \n", 175 | "\n", 176 | " Work_accident left promotion_last_5years \n", 177 | "0 0 1 0 \n", 178 | "1 0 0 0 \n", 179 | "2 0 1 0 \n", 180 | "3 0 1 0 \n", 181 | "4 0 1 0 " 182 | ] 183 | }, 184 | "execution_count": 3, 185 | "metadata": {}, 186 | "output_type": "execute_result" 187 | } 188 | ], 189 | "source": [ 190 | "data.head()" 191 | ] 192 | }, 193 | { 194 | "cell_type": "code", 195 | "execution_count": 4, 196 | "metadata": { 197 | "scrolled": true 198 | }, 199 | "outputs": [], 200 | "source": [ 201 | "X, y = data.drop('left', axis=1).values, data['left'].values" 202 | ] 203 | }, 204 | { 205 | "cell_type": "code", 206 | "execution_count": 5, 207 | "metadata": {}, 208 | "outputs": [ 209 | { 210 | "data": { 211 | "text/plain": [ 212 | "array([1, 0])" 213 | ] 214 | }, 215 | "execution_count": 5, 216 | "metadata": {}, 217 | "output_type": "execute_result" 218 | } 219 | ], 220 | "source": [ 221 | "data['left'].unique()" 222 | ] 223 | }, 224 | { 225 | "cell_type": "code", 226 | "execution_count": 6, 227 | "metadata": { 228 | "collapsed": true 229 | }, 230 | "outputs": [], 231 | "source": [ 232 | "X_train, X_test, y_train, y_test = train_test_split(X, y)" 233 | ] 234 | }, 235 | { 236 | "cell_type": "code", 237 | "execution_count": 7, 238 | "metadata": { 239 | "collapsed": true 240 | }, 241 | "outputs": [], 242 | "source": [ 243 | "def sample_model(seed, model):\n", 244 | " random_gen = np.random.RandomState(seed)\n", 245 | " indices = random_gen.choice(len(y_train), size=len(y_train), replace=True)\n", 246 | " model.fit(X_train[indices, :], y_train[indices])\n", 247 | " return model\n", 248 | "\n", 249 | "def estimate_bias_variance(model, iters_count=100):\n", 250 | " y_preds = []\n", 251 | " for seed in range(iters_count):\n", 252 | " model = sample_model(seed, model)\n", 253 | " y_preds.append(model.predict(X_test))\n", 254 | " y_preds = np.array(y_preds)\n", 255 | " \n", 256 | " print('Bias:', np.mean((y_test - y_preds.mean(axis=0)) ** 2))\n", 257 | " print('Variance:', y_preds.std(axis=0).mean())" 258 | ] 259 | }, 260 | { 261 | "cell_type": "markdown", 262 | "metadata": {}, 263 | "source": [ 264 | "**Линейная регрессия**" 265 | ] 266 | }, 267 | { 268 | "cell_type": "code", 269 | "execution_count": 8, 270 | "metadata": {}, 271 | "outputs": [ 272 | { 273 | "name": "stdout", 274 | "output_type": "stream", 275 | "text": [ 276 | "Bias: 0.22539321164615467\n", 277 | "Variance: 0.010711666687293465\n" 278 | ] 279 | } 280 | ], 281 | "source": [ 282 | "estimate_bias_variance(LinearRegression())" 283 | ] 284 | }, 285 | { 286 | "cell_type": "markdown", 287 | "metadata": {}, 288 | "source": [ 289 | "**Решающее дерево с max_depth=5**" 290 | ] 291 | }, 292 | { 293 | "cell_type": "code", 294 | "execution_count": 9, 295 | "metadata": {}, 296 | "outputs": [ 297 | { 298 | "name": "stdout", 299 | "output_type": "stream", 300 | "text": [ 301 | "Bias: 0.17343635344369013\n", 302 | "Variance: 0.04434523236701086\n" 303 | ] 304 | } 305 | ], 306 | "source": [ 307 | "estimate_bias_variance(DecisionTreeRegressor(max_depth=5))" 308 | ] 309 | }, 310 | { 311 | "cell_type": "markdown", 312 | "metadata": {}, 313 | "source": [ 314 | "**Решающее дерево с max_depth=10**" 315 | ] 316 | }, 317 | { 318 | "cell_type": "code", 319 | "execution_count": 10, 320 | "metadata": {}, 321 | "outputs": [ 322 | { 323 | "name": "stdout", 324 | "output_type": "stream", 325 | "text": [ 326 | "Bias: 0.17175575739495175\n", 327 | "Variance: 0.11712092704487344\n" 328 | ] 329 | } 330 | ], 331 | "source": [ 332 | "estimate_bias_variance(DecisionTreeRegressor(max_depth=10))" 333 | ] 334 | }, 335 | { 336 | "cell_type": "markdown", 337 | "metadata": {}, 338 | "source": [ 339 | "**Решающее дерево с max_depth=15**" 340 | ] 341 | }, 342 | { 343 | "cell_type": "code", 344 | "execution_count": 11, 345 | "metadata": {}, 346 | "outputs": [ 347 | { 348 | "name": "stdout", 349 | "output_type": "stream", 350 | "text": [ 351 | "Bias: 0.17842598190450087\n", 352 | "Variance: 0.21661949936646008\n" 353 | ] 354 | } 355 | ], 356 | "source": [ 357 | "estimate_bias_variance(DecisionTreeRegressor(max_depth=15))" 358 | ] 359 | }, 360 | { 361 | "cell_type": "markdown", 362 | "metadata": {}, 363 | "source": [ 364 | "**Решающее дерево без ограничения глубины**" 365 | ] 366 | }, 367 | { 368 | "cell_type": "code", 369 | "execution_count": 12, 370 | "metadata": {}, 371 | "outputs": [ 372 | { 373 | "name": "stdout", 374 | "output_type": "stream", 375 | "text": [ 376 | "Bias: 0.2069107045423811\n", 377 | "Variance: 0.32457384418180296\n" 378 | ] 379 | } 380 | ], 381 | "source": [ 382 | "estimate_bias_variance(DecisionTreeRegressor(max_depth=None))" 383 | ] 384 | }, 385 | { 386 | "cell_type": "markdown", 387 | "metadata": {}, 388 | "source": [ 389 | "**Случайный лес n_estimators=1**" 390 | ] 391 | }, 392 | { 393 | "cell_type": "code", 394 | "execution_count": 13, 395 | "metadata": {}, 396 | "outputs": [ 397 | { 398 | "name": "stdout", 399 | "output_type": "stream", 400 | "text": [ 401 | "Bias: 0.19463122486057333\n", 402 | "Variance: 0.35705073628637773\n" 403 | ] 404 | } 405 | ], 406 | "source": [ 407 | "estimate_bias_variance(RandomForestRegressor(n_estimators=1, random_state=42))" 408 | ] 409 | }, 410 | { 411 | "cell_type": "markdown", 412 | "metadata": {}, 413 | "source": [ 414 | "**Случайный лес n_estimators=10**" 415 | ] 416 | }, 417 | { 418 | "cell_type": "code", 419 | "execution_count": 14, 420 | "metadata": {}, 421 | "outputs": [ 422 | { 423 | "name": "stdout", 424 | "output_type": "stream", 425 | "text": [ 426 | "Bias: 0.19311294566535084\n", 427 | "Variance: 0.17229587181057013\n" 428 | ] 429 | } 430 | ], 431 | "source": [ 432 | "estimate_bias_variance(RandomForestRegressor(n_estimators=10, random_state=42))" 433 | ] 434 | }, 435 | { 436 | "cell_type": "markdown", 437 | "metadata": {}, 438 | "source": [ 439 | "**Случайный лес n_estimators=50**" 440 | ] 441 | }, 442 | { 443 | "cell_type": "code", 444 | "execution_count": 15, 445 | "metadata": {}, 446 | "outputs": [ 447 | { 448 | "name": "stdout", 449 | "output_type": "stream", 450 | "text": [ 451 | "Bias: 0.19315888675365975\n", 452 | "Variance: 0.14255099142514835\n" 453 | ] 454 | } 455 | ], 456 | "source": [ 457 | "estimate_bias_variance(RandomForestRegressor(n_estimators=50, random_state=42))" 458 | ] 459 | }, 460 | { 461 | "cell_type": "markdown", 462 | "metadata": {}, 463 | "source": [ 464 | "**XGBRegressor**" 465 | ] 466 | }, 467 | { 468 | "cell_type": "markdown", 469 | "metadata": {}, 470 | "source": [ 471 | "**Бустинг над деревьями max_depth=20**" 472 | ] 473 | }, 474 | { 475 | "cell_type": "code", 476 | "execution_count": 16, 477 | "metadata": {}, 478 | "outputs": [ 479 | { 480 | "name": "stdout", 481 | "output_type": "stream", 482 | "text": [ 483 | "Bias: 0.23515768239943852\n", 484 | "Variance: 0.022880817\n" 485 | ] 486 | } 487 | ], 488 | "source": [ 489 | "estimate_bias_variance(XGBRegressor(n_estimators=1, max_depth=20))" 490 | ] 491 | }, 492 | { 493 | "cell_type": "markdown", 494 | "metadata": {}, 495 | "source": [ 496 | "**Бустинг над деревьями max_depth=10**" 497 | ] 498 | }, 499 | { 500 | "cell_type": "code", 501 | "execution_count": 17, 502 | "metadata": {}, 503 | "outputs": [ 504 | { 505 | "name": "stdout", 506 | "output_type": "stream", 507 | "text": [ 508 | "Bias: 0.23460312600664116\n", 509 | "Variance: 0.01066339\n" 510 | ] 511 | } 512 | ], 513 | "source": [ 514 | "estimate_bias_variance(XGBRegressor(n_estimators=1, max_depth=10))" 515 | ] 516 | }, 517 | { 518 | "cell_type": "markdown", 519 | "metadata": {}, 520 | "source": [ 521 | "**Бустинг над деревьями max_depth=5**" 522 | ] 523 | }, 524 | { 525 | "cell_type": "code", 526 | "execution_count": 18, 527 | "metadata": {}, 528 | "outputs": [ 529 | { 530 | "name": "stdout", 531 | "output_type": "stream", 532 | "text": [ 533 | "Bias: 0.23539367931741623\n", 534 | "Variance: 0.004351252\n" 535 | ] 536 | } 537 | ], 538 | "source": [ 539 | "estimate_bias_variance(XGBRegressor(n_estimators=1, max_depth=5))" 540 | ] 541 | }, 542 | { 543 | "cell_type": "markdown", 544 | "metadata": {}, 545 | "source": [ 546 | "**Бустинг над деревьями n_estimators=10**" 547 | ] 548 | }, 549 | { 550 | "cell_type": "code", 551 | "execution_count": 19, 552 | "metadata": {}, 553 | "outputs": [ 554 | { 555 | "name": "stdout", 556 | "output_type": "stream", 557 | "text": [ 558 | "Bias: 0.18128282857388248\n", 559 | "Variance: 0.019852929\n" 560 | ] 561 | } 562 | ], 563 | "source": [ 564 | "estimate_bias_variance(XGBRegressor(n_estimators=10, max_depth=5))" 565 | ] 566 | }, 567 | { 568 | "cell_type": "markdown", 569 | "metadata": {}, 570 | "source": [ 571 | "**Бустинг над деревьями n_estimators=100**" 572 | ] 573 | }, 574 | { 575 | "cell_type": "code", 576 | "execution_count": 20, 577 | "metadata": {}, 578 | "outputs": [ 579 | { 580 | "name": "stdout", 581 | "output_type": "stream", 582 | "text": [ 583 | "Bias: 0.17182469278418883\n", 584 | "Variance: 0.05643562\n" 585 | ] 586 | } 587 | ], 588 | "source": [ 589 | "estimate_bias_variance(XGBRegressor(n_estimators=100, max_depth=5))" 590 | ] 591 | }, 592 | { 593 | "cell_type": "markdown", 594 | "metadata": {}, 595 | "source": [ 596 | "**CatBoostRegressor**" 597 | ] 598 | }, 599 | { 600 | "cell_type": "code", 601 | "execution_count": 21, 602 | "metadata": {}, 603 | "outputs": [ 604 | { 605 | "name": "stdout", 606 | "output_type": "stream", 607 | "text": [ 608 | "Bias: 0.3467385908579134\n", 609 | "Variance: 0.0006835754697775419\n" 610 | ] 611 | } 612 | ], 613 | "source": [ 614 | "estimate_bias_variance(CatBoostRegressor(n_estimators=1, max_depth=6, verbose=False))" 615 | ] 616 | }, 617 | { 618 | "cell_type": "code", 619 | "execution_count": 22, 620 | "metadata": {}, 621 | "outputs": [ 622 | { 623 | "name": "stdout", 624 | "output_type": "stream", 625 | "text": [ 626 | "Bias: 0.2801365481999864\n", 627 | "Variance: 0.0058176648330751516\n" 628 | ] 629 | } 630 | ], 631 | "source": [ 632 | "estimate_bias_variance(CatBoostRegressor(n_estimators=10, max_depth=6, verbose=False))" 633 | ] 634 | }, 635 | { 636 | "cell_type": "code", 637 | "execution_count": 23, 638 | "metadata": {}, 639 | "outputs": [ 640 | { 641 | "name": "stdout", 642 | "output_type": "stream", 643 | "text": [ 644 | "Bias: 0.17608858395709903\n", 645 | "Variance: 0.019651480967338052\n" 646 | ] 647 | } 648 | ], 649 | "source": [ 650 | "estimate_bias_variance(CatBoostRegressor(n_estimators=100, max_depth=6, verbose=False))" 651 | ] 652 | }, 653 | { 654 | "cell_type": "markdown", 655 | "metadata": {}, 656 | "source": [ 657 | "**LGBMRegressor**" 658 | ] 659 | }, 660 | { 661 | "cell_type": "code", 662 | "execution_count": 24, 663 | "metadata": {}, 664 | "outputs": [ 665 | { 666 | "name": "stdout", 667 | "output_type": "stream", 668 | "text": [ 669 | "Bias: 0.2193821065837484\n", 670 | "Variance: 0.006619492577068203\n" 671 | ] 672 | } 673 | ], 674 | "source": [ 675 | "estimate_bias_variance(LGBMRegressor(n_estimators=1, max_depth=5))" 676 | ] 677 | }, 678 | { 679 | "cell_type": "code", 680 | "execution_count": 25, 681 | "metadata": {}, 682 | "outputs": [ 683 | { 684 | "name": "stdout", 685 | "output_type": "stream", 686 | "text": [ 687 | "Bias: 0.18007513755165383\n", 688 | "Variance: 0.020490996273996646\n" 689 | ] 690 | } 691 | ], 692 | "source": [ 693 | "estimate_bias_variance(LGBMRegressor(n_estimators=10, max_depth=5))" 694 | ] 695 | }, 696 | { 697 | "cell_type": "code", 698 | "execution_count": 26, 699 | "metadata": {}, 700 | "outputs": [ 701 | { 702 | "name": "stdout", 703 | "output_type": "stream", 704 | "text": [ 705 | "Bias: 0.1717763977728146\n", 706 | "Variance: 0.054249435185578586\n" 707 | ] 708 | } 709 | ], 710 | "source": [ 711 | "estimate_bias_variance(LGBMRegressor(n_estimators=100, max_depth=5))" 712 | ] 713 | } 714 | ], 715 | "metadata": { 716 | "anaconda-cloud": {}, 717 | "kernelspec": { 718 | "display_name": "venv_DMIA", 719 | "language": "python", 720 | "name": "venv_dmia" 721 | }, 722 | "language_info": { 723 | "codemirror_mode": { 724 | "name": "ipython", 725 | "version": 3 726 | }, 727 | "file_extension": ".py", 728 | "mimetype": "text/x-python", 729 | "name": "python", 730 | "nbconvert_exporter": "python", 731 | "pygments_lexer": "ipython3", 732 | "version": "3.6.3" 733 | } 734 | }, 735 | "nbformat": 4, 736 | "nbformat_minor": 1 737 | } 738 | -------------------------------------------------------------------------------- /seminar01/03_Main_NaiveBayes.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Naive Bayes" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "http://scikit-learn.org/stable/modules/naive_bayes.html\n", 15 | "\n", 16 | "Методы sklearn.naive_bayes - это набор методов, которые основаны на применении теоремы Байеса с \"наивным\" предположением о условной независимости любой пары признаков $(x_i, x_j)$ при условии выбранного значения целевой переменной $y$. \n", 17 | "\n", 18 | "Теорема Байеса утверждает, что:\n", 19 | "\n", 20 | "$$P(y \\mid x_1, \\dots, x_n) = \\frac{P(y) P(x_1, \\dots x_n \\mid y)}\n", 21 | " {P(x_1, \\dots, x_n)}$$\n", 22 | "\n", 23 | "Используя \"наивное\" предположение о условной независимости пар признаков\n", 24 | "\n", 25 | "$$P(x_i | y, x_1, \\dots, x_{i-1}, x_{i+1}, \\dots, x_n) = P(x_i | y),$$\n", 26 | "\n", 27 | "можно преобразовать теорему Байеса к\n", 28 | "\n", 29 | "$$P(y \\mid x_1, \\dots, x_n) = \\frac{P(y) \\prod_{i=1}^{n} P(x_i \\mid y)}\n", 30 | " {P(x_1, \\dots, x_n)}$$\n", 31 | "\n", 32 | "Поскольку $P(x_1, \\dots, x_n)$ - известные значения признаков для выбранного объекта, мы можем использовать следующее правило:\n", 33 | "\n", 34 | "$$\\hat{y} = \\arg\\max_y P(y) \\prod_{i=1}^{n} P(x_i \\mid y),$$" 35 | ] 36 | }, 37 | { 38 | "cell_type": "code", 39 | "execution_count": 1, 40 | "metadata": {}, 41 | "outputs": [], 42 | "source": [ 43 | "import scipy\n", 44 | "import numpy as np" 45 | ] 46 | }, 47 | { 48 | "cell_type": "markdown", 49 | "metadata": {}, 50 | "source": [ 51 | "Скачиваем данные, в sklearn есть модуль datasets, он предоставляет широкий спектр наборов данных" 52 | ] 53 | }, 54 | { 55 | "cell_type": "code", 56 | "execution_count": 2, 57 | "metadata": {}, 58 | "outputs": [], 59 | "source": [ 60 | "from sklearn.datasets import load_iris\n", 61 | "\n", 62 | "data = load_iris()" 63 | ] 64 | }, 65 | { 66 | "cell_type": "markdown", 67 | "metadata": {}, 68 | "source": [ 69 | "# BernoulliNB" 70 | ] 71 | }, 72 | { 73 | "cell_type": "markdown", 74 | "metadata": {}, 75 | "source": [ 76 | "Для оценки вероятности $P(x_i \\mid y)$ бинаризуем признаки, а затем рассчитаем её как\n", 77 | "\n", 78 | "$$P(x_i \\mid y) = P(x_i \\mid y) x_i + (1 - P(x_i \\mid y)) (1 - x_i)$$" 79 | ] 80 | }, 81 | { 82 | "cell_type": "code", 83 | "execution_count": 3, 84 | "metadata": {}, 85 | "outputs": [ 86 | { 87 | "data": { 88 | "text/plain": [ 89 | "array([0.33333333, 0.33333333, 0.33333333])" 90 | ] 91 | }, 92 | "execution_count": 3, 93 | "metadata": {}, 94 | "output_type": "execute_result" 95 | } 96 | ], 97 | "source": [ 98 | "from sklearn.naive_bayes import BernoulliNB\n", 99 | "from sklearn.model_selection import cross_val_score\n", 100 | "\n", 101 | "cross_val_score(BernoulliNB(), data.data, data.target)" 102 | ] 103 | }, 104 | { 105 | "cell_type": "code", 106 | "execution_count": 4, 107 | "metadata": {}, 108 | "outputs": [ 109 | { 110 | "data": { 111 | "text/plain": [ 112 | "array([0.39215686, 0.35294118, 0.375 ])" 113 | ] 114 | }, 115 | "execution_count": 4, 116 | "metadata": {}, 117 | "output_type": "execute_result" 118 | } 119 | ], 120 | "source": [ 121 | "cross_val_score(BernoulliNB(binarize=0.1), data.data, data.target)" 122 | ] 123 | }, 124 | { 125 | "cell_type": "code", 126 | "execution_count": 5, 127 | "metadata": {}, 128 | "outputs": [ 129 | { 130 | "data": { 131 | "text/plain": [ 132 | "array([0.66666667, 0.66666667, 0.66666667])" 133 | ] 134 | }, 135 | "execution_count": 5, 136 | "metadata": {}, 137 | "output_type": "execute_result" 138 | } 139 | ], 140 | "source": [ 141 | "cross_val_score(BernoulliNB(binarize=1.), data.data, data.target)" 142 | ] 143 | }, 144 | { 145 | "cell_type": "code", 146 | "execution_count": 6, 147 | "metadata": {}, 148 | "outputs": [ 149 | { 150 | "data": { 151 | "text/plain": [ 152 | "array([0.66666667, 0.66666667, 0.66666667])" 153 | ] 154 | }, 155 | "execution_count": 6, 156 | "metadata": {}, 157 | "output_type": "execute_result" 158 | } 159 | ], 160 | "source": [ 161 | "cross_val_score(BernoulliNB(binarize=1., alpha=0.1), data.data, data.target)" 162 | ] 163 | }, 164 | { 165 | "cell_type": "markdown", 166 | "metadata": {}, 167 | "source": [ 168 | "В sklearn есть стандартная функция для поиска наилучших значений параметров модели" 169 | ] 170 | }, 171 | { 172 | "cell_type": "code", 173 | "execution_count": 7, 174 | "metadata": {}, 175 | "outputs": [], 176 | "source": [ 177 | "from sklearn.model_selection import GridSearchCV\n", 178 | "\n", 179 | "res = GridSearchCV(BernoulliNB(), param_grid={\n", 180 | " 'binarize': [0.,0.1, 0.5, 1., 2, 10, 100.],\n", 181 | " 'alpha': [0.1, 0.5, 1., 2, 10., 100.]\n", 182 | "}, cv=3).fit(data.data, data.target)" 183 | ] 184 | }, 185 | { 186 | "cell_type": "code", 187 | "execution_count": 8, 188 | "metadata": {}, 189 | "outputs": [ 190 | { 191 | "data": { 192 | "text/plain": [ 193 | "({'alpha': 0.1, 'binarize': 2}, 0.82)" 194 | ] 195 | }, 196 | "execution_count": 8, 197 | "metadata": {}, 198 | "output_type": "execute_result" 199 | } 200 | ], 201 | "source": [ 202 | "res.best_params_, res.best_score_" 203 | ] 204 | }, 205 | { 206 | "cell_type": "code", 207 | "execution_count": 9, 208 | "metadata": {}, 209 | "outputs": [ 210 | { 211 | "data": { 212 | "text/plain": [ 213 | "0.8206699346405228" 214 | ] 215 | }, 216 | "execution_count": 9, 217 | "metadata": {}, 218 | "output_type": "execute_result" 219 | } 220 | ], 221 | "source": [ 222 | "cross_val_score(BernoulliNB(binarize=2., alpha=0.1), data.data, data.target).mean()" 223 | ] 224 | }, 225 | { 226 | "cell_type": "markdown", 227 | "metadata": {}, 228 | "source": [ 229 | "Добавим параметры" 230 | ] 231 | }, 232 | { 233 | "cell_type": "code", 234 | "execution_count": 10, 235 | "metadata": {}, 236 | "outputs": [ 237 | { 238 | "data": { 239 | "text/plain": [ 240 | "({'alpha': 0.1, 'binarize': 2, 'fit_prior': False}, 0.82)" 241 | ] 242 | }, 243 | "execution_count": 10, 244 | "metadata": {}, 245 | "output_type": "execute_result" 246 | } 247 | ], 248 | "source": [ 249 | "res = GridSearchCV(BernoulliNB(), param_grid={\n", 250 | " 'binarize': [0.,0.1, 0.5, 1., 2, 10, 100.],\n", 251 | " 'alpha': [0.1, 0.5, 1., 2, 10., 100.],\n", 252 | " 'fit_prior': [False, True]\n", 253 | "}, cv=3).fit(data.data, data.target)\n", 254 | "res.best_params_, res.best_score_" 255 | ] 256 | }, 257 | { 258 | "cell_type": "markdown", 259 | "metadata": {}, 260 | "source": [ 261 | "Иногда по сетке перебирать параметры слишком избыточное, поэтому имеет смысл использовать RandomziedSearch" 262 | ] 263 | }, 264 | { 265 | "cell_type": "code", 266 | "execution_count": 11, 267 | "metadata": {}, 268 | "outputs": [ 269 | { 270 | "data": { 271 | "text/plain": [ 272 | "({'alpha': 2.4012459939700026,\n", 273 | " 'binarize': 1.6139166699914442,\n", 274 | " 'fit_prior': True},\n", 275 | " 0.92)" 276 | ] 277 | }, 278 | "execution_count": 11, 279 | "metadata": {}, 280 | "output_type": "execute_result" 281 | } 282 | ], 283 | "source": [ 284 | "from sklearn.model_selection import RandomizedSearchCV\n", 285 | "\n", 286 | "res = RandomizedSearchCV(BernoulliNB(), param_distributions={\n", 287 | " 'binarize': scipy.stats.uniform(0, 10),\n", 288 | " 'alpha': scipy.stats.uniform(0, 10),\n", 289 | " 'fit_prior': [False, True]\n", 290 | "}, cv=3).fit(data.data, data.target)\n", 291 | "res.best_params_, res.best_score_" 292 | ] 293 | }, 294 | { 295 | "cell_type": "markdown", 296 | "metadata": {}, 297 | "source": [ 298 | "Первый вариант - локализованный перебор" 299 | ] 300 | }, 301 | { 302 | "cell_type": "code", 303 | "execution_count": 12, 304 | "metadata": {}, 305 | "outputs": [ 306 | { 307 | "data": { 308 | "text/plain": [ 309 | "({'alpha': 0.09516398060726261,\n", 310 | " 'binarize': 1.7114973962724238,\n", 311 | " 'fit_prior': False},\n", 312 | " 0.9466666666666667)" 313 | ] 314 | }, 315 | "execution_count": 12, 316 | "metadata": {}, 317 | "output_type": "execute_result" 318 | } 319 | ], 320 | "source": [ 321 | "res = RandomizedSearchCV(BernoulliNB(), param_distributions={\n", 322 | " 'binarize': scipy.stats.uniform(1.5, 2.5),\n", 323 | " 'alpha': scipy.stats.uniform(0.05, 0.15),\n", 324 | " 'fit_prior': [False, True]\n", 325 | "}, cv=3).fit(data.data, data.target)\n", 326 | "res.best_params_, res.best_score_" 327 | ] 328 | }, 329 | { 330 | "cell_type": "code", 331 | "execution_count": 13, 332 | "metadata": {}, 333 | "outputs": [ 334 | { 335 | "data": { 336 | "text/plain": [ 337 | "({'alpha': 0.07339917805043039,\n", 338 | " 'binarize': 1.6452090304204987,\n", 339 | " 'fit_prior': True},\n", 340 | " 0.92)" 341 | ] 342 | }, 343 | "execution_count": 13, 344 | "metadata": {}, 345 | "output_type": "execute_result" 346 | } 347 | ], 348 | "source": [ 349 | "res = RandomizedSearchCV(BernoulliNB(), param_distributions={\n", 350 | " 'binarize': scipy.stats.uniform(1.5, 2.5),\n", 351 | " 'alpha': scipy.stats.uniform(0.05, 0.15),\n", 352 | " 'fit_prior': [False, True]\n", 353 | "}, cv=3, random_state=42).fit(data.data, data.target)\n", 354 | "res.best_params_, res.best_score_" 355 | ] 356 | }, 357 | { 358 | "cell_type": "markdown", 359 | "metadata": {}, 360 | "source": [ 361 | "Второй вариант - больше диапозоны и больше итераций" 362 | ] 363 | }, 364 | { 365 | "cell_type": "code", 366 | "execution_count": 14, 367 | "metadata": {}, 368 | "outputs": [ 369 | { 370 | "data": { 371 | "text/plain": [ 372 | "({'alpha': 6.075448519014383,\n", 373 | " 'binarize': 1.7052412368729153,\n", 374 | " 'fit_prior': False},\n", 375 | " 0.9466666666666667)" 376 | ] 377 | }, 378 | "execution_count": 14, 379 | "metadata": {}, 380 | "output_type": "execute_result" 381 | } 382 | ], 383 | "source": [ 384 | "res = RandomizedSearchCV(BernoulliNB(), param_distributions={\n", 385 | " 'binarize': scipy.stats.uniform(0, 10),\n", 386 | " 'alpha': scipy.stats.uniform(0, 10),\n", 387 | " 'fit_prior': [False, True]\n", 388 | "}, cv=3, random_state=42, n_iter=1000).fit(data.data, data.target)\n", 389 | "res.best_params_, res.best_score_" 390 | ] 391 | }, 392 | { 393 | "cell_type": "markdown", 394 | "metadata": {}, 395 | "source": [ 396 | "# MultinomialNB" 397 | ] 398 | }, 399 | { 400 | "cell_type": "markdown", 401 | "metadata": {}, 402 | "source": [ 403 | "MultinomialNB реализует алгоритм наивного Байеса для признаков, распределенных мультиномиально. Хорошо работает для задач опреледения категории текста или детектирования спама. На практике также неплохо работает с признаками после TFIDF.\n", 404 | "\n", 405 | "Распределение параметризовано векторами $\\theta_y = (\\theta_{y1},\\ldots,\\theta_{yn})$ для каждого класса $y$, где $n$ - количество признаков, $\\theta_{yi}$ - вероятность $P(x_i \\mid y)$ появления признака $i$ в объекте класса $y$.\n", 406 | "\n", 407 | "Параметры $\\theta_y$ оцениваются с помощью сглаженной версии максимального правдоподобия, то есть относительной частоты встречаемости значений признаков (проще думать об этом, как об встречаемости слов в документах класса $y$):\n", 408 | "\n", 409 | "$$\\hat{\\theta}_{yi} = \\frac{ N_{yi} + \\alpha}{N_y + \\alpha n}$$\n", 410 | "\n", 411 | "где $N_{yi} = \\sum_{x \\in T} x_i$ - число раз, когда признак $i$ встречается в объектах класса $y$ среди обучающей выборки $T$,\n", 412 | "и $N_{y} = \\sum_{i=1}^{n} N_{yi}$ - полное число встреч признака во всех классах $y$.\n", 413 | "\n", 414 | "Сглаживающая константа $\\alpha \\ge 0$ обрабатывает случаи, когда признак не встречается для класса в обучающей выборке и предотвращает нулевые вероятности в дальнейших вычислениях." 415 | ] 416 | }, 417 | { 418 | "cell_type": "markdown", 419 | "metadata": {}, 420 | "source": [ 421 | "Оценим качество MultinomialNB на той же задаче" 422 | ] 423 | }, 424 | { 425 | "cell_type": "code", 426 | "execution_count": 15, 427 | "metadata": {}, 428 | "outputs": [ 429 | { 430 | "data": { 431 | "text/plain": [ 432 | "array([1. , 0.88235294, 1. ])" 433 | ] 434 | }, 435 | "execution_count": 15, 436 | "metadata": {}, 437 | "output_type": "execute_result" 438 | } 439 | ], 440 | "source": [ 441 | "from sklearn.naive_bayes import MultinomialNB\n", 442 | "\n", 443 | "cross_val_score(MultinomialNB(), data.data, data.target)" 444 | ] 445 | }, 446 | { 447 | "cell_type": "markdown", 448 | "metadata": {}, 449 | "source": [ 450 | "Подберём оптимальные параметры" 451 | ] 452 | }, 453 | { 454 | "cell_type": "code", 455 | "execution_count": 16, 456 | "metadata": {}, 457 | "outputs": [ 458 | { 459 | "name": "stdout", 460 | "output_type": "stream", 461 | "text": [ 462 | "{'alpha': 2.7, 'fit_prior': False} 0.9666666666666667\n", 463 | "CPU times: user 635 ms, sys: 5.22 ms, total: 640 ms\n", 464 | "Wall time: 640 ms\n" 465 | ] 466 | } 467 | ], 468 | "source": [ 469 | "%%time\n", 470 | "res = GridSearchCV(MultinomialNB(), param_grid={\n", 471 | " 'alpha': np.arange(0.1, 10.1, 0.1),\n", 472 | " 'fit_prior': [False, True]\n", 473 | "}, cv=3).fit(data.data, data.target)\n", 474 | "print(res.best_params_, res.best_score_)" 475 | ] 476 | }, 477 | { 478 | "cell_type": "markdown", 479 | "metadata": {}, 480 | "source": [ 481 | "# GaussianNB" 482 | ] 483 | }, 484 | { 485 | "cell_type": "markdown", 486 | "metadata": {}, 487 | "source": [ 488 | "В GaussianNB вероятность значений признака предполагается распределенной по Гауссу:\n", 489 | "\n", 490 | "$$P(x_i \\mid y) = \\frac{1}{\\sqrt{2\\pi\\sigma^2_y}} \\exp\\left(-\\frac{(x_i - \\mu_y)^2}{2\\sigma^2_y}\\right)$$\n", 491 | "\n", 492 | "Параметры $\\sigma_y$ and $\\mu_y$ оцениваются с помощью метода максимального правдоподобия." 493 | ] 494 | }, 495 | { 496 | "cell_type": "markdown", 497 | "metadata": {}, 498 | "source": [ 499 | "Добавим в сравнение GaussianNB" 500 | ] 501 | }, 502 | { 503 | "cell_type": "code", 504 | "execution_count": 17, 505 | "metadata": {}, 506 | "outputs": [ 507 | { 508 | "data": { 509 | "text/plain": [ 510 | "array([0.92156863, 0.90196078, 0.97916667])" 511 | ] 512 | }, 513 | "execution_count": 17, 514 | "metadata": {}, 515 | "output_type": "execute_result" 516 | } 517 | ], 518 | "source": [ 519 | "from sklearn.naive_bayes import GaussianNB\n", 520 | "\n", 521 | "cross_val_score(GaussianNB(), data.data, data.target)" 522 | ] 523 | }, 524 | { 525 | "cell_type": "code", 526 | "execution_count": 18, 527 | "metadata": {}, 528 | "outputs": [ 529 | { 530 | "data": { 531 | "text/plain": [ 532 | "0.9342320261437909" 533 | ] 534 | }, 535 | "execution_count": 18, 536 | "metadata": {}, 537 | "output_type": "execute_result" 538 | } 539 | ], 540 | "source": [ 541 | "cross_val_score(GaussianNB(), data.data, data.target).mean()" 542 | ] 543 | }, 544 | { 545 | "cell_type": "markdown", 546 | "metadata": {}, 547 | "source": [ 548 | "# 20 newsgroups dataset" 549 | ] 550 | }, 551 | { 552 | "cell_type": "markdown", 553 | "metadata": {}, 554 | "source": [ 555 | "Проверим работу методов на другом наборе данных" 556 | ] 557 | }, 558 | { 559 | "cell_type": "code", 560 | "execution_count": 19, 561 | "metadata": {}, 562 | "outputs": [], 563 | "source": [ 564 | "from sklearn.datasets import fetch_20newsgroups_vectorized\n", 565 | "\n", 566 | "data = fetch_20newsgroups_vectorized(subset='all', remove=('headers', 'footers', 'quotes'))" 567 | ] 568 | }, 569 | { 570 | "cell_type": "code", 571 | "execution_count": 20, 572 | "metadata": {}, 573 | "outputs": [ 574 | { 575 | "data": { 576 | "text/plain": [ 577 | "<18846x101631 sparse matrix of type ''\n", 578 | "\twith 1769365 stored elements in Compressed Sparse Row format>" 579 | ] 580 | }, 581 | "execution_count": 20, 582 | "metadata": {}, 583 | "output_type": "execute_result" 584 | } 585 | ], 586 | "source": [ 587 | "data.data" 588 | ] 589 | }, 590 | { 591 | "cell_type": "code", 592 | "execution_count": 21, 593 | "metadata": {}, 594 | "outputs": [ 595 | { 596 | "data": { 597 | "text/plain": [ 598 | "array([0.46581876, 0.50334501, 0.46670914])" 599 | ] 600 | }, 601 | "execution_count": 21, 602 | "metadata": {}, 603 | "output_type": "execute_result" 604 | } 605 | ], 606 | "source": [ 607 | "cross_val_score(BernoulliNB(), data.data, data.target)" 608 | ] 609 | }, 610 | { 611 | "cell_type": "code", 612 | "execution_count": 22, 613 | "metadata": {}, 614 | "outputs": [ 615 | { 616 | "name": "stdout", 617 | "output_type": "stream", 618 | "text": [ 619 | "{'alpha': 0.07066305219717406, 'binarize': 0.023062425041415757, 'fit_prior': False} 0.6818423007534755\n", 620 | "CPU times: user 6.28 s, sys: 845 ms, total: 7.12 s\n", 621 | "Wall time: 7.16 s\n" 622 | ] 623 | } 624 | ], 625 | "source": [ 626 | "%%time\n", 627 | "res = RandomizedSearchCV(BernoulliNB(), param_distributions={\n", 628 | " 'binarize': scipy.stats.uniform(0, 1),\n", 629 | " 'alpha': scipy.stats.uniform(0, 10),\n", 630 | " 'fit_prior': [False, True]\n", 631 | "}, cv=3, random_state=42).fit(data.data, data.target)\n", 632 | "print(res.best_params_, res.best_score_)" 633 | ] 634 | }, 635 | { 636 | "cell_type": "code", 637 | "execution_count": 23, 638 | "metadata": {}, 639 | "outputs": [ 640 | { 641 | "data": { 642 | "text/plain": [ 643 | "array([0.55023847, 0.54858235, 0.51258363])" 644 | ] 645 | }, 646 | "execution_count": 23, 647 | "metadata": {}, 648 | "output_type": "execute_result" 649 | } 650 | ], 651 | "source": [ 652 | "cross_val_score(MultinomialNB(), data.data, data.target)" 653 | ] 654 | }, 655 | { 656 | "cell_type": "code", 657 | "execution_count": 24, 658 | "metadata": {}, 659 | "outputs": [ 660 | { 661 | "name": "stdout", 662 | "output_type": "stream", 663 | "text": [ 664 | "{'alpha': 0.10778765841014329, 'fit_prior': True} 0.6743075453677173\n", 665 | "CPU times: user 5.27 s, sys: 440 ms, total: 5.71 s\n", 666 | "Wall time: 5.75 s\n" 667 | ] 668 | } 669 | ], 670 | "source": [ 671 | "%%time\n", 672 | "res = RandomizedSearchCV(MultinomialNB(), param_distributions={\n", 673 | " 'alpha': scipy.stats.uniform(0.1, 10),\n", 674 | " 'fit_prior': [False, True]\n", 675 | "}, cv=3, random_state=42).fit(data.data, data.target)\n", 676 | "print(res.best_params_, res.best_score_)" 677 | ] 678 | }, 679 | { 680 | "cell_type": "code", 681 | "execution_count": 25, 682 | "metadata": {}, 683 | "outputs": [ 684 | { 685 | "ename": "TypeError", 686 | "evalue": "A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.", 687 | "output_type": "error", 688 | "traceback": [ 689 | "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", 690 | "\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)", 691 | "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mcross_val_score\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mGaussianNB\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdata\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdata\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdata\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mtarget\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", 692 | "\u001b[0;32m~/anaconda2/envs/py36/lib/python3.6/site-packages/sklearn/model_selection/_validation.py\u001b[0m in \u001b[0;36mcross_val_score\u001b[0;34m(estimator, X, y, groups, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch)\u001b[0m\n\u001b[1;32m 340\u001b[0m \u001b[0mn_jobs\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mn_jobs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mverbose\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mverbose\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 341\u001b[0m \u001b[0mfit_params\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mfit_params\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 342\u001b[0;31m pre_dispatch=pre_dispatch)\n\u001b[0m\u001b[1;32m 343\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mcv_results\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'test_score'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 344\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", 693 | "\u001b[0;32m~/anaconda2/envs/py36/lib/python3.6/site-packages/sklearn/model_selection/_validation.py\u001b[0m in \u001b[0;36mcross_validate\u001b[0;34m(estimator, X, y, groups, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch, return_train_score)\u001b[0m\n\u001b[1;32m 204\u001b[0m \u001b[0mfit_params\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mreturn_train_score\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mreturn_train_score\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 205\u001b[0m return_times=True)\n\u001b[0;32m--> 206\u001b[0;31m for train, test in cv.split(X, y, groups))\n\u001b[0m\u001b[1;32m 207\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 208\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mreturn_train_score\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 694 | "\u001b[0;32m~/anaconda2/envs/py36/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py\u001b[0m in \u001b[0;36m__call__\u001b[0;34m(self, iterable)\u001b[0m\n\u001b[1;32m 777\u001b[0m \u001b[0;31m# was dispatched. In particular this covers the edge\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 778\u001b[0m \u001b[0;31m# case of Parallel used with an exhausted iterator.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 779\u001b[0;31m \u001b[0;32mwhile\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdispatch_one_batch\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0miterator\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 780\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_iterating\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mTrue\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 781\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 695 | "\u001b[0;32m~/anaconda2/envs/py36/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py\u001b[0m in \u001b[0;36mdispatch_one_batch\u001b[0;34m(self, iterator)\u001b[0m\n\u001b[1;32m 623\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0;32mFalse\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 624\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 625\u001b[0;31m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_dispatch\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtasks\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 626\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0;32mTrue\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 627\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", 696 | "\u001b[0;32m~/anaconda2/envs/py36/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py\u001b[0m in \u001b[0;36m_dispatch\u001b[0;34m(self, batch)\u001b[0m\n\u001b[1;32m 586\u001b[0m \u001b[0mdispatch_timestamp\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mtime\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mtime\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 587\u001b[0m \u001b[0mcb\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mBatchCompletionCallBack\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdispatch_timestamp\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mlen\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mbatch\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 588\u001b[0;31m \u001b[0mjob\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_backend\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mapply_async\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mbatch\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcallback\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mcb\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 589\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_jobs\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mappend\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mjob\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 590\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", 697 | "\u001b[0;32m~/anaconda2/envs/py36/lib/python3.6/site-packages/sklearn/externals/joblib/_parallel_backends.py\u001b[0m in \u001b[0;36mapply_async\u001b[0;34m(self, func, callback)\u001b[0m\n\u001b[1;32m 109\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mapply_async\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfunc\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcallback\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mNone\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 110\u001b[0m \u001b[0;34m\"\"\"Schedule a func to be run\"\"\"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 111\u001b[0;31m \u001b[0mresult\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mImmediateResult\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mfunc\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 112\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mcallback\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 113\u001b[0m \u001b[0mcallback\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mresult\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 698 | "\u001b[0;32m~/anaconda2/envs/py36/lib/python3.6/site-packages/sklearn/externals/joblib/_parallel_backends.py\u001b[0m in \u001b[0;36m__init__\u001b[0;34m(self, batch)\u001b[0m\n\u001b[1;32m 330\u001b[0m \u001b[0;31m# Don't delay the application, to avoid keeping the input\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 331\u001b[0m \u001b[0;31m# arguments in memory\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 332\u001b[0;31m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mresults\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mbatch\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 333\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 334\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mget\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 699 | "\u001b[0;32m~/anaconda2/envs/py36/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py\u001b[0m in \u001b[0;36m__call__\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m 129\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 130\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0m__call__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 131\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0mfunc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mfunc\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mkwargs\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mitems\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 132\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 133\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0m__len__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 700 | "\u001b[0;32m~/anaconda2/envs/py36/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py\u001b[0m in \u001b[0;36m\u001b[0;34m(.0)\u001b[0m\n\u001b[1;32m 129\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 130\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0m__call__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 131\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0mfunc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mfunc\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mkwargs\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mitems\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 132\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 133\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0m__len__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 701 | "\u001b[0;32m~/anaconda2/envs/py36/lib/python3.6/site-packages/sklearn/model_selection/_validation.py\u001b[0m in \u001b[0;36m_fit_and_score\u001b[0;34m(estimator, X, y, scorer, train, test, verbose, parameters, fit_params, return_train_score, return_parameters, return_n_test_samples, return_times, error_score)\u001b[0m\n\u001b[1;32m 456\u001b[0m \u001b[0mestimator\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfit\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mX_train\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mfit_params\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 457\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 458\u001b[0;31m \u001b[0mestimator\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfit\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mX_train\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0my_train\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mfit_params\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 459\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 460\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0mException\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0me\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 702 | "\u001b[0;32m~/anaconda2/envs/py36/lib/python3.6/site-packages/sklearn/naive_bayes.py\u001b[0m in \u001b[0;36mfit\u001b[0;34m(self, X, y, sample_weight)\u001b[0m\n\u001b[1;32m 181\u001b[0m \u001b[0mReturns\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 182\u001b[0m \"\"\"\n\u001b[0;32m--> 183\u001b[0;31m \u001b[0mX\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0my\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mcheck_X_y\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mX\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0my\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 184\u001b[0m return self._partial_fit(X, y, np.unique(y), _refit=True,\n\u001b[1;32m 185\u001b[0m sample_weight=sample_weight)\n", 703 | "\u001b[0;32m~/anaconda2/envs/py36/lib/python3.6/site-packages/sklearn/utils/validation.py\u001b[0m in \u001b[0;36mcheck_X_y\u001b[0;34m(X, y, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, warn_on_dtype, estimator)\u001b[0m\n\u001b[1;32m 571\u001b[0m X = check_array(X, accept_sparse, dtype, order, copy, force_all_finite,\n\u001b[1;32m 572\u001b[0m \u001b[0mensure_2d\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mallow_nd\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mensure_min_samples\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 573\u001b[0;31m ensure_min_features, warn_on_dtype, estimator)\n\u001b[0m\u001b[1;32m 574\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mmulti_output\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 575\u001b[0m y = check_array(y, 'csr', force_all_finite=True, ensure_2d=False,\n", 704 | "\u001b[0;32m~/anaconda2/envs/py36/lib/python3.6/site-packages/sklearn/utils/validation.py\u001b[0m in \u001b[0;36mcheck_array\u001b[0;34m(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)\u001b[0m\n\u001b[1;32m 429\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0msp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0missparse\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0marray\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 430\u001b[0m array = _ensure_sparse_format(array, accept_sparse, dtype, copy,\n\u001b[0;32m--> 431\u001b[0;31m force_all_finite)\n\u001b[0m\u001b[1;32m 432\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 433\u001b[0m \u001b[0marray\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0marray\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0marray\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mdtype\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0morder\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0morder\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcopy\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mcopy\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 705 | "\u001b[0;32m~/anaconda2/envs/py36/lib/python3.6/site-packages/sklearn/utils/validation.py\u001b[0m in \u001b[0;36m_ensure_sparse_format\u001b[0;34m(spmatrix, accept_sparse, dtype, copy, force_all_finite)\u001b[0m\n\u001b[1;32m 273\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 274\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0maccept_sparse\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mFalse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 275\u001b[0;31m raise TypeError('A sparse matrix was passed, but dense '\n\u001b[0m\u001b[1;32m 276\u001b[0m \u001b[0;34m'data is required. Use X.toarray() to '\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 277\u001b[0m 'convert to a dense numpy array.')\n", 706 | "\u001b[0;31mTypeError\u001b[0m: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array." 707 | ] 708 | } 709 | ], 710 | "source": [ 711 | "cross_val_score(GaussianNB(), data.data, data.target)" 712 | ] 713 | }, 714 | { 715 | "cell_type": "markdown", 716 | "metadata": {}, 717 | "source": [ 718 | "# Wine dataset" 719 | ] 720 | }, 721 | { 722 | "cell_type": "markdown", 723 | "metadata": {}, 724 | "source": [ 725 | "Может сложиться ощущение, что GaussianNB работает хуже, однако, когда признаки вещественные, это не так" 726 | ] 727 | }, 728 | { 729 | "cell_type": "code", 730 | "execution_count": 26, 731 | "metadata": {}, 732 | "outputs": [], 733 | "source": [ 734 | "from sklearn.datasets import load_wine\n", 735 | "\n", 736 | "data = load_wine()" 737 | ] 738 | }, 739 | { 740 | "cell_type": "code", 741 | "execution_count": 27, 742 | "metadata": {}, 743 | "outputs": [ 744 | { 745 | "data": { 746 | "text/plain": [ 747 | "array([[1.423e+01, 1.710e+00, 2.430e+00, ..., 1.040e+00, 3.920e+00,\n", 748 | " 1.065e+03],\n", 749 | " [1.320e+01, 1.780e+00, 2.140e+00, ..., 1.050e+00, 3.400e+00,\n", 750 | " 1.050e+03],\n", 751 | " [1.316e+01, 2.360e+00, 2.670e+00, ..., 1.030e+00, 3.170e+00,\n", 752 | " 1.185e+03],\n", 753 | " ...,\n", 754 | " [1.327e+01, 4.280e+00, 2.260e+00, ..., 5.900e-01, 1.560e+00,\n", 755 | " 8.350e+02],\n", 756 | " [1.317e+01, 2.590e+00, 2.370e+00, ..., 6.000e-01, 1.620e+00,\n", 757 | " 8.400e+02],\n", 758 | " [1.413e+01, 4.100e+00, 2.740e+00, ..., 6.100e-01, 1.600e+00,\n", 759 | " 5.600e+02]])" 760 | ] 761 | }, 762 | "execution_count": 27, 763 | "metadata": {}, 764 | "output_type": "execute_result" 765 | } 766 | ], 767 | "source": [ 768 | "data.data" 769 | ] 770 | }, 771 | { 772 | "cell_type": "code", 773 | "execution_count": 28, 774 | "metadata": {}, 775 | "outputs": [ 776 | { 777 | "data": { 778 | "text/plain": [ 779 | "array([0.4 , 0.4 , 0.39655172])" 780 | ] 781 | }, 782 | "execution_count": 28, 783 | "metadata": {}, 784 | "output_type": "execute_result" 785 | } 786 | ], 787 | "source": [ 788 | "cross_val_score(BernoulliNB(), data.data, data.target)" 789 | ] 790 | }, 791 | { 792 | "cell_type": "code", 793 | "execution_count": 29, 794 | "metadata": {}, 795 | "outputs": [ 796 | { 797 | "name": "stdout", 798 | "output_type": "stream", 799 | "text": [ 800 | "{'alpha': 3.559726786512616, 'binarize': 757.8461104643691, 'fit_prior': False} 0.6966292134831461\n", 801 | "CPU times: user 1.95 s, sys: 14.6 ms, total: 1.96 s\n", 802 | "Wall time: 1.96 s\n" 803 | ] 804 | } 805 | ], 806 | "source": [ 807 | "%%time\n", 808 | "res = RandomizedSearchCV(BernoulliNB(), param_distributions={\n", 809 | " 'binarize': scipy.stats.uniform(0, 1000),\n", 810 | " 'alpha': scipy.stats.uniform(0, 10),\n", 811 | " 'fit_prior': [False, True]\n", 812 | "}, cv=3, random_state=42, n_iter=500).fit(data.data, data.target)\n", 813 | "print(res.best_params_, res.best_score_)" 814 | ] 815 | }, 816 | { 817 | "cell_type": "code", 818 | "execution_count": 30, 819 | "metadata": {}, 820 | "outputs": [ 821 | { 822 | "data": { 823 | "text/plain": [ 824 | "array([0.71666667, 0.81666667, 0.96551724])" 825 | ] 826 | }, 827 | "execution_count": 30, 828 | "metadata": {}, 829 | "output_type": "execute_result" 830 | } 831 | ], 832 | "source": [ 833 | "cross_val_score(MultinomialNB(), data.data, data.target)" 834 | ] 835 | }, 836 | { 837 | "cell_type": "code", 838 | "execution_count": 31, 839 | "metadata": {}, 840 | "outputs": [ 841 | { 842 | "name": "stdout", 843 | "output_type": "stream", 844 | "text": [ 845 | "{'alpha': 3.745401188473625, 'fit_prior': False} 0.8426966292134831\n", 846 | "CPU times: user 1.75 s, sys: 12.9 ms, total: 1.77 s\n", 847 | "Wall time: 1.77 s\n" 848 | ] 849 | } 850 | ], 851 | "source": [ 852 | "%%time\n", 853 | "res = RandomizedSearchCV(MultinomialNB(), param_distributions={\n", 854 | " 'alpha': scipy.stats.uniform(0, 10),\n", 855 | " 'fit_prior': [False, True]\n", 856 | "}, cv=3, random_state=42, n_iter=500).fit(data.data, data.target)\n", 857 | "print(res.best_params_, res.best_score_)" 858 | ] 859 | }, 860 | { 861 | "cell_type": "code", 862 | "execution_count": 32, 863 | "metadata": {}, 864 | "outputs": [ 865 | { 866 | "data": { 867 | "text/plain": [ 868 | "array([0.95 , 0.96666667, 0.96551724])" 869 | ] 870 | }, 871 | "execution_count": 32, 872 | "metadata": {}, 873 | "output_type": "execute_result" 874 | } 875 | ], 876 | "source": [ 877 | "cross_val_score(GaussianNB(), data.data, data.target)" 878 | ] 879 | }, 880 | { 881 | "cell_type": "code", 882 | "execution_count": 33, 883 | "metadata": {}, 884 | "outputs": [ 885 | { 886 | "data": { 887 | "text/plain": [ 888 | "0.960727969348659" 889 | ] 890 | }, 891 | "execution_count": 33, 892 | "metadata": {}, 893 | "output_type": "execute_result" 894 | } 895 | ], 896 | "source": [ 897 | "cross_val_score(GaussianNB(), data.data, data.target).mean()" 898 | ] 899 | } 900 | ], 901 | "metadata": { 902 | "kernelspec": { 903 | "display_name": "Python 3", 904 | "language": "python", 905 | "name": "python3" 906 | }, 907 | "language_info": { 908 | "codemirror_mode": { 909 | "name": "ipython", 910 | "version": 3 911 | }, 912 | "file_extension": ".py", 913 | "mimetype": "text/x-python", 914 | "name": "python", 915 | "nbconvert_exporter": "python", 916 | "pygments_lexer": "ipython3", 917 | "version": "3.6.5" 918 | } 919 | }, 920 | "nbformat": 4, 921 | "nbformat_minor": 2 922 | } 923 | -------------------------------------------------------------------------------- /seminar01/06_Reference_Numpy.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Numpy" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "В Python есть встроенные:\n", 15 | " 1. списки и словари\n", 16 | " 2. числовые объекты (целые числа, числа с плавующей точкой)\n", 17 | "\n", 18 | "Numpy это дополнительный модуль для Python для многомерных массивов и эффективных вычислений над числами.\n", 19 | "Эта библиотека ближе к hardware (использует типы из C, которые существенно быстрее чем Python типы), за счёт чего более эффективна при вычислениях." 20 | ] 21 | }, 22 | { 23 | "cell_type": "code", 24 | "execution_count": 12, 25 | "metadata": {}, 26 | "outputs": [], 27 | "source": [ 28 | "import numpy as np" 29 | ] 30 | }, 31 | { 32 | "cell_type": "markdown", 33 | "metadata": {}, 34 | "source": [ 35 | "## Основные типы данных" 36 | ] 37 | }, 38 | { 39 | "cell_type": "markdown", 40 | "metadata": {}, 41 | "source": [ 42 | "В Python есть типы: bool, int, float, complex\n", 43 | "\n", 44 | "В numpy имеются эти типы, а также обёртки над этими типами, которые используют реализацию типов на C, например, int8, int16, int32, int64.\n", 45 | "\n", 46 | "Число означает сколько бит используется для хранения числа.\n", 47 | "\n", 48 | "За счёт того, что используются типы данных из C, numpy получает ускорение операций." 49 | ] 50 | }, 51 | { 52 | "cell_type": "code", 53 | "execution_count": null, 54 | "metadata": {}, 55 | "outputs": [], 56 | "source": [ 57 | "type(np.bool)" 58 | ] 59 | }, 60 | { 61 | "cell_type": "code", 62 | "execution_count": null, 63 | "metadata": {}, 64 | "outputs": [], 65 | "source": [ 66 | "np.bool()" 67 | ] 68 | }, 69 | { 70 | "cell_type": "code", 71 | "execution_count": null, 72 | "metadata": {}, 73 | "outputs": [], 74 | "source": [] 75 | }, 76 | { 77 | "cell_type": "markdown", 78 | "metadata": {}, 79 | "source": [ 80 | "np.int - тип из Python\n", 81 | "\n", 82 | "np.int32 и np.int64 - типы из C 32-битный и 64-битный" 83 | ] 84 | }, 85 | { 86 | "cell_type": "code", 87 | "execution_count": null, 88 | "metadata": {}, 89 | "outputs": [], 90 | "source": [ 91 | "type(np.int), type(np.int32), type(np.int64)" 92 | ] 93 | }, 94 | { 95 | "cell_type": "code", 96 | "execution_count": null, 97 | "metadata": {}, 98 | "outputs": [], 99 | "source": [ 100 | "np.int(), np.int32(), np.int64()" 101 | ] 102 | }, 103 | { 104 | "cell_type": "code", 105 | "execution_count": null, 106 | "metadata": {}, 107 | "outputs": [], 108 | "source": [ 109 | "type(np.int()), type(np.int32()), type(np.int64())" 110 | ] 111 | }, 112 | { 113 | "cell_type": "markdown", 114 | "metadata": {}, 115 | "source": [ 116 | "В Python есть длинная арифметика, поэтому можно любые числа хранить" 117 | ] 118 | }, 119 | { 120 | "cell_type": "code", 121 | "execution_count": null, 122 | "metadata": {}, 123 | "outputs": [], 124 | "source": [ 125 | "print(np.int(1e18)) # обёртка питоновского типа" 126 | ] 127 | }, 128 | { 129 | "cell_type": "markdown", 130 | "metadata": {}, 131 | "source": [ 132 | "32-битный int в C хранит числа от −2147483648 до 2147483647 на $10^{18}$ не хватит, чтобы хранить" 133 | ] 134 | }, 135 | { 136 | "cell_type": "code", 137 | "execution_count": null, 138 | "metadata": {}, 139 | "outputs": [], 140 | "source": [ 141 | "print(np.int32(1e18))" 142 | ] 143 | }, 144 | { 145 | "cell_type": "markdown", 146 | "metadata": {}, 147 | "source": [ 148 | "64-битный int в С хранит числа от -9223372036854775808 до 9223372036854775807 на $10^{18}$ уже хватает" 149 | ] 150 | }, 151 | { 152 | "cell_type": "code", 153 | "execution_count": null, 154 | "metadata": {}, 155 | "outputs": [], 156 | "source": [ 157 | "print(np.int64(1e18))" 158 | ] 159 | }, 160 | { 161 | "cell_type": "code", 162 | "execution_count": null, 163 | "metadata": {}, 164 | "outputs": [], 165 | "source": [] 166 | }, 167 | { 168 | "cell_type": "markdown", 169 | "metadata": {}, 170 | "source": [ 171 | "аналогичная градация и для float\n", 172 | "\n", 173 | "float - обёртка питоновского типа\n", 174 | "float32 и float64 - обёртки чисел соответствующей битности (в стиле С)" 175 | ] 176 | }, 177 | { 178 | "cell_type": "code", 179 | "execution_count": null, 180 | "metadata": {}, 181 | "outputs": [], 182 | "source": [ 183 | "type(np.float), type(np.float32), type(np.float64)" 184 | ] 185 | }, 186 | { 187 | "cell_type": "code", 188 | "execution_count": null, 189 | "metadata": {}, 190 | "outputs": [], 191 | "source": [ 192 | "np.float(), np.float32(), np.float64()" 193 | ] 194 | }, 195 | { 196 | "cell_type": "code", 197 | "execution_count": null, 198 | "metadata": {}, 199 | "outputs": [], 200 | "source": [ 201 | "type(np.float()), type(np.float32()), type(np.float64())" 202 | ] 203 | }, 204 | { 205 | "cell_type": "code", 206 | "execution_count": null, 207 | "metadata": {}, 208 | "outputs": [], 209 | "source": [ 210 | "type(np.sqrt(np.float(2))) # np.sqrt возвращает максимально близкий тип, для питоновского float это float64" 211 | ] 212 | }, 213 | { 214 | "cell_type": "code", 215 | "execution_count": null, 216 | "metadata": {}, 217 | "outputs": [], 218 | "source": [ 219 | "type(np.sqrt(np.float32(2)))" 220 | ] 221 | }, 222 | { 223 | "cell_type": "code", 224 | "execution_count": null, 225 | "metadata": {}, 226 | "outputs": [], 227 | "source": [ 228 | "type(np.sqrt(np.float64(2)))" 229 | ] 230 | }, 231 | { 232 | "cell_type": "code", 233 | "execution_count": null, 234 | "metadata": {}, 235 | "outputs": [], 236 | "source": [] 237 | }, 238 | { 239 | "cell_type": "markdown", 240 | "metadata": {}, 241 | "source": [ 242 | "специальные классы для хранения комплексных чисел - по сути это два float-а" 243 | ] 244 | }, 245 | { 246 | "cell_type": "code", 247 | "execution_count": null, 248 | "metadata": {}, 249 | "outputs": [], 250 | "source": [ 251 | "type(np.complex), type(np.complex64), type(np.complex128)" 252 | ] 253 | }, 254 | { 255 | "cell_type": "code", 256 | "execution_count": null, 257 | "metadata": {}, 258 | "outputs": [], 259 | "source": [ 260 | "np.complex(), np.complex64(), np.complex128()" 261 | ] 262 | }, 263 | { 264 | "cell_type": "code", 265 | "execution_count": null, 266 | "metadata": {}, 267 | "outputs": [], 268 | "source": [ 269 | "type(np.complex()), type(np.complex64()), type(np.complex128())" 270 | ] 271 | }, 272 | { 273 | "cell_type": "code", 274 | "execution_count": null, 275 | "metadata": {}, 276 | "outputs": [], 277 | "source": [] 278 | }, 279 | { 280 | "cell_type": "markdown", 281 | "metadata": {}, 282 | "source": [ 283 | "по умолчанию корень из -1 не получится взять" 284 | ] 285 | }, 286 | { 287 | "cell_type": "code", 288 | "execution_count": null, 289 | "metadata": {}, 290 | "outputs": [], 291 | "source": [ 292 | "np.sqrt(-1.)" 293 | ] 294 | }, 295 | { 296 | "cell_type": "markdown", 297 | "metadata": {}, 298 | "source": [ 299 | "но если указать, что тип данных complex, то всё сработает" 300 | ] 301 | }, 302 | { 303 | "cell_type": "code", 304 | "execution_count": null, 305 | "metadata": {}, 306 | "outputs": [], 307 | "source": [ 308 | "np.sqrt(-1 + 0j)" 309 | ] 310 | }, 311 | { 312 | "cell_type": "code", 313 | "execution_count": null, 314 | "metadata": {}, 315 | "outputs": [], 316 | "source": [ 317 | "type(np.sqrt(-1 + 0j))" 318 | ] 319 | }, 320 | { 321 | "cell_type": "code", 322 | "execution_count": null, 323 | "metadata": {}, 324 | "outputs": [], 325 | "source": [ 326 | "type(np.sqrt(np.complex(-1 + 0j)))" 327 | ] 328 | }, 329 | { 330 | "cell_type": "code", 331 | "execution_count": null, 332 | "metadata": {}, 333 | "outputs": [], 334 | "source": [ 335 | "type(np.sqrt(np.complex64(-1 + 0j)))" 336 | ] 337 | }, 338 | { 339 | "cell_type": "code", 340 | "execution_count": null, 341 | "metadata": {}, 342 | "outputs": [], 343 | "source": [ 344 | "type(np.sqrt(np.complex128(-1 + 0j)))" 345 | ] 346 | }, 347 | { 348 | "cell_type": "code", 349 | "execution_count": null, 350 | "metadata": {}, 351 | "outputs": [], 352 | "source": [] 353 | }, 354 | { 355 | "cell_type": "markdown", 356 | "metadata": { 357 | "collapsed": true 358 | }, 359 | "source": [ 360 | "### Вывод:\n", 361 | "\n", 362 | "В numpy присутсвуют обёртки всех типов из C, а также перенесены типы из Python" 363 | ] 364 | }, 365 | { 366 | "cell_type": "code", 367 | "execution_count": null, 368 | "metadata": {}, 369 | "outputs": [], 370 | "source": [] 371 | }, 372 | { 373 | "cell_type": "code", 374 | "execution_count": null, 375 | "metadata": {}, 376 | "outputs": [], 377 | "source": [] 378 | }, 379 | { 380 | "cell_type": "code", 381 | "execution_count": null, 382 | "metadata": {}, 383 | "outputs": [], 384 | "source": [] 385 | }, 386 | { 387 | "cell_type": "markdown", 388 | "metadata": {}, 389 | "source": [ 390 | "## Основные численные функции" 391 | ] 392 | }, 393 | { 394 | "cell_type": "markdown", 395 | "metadata": {}, 396 | "source": [ 397 | "numpy предоставляет широкий спектр математических функций\n", 398 | "\n", 399 | "опишем основные их виды" 400 | ] 401 | }, 402 | { 403 | "cell_type": "markdown", 404 | "metadata": {}, 405 | "source": [ 406 | "##### Округления чисел" 407 | ] 408 | }, 409 | { 410 | "cell_type": "markdown", 411 | "metadata": {}, 412 | "source": [ 413 | "np.round - математическое округление\n", 414 | "\n", 415 | "np.floor - округление вниз\n", 416 | "\n", 417 | "np.ceil - округление вверх\n", 418 | "\n", 419 | "np.int - округление к нулю" 420 | ] 421 | }, 422 | { 423 | "cell_type": "code", 424 | "execution_count": null, 425 | "metadata": {}, 426 | "outputs": [], 427 | "source": [ 428 | "np.round(4.1), np.floor(4.1), np.ceil(4.1), np.int(4.1)" 429 | ] 430 | }, 431 | { 432 | "cell_type": "code", 433 | "execution_count": null, 434 | "metadata": {}, 435 | "outputs": [], 436 | "source": [ 437 | "np.round(-4.1), np.floor(-4.1), np.ceil(-4.1), np.int(-4.1)" 438 | ] 439 | }, 440 | { 441 | "cell_type": "code", 442 | "execution_count": null, 443 | "metadata": {}, 444 | "outputs": [], 445 | "source": [] 446 | }, 447 | { 448 | "cell_type": "code", 449 | "execution_count": null, 450 | "metadata": {}, 451 | "outputs": [], 452 | "source": [ 453 | "np.round(4.5), np.floor(4.5), np.ceil(4.5), np.int(4.5)" 454 | ] 455 | }, 456 | { 457 | "cell_type": "code", 458 | "execution_count": null, 459 | "metadata": {}, 460 | "outputs": [], 461 | "source": [ 462 | "np.round(-4.5), np.floor(-4.5), np.ceil(-4.5), np.int(-4.5)" 463 | ] 464 | }, 465 | { 466 | "cell_type": "code", 467 | "execution_count": null, 468 | "metadata": {}, 469 | "outputs": [], 470 | "source": [] 471 | }, 472 | { 473 | "cell_type": "code", 474 | "execution_count": null, 475 | "metadata": {}, 476 | "outputs": [], 477 | "source": [ 478 | "np.round(4.7), np.floor(4.7), np.ceil(4.7), np.int(4.7)" 479 | ] 480 | }, 481 | { 482 | "cell_type": "code", 483 | "execution_count": null, 484 | "metadata": {}, 485 | "outputs": [], 486 | "source": [ 487 | "np.round(-4.7), np.floor(-4.7), np.ceil(-4.7), np.int(-4.7)" 488 | ] 489 | }, 490 | { 491 | "cell_type": "markdown", 492 | "metadata": {}, 493 | "source": [ 494 | "##### Математические операции" 495 | ] 496 | }, 497 | { 498 | "cell_type": "markdown", 499 | "metadata": {}, 500 | "source": [ 501 | "Подсчитаем логарифм" 502 | ] 503 | }, 504 | { 505 | "cell_type": "code", 506 | "execution_count": null, 507 | "metadata": {}, 508 | "outputs": [], 509 | "source": [ 510 | "np.log(1000.), type(np.log(1000.))" 511 | ] 512 | }, 513 | { 514 | "cell_type": "code", 515 | "execution_count": null, 516 | "metadata": {}, 517 | "outputs": [], 518 | "source": [ 519 | "np.log(np.float32(1000.)), type(np.log(np.float32(1000.))) # меньше бит на хранение - меньше точность" 520 | ] 521 | }, 522 | { 523 | "cell_type": "code", 524 | "execution_count": null, 525 | "metadata": {}, 526 | "outputs": [], 527 | "source": [ 528 | "np.log(1000.) / np.log(10.), type(np.log(1000.) / np.log(10.))" 529 | ] 530 | }, 531 | { 532 | "cell_type": "markdown", 533 | "metadata": {}, 534 | "source": [ 535 | "Если брать значение не из области определения, то исключение не выкидывается, но будет warning и вернётся inf или nan" 536 | ] 537 | }, 538 | { 539 | "cell_type": "code", 540 | "execution_count": null, 541 | "metadata": {}, 542 | "outputs": [], 543 | "source": [ 544 | "np.log(0.)" 545 | ] 546 | }, 547 | { 548 | "cell_type": "code", 549 | "execution_count": null, 550 | "metadata": {}, 551 | "outputs": [], 552 | "source": [ 553 | "np.log(-1.)" 554 | ] 555 | }, 556 | { 557 | "cell_type": "markdown", 558 | "metadata": {}, 559 | "source": [ 560 | "Функции работают и с комплексными числами" 561 | ] 562 | }, 563 | { 564 | "cell_type": "code", 565 | "execution_count": null, 566 | "metadata": {}, 567 | "outputs": [], 568 | "source": [ 569 | "np.log(-1 + 0j)" 570 | ] 571 | }, 572 | { 573 | "cell_type": "code", 574 | "execution_count": null, 575 | "metadata": {}, 576 | "outputs": [], 577 | "source": [ 578 | "np.log(1j)" 579 | ] 580 | }, 581 | { 582 | "cell_type": "code", 583 | "execution_count": null, 584 | "metadata": {}, 585 | "outputs": [], 586 | "source": [] 587 | }, 588 | { 589 | "cell_type": "markdown", 590 | "metadata": {}, 591 | "source": [ 592 | "Есть специальные функции для двоичного и десятичного логарифмов" 593 | ] 594 | }, 595 | { 596 | "cell_type": "code", 597 | "execution_count": null, 598 | "metadata": {}, 599 | "outputs": [], 600 | "source": [ 601 | "print(np.log10(10))\n", 602 | "print(np.log10(100))\n", 603 | "print(np.log10(1000))\n", 604 | "print(np.log10(1e8))\n", 605 | "print(np.log10(1e30))\n", 606 | "print(np.log10(1e100))\n", 607 | "print(np.log10(1e1000))" 608 | ] 609 | }, 610 | { 611 | "cell_type": "markdown", 612 | "metadata": {}, 613 | "source": [ 614 | "у больших int-ов уже не получается взять логарифм, так как np.log2 приводит к сишному типу" 615 | ] 616 | }, 617 | { 618 | "cell_type": "code", 619 | "execution_count": null, 620 | "metadata": {}, 621 | "outputs": [], 622 | "source": [ 623 | "print(np.log2(2))\n", 624 | "print(np.log2(2 ** 2))\n", 625 | "print(np.log2(2 ** 3))\n", 626 | "print(np.log2(2 ** 8))\n", 627 | "print(np.log2(2 ** 30))\n", 628 | "print(np.log2(2 ** 100))\n", 629 | "print(np.log2(2 ** 1000))" 630 | ] 631 | }, 632 | { 633 | "cell_type": "markdown", 634 | "metadata": {}, 635 | "source": [ 636 | "функции работают с типами С, поэтому может быть переполнение" 637 | ] 638 | }, 639 | { 640 | "cell_type": "code", 641 | "execution_count": null, 642 | "metadata": {}, 643 | "outputs": [], 644 | "source": [ 645 | "np.exp(10.), type(np.exp(10.))" 646 | ] 647 | }, 648 | { 649 | "cell_type": "code", 650 | "execution_count": null, 651 | "metadata": {}, 652 | "outputs": [], 653 | "source": [ 654 | "np.exp(100.), type(np.exp(100.))" 655 | ] 656 | }, 657 | { 658 | "cell_type": "code", 659 | "execution_count": null, 660 | "metadata": {}, 661 | "outputs": [], 662 | "source": [ 663 | "np.exp(1000.), type(np.exp(1000.))" 664 | ] 665 | }, 666 | { 667 | "cell_type": "code", 668 | "execution_count": null, 669 | "metadata": {}, 670 | "outputs": [], 671 | "source": [] 672 | }, 673 | { 674 | "cell_type": "markdown", 675 | "metadata": {}, 676 | "source": [ 677 | "##### Константы" 678 | ] 679 | }, 680 | { 681 | "cell_type": "markdown", 682 | "metadata": {}, 683 | "source": [ 684 | "в numpy есть математические константы " 685 | ] 686 | }, 687 | { 688 | "cell_type": "code", 689 | "execution_count": null, 690 | "metadata": {}, 691 | "outputs": [], 692 | "source": [ 693 | "np.pi, type(np.pi)" 694 | ] 695 | }, 696 | { 697 | "cell_type": "code", 698 | "execution_count": null, 699 | "metadata": {}, 700 | "outputs": [], 701 | "source": [ 702 | "np.e, type(np.e)" 703 | ] 704 | }, 705 | { 706 | "cell_type": "code", 707 | "execution_count": null, 708 | "metadata": {}, 709 | "outputs": [], 710 | "source": [ 711 | "np.exp(np.pi * 1j)" 712 | ] 713 | }, 714 | { 715 | "cell_type": "code", 716 | "execution_count": null, 717 | "metadata": {}, 718 | "outputs": [], 719 | "source": [ 720 | "np.exp(np.pi * 1j).astype(np.float64)" 721 | ] 722 | }, 723 | { 724 | "cell_type": "code", 725 | "execution_count": null, 726 | "metadata": {}, 727 | "outputs": [], 728 | "source": [] 729 | }, 730 | { 731 | "cell_type": "markdown", 732 | "metadata": {}, 733 | "source": [ 734 | "##### Ещё примеры переполнения типов данных" 735 | ] 736 | }, 737 | { 738 | "cell_type": "markdown", 739 | "metadata": {}, 740 | "source": [ 741 | "Использование чисел определённой битности накладывает ограничения на их максимальные значения" 742 | ] 743 | }, 744 | { 745 | "cell_type": "code", 746 | "execution_count": null, 747 | "metadata": {}, 748 | "outputs": [], 749 | "source": [ 750 | "2 ** 60, type(2 ** 60) # питоновское умножение" 751 | ] 752 | }, 753 | { 754 | "cell_type": "code", 755 | "execution_count": null, 756 | "metadata": {}, 757 | "outputs": [], 758 | "source": [ 759 | "2 ** 1000, type(2 ** 1000) # питоновское умножение" 760 | ] 761 | }, 762 | { 763 | "cell_type": "code", 764 | "execution_count": null, 765 | "metadata": {}, 766 | "outputs": [], 767 | "source": [ 768 | "np.power(2, 60), type(np.power(2, 60))" 769 | ] 770 | }, 771 | { 772 | "cell_type": "code", 773 | "execution_count": null, 774 | "metadata": {}, 775 | "outputs": [], 776 | "source": [ 777 | "np.power(np.int64(2), 60), type(np.power(np.int64(2), 60))" 778 | ] 779 | }, 780 | { 781 | "cell_type": "code", 782 | "execution_count": null, 783 | "metadata": {}, 784 | "outputs": [], 785 | "source": [ 786 | "np.power(2, 1000), type(np.power(2, 1000))" 787 | ] 788 | }, 789 | { 790 | "cell_type": "code", 791 | "execution_count": null, 792 | "metadata": {}, 793 | "outputs": [], 794 | "source": [] 795 | }, 796 | { 797 | "cell_type": "markdown", 798 | "metadata": {}, 799 | "source": [ 800 | "##### Функция модуль" 801 | ] 802 | }, 803 | { 804 | "cell_type": "code", 805 | "execution_count": 13, 806 | "metadata": {}, 807 | "outputs": [ 808 | { 809 | "data": { 810 | "text/plain": [ 811 | "10000" 812 | ] 813 | }, 814 | "execution_count": 13, 815 | "metadata": {}, 816 | "output_type": "execute_result" 817 | } 818 | ], 819 | "source": [ 820 | "np.abs(-10000)" 821 | ] 822 | }, 823 | { 824 | "cell_type": "code", 825 | "execution_count": 14, 826 | "metadata": {}, 827 | "outputs": [ 828 | { 829 | "data": { 830 | "text/plain": [ 831 | "1.0" 832 | ] 833 | }, 834 | "execution_count": 14, 835 | "metadata": {}, 836 | "output_type": "execute_result" 837 | } 838 | ], 839 | "source": [ 840 | "np.abs(1j) # возвращает модуль комплексного числа" 841 | ] 842 | }, 843 | { 844 | "cell_type": "code", 845 | "execution_count": 15, 846 | "metadata": {}, 847 | "outputs": [ 848 | { 849 | "data": { 850 | "text/plain": [ 851 | "1.4142135623730951" 852 | ] 853 | }, 854 | "execution_count": 15, 855 | "metadata": {}, 856 | "output_type": "execute_result" 857 | } 858 | ], 859 | "source": [ 860 | "np.abs(1 + 1j)" 861 | ] 862 | }, 863 | { 864 | "cell_type": "code", 865 | "execution_count": null, 866 | "metadata": {}, 867 | "outputs": [], 868 | "source": [] 869 | }, 870 | { 871 | "cell_type": "markdown", 872 | "metadata": {}, 873 | "source": [ 874 | "##### Тригонометрические функции" 875 | ] 876 | }, 877 | { 878 | "cell_type": "code", 879 | "execution_count": 16, 880 | "metadata": {}, 881 | "outputs": [ 882 | { 883 | "data": { 884 | "text/plain": [ 885 | "-1.0" 886 | ] 887 | }, 888 | "execution_count": 16, 889 | "metadata": {}, 890 | "output_type": "execute_result" 891 | } 892 | ], 893 | "source": [ 894 | "np.cos(np.pi)" 895 | ] 896 | }, 897 | { 898 | "cell_type": "code", 899 | "execution_count": 17, 900 | "metadata": {}, 901 | "outputs": [ 902 | { 903 | "data": { 904 | "text/plain": [ 905 | "1.0" 906 | ] 907 | }, 908 | "execution_count": 17, 909 | "metadata": {}, 910 | "output_type": "execute_result" 911 | } 912 | ], 913 | "source": [ 914 | "np.log(np.e)" 915 | ] 916 | }, 917 | { 918 | "cell_type": "code", 919 | "execution_count": 18, 920 | "metadata": {}, 921 | "outputs": [ 922 | { 923 | "data": { 924 | "text/plain": [ 925 | "1.0" 926 | ] 927 | }, 928 | "execution_count": 18, 929 | "metadata": {}, 930 | "output_type": "execute_result" 931 | } 932 | ], 933 | "source": [ 934 | "np.sin(np.pi / 2)" 935 | ] 936 | }, 937 | { 938 | "cell_type": "code", 939 | "execution_count": 19, 940 | "metadata": {}, 941 | "outputs": [ 942 | { 943 | "data": { 944 | "text/plain": [ 945 | "1.5707963267948966" 946 | ] 947 | }, 948 | "execution_count": 19, 949 | "metadata": {}, 950 | "output_type": "execute_result" 951 | } 952 | ], 953 | "source": [ 954 | "np.arccos(0.)" 955 | ] 956 | }, 957 | { 958 | "cell_type": "code", 959 | "execution_count": 20, 960 | "metadata": {}, 961 | "outputs": [ 962 | { 963 | "data": { 964 | "text/plain": [ 965 | "57.29577951308232" 966 | ] 967 | }, 968 | "execution_count": 20, 969 | "metadata": {}, 970 | "output_type": "execute_result" 971 | } 972 | ], 973 | "source": [ 974 | "np.rad2deg(1.)" 975 | ] 976 | }, 977 | { 978 | "cell_type": "code", 979 | "execution_count": 21, 980 | "metadata": {}, 981 | "outputs": [ 982 | { 983 | "data": { 984 | "text/plain": [ 985 | "3.141592653589793" 986 | ] 987 | }, 988 | "execution_count": 21, 989 | "metadata": {}, 990 | "output_type": "execute_result" 991 | } 992 | ], 993 | "source": [ 994 | "np.deg2rad(180.)" 995 | ] 996 | }, 997 | { 998 | "cell_type": "markdown", 999 | "metadata": {}, 1000 | "source": [ 1001 | "Более подробно можно посмотреть здесь: https://docs.scipy.org/doc/numpy-1.9.2/reference/routines.math.html" 1002 | ] 1003 | }, 1004 | { 1005 | "cell_type": "markdown", 1006 | "metadata": { 1007 | "collapsed": true 1008 | }, 1009 | "source": [ 1010 | "### Вывод:\n", 1011 | "В numpy реализовано огромное число математических функций" 1012 | ] 1013 | }, 1014 | { 1015 | "cell_type": "code", 1016 | "execution_count": null, 1017 | "metadata": {}, 1018 | "outputs": [], 1019 | "source": [] 1020 | }, 1021 | { 1022 | "cell_type": "markdown", 1023 | "metadata": {}, 1024 | "source": [ 1025 | "### Чем это лучше модуля math?" 1026 | ] 1027 | }, 1028 | { 1029 | "cell_type": "code", 1030 | "execution_count": null, 1031 | "metadata": {}, 1032 | "outputs": [], 1033 | "source": [ 1034 | "import math" 1035 | ] 1036 | }, 1037 | { 1038 | "cell_type": "code", 1039 | "execution_count": null, 1040 | "metadata": {}, 1041 | "outputs": [], 1042 | "source": [ 1043 | "%timeit math.exp(10.)\n", 1044 | "%timeit np.exp(10.)" 1045 | ] 1046 | }, 1047 | { 1048 | "cell_type": "code", 1049 | "execution_count": null, 1050 | "metadata": {}, 1051 | "outputs": [], 1052 | "source": [ 1053 | "%timeit math.sqrt(10.)\n", 1054 | "%timeit np.sqrt(10.)" 1055 | ] 1056 | }, 1057 | { 1058 | "cell_type": "code", 1059 | "execution_count": null, 1060 | "metadata": {}, 1061 | "outputs": [], 1062 | "source": [ 1063 | "%timeit math.log(10.)\n", 1064 | "%timeit np.log(10.)" 1065 | ] 1066 | }, 1067 | { 1068 | "cell_type": "code", 1069 | "execution_count": null, 1070 | "metadata": {}, 1071 | "outputs": [], 1072 | "source": [ 1073 | "%timeit math.cos(10.)\n", 1074 | "%timeit np.cos(10.)" 1075 | ] 1076 | }, 1077 | { 1078 | "cell_type": "markdown", 1079 | "metadata": {}, 1080 | "source": [ 1081 | "### Вывод:\n", 1082 | "Арифметические функции из numpy не работают существенно быстрее, чем функции из math, если считаете для одного значения\n", 1083 | "\n", 1084 | "Если вам нужно вычислить значение некоторой функции из математики, то скорее всего она уже реализована в numpy" 1085 | ] 1086 | }, 1087 | { 1088 | "cell_type": "code", 1089 | "execution_count": null, 1090 | "metadata": {}, 1091 | "outputs": [], 1092 | "source": [] 1093 | }, 1094 | { 1095 | "cell_type": "code", 1096 | "execution_count": null, 1097 | "metadata": {}, 1098 | "outputs": [], 1099 | "source": [] 1100 | }, 1101 | { 1102 | "cell_type": "markdown", 1103 | "metadata": {}, 1104 | "source": [ 1105 | "### Арифметические функции хороши, но, тем не менее, основным объектом NumPy является однородный многомерный массив" 1106 | ] 1107 | }, 1108 | { 1109 | "cell_type": "code", 1110 | "execution_count": null, 1111 | "metadata": {}, 1112 | "outputs": [], 1113 | "source": [ 1114 | "type(np.array([]))" 1115 | ] 1116 | }, 1117 | { 1118 | "cell_type": "markdown", 1119 | "metadata": {}, 1120 | "source": [ 1121 | "Наиболее важные атрибуты объектов ndarray:\n", 1122 | "\n", 1123 | " 1. ndarray.ndim - число измерений (чаще их называют \"оси\") массива.\n", 1124 | "\n", 1125 | " 2. ndarray.shape - размеры массива, его форма. Это кортеж натуральных чисел, показывающий длину массива по каждой оси. Для матрицы из n строк и m столбов, shape будет (n,m). Число элементов кортежа shape равно ndim.\n", 1126 | "\n", 1127 | " 3. ndarray.size - количество элементов массива. Очевидно, равно произведению всех элементов атрибута shape.\n", 1128 | "\n", 1129 | " 4. ndarray.dtype - объект, описывающий тип элементов массива. Можно определить dtype, используя стандартные типы данных Python. Можно хранить и numpy типы, например: bool, int16, int32, int64, float16, float32, float64, complex64\n", 1130 | "\n", 1131 | " 5. ndarray.itemsize - размер каждого элемента массива в байтах.\n", 1132 | "\n", 1133 | " 6. ndarray.data - буфер, содержащий фактические элементы массива. Обычно не нужно использовать этот атрибут, так как обращаться к элементам массива проще всего с помощью индексов." 1134 | ] 1135 | }, 1136 | { 1137 | "cell_type": "markdown", 1138 | "metadata": {}, 1139 | "source": [ 1140 | "##### Обычные одномерные массивы" 1141 | ] 1142 | }, 1143 | { 1144 | "cell_type": "code", 1145 | "execution_count": null, 1146 | "metadata": {}, 1147 | "outputs": [], 1148 | "source": [ 1149 | "arr = np.array([1, 2, 4, 8, 16, 32])\n", 1150 | "\n", 1151 | "print(arr.ndim)\n", 1152 | "print(arr.shape)\n", 1153 | "print(arr.size)\n", 1154 | "print(arr.dtype)\n", 1155 | "print(arr.itemsize)\n", 1156 | "print(arr.data)" 1157 | ] 1158 | }, 1159 | { 1160 | "cell_type": "code", 1161 | "execution_count": null, 1162 | "metadata": {}, 1163 | "outputs": [], 1164 | "source": [ 1165 | "arr = np.array([1, 2, 4, 8, 16, 32], dtype=int)\n", 1166 | "\n", 1167 | "print(arr.ndim)\n", 1168 | "print(arr.shape)\n", 1169 | "print(arr.size)\n", 1170 | "print(arr.dtype)\n", 1171 | "print(arr.itemsize)\n", 1172 | "print(arr.data)" 1173 | ] 1174 | }, 1175 | { 1176 | "cell_type": "code", 1177 | "execution_count": null, 1178 | "metadata": {}, 1179 | "outputs": [], 1180 | "source": [ 1181 | "arr = np.array([1, 2, 4, 8, 16, 32], dtype=object)\n", 1182 | "\n", 1183 | "print(arr.ndim)\n", 1184 | "print(arr.shape)\n", 1185 | "print(arr.size)\n", 1186 | "print(arr.dtype)\n", 1187 | "print(arr.itemsize)\n", 1188 | "print(arr.data)" 1189 | ] 1190 | }, 1191 | { 1192 | "cell_type": "code", 1193 | "execution_count": null, 1194 | "metadata": {}, 1195 | "outputs": [], 1196 | "source": [ 1197 | "arr = np.array([1, 2, 4, 8, 16, 32], dtype=np.int64)\n", 1198 | "\n", 1199 | "print(arr.ndim)\n", 1200 | "print(arr.shape)\n", 1201 | "print(arr.size)\n", 1202 | "print(arr.dtype)\n", 1203 | "print(arr.itemsize)\n", 1204 | "print(arr.data)" 1205 | ] 1206 | }, 1207 | { 1208 | "cell_type": "code", 1209 | "execution_count": null, 1210 | "metadata": {}, 1211 | "outputs": [], 1212 | "source": [ 1213 | "arr = np.array([1, 2, 4, 8, 16, 32], dtype=np.complex128)\n", 1214 | "\n", 1215 | "print(arr.ndim)\n", 1216 | "print(arr.shape)\n", 1217 | "print(arr.size)\n", 1218 | "print(arr.dtype)\n", 1219 | "print(arr.itemsize)\n", 1220 | "print(arr.data)" 1221 | ] 1222 | }, 1223 | { 1224 | "cell_type": "markdown", 1225 | "metadata": {}, 1226 | "source": [ 1227 | "##### Обычные двухмерные массивы" 1228 | ] 1229 | }, 1230 | { 1231 | "cell_type": "code", 1232 | "execution_count": null, 1233 | "metadata": {}, 1234 | "outputs": [], 1235 | "source": [ 1236 | "arr = np.array([[1], [2], [4], [8], [16], [32]], dtype=np.complex128)\n", 1237 | "\n", 1238 | "print(arr.ndim)\n", 1239 | "print(arr.shape)\n", 1240 | "print(arr.size)\n", 1241 | "print(arr.dtype)\n", 1242 | "print(arr.itemsize)\n", 1243 | "print(arr.data)" 1244 | ] 1245 | }, 1246 | { 1247 | "cell_type": "code", 1248 | "execution_count": null, 1249 | "metadata": {}, 1250 | "outputs": [], 1251 | "source": [ 1252 | "arr = np.array([[1, 0], [2, 0], [4, 0], [8, 0], [16, 0], [32, 0]], dtype=np.complex128)\n", 1253 | "\n", 1254 | "print(arr.ndim)\n", 1255 | "print(arr.shape)\n", 1256 | "print(arr.size)\n", 1257 | "print(arr.dtype)\n", 1258 | "print(arr.itemsize)\n", 1259 | "print(arr.data)" 1260 | ] 1261 | }, 1262 | { 1263 | "cell_type": "code", 1264 | "execution_count": null, 1265 | "metadata": {}, 1266 | "outputs": [], 1267 | "source": [ 1268 | "# указываем строчки с разным числом элементов\n", 1269 | "arr = np.array([[1, 0], [2, 0], [4, 0], [8, 0], [16, 0], [32]], dtype=np.complex128)\n", 1270 | "\n", 1271 | "print(arr.ndim)\n", 1272 | "print(arr.shape)\n", 1273 | "print(arr.size)\n", 1274 | "print(arr.dtype)\n", 1275 | "print(arr.itemsize)\n", 1276 | "print(arr.data)" 1277 | ] 1278 | }, 1279 | { 1280 | "cell_type": "markdown", 1281 | "metadata": {}, 1282 | "source": [ 1283 | "##### Индексация одномерных массивов" 1284 | ] 1285 | }, 1286 | { 1287 | "cell_type": "code", 1288 | "execution_count": null, 1289 | "metadata": {}, 1290 | "outputs": [], 1291 | "source": [ 1292 | "arr = np.array([1, 2, 4, 8, 16, 32], dtype=np.int64)" 1293 | ] 1294 | }, 1295 | { 1296 | "cell_type": "code", 1297 | "execution_count": null, 1298 | "metadata": {}, 1299 | "outputs": [], 1300 | "source": [ 1301 | "arr[0], arr[1], arr[4], arr[-1]" 1302 | ] 1303 | }, 1304 | { 1305 | "cell_type": "code", 1306 | "execution_count": null, 1307 | "metadata": {}, 1308 | "outputs": [], 1309 | "source": [ 1310 | "arr[0:4]" 1311 | ] 1312 | }, 1313 | { 1314 | "cell_type": "code", 1315 | "execution_count": null, 1316 | "metadata": {}, 1317 | "outputs": [], 1318 | "source": [ 1319 | "arr[[0, 3, 5]]" 1320 | ] 1321 | }, 1322 | { 1323 | "cell_type": "markdown", 1324 | "metadata": {}, 1325 | "source": [ 1326 | "##### Индексация двухмерных массивов" 1327 | ] 1328 | }, 1329 | { 1330 | "cell_type": "code", 1331 | "execution_count": null, 1332 | "metadata": {}, 1333 | "outputs": [], 1334 | "source": [ 1335 | "arr = np.array(\n", 1336 | " [\n", 1337 | " [1, 0, 4], \n", 1338 | " [2, 0, 4], \n", 1339 | " [4, 0, 4], \n", 1340 | " [8, 0, 4], \n", 1341 | " [16, 0, 4], \n", 1342 | " [32, 0, 4]\n", 1343 | " ],\n", 1344 | " dtype=np.int64\n", 1345 | ")" 1346 | ] 1347 | }, 1348 | { 1349 | "cell_type": "code", 1350 | "execution_count": null, 1351 | "metadata": {}, 1352 | "outputs": [], 1353 | "source": [ 1354 | "print(arr[0])\n", 1355 | "print(arr[1])\n", 1356 | "print(arr[4])\n", 1357 | "print(arr[-1])" 1358 | ] 1359 | }, 1360 | { 1361 | "cell_type": "code", 1362 | "execution_count": null, 1363 | "metadata": {}, 1364 | "outputs": [], 1365 | "source": [ 1366 | "arr[0, 0], arr[1, 0], arr[4, 0], arr[-1, 0]" 1367 | ] 1368 | }, 1369 | { 1370 | "cell_type": "code", 1371 | "execution_count": null, 1372 | "metadata": {}, 1373 | "outputs": [], 1374 | "source": [ 1375 | "arr[0][0], arr[1][0], arr[4][0], arr[-1][0]" 1376 | ] 1377 | }, 1378 | { 1379 | "cell_type": "markdown", 1380 | "metadata": {}, 1381 | "source": [ 1382 | "Первый способ быстрее" 1383 | ] 1384 | }, 1385 | { 1386 | "cell_type": "code", 1387 | "execution_count": null, 1388 | "metadata": {}, 1389 | "outputs": [], 1390 | "source": [ 1391 | "%timeit arr[0, 0], arr[1, 0], arr[4, 0], arr[-1, 0]" 1392 | ] 1393 | }, 1394 | { 1395 | "cell_type": "code", 1396 | "execution_count": null, 1397 | "metadata": {}, 1398 | "outputs": [], 1399 | "source": [ 1400 | "%timeit arr[0][0], arr[1][0], arr[4][0], arr[-1][0]" 1401 | ] 1402 | }, 1403 | { 1404 | "cell_type": "markdown", 1405 | "metadata": {}, 1406 | "source": [ 1407 | "##### Более сложная индексация" 1408 | ] 1409 | }, 1410 | { 1411 | "cell_type": "markdown", 1412 | "metadata": {}, 1413 | "source": [ 1414 | "Можем взять строчку или столбец" 1415 | ] 1416 | }, 1417 | { 1418 | "cell_type": "code", 1419 | "execution_count": null, 1420 | "metadata": {}, 1421 | "outputs": [], 1422 | "source": [ 1423 | "arr[0, :], arr[0, :].shape" 1424 | ] 1425 | }, 1426 | { 1427 | "cell_type": "code", 1428 | "execution_count": null, 1429 | "metadata": {}, 1430 | "outputs": [], 1431 | "source": [ 1432 | "arr[:, 0], arr[:, 0].shape" 1433 | ] 1434 | }, 1435 | { 1436 | "cell_type": "code", 1437 | "execution_count": null, 1438 | "metadata": {}, 1439 | "outputs": [], 1440 | "source": [ 1441 | "arr[[1, 3, 5], :], arr[[1, 3, 5], :].shape" 1442 | ] 1443 | }, 1444 | { 1445 | "cell_type": "code", 1446 | "execution_count": null, 1447 | "metadata": {}, 1448 | "outputs": [], 1449 | "source": [ 1450 | "arr[[1, 3, 5], 0]" 1451 | ] 1452 | }, 1453 | { 1454 | "cell_type": "code", 1455 | "execution_count": null, 1456 | "metadata": {}, 1457 | "outputs": [], 1458 | "source": [ 1459 | "arr[1::2, 0]" 1460 | ] 1461 | }, 1462 | { 1463 | "cell_type": "code", 1464 | "execution_count": null, 1465 | "metadata": {}, 1466 | "outputs": [], 1467 | "source": [ 1468 | "arr[[1, 3, 5], :2]" 1469 | ] 1470 | }, 1471 | { 1472 | "cell_type": "code", 1473 | "execution_count": null, 1474 | "metadata": {}, 1475 | "outputs": [], 1476 | "source": [ 1477 | "arr[[1, 3, 5], 1:]" 1478 | ] 1479 | }, 1480 | { 1481 | "cell_type": "code", 1482 | "execution_count": null, 1483 | "metadata": {}, 1484 | "outputs": [], 1485 | "source": [ 1486 | "arr[[1, 3, 5], [0, 2]]" 1487 | ] 1488 | }, 1489 | { 1490 | "cell_type": "code", 1491 | "execution_count": null, 1492 | "metadata": {}, 1493 | "outputs": [], 1494 | "source": [ 1495 | "arr[[1, 3], [0, 2]] # взяли элементы arr[1, 0] и arr[3, 2]" 1496 | ] 1497 | }, 1498 | { 1499 | "cell_type": "code", 1500 | "execution_count": null, 1501 | "metadata": {}, 1502 | "outputs": [], 1503 | "source": [ 1504 | "arr[np.ix_([1, 3, 5], [0, 2])]" 1505 | ] 1506 | }, 1507 | { 1508 | "cell_type": "code", 1509 | "execution_count": null, 1510 | "metadata": {}, 1511 | "outputs": [], 1512 | "source": [ 1513 | "np.ix_([1, 3, 5], [0, 2])" 1514 | ] 1515 | }, 1516 | { 1517 | "cell_type": "markdown", 1518 | "metadata": {}, 1519 | "source": [ 1520 | "### Выводы" 1521 | ] 1522 | }, 1523 | { 1524 | "cell_type": "markdown", 1525 | "metadata": {}, 1526 | "source": [ 1527 | "Картинки взяты с http://www.scipy-lectures.org/intro/numpy/numpy.html" 1528 | ] 1529 | }, 1530 | { 1531 | "cell_type": "markdown", 1532 | "metadata": {}, 1533 | "source": [ 1534 | "![title](numpy_indexing.png)" 1535 | ] 1536 | }, 1537 | { 1538 | "cell_type": "markdown", 1539 | "metadata": {}, 1540 | "source": [ 1541 | "![title](numpy_fancy_indexing.png)" 1542 | ] 1543 | }, 1544 | { 1545 | "cell_type": "code", 1546 | "execution_count": null, 1547 | "metadata": {}, 1548 | "outputs": [], 1549 | "source": [] 1550 | }, 1551 | { 1552 | "cell_type": "markdown", 1553 | "metadata": {}, 1554 | "source": [ 1555 | "##### Операции с массивами" 1556 | ] 1557 | }, 1558 | { 1559 | "cell_type": "markdown", 1560 | "metadata": {}, 1561 | "source": [ 1562 | "Над массивами можно делать арифметические операции. При этом не нужно обращаться отдельно к каждому элементы, можно выполнять операции с массивами в целом." 1563 | ] 1564 | }, 1565 | { 1566 | "cell_type": "code", 1567 | "execution_count": 22, 1568 | "metadata": {}, 1569 | "outputs": [], 1570 | "source": [ 1571 | "a = np.array([1, 2, 4, 8, 16])\n", 1572 | "b = np.array([1, 3, 9, 27, 81])" 1573 | ] 1574 | }, 1575 | { 1576 | "cell_type": "code", 1577 | "execution_count": 23, 1578 | "metadata": {}, 1579 | "outputs": [ 1580 | { 1581 | "data": { 1582 | "text/plain": [ 1583 | "array([ 0, 1, 3, 7, 15])" 1584 | ] 1585 | }, 1586 | "execution_count": 23, 1587 | "metadata": {}, 1588 | "output_type": "execute_result" 1589 | } 1590 | ], 1591 | "source": [ 1592 | "a - 1" 1593 | ] 1594 | }, 1595 | { 1596 | "cell_type": "code", 1597 | "execution_count": 24, 1598 | "metadata": {}, 1599 | "outputs": [ 1600 | { 1601 | "data": { 1602 | "text/plain": [ 1603 | "array([ 2, 5, 13, 35, 97])" 1604 | ] 1605 | }, 1606 | "execution_count": 24, 1607 | "metadata": {}, 1608 | "output_type": "execute_result" 1609 | } 1610 | ], 1611 | "source": [ 1612 | "a + b" 1613 | ] 1614 | }, 1615 | { 1616 | "cell_type": "code", 1617 | "execution_count": 25, 1618 | "metadata": {}, 1619 | "outputs": [ 1620 | { 1621 | "data": { 1622 | "text/plain": [ 1623 | "array([ 1, 6, 36, 216, 1296])" 1624 | ] 1625 | }, 1626 | "execution_count": 25, 1627 | "metadata": {}, 1628 | "output_type": "execute_result" 1629 | } 1630 | ], 1631 | "source": [ 1632 | "a * b" 1633 | ] 1634 | }, 1635 | { 1636 | "cell_type": "code", 1637 | "execution_count": 26, 1638 | "metadata": {}, 1639 | "outputs": [ 1640 | { 1641 | "data": { 1642 | "text/plain": [ 1643 | "array([1. , 1.5 , 2.25 , 3.375 , 5.0625])" 1644 | ] 1645 | }, 1646 | "execution_count": 26, 1647 | "metadata": {}, 1648 | "output_type": "execute_result" 1649 | } 1650 | ], 1651 | "source": [ 1652 | "b / a" 1653 | ] 1654 | }, 1655 | { 1656 | "cell_type": "code", 1657 | "execution_count": 27, 1658 | "metadata": {}, 1659 | "outputs": [ 1660 | { 1661 | "data": { 1662 | "text/plain": [ 1663 | "array([1, 1, 2, 3, 5])" 1664 | ] 1665 | }, 1666 | "execution_count": 27, 1667 | "metadata": {}, 1668 | "output_type": "execute_result" 1669 | } 1670 | ], 1671 | "source": [ 1672 | "b // a" 1673 | ] 1674 | }, 1675 | { 1676 | "cell_type": "code", 1677 | "execution_count": 28, 1678 | "metadata": {}, 1679 | "outputs": [ 1680 | { 1681 | "data": { 1682 | "text/plain": [ 1683 | "array([0., 1., 2., 3., 4.])" 1684 | ] 1685 | }, 1686 | "execution_count": 28, 1687 | "metadata": {}, 1688 | "output_type": "execute_result" 1689 | } 1690 | ], 1691 | "source": [ 1692 | "np.log2(a)" 1693 | ] 1694 | }, 1695 | { 1696 | "cell_type": "code", 1697 | "execution_count": 29, 1698 | "metadata": {}, 1699 | "outputs": [ 1700 | { 1701 | "data": { 1702 | "text/plain": [ 1703 | "array([0., 1., 2., 3., 4.])" 1704 | ] 1705 | }, 1706 | "execution_count": 29, 1707 | "metadata": {}, 1708 | "output_type": "execute_result" 1709 | } 1710 | ], 1711 | "source": [ 1712 | "np.log(a) / np.log(2)" 1713 | ] 1714 | }, 1715 | { 1716 | "cell_type": "code", 1717 | "execution_count": 30, 1718 | "metadata": {}, 1719 | "outputs": [ 1720 | { 1721 | "data": { 1722 | "text/plain": [ 1723 | "array([0., 1., 2., 3., 4.])" 1724 | ] 1725 | }, 1726 | "execution_count": 30, 1727 | "metadata": {}, 1728 | "output_type": "execute_result" 1729 | } 1730 | ], 1731 | "source": [ 1732 | "np.log(b) / np.log(3)" 1733 | ] 1734 | }, 1735 | { 1736 | "cell_type": "markdown", 1737 | "metadata": {}, 1738 | "source": [ 1739 | "##### Преимущество по скорости" 1740 | ] 1741 | }, 1742 | { 1743 | "cell_type": "code", 1744 | "execution_count": null, 1745 | "metadata": {}, 1746 | "outputs": [], 1747 | "source": [ 1748 | "a = list(range(10000))\n", 1749 | "b = list(range(10000))" 1750 | ] 1751 | }, 1752 | { 1753 | "cell_type": "code", 1754 | "execution_count": null, 1755 | "metadata": {}, 1756 | "outputs": [], 1757 | "source": [ 1758 | "%%timeit\n", 1759 | "c = [\n", 1760 | " x * y\n", 1761 | " for x, y in zip(a, b)\n", 1762 | "]" 1763 | ] 1764 | }, 1765 | { 1766 | "cell_type": "code", 1767 | "execution_count": null, 1768 | "metadata": {}, 1769 | "outputs": [], 1770 | "source": [ 1771 | "a = np.array(a)\n", 1772 | "b = np.array(b)" 1773 | ] 1774 | }, 1775 | { 1776 | "cell_type": "code", 1777 | "execution_count": null, 1778 | "metadata": {}, 1779 | "outputs": [], 1780 | "source": [ 1781 | "%%timeit\n", 1782 | "c = a * b" 1783 | ] 1784 | }, 1785 | { 1786 | "cell_type": "code", 1787 | "execution_count": null, 1788 | "metadata": {}, 1789 | "outputs": [], 1790 | "source": [ 1791 | "%%timeit\n", 1792 | "c = [\n", 1793 | " x * y\n", 1794 | " for x, y in zip(a, b)\n", 1795 | "]" 1796 | ] 1797 | }, 1798 | { 1799 | "cell_type": "markdown", 1800 | "metadata": {}, 1801 | "source": [ 1802 | "Операции с массивами в 100 раз быстрее, хотя если мы пробуем использовать обычный питоновский код поверх массивов, то получается существенно медленнее" 1803 | ] 1804 | }, 1805 | { 1806 | "cell_type": "markdown", 1807 | "metadata": {}, 1808 | "source": [ 1809 | "### Выводы:\n", 1810 | "\n", 1811 | "Для большей производительности лучше использовать арифметические операции над массивами" 1812 | ] 1813 | }, 1814 | { 1815 | "cell_type": "code", 1816 | "execution_count": null, 1817 | "metadata": {}, 1818 | "outputs": [], 1819 | "source": [] 1820 | }, 1821 | { 1822 | "cell_type": "code", 1823 | "execution_count": null, 1824 | "metadata": {}, 1825 | "outputs": [], 1826 | "source": [] 1827 | }, 1828 | { 1829 | "cell_type": "markdown", 1830 | "metadata": {}, 1831 | "source": [ 1832 | "##### random" 1833 | ] 1834 | }, 1835 | { 1836 | "cell_type": "markdown", 1837 | "metadata": {}, 1838 | "source": [ 1839 | "В numpy есть аналог модуля random - numpy.random. Используя типизацию из C, он как и свой аналог генерирует случайные данные." 1840 | ] 1841 | }, 1842 | { 1843 | "cell_type": "code", 1844 | "execution_count": 31, 1845 | "metadata": {}, 1846 | "outputs": [ 1847 | { 1848 | "data": { 1849 | "text/plain": [ 1850 | "array([[[0.8175136 , 0.77078567, 0.87454103, 0.17336117],\n", 1851 | " [0.37306559, 0.3334027 , 0.63796893, 0.42849584],\n", 1852 | " [0.04700558, 0.51279351, 0.22267211, 0.91020539]],\n", 1853 | "\n", 1854 | " [[0.64515575, 0.65825143, 0.90880479, 0.88388794],\n", 1855 | " [0.82751777, 0.46026817, 0.67696989, 0.53016121],\n", 1856 | " [0.06275625, 0.61376869, 0.14391625, 0.30392825]]])" 1857 | ] 1858 | }, 1859 | "execution_count": 31, 1860 | "metadata": {}, 1861 | "output_type": "execute_result" 1862 | } 1863 | ], 1864 | "source": [ 1865 | "np.random.rand(2, 3, 4) # равномерное от 0 до 1 распределение в заданном shape" 1866 | ] 1867 | }, 1868 | { 1869 | "cell_type": "code", 1870 | "execution_count": 32, 1871 | "metadata": {}, 1872 | "outputs": [ 1873 | { 1874 | "data": { 1875 | "text/plain": [ 1876 | "(2, 3, 4)" 1877 | ] 1878 | }, 1879 | "execution_count": 32, 1880 | "metadata": {}, 1881 | "output_type": "execute_result" 1882 | } 1883 | ], 1884 | "source": [ 1885 | "np.random.rand(2, 3, 4).shape" 1886 | ] 1887 | }, 1888 | { 1889 | "cell_type": "code", 1890 | "execution_count": 33, 1891 | "metadata": {}, 1892 | "outputs": [ 1893 | { 1894 | "data": { 1895 | "text/plain": [ 1896 | "array([[[ 0.51807114, 1.21877741, 0.53473039, 1.25560827],\n", 1897 | " [ 1.95685262, 0.26716197, 1.09282955, -0.71969846],\n", 1898 | " [ 2.2309445 , 0.74894436, -0.07109792, 0.35245353]],\n", 1899 | "\n", 1900 | " [[-1.71500229, 0.3727462 , -0.86423839, 0.95929217],\n", 1901 | " [ 1.38904054, -2.07292949, -0.41625269, 1.74899741],\n", 1902 | " [ 0.75667197, -0.40825183, 0.16802865, 1.73164801]]])" 1903 | ] 1904 | }, 1905 | "execution_count": 33, 1906 | "metadata": {}, 1907 | "output_type": "execute_result" 1908 | } 1909 | ], 1910 | "source": [ 1911 | "np.random.randn(2, 3, 4) # нормальное распределение в заданном shape" 1912 | ] 1913 | }, 1914 | { 1915 | "cell_type": "code", 1916 | "execution_count": 34, 1917 | "metadata": {}, 1918 | "outputs": [ 1919 | { 1920 | "data": { 1921 | "text/plain": [ 1922 | "b'\\x8ca\\xba&\\x96\\xb7\\xbc5Z\\xfa'" 1923 | ] 1924 | }, 1925 | "execution_count": 34, 1926 | "metadata": {}, 1927 | "output_type": "execute_result" 1928 | } 1929 | ], 1930 | "source": [ 1931 | "np.random.bytes(10) # случайные байты" 1932 | ] 1933 | }, 1934 | { 1935 | "cell_type": "markdown", 1936 | "metadata": {}, 1937 | "source": [ 1938 | "Можно генерировать и другие распределения, подробности тут:\n", 1939 | "\n", 1940 | "https://docs.scipy.org/doc/numpy-1.12.0/reference/routines.random.html" 1941 | ] 1942 | }, 1943 | { 1944 | "cell_type": "code", 1945 | "execution_count": null, 1946 | "metadata": {}, 1947 | "outputs": [], 1948 | "source": [] 1949 | }, 1950 | { 1951 | "cell_type": "markdown", 1952 | "metadata": {}, 1953 | "source": [ 1954 | "#### Ещё один пример эффективных вычислений" 1955 | ] 1956 | }, 1957 | { 1958 | "cell_type": "markdown", 1959 | "metadata": {}, 1960 | "source": [ 1961 | "В заключение приведём ещё один пример, где использование numpy существенно ускоряет код" 1962 | ] 1963 | }, 1964 | { 1965 | "cell_type": "markdown", 1966 | "metadata": {}, 1967 | "source": [ 1968 | "В математике определена операция перемножения матриц (двухмерных массивов)\n", 1969 | "\n", 1970 | "$A \\times B = C$\n", 1971 | "\n", 1972 | "$C_{ij} = \\sum_k A_{ik} B_{kj}$" 1973 | ] 1974 | }, 1975 | { 1976 | "cell_type": "markdown", 1977 | "metadata": {}, 1978 | "source": [ 1979 | "сгенерируем случайные матрицы" 1980 | ] 1981 | }, 1982 | { 1983 | "cell_type": "code", 1984 | "execution_count": 35, 1985 | "metadata": {}, 1986 | "outputs": [], 1987 | "source": [ 1988 | "A = np.random.randint(1000, size=(200, 100))\n", 1989 | "B = np.random.randint(1000, size=(100, 300))" 1990 | ] 1991 | }, 1992 | { 1993 | "cell_type": "markdown", 1994 | "metadata": {}, 1995 | "source": [ 1996 | "умножение на основе numpy" 1997 | ] 1998 | }, 1999 | { 2000 | "cell_type": "code", 2001 | "execution_count": 36, 2002 | "metadata": {}, 2003 | "outputs": [], 2004 | "source": [ 2005 | "def np_multiply():\n", 2006 | " return np.dot(A, B)" 2007 | ] 2008 | }, 2009 | { 2010 | "cell_type": "code", 2011 | "execution_count": 37, 2012 | "metadata": {}, 2013 | "outputs": [ 2014 | { 2015 | "name": "stdout", 2016 | "output_type": "stream", 2017 | "text": [ 2018 | "4.41 ms ± 639 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n" 2019 | ] 2020 | } 2021 | ], 2022 | "source": [ 2023 | "%timeit np_multiply()" 2024 | ] 2025 | }, 2026 | { 2027 | "cell_type": "markdown", 2028 | "metadata": { 2029 | "collapsed": true 2030 | }, 2031 | "source": [ 2032 | "если хранить матрицу не как двухмерный массив, а как список списков, то будет дольше работать" 2033 | ] 2034 | }, 2035 | { 2036 | "cell_type": "code", 2037 | "execution_count": 38, 2038 | "metadata": {}, 2039 | "outputs": [], 2040 | "source": [ 2041 | "A = [list(x) for x in A]\n", 2042 | "B = [list(x) for x in B]" 2043 | ] 2044 | }, 2045 | { 2046 | "cell_type": "code", 2047 | "execution_count": 39, 2048 | "metadata": {}, 2049 | "outputs": [ 2050 | { 2051 | "name": "stdout", 2052 | "output_type": "stream", 2053 | "text": [ 2054 | "10.1 ms ± 768 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n" 2055 | ] 2056 | } 2057 | ], 2058 | "source": [ 2059 | "%timeit np_multiply()" 2060 | ] 2061 | }, 2062 | { 2063 | "cell_type": "markdown", 2064 | "metadata": {}, 2065 | "source": [ 2066 | "а это умножение на чистом питоновском коде" 2067 | ] 2068 | }, 2069 | { 2070 | "cell_type": "code", 2071 | "execution_count": 40, 2072 | "metadata": {}, 2073 | "outputs": [], 2074 | "source": [ 2075 | "def python_multiply():\n", 2076 | " res = []\n", 2077 | " for i in range(200):\n", 2078 | " row = []\n", 2079 | " for j in range(300):\n", 2080 | " val = 0\n", 2081 | " for k in range(100):\n", 2082 | " val += A[i][k] * B[k][j]\n", 2083 | " row.append(val)\n", 2084 | " res.append(row)\n", 2085 | " return res" 2086 | ] 2087 | }, 2088 | { 2089 | "cell_type": "code", 2090 | "execution_count": 41, 2091 | "metadata": {}, 2092 | "outputs": [ 2093 | { 2094 | "name": "stdout", 2095 | "output_type": "stream", 2096 | "text": [ 2097 | "2.06 s ± 235 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n" 2098 | ] 2099 | } 2100 | ], 2101 | "source": [ 2102 | "%timeit python_multiply()" 2103 | ] 2104 | }, 2105 | { 2106 | "cell_type": "markdown", 2107 | "metadata": {}, 2108 | "source": [ 2109 | "Ускорение более чем в 100 раз" 2110 | ] 2111 | }, 2112 | { 2113 | "cell_type": "code", 2114 | "execution_count": null, 2115 | "metadata": {}, 2116 | "outputs": [], 2117 | "source": [] 2118 | } 2119 | ], 2120 | "metadata": { 2121 | "kernelspec": { 2122 | "display_name": "Python 3", 2123 | "language": "python", 2124 | "name": "python3" 2125 | }, 2126 | "language_info": { 2127 | "codemirror_mode": { 2128 | "name": "ipython", 2129 | "version": 3 2130 | }, 2131 | "file_extension": ".py", 2132 | "mimetype": "text/x-python", 2133 | "name": "python", 2134 | "nbconvert_exporter": "python", 2135 | "pygments_lexer": "ipython3", 2136 | "version": "3.6.5" 2137 | } 2138 | }, 2139 | "nbformat": 4, 2140 | "nbformat_minor": 2 2141 | } 2142 | --------------------------------------------------------------------------------