├── README.md
├── seminar01
├── numpy_indexing.png
├── numpy_fancy_indexing.png
├── 01_Main_SklearnFirstClassifiers.ipynb
├── 03_Main_NaiveBayes.ipynb
└── 06_Reference_Numpy.ipynb
├── seminar02
├── ml_bias_variance.png
├── 02_Main_Bagging.ipynb
├── 03_Main_Boosting.ipynb
└── 05_Reference_BiasVariance.ipynb
├── hw01
├── DMIA_Base_2021_Spring_hw1.pdf
├── README.md
├── NumpyScipy.ipynb
├── KNN.ipynb
├── NaiveBayes.ipynb
└── Polynom.ipynb
├── lecture01
└── Lecture1_MathAndSimpleMethods-compressed.pdf
└── hw02
├── README.md
└── SimpleGB.ipynb
/README.md:
--------------------------------------------------------------------------------
1 | # DMIA_Base_2021_Spring
--------------------------------------------------------------------------------
/seminar01/numpy_indexing.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/data-mining-in-action/DMIA_Base_2021_Spring/HEAD/seminar01/numpy_indexing.png
--------------------------------------------------------------------------------
/seminar02/ml_bias_variance.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/data-mining-in-action/DMIA_Base_2021_Spring/HEAD/seminar02/ml_bias_variance.png
--------------------------------------------------------------------------------
/hw01/DMIA_Base_2021_Spring_hw1.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/data-mining-in-action/DMIA_Base_2021_Spring/HEAD/hw01/DMIA_Base_2021_Spring_hw1.pdf
--------------------------------------------------------------------------------
/seminar01/numpy_fancy_indexing.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/data-mining-in-action/DMIA_Base_2021_Spring/HEAD/seminar01/numpy_fancy_indexing.png
--------------------------------------------------------------------------------
/lecture01/Lecture1_MathAndSimpleMethods-compressed.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/data-mining-in-action/DMIA_Base_2021_Spring/HEAD/lecture01/Lecture1_MathAndSimpleMethods-compressed.pdf
--------------------------------------------------------------------------------
/hw01/README.md:
--------------------------------------------------------------------------------
1 | Выполните задачи в домашних ноутбуках, в каждой задаче будет контрольный код, вывод которого нужно ввести в [форму](https://forms.gle/LgGQgq2E1WauiGYT8).
2 |
3 | Дедлайн по сдаче: 10:00 15 мая 2021
4 |
--------------------------------------------------------------------------------
/hw02/README.md:
--------------------------------------------------------------------------------
1 | Выполните задачи в домашних ноутбуках, в каждой задаче будет контрольный код, вывод которого нужно ввести в [форму](https://forms.gle/RpPCzACqkmW8rkvN6).
2 |
3 | Дедлайн по сдаче: 10:00 20 мая 2021
4 |
--------------------------------------------------------------------------------
/hw01/NumpyScipy.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": null,
6 | "metadata": {},
7 | "outputs": [],
8 | "source": [
9 | "import numpy as np\n",
10 | "import scipy"
11 | ]
12 | },
13 | {
14 | "cell_type": "markdown",
15 | "metadata": {},
16 | "source": [
17 | "### Задача 1\n",
18 | "\n",
19 | "Дан массив $arr$, требуется для каждой позиции $i$ найти номер элемента $arr_i$ в массиве $arr$, отсортированном по убыванию. Все значения массива $arr$ различны."
20 | ]
21 | },
22 | {
23 | "cell_type": "code",
24 | "execution_count": null,
25 | "metadata": {},
26 | "outputs": [],
27 | "source": [
28 | "def function_1(arr):\n",
29 | " return #TODO"
30 | ]
31 | },
32 | {
33 | "cell_type": "code",
34 | "execution_count": null,
35 | "metadata": {},
36 | "outputs": [],
37 | "source": [
38 | "(function_1([1, 2, 3]) == [2, 1, 0]).all()"
39 | ]
40 | },
41 | {
42 | "cell_type": "code",
43 | "execution_count": null,
44 | "metadata": {},
45 | "outputs": [],
46 | "source": [
47 | "(function_1([-2, 1, 0]) == [2, 0, 1]).all()"
48 | ]
49 | },
50 | {
51 | "cell_type": "code",
52 | "execution_count": null,
53 | "metadata": {},
54 | "outputs": [],
55 | "source": [
56 | "(function_1([-2, 1, 0, -1]) == [3, 0, 1, 2]).all()"
57 | ]
58 | },
59 | {
60 | "cell_type": "markdown",
61 | "metadata": {},
62 | "source": [
63 | "**Значение для формы**"
64 | ]
65 | },
66 | {
67 | "cell_type": "code",
68 | "execution_count": null,
69 | "metadata": {},
70 | "outputs": [],
71 | "source": [
72 | "np.random.seed(42)\n",
73 | "arr = function_1(np.random.uniform(size=1000000))\n",
74 | "print(arr[7] + arr[42] + arr[445677] + arr[53422])"
75 | ]
76 | },
77 | {
78 | "cell_type": "markdown",
79 | "metadata": {},
80 | "source": [
81 | "### Задача 2\n",
82 | "\n",
83 | "Дана матрица $X$, нужно найти след матрицы $X X^T$"
84 | ]
85 | },
86 | {
87 | "cell_type": "code",
88 | "execution_count": null,
89 | "metadata": {},
90 | "outputs": [],
91 | "source": [
92 | "def function_2(matrix):\n",
93 | " return #TODO"
94 | ]
95 | },
96 | {
97 | "cell_type": "code",
98 | "execution_count": null,
99 | "metadata": {},
100 | "outputs": [],
101 | "source": [
102 | "function_2(np.array([\n",
103 | " [1, 2],\n",
104 | " [3, 4]\n",
105 | "])) == 30"
106 | ]
107 | },
108 | {
109 | "cell_type": "code",
110 | "execution_count": null,
111 | "metadata": {},
112 | "outputs": [],
113 | "source": [
114 | "function_2(np.array([\n",
115 | " [1, 0],\n",
116 | " [0, 1]\n",
117 | "])) == 2"
118 | ]
119 | },
120 | {
121 | "cell_type": "code",
122 | "execution_count": null,
123 | "metadata": {},
124 | "outputs": [],
125 | "source": [
126 | "function_2(np.array([\n",
127 | " [2, 0],\n",
128 | " [0, 2]\n",
129 | "])) == 8"
130 | ]
131 | },
132 | {
133 | "cell_type": "code",
134 | "execution_count": null,
135 | "metadata": {},
136 | "outputs": [],
137 | "source": [
138 | "function_2(np.array([\n",
139 | " [2, 1, 1],\n",
140 | " [1, 2, 1]\n",
141 | "])) == 12"
142 | ]
143 | },
144 | {
145 | "cell_type": "markdown",
146 | "metadata": {},
147 | "source": [
148 | "**Значение для формы**"
149 | ]
150 | },
151 | {
152 | "cell_type": "code",
153 | "execution_count": null,
154 | "metadata": {},
155 | "outputs": [],
156 | "source": [
157 | "np.random.seed(42)\n",
158 | "arr1 = np.random.uniform(size=(1, 100000))\n",
159 | "arr2 = np.random.uniform(size=(100000, 1))\n",
160 | "print(int(function_2(arr1) + function_2(arr2)))"
161 | ]
162 | },
163 | {
164 | "cell_type": "markdown",
165 | "metadata": {},
166 | "source": [
167 | "### Задача 3\n",
168 | "\n",
169 | "Дан набор точек с координатам точек points_x и points_y. Нужно найти такую точку $p$ с нулевой координатой $y$ (то есть с координатами вида $(x, 0)$), что расстояние от неё до самой удалённой точки из исходного набора (растояние евклидово) минимально"
170 | ]
171 | },
172 | {
173 | "cell_type": "code",
174 | "execution_count": null,
175 | "metadata": {},
176 | "outputs": [],
177 | "source": [
178 | "def function_3(points_x, points_y):\n",
179 | " return #TODO"
180 | ]
181 | },
182 | {
183 | "cell_type": "code",
184 | "execution_count": null,
185 | "metadata": {},
186 | "outputs": [],
187 | "source": [
188 | "np.abs(function_3([0, 2], [1, 1]) - 1.) < 1e-3"
189 | ]
190 | },
191 | {
192 | "cell_type": "code",
193 | "execution_count": null,
194 | "metadata": {},
195 | "outputs": [],
196 | "source": [
197 | "np.abs(function_3([0, 2, 4], [1, 1, 1]) - 2.) < 1e-3"
198 | ]
199 | },
200 | {
201 | "cell_type": "code",
202 | "execution_count": null,
203 | "metadata": {},
204 | "outputs": [],
205 | "source": [
206 | "np.abs(function_3([0, 4, 4], [1, 1, 1]) - 2.) < 1e-3"
207 | ]
208 | },
209 | {
210 | "cell_type": "markdown",
211 | "metadata": {},
212 | "source": [
213 | "**Значение для формы**"
214 | ]
215 | },
216 | {
217 | "cell_type": "code",
218 | "execution_count": null,
219 | "metadata": {},
220 | "outputs": [],
221 | "source": [
222 | "np.random.seed(42)\n",
223 | "arr1 = np.random.uniform(-56, 100, size=100000)\n",
224 | "arr2 = np.random.uniform(-100, 100, size=100000)\n",
225 | "print(int(round((function_3(arr1, arr2) * 100))))"
226 | ]
227 | },
228 | {
229 | "cell_type": "code",
230 | "execution_count": null,
231 | "metadata": {},
232 | "outputs": [],
233 | "source": []
234 | }
235 | ],
236 | "metadata": {
237 | "kernelspec": {
238 | "display_name": "dmia",
239 | "language": "python",
240 | "name": "dmia"
241 | },
242 | "language_info": {
243 | "codemirror_mode": {
244 | "name": "ipython",
245 | "version": 3
246 | },
247 | "file_extension": ".py",
248 | "mimetype": "text/x-python",
249 | "name": "python",
250 | "nbconvert_exporter": "python",
251 | "pygments_lexer": "ipython3",
252 | "version": "3.6.6"
253 | }
254 | },
255 | "nbformat": 4,
256 | "nbformat_minor": 2
257 | }
258 |
--------------------------------------------------------------------------------
/hw01/KNN.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Реализуем метод predict_proba для KNN"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "Ниже реализован класс KNeighborsClassifier, который для поиска ближайших соседей использует sklearn.neighbors.NearestNeighbors\n",
15 | "\n",
16 | "Требуется реализовать метод predict_proba для вычисления ответа классификатора."
17 | ]
18 | },
19 | {
20 | "cell_type": "code",
21 | "execution_count": 1,
22 | "metadata": {},
23 | "outputs": [],
24 | "source": [
25 | "import numpy as np\n",
26 | "\n",
27 | "from sklearn.base import BaseEstimator, ClassifierMixin\n",
28 | "from sklearn.neighbors import NearestNeighbors\n",
29 | "\n",
30 | "\n",
31 | "class KNeighborsClassifier(BaseEstimator, ClassifierMixin):\n",
32 | " '''\n",
33 | " Класс, который позволит нам изучить KNN\n",
34 | " '''\n",
35 | " def __init__(self, n_neighbors=5, weights='uniform', \n",
36 | " metric='minkowski', p=2):\n",
37 | " '''\n",
38 | " Инициализируем KNN с несколькими стандартными параметрами\n",
39 | " '''\n",
40 | " assert weights in ('uniform', 'distance')\n",
41 | " \n",
42 | " self.n_neighbors = n_neighbors\n",
43 | " self.weights = weights\n",
44 | " self.metric = metric\n",
45 | " \n",
46 | " self.NearestNeighbors = NearestNeighbors(\n",
47 | " n_neighbors = n_neighbors,\n",
48 | " metric = self.metric)\n",
49 | " \n",
50 | " def fit(self, X, y):\n",
51 | " '''\n",
52 | " Используем sklearn.neighbors.NearestNeighbors \n",
53 | " для запоминания обучающей выборки\n",
54 | " и последующего поиска соседей\n",
55 | " '''\n",
56 | " self.NearestNeighbors.fit(X)\n",
57 | " self.n_classes = len(np.unique(y))\n",
58 | " self.y = y\n",
59 | " \n",
60 | " def predict_proba(self, X, use_first_zero_distant_sample=True):\n",
61 | " '''\n",
62 | " Чтобы реализовать этот метод, \n",
63 | " изучите работу sklearn.neighbors.NearestNeighbors'''\n",
64 | " \n",
65 | " # получим здесь расстояния до соседей distances и их метки\n",
66 | " \n",
67 | " if self.weights == 'uniform':\n",
68 | " w = np.ones(distances.shape)\n",
69 | " else:\n",
70 | " # чтобы не делить на 0, \n",
71 | " # добавим небольшую константу, например 1e-3\n",
72 | " w = 1/(distances + 1e-3)\n",
73 | "\n",
74 | " # реализуем вычисление предсказаний:\n",
75 | " # выбрав один объект, для каждого класса посчитаем\n",
76 | " # суммарный вес голосующих за него объектов\n",
77 | " # затем нормируем эти веса на их сумму\n",
78 | " # и вернем это как предсказание KNN\n",
79 | " return probs"
80 | ]
81 | },
82 | {
83 | "cell_type": "markdown",
84 | "metadata": {},
85 | "source": [
86 | "# Загрузим данные и обучим классификатор"
87 | ]
88 | },
89 | {
90 | "cell_type": "code",
91 | "execution_count": 2,
92 | "metadata": {},
93 | "outputs": [],
94 | "source": [
95 | "from sklearn.datasets import load_iris\n",
96 | "X, y = load_iris(return_X_y=True)\n",
97 | "\n",
98 | "knn = KNeighborsClassifier(weights='distance')\n",
99 | "knn.fit(X, y)\n",
100 | "prediction = knn.predict_proba(X, )"
101 | ]
102 | },
103 | {
104 | "cell_type": "markdown",
105 | "metadata": {},
106 | "source": [
107 | "Поскольку мы используем одну и ту же выборку для обучения и предсказания, ближайшим соседом любого объекта будет он же сам. В качестве упражнения предлагаю реализовать метод transform, который реализует получение предсказаний для обучающей выборки, но для каждого объекта не будет учитывать его самого.\n",
108 | "\n",
109 | "Посмотрим, в каких объектах max(prediction) != 1:"
110 | ]
111 | },
112 | {
113 | "cell_type": "code",
114 | "execution_count": 3,
115 | "metadata": {},
116 | "outputs": [
117 | {
118 | "name": "stdout",
119 | "output_type": "stream",
120 | "text": [
121 | "[ 56 68 70 72 77 83 106 110 119 123 127 133 134 138 146]\n"
122 | ]
123 | }
124 | ],
125 | "source": [
126 | "inds = np.arange(len(prediction))[prediction.max(1) != 1]\n",
127 | "print(inds)\n",
128 | "\n",
129 | "# [ 56 68 70 72 77 83 106 110 119 123 127 133 134 138 146]"
130 | ]
131 | },
132 | {
133 | "cell_type": "markdown",
134 | "metadata": {},
135 | "source": [
136 | "Несколько примеров, на которых можно проверить правильность реализованного метода:"
137 | ]
138 | },
139 | {
140 | "cell_type": "code",
141 | "execution_count": 4,
142 | "metadata": {},
143 | "outputs": [
144 | {
145 | "name": "stdout",
146 | "output_type": "stream",
147 | "text": [
148 | "68 [0. 0.99816311 0.00183689]\n",
149 | "77 [0. 0.99527902 0.00472098]\n",
150 | "146 [0. 0.00239145 0.99760855]\n"
151 | ]
152 | }
153 | ],
154 | "source": [
155 | "for i in 1, 4, -1:\n",
156 | " print(inds[i], prediction[inds[i]])\n",
157 | "\n",
158 | "# 68 [0. 0.99816311 0.00183689]\n",
159 | "# 77 [0. 0.99527902 0.00472098]\n",
160 | "# 146 [0. 0.00239145 0.99760855]"
161 | ]
162 | },
163 | {
164 | "cell_type": "markdown",
165 | "metadata": {},
166 | "source": [
167 | "**Примечание:** отличие в третьем-четвертом знаке после запятой в тестах не должно повлиять на сдачу задания"
168 | ]
169 | },
170 | {
171 | "cell_type": "markdown",
172 | "metadata": {},
173 | "source": [
174 | "# Ответы для формы"
175 | ]
176 | },
177 | {
178 | "cell_type": "markdown",
179 | "metadata": {},
180 | "source": [
181 | "В форму требуется ввести max(prediction) для объекта. Если метод реализован верно, то ячейка ниже распечатает ответы, которые нужно ввести в форму"
182 | ]
183 | },
184 | {
185 | "cell_type": "code",
186 | "execution_count": null,
187 | "metadata": {},
188 | "outputs": [],
189 | "source": [
190 | "for i in 56, 83, 127:\n",
191 | " print('{:.2f}'.format(max(prediction[i])))"
192 | ]
193 | },
194 | {
195 | "cell_type": "code",
196 | "execution_count": null,
197 | "metadata": {},
198 | "outputs": [],
199 | "source": []
200 | }
201 | ],
202 | "metadata": {
203 | "kernelspec": {
204 | "display_name": "Python 3",
205 | "language": "python",
206 | "name": "python3"
207 | },
208 | "language_info": {
209 | "codemirror_mode": {
210 | "name": "ipython",
211 | "version": 3
212 | },
213 | "file_extension": ".py",
214 | "mimetype": "text/x-python",
215 | "name": "python",
216 | "nbconvert_exporter": "python",
217 | "pygments_lexer": "ipython3",
218 | "version": "3.6.5"
219 | }
220 | },
221 | "nbformat": 4,
222 | "nbformat_minor": 2
223 | }
224 |
--------------------------------------------------------------------------------
/seminar02/02_Main_Bagging.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Бэггинг"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "Пусть есть случайные одинаково распределённые величины $\\xi_1, \\xi_2, \\dots, \\xi_n$, скоррелированные с коэффициентом корреляции $\\rho$ и дисперсией $\\sigma^2$. Какова будет дисперсия величины $\\frac1n \\sum_{i=1}^n \\xi_i$?\n",
15 | "\n",
16 | "$$\\mathbf{D} \\frac1n \\sum_{i=1}^n \\xi_i = \\frac1{n^2}\\mathbf{cov} (\\sum_{i=1}^n \\xi_i, \\sum_{i=1}^n \\xi_i) = \\frac1{n^2} \\sum_{i=1, j=1}^n \\mathbf{cov}(\\xi_i, \\xi_j) = \\frac1{n^2} \\sum_{i=1}^n \\mathbf{cov}(\\xi_i, \\xi_i) + \\frac1{n^2} \\sum_{i=1, j=1, i\\neq j}^n \\mathbf{cov}(\\xi_i, \\xi_j) = \\frac1{n^2} \\sum_{i=1}^n \\sigma^2+ \\frac1{n^2} \\sum_{i=1, j=1, i\\neq j}^n \\rho \\sigma^2 =$$\n",
17 | "$$ = \\frac1{n^2} n \\sigma^2 + \\frac1{n^2} n(n-1) \\rho \\sigma^2 = \\frac{\\sigma^2( 1 + \\rho(n-1))}{n}$$\n",
18 | "\n",
19 | "Таким образом, чем менее величины скоррелированы между собой, тем меньше будет дисперсия после их усреднения. Грубо говоря в этом и состоит идея бэггинга: давайте сделаем много максимально независимых моделей, а потом их усредим, и тогда предсказания станет более устойчивым!"
20 | ]
21 | },
22 | {
23 | "cell_type": "markdown",
24 | "metadata": {},
25 | "source": [
26 | "# Бэггинг над решающими деревьями\n",
27 | "\n",
28 | "Посмотрим, какие модели можно получить из деревьев с помощью их рандомизации"
29 | ]
30 | },
31 | {
32 | "cell_type": "code",
33 | "execution_count": 1,
34 | "metadata": {},
35 | "outputs": [],
36 | "source": [
37 | "import pandas as pd\n",
38 | "import numpy as np\n",
39 | "from sklearn.model_selection import cross_val_score, train_test_split\n",
40 | "from sklearn.ensemble import BaggingClassifier\n",
41 | "from sklearn.tree import DecisionTreeClassifier\n",
42 | "from sklearn.ensemble import RandomForestClassifier\n",
43 | "from sklearn.linear_model import LogisticRegression"
44 | ]
45 | },
46 | {
47 | "cell_type": "code",
48 | "execution_count": 2,
49 | "metadata": {
50 | "scrolled": true
51 | },
52 | "outputs": [
53 | {
54 | "name": "stdout",
55 | "output_type": "stream",
56 | "text": [
57 | "['last_evaluation', 'number_project', 'average_montly_hours', 'time_spend_company', 'Work_accident', 'promotion_last_5years']\n"
58 | ]
59 | }
60 | ],
61 | "source": [
62 | "data = pd.read_csv('HR.csv')\n",
63 | "\n",
64 | "target = 'left'\n",
65 | "features = [c for c in data if c != target]\n",
66 | "print(features)\n",
67 | "\n",
68 | "X, y = data[features], data[target]"
69 | ]
70 | },
71 | {
72 | "cell_type": "code",
73 | "execution_count": 3,
74 | "metadata": {},
75 | "outputs": [],
76 | "source": [
77 | "rnd_d3 = DecisionTreeClassifier(max_features=int(len(features) ** 0.5)) # Решающее дерево с рандомизацией в сплитах\n",
78 | "d3 = DecisionTreeClassifier() # Обычное решающее дерево"
79 | ]
80 | },
81 | {
82 | "cell_type": "markdown",
83 | "metadata": {},
84 | "source": [
85 | "Качество классификации решающим деревом с настройками по-умолчанию:"
86 | ]
87 | },
88 | {
89 | "cell_type": "code",
90 | "execution_count": 4,
91 | "metadata": {},
92 | "outputs": [
93 | {
94 | "name": "stdout",
95 | "output_type": "stream",
96 | "text": [
97 | "Decision tree: 0.6523099419883976\n"
98 | ]
99 | }
100 | ],
101 | "source": [
102 | "print(\"Decision tree:\", cross_val_score(d3, X, y).mean())"
103 | ]
104 | },
105 | {
106 | "cell_type": "markdown",
107 | "metadata": {},
108 | "source": [
109 | "Бэггинг над решающими деревьями:"
110 | ]
111 | },
112 | {
113 | "cell_type": "code",
114 | "execution_count": 5,
115 | "metadata": {},
116 | "outputs": [
117 | {
118 | "name": "stdout",
119 | "output_type": "stream",
120 | "text": [
121 | "D3 bagging: 0.7174495299059812\n"
122 | ]
123 | }
124 | ],
125 | "source": [
126 | "print(\"D3 bagging:\", cross_val_score(BaggingClassifier(d3, random_state=42), X, y).mean())"
127 | ]
128 | },
129 | {
130 | "cell_type": "markdown",
131 | "metadata": {},
132 | "source": [
133 | "Усредненная модель оказалась намного лучше. Оказывается, у решающих деревьев есть существенный недостаток - нестабильность получаемого дерева при небольших изменениях в выборке. Но бэггинг обращает этот недостаток в достоинство, ведь усредненная модель работает лучше, когда базовые модели слабо скоррелированы."
134 | ]
135 | },
136 | {
137 | "cell_type": "markdown",
138 | "metadata": {},
139 | "source": [
140 | "Изучив параметры DecisionTreeClassifier, можно найти хороший способ сделать деревья еще более различными - при построении каждого узла отбирать случайные max_features признаков и искать информативное разбиение только по одному из них."
141 | ]
142 | },
143 | {
144 | "cell_type": "code",
145 | "execution_count": 6,
146 | "metadata": {},
147 | "outputs": [
148 | {
149 | "name": "stdout",
150 | "output_type": "stream",
151 | "text": [
152 | "Randomized D3 Bagging: 0.7194494632259785\n"
153 | ]
154 | }
155 | ],
156 | "source": [
157 | "print(\"Randomized D3 Bagging:\", cross_val_score(BaggingClassifier(rnd_d3, random_state=42), X, y).mean())"
158 | ]
159 | },
160 | {
161 | "cell_type": "markdown",
162 | "metadata": {},
163 | "source": [
164 | "В среднем, качество получается еще лучше. Для выбора числа признаков использовалась часто применяемая на практике эвристика - брать корень из общего числа признаков. Если бы мы решали задачу регрессии - брали бы треть от общего числа."
165 | ]
166 | },
167 | {
168 | "cell_type": "code",
169 | "execution_count": 7,
170 | "metadata": {},
171 | "outputs": [
172 | {
173 | "name": "stdout",
174 | "output_type": "stream",
175 | "text": [
176 | "Random Forest: 0.7232495965859839\n"
177 | ]
178 | }
179 | ],
180 | "source": [
181 | "print(\"Random Forest:\", cross_val_score(RandomForestClassifier(random_state=42), X, y).mean())"
182 | ]
183 | },
184 | {
185 | "cell_type": "code",
186 | "execution_count": 8,
187 | "metadata": {},
188 | "outputs": [
189 | {
190 | "name": "stdout",
191 | "output_type": "stream",
192 | "text": [
193 | "Logistic Regression: 0.6287053143962126\n"
194 | ]
195 | }
196 | ],
197 | "source": [
198 | "print(\"Logistic Regression:\", cross_val_score(LogisticRegression(), X, y).mean())"
199 | ]
200 | },
201 | {
202 | "cell_type": "markdown",
203 | "metadata": {},
204 | "source": [
205 | "## Опциональное задание\n",
206 | "Повторные запуски cross_val_score будут показывать различное качество модели.\n",
207 | "\n",
208 | "Это зависит от параметра рандомизации модели \"random_state\" в DecisionTreeClassifier, BaggingClassifie или RandomForest.\n",
209 | "\n",
210 | "Чтобы понять, действительно ли одна модель лучше другой, можно посмотреть на её качество в среднем, то есть усредняя запуски с разным random_state. Предлагаю сравнить качество и понять, действительно ли BaggingClassifier(d3) лучше BaggingClassifier(rnd_d3)?\n",
211 | "\n",
212 | "Также предлагаю ответить на вопрос, чем здесь отличается BaggingClassifier(rnd_d3) от RandomForestClassifier()?"
213 | ]
214 | },
215 | {
216 | "cell_type": "code",
217 | "execution_count": null,
218 | "metadata": {
219 | "collapsed": true
220 | },
221 | "outputs": [],
222 | "source": []
223 | }
224 | ],
225 | "metadata": {
226 | "anaconda-cloud": {},
227 | "kernelspec": {
228 | "display_name": "Python 3",
229 | "language": "python",
230 | "name": "python3"
231 | },
232 | "language_info": {
233 | "codemirror_mode": {
234 | "name": "ipython",
235 | "version": 3
236 | },
237 | "file_extension": ".py",
238 | "mimetype": "text/x-python",
239 | "name": "python",
240 | "nbconvert_exporter": "python",
241 | "pygments_lexer": "ipython3",
242 | "version": "3.6.5"
243 | }
244 | },
245 | "nbformat": 4,
246 | "nbformat_minor": 1
247 | }
248 |
--------------------------------------------------------------------------------
/seminar02/03_Main_Boosting.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Градиентный бустинг"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "Бустинг это метод построения компизиции алгоритмов, в котором базовые алгоритмы строятся последовательно один за другим, причем каждый следующий алгоритм строится таким образом, чтобы уменьшить ошибку предыдущего."
15 | ]
16 | },
17 | {
18 | "cell_type": "markdown",
19 | "metadata": {},
20 | "source": [
21 | "Положим, что алгоритм это линейная комбинация некоторых базовых алгоритмов:\n",
22 | " $$a_N(x) = \\sum_{n=1}^N b_n(x)$$"
23 | ]
24 | },
25 | {
26 | "cell_type": "markdown",
27 | "metadata": {},
28 | "source": [
29 | "Пусть задана некоторая функция потетерь, которую мы оптимизируем\n",
30 | "$$\\sum_{i=1}^l L(\\hat y_i, y_i) \\rightarrow min$$ \n",
31 | "\n",
32 | "\n",
33 | "Зададимся вопросом: а что если мы хотим добавить ещё один алгоритм в эту композицию, но не просто добавить, а как можно оптимальнее с точки зрения исходной оптимизационной задачи. То есть уже есть какой-то алгоритм $a_N(x)$ и мы хотим прибавить к нему базовый алгоритм $b_{N+1}(x)$:\n",
34 | "\n",
35 | "$$\\sum_{i=1}^l L(a_{N}(x_i) + b_{N+1}(x_i), y_i) \\to \\min_{b_{N+1}}$$\n",
36 | "\n",
37 | "Сначала имеет смысл решить более простую задачу: определить, какие значения $r_1 ,r_2 ..., r_l$ должен принимать алгоритм $b_N(x_i) = r_i$ на объектах обучающей выборки, чтобы ошибка на обучающей выборке была минимальной:\n",
38 | "\n",
39 | "$$F(r) = \\sum_{i=1}^l L(a_{N}(x_i) + r_i, y_i) \\to \\min_{r},$$\n",
40 | "\n",
41 | "где $r = (r_1, r_2, \\dots, r_l)$ - вектор сдвигов."
42 | ]
43 | },
44 | {
45 | "cell_type": "markdown",
46 | "metadata": {},
47 | "source": [
48 | "Поскольку направление наискорейшего убывания функции задается направлением антиградиента, его можно принять в качестве вектора $r$:\n",
49 | "$$r = -\\nabla F \\\\$$\n",
50 | "$$r_i = \\frac{\\partial{L}(a_N(x_i), y_i))}{\\partial{a_N(x_i)}}, \\ \\ \\ i = \\overline{1,l}$$"
51 | ]
52 | },
53 | {
54 | "cell_type": "markdown",
55 | "metadata": {},
56 | "source": [
57 | "Компоненты вектора $r$, фактически, являются теми значениями, которые на объектах обучающей выборки должен принимать новый алгоритм $b_{N+1}(x)$, чтобы минимизировать ошибку строящейся композиции. \n",
58 | "Обучение $b_{N+1}(x)$, таким образом, представляет собой *задачу обучения на размеченных данных*, в которой ${(x_i , r_i )}_{i=1}^l$ — обучающая выборка, и используется, например, квадратичная функция ошибки:\n",
59 | "$$b_{N+1}(x) = arg \\min_{b}\\sum_{i=1}^l(b(x_i) - r_i)^2$$"
60 | ]
61 | },
62 | {
63 | "cell_type": "markdown",
64 | "metadata": {},
65 | "source": [
66 | "Таким образом, можно подобрать неплохое улучшение текущего алгоритма $a_N(x)$, а потом ещё раз и ещё, в итоге получив комбинацию алгоритмов, которая будет минимизировать исходный функционал."
67 | ]
68 | },
69 | {
70 | "cell_type": "markdown",
71 | "metadata": {},
72 | "source": [
73 | "# Бустинг над решающими деревьями\n",
74 | "\n",
75 | "Наиболее популярное семейство алгоритмов для бустинга это деревья. Рассмотрим популярные библиотеки"
76 | ]
77 | },
78 | {
79 | "cell_type": "code",
80 | "execution_count": 1,
81 | "metadata": {
82 | "collapsed": true
83 | },
84 | "outputs": [],
85 | "source": [
86 | "import pandas as pd\n",
87 | "import numpy as np\n",
88 | "from sklearn.model_selection import cross_val_score, train_test_split\n",
89 | "\n",
90 | "from xgboost import XGBClassifier\n",
91 | "from catboost import CatBoostClassifier\n",
92 | "from lightgbm import LGBMClassifier\n",
93 | "\n",
94 | "import warnings\n",
95 | "warnings.filterwarnings('ignore')"
96 | ]
97 | },
98 | {
99 | "cell_type": "code",
100 | "execution_count": 2,
101 | "metadata": {
102 | "collapsed": true
103 | },
104 | "outputs": [],
105 | "source": [
106 | "data = pd.read_csv('HR.csv')"
107 | ]
108 | },
109 | {
110 | "cell_type": "code",
111 | "execution_count": 3,
112 | "metadata": {},
113 | "outputs": [
114 | {
115 | "data": {
116 | "text/html": [
117 | "
\n",
118 | "\n",
131 | "
\n",
132 | " \n",
133 | " \n",
134 | " | \n",
135 | " last_evaluation | \n",
136 | " number_project | \n",
137 | " average_montly_hours | \n",
138 | " time_spend_company | \n",
139 | " Work_accident | \n",
140 | " left | \n",
141 | " promotion_last_5years | \n",
142 | "
\n",
143 | " \n",
144 | " \n",
145 | " \n",
146 | " | 0 | \n",
147 | " 0.53 | \n",
148 | " 2 | \n",
149 | " 157 | \n",
150 | " 3 | \n",
151 | " 0 | \n",
152 | " 1 | \n",
153 | " 0 | \n",
154 | "
\n",
155 | " \n",
156 | " | 1 | \n",
157 | " 0.86 | \n",
158 | " 5 | \n",
159 | " 262 | \n",
160 | " 6 | \n",
161 | " 0 | \n",
162 | " 0 | \n",
163 | " 0 | \n",
164 | "
\n",
165 | " \n",
166 | " | 2 | \n",
167 | " 0.88 | \n",
168 | " 7 | \n",
169 | " 272 | \n",
170 | " 4 | \n",
171 | " 0 | \n",
172 | " 1 | \n",
173 | " 0 | \n",
174 | "
\n",
175 | " \n",
176 | " | 3 | \n",
177 | " 0.87 | \n",
178 | " 5 | \n",
179 | " 223 | \n",
180 | " 5 | \n",
181 | " 0 | \n",
182 | " 1 | \n",
183 | " 0 | \n",
184 | "
\n",
185 | " \n",
186 | " | 4 | \n",
187 | " 0.52 | \n",
188 | " 2 | \n",
189 | " 159 | \n",
190 | " 3 | \n",
191 | " 0 | \n",
192 | " 1 | \n",
193 | " 0 | \n",
194 | "
\n",
195 | " \n",
196 | "
\n",
197 | "
"
198 | ],
199 | "text/plain": [
200 | " last_evaluation number_project average_montly_hours time_spend_company \\\n",
201 | "0 0.53 2 157 3 \n",
202 | "1 0.86 5 262 6 \n",
203 | "2 0.88 7 272 4 \n",
204 | "3 0.87 5 223 5 \n",
205 | "4 0.52 2 159 3 \n",
206 | "\n",
207 | " Work_accident left promotion_last_5years \n",
208 | "0 0 1 0 \n",
209 | "1 0 0 0 \n",
210 | "2 0 1 0 \n",
211 | "3 0 1 0 \n",
212 | "4 0 1 0 "
213 | ]
214 | },
215 | "execution_count": 3,
216 | "metadata": {},
217 | "output_type": "execute_result"
218 | }
219 | ],
220 | "source": [
221 | "data.head()"
222 | ]
223 | },
224 | {
225 | "cell_type": "code",
226 | "execution_count": 4,
227 | "metadata": {
228 | "collapsed": true,
229 | "scrolled": true
230 | },
231 | "outputs": [],
232 | "source": [
233 | "X, y = data.drop('left', axis=1).values, data['left'].values"
234 | ]
235 | },
236 | {
237 | "cell_type": "markdown",
238 | "metadata": {},
239 | "source": [
240 | "Качество классификации решающим деревом с настройками по-умолчанию:"
241 | ]
242 | },
243 | {
244 | "cell_type": "code",
245 | "execution_count": 5,
246 | "metadata": {},
247 | "outputs": [
248 | {
249 | "name": "stdout",
250 | "output_type": "stream",
251 | "text": [
252 | "XGBClassifier: 0.7791\n",
253 | "CPU times: user 1.05 s, sys: 4.04 ms, total: 1.06 s\n",
254 | "Wall time: 1.06 s\n"
255 | ]
256 | }
257 | ],
258 | "source": [
259 | "%%time\n",
260 | "print(\"XGBClassifier: {:.4f}\".format(cross_val_score(XGBClassifier(), X, y).mean()))"
261 | ]
262 | },
263 | {
264 | "cell_type": "code",
265 | "execution_count": 6,
266 | "metadata": {},
267 | "outputs": [
268 | {
269 | "name": "stdout",
270 | "output_type": "stream",
271 | "text": [
272 | "CatBoostClassifier: 0.7776\n",
273 | "CPU times: user 1min 45s, sys: 52.7 s, total: 2min 38s\n",
274 | "Wall time: 50.4 s\n"
275 | ]
276 | }
277 | ],
278 | "source": [
279 | "%%time\n",
280 | "print(\"CatBoostClassifier: {:.4f}\".format(cross_val_score(CatBoostClassifier(verbose=False), X, y).mean()))"
281 | ]
282 | },
283 | {
284 | "cell_type": "code",
285 | "execution_count": 7,
286 | "metadata": {},
287 | "outputs": [
288 | {
289 | "name": "stdout",
290 | "output_type": "stream",
291 | "text": [
292 | "LGBMClassifier: 0.7790\n",
293 | "CPU times: user 562 ms, sys: 24.8 ms, total: 587 ms\n",
294 | "Wall time: 586 ms\n"
295 | ]
296 | }
297 | ],
298 | "source": [
299 | "%%time\n",
300 | "print(\"LGBMClassifier: {:.4f}\".format(cross_val_score(LGBMClassifier(), X, y).mean()))"
301 | ]
302 | },
303 | {
304 | "cell_type": "markdown",
305 | "metadata": {},
306 | "source": [
307 | "## Опциональное задание\n",
308 | "Поиграйтесь с основными параметрами алгоритмов, чтобы максимизировать качество"
309 | ]
310 | },
311 | {
312 | "cell_type": "code",
313 | "execution_count": null,
314 | "metadata": {
315 | "collapsed": true
316 | },
317 | "outputs": [],
318 | "source": []
319 | }
320 | ],
321 | "metadata": {
322 | "anaconda-cloud": {},
323 | "kernelspec": {
324 | "display_name": "Python 3",
325 | "language": "python",
326 | "name": "python3"
327 | },
328 | "language_info": {
329 | "codemirror_mode": {
330 | "name": "ipython",
331 | "version": 3
332 | },
333 | "file_extension": ".py",
334 | "mimetype": "text/x-python",
335 | "name": "python",
336 | "nbconvert_exporter": "python",
337 | "pygments_lexer": "ipython3",
338 | "version": "3.6.5"
339 | }
340 | },
341 | "nbformat": 4,
342 | "nbformat_minor": 1
343 | }
344 |
--------------------------------------------------------------------------------
/hw02/SimpleGB.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Домашнее задание 2\n",
8 | "\n",
9 | "### Реализация градиентного бустинга для классификации"
10 | ]
11 | },
12 | {
13 | "cell_type": "markdown",
14 | "metadata": {},
15 | "source": [
16 | "В рамках этой задачи нужно написать градиентный бустинг над решающими деревьями в задаче классификации. В качестве функции потерь предлагается взять **log loss**. Про него можно прочитать подробнее здесь: https://scikit-learn.org/stable/modules/model_evaluation.html#log-loss"
17 | ]
18 | },
19 | {
20 | "cell_type": "markdown",
21 | "metadata": {},
22 | "source": [
23 | "\n",
24 | "$y_i$ это правильный ответ (0 или 1), $\\hat{y}_i$ это ваше предсказание\n",
25 | "\n",
26 | "Может показаться, что надо максимизировать функцию $L(\\hat{y}, y) = \\sum_{i=1}^n y_i \\log(\\hat{y}_i) + (1 - y_i) \\log(1 - \\hat{y}_i)$,\n",
27 | "\n",
28 | "Это так, но не совсем: лучше максимизировать функцию $L(\\hat{y}, y) = \\sum_{i=1}^n y_i \\log(f(\\hat{y}_i)) + (1 - y_i) \\log(1 - f(\\hat{y}_i))$, где $f(x) = \\frac{1}{1 + e^{-x}}$. Благодаря этому у вас не будет ограничений на принимаеммые значения для прогнозов $\\hat{y}_i$"
29 | ]
30 | },
31 | {
32 | "cell_type": "markdown",
33 | "metadata": {},
34 | "source": [
35 | "### Задание 1\n",
36 | "\n",
37 | "Функцию f(x), предложенную выше, обычно называют **сигмоида** или **сигмоидная функция**. Напишите функцию, вычисляющую значения производной функции f(x)."
38 | ]
39 | },
40 | {
41 | "cell_type": "code",
42 | "execution_count": null,
43 | "metadata": {},
44 | "outputs": [],
45 | "source": [
46 | "import numpy as np"
47 | ]
48 | },
49 | {
50 | "cell_type": "code",
51 | "execution_count": null,
52 | "metadata": {},
53 | "outputs": [],
54 | "source": [
55 | "def sigmoid(x):\n",
56 | " return 1. / (1 + np.exp(-x))\n",
57 | "\n",
58 | "\n",
59 | "def der_sigmoid(x):\n",
60 | " return None # TODO"
61 | ]
62 | },
63 | {
64 | "cell_type": "code",
65 | "execution_count": null,
66 | "metadata": {},
67 | "outputs": [],
68 | "source": [
69 | "der_sigmoid(0) == 0.25"
70 | ]
71 | },
72 | {
73 | "cell_type": "code",
74 | "execution_count": null,
75 | "metadata": {},
76 | "outputs": [],
77 | "source": [
78 | "der_sigmoid(np.array([0, 0])) == np.array([0.25, 0.25])"
79 | ]
80 | },
81 | {
82 | "cell_type": "code",
83 | "execution_count": null,
84 | "metadata": {},
85 | "outputs": [],
86 | "source": [
87 | "der_sigmoid(np.log(3)) == 0.1875"
88 | ]
89 | },
90 | {
91 | "cell_type": "markdown",
92 | "metadata": {},
93 | "source": [
94 | "**Значение для формы:**"
95 | ]
96 | },
97 | {
98 | "cell_type": "code",
99 | "execution_count": null,
100 | "metadata": {},
101 | "outputs": [],
102 | "source": [
103 | "print(round(der_sigmoid(np.array([-10, 4.1, -1, 2])).sum() + sigmoid(0.42), 4))"
104 | ]
105 | },
106 | {
107 | "cell_type": "markdown",
108 | "metadata": {},
109 | "source": [
110 | "Хорошо, теперь мы умеем считать производную функции f, но надо найти производную log loss-а по $\\hat{y}$ в первом варианте записи потерь\n",
111 | "\n",
112 | "Напоминание, первый вариант это $y_i \\log(\\hat{y}_i) + (1 - y_i) \\log(1 - \\hat{y}_i)$"
113 | ]
114 | },
115 | {
116 | "cell_type": "markdown",
117 | "metadata": {},
118 | "source": [
119 | "### Задание 2\n",
120 | "\n",
121 | "Напишите вычисление производной log loss-a"
122 | ]
123 | },
124 | {
125 | "cell_type": "code",
126 | "execution_count": null,
127 | "metadata": {},
128 | "outputs": [],
129 | "source": [
130 | "def der_log_loss(y_hat, y_true):\n",
131 | " \"\"\"\n",
132 | " 0 < y_hat < 1\n",
133 | " \"\"\"\n",
134 | " return None # TODO"
135 | ]
136 | },
137 | {
138 | "cell_type": "code",
139 | "execution_count": null,
140 | "metadata": {},
141 | "outputs": [],
142 | "source": [
143 | "der_log_loss(0.5, 0) == -2"
144 | ]
145 | },
146 | {
147 | "cell_type": "code",
148 | "execution_count": null,
149 | "metadata": {},
150 | "outputs": [],
151 | "source": [
152 | "der_log_loss(0.5, 1) == 2"
153 | ]
154 | },
155 | {
156 | "cell_type": "code",
157 | "execution_count": null,
158 | "metadata": {},
159 | "outputs": [],
160 | "source": [
161 | "der_log_loss(np.array([0.8, 0.8]), np.array([1, 1])) == np.array([1.25, 1.25])"
162 | ]
163 | },
164 | {
165 | "cell_type": "markdown",
166 | "metadata": {},
167 | "source": [
168 | "**Значение для формы**"
169 | ]
170 | },
171 | {
172 | "cell_type": "code",
173 | "execution_count": null,
174 | "metadata": {},
175 | "outputs": [],
176 | "source": [
177 | "print(round(-sum(der_log_loss((x + 1) / 100., x % 2) for x in range(99)), 2))"
178 | ]
179 | },
180 | {
181 | "cell_type": "markdown",
182 | "metadata": {},
183 | "source": [
184 | "Теперь мы можем воспользоваться формулой производной сложной функции (chain rule) и получить вычисление градиента формулы по второму варианту:"
185 | ]
186 | },
187 | {
188 | "cell_type": "code",
189 | "execution_count": null,
190 | "metadata": {},
191 | "outputs": [],
192 | "source": [
193 | "def calc_gradient(y_hat, y_true):\n",
194 | " return der_log_loss(sigmoid(y_hat), y_true) * der_sigmoid(y_hat)"
195 | ]
196 | },
197 | {
198 | "cell_type": "markdown",
199 | "metadata": {},
200 | "source": [
201 | "Теперь мы можем написать код градиентного бустинга для классификации"
202 | ]
203 | },
204 | {
205 | "cell_type": "markdown",
206 | "metadata": {},
207 | "source": [
208 | "### Задание 3\n",
209 | "\n",
210 | "Допишите класс"
211 | ]
212 | },
213 | {
214 | "cell_type": "code",
215 | "execution_count": null,
216 | "metadata": {},
217 | "outputs": [],
218 | "source": [
219 | "from sklearn.base import BaseEstimator # чтобы поддержать интерфейс sklearn\n",
220 | "from sklearn.tree import DecisionTreeRegressor # для обучения на каждой итерации"
221 | ]
222 | },
223 | {
224 | "cell_type": "code",
225 | "execution_count": null,
226 | "metadata": {},
227 | "outputs": [],
228 | "source": [
229 | "class SimpleGB(BaseEstimator):\n",
230 | " def __init__(self, tree_params_dict, iters=100, tau=1e-1):\n",
231 | " \"\"\"\n",
232 | " tree_params_dict - словарь параметров, которые надо использовать при обучении дерева на итерации\n",
233 | " iters - количество итераций\n",
234 | " tau - коэффициент перед предсказаниями деревьев на каждой итерации\n",
235 | " \"\"\"\n",
236 | " self.tree_params_dict = tree_params_dict\n",
237 | " self.iters = iters\n",
238 | " self.tau = tau\n",
239 | " \n",
240 | " def fit(self, X_data, y_data):\n",
241 | " self.estimators = []\n",
242 | " curr_pred = 0\n",
243 | " for iter_num in range(self.iters):\n",
244 | " # Нужно найти градиент функции потерь по предсказниям в точке curr_pred\n",
245 | " grad = None # TODO\n",
246 | " # Мы максимизируем, поэтому надо обучить DecisionTreeRegressor с параметрами \n",
247 | " # tree_params_dict по X_data предсказывать grad\n",
248 | " algo = None # TODO\n",
249 | " self.estimators.append(algo)\n",
250 | " # все предсказания домножаются на tau и обновляется переменная curr_pred\n",
251 | " curr_pred += self.tau * algo.predict(X_data)\n",
252 | " \n",
253 | " def predict(self, X_data):\n",
254 | " # изначально все предскзания нули\n",
255 | " res = np.zeros(X_data.shape[0])\n",
256 | " for estimator in self.estimators:\n",
257 | " # нужно сложить все предсказания деревьев с весом self.tau\n",
258 | " pass # TODO\n",
259 | " \n",
260 | " return (res > 0).astype(int)"
261 | ]
262 | },
263 | {
264 | "cell_type": "markdown",
265 | "metadata": {},
266 | "source": [
267 | "## Проверка качества полученного класса (в самом низу код для формы)\n",
268 | "\n",
269 | "Можете поиграться с параметрами, посмотрим, у кого получится самое лучшее качество"
270 | ]
271 | },
272 | {
273 | "cell_type": "code",
274 | "execution_count": null,
275 | "metadata": {},
276 | "outputs": [],
277 | "source": [
278 | "# для оценки качества\n",
279 | "from sklearn.model_selection import cross_val_score\n",
280 | "\n",
281 | "# для генерации датасетов\n",
282 | "from sklearn.datasets import make_classification\n",
283 | "\n",
284 | "# для сравнения\n",
285 | "from sklearn.tree import DecisionTreeClassifier\n",
286 | "from sklearn.linear_model import LogisticRegression\n",
287 | "from xgboost import XGBClassifier\n",
288 | "\n",
289 | "import warnings\n",
290 | "warnings.filterwarnings('ignore')"
291 | ]
292 | },
293 | {
294 | "cell_type": "code",
295 | "execution_count": null,
296 | "metadata": {},
297 | "outputs": [],
298 | "source": [
299 | "X_data, y_data = make_classification(n_samples=1000, n_features=10, random_state=42)"
300 | ]
301 | },
302 | {
303 | "cell_type": "code",
304 | "execution_count": null,
305 | "metadata": {},
306 | "outputs": [],
307 | "source": [
308 | "algo = SimpleGB(\n",
309 | " tree_params_dict={\n",
310 | " 'max_depth':4\n",
311 | " },\n",
312 | " iters=100,\n",
313 | " tau = 0.1\n",
314 | ")"
315 | ]
316 | },
317 | {
318 | "cell_type": "code",
319 | "execution_count": null,
320 | "metadata": {},
321 | "outputs": [],
322 | "source": [
323 | "np.mean(cross_val_score(algo, X_data, y_data, cv=5, scoring='accuracy'))"
324 | ]
325 | },
326 | {
327 | "cell_type": "code",
328 | "execution_count": null,
329 | "metadata": {
330 | "scrolled": true
331 | },
332 | "outputs": [],
333 | "source": [
334 | "np.mean(cross_val_score(DecisionTreeClassifier(), X_data, y_data, cv=5, scoring='accuracy'))"
335 | ]
336 | },
337 | {
338 | "cell_type": "code",
339 | "execution_count": null,
340 | "metadata": {},
341 | "outputs": [],
342 | "source": [
343 | "np.mean(cross_val_score(XGBClassifier(), X_data, y_data, cv=5, scoring='accuracy'))"
344 | ]
345 | },
346 | {
347 | "cell_type": "code",
348 | "execution_count": null,
349 | "metadata": {},
350 | "outputs": [],
351 | "source": [
352 | "np.mean(cross_val_score(LogisticRegression(), X_data, y_data, cv=5, scoring='accuracy'))"
353 | ]
354 | },
355 | {
356 | "cell_type": "markdown",
357 | "metadata": {},
358 | "source": [
359 | "**Значение для формы**"
360 | ]
361 | },
362 | {
363 | "cell_type": "code",
364 | "execution_count": null,
365 | "metadata": {},
366 | "outputs": [],
367 | "source": [
368 | "print(round(np.mean(cross_val_score(SimpleGB(\n",
369 | " tree_params_dict={\n",
370 | " 'max_depth': 4\n",
371 | " },\n",
372 | " iters=1000,\n",
373 | " tau = 0.01\n",
374 | "), X_data, y_data, cv=4, scoring='accuracy')), 3))"
375 | ]
376 | },
377 | {
378 | "cell_type": "code",
379 | "execution_count": null,
380 | "metadata": {},
381 | "outputs": [],
382 | "source": []
383 | }
384 | ],
385 | "metadata": {
386 | "kernelspec": {
387 | "display_name": "Python 3",
388 | "language": "python",
389 | "name": "python3"
390 | },
391 | "language_info": {
392 | "codemirror_mode": {
393 | "name": "ipython",
394 | "version": 3
395 | },
396 | "file_extension": ".py",
397 | "mimetype": "text/x-python",
398 | "name": "python",
399 | "nbconvert_exporter": "python",
400 | "pygments_lexer": "ipython3",
401 | "version": "3.6.5"
402 | }
403 | },
404 | "nbformat": 4,
405 | "nbformat_minor": 1
406 | }
407 |
--------------------------------------------------------------------------------
/hw01/NaiveBayes.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Реализуем методы для наивного байеса"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "Сгенерируем выборку, в которой каждый признак имеет некоторое своё распределение, параметры которого отличаются для каждого класса. Затем реализуем несколько методов для класса, который уже частично написан ниже:\n",
15 | "- метод predict\n",
16 | "- метод \\_find\\_expon\\_params и \\_get\\_expon\\_density для экспоненциального распределения\n",
17 | "- метод \\_find\\_norm\\_params и \\_get\\_norm\\_probability для биномиального распределения\n",
18 | "\n",
19 | "Для имплементации \\_find\\_something\\_params изучите документацию функций для работы с этими распределениями в scipy.stats и используйте предоставленные там методы."
20 | ]
21 | },
22 | {
23 | "cell_type": "code",
24 | "execution_count": 1,
25 | "metadata": {},
26 | "outputs": [],
27 | "source": [
28 | "import numpy as np\n",
29 | "import scipy\n",
30 | "import scipy.stats"
31 | ]
32 | },
33 | {
34 | "cell_type": "markdown",
35 | "metadata": {},
36 | "source": [
37 | "Сформируем параметры генерации для трех датасетов"
38 | ]
39 | },
40 | {
41 | "cell_type": "code",
42 | "execution_count": 2,
43 | "metadata": {},
44 | "outputs": [
45 | {
46 | "data": {
47 | "text/plain": [
48 | "((5000, 1), (5000,), ['bernoulli'])"
49 | ]
50 | },
51 | "execution_count": 2,
52 | "metadata": {},
53 | "output_type": "execute_result"
54 | }
55 | ],
56 | "source": [
57 | "func_params_set0 = [(scipy.stats.bernoulli, [dict(p=0.1), dict(p=0.5)]),\n",
58 | " ]\n",
59 | "\n",
60 | "func_params_set1 = [(scipy.stats.bernoulli, [dict(p=0.1), dict(p=0.5)]),\n",
61 | " (scipy.stats.expon, [dict(scale=1), dict(scale=0.3)]),\n",
62 | " ]\n",
63 | "\n",
64 | "func_params_set2 = [(scipy.stats.bernoulli, [dict(p=0.1), dict(p=0.5)]),\n",
65 | " (scipy.stats.expon, [dict(scale=1), dict(scale=0.3)]),\n",
66 | " (scipy.stats.norm, [dict(loc=0, scale=1), dict(loc=1, scale=2)]),\n",
67 | " ]\n",
68 | "\n",
69 | "def generate_dataset_for_nb(func_params_set=[], size = 2500, random_seed=0):\n",
70 | " '''\n",
71 | " Генерирует выборку с заданными параметрами распределений P(x|y).\n",
72 | " Число классов задается длиной списка с параметрами.\n",
73 | " Возвращает X, y, список с названиями распределений\n",
74 | " '''\n",
75 | " np.random.seed(random_seed)\n",
76 | "\n",
77 | " X = []\n",
78 | " names = []\n",
79 | " for func, params in func_params_set:\n",
80 | " names.append(func.name)\n",
81 | " f = []\n",
82 | " for i, param in enumerate(params):\n",
83 | " f.append(func.rvs(size=size, **param))\n",
84 | " f = np.concatenate(f).reshape(-1,1)\n",
85 | " X.append(f)\n",
86 | "\n",
87 | " X = np.concatenate(X, 1)\n",
88 | " y = np.array([0] * size + [1] * size)\n",
89 | "\n",
90 | " shuffle_inds = np.random.choice(range(len(X)), size=len(X), replace=False)\n",
91 | " X = X[shuffle_inds]\n",
92 | " y = y[shuffle_inds]\n",
93 | "\n",
94 | " return X, y, names \n",
95 | "\n",
96 | "X, y, distrubution_names = generate_dataset_for_nb(func_params_set0)\n",
97 | "X.shape, y.shape, distrubution_names"
98 | ]
99 | },
100 | {
101 | "cell_type": "code",
102 | "execution_count": 3,
103 | "metadata": {},
104 | "outputs": [],
105 | "source": [
106 | "from collections import defaultdict\n",
107 | "from sklearn.base import BaseEstimator, ClassifierMixin\n",
108 | "\n",
109 | "class NaiveBayes(BaseEstimator, ClassifierMixin):\n",
110 | " '''\n",
111 | " Реализация наивного байеса, которая помимо X, y\n",
112 | " принимает на вход во время обучения \n",
113 | " виды распределений значений признаков\n",
114 | " '''\n",
115 | " def __init__(self):\n",
116 | " pass\n",
117 | " \n",
118 | " def _find_bernoulli_params(self, x):\n",
119 | " '''\n",
120 | " метод возвращает найденный параметр `p`\n",
121 | " распределения scipy.stats.bernoulli\n",
122 | " '''\n",
123 | " return dict(p=np.mean(x))\n",
124 | " \n",
125 | " def _get_bernoulli_probability(self, x, params):\n",
126 | " '''\n",
127 | " метод возвращает вероятность x для данных\n",
128 | " параметров распределния\n",
129 | " '''\n",
130 | " return scipy.stats.bernoulli.pmf(x, **params)\n",
131 | "\n",
132 | " def _find_expon_params(self, x):\n",
133 | " # нужно определить параметры распределения\n",
134 | " # и вернуть их\n",
135 | " pass\n",
136 | " \n",
137 | " def _get_expon_density(self, x, params):\n",
138 | " # нужно вернуть плотность распределения в x\n",
139 | " pass\n",
140 | "\n",
141 | " def _find_norm_params(self, x):\n",
142 | " # нужно определить параметры распределения\n",
143 | " # и вернуть их\n",
144 | " pass\n",
145 | " \n",
146 | " def _get_norm_density(self, x, params):\n",
147 | " # нужно вернуть плотность распределения в x\n",
148 | " pass\n",
149 | "\n",
150 | " def _get_params(self, x, distribution):\n",
151 | " '''\n",
152 | " x - значения из распределения,\n",
153 | " distribution - название распределения в scipy.stats\n",
154 | " '''\n",
155 | " if distribution == 'bernoulli':\n",
156 | " return self._find_bernoulli_params(x)\n",
157 | " elif distribution == 'expon':\n",
158 | " return self._find_expon_params(x)\n",
159 | " elif distribution == 'norm':\n",
160 | " return self._find_norm_params(x)\n",
161 | " else:\n",
162 | " raise NotImplementedError('Unknown distribution')\n",
163 | " \n",
164 | " def _get_probability_or_density(self, x, distribution, params):\n",
165 | " '''\n",
166 | " x - значения,\n",
167 | " distribytion - название распределения в scipy.stats,\n",
168 | " params - параметры распределения\n",
169 | " '''\n",
170 | " if distribution == 'bernoulli':\n",
171 | " return self._get_bernoulli_probability(x, params)\n",
172 | " elif distribution == 'expon':\n",
173 | " return self._get_expon_density(x, params)\n",
174 | " elif distribution == 'norm':\n",
175 | " return self._get_norm_density(x, params)\n",
176 | " else:\n",
177 | " raise NotImplementedError('Unknown distribution')\n",
178 | "\n",
179 | " def fit(self, X, y, distrubution_names):\n",
180 | " '''\n",
181 | " X - обучающая выборка,\n",
182 | " y - целевая переменная,\n",
183 | " feature_distributions - список названий распределений, \n",
184 | " по которым предположительно распределны значения P(x|y)\n",
185 | " ''' \n",
186 | " assert X.shape[1] == len(distrubution_names)\n",
187 | " assert set(y) == {0, 1}\n",
188 | " self.n_classes = len(np.unique(y))\n",
189 | " self.distrubution_names = distrubution_names\n",
190 | " \n",
191 | " self.y_prior = [(y == j).mean() for j in range(self.n_classes)]\n",
192 | " \n",
193 | " self.distributions_params = defaultdict(dict)\n",
194 | " for i in range(X.shape[1]):\n",
195 | " distribution = self.distrubution_names[i]\n",
196 | " for j in range(self.n_classes):\n",
197 | " values = X[y == j, i]\n",
198 | " self.distributions_params[j][i] = \\\n",
199 | " self._get_params(values, distribution)\n",
200 | " \n",
201 | " return self.distributions_params\n",
202 | " \n",
203 | " def predict(self, X):\n",
204 | " '''\n",
205 | " X - тестовая выборка\n",
206 | " '''\n",
207 | " assert X.shape[1] == len(self.distrubution_names)\n",
208 | " \n",
209 | " # нужно реализовать подсчет аргмаксной формулы, по которой \n",
210 | " # наивный байес принимает решение о принадлежности объекта классу\n",
211 | " # и применить её для каждого объекта в X\n",
212 | " #\n",
213 | " # примечание: обычно подсчет этой формулы реализуют через \n",
214 | " # её логарифмирование, то есть, через сумму логарифмов вероятностей, \n",
215 | " # поскольку перемножение достаточно малых вероятностей будет вести\n",
216 | " # к вычислительным неточностям\n",
217 | " \n",
218 | " return preds"
219 | ]
220 | },
221 | {
222 | "cell_type": "markdown",
223 | "metadata": {},
224 | "source": [
225 | "Проверим результат на примере первого распределения"
226 | ]
227 | },
228 | {
229 | "cell_type": "code",
230 | "execution_count": 4,
231 | "metadata": {},
232 | "outputs": [
233 | {
234 | "data": {
235 | "text/plain": [
236 | "defaultdict(dict, {0: {0: {'p': 0.1128}}, 1: {0: {'p': 0.482}}})"
237 | ]
238 | },
239 | "execution_count": 4,
240 | "metadata": {},
241 | "output_type": "execute_result"
242 | }
243 | ],
244 | "source": [
245 | "nb = NaiveBayes()\n",
246 | "nb.fit(X, y, ['bernoulli'])"
247 | ]
248 | },
249 | {
250 | "cell_type": "code",
251 | "execution_count": 5,
252 | "metadata": {},
253 | "outputs": [
254 | {
255 | "name": "stdout",
256 | "output_type": "stream",
257 | "text": [
258 | "0.6045\n"
259 | ]
260 | }
261 | ],
262 | "source": [
263 | "from sklearn.metrics import f1_score\n",
264 | "\n",
265 | "prediction = nb.predict(X)\n",
266 | "score = f1_score(y, prediction)\n",
267 | "print('{:.2f}'.format(score))"
268 | ]
269 | },
270 | {
271 | "cell_type": "markdown",
272 | "metadata": {},
273 | "source": [
274 | "# Ответы для формы"
275 | ]
276 | },
277 | {
278 | "cell_type": "markdown",
279 | "metadata": {},
280 | "source": [
281 | "Ответом для формы должны служить числа, которые будут выведены ниже. Все ответы проверены: в этих примерах получается одинаковый результат и через сумму логарифмов, и через произведение вероятностей."
282 | ]
283 | },
284 | {
285 | "cell_type": "code",
286 | "execution_count": null,
287 | "metadata": {},
288 | "outputs": [],
289 | "source": [
290 | "scipy.stats.bernoulli.name\n",
291 | "\n",
292 | "for fps in (func_params_set0 * 2,\n",
293 | " func_params_set1, \n",
294 | " func_params_set2):\n",
295 | " \n",
296 | "\n",
297 | " X, y, distrubution_names = generate_dataset_for_nb(fps)\n",
298 | " \n",
299 | " nb = NaiveBayes()\n",
300 | " nb.fit(X, y, distrubution_names)\n",
301 | " prediction = nb.predict(X)\n",
302 | " score = f1_score(y, prediction)\n",
303 | " print('{:.2f}'.format(score))"
304 | ]
305 | },
306 | {
307 | "cell_type": "code",
308 | "execution_count": null,
309 | "metadata": {},
310 | "outputs": [],
311 | "source": []
312 | }
313 | ],
314 | "metadata": {
315 | "kernelspec": {
316 | "display_name": "Python 3",
317 | "language": "python",
318 | "name": "python3"
319 | },
320 | "language_info": {
321 | "codemirror_mode": {
322 | "name": "ipython",
323 | "version": 3
324 | },
325 | "file_extension": ".py",
326 | "mimetype": "text/x-python",
327 | "name": "python",
328 | "nbconvert_exporter": "python",
329 | "pygments_lexer": "ipython3",
330 | "version": "3.6.5"
331 | }
332 | },
333 | "nbformat": 4,
334 | "nbformat_minor": 2
335 | }
336 |
--------------------------------------------------------------------------------
/hw01/Polynom.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 1,
6 | "metadata": {},
7 | "outputs": [],
8 | "source": [
9 | "import numpy as np\n",
10 | "\n",
11 | "import matplotlib.pyplot as plt\n",
12 | "%matplotlib inline"
13 | ]
14 | },
15 | {
16 | "cell_type": "markdown",
17 | "metadata": {},
18 | "source": [
19 | "# Задание\n",
20 | "\n",
21 | "Допишите реализацию класса для обучения полиномиальной регресии, то есть по точкам $x_1, x_2, \\dots, x_n$ и $y_1, y_2, \\dots, y_n$ и заданному числу $d$ решить оптимизационную задачу:\n",
22 | "\n",
23 | "$$ \\sum_{i=1}^n (~f(x_i) - y_i~)^2 \\min_f,$$ где f – полином степени не выше $d$."
24 | ]
25 | },
26 | {
27 | "cell_type": "markdown",
28 | "metadata": {},
29 | "source": [
30 | "**Примечание:** в этом задании оптимизационную задачу можно решать как с помощью scipy.optimize, так и сводя задачу к линейной регрессии и используя готовую формулу весов из нее. Предпочтительней второй путь, но первый вариант проще, и его можно использовать для проверки. Независимо от того, как вы решите эту задачу, сдавайте в форму ответ, в котором будете больше всего уверенны.\n",
31 | "\n",
32 | "**Предупреждение:** проверка этого задания **не предполагает**, что вы решите его с помощью SGD, т.к. получить таким способом тот же ответ *очень* сложно."
33 | ]
34 | },
35 | {
36 | "cell_type": "code",
37 | "execution_count": 2,
38 | "metadata": {},
39 | "outputs": [],
40 | "source": [
41 | "class PolynomialRegression(object):\n",
42 | " \n",
43 | " def __init__(self, max_degree=1):\n",
44 | " self.max_degree = max_degree\n",
45 | " \n",
46 | " def fit(self, points_x, points_y):\n",
47 | " # insert your code here to fit the model\n",
48 | " \n",
49 | " return self\n",
50 | " \n",
51 | " def predict(self, points_x):\n",
52 | " # insert your code here to predict the values\n",
53 | " \n",
54 | " return values"
55 | ]
56 | },
57 | {
58 | "cell_type": "code",
59 | "execution_count": 3,
60 | "metadata": {},
61 | "outputs": [],
62 | "source": [
63 | "np.random.seed(42)\n",
64 | "points_x = np.random.uniform(-10, 10, size=10)\n",
65 | "# we use list comprehesion but think about how to write it using np.array operations\n",
66 | "points_y = np.array([4 - x + x ** 2 + 0.1 * x ** 3 + np.random.uniform(-20, 20) for x in points_x])"
67 | ]
68 | },
69 | {
70 | "cell_type": "code",
71 | "execution_count": 4,
72 | "metadata": {},
73 | "outputs": [
74 | {
75 | "data": {
76 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAloAAAEyCAYAAAAiFH5AAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvhp/UCwAAGWhJREFUeJzt3X+MXXd55/H3s2ODRhm00zZ0GjthDVrX0ibZtetR6IoumtlAHSJETFSFWBVNgF2TVVm1WuQWQ1Sipgha80OqaGnNJiIsNBNEHOONQk02dDZUWrOx42ycX9M6aVI8jpKSxAkDI2qbZ/+YM2HGzPGM58537px73y9pNPd+z7nnPPPoXOeT8z3n3shMJEmStPT+RbsLkCRJ6lQGLUmSpEIMWpIkSYUYtCRJkgoxaEmSJBVi0JIkSSrEoCVJklSIQUuSJKkQg5YkSVIhq+ZbISJuBd4JPJ+Zl1RjdwAbqlX6gROZuTEi1gGPA2PVsgOZecN8+zj//PNz3bp151x8p/jhD3/Ieeed1+4yViR7U8/e1LM39exNPXtTz97MdujQoe9n5usXsu68QQv4EvB54MvTA5n5nunHEfEZ4OUZ6z+ZmRsXVuqUdevWcfDgwXN5SUcZHR1laGio3WWsSPamnr2pZ2/q2Zt69qaevZktIp5Z6LrzBq3MvL86UzXXjgK4BviPC92hJElSt4iFfKl0FbTunp46nDH+VuCzmTk4Y71Hgb8DXgFuzMzv1GxzO7AdYGBgYPPIyMhi/4bGm5iYoK+vr91lrEj2pp69qWdv6tmbevamnr2ZbXh4+NB09pnPQqYOz2YbcPuM588Cb8jMFyJiM7A3Ii7OzFfOfGFm7gZ2AwwODmY3n5L0lGw9e1PP3tSzN/XsTT17U8/eLN6i7zqMiFXA1cAd02OZ+ePMfKF6fAh4EvjlVouUJElqolY+3uFtwBOZeWx6ICJeHxE91eM3AeuBp1orUZIkqZnmDVoRcTvwf4ANEXEsIj5QLbqW2dOGAG8FHo6Ih4CvAzdk5otLWbAkSVJTLOSuw20149fPMXYncGfrZUmSJDVfqxfDS5Iktd3ew+Ps2j/G8ROTrOnvZceWDWzdtLbdZRm0JElSs+09PM7OPUeYPHkagPETk+zccwSg7WHL7zqUJEmNtmv/2Ksha9rkydPs2j9W84rlY9CSJEmNdvzE5DmNLyeDliRJarQ1/b3nNL6cDFqSJKnRdmzZQO/qnlljvat72LFlQ5sq+ikvhpckSY02fcG7dx1KkiQVsHXT2hURrM7k1KEkSVIhBi1JkqRCDFqSJEmFGLQkSZIKMWhJkiQVYtCSJEkqxKAlSZJUiEFLkiSpEIOWJElSIQYtSZKkQgxakiRJhRi0JEmSCjFoSZIkFWLQkiRJKsSgJUmSVIhBS5IkqRCDliRJUiEGLUmSpELmDVoRcWtEPB8Rj8wYuykixiPioernyhnLdkbE0YgYi4gtpQqXJEla6RZyRutLwBVzjH8uMzdWP/cARMS/Aa4FLq5e8+cR0bNUxUqSJDXJvEErM+8HXlzg9q4CRjLzx5n5D8BR4LIW6pMkSWqsyMz5V4pYB9ydmZdUz28CrgdeAQ4CH87MlyLi88CBzPxKtd4twDcz8+tzbHM7sB1gYGBg88jIyBL8Oc00MTFBX19fu8tYkexNPXtTz97Uszf17E09ezPb8PDwocwcXMi6qxa5jy8ANwNZ/f4M8P5z2UBm7gZ2AwwODubQ0NAiS2m+0dFRuvnvPxt7U8/e1LM39exNPXtTz94s3qLuOszM5zLzdGb+BPgiP50eHAcumrHqhdWYJElS11lU0IqIC2Y8fTcwfUfiPuDaiHhtRLwRWA/839ZKlCRJaqZ5pw4j4nZgCDg/Io4BHweGImIjU1OHTwMfBMjMRyPia8BjwCngtzPzdJnSJUmSVrZ5g1Zmbptj+JazrP8J4BOtFCVJktQJ/GR4SZKkQgxakiRJhRi0JEmSCjFoSZIkFWLQkiRJKsSgJUmSVIhBS5IkqRCDliRJUiEGLUmSpEIMWpIkSYUYtCRJkgoxaEmSJBVi0JIkSSrEoCVJklSIQUuSJKkQg5YkSVIhBi1JkqRCDFqSJEmFGLQkSZIKMWhJkiQVYtCSJEkqxKAlSZJUiEFLkiSpEIOWJElSIQYtSZKkQgxakiRJhcwbtCLi1oh4PiIemTG2KyKeiIiHI+KuiOivxtdFxGREPFT9/EXJ4iVJklayhZzR+hJwxRlj9wKXZOa/Bf4O2Dlj2ZOZubH6uWFpypQkSWqeeYNWZt4PvHjG2Lcy81T19ABwYYHaJEmSGi0yc/6VItYBd2fmJXMs+5/AHZn5lWq9R5k6y/UKcGNmfqdmm9uB7QADAwObR0ZGFvcXdICJiQn6+vraXcaKZG/q2Zt69qaevalnb+rZm9mGh4cPZebgQtZd1cqOIuJjwCngq9XQs8AbMvOFiNgM7I2IizPzlTNfm5m7gd0Ag4ODOTQ01EopjTY6Oko3//1nY2/q2Zt69qaevalnb+rZm8Vb9F2HEXE98E7gN7M6LZaZP87MF6rHh4AngV9egjolSZIaZ1FBKyKuAH4PeFdm/mjG+Osjoqd6/CZgPfDUUhQqSZLUNPNOHUbE7cAQcH5EHAM+ztRdhq8F7o0IgAPVHYZvBf4wIk4CPwFuyMwX59ywJElSh5s3aGXmtjmGb6lZ907gzlaLkiRJ6gR+MrwkSVIhBi1JkqRCDFqSJEmFGLQkSZIKMWhJkiQVYtCSJEkqxKAlSZJUiEFLkiSpEIOWJElSIQYtSZKkQgxakiRJhRi0JEmSCjFoSZIkFWLQkiRJKsSgJUmSVIhBS5IkqRCDliRJUiEGLUmSpEIMWpIkSYUYtCRJkgoxaEmSJBVi0JIkSSrEoCVJklSIQUuSJKkQg5YkSVIhBi1JkqRCFhS0IuLWiHg+Ih6ZMfbzEXFvRPx99fvnqvGIiD+NiKMR8XBE/Eqp4iVJklayhZ7R+hJwxRljHwHuy8z1wH3Vc4B3AOurn+3AF1ovU5IkqXkWFLQy837gxTOGrwJuqx7fBmydMf7lnHIA6I+IC5aiWEmSpCaJzFzYihHrgLsz85Lq+YnM7K8eB/BSZvZHxN3ApzLzb6tl9wG/n5kHz9jedqbOeDEwMLB5ZGRkaf6iBpqYmKCvr6/dZaxI9qaevalnb+rZm3r2pp69mW14ePhQZg4uZN1VS7HDzMyIWFhi++lrdgO7AQYHB3NoaGgpSmmk0dFRuvnvPxt7U8/e1LM39exNPXtTz94sXit3HT43PSVY/X6+Gh8HLpqx3oXVmCRJUldpJWjtA66rHl8HfGPG+G9Vdx/+KvByZj7bwn4kSZIaaUFThxFxOzAEnB8Rx4CPA58CvhYRHwCeAa6pVr8HuBI4CvwIeN8S1yxJktQICwpambmtZtHlc6ybwG+3UpQkSVIn8JPhJUmSCjFoSZIkFWLQkiRJKsSgJUmSVMiSfGCpJElqtr2Hx9m1f4zjJyZZ09/Lji0b2LppbbvLajyDliRJXW7v4XF27jnC5MnTAIyfmGTnniMAhq0WOXUoSVKX27V/7NWQNW3y5Gl27R9rU0Wdw6AlSVKXO35i8pzGtXAGLUmSutya/t5zGtfCGbQkSepyO7ZsoHd1z6yx3tU97NiyoU0VdQ4vhpckqctNX/DuXYdLz6AlSZLYummtwaoApw4lSZIKMWhJkiQVYtCSJEkqxKAlSZJUiEFLkiSpEIOWJElSIQYtSZKkQgxakiRJhRi0JEmSCjFoSZIkFWLQkiRJKsSgJUmSVIhBS5IkqRCDliRJUiGrFvvCiNgA3DFj6E3AHwD9wH8G/qka/2hm3rPoCiVJkhpq0UErM8eAjQAR0QOMA3cB7wM+l5mfXpIKJUmSGmqppg4vB57MzGeWaHuSJEmNF5nZ+kYibgUezMzPR8RNwPXAK8BB4MOZ+dIcr9kObAcYGBjYPDIy0nIdTTUxMUFfX1+7y1iR7E09e1PP3tSzN/XsTT17M9vw8PChzBxcyLotB62IeA1wHLg4M5+LiAHg+0ACNwMXZOb7z7aNwcHBPHjwYEt1NNno6ChDQ0PtLmNFsjf17E09e1PP3tSzN/XszWwRseCgtRRTh+9g6mzWcwCZ+Vxmns7MnwBfBC5bgn1IkiQ1zlIErW3A7dNPIuKCGcveDTyyBPuQJElqnEXfdQgQEecBbwc+OGP4TyJiI1NTh0+fsUySJKlrtBS0MvOHwC+cMfbeliqSJEnqEH4yvCRJUiEGLUmSpEIMWpIkSYUYtCRJkgoxaEmSJBVi0JIkSSrEoCVJklSIQUuSJKkQg5YkSVIhBi1JkqRCDFqSJEmFGLQkSZIKMWhJkiQVYtCSJEkqxKAlSZJUiEFLkiSpEIOWJElSIQYtSZKkQgxakiRJhRi0JEmSCjFoSZIkFWLQkiRJKsSgJUmSVIhBS5IkqRCDliRJUiEGLUmSpEJWtbqBiHga+AFwGjiVmYMR8fPAHcA64Gngmsx8qdV9SZIkNclSndEazsyNmTlYPf8IcF9mrgfuq55LkiR1lVJTh1cBt1WPbwO2FtqPJEnSihWZ2doGIv4BeAlI4C8zc3dEnMjM/mp5AC9NP5/xuu3AdoCBgYHNIyMjLdXRZBMTE/T19bW7jBXJ3tSzN/XsTT17U8/e1LM3sw0PDx+aMYt3Vi1fowX8WmaOR8QvAvdGxBMzF2ZmRsTPpLnM3A3sBhgcHMyhoaElKKWZRkdH6ea//2zsTT17U8/e1LM39exNPXuzeC1PHWbmePX7eeAu4DLguYi4AKD6/Xyr+5EkSWqaloJWRJwXEa+bfgz8OvAIsA+4rlrtOuAbrexHkiSpiVqdOhwA7pq6DItVwF9l5l9HxAPA1yLiA8AzwDUt7keSJKlxWgpamfkU8O/mGH8BuLyVbUuS1Iq9h8fZtX+M4ycmWdPfy44tG9i6aW27y1KXWYqL4SVJWlH2Hh5n554jTJ48DcD4iUl27jkCYNjSsvIreCRJHWfX/rFXQ9a0yZOn2bV/rE0VqVsZtCRJHef4iclzGpdKMWhJkjrOmv7ecxqXSjFoSZI6zo4tG+hd3TNrrHd1Dzu2bGhTRepWXgwvSeo40xe8e9eh2s2gJUnqSFs3rTVYqe2cOpQkSSrEoCVJklSIQUuSJKkQg5YkSVIhBi1JkqRCDFqSJEmFGLQkSZIKMWhJkiQVYtCSJEkqxKAlSZJUiEFLkiSpEIOWJElSIQYtSZKkQgxakiRJhRi0JEmSCjFoSZIkFWLQkiRJKsSgJUmSVIhBS5IkqZBFB62IuCgi/iYiHouIRyPid6rxmyJiPCIeqn6uXLpyJUmSmmNVC689BXw4Mx+MiNcBhyLi3mrZ5zLz062XJ0mS1FyLDlqZ+SzwbPX4BxHxOLB2qQqTJElqusjM1jcSsQ64H7gE+G/A9cArwEGmznq9NMdrtgPbAQYGBjaPjIy0XEdTTUxM0NfX1+4yViR7U8/e1LM39exNPXtTz97MNjw8fCgzBxeybstBKyL6gP8NfCIz90TEAPB9IIGbgQsy8/1n28bg4GAePHiwpTqabHR0lKGhoXaXsSLZm3r2pp69qWdv6tmbevZmtohYcNBq6a7DiFgN3Al8NTP3AGTmc5l5OjN/AnwRuKyVfUiSJDVVK3cdBnAL8HhmfnbG+AUzVns38Mjiy5MkSWquVu46fAvwXuBIRDxUjX0U2BYRG5maOnwa+GBLFUqSJDVUK3cd/i0Qcyy6Z/HlSJIkdQ4/GV6SJKkQg5YkSVIhBi1JkqRCDFqSJEmFtHLXoSRJxe09PM6u/WMcPzHJmv5edmzZwNZNfuObmsGgJUlasfYeHmfnniNMnjwNwPiJSXbuOQJg2FIjOHUoSVqxdu0fezVkTZs8eZpd+8faVJF0brrijJannSWpmY6fmDyncWml6fig1emnnW/ce4Tbv/s9TmfSE8G2N1/EH229tN1lSdKSWNPfy/gcoWpNf28bqpHOXcdPHXbyaecb9x7hKwf+kdOZAJzO5CsH/pEb9x5pc2WStDR2bNlA7+qeWWO9q3vYsWVDmyqSzk3HB61OPu18+3e/d07jktQ0Wzet5ZNXX8ra/l4CWNvfyyevvrQjZiTUHTp+6rCTTztPn8la6LgkNdHWTWsNVmqsjj+j1cmnnXtiru/0rh+XJEnLq+ODViefdt725ovOaVySJC2vjp86hM497Tx9d6F3HUqStDJ1RdDqZH+09VKDlSRJK1THTx1KkiS1i0FLkiSpEIOWJElSIQYtSZKkQrwYXpIaZO/hcXbtH+P4iUnW9PeyY8uGjryrWuoUBi1Jaoi9h8fZuefIq9/fOn5ikp17pr7b1LAlrUxOHUpSQ+zaP/ZqyJo2efI0u/aPtakiSfMxaElSQxyf43tbzzYuqf0MWpLUEGv6e89pXFL7GbQkqSF2bNlA7+qeWWO9q3vYsWVDmyqSNJ9iQSsiroiIsYg4GhEfKbUfSeoWWzet5ZNXX8ra/l4CWNvfyyevvtQL4aUVrMhdhxHRA/wZ8HbgGPBAROzLzMdK7E+SusXWTWsNVlKDlDqjdRlwNDOfysx/BkaAqwrtS5IkaUWKzFz6jUb8BnBFZv6n6vl7gTdn5odmrLMd2A4wMDCweWRkZMnraIqJiQn6+vraXcaKZG/q2Zt69qaevalnb+rZm9mGh4cPZebgQtZt2weWZuZuYDfA4OBgDg0NtauUthsdHaWb//6zsTf17E09e1PP3tSzN/XszeKVmjocBy6a8fzCakySJKlrlApaDwDrI+KNEfEa4FpgX6F9SZIkrUhFpg4z81REfAjYD/QAt2bmoyX2JUndwi+Ulpqn2DVamXkPcE+p7UtSN/ELpaVm8pPhJakB/EJpqZkMWpLUAH6htNRMBi2pYfYeHuctn/o2R8Zf5i2f+jZ7D3tDbzfwC6WlZjJoSQ0yfZ3OeHUWY/o6HcNW5/MLpaVmMmhJDeJ1Ot3LL5SWmqltnwwv6dx5nU538wulpebxjJbUIF6nI0nNYtCSGsTrdCSpWZw6lBpketpo6pqsH7DWTweXpBXNoCU1zPR1OqOjo/zX3xxqdzmSpLNw6lCSJKkQg5YkSVIhBi1JkqRCDFqSJEmFGLQkSZIKMWhJkiQVYtCSJEkqxKAlSZJUSGRmu2sgIv4JeKbddbTR+cD3213ECmVv6tmbevamnr2pZ2/q2ZvZ/lVmvn4hK66IoNXtIuJgZg62u46VyN7Uszf17E09e1PP3tSzN4vn1KEkSVIhBi1JkqRCDForw+52F7CC2Zt69qaevalnb+rZm3r2ZpG8RkuSJKkQz2hJkiQVYtCSJEkqxKDVBhFxR0Q8VP08HREP1az3dEQcqdY7uNx1tkNE3BQR4zP6c2XNeldExFhEHI2Ijyx3ne0QEbsi4omIeDgi7oqI/pr1uua4me84iIjXVu+3oxHx3YhYt/xVLr+IuCgi/iYiHouIRyPid+ZYZygiXp7xXvuDdtTaDvO9R2LKn1bHzcMR8SvtqHO5RcSGGcfDQxHxSkT87hnrdO1xs1ir2l1AN8rM90w/jojPAC+fZfXhzOy2D4n7XGZ+um5hRPQAfwa8HTgGPBAR+zLzseUqsE3uBXZm5qmI+GNgJ/D7Net2/HGzwOPgA8BLmfmvI+Ja4I+B9/zs1jrOKeDDmflgRLwOOBQR987xHvlOZr6zDfWtBGd7j7wDWF/9vBn4QvW7o2XmGLARXn1/jQN3zbFqNx8358wzWm0UEQFcA9ze7loa5jLgaGY+lZn/DIwAV7W5puIy81uZeap6egC4sJ31rAALOQ6uAm6rHn8duLx633W0zHw2Mx+sHv8AeBxY296qGuUq4Ms55QDQHxEXtLuoZXY58GRmdvO3tiwJg1Z7/Qfgucz8+5rlCXwrIg5FxPZlrKvdPlSdrr81In5ujuVrge/NeH6M7vuPyPuBb9Ys65bjZiHHwavrVCH1ZeAXlqW6FaKaLt0EfHeOxf8+Iv5fRHwzIi5e1sLaa773iP/GwLXUnwTo1uNmUZw6LCQi/hfwS3Ms+lhmfqN6vI2zn836tcwcj4hfBO6NiCcy8/6lrnW5na03TJ2iv5mpfwhvBj7DVKjoCgs5biLiY0xNDX21ZjMdedzo3EVEH3An8LuZ+coZix9k6vvaJqprIfcyNVXWDXyPnEVEvAZ4F1OXJ5ypm4+bRTFoFZKZbzvb8ohYBVwNbD7LNsar389HxF1MTZU0/h+D+XozLSK+CNw9x6Jx4KIZzy+sxhpvAcfN9cA7gcuz5kPwOvW4mcNCjoPpdY5V77l/CbywPOW1V0SsZipkfTUz95y5fGbwysx7IuLPI+L8Tr+2Dxb0HunYf2MW6B3Ag5n53JkLuvm4WSynDtvnbcATmXlsroURcV51ESsRcR7w68Ajy1hfW5xxHcS7mftvfgBYHxFvrP7P61pg33LU104RcQXwe8C7MvNHNet003GzkONgH3Bd9fg3gG/XBdROUl2HdgvweGZ+tmadX5q+Xi0iLmPqvwcdH0IX+B7ZB/xWdffhrwIvZ+azy1xqO9XOtnTrcdMKz2i1z8/Mf0fEGuC/Z+aVwABwV3U8rwL+KjP/etmrXH5/EhEbmZo6fBr4IMzuTXXX3YeA/UAPcGtmPtqugpfR54HXMjXVAXAgM2/o1uOm7jiIiD8EDmbmPqbCxv+IiKPAi0y977rBW4D3Akfipx8f81HgDQCZ+RdMBc//EhGngEng2m4IodS8RyLiBni1N/cAVwJHgR8B72tTrcuuCp9vp/q3txqb2ZtuPW4Wza/gkSRJKsSpQ0mSpEIMWpIkSYUYtCRJkgoxaEmSJBVi0JIkSSrEoCVJklSIQUuSJKmQ/w+dgKdwolc3OgAAAABJRU5ErkJggg==\n",
77 | "text/plain": [
78 | ""
79 | ]
80 | },
81 | "metadata": {
82 | "needs_background": "light"
83 | },
84 | "output_type": "display_data"
85 | }
86 | ],
87 | "source": [
88 | "plt.figure(figsize=(10, 5))\n",
89 | "plt.scatter(points_x, points_y)\n",
90 | "plt.grid()\n",
91 | "plt.show()"
92 | ]
93 | },
94 | {
95 | "cell_type": "code",
96 | "execution_count": 19,
97 | "metadata": {},
98 | "outputs": [],
99 | "source": [
100 | "def plot_model(max_degree):\n",
101 | " plt.figure(figsize=(10, 5))\n",
102 | " plt.scatter(points_x, points_y)\n",
103 | " model = PolynomialRegression(max_degree).fit(points_x, points_y)\n",
104 | " all_x = np.arange(-10, 10.1, 0.1)\n",
105 | " plt.plot(all_x, model.predict(all_x))\n",
106 | " plt.grid()\n",
107 | " plt.show()"
108 | ]
109 | },
110 | {
111 | "cell_type": "code",
112 | "execution_count": null,
113 | "metadata": {},
114 | "outputs": [],
115 | "source": [
116 | "for i in range(10):\n",
117 | " plot_model(i)"
118 | ]
119 | },
120 | {
121 | "cell_type": "markdown",
122 | "metadata": {},
123 | "source": [
124 | "Объясните почему графики меняются таким образом"
125 | ]
126 | },
127 | {
128 | "cell_type": "code",
129 | "execution_count": null,
130 | "metadata": {},
131 | "outputs": [],
132 | "source": []
133 | },
134 | {
135 | "cell_type": "markdown",
136 | "metadata": {},
137 | "source": [
138 | "**Значение для формы**"
139 | ]
140 | },
141 | {
142 | "cell_type": "code",
143 | "execution_count": null,
144 | "metadata": {},
145 | "outputs": [],
146 | "source": [
147 | "print(int(\n",
148 | " PolynomialRegression(7).fit(points_x, points_y).predict([10])[0]\n",
149 | " + PolynomialRegression(1).fit(points_x, points_y).predict([-5])[0]\n",
150 | " + PolynomialRegression(4).fit(points_x, points_y).predict([-15])[0]\n",
151 | "))"
152 | ]
153 | }
154 | ],
155 | "metadata": {
156 | "kernelspec": {
157 | "display_name": "Python 3",
158 | "language": "python",
159 | "name": "python3"
160 | },
161 | "language_info": {
162 | "codemirror_mode": {
163 | "name": "ipython",
164 | "version": 3
165 | },
166 | "file_extension": ".py",
167 | "mimetype": "text/x-python",
168 | "name": "python",
169 | "nbconvert_exporter": "python",
170 | "pygments_lexer": "ipython3",
171 | "version": "3.6.5"
172 | }
173 | },
174 | "nbformat": 4,
175 | "nbformat_minor": 2
176 | }
177 |
--------------------------------------------------------------------------------
/seminar01/01_Main_SklearnFirstClassifiers.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Обучаем первые классификаторы в sklearn"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "### Данные"
15 | ]
16 | },
17 | {
18 | "cell_type": "markdown",
19 | "metadata": {},
20 | "source": [
21 | "\n",
22 | "По данным характеристикам молекулы требуется определить, будет ли дан биологический ответ (biological response).\n",
23 | "\n",
24 | "Для демонстрации используется обучающая выборка из исходных данных bioresponse.csv, файл с данными прилагается."
25 | ]
26 | },
27 | {
28 | "cell_type": "markdown",
29 | "metadata": {},
30 | "source": [
31 | "### Готовим обучающую и тестовую выборки"
32 | ]
33 | },
34 | {
35 | "cell_type": "code",
36 | "execution_count": 1,
37 | "metadata": {},
38 | "outputs": [],
39 | "source": [
40 | "import pandas as pd\n",
41 | "\n",
42 | "bioresponce = pd.read_csv('bioresponse.csv', header=0, sep=',')"
43 | ]
44 | },
45 | {
46 | "cell_type": "code",
47 | "execution_count": 2,
48 | "metadata": {},
49 | "outputs": [
50 | {
51 | "data": {
52 | "text/html": [
53 | "\n",
54 | "\n",
67 | "
\n",
68 | " \n",
69 | " \n",
70 | " | \n",
71 | " Activity | \n",
72 | " D1 | \n",
73 | " D2 | \n",
74 | " D3 | \n",
75 | " D4 | \n",
76 | " D5 | \n",
77 | " D6 | \n",
78 | " D7 | \n",
79 | " D8 | \n",
80 | " D9 | \n",
81 | " ... | \n",
82 | " D1767 | \n",
83 | " D1768 | \n",
84 | " D1769 | \n",
85 | " D1770 | \n",
86 | " D1771 | \n",
87 | " D1772 | \n",
88 | " D1773 | \n",
89 | " D1774 | \n",
90 | " D1775 | \n",
91 | " D1776 | \n",
92 | "
\n",
93 | " \n",
94 | " \n",
95 | " \n",
96 | " | 0 | \n",
97 | " 1 | \n",
98 | " 0.000000 | \n",
99 | " 0.497009 | \n",
100 | " 0.10 | \n",
101 | " 0.0 | \n",
102 | " 0.132956 | \n",
103 | " 0.678031 | \n",
104 | " 0.273166 | \n",
105 | " 0.585445 | \n",
106 | " 0.743663 | \n",
107 | " ... | \n",
108 | " 0 | \n",
109 | " 0 | \n",
110 | " 0 | \n",
111 | " 0 | \n",
112 | " 0 | \n",
113 | " 0 | \n",
114 | " 0 | \n",
115 | " 0 | \n",
116 | " 0 | \n",
117 | " 0 | \n",
118 | "
\n",
119 | " \n",
120 | " | 1 | \n",
121 | " 1 | \n",
122 | " 0.366667 | \n",
123 | " 0.606291 | \n",
124 | " 0.05 | \n",
125 | " 0.0 | \n",
126 | " 0.111209 | \n",
127 | " 0.803455 | \n",
128 | " 0.106105 | \n",
129 | " 0.411754 | \n",
130 | " 0.836582 | \n",
131 | " ... | \n",
132 | " 1 | \n",
133 | " 1 | \n",
134 | " 1 | \n",
135 | " 1 | \n",
136 | " 0 | \n",
137 | " 1 | \n",
138 | " 0 | \n",
139 | " 0 | \n",
140 | " 1 | \n",
141 | " 0 | \n",
142 | "
\n",
143 | " \n",
144 | " | 2 | \n",
145 | " 1 | \n",
146 | " 0.033300 | \n",
147 | " 0.480124 | \n",
148 | " 0.00 | \n",
149 | " 0.0 | \n",
150 | " 0.209791 | \n",
151 | " 0.610350 | \n",
152 | " 0.356453 | \n",
153 | " 0.517720 | \n",
154 | " 0.679051 | \n",
155 | " ... | \n",
156 | " 0 | \n",
157 | " 0 | \n",
158 | " 0 | \n",
159 | " 0 | \n",
160 | " 0 | \n",
161 | " 0 | \n",
162 | " 0 | \n",
163 | " 0 | \n",
164 | " 0 | \n",
165 | " 0 | \n",
166 | "
\n",
167 | " \n",
168 | " | 3 | \n",
169 | " 1 | \n",
170 | " 0.000000 | \n",
171 | " 0.538825 | \n",
172 | " 0.00 | \n",
173 | " 0.5 | \n",
174 | " 0.196344 | \n",
175 | " 0.724230 | \n",
176 | " 0.235606 | \n",
177 | " 0.288764 | \n",
178 | " 0.805110 | \n",
179 | " ... | \n",
180 | " 0 | \n",
181 | " 0 | \n",
182 | " 0 | \n",
183 | " 0 | \n",
184 | " 0 | \n",
185 | " 0 | \n",
186 | " 0 | \n",
187 | " 0 | \n",
188 | " 0 | \n",
189 | " 0 | \n",
190 | "
\n",
191 | " \n",
192 | " | 4 | \n",
193 | " 0 | \n",
194 | " 0.100000 | \n",
195 | " 0.517794 | \n",
196 | " 0.00 | \n",
197 | " 0.0 | \n",
198 | " 0.494734 | \n",
199 | " 0.781422 | \n",
200 | " 0.154361 | \n",
201 | " 0.303809 | \n",
202 | " 0.812646 | \n",
203 | " ... | \n",
204 | " 0 | \n",
205 | " 0 | \n",
206 | " 0 | \n",
207 | " 0 | \n",
208 | " 0 | \n",
209 | " 0 | \n",
210 | " 0 | \n",
211 | " 0 | \n",
212 | " 0 | \n",
213 | " 0 | \n",
214 | "
\n",
215 | " \n",
216 | "
\n",
217 | "
5 rows × 1777 columns
\n",
218 | "
"
219 | ],
220 | "text/plain": [
221 | " Activity D1 D2 D3 D4 D5 D6 D7 \\\n",
222 | "0 1 0.000000 0.497009 0.10 0.0 0.132956 0.678031 0.273166 \n",
223 | "1 1 0.366667 0.606291 0.05 0.0 0.111209 0.803455 0.106105 \n",
224 | "2 1 0.033300 0.480124 0.00 0.0 0.209791 0.610350 0.356453 \n",
225 | "3 1 0.000000 0.538825 0.00 0.5 0.196344 0.724230 0.235606 \n",
226 | "4 0 0.100000 0.517794 0.00 0.0 0.494734 0.781422 0.154361 \n",
227 | "\n",
228 | " D8 D9 ... D1767 D1768 D1769 D1770 D1771 D1772 D1773 \\\n",
229 | "0 0.585445 0.743663 ... 0 0 0 0 0 0 0 \n",
230 | "1 0.411754 0.836582 ... 1 1 1 1 0 1 0 \n",
231 | "2 0.517720 0.679051 ... 0 0 0 0 0 0 0 \n",
232 | "3 0.288764 0.805110 ... 0 0 0 0 0 0 0 \n",
233 | "4 0.303809 0.812646 ... 0 0 0 0 0 0 0 \n",
234 | "\n",
235 | " D1774 D1775 D1776 \n",
236 | "0 0 0 0 \n",
237 | "1 0 1 0 \n",
238 | "2 0 0 0 \n",
239 | "3 0 0 0 \n",
240 | "4 0 0 0 \n",
241 | "\n",
242 | "[5 rows x 1777 columns]"
243 | ]
244 | },
245 | "execution_count": 2,
246 | "metadata": {},
247 | "output_type": "execute_result"
248 | }
249 | ],
250 | "source": [
251 | "bioresponce.head(5)"
252 | ]
253 | },
254 | {
255 | "cell_type": "code",
256 | "execution_count": 3,
257 | "metadata": {},
258 | "outputs": [],
259 | "source": [
260 | "y = bioresponce.Activity.values"
261 | ]
262 | },
263 | {
264 | "cell_type": "code",
265 | "execution_count": 4,
266 | "metadata": {},
267 | "outputs": [],
268 | "source": [
269 | "X = bioresponce.iloc[:, 1:]"
270 | ]
271 | },
272 | {
273 | "cell_type": "code",
274 | "execution_count": 6,
275 | "metadata": {},
276 | "outputs": [],
277 | "source": [
278 | "from sklearn.model_selection import train_test_split\n",
279 | "\n",
280 | "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)"
281 | ]
282 | },
283 | {
284 | "cell_type": "markdown",
285 | "metadata": {},
286 | "source": [
287 | "### Строим модель и оцениваем качество"
288 | ]
289 | },
290 | {
291 | "cell_type": "code",
292 | "execution_count": 7,
293 | "metadata": {},
294 | "outputs": [],
295 | "source": [
296 | "from sklearn.linear_model import LogisticRegression"
297 | ]
298 | },
299 | {
300 | "cell_type": "code",
301 | "execution_count": 8,
302 | "metadata": {},
303 | "outputs": [],
304 | "source": [
305 | "model = LogisticRegression()\n",
306 | "model.fit(X_train, y_train)\n",
307 | "preds = model.predict(X_test)"
308 | ]
309 | },
310 | {
311 | "cell_type": "code",
312 | "execution_count": 9,
313 | "metadata": {},
314 | "outputs": [
315 | {
316 | "data": {
317 | "text/plain": [
318 | "numpy.ndarray"
319 | ]
320 | },
321 | "execution_count": 9,
322 | "metadata": {},
323 | "output_type": "execute_result"
324 | }
325 | ],
326 | "source": [
327 | "type(preds)"
328 | ]
329 | },
330 | {
331 | "cell_type": "code",
332 | "execution_count": 10,
333 | "metadata": {},
334 | "outputs": [
335 | {
336 | "data": {
337 | "text/plain": [
338 | "1"
339 | ]
340 | },
341 | "execution_count": 10,
342 | "metadata": {},
343 | "output_type": "execute_result"
344 | }
345 | ],
346 | "source": [
347 | "10 // 9"
348 | ]
349 | },
350 | {
351 | "cell_type": "code",
352 | "execution_count": 11,
353 | "metadata": {},
354 | "outputs": [
355 | {
356 | "name": "stdout",
357 | "output_type": "stream",
358 | "text": [
359 | "0.7560581583198708\n"
360 | ]
361 | }
362 | ],
363 | "source": [
364 | "print(sum(preds == y_test) / len(preds))"
365 | ]
366 | },
367 | {
368 | "cell_type": "code",
369 | "execution_count": 12,
370 | "metadata": {},
371 | "outputs": [
372 | {
373 | "name": "stdout",
374 | "output_type": "stream",
375 | "text": [
376 | "0.7560581583198708\n"
377 | ]
378 | }
379 | ],
380 | "source": [
381 | "print(sum(preds == y_test) / float(len(preds)))"
382 | ]
383 | },
384 | {
385 | "cell_type": "code",
386 | "execution_count": 13,
387 | "metadata": {},
388 | "outputs": [
389 | {
390 | "name": "stdout",
391 | "output_type": "stream",
392 | "text": [
393 | "0.7560581583198708\n"
394 | ]
395 | }
396 | ],
397 | "source": [
398 | "from sklearn.metrics import accuracy_score\n",
399 | "\n",
400 | "print(accuracy_score(preds, y_test))"
401 | ]
402 | },
403 | {
404 | "cell_type": "markdown",
405 | "metadata": {},
406 | "source": [
407 | "### Качество на кросс-валидации"
408 | ]
409 | },
410 | {
411 | "cell_type": "code",
412 | "execution_count": 14,
413 | "metadata": {},
414 | "outputs": [
415 | {
416 | "name": "stdout",
417 | "output_type": "stream",
418 | "text": [
419 | "[0.74404762 0.73956262 0.72310757 0.75099602 0.75896414]\n"
420 | ]
421 | }
422 | ],
423 | "source": [
424 | "from sklearn.model_selection import cross_val_score\n",
425 | "\n",
426 | "print(cross_val_score(model, X_train, y_train, cv=5))"
427 | ]
428 | },
429 | {
430 | "cell_type": "code",
431 | "execution_count": 15,
432 | "metadata": {},
433 | "outputs": [
434 | {
435 | "name": "stdout",
436 | "output_type": "stream",
437 | "text": [
438 | "0.7433355944771515\n"
439 | ]
440 | }
441 | ],
442 | "source": [
443 | "print(cross_val_score(model, X_train, y_train, cv=5).mean())"
444 | ]
445 | },
446 | {
447 | "cell_type": "markdown",
448 | "metadata": {},
449 | "source": [
450 | "### Пробуем другие классификаторы"
451 | ]
452 | },
453 | {
454 | "cell_type": "code",
455 | "execution_count": 16,
456 | "metadata": {},
457 | "outputs": [],
458 | "source": [
459 | "from sklearn.neighbors import KNeighborsClassifier\n",
460 | "from sklearn.tree import DecisionTreeClassifier\n",
461 | "from sklearn.svm import LinearSVC\n",
462 | "from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier\n"
463 | ]
464 | },
465 | {
466 | "cell_type": "code",
467 | "execution_count": 17,
468 | "metadata": {},
469 | "outputs": [
470 | {
471 | "name": "stdout",
472 | "output_type": "stream",
473 | "text": [
474 | "0.7189014539579968 KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',\n",
475 | " metric_params=None, n_jobs=1, n_neighbors=5, p=2,\n",
476 | " weights='uniform')\n",
477 | "0.7059773828756059 DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,\n",
478 | " max_features=None, max_leaf_nodes=None,\n",
479 | " min_impurity_decrease=0.0, min_impurity_split=None,\n",
480 | " min_samples_leaf=1, min_samples_split=2,\n",
481 | " min_weight_fraction_leaf=0.0, presort=False, random_state=None,\n",
482 | " splitter='best')\n",
483 | "0.7431340872374798 LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,\n",
484 | " intercept_scaling=1, loss='squared_hinge', max_iter=1000,\n",
485 | " multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,\n",
486 | " verbose=0)\n",
487 | "0.789983844911147 RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',\n",
488 | " max_depth=None, max_features='auto', max_leaf_nodes=None,\n",
489 | " min_impurity_decrease=0.0, min_impurity_split=None,\n",
490 | " min_samples_leaf=1, min_samples_split=2,\n",
491 | " min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,\n",
492 | " oob_score=False, random_state=None, verbose=0,\n",
493 | " warm_start=False)\n",
494 | "0.778675282714055 GradientBoostingClassifier(criterion='friedman_mse', init=None,\n",
495 | " learning_rate=0.1, loss='deviance', max_depth=3,\n",
496 | " max_features=None, max_leaf_nodes=None,\n",
497 | " min_impurity_decrease=0.0, min_impurity_split=None,\n",
498 | " min_samples_leaf=1, min_samples_split=2,\n",
499 | " min_weight_fraction_leaf=0.0, n_estimators=100,\n",
500 | " presort='auto', random_state=None, subsample=1.0, verbose=0,\n",
501 | " warm_start=False)\n",
502 | "CPU times: user 25.9 s, sys: 900 ms, total: 26.8 s\n",
503 | "Wall time: 25.9 s\n"
504 | ]
505 | }
506 | ],
507 | "source": [
508 | "%%time\n",
509 | "\n",
510 | "models = [\n",
511 | " KNeighborsClassifier(),\n",
512 | " DecisionTreeClassifier(),\n",
513 | " LinearSVC(),\n",
514 | " RandomForestClassifier(n_estimators=100), \n",
515 | " GradientBoostingClassifier(n_estimators=100)\n",
516 | "]\n",
517 | "\n",
518 | "for model in models:\n",
519 | " model.fit(X_train, y_train)\n",
520 | " preds = model.predict(X_test)\n",
521 | " print(accuracy_score(preds, y_test), model)"
522 | ]
523 | },
524 | {
525 | "cell_type": "code",
526 | "execution_count": null,
527 | "metadata": {},
528 | "outputs": [],
529 | "source": []
530 | }
531 | ],
532 | "metadata": {
533 | "anaconda-cloud": {},
534 | "kernelspec": {
535 | "display_name": "dmia",
536 | "language": "python",
537 | "name": "dmia"
538 | },
539 | "language_info": {
540 | "codemirror_mode": {
541 | "name": "ipython",
542 | "version": 3
543 | },
544 | "file_extension": ".py",
545 | "mimetype": "text/x-python",
546 | "name": "python",
547 | "nbconvert_exporter": "python",
548 | "pygments_lexer": "ipython3",
549 | "version": "3.6.6"
550 | }
551 | },
552 | "nbformat": 4,
553 | "nbformat_minor": 1
554 | }
555 |
--------------------------------------------------------------------------------
/seminar02/05_Reference_BiasVariance.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Bias Variance"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | ""
15 | ]
16 | },
17 | {
18 | "cell_type": "markdown",
19 | "metadata": {},
20 | "source": [
21 | "Мы не будем выписывать строгие формулы, но попробуем объяснить идею этих понятий.\n",
22 | "\n",
23 | "Пусть у нас есть алгоритм обучения, который по данным может создать модель.\n",
24 | "\n",
25 | "Ошибка этих моделей может быть разложена на три части:\n",
26 | "* **Noise** – шум данных, не предсказуем, теоретический минимум ошибки\n",
27 | "* **Bias** – смещение, на сколько хорошо работает средний алгоритм. Средний алгоритм это \"возьмём случайные данные, обучим алгоритм, сделаем предсказания\", **Bias** – это ошибка средних предсказаний.\n",
28 | "* **Variance** – разброс, на сколько устойчиво работает алгоритм. Опять же \"возьмём случайные данные, обучим алгоритм, сделаем предсказания\", **Variance** – это разрос этих предсказаний."
29 | ]
30 | },
31 | {
32 | "cell_type": "markdown",
33 | "metadata": {},
34 | "source": [
35 | "# Бустинг и Бэггинг в терминах Bias и Variance"
36 | ]
37 | },
38 | {
39 | "cell_type": "markdown",
40 | "metadata": {},
41 | "source": [
42 | "Как вы думаете на какую составляющую Бустинг и Бэггинг влияют, а на какую нет?"
43 | ]
44 | },
45 | {
46 | "cell_type": "code",
47 | "execution_count": 1,
48 | "metadata": {
49 | "collapsed": true
50 | },
51 | "outputs": [],
52 | "source": [
53 | "import pandas as pd\n",
54 | "import numpy as np\n",
55 | "from sklearn.model_selection import cross_val_score, train_test_split\n",
56 | "from xgboost import XGBRegressor\n",
57 | "from catboost import CatBoostRegressor\n",
58 | "from lightgbm import LGBMRegressor\n",
59 | "from sklearn.ensemble import RandomForestRegressor\n",
60 | "from sklearn.tree import DecisionTreeRegressor\n",
61 | "from sklearn.linear_model import LinearRegression\n",
62 | "\n",
63 | "import warnings\n",
64 | "warnings.filterwarnings('ignore')"
65 | ]
66 | },
67 | {
68 | "cell_type": "code",
69 | "execution_count": 2,
70 | "metadata": {
71 | "collapsed": true
72 | },
73 | "outputs": [],
74 | "source": [
75 | "data = pd.read_csv('HR.csv')"
76 | ]
77 | },
78 | {
79 | "cell_type": "code",
80 | "execution_count": 3,
81 | "metadata": {},
82 | "outputs": [
83 | {
84 | "data": {
85 | "text/html": [
86 | "\n",
87 | "\n",
100 | "
\n",
101 | " \n",
102 | " \n",
103 | " | \n",
104 | " last_evaluation | \n",
105 | " number_project | \n",
106 | " average_montly_hours | \n",
107 | " time_spend_company | \n",
108 | " Work_accident | \n",
109 | " left | \n",
110 | " promotion_last_5years | \n",
111 | "
\n",
112 | " \n",
113 | " \n",
114 | " \n",
115 | " | 0 | \n",
116 | " 0.53 | \n",
117 | " 2 | \n",
118 | " 157 | \n",
119 | " 3 | \n",
120 | " 0 | \n",
121 | " 1 | \n",
122 | " 0 | \n",
123 | "
\n",
124 | " \n",
125 | " | 1 | \n",
126 | " 0.86 | \n",
127 | " 5 | \n",
128 | " 262 | \n",
129 | " 6 | \n",
130 | " 0 | \n",
131 | " 0 | \n",
132 | " 0 | \n",
133 | "
\n",
134 | " \n",
135 | " | 2 | \n",
136 | " 0.88 | \n",
137 | " 7 | \n",
138 | " 272 | \n",
139 | " 4 | \n",
140 | " 0 | \n",
141 | " 1 | \n",
142 | " 0 | \n",
143 | "
\n",
144 | " \n",
145 | " | 3 | \n",
146 | " 0.87 | \n",
147 | " 5 | \n",
148 | " 223 | \n",
149 | " 5 | \n",
150 | " 0 | \n",
151 | " 1 | \n",
152 | " 0 | \n",
153 | "
\n",
154 | " \n",
155 | " | 4 | \n",
156 | " 0.52 | \n",
157 | " 2 | \n",
158 | " 159 | \n",
159 | " 3 | \n",
160 | " 0 | \n",
161 | " 1 | \n",
162 | " 0 | \n",
163 | "
\n",
164 | " \n",
165 | "
\n",
166 | "
"
167 | ],
168 | "text/plain": [
169 | " last_evaluation number_project average_montly_hours time_spend_company \\\n",
170 | "0 0.53 2 157 3 \n",
171 | "1 0.86 5 262 6 \n",
172 | "2 0.88 7 272 4 \n",
173 | "3 0.87 5 223 5 \n",
174 | "4 0.52 2 159 3 \n",
175 | "\n",
176 | " Work_accident left promotion_last_5years \n",
177 | "0 0 1 0 \n",
178 | "1 0 0 0 \n",
179 | "2 0 1 0 \n",
180 | "3 0 1 0 \n",
181 | "4 0 1 0 "
182 | ]
183 | },
184 | "execution_count": 3,
185 | "metadata": {},
186 | "output_type": "execute_result"
187 | }
188 | ],
189 | "source": [
190 | "data.head()"
191 | ]
192 | },
193 | {
194 | "cell_type": "code",
195 | "execution_count": 4,
196 | "metadata": {
197 | "scrolled": true
198 | },
199 | "outputs": [],
200 | "source": [
201 | "X, y = data.drop('left', axis=1).values, data['left'].values"
202 | ]
203 | },
204 | {
205 | "cell_type": "code",
206 | "execution_count": 5,
207 | "metadata": {},
208 | "outputs": [
209 | {
210 | "data": {
211 | "text/plain": [
212 | "array([1, 0])"
213 | ]
214 | },
215 | "execution_count": 5,
216 | "metadata": {},
217 | "output_type": "execute_result"
218 | }
219 | ],
220 | "source": [
221 | "data['left'].unique()"
222 | ]
223 | },
224 | {
225 | "cell_type": "code",
226 | "execution_count": 6,
227 | "metadata": {
228 | "collapsed": true
229 | },
230 | "outputs": [],
231 | "source": [
232 | "X_train, X_test, y_train, y_test = train_test_split(X, y)"
233 | ]
234 | },
235 | {
236 | "cell_type": "code",
237 | "execution_count": 7,
238 | "metadata": {
239 | "collapsed": true
240 | },
241 | "outputs": [],
242 | "source": [
243 | "def sample_model(seed, model):\n",
244 | " random_gen = np.random.RandomState(seed)\n",
245 | " indices = random_gen.choice(len(y_train), size=len(y_train), replace=True)\n",
246 | " model.fit(X_train[indices, :], y_train[indices])\n",
247 | " return model\n",
248 | "\n",
249 | "def estimate_bias_variance(model, iters_count=100):\n",
250 | " y_preds = []\n",
251 | " for seed in range(iters_count):\n",
252 | " model = sample_model(seed, model)\n",
253 | " y_preds.append(model.predict(X_test))\n",
254 | " y_preds = np.array(y_preds)\n",
255 | " \n",
256 | " print('Bias:', np.mean((y_test - y_preds.mean(axis=0)) ** 2))\n",
257 | " print('Variance:', y_preds.std(axis=0).mean())"
258 | ]
259 | },
260 | {
261 | "cell_type": "markdown",
262 | "metadata": {},
263 | "source": [
264 | "**Линейная регрессия**"
265 | ]
266 | },
267 | {
268 | "cell_type": "code",
269 | "execution_count": 8,
270 | "metadata": {},
271 | "outputs": [
272 | {
273 | "name": "stdout",
274 | "output_type": "stream",
275 | "text": [
276 | "Bias: 0.22539321164615467\n",
277 | "Variance: 0.010711666687293465\n"
278 | ]
279 | }
280 | ],
281 | "source": [
282 | "estimate_bias_variance(LinearRegression())"
283 | ]
284 | },
285 | {
286 | "cell_type": "markdown",
287 | "metadata": {},
288 | "source": [
289 | "**Решающее дерево с max_depth=5**"
290 | ]
291 | },
292 | {
293 | "cell_type": "code",
294 | "execution_count": 9,
295 | "metadata": {},
296 | "outputs": [
297 | {
298 | "name": "stdout",
299 | "output_type": "stream",
300 | "text": [
301 | "Bias: 0.17343635344369013\n",
302 | "Variance: 0.04434523236701086\n"
303 | ]
304 | }
305 | ],
306 | "source": [
307 | "estimate_bias_variance(DecisionTreeRegressor(max_depth=5))"
308 | ]
309 | },
310 | {
311 | "cell_type": "markdown",
312 | "metadata": {},
313 | "source": [
314 | "**Решающее дерево с max_depth=10**"
315 | ]
316 | },
317 | {
318 | "cell_type": "code",
319 | "execution_count": 10,
320 | "metadata": {},
321 | "outputs": [
322 | {
323 | "name": "stdout",
324 | "output_type": "stream",
325 | "text": [
326 | "Bias: 0.17175575739495175\n",
327 | "Variance: 0.11712092704487344\n"
328 | ]
329 | }
330 | ],
331 | "source": [
332 | "estimate_bias_variance(DecisionTreeRegressor(max_depth=10))"
333 | ]
334 | },
335 | {
336 | "cell_type": "markdown",
337 | "metadata": {},
338 | "source": [
339 | "**Решающее дерево с max_depth=15**"
340 | ]
341 | },
342 | {
343 | "cell_type": "code",
344 | "execution_count": 11,
345 | "metadata": {},
346 | "outputs": [
347 | {
348 | "name": "stdout",
349 | "output_type": "stream",
350 | "text": [
351 | "Bias: 0.17842598190450087\n",
352 | "Variance: 0.21661949936646008\n"
353 | ]
354 | }
355 | ],
356 | "source": [
357 | "estimate_bias_variance(DecisionTreeRegressor(max_depth=15))"
358 | ]
359 | },
360 | {
361 | "cell_type": "markdown",
362 | "metadata": {},
363 | "source": [
364 | "**Решающее дерево без ограничения глубины**"
365 | ]
366 | },
367 | {
368 | "cell_type": "code",
369 | "execution_count": 12,
370 | "metadata": {},
371 | "outputs": [
372 | {
373 | "name": "stdout",
374 | "output_type": "stream",
375 | "text": [
376 | "Bias: 0.2069107045423811\n",
377 | "Variance: 0.32457384418180296\n"
378 | ]
379 | }
380 | ],
381 | "source": [
382 | "estimate_bias_variance(DecisionTreeRegressor(max_depth=None))"
383 | ]
384 | },
385 | {
386 | "cell_type": "markdown",
387 | "metadata": {},
388 | "source": [
389 | "**Случайный лес n_estimators=1**"
390 | ]
391 | },
392 | {
393 | "cell_type": "code",
394 | "execution_count": 13,
395 | "metadata": {},
396 | "outputs": [
397 | {
398 | "name": "stdout",
399 | "output_type": "stream",
400 | "text": [
401 | "Bias: 0.19463122486057333\n",
402 | "Variance: 0.35705073628637773\n"
403 | ]
404 | }
405 | ],
406 | "source": [
407 | "estimate_bias_variance(RandomForestRegressor(n_estimators=1, random_state=42))"
408 | ]
409 | },
410 | {
411 | "cell_type": "markdown",
412 | "metadata": {},
413 | "source": [
414 | "**Случайный лес n_estimators=10**"
415 | ]
416 | },
417 | {
418 | "cell_type": "code",
419 | "execution_count": 14,
420 | "metadata": {},
421 | "outputs": [
422 | {
423 | "name": "stdout",
424 | "output_type": "stream",
425 | "text": [
426 | "Bias: 0.19311294566535084\n",
427 | "Variance: 0.17229587181057013\n"
428 | ]
429 | }
430 | ],
431 | "source": [
432 | "estimate_bias_variance(RandomForestRegressor(n_estimators=10, random_state=42))"
433 | ]
434 | },
435 | {
436 | "cell_type": "markdown",
437 | "metadata": {},
438 | "source": [
439 | "**Случайный лес n_estimators=50**"
440 | ]
441 | },
442 | {
443 | "cell_type": "code",
444 | "execution_count": 15,
445 | "metadata": {},
446 | "outputs": [
447 | {
448 | "name": "stdout",
449 | "output_type": "stream",
450 | "text": [
451 | "Bias: 0.19315888675365975\n",
452 | "Variance: 0.14255099142514835\n"
453 | ]
454 | }
455 | ],
456 | "source": [
457 | "estimate_bias_variance(RandomForestRegressor(n_estimators=50, random_state=42))"
458 | ]
459 | },
460 | {
461 | "cell_type": "markdown",
462 | "metadata": {},
463 | "source": [
464 | "**XGBRegressor**"
465 | ]
466 | },
467 | {
468 | "cell_type": "markdown",
469 | "metadata": {},
470 | "source": [
471 | "**Бустинг над деревьями max_depth=20**"
472 | ]
473 | },
474 | {
475 | "cell_type": "code",
476 | "execution_count": 16,
477 | "metadata": {},
478 | "outputs": [
479 | {
480 | "name": "stdout",
481 | "output_type": "stream",
482 | "text": [
483 | "Bias: 0.23515768239943852\n",
484 | "Variance: 0.022880817\n"
485 | ]
486 | }
487 | ],
488 | "source": [
489 | "estimate_bias_variance(XGBRegressor(n_estimators=1, max_depth=20))"
490 | ]
491 | },
492 | {
493 | "cell_type": "markdown",
494 | "metadata": {},
495 | "source": [
496 | "**Бустинг над деревьями max_depth=10**"
497 | ]
498 | },
499 | {
500 | "cell_type": "code",
501 | "execution_count": 17,
502 | "metadata": {},
503 | "outputs": [
504 | {
505 | "name": "stdout",
506 | "output_type": "stream",
507 | "text": [
508 | "Bias: 0.23460312600664116\n",
509 | "Variance: 0.01066339\n"
510 | ]
511 | }
512 | ],
513 | "source": [
514 | "estimate_bias_variance(XGBRegressor(n_estimators=1, max_depth=10))"
515 | ]
516 | },
517 | {
518 | "cell_type": "markdown",
519 | "metadata": {},
520 | "source": [
521 | "**Бустинг над деревьями max_depth=5**"
522 | ]
523 | },
524 | {
525 | "cell_type": "code",
526 | "execution_count": 18,
527 | "metadata": {},
528 | "outputs": [
529 | {
530 | "name": "stdout",
531 | "output_type": "stream",
532 | "text": [
533 | "Bias: 0.23539367931741623\n",
534 | "Variance: 0.004351252\n"
535 | ]
536 | }
537 | ],
538 | "source": [
539 | "estimate_bias_variance(XGBRegressor(n_estimators=1, max_depth=5))"
540 | ]
541 | },
542 | {
543 | "cell_type": "markdown",
544 | "metadata": {},
545 | "source": [
546 | "**Бустинг над деревьями n_estimators=10**"
547 | ]
548 | },
549 | {
550 | "cell_type": "code",
551 | "execution_count": 19,
552 | "metadata": {},
553 | "outputs": [
554 | {
555 | "name": "stdout",
556 | "output_type": "stream",
557 | "text": [
558 | "Bias: 0.18128282857388248\n",
559 | "Variance: 0.019852929\n"
560 | ]
561 | }
562 | ],
563 | "source": [
564 | "estimate_bias_variance(XGBRegressor(n_estimators=10, max_depth=5))"
565 | ]
566 | },
567 | {
568 | "cell_type": "markdown",
569 | "metadata": {},
570 | "source": [
571 | "**Бустинг над деревьями n_estimators=100**"
572 | ]
573 | },
574 | {
575 | "cell_type": "code",
576 | "execution_count": 20,
577 | "metadata": {},
578 | "outputs": [
579 | {
580 | "name": "stdout",
581 | "output_type": "stream",
582 | "text": [
583 | "Bias: 0.17182469278418883\n",
584 | "Variance: 0.05643562\n"
585 | ]
586 | }
587 | ],
588 | "source": [
589 | "estimate_bias_variance(XGBRegressor(n_estimators=100, max_depth=5))"
590 | ]
591 | },
592 | {
593 | "cell_type": "markdown",
594 | "metadata": {},
595 | "source": [
596 | "**CatBoostRegressor**"
597 | ]
598 | },
599 | {
600 | "cell_type": "code",
601 | "execution_count": 21,
602 | "metadata": {},
603 | "outputs": [
604 | {
605 | "name": "stdout",
606 | "output_type": "stream",
607 | "text": [
608 | "Bias: 0.3467385908579134\n",
609 | "Variance: 0.0006835754697775419\n"
610 | ]
611 | }
612 | ],
613 | "source": [
614 | "estimate_bias_variance(CatBoostRegressor(n_estimators=1, max_depth=6, verbose=False))"
615 | ]
616 | },
617 | {
618 | "cell_type": "code",
619 | "execution_count": 22,
620 | "metadata": {},
621 | "outputs": [
622 | {
623 | "name": "stdout",
624 | "output_type": "stream",
625 | "text": [
626 | "Bias: 0.2801365481999864\n",
627 | "Variance: 0.0058176648330751516\n"
628 | ]
629 | }
630 | ],
631 | "source": [
632 | "estimate_bias_variance(CatBoostRegressor(n_estimators=10, max_depth=6, verbose=False))"
633 | ]
634 | },
635 | {
636 | "cell_type": "code",
637 | "execution_count": 23,
638 | "metadata": {},
639 | "outputs": [
640 | {
641 | "name": "stdout",
642 | "output_type": "stream",
643 | "text": [
644 | "Bias: 0.17608858395709903\n",
645 | "Variance: 0.019651480967338052\n"
646 | ]
647 | }
648 | ],
649 | "source": [
650 | "estimate_bias_variance(CatBoostRegressor(n_estimators=100, max_depth=6, verbose=False))"
651 | ]
652 | },
653 | {
654 | "cell_type": "markdown",
655 | "metadata": {},
656 | "source": [
657 | "**LGBMRegressor**"
658 | ]
659 | },
660 | {
661 | "cell_type": "code",
662 | "execution_count": 24,
663 | "metadata": {},
664 | "outputs": [
665 | {
666 | "name": "stdout",
667 | "output_type": "stream",
668 | "text": [
669 | "Bias: 0.2193821065837484\n",
670 | "Variance: 0.006619492577068203\n"
671 | ]
672 | }
673 | ],
674 | "source": [
675 | "estimate_bias_variance(LGBMRegressor(n_estimators=1, max_depth=5))"
676 | ]
677 | },
678 | {
679 | "cell_type": "code",
680 | "execution_count": 25,
681 | "metadata": {},
682 | "outputs": [
683 | {
684 | "name": "stdout",
685 | "output_type": "stream",
686 | "text": [
687 | "Bias: 0.18007513755165383\n",
688 | "Variance: 0.020490996273996646\n"
689 | ]
690 | }
691 | ],
692 | "source": [
693 | "estimate_bias_variance(LGBMRegressor(n_estimators=10, max_depth=5))"
694 | ]
695 | },
696 | {
697 | "cell_type": "code",
698 | "execution_count": 26,
699 | "metadata": {},
700 | "outputs": [
701 | {
702 | "name": "stdout",
703 | "output_type": "stream",
704 | "text": [
705 | "Bias: 0.1717763977728146\n",
706 | "Variance: 0.054249435185578586\n"
707 | ]
708 | }
709 | ],
710 | "source": [
711 | "estimate_bias_variance(LGBMRegressor(n_estimators=100, max_depth=5))"
712 | ]
713 | }
714 | ],
715 | "metadata": {
716 | "anaconda-cloud": {},
717 | "kernelspec": {
718 | "display_name": "venv_DMIA",
719 | "language": "python",
720 | "name": "venv_dmia"
721 | },
722 | "language_info": {
723 | "codemirror_mode": {
724 | "name": "ipython",
725 | "version": 3
726 | },
727 | "file_extension": ".py",
728 | "mimetype": "text/x-python",
729 | "name": "python",
730 | "nbconvert_exporter": "python",
731 | "pygments_lexer": "ipython3",
732 | "version": "3.6.3"
733 | }
734 | },
735 | "nbformat": 4,
736 | "nbformat_minor": 1
737 | }
738 |
--------------------------------------------------------------------------------
/seminar01/03_Main_NaiveBayes.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Naive Bayes"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "http://scikit-learn.org/stable/modules/naive_bayes.html\n",
15 | "\n",
16 | "Методы sklearn.naive_bayes - это набор методов, которые основаны на применении теоремы Байеса с \"наивным\" предположением о условной независимости любой пары признаков $(x_i, x_j)$ при условии выбранного значения целевой переменной $y$. \n",
17 | "\n",
18 | "Теорема Байеса утверждает, что:\n",
19 | "\n",
20 | "$$P(y \\mid x_1, \\dots, x_n) = \\frac{P(y) P(x_1, \\dots x_n \\mid y)}\n",
21 | " {P(x_1, \\dots, x_n)}$$\n",
22 | "\n",
23 | "Используя \"наивное\" предположение о условной независимости пар признаков\n",
24 | "\n",
25 | "$$P(x_i | y, x_1, \\dots, x_{i-1}, x_{i+1}, \\dots, x_n) = P(x_i | y),$$\n",
26 | "\n",
27 | "можно преобразовать теорему Байеса к\n",
28 | "\n",
29 | "$$P(y \\mid x_1, \\dots, x_n) = \\frac{P(y) \\prod_{i=1}^{n} P(x_i \\mid y)}\n",
30 | " {P(x_1, \\dots, x_n)}$$\n",
31 | "\n",
32 | "Поскольку $P(x_1, \\dots, x_n)$ - известные значения признаков для выбранного объекта, мы можем использовать следующее правило:\n",
33 | "\n",
34 | "$$\\hat{y} = \\arg\\max_y P(y) \\prod_{i=1}^{n} P(x_i \\mid y),$$"
35 | ]
36 | },
37 | {
38 | "cell_type": "code",
39 | "execution_count": 1,
40 | "metadata": {},
41 | "outputs": [],
42 | "source": [
43 | "import scipy\n",
44 | "import numpy as np"
45 | ]
46 | },
47 | {
48 | "cell_type": "markdown",
49 | "metadata": {},
50 | "source": [
51 | "Скачиваем данные, в sklearn есть модуль datasets, он предоставляет широкий спектр наборов данных"
52 | ]
53 | },
54 | {
55 | "cell_type": "code",
56 | "execution_count": 2,
57 | "metadata": {},
58 | "outputs": [],
59 | "source": [
60 | "from sklearn.datasets import load_iris\n",
61 | "\n",
62 | "data = load_iris()"
63 | ]
64 | },
65 | {
66 | "cell_type": "markdown",
67 | "metadata": {},
68 | "source": [
69 | "# BernoulliNB"
70 | ]
71 | },
72 | {
73 | "cell_type": "markdown",
74 | "metadata": {},
75 | "source": [
76 | "Для оценки вероятности $P(x_i \\mid y)$ бинаризуем признаки, а затем рассчитаем её как\n",
77 | "\n",
78 | "$$P(x_i \\mid y) = P(x_i \\mid y) x_i + (1 - P(x_i \\mid y)) (1 - x_i)$$"
79 | ]
80 | },
81 | {
82 | "cell_type": "code",
83 | "execution_count": 3,
84 | "metadata": {},
85 | "outputs": [
86 | {
87 | "data": {
88 | "text/plain": [
89 | "array([0.33333333, 0.33333333, 0.33333333])"
90 | ]
91 | },
92 | "execution_count": 3,
93 | "metadata": {},
94 | "output_type": "execute_result"
95 | }
96 | ],
97 | "source": [
98 | "from sklearn.naive_bayes import BernoulliNB\n",
99 | "from sklearn.model_selection import cross_val_score\n",
100 | "\n",
101 | "cross_val_score(BernoulliNB(), data.data, data.target)"
102 | ]
103 | },
104 | {
105 | "cell_type": "code",
106 | "execution_count": 4,
107 | "metadata": {},
108 | "outputs": [
109 | {
110 | "data": {
111 | "text/plain": [
112 | "array([0.39215686, 0.35294118, 0.375 ])"
113 | ]
114 | },
115 | "execution_count": 4,
116 | "metadata": {},
117 | "output_type": "execute_result"
118 | }
119 | ],
120 | "source": [
121 | "cross_val_score(BernoulliNB(binarize=0.1), data.data, data.target)"
122 | ]
123 | },
124 | {
125 | "cell_type": "code",
126 | "execution_count": 5,
127 | "metadata": {},
128 | "outputs": [
129 | {
130 | "data": {
131 | "text/plain": [
132 | "array([0.66666667, 0.66666667, 0.66666667])"
133 | ]
134 | },
135 | "execution_count": 5,
136 | "metadata": {},
137 | "output_type": "execute_result"
138 | }
139 | ],
140 | "source": [
141 | "cross_val_score(BernoulliNB(binarize=1.), data.data, data.target)"
142 | ]
143 | },
144 | {
145 | "cell_type": "code",
146 | "execution_count": 6,
147 | "metadata": {},
148 | "outputs": [
149 | {
150 | "data": {
151 | "text/plain": [
152 | "array([0.66666667, 0.66666667, 0.66666667])"
153 | ]
154 | },
155 | "execution_count": 6,
156 | "metadata": {},
157 | "output_type": "execute_result"
158 | }
159 | ],
160 | "source": [
161 | "cross_val_score(BernoulliNB(binarize=1., alpha=0.1), data.data, data.target)"
162 | ]
163 | },
164 | {
165 | "cell_type": "markdown",
166 | "metadata": {},
167 | "source": [
168 | "В sklearn есть стандартная функция для поиска наилучших значений параметров модели"
169 | ]
170 | },
171 | {
172 | "cell_type": "code",
173 | "execution_count": 7,
174 | "metadata": {},
175 | "outputs": [],
176 | "source": [
177 | "from sklearn.model_selection import GridSearchCV\n",
178 | "\n",
179 | "res = GridSearchCV(BernoulliNB(), param_grid={\n",
180 | " 'binarize': [0.,0.1, 0.5, 1., 2, 10, 100.],\n",
181 | " 'alpha': [0.1, 0.5, 1., 2, 10., 100.]\n",
182 | "}, cv=3).fit(data.data, data.target)"
183 | ]
184 | },
185 | {
186 | "cell_type": "code",
187 | "execution_count": 8,
188 | "metadata": {},
189 | "outputs": [
190 | {
191 | "data": {
192 | "text/plain": [
193 | "({'alpha': 0.1, 'binarize': 2}, 0.82)"
194 | ]
195 | },
196 | "execution_count": 8,
197 | "metadata": {},
198 | "output_type": "execute_result"
199 | }
200 | ],
201 | "source": [
202 | "res.best_params_, res.best_score_"
203 | ]
204 | },
205 | {
206 | "cell_type": "code",
207 | "execution_count": 9,
208 | "metadata": {},
209 | "outputs": [
210 | {
211 | "data": {
212 | "text/plain": [
213 | "0.8206699346405228"
214 | ]
215 | },
216 | "execution_count": 9,
217 | "metadata": {},
218 | "output_type": "execute_result"
219 | }
220 | ],
221 | "source": [
222 | "cross_val_score(BernoulliNB(binarize=2., alpha=0.1), data.data, data.target).mean()"
223 | ]
224 | },
225 | {
226 | "cell_type": "markdown",
227 | "metadata": {},
228 | "source": [
229 | "Добавим параметры"
230 | ]
231 | },
232 | {
233 | "cell_type": "code",
234 | "execution_count": 10,
235 | "metadata": {},
236 | "outputs": [
237 | {
238 | "data": {
239 | "text/plain": [
240 | "({'alpha': 0.1, 'binarize': 2, 'fit_prior': False}, 0.82)"
241 | ]
242 | },
243 | "execution_count": 10,
244 | "metadata": {},
245 | "output_type": "execute_result"
246 | }
247 | ],
248 | "source": [
249 | "res = GridSearchCV(BernoulliNB(), param_grid={\n",
250 | " 'binarize': [0.,0.1, 0.5, 1., 2, 10, 100.],\n",
251 | " 'alpha': [0.1, 0.5, 1., 2, 10., 100.],\n",
252 | " 'fit_prior': [False, True]\n",
253 | "}, cv=3).fit(data.data, data.target)\n",
254 | "res.best_params_, res.best_score_"
255 | ]
256 | },
257 | {
258 | "cell_type": "markdown",
259 | "metadata": {},
260 | "source": [
261 | "Иногда по сетке перебирать параметры слишком избыточное, поэтому имеет смысл использовать RandomziedSearch"
262 | ]
263 | },
264 | {
265 | "cell_type": "code",
266 | "execution_count": 11,
267 | "metadata": {},
268 | "outputs": [
269 | {
270 | "data": {
271 | "text/plain": [
272 | "({'alpha': 2.4012459939700026,\n",
273 | " 'binarize': 1.6139166699914442,\n",
274 | " 'fit_prior': True},\n",
275 | " 0.92)"
276 | ]
277 | },
278 | "execution_count": 11,
279 | "metadata": {},
280 | "output_type": "execute_result"
281 | }
282 | ],
283 | "source": [
284 | "from sklearn.model_selection import RandomizedSearchCV\n",
285 | "\n",
286 | "res = RandomizedSearchCV(BernoulliNB(), param_distributions={\n",
287 | " 'binarize': scipy.stats.uniform(0, 10),\n",
288 | " 'alpha': scipy.stats.uniform(0, 10),\n",
289 | " 'fit_prior': [False, True]\n",
290 | "}, cv=3).fit(data.data, data.target)\n",
291 | "res.best_params_, res.best_score_"
292 | ]
293 | },
294 | {
295 | "cell_type": "markdown",
296 | "metadata": {},
297 | "source": [
298 | "Первый вариант - локализованный перебор"
299 | ]
300 | },
301 | {
302 | "cell_type": "code",
303 | "execution_count": 12,
304 | "metadata": {},
305 | "outputs": [
306 | {
307 | "data": {
308 | "text/plain": [
309 | "({'alpha': 0.09516398060726261,\n",
310 | " 'binarize': 1.7114973962724238,\n",
311 | " 'fit_prior': False},\n",
312 | " 0.9466666666666667)"
313 | ]
314 | },
315 | "execution_count": 12,
316 | "metadata": {},
317 | "output_type": "execute_result"
318 | }
319 | ],
320 | "source": [
321 | "res = RandomizedSearchCV(BernoulliNB(), param_distributions={\n",
322 | " 'binarize': scipy.stats.uniform(1.5, 2.5),\n",
323 | " 'alpha': scipy.stats.uniform(0.05, 0.15),\n",
324 | " 'fit_prior': [False, True]\n",
325 | "}, cv=3).fit(data.data, data.target)\n",
326 | "res.best_params_, res.best_score_"
327 | ]
328 | },
329 | {
330 | "cell_type": "code",
331 | "execution_count": 13,
332 | "metadata": {},
333 | "outputs": [
334 | {
335 | "data": {
336 | "text/plain": [
337 | "({'alpha': 0.07339917805043039,\n",
338 | " 'binarize': 1.6452090304204987,\n",
339 | " 'fit_prior': True},\n",
340 | " 0.92)"
341 | ]
342 | },
343 | "execution_count": 13,
344 | "metadata": {},
345 | "output_type": "execute_result"
346 | }
347 | ],
348 | "source": [
349 | "res = RandomizedSearchCV(BernoulliNB(), param_distributions={\n",
350 | " 'binarize': scipy.stats.uniform(1.5, 2.5),\n",
351 | " 'alpha': scipy.stats.uniform(0.05, 0.15),\n",
352 | " 'fit_prior': [False, True]\n",
353 | "}, cv=3, random_state=42).fit(data.data, data.target)\n",
354 | "res.best_params_, res.best_score_"
355 | ]
356 | },
357 | {
358 | "cell_type": "markdown",
359 | "metadata": {},
360 | "source": [
361 | "Второй вариант - больше диапозоны и больше итераций"
362 | ]
363 | },
364 | {
365 | "cell_type": "code",
366 | "execution_count": 14,
367 | "metadata": {},
368 | "outputs": [
369 | {
370 | "data": {
371 | "text/plain": [
372 | "({'alpha': 6.075448519014383,\n",
373 | " 'binarize': 1.7052412368729153,\n",
374 | " 'fit_prior': False},\n",
375 | " 0.9466666666666667)"
376 | ]
377 | },
378 | "execution_count": 14,
379 | "metadata": {},
380 | "output_type": "execute_result"
381 | }
382 | ],
383 | "source": [
384 | "res = RandomizedSearchCV(BernoulliNB(), param_distributions={\n",
385 | " 'binarize': scipy.stats.uniform(0, 10),\n",
386 | " 'alpha': scipy.stats.uniform(0, 10),\n",
387 | " 'fit_prior': [False, True]\n",
388 | "}, cv=3, random_state=42, n_iter=1000).fit(data.data, data.target)\n",
389 | "res.best_params_, res.best_score_"
390 | ]
391 | },
392 | {
393 | "cell_type": "markdown",
394 | "metadata": {},
395 | "source": [
396 | "# MultinomialNB"
397 | ]
398 | },
399 | {
400 | "cell_type": "markdown",
401 | "metadata": {},
402 | "source": [
403 | "MultinomialNB реализует алгоритм наивного Байеса для признаков, распределенных мультиномиально. Хорошо работает для задач опреледения категории текста или детектирования спама. На практике также неплохо работает с признаками после TFIDF.\n",
404 | "\n",
405 | "Распределение параметризовано векторами $\\theta_y = (\\theta_{y1},\\ldots,\\theta_{yn})$ для каждого класса $y$, где $n$ - количество признаков, $\\theta_{yi}$ - вероятность $P(x_i \\mid y)$ появления признака $i$ в объекте класса $y$.\n",
406 | "\n",
407 | "Параметры $\\theta_y$ оцениваются с помощью сглаженной версии максимального правдоподобия, то есть относительной частоты встречаемости значений признаков (проще думать об этом, как об встречаемости слов в документах класса $y$):\n",
408 | "\n",
409 | "$$\\hat{\\theta}_{yi} = \\frac{ N_{yi} + \\alpha}{N_y + \\alpha n}$$\n",
410 | "\n",
411 | "где $N_{yi} = \\sum_{x \\in T} x_i$ - число раз, когда признак $i$ встречается в объектах класса $y$ среди обучающей выборки $T$,\n",
412 | "и $N_{y} = \\sum_{i=1}^{n} N_{yi}$ - полное число встреч признака во всех классах $y$.\n",
413 | "\n",
414 | "Сглаживающая константа $\\alpha \\ge 0$ обрабатывает случаи, когда признак не встречается для класса в обучающей выборке и предотвращает нулевые вероятности в дальнейших вычислениях."
415 | ]
416 | },
417 | {
418 | "cell_type": "markdown",
419 | "metadata": {},
420 | "source": [
421 | "Оценим качество MultinomialNB на той же задаче"
422 | ]
423 | },
424 | {
425 | "cell_type": "code",
426 | "execution_count": 15,
427 | "metadata": {},
428 | "outputs": [
429 | {
430 | "data": {
431 | "text/plain": [
432 | "array([1. , 0.88235294, 1. ])"
433 | ]
434 | },
435 | "execution_count": 15,
436 | "metadata": {},
437 | "output_type": "execute_result"
438 | }
439 | ],
440 | "source": [
441 | "from sklearn.naive_bayes import MultinomialNB\n",
442 | "\n",
443 | "cross_val_score(MultinomialNB(), data.data, data.target)"
444 | ]
445 | },
446 | {
447 | "cell_type": "markdown",
448 | "metadata": {},
449 | "source": [
450 | "Подберём оптимальные параметры"
451 | ]
452 | },
453 | {
454 | "cell_type": "code",
455 | "execution_count": 16,
456 | "metadata": {},
457 | "outputs": [
458 | {
459 | "name": "stdout",
460 | "output_type": "stream",
461 | "text": [
462 | "{'alpha': 2.7, 'fit_prior': False} 0.9666666666666667\n",
463 | "CPU times: user 635 ms, sys: 5.22 ms, total: 640 ms\n",
464 | "Wall time: 640 ms\n"
465 | ]
466 | }
467 | ],
468 | "source": [
469 | "%%time\n",
470 | "res = GridSearchCV(MultinomialNB(), param_grid={\n",
471 | " 'alpha': np.arange(0.1, 10.1, 0.1),\n",
472 | " 'fit_prior': [False, True]\n",
473 | "}, cv=3).fit(data.data, data.target)\n",
474 | "print(res.best_params_, res.best_score_)"
475 | ]
476 | },
477 | {
478 | "cell_type": "markdown",
479 | "metadata": {},
480 | "source": [
481 | "# GaussianNB"
482 | ]
483 | },
484 | {
485 | "cell_type": "markdown",
486 | "metadata": {},
487 | "source": [
488 | "В GaussianNB вероятность значений признака предполагается распределенной по Гауссу:\n",
489 | "\n",
490 | "$$P(x_i \\mid y) = \\frac{1}{\\sqrt{2\\pi\\sigma^2_y}} \\exp\\left(-\\frac{(x_i - \\mu_y)^2}{2\\sigma^2_y}\\right)$$\n",
491 | "\n",
492 | "Параметры $\\sigma_y$ and $\\mu_y$ оцениваются с помощью метода максимального правдоподобия."
493 | ]
494 | },
495 | {
496 | "cell_type": "markdown",
497 | "metadata": {},
498 | "source": [
499 | "Добавим в сравнение GaussianNB"
500 | ]
501 | },
502 | {
503 | "cell_type": "code",
504 | "execution_count": 17,
505 | "metadata": {},
506 | "outputs": [
507 | {
508 | "data": {
509 | "text/plain": [
510 | "array([0.92156863, 0.90196078, 0.97916667])"
511 | ]
512 | },
513 | "execution_count": 17,
514 | "metadata": {},
515 | "output_type": "execute_result"
516 | }
517 | ],
518 | "source": [
519 | "from sklearn.naive_bayes import GaussianNB\n",
520 | "\n",
521 | "cross_val_score(GaussianNB(), data.data, data.target)"
522 | ]
523 | },
524 | {
525 | "cell_type": "code",
526 | "execution_count": 18,
527 | "metadata": {},
528 | "outputs": [
529 | {
530 | "data": {
531 | "text/plain": [
532 | "0.9342320261437909"
533 | ]
534 | },
535 | "execution_count": 18,
536 | "metadata": {},
537 | "output_type": "execute_result"
538 | }
539 | ],
540 | "source": [
541 | "cross_val_score(GaussianNB(), data.data, data.target).mean()"
542 | ]
543 | },
544 | {
545 | "cell_type": "markdown",
546 | "metadata": {},
547 | "source": [
548 | "# 20 newsgroups dataset"
549 | ]
550 | },
551 | {
552 | "cell_type": "markdown",
553 | "metadata": {},
554 | "source": [
555 | "Проверим работу методов на другом наборе данных"
556 | ]
557 | },
558 | {
559 | "cell_type": "code",
560 | "execution_count": 19,
561 | "metadata": {},
562 | "outputs": [],
563 | "source": [
564 | "from sklearn.datasets import fetch_20newsgroups_vectorized\n",
565 | "\n",
566 | "data = fetch_20newsgroups_vectorized(subset='all', remove=('headers', 'footers', 'quotes'))"
567 | ]
568 | },
569 | {
570 | "cell_type": "code",
571 | "execution_count": 20,
572 | "metadata": {},
573 | "outputs": [
574 | {
575 | "data": {
576 | "text/plain": [
577 | "<18846x101631 sparse matrix of type ''\n",
578 | "\twith 1769365 stored elements in Compressed Sparse Row format>"
579 | ]
580 | },
581 | "execution_count": 20,
582 | "metadata": {},
583 | "output_type": "execute_result"
584 | }
585 | ],
586 | "source": [
587 | "data.data"
588 | ]
589 | },
590 | {
591 | "cell_type": "code",
592 | "execution_count": 21,
593 | "metadata": {},
594 | "outputs": [
595 | {
596 | "data": {
597 | "text/plain": [
598 | "array([0.46581876, 0.50334501, 0.46670914])"
599 | ]
600 | },
601 | "execution_count": 21,
602 | "metadata": {},
603 | "output_type": "execute_result"
604 | }
605 | ],
606 | "source": [
607 | "cross_val_score(BernoulliNB(), data.data, data.target)"
608 | ]
609 | },
610 | {
611 | "cell_type": "code",
612 | "execution_count": 22,
613 | "metadata": {},
614 | "outputs": [
615 | {
616 | "name": "stdout",
617 | "output_type": "stream",
618 | "text": [
619 | "{'alpha': 0.07066305219717406, 'binarize': 0.023062425041415757, 'fit_prior': False} 0.6818423007534755\n",
620 | "CPU times: user 6.28 s, sys: 845 ms, total: 7.12 s\n",
621 | "Wall time: 7.16 s\n"
622 | ]
623 | }
624 | ],
625 | "source": [
626 | "%%time\n",
627 | "res = RandomizedSearchCV(BernoulliNB(), param_distributions={\n",
628 | " 'binarize': scipy.stats.uniform(0, 1),\n",
629 | " 'alpha': scipy.stats.uniform(0, 10),\n",
630 | " 'fit_prior': [False, True]\n",
631 | "}, cv=3, random_state=42).fit(data.data, data.target)\n",
632 | "print(res.best_params_, res.best_score_)"
633 | ]
634 | },
635 | {
636 | "cell_type": "code",
637 | "execution_count": 23,
638 | "metadata": {},
639 | "outputs": [
640 | {
641 | "data": {
642 | "text/plain": [
643 | "array([0.55023847, 0.54858235, 0.51258363])"
644 | ]
645 | },
646 | "execution_count": 23,
647 | "metadata": {},
648 | "output_type": "execute_result"
649 | }
650 | ],
651 | "source": [
652 | "cross_val_score(MultinomialNB(), data.data, data.target)"
653 | ]
654 | },
655 | {
656 | "cell_type": "code",
657 | "execution_count": 24,
658 | "metadata": {},
659 | "outputs": [
660 | {
661 | "name": "stdout",
662 | "output_type": "stream",
663 | "text": [
664 | "{'alpha': 0.10778765841014329, 'fit_prior': True} 0.6743075453677173\n",
665 | "CPU times: user 5.27 s, sys: 440 ms, total: 5.71 s\n",
666 | "Wall time: 5.75 s\n"
667 | ]
668 | }
669 | ],
670 | "source": [
671 | "%%time\n",
672 | "res = RandomizedSearchCV(MultinomialNB(), param_distributions={\n",
673 | " 'alpha': scipy.stats.uniform(0.1, 10),\n",
674 | " 'fit_prior': [False, True]\n",
675 | "}, cv=3, random_state=42).fit(data.data, data.target)\n",
676 | "print(res.best_params_, res.best_score_)"
677 | ]
678 | },
679 | {
680 | "cell_type": "code",
681 | "execution_count": 25,
682 | "metadata": {},
683 | "outputs": [
684 | {
685 | "ename": "TypeError",
686 | "evalue": "A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.",
687 | "output_type": "error",
688 | "traceback": [
689 | "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
690 | "\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)",
691 | "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mcross_val_score\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mGaussianNB\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdata\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdata\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdata\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mtarget\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
692 | "\u001b[0;32m~/anaconda2/envs/py36/lib/python3.6/site-packages/sklearn/model_selection/_validation.py\u001b[0m in \u001b[0;36mcross_val_score\u001b[0;34m(estimator, X, y, groups, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch)\u001b[0m\n\u001b[1;32m 340\u001b[0m \u001b[0mn_jobs\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mn_jobs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mverbose\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mverbose\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 341\u001b[0m \u001b[0mfit_params\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mfit_params\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 342\u001b[0;31m pre_dispatch=pre_dispatch)\n\u001b[0m\u001b[1;32m 343\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mcv_results\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'test_score'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 344\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
693 | "\u001b[0;32m~/anaconda2/envs/py36/lib/python3.6/site-packages/sklearn/model_selection/_validation.py\u001b[0m in \u001b[0;36mcross_validate\u001b[0;34m(estimator, X, y, groups, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch, return_train_score)\u001b[0m\n\u001b[1;32m 204\u001b[0m \u001b[0mfit_params\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mreturn_train_score\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mreturn_train_score\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 205\u001b[0m return_times=True)\n\u001b[0;32m--> 206\u001b[0;31m for train, test in cv.split(X, y, groups))\n\u001b[0m\u001b[1;32m 207\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 208\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mreturn_train_score\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
694 | "\u001b[0;32m~/anaconda2/envs/py36/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py\u001b[0m in \u001b[0;36m__call__\u001b[0;34m(self, iterable)\u001b[0m\n\u001b[1;32m 777\u001b[0m \u001b[0;31m# was dispatched. In particular this covers the edge\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 778\u001b[0m \u001b[0;31m# case of Parallel used with an exhausted iterator.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 779\u001b[0;31m \u001b[0;32mwhile\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdispatch_one_batch\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0miterator\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 780\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_iterating\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mTrue\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 781\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
695 | "\u001b[0;32m~/anaconda2/envs/py36/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py\u001b[0m in \u001b[0;36mdispatch_one_batch\u001b[0;34m(self, iterator)\u001b[0m\n\u001b[1;32m 623\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0;32mFalse\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 624\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 625\u001b[0;31m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_dispatch\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtasks\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 626\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0;32mTrue\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 627\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
696 | "\u001b[0;32m~/anaconda2/envs/py36/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py\u001b[0m in \u001b[0;36m_dispatch\u001b[0;34m(self, batch)\u001b[0m\n\u001b[1;32m 586\u001b[0m \u001b[0mdispatch_timestamp\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mtime\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mtime\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 587\u001b[0m \u001b[0mcb\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mBatchCompletionCallBack\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdispatch_timestamp\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mlen\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mbatch\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 588\u001b[0;31m \u001b[0mjob\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_backend\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mapply_async\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mbatch\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcallback\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mcb\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 589\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_jobs\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mappend\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mjob\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 590\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
697 | "\u001b[0;32m~/anaconda2/envs/py36/lib/python3.6/site-packages/sklearn/externals/joblib/_parallel_backends.py\u001b[0m in \u001b[0;36mapply_async\u001b[0;34m(self, func, callback)\u001b[0m\n\u001b[1;32m 109\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mapply_async\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfunc\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcallback\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mNone\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 110\u001b[0m \u001b[0;34m\"\"\"Schedule a func to be run\"\"\"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 111\u001b[0;31m \u001b[0mresult\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mImmediateResult\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mfunc\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 112\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mcallback\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 113\u001b[0m \u001b[0mcallback\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mresult\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
698 | "\u001b[0;32m~/anaconda2/envs/py36/lib/python3.6/site-packages/sklearn/externals/joblib/_parallel_backends.py\u001b[0m in \u001b[0;36m__init__\u001b[0;34m(self, batch)\u001b[0m\n\u001b[1;32m 330\u001b[0m \u001b[0;31m# Don't delay the application, to avoid keeping the input\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 331\u001b[0m \u001b[0;31m# arguments in memory\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 332\u001b[0;31m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mresults\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mbatch\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 333\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 334\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mget\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
699 | "\u001b[0;32m~/anaconda2/envs/py36/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py\u001b[0m in \u001b[0;36m__call__\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m 129\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 130\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0m__call__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 131\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0mfunc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mfunc\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mkwargs\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mitems\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 132\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 133\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0m__len__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
700 | "\u001b[0;32m~/anaconda2/envs/py36/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py\u001b[0m in \u001b[0;36m\u001b[0;34m(.0)\u001b[0m\n\u001b[1;32m 129\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 130\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0m__call__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 131\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0mfunc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mfunc\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mkwargs\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mitems\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 132\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 133\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0m__len__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
701 | "\u001b[0;32m~/anaconda2/envs/py36/lib/python3.6/site-packages/sklearn/model_selection/_validation.py\u001b[0m in \u001b[0;36m_fit_and_score\u001b[0;34m(estimator, X, y, scorer, train, test, verbose, parameters, fit_params, return_train_score, return_parameters, return_n_test_samples, return_times, error_score)\u001b[0m\n\u001b[1;32m 456\u001b[0m \u001b[0mestimator\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfit\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mX_train\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mfit_params\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 457\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 458\u001b[0;31m \u001b[0mestimator\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfit\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mX_train\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0my_train\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mfit_params\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 459\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 460\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0mException\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0me\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
702 | "\u001b[0;32m~/anaconda2/envs/py36/lib/python3.6/site-packages/sklearn/naive_bayes.py\u001b[0m in \u001b[0;36mfit\u001b[0;34m(self, X, y, sample_weight)\u001b[0m\n\u001b[1;32m 181\u001b[0m \u001b[0mReturns\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 182\u001b[0m \"\"\"\n\u001b[0;32m--> 183\u001b[0;31m \u001b[0mX\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0my\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mcheck_X_y\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mX\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0my\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 184\u001b[0m return self._partial_fit(X, y, np.unique(y), _refit=True,\n\u001b[1;32m 185\u001b[0m sample_weight=sample_weight)\n",
703 | "\u001b[0;32m~/anaconda2/envs/py36/lib/python3.6/site-packages/sklearn/utils/validation.py\u001b[0m in \u001b[0;36mcheck_X_y\u001b[0;34m(X, y, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, warn_on_dtype, estimator)\u001b[0m\n\u001b[1;32m 571\u001b[0m X = check_array(X, accept_sparse, dtype, order, copy, force_all_finite,\n\u001b[1;32m 572\u001b[0m \u001b[0mensure_2d\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mallow_nd\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mensure_min_samples\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 573\u001b[0;31m ensure_min_features, warn_on_dtype, estimator)\n\u001b[0m\u001b[1;32m 574\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mmulti_output\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 575\u001b[0m y = check_array(y, 'csr', force_all_finite=True, ensure_2d=False,\n",
704 | "\u001b[0;32m~/anaconda2/envs/py36/lib/python3.6/site-packages/sklearn/utils/validation.py\u001b[0m in \u001b[0;36mcheck_array\u001b[0;34m(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)\u001b[0m\n\u001b[1;32m 429\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0msp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0missparse\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0marray\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 430\u001b[0m array = _ensure_sparse_format(array, accept_sparse, dtype, copy,\n\u001b[0;32m--> 431\u001b[0;31m force_all_finite)\n\u001b[0m\u001b[1;32m 432\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 433\u001b[0m \u001b[0marray\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0marray\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0marray\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mdtype\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0morder\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0morder\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcopy\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mcopy\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
705 | "\u001b[0;32m~/anaconda2/envs/py36/lib/python3.6/site-packages/sklearn/utils/validation.py\u001b[0m in \u001b[0;36m_ensure_sparse_format\u001b[0;34m(spmatrix, accept_sparse, dtype, copy, force_all_finite)\u001b[0m\n\u001b[1;32m 273\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 274\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0maccept_sparse\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mFalse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 275\u001b[0;31m raise TypeError('A sparse matrix was passed, but dense '\n\u001b[0m\u001b[1;32m 276\u001b[0m \u001b[0;34m'data is required. Use X.toarray() to '\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 277\u001b[0m 'convert to a dense numpy array.')\n",
706 | "\u001b[0;31mTypeError\u001b[0m: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array."
707 | ]
708 | }
709 | ],
710 | "source": [
711 | "cross_val_score(GaussianNB(), data.data, data.target)"
712 | ]
713 | },
714 | {
715 | "cell_type": "markdown",
716 | "metadata": {},
717 | "source": [
718 | "# Wine dataset"
719 | ]
720 | },
721 | {
722 | "cell_type": "markdown",
723 | "metadata": {},
724 | "source": [
725 | "Может сложиться ощущение, что GaussianNB работает хуже, однако, когда признаки вещественные, это не так"
726 | ]
727 | },
728 | {
729 | "cell_type": "code",
730 | "execution_count": 26,
731 | "metadata": {},
732 | "outputs": [],
733 | "source": [
734 | "from sklearn.datasets import load_wine\n",
735 | "\n",
736 | "data = load_wine()"
737 | ]
738 | },
739 | {
740 | "cell_type": "code",
741 | "execution_count": 27,
742 | "metadata": {},
743 | "outputs": [
744 | {
745 | "data": {
746 | "text/plain": [
747 | "array([[1.423e+01, 1.710e+00, 2.430e+00, ..., 1.040e+00, 3.920e+00,\n",
748 | " 1.065e+03],\n",
749 | " [1.320e+01, 1.780e+00, 2.140e+00, ..., 1.050e+00, 3.400e+00,\n",
750 | " 1.050e+03],\n",
751 | " [1.316e+01, 2.360e+00, 2.670e+00, ..., 1.030e+00, 3.170e+00,\n",
752 | " 1.185e+03],\n",
753 | " ...,\n",
754 | " [1.327e+01, 4.280e+00, 2.260e+00, ..., 5.900e-01, 1.560e+00,\n",
755 | " 8.350e+02],\n",
756 | " [1.317e+01, 2.590e+00, 2.370e+00, ..., 6.000e-01, 1.620e+00,\n",
757 | " 8.400e+02],\n",
758 | " [1.413e+01, 4.100e+00, 2.740e+00, ..., 6.100e-01, 1.600e+00,\n",
759 | " 5.600e+02]])"
760 | ]
761 | },
762 | "execution_count": 27,
763 | "metadata": {},
764 | "output_type": "execute_result"
765 | }
766 | ],
767 | "source": [
768 | "data.data"
769 | ]
770 | },
771 | {
772 | "cell_type": "code",
773 | "execution_count": 28,
774 | "metadata": {},
775 | "outputs": [
776 | {
777 | "data": {
778 | "text/plain": [
779 | "array([0.4 , 0.4 , 0.39655172])"
780 | ]
781 | },
782 | "execution_count": 28,
783 | "metadata": {},
784 | "output_type": "execute_result"
785 | }
786 | ],
787 | "source": [
788 | "cross_val_score(BernoulliNB(), data.data, data.target)"
789 | ]
790 | },
791 | {
792 | "cell_type": "code",
793 | "execution_count": 29,
794 | "metadata": {},
795 | "outputs": [
796 | {
797 | "name": "stdout",
798 | "output_type": "stream",
799 | "text": [
800 | "{'alpha': 3.559726786512616, 'binarize': 757.8461104643691, 'fit_prior': False} 0.6966292134831461\n",
801 | "CPU times: user 1.95 s, sys: 14.6 ms, total: 1.96 s\n",
802 | "Wall time: 1.96 s\n"
803 | ]
804 | }
805 | ],
806 | "source": [
807 | "%%time\n",
808 | "res = RandomizedSearchCV(BernoulliNB(), param_distributions={\n",
809 | " 'binarize': scipy.stats.uniform(0, 1000),\n",
810 | " 'alpha': scipy.stats.uniform(0, 10),\n",
811 | " 'fit_prior': [False, True]\n",
812 | "}, cv=3, random_state=42, n_iter=500).fit(data.data, data.target)\n",
813 | "print(res.best_params_, res.best_score_)"
814 | ]
815 | },
816 | {
817 | "cell_type": "code",
818 | "execution_count": 30,
819 | "metadata": {},
820 | "outputs": [
821 | {
822 | "data": {
823 | "text/plain": [
824 | "array([0.71666667, 0.81666667, 0.96551724])"
825 | ]
826 | },
827 | "execution_count": 30,
828 | "metadata": {},
829 | "output_type": "execute_result"
830 | }
831 | ],
832 | "source": [
833 | "cross_val_score(MultinomialNB(), data.data, data.target)"
834 | ]
835 | },
836 | {
837 | "cell_type": "code",
838 | "execution_count": 31,
839 | "metadata": {},
840 | "outputs": [
841 | {
842 | "name": "stdout",
843 | "output_type": "stream",
844 | "text": [
845 | "{'alpha': 3.745401188473625, 'fit_prior': False} 0.8426966292134831\n",
846 | "CPU times: user 1.75 s, sys: 12.9 ms, total: 1.77 s\n",
847 | "Wall time: 1.77 s\n"
848 | ]
849 | }
850 | ],
851 | "source": [
852 | "%%time\n",
853 | "res = RandomizedSearchCV(MultinomialNB(), param_distributions={\n",
854 | " 'alpha': scipy.stats.uniform(0, 10),\n",
855 | " 'fit_prior': [False, True]\n",
856 | "}, cv=3, random_state=42, n_iter=500).fit(data.data, data.target)\n",
857 | "print(res.best_params_, res.best_score_)"
858 | ]
859 | },
860 | {
861 | "cell_type": "code",
862 | "execution_count": 32,
863 | "metadata": {},
864 | "outputs": [
865 | {
866 | "data": {
867 | "text/plain": [
868 | "array([0.95 , 0.96666667, 0.96551724])"
869 | ]
870 | },
871 | "execution_count": 32,
872 | "metadata": {},
873 | "output_type": "execute_result"
874 | }
875 | ],
876 | "source": [
877 | "cross_val_score(GaussianNB(), data.data, data.target)"
878 | ]
879 | },
880 | {
881 | "cell_type": "code",
882 | "execution_count": 33,
883 | "metadata": {},
884 | "outputs": [
885 | {
886 | "data": {
887 | "text/plain": [
888 | "0.960727969348659"
889 | ]
890 | },
891 | "execution_count": 33,
892 | "metadata": {},
893 | "output_type": "execute_result"
894 | }
895 | ],
896 | "source": [
897 | "cross_val_score(GaussianNB(), data.data, data.target).mean()"
898 | ]
899 | }
900 | ],
901 | "metadata": {
902 | "kernelspec": {
903 | "display_name": "Python 3",
904 | "language": "python",
905 | "name": "python3"
906 | },
907 | "language_info": {
908 | "codemirror_mode": {
909 | "name": "ipython",
910 | "version": 3
911 | },
912 | "file_extension": ".py",
913 | "mimetype": "text/x-python",
914 | "name": "python",
915 | "nbconvert_exporter": "python",
916 | "pygments_lexer": "ipython3",
917 | "version": "3.6.5"
918 | }
919 | },
920 | "nbformat": 4,
921 | "nbformat_minor": 2
922 | }
923 |
--------------------------------------------------------------------------------
/seminar01/06_Reference_Numpy.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Numpy"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "В Python есть встроенные:\n",
15 | " 1. списки и словари\n",
16 | " 2. числовые объекты (целые числа, числа с плавующей точкой)\n",
17 | "\n",
18 | "Numpy это дополнительный модуль для Python для многомерных массивов и эффективных вычислений над числами.\n",
19 | "Эта библиотека ближе к hardware (использует типы из C, которые существенно быстрее чем Python типы), за счёт чего более эффективна при вычислениях."
20 | ]
21 | },
22 | {
23 | "cell_type": "code",
24 | "execution_count": 12,
25 | "metadata": {},
26 | "outputs": [],
27 | "source": [
28 | "import numpy as np"
29 | ]
30 | },
31 | {
32 | "cell_type": "markdown",
33 | "metadata": {},
34 | "source": [
35 | "## Основные типы данных"
36 | ]
37 | },
38 | {
39 | "cell_type": "markdown",
40 | "metadata": {},
41 | "source": [
42 | "В Python есть типы: bool, int, float, complex\n",
43 | "\n",
44 | "В numpy имеются эти типы, а также обёртки над этими типами, которые используют реализацию типов на C, например, int8, int16, int32, int64.\n",
45 | "\n",
46 | "Число означает сколько бит используется для хранения числа.\n",
47 | "\n",
48 | "За счёт того, что используются типы данных из C, numpy получает ускорение операций."
49 | ]
50 | },
51 | {
52 | "cell_type": "code",
53 | "execution_count": null,
54 | "metadata": {},
55 | "outputs": [],
56 | "source": [
57 | "type(np.bool)"
58 | ]
59 | },
60 | {
61 | "cell_type": "code",
62 | "execution_count": null,
63 | "metadata": {},
64 | "outputs": [],
65 | "source": [
66 | "np.bool()"
67 | ]
68 | },
69 | {
70 | "cell_type": "code",
71 | "execution_count": null,
72 | "metadata": {},
73 | "outputs": [],
74 | "source": []
75 | },
76 | {
77 | "cell_type": "markdown",
78 | "metadata": {},
79 | "source": [
80 | "np.int - тип из Python\n",
81 | "\n",
82 | "np.int32 и np.int64 - типы из C 32-битный и 64-битный"
83 | ]
84 | },
85 | {
86 | "cell_type": "code",
87 | "execution_count": null,
88 | "metadata": {},
89 | "outputs": [],
90 | "source": [
91 | "type(np.int), type(np.int32), type(np.int64)"
92 | ]
93 | },
94 | {
95 | "cell_type": "code",
96 | "execution_count": null,
97 | "metadata": {},
98 | "outputs": [],
99 | "source": [
100 | "np.int(), np.int32(), np.int64()"
101 | ]
102 | },
103 | {
104 | "cell_type": "code",
105 | "execution_count": null,
106 | "metadata": {},
107 | "outputs": [],
108 | "source": [
109 | "type(np.int()), type(np.int32()), type(np.int64())"
110 | ]
111 | },
112 | {
113 | "cell_type": "markdown",
114 | "metadata": {},
115 | "source": [
116 | "В Python есть длинная арифметика, поэтому можно любые числа хранить"
117 | ]
118 | },
119 | {
120 | "cell_type": "code",
121 | "execution_count": null,
122 | "metadata": {},
123 | "outputs": [],
124 | "source": [
125 | "print(np.int(1e18)) # обёртка питоновского типа"
126 | ]
127 | },
128 | {
129 | "cell_type": "markdown",
130 | "metadata": {},
131 | "source": [
132 | "32-битный int в C хранит числа от −2147483648 до 2147483647 на $10^{18}$ не хватит, чтобы хранить"
133 | ]
134 | },
135 | {
136 | "cell_type": "code",
137 | "execution_count": null,
138 | "metadata": {},
139 | "outputs": [],
140 | "source": [
141 | "print(np.int32(1e18))"
142 | ]
143 | },
144 | {
145 | "cell_type": "markdown",
146 | "metadata": {},
147 | "source": [
148 | "64-битный int в С хранит числа от -9223372036854775808 до 9223372036854775807 на $10^{18}$ уже хватает"
149 | ]
150 | },
151 | {
152 | "cell_type": "code",
153 | "execution_count": null,
154 | "metadata": {},
155 | "outputs": [],
156 | "source": [
157 | "print(np.int64(1e18))"
158 | ]
159 | },
160 | {
161 | "cell_type": "code",
162 | "execution_count": null,
163 | "metadata": {},
164 | "outputs": [],
165 | "source": []
166 | },
167 | {
168 | "cell_type": "markdown",
169 | "metadata": {},
170 | "source": [
171 | "аналогичная градация и для float\n",
172 | "\n",
173 | "float - обёртка питоновского типа\n",
174 | "float32 и float64 - обёртки чисел соответствующей битности (в стиле С)"
175 | ]
176 | },
177 | {
178 | "cell_type": "code",
179 | "execution_count": null,
180 | "metadata": {},
181 | "outputs": [],
182 | "source": [
183 | "type(np.float), type(np.float32), type(np.float64)"
184 | ]
185 | },
186 | {
187 | "cell_type": "code",
188 | "execution_count": null,
189 | "metadata": {},
190 | "outputs": [],
191 | "source": [
192 | "np.float(), np.float32(), np.float64()"
193 | ]
194 | },
195 | {
196 | "cell_type": "code",
197 | "execution_count": null,
198 | "metadata": {},
199 | "outputs": [],
200 | "source": [
201 | "type(np.float()), type(np.float32()), type(np.float64())"
202 | ]
203 | },
204 | {
205 | "cell_type": "code",
206 | "execution_count": null,
207 | "metadata": {},
208 | "outputs": [],
209 | "source": [
210 | "type(np.sqrt(np.float(2))) # np.sqrt возвращает максимально близкий тип, для питоновского float это float64"
211 | ]
212 | },
213 | {
214 | "cell_type": "code",
215 | "execution_count": null,
216 | "metadata": {},
217 | "outputs": [],
218 | "source": [
219 | "type(np.sqrt(np.float32(2)))"
220 | ]
221 | },
222 | {
223 | "cell_type": "code",
224 | "execution_count": null,
225 | "metadata": {},
226 | "outputs": [],
227 | "source": [
228 | "type(np.sqrt(np.float64(2)))"
229 | ]
230 | },
231 | {
232 | "cell_type": "code",
233 | "execution_count": null,
234 | "metadata": {},
235 | "outputs": [],
236 | "source": []
237 | },
238 | {
239 | "cell_type": "markdown",
240 | "metadata": {},
241 | "source": [
242 | "специальные классы для хранения комплексных чисел - по сути это два float-а"
243 | ]
244 | },
245 | {
246 | "cell_type": "code",
247 | "execution_count": null,
248 | "metadata": {},
249 | "outputs": [],
250 | "source": [
251 | "type(np.complex), type(np.complex64), type(np.complex128)"
252 | ]
253 | },
254 | {
255 | "cell_type": "code",
256 | "execution_count": null,
257 | "metadata": {},
258 | "outputs": [],
259 | "source": [
260 | "np.complex(), np.complex64(), np.complex128()"
261 | ]
262 | },
263 | {
264 | "cell_type": "code",
265 | "execution_count": null,
266 | "metadata": {},
267 | "outputs": [],
268 | "source": [
269 | "type(np.complex()), type(np.complex64()), type(np.complex128())"
270 | ]
271 | },
272 | {
273 | "cell_type": "code",
274 | "execution_count": null,
275 | "metadata": {},
276 | "outputs": [],
277 | "source": []
278 | },
279 | {
280 | "cell_type": "markdown",
281 | "metadata": {},
282 | "source": [
283 | "по умолчанию корень из -1 не получится взять"
284 | ]
285 | },
286 | {
287 | "cell_type": "code",
288 | "execution_count": null,
289 | "metadata": {},
290 | "outputs": [],
291 | "source": [
292 | "np.sqrt(-1.)"
293 | ]
294 | },
295 | {
296 | "cell_type": "markdown",
297 | "metadata": {},
298 | "source": [
299 | "но если указать, что тип данных complex, то всё сработает"
300 | ]
301 | },
302 | {
303 | "cell_type": "code",
304 | "execution_count": null,
305 | "metadata": {},
306 | "outputs": [],
307 | "source": [
308 | "np.sqrt(-1 + 0j)"
309 | ]
310 | },
311 | {
312 | "cell_type": "code",
313 | "execution_count": null,
314 | "metadata": {},
315 | "outputs": [],
316 | "source": [
317 | "type(np.sqrt(-1 + 0j))"
318 | ]
319 | },
320 | {
321 | "cell_type": "code",
322 | "execution_count": null,
323 | "metadata": {},
324 | "outputs": [],
325 | "source": [
326 | "type(np.sqrt(np.complex(-1 + 0j)))"
327 | ]
328 | },
329 | {
330 | "cell_type": "code",
331 | "execution_count": null,
332 | "metadata": {},
333 | "outputs": [],
334 | "source": [
335 | "type(np.sqrt(np.complex64(-1 + 0j)))"
336 | ]
337 | },
338 | {
339 | "cell_type": "code",
340 | "execution_count": null,
341 | "metadata": {},
342 | "outputs": [],
343 | "source": [
344 | "type(np.sqrt(np.complex128(-1 + 0j)))"
345 | ]
346 | },
347 | {
348 | "cell_type": "code",
349 | "execution_count": null,
350 | "metadata": {},
351 | "outputs": [],
352 | "source": []
353 | },
354 | {
355 | "cell_type": "markdown",
356 | "metadata": {
357 | "collapsed": true
358 | },
359 | "source": [
360 | "### Вывод:\n",
361 | "\n",
362 | "В numpy присутсвуют обёртки всех типов из C, а также перенесены типы из Python"
363 | ]
364 | },
365 | {
366 | "cell_type": "code",
367 | "execution_count": null,
368 | "metadata": {},
369 | "outputs": [],
370 | "source": []
371 | },
372 | {
373 | "cell_type": "code",
374 | "execution_count": null,
375 | "metadata": {},
376 | "outputs": [],
377 | "source": []
378 | },
379 | {
380 | "cell_type": "code",
381 | "execution_count": null,
382 | "metadata": {},
383 | "outputs": [],
384 | "source": []
385 | },
386 | {
387 | "cell_type": "markdown",
388 | "metadata": {},
389 | "source": [
390 | "## Основные численные функции"
391 | ]
392 | },
393 | {
394 | "cell_type": "markdown",
395 | "metadata": {},
396 | "source": [
397 | "numpy предоставляет широкий спектр математических функций\n",
398 | "\n",
399 | "опишем основные их виды"
400 | ]
401 | },
402 | {
403 | "cell_type": "markdown",
404 | "metadata": {},
405 | "source": [
406 | "##### Округления чисел"
407 | ]
408 | },
409 | {
410 | "cell_type": "markdown",
411 | "metadata": {},
412 | "source": [
413 | "np.round - математическое округление\n",
414 | "\n",
415 | "np.floor - округление вниз\n",
416 | "\n",
417 | "np.ceil - округление вверх\n",
418 | "\n",
419 | "np.int - округление к нулю"
420 | ]
421 | },
422 | {
423 | "cell_type": "code",
424 | "execution_count": null,
425 | "metadata": {},
426 | "outputs": [],
427 | "source": [
428 | "np.round(4.1), np.floor(4.1), np.ceil(4.1), np.int(4.1)"
429 | ]
430 | },
431 | {
432 | "cell_type": "code",
433 | "execution_count": null,
434 | "metadata": {},
435 | "outputs": [],
436 | "source": [
437 | "np.round(-4.1), np.floor(-4.1), np.ceil(-4.1), np.int(-4.1)"
438 | ]
439 | },
440 | {
441 | "cell_type": "code",
442 | "execution_count": null,
443 | "metadata": {},
444 | "outputs": [],
445 | "source": []
446 | },
447 | {
448 | "cell_type": "code",
449 | "execution_count": null,
450 | "metadata": {},
451 | "outputs": [],
452 | "source": [
453 | "np.round(4.5), np.floor(4.5), np.ceil(4.5), np.int(4.5)"
454 | ]
455 | },
456 | {
457 | "cell_type": "code",
458 | "execution_count": null,
459 | "metadata": {},
460 | "outputs": [],
461 | "source": [
462 | "np.round(-4.5), np.floor(-4.5), np.ceil(-4.5), np.int(-4.5)"
463 | ]
464 | },
465 | {
466 | "cell_type": "code",
467 | "execution_count": null,
468 | "metadata": {},
469 | "outputs": [],
470 | "source": []
471 | },
472 | {
473 | "cell_type": "code",
474 | "execution_count": null,
475 | "metadata": {},
476 | "outputs": [],
477 | "source": [
478 | "np.round(4.7), np.floor(4.7), np.ceil(4.7), np.int(4.7)"
479 | ]
480 | },
481 | {
482 | "cell_type": "code",
483 | "execution_count": null,
484 | "metadata": {},
485 | "outputs": [],
486 | "source": [
487 | "np.round(-4.7), np.floor(-4.7), np.ceil(-4.7), np.int(-4.7)"
488 | ]
489 | },
490 | {
491 | "cell_type": "markdown",
492 | "metadata": {},
493 | "source": [
494 | "##### Математические операции"
495 | ]
496 | },
497 | {
498 | "cell_type": "markdown",
499 | "metadata": {},
500 | "source": [
501 | "Подсчитаем логарифм"
502 | ]
503 | },
504 | {
505 | "cell_type": "code",
506 | "execution_count": null,
507 | "metadata": {},
508 | "outputs": [],
509 | "source": [
510 | "np.log(1000.), type(np.log(1000.))"
511 | ]
512 | },
513 | {
514 | "cell_type": "code",
515 | "execution_count": null,
516 | "metadata": {},
517 | "outputs": [],
518 | "source": [
519 | "np.log(np.float32(1000.)), type(np.log(np.float32(1000.))) # меньше бит на хранение - меньше точность"
520 | ]
521 | },
522 | {
523 | "cell_type": "code",
524 | "execution_count": null,
525 | "metadata": {},
526 | "outputs": [],
527 | "source": [
528 | "np.log(1000.) / np.log(10.), type(np.log(1000.) / np.log(10.))"
529 | ]
530 | },
531 | {
532 | "cell_type": "markdown",
533 | "metadata": {},
534 | "source": [
535 | "Если брать значение не из области определения, то исключение не выкидывается, но будет warning и вернётся inf или nan"
536 | ]
537 | },
538 | {
539 | "cell_type": "code",
540 | "execution_count": null,
541 | "metadata": {},
542 | "outputs": [],
543 | "source": [
544 | "np.log(0.)"
545 | ]
546 | },
547 | {
548 | "cell_type": "code",
549 | "execution_count": null,
550 | "metadata": {},
551 | "outputs": [],
552 | "source": [
553 | "np.log(-1.)"
554 | ]
555 | },
556 | {
557 | "cell_type": "markdown",
558 | "metadata": {},
559 | "source": [
560 | "Функции работают и с комплексными числами"
561 | ]
562 | },
563 | {
564 | "cell_type": "code",
565 | "execution_count": null,
566 | "metadata": {},
567 | "outputs": [],
568 | "source": [
569 | "np.log(-1 + 0j)"
570 | ]
571 | },
572 | {
573 | "cell_type": "code",
574 | "execution_count": null,
575 | "metadata": {},
576 | "outputs": [],
577 | "source": [
578 | "np.log(1j)"
579 | ]
580 | },
581 | {
582 | "cell_type": "code",
583 | "execution_count": null,
584 | "metadata": {},
585 | "outputs": [],
586 | "source": []
587 | },
588 | {
589 | "cell_type": "markdown",
590 | "metadata": {},
591 | "source": [
592 | "Есть специальные функции для двоичного и десятичного логарифмов"
593 | ]
594 | },
595 | {
596 | "cell_type": "code",
597 | "execution_count": null,
598 | "metadata": {},
599 | "outputs": [],
600 | "source": [
601 | "print(np.log10(10))\n",
602 | "print(np.log10(100))\n",
603 | "print(np.log10(1000))\n",
604 | "print(np.log10(1e8))\n",
605 | "print(np.log10(1e30))\n",
606 | "print(np.log10(1e100))\n",
607 | "print(np.log10(1e1000))"
608 | ]
609 | },
610 | {
611 | "cell_type": "markdown",
612 | "metadata": {},
613 | "source": [
614 | "у больших int-ов уже не получается взять логарифм, так как np.log2 приводит к сишному типу"
615 | ]
616 | },
617 | {
618 | "cell_type": "code",
619 | "execution_count": null,
620 | "metadata": {},
621 | "outputs": [],
622 | "source": [
623 | "print(np.log2(2))\n",
624 | "print(np.log2(2 ** 2))\n",
625 | "print(np.log2(2 ** 3))\n",
626 | "print(np.log2(2 ** 8))\n",
627 | "print(np.log2(2 ** 30))\n",
628 | "print(np.log2(2 ** 100))\n",
629 | "print(np.log2(2 ** 1000))"
630 | ]
631 | },
632 | {
633 | "cell_type": "markdown",
634 | "metadata": {},
635 | "source": [
636 | "функции работают с типами С, поэтому может быть переполнение"
637 | ]
638 | },
639 | {
640 | "cell_type": "code",
641 | "execution_count": null,
642 | "metadata": {},
643 | "outputs": [],
644 | "source": [
645 | "np.exp(10.), type(np.exp(10.))"
646 | ]
647 | },
648 | {
649 | "cell_type": "code",
650 | "execution_count": null,
651 | "metadata": {},
652 | "outputs": [],
653 | "source": [
654 | "np.exp(100.), type(np.exp(100.))"
655 | ]
656 | },
657 | {
658 | "cell_type": "code",
659 | "execution_count": null,
660 | "metadata": {},
661 | "outputs": [],
662 | "source": [
663 | "np.exp(1000.), type(np.exp(1000.))"
664 | ]
665 | },
666 | {
667 | "cell_type": "code",
668 | "execution_count": null,
669 | "metadata": {},
670 | "outputs": [],
671 | "source": []
672 | },
673 | {
674 | "cell_type": "markdown",
675 | "metadata": {},
676 | "source": [
677 | "##### Константы"
678 | ]
679 | },
680 | {
681 | "cell_type": "markdown",
682 | "metadata": {},
683 | "source": [
684 | "в numpy есть математические константы "
685 | ]
686 | },
687 | {
688 | "cell_type": "code",
689 | "execution_count": null,
690 | "metadata": {},
691 | "outputs": [],
692 | "source": [
693 | "np.pi, type(np.pi)"
694 | ]
695 | },
696 | {
697 | "cell_type": "code",
698 | "execution_count": null,
699 | "metadata": {},
700 | "outputs": [],
701 | "source": [
702 | "np.e, type(np.e)"
703 | ]
704 | },
705 | {
706 | "cell_type": "code",
707 | "execution_count": null,
708 | "metadata": {},
709 | "outputs": [],
710 | "source": [
711 | "np.exp(np.pi * 1j)"
712 | ]
713 | },
714 | {
715 | "cell_type": "code",
716 | "execution_count": null,
717 | "metadata": {},
718 | "outputs": [],
719 | "source": [
720 | "np.exp(np.pi * 1j).astype(np.float64)"
721 | ]
722 | },
723 | {
724 | "cell_type": "code",
725 | "execution_count": null,
726 | "metadata": {},
727 | "outputs": [],
728 | "source": []
729 | },
730 | {
731 | "cell_type": "markdown",
732 | "metadata": {},
733 | "source": [
734 | "##### Ещё примеры переполнения типов данных"
735 | ]
736 | },
737 | {
738 | "cell_type": "markdown",
739 | "metadata": {},
740 | "source": [
741 | "Использование чисел определённой битности накладывает ограничения на их максимальные значения"
742 | ]
743 | },
744 | {
745 | "cell_type": "code",
746 | "execution_count": null,
747 | "metadata": {},
748 | "outputs": [],
749 | "source": [
750 | "2 ** 60, type(2 ** 60) # питоновское умножение"
751 | ]
752 | },
753 | {
754 | "cell_type": "code",
755 | "execution_count": null,
756 | "metadata": {},
757 | "outputs": [],
758 | "source": [
759 | "2 ** 1000, type(2 ** 1000) # питоновское умножение"
760 | ]
761 | },
762 | {
763 | "cell_type": "code",
764 | "execution_count": null,
765 | "metadata": {},
766 | "outputs": [],
767 | "source": [
768 | "np.power(2, 60), type(np.power(2, 60))"
769 | ]
770 | },
771 | {
772 | "cell_type": "code",
773 | "execution_count": null,
774 | "metadata": {},
775 | "outputs": [],
776 | "source": [
777 | "np.power(np.int64(2), 60), type(np.power(np.int64(2), 60))"
778 | ]
779 | },
780 | {
781 | "cell_type": "code",
782 | "execution_count": null,
783 | "metadata": {},
784 | "outputs": [],
785 | "source": [
786 | "np.power(2, 1000), type(np.power(2, 1000))"
787 | ]
788 | },
789 | {
790 | "cell_type": "code",
791 | "execution_count": null,
792 | "metadata": {},
793 | "outputs": [],
794 | "source": []
795 | },
796 | {
797 | "cell_type": "markdown",
798 | "metadata": {},
799 | "source": [
800 | "##### Функция модуль"
801 | ]
802 | },
803 | {
804 | "cell_type": "code",
805 | "execution_count": 13,
806 | "metadata": {},
807 | "outputs": [
808 | {
809 | "data": {
810 | "text/plain": [
811 | "10000"
812 | ]
813 | },
814 | "execution_count": 13,
815 | "metadata": {},
816 | "output_type": "execute_result"
817 | }
818 | ],
819 | "source": [
820 | "np.abs(-10000)"
821 | ]
822 | },
823 | {
824 | "cell_type": "code",
825 | "execution_count": 14,
826 | "metadata": {},
827 | "outputs": [
828 | {
829 | "data": {
830 | "text/plain": [
831 | "1.0"
832 | ]
833 | },
834 | "execution_count": 14,
835 | "metadata": {},
836 | "output_type": "execute_result"
837 | }
838 | ],
839 | "source": [
840 | "np.abs(1j) # возвращает модуль комплексного числа"
841 | ]
842 | },
843 | {
844 | "cell_type": "code",
845 | "execution_count": 15,
846 | "metadata": {},
847 | "outputs": [
848 | {
849 | "data": {
850 | "text/plain": [
851 | "1.4142135623730951"
852 | ]
853 | },
854 | "execution_count": 15,
855 | "metadata": {},
856 | "output_type": "execute_result"
857 | }
858 | ],
859 | "source": [
860 | "np.abs(1 + 1j)"
861 | ]
862 | },
863 | {
864 | "cell_type": "code",
865 | "execution_count": null,
866 | "metadata": {},
867 | "outputs": [],
868 | "source": []
869 | },
870 | {
871 | "cell_type": "markdown",
872 | "metadata": {},
873 | "source": [
874 | "##### Тригонометрические функции"
875 | ]
876 | },
877 | {
878 | "cell_type": "code",
879 | "execution_count": 16,
880 | "metadata": {},
881 | "outputs": [
882 | {
883 | "data": {
884 | "text/plain": [
885 | "-1.0"
886 | ]
887 | },
888 | "execution_count": 16,
889 | "metadata": {},
890 | "output_type": "execute_result"
891 | }
892 | ],
893 | "source": [
894 | "np.cos(np.pi)"
895 | ]
896 | },
897 | {
898 | "cell_type": "code",
899 | "execution_count": 17,
900 | "metadata": {},
901 | "outputs": [
902 | {
903 | "data": {
904 | "text/plain": [
905 | "1.0"
906 | ]
907 | },
908 | "execution_count": 17,
909 | "metadata": {},
910 | "output_type": "execute_result"
911 | }
912 | ],
913 | "source": [
914 | "np.log(np.e)"
915 | ]
916 | },
917 | {
918 | "cell_type": "code",
919 | "execution_count": 18,
920 | "metadata": {},
921 | "outputs": [
922 | {
923 | "data": {
924 | "text/plain": [
925 | "1.0"
926 | ]
927 | },
928 | "execution_count": 18,
929 | "metadata": {},
930 | "output_type": "execute_result"
931 | }
932 | ],
933 | "source": [
934 | "np.sin(np.pi / 2)"
935 | ]
936 | },
937 | {
938 | "cell_type": "code",
939 | "execution_count": 19,
940 | "metadata": {},
941 | "outputs": [
942 | {
943 | "data": {
944 | "text/plain": [
945 | "1.5707963267948966"
946 | ]
947 | },
948 | "execution_count": 19,
949 | "metadata": {},
950 | "output_type": "execute_result"
951 | }
952 | ],
953 | "source": [
954 | "np.arccos(0.)"
955 | ]
956 | },
957 | {
958 | "cell_type": "code",
959 | "execution_count": 20,
960 | "metadata": {},
961 | "outputs": [
962 | {
963 | "data": {
964 | "text/plain": [
965 | "57.29577951308232"
966 | ]
967 | },
968 | "execution_count": 20,
969 | "metadata": {},
970 | "output_type": "execute_result"
971 | }
972 | ],
973 | "source": [
974 | "np.rad2deg(1.)"
975 | ]
976 | },
977 | {
978 | "cell_type": "code",
979 | "execution_count": 21,
980 | "metadata": {},
981 | "outputs": [
982 | {
983 | "data": {
984 | "text/plain": [
985 | "3.141592653589793"
986 | ]
987 | },
988 | "execution_count": 21,
989 | "metadata": {},
990 | "output_type": "execute_result"
991 | }
992 | ],
993 | "source": [
994 | "np.deg2rad(180.)"
995 | ]
996 | },
997 | {
998 | "cell_type": "markdown",
999 | "metadata": {},
1000 | "source": [
1001 | "Более подробно можно посмотреть здесь: https://docs.scipy.org/doc/numpy-1.9.2/reference/routines.math.html"
1002 | ]
1003 | },
1004 | {
1005 | "cell_type": "markdown",
1006 | "metadata": {
1007 | "collapsed": true
1008 | },
1009 | "source": [
1010 | "### Вывод:\n",
1011 | "В numpy реализовано огромное число математических функций"
1012 | ]
1013 | },
1014 | {
1015 | "cell_type": "code",
1016 | "execution_count": null,
1017 | "metadata": {},
1018 | "outputs": [],
1019 | "source": []
1020 | },
1021 | {
1022 | "cell_type": "markdown",
1023 | "metadata": {},
1024 | "source": [
1025 | "### Чем это лучше модуля math?"
1026 | ]
1027 | },
1028 | {
1029 | "cell_type": "code",
1030 | "execution_count": null,
1031 | "metadata": {},
1032 | "outputs": [],
1033 | "source": [
1034 | "import math"
1035 | ]
1036 | },
1037 | {
1038 | "cell_type": "code",
1039 | "execution_count": null,
1040 | "metadata": {},
1041 | "outputs": [],
1042 | "source": [
1043 | "%timeit math.exp(10.)\n",
1044 | "%timeit np.exp(10.)"
1045 | ]
1046 | },
1047 | {
1048 | "cell_type": "code",
1049 | "execution_count": null,
1050 | "metadata": {},
1051 | "outputs": [],
1052 | "source": [
1053 | "%timeit math.sqrt(10.)\n",
1054 | "%timeit np.sqrt(10.)"
1055 | ]
1056 | },
1057 | {
1058 | "cell_type": "code",
1059 | "execution_count": null,
1060 | "metadata": {},
1061 | "outputs": [],
1062 | "source": [
1063 | "%timeit math.log(10.)\n",
1064 | "%timeit np.log(10.)"
1065 | ]
1066 | },
1067 | {
1068 | "cell_type": "code",
1069 | "execution_count": null,
1070 | "metadata": {},
1071 | "outputs": [],
1072 | "source": [
1073 | "%timeit math.cos(10.)\n",
1074 | "%timeit np.cos(10.)"
1075 | ]
1076 | },
1077 | {
1078 | "cell_type": "markdown",
1079 | "metadata": {},
1080 | "source": [
1081 | "### Вывод:\n",
1082 | "Арифметические функции из numpy не работают существенно быстрее, чем функции из math, если считаете для одного значения\n",
1083 | "\n",
1084 | "Если вам нужно вычислить значение некоторой функции из математики, то скорее всего она уже реализована в numpy"
1085 | ]
1086 | },
1087 | {
1088 | "cell_type": "code",
1089 | "execution_count": null,
1090 | "metadata": {},
1091 | "outputs": [],
1092 | "source": []
1093 | },
1094 | {
1095 | "cell_type": "code",
1096 | "execution_count": null,
1097 | "metadata": {},
1098 | "outputs": [],
1099 | "source": []
1100 | },
1101 | {
1102 | "cell_type": "markdown",
1103 | "metadata": {},
1104 | "source": [
1105 | "### Арифметические функции хороши, но, тем не менее, основным объектом NumPy является однородный многомерный массив"
1106 | ]
1107 | },
1108 | {
1109 | "cell_type": "code",
1110 | "execution_count": null,
1111 | "metadata": {},
1112 | "outputs": [],
1113 | "source": [
1114 | "type(np.array([]))"
1115 | ]
1116 | },
1117 | {
1118 | "cell_type": "markdown",
1119 | "metadata": {},
1120 | "source": [
1121 | "Наиболее важные атрибуты объектов ndarray:\n",
1122 | "\n",
1123 | " 1. ndarray.ndim - число измерений (чаще их называют \"оси\") массива.\n",
1124 | "\n",
1125 | " 2. ndarray.shape - размеры массива, его форма. Это кортеж натуральных чисел, показывающий длину массива по каждой оси. Для матрицы из n строк и m столбов, shape будет (n,m). Число элементов кортежа shape равно ndim.\n",
1126 | "\n",
1127 | " 3. ndarray.size - количество элементов массива. Очевидно, равно произведению всех элементов атрибута shape.\n",
1128 | "\n",
1129 | " 4. ndarray.dtype - объект, описывающий тип элементов массива. Можно определить dtype, используя стандартные типы данных Python. Можно хранить и numpy типы, например: bool, int16, int32, int64, float16, float32, float64, complex64\n",
1130 | "\n",
1131 | " 5. ndarray.itemsize - размер каждого элемента массива в байтах.\n",
1132 | "\n",
1133 | " 6. ndarray.data - буфер, содержащий фактические элементы массива. Обычно не нужно использовать этот атрибут, так как обращаться к элементам массива проще всего с помощью индексов."
1134 | ]
1135 | },
1136 | {
1137 | "cell_type": "markdown",
1138 | "metadata": {},
1139 | "source": [
1140 | "##### Обычные одномерные массивы"
1141 | ]
1142 | },
1143 | {
1144 | "cell_type": "code",
1145 | "execution_count": null,
1146 | "metadata": {},
1147 | "outputs": [],
1148 | "source": [
1149 | "arr = np.array([1, 2, 4, 8, 16, 32])\n",
1150 | "\n",
1151 | "print(arr.ndim)\n",
1152 | "print(arr.shape)\n",
1153 | "print(arr.size)\n",
1154 | "print(arr.dtype)\n",
1155 | "print(arr.itemsize)\n",
1156 | "print(arr.data)"
1157 | ]
1158 | },
1159 | {
1160 | "cell_type": "code",
1161 | "execution_count": null,
1162 | "metadata": {},
1163 | "outputs": [],
1164 | "source": [
1165 | "arr = np.array([1, 2, 4, 8, 16, 32], dtype=int)\n",
1166 | "\n",
1167 | "print(arr.ndim)\n",
1168 | "print(arr.shape)\n",
1169 | "print(arr.size)\n",
1170 | "print(arr.dtype)\n",
1171 | "print(arr.itemsize)\n",
1172 | "print(arr.data)"
1173 | ]
1174 | },
1175 | {
1176 | "cell_type": "code",
1177 | "execution_count": null,
1178 | "metadata": {},
1179 | "outputs": [],
1180 | "source": [
1181 | "arr = np.array([1, 2, 4, 8, 16, 32], dtype=object)\n",
1182 | "\n",
1183 | "print(arr.ndim)\n",
1184 | "print(arr.shape)\n",
1185 | "print(arr.size)\n",
1186 | "print(arr.dtype)\n",
1187 | "print(arr.itemsize)\n",
1188 | "print(arr.data)"
1189 | ]
1190 | },
1191 | {
1192 | "cell_type": "code",
1193 | "execution_count": null,
1194 | "metadata": {},
1195 | "outputs": [],
1196 | "source": [
1197 | "arr = np.array([1, 2, 4, 8, 16, 32], dtype=np.int64)\n",
1198 | "\n",
1199 | "print(arr.ndim)\n",
1200 | "print(arr.shape)\n",
1201 | "print(arr.size)\n",
1202 | "print(arr.dtype)\n",
1203 | "print(arr.itemsize)\n",
1204 | "print(arr.data)"
1205 | ]
1206 | },
1207 | {
1208 | "cell_type": "code",
1209 | "execution_count": null,
1210 | "metadata": {},
1211 | "outputs": [],
1212 | "source": [
1213 | "arr = np.array([1, 2, 4, 8, 16, 32], dtype=np.complex128)\n",
1214 | "\n",
1215 | "print(arr.ndim)\n",
1216 | "print(arr.shape)\n",
1217 | "print(arr.size)\n",
1218 | "print(arr.dtype)\n",
1219 | "print(arr.itemsize)\n",
1220 | "print(arr.data)"
1221 | ]
1222 | },
1223 | {
1224 | "cell_type": "markdown",
1225 | "metadata": {},
1226 | "source": [
1227 | "##### Обычные двухмерные массивы"
1228 | ]
1229 | },
1230 | {
1231 | "cell_type": "code",
1232 | "execution_count": null,
1233 | "metadata": {},
1234 | "outputs": [],
1235 | "source": [
1236 | "arr = np.array([[1], [2], [4], [8], [16], [32]], dtype=np.complex128)\n",
1237 | "\n",
1238 | "print(arr.ndim)\n",
1239 | "print(arr.shape)\n",
1240 | "print(arr.size)\n",
1241 | "print(arr.dtype)\n",
1242 | "print(arr.itemsize)\n",
1243 | "print(arr.data)"
1244 | ]
1245 | },
1246 | {
1247 | "cell_type": "code",
1248 | "execution_count": null,
1249 | "metadata": {},
1250 | "outputs": [],
1251 | "source": [
1252 | "arr = np.array([[1, 0], [2, 0], [4, 0], [8, 0], [16, 0], [32, 0]], dtype=np.complex128)\n",
1253 | "\n",
1254 | "print(arr.ndim)\n",
1255 | "print(arr.shape)\n",
1256 | "print(arr.size)\n",
1257 | "print(arr.dtype)\n",
1258 | "print(arr.itemsize)\n",
1259 | "print(arr.data)"
1260 | ]
1261 | },
1262 | {
1263 | "cell_type": "code",
1264 | "execution_count": null,
1265 | "metadata": {},
1266 | "outputs": [],
1267 | "source": [
1268 | "# указываем строчки с разным числом элементов\n",
1269 | "arr = np.array([[1, 0], [2, 0], [4, 0], [8, 0], [16, 0], [32]], dtype=np.complex128)\n",
1270 | "\n",
1271 | "print(arr.ndim)\n",
1272 | "print(arr.shape)\n",
1273 | "print(arr.size)\n",
1274 | "print(arr.dtype)\n",
1275 | "print(arr.itemsize)\n",
1276 | "print(arr.data)"
1277 | ]
1278 | },
1279 | {
1280 | "cell_type": "markdown",
1281 | "metadata": {},
1282 | "source": [
1283 | "##### Индексация одномерных массивов"
1284 | ]
1285 | },
1286 | {
1287 | "cell_type": "code",
1288 | "execution_count": null,
1289 | "metadata": {},
1290 | "outputs": [],
1291 | "source": [
1292 | "arr = np.array([1, 2, 4, 8, 16, 32], dtype=np.int64)"
1293 | ]
1294 | },
1295 | {
1296 | "cell_type": "code",
1297 | "execution_count": null,
1298 | "metadata": {},
1299 | "outputs": [],
1300 | "source": [
1301 | "arr[0], arr[1], arr[4], arr[-1]"
1302 | ]
1303 | },
1304 | {
1305 | "cell_type": "code",
1306 | "execution_count": null,
1307 | "metadata": {},
1308 | "outputs": [],
1309 | "source": [
1310 | "arr[0:4]"
1311 | ]
1312 | },
1313 | {
1314 | "cell_type": "code",
1315 | "execution_count": null,
1316 | "metadata": {},
1317 | "outputs": [],
1318 | "source": [
1319 | "arr[[0, 3, 5]]"
1320 | ]
1321 | },
1322 | {
1323 | "cell_type": "markdown",
1324 | "metadata": {},
1325 | "source": [
1326 | "##### Индексация двухмерных массивов"
1327 | ]
1328 | },
1329 | {
1330 | "cell_type": "code",
1331 | "execution_count": null,
1332 | "metadata": {},
1333 | "outputs": [],
1334 | "source": [
1335 | "arr = np.array(\n",
1336 | " [\n",
1337 | " [1, 0, 4], \n",
1338 | " [2, 0, 4], \n",
1339 | " [4, 0, 4], \n",
1340 | " [8, 0, 4], \n",
1341 | " [16, 0, 4], \n",
1342 | " [32, 0, 4]\n",
1343 | " ],\n",
1344 | " dtype=np.int64\n",
1345 | ")"
1346 | ]
1347 | },
1348 | {
1349 | "cell_type": "code",
1350 | "execution_count": null,
1351 | "metadata": {},
1352 | "outputs": [],
1353 | "source": [
1354 | "print(arr[0])\n",
1355 | "print(arr[1])\n",
1356 | "print(arr[4])\n",
1357 | "print(arr[-1])"
1358 | ]
1359 | },
1360 | {
1361 | "cell_type": "code",
1362 | "execution_count": null,
1363 | "metadata": {},
1364 | "outputs": [],
1365 | "source": [
1366 | "arr[0, 0], arr[1, 0], arr[4, 0], arr[-1, 0]"
1367 | ]
1368 | },
1369 | {
1370 | "cell_type": "code",
1371 | "execution_count": null,
1372 | "metadata": {},
1373 | "outputs": [],
1374 | "source": [
1375 | "arr[0][0], arr[1][0], arr[4][0], arr[-1][0]"
1376 | ]
1377 | },
1378 | {
1379 | "cell_type": "markdown",
1380 | "metadata": {},
1381 | "source": [
1382 | "Первый способ быстрее"
1383 | ]
1384 | },
1385 | {
1386 | "cell_type": "code",
1387 | "execution_count": null,
1388 | "metadata": {},
1389 | "outputs": [],
1390 | "source": [
1391 | "%timeit arr[0, 0], arr[1, 0], arr[4, 0], arr[-1, 0]"
1392 | ]
1393 | },
1394 | {
1395 | "cell_type": "code",
1396 | "execution_count": null,
1397 | "metadata": {},
1398 | "outputs": [],
1399 | "source": [
1400 | "%timeit arr[0][0], arr[1][0], arr[4][0], arr[-1][0]"
1401 | ]
1402 | },
1403 | {
1404 | "cell_type": "markdown",
1405 | "metadata": {},
1406 | "source": [
1407 | "##### Более сложная индексация"
1408 | ]
1409 | },
1410 | {
1411 | "cell_type": "markdown",
1412 | "metadata": {},
1413 | "source": [
1414 | "Можем взять строчку или столбец"
1415 | ]
1416 | },
1417 | {
1418 | "cell_type": "code",
1419 | "execution_count": null,
1420 | "metadata": {},
1421 | "outputs": [],
1422 | "source": [
1423 | "arr[0, :], arr[0, :].shape"
1424 | ]
1425 | },
1426 | {
1427 | "cell_type": "code",
1428 | "execution_count": null,
1429 | "metadata": {},
1430 | "outputs": [],
1431 | "source": [
1432 | "arr[:, 0], arr[:, 0].shape"
1433 | ]
1434 | },
1435 | {
1436 | "cell_type": "code",
1437 | "execution_count": null,
1438 | "metadata": {},
1439 | "outputs": [],
1440 | "source": [
1441 | "arr[[1, 3, 5], :], arr[[1, 3, 5], :].shape"
1442 | ]
1443 | },
1444 | {
1445 | "cell_type": "code",
1446 | "execution_count": null,
1447 | "metadata": {},
1448 | "outputs": [],
1449 | "source": [
1450 | "arr[[1, 3, 5], 0]"
1451 | ]
1452 | },
1453 | {
1454 | "cell_type": "code",
1455 | "execution_count": null,
1456 | "metadata": {},
1457 | "outputs": [],
1458 | "source": [
1459 | "arr[1::2, 0]"
1460 | ]
1461 | },
1462 | {
1463 | "cell_type": "code",
1464 | "execution_count": null,
1465 | "metadata": {},
1466 | "outputs": [],
1467 | "source": [
1468 | "arr[[1, 3, 5], :2]"
1469 | ]
1470 | },
1471 | {
1472 | "cell_type": "code",
1473 | "execution_count": null,
1474 | "metadata": {},
1475 | "outputs": [],
1476 | "source": [
1477 | "arr[[1, 3, 5], 1:]"
1478 | ]
1479 | },
1480 | {
1481 | "cell_type": "code",
1482 | "execution_count": null,
1483 | "metadata": {},
1484 | "outputs": [],
1485 | "source": [
1486 | "arr[[1, 3, 5], [0, 2]]"
1487 | ]
1488 | },
1489 | {
1490 | "cell_type": "code",
1491 | "execution_count": null,
1492 | "metadata": {},
1493 | "outputs": [],
1494 | "source": [
1495 | "arr[[1, 3], [0, 2]] # взяли элементы arr[1, 0] и arr[3, 2]"
1496 | ]
1497 | },
1498 | {
1499 | "cell_type": "code",
1500 | "execution_count": null,
1501 | "metadata": {},
1502 | "outputs": [],
1503 | "source": [
1504 | "arr[np.ix_([1, 3, 5], [0, 2])]"
1505 | ]
1506 | },
1507 | {
1508 | "cell_type": "code",
1509 | "execution_count": null,
1510 | "metadata": {},
1511 | "outputs": [],
1512 | "source": [
1513 | "np.ix_([1, 3, 5], [0, 2])"
1514 | ]
1515 | },
1516 | {
1517 | "cell_type": "markdown",
1518 | "metadata": {},
1519 | "source": [
1520 | "### Выводы"
1521 | ]
1522 | },
1523 | {
1524 | "cell_type": "markdown",
1525 | "metadata": {},
1526 | "source": [
1527 | "Картинки взяты с http://www.scipy-lectures.org/intro/numpy/numpy.html"
1528 | ]
1529 | },
1530 | {
1531 | "cell_type": "markdown",
1532 | "metadata": {},
1533 | "source": [
1534 | ""
1535 | ]
1536 | },
1537 | {
1538 | "cell_type": "markdown",
1539 | "metadata": {},
1540 | "source": [
1541 | ""
1542 | ]
1543 | },
1544 | {
1545 | "cell_type": "code",
1546 | "execution_count": null,
1547 | "metadata": {},
1548 | "outputs": [],
1549 | "source": []
1550 | },
1551 | {
1552 | "cell_type": "markdown",
1553 | "metadata": {},
1554 | "source": [
1555 | "##### Операции с массивами"
1556 | ]
1557 | },
1558 | {
1559 | "cell_type": "markdown",
1560 | "metadata": {},
1561 | "source": [
1562 | "Над массивами можно делать арифметические операции. При этом не нужно обращаться отдельно к каждому элементы, можно выполнять операции с массивами в целом."
1563 | ]
1564 | },
1565 | {
1566 | "cell_type": "code",
1567 | "execution_count": 22,
1568 | "metadata": {},
1569 | "outputs": [],
1570 | "source": [
1571 | "a = np.array([1, 2, 4, 8, 16])\n",
1572 | "b = np.array([1, 3, 9, 27, 81])"
1573 | ]
1574 | },
1575 | {
1576 | "cell_type": "code",
1577 | "execution_count": 23,
1578 | "metadata": {},
1579 | "outputs": [
1580 | {
1581 | "data": {
1582 | "text/plain": [
1583 | "array([ 0, 1, 3, 7, 15])"
1584 | ]
1585 | },
1586 | "execution_count": 23,
1587 | "metadata": {},
1588 | "output_type": "execute_result"
1589 | }
1590 | ],
1591 | "source": [
1592 | "a - 1"
1593 | ]
1594 | },
1595 | {
1596 | "cell_type": "code",
1597 | "execution_count": 24,
1598 | "metadata": {},
1599 | "outputs": [
1600 | {
1601 | "data": {
1602 | "text/plain": [
1603 | "array([ 2, 5, 13, 35, 97])"
1604 | ]
1605 | },
1606 | "execution_count": 24,
1607 | "metadata": {},
1608 | "output_type": "execute_result"
1609 | }
1610 | ],
1611 | "source": [
1612 | "a + b"
1613 | ]
1614 | },
1615 | {
1616 | "cell_type": "code",
1617 | "execution_count": 25,
1618 | "metadata": {},
1619 | "outputs": [
1620 | {
1621 | "data": {
1622 | "text/plain": [
1623 | "array([ 1, 6, 36, 216, 1296])"
1624 | ]
1625 | },
1626 | "execution_count": 25,
1627 | "metadata": {},
1628 | "output_type": "execute_result"
1629 | }
1630 | ],
1631 | "source": [
1632 | "a * b"
1633 | ]
1634 | },
1635 | {
1636 | "cell_type": "code",
1637 | "execution_count": 26,
1638 | "metadata": {},
1639 | "outputs": [
1640 | {
1641 | "data": {
1642 | "text/plain": [
1643 | "array([1. , 1.5 , 2.25 , 3.375 , 5.0625])"
1644 | ]
1645 | },
1646 | "execution_count": 26,
1647 | "metadata": {},
1648 | "output_type": "execute_result"
1649 | }
1650 | ],
1651 | "source": [
1652 | "b / a"
1653 | ]
1654 | },
1655 | {
1656 | "cell_type": "code",
1657 | "execution_count": 27,
1658 | "metadata": {},
1659 | "outputs": [
1660 | {
1661 | "data": {
1662 | "text/plain": [
1663 | "array([1, 1, 2, 3, 5])"
1664 | ]
1665 | },
1666 | "execution_count": 27,
1667 | "metadata": {},
1668 | "output_type": "execute_result"
1669 | }
1670 | ],
1671 | "source": [
1672 | "b // a"
1673 | ]
1674 | },
1675 | {
1676 | "cell_type": "code",
1677 | "execution_count": 28,
1678 | "metadata": {},
1679 | "outputs": [
1680 | {
1681 | "data": {
1682 | "text/plain": [
1683 | "array([0., 1., 2., 3., 4.])"
1684 | ]
1685 | },
1686 | "execution_count": 28,
1687 | "metadata": {},
1688 | "output_type": "execute_result"
1689 | }
1690 | ],
1691 | "source": [
1692 | "np.log2(a)"
1693 | ]
1694 | },
1695 | {
1696 | "cell_type": "code",
1697 | "execution_count": 29,
1698 | "metadata": {},
1699 | "outputs": [
1700 | {
1701 | "data": {
1702 | "text/plain": [
1703 | "array([0., 1., 2., 3., 4.])"
1704 | ]
1705 | },
1706 | "execution_count": 29,
1707 | "metadata": {},
1708 | "output_type": "execute_result"
1709 | }
1710 | ],
1711 | "source": [
1712 | "np.log(a) / np.log(2)"
1713 | ]
1714 | },
1715 | {
1716 | "cell_type": "code",
1717 | "execution_count": 30,
1718 | "metadata": {},
1719 | "outputs": [
1720 | {
1721 | "data": {
1722 | "text/plain": [
1723 | "array([0., 1., 2., 3., 4.])"
1724 | ]
1725 | },
1726 | "execution_count": 30,
1727 | "metadata": {},
1728 | "output_type": "execute_result"
1729 | }
1730 | ],
1731 | "source": [
1732 | "np.log(b) / np.log(3)"
1733 | ]
1734 | },
1735 | {
1736 | "cell_type": "markdown",
1737 | "metadata": {},
1738 | "source": [
1739 | "##### Преимущество по скорости"
1740 | ]
1741 | },
1742 | {
1743 | "cell_type": "code",
1744 | "execution_count": null,
1745 | "metadata": {},
1746 | "outputs": [],
1747 | "source": [
1748 | "a = list(range(10000))\n",
1749 | "b = list(range(10000))"
1750 | ]
1751 | },
1752 | {
1753 | "cell_type": "code",
1754 | "execution_count": null,
1755 | "metadata": {},
1756 | "outputs": [],
1757 | "source": [
1758 | "%%timeit\n",
1759 | "c = [\n",
1760 | " x * y\n",
1761 | " for x, y in zip(a, b)\n",
1762 | "]"
1763 | ]
1764 | },
1765 | {
1766 | "cell_type": "code",
1767 | "execution_count": null,
1768 | "metadata": {},
1769 | "outputs": [],
1770 | "source": [
1771 | "a = np.array(a)\n",
1772 | "b = np.array(b)"
1773 | ]
1774 | },
1775 | {
1776 | "cell_type": "code",
1777 | "execution_count": null,
1778 | "metadata": {},
1779 | "outputs": [],
1780 | "source": [
1781 | "%%timeit\n",
1782 | "c = a * b"
1783 | ]
1784 | },
1785 | {
1786 | "cell_type": "code",
1787 | "execution_count": null,
1788 | "metadata": {},
1789 | "outputs": [],
1790 | "source": [
1791 | "%%timeit\n",
1792 | "c = [\n",
1793 | " x * y\n",
1794 | " for x, y in zip(a, b)\n",
1795 | "]"
1796 | ]
1797 | },
1798 | {
1799 | "cell_type": "markdown",
1800 | "metadata": {},
1801 | "source": [
1802 | "Операции с массивами в 100 раз быстрее, хотя если мы пробуем использовать обычный питоновский код поверх массивов, то получается существенно медленнее"
1803 | ]
1804 | },
1805 | {
1806 | "cell_type": "markdown",
1807 | "metadata": {},
1808 | "source": [
1809 | "### Выводы:\n",
1810 | "\n",
1811 | "Для большей производительности лучше использовать арифметические операции над массивами"
1812 | ]
1813 | },
1814 | {
1815 | "cell_type": "code",
1816 | "execution_count": null,
1817 | "metadata": {},
1818 | "outputs": [],
1819 | "source": []
1820 | },
1821 | {
1822 | "cell_type": "code",
1823 | "execution_count": null,
1824 | "metadata": {},
1825 | "outputs": [],
1826 | "source": []
1827 | },
1828 | {
1829 | "cell_type": "markdown",
1830 | "metadata": {},
1831 | "source": [
1832 | "##### random"
1833 | ]
1834 | },
1835 | {
1836 | "cell_type": "markdown",
1837 | "metadata": {},
1838 | "source": [
1839 | "В numpy есть аналог модуля random - numpy.random. Используя типизацию из C, он как и свой аналог генерирует случайные данные."
1840 | ]
1841 | },
1842 | {
1843 | "cell_type": "code",
1844 | "execution_count": 31,
1845 | "metadata": {},
1846 | "outputs": [
1847 | {
1848 | "data": {
1849 | "text/plain": [
1850 | "array([[[0.8175136 , 0.77078567, 0.87454103, 0.17336117],\n",
1851 | " [0.37306559, 0.3334027 , 0.63796893, 0.42849584],\n",
1852 | " [0.04700558, 0.51279351, 0.22267211, 0.91020539]],\n",
1853 | "\n",
1854 | " [[0.64515575, 0.65825143, 0.90880479, 0.88388794],\n",
1855 | " [0.82751777, 0.46026817, 0.67696989, 0.53016121],\n",
1856 | " [0.06275625, 0.61376869, 0.14391625, 0.30392825]]])"
1857 | ]
1858 | },
1859 | "execution_count": 31,
1860 | "metadata": {},
1861 | "output_type": "execute_result"
1862 | }
1863 | ],
1864 | "source": [
1865 | "np.random.rand(2, 3, 4) # равномерное от 0 до 1 распределение в заданном shape"
1866 | ]
1867 | },
1868 | {
1869 | "cell_type": "code",
1870 | "execution_count": 32,
1871 | "metadata": {},
1872 | "outputs": [
1873 | {
1874 | "data": {
1875 | "text/plain": [
1876 | "(2, 3, 4)"
1877 | ]
1878 | },
1879 | "execution_count": 32,
1880 | "metadata": {},
1881 | "output_type": "execute_result"
1882 | }
1883 | ],
1884 | "source": [
1885 | "np.random.rand(2, 3, 4).shape"
1886 | ]
1887 | },
1888 | {
1889 | "cell_type": "code",
1890 | "execution_count": 33,
1891 | "metadata": {},
1892 | "outputs": [
1893 | {
1894 | "data": {
1895 | "text/plain": [
1896 | "array([[[ 0.51807114, 1.21877741, 0.53473039, 1.25560827],\n",
1897 | " [ 1.95685262, 0.26716197, 1.09282955, -0.71969846],\n",
1898 | " [ 2.2309445 , 0.74894436, -0.07109792, 0.35245353]],\n",
1899 | "\n",
1900 | " [[-1.71500229, 0.3727462 , -0.86423839, 0.95929217],\n",
1901 | " [ 1.38904054, -2.07292949, -0.41625269, 1.74899741],\n",
1902 | " [ 0.75667197, -0.40825183, 0.16802865, 1.73164801]]])"
1903 | ]
1904 | },
1905 | "execution_count": 33,
1906 | "metadata": {},
1907 | "output_type": "execute_result"
1908 | }
1909 | ],
1910 | "source": [
1911 | "np.random.randn(2, 3, 4) # нормальное распределение в заданном shape"
1912 | ]
1913 | },
1914 | {
1915 | "cell_type": "code",
1916 | "execution_count": 34,
1917 | "metadata": {},
1918 | "outputs": [
1919 | {
1920 | "data": {
1921 | "text/plain": [
1922 | "b'\\x8ca\\xba&\\x96\\xb7\\xbc5Z\\xfa'"
1923 | ]
1924 | },
1925 | "execution_count": 34,
1926 | "metadata": {},
1927 | "output_type": "execute_result"
1928 | }
1929 | ],
1930 | "source": [
1931 | "np.random.bytes(10) # случайные байты"
1932 | ]
1933 | },
1934 | {
1935 | "cell_type": "markdown",
1936 | "metadata": {},
1937 | "source": [
1938 | "Можно генерировать и другие распределения, подробности тут:\n",
1939 | "\n",
1940 | "https://docs.scipy.org/doc/numpy-1.12.0/reference/routines.random.html"
1941 | ]
1942 | },
1943 | {
1944 | "cell_type": "code",
1945 | "execution_count": null,
1946 | "metadata": {},
1947 | "outputs": [],
1948 | "source": []
1949 | },
1950 | {
1951 | "cell_type": "markdown",
1952 | "metadata": {},
1953 | "source": [
1954 | "#### Ещё один пример эффективных вычислений"
1955 | ]
1956 | },
1957 | {
1958 | "cell_type": "markdown",
1959 | "metadata": {},
1960 | "source": [
1961 | "В заключение приведём ещё один пример, где использование numpy существенно ускоряет код"
1962 | ]
1963 | },
1964 | {
1965 | "cell_type": "markdown",
1966 | "metadata": {},
1967 | "source": [
1968 | "В математике определена операция перемножения матриц (двухмерных массивов)\n",
1969 | "\n",
1970 | "$A \\times B = C$\n",
1971 | "\n",
1972 | "$C_{ij} = \\sum_k A_{ik} B_{kj}$"
1973 | ]
1974 | },
1975 | {
1976 | "cell_type": "markdown",
1977 | "metadata": {},
1978 | "source": [
1979 | "сгенерируем случайные матрицы"
1980 | ]
1981 | },
1982 | {
1983 | "cell_type": "code",
1984 | "execution_count": 35,
1985 | "metadata": {},
1986 | "outputs": [],
1987 | "source": [
1988 | "A = np.random.randint(1000, size=(200, 100))\n",
1989 | "B = np.random.randint(1000, size=(100, 300))"
1990 | ]
1991 | },
1992 | {
1993 | "cell_type": "markdown",
1994 | "metadata": {},
1995 | "source": [
1996 | "умножение на основе numpy"
1997 | ]
1998 | },
1999 | {
2000 | "cell_type": "code",
2001 | "execution_count": 36,
2002 | "metadata": {},
2003 | "outputs": [],
2004 | "source": [
2005 | "def np_multiply():\n",
2006 | " return np.dot(A, B)"
2007 | ]
2008 | },
2009 | {
2010 | "cell_type": "code",
2011 | "execution_count": 37,
2012 | "metadata": {},
2013 | "outputs": [
2014 | {
2015 | "name": "stdout",
2016 | "output_type": "stream",
2017 | "text": [
2018 | "4.41 ms ± 639 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n"
2019 | ]
2020 | }
2021 | ],
2022 | "source": [
2023 | "%timeit np_multiply()"
2024 | ]
2025 | },
2026 | {
2027 | "cell_type": "markdown",
2028 | "metadata": {
2029 | "collapsed": true
2030 | },
2031 | "source": [
2032 | "если хранить матрицу не как двухмерный массив, а как список списков, то будет дольше работать"
2033 | ]
2034 | },
2035 | {
2036 | "cell_type": "code",
2037 | "execution_count": 38,
2038 | "metadata": {},
2039 | "outputs": [],
2040 | "source": [
2041 | "A = [list(x) for x in A]\n",
2042 | "B = [list(x) for x in B]"
2043 | ]
2044 | },
2045 | {
2046 | "cell_type": "code",
2047 | "execution_count": 39,
2048 | "metadata": {},
2049 | "outputs": [
2050 | {
2051 | "name": "stdout",
2052 | "output_type": "stream",
2053 | "text": [
2054 | "10.1 ms ± 768 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n"
2055 | ]
2056 | }
2057 | ],
2058 | "source": [
2059 | "%timeit np_multiply()"
2060 | ]
2061 | },
2062 | {
2063 | "cell_type": "markdown",
2064 | "metadata": {},
2065 | "source": [
2066 | "а это умножение на чистом питоновском коде"
2067 | ]
2068 | },
2069 | {
2070 | "cell_type": "code",
2071 | "execution_count": 40,
2072 | "metadata": {},
2073 | "outputs": [],
2074 | "source": [
2075 | "def python_multiply():\n",
2076 | " res = []\n",
2077 | " for i in range(200):\n",
2078 | " row = []\n",
2079 | " for j in range(300):\n",
2080 | " val = 0\n",
2081 | " for k in range(100):\n",
2082 | " val += A[i][k] * B[k][j]\n",
2083 | " row.append(val)\n",
2084 | " res.append(row)\n",
2085 | " return res"
2086 | ]
2087 | },
2088 | {
2089 | "cell_type": "code",
2090 | "execution_count": 41,
2091 | "metadata": {},
2092 | "outputs": [
2093 | {
2094 | "name": "stdout",
2095 | "output_type": "stream",
2096 | "text": [
2097 | "2.06 s ± 235 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
2098 | ]
2099 | }
2100 | ],
2101 | "source": [
2102 | "%timeit python_multiply()"
2103 | ]
2104 | },
2105 | {
2106 | "cell_type": "markdown",
2107 | "metadata": {},
2108 | "source": [
2109 | "Ускорение более чем в 100 раз"
2110 | ]
2111 | },
2112 | {
2113 | "cell_type": "code",
2114 | "execution_count": null,
2115 | "metadata": {},
2116 | "outputs": [],
2117 | "source": []
2118 | }
2119 | ],
2120 | "metadata": {
2121 | "kernelspec": {
2122 | "display_name": "Python 3",
2123 | "language": "python",
2124 | "name": "python3"
2125 | },
2126 | "language_info": {
2127 | "codemirror_mode": {
2128 | "name": "ipython",
2129 | "version": 3
2130 | },
2131 | "file_extension": ".py",
2132 | "mimetype": "text/x-python",
2133 | "name": "python",
2134 | "nbconvert_exporter": "python",
2135 | "pygments_lexer": "ipython3",
2136 | "version": "3.6.5"
2137 | }
2138 | },
2139 | "nbformat": 4,
2140 | "nbformat_minor": 2
2141 | }
2142 |
--------------------------------------------------------------------------------