├── O2O优惠券预测
├── O2O优惠券预测-数据探索.ipynb
├── O2O优惠券预测-模型训练.ipynb
├── O2O优惠券预测-模型验证及优化.ipynb
├── O2O优惠券预测-特征工程.ipynb
└── O2O优惠券预测-赛题实践.ipynb
├── README.md
├── 天猫重复购买预测
├── 天猫重复购买预测 02 数据探索.ipynb
├── 天猫重复购买预测 03 特征工程.ipynb
├── 天猫重复购买预测 04 模型训练、验证和评测.ipynb
└── 天猫重复购买预测 05 特征优化和特征选择.ipynb
├── 工业蒸汽
├── zhengqi_test.txt
├── zhengqi_train.txt
├── 工业蒸汽 02数据探索.ipynb
├── 工业蒸汽 03 特征工程.ipynb
├── 工业蒸汽 04 模型训练.ipynb
├── 工业蒸汽 05 模型验证.ipynb
├── 工业蒸汽 06 特征优化.ipynb
└── 工业蒸汽 07 模型融合.ipynb
└── 阿里云安全恶意程序检测
├── 阿里云安全恶意程序检测-优化技巧与解决方案升级.ipynb
├── 阿里云安全恶意程序检测-数据探索.ipynb
├── 阿里云安全恶意程序检测-特征工程与基线模型.ipynb
├── 阿里云安全恶意程序检测-特征工程进阶与方案优化.ipynb
└── 阿里云安全恶意程序检测-高阶数据探索.ipynb
/README.md:
--------------------------------------------------------------------------------
1 | # alibaba_tianchi_book
2 | 阿里云天池大赛赛题解析,原书代码链接: https://tianchi.aliyun.com/specials/promotion/bookcode?spm=5176.14154004.J_3941670930.15.31fe5699wu5Gtm
3 |
4 | 为了方便查看,我进行了整理,总共包含四个项目。
5 |
6 | 代码质量很高,有深度,个人觉得进阶机器学习看这本书应该是足够了。
7 |
8 |
--------------------------------------------------------------------------------
/天猫重复购买预测/天猫重复购买预测 05 特征优化和特征选择.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "## 导入相关包"
8 | ]
9 | },
10 | {
11 | "cell_type": "code",
12 | "execution_count": 1,
13 | "metadata": {},
14 | "outputs": [],
15 | "source": [
16 | "import pandas as pd\n",
17 | "import numpy as np\n",
18 | "\n",
19 | "import warnings\n",
20 | "warnings.filterwarnings(\"ignore\") "
21 | ]
22 | },
23 | {
24 | "cell_type": "markdown",
25 | "metadata": {},
26 | "source": [
27 | "## 读取数据(训练数据前10000行,测试数据前100条)"
28 | ]
29 | },
30 | {
31 | "cell_type": "code",
32 | "execution_count": 2,
33 | "metadata": {},
34 | "outputs": [],
35 | "source": [
36 | "train_data = pd.read_csv('train_all.csv',nrows=10000)\n",
37 | "test_data = pd.read_csv('test_all.csv',nrows=100)"
38 | ]
39 | },
40 | {
41 | "cell_type": "markdown",
42 | "metadata": {},
43 | "source": [
44 | "## 读取全部数据"
45 | ]
46 | },
47 | {
48 | "cell_type": "code",
49 | "execution_count": 3,
50 | "metadata": {},
51 | "outputs": [],
52 | "source": [
53 | "# train_data = pd.read_csv('train_all.csv',nrows=None)\n",
54 | "# test_data = pd.read_csv('test_all.csv',nrows=None)"
55 | ]
56 | },
57 | {
58 | "cell_type": "markdown",
59 | "metadata": {},
60 | "source": [
61 | "## 获取训练和测试数据"
62 | ]
63 | },
64 | {
65 | "cell_type": "code",
66 | "execution_count": 4,
67 | "metadata": {},
68 | "outputs": [],
69 | "source": [
70 | "features_columns = [col for col in train_data.columns if col not in ['user_id','label']]\n",
71 | "train = train_data[features_columns].values\n",
72 | "test = test_data[features_columns].values\n",
73 | "target =train_data['label'].values"
74 | ]
75 | },
76 | {
77 | "cell_type": "markdown",
78 | "metadata": {},
79 | "source": [
80 | "## 缺失值补全"
81 | ]
82 | },
83 | {
84 | "cell_type": "markdown",
85 | "metadata": {},
86 | "source": [
87 | "处理缺失值有很多方法,最常用为以下几种:\n",
88 | "1. 删除。当数据量较大时,或者缺失数据占比较小时,可以使用这种方法。\n",
89 | "2. 填充。通用的方法是采用平均数、中位数来填充,可以适用插值或者模型预测的方法进行缺失补全。\n",
90 | "3. 不处理。树类模型对缺失值不明感。"
91 | ]
92 | },
93 | {
94 | "cell_type": "markdown",
95 | "metadata": {},
96 | "source": [
97 | "#### 采用中值进行填充"
98 | ]
99 | },
100 | {
101 | "cell_type": "code",
102 | "execution_count": 5,
103 | "metadata": {},
104 | "outputs": [],
105 | "source": [
106 | "# from sklearn.preprocessing import Imputer\n",
107 | "# imputer = Imputer(strategy=\"median\")\n",
108 | "\n",
109 | "from sklearn.impute import SimpleImputer\n",
110 | "\n",
111 | "imputer = SimpleImputer(missing_values=np.nan, strategy='mean')\n",
112 | "imputer = imputer.fit(train)\n",
113 | "train_imputer = imputer.transform(train)\n",
114 | "test_imputer = imputer.transform(test)"
115 | ]
116 | },
117 | {
118 | "cell_type": "markdown",
119 | "metadata": {},
120 | "source": [
121 | "## 特征选择概念"
122 | ]
123 | },
124 | {
125 | "cell_type": "markdown",
126 | "metadata": {},
127 | "source": [
128 | "在机器学习和统计学中,特征选择(英语:feature selection)也被称为变量选择、属性选择 或变量子集选择 。它是指:为了构建模型而选择相关特征(即属性、指标)子集的过程。使用特征选择技术有三个原因:\n",
129 | "\n",
130 | " 简化模型,使之更易于被研究人员或用户理解,\n",
131 | " 缩短训练时间,\n",
132 | " 改善通用性、降低过拟合(即降低方差)。\n",
133 | "\n",
134 | "要使用特征选择技术的关键假设是:训练数据包含许多冗余 或无关 的特征,因而移除这些特征并不会导致丢失信息。 冗余 或无关 特征是两个不同的概念。如果一个特征本身有用,但如果这个特征与另一个有用特征强相关,且那个特征也出现在数据中,那么这个特征可能就变得多余。\n",
135 | "特征选择技术与特征提取有所不同。特征提取是从原有特征的功能中创造新的特征,而特征选择则只返回原有特征中的子集。 特征选择技术的常常用于许多特征但样本(即数据点)相对较少的领域。特征选择应用的典型用例包括:解析书面文本和微阵列数据,这些场景下特征成千上万,但样本只有几十到几百个。"
136 | ]
137 | },
138 | {
139 | "cell_type": "code",
140 | "execution_count": 6,
141 | "metadata": {},
142 | "outputs": [],
143 | "source": [
144 | "from sklearn.model_selection import cross_val_score\n",
145 | "from sklearn.ensemble import RandomForestClassifier\n",
146 | "\n",
147 | "def feature_selection(train, train_sel, target):\n",
148 | " clf = RandomForestClassifier(n_estimators=100, max_depth=2, random_state=0, n_jobs=-1)\n",
149 | " \n",
150 | " scores = cross_val_score(clf, train, target, cv=5)\n",
151 | " scores_sel = cross_val_score(clf, train_sel, target, cv=5)\n",
152 | " \n",
153 | " print(\"No Select Accuracy: %0.2f (+/- %0.2f)\" % (scores.mean(), scores.std() * 2)) \n",
154 | " print(\"Features Select Accuracy: %0.2f (+/- %0.2f)\" % (scores.mean(), scores.std() * 2))"
155 | ]
156 | },
157 | {
158 | "cell_type": "markdown",
159 | "metadata": {},
160 | "source": [
161 | "### 删除方差较小的要素(方法一)\n",
162 | "VarianceThreshold是一种简单的基线特征选择方法。它会删除方差不符合某个阈值的所有要素。默认情况下,它会删除所有零方差要素,即在所有样本中具有相同值的要素。"
163 | ]
164 | },
165 | {
166 | "cell_type": "code",
167 | "execution_count": 7,
168 | "metadata": {},
169 | "outputs": [
170 | {
171 | "name": "stdout",
172 | "output_type": "stream",
173 | "text": [
174 | "训练数据未特征筛选维度 (2000, 229)\n",
175 | "训练数据特征筛选维度后 (2000, 29)\n"
176 | ]
177 | }
178 | ],
179 | "source": [
180 | "from sklearn.feature_selection import VarianceThreshold\n",
181 | "\n",
182 | "sel = VarianceThreshold(threshold=(.8 * (1 - .8)))\n",
183 | "sel = sel.fit(train)\n",
184 | "train_sel = sel.transform(train)\n",
185 | "test_sel = sel.transform(test)\n",
186 | "print('训练数据未特征筛选维度', train.shape)\n",
187 | "print('训练数据特征筛选维度后', train_sel.shape)"
188 | ]
189 | },
190 | {
191 | "cell_type": "markdown",
192 | "metadata": {},
193 | "source": [
194 | "### 特征选择前后区别"
195 | ]
196 | },
197 | {
198 | "cell_type": "code",
199 | "execution_count": 8,
200 | "metadata": {},
201 | "outputs": [
202 | {
203 | "name": "stdout",
204 | "output_type": "stream",
205 | "text": [
206 | "No Select Accuracy: 0.93 (+/- 0.00)\n",
207 | "Features Select Accuracy: 0.93 (+/- 0.00)\n"
208 | ]
209 | }
210 | ],
211 | "source": [
212 | "feature_selection(train, train_sel, target)"
213 | ]
214 | },
215 | {
216 | "cell_type": "markdown",
217 | "metadata": {},
218 | "source": [
219 | "### 单变量特征选择(方法二)\n",
220 | "通过基于单变量统计检验选择最佳特征。"
221 | ]
222 | },
223 | {
224 | "cell_type": "code",
225 | "execution_count": 9,
226 | "metadata": {},
227 | "outputs": [
228 | {
229 | "name": "stdout",
230 | "output_type": "stream",
231 | "text": [
232 | "训练数据未特征筛选维度 (2000, 229)\n",
233 | "训练数据特征筛选维度后 (2000, 2)\n"
234 | ]
235 | }
236 | ],
237 | "source": [
238 | "from sklearn.feature_selection import SelectKBest\n",
239 | "# from sklearn.feature_selection import chi2\n",
240 | "from sklearn.feature_selection import mutual_info_classif\n",
241 | "\n",
242 | "sel = SelectKBest(mutual_info_classif, k=2)\n",
243 | "sel = sel.fit(train, target)\n",
244 | "train_sel = sel.transform(train)\n",
245 | "test_sel = sel.transform(test)\n",
246 | "print('训练数据未特征筛选维度', train.shape)\n",
247 | "print('训练数据特征筛选维度后', train_sel.shape)"
248 | ]
249 | },
250 | {
251 | "cell_type": "code",
252 | "execution_count": 10,
253 | "metadata": {},
254 | "outputs": [
255 | {
256 | "name": "stdout",
257 | "output_type": "stream",
258 | "text": [
259 | "训练数据未特征筛选维度 (2000, 229)\n",
260 | "训练数据特征筛选维度后 (2000, 10)\n"
261 | ]
262 | }
263 | ],
264 | "source": [
265 | "sel = SelectKBest(mutual_info_classif, k=10)\n",
266 | "sel = sel.fit(train, target)\n",
267 | "train_sel = sel.transform(train)\n",
268 | "test_sel = sel.transform(test)\n",
269 | "print('训练数据未特征筛选维度', train.shape)\n",
270 | "print('训练数据特征筛选维度后', train_sel.shape)"
271 | ]
272 | },
273 | {
274 | "cell_type": "markdown",
275 | "metadata": {},
276 | "source": [
277 | "### 特征选择前后区别"
278 | ]
279 | },
280 | {
281 | "cell_type": "code",
282 | "execution_count": 11,
283 | "metadata": {},
284 | "outputs": [
285 | {
286 | "name": "stdout",
287 | "output_type": "stream",
288 | "text": [
289 | "No Select Accuracy: 0.93 (+/- 0.00)\n",
290 | "Features Select Accuracy: 0.93 (+/- 0.00)\n"
291 | ]
292 | }
293 | ],
294 | "source": [
295 | "feature_selection(train, train_sel, target)"
296 | ]
297 | },
298 | {
299 | "cell_type": "markdown",
300 | "metadata": {},
301 | "source": [
302 | "### 递归功能消除(方法三)\n",
303 | "选定模型拟合,进行递归拟合,每次把评分低得特征去除,重复上诉循环。"
304 | ]
305 | },
306 | {
307 | "cell_type": "code",
308 | "execution_count": 12,
309 | "metadata": {},
310 | "outputs": [
311 | {
312 | "name": "stdout",
313 | "output_type": "stream",
314 | "text": [
315 | "[False False False False False False False False False False False False\n",
316 | " False False False False False False False False False False False False\n",
317 | " False False False False False False False False False False False False\n",
318 | " False False False False False False False False False False False False\n",
319 | " False False False False False False False False False False False False\n",
320 | " False False False False False False False False False False False False\n",
321 | " False False False False False False False False False False False False\n",
322 | " False False False False False False False False False False False False\n",
323 | " False False False False False False False False False False False False\n",
324 | " False False False False False False False False False False False False\n",
325 | " False False False False False False False False False False False False\n",
326 | " False False False False False False False False False False False False\n",
327 | " False False False False False False False False False False False False\n",
328 | " False False False False False False False False False False False False\n",
329 | " False False False False False False False False False False False True\n",
330 | " True False False False False False False False False False False True\n",
331 | " False True False False False False True False False True False False\n",
332 | " False True False True False True False True False False False False\n",
333 | " False False False False False False False False False False False False\n",
334 | " False]\n",
335 | "[220 219 218 217 216 215 213 212 211 210 209 208 207 206 205 204 203 202\n",
336 | " 201 200 197 195 192 187 186 185 184 183 182 181 180 179 178 177 176 175\n",
337 | " 174 173 172 171 170 169 168 167 166 165 164 163 162 161 160 158 157 155\n",
338 | " 154 153 152 151 150 149 148 147 146 145 144 143 142 141 140 139 137 136\n",
339 | " 134 133 132 131 130 129 128 127 126 125 124 123 122 121 120 117 116 115\n",
340 | " 114 113 112 111 110 109 108 107 106 105 104 103 102 101 100 99 98 97\n",
341 | " 95 94 93 92 91 90 89 88 87 214 86 85 84 83 189 193 199 198\n",
342 | " 196 194 191 190 188 81 80 79 77 74 73 72 71 69 68 67 66 65\n",
343 | " 64 156 63 61 60 59 58 57 55 54 138 135 50 48 46 45 44 42\n",
344 | " 39 119 118 38 35 31 28 25 24 22 17 15 14 96 13 12 43 1\n",
345 | " 1 4 82 75 78 76 26 30 70 20 7 1 62 1 51 53 49 47\n",
346 | " 1 27 41 1 23 21 18 1 11 1 19 1 6 1 8 29 2 5\n",
347 | " 10 3 9 16 32 33 34 36 37 40 52 56 159]\n"
348 | ]
349 | }
350 | ],
351 | "source": [
352 | "from sklearn.feature_selection import RFECV\n",
353 | "from sklearn.ensemble import RandomForestClassifier\n",
354 | "\n",
355 | "clf = RandomForestClassifier(n_estimators=10, max_depth=2, random_state=0, n_jobs=-1)\n",
356 | "selector = RFECV(clf, step=1, cv=2)\n",
357 | "selector = selector.fit(train, target)\n",
358 | "print(selector.support_)\n",
359 | "print(selector.ranking_)"
360 | ]
361 | },
362 | {
363 | "cell_type": "markdown",
364 | "metadata": {},
365 | "source": [
366 | "### 使用模型选择特征(方法四)"
367 | ]
368 | },
369 | {
370 | "cell_type": "markdown",
371 | "metadata": {},
372 | "source": [
373 | "#### 使用LR拟合的参数进行变量选择(L2范数进行特征选择)\n",
374 | "LR模型采用拟合参数形式进行变量选择,筛选对回归目标影响大的"
375 | ]
376 | },
377 | {
378 | "cell_type": "code",
379 | "execution_count": 13,
380 | "metadata": {},
381 | "outputs": [
382 | {
383 | "name": "stdout",
384 | "output_type": "stream",
385 | "text": [
386 | "训练数据未特征筛选维度 (2000, 229)\n",
387 | "训练数据特征筛选维度后 (2000, 19)\n"
388 | ]
389 | }
390 | ],
391 | "source": [
392 | "from sklearn.feature_selection import SelectFromModel\n",
393 | "from sklearn.linear_model import LogisticRegression\n",
394 | "from sklearn.preprocessing import Normalizer\n",
395 | "\n",
396 | "normalizer = Normalizer()\n",
397 | "normalizer = normalizer.fit(train) \n",
398 | "\n",
399 | "train_norm = normalizer.transform(train) \n",
400 | "test_norm = normalizer.transform(test)\n",
401 | "\n",
402 | "LR = LogisticRegression(penalty='l2',C=5)\n",
403 | "LR = LR.fit(train_norm, target)\n",
404 | "model = SelectFromModel(LR, prefit=True)\n",
405 | "train_sel = model.transform(train)\n",
406 | "test_sel = model.transform(test)\n",
407 | "print('训练数据未特征筛选维度', train.shape)\n",
408 | "print('训练数据特征筛选维度后', train_sel.shape)"
409 | ]
410 | },
411 | {
412 | "cell_type": "markdown",
413 | "metadata": {},
414 | "source": [
415 | "##### L2范数选择参数"
416 | ]
417 | },
418 | {
419 | "cell_type": "code",
420 | "execution_count": 14,
421 | "metadata": {},
422 | "outputs": [
423 | {
424 | "data": {
425 | "text/plain": [
426 | "array([ 0.27519508, -0.02736226, -0.00522652, 0.90644126, -0.4310027 ,\n",
427 | " -0.25110925, -0.4058899 , 0.29059019, 0.10568508, -0.02731211])"
428 | ]
429 | },
430 | "execution_count": 14,
431 | "metadata": {},
432 | "output_type": "execute_result"
433 | }
434 | ],
435 | "source": [
436 | "LR.coef_[0][:10]"
437 | ]
438 | },
439 | {
440 | "cell_type": "markdown",
441 | "metadata": {},
442 | "source": [
443 | "### 特征选择前后区别"
444 | ]
445 | },
446 | {
447 | "cell_type": "code",
448 | "execution_count": 15,
449 | "metadata": {},
450 | "outputs": [
451 | {
452 | "name": "stdout",
453 | "output_type": "stream",
454 | "text": [
455 | "No Select Accuracy: 0.93 (+/- 0.00)\n",
456 | "Features Select Accuracy: 0.93 (+/- 0.00)\n"
457 | ]
458 | }
459 | ],
460 | "source": [
461 | "feature_selection(train, train_sel, target)"
462 | ]
463 | },
464 | {
465 | "cell_type": "markdown",
466 | "metadata": {},
467 | "source": [
468 | "#### 使用LR拟合的参数进行变量选择(L1范数进行特征选择)\n",
469 | "LR模型采用拟合参数形式进行变量选择,筛选对回归目标影响大的"
470 | ]
471 | },
472 | {
473 | "cell_type": "code",
474 | "execution_count": 16,
475 | "metadata": {},
476 | "outputs": [],
477 | "source": [
478 | "# from sklearn.feature_selection import SelectFromModel\n",
479 | "# from sklearn.linear_model import LogisticRegression\n",
480 | "# from sklearn.preprocessing import Normalizer\n",
481 | "\n",
482 | "# normalizer = Normalizer()\n",
483 | "# normalizer = normalizer.fit(train) \n",
484 | "\n",
485 | "# train_norm = normalizer.transform(train) \n",
486 | "# test_norm = normalizer.transform(test)\n",
487 | "\n",
488 | "# LR = LogisticRegression(penalty='l1',C=5)\n",
489 | "# LR = LR.fit(train_norm, target)\n",
490 | "# model = SelectFromModel(LR, prefit=True)\n",
491 | "# train_sel = model.transform(train)\n",
492 | "# test_sel = model.transform(test)\n",
493 | "# print('训练数据未特征筛选维度', train.shape)\n",
494 | "# print('训练数据特征筛选维度后', train_sel.shape)"
495 | ]
496 | },
497 | {
498 | "cell_type": "markdown",
499 | "metadata": {},
500 | "source": [
501 | "##### L1范数选择参数\n",
502 | "对于α的良好选择,只要满足某些特定条件,Lasso就可以仅使用少量观察来完全恢复精确的非零变量集。"
503 | ]
504 | },
505 | {
506 | "cell_type": "code",
507 | "execution_count": 17,
508 | "metadata": {},
509 | "outputs": [],
510 | "source": [
511 | "# LR.coef_[0][:10]"
512 | ]
513 | },
514 | {
515 | "cell_type": "markdown",
516 | "metadata": {},
517 | "source": [
518 | "### 特征选择前后区别"
519 | ]
520 | },
521 | {
522 | "cell_type": "code",
523 | "execution_count": 18,
524 | "metadata": {},
525 | "outputs": [
526 | {
527 | "name": "stdout",
528 | "output_type": "stream",
529 | "text": [
530 | "No Select Accuracy: 0.93 (+/- 0.00)\n",
531 | "Features Select Accuracy: 0.93 (+/- 0.00)\n"
532 | ]
533 | }
534 | ],
535 | "source": [
536 | "feature_selection(train, train_sel, target)"
537 | ]
538 | },
539 | {
540 | "cell_type": "markdown",
541 | "metadata": {},
542 | "source": [
543 | "### 基于树模型特征选择\n",
544 | "树模型基于分裂评价标准所计算的总的评分作为依据进行相关排序,然后进行特征筛选"
545 | ]
546 | },
547 | {
548 | "cell_type": "code",
549 | "execution_count": 19,
550 | "metadata": {},
551 | "outputs": [
552 | {
553 | "name": "stdout",
554 | "output_type": "stream",
555 | "text": [
556 | "训练数据未特征筛选维度 (2000, 229)\n",
557 | "训练数据特征筛选维度后 (2000, 71)\n"
558 | ]
559 | }
560 | ],
561 | "source": [
562 | "from sklearn.ensemble import ExtraTreesClassifier\n",
563 | "from sklearn.feature_selection import SelectFromModel\n",
564 | "\n",
565 | "clf = ExtraTreesClassifier(n_estimators=50)\n",
566 | "clf = clf.fit(train, target)\n",
567 | "\n",
568 | "model = SelectFromModel(clf, prefit=True)\n",
569 | "train_sel = model.transform(train)\n",
570 | "test_sel = model.transform(test)\n",
571 | "print('训练数据未特征筛选维度', train.shape)\n",
572 | "print('训练数据特征筛选维度后', train_sel.shape)"
573 | ]
574 | },
575 | {
576 | "cell_type": "markdown",
577 | "metadata": {},
578 | "source": [
579 | "#### 树特征重要性"
580 | ]
581 | },
582 | {
583 | "cell_type": "code",
584 | "execution_count": 20,
585 | "metadata": {},
586 | "outputs": [
587 | {
588 | "data": {
589 | "text/plain": [
590 | "array([0.09210871, 0.00578114, 0.00388741, 0.0047027 , 0.00324662,\n",
591 | " 0.00409547, 0.00560588, 0.00399393, 0.00499705, 0.00233944])"
592 | ]
593 | },
594 | "execution_count": 20,
595 | "metadata": {},
596 | "output_type": "execute_result"
597 | }
598 | ],
599 | "source": [
600 | "clf.feature_importances_[:10]"
601 | ]
602 | },
603 | {
604 | "cell_type": "code",
605 | "execution_count": 21,
606 | "metadata": {},
607 | "outputs": [],
608 | "source": [
609 | "df_features_import = pd.DataFrame()\n",
610 | "df_features_import['features_import'] = clf.feature_importances_\n",
611 | "df_features_import['features_name'] = features_columns"
612 | ]
613 | },
614 | {
615 | "cell_type": "code",
616 | "execution_count": 22,
617 | "metadata": {},
618 | "outputs": [
619 | {
620 | "data": {
621 | "text/html": [
622 | "
\n",
623 | "\n",
636 | "
\n",
637 | " \n",
638 | " \n",
639 | " | \n",
640 | " features_import | \n",
641 | " features_name | \n",
642 | "
\n",
643 | " \n",
644 | " \n",
645 | " \n",
646 | " 0 | \n",
647 | " 0.092109 | \n",
648 | " merchant_id | \n",
649 | "
\n",
650 | " \n",
651 | " 228 | \n",
652 | " 0.085244 | \n",
653 | " xgb_clf | \n",
654 | "
\n",
655 | " \n",
656 | " 227 | \n",
657 | " 0.056583 | \n",
658 | " lgb_clf | \n",
659 | "
\n",
660 | " \n",
661 | " 199 | \n",
662 | " 0.007003 | \n",
663 | " embeeding_72 | \n",
664 | "
\n",
665 | " \n",
666 | " 179 | \n",
667 | " 0.006930 | \n",
668 | " embeeding_52 | \n",
669 | "
\n",
670 | " \n",
671 | " 18 | \n",
672 | " 0.006444 | \n",
673 | " seller_most_1_cnt | \n",
674 | "
\n",
675 | " \n",
676 | " 207 | \n",
677 | " 0.006367 | \n",
678 | " embeeding_80 | \n",
679 | "
\n",
680 | " \n",
681 | " 193 | \n",
682 | " 0.006110 | \n",
683 | " embeeding_66 | \n",
684 | "
\n",
685 | " \n",
686 | " 190 | \n",
687 | " 0.006107 | \n",
688 | " embeeding_63 | \n",
689 | "
\n",
690 | " \n",
691 | " 132 | \n",
692 | " 0.006077 | \n",
693 | " embeeding_5 | \n",
694 | "
\n",
695 | " \n",
696 | " 144 | \n",
697 | " 0.005996 | \n",
698 | " embeeding_17 | \n",
699 | "
\n",
700 | " \n",
701 | " 146 | \n",
702 | " 0.005913 | \n",
703 | " embeeding_19 | \n",
704 | "
\n",
705 | " \n",
706 | " 1 | \n",
707 | " 0.005781 | \n",
708 | " age_range | \n",
709 | "
\n",
710 | " \n",
711 | " 158 | \n",
712 | " 0.005715 | \n",
713 | " embeeding_31 | \n",
714 | "
\n",
715 | " \n",
716 | " 191 | \n",
717 | " 0.005701 | \n",
718 | " embeeding_64 | \n",
719 | "
\n",
720 | " \n",
721 | " 165 | \n",
722 | " 0.005673 | \n",
723 | " embeeding_38 | \n",
724 | "
\n",
725 | " \n",
726 | " 15 | \n",
727 | " 0.005648 | \n",
728 | " cat_most_1 | \n",
729 | "
\n",
730 | " \n",
731 | " 6 | \n",
732 | " 0.005606 | \n",
733 | " brand_nunique | \n",
734 | "
\n",
735 | " \n",
736 | " 22 | \n",
737 | " 0.005488 | \n",
738 | " user_cnt_0 | \n",
739 | "
\n",
740 | " \n",
741 | " 220 | \n",
742 | " 0.005485 | \n",
743 | " embeeding_93 | \n",
744 | "
\n",
745 | " \n",
746 | " 166 | \n",
747 | " 0.005473 | \n",
748 | " embeeding_39 | \n",
749 | "
\n",
750 | " \n",
751 | " 87 | \n",
752 | " 0.005472 | \n",
753 | " tfidf_60 | \n",
754 | "
\n",
755 | " \n",
756 | " 127 | \n",
757 | " 0.005463 | \n",
758 | " embeeding_0 | \n",
759 | "
\n",
760 | " \n",
761 | " 196 | \n",
762 | " 0.005427 | \n",
763 | " embeeding_69 | \n",
764 | "
\n",
765 | " \n",
766 | " 205 | \n",
767 | " 0.005407 | \n",
768 | " embeeding_78 | \n",
769 | "
\n",
770 | " \n",
771 | " 147 | \n",
772 | " 0.005402 | \n",
773 | " embeeding_20 | \n",
774 | "
\n",
775 | " \n",
776 | " 163 | \n",
777 | " 0.005347 | \n",
778 | " embeeding_36 | \n",
779 | "
\n",
780 | " \n",
781 | " 192 | \n",
782 | " 0.005328 | \n",
783 | " embeeding_65 | \n",
784 | "
\n",
785 | " \n",
786 | " 169 | \n",
787 | " 0.005280 | \n",
788 | " embeeding_42 | \n",
789 | "
\n",
790 | " \n",
791 | " 50 | \n",
792 | " 0.005277 | \n",
793 | " tfidf_23 | \n",
794 | "
\n",
795 | " \n",
796 | "
\n",
797 | "
"
798 | ],
799 | "text/plain": [
800 | " features_import features_name\n",
801 | "0 0.092109 merchant_id\n",
802 | "228 0.085244 xgb_clf\n",
803 | "227 0.056583 lgb_clf\n",
804 | "199 0.007003 embeeding_72\n",
805 | "179 0.006930 embeeding_52\n",
806 | "18 0.006444 seller_most_1_cnt\n",
807 | "207 0.006367 embeeding_80\n",
808 | "193 0.006110 embeeding_66\n",
809 | "190 0.006107 embeeding_63\n",
810 | "132 0.006077 embeeding_5\n",
811 | "144 0.005996 embeeding_17\n",
812 | "146 0.005913 embeeding_19\n",
813 | "1 0.005781 age_range\n",
814 | "158 0.005715 embeeding_31\n",
815 | "191 0.005701 embeeding_64\n",
816 | "165 0.005673 embeeding_38\n",
817 | "15 0.005648 cat_most_1\n",
818 | "6 0.005606 brand_nunique\n",
819 | "22 0.005488 user_cnt_0\n",
820 | "220 0.005485 embeeding_93\n",
821 | "166 0.005473 embeeding_39\n",
822 | "87 0.005472 tfidf_60\n",
823 | "127 0.005463 embeeding_0\n",
824 | "196 0.005427 embeeding_69\n",
825 | "205 0.005407 embeeding_78\n",
826 | "147 0.005402 embeeding_20\n",
827 | "163 0.005347 embeeding_36\n",
828 | "192 0.005328 embeeding_65\n",
829 | "169 0.005280 embeeding_42\n",
830 | "50 0.005277 tfidf_23"
831 | ]
832 | },
833 | "execution_count": 22,
834 | "metadata": {},
835 | "output_type": "execute_result"
836 | }
837 | ],
838 | "source": [
839 | "df_features_import.sort_values(['features_import'],ascending=0).head(30)"
840 | ]
841 | },
842 | {
843 | "cell_type": "code",
844 | "execution_count": 23,
845 | "metadata": {},
846 | "outputs": [],
847 | "source": [
848 | "# features_columns"
849 | ]
850 | },
851 | {
852 | "cell_type": "markdown",
853 | "metadata": {},
854 | "source": [
855 | "### 特征选择前后区别"
856 | ]
857 | },
858 | {
859 | "cell_type": "code",
860 | "execution_count": 24,
861 | "metadata": {},
862 | "outputs": [
863 | {
864 | "name": "stdout",
865 | "output_type": "stream",
866 | "text": [
867 | "No Select Accuracy: 0.93 (+/- 0.00)\n",
868 | "Features Select Accuracy: 0.93 (+/- 0.00)\n"
869 | ]
870 | }
871 | ],
872 | "source": [
873 | "feature_selection(train, train_sel, target)"
874 | ]
875 | },
876 | {
877 | "cell_type": "markdown",
878 | "metadata": {},
879 | "source": [
880 | "### Lgb特征重要性"
881 | ]
882 | },
883 | {
884 | "cell_type": "code",
885 | "execution_count": 25,
886 | "metadata": {},
887 | "outputs": [
888 | {
889 | "name": "stdout",
890 | "output_type": "stream",
891 | "text": [
892 | "[LightGBM] [Warning] Unknown parameter: colsample_bylevel\n",
893 | "[LightGBM] [Warning] Unknown parameter: tree_method\n",
894 | "[LightGBM] [Warning] Unknown parameter: silent\n",
895 | "[LightGBM] [Warning] Unknown parameter: colsample_bylevel\n",
896 | "[LightGBM] [Warning] Unknown parameter: tree_method\n",
897 | "[LightGBM] [Warning] Unknown parameter: silent\n",
898 | "[LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.006242 seconds.\n",
899 | "You can set `force_row_wise=true` to remove the overhead.\n",
900 | "And if memory is not enough, you can set `force_col_wise=true`.\n",
901 | "[LightGBM] [Info] Total Bins 32114\n",
902 | "[LightGBM] [Info] Number of data points in the train set: 1200, number of used features: 224\n",
903 | "[LightGBM] [Warning] Unknown parameter: colsample_bylevel\n",
904 | "[LightGBM] [Warning] Unknown parameter: tree_method\n",
905 | "[LightGBM] [Warning] Unknown parameter: silent\n",
906 | "[LightGBM] [Info] Start training from score -0.068100\n",
907 | "[LightGBM] [Info] Start training from score -2.720629\n",
908 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
909 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
910 | "[1]\tvalid_0's multi_logloss: 0.256738\n",
911 | "Training until validation scores don't improve for 100 rounds\n",
912 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
913 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
914 | "[2]\tvalid_0's multi_logloss: 0.256574\n",
915 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
916 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
917 | "[3]\tvalid_0's multi_logloss: 0.256518\n",
918 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
919 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
920 | "[4]\tvalid_0's multi_logloss: 0.25657\n",
921 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
922 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
923 | "[5]\tvalid_0's multi_logloss: 0.256756\n",
924 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
925 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
926 | "[6]\tvalid_0's multi_logloss: 0.25682\n",
927 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
928 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
929 | "[7]\tvalid_0's multi_logloss: 0.256989\n",
930 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
931 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
932 | "[8]\tvalid_0's multi_logloss: 0.257236\n",
933 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
934 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
935 | "[9]\tvalid_0's multi_logloss: 0.25712\n",
936 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
937 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
938 | "[10]\tvalid_0's multi_logloss: 0.257011\n",
939 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
940 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
941 | "[11]\tvalid_0's multi_logloss: 0.257042\n",
942 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
943 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
944 | "[12]\tvalid_0's multi_logloss: 0.257402\n",
945 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
946 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
947 | "[13]\tvalid_0's multi_logloss: 0.257419\n",
948 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
949 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
950 | "[14]\tvalid_0's multi_logloss: 0.257646\n",
951 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
952 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
953 | "[15]\tvalid_0's multi_logloss: 0.257552\n",
954 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
955 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
956 | "[16]\tvalid_0's multi_logloss: 0.257604\n",
957 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
958 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
959 | "[17]\tvalid_0's multi_logloss: 0.257797\n",
960 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
961 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
962 | "[18]\tvalid_0's multi_logloss: 0.257928\n",
963 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
964 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
965 | "[19]\tvalid_0's multi_logloss: 0.258142\n",
966 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
967 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
968 | "[20]\tvalid_0's multi_logloss: 0.25847\n",
969 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
970 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
971 | "[21]\tvalid_0's multi_logloss: 0.258654\n",
972 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
973 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
974 | "[22]\tvalid_0's multi_logloss: 0.258846\n",
975 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
976 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
977 | "[23]\tvalid_0's multi_logloss: 0.258962\n",
978 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
979 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
980 | "[24]\tvalid_0's multi_logloss: 0.258991\n",
981 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
982 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
983 | "[25]\tvalid_0's multi_logloss: 0.259334\n",
984 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
985 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
986 | "[26]\tvalid_0's multi_logloss: 0.259433\n",
987 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
988 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
989 | "[27]\tvalid_0's multi_logloss: 0.259912\n",
990 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
991 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
992 | "[28]\tvalid_0's multi_logloss: 0.260153\n",
993 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
994 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
995 | "[29]\tvalid_0's multi_logloss: 0.260576\n",
996 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
997 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
998 | "[30]\tvalid_0's multi_logloss: 0.26094\n",
999 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1000 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1001 | "[31]\tvalid_0's multi_logloss: 0.261198\n",
1002 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1003 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1004 | "[32]\tvalid_0's multi_logloss: 0.26141\n",
1005 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1006 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1007 | "[33]\tvalid_0's multi_logloss: 0.261614\n",
1008 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1009 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1010 | "[34]\tvalid_0's multi_logloss: 0.261801\n",
1011 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1012 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1013 | "[35]\tvalid_0's multi_logloss: 0.261931\n",
1014 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1015 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1016 | "[36]\tvalid_0's multi_logloss: 0.262242\n",
1017 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1018 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1019 | "[37]\tvalid_0's multi_logloss: 0.262492\n",
1020 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1021 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1022 | "[38]\tvalid_0's multi_logloss: 0.26273\n",
1023 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1024 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1025 | "[39]\tvalid_0's multi_logloss: 0.262855\n",
1026 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1027 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1028 | "[40]\tvalid_0's multi_logloss: 0.263225\n",
1029 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1030 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1031 | "[41]\tvalid_0's multi_logloss: 0.263311\n",
1032 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1033 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1034 | "[42]\tvalid_0's multi_logloss: 0.263612\n",
1035 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n"
1036 | ]
1037 | },
1038 | {
1039 | "name": "stdout",
1040 | "output_type": "stream",
1041 | "text": [
1042 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1043 | "[43]\tvalid_0's multi_logloss: 0.263937\n",
1044 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1045 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1046 | "[44]\tvalid_0's multi_logloss: 0.264398\n",
1047 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1048 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1049 | "[45]\tvalid_0's multi_logloss: 0.264822\n",
1050 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1051 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1052 | "[46]\tvalid_0's multi_logloss: 0.264977\n",
1053 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1054 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1055 | "[47]\tvalid_0's multi_logloss: 0.265401\n",
1056 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1057 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1058 | "[48]\tvalid_0's multi_logloss: 0.265718\n",
1059 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1060 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1061 | "[49]\tvalid_0's multi_logloss: 0.265859\n",
1062 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1063 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1064 | "[50]\tvalid_0's multi_logloss: 0.266173\n",
1065 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1066 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1067 | "[51]\tvalid_0's multi_logloss: 0.266544\n",
1068 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1069 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1070 | "[52]\tvalid_0's multi_logloss: 0.266719\n",
1071 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1072 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1073 | "[53]\tvalid_0's multi_logloss: 0.266817\n",
1074 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1075 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1076 | "[54]\tvalid_0's multi_logloss: 0.267013\n",
1077 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1078 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1079 | "[55]\tvalid_0's multi_logloss: 0.267385\n",
1080 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1081 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1082 | "[56]\tvalid_0's multi_logloss: 0.267389\n",
1083 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1084 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1085 | "[57]\tvalid_0's multi_logloss: 0.267662\n",
1086 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1087 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1088 | "[58]\tvalid_0's multi_logloss: 0.267792\n",
1089 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1090 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1091 | "[59]\tvalid_0's multi_logloss: 0.268017\n",
1092 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1093 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1094 | "[60]\tvalid_0's multi_logloss: 0.268158\n",
1095 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1096 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1097 | "[61]\tvalid_0's multi_logloss: 0.268437\n",
1098 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1099 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1100 | "[62]\tvalid_0's multi_logloss: 0.268773\n",
1101 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1102 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1103 | "[63]\tvalid_0's multi_logloss: 0.268824\n",
1104 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1105 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1106 | "[64]\tvalid_0's multi_logloss: 0.269138\n",
1107 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1108 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1109 | "[65]\tvalid_0's multi_logloss: 0.269357\n",
1110 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1111 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1112 | "[66]\tvalid_0's multi_logloss: 0.269572\n",
1113 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1114 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1115 | "[67]\tvalid_0's multi_logloss: 0.269786\n",
1116 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1117 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1118 | "[68]\tvalid_0's multi_logloss: 0.270102\n",
1119 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1120 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1121 | "[69]\tvalid_0's multi_logloss: 0.270435\n",
1122 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1123 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1124 | "[70]\tvalid_0's multi_logloss: 0.270566\n",
1125 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1126 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1127 | "[71]\tvalid_0's multi_logloss: 0.270679\n",
1128 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1129 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1130 | "[72]\tvalid_0's multi_logloss: 0.271056\n",
1131 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1132 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1133 | "[73]\tvalid_0's multi_logloss: 0.271474\n",
1134 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1135 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1136 | "[74]\tvalid_0's multi_logloss: 0.27168\n",
1137 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1138 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1139 | "[75]\tvalid_0's multi_logloss: 0.271918\n",
1140 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1141 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1142 | "[76]\tvalid_0's multi_logloss: 0.271937\n",
1143 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1144 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1145 | "[77]\tvalid_0's multi_logloss: 0.272113\n",
1146 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1147 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1148 | "[78]\tvalid_0's multi_logloss: 0.27242\n",
1149 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1150 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1151 | "[79]\tvalid_0's multi_logloss: 0.272712\n",
1152 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1153 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1154 | "[80]\tvalid_0's multi_logloss: 0.27267\n",
1155 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1156 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1157 | "[81]\tvalid_0's multi_logloss: 0.273019\n",
1158 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1159 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1160 | "[82]\tvalid_0's multi_logloss: 0.272981\n",
1161 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1162 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1163 | "[83]\tvalid_0's multi_logloss: 0.273218\n",
1164 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1165 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1166 | "[84]\tvalid_0's multi_logloss: 0.27353\n",
1167 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1168 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1169 | "[85]\tvalid_0's multi_logloss: 0.273649\n",
1170 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1171 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1172 | "[86]\tvalid_0's multi_logloss: 0.273775\n",
1173 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1174 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1175 | "[87]\tvalid_0's multi_logloss: 0.273835\n",
1176 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1177 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1178 | "[88]\tvalid_0's multi_logloss: 0.274091\n",
1179 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1180 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1181 | "[89]\tvalid_0's multi_logloss: 0.274422\n",
1182 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1183 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1184 | "[90]\tvalid_0's multi_logloss: 0.274716\n",
1185 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1186 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1187 | "[91]\tvalid_0's multi_logloss: 0.275082\n",
1188 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n"
1189 | ]
1190 | },
1191 | {
1192 | "name": "stdout",
1193 | "output_type": "stream",
1194 | "text": [
1195 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1196 | "[92]\tvalid_0's multi_logloss: 0.275278\n",
1197 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1198 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1199 | "[93]\tvalid_0's multi_logloss: 0.275447\n",
1200 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1201 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1202 | "[94]\tvalid_0's multi_logloss: 0.275438\n",
1203 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1204 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1205 | "[95]\tvalid_0's multi_logloss: 0.275778\n",
1206 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1207 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1208 | "[96]\tvalid_0's multi_logloss: 0.27591\n",
1209 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1210 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1211 | "[97]\tvalid_0's multi_logloss: 0.276129\n",
1212 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1213 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1214 | "[98]\tvalid_0's multi_logloss: 0.276326\n",
1215 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1216 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1217 | "[99]\tvalid_0's multi_logloss: 0.276449\n",
1218 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1219 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1220 | "[100]\tvalid_0's multi_logloss: 0.276745\n",
1221 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1222 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1223 | "[101]\tvalid_0's multi_logloss: 0.276895\n",
1224 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1225 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1226 | "[102]\tvalid_0's multi_logloss: 0.276914\n",
1227 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1228 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1229 | "[103]\tvalid_0's multi_logloss: 0.277281\n",
1230 | "Early stopping, best iteration is:\n",
1231 | "[3]\tvalid_0's multi_logloss: 0.256518\n"
1232 | ]
1233 | }
1234 | ],
1235 | "source": [
1236 | "import lightgbm\n",
1237 | "from sklearn.model_selection import train_test_split\n",
1238 | "\n",
1239 | "X_train, X_test, y_train, y_test = train_test_split(train, target, test_size=0.4, random_state=0)\n",
1240 | "\n",
1241 | "clf = lightgbm\n",
1242 | "\n",
1243 | "train_matrix = clf.Dataset(X_train, label=y_train)\n",
1244 | "test_matrix = clf.Dataset(X_test, label=y_test)\n",
1245 | "params = {\n",
1246 | " 'boosting_type': 'gbdt',\n",
1247 | " #'boosting_type': 'dart',\n",
1248 | " 'objective': 'multiclass',\n",
1249 | " 'metric': 'multi_logloss',\n",
1250 | " 'min_child_weight': 1.5,\n",
1251 | " 'num_leaves': 2**5,\n",
1252 | " 'lambda_l2': 10,\n",
1253 | " 'subsample': 0.7,\n",
1254 | " 'colsample_bytree': 0.7,\n",
1255 | " 'colsample_bylevel': 0.7,\n",
1256 | " 'learning_rate': 0.03,\n",
1257 | " 'tree_method': 'exact',\n",
1258 | " 'seed': 2017,\n",
1259 | " \"num_class\": 2,\n",
1260 | " 'silent': True,\n",
1261 | " }\n",
1262 | "num_round = 10000\n",
1263 | "early_stopping_rounds = 100\n",
1264 | "model = clf.train(params, \n",
1265 | " train_matrix,\n",
1266 | " num_round,\n",
1267 | " valid_sets=test_matrix,\n",
1268 | " early_stopping_rounds=early_stopping_rounds)"
1269 | ]
1270 | },
1271 | {
1272 | "cell_type": "code",
1273 | "execution_count": 26,
1274 | "metadata": {},
1275 | "outputs": [],
1276 | "source": [
1277 | "def lgb_transform(train, test, model, topK):\n",
1278 | " train_df = pd.DataFrame(train)\n",
1279 | " train_df.columns = range(train.shape[1])\n",
1280 | " \n",
1281 | " test_df = pd.DataFrame(test)\n",
1282 | " test_df.columns = range(test.shape[1])\n",
1283 | " \n",
1284 | " features_import = pd.DataFrame()\n",
1285 | " features_import['importance'] = model.feature_importance()\n",
1286 | " features_import['col'] = range(train.shape[1])\n",
1287 | " \n",
1288 | " features_import = features_import.sort_values(['importance'],ascending=0).head(topK)\n",
1289 | " sel_col = list(features_import.col)\n",
1290 | " \n",
1291 | " train_sel = train_df[sel_col]\n",
1292 | " test_sel = test_df[sel_col]\n",
1293 | " return train_sel, test_sel"
1294 | ]
1295 | },
1296 | {
1297 | "cell_type": "code",
1298 | "execution_count": 27,
1299 | "metadata": {},
1300 | "outputs": [
1301 | {
1302 | "name": "stdout",
1303 | "output_type": "stream",
1304 | "text": [
1305 | "训练数据未特征筛选维度 (2000, 229)\n",
1306 | "训练数据特征筛选维度后 (2000, 20)\n"
1307 | ]
1308 | }
1309 | ],
1310 | "source": [
1311 | "train_sel, test_sel = lgb_transform(train, test, model, 20)\n",
1312 | "print('训练数据未特征筛选维度', train.shape)\n",
1313 | "print('训练数据特征筛选维度后', train_sel.shape)"
1314 | ]
1315 | },
1316 | {
1317 | "cell_type": "markdown",
1318 | "metadata": {},
1319 | "source": [
1320 | "### lgb特征重要性"
1321 | ]
1322 | },
1323 | {
1324 | "cell_type": "code",
1325 | "execution_count": 28,
1326 | "metadata": {},
1327 | "outputs": [
1328 | {
1329 | "data": {
1330 | "text/plain": [
1331 | "array([2, 3, 0, 0, 0, 1, 1, 0, 1, 0])"
1332 | ]
1333 | },
1334 | "execution_count": 28,
1335 | "metadata": {},
1336 | "output_type": "execute_result"
1337 | }
1338 | ],
1339 | "source": [
1340 | "model.feature_importance()[:10]"
1341 | ]
1342 | },
1343 | {
1344 | "cell_type": "code",
1345 | "execution_count": 29,
1346 | "metadata": {},
1347 | "outputs": [],
1348 | "source": [
1349 | "#sorted(model.feature_importance(),reverse=True)[:10]"
1350 | ]
1351 | },
1352 | {
1353 | "cell_type": "markdown",
1354 | "metadata": {},
1355 | "source": [
1356 | "### 特征选择前后区别"
1357 | ]
1358 | },
1359 | {
1360 | "cell_type": "code",
1361 | "execution_count": 30,
1362 | "metadata": {},
1363 | "outputs": [
1364 | {
1365 | "name": "stdout",
1366 | "output_type": "stream",
1367 | "text": [
1368 | "No Select Accuracy: 0.93 (+/- 0.00)\n",
1369 | "Features Select Accuracy: 0.93 (+/- 0.00)\n"
1370 | ]
1371 | }
1372 | ],
1373 | "source": [
1374 | "feature_selection(train, train_sel, target)"
1375 | ]
1376 | },
1377 | {
1378 | "cell_type": "code",
1379 | "execution_count": null,
1380 | "metadata": {},
1381 | "outputs": [],
1382 | "source": []
1383 | }
1384 | ],
1385 | "metadata": {
1386 | "kernelspec": {
1387 | "display_name": "Python 3",
1388 | "language": "python",
1389 | "name": "python3"
1390 | },
1391 | "language_info": {
1392 | "codemirror_mode": {
1393 | "name": "ipython",
1394 | "version": 3
1395 | },
1396 | "file_extension": ".py",
1397 | "mimetype": "text/x-python",
1398 | "name": "python",
1399 | "nbconvert_exporter": "python",
1400 | "pygments_lexer": "ipython3",
1401 | "version": "3.7.1"
1402 | },
1403 | "latex_envs": {
1404 | "LaTeX_envs_menu_present": true,
1405 | "autoclose": false,
1406 | "autocomplete": true,
1407 | "bibliofile": "biblio.bib",
1408 | "cite_by": "apalike",
1409 | "current_citInitial": 1,
1410 | "eqLabelWithNumbers": true,
1411 | "eqNumInitial": 1,
1412 | "hotkeys": {
1413 | "equation": "Ctrl-E",
1414 | "itemize": "Ctrl-I"
1415 | },
1416 | "labels_anchors": false,
1417 | "latex_user_defs": false,
1418 | "report_style_numbering": false,
1419 | "user_envs_cfg": false
1420 | }
1421 | },
1422 | "nbformat": 4,
1423 | "nbformat_minor": 2
1424 | }
1425 |
--------------------------------------------------------------------------------
/工业蒸汽/工业蒸汽 06 特征优化.ipynb:
--------------------------------------------------------------------------------
1 | {"nbformat":4,"nbformat_minor":0,"metadata":{"kernelspec":{"display_name":"Python 3","language":"python","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.7.0"},"colab":{"name":"工业蒸汽 06 特征优化.ipynb","provenance":[],"collapsed_sections":[]}},"cells":[{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"wo_kOHZTLhVZ","executionInfo":{"status":"ok","timestamp":1623400030401,"user_tz":-480,"elapsed":22582,"user":{"displayName":"陆轩韬","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GgCF1eSJvFwyQhlJNSqYJSHyWwDFm88zduORKCQ=s64","userId":"09875294307280000133"}},"outputId":"e5f54b16-a3a0-4165-f896-bdd2384af00c"},"source":["from google.colab import drive\n","drive.mount('/content/drive')"],"execution_count":1,"outputs":[{"output_type":"stream","text":["Mounted at /content/drive\n"],"name":"stdout"}]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"G6wonEdoLjSN","executionInfo":{"status":"ok","timestamp":1623400032113,"user_tz":-480,"elapsed":1729,"user":{"displayName":"陆轩韬","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GgCF1eSJvFwyQhlJNSqYJSHyWwDFm88zduORKCQ=s64","userId":"09875294307280000133"}},"outputId":"6ffe00a1-5ad1-41e6-8b04-5594798cdb7d"},"source":["%cd /content/drive/MyDrive/Colab Notebooks/天池/工业蒸汽"],"execution_count":2,"outputs":[{"output_type":"stream","text":["/content/drive/MyDrive/Colab Notebooks/天池/工业蒸汽\n"],"name":"stdout"}]},{"cell_type":"markdown","metadata":{"id":"q_gDa5rQLdXM"},"source":["## 特征优化\n","\n","### 导入数据"]},{"cell_type":"code","metadata":{"id":"qYQ3-RsyLdXW","executionInfo":{"status":"ok","timestamp":1623401062097,"user_tz":-480,"elapsed":556,"user":{"displayName":"陆轩韬","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GgCF1eSJvFwyQhlJNSqYJSHyWwDFm88zduORKCQ=s64","userId":"09875294307280000133"}}},"source":["import pandas as pd\n","\n","train_data_file = \"./zhengqi_train.txt\"\n","test_data_file = \"./zhengqi_test.txt\"\n","\n","train_data = pd.read_csv(train_data_file, sep='\\t', encoding='utf-8')\n","test_data = pd.read_csv(test_data_file, sep='\\t', encoding='utf-8')"],"execution_count":13,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"1d20kkI_LdXX"},"source":["### 定义特征构造方法,构造特征"]},{"cell_type":"code","metadata":{"id":"ayULiWwdLdXY","executionInfo":{"status":"ok","timestamp":1623401065107,"user_tz":-480,"elapsed":505,"user":{"displayName":"陆轩韬","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GgCF1eSJvFwyQhlJNSqYJSHyWwDFm88zduORKCQ=s64","userId":"09875294307280000133"}}},"source":["epsilon=1e-5\n","\n","#组交叉特征,可以自行定义,如增加: x*x/y, log(x)/y 等等\n","func_dict = {\n"," 'add': lambda x,y: x+y,\n"," 'mins': lambda x,y: x-y,\n"," 'div': lambda x,y: x/(y+epsilon),\n"," 'multi': lambda x,y: x*y\n"," }"],"execution_count":14,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"ysPQ-6EQLdXY"},"source":["### 定义特征构造的函数"]},{"cell_type":"code","metadata":{"id":"dPZ-p8xSLdXZ","executionInfo":{"status":"ok","timestamp":1623401066461,"user_tz":-480,"elapsed":3,"user":{"displayName":"陆轩韬","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GgCF1eSJvFwyQhlJNSqYJSHyWwDFm88zduORKCQ=s64","userId":"09875294307280000133"}}},"source":["def auto_features_make(train_data,test_data,func_dict,col_list):\n"," train_data, test_data = train_data.copy(), test_data.copy()\n"," for col_i in col_list:\n"," for col_j in col_list:\n"," for func_name, func in func_dict.items():\n"," for data in [train_data,test_data]:\n"," func_features = func(data[col_i],data[col_j])\n"," col_func_features = '-'.join([col_i,func_name,col_j])\n"," data[col_func_features] = func_features\n"," return train_data,test_data"],"execution_count":15,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"rd49LO4ZLdXZ"},"source":["### 对训练集和测试集数据进行特征构造"]},{"cell_type":"code","metadata":{"id":"Yw1NxFGPLdXa","executionInfo":{"status":"ok","timestamp":1623401082922,"user_tz":-480,"elapsed":14522,"user":{"displayName":"陆轩韬","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GgCF1eSJvFwyQhlJNSqYJSHyWwDFm88zduORKCQ=s64","userId":"09875294307280000133"}}},"source":["train_data2, test_data2 = auto_features_make(train_data,test_data,func_dict,col_list=test_data.columns)"],"execution_count":16,"outputs":[]},{"cell_type":"code","metadata":{"id":"5aDoibXvLdXa","executionInfo":{"status":"ok","timestamp":1623401101042,"user_tz":-480,"elapsed":9223,"user":{"displayName":"陆轩韬","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GgCF1eSJvFwyQhlJNSqYJSHyWwDFm88zduORKCQ=s64","userId":"09875294307280000133"}}},"source":["from sklearn.decomposition import PCA #主成分分析法\n","\n","#PCA方法降维\n","pca = PCA(n_components=500)\n","train_data2_pca = pca.fit_transform(train_data2.iloc[:,0:-1])\n","test_data2_pca = pca.transform(test_data2)\n","train_data2_pca = pd.DataFrame(train_data2_pca)\n","test_data2_pca = pd.DataFrame(test_data2_pca)\n","train_data2_pca['target'] = train_data2['target']"],"execution_count":17,"outputs":[]},{"cell_type":"code","metadata":{"id":"gWeLa6r8LdXb","executionInfo":{"status":"ok","timestamp":1623401103248,"user_tz":-480,"elapsed":496,"user":{"displayName":"陆轩韬","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GgCF1eSJvFwyQhlJNSqYJSHyWwDFm88zduORKCQ=s64","userId":"09875294307280000133"}}},"source":["X_train2 = train_data2[test_data2.columns].values\n","y_train = train_data2['target']"],"execution_count":18,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"kDqTOUhmLdXb"},"source":["### 使用lightgbm模型对新构造的特征进行模型训练和评估"]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/","height":833},"id":"wMfN5jqTLdXb","executionInfo":{"status":"error","timestamp":1623402038351,"user_tz":-480,"elapsed":6450,"user":{"displayName":"陆轩韬","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GgCF1eSJvFwyQhlJNSqYJSHyWwDFm88zduORKCQ=s64","userId":"09875294307280000133"}},"outputId":"f5f30543-7f11-4076-8560-e1ecf4ca9efa"},"source":["# ls_validation i\n","from sklearn.model_selection import KFold\n","from sklearn.metrics import mean_squared_error\n","import lightgbm as lgb\n","import numpy as np\n","\n","# 5折交叉验证\n","Folds=5\n","# kf = KFold(len(X_train2), n_splits=Folds, random_state=2019, shuffle=True)\n","kf = KFold(len(X_train2), random_state=2019, shuffle=True)\n","# 记录训练和预测MSE\n","MSE_DICT = {\n"," 'train_mse':[],\n"," 'test_mse':[]\n","}\n","\n","# 线下训练预测\n","for i, (train_index, test_index) in enumerate(kf.split(X_train2)):\n"," # lgb树模型\n"," lgb_reg = lgb.LGBMRegressor(\n"," learning_rate=0.01,\n"," max_depth=-1,\n"," n_estimators=5000,\n"," boosting_type='gbdt',\n"," random_state=2019,\n"," objective='regression',\n"," )\n"," \n"," # 切分训练集和预测集\n"," X_train_KFold, X_test_KFold = X_train2[train_index], X_train2[test_index]\n"," y_train_KFold, y_test_KFold = y_train[train_index], y_train[test_index]\n"," \n"," # 训练模型\n"," lgb_reg.fit(\n"," X=X_train_KFold,y=y_train_KFold,\n"," eval_set=[(X_train_KFold, y_train_KFold),(X_test_KFold, y_test_KFold)],\n"," eval_names=['Train','Test'],\n"," early_stopping_rounds=100,\n"," eval_metric='MSE',\n"," verbose=50\n"," )\n","\n"," # 训练集预测 测试集预测\n"," y_train_KFold_predict = lgb_reg.predict(X_train_KFold,num_iteration=lgb_reg.best_iteration_)\n"," y_test_KFold_predict = lgb_reg.predict(X_test_KFold,num_iteration=lgb_reg.best_iteration_) \n"," \n"," print('第{}折 训练和预测 训练MSE 预测MSE'.format(i))\n"," train_mse = mean_squared_error(y_train_KFold_predict, y_train_KFold)\n"," print('------\\n', '训练MSE\\n', train_mse, '\\n------')\n"," test_mse = mean_squared_error(y_test_KFold_predict, y_test_KFold)\n"," print('------\\n', '预测MSE\\n', test_mse, '\\n------\\n')\n"," \n"," MSE_DICT['train_mse'].append(train_mse)\n"," MSE_DICT['test_mse'].append(test_mse)\n","print('------\\n', '训练MSE\\n', MSE_DICT['train_mse'], '\\n', np.mean(MSE_DICT['train_mse']), '\\n------')\n","print('------\\n', '预测MSE\\n', MSE_DICT['test_mse'], '\\n', np.mean(MSE_DICT['test_mse']), '\\n------')"],"execution_count":20,"outputs":[{"output_type":"stream","text":["Training until validation scores don't improve for 100 rounds.\n","[50]\tTrain's l2: 0.418978\tTrain's l2: 0.418978\tTest's l2: 0.106374\tTest's l2: 0.106374\n","[100]\tTrain's l2: 0.203693\tTrain's l2: 0.203693\tTest's l2: 0.02276\tTest's l2: 0.02276\n","[150]\tTrain's l2: 0.114486\tTrain's l2: 0.114486\tTest's l2: 0.00527795\tTest's l2: 0.00527795\n","[200]\tTrain's l2: 0.0741934\tTrain's l2: 0.0741934\tTest's l2: 5.99266e-05\tTest's l2: 5.99266e-05\n","[250]\tTrain's l2: 0.0535396\tTrain's l2: 0.0535396\tTest's l2: 0.00036171\tTest's l2: 0.00036171\n","[300]\tTrain's l2: 0.041529\tTrain's l2: 0.041529\tTest's l2: 0.00267813\tTest's l2: 0.00267813\n","Early stopping, best iteration is:\n","[221]\tTrain's l2: 0.0640274\tTrain's l2: 0.0640274\tTest's l2: 6.95547e-08\tTest's l2: 6.95547e-08\n","第0折 训练和预测 训练MSE 预测MSE\n","------\n"," 训练MSE\n"," 0.0640273654375399 \n","------\n","------\n"," 预测MSE\n"," 6.954770450692039e-08 \n","------\n","\n","Training until validation scores don't improve for 100 rounds.\n","[50]\tTrain's l2: 0.419128\tTrain's l2: 0.419128\tTest's l2: 0.142103\tTest's l2: 0.142103\n","[100]\tTrain's l2: 0.203838\tTrain's l2: 0.203838\tTest's l2: 0.0997735\tTest's l2: 0.0997735\n","[150]\tTrain's l2: 0.114537\tTrain's l2: 0.114537\tTest's l2: 0.0765469\tTest's l2: 0.0765469\n","[200]\tTrain's l2: 0.0742645\tTrain's l2: 0.0742645\tTest's l2: 0.0612271\tTest's l2: 0.0612271\n","[250]\tTrain's l2: 0.0536164\tTrain's l2: 0.0536164\tTest's l2: 0.0471729\tTest's l2: 0.0471729\n","[300]\tTrain's l2: 0.0415117\tTrain's l2: 0.0415117\tTest's l2: 0.0427081\tTest's l2: 0.0427081\n"],"name":"stdout"},{"output_type":"error","ename":"KeyboardInterrupt","evalue":"ignored","traceback":["\u001b[0;31m---------------------------------------------------------------------------\u001b[0m","\u001b[0;31mKeyboardInterrupt\u001b[0m Traceback (most recent call last)","\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 38\u001b[0m \u001b[0mearly_stopping_rounds\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m100\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 39\u001b[0m \u001b[0meval_metric\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m'MSE'\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 40\u001b[0;31m \u001b[0mverbose\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m50\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 41\u001b[0m )\n\u001b[1;32m 42\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n","\u001b[0;32m/usr/local/lib/python3.7/dist-packages/lightgbm/sklearn.py\u001b[0m in \u001b[0;36mfit\u001b[0;34m(self, X, y, sample_weight, init_score, eval_set, eval_names, eval_sample_weight, eval_init_score, eval_metric, early_stopping_rounds, verbose, feature_name, categorical_feature, callbacks)\u001b[0m\n\u001b[1;32m 683\u001b[0m \u001b[0mverbose\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mverbose\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfeature_name\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mfeature_name\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 684\u001b[0m \u001b[0mcategorical_feature\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mcategorical_feature\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 685\u001b[0;31m callbacks=callbacks)\n\u001b[0m\u001b[1;32m 686\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 687\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n","\u001b[0;32m/usr/local/lib/python3.7/dist-packages/lightgbm/sklearn.py\u001b[0m in \u001b[0;36mfit\u001b[0;34m(self, X, y, sample_weight, init_score, group, eval_set, eval_names, eval_sample_weight, eval_class_weight, eval_init_score, eval_group, eval_metric, early_stopping_rounds, verbose, feature_name, categorical_feature, callbacks)\u001b[0m\n\u001b[1;32m 542\u001b[0m \u001b[0mverbose_eval\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mverbose\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfeature_name\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mfeature_name\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 543\u001b[0m \u001b[0mcategorical_feature\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mcategorical_feature\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 544\u001b[0;31m callbacks=callbacks)\n\u001b[0m\u001b[1;32m 545\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 546\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mevals_result\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n","\u001b[0;32m/usr/local/lib/python3.7/dist-packages/lightgbm/engine.py\u001b[0m in \u001b[0;36mtrain\u001b[0;34m(params, train_set, num_boost_round, valid_sets, valid_names, fobj, feval, init_model, feature_name, categorical_feature, early_stopping_rounds, evals_result, verbose_eval, learning_rates, keep_training_booster, callbacks)\u001b[0m\n\u001b[1;32m 216\u001b[0m evaluation_result_list=None))\n\u001b[1;32m 217\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 218\u001b[0;31m \u001b[0mbooster\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mupdate\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mfobj\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mfobj\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 219\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 220\u001b[0m \u001b[0mevaluation_result_list\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n","\u001b[0;32m/usr/local/lib/python3.7/dist-packages/lightgbm/basic.py\u001b[0m in \u001b[0;36mupdate\u001b[0;34m(self, train_set, fobj)\u001b[0m\n\u001b[1;32m 1800\u001b[0m _safe_call(_LIB.LGBM_BoosterUpdateOneIter(\n\u001b[1;32m 1801\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mhandle\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1802\u001b[0;31m ctypes.byref(is_finished)))\n\u001b[0m\u001b[1;32m 1803\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__is_predicted_cur_iter\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;32mFalse\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0m_\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mrange_\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__num_dataset\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1804\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mis_finished\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mvalue\u001b[0m \u001b[0;34m==\u001b[0m \u001b[0;36m1\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n","\u001b[0;31mKeyboardInterrupt\u001b[0m: "]}]},{"cell_type":"code","metadata":{"id":"aB3dbDkaPV8W"},"source":[""],"execution_count":null,"outputs":[]}]}
--------------------------------------------------------------------------------
/阿里云安全恶意程序检测/阿里云安全恶意程序检测-优化技巧与解决方案升级.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "## 第六节:优化技巧与解决方案升级"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "## 6.2 深度学习解决方案:TextCNN建模"
15 | ]
16 | },
17 | {
18 | "cell_type": "markdown",
19 | "metadata": {},
20 | "source": [
21 | "### 6.2.2 数据读取"
22 | ]
23 | },
24 | {
25 | "cell_type": "code",
26 | "execution_count": null,
27 | "metadata": {},
28 | "outputs": [],
29 | "source": [
30 | "import pandas as pd\n",
31 | "import numpy as np\n",
32 | "import seaborn as sns\n",
33 | "import matplotlib.pyplot as plt\n",
34 | "\n",
35 | "import lightgbm as lgb\n",
36 | "from sklearn.model_selection import train_test_split\n",
37 | "from sklearn.preprocessing import OneHotEncoder\n",
38 | "\n",
39 | "from tqdm import tqdm_notebook\n",
40 | "from sklearn.preprocessing import LabelBinarizer,LabelEncoder\n",
41 | "\n",
42 | "import warnings\n",
43 | "warnings.filterwarnings('ignore')\n",
44 | "%matplotlib inline"
45 | ]
46 | },
47 | {
48 | "cell_type": "code",
49 | "execution_count": 4,
50 | "metadata": {},
51 | "outputs": [],
52 | "source": [
53 | "path = '../security_data/'\n",
54 | "train = pd.read_csv(path + 'security_train.csv')\n",
55 | "test = pd.read_csv(path + 'security_test.csv')"
56 | ]
57 | },
58 | {
59 | "cell_type": "code",
60 | "execution_count": 2,
61 | "metadata": {},
62 | "outputs": [],
63 | "source": [
64 | "import numpy as np\n",
65 | "import pandas as pd\n",
66 | "from tqdm import tqdm \n",
67 | "\n",
68 | "class _Data_Preprocess:\n",
69 | " def __init__(self):\n",
70 | " self.int8_max = np.iinfo(np.int8).max\n",
71 | " self.int8_min = np.iinfo(np.int8).min\n",
72 | "\n",
73 | " self.int16_max = np.iinfo(np.int16).max\n",
74 | " self.int16_min = np.iinfo(np.int16).min\n",
75 | "\n",
76 | " self.int32_max = np.iinfo(np.int32).max\n",
77 | " self.int32_min = np.iinfo(np.int32).min\n",
78 | "\n",
79 | " self.int64_max = np.iinfo(np.int64).max\n",
80 | " self.int64_min = np.iinfo(np.int64).min\n",
81 | "\n",
82 | " self.float16_max = np.finfo(np.float16).max\n",
83 | " self.float16_min = np.finfo(np.float16).min\n",
84 | "\n",
85 | " self.float32_max = np.finfo(np.float32).max\n",
86 | " self.float32_min = np.finfo(np.float32).min\n",
87 | "\n",
88 | " self.float64_max = np.finfo(np.float64).max\n",
89 | " self.float64_min = np.finfo(np.float64).min\n",
90 | "\n",
91 | " def _get_type(self, min_val, max_val, types):\n",
92 | " if types == 'int':\n",
93 | " if max_val <= self.int8_max and min_val >= self.int8_min:\n",
94 | " return np.int8\n",
95 | " elif max_val <= self.int16_max <= max_val and min_val >= self.int16_min:\n",
96 | " return np.int16\n",
97 | " elif max_val <= self.int32_max and min_val >= self.int32_min:\n",
98 | " return np.int32\n",
99 | " return None\n",
100 | "\n",
101 | " elif types == 'float':\n",
102 | " if max_val <= self.float16_max and min_val >= self.float16_min:\n",
103 | " return np.float16\n",
104 | " if max_val <= self.float32_max and min_val >= self.float32_min:\n",
105 | " return np.float32\n",
106 | " if max_val <= self.float64_max and min_val >= self.float64_min:\n",
107 | " return np.float64\n",
108 | " return None\n",
109 | "\n",
110 | " def _memory_process(self, df):\n",
111 | " init_memory = df.memory_usage().sum() / 1024 ** 2 / 1024\n",
112 | " print('Original data occupies {} GB memory.'.format(init_memory))\n",
113 | " df_cols = df.columns\n",
114 | "\n",
115 | " \n",
116 | " for col in tqdm_notebook(df_cols):\n",
117 | " try:\n",
118 | " if 'float' in str(df[col].dtypes):\n",
119 | " max_val = df[col].max()\n",
120 | " min_val = df[col].min()\n",
121 | " trans_types = self._get_type(min_val, max_val, 'float')\n",
122 | " if trans_types is not None:\n",
123 | " df[col] = df[col].astype(trans_types)\n",
124 | " elif 'int' in str(df[col].dtypes):\n",
125 | " max_val = df[col].max()\n",
126 | " min_val = df[col].min()\n",
127 | " trans_types = self._get_type(min_val, max_val, 'int')\n",
128 | " if trans_types is not None:\n",
129 | " df[col] = df[col].astype(trans_types)\n",
130 | " except:\n",
131 | " print(' Can not do any process for column, {}.'.format(col)) \n",
132 | " afterprocess_memory = df.memory_usage().sum() / 1024 ** 2 / 1024\n",
133 | " print('After processing, the data occupies {} GB memory.'.format(afterprocess_memory))\n",
134 | " return df\n",
135 | "\n",
136 | "memory_process = _Data_Preprocess()"
137 | ]
138 | },
139 | {
140 | "cell_type": "code",
141 | "execution_count": 5,
142 | "metadata": {
143 | "scrolled": true
144 | },
145 | "outputs": [
146 | {
147 | "data": {
148 | "text/html": [
149 | "\n",
150 | "\n",
163 | "
\n",
164 | " \n",
165 | " \n",
166 | " | \n",
167 | " file_id | \n",
168 | " label | \n",
169 | " api | \n",
170 | " tid | \n",
171 | " index | \n",
172 | "
\n",
173 | " \n",
174 | " \n",
175 | " \n",
176 | " 0 | \n",
177 | " 1 | \n",
178 | " 5 | \n",
179 | " LdrLoadDll | \n",
180 | " 2488 | \n",
181 | " 0 | \n",
182 | "
\n",
183 | " \n",
184 | " 1 | \n",
185 | " 1 | \n",
186 | " 5 | \n",
187 | " LdrGetProcedureAddress | \n",
188 | " 2488 | \n",
189 | " 1 | \n",
190 | "
\n",
191 | " \n",
192 | " 2 | \n",
193 | " 1 | \n",
194 | " 5 | \n",
195 | " LdrGetProcedureAddress | \n",
196 | " 2488 | \n",
197 | " 2 | \n",
198 | "
\n",
199 | " \n",
200 | " 3 | \n",
201 | " 1 | \n",
202 | " 5 | \n",
203 | " LdrGetProcedureAddress | \n",
204 | " 2488 | \n",
205 | " 3 | \n",
206 | "
\n",
207 | " \n",
208 | " 4 | \n",
209 | " 1 | \n",
210 | " 5 | \n",
211 | " LdrGetProcedureAddress | \n",
212 | " 2488 | \n",
213 | " 4 | \n",
214 | "
\n",
215 | " \n",
216 | "
\n",
217 | "
"
218 | ],
219 | "text/plain": [
220 | " file_id label api tid index\n",
221 | "0 1 5 LdrLoadDll 2488 0\n",
222 | "1 1 5 LdrGetProcedureAddress 2488 1\n",
223 | "2 1 5 LdrGetProcedureAddress 2488 2\n",
224 | "3 1 5 LdrGetProcedureAddress 2488 3\n",
225 | "4 1 5 LdrGetProcedureAddress 2488 4"
226 | ]
227 | },
228 | "execution_count": 5,
229 | "metadata": {},
230 | "output_type": "execute_result"
231 | }
232 | ],
233 | "source": [
234 | "train.head()"
235 | ]
236 | },
237 | {
238 | "cell_type": "markdown",
239 | "metadata": {},
240 | "source": [
241 | "### 6.2.3 数据预处理"
242 | ]
243 | },
244 | {
245 | "cell_type": "code",
246 | "execution_count": 6,
247 | "metadata": {},
248 | "outputs": [],
249 | "source": [
250 | "# (字符串转化为数字)\n",
251 | "unique_api = train['api'].unique()"
252 | ]
253 | },
254 | {
255 | "cell_type": "code",
256 | "execution_count": 8,
257 | "metadata": {},
258 | "outputs": [],
259 | "source": [
260 | "api2index = {item:(i+1) for i,item in enumerate(unique_api)}\n",
261 | "index2api = {(i+1):item for i,item in enumerate(unique_api)}"
262 | ]
263 | },
264 | {
265 | "cell_type": "code",
266 | "execution_count": 9,
267 | "metadata": {},
268 | "outputs": [],
269 | "source": [
270 | "train['api_idx'] = train['api'].map(api2index)\n",
271 | "test['api_idx'] = test['api'].map(api2index)"
272 | ]
273 | },
274 | {
275 | "cell_type": "code",
276 | "execution_count": 10,
277 | "metadata": {},
278 | "outputs": [],
279 | "source": [
280 | "# 获取每个文件对应的字符串序列\n",
281 | "def get_sequence(df,period_idx):\n",
282 | " seq_list = []\n",
283 | " for _id,begin in enumerate(period_idx[:-1]):\n",
284 | " seq_list.append(df.iloc[begin:period_idx[_id+1]]['api_idx'].values)\n",
285 | " seq_list.append(df.iloc[period_idx[-1]:]['api_idx'].values)\n",
286 | " return seq_list"
287 | ]
288 | },
289 | {
290 | "cell_type": "code",
291 | "execution_count": 11,
292 | "metadata": {},
293 | "outputs": [],
294 | "source": [
295 | "train_period_idx = train.file_id.drop_duplicates(keep='first').index.values\n",
296 | "test_period_idx = test.file_id.drop_duplicates(keep='first').index.values"
297 | ]
298 | },
299 | {
300 | "cell_type": "code",
301 | "execution_count": 13,
302 | "metadata": {},
303 | "outputs": [],
304 | "source": [
305 | "train_df = train[['file_id','label']].drop_duplicates(keep='first')\n",
306 | "test_df = test[['file_id']].drop_duplicates(keep='first')"
307 | ]
308 | },
309 | {
310 | "cell_type": "code",
311 | "execution_count": 14,
312 | "metadata": {},
313 | "outputs": [],
314 | "source": [
315 | "train_df['seq'] = get_sequence(train,train_period_idx)\n",
316 | "test_df['seq'] = get_sequence(test,test_period_idx)"
317 | ]
318 | },
319 | {
320 | "cell_type": "markdown",
321 | "metadata": {},
322 | "source": [
323 | "### 6.2.4 TextCNN网络结构"
324 | ]
325 | },
326 | {
327 | "cell_type": "code",
328 | "execution_count": 16,
329 | "metadata": {},
330 | "outputs": [
331 | {
332 | "name": "stderr",
333 | "output_type": "stream",
334 | "text": [
335 | "Using TensorFlow backend.\n"
336 | ]
337 | }
338 | ],
339 | "source": [
340 | "from keras.preprocessing.text import Tokenizer\n",
341 | "from keras.preprocessing.sequence import pad_sequences\n",
342 | "from keras.layers import Dense, Input, LSTM, Lambda, Embedding, Dropout, Activation,GRU,Bidirectional\n",
343 | "from keras.layers import Conv1D,Conv2D,MaxPooling2D,GlobalAveragePooling1D,GlobalMaxPooling1D, MaxPooling1D, Flatten\n",
344 | "from keras.layers import CuDNNGRU, CuDNNLSTM, SpatialDropout1D\n",
345 | "from keras.layers.merge import concatenate, Concatenate, Average, Dot, Maximum, Multiply, Subtract, average\n",
346 | "from keras.models import Model\n",
347 | "from keras.optimizers import RMSprop,Adam\n",
348 | "from keras.layers.normalization import BatchNormalization\n",
349 | "from keras.callbacks import EarlyStopping, ModelCheckpoint\n",
350 | "from keras.optimizers import SGD\n",
351 | "from keras import backend as K\n",
352 | "from sklearn.decomposition import TruncatedSVD, NMF, LatentDirichletAllocation\n",
353 | "from keras.layers import SpatialDropout1D\n",
354 | "from keras.layers.wrappers import Bidirectional"
355 | ]
356 | },
357 | {
358 | "cell_type": "code",
359 | "execution_count": 17,
360 | "metadata": {},
361 | "outputs": [],
362 | "source": [
363 | "def TextCNN(max_len,max_cnt,embed_size, num_filters,kernel_size,conv_action, mask_zero):\n",
364 | " _input = Input(shape=(max_len,), dtype='int32')\n",
365 | " _embed = Embedding(max_cnt, embed_size, input_length=max_len, mask_zero=mask_zero)(_input)\n",
366 | " _embed = SpatialDropout1D(0.15)(_embed)\n",
367 | " warppers = []\n",
368 | " \n",
369 | " for _kernel_size in kernel_size:\n",
370 | " conv1d = Conv1D(filters=num_filters, kernel_size=_kernel_size, activation=conv_action)(_embed)\n",
371 | " warppers.append(GlobalMaxPooling1D()(conv1d))\n",
372 | " \n",
373 | " fc = concatenate(warppers)\n",
374 | " fc = Dropout(0.5)(fc)\n",
375 | " #fc = BatchNormalization()(fc)\n",
376 | " fc = Dense(256, activation='relu')(fc)\n",
377 | " fc = Dropout(0.25)(fc)\n",
378 | " #fc = BatchNormalization()(fc) \n",
379 | " preds = Dense(8, activation = 'softmax')(fc)\n",
380 | " \n",
381 | " model = Model(inputs=_input, outputs=preds)\n",
382 | " \n",
383 | " model.compile(loss='categorical_crossentropy',\n",
384 | " optimizer='adam',\n",
385 | " metrics=['accuracy'])\n",
386 | " return model"
387 | ]
388 | },
389 | {
390 | "cell_type": "code",
391 | "execution_count": 18,
392 | "metadata": {},
393 | "outputs": [],
394 | "source": [
395 | "train_labels = pd.get_dummies(train_df.label).values\n",
396 | "train_seq = pad_sequences(train_df.seq.values, maxlen = 6000)\n",
397 | "test_seq = pad_sequences(test_df.seq.values, maxlen = 6000)"
398 | ]
399 | },
400 | {
401 | "cell_type": "markdown",
402 | "metadata": {},
403 | "source": [
404 | "### 6.2.5 TextCNN训练和预测"
405 | ]
406 | },
407 | {
408 | "cell_type": "code",
409 | "execution_count": 19,
410 | "metadata": {},
411 | "outputs": [],
412 | "source": [
413 | "from sklearn.model_selection import StratifiedKFold,KFold \n",
414 | "skf = KFold(n_splits=5, shuffle=True)"
415 | ]
416 | },
417 | {
418 | "cell_type": "code",
419 | "execution_count": 20,
420 | "metadata": {},
421 | "outputs": [],
422 | "source": [
423 | "max_len = 6000\n",
424 | "max_cnt = 295\n",
425 | "embed_size = 256\n",
426 | "num_filters = 64\n",
427 | "kernel_size = [2,4,6,8,10,12,14]\n",
428 | "conv_action = 'relu'\n",
429 | "mask_zero = False\n",
430 | "TRAIN = True"
431 | ]
432 | },
433 | {
434 | "cell_type": "code",
435 | "execution_count": 21,
436 | "metadata": {
437 | "scrolled": true
438 | },
439 | "outputs": [
440 | {
441 | "name": "stdout",
442 | "output_type": "stream",
443 | "text": [
444 | "FOLD: \n",
445 | "2778 11109\n",
446 | "Train on 11109 samples, validate on 2778 samples\n",
447 | "Epoch 1/100\n",
448 | "11109/11109 [==============================] - 75s 7ms/step - loss: 0.8165 - acc: 0.7370 - val_loss: 0.4825 - val_acc: 0.8485\n",
449 | "Epoch 2/100\n",
450 | "11109/11109 [==============================] - 64s 6ms/step - loss: 0.4772 - acc: 0.8499 - val_loss: 0.4141 - val_acc: 0.8625\n",
451 | "Epoch 3/100\n",
452 | "11109/11109 [==============================] - 64s 6ms/step - loss: 0.4172 - acc: 0.8673 - val_loss: 0.3785 - val_acc: 0.8780\n",
453 | "Epoch 4/100\n",
454 | "11109/11109 [==============================] - 64s 6ms/step - loss: 0.3768 - acc: 0.8769 - val_loss: 0.3821 - val_acc: 0.8726\n",
455 | "Epoch 5/100\n",
456 | "11109/11109 [==============================] - 64s 6ms/step - loss: 0.3568 - acc: 0.8831 - val_loss: 0.3932 - val_acc: 0.8783\n",
457 | "Epoch 6/100\n",
458 | "11109/11109 [==============================] - 64s 6ms/step - loss: 0.3388 - acc: 0.8893 - val_loss: 0.3566 - val_acc: 0.8902\n",
459 | "Epoch 7/100\n",
460 | "11109/11109 [==============================] - 64s 6ms/step - loss: 0.3179 - acc: 0.8968 - val_loss: 0.3553 - val_acc: 0.8902\n",
461 | "Epoch 8/100\n",
462 | "11109/11109 [==============================] - 64s 6ms/step - loss: 0.3050 - acc: 0.8991 - val_loss: 0.3590 - val_acc: 0.8870\n",
463 | "Epoch 9/100\n",
464 | "11109/11109 [==============================] - 64s 6ms/step - loss: 0.2913 - acc: 0.9006 - val_loss: 0.3593 - val_acc: 0.8909\n",
465 | "Epoch 10/100\n",
466 | "11109/11109 [==============================] - 64s 6ms/step - loss: 0.2812 - acc: 0.9047 - val_loss: 0.3528 - val_acc: 0.8906\n",
467 | "Epoch 11/100\n",
468 | "11109/11109 [==============================] - 64s 6ms/step - loss: 0.2633 - acc: 0.9054 - val_loss: 0.3608 - val_acc: 0.8823\n",
469 | "Epoch 12/100\n",
470 | "11109/11109 [==============================] - 64s 6ms/step - loss: 0.2665 - acc: 0.9103 - val_loss: 0.3589 - val_acc: 0.8859\n",
471 | "Epoch 13/100\n",
472 | "11109/11109 [==============================] - 64s 6ms/step - loss: 0.2495 - acc: 0.9097 - val_loss: 0.3570 - val_acc: 0.8909\n",
473 | "2778/2778 [==============================] - 4s 1ms/step\n",
474 | "12955/12955 [==============================] - 13s 980us/step\n",
475 | "FOLD: \n",
476 | "2778 11109\n",
477 | "Train on 11109 samples, validate on 2778 samples\n",
478 | "Epoch 1/100\n",
479 | "11109/11109 [==============================] - 65s 6ms/step - loss: 0.8297 - acc: 0.7290 - val_loss: 0.4925 - val_acc: 0.8463\n",
480 | "Epoch 2/100\n",
481 | "11109/11109 [==============================] - 64s 6ms/step - loss: 0.4808 - acc: 0.8442 - val_loss: 0.4115 - val_acc: 0.8690\n",
482 | "Epoch 3/100\n",
483 | "11109/11109 [==============================] - 64s 6ms/step - loss: 0.4149 - acc: 0.8643 - val_loss: 0.4037 - val_acc: 0.8715\n",
484 | "Epoch 4/100\n",
485 | "11109/11109 [==============================] - 64s 6ms/step - loss: 0.3799 - acc: 0.8774 - val_loss: 0.3798 - val_acc: 0.8841\n",
486 | "Epoch 5/100\n",
487 | "11109/11109 [==============================] - 64s 6ms/step - loss: 0.3530 - acc: 0.8850 - val_loss: 0.3773 - val_acc: 0.8870\n",
488 | "Epoch 6/100\n",
489 | "11109/11109 [==============================] - 64s 6ms/step - loss: 0.3291 - acc: 0.8924 - val_loss: 0.3676 - val_acc: 0.8855\n",
490 | "Epoch 7/100\n",
491 | "11109/11109 [==============================] - 64s 6ms/step - loss: 0.3115 - acc: 0.8959 - val_loss: 0.3773 - val_acc: 0.8888\n",
492 | "Epoch 8/100\n",
493 | "11109/11109 [==============================] - 64s 6ms/step - loss: 0.3032 - acc: 0.8998 - val_loss: 0.3518 - val_acc: 0.8891\n",
494 | "Epoch 9/100\n",
495 | "11109/11109 [==============================] - 64s 6ms/step - loss: 0.2892 - acc: 0.9027 - val_loss: 0.3655 - val_acc: 0.8920\n",
496 | "Epoch 10/100\n",
497 | "11109/11109 [==============================] - 64s 6ms/step - loss: 0.2792 - acc: 0.9056 - val_loss: 0.3615 - val_acc: 0.8906\n",
498 | "Epoch 11/100\n",
499 | "11109/11109 [==============================] - 64s 6ms/step - loss: 0.2736 - acc: 0.9085 - val_loss: 0.3719 - val_acc: 0.8924\n",
500 | "2778/2778 [==============================] - 3s 1ms/step\n",
501 | "12955/12955 [==============================] - 12s 951us/step\n",
502 | "FOLD: \n",
503 | "2777 11110\n",
504 | "Train on 11110 samples, validate on 2777 samples\n",
505 | "Epoch 1/100\n",
506 | "11110/11110 [==============================] - 67s 6ms/step - loss: 0.8388 - acc: 0.7239 - val_loss: 0.4323 - val_acc: 0.8657\n",
507 | "Epoch 2/100\n",
508 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.4969 - acc: 0.8418 - val_loss: 0.3881 - val_acc: 0.8783\n",
509 | "Epoch 3/100\n",
510 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.4276 - acc: 0.8631 - val_loss: 0.3587 - val_acc: 0.8855\n",
511 | "Epoch 4/100\n",
512 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.3882 - acc: 0.8770 - val_loss: 0.3542 - val_acc: 0.8898\n",
513 | "Epoch 5/100\n",
514 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.3590 - acc: 0.8875 - val_loss: 0.3640 - val_acc: 0.8920\n",
515 | "Epoch 6/100\n",
516 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.3437 - acc: 0.8851 - val_loss: 0.3445 - val_acc: 0.8992\n",
517 | "Epoch 7/100\n",
518 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.3235 - acc: 0.8912 - val_loss: 0.3552 - val_acc: 0.8923\n",
519 | "Epoch 8/100\n",
520 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.3079 - acc: 0.8986 - val_loss: 0.3491 - val_acc: 0.8927\n",
521 | "Epoch 9/100\n",
522 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.2937 - acc: 0.9035 - val_loss: 0.3370 - val_acc: 0.9003\n",
523 | "Epoch 10/100\n",
524 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.2897 - acc: 0.9018 - val_loss: 0.3523 - val_acc: 0.8959\n",
525 | "Epoch 11/100\n",
526 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.2788 - acc: 0.9066 - val_loss: 0.3519 - val_acc: 0.8967\n",
527 | "Epoch 12/100\n",
528 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.2653 - acc: 0.9131 - val_loss: 0.3571 - val_acc: 0.8952\n",
529 | "2777/2777 [==============================] - 3s 1ms/step\n",
530 | "12955/12955 [==============================] - 12s 955us/step\n",
531 | "FOLD: \n",
532 | "2777 11110\n",
533 | "Train on 11110 samples, validate on 2777 samples\n",
534 | "Epoch 1/100\n",
535 | "11110/11110 [==============================] - 66s 6ms/step - loss: 0.8286 - acc: 0.7326 - val_loss: 0.4647 - val_acc: 0.8596\n",
536 | "Epoch 2/100\n",
537 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.4807 - acc: 0.8524 - val_loss: 0.4081 - val_acc: 0.8704\n",
538 | "Epoch 3/100\n",
539 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.4233 - acc: 0.8656 - val_loss: 0.3920 - val_acc: 0.8808\n",
540 | "Epoch 4/100\n",
541 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.3771 - acc: 0.8819 - val_loss: 0.3767 - val_acc: 0.8804\n",
542 | "Epoch 5/100\n",
543 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.3529 - acc: 0.8861 - val_loss: 0.3990 - val_acc: 0.8725\n",
544 | "Epoch 6/100\n",
545 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.3353 - acc: 0.8900 - val_loss: 0.3776 - val_acc: 0.8822\n",
546 | "Epoch 7/100\n",
547 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.3234 - acc: 0.8935 - val_loss: 0.3717 - val_acc: 0.8855\n",
548 | "Epoch 8/100\n",
549 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.3012 - acc: 0.9010 - val_loss: 0.3758 - val_acc: 0.8848\n",
550 | "Epoch 9/100\n",
551 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.2907 - acc: 0.9011 - val_loss: 0.3656 - val_acc: 0.8862\n",
552 | "Epoch 10/100\n",
553 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.2852 - acc: 0.9022 - val_loss: 0.3676 - val_acc: 0.8858\n",
554 | "Epoch 11/100\n",
555 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.2683 - acc: 0.9085 - val_loss: 0.3630 - val_acc: 0.8862\n",
556 | "Epoch 12/100\n",
557 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.2595 - acc: 0.9091 - val_loss: 0.3768 - val_acc: 0.8884\n",
558 | "Epoch 13/100\n",
559 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.2533 - acc: 0.9140 - val_loss: 0.3817 - val_acc: 0.8822\n",
560 | "Epoch 14/100\n",
561 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.2464 - acc: 0.9155 - val_loss: 0.3757 - val_acc: 0.8862\n",
562 | "2777/2777 [==============================] - 3s 1ms/step\n",
563 | "12955/12955 [==============================] - 12s 949us/step\n",
564 | "FOLD: \n",
565 | "2777 11110\n",
566 | "Train on 11110 samples, validate on 2777 samples\n",
567 | "Epoch 1/100\n",
568 | "11110/11110 [==============================] - 65s 6ms/step - loss: 0.8168 - acc: 0.7315 - val_loss: 0.4718 - val_acc: 0.8567\n",
569 | "Epoch 2/100\n",
570 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.4880 - acc: 0.8459 - val_loss: 0.4047 - val_acc: 0.8711\n",
571 | "Epoch 3/100\n",
572 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.4224 - acc: 0.8674 - val_loss: 0.3871 - val_acc: 0.8732\n",
573 | "Epoch 4/100\n",
574 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.3900 - acc: 0.8728 - val_loss: 0.3676 - val_acc: 0.8812\n",
575 | "Epoch 5/100\n",
576 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.3581 - acc: 0.8846 - val_loss: 0.3713 - val_acc: 0.8819\n",
577 | "Epoch 6/100\n",
578 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.3391 - acc: 0.8890 - val_loss: 0.3542 - val_acc: 0.8905\n",
579 | "Epoch 7/100\n",
580 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.3158 - acc: 0.8975 - val_loss: 0.3610 - val_acc: 0.8902\n",
581 | "Epoch 8/100\n",
582 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.3074 - acc: 0.8986 - val_loss: 0.3520 - val_acc: 0.8887\n",
583 | "Epoch 9/100\n",
584 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.2905 - acc: 0.9026 - val_loss: 0.3588 - val_acc: 0.8941\n",
585 | "Epoch 10/100\n",
586 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.2795 - acc: 0.9032 - val_loss: 0.3417 - val_acc: 0.8923\n",
587 | "Epoch 11/100\n",
588 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.2747 - acc: 0.9044 - val_loss: 0.3456 - val_acc: 0.8912\n",
589 | "Epoch 12/100\n",
590 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.2546 - acc: 0.9131 - val_loss: 0.3517 - val_acc: 0.8902\n",
591 | "Epoch 13/100\n",
592 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.2483 - acc: 0.9144 - val_loss: 0.3785 - val_acc: 0.8909\n",
593 | "2777/2777 [==============================] - 3s 1ms/step\n",
594 | "12955/12955 [==============================] - 12s 949us/step\n"
595 | ]
596 | }
597 | ],
598 | "source": [
599 | "import os\n",
600 | "os.environ[\"CUDA_VISIBLE_DEVICES\"] = \"0,1\"\n",
601 | "meta_train = np.zeros(shape = (len(train_seq),8))\n",
602 | "meta_test = np.zeros(shape = (len(test_seq),8))\n",
603 | "FLAG = True\n",
604 | "i = 0\n",
605 | "for tr_ind,te_ind in skf.split(train_labels):\n",
606 | " i +=1\n",
607 | " print('FOLD: '.format(i))\n",
608 | " print(len(te_ind),len(tr_ind)) \n",
609 | " model_name = 'benchmark_textcnn_fold_'+str(i)\n",
610 | " X_train,X_train_label = train_seq[tr_ind],train_labels[tr_ind]\n",
611 | " X_val,X_val_label = train_seq[te_ind],train_labels[te_ind]\n",
612 | " \n",
613 | " model = TextCNN(max_len,max_cnt,embed_size,num_filters,kernel_size,conv_action,mask_zero)\n",
614 | " \n",
615 | " model_save_path = './NN/%s_%s.hdf5'%(model_name,embed_size)\n",
616 | " early_stopping =EarlyStopping(monitor='val_loss', patience=3)\n",
617 | " model_checkpoint = ModelCheckpoint(model_save_path, save_best_only=True, save_weights_only=True)\n",
618 | " if TRAIN and FLAG:\n",
619 | " model.fit(X_train,X_train_label,validation_data=(X_val,X_val_label),epochs=100,batch_size=64,shuffle=True,callbacks=[early_stopping,model_checkpoint] )\n",
620 | " \n",
621 | " model.load_weights(model_save_path)\n",
622 | " pred_val = model.predict(X_val,batch_size=128,verbose=1)\n",
623 | " pred_test = model.predict(test_seq,batch_size=128,verbose=1)\n",
624 | " \n",
625 | " meta_train[te_ind] = pred_val\n",
626 | " meta_test += pred_test\n",
627 | " K.clear_session()\n",
628 | "meta_test /= 5.0 "
629 | ]
630 | },
631 | {
632 | "cell_type": "markdown",
633 | "metadata": {},
634 | "source": [
635 | "### 6.2.6 结果提交"
636 | ]
637 | },
638 | {
639 | "cell_type": "code",
640 | "execution_count": 22,
641 | "metadata": {},
642 | "outputs": [],
643 | "source": [
644 | "test_df['prob0'] = 0\n",
645 | "test_df['prob1'] = 0\n",
646 | "test_df['prob2'] = 0\n",
647 | "test_df['prob3'] = 0\n",
648 | "test_df['prob4'] = 0\n",
649 | "test_df['prob5'] = 0\n",
650 | "test_df['prob6'] = 0\n",
651 | "test_df['prob7'] = 0\n",
652 | "\n",
653 | "test_df[['prob0','prob1','prob2','prob3','prob4','prob5','prob6','prob7']] = meta_test\n",
654 | "test_df[['file_id','prob0','prob1','prob2','prob3','prob4','prob5','prob6','prob7']].to_csv('nn_baseline_5fold.csv',index = None)"
655 | ]
656 | },
657 | {
658 | "cell_type": "code",
659 | "execution_count": null,
660 | "metadata": {},
661 | "outputs": [],
662 | "source": []
663 | }
664 | ],
665 | "metadata": {
666 | "kernelspec": {
667 | "display_name": "Python 3",
668 | "language": "python",
669 | "name": "python3"
670 | },
671 | "language_info": {
672 | "codemirror_mode": {
673 | "name": "ipython",
674 | "version": 3
675 | },
676 | "file_extension": ".py",
677 | "mimetype": "text/x-python",
678 | "name": "python",
679 | "nbconvert_exporter": "python",
680 | "pygments_lexer": "ipython3",
681 | "version": "3.7.1"
682 | },
683 | "toc": {
684 | "nav_menu": {},
685 | "number_sections": true,
686 | "sideBar": true,
687 | "skip_h1_title": false,
688 | "title_cell": "Table of Contents",
689 | "title_sidebar": "Contents",
690 | "toc_cell": true,
691 | "toc_position": {},
692 | "toc_section_display": true,
693 | "toc_window_display": true
694 | }
695 | },
696 | "nbformat": 4,
697 | "nbformat_minor": 2
698 | }
699 |
--------------------------------------------------------------------------------
/阿里云安全恶意程序检测/阿里云安全恶意程序检测-数据探索.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# 第二节:数据探索"
8 | ]
9 | },
10 | {
11 | "cell_type": "code",
12 | "execution_count": 4,
13 | "metadata": {},
14 | "outputs": [
15 | {
16 | "data": {
17 | "text/html": [
18 | "\n",
43 | "!!以上是作者为了排版而修改的排版效果,请注意是否需要使用!!
\n"
44 | ],
45 | "text/plain": [
46 | ""
47 | ]
48 | },
49 | "metadata": {},
50 | "output_type": "display_data"
51 | }
52 | ],
53 | "source": [
54 | "%%html\n",
55 | "\n",
80 | "!!以上是作者为了排版而修改的排版效果,请注意是否需要使用!!
"
81 | ]
82 | },
83 | {
84 | "cell_type": "markdown",
85 | "metadata": {},
86 | "source": [
87 | "## 2.1 训练集数据探索"
88 | ]
89 | },
90 | {
91 | "cell_type": "markdown",
92 | "metadata": {},
93 | "source": [
94 | "### 2.1.1 数据特征类型"
95 | ]
96 | },
97 | {
98 | "cell_type": "code",
99 | "execution_count": 7,
100 | "metadata": {},
101 | "outputs": [],
102 | "source": [
103 | "# 导入相关应用包\n",
104 | "import pandas as pd\n",
105 | "import numpy as np\n",
106 | "import seaborn as sns\n",
107 | "import matplotlib.pyplot as plt\n",
108 | "\n",
109 | "# 忽略警告信息\n",
110 | "import warnings\n",
111 | "warnings.filterwarnings(\"ignore\")\n",
112 | "\n",
113 | "%matplotlib inline\n",
114 | "\n",
115 | "# 读取数据\n",
116 | "path = './dataset/'\n",
117 | "train = pd.read_csv(path + 'security_train.csv') # 训练集\n",
118 | "test = pd.read_csv(path + 'security_test.csv') # 测试集"
119 | ]
120 | },
121 | {
122 | "cell_type": "code",
123 | "execution_count": 8,
124 | "metadata": {},
125 | "outputs": [
126 | {
127 | "data": {
128 | "text/html": [
129 | "\n",
130 | "\n",
143 | "
\n",
144 | " \n",
145 | " \n",
146 | " | \n",
147 | " file_id | \n",
148 | " label | \n",
149 | " api | \n",
150 | " tid | \n",
151 | " index | \n",
152 | "
\n",
153 | " \n",
154 | " \n",
155 | " \n",
156 | " 0 | \n",
157 | " 1 | \n",
158 | " 5 | \n",
159 | " LdrLoadDll | \n",
160 | " 2488 | \n",
161 | " 0.0 | \n",
162 | "
\n",
163 | " \n",
164 | " 1 | \n",
165 | " 1 | \n",
166 | " 5 | \n",
167 | " LdrGetProcedureAddress | \n",
168 | " 2488 | \n",
169 | " 1.0 | \n",
170 | "
\n",
171 | " \n",
172 | " 2 | \n",
173 | " 1 | \n",
174 | " 5 | \n",
175 | " LdrGetProcedureAddress | \n",
176 | " 2488 | \n",
177 | " 2.0 | \n",
178 | "
\n",
179 | " \n",
180 | " 3 | \n",
181 | " 1 | \n",
182 | " 5 | \n",
183 | " LdrGetProcedureAddress | \n",
184 | " 2488 | \n",
185 | " 3.0 | \n",
186 | "
\n",
187 | " \n",
188 | " 4 | \n",
189 | " 1 | \n",
190 | " 5 | \n",
191 | " LdrGetProcedureAddress | \n",
192 | " 2488 | \n",
193 | " 4.0 | \n",
194 | "
\n",
195 | " \n",
196 | "
\n",
197 | "
"
198 | ],
199 | "text/plain": [
200 | " file_id label api tid index\n",
201 | "0 1 5 LdrLoadDll 2488 0.0\n",
202 | "1 1 5 LdrGetProcedureAddress 2488 1.0\n",
203 | "2 1 5 LdrGetProcedureAddress 2488 2.0\n",
204 | "3 1 5 LdrGetProcedureAddress 2488 3.0\n",
205 | "4 1 5 LdrGetProcedureAddress 2488 4.0"
206 | ]
207 | },
208 | "execution_count": 8,
209 | "metadata": {},
210 | "output_type": "execute_result"
211 | }
212 | ],
213 | "source": [
214 | "train.head()"
215 | ]
216 | },
217 | {
218 | "cell_type": "code",
219 | "execution_count": 9,
220 | "metadata": {},
221 | "outputs": [
222 | {
223 | "data": {
224 | "text/html": [
225 | "\n",
226 | "\n",
239 | "
\n",
240 | " \n",
241 | " \n",
242 | " | \n",
243 | " file_id | \n",
244 | " label | \n",
245 | " tid | \n",
246 | " index | \n",
247 | "
\n",
248 | " \n",
249 | " \n",
250 | " \n",
251 | " count | \n",
252 | " 35952.000000 | \n",
253 | " 35952.000000 | \n",
254 | " 35952.000000 | \n",
255 | " 35951.000000 | \n",
256 | "
\n",
257 | " \n",
258 | " mean | \n",
259 | " 5.142051 | \n",
260 | " 0.989152 | \n",
261 | " 2494.964564 | \n",
262 | " 2153.216267 | \n",
263 | "
\n",
264 | " \n",
265 | " std | \n",
266 | " 2.547382 | \n",
267 | " 1.957361 | \n",
268 | " 129.979938 | \n",
269 | " 1537.349809 | \n",
270 | "
\n",
271 | " \n",
272 | " min | \n",
273 | " 1.000000 | \n",
274 | " 0.000000 | \n",
275 | " 282.000000 | \n",
276 | " 0.000000 | \n",
277 | "
\n",
278 | " \n",
279 | " 25% | \n",
280 | " 4.000000 | \n",
281 | " 0.000000 | \n",
282 | " 2456.000000 | \n",
283 | " 722.000000 | \n",
284 | "
\n",
285 | " \n",
286 | " 50% | \n",
287 | " 5.000000 | \n",
288 | " 0.000000 | \n",
289 | " 2500.000000 | \n",
290 | " 2004.000000 | \n",
291 | "
\n",
292 | " \n",
293 | " 75% | \n",
294 | " 7.000000 | \n",
295 | " 0.000000 | \n",
296 | " 2596.000000 | \n",
297 | " 3502.000000 | \n",
298 | "
\n",
299 | " \n",
300 | " max | \n",
301 | " 9.000000 | \n",
302 | " 5.000000 | \n",
303 | " 2980.000000 | \n",
304 | " 5000.000000 | \n",
305 | "
\n",
306 | " \n",
307 | "
\n",
308 | "
"
309 | ],
310 | "text/plain": [
311 | " file_id label tid index\n",
312 | "count 35952.000000 35952.000000 35952.000000 35951.000000\n",
313 | "mean 5.142051 0.989152 2494.964564 2153.216267\n",
314 | "std 2.547382 1.957361 129.979938 1537.349809\n",
315 | "min 1.000000 0.000000 282.000000 0.000000\n",
316 | "25% 4.000000 0.000000 2456.000000 722.000000\n",
317 | "50% 5.000000 0.000000 2500.000000 2004.000000\n",
318 | "75% 7.000000 0.000000 2596.000000 3502.000000\n",
319 | "max 9.000000 5.000000 2980.000000 5000.000000"
320 | ]
321 | },
322 | "execution_count": 9,
323 | "metadata": {},
324 | "output_type": "execute_result"
325 | }
326 | ],
327 | "source": [
328 | "train.describe()"
329 | ]
330 | },
331 | {
332 | "cell_type": "markdown",
333 | "metadata": {},
334 | "source": [
335 | "### 2.1.2 数据分布探索"
336 | ]
337 | },
338 | {
339 | "cell_type": "code",
340 | "execution_count": 10,
341 | "metadata": {},
342 | "outputs": [
343 | {
344 | "data": {
345 | "text/plain": [
346 | ""
347 | ]
348 | },
349 | "execution_count": 10,
350 | "metadata": {},
351 | "output_type": "execute_result"
352 | },
353 | {
354 | "data": {
355 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAWAAAAEGCAYAAABbzE8LAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8vihELAAAACXBIWXMAAAsTAAALEwEAmpwYAAANb0lEQVR4nO3df2yc913A8fenMRtRTWnXlJJ5FbfVTKxrRmnMtD8YKkVlIUGqBAhpm2hgqtA6LWkifmij0eKCkTYGoiVCm0phNFCxwTZUqLqsQSJDQmqnc9Um+5HRa+epy8LoUqLVbdIpyZc/7nFzcc9nO/Xd5/Hl/ZJOuzz35Pr9+O55+/Hj2YlSCpKkwbsoewGSdKEywJKUxABLUhIDLElJDLAkJRlZzs7r1q0rjUajT0uRpOE0PT39vVLKFfO3LyvAjUaDZrO5cquSpAtARHyr23YvQUhSEgMsSUkMsCQlMcCSlMQAS1ISAyxJSQywJCUxwJKUxABLUhIDLElJDLAkJTHAkpTEAEtSEgMsSUkMsCQlMcCSlMQAS1ISAyxJSQywJCVZ1r8JJw3KrbfeyvHjxxkbG8teysvGx8fZtm1b9jI0RAywauno0aPMvvAi//NSPd6ia158LnsJGkL1eHdL3awZ4cRPbc5eBQBrDz+UvQQNIa8BS1ISAyxJSQywJCUxwJKUxABLUhIDLElJDLAkJTHAkpTEAEtSEgMsSUkMsCQlMcCSlMQAS1ISAyxJSQywJCUxwJKUxABLUhIDLElJDLAkJTHAkpTEAEtSEgMsSUkMsCQlMcCSlMQAS1ISAyxJSQywJCUxwJKUxABLUhIDLElJDLAkJTHAkpTEAEtSEgMsSUkMsCQlMcCSlMQAS1ISAyxJSQywJCUxwJKUxABLUhIDrIHYs2cPe/bsyV7GBcWPef2NZC9AF4ZWq5W9hAuOH/P68wxYkpIYYElKYoAlKYkBlqQkBliSkhhgSUpigCUpiQGWpCQGWJKSGGBJSmKAJSmJAZakJAZYkpIYYElKYoAlKYkBlqQkBliSkhhgSUpigCUpiQGWpCQGWJKSGGBJSmKAJSmJAZakJAZYkpIYYElKYoAlKYkBlqQkBliSkhhgSUpigCUpiQGWpCQGWJKSGGBJSmKAJSmJAZakJAZYkpIYYElKMpAA33DDDS/f6mSl1rXQ8zSbTW688Uamp6dX9L8nDYNWq8WWLVt44IEHzjlO5j/earUAOHbsGNu3b6fVarF9+3aOHTvWl3V1HqfzbyvNM+A+mpyc5MyZM+zevTt7KVLtTE1N8cILL3DXXXd1PU7mHp+amgLgvvvu49ChQ0xNTXHo0CH27t2bsewV1fcAz/+sUZezv5Va10LP02w2mZ2dBWB2dra2HwcpQ6vVYmZmBoBSCtA+TubOgjsfn5mZodlssm/fPkopzMzMUEph3759K34WvNhxudLH7ciKPpteNjk5mb2EWjly5AgnTpzg9ttvX9L+J06cgNLnRS3DRSe/T6v1/JLXXwetVou1a9dmL6OrubPa+Xbv3s2DDz74isfnvprsdPr0afbu3cvOnTv7ts5+W/QMOCJ+JyKaEdF89tlnB7GmoTB39ivplebObuebO27mPz47O8upU6fO2Xbq1Cn279/fj+UNzKJnwKWUe4B7ACYmJmp0TlJvo6OjRrjD2NgYAHffffeS9t+yZQuzJ3/QzyUty5kfvoTxN1255PXXQZ3P1huNRtcIj46Odn18dHSUkydPnhPhkZERbrrppn4vta/8JlyfeAlCWtiuXbu6br/zzju7Pj45OclFF52bqzVr1nDLLbf0Z4ED0vcAHzhwoOefs6zUuhZ6nomJiZc/m4+Ojtb24yBlGB8fp9FoABARQPs42bhx4ysebzQaTExMsGnTJiKCRqNBRLBp0yYuv/zyFV3XYsflSh+3ngH30dxn7bnP6pLO2rVrFxdffDE7duzoepzMPT53Nrx161Y2bNjArl272LBhw6o/+wWIuf8LyFJMTEyUZrPZx+VoWM1dj1zuNeDZ63+zn8tasrWHH2LjKr0GvJrWPKwiYrqUMjF/u2fAkpTEAEtSEgMsSUkMsCQlMcCSlMQAS1ISAyxJSQywJCUxwJKUxABLUhIDLElJDLAkJTHAkpTEAEtSEgMsSUkMsCQlMcCSlMQAS1ISAyxJSQywJCUxwJKUxABLUhIDLElJDLAkJTHAkpTEAEtSEgMsSUkMsCQlMcCSlMQAS1ISAyxJSQywJCUxwJKUxABLUhIDLElJDLAkJTHAkpTEAEtSkpHsBejCMD4+nr2EC44f8/ozwBqIbdu2ZS/hguPHvP68BCFJSQywJCUxwJKUxABLUhIDLElJDLAkJTHAkpTEAEtSEgMsSUkMsCQlMcCSlMQAS1ISAyxJSQywJCUxwJKUxABLUhIDLElJDLAkJTHAkpTEAEtSEgMsSUkMsCQlMcCSlMQAS1ISAyxJSQywJCUxwJKUxABLUhIDLElJDLAkJTHAkpTEAEtSEgMsSUkMsCQlMcCSlMQAS1ISAyxJSQywJCUxwJKUxABLUpKR7AVICzp9irWHH8peBQBrXnwOuDJ7GRoyBli1tH79eo4fP87YWF2idyXj4+PZi9CQMcCqpXvvvTd7CVLfeQ1YkpIYYElKYoAlKYkBlqQkBliSkhhgSUpigCUpiQGWpCQGWJKSGGBJSmKAJSmJAZakJAZYkpIYYElKYoAlKYkBlqQkBliSkhhgSUpigCUpiQGWpCRRSln6zhHPAt9a4u7rgO+dz6JqZljmgOGZxTnqZ1hm6dccP1FKuWL+xmUFeDkiollKmejLkw/QsMwBwzOLc9TPsMwy6Dm8BCFJSQywJCXpZ4Dv6eNzD9KwzAHDM4tz1M+wzDLQOfp2DViS1JuXICQpiQGWpCRLDnBEXBUR/xERX4uIr0bE7dX2j0fE4Yg4GBH/EhGXdvydD0dEKyK+ERHv6ti+qdrWiogPrehE5z/HH1czPB4RD0fE66vtERF/Wa31YERc3/FcWyPiyeq2dZBz9Jql4/HfjYgSEevqPEuP12QyIo5Ur8njEbG54+/U7r3Va5bqsW3VsfLViPjTOs/S4zX5TMfrMRMRj6/SOa6LiEeqOZoR8fZq+2CPkVLKkm7AeuD66v6PAP8NXAP8EjBSbf8Y8LHq/jXAE8BrgTcCTwFrqttTwJuA11T7XLPUdbzaW485LunYZzvwyer+ZuALQADvAB6ttr8OeLr638uq+5cNao5es1R/vgr4Iu0fnFlX51l6vCaTwO912b+W761FZvkF4N+B11aP/VidZ+n13urY58+Bj6zGOYCHgV/uOC4OZBwjSz4DLqUcLaU8Vt1/Hvg6MFZKebiUcqra7RHgDdX9m4FPl1JeKqV8E2gBb69urVLK06WUHwCfrvYdiB5zfL9jt4uBue9O3gzsLW2PAJdGxHrgXcD+UspzpZT/A/YDmwY1Byw8S/XwXwB/0DEH1HSWReboppbvLeg5y23AR0spL1WP/W+dZ1nsNYmIAH4D+MdVOkcBLql2+1HgOx1zDOwYOa9rwBHRAH4GeHTeQ++j/dkD2kM+0/HYt6ttC20fuPlzRMSfRMQzwHuBj1S71X4OOHeWiLgZOFJKeWLebrWfpct764PVl4J/GxGXVdtqPwe8YpY3A++MiEcj4ksR8bPVbrWfZYHj/Z3Ad0spT1Z/Xm1z7AA+Xh3vfwZ8uNptoHMsO8ARMQp8DtjRedYYEXcAp4D7X+2iBqHbHKWUO0opV9Ge4YOZ61uOzllovwZ/yNlPIKtGl9fkE8DVwHXAUdpf8q4KXWYZof3l6zuA3wf+qTqLrLWFjnfg3Zw9+629LnPcBuysjvedwN9krGtZAY6IH6I9xP2llM93bP8t4FeA95bqgglwhPZ1yDlvqLYttH1gFpqjw/3Ar1X3azsHdJ3latrX4J6IiJlqXY9FxI9T41m6vSallO+WUk6XUs4Af037y1l6rDd9Dljw/fVt4PPVl7ZfBs7Q/sUvtZ2lx/E+Avwq8JmO3VfbHFuBufv/TNZ7axkXswPYC9w1b/sm4GvAFfO2v5VzL8o/TfuC/Eh1/42cvSj/1ld7MXsF5vjJjvvbgM9W97dw7kX5L5ezF+W/SfuC/GXV/dcNao5es8zbZ4az34Sr5Sw9XpP1Hfd30r7GWNv31iKzvB/4o+r+m2l/ORt1naXXe6s65r80b9uqmoP2teAbqvu/CExnHCPLGeTnaF+4Pgg8Xt02077Y/kzHtk92/J07aH8H9BtU33Gstm+m/d3Ip4A7BnyALDTH54CvVNv/jfY35uZewL+q1noImOh4rvdV87eA3x7kHL1mmbfPDGcDXMtZerwmf1+t8yDwr5wb5Nq9txaZ5TXAP1TvsceAG+s8S6/3FvB3wPu7/J1VM0e1fZr2J4RHgY0Zx4g/iixJSfxJOElKYoAlKYkBlqQkBliSkhhgSUpigLWqRMSlEfGB6v7rI+KzC+x3ICJW/T8SqeFmgLXaXAp8AKCU8p1Syq/nLkc6fyPZC5CW6aPA1dXvoX0SeEsp5dqIWAt8Cvhp4DCwNm+J0tIYYK02HwKuLaVcV/12qwer7bcBL5ZS3hIRb6P902ZSrXkJQsPi52n/qC+llIO0f/RUqjUDLElJDLBWm+dp/9My8/0n8B6AiLgWeNsgFyWdD68Ba1UppRyLiP+KiK/Q/pWCcz4BfCoivl5tn05ZoLQM/jY0SUriJQhJSmKAJSmJAZakJAZYkpIYYElKYoAlKYkBlqQk/w8fN12SFSqSDQAAAABJRU5ErkJggg==\n",
356 | "text/plain": [
357 | ""
358 | ]
359 | },
360 | "metadata": {
361 | "needs_background": "light"
362 | },
363 | "output_type": "display_data"
364 | }
365 | ],
366 | "source": [
367 | "sns.boxplot(x=train.iloc[:10000][\"tid\"])"
368 | ]
369 | },
370 | {
371 | "cell_type": "code",
372 | "execution_count": 11,
373 | "metadata": {},
374 | "outputs": [
375 | {
376 | "data": {
377 | "text/plain": [
378 | "file_id 9\n",
379 | "label 3\n",
380 | "api 166\n",
381 | "tid 56\n",
382 | "index 5001\n",
383 | "dtype: int64"
384 | ]
385 | },
386 | "execution_count": 11,
387 | "metadata": {},
388 | "output_type": "execute_result"
389 | }
390 | ],
391 | "source": [
392 | "train.nunique()"
393 | ]
394 | },
395 | {
396 | "cell_type": "markdown",
397 | "metadata": {},
398 | "source": [
399 | "### 2.1.3 数据缺失值探索"
400 | ]
401 | },
402 | {
403 | "cell_type": "code",
404 | "execution_count": 12,
405 | "metadata": {},
406 | "outputs": [
407 | {
408 | "data": {
409 | "text/plain": [
410 | "count 35951.000000\n",
411 | "mean 2153.216267\n",
412 | "std 1537.349809\n",
413 | "min 0.000000\n",
414 | "25% 722.000000\n",
415 | "50% 2004.000000\n",
416 | "75% 3502.000000\n",
417 | "max 5000.000000\n",
418 | "Name: index, dtype: float64"
419 | ]
420 | },
421 | "execution_count": 12,
422 | "metadata": {},
423 | "output_type": "execute_result"
424 | }
425 | ],
426 | "source": [
427 | "train['index'].describe()"
428 | ]
429 | },
430 | {
431 | "cell_type": "markdown",
432 | "metadata": {},
433 | "source": [
434 | "### 2.1.4 奇异值探索"
435 | ]
436 | },
437 | {
438 | "cell_type": "code",
439 | "execution_count": 13,
440 | "metadata": {
441 | "scrolled": true
442 | },
443 | "outputs": [
444 | {
445 | "data": {
446 | "text/plain": [
447 | "count 35951.000000\n",
448 | "mean 2153.216267\n",
449 | "std 1537.349809\n",
450 | "min 0.000000\n",
451 | "25% 722.000000\n",
452 | "50% 2004.000000\n",
453 | "75% 3502.000000\n",
454 | "max 5000.000000\n",
455 | "Name: index, dtype: float64"
456 | ]
457 | },
458 | "execution_count": 13,
459 | "metadata": {},
460 | "output_type": "execute_result"
461 | }
462 | ],
463 | "source": [
464 | "train['index'].describe()"
465 | ]
466 | },
467 | {
468 | "cell_type": "code",
469 | "execution_count": 14,
470 | "metadata": {},
471 | "outputs": [
472 | {
473 | "data": {
474 | "text/plain": [
475 | "count 35952.000000\n",
476 | "mean 2494.964564\n",
477 | "std 129.979938\n",
478 | "min 282.000000\n",
479 | "25% 2456.000000\n",
480 | "50% 2500.000000\n",
481 | "75% 2596.000000\n",
482 | "max 2980.000000\n",
483 | "Name: tid, dtype: float64"
484 | ]
485 | },
486 | "execution_count": 14,
487 | "metadata": {},
488 | "output_type": "execute_result"
489 | }
490 | ],
491 | "source": [
492 | "train['tid'].describe()"
493 | ]
494 | },
495 | {
496 | "cell_type": "markdown",
497 | "metadata": {},
498 | "source": [
499 | "### 2.1.5 标签分布探索"
500 | ]
501 | },
502 | {
503 | "cell_type": "code",
504 | "execution_count": 15,
505 | "metadata": {},
506 | "outputs": [
507 | {
508 | "data": {
509 | "text/plain": [
510 | "0 28350\n",
511 | "5 6786\n",
512 | "2 816\n",
513 | "Name: label, dtype: int64"
514 | ]
515 | },
516 | "execution_count": 15,
517 | "metadata": {},
518 | "output_type": "execute_result"
519 | }
520 | ],
521 | "source": [
522 | "train['label'].value_counts()"
523 | ]
524 | },
525 | {
526 | "cell_type": "code",
527 | "execution_count": 16,
528 | "metadata": {},
529 | "outputs": [
530 | {
531 | "data": {
532 | "text/plain": [
533 | ""
534 | ]
535 | },
536 | "execution_count": 16,
537 | "metadata": {},
538 | "output_type": "execute_result"
539 | },
540 | {
541 | "data": {
542 | "image/png": "iVBORw0KGgoAAAANSUhEUgAABeEAAAH+CAYAAAAbE8XCAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8vihELAAAACXBIWXMAABcSAAAXEgFnn9JSAAAuyklEQVR4nO3de7BtZXku+OeNm4vcBCN4PKLYYohHFO+CEm+AaW+0KCSpmI4tmnTa1pY0cFKxGikUK4nWkVaLmFSOAvZJ5bQ2mhCC0YQyKEEMiSgY8ATEjop3QeTmBoxv/zHHsqdrz7X23rC+vfZm/35Vq8Ya3xjPHN9cVdZyP+vjm9XdAQAAAAAA1t7PrPcEAAAAAADg/koJDwAAAAAAgyjhAQAAAABgECU8AAAAAAAMooQHAAAAAIBBlPAAAAAAADCIEh4AAAAAAAZRwgMAAAAAwCBKeAAAAAAAGEQJDwAAAAAAgyjhAQAAAABgECU8AAAAAAAMsmG9J7Azq6pvJdkjydfWey4AAAAAAKzoEUnu7O5/t7XB6u4B82FLVNWtu+22294HH3zwek8FAAAAAIAV3HDDDbnrrrtu6+59tjZrJfz6+trBBx/8uGuuuWa95wEAAAAAwAoOPfTQXHvttfdqRxN7wgMAAAAAwCBKeAAAAAAAGEQJDwAAAAAAgyjhAQAAAABgECU8AAAAAAAMooQHAAAAAIBBlPAAAAAAADCIEh4AAAAAAAZRwgMAAAAAwCBKeAAAAAAAGEQJDwAAAAAAgyjhAQAAAABgECU8AAAAAAAMooQHAAAAAIBBlPAAAAAAADCIEh4AAAAAAAZRwgMAAAAAwCBKeAAAAAAAGEQJDwAAAAAAgyjhAQAAAABgkA3rPQG4P3nU71603lMAVvGvf/CS9Z4CAAAAsJOxEh4AAAAAAAZRwgMAAAAAwCBKeAAAAAAAGEQJDwAAAAAAgyjhAQAAAABgECU8AAAAAAAMooQHAAAAAIBBlPAAAAAAADCIEh4AAAAAAAZRwgMAAAAAwCBKeAAAAAAAGEQJDwAAAAAAgyjhAQAAAABgECU8AAAAAAAMooQHAAAAAIBBlPAAAAAAADCIEh4AAAAAAAZRwgMAAAAAwCBKeAAAAAAAGEQJDwAAAAAAgyjhAQAAAABgECU8AAAAAAAMooQHAAAAAIBBlPAAAAAAADCIEh4AAAAAAAZRwgMAAAAAwCBKeAAAAAAAGEQJDwAAAAAAgyjhAQAAAABgECU8AAAAAAAMooQHAAAAAIBBlPAAAAAAADCIEh4AAAAAAAZRwgMAAAAAwCBKeAAAAAAAGEQJDwAAAAAAgyjhAQAAAABgECU8AAAAAAAMooQHAAAAAIBBlPAAAAAAADCIEh4AAAAAAAZRwgMAAAAAwCBKeAAAAAAAGOQ+l/BVtUdVHVdV76+qf6mqjVV1R1VdVVWnV9VeCzJnVFWv8vUHqzzvyKr6aFXdXFW3V9UVVfWqzczxwKo6t6q+Mc3vuqp6S1XtvkrmgVX11unejVP2nKp6+Nb9hAAAAAAA2FltWIPXeGWS/zx9/8Ukf5lknyTPSvKWJL9aVc/t7u8syF6W5EsLxj+76EFVdXySD2b2x4NPJflekqOTfKCqDuvuUxdkHpPk8iQPSfLPSS5N8rQkpyc5uqqO7u67lmV2T/KJJEck+WaSC5I8KsmJSV5aVUd095cX/jQAAAAAAGCyFiX8PUn+JMm7uvuLS4NV9bAkFyV5cpJ3ZVbWL/e+7j5vSx5SVQ9Ock6SByQ5vrs/Mo0/NMnfJzmlqv6quy9ZFj0vswL+Pd190pTZkORDSV6e5E1JzliWOS2zAv7yJL/Y3bdPuZOTvHOax/O2ZN4AAAAAAOy87vN2NN39ge7+rfkCfhr/ZpLXT6evqKpd7+OjfiOzFfYXLBXw03O+neR3ptNT5gNV9YwkRyb5ztw96e4fJXldZn9AeONUyi9ldk3yhun09UsF/JQ7K8nVSZ5bVU+9j+8HAAAAAID7udEfzHrVdNwtyc/ex9d6yXQ8f8G1i5JsTHLMsn3elzIXLt9yZirvL02yX5JfmLt0ZJIHJbmhuz+34FlLzz9266YPAAAAAMDOZnQJ/+jpeE+SmxdcP6qq3lVVf1xVp21mdfkTp+OVyy90992Z7fe+e5JDtiSzbPyw+5gBAAAAAIBNrMWe8Ks5aTp+bPlK9MmvLzs/s6o+nOTV89vAVNU+ma1OT5IbV3jWjZl94OpBmW0ZkySP3IJMpsySe5NZVVVds8Klg7f0NQAAAAAA2PEMWwlfVS9O8trMVsG/ednlLyU5NcmhSfZK8ogkv5bk60mOT/Jflt2/19z3d67wyDum494LcqMzAAAAAACwiSEr4avqsUn+NEkl+Y/dfdX89e7+02WRO5L8WVX9XZIvJDmuqo7o7s+MmN+21t2HLhqfVsg/bhtPBwAAAACAbWTNV8JX1cOTfCyzDzw9q7vfvaXZ7v5mknOn0xfOXbp97vs9VojvOR1vW5AbnQEAAAAAgE2saQlfVQ9O8jeZ7Zd+bmZbzmyt66fjw5YGuvvWJD+YTg9cIbc0/pW5sa9uowwAAAAAAGxizUr4qtoryV9ntr3KR5L8Znf3vXip/abjHcvGl7a0ecqCZ++S5PFJNia5bksyy8avnhu7NxkAAAAAANjEmpTwVbVbkguSPCPJx5P8anf/2714nUry8un0ymWXL5qOJyyIvjTJ7kku7u6NCzLHTnOcf9ZDkzw7yfeTXDZ36bLMVt0fXFVPWvCspedfuPI7AQAAAACANSjhq+oBSf5rkqOSXJrkFd199yr3719Vr6+qvZeN75Xkj5IcnuRbma2mn/e+JLcmeVlVvWIud0CSd0yn75wPdPcVmZXqByR5+1xmQ5L3JtklyXu6+565zN1Jzp5O/7Cq9pzLnZzksCSf7O7PrvQeAQAAAAAgSTaswWu8If//6vXvJXnvbEH7Jk7t7u9l9sGmZyf5g6r6xyTfTLJ/Ztu8/GySW5Kc0N13zoe7++aqek2SDyU5v6ouSXJTkmOS7JvZh8BesuC5Jya5PMlJVXVUkmuTPD3Jo5N8OsnvL8i8bXrdZyW5vqouzWyf+8OTfDfJa1b7gQAAAAAAQLI2Jfx+c9+/fMW7kjMyK+lvymxV+hFJDsms6P63JP9vkvOS/J/d/fVFL9DdH66q5yQ5bcrvmlmpfnZ3f2CFzPVV9eQkb03ywmmOX01yZpLf6+67FmQ2VtXzk7wpySuTHJfk5ml+b+7uG1d5nwAAAAAAkGQNSvjuPiOzgn1L778tye/eh+ddluRFW5n5WmYr4rcm88Mkp09fAAAAAACw1dbkg1kBAAAAAIBNKeEBAAAAAGAQJTwAAAAAAAyihAcAAAAAgEGU8AAAAAAAMIgSHgAAAAAABlHCAwAAAADAIEp4AAAAAAAYRAkPAAAAAACDKOEBAAAAAGAQJTwAAAAAAAyihAcAAAAAgEGU8AAAAAAAMIgSHgAAAAAABlHCAwAAAADAIEp4AAAAAAAYRAkPAAAAAACDKOEBAAAAAGAQJTwAAAAAAAyihAcAAAAAgEGU8AAAAAAAMIgSHgAAAAAABlHCAwAAAADAIEp4AAAAAAAYRAkPAAAAAACDKOEBAAAAAGAQJTwAAAAAAAyihAcAAAAAgEGU8AAAAAAAMIgSHgAAAAAABlHCAwAAAADAIEp4AAAAAAAYRAkPAAAAAACDKOEBAAAAAGAQJTwAAAAAAAyihAcAAAAAgEGU8AAAAAAAMIgSHgAAAAAABlHCAwAAAADAIEp4AAAAAAAYRAkPAAAAAACDKOEBAAAAAGAQJTwAAAAAAAyihAcAAAAAgEGU8AAAAAAAMIgSHgAAAAAABlHCAwAAAADAIEp4AAAAAAAYRAkPAAAAAACDKOEBAAAAAGAQJTwAAAAAAAyihAcAAAAAgEGU8AAAAAAAMIgSHgAAAAAABlHCAwAAAADAIEp4AAAAAAAYRAkPAAAAAACDKOEBAAAAAGAQJTwAAAAAAAxyn0v4qtqjqo6rqvdX1b9U1caquqOqrqqq06tqr1Wyr66qK6rq9qq6uao+WlXP2szzjpzuu3nKXVFVr9pM5sCqOreqvjHN77qqektV7b5K5oFV9dbp3o1T9pyqevjmfyoAAAAAALA2K+FfmeTPk7wmyb8l+csklyb575K8Jck/VtUBy0NV9a4k5yZ5fJKLk1yR5AVJPlVVxy16UFUdn+STSV6Y5OokH0vyc0k+UFX/aYXMY5J8Lsmrk9yU5IIkD0hyepKLq2q3BZndk3wiyZuT7DVlvpbkxCSfq6pHr/oTAQAAAACArE0Jf0+SP0nyuO5+XHf/cne/MMnPZ1Z+PzbJu+YDVXVMkpMyK8Wf2N3HTZnnZFbkn1tV+y7LPDjJOZkV6Cd09/O6+4Tp9b+U5JSqet6C+Z2X5CFJ3tPdT+juX5nm9udJjkzypgWZ05IckeTyJId096909+FJTkmy/zQPAAAAAABY1X0u4bv7A939W939xWXj30zy+un0FVW169zlk6fj27r7+rnM5Un+OMm+SV677FG/kWSfJBd090fmMt9O8jvT6Snzgap6RmZF+3fm7kl3/yjJ6zL7A8Ibq2rDXGbXJG+YTl/f3bfP5c7KbAX+c6vqqQt/IAAAAAAAMBn9waxXTcfdkvxsMttrPclR0/j5CzJLY8cuG3/JKpmLkmxMcsyyfd6XMhd2913zgam8vzTJfkl+Ye7SkUkelOSG7v7cVswPAAAAAAB+yugSfmnv9HuS3Dx9//OZlfLf7e4bF2SunI6HLRt/4rLrP9Hddyf55yS7JzlkSzKrPOveZAAAAAAAYBMbNn/LfXLSdPzY3Er0R07HRQV8uvuOqrolyX5VtXd331ZV+2S2On3F3DT+tCQHZbZlzGafNTd+0NzYvcmsqqquWeHSwVv6GgAAAAAA7HiGrYSvqhdntq/7PUnePHdpr+l45yrxO6bj3ssyq+WWZ7bkWWuVAQAAAACATQxZCV9Vj03yp0kqyX/s7qs2E7lf6+5DF41PK+Qft42nAwAAAADANrLmK+Gr6uFJPpbZB56e1d3vXnbL7dNxj1VeZs/peNuyzGq55ZktedZaZQAAAAAAYBNrWsJX1YOT/E1m+6Wfm+TUBbd9dToeuMJr7Jlk3yTf7+7bkqS7b03yg9Vyc+Nf2dJnrWEGAAAAAAA2sWYlfFXtleSvM9te5SNJfrO7e8Gt/5LkriT7T6vml3vKdLx62fhVy67PP3uXJI9PsjHJdVuSWeVZ9yYDAAAAAACbWJMSvqp2S3JBkmck+XiSX+3uf1t0b3f/MMknptNfWnDLCdPxwmXjFy27Pu+lSXZPcnF3b1yQOXaa4/ycH5rk2Um+n+SyuUuXZbbq/uCqetJWzA8AAAAAAH7KfS7hq+oBSf5rkqOSXJrkFd1992ZiZ03H06rq5+Ze65lJfivJLUnevyzzviS3JnlZVb1iLnNAkndMp++cD3T3FZmV6gckeftcZkOS9ybZJcl7uvueuczdSc6eTv9w2h5nKXdyksOSfLK7P7uZ9wgAAAAAwE5uwxq8xhuSvHz6/ntJ3ltVi+47tbu/lyTdfXFVvTvJSUk+X1V/m2TXJC9IUklO7O5b5sPdfXNVvSbJh5KcX1WXJLkpyTGZ7SF/VndfsuC5Jya5PMlJVXVUkmuTPD3Jo5N8OsnvL8i8bXrdZyW5vqouzWyf+8OTfDfJa1b9iQAAAAAAQNamhN9v7vuXr3hXckZmJX2SpLt/u6o+n1mJ/4Ikdye5OMmZ3f3pRS/Q3R+uquckOS3JEZkV99cmObu7P7BC5vqqenKStyZ54TTHryY5M8nvdfddCzIbq+r5Sd6U5JVJjktyc5Lzkry5u29c5X0CAAAAAECSNSjhu/uMzAr2e5M9L7Nie2sylyV50VZmvpbZivityfwwyenTFwAAAAAAbLU1+WBWAAAAAABgU0p4AAAAAAAYRAkPAAAAAACDKOEBAAAAAGAQJTwAAAAAAAyihAcAAAAAgEGU8AAAAAAAMIgSHgAAAAAABlHCAwAAAADAIEp4AAAAAAAYRAkPAAAAAACDKOEBAAAAAGAQJTwAAAAAAAyihAcAAAAAgEGU8AAAAAAAMIgSHgAAAAAABlHCAwAAAADAIEp4AAAAAAAYRAkPAAAAAACDKOEBAAAAAGAQJTwAAAAAAAyihAcAAAAAgEGU8AAAAAAAMIgSHgAAAAAABlHCAwAAAADAIEp4AAAAAAAYRAkPAAAAAACDKOEBAAAAAGAQJTwAAAAAAAyihAcAAAAAgEGU8AAAAAAAMIgSHgAAAAAABlHCAwAAAADAIEp4AAAAAAAYRAkPAAAAAACDKOEBAAAAAGAQJTwAAAAAAAyihAcAAAAAgEGU8AAAAAAAMIgSHgAAAAAABlHCAwAAAADAIEp4AAAAAAAYRAkPAAAAAACDKOEBAAAAAGAQJTwAAAAAAAyihAcAAAAAgEGU8AAAAAAAMIgSHgAAAAAABlHCAwAAAADAIEp4AAAAAAAYRAkPAAAAAACDKOEBAAAAAGAQJTwAAAAAAAyihAcAAAAAgEGU8AAAAAAAMIgSHgAAAAAABlHCAwAAAADAIEp4AAAAAAAYRAkPAAAAAACDrEkJX1VPrarfraqPVNWNVdVV1avcf8bSPSt8/cEq2SOr6qNVdXNV3V5VV1TVqzYzvwOr6tyq+kZVbayq66rqLVW1+yqZB1bVW6d7N07Zc6rq4Vv2UwEAAAAAYGe3YY1e581JXnYvcpcl+dKC8c8uurmqjk/ywcz+ePCpJN9LcnSSD1TVYd196oLMY5JcnuQhSf45yaVJnpbk9CRHV9XR3X3XsszuST6R5Igk30xyQZJHJTkxyUur6oju/vJWv1sAAAAAAHYqa1XCX57k6iT/OH39a5LdtiD3vu4+b0seUFUPTnJOkgckOb67PzKNPzTJ3yc5par+qrsvWRY9L7MC/j3dfdKU2ZDkQ0lenuRNSc5YljktswL+8iS/2N23T7mTk7xzmsfztmTeAAAAAADsvNZkO5rufnt3n97dF3b3t9biNRf4jST7JLlgqYCfnv3tJL8znZ4yH6iqZyQ5Msl35u5Jd/8oyeuS3JPkjVMpv5TZNckbptPXLxXwU+6szP7Y8NyqeuravTUAAAAAAO6PdqQPZn3JdDx/wbWLkmxMcsyyfd6XMhcu33JmKu8vTbJfkl+Yu3RkkgcluaG7P7fgWUvPP3brpg8AAAAAwM5mvUv4o6rqXVX1x1V12mZWlz9xOl65/EJ3353Zfu+7JzlkSzLLxg+7jxkAAAAAANjEWu0Jf2/9+rLzM6vqw0lePb8NTFXtk9nq9CS5cYXXujGzD1w9KLMtY5LkkVuQyZRZcm8yq6qqa1a4dPCWvgYAAAAAADue9VoJ/6UkpyY5NMleSR6R5NeSfD3J8Un+y7L795r7/s4VXvOO6bj3gtzoDAAAAAAAbGJdVsJ3958uG7ojyZ9V1d8l+UKS46rqiO7+zLaf3drr7kMXjU8r5B+3jacDAAAAAMA2st57wv+U7v5mknOn0xfOXbp97vs9VojvOR1vW5AbnQEAAAAAgE1sVyX85Prp+LClge6+NckPptMDV8gtjX9lbuyr2ygDAAAAAACb2B5L+P2m4x3Lxq+ajk9ZHqiqXZI8PsnGJNdtSWbZ+NVzY/cmAwAAAAAAm9iuSviqqiQvn06vXHb5oul4woLoS5PsnuTi7t64IHNsVe227FkPTfLsJN9PctncpcsyW3V/cFU9acGzlp5/4crvBAAAAAAA1qGEr6r9q+r1VbX3svG9kvxRksOTfCvJR5ZF35fk1iQvq6pXzOUOSPKO6fSd84HuviKzUv2AJG+fy2xI8t4kuyR5T3ffM5e5O8nZ0+kfVtWec7mTkxyW5JPd/dmtfOsAAAAAAOxkNqzFi1TVS5K8eW5o12n8M3NjZ3b3RZl9sOnZSf6gqv4xyTeT7J/ZNi8/m+SWJCd0953zz+jum6vqNUk+lOT8qrokyU1Jjkmyb5KzuvuSBdM7McnlSU6qqqOSXJvk6UkeneTTSX5/QeZt0+s+K8n1VXVpkoMy+wPBd5O8ZnM/EwAAAAAAWKuV8PtnVlAvfdU0Pj+2/zR2U2ar0j+b5JAkxyc5MrPV7+9M8vjunt8e5ie6+8NJnpPk40menOTFSb6U5NXdfcoKmeune8+b5vDyJD9OcmaSo7v7rgWZjUmeP91zZ5LjMivhz0vylO7+8uZ+IAAAAAAAsCYr4bv7vMwK6i2597Ykv3sfnnVZkhdtZeZrma2I35rMD5OcPn0BAAAAAMBW264+mBUAAAAAAO5PlPAAAAAAADCIEh4AAAAAAAZRwgMAAAAAwCBKeAAAAAAAGEQJDwAAAAAAgyjhAQAAAABgECU8AAAAAAAMooQHAAAAAIBBlPAAAAAAADCIEh4AAAAAAAZRwgMAAAAAwCBKeAAAAAAAGEQJDwAAAAAAgyjhAQAAAABgECU8AAAAAAAMooQHAAAAAIBBlPAAAAAAADCIEh4AAAAAAAZRwgMAAAAAwCBKeAAAAAAAGEQJDwAAAAAAgyjhAQAAAABgECU8AAAAAAAMooQHAAAAAIBBlPAAAAAAADCIEh4AAAAAAAZRwgMAAAAAwCBKeAAAAAAAGEQJDwAAAAAAgyjhAQAAAABgECU8AAAAAAAMooQHAAAAAIBBlPAAAAAAADCIEh4AAAAAAAZRwgMAAAAAwCBKeAAAAAAAGEQJDwAAAAAAgyjhAQAAAABgECU8AAAAAAAMooQHAAAAAIBBlPAAAAAAADCIEh4AAAAAAAZRwgMAAAAAwCBKeAAAAAAAGEQJDwAAAAAAgyjhAQAAAABgECU8AAAAAAAMooQHAAAAAIBBlPAAAAAAADCIEh4AAAAAAAZRwgMAAAAAwCBKeAAAAAAAGEQJDwAAAAAAgyjhAQAAAABgECU8AAAAAAAMooQHAAAAAIBBlPAAAAAAADCIEh4AAAAAAAZZkxK+qp5aVb9bVR+pqhurqquqtyD36qq6oqpur6qbq+qjVfWszWSOnO67ecpdUVWv2kzmwKo6t6q+UVUbq+q6qnpLVe2+SuaBVfXW6d6NU/acqnr45t4XAAAAAAAkyYY1ep03J3nZ1gSq6l1JTkrywyR/k2T3JC9I8otVdUJ3/8WCzPFJPpjZHw8+leR7SY5O8oGqOqy7T12QeUySy5M8JMk/J7k0ydOSnJ7k6Ko6urvvWpbZPcknkhyR5JtJLkjyqCQnJnlpVR3R3V/emvcLAAAAAMDOZ622o7k8yZlJ/ockD0ty12o3V9UxmRXwNyV5Yncf190vTPKcJP+W5Nyq2ndZ5sFJzknygCQndPfzuvuEJI9N8qUkp1TV8xY87rzMCvj3dPcTuvtXkvx8kj9PcmSSNy3InJZZAX95kkO6+1e6+/AkpyTZf5oHAAAAAACsak1K+O5+e3ef3t0Xdve3tiBy8nR8W3dfP/c6lyf54yT7JnntssxvJNknyQXd/ZG5zLeT/M50esp8oKqekVnR/p25e9LdP0ryuiT3JHljVW2Yy+ya5A3T6eu7+/a53FlJrk7y3Kp66ha8TwAAAAAAdmLb/INZq+qBSY6aTs9fcMvS2LHLxl+ySuaiJBuTHLNsn/elzIXLt5yZyvtLk+yX5BfmLh2Z5EFJbujuz23F/AAAAAAA4Kds8xI+s61gdkvy3e6+ccH1K6fjYcvGn7js+k90992Z7fe+e5JDtiSzyrPuTQYAAAAAADaxVh/MujUeOR0XFfDp7juq6pYk+1XV3t19W1Xtk9nq9BVz0/jTkhyU2ZYxm33W3PhBWzq/FTKrqqprVrh08Ja+BgAAAAAAO571WAm/13S8c5V77piOey/LrJZbntmSZ61VBgAAAAAANrEeK+F3Ot196KLxaYX847bxdAAAAAAA2EbWYyX87dNxj1Xu2XM63rYss1pueWZLnrVWGQAAAAAA2MR6lPBfnY4HLrpYVXsm2TfJ97v7tiTp7luT/GC13Nz4V7b0WWuYAQAAAACATaxHCf8vSe5Ksn9VPXzB9adMx6uXjV+17PpPVNUuSR6fZGOS67Yks8qz7k0GAAAAAAA2sc1L+O7+YZJPTKe/tOCWE6bjhcvGL1p2fd5Lk+ye5OLu3rggc2xV7TYfqKqHJnl2ku8nuWzu0mWZrbo/uKqetBXzAwAAAACAn7IeK+GT5KzpeFpV/dzSYFU9M8lvJbklyfuXZd6X5NYkL6uqV8xlDkjyjun0nfOB7r4is1L9gCRvn8tsSPLeJLskeU933zOXuTvJ2dPpH07b4yzlTk5yWJJPdvdnt+4tAwAAAACws9mwFi9SVS9J8ua5oV2n8c/MjZ3Z3RclSXdfXFXvTnJSks9X1d9OmRckqSQndvct88/o7pur6jVJPpTk/Kq6JMlNSY7JbA/5s7r7kgXTOzHJ5UlOqqqjklyb5OlJHp3k00l+f0HmbdPrPivJ9VV1aZKDkhye5LtJXrPZHwoAAAAAADu9tVoJv39mBfXSV03j82P7zwe6+7czK8i/mFn5/swkFyd5Tnf/xaKHdPeHkzwnyceTPDnJi5N8Kcmru/uUFTLXT/eeN83h5Ul+nOTMJEd3910LMhuTPH+6584kx2VWwp+X5Cnd/eVVfhYAAAAAAJBkjVbCd/d5mRXUw3PdfVmSF21l5muZFf5bk/lhktOnLwAAAAAA2GrrtSc8AAAAAADc7ynhAQAAAABgECU8AAAAAAAMooQHAAAAAIBBlPAAAAAAADCIEh4AAAAAAAZRwgMAAAAAwCBKeAAAAAAAGEQJDwAAAAAAgyjhAQAAAABgECU8AAAAAAAMooQHAAAAAIBBlPAAAAAAADCIEh4AAAAAAAZRwgMAAAAAwCBKeAAAAAAAGEQJDwAAAAAAgyjhAQAAAABgECU8AAAAAAAMooQHAAAAAIBBlPAAAAAAADCIEh4AAAAAAAZRwgMAAAAAwCBKeAAAAAAAGEQJDwAAAAAAgyjhAQAAAABgECU8AAAAAAAMooQHAAAAAIBBlPAAAAAAADCIEh4AAAAAAAZRwgMAAAAAwCBKeAAAAAAAGEQJDwAAAAAAgyjhAQAAAABgECU8AAAAAAAMooQHAAAAAIBBlPAAAAAAADCIEh4AAAAAAAZRwgMAAAAAwCAb1nsCAAAAANw7T/jAE9Z7CsAqvvA/fWG9p8B2wEp4AAAAAAAYRAkPAAAAAACDKOEBAAAAAGAQJTwAAAAAAAyihAcAAAAAgEGU8AAAAAAAMIgSHgAAAAAABlHCAwAAAADAIEp4AAAAAAAYRAkPAAAAAACDKOEBAAAAAGAQJTwAAAAAAAyihAcAAAAAgEGU8AAAAAAAMIgSHgAAAAAABlHCAwAAAADAIEp4AAAAAAAYRAkPAAAAAACDKOEBAAAAAGCQdS3hq+qSqupVvl64Qu7VVXVFVd1eVTdX1Uer6lmbedaR0303T7krqupVm8kcWFXnVtU3qmpjVV1XVW+pqt3vy/sGAAAAAGDnsGG9JzD5cJLbF4x/fflAVb0ryUlJfpjkb5LsnuQFSX6xqk7o7r9YkDk+yQcz+6PDp5J8L8nRST5QVYd196kLMo9JcnmShyT55ySXJnlaktOTHF1VR3f3XVv9TgEAAAAA2GlsLyX8qd39r5u7qaqOyayAvynJM7v7+mn8mUkuSXJuVV3S3bfMZR6c5JwkD0hyfHd/ZBp/aJK/T3JKVf1Vd1+y7HHnZVbAv6e7T5oyG5J8KMnLk7wpyRn36t0CAAAAALBT2NH2hD95Or5tqYBPku6+PMkfJ9k3yWuXZX4jyT5JLlgq4KfMt5P8znR6ynygqp6R5Mgk35m7J939oySvS3JPkjdOpTwAAAAAACy0w5TwVfXAJEdNp+cvuGVp7Nhl4y9ZJXNRko1Jjlm2z/tS5sLlW85M5f2lSfZL8gtbNnsAAAAAAHZG20sJ/9qqem9VnV1Vb6yqRy645+eT7Jbku91944LrV07Hw5aNP3HZ9Z/o7rsz2+999ySHbElmM88CAAAAAICf2F62Uzlt2fl/qqozu/vMubGlYn5RAZ/uvqOqbkmyX1Xt3d23VdU+SR60Wm4af1qSg5JcvSXPmhs/aIXrP6Wqrlnh0sFbkgcAAAAAYMe03ivhP5Xk1zMro/fIbLX7/5HkR0neWlUnzd2713S8c5XXu2M67r0ss1pueWZLnrUoAwAAAAAAP2VdV8J39+nLhq5L8ntV9U9JPp7kjKr6k+7+4baf3drp7kMXjU8r5B+3jacDAAAAAMA2st4r4Rfq7r9J8k9J9k1y+DR8+3TcY5XontPxtmWZ1XLLM1vyrEUZAAAAAAD4KdtlCT+5fjo+bDp+dToeuOjmqtozs9L++919W5J0961JfrBabm78K3Njqz5rhQwAAAAAAPyU7bmE3286Lu2//i9J7kqyf1U9fMH9T5mOVy8bv2rZ9Z+oql2SPD7Jxsy2wtlsZjPPAgAAAACAn9guS/iq2j/Js6fTK5Nk2hf+E9PYLy2InTAdL1w2ftGy6/NemmT3JBd398YFmWOrardlc3voNLfvJ7ls9XcCAAAAAMDObN1K+Kp6VlUdV1UPWDb+qCR/ntm+63/Z3TfOXT5rOp5WVT83l3lmkt9KckuS9y971PuS3JrkZVX1irnMAUneMZ2+cz7Q3VdkVrAfkOTtc5kNSd6bZJck7+nue7biLQMAAAAAsJPZsI7PPiTJuUm+VVVXZlagH5TkqZmtTr8myW/OB7r74qp6d5KTkny+qv42ya5JXpCkkpzY3bcsy9xcVa9J8qEk51fVJUluSnJMZnvIn9XdlyyY34lJLk9yUlUdleTaJE9P8ugkn07y+/fp3QMAAAAAcL+3ntvR/EOSP0ryjczK7V/ObH/2zyc5JcnTu/s7y0Pd/duZFeRfzKx8f2aSi5M8p7v/YtGDuvvDSZ6T5ONJnpzkxUm+lOTV3X3KCpnrp3vPS7J/kpcn+XGSM5Mc3d13bfU7BgAAAABgp7JuK+G7+4tJ/td7mT0vs3J8azKXJXnRVma+llnhDwAAAAAAW227/GBWAAAAAAC4P1DCAwAAAADAIEp4AAAAAAAYRAkPAAAAAACDKOEBAAAAAGAQJTwAAAAAAAyihAcAAAAAgEGU8AAAAAAAMIgSHgAAAAAABlHCAwAAAADAIEp4AAAAAAAYRAkPAAAAAACDKOEBAAAAAGAQJTwAAAAAAAyihAcAAAAAgEGU8AAAAAAAMIgSHgAAAAAABlHCAwAAAADAIEp4AAAAAAAYRAkPAAAAAACDKOEBAAAAAGAQJTwAAAAAAAyihAcAAAAAgEGU8AAAAAAAMIgSHgAAAAAABlHCAwAAAADAIEp4AAAAAAAYRAkPAAAAAACDKOEBAAAAAGAQJTwAAAAAAAyihAcAAAAAgEGU8AAAAAAAMIgSHgAAAAAABlHCAwAAAADAIEp4AAAAAAAYRAkPAAAAAACDKOEBAAAAAGAQJTwAAAAAAAyihAcAAAAAgEGU8AAAAAAAMIgSHgAAAAAABlHCAwAAAADAIEp4AAAAAAAYRAkPAAAAAACDKOEBAAAAAGAQJTwAAAAAAAyihAcAAAAAgEGU8AAAAAAAMIgSHgAAAAAABlHCAwAAAADAIEp4AAAAAAAYRAkPAAAAAACDKOEBAAAAAGAQJTwAAAAAAAyyYb0nAAAAOeNB6z0DYDVn/GC9ZwAAsMOyEh4AAAAAAAZRwgMAAAAAwCBKeAAAAAAAGEQJDwAAAAAAgyjhV1FVD6yqt1bVdVW1saq+UVXnVNXD13tuAAAAAABs/5TwK6iq3ZN8Ismbk+yV5IIkX0tyYpLPVdWj13F6AAAAAADsAJTwKzstyRFJLk9ySHf/SncfnuSUJPsnOWc9JwcAAAAAwPZPCb9AVe2a5A3T6eu7+/ala919VpKrkzy3qp66HvMDAAAAAGDHoIRf7MgkD0pyQ3d/bsH186fjsdtuSgAAAAAA7GiU8Is9cTpeucL1pfHDtsFcAAAAAADYQW1Y7wlspx45HW9c4frS+EFb8mJVdc0Klx57ww035NBDD92aubEd+8a3b9/8TcC6OfTCvdZ7CsBKvut3KGzX/h//ZoHt1Q233LDeUwBWceg7/A69v7jhhhuS5BH3JquEX2yppblzhet3TMe97+NzfnzXXXfdce21137tPr4OsPYOno7+H+39yLU3rfcMAHYKfofeH3332vWeAcDOwO/Q+6Frv+F36P3II7JyX7wqJfw20N3+5AU7mKX/gsX/fgFg6/gdCgD3jt+hcP9lT/jFlv576D1WuL7ndLxtG8wFAAAAAIAdlBJ+sa9OxwNXuL40/pVtMBcAAAAAAHZQSvjFrpqOT1nh+tL41dtgLgAAAAAA7KCU8ItdluQHSQ6uqictuH7CdLxwm80IAAAAAIAdjhJ+ge6+O8nZ0+kfVtXSHvCpqpOTHJbkk9392fWYHwAAAAAAO4bq7vWew3apqnZPckmSw5N8M8mlSQ6azr+b5Iju/vK6TRAAAAAAgO2eEn4VVfXAJG9K8sokj0hyc5KPJXlzd9+4nnMDAAAAAGD7p4QHAAAAAIBB7AkPAAAAAACDKOEBAAAAAGAQJTwAAAAAAAyihAcAAAAAgEGU8AAAAAAAMIgSHgAAAAAABlHCAwAAAADAIEp4AAAAAAAYRAkPAAAAAACDbFjvCQCst6p6cpJjkxyW5KAke0+XbkvylSRXJ7mwuz+3PjMEgO1TVW1I8rNJbu7uezZz74OT7NXdX90mkwOAHUxV7Zbk8CQPS3JHkiu7+xvrOytgLVR3r/ccANZFVT0qyTlJnrs0tMrtneSSJK/t7n8dOjEA2M5V1UOSvCvJK5LsluSeJH+d5PTu/sIKmXOT/Hp3WwgEwE6pqn4xyde7+5oF1/63JGck2XfZpQuS/M/d/b3hEwSGUcIDO6Wq+vdJrkxyQGYr3c+fzm/MbMVBkuyZ5MAkT0nyS0mekOTbSZ5qNQIAO6uq2jPJPyb5+Wz6B+y7k5za3WcvyJ2b5FXd/YDxswSA7U9V/TjJud392mXjpyV5S2a/V/8pyfVJ9kvy7Mz+XfqFJM/o7ru27YyBtWJPeGBndWZmBfzJ3f2k7n5bd3+0u6/u7humr6unsbd19xOTnJrkoUneuq4zB4D1dXKSxyb5fJJnZVYOPCHJ+5PskuTdVfWOdZsdAGzffuoP2FX1iCRvTvLDJP99dz+ju3+tu1+c5NFJPp3k8Un+l20+U2DNKOGBndULk/xDd79rSwPdfVaSf0jyolGTAoAdwPFJbk3y4u7+THf/sLuv6e7fzOwzVn6Q5JSq+s9VtdpWbwBAclxmf8R+W3f/7fyF7v5ukv8xyV1JfnnbTw1YK0p4YGf14CT/ei9yX5myALCzekyST3f3t5df6O6PZrY6/mtJXpPkg9OHtwIAix2S2WeQnb/o4vSZZJ9N8h+24ZyANaaEB3ZWX03y7KraY0sD073PzqxYAICd1QMyWwm/UHf/tyRHJvlvma2av6Cqdt9GcwOAHc1SN7favzO/ktn2b8AOSgkP7Kw+mOTfJ/l4VR22uZunez6e5N8l+bPBcwOA7dlXMtubdkXd/fUkv5DZh8u9MMnHkuwzfmoAsN3bq6oeufSV5KZp/GGrZPZN8v3hMwOGqe5e7zkAbHPTiry/S3J4Zv/p3w1JrkxyY5I7p9v2SHJgkqckOTizD9D5TJLn+1R6AHZWVfX+JK9O8h+6+7rN3Ltnkr9M8vzMft+mux8weo4AsD2qqh9n+n24wK939yYLvqrqZzL7d+rXu/vpI+cHjGN/RmCn1N0bq+p5mX0K/esz29/2MUuXp+P8h8n9IMnZmX1YjgIegJ3ZXyY5Mcn/nuR1q93Y3XdU1YuS/N+ZffCcFUAA7Mw+lZV/Fx6ywvixmf0X2R8ZMiNgm7ASHtjpVdUume1d+8Qkj0yy13Tp9sz2jr8qyWXdfc/6zBAAth9V9cAkr0xyT3f/X1uY+Zkkb0iyX3e/ZeT8AOD+pKqemVlB/w/T564AOyAlPAAAAAAADOKDWQEAAAAAYBAlPAAAAAAADKKEBwAAAACAQZTwAAAAAAAwiBIeAAAAAAAGUcIDAAAAAMAgSngAAAAAABhECQ8AAAAAAIMo4QEAAAAAYBAlPAAAAAAADKKEBwAAAACAQZTwAAAAAAAwyP8HJMWOWb7+FSUAAAAASUVORK5CYII=\n",
543 | "text/plain": [
544 | ""
545 | ]
546 | },
547 | "metadata": {
548 | "needs_background": "light"
549 | },
550 | "output_type": "display_data"
551 | }
552 | ],
553 | "source": [
554 | "plt.figure(figsize=(12,4),dpi=150)\n",
555 | "train['label'].value_counts().sort_index().plot(kind = 'bar')"
556 | ]
557 | },
558 | {
559 | "cell_type": "code",
560 | "execution_count": 17,
561 | "metadata": {},
562 | "outputs": [
563 | {
564 | "data": {
565 | "text/plain": [
566 | ""
567 | ]
568 | },
569 | "execution_count": 17,
570 | "metadata": {},
571 | "output_type": "execute_result"
572 | },
573 | {
574 | "data": {
575 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAf8AAAHjCAYAAAAt5RZeAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8vihELAAAACXBIWXMAABcSAAAXEgFnn9JSAAA0s0lEQVR4nO3deZhcVYH38d/pTtIJIWSBACGQXAhZQAgJARISFHx1FKgRBHFFLARlwNFR2byvyrzozDuW+row7uPo6OiMOqiE5SKLYRUUZEm4IBACFLsgSXens3S6u+q8f9yKNKGT9HJvnbt8P8/TTyddVbd+5NH+1bn33HOMtVYAAKA4WlwHAAAAzUX5AwBQMJQ/AAAFQ/kDAFAwlD8AAAVD+QMAUDCUPwAABUP5AwBQMJQ/AAAFQ/kDAFAwlD8AAAVD+QMAUDCUPwAABUP5AwBQMJQ/AAAFQ/kDAFAwlD8AAAVD+QMAUDCUPwAABUP5AwBQMJQ/AAAFQ/kDAFAwlD8AAAVD+QMAUDCUPwAABUP5AwBQMJQ/AAAFQ/kDAFAwlD8AAAVD+QMAUDCUPwAABUP5AwBQMJQ/AAAFQ/kDAFAwlD8AAAVD+QMAUDCUPwAABUP5AwBQMJQ/AAAFQ/kDAFAwlD8AAAVD+QMAUDCUPwAABUP5AwBQMJQ/AAAFQ/kDAFAwlD8AAAVD+QMAUDCUPwAABUP5AwBQMJQ/AAAFQ/kDAFAwlD8AAAVD+QMAUDCUPwAABUP5AwBQMJQ/AAAFQ/kDAFAwlD8AAAUzynUAAM3n+cEukiZK2m0HX/0fnyBptKSapHrje/8/90javJ2vdZKe3/pVrZQ2NOO/EcD2GWut6wwAYuT5gZE0TdIB23zNkrS/pKly+8F/g6IPAi/olQ8F2/75uWqltNFZQiDnKH8ggzw/GCNpjl5b8AdI8iSNcxYuPi9IeqDfVyjp4Wql1OM0FZADlD+Qco2R/GxJR0la3Pi+QNIYh7Fc6ZX0qF79oeCBaqX0nNNUQMZQ/kDKeH4wVa+U/GJJR0qa7DRU+q1VdGZglaTbJN1crZTa3UYC0ovyBxzy/GC0onJf3O/Lc5kpJ+qSVkpa0fi6vVopbXKaCEgRyh9oMs8PZko6vvH1vxTNpkeyeiTdpeiDwE2S/lCtlHrdRgLcofyBhDUm5x0r6URFhT/PbSJI2ijpdkUfBFZIWlmtlOpuIwHNQ/kDCfD8YKKisj9Z0glidJ92L0r6taRfSrq1WinVHOcBEkX5AzHx/GAfSacqKvxjFS2Kg+x5SdIVij4I3MwHAeQR5Q+MQOOU/kmSzpL0FkmtbhMhZi8r+hDw02qldIfrMEBcKH9gGDw/mK+o8E+XtIfjOGiOJyT9t6IPAo+6DgOMBOUPDJLnB5MlvU9R6R/uOA7cukfSjyX9uFopdbkOAwwV5Q/sgOcHLZLeLOmDkt4uaazTQEibLkn/Ielfq5XS467DAINF+QMD8PxghqQPSSpLmuE4DtKvLimQdFm1UlrhOgywM5Q/0I/nB6+T5Et6j9jyGsPzoKR/VTQ3YLPrMMBAKH9AkucHiyV9WtLbJBnHcZAPayX9m6RvsfEQ0obyR6F5fvA3kv63pDe6zoLc6pP0K0WXBH7vOgwgUf4ooMYkvlMUnd4/wnEcFMudkj5drZRudR0ExUb5ozAaO+i9X9LFYn19uHW9og8B97kOgmKi/JF7nh+MlXSOpAsl7ec4DrCVlXS5pEuqldJq12FQLJQ/cs3zg3dJ+qIkz3EUYHv6FK0V8DkmBqJZKH/kkucHiyR9XdIxjqMAg9Ut6ZuSvlCtlNa5DoN8o/yRK54fTJP0L4oW5+GWPWRRp6T/J+lr1Uppo+swyCfKH7nQuK5/vqLb9nZ1HAeIw4uS/lnSd9hWGHGj/JF5nh+8U9KXxHV95NP9kj5crZTudR0E+UH5I7M8Pzhc0XX91zuOAiStpmjJ4Eu4FIA4UP7IHM8P9pRUUXRdv8VxHKCZnpJ0XrVS+o3rIMg2yh+Z4vnB6ZIuk7S76yyAQ7+Q9PFqpfSi6yDIJsofmdCYxf9dSSe5zgKkRLui1Sp/UK2U+EWOIaH8kXqeH5QlfU3SZNdZgBS6VdLfVSulR10HQXZQ/kitxmj/3yWd6DoLkHJbJP1fSZVqpdTrOgzSj/JHKnl+cJqk70ma4joLkCEPSnpvtVJ60HUQpBvlj1Tx/GA3RUucnuE6C5BR3ZIuqFZK33YdBOlF+SM1PD94g6T/lDTTdRYgB5ZLOpt9AjAQyh/OeX4wStEypheJ+/aBOD0j6f3VSuk210GQLpQ/nPL8YKqk/5F0nOMoQF7VJH1e0j9XK6W66zBIB8ofzjS23f21pBmuswAFcL2k06uV0lrXQeAep1jhhOcHZ0j6nSh+oFneKul+zw8Wuw4C9xj5o6ka1/e/KuljrrMABdUj6cJqpfQN10HgDuWPpmlc379c0rGuswDQzyWdVa2UNrsOguaj/NEUjev7V0jaz3UWAH91l6S3VSulv7gOgubimj8S11ib/3ei+IG0WSzp954fzHYdBM3FyB+J4fo+kBlrJZ1UrZTudB0EzUH5IxGeH0xUtMLYcW6TABikbkULAv3KdRAkj9P+iJ3nB3tKulkUP5AlYyX9j+cHn3QdBMlj5I9YeX4wQ9KNkua4zgJg2C6TdD4rAuYX5Y/YeH4wV1HxM7EPyL4rFK0IyK2AOUT5IxaeHyxUtHzoVNdZAMTmD4puBXzZdRDEi2v+KWaMGWeM+bwxZrUxptsY87wx5ofGmOmus/Xn+cExiq7xU/xAvixRdCvgga6DIF6M/FPKGDNWUaEukfSCpNsleZKOkvQXSUustU84C9jg+cHxkn4laRfXWQAk5nlJx1YrpTWugyAejPzT67NqfOqWNMda+25r7WJJFygaYf/QZThJ8vzgXZKuEsUP5N0+km7y/GB/10EQD0b+KWSMGSPpJUkTJR1urb1/m8dXSZov6Qhr7b0OIsrzgw9J+p74AAkUyVOS3lCtlJ52HQQjwy/udFqmqPgf37b4G37Z+P625kV6hecHF0r6vvjfD1A0MyXd7PlBquYdYej45Z1OhzW+37edx7f+fH4TsrxKYwGQLzf7fQGkxgGKLgFMcx0Ew0f5p9OMxvdnt/P41p/PbEKWv/L84AxJX2nmewJIpTmSVjRW80QGUf7ptGvj+6btPL6x8X1CE7JIkjw/KCmaZGia9Z4AUu0gRR8A9nAdBENH+WOnPD9YJulySaNcZwGQKodI+q3nB1NcB8HQUP7ptKHxfXu30I1vfO9KOojnB4dIulrSuKTfC0AmHSbpBs8PJrkOgsGj/NNp6200+27n8a0/fyrJEJ4feIqW7J2c5PsAyLxFkq73/GA310EwOJR/Oq1qfD98O49v/fkDSQVoTOS5QdHiHgCwM0dJ+oXnB62ug2DnKP90ukNSp6RZxpgFAzx+WuP71Um8uecHEyT9RtLsJI4PILeOl/RV1yGwc5R/CllreyR9s/HXbxljtl7jlzHmfEX399+axOp+nh+0SbpS2z/rAAA78g+eH5zrOgR2jOV9U6qxsc8tkhbrlY19Zjb+nsjGPp4ftCia1X9qnMcFUDh9ko6vVkorXAfBwBj5p5S1tlvSGyX9k6L7/d+uqPx/pGi9/yR29LtMFD+AkRsl6XLPD7h0mFKM/CFJ8vzgbEn/7joHgFxZLWlJtVJqdx0Er0b5Q54fLFF0iaHNcRQA+bNC0SWAPtdB8ApO+xdcY3OOX4niB5CMN0n6husQeDXKv8A8PxijqPi5lx9Aks71/OBjrkPgFZR/sX1T0tGuQwAohK95fvBW1yEQ4Zp/QXl+8EFFu/QBQLN0SlpcrZQedR2k6Cj/AvL8YL6kP4jNegA03ypFdwB0uw5SZJz2L5jG0r2Xi+IH4MZhkr7iOkTRUf7F8wNJc1yHAFBoH/H84BTXIYqM0/4F0pht+6+ucwCApHZJC6qV0tM7fSZiR/kXhOcHCxVd5x/jOgsANPxO0nHVSqnmOkjRcNq/ADw/GK1oTwCKH0CaHCPpEtchiojyL4ZLFG0DDABp8xnPD45yHaJoOO2fc43T/Xcr2mULANJotaSF1Uppk+sgRcHIP8cap/v/QxQ/gHSbI+nLrkMUCeWfb59RdE8tAKTdRzw/ON51iKLgtH9OeX6wQNHp/tGOowDAYL0g6ZBqpbTOdZC8Y+SfQ/1m91P8ALJkmqSvug5RBJR/Pn1anO4HkE0f8PzgGNch8o7T/jnj+cFhkv4oRv0AsusBSYez+E9yGPnniOcHo8TpfgDZN1/SR12HyDPKP18+LWmB6xAAEIPPeX6wt+sQeUX554TnBwcourUPAPJgorj3PzGUf358UazdDyBf3u/5wetdh8gjJvzlgOcHyxTtjgUAeRMqmvzX5zpInjDyzzjPD4ykr7jOAQAJOVTSx1yHyBvKP/veLWmx6xAAkKDPeX4wzXWIPKH8M8zzgzZJX3CdAwASNkGc4YwV5Z9tH5fkuQ4BAE3wXs8PjnMdIi+Y8JdRnh/sIWmNotthAKAI7pe0qFopUVwjxMg/uy4VxQ+gWBZKOtl1iDxg5J9Bnh/MU3T7yyjXWQCgyVYquvWP8hoBRv7Z9CVR/ACKaYEY/Y8YI/+M8fzgjZJucp0DABxaKUb/I8LIP3sqrgMAgGMLJL3dcYZMY+SfIYz6AeCvVklayOh/eBj5Z8vFrgMAQEocJukU1yGyipF/Rnh+cKikB1znAIAUYfQ/TIz8s4NRPwC8GqP/YWLknwGeH8yQ9Li4vQ8AtvWApAWM/oeGkX82nC+KHwAGMl/Sqa5DZA3ln3KeH0yR9CHXOQAgxf7RdYCsofzT7yOSxrsOAQApNt/zg2Ndh8gSyj/FPD8YK+ljrnMAQAac6zpAllD+6XampD1dhwCADDjV8wN+Xw4S5Z9Snh+0SLrAdQ4AyIgxks5yHSIrKP/0OlXSga5DAECGnNMYOGEn+EdKr4+6DgAAGbO/pLe6DpEFlH8KeX5woCRmrgLA0DHxbxAo/3T6oOsAAJBRJc8P9nMdIu0o/5Tx/KBV0Sx/AMDQtUr6sOsQaUf5p8/xkvZxHQIAMuxszw9YEn0HKP/04VYVABiZfSSd5DpEmlH+KeL5wVRJb3OdAwBy4DzXAdKM8k+Xd0sa7ToEAOTAmzw/mOU6RFpR/unyPtcBACAnjKT3uA6RVpR/Snh+cICko13nAIAceafrAGlF+acHo34AiNdhnh/McR0ijSj/9DjddQAAyCFG/wOg/FPA84PDJc1znQMAcuhdrgOkEeWfDnwyBYBkzOfU/2tR/ulwousAAJBjp7kOkDaUv2OeH+wjab7rHACQYye7DpA2lL97x7sOAAA5d6TnB3u7DpEmlL97J7gOAAA5Z8TS6a9C+TvU2HXqb1znAIAC4NR/P5S/W0dLmug6BAAUwJs8P9jFdYi0oPzd4pQ/ADTHWElvcR0iLSh/t5jsBwDNQ/k3UP6ONGaeLnCdAwAK5BjXAdKC8nfneEUzUAEAzXGI5weTXIdIA8rfHa73A0BzGUnLXIdIA8rfAc8PWsUtfgDgwutdB0gDyt+NoyRNdh0CAAqI6/6i/F1Z6joAABTUkZ4fjHUdwjXK341FrgMAQEGNkXSk6xCuUf5uHOE6AAAUWOGv+1P+Teb5wURJB7rOAQAFVvjr/pR/8y0S9/cDgEtLPT8odP8V+j/eEU75A4BbEyUd6jqES5R/8zHZDwDcK/R1f8q/+Rj5A4B7hV7pj/JvIs8PJks6wHUOAIAOcR3AJcq/uRj1A0A6zC7ypL/C/oc7QvkDQDq0SdrfdQhXKP/movwBID3muQ7gCuXfXMz0B4D0oPyRrMZkv5mucwAA/oryR+IKe20JAFKK8kfiPNcBAACvQvkjcZzyB4B02cPzg91dh3CB8m8ez3UAAMBrFHL0T/k3DyN/AEgfyh+J8lwHAAC8BuWPRDHyB4D0ofyRDM8PJkqa5DoHAOA1ZrsO4MKokR7AGPPDEbzcWmvPHmmGDGDUDwDptKfrAC4Ya+3IDmBMfQQvt9ba1hEFyADPD06SdKXrHACA17CSRlcrpZrrIM004pG/pDfGcIy881wHAAAMyEjaXdJLroM004jL31p7axxBco7T/gCQXnuoYOXPhL/m8FwHAABs1x6uAzRbHKf9B2SMGSWpJOkoRf+wd1lrf9h4bJ/Gz/5kre1LKkOKTHcdAACwXZR/HIwxx0j6qaT9FF1PsZJGS9p6Z8DRkv5H0jsl/TqJDCkzyXUAAMB2Fa78Yz/tb4w5WNJ1kqZJ+oakdyn6ANDf1ZI2SXpH3O+fUhNdBwAAbFfhyj+Jkf8lksZKOtFae4MkGfPq7rfW9hhj7pO0MIH3TyPKHwDSq3Dln8SEvzdKuntr8e/Ac5L2SeD9U8Xzg9GSxrnOAQDYrsJt65tE+U+S9Mwgnjde0TyAvGPUDwDpxsg/Bi9JOnAQzztIg/uQkHW7uQ4AANghyj8GN0laYIzZ7sp/xphTFH1AuDGB908byh8A0o3yj0FFUo+k5caY84wxe299wBgz2RhzlqQfSNoo6asJvH/acL0fANJtsusAzRZ7+VtrH5H03saxv6loYp+VVJb0sqTvS2qTdLq19sm43z+FxroOAADYoTGuAzRbIsv7WmuXSzpE0X3+j0jqVnQ24AlJ35M031p7VRLvnUKUPwCkW+53l91WYsv7WmufkvSJpI6fIZz2B4B0S6wL04qNfZLHyB8A0q1wI//Eyt8Y02aMeZ8x5jvGmCsbX98xxpxujClSIRbpvxUAssh4frDtMvS5ltTGPm+W9CNF6/tv+w96jqQvGWPOtNYW4Va/wp1OAoAMGiWp13WIZom9mIwxiyVdo2j25F2Sfiap2nh4pqI7AZZIutoYc6y19q64M6RMj+sAQJxaTXfX6H1+da9e+8EeyKz6lj1bol3oiyGJUek/KVq29zxr7fcGePwbxphzJH1X0uclvTWBDGnS7ToAEKeabdt1yvgHJ3e32sNcZwFiVHcdoJmSuOa/WNI92yl+SZK19t8k/VHRGYC82+w6ABAvYxZ2TC7C0twolprrAM2URPnXJa0ZxPPWKFr8J+8Y+SN3Dm2fPt5Yu851DiAmNiyHjPxH6G5J8wfxvPmN5+Yd5Y/c+W3f4umLu7tD1zmAmBRq1C8lU/6XSJptjPmcMeY1xzeRz0ma3Xhu3nHaH7mzys6a/Ym166fI2iKcvUP+Fa78RzzhzxjzgQF+/GNJn5V0hjHmV5Keavx8pqRTJXmK1vifq+iOgDxj5I8cMmZiz4TNk+v1le2trQtdpwFGiPIfhh9p4Gv3RlHJX9Dv8f63Bp0j6cOS/jOGDGlG+SOXbqot3HxOx+9rX9x9iusowEgV7vd0HOX/eRVj4t5wcdofubS8tmyvX6y/cdaXpkz+izVmqus8wAisdR2g2UZc/tbaS2PIkWeF+0SJYrjfHjhnlFXXGzZv/tOtu+xyrOs8wAgUrvzZ2Cd5jPyRS1YtLS9o99UXr+2YJWsLdZsUcofyR+wY+SO3bqkdtmlGX9++U2u1+1xnAUaA8o+LMeYYY8yXjTHLjTErjDE3DfC1Iqn3T4tqpdSnAs4kRTEsry2bKkl/397JyB9ZVrjyT2JjHyPpB5LKemV2v9WrZ/pv/XtRJgp2SZrkOgQQt3vs3LnWav3bN2xc9Pk9prxQN2aa60zAMBSu/JMY+Z8r6UxJ90r6G0m/bvx8rqQTFN0aWJf0ZUkHJPD+afS86wBAEupqaX1Rk1e3Sq1v3rR5tes8wDC97DpAsyVR/mdK2ijpBGvtCkWjXllrH7PWXm+tPUvRtr4XSlqQwPun0bOuAwBJubU2f4MkXbi2fY6s5RIXsoiRfwwOknSntXbrP6aVJGNM69YnWGt/qejMwIUJvH8aUf7IrSvr0XX/abXatGl9tXtc5wGGgfKP6Zj9/yE3Nb5P3uZ5j0k6NIH3TyO2P0Vu3VU/aK612iBJ/9De0bqz5wMpRPnH4DlJ+/T7+9Z1/bdd/3uOpL4E3j+NGPkjt2pqHfUXTXxUkk7cuOnwUdbyYRdZQ/nH4D5JB/c7zX+Dopn9XzLGzDPGTDDGXCRpkaT7E3j/NKL8kWu31+d3SVKL1HLCho2Pu84DDBHlH4OrJO0hqSRJ1tpVkn4u6TBJD0nqkFRRNOr/TALvn0aMhJBry2vLdt/650+2d7xO1va6zAMMwcawHBZuMbbYy99a+zNJ4yQF/X5clvRpSX+UtEbStZLeZK29O+73TylG/si139cPnmdtNL9naq0+dUZfHxP/kBVPuA7gQiIr/Flrt9h+t/xYa3uttRVr7RJr7Vxr7dustbcn8d5pVK2UOtW45RHIoz6NGr1Wuz269e+fXNcx1mUeYAgKuT4Fa/s3z3OuAwBJuqN+SOfWP79p0+YFo6190mUeYJAofySK6/7IteW1ZX+9nddI5u1dG552mQcYpEd3/pT8GfHa/saYkVwvsdbaWSPNkBFc90eu3VE/ZJ616jZGYyXpY+2dh14+YdduGcMlAKRZIUf+cWzs48VwjCKg/JFrPRrd1q4JK6eoa4EkTa7Xp8zq7b3j8TFjljmOBuxIIct/xKf9rbUtI/mK4z8iIwo5oxTFcmf94I7+f79gXcdujqIAg7E2LIeFu8df4pp/M61yHQBI2pW1ZZP6//31m7sPbavXH3MUB9iZQo76Jcq/mR6SxMInyLXb6vPnWast/X/2zq4NbGmNtKL8kaxqpdQj6U+ucwBJ2qIxYzs0/lWzp8/r6Fwgazdt7zWAQ5Q/mmKl6wBA0u6qH7yu/993q9uJB/X03ucqD7ADhbzNT6L8m60oGxmhwJbXlr5mkt9F69qnuMgC7AQjfzTFStcBgKTdWl8wz1r19P/Zkd1bDh5Xrz/sKhMwACupsJNRKf/mWqnof3BAbm1W2y7rtcsj2/789PVdL7vIA2zHI0XczW8ryr+JGhv8VF3nAJJ2d33eum1/9qGO9QtlLRtcIS3udB3AJcq/+bjuj9y7srZswrY/G2/trvO39Kx0EAcYCOWPplrpOgCQtJvqC+dZq75tf+6vbd/LRR5gAJQ/moqRP3Jvk8aO79K411z3P7SnZ86u9fqDLjIB/axVgW/zkyh/F1a6DgA0wz31uQNO8Duzc31Hk6MA2/pDWA4LPfma8m+yaqX0rCRmPSP3rqot3XWgn5c7uxYZazuaHAfor9Cn/CXK35U/uA4AJO3G+qK51qq27c/HWjvuiO4tD7jIBDRQ/q4DFNTNrgMASduocRM2auyA11U/tbZ932bnARr6JN3tOoRrlL8blD8K4b767L8M9PO5vb0HTKzV2OYaLqwKy2HhN5qi/N1YKek1i6AAeXNVfem47T324Y71G5uZBWgo/Cl/ifJ3olopWUm3uM4BJO362hFzrVV9oMfeu77rCGMtk1/RbJS/KH+XbnIdAEhal8ZP3KS2Aa/7j5HGLN3c/VCzM6HwKH9R/i6tcB0AaIaV9QNf2t5jn1rX7snaQt9vjaZaE5bDp12HSAPK35FqpfSIpGdd5wCSdnX96LHbe2z/3r6Zu9fq9zUzDwrtStcB0oLyd+s3rgMASbuuduRca7e/lfW5HZ29zcyDQrvKdYC0oPzdovyRex2aMGmzxqze3uOndW04osXaF5uZCYW0VtIdrkOkBeXv1m8lMepB7j1gZ/15e4+NkkYdt2nzw83Mg0K6NiyHr1lxsqgof4eqlVKX+CSKArimtqRtR49ftK59tqwd8JZAICac8u+H8nePU//IvWtrR83e0XX/fftq0/eq1e5tZiYUyhZJ17kOkSaUv3vXug4AJG2dJu6+RaPX7Og5H23vbFYcFM/NYTnc4DpEmlD+jlUrpQcl/cl1DiBpod3/hR09ftKGjYtarX2uWXlQKJzy3wblnw7/5ToAkLRraktG7+jxFqnlLRs37fDsADBMlP82KP90+G9p+9dDgTwIaksO3NlzLljXMU/W9jUjDwrjvrAcckZpG5R/ClQrpaqY9Y+ce1mTpm6xox/f0XP2qtX22revdk+zMqEQWNVvAJR/enDqH7n3kJ250xHYx9s7xjQjCwqD8h8A5Z8el4sFf5BzQW3JqJ09560bNy0cZe1TzciD3HswLIerXIdII8o/JaqV0lpxHypyLqgtmbWz5xjJvG3DxiebkQe59wPXAdKK8k8XTv0j1/6sKXv12FE7LfaPr+s4RNb2NCMTcqtH0k9ch0gryj9drpLU5ToEkKSH7YydbmW9e72+h9fbx8Q/jMTysByudR0irSj/FKlWSpslXeE6B5Cka2uLB/V754L2jl2SzoJc45T/DlD+6cOpf+TaNbUlBwzmecdt2rxgjLVPJJ0HufSUpBtdh0gzyj99Vkja7vanQNY9p6nTem3roGbzn9q1YaeXCIAB/EdYDlk4bQco/5SpVko1ST93nQNI0qN2v6cH87yPtnfOl7XdSedBrtQl/dB1iLSj/NPpu2K5X+TYb2pHmcE8b2K9Pml2by9b/WIobgzL4TOuQ6Qd5Z9C1UrpUbHVL3Ls6vrR3mCfe9HajokJRkH+MNFvECj/9PqK6wBAUp62e+3ba1sHdT3/6O7uQ8bW648mnQm58LJYzndQKP+UqlZKN0u633UOIClr7PTqYJ/7nvUbXkwwCvLjJ2E5ZHGoQaD80+1rrgMASbmuduSgn/t3HZ0LZe2GBOMg++qSvuc6RFZQ/un2c0nPuw4BJOGq+tKZg33urtZOOKSnhzNh2JErwnLI5aFBovxTrFop9Ur6huscQBKetNP267Mtg/5we9Ha9j2SzIPM+xfXAbKE8k+/70na6DoEkITH7T6D3r3v8C09B+1Sr/8pyTzIrOvDcnif6xBZQvmnXLVSapf0I9c5gCTcUD+iPpTnf6Cza11SWZBpjPqHiPLPhq8rmswC5MqVtaUzhvL8szrXHy5rO5PKg0y6IyyHt7kOkTWUfwZUK6U1kq52nQOI2xq778yaNYPey2Kctbss3LJlVZKZkDlfcB0giyj/7GDRH+TSk3bakHbu+9TajmlJZUHmrArLYeA6RBZR/hlRrZRul3S36xxA3G6sL+obyvNf19Mze0Kt/kBSeZApjPqHifLPls+4DgDE7crasv2G+pqzOtd3JZEFmfKYpMtdh8gqyj9DqpXSbyXd4DoHEKdH7Iz9a9a8NJTXnLF+/SJjLTP/i+2LYTlkIvQwUf7Z8ymx3S9y5im71+NDeX6b1djF3VvCpPIg9Z6R9J+uQ2QZ5Z8x1UpppaT/dp0DiNNv64f3DvU1n1rbPqTbBJErXwzL4ZD/N4NXUP7Z9FlJW1yHAOJyZW3Z9KG+5sDe3v0n12orE4iDdHtYbOAzYpR/BlUrpaqkb7vOAcTlIbv/rLo1Lw/1ded0dG5OIg9S7YKwHA7pDhG8FuWfXf9XEiudITeesVPXDPU1716/4Qhj7V+SyINUui4sh79xHSIPKP+MqlZKayV90XUOIC4r6ocP+VLWaGn06zd3s9lPMfRJOt91iLyg/LPt65Kecx0CiMOVtWX7DOd1F69tP0DWcstX/n03LIcPuw6RF5R/hlUrpc2S/o/rHEAcVtkDDqxbM+R792f29e03tVZjO9d8axe/62JF+WffjyRx2hM5YMxzdvfHhvPKj3R0MvLPt8+F5ZBFnWJE+WdctVKqSfJd5wDicHN9YfdwXndK18ZFLda+EHcepMIjkr7lOkTeUP45UK2UrpZ0nescwEgtry3beziva5Va37Rp86Nx50EqXMitffGj/PPjPEmbXIcARuJ+e+Bsa4d3C+uF69rnytpa3Jng1PVs2ZsMyj8nGgv/XOo4BjAiVi0tz2v3YY3g9+mrTZtWq90bdyY4w619CaL88+Vrkla5DgGMxK21w4a9at8/rOvgd1p+XBaWQyYzJ4T/o+RItVLqk/RhScx8RmYtry3bc7ivPXHjpsNbrX02zjxw4jFJl7gOkWeUf85UK6U/ipmxyLB77Nw51mr9cF7bIrWcuGHTkLYHRupYSWeH5ZB9GxJE+efTpyVVXYcAhqOultYXNXnYM/c/2d5+sKxlu9fs+nZYDm93HSLvKP8cqlZKGySdregTNJA5t9YOG/adK1Nr9akz+vruiTMPmqYq1i1pCso/p6qV0k1iz2tk1BX1ZXuM5PWfWNcxNq4saKpzwnK4wXWIIqD88+0iSU+5DgEM1R/r8+Zaq2GXwJs3bV4w2ton48yExH07LIc3xnEgY8wtxhi7g6/j43ifLKP8c6zf6X8gU2pqHfUXTXpkuK83kjm5a8PTcWZColYrGqzE7VeSfjzAV+F3QzXWclk47zw/+I6kc13nAIbiK6O/c+s7Wm8/drivb29pWfeGGdPHy5i2OHMhdjVJy8JyeFdcBzTG3CLpWEn7W2urcR03Txj5F8P5kkLXIYChuKJ2zJSRvH5yvT5lVm8vK/6l37/EWfwYHMq/AKqV0mZJp0nqcp0FGKw/1A+aZ+3I9qu4YF3HrnHlQSLulfR51yGKiPIviGqltFpc/0eG9GnU6LXabdjX/SXp9Zu757fV62viyoRYdUk6PeEd+842xnzbGPNNY8w/GGNmJPhemUL5F0i1Urpc0jdc5wAG63f1Q0Z8tuqdXRuejyMLYndmWA6T3ob5s4p2PP17SZdJWmOMYdlgUf5FdKGku12HAAZjee2YySM9xnkdnYfJWra7TpcvhuXw1wke/zZJZ0iaJWkXSXMlfUbRToGfN8Z8PMH3zgRm+xeQ5wczJN0vaUQTqoCkjVZfz+q2D9SM0biRHOed++z9u0faxhwTVy6MyI2STgjLYa3Zb2yMeYuk6yV1SNrHWlvY/QMY+RdQtVJ6WtGnYj75IdV6NWrMOk0Y0XV/SbpoXTsfdNPhKUnvdVH8kmStvUHSPZImSVrsIkNaUP4FVa2UrpX0Bdc5gJ35ff11w9rhr7+jurccPK5eH/GHCIxIt6RTw3K41nGOxxrfpzlN4RjlX2z/KOlm1yGAHbmitmxiHMd53/quv8RxHAzbeWE5vM91CElb55FsdJrCMa75F5znB3spuv5f6E/BSK829XQ/0namMUYjWqlvozEblszc18qYCXFlw6B9NyyH57kOYYyZKulJSeMl7WetfdZxJGcY+RdctVJ6UdJ7FC2xCaTOFo0Z26FdR3zKfry1u87f0rMyhkgYmt9LatrsemPMUmPM240xrdv83JN0haLiv6rIxS9R/pBUrZRuU3QvLJBKf6gf1BHHcT61rn3POI6DQXtR0mlhOexp4nvOUVTyzxpjAmPMfxljfifpYUnLJD0k6cNNzJNKlD8kSdVK6ftimU2k1PLast3iOM78LT1zd63XH4rjWNipPknvCsthsxdZukvSdyQ9L+lISe+SdIiklZIukHSktfalJmdKHa7541U8P/i+pA+5zgH0N1ZbNj/c9sFWYzRmpMf67qTd7vjW5EnL4siF7bKSPhCWw5+6DoKBMfLHts6VdI3rEEB/3Wobt167xHKr3pmdXYfL2s44joXtOp/iTzfKH69SrZRqkt6t6NQZkBp31w9qj+M4Y60dd0T3llVxHAsD+kJYDr/uOgR2jPLHa1QrpU2S/lbSatdZgK2W15bFtj3vp9a1T4/rWHiVfw/L4addh8DOUf4YULVSelnS8ZL+7DoLIEk31xfMs1axbP86r6d31sRajdF/vK5QdNkQGUD5Y7uqldKTkk5UtO824NQmjR3fpXGxLdF7duf6Qq/wFrNb5HDNfgwd5Y8dqlZK90s6VVKv6yzAPfW5sa0Lf3pn1xHG2pfjOl6B3S/p5LAcbnEdBINH+WOnqpXSbyWdJXYBhGNX1paNj+tYY6QxSzd3PxjX8QpqjaTjw3I44s2X0FyUPwalWin9VNInXedAsf22fvhca+Nbivride37i8VOhusFSW8Jy2HhF8zJIsofg1atlC6T9BFxBgCObNS4CRs19tG4jndAb9/M3ev1++M6XoG0S3prWA6fdB0Ew0P5Y0iqldJ3JJ0pNgKCI/fW58S6Ne+57Z3NXHc+D16UdFxYDkPXQTB8lD+GrFop/aek94pJgHDgqtrSXeI83mldG45osfbFOI+ZY09Len1YDh9wHQQjQ/ljWKqV0uWK7gJghi+a6ob6ojnWqh7X8UZJo47dtDm2WwhzbLWi4n/MdRCMHOWPYatWStdIepukTa6zoDi6NH7iJrXFdt1fki5e136grI3tA0UOrVJU/E+7DoJ4UP4YkWqldKOilQBZCAhNs7J+YKwzzPftq03fq1a7N85j5sjvFV3jZ1Z/jlD+GLFqpXS7pDcpmgEMJO7K+tJxcR/z79vZ6G8AKyT9TVgOO1wHQbwof8SiWin9UdIbJcU6ExsYyPW1I2O97i9JJ2/YuKjV2ufiPGbGXSmpFJZDlkHOIcofsalWSqskvUHS866zIN86teukzRoT68SzFqnlLRs3MZkt8lNJp7Fkb35R/ohVtVJ6RNJSSSybikQ9YGfFfnve+es65snaWHYOzLDvSPpAWA6L/u+Qa5Q/YletlJ5S9AHgWtdZkF9X1Y4eE/cx967V9p7eV9iJfzVJF4bl8CNhOWQVz5yj/JGIaqXUJekkSZe5zoJ8+k3tqDnWxr/U9MfbO0bFfcwMaJd0YlgOv+I6CJrDsKcFkub5wd9J+qakIv5SRYIeaSs/Ntb0zo7zmFayh3v7PdNnzIw4j5tiDynakvdx10HQPIz8kbhqpfQ9RWsBcCsgYhXaA16I+5hGMn+7YWNRNqxZLmkJxV88lD+aoloprZB0hCTWBEdsrq4tGZ3EcT+xruN1sjbPG/5YSZ+XdGpYDje4DoPmo/zRNNVK6QlJR0v6hessyIdra4sPTOK4u9fre3i9fXmd+LdB0jvCcvh/mNhXXJQ/mqpaKW2qVkrvkXSx2BYYI/SyJk3dYkcncsr6/PaOWHcPTIknJB0dlsMrXAeBW5Q/nKhWSl+WdIKkda6zINsesjMTWVTqjZs2HzbG2ieSOLYjKyQdGZZD1uAA5Q93GpsCLZB0k+MoyLBrake3JnXsU7o2PJPUsZuoLumLkt4alkM+bEMSt/ohBTw/MJI+KelfJLU5joOM2UvrXrpr7Ef3TOLYnS0tHcfMmD5WxoxN4vhNUFW0Wt/troMgXRj5w7lqpWSrldJXFd0NsMp1HmTLi5qyZ48dlciteRPr9Umze3uzOvHvR5LmU/wYCOWP1KhWSg9KOkrSl6R4d2xDvj1sZzyb1LEvXNuxW1LHTsjLim7h+2BYDrtch0E6Uf5IlWql1FOtlD6laHvgp1znQTYEtSWJ/S5b2t196Nh6fXVSx49ZIOkQZvNjZyh/pFK1UrpN0nxJP3GdBel3TW3JAUke/91dG/6c5PFjsFHSuWE5/NuwHMa+2yHyhwl/SD3PD06T9D1JU1xnQXo91nbGU6NNbWYSx+4yZv3Smfu2ypjxSRx/hP4g6YywHK5xHQTZwcgfqVetlH4p6VBJ17vOgvR61O6X2G15E6zd7XU9Pfcldfxh6pP0j5KOofgxVJQ/MqFaKT1frZSOl3SGpNg3c0H2XVtbbJI8/sVr2/dI8vhDdKukw8Ny+E9hOWSlTAwZp/2ROZ4fTJB0iaRPSEpkYxdkz37mpedub/vE9CTfY/HMff+0qaXl4CTfYyeelXRhWA7ZHwMjwsgfmVOtlLqqldLFii4FXOc6D9LhGbvn9F7bmtgtf5L0/s4uVyvkbVG0CNY8ih9xYOSPzPP84CRJX5OU6IxvpN+1Y/w7Dm55ellSx99kzMbFM/ftkzETk3qPAQSSPsF1fcSJkT8yr1opXSXpYEmflbTJcRw4dF3tqERHM7tYO37hli3NWoVyjaS/bdy+R/EjVoz8kSueH+wn6f9JepfrLGg+z7zwzC1tF+yX5Hs8OGbMY++dvvfsBN9io6JT/F8Jy+GWBN8HBUb5I5c8PzhO0r8qmheAAlnT9v7nR5n6Pkm+x9IZ+4ZdrS1J/G/rF4om9CU6dwHgtD9yqVop3SJpoaQPSnrMbRo00+N2n0Q2+envrM7162M8nJV0haSFYTl8D8WPZmDkj9zz/KBV0rslfUbR3ADk2CdHXX77x0dd8fok32OLUfeRM/fbbI2ZPILDWEm/kvRPYTl8IKZowKBQ/igMzw9aJL1D0cTA+Y7jICGzzHNPrWi7KJFlfvv70N573nrXuLHHDuOldUmXKyr9h2KOBQwK5Y/C8fzASDpJ0UJBixzHQQIeb3v/C62mPi3J91gzevSTp+w7bf8hvKQu6eeS/jkshw8nFAsYFMofheb5wQmKPgQc7ToL4nPjmAvvmN3yfGL3+2/1hhnT729vbV24k6fVJP1MUek/mnQmYDCY8IdCq1ZKv6lWSkslvVnReunIgRvqR9Sb8T4f7ljfvYOHeyX9WNJBYTk8g+JHmjDyB/rx/OAYSR+TdIrYNyCz5pqnn7y+zR/KKflh6ZV6F3n7dVhjpvb78fOS/k3Sv4XlkE2okEqUPzAAzw/2lnS2pHMkzXAcB8PweNvpL7Uau2fS7/P3e0295bZdxh2n6MzRtyRdEZbDvqTfFxgJyh/YgcZtgiVJ50l6q6REt41FfG4ac8GdB7S8sDTht+l4dMzo7542fdp/heXwwYTfC4gN5Q8MkucHMyWd2fjyXGbBzv3vUf9129+NCt6Q0OFvl/R9Sb/UpZ2bE3oPIDGUPzBEjVsF36ho9cB3SBrnNhEGcrCpPn5t26dnxXjIP0v6iaR/16Wdq2M8LtB0lD8wAp4fTFS0idA7FH0gGOM2Efp7ou30l1uM3WMEh3hJ0Sp8/yPpNl3a2ZS7CICkUf5ATDw/2E3SCZJOlnSipGbu+Y4B3DLmk7/3Wl4c6hoOLytaa/8Xkm7RpZ21+JMBblH+QAI8Pxgt6VhFHwROlpToNrMY2CWjfnLr2aN+M5gleJ+QdKWk5ZLuoPCRd5Q/0ASeHyzUKx8EFrhNUxzzzeOPXdV2yewBHqpLulfS1ZKW69LOsLnJALcof6DJGncNnCzpLYqWFZ7iNlGeWftE2/vbW4ydIulhSSsk3aTodH6722yAO5Q/4FDjzoG5kpb2+5on1hMYKSvpT5J+99+j//nGpa1/ulOXdrLaHtBA+QMp4/nBZElL9MqHgaMk7eo0VPptkfRHSb9rfN1ZrZQY2QPbQfkDKddYZXC+og8CR0s6WNJsFfMDQa+k1YpG9Q/1+/5YtVLqdRkMyBLKH8gozw+mKfoQMKfxfeufZ0ka6zBaHHokPaao2LctedbNB0aI8gdyxvODFkn76tUfCg6UtLekqY2v8c4CSu2SXlS0gM5L/f78oqJV9FYrxSVvjNlF0WTNt0k6RtJMSTVJaxQtCPRVa+0GdwmBnaP8gQLy/GCcpD0UfRDYQ9IkSbsN8LWrosmHNUW3xw30faCf9Ulap9cW/EtZPz1vjPmQonX9pegOggcV/VstlTRB0iOSjrXWvuQmIbBzlD8ADIExpqyo6L9urX2438+nSQokLZT0M2vt+xxFBHaK8geAmBhjjpZ0p6K7D3az1vY4jgQMqMV1AADIkVWN722SdncZBNgRyh8A4nNA43uvojkPQCpR/gAQn483vl9nrd3iNAmwA1zzB4AYGGNOlHSNojsdjrTWrtrJSwBnGPkDwAgZY+ZJ+qmi2yIvoviRdpQ/AIyAMWa6pOskTVa0wM9ljiMBO8VpfwAYJmPMFEm3K9pv4T8knW35pYoMoPwBYBiMMbtKWqFo18VfS3qXtbbmNhUwOJz2B4AhMsa0SbpSUfFfL+m9FD+yhPIHgCEwxrRK+pmk/6XolP+prOSHrBnlOgAAZMxHJZ3S+PPLkr5tjBnoeRdaa19uWipgCCh/ABiayf3+fMp2nyVdqujDAZA6TPgDAKBguOYPAEDBUP4AABQM5Q8AQMFQ/gAAFAzlDwBAwVD+AAAUDOUPAEDBUP4AABQM5Q8AQMFQ/gAAFAzlDwBAwVD+AAAUDOUPAEDBUP4AABQM5Q8AQMFQ/gAAFAzlDwBAwVD+AAAUDOUPAEDBUP4AABQM5Q8AQMFQ/gAAFAzlDwBAwVD+AAAUDOUPAEDBUP4AABQM5Q8AQMFQ/gAAFAzlDwBAwVD+AAAUDOUPAEDBUP4AABQM5Q8AQMFQ/gAAFAzlDwBAwVD+AAAUDOUPAEDBUP4AABQM5Q8AQMFQ/gAAFAzlDwBAwVD+AAAUDOUPAEDBUP4AABQM5Q8AQMH8f8TljZcbqcRxAAAAAElFTkSuQmCC\n",
576 | "text/plain": [
577 | ""
578 | ]
579 | },
580 | "metadata": {},
581 | "output_type": "display_data"
582 | }
583 | ],
584 | "source": [
585 | "plt.figure(figsize=(4,4),dpi=150)\n",
586 | "train['label'].value_counts().sort_index().plot(kind = 'pie')"
587 | ]
588 | },
589 | {
590 | "cell_type": "markdown",
591 | "metadata": {},
592 | "source": [
593 | "## 2.2 测试集探索"
594 | ]
595 | },
596 | {
597 | "cell_type": "markdown",
598 | "metadata": {},
599 | "source": [
600 | "### 2.2.1 数据信息"
601 | ]
602 | },
603 | {
604 | "cell_type": "code",
605 | "execution_count": 18,
606 | "metadata": {},
607 | "outputs": [
608 | {
609 | "data": {
610 | "text/html": [
611 | "\n",
612 | "\n",
625 | "
\n",
626 | " \n",
627 | " \n",
628 | " | \n",
629 | " file_id | \n",
630 | " api | \n",
631 | " tid | \n",
632 | " index | \n",
633 | "
\n",
634 | " \n",
635 | " \n",
636 | " \n",
637 | " 0 | \n",
638 | " 1 | \n",
639 | " RegOpenKeyExA | \n",
640 | " 2332.0 | \n",
641 | " 0.0 | \n",
642 | "
\n",
643 | " \n",
644 | " 1 | \n",
645 | " 1 | \n",
646 | " CopyFileA | \n",
647 | " 2332.0 | \n",
648 | " 1.0 | \n",
649 | "
\n",
650 | " \n",
651 | " 2 | \n",
652 | " 1 | \n",
653 | " OpenSCManagerA | \n",
654 | " 2332.0 | \n",
655 | " 2.0 | \n",
656 | "
\n",
657 | " \n",
658 | " 3 | \n",
659 | " 1 | \n",
660 | " CreateServiceA | \n",
661 | " 2332.0 | \n",
662 | " 3.0 | \n",
663 | "
\n",
664 | " \n",
665 | " 4 | \n",
666 | " 1 | \n",
667 | " RegOpenKeyExA | \n",
668 | " 2468.0 | \n",
669 | " 0.0 | \n",
670 | "
\n",
671 | " \n",
672 | "
\n",
673 | "
"
674 | ],
675 | "text/plain": [
676 | " file_id api tid index\n",
677 | "0 1 RegOpenKeyExA 2332.0 0.0\n",
678 | "1 1 CopyFileA 2332.0 1.0\n",
679 | "2 1 OpenSCManagerA 2332.0 2.0\n",
680 | "3 1 CreateServiceA 2332.0 3.0\n",
681 | "4 1 RegOpenKeyExA 2468.0 0.0"
682 | ]
683 | },
684 | "execution_count": 18,
685 | "metadata": {},
686 | "output_type": "execute_result"
687 | }
688 | ],
689 | "source": [
690 | "test.head()"
691 | ]
692 | },
693 | {
694 | "cell_type": "code",
695 | "execution_count": 19,
696 | "metadata": {},
697 | "outputs": [
698 | {
699 | "name": "stdout",
700 | "output_type": "stream",
701 | "text": [
702 | "\n",
703 | "RangeIndex: 39173 entries, 0 to 39172\n",
704 | "Data columns (total 4 columns):\n",
705 | "file_id 39173 non-null int64\n",
706 | "api 39173 non-null object\n",
707 | "tid 39172 non-null float64\n",
708 | "index 39172 non-null float64\n",
709 | "dtypes: float64(2), int64(1), object(1)\n",
710 | "memory usage: 1.2+ MB\n"
711 | ]
712 | }
713 | ],
714 | "source": [
715 | "test.info()"
716 | ]
717 | },
718 | {
719 | "cell_type": "markdown",
720 | "metadata": {},
721 | "source": [
722 | "### 2.2.2 缺失值探索"
723 | ]
724 | },
725 | {
726 | "cell_type": "code",
727 | "execution_count": 20,
728 | "metadata": {},
729 | "outputs": [
730 | {
731 | "data": {
732 | "text/plain": [
733 | "file_id 0\n",
734 | "api 0\n",
735 | "tid 1\n",
736 | "index 1\n",
737 | "dtype: int64"
738 | ]
739 | },
740 | "execution_count": 20,
741 | "metadata": {},
742 | "output_type": "execute_result"
743 | }
744 | ],
745 | "source": [
746 | "test.isnull().sum()"
747 | ]
748 | },
749 | {
750 | "cell_type": "markdown",
751 | "metadata": {},
752 | "source": [
753 | "### 2.2.3 数据分布探索"
754 | ]
755 | },
756 | {
757 | "cell_type": "code",
758 | "execution_count": 21,
759 | "metadata": {},
760 | "outputs": [
761 | {
762 | "data": {
763 | "text/plain": [
764 | "file_id 10\n",
765 | "api 146\n",
766 | "tid 125\n",
767 | "index 5001\n",
768 | "dtype: int64"
769 | ]
770 | },
771 | "execution_count": 21,
772 | "metadata": {},
773 | "output_type": "execute_result"
774 | }
775 | ],
776 | "source": [
777 | "test.nunique()"
778 | ]
779 | },
780 | {
781 | "cell_type": "markdown",
782 | "metadata": {},
783 | "source": [
784 | "### 2.2.4 奇异值探索"
785 | ]
786 | },
787 | {
788 | "cell_type": "code",
789 | "execution_count": 22,
790 | "metadata": {},
791 | "outputs": [
792 | {
793 | "data": {
794 | "text/plain": [
795 | "count 39172.000000\n",
796 | "mean 1729.569284\n",
797 | "std 1486.018402\n",
798 | "min 0.000000\n",
799 | "25% 405.750000\n",
800 | "50% 1342.000000\n",
801 | "75% 2876.000000\n",
802 | "max 5000.000000\n",
803 | "Name: index, dtype: float64"
804 | ]
805 | },
806 | "execution_count": 22,
807 | "metadata": {},
808 | "output_type": "execute_result"
809 | }
810 | ],
811 | "source": [
812 | "test['index'].describe()"
813 | ]
814 | },
815 | {
816 | "cell_type": "code",
817 | "execution_count": 23,
818 | "metadata": {
819 | "scrolled": true
820 | },
821 | "outputs": [
822 | {
823 | "data": {
824 | "text/plain": [
825 | "count 39172.000000\n",
826 | "mean 2158.769938\n",
827 | "std 464.152821\n",
828 | "min 504.000000\n",
829 | "25% 2092.000000\n",
830 | "50% 2224.000000\n",
831 | "75% 2500.000000\n",
832 | "max 2920.000000\n",
833 | "Name: tid, dtype: float64"
834 | ]
835 | },
836 | "execution_count": 23,
837 | "metadata": {},
838 | "output_type": "execute_result"
839 | }
840 | ],
841 | "source": [
842 | "test['tid'].describe()"
843 | ]
844 | },
845 | {
846 | "cell_type": "markdown",
847 | "metadata": {},
848 | "source": [
849 | "## 2.3 数据集联合分析"
850 | ]
851 | },
852 | {
853 | "cell_type": "markdown",
854 | "metadata": {},
855 | "source": [
856 | "### 2.3.1 file_id分析"
857 | ]
858 | },
859 | {
860 | "cell_type": "code",
861 | "execution_count": 24,
862 | "metadata": {},
863 | "outputs": [],
864 | "source": [
865 | "train_fileids = train['file_id'].unique()\n",
866 | "test_fileids = test['file_id'].unique()"
867 | ]
868 | },
869 | {
870 | "cell_type": "code",
871 | "execution_count": 25,
872 | "metadata": {},
873 | "outputs": [
874 | {
875 | "data": {
876 | "text/plain": [
877 | "0"
878 | ]
879 | },
880 | "execution_count": 25,
881 | "metadata": {},
882 | "output_type": "execute_result"
883 | }
884 | ],
885 | "source": [
886 | "len(set(train_fileids)-set(test_fileids)) "
887 | ]
888 | },
889 | {
890 | "cell_type": "code",
891 | "execution_count": 26,
892 | "metadata": {},
893 | "outputs": [
894 | {
895 | "data": {
896 | "text/plain": [
897 | "1"
898 | ]
899 | },
900 | "execution_count": 26,
901 | "metadata": {},
902 | "output_type": "execute_result"
903 | }
904 | ],
905 | "source": [
906 | "len(set(test_fileids)-set(train_fileids)) "
907 | ]
908 | },
909 | {
910 | "cell_type": "markdown",
911 | "metadata": {},
912 | "source": [
913 | "### 2.3.2 API分析"
914 | ]
915 | },
916 | {
917 | "cell_type": "code",
918 | "execution_count": 27,
919 | "metadata": {},
920 | "outputs": [],
921 | "source": [
922 | "train_apis = train['api'].unique()\n",
923 | "test_apis = test['api'].unique()"
924 | ]
925 | },
926 | {
927 | "cell_type": "code",
928 | "execution_count": 28,
929 | "metadata": {
930 | "scrolled": true
931 | },
932 | "outputs": [
933 | {
934 | "data": {
935 | "text/plain": [
936 | "{'CertCreateCertificateContext',\n",
937 | " 'CertOpenSystemStoreA',\n",
938 | " 'CoInitializeSecurity',\n",
939 | " 'CreateServiceA',\n",
940 | " 'CryptAcquireContextW',\n",
941 | " 'FindWindowA',\n",
942 | " 'FindWindowExW',\n",
943 | " 'GetComputerNameA',\n",
944 | " 'GetFileVersionInfoSizeW',\n",
945 | " 'GetFileVersionInfoW',\n",
946 | " 'IWbemServices_ExecQuery',\n",
947 | " 'LookupAccountSidW',\n",
948 | " 'LookupPrivilegeValueW',\n",
949 | " 'OpenServiceW',\n",
950 | " 'OutputDebugStringA',\n",
951 | " 'R',\n",
952 | " 'SendNotifyMessageW',\n",
953 | " 'SetStdHandle',\n",
954 | " 'StartServiceA',\n",
955 | " 'StartServiceW',\n",
956 | " 'UnhookWindowsHookEx',\n",
957 | " 'connect',\n",
958 | " 'timeGetTime'}"
959 | ]
960 | },
961 | "execution_count": 28,
962 | "metadata": {},
963 | "output_type": "execute_result"
964 | }
965 | ],
966 | "source": [
967 | "set(test_apis)-set(train_apis)"
968 | ]
969 | },
970 | {
971 | "cell_type": "code",
972 | "execution_count": 29,
973 | "metadata": {},
974 | "outputs": [
975 | {
976 | "data": {
977 | "text/plain": [
978 | "{'CertControlStore',\n",
979 | " 'CryptAcquireContextA',\n",
980 | " 'CryptCreateHash',\n",
981 | " 'CryptExportKey',\n",
982 | " 'CryptHashData',\n",
983 | " 'DeviceIoControl',\n",
984 | " 'DrawTextExA',\n",
985 | " 'EncryptMessage',\n",
986 | " 'EnumServicesStatusW',\n",
987 | " 'FindResourceExA',\n",
988 | " 'GetAdaptersAddresses',\n",
989 | " 'GetAddrInfoW',\n",
990 | " 'GetAsyncKeyState',\n",
991 | " 'GetBestInterfaceEx',\n",
992 | " 'GetFileInformationByHandle',\n",
993 | " 'GetFileVersionInfoExW',\n",
994 | " 'GetFileVersionInfoSizeExW',\n",
995 | " 'GetUserNameA',\n",
996 | " 'GetVolumePathNameW',\n",
997 | " 'GlobalMemoryStatus',\n",
998 | " 'HttpOpenRequestA',\n",
999 | " 'InternetConnectA',\n",
1000 | " 'InternetOpenA',\n",
1001 | " 'IsDebuggerPresent',\n",
1002 | " 'Module32FirstW',\n",
1003 | " 'Module32NextW',\n",
1004 | " 'NtDeleteValueKey',\n",
1005 | " 'NtReadVirtualMemory',\n",
1006 | " 'OpenServiceA',\n",
1007 | " 'ReadProcessMemory',\n",
1008 | " 'RegEnumKeyExA',\n",
1009 | " 'RegEnumValueA',\n",
1010 | " 'RtlAddVectoredContinueHandler',\n",
1011 | " 'RtlAddVectoredExceptionHandler',\n",
1012 | " 'RtlRemoveVectoredExceptionHandler',\n",
1013 | " 'SetFileAttributesW',\n",
1014 | " 'SetFileTime',\n",
1015 | " 'SetWindowsHookExA',\n",
1016 | " 'Thread32First',\n",
1017 | " 'Thread32Next',\n",
1018 | " 'WriteConsoleA',\n",
1019 | " 'bind',\n",
1020 | " 'listen'}"
1021 | ]
1022 | },
1023 | "execution_count": 29,
1024 | "metadata": {},
1025 | "output_type": "execute_result"
1026 | }
1027 | ],
1028 | "source": [
1029 | "set(train_apis) - set(test_apis)"
1030 | ]
1031 | }
1032 | ],
1033 | "metadata": {
1034 | "kernelspec": {
1035 | "display_name": "Python 3",
1036 | "language": "python",
1037 | "name": "python3"
1038 | },
1039 | "language_info": {
1040 | "codemirror_mode": {
1041 | "name": "ipython",
1042 | "version": 3
1043 | },
1044 | "file_extension": ".py",
1045 | "mimetype": "text/x-python",
1046 | "name": "python",
1047 | "nbconvert_exporter": "python",
1048 | "pygments_lexer": "ipython3",
1049 | "version": "3.6.2"
1050 | },
1051 | "latex_envs": {
1052 | "LaTeX_envs_menu_present": true,
1053 | "autoclose": false,
1054 | "autocomplete": true,
1055 | "bibliofile": "biblio.bib",
1056 | "cite_by": "apalike",
1057 | "current_citInitial": 1,
1058 | "eqLabelWithNumbers": true,
1059 | "eqNumInitial": 1,
1060 | "hotkeys": {
1061 | "equation": "Ctrl-E",
1062 | "itemize": "Ctrl-I"
1063 | },
1064 | "labels_anchors": false,
1065 | "latex_user_defs": false,
1066 | "report_style_numbering": false,
1067 | "user_envs_cfg": false
1068 | },
1069 | "tianchi_metadata": {
1070 | "competitions": [],
1071 | "datasets": [],
1072 | "description": "",
1073 | "notebookId": "116127",
1074 | "source": "dsw"
1075 | },
1076 | "toc": {
1077 | "nav_menu": {},
1078 | "number_sections": true,
1079 | "sideBar": true,
1080 | "skip_h1_title": false,
1081 | "title_cell": "Table of Contents",
1082 | "title_sidebar": "Contents",
1083 | "toc_cell": true,
1084 | "toc_position": {
1085 | "height": "calc(100% - 180px)",
1086 | "left": "10px",
1087 | "top": "150px",
1088 | "width": "384px"
1089 | },
1090 | "toc_section_display": true,
1091 | "toc_window_display": true
1092 | }
1093 | },
1094 | "nbformat": 4,
1095 | "nbformat_minor": 4
1096 | }
1097 |
--------------------------------------------------------------------------------
/阿里云安全恶意程序检测/阿里云安全恶意程序检测-特征工程与基线模型.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "## 第三节:特征工程与基线模型"
8 | ]
9 | },
10 | {
11 | "cell_type": "code",
12 | "execution_count": 2,
13 | "metadata": {},
14 | "outputs": [],
15 | "source": [
16 | "import numpy as np\n",
17 | "import pandas as pd\n",
18 | "from tqdm import tqdm \n",
19 | "\n",
20 | "class _Data_Preprocess:\n",
21 | " def __init__(self):\n",
22 | " self.int8_max = np.iinfo(np.int8).max\n",
23 | " self.int8_min = np.iinfo(np.int8).min\n",
24 | "\n",
25 | " self.int16_max = np.iinfo(np.int16).max\n",
26 | " self.int16_min = np.iinfo(np.int16).min\n",
27 | "\n",
28 | " self.int32_max = np.iinfo(np.int32).max\n",
29 | " self.int32_min = np.iinfo(np.int32).min\n",
30 | "\n",
31 | " self.int64_max = np.iinfo(np.int64).max\n",
32 | " self.int64_min = np.iinfo(np.int64).min\n",
33 | "\n",
34 | " self.float16_max = np.finfo(np.float16).max\n",
35 | " self.float16_min = np.finfo(np.float16).min\n",
36 | "\n",
37 | " self.float32_max = np.finfo(np.float32).max\n",
38 | " self.float32_min = np.finfo(np.float32).min\n",
39 | "\n",
40 | " self.float64_max = np.finfo(np.float64).max\n",
41 | " self.float64_min = np.finfo(np.float64).min\n",
42 | "\n",
43 | " def _get_type(self, min_val, max_val, types):\n",
44 | " if types == 'int':\n",
45 | " if max_val <= self.int8_max and min_val >= self.int8_min:\n",
46 | " return np.int8\n",
47 | " elif max_val <= self.int16_max <= max_val and min_val >= self.int16_min:\n",
48 | " return np.int16\n",
49 | " elif max_val <= self.int32_max and min_val >= self.int32_min:\n",
50 | " return np.int32\n",
51 | " return None\n",
52 | "\n",
53 | " elif types == 'float':\n",
54 | " if max_val <= self.float16_max and min_val >= self.float16_min:\n",
55 | " return np.float16\n",
56 | " if max_val <= self.float32_max and min_val >= self.float32_min:\n",
57 | " return np.float32\n",
58 | " if max_val <= self.float64_max and min_val >= self.float64_min:\n",
59 | " return np.float64\n",
60 | " return None\n",
61 | "\n",
62 | " def _memory_process(self, df):\n",
63 | " init_memory = df.memory_usage().sum() / 1024 ** 2 / 1024\n",
64 | " print('Original data occupies {} GB memory.'.format(init_memory))\n",
65 | " df_cols = df.columns\n",
66 | "\n",
67 | " \n",
68 | " for col in tqdm_notebook(df_cols):\n",
69 | " try:\n",
70 | " if 'float' in str(df[col].dtypes):\n",
71 | " max_val = df[col].max()\n",
72 | " min_val = df[col].min()\n",
73 | " trans_types = self._get_type(min_val, max_val, 'float')\n",
74 | " if trans_types is not None:\n",
75 | " df[col] = df[col].astype(trans_types)\n",
76 | " elif 'int' in str(df[col].dtypes):\n",
77 | " max_val = df[col].max()\n",
78 | " min_val = df[col].min()\n",
79 | " trans_types = self._get_type(min_val, max_val, 'int')\n",
80 | " if trans_types is not None:\n",
81 | " df[col] = df[col].astype(trans_types)\n",
82 | " except:\n",
83 | " print(' Can not do any process for column, {}.'.format(col)) \n",
84 | " afterprocess_memory = df.memory_usage().sum() / 1024 ** 2 / 1024\n",
85 | " print('After processing, the data occupies {} GB memory.'.format(afterprocess_memory))\n",
86 | " return df"
87 | ]
88 | },
89 | {
90 | "cell_type": "markdown",
91 | "metadata": {},
92 | "source": [
93 | "## 3.3 基线模型"
94 | ]
95 | },
96 | {
97 | "cell_type": "markdown",
98 | "metadata": {},
99 | "source": [
100 | "### 3.3.1 数据读取"
101 | ]
102 | },
103 | {
104 | "cell_type": "code",
105 | "execution_count": 1,
106 | "metadata": {},
107 | "outputs": [],
108 | "source": [
109 | "import pandas as pd\n",
110 | "import numpy as np\n",
111 | "import seaborn as sns\n",
112 | "import matplotlib.pyplot as plt\n",
113 | "\n",
114 | "import lightgbm as lgb\n",
115 | "from sklearn.model_selection import train_test_split\n",
116 | "from sklearn.preprocessing import OneHotEncoder\n",
117 | "\n",
118 | "import warnings\n",
119 | "warnings.filterwarnings('ignore')\n",
120 | "%matplotlib inline"
121 | ]
122 | },
123 | {
124 | "cell_type": "markdown",
125 | "metadata": {},
126 | "source": [
127 | "### 数据读取"
128 | ]
129 | },
130 | {
131 | "cell_type": "code",
132 | "execution_count": 3,
133 | "metadata": {},
134 | "outputs": [],
135 | "source": [
136 | "path = '../security_data/'\n",
137 | "train = pd.read_csv(path + 'security_train.csv')\n",
138 | "test = pd.read_csv(path + 'security_test.csv')"
139 | ]
140 | },
141 | {
142 | "cell_type": "code",
143 | "execution_count": 4,
144 | "metadata": {},
145 | "outputs": [
146 | {
147 | "data": {
148 | "text/html": [
149 | "\n",
150 | "\n",
163 | "
\n",
164 | " \n",
165 | " \n",
166 | " | \n",
167 | " file_id | \n",
168 | " label | \n",
169 | " api | \n",
170 | " tid | \n",
171 | " index | \n",
172 | "
\n",
173 | " \n",
174 | " \n",
175 | " \n",
176 | " 0 | \n",
177 | " 1 | \n",
178 | " 5 | \n",
179 | " LdrLoadDll | \n",
180 | " 2488 | \n",
181 | " 0 | \n",
182 | "
\n",
183 | " \n",
184 | " 1 | \n",
185 | " 1 | \n",
186 | " 5 | \n",
187 | " LdrGetProcedureAddress | \n",
188 | " 2488 | \n",
189 | " 1 | \n",
190 | "
\n",
191 | " \n",
192 | " 2 | \n",
193 | " 1 | \n",
194 | " 5 | \n",
195 | " LdrGetProcedureAddress | \n",
196 | " 2488 | \n",
197 | " 2 | \n",
198 | "
\n",
199 | " \n",
200 | " 3 | \n",
201 | " 1 | \n",
202 | " 5 | \n",
203 | " LdrGetProcedureAddress | \n",
204 | " 2488 | \n",
205 | " 3 | \n",
206 | "
\n",
207 | " \n",
208 | " 4 | \n",
209 | " 1 | \n",
210 | " 5 | \n",
211 | " LdrGetProcedureAddress | \n",
212 | " 2488 | \n",
213 | " 4 | \n",
214 | "
\n",
215 | " \n",
216 | "
\n",
217 | "
"
218 | ],
219 | "text/plain": [
220 | " file_id label api tid index\n",
221 | "0 1 5 LdrLoadDll 2488 0\n",
222 | "1 1 5 LdrGetProcedureAddress 2488 1\n",
223 | "2 1 5 LdrGetProcedureAddress 2488 2\n",
224 | "3 1 5 LdrGetProcedureAddress 2488 3\n",
225 | "4 1 5 LdrGetProcedureAddress 2488 4"
226 | ]
227 | },
228 | "execution_count": 4,
229 | "metadata": {},
230 | "output_type": "execute_result"
231 | }
232 | ],
233 | "source": [
234 | "train.head()"
235 | ]
236 | },
237 | {
238 | "cell_type": "markdown",
239 | "metadata": {},
240 | "source": [
241 | "### 3.3.2 特征工程 "
242 | ]
243 | },
244 | {
245 | "cell_type": "code",
246 | "execution_count": 5,
247 | "metadata": {},
248 | "outputs": [],
249 | "source": [
250 | "def simple_sts_features(df):\n",
251 | " simple_fea = pd.DataFrame()\n",
252 | " simple_fea['file_id'] = df['file_id'].unique()\n",
253 | " simple_fea = simple_fea.sort_values('file_id')\n",
254 | " \n",
255 | " df_grp = df.groupby('file_id')\n",
256 | " simple_fea['file_id_api_count'] = df_grp['api'].count().values\n",
257 | " simple_fea['file_id_api_nunique'] = df_grp['api'].nunique().values\n",
258 | " \n",
259 | " simple_fea['file_id_tid_count'] = df_grp['tid'].count().values\n",
260 | " simple_fea['file_id_tid_nunique'] = df_grp['tid'].nunique().values\n",
261 | " \n",
262 | " simple_fea['file_id_index_count'] = df_grp['index'].count().values\n",
263 | " simple_fea['file_id_index_nunique'] = df_grp['index'].nunique().values\n",
264 | " \n",
265 | " return simple_fea"
266 | ]
267 | },
268 | {
269 | "cell_type": "code",
270 | "execution_count": 6,
271 | "metadata": {},
272 | "outputs": [
273 | {
274 | "name": "stdout",
275 | "output_type": "stream",
276 | "text": [
277 | "Wall time: 1.4 s\n"
278 | ]
279 | }
280 | ],
281 | "source": [
282 | "%%time\n",
283 | "simple_train_fea1 = simple_sts_features(train)"
284 | ]
285 | },
286 | {
287 | "cell_type": "code",
288 | "execution_count": 7,
289 | "metadata": {},
290 | "outputs": [
291 | {
292 | "name": "stdout",
293 | "output_type": "stream",
294 | "text": [
295 | "Wall time: 23.9 ms\n"
296 | ]
297 | }
298 | ],
299 | "source": [
300 | "%%time\n",
301 | "simple_test_fea1 = simple_sts_features(test)"
302 | ]
303 | },
304 | {
305 | "cell_type": "code",
306 | "execution_count": 8,
307 | "metadata": {},
308 | "outputs": [],
309 | "source": [
310 | "def simple_numerical_sts_features(df):\n",
311 | " simple_numerical_fea = pd.DataFrame()\n",
312 | " simple_numerical_fea['file_id'] = df['file_id'].unique()\n",
313 | " simple_numerical_fea = simple_numerical_fea.sort_values('file_id')\n",
314 | " \n",
315 | " df_grp = df.groupby('file_id')\n",
316 | " \n",
317 | " simple_numerical_fea['file_id_tid_mean'] = df_grp['tid'].mean().values\n",
318 | " simple_numerical_fea['file_id_tid_min'] = df_grp['tid'].min().values\n",
319 | " simple_numerical_fea['file_id_tid_std'] = df_grp['tid'].std().values\n",
320 | " simple_numerical_fea['file_id_tid_max'] = df_grp['tid'].max().values\n",
321 | " \n",
322 | " \n",
323 | " simple_numerical_fea['file_id_index_mean']= df_grp['index'].mean().values\n",
324 | " simple_numerical_fea['file_id_index_min'] = df_grp['index'].min().values\n",
325 | " simple_numerical_fea['file_id_index_std'] = df_grp['index'].std().values\n",
326 | " simple_numerical_fea['file_id_index_max'] = df_grp['index'].max().values\n",
327 | " \n",
328 | " return simple_numerical_fea"
329 | ]
330 | },
331 | {
332 | "cell_type": "code",
333 | "execution_count": 9,
334 | "metadata": {
335 | "scrolled": true
336 | },
337 | "outputs": [
338 | {
339 | "name": "stdout",
340 | "output_type": "stream",
341 | "text": [
342 | "Wall time: 172 ms\n"
343 | ]
344 | }
345 | ],
346 | "source": [
347 | "%%time\n",
348 | "simple_train_fea2 = simple_numerical_sts_features(train)"
349 | ]
350 | },
351 | {
352 | "cell_type": "code",
353 | "execution_count": 10,
354 | "metadata": {},
355 | "outputs": [
356 | {
357 | "name": "stdout",
358 | "output_type": "stream",
359 | "text": [
360 | "Wall time: 18 ms\n"
361 | ]
362 | }
363 | ],
364 | "source": [
365 | "%%time\n",
366 | "simple_test_fea2 = simple_numerical_sts_features(test)"
367 | ]
368 | },
369 | {
370 | "cell_type": "markdown",
371 | "metadata": {},
372 | "source": [
373 | "### 3.3.3 基线构建"
374 | ]
375 | },
376 | {
377 | "cell_type": "code",
378 | "execution_count": 11,
379 | "metadata": {},
380 | "outputs": [],
381 | "source": [
382 | "train_label = train[['file_id','label']].drop_duplicates(subset = ['file_id','label'], keep = 'first')\n",
383 | "test_submit = test[['file_id']].drop_duplicates(subset = ['file_id'], keep = 'first')"
384 | ]
385 | },
386 | {
387 | "cell_type": "code",
388 | "execution_count": 12,
389 | "metadata": {},
390 | "outputs": [],
391 | "source": [
392 | "### 训练集&测试集构建\n",
393 | "train_data = train_label.merge(simple_train_fea1, on ='file_id', how='left')\n",
394 | "train_data = train_data.merge(simple_train_fea2, on ='file_id', how='left')\n",
395 | "\n",
396 | "test_submit = test_submit.merge(simple_test_fea1, on ='file_id', how='left')\n",
397 | "test_submit = test_submit.merge(simple_test_fea2, on ='file_id', how='left')"
398 | ]
399 | },
400 | {
401 | "cell_type": "code",
402 | "execution_count": 14,
403 | "metadata": {},
404 | "outputs": [],
405 | "source": [
406 | "def lgb_logloss(preds,data):\n",
407 | " labels_ = data.get_label() \n",
408 | " classes_ = np.unique(labels_) \n",
409 | " preds_prob = []\n",
410 | " for i in range(len(classes_)):\n",
411 | " preds_prob.append(preds[i*len(labels_):(i+1) * len(labels_)] )\n",
412 | " \n",
413 | " preds_prob_ = np.vstack(preds_prob) \n",
414 | " \n",
415 | " loss = []\n",
416 | " for i in range(preds_prob_.shape[1]): # 样本个数\n",
417 | " sum_ = 0\n",
418 | " for j in range(preds_prob_.shape[0]): #类别个数\n",
419 | " pred = preds_prob_[j,i] # 第i个样本预测为第j类的概率\n",
420 | " if j == labels_[i]:\n",
421 | " sum_ += np.log(pred)\n",
422 | " else:\n",
423 | " sum_ += np.log(1 - pred)\n",
424 | " loss.append(sum_) \n",
425 | " return 'loss is: ',-1 * (np.sum(loss) / preds_prob_.shape[1]),False"
426 | ]
427 | },
428 | {
429 | "cell_type": "markdown",
430 | "metadata": {},
431 | "source": []
432 | },
433 | {
434 | "cell_type": "code",
435 | "execution_count": 15,
436 | "metadata": {},
437 | "outputs": [],
438 | "source": [
439 | "### 模型验证\n",
440 | "train_features = [col for col in train_data.columns if col not in ['label','file_id']]\n",
441 | "train_label = 'label'"
442 | ]
443 | },
444 | {
445 | "cell_type": "code",
446 | "execution_count": 16,
447 | "metadata": {
448 | "scrolled": true
449 | },
450 | "outputs": [
451 | {
452 | "name": "stdout",
453 | "output_type": "stream",
454 | "text": [
455 | "fold n°0\n",
456 | "Training until validation scores don't improve for 100 rounds\n",
457 | "[50]\ttraining's multi_logloss: 1.83717\ttraining's loss is: : 2.41456\tvalid_1's multi_logloss: 1.28536\tvalid_1's loss is: : 0.941228\n",
458 | "[100]\ttraining's multi_logloss: 1.83717\ttraining's loss is: : 2.41456\tvalid_1's multi_logloss: 1.28536\tvalid_1's loss is: : 0.941228\n",
459 | "Early stopping, best iteration is:\n",
460 | "[1]\ttraining's multi_logloss: 1.83717\ttraining's loss is: : 2.41456\tvalid_1's multi_logloss: 1.28536\tvalid_1's loss is: : 0.941228\n",
461 | "fold n°1\n",
462 | "Training until validation scores don't improve for 100 rounds\n",
463 | "[50]\ttraining's multi_logloss: 1.77226\ttraining's loss is: : 2.32838\tvalid_1's multi_logloss: 2.10695\tvalid_1's loss is: : 1.86109\n",
464 | "[100]\ttraining's multi_logloss: 1.77226\ttraining's loss is: : 2.32838\tvalid_1's multi_logloss: 2.10695\tvalid_1's loss is: : 1.86109\n",
465 | "Early stopping, best iteration is:\n",
466 | "[1]\ttraining's multi_logloss: 1.77226\ttraining's loss is: : 2.32838\tvalid_1's multi_logloss: 2.10695\tvalid_1's loss is: : 1.86109\n",
467 | "fold n°2\n",
468 | "Training until validation scores don't improve for 100 rounds\n",
469 | "[50]\ttraining's multi_logloss: 1.79063\ttraining's loss is: : 2.32093\tvalid_1's multi_logloss: 1.67573\tvalid_1's loss is: : 1.91268\n",
470 | "[100]\ttraining's multi_logloss: 1.79063\ttraining's loss is: : 2.32093\tvalid_1's multi_logloss: 1.67573\tvalid_1's loss is: : 1.91268\n",
471 | "Early stopping, best iteration is:\n",
472 | "[1]\ttraining's multi_logloss: 1.79063\ttraining's loss is: : 2.32093\tvalid_1's multi_logloss: 1.67573\tvalid_1's loss is: : 1.91268\n",
473 | "fold n°3\n",
474 | "Training until validation scores don't improve for 100 rounds\n",
475 | "[50]\ttraining's multi_logloss: 1.79651\ttraining's loss is: : 2.36572\tvalid_1's multi_logloss: 1.92355\tvalid_1's loss is: : 1.42824\n",
476 | "[100]\ttraining's multi_logloss: 1.79651\ttraining's loss is: : 2.36572\tvalid_1's multi_logloss: 1.92355\tvalid_1's loss is: : 1.42824\n",
477 | "Early stopping, best iteration is:\n",
478 | "[1]\ttraining's multi_logloss: 1.79651\ttraining's loss is: : 2.36572\tvalid_1's multi_logloss: 1.92355\tvalid_1's loss is: : 1.42824\n",
479 | "fold n°4\n",
480 | "Training until validation scores don't improve for 100 rounds\n",
481 | "[50]\ttraining's multi_logloss: 1.70379\ttraining's loss is: : 2.27265\tvalid_1's multi_logloss: 2.91788\tvalid_1's loss is: : 3.32694\n",
482 | "[100]\ttraining's multi_logloss: 1.70379\ttraining's loss is: : 2.27265\tvalid_1's multi_logloss: 2.91788\tvalid_1's loss is: : 3.32694\n",
483 | "Early stopping, best iteration is:\n",
484 | "[1]\ttraining's multi_logloss: 1.70379\ttraining's loss is: : 2.27265\tvalid_1's multi_logloss: 2.91788\tvalid_1's loss is: : 3.32694\n",
485 | "Wall time: 9.94 s\n"
486 | ]
487 | }
488 | ],
489 | "source": [
490 | "%%time\n",
491 | "from sklearn.model_selection import StratifiedKFold,KFold\n",
492 | "params = {\n",
493 | " 'task':'train', \n",
494 | " 'num_leaves': 255,\n",
495 | " 'objective': 'multiclass',\n",
496 | " 'num_class': 8,\n",
497 | " 'min_data_in_leaf': 50,\n",
498 | " 'learning_rate': 0.05,\n",
499 | " 'feature_fraction': 0.85,\n",
500 | " 'bagging_fraction': 0.85,\n",
501 | " 'bagging_freq': 5, \n",
502 | " 'max_bin':128,\n",
503 | " 'random_state':100\n",
504 | " } \n",
505 | "\n",
506 | "folds = KFold(n_splits=5, shuffle=True, random_state=15)\n",
507 | "oof = np.zeros(len(train))\n",
508 | "\n",
509 | "predict_res = 0\n",
510 | "models = []\n",
511 | "for fold_, (trn_idx, val_idx) in enumerate(folds.split(train_data)):\n",
512 | " print(\"fold n°{}\".format(fold_))\n",
513 | " trn_data = lgb.Dataset(train_data.iloc[trn_idx][train_features], label=train_data.iloc[trn_idx][train_label].values)\n",
514 | " val_data = lgb.Dataset(train_data.iloc[val_idx][train_features], label=train_data.iloc[val_idx][train_label].values) \n",
515 | " \n",
516 | " clf = lgb.train(params, trn_data, num_boost_round=2000,valid_sets=[trn_data,val_data], verbose_eval=50, early_stopping_rounds=100, feval=lgb_logloss) \n",
517 | " models.append(clf)"
518 | ]
519 | },
520 | {
521 | "cell_type": "code",
522 | "execution_count": 17,
523 | "metadata": {
524 | "scrolled": true
525 | },
526 | "outputs": [
527 | {
528 | "name": "stdout",
529 | "output_type": "stream",
530 | "text": [
531 | "fold n°0\n",
532 | "Training until validation scores don't improve for 100 rounds\n",
533 | "[50]\ttraining's multi_logloss: 1.83717\ttraining's loss is: : 2.41456\tvalid_1's multi_logloss: 1.28536\tvalid_1's loss is: : 0.941228\n",
534 | "[100]\ttraining's multi_logloss: 1.83717\ttraining's loss is: : 2.41456\tvalid_1's multi_logloss: 1.28536\tvalid_1's loss is: : 0.941228\n",
535 | "Early stopping, best iteration is:\n",
536 | "[1]\ttraining's multi_logloss: 1.83717\ttraining's loss is: : 2.41456\tvalid_1's multi_logloss: 1.28536\tvalid_1's loss is: : 0.941228\n",
537 | "fold n°1\n",
538 | "Training until validation scores don't improve for 100 rounds\n",
539 | "[50]\ttraining's multi_logloss: 1.77226\ttraining's loss is: : 2.32838\tvalid_1's multi_logloss: 2.10695\tvalid_1's loss is: : 1.86109\n",
540 | "[100]\ttraining's multi_logloss: 1.77226\ttraining's loss is: : 2.32838\tvalid_1's multi_logloss: 2.10695\tvalid_1's loss is: : 1.86109\n",
541 | "Early stopping, best iteration is:\n",
542 | "[1]\ttraining's multi_logloss: 1.77226\ttraining's loss is: : 2.32838\tvalid_1's multi_logloss: 2.10695\tvalid_1's loss is: : 1.86109\n",
543 | "fold n°2\n",
544 | "Training until validation scores don't improve for 100 rounds\n",
545 | "[50]\ttraining's multi_logloss: 1.79063\ttraining's loss is: : 2.32093\tvalid_1's multi_logloss: 1.67573\tvalid_1's loss is: : 1.91268\n",
546 | "[100]\ttraining's multi_logloss: 1.79063\ttraining's loss is: : 2.32093\tvalid_1's multi_logloss: 1.67573\tvalid_1's loss is: : 1.91268\n",
547 | "Early stopping, best iteration is:\n",
548 | "[1]\ttraining's multi_logloss: 1.79063\ttraining's loss is: : 2.32093\tvalid_1's multi_logloss: 1.67573\tvalid_1's loss is: : 1.91268\n",
549 | "fold n°3\n",
550 | "Training until validation scores don't improve for 100 rounds\n",
551 | "[50]\ttraining's multi_logloss: 1.79651\ttraining's loss is: : 2.36572\tvalid_1's multi_logloss: 1.92355\tvalid_1's loss is: : 1.42824\n",
552 | "[100]\ttraining's multi_logloss: 1.79651\ttraining's loss is: : 2.36572\tvalid_1's multi_logloss: 1.92355\tvalid_1's loss is: : 1.42824\n",
553 | "Early stopping, best iteration is:\n",
554 | "[1]\ttraining's multi_logloss: 1.79651\ttraining's loss is: : 2.36572\tvalid_1's multi_logloss: 1.92355\tvalid_1's loss is: : 1.42824\n",
555 | "fold n°4\n",
556 | "Training until validation scores don't improve for 100 rounds\n",
557 | "[50]\ttraining's multi_logloss: 1.70379\ttraining's loss is: : 2.27265\tvalid_1's multi_logloss: 2.91788\tvalid_1's loss is: : 3.32694\n",
558 | "[100]\ttraining's multi_logloss: 1.70379\ttraining's loss is: : 2.27265\tvalid_1's multi_logloss: 2.91788\tvalid_1's loss is: : 3.32694\n",
559 | "Early stopping, best iteration is:\n",
560 | "[1]\ttraining's multi_logloss: 1.70379\ttraining's loss is: : 2.27265\tvalid_1's multi_logloss: 2.91788\tvalid_1's loss is: : 3.32694\n",
561 | "Wall time: 9.7 s\n"
562 | ]
563 | }
564 | ],
565 | "source": [
566 | "%%time\n",
567 | "from sklearn.model_selection import StratifiedKFold,KFold\n",
568 | "params = {\n",
569 | " 'task':'train', \n",
570 | " 'num_leaves': 255,\n",
571 | " 'objective': 'multiclass',\n",
572 | " 'num_class': 8,\n",
573 | " 'min_data_in_leaf': 50,\n",
574 | " 'learning_rate': 0.05,\n",
575 | " 'feature_fraction': 0.85,\n",
576 | " 'bagging_fraction': 0.85,\n",
577 | " 'bagging_freq': 5, \n",
578 | " 'max_bin':128,\n",
579 | " 'random_state':100\n",
580 | " } \n",
581 | "\n",
582 | "folds = KFold(n_splits=5, shuffle=True, random_state=15)\n",
583 | "oof = np.zeros(len(train))\n",
584 | "\n",
585 | "predict_res = 0\n",
586 | "models = []\n",
587 | "for fold_, (trn_idx, val_idx) in enumerate(folds.split(train_data)):\n",
588 | " print(\"fold n°{}\".format(fold_))\n",
589 | " trn_data = lgb.Dataset(train_data.iloc[trn_idx][train_features], label=train_data.iloc[trn_idx][train_label].values)\n",
590 | " val_data = lgb.Dataset(train_data.iloc[val_idx][train_features], label=train_data.iloc[val_idx][train_label].values) \n",
591 | " \n",
592 | " clf = lgb.train(params, trn_data, num_boost_round=2000,valid_sets=[trn_data,val_data], verbose_eval=50, early_stopping_rounds=100, feval=lgb_logloss) \n",
593 | " models.append(clf)"
594 | ]
595 | },
596 | {
597 | "cell_type": "markdown",
598 | "metadata": {},
599 | "source": [
600 | "### 3.3.4 特征重要性分析"
601 | ]
602 | },
603 | {
604 | "cell_type": "code",
605 | "execution_count": 18,
606 | "metadata": {},
607 | "outputs": [],
608 | "source": [
609 | "feature_importance = pd.DataFrame()\n",
610 | "feature_importance['fea_name'] = train_features\n",
611 | "feature_importance['fea_imp'] = clf.feature_importance()\n",
612 | "feature_importance = feature_importance.sort_values('fea_imp',ascending = False)"
613 | ]
614 | },
615 | {
616 | "cell_type": "code",
617 | "execution_count": 19,
618 | "metadata": {},
619 | "outputs": [
620 | {
621 | "data": {
622 | "text/plain": [
623 | ""
624 | ]
625 | },
626 | "execution_count": 19,
627 | "metadata": {},
628 | "output_type": "execute_result"
629 | },
630 | {
631 | "data": {
632 | "image/png": "iVBORw0KGgoAAAANSUhEUgAABKMAAAJOCAYAAABr8MR3AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy8QZhcZAAAgAElEQVR4nOzde7htZV0v8O8PUMRUEPCCgG5MyvTJKFeYprXLS1opmpxETbE0K/WYmqcsKxWrx0tlxzCVlCKOx0uWRmreUNK8IAvdctEURDwQpBhIIqmB7/ljjMWeLNZl7rX2eufea38+z7OeNeYY7xjjnb/1jjnn+s4x5qzWWgAAAACgh71m3QEAAAAA9hzCKAAAAAC6EUYBAAAA0I0wCgAAAIBuhFEAAAAAdLPPrDvQ28EHH9y2bNky624AAAAAbBpnn332V1trt5um7R4XRm3ZsiXz8/Oz7gYAAADAplFVX5q2rcv0AAAAAOhGGAUAAABAN8IoAAAAALoRRgEAAADQjTAKAAAAgG6EUQAAAAB0I4wCAAAAoBthFAAAAADdCKMAAAAA6EYYBQAAAEA3wigAAAAAuhFGAQAAANCNMAoAAACAboRRAAAAAHQjjAIAAACgG2EUAAAAAN0IowAAAADoRhgFAAAAQDfCKAAAAAC6EUYBAAAA0I0wCgAAAIBuhFEAAAAAdCOMAgAAAKAbYRQAAAAA3QijAAAAAOhGGAUAAABAN8IoAAAAALoRRgEAAADQjTAKAAAAgG6EUQAAAAB0I4wCAAAAoBthFAAAAADdCKMAAAAA6EYYBQAAAEA3wigAAAAAuhFGAQAAANCNMAoAAACAboRRAAAAAHQjjAIAAACgG2EUAAAAAN0IowAAAADoRhgFAAAAQDfCKAAAAAC6EUYBAAAA0I0wCgAAAIBuhFEAAAAAdCOMAgAAAKAbYRQAAAAA3QijAAAAAOhGGAUAAABAN8IoAAAAALoRRgEAAADQjTAKAAAAgG6EUQAAAAB0I4wCAAAAoBthFAAAAADdCKMAAAAA6EYYBQAAAEA3wigAAAAAuhFGAQAAANCNMAoAAACAboRRAAAAAHQjjAIAAACgG2EUAAAAAN0IowAAAADoRhgFAAAAQDfCKAAAAAC6EUYBAAAA0I0wCgAAAIBuZh5GVdVDq+pzVXVhVT1vieX7VtWbx+VnVtWWRcvvXFXXVNVze/UZAAAAgLWZaRhVVXsneVWShyW5R5LHVtU9FjV7cpKrWmt3S/KKJC9dtPwVSf5po/sKAAAAwPrN+syoo5Nc2Fq7qLX27SRvSnLMojbHJDllnH5rkgdWVSVJVT0yyUVJzu/UXwAAAADWYdZh1KFJLpm4fek4b8k2rbXrklyd5KCq+q4kv5XkRR36CQAAAMBOMOswqpaY16Zs86Ikr2itXbPqTqqeWlXzVTV/xRVXrKGbAAAAAOwM+8x4/5cmOXzi9mFJLlumzaVVtU+S/ZNcmeQ+SY6tqpclOSDJd6rqm621ExfvpLV2UpKTkmRubm5x2AUAAABAJ7MOo85KcmRVHZHk35Icl+Rxi9qcluT4JB9LcmySD7TWWpIHLDSoqhcmuWapIAoAAACAXcdMw6jW2nVV9Ywk70myd5KTW2vnV9UJSeZba6cleX2SU6vqwgxnRB03ux4DAAAAsB41nGS055ibm2vz8/Oz7gYAAADAplFVZ7fW5qZpO+sPMAcAAABgDyKMAgAAAKAbYRQAAAAA3QijAAAAAOhGGAUAAABAN8IoAAAAALoRRgEAAADQjTAKAAAAgG6EUQAAAAB0I4wCAAAAoBthFAAAAADdCKMAAAAA6EYYBQAAAEA3wigAAAAAuhFGAQAAANCNMAoAAACAboRRAAAAAHQjjAIAAACgG2EUAAAAAN0IowAAAADoRhgFAAAAQDfCKAAAAAC6EUYBAAAA0I0wCgAAAIBuhFEAAAAAdCOMAgAAAKAbYRQAAAAA3QijAAAAAOhGGAUAAABAN8IoAAAAALoRRgEAAADQjTAKAAAAgG6EUQAAAAB0I4wCAAAAoBthFAAAAADdCKMAAAAA6EYYBQAAAEA3wigAAAAAuhFGAQAAANCNMAoAAACAboRRAAAAAHQjjAIAAACgG2EUAAAAAN0IowAAAADoRhgFAAAAQDfCKAAAAAC6EUYBAAAA0I0wCgAAAIBuhFEAAAAAdCOMAgAAAKAbYRQAAAAA3QijAAAAAOhGGAUAAABAN8IoAAAAALoRRgEAAADQjTAKAAAAgG6EUQAAAAB0I4wCAAAAoBthFAAAAADdCKMAAAAA6EYYBQAAAEA3wigAAAAAuhFGAQAAANCNMAoAAACAboRRAAAAAHQjjAIAAACgG2EUAAAAAN0IowAAAADoRhgFAAAAQDfCKAAAAAC6EUYBAAAA0I0wCgAAAIBuhFEAAAAAdCOMAgAAAKAbYRQAAAAA3QijAAAAAOhGGAUAAABAN8IoAAAAALoRRgEAAADQjTAKAAAAgG6EUQAAAAB0I4wCAAAAoBthFAAAAADdCKMAAAAA6EYYBQAAAEA3wigAAAAAuhFGAQAAANCNMAoAAACAboRRAAAAAHQjjAIAAACgm5mHUVX10Kr6XFVdWFXPW2L5vlX15nH5mVW1ZZz/4Ko6u6rOHX//ZO++AwAAALBjZhpGVdXeSV6V5GFJ7pHksVV1j0XNnpzkqtba3ZK8IslLx/lfTfLw1tr3Jzk+yal9eg0AAADAWs36zKijk1zYWruotfbtJG9KcsyiNsckOWWcfmuSB1ZVtdY+1Vq7bJx/fpJbVNW+XXoNAAAAwJrMOow6NMklE7cvHect2aa1dl2Sq5MctKjNo5N8qrX2raV2UlVPrar5qpq/4oordkrHAQAAANhxsw6jaol5bUfaVNU9M1y69yvL7aS1dlJrba61Nne7291uTR0FAAAAYP1mHUZdmuTwiduHJblsuTZVtU+S/ZNcOd4+LMnbkjyxtfaFDe8tAAAAAOsy6zDqrCRHVtURVXXzJMclOW1Rm9MyfEB5khyb5AOttVZVByR5Z5Lfbq19pFuPAQAAAFizmYZR42dAPSPJe5J8NslbWmvnV9UJVfWIsdnrkxxUVRcmeU6S543zn5Hkbkl+r6q2jT+373wXAAAAANgB1drij2ja3Obm5tr8/PysuwEAAACwaVTV2a21uWnazvoyPQAAAAD2IMIoAAAAALoRRgEAAADQjTAKAAAAgG6EUQAAAAB0I4wCAAAAoBthFAAAAADdCKMAAAAA6EYYBQAAAEA3wigAAAAAuhFGAQAAANCNMAoAAACAboRRAAAAAHQjjAIAAACgG2EUAAAAAN0IowAAAADoRhgFAAAAQDfCKAAAAAC6EUYBAAAA0I0wCgAAAIBuhFEAAAAAdCOMAgAAAKAbYRQAAAAA3QijAAAAAOhGGAUAAABAN8IoAAAAALoRRgEAAADQjTAKAAAAgG6EUQAAAAB0I4wCAAAAoBthFAAAAADdCKMAAAAA6EYYBQAAAEA3wigAAAAAuhFGAQAAANCNMAoAAACAboRRAAAAAHQjjAIAAACgG2EUAAAAAN0IowAAAADoRhgFAAAAQDfCKAAAAAC6EUYBAAAA0I0wCgAAAIBuhFEAAAAAdCOMAgAAAKAbYRQAAAAA3QijAAAAAOhGGAUAAABAN8IoAAAAALoRRgEAAADQjTAKAAAAgG6EUQAAAAB0I4wCAAAAoBthFAAAAADdCKMAAAAA6EYYBQAAAEA3wigAAAAAuhFGAQAAANCNMAoAAACAboRRAAAAAHQjjAIAAACgG2EUAAAAAN0IowAAAADoRhgFAAAAQDfCKAAAAAC6EUYBAAAA0I0wCgAAAIBuhFEAAAAAdCOMAgAAAKAbYRQAAAAA3QijAAAAAOhGGAUAAABAN8IoAAAAALoRRgEAAADQjTAKAAAAgG52KIyqqttU1a03qjMAAAAAbG5ThVFVNVdV5yY5J8l5VfXpqrr3xnYNAAAAgM1mnynbnZzkaa21DydJVd0/yV8luddGdQwAAACAzWfay/S+vhBEJUlr7V+SfH1jugQAAADAZjXtmVGfqKrXJnljkpbkMUnOqKofSpLW2ic3qH8AAAAAbCLThlFHjb9fsGj+/TKEUz+503oEAAAAwKY1VRjVWvuJje4IAAAAAJvfVGFUVR2Q5IlJtkyu01p75sZ0CwAAAIDNaNrL9N6V5ONJzk3ynY3rDgAAAACb2bRh1C1aa8/Z0J4AAAAAsOntNWW7U6vql6vqkKo6cOFnQ3sGAAAAwKYz7ZlR307y8iTPz/DteRl/33UjOgUAAADA5jRtGPWcJHdrrX11IzsDAAAAwOY27WV65ye5diM7AgAAAMDmN20YdX2SbVX12qp65cLPzuhAVT20qj5XVRdW1fOWWL5vVb15XH5mVW2ZWPbb4/zPVdVP7Yz+AAAAALBxpr1M7+3jz05VVXsneVWSBye5NMlZVXVaa+0zE82enOSq1trdquq4JC9N8piqukeS45LcM8mdkry/qr6ntXb9zu4nAAAAADvHVGFUa+2UDdr/0UkubK1dlCRV9aYkxySZDKOOSfLCcfqtSU6sqhrnv6m19q0kX6yqC8ftfWyD+goAAADAOq0YRlXVW1prP19V52b7t+jdoLV2r3Xu/9Akl0zcvjTJfZZr01q7rqquTnLQOP/ji9Y9dKmdVNVTkzw1Se585zuvs8sAAAAArNVqZ0b9+vj7Zzdo/7XEvMWh13Jtpll3mNnaSUlOSpK5ubkl2wAAAACw8VYMo1prl4+/v7RSu6r6WGvtvmvY/6VJDp+4fViSy5Zpc2lV7ZNk/yRXTrkuAAAAALuQab9NbzW3WON6ZyU5sqqOqKqbZ/hA8tMWtTktyfHj9LFJPtBaa+P848Zv2zsiyZFJPrHGfgAAAADQwbTfpreaNV36Nn4G1DOSvCfJ3klObq2dX1UnJJlvrZ2W5PVJTh0/oPzKDIFVxnZvyfBh59clebpv0gMAAADYtdVwktE6N1L1ydbaD+2E/my4ubm5Nj8/P+tuAAAAAGwaVXV2a21umrY76zK9pT5MHAAAAABuZGeFUU/YSdsBAAAAYBObKoyqqh+pqrOq6pqq+nZVXV9V/7mwvLV23sZ1EQAAAIDNYtozo05M8tgkFyTZL8lTkvz5RnUKAAAAgM1p6m/Ta61dWFV7j99Y91dV9dEN7BcAAAAAm9C0YdS1VXXzJNuq6mVJLk/yXRvXLQAAAAA2o2kv03vC2PYZSb6R5PAkj96oTgEAAACwOU11ZlRr7UtVtV+SQ1prL9rgPgEAAACwSU37bXoPT7ItybvH20dV1Wkb2TEAAAAANp9pL9N7YZKjk3wtSVpr25Js2ZguAQAAALBZTRtGXddau3pDewIAAADApjftt+mdV1WPS7J3VR2Z5JlJPrpx3QIAAABgM1rxzKiqOnWc/EKSeyb5VpI3JvnPJM/a2K4BAAAAsNmsdmbUvavqLkkek+QnkvzJxLJbJvnmRnUMAAAAgM1ntTDqNRm+Qe+uSeYn5leSNs4HAAAAgKmseJlea+2VrbXvS3Jya+2uEz9HtNYEUQAAAADskKm+Ta+19msb3REAAAAANr+pwigAAAAA2BmEUQAAAAB0I4wCAAAAoBthFAAAAADdCKMAAAAA6EYYBQAAAEA3wigAAAAAuhFGAQAAANCNMAoAAACAboRRAAAAAHQjjAIAAACgG2EUAAAAAN0IowAAAADoRhgFAAAAQDfCKAAAAAC6EUYBAAAA0I0wCgAAAIBuhFEAAAAAdCOMAgAAAKAbYRQAAAAA3QijAAAAAOhGGAUAAABAN8IoAAAAALoRRgEAAADQjTAKAAAAgG6EUQAAAAB0I4wCAAAAoBthFAAAAADdCKMAAAAA6EYYBQAAAEA3wigAAAAAuhFGAQAAANCNMAoAAACAboRRAAAAAHQjjAIAAACgG2EUAAAAAN0IowAAAADoRhgFAAAAQDfCKAAAAAC6EUYBAAAA0I0wCgAAAIBuhFEAAAAAdCOMAgAAAKAbYRQAAAAA3QijAAAAAOhGGAUAAABAN8IoAAAAALoRRgEAAADQjTAKAAAAgG6EUQAAAAB0I4wCAAAAoBthFAAAAADdCKMAAAAA6EYYBQAAAEA3wigAAAAAuhFGAQAAANCNMAoAAACAboRRAAAAAHQjjAIAAACgG2EUAAAAAN0IowAAAADoRhgFAAAAQDfCKAAAAAC6EUYBAAAA0I0wCgAAAIBuhFEAAAAAdCOMAgAAAKAbYRQAAAAA3QijAAAAAOhGGAUAAABAN8IoAAAAALoRRgEAAADQjTAKAAAAgG5mFkZV1YFV9b6qumD8fdtl2h0/trmgqo4f592yqt5ZVf9aVedX1Uv69h4AAACAtZjlmVHPS3J6a+3IJKePt2+kqg5M8oIk90lydJIXTIRWf9xau3uSH0zyo1X1sD7dBgAAAGCtZhlGHZPklHH6lCSPXKLNTyV5X2vtytbaVUnel+ShrbVrW2sfTJLW2reTfDLJYR36DAAAAMA6zDKMukNr7fIkGX/ffok2hya5ZOL2peO8G1TVAUkenuHsqiVV1VOrar6q5q+44op1dxwAAACAtdlnIzdeVe9PcsclFj1/2k0sMa9NbH+fJG9M8srW2kXLbaS1dlKSk5Jkbm6uLdcOAAAAgI21oWFUa+1Byy2rqi9X1SGttcur6pAkX1mi2aVJtk7cPizJGRO3T0pyQWvtz3ZCdwEAAADYYLO8TO+0JMeP08cn+Ycl2rwnyUOq6rbjB5c/ZJyXqvqDJPsneVaHvgIAAACwE8wyjHpJkgdX1QVJHjzeTlXNVdXrkqS1dmWSFyc5a/w5obV2ZVUdluFSv3sk+WRVbauqp8ziTgAAAAAwvWptz/oIpbm5uTY/Pz/rbgAAAABsGlV1dmttbpq2szwzCgAAAIA9jDAKAAAAgG6EUQAAAAB0I4wCAAAAoBthFAAAAADdCKMAAAAA6EYYBQAAAEA3wigAAAAAuhFGAQAAANCNMAoAAACAboRRAAAAAHQjjAIAAACgG2EUAAAAAN0IowAAAADoRhgFAAAAQDfCKAAAAAC6EUYBAAAA0I0wCgAAAIBuhFEAAAAAdCOMAgAAAKAbYRQAAAAA3QijAAAAAOhGGAUAAABAN8IoAAAAALoRRgEAAADQjTAKAAAAgG6EUQAAAAB0I4wCAAAAoBthFAAAAADdCKMAAAAA6EYYBQAAAEA3wigAAAAAuhFGAQAAANCNMAoAAACAboRRAAAAAHQjjAIAAACgG2EUAAAAAN0IowAAAADoRhgFAAAAQDfCKAAAAAC6EUYBAAAA0I0wCgAAAIBuhFEAAAAAdCOMAgAAAKAbYRQAAAAA3QijAAAAAOhGGAUAAABAN8IoAAAAALoRRgEAAADQjTAKAAAAgG6EUQAAAAB0I4wCAAAAoBthFAAAAADdCKMAAAAA6EYYBQAAAEA3wigAAAAAuhFGAQAAANCNMAoAAACAboRRAAAAAHQjjAIAAACgG2EUAAAAAN0IowAAAADoRhgFAAAAQDfCKAAAAAC6EUYBAAAA0I0wCgAAAIBuhFEAAAAAdCOMAgAAAKAbYRQAAAAA3QijAAAAAOhGGAUAAABAN8IoAAAAALoRRgEAAADQjTAKAAAAgG6EUQAAAAB0I4wCAAAAoBthFAAAAADdCKMAAAAA6EYYBQAAAEA3wigAAAAAuhFGAQAAANCNMAoAAACAboRRAAAAAHQjjAIAAACgG2EUAAAAAN0IowAAAADoRhgFAAAAQDfCKAAAAAC6EUYBAAAA0I0wCgAAAIBuZhZGVdWBVfW+qrpg/H3bZdodP7a5oKqOX2L5aVV13sb3GAAAAID1muWZUc9Lcnpr7cgkp4+3b6SqDkzygiT3SXJ0khdMhlZV9XNJrunTXQAAAADWa5Zh1DFJThmnT0nyyCXa/FSS97XWrmytXZXkfUkemiRVdaskz0nyBx36CgAAAMBOMMsw6g6ttcuTZPx9+yXaHJrkkonbl47zkuTFSf4kybWr7aiqnlpV81U1f8UVV6yv1wAAAACs2T4bufGqen+SOy6x6PnTbmKJea2qjkpyt9bas6tqy2obaa2dlOSkJJmbm2tT7hsAAACAnWxDw6jW2oOWW1ZVX66qQ1prl1fVIUm+skSzS5Nsnbh9WJIzktw3yb2r6uIM9+H2VXVGa21rAAAAANhlzfIyvdOSLHw73vFJ/mGJNu9J8pCquu34weUPSfKe1tqrW2t3aq1tSXL/JJ8XRAEAAADs+mYZRr0kyYOr6oIkDx5vp6rmqup1SdJauzLDZ0OdNf6cMM4DAAAAYDdUre1ZH6E0NzfX5ufnZ90NAAAAgE2jqs5urc1N03aWZ0YBAAAAsIcRRgEAAADQjTAKAAAAgG6EUQAAAAB0I4wCAAAAoBthFAAAAADdCKMAAAAA6EYYBQAAAEA3wigAAAAAuhFGAQAAANCNMAoAAACAboRRAAAAAHQjjAIAAACgG2EUAAAAAN0IowAAAADoRhgFAAAAQDfCKAAAAAC6EUYBAAAA0I0wCgAAAIBuhFEAAAAAdCOMAgAAAKAbYRQAAAAA3QijAAAAAOhGGAUAAABAN8IoAAAAALoRRgEAAADQjTAKAAAAgG6EUQAAAAB0I4wCAAAAoBthFAAAAADdCKMAAAAA6EYYBQAAAEA3wigAAAAAuhFGAQAAANCNMAoAAACAboRRAAAAAHQjjAIAAACgG2EUAAAAAN0IowAAAADoRhgFAAAAQDfCKAAAAAC6EUYBAAAA0I0wCgAAAIBuhFEAAAAAdCOMAgAAAKAbYRQAAAAA3QijAAAAAOhGGAUAAABAN8IoAAAAALoRRgEAAADQjTAKAAAAgG6EUQAAAAB0I4wCAAAAoBthFAAAAADdCKMAAAAA6EYYBQAAAEA3wigAAAAAuhFGAQAAANCNMAoAAACAboRRAAAAAHQjjAIAAACgG2EUAAAAAN0IowAAAADoRhgFAAAAQDfCKAAAAAC6EUYBAAAA0I0wCgAAAIBuhFEAAAAAdFOttVn3oauquiLJl2bdj2UcnOSrs+7Ebkz91kf91kf91k7t1kf91kf91k7t1kf91kf91kf91k7t1kf91mdXr99dWmu3m6bhHhdG7cqqar61Njfrfuyu1G991G991G/t1G591G991G/t1G591G991G991G/t1G591G99NlP9XKYHAAAAQDfCKAAAAAC6EUbtWk6adQd2c+q3Puq3Puq3dmq3Puq3Puq3dmq3Puq3Puq3Puq3dmq3Puq3Ppumfj4zCgAAAIBunBkFAAAAQDfCKAAAAAC6EUYBAAAA0M0eFUZV1TOr6rNVdVVVPW+c98Kqeu4atvWrVfXEJeZvqarzdkZ/J7b5rqo6YGduc5X93ahOVfWsqvpDdVr/vnaFMVhVR1XVT0/cfsRCX5Zoe82O9mu9FtXo3eM8NRr29dFVlu8K42uuql65g/taUx/XYmfUqKq2VtX9NmuNdtRydZhYviuMy13ymF6rPbGmq42zNWxv09ewqn5nhWU7fF/3hJrtqF2kJpvmOUU9d8xyrwvVccV9bK2q+y0xf6FmTc362WfWHejsaUke1lr74no31Fp7zU7oz7T7+unVW+1UN6pTVV2c5C1r2dBmrdM69rUrjMGjkswlede4ndOSnLbe/uxEN9RovS8kN1uNWms3efJcZObjq7U2n2R+vfvfQDujRluTXNNa++O1rLwb1GiHTDFWZj4us4se0+uwx9V0A15P7Ak1/J0kf7QTt7cn1GxHzbwmm+w5RT13wAqvC9VxeVuTXJNkcZD3tCQPS3Jua+0l69nBJqzZhtljzoyqqtckuWuS06rq2VV14hJtvruq3l1VZ1fVh6vq7its74YUsqruXVWfrqqPJXn6Kv3YMm77k+PP/cb5W6vqQ1X1tqr6TFW9pqr2GpddXFUHr7DNJ1bVOWMfTh3n3aWqTh/nn15Vdx7n/3VVHTux7jUT+z+jqr6Q5HuTfGqs04eS3CnJ8Ul+dXet07i9z1bVX1bV+VX13qrab1x2RlXNjdMHj+FbqupJVfX34329oKpeNrG9G/ZVVc+vqs9V1fur6o0T93fxdr+e7WPwA1X15ao6J8m9J7a7obWtqpsnOSHJY6pqW1U9ZryfJ47Lj6iqj1XVWVX14pX+RmP736yqc8d9v2Scd1RVfXwce2+rqtvuQJ2/luTIsUbvT3LLqtqW5Od2xxpNHFdvrap/rao3VFWNyybH0FxVnTHR55PH9S6qqmdObG/heK2qOnE8Bt5Zw5l6780wvj5XVb87Lp9L8qRxne+qqr+tqqur6trxONiIY3drVb1jivtyw3GT4TFnYf6Sf9+q+oca32Wqql+pqjes0Ie71XA8frqGx4/vHmu2Lcn3JPlsVZ0y1mhrksdNrPs345g+u6q+Obb55Djv7lW1JcNj4bOr6t+r6lW7aY3OqKqXVtUnqurzVfWAcf4NY328/Y6xRqmqa2o4S/bTNRzjd5i4DzcZK1X18qq6MsO4/NDY14VtPy7Jd4/rHF/Du7fXVtUVVfVDK/R7Vzim/7mq3jLW7SVV9fixjudW1cJ9ul1V/d24zbOq6kfH+UdX1Uer6lPj7++dqPuSzzdL9GHmr2dmVNPJfi45fqe1CWt4SA2vjbZV1XlV9YAanpP3G+e9YWy35GPKNDZhzaYddw+vqjPHY/b9tf1x75VVddZYkw9U1RdmUZOJ+7IrPKe8YhyHn62qH67hMe2CqvqDiXa/MNZ4W1W9tqr2Hue/uqq+kuE5+uMLY6yG14pbkzxr/Ls8WD1vVM/F/8e9taquyvbX0rP8n3dXqOMza3itfE5Vvalu/BpuWw2PlUdU1b9nGHtnJrnZHl6zacfe28f9nF9VTx3n3WVsd3BV7TX24SEr3ee01vaYnyQXJzk4wz9nJ47zXpjkueP06UmOHKfvk+QDK2xrcr1zkvz4OP3yJOetsN4tk9xinD4yyfw4vTXJNzM8qe2d5H1Jjp3s9zLbu2eSzy0sT3Lg+Psfkxw/Tv9SkreP03+9sN3x9jUT+786yWHj/s5K8odJThxvv2w3r9OWJNclOWq8/ZYkvzBOn5Fkbpw+OMnF4/STklyUZLF1M68AAA5bSURBVP8kt0jypSSHLxpL905y7tjf2yS5cOL+3mS7489zkvzdWNt9k1yW5I861vZJGcf/4tsZ3l184jj99IXxscx2HpbhXYVbLhp7k305Icmf7WCdr0tyr3H+t2d0nO6sGm3N9uNqryQfS3L/xeM1w7u8Z0z0+aPj2Dg4yX8kudmi4/XnMoz9vTOExV9Lcuy4zf+X5BkZxtfcOO+5Gd4dPz/D8XTA2O6MDajd1iTvWOm+ZOXjZsm/b5I7jO0ekOTzGcfbMn04M8mjxulbjPt59Fizi5N839iXk8f+fn5i/5ck+a1x+rIknx+nn5bkdZO12c1rdEaSPxmnfzrJ+5cZ++9IsnWcbkkePk6/LMnvTjNWxpo/I8n7s/04+nySV4/3+aok9xrnn5jkol38mP5akkPGv9m/JXnRuOzXs/3x7v9m+7F+5ySfHadvk2SfcfpBSf5umcfBG55vlunHxZn965neNZ3s5xlZYvzuyM8mq+FvJHn+OL13kluP09dMtFn2MWUPrdm04+62SWqcfsrEuLtlhufUfx9r+b9mWJOt2TWeU146Ub/LJmp7aZKDMjz3/mO2v6b5i4m/18JryIuTfCTJ72b7/yDvyvCc+7Tx76SeyUGTx3hu+nrzm0l+JrM9VneFOl6WZN9x+oDF92fycSPDWPvNsXZ7cs3OyHRjb+GY3S/Da72F+U9J8tYMj4mvXW4/Cz972mV6y6qqWyW5X5K/reHEhWQo+mrr7Z9hcP/zOOvUDP+kL+dmSU6sqqOSXJ8hhV3widbaReN235jk/hn+mCv5ySRvba19NUlaa1eO8++b7WeTnJrhH4fVfKK1dul4/8/LMPD/e7LBblynJPlia23bOH12hoBqNae31q4e9/WZJHfJ8M/qggckeVtr7dqxzTSnim9NcnSGF4z3z3AQH9yxtiv50Qz/tC9s56UrtH1Qkr9auO+ttSuX6MspSf52iv1O1vnbSQ5fqtFuWKNkPK7GfmzLMO7+ZZV13tla+1aSb9XwTuEdMjwBLPixJG9srV2f5LKq+sAU/X5ohheC54y3986i43spO6F2S92XJY+blf6+rbUvV9XvJ/lghqDpyiyhqm6d5NDW2tvG9b45zr9/kjcm+f0kVyT51yS3X7TurTI84T69qh6b4THwG+PiszNxht6i9XarGk34+4n7tmWKfn47Qzi1sM6DJxeusQ5bM4SjZ473pzK8EFzRjI/ps1prl4/9+EKS947zz03yE+P0g5LcY+JvdJtxbO6f5JSqOjJDuHezie2u9nwzld30cXKami62o+N3arthDc9KcnJV3SzDm4/blmizltcrU9sNa5ZMN+4OS/Lmqjokyc2TfDFJWmvXVtUvZwhO/izD88pdJjc+w5rM6jllYUydm+T8idpelOF13f0z/CN91riv/ZJ8ZVzn58ezK+6U5FYZXqtcPy777Pj7M0nuuFRfV7KJ6/kfi9pPvt5ceC19k+fTPWxcnpPkDVX19iRvX6bNwuPGCRlOVjhhcYM9rGbJdGPvmVX1qLHd4Rne7P6P1trrqup/ZDgD7ajV7qAwaru9knyttbZq0RapDC8op/XsJF9O8gPZnlwvWLydabY77f4X2lw37jc1jNSbT7T51sT09Vn6n9XdtU7JTe/ffuP0DTXJ8I70Sussdcwst//ltltJ/k+GM7+eUVUvzHDtcq/armbabe3ofqetc7L8Y9PuVqNk+TG0keNu4Vlo8bj7amvt9jddZUXrrd1y92Wpba729/3+DE+Ad1phf7WD8yfrtVeGwOX3W2t/XcPlAfcdly33d1jY9u5Uo8X7XW5cJjceQ//dxre9snQ9VqrDZJ0zse5eSb7ZWtvvpqusaFc5pr8zcfs7ufH9um9r7b8mV6yqP0/ywdbao2q4XOCMZba70nhbze7+OLlcTZdbZz21Ws5uVcPW2oeq6scynAlxalW9vLX2N2vd3hrtVjUbTTPu/jzJn7bWTqvhkuUXTqzz/WPbO2Y4Q2qxWdVk1s8pk7VcuL1Phvt1SmvttydXqqojMpz59MNJPpXhLPKbZXsYtfC7JblOPZd9XNzVX0vPoo4/k+EN3Eck+b2quucy7Va7X3tSzSb3u+TYGx8LH5Thdc61NXzUyC2SpKpumSHET4Zg+esr7WiP+cyo1bTW/jPJF8ckLzX4gSnW+1qSq8d33ZPk8aussn+Sy1tr30nyhAyBz4Kja7huda8kj8nqZ08kw+l4P19VB439PnCc/9Ekx030aWFbF2f7ZxQdkxu/K7ucr2cMrXbjOq3k4myvybErtFvKh5I8qqr2G9/1fvgU2/1ghjPaFo6/g5LcvGNtv57k1sss+0huPG5W8t4kvzQ+6KSqDhzf1b+qtn9+xxOSLLwbcHF2vM7fGd/pTdJ1/O2sGq3k4myvx6NXaLeUDyU5rqr2Ht+tnTxz4JJsP0tgcrv/lOT6idr94AbVbhpLHjcr/X2r6ugM7yb9YJLnji9el+rvfya5tKoeOa637zhGP5Th8SIZjrnvzfCO7JeS3C7DY0xlOC5/eGKTS71wudH42N1qtIqLkxxVw7X+h2c4i3Mqq9ThkgyXq2Xc7qHj/A8m2auq/ue47JZV9Yh17mspPY7pSe/NcGlikqSGs3yT4bnt38bpJ+2kfd3IJnucnIndrYZVdZckX2mt/WWS1ydZ+Ny1/554Dl3p9cq67W412wGTx+zxCzPHmv9GhktYHpjh4yNupGNNpjGr55RJpyc5tqpuP27/wLGOt8lwFvLVGZ6DH7jM+t9I8k31XJ89ZVzW8H/i4a21D2a4/O6AbA9HJh9DJh83lvz/ZE+p2Q7YP8lVYxB19yQ/MrHspUnekOFKhL9cbUPCqBt7fJInV9WnM1wHfsyU6/1iklfV8MFk/7VK279IcnxVfTzDpWffmFj2sSQvyXCJ3BeTvG21HbfWzs/w2U7/PPb7T8dFz0zyizV8QPYTMlzzmQyD4ser6hMZriX9RlZ3UpJfyPgB5tkN67SKP07yazV8PeqyHxS/lNbaJ5O8Ocm2DJ8D9eEptntqhhc2j6nh6z1/NtuPxR61/WCGy0e2VdVjFi379QyXKJ2V4YFmWa21d2c4jXO+hsvPFr5W9PgkLx/H3lHZfrrrWup8foZTbCcvj9ptarSKFyX531X14Wx/129ab0tyQYbTZ1+d7YFfMlyP/vgMNZvc7osz3K+Tq+q/MjyZbUTtVrXKcXOTv29V7ZvhseuXWmuXZfgH4OSqqiztCRlOHz4nQzB/xww1OyfDO0Jvy3Aq9rWttUvG/fxGhifPM5L82Lj/O2Xp06n/McmjMjwmLjyh7241Ws5HMjyunpvhmP3kDq6/XB3OTPLVJI8dt3t5krTWrhjX+f/t3V+IVGUYx/HvDw00gyAh8KKSLoJCKUsRKkqJiiCKyOiioroLIutCgihCyIjqQoyK/tA/aq+KKO2ilEKyIBTK/FsEdVEURUQ3QQr6dHGONG2ruzuzntnV7weWnTk773ueeZiZM/uc933PuvZ1+Qf/Fg373ddYunhP91oNLE2zaOo+/j1+PgU8keRz/nuSZaqdKJ+TwzSTcrgC2JnkK5qTEBva7S8Bu5KMjPOZMlVmUs4mai3NtJdtNJ9hR2YWvELzvecQ8ED7HMZ6T3eRk3EN8ZjSG8M+mrWgNrfH5y3Agqr6mmZE1F6a74fbj9HN/tGxTnD3J1w+B3QyvC5nAW8l2U3z+lrfFoc20RRzdqY5eX4/zRpzC2gKo0dzMuRsoj6kGSG1i+b/iy8AklxJc0L3yaoaAQ4muftYHR1ZkE9Dlma425qqun7YsUxn0z1PaafcVZ+XfJf6keR1mgUPJ7J2mtSJNNPQPqiqRUMORZIkSdOMI6MkSZIkSZLUGUdGjSPJw8Atoza/XVWPj9PuWv5/FY8fquqmsR4/gTjm08y1Hu2qqhp9NYXOmafjZ9i5TbKYZmphrwNVtXwy/RxP5qh/w87dIJI8R3MVlF4bquq1Kd6POerYsHM+k9/TR2NOB2cOJ8+c/d+wczKI6XhMMZ9TFot5nPx+zdmgcViMkiRJkiRJUlecpidJkiRJkqTOWIySJEmSJElSZyxGSZIkSZIkqTMWoyRJkvqUZHWS/UlGhh2LJEnSTOEC5pIkSX1K8g1wXVX9MOxYJEmSZgpHRkmSJPUhyQvAucDGJA8neTXJjiRfJbmxfczCJNuSfNn+XHqM/lYk2ZrknSTfJBlJkvZvj7Z970nyUs/2rUnWJ/m0HaG1LMm7Sb5Lsq6n79uTbE+yM8mLSWYd3+xIkiQdncUoSZKkPlTVPcDPwEpgHvBJVS1r7z+dZB7wG3B1VV0M3Ao8M063S4AHgAtoCl2XtdufraplVbUImAtc39PmYFVdAbwAvA/cCywC7koyP8n57b4vq6qLgEPAbYM9e0mSpP7NHnYAkiRJJ4BrgBuSrGnvzwHOpilWPZvkSBHovHH62V5VPwEk2QksBD4DViZ5EDgVOAPYC2xq22xsf+8G9lbVL23774GzgMuBS4Ad7YCquTRFMkmSpKGwGCVJkjS4ADdX1bf/2ZisBX4FLqQZkf73OP0c6Ll9CJidZA7wPLC0qn5s+5wzRpvDo9ofpvmuF+CNqnpoMk9IkiTpeHGaniRJ0uA+Au7rWctpSbv9dOCXqjoM3AH0s1bTkcLT70lOA1ZNsv3HwKokZ7axnZHknD7ikCRJmhIWoyRJkgb3GHAKsCvJnvY+NCOa7kzyBc0Uvb8m23FV/Qm8TDMN7z1gxyTb7wMeATYn2QVsARZMNg5JkqSpkqoadgySJEmSJEk6STgySpIkSZIkSZ1xAXNJkqQOJVkMvDlq84GqWj6MeCRJkrrmND1JkiRJkiR1xml6kiRJkiRJ6ozFKEmSJEmSJHXGYpQkSZIkSZI6YzFKkiRJkiRJnfkHzsY0rysG+q0AAAAASUVORK5CYII=\n",
633 | "text/plain": [
634 | ""
635 | ]
636 | },
637 | "metadata": {
638 | "needs_background": "light"
639 | },
640 | "output_type": "display_data"
641 | }
642 | ],
643 | "source": [
644 | "plt.figure(figsize=[20, 10,])\n",
645 | "sns.barplot(x = feature_importance['fea_name'], y = feature_importance['fea_imp'])\n",
646 | "#sns.barplot(x=\"fea_name\",y=\"fea_imp\",data=feature_importance)"
647 | ]
648 | },
649 | {
650 | "cell_type": "markdown",
651 | "metadata": {},
652 | "source": [
653 | "### 3.3.5 模型测试"
654 | ]
655 | },
656 | {
657 | "cell_type": "code",
658 | "execution_count": 20,
659 | "metadata": {},
660 | "outputs": [],
661 | "source": [
662 | "pred_res = 0\n",
663 | "fold = 5\n",
664 | "for model in models:\n",
665 | " pred_res +=model.predict(test_submit[train_features]) * 1.0 / fold "
666 | ]
667 | },
668 | {
669 | "cell_type": "code",
670 | "execution_count": 21,
671 | "metadata": {},
672 | "outputs": [],
673 | "source": [
674 | "test_submit['prob0'] = 0\n",
675 | "test_submit['prob1'] = 0\n",
676 | "test_submit['prob2'] = 0\n",
677 | "test_submit['prob3'] = 0\n",
678 | "test_submit['prob4'] = 0\n",
679 | "test_submit['prob5'] = 0\n",
680 | "test_submit['prob6'] = 0\n",
681 | "test_submit['prob7'] = 0"
682 | ]
683 | },
684 | {
685 | "cell_type": "code",
686 | "execution_count": 22,
687 | "metadata": {
688 | "scrolled": true
689 | },
690 | "outputs": [],
691 | "source": [
692 | "test_submit[['prob0','prob1','prob2','prob3','prob4','prob5','prob6','prob7']] = pred_res\n",
693 | "test_submit[['file_id','prob0','prob1','prob2','prob3','prob4','prob5','prob6','prob7']].to_csv('baseline.csv',index = None)"
694 | ]
695 | }
696 | ],
697 | "metadata": {
698 | "kernelspec": {
699 | "display_name": "Python 3",
700 | "language": "python",
701 | "name": "python3"
702 | },
703 | "language_info": {
704 | "codemirror_mode": {
705 | "name": "ipython",
706 | "version": 3
707 | },
708 | "file_extension": ".py",
709 | "mimetype": "text/x-python",
710 | "name": "python",
711 | "nbconvert_exporter": "python",
712 | "pygments_lexer": "ipython3",
713 | "version": "3.7.1"
714 | },
715 | "latex_envs": {
716 | "LaTeX_envs_menu_present": true,
717 | "autoclose": false,
718 | "autocomplete": true,
719 | "bibliofile": "biblio.bib",
720 | "cite_by": "apalike",
721 | "current_citInitial": 1,
722 | "eqLabelWithNumbers": true,
723 | "eqNumInitial": 1,
724 | "hotkeys": {
725 | "equation": "Ctrl-E",
726 | "itemize": "Ctrl-I"
727 | },
728 | "labels_anchors": false,
729 | "latex_user_defs": false,
730 | "report_style_numbering": false,
731 | "user_envs_cfg": false
732 | },
733 | "toc": {
734 | "nav_menu": {},
735 | "number_sections": true,
736 | "sideBar": true,
737 | "skip_h1_title": false,
738 | "title_cell": "Table of Contents",
739 | "title_sidebar": "Contents",
740 | "toc_cell": true,
741 | "toc_position": {
742 | "height": "calc(100% - 180px)",
743 | "left": "10px",
744 | "top": "150px",
745 | "width": "384px"
746 | },
747 | "toc_section_display": true,
748 | "toc_window_display": true
749 | }
750 | },
751 | "nbformat": 4,
752 | "nbformat_minor": 2
753 | }
754 |
--------------------------------------------------------------------------------