├── O2O优惠券预测
├── O2O优惠券预测-数据探索.ipynb
├── O2O优惠券预测-模型训练.ipynb
├── O2O优惠券预测-模型验证及优化.ipynb
├── O2O优惠券预测-特征工程.ipynb
└── O2O优惠券预测-赛题实践.ipynb
├── README.md
├── 天猫重复购买预测
├── 天猫重复购买预测 02 数据探索.ipynb
├── 天猫重复购买预测 03 特征工程.ipynb
├── 天猫重复购买预测 04 模型训练、验证和评测.ipynb
└── 天猫重复购买预测 05 特征优化和特征选择.ipynb
├── 工业蒸汽
├── zhengqi_test.txt
├── zhengqi_train.txt
├── 工业蒸汽 02数据探索.ipynb
├── 工业蒸汽 03 特征工程.ipynb
├── 工业蒸汽 04 模型训练.ipynb
├── 工业蒸汽 05 模型验证.ipynb
├── 工业蒸汽 06 特征优化.ipynb
└── 工业蒸汽 07 模型融合.ipynb
└── 阿里云安全恶意程序检测
├── 阿里云安全恶意程序检测-优化技巧与解决方案升级.ipynb
├── 阿里云安全恶意程序检测-数据探索.ipynb
├── 阿里云安全恶意程序检测-特征工程与基线模型.ipynb
├── 阿里云安全恶意程序检测-特征工程进阶与方案优化.ipynb
└── 阿里云安全恶意程序检测-高阶数据探索.ipynb
/README.md:
--------------------------------------------------------------------------------
1 | # alibaba_tianchi_book
2 | 阿里云天池大赛赛题解析,原书代码链接: https://tianchi.aliyun.com/specials/promotion/bookcode?spm=5176.14154004.J_3941670930.15.31fe5699wu5Gtm
3 |
4 | 为了方便查看,我进行了整理,总共包含四个项目。
5 |
6 | 代码质量很高,有深度,个人觉得进阶机器学习看这本书应该是足够了。
7 |
8 |
--------------------------------------------------------------------------------
/天猫重复购买预测/天猫重复购买预测 05 特征优化和特征选择.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "## 导入相关包"
8 | ]
9 | },
10 | {
11 | "cell_type": "code",
12 | "execution_count": 1,
13 | "metadata": {},
14 | "outputs": [],
15 | "source": [
16 | "import pandas as pd\n",
17 | "import numpy as np\n",
18 | "\n",
19 | "import warnings\n",
20 | "warnings.filterwarnings(\"ignore\") "
21 | ]
22 | },
23 | {
24 | "cell_type": "markdown",
25 | "metadata": {},
26 | "source": [
27 | "## 读取数据(训练数据前10000行,测试数据前100条)"
28 | ]
29 | },
30 | {
31 | "cell_type": "code",
32 | "execution_count": 2,
33 | "metadata": {},
34 | "outputs": [],
35 | "source": [
36 | "train_data = pd.read_csv('train_all.csv',nrows=10000)\n",
37 | "test_data = pd.read_csv('test_all.csv',nrows=100)"
38 | ]
39 | },
40 | {
41 | "cell_type": "markdown",
42 | "metadata": {},
43 | "source": [
44 | "## 读取全部数据"
45 | ]
46 | },
47 | {
48 | "cell_type": "code",
49 | "execution_count": 3,
50 | "metadata": {},
51 | "outputs": [],
52 | "source": [
53 | "# train_data = pd.read_csv('train_all.csv',nrows=None)\n",
54 | "# test_data = pd.read_csv('test_all.csv',nrows=None)"
55 | ]
56 | },
57 | {
58 | "cell_type": "markdown",
59 | "metadata": {},
60 | "source": [
61 | "## 获取训练和测试数据"
62 | ]
63 | },
64 | {
65 | "cell_type": "code",
66 | "execution_count": 4,
67 | "metadata": {},
68 | "outputs": [],
69 | "source": [
70 | "features_columns = [col for col in train_data.columns if col not in ['user_id','label']]\n",
71 | "train = train_data[features_columns].values\n",
72 | "test = test_data[features_columns].values\n",
73 | "target =train_data['label'].values"
74 | ]
75 | },
76 | {
77 | "cell_type": "markdown",
78 | "metadata": {},
79 | "source": [
80 | "## 缺失值补全"
81 | ]
82 | },
83 | {
84 | "cell_type": "markdown",
85 | "metadata": {},
86 | "source": [
87 | "处理缺失值有很多方法,最常用为以下几种:\n",
88 | "1. 删除。当数据量较大时,或者缺失数据占比较小时,可以使用这种方法。\n",
89 | "2. 填充。通用的方法是采用平均数、中位数来填充,可以适用插值或者模型预测的方法进行缺失补全。\n",
90 | "3. 不处理。树类模型对缺失值不明感。"
91 | ]
92 | },
93 | {
94 | "cell_type": "markdown",
95 | "metadata": {},
96 | "source": [
97 | "#### 采用中值进行填充"
98 | ]
99 | },
100 | {
101 | "cell_type": "code",
102 | "execution_count": 5,
103 | "metadata": {},
104 | "outputs": [],
105 | "source": [
106 | "# from sklearn.preprocessing import Imputer\n",
107 | "# imputer = Imputer(strategy=\"median\")\n",
108 | "\n",
109 | "from sklearn.impute import SimpleImputer\n",
110 | "\n",
111 | "imputer = SimpleImputer(missing_values=np.nan, strategy='mean')\n",
112 | "imputer = imputer.fit(train)\n",
113 | "train_imputer = imputer.transform(train)\n",
114 | "test_imputer = imputer.transform(test)"
115 | ]
116 | },
117 | {
118 | "cell_type": "markdown",
119 | "metadata": {},
120 | "source": [
121 | "## 特征选择概念"
122 | ]
123 | },
124 | {
125 | "cell_type": "markdown",
126 | "metadata": {},
127 | "source": [
128 | "在机器学习和统计学中,特征选择(英语:feature selection)也被称为变量选择、属性选择 或变量子集选择 。它是指:为了构建模型而选择相关特征(即属性、指标)子集的过程。使用特征选择技术有三个原因:\n",
129 | "\n",
130 | " 简化模型,使之更易于被研究人员或用户理解,\n",
131 | " 缩短训练时间,\n",
132 | " 改善通用性、降低过拟合(即降低方差)。\n",
133 | "\n",
134 | "要使用特征选择技术的关键假设是:训练数据包含许多冗余 或无关 的特征,因而移除这些特征并不会导致丢失信息。 冗余 或无关 特征是两个不同的概念。如果一个特征本身有用,但如果这个特征与另一个有用特征强相关,且那个特征也出现在数据中,那么这个特征可能就变得多余。\n",
135 | "特征选择技术与特征提取有所不同。特征提取是从原有特征的功能中创造新的特征,而特征选择则只返回原有特征中的子集。 特征选择技术的常常用于许多特征但样本(即数据点)相对较少的领域。特征选择应用的典型用例包括:解析书面文本和微阵列数据,这些场景下特征成千上万,但样本只有几十到几百个。"
136 | ]
137 | },
138 | {
139 | "cell_type": "code",
140 | "execution_count": 6,
141 | "metadata": {},
142 | "outputs": [],
143 | "source": [
144 | "from sklearn.model_selection import cross_val_score\n",
145 | "from sklearn.ensemble import RandomForestClassifier\n",
146 | "\n",
147 | "def feature_selection(train, train_sel, target):\n",
148 | " clf = RandomForestClassifier(n_estimators=100, max_depth=2, random_state=0, n_jobs=-1)\n",
149 | " \n",
150 | " scores = cross_val_score(clf, train, target, cv=5)\n",
151 | " scores_sel = cross_val_score(clf, train_sel, target, cv=5)\n",
152 | " \n",
153 | " print(\"No Select Accuracy: %0.2f (+/- %0.2f)\" % (scores.mean(), scores.std() * 2)) \n",
154 | " print(\"Features Select Accuracy: %0.2f (+/- %0.2f)\" % (scores.mean(), scores.std() * 2))"
155 | ]
156 | },
157 | {
158 | "cell_type": "markdown",
159 | "metadata": {},
160 | "source": [
161 | "### 删除方差较小的要素(方法一)\n",
162 | "VarianceThreshold是一种简单的基线特征选择方法。它会删除方差不符合某个阈值的所有要素。默认情况下,它会删除所有零方差要素,即在所有样本中具有相同值的要素。"
163 | ]
164 | },
165 | {
166 | "cell_type": "code",
167 | "execution_count": 7,
168 | "metadata": {},
169 | "outputs": [
170 | {
171 | "name": "stdout",
172 | "output_type": "stream",
173 | "text": [
174 | "训练数据未特征筛选维度 (2000, 229)\n",
175 | "训练数据特征筛选维度后 (2000, 29)\n"
176 | ]
177 | }
178 | ],
179 | "source": [
180 | "from sklearn.feature_selection import VarianceThreshold\n",
181 | "\n",
182 | "sel = VarianceThreshold(threshold=(.8 * (1 - .8)))\n",
183 | "sel = sel.fit(train)\n",
184 | "train_sel = sel.transform(train)\n",
185 | "test_sel = sel.transform(test)\n",
186 | "print('训练数据未特征筛选维度', train.shape)\n",
187 | "print('训练数据特征筛选维度后', train_sel.shape)"
188 | ]
189 | },
190 | {
191 | "cell_type": "markdown",
192 | "metadata": {},
193 | "source": [
194 | "### 特征选择前后区别"
195 | ]
196 | },
197 | {
198 | "cell_type": "code",
199 | "execution_count": 8,
200 | "metadata": {},
201 | "outputs": [
202 | {
203 | "name": "stdout",
204 | "output_type": "stream",
205 | "text": [
206 | "No Select Accuracy: 0.93 (+/- 0.00)\n",
207 | "Features Select Accuracy: 0.93 (+/- 0.00)\n"
208 | ]
209 | }
210 | ],
211 | "source": [
212 | "feature_selection(train, train_sel, target)"
213 | ]
214 | },
215 | {
216 | "cell_type": "markdown",
217 | "metadata": {},
218 | "source": [
219 | "### 单变量特征选择(方法二)\n",
220 | "通过基于单变量统计检验选择最佳特征。"
221 | ]
222 | },
223 | {
224 | "cell_type": "code",
225 | "execution_count": 9,
226 | "metadata": {},
227 | "outputs": [
228 | {
229 | "name": "stdout",
230 | "output_type": "stream",
231 | "text": [
232 | "训练数据未特征筛选维度 (2000, 229)\n",
233 | "训练数据特征筛选维度后 (2000, 2)\n"
234 | ]
235 | }
236 | ],
237 | "source": [
238 | "from sklearn.feature_selection import SelectKBest\n",
239 | "# from sklearn.feature_selection import chi2\n",
240 | "from sklearn.feature_selection import mutual_info_classif\n",
241 | "\n",
242 | "sel = SelectKBest(mutual_info_classif, k=2)\n",
243 | "sel = sel.fit(train, target)\n",
244 | "train_sel = sel.transform(train)\n",
245 | "test_sel = sel.transform(test)\n",
246 | "print('训练数据未特征筛选维度', train.shape)\n",
247 | "print('训练数据特征筛选维度后', train_sel.shape)"
248 | ]
249 | },
250 | {
251 | "cell_type": "code",
252 | "execution_count": 10,
253 | "metadata": {},
254 | "outputs": [
255 | {
256 | "name": "stdout",
257 | "output_type": "stream",
258 | "text": [
259 | "训练数据未特征筛选维度 (2000, 229)\n",
260 | "训练数据特征筛选维度后 (2000, 10)\n"
261 | ]
262 | }
263 | ],
264 | "source": [
265 | "sel = SelectKBest(mutual_info_classif, k=10)\n",
266 | "sel = sel.fit(train, target)\n",
267 | "train_sel = sel.transform(train)\n",
268 | "test_sel = sel.transform(test)\n",
269 | "print('训练数据未特征筛选维度', train.shape)\n",
270 | "print('训练数据特征筛选维度后', train_sel.shape)"
271 | ]
272 | },
273 | {
274 | "cell_type": "markdown",
275 | "metadata": {},
276 | "source": [
277 | "### 特征选择前后区别"
278 | ]
279 | },
280 | {
281 | "cell_type": "code",
282 | "execution_count": 11,
283 | "metadata": {},
284 | "outputs": [
285 | {
286 | "name": "stdout",
287 | "output_type": "stream",
288 | "text": [
289 | "No Select Accuracy: 0.93 (+/- 0.00)\n",
290 | "Features Select Accuracy: 0.93 (+/- 0.00)\n"
291 | ]
292 | }
293 | ],
294 | "source": [
295 | "feature_selection(train, train_sel, target)"
296 | ]
297 | },
298 | {
299 | "cell_type": "markdown",
300 | "metadata": {},
301 | "source": [
302 | "### 递归功能消除(方法三)\n",
303 | "选定模型拟合,进行递归拟合,每次把评分低得特征去除,重复上诉循环。"
304 | ]
305 | },
306 | {
307 | "cell_type": "code",
308 | "execution_count": 12,
309 | "metadata": {},
310 | "outputs": [
311 | {
312 | "name": "stdout",
313 | "output_type": "stream",
314 | "text": [
315 | "[False False False False False False False False False False False False\n",
316 | " False False False False False False False False False False False False\n",
317 | " False False False False False False False False False False False False\n",
318 | " False False False False False False False False False False False False\n",
319 | " False False False False False False False False False False False False\n",
320 | " False False False False False False False False False False False False\n",
321 | " False False False False False False False False False False False False\n",
322 | " False False False False False False False False False False False False\n",
323 | " False False False False False False False False False False False False\n",
324 | " False False False False False False False False False False False False\n",
325 | " False False False False False False False False False False False False\n",
326 | " False False False False False False False False False False False False\n",
327 | " False False False False False False False False False False False False\n",
328 | " False False False False False False False False False False False False\n",
329 | " False False False False False False False False False False False True\n",
330 | " True False False False False False False False False False False True\n",
331 | " False True False False False False True False False True False False\n",
332 | " False True False True False True False True False False False False\n",
333 | " False False False False False False False False False False False False\n",
334 | " False]\n",
335 | "[220 219 218 217 216 215 213 212 211 210 209 208 207 206 205 204 203 202\n",
336 | " 201 200 197 195 192 187 186 185 184 183 182 181 180 179 178 177 176 175\n",
337 | " 174 173 172 171 170 169 168 167 166 165 164 163 162 161 160 158 157 155\n",
338 | " 154 153 152 151 150 149 148 147 146 145 144 143 142 141 140 139 137 136\n",
339 | " 134 133 132 131 130 129 128 127 126 125 124 123 122 121 120 117 116 115\n",
340 | " 114 113 112 111 110 109 108 107 106 105 104 103 102 101 100 99 98 97\n",
341 | " 95 94 93 92 91 90 89 88 87 214 86 85 84 83 189 193 199 198\n",
342 | " 196 194 191 190 188 81 80 79 77 74 73 72 71 69 68 67 66 65\n",
343 | " 64 156 63 61 60 59 58 57 55 54 138 135 50 48 46 45 44 42\n",
344 | " 39 119 118 38 35 31 28 25 24 22 17 15 14 96 13 12 43 1\n",
345 | " 1 4 82 75 78 76 26 30 70 20 7 1 62 1 51 53 49 47\n",
346 | " 1 27 41 1 23 21 18 1 11 1 19 1 6 1 8 29 2 5\n",
347 | " 10 3 9 16 32 33 34 36 37 40 52 56 159]\n"
348 | ]
349 | }
350 | ],
351 | "source": [
352 | "from sklearn.feature_selection import RFECV\n",
353 | "from sklearn.ensemble import RandomForestClassifier\n",
354 | "\n",
355 | "clf = RandomForestClassifier(n_estimators=10, max_depth=2, random_state=0, n_jobs=-1)\n",
356 | "selector = RFECV(clf, step=1, cv=2)\n",
357 | "selector = selector.fit(train, target)\n",
358 | "print(selector.support_)\n",
359 | "print(selector.ranking_)"
360 | ]
361 | },
362 | {
363 | "cell_type": "markdown",
364 | "metadata": {},
365 | "source": [
366 | "### 使用模型选择特征(方法四)"
367 | ]
368 | },
369 | {
370 | "cell_type": "markdown",
371 | "metadata": {},
372 | "source": [
373 | "#### 使用LR拟合的参数进行变量选择(L2范数进行特征选择)\n",
374 | "LR模型采用拟合参数形式进行变量选择,筛选对回归目标影响大的"
375 | ]
376 | },
377 | {
378 | "cell_type": "code",
379 | "execution_count": 13,
380 | "metadata": {},
381 | "outputs": [
382 | {
383 | "name": "stdout",
384 | "output_type": "stream",
385 | "text": [
386 | "训练数据未特征筛选维度 (2000, 229)\n",
387 | "训练数据特征筛选维度后 (2000, 19)\n"
388 | ]
389 | }
390 | ],
391 | "source": [
392 | "from sklearn.feature_selection import SelectFromModel\n",
393 | "from sklearn.linear_model import LogisticRegression\n",
394 | "from sklearn.preprocessing import Normalizer\n",
395 | "\n",
396 | "normalizer = Normalizer()\n",
397 | "normalizer = normalizer.fit(train) \n",
398 | "\n",
399 | "train_norm = normalizer.transform(train) \n",
400 | "test_norm = normalizer.transform(test)\n",
401 | "\n",
402 | "LR = LogisticRegression(penalty='l2',C=5)\n",
403 | "LR = LR.fit(train_norm, target)\n",
404 | "model = SelectFromModel(LR, prefit=True)\n",
405 | "train_sel = model.transform(train)\n",
406 | "test_sel = model.transform(test)\n",
407 | "print('训练数据未特征筛选维度', train.shape)\n",
408 | "print('训练数据特征筛选维度后', train_sel.shape)"
409 | ]
410 | },
411 | {
412 | "cell_type": "markdown",
413 | "metadata": {},
414 | "source": [
415 | "##### L2范数选择参数"
416 | ]
417 | },
418 | {
419 | "cell_type": "code",
420 | "execution_count": 14,
421 | "metadata": {},
422 | "outputs": [
423 | {
424 | "data": {
425 | "text/plain": [
426 | "array([ 0.27519508, -0.02736226, -0.00522652, 0.90644126, -0.4310027 ,\n",
427 | " -0.25110925, -0.4058899 , 0.29059019, 0.10568508, -0.02731211])"
428 | ]
429 | },
430 | "execution_count": 14,
431 | "metadata": {},
432 | "output_type": "execute_result"
433 | }
434 | ],
435 | "source": [
436 | "LR.coef_[0][:10]"
437 | ]
438 | },
439 | {
440 | "cell_type": "markdown",
441 | "metadata": {},
442 | "source": [
443 | "### 特征选择前后区别"
444 | ]
445 | },
446 | {
447 | "cell_type": "code",
448 | "execution_count": 15,
449 | "metadata": {},
450 | "outputs": [
451 | {
452 | "name": "stdout",
453 | "output_type": "stream",
454 | "text": [
455 | "No Select Accuracy: 0.93 (+/- 0.00)\n",
456 | "Features Select Accuracy: 0.93 (+/- 0.00)\n"
457 | ]
458 | }
459 | ],
460 | "source": [
461 | "feature_selection(train, train_sel, target)"
462 | ]
463 | },
464 | {
465 | "cell_type": "markdown",
466 | "metadata": {},
467 | "source": [
468 | "#### 使用LR拟合的参数进行变量选择(L1范数进行特征选择)\n",
469 | "LR模型采用拟合参数形式进行变量选择,筛选对回归目标影响大的"
470 | ]
471 | },
472 | {
473 | "cell_type": "code",
474 | "execution_count": 16,
475 | "metadata": {},
476 | "outputs": [],
477 | "source": [
478 | "# from sklearn.feature_selection import SelectFromModel\n",
479 | "# from sklearn.linear_model import LogisticRegression\n",
480 | "# from sklearn.preprocessing import Normalizer\n",
481 | "\n",
482 | "# normalizer = Normalizer()\n",
483 | "# normalizer = normalizer.fit(train) \n",
484 | "\n",
485 | "# train_norm = normalizer.transform(train) \n",
486 | "# test_norm = normalizer.transform(test)\n",
487 | "\n",
488 | "# LR = LogisticRegression(penalty='l1',C=5)\n",
489 | "# LR = LR.fit(train_norm, target)\n",
490 | "# model = SelectFromModel(LR, prefit=True)\n",
491 | "# train_sel = model.transform(train)\n",
492 | "# test_sel = model.transform(test)\n",
493 | "# print('训练数据未特征筛选维度', train.shape)\n",
494 | "# print('训练数据特征筛选维度后', train_sel.shape)"
495 | ]
496 | },
497 | {
498 | "cell_type": "markdown",
499 | "metadata": {},
500 | "source": [
501 | "##### L1范数选择参数\n",
502 | "对于α的良好选择,只要满足某些特定条件,Lasso就可以仅使用少量观察来完全恢复精确的非零变量集。"
503 | ]
504 | },
505 | {
506 | "cell_type": "code",
507 | "execution_count": 17,
508 | "metadata": {},
509 | "outputs": [],
510 | "source": [
511 | "# LR.coef_[0][:10]"
512 | ]
513 | },
514 | {
515 | "cell_type": "markdown",
516 | "metadata": {},
517 | "source": [
518 | "### 特征选择前后区别"
519 | ]
520 | },
521 | {
522 | "cell_type": "code",
523 | "execution_count": 18,
524 | "metadata": {},
525 | "outputs": [
526 | {
527 | "name": "stdout",
528 | "output_type": "stream",
529 | "text": [
530 | "No Select Accuracy: 0.93 (+/- 0.00)\n",
531 | "Features Select Accuracy: 0.93 (+/- 0.00)\n"
532 | ]
533 | }
534 | ],
535 | "source": [
536 | "feature_selection(train, train_sel, target)"
537 | ]
538 | },
539 | {
540 | "cell_type": "markdown",
541 | "metadata": {},
542 | "source": [
543 | "### 基于树模型特征选择\n",
544 | "树模型基于分裂评价标准所计算的总的评分作为依据进行相关排序,然后进行特征筛选"
545 | ]
546 | },
547 | {
548 | "cell_type": "code",
549 | "execution_count": 19,
550 | "metadata": {},
551 | "outputs": [
552 | {
553 | "name": "stdout",
554 | "output_type": "stream",
555 | "text": [
556 | "训练数据未特征筛选维度 (2000, 229)\n",
557 | "训练数据特征筛选维度后 (2000, 71)\n"
558 | ]
559 | }
560 | ],
561 | "source": [
562 | "from sklearn.ensemble import ExtraTreesClassifier\n",
563 | "from sklearn.feature_selection import SelectFromModel\n",
564 | "\n",
565 | "clf = ExtraTreesClassifier(n_estimators=50)\n",
566 | "clf = clf.fit(train, target)\n",
567 | "\n",
568 | "model = SelectFromModel(clf, prefit=True)\n",
569 | "train_sel = model.transform(train)\n",
570 | "test_sel = model.transform(test)\n",
571 | "print('训练数据未特征筛选维度', train.shape)\n",
572 | "print('训练数据特征筛选维度后', train_sel.shape)"
573 | ]
574 | },
575 | {
576 | "cell_type": "markdown",
577 | "metadata": {},
578 | "source": [
579 | "#### 树特征重要性"
580 | ]
581 | },
582 | {
583 | "cell_type": "code",
584 | "execution_count": 20,
585 | "metadata": {},
586 | "outputs": [
587 | {
588 | "data": {
589 | "text/plain": [
590 | "array([0.09210871, 0.00578114, 0.00388741, 0.0047027 , 0.00324662,\n",
591 | " 0.00409547, 0.00560588, 0.00399393, 0.00499705, 0.00233944])"
592 | ]
593 | },
594 | "execution_count": 20,
595 | "metadata": {},
596 | "output_type": "execute_result"
597 | }
598 | ],
599 | "source": [
600 | "clf.feature_importances_[:10]"
601 | ]
602 | },
603 | {
604 | "cell_type": "code",
605 | "execution_count": 21,
606 | "metadata": {},
607 | "outputs": [],
608 | "source": [
609 | "df_features_import = pd.DataFrame()\n",
610 | "df_features_import['features_import'] = clf.feature_importances_\n",
611 | "df_features_import['features_name'] = features_columns"
612 | ]
613 | },
614 | {
615 | "cell_type": "code",
616 | "execution_count": 22,
617 | "metadata": {},
618 | "outputs": [
619 | {
620 | "data": {
621 | "text/html": [
622 | "
\n",
623 | "\n",
636 | "
\n",
637 | " \n",
638 | " \n",
639 | " | \n",
640 | " features_import | \n",
641 | " features_name | \n",
642 | "
\n",
643 | " \n",
644 | " \n",
645 | " \n",
646 | " 0 | \n",
647 | " 0.092109 | \n",
648 | " merchant_id | \n",
649 | "
\n",
650 | " \n",
651 | " 228 | \n",
652 | " 0.085244 | \n",
653 | " xgb_clf | \n",
654 | "
\n",
655 | " \n",
656 | " 227 | \n",
657 | " 0.056583 | \n",
658 | " lgb_clf | \n",
659 | "
\n",
660 | " \n",
661 | " 199 | \n",
662 | " 0.007003 | \n",
663 | " embeeding_72 | \n",
664 | "
\n",
665 | " \n",
666 | " 179 | \n",
667 | " 0.006930 | \n",
668 | " embeeding_52 | \n",
669 | "
\n",
670 | " \n",
671 | " 18 | \n",
672 | " 0.006444 | \n",
673 | " seller_most_1_cnt | \n",
674 | "
\n",
675 | " \n",
676 | " 207 | \n",
677 | " 0.006367 | \n",
678 | " embeeding_80 | \n",
679 | "
\n",
680 | " \n",
681 | " 193 | \n",
682 | " 0.006110 | \n",
683 | " embeeding_66 | \n",
684 | "
\n",
685 | " \n",
686 | " 190 | \n",
687 | " 0.006107 | \n",
688 | " embeeding_63 | \n",
689 | "
\n",
690 | " \n",
691 | " 132 | \n",
692 | " 0.006077 | \n",
693 | " embeeding_5 | \n",
694 | "
\n",
695 | " \n",
696 | " 144 | \n",
697 | " 0.005996 | \n",
698 | " embeeding_17 | \n",
699 | "
\n",
700 | " \n",
701 | " 146 | \n",
702 | " 0.005913 | \n",
703 | " embeeding_19 | \n",
704 | "
\n",
705 | " \n",
706 | " 1 | \n",
707 | " 0.005781 | \n",
708 | " age_range | \n",
709 | "
\n",
710 | " \n",
711 | " 158 | \n",
712 | " 0.005715 | \n",
713 | " embeeding_31 | \n",
714 | "
\n",
715 | " \n",
716 | " 191 | \n",
717 | " 0.005701 | \n",
718 | " embeeding_64 | \n",
719 | "
\n",
720 | " \n",
721 | " 165 | \n",
722 | " 0.005673 | \n",
723 | " embeeding_38 | \n",
724 | "
\n",
725 | " \n",
726 | " 15 | \n",
727 | " 0.005648 | \n",
728 | " cat_most_1 | \n",
729 | "
\n",
730 | " \n",
731 | " 6 | \n",
732 | " 0.005606 | \n",
733 | " brand_nunique | \n",
734 | "
\n",
735 | " \n",
736 | " 22 | \n",
737 | " 0.005488 | \n",
738 | " user_cnt_0 | \n",
739 | "
\n",
740 | " \n",
741 | " 220 | \n",
742 | " 0.005485 | \n",
743 | " embeeding_93 | \n",
744 | "
\n",
745 | " \n",
746 | " 166 | \n",
747 | " 0.005473 | \n",
748 | " embeeding_39 | \n",
749 | "
\n",
750 | " \n",
751 | " 87 | \n",
752 | " 0.005472 | \n",
753 | " tfidf_60 | \n",
754 | "
\n",
755 | " \n",
756 | " 127 | \n",
757 | " 0.005463 | \n",
758 | " embeeding_0 | \n",
759 | "
\n",
760 | " \n",
761 | " 196 | \n",
762 | " 0.005427 | \n",
763 | " embeeding_69 | \n",
764 | "
\n",
765 | " \n",
766 | " 205 | \n",
767 | " 0.005407 | \n",
768 | " embeeding_78 | \n",
769 | "
\n",
770 | " \n",
771 | " 147 | \n",
772 | " 0.005402 | \n",
773 | " embeeding_20 | \n",
774 | "
\n",
775 | " \n",
776 | " 163 | \n",
777 | " 0.005347 | \n",
778 | " embeeding_36 | \n",
779 | "
\n",
780 | " \n",
781 | " 192 | \n",
782 | " 0.005328 | \n",
783 | " embeeding_65 | \n",
784 | "
\n",
785 | " \n",
786 | " 169 | \n",
787 | " 0.005280 | \n",
788 | " embeeding_42 | \n",
789 | "
\n",
790 | " \n",
791 | " 50 | \n",
792 | " 0.005277 | \n",
793 | " tfidf_23 | \n",
794 | "
\n",
795 | " \n",
796 | "
\n",
797 | "
"
798 | ],
799 | "text/plain": [
800 | " features_import features_name\n",
801 | "0 0.092109 merchant_id\n",
802 | "228 0.085244 xgb_clf\n",
803 | "227 0.056583 lgb_clf\n",
804 | "199 0.007003 embeeding_72\n",
805 | "179 0.006930 embeeding_52\n",
806 | "18 0.006444 seller_most_1_cnt\n",
807 | "207 0.006367 embeeding_80\n",
808 | "193 0.006110 embeeding_66\n",
809 | "190 0.006107 embeeding_63\n",
810 | "132 0.006077 embeeding_5\n",
811 | "144 0.005996 embeeding_17\n",
812 | "146 0.005913 embeeding_19\n",
813 | "1 0.005781 age_range\n",
814 | "158 0.005715 embeeding_31\n",
815 | "191 0.005701 embeeding_64\n",
816 | "165 0.005673 embeeding_38\n",
817 | "15 0.005648 cat_most_1\n",
818 | "6 0.005606 brand_nunique\n",
819 | "22 0.005488 user_cnt_0\n",
820 | "220 0.005485 embeeding_93\n",
821 | "166 0.005473 embeeding_39\n",
822 | "87 0.005472 tfidf_60\n",
823 | "127 0.005463 embeeding_0\n",
824 | "196 0.005427 embeeding_69\n",
825 | "205 0.005407 embeeding_78\n",
826 | "147 0.005402 embeeding_20\n",
827 | "163 0.005347 embeeding_36\n",
828 | "192 0.005328 embeeding_65\n",
829 | "169 0.005280 embeeding_42\n",
830 | "50 0.005277 tfidf_23"
831 | ]
832 | },
833 | "execution_count": 22,
834 | "metadata": {},
835 | "output_type": "execute_result"
836 | }
837 | ],
838 | "source": [
839 | "df_features_import.sort_values(['features_import'],ascending=0).head(30)"
840 | ]
841 | },
842 | {
843 | "cell_type": "code",
844 | "execution_count": 23,
845 | "metadata": {},
846 | "outputs": [],
847 | "source": [
848 | "# features_columns"
849 | ]
850 | },
851 | {
852 | "cell_type": "markdown",
853 | "metadata": {},
854 | "source": [
855 | "### 特征选择前后区别"
856 | ]
857 | },
858 | {
859 | "cell_type": "code",
860 | "execution_count": 24,
861 | "metadata": {},
862 | "outputs": [
863 | {
864 | "name": "stdout",
865 | "output_type": "stream",
866 | "text": [
867 | "No Select Accuracy: 0.93 (+/- 0.00)\n",
868 | "Features Select Accuracy: 0.93 (+/- 0.00)\n"
869 | ]
870 | }
871 | ],
872 | "source": [
873 | "feature_selection(train, train_sel, target)"
874 | ]
875 | },
876 | {
877 | "cell_type": "markdown",
878 | "metadata": {},
879 | "source": [
880 | "### Lgb特征重要性"
881 | ]
882 | },
883 | {
884 | "cell_type": "code",
885 | "execution_count": 25,
886 | "metadata": {},
887 | "outputs": [
888 | {
889 | "name": "stdout",
890 | "output_type": "stream",
891 | "text": [
892 | "[LightGBM] [Warning] Unknown parameter: colsample_bylevel\n",
893 | "[LightGBM] [Warning] Unknown parameter: tree_method\n",
894 | "[LightGBM] [Warning] Unknown parameter: silent\n",
895 | "[LightGBM] [Warning] Unknown parameter: colsample_bylevel\n",
896 | "[LightGBM] [Warning] Unknown parameter: tree_method\n",
897 | "[LightGBM] [Warning] Unknown parameter: silent\n",
898 | "[LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.006242 seconds.\n",
899 | "You can set `force_row_wise=true` to remove the overhead.\n",
900 | "And if memory is not enough, you can set `force_col_wise=true`.\n",
901 | "[LightGBM] [Info] Total Bins 32114\n",
902 | "[LightGBM] [Info] Number of data points in the train set: 1200, number of used features: 224\n",
903 | "[LightGBM] [Warning] Unknown parameter: colsample_bylevel\n",
904 | "[LightGBM] [Warning] Unknown parameter: tree_method\n",
905 | "[LightGBM] [Warning] Unknown parameter: silent\n",
906 | "[LightGBM] [Info] Start training from score -0.068100\n",
907 | "[LightGBM] [Info] Start training from score -2.720629\n",
908 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
909 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
910 | "[1]\tvalid_0's multi_logloss: 0.256738\n",
911 | "Training until validation scores don't improve for 100 rounds\n",
912 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
913 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
914 | "[2]\tvalid_0's multi_logloss: 0.256574\n",
915 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
916 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
917 | "[3]\tvalid_0's multi_logloss: 0.256518\n",
918 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
919 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
920 | "[4]\tvalid_0's multi_logloss: 0.25657\n",
921 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
922 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
923 | "[5]\tvalid_0's multi_logloss: 0.256756\n",
924 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
925 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
926 | "[6]\tvalid_0's multi_logloss: 0.25682\n",
927 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
928 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
929 | "[7]\tvalid_0's multi_logloss: 0.256989\n",
930 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
931 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
932 | "[8]\tvalid_0's multi_logloss: 0.257236\n",
933 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
934 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
935 | "[9]\tvalid_0's multi_logloss: 0.25712\n",
936 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
937 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
938 | "[10]\tvalid_0's multi_logloss: 0.257011\n",
939 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
940 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
941 | "[11]\tvalid_0's multi_logloss: 0.257042\n",
942 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
943 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
944 | "[12]\tvalid_0's multi_logloss: 0.257402\n",
945 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
946 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
947 | "[13]\tvalid_0's multi_logloss: 0.257419\n",
948 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
949 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
950 | "[14]\tvalid_0's multi_logloss: 0.257646\n",
951 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
952 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
953 | "[15]\tvalid_0's multi_logloss: 0.257552\n",
954 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
955 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
956 | "[16]\tvalid_0's multi_logloss: 0.257604\n",
957 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
958 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
959 | "[17]\tvalid_0's multi_logloss: 0.257797\n",
960 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
961 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
962 | "[18]\tvalid_0's multi_logloss: 0.257928\n",
963 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
964 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
965 | "[19]\tvalid_0's multi_logloss: 0.258142\n",
966 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
967 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
968 | "[20]\tvalid_0's multi_logloss: 0.25847\n",
969 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
970 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
971 | "[21]\tvalid_0's multi_logloss: 0.258654\n",
972 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
973 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
974 | "[22]\tvalid_0's multi_logloss: 0.258846\n",
975 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
976 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
977 | "[23]\tvalid_0's multi_logloss: 0.258962\n",
978 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
979 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
980 | "[24]\tvalid_0's multi_logloss: 0.258991\n",
981 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
982 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
983 | "[25]\tvalid_0's multi_logloss: 0.259334\n",
984 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
985 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
986 | "[26]\tvalid_0's multi_logloss: 0.259433\n",
987 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
988 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
989 | "[27]\tvalid_0's multi_logloss: 0.259912\n",
990 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
991 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
992 | "[28]\tvalid_0's multi_logloss: 0.260153\n",
993 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
994 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
995 | "[29]\tvalid_0's multi_logloss: 0.260576\n",
996 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
997 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
998 | "[30]\tvalid_0's multi_logloss: 0.26094\n",
999 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1000 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1001 | "[31]\tvalid_0's multi_logloss: 0.261198\n",
1002 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1003 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1004 | "[32]\tvalid_0's multi_logloss: 0.26141\n",
1005 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1006 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1007 | "[33]\tvalid_0's multi_logloss: 0.261614\n",
1008 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1009 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1010 | "[34]\tvalid_0's multi_logloss: 0.261801\n",
1011 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1012 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1013 | "[35]\tvalid_0's multi_logloss: 0.261931\n",
1014 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1015 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1016 | "[36]\tvalid_0's multi_logloss: 0.262242\n",
1017 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1018 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1019 | "[37]\tvalid_0's multi_logloss: 0.262492\n",
1020 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1021 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1022 | "[38]\tvalid_0's multi_logloss: 0.26273\n",
1023 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1024 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1025 | "[39]\tvalid_0's multi_logloss: 0.262855\n",
1026 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1027 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1028 | "[40]\tvalid_0's multi_logloss: 0.263225\n",
1029 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1030 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1031 | "[41]\tvalid_0's multi_logloss: 0.263311\n",
1032 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1033 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1034 | "[42]\tvalid_0's multi_logloss: 0.263612\n",
1035 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n"
1036 | ]
1037 | },
1038 | {
1039 | "name": "stdout",
1040 | "output_type": "stream",
1041 | "text": [
1042 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1043 | "[43]\tvalid_0's multi_logloss: 0.263937\n",
1044 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1045 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1046 | "[44]\tvalid_0's multi_logloss: 0.264398\n",
1047 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1048 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1049 | "[45]\tvalid_0's multi_logloss: 0.264822\n",
1050 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1051 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1052 | "[46]\tvalid_0's multi_logloss: 0.264977\n",
1053 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1054 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1055 | "[47]\tvalid_0's multi_logloss: 0.265401\n",
1056 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1057 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1058 | "[48]\tvalid_0's multi_logloss: 0.265718\n",
1059 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1060 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1061 | "[49]\tvalid_0's multi_logloss: 0.265859\n",
1062 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1063 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1064 | "[50]\tvalid_0's multi_logloss: 0.266173\n",
1065 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1066 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1067 | "[51]\tvalid_0's multi_logloss: 0.266544\n",
1068 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1069 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1070 | "[52]\tvalid_0's multi_logloss: 0.266719\n",
1071 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1072 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1073 | "[53]\tvalid_0's multi_logloss: 0.266817\n",
1074 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1075 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1076 | "[54]\tvalid_0's multi_logloss: 0.267013\n",
1077 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1078 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1079 | "[55]\tvalid_0's multi_logloss: 0.267385\n",
1080 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1081 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1082 | "[56]\tvalid_0's multi_logloss: 0.267389\n",
1083 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1084 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1085 | "[57]\tvalid_0's multi_logloss: 0.267662\n",
1086 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1087 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1088 | "[58]\tvalid_0's multi_logloss: 0.267792\n",
1089 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1090 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1091 | "[59]\tvalid_0's multi_logloss: 0.268017\n",
1092 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1093 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1094 | "[60]\tvalid_0's multi_logloss: 0.268158\n",
1095 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1096 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1097 | "[61]\tvalid_0's multi_logloss: 0.268437\n",
1098 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1099 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1100 | "[62]\tvalid_0's multi_logloss: 0.268773\n",
1101 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1102 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1103 | "[63]\tvalid_0's multi_logloss: 0.268824\n",
1104 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1105 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1106 | "[64]\tvalid_0's multi_logloss: 0.269138\n",
1107 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1108 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1109 | "[65]\tvalid_0's multi_logloss: 0.269357\n",
1110 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1111 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1112 | "[66]\tvalid_0's multi_logloss: 0.269572\n",
1113 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1114 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1115 | "[67]\tvalid_0's multi_logloss: 0.269786\n",
1116 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1117 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1118 | "[68]\tvalid_0's multi_logloss: 0.270102\n",
1119 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1120 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1121 | "[69]\tvalid_0's multi_logloss: 0.270435\n",
1122 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1123 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1124 | "[70]\tvalid_0's multi_logloss: 0.270566\n",
1125 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1126 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1127 | "[71]\tvalid_0's multi_logloss: 0.270679\n",
1128 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1129 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1130 | "[72]\tvalid_0's multi_logloss: 0.271056\n",
1131 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1132 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1133 | "[73]\tvalid_0's multi_logloss: 0.271474\n",
1134 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1135 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1136 | "[74]\tvalid_0's multi_logloss: 0.27168\n",
1137 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1138 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1139 | "[75]\tvalid_0's multi_logloss: 0.271918\n",
1140 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1141 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1142 | "[76]\tvalid_0's multi_logloss: 0.271937\n",
1143 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1144 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1145 | "[77]\tvalid_0's multi_logloss: 0.272113\n",
1146 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1147 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1148 | "[78]\tvalid_0's multi_logloss: 0.27242\n",
1149 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1150 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1151 | "[79]\tvalid_0's multi_logloss: 0.272712\n",
1152 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1153 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1154 | "[80]\tvalid_0's multi_logloss: 0.27267\n",
1155 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1156 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1157 | "[81]\tvalid_0's multi_logloss: 0.273019\n",
1158 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1159 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1160 | "[82]\tvalid_0's multi_logloss: 0.272981\n",
1161 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1162 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1163 | "[83]\tvalid_0's multi_logloss: 0.273218\n",
1164 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1165 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1166 | "[84]\tvalid_0's multi_logloss: 0.27353\n",
1167 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1168 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1169 | "[85]\tvalid_0's multi_logloss: 0.273649\n",
1170 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1171 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1172 | "[86]\tvalid_0's multi_logloss: 0.273775\n",
1173 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1174 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1175 | "[87]\tvalid_0's multi_logloss: 0.273835\n",
1176 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1177 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1178 | "[88]\tvalid_0's multi_logloss: 0.274091\n",
1179 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1180 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1181 | "[89]\tvalid_0's multi_logloss: 0.274422\n",
1182 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1183 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1184 | "[90]\tvalid_0's multi_logloss: 0.274716\n",
1185 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1186 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1187 | "[91]\tvalid_0's multi_logloss: 0.275082\n",
1188 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n"
1189 | ]
1190 | },
1191 | {
1192 | "name": "stdout",
1193 | "output_type": "stream",
1194 | "text": [
1195 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1196 | "[92]\tvalid_0's multi_logloss: 0.275278\n",
1197 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1198 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1199 | "[93]\tvalid_0's multi_logloss: 0.275447\n",
1200 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1201 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1202 | "[94]\tvalid_0's multi_logloss: 0.275438\n",
1203 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1204 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1205 | "[95]\tvalid_0's multi_logloss: 0.275778\n",
1206 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1207 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1208 | "[96]\tvalid_0's multi_logloss: 0.27591\n",
1209 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1210 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1211 | "[97]\tvalid_0's multi_logloss: 0.276129\n",
1212 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1213 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1214 | "[98]\tvalid_0's multi_logloss: 0.276326\n",
1215 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1216 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1217 | "[99]\tvalid_0's multi_logloss: 0.276449\n",
1218 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1219 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1220 | "[100]\tvalid_0's multi_logloss: 0.276745\n",
1221 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1222 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1223 | "[101]\tvalid_0's multi_logloss: 0.276895\n",
1224 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1225 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1226 | "[102]\tvalid_0's multi_logloss: 0.276914\n",
1227 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1228 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
1229 | "[103]\tvalid_0's multi_logloss: 0.277281\n",
1230 | "Early stopping, best iteration is:\n",
1231 | "[3]\tvalid_0's multi_logloss: 0.256518\n"
1232 | ]
1233 | }
1234 | ],
1235 | "source": [
1236 | "import lightgbm\n",
1237 | "from sklearn.model_selection import train_test_split\n",
1238 | "\n",
1239 | "X_train, X_test, y_train, y_test = train_test_split(train, target, test_size=0.4, random_state=0)\n",
1240 | "\n",
1241 | "clf = lightgbm\n",
1242 | "\n",
1243 | "train_matrix = clf.Dataset(X_train, label=y_train)\n",
1244 | "test_matrix = clf.Dataset(X_test, label=y_test)\n",
1245 | "params = {\n",
1246 | " 'boosting_type': 'gbdt',\n",
1247 | " #'boosting_type': 'dart',\n",
1248 | " 'objective': 'multiclass',\n",
1249 | " 'metric': 'multi_logloss',\n",
1250 | " 'min_child_weight': 1.5,\n",
1251 | " 'num_leaves': 2**5,\n",
1252 | " 'lambda_l2': 10,\n",
1253 | " 'subsample': 0.7,\n",
1254 | " 'colsample_bytree': 0.7,\n",
1255 | " 'colsample_bylevel': 0.7,\n",
1256 | " 'learning_rate': 0.03,\n",
1257 | " 'tree_method': 'exact',\n",
1258 | " 'seed': 2017,\n",
1259 | " \"num_class\": 2,\n",
1260 | " 'silent': True,\n",
1261 | " }\n",
1262 | "num_round = 10000\n",
1263 | "early_stopping_rounds = 100\n",
1264 | "model = clf.train(params, \n",
1265 | " train_matrix,\n",
1266 | " num_round,\n",
1267 | " valid_sets=test_matrix,\n",
1268 | " early_stopping_rounds=early_stopping_rounds)"
1269 | ]
1270 | },
1271 | {
1272 | "cell_type": "code",
1273 | "execution_count": 26,
1274 | "metadata": {},
1275 | "outputs": [],
1276 | "source": [
1277 | "def lgb_transform(train, test, model, topK):\n",
1278 | " train_df = pd.DataFrame(train)\n",
1279 | " train_df.columns = range(train.shape[1])\n",
1280 | " \n",
1281 | " test_df = pd.DataFrame(test)\n",
1282 | " test_df.columns = range(test.shape[1])\n",
1283 | " \n",
1284 | " features_import = pd.DataFrame()\n",
1285 | " features_import['importance'] = model.feature_importance()\n",
1286 | " features_import['col'] = range(train.shape[1])\n",
1287 | " \n",
1288 | " features_import = features_import.sort_values(['importance'],ascending=0).head(topK)\n",
1289 | " sel_col = list(features_import.col)\n",
1290 | " \n",
1291 | " train_sel = train_df[sel_col]\n",
1292 | " test_sel = test_df[sel_col]\n",
1293 | " return train_sel, test_sel"
1294 | ]
1295 | },
1296 | {
1297 | "cell_type": "code",
1298 | "execution_count": 27,
1299 | "metadata": {},
1300 | "outputs": [
1301 | {
1302 | "name": "stdout",
1303 | "output_type": "stream",
1304 | "text": [
1305 | "训练数据未特征筛选维度 (2000, 229)\n",
1306 | "训练数据特征筛选维度后 (2000, 20)\n"
1307 | ]
1308 | }
1309 | ],
1310 | "source": [
1311 | "train_sel, test_sel = lgb_transform(train, test, model, 20)\n",
1312 | "print('训练数据未特征筛选维度', train.shape)\n",
1313 | "print('训练数据特征筛选维度后', train_sel.shape)"
1314 | ]
1315 | },
1316 | {
1317 | "cell_type": "markdown",
1318 | "metadata": {},
1319 | "source": [
1320 | "### lgb特征重要性"
1321 | ]
1322 | },
1323 | {
1324 | "cell_type": "code",
1325 | "execution_count": 28,
1326 | "metadata": {},
1327 | "outputs": [
1328 | {
1329 | "data": {
1330 | "text/plain": [
1331 | "array([2, 3, 0, 0, 0, 1, 1, 0, 1, 0])"
1332 | ]
1333 | },
1334 | "execution_count": 28,
1335 | "metadata": {},
1336 | "output_type": "execute_result"
1337 | }
1338 | ],
1339 | "source": [
1340 | "model.feature_importance()[:10]"
1341 | ]
1342 | },
1343 | {
1344 | "cell_type": "code",
1345 | "execution_count": 29,
1346 | "metadata": {},
1347 | "outputs": [],
1348 | "source": [
1349 | "#sorted(model.feature_importance(),reverse=True)[:10]"
1350 | ]
1351 | },
1352 | {
1353 | "cell_type": "markdown",
1354 | "metadata": {},
1355 | "source": [
1356 | "### 特征选择前后区别"
1357 | ]
1358 | },
1359 | {
1360 | "cell_type": "code",
1361 | "execution_count": 30,
1362 | "metadata": {},
1363 | "outputs": [
1364 | {
1365 | "name": "stdout",
1366 | "output_type": "stream",
1367 | "text": [
1368 | "No Select Accuracy: 0.93 (+/- 0.00)\n",
1369 | "Features Select Accuracy: 0.93 (+/- 0.00)\n"
1370 | ]
1371 | }
1372 | ],
1373 | "source": [
1374 | "feature_selection(train, train_sel, target)"
1375 | ]
1376 | },
1377 | {
1378 | "cell_type": "code",
1379 | "execution_count": null,
1380 | "metadata": {},
1381 | "outputs": [],
1382 | "source": []
1383 | }
1384 | ],
1385 | "metadata": {
1386 | "kernelspec": {
1387 | "display_name": "Python 3",
1388 | "language": "python",
1389 | "name": "python3"
1390 | },
1391 | "language_info": {
1392 | "codemirror_mode": {
1393 | "name": "ipython",
1394 | "version": 3
1395 | },
1396 | "file_extension": ".py",
1397 | "mimetype": "text/x-python",
1398 | "name": "python",
1399 | "nbconvert_exporter": "python",
1400 | "pygments_lexer": "ipython3",
1401 | "version": "3.7.1"
1402 | },
1403 | "latex_envs": {
1404 | "LaTeX_envs_menu_present": true,
1405 | "autoclose": false,
1406 | "autocomplete": true,
1407 | "bibliofile": "biblio.bib",
1408 | "cite_by": "apalike",
1409 | "current_citInitial": 1,
1410 | "eqLabelWithNumbers": true,
1411 | "eqNumInitial": 1,
1412 | "hotkeys": {
1413 | "equation": "Ctrl-E",
1414 | "itemize": "Ctrl-I"
1415 | },
1416 | "labels_anchors": false,
1417 | "latex_user_defs": false,
1418 | "report_style_numbering": false,
1419 | "user_envs_cfg": false
1420 | }
1421 | },
1422 | "nbformat": 4,
1423 | "nbformat_minor": 2
1424 | }
1425 |
--------------------------------------------------------------------------------
/工业蒸汽/工业蒸汽 06 特征优化.ipynb:
--------------------------------------------------------------------------------
1 | {"nbformat":4,"nbformat_minor":0,"metadata":{"kernelspec":{"display_name":"Python 3","language":"python","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.7.0"},"colab":{"name":"工业蒸汽 06 特征优化.ipynb","provenance":[],"collapsed_sections":[]}},"cells":[{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"wo_kOHZTLhVZ","executionInfo":{"status":"ok","timestamp":1623400030401,"user_tz":-480,"elapsed":22582,"user":{"displayName":"陆轩韬","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GgCF1eSJvFwyQhlJNSqYJSHyWwDFm88zduORKCQ=s64","userId":"09875294307280000133"}},"outputId":"e5f54b16-a3a0-4165-f896-bdd2384af00c"},"source":["from google.colab import drive\n","drive.mount('/content/drive')"],"execution_count":1,"outputs":[{"output_type":"stream","text":["Mounted at /content/drive\n"],"name":"stdout"}]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"G6wonEdoLjSN","executionInfo":{"status":"ok","timestamp":1623400032113,"user_tz":-480,"elapsed":1729,"user":{"displayName":"陆轩韬","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GgCF1eSJvFwyQhlJNSqYJSHyWwDFm88zduORKCQ=s64","userId":"09875294307280000133"}},"outputId":"6ffe00a1-5ad1-41e6-8b04-5594798cdb7d"},"source":["%cd /content/drive/MyDrive/Colab Notebooks/天池/工业蒸汽"],"execution_count":2,"outputs":[{"output_type":"stream","text":["/content/drive/MyDrive/Colab Notebooks/天池/工业蒸汽\n"],"name":"stdout"}]},{"cell_type":"markdown","metadata":{"id":"q_gDa5rQLdXM"},"source":["## 特征优化\n","\n","### 导入数据"]},{"cell_type":"code","metadata":{"id":"qYQ3-RsyLdXW","executionInfo":{"status":"ok","timestamp":1623401062097,"user_tz":-480,"elapsed":556,"user":{"displayName":"陆轩韬","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GgCF1eSJvFwyQhlJNSqYJSHyWwDFm88zduORKCQ=s64","userId":"09875294307280000133"}}},"source":["import pandas as pd\n","\n","train_data_file = \"./zhengqi_train.txt\"\n","test_data_file = \"./zhengqi_test.txt\"\n","\n","train_data = pd.read_csv(train_data_file, sep='\\t', encoding='utf-8')\n","test_data = pd.read_csv(test_data_file, sep='\\t', encoding='utf-8')"],"execution_count":13,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"1d20kkI_LdXX"},"source":["### 定义特征构造方法,构造特征"]},{"cell_type":"code","metadata":{"id":"ayULiWwdLdXY","executionInfo":{"status":"ok","timestamp":1623401065107,"user_tz":-480,"elapsed":505,"user":{"displayName":"陆轩韬","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GgCF1eSJvFwyQhlJNSqYJSHyWwDFm88zduORKCQ=s64","userId":"09875294307280000133"}}},"source":["epsilon=1e-5\n","\n","#组交叉特征,可以自行定义,如增加: x*x/y, log(x)/y 等等\n","func_dict = {\n"," 'add': lambda x,y: x+y,\n"," 'mins': lambda x,y: x-y,\n"," 'div': lambda x,y: x/(y+epsilon),\n"," 'multi': lambda x,y: x*y\n"," }"],"execution_count":14,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"ysPQ-6EQLdXY"},"source":["### 定义特征构造的函数"]},{"cell_type":"code","metadata":{"id":"dPZ-p8xSLdXZ","executionInfo":{"status":"ok","timestamp":1623401066461,"user_tz":-480,"elapsed":3,"user":{"displayName":"陆轩韬","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GgCF1eSJvFwyQhlJNSqYJSHyWwDFm88zduORKCQ=s64","userId":"09875294307280000133"}}},"source":["def auto_features_make(train_data,test_data,func_dict,col_list):\n"," train_data, test_data = train_data.copy(), test_data.copy()\n"," for col_i in col_list:\n"," for col_j in col_list:\n"," for func_name, func in func_dict.items():\n"," for data in [train_data,test_data]:\n"," func_features = func(data[col_i],data[col_j])\n"," col_func_features = '-'.join([col_i,func_name,col_j])\n"," data[col_func_features] = func_features\n"," return train_data,test_data"],"execution_count":15,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"rd49LO4ZLdXZ"},"source":["### 对训练集和测试集数据进行特征构造"]},{"cell_type":"code","metadata":{"id":"Yw1NxFGPLdXa","executionInfo":{"status":"ok","timestamp":1623401082922,"user_tz":-480,"elapsed":14522,"user":{"displayName":"陆轩韬","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GgCF1eSJvFwyQhlJNSqYJSHyWwDFm88zduORKCQ=s64","userId":"09875294307280000133"}}},"source":["train_data2, test_data2 = auto_features_make(train_data,test_data,func_dict,col_list=test_data.columns)"],"execution_count":16,"outputs":[]},{"cell_type":"code","metadata":{"id":"5aDoibXvLdXa","executionInfo":{"status":"ok","timestamp":1623401101042,"user_tz":-480,"elapsed":9223,"user":{"displayName":"陆轩韬","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GgCF1eSJvFwyQhlJNSqYJSHyWwDFm88zduORKCQ=s64","userId":"09875294307280000133"}}},"source":["from sklearn.decomposition import PCA #主成分分析法\n","\n","#PCA方法降维\n","pca = PCA(n_components=500)\n","train_data2_pca = pca.fit_transform(train_data2.iloc[:,0:-1])\n","test_data2_pca = pca.transform(test_data2)\n","train_data2_pca = pd.DataFrame(train_data2_pca)\n","test_data2_pca = pd.DataFrame(test_data2_pca)\n","train_data2_pca['target'] = train_data2['target']"],"execution_count":17,"outputs":[]},{"cell_type":"code","metadata":{"id":"gWeLa6r8LdXb","executionInfo":{"status":"ok","timestamp":1623401103248,"user_tz":-480,"elapsed":496,"user":{"displayName":"陆轩韬","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GgCF1eSJvFwyQhlJNSqYJSHyWwDFm88zduORKCQ=s64","userId":"09875294307280000133"}}},"source":["X_train2 = train_data2[test_data2.columns].values\n","y_train = train_data2['target']"],"execution_count":18,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"kDqTOUhmLdXb"},"source":["### 使用lightgbm模型对新构造的特征进行模型训练和评估"]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/","height":833},"id":"wMfN5jqTLdXb","executionInfo":{"status":"error","timestamp":1623402038351,"user_tz":-480,"elapsed":6450,"user":{"displayName":"陆轩韬","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GgCF1eSJvFwyQhlJNSqYJSHyWwDFm88zduORKCQ=s64","userId":"09875294307280000133"}},"outputId":"f5f30543-7f11-4076-8560-e1ecf4ca9efa"},"source":["# ls_validation i\n","from sklearn.model_selection import KFold\n","from sklearn.metrics import mean_squared_error\n","import lightgbm as lgb\n","import numpy as np\n","\n","# 5折交叉验证\n","Folds=5\n","# kf = KFold(len(X_train2), n_splits=Folds, random_state=2019, shuffle=True)\n","kf = KFold(len(X_train2), random_state=2019, shuffle=True)\n","# 记录训练和预测MSE\n","MSE_DICT = {\n"," 'train_mse':[],\n"," 'test_mse':[]\n","}\n","\n","# 线下训练预测\n","for i, (train_index, test_index) in enumerate(kf.split(X_train2)):\n"," # lgb树模型\n"," lgb_reg = lgb.LGBMRegressor(\n"," learning_rate=0.01,\n"," max_depth=-1,\n"," n_estimators=5000,\n"," boosting_type='gbdt',\n"," random_state=2019,\n"," objective='regression',\n"," )\n"," \n"," # 切分训练集和预测集\n"," X_train_KFold, X_test_KFold = X_train2[train_index], X_train2[test_index]\n"," y_train_KFold, y_test_KFold = y_train[train_index], y_train[test_index]\n"," \n"," # 训练模型\n"," lgb_reg.fit(\n"," X=X_train_KFold,y=y_train_KFold,\n"," eval_set=[(X_train_KFold, y_train_KFold),(X_test_KFold, y_test_KFold)],\n"," eval_names=['Train','Test'],\n"," early_stopping_rounds=100,\n"," eval_metric='MSE',\n"," verbose=50\n"," )\n","\n"," # 训练集预测 测试集预测\n"," y_train_KFold_predict = lgb_reg.predict(X_train_KFold,num_iteration=lgb_reg.best_iteration_)\n"," y_test_KFold_predict = lgb_reg.predict(X_test_KFold,num_iteration=lgb_reg.best_iteration_) \n"," \n"," print('第{}折 训练和预测 训练MSE 预测MSE'.format(i))\n"," train_mse = mean_squared_error(y_train_KFold_predict, y_train_KFold)\n"," print('------\\n', '训练MSE\\n', train_mse, '\\n------')\n"," test_mse = mean_squared_error(y_test_KFold_predict, y_test_KFold)\n"," print('------\\n', '预测MSE\\n', test_mse, '\\n------\\n')\n"," \n"," MSE_DICT['train_mse'].append(train_mse)\n"," MSE_DICT['test_mse'].append(test_mse)\n","print('------\\n', '训练MSE\\n', MSE_DICT['train_mse'], '\\n', np.mean(MSE_DICT['train_mse']), '\\n------')\n","print('------\\n', '预测MSE\\n', MSE_DICT['test_mse'], '\\n', np.mean(MSE_DICT['test_mse']), '\\n------')"],"execution_count":20,"outputs":[{"output_type":"stream","text":["Training until validation scores don't improve for 100 rounds.\n","[50]\tTrain's l2: 0.418978\tTrain's l2: 0.418978\tTest's l2: 0.106374\tTest's l2: 0.106374\n","[100]\tTrain's l2: 0.203693\tTrain's l2: 0.203693\tTest's l2: 0.02276\tTest's l2: 0.02276\n","[150]\tTrain's l2: 0.114486\tTrain's l2: 0.114486\tTest's l2: 0.00527795\tTest's l2: 0.00527795\n","[200]\tTrain's l2: 0.0741934\tTrain's l2: 0.0741934\tTest's l2: 5.99266e-05\tTest's l2: 5.99266e-05\n","[250]\tTrain's l2: 0.0535396\tTrain's l2: 0.0535396\tTest's l2: 0.00036171\tTest's l2: 0.00036171\n","[300]\tTrain's l2: 0.041529\tTrain's l2: 0.041529\tTest's l2: 0.00267813\tTest's l2: 0.00267813\n","Early stopping, best iteration is:\n","[221]\tTrain's l2: 0.0640274\tTrain's l2: 0.0640274\tTest's l2: 6.95547e-08\tTest's l2: 6.95547e-08\n","第0折 训练和预测 训练MSE 预测MSE\n","------\n"," 训练MSE\n"," 0.0640273654375399 \n","------\n","------\n"," 预测MSE\n"," 6.954770450692039e-08 \n","------\n","\n","Training until validation scores don't improve for 100 rounds.\n","[50]\tTrain's l2: 0.419128\tTrain's l2: 0.419128\tTest's l2: 0.142103\tTest's l2: 0.142103\n","[100]\tTrain's l2: 0.203838\tTrain's l2: 0.203838\tTest's l2: 0.0997735\tTest's l2: 0.0997735\n","[150]\tTrain's l2: 0.114537\tTrain's l2: 0.114537\tTest's l2: 0.0765469\tTest's l2: 0.0765469\n","[200]\tTrain's l2: 0.0742645\tTrain's l2: 0.0742645\tTest's l2: 0.0612271\tTest's l2: 0.0612271\n","[250]\tTrain's l2: 0.0536164\tTrain's l2: 0.0536164\tTest's l2: 0.0471729\tTest's l2: 0.0471729\n","[300]\tTrain's l2: 0.0415117\tTrain's l2: 0.0415117\tTest's l2: 0.0427081\tTest's l2: 0.0427081\n"],"name":"stdout"},{"output_type":"error","ename":"KeyboardInterrupt","evalue":"ignored","traceback":["\u001b[0;31m---------------------------------------------------------------------------\u001b[0m","\u001b[0;31mKeyboardInterrupt\u001b[0m Traceback (most recent call last)","\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 38\u001b[0m \u001b[0mearly_stopping_rounds\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m100\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 39\u001b[0m \u001b[0meval_metric\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m'MSE'\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 40\u001b[0;31m \u001b[0mverbose\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m50\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 41\u001b[0m )\n\u001b[1;32m 42\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n","\u001b[0;32m/usr/local/lib/python3.7/dist-packages/lightgbm/sklearn.py\u001b[0m in \u001b[0;36mfit\u001b[0;34m(self, X, y, sample_weight, init_score, eval_set, eval_names, eval_sample_weight, eval_init_score, eval_metric, early_stopping_rounds, verbose, feature_name, categorical_feature, callbacks)\u001b[0m\n\u001b[1;32m 683\u001b[0m \u001b[0mverbose\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mverbose\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfeature_name\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mfeature_name\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 684\u001b[0m \u001b[0mcategorical_feature\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mcategorical_feature\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 685\u001b[0;31m callbacks=callbacks)\n\u001b[0m\u001b[1;32m 686\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 687\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n","\u001b[0;32m/usr/local/lib/python3.7/dist-packages/lightgbm/sklearn.py\u001b[0m in \u001b[0;36mfit\u001b[0;34m(self, X, y, sample_weight, init_score, group, eval_set, eval_names, eval_sample_weight, eval_class_weight, eval_init_score, eval_group, eval_metric, early_stopping_rounds, verbose, feature_name, categorical_feature, callbacks)\u001b[0m\n\u001b[1;32m 542\u001b[0m \u001b[0mverbose_eval\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mverbose\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfeature_name\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mfeature_name\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 543\u001b[0m \u001b[0mcategorical_feature\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mcategorical_feature\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 544\u001b[0;31m callbacks=callbacks)\n\u001b[0m\u001b[1;32m 545\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 546\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mevals_result\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n","\u001b[0;32m/usr/local/lib/python3.7/dist-packages/lightgbm/engine.py\u001b[0m in \u001b[0;36mtrain\u001b[0;34m(params, train_set, num_boost_round, valid_sets, valid_names, fobj, feval, init_model, feature_name, categorical_feature, early_stopping_rounds, evals_result, verbose_eval, learning_rates, keep_training_booster, callbacks)\u001b[0m\n\u001b[1;32m 216\u001b[0m evaluation_result_list=None))\n\u001b[1;32m 217\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 218\u001b[0;31m \u001b[0mbooster\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mupdate\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mfobj\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mfobj\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 219\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 220\u001b[0m \u001b[0mevaluation_result_list\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n","\u001b[0;32m/usr/local/lib/python3.7/dist-packages/lightgbm/basic.py\u001b[0m in \u001b[0;36mupdate\u001b[0;34m(self, train_set, fobj)\u001b[0m\n\u001b[1;32m 1800\u001b[0m _safe_call(_LIB.LGBM_BoosterUpdateOneIter(\n\u001b[1;32m 1801\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mhandle\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1802\u001b[0;31m ctypes.byref(is_finished)))\n\u001b[0m\u001b[1;32m 1803\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__is_predicted_cur_iter\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;32mFalse\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0m_\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mrange_\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__num_dataset\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1804\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mis_finished\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mvalue\u001b[0m \u001b[0;34m==\u001b[0m \u001b[0;36m1\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n","\u001b[0;31mKeyboardInterrupt\u001b[0m: "]}]},{"cell_type":"code","metadata":{"id":"aB3dbDkaPV8W"},"source":[""],"execution_count":null,"outputs":[]}]}
--------------------------------------------------------------------------------
/阿里云安全恶意程序检测/阿里云安全恶意程序检测-优化技巧与解决方案升级.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "## 第六节:优化技巧与解决方案升级"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "## 6.2 深度学习解决方案:TextCNN建模"
15 | ]
16 | },
17 | {
18 | "cell_type": "markdown",
19 | "metadata": {},
20 | "source": [
21 | "### 6.2.2 数据读取"
22 | ]
23 | },
24 | {
25 | "cell_type": "code",
26 | "execution_count": null,
27 | "metadata": {},
28 | "outputs": [],
29 | "source": [
30 | "import pandas as pd\n",
31 | "import numpy as np\n",
32 | "import seaborn as sns\n",
33 | "import matplotlib.pyplot as plt\n",
34 | "\n",
35 | "import lightgbm as lgb\n",
36 | "from sklearn.model_selection import train_test_split\n",
37 | "from sklearn.preprocessing import OneHotEncoder\n",
38 | "\n",
39 | "from tqdm import tqdm_notebook\n",
40 | "from sklearn.preprocessing import LabelBinarizer,LabelEncoder\n",
41 | "\n",
42 | "import warnings\n",
43 | "warnings.filterwarnings('ignore')\n",
44 | "%matplotlib inline"
45 | ]
46 | },
47 | {
48 | "cell_type": "code",
49 | "execution_count": 4,
50 | "metadata": {},
51 | "outputs": [],
52 | "source": [
53 | "path = '../security_data/'\n",
54 | "train = pd.read_csv(path + 'security_train.csv')\n",
55 | "test = pd.read_csv(path + 'security_test.csv')"
56 | ]
57 | },
58 | {
59 | "cell_type": "code",
60 | "execution_count": 2,
61 | "metadata": {},
62 | "outputs": [],
63 | "source": [
64 | "import numpy as np\n",
65 | "import pandas as pd\n",
66 | "from tqdm import tqdm \n",
67 | "\n",
68 | "class _Data_Preprocess:\n",
69 | " def __init__(self):\n",
70 | " self.int8_max = np.iinfo(np.int8).max\n",
71 | " self.int8_min = np.iinfo(np.int8).min\n",
72 | "\n",
73 | " self.int16_max = np.iinfo(np.int16).max\n",
74 | " self.int16_min = np.iinfo(np.int16).min\n",
75 | "\n",
76 | " self.int32_max = np.iinfo(np.int32).max\n",
77 | " self.int32_min = np.iinfo(np.int32).min\n",
78 | "\n",
79 | " self.int64_max = np.iinfo(np.int64).max\n",
80 | " self.int64_min = np.iinfo(np.int64).min\n",
81 | "\n",
82 | " self.float16_max = np.finfo(np.float16).max\n",
83 | " self.float16_min = np.finfo(np.float16).min\n",
84 | "\n",
85 | " self.float32_max = np.finfo(np.float32).max\n",
86 | " self.float32_min = np.finfo(np.float32).min\n",
87 | "\n",
88 | " self.float64_max = np.finfo(np.float64).max\n",
89 | " self.float64_min = np.finfo(np.float64).min\n",
90 | "\n",
91 | " def _get_type(self, min_val, max_val, types):\n",
92 | " if types == 'int':\n",
93 | " if max_val <= self.int8_max and min_val >= self.int8_min:\n",
94 | " return np.int8\n",
95 | " elif max_val <= self.int16_max <= max_val and min_val >= self.int16_min:\n",
96 | " return np.int16\n",
97 | " elif max_val <= self.int32_max and min_val >= self.int32_min:\n",
98 | " return np.int32\n",
99 | " return None\n",
100 | "\n",
101 | " elif types == 'float':\n",
102 | " if max_val <= self.float16_max and min_val >= self.float16_min:\n",
103 | " return np.float16\n",
104 | " if max_val <= self.float32_max and min_val >= self.float32_min:\n",
105 | " return np.float32\n",
106 | " if max_val <= self.float64_max and min_val >= self.float64_min:\n",
107 | " return np.float64\n",
108 | " return None\n",
109 | "\n",
110 | " def _memory_process(self, df):\n",
111 | " init_memory = df.memory_usage().sum() / 1024 ** 2 / 1024\n",
112 | " print('Original data occupies {} GB memory.'.format(init_memory))\n",
113 | " df_cols = df.columns\n",
114 | "\n",
115 | " \n",
116 | " for col in tqdm_notebook(df_cols):\n",
117 | " try:\n",
118 | " if 'float' in str(df[col].dtypes):\n",
119 | " max_val = df[col].max()\n",
120 | " min_val = df[col].min()\n",
121 | " trans_types = self._get_type(min_val, max_val, 'float')\n",
122 | " if trans_types is not None:\n",
123 | " df[col] = df[col].astype(trans_types)\n",
124 | " elif 'int' in str(df[col].dtypes):\n",
125 | " max_val = df[col].max()\n",
126 | " min_val = df[col].min()\n",
127 | " trans_types = self._get_type(min_val, max_val, 'int')\n",
128 | " if trans_types is not None:\n",
129 | " df[col] = df[col].astype(trans_types)\n",
130 | " except:\n",
131 | " print(' Can not do any process for column, {}.'.format(col)) \n",
132 | " afterprocess_memory = df.memory_usage().sum() / 1024 ** 2 / 1024\n",
133 | " print('After processing, the data occupies {} GB memory.'.format(afterprocess_memory))\n",
134 | " return df\n",
135 | "\n",
136 | "memory_process = _Data_Preprocess()"
137 | ]
138 | },
139 | {
140 | "cell_type": "code",
141 | "execution_count": 5,
142 | "metadata": {
143 | "scrolled": true
144 | },
145 | "outputs": [
146 | {
147 | "data": {
148 | "text/html": [
149 | "\n",
150 | "\n",
163 | "
\n",
164 | " \n",
165 | " \n",
166 | " | \n",
167 | " file_id | \n",
168 | " label | \n",
169 | " api | \n",
170 | " tid | \n",
171 | " index | \n",
172 | "
\n",
173 | " \n",
174 | " \n",
175 | " \n",
176 | " 0 | \n",
177 | " 1 | \n",
178 | " 5 | \n",
179 | " LdrLoadDll | \n",
180 | " 2488 | \n",
181 | " 0 | \n",
182 | "
\n",
183 | " \n",
184 | " 1 | \n",
185 | " 1 | \n",
186 | " 5 | \n",
187 | " LdrGetProcedureAddress | \n",
188 | " 2488 | \n",
189 | " 1 | \n",
190 | "
\n",
191 | " \n",
192 | " 2 | \n",
193 | " 1 | \n",
194 | " 5 | \n",
195 | " LdrGetProcedureAddress | \n",
196 | " 2488 | \n",
197 | " 2 | \n",
198 | "
\n",
199 | " \n",
200 | " 3 | \n",
201 | " 1 | \n",
202 | " 5 | \n",
203 | " LdrGetProcedureAddress | \n",
204 | " 2488 | \n",
205 | " 3 | \n",
206 | "
\n",
207 | " \n",
208 | " 4 | \n",
209 | " 1 | \n",
210 | " 5 | \n",
211 | " LdrGetProcedureAddress | \n",
212 | " 2488 | \n",
213 | " 4 | \n",
214 | "
\n",
215 | " \n",
216 | "
\n",
217 | "
"
218 | ],
219 | "text/plain": [
220 | " file_id label api tid index\n",
221 | "0 1 5 LdrLoadDll 2488 0\n",
222 | "1 1 5 LdrGetProcedureAddress 2488 1\n",
223 | "2 1 5 LdrGetProcedureAddress 2488 2\n",
224 | "3 1 5 LdrGetProcedureAddress 2488 3\n",
225 | "4 1 5 LdrGetProcedureAddress 2488 4"
226 | ]
227 | },
228 | "execution_count": 5,
229 | "metadata": {},
230 | "output_type": "execute_result"
231 | }
232 | ],
233 | "source": [
234 | "train.head()"
235 | ]
236 | },
237 | {
238 | "cell_type": "markdown",
239 | "metadata": {},
240 | "source": [
241 | "### 6.2.3 数据预处理"
242 | ]
243 | },
244 | {
245 | "cell_type": "code",
246 | "execution_count": 6,
247 | "metadata": {},
248 | "outputs": [],
249 | "source": [
250 | "# (字符串转化为数字)\n",
251 | "unique_api = train['api'].unique()"
252 | ]
253 | },
254 | {
255 | "cell_type": "code",
256 | "execution_count": 8,
257 | "metadata": {},
258 | "outputs": [],
259 | "source": [
260 | "api2index = {item:(i+1) for i,item in enumerate(unique_api)}\n",
261 | "index2api = {(i+1):item for i,item in enumerate(unique_api)}"
262 | ]
263 | },
264 | {
265 | "cell_type": "code",
266 | "execution_count": 9,
267 | "metadata": {},
268 | "outputs": [],
269 | "source": [
270 | "train['api_idx'] = train['api'].map(api2index)\n",
271 | "test['api_idx'] = test['api'].map(api2index)"
272 | ]
273 | },
274 | {
275 | "cell_type": "code",
276 | "execution_count": 10,
277 | "metadata": {},
278 | "outputs": [],
279 | "source": [
280 | "# 获取每个文件对应的字符串序列\n",
281 | "def get_sequence(df,period_idx):\n",
282 | " seq_list = []\n",
283 | " for _id,begin in enumerate(period_idx[:-1]):\n",
284 | " seq_list.append(df.iloc[begin:period_idx[_id+1]]['api_idx'].values)\n",
285 | " seq_list.append(df.iloc[period_idx[-1]:]['api_idx'].values)\n",
286 | " return seq_list"
287 | ]
288 | },
289 | {
290 | "cell_type": "code",
291 | "execution_count": 11,
292 | "metadata": {},
293 | "outputs": [],
294 | "source": [
295 | "train_period_idx = train.file_id.drop_duplicates(keep='first').index.values\n",
296 | "test_period_idx = test.file_id.drop_duplicates(keep='first').index.values"
297 | ]
298 | },
299 | {
300 | "cell_type": "code",
301 | "execution_count": 13,
302 | "metadata": {},
303 | "outputs": [],
304 | "source": [
305 | "train_df = train[['file_id','label']].drop_duplicates(keep='first')\n",
306 | "test_df = test[['file_id']].drop_duplicates(keep='first')"
307 | ]
308 | },
309 | {
310 | "cell_type": "code",
311 | "execution_count": 14,
312 | "metadata": {},
313 | "outputs": [],
314 | "source": [
315 | "train_df['seq'] = get_sequence(train,train_period_idx)\n",
316 | "test_df['seq'] = get_sequence(test,test_period_idx)"
317 | ]
318 | },
319 | {
320 | "cell_type": "markdown",
321 | "metadata": {},
322 | "source": [
323 | "### 6.2.4 TextCNN网络结构"
324 | ]
325 | },
326 | {
327 | "cell_type": "code",
328 | "execution_count": 16,
329 | "metadata": {},
330 | "outputs": [
331 | {
332 | "name": "stderr",
333 | "output_type": "stream",
334 | "text": [
335 | "Using TensorFlow backend.\n"
336 | ]
337 | }
338 | ],
339 | "source": [
340 | "from keras.preprocessing.text import Tokenizer\n",
341 | "from keras.preprocessing.sequence import pad_sequences\n",
342 | "from keras.layers import Dense, Input, LSTM, Lambda, Embedding, Dropout, Activation,GRU,Bidirectional\n",
343 | "from keras.layers import Conv1D,Conv2D,MaxPooling2D,GlobalAveragePooling1D,GlobalMaxPooling1D, MaxPooling1D, Flatten\n",
344 | "from keras.layers import CuDNNGRU, CuDNNLSTM, SpatialDropout1D\n",
345 | "from keras.layers.merge import concatenate, Concatenate, Average, Dot, Maximum, Multiply, Subtract, average\n",
346 | "from keras.models import Model\n",
347 | "from keras.optimizers import RMSprop,Adam\n",
348 | "from keras.layers.normalization import BatchNormalization\n",
349 | "from keras.callbacks import EarlyStopping, ModelCheckpoint\n",
350 | "from keras.optimizers import SGD\n",
351 | "from keras import backend as K\n",
352 | "from sklearn.decomposition import TruncatedSVD, NMF, LatentDirichletAllocation\n",
353 | "from keras.layers import SpatialDropout1D\n",
354 | "from keras.layers.wrappers import Bidirectional"
355 | ]
356 | },
357 | {
358 | "cell_type": "code",
359 | "execution_count": 17,
360 | "metadata": {},
361 | "outputs": [],
362 | "source": [
363 | "def TextCNN(max_len,max_cnt,embed_size, num_filters,kernel_size,conv_action, mask_zero):\n",
364 | " _input = Input(shape=(max_len,), dtype='int32')\n",
365 | " _embed = Embedding(max_cnt, embed_size, input_length=max_len, mask_zero=mask_zero)(_input)\n",
366 | " _embed = SpatialDropout1D(0.15)(_embed)\n",
367 | " warppers = []\n",
368 | " \n",
369 | " for _kernel_size in kernel_size:\n",
370 | " conv1d = Conv1D(filters=num_filters, kernel_size=_kernel_size, activation=conv_action)(_embed)\n",
371 | " warppers.append(GlobalMaxPooling1D()(conv1d))\n",
372 | " \n",
373 | " fc = concatenate(warppers)\n",
374 | " fc = Dropout(0.5)(fc)\n",
375 | " #fc = BatchNormalization()(fc)\n",
376 | " fc = Dense(256, activation='relu')(fc)\n",
377 | " fc = Dropout(0.25)(fc)\n",
378 | " #fc = BatchNormalization()(fc) \n",
379 | " preds = Dense(8, activation = 'softmax')(fc)\n",
380 | " \n",
381 | " model = Model(inputs=_input, outputs=preds)\n",
382 | " \n",
383 | " model.compile(loss='categorical_crossentropy',\n",
384 | " optimizer='adam',\n",
385 | " metrics=['accuracy'])\n",
386 | " return model"
387 | ]
388 | },
389 | {
390 | "cell_type": "code",
391 | "execution_count": 18,
392 | "metadata": {},
393 | "outputs": [],
394 | "source": [
395 | "train_labels = pd.get_dummies(train_df.label).values\n",
396 | "train_seq = pad_sequences(train_df.seq.values, maxlen = 6000)\n",
397 | "test_seq = pad_sequences(test_df.seq.values, maxlen = 6000)"
398 | ]
399 | },
400 | {
401 | "cell_type": "markdown",
402 | "metadata": {},
403 | "source": [
404 | "### 6.2.5 TextCNN训练和预测"
405 | ]
406 | },
407 | {
408 | "cell_type": "code",
409 | "execution_count": 19,
410 | "metadata": {},
411 | "outputs": [],
412 | "source": [
413 | "from sklearn.model_selection import StratifiedKFold,KFold \n",
414 | "skf = KFold(n_splits=5, shuffle=True)"
415 | ]
416 | },
417 | {
418 | "cell_type": "code",
419 | "execution_count": 20,
420 | "metadata": {},
421 | "outputs": [],
422 | "source": [
423 | "max_len = 6000\n",
424 | "max_cnt = 295\n",
425 | "embed_size = 256\n",
426 | "num_filters = 64\n",
427 | "kernel_size = [2,4,6,8,10,12,14]\n",
428 | "conv_action = 'relu'\n",
429 | "mask_zero = False\n",
430 | "TRAIN = True"
431 | ]
432 | },
433 | {
434 | "cell_type": "code",
435 | "execution_count": 21,
436 | "metadata": {
437 | "scrolled": true
438 | },
439 | "outputs": [
440 | {
441 | "name": "stdout",
442 | "output_type": "stream",
443 | "text": [
444 | "FOLD: \n",
445 | "2778 11109\n",
446 | "Train on 11109 samples, validate on 2778 samples\n",
447 | "Epoch 1/100\n",
448 | "11109/11109 [==============================] - 75s 7ms/step - loss: 0.8165 - acc: 0.7370 - val_loss: 0.4825 - val_acc: 0.8485\n",
449 | "Epoch 2/100\n",
450 | "11109/11109 [==============================] - 64s 6ms/step - loss: 0.4772 - acc: 0.8499 - val_loss: 0.4141 - val_acc: 0.8625\n",
451 | "Epoch 3/100\n",
452 | "11109/11109 [==============================] - 64s 6ms/step - loss: 0.4172 - acc: 0.8673 - val_loss: 0.3785 - val_acc: 0.8780\n",
453 | "Epoch 4/100\n",
454 | "11109/11109 [==============================] - 64s 6ms/step - loss: 0.3768 - acc: 0.8769 - val_loss: 0.3821 - val_acc: 0.8726\n",
455 | "Epoch 5/100\n",
456 | "11109/11109 [==============================] - 64s 6ms/step - loss: 0.3568 - acc: 0.8831 - val_loss: 0.3932 - val_acc: 0.8783\n",
457 | "Epoch 6/100\n",
458 | "11109/11109 [==============================] - 64s 6ms/step - loss: 0.3388 - acc: 0.8893 - val_loss: 0.3566 - val_acc: 0.8902\n",
459 | "Epoch 7/100\n",
460 | "11109/11109 [==============================] - 64s 6ms/step - loss: 0.3179 - acc: 0.8968 - val_loss: 0.3553 - val_acc: 0.8902\n",
461 | "Epoch 8/100\n",
462 | "11109/11109 [==============================] - 64s 6ms/step - loss: 0.3050 - acc: 0.8991 - val_loss: 0.3590 - val_acc: 0.8870\n",
463 | "Epoch 9/100\n",
464 | "11109/11109 [==============================] - 64s 6ms/step - loss: 0.2913 - acc: 0.9006 - val_loss: 0.3593 - val_acc: 0.8909\n",
465 | "Epoch 10/100\n",
466 | "11109/11109 [==============================] - 64s 6ms/step - loss: 0.2812 - acc: 0.9047 - val_loss: 0.3528 - val_acc: 0.8906\n",
467 | "Epoch 11/100\n",
468 | "11109/11109 [==============================] - 64s 6ms/step - loss: 0.2633 - acc: 0.9054 - val_loss: 0.3608 - val_acc: 0.8823\n",
469 | "Epoch 12/100\n",
470 | "11109/11109 [==============================] - 64s 6ms/step - loss: 0.2665 - acc: 0.9103 - val_loss: 0.3589 - val_acc: 0.8859\n",
471 | "Epoch 13/100\n",
472 | "11109/11109 [==============================] - 64s 6ms/step - loss: 0.2495 - acc: 0.9097 - val_loss: 0.3570 - val_acc: 0.8909\n",
473 | "2778/2778 [==============================] - 4s 1ms/step\n",
474 | "12955/12955 [==============================] - 13s 980us/step\n",
475 | "FOLD: \n",
476 | "2778 11109\n",
477 | "Train on 11109 samples, validate on 2778 samples\n",
478 | "Epoch 1/100\n",
479 | "11109/11109 [==============================] - 65s 6ms/step - loss: 0.8297 - acc: 0.7290 - val_loss: 0.4925 - val_acc: 0.8463\n",
480 | "Epoch 2/100\n",
481 | "11109/11109 [==============================] - 64s 6ms/step - loss: 0.4808 - acc: 0.8442 - val_loss: 0.4115 - val_acc: 0.8690\n",
482 | "Epoch 3/100\n",
483 | "11109/11109 [==============================] - 64s 6ms/step - loss: 0.4149 - acc: 0.8643 - val_loss: 0.4037 - val_acc: 0.8715\n",
484 | "Epoch 4/100\n",
485 | "11109/11109 [==============================] - 64s 6ms/step - loss: 0.3799 - acc: 0.8774 - val_loss: 0.3798 - val_acc: 0.8841\n",
486 | "Epoch 5/100\n",
487 | "11109/11109 [==============================] - 64s 6ms/step - loss: 0.3530 - acc: 0.8850 - val_loss: 0.3773 - val_acc: 0.8870\n",
488 | "Epoch 6/100\n",
489 | "11109/11109 [==============================] - 64s 6ms/step - loss: 0.3291 - acc: 0.8924 - val_loss: 0.3676 - val_acc: 0.8855\n",
490 | "Epoch 7/100\n",
491 | "11109/11109 [==============================] - 64s 6ms/step - loss: 0.3115 - acc: 0.8959 - val_loss: 0.3773 - val_acc: 0.8888\n",
492 | "Epoch 8/100\n",
493 | "11109/11109 [==============================] - 64s 6ms/step - loss: 0.3032 - acc: 0.8998 - val_loss: 0.3518 - val_acc: 0.8891\n",
494 | "Epoch 9/100\n",
495 | "11109/11109 [==============================] - 64s 6ms/step - loss: 0.2892 - acc: 0.9027 - val_loss: 0.3655 - val_acc: 0.8920\n",
496 | "Epoch 10/100\n",
497 | "11109/11109 [==============================] - 64s 6ms/step - loss: 0.2792 - acc: 0.9056 - val_loss: 0.3615 - val_acc: 0.8906\n",
498 | "Epoch 11/100\n",
499 | "11109/11109 [==============================] - 64s 6ms/step - loss: 0.2736 - acc: 0.9085 - val_loss: 0.3719 - val_acc: 0.8924\n",
500 | "2778/2778 [==============================] - 3s 1ms/step\n",
501 | "12955/12955 [==============================] - 12s 951us/step\n",
502 | "FOLD: \n",
503 | "2777 11110\n",
504 | "Train on 11110 samples, validate on 2777 samples\n",
505 | "Epoch 1/100\n",
506 | "11110/11110 [==============================] - 67s 6ms/step - loss: 0.8388 - acc: 0.7239 - val_loss: 0.4323 - val_acc: 0.8657\n",
507 | "Epoch 2/100\n",
508 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.4969 - acc: 0.8418 - val_loss: 0.3881 - val_acc: 0.8783\n",
509 | "Epoch 3/100\n",
510 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.4276 - acc: 0.8631 - val_loss: 0.3587 - val_acc: 0.8855\n",
511 | "Epoch 4/100\n",
512 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.3882 - acc: 0.8770 - val_loss: 0.3542 - val_acc: 0.8898\n",
513 | "Epoch 5/100\n",
514 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.3590 - acc: 0.8875 - val_loss: 0.3640 - val_acc: 0.8920\n",
515 | "Epoch 6/100\n",
516 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.3437 - acc: 0.8851 - val_loss: 0.3445 - val_acc: 0.8992\n",
517 | "Epoch 7/100\n",
518 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.3235 - acc: 0.8912 - val_loss: 0.3552 - val_acc: 0.8923\n",
519 | "Epoch 8/100\n",
520 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.3079 - acc: 0.8986 - val_loss: 0.3491 - val_acc: 0.8927\n",
521 | "Epoch 9/100\n",
522 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.2937 - acc: 0.9035 - val_loss: 0.3370 - val_acc: 0.9003\n",
523 | "Epoch 10/100\n",
524 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.2897 - acc: 0.9018 - val_loss: 0.3523 - val_acc: 0.8959\n",
525 | "Epoch 11/100\n",
526 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.2788 - acc: 0.9066 - val_loss: 0.3519 - val_acc: 0.8967\n",
527 | "Epoch 12/100\n",
528 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.2653 - acc: 0.9131 - val_loss: 0.3571 - val_acc: 0.8952\n",
529 | "2777/2777 [==============================] - 3s 1ms/step\n",
530 | "12955/12955 [==============================] - 12s 955us/step\n",
531 | "FOLD: \n",
532 | "2777 11110\n",
533 | "Train on 11110 samples, validate on 2777 samples\n",
534 | "Epoch 1/100\n",
535 | "11110/11110 [==============================] - 66s 6ms/step - loss: 0.8286 - acc: 0.7326 - val_loss: 0.4647 - val_acc: 0.8596\n",
536 | "Epoch 2/100\n",
537 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.4807 - acc: 0.8524 - val_loss: 0.4081 - val_acc: 0.8704\n",
538 | "Epoch 3/100\n",
539 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.4233 - acc: 0.8656 - val_loss: 0.3920 - val_acc: 0.8808\n",
540 | "Epoch 4/100\n",
541 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.3771 - acc: 0.8819 - val_loss: 0.3767 - val_acc: 0.8804\n",
542 | "Epoch 5/100\n",
543 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.3529 - acc: 0.8861 - val_loss: 0.3990 - val_acc: 0.8725\n",
544 | "Epoch 6/100\n",
545 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.3353 - acc: 0.8900 - val_loss: 0.3776 - val_acc: 0.8822\n",
546 | "Epoch 7/100\n",
547 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.3234 - acc: 0.8935 - val_loss: 0.3717 - val_acc: 0.8855\n",
548 | "Epoch 8/100\n",
549 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.3012 - acc: 0.9010 - val_loss: 0.3758 - val_acc: 0.8848\n",
550 | "Epoch 9/100\n",
551 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.2907 - acc: 0.9011 - val_loss: 0.3656 - val_acc: 0.8862\n",
552 | "Epoch 10/100\n",
553 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.2852 - acc: 0.9022 - val_loss: 0.3676 - val_acc: 0.8858\n",
554 | "Epoch 11/100\n",
555 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.2683 - acc: 0.9085 - val_loss: 0.3630 - val_acc: 0.8862\n",
556 | "Epoch 12/100\n",
557 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.2595 - acc: 0.9091 - val_loss: 0.3768 - val_acc: 0.8884\n",
558 | "Epoch 13/100\n",
559 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.2533 - acc: 0.9140 - val_loss: 0.3817 - val_acc: 0.8822\n",
560 | "Epoch 14/100\n",
561 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.2464 - acc: 0.9155 - val_loss: 0.3757 - val_acc: 0.8862\n",
562 | "2777/2777 [==============================] - 3s 1ms/step\n",
563 | "12955/12955 [==============================] - 12s 949us/step\n",
564 | "FOLD: \n",
565 | "2777 11110\n",
566 | "Train on 11110 samples, validate on 2777 samples\n",
567 | "Epoch 1/100\n",
568 | "11110/11110 [==============================] - 65s 6ms/step - loss: 0.8168 - acc: 0.7315 - val_loss: 0.4718 - val_acc: 0.8567\n",
569 | "Epoch 2/100\n",
570 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.4880 - acc: 0.8459 - val_loss: 0.4047 - val_acc: 0.8711\n",
571 | "Epoch 3/100\n",
572 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.4224 - acc: 0.8674 - val_loss: 0.3871 - val_acc: 0.8732\n",
573 | "Epoch 4/100\n",
574 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.3900 - acc: 0.8728 - val_loss: 0.3676 - val_acc: 0.8812\n",
575 | "Epoch 5/100\n",
576 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.3581 - acc: 0.8846 - val_loss: 0.3713 - val_acc: 0.8819\n",
577 | "Epoch 6/100\n",
578 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.3391 - acc: 0.8890 - val_loss: 0.3542 - val_acc: 0.8905\n",
579 | "Epoch 7/100\n",
580 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.3158 - acc: 0.8975 - val_loss: 0.3610 - val_acc: 0.8902\n",
581 | "Epoch 8/100\n",
582 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.3074 - acc: 0.8986 - val_loss: 0.3520 - val_acc: 0.8887\n",
583 | "Epoch 9/100\n",
584 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.2905 - acc: 0.9026 - val_loss: 0.3588 - val_acc: 0.8941\n",
585 | "Epoch 10/100\n",
586 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.2795 - acc: 0.9032 - val_loss: 0.3417 - val_acc: 0.8923\n",
587 | "Epoch 11/100\n",
588 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.2747 - acc: 0.9044 - val_loss: 0.3456 - val_acc: 0.8912\n",
589 | "Epoch 12/100\n",
590 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.2546 - acc: 0.9131 - val_loss: 0.3517 - val_acc: 0.8902\n",
591 | "Epoch 13/100\n",
592 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.2483 - acc: 0.9144 - val_loss: 0.3785 - val_acc: 0.8909\n",
593 | "2777/2777 [==============================] - 3s 1ms/step\n",
594 | "12955/12955 [==============================] - 12s 949us/step\n"
595 | ]
596 | }
597 | ],
598 | "source": [
599 | "import os\n",
600 | "os.environ[\"CUDA_VISIBLE_DEVICES\"] = \"0,1\"\n",
601 | "meta_train = np.zeros(shape = (len(train_seq),8))\n",
602 | "meta_test = np.zeros(shape = (len(test_seq),8))\n",
603 | "FLAG = True\n",
604 | "i = 0\n",
605 | "for tr_ind,te_ind in skf.split(train_labels):\n",
606 | " i +=1\n",
607 | " print('FOLD: '.format(i))\n",
608 | " print(len(te_ind),len(tr_ind)) \n",
609 | " model_name = 'benchmark_textcnn_fold_'+str(i)\n",
610 | " X_train,X_train_label = train_seq[tr_ind],train_labels[tr_ind]\n",
611 | " X_val,X_val_label = train_seq[te_ind],train_labels[te_ind]\n",
612 | " \n",
613 | " model = TextCNN(max_len,max_cnt,embed_size,num_filters,kernel_size,conv_action,mask_zero)\n",
614 | " \n",
615 | " model_save_path = './NN/%s_%s.hdf5'%(model_name,embed_size)\n",
616 | " early_stopping =EarlyStopping(monitor='val_loss', patience=3)\n",
617 | " model_checkpoint = ModelCheckpoint(model_save_path, save_best_only=True, save_weights_only=True)\n",
618 | " if TRAIN and FLAG:\n",
619 | " model.fit(X_train,X_train_label,validation_data=(X_val,X_val_label),epochs=100,batch_size=64,shuffle=True,callbacks=[early_stopping,model_checkpoint] )\n",
620 | " \n",
621 | " model.load_weights(model_save_path)\n",
622 | " pred_val = model.predict(X_val,batch_size=128,verbose=1)\n",
623 | " pred_test = model.predict(test_seq,batch_size=128,verbose=1)\n",
624 | " \n",
625 | " meta_train[te_ind] = pred_val\n",
626 | " meta_test += pred_test\n",
627 | " K.clear_session()\n",
628 | "meta_test /= 5.0 "
629 | ]
630 | },
631 | {
632 | "cell_type": "markdown",
633 | "metadata": {},
634 | "source": [
635 | "### 6.2.6 结果提交"
636 | ]
637 | },
638 | {
639 | "cell_type": "code",
640 | "execution_count": 22,
641 | "metadata": {},
642 | "outputs": [],
643 | "source": [
644 | "test_df['prob0'] = 0\n",
645 | "test_df['prob1'] = 0\n",
646 | "test_df['prob2'] = 0\n",
647 | "test_df['prob3'] = 0\n",
648 | "test_df['prob4'] = 0\n",
649 | "test_df['prob5'] = 0\n",
650 | "test_df['prob6'] = 0\n",
651 | "test_df['prob7'] = 0\n",
652 | "\n",
653 | "test_df[['prob0','prob1','prob2','prob3','prob4','prob5','prob6','prob7']] = meta_test\n",
654 | "test_df[['file_id','prob0','prob1','prob2','prob3','prob4','prob5','prob6','prob7']].to_csv('nn_baseline_5fold.csv',index = None)"
655 | ]
656 | },
657 | {
658 | "cell_type": "code",
659 | "execution_count": null,
660 | "metadata": {},
661 | "outputs": [],
662 | "source": []
663 | }
664 | ],
665 | "metadata": {
666 | "kernelspec": {
667 | "display_name": "Python 3",
668 | "language": "python",
669 | "name": "python3"
670 | },
671 | "language_info": {
672 | "codemirror_mode": {
673 | "name": "ipython",
674 | "version": 3
675 | },
676 | "file_extension": ".py",
677 | "mimetype": "text/x-python",
678 | "name": "python",
679 | "nbconvert_exporter": "python",
680 | "pygments_lexer": "ipython3",
681 | "version": "3.7.1"
682 | },
683 | "toc": {
684 | "nav_menu": {},
685 | "number_sections": true,
686 | "sideBar": true,
687 | "skip_h1_title": false,
688 | "title_cell": "Table of Contents",
689 | "title_sidebar": "Contents",
690 | "toc_cell": true,
691 | "toc_position": {},
692 | "toc_section_display": true,
693 | "toc_window_display": true
694 | }
695 | },
696 | "nbformat": 4,
697 | "nbformat_minor": 2
698 | }
699 |
--------------------------------------------------------------------------------
/阿里云安全恶意程序检测/阿里云安全恶意程序检测-数据探索.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# 第二节:数据探索"
8 | ]
9 | },
10 | {
11 | "cell_type": "code",
12 | "execution_count": 4,
13 | "metadata": {},
14 | "outputs": [
15 | {
16 | "data": {
17 | "text/html": [
18 | "\n",
43 | "!!以上是作者为了排版而修改的排版效果,请注意是否需要使用!!
\n"
44 | ],
45 | "text/plain": [
46 | ""
47 | ]
48 | },
49 | "metadata": {},
50 | "output_type": "display_data"
51 | }
52 | ],
53 | "source": [
54 | "%%html\n",
55 | "\n",
80 | "!!以上是作者为了排版而修改的排版效果,请注意是否需要使用!!
"
81 | ]
82 | },
83 | {
84 | "cell_type": "markdown",
85 | "metadata": {},
86 | "source": [
87 | "## 2.1 训练集数据探索"
88 | ]
89 | },
90 | {
91 | "cell_type": "markdown",
92 | "metadata": {},
93 | "source": [
94 | "### 2.1.1 数据特征类型"
95 | ]
96 | },
97 | {
98 | "cell_type": "code",
99 | "execution_count": 7,
100 | "metadata": {},
101 | "outputs": [],
102 | "source": [
103 | "# 导入相关应用包\n",
104 | "import pandas as pd\n",
105 | "import numpy as np\n",
106 | "import seaborn as sns\n",
107 | "import matplotlib.pyplot as plt\n",
108 | "\n",
109 | "# 忽略警告信息\n",
110 | "import warnings\n",
111 | "warnings.filterwarnings(\"ignore\")\n",
112 | "\n",
113 | "%matplotlib inline\n",
114 | "\n",
115 | "# 读取数据\n",
116 | "path = './dataset/'\n",
117 | "train = pd.read_csv(path + 'security_train.csv') # 训练集\n",
118 | "test = pd.read_csv(path + 'security_test.csv') # 测试集"
119 | ]
120 | },
121 | {
122 | "cell_type": "code",
123 | "execution_count": 8,
124 | "metadata": {},
125 | "outputs": [
126 | {
127 | "data": {
128 | "text/html": [
129 | "\n",
130 | "\n",
143 | "
\n",
144 | " \n",
145 | " \n",
146 | " | \n",
147 | " file_id | \n",
148 | " label | \n",
149 | " api | \n",
150 | " tid | \n",
151 | " index | \n",
152 | "
\n",
153 | " \n",
154 | " \n",
155 | " \n",
156 | " 0 | \n",
157 | " 1 | \n",
158 | " 5 | \n",
159 | " LdrLoadDll | \n",
160 | " 2488 | \n",
161 | " 0.0 | \n",
162 | "
\n",
163 | " \n",
164 | " 1 | \n",
165 | " 1 | \n",
166 | " 5 | \n",
167 | " LdrGetProcedureAddress | \n",
168 | " 2488 | \n",
169 | " 1.0 | \n",
170 | "
\n",
171 | " \n",
172 | " 2 | \n",
173 | " 1 | \n",
174 | " 5 | \n",
175 | " LdrGetProcedureAddress | \n",
176 | " 2488 | \n",
177 | " 2.0 | \n",
178 | "
\n",
179 | " \n",
180 | " 3 | \n",
181 | " 1 | \n",
182 | " 5 | \n",
183 | " LdrGetProcedureAddress | \n",
184 | " 2488 | \n",
185 | " 3.0 | \n",
186 | "
\n",
187 | " \n",
188 | " 4 | \n",
189 | " 1 | \n",
190 | " 5 | \n",
191 | " LdrGetProcedureAddress | \n",
192 | " 2488 | \n",
193 | " 4.0 | \n",
194 | "
\n",
195 | " \n",
196 | "
\n",
197 | "
"
198 | ],
199 | "text/plain": [
200 | " file_id label api tid index\n",
201 | "0 1 5 LdrLoadDll 2488 0.0\n",
202 | "1 1 5 LdrGetProcedureAddress 2488 1.0\n",
203 | "2 1 5 LdrGetProcedureAddress 2488 2.0\n",
204 | "3 1 5 LdrGetProcedureAddress 2488 3.0\n",
205 | "4 1 5 LdrGetProcedureAddress 2488 4.0"
206 | ]
207 | },
208 | "execution_count": 8,
209 | "metadata": {},
210 | "output_type": "execute_result"
211 | }
212 | ],
213 | "source": [
214 | "train.head()"
215 | ]
216 | },
217 | {
218 | "cell_type": "code",
219 | "execution_count": 9,
220 | "metadata": {},
221 | "outputs": [
222 | {
223 | "data": {
224 | "text/html": [
225 | "\n",
226 | "\n",
239 | "
\n",
240 | " \n",
241 | " \n",
242 | " | \n",
243 | " file_id | \n",
244 | " label | \n",
245 | " tid | \n",
246 | " index | \n",
247 | "
\n",
248 | " \n",
249 | " \n",
250 | " \n",
251 | " count | \n",
252 | " 35952.000000 | \n",
253 | " 35952.000000 | \n",
254 | " 35952.000000 | \n",
255 | " 35951.000000 | \n",
256 | "
\n",
257 | " \n",
258 | " mean | \n",
259 | " 5.142051 | \n",
260 | " 0.989152 | \n",
261 | " 2494.964564 | \n",
262 | " 2153.216267 | \n",
263 | "
\n",
264 | " \n",
265 | " std | \n",
266 | " 2.547382 | \n",
267 | " 1.957361 | \n",
268 | " 129.979938 | \n",
269 | " 1537.349809 | \n",
270 | "
\n",
271 | " \n",
272 | " min | \n",
273 | " 1.000000 | \n",
274 | " 0.000000 | \n",
275 | " 282.000000 | \n",
276 | " 0.000000 | \n",
277 | "
\n",
278 | " \n",
279 | " 25% | \n",
280 | " 4.000000 | \n",
281 | " 0.000000 | \n",
282 | " 2456.000000 | \n",
283 | " 722.000000 | \n",
284 | "
\n",
285 | " \n",
286 | " 50% | \n",
287 | " 5.000000 | \n",
288 | " 0.000000 | \n",
289 | " 2500.000000 | \n",
290 | " 2004.000000 | \n",
291 | "
\n",
292 | " \n",
293 | " 75% | \n",
294 | " 7.000000 | \n",
295 | " 0.000000 | \n",
296 | " 2596.000000 | \n",
297 | " 3502.000000 | \n",
298 | "
\n",
299 | " \n",
300 | " max | \n",
301 | " 9.000000 | \n",
302 | " 5.000000 | \n",
303 | " 2980.000000 | \n",
304 | " 5000.000000 | \n",
305 | "
\n",
306 | " \n",
307 | "
\n",
308 | "
"
309 | ],
310 | "text/plain": [
311 | " file_id label tid index\n",
312 | "count 35952.000000 35952.000000 35952.000000 35951.000000\n",
313 | "mean 5.142051 0.989152 2494.964564 2153.216267\n",
314 | "std 2.547382 1.957361 129.979938 1537.349809\n",
315 | "min 1.000000 0.000000 282.000000 0.000000\n",
316 | "25% 4.000000 0.000000 2456.000000 722.000000\n",
317 | "50% 5.000000 0.000000 2500.000000 2004.000000\n",
318 | "75% 7.000000 0.000000 2596.000000 3502.000000\n",
319 | "max 9.000000 5.000000 2980.000000 5000.000000"
320 | ]
321 | },
322 | "execution_count": 9,
323 | "metadata": {},
324 | "output_type": "execute_result"
325 | }
326 | ],
327 | "source": [
328 | "train.describe()"
329 | ]
330 | },
331 | {
332 | "cell_type": "markdown",
333 | "metadata": {},
334 | "source": [
335 | "### 2.1.2 数据分布探索"
336 | ]
337 | },
338 | {
339 | "cell_type": "code",
340 | "execution_count": 10,
341 | "metadata": {},
342 | "outputs": [
343 | {
344 | "data": {
345 | "text/plain": [
346 | ""
347 | ]
348 | },
349 | "execution_count": 10,
350 | "metadata": {},
351 | "output_type": "execute_result"
352 | },
353 | {
354 | "data": {
355 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAWAAAAEGCAYAAABbzE8LAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8vihELAAAACXBIWXMAAAsTAAALEwEAmpwYAAANb0lEQVR4nO3df2yc913A8fenMRtRTWnXlJJ5FbfVTKxrRmnMtD8YKkVlIUGqBAhpm2hgqtA6LWkifmij0eKCkTYGoiVCm0phNFCxwTZUqLqsQSJDQmqnc9Um+5HRa+epy8LoUqLVbdIpyZc/7nFzcc9nO/Xd5/Hl/ZJOuzz35Pr9+O55+/Hj2YlSCpKkwbsoewGSdKEywJKUxABLUhIDLElJDLAkJRlZzs7r1q0rjUajT0uRpOE0PT39vVLKFfO3LyvAjUaDZrO5cquSpAtARHyr23YvQUhSEgMsSUkMsCQlMcCSlMQAS1ISAyxJSQywJCUxwJKUxABLUhIDLElJDLAkJTHAkpTEAEtSEgMsSUkMsCQlMcCSlMQAS1ISAyxJSQywJCVZ1r8JJw3KrbfeyvHjxxkbG8teysvGx8fZtm1b9jI0RAywauno0aPMvvAi//NSPd6ia158LnsJGkL1eHdL3awZ4cRPbc5eBQBrDz+UvQQNIa8BS1ISAyxJSQywJCUxwJKUxABLUhIDLElJDLAkJTHAkpTEAEtSEgMsSUkMsCQlMcCSlMQAS1ISAyxJSQywJCUxwJKUxABLUhIDLElJDLAkJTHAkpTEAEtSEgMsSUkMsCQlMcCSlMQAS1ISAyxJSQywJCUxwJKUxABLUhIDLElJDLAkJTHAkpTEAEtSEgMsSUkMsCQlMcCSlMQAS1ISAyxJSQywJCUxwJKUxABLUhIDrIHYs2cPe/bsyV7GBcWPef2NZC9AF4ZWq5W9hAuOH/P68wxYkpIYYElKYoAlKYkBlqQkBliSkhhgSUpigCUpiQGWpCQGWJKSGGBJSmKAJSmJAZakJAZYkpIYYElKYoAlKYkBlqQkBliSkhhgSUpigCUpiQGWpCQGWJKSGGBJSmKAJSmJAZakJAZYkpIYYElKYoAlKYkBlqQkBliSkhhgSUpigCUpiQGWpCQGWJKSGGBJSmKAJSmJAZakJAZYkpIYYElKMpAA33DDDS/f6mSl1rXQ8zSbTW688Uamp6dX9L8nDYNWq8WWLVt44IEHzjlO5j/earUAOHbsGNu3b6fVarF9+3aOHTvWl3V1HqfzbyvNM+A+mpyc5MyZM+zevTt7KVLtTE1N8cILL3DXXXd1PU7mHp+amgLgvvvu49ChQ0xNTXHo0CH27t2bsewV1fcAz/+sUZezv5Va10LP02w2mZ2dBWB2dra2HwcpQ6vVYmZmBoBSCtA+TubOgjsfn5mZodlssm/fPkopzMzMUEph3759K34WvNhxudLH7ciKPpteNjk5mb2EWjly5AgnTpzg9ttvX9L+J06cgNLnRS3DRSe/T6v1/JLXXwetVou1a9dmL6OrubPa+Xbv3s2DDz74isfnvprsdPr0afbu3cvOnTv7ts5+W/QMOCJ+JyKaEdF89tlnB7GmoTB39ivplebObuebO27mPz47O8upU6fO2Xbq1Cn279/fj+UNzKJnwKWUe4B7ACYmJmp0TlJvo6OjRrjD2NgYAHffffeS9t+yZQuzJ3/QzyUty5kfvoTxN1255PXXQZ3P1huNRtcIj46Odn18dHSUkydPnhPhkZERbrrppn4vta/8JlyfeAlCWtiuXbu6br/zzju7Pj45OclFF52bqzVr1nDLLbf0Z4ED0vcAHzhwoOefs6zUuhZ6nomJiZc/m4+Ojtb24yBlGB8fp9FoABARQPs42bhx4ysebzQaTExMsGnTJiKCRqNBRLBp0yYuv/zyFV3XYsflSh+3ngH30dxn7bnP6pLO2rVrFxdffDE7duzoepzMPT53Nrx161Y2bNjArl272LBhw6o/+wWIuf8LyFJMTEyUZrPZx+VoWM1dj1zuNeDZ63+zn8tasrWHH2LjKr0GvJrWPKwiYrqUMjF/u2fAkpTEAEtSEgMsSUkMsCQlMcCSlMQAS1ISAyxJSQywJCUxwJKUxABLUhIDLElJDLAkJTHAkpTEAEtSEgMsSUkMsCQlMcCSlMQAS1ISAyxJSQywJCUxwJKUxABLUhIDLElJDLAkJTHAkpTEAEtSEgMsSUkMsCQlMcCSlMQAS1ISAyxJSQywJCUxwJKUxABLUhIDLElJDLAkJTHAkpTEAEtSkpHsBejCMD4+nr2EC44f8/ozwBqIbdu2ZS/hguPHvP68BCFJSQywJCUxwJKUxABLUhIDLElJDLAkJTHAkpTEAEtSEgMsSUkMsCQlMcCSlMQAS1ISAyxJSQywJCUxwJKUxABLUhIDLElJDLAkJTHAkpTEAEtSEgMsSUkMsCQlMcCSlMQAS1ISAyxJSQywJCUxwJKUxABLUhIDLElJDLAkJTHAkpTEAEtSEgMsSUkMsCQlMcCSlMQAS1ISAyxJSQywJCUxwJKUxABLUpKR7AVICzp9irWHH8peBQBrXnwOuDJ7GRoyBli1tH79eo4fP87YWF2idyXj4+PZi9CQMcCqpXvvvTd7CVLfeQ1YkpIYYElKYoAlKYkBlqQkBliSkhhgSUpigCUpiQGWpCQGWJKSGGBJSmKAJSmJAZakJAZYkpIYYElKYoAlKYkBlqQkBliSkhhgSUpigCUpiQGWpCRRSln6zhHPAt9a4u7rgO+dz6JqZljmgOGZxTnqZ1hm6dccP1FKuWL+xmUFeDkiollKmejLkw/QsMwBwzOLc9TPsMwy6Dm8BCFJSQywJCXpZ4Dv6eNzD9KwzAHDM4tz1M+wzDLQOfp2DViS1JuXICQpiQGWpCRLDnBEXBUR/xERX4uIr0bE7dX2j0fE4Yg4GBH/EhGXdvydD0dEKyK+ERHv6ti+qdrWiogPrehE5z/HH1czPB4RD0fE66vtERF/Wa31YERc3/FcWyPiyeq2dZBz9Jql4/HfjYgSEevqPEuP12QyIo5Ur8njEbG54+/U7r3Va5bqsW3VsfLViPjTOs/S4zX5TMfrMRMRj6/SOa6LiEeqOZoR8fZq+2CPkVLKkm7AeuD66v6PAP8NXAP8EjBSbf8Y8LHq/jXAE8BrgTcCTwFrqttTwJuA11T7XLPUdbzaW485LunYZzvwyer+ZuALQADvAB6ttr8OeLr638uq+5cNao5es1R/vgr4Iu0fnFlX51l6vCaTwO912b+W761FZvkF4N+B11aP/VidZ+n13urY58+Bj6zGOYCHgV/uOC4OZBwjSz4DLqUcLaU8Vt1/Hvg6MFZKebiUcqra7RHgDdX9m4FPl1JeKqV8E2gBb69urVLK06WUHwCfrvYdiB5zfL9jt4uBue9O3gzsLW2PAJdGxHrgXcD+UspzpZT/A/YDmwY1Byw8S/XwXwB/0DEH1HSWReboppbvLeg5y23AR0spL1WP/W+dZ1nsNYmIAH4D+MdVOkcBLql2+1HgOx1zDOwYOa9rwBHRAH4GeHTeQ++j/dkD2kM+0/HYt6ttC20fuPlzRMSfRMQzwHuBj1S71X4OOHeWiLgZOFJKeWLebrWfpct764PVl4J/GxGXVdtqPwe8YpY3A++MiEcj4ksR8bPVbrWfZYHj/Z3Ad0spT1Z/Xm1z7AA+Xh3vfwZ8uNptoHMsO8ARMQp8DtjRedYYEXcAp4D7X+2iBqHbHKWUO0opV9Ge4YOZ61uOzllovwZ/yNlPIKtGl9fkE8DVwHXAUdpf8q4KXWYZof3l6zuA3wf+qTqLrLWFjnfg3Zw9+629LnPcBuysjvedwN9krGtZAY6IH6I9xP2llM93bP8t4FeA95bqgglwhPZ1yDlvqLYttH1gFpqjw/3Ar1X3azsHdJ3latrX4J6IiJlqXY9FxI9T41m6vSallO+WUk6XUs4Af037y1l6rDd9Dljw/fVt4PPVl7ZfBs7Q/sUvtZ2lx/E+Avwq8JmO3VfbHFuBufv/TNZ7axkXswPYC9w1b/sm4GvAFfO2v5VzL8o/TfuC/Eh1/42cvSj/1ld7MXsF5vjJjvvbgM9W97dw7kX5L5ezF+W/SfuC/GXV/dcNao5es8zbZ4az34Sr5Sw9XpP1Hfd30r7GWNv31iKzvB/4o+r+m2l/ORt1naXXe6s65r80b9uqmoP2teAbqvu/CExnHCPLGeTnaF+4Pgg8Xt02077Y/kzHtk92/J07aH8H9BtU33Gstm+m/d3Ip4A7BnyALDTH54CvVNv/jfY35uZewL+q1noImOh4rvdV87eA3x7kHL1mmbfPDGcDXMtZerwmf1+t8yDwr5wb5Nq9txaZ5TXAP1TvsceAG+s8S6/3FvB3wPu7/J1VM0e1fZr2J4RHgY0Zx4g/iixJSfxJOElKYoAlKYkBlqQkBliSkhhgSUpigLWqRMSlEfGB6v7rI+KzC+x3ICJW/T8SqeFmgLXaXAp8AKCU8p1Syq/nLkc6fyPZC5CW6aPA1dXvoX0SeEsp5dqIWAt8Cvhp4DCwNm+J0tIYYK02HwKuLaVcV/12qwer7bcBL5ZS3hIRb6P902ZSrXkJQsPi52n/qC+llIO0f/RUqjUDLElJDLBWm+dp/9My8/0n8B6AiLgWeNsgFyWdD68Ba1UppRyLiP+KiK/Q/pWCcz4BfCoivl5tn05ZoLQM/jY0SUriJQhJSmKAJSmJAZakJAZYkpIYYElKYoAlKYkBlqQk/w8fN12SFSqSDQAAAABJRU5ErkJggg==\n",
356 | "text/plain": [
357 | ""
358 | ]
359 | },
360 | "metadata": {
361 | "needs_background": "light"
362 | },
363 | "output_type": "display_data"
364 | }
365 | ],
366 | "source": [
367 | "sns.boxplot(x=train.iloc[:10000][\"tid\"])"
368 | ]
369 | },
370 | {
371 | "cell_type": "code",
372 | "execution_count": 11,
373 | "metadata": {},
374 | "outputs": [
375 | {
376 | "data": {
377 | "text/plain": [
378 | "file_id 9\n",
379 | "label 3\n",
380 | "api 166\n",
381 | "tid 56\n",
382 | "index 5001\n",
383 | "dtype: int64"
384 | ]
385 | },
386 | "execution_count": 11,
387 | "metadata": {},
388 | "output_type": "execute_result"
389 | }
390 | ],
391 | "source": [
392 | "train.nunique()"
393 | ]
394 | },
395 | {
396 | "cell_type": "markdown",
397 | "metadata": {},
398 | "source": [
399 | "### 2.1.3 数据缺失值探索"
400 | ]
401 | },
402 | {
403 | "cell_type": "code",
404 | "execution_count": 12,
405 | "metadata": {},
406 | "outputs": [
407 | {
408 | "data": {
409 | "text/plain": [
410 | "count 35951.000000\n",
411 | "mean 2153.216267\n",
412 | "std 1537.349809\n",
413 | "min 0.000000\n",
414 | "25% 722.000000\n",
415 | "50% 2004.000000\n",
416 | "75% 3502.000000\n",
417 | "max 5000.000000\n",
418 | "Name: index, dtype: float64"
419 | ]
420 | },
421 | "execution_count": 12,
422 | "metadata": {},
423 | "output_type": "execute_result"
424 | }
425 | ],
426 | "source": [
427 | "train['index'].describe()"
428 | ]
429 | },
430 | {
431 | "cell_type": "markdown",
432 | "metadata": {},
433 | "source": [
434 | "### 2.1.4 奇异值探索"
435 | ]
436 | },
437 | {
438 | "cell_type": "code",
439 | "execution_count": 13,
440 | "metadata": {
441 | "scrolled": true
442 | },
443 | "outputs": [
444 | {
445 | "data": {
446 | "text/plain": [
447 | "count 35951.000000\n",
448 | "mean 2153.216267\n",
449 | "std 1537.349809\n",
450 | "min 0.000000\n",
451 | "25% 722.000000\n",
452 | "50% 2004.000000\n",
453 | "75% 3502.000000\n",
454 | "max 5000.000000\n",
455 | "Name: index, dtype: float64"
456 | ]
457 | },
458 | "execution_count": 13,
459 | "metadata": {},
460 | "output_type": "execute_result"
461 | }
462 | ],
463 | "source": [
464 | "train['index'].describe()"
465 | ]
466 | },
467 | {
468 | "cell_type": "code",
469 | "execution_count": 14,
470 | "metadata": {},
471 | "outputs": [
472 | {
473 | "data": {
474 | "text/plain": [
475 | "count 35952.000000\n",
476 | "mean 2494.964564\n",
477 | "std 129.979938\n",
478 | "min 282.000000\n",
479 | "25% 2456.000000\n",
480 | "50% 2500.000000\n",
481 | "75% 2596.000000\n",
482 | "max 2980.000000\n",
483 | "Name: tid, dtype: float64"
484 | ]
485 | },
486 | "execution_count": 14,
487 | "metadata": {},
488 | "output_type": "execute_result"
489 | }
490 | ],
491 | "source": [
492 | "train['tid'].describe()"
493 | ]
494 | },
495 | {
496 | "cell_type": "markdown",
497 | "metadata": {},
498 | "source": [
499 | "### 2.1.5 标签分布探索"
500 | ]
501 | },
502 | {
503 | "cell_type": "code",
504 | "execution_count": 15,
505 | "metadata": {},
506 | "outputs": [
507 | {
508 | "data": {
509 | "text/plain": [
510 | "0 28350\n",
511 | "5 6786\n",
512 | "2 816\n",
513 | "Name: label, dtype: int64"
514 | ]
515 | },
516 | "execution_count": 15,
517 | "metadata": {},
518 | "output_type": "execute_result"
519 | }
520 | ],
521 | "source": [
522 | "train['label'].value_counts()"
523 | ]
524 | },
525 | {
526 | "cell_type": "code",
527 | "execution_count": 16,
528 | "metadata": {},
529 | "outputs": [
530 | {
531 | "data": {
532 | "text/plain": [
533 | ""
534 | ]
535 | },
536 | "execution_count": 16,
537 | "metadata": {},
538 | "output_type": "execute_result"
539 | },
540 | {
541 | "data": {
542 | "image/png": "iVBORw0KGgoAAAANSUhEUgAABeEAAAH+CAYAAAAbE8XCAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8vihELAAAACXBIWXMAABcSAAAXEgFnn9JSAAAuyklEQVR4nO3de7BtZXku+OeNm4vcBCN4PKLYYohHFO+CEm+AaW+0KCSpmI4tmnTa1pY0cFKxGikUK4nWkVaLmFSOAvZJ5bQ2mhCC0YQyKEEMiSgY8ATEjop3QeTmBoxv/zHHsqdrz7X23rC+vfZm/35Vq8Ya3xjPHN9cVdZyP+vjm9XdAQAAAAAA1t7PrPcEAAAAAADg/koJDwAAAAAAgyjhAQAAAABgECU8AAAAAAAMooQHAAAAAIBBlPAAAAAAADCIEh4AAAAAAAZRwgMAAAAAwCBKeAAAAAAAGEQJDwAAAAAAgyjhAQAAAABgECU8AAAAAAAMsmG9J7Azq6pvJdkjydfWey4AAAAAAKzoEUnu7O5/t7XB6u4B82FLVNWtu+22294HH3zwek8FAAAAAIAV3HDDDbnrrrtu6+59tjZrJfz6+trBBx/8uGuuuWa95wEAAAAAwAoOPfTQXHvttfdqRxN7wgMAAAAAwCBKeAAAAAAAGEQJDwAAAAAAgyjhAQAAAABgECU8AAAAAAAMooQHAAAAAIBBlPAAAAAAADCIEh4AAAAAAAZRwgMAAAAAwCBKeAAAAAAAGEQJDwAAAAAAgyjhAQAAAABgECU8AAAAAAAMooQHAAAAAIBBlPAAAAAAADCIEh4AAAAAAAZRwgMAAAAAwCBKeAAAAAAAGEQJDwAAAAAAgyjhAQAAAABgkA3rPQG4P3nU71603lMAVvGvf/CS9Z4CAAAAsJOxEh4AAAAAAAZRwgMAAAAAwCBKeAAAAAAAGEQJDwAAAAAAgyjhAQAAAABgECU8AAAAAAAMooQHAAAAAIBBlPAAAAAAADCIEh4AAAAAAAZRwgMAAAAAwCBKeAAAAAAAGEQJDwAAAAAAgyjhAQAAAABgECU8AAAAAAAMooQHAAAAAIBBlPAAAAAAADCIEh4AAAAAAAZRwgMAAAAAwCBKeAAAAAAAGEQJDwAAAAAAgyjhAQAAAABgECU8AAAAAAAMooQHAAAAAIBBlPAAAAAAADCIEh4AAAAAAAZRwgMAAAAAwCBKeAAAAAAAGEQJDwAAAAAAgyjhAQAAAABgECU8AAAAAAAMooQHAAAAAIBBlPAAAAAAADCIEh4AAAAAAAZRwgMAAAAAwCBKeAAAAAAAGEQJDwAAAAAAgyjhAQAAAABgECU8AAAAAAAMooQHAAAAAIBBlPAAAAAAADCIEh4AAAAAAAZRwgMAAAAAwCBKeAAAAAAAGOQ+l/BVtUdVHVdV76+qf6mqjVV1R1VdVVWnV9VeCzJnVFWv8vUHqzzvyKr6aFXdXFW3V9UVVfWqzczxwKo6t6q+Mc3vuqp6S1XtvkrmgVX11unejVP2nKp6+Nb9hAAAAAAA2FltWIPXeGWS/zx9/8Ukf5lknyTPSvKWJL9aVc/t7u8syF6W5EsLxj+76EFVdXySD2b2x4NPJflekqOTfKCqDuvuUxdkHpPk8iQPSfLPSS5N8rQkpyc5uqqO7u67lmV2T/KJJEck+WaSC5I8KsmJSV5aVUd095cX/jQAAAAAAGCyFiX8PUn+JMm7uvuLS4NV9bAkFyV5cpJ3ZVbWL/e+7j5vSx5SVQ9Ock6SByQ5vrs/Mo0/NMnfJzmlqv6quy9ZFj0vswL+Pd190pTZkORDSV6e5E1JzliWOS2zAv7yJL/Y3bdPuZOTvHOax/O2ZN4AAAAAAOy87vN2NN39ge7+rfkCfhr/ZpLXT6evqKpd7+OjfiOzFfYXLBXw03O+neR3ptNT5gNV9YwkRyb5ztw96e4fJXldZn9AeONUyi9ldk3yhun09UsF/JQ7K8nVSZ5bVU+9j+8HAAAAAID7udEfzHrVdNwtyc/ex9d6yXQ8f8G1i5JsTHLMsn3elzIXLt9yZirvL02yX5JfmLt0ZJIHJbmhuz+34FlLzz9266YPAAAAAMDOZnQJ/+jpeE+SmxdcP6qq3lVVf1xVp21mdfkTp+OVyy90992Z7fe+e5JDtiSzbPyw+5gBAAAAAIBNrMWe8Ks5aTp+bPlK9MmvLzs/s6o+nOTV89vAVNU+ma1OT5IbV3jWjZl94OpBmW0ZkySP3IJMpsySe5NZVVVds8Klg7f0NQAAAAAA2PEMWwlfVS9O8trMVsG/ednlLyU5NcmhSfZK8ogkv5bk60mOT/Jflt2/19z3d67wyDum494LcqMzAAAAAACwiSEr4avqsUn+NEkl+Y/dfdX89e7+02WRO5L8WVX9XZIvJDmuqo7o7s+MmN+21t2HLhqfVsg/bhtPBwAAAACAbWTNV8JX1cOTfCyzDzw9q7vfvaXZ7v5mknOn0xfOXbp97vs9VojvOR1vW5AbnQEAAAAAgE2saQlfVQ9O8jeZ7Zd+bmZbzmyt66fjw5YGuvvWJD+YTg9cIbc0/pW5sa9uowwAAAAAAGxizUr4qtoryV9ntr3KR5L8Znf3vXip/abjHcvGl7a0ecqCZ++S5PFJNia5bksyy8avnhu7NxkAAAAAANjEmpTwVbVbkguSPCPJx5P8anf/2714nUry8un0ymWXL5qOJyyIvjTJ7kku7u6NCzLHTnOcf9ZDkzw7yfeTXDZ36bLMVt0fXFVPWvCspedfuPI7AQAAAACANSjhq+oBSf5rkqOSXJrkFd199yr3719Vr6+qvZeN75Xkj5IcnuRbma2mn/e+JLcmeVlVvWIud0CSd0yn75wPdPcVmZXqByR5+1xmQ5L3JtklyXu6+565zN1Jzp5O/7Cq9pzLnZzksCSf7O7PrvQeAQAAAAAgSTaswWu8If//6vXvJXnvbEH7Jk7t7u9l9sGmZyf5g6r6xyTfTLJ/Ztu8/GySW5Kc0N13zoe7++aqek2SDyU5v6ouSXJTkmOS7JvZh8BesuC5Jya5PMlJVXVUkmuTPD3Jo5N8OsnvL8i8bXrdZyW5vqouzWyf+8OTfDfJa1b7gQAAAAAAQLI2Jfx+c9+/fMW7kjMyK+lvymxV+hFJDsms6P63JP9vkvOS/J/d/fVFL9DdH66q5yQ5bcrvmlmpfnZ3f2CFzPVV9eQkb03ywmmOX01yZpLf6+67FmQ2VtXzk7wpySuTHJfk5ml+b+7uG1d5nwAAAAAAkGQNSvjuPiOzgn1L778tye/eh+ddluRFW5n5WmYr4rcm88Mkp09fAAAAAACw1dbkg1kBAAAAAIBNKeEBAAAAAGAQJTwAAAAAAAyihAcAAAAAgEGU8AAAAAAAMIgSHgAAAAAABlHCAwAAAADAIEp4AAAAAAAYRAkPAAAAAACDKOEBAAAAAGAQJTwAAAAAAAyihAcAAAAAgEGU8AAAAAAAMIgSHgAAAAAABlHCAwAAAADAIEp4AAAAAAAYRAkPAAAAAACDKOEBAAAAAGAQJTwAAAAAAAyihAcAAAAAgEGU8AAAAAAAMIgSHgAAAAAABlHCAwAAAADAIEp4AAAAAAAYRAkPAAAAAACDKOEBAAAAAGAQJTwAAAAAAAyihAcAAAAAgEGU8AAAAAAAMIgSHgAAAAAABlHCAwAAAADAIEp4AAAAAAAYRAkPAAAAAACDKOEBAAAAAGAQJTwAAAAAAAyihAcAAAAAgEGU8AAAAAAAMIgSHgAAAAAABlHCAwAAAADAIEp4AAAAAAAYRAkPAAAAAACDKOEBAAAAAGAQJTwAAAAAAAyihAcAAAAAgEGU8AAAAAAAMIgSHgAAAAAABlHCAwAAAADAIEp4AAAAAAAYRAkPAAAAAACDKOEBAAAAAGAQJTwAAAAAAAyihAcAAAAAgEGU8AAAAAAAMIgSHgAAAAAABlHCAwAAAADAIEp4AAAAAAAYRAkPAAAAAACDKOEBAAAAAGAQJTwAAAAAAAxyn0v4qtqjqo6rqvdX1b9U1caquqOqrqqq06tqr1Wyr66qK6rq9qq6uao+WlXP2szzjpzuu3nKXVFVr9pM5sCqOreqvjHN77qqektV7b5K5oFV9dbp3o1T9pyqevjmfyoAAAAAALA2K+FfmeTPk7wmyb8l+csklyb575K8Jck/VtUBy0NV9a4k5yZ5fJKLk1yR5AVJPlVVxy16UFUdn+STSV6Y5OokH0vyc0k+UFX/aYXMY5J8Lsmrk9yU5IIkD0hyepKLq2q3BZndk3wiyZuT7DVlvpbkxCSfq6pHr/oTAQAAAACArE0Jf0+SP0nyuO5+XHf/cne/MMnPZ1Z+PzbJu+YDVXVMkpMyK8Wf2N3HTZnnZFbkn1tV+y7LPDjJOZkV6Cd09/O6+4Tp9b+U5JSqet6C+Z2X5CFJ3tPdT+juX5nm9udJjkzypgWZ05IckeTyJId096909+FJTkmy/zQPAAAAAABY1X0u4bv7A939W939xWXj30zy+un0FVW169zlk6fj27r7+rnM5Un+OMm+SV677FG/kWSfJBd090fmMt9O8jvT6Snzgap6RmZF+3fm7kl3/yjJ6zL7A8Ibq2rDXGbXJG+YTl/f3bfP5c7KbAX+c6vqqQt/IAAAAAAAMBn9waxXTcfdkvxsMttrPclR0/j5CzJLY8cuG3/JKpmLkmxMcsyyfd6XMhd2913zgam8vzTJfkl+Ye7SkUkelOSG7v7cVswPAAAAAAB+yugSfmnv9HuS3Dx9//OZlfLf7e4bF2SunI6HLRt/4rLrP9Hddyf55yS7JzlkSzKrPOveZAAAAAAAYBMbNn/LfXLSdPzY3Er0R07HRQV8uvuOqrolyX5VtXd331ZV+2S2On3F3DT+tCQHZbZlzGafNTd+0NzYvcmsqqquWeHSwVv6GgAAAAAA7HiGrYSvqhdntq/7PUnePHdpr+l45yrxO6bj3ssyq+WWZ7bkWWuVAQAAAACATQxZCV9Vj03yp0kqyX/s7qs2E7lf6+5DF41PK+Qft42nAwAAAADANrLmK+Gr6uFJPpbZB56e1d3vXnbL7dNxj1VeZs/peNuyzGq55ZktedZaZQAAAAAAYBNrWsJX1YOT/E1m+6Wfm+TUBbd9dToeuMJr7Jlk3yTf7+7bkqS7b03yg9Vyc+Nf2dJnrWEGAAAAAAA2sWYlfFXtleSvM9te5SNJfrO7e8Gt/5LkriT7T6vml3vKdLx62fhVy67PP3uXJI9PsjHJdVuSWeVZ9yYDAAAAAACbWJMSvqp2S3JBkmck+XiSX+3uf1t0b3f/MMknptNfWnDLCdPxwmXjFy27Pu+lSXZPcnF3b1yQOXaa4/ycH5rk2Um+n+SyuUuXZbbq/uCqetJWzA8AAAAAAH7KfS7hq+oBSf5rkqOSXJrkFd1992ZiZ03H06rq5+Ze65lJfivJLUnevyzzviS3JnlZVb1iLnNAkndMp++cD3T3FZmV6gckeftcZkOS9ybZJcl7uvueuczdSc6eTv9w2h5nKXdyksOSfLK7P7uZ9wgAAAAAwE5uwxq8xhuSvHz6/ntJ3ltVi+47tbu/lyTdfXFVvTvJSUk+X1V/m2TXJC9IUklO7O5b5sPdfXNVvSbJh5KcX1WXJLkpyTGZ7SF/VndfsuC5Jya5PMlJVXVUkmuTPD3Jo5N8OsnvL8i8bXrdZyW5vqouzWyf+8OTfDfJa1b9iQAAAAAAQNamhN9v7vuXr3hXckZmJX2SpLt/u6o+n1mJ/4Ikdye5OMmZ3f3pRS/Q3R+uquckOS3JEZkV99cmObu7P7BC5vqqenKStyZ54TTHryY5M8nvdfddCzIbq+r5Sd6U5JVJjktyc5Lzkry5u29c5X0CAAAAAECSNSjhu/uMzAr2e5M9L7Nie2sylyV50VZmvpbZivityfwwyenTFwAAAAAAbLU1+WBWAAAAAABgU0p4AAAAAAAYRAkPAAAAAACDKOEBAAAAAGAQJTwAAAAAAAyihAcAAAAAgEGU8AAAAAAAMIgSHgAAAAAABlHCAwAAAADAIEp4AAAAAAAYRAkPAAAAAACDKOEBAAAAAGAQJTwAAAAAAAyihAcAAAAAgEGU8AAAAAAAMIgSHgAAAAAABlHCAwAAAADAIEp4AAAAAAAYRAkPAAAAAACDKOEBAAAAAGAQJTwAAAAAAAyihAcAAAAAgEGU8AAAAAAAMIgSHgAAAAAABlHCAwAAAADAIEp4AAAAAAAYRAkPAAAAAACDKOEBAAAAAGAQJTwAAAAAAAyihAcAAAAAgEGU8AAAAAAAMIgSHgAAAAAABlHCAwAAAADAIEp4AAAAAAAYRAkPAAAAAACDKOEBAAAAAGAQJTwAAAAAAAyihAcAAAAAgEGU8AAAAAAAMIgSHgAAAAAABlHCAwAAAADAIEp4AAAAAAAYRAkPAAAAAACDKOEBAAAAAGAQJTwAAAAAAAyihAcAAAAAgEGU8AAAAAAAMIgSHgAAAAAABlHCAwAAAADAIEp4AAAAAAAYRAkPAAAAAACDKOEBAAAAAGAQJTwAAAAAAAyihAcAAAAAgEGU8AAAAAAAMIgSHgAAAAAABlHCAwAAAADAIEp4AAAAAAAYRAkPAAAAAACDrEkJX1VPrarfraqPVNWNVdVV1avcf8bSPSt8/cEq2SOr6qNVdXNV3V5VV1TVqzYzvwOr6tyq+kZVbayq66rqLVW1+yqZB1bVW6d7N07Zc6rq4Vv2UwEAAAAAYGe3YY1e581JXnYvcpcl+dKC8c8uurmqjk/ywcz+ePCpJN9LcnSSD1TVYd196oLMY5JcnuQhSf45yaVJnpbk9CRHV9XR3X3XsszuST6R5Igk30xyQZJHJTkxyUur6oju/vJWv1sAAAAAAHYqa1XCX57k6iT/OH39a5LdtiD3vu4+b0seUFUPTnJOkgckOb67PzKNPzTJ3yc5par+qrsvWRY9L7MC/j3dfdKU2ZDkQ0lenuRNSc5YljktswL+8iS/2N23T7mTk7xzmsfztmTeAAAAAADsvNZkO5rufnt3n97dF3b3t9biNRf4jST7JLlgqYCfnv3tJL8znZ4yH6iqZyQ5Msl35u5Jd/8oyeuS3JPkjVMpv5TZNckbptPXLxXwU+6szP7Y8NyqeuravTUAAAAAAO6PdqQPZn3JdDx/wbWLkmxMcsyyfd6XMhcu33JmKu8vTbJfkl+Yu3RkkgcluaG7P7fgWUvPP3brpg8AAAAAwM5mvUv4o6rqXVX1x1V12mZWlz9xOl65/EJ3353Zfu+7JzlkSzLLxg+7jxkAAAAAANjEWu0Jf2/9+rLzM6vqw0lePb8NTFXtk9nq9CS5cYXXujGzD1w9KLMtY5LkkVuQyZRZcm8yq6qqa1a4dPCWvgYAAAAAADue9VoJ/6UkpyY5NMleSR6R5NeSfD3J8Un+y7L795r7/s4VXvOO6bj3gtzoDAAAAAAAbGJdVsJ3958uG7ojyZ9V1d8l+UKS46rqiO7+zLaf3drr7kMXjU8r5B+3jacDAAAAAMA2st57wv+U7v5mknOn0xfOXbp97vs9VojvOR1vW5AbnQEAAAAAgE1sVyX85Prp+LClge6+NckPptMDV8gtjX9lbuyr2ygDAAAAAACb2B5L+P2m4x3Lxq+ajk9ZHqiqXZI8PsnGJNdtSWbZ+NVzY/cmAwAAAAAAm9iuSviqqiQvn06vXHb5oul4woLoS5PsnuTi7t64IHNsVe227FkPTfLsJN9PctncpcsyW3V/cFU9acGzlp5/4crvBAAAAAAA1qGEr6r9q+r1VbX3svG9kvxRksOTfCvJR5ZF35fk1iQvq6pXzOUOSPKO6fSd84HuviKzUv2AJG+fy2xI8t4kuyR5T3ffM5e5O8nZ0+kfVtWec7mTkxyW5JPd/dmtfOsAAAAAAOxkNqzFi1TVS5K8eW5o12n8M3NjZ3b3RZl9sOnZSf6gqv4xyTeT7J/ZNi8/m+SWJCd0953zz+jum6vqNUk+lOT8qrokyU1Jjkmyb5KzuvuSBdM7McnlSU6qqqOSXJvk6UkeneTTSX5/QeZt0+s+K8n1VXVpkoMy+wPBd5O8ZnM/EwAAAAAAWKuV8PtnVlAvfdU0Pj+2/zR2U2ar0j+b5JAkxyc5MrPV7+9M8vjunt8e5ie6+8NJnpPk40menOTFSb6U5NXdfcoKmeune8+b5vDyJD9OcmaSo7v7rgWZjUmeP91zZ5LjMivhz0vylO7+8uZ+IAAAAAAAsCYr4bv7vMwK6i2597Ykv3sfnnVZkhdtZeZrma2I35rMD5OcPn0BAAAAAMBW264+mBUAAAAAAO5PlPAAAAAAADCIEh4AAAAAAAZRwgMAAAAAwCBKeAAAAAAAGEQJDwAAAAAAgyjhAQAAAABgECU8AAAAAAAMooQHAAAAAIBBlPAAAAAAADCIEh4AAAAAAAZRwgMAAAAAwCBKeAAAAAAAGEQJDwAAAAAAgyjhAQAAAABgECU8AAAAAAAMooQHAAAAAIBBlPAAAAAAADCIEh4AAAAAAAZRwgMAAAAAwCBKeAAAAAAAGEQJDwAAAAAAgyjhAQAAAABgECU8AAAAAAAMooQHAAAAAIBBlPAAAAAAADCIEh4AAAAAAAZRwgMAAAAAwCBKeAAAAAAAGEQJDwAAAAAAgyjhAQAAAABgECU8AAAAAAAMooQHAAAAAIBBlPAAAAAAADCIEh4AAAAAAAZRwgMAAAAAwCBKeAAAAAAAGEQJDwAAAAAAgyjhAQAAAABgECU8AAAAAAAMooQHAAAAAIBBlPAAAAAAADCIEh4AAAAAAAZRwgMAAAAAwCBKeAAAAAAAGEQJDwAAAAAAgyjhAQAAAABgECU8AAAAAAAMooQHAAAAAIBBlPAAAAAAADCIEh4AAAAAAAZRwgMAAAAAwCBKeAAAAAAAGEQJDwAAAAAAgyjhAQAAAABgECU8AAAAAAAMooQHAAAAAIBBlPAAAAAAADCIEh4AAAAAAAZZkxK+qp5aVb9bVR+pqhurqquqtyD36qq6oqpur6qbq+qjVfWszWSOnO67ecpdUVWv2kzmwKo6t6q+UVUbq+q6qnpLVe2+SuaBVfXW6d6NU/acqnr45t4XAAAAAAAkyYY1ep03J3nZ1gSq6l1JTkrywyR/k2T3JC9I8otVdUJ3/8WCzPFJPpjZHw8+leR7SY5O8oGqOqy7T12QeUySy5M8JMk/J7k0ydOSnJ7k6Ko6urvvWpbZPcknkhyR5JtJLkjyqCQnJnlpVR3R3V/emvcLAAAAAMDOZ622o7k8yZlJ/ockD0ty12o3V9UxmRXwNyV5Yncf190vTPKcJP+W5Nyq2ndZ5sFJzknygCQndPfzuvuEJI9N8qUkp1TV8xY87rzMCvj3dPcTuvtXkvx8kj9PcmSSNy3InJZZAX95kkO6+1e6+/AkpyTZf5oHAAAAAACsak1K+O5+e3ef3t0Xdve3tiBy8nR8W3dfP/c6lyf54yT7JnntssxvJNknyQXd/ZG5zLeT/M50esp8oKqekVnR/p25e9LdP0ryuiT3JHljVW2Yy+ya5A3T6eu7+/a53FlJrk7y3Kp66ha8TwAAAAAAdmLb/INZq+qBSY6aTs9fcMvS2LHLxl+ySuaiJBuTHLNsn/elzIXLt5yZyvtLk+yX5BfmLh2Z5EFJbujuz23F/AAAAAAA4Kds8xI+s61gdkvy3e6+ccH1K6fjYcvGn7js+k90992Z7fe+e5JDtiSzyrPuTQYAAAAAADaxVh/MujUeOR0XFfDp7juq6pYk+1XV3t19W1Xtk9nq9BVz0/jTkhyU2ZYxm33W3PhBWzq/FTKrqqprVrh08Ja+BgAAAAAAO571WAm/13S8c5V77piOey/LrJZbntmSZ61VBgAAAAAANrEeK+F3Ot196KLxaYX847bxdAAAAAAA2EbWYyX87dNxj1Xu2XM63rYss1pueWZLnrVWGQAAAAAA2MR6lPBfnY4HLrpYVXsm2TfJ97v7tiTp7luT/GC13Nz4V7b0WWuYAQAAAACATaxHCf8vSe5Ksn9VPXzB9adMx6uXjV+17PpPVNUuSR6fZGOS67Yks8qz7k0GAAAAAAA2sc1L+O7+YZJPTKe/tOCWE6bjhcvGL1p2fd5Lk+ye5OLu3rggc2xV7TYfqKqHJnl2ku8nuWzu0mWZrbo/uKqetBXzAwAAAACAn7IeK+GT5KzpeFpV/dzSYFU9M8lvJbklyfuXZd6X5NYkL6uqV8xlDkjyjun0nfOB7r4is1L9gCRvn8tsSPLeJLskeU933zOXuTvJ2dPpH07b4yzlTk5yWJJPdvdnt+4tAwAAAACws9mwFi9SVS9J8ua5oV2n8c/MjZ3Z3RclSXdfXFXvTnJSks9X1d9OmRckqSQndvct88/o7pur6jVJPpTk/Kq6JMlNSY7JbA/5s7r7kgXTOzHJ5UlOqqqjklyb5OlJHp3k00l+f0HmbdPrPivJ9VV1aZKDkhye5LtJXrPZHwoAAAAAADu9tVoJv39mBfXSV03j82P7zwe6+7czK8i/mFn5/swkFyd5Tnf/xaKHdPeHkzwnyceTPDnJi5N8Kcmru/uUFTLXT/eeN83h5Ul+nOTMJEd3910LMhuTPH+6584kx2VWwp+X5Cnd/eVVfhYAAAAAAJBkjVbCd/d5mRXUw3PdfVmSF21l5muZFf5bk/lhktOnLwAAAAAA2GrrtSc8AAAAAADc7ynhAQAAAABgECU8AAAAAAAMooQHAAAAAIBBlPAAAAAAADCIEh4AAAAAAAZRwgMAAAAAwCBKeAAAAAAAGEQJDwAAAAAAgyjhAQAAAABgECU8AAAAAAAMooQHAAAAAIBBlPAAAAAAADCIEh4AAAAAAAZRwgMAAAAAwCBKeAAAAAAAGEQJDwAAAAAAgyjhAQAAAABgECU8AAAAAAAMooQHAAAAAIBBlPAAAAAAADCIEh4AAAAAAAZRwgMAAAAAwCBKeAAAAAAAGEQJDwAAAAAAgyjhAQAAAABgECU8AAAAAAAMooQHAAAAAIBBlPAAAAAAADCIEh4AAAAAAAZRwgMAAAAAwCBKeAAAAAAAGEQJDwAAAAAAgyjhAQAAAABgECU8AAAAAAAMooQHAAAAAIBBlPAAAAAAADCIEh4AAAAAAAZRwgMAAAAAwCAb1nsCAAAAANw7T/jAE9Z7CsAqvvA/fWG9p8B2wEp4AAAAAAAYRAkPAAAAAACDKOEBAAAAAGAQJTwAAAAAAAyihAcAAAAAgEGU8AAAAAAAMIgSHgAAAAAABlHCAwAAAADAIEp4AAAAAAAYRAkPAAAAAACDKOEBAAAAAGAQJTwAAAAAAAyihAcAAAAAgEGU8AAAAAAAMIgSHgAAAAAABlHCAwAAAADAIEp4AAAAAAAYRAkPAAAAAACDKOEBAAAAAGCQdS3hq+qSqupVvl64Qu7VVXVFVd1eVTdX1Uer6lmbedaR0303T7krqupVm8kcWFXnVtU3qmpjVV1XVW+pqt3vy/sGAAAAAGDnsGG9JzD5cJLbF4x/fflAVb0ryUlJfpjkb5LsnuQFSX6xqk7o7r9YkDk+yQcz+6PDp5J8L8nRST5QVYd196kLMo9JcnmShyT55ySXJnlaktOTHF1VR3f3XVv9TgEAAAAA2GlsLyX8qd39r5u7qaqOyayAvynJM7v7+mn8mUkuSXJuVV3S3bfMZR6c5JwkD0hyfHd/ZBp/aJK/T3JKVf1Vd1+y7HHnZVbAv6e7T5oyG5J8KMnLk7wpyRn36t0CAAAAALBT2NH2hD95Or5tqYBPku6+PMkfJ9k3yWuXZX4jyT5JLlgq4KfMt5P8znR6ynygqp6R5Mgk35m7J939oySvS3JPkjdOpTwAAAAAACy0w5TwVfXAJEdNp+cvuGVp7Nhl4y9ZJXNRko1Jjlm2z/tS5sLlW85M5f2lSfZL8gtbNnsAAAAAAHZG20sJ/9qqem9VnV1Vb6yqRy645+eT7Jbku91944LrV07Hw5aNP3HZ9Z/o7rsz2+999ySHbElmM88CAAAAAICf2F62Uzlt2fl/qqozu/vMubGlYn5RAZ/uvqOqbkmyX1Xt3d23VdU+SR60Wm4af1qSg5JcvSXPmhs/aIXrP6Wqrlnh0sFbkgcAAAAAYMe03ivhP5Xk1zMro/fIbLX7/5HkR0neWlUnzd2713S8c5XXu2M67r0ss1pueWZLnrUoAwAAAAAAP2VdV8J39+nLhq5L8ntV9U9JPp7kjKr6k+7+4baf3drp7kMXjU8r5B+3jacDAAAAAMA2st4r4Rfq7r9J8k9J9k1y+DR8+3TcY5XontPxtmWZ1XLLM1vyrEUZAAAAAAD4KdtlCT+5fjo+bDp+dToeuOjmqtozs9L++919W5J0961JfrBabm78K3Njqz5rhQwAAAAAAPyU7bmE3286Lu2//i9J7kqyf1U9fMH9T5mOVy8bv2rZ9Z+oql2SPD7Jxsy2wtlsZjPPAgAAAACAn9guS/iq2j/Js6fTK5Nk2hf+E9PYLy2InTAdL1w2ftGy6/NemmT3JBd398YFmWOrardlc3voNLfvJ7ls9XcCAAAAAMDObN1K+Kp6VlUdV1UPWDb+qCR/ntm+63/Z3TfOXT5rOp5WVT83l3lmkt9KckuS9y971PuS3JrkZVX1irnMAUneMZ2+cz7Q3VdkVrAfkOTtc5kNSd6bZJck7+nue7biLQMAAAAAsJPZsI7PPiTJuUm+VVVXZlagH5TkqZmtTr8myW/OB7r74qp6d5KTkny+qv42ya5JXpCkkpzY3bcsy9xcVa9J8qEk51fVJUluSnJMZnvIn9XdlyyY34lJLk9yUlUdleTaJE9P8ugkn07y+/fp3QMAAAAAcL+3ntvR/EOSP0ryjczK7V/ObH/2zyc5JcnTu/s7y0Pd/duZFeRfzKx8f2aSi5M8p7v/YtGDuvvDSZ6T5ONJnpzkxUm+lOTV3X3KCpnrp3vPS7J/kpcn+XGSM5Mc3d13bfU7BgAAAABgp7JuK+G7+4tJ/td7mT0vs3J8azKXJXnRVma+llnhDwAAAAAAW227/GBWAAAAAAC4P1DCAwAAAADAIEp4AAAAAAAYRAkPAAAAAACDKOEBAAAAAGAQJTwAAAAAAAyihAcAAAAAgEGU8AAAAAAAMIgSHgAAAAAABlHCAwAAAADAIEp4AAAAAAAYRAkPAAAAAACDKOEBAAAAAGAQJTwAAAAAAAyihAcAAAAAgEGU8AAAAAAAMIgSHgAAAAAABlHCAwAAAADAIEp4AAAAAAAYRAkPAAAAAACDKOEBAAAAAGAQJTwAAAAAAAyihAcAAAAAgEGU8AAAAAAAMIgSHgAAAAAABlHCAwAAAADAIEp4AAAAAAAYRAkPAAAAAACDKOEBAAAAAGAQJTwAAAAAAAyihAcAAAAAgEGU8AAAAAAAMIgSHgAAAAAABlHCAwAAAADAIEp4AAAAAAAYRAkPAAAAAACDKOEBAAAAAGAQJTwAAAAAAAyihAcAAAAAgEGU8AAAAAAAMIgSHgAAAAAABlHCAwAAAADAIEp4AAAAAAAYRAkPAAAAAACDKOEBAAAAAGAQJTwAAAAAAAyihAcAAAAAgEGU8AAAAAAAMIgSHgAAAAAABlHCAwAAAADAIEp4AAAAAAAYRAkPAAAAAACDKOEBAAAAAGAQJTwAAAAAAAyyYb0nAAAAOeNB6z0DYDVn/GC9ZwAAsMOyEh4AAAAAAAZRwgMAAAAAwCBKeAAAAAAAGEQJDwAAAAAAgyjhV1FVD6yqt1bVdVW1saq+UVXnVNXD13tuAAAAAABs/5TwK6iq3ZN8Ismbk+yV5IIkX0tyYpLPVdWj13F6AAAAAADsAJTwKzstyRFJLk9ySHf/SncfnuSUJPsnOWc9JwcAAAAAwPZPCb9AVe2a5A3T6eu7+/ala919VpKrkzy3qp66HvMDAAAAAGDHoIRf7MgkD0pyQ3d/bsH186fjsdtuSgAAAAAA7GiU8Is9cTpeucL1pfHDtsFcAAAAAADYQW1Y7wlspx45HW9c4frS+EFb8mJVdc0Klx57ww035NBDD92aubEd+8a3b9/8TcC6OfTCvdZ7CsBKvut3KGzX/h//ZoHt1Q233LDeUwBWceg7/A69v7jhhhuS5BH3JquEX2yppblzhet3TMe97+NzfnzXXXfdce21137tPr4OsPYOno7+H+39yLU3rfcMAHYKfofeH3332vWeAcDOwO/Q+6Frv+F36P3II7JyX7wqJfw20N3+5AU7mKX/gsX/fgFg6/gdCgD3jt+hcP9lT/jFlv576D1WuL7ndLxtG8wFAAAAAIAdlBJ+sa9OxwNXuL40/pVtMBcAAAAAAHZQSvjFrpqOT1nh+tL41dtgLgAAAAAA7KCU8ItdluQHSQ6uqictuH7CdLxwm80IAAAAAIAdjhJ+ge6+O8nZ0+kfVtXSHvCpqpOTHJbkk9392fWYHwAAAAAAO4bq7vWew3apqnZPckmSw5N8M8mlSQ6azr+b5Iju/vK6TRAAAAAAgO2eEn4VVfXAJG9K8sokj0hyc5KPJXlzd9+4nnMDAAAAAGD7p4QHAAAAAIBB7AkPAAAAAACDKOEBAAAAAGAQJTwAAAAAAAyihAcAAAAAgEGU8AAAAAAAMIgSHgAAAAAABlHCAwAAAADAIEp4AAAAAAAYRAkPAAAAAACDbFjvCQCst6p6cpJjkxyW5KAke0+XbkvylSRXJ7mwuz+3PjMEgO1TVW1I8rNJbu7uezZz74OT7NXdX90mkwOAHUxV7Zbk8CQPS3JHkiu7+xvrOytgLVR3r/ccANZFVT0qyTlJnrs0tMrtneSSJK/t7n8dOjEA2M5V1UOSvCvJK5LsluSeJH+d5PTu/sIKmXOT/Hp3WwgEwE6pqn4xyde7+5oF1/63JGck2XfZpQuS/M/d/b3hEwSGUcIDO6Wq+vdJrkxyQGYr3c+fzm/MbMVBkuyZ5MAkT0nyS0mekOTbSZ5qNQIAO6uq2jPJPyb5+Wz6B+y7k5za3WcvyJ2b5FXd/YDxswSA7U9V/TjJud392mXjpyV5S2a/V/8pyfVJ9kvy7Mz+XfqFJM/o7ru27YyBtWJPeGBndWZmBfzJ3f2k7n5bd3+0u6/u7humr6unsbd19xOTnJrkoUneuq4zB4D1dXKSxyb5fJJnZVYOPCHJ+5PskuTdVfWOdZsdAGzffuoP2FX1iCRvTvLDJP99dz+ju3+tu1+c5NFJPp3k8Un+l20+U2DNKOGBndULk/xDd79rSwPdfVaSf0jyolGTAoAdwPFJbk3y4u7+THf/sLuv6e7fzOwzVn6Q5JSq+s9VtdpWbwBAclxmf8R+W3f/7fyF7v5ukv8xyV1JfnnbTw1YK0p4YGf14CT/ei9yX5myALCzekyST3f3t5df6O6PZrY6/mtJXpPkg9OHtwIAix2S2WeQnb/o4vSZZJ9N8h+24ZyANaaEB3ZWX03y7KraY0sD073PzqxYAICd1QMyWwm/UHf/tyRHJvlvma2av6Cqdt9GcwOAHc1SN7favzO/ktn2b8AOSgkP7Kw+mOTfJ/l4VR22uZunez6e5N8l+bPBcwOA7dlXMtubdkXd/fUkv5DZh8u9MMnHkuwzfmoAsN3bq6oeufSV5KZp/GGrZPZN8v3hMwOGqe5e7zkAbHPTiry/S3J4Zv/p3w1JrkxyY5I7p9v2SHJgkqckOTizD9D5TJLn+1R6AHZWVfX+JK9O8h+6+7rN3Ltnkr9M8vzMft+mux8weo4AsD2qqh9n+n24wK939yYLvqrqZzL7d+rXu/vpI+cHjGN/RmCn1N0bq+p5mX0K/esz29/2MUuXp+P8h8n9IMnZmX1YjgIegJ3ZXyY5Mcn/nuR1q93Y3XdU1YuS/N+ZffCcFUAA7Mw+lZV/Fx6ywvixmf0X2R8ZMiNgm7ASHtjpVdUume1d+8Qkj0yy13Tp9sz2jr8qyWXdfc/6zBAAth9V9cAkr0xyT3f/X1uY+Zkkb0iyX3e/ZeT8AOD+pKqemVlB/w/T564AOyAlPAAAAAAADOKDWQEAAAAAYBAlPAAAAAAADKKEBwAAAACAQZTwAAAAAAAwiBIeAAAAAAAGUcIDAAAAAMAgSngAAAAAABhECQ8AAAAAAIMo4QEAAAAAYBAlPAAAAAAADKKEBwAAAACAQZTwAAAAAAAwyP8HJMWOWb7+FSUAAAAASUVORK5CYII=\n",
543 | "text/plain": [
544 | ""
545 | ]
546 | },
547 | "metadata": {
548 | "needs_background": "light"
549 | },
550 | "output_type": "display_data"
551 | }
552 | ],
553 | "source": [
554 | "plt.figure(figsize=(12,4),dpi=150)\n",
555 | "train['label'].value_counts().sort_index().plot(kind = 'bar')"
556 | ]
557 | },
558 | {
559 | "cell_type": "code",
560 | "execution_count": 17,
561 | "metadata": {},
562 | "outputs": [
563 | {
564 | "data": {
565 | "text/plain": [
566 | ""
567 | ]
568 | },
569 | "execution_count": 17,
570 | "metadata": {},
571 | "output_type": "execute_result"
572 | },
573 | {
574 | "data": {
575 | "image/png": "\n",
576 | "text/plain": [
577 | ""
578 | ]
579 | },
580 | "metadata": {},
581 | "output_type": "display_data"
582 | }
583 | ],
584 | "source": [
585 | "plt.figure(figsize=(4,4),dpi=150)\n",
586 | "train['label'].value_counts().sort_index().plot(kind = 'pie')"
587 | ]
588 | },
589 | {
590 | "cell_type": "markdown",
591 | "metadata": {},
592 | "source": [
593 | "## 2.2 测试集探索"
594 | ]
595 | },
596 | {
597 | "cell_type": "markdown",
598 | "metadata": {},
599 | "source": [
600 | "### 2.2.1 数据信息"
601 | ]
602 | },
603 | {
604 | "cell_type": "code",
605 | "execution_count": 18,
606 | "metadata": {},
607 | "outputs": [
608 | {
609 | "data": {
610 | "text/html": [
611 | "\n",
612 | "\n",
625 | "
\n",
626 | " \n",
627 | " \n",
628 | " | \n",
629 | " file_id | \n",
630 | " api | \n",
631 | " tid | \n",
632 | " index | \n",
633 | "
\n",
634 | " \n",
635 | " \n",
636 | " \n",
637 | " 0 | \n",
638 | " 1 | \n",
639 | " RegOpenKeyExA | \n",
640 | " 2332.0 | \n",
641 | " 0.0 | \n",
642 | "
\n",
643 | " \n",
644 | " 1 | \n",
645 | " 1 | \n",
646 | " CopyFileA | \n",
647 | " 2332.0 | \n",
648 | " 1.0 | \n",
649 | "
\n",
650 | " \n",
651 | " 2 | \n",
652 | " 1 | \n",
653 | " OpenSCManagerA | \n",
654 | " 2332.0 | \n",
655 | " 2.0 | \n",
656 | "
\n",
657 | " \n",
658 | " 3 | \n",
659 | " 1 | \n",
660 | " CreateServiceA | \n",
661 | " 2332.0 | \n",
662 | " 3.0 | \n",
663 | "
\n",
664 | " \n",
665 | " 4 | \n",
666 | " 1 | \n",
667 | " RegOpenKeyExA | \n",
668 | " 2468.0 | \n",
669 | " 0.0 | \n",
670 | "
\n",
671 | " \n",
672 | "
\n",
673 | "
"
674 | ],
675 | "text/plain": [
676 | " file_id api tid index\n",
677 | "0 1 RegOpenKeyExA 2332.0 0.0\n",
678 | "1 1 CopyFileA 2332.0 1.0\n",
679 | "2 1 OpenSCManagerA 2332.0 2.0\n",
680 | "3 1 CreateServiceA 2332.0 3.0\n",
681 | "4 1 RegOpenKeyExA 2468.0 0.0"
682 | ]
683 | },
684 | "execution_count": 18,
685 | "metadata": {},
686 | "output_type": "execute_result"
687 | }
688 | ],
689 | "source": [
690 | "test.head()"
691 | ]
692 | },
693 | {
694 | "cell_type": "code",
695 | "execution_count": 19,
696 | "metadata": {},
697 | "outputs": [
698 | {
699 | "name": "stdout",
700 | "output_type": "stream",
701 | "text": [
702 | "\n",
703 | "RangeIndex: 39173 entries, 0 to 39172\n",
704 | "Data columns (total 4 columns):\n",
705 | "file_id 39173 non-null int64\n",
706 | "api 39173 non-null object\n",
707 | "tid 39172 non-null float64\n",
708 | "index 39172 non-null float64\n",
709 | "dtypes: float64(2), int64(1), object(1)\n",
710 | "memory usage: 1.2+ MB\n"
711 | ]
712 | }
713 | ],
714 | "source": [
715 | "test.info()"
716 | ]
717 | },
718 | {
719 | "cell_type": "markdown",
720 | "metadata": {},
721 | "source": [
722 | "### 2.2.2 缺失值探索"
723 | ]
724 | },
725 | {
726 | "cell_type": "code",
727 | "execution_count": 20,
728 | "metadata": {},
729 | "outputs": [
730 | {
731 | "data": {
732 | "text/plain": [
733 | "file_id 0\n",
734 | "api 0\n",
735 | "tid 1\n",
736 | "index 1\n",
737 | "dtype: int64"
738 | ]
739 | },
740 | "execution_count": 20,
741 | "metadata": {},
742 | "output_type": "execute_result"
743 | }
744 | ],
745 | "source": [
746 | "test.isnull().sum()"
747 | ]
748 | },
749 | {
750 | "cell_type": "markdown",
751 | "metadata": {},
752 | "source": [
753 | "### 2.2.3 数据分布探索"
754 | ]
755 | },
756 | {
757 | "cell_type": "code",
758 | "execution_count": 21,
759 | "metadata": {},
760 | "outputs": [
761 | {
762 | "data": {
763 | "text/plain": [
764 | "file_id 10\n",
765 | "api 146\n",
766 | "tid 125\n",
767 | "index 5001\n",
768 | "dtype: int64"
769 | ]
770 | },
771 | "execution_count": 21,
772 | "metadata": {},
773 | "output_type": "execute_result"
774 | }
775 | ],
776 | "source": [
777 | "test.nunique()"
778 | ]
779 | },
780 | {
781 | "cell_type": "markdown",
782 | "metadata": {},
783 | "source": [
784 | "### 2.2.4 奇异值探索"
785 | ]
786 | },
787 | {
788 | "cell_type": "code",
789 | "execution_count": 22,
790 | "metadata": {},
791 | "outputs": [
792 | {
793 | "data": {
794 | "text/plain": [
795 | "count 39172.000000\n",
796 | "mean 1729.569284\n",
797 | "std 1486.018402\n",
798 | "min 0.000000\n",
799 | "25% 405.750000\n",
800 | "50% 1342.000000\n",
801 | "75% 2876.000000\n",
802 | "max 5000.000000\n",
803 | "Name: index, dtype: float64"
804 | ]
805 | },
806 | "execution_count": 22,
807 | "metadata": {},
808 | "output_type": "execute_result"
809 | }
810 | ],
811 | "source": [
812 | "test['index'].describe()"
813 | ]
814 | },
815 | {
816 | "cell_type": "code",
817 | "execution_count": 23,
818 | "metadata": {
819 | "scrolled": true
820 | },
821 | "outputs": [
822 | {
823 | "data": {
824 | "text/plain": [
825 | "count 39172.000000\n",
826 | "mean 2158.769938\n",
827 | "std 464.152821\n",
828 | "min 504.000000\n",
829 | "25% 2092.000000\n",
830 | "50% 2224.000000\n",
831 | "75% 2500.000000\n",
832 | "max 2920.000000\n",
833 | "Name: tid, dtype: float64"
834 | ]
835 | },
836 | "execution_count": 23,
837 | "metadata": {},
838 | "output_type": "execute_result"
839 | }
840 | ],
841 | "source": [
842 | "test['tid'].describe()"
843 | ]
844 | },
845 | {
846 | "cell_type": "markdown",
847 | "metadata": {},
848 | "source": [
849 | "## 2.3 数据集联合分析"
850 | ]
851 | },
852 | {
853 | "cell_type": "markdown",
854 | "metadata": {},
855 | "source": [
856 | "### 2.3.1 file_id分析"
857 | ]
858 | },
859 | {
860 | "cell_type": "code",
861 | "execution_count": 24,
862 | "metadata": {},
863 | "outputs": [],
864 | "source": [
865 | "train_fileids = train['file_id'].unique()\n",
866 | "test_fileids = test['file_id'].unique()"
867 | ]
868 | },
869 | {
870 | "cell_type": "code",
871 | "execution_count": 25,
872 | "metadata": {},
873 | "outputs": [
874 | {
875 | "data": {
876 | "text/plain": [
877 | "0"
878 | ]
879 | },
880 | "execution_count": 25,
881 | "metadata": {},
882 | "output_type": "execute_result"
883 | }
884 | ],
885 | "source": [
886 | "len(set(train_fileids)-set(test_fileids)) "
887 | ]
888 | },
889 | {
890 | "cell_type": "code",
891 | "execution_count": 26,
892 | "metadata": {},
893 | "outputs": [
894 | {
895 | "data": {
896 | "text/plain": [
897 | "1"
898 | ]
899 | },
900 | "execution_count": 26,
901 | "metadata": {},
902 | "output_type": "execute_result"
903 | }
904 | ],
905 | "source": [
906 | "len(set(test_fileids)-set(train_fileids)) "
907 | ]
908 | },
909 | {
910 | "cell_type": "markdown",
911 | "metadata": {},
912 | "source": [
913 | "### 2.3.2 API分析"
914 | ]
915 | },
916 | {
917 | "cell_type": "code",
918 | "execution_count": 27,
919 | "metadata": {},
920 | "outputs": [],
921 | "source": [
922 | "train_apis = train['api'].unique()\n",
923 | "test_apis = test['api'].unique()"
924 | ]
925 | },
926 | {
927 | "cell_type": "code",
928 | "execution_count": 28,
929 | "metadata": {
930 | "scrolled": true
931 | },
932 | "outputs": [
933 | {
934 | "data": {
935 | "text/plain": [
936 | "{'CertCreateCertificateContext',\n",
937 | " 'CertOpenSystemStoreA',\n",
938 | " 'CoInitializeSecurity',\n",
939 | " 'CreateServiceA',\n",
940 | " 'CryptAcquireContextW',\n",
941 | " 'FindWindowA',\n",
942 | " 'FindWindowExW',\n",
943 | " 'GetComputerNameA',\n",
944 | " 'GetFileVersionInfoSizeW',\n",
945 | " 'GetFileVersionInfoW',\n",
946 | " 'IWbemServices_ExecQuery',\n",
947 | " 'LookupAccountSidW',\n",
948 | " 'LookupPrivilegeValueW',\n",
949 | " 'OpenServiceW',\n",
950 | " 'OutputDebugStringA',\n",
951 | " 'R',\n",
952 | " 'SendNotifyMessageW',\n",
953 | " 'SetStdHandle',\n",
954 | " 'StartServiceA',\n",
955 | " 'StartServiceW',\n",
956 | " 'UnhookWindowsHookEx',\n",
957 | " 'connect',\n",
958 | " 'timeGetTime'}"
959 | ]
960 | },
961 | "execution_count": 28,
962 | "metadata": {},
963 | "output_type": "execute_result"
964 | }
965 | ],
966 | "source": [
967 | "set(test_apis)-set(train_apis)"
968 | ]
969 | },
970 | {
971 | "cell_type": "code",
972 | "execution_count": 29,
973 | "metadata": {},
974 | "outputs": [
975 | {
976 | "data": {
977 | "text/plain": [
978 | "{'CertControlStore',\n",
979 | " 'CryptAcquireContextA',\n",
980 | " 'CryptCreateHash',\n",
981 | " 'CryptExportKey',\n",
982 | " 'CryptHashData',\n",
983 | " 'DeviceIoControl',\n",
984 | " 'DrawTextExA',\n",
985 | " 'EncryptMessage',\n",
986 | " 'EnumServicesStatusW',\n",
987 | " 'FindResourceExA',\n",
988 | " 'GetAdaptersAddresses',\n",
989 | " 'GetAddrInfoW',\n",
990 | " 'GetAsyncKeyState',\n",
991 | " 'GetBestInterfaceEx',\n",
992 | " 'GetFileInformationByHandle',\n",
993 | " 'GetFileVersionInfoExW',\n",
994 | " 'GetFileVersionInfoSizeExW',\n",
995 | " 'GetUserNameA',\n",
996 | " 'GetVolumePathNameW',\n",
997 | " 'GlobalMemoryStatus',\n",
998 | " 'HttpOpenRequestA',\n",
999 | " 'InternetConnectA',\n",
1000 | " 'InternetOpenA',\n",
1001 | " 'IsDebuggerPresent',\n",
1002 | " 'Module32FirstW',\n",
1003 | " 'Module32NextW',\n",
1004 | " 'NtDeleteValueKey',\n",
1005 | " 'NtReadVirtualMemory',\n",
1006 | " 'OpenServiceA',\n",
1007 | " 'ReadProcessMemory',\n",
1008 | " 'RegEnumKeyExA',\n",
1009 | " 'RegEnumValueA',\n",
1010 | " 'RtlAddVectoredContinueHandler',\n",
1011 | " 'RtlAddVectoredExceptionHandler',\n",
1012 | " 'RtlRemoveVectoredExceptionHandler',\n",
1013 | " 'SetFileAttributesW',\n",
1014 | " 'SetFileTime',\n",
1015 | " 'SetWindowsHookExA',\n",
1016 | " 'Thread32First',\n",
1017 | " 'Thread32Next',\n",
1018 | " 'WriteConsoleA',\n",
1019 | " 'bind',\n",
1020 | " 'listen'}"
1021 | ]
1022 | },
1023 | "execution_count": 29,
1024 | "metadata": {},
1025 | "output_type": "execute_result"
1026 | }
1027 | ],
1028 | "source": [
1029 | "set(train_apis) - set(test_apis)"
1030 | ]
1031 | }
1032 | ],
1033 | "metadata": {
1034 | "kernelspec": {
1035 | "display_name": "Python 3",
1036 | "language": "python",
1037 | "name": "python3"
1038 | },
1039 | "language_info": {
1040 | "codemirror_mode": {
1041 | "name": "ipython",
1042 | "version": 3
1043 | },
1044 | "file_extension": ".py",
1045 | "mimetype": "text/x-python",
1046 | "name": "python",
1047 | "nbconvert_exporter": "python",
1048 | "pygments_lexer": "ipython3",
1049 | "version": "3.6.2"
1050 | },
1051 | "latex_envs": {
1052 | "LaTeX_envs_menu_present": true,
1053 | "autoclose": false,
1054 | "autocomplete": true,
1055 | "bibliofile": "biblio.bib",
1056 | "cite_by": "apalike",
1057 | "current_citInitial": 1,
1058 | "eqLabelWithNumbers": true,
1059 | "eqNumInitial": 1,
1060 | "hotkeys": {
1061 | "equation": "Ctrl-E",
1062 | "itemize": "Ctrl-I"
1063 | },
1064 | "labels_anchors": false,
1065 | "latex_user_defs": false,
1066 | "report_style_numbering": false,
1067 | "user_envs_cfg": false
1068 | },
1069 | "tianchi_metadata": {
1070 | "competitions": [],
1071 | "datasets": [],
1072 | "description": "",
1073 | "notebookId": "116127",
1074 | "source": "dsw"
1075 | },
1076 | "toc": {
1077 | "nav_menu": {},
1078 | "number_sections": true,
1079 | "sideBar": true,
1080 | "skip_h1_title": false,
1081 | "title_cell": "Table of Contents",
1082 | "title_sidebar": "Contents",
1083 | "toc_cell": true,
1084 | "toc_position": {
1085 | "height": "calc(100% - 180px)",
1086 | "left": "10px",
1087 | "top": "150px",
1088 | "width": "384px"
1089 | },
1090 | "toc_section_display": true,
1091 | "toc_window_display": true
1092 | }
1093 | },
1094 | "nbformat": 4,
1095 | "nbformat_minor": 4
1096 | }
1097 |
--------------------------------------------------------------------------------
/阿里云安全恶意程序检测/阿里云安全恶意程序检测-特征工程与基线模型.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "## 第三节:特征工程与基线模型"
8 | ]
9 | },
10 | {
11 | "cell_type": "code",
12 | "execution_count": 2,
13 | "metadata": {},
14 | "outputs": [],
15 | "source": [
16 | "import numpy as np\n",
17 | "import pandas as pd\n",
18 | "from tqdm import tqdm \n",
19 | "\n",
20 | "class _Data_Preprocess:\n",
21 | " def __init__(self):\n",
22 | " self.int8_max = np.iinfo(np.int8).max\n",
23 | " self.int8_min = np.iinfo(np.int8).min\n",
24 | "\n",
25 | " self.int16_max = np.iinfo(np.int16).max\n",
26 | " self.int16_min = np.iinfo(np.int16).min\n",
27 | "\n",
28 | " self.int32_max = np.iinfo(np.int32).max\n",
29 | " self.int32_min = np.iinfo(np.int32).min\n",
30 | "\n",
31 | " self.int64_max = np.iinfo(np.int64).max\n",
32 | " self.int64_min = np.iinfo(np.int64).min\n",
33 | "\n",
34 | " self.float16_max = np.finfo(np.float16).max\n",
35 | " self.float16_min = np.finfo(np.float16).min\n",
36 | "\n",
37 | " self.float32_max = np.finfo(np.float32).max\n",
38 | " self.float32_min = np.finfo(np.float32).min\n",
39 | "\n",
40 | " self.float64_max = np.finfo(np.float64).max\n",
41 | " self.float64_min = np.finfo(np.float64).min\n",
42 | "\n",
43 | " def _get_type(self, min_val, max_val, types):\n",
44 | " if types == 'int':\n",
45 | " if max_val <= self.int8_max and min_val >= self.int8_min:\n",
46 | " return np.int8\n",
47 | " elif max_val <= self.int16_max <= max_val and min_val >= self.int16_min:\n",
48 | " return np.int16\n",
49 | " elif max_val <= self.int32_max and min_val >= self.int32_min:\n",
50 | " return np.int32\n",
51 | " return None\n",
52 | "\n",
53 | " elif types == 'float':\n",
54 | " if max_val <= self.float16_max and min_val >= self.float16_min:\n",
55 | " return np.float16\n",
56 | " if max_val <= self.float32_max and min_val >= self.float32_min:\n",
57 | " return np.float32\n",
58 | " if max_val <= self.float64_max and min_val >= self.float64_min:\n",
59 | " return np.float64\n",
60 | " return None\n",
61 | "\n",
62 | " def _memory_process(self, df):\n",
63 | " init_memory = df.memory_usage().sum() / 1024 ** 2 / 1024\n",
64 | " print('Original data occupies {} GB memory.'.format(init_memory))\n",
65 | " df_cols = df.columns\n",
66 | "\n",
67 | " \n",
68 | " for col in tqdm_notebook(df_cols):\n",
69 | " try:\n",
70 | " if 'float' in str(df[col].dtypes):\n",
71 | " max_val = df[col].max()\n",
72 | " min_val = df[col].min()\n",
73 | " trans_types = self._get_type(min_val, max_val, 'float')\n",
74 | " if trans_types is not None:\n",
75 | " df[col] = df[col].astype(trans_types)\n",
76 | " elif 'int' in str(df[col].dtypes):\n",
77 | " max_val = df[col].max()\n",
78 | " min_val = df[col].min()\n",
79 | " trans_types = self._get_type(min_val, max_val, 'int')\n",
80 | " if trans_types is not None:\n",
81 | " df[col] = df[col].astype(trans_types)\n",
82 | " except:\n",
83 | " print(' Can not do any process for column, {}.'.format(col)) \n",
84 | " afterprocess_memory = df.memory_usage().sum() / 1024 ** 2 / 1024\n",
85 | " print('After processing, the data occupies {} GB memory.'.format(afterprocess_memory))\n",
86 | " return df"
87 | ]
88 | },
89 | {
90 | "cell_type": "markdown",
91 | "metadata": {},
92 | "source": [
93 | "## 3.3 基线模型"
94 | ]
95 | },
96 | {
97 | "cell_type": "markdown",
98 | "metadata": {},
99 | "source": [
100 | "### 3.3.1 数据读取"
101 | ]
102 | },
103 | {
104 | "cell_type": "code",
105 | "execution_count": 1,
106 | "metadata": {},
107 | "outputs": [],
108 | "source": [
109 | "import pandas as pd\n",
110 | "import numpy as np\n",
111 | "import seaborn as sns\n",
112 | "import matplotlib.pyplot as plt\n",
113 | "\n",
114 | "import lightgbm as lgb\n",
115 | "from sklearn.model_selection import train_test_split\n",
116 | "from sklearn.preprocessing import OneHotEncoder\n",
117 | "\n",
118 | "import warnings\n",
119 | "warnings.filterwarnings('ignore')\n",
120 | "%matplotlib inline"
121 | ]
122 | },
123 | {
124 | "cell_type": "markdown",
125 | "metadata": {},
126 | "source": [
127 | "### 数据读取"
128 | ]
129 | },
130 | {
131 | "cell_type": "code",
132 | "execution_count": 3,
133 | "metadata": {},
134 | "outputs": [],
135 | "source": [
136 | "path = '../security_data/'\n",
137 | "train = pd.read_csv(path + 'security_train.csv')\n",
138 | "test = pd.read_csv(path + 'security_test.csv')"
139 | ]
140 | },
141 | {
142 | "cell_type": "code",
143 | "execution_count": 4,
144 | "metadata": {},
145 | "outputs": [
146 | {
147 | "data": {
148 | "text/html": [
149 | "\n",
150 | "\n",
163 | "
\n",
164 | " \n",
165 | " \n",
166 | " | \n",
167 | " file_id | \n",
168 | " label | \n",
169 | " api | \n",
170 | " tid | \n",
171 | " index | \n",
172 | "
\n",
173 | " \n",
174 | " \n",
175 | " \n",
176 | " 0 | \n",
177 | " 1 | \n",
178 | " 5 | \n",
179 | " LdrLoadDll | \n",
180 | " 2488 | \n",
181 | " 0 | \n",
182 | "
\n",
183 | " \n",
184 | " 1 | \n",
185 | " 1 | \n",
186 | " 5 | \n",
187 | " LdrGetProcedureAddress | \n",
188 | " 2488 | \n",
189 | " 1 | \n",
190 | "
\n",
191 | " \n",
192 | " 2 | \n",
193 | " 1 | \n",
194 | " 5 | \n",
195 | " LdrGetProcedureAddress | \n",
196 | " 2488 | \n",
197 | " 2 | \n",
198 | "
\n",
199 | " \n",
200 | " 3 | \n",
201 | " 1 | \n",
202 | " 5 | \n",
203 | " LdrGetProcedureAddress | \n",
204 | " 2488 | \n",
205 | " 3 | \n",
206 | "
\n",
207 | " \n",
208 | " 4 | \n",
209 | " 1 | \n",
210 | " 5 | \n",
211 | " LdrGetProcedureAddress | \n",
212 | " 2488 | \n",
213 | " 4 | \n",
214 | "
\n",
215 | " \n",
216 | "
\n",
217 | "
"
218 | ],
219 | "text/plain": [
220 | " file_id label api tid index\n",
221 | "0 1 5 LdrLoadDll 2488 0\n",
222 | "1 1 5 LdrGetProcedureAddress 2488 1\n",
223 | "2 1 5 LdrGetProcedureAddress 2488 2\n",
224 | "3 1 5 LdrGetProcedureAddress 2488 3\n",
225 | "4 1 5 LdrGetProcedureAddress 2488 4"
226 | ]
227 | },
228 | "execution_count": 4,
229 | "metadata": {},
230 | "output_type": "execute_result"
231 | }
232 | ],
233 | "source": [
234 | "train.head()"
235 | ]
236 | },
237 | {
238 | "cell_type": "markdown",
239 | "metadata": {},
240 | "source": [
241 | "### 3.3.2 特征工程 "
242 | ]
243 | },
244 | {
245 | "cell_type": "code",
246 | "execution_count": 5,
247 | "metadata": {},
248 | "outputs": [],
249 | "source": [
250 | "def simple_sts_features(df):\n",
251 | " simple_fea = pd.DataFrame()\n",
252 | " simple_fea['file_id'] = df['file_id'].unique()\n",
253 | " simple_fea = simple_fea.sort_values('file_id')\n",
254 | " \n",
255 | " df_grp = df.groupby('file_id')\n",
256 | " simple_fea['file_id_api_count'] = df_grp['api'].count().values\n",
257 | " simple_fea['file_id_api_nunique'] = df_grp['api'].nunique().values\n",
258 | " \n",
259 | " simple_fea['file_id_tid_count'] = df_grp['tid'].count().values\n",
260 | " simple_fea['file_id_tid_nunique'] = df_grp['tid'].nunique().values\n",
261 | " \n",
262 | " simple_fea['file_id_index_count'] = df_grp['index'].count().values\n",
263 | " simple_fea['file_id_index_nunique'] = df_grp['index'].nunique().values\n",
264 | " \n",
265 | " return simple_fea"
266 | ]
267 | },
268 | {
269 | "cell_type": "code",
270 | "execution_count": 6,
271 | "metadata": {},
272 | "outputs": [
273 | {
274 | "name": "stdout",
275 | "output_type": "stream",
276 | "text": [
277 | "Wall time: 1.4 s\n"
278 | ]
279 | }
280 | ],
281 | "source": [
282 | "%%time\n",
283 | "simple_train_fea1 = simple_sts_features(train)"
284 | ]
285 | },
286 | {
287 | "cell_type": "code",
288 | "execution_count": 7,
289 | "metadata": {},
290 | "outputs": [
291 | {
292 | "name": "stdout",
293 | "output_type": "stream",
294 | "text": [
295 | "Wall time: 23.9 ms\n"
296 | ]
297 | }
298 | ],
299 | "source": [
300 | "%%time\n",
301 | "simple_test_fea1 = simple_sts_features(test)"
302 | ]
303 | },
304 | {
305 | "cell_type": "code",
306 | "execution_count": 8,
307 | "metadata": {},
308 | "outputs": [],
309 | "source": [
310 | "def simple_numerical_sts_features(df):\n",
311 | " simple_numerical_fea = pd.DataFrame()\n",
312 | " simple_numerical_fea['file_id'] = df['file_id'].unique()\n",
313 | " simple_numerical_fea = simple_numerical_fea.sort_values('file_id')\n",
314 | " \n",
315 | " df_grp = df.groupby('file_id')\n",
316 | " \n",
317 | " simple_numerical_fea['file_id_tid_mean'] = df_grp['tid'].mean().values\n",
318 | " simple_numerical_fea['file_id_tid_min'] = df_grp['tid'].min().values\n",
319 | " simple_numerical_fea['file_id_tid_std'] = df_grp['tid'].std().values\n",
320 | " simple_numerical_fea['file_id_tid_max'] = df_grp['tid'].max().values\n",
321 | " \n",
322 | " \n",
323 | " simple_numerical_fea['file_id_index_mean']= df_grp['index'].mean().values\n",
324 | " simple_numerical_fea['file_id_index_min'] = df_grp['index'].min().values\n",
325 | " simple_numerical_fea['file_id_index_std'] = df_grp['index'].std().values\n",
326 | " simple_numerical_fea['file_id_index_max'] = df_grp['index'].max().values\n",
327 | " \n",
328 | " return simple_numerical_fea"
329 | ]
330 | },
331 | {
332 | "cell_type": "code",
333 | "execution_count": 9,
334 | "metadata": {
335 | "scrolled": true
336 | },
337 | "outputs": [
338 | {
339 | "name": "stdout",
340 | "output_type": "stream",
341 | "text": [
342 | "Wall time: 172 ms\n"
343 | ]
344 | }
345 | ],
346 | "source": [
347 | "%%time\n",
348 | "simple_train_fea2 = simple_numerical_sts_features(train)"
349 | ]
350 | },
351 | {
352 | "cell_type": "code",
353 | "execution_count": 10,
354 | "metadata": {},
355 | "outputs": [
356 | {
357 | "name": "stdout",
358 | "output_type": "stream",
359 | "text": [
360 | "Wall time: 18 ms\n"
361 | ]
362 | }
363 | ],
364 | "source": [
365 | "%%time\n",
366 | "simple_test_fea2 = simple_numerical_sts_features(test)"
367 | ]
368 | },
369 | {
370 | "cell_type": "markdown",
371 | "metadata": {},
372 | "source": [
373 | "### 3.3.3 基线构建"
374 | ]
375 | },
376 | {
377 | "cell_type": "code",
378 | "execution_count": 11,
379 | "metadata": {},
380 | "outputs": [],
381 | "source": [
382 | "train_label = train[['file_id','label']].drop_duplicates(subset = ['file_id','label'], keep = 'first')\n",
383 | "test_submit = test[['file_id']].drop_duplicates(subset = ['file_id'], keep = 'first')"
384 | ]
385 | },
386 | {
387 | "cell_type": "code",
388 | "execution_count": 12,
389 | "metadata": {},
390 | "outputs": [],
391 | "source": [
392 | "### 训练集&测试集构建\n",
393 | "train_data = train_label.merge(simple_train_fea1, on ='file_id', how='left')\n",
394 | "train_data = train_data.merge(simple_train_fea2, on ='file_id', how='left')\n",
395 | "\n",
396 | "test_submit = test_submit.merge(simple_test_fea1, on ='file_id', how='left')\n",
397 | "test_submit = test_submit.merge(simple_test_fea2, on ='file_id', how='left')"
398 | ]
399 | },
400 | {
401 | "cell_type": "code",
402 | "execution_count": 14,
403 | "metadata": {},
404 | "outputs": [],
405 | "source": [
406 | "def lgb_logloss(preds,data):\n",
407 | " labels_ = data.get_label() \n",
408 | " classes_ = np.unique(labels_) \n",
409 | " preds_prob = []\n",
410 | " for i in range(len(classes_)):\n",
411 | " preds_prob.append(preds[i*len(labels_):(i+1) * len(labels_)] )\n",
412 | " \n",
413 | " preds_prob_ = np.vstack(preds_prob) \n",
414 | " \n",
415 | " loss = []\n",
416 | " for i in range(preds_prob_.shape[1]): # 样本个数\n",
417 | " sum_ = 0\n",
418 | " for j in range(preds_prob_.shape[0]): #类别个数\n",
419 | " pred = preds_prob_[j,i] # 第i个样本预测为第j类的概率\n",
420 | " if j == labels_[i]:\n",
421 | " sum_ += np.log(pred)\n",
422 | " else:\n",
423 | " sum_ += np.log(1 - pred)\n",
424 | " loss.append(sum_) \n",
425 | " return 'loss is: ',-1 * (np.sum(loss) / preds_prob_.shape[1]),False"
426 | ]
427 | },
428 | {
429 | "cell_type": "markdown",
430 | "metadata": {},
431 | "source": []
432 | },
433 | {
434 | "cell_type": "code",
435 | "execution_count": 15,
436 | "metadata": {},
437 | "outputs": [],
438 | "source": [
439 | "### 模型验证\n",
440 | "train_features = [col for col in train_data.columns if col not in ['label','file_id']]\n",
441 | "train_label = 'label'"
442 | ]
443 | },
444 | {
445 | "cell_type": "code",
446 | "execution_count": 16,
447 | "metadata": {
448 | "scrolled": true
449 | },
450 | "outputs": [
451 | {
452 | "name": "stdout",
453 | "output_type": "stream",
454 | "text": [
455 | "fold n°0\n",
456 | "Training until validation scores don't improve for 100 rounds\n",
457 | "[50]\ttraining's multi_logloss: 1.83717\ttraining's loss is: : 2.41456\tvalid_1's multi_logloss: 1.28536\tvalid_1's loss is: : 0.941228\n",
458 | "[100]\ttraining's multi_logloss: 1.83717\ttraining's loss is: : 2.41456\tvalid_1's multi_logloss: 1.28536\tvalid_1's loss is: : 0.941228\n",
459 | "Early stopping, best iteration is:\n",
460 | "[1]\ttraining's multi_logloss: 1.83717\ttraining's loss is: : 2.41456\tvalid_1's multi_logloss: 1.28536\tvalid_1's loss is: : 0.941228\n",
461 | "fold n°1\n",
462 | "Training until validation scores don't improve for 100 rounds\n",
463 | "[50]\ttraining's multi_logloss: 1.77226\ttraining's loss is: : 2.32838\tvalid_1's multi_logloss: 2.10695\tvalid_1's loss is: : 1.86109\n",
464 | "[100]\ttraining's multi_logloss: 1.77226\ttraining's loss is: : 2.32838\tvalid_1's multi_logloss: 2.10695\tvalid_1's loss is: : 1.86109\n",
465 | "Early stopping, best iteration is:\n",
466 | "[1]\ttraining's multi_logloss: 1.77226\ttraining's loss is: : 2.32838\tvalid_1's multi_logloss: 2.10695\tvalid_1's loss is: : 1.86109\n",
467 | "fold n°2\n",
468 | "Training until validation scores don't improve for 100 rounds\n",
469 | "[50]\ttraining's multi_logloss: 1.79063\ttraining's loss is: : 2.32093\tvalid_1's multi_logloss: 1.67573\tvalid_1's loss is: : 1.91268\n",
470 | "[100]\ttraining's multi_logloss: 1.79063\ttraining's loss is: : 2.32093\tvalid_1's multi_logloss: 1.67573\tvalid_1's loss is: : 1.91268\n",
471 | "Early stopping, best iteration is:\n",
472 | "[1]\ttraining's multi_logloss: 1.79063\ttraining's loss is: : 2.32093\tvalid_1's multi_logloss: 1.67573\tvalid_1's loss is: : 1.91268\n",
473 | "fold n°3\n",
474 | "Training until validation scores don't improve for 100 rounds\n",
475 | "[50]\ttraining's multi_logloss: 1.79651\ttraining's loss is: : 2.36572\tvalid_1's multi_logloss: 1.92355\tvalid_1's loss is: : 1.42824\n",
476 | "[100]\ttraining's multi_logloss: 1.79651\ttraining's loss is: : 2.36572\tvalid_1's multi_logloss: 1.92355\tvalid_1's loss is: : 1.42824\n",
477 | "Early stopping, best iteration is:\n",
478 | "[1]\ttraining's multi_logloss: 1.79651\ttraining's loss is: : 2.36572\tvalid_1's multi_logloss: 1.92355\tvalid_1's loss is: : 1.42824\n",
479 | "fold n°4\n",
480 | "Training until validation scores don't improve for 100 rounds\n",
481 | "[50]\ttraining's multi_logloss: 1.70379\ttraining's loss is: : 2.27265\tvalid_1's multi_logloss: 2.91788\tvalid_1's loss is: : 3.32694\n",
482 | "[100]\ttraining's multi_logloss: 1.70379\ttraining's loss is: : 2.27265\tvalid_1's multi_logloss: 2.91788\tvalid_1's loss is: : 3.32694\n",
483 | "Early stopping, best iteration is:\n",
484 | "[1]\ttraining's multi_logloss: 1.70379\ttraining's loss is: : 2.27265\tvalid_1's multi_logloss: 2.91788\tvalid_1's loss is: : 3.32694\n",
485 | "Wall time: 9.94 s\n"
486 | ]
487 | }
488 | ],
489 | "source": [
490 | "%%time\n",
491 | "from sklearn.model_selection import StratifiedKFold,KFold\n",
492 | "params = {\n",
493 | " 'task':'train', \n",
494 | " 'num_leaves': 255,\n",
495 | " 'objective': 'multiclass',\n",
496 | " 'num_class': 8,\n",
497 | " 'min_data_in_leaf': 50,\n",
498 | " 'learning_rate': 0.05,\n",
499 | " 'feature_fraction': 0.85,\n",
500 | " 'bagging_fraction': 0.85,\n",
501 | " 'bagging_freq': 5, \n",
502 | " 'max_bin':128,\n",
503 | " 'random_state':100\n",
504 | " } \n",
505 | "\n",
506 | "folds = KFold(n_splits=5, shuffle=True, random_state=15)\n",
507 | "oof = np.zeros(len(train))\n",
508 | "\n",
509 | "predict_res = 0\n",
510 | "models = []\n",
511 | "for fold_, (trn_idx, val_idx) in enumerate(folds.split(train_data)):\n",
512 | " print(\"fold n°{}\".format(fold_))\n",
513 | " trn_data = lgb.Dataset(train_data.iloc[trn_idx][train_features], label=train_data.iloc[trn_idx][train_label].values)\n",
514 | " val_data = lgb.Dataset(train_data.iloc[val_idx][train_features], label=train_data.iloc[val_idx][train_label].values) \n",
515 | " \n",
516 | " clf = lgb.train(params, trn_data, num_boost_round=2000,valid_sets=[trn_data,val_data], verbose_eval=50, early_stopping_rounds=100, feval=lgb_logloss) \n",
517 | " models.append(clf)"
518 | ]
519 | },
520 | {
521 | "cell_type": "code",
522 | "execution_count": 17,
523 | "metadata": {
524 | "scrolled": true
525 | },
526 | "outputs": [
527 | {
528 | "name": "stdout",
529 | "output_type": "stream",
530 | "text": [
531 | "fold n°0\n",
532 | "Training until validation scores don't improve for 100 rounds\n",
533 | "[50]\ttraining's multi_logloss: 1.83717\ttraining's loss is: : 2.41456\tvalid_1's multi_logloss: 1.28536\tvalid_1's loss is: : 0.941228\n",
534 | "[100]\ttraining's multi_logloss: 1.83717\ttraining's loss is: : 2.41456\tvalid_1's multi_logloss: 1.28536\tvalid_1's loss is: : 0.941228\n",
535 | "Early stopping, best iteration is:\n",
536 | "[1]\ttraining's multi_logloss: 1.83717\ttraining's loss is: : 2.41456\tvalid_1's multi_logloss: 1.28536\tvalid_1's loss is: : 0.941228\n",
537 | "fold n°1\n",
538 | "Training until validation scores don't improve for 100 rounds\n",
539 | "[50]\ttraining's multi_logloss: 1.77226\ttraining's loss is: : 2.32838\tvalid_1's multi_logloss: 2.10695\tvalid_1's loss is: : 1.86109\n",
540 | "[100]\ttraining's multi_logloss: 1.77226\ttraining's loss is: : 2.32838\tvalid_1's multi_logloss: 2.10695\tvalid_1's loss is: : 1.86109\n",
541 | "Early stopping, best iteration is:\n",
542 | "[1]\ttraining's multi_logloss: 1.77226\ttraining's loss is: : 2.32838\tvalid_1's multi_logloss: 2.10695\tvalid_1's loss is: : 1.86109\n",
543 | "fold n°2\n",
544 | "Training until validation scores don't improve for 100 rounds\n",
545 | "[50]\ttraining's multi_logloss: 1.79063\ttraining's loss is: : 2.32093\tvalid_1's multi_logloss: 1.67573\tvalid_1's loss is: : 1.91268\n",
546 | "[100]\ttraining's multi_logloss: 1.79063\ttraining's loss is: : 2.32093\tvalid_1's multi_logloss: 1.67573\tvalid_1's loss is: : 1.91268\n",
547 | "Early stopping, best iteration is:\n",
548 | "[1]\ttraining's multi_logloss: 1.79063\ttraining's loss is: : 2.32093\tvalid_1's multi_logloss: 1.67573\tvalid_1's loss is: : 1.91268\n",
549 | "fold n°3\n",
550 | "Training until validation scores don't improve for 100 rounds\n",
551 | "[50]\ttraining's multi_logloss: 1.79651\ttraining's loss is: : 2.36572\tvalid_1's multi_logloss: 1.92355\tvalid_1's loss is: : 1.42824\n",
552 | "[100]\ttraining's multi_logloss: 1.79651\ttraining's loss is: : 2.36572\tvalid_1's multi_logloss: 1.92355\tvalid_1's loss is: : 1.42824\n",
553 | "Early stopping, best iteration is:\n",
554 | "[1]\ttraining's multi_logloss: 1.79651\ttraining's loss is: : 2.36572\tvalid_1's multi_logloss: 1.92355\tvalid_1's loss is: : 1.42824\n",
555 | "fold n°4\n",
556 | "Training until validation scores don't improve for 100 rounds\n",
557 | "[50]\ttraining's multi_logloss: 1.70379\ttraining's loss is: : 2.27265\tvalid_1's multi_logloss: 2.91788\tvalid_1's loss is: : 3.32694\n",
558 | "[100]\ttraining's multi_logloss: 1.70379\ttraining's loss is: : 2.27265\tvalid_1's multi_logloss: 2.91788\tvalid_1's loss is: : 3.32694\n",
559 | "Early stopping, best iteration is:\n",
560 | "[1]\ttraining's multi_logloss: 1.70379\ttraining's loss is: : 2.27265\tvalid_1's multi_logloss: 2.91788\tvalid_1's loss is: : 3.32694\n",
561 | "Wall time: 9.7 s\n"
562 | ]
563 | }
564 | ],
565 | "source": [
566 | "%%time\n",
567 | "from sklearn.model_selection import StratifiedKFold,KFold\n",
568 | "params = {\n",
569 | " 'task':'train', \n",
570 | " 'num_leaves': 255,\n",
571 | " 'objective': 'multiclass',\n",
572 | " 'num_class': 8,\n",
573 | " 'min_data_in_leaf': 50,\n",
574 | " 'learning_rate': 0.05,\n",
575 | " 'feature_fraction': 0.85,\n",
576 | " 'bagging_fraction': 0.85,\n",
577 | " 'bagging_freq': 5, \n",
578 | " 'max_bin':128,\n",
579 | " 'random_state':100\n",
580 | " } \n",
581 | "\n",
582 | "folds = KFold(n_splits=5, shuffle=True, random_state=15)\n",
583 | "oof = np.zeros(len(train))\n",
584 | "\n",
585 | "predict_res = 0\n",
586 | "models = []\n",
587 | "for fold_, (trn_idx, val_idx) in enumerate(folds.split(train_data)):\n",
588 | " print(\"fold n°{}\".format(fold_))\n",
589 | " trn_data = lgb.Dataset(train_data.iloc[trn_idx][train_features], label=train_data.iloc[trn_idx][train_label].values)\n",
590 | " val_data = lgb.Dataset(train_data.iloc[val_idx][train_features], label=train_data.iloc[val_idx][train_label].values) \n",
591 | " \n",
592 | " clf = lgb.train(params, trn_data, num_boost_round=2000,valid_sets=[trn_data,val_data], verbose_eval=50, early_stopping_rounds=100, feval=lgb_logloss) \n",
593 | " models.append(clf)"
594 | ]
595 | },
596 | {
597 | "cell_type": "markdown",
598 | "metadata": {},
599 | "source": [
600 | "### 3.3.4 特征重要性分析"
601 | ]
602 | },
603 | {
604 | "cell_type": "code",
605 | "execution_count": 18,
606 | "metadata": {},
607 | "outputs": [],
608 | "source": [
609 | "feature_importance = pd.DataFrame()\n",
610 | "feature_importance['fea_name'] = train_features\n",
611 | "feature_importance['fea_imp'] = clf.feature_importance()\n",
612 | "feature_importance = feature_importance.sort_values('fea_imp',ascending = False)"
613 | ]
614 | },
615 | {
616 | "cell_type": "code",
617 | "execution_count": 19,
618 | "metadata": {},
619 | "outputs": [
620 | {
621 | "data": {
622 | "text/plain": [
623 | ""
624 | ]
625 | },
626 | "execution_count": 19,
627 | "metadata": {},
628 | "output_type": "execute_result"
629 | },
630 | {
631 | "data": {
632 | "image/png": "\n",
633 | "text/plain": [
634 | ""
635 | ]
636 | },
637 | "metadata": {
638 | "needs_background": "light"
639 | },
640 | "output_type": "display_data"
641 | }
642 | ],
643 | "source": [
644 | "plt.figure(figsize=[20, 10,])\n",
645 | "sns.barplot(x = feature_importance['fea_name'], y = feature_importance['fea_imp'])\n",
646 | "#sns.barplot(x=\"fea_name\",y=\"fea_imp\",data=feature_importance)"
647 | ]
648 | },
649 | {
650 | "cell_type": "markdown",
651 | "metadata": {},
652 | "source": [
653 | "### 3.3.5 模型测试"
654 | ]
655 | },
656 | {
657 | "cell_type": "code",
658 | "execution_count": 20,
659 | "metadata": {},
660 | "outputs": [],
661 | "source": [
662 | "pred_res = 0\n",
663 | "fold = 5\n",
664 | "for model in models:\n",
665 | " pred_res +=model.predict(test_submit[train_features]) * 1.0 / fold "
666 | ]
667 | },
668 | {
669 | "cell_type": "code",
670 | "execution_count": 21,
671 | "metadata": {},
672 | "outputs": [],
673 | "source": [
674 | "test_submit['prob0'] = 0\n",
675 | "test_submit['prob1'] = 0\n",
676 | "test_submit['prob2'] = 0\n",
677 | "test_submit['prob3'] = 0\n",
678 | "test_submit['prob4'] = 0\n",
679 | "test_submit['prob5'] = 0\n",
680 | "test_submit['prob6'] = 0\n",
681 | "test_submit['prob7'] = 0"
682 | ]
683 | },
684 | {
685 | "cell_type": "code",
686 | "execution_count": 22,
687 | "metadata": {
688 | "scrolled": true
689 | },
690 | "outputs": [],
691 | "source": [
692 | "test_submit[['prob0','prob1','prob2','prob3','prob4','prob5','prob6','prob7']] = pred_res\n",
693 | "test_submit[['file_id','prob0','prob1','prob2','prob3','prob4','prob5','prob6','prob7']].to_csv('baseline.csv',index = None)"
694 | ]
695 | }
696 | ],
697 | "metadata": {
698 | "kernelspec": {
699 | "display_name": "Python 3",
700 | "language": "python",
701 | "name": "python3"
702 | },
703 | "language_info": {
704 | "codemirror_mode": {
705 | "name": "ipython",
706 | "version": 3
707 | },
708 | "file_extension": ".py",
709 | "mimetype": "text/x-python",
710 | "name": "python",
711 | "nbconvert_exporter": "python",
712 | "pygments_lexer": "ipython3",
713 | "version": "3.7.1"
714 | },
715 | "latex_envs": {
716 | "LaTeX_envs_menu_present": true,
717 | "autoclose": false,
718 | "autocomplete": true,
719 | "bibliofile": "biblio.bib",
720 | "cite_by": "apalike",
721 | "current_citInitial": 1,
722 | "eqLabelWithNumbers": true,
723 | "eqNumInitial": 1,
724 | "hotkeys": {
725 | "equation": "Ctrl-E",
726 | "itemize": "Ctrl-I"
727 | },
728 | "labels_anchors": false,
729 | "latex_user_defs": false,
730 | "report_style_numbering": false,
731 | "user_envs_cfg": false
732 | },
733 | "toc": {
734 | "nav_menu": {},
735 | "number_sections": true,
736 | "sideBar": true,
737 | "skip_h1_title": false,
738 | "title_cell": "Table of Contents",
739 | "title_sidebar": "Contents",
740 | "toc_cell": true,
741 | "toc_position": {
742 | "height": "calc(100% - 180px)",
743 | "left": "10px",
744 | "top": "150px",
745 | "width": "384px"
746 | },
747 | "toc_section_display": true,
748 | "toc_window_display": true
749 | }
750 | },
751 | "nbformat": 4,
752 | "nbformat_minor": 2
753 | }
754 |
--------------------------------------------------------------------------------