├── O2O优惠券预测 ├── O2O优惠券预测-数据探索.ipynb ├── O2O优惠券预测-模型训练.ipynb ├── O2O优惠券预测-模型验证及优化.ipynb ├── O2O优惠券预测-特征工程.ipynb └── O2O优惠券预测-赛题实践.ipynb ├── README.md ├── 天猫重复购买预测 ├── 天猫重复购买预测 02 数据探索.ipynb ├── 天猫重复购买预测 03 特征工程.ipynb ├── 天猫重复购买预测 04 模型训练、验证和评测.ipynb └── 天猫重复购买预测 05 特征优化和特征选择.ipynb ├── 工业蒸汽 ├── zhengqi_test.txt ├── zhengqi_train.txt ├── 工业蒸汽 02数据探索.ipynb ├── 工业蒸汽 03 特征工程.ipynb ├── 工业蒸汽 04 模型训练.ipynb ├── 工业蒸汽 05 模型验证.ipynb ├── 工业蒸汽 06 特征优化.ipynb └── 工业蒸汽 07 模型融合.ipynb └── 阿里云安全恶意程序检测 ├── 阿里云安全恶意程序检测-优化技巧与解决方案升级.ipynb ├── 阿里云安全恶意程序检测-数据探索.ipynb ├── 阿里云安全恶意程序检测-特征工程与基线模型.ipynb ├── 阿里云安全恶意程序检测-特征工程进阶与方案优化.ipynb └── 阿里云安全恶意程序检测-高阶数据探索.ipynb /README.md: -------------------------------------------------------------------------------- 1 | # alibaba_tianchi_book 2 | 阿里云天池大赛赛题解析,原书代码链接: https://tianchi.aliyun.com/specials/promotion/bookcode?spm=5176.14154004.J_3941670930.15.31fe5699wu5Gtm 3 | 4 | 为了方便查看,我进行了整理,总共包含四个项目。 5 | 6 | 代码质量很高,有深度,个人觉得进阶机器学习看这本书应该是足够了。 7 | 8 | -------------------------------------------------------------------------------- /天猫重复购买预测/天猫重复购买预测 05 特征优化和特征选择.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## 导入相关包" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 1, 13 | "metadata": {}, 14 | "outputs": [], 15 | "source": [ 16 | "import pandas as pd\n", 17 | "import numpy as np\n", 18 | "\n", 19 | "import warnings\n", 20 | "warnings.filterwarnings(\"ignore\") " 21 | ] 22 | }, 23 | { 24 | "cell_type": "markdown", 25 | "metadata": {}, 26 | "source": [ 27 | "## 读取数据(训练数据前10000行,测试数据前100条)" 28 | ] 29 | }, 30 | { 31 | "cell_type": "code", 32 | "execution_count": 2, 33 | "metadata": {}, 34 | "outputs": [], 35 | "source": [ 36 | "train_data = pd.read_csv('train_all.csv',nrows=10000)\n", 37 | "test_data = pd.read_csv('test_all.csv',nrows=100)" 38 | ] 39 | }, 40 | { 41 | "cell_type": "markdown", 42 | "metadata": {}, 43 | "source": [ 44 | "## 读取全部数据" 45 | ] 46 | }, 47 | { 48 | "cell_type": "code", 49 | "execution_count": 3, 50 | "metadata": {}, 51 | "outputs": [], 52 | "source": [ 53 | "# train_data = pd.read_csv('train_all.csv',nrows=None)\n", 54 | "# test_data = pd.read_csv('test_all.csv',nrows=None)" 55 | ] 56 | }, 57 | { 58 | "cell_type": "markdown", 59 | "metadata": {}, 60 | "source": [ 61 | "## 获取训练和测试数据" 62 | ] 63 | }, 64 | { 65 | "cell_type": "code", 66 | "execution_count": 4, 67 | "metadata": {}, 68 | "outputs": [], 69 | "source": [ 70 | "features_columns = [col for col in train_data.columns if col not in ['user_id','label']]\n", 71 | "train = train_data[features_columns].values\n", 72 | "test = test_data[features_columns].values\n", 73 | "target =train_data['label'].values" 74 | ] 75 | }, 76 | { 77 | "cell_type": "markdown", 78 | "metadata": {}, 79 | "source": [ 80 | "## 缺失值补全" 81 | ] 82 | }, 83 | { 84 | "cell_type": "markdown", 85 | "metadata": {}, 86 | "source": [ 87 | "处理缺失值有很多方法,最常用为以下几种:\n", 88 | "1. 删除。当数据量较大时,或者缺失数据占比较小时,可以使用这种方法。\n", 89 | "2. 填充。通用的方法是采用平均数、中位数来填充,可以适用插值或者模型预测的方法进行缺失补全。\n", 90 | "3. 不处理。树类模型对缺失值不明感。" 91 | ] 92 | }, 93 | { 94 | "cell_type": "markdown", 95 | "metadata": {}, 96 | "source": [ 97 | "#### 采用中值进行填充" 98 | ] 99 | }, 100 | { 101 | "cell_type": "code", 102 | "execution_count": 5, 103 | "metadata": {}, 104 | "outputs": [], 105 | "source": [ 106 | "# from sklearn.preprocessing import Imputer\n", 107 | "# imputer = Imputer(strategy=\"median\")\n", 108 | "\n", 109 | "from sklearn.impute import SimpleImputer\n", 110 | "\n", 111 | "imputer = SimpleImputer(missing_values=np.nan, strategy='mean')\n", 112 | "imputer = imputer.fit(train)\n", 113 | "train_imputer = imputer.transform(train)\n", 114 | "test_imputer = imputer.transform(test)" 115 | ] 116 | }, 117 | { 118 | "cell_type": "markdown", 119 | "metadata": {}, 120 | "source": [ 121 | "## 特征选择概念" 122 | ] 123 | }, 124 | { 125 | "cell_type": "markdown", 126 | "metadata": {}, 127 | "source": [ 128 | "在机器学习和统计学中,特征选择(英语:feature selection)也被称为变量选择、属性选择 或变量子集选择 。它是指:为了构建模型而选择相关特征(即属性、指标)子集的过程。使用特征选择技术有三个原因:\n", 129 | "\n", 130 | " 简化模型,使之更易于被研究人员或用户理解,\n", 131 | " 缩短训练时间,\n", 132 | " 改善通用性、降低过拟合(即降低方差)。\n", 133 | "\n", 134 | "要使用特征选择技术的关键假设是:训练数据包含许多冗余 或无关 的特征,因而移除这些特征并不会导致丢失信息。 冗余 或无关 特征是两个不同的概念。如果一个特征本身有用,但如果这个特征与另一个有用特征强相关,且那个特征也出现在数据中,那么这个特征可能就变得多余。\n", 135 | "特征选择技术与特征提取有所不同。特征提取是从原有特征的功能中创造新的特征,而特征选择则只返回原有特征中的子集。 特征选择技术的常常用于许多特征但样本(即数据点)相对较少的领域。特征选择应用的典型用例包括:解析书面文本和微阵列数据,这些场景下特征成千上万,但样本只有几十到几百个。" 136 | ] 137 | }, 138 | { 139 | "cell_type": "code", 140 | "execution_count": 6, 141 | "metadata": {}, 142 | "outputs": [], 143 | "source": [ 144 | "from sklearn.model_selection import cross_val_score\n", 145 | "from sklearn.ensemble import RandomForestClassifier\n", 146 | "\n", 147 | "def feature_selection(train, train_sel, target):\n", 148 | " clf = RandomForestClassifier(n_estimators=100, max_depth=2, random_state=0, n_jobs=-1)\n", 149 | " \n", 150 | " scores = cross_val_score(clf, train, target, cv=5)\n", 151 | " scores_sel = cross_val_score(clf, train_sel, target, cv=5)\n", 152 | " \n", 153 | " print(\"No Select Accuracy: %0.2f (+/- %0.2f)\" % (scores.mean(), scores.std() * 2)) \n", 154 | " print(\"Features Select Accuracy: %0.2f (+/- %0.2f)\" % (scores.mean(), scores.std() * 2))" 155 | ] 156 | }, 157 | { 158 | "cell_type": "markdown", 159 | "metadata": {}, 160 | "source": [ 161 | "### 删除方差较小的要素(方法一)\n", 162 | "VarianceThreshold是一种简单的基线特征选择方法。它会删除方差不符合某个阈值的所有要素。默认情况下,它会删除所有零方差要素,即在所有样本中具有相同值的要素。" 163 | ] 164 | }, 165 | { 166 | "cell_type": "code", 167 | "execution_count": 7, 168 | "metadata": {}, 169 | "outputs": [ 170 | { 171 | "name": "stdout", 172 | "output_type": "stream", 173 | "text": [ 174 | "训练数据未特征筛选维度 (2000, 229)\n", 175 | "训练数据特征筛选维度后 (2000, 29)\n" 176 | ] 177 | } 178 | ], 179 | "source": [ 180 | "from sklearn.feature_selection import VarianceThreshold\n", 181 | "\n", 182 | "sel = VarianceThreshold(threshold=(.8 * (1 - .8)))\n", 183 | "sel = sel.fit(train)\n", 184 | "train_sel = sel.transform(train)\n", 185 | "test_sel = sel.transform(test)\n", 186 | "print('训练数据未特征筛选维度', train.shape)\n", 187 | "print('训练数据特征筛选维度后', train_sel.shape)" 188 | ] 189 | }, 190 | { 191 | "cell_type": "markdown", 192 | "metadata": {}, 193 | "source": [ 194 | "### 特征选择前后区别" 195 | ] 196 | }, 197 | { 198 | "cell_type": "code", 199 | "execution_count": 8, 200 | "metadata": {}, 201 | "outputs": [ 202 | { 203 | "name": "stdout", 204 | "output_type": "stream", 205 | "text": [ 206 | "No Select Accuracy: 0.93 (+/- 0.00)\n", 207 | "Features Select Accuracy: 0.93 (+/- 0.00)\n" 208 | ] 209 | } 210 | ], 211 | "source": [ 212 | "feature_selection(train, train_sel, target)" 213 | ] 214 | }, 215 | { 216 | "cell_type": "markdown", 217 | "metadata": {}, 218 | "source": [ 219 | "### 单变量特征选择(方法二)\n", 220 | "通过基于单变量统计检验选择最佳特征。" 221 | ] 222 | }, 223 | { 224 | "cell_type": "code", 225 | "execution_count": 9, 226 | "metadata": {}, 227 | "outputs": [ 228 | { 229 | "name": "stdout", 230 | "output_type": "stream", 231 | "text": [ 232 | "训练数据未特征筛选维度 (2000, 229)\n", 233 | "训练数据特征筛选维度后 (2000, 2)\n" 234 | ] 235 | } 236 | ], 237 | "source": [ 238 | "from sklearn.feature_selection import SelectKBest\n", 239 | "# from sklearn.feature_selection import chi2\n", 240 | "from sklearn.feature_selection import mutual_info_classif\n", 241 | "\n", 242 | "sel = SelectKBest(mutual_info_classif, k=2)\n", 243 | "sel = sel.fit(train, target)\n", 244 | "train_sel = sel.transform(train)\n", 245 | "test_sel = sel.transform(test)\n", 246 | "print('训练数据未特征筛选维度', train.shape)\n", 247 | "print('训练数据特征筛选维度后', train_sel.shape)" 248 | ] 249 | }, 250 | { 251 | "cell_type": "code", 252 | "execution_count": 10, 253 | "metadata": {}, 254 | "outputs": [ 255 | { 256 | "name": "stdout", 257 | "output_type": "stream", 258 | "text": [ 259 | "训练数据未特征筛选维度 (2000, 229)\n", 260 | "训练数据特征筛选维度后 (2000, 10)\n" 261 | ] 262 | } 263 | ], 264 | "source": [ 265 | "sel = SelectKBest(mutual_info_classif, k=10)\n", 266 | "sel = sel.fit(train, target)\n", 267 | "train_sel = sel.transform(train)\n", 268 | "test_sel = sel.transform(test)\n", 269 | "print('训练数据未特征筛选维度', train.shape)\n", 270 | "print('训练数据特征筛选维度后', train_sel.shape)" 271 | ] 272 | }, 273 | { 274 | "cell_type": "markdown", 275 | "metadata": {}, 276 | "source": [ 277 | "### 特征选择前后区别" 278 | ] 279 | }, 280 | { 281 | "cell_type": "code", 282 | "execution_count": 11, 283 | "metadata": {}, 284 | "outputs": [ 285 | { 286 | "name": "stdout", 287 | "output_type": "stream", 288 | "text": [ 289 | "No Select Accuracy: 0.93 (+/- 0.00)\n", 290 | "Features Select Accuracy: 0.93 (+/- 0.00)\n" 291 | ] 292 | } 293 | ], 294 | "source": [ 295 | "feature_selection(train, train_sel, target)" 296 | ] 297 | }, 298 | { 299 | "cell_type": "markdown", 300 | "metadata": {}, 301 | "source": [ 302 | "### 递归功能消除(方法三)\n", 303 | "选定模型拟合,进行递归拟合,每次把评分低得特征去除,重复上诉循环。" 304 | ] 305 | }, 306 | { 307 | "cell_type": "code", 308 | "execution_count": 12, 309 | "metadata": {}, 310 | "outputs": [ 311 | { 312 | "name": "stdout", 313 | "output_type": "stream", 314 | "text": [ 315 | "[False False False False False False False False False False False False\n", 316 | " False False False False False False False False False False False False\n", 317 | " False False False False False False False False False False False False\n", 318 | " False False False False False False False False False False False False\n", 319 | " False False False False False False False False False False False False\n", 320 | " False False False False False False False False False False False False\n", 321 | " False False False False False False False False False False False False\n", 322 | " False False False False False False False False False False False False\n", 323 | " False False False False False False False False False False False False\n", 324 | " False False False False False False False False False False False False\n", 325 | " False False False False False False False False False False False False\n", 326 | " False False False False False False False False False False False False\n", 327 | " False False False False False False False False False False False False\n", 328 | " False False False False False False False False False False False False\n", 329 | " False False False False False False False False False False False True\n", 330 | " True False False False False False False False False False False True\n", 331 | " False True False False False False True False False True False False\n", 332 | " False True False True False True False True False False False False\n", 333 | " False False False False False False False False False False False False\n", 334 | " False]\n", 335 | "[220 219 218 217 216 215 213 212 211 210 209 208 207 206 205 204 203 202\n", 336 | " 201 200 197 195 192 187 186 185 184 183 182 181 180 179 178 177 176 175\n", 337 | " 174 173 172 171 170 169 168 167 166 165 164 163 162 161 160 158 157 155\n", 338 | " 154 153 152 151 150 149 148 147 146 145 144 143 142 141 140 139 137 136\n", 339 | " 134 133 132 131 130 129 128 127 126 125 124 123 122 121 120 117 116 115\n", 340 | " 114 113 112 111 110 109 108 107 106 105 104 103 102 101 100 99 98 97\n", 341 | " 95 94 93 92 91 90 89 88 87 214 86 85 84 83 189 193 199 198\n", 342 | " 196 194 191 190 188 81 80 79 77 74 73 72 71 69 68 67 66 65\n", 343 | " 64 156 63 61 60 59 58 57 55 54 138 135 50 48 46 45 44 42\n", 344 | " 39 119 118 38 35 31 28 25 24 22 17 15 14 96 13 12 43 1\n", 345 | " 1 4 82 75 78 76 26 30 70 20 7 1 62 1 51 53 49 47\n", 346 | " 1 27 41 1 23 21 18 1 11 1 19 1 6 1 8 29 2 5\n", 347 | " 10 3 9 16 32 33 34 36 37 40 52 56 159]\n" 348 | ] 349 | } 350 | ], 351 | "source": [ 352 | "from sklearn.feature_selection import RFECV\n", 353 | "from sklearn.ensemble import RandomForestClassifier\n", 354 | "\n", 355 | "clf = RandomForestClassifier(n_estimators=10, max_depth=2, random_state=0, n_jobs=-1)\n", 356 | "selector = RFECV(clf, step=1, cv=2)\n", 357 | "selector = selector.fit(train, target)\n", 358 | "print(selector.support_)\n", 359 | "print(selector.ranking_)" 360 | ] 361 | }, 362 | { 363 | "cell_type": "markdown", 364 | "metadata": {}, 365 | "source": [ 366 | "### 使用模型选择特征(方法四)" 367 | ] 368 | }, 369 | { 370 | "cell_type": "markdown", 371 | "metadata": {}, 372 | "source": [ 373 | "#### 使用LR拟合的参数进行变量选择(L2范数进行特征选择)\n", 374 | "LR模型采用拟合参数形式进行变量选择,筛选对回归目标影响大的" 375 | ] 376 | }, 377 | { 378 | "cell_type": "code", 379 | "execution_count": 13, 380 | "metadata": {}, 381 | "outputs": [ 382 | { 383 | "name": "stdout", 384 | "output_type": "stream", 385 | "text": [ 386 | "训练数据未特征筛选维度 (2000, 229)\n", 387 | "训练数据特征筛选维度后 (2000, 19)\n" 388 | ] 389 | } 390 | ], 391 | "source": [ 392 | "from sklearn.feature_selection import SelectFromModel\n", 393 | "from sklearn.linear_model import LogisticRegression\n", 394 | "from sklearn.preprocessing import Normalizer\n", 395 | "\n", 396 | "normalizer = Normalizer()\n", 397 | "normalizer = normalizer.fit(train) \n", 398 | "\n", 399 | "train_norm = normalizer.transform(train) \n", 400 | "test_norm = normalizer.transform(test)\n", 401 | "\n", 402 | "LR = LogisticRegression(penalty='l2',C=5)\n", 403 | "LR = LR.fit(train_norm, target)\n", 404 | "model = SelectFromModel(LR, prefit=True)\n", 405 | "train_sel = model.transform(train)\n", 406 | "test_sel = model.transform(test)\n", 407 | "print('训练数据未特征筛选维度', train.shape)\n", 408 | "print('训练数据特征筛选维度后', train_sel.shape)" 409 | ] 410 | }, 411 | { 412 | "cell_type": "markdown", 413 | "metadata": {}, 414 | "source": [ 415 | "##### L2范数选择参数" 416 | ] 417 | }, 418 | { 419 | "cell_type": "code", 420 | "execution_count": 14, 421 | "metadata": {}, 422 | "outputs": [ 423 | { 424 | "data": { 425 | "text/plain": [ 426 | "array([ 0.27519508, -0.02736226, -0.00522652, 0.90644126, -0.4310027 ,\n", 427 | " -0.25110925, -0.4058899 , 0.29059019, 0.10568508, -0.02731211])" 428 | ] 429 | }, 430 | "execution_count": 14, 431 | "metadata": {}, 432 | "output_type": "execute_result" 433 | } 434 | ], 435 | "source": [ 436 | "LR.coef_[0][:10]" 437 | ] 438 | }, 439 | { 440 | "cell_type": "markdown", 441 | "metadata": {}, 442 | "source": [ 443 | "### 特征选择前后区别" 444 | ] 445 | }, 446 | { 447 | "cell_type": "code", 448 | "execution_count": 15, 449 | "metadata": {}, 450 | "outputs": [ 451 | { 452 | "name": "stdout", 453 | "output_type": "stream", 454 | "text": [ 455 | "No Select Accuracy: 0.93 (+/- 0.00)\n", 456 | "Features Select Accuracy: 0.93 (+/- 0.00)\n" 457 | ] 458 | } 459 | ], 460 | "source": [ 461 | "feature_selection(train, train_sel, target)" 462 | ] 463 | }, 464 | { 465 | "cell_type": "markdown", 466 | "metadata": {}, 467 | "source": [ 468 | "#### 使用LR拟合的参数进行变量选择(L1范数进行特征选择)\n", 469 | "LR模型采用拟合参数形式进行变量选择,筛选对回归目标影响大的" 470 | ] 471 | }, 472 | { 473 | "cell_type": "code", 474 | "execution_count": 16, 475 | "metadata": {}, 476 | "outputs": [], 477 | "source": [ 478 | "# from sklearn.feature_selection import SelectFromModel\n", 479 | "# from sklearn.linear_model import LogisticRegression\n", 480 | "# from sklearn.preprocessing import Normalizer\n", 481 | "\n", 482 | "# normalizer = Normalizer()\n", 483 | "# normalizer = normalizer.fit(train) \n", 484 | "\n", 485 | "# train_norm = normalizer.transform(train) \n", 486 | "# test_norm = normalizer.transform(test)\n", 487 | "\n", 488 | "# LR = LogisticRegression(penalty='l1',C=5)\n", 489 | "# LR = LR.fit(train_norm, target)\n", 490 | "# model = SelectFromModel(LR, prefit=True)\n", 491 | "# train_sel = model.transform(train)\n", 492 | "# test_sel = model.transform(test)\n", 493 | "# print('训练数据未特征筛选维度', train.shape)\n", 494 | "# print('训练数据特征筛选维度后', train_sel.shape)" 495 | ] 496 | }, 497 | { 498 | "cell_type": "markdown", 499 | "metadata": {}, 500 | "source": [ 501 | "##### L1范数选择参数\n", 502 | "对于α的良好选择,只要满足某些特定条件,Lasso就可以仅使用少量观察来完全恢复精确的非零变量集。" 503 | ] 504 | }, 505 | { 506 | "cell_type": "code", 507 | "execution_count": 17, 508 | "metadata": {}, 509 | "outputs": [], 510 | "source": [ 511 | "# LR.coef_[0][:10]" 512 | ] 513 | }, 514 | { 515 | "cell_type": "markdown", 516 | "metadata": {}, 517 | "source": [ 518 | "### 特征选择前后区别" 519 | ] 520 | }, 521 | { 522 | "cell_type": "code", 523 | "execution_count": 18, 524 | "metadata": {}, 525 | "outputs": [ 526 | { 527 | "name": "stdout", 528 | "output_type": "stream", 529 | "text": [ 530 | "No Select Accuracy: 0.93 (+/- 0.00)\n", 531 | "Features Select Accuracy: 0.93 (+/- 0.00)\n" 532 | ] 533 | } 534 | ], 535 | "source": [ 536 | "feature_selection(train, train_sel, target)" 537 | ] 538 | }, 539 | { 540 | "cell_type": "markdown", 541 | "metadata": {}, 542 | "source": [ 543 | "### 基于树模型特征选择\n", 544 | "树模型基于分裂评价标准所计算的总的评分作为依据进行相关排序,然后进行特征筛选" 545 | ] 546 | }, 547 | { 548 | "cell_type": "code", 549 | "execution_count": 19, 550 | "metadata": {}, 551 | "outputs": [ 552 | { 553 | "name": "stdout", 554 | "output_type": "stream", 555 | "text": [ 556 | "训练数据未特征筛选维度 (2000, 229)\n", 557 | "训练数据特征筛选维度后 (2000, 71)\n" 558 | ] 559 | } 560 | ], 561 | "source": [ 562 | "from sklearn.ensemble import ExtraTreesClassifier\n", 563 | "from sklearn.feature_selection import SelectFromModel\n", 564 | "\n", 565 | "clf = ExtraTreesClassifier(n_estimators=50)\n", 566 | "clf = clf.fit(train, target)\n", 567 | "\n", 568 | "model = SelectFromModel(clf, prefit=True)\n", 569 | "train_sel = model.transform(train)\n", 570 | "test_sel = model.transform(test)\n", 571 | "print('训练数据未特征筛选维度', train.shape)\n", 572 | "print('训练数据特征筛选维度后', train_sel.shape)" 573 | ] 574 | }, 575 | { 576 | "cell_type": "markdown", 577 | "metadata": {}, 578 | "source": [ 579 | "#### 树特征重要性" 580 | ] 581 | }, 582 | { 583 | "cell_type": "code", 584 | "execution_count": 20, 585 | "metadata": {}, 586 | "outputs": [ 587 | { 588 | "data": { 589 | "text/plain": [ 590 | "array([0.09210871, 0.00578114, 0.00388741, 0.0047027 , 0.00324662,\n", 591 | " 0.00409547, 0.00560588, 0.00399393, 0.00499705, 0.00233944])" 592 | ] 593 | }, 594 | "execution_count": 20, 595 | "metadata": {}, 596 | "output_type": "execute_result" 597 | } 598 | ], 599 | "source": [ 600 | "clf.feature_importances_[:10]" 601 | ] 602 | }, 603 | { 604 | "cell_type": "code", 605 | "execution_count": 21, 606 | "metadata": {}, 607 | "outputs": [], 608 | "source": [ 609 | "df_features_import = pd.DataFrame()\n", 610 | "df_features_import['features_import'] = clf.feature_importances_\n", 611 | "df_features_import['features_name'] = features_columns" 612 | ] 613 | }, 614 | { 615 | "cell_type": "code", 616 | "execution_count": 22, 617 | "metadata": {}, 618 | "outputs": [ 619 | { 620 | "data": { 621 | "text/html": [ 622 | "
\n", 623 | "\n", 636 | "\n", 637 | " \n", 638 | " \n", 639 | " \n", 640 | " \n", 641 | " \n", 642 | " \n", 643 | " \n", 644 | " \n", 645 | " \n", 646 | " \n", 647 | " \n", 648 | " \n", 649 | " \n", 650 | " \n", 651 | " \n", 652 | " \n", 653 | " \n", 654 | " \n", 655 | " \n", 656 | " \n", 657 | " \n", 658 | " \n", 659 | " \n", 660 | " \n", 661 | " \n", 662 | " \n", 663 | " \n", 664 | " \n", 665 | " \n", 666 | " \n", 667 | " \n", 668 | " \n", 669 | " \n", 670 | " \n", 671 | " \n", 672 | " \n", 673 | " \n", 674 | " \n", 675 | " \n", 676 | " \n", 677 | " \n", 678 | " \n", 679 | " \n", 680 | " \n", 681 | " \n", 682 | " \n", 683 | " \n", 684 | " \n", 685 | " \n", 686 | " \n", 687 | " \n", 688 | " \n", 689 | " \n", 690 | " \n", 691 | " \n", 692 | " \n", 693 | " \n", 694 | " \n", 695 | " \n", 696 | " \n", 697 | " \n", 698 | " \n", 699 | " \n", 700 | " \n", 701 | " \n", 702 | " \n", 703 | " \n", 704 | " \n", 705 | " \n", 706 | " \n", 707 | " \n", 708 | " \n", 709 | " \n", 710 | " \n", 711 | " \n", 712 | " \n", 713 | " \n", 714 | " \n", 715 | " \n", 716 | " \n", 717 | " \n", 718 | " \n", 719 | " \n", 720 | " \n", 721 | " \n", 722 | " \n", 723 | " \n", 724 | " \n", 725 | " \n", 726 | " \n", 727 | " \n", 728 | " \n", 729 | " \n", 730 | " \n", 731 | " \n", 732 | " \n", 733 | " \n", 734 | " \n", 735 | " \n", 736 | " \n", 737 | " \n", 738 | " \n", 739 | " \n", 740 | " \n", 741 | " \n", 742 | " \n", 743 | " \n", 744 | " \n", 745 | " \n", 746 | " \n", 747 | " \n", 748 | " \n", 749 | " \n", 750 | " \n", 751 | " \n", 752 | " \n", 753 | " \n", 754 | " \n", 755 | " \n", 756 | " \n", 757 | " \n", 758 | " \n", 759 | " \n", 760 | " \n", 761 | " \n", 762 | " \n", 763 | " \n", 764 | " \n", 765 | " \n", 766 | " \n", 767 | " \n", 768 | " \n", 769 | " \n", 770 | " \n", 771 | " \n", 772 | " \n", 773 | " \n", 774 | " \n", 775 | " \n", 776 | " \n", 777 | " \n", 778 | " \n", 779 | " \n", 780 | " \n", 781 | " \n", 782 | " \n", 783 | " \n", 784 | " \n", 785 | " \n", 786 | " \n", 787 | " \n", 788 | " \n", 789 | " \n", 790 | " \n", 791 | " \n", 792 | " \n", 793 | " \n", 794 | " \n", 795 | " \n", 796 | "
features_importfeatures_name
00.092109merchant_id
2280.085244xgb_clf
2270.056583lgb_clf
1990.007003embeeding_72
1790.006930embeeding_52
180.006444seller_most_1_cnt
2070.006367embeeding_80
1930.006110embeeding_66
1900.006107embeeding_63
1320.006077embeeding_5
1440.005996embeeding_17
1460.005913embeeding_19
10.005781age_range
1580.005715embeeding_31
1910.005701embeeding_64
1650.005673embeeding_38
150.005648cat_most_1
60.005606brand_nunique
220.005488user_cnt_0
2200.005485embeeding_93
1660.005473embeeding_39
870.005472tfidf_60
1270.005463embeeding_0
1960.005427embeeding_69
2050.005407embeeding_78
1470.005402embeeding_20
1630.005347embeeding_36
1920.005328embeeding_65
1690.005280embeeding_42
500.005277tfidf_23
\n", 797 | "
" 798 | ], 799 | "text/plain": [ 800 | " features_import features_name\n", 801 | "0 0.092109 merchant_id\n", 802 | "228 0.085244 xgb_clf\n", 803 | "227 0.056583 lgb_clf\n", 804 | "199 0.007003 embeeding_72\n", 805 | "179 0.006930 embeeding_52\n", 806 | "18 0.006444 seller_most_1_cnt\n", 807 | "207 0.006367 embeeding_80\n", 808 | "193 0.006110 embeeding_66\n", 809 | "190 0.006107 embeeding_63\n", 810 | "132 0.006077 embeeding_5\n", 811 | "144 0.005996 embeeding_17\n", 812 | "146 0.005913 embeeding_19\n", 813 | "1 0.005781 age_range\n", 814 | "158 0.005715 embeeding_31\n", 815 | "191 0.005701 embeeding_64\n", 816 | "165 0.005673 embeeding_38\n", 817 | "15 0.005648 cat_most_1\n", 818 | "6 0.005606 brand_nunique\n", 819 | "22 0.005488 user_cnt_0\n", 820 | "220 0.005485 embeeding_93\n", 821 | "166 0.005473 embeeding_39\n", 822 | "87 0.005472 tfidf_60\n", 823 | "127 0.005463 embeeding_0\n", 824 | "196 0.005427 embeeding_69\n", 825 | "205 0.005407 embeeding_78\n", 826 | "147 0.005402 embeeding_20\n", 827 | "163 0.005347 embeeding_36\n", 828 | "192 0.005328 embeeding_65\n", 829 | "169 0.005280 embeeding_42\n", 830 | "50 0.005277 tfidf_23" 831 | ] 832 | }, 833 | "execution_count": 22, 834 | "metadata": {}, 835 | "output_type": "execute_result" 836 | } 837 | ], 838 | "source": [ 839 | "df_features_import.sort_values(['features_import'],ascending=0).head(30)" 840 | ] 841 | }, 842 | { 843 | "cell_type": "code", 844 | "execution_count": 23, 845 | "metadata": {}, 846 | "outputs": [], 847 | "source": [ 848 | "# features_columns" 849 | ] 850 | }, 851 | { 852 | "cell_type": "markdown", 853 | "metadata": {}, 854 | "source": [ 855 | "### 特征选择前后区别" 856 | ] 857 | }, 858 | { 859 | "cell_type": "code", 860 | "execution_count": 24, 861 | "metadata": {}, 862 | "outputs": [ 863 | { 864 | "name": "stdout", 865 | "output_type": "stream", 866 | "text": [ 867 | "No Select Accuracy: 0.93 (+/- 0.00)\n", 868 | "Features Select Accuracy: 0.93 (+/- 0.00)\n" 869 | ] 870 | } 871 | ], 872 | "source": [ 873 | "feature_selection(train, train_sel, target)" 874 | ] 875 | }, 876 | { 877 | "cell_type": "markdown", 878 | "metadata": {}, 879 | "source": [ 880 | "### Lgb特征重要性" 881 | ] 882 | }, 883 | { 884 | "cell_type": "code", 885 | "execution_count": 25, 886 | "metadata": {}, 887 | "outputs": [ 888 | { 889 | "name": "stdout", 890 | "output_type": "stream", 891 | "text": [ 892 | "[LightGBM] [Warning] Unknown parameter: colsample_bylevel\n", 893 | "[LightGBM] [Warning] Unknown parameter: tree_method\n", 894 | "[LightGBM] [Warning] Unknown parameter: silent\n", 895 | "[LightGBM] [Warning] Unknown parameter: colsample_bylevel\n", 896 | "[LightGBM] [Warning] Unknown parameter: tree_method\n", 897 | "[LightGBM] [Warning] Unknown parameter: silent\n", 898 | "[LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.006242 seconds.\n", 899 | "You can set `force_row_wise=true` to remove the overhead.\n", 900 | "And if memory is not enough, you can set `force_col_wise=true`.\n", 901 | "[LightGBM] [Info] Total Bins 32114\n", 902 | "[LightGBM] [Info] Number of data points in the train set: 1200, number of used features: 224\n", 903 | "[LightGBM] [Warning] Unknown parameter: colsample_bylevel\n", 904 | "[LightGBM] [Warning] Unknown parameter: tree_method\n", 905 | "[LightGBM] [Warning] Unknown parameter: silent\n", 906 | "[LightGBM] [Info] Start training from score -0.068100\n", 907 | "[LightGBM] [Info] Start training from score -2.720629\n", 908 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 909 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 910 | "[1]\tvalid_0's multi_logloss: 0.256738\n", 911 | "Training until validation scores don't improve for 100 rounds\n", 912 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 913 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 914 | "[2]\tvalid_0's multi_logloss: 0.256574\n", 915 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 916 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 917 | "[3]\tvalid_0's multi_logloss: 0.256518\n", 918 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 919 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 920 | "[4]\tvalid_0's multi_logloss: 0.25657\n", 921 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 922 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 923 | "[5]\tvalid_0's multi_logloss: 0.256756\n", 924 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 925 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 926 | "[6]\tvalid_0's multi_logloss: 0.25682\n", 927 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 928 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 929 | "[7]\tvalid_0's multi_logloss: 0.256989\n", 930 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 931 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 932 | "[8]\tvalid_0's multi_logloss: 0.257236\n", 933 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 934 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 935 | "[9]\tvalid_0's multi_logloss: 0.25712\n", 936 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 937 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 938 | "[10]\tvalid_0's multi_logloss: 0.257011\n", 939 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 940 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 941 | "[11]\tvalid_0's multi_logloss: 0.257042\n", 942 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 943 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 944 | "[12]\tvalid_0's multi_logloss: 0.257402\n", 945 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 946 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 947 | "[13]\tvalid_0's multi_logloss: 0.257419\n", 948 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 949 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 950 | "[14]\tvalid_0's multi_logloss: 0.257646\n", 951 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 952 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 953 | "[15]\tvalid_0's multi_logloss: 0.257552\n", 954 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 955 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 956 | "[16]\tvalid_0's multi_logloss: 0.257604\n", 957 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 958 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 959 | "[17]\tvalid_0's multi_logloss: 0.257797\n", 960 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 961 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 962 | "[18]\tvalid_0's multi_logloss: 0.257928\n", 963 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 964 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 965 | "[19]\tvalid_0's multi_logloss: 0.258142\n", 966 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 967 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 968 | "[20]\tvalid_0's multi_logloss: 0.25847\n", 969 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 970 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 971 | "[21]\tvalid_0's multi_logloss: 0.258654\n", 972 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 973 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 974 | "[22]\tvalid_0's multi_logloss: 0.258846\n", 975 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 976 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 977 | "[23]\tvalid_0's multi_logloss: 0.258962\n", 978 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 979 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 980 | "[24]\tvalid_0's multi_logloss: 0.258991\n", 981 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 982 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 983 | "[25]\tvalid_0's multi_logloss: 0.259334\n", 984 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 985 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 986 | "[26]\tvalid_0's multi_logloss: 0.259433\n", 987 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 988 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 989 | "[27]\tvalid_0's multi_logloss: 0.259912\n", 990 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 991 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 992 | "[28]\tvalid_0's multi_logloss: 0.260153\n", 993 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 994 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 995 | "[29]\tvalid_0's multi_logloss: 0.260576\n", 996 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 997 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 998 | "[30]\tvalid_0's multi_logloss: 0.26094\n", 999 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1000 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1001 | "[31]\tvalid_0's multi_logloss: 0.261198\n", 1002 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1003 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1004 | "[32]\tvalid_0's multi_logloss: 0.26141\n", 1005 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1006 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1007 | "[33]\tvalid_0's multi_logloss: 0.261614\n", 1008 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1009 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1010 | "[34]\tvalid_0's multi_logloss: 0.261801\n", 1011 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1012 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1013 | "[35]\tvalid_0's multi_logloss: 0.261931\n", 1014 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1015 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1016 | "[36]\tvalid_0's multi_logloss: 0.262242\n", 1017 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1018 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1019 | "[37]\tvalid_0's multi_logloss: 0.262492\n", 1020 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1021 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1022 | "[38]\tvalid_0's multi_logloss: 0.26273\n", 1023 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1024 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1025 | "[39]\tvalid_0's multi_logloss: 0.262855\n", 1026 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1027 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1028 | "[40]\tvalid_0's multi_logloss: 0.263225\n", 1029 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1030 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1031 | "[41]\tvalid_0's multi_logloss: 0.263311\n", 1032 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1033 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1034 | "[42]\tvalid_0's multi_logloss: 0.263612\n", 1035 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n" 1036 | ] 1037 | }, 1038 | { 1039 | "name": "stdout", 1040 | "output_type": "stream", 1041 | "text": [ 1042 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1043 | "[43]\tvalid_0's multi_logloss: 0.263937\n", 1044 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1045 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1046 | "[44]\tvalid_0's multi_logloss: 0.264398\n", 1047 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1048 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1049 | "[45]\tvalid_0's multi_logloss: 0.264822\n", 1050 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1051 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1052 | "[46]\tvalid_0's multi_logloss: 0.264977\n", 1053 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1054 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1055 | "[47]\tvalid_0's multi_logloss: 0.265401\n", 1056 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1057 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1058 | "[48]\tvalid_0's multi_logloss: 0.265718\n", 1059 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1060 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1061 | "[49]\tvalid_0's multi_logloss: 0.265859\n", 1062 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1063 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1064 | "[50]\tvalid_0's multi_logloss: 0.266173\n", 1065 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1066 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1067 | "[51]\tvalid_0's multi_logloss: 0.266544\n", 1068 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1069 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1070 | "[52]\tvalid_0's multi_logloss: 0.266719\n", 1071 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1072 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1073 | "[53]\tvalid_0's multi_logloss: 0.266817\n", 1074 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1075 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1076 | "[54]\tvalid_0's multi_logloss: 0.267013\n", 1077 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1078 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1079 | "[55]\tvalid_0's multi_logloss: 0.267385\n", 1080 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1081 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1082 | "[56]\tvalid_0's multi_logloss: 0.267389\n", 1083 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1084 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1085 | "[57]\tvalid_0's multi_logloss: 0.267662\n", 1086 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1087 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1088 | "[58]\tvalid_0's multi_logloss: 0.267792\n", 1089 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1090 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1091 | "[59]\tvalid_0's multi_logloss: 0.268017\n", 1092 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1093 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1094 | "[60]\tvalid_0's multi_logloss: 0.268158\n", 1095 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1096 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1097 | "[61]\tvalid_0's multi_logloss: 0.268437\n", 1098 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1099 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1100 | "[62]\tvalid_0's multi_logloss: 0.268773\n", 1101 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1102 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1103 | "[63]\tvalid_0's multi_logloss: 0.268824\n", 1104 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1105 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1106 | "[64]\tvalid_0's multi_logloss: 0.269138\n", 1107 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1108 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1109 | "[65]\tvalid_0's multi_logloss: 0.269357\n", 1110 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1111 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1112 | "[66]\tvalid_0's multi_logloss: 0.269572\n", 1113 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1114 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1115 | "[67]\tvalid_0's multi_logloss: 0.269786\n", 1116 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1117 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1118 | "[68]\tvalid_0's multi_logloss: 0.270102\n", 1119 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1120 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1121 | "[69]\tvalid_0's multi_logloss: 0.270435\n", 1122 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1123 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1124 | "[70]\tvalid_0's multi_logloss: 0.270566\n", 1125 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1126 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1127 | "[71]\tvalid_0's multi_logloss: 0.270679\n", 1128 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1129 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1130 | "[72]\tvalid_0's multi_logloss: 0.271056\n", 1131 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1132 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1133 | "[73]\tvalid_0's multi_logloss: 0.271474\n", 1134 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1135 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1136 | "[74]\tvalid_0's multi_logloss: 0.27168\n", 1137 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1138 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1139 | "[75]\tvalid_0's multi_logloss: 0.271918\n", 1140 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1141 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1142 | "[76]\tvalid_0's multi_logloss: 0.271937\n", 1143 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1144 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1145 | "[77]\tvalid_0's multi_logloss: 0.272113\n", 1146 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1147 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1148 | "[78]\tvalid_0's multi_logloss: 0.27242\n", 1149 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1150 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1151 | "[79]\tvalid_0's multi_logloss: 0.272712\n", 1152 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1153 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1154 | "[80]\tvalid_0's multi_logloss: 0.27267\n", 1155 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1156 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1157 | "[81]\tvalid_0's multi_logloss: 0.273019\n", 1158 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1159 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1160 | "[82]\tvalid_0's multi_logloss: 0.272981\n", 1161 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1162 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1163 | "[83]\tvalid_0's multi_logloss: 0.273218\n", 1164 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1165 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1166 | "[84]\tvalid_0's multi_logloss: 0.27353\n", 1167 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1168 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1169 | "[85]\tvalid_0's multi_logloss: 0.273649\n", 1170 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1171 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1172 | "[86]\tvalid_0's multi_logloss: 0.273775\n", 1173 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1174 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1175 | "[87]\tvalid_0's multi_logloss: 0.273835\n", 1176 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1177 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1178 | "[88]\tvalid_0's multi_logloss: 0.274091\n", 1179 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1180 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1181 | "[89]\tvalid_0's multi_logloss: 0.274422\n", 1182 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1183 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1184 | "[90]\tvalid_0's multi_logloss: 0.274716\n", 1185 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1186 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1187 | "[91]\tvalid_0's multi_logloss: 0.275082\n", 1188 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n" 1189 | ] 1190 | }, 1191 | { 1192 | "name": "stdout", 1193 | "output_type": "stream", 1194 | "text": [ 1195 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1196 | "[92]\tvalid_0's multi_logloss: 0.275278\n", 1197 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1198 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1199 | "[93]\tvalid_0's multi_logloss: 0.275447\n", 1200 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1201 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1202 | "[94]\tvalid_0's multi_logloss: 0.275438\n", 1203 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1204 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1205 | "[95]\tvalid_0's multi_logloss: 0.275778\n", 1206 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1207 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1208 | "[96]\tvalid_0's multi_logloss: 0.27591\n", 1209 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1210 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1211 | "[97]\tvalid_0's multi_logloss: 0.276129\n", 1212 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1213 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1214 | "[98]\tvalid_0's multi_logloss: 0.276326\n", 1215 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1216 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1217 | "[99]\tvalid_0's multi_logloss: 0.276449\n", 1218 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1219 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1220 | "[100]\tvalid_0's multi_logloss: 0.276745\n", 1221 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1222 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1223 | "[101]\tvalid_0's multi_logloss: 0.276895\n", 1224 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1225 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1226 | "[102]\tvalid_0's multi_logloss: 0.276914\n", 1227 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1228 | "[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n", 1229 | "[103]\tvalid_0's multi_logloss: 0.277281\n", 1230 | "Early stopping, best iteration is:\n", 1231 | "[3]\tvalid_0's multi_logloss: 0.256518\n" 1232 | ] 1233 | } 1234 | ], 1235 | "source": [ 1236 | "import lightgbm\n", 1237 | "from sklearn.model_selection import train_test_split\n", 1238 | "\n", 1239 | "X_train, X_test, y_train, y_test = train_test_split(train, target, test_size=0.4, random_state=0)\n", 1240 | "\n", 1241 | "clf = lightgbm\n", 1242 | "\n", 1243 | "train_matrix = clf.Dataset(X_train, label=y_train)\n", 1244 | "test_matrix = clf.Dataset(X_test, label=y_test)\n", 1245 | "params = {\n", 1246 | " 'boosting_type': 'gbdt',\n", 1247 | " #'boosting_type': 'dart',\n", 1248 | " 'objective': 'multiclass',\n", 1249 | " 'metric': 'multi_logloss',\n", 1250 | " 'min_child_weight': 1.5,\n", 1251 | " 'num_leaves': 2**5,\n", 1252 | " 'lambda_l2': 10,\n", 1253 | " 'subsample': 0.7,\n", 1254 | " 'colsample_bytree': 0.7,\n", 1255 | " 'colsample_bylevel': 0.7,\n", 1256 | " 'learning_rate': 0.03,\n", 1257 | " 'tree_method': 'exact',\n", 1258 | " 'seed': 2017,\n", 1259 | " \"num_class\": 2,\n", 1260 | " 'silent': True,\n", 1261 | " }\n", 1262 | "num_round = 10000\n", 1263 | "early_stopping_rounds = 100\n", 1264 | "model = clf.train(params, \n", 1265 | " train_matrix,\n", 1266 | " num_round,\n", 1267 | " valid_sets=test_matrix,\n", 1268 | " early_stopping_rounds=early_stopping_rounds)" 1269 | ] 1270 | }, 1271 | { 1272 | "cell_type": "code", 1273 | "execution_count": 26, 1274 | "metadata": {}, 1275 | "outputs": [], 1276 | "source": [ 1277 | "def lgb_transform(train, test, model, topK):\n", 1278 | " train_df = pd.DataFrame(train)\n", 1279 | " train_df.columns = range(train.shape[1])\n", 1280 | " \n", 1281 | " test_df = pd.DataFrame(test)\n", 1282 | " test_df.columns = range(test.shape[1])\n", 1283 | " \n", 1284 | " features_import = pd.DataFrame()\n", 1285 | " features_import['importance'] = model.feature_importance()\n", 1286 | " features_import['col'] = range(train.shape[1])\n", 1287 | " \n", 1288 | " features_import = features_import.sort_values(['importance'],ascending=0).head(topK)\n", 1289 | " sel_col = list(features_import.col)\n", 1290 | " \n", 1291 | " train_sel = train_df[sel_col]\n", 1292 | " test_sel = test_df[sel_col]\n", 1293 | " return train_sel, test_sel" 1294 | ] 1295 | }, 1296 | { 1297 | "cell_type": "code", 1298 | "execution_count": 27, 1299 | "metadata": {}, 1300 | "outputs": [ 1301 | { 1302 | "name": "stdout", 1303 | "output_type": "stream", 1304 | "text": [ 1305 | "训练数据未特征筛选维度 (2000, 229)\n", 1306 | "训练数据特征筛选维度后 (2000, 20)\n" 1307 | ] 1308 | } 1309 | ], 1310 | "source": [ 1311 | "train_sel, test_sel = lgb_transform(train, test, model, 20)\n", 1312 | "print('训练数据未特征筛选维度', train.shape)\n", 1313 | "print('训练数据特征筛选维度后', train_sel.shape)" 1314 | ] 1315 | }, 1316 | { 1317 | "cell_type": "markdown", 1318 | "metadata": {}, 1319 | "source": [ 1320 | "### lgb特征重要性" 1321 | ] 1322 | }, 1323 | { 1324 | "cell_type": "code", 1325 | "execution_count": 28, 1326 | "metadata": {}, 1327 | "outputs": [ 1328 | { 1329 | "data": { 1330 | "text/plain": [ 1331 | "array([2, 3, 0, 0, 0, 1, 1, 0, 1, 0])" 1332 | ] 1333 | }, 1334 | "execution_count": 28, 1335 | "metadata": {}, 1336 | "output_type": "execute_result" 1337 | } 1338 | ], 1339 | "source": [ 1340 | "model.feature_importance()[:10]" 1341 | ] 1342 | }, 1343 | { 1344 | "cell_type": "code", 1345 | "execution_count": 29, 1346 | "metadata": {}, 1347 | "outputs": [], 1348 | "source": [ 1349 | "#sorted(model.feature_importance(),reverse=True)[:10]" 1350 | ] 1351 | }, 1352 | { 1353 | "cell_type": "markdown", 1354 | "metadata": {}, 1355 | "source": [ 1356 | "### 特征选择前后区别" 1357 | ] 1358 | }, 1359 | { 1360 | "cell_type": "code", 1361 | "execution_count": 30, 1362 | "metadata": {}, 1363 | "outputs": [ 1364 | { 1365 | "name": "stdout", 1366 | "output_type": "stream", 1367 | "text": [ 1368 | "No Select Accuracy: 0.93 (+/- 0.00)\n", 1369 | "Features Select Accuracy: 0.93 (+/- 0.00)\n" 1370 | ] 1371 | } 1372 | ], 1373 | "source": [ 1374 | "feature_selection(train, train_sel, target)" 1375 | ] 1376 | }, 1377 | { 1378 | "cell_type": "code", 1379 | "execution_count": null, 1380 | "metadata": {}, 1381 | "outputs": [], 1382 | "source": [] 1383 | } 1384 | ], 1385 | "metadata": { 1386 | "kernelspec": { 1387 | "display_name": "Python 3", 1388 | "language": "python", 1389 | "name": "python3" 1390 | }, 1391 | "language_info": { 1392 | "codemirror_mode": { 1393 | "name": "ipython", 1394 | "version": 3 1395 | }, 1396 | "file_extension": ".py", 1397 | "mimetype": "text/x-python", 1398 | "name": "python", 1399 | "nbconvert_exporter": "python", 1400 | "pygments_lexer": "ipython3", 1401 | "version": "3.7.1" 1402 | }, 1403 | "latex_envs": { 1404 | "LaTeX_envs_menu_present": true, 1405 | "autoclose": false, 1406 | "autocomplete": true, 1407 | "bibliofile": "biblio.bib", 1408 | "cite_by": "apalike", 1409 | "current_citInitial": 1, 1410 | "eqLabelWithNumbers": true, 1411 | "eqNumInitial": 1, 1412 | "hotkeys": { 1413 | "equation": "Ctrl-E", 1414 | "itemize": "Ctrl-I" 1415 | }, 1416 | "labels_anchors": false, 1417 | "latex_user_defs": false, 1418 | "report_style_numbering": false, 1419 | "user_envs_cfg": false 1420 | } 1421 | }, 1422 | "nbformat": 4, 1423 | "nbformat_minor": 2 1424 | } 1425 | -------------------------------------------------------------------------------- /工业蒸汽/工业蒸汽 06 特征优化.ipynb: -------------------------------------------------------------------------------- 1 | {"nbformat":4,"nbformat_minor":0,"metadata":{"kernelspec":{"display_name":"Python 3","language":"python","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.7.0"},"colab":{"name":"工业蒸汽 06 特征优化.ipynb","provenance":[],"collapsed_sections":[]}},"cells":[{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"wo_kOHZTLhVZ","executionInfo":{"status":"ok","timestamp":1623400030401,"user_tz":-480,"elapsed":22582,"user":{"displayName":"陆轩韬","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GgCF1eSJvFwyQhlJNSqYJSHyWwDFm88zduORKCQ=s64","userId":"09875294307280000133"}},"outputId":"e5f54b16-a3a0-4165-f896-bdd2384af00c"},"source":["from google.colab import drive\n","drive.mount('/content/drive')"],"execution_count":1,"outputs":[{"output_type":"stream","text":["Mounted at /content/drive\n"],"name":"stdout"}]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"G6wonEdoLjSN","executionInfo":{"status":"ok","timestamp":1623400032113,"user_tz":-480,"elapsed":1729,"user":{"displayName":"陆轩韬","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GgCF1eSJvFwyQhlJNSqYJSHyWwDFm88zduORKCQ=s64","userId":"09875294307280000133"}},"outputId":"6ffe00a1-5ad1-41e6-8b04-5594798cdb7d"},"source":["%cd /content/drive/MyDrive/Colab Notebooks/天池/工业蒸汽"],"execution_count":2,"outputs":[{"output_type":"stream","text":["/content/drive/MyDrive/Colab Notebooks/天池/工业蒸汽\n"],"name":"stdout"}]},{"cell_type":"markdown","metadata":{"id":"q_gDa5rQLdXM"},"source":["## 特征优化\n","\n","### 导入数据"]},{"cell_type":"code","metadata":{"id":"qYQ3-RsyLdXW","executionInfo":{"status":"ok","timestamp":1623401062097,"user_tz":-480,"elapsed":556,"user":{"displayName":"陆轩韬","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GgCF1eSJvFwyQhlJNSqYJSHyWwDFm88zduORKCQ=s64","userId":"09875294307280000133"}}},"source":["import pandas as pd\n","\n","train_data_file = \"./zhengqi_train.txt\"\n","test_data_file = \"./zhengqi_test.txt\"\n","\n","train_data = pd.read_csv(train_data_file, sep='\\t', encoding='utf-8')\n","test_data = pd.read_csv(test_data_file, sep='\\t', encoding='utf-8')"],"execution_count":13,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"1d20kkI_LdXX"},"source":["### 定义特征构造方法,构造特征"]},{"cell_type":"code","metadata":{"id":"ayULiWwdLdXY","executionInfo":{"status":"ok","timestamp":1623401065107,"user_tz":-480,"elapsed":505,"user":{"displayName":"陆轩韬","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GgCF1eSJvFwyQhlJNSqYJSHyWwDFm88zduORKCQ=s64","userId":"09875294307280000133"}}},"source":["epsilon=1e-5\n","\n","#组交叉特征,可以自行定义,如增加: x*x/y, log(x)/y 等等\n","func_dict = {\n"," 'add': lambda x,y: x+y,\n"," 'mins': lambda x,y: x-y,\n"," 'div': lambda x,y: x/(y+epsilon),\n"," 'multi': lambda x,y: x*y\n"," }"],"execution_count":14,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"ysPQ-6EQLdXY"},"source":["### 定义特征构造的函数"]},{"cell_type":"code","metadata":{"id":"dPZ-p8xSLdXZ","executionInfo":{"status":"ok","timestamp":1623401066461,"user_tz":-480,"elapsed":3,"user":{"displayName":"陆轩韬","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GgCF1eSJvFwyQhlJNSqYJSHyWwDFm88zduORKCQ=s64","userId":"09875294307280000133"}}},"source":["def auto_features_make(train_data,test_data,func_dict,col_list):\n"," train_data, test_data = train_data.copy(), test_data.copy()\n"," for col_i in col_list:\n"," for col_j in col_list:\n"," for func_name, func in func_dict.items():\n"," for data in [train_data,test_data]:\n"," func_features = func(data[col_i],data[col_j])\n"," col_func_features = '-'.join([col_i,func_name,col_j])\n"," data[col_func_features] = func_features\n"," return train_data,test_data"],"execution_count":15,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"rd49LO4ZLdXZ"},"source":["### 对训练集和测试集数据进行特征构造"]},{"cell_type":"code","metadata":{"id":"Yw1NxFGPLdXa","executionInfo":{"status":"ok","timestamp":1623401082922,"user_tz":-480,"elapsed":14522,"user":{"displayName":"陆轩韬","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GgCF1eSJvFwyQhlJNSqYJSHyWwDFm88zduORKCQ=s64","userId":"09875294307280000133"}}},"source":["train_data2, test_data2 = auto_features_make(train_data,test_data,func_dict,col_list=test_data.columns)"],"execution_count":16,"outputs":[]},{"cell_type":"code","metadata":{"id":"5aDoibXvLdXa","executionInfo":{"status":"ok","timestamp":1623401101042,"user_tz":-480,"elapsed":9223,"user":{"displayName":"陆轩韬","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GgCF1eSJvFwyQhlJNSqYJSHyWwDFm88zduORKCQ=s64","userId":"09875294307280000133"}}},"source":["from sklearn.decomposition import PCA #主成分分析法\n","\n","#PCA方法降维\n","pca = PCA(n_components=500)\n","train_data2_pca = pca.fit_transform(train_data2.iloc[:,0:-1])\n","test_data2_pca = pca.transform(test_data2)\n","train_data2_pca = pd.DataFrame(train_data2_pca)\n","test_data2_pca = pd.DataFrame(test_data2_pca)\n","train_data2_pca['target'] = train_data2['target']"],"execution_count":17,"outputs":[]},{"cell_type":"code","metadata":{"id":"gWeLa6r8LdXb","executionInfo":{"status":"ok","timestamp":1623401103248,"user_tz":-480,"elapsed":496,"user":{"displayName":"陆轩韬","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GgCF1eSJvFwyQhlJNSqYJSHyWwDFm88zduORKCQ=s64","userId":"09875294307280000133"}}},"source":["X_train2 = train_data2[test_data2.columns].values\n","y_train = train_data2['target']"],"execution_count":18,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"kDqTOUhmLdXb"},"source":["### 使用lightgbm模型对新构造的特征进行模型训练和评估"]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/","height":833},"id":"wMfN5jqTLdXb","executionInfo":{"status":"error","timestamp":1623402038351,"user_tz":-480,"elapsed":6450,"user":{"displayName":"陆轩韬","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GgCF1eSJvFwyQhlJNSqYJSHyWwDFm88zduORKCQ=s64","userId":"09875294307280000133"}},"outputId":"f5f30543-7f11-4076-8560-e1ecf4ca9efa"},"source":["# ls_validation i\n","from sklearn.model_selection import KFold\n","from sklearn.metrics import mean_squared_error\n","import lightgbm as lgb\n","import numpy as np\n","\n","# 5折交叉验证\n","Folds=5\n","# kf = KFold(len(X_train2), n_splits=Folds, random_state=2019, shuffle=True)\n","kf = KFold(len(X_train2), random_state=2019, shuffle=True)\n","# 记录训练和预测MSE\n","MSE_DICT = {\n"," 'train_mse':[],\n"," 'test_mse':[]\n","}\n","\n","# 线下训练预测\n","for i, (train_index, test_index) in enumerate(kf.split(X_train2)):\n"," # lgb树模型\n"," lgb_reg = lgb.LGBMRegressor(\n"," learning_rate=0.01,\n"," max_depth=-1,\n"," n_estimators=5000,\n"," boosting_type='gbdt',\n"," random_state=2019,\n"," objective='regression',\n"," )\n"," \n"," # 切分训练集和预测集\n"," X_train_KFold, X_test_KFold = X_train2[train_index], X_train2[test_index]\n"," y_train_KFold, y_test_KFold = y_train[train_index], y_train[test_index]\n"," \n"," # 训练模型\n"," lgb_reg.fit(\n"," X=X_train_KFold,y=y_train_KFold,\n"," eval_set=[(X_train_KFold, y_train_KFold),(X_test_KFold, y_test_KFold)],\n"," eval_names=['Train','Test'],\n"," early_stopping_rounds=100,\n"," eval_metric='MSE',\n"," verbose=50\n"," )\n","\n"," # 训练集预测 测试集预测\n"," y_train_KFold_predict = lgb_reg.predict(X_train_KFold,num_iteration=lgb_reg.best_iteration_)\n"," y_test_KFold_predict = lgb_reg.predict(X_test_KFold,num_iteration=lgb_reg.best_iteration_) \n"," \n"," print('第{}折 训练和预测 训练MSE 预测MSE'.format(i))\n"," train_mse = mean_squared_error(y_train_KFold_predict, y_train_KFold)\n"," print('------\\n', '训练MSE\\n', train_mse, '\\n------')\n"," test_mse = mean_squared_error(y_test_KFold_predict, y_test_KFold)\n"," print('------\\n', '预测MSE\\n', test_mse, '\\n------\\n')\n"," \n"," MSE_DICT['train_mse'].append(train_mse)\n"," MSE_DICT['test_mse'].append(test_mse)\n","print('------\\n', '训练MSE\\n', MSE_DICT['train_mse'], '\\n', np.mean(MSE_DICT['train_mse']), '\\n------')\n","print('------\\n', '预测MSE\\n', MSE_DICT['test_mse'], '\\n', np.mean(MSE_DICT['test_mse']), '\\n------')"],"execution_count":20,"outputs":[{"output_type":"stream","text":["Training until validation scores don't improve for 100 rounds.\n","[50]\tTrain's l2: 0.418978\tTrain's l2: 0.418978\tTest's l2: 0.106374\tTest's l2: 0.106374\n","[100]\tTrain's l2: 0.203693\tTrain's l2: 0.203693\tTest's l2: 0.02276\tTest's l2: 0.02276\n","[150]\tTrain's l2: 0.114486\tTrain's l2: 0.114486\tTest's l2: 0.00527795\tTest's l2: 0.00527795\n","[200]\tTrain's l2: 0.0741934\tTrain's l2: 0.0741934\tTest's l2: 5.99266e-05\tTest's l2: 5.99266e-05\n","[250]\tTrain's l2: 0.0535396\tTrain's l2: 0.0535396\tTest's l2: 0.00036171\tTest's l2: 0.00036171\n","[300]\tTrain's l2: 0.041529\tTrain's l2: 0.041529\tTest's l2: 0.00267813\tTest's l2: 0.00267813\n","Early stopping, best iteration is:\n","[221]\tTrain's l2: 0.0640274\tTrain's l2: 0.0640274\tTest's l2: 6.95547e-08\tTest's l2: 6.95547e-08\n","第0折 训练和预测 训练MSE 预测MSE\n","------\n"," 训练MSE\n"," 0.0640273654375399 \n","------\n","------\n"," 预测MSE\n"," 6.954770450692039e-08 \n","------\n","\n","Training until validation scores don't improve for 100 rounds.\n","[50]\tTrain's l2: 0.419128\tTrain's l2: 0.419128\tTest's l2: 0.142103\tTest's l2: 0.142103\n","[100]\tTrain's l2: 0.203838\tTrain's l2: 0.203838\tTest's l2: 0.0997735\tTest's l2: 0.0997735\n","[150]\tTrain's l2: 0.114537\tTrain's l2: 0.114537\tTest's l2: 0.0765469\tTest's l2: 0.0765469\n","[200]\tTrain's l2: 0.0742645\tTrain's l2: 0.0742645\tTest's l2: 0.0612271\tTest's l2: 0.0612271\n","[250]\tTrain's l2: 0.0536164\tTrain's l2: 0.0536164\tTest's l2: 0.0471729\tTest's l2: 0.0471729\n","[300]\tTrain's l2: 0.0415117\tTrain's l2: 0.0415117\tTest's l2: 0.0427081\tTest's l2: 0.0427081\n"],"name":"stdout"},{"output_type":"error","ename":"KeyboardInterrupt","evalue":"ignored","traceback":["\u001b[0;31m---------------------------------------------------------------------------\u001b[0m","\u001b[0;31mKeyboardInterrupt\u001b[0m Traceback (most recent call last)","\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 38\u001b[0m \u001b[0mearly_stopping_rounds\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m100\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 39\u001b[0m \u001b[0meval_metric\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m'MSE'\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 40\u001b[0;31m \u001b[0mverbose\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m50\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 41\u001b[0m )\n\u001b[1;32m 42\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n","\u001b[0;32m/usr/local/lib/python3.7/dist-packages/lightgbm/sklearn.py\u001b[0m in \u001b[0;36mfit\u001b[0;34m(self, X, y, sample_weight, init_score, eval_set, eval_names, eval_sample_weight, eval_init_score, eval_metric, early_stopping_rounds, verbose, feature_name, categorical_feature, callbacks)\u001b[0m\n\u001b[1;32m 683\u001b[0m \u001b[0mverbose\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mverbose\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfeature_name\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mfeature_name\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 684\u001b[0m \u001b[0mcategorical_feature\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mcategorical_feature\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 685\u001b[0;31m callbacks=callbacks)\n\u001b[0m\u001b[1;32m 686\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 687\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n","\u001b[0;32m/usr/local/lib/python3.7/dist-packages/lightgbm/sklearn.py\u001b[0m in \u001b[0;36mfit\u001b[0;34m(self, X, y, sample_weight, init_score, group, eval_set, eval_names, eval_sample_weight, eval_class_weight, eval_init_score, eval_group, eval_metric, early_stopping_rounds, verbose, feature_name, categorical_feature, callbacks)\u001b[0m\n\u001b[1;32m 542\u001b[0m \u001b[0mverbose_eval\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mverbose\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfeature_name\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mfeature_name\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 543\u001b[0m \u001b[0mcategorical_feature\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mcategorical_feature\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 544\u001b[0;31m callbacks=callbacks)\n\u001b[0m\u001b[1;32m 545\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 546\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mevals_result\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n","\u001b[0;32m/usr/local/lib/python3.7/dist-packages/lightgbm/engine.py\u001b[0m in \u001b[0;36mtrain\u001b[0;34m(params, train_set, num_boost_round, valid_sets, valid_names, fobj, feval, init_model, feature_name, categorical_feature, early_stopping_rounds, evals_result, verbose_eval, learning_rates, keep_training_booster, callbacks)\u001b[0m\n\u001b[1;32m 216\u001b[0m evaluation_result_list=None))\n\u001b[1;32m 217\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 218\u001b[0;31m \u001b[0mbooster\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mupdate\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mfobj\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mfobj\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 219\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 220\u001b[0m \u001b[0mevaluation_result_list\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n","\u001b[0;32m/usr/local/lib/python3.7/dist-packages/lightgbm/basic.py\u001b[0m in \u001b[0;36mupdate\u001b[0;34m(self, train_set, fobj)\u001b[0m\n\u001b[1;32m 1800\u001b[0m _safe_call(_LIB.LGBM_BoosterUpdateOneIter(\n\u001b[1;32m 1801\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mhandle\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1802\u001b[0;31m ctypes.byref(is_finished)))\n\u001b[0m\u001b[1;32m 1803\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__is_predicted_cur_iter\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;32mFalse\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0m_\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mrange_\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__num_dataset\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1804\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mis_finished\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mvalue\u001b[0m \u001b[0;34m==\u001b[0m \u001b[0;36m1\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n","\u001b[0;31mKeyboardInterrupt\u001b[0m: "]}]},{"cell_type":"code","metadata":{"id":"aB3dbDkaPV8W"},"source":[""],"execution_count":null,"outputs":[]}]} -------------------------------------------------------------------------------- /阿里云安全恶意程序检测/阿里云安全恶意程序检测-优化技巧与解决方案升级.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## 第六节:优化技巧与解决方案升级" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "## 6.2 深度学习解决方案:TextCNN建模" 15 | ] 16 | }, 17 | { 18 | "cell_type": "markdown", 19 | "metadata": {}, 20 | "source": [ 21 | "### 6.2.2 数据读取" 22 | ] 23 | }, 24 | { 25 | "cell_type": "code", 26 | "execution_count": null, 27 | "metadata": {}, 28 | "outputs": [], 29 | "source": [ 30 | "import pandas as pd\n", 31 | "import numpy as np\n", 32 | "import seaborn as sns\n", 33 | "import matplotlib.pyplot as plt\n", 34 | "\n", 35 | "import lightgbm as lgb\n", 36 | "from sklearn.model_selection import train_test_split\n", 37 | "from sklearn.preprocessing import OneHotEncoder\n", 38 | "\n", 39 | "from tqdm import tqdm_notebook\n", 40 | "from sklearn.preprocessing import LabelBinarizer,LabelEncoder\n", 41 | "\n", 42 | "import warnings\n", 43 | "warnings.filterwarnings('ignore')\n", 44 | "%matplotlib inline" 45 | ] 46 | }, 47 | { 48 | "cell_type": "code", 49 | "execution_count": 4, 50 | "metadata": {}, 51 | "outputs": [], 52 | "source": [ 53 | "path = '../security_data/'\n", 54 | "train = pd.read_csv(path + 'security_train.csv')\n", 55 | "test = pd.read_csv(path + 'security_test.csv')" 56 | ] 57 | }, 58 | { 59 | "cell_type": "code", 60 | "execution_count": 2, 61 | "metadata": {}, 62 | "outputs": [], 63 | "source": [ 64 | "import numpy as np\n", 65 | "import pandas as pd\n", 66 | "from tqdm import tqdm \n", 67 | "\n", 68 | "class _Data_Preprocess:\n", 69 | " def __init__(self):\n", 70 | " self.int8_max = np.iinfo(np.int8).max\n", 71 | " self.int8_min = np.iinfo(np.int8).min\n", 72 | "\n", 73 | " self.int16_max = np.iinfo(np.int16).max\n", 74 | " self.int16_min = np.iinfo(np.int16).min\n", 75 | "\n", 76 | " self.int32_max = np.iinfo(np.int32).max\n", 77 | " self.int32_min = np.iinfo(np.int32).min\n", 78 | "\n", 79 | " self.int64_max = np.iinfo(np.int64).max\n", 80 | " self.int64_min = np.iinfo(np.int64).min\n", 81 | "\n", 82 | " self.float16_max = np.finfo(np.float16).max\n", 83 | " self.float16_min = np.finfo(np.float16).min\n", 84 | "\n", 85 | " self.float32_max = np.finfo(np.float32).max\n", 86 | " self.float32_min = np.finfo(np.float32).min\n", 87 | "\n", 88 | " self.float64_max = np.finfo(np.float64).max\n", 89 | " self.float64_min = np.finfo(np.float64).min\n", 90 | "\n", 91 | " def _get_type(self, min_val, max_val, types):\n", 92 | " if types == 'int':\n", 93 | " if max_val <= self.int8_max and min_val >= self.int8_min:\n", 94 | " return np.int8\n", 95 | " elif max_val <= self.int16_max <= max_val and min_val >= self.int16_min:\n", 96 | " return np.int16\n", 97 | " elif max_val <= self.int32_max and min_val >= self.int32_min:\n", 98 | " return np.int32\n", 99 | " return None\n", 100 | "\n", 101 | " elif types == 'float':\n", 102 | " if max_val <= self.float16_max and min_val >= self.float16_min:\n", 103 | " return np.float16\n", 104 | " if max_val <= self.float32_max and min_val >= self.float32_min:\n", 105 | " return np.float32\n", 106 | " if max_val <= self.float64_max and min_val >= self.float64_min:\n", 107 | " return np.float64\n", 108 | " return None\n", 109 | "\n", 110 | " def _memory_process(self, df):\n", 111 | " init_memory = df.memory_usage().sum() / 1024 ** 2 / 1024\n", 112 | " print('Original data occupies {} GB memory.'.format(init_memory))\n", 113 | " df_cols = df.columns\n", 114 | "\n", 115 | " \n", 116 | " for col in tqdm_notebook(df_cols):\n", 117 | " try:\n", 118 | " if 'float' in str(df[col].dtypes):\n", 119 | " max_val = df[col].max()\n", 120 | " min_val = df[col].min()\n", 121 | " trans_types = self._get_type(min_val, max_val, 'float')\n", 122 | " if trans_types is not None:\n", 123 | " df[col] = df[col].astype(trans_types)\n", 124 | " elif 'int' in str(df[col].dtypes):\n", 125 | " max_val = df[col].max()\n", 126 | " min_val = df[col].min()\n", 127 | " trans_types = self._get_type(min_val, max_val, 'int')\n", 128 | " if trans_types is not None:\n", 129 | " df[col] = df[col].astype(trans_types)\n", 130 | " except:\n", 131 | " print(' Can not do any process for column, {}.'.format(col)) \n", 132 | " afterprocess_memory = df.memory_usage().sum() / 1024 ** 2 / 1024\n", 133 | " print('After processing, the data occupies {} GB memory.'.format(afterprocess_memory))\n", 134 | " return df\n", 135 | "\n", 136 | "memory_process = _Data_Preprocess()" 137 | ] 138 | }, 139 | { 140 | "cell_type": "code", 141 | "execution_count": 5, 142 | "metadata": { 143 | "scrolled": true 144 | }, 145 | "outputs": [ 146 | { 147 | "data": { 148 | "text/html": [ 149 | "
\n", 150 | "\n", 163 | "\n", 164 | " \n", 165 | " \n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | " \n", 170 | " \n", 171 | " \n", 172 | " \n", 173 | " \n", 174 | " \n", 175 | " \n", 176 | " \n", 177 | " \n", 178 | " \n", 179 | " \n", 180 | " \n", 181 | " \n", 182 | " \n", 183 | " \n", 184 | " \n", 185 | " \n", 186 | " \n", 187 | " \n", 188 | " \n", 189 | " \n", 190 | " \n", 191 | " \n", 192 | " \n", 193 | " \n", 194 | " \n", 195 | " \n", 196 | " \n", 197 | " \n", 198 | " \n", 199 | " \n", 200 | " \n", 201 | " \n", 202 | " \n", 203 | " \n", 204 | " \n", 205 | " \n", 206 | " \n", 207 | " \n", 208 | " \n", 209 | " \n", 210 | " \n", 211 | " \n", 212 | " \n", 213 | " \n", 214 | " \n", 215 | " \n", 216 | "
file_idlabelapitidindex
015LdrLoadDll24880
115LdrGetProcedureAddress24881
215LdrGetProcedureAddress24882
315LdrGetProcedureAddress24883
415LdrGetProcedureAddress24884
\n", 217 | "
" 218 | ], 219 | "text/plain": [ 220 | " file_id label api tid index\n", 221 | "0 1 5 LdrLoadDll 2488 0\n", 222 | "1 1 5 LdrGetProcedureAddress 2488 1\n", 223 | "2 1 5 LdrGetProcedureAddress 2488 2\n", 224 | "3 1 5 LdrGetProcedureAddress 2488 3\n", 225 | "4 1 5 LdrGetProcedureAddress 2488 4" 226 | ] 227 | }, 228 | "execution_count": 5, 229 | "metadata": {}, 230 | "output_type": "execute_result" 231 | } 232 | ], 233 | "source": [ 234 | "train.head()" 235 | ] 236 | }, 237 | { 238 | "cell_type": "markdown", 239 | "metadata": {}, 240 | "source": [ 241 | "### 6.2.3 数据预处理" 242 | ] 243 | }, 244 | { 245 | "cell_type": "code", 246 | "execution_count": 6, 247 | "metadata": {}, 248 | "outputs": [], 249 | "source": [ 250 | "# (字符串转化为数字)\n", 251 | "unique_api = train['api'].unique()" 252 | ] 253 | }, 254 | { 255 | "cell_type": "code", 256 | "execution_count": 8, 257 | "metadata": {}, 258 | "outputs": [], 259 | "source": [ 260 | "api2index = {item:(i+1) for i,item in enumerate(unique_api)}\n", 261 | "index2api = {(i+1):item for i,item in enumerate(unique_api)}" 262 | ] 263 | }, 264 | { 265 | "cell_type": "code", 266 | "execution_count": 9, 267 | "metadata": {}, 268 | "outputs": [], 269 | "source": [ 270 | "train['api_idx'] = train['api'].map(api2index)\n", 271 | "test['api_idx'] = test['api'].map(api2index)" 272 | ] 273 | }, 274 | { 275 | "cell_type": "code", 276 | "execution_count": 10, 277 | "metadata": {}, 278 | "outputs": [], 279 | "source": [ 280 | "# 获取每个文件对应的字符串序列\n", 281 | "def get_sequence(df,period_idx):\n", 282 | " seq_list = []\n", 283 | " for _id,begin in enumerate(period_idx[:-1]):\n", 284 | " seq_list.append(df.iloc[begin:period_idx[_id+1]]['api_idx'].values)\n", 285 | " seq_list.append(df.iloc[period_idx[-1]:]['api_idx'].values)\n", 286 | " return seq_list" 287 | ] 288 | }, 289 | { 290 | "cell_type": "code", 291 | "execution_count": 11, 292 | "metadata": {}, 293 | "outputs": [], 294 | "source": [ 295 | "train_period_idx = train.file_id.drop_duplicates(keep='first').index.values\n", 296 | "test_period_idx = test.file_id.drop_duplicates(keep='first').index.values" 297 | ] 298 | }, 299 | { 300 | "cell_type": "code", 301 | "execution_count": 13, 302 | "metadata": {}, 303 | "outputs": [], 304 | "source": [ 305 | "train_df = train[['file_id','label']].drop_duplicates(keep='first')\n", 306 | "test_df = test[['file_id']].drop_duplicates(keep='first')" 307 | ] 308 | }, 309 | { 310 | "cell_type": "code", 311 | "execution_count": 14, 312 | "metadata": {}, 313 | "outputs": [], 314 | "source": [ 315 | "train_df['seq'] = get_sequence(train,train_period_idx)\n", 316 | "test_df['seq'] = get_sequence(test,test_period_idx)" 317 | ] 318 | }, 319 | { 320 | "cell_type": "markdown", 321 | "metadata": {}, 322 | "source": [ 323 | "### 6.2.4 TextCNN网络结构" 324 | ] 325 | }, 326 | { 327 | "cell_type": "code", 328 | "execution_count": 16, 329 | "metadata": {}, 330 | "outputs": [ 331 | { 332 | "name": "stderr", 333 | "output_type": "stream", 334 | "text": [ 335 | "Using TensorFlow backend.\n" 336 | ] 337 | } 338 | ], 339 | "source": [ 340 | "from keras.preprocessing.text import Tokenizer\n", 341 | "from keras.preprocessing.sequence import pad_sequences\n", 342 | "from keras.layers import Dense, Input, LSTM, Lambda, Embedding, Dropout, Activation,GRU,Bidirectional\n", 343 | "from keras.layers import Conv1D,Conv2D,MaxPooling2D,GlobalAveragePooling1D,GlobalMaxPooling1D, MaxPooling1D, Flatten\n", 344 | "from keras.layers import CuDNNGRU, CuDNNLSTM, SpatialDropout1D\n", 345 | "from keras.layers.merge import concatenate, Concatenate, Average, Dot, Maximum, Multiply, Subtract, average\n", 346 | "from keras.models import Model\n", 347 | "from keras.optimizers import RMSprop,Adam\n", 348 | "from keras.layers.normalization import BatchNormalization\n", 349 | "from keras.callbacks import EarlyStopping, ModelCheckpoint\n", 350 | "from keras.optimizers import SGD\n", 351 | "from keras import backend as K\n", 352 | "from sklearn.decomposition import TruncatedSVD, NMF, LatentDirichletAllocation\n", 353 | "from keras.layers import SpatialDropout1D\n", 354 | "from keras.layers.wrappers import Bidirectional" 355 | ] 356 | }, 357 | { 358 | "cell_type": "code", 359 | "execution_count": 17, 360 | "metadata": {}, 361 | "outputs": [], 362 | "source": [ 363 | "def TextCNN(max_len,max_cnt,embed_size, num_filters,kernel_size,conv_action, mask_zero):\n", 364 | " _input = Input(shape=(max_len,), dtype='int32')\n", 365 | " _embed = Embedding(max_cnt, embed_size, input_length=max_len, mask_zero=mask_zero)(_input)\n", 366 | " _embed = SpatialDropout1D(0.15)(_embed)\n", 367 | " warppers = []\n", 368 | " \n", 369 | " for _kernel_size in kernel_size:\n", 370 | " conv1d = Conv1D(filters=num_filters, kernel_size=_kernel_size, activation=conv_action)(_embed)\n", 371 | " warppers.append(GlobalMaxPooling1D()(conv1d))\n", 372 | " \n", 373 | " fc = concatenate(warppers)\n", 374 | " fc = Dropout(0.5)(fc)\n", 375 | " #fc = BatchNormalization()(fc)\n", 376 | " fc = Dense(256, activation='relu')(fc)\n", 377 | " fc = Dropout(0.25)(fc)\n", 378 | " #fc = BatchNormalization()(fc) \n", 379 | " preds = Dense(8, activation = 'softmax')(fc)\n", 380 | " \n", 381 | " model = Model(inputs=_input, outputs=preds)\n", 382 | " \n", 383 | " model.compile(loss='categorical_crossentropy',\n", 384 | " optimizer='adam',\n", 385 | " metrics=['accuracy'])\n", 386 | " return model" 387 | ] 388 | }, 389 | { 390 | "cell_type": "code", 391 | "execution_count": 18, 392 | "metadata": {}, 393 | "outputs": [], 394 | "source": [ 395 | "train_labels = pd.get_dummies(train_df.label).values\n", 396 | "train_seq = pad_sequences(train_df.seq.values, maxlen = 6000)\n", 397 | "test_seq = pad_sequences(test_df.seq.values, maxlen = 6000)" 398 | ] 399 | }, 400 | { 401 | "cell_type": "markdown", 402 | "metadata": {}, 403 | "source": [ 404 | "### 6.2.5 TextCNN训练和预测" 405 | ] 406 | }, 407 | { 408 | "cell_type": "code", 409 | "execution_count": 19, 410 | "metadata": {}, 411 | "outputs": [], 412 | "source": [ 413 | "from sklearn.model_selection import StratifiedKFold,KFold \n", 414 | "skf = KFold(n_splits=5, shuffle=True)" 415 | ] 416 | }, 417 | { 418 | "cell_type": "code", 419 | "execution_count": 20, 420 | "metadata": {}, 421 | "outputs": [], 422 | "source": [ 423 | "max_len = 6000\n", 424 | "max_cnt = 295\n", 425 | "embed_size = 256\n", 426 | "num_filters = 64\n", 427 | "kernel_size = [2,4,6,8,10,12,14]\n", 428 | "conv_action = 'relu'\n", 429 | "mask_zero = False\n", 430 | "TRAIN = True" 431 | ] 432 | }, 433 | { 434 | "cell_type": "code", 435 | "execution_count": 21, 436 | "metadata": { 437 | "scrolled": true 438 | }, 439 | "outputs": [ 440 | { 441 | "name": "stdout", 442 | "output_type": "stream", 443 | "text": [ 444 | "FOLD: \n", 445 | "2778 11109\n", 446 | "Train on 11109 samples, validate on 2778 samples\n", 447 | "Epoch 1/100\n", 448 | "11109/11109 [==============================] - 75s 7ms/step - loss: 0.8165 - acc: 0.7370 - val_loss: 0.4825 - val_acc: 0.8485\n", 449 | "Epoch 2/100\n", 450 | "11109/11109 [==============================] - 64s 6ms/step - loss: 0.4772 - acc: 0.8499 - val_loss: 0.4141 - val_acc: 0.8625\n", 451 | "Epoch 3/100\n", 452 | "11109/11109 [==============================] - 64s 6ms/step - loss: 0.4172 - acc: 0.8673 - val_loss: 0.3785 - val_acc: 0.8780\n", 453 | "Epoch 4/100\n", 454 | "11109/11109 [==============================] - 64s 6ms/step - loss: 0.3768 - acc: 0.8769 - val_loss: 0.3821 - val_acc: 0.8726\n", 455 | "Epoch 5/100\n", 456 | "11109/11109 [==============================] - 64s 6ms/step - loss: 0.3568 - acc: 0.8831 - val_loss: 0.3932 - val_acc: 0.8783\n", 457 | "Epoch 6/100\n", 458 | "11109/11109 [==============================] - 64s 6ms/step - loss: 0.3388 - acc: 0.8893 - val_loss: 0.3566 - val_acc: 0.8902\n", 459 | "Epoch 7/100\n", 460 | "11109/11109 [==============================] - 64s 6ms/step - loss: 0.3179 - acc: 0.8968 - val_loss: 0.3553 - val_acc: 0.8902\n", 461 | "Epoch 8/100\n", 462 | "11109/11109 [==============================] - 64s 6ms/step - loss: 0.3050 - acc: 0.8991 - val_loss: 0.3590 - val_acc: 0.8870\n", 463 | "Epoch 9/100\n", 464 | "11109/11109 [==============================] - 64s 6ms/step - loss: 0.2913 - acc: 0.9006 - val_loss: 0.3593 - val_acc: 0.8909\n", 465 | "Epoch 10/100\n", 466 | "11109/11109 [==============================] - 64s 6ms/step - loss: 0.2812 - acc: 0.9047 - val_loss: 0.3528 - val_acc: 0.8906\n", 467 | "Epoch 11/100\n", 468 | "11109/11109 [==============================] - 64s 6ms/step - loss: 0.2633 - acc: 0.9054 - val_loss: 0.3608 - val_acc: 0.8823\n", 469 | "Epoch 12/100\n", 470 | "11109/11109 [==============================] - 64s 6ms/step - loss: 0.2665 - acc: 0.9103 - val_loss: 0.3589 - val_acc: 0.8859\n", 471 | "Epoch 13/100\n", 472 | "11109/11109 [==============================] - 64s 6ms/step - loss: 0.2495 - acc: 0.9097 - val_loss: 0.3570 - val_acc: 0.8909\n", 473 | "2778/2778 [==============================] - 4s 1ms/step\n", 474 | "12955/12955 [==============================] - 13s 980us/step\n", 475 | "FOLD: \n", 476 | "2778 11109\n", 477 | "Train on 11109 samples, validate on 2778 samples\n", 478 | "Epoch 1/100\n", 479 | "11109/11109 [==============================] - 65s 6ms/step - loss: 0.8297 - acc: 0.7290 - val_loss: 0.4925 - val_acc: 0.8463\n", 480 | "Epoch 2/100\n", 481 | "11109/11109 [==============================] - 64s 6ms/step - loss: 0.4808 - acc: 0.8442 - val_loss: 0.4115 - val_acc: 0.8690\n", 482 | "Epoch 3/100\n", 483 | "11109/11109 [==============================] - 64s 6ms/step - loss: 0.4149 - acc: 0.8643 - val_loss: 0.4037 - val_acc: 0.8715\n", 484 | "Epoch 4/100\n", 485 | "11109/11109 [==============================] - 64s 6ms/step - loss: 0.3799 - acc: 0.8774 - val_loss: 0.3798 - val_acc: 0.8841\n", 486 | "Epoch 5/100\n", 487 | "11109/11109 [==============================] - 64s 6ms/step - loss: 0.3530 - acc: 0.8850 - val_loss: 0.3773 - val_acc: 0.8870\n", 488 | "Epoch 6/100\n", 489 | "11109/11109 [==============================] - 64s 6ms/step - loss: 0.3291 - acc: 0.8924 - val_loss: 0.3676 - val_acc: 0.8855\n", 490 | "Epoch 7/100\n", 491 | "11109/11109 [==============================] - 64s 6ms/step - loss: 0.3115 - acc: 0.8959 - val_loss: 0.3773 - val_acc: 0.8888\n", 492 | "Epoch 8/100\n", 493 | "11109/11109 [==============================] - 64s 6ms/step - loss: 0.3032 - acc: 0.8998 - val_loss: 0.3518 - val_acc: 0.8891\n", 494 | "Epoch 9/100\n", 495 | "11109/11109 [==============================] - 64s 6ms/step - loss: 0.2892 - acc: 0.9027 - val_loss: 0.3655 - val_acc: 0.8920\n", 496 | "Epoch 10/100\n", 497 | "11109/11109 [==============================] - 64s 6ms/step - loss: 0.2792 - acc: 0.9056 - val_loss: 0.3615 - val_acc: 0.8906\n", 498 | "Epoch 11/100\n", 499 | "11109/11109 [==============================] - 64s 6ms/step - loss: 0.2736 - acc: 0.9085 - val_loss: 0.3719 - val_acc: 0.8924\n", 500 | "2778/2778 [==============================] - 3s 1ms/step\n", 501 | "12955/12955 [==============================] - 12s 951us/step\n", 502 | "FOLD: \n", 503 | "2777 11110\n", 504 | "Train on 11110 samples, validate on 2777 samples\n", 505 | "Epoch 1/100\n", 506 | "11110/11110 [==============================] - 67s 6ms/step - loss: 0.8388 - acc: 0.7239 - val_loss: 0.4323 - val_acc: 0.8657\n", 507 | "Epoch 2/100\n", 508 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.4969 - acc: 0.8418 - val_loss: 0.3881 - val_acc: 0.8783\n", 509 | "Epoch 3/100\n", 510 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.4276 - acc: 0.8631 - val_loss: 0.3587 - val_acc: 0.8855\n", 511 | "Epoch 4/100\n", 512 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.3882 - acc: 0.8770 - val_loss: 0.3542 - val_acc: 0.8898\n", 513 | "Epoch 5/100\n", 514 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.3590 - acc: 0.8875 - val_loss: 0.3640 - val_acc: 0.8920\n", 515 | "Epoch 6/100\n", 516 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.3437 - acc: 0.8851 - val_loss: 0.3445 - val_acc: 0.8992\n", 517 | "Epoch 7/100\n", 518 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.3235 - acc: 0.8912 - val_loss: 0.3552 - val_acc: 0.8923\n", 519 | "Epoch 8/100\n", 520 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.3079 - acc: 0.8986 - val_loss: 0.3491 - val_acc: 0.8927\n", 521 | "Epoch 9/100\n", 522 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.2937 - acc: 0.9035 - val_loss: 0.3370 - val_acc: 0.9003\n", 523 | "Epoch 10/100\n", 524 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.2897 - acc: 0.9018 - val_loss: 0.3523 - val_acc: 0.8959\n", 525 | "Epoch 11/100\n", 526 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.2788 - acc: 0.9066 - val_loss: 0.3519 - val_acc: 0.8967\n", 527 | "Epoch 12/100\n", 528 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.2653 - acc: 0.9131 - val_loss: 0.3571 - val_acc: 0.8952\n", 529 | "2777/2777 [==============================] - 3s 1ms/step\n", 530 | "12955/12955 [==============================] - 12s 955us/step\n", 531 | "FOLD: \n", 532 | "2777 11110\n", 533 | "Train on 11110 samples, validate on 2777 samples\n", 534 | "Epoch 1/100\n", 535 | "11110/11110 [==============================] - 66s 6ms/step - loss: 0.8286 - acc: 0.7326 - val_loss: 0.4647 - val_acc: 0.8596\n", 536 | "Epoch 2/100\n", 537 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.4807 - acc: 0.8524 - val_loss: 0.4081 - val_acc: 0.8704\n", 538 | "Epoch 3/100\n", 539 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.4233 - acc: 0.8656 - val_loss: 0.3920 - val_acc: 0.8808\n", 540 | "Epoch 4/100\n", 541 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.3771 - acc: 0.8819 - val_loss: 0.3767 - val_acc: 0.8804\n", 542 | "Epoch 5/100\n", 543 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.3529 - acc: 0.8861 - val_loss: 0.3990 - val_acc: 0.8725\n", 544 | "Epoch 6/100\n", 545 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.3353 - acc: 0.8900 - val_loss: 0.3776 - val_acc: 0.8822\n", 546 | "Epoch 7/100\n", 547 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.3234 - acc: 0.8935 - val_loss: 0.3717 - val_acc: 0.8855\n", 548 | "Epoch 8/100\n", 549 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.3012 - acc: 0.9010 - val_loss: 0.3758 - val_acc: 0.8848\n", 550 | "Epoch 9/100\n", 551 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.2907 - acc: 0.9011 - val_loss: 0.3656 - val_acc: 0.8862\n", 552 | "Epoch 10/100\n", 553 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.2852 - acc: 0.9022 - val_loss: 0.3676 - val_acc: 0.8858\n", 554 | "Epoch 11/100\n", 555 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.2683 - acc: 0.9085 - val_loss: 0.3630 - val_acc: 0.8862\n", 556 | "Epoch 12/100\n", 557 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.2595 - acc: 0.9091 - val_loss: 0.3768 - val_acc: 0.8884\n", 558 | "Epoch 13/100\n", 559 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.2533 - acc: 0.9140 - val_loss: 0.3817 - val_acc: 0.8822\n", 560 | "Epoch 14/100\n", 561 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.2464 - acc: 0.9155 - val_loss: 0.3757 - val_acc: 0.8862\n", 562 | "2777/2777 [==============================] - 3s 1ms/step\n", 563 | "12955/12955 [==============================] - 12s 949us/step\n", 564 | "FOLD: \n", 565 | "2777 11110\n", 566 | "Train on 11110 samples, validate on 2777 samples\n", 567 | "Epoch 1/100\n", 568 | "11110/11110 [==============================] - 65s 6ms/step - loss: 0.8168 - acc: 0.7315 - val_loss: 0.4718 - val_acc: 0.8567\n", 569 | "Epoch 2/100\n", 570 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.4880 - acc: 0.8459 - val_loss: 0.4047 - val_acc: 0.8711\n", 571 | "Epoch 3/100\n", 572 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.4224 - acc: 0.8674 - val_loss: 0.3871 - val_acc: 0.8732\n", 573 | "Epoch 4/100\n", 574 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.3900 - acc: 0.8728 - val_loss: 0.3676 - val_acc: 0.8812\n", 575 | "Epoch 5/100\n", 576 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.3581 - acc: 0.8846 - val_loss: 0.3713 - val_acc: 0.8819\n", 577 | "Epoch 6/100\n", 578 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.3391 - acc: 0.8890 - val_loss: 0.3542 - val_acc: 0.8905\n", 579 | "Epoch 7/100\n", 580 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.3158 - acc: 0.8975 - val_loss: 0.3610 - val_acc: 0.8902\n", 581 | "Epoch 8/100\n", 582 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.3074 - acc: 0.8986 - val_loss: 0.3520 - val_acc: 0.8887\n", 583 | "Epoch 9/100\n", 584 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.2905 - acc: 0.9026 - val_loss: 0.3588 - val_acc: 0.8941\n", 585 | "Epoch 10/100\n", 586 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.2795 - acc: 0.9032 - val_loss: 0.3417 - val_acc: 0.8923\n", 587 | "Epoch 11/100\n", 588 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.2747 - acc: 0.9044 - val_loss: 0.3456 - val_acc: 0.8912\n", 589 | "Epoch 12/100\n", 590 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.2546 - acc: 0.9131 - val_loss: 0.3517 - val_acc: 0.8902\n", 591 | "Epoch 13/100\n", 592 | "11110/11110 [==============================] - 64s 6ms/step - loss: 0.2483 - acc: 0.9144 - val_loss: 0.3785 - val_acc: 0.8909\n", 593 | "2777/2777 [==============================] - 3s 1ms/step\n", 594 | "12955/12955 [==============================] - 12s 949us/step\n" 595 | ] 596 | } 597 | ], 598 | "source": [ 599 | "import os\n", 600 | "os.environ[\"CUDA_VISIBLE_DEVICES\"] = \"0,1\"\n", 601 | "meta_train = np.zeros(shape = (len(train_seq),8))\n", 602 | "meta_test = np.zeros(shape = (len(test_seq),8))\n", 603 | "FLAG = True\n", 604 | "i = 0\n", 605 | "for tr_ind,te_ind in skf.split(train_labels):\n", 606 | " i +=1\n", 607 | " print('FOLD: '.format(i))\n", 608 | " print(len(te_ind),len(tr_ind)) \n", 609 | " model_name = 'benchmark_textcnn_fold_'+str(i)\n", 610 | " X_train,X_train_label = train_seq[tr_ind],train_labels[tr_ind]\n", 611 | " X_val,X_val_label = train_seq[te_ind],train_labels[te_ind]\n", 612 | " \n", 613 | " model = TextCNN(max_len,max_cnt,embed_size,num_filters,kernel_size,conv_action,mask_zero)\n", 614 | " \n", 615 | " model_save_path = './NN/%s_%s.hdf5'%(model_name,embed_size)\n", 616 | " early_stopping =EarlyStopping(monitor='val_loss', patience=3)\n", 617 | " model_checkpoint = ModelCheckpoint(model_save_path, save_best_only=True, save_weights_only=True)\n", 618 | " if TRAIN and FLAG:\n", 619 | " model.fit(X_train,X_train_label,validation_data=(X_val,X_val_label),epochs=100,batch_size=64,shuffle=True,callbacks=[early_stopping,model_checkpoint] )\n", 620 | " \n", 621 | " model.load_weights(model_save_path)\n", 622 | " pred_val = model.predict(X_val,batch_size=128,verbose=1)\n", 623 | " pred_test = model.predict(test_seq,batch_size=128,verbose=1)\n", 624 | " \n", 625 | " meta_train[te_ind] = pred_val\n", 626 | " meta_test += pred_test\n", 627 | " K.clear_session()\n", 628 | "meta_test /= 5.0 " 629 | ] 630 | }, 631 | { 632 | "cell_type": "markdown", 633 | "metadata": {}, 634 | "source": [ 635 | "### 6.2.6 结果提交" 636 | ] 637 | }, 638 | { 639 | "cell_type": "code", 640 | "execution_count": 22, 641 | "metadata": {}, 642 | "outputs": [], 643 | "source": [ 644 | "test_df['prob0'] = 0\n", 645 | "test_df['prob1'] = 0\n", 646 | "test_df['prob2'] = 0\n", 647 | "test_df['prob3'] = 0\n", 648 | "test_df['prob4'] = 0\n", 649 | "test_df['prob5'] = 0\n", 650 | "test_df['prob6'] = 0\n", 651 | "test_df['prob7'] = 0\n", 652 | "\n", 653 | "test_df[['prob0','prob1','prob2','prob3','prob4','prob5','prob6','prob7']] = meta_test\n", 654 | "test_df[['file_id','prob0','prob1','prob2','prob3','prob4','prob5','prob6','prob7']].to_csv('nn_baseline_5fold.csv',index = None)" 655 | ] 656 | }, 657 | { 658 | "cell_type": "code", 659 | "execution_count": null, 660 | "metadata": {}, 661 | "outputs": [], 662 | "source": [] 663 | } 664 | ], 665 | "metadata": { 666 | "kernelspec": { 667 | "display_name": "Python 3", 668 | "language": "python", 669 | "name": "python3" 670 | }, 671 | "language_info": { 672 | "codemirror_mode": { 673 | "name": "ipython", 674 | "version": 3 675 | }, 676 | "file_extension": ".py", 677 | "mimetype": "text/x-python", 678 | "name": "python", 679 | "nbconvert_exporter": "python", 680 | "pygments_lexer": "ipython3", 681 | "version": "3.7.1" 682 | }, 683 | "toc": { 684 | "nav_menu": {}, 685 | "number_sections": true, 686 | "sideBar": true, 687 | "skip_h1_title": false, 688 | "title_cell": "Table of Contents", 689 | "title_sidebar": "Contents", 690 | "toc_cell": true, 691 | "toc_position": {}, 692 | "toc_section_display": true, 693 | "toc_window_display": true 694 | } 695 | }, 696 | "nbformat": 4, 697 | "nbformat_minor": 2 698 | } 699 | -------------------------------------------------------------------------------- /阿里云安全恶意程序检测/阿里云安全恶意程序检测-数据探索.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# 第二节:数据探索" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 4, 13 | "metadata": {}, 14 | "outputs": [ 15 | { 16 | "data": { 17 | "text/html": [ 18 | "\n", 43 | "
!!以上是作者为了排版而修改的排版效果,请注意是否需要使用!!
\n" 44 | ], 45 | "text/plain": [ 46 | "" 47 | ] 48 | }, 49 | "metadata": {}, 50 | "output_type": "display_data" 51 | } 52 | ], 53 | "source": [ 54 | "%%html\n", 55 | "\n", 80 | "
!!以上是作者为了排版而修改的排版效果,请注意是否需要使用!!
" 81 | ] 82 | }, 83 | { 84 | "cell_type": "markdown", 85 | "metadata": {}, 86 | "source": [ 87 | "## 2.1 训练集数据探索" 88 | ] 89 | }, 90 | { 91 | "cell_type": "markdown", 92 | "metadata": {}, 93 | "source": [ 94 | "### 2.1.1 数据特征类型" 95 | ] 96 | }, 97 | { 98 | "cell_type": "code", 99 | "execution_count": 7, 100 | "metadata": {}, 101 | "outputs": [], 102 | "source": [ 103 | "# 导入相关应用包\n", 104 | "import pandas as pd\n", 105 | "import numpy as np\n", 106 | "import seaborn as sns\n", 107 | "import matplotlib.pyplot as plt\n", 108 | "\n", 109 | "# 忽略警告信息\n", 110 | "import warnings\n", 111 | "warnings.filterwarnings(\"ignore\")\n", 112 | "\n", 113 | "%matplotlib inline\n", 114 | "\n", 115 | "# 读取数据\n", 116 | "path = './dataset/'\n", 117 | "train = pd.read_csv(path + 'security_train.csv') # 训练集\n", 118 | "test = pd.read_csv(path + 'security_test.csv') # 测试集" 119 | ] 120 | }, 121 | { 122 | "cell_type": "code", 123 | "execution_count": 8, 124 | "metadata": {}, 125 | "outputs": [ 126 | { 127 | "data": { 128 | "text/html": [ 129 | "
\n", 130 | "\n", 143 | "\n", 144 | " \n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " \n", 149 | " \n", 150 | " \n", 151 | " \n", 152 | " \n", 153 | " \n", 154 | " \n", 155 | " \n", 156 | " \n", 157 | " \n", 158 | " \n", 159 | " \n", 160 | " \n", 161 | " \n", 162 | " \n", 163 | " \n", 164 | " \n", 165 | " \n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | " \n", 170 | " \n", 171 | " \n", 172 | " \n", 173 | " \n", 174 | " \n", 175 | " \n", 176 | " \n", 177 | " \n", 178 | " \n", 179 | " \n", 180 | " \n", 181 | " \n", 182 | " \n", 183 | " \n", 184 | " \n", 185 | " \n", 186 | " \n", 187 | " \n", 188 | " \n", 189 | " \n", 190 | " \n", 191 | " \n", 192 | " \n", 193 | " \n", 194 | " \n", 195 | " \n", 196 | "
file_idlabelapitidindex
015LdrLoadDll24880.0
115LdrGetProcedureAddress24881.0
215LdrGetProcedureAddress24882.0
315LdrGetProcedureAddress24883.0
415LdrGetProcedureAddress24884.0
\n", 197 | "
" 198 | ], 199 | "text/plain": [ 200 | " file_id label api tid index\n", 201 | "0 1 5 LdrLoadDll 2488 0.0\n", 202 | "1 1 5 LdrGetProcedureAddress 2488 1.0\n", 203 | "2 1 5 LdrGetProcedureAddress 2488 2.0\n", 204 | "3 1 5 LdrGetProcedureAddress 2488 3.0\n", 205 | "4 1 5 LdrGetProcedureAddress 2488 4.0" 206 | ] 207 | }, 208 | "execution_count": 8, 209 | "metadata": {}, 210 | "output_type": "execute_result" 211 | } 212 | ], 213 | "source": [ 214 | "train.head()" 215 | ] 216 | }, 217 | { 218 | "cell_type": "code", 219 | "execution_count": 9, 220 | "metadata": {}, 221 | "outputs": [ 222 | { 223 | "data": { 224 | "text/html": [ 225 | "
\n", 226 | "\n", 239 | "\n", 240 | " \n", 241 | " \n", 242 | " \n", 243 | " \n", 244 | " \n", 245 | " \n", 246 | " \n", 247 | " \n", 248 | " \n", 249 | " \n", 250 | " \n", 251 | " \n", 252 | " \n", 253 | " \n", 254 | " \n", 255 | " \n", 256 | " \n", 257 | " \n", 258 | " \n", 259 | " \n", 260 | " \n", 261 | " \n", 262 | " \n", 263 | " \n", 264 | " \n", 265 | " \n", 266 | " \n", 267 | " \n", 268 | " \n", 269 | " \n", 270 | " \n", 271 | " \n", 272 | " \n", 273 | " \n", 274 | " \n", 275 | " \n", 276 | " \n", 277 | " \n", 278 | " \n", 279 | " \n", 280 | " \n", 281 | " \n", 282 | " \n", 283 | " \n", 284 | " \n", 285 | " \n", 286 | " \n", 287 | " \n", 288 | " \n", 289 | " \n", 290 | " \n", 291 | " \n", 292 | " \n", 293 | " \n", 294 | " \n", 295 | " \n", 296 | " \n", 297 | " \n", 298 | " \n", 299 | " \n", 300 | " \n", 301 | " \n", 302 | " \n", 303 | " \n", 304 | " \n", 305 | " \n", 306 | " \n", 307 | "
file_idlabeltidindex
count35952.00000035952.00000035952.00000035951.000000
mean5.1420510.9891522494.9645642153.216267
std2.5473821.957361129.9799381537.349809
min1.0000000.000000282.0000000.000000
25%4.0000000.0000002456.000000722.000000
50%5.0000000.0000002500.0000002004.000000
75%7.0000000.0000002596.0000003502.000000
max9.0000005.0000002980.0000005000.000000
\n", 308 | "
" 309 | ], 310 | "text/plain": [ 311 | " file_id label tid index\n", 312 | "count 35952.000000 35952.000000 35952.000000 35951.000000\n", 313 | "mean 5.142051 0.989152 2494.964564 2153.216267\n", 314 | "std 2.547382 1.957361 129.979938 1537.349809\n", 315 | "min 1.000000 0.000000 282.000000 0.000000\n", 316 | "25% 4.000000 0.000000 2456.000000 722.000000\n", 317 | "50% 5.000000 0.000000 2500.000000 2004.000000\n", 318 | "75% 7.000000 0.000000 2596.000000 3502.000000\n", 319 | "max 9.000000 5.000000 2980.000000 5000.000000" 320 | ] 321 | }, 322 | "execution_count": 9, 323 | "metadata": {}, 324 | "output_type": "execute_result" 325 | } 326 | ], 327 | "source": [ 328 | "train.describe()" 329 | ] 330 | }, 331 | { 332 | "cell_type": "markdown", 333 | "metadata": {}, 334 | "source": [ 335 | "### 2.1.2 数据分布探索" 336 | ] 337 | }, 338 | { 339 | "cell_type": "code", 340 | "execution_count": 10, 341 | "metadata": {}, 342 | "outputs": [ 343 | { 344 | "data": { 345 | "text/plain": [ 346 | "" 347 | ] 348 | }, 349 | "execution_count": 10, 350 | "metadata": {}, 351 | "output_type": "execute_result" 352 | }, 353 | { 354 | "data": { 355 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAWAAAAEGCAYAAABbzE8LAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8vihELAAAACXBIWXMAAAsTAAALEwEAmpwYAAANb0lEQVR4nO3df2yc913A8fenMRtRTWnXlJJ5FbfVTKxrRmnMtD8YKkVlIUGqBAhpm2hgqtA6LWkifmij0eKCkTYGoiVCm0phNFCxwTZUqLqsQSJDQmqnc9Um+5HRa+epy8LoUqLVbdIpyZc/7nFzcc9nO/Xd5/Hl/ZJOuzz35Pr9+O55+/Hj2YlSCpKkwbsoewGSdKEywJKUxABLUhIDLElJDLAkJRlZzs7r1q0rjUajT0uRpOE0PT39vVLKFfO3LyvAjUaDZrO5cquSpAtARHyr23YvQUhSEgMsSUkMsCQlMcCSlMQAS1ISAyxJSQywJCUxwJKUxABLUhIDLElJDLAkJTHAkpTEAEtSEgMsSUkMsCQlMcCSlMQAS1ISAyxJSQywJCVZ1r8JJw3KrbfeyvHjxxkbG8teysvGx8fZtm1b9jI0RAywauno0aPMvvAi//NSPd6ia158LnsJGkL1eHdL3awZ4cRPbc5eBQBrDz+UvQQNIa8BS1ISAyxJSQywJCUxwJKUxABLUhIDLElJDLAkJTHAkpTEAEtSEgMsSUkMsCQlMcCSlMQAS1ISAyxJSQywJCUxwJKUxABLUhIDLElJDLAkJTHAkpTEAEtSEgMsSUkMsCQlMcCSlMQAS1ISAyxJSQywJCUxwJKUxABLUhIDLElJDLAkJTHAkpTEAEtSEgMsSUkMsCQlMcCSlMQAS1ISAyxJSQywJCUxwJKUxABLUhIDrIHYs2cPe/bsyV7GBcWPef2NZC9AF4ZWq5W9hAuOH/P68wxYkpIYYElKYoAlKYkBlqQkBliSkhhgSUpigCUpiQGWpCQGWJKSGGBJSmKAJSmJAZakJAZYkpIYYElKYoAlKYkBlqQkBliSkhhgSUpigCUpiQGWpCQGWJKSGGBJSmKAJSmJAZakJAZYkpIYYElKYoAlKYkBlqQkBliSkhhgSUpigCUpiQGWpCQGWJKSGGBJSmKAJSmJAZakJAZYkpIYYElKMpAA33DDDS/f6mSl1rXQ8zSbTW688Uamp6dX9L8nDYNWq8WWLVt44IEHzjlO5j/earUAOHbsGNu3b6fVarF9+3aOHTvWl3V1HqfzbyvNM+A+mpyc5MyZM+zevTt7KVLtTE1N8cILL3DXXXd1PU7mHp+amgLgvvvu49ChQ0xNTXHo0CH27t2bsewV1fcAz/+sUZezv5Va10LP02w2mZ2dBWB2dra2HwcpQ6vVYmZmBoBSCtA+TubOgjsfn5mZodlssm/fPkopzMzMUEph3759K34WvNhxudLH7ciKPpteNjk5mb2EWjly5AgnTpzg9ttvX9L+J06cgNLnRS3DRSe/T6v1/JLXXwetVou1a9dmL6OrubPa+Xbv3s2DDz74isfnvprsdPr0afbu3cvOnTv7ts5+W/QMOCJ+JyKaEdF89tlnB7GmoTB39ivplebObuebO27mPz47O8upU6fO2Xbq1Cn279/fj+UNzKJnwKWUe4B7ACYmJmp0TlJvo6OjRrjD2NgYAHffffeS9t+yZQuzJ3/QzyUty5kfvoTxN1255PXXQZ3P1huNRtcIj46Odn18dHSUkydPnhPhkZERbrrppn4vta/8JlyfeAlCWtiuXbu6br/zzju7Pj45OclFF52bqzVr1nDLLbf0Z4ED0vcAHzhwoOefs6zUuhZ6nomJiZc/m4+Ojtb24yBlGB8fp9FoABARQPs42bhx4ysebzQaTExMsGnTJiKCRqNBRLBp0yYuv/zyFV3XYsflSh+3ngH30dxn7bnP6pLO2rVrFxdffDE7duzoepzMPT53Nrx161Y2bNjArl272LBhw6o/+wWIuf8LyFJMTEyUZrPZx+VoWM1dj1zuNeDZ63+zn8tasrWHH2LjKr0GvJrWPKwiYrqUMjF/u2fAkpTEAEtSEgMsSUkMsCQlMcCSlMQAS1ISAyxJSQywJCUxwJKUxABLUhIDLElJDLAkJTHAkpTEAEtSEgMsSUkMsCQlMcCSlMQAS1ISAyxJSQywJCUxwJKUxABLUhIDLElJDLAkJTHAkpTEAEtSEgMsSUkMsCQlMcCSlMQAS1ISAyxJSQywJCUxwJKUxABLUhIDLElJDLAkJTHAkpTEAEtSkpHsBejCMD4+nr2EC44f8/ozwBqIbdu2ZS/hguPHvP68BCFJSQywJCUxwJKUxABLUhIDLElJDLAkJTHAkpTEAEtSEgMsSUkMsCQlMcCSlMQAS1ISAyxJSQywJCUxwJKUxABLUhIDLElJDLAkJTHAkpTEAEtSEgMsSUkMsCQlMcCSlMQAS1ISAyxJSQywJCUxwJKUxABLUhIDLElJDLAkJTHAkpTEAEtSEgMsSUkMsCQlMcCSlMQAS1ISAyxJSQywJCUxwJKUxABLUpKR7AVICzp9irWHH8peBQBrXnwOuDJ7GRoyBli1tH79eo4fP87YWF2idyXj4+PZi9CQMcCqpXvvvTd7CVLfeQ1YkpIYYElKYoAlKYkBlqQkBliSkhhgSUpigCUpiQGWpCQGWJKSGGBJSmKAJSmJAZakJAZYkpIYYElKYoAlKYkBlqQkBliSkhhgSUpigCUpiQGWpCRRSln6zhHPAt9a4u7rgO+dz6JqZljmgOGZxTnqZ1hm6dccP1FKuWL+xmUFeDkiollKmejLkw/QsMwBwzOLc9TPsMwy6Dm8BCFJSQywJCXpZ4Dv6eNzD9KwzAHDM4tz1M+wzDLQOfp2DViS1JuXICQpiQGWpCRLDnBEXBUR/xERX4uIr0bE7dX2j0fE4Yg4GBH/EhGXdvydD0dEKyK+ERHv6ti+qdrWiogPrehE5z/HH1czPB4RD0fE66vtERF/Wa31YERc3/FcWyPiyeq2dZBz9Jql4/HfjYgSEevqPEuP12QyIo5Ur8njEbG54+/U7r3Va5bqsW3VsfLViPjTOs/S4zX5TMfrMRMRj6/SOa6LiEeqOZoR8fZq+2CPkVLKkm7AeuD66v6PAP8NXAP8EjBSbf8Y8LHq/jXAE8BrgTcCTwFrqttTwJuA11T7XLPUdbzaW485LunYZzvwyer+ZuALQADvAB6ttr8OeLr638uq+5cNao5es1R/vgr4Iu0fnFlX51l6vCaTwO912b+W761FZvkF4N+B11aP/VidZ+n13urY58+Bj6zGOYCHgV/uOC4OZBwjSz4DLqUcLaU8Vt1/Hvg6MFZKebiUcqra7RHgDdX9m4FPl1JeKqV8E2gBb69urVLK06WUHwCfrvYdiB5zfL9jt4uBue9O3gzsLW2PAJdGxHrgXcD+UspzpZT/A/YDmwY1Byw8S/XwXwB/0DEH1HSWReboppbvLeg5y23AR0spL1WP/W+dZ1nsNYmIAH4D+MdVOkcBLql2+1HgOx1zDOwYOa9rwBHRAH4GeHTeQ++j/dkD2kM+0/HYt6ttC20fuPlzRMSfRMQzwHuBj1S71X4OOHeWiLgZOFJKeWLebrWfpct764PVl4J/GxGXVdtqPwe8YpY3A++MiEcj4ksR8bPVbrWfZYHj/Z3Ad0spT1Z/Xm1z7AA+Xh3vfwZ8uNptoHMsO8ARMQp8DtjRedYYEXcAp4D7X+2iBqHbHKWUO0opV9Ge4YOZ61uOzllovwZ/yNlPIKtGl9fkE8DVwHXAUdpf8q4KXWYZof3l6zuA3wf+qTqLrLWFjnfg3Zw9+629LnPcBuysjvedwN9krGtZAY6IH6I9xP2llM93bP8t4FeA95bqgglwhPZ1yDlvqLYttH1gFpqjw/3Ar1X3azsHdJ3latrX4J6IiJlqXY9FxI9T41m6vSallO+WUk6XUs4Af037y1l6rDd9Dljw/fVt4PPVl7ZfBs7Q/sUvtZ2lx/E+Avwq8JmO3VfbHFuBufv/TNZ7axkXswPYC9w1b/sm4GvAFfO2v5VzL8o/TfuC/Eh1/42cvSj/1ld7MXsF5vjJjvvbgM9W97dw7kX5L5ezF+W/SfuC/GXV/dcNao5es8zbZ4az34Sr5Sw9XpP1Hfd30r7GWNv31iKzvB/4o+r+m2l/ORt1naXXe6s65r80b9uqmoP2teAbqvu/CExnHCPLGeTnaF+4Pgg8Xt02077Y/kzHtk92/J07aH8H9BtU33Gstm+m/d3Ip4A7BnyALDTH54CvVNv/jfY35uZewL+q1noImOh4rvdV87eA3x7kHL1mmbfPDGcDXMtZerwmf1+t8yDwr5wb5Nq9txaZ5TXAP1TvsceAG+s8S6/3FvB3wPu7/J1VM0e1fZr2J4RHgY0Zx4g/iixJSfxJOElKYoAlKYkBlqQkBliSkhhgSUpigLWqRMSlEfGB6v7rI+KzC+x3ICJW/T8SqeFmgLXaXAp8AKCU8p1Syq/nLkc6fyPZC5CW6aPA1dXvoX0SeEsp5dqIWAt8Cvhp4DCwNm+J0tIYYK02HwKuLaVcV/12qwer7bcBL5ZS3hIRb6P902ZSrXkJQsPi52n/qC+llIO0f/RUqjUDLElJDLBWm+dp/9My8/0n8B6AiLgWeNsgFyWdD68Ba1UppRyLiP+KiK/Q/pWCcz4BfCoivl5tn05ZoLQM/jY0SUriJQhJSmKAJSmJAZakJAZYkpIYYElKYoAlKYkBlqQk/w8fN12SFSqSDQAAAABJRU5ErkJggg==\n", 356 | "text/plain": [ 357 | "
" 358 | ] 359 | }, 360 | "metadata": { 361 | "needs_background": "light" 362 | }, 363 | "output_type": "display_data" 364 | } 365 | ], 366 | "source": [ 367 | "sns.boxplot(x=train.iloc[:10000][\"tid\"])" 368 | ] 369 | }, 370 | { 371 | "cell_type": "code", 372 | "execution_count": 11, 373 | "metadata": {}, 374 | "outputs": [ 375 | { 376 | "data": { 377 | "text/plain": [ 378 | "file_id 9\n", 379 | "label 3\n", 380 | "api 166\n", 381 | "tid 56\n", 382 | "index 5001\n", 383 | "dtype: int64" 384 | ] 385 | }, 386 | "execution_count": 11, 387 | "metadata": {}, 388 | "output_type": "execute_result" 389 | } 390 | ], 391 | "source": [ 392 | "train.nunique()" 393 | ] 394 | }, 395 | { 396 | "cell_type": "markdown", 397 | "metadata": {}, 398 | "source": [ 399 | "### 2.1.3 数据缺失值探索" 400 | ] 401 | }, 402 | { 403 | "cell_type": "code", 404 | "execution_count": 12, 405 | "metadata": {}, 406 | "outputs": [ 407 | { 408 | "data": { 409 | "text/plain": [ 410 | "count 35951.000000\n", 411 | "mean 2153.216267\n", 412 | "std 1537.349809\n", 413 | "min 0.000000\n", 414 | "25% 722.000000\n", 415 | "50% 2004.000000\n", 416 | "75% 3502.000000\n", 417 | "max 5000.000000\n", 418 | "Name: index, dtype: float64" 419 | ] 420 | }, 421 | "execution_count": 12, 422 | "metadata": {}, 423 | "output_type": "execute_result" 424 | } 425 | ], 426 | "source": [ 427 | "train['index'].describe()" 428 | ] 429 | }, 430 | { 431 | "cell_type": "markdown", 432 | "metadata": {}, 433 | "source": [ 434 | "### 2.1.4 奇异值探索" 435 | ] 436 | }, 437 | { 438 | "cell_type": "code", 439 | "execution_count": 13, 440 | "metadata": { 441 | "scrolled": true 442 | }, 443 | "outputs": [ 444 | { 445 | "data": { 446 | "text/plain": [ 447 | "count 35951.000000\n", 448 | "mean 2153.216267\n", 449 | "std 1537.349809\n", 450 | "min 0.000000\n", 451 | "25% 722.000000\n", 452 | "50% 2004.000000\n", 453 | "75% 3502.000000\n", 454 | "max 5000.000000\n", 455 | "Name: index, dtype: float64" 456 | ] 457 | }, 458 | "execution_count": 13, 459 | "metadata": {}, 460 | "output_type": "execute_result" 461 | } 462 | ], 463 | "source": [ 464 | "train['index'].describe()" 465 | ] 466 | }, 467 | { 468 | "cell_type": "code", 469 | "execution_count": 14, 470 | "metadata": {}, 471 | "outputs": [ 472 | { 473 | "data": { 474 | "text/plain": [ 475 | "count 35952.000000\n", 476 | "mean 2494.964564\n", 477 | "std 129.979938\n", 478 | "min 282.000000\n", 479 | "25% 2456.000000\n", 480 | "50% 2500.000000\n", 481 | "75% 2596.000000\n", 482 | "max 2980.000000\n", 483 | "Name: tid, dtype: float64" 484 | ] 485 | }, 486 | "execution_count": 14, 487 | "metadata": {}, 488 | "output_type": "execute_result" 489 | } 490 | ], 491 | "source": [ 492 | "train['tid'].describe()" 493 | ] 494 | }, 495 | { 496 | "cell_type": "markdown", 497 | "metadata": {}, 498 | "source": [ 499 | "### 2.1.5 标签分布探索" 500 | ] 501 | }, 502 | { 503 | "cell_type": "code", 504 | "execution_count": 15, 505 | "metadata": {}, 506 | "outputs": [ 507 | { 508 | "data": { 509 | "text/plain": [ 510 | "0 28350\n", 511 | "5 6786\n", 512 | "2 816\n", 513 | "Name: label, dtype: int64" 514 | ] 515 | }, 516 | "execution_count": 15, 517 | "metadata": {}, 518 | "output_type": "execute_result" 519 | } 520 | ], 521 | "source": [ 522 | "train['label'].value_counts()" 523 | ] 524 | }, 525 | { 526 | "cell_type": "code", 527 | "execution_count": 16, 528 | "metadata": {}, 529 | "outputs": [ 530 | { 531 | "data": { 532 | "text/plain": [ 533 | "" 534 | ] 535 | }, 536 | "execution_count": 16, 537 | "metadata": {}, 538 | "output_type": "execute_result" 539 | }, 540 | { 541 | "data": { 542 | "image/png": "\n", 543 | "text/plain": [ 544 | "
" 545 | ] 546 | }, 547 | "metadata": { 548 | "needs_background": "light" 549 | }, 550 | "output_type": "display_data" 551 | } 552 | ], 553 | "source": [ 554 | "plt.figure(figsize=(12,4),dpi=150)\n", 555 | "train['label'].value_counts().sort_index().plot(kind = 'bar')" 556 | ] 557 | }, 558 | { 559 | "cell_type": "code", 560 | "execution_count": 17, 561 | "metadata": {}, 562 | "outputs": [ 563 | { 564 | "data": { 565 | "text/plain": [ 566 | "" 567 | ] 568 | }, 569 | "execution_count": 17, 570 | "metadata": {}, 571 | "output_type": "execute_result" 572 | }, 573 | { 574 | "data": { 575 | "image/png": "\n", 576 | "text/plain": [ 577 | "
" 578 | ] 579 | }, 580 | "metadata": {}, 581 | "output_type": "display_data" 582 | } 583 | ], 584 | "source": [ 585 | "plt.figure(figsize=(4,4),dpi=150)\n", 586 | "train['label'].value_counts().sort_index().plot(kind = 'pie')" 587 | ] 588 | }, 589 | { 590 | "cell_type": "markdown", 591 | "metadata": {}, 592 | "source": [ 593 | "## 2.2 测试集探索" 594 | ] 595 | }, 596 | { 597 | "cell_type": "markdown", 598 | "metadata": {}, 599 | "source": [ 600 | "### 2.2.1 数据信息" 601 | ] 602 | }, 603 | { 604 | "cell_type": "code", 605 | "execution_count": 18, 606 | "metadata": {}, 607 | "outputs": [ 608 | { 609 | "data": { 610 | "text/html": [ 611 | "
\n", 612 | "\n", 625 | "\n", 626 | " \n", 627 | " \n", 628 | " \n", 629 | " \n", 630 | " \n", 631 | " \n", 632 | " \n", 633 | " \n", 634 | " \n", 635 | " \n", 636 | " \n", 637 | " \n", 638 | " \n", 639 | " \n", 640 | " \n", 641 | " \n", 642 | " \n", 643 | " \n", 644 | " \n", 645 | " \n", 646 | " \n", 647 | " \n", 648 | " \n", 649 | " \n", 650 | " \n", 651 | " \n", 652 | " \n", 653 | " \n", 654 | " \n", 655 | " \n", 656 | " \n", 657 | " \n", 658 | " \n", 659 | " \n", 660 | " \n", 661 | " \n", 662 | " \n", 663 | " \n", 664 | " \n", 665 | " \n", 666 | " \n", 667 | " \n", 668 | " \n", 669 | " \n", 670 | " \n", 671 | " \n", 672 | "
file_idapitidindex
01RegOpenKeyExA2332.00.0
11CopyFileA2332.01.0
21OpenSCManagerA2332.02.0
31CreateServiceA2332.03.0
41RegOpenKeyExA2468.00.0
\n", 673 | "
" 674 | ], 675 | "text/plain": [ 676 | " file_id api tid index\n", 677 | "0 1 RegOpenKeyExA 2332.0 0.0\n", 678 | "1 1 CopyFileA 2332.0 1.0\n", 679 | "2 1 OpenSCManagerA 2332.0 2.0\n", 680 | "3 1 CreateServiceA 2332.0 3.0\n", 681 | "4 1 RegOpenKeyExA 2468.0 0.0" 682 | ] 683 | }, 684 | "execution_count": 18, 685 | "metadata": {}, 686 | "output_type": "execute_result" 687 | } 688 | ], 689 | "source": [ 690 | "test.head()" 691 | ] 692 | }, 693 | { 694 | "cell_type": "code", 695 | "execution_count": 19, 696 | "metadata": {}, 697 | "outputs": [ 698 | { 699 | "name": "stdout", 700 | "output_type": "stream", 701 | "text": [ 702 | "\n", 703 | "RangeIndex: 39173 entries, 0 to 39172\n", 704 | "Data columns (total 4 columns):\n", 705 | "file_id 39173 non-null int64\n", 706 | "api 39173 non-null object\n", 707 | "tid 39172 non-null float64\n", 708 | "index 39172 non-null float64\n", 709 | "dtypes: float64(2), int64(1), object(1)\n", 710 | "memory usage: 1.2+ MB\n" 711 | ] 712 | } 713 | ], 714 | "source": [ 715 | "test.info()" 716 | ] 717 | }, 718 | { 719 | "cell_type": "markdown", 720 | "metadata": {}, 721 | "source": [ 722 | "### 2.2.2 缺失值探索" 723 | ] 724 | }, 725 | { 726 | "cell_type": "code", 727 | "execution_count": 20, 728 | "metadata": {}, 729 | "outputs": [ 730 | { 731 | "data": { 732 | "text/plain": [ 733 | "file_id 0\n", 734 | "api 0\n", 735 | "tid 1\n", 736 | "index 1\n", 737 | "dtype: int64" 738 | ] 739 | }, 740 | "execution_count": 20, 741 | "metadata": {}, 742 | "output_type": "execute_result" 743 | } 744 | ], 745 | "source": [ 746 | "test.isnull().sum()" 747 | ] 748 | }, 749 | { 750 | "cell_type": "markdown", 751 | "metadata": {}, 752 | "source": [ 753 | "### 2.2.3 数据分布探索" 754 | ] 755 | }, 756 | { 757 | "cell_type": "code", 758 | "execution_count": 21, 759 | "metadata": {}, 760 | "outputs": [ 761 | { 762 | "data": { 763 | "text/plain": [ 764 | "file_id 10\n", 765 | "api 146\n", 766 | "tid 125\n", 767 | "index 5001\n", 768 | "dtype: int64" 769 | ] 770 | }, 771 | "execution_count": 21, 772 | "metadata": {}, 773 | "output_type": "execute_result" 774 | } 775 | ], 776 | "source": [ 777 | "test.nunique()" 778 | ] 779 | }, 780 | { 781 | "cell_type": "markdown", 782 | "metadata": {}, 783 | "source": [ 784 | "### 2.2.4 奇异值探索" 785 | ] 786 | }, 787 | { 788 | "cell_type": "code", 789 | "execution_count": 22, 790 | "metadata": {}, 791 | "outputs": [ 792 | { 793 | "data": { 794 | "text/plain": [ 795 | "count 39172.000000\n", 796 | "mean 1729.569284\n", 797 | "std 1486.018402\n", 798 | "min 0.000000\n", 799 | "25% 405.750000\n", 800 | "50% 1342.000000\n", 801 | "75% 2876.000000\n", 802 | "max 5000.000000\n", 803 | "Name: index, dtype: float64" 804 | ] 805 | }, 806 | "execution_count": 22, 807 | "metadata": {}, 808 | "output_type": "execute_result" 809 | } 810 | ], 811 | "source": [ 812 | "test['index'].describe()" 813 | ] 814 | }, 815 | { 816 | "cell_type": "code", 817 | "execution_count": 23, 818 | "metadata": { 819 | "scrolled": true 820 | }, 821 | "outputs": [ 822 | { 823 | "data": { 824 | "text/plain": [ 825 | "count 39172.000000\n", 826 | "mean 2158.769938\n", 827 | "std 464.152821\n", 828 | "min 504.000000\n", 829 | "25% 2092.000000\n", 830 | "50% 2224.000000\n", 831 | "75% 2500.000000\n", 832 | "max 2920.000000\n", 833 | "Name: tid, dtype: float64" 834 | ] 835 | }, 836 | "execution_count": 23, 837 | "metadata": {}, 838 | "output_type": "execute_result" 839 | } 840 | ], 841 | "source": [ 842 | "test['tid'].describe()" 843 | ] 844 | }, 845 | { 846 | "cell_type": "markdown", 847 | "metadata": {}, 848 | "source": [ 849 | "## 2.3 数据集联合分析" 850 | ] 851 | }, 852 | { 853 | "cell_type": "markdown", 854 | "metadata": {}, 855 | "source": [ 856 | "### 2.3.1 file_id分析" 857 | ] 858 | }, 859 | { 860 | "cell_type": "code", 861 | "execution_count": 24, 862 | "metadata": {}, 863 | "outputs": [], 864 | "source": [ 865 | "train_fileids = train['file_id'].unique()\n", 866 | "test_fileids = test['file_id'].unique()" 867 | ] 868 | }, 869 | { 870 | "cell_type": "code", 871 | "execution_count": 25, 872 | "metadata": {}, 873 | "outputs": [ 874 | { 875 | "data": { 876 | "text/plain": [ 877 | "0" 878 | ] 879 | }, 880 | "execution_count": 25, 881 | "metadata": {}, 882 | "output_type": "execute_result" 883 | } 884 | ], 885 | "source": [ 886 | "len(set(train_fileids)-set(test_fileids)) " 887 | ] 888 | }, 889 | { 890 | "cell_type": "code", 891 | "execution_count": 26, 892 | "metadata": {}, 893 | "outputs": [ 894 | { 895 | "data": { 896 | "text/plain": [ 897 | "1" 898 | ] 899 | }, 900 | "execution_count": 26, 901 | "metadata": {}, 902 | "output_type": "execute_result" 903 | } 904 | ], 905 | "source": [ 906 | "len(set(test_fileids)-set(train_fileids)) " 907 | ] 908 | }, 909 | { 910 | "cell_type": "markdown", 911 | "metadata": {}, 912 | "source": [ 913 | "### 2.3.2 API分析" 914 | ] 915 | }, 916 | { 917 | "cell_type": "code", 918 | "execution_count": 27, 919 | "metadata": {}, 920 | "outputs": [], 921 | "source": [ 922 | "train_apis = train['api'].unique()\n", 923 | "test_apis = test['api'].unique()" 924 | ] 925 | }, 926 | { 927 | "cell_type": "code", 928 | "execution_count": 28, 929 | "metadata": { 930 | "scrolled": true 931 | }, 932 | "outputs": [ 933 | { 934 | "data": { 935 | "text/plain": [ 936 | "{'CertCreateCertificateContext',\n", 937 | " 'CertOpenSystemStoreA',\n", 938 | " 'CoInitializeSecurity',\n", 939 | " 'CreateServiceA',\n", 940 | " 'CryptAcquireContextW',\n", 941 | " 'FindWindowA',\n", 942 | " 'FindWindowExW',\n", 943 | " 'GetComputerNameA',\n", 944 | " 'GetFileVersionInfoSizeW',\n", 945 | " 'GetFileVersionInfoW',\n", 946 | " 'IWbemServices_ExecQuery',\n", 947 | " 'LookupAccountSidW',\n", 948 | " 'LookupPrivilegeValueW',\n", 949 | " 'OpenServiceW',\n", 950 | " 'OutputDebugStringA',\n", 951 | " 'R',\n", 952 | " 'SendNotifyMessageW',\n", 953 | " 'SetStdHandle',\n", 954 | " 'StartServiceA',\n", 955 | " 'StartServiceW',\n", 956 | " 'UnhookWindowsHookEx',\n", 957 | " 'connect',\n", 958 | " 'timeGetTime'}" 959 | ] 960 | }, 961 | "execution_count": 28, 962 | "metadata": {}, 963 | "output_type": "execute_result" 964 | } 965 | ], 966 | "source": [ 967 | "set(test_apis)-set(train_apis)" 968 | ] 969 | }, 970 | { 971 | "cell_type": "code", 972 | "execution_count": 29, 973 | "metadata": {}, 974 | "outputs": [ 975 | { 976 | "data": { 977 | "text/plain": [ 978 | "{'CertControlStore',\n", 979 | " 'CryptAcquireContextA',\n", 980 | " 'CryptCreateHash',\n", 981 | " 'CryptExportKey',\n", 982 | " 'CryptHashData',\n", 983 | " 'DeviceIoControl',\n", 984 | " 'DrawTextExA',\n", 985 | " 'EncryptMessage',\n", 986 | " 'EnumServicesStatusW',\n", 987 | " 'FindResourceExA',\n", 988 | " 'GetAdaptersAddresses',\n", 989 | " 'GetAddrInfoW',\n", 990 | " 'GetAsyncKeyState',\n", 991 | " 'GetBestInterfaceEx',\n", 992 | " 'GetFileInformationByHandle',\n", 993 | " 'GetFileVersionInfoExW',\n", 994 | " 'GetFileVersionInfoSizeExW',\n", 995 | " 'GetUserNameA',\n", 996 | " 'GetVolumePathNameW',\n", 997 | " 'GlobalMemoryStatus',\n", 998 | " 'HttpOpenRequestA',\n", 999 | " 'InternetConnectA',\n", 1000 | " 'InternetOpenA',\n", 1001 | " 'IsDebuggerPresent',\n", 1002 | " 'Module32FirstW',\n", 1003 | " 'Module32NextW',\n", 1004 | " 'NtDeleteValueKey',\n", 1005 | " 'NtReadVirtualMemory',\n", 1006 | " 'OpenServiceA',\n", 1007 | " 'ReadProcessMemory',\n", 1008 | " 'RegEnumKeyExA',\n", 1009 | " 'RegEnumValueA',\n", 1010 | " 'RtlAddVectoredContinueHandler',\n", 1011 | " 'RtlAddVectoredExceptionHandler',\n", 1012 | " 'RtlRemoveVectoredExceptionHandler',\n", 1013 | " 'SetFileAttributesW',\n", 1014 | " 'SetFileTime',\n", 1015 | " 'SetWindowsHookExA',\n", 1016 | " 'Thread32First',\n", 1017 | " 'Thread32Next',\n", 1018 | " 'WriteConsoleA',\n", 1019 | " 'bind',\n", 1020 | " 'listen'}" 1021 | ] 1022 | }, 1023 | "execution_count": 29, 1024 | "metadata": {}, 1025 | "output_type": "execute_result" 1026 | } 1027 | ], 1028 | "source": [ 1029 | "set(train_apis) - set(test_apis)" 1030 | ] 1031 | } 1032 | ], 1033 | "metadata": { 1034 | "kernelspec": { 1035 | "display_name": "Python 3", 1036 | "language": "python", 1037 | "name": "python3" 1038 | }, 1039 | "language_info": { 1040 | "codemirror_mode": { 1041 | "name": "ipython", 1042 | "version": 3 1043 | }, 1044 | "file_extension": ".py", 1045 | "mimetype": "text/x-python", 1046 | "name": "python", 1047 | "nbconvert_exporter": "python", 1048 | "pygments_lexer": "ipython3", 1049 | "version": "3.6.2" 1050 | }, 1051 | "latex_envs": { 1052 | "LaTeX_envs_menu_present": true, 1053 | "autoclose": false, 1054 | "autocomplete": true, 1055 | "bibliofile": "biblio.bib", 1056 | "cite_by": "apalike", 1057 | "current_citInitial": 1, 1058 | "eqLabelWithNumbers": true, 1059 | "eqNumInitial": 1, 1060 | "hotkeys": { 1061 | "equation": "Ctrl-E", 1062 | "itemize": "Ctrl-I" 1063 | }, 1064 | "labels_anchors": false, 1065 | "latex_user_defs": false, 1066 | "report_style_numbering": false, 1067 | "user_envs_cfg": false 1068 | }, 1069 | "tianchi_metadata": { 1070 | "competitions": [], 1071 | "datasets": [], 1072 | "description": "", 1073 | "notebookId": "116127", 1074 | "source": "dsw" 1075 | }, 1076 | "toc": { 1077 | "nav_menu": {}, 1078 | "number_sections": true, 1079 | "sideBar": true, 1080 | "skip_h1_title": false, 1081 | "title_cell": "Table of Contents", 1082 | "title_sidebar": "Contents", 1083 | "toc_cell": true, 1084 | "toc_position": { 1085 | "height": "calc(100% - 180px)", 1086 | "left": "10px", 1087 | "top": "150px", 1088 | "width": "384px" 1089 | }, 1090 | "toc_section_display": true, 1091 | "toc_window_display": true 1092 | } 1093 | }, 1094 | "nbformat": 4, 1095 | "nbformat_minor": 4 1096 | } 1097 | -------------------------------------------------------------------------------- /阿里云安全恶意程序检测/阿里云安全恶意程序检测-特征工程与基线模型.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## 第三节:特征工程与基线模型" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 2, 13 | "metadata": {}, 14 | "outputs": [], 15 | "source": [ 16 | "import numpy as np\n", 17 | "import pandas as pd\n", 18 | "from tqdm import tqdm \n", 19 | "\n", 20 | "class _Data_Preprocess:\n", 21 | " def __init__(self):\n", 22 | " self.int8_max = np.iinfo(np.int8).max\n", 23 | " self.int8_min = np.iinfo(np.int8).min\n", 24 | "\n", 25 | " self.int16_max = np.iinfo(np.int16).max\n", 26 | " self.int16_min = np.iinfo(np.int16).min\n", 27 | "\n", 28 | " self.int32_max = np.iinfo(np.int32).max\n", 29 | " self.int32_min = np.iinfo(np.int32).min\n", 30 | "\n", 31 | " self.int64_max = np.iinfo(np.int64).max\n", 32 | " self.int64_min = np.iinfo(np.int64).min\n", 33 | "\n", 34 | " self.float16_max = np.finfo(np.float16).max\n", 35 | " self.float16_min = np.finfo(np.float16).min\n", 36 | "\n", 37 | " self.float32_max = np.finfo(np.float32).max\n", 38 | " self.float32_min = np.finfo(np.float32).min\n", 39 | "\n", 40 | " self.float64_max = np.finfo(np.float64).max\n", 41 | " self.float64_min = np.finfo(np.float64).min\n", 42 | "\n", 43 | " def _get_type(self, min_val, max_val, types):\n", 44 | " if types == 'int':\n", 45 | " if max_val <= self.int8_max and min_val >= self.int8_min:\n", 46 | " return np.int8\n", 47 | " elif max_val <= self.int16_max <= max_val and min_val >= self.int16_min:\n", 48 | " return np.int16\n", 49 | " elif max_val <= self.int32_max and min_val >= self.int32_min:\n", 50 | " return np.int32\n", 51 | " return None\n", 52 | "\n", 53 | " elif types == 'float':\n", 54 | " if max_val <= self.float16_max and min_val >= self.float16_min:\n", 55 | " return np.float16\n", 56 | " if max_val <= self.float32_max and min_val >= self.float32_min:\n", 57 | " return np.float32\n", 58 | " if max_val <= self.float64_max and min_val >= self.float64_min:\n", 59 | " return np.float64\n", 60 | " return None\n", 61 | "\n", 62 | " def _memory_process(self, df):\n", 63 | " init_memory = df.memory_usage().sum() / 1024 ** 2 / 1024\n", 64 | " print('Original data occupies {} GB memory.'.format(init_memory))\n", 65 | " df_cols = df.columns\n", 66 | "\n", 67 | " \n", 68 | " for col in tqdm_notebook(df_cols):\n", 69 | " try:\n", 70 | " if 'float' in str(df[col].dtypes):\n", 71 | " max_val = df[col].max()\n", 72 | " min_val = df[col].min()\n", 73 | " trans_types = self._get_type(min_val, max_val, 'float')\n", 74 | " if trans_types is not None:\n", 75 | " df[col] = df[col].astype(trans_types)\n", 76 | " elif 'int' in str(df[col].dtypes):\n", 77 | " max_val = df[col].max()\n", 78 | " min_val = df[col].min()\n", 79 | " trans_types = self._get_type(min_val, max_val, 'int')\n", 80 | " if trans_types is not None:\n", 81 | " df[col] = df[col].astype(trans_types)\n", 82 | " except:\n", 83 | " print(' Can not do any process for column, {}.'.format(col)) \n", 84 | " afterprocess_memory = df.memory_usage().sum() / 1024 ** 2 / 1024\n", 85 | " print('After processing, the data occupies {} GB memory.'.format(afterprocess_memory))\n", 86 | " return df" 87 | ] 88 | }, 89 | { 90 | "cell_type": "markdown", 91 | "metadata": {}, 92 | "source": [ 93 | "## 3.3 基线模型" 94 | ] 95 | }, 96 | { 97 | "cell_type": "markdown", 98 | "metadata": {}, 99 | "source": [ 100 | "### 3.3.1 数据读取" 101 | ] 102 | }, 103 | { 104 | "cell_type": "code", 105 | "execution_count": 1, 106 | "metadata": {}, 107 | "outputs": [], 108 | "source": [ 109 | "import pandas as pd\n", 110 | "import numpy as np\n", 111 | "import seaborn as sns\n", 112 | "import matplotlib.pyplot as plt\n", 113 | "\n", 114 | "import lightgbm as lgb\n", 115 | "from sklearn.model_selection import train_test_split\n", 116 | "from sklearn.preprocessing import OneHotEncoder\n", 117 | "\n", 118 | "import warnings\n", 119 | "warnings.filterwarnings('ignore')\n", 120 | "%matplotlib inline" 121 | ] 122 | }, 123 | { 124 | "cell_type": "markdown", 125 | "metadata": {}, 126 | "source": [ 127 | "### 数据读取" 128 | ] 129 | }, 130 | { 131 | "cell_type": "code", 132 | "execution_count": 3, 133 | "metadata": {}, 134 | "outputs": [], 135 | "source": [ 136 | "path = '../security_data/'\n", 137 | "train = pd.read_csv(path + 'security_train.csv')\n", 138 | "test = pd.read_csv(path + 'security_test.csv')" 139 | ] 140 | }, 141 | { 142 | "cell_type": "code", 143 | "execution_count": 4, 144 | "metadata": {}, 145 | "outputs": [ 146 | { 147 | "data": { 148 | "text/html": [ 149 | "
\n", 150 | "\n", 163 | "\n", 164 | " \n", 165 | " \n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | " \n", 170 | " \n", 171 | " \n", 172 | " \n", 173 | " \n", 174 | " \n", 175 | " \n", 176 | " \n", 177 | " \n", 178 | " \n", 179 | " \n", 180 | " \n", 181 | " \n", 182 | " \n", 183 | " \n", 184 | " \n", 185 | " \n", 186 | " \n", 187 | " \n", 188 | " \n", 189 | " \n", 190 | " \n", 191 | " \n", 192 | " \n", 193 | " \n", 194 | " \n", 195 | " \n", 196 | " \n", 197 | " \n", 198 | " \n", 199 | " \n", 200 | " \n", 201 | " \n", 202 | " \n", 203 | " \n", 204 | " \n", 205 | " \n", 206 | " \n", 207 | " \n", 208 | " \n", 209 | " \n", 210 | " \n", 211 | " \n", 212 | " \n", 213 | " \n", 214 | " \n", 215 | " \n", 216 | "
file_idlabelapitidindex
015LdrLoadDll24880
115LdrGetProcedureAddress24881
215LdrGetProcedureAddress24882
315LdrGetProcedureAddress24883
415LdrGetProcedureAddress24884
\n", 217 | "
" 218 | ], 219 | "text/plain": [ 220 | " file_id label api tid index\n", 221 | "0 1 5 LdrLoadDll 2488 0\n", 222 | "1 1 5 LdrGetProcedureAddress 2488 1\n", 223 | "2 1 5 LdrGetProcedureAddress 2488 2\n", 224 | "3 1 5 LdrGetProcedureAddress 2488 3\n", 225 | "4 1 5 LdrGetProcedureAddress 2488 4" 226 | ] 227 | }, 228 | "execution_count": 4, 229 | "metadata": {}, 230 | "output_type": "execute_result" 231 | } 232 | ], 233 | "source": [ 234 | "train.head()" 235 | ] 236 | }, 237 | { 238 | "cell_type": "markdown", 239 | "metadata": {}, 240 | "source": [ 241 | "### 3.3.2 特征工程 " 242 | ] 243 | }, 244 | { 245 | "cell_type": "code", 246 | "execution_count": 5, 247 | "metadata": {}, 248 | "outputs": [], 249 | "source": [ 250 | "def simple_sts_features(df):\n", 251 | " simple_fea = pd.DataFrame()\n", 252 | " simple_fea['file_id'] = df['file_id'].unique()\n", 253 | " simple_fea = simple_fea.sort_values('file_id')\n", 254 | " \n", 255 | " df_grp = df.groupby('file_id')\n", 256 | " simple_fea['file_id_api_count'] = df_grp['api'].count().values\n", 257 | " simple_fea['file_id_api_nunique'] = df_grp['api'].nunique().values\n", 258 | " \n", 259 | " simple_fea['file_id_tid_count'] = df_grp['tid'].count().values\n", 260 | " simple_fea['file_id_tid_nunique'] = df_grp['tid'].nunique().values\n", 261 | " \n", 262 | " simple_fea['file_id_index_count'] = df_grp['index'].count().values\n", 263 | " simple_fea['file_id_index_nunique'] = df_grp['index'].nunique().values\n", 264 | " \n", 265 | " return simple_fea" 266 | ] 267 | }, 268 | { 269 | "cell_type": "code", 270 | "execution_count": 6, 271 | "metadata": {}, 272 | "outputs": [ 273 | { 274 | "name": "stdout", 275 | "output_type": "stream", 276 | "text": [ 277 | "Wall time: 1.4 s\n" 278 | ] 279 | } 280 | ], 281 | "source": [ 282 | "%%time\n", 283 | "simple_train_fea1 = simple_sts_features(train)" 284 | ] 285 | }, 286 | { 287 | "cell_type": "code", 288 | "execution_count": 7, 289 | "metadata": {}, 290 | "outputs": [ 291 | { 292 | "name": "stdout", 293 | "output_type": "stream", 294 | "text": [ 295 | "Wall time: 23.9 ms\n" 296 | ] 297 | } 298 | ], 299 | "source": [ 300 | "%%time\n", 301 | "simple_test_fea1 = simple_sts_features(test)" 302 | ] 303 | }, 304 | { 305 | "cell_type": "code", 306 | "execution_count": 8, 307 | "metadata": {}, 308 | "outputs": [], 309 | "source": [ 310 | "def simple_numerical_sts_features(df):\n", 311 | " simple_numerical_fea = pd.DataFrame()\n", 312 | " simple_numerical_fea['file_id'] = df['file_id'].unique()\n", 313 | " simple_numerical_fea = simple_numerical_fea.sort_values('file_id')\n", 314 | " \n", 315 | " df_grp = df.groupby('file_id')\n", 316 | " \n", 317 | " simple_numerical_fea['file_id_tid_mean'] = df_grp['tid'].mean().values\n", 318 | " simple_numerical_fea['file_id_tid_min'] = df_grp['tid'].min().values\n", 319 | " simple_numerical_fea['file_id_tid_std'] = df_grp['tid'].std().values\n", 320 | " simple_numerical_fea['file_id_tid_max'] = df_grp['tid'].max().values\n", 321 | " \n", 322 | " \n", 323 | " simple_numerical_fea['file_id_index_mean']= df_grp['index'].mean().values\n", 324 | " simple_numerical_fea['file_id_index_min'] = df_grp['index'].min().values\n", 325 | " simple_numerical_fea['file_id_index_std'] = df_grp['index'].std().values\n", 326 | " simple_numerical_fea['file_id_index_max'] = df_grp['index'].max().values\n", 327 | " \n", 328 | " return simple_numerical_fea" 329 | ] 330 | }, 331 | { 332 | "cell_type": "code", 333 | "execution_count": 9, 334 | "metadata": { 335 | "scrolled": true 336 | }, 337 | "outputs": [ 338 | { 339 | "name": "stdout", 340 | "output_type": "stream", 341 | "text": [ 342 | "Wall time: 172 ms\n" 343 | ] 344 | } 345 | ], 346 | "source": [ 347 | "%%time\n", 348 | "simple_train_fea2 = simple_numerical_sts_features(train)" 349 | ] 350 | }, 351 | { 352 | "cell_type": "code", 353 | "execution_count": 10, 354 | "metadata": {}, 355 | "outputs": [ 356 | { 357 | "name": "stdout", 358 | "output_type": "stream", 359 | "text": [ 360 | "Wall time: 18 ms\n" 361 | ] 362 | } 363 | ], 364 | "source": [ 365 | "%%time\n", 366 | "simple_test_fea2 = simple_numerical_sts_features(test)" 367 | ] 368 | }, 369 | { 370 | "cell_type": "markdown", 371 | "metadata": {}, 372 | "source": [ 373 | "### 3.3.3 基线构建" 374 | ] 375 | }, 376 | { 377 | "cell_type": "code", 378 | "execution_count": 11, 379 | "metadata": {}, 380 | "outputs": [], 381 | "source": [ 382 | "train_label = train[['file_id','label']].drop_duplicates(subset = ['file_id','label'], keep = 'first')\n", 383 | "test_submit = test[['file_id']].drop_duplicates(subset = ['file_id'], keep = 'first')" 384 | ] 385 | }, 386 | { 387 | "cell_type": "code", 388 | "execution_count": 12, 389 | "metadata": {}, 390 | "outputs": [], 391 | "source": [ 392 | "### 训练集&测试集构建\n", 393 | "train_data = train_label.merge(simple_train_fea1, on ='file_id', how='left')\n", 394 | "train_data = train_data.merge(simple_train_fea2, on ='file_id', how='left')\n", 395 | "\n", 396 | "test_submit = test_submit.merge(simple_test_fea1, on ='file_id', how='left')\n", 397 | "test_submit = test_submit.merge(simple_test_fea2, on ='file_id', how='left')" 398 | ] 399 | }, 400 | { 401 | "cell_type": "code", 402 | "execution_count": 14, 403 | "metadata": {}, 404 | "outputs": [], 405 | "source": [ 406 | "def lgb_logloss(preds,data):\n", 407 | " labels_ = data.get_label() \n", 408 | " classes_ = np.unique(labels_) \n", 409 | " preds_prob = []\n", 410 | " for i in range(len(classes_)):\n", 411 | " preds_prob.append(preds[i*len(labels_):(i+1) * len(labels_)] )\n", 412 | " \n", 413 | " preds_prob_ = np.vstack(preds_prob) \n", 414 | " \n", 415 | " loss = []\n", 416 | " for i in range(preds_prob_.shape[1]): # 样本个数\n", 417 | " sum_ = 0\n", 418 | " for j in range(preds_prob_.shape[0]): #类别个数\n", 419 | " pred = preds_prob_[j,i] # 第i个样本预测为第j类的概率\n", 420 | " if j == labels_[i]:\n", 421 | " sum_ += np.log(pred)\n", 422 | " else:\n", 423 | " sum_ += np.log(1 - pred)\n", 424 | " loss.append(sum_) \n", 425 | " return 'loss is: ',-1 * (np.sum(loss) / preds_prob_.shape[1]),False" 426 | ] 427 | }, 428 | { 429 | "cell_type": "markdown", 430 | "metadata": {}, 431 | "source": [] 432 | }, 433 | { 434 | "cell_type": "code", 435 | "execution_count": 15, 436 | "metadata": {}, 437 | "outputs": [], 438 | "source": [ 439 | "### 模型验证\n", 440 | "train_features = [col for col in train_data.columns if col not in ['label','file_id']]\n", 441 | "train_label = 'label'" 442 | ] 443 | }, 444 | { 445 | "cell_type": "code", 446 | "execution_count": 16, 447 | "metadata": { 448 | "scrolled": true 449 | }, 450 | "outputs": [ 451 | { 452 | "name": "stdout", 453 | "output_type": "stream", 454 | "text": [ 455 | "fold n°0\n", 456 | "Training until validation scores don't improve for 100 rounds\n", 457 | "[50]\ttraining's multi_logloss: 1.83717\ttraining's loss is: : 2.41456\tvalid_1's multi_logloss: 1.28536\tvalid_1's loss is: : 0.941228\n", 458 | "[100]\ttraining's multi_logloss: 1.83717\ttraining's loss is: : 2.41456\tvalid_1's multi_logloss: 1.28536\tvalid_1's loss is: : 0.941228\n", 459 | "Early stopping, best iteration is:\n", 460 | "[1]\ttraining's multi_logloss: 1.83717\ttraining's loss is: : 2.41456\tvalid_1's multi_logloss: 1.28536\tvalid_1's loss is: : 0.941228\n", 461 | "fold n°1\n", 462 | "Training until validation scores don't improve for 100 rounds\n", 463 | "[50]\ttraining's multi_logloss: 1.77226\ttraining's loss is: : 2.32838\tvalid_1's multi_logloss: 2.10695\tvalid_1's loss is: : 1.86109\n", 464 | "[100]\ttraining's multi_logloss: 1.77226\ttraining's loss is: : 2.32838\tvalid_1's multi_logloss: 2.10695\tvalid_1's loss is: : 1.86109\n", 465 | "Early stopping, best iteration is:\n", 466 | "[1]\ttraining's multi_logloss: 1.77226\ttraining's loss is: : 2.32838\tvalid_1's multi_logloss: 2.10695\tvalid_1's loss is: : 1.86109\n", 467 | "fold n°2\n", 468 | "Training until validation scores don't improve for 100 rounds\n", 469 | "[50]\ttraining's multi_logloss: 1.79063\ttraining's loss is: : 2.32093\tvalid_1's multi_logloss: 1.67573\tvalid_1's loss is: : 1.91268\n", 470 | "[100]\ttraining's multi_logloss: 1.79063\ttraining's loss is: : 2.32093\tvalid_1's multi_logloss: 1.67573\tvalid_1's loss is: : 1.91268\n", 471 | "Early stopping, best iteration is:\n", 472 | "[1]\ttraining's multi_logloss: 1.79063\ttraining's loss is: : 2.32093\tvalid_1's multi_logloss: 1.67573\tvalid_1's loss is: : 1.91268\n", 473 | "fold n°3\n", 474 | "Training until validation scores don't improve for 100 rounds\n", 475 | "[50]\ttraining's multi_logloss: 1.79651\ttraining's loss is: : 2.36572\tvalid_1's multi_logloss: 1.92355\tvalid_1's loss is: : 1.42824\n", 476 | "[100]\ttraining's multi_logloss: 1.79651\ttraining's loss is: : 2.36572\tvalid_1's multi_logloss: 1.92355\tvalid_1's loss is: : 1.42824\n", 477 | "Early stopping, best iteration is:\n", 478 | "[1]\ttraining's multi_logloss: 1.79651\ttraining's loss is: : 2.36572\tvalid_1's multi_logloss: 1.92355\tvalid_1's loss is: : 1.42824\n", 479 | "fold n°4\n", 480 | "Training until validation scores don't improve for 100 rounds\n", 481 | "[50]\ttraining's multi_logloss: 1.70379\ttraining's loss is: : 2.27265\tvalid_1's multi_logloss: 2.91788\tvalid_1's loss is: : 3.32694\n", 482 | "[100]\ttraining's multi_logloss: 1.70379\ttraining's loss is: : 2.27265\tvalid_1's multi_logloss: 2.91788\tvalid_1's loss is: : 3.32694\n", 483 | "Early stopping, best iteration is:\n", 484 | "[1]\ttraining's multi_logloss: 1.70379\ttraining's loss is: : 2.27265\tvalid_1's multi_logloss: 2.91788\tvalid_1's loss is: : 3.32694\n", 485 | "Wall time: 9.94 s\n" 486 | ] 487 | } 488 | ], 489 | "source": [ 490 | "%%time\n", 491 | "from sklearn.model_selection import StratifiedKFold,KFold\n", 492 | "params = {\n", 493 | " 'task':'train', \n", 494 | " 'num_leaves': 255,\n", 495 | " 'objective': 'multiclass',\n", 496 | " 'num_class': 8,\n", 497 | " 'min_data_in_leaf': 50,\n", 498 | " 'learning_rate': 0.05,\n", 499 | " 'feature_fraction': 0.85,\n", 500 | " 'bagging_fraction': 0.85,\n", 501 | " 'bagging_freq': 5, \n", 502 | " 'max_bin':128,\n", 503 | " 'random_state':100\n", 504 | " } \n", 505 | "\n", 506 | "folds = KFold(n_splits=5, shuffle=True, random_state=15)\n", 507 | "oof = np.zeros(len(train))\n", 508 | "\n", 509 | "predict_res = 0\n", 510 | "models = []\n", 511 | "for fold_, (trn_idx, val_idx) in enumerate(folds.split(train_data)):\n", 512 | " print(\"fold n°{}\".format(fold_))\n", 513 | " trn_data = lgb.Dataset(train_data.iloc[trn_idx][train_features], label=train_data.iloc[trn_idx][train_label].values)\n", 514 | " val_data = lgb.Dataset(train_data.iloc[val_idx][train_features], label=train_data.iloc[val_idx][train_label].values) \n", 515 | " \n", 516 | " clf = lgb.train(params, trn_data, num_boost_round=2000,valid_sets=[trn_data,val_data], verbose_eval=50, early_stopping_rounds=100, feval=lgb_logloss) \n", 517 | " models.append(clf)" 518 | ] 519 | }, 520 | { 521 | "cell_type": "code", 522 | "execution_count": 17, 523 | "metadata": { 524 | "scrolled": true 525 | }, 526 | "outputs": [ 527 | { 528 | "name": "stdout", 529 | "output_type": "stream", 530 | "text": [ 531 | "fold n°0\n", 532 | "Training until validation scores don't improve for 100 rounds\n", 533 | "[50]\ttraining's multi_logloss: 1.83717\ttraining's loss is: : 2.41456\tvalid_1's multi_logloss: 1.28536\tvalid_1's loss is: : 0.941228\n", 534 | "[100]\ttraining's multi_logloss: 1.83717\ttraining's loss is: : 2.41456\tvalid_1's multi_logloss: 1.28536\tvalid_1's loss is: : 0.941228\n", 535 | "Early stopping, best iteration is:\n", 536 | "[1]\ttraining's multi_logloss: 1.83717\ttraining's loss is: : 2.41456\tvalid_1's multi_logloss: 1.28536\tvalid_1's loss is: : 0.941228\n", 537 | "fold n°1\n", 538 | "Training until validation scores don't improve for 100 rounds\n", 539 | "[50]\ttraining's multi_logloss: 1.77226\ttraining's loss is: : 2.32838\tvalid_1's multi_logloss: 2.10695\tvalid_1's loss is: : 1.86109\n", 540 | "[100]\ttraining's multi_logloss: 1.77226\ttraining's loss is: : 2.32838\tvalid_1's multi_logloss: 2.10695\tvalid_1's loss is: : 1.86109\n", 541 | "Early stopping, best iteration is:\n", 542 | "[1]\ttraining's multi_logloss: 1.77226\ttraining's loss is: : 2.32838\tvalid_1's multi_logloss: 2.10695\tvalid_1's loss is: : 1.86109\n", 543 | "fold n°2\n", 544 | "Training until validation scores don't improve for 100 rounds\n", 545 | "[50]\ttraining's multi_logloss: 1.79063\ttraining's loss is: : 2.32093\tvalid_1's multi_logloss: 1.67573\tvalid_1's loss is: : 1.91268\n", 546 | "[100]\ttraining's multi_logloss: 1.79063\ttraining's loss is: : 2.32093\tvalid_1's multi_logloss: 1.67573\tvalid_1's loss is: : 1.91268\n", 547 | "Early stopping, best iteration is:\n", 548 | "[1]\ttraining's multi_logloss: 1.79063\ttraining's loss is: : 2.32093\tvalid_1's multi_logloss: 1.67573\tvalid_1's loss is: : 1.91268\n", 549 | "fold n°3\n", 550 | "Training until validation scores don't improve for 100 rounds\n", 551 | "[50]\ttraining's multi_logloss: 1.79651\ttraining's loss is: : 2.36572\tvalid_1's multi_logloss: 1.92355\tvalid_1's loss is: : 1.42824\n", 552 | "[100]\ttraining's multi_logloss: 1.79651\ttraining's loss is: : 2.36572\tvalid_1's multi_logloss: 1.92355\tvalid_1's loss is: : 1.42824\n", 553 | "Early stopping, best iteration is:\n", 554 | "[1]\ttraining's multi_logloss: 1.79651\ttraining's loss is: : 2.36572\tvalid_1's multi_logloss: 1.92355\tvalid_1's loss is: : 1.42824\n", 555 | "fold n°4\n", 556 | "Training until validation scores don't improve for 100 rounds\n", 557 | "[50]\ttraining's multi_logloss: 1.70379\ttraining's loss is: : 2.27265\tvalid_1's multi_logloss: 2.91788\tvalid_1's loss is: : 3.32694\n", 558 | "[100]\ttraining's multi_logloss: 1.70379\ttraining's loss is: : 2.27265\tvalid_1's multi_logloss: 2.91788\tvalid_1's loss is: : 3.32694\n", 559 | "Early stopping, best iteration is:\n", 560 | "[1]\ttraining's multi_logloss: 1.70379\ttraining's loss is: : 2.27265\tvalid_1's multi_logloss: 2.91788\tvalid_1's loss is: : 3.32694\n", 561 | "Wall time: 9.7 s\n" 562 | ] 563 | } 564 | ], 565 | "source": [ 566 | "%%time\n", 567 | "from sklearn.model_selection import StratifiedKFold,KFold\n", 568 | "params = {\n", 569 | " 'task':'train', \n", 570 | " 'num_leaves': 255,\n", 571 | " 'objective': 'multiclass',\n", 572 | " 'num_class': 8,\n", 573 | " 'min_data_in_leaf': 50,\n", 574 | " 'learning_rate': 0.05,\n", 575 | " 'feature_fraction': 0.85,\n", 576 | " 'bagging_fraction': 0.85,\n", 577 | " 'bagging_freq': 5, \n", 578 | " 'max_bin':128,\n", 579 | " 'random_state':100\n", 580 | " } \n", 581 | "\n", 582 | "folds = KFold(n_splits=5, shuffle=True, random_state=15)\n", 583 | "oof = np.zeros(len(train))\n", 584 | "\n", 585 | "predict_res = 0\n", 586 | "models = []\n", 587 | "for fold_, (trn_idx, val_idx) in enumerate(folds.split(train_data)):\n", 588 | " print(\"fold n°{}\".format(fold_))\n", 589 | " trn_data = lgb.Dataset(train_data.iloc[trn_idx][train_features], label=train_data.iloc[trn_idx][train_label].values)\n", 590 | " val_data = lgb.Dataset(train_data.iloc[val_idx][train_features], label=train_data.iloc[val_idx][train_label].values) \n", 591 | " \n", 592 | " clf = lgb.train(params, trn_data, num_boost_round=2000,valid_sets=[trn_data,val_data], verbose_eval=50, early_stopping_rounds=100, feval=lgb_logloss) \n", 593 | " models.append(clf)" 594 | ] 595 | }, 596 | { 597 | "cell_type": "markdown", 598 | "metadata": {}, 599 | "source": [ 600 | "### 3.3.4 特征重要性分析" 601 | ] 602 | }, 603 | { 604 | "cell_type": "code", 605 | "execution_count": 18, 606 | "metadata": {}, 607 | "outputs": [], 608 | "source": [ 609 | "feature_importance = pd.DataFrame()\n", 610 | "feature_importance['fea_name'] = train_features\n", 611 | "feature_importance['fea_imp'] = clf.feature_importance()\n", 612 | "feature_importance = feature_importance.sort_values('fea_imp',ascending = False)" 613 | ] 614 | }, 615 | { 616 | "cell_type": "code", 617 | "execution_count": 19, 618 | "metadata": {}, 619 | "outputs": [ 620 | { 621 | "data": { 622 | "text/plain": [ 623 | "" 624 | ] 625 | }, 626 | "execution_count": 19, 627 | "metadata": {}, 628 | "output_type": "execute_result" 629 | }, 630 | { 631 | "data": { 632 | "image/png": "\n", 633 | "text/plain": [ 634 | "
" 635 | ] 636 | }, 637 | "metadata": { 638 | "needs_background": "light" 639 | }, 640 | "output_type": "display_data" 641 | } 642 | ], 643 | "source": [ 644 | "plt.figure(figsize=[20, 10,])\n", 645 | "sns.barplot(x = feature_importance['fea_name'], y = feature_importance['fea_imp'])\n", 646 | "#sns.barplot(x=\"fea_name\",y=\"fea_imp\",data=feature_importance)" 647 | ] 648 | }, 649 | { 650 | "cell_type": "markdown", 651 | "metadata": {}, 652 | "source": [ 653 | "### 3.3.5 模型测试" 654 | ] 655 | }, 656 | { 657 | "cell_type": "code", 658 | "execution_count": 20, 659 | "metadata": {}, 660 | "outputs": [], 661 | "source": [ 662 | "pred_res = 0\n", 663 | "fold = 5\n", 664 | "for model in models:\n", 665 | " pred_res +=model.predict(test_submit[train_features]) * 1.0 / fold " 666 | ] 667 | }, 668 | { 669 | "cell_type": "code", 670 | "execution_count": 21, 671 | "metadata": {}, 672 | "outputs": [], 673 | "source": [ 674 | "test_submit['prob0'] = 0\n", 675 | "test_submit['prob1'] = 0\n", 676 | "test_submit['prob2'] = 0\n", 677 | "test_submit['prob3'] = 0\n", 678 | "test_submit['prob4'] = 0\n", 679 | "test_submit['prob5'] = 0\n", 680 | "test_submit['prob6'] = 0\n", 681 | "test_submit['prob7'] = 0" 682 | ] 683 | }, 684 | { 685 | "cell_type": "code", 686 | "execution_count": 22, 687 | "metadata": { 688 | "scrolled": true 689 | }, 690 | "outputs": [], 691 | "source": [ 692 | "test_submit[['prob0','prob1','prob2','prob3','prob4','prob5','prob6','prob7']] = pred_res\n", 693 | "test_submit[['file_id','prob0','prob1','prob2','prob3','prob4','prob5','prob6','prob7']].to_csv('baseline.csv',index = None)" 694 | ] 695 | } 696 | ], 697 | "metadata": { 698 | "kernelspec": { 699 | "display_name": "Python 3", 700 | "language": "python", 701 | "name": "python3" 702 | }, 703 | "language_info": { 704 | "codemirror_mode": { 705 | "name": "ipython", 706 | "version": 3 707 | }, 708 | "file_extension": ".py", 709 | "mimetype": "text/x-python", 710 | "name": "python", 711 | "nbconvert_exporter": "python", 712 | "pygments_lexer": "ipython3", 713 | "version": "3.7.1" 714 | }, 715 | "latex_envs": { 716 | "LaTeX_envs_menu_present": true, 717 | "autoclose": false, 718 | "autocomplete": true, 719 | "bibliofile": "biblio.bib", 720 | "cite_by": "apalike", 721 | "current_citInitial": 1, 722 | "eqLabelWithNumbers": true, 723 | "eqNumInitial": 1, 724 | "hotkeys": { 725 | "equation": "Ctrl-E", 726 | "itemize": "Ctrl-I" 727 | }, 728 | "labels_anchors": false, 729 | "latex_user_defs": false, 730 | "report_style_numbering": false, 731 | "user_envs_cfg": false 732 | }, 733 | "toc": { 734 | "nav_menu": {}, 735 | "number_sections": true, 736 | "sideBar": true, 737 | "skip_h1_title": false, 738 | "title_cell": "Table of Contents", 739 | "title_sidebar": "Contents", 740 | "toc_cell": true, 741 | "toc_position": { 742 | "height": "calc(100% - 180px)", 743 | "left": "10px", 744 | "top": "150px", 745 | "width": "384px" 746 | }, 747 | "toc_section_display": true, 748 | "toc_window_display": true 749 | } 750 | }, 751 | "nbformat": 4, 752 | "nbformat_minor": 2 753 | } 754 | --------------------------------------------------------------------------------