├── README.md ├── Data_Explore.ipynb ├── Model_Design_v1.7.ipynb ├── Feature_Engineering-v1.4.ipynb └── data_cleaning_wp.ipynb /README.md: -------------------------------------------------------------------------------- 1 | # JData 2 | 京东2017年JData比赛,团队`HelloKitty`的解决方案,最终线上排名成绩为A榜: 139/4240,B榜: 97/4240 3 | * 背景概述 4 | > 本次大赛以京东商城真实的用户、商品和行为数据(脱敏后)为基础,参赛队伍需要通过数据挖掘的技术和机器学习的算法,构建用户购买商品的预测模型,输出高潜用户和目标商品的匹配结果,为精准营销提供高质量的目标群体。同时,希望参赛队伍能通过本次比赛,挖掘数据背后潜在的意义,为电商用户提供更简单、快捷、省心的购物体验。 5 | 6 | * [比赛地址](http://www.datafountain.cn/#/competitions/247/intro) 7 | * 代码文件 8 | * 数据分析(包括Data_Explore、data_analysis_wp、data_cleaning_wp) 9 | * 特征工程(Feature_Engineering-v1.4) 10 | * 模型设计(Model_Design_v1.7) 11 | 关于本次比赛的详细个人总结见[博客地址](http://izhaoyi.top/2017/06/25/JData/) 12 | 13 | 14 | -------------------------------------------------------------------------------- /Data_Explore.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## 异常值判断" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 1, 13 | "metadata": { 14 | "ExecuteTime": { 15 | "end_time": "2017-05-10T21:17:05.070380Z", 16 | "start_time": "2017-05-10T21:17:04.293943Z" 17 | }, 18 | "collapsed": true 19 | }, 20 | "outputs": [], 21 | "source": [ 22 | "%matplotlib inline\n", 23 | "import matplotlib\n", 24 | "import matplotlib.pyplot as plt\n", 25 | "import numpy as np\n", 26 | "import pandas as pd" 27 | ] 28 | }, 29 | { 30 | "cell_type": "code", 31 | "execution_count": 2, 32 | "metadata": { 33 | "ExecuteTime": { 34 | "end_time": "2017-05-10T21:17:05.074920Z", 35 | "start_time": "2017-05-10T21:17:05.071410Z" 36 | }, 37 | "collapsed": true 38 | }, 39 | "outputs": [], 40 | "source": [ 41 | "#定义文件名\n", 42 | "ACTION_201602_FILE = \"data/JData_Action_201602.csv\"\n", 43 | "ACTION_201603_FILE = \"data/JData_Action_201603.csv\"\n", 44 | "ACTION_201604_FILE = \"data/JData_Action_201604.csv\"\n", 45 | "COMMENT_FILE = \"data/JData_Comment.csv\"\n", 46 | "PRODUCT_FILE = \"data/JData_Product.csv\"\n", 47 | "USER_FILE = \"data/JData_User.csv\"\n", 48 | "USER_TABLE_FILE = \"data/User_table.csv\"\n", 49 | "ITEM_TABLE_FILE = \"data/Item_table.csv\"" 50 | ] 51 | }, 52 | { 53 | "cell_type": "markdown", 54 | "metadata": {}, 55 | "source": [ 56 | "### 数据背景信息\n", 57 | "根据官方给出的数据介绍里,可以知道数据可能存在哪些异常信息\n", 58 | "* 用户文件\n", 59 | " * 用户的age存在未知的情况,标记为-1\n", 60 | " * 用户的sex存在保密情况,标记为2\n", 61 | " * 后续分析发现,用户注册日期存在系统异常导致在预测日之后的情况,不过目前针对该特征没有想法,所以不作处理\n", 62 | "* 商品文件\n", 63 | " * 属性a1,a2,a3均存在未知情形,标记为-1\n", 64 | "* 行为文件\n", 65 | " * model_id为点击模块编号,针对用户的行为类型为6时,可能存在空值" 66 | ] 67 | }, 68 | { 69 | "cell_type": "markdown", 70 | "metadata": {}, 71 | "source": [ 72 | "### 空值判断" 73 | ] 74 | }, 75 | { 76 | "cell_type": "code", 77 | "execution_count": 3, 78 | "metadata": { 79 | "ExecuteTime": { 80 | "end_time": "2017-05-10T21:17:35.805241Z", 81 | "start_time": "2017-05-10T21:17:05.075917Z" 82 | }, 83 | "collapsed": false 84 | }, 85 | "outputs": [ 86 | { 87 | "name": "stdout", 88 | "output_type": "stream", 89 | "text": [ 90 | "Is there any missing value in User? True\n", 91 | "Is there any missing value in Action 2? True\n", 92 | "Is there any missing value in Action 3? True\n", 93 | "Is there any missing value in Action 4? True\n", 94 | "Is there any missing value in Comment? False\n", 95 | "Is there any missing value in Product? False\n" 96 | ] 97 | } 98 | ], 99 | "source": [ 100 | "def check_empty(file_path, file_name):\n", 101 | " df_file = pd.read_csv(file_path)\n", 102 | " print 'Is there any missing value in {0}? {1}'.format(file_name, df_file.isnull().any().any()) \n", 103 | "\n", 104 | "check_empty(USER_FILE, 'User')\n", 105 | "check_empty(ACTION_201602_FILE, 'Action 2')\n", 106 | "check_empty(ACTION_201603_FILE, 'Action 3')\n", 107 | "check_empty(ACTION_201604_FILE, 'Action 4')\n", 108 | "check_empty(COMMENT_FILE, 'Comment')\n", 109 | "check_empty(PRODUCT_FILE, 'Product')" 110 | ] 111 | }, 112 | { 113 | "cell_type": "markdown", 114 | "metadata": {}, 115 | "source": [ 116 | "由上述简单的分析可知,用户表及行为表中均存在空值记录,而评论表和商品表则不存在,但是结合之前的数据背景分析商品表中存在属性未知的情况,后续也需要针对分析,进一步的我们看看用户表和行为表中的空值情况" 117 | ] 118 | }, 119 | { 120 | "cell_type": "code", 121 | "execution_count": 4, 122 | "metadata": { 123 | "ExecuteTime": { 124 | "end_time": "2017-05-10T21:18:04.373068Z", 125 | "start_time": "2017-05-10T21:17:35.806261Z" 126 | }, 127 | "collapsed": false 128 | }, 129 | "outputs": [ 130 | { 131 | "name": "stdout", 132 | "output_type": "stream", 133 | "text": [ 134 | "empty info in detail of User:\n", 135 | "user_id False\n", 136 | "age True\n", 137 | "sex True\n", 138 | "user_lv_cd False\n", 139 | "user_reg_tm True\n", 140 | "dtype: bool\n", 141 | "empty info in detail of Action 2:\n", 142 | "user_id False\n", 143 | "sku_id False\n", 144 | "time False\n", 145 | "model_id True\n", 146 | "type False\n", 147 | "cate False\n", 148 | "brand False\n", 149 | "dtype: bool\n", 150 | "empty info in detail of Action 3:\n", 151 | "user_id False\n", 152 | "sku_id False\n", 153 | "time False\n", 154 | "model_id True\n", 155 | "type False\n", 156 | "cate False\n", 157 | "brand False\n", 158 | "dtype: bool\n", 159 | "empty info in detail of Action 4:\n", 160 | "user_id False\n", 161 | "sku_id False\n", 162 | "time False\n", 163 | "model_id True\n", 164 | "type False\n", 165 | "cate False\n", 166 | "brand False\n", 167 | "dtype: bool\n" 168 | ] 169 | } 170 | ], 171 | "source": [ 172 | "def empty_detail(f_path, f_name):\n", 173 | " df_file = pd.read_csv(f_path)\n", 174 | " print 'empty info in detail of {0}:'.format(f_name)\n", 175 | " print pd.isnull(df_file).any()\n", 176 | "\n", 177 | "empty_detail(USER_FILE, 'User')\n", 178 | "empty_detail(ACTION_201602_FILE, 'Action 2')\n", 179 | "empty_detail(ACTION_201603_FILE, 'Action 3')\n", 180 | "empty_detail(ACTION_201604_FILE, 'Action 4')" 181 | ] 182 | }, 183 | { 184 | "cell_type": "markdown", 185 | "metadata": {}, 186 | "source": [ 187 | "上面简单的输出了下存在空值的文件中具体哪些列存在空值(True),结果如下\n", 188 | "* User\n", 189 | " * age\n", 190 | " * sex\n", 191 | " * user_reg_tm\n", 192 | "* Action\n", 193 | " * model_id\n", 194 | " \n", 195 | "接下来具体看看各文件中的空值情况:" 196 | ] 197 | }, 198 | { 199 | "cell_type": "code", 200 | "execution_count": 5, 201 | "metadata": { 202 | "ExecuteTime": { 203 | "end_time": "2017-05-10T21:18:27.415981Z", 204 | "start_time": "2017-05-10T21:18:04.373995Z" 205 | }, 206 | "collapsed": false 207 | }, 208 | "outputs": [ 209 | { 210 | "name": "stdout", 211 | "output_type": "stream", 212 | "text": [ 213 | "No. of missing age in User is 3\n", 214 | "percent: 2.84843478509e-05\n", 215 | "No. of missing sex in User is 3\n", 216 | "percent: 2.84843478509e-05\n", 217 | "No. of missing user_reg_tm in User is 3\n", 218 | "percent: 2.84843478509e-05\n", 219 | "No. of missing model_id in Action 2 is 4959617\n", 220 | "percent: 0.431818363867\n", 221 | "No. of missing model_id in Action 3 is 10553261\n", 222 | "percent: 0.4072043169\n", 223 | "No. of missing model_id in Action 4 is 5143018\n", 224 | "percent: 0.38962452388\n" 225 | ] 226 | } 227 | ], 228 | "source": [ 229 | "def empty_records(f_path, f_name, col_name):\n", 230 | " df_file = pd.read_csv(f_path)\n", 231 | " missing = df_file[col_name].isnull().sum().sum()\n", 232 | " print 'No. of missing {0} in {1} is {2}'.format(col_name, f_name, missing) \n", 233 | " print 'percent: ', missing * 1.0 / df_file.shape[0]\n", 234 | "\n", 235 | "empty_records(USER_FILE, 'User', 'age')\n", 236 | "empty_records(USER_FILE, 'User', 'sex')\n", 237 | "empty_records(USER_FILE, 'User', 'user_reg_tm')\n", 238 | "empty_records(ACTION_201602_FILE, 'Action 2', 'model_id')\n", 239 | "empty_records(ACTION_201603_FILE, 'Action 3', 'model_id')\n", 240 | "empty_records(ACTION_201604_FILE, 'Action 4', 'model_id')" 241 | ] 242 | }, 243 | { 244 | "cell_type": "markdown", 245 | "metadata": {}, 246 | "source": [ 247 | "对比下数据集的记录数:\n", 248 | "\n", 249 | "文件|文件说明|记录数\n", 250 | "---|---|---\n", 251 | "1. JData_User.csv | 用户数据集 | 105,321个用户\n", 252 | "2. JData_Comment.csv | 商品评论 | 558,552条记录\n", 253 | "3. JData_Product.csv | 预测商品集合 | 24,187条记录\n", 254 | "4. JData_Action_201602.csv | 2月份行为交互记录 | 11,485,424条记录\n", 255 | "5. JData_Action_201603.csv | 3月份行为交互记录 | 25,916,378条记录\n", 256 | "6. JData_Action_201604.csv | 4月份行为交互记录 | 13,199,934条记录\n", 257 | "\n", 258 | "两相对比结合前面输出的情况,针对不同数据进行不同处理\n", 259 | "* 用户文件 \n", 260 | " * age,sex:先填充为对应的未知状态(-1|2),后续作为未知状态的值进一步分析和处理\n", 261 | " * user_reg_tm:暂时不做处理\n", 262 | "* 行为文件\n", 263 | " * model_id涉及数目接近一半,而且当前针对该特征没有很好的处理方法,待定" 264 | ] 265 | }, 266 | { 267 | "cell_type": "code", 268 | "execution_count": 6, 269 | "metadata": { 270 | "ExecuteTime": { 271 | "end_time": "2017-05-10T21:18:27.455370Z", 272 | "start_time": "2017-05-10T21:18:27.416911Z" 273 | }, 274 | "collapsed": true 275 | }, 276 | "outputs": [], 277 | "source": [ 278 | "user = pd.read_csv(USER_FILE)\n", 279 | "user['age'].fillna('-1', inplace=True)\n", 280 | "user['sex'].fillna(2, inplace=True)" 281 | ] 282 | }, 283 | { 284 | "cell_type": "code", 285 | "execution_count": 7, 286 | "metadata": { 287 | "ExecuteTime": { 288 | "end_time": "2017-05-10T21:18:27.478275Z", 289 | "start_time": "2017-05-10T21:18:27.456423Z" 290 | }, 291 | "collapsed": false 292 | }, 293 | "outputs": [ 294 | { 295 | "name": "stdout", 296 | "output_type": "stream", 297 | "text": [ 298 | "user_id False\n", 299 | "age False\n", 300 | "sex False\n", 301 | "user_lv_cd False\n", 302 | "user_reg_tm True\n", 303 | "dtype: bool\n" 304 | ] 305 | } 306 | ], 307 | "source": [ 308 | "print pd.isnull(user).any()" 309 | ] 310 | }, 311 | { 312 | "cell_type": "code", 313 | "execution_count": 10, 314 | "metadata": { 315 | "ExecuteTime": { 316 | "end_time": "2017-05-10T21:26:59.218638Z", 317 | "start_time": "2017-05-10T21:26:59.207494Z" 318 | }, 319 | "collapsed": false 320 | }, 321 | "outputs": [ 322 | { 323 | "name": "stdout", 324 | "output_type": "stream", 325 | "text": [ 326 | " user_id age sex user_lv_cd user_reg_tm\n", 327 | "34072 234073 -1 2.0 1 NaN\n", 328 | "38905 238906 -1 2.0 1 NaN\n", 329 | "67704 267705 -1 2.0 1 NaN\n" 330 | ] 331 | } 332 | ], 333 | "source": [ 334 | "nan_reg_tm = user[user['user_reg_tm'].isnull()]\n", 335 | "print nan_reg_tm" 336 | ] 337 | }, 338 | { 339 | "cell_type": "code", 340 | "execution_count": 8, 341 | "metadata": { 342 | "ExecuteTime": { 343 | "end_time": "2017-05-07T16:36:49.338047Z", 344 | "start_time": "2017-05-07T16:36:49.268713Z" 345 | }, 346 | "collapsed": false 347 | }, 348 | "outputs": [ 349 | { 350 | "name": "stdout", 351 | "output_type": "stream", 352 | "text": [ 353 | "7\n", 354 | "3\n", 355 | "5\n" 356 | ] 357 | } 358 | ], 359 | "source": [ 360 | "print len(user['age'].unique())\n", 361 | "print len(user['sex'].unique())\n", 362 | "print len(user['user_lv_cd'].unique())" 363 | ] 364 | }, 365 | { 366 | "cell_type": "code", 367 | "execution_count": 9, 368 | "metadata": { 369 | "ExecuteTime": { 370 | "end_time": "2017-05-07T16:36:49.376052Z", 371 | "start_time": "2017-05-07T16:36:49.340659Z" 372 | }, 373 | "collapsed": true 374 | }, 375 | "outputs": [], 376 | "source": [ 377 | "prod = pd.read_csv(PRODUCT_FILE)" 378 | ] 379 | }, 380 | { 381 | "cell_type": "code", 382 | "execution_count": 10, 383 | "metadata": { 384 | "ExecuteTime": { 385 | "end_time": "2017-05-07T16:36:49.455004Z", 386 | "start_time": "2017-05-07T16:36:49.377236Z" 387 | }, 388 | "collapsed": false 389 | }, 390 | "outputs": [ 391 | { 392 | "name": "stdout", 393 | "output_type": "stream", 394 | "text": [ 395 | "4\n", 396 | "3\n", 397 | "3\n", 398 | "102\n" 399 | ] 400 | } 401 | ], 402 | "source": [ 403 | "print len(prod['a1'].unique())\n", 404 | "print len(prod['a2'].unique())\n", 405 | "print len(prod['a3'].unique())\n", 406 | "# print len(prod['a2'].unique())\n", 407 | "print len(prod['brand'].unique())" 408 | ] 409 | }, 410 | { 411 | "cell_type": "markdown", 412 | "metadata": {}, 413 | "source": [ 414 | "### 未知记录\n", 415 | "接下来看看各个文件中的未知记录占的比重" 416 | ] 417 | }, 418 | { 419 | "cell_type": "code", 420 | "execution_count": 11, 421 | "metadata": { 422 | "ExecuteTime": { 423 | "end_time": "2017-05-07T16:36:49.573716Z", 424 | "start_time": "2017-05-07T16:36:49.456369Z" 425 | }, 426 | "collapsed": false 427 | }, 428 | "outputs": [ 429 | { 430 | "name": "stdout", 431 | "output_type": "stream", 432 | "text": [ 433 | "No. of unknown age user: 14415 and the percent: 0.136867291423 \n", 434 | "No. of unknown sex user: 54738 and the percent: 0.519725410887 \n" 435 | ] 436 | } 437 | ], 438 | "source": [ 439 | "print 'No. of unknown age user: {0} and the percent: {1} '.format(user[user['age']=='-1'].shape[0],\n", 440 | " user[user['age']=='-1'].shape[0]*1.0/user.shape[0])\n", 441 | "print 'No. of unknown sex user: {0} and the percent: {1} '.format(user[user['sex']==2].shape[0],\n", 442 | " user[user['sex']==2].shape[0]*1.0/user.shape[0])" 443 | ] 444 | }, 445 | { 446 | "cell_type": "code", 447 | "execution_count": 12, 448 | "metadata": { 449 | "ExecuteTime": { 450 | "end_time": "2017-05-07T16:36:49.639049Z", 451 | "start_time": "2017-05-07T16:36:49.575491Z" 452 | }, 453 | "collapsed": false 454 | }, 455 | "outputs": [ 456 | { 457 | "name": "stdout", 458 | "output_type": "stream", 459 | "text": [ 460 | "No. of unknown a1 in Product is 1701\n", 461 | "percent: 0.0703270351842\n", 462 | "No. of unknown a2 in Product is 4050\n", 463 | "percent: 0.167445321867\n", 464 | "No. of unknown a3 in Product is 3815\n", 465 | "percent: 0.157729358746\n" 466 | ] 467 | } 468 | ], 469 | "source": [ 470 | "def unknown_records(f_path, f_name, col_name):\n", 471 | " df_file = pd.read_csv(f_path)\n", 472 | " missing = df_file[df_file[col_name]==-1].shape[0]\n", 473 | " print 'No. of unknown {0} in {1} is {2}'.format(col_name, f_name, missing) \n", 474 | " print 'percent: ', missing * 1.0 / df_file.shape[0]\n", 475 | " \n", 476 | "unknown_records(PRODUCT_FILE, 'Product', 'a1')\n", 477 | "unknown_records(PRODUCT_FILE, 'Product', 'a2')\n", 478 | "unknown_records(PRODUCT_FILE, 'Product', 'a3')" 479 | ] 480 | }, 481 | { 482 | "cell_type": "markdown", 483 | "metadata": {}, 484 | "source": [ 485 | "小结一下\n", 486 | "* 空值部分对3条用户的sex,age填充为未知值,注册时间不作处理,此外行为数据部分model_id待定: 43.2%,40.7%,39.0%\n", 487 | "* 未知值部分,用户age存在部分未知值: 13.7%,sex存在大量保密情况(超过一半) 52.0%\n", 488 | "* 商品中各个属性存在部分未知的情况,a1= Q1 - step) & (log_data[feature] <= Q3 + step))])" 523 | ] 524 | } 525 | ], 526 | "metadata": { 527 | "kernelspec": { 528 | "display_name": "Python 2", 529 | "language": "python", 530 | "name": "python2" 531 | }, 532 | "language_info": { 533 | "codemirror_mode": { 534 | "name": "ipython", 535 | "version": 2 536 | }, 537 | "file_extension": ".py", 538 | "mimetype": "text/x-python", 539 | "name": "python", 540 | "nbconvert_exporter": "python", 541 | "pygments_lexer": "ipython2", 542 | "version": "2.7.13" 543 | } 544 | }, 545 | "nbformat": 4, 546 | "nbformat_minor": 2 547 | } 548 | -------------------------------------------------------------------------------- /Model_Design_v1.7.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## 模型设计" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "专用测试模型\n", 15 | "\n", 16 | "由于GridSearchCV耗时过久,考虑折中的方法,即使用原始的train中的分割作为调参的评判" 17 | ] 18 | }, 19 | { 20 | "cell_type": "code", 21 | "execution_count": 208, 22 | "metadata": { 23 | "ExecuteTime": { 24 | "end_time": "2017-05-25T15:16:30.859241Z", 25 | "start_time": "2017-05-25T15:16:30.854262Z" 26 | }, 27 | "collapsed": false 28 | }, 29 | "outputs": [], 30 | "source": [ 31 | "#!/usr/bin/env python\n", 32 | "# -*- coding: UTF-8 -*-\n", 33 | "import sys\n", 34 | "import pandas as pd\n", 35 | "import numpy as np\n", 36 | "import xgboost as xgb\n", 37 | "from sklearn.model_selection import train_test_split\n", 38 | "import operator\n", 39 | "from matplotlib import pylab as plt\n", 40 | "from datetime import datetime\n", 41 | "import time\n", 42 | "from sklearn.model_selection import GridSearchCV" 43 | ] 44 | }, 45 | { 46 | "cell_type": "code", 47 | "execution_count": 209, 48 | "metadata": { 49 | "ExecuteTime": { 50 | "end_time": "2017-05-25T15:16:30.928993Z", 51 | "start_time": "2017-05-25T15:16:30.863330Z" 52 | }, 53 | "collapsed": false 54 | }, 55 | "outputs": [], 56 | "source": [ 57 | "# import gc\n", 58 | "def show_record():\n", 59 | " train = pd.read_csv('train_set.csv')\n", 60 | "# valid = pd.read_csv('val_set.csv')\n", 61 | "# label_val = pd.read_csv('label_val_set.csv')\n", 62 | " valid1 = pd.read_csv('val_1.csv')\n", 63 | " valid2 = pd.read_csv('val_2.csv')\n", 64 | " valid3 = pd.read_csv('val_3.csv')\n", 65 | "# test = pd.read_csv('test_set.csv')\n", 66 | " print train.shape\n", 67 | "# print valid.shape\n", 68 | "# print label_val.shape\n", 69 | "# print test.shape\n", 70 | " print valid1.shape\n", 71 | " print valid2.shape\n", 72 | " print valid3.shape\n", 73 | "\n", 74 | "# show_record()\n", 75 | "# del train, valid, test\n", 76 | "# gc.collect()\n" 77 | ] 78 | }, 79 | { 80 | "cell_type": "markdown", 81 | "metadata": {}, 82 | "source": [ 83 | "### 训练数据\n", 84 | "* 返回训练后的模型\n", 85 | "* 生成特征map文件作为后续特征重要性之用" 86 | ] 87 | }, 88 | { 89 | "cell_type": "code", 90 | "execution_count": 210, 91 | "metadata": { 92 | "ExecuteTime": { 93 | "end_time": "2017-05-25T15:17:20.950524Z", 94 | "start_time": "2017-05-25T15:16:30.930219Z" 95 | }, 96 | "collapsed": false 97 | }, 98 | "outputs": [ 99 | { 100 | "name": "stdout", 101 | "output_type": "stream", 102 | "text": [ 103 | "total features: 301\n", 104 | "[0]\ttrain-auc:0.913108\teval-auc:0.911621\n", 105 | "Multiple eval metrics have been passed: 'eval-auc' will be used for early stopping.\n", 106 | "\n", 107 | "Will train until eval-auc hasn't improved in 10 rounds.\n", 108 | "[1]\ttrain-auc:0.932872\teval-auc:0.930423\n", 109 | "[2]\ttrain-auc:0.936241\teval-auc:0.93338\n", 110 | "[3]\ttrain-auc:0.938389\teval-auc:0.936325\n", 111 | "[4]\ttrain-auc:0.938821\teval-auc:0.937227\n", 112 | "[5]\ttrain-auc:0.942359\teval-auc:0.941066\n", 113 | "[6]\ttrain-auc:0.943082\teval-auc:0.941563\n", 114 | "[7]\ttrain-auc:0.943831\teval-auc:0.942126\n", 115 | "[8]\ttrain-auc:0.945939\teval-auc:0.944321\n", 116 | "[9]\ttrain-auc:0.946849\teval-auc:0.94524\n", 117 | "[10]\ttrain-auc:0.948741\teval-auc:0.946958\n", 118 | "[11]\ttrain-auc:0.949038\teval-auc:0.947272\n", 119 | "[12]\ttrain-auc:0.950141\teval-auc:0.948423\n", 120 | "[13]\ttrain-auc:0.951377\teval-auc:0.949643\n", 121 | "[14]\ttrain-auc:0.951717\teval-auc:0.950112\n", 122 | "[15]\ttrain-auc:0.952721\teval-auc:0.951315\n", 123 | "[16]\ttrain-auc:0.953227\teval-auc:0.952028\n", 124 | "[17]\ttrain-auc:0.953574\teval-auc:0.952416\n", 125 | "[18]\ttrain-auc:0.954116\teval-auc:0.952997\n", 126 | "[19]\ttrain-auc:0.954471\teval-auc:0.953389\n", 127 | "[20]\ttrain-auc:0.954607\teval-auc:0.953544\n", 128 | "[21]\ttrain-auc:0.954891\teval-auc:0.953925\n", 129 | "[22]\ttrain-auc:0.955298\teval-auc:0.954452\n", 130 | "[23]\ttrain-auc:0.955599\teval-auc:0.954829\n", 131 | "[24]\ttrain-auc:0.956163\teval-auc:0.955341\n", 132 | "[25]\ttrain-auc:0.956823\teval-auc:0.95607\n", 133 | "[26]\ttrain-auc:0.957225\teval-auc:0.956492\n", 134 | "[27]\ttrain-auc:0.958301\teval-auc:0.957585\n", 135 | "[28]\ttrain-auc:0.958674\teval-auc:0.957999\n", 136 | "[29]\ttrain-auc:0.959072\teval-auc:0.958424\n", 137 | "[30]\ttrain-auc:0.95949\teval-auc:0.95886\n", 138 | "[31]\ttrain-auc:0.960144\teval-auc:0.959574\n", 139 | "[32]\ttrain-auc:0.960481\teval-auc:0.959766\n", 140 | "[33]\ttrain-auc:0.960753\teval-auc:0.959948\n", 141 | "[34]\ttrain-auc:0.961088\teval-auc:0.960227\n", 142 | "[35]\ttrain-auc:0.961579\teval-auc:0.960637\n", 143 | "[36]\ttrain-auc:0.961899\teval-auc:0.961006\n", 144 | "[37]\ttrain-auc:0.96222\teval-auc:0.961291\n", 145 | "[38]\ttrain-auc:0.96256\teval-auc:0.961603\n", 146 | "[39]\ttrain-auc:0.962937\teval-auc:0.96199\n", 147 | "[40]\ttrain-auc:0.963402\teval-auc:0.96247\n", 148 | "[41]\ttrain-auc:0.963608\teval-auc:0.962675\n", 149 | "[42]\ttrain-auc:0.963719\teval-auc:0.962718\n", 150 | "[43]\ttrain-auc:0.964028\teval-auc:0.963035\n", 151 | "[44]\ttrain-auc:0.964363\teval-auc:0.963354\n", 152 | "[45]\ttrain-auc:0.964619\teval-auc:0.963669\n", 153 | "[46]\ttrain-auc:0.964959\teval-auc:0.964009\n", 154 | "[47]\ttrain-auc:0.965308\teval-auc:0.964401\n", 155 | "[48]\ttrain-auc:0.965511\teval-auc:0.964586\n", 156 | "[49]\ttrain-auc:0.965684\teval-auc:0.964769\n", 157 | "[50]\ttrain-auc:0.965868\teval-auc:0.964956\n", 158 | "[51]\ttrain-auc:0.966077\teval-auc:0.965119\n", 159 | "[52]\ttrain-auc:0.966301\teval-auc:0.965328\n", 160 | "[53]\ttrain-auc:0.966489\teval-auc:0.965482\n", 161 | "[54]\ttrain-auc:0.966734\teval-auc:0.965689\n", 162 | "[55]\ttrain-auc:0.966895\teval-auc:0.965911\n", 163 | "[56]\ttrain-auc:0.967153\teval-auc:0.966191\n", 164 | "[57]\ttrain-auc:0.967307\teval-auc:0.966304\n", 165 | "[58]\ttrain-auc:0.967423\teval-auc:0.966448\n", 166 | "[59]\ttrain-auc:0.967507\teval-auc:0.9665\n", 167 | "[60]\ttrain-auc:0.96769\teval-auc:0.966634\n", 168 | "[61]\ttrain-auc:0.967913\teval-auc:0.966795\n", 169 | "[62]\ttrain-auc:0.968068\teval-auc:0.966968\n", 170 | "[63]\ttrain-auc:0.968196\teval-auc:0.967114\n", 171 | "[64]\ttrain-auc:0.968262\teval-auc:0.96715\n", 172 | "[65]\ttrain-auc:0.968397\teval-auc:0.96732\n", 173 | "[66]\ttrain-auc:0.96851\teval-auc:0.967424\n", 174 | "[67]\ttrain-auc:0.968621\teval-auc:0.967548\n", 175 | "[68]\ttrain-auc:0.968755\teval-auc:0.967664\n", 176 | "[69]\ttrain-auc:0.968887\teval-auc:0.967772\n", 177 | "[70]\ttrain-auc:0.968964\teval-auc:0.967863\n", 178 | "[71]\ttrain-auc:0.969091\teval-auc:0.96799\n", 179 | "[72]\ttrain-auc:0.969186\teval-auc:0.96807\n", 180 | "[73]\ttrain-auc:0.969338\teval-auc:0.96821\n", 181 | "[74]\ttrain-auc:0.969443\teval-auc:0.968308\n", 182 | "[75]\ttrain-auc:0.969527\teval-auc:0.968395\n", 183 | "[76]\ttrain-auc:0.969607\teval-auc:0.968476\n", 184 | "[77]\ttrain-auc:0.969698\teval-auc:0.968581\n", 185 | "[78]\ttrain-auc:0.969798\teval-auc:0.968621\n", 186 | "[79]\ttrain-auc:0.96986\teval-auc:0.968656\n", 187 | "[80]\ttrain-auc:0.969931\teval-auc:0.968705\n", 188 | "[81]\ttrain-auc:0.97008\teval-auc:0.968845\n", 189 | "[82]\ttrain-auc:0.970107\teval-auc:0.968868\n", 190 | "[83]\ttrain-auc:0.970225\teval-auc:0.968986\n", 191 | "[84]\ttrain-auc:0.970319\teval-auc:0.969047\n", 192 | "[85]\ttrain-auc:0.97046\teval-auc:0.969204\n", 193 | "[86]\ttrain-auc:0.970525\teval-auc:0.96928\n", 194 | "[87]\ttrain-auc:0.970585\teval-auc:0.969315\n", 195 | "[88]\ttrain-auc:0.970624\teval-auc:0.969353\n", 196 | "[89]\ttrain-auc:0.970689\teval-auc:0.969405\n", 197 | "[90]\ttrain-auc:0.97083\teval-auc:0.969556\n", 198 | "[91]\ttrain-auc:0.970917\teval-auc:0.969625\n", 199 | "[92]\ttrain-auc:0.970938\teval-auc:0.969653\n", 200 | "[93]\ttrain-auc:0.971022\teval-auc:0.969732\n", 201 | "[94]\ttrain-auc:0.971079\teval-auc:0.969774\n", 202 | "[95]\ttrain-auc:0.971156\teval-auc:0.969873\n", 203 | "[96]\ttrain-auc:0.971247\teval-auc:0.96995\n", 204 | "[97]\ttrain-auc:0.97132\teval-auc:0.970017\n", 205 | "[98]\ttrain-auc:0.971355\teval-auc:0.970063\n", 206 | "[99]\ttrain-auc:0.971424\teval-auc:0.970123\n", 207 | "[100]\ttrain-auc:0.971516\teval-auc:0.970181\n", 208 | "[101]\ttrain-auc:0.971591\teval-auc:0.97024\n", 209 | "[102]\ttrain-auc:0.971706\teval-auc:0.970358\n", 210 | "[103]\ttrain-auc:0.971833\teval-auc:0.970458\n", 211 | "[104]\ttrain-auc:0.971895\teval-auc:0.970494\n", 212 | "[105]\ttrain-auc:0.97195\teval-auc:0.970536\n", 213 | "[106]\ttrain-auc:0.971994\teval-auc:0.970567\n", 214 | "[107]\ttrain-auc:0.972044\teval-auc:0.970608\n", 215 | "[108]\ttrain-auc:0.97209\teval-auc:0.970646\n", 216 | "[109]\ttrain-auc:0.97218\teval-auc:0.970723\n", 217 | "[110]\ttrain-auc:0.972293\teval-auc:0.970827\n", 218 | "[111]\ttrain-auc:0.972311\teval-auc:0.970847\n", 219 | "[112]\ttrain-auc:0.972366\teval-auc:0.970898\n", 220 | "[113]\ttrain-auc:0.972435\teval-auc:0.97097\n", 221 | "[114]\ttrain-auc:0.972496\teval-auc:0.97102\n", 222 | "[115]\ttrain-auc:0.972532\teval-auc:0.971041\n", 223 | "[116]\ttrain-auc:0.972576\teval-auc:0.971063\n", 224 | "[117]\ttrain-auc:0.972691\teval-auc:0.971179\n", 225 | "[118]\ttrain-auc:0.97274\teval-auc:0.971195\n", 226 | "[119]\ttrain-auc:0.972806\teval-auc:0.971237\n", 227 | "[120]\ttrain-auc:0.972906\teval-auc:0.971305\n", 228 | "[121]\ttrain-auc:0.972952\teval-auc:0.971337\n", 229 | "[122]\ttrain-auc:0.972985\teval-auc:0.971359\n", 230 | "[123]\ttrain-auc:0.973057\teval-auc:0.971405\n", 231 | "[124]\ttrain-auc:0.973137\teval-auc:0.971469\n", 232 | "[125]\ttrain-auc:0.973214\teval-auc:0.971547\n", 233 | "[126]\ttrain-auc:0.973253\teval-auc:0.971568\n", 234 | "[127]\ttrain-auc:0.973292\teval-auc:0.971589\n", 235 | "[128]\ttrain-auc:0.973339\teval-auc:0.971617\n", 236 | "[129]\ttrain-auc:0.973381\teval-auc:0.971678\n", 237 | "[130]\ttrain-auc:0.973479\teval-auc:0.971781\n", 238 | "[131]\ttrain-auc:0.973517\teval-auc:0.971821\n", 239 | "[132]\ttrain-auc:0.973577\teval-auc:0.971874\n", 240 | "[133]\ttrain-auc:0.973626\teval-auc:0.971904\n", 241 | "[134]\ttrain-auc:0.973689\teval-auc:0.971956\n", 242 | "[135]\ttrain-auc:0.973712\teval-auc:0.971973\n", 243 | "[136]\ttrain-auc:0.973737\teval-auc:0.971986\n", 244 | "[137]\ttrain-auc:0.973787\teval-auc:0.972016\n", 245 | "[138]\ttrain-auc:0.973829\teval-auc:0.972047\n", 246 | "[139]\ttrain-auc:0.97386\teval-auc:0.97207\n", 247 | "[140]\ttrain-auc:0.973892\teval-auc:0.972098\n", 248 | "[141]\ttrain-auc:0.973934\teval-auc:0.972123\n", 249 | "[142]\ttrain-auc:0.974017\teval-auc:0.972172\n", 250 | "[143]\ttrain-auc:0.974054\teval-auc:0.972186\n", 251 | "[144]\ttrain-auc:0.974074\teval-auc:0.972196\n", 252 | "[145]\ttrain-auc:0.974146\teval-auc:0.97224\n", 253 | "[146]\ttrain-auc:0.97422\teval-auc:0.972268\n", 254 | "[147]\ttrain-auc:0.974302\teval-auc:0.972333\n", 255 | "[148]\ttrain-auc:0.974342\teval-auc:0.972365\n", 256 | "[149]\ttrain-auc:0.9744\teval-auc:0.972402\n", 257 | "[150]\ttrain-auc:0.974414\teval-auc:0.972401\n", 258 | "[151]\ttrain-auc:0.974465\teval-auc:0.972457\n", 259 | "[152]\ttrain-auc:0.974484\teval-auc:0.972464\n", 260 | "[153]\ttrain-auc:0.974518\teval-auc:0.972481\n", 261 | "[154]\ttrain-auc:0.974543\teval-auc:0.972486\n", 262 | "[155]\ttrain-auc:0.974571\teval-auc:0.972516\n", 263 | "[156]\ttrain-auc:0.974615\teval-auc:0.972535\n", 264 | "[157]\ttrain-auc:0.974697\teval-auc:0.972602\n", 265 | "[158]\ttrain-auc:0.974711\teval-auc:0.972621\n", 266 | "[159]\ttrain-auc:0.974749\teval-auc:0.972668\n", 267 | "[160]\ttrain-auc:0.974835\teval-auc:0.972706\n", 268 | "[161]\ttrain-auc:0.974902\teval-auc:0.972744\n", 269 | "[162]\ttrain-auc:0.974951\teval-auc:0.972775\n", 270 | "[163]\ttrain-auc:0.975003\teval-auc:0.972796\n", 271 | "[164]\ttrain-auc:0.975017\teval-auc:0.972807\n", 272 | "[165]\ttrain-auc:0.975039\teval-auc:0.972834\n", 273 | "[166]\ttrain-auc:0.975067\teval-auc:0.972862\n", 274 | "[167]\ttrain-auc:0.975098\teval-auc:0.97288\n", 275 | "[168]\ttrain-auc:0.975117\teval-auc:0.972898\n", 276 | "[169]\ttrain-auc:0.975179\teval-auc:0.972938\n", 277 | "[170]\ttrain-auc:0.975194\teval-auc:0.972923\n", 278 | "[171]\ttrain-auc:0.975231\teval-auc:0.972927\n", 279 | "[172]\ttrain-auc:0.97529\teval-auc:0.972968\n", 280 | "[173]\ttrain-auc:0.975333\teval-auc:0.972991\n", 281 | "[174]\ttrain-auc:0.975344\teval-auc:0.972996\n", 282 | "[175]\ttrain-auc:0.975396\teval-auc:0.973037\n", 283 | "[176]\ttrain-auc:0.975407\teval-auc:0.973061\n", 284 | "[177]\ttrain-auc:0.975475\teval-auc:0.973096\n", 285 | "[178]\ttrain-auc:0.97554\teval-auc:0.973126\n", 286 | "[179]\ttrain-auc:0.975574\teval-auc:0.973153\n", 287 | "[180]\ttrain-auc:0.975602\teval-auc:0.973176\n", 288 | "[181]\ttrain-auc:0.975673\teval-auc:0.973216\n", 289 | "[182]\ttrain-auc:0.9757\teval-auc:0.973239\n", 290 | "[183]\ttrain-auc:0.975749\teval-auc:0.973277\n", 291 | "[184]\ttrain-auc:0.975785\teval-auc:0.973293\n", 292 | "[185]\ttrain-auc:0.975804\teval-auc:0.973307\n", 293 | "[186]\ttrain-auc:0.975845\teval-auc:0.973352\n", 294 | "[187]\ttrain-auc:0.975861\teval-auc:0.973368\n", 295 | "[188]\ttrain-auc:0.975877\teval-auc:0.973377\n", 296 | "[189]\ttrain-auc:0.975923\teval-auc:0.973386\n", 297 | "[190]\ttrain-auc:0.975947\teval-auc:0.973379\n", 298 | "[191]\ttrain-auc:0.976003\teval-auc:0.973426\n", 299 | "[192]\ttrain-auc:0.97606\teval-auc:0.973478\n", 300 | "[193]\ttrain-auc:0.976122\teval-auc:0.973517\n", 301 | "[194]\ttrain-auc:0.976136\teval-auc:0.973525\n", 302 | "[195]\ttrain-auc:0.976159\teval-auc:0.973543\n", 303 | "[196]\ttrain-auc:0.976184\teval-auc:0.973557\n", 304 | "[197]\ttrain-auc:0.9762\teval-auc:0.973575\n", 305 | "[198]\ttrain-auc:0.976241\teval-auc:0.973598\n", 306 | "[199]\ttrain-auc:0.976268\teval-auc:0.973611\n", 307 | "[200]\ttrain-auc:0.97629\teval-auc:0.973603\n", 308 | "[201]\ttrain-auc:0.97635\teval-auc:0.973653\n", 309 | "[202]\ttrain-auc:0.976386\teval-auc:0.973675\n", 310 | "[203]\ttrain-auc:0.976424\teval-auc:0.973702\n", 311 | "[204]\ttrain-auc:0.976454\teval-auc:0.973711\n", 312 | "[205]\ttrain-auc:0.976462\teval-auc:0.97372\n", 313 | "[206]\ttrain-auc:0.976477\teval-auc:0.973726\n", 314 | "[207]\ttrain-auc:0.976495\teval-auc:0.973728\n", 315 | "[208]\ttrain-auc:0.976579\teval-auc:0.973798\n", 316 | "[209]\ttrain-auc:0.976655\teval-auc:0.973876\n", 317 | "[210]\ttrain-auc:0.976672\teval-auc:0.973882\n", 318 | "[211]\ttrain-auc:0.976726\teval-auc:0.973922\n", 319 | "[212]\ttrain-auc:0.976811\teval-auc:0.973967\n", 320 | "[213]\ttrain-auc:0.976845\teval-auc:0.973996\n", 321 | "[214]\ttrain-auc:0.976899\teval-auc:0.97401\n", 322 | "[215]\ttrain-auc:0.976926\teval-auc:0.974021\n", 323 | "[216]\ttrain-auc:0.976939\teval-auc:0.974023\n", 324 | "[217]\ttrain-auc:0.976969\teval-auc:0.974021\n", 325 | "[218]\ttrain-auc:0.976991\teval-auc:0.974018\n", 326 | "[219]\ttrain-auc:0.977021\teval-auc:0.974029\n", 327 | "[220]\ttrain-auc:0.97705\teval-auc:0.974053\n", 328 | "[221]\ttrain-auc:0.977064\teval-auc:0.974069\n", 329 | "[222]\ttrain-auc:0.97708\teval-auc:0.974078\n", 330 | "[223]\ttrain-auc:0.977096\teval-auc:0.974088\n", 331 | "[224]\ttrain-auc:0.97711\teval-auc:0.9741\n", 332 | "[225]\ttrain-auc:0.977132\teval-auc:0.974098\n", 333 | "[226]\ttrain-auc:0.977167\teval-auc:0.97409\n", 334 | "[227]\ttrain-auc:0.977197\teval-auc:0.974117\n", 335 | "[228]\ttrain-auc:0.977257\teval-auc:0.974155\n", 336 | "[229]\ttrain-auc:0.977282\teval-auc:0.974168\n", 337 | "[230]\ttrain-auc:0.977315\teval-auc:0.97418\n", 338 | "[231]\ttrain-auc:0.9774\teval-auc:0.974234\n", 339 | "[232]\ttrain-auc:0.977426\teval-auc:0.974248\n", 340 | "[233]\ttrain-auc:0.977435\teval-auc:0.974243\n", 341 | "[234]\ttrain-auc:0.977444\teval-auc:0.97425\n", 342 | "[235]\ttrain-auc:0.977501\teval-auc:0.974306\n", 343 | "[236]\ttrain-auc:0.977522\teval-auc:0.974321\n", 344 | "[237]\ttrain-auc:0.977528\teval-auc:0.974311\n", 345 | "[238]\ttrain-auc:0.977535\teval-auc:0.974316\n", 346 | "[239]\ttrain-auc:0.977581\teval-auc:0.97436\n", 347 | "[240]\ttrain-auc:0.977636\teval-auc:0.974411\n", 348 | "[241]\ttrain-auc:0.977643\teval-auc:0.974413\n", 349 | "[242]\ttrain-auc:0.977702\teval-auc:0.974434\n", 350 | "[243]\ttrain-auc:0.977712\teval-auc:0.974438\n", 351 | "[244]\ttrain-auc:0.977738\teval-auc:0.974443\n", 352 | "[245]\ttrain-auc:0.977757\teval-auc:0.974467\n", 353 | "[246]\ttrain-auc:0.977775\teval-auc:0.974465\n", 354 | "[247]\ttrain-auc:0.977804\teval-auc:0.974468\n", 355 | "[248]\ttrain-auc:0.977839\teval-auc:0.974488\n", 356 | "[249]\ttrain-auc:0.977852\teval-auc:0.974497\n", 357 | "[250]\ttrain-auc:0.977885\teval-auc:0.974505\n", 358 | "[251]\ttrain-auc:0.977908\teval-auc:0.974498\n", 359 | "[252]\ttrain-auc:0.977951\teval-auc:0.974538\n", 360 | "[253]\ttrain-auc:0.977984\teval-auc:0.974515\n", 361 | "[254]\ttrain-auc:0.978015\teval-auc:0.974532\n", 362 | "[255]\ttrain-auc:0.978064\teval-auc:0.97458\n", 363 | "[256]\ttrain-auc:0.978121\teval-auc:0.974593\n", 364 | "[257]\ttrain-auc:0.978146\teval-auc:0.974613\n", 365 | "[258]\ttrain-auc:0.978171\teval-auc:0.974605\n", 366 | "[259]\ttrain-auc:0.978204\teval-auc:0.974623\n", 367 | "[260]\ttrain-auc:0.978244\teval-auc:0.974644\n", 368 | "[261]\ttrain-auc:0.978264\teval-auc:0.97466\n", 369 | "[262]\ttrain-auc:0.978277\teval-auc:0.974679\n", 370 | "[263]\ttrain-auc:0.978284\teval-auc:0.974679\n", 371 | "[264]\ttrain-auc:0.978351\teval-auc:0.974717\n", 372 | "[265]\ttrain-auc:0.978383\teval-auc:0.974731\n", 373 | "[266]\ttrain-auc:0.978402\teval-auc:0.974751\n", 374 | "[267]\ttrain-auc:0.978414\teval-auc:0.97475\n", 375 | "[268]\ttrain-auc:0.978433\teval-auc:0.974759\n", 376 | "[269]\ttrain-auc:0.978467\teval-auc:0.97477\n", 377 | "[270]\ttrain-auc:0.978501\teval-auc:0.974782\n", 378 | "[271]\ttrain-auc:0.978552\teval-auc:0.974822\n", 379 | "[272]\ttrain-auc:0.978575\teval-auc:0.974838\n", 380 | "[273]\ttrain-auc:0.978587\teval-auc:0.974845\n", 381 | "[274]\ttrain-auc:0.978595\teval-auc:0.974843\n", 382 | "[275]\ttrain-auc:0.978607\teval-auc:0.974851\n", 383 | "[276]\ttrain-auc:0.978661\teval-auc:0.974882\n", 384 | "[277]\ttrain-auc:0.978726\teval-auc:0.97493\n", 385 | "[278]\ttrain-auc:0.978749\teval-auc:0.974941\n", 386 | "[279]\ttrain-auc:0.978778\teval-auc:0.974958\n", 387 | "[280]\ttrain-auc:0.978804\teval-auc:0.974971\n", 388 | "[281]\ttrain-auc:0.978833\teval-auc:0.974998\n", 389 | "[282]\ttrain-auc:0.978837\teval-auc:0.974999\n", 390 | "[283]\ttrain-auc:0.978851\teval-auc:0.974991\n", 391 | "[284]\ttrain-auc:0.978876\teval-auc:0.974986\n", 392 | "[285]\ttrain-auc:0.978919\teval-auc:0.975025\n", 393 | "[286]\ttrain-auc:0.978936\teval-auc:0.975018\n", 394 | "[287]\ttrain-auc:0.978943\teval-auc:0.975042\n", 395 | "[288]\ttrain-auc:0.978956\teval-auc:0.975056\n", 396 | "[289]\ttrain-auc:0.978968\teval-auc:0.975069\n", 397 | "[290]\ttrain-auc:0.979003\teval-auc:0.975099\n", 398 | "[291]\ttrain-auc:0.979014\teval-auc:0.975107\n", 399 | "[292]\ttrain-auc:0.979055\teval-auc:0.975141\n", 400 | "[293]\ttrain-auc:0.979068\teval-auc:0.975145\n", 401 | "[294]\ttrain-auc:0.979106\teval-auc:0.975182\n", 402 | "[295]\ttrain-auc:0.979123\teval-auc:0.975187\n", 403 | "[296]\ttrain-auc:0.979158\teval-auc:0.975205\n", 404 | "[297]\ttrain-auc:0.979189\teval-auc:0.975221\n", 405 | "[298]\ttrain-auc:0.979238\teval-auc:0.975272\n", 406 | "[299]\ttrain-auc:0.979276\teval-auc:0.975295\n", 407 | "[300]\ttrain-auc:0.979331\teval-auc:0.975331\n", 408 | "[301]\ttrain-auc:0.979346\teval-auc:0.975349\n", 409 | "[302]\ttrain-auc:0.979402\teval-auc:0.975374\n", 410 | "[303]\ttrain-auc:0.979424\teval-auc:0.97539\n", 411 | "[304]\ttrain-auc:0.979441\teval-auc:0.975395\n", 412 | "[305]\ttrain-auc:0.979457\teval-auc:0.975398\n", 413 | "[306]\ttrain-auc:0.979478\teval-auc:0.975397\n", 414 | "[307]\ttrain-auc:0.979487\teval-auc:0.975406\n", 415 | "[308]\ttrain-auc:0.97954\teval-auc:0.975441\n", 416 | "[309]\ttrain-auc:0.979567\teval-auc:0.975446\n", 417 | "[310]\ttrain-auc:0.979579\teval-auc:0.975445\n", 418 | "[311]\ttrain-auc:0.979581\teval-auc:0.975446\n", 419 | "[312]\ttrain-auc:0.979591\teval-auc:0.97546\n", 420 | "[313]\ttrain-auc:0.979596\teval-auc:0.97545\n", 421 | "[314]\ttrain-auc:0.979661\teval-auc:0.975512\n", 422 | "[315]\ttrain-auc:0.979701\teval-auc:0.975553\n", 423 | "[316]\ttrain-auc:0.979726\teval-auc:0.975551\n", 424 | "[317]\ttrain-auc:0.979751\teval-auc:0.975557\n", 425 | "[318]\ttrain-auc:0.979766\teval-auc:0.975567\n", 426 | "[319]\ttrain-auc:0.97982\teval-auc:0.975622\n", 427 | "[320]\ttrain-auc:0.979872\teval-auc:0.975648\n", 428 | "[321]\ttrain-auc:0.979919\teval-auc:0.975681\n", 429 | "[322]\ttrain-auc:0.979969\teval-auc:0.975713\n", 430 | "[323]\ttrain-auc:0.979979\teval-auc:0.975718\n", 431 | "[324]\ttrain-auc:0.979999\teval-auc:0.975726\n", 432 | "[325]\ttrain-auc:0.980037\teval-auc:0.975734\n", 433 | "[326]\ttrain-auc:0.980058\teval-auc:0.975762\n", 434 | "[327]\ttrain-auc:0.980087\teval-auc:0.975757\n", 435 | "[328]\ttrain-auc:0.980087\teval-auc:0.975754\n", 436 | "[329]\ttrain-auc:0.980115\teval-auc:0.975785\n", 437 | "[330]\ttrain-auc:0.98012\teval-auc:0.975779\n", 438 | "[331]\ttrain-auc:0.980128\teval-auc:0.975783\n", 439 | "[332]\ttrain-auc:0.980142\teval-auc:0.975803\n", 440 | "[333]\ttrain-auc:0.980179\teval-auc:0.975823\n", 441 | "[334]\ttrain-auc:0.980199\teval-auc:0.975838\n", 442 | "[335]\ttrain-auc:0.98021\teval-auc:0.975844\n", 443 | "[336]\ttrain-auc:0.980235\teval-auc:0.975879\n", 444 | "[337]\ttrain-auc:0.980258\teval-auc:0.975896\n", 445 | "[338]\ttrain-auc:0.980307\teval-auc:0.975942\n", 446 | "[339]\ttrain-auc:0.980337\teval-auc:0.975959\n", 447 | "[340]\ttrain-auc:0.980371\teval-auc:0.975987\n", 448 | "[341]\ttrain-auc:0.980387\teval-auc:0.975978\n", 449 | "[342]\ttrain-auc:0.980406\teval-auc:0.975988\n", 450 | "[343]\ttrain-auc:0.980431\teval-auc:0.975996\n", 451 | "[344]\ttrain-auc:0.980451\teval-auc:0.975996\n", 452 | "[345]\ttrain-auc:0.980491\teval-auc:0.97601\n", 453 | "[346]\ttrain-auc:0.980515\teval-auc:0.976017\n", 454 | "[347]\ttrain-auc:0.980551\teval-auc:0.97605\n", 455 | "[348]\ttrain-auc:0.980552\teval-auc:0.976053\n", 456 | "[349]\ttrain-auc:0.980599\teval-auc:0.976069\n", 457 | "[350]\ttrain-auc:0.98063\teval-auc:0.976085\n", 458 | "[351]\ttrain-auc:0.98066\teval-auc:0.97609\n", 459 | "[352]\ttrain-auc:0.980691\teval-auc:0.976109\n", 460 | "[353]\ttrain-auc:0.980705\teval-auc:0.976119\n", 461 | "[354]\ttrain-auc:0.980706\teval-auc:0.976126\n", 462 | "[355]\ttrain-auc:0.980733\teval-auc:0.976135\n", 463 | "[356]\ttrain-auc:0.98078\teval-auc:0.976186\n", 464 | "[357]\ttrain-auc:0.980799\teval-auc:0.976203\n", 465 | "[358]\ttrain-auc:0.980817\teval-auc:0.976207\n", 466 | "[359]\ttrain-auc:0.980826\teval-auc:0.976207\n", 467 | "[360]\ttrain-auc:0.980842\teval-auc:0.976208\n", 468 | "[361]\ttrain-auc:0.980859\teval-auc:0.976204\n", 469 | "[362]\ttrain-auc:0.980898\teval-auc:0.976226\n", 470 | "[363]\ttrain-auc:0.980906\teval-auc:0.976229\n", 471 | "[364]\ttrain-auc:0.980925\teval-auc:0.976238\n", 472 | "[365]\ttrain-auc:0.980974\teval-auc:0.976281\n", 473 | "[366]\ttrain-auc:0.980999\teval-auc:0.976293\n", 474 | "[367]\ttrain-auc:0.981008\teval-auc:0.976284\n", 475 | "[368]\ttrain-auc:0.981034\teval-auc:0.976293\n", 476 | "[369]\ttrain-auc:0.981042\teval-auc:0.976297\n", 477 | "[370]\ttrain-auc:0.981069\teval-auc:0.976308\n", 478 | "[371]\ttrain-auc:0.981096\teval-auc:0.97632\n", 479 | "[372]\ttrain-auc:0.981112\teval-auc:0.976322\n", 480 | "[373]\ttrain-auc:0.981122\teval-auc:0.976323\n", 481 | "[374]\ttrain-auc:0.981138\teval-auc:0.976339\n", 482 | "[375]\ttrain-auc:0.981142\teval-auc:0.976338\n", 483 | "[376]\ttrain-auc:0.981171\teval-auc:0.976343\n", 484 | "[377]\ttrain-auc:0.981205\teval-auc:0.976368\n", 485 | "[378]\ttrain-auc:0.981233\teval-auc:0.97637\n", 486 | "[379]\ttrain-auc:0.981247\teval-auc:0.97637\n", 487 | "[380]\ttrain-auc:0.981276\teval-auc:0.976381\n", 488 | "[381]\ttrain-auc:0.981315\teval-auc:0.976407\n", 489 | "[382]\ttrain-auc:0.981344\teval-auc:0.976415\n", 490 | "[383]\ttrain-auc:0.981352\teval-auc:0.976412\n", 491 | "[384]\ttrain-auc:0.981382\teval-auc:0.976427\n", 492 | "[385]\ttrain-auc:0.981418\teval-auc:0.976445\n", 493 | "[386]\ttrain-auc:0.981449\teval-auc:0.97645\n", 494 | "[387]\ttrain-auc:0.981475\teval-auc:0.97647\n", 495 | "[388]\ttrain-auc:0.981508\teval-auc:0.976505\n", 496 | "[389]\ttrain-auc:0.981524\teval-auc:0.976507\n", 497 | "[390]\ttrain-auc:0.981539\teval-auc:0.97651\n", 498 | "[391]\ttrain-auc:0.981552\teval-auc:0.976512\n", 499 | "[392]\ttrain-auc:0.981565\teval-auc:0.976526\n", 500 | "[393]\ttrain-auc:0.981586\teval-auc:0.976536\n", 501 | "[394]\ttrain-auc:0.981602\teval-auc:0.976557\n", 502 | "[395]\ttrain-auc:0.98164\teval-auc:0.976583\n", 503 | "[396]\ttrain-auc:0.981662\teval-auc:0.976594\n", 504 | "[397]\ttrain-auc:0.981681\teval-auc:0.976605\n", 505 | "[398]\ttrain-auc:0.981717\teval-auc:0.976614\n", 506 | "[399]\ttrain-auc:0.981752\teval-auc:0.97664\n", 507 | "[400]\ttrain-auc:0.981776\teval-auc:0.976648\n", 508 | "[401]\ttrain-auc:0.981799\teval-auc:0.976666\n", 509 | "[402]\ttrain-auc:0.981806\teval-auc:0.976667\n", 510 | "[403]\ttrain-auc:0.981809\teval-auc:0.976668\n", 511 | "[404]\ttrain-auc:0.981842\teval-auc:0.976694\n", 512 | "[405]\ttrain-auc:0.981865\teval-auc:0.976711\n", 513 | "[406]\ttrain-auc:0.981898\teval-auc:0.976723\n", 514 | "[407]\ttrain-auc:0.981931\teval-auc:0.976751\n", 515 | "[408]\ttrain-auc:0.981964\teval-auc:0.976777\n", 516 | "[409]\ttrain-auc:0.981986\teval-auc:0.976768\n", 517 | "[410]\ttrain-auc:0.982\teval-auc:0.976758\n", 518 | "[411]\ttrain-auc:0.982013\teval-auc:0.976757\n", 519 | "[412]\ttrain-auc:0.982024\teval-auc:0.976756\n", 520 | "[413]\ttrain-auc:0.982038\teval-auc:0.976749\n", 521 | "[414]\ttrain-auc:0.982042\teval-auc:0.976752\n", 522 | "[415]\ttrain-auc:0.982062\teval-auc:0.976763\n", 523 | "[416]\ttrain-auc:0.982069\teval-auc:0.976755\n", 524 | "[417]\ttrain-auc:0.982076\teval-auc:0.976751\n", 525 | "[418]\ttrain-auc:0.982087\teval-auc:0.976753\n", 526 | "Stopping. Best iteration:\n", 527 | "[408]\ttrain-auc:0.981964\teval-auc:0.976777\n", 528 | "\n" 529 | ] 530 | } 531 | ], 532 | "source": [ 533 | "def create_feature_map(features):\n", 534 | " outfile = open(r'xgb.fmap', 'w')\n", 535 | " i = 0\n", 536 | " for feat in features:\n", 537 | " outfile.write('{0}\\t{1}\\tq\\n'.format(i, feat))\n", 538 | " i = i + 1\n", 539 | " outfile.close()\n", 540 | "\n", 541 | "def xgb_model(train_set):\n", 542 | " actions = pd.read_csv(train_set) #read train_set\n", 543 | " # 单纯的删掉模型前一遍训练认为无用的特征(根据特征重要性中不存在的特征)\n", 544 | " lst_useless = ['brand']\n", 545 | " \n", 546 | " actions.drop(lst_useless, inplace=True, axis=1)\n", 547 | " \n", 548 | " users = actions[['user_id', 'sku_id']].copy()\n", 549 | " labels = actions['label'].copy()\n", 550 | " del actions['user_id']\n", 551 | " del actions['sku_id']\n", 552 | " del actions['label']\n", 553 | " # 尝试通过设置scale_pos_weight来调整政府比例不均的问题,但是经过采样的正负比为1:10,训练结果反而不如设置为1\n", 554 | "# ratio = float(np.sum(labels==0)) / np.sum(labels==1)\n", 555 | "# print ratio\n", 556 | " \n", 557 | " # write to feature map\n", 558 | " features = list(actions.columns[:])\n", 559 | " print 'total features: ', len(features)\n", 560 | " create_feature_map(features)\n", 561 | " # 训练时即传入特征名\n", 562 | "# features = list(actions.columns.values)\n", 563 | " \n", 564 | " user_index=users\n", 565 | " training_data=actions\n", 566 | " label=labels\n", 567 | " X_train, X_valid, y_train, y_valid = train_test_split(training_data.values, label.values, test_size=0.2, \n", 568 | " random_state=0)\n", 569 | " \n", 570 | " # 尝试通过提前设置传入训练的正负例的权重来改善正负比例不均的问题\n", 571 | "# weights = np.zeros(len(y_train))\n", 572 | "# weights[y_train==0] = 1\n", 573 | "# weights[y_train==1] = 10\n", 574 | " \n", 575 | "# dtrain = xgb.DMatrix(X_train, label=y_train, weight=weights)\n", 576 | " dtrain = xgb.DMatrix(X_train, label=y_train)\n", 577 | " dvalid = xgb.DMatrix(X_valid, label=y_valid)\n", 578 | "# dtrain = xgb.DMatrix(X_train, label=y_train, feature_names=features)\n", 579 | "# dvalid = xgb.DMatrix(X_valid, label=y_valid, feature_names=features)\n", 580 | "# dtrain = xgb.DMatrix(training_data.values, label.values)\n", 581 | " param = {'n_estimators': 4000, 'max_depth': 3, 'min_child_weight': 5, 'gamma': 0, 'subsample': 1.0, \n", 582 | " 'colsample_bytree': 0.8, 'scale_pos_weight':10, 'eta': 0.1, 'silent': 1, 'objective': 'binary:logistic',\n", 583 | " 'eval_metric':'auc'}\n", 584 | "# param = {'n_estimators': 4000, 'max_depth': 6, 'seed': 7, 'min_child_weight': 5, 'gamma': 0, 'subsample': 1.0, \n", 585 | "# 'colsample_bytree': 0.8, 'scale_pos_weight': 1, 'eta': 0.09, 'silent': 1, 'objective': 'binary:logistic',\n", 586 | "# 'eval_metric':'auc'}\n", 587 | " \n", 588 | " num_round = param['n_estimators']\n", 589 | "# param['nthread'] = 4\n", 590 | " # param['eval_metric'] = \"auc\"\n", 591 | " plst = param.items()\n", 592 | " evallist = [(dtrain, 'train'), (dvalid, 'eval')]\n", 593 | " # 根本停不下来\n", 594 | "# evallist = [(dvalid, 'eval'), (dtrain, 'train')]\n", 595 | "# evallist = [(dtrain, 'train')]\n", 596 | " bst = xgb.train(plst, dtrain, num_round, evallist, early_stopping_rounds=10)\n", 597 | " bst.save_model('bst.model')\n", 598 | " return bst, features\n", 599 | "\n", 600 | "bst_xgb, features = xgb_model('train_set.csv')" 601 | ] 602 | }, 603 | { 604 | "cell_type": "code", 605 | "execution_count": 211, 606 | "metadata": { 607 | "ExecuteTime": { 608 | "end_time": "2017-05-25T15:17:20.955471Z", 609 | "start_time": "2017-05-25T15:17:20.952599Z" 610 | }, 611 | "collapsed": false 612 | }, 613 | "outputs": [ 614 | { 615 | "name": "stdout", 616 | "output_type": "stream", 617 | "text": [ 618 | "{'best_iteration': '408', 'best_msg': '[408]\\ttrain-auc:0.981964\\teval-auc:0.976777', 'best_score': '0.976777'}\n" 619 | ] 620 | } 621 | ], 622 | "source": [ 623 | "print bst_xgb.attributes()" 624 | ] 625 | }, 626 | { 627 | "cell_type": "markdown", 628 | "metadata": {}, 629 | "source": [ 630 | "### 对验证集进行线下测试" 631 | ] 632 | }, 633 | { 634 | "cell_type": "code", 635 | "execution_count": 212, 636 | "metadata": { 637 | "ExecuteTime": { 638 | "end_time": "2017-05-25T15:17:21.018473Z", 639 | "start_time": "2017-05-25T15:17:20.957406Z" 640 | }, 641 | "collapsed": false 642 | }, 643 | "outputs": [], 644 | "source": [ 645 | "def report(pred, label):\n", 646 | "\n", 647 | " actions = label\n", 648 | " result = pred\n", 649 | "\n", 650 | " # 所有实际用户商品对\n", 651 | " all_user_item_pair = actions['user_id'].map(str) + '-' + actions['sku_id'].map(str)\n", 652 | " all_user_item_pair = np.array(all_user_item_pair)\n", 653 | " # 所有购买用户\n", 654 | " all_user_set = actions['user_id'].unique()\n", 655 | "\n", 656 | " # 所有预测购买的用户\n", 657 | " all_user_test_set = result['user_id'].unique()\n", 658 | " all_user_test_item_pair = result['user_id'].map(str) + '-' + result['sku_id'].map(str)\n", 659 | " all_user_test_item_pair = np.array(all_user_test_item_pair)\n", 660 | "\n", 661 | " # 计算所有用户购买评价指标\n", 662 | " pos, neg = 0,0\n", 663 | " for user_id in all_user_test_set:\n", 664 | " if user_id in all_user_set:\n", 665 | " pos += 1\n", 666 | " else:\n", 667 | " neg += 1\n", 668 | " all_user_acc = 1.0 * pos / ( pos + neg)\n", 669 | " all_user_recall = 1.0 * pos / len(all_user_set)\n", 670 | " print '所有用户中预测购买用户的准确率为 ' + str(all_user_acc)\n", 671 | " print '所有用户中预测购买用户的召回率' + str(all_user_recall)\n", 672 | "\n", 673 | " pos, neg = 0, 0\n", 674 | " for user_item_pair in all_user_test_item_pair:\n", 675 | " if user_item_pair in all_user_item_pair:\n", 676 | " pos += 1\n", 677 | " else:\n", 678 | " neg += 1\n", 679 | " all_item_acc = 1.0 * pos / ( pos + neg)\n", 680 | " all_item_recall = 1.0 * pos / len(all_user_item_pair)\n", 681 | " print '所有用户中预测购买商品的准确率为 ' + str(all_item_acc)\n", 682 | " print '所有用户中预测购买商品的召回率' + str(all_item_recall)\n", 683 | " F11 = 6.0 * all_user_recall * all_user_acc / (5.0 * all_user_recall + all_user_acc)\n", 684 | " F12 = 5.0 * all_item_acc * all_item_recall / (2.0 * all_item_recall + 3 * all_item_acc)\n", 685 | " score = 0.4 * F11 + 0.6 * F12\n", 686 | " print 'F11=' + str(F11)\n", 687 | " print 'F12=' + str(F12)\n", 688 | " print 'score=' + str(score)\n", 689 | " \n", 690 | " return all_user_acc, all_user_recall, F11, all_item_acc, all_item_recall, F12, score" 691 | ] 692 | }, 693 | { 694 | "cell_type": "code", 695 | "execution_count": 213, 696 | "metadata": { 697 | "ExecuteTime": { 698 | "end_time": "2017-05-25T15:17:21.095577Z", 699 | "start_time": "2017-05-25T15:17:21.020882Z" 700 | }, 701 | "collapsed": false 702 | }, 703 | "outputs": [], 704 | "source": [ 705 | "def validate(valid_set, val_label, model):\n", 706 | " actions = pd.read_csv(valid_set) #read test_set \n", 707 | "# users = actions[['user_id', 'sku_id']].copy()\n", 708 | " # 避免预测到非8类商品,所以最后还是再筛一遍的好 \n", 709 | " users = actions[['user_id', 'sku_id', 'cate']].copy()\n", 710 | " \n", 711 | " actions['user_id'] = actions['user_id'].astype(np.int64)\n", 712 | "# test_label= actions[actions['label'] == 1]\n", 713 | "\n", 714 | "# test_label= actions[(actions['label']==1) & (actions['cate']==8)]\n", 715 | " test_label = pd.read_csv(val_label)\n", 716 | " \n", 717 | " lst_useless = ['brand']\n", 718 | " \n", 719 | " actions.drop(lst_useless, inplace=True, axis=1)\n", 720 | " \n", 721 | "# test_label = test_label[['user_id','sku_id','label']]\n", 722 | " del actions['user_id']\n", 723 | " del actions['sku_id']\n", 724 | " \n", 725 | "# features = list(actions.columns.values)\n", 726 | " \n", 727 | "# del actions['label']\n", 728 | " sub_user_index = users\n", 729 | "# sub_trainning_data = xgb.DMatrix(actions.values, feature_names=features)\n", 730 | " sub_trainning_data = xgb.DMatrix(actions.values)\n", 731 | "# y = model.predict(sub_trainning_data,ntree_limit=model.best_iteration)\n", 732 | " y = model.predict(sub_trainning_data, ntree_limit=model.best_ntree_limit)\n", 733 | " sub_user_index['label'] = y\n", 734 | " \n", 735 | " sub_user_index.to_csv('result_' + valid_set, index=False)\n", 736 | " \n", 737 | "# sub_user_index = sub_user_index[sub_user_index['cate']==8]\n", 738 | "# del sub_user_index['cate']\n", 739 | " rank = 1000\n", 740 | " pred = sub_user_index.sort_values(by='label', ascending=False)[:rank]\n", 741 | "# pred = sub_user_index[sub_user_index['label'] >= 0.05]\n", 742 | " \n", 743 | " print 'No. of raw pred users: ', len(pred['user_id'].unique())\n", 744 | " pred = pred[pred['cate']==8]\n", 745 | " print 'No. of pred users bought cate 8: ', len(pred['user_id'].unique())\n", 746 | " \n", 747 | "# pred = pred[['user_id', 'sku_id']]\n", 748 | " pred = pred[['user_id', 'sku_id', 'label']]\n", 749 | " pred = pred.groupby('user_id').first().reset_index()\n", 750 | " pred['user_id'] = pred['user_id'].astype(int)\n", 751 | " pred['sku_id'] = pred['sku_id'].astype(int)\n", 752 | " \n", 753 | "# print 'No. of pred users after deduplicates: ', len(pred['user_id'].unique())\n", 754 | " true_user = len(test_label['user_id']) \n", 755 | " pred_ui = len(pred['user_id'].unique())\n", 756 | " print 'pred item: ', len(pred['sku_id'].unique())\n", 757 | " print 'true users: ', true_user\n", 758 | " print 'pred users: ', pred_ui\n", 759 | " test_label['user_id'] = test_label['user_id'].astype(int)\n", 760 | " test_label['sku_id'] = test_label['sku_id'].astype(int)\n", 761 | "\n", 762 | " all_user_acc, all_user_recall, F11, all_item_acc, all_item_recall, F12, score = report(pred, test_label) \n", 763 | " \n", 764 | " f_name = 'pred_' + str(rank) + '_' + valid_set\n", 765 | " pred.to_csv(f_name, index=False)\n", 766 | " \n", 767 | " return rank, true_user, pred_ui, all_user_acc, all_user_recall, F11, all_item_acc, all_item_recall, F12, score\n", 768 | "\n", 769 | "# validate('val_set.csv', bst_xgb)" 770 | ] 771 | }, 772 | { 773 | "cell_type": "code", 774 | "execution_count": 214, 775 | "metadata": { 776 | "ExecuteTime": { 777 | "end_time": "2017-05-25T15:18:02.727408Z", 778 | "start_time": "2017-05-25T15:17:21.098027Z" 779 | }, 780 | "collapsed": false, 781 | "scrolled": true 782 | }, 783 | "outputs": [ 784 | { 785 | "name": "stdout", 786 | "output_type": "stream", 787 | "text": [ 788 | "No. of raw pred users: 950\n", 789 | "No. of pred users bought cate 8: 950\n", 790 | "pred item: 220\n", 791 | "true users: 1211\n", 792 | "pred users: 950\n", 793 | "所有用户中预测购买用户的准确率为 0.147368421053\n", 794 | "所有用户中预测购买用户的召回率0.116569525396\n", 795 | "所有用户中预测购买商品的准确率为 0.108421052632\n", 796 | "所有用户中预测购买商品的召回率0.0850536746491\n", 797 | "F11=0.141152747437\n", 798 | "F12=0.0930778962588\n", 799 | "score=0.11230783673\n", 800 | "-------------------------------------------\n", 801 | "No. of raw pred users: 950\n", 802 | "No. of pred users bought cate 8: 950\n", 803 | "pred item: 203\n", 804 | "true users: 1259\n", 805 | "pred users: 950\n", 806 | "所有用户中预测购买用户的准确率为 0.163157894737\n", 807 | "所有用户中预测购买用户的召回率0.12360446571\n", 808 | "所有用户中预测购买商品的准确率为 0.116842105263\n", 809 | "所有用户中预测购买商品的召回率0.0881652104845\n", 810 | "F11=0.15489673551\n", 811 | "F12=0.0977629029417\n", 812 | "score=0.120616435969\n", 813 | "-------------------------------------------\n", 814 | "No. of raw pred users: 960\n", 815 | "No. of pred users bought cate 8: 960\n", 816 | "pred item: 219\n", 817 | "true users: 1385\n", 818 | "pred users: 960\n", 819 | "所有用户中预测购买用户的准确率为 0.161458333333\n", 820 | "所有用户中预测购买用户的召回率0.11231884058\n", 821 | "所有用户中预测购买商品的准确率为 0.120833333333\n", 822 | "所有用户中预测购买商品的召回率0.0837545126354\n", 823 | "F11=0.150485436893\n", 824 | "F12=0.0954732510288\n", 825 | "score=0.117478125375\n", 826 | "===========================================\n", 827 | "avg user acc: 0.157328216374\n", 828 | "avg user recall: 0.117497610562\n", 829 | "avg item acc: 0.115365497076\n", 830 | "avg item recall: 0.0856577992563\n", 831 | "avg F11: 0.14884497328\n", 832 | "avg F12: 0.0954380167431\n", 833 | "avg score: 0.116800799358\n" 834 | ] 835 | } 836 | ], 837 | "source": [ 838 | "def avg_score():\n", 839 | " rank1, true_user1, pred_ui1, user_acc1, user_recall1, F11_1, item_acc1, item_recall1, F12_1, score1 = validate('val_1.csv', 'label_val_1.csv', bst_xgb)\n", 840 | " print '-------------------------------------------'\n", 841 | " rank2, true_user2, pred_ui2, user_acc2, user_recall2, F11_2, item_acc2, item_recall2, F12_2, score2 = validate('val_2.csv', 'label_val_2.csv', bst_xgb)\n", 842 | " print '-------------------------------------------'\n", 843 | " rank3, true_user3, pred_ui3, user_acc3, user_recall3, F11_3, item_acc3, item_recall3, F12_3, score3 = validate('val_3.csv', 'label_val_3.csv', bst_xgb)\n", 844 | " print '==========================================='\n", 845 | " print 'avg user acc: ', (user_acc1+user_acc2+user_acc3)/3\n", 846 | " print 'avg user recall: ', (user_recall1+user_recall2+user_recall3)/3\n", 847 | " print 'avg item acc: ', (item_acc1+item_acc2+item_acc3)/3\n", 848 | " print 'avg item recall: ', (item_recall1+item_recall2+item_recall3)/3\n", 849 | " print 'avg F11: ', (F11_1+F11_2+F11_3)/3\n", 850 | " print 'avg F12: ', (F12_1+F12_2+F12_3)/3\n", 851 | " print 'avg score: ', (score1+score2+score3)/3\n", 852 | " # make the csv file\n", 853 | " dct_score = {}\n", 854 | " dct_score['rank'] = [rank1, rank2, rank3]\n", 855 | " dct_score['true_user'] = [true_user1, true_user2, true_user3]\n", 856 | " dct_score['pred_ui'] = [pred_ui1, pred_ui2, pred_ui3]\n", 857 | " dct_score['user_acc'] = [user_acc1, user_acc2, user_acc3]\n", 858 | " dct_score['user_recall'] = [user_recall1, user_recall2, user_recall3]\n", 859 | " dct_score['F11'] = [F11_1, F11_2, F11_3]\n", 860 | " dct_score['item_acc'] = [item_acc1, item_acc2, item_acc3]\n", 861 | " dct_score['item_recall'] = [item_recall1, item_recall2, item_recall3]\n", 862 | " dct_score['F12'] = [F12_1, F12_2, F12_3]\n", 863 | " dct_score['score'] = [score1, score2, score3]\n", 864 | " column_order = ['rank', 'true_user', 'pred_ui', 'user_acc', 'user_recall', 'item_acc', 'item_recall', 'F11', 'F12', \n", 865 | " 'score']\n", 866 | " df_score = pd.DataFrame(dct_score)\n", 867 | " file_name = 'score_' + str(datetime.now().date())[5:] +'_'+ str(rank1) + '.csv'\n", 868 | " df_score[column_order].to_csv(file_name, index=False)\n", 869 | "avg_score()" 870 | ] 871 | }, 872 | { 873 | "cell_type": "markdown", 874 | "metadata": {}, 875 | "source": [ 876 | "### 输出特征重要性" 877 | ] 878 | }, 879 | { 880 | "cell_type": "code", 881 | "execution_count": 27, 882 | "metadata": { 883 | "ExecuteTime": { 884 | "end_time": "2017-05-24T19:56:47.018369Z", 885 | "start_time": "2017-05-24T19:56:47.003330Z" 886 | }, 887 | "collapsed": false 888 | }, 889 | "outputs": [], 890 | "source": [ 891 | "def feature_importance(bst_xgb):\n", 892 | " importance = bst_xgb.get_fscore(fmap=r'xgb.fmap')\n", 893 | " importance = sorted(importance.items(), key=operator.itemgetter(1), reverse=True)\n", 894 | "\n", 895 | " df = pd.DataFrame(importance, columns=['feature', 'fscore'])\n", 896 | " df['fscore'] = df['fscore'] / df['fscore'].sum()\n", 897 | " file_name = 'feature_importance_' + str(datetime.now().date())[5:] + '.csv'\n", 898 | " df.to_csv(file_name)\n", 899 | "\n", 900 | "feature_importance(bst_xgb)" 901 | ] 902 | }, 903 | { 904 | "cell_type": "markdown", 905 | "metadata": {}, 906 | "source": [ 907 | "### 生成提交结果" 908 | ] 909 | }, 910 | { 911 | "cell_type": "code", 912 | "execution_count": 185, 913 | "metadata": { 914 | "ExecuteTime": { 915 | "end_time": "2017-05-25T15:00:26.300556Z", 916 | "start_time": "2017-05-25T15:00:08.688328Z" 917 | }, 918 | "collapsed": false 919 | }, 920 | "outputs": [ 921 | { 922 | "name": "stdout", 923 | "output_type": "stream", 924 | "text": [ 925 | "No. of raw pred users: 1142\n", 926 | "No. of pred users bought cate 8: 1142\n" 927 | ] 928 | } 929 | ], 930 | "source": [ 931 | "# sub_file\n", 932 | "\n", 933 | "def submit(pred_set, model):\n", 934 | " actions = pd.read_csv(pred_set) #read test_set\n", 935 | " \n", 936 | "# print 'total user before: ', len(actions['user_id'].unique())\n", 937 | "# potential = pd.read_csv('potential_user_04-28.csv')\n", 938 | "# lst_user = potential['user_id'].unique().tolist()\n", 939 | "# actions = actions[actions['user_id'].isin(lst_user)]\n", 940 | "# print 'total user after: ', len(actions['user_id'].unique())\n", 941 | " # 提前去掉部分特征\n", 942 | " lst_useless = ['brand']\n", 943 | " \n", 944 | " actions.drop(lst_useless, inplace=True, axis=1)\n", 945 | "\n", 946 | " \n", 947 | " users = actions[['user_id', 'sku_id', 'cate']].copy()\n", 948 | "# users = actions[['user_id', 'sku_id']].copy()\n", 949 | " \n", 950 | " actions['user_id'] = actions['user_id'].astype(np.int64)\n", 951 | " del actions['user_id']\n", 952 | " del actions['sku_id']\n", 953 | " sub_user_index = users\n", 954 | " sub_trainning_data = xgb.DMatrix(actions.values)\n", 955 | " y = model.predict(sub_trainning_data, ntree_limit=model.best_ntree_limit)\n", 956 | " sub_user_index['label'] = y\n", 957 | " \n", 958 | "# sub_user_index = sub_user_index[sub_user_index['cate']==8]\n", 959 | "# del sub_user_index['cate']\n", 960 | " rank = 1200\n", 961 | " pred = sub_user_index.sort_values(by='label', ascending=False)[:rank]\n", 962 | "# pred = sub_user_index[sub_user_index['label'] >= 0.05]\n", 963 | "# pred = pred[['user_id', 'sku_id', 'label']]\n", 964 | "# pred = pred[pred['label']>0.45]\n", 965 | "\n", 966 | " print 'No. of raw pred users: ', len(pred['user_id'].unique())\n", 967 | " pred = pred[pred['cate']==8]\n", 968 | " print 'No. of pred users bought cate 8: ', len(pred['user_id'].unique())\n", 969 | "\n", 970 | " pred = pred[['user_id', 'sku_id']]\n", 971 | "# print \n", 972 | " pred = pred.groupby('user_id').first().reset_index()\n", 973 | " pred['user_id'] = pred['user_id'].astype(int)\n", 974 | " pred['sku_id'] = pred['sku_id'].astype(int)\n", 975 | " sub_file = 'submission_' + str(rank) + '_' + str(datetime.now().date())[5:] + '.csv'\n", 976 | "# sub_file = 'submission_detail_' + str(datetime.now().date())[5:] + '.csv'\n", 977 | " pred.to_csv(sub_file, index=False, index_label=False) \n", 978 | "\n", 979 | "submit('test_set.csv', bst_xgb)" 980 | ] 981 | }, 982 | { 983 | "cell_type": "markdown", 984 | "metadata": {}, 985 | "source": [ 986 | "### 提交结果除重\n", 987 | "除去最后临近三天的发生购买行为的用户商品对" 988 | ] 989 | }, 990 | { 991 | "cell_type": "code", 992 | "execution_count": 186, 993 | "metadata": { 994 | "ExecuteTime": { 995 | "end_time": "2017-05-25T15:01:09.432698Z", 996 | "start_time": "2017-05-25T15:00:32.113207Z" 997 | }, 998 | "collapsed": false 999 | }, 1000 | "outputs": [ 1001 | { 1002 | "name": "stdout", 1003 | "output_type": "stream", 1004 | "text": [ 1005 | "No. of records after remove dup: 1142\n", 1006 | "No. of dup: 0\n" 1007 | ] 1008 | } 1009 | ], 1010 | "source": [ 1011 | "from datetime import datetime\n", 1012 | "\n", 1013 | "def sub_improv(action, sub_file):\n", 1014 | " # 获取4月最近三天的目标用户商品对\n", 1015 | " action_4 = pd.read_csv(action)\n", 1016 | " action_4['time'] = pd.to_datetime(action_4['time']).apply(lambda x: x.date())\n", 1017 | " aim_date = [datetime.strptime(s, '%Y-%m-%d').date() for s in ['2016-04-09', '2016-04-10', '2016-04-11' ,\n", 1018 | " '2016-04-12', '2016-04-13', '2016-04-14', \n", 1019 | " '2016-04-15']]\n", 1020 | " aim_action = action_4[(action_4['type']==4) & (action_4['cate']==8) & (action_4['time'].isin(aim_date))]\n", 1021 | " aim_ui = aim_action['user_id'].map(int).map(str) + '-' + aim_action['sku_id'].map(str)\n", 1022 | " # 拼接提交数据的用户商品\n", 1023 | " sub = pd.read_csv(sub_file)\n", 1024 | " before = sub.shape[0]\n", 1025 | " sub_ui = sub['user_id'].map(str) + '-' + sub['sku_id'].map(str)\n", 1026 | " # 交集\n", 1027 | " lst_aim = aim_ui.unique().tolist()\n", 1028 | " lst_sub = sub_ui.unique().tolist()\n", 1029 | " lst_common = [i for i in lst_aim if i in lst_sub]\n", 1030 | " dct_ui = {i.split('-')[0]: i.split('-')[1] for i in lst_common}\n", 1031 | " # 从提交结果除掉交集部分\n", 1032 | " for k in dct_ui:\n", 1033 | " sub.drop(sub[(sub['user_id']==int(k)) & (sub['sku_id']==int(dct_ui[k]))].index, inplace=True)\n", 1034 | " print 'No. of records after remove dup: ', sub.shape[0]\n", 1035 | " print 'No. of dup: ', before - sub.shape[0]\n", 1036 | " if (before - sub.shape[0])!=0:\n", 1037 | " file_name = 'submission_' + str(datetime.now().date())[5:] + '_improv.csv'\n", 1038 | " sub.to_csv(file_name, index=False, index_label=False)\n", 1039 | " \n", 1040 | "sub_improv('data/JData_Action_201604.csv', 'submission_1200_05-25.csv')" 1041 | ] 1042 | }, 1043 | { 1044 | "cell_type": "code", 1045 | "execution_count": null, 1046 | "metadata": { 1047 | "collapsed": true 1048 | }, 1049 | "outputs": [], 1050 | "source": [] 1051 | } 1052 | ], 1053 | "metadata": { 1054 | "kernelspec": { 1055 | "display_name": "Python 2", 1056 | "language": "python", 1057 | "name": "python2" 1058 | }, 1059 | "language_info": { 1060 | "codemirror_mode": { 1061 | "name": "ipython", 1062 | "version": 2 1063 | }, 1064 | "file_extension": ".py", 1065 | "mimetype": "text/x-python", 1066 | "name": "python", 1067 | "nbconvert_exporter": "python", 1068 | "pygments_lexer": "ipython2", 1069 | "version": "2.7.13" 1070 | } 1071 | }, 1072 | "nbformat": 4, 1073 | "nbformat_minor": 2 1074 | } 1075 | -------------------------------------------------------------------------------- /Feature_Engineering-v1.4.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# 特征工程" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "版本一\n", 15 | "\n", 16 | "* get_action_feat(): 最后保留类别属性,此外在对不同天拼接时注意on=['user_id', 'sku_id', 'cate']\n", 17 | "* get_accumulate_user_feat():平滑转化率计算、标准差计算前的空值填充,平滑recent(怎么做更合理),days_interval->day\n", 18 | "* user_cate:计算cate各项行为占比涉及除法会出现空值,这里对于除法可以做一个对数操作\n", 19 | "* product_basic:保留该特征,若使用的话空值填充需要注意,合并后填充空值为-1,对a1,a2,a3表示未知,brand需要看看是否存在这个品牌代号(没有),但是这里有一个问题,原始的a1,a2,a3均经过独热编码处理,所以这里并不是单纯的填充为-1,而是还是填充为0(刚好brand也没有这个值),此外注意拼接时on=['sku_id', 'cate']\n", 20 | "* get_accumulate_product_feat:修改转化率计算,在make_action里注意修改日期跨度问题,计数保留所有行为,标准差计算前的空值填充\n", 21 | "* get_accumulate_cate_feat():修改转化率计算,修改日期跨度问题,计数保留全部,标准差计算前的空值填充\n", 22 | "* get_comment_product_date():评论独热少了一个评论数为0的情况(但是对于部分时间窗可能并不存在comment为0的状态,所以需要处理),拼接后填充空值,取评论时的日期操作是多余的,直接取等即可,start_date多余,此外需分析comment数据集中包含的商品是否全部来自product(不是)\n", 23 | "* 最后打标签这块,先还是对用户购买类别8的商品对打标签,需要实验和思考" 24 | ] 25 | }, 26 | { 27 | "cell_type": "markdown", 28 | "metadata": {}, 29 | "source": [ 30 | "## 准备\n", 31 | "* 导入相应的库\n", 32 | "* 设置文件路径等" 33 | ] 34 | }, 35 | { 36 | "cell_type": "code", 37 | "execution_count": 1, 38 | "metadata": { 39 | "ExecuteTime": { 40 | "end_time": "2017-05-16T21:16:17.345239Z", 41 | "start_time": "2017-05-16T21:16:15.334700Z" 42 | }, 43 | "collapsed": true 44 | }, 45 | "outputs": [], 46 | "source": [ 47 | "#!/usr/bin/env python\n", 48 | "# -*- coding: UTF-8 -*-\n", 49 | "import time\n", 50 | "from datetime import datetime\n", 51 | "from datetime import timedelta\n", 52 | "import pandas as pd\n", 53 | "import pickle\n", 54 | "import os\n", 55 | "import math\n", 56 | "import numpy as np" 57 | ] 58 | }, 59 | { 60 | "cell_type": "code", 61 | "execution_count": 2, 62 | "metadata": { 63 | "ExecuteTime": { 64 | "end_time": "2017-05-16T21:16:17.352419Z", 65 | "start_time": "2017-05-16T21:16:17.346954Z" 66 | }, 67 | "collapsed": true 68 | }, 69 | "outputs": [], 70 | "source": [ 71 | "# action_1_path = r'data/JData_Action_201602.csv'\n", 72 | "# action_2_path = r'data/JData_Action_201603.csv'\n", 73 | "# action_3_path = r'data/JData_Action_201604.csv'\n", 74 | "action_1_path = r'data/actions1.csv'\n", 75 | "action_2_path = r'data/actions2.csv'\n", 76 | "action_3_path = r'data/actions3.csv'\n", 77 | "comment_path = r'data/JData_Comment.csv'\n", 78 | "product_path = r'data/JData_Product.csv'\n", 79 | "# user_path = r'data/JData_User.csv'\n", 80 | "user_path = r'data/user.csv'\n", 81 | "\n", 82 | "comment_date = [\n", 83 | " \"2016-02-01\", \"2016-02-08\", \"2016-02-15\", \"2016-02-22\", \"2016-02-29\",\n", 84 | " \"2016-03-07\", \"2016-03-14\", \"2016-03-21\", \"2016-03-28\", \"2016-04-04\",\n", 85 | " \"2016-04-11\", \"2016-04-15\"\n", 86 | "]" 87 | ] 88 | }, 89 | { 90 | "cell_type": "code", 91 | "execution_count": 3, 92 | "metadata": { 93 | "ExecuteTime": { 94 | "end_time": "2017-05-16T21:16:17.463427Z", 95 | "start_time": "2017-05-16T21:16:17.353703Z" 96 | }, 97 | "collapsed": true 98 | }, 99 | "outputs": [], 100 | "source": [ 101 | "def get_actions_1():\n", 102 | " action = pd.read_csv(action_1_path)\n", 103 | " return action\n", 104 | "\n", 105 | "def get_actions_2():\n", 106 | " action2 = pd.read_csv(action_2_path)\n", 107 | " return action2\n", 108 | "\n", 109 | "def get_actions_3():\n", 110 | " action3 = pd.read_csv(action_3_path)\n", 111 | " return action3\n", 112 | "# 读取并拼接所有行为记录文件\n", 113 | "def get_all_action():\n", 114 | " action_1 = get_actions_1()\n", 115 | " action_2 = get_actions_2()\n", 116 | " action_3 = get_actions_3()\n", 117 | " actions = pd.concat([action_1, action_2, action_3]) # type: pd.DataFrame\n", 118 | "# actions = pd.read_csv(action_path)\n", 119 | " return actions\n", 120 | "\n", 121 | "# 获取某个时间段的行为记录\n", 122 | "def get_actions(start_date, end_date, all_actions):\n", 123 | " \"\"\"\n", 124 | " :param start_date:\n", 125 | " :param end_date:\n", 126 | " :return: actions: pd.Dataframe\n", 127 | " \"\"\"\n", 128 | " actions = all_actions[(all_actions.time >= start_date) & (all_actions.time < end_date)].copy()\n", 129 | " return actions" 130 | ] 131 | }, 132 | { 133 | "cell_type": "markdown", 134 | "metadata": {}, 135 | "source": [ 136 | "## 用户特征\n", 137 | "### 用户基本特征\n", 138 | "获取基本的用户特征,基于用户本身属性多为类别特征的特点,对age,sex,usr_lv_cd进行独热编码操作,对于用户注册时间暂时不处理" 139 | ] 140 | }, 141 | { 142 | "cell_type": "code", 143 | "execution_count": 4, 144 | "metadata": { 145 | "ExecuteTime": { 146 | "end_time": "2017-05-16T21:16:18.813768Z", 147 | "start_time": "2017-05-16T21:16:17.467022Z" 148 | }, 149 | "collapsed": true 150 | }, 151 | "outputs": [], 152 | "source": [ 153 | "from sklearn import preprocessing\n", 154 | "\n", 155 | "def get_basic_user_feat():\n", 156 | " # 针对年龄的中文字符问题处理,首先是读入的时候编码,填充空值,然后将其数值化,最后独热编码,此外对于sex也进行了数值类型转换\n", 157 | " user = pd.read_csv(user_path, encoding='gbk')\n", 158 | "# user['age'].fillna('-1', inplace=True)\n", 159 | "# user['sex'].fillna(2, inplace=True)\n", 160 | " user['sex'] = user['sex'].astype(int) \n", 161 | " user['age'] = user['age'].astype(unicode)\n", 162 | " le = preprocessing.LabelEncoder() \n", 163 | " age_df = le.fit_transform(user['age'])\n", 164 | "# print list(le.classes_)\n", 165 | "\n", 166 | " age_df = pd.get_dummies(age_df, prefix='age')\n", 167 | " sex_df = pd.get_dummies(user['sex'], prefix='sex')\n", 168 | " user_lv_df = pd.get_dummies(user['user_lv_cd'], prefix='user_lv_cd')\n", 169 | " user = pd.concat([user['user_id'], age_df, sex_df, user_lv_df], axis=1)\n", 170 | " return user" 171 | ] 172 | }, 173 | { 174 | "cell_type": "markdown", 175 | "metadata": {}, 176 | "source": [ 177 | "## 商品特征\n", 178 | "### 商品基本特征\n", 179 | "根据商品文件获取基本的特征,针对属性a1,a2,a3进行独热编码,商品类别和品牌直接作为特征" 180 | ] 181 | }, 182 | { 183 | "cell_type": "code", 184 | "execution_count": 5, 185 | "metadata": { 186 | "ExecuteTime": { 187 | "end_time": "2017-05-16T21:16:18.826294Z", 188 | "start_time": "2017-05-16T21:16:18.818260Z" 189 | }, 190 | "collapsed": true 191 | }, 192 | "outputs": [], 193 | "source": [ 194 | "def get_basic_product_feat():\n", 195 | " product = pd.read_csv(product_path)\n", 196 | " attr1_df = pd.get_dummies(product[\"a1\"], prefix=\"a1\")\n", 197 | " attr2_df = pd.get_dummies(product[\"a2\"], prefix=\"a2\")\n", 198 | " attr3_df = pd.get_dummies(product[\"a3\"], prefix=\"a3\")\n", 199 | " product = pd.concat([product[['sku_id', 'cate', 'brand']], attr1_df, attr2_df, attr3_df], axis=1)\n", 200 | " return product" 201 | ] 202 | }, 203 | { 204 | "cell_type": "markdown", 205 | "metadata": {}, 206 | "source": [ 207 | "## 评论特征\n", 208 | "\n", 209 | "* 分时间段\n", 210 | "* 对评论数进行独热编码" 211 | ] 212 | }, 213 | { 214 | "cell_type": "code", 215 | "execution_count": 6, 216 | "metadata": { 217 | "ExecuteTime": { 218 | "end_time": "2017-05-16T21:16:18.880657Z", 219 | "start_time": "2017-05-16T21:16:18.827667Z" 220 | }, 221 | "collapsed": true 222 | }, 223 | "outputs": [], 224 | "source": [ 225 | "def get_comments_product_feat(end_date):\n", 226 | " comments = pd.read_csv(comment_path)\n", 227 | " comment_date_end = end_date\n", 228 | " comment_date_begin = comment_date[0]\n", 229 | " for date in reversed(comment_date):\n", 230 | " if date < comment_date_end:\n", 231 | " comment_date_begin = date\n", 232 | " break\n", 233 | " comments = comments[comments.dt==comment_date_begin]\n", 234 | " df = pd.get_dummies(comments['comment_num'], prefix='comment_num')\n", 235 | " # 为了防止某个时间段不具备评论数为0的情况(测试集出现过这种情况)\n", 236 | " for i in range(0, 5):\n", 237 | " if 'comment_num_' + str(i) not in df.columns:\n", 238 | " df['comment_num_' + str(i)] = 0\n", 239 | " df = df[['comment_num_0', 'comment_num_1', 'comment_num_2', 'comment_num_3', 'comment_num_4']]\n", 240 | " \n", 241 | " comments = pd.concat([comments, df], axis=1) # type: pd.DataFrame\n", 242 | " #del comments['dt']\n", 243 | " #del comments['comment_num']\n", 244 | " comments = comments[['sku_id', 'has_bad_comment', 'bad_comment_rate','comment_num_0', 'comment_num_1', \n", 245 | " 'comment_num_2', 'comment_num_3', 'comment_num_4']]\n", 246 | " return comments" 247 | ] 248 | }, 249 | { 250 | "cell_type": "markdown", 251 | "metadata": {}, 252 | "source": [ 253 | "## 行为特征\n", 254 | "\n", 255 | "* 分时间段\n", 256 | "* 对行为类别进行独热编码\n", 257 | "* 分别按照用户-类别行为分组和用户-类别-商品行为分组统计,然后计算\n", 258 | " * 用户对同类别下其他商品的行为计数\n", 259 | " * 针对用户对同类别下目标商品的行为计数与该时间段的行为均值作差" 260 | ] 261 | }, 262 | { 263 | "cell_type": "code", 264 | "execution_count": 7, 265 | "metadata": { 266 | "ExecuteTime": { 267 | "end_time": "2017-05-16T21:16:18.950287Z", 268 | "start_time": "2017-05-16T21:16:18.882812Z" 269 | }, 270 | "collapsed": true 271 | }, 272 | "outputs": [], 273 | "source": [ 274 | "def get_action_feat(start_date, end_date, all_actions, i):\n", 275 | " actions = get_actions(start_date, end_date, all_actions)\n", 276 | " actions = actions[['user_id', 'sku_id', 'cate','type']]\n", 277 | " # 不同时间累积的行为计数(3,5,7,10,15,21,30)\n", 278 | " df = pd.get_dummies(actions['type'], prefix='action_before_%s' %i)\n", 279 | " before_date = 'action_before_%s' %i\n", 280 | " actions = pd.concat([actions, df], axis=1) # type: pd.DataFrame\n", 281 | " # 分组统计,用户-类别-商品,不同用户对不同类别下商品的行为计数\n", 282 | " actions = actions.groupby(['user_id', 'sku_id','cate'], as_index=False).sum()\n", 283 | " # 分组统计,用户-类别,不同用户对不同商品类别的行为计数\n", 284 | " user_cate = actions.groupby(['user_id','cate'], as_index=False).sum()\n", 285 | " del user_cate['sku_id']\n", 286 | " del user_cate['type']\n", 287 | " actions = pd.merge(actions, user_cate, how='left', on=['user_id','cate'])\n", 288 | " #本类别下其他商品点击量\n", 289 | " # 前述两种分组含有相同名称的不同行为的计数,系统会自动针对名称调整添加后缀,x,y,所以这里作差统计的是同一类别下其他商品的行为计数\n", 290 | " actions[before_date+'_1_y'] = actions[before_date+'_1_y'] - actions[before_date+'_1_x']\n", 291 | " actions[before_date+'_2_y'] = actions[before_date+'_2_y'] - actions[before_date+'_2_x']\n", 292 | " actions[before_date+'_3_y'] = actions[before_date+'_3_y'] - actions[before_date+'_3_x']\n", 293 | " actions[before_date+'_4_y'] = actions[before_date+'_4_y'] - actions[before_date+'_4_x']\n", 294 | " actions[before_date+'_5_y'] = actions[before_date+'_5_y'] - actions[before_date+'_5_x']\n", 295 | " actions[before_date+'_6_y'] = actions[before_date+'_6_y'] - actions[before_date+'_6_x']\n", 296 | " # 统计用户对不同类别下商品计数与该类别下商品行为计数均值(对时间)的差值\n", 297 | " actions[before_date+'minus_mean_1'] = actions[before_date+'_1_x'] - (actions[before_date+'_1_x']/i)\n", 298 | " actions[before_date+'minus_mean_2'] = actions[before_date+'_2_x'] - (actions[before_date+'_2_x']/i)\n", 299 | " actions[before_date+'minus_mean_3'] = actions[before_date+'_3_x'] - (actions[before_date+'_3_x']/i)\n", 300 | " actions[before_date+'minus_mean_4'] = actions[before_date+'_4_x'] - (actions[before_date+'_4_x']/i)\n", 301 | " actions[before_date+'minus_mean_5'] = actions[before_date+'_5_x'] - (actions[before_date+'_5_x']/i)\n", 302 | " actions[before_date+'minus_mean_6'] = actions[before_date+'_6_x'] - (actions[before_date+'_6_x']/i)\n", 303 | " del actions['type']\n", 304 | " # 保留cate特征\n", 305 | "# del actions['cate']\n", 306 | "\n", 307 | " return actions" 308 | ] 309 | }, 310 | { 311 | "cell_type": "markdown", 312 | "metadata": {}, 313 | "source": [ 314 | "### 用户-行为\n", 315 | "#### 累积用户特征\n", 316 | "\n", 317 | "* 分时间段\n", 318 | "* 用户不同行为的\n", 319 | " * 购买转化率\n", 320 | " * 均值\n", 321 | " * 标准差" 322 | ] 323 | }, 324 | { 325 | "cell_type": "code", 326 | "execution_count": 8, 327 | "metadata": { 328 | "ExecuteTime": { 329 | "end_time": "2017-05-16T21:16:19.036710Z", 330 | "start_time": "2017-05-16T21:16:18.952454Z" 331 | }, 332 | "collapsed": true 333 | }, 334 | "outputs": [], 335 | "source": [ 336 | "def get_accumulate_user_feat(end_date, all_actions, day):\n", 337 | " start_date = datetime.strptime(end_date, '%Y-%m-%d') - timedelta(days=day)\n", 338 | " start_date = start_date.strftime('%Y-%m-%d')\n", 339 | " before_date = 'user_action_%s' % day\n", 340 | "\n", 341 | " feature = [\n", 342 | " 'user_id', before_date + '_1', before_date + '_2', before_date + '_3',\n", 343 | " before_date + '_4', before_date + '_5', before_date + '_6',\n", 344 | " before_date + '_1_ratio', before_date + '_2_ratio',\n", 345 | " before_date + '_3_ratio', before_date + '_5_ratio',\n", 346 | " before_date + '_6_ratio', before_date + '_1_mean',\n", 347 | " before_date + '_2_mean', before_date + '_3_mean',\n", 348 | " before_date + '_4_mean', before_date + '_5_mean',\n", 349 | " before_date + '_6_mean', before_date + '_1_std',\n", 350 | " before_date + '_2_std', before_date + '_3_std', before_date + '_4_std',\n", 351 | " before_date + '_5_std', before_date + '_6_std'\n", 352 | " ]\n", 353 | "\n", 354 | " actions = get_actions(start_date, end_date, all_actions)\n", 355 | " df = pd.get_dummies(actions['type'], prefix=before_date)\n", 356 | "\n", 357 | "# actions['date'] = pd.to_datetime(actions['time']).apply(lambda x: x.date())\n", 358 | "\n", 359 | " actions = pd.concat([actions[['user_id', 'date']], df], axis=1)\n", 360 | " # 分组统计,用户不同日期的行为计算标准差\n", 361 | " actions_date = actions.groupby(['user_id', 'date']).sum()\n", 362 | " actions_date = actions_date.unstack()\n", 363 | " actions_date.fillna(0, inplace=True)\n", 364 | " action_1 = np.std(actions_date[before_date + '_1'], axis=1)\n", 365 | " action_1 = action_1.to_frame()\n", 366 | " action_1.columns = [before_date + '_1_std']\n", 367 | " action_2 = np.std(actions_date[before_date + '_2'], axis=1)\n", 368 | " action_2 = action_2.to_frame()\n", 369 | " action_2.columns = [before_date + '_2_std']\n", 370 | " action_3 = np.std(actions_date[before_date + '_3'], axis=1)\n", 371 | " action_3 = action_3.to_frame()\n", 372 | " action_3.columns = [before_date + '_3_std']\n", 373 | " action_4 = np.std(actions_date[before_date + '_4'], axis=1)\n", 374 | " action_4 = action_4.to_frame()\n", 375 | " action_4.columns = [before_date + '_4_std']\n", 376 | " action_5 = np.std(actions_date[before_date + '_5'], axis=1)\n", 377 | " action_5 = action_5.to_frame()\n", 378 | " action_5.columns = [before_date + '_5_std']\n", 379 | " action_6 = np.std(actions_date[before_date + '_6'], axis=1)\n", 380 | " action_6 = action_6.to_frame()\n", 381 | " action_6.columns = [before_date + '_6_std']\n", 382 | " actions_date = pd.concat(\n", 383 | " [action_1, action_2, action_3, action_4, action_5, action_6], axis=1)\n", 384 | " actions_date['user_id'] = actions_date.index\n", 385 | " # 分组统计,按用户分组,统计用户各项行为的转化率、均值\n", 386 | " actions = actions.groupby(['user_id'], as_index=False).sum()\n", 387 | "# days_interal = (datetime.strptime(end_date, '%Y-%m-%d') -\n", 388 | "# datetime.strptime(start_date, '%Y-%m-%d')).days\n", 389 | " # 转化率\n", 390 | "# actions[before_date + '_1_ratio'] = actions[before_date +\n", 391 | "# '_4'] / actions[before_date +\n", 392 | "# '_1']\n", 393 | "# actions[before_date + '_2_ratio'] = actions[before_date +\n", 394 | "# '_4'] / actions[before_date +\n", 395 | "# '_2']\n", 396 | "# actions[before_date + '_3_ratio'] = actions[before_date +\n", 397 | "# '_4'] / actions[before_date +\n", 398 | "# '_3']\n", 399 | "# actions[before_date + '_5_ratio'] = actions[before_date +\n", 400 | "# '_4'] / actions[before_date +\n", 401 | "# '_5']\n", 402 | "# actions[before_date + '_6_ratio'] = actions[before_date +\n", 403 | "# '_4'] / actions[before_date +\n", 404 | "# '_6']\n", 405 | " actions[before_date + '_1_ratio'] = np.log(1 + actions[before_date + '_4']) - np.log(1 + actions[before_date +'_1'])\n", 406 | " actions[before_date + '_2_ratio'] = np.log(1 + actions[before_date + '_4']) - np.log(1 + actions[before_date +'_2'])\n", 407 | " actions[before_date + '_3_ratio'] = np.log(1 + actions[before_date + '_4']) - np.log(1 + actions[before_date +'_3'])\n", 408 | " actions[before_date + '_5_ratio'] = np.log(1 + actions[before_date + '_4']) - np.log(1 + actions[before_date +'_5'])\n", 409 | " actions[before_date + '_6_ratio'] = np.log(1 + actions[before_date + '_4']) - np.log(1 + actions[before_date +'_6'])\n", 410 | " # 均值\n", 411 | " actions[before_date + '_1_mean'] = actions[before_date + '_1'] / day\n", 412 | " actions[before_date + '_2_mean'] = actions[before_date + '_2'] / day\n", 413 | " actions[before_date + '_3_mean'] = actions[before_date + '_3'] / day\n", 414 | " actions[before_date + '_4_mean'] = actions[before_date + '_4'] / day\n", 415 | " actions[before_date + '_5_mean'] = actions[before_date + '_5'] / day\n", 416 | " actions[before_date + '_6_mean'] = actions[before_date + '_6'] / day\n", 417 | " actions = pd.merge(actions, actions_date, how='left', on='user_id')\n", 418 | " actions = actions[feature]\n", 419 | " return actions" 420 | ] 421 | }, 422 | { 423 | "cell_type": "markdown", 424 | "metadata": {}, 425 | "source": [ 426 | "#### 用户近期行为特征\n", 427 | "\n", 428 | "在上面针对用户进行累积特征提取的基础上,分别提取用户近一个月、近三天的特征,然后提取一个月内用户除去最近三天的行为占据一个月的行为的比重" 429 | ] 430 | }, 431 | { 432 | "cell_type": "code", 433 | "execution_count": 9, 434 | "metadata": { 435 | "ExecuteTime": { 436 | "end_time": "2017-05-16T21:16:19.108485Z", 437 | "start_time": "2017-05-16T21:16:19.037711Z" 438 | }, 439 | "collapsed": true 440 | }, 441 | "outputs": [], 442 | "source": [ 443 | "def get_recent_user_feat(end_date, all_actions):\n", 444 | " actions_3 = get_accumulate_user_feat(end_date, all_actions, 3)\n", 445 | " actions_30 = get_accumulate_user_feat(end_date, all_actions, 30)\n", 446 | " actions = pd.merge(actions_3, actions_30, how ='left', on='user_id')\n", 447 | " del actions_3\n", 448 | " del actions_30\n", 449 | " \n", 450 | " actions['recent_action1'] = np.log(1 + actions['user_action_30_1']-actions['user_action_3_1']) - np.log(1 + actions['user_action_30_1'])\n", 451 | " actions['recent_action2'] = np.log(1 + actions['user_action_30_2']-actions['user_action_3_2']) - np.log(1 + actions['user_action_30_2'])\n", 452 | " actions['recent_action3'] = np.log(1 + actions['user_action_30_3']-actions['user_action_3_3']) - np.log(1 + actions['user_action_30_3'])\n", 453 | " actions['recent_action4'] = np.log(1 + actions['user_action_30_4']-actions['user_action_3_4']) - np.log(1 + actions['user_action_30_4'])\n", 454 | " actions['recent_action5'] = np.log(1 + actions['user_action_30_5']-actions['user_action_3_5']) - np.log(1 + actions['user_action_30_5'])\n", 455 | " actions['recent_action6'] = np.log(1 + actions['user_action_30_6']-actions['user_action_3_6']) - np.log(1 + actions['user_action_30_6'])\n", 456 | " \n", 457 | "# actions['recent_action1'] = (actions['user_action_30_1']-actions['user_action_3_1'])/actions['user_action_30_1']\n", 458 | "# actions['recent_action2'] = (actions['user_action_30_2']-actions['user_action_3_2'])/actions['user_action_30_2']\n", 459 | "# actions['recent_action3'] = (actions['user_action_30_3']-actions['user_action_3_3'])/actions['user_action_30_3']\n", 460 | "# actions['recent_action4'] = (actions['user_action_30_4']-actions['user_action_3_4'])/actions['user_action_30_4']\n", 461 | "# actions['recent_action5'] = (actions['user_action_30_5']-actions['user_action_3_5'])/actions['user_action_30_5']\n", 462 | "# actions['recent_action6'] = (actions['user_action_30_6']-actions['user_action_3_6'])/actions['user_action_30_6']\n", 463 | " \n", 464 | " return actions" 465 | ] 466 | }, 467 | { 468 | "cell_type": "markdown", 469 | "metadata": {}, 470 | "source": [ 471 | "#### 用户对同类别下各种商品的行为\n", 472 | "* 用户对各个类别的各项行为操作统计\n", 473 | "* 用户对各个类别操作行为统计占对所有类别操作行为统计的比重" 474 | ] 475 | }, 476 | { 477 | "cell_type": "code", 478 | "execution_count": 10, 479 | "metadata": { 480 | "ExecuteTime": { 481 | "end_time": "2017-05-16T21:16:19.184116Z", 482 | "start_time": "2017-05-16T21:16:19.110701Z" 483 | }, 484 | "collapsed": true 485 | }, 486 | "outputs": [], 487 | "source": [ 488 | "#增加了用户对不同类别的交互特征\n", 489 | "def get_user_cate_feature(start_date, end_date, all_actions):\n", 490 | " actions = get_actions(start_date, end_date, all_actions)\n", 491 | " actions = actions[['user_id', 'cate', 'type']]\n", 492 | " df = pd.get_dummies(actions['type'], prefix='type')\n", 493 | " actions = pd.concat([actions[['user_id', 'cate']], df], axis=1)\n", 494 | " actions = actions.groupby(['user_id', 'cate']).sum()\n", 495 | " actions = actions.unstack()\n", 496 | " actions.columns = actions.columns.swaplevel(0, 1)\n", 497 | " actions.columns = actions.columns.droplevel()\n", 498 | " actions.columns = [\n", 499 | " 'cate_4_type1', 'cate_5_type1', 'cate_6_type1', 'cate_7_type1',\n", 500 | " 'cate_8_type1', 'cate_9_type1', 'cate_10_type1', 'cate_11_type1',\n", 501 | " 'cate_4_type2', 'cate_5_type2', 'cate_6_type2', 'cate_7_type2',\n", 502 | " 'cate_8_type2', 'cate_9_type2', 'cate_10_type2', 'cate_11_type2',\n", 503 | " 'cate_4_type3', 'cate_5_type3', 'cate_6_type3', 'cate_7_type3',\n", 504 | " 'cate_8_type3', 'cate_9_type3', 'cate_10_type3', 'cate_11_type3',\n", 505 | " 'cate_4_type4', 'cate_5_type4', 'cate_6_type4', 'cate_7_type4',\n", 506 | " 'cate_8_type4', 'cate_9_type4', 'cate_10_type4', 'cate_11_type4',\n", 507 | " 'cate_4_type5', 'cate_5_type5', 'cate_6_type5', 'cate_7_type5',\n", 508 | " 'cate_8_type5', 'cate_9_type5', 'cate_10_type5', 'cate_11_type5',\n", 509 | " 'cate_4_type6', 'cate_5_type6', 'cate_6_type6', 'cate_7_type6',\n", 510 | " 'cate_8_type6', 'cate_9_type6', 'cate_10_type6', 'cate_11_type6'\n", 511 | " ]\n", 512 | " actions = actions.fillna(0)\n", 513 | " actions['cate_action_sum'] = actions.sum(axis=1)\n", 514 | " actions['cate8_percentage'] = (\n", 515 | " actions['cate_8_type1'] + actions['cate_8_type2'] +\n", 516 | " actions['cate_8_type3'] + actions['cate_8_type4'] +\n", 517 | " actions['cate_8_type5'] + actions['cate_8_type6']\n", 518 | " ) / actions['cate_action_sum']\n", 519 | " actions['cate4_percentage'] = (\n", 520 | " actions['cate_4_type1'] + actions['cate_4_type2'] +\n", 521 | " actions['cate_4_type3'] + actions['cate_4_type4'] +\n", 522 | " actions['cate_4_type5'] + actions['cate_4_type6']\n", 523 | " ) / actions['cate_action_sum']\n", 524 | " actions['cate5_percentage'] = (\n", 525 | " actions['cate_5_type1'] + actions['cate_5_type2'] +\n", 526 | " actions['cate_5_type3'] + actions['cate_5_type4'] +\n", 527 | " actions['cate_5_type5'] + actions['cate_5_type6']\n", 528 | " ) / actions['cate_action_sum']\n", 529 | " actions['cate6_percentage'] = (\n", 530 | " actions['cate_6_type1'] + actions['cate_6_type2'] +\n", 531 | " actions['cate_6_type3'] + actions['cate_6_type4'] +\n", 532 | " actions['cate_6_type5'] + actions['cate_6_type6']\n", 533 | " ) / actions['cate_action_sum']\n", 534 | " actions['cate7_percentage'] = (\n", 535 | " actions['cate_7_type1'] + actions['cate_7_type2'] +\n", 536 | " actions['cate_7_type3'] + actions['cate_7_type4'] +\n", 537 | " actions['cate_7_type5'] + actions['cate_7_type6']\n", 538 | " ) / actions['cate_action_sum']\n", 539 | " actions['cate9_percentage'] = (\n", 540 | " actions['cate_9_type1'] + actions['cate_9_type2'] +\n", 541 | " actions['cate_9_type3'] + actions['cate_9_type4'] +\n", 542 | " actions['cate_9_type5'] + actions['cate_9_type6']\n", 543 | " ) / actions['cate_action_sum']\n", 544 | " actions['cate10_percentage'] = (\n", 545 | " actions['cate_10_type1'] + actions['cate_10_type2'] +\n", 546 | " actions['cate_10_type3'] + actions['cate_10_type4'] +\n", 547 | " actions['cate_10_type5'] + actions['cate_10_type6']\n", 548 | " ) / actions['cate_action_sum']\n", 549 | " actions['cate11_percentage'] = (\n", 550 | " actions['cate_11_type1'] + actions['cate_11_type2'] +\n", 551 | " actions['cate_11_type3'] + actions['cate_11_type4'] +\n", 552 | " actions['cate_11_type5'] + actions['cate_11_type6']\n", 553 | " ) / actions['cate_action_sum']\n", 554 | "\n", 555 | " actions['cate8_type1_percentage'] = np.log(\n", 556 | " 1 + actions['cate_8_type1']) - np.log(\n", 557 | " 1 + actions['cate_8_type1'] + actions['cate_4_type1'] +\n", 558 | " actions['cate_5_type1'] + actions['cate_6_type1'] +\n", 559 | " actions['cate_7_type1'] + actions['cate_9_type1'] +\n", 560 | " actions['cate_10_type1'] + actions['cate_11_type1'])\n", 561 | "\n", 562 | " actions['cate8_type2_percentage'] = np.log(\n", 563 | " 1 + actions['cate_8_type2']) - np.log(\n", 564 | " 1 + actions['cate_8_type2'] + actions['cate_4_type2'] +\n", 565 | " actions['cate_5_type2'] + actions['cate_6_type2'] +\n", 566 | " actions['cate_7_type2'] + actions['cate_9_type2'] +\n", 567 | " actions['cate_10_type2'] + actions['cate_11_type2'])\n", 568 | " actions['cate8_type3_percentage'] = np.log(\n", 569 | " 1 + actions['cate_8_type3']) - np.log(\n", 570 | " 1 + actions['cate_8_type3'] + actions['cate_4_type3'] +\n", 571 | " actions['cate_5_type3'] + actions['cate_6_type3'] +\n", 572 | " actions['cate_7_type3'] + actions['cate_9_type3'] +\n", 573 | " actions['cate_10_type3'] + actions['cate_11_type3'])\n", 574 | " actions['cate8_type4_percentage'] = np.log(\n", 575 | " 1 + actions['cate_8_type4']) - np.log(\n", 576 | " 1 + actions['cate_8_type4'] + actions['cate_4_type4'] +\n", 577 | " actions['cate_5_type4'] + actions['cate_6_type4'] +\n", 578 | " actions['cate_7_type4'] + actions['cate_9_type4'] +\n", 579 | " actions['cate_10_type4'] + actions['cate_11_type4'])\n", 580 | " actions['cate8_type5_percentage'] = np.log(\n", 581 | " 1 + actions['cate_8_type5']) - np.log(\n", 582 | " 1 + actions['cate_8_type5'] + actions['cate_4_type5'] +\n", 583 | " actions['cate_5_type5'] + actions['cate_6_type5'] +\n", 584 | " actions['cate_7_type5'] + actions['cate_9_type5'] +\n", 585 | " actions['cate_10_type5'] + actions['cate_11_type5'])\n", 586 | " actions['cate8_type6_percentage'] = np.log(\n", 587 | " 1 + actions['cate_8_type6']) - np.log(\n", 588 | " 1 + actions['cate_8_type6'] + actions['cate_4_type6'] +\n", 589 | " actions['cate_5_type6'] + actions['cate_6_type6'] +\n", 590 | " actions['cate_7_type6'] + actions['cate_9_type6'] +\n", 591 | " actions['cate_10_type6'] + actions['cate_11_type6'])\n", 592 | " actions['user_id'] = actions.index\n", 593 | " actions = actions[[\n", 594 | " 'user_id', 'cate8_percentage', 'cate4_percentage', 'cate5_percentage',\n", 595 | " 'cate6_percentage', 'cate7_percentage', 'cate9_percentage',\n", 596 | " 'cate10_percentage', 'cate11_percentage', 'cate8_type1_percentage',\n", 597 | " 'cate8_type2_percentage', 'cate8_type3_percentage',\n", 598 | " 'cate8_type4_percentage', 'cate8_type5_percentage',\n", 599 | " 'cate8_type6_percentage'\n", 600 | " ]]\n", 601 | " return actions" 602 | ] 603 | }, 604 | { 605 | "cell_type": "markdown", 606 | "metadata": {}, 607 | "source": [ 608 | "### 商品-行为\n", 609 | "#### 累积商品特征\n", 610 | "* 分时间段\n", 611 | "* 针对商品的不同行为的\n", 612 | " * 购买转化率\n", 613 | " * 均值\n", 614 | " * 标准差" 615 | ] 616 | }, 617 | { 618 | "cell_type": "code", 619 | "execution_count": 11, 620 | "metadata": { 621 | "ExecuteTime": { 622 | "end_time": "2017-05-16T21:16:19.251417Z", 623 | "start_time": "2017-05-16T21:16:19.185064Z" 624 | }, 625 | "collapsed": true 626 | }, 627 | "outputs": [], 628 | "source": [ 629 | "def get_accumulate_product_feat(start_date, end_date, all_actions):\n", 630 | " feature = [\n", 631 | " 'sku_id', 'product_action_1', 'product_action_2',\n", 632 | " 'product_action_3', 'product_action_4',\n", 633 | " 'product_action_5', 'product_action_6',\n", 634 | " 'product_action_1_ratio', 'product_action_2_ratio',\n", 635 | " 'product_action_3_ratio', 'product_action_5_ratio',\n", 636 | " 'product_action_6_ratio', 'product_action_1_mean',\n", 637 | " 'product_action_2_mean', 'product_action_3_mean',\n", 638 | " 'product_action_4_mean', 'product_action_5_mean',\n", 639 | " 'product_action_6_mean', 'product_action_1_std',\n", 640 | " 'product_action_2_std', 'product_action_3_std', 'product_action_4_std',\n", 641 | " 'product_action_5_std', 'product_action_6_std'\n", 642 | " ]\n", 643 | "\n", 644 | " actions = get_actions(start_date, end_date, all_actions)\n", 645 | " df = pd.get_dummies(actions['type'], prefix='product_action')\n", 646 | " # 按照商品-日期分组,计算某个时间段该商品的各项行为的标准差\n", 647 | "# actions['date'] = pd.to_datetime(actions['time']).apply(lambda x: x.date())\n", 648 | " actions = pd.concat([actions[['sku_id', 'date']], df], axis=1)\n", 649 | " actions_date = actions.groupby(['sku_id', 'date']).sum()\n", 650 | " actions_date = actions_date.unstack()\n", 651 | " actions_date.fillna(0, inplace=True)\n", 652 | " action_1 = np.std(actions_date['product_action_1'], axis=1)\n", 653 | " action_1 = action_1.to_frame()\n", 654 | " action_1.columns = ['product_action_1_std']\n", 655 | " action_2 = np.std(actions_date['product_action_2'], axis=1)\n", 656 | " action_2 = action_2.to_frame()\n", 657 | " action_2.columns = ['product_action_2_std']\n", 658 | " action_3 = np.std(actions_date['product_action_3'], axis=1)\n", 659 | " action_3 = action_3.to_frame()\n", 660 | " action_3.columns = ['product_action_3_std']\n", 661 | " action_4 = np.std(actions_date['product_action_4'], axis=1)\n", 662 | " action_4 = action_4.to_frame()\n", 663 | " action_4.columns = ['product_action_4_std']\n", 664 | " action_5 = np.std(actions_date['product_action_5'], axis=1)\n", 665 | " action_5 = action_5.to_frame()\n", 666 | " action_5.columns = ['product_action_5_std']\n", 667 | " action_6 = np.std(actions_date['product_action_6'], axis=1)\n", 668 | " action_6 = action_6.to_frame()\n", 669 | " action_6.columns = ['product_action_6_std']\n", 670 | " actions_date = pd.concat(\n", 671 | " [action_1, action_2, action_3, action_4, action_5, action_6], axis=1)\n", 672 | " actions_date['sku_id'] = actions_date.index\n", 673 | "\n", 674 | " actions = actions.groupby(['sku_id'], as_index=False).sum()\n", 675 | " days_interal = (datetime.strptime(end_date, '%Y-%m-%d') - datetime.strptime(start_date, '%Y-%m-%d')).days\n", 676 | " # 针对商品分组,计算购买转化率\n", 677 | "# actions['product_action_1_ratio'] = actions['product_action_4'] / actions[\n", 678 | "# 'product_action_1']\n", 679 | "# actions['product_action_2_ratio'] = actions['product_action_4'] / actions[\n", 680 | "# 'product_action_2']\n", 681 | "# actions['product_action_3_ratio'] = actions['product_action_4'] / actions[\n", 682 | "# 'product_action_3']\n", 683 | "# actions['product_action_5_ratio'] = actions['product_action_4'] / actions[\n", 684 | "# 'product_action_5']\n", 685 | "# actions['product_action_6_ratio'] = actions['product_action_4'] / actions[\n", 686 | "# 'product_action_6']\n", 687 | " actions['product_action_1_ratio'] = np.log(1 + actions['product_action_4']) - np.log(1 + actions['product_action_1'])\n", 688 | " actions['product_action_2_ratio'] = np.log(1 + actions['product_action_4']) - np.log(1 + actions['product_action_2'])\n", 689 | " actions['product_action_3_ratio'] = np.log(1 + actions['product_action_4']) - np.log(1 + actions['product_action_3'])\n", 690 | " actions['product_action_5_ratio'] = np.log(1 + actions['product_action_4']) - np.log(1 + actions['product_action_5'])\n", 691 | " actions['product_action_6_ratio'] = np.log(1 + actions['product_action_4']) - np.log(1 + actions['product_action_6'])\n", 692 | " # 计算各种行为的均值\n", 693 | " actions['product_action_1_mean'] = actions[\n", 694 | " 'product_action_1'] / days_interal\n", 695 | " actions['product_action_2_mean'] = actions[\n", 696 | " 'product_action_2'] / days_interal\n", 697 | " actions['product_action_3_mean'] = actions[\n", 698 | " 'product_action_3'] / days_interal\n", 699 | " actions['product_action_4_mean'] = actions[\n", 700 | " 'product_action_4'] / days_interal\n", 701 | " actions['product_action_5_mean'] = actions[\n", 702 | " 'product_action_5'] / days_interal\n", 703 | " actions['product_action_6_mean'] = actions[\n", 704 | " 'product_action_6'] / days_interal\n", 705 | " actions = pd.merge(actions, actions_date, how='left', on='sku_id')\n", 706 | " actions = actions[feature]\n", 707 | " return actions" 708 | ] 709 | }, 710 | { 711 | "cell_type": "markdown", 712 | "metadata": {}, 713 | "source": [ 714 | "### 类别特征\n", 715 | "分时间段下各个商品类别的\n", 716 | "* 购买转化率\n", 717 | "* 标准差\n", 718 | "* 均值" 719 | ] 720 | }, 721 | { 722 | "cell_type": "code", 723 | "execution_count": 12, 724 | "metadata": { 725 | "ExecuteTime": { 726 | "end_time": "2017-05-16T21:16:19.317213Z", 727 | "start_time": "2017-05-16T21:16:19.252514Z" 728 | }, 729 | "collapsed": true 730 | }, 731 | "outputs": [], 732 | "source": [ 733 | "def get_accumulate_cate_feat(start_date, end_date, all_actions):\n", 734 | " feature = ['cate','cate_action_1', 'cate_action_2', 'cate_action_3', 'cate_action_4', 'cate_action_5', \n", 735 | " 'cate_action_6', 'cate_action_1_ratio', 'cate_action_2_ratio', \n", 736 | " 'cate_action_3_ratio', 'cate_action_5_ratio', 'cate_action_6_ratio', 'cate_action_1_mean',\n", 737 | " 'cate_action_2_mean', 'cate_action_3_mean', 'cate_action_4_mean', 'cate_action_5_mean',\n", 738 | " 'cate_action_6_mean', 'cate_action_1_std', 'cate_action_2_std', 'cate_action_3_std',\n", 739 | " 'cate_action_4_std', 'cate_action_5_std', 'cate_action_6_std']\n", 740 | " actions = get_actions(start_date, end_date, all_actions)\n", 741 | "# actions['date'] = pd.to_datetime(actions['time']).apply(lambda x: x.date())\n", 742 | " df = pd.get_dummies(actions['type'], prefix='cate_action')\n", 743 | " actions = pd.concat([actions[['cate','date']], df], axis=1)\n", 744 | " # 按照类别-日期分组计算针对不同类别的各种行为某段时间的标准差\n", 745 | " actions_date = actions.groupby(['cate','date']).sum()\n", 746 | " actions_date = actions_date.unstack()\n", 747 | " actions_date.fillna(0, inplace=True)\n", 748 | " action_1 = np.std(actions_date['cate_action_1'], axis=1)\n", 749 | " action_1 = action_1.to_frame()\n", 750 | " action_1.columns = ['cate_action_1_std']\n", 751 | " action_2 = np.std(actions_date['cate_action_2'], axis=1)\n", 752 | " action_2 = action_2.to_frame()\n", 753 | " action_2.columns = ['cate_action_2_std']\n", 754 | " action_3 = np.std(actions_date['cate_action_3'], axis=1)\n", 755 | " action_3 = action_3.to_frame()\n", 756 | " action_3.columns = ['cate_action_3_std']\n", 757 | " action_4 = np.std(actions_date['cate_action_4'], axis=1)\n", 758 | " action_4 = action_4.to_frame()\n", 759 | " action_4.columns = ['cate_action_4_std']\n", 760 | " action_5 = np.std(actions_date['cate_action_5'], axis=1)\n", 761 | " action_5 = action_5.to_frame()\n", 762 | " action_5.columns = ['cate_action_5_std']\n", 763 | " action_6 = np.std(actions_date['cate_action_6'], axis=1)\n", 764 | " action_6 = action_6.to_frame()\n", 765 | " action_6.columns = ['cate_action_6_std']\n", 766 | " actions_date = pd.concat([action_1, action_2, action_3, action_4, action_5, action_6], axis=1)\n", 767 | " actions_date['cate'] = actions_date.index\n", 768 | " # 按照类别分组,统计各个商品类别下行为的转化率\n", 769 | " actions = actions.groupby(['cate'], as_index=False).sum()\n", 770 | " days_interal = (datetime.strptime(end_date, '%Y-%m-%d')-datetime.strptime(start_date, '%Y-%m-%d')).days\n", 771 | " \n", 772 | "# actions['cate_action_1_ratio'] = actions['cate_action_4'] / actions['cate_action_1']\n", 773 | "# actions['cate_action_2_ratio'] = actions['cate_action_4'] / actions['cate_action_2']\n", 774 | "# actions['cate_action_3_ratio'] = actions['cate_action_4'] / actions['cate_action_3']\n", 775 | "# actions['cate_action_5_ratio'] = actions['cate_action_4'] / actions['cate_action_5']\n", 776 | "# actions['cate_action_6_ratio'] = actions['cate_action_4'] / actions['cate_action_6']\n", 777 | " actions['cate_action_1_ratio'] =(np.log(1 + actions['cate_action_4']) - np.log(1 + actions['cate_action_1']))\n", 778 | " actions['cate_action_2_ratio'] =(np.log(1 + actions['cate_action_4']) - np.log(1 + actions['cate_action_2']))\n", 779 | " actions['cate_action_3_ratio'] =(np.log(1 + actions['cate_action_4']) - np.log(1 + actions['cate_action_3']))\n", 780 | " actions['cate_action_5_ratio'] =(np.log(1 + actions['cate_action_4']) - np.log(1 + actions['cate_action_5']))\n", 781 | " actions['cate_action_6_ratio'] =(np.log(1 + actions['cate_action_4']) - np.log(1 + actions['cate_action_6']))\n", 782 | " # 按照类别分组,统计各个商品类别下行为在一段时间的均值\n", 783 | " actions['cate_action_1_mean'] = actions['cate_action_1'] / days_interal\n", 784 | " actions['cate_action_2_mean'] = actions['cate_action_2'] / days_interal\n", 785 | " actions['cate_action_3_mean'] = actions['cate_action_3'] / days_interal\n", 786 | " actions['cate_action_4_mean'] = actions['cate_action_4'] / days_interal\n", 787 | " actions['cate_action_5_mean'] = actions['cate_action_5'] / days_interal\n", 788 | " actions['cate_action_6_mean'] = actions['cate_action_6'] / days_interal\n", 789 | " actions = pd.merge(actions, actions_date, how ='left',on='cate')\n", 790 | " actions = actions[feature]\n", 791 | " return actions" 792 | ] 793 | }, 794 | { 795 | "cell_type": "markdown", 796 | "metadata": {}, 797 | "source": [ 798 | "## 构造训练集/测试集" 799 | ] 800 | }, 801 | { 802 | "cell_type": "markdown", 803 | "metadata": {}, 804 | "source": [ 805 | "### 构造训练集/验证集\n", 806 | "* 标签,采用滑动窗口的方式,构造训练集的时候针对产生购买的行为标记为1\n", 807 | "* 整合特征" 808 | ] 809 | }, 810 | { 811 | "cell_type": "code", 812 | "execution_count": 13, 813 | "metadata": { 814 | "ExecuteTime": { 815 | "end_time": "2017-05-16T21:16:19.383390Z", 816 | "start_time": "2017-05-16T21:16:19.318110Z" 817 | }, 818 | "collapsed": true 819 | }, 820 | "outputs": [], 821 | "source": [ 822 | "def get_labels(start_date, end_date, all_actions):\n", 823 | " actions = get_actions(start_date, end_date, all_actions)\n", 824 | "# actions = actions[actions['type'] == 4]\n", 825 | " # 修改为预测购买了商品8的用户预测\n", 826 | " actions = actions[(actions['type'] == 4) & (actions['cate']==8)]\n", 827 | " \n", 828 | " actions = actions.groupby(['user_id', 'sku_id'], as_index=False).sum()\n", 829 | " actions['label'] = 1\n", 830 | " actions = actions[['user_id', 'sku_id', 'label']]\n", 831 | " return actions" 832 | ] 833 | }, 834 | { 835 | "cell_type": "markdown", 836 | "metadata": {}, 837 | "source": [ 838 | "构造训练集" 839 | ] 840 | }, 841 | { 842 | "cell_type": "code", 843 | "execution_count": 14, 844 | "metadata": { 845 | "ExecuteTime": { 846 | "end_time": "2017-05-16T21:16:19.439236Z", 847 | "start_time": "2017-05-16T21:16:19.384369Z" 848 | }, 849 | "collapsed": true 850 | }, 851 | "outputs": [], 852 | "source": [ 853 | "def make_actions(user, product, all_actions, train_start_date):\n", 854 | " train_end_date = datetime.strptime(train_start_date, '%Y-%m-%d') + timedelta(days=3)\n", 855 | " train_end_date = train_end_date.strftime('%Y-%m-%d')\n", 856 | " # 修正prod_acc,cate_acc的时间跨度\n", 857 | " start_days = datetime.strptime(train_end_date, '%Y-%m-%d') - timedelta(days=30)\n", 858 | " start_days = start_days.strftime('%Y-%m-%d')\n", 859 | " print train_end_date\n", 860 | " user_acc = get_recent_user_feat(train_end_date, all_actions)\n", 861 | " print 'get_recent_user_feat finsihed'\n", 862 | " \n", 863 | " user_cate = get_user_cate_feature(train_start_date, train_end_date, all_actions)\n", 864 | " print 'get_user_cate_feature finished'\n", 865 | " \n", 866 | " product_acc = get_accumulate_product_feat(start_days, train_end_date, all_actions)\n", 867 | " print 'get_accumulate_product_feat finsihed'\n", 868 | " cate_acc = get_accumulate_cate_feat(start_days, train_end_date, all_actions)\n", 869 | " print 'get_accumulate_cate_feat finsihed'\n", 870 | " comment_acc = get_comments_product_feat(train_end_date)\n", 871 | " print 'get_comments_product_feat finished'\n", 872 | " # 标记\n", 873 | " test_start_date = train_end_date\n", 874 | " test_end_date = datetime.strptime(test_start_date, '%Y-%m-%d') + timedelta(days=5)\n", 875 | " test_end_date = test_end_date.strftime('%Y-%m-%d')\n", 876 | " labels = get_labels(test_start_date, test_end_date, all_actions)\n", 877 | " print \"get labels\"\n", 878 | " \n", 879 | " actions = None\n", 880 | " for i in (3, 5, 7, 10, 15, 21, 30):\n", 881 | " start_days = datetime.strptime(train_end_date, '%Y-%m-%d') - timedelta(days=i)\n", 882 | " start_days = start_days.strftime('%Y-%m-%d')\n", 883 | " if actions is None:\n", 884 | " actions = get_action_feat(start_days, train_end_date, all_actions, i)\n", 885 | " else:\n", 886 | " # 注意这里的拼接key\n", 887 | " actions = pd.merge(actions, get_action_feat(start_days, train_end_date, all_actions, i), how='left',\n", 888 | " on=['user_id', 'sku_id', 'cate'])\n", 889 | "\n", 890 | " actions = pd.merge(actions, user, how='left', on='user_id')\n", 891 | " actions = pd.merge(actions, user_acc, how='left', on='user_id')\n", 892 | " actions = pd.merge(actions, user_cate, how='left', on='user_id')\n", 893 | " # 注意这里的拼接key\n", 894 | " actions = pd.merge(actions, product, how='left', on=['sku_id', 'cate'])\n", 895 | " actions = pd.merge(actions, product_acc, how='left', on='sku_id')\n", 896 | " actions = pd.merge(actions, cate_acc, how='left', on='cate')\n", 897 | " actions = pd.merge(actions, comment_acc, how='left', on='sku_id')\n", 898 | " actions = pd.merge(actions, labels, how='left', on=['user_id', 'sku_id'])\n", 899 | " # 主要是填充拼接商品基本特征、评论特征、标签之后的空值\n", 900 | " actions = actions.fillna(0)\n", 901 | "# return actions\n", 902 | " # 采样\n", 903 | " action_postive = actions[actions['label'] == 1]\n", 904 | " action_negative = actions[actions['label'] == 0]\n", 905 | " del actions\n", 906 | " neg_len = len(action_postive) * 10\n", 907 | " action_negative = action_negative.sample(n=neg_len)\n", 908 | " action_sample = pd.concat([action_postive, action_negative], ignore_index=True) \n", 909 | " \n", 910 | " return action_sample" 911 | ] 912 | }, 913 | { 914 | "cell_type": "code", 915 | "execution_count": 15, 916 | "metadata": { 917 | "ExecuteTime": { 918 | "end_time": "2017-05-16T21:16:19.509168Z", 919 | "start_time": "2017-05-16T21:16:19.440133Z" 920 | }, 921 | "collapsed": true 922 | }, 923 | "outputs": [], 924 | "source": [ 925 | "def make_train_set(train_start_date, setNums ,f_path):\n", 926 | " train_actions = None\n", 927 | " all_actions = get_all_action()\n", 928 | " print \"get all actions!\"\n", 929 | " user = get_basic_user_feat()\n", 930 | " print 'get_basic_user_feat finsihed'\n", 931 | " product = get_basic_product_feat()\n", 932 | " print 'get_basic_product_feat finsihed'\n", 933 | " # 滑窗,构造多组训练集/验证集\n", 934 | " for i in range(setNums):\n", 935 | " print train_start_date\n", 936 | " if train_actions is None:\n", 937 | " train_actions = make_actions(user, product, all_actions, train_start_date)\n", 938 | " else:\n", 939 | " train_actions = pd.concat([train_actions, make_actions(user, product, all_actions, train_start_date)],\n", 940 | " ignore_index=True)\n", 941 | " # 接下来每次移动一天\n", 942 | " train_start_date = datetime.strptime(train_start_date, '%Y-%m-%d') + timedelta(days=1)\n", 943 | " train_start_date = train_start_date.strftime('%Y-%m-%d')\n", 944 | " print \"round {0}/{1} over!\".format(i+1, setNums)\n", 945 | "\n", 946 | " train_actions.to_csv(f_path, index=False)" 947 | ] 948 | }, 949 | { 950 | "cell_type": "code", 951 | "execution_count": 16, 952 | "metadata": { 953 | "ExecuteTime": { 954 | "end_time": "2017-05-16T22:34:29.544825Z", 955 | "start_time": "2017-05-16T21:16:19.510386Z" 956 | }, 957 | "collapsed": false 958 | }, 959 | "outputs": [ 960 | { 961 | "name": "stdout", 962 | "output_type": "stream", 963 | "text": [ 964 | "get all actions!\n", 965 | "get_basic_user_feat finsihed\n", 966 | "get_basic_product_feat finsihed\n", 967 | "2016-03-01\n", 968 | "2016-03-04\n", 969 | "get_recent_user_feat finsihed\n", 970 | "get_user_cate_feature finished\n", 971 | "get_accumulate_product_feat finsihed\n", 972 | "get_accumulate_cate_feat finsihed\n", 973 | "get_comments_product_feat finished\n", 974 | "get labels\n", 975 | "round 1/34 over!\n", 976 | "2016-03-02\n", 977 | "2016-03-05\n", 978 | "get_recent_user_feat finsihed\n", 979 | "get_user_cate_feature finished\n", 980 | "get_accumulate_product_feat finsihed\n", 981 | "get_accumulate_cate_feat finsihed\n", 982 | "get_comments_product_feat finished\n", 983 | "get labels\n", 984 | "round 2/34 over!\n", 985 | "2016-03-03\n", 986 | "2016-03-06\n", 987 | "get_recent_user_feat finsihed\n", 988 | "get_user_cate_feature finished\n", 989 | "get_accumulate_product_feat finsihed\n", 990 | "get_accumulate_cate_feat finsihed\n", 991 | "get_comments_product_feat finished\n", 992 | "get labels\n", 993 | "round 3/34 over!\n", 994 | "2016-03-04\n", 995 | "2016-03-07\n", 996 | "get_recent_user_feat finsihed\n", 997 | "get_user_cate_feature finished\n", 998 | "get_accumulate_product_feat finsihed\n", 999 | "get_accumulate_cate_feat finsihed\n", 1000 | "get_comments_product_feat finished\n", 1001 | "get labels\n", 1002 | "round 4/34 over!\n", 1003 | "2016-03-05\n", 1004 | "2016-03-08\n", 1005 | "get_recent_user_feat finsihed\n", 1006 | "get_user_cate_feature finished\n", 1007 | "get_accumulate_product_feat finsihed\n", 1008 | "get_accumulate_cate_feat finsihed\n", 1009 | "get_comments_product_feat finished\n", 1010 | "get labels\n", 1011 | "round 5/34 over!\n", 1012 | "2016-03-06\n", 1013 | "2016-03-09\n", 1014 | "get_recent_user_feat finsihed\n", 1015 | "get_user_cate_feature finished\n", 1016 | "get_accumulate_product_feat finsihed\n", 1017 | "get_accumulate_cate_feat finsihed\n", 1018 | "get_comments_product_feat finished\n", 1019 | "get labels\n", 1020 | "round 6/34 over!\n", 1021 | "2016-03-07\n", 1022 | "2016-03-10\n", 1023 | "get_recent_user_feat finsihed\n", 1024 | "get_user_cate_feature finished\n", 1025 | "get_accumulate_product_feat finsihed\n", 1026 | "get_accumulate_cate_feat finsihed\n", 1027 | "get_comments_product_feat finished\n", 1028 | "get labels\n", 1029 | "round 7/34 over!\n", 1030 | "2016-03-08\n", 1031 | "2016-03-11\n", 1032 | "get_recent_user_feat finsihed\n", 1033 | "get_user_cate_feature finished\n", 1034 | "get_accumulate_product_feat finsihed\n", 1035 | "get_accumulate_cate_feat finsihed\n", 1036 | "get_comments_product_feat finished\n", 1037 | "get labels\n", 1038 | "round 8/34 over!\n", 1039 | "2016-03-09\n", 1040 | "2016-03-12\n", 1041 | "get_recent_user_feat finsihed\n", 1042 | "get_user_cate_feature finished\n", 1043 | "get_accumulate_product_feat finsihed\n", 1044 | "get_accumulate_cate_feat finsihed\n", 1045 | "get_comments_product_feat finished\n", 1046 | "get labels\n", 1047 | "round 9/34 over!\n", 1048 | "2016-03-10\n", 1049 | "2016-03-13\n", 1050 | "get_recent_user_feat finsihed\n", 1051 | "get_user_cate_feature finished\n", 1052 | "get_accumulate_product_feat finsihed\n", 1053 | "get_accumulate_cate_feat finsihed\n", 1054 | "get_comments_product_feat finished\n", 1055 | "get labels\n", 1056 | "round 10/34 over!\n", 1057 | "2016-03-11\n", 1058 | "2016-03-14\n", 1059 | "get_recent_user_feat finsihed\n", 1060 | "get_user_cate_feature finished\n", 1061 | "get_accumulate_product_feat finsihed\n", 1062 | "get_accumulate_cate_feat finsihed\n", 1063 | "get_comments_product_feat finished\n", 1064 | "get labels\n", 1065 | "round 11/34 over!\n", 1066 | "2016-03-12\n", 1067 | "2016-03-15\n", 1068 | "get_recent_user_feat finsihed\n", 1069 | "get_user_cate_feature finished\n", 1070 | "get_accumulate_product_feat finsihed\n", 1071 | "get_accumulate_cate_feat finsihed\n", 1072 | "get_comments_product_feat finished\n", 1073 | "get labels\n", 1074 | "round 12/34 over!\n", 1075 | "2016-03-13\n", 1076 | "2016-03-16\n", 1077 | "get_recent_user_feat finsihed\n", 1078 | "get_user_cate_feature finished\n", 1079 | "get_accumulate_product_feat finsihed\n", 1080 | "get_accumulate_cate_feat finsihed\n", 1081 | "get_comments_product_feat finished\n", 1082 | "get labels\n", 1083 | "round 13/34 over!\n", 1084 | "2016-03-14\n", 1085 | "2016-03-17\n", 1086 | "get_recent_user_feat finsihed\n", 1087 | "get_user_cate_feature finished\n", 1088 | "get_accumulate_product_feat finsihed\n", 1089 | "get_accumulate_cate_feat finsihed\n", 1090 | "get_comments_product_feat finished\n", 1091 | "get labels\n", 1092 | "round 14/34 over!\n", 1093 | "2016-03-15\n", 1094 | "2016-03-18\n", 1095 | "get_recent_user_feat finsihed\n", 1096 | "get_user_cate_feature finished\n", 1097 | "get_accumulate_product_feat finsihed\n", 1098 | "get_accumulate_cate_feat finsihed\n", 1099 | "get_comments_product_feat finished\n", 1100 | "get labels\n", 1101 | "round 15/34 over!\n", 1102 | "2016-03-16\n", 1103 | "2016-03-19\n", 1104 | "get_recent_user_feat finsihed\n", 1105 | "get_user_cate_feature finished\n", 1106 | "get_accumulate_product_feat finsihed\n", 1107 | "get_accumulate_cate_feat finsihed\n", 1108 | "get_comments_product_feat finished\n", 1109 | "get labels\n", 1110 | "round 16/34 over!\n", 1111 | "2016-03-17\n", 1112 | "2016-03-20\n", 1113 | "get_recent_user_feat finsihed\n", 1114 | "get_user_cate_feature finished\n", 1115 | "get_accumulate_product_feat finsihed\n", 1116 | "get_accumulate_cate_feat finsihed\n", 1117 | "get_comments_product_feat finished\n", 1118 | "get labels\n", 1119 | "round 17/34 over!\n", 1120 | "2016-03-18\n", 1121 | "2016-03-21\n", 1122 | "get_recent_user_feat finsihed\n", 1123 | "get_user_cate_feature finished\n", 1124 | "get_accumulate_product_feat finsihed\n", 1125 | "get_accumulate_cate_feat finsihed\n", 1126 | "get_comments_product_feat finished\n", 1127 | "get labels\n", 1128 | "round 18/34 over!\n", 1129 | "2016-03-19\n", 1130 | "2016-03-22\n", 1131 | "get_recent_user_feat finsihed\n", 1132 | "get_user_cate_feature finished\n", 1133 | "get_accumulate_product_feat finsihed\n", 1134 | "get_accumulate_cate_feat finsihed\n", 1135 | "get_comments_product_feat finished\n", 1136 | "get labels\n", 1137 | "round 19/34 over!\n", 1138 | "2016-03-20\n", 1139 | "2016-03-23\n", 1140 | "get_recent_user_feat finsihed\n", 1141 | "get_user_cate_feature finished\n", 1142 | "get_accumulate_product_feat finsihed\n", 1143 | "get_accumulate_cate_feat finsihed\n", 1144 | "get_comments_product_feat finished\n", 1145 | "get labels\n", 1146 | "round 20/34 over!\n", 1147 | "2016-03-21\n", 1148 | "2016-03-24\n", 1149 | "get_recent_user_feat finsihed\n", 1150 | "get_user_cate_feature finished\n", 1151 | "get_accumulate_product_feat finsihed\n", 1152 | "get_accumulate_cate_feat finsihed\n", 1153 | "get_comments_product_feat finished\n", 1154 | "get labels\n", 1155 | "round 21/34 over!\n", 1156 | "2016-03-22\n", 1157 | "2016-03-25\n", 1158 | "get_recent_user_feat finsihed\n", 1159 | "get_user_cate_feature finished\n", 1160 | "get_accumulate_product_feat finsihed\n", 1161 | "get_accumulate_cate_feat finsihed\n", 1162 | "get_comments_product_feat finished\n", 1163 | "get labels\n", 1164 | "round 22/34 over!\n", 1165 | "2016-03-23\n", 1166 | "2016-03-26\n", 1167 | "get_recent_user_feat finsihed\n", 1168 | "get_user_cate_feature finished\n", 1169 | "get_accumulate_product_feat finsihed\n", 1170 | "get_accumulate_cate_feat finsihed\n", 1171 | "get_comments_product_feat finished\n", 1172 | "get labels\n", 1173 | "round 23/34 over!\n", 1174 | "2016-03-24\n", 1175 | "2016-03-27\n", 1176 | "get_recent_user_feat finsihed\n", 1177 | "get_user_cate_feature finished\n", 1178 | "get_accumulate_product_feat finsihed\n", 1179 | "get_accumulate_cate_feat finsihed\n", 1180 | "get_comments_product_feat finished\n", 1181 | "get labels\n", 1182 | "round 24/34 over!\n", 1183 | "2016-03-25\n", 1184 | "2016-03-28\n", 1185 | "get_recent_user_feat finsihed\n", 1186 | "get_user_cate_feature finished\n", 1187 | "get_accumulate_product_feat finsihed\n", 1188 | "get_accumulate_cate_feat finsihed\n", 1189 | "get_comments_product_feat finished\n", 1190 | "get labels\n", 1191 | "round 25/34 over!\n", 1192 | "2016-03-26\n", 1193 | "2016-03-29\n", 1194 | "get_recent_user_feat finsihed\n", 1195 | "get_user_cate_feature finished\n", 1196 | "get_accumulate_product_feat finsihed\n", 1197 | "get_accumulate_cate_feat finsihed\n", 1198 | "get_comments_product_feat finished\n", 1199 | "get labels\n", 1200 | "round 26/34 over!\n", 1201 | "2016-03-27\n", 1202 | "2016-03-30\n", 1203 | "get_recent_user_feat finsihed\n", 1204 | "get_user_cate_feature finished\n", 1205 | "get_accumulate_product_feat finsihed\n", 1206 | "get_accumulate_cate_feat finsihed\n", 1207 | "get_comments_product_feat finished\n", 1208 | "get labels\n", 1209 | "round 27/34 over!\n", 1210 | "2016-03-28\n", 1211 | "2016-03-31\n", 1212 | "get_recent_user_feat finsihed\n", 1213 | "get_user_cate_feature finished\n", 1214 | "get_accumulate_product_feat finsihed\n", 1215 | "get_accumulate_cate_feat finsihed\n", 1216 | "get_comments_product_feat finished\n", 1217 | "get labels\n", 1218 | "round 28/34 over!\n", 1219 | "2016-03-29\n", 1220 | "2016-04-01\n", 1221 | "get_recent_user_feat finsihed\n", 1222 | "get_user_cate_feature finished\n", 1223 | "get_accumulate_product_feat finsihed\n", 1224 | "get_accumulate_cate_feat finsihed\n", 1225 | "get_comments_product_feat finished\n", 1226 | "get labels\n", 1227 | "round 29/34 over!\n", 1228 | "2016-03-30\n", 1229 | "2016-04-02\n", 1230 | "get_recent_user_feat finsihed\n", 1231 | "get_user_cate_feature finished\n", 1232 | "get_accumulate_product_feat finsihed\n", 1233 | "get_accumulate_cate_feat finsihed\n", 1234 | "get_comments_product_feat finished\n", 1235 | "get labels\n", 1236 | "round 30/34 over!\n", 1237 | "2016-03-31\n", 1238 | "2016-04-03\n", 1239 | "get_recent_user_feat finsihed\n", 1240 | "get_user_cate_feature finished\n", 1241 | "get_accumulate_product_feat finsihed\n", 1242 | "get_accumulate_cate_feat finsihed\n", 1243 | "get_comments_product_feat finished\n", 1244 | "get labels\n", 1245 | "round 31/34 over!\n", 1246 | "2016-04-01\n", 1247 | "2016-04-04\n", 1248 | "get_recent_user_feat finsihed\n", 1249 | "get_user_cate_feature finished\n", 1250 | "get_accumulate_product_feat finsihed\n", 1251 | "get_accumulate_cate_feat finsihed\n", 1252 | "get_comments_product_feat finished\n", 1253 | "get labels\n", 1254 | "round 32/34 over!\n", 1255 | "2016-04-02\n", 1256 | "2016-04-05\n", 1257 | "get_recent_user_feat finsihed\n", 1258 | "get_user_cate_feature finished\n", 1259 | "get_accumulate_product_feat finsihed\n", 1260 | "get_accumulate_cate_feat finsihed\n", 1261 | "get_comments_product_feat finished\n", 1262 | "get labels\n", 1263 | "round 33/34 over!\n", 1264 | "2016-04-03\n", 1265 | "2016-04-06\n", 1266 | "get_recent_user_feat finsihed\n", 1267 | "get_user_cate_feature finished\n", 1268 | "get_accumulate_product_feat finsihed\n", 1269 | "get_accumulate_cate_feat finsihed\n", 1270 | "get_comments_product_feat finished\n", 1271 | "get labels\n", 1272 | "round 34/34 over!\n" 1273 | ] 1274 | } 1275 | ], 1276 | "source": [ 1277 | "# 训练集\n", 1278 | "train_start_date = '2016-03-01'\n", 1279 | "make_train_set(train_start_date, 34, 'train_set.csv')" 1280 | ] 1281 | }, 1282 | { 1283 | "cell_type": "markdown", 1284 | "metadata": {}, 1285 | "source": [ 1286 | "构造验证集(线下测试集)" 1287 | ] 1288 | }, 1289 | { 1290 | "cell_type": "code", 1291 | "execution_count": 17, 1292 | "metadata": { 1293 | "ExecuteTime": { 1294 | "end_time": "2017-05-16T22:34:29.585821Z", 1295 | "start_time": "2017-05-16T22:34:29.546019Z" 1296 | }, 1297 | "collapsed": true 1298 | }, 1299 | "outputs": [], 1300 | "source": [ 1301 | "def make_val_answer(val_start_date, val_end_date, all_actions, label_val_s1_path):\n", 1302 | " actions = get_actions(val_start_date, val_end_date,all_actions)\n", 1303 | " actions = actions[(actions['type'] == 4) & (actions['cate'] == 8)]\n", 1304 | " actions = actions[['user_id', 'sku_id']]\n", 1305 | " actions = actions.drop_duplicates()\n", 1306 | " actions.to_csv(label_val_s1_path, index=False)\n", 1307 | "\n", 1308 | "def make_val_set(train_start_date, train_end_date, val_s1_path):\n", 1309 | " # 修改时间跨度\n", 1310 | " start_days = datetime.strptime(train_end_date, '%Y-%m-%d') - timedelta(days=30)\n", 1311 | " start_days = start_days.strftime('%Y-%m-%d')\n", 1312 | " all_actions = get_all_action()\n", 1313 | " print \"get all actions!\"\n", 1314 | " user = get_basic_user_feat()\n", 1315 | " print 'get_basic_user_feat finsihed'\n", 1316 | " \n", 1317 | " product = get_basic_product_feat()\n", 1318 | " print 'get_basic_product_feat finsihed'\n", 1319 | "# user_acc = get_accumulate_user_feat(train_end_date,all_actions,30)\n", 1320 | "# print 'get_accumulate_user_feat finished'\n", 1321 | " user_acc = get_recent_user_feat(train_end_date, all_actions)\n", 1322 | " print 'get_recent_user_feat finsihed'\n", 1323 | " user_cate = get_user_cate_feature(train_start_date, train_end_date, all_actions)\n", 1324 | " print 'get_user_cate_feature finished'\n", 1325 | " \n", 1326 | " product_acc = get_accumulate_product_feat(start_days, train_end_date, all_actions)\n", 1327 | " print 'get_accumulate_product_feat finsihed'\n", 1328 | " cate_acc = get_accumulate_cate_feat(start_days, train_end_date, all_actions)\n", 1329 | " print 'get_accumulate_cate_feat finsihed'\n", 1330 | " comment_acc = get_comments_product_feat(train_end_date)\n", 1331 | " print 'get_comments_product_feat finished'\n", 1332 | " \n", 1333 | " actions = None\n", 1334 | " for i in (3, 5, 7, 10, 15, 21, 30):\n", 1335 | " start_days = datetime.strptime(train_end_date, '%Y-%m-%d') - timedelta(days=i)\n", 1336 | " start_days = start_days.strftime('%Y-%m-%d')\n", 1337 | " if actions is None:\n", 1338 | " actions = get_action_feat(start_days, train_end_date, all_actions,i)\n", 1339 | " else:\n", 1340 | " actions = pd.merge(actions, get_action_feat(start_days, train_end_date,all_actions,i), how='left',\n", 1341 | " on=['user_id', 'sku_id', 'cate'])\n", 1342 | "\n", 1343 | " actions = pd.merge(actions, user, how='left', on='user_id')\n", 1344 | " actions = pd.merge(actions, user_acc, how='left', on='user_id')\n", 1345 | " actions = pd.merge(actions, user_cate, how='left', on='user_id')\n", 1346 | " # 注意这里的拼接key\n", 1347 | " actions = pd.merge(actions, product, how='left', on=['sku_id', 'cate'])\n", 1348 | " actions = pd.merge(actions, product_acc, how='left', on='sku_id')\n", 1349 | " actions = pd.merge(actions, cate_acc, how='left', on='cate')\n", 1350 | " actions = pd.merge(actions, comment_acc, how='left', on='sku_id')\n", 1351 | " actions = actions.fillna(0)\n", 1352 | " \n", 1353 | " \n", 1354 | "# print actions\n", 1355 | " # 构造真实用户购买情况作为后续验证\n", 1356 | " val_start_date = train_end_date\n", 1357 | " val_end_date = datetime.strptime(val_start_date, '%Y-%m-%d') + timedelta(days=5)\n", 1358 | " val_end_date = val_end_date.strftime('%Y-%m-%d')\n", 1359 | " make_val_answer(val_start_date, val_end_date, all_actions, 'label_'+val_s1_path)\n", 1360 | " \n", 1361 | " actions.to_csv(val_s1_path, index=False)\n" 1362 | ] 1363 | }, 1364 | { 1365 | "cell_type": "code", 1366 | "execution_count": 18, 1367 | "metadata": { 1368 | "ExecuteTime": { 1369 | "end_time": "2017-05-16T22:44:56.918322Z", 1370 | "start_time": "2017-05-16T22:34:29.587106Z" 1371 | }, 1372 | "collapsed": false 1373 | }, 1374 | "outputs": [ 1375 | { 1376 | "name": "stdout", 1377 | "output_type": "stream", 1378 | "text": [ 1379 | "get all actions!\n", 1380 | "get_basic_user_feat finsihed\n", 1381 | "get_basic_product_feat finsihed\n", 1382 | "get_recent_user_feat finsihed\n", 1383 | "get_user_cate_feature finished\n", 1384 | "get_accumulate_product_feat finsihed\n", 1385 | "get_accumulate_cate_feat finsihed\n", 1386 | "get_comments_product_feat finished\n", 1387 | "get all actions!\n", 1388 | "get_basic_user_feat finsihed\n", 1389 | "get_basic_product_feat finsihed\n", 1390 | "get_recent_user_feat finsihed\n", 1391 | "get_user_cate_feature finished\n", 1392 | "get_accumulate_product_feat finsihed\n", 1393 | "get_accumulate_cate_feat finsihed\n", 1394 | "get_comments_product_feat finished\n", 1395 | "get all actions!\n", 1396 | "get_basic_user_feat finsihed\n", 1397 | "get_basic_product_feat finsihed\n", 1398 | "get_recent_user_feat finsihed\n", 1399 | "get_user_cate_feature finished\n", 1400 | "get_accumulate_product_feat finsihed\n", 1401 | "get_accumulate_cate_feat finsihed\n", 1402 | "get_comments_product_feat finished\n" 1403 | ] 1404 | } 1405 | ], 1406 | "source": [ 1407 | "# 验证集\n", 1408 | "# train_start_date = '2016-04-06'\n", 1409 | "# make_train_set(train_start_date, 3, 'val_set.csv')\n", 1410 | "make_val_set('2016-04-06', '2016-04-09', 'val_1.csv')\n", 1411 | "make_val_set('2016-04-07', '2016-04-10', 'val_2.csv')\n", 1412 | "make_val_set('2016-04-08', '2016-04-11', 'val_3.csv')" 1413 | ] 1414 | }, 1415 | { 1416 | "cell_type": "markdown", 1417 | "metadata": {}, 1418 | "source": [ 1419 | "### 构造测试集" 1420 | ] 1421 | }, 1422 | { 1423 | "cell_type": "code", 1424 | "execution_count": 19, 1425 | "metadata": { 1426 | "ExecuteTime": { 1427 | "end_time": "2017-05-16T22:44:57.248609Z", 1428 | "start_time": "2017-05-16T22:44:57.023515Z" 1429 | }, 1430 | "collapsed": true 1431 | }, 1432 | "outputs": [], 1433 | "source": [ 1434 | "def make_test_set(train_start_date, train_end_date):\n", 1435 | " start_days = datetime.strptime(train_end_date, '%Y-%m-%d') - timedelta(days=30)\n", 1436 | " start_days = start_days.strftime('%Y-%m-%d')\n", 1437 | " all_actions = get_all_action()\n", 1438 | " print \"get all actions!\"\n", 1439 | " user = get_basic_user_feat()\n", 1440 | " print 'get_basic_user_feat finsihed'\n", 1441 | " product = get_basic_product_feat()\n", 1442 | " print 'get_basic_product_feat finsihed'\n", 1443 | " \n", 1444 | " user_acc = get_recent_user_feat(train_end_date, all_actions)\n", 1445 | " print 'get_accumulate_user_feat finsihed'\n", 1446 | " \n", 1447 | " user_cate = get_user_cate_feature(train_start_date, train_end_date, all_actions)\n", 1448 | " print 'get_user_cate_feature finished'\n", 1449 | " \n", 1450 | " product_acc = get_accumulate_product_feat(start_days, train_end_date, all_actions)\n", 1451 | " print 'get_accumulate_product_feat finsihed'\n", 1452 | " cate_acc = get_accumulate_cate_feat(start_days, train_end_date, all_actions)\n", 1453 | " print 'get_accumulate_cate_feat finsihed'\n", 1454 | " comment_acc = get_comments_product_feat(train_end_date)\n", 1455 | "\n", 1456 | " actions = None\n", 1457 | " for i in (3, 5, 7, 10, 15, 21, 30):\n", 1458 | " start_days = datetime.strptime(train_end_date, '%Y-%m-%d') - timedelta(days=i)\n", 1459 | " start_days = start_days.strftime('%Y-%m-%d')\n", 1460 | " if actions is None:\n", 1461 | " actions = get_action_feat(start_days, train_end_date, all_actions,i)\n", 1462 | " else:\n", 1463 | " actions = pd.merge(actions, get_action_feat(start_days, train_end_date,all_actions,i), how='left',\n", 1464 | " on=['user_id', 'sku_id', 'cate'])\n", 1465 | "\n", 1466 | " actions = pd.merge(actions, user, how='left', on='user_id')\n", 1467 | " actions = pd.merge(actions, user_acc, how='left', on='user_id')\n", 1468 | " actions = pd.merge(actions, user_cate, how='left', on='user_id')\n", 1469 | " # 注意这里的拼接key\n", 1470 | " actions = pd.merge(actions, product, how='left', on=['sku_id', 'cate'])\n", 1471 | " actions = pd.merge(actions, product_acc, how='left', on='sku_id')\n", 1472 | " actions = pd.merge(actions, cate_acc, how='left', on='cate')\n", 1473 | " actions = pd.merge(actions, comment_acc, how='left', on='sku_id')\n", 1474 | "\n", 1475 | " actions = actions.fillna(0)\n", 1476 | " \n", 1477 | "\n", 1478 | " actions.to_csv(\"test_set.csv\", index=False)" 1479 | ] 1480 | }, 1481 | { 1482 | "cell_type": "markdown", 1483 | "metadata": {}, 1484 | "source": [ 1485 | "4.13~4.16这三天的评论记录似乎并不存在为0的情况,导致构建测试集时出错\n", 1486 | "\n", 1487 | "`KeyError: \"['comment_num_0'] not in index\"`" 1488 | ] 1489 | }, 1490 | { 1491 | "cell_type": "code", 1492 | "execution_count": 20, 1493 | "metadata": { 1494 | "ExecuteTime": { 1495 | "end_time": "2017-05-16T22:48:22.214337Z", 1496 | "start_time": "2017-05-16T22:44:57.250749Z" 1497 | }, 1498 | "collapsed": false 1499 | }, 1500 | "outputs": [ 1501 | { 1502 | "name": "stdout", 1503 | "output_type": "stream", 1504 | "text": [ 1505 | "get all actions!\n", 1506 | "get_basic_user_feat finsihed\n", 1507 | "get_basic_product_feat finsihed\n", 1508 | "get_accumulate_user_feat finsihed\n", 1509 | "get_user_cate_feature finished\n", 1510 | "get_accumulate_product_feat finsihed\n", 1511 | "get_accumulate_cate_feat finsihed\n" 1512 | ] 1513 | } 1514 | ], 1515 | "source": [ 1516 | "# 预测结果\n", 1517 | "sub_start_date = '2016-04-13'\n", 1518 | "sub_end_date = '2016-04-16'\n", 1519 | "make_test_set(sub_start_date, sub_end_date)" 1520 | ] 1521 | } 1522 | ], 1523 | "metadata": { 1524 | "kernelspec": { 1525 | "display_name": "Python 2", 1526 | "language": "python", 1527 | "name": "python2" 1528 | }, 1529 | "language_info": { 1530 | "codemirror_mode": { 1531 | "name": "ipython", 1532 | "version": 2 1533 | }, 1534 | "file_extension": ".py", 1535 | "mimetype": "text/x-python", 1536 | "name": "python", 1537 | "nbconvert_exporter": "python", 1538 | "pygments_lexer": "ipython2", 1539 | "version": "2.7.13" 1540 | } 1541 | }, 1542 | "nbformat": 4, 1543 | "nbformat_minor": 2 1544 | } 1545 | -------------------------------------------------------------------------------- /data_cleaning_wp.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# 京东JData大数据竞赛(1)- 数据清洗" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "比赛的题目是高潜用户的购买意向的预测,从机器学习的角度来讲我们可以认为这是一个二分类的任务.那么我们就是尝试去构建自己的正负样本.\n", 15 | "由于我们拿到的是原始数据,里面存在很多噪声,因而第一步我们先要对数据清洗,比如说:\n", 16 | "\n", 17 | "* 检查并去除重复记录\n", 18 | "* 无交互行为的用户和商品\n", 19 | "* 去掉浏览量很大而购买量很少的用户(惰性用户或爬虫用户)\n", 20 | "* ......" 21 | ] 22 | }, 23 | { 24 | "cell_type": "markdown", 25 | "metadata": {}, 26 | "source": [ 27 | "为了能够进行上述清洗,在此首先构造了简单的用户(user)行为特征和商品(item)行为特征,对应于两张表user_table和item_table\n", 28 | "* user_table特征包括:\n", 29 | " user_id(用户id),age(年龄),sex(性别),\n", 30 | " user_lv_cd(用户级别),browse_num(浏览数),\n", 31 | " addcart_num(加购数),delcart_num(删购数),\n", 32 | " buy_num(购买数),favor_num(收藏数),\n", 33 | " click_num(点击数),buy_addcart_ratio(购买加购转化率),\n", 34 | " buy_browse_ratio(购买浏览转化率),\n", 35 | " buy_click_ratio(购买点击转化率),\n", 36 | " buy_favor_ratio(购买收藏转化率)\n", 37 | " \n", 38 | "* item_table特征包括:\n", 39 | " sku_id(商品id),attr1,attr2,\n", 40 | " attr3,cate,brand,browse_num,\n", 41 | " addcart_num,delcart_num,\n", 42 | " buy_num,favor_num,click_num,\n", 43 | " buy_addcart_ratio,buy_browse_ratio,\n", 44 | " buy_click_ratio,buy_favor_ratio,\n", 45 | " comment_num(评论数),\n", 46 | " has_bad_comment(是否有差评),\n", 47 | " bad_comment_rate(差评率)" 48 | ] 49 | }, 50 | { 51 | "cell_type": "markdown", 52 | "metadata": {}, 53 | "source": [ 54 | "## 数据集注释" 55 | ] 56 | }, 57 | { 58 | "cell_type": "markdown", 59 | "metadata": {}, 60 | "source": [ 61 | "这里涉及到的数据集是京东最新的数据集:\n", 62 | "1. JData_User.csv 用户数据集 105,321个用户\n", 63 | "2. JData_Comment.csv 商品评论 558,552条记录\n", 64 | "3. JData_Product.csv 预测商品集合 24,187条记录\n", 65 | "4. JData_Action_201602.csv 2月份行为交互记录 11,485,424条记录\n", 66 | "5. JData_Action_201603.csv 3月份行为交互记录 25,916,378条记录\n", 67 | "6. JData_Action_201604.csv 4月份行为交互记录 13,199,934条记录" 68 | ] 69 | }, 70 | { 71 | "cell_type": "markdown", 72 | "metadata": {}, 73 | "source": [ 74 | "## 数据集验证" 75 | ] 76 | }, 77 | { 78 | "cell_type": "markdown", 79 | "metadata": {}, 80 | "source": [ 81 | "### 首先检查JData_User中的用户和JData_Action中的用户是否一致\n", 82 | "保证行为数据中的所产生的行为均由用户数据中的用户产生(但是可能存在用户在行为数据中无行为)" 83 | ] 84 | }, 85 | { 86 | "cell_type": "markdown", 87 | "metadata": {}, 88 | "source": [ 89 | "思路:利用pd.Merge连接sku 和 Action中的sku, 观察Action中的数据是否减少\n", 90 | "Example:" 91 | ] 92 | }, 93 | { 94 | "cell_type": "code", 95 | "execution_count": 1, 96 | "metadata": { 97 | "collapsed": false 98 | }, 99 | "outputs": [], 100 | "source": [ 101 | "import pandas as pd\n", 102 | "# test sample\n", 103 | "# df1 = pd.DataFrame({'sku':['a','a','b','c'],'data':[1,1,2,3]})\n", 104 | "# df2 = pd.DataFrame({'sku':['a','b','c']})\n", 105 | "# df3 = pd.DataFrame({'sku':['a','b','d']})\n", 106 | "# df4 = pd.DataFrame({'sku':['a','b','c','d']})\n", 107 | "# print pd.merge(df2,df1)\n", 108 | "# # print pd.merge(df1,df2)\n", 109 | "# print pd.merge(df3,df1)\n", 110 | "# print pd.merge(df4,df1)\n", 111 | "# # print pd.merge(df1,df3)" 112 | ] 113 | }, 114 | { 115 | "cell_type": "code", 116 | "execution_count": 2, 117 | "metadata": { 118 | "collapsed": false 119 | }, 120 | "outputs": [ 121 | { 122 | "name": "stdout", 123 | "output_type": "stream", 124 | "text": [ 125 | "Is action of Feb. from User file? True\n", 126 | "Is action of Mar. from User file? True\n", 127 | "Is action of Apr. from User file? True\n" 128 | ] 129 | } 130 | ], 131 | "source": [ 132 | "def user_action_check():\n", 133 | " df_user = pd.read_csv('data/JData_User.csv')\n", 134 | " df_sku = df_user.ix[:,'user_id'].to_frame()\n", 135 | " df_month2 = pd.read_csv('data/JData_Action_201602.csv')\n", 136 | " print 'Is action of Feb. from User file? ', len(df_month2) == len(pd.merge(df_sku,df_month2))\n", 137 | " df_month3 = pd.read_csv('data/JData_Action_201603.csv')\n", 138 | " print 'Is action of Mar. from User file? ', len(df_month3) == len(pd.merge(df_sku,df_month3))\n", 139 | " df_month4 = pd.read_csv('data/JData_Action_201604.csv')\n", 140 | " print 'Is action of Apr. from User file? ', len(df_month4) == len(pd.merge(df_sku,df_month4))\n", 141 | "\n", 142 | "user_action_check()" 143 | ] 144 | }, 145 | { 146 | "cell_type": "markdown", 147 | "metadata": {}, 148 | "source": [ 149 | "结论: User数据集中的用户和交互行为数据集中的用户完全一致\n", 150 | "\n", 151 | "*根据merge前后的数据量比对,能保证Action中的用户ID是User中的ID的子集*" 152 | ] 153 | }, 154 | { 155 | "cell_type": "markdown", 156 | "metadata": {}, 157 | "source": [ 158 | "### 检查是否有重复记录\n", 159 | "除去各个数据文件中完全重复的记录,结果证明线上成绩反而大幅下降,可能解释是重复数据是有意义的,比如用户同时购买多件商品,同时添加多个数量的商品到购物车等..." 160 | ] 161 | }, 162 | { 163 | "cell_type": "code", 164 | "execution_count": 8, 165 | "metadata": { 166 | "collapsed": false 167 | }, 168 | "outputs": [], 169 | "source": [ 170 | "def deduplicate(filepath, filename, newpath):\n", 171 | " df_file = pd.read_csv(filepath) \n", 172 | " before = df_file.shape[0]\n", 173 | " df_file.drop_duplicates(inplace=True)\n", 174 | " after = df_file.shape[0]\n", 175 | " n_dup = before-after\n", 176 | " print 'No. of duplicate records for ' + filename + ' is: ' + str(n_dup)\n", 177 | " if n_dup != 0:\n", 178 | " df_file.to_csv(newpath, index=None)\n", 179 | " else:\n", 180 | " print 'no duplicate records in ' + filename" 181 | ] 182 | }, 183 | { 184 | "cell_type": "code", 185 | "execution_count": 9, 186 | "metadata": { 187 | "collapsed": false 188 | }, 189 | "outputs": [ 190 | { 191 | "name": "stdout", 192 | "output_type": "stream", 193 | "text": [ 194 | "No. of duplicate records for Mar. action is: 7085038\n", 195 | "No. of duplicate records for Feb. action is: 3672710\n", 196 | "No. of duplicate records for Comment is: 0\n", 197 | "no duplicate records in Comment\n", 198 | "No. of duplicate records for Product is: 0\n", 199 | "no duplicate records in Product\n", 200 | "No. of duplicate records for User is: 0\n", 201 | "no duplicate records in User\n" 202 | ] 203 | } 204 | ], 205 | "source": [ 206 | "# deduplicate('data/JData_Action_201602.csv', 'Feb. action', 'data/JData_Action_201602_dedup.csv')\n", 207 | "deduplicate('data/JData_Action_201603.csv', 'Mar. action', 'data/JData_Action_201603_dedup.csv')\n", 208 | "deduplicate('data/JData_Action_201604.csv', 'Feb. action', 'data/JData_Action_201604_dedup.csv')\n", 209 | "deduplicate('data/JData_Comment.csv', 'Comment', 'data/JData_Comment_dedup.csv')\n", 210 | "deduplicate('data/JData_Product.csv', 'Product', 'data/JData_Product_dedup.csv')\n", 211 | "deduplicate('data/JData_User.csv', 'User', 'data/JData_User_dedup.csv')" 212 | ] 213 | }, 214 | { 215 | "cell_type": "code", 216 | "execution_count": 31, 217 | "metadata": { 218 | "collapsed": false 219 | }, 220 | "outputs": [ 221 | { 222 | "data": { 223 | "text/html": [ 224 | "
\n", 225 | "\n", 226 | " \n", 227 | " \n", 228 | " \n", 229 | " \n", 230 | " \n", 231 | " \n", 232 | " \n", 233 | " \n", 234 | " \n", 235 | " \n", 236 | " \n", 237 | " \n", 238 | " \n", 239 | " \n", 240 | " \n", 241 | " \n", 242 | " \n", 243 | " \n", 244 | " \n", 245 | " \n", 246 | " \n", 247 | " \n", 248 | " \n", 249 | " \n", 250 | " \n", 251 | " \n", 252 | " \n", 253 | " \n", 254 | " \n", 255 | " \n", 256 | " \n", 257 | " \n", 258 | " \n", 259 | " \n", 260 | " \n", 261 | " \n", 262 | " \n", 263 | " \n", 264 | " \n", 265 | " \n", 266 | " \n", 267 | " \n", 268 | " \n", 269 | " \n", 270 | " \n", 271 | " \n", 272 | " \n", 273 | " \n", 274 | " \n", 275 | " \n", 276 | " \n", 277 | " \n", 278 | " \n", 279 | " \n", 280 | " \n", 281 | " \n", 282 | " \n", 283 | " \n", 284 | " \n", 285 | " \n", 286 | " \n", 287 | " \n", 288 | " \n", 289 | " \n", 290 | " \n", 291 | " \n", 292 | " \n", 293 | " \n", 294 | " \n", 295 | " \n", 296 | " \n", 297 | " \n", 298 | " \n", 299 | " \n", 300 | " \n", 301 | " \n", 302 | "
user_idsku_idtimemodel_idcatebrand
type
1217637821763782176378021763782176378
26366366360636636
3146414641464014641464
437373703737
5198119811981019811981
6575597575597575597545054575597575597
\n", 303 | "
" 304 | ], 305 | "text/plain": [ 306 | " user_id sku_id time model_id cate brand\n", 307 | "type \n", 308 | "1 2176378 2176378 2176378 0 2176378 2176378\n", 309 | "2 636 636 636 0 636 636\n", 310 | "3 1464 1464 1464 0 1464 1464\n", 311 | "4 37 37 37 0 37 37\n", 312 | "5 1981 1981 1981 0 1981 1981\n", 313 | "6 575597 575597 575597 545054 575597 575597" 314 | ] 315 | }, 316 | "execution_count": 31, 317 | "metadata": {}, 318 | "output_type": "execute_result" 319 | } 320 | ], 321 | "source": [ 322 | "IsDuplicated = df_month.duplicated() \n", 323 | "df_d=df_month[IsDuplicated]\n", 324 | "df_d.groupby('type').count() #发现重复数据大多数都是由于浏览(1),或者点击(6)产生" 325 | ] 326 | }, 327 | { 328 | "cell_type": "markdown", 329 | "metadata": {}, 330 | "source": [ 331 | "### 检查是否存在注册时间在2016年-4月-15号之后的用户" 332 | ] 333 | }, 334 | { 335 | "cell_type": "code", 336 | "execution_count": 6, 337 | "metadata": { 338 | "collapsed": false 339 | }, 340 | "outputs": [ 341 | { 342 | "data": { 343 | "text/html": [ 344 | "
\n", 345 | "\n", 346 | " \n", 347 | " \n", 348 | " \n", 349 | " \n", 350 | " \n", 351 | " \n", 352 | " \n", 353 | " \n", 354 | " \n", 355 | " \n", 356 | " \n", 357 | " \n", 358 | " \n", 359 | " \n", 360 | " \n", 361 | " \n", 362 | " \n", 363 | " \n", 364 | " \n", 365 | " \n", 366 | " \n", 367 | " \n", 368 | " \n", 369 | " \n", 370 | " \n", 371 | " \n", 372 | " \n", 373 | " \n", 374 | " \n", 375 | " \n", 376 | " \n", 377 | " \n", 378 | " \n", 379 | " \n", 380 | " \n", 381 | " \n", 382 | " \n", 383 | " \n", 384 | " \n", 385 | " \n", 386 | " \n", 387 | " \n", 388 | " \n", 389 | " \n", 390 | " \n", 391 | " \n", 392 | " \n", 393 | " \n", 394 | " \n", 395 | " \n", 396 | " \n", 397 | " \n", 398 | " \n", 399 | " \n", 400 | " \n", 401 | " \n", 402 | " \n", 403 | " \n", 404 | " \n", 405 | " \n", 406 | " \n", 407 | " \n", 408 | " \n", 409 | " \n", 410 | " \n", 411 | " \n", 412 | " \n", 413 | " \n", 414 | " \n", 415 | " \n", 416 | " \n", 417 | " \n", 418 | " \n", 419 | " \n", 420 | " \n", 421 | " \n", 422 | " \n", 423 | " \n", 424 | " \n", 425 | " \n", 426 | " \n", 427 | " \n", 428 | " \n", 429 | " \n", 430 | " \n", 431 | " \n", 432 | " \n", 433 | " \n", 434 | " \n", 435 | " \n", 436 | " \n", 437 | " \n", 438 | " \n", 439 | " \n", 440 | " \n", 441 | " \n", 442 | " \n", 443 | " \n", 444 | " \n", 445 | " \n", 446 | " \n", 447 | " \n", 448 | " \n", 449 | " \n", 450 | " \n", 451 | " \n", 452 | " \n", 453 | " \n", 454 | " \n", 455 | " \n", 456 | " \n", 457 | " \n", 458 | " \n", 459 | " \n", 460 | " \n", 461 | " \n", 462 | " \n", 463 | " \n", 464 | " \n", 465 | " \n", 466 | " \n", 467 | " \n", 468 | " \n", 469 | " \n", 470 | " \n", 471 | " \n", 472 | " \n", 473 | " \n", 474 | " \n", 475 | " \n", 476 | " \n", 477 | " \n", 478 | " \n", 479 | " \n", 480 | " \n", 481 | " \n", 482 | " \n", 483 | " \n", 484 | " \n", 485 | " \n", 486 | " \n", 487 | " \n", 488 | " \n", 489 | " \n", 490 | " \n", 491 | " \n", 492 | " \n", 493 | " \n", 494 | " \n", 495 | " \n", 496 | " \n", 497 | " \n", 498 | " \n", 499 | " \n", 500 | " \n", 501 | " \n", 502 | " \n", 503 | " \n", 504 | " \n", 505 | " \n", 506 | " \n", 507 | " \n", 508 | " \n", 509 | " \n", 510 | " \n", 511 | " \n", 512 | " \n", 513 | " \n", 514 | " \n", 515 | " \n", 516 | " \n", 517 | " \n", 518 | " \n", 519 | " \n", 520 | " \n", 521 | " \n", 522 | " \n", 523 | " \n", 524 | " \n", 525 | " \n", 526 | " \n", 527 | " \n", 528 | " \n", 529 | " \n", 530 | " \n", 531 | " \n", 532 | " \n", 533 | " \n", 534 | " \n", 535 | " \n", 536 | " \n", 537 | " \n", 538 | " \n", 539 | " \n", 540 | " \n", 541 | " \n", 542 | " \n", 543 | " \n", 544 | " \n", 545 | " \n", 546 | " \n", 547 | " \n", 548 | " \n", 549 | " \n", 550 | " \n", 551 | " \n", 552 | " \n", 553 | " \n", 554 | " \n", 555 | " \n", 556 | " \n", 557 | " \n", 558 | " \n", 559 | " \n", 560 | " \n", 561 | " \n", 562 | " \n", 563 | " \n", 564 | " \n", 565 | " \n", 566 | " \n", 567 | " \n", 568 | " \n", 569 | " \n", 570 | " \n", 571 | " \n", 572 | " \n", 573 | " \n", 574 | "
user_idagesexuser_lv_cduser_reg_tm
7457207458-12.012016-04-15
746320746426-35岁2.022016-04-15
746720746836-45岁2.032016-04-15
7472207473-12.012016-04-15
748220748326-35岁2.032016-04-15
749220749316-25岁2.032016-04-15
749320749416-25岁2.032016-04-15
750320750416-25岁2.042016-04-15
751020751146-55岁2.052016-04-15
7512207513-12.012016-04-15
751820751926-35岁2.022016-04-15
752120752226-35岁0.032016-04-15
7525207526-12.032016-04-15
7533207534-12.012016-04-15
754320754426-35岁2.032016-04-15
7544207545-12.012016-04-15
755120755226-35岁2.032016-04-15
755320755416-25岁2.042016-04-15
854520854616-25岁0.022016-04-29
939420939516-25岁1.022016-05-11
1036221036356岁以上2.022016-05-24
10367210368-12.012016-05-24
1101921102036-45岁2.032016-06-06
1201421201536-45岁2.022016-07-05
1385021385126-35岁2.032016-09-11
14542214543-12.012016-10-05
1674621674716-25岁2.012016-11-25
\n", 575 | "
" 576 | ], 577 | "text/plain": [ 578 | " user_id age sex user_lv_cd user_reg_tm\n", 579 | "7457 207458 -1 2.0 1 2016-04-15\n", 580 | "7463 207464 26-35岁 2.0 2 2016-04-15\n", 581 | "7467 207468 36-45岁 2.0 3 2016-04-15\n", 582 | "7472 207473 -1 2.0 1 2016-04-15\n", 583 | "7482 207483 26-35岁 2.0 3 2016-04-15\n", 584 | "7492 207493 16-25岁 2.0 3 2016-04-15\n", 585 | "7493 207494 16-25岁 2.0 3 2016-04-15\n", 586 | "7503 207504 16-25岁 2.0 4 2016-04-15\n", 587 | "7510 207511 46-55岁 2.0 5 2016-04-15\n", 588 | "7512 207513 -1 2.0 1 2016-04-15\n", 589 | "7518 207519 26-35岁 2.0 2 2016-04-15\n", 590 | "7521 207522 26-35岁 0.0 3 2016-04-15\n", 591 | "7525 207526 -1 2.0 3 2016-04-15\n", 592 | "7533 207534 -1 2.0 1 2016-04-15\n", 593 | "7543 207544 26-35岁 2.0 3 2016-04-15\n", 594 | "7544 207545 -1 2.0 1 2016-04-15\n", 595 | "7551 207552 26-35岁 2.0 3 2016-04-15\n", 596 | "7553 207554 16-25岁 2.0 4 2016-04-15\n", 597 | "8545 208546 16-25岁 0.0 2 2016-04-29\n", 598 | "9394 209395 16-25岁 1.0 2 2016-05-11\n", 599 | "10362 210363 56岁以上 2.0 2 2016-05-24\n", 600 | "10367 210368 -1 2.0 1 2016-05-24\n", 601 | "11019 211020 36-45岁 2.0 3 2016-06-06\n", 602 | "12014 212015 36-45岁 2.0 2 2016-07-05\n", 603 | "13850 213851 26-35岁 2.0 3 2016-09-11\n", 604 | "14542 214543 -1 2.0 1 2016-10-05\n", 605 | "16746 216747 16-25岁 2.0 1 2016-11-25" 606 | ] 607 | }, 608 | "execution_count": 6, 609 | "metadata": {}, 610 | "output_type": "execute_result" 611 | } 612 | ], 613 | "source": [ 614 | "import pandas as pd\n", 615 | "df_user = pd.read_csv('data\\JData_User.csv',encoding='gbk')\n", 616 | "df_user['user_reg_tm']=pd.to_datetime(df_user['user_reg_tm'])\n", 617 | "df_user.ix[df_user.user_reg_tm >= '2016-4-15']" 618 | ] 619 | }, 620 | { 621 | "cell_type": "markdown", 622 | "metadata": {}, 623 | "source": [ 624 | "由于注册时间是京东系统错误造成,如果行为数据中没有在4月15号之后的数据的话,那么说明这些用户还是正常用户,并不需要删除。" 625 | ] 626 | }, 627 | { 628 | "cell_type": "code", 629 | "execution_count": 9, 630 | "metadata": { 631 | "collapsed": false 632 | }, 633 | "outputs": [ 634 | { 635 | "data": { 636 | "text/html": [ 637 | "
\n", 638 | "\n", 639 | " \n", 640 | " \n", 641 | " \n", 642 | " \n", 643 | " \n", 644 | " \n", 645 | " \n", 646 | " \n", 647 | " \n", 648 | " \n", 649 | " \n", 650 | " \n", 651 | " \n", 652 | " \n", 653 | "
user_idsku_idtimemodel_idtypecatebrand
\n", 654 | "
" 655 | ], 656 | "text/plain": [ 657 | "Empty DataFrame\n", 658 | "Columns: [user_id, sku_id, time, model_id, type, cate, brand]\n", 659 | "Index: []" 660 | ] 661 | }, 662 | "execution_count": 9, 663 | "metadata": {}, 664 | "output_type": "execute_result" 665 | } 666 | ], 667 | "source": [ 668 | "df_month = pd.read_csv('data\\JData_Action_201604.csv')\n", 669 | "df_month['time'] = pd.to_datetime(df_month['time'])\n", 670 | "df_month.ix[df_month.time >= '2016-4-16']" 671 | ] 672 | }, 673 | { 674 | "cell_type": "markdown", 675 | "metadata": {}, 676 | "source": [ 677 | "结论:说明用户没有异常操作数据,所以这一批用户不删除" 678 | ] 679 | }, 680 | { 681 | "cell_type": "markdown", 682 | "metadata": {}, 683 | "source": [ 684 | "### 行为数据中的user_id为浮点型,进行INT类型转换" 685 | ] 686 | }, 687 | { 688 | "cell_type": "code", 689 | "execution_count": 47, 690 | "metadata": { 691 | "collapsed": false 692 | }, 693 | "outputs": [ 694 | { 695 | "name": "stdout", 696 | "output_type": "stream", 697 | "text": [ 698 | "int64\n", 699 | "int64\n", 700 | "int64\n" 701 | ] 702 | } 703 | ], 704 | "source": [ 705 | "import pandas as pd\n", 706 | "df_month = pd.read_csv('data\\JData_Action_201602.csv')\n", 707 | "df_month['user_id'] = df_month['user_id'].apply(lambda x:int(x))\n", 708 | "print df_month['user_id'].dtype\n", 709 | "df_month.to_csv('data\\JData_Action_201602.csv',index=None)\n", 710 | "df_month = pd.read_csv('data\\JData_Action_201603.csv')\n", 711 | "df_month['user_id'] = df_month['user_id'].apply(lambda x:int(x))\n", 712 | "print df_month['user_id'].dtype\n", 713 | "df_month.to_csv('data\\JData_Action_201603.csv',index=None)\n", 714 | "df_month = pd.read_csv('data\\JData_Action_201604.csv')\n", 715 | "df_month['user_id'] = df_month['user_id'].apply(lambda x:int(x))\n", 716 | "print df_month['user_id'].dtype\n", 717 | "df_month.to_csv('data\\JData_Action_201604.csv',index=None)" 718 | ] 719 | }, 720 | { 721 | "cell_type": "markdown", 722 | "metadata": {}, 723 | "source": [ 724 | "### 年龄区间的处理" 725 | ] 726 | }, 727 | { 728 | "cell_type": "code", 729 | "execution_count": 35, 730 | "metadata": { 731 | "collapsed": false 732 | }, 733 | "outputs": [ 734 | { 735 | "name": "stdout", 736 | "output_type": "stream", 737 | "text": [ 738 | " user_id sex user_lv_cd user_reg_tm\n", 739 | "age \n", 740 | "-1 14412 14412 14412 14412\n", 741 | "1 7 7 7 7\n", 742 | "2 8797 8797 8797 8797\n", 743 | "3 46570 46570 46570 46570\n", 744 | "4 30336 30336 30336 30336\n", 745 | "5 3325 3325 3325 3325\n", 746 | "6 1871 1871 1871 1871\n" 747 | ] 748 | } 749 | ], 750 | "source": [ 751 | "import pandas as pd\n", 752 | "df_user = pd.read_csv('data\\JData_User.csv',encoding='gbk')\n", 753 | "\n", 754 | "def tranAge(x):\n", 755 | " if x == u'15岁以下':\n", 756 | " x='1'\n", 757 | " elif x==u'16-25岁':\n", 758 | " x='2'\n", 759 | " elif x==u'26-35岁':\n", 760 | " x='3'\n", 761 | " elif x==u'36-45岁':\n", 762 | " x='4'\n", 763 | " elif x==u'46-55岁':\n", 764 | " x='5'\n", 765 | " elif x==u'56岁以上':\n", 766 | " x='6'\n", 767 | " return x\n", 768 | "df_user['age']=df_user['age'].apply(tranAge)\n", 769 | "print df_user.groupby(df_user['age']).count()\n", 770 | "df_user.to_csv('data\\JData_User.csv',index=None)" 771 | ] 772 | }, 773 | { 774 | "cell_type": "markdown", 775 | "metadata": {}, 776 | "source": [ 777 | "### 构建User_table" 778 | ] 779 | }, 780 | { 781 | "cell_type": "code", 782 | "execution_count": 6, 783 | "metadata": { 784 | "collapsed": true 785 | }, 786 | "outputs": [], 787 | "source": [ 788 | "#定义文件名\n", 789 | "ACTION_201602_FILE = \"data/JData_Action_201602.csv\"\n", 790 | "ACTION_201603_FILE = \"data/JData_Action_201603.csv\"\n", 791 | "ACTION_201604_FILE = \"data/JData_Action_201604.csv\"\n", 792 | "COMMENT_FILE = \"data/JData_Comment.csv\"\n", 793 | "PRODUCT_FILE = \"data/JData_Product.csv\"\n", 794 | "USER_FILE = \"data/JData_User.csv\"\n", 795 | "USER_TABLE_FILE = \"data/User_table.csv\"\n", 796 | "ITEM_TABLE_FILE = \"data/Item_table.csv\"" 797 | ] 798 | }, 799 | { 800 | "cell_type": "code", 801 | "execution_count": 7, 802 | "metadata": { 803 | "collapsed": true 804 | }, 805 | "outputs": [], 806 | "source": [ 807 | "# 导入相关包\n", 808 | "import pandas as pd\n", 809 | "import numpy as np\n", 810 | "from collections import Counter" 811 | ] 812 | }, 813 | { 814 | "cell_type": "code", 815 | "execution_count": 8, 816 | "metadata": { 817 | "collapsed": true 818 | }, 819 | "outputs": [], 820 | "source": [ 821 | "# 功能函数: 对每一个user分组的数据进行统计\n", 822 | "def add_type_count(group):\n", 823 | " behavior_type = group.type.astype(int)\n", 824 | " # 用户行为类别\n", 825 | " type_cnt = Counter(behavior_type)\n", 826 | " # 1: 浏览 2: 加购 3: 删除\n", 827 | " # 4: 购买 5: 收藏 6: 点击\n", 828 | " group['browse_num'] = type_cnt[1]\n", 829 | " group['addcart_num'] = type_cnt[2]\n", 830 | " group['delcart_num'] = type_cnt[3]\n", 831 | " group['buy_num'] = type_cnt[4]\n", 832 | " group['favor_num'] = type_cnt[5]\n", 833 | " group['click_num'] = type_cnt[6]\n", 834 | "\n", 835 | " return group[['user_id', 'browse_num', 'addcart_num',\n", 836 | " 'delcart_num', 'buy_num', 'favor_num',\n", 837 | " 'click_num']]" 838 | ] 839 | }, 840 | { 841 | "cell_type": "markdown", 842 | "metadata": {}, 843 | "source": [ 844 | "由于用户行为数据量较大,一次性读入可能造成内存错误(Memory Error),因而使用pandas的分块(chunk)读取." 845 | ] 846 | }, 847 | { 848 | "cell_type": "code", 849 | "execution_count": 9, 850 | "metadata": { 851 | "collapsed": true 852 | }, 853 | "outputs": [], 854 | "source": [ 855 | "#对action数据进行统计\n", 856 | "#根据自己调节chunk_size大小\n", 857 | "def get_from_action_data(fname, chunk_size=100000):\n", 858 | " reader = pd.read_csv(fname, header=0, iterator=True)\n", 859 | " chunks = []\n", 860 | " loop = True\n", 861 | " while loop:\n", 862 | " try:\n", 863 | " # 只读取user_id和type两个字段\n", 864 | " chunk = reader.get_chunk(chunk_size)[[\"user_id\", \"type\"]]\n", 865 | " chunks.append(chunk)\n", 866 | " except StopIteration:\n", 867 | " loop = False\n", 868 | " print(\"Iteration is stopped\")\n", 869 | " # 将块拼接为pandas dataframe格式\n", 870 | " df_ac = pd.concat(chunks, ignore_index=True)\n", 871 | " # 按user_id分组,对每一组进行统计,as_index 表示无索引形式返回数据\n", 872 | " df_ac = df_ac.groupby(['user_id'], as_index=False).apply(add_type_count)\n", 873 | " # 将重复的行丢弃\n", 874 | " df_ac = df_ac.drop_duplicates('user_id')\n", 875 | "\n", 876 | " return df_ac" 877 | ] 878 | }, 879 | { 880 | "cell_type": "code", 881 | "execution_count": 10, 882 | "metadata": { 883 | "collapsed": true 884 | }, 885 | "outputs": [], 886 | "source": [ 887 | "# 将各个action数据的统计量进行聚合\n", 888 | "def merge_action_data():\n", 889 | " df_ac = []\n", 890 | " df_ac.append(get_from_action_data(fname=ACTION_201602_FILE))\n", 891 | " df_ac.append(get_from_action_data(fname=ACTION_201603_FILE))\n", 892 | " df_ac.append(get_from_action_data(fname=ACTION_201604_FILE))\n", 893 | "\n", 894 | " df_ac = pd.concat(df_ac, ignore_index=True)\n", 895 | " # 用户在不同action表中统计量求和\n", 896 | " df_ac = df_ac.groupby(['user_id'], as_index=False).sum()\n", 897 | " # 构造转化率字段\n", 898 | " df_ac['buy_addcart_ratio'] = df_ac['buy_num'] / df_ac['addcart_num']\n", 899 | " df_ac['buy_browse_ratio'] = df_ac['buy_num'] / df_ac['browse_num']\n", 900 | " df_ac['buy_click_ratio'] = df_ac['buy_num'] / df_ac['click_num']\n", 901 | " df_ac['buy_favor_ratio'] = df_ac['buy_num'] / df_ac['favor_num']\n", 902 | " \n", 903 | " # 将大于1的转化率字段置为1(100%)\n", 904 | " df_ac.ix[df_ac['buy_addcart_ratio'] > 1., 'buy_addcart_ratio'] = 1.\n", 905 | " df_ac.ix[df_ac['buy_browse_ratio'] > 1., 'buy_browse_ratio'] = 1.\n", 906 | " df_ac.ix[df_ac['buy_click_ratio'] > 1., 'buy_click_ratio'] = 1.\n", 907 | " df_ac.ix[df_ac['buy_favor_ratio'] > 1., 'buy_favor_ratio'] = 1.\n", 908 | "\n", 909 | " return df_ac" 910 | ] 911 | }, 912 | { 913 | "cell_type": "code", 914 | "execution_count": 11, 915 | "metadata": { 916 | "collapsed": true 917 | }, 918 | "outputs": [], 919 | "source": [ 920 | "# 从FJData_User表中抽取需要的字段\n", 921 | "def get_from_jdata_user():\n", 922 | " df_usr = pd.read_csv(USER_FILE, header=0)\n", 923 | " df_usr = df_usr[[\"user_id\", \"age\", \"sex\", \"user_lv_cd\"]]\n", 924 | " return df_usr" 925 | ] 926 | }, 927 | { 928 | "cell_type": "code", 929 | "execution_count": 12, 930 | "metadata": { 931 | "collapsed": false 932 | }, 933 | "outputs": [ 934 | { 935 | "name": "stdout", 936 | "output_type": "stream", 937 | "text": [ 938 | "Iteration is stopped\n", 939 | "Iteration is stopped\n", 940 | "Iteration is stopped\n" 941 | ] 942 | } 943 | ], 944 | "source": [ 945 | "user_base = get_from_jdata_user()\n", 946 | "user_behavior = merge_action_data()\n", 947 | "\n", 948 | "# 连接成一张表,类似于SQL的左连接(left join)\n", 949 | "user_behavior = pd.merge(user_base, user_behavior, on=['user_id'], how='left')\n", 950 | "# 保存为user_table.csv\n", 951 | "user_behavior.to_csv(USER_TABLE_FILE, index=False)" 952 | ] 953 | }, 954 | { 955 | "cell_type": "markdown", 956 | "metadata": {}, 957 | "source": [ 958 | "### 构建Item_table" 959 | ] 960 | }, 961 | { 962 | "cell_type": "code", 963 | "execution_count": 21, 964 | "metadata": { 965 | "collapsed": true 966 | }, 967 | "outputs": [], 968 | "source": [ 969 | "#定义文件名\n", 970 | "ACTION_201602_FILE = \"data/JData_Action_201602.csv\"\n", 971 | "ACTION_201603_FILE = \"data/JData_Action_201603.csv\"\n", 972 | "ACTION_201604_FILE = \"data/JData_Action_201604.csv\"\n", 973 | "COMMENT_FILE = \"data/JData_Comment.csv\"\n", 974 | "PRODUCT_FILE = \"data/JData_Product.csv\"\n", 975 | "USER_FILE = \"data/JData_User.csv\"\n", 976 | "USER_TABLE_FILE = \"data/User_table.csv\"\n", 977 | "ITEM_TABLE_FILE = \"data/Item_table.csv\"" 978 | ] 979 | }, 980 | { 981 | "cell_type": "code", 982 | "execution_count": 14, 983 | "metadata": { 984 | "collapsed": true 985 | }, 986 | "outputs": [], 987 | "source": [ 988 | "# 导入相关包\n", 989 | "import pandas as pd\n", 990 | "import numpy as np\n", 991 | "from collections import Counter" 992 | ] 993 | }, 994 | { 995 | "cell_type": "code", 996 | "execution_count": 15, 997 | "metadata": { 998 | "collapsed": true 999 | }, 1000 | "outputs": [], 1001 | "source": [ 1002 | "# 读取Product中商品\n", 1003 | "def get_from_jdata_product():\n", 1004 | " df_item = pd.read_csv(PRODUCT_FILE, header=0)\n", 1005 | " return df_item" 1006 | ] 1007 | }, 1008 | { 1009 | "cell_type": "code", 1010 | "execution_count": 16, 1011 | "metadata": { 1012 | "collapsed": true 1013 | }, 1014 | "outputs": [], 1015 | "source": [ 1016 | "# 对每一个商品分组进行统计\n", 1017 | "def add_type_count(group):\n", 1018 | " behavior_type = group.type.astype(int)\n", 1019 | " type_cnt = Counter(behavior_type)\n", 1020 | "\n", 1021 | " group['browse_num'] = type_cnt[1]\n", 1022 | " group['addcart_num'] = type_cnt[2]\n", 1023 | " group['delcart_num'] = type_cnt[3]\n", 1024 | " group['buy_num'] = type_cnt[4]\n", 1025 | " group['favor_num'] = type_cnt[5]\n", 1026 | " group['click_num'] = type_cnt[6]\n", 1027 | "\n", 1028 | " return group[['sku_id', 'browse_num', 'addcart_num',\n", 1029 | " 'delcart_num', 'buy_num', 'favor_num',\n", 1030 | " 'click_num']]\n" 1031 | ] 1032 | }, 1033 | { 1034 | "cell_type": "code", 1035 | "execution_count": 17, 1036 | "metadata": { 1037 | "collapsed": true 1038 | }, 1039 | "outputs": [], 1040 | "source": [ 1041 | "#对action中的数据进行统计\n", 1042 | "def get_from_action_data(fname, chunk_size=100000):\n", 1043 | " reader = pd.read_csv(fname, header=0, iterator=True)\n", 1044 | " chunks = []\n", 1045 | " loop = True\n", 1046 | " while loop:\n", 1047 | " try:\n", 1048 | " chunk = reader.get_chunk(chunk_size)[[\"sku_id\", \"type\"]]\n", 1049 | " chunks.append(chunk)\n", 1050 | " except StopIteration:\n", 1051 | " loop = False\n", 1052 | " print(\"Iteration is stopped\")\n", 1053 | "\n", 1054 | " df_ac = pd.concat(chunks, ignore_index=True)\n", 1055 | "\n", 1056 | " df_ac = df_ac.groupby(['sku_id'], as_index=False).apply(add_type_count)\n", 1057 | " # Select unique row\n", 1058 | " df_ac = df_ac.drop_duplicates('sku_id')\n", 1059 | "\n", 1060 | " return df_ac" 1061 | ] 1062 | }, 1063 | { 1064 | "cell_type": "code", 1065 | "execution_count": 18, 1066 | "metadata": { 1067 | "collapsed": true 1068 | }, 1069 | "outputs": [], 1070 | "source": [ 1071 | "# 获取评论中的商品数据,如果存在某一个商品有两个日期的评论,我们取最晚的那一个\n", 1072 | "def get_from_jdata_comment():\n", 1073 | " df_cmt = pd.read_csv(COMMENT_FILE, header=0)\n", 1074 | " df_cmt['dt'] = pd.to_datetime(df_cmt['dt'])\n", 1075 | " # find latest comment index\n", 1076 | " idx = df_cmt.groupby(['sku_id'])['dt'].transform(max) == df_cmt['dt']\n", 1077 | " df_cmt = df_cmt[idx]\n", 1078 | "\n", 1079 | " return df_cmt[['sku_id', 'comment_num',\n", 1080 | " 'has_bad_comment', 'bad_comment_rate']]" 1081 | ] 1082 | }, 1083 | { 1084 | "cell_type": "code", 1085 | "execution_count": 19, 1086 | "metadata": { 1087 | "collapsed": true 1088 | }, 1089 | "outputs": [], 1090 | "source": [ 1091 | "def merge_action_data():\n", 1092 | " df_ac = []\n", 1093 | " df_ac.append(get_from_action_data(fname=ACTION_201602_FILE))\n", 1094 | " df_ac.append(get_from_action_data(fname=ACTION_201603_FILE))\n", 1095 | " df_ac.append(get_from_action_data(fname=ACTION_201604_FILE))\n", 1096 | "\n", 1097 | " df_ac = pd.concat(df_ac, ignore_index=True)\n", 1098 | " df_ac = df_ac.groupby(['sku_id'], as_index=False).sum()\n", 1099 | "\n", 1100 | " df_ac['buy_addcart_ratio'] = df_ac['buy_num'] / df_ac['addcart_num']\n", 1101 | " df_ac['buy_browse_ratio'] = df_ac['buy_num'] / df_ac['browse_num']\n", 1102 | " df_ac['buy_click_ratio'] = df_ac['buy_num'] / df_ac['click_num']\n", 1103 | " df_ac['buy_favor_ratio'] = df_ac['buy_num'] / df_ac['favor_num']\n", 1104 | "\n", 1105 | " df_ac.ix[df_ac['buy_addcart_ratio'] > 1., 'buy_addcart_ratio'] = 1.\n", 1106 | " df_ac.ix[df_ac['buy_browse_ratio'] > 1., 'buy_browse_ratio'] = 1.\n", 1107 | " df_ac.ix[df_ac['buy_click_ratio'] > 1., 'buy_click_ratio'] = 1.\n", 1108 | " df_ac.ix[df_ac['buy_favor_ratio'] > 1., 'buy_favor_ratio'] = 1.\n", 1109 | "\n", 1110 | " return df_ac" 1111 | ] 1112 | }, 1113 | { 1114 | "cell_type": "code", 1115 | "execution_count": 22, 1116 | "metadata": { 1117 | "collapsed": false 1118 | }, 1119 | "outputs": [ 1120 | { 1121 | "name": "stdout", 1122 | "output_type": "stream", 1123 | "text": [ 1124 | "Iteration is stopped\n", 1125 | "Iteration is stopped\n", 1126 | "Iteration is stopped\n" 1127 | ] 1128 | } 1129 | ], 1130 | "source": [ 1131 | "#取出只有P集合中存在商品信息\n", 1132 | "item_base = get_from_jdata_product()\n", 1133 | "item_behavior = merge_action_data()\n", 1134 | "item_comment = get_from_jdata_comment()\n", 1135 | "\n", 1136 | "# SQL: left join\n", 1137 | "item_behavior = pd.merge(\n", 1138 | " item_base, item_behavior, on=['sku_id'], how='left')\n", 1139 | "item_behavior = pd.merge(\n", 1140 | " item_behavior, item_comment, on=['sku_id'], how='left')\n", 1141 | "\n", 1142 | "item_behavior.to_csv(ITEM_TABLE_FILE, index=False)" 1143 | ] 1144 | }, 1145 | { 1146 | "cell_type": "markdown", 1147 | "metadata": {}, 1148 | "source": [ 1149 | "### 数据清洗" 1150 | ] 1151 | }, 1152 | { 1153 | "cell_type": "markdown", 1154 | "metadata": {}, 1155 | "source": [ 1156 | "#### 用户清洗" 1157 | ] 1158 | }, 1159 | { 1160 | "cell_type": "code", 1161 | "execution_count": 1, 1162 | "metadata": { 1163 | "collapsed": false 1164 | }, 1165 | "outputs": [ 1166 | { 1167 | "data": { 1168 | "text/html": [ 1169 | "
\n", 1170 | "\n", 1171 | " \n", 1172 | " \n", 1173 | " \n", 1174 | " \n", 1175 | " \n", 1176 | " \n", 1177 | " \n", 1178 | " \n", 1179 | " \n", 1180 | " \n", 1181 | " \n", 1182 | " \n", 1183 | " \n", 1184 | " \n", 1185 | " \n", 1186 | " \n", 1187 | " \n", 1188 | " \n", 1189 | " \n", 1190 | " \n", 1191 | " \n", 1192 | " \n", 1193 | " \n", 1194 | " \n", 1195 | " \n", 1196 | " \n", 1197 | " \n", 1198 | " \n", 1199 | " \n", 1200 | " \n", 1201 | " \n", 1202 | " \n", 1203 | " \n", 1204 | " \n", 1205 | " \n", 1206 | " \n", 1207 | " \n", 1208 | " \n", 1209 | " \n", 1210 | " \n", 1211 | " \n", 1212 | " \n", 1213 | " \n", 1214 | " \n", 1215 | " \n", 1216 | " \n", 1217 | " \n", 1218 | " \n", 1219 | " \n", 1220 | " \n", 1221 | " \n", 1222 | " \n", 1223 | " \n", 1224 | " \n", 1225 | " \n", 1226 | " \n", 1227 | " \n", 1228 | " \n", 1229 | " \n", 1230 | " \n", 1231 | " \n", 1232 | " \n", 1233 | " \n", 1234 | " \n", 1235 | " \n", 1236 | " \n", 1237 | " \n", 1238 | " \n", 1239 | " \n", 1240 | " \n", 1241 | " \n", 1242 | " \n", 1243 | " \n", 1244 | " \n", 1245 | " \n", 1246 | " \n", 1247 | " \n", 1248 | " \n", 1249 | " \n", 1250 | " \n", 1251 | " \n", 1252 | " \n", 1253 | " \n", 1254 | " \n", 1255 | " \n", 1256 | " \n", 1257 | " \n", 1258 | " \n", 1259 | " \n", 1260 | " \n", 1261 | " \n", 1262 | " \n", 1263 | " \n", 1264 | " \n", 1265 | " \n", 1266 | " \n", 1267 | " \n", 1268 | " \n", 1269 | " \n", 1270 | " \n", 1271 | " \n", 1272 | " \n", 1273 | " \n", 1274 | " \n", 1275 | " \n", 1276 | " \n", 1277 | " \n", 1278 | " \n", 1279 | " \n", 1280 | " \n", 1281 | " \n", 1282 | " \n", 1283 | " \n", 1284 | " \n", 1285 | " \n", 1286 | " \n", 1287 | " \n", 1288 | " \n", 1289 | " \n", 1290 | " \n", 1291 | " \n", 1292 | " \n", 1293 | " \n", 1294 | " \n", 1295 | " \n", 1296 | " \n", 1297 | " \n", 1298 | " \n", 1299 | " \n", 1300 | " \n", 1301 | " \n", 1302 | " \n", 1303 | " \n", 1304 | " \n", 1305 | " \n", 1306 | " \n", 1307 | " \n", 1308 | " \n", 1309 | " \n", 1310 | " \n", 1311 | " \n", 1312 | " \n", 1313 | " \n", 1314 | " \n", 1315 | " \n", 1316 | " \n", 1317 | " \n", 1318 | " \n", 1319 | " \n", 1320 | " \n", 1321 | " \n", 1322 | " \n", 1323 | " \n", 1324 | " \n", 1325 | " \n", 1326 | " \n", 1327 | " \n", 1328 | "
user_idagesexuser_lv_cdbrowse_numaddcart_numdelcart_numbuy_numfavor_numclick_numbuy_addcart_ratiobuy_browse_ratiobuy_click_ratiobuy_favor_ratio
count105,321.000105,318.000105,318.000105,321.000105,180.000105,180.000105,180.000105,180.000105,180.000105,180.00072,129.000105,172.000103,197.00045,986.000
mean252,661.0002.7731.1133.850180.4665.4712.4340.4591.045291.2220.1470.0050.0090.552
std30,403.6981.6720.9561.072273.43710.6185.6001.0483.442460.0310.2700.0220.0740.473
min200,001.000-1.0000.0001.0000.0000.0000.0000.0000.0000.0000.0000.0000.0000.000
25%226,331.0003.0000.0003.00040.0000.0000.0000.0000.00059.0000.0000.0000.0000.000
50%252,661.0003.0002.0004.00094.0002.0000.0000.0000.000148.0000.0000.0000.0001.000
75%278,991.0004.0002.0005.000212.0006.0003.0001.0000.000342.0000.1670.0020.0011.000
max305,321.0006.0002.0005.0007,605.000369.000231.00050.00099.00015,302.0001.0001.0001.0001.000
\n", 1329 | "
" 1330 | ], 1331 | "text/plain": [ 1332 | " user_id age sex user_lv_cd browse_num \\\n", 1333 | "count 105,321.000 105,318.000 105,318.000 105,321.000 105,180.000 \n", 1334 | "mean 252,661.000 2.773 1.113 3.850 180.466 \n", 1335 | "std 30,403.698 1.672 0.956 1.072 273.437 \n", 1336 | "min 200,001.000 -1.000 0.000 1.000 0.000 \n", 1337 | "25% 226,331.000 3.000 0.000 3.000 40.000 \n", 1338 | "50% 252,661.000 3.000 2.000 4.000 94.000 \n", 1339 | "75% 278,991.000 4.000 2.000 5.000 212.000 \n", 1340 | "max 305,321.000 6.000 2.000 5.000 7,605.000 \n", 1341 | "\n", 1342 | " addcart_num delcart_num buy_num favor_num click_num \\\n", 1343 | "count 105,180.000 105,180.000 105,180.000 105,180.000 105,180.000 \n", 1344 | "mean 5.471 2.434 0.459 1.045 291.222 \n", 1345 | "std 10.618 5.600 1.048 3.442 460.031 \n", 1346 | "min 0.000 0.000 0.000 0.000 0.000 \n", 1347 | "25% 0.000 0.000 0.000 0.000 59.000 \n", 1348 | "50% 2.000 0.000 0.000 0.000 148.000 \n", 1349 | "75% 6.000 3.000 1.000 0.000 342.000 \n", 1350 | "max 369.000 231.000 50.000 99.000 15,302.000 \n", 1351 | "\n", 1352 | " buy_addcart_ratio buy_browse_ratio buy_click_ratio buy_favor_ratio \n", 1353 | "count 72,129.000 105,172.000 103,197.000 45,986.000 \n", 1354 | "mean 0.147 0.005 0.009 0.552 \n", 1355 | "std 0.270 0.022 0.074 0.473 \n", 1356 | "min 0.000 0.000 0.000 0.000 \n", 1357 | "25% 0.000 0.000 0.000 0.000 \n", 1358 | "50% 0.000 0.000 0.000 1.000 \n", 1359 | "75% 0.167 0.002 0.001 1.000 \n", 1360 | "max 1.000 1.000 1.000 1.000 " 1361 | ] 1362 | }, 1363 | "execution_count": 1, 1364 | "metadata": {}, 1365 | "output_type": "execute_result" 1366 | } 1367 | ], 1368 | "source": [ 1369 | "import pandas as pd\n", 1370 | "df_user = pd.read_csv('data/User_table.csv',header=0)\n", 1371 | "pd.options.display.float_format = '{:,.3f}'.format #输出格式设置,保留三位小数\n", 1372 | "df_user.describe()" 1373 | ] 1374 | }, 1375 | { 1376 | "cell_type": "markdown", 1377 | "metadata": {}, 1378 | "source": [ 1379 | "由上述统计信息发现: 第一行中根据User_id统计发现有105321个用户,发现有3个用户没有age,sex字段,而且根据浏览、加购、删购、购买等记录却只有105180条记录,说明存在用户无任何交互记录,因此可以删除上述用户。" 1380 | ] 1381 | }, 1382 | { 1383 | "cell_type": "markdown", 1384 | "metadata": {}, 1385 | "source": [ 1386 | "删除没有age,sex字段的用户" 1387 | ] 1388 | }, 1389 | { 1390 | "cell_type": "code", 1391 | "execution_count": 3, 1392 | "metadata": { 1393 | "collapsed": false 1394 | }, 1395 | "outputs": [ 1396 | { 1397 | "data": { 1398 | "text/html": [ 1399 | "
\n", 1400 | "\n", 1401 | " \n", 1402 | " \n", 1403 | " \n", 1404 | " \n", 1405 | " \n", 1406 | " \n", 1407 | " \n", 1408 | " \n", 1409 | " \n", 1410 | " \n", 1411 | " \n", 1412 | " \n", 1413 | " \n", 1414 | " \n", 1415 | " \n", 1416 | " \n", 1417 | " \n", 1418 | " \n", 1419 | " \n", 1420 | " \n", 1421 | " \n", 1422 | " \n", 1423 | " \n", 1424 | " \n", 1425 | " \n", 1426 | " \n", 1427 | " \n", 1428 | " \n", 1429 | " \n", 1430 | " \n", 1431 | " \n", 1432 | " \n", 1433 | " \n", 1434 | " \n", 1435 | " \n", 1436 | " \n", 1437 | " \n", 1438 | " \n", 1439 | " \n", 1440 | " \n", 1441 | " \n", 1442 | " \n", 1443 | " \n", 1444 | " \n", 1445 | " \n", 1446 | " \n", 1447 | " \n", 1448 | " \n", 1449 | " \n", 1450 | " \n", 1451 | " \n", 1452 | " \n", 1453 | " \n", 1454 | " \n", 1455 | " \n", 1456 | " \n", 1457 | " \n", 1458 | " \n", 1459 | " \n", 1460 | " \n", 1461 | " \n", 1462 | " \n", 1463 | " \n", 1464 | " \n", 1465 | " \n", 1466 | " \n", 1467 | " \n", 1468 | " \n", 1469 | " \n", 1470 | " \n", 1471 | " \n", 1472 | " \n", 1473 | "
user_idagesexuser_lv_cdbrowse_numaddcart_numdelcart_numbuy_numfavor_numclick_numbuy_addcart_ratiobuy_browse_ratiobuy_click_ratiobuy_favor_ratio
34072234073nannan132.0006.0004.0001.0000.00041.0000.1670.0310.0241.000
38905238906nannan1171.0003.0002.0002.0003.000464.0000.6670.0120.0040.667
67704267705nannan1342.00018.0008.0000.0000.000743.0000.0000.0000.000nan
\n", 1474 | "
" 1475 | ], 1476 | "text/plain": [ 1477 | " user_id age sex user_lv_cd browse_num addcart_num delcart_num \\\n", 1478 | "34072 234073 nan nan 1 32.000 6.000 4.000 \n", 1479 | "38905 238906 nan nan 1 171.000 3.000 2.000 \n", 1480 | "67704 267705 nan nan 1 342.000 18.000 8.000 \n", 1481 | "\n", 1482 | " buy_num favor_num click_num buy_addcart_ratio buy_browse_ratio \\\n", 1483 | "34072 1.000 0.000 41.000 0.167 0.031 \n", 1484 | "38905 2.000 3.000 464.000 0.667 0.012 \n", 1485 | "67704 0.000 0.000 743.000 0.000 0.000 \n", 1486 | "\n", 1487 | " buy_click_ratio buy_favor_ratio \n", 1488 | "34072 0.024 1.000 \n", 1489 | "38905 0.004 0.667 \n", 1490 | "67704 0.000 nan " 1491 | ] 1492 | }, 1493 | "execution_count": 3, 1494 | "metadata": {}, 1495 | "output_type": "execute_result" 1496 | } 1497 | ], 1498 | "source": [ 1499 | "df_user[df_user['age'].isnull()]" 1500 | ] 1501 | }, 1502 | { 1503 | "cell_type": "code", 1504 | "execution_count": 4, 1505 | "metadata": { 1506 | "collapsed": true 1507 | }, 1508 | "outputs": [], 1509 | "source": [ 1510 | "delete_list = df_user[df_user['age'].isnull()].index\n", 1511 | "df_user.drop(delete_list,axis=0,inplace=True)" 1512 | ] 1513 | }, 1514 | { 1515 | "cell_type": "markdown", 1516 | "metadata": {}, 1517 | "source": [ 1518 | "删除无交互记录的用户" 1519 | ] 1520 | }, 1521 | { 1522 | "cell_type": "code", 1523 | "execution_count": 5, 1524 | "metadata": { 1525 | "collapsed": false 1526 | }, 1527 | "outputs": [ 1528 | { 1529 | "name": "stdout", 1530 | "output_type": "stream", 1531 | "text": [ 1532 | "105177\n" 1533 | ] 1534 | } 1535 | ], 1536 | "source": [ 1537 | "#删除无交互记录的用户\n", 1538 | "df_naction = df_user[(df_user['browse_num'].isnull()) & (df_user['addcart_num'].isnull()) & (df_user['delcart_num'].isnull()) & (df_user['buy_num'].isnull()) & (df_user['favor_num'].isnull()) & (df_user['click_num'].isnull())]\n", 1539 | "df_user.drop(df_naction.index,axis=0,inplace=True)\n", 1540 | "print len(df_user)" 1541 | ] 1542 | }, 1543 | { 1544 | "cell_type": "markdown", 1545 | "metadata": {}, 1546 | "source": [ 1547 | "统计并删除无购买记录的用户" 1548 | ] 1549 | }, 1550 | { 1551 | "cell_type": "code", 1552 | "execution_count": 6, 1553 | "metadata": { 1554 | "collapsed": false 1555 | }, 1556 | "outputs": [ 1557 | { 1558 | "name": "stdout", 1559 | "output_type": "stream", 1560 | "text": [ 1561 | "75694\n" 1562 | ] 1563 | } 1564 | ], 1565 | "source": [ 1566 | "#统计无购买记录的用户\n", 1567 | "df_bzero = df_user[df_user['buy_num']==0]\n", 1568 | "#输出购买数为0的总记录数\n", 1569 | "print len(df_bzero)" 1570 | ] 1571 | }, 1572 | { 1573 | "cell_type": "code", 1574 | "execution_count": 7, 1575 | "metadata": { 1576 | "collapsed": false 1577 | }, 1578 | "outputs": [], 1579 | "source": [ 1580 | "#删除无购买记录的用户\n", 1581 | "df_user = df_user[df_user['buy_num']!=0]" 1582 | ] 1583 | }, 1584 | { 1585 | "cell_type": "code", 1586 | "execution_count": 8, 1587 | "metadata": { 1588 | "collapsed": false 1589 | }, 1590 | "outputs": [ 1591 | { 1592 | "data": { 1593 | "text/html": [ 1594 | "
\n", 1595 | "\n", 1596 | " \n", 1597 | " \n", 1598 | " \n", 1599 | " \n", 1600 | " \n", 1601 | " \n", 1602 | " \n", 1603 | " \n", 1604 | " \n", 1605 | " \n", 1606 | " \n", 1607 | " \n", 1608 | " \n", 1609 | " \n", 1610 | " \n", 1611 | " \n", 1612 | " \n", 1613 | " \n", 1614 | " \n", 1615 | " \n", 1616 | " \n", 1617 | " \n", 1618 | " \n", 1619 | " \n", 1620 | " \n", 1621 | " \n", 1622 | " \n", 1623 | " \n", 1624 | " \n", 1625 | " \n", 1626 | " \n", 1627 | " \n", 1628 | " \n", 1629 | " \n", 1630 | " \n", 1631 | " \n", 1632 | " \n", 1633 | " \n", 1634 | " \n", 1635 | " \n", 1636 | " \n", 1637 | " \n", 1638 | " \n", 1639 | " \n", 1640 | " \n", 1641 | " \n", 1642 | " \n", 1643 | " \n", 1644 | " \n", 1645 | " \n", 1646 | " \n", 1647 | " \n", 1648 | " \n", 1649 | " \n", 1650 | " \n", 1651 | " \n", 1652 | " \n", 1653 | " \n", 1654 | " \n", 1655 | " \n", 1656 | " \n", 1657 | " \n", 1658 | " \n", 1659 | " \n", 1660 | " \n", 1661 | " \n", 1662 | " \n", 1663 | " \n", 1664 | " \n", 1665 | " \n", 1666 | " \n", 1667 | " \n", 1668 | " \n", 1669 | " \n", 1670 | " \n", 1671 | " \n", 1672 | " \n", 1673 | " \n", 1674 | " \n", 1675 | " \n", 1676 | " \n", 1677 | " \n", 1678 | " \n", 1679 | " \n", 1680 | " \n", 1681 | " \n", 1682 | " \n", 1683 | " \n", 1684 | " \n", 1685 | " \n", 1686 | " \n", 1687 | " \n", 1688 | " \n", 1689 | " \n", 1690 | " \n", 1691 | " \n", 1692 | " \n", 1693 | " \n", 1694 | " \n", 1695 | " \n", 1696 | " \n", 1697 | " \n", 1698 | " \n", 1699 | " \n", 1700 | " \n", 1701 | " \n", 1702 | " \n", 1703 | " \n", 1704 | " \n", 1705 | " \n", 1706 | " \n", 1707 | " \n", 1708 | " \n", 1709 | " \n", 1710 | " \n", 1711 | " \n", 1712 | " \n", 1713 | " \n", 1714 | " \n", 1715 | " \n", 1716 | " \n", 1717 | " \n", 1718 | " \n", 1719 | " \n", 1720 | " \n", 1721 | " \n", 1722 | " \n", 1723 | " \n", 1724 | " \n", 1725 | " \n", 1726 | " \n", 1727 | " \n", 1728 | " \n", 1729 | " \n", 1730 | " \n", 1731 | " \n", 1732 | " \n", 1733 | " \n", 1734 | " \n", 1735 | " \n", 1736 | " \n", 1737 | " \n", 1738 | " \n", 1739 | " \n", 1740 | " \n", 1741 | " \n", 1742 | " \n", 1743 | " \n", 1744 | " \n", 1745 | " \n", 1746 | " \n", 1747 | " \n", 1748 | " \n", 1749 | " \n", 1750 | " \n", 1751 | " \n", 1752 | " \n", 1753 | "
user_idagesexuser_lv_cdbrowse_numaddcart_numdelcart_numbuy_numfavor_numclick_numbuy_addcart_ratiobuy_browse_ratiobuy_click_ratiobuy_favor_ratio
count29,483.00029,483.00029,483.00029,483.00029,483.00029,483.00029,483.00029,483.00029,483.00029,483.00029,483.00029,483.00029,483.00029,483.000
mean250,746.4452.9141.0254.272302.48810.5254.6731.6371.677486.6530.3600.0180.0300.862
std29,979.6761.4900.9590.808391.53514.3017.5681.4124.584658.6710.3200.0380.1360.287
min200,001.000-1.0000.0002.0001.0000.0000.0001.0000.0000.0000.0040.0000.0000.010
25%225,058.5003.0000.0004.00076.0003.0000.0001.0000.000116.0000.1180.0040.0021.000
50%249,144.0003.0001.0004.000178.0006.0002.0001.0000.000282.0000.2500.0080.0051.000
75%276,252.5004.0002.0005.000381.00013.0006.0002.0001.000604.0000.5000.0180.0121.000
max305,318.0006.0002.0005.0007,605.000288.000178.00050.00096.00015,302.0001.0001.0001.0001.000
\n", 1754 | "
" 1755 | ], 1756 | "text/plain": [ 1757 | " user_id age sex user_lv_cd browse_num addcart_num \\\n", 1758 | "count 29,483.000 29,483.000 29,483.000 29,483.000 29,483.000 29,483.000 \n", 1759 | "mean 250,746.445 2.914 1.025 4.272 302.488 10.525 \n", 1760 | "std 29,979.676 1.490 0.959 0.808 391.535 14.301 \n", 1761 | "min 200,001.000 -1.000 0.000 2.000 1.000 0.000 \n", 1762 | "25% 225,058.500 3.000 0.000 4.000 76.000 3.000 \n", 1763 | "50% 249,144.000 3.000 1.000 4.000 178.000 6.000 \n", 1764 | "75% 276,252.500 4.000 2.000 5.000 381.000 13.000 \n", 1765 | "max 305,318.000 6.000 2.000 5.000 7,605.000 288.000 \n", 1766 | "\n", 1767 | " delcart_num buy_num favor_num click_num buy_addcart_ratio \\\n", 1768 | "count 29,483.000 29,483.000 29,483.000 29,483.000 29,483.000 \n", 1769 | "mean 4.673 1.637 1.677 486.653 0.360 \n", 1770 | "std 7.568 1.412 4.584 658.671 0.320 \n", 1771 | "min 0.000 1.000 0.000 0.000 0.004 \n", 1772 | "25% 0.000 1.000 0.000 116.000 0.118 \n", 1773 | "50% 2.000 1.000 0.000 282.000 0.250 \n", 1774 | "75% 6.000 2.000 1.000 604.000 0.500 \n", 1775 | "max 178.000 50.000 96.000 15,302.000 1.000 \n", 1776 | "\n", 1777 | " buy_browse_ratio buy_click_ratio buy_favor_ratio \n", 1778 | "count 29,483.000 29,483.000 29,483.000 \n", 1779 | "mean 0.018 0.030 0.862 \n", 1780 | "std 0.038 0.136 0.287 \n", 1781 | "min 0.000 0.000 0.010 \n", 1782 | "25% 0.004 0.002 1.000 \n", 1783 | "50% 0.008 0.005 1.000 \n", 1784 | "75% 0.018 0.012 1.000 \n", 1785 | "max 1.000 1.000 1.000 " 1786 | ] 1787 | }, 1788 | "execution_count": 8, 1789 | "metadata": {}, 1790 | "output_type": "execute_result" 1791 | } 1792 | ], 1793 | "source": [ 1794 | "df_user.describe()" 1795 | ] 1796 | }, 1797 | { 1798 | "cell_type": "markdown", 1799 | "metadata": {}, 1800 | "source": [ 1801 | "删除爬虫及惰性用户" 1802 | ] 1803 | }, 1804 | { 1805 | "cell_type": "markdown", 1806 | "metadata": {}, 1807 | "source": [ 1808 | "由上表所知,浏览购买转换比和点击购买转换比均值为0.018,0.030,因此这里认为浏览购买转换比和点击购买转换比小于0.0005的用户为惰性用户" 1809 | ] 1810 | }, 1811 | { 1812 | "cell_type": "code", 1813 | "execution_count": 9, 1814 | "metadata": { 1815 | "collapsed": false 1816 | }, 1817 | "outputs": [ 1818 | { 1819 | "name": "stdout", 1820 | "output_type": "stream", 1821 | "text": [ 1822 | "90\n" 1823 | ] 1824 | } 1825 | ], 1826 | "source": [ 1827 | "bindex = df_user[df_user['buy_browse_ratio']<0.0005].index\n", 1828 | "print len(bindex)\n", 1829 | "df_user.drop(bindex,axis=0,inplace=True)" 1830 | ] 1831 | }, 1832 | { 1833 | "cell_type": "code", 1834 | "execution_count": 10, 1835 | "metadata": { 1836 | "collapsed": false, 1837 | "scrolled": true 1838 | }, 1839 | "outputs": [ 1840 | { 1841 | "name": "stdout", 1842 | "output_type": "stream", 1843 | "text": [ 1844 | "323\n" 1845 | ] 1846 | } 1847 | ], 1848 | "source": [ 1849 | "cindex = df_user[df_user['buy_click_ratio']<0.0005].index\n", 1850 | "print len(cindex)\n", 1851 | "df_user.drop(cindex,axis=0,inplace=True)" 1852 | ] 1853 | }, 1854 | { 1855 | "cell_type": "code", 1856 | "execution_count": 11, 1857 | "metadata": { 1858 | "collapsed": false 1859 | }, 1860 | "outputs": [ 1861 | { 1862 | "data": { 1863 | "text/html": [ 1864 | "
\n", 1865 | "\n", 1866 | " \n", 1867 | " \n", 1868 | " \n", 1869 | " \n", 1870 | " \n", 1871 | " \n", 1872 | " \n", 1873 | " \n", 1874 | " \n", 1875 | " \n", 1876 | " \n", 1877 | " \n", 1878 | " \n", 1879 | " \n", 1880 | " \n", 1881 | " \n", 1882 | " \n", 1883 | " \n", 1884 | " \n", 1885 | " \n", 1886 | " \n", 1887 | " \n", 1888 | " \n", 1889 | " \n", 1890 | " \n", 1891 | " \n", 1892 | " \n", 1893 | " \n", 1894 | " \n", 1895 | " \n", 1896 | " \n", 1897 | " \n", 1898 | " \n", 1899 | " \n", 1900 | " \n", 1901 | " \n", 1902 | " \n", 1903 | " \n", 1904 | " \n", 1905 | " \n", 1906 | " \n", 1907 | " \n", 1908 | " \n", 1909 | " \n", 1910 | " \n", 1911 | " \n", 1912 | " \n", 1913 | " \n", 1914 | " \n", 1915 | " \n", 1916 | " \n", 1917 | " \n", 1918 | " \n", 1919 | " \n", 1920 | " \n", 1921 | " \n", 1922 | " \n", 1923 | " \n", 1924 | " \n", 1925 | " \n", 1926 | " \n", 1927 | " \n", 1928 | " \n", 1929 | " \n", 1930 | " \n", 1931 | " \n", 1932 | " \n", 1933 | " \n", 1934 | " \n", 1935 | " \n", 1936 | " \n", 1937 | " \n", 1938 | " \n", 1939 | " \n", 1940 | " \n", 1941 | " \n", 1942 | " \n", 1943 | " \n", 1944 | " \n", 1945 | " \n", 1946 | " \n", 1947 | " \n", 1948 | " \n", 1949 | " \n", 1950 | " \n", 1951 | " \n", 1952 | " \n", 1953 | " \n", 1954 | " \n", 1955 | " \n", 1956 | " \n", 1957 | " \n", 1958 | " \n", 1959 | " \n", 1960 | " \n", 1961 | " \n", 1962 | " \n", 1963 | " \n", 1964 | " \n", 1965 | " \n", 1966 | " \n", 1967 | " \n", 1968 | " \n", 1969 | " \n", 1970 | " \n", 1971 | " \n", 1972 | " \n", 1973 | " \n", 1974 | " \n", 1975 | " \n", 1976 | " \n", 1977 | " \n", 1978 | " \n", 1979 | " \n", 1980 | " \n", 1981 | " \n", 1982 | " \n", 1983 | " \n", 1984 | " \n", 1985 | " \n", 1986 | " \n", 1987 | " \n", 1988 | " \n", 1989 | " \n", 1990 | " \n", 1991 | " \n", 1992 | " \n", 1993 | " \n", 1994 | " \n", 1995 | " \n", 1996 | " \n", 1997 | " \n", 1998 | " \n", 1999 | " \n", 2000 | " \n", 2001 | " \n", 2002 | " \n", 2003 | " \n", 2004 | " \n", 2005 | " \n", 2006 | " \n", 2007 | " \n", 2008 | " \n", 2009 | " \n", 2010 | " \n", 2011 | " \n", 2012 | " \n", 2013 | " \n", 2014 | " \n", 2015 | " \n", 2016 | " \n", 2017 | " \n", 2018 | " \n", 2019 | " \n", 2020 | " \n", 2021 | " \n", 2022 | " \n", 2023 | "
user_idagesexuser_lv_cdbrowse_numaddcart_numdelcart_numbuy_numfavor_numclick_numbuy_addcart_ratiobuy_browse_ratiobuy_click_ratiobuy_favor_ratio
count29,070.00029,070.00029,070.00029,070.00029,070.00029,070.00029,070.00029,070.00029,070.00029,070.00029,070.00029,070.00029,070.00029,070.000
mean250,767.0992.9101.0284.268280.26010.1454.4571.6441.589447.1130.3640.0190.0310.866
std29,998.8701.4920.9590.809325.12913.4436.9981.4204.294530.9940.3200.0380.1370.282
min200,001.000-1.0000.0002.0001.0000.0000.0001.0000.0000.0000.0040.0010.0010.018
25%225,036.0003.0000.0004.00075.0003.0000.0001.0000.000114.0000.1250.0040.0021.000
50%249,200.5003.0001.0004.000174.0006.0002.0001.0000.000275.0000.2500.0080.0051.000
75%276,284.0004.0002.0005.000366.00013.0006.0002.0001.000585.0000.5000.0180.0121.000
max305,318.0006.0002.0005.0005,007.000288.000158.00050.00069.0008,156.0001.0001.0001.0001.000
\n", 2024 | "
" 2025 | ], 2026 | "text/plain": [ 2027 | " user_id age sex user_lv_cd browse_num addcart_num \\\n", 2028 | "count 29,070.000 29,070.000 29,070.000 29,070.000 29,070.000 29,070.000 \n", 2029 | "mean 250,767.099 2.910 1.028 4.268 280.260 10.145 \n", 2030 | "std 29,998.870 1.492 0.959 0.809 325.129 13.443 \n", 2031 | "min 200,001.000 -1.000 0.000 2.000 1.000 0.000 \n", 2032 | "25% 225,036.000 3.000 0.000 4.000 75.000 3.000 \n", 2033 | "50% 249,200.500 3.000 1.000 4.000 174.000 6.000 \n", 2034 | "75% 276,284.000 4.000 2.000 5.000 366.000 13.000 \n", 2035 | "max 305,318.000 6.000 2.000 5.000 5,007.000 288.000 \n", 2036 | "\n", 2037 | " delcart_num buy_num favor_num click_num buy_addcart_ratio \\\n", 2038 | "count 29,070.000 29,070.000 29,070.000 29,070.000 29,070.000 \n", 2039 | "mean 4.457 1.644 1.589 447.113 0.364 \n", 2040 | "std 6.998 1.420 4.294 530.994 0.320 \n", 2041 | "min 0.000 1.000 0.000 0.000 0.004 \n", 2042 | "25% 0.000 1.000 0.000 114.000 0.125 \n", 2043 | "50% 2.000 1.000 0.000 275.000 0.250 \n", 2044 | "75% 6.000 2.000 1.000 585.000 0.500 \n", 2045 | "max 158.000 50.000 69.000 8,156.000 1.000 \n", 2046 | "\n", 2047 | " buy_browse_ratio buy_click_ratio buy_favor_ratio \n", 2048 | "count 29,070.000 29,070.000 29,070.000 \n", 2049 | "mean 0.019 0.031 0.866 \n", 2050 | "std 0.038 0.137 0.282 \n", 2051 | "min 0.001 0.001 0.018 \n", 2052 | "25% 0.004 0.002 1.000 \n", 2053 | "50% 0.008 0.005 1.000 \n", 2054 | "75% 0.018 0.012 1.000 \n", 2055 | "max 1.000 1.000 1.000 " 2056 | ] 2057 | }, 2058 | "execution_count": 11, 2059 | "metadata": {}, 2060 | "output_type": "execute_result" 2061 | } 2062 | ], 2063 | "source": [ 2064 | "df_user.describe()" 2065 | ] 2066 | }, 2067 | { 2068 | "cell_type": "markdown", 2069 | "metadata": {}, 2070 | "source": [ 2071 | "最后这29070个用户为最终预测用户数据集" 2072 | ] 2073 | }, 2074 | { 2075 | "cell_type": "code", 2076 | "execution_count": 12, 2077 | "metadata": { 2078 | "collapsed": true 2079 | }, 2080 | "outputs": [], 2081 | "source": [ 2082 | "df_user.to_csv(\"data/JData_FUser.csv\",index=None)" 2083 | ] 2084 | }, 2085 | { 2086 | "cell_type": "markdown", 2087 | "metadata": {}, 2088 | "source": [ 2089 | "#### 商品清洗" 2090 | ] 2091 | }, 2092 | { 2093 | "cell_type": "code", 2094 | "execution_count": 13, 2095 | "metadata": { 2096 | "collapsed": false 2097 | }, 2098 | "outputs": [ 2099 | { 2100 | "data": { 2101 | "text/html": [ 2102 | "
\n", 2103 | "\n", 2104 | " \n", 2105 | " \n", 2106 | " \n", 2107 | " \n", 2108 | " \n", 2109 | " \n", 2110 | " \n", 2111 | " \n", 2112 | " \n", 2113 | " \n", 2114 | " \n", 2115 | " \n", 2116 | " \n", 2117 | " \n", 2118 | " \n", 2119 | " \n", 2120 | " \n", 2121 | " \n", 2122 | " \n", 2123 | " \n", 2124 | " \n", 2125 | " \n", 2126 | " \n", 2127 | " \n", 2128 | " \n", 2129 | " \n", 2130 | " \n", 2131 | " \n", 2132 | " \n", 2133 | " \n", 2134 | " \n", 2135 | " \n", 2136 | " \n", 2137 | " \n", 2138 | " \n", 2139 | " \n", 2140 | " \n", 2141 | " \n", 2142 | " \n", 2143 | " \n", 2144 | " \n", 2145 | " \n", 2146 | " \n", 2147 | " \n", 2148 | " \n", 2149 | " \n", 2150 | " \n", 2151 | " \n", 2152 | " \n", 2153 | " \n", 2154 | " \n", 2155 | " \n", 2156 | " \n", 2157 | " \n", 2158 | " \n", 2159 | " \n", 2160 | " \n", 2161 | " \n", 2162 | " \n", 2163 | " \n", 2164 | " \n", 2165 | " \n", 2166 | " \n", 2167 | " \n", 2168 | " \n", 2169 | " \n", 2170 | " \n", 2171 | " \n", 2172 | " \n", 2173 | " \n", 2174 | " \n", 2175 | " \n", 2176 | " \n", 2177 | " \n", 2178 | " \n", 2179 | " \n", 2180 | " \n", 2181 | " \n", 2182 | " \n", 2183 | " \n", 2184 | " \n", 2185 | " \n", 2186 | " \n", 2187 | " \n", 2188 | " \n", 2189 | " \n", 2190 | " \n", 2191 | " \n", 2192 | " \n", 2193 | " \n", 2194 | " \n", 2195 | " \n", 2196 | " \n", 2197 | " \n", 2198 | " \n", 2199 | " \n", 2200 | " \n", 2201 | " \n", 2202 | " \n", 2203 | " \n", 2204 | " \n", 2205 | " \n", 2206 | " \n", 2207 | " \n", 2208 | " \n", 2209 | " \n", 2210 | " \n", 2211 | " \n", 2212 | " \n", 2213 | " \n", 2214 | " \n", 2215 | " \n", 2216 | " \n", 2217 | " \n", 2218 | " \n", 2219 | " \n", 2220 | " \n", 2221 | " \n", 2222 | " \n", 2223 | " \n", 2224 | " \n", 2225 | " \n", 2226 | " \n", 2227 | " \n", 2228 | " \n", 2229 | " \n", 2230 | " \n", 2231 | " \n", 2232 | " \n", 2233 | " \n", 2234 | " \n", 2235 | " \n", 2236 | " \n", 2237 | " \n", 2238 | " \n", 2239 | " \n", 2240 | " \n", 2241 | " \n", 2242 | " \n", 2243 | " \n", 2244 | " \n", 2245 | " \n", 2246 | " \n", 2247 | " \n", 2248 | " \n", 2249 | " \n", 2250 | " \n", 2251 | " \n", 2252 | " \n", 2253 | " \n", 2254 | " \n", 2255 | " \n", 2256 | " \n", 2257 | " \n", 2258 | " \n", 2259 | " \n", 2260 | " \n", 2261 | " \n", 2262 | " \n", 2263 | " \n", 2264 | " \n", 2265 | " \n", 2266 | " \n", 2267 | " \n", 2268 | " \n", 2269 | " \n", 2270 | " \n", 2271 | " \n", 2272 | " \n", 2273 | " \n", 2274 | " \n", 2275 | " \n", 2276 | " \n", 2277 | " \n", 2278 | " \n", 2279 | " \n", 2280 | " \n", 2281 | " \n", 2282 | " \n", 2283 | " \n", 2284 | " \n", 2285 | " \n", 2286 | " \n", 2287 | " \n", 2288 | " \n", 2289 | " \n", 2290 | " \n", 2291 | " \n", 2292 | " \n", 2293 | " \n", 2294 | " \n", 2295 | " \n", 2296 | " \n", 2297 | " \n", 2298 | " \n", 2299 | " \n", 2300 | " \n", 2301 | " \n", 2302 | " \n", 2303 | " \n", 2304 | " \n", 2305 | " \n", 2306 | "
sku_ida1a2a3catebrandbrowse_numaddcart_numdelcart_numbuy_numfavor_numclick_numbuy_addcart_ratiobuy_browse_ratiobuy_click_ratiobuy_favor_ratiocomment_numhas_bad_commentbad_comment_rate
count24,187.00024,187.00024,187.00024,187.00024,187.00024,187.0003,938.0003,938.0003,938.0003,938.0003,938.0003,938.0002,225.0003,909.0003,692.0002,016.0006,830.0006,830.0006,830.000
mean85,398.7372.1770.9391.1808.000435.8641,723.71154.21221.2843.37310.6552,790.1320.0360.0010.0000.1622.7320.4860.044
std49,238.7991.1760.9701.0460.000225.7497,957.661285.723106.10021.69549.95612,647.5340.0930.0020.0010.2641.0370.5000.110
min6.000-1.000-1.000-1.0008.0003.0000.0000.0000.0000.0000.0000.0000.0000.0000.0000.0001.0000.0000.000
25%42,476.0001.0001.0001.0008.000214.00019.0000.0000.0000.0000.00030.0000.0000.0000.0000.0002.0000.0000.000
50%85,616.0003.0001.0001.0008.000489.000148.0001.0001.0000.0001.000246.5000.0000.0000.0000.0003.0000.0000.000
75%127,774.0003.0002.0002.0008.000571.000655.75012.0005.0000.0004.0001,071.7500.0480.0000.0000.2504.0001.0000.044
max171,224.0003.0002.0002.0008.000922.000194,920.0006,296.0002,258.000691.0001,205.000312,005.0001.0000.1140.0631.0004.0001.0001.000
\n", 2307 | "
" 2308 | ], 2309 | "text/plain": [ 2310 | " sku_id a1 a2 a3 cate brand \\\n", 2311 | "count 24,187.000 24,187.000 24,187.000 24,187.000 24,187.000 24,187.000 \n", 2312 | "mean 85,398.737 2.177 0.939 1.180 8.000 435.864 \n", 2313 | "std 49,238.799 1.176 0.970 1.046 0.000 225.749 \n", 2314 | "min 6.000 -1.000 -1.000 -1.000 8.000 3.000 \n", 2315 | "25% 42,476.000 1.000 1.000 1.000 8.000 214.000 \n", 2316 | "50% 85,616.000 3.000 1.000 1.000 8.000 489.000 \n", 2317 | "75% 127,774.000 3.000 2.000 2.000 8.000 571.000 \n", 2318 | "max 171,224.000 3.000 2.000 2.000 8.000 922.000 \n", 2319 | "\n", 2320 | " browse_num addcart_num delcart_num buy_num favor_num click_num \\\n", 2321 | "count 3,938.000 3,938.000 3,938.000 3,938.000 3,938.000 3,938.000 \n", 2322 | "mean 1,723.711 54.212 21.284 3.373 10.655 2,790.132 \n", 2323 | "std 7,957.661 285.723 106.100 21.695 49.956 12,647.534 \n", 2324 | "min 0.000 0.000 0.000 0.000 0.000 0.000 \n", 2325 | "25% 19.000 0.000 0.000 0.000 0.000 30.000 \n", 2326 | "50% 148.000 1.000 1.000 0.000 1.000 246.500 \n", 2327 | "75% 655.750 12.000 5.000 0.000 4.000 1,071.750 \n", 2328 | "max 194,920.000 6,296.000 2,258.000 691.000 1,205.000 312,005.000 \n", 2329 | "\n", 2330 | " buy_addcart_ratio buy_browse_ratio buy_click_ratio buy_favor_ratio \\\n", 2331 | "count 2,225.000 3,909.000 3,692.000 2,016.000 \n", 2332 | "mean 0.036 0.001 0.000 0.162 \n", 2333 | "std 0.093 0.002 0.001 0.264 \n", 2334 | "min 0.000 0.000 0.000 0.000 \n", 2335 | "25% 0.000 0.000 0.000 0.000 \n", 2336 | "50% 0.000 0.000 0.000 0.000 \n", 2337 | "75% 0.048 0.000 0.000 0.250 \n", 2338 | "max 1.000 0.114 0.063 1.000 \n", 2339 | "\n", 2340 | " comment_num has_bad_comment bad_comment_rate \n", 2341 | "count 6,830.000 6,830.000 6,830.000 \n", 2342 | "mean 2.732 0.486 0.044 \n", 2343 | "std 1.037 0.500 0.110 \n", 2344 | "min 1.000 0.000 0.000 \n", 2345 | "25% 2.000 0.000 0.000 \n", 2346 | "50% 3.000 0.000 0.000 \n", 2347 | "75% 4.000 1.000 0.044 \n", 2348 | "max 4.000 1.000 1.000 " 2349 | ] 2350 | }, 2351 | "execution_count": 13, 2352 | "metadata": {}, 2353 | "output_type": "execute_result" 2354 | } 2355 | ], 2356 | "source": [ 2357 | "import pandas as pd\n", 2358 | "df_product = pd.read_csv('data/Item_table.csv',header=0)\n", 2359 | "pd.options.display.float_format = '{:,.3f}'.format #输出格式设置,保留三位小数\n", 2360 | "df_product.describe()" 2361 | ] 2362 | }, 2363 | { 2364 | "cell_type": "code", 2365 | "execution_count": null, 2366 | "metadata": { 2367 | "collapsed": true 2368 | }, 2369 | "outputs": [], 2370 | "source": [] 2371 | } 2372 | ], 2373 | "metadata": { 2374 | "kernelspec": { 2375 | "display_name": "Python 2", 2376 | "language": "python", 2377 | "name": "python2" 2378 | }, 2379 | "language_info": { 2380 | "codemirror_mode": { 2381 | "name": "ipython", 2382 | "version": 2 2383 | }, 2384 | "file_extension": ".py", 2385 | "mimetype": "text/x-python", 2386 | "name": "python", 2387 | "nbconvert_exporter": "python", 2388 | "pygments_lexer": "ipython2", 2389 | "version": "2.7.13" 2390 | } 2391 | }, 2392 | "nbformat": 4, 2393 | "nbformat_minor": 0 2394 | } 2395 | --------------------------------------------------------------------------------