├── README.md
├── Data_Explore.ipynb
├── Model_Design_v1.7.ipynb
├── Feature_Engineering-v1.4.ipynb
└── data_cleaning_wp.ipynb


/README.md:
--------------------------------------------------------------------------------
 1 | # JData
 2 | 京东2017年JData比赛，团队`HelloKitty`的解决方案，最终线上排名成绩为A榜: 139/4240，B榜: 97/4240
 3 | * 背景概述
 4 | > 本次大赛以京东商城真实的用户、商品和行为数据（脱敏后）为基础，参赛队伍需要通过数据挖掘的技术和机器学习的算法，构建用户购买商品的预测模型，输出高潜用户和目标商品的匹配结果，为精准营销提供高质量的目标群体。同时，希望参赛队伍能通过本次比赛，挖掘数据背后潜在的意义，为电商用户提供更简单、快捷、省心的购物体验。
 5 | 
 6 | * [比赛地址](http://www.datafountain.cn/#/competitions/247/intro)
 7 | * 代码文件
 8 |   * 数据分析(包括Data_Explore、data_analysis_wp、data_cleaning_wp)
 9 |   * 特征工程(Feature_Engineering-v1.4)
10 |   * 模型设计(Model_Design_v1.7)
11 | 关于本次比赛的详细个人总结见[博客地址](http://izhaoyi.top/2017/06/25/JData/)
12 | 
13 | 
14 | 


--------------------------------------------------------------------------------
/Data_Explore.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "## 异常值判断"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "code",
 12 |    "execution_count": 1,
 13 |    "metadata": {
 14 |     "ExecuteTime": {
 15 |      "end_time": "2017-05-10T21:17:05.070380Z",
 16 |      "start_time": "2017-05-10T21:17:04.293943Z"
 17 |     },
 18 |     "collapsed": true
 19 |    },
 20 |    "outputs": [],
 21 |    "source": [
 22 |     "%matplotlib inline\n",
 23 |     "import matplotlib\n",
 24 |     "import matplotlib.pyplot as plt\n",
 25 |     "import numpy as np\n",
 26 |     "import pandas as pd"
 27 |    ]
 28 |   },
 29 |   {
 30 |    "cell_type": "code",
 31 |    "execution_count": 2,
 32 |    "metadata": {
 33 |     "ExecuteTime": {
 34 |      "end_time": "2017-05-10T21:17:05.074920Z",
 35 |      "start_time": "2017-05-10T21:17:05.071410Z"
 36 |     },
 37 |     "collapsed": true
 38 |    },
 39 |    "outputs": [],
 40 |    "source": [
 41 |     "#定义文件名\n",
 42 |     "ACTION_201602_FILE = \"data/JData_Action_201602.csv\"\n",
 43 |     "ACTION_201603_FILE = \"data/JData_Action_201603.csv\"\n",
 44 |     "ACTION_201604_FILE = \"data/JData_Action_201604.csv\"\n",
 45 |     "COMMENT_FILE = \"data/JData_Comment.csv\"\n",
 46 |     "PRODUCT_FILE = \"data/JData_Product.csv\"\n",
 47 |     "USER_FILE = \"data/JData_User.csv\"\n",
 48 |     "USER_TABLE_FILE = \"data/User_table.csv\"\n",
 49 |     "ITEM_TABLE_FILE = \"data/Item_table.csv\""
 50 |    ]
 51 |   },
 52 |   {
 53 |    "cell_type": "markdown",
 54 |    "metadata": {},
 55 |    "source": [
 56 |     "### 数据背景信息\n",
 57 |     "根据官方给出的数据介绍里，可以知道数据可能存在哪些异常信息\n",
 58 |     "* 用户文件\n",
 59 |     "    * 用户的age存在未知的情况，标记为-1\n",
 60 |     "    * 用户的sex存在保密情况，标记为2\n",
 61 |     "    * 后续分析发现，用户注册日期存在系统异常导致在预测日之后的情况，不过目前针对该特征没有想法，所以不作处理\n",
 62 |     "* 商品文件\n",
 63 |     "    * 属性a1,a2,a3均存在未知情形，标记为-1\n",
 64 |     "* 行为文件\n",
 65 |     "    * model_id为点击模块编号，针对用户的行为类型为6时，可能存在空值"
 66 |    ]
 67 |   },
 68 |   {
 69 |    "cell_type": "markdown",
 70 |    "metadata": {},
 71 |    "source": [
 72 |     "### 空值判断"
 73 |    ]
 74 |   },
 75 |   {
 76 |    "cell_type": "code",
 77 |    "execution_count": 3,
 78 |    "metadata": {
 79 |     "ExecuteTime": {
 80 |      "end_time": "2017-05-10T21:17:35.805241Z",
 81 |      "start_time": "2017-05-10T21:17:05.075917Z"
 82 |     },
 83 |     "collapsed": false
 84 |    },
 85 |    "outputs": [
 86 |     {
 87 |      "name": "stdout",
 88 |      "output_type": "stream",
 89 |      "text": [
 90 |       "Is there any missing value in User? True\n",
 91 |       "Is there any missing value in Action 2? True\n",
 92 |       "Is there any missing value in Action 3? True\n",
 93 |       "Is there any missing value in Action 4? True\n",
 94 |       "Is there any missing value in Comment? False\n",
 95 |       "Is there any missing value in Product? False\n"
 96 |      ]
 97 |     }
 98 |    ],
 99 |    "source": [
100 |     "def check_empty(file_path, file_name):\n",
101 |     "    df_file = pd.read_csv(file_path)\n",
102 |     "    print 'Is there any missing value in {0}? {1}'.format(file_name, df_file.isnull().any().any()) \n",
103 |     "\n",
104 |     "check_empty(USER_FILE, 'User')\n",
105 |     "check_empty(ACTION_201602_FILE, 'Action 2')\n",
106 |     "check_empty(ACTION_201603_FILE, 'Action 3')\n",
107 |     "check_empty(ACTION_201604_FILE, 'Action 4')\n",
108 |     "check_empty(COMMENT_FILE, 'Comment')\n",
109 |     "check_empty(PRODUCT_FILE, 'Product')"
110 |    ]
111 |   },
112 |   {
113 |    "cell_type": "markdown",
114 |    "metadata": {},
115 |    "source": [
116 |     "由上述简单的分析可知，用户表及行为表中均存在空值记录，而评论表和商品表则不存在，但是结合之前的数据背景分析商品表中存在属性未知的情况，后续也需要针对分析，进一步的我们看看用户表和行为表中的空值情况"
117 |    ]
118 |   },
119 |   {
120 |    "cell_type": "code",
121 |    "execution_count": 4,
122 |    "metadata": {
123 |     "ExecuteTime": {
124 |      "end_time": "2017-05-10T21:18:04.373068Z",
125 |      "start_time": "2017-05-10T21:17:35.806261Z"
126 |     },
127 |     "collapsed": false
128 |    },
129 |    "outputs": [
130 |     {
131 |      "name": "stdout",
132 |      "output_type": "stream",
133 |      "text": [
134 |       "empty info in detail of User:\n",
135 |       "user_id        False\n",
136 |       "age             True\n",
137 |       "sex             True\n",
138 |       "user_lv_cd     False\n",
139 |       "user_reg_tm     True\n",
140 |       "dtype: bool\n",
141 |       "empty info in detail of Action 2:\n",
142 |       "user_id     False\n",
143 |       "sku_id      False\n",
144 |       "time        False\n",
145 |       "model_id     True\n",
146 |       "type        False\n",
147 |       "cate        False\n",
148 |       "brand       False\n",
149 |       "dtype: bool\n",
150 |       "empty info in detail of Action 3:\n",
151 |       "user_id     False\n",
152 |       "sku_id      False\n",
153 |       "time        False\n",
154 |       "model_id     True\n",
155 |       "type        False\n",
156 |       "cate        False\n",
157 |       "brand       False\n",
158 |       "dtype: bool\n",
159 |       "empty info in detail of Action 4:\n",
160 |       "user_id     False\n",
161 |       "sku_id      False\n",
162 |       "time        False\n",
163 |       "model_id     True\n",
164 |       "type        False\n",
165 |       "cate        False\n",
166 |       "brand       False\n",
167 |       "dtype: bool\n"
168 |      ]
169 |     }
170 |    ],
171 |    "source": [
172 |     "def empty_detail(f_path, f_name):\n",
173 |     "    df_file = pd.read_csv(f_path)\n",
174 |     "    print 'empty info in detail of {0}:'.format(f_name)\n",
175 |     "    print pd.isnull(df_file).any()\n",
176 |     "\n",
177 |     "empty_detail(USER_FILE, 'User')\n",
178 |     "empty_detail(ACTION_201602_FILE, 'Action 2')\n",
179 |     "empty_detail(ACTION_201603_FILE, 'Action 3')\n",
180 |     "empty_detail(ACTION_201604_FILE, 'Action 4')"
181 |    ]
182 |   },
183 |   {
184 |    "cell_type": "markdown",
185 |    "metadata": {},
186 |    "source": [
187 |     "上面简单的输出了下存在空值的文件中具体哪些列存在空值(True)，结果如下\n",
188 |     "* User\n",
189 |     "    * age\n",
190 |     "    * sex\n",
191 |     "    * user_reg_tm\n",
192 |     "* Action\n",
193 |     "    * model_id\n",
194 |     "    \n",
195 |     "接下来具体看看各文件中的空值情况："
196 |    ]
197 |   },
198 |   {
199 |    "cell_type": "code",
200 |    "execution_count": 5,
201 |    "metadata": {
202 |     "ExecuteTime": {
203 |      "end_time": "2017-05-10T21:18:27.415981Z",
204 |      "start_time": "2017-05-10T21:18:04.373995Z"
205 |     },
206 |     "collapsed": false
207 |    },
208 |    "outputs": [
209 |     {
210 |      "name": "stdout",
211 |      "output_type": "stream",
212 |      "text": [
213 |       "No. of missing age in User is 3\n",
214 |       "percent:  2.84843478509e-05\n",
215 |       "No. of missing sex in User is 3\n",
216 |       "percent:  2.84843478509e-05\n",
217 |       "No. of missing user_reg_tm in User is 3\n",
218 |       "percent:  2.84843478509e-05\n",
219 |       "No. of missing model_id in Action 2 is 4959617\n",
220 |       "percent:  0.431818363867\n",
221 |       "No. of missing model_id in Action 3 is 10553261\n",
222 |       "percent:  0.4072043169\n",
223 |       "No. of missing model_id in Action 4 is 5143018\n",
224 |       "percent:  0.38962452388\n"
225 |      ]
226 |     }
227 |    ],
228 |    "source": [
229 |     "def empty_records(f_path, f_name, col_name):\n",
230 |     "    df_file = pd.read_csv(f_path)\n",
231 |     "    missing = df_file[col_name].isnull().sum().sum()\n",
232 |     "    print 'No. of missing {0} in {1} is {2}'.format(col_name, f_name, missing) \n",
233 |     "    print 'percent: ', missing * 1.0 / df_file.shape[0]\n",
234 |     "\n",
235 |     "empty_records(USER_FILE, 'User', 'age')\n",
236 |     "empty_records(USER_FILE, 'User', 'sex')\n",
237 |     "empty_records(USER_FILE, 'User', 'user_reg_tm')\n",
238 |     "empty_records(ACTION_201602_FILE, 'Action 2', 'model_id')\n",
239 |     "empty_records(ACTION_201603_FILE, 'Action 3', 'model_id')\n",
240 |     "empty_records(ACTION_201604_FILE, 'Action 4', 'model_id')"
241 |    ]
242 |   },
243 |   {
244 |    "cell_type": "markdown",
245 |    "metadata": {},
246 |    "source": [
247 |     "对比下数据集的记录数：\n",
248 |     "\n",
249 |     "文件|文件说明|记录数\n",
250 |     "---|---|---\n",
251 |     "1. JData_User.csv |             用户数据集          | 105,321个用户\n",
252 |     "2. JData_Comment.csv |          商品评论            | 558,552条记录\n",
253 |     "3. JData_Product.csv |          预测商品集合          | 24,187条记录\n",
254 |     "4. JData_Action_201602.csv |      2月份行为交互记录    | 11,485,424条记录\n",
255 |     "5. JData_Action_201603.csv |      3月份行为交互记录    | 25,916,378条记录\n",
256 |     "6. JData_Action_201604.csv |      4月份行为交互记录    | 13,199,934条记录\n",
257 |     "\n",
258 |     "两相对比结合前面输出的情况，针对不同数据进行不同处理\n",
259 |     "* 用户文件 \n",
260 |     "    * age,sex:先填充为对应的未知状态（-1|2），后续作为未知状态的值进一步分析和处理\n",
261 |     "    * user_reg_tm:暂时不做处理\n",
262 |     "* 行为文件\n",
263 |     "    * model_id涉及数目接近一半，而且当前针对该特征没有很好的处理方法，待定"
264 |    ]
265 |   },
266 |   {
267 |    "cell_type": "code",
268 |    "execution_count": 6,
269 |    "metadata": {
270 |     "ExecuteTime": {
271 |      "end_time": "2017-05-10T21:18:27.455370Z",
272 |      "start_time": "2017-05-10T21:18:27.416911Z"
273 |     },
274 |     "collapsed": true
275 |    },
276 |    "outputs": [],
277 |    "source": [
278 |     "user = pd.read_csv(USER_FILE)\n",
279 |     "user['age'].fillna('-1', inplace=True)\n",
280 |     "user['sex'].fillna(2, inplace=True)"
281 |    ]
282 |   },
283 |   {
284 |    "cell_type": "code",
285 |    "execution_count": 7,
286 |    "metadata": {
287 |     "ExecuteTime": {
288 |      "end_time": "2017-05-10T21:18:27.478275Z",
289 |      "start_time": "2017-05-10T21:18:27.456423Z"
290 |     },
291 |     "collapsed": false
292 |    },
293 |    "outputs": [
294 |     {
295 |      "name": "stdout",
296 |      "output_type": "stream",
297 |      "text": [
298 |       "user_id        False\n",
299 |       "age            False\n",
300 |       "sex            False\n",
301 |       "user_lv_cd     False\n",
302 |       "user_reg_tm     True\n",
303 |       "dtype: bool\n"
304 |      ]
305 |     }
306 |    ],
307 |    "source": [
308 |     "print pd.isnull(user).any()"
309 |    ]
310 |   },
311 |   {
312 |    "cell_type": "code",
313 |    "execution_count": 10,
314 |    "metadata": {
315 |     "ExecuteTime": {
316 |      "end_time": "2017-05-10T21:26:59.218638Z",
317 |      "start_time": "2017-05-10T21:26:59.207494Z"
318 |     },
319 |     "collapsed": false
320 |    },
321 |    "outputs": [
322 |     {
323 |      "name": "stdout",
324 |      "output_type": "stream",
325 |      "text": [
326 |       "       user_id age  sex  user_lv_cd user_reg_tm\n",
327 |       "34072   234073  -1  2.0           1         NaN\n",
328 |       "38905   238906  -1  2.0           1         NaN\n",
329 |       "67704   267705  -1  2.0           1         NaN\n"
330 |      ]
331 |     }
332 |    ],
333 |    "source": [
334 |     "nan_reg_tm = user[user['user_reg_tm'].isnull()]\n",
335 |     "print nan_reg_tm"
336 |    ]
337 |   },
338 |   {
339 |    "cell_type": "code",
340 |    "execution_count": 8,
341 |    "metadata": {
342 |     "ExecuteTime": {
343 |      "end_time": "2017-05-07T16:36:49.338047Z",
344 |      "start_time": "2017-05-07T16:36:49.268713Z"
345 |     },
346 |     "collapsed": false
347 |    },
348 |    "outputs": [
349 |     {
350 |      "name": "stdout",
351 |      "output_type": "stream",
352 |      "text": [
353 |       "7\n",
354 |       "3\n",
355 |       "5\n"
356 |      ]
357 |     }
358 |    ],
359 |    "source": [
360 |     "print len(user['age'].unique())\n",
361 |     "print len(user['sex'].unique())\n",
362 |     "print len(user['user_lv_cd'].unique())"
363 |    ]
364 |   },
365 |   {
366 |    "cell_type": "code",
367 |    "execution_count": 9,
368 |    "metadata": {
369 |     "ExecuteTime": {
370 |      "end_time": "2017-05-07T16:36:49.376052Z",
371 |      "start_time": "2017-05-07T16:36:49.340659Z"
372 |     },
373 |     "collapsed": true
374 |    },
375 |    "outputs": [],
376 |    "source": [
377 |     "prod = pd.read_csv(PRODUCT_FILE)"
378 |    ]
379 |   },
380 |   {
381 |    "cell_type": "code",
382 |    "execution_count": 10,
383 |    "metadata": {
384 |     "ExecuteTime": {
385 |      "end_time": "2017-05-07T16:36:49.455004Z",
386 |      "start_time": "2017-05-07T16:36:49.377236Z"
387 |     },
388 |     "collapsed": false
389 |    },
390 |    "outputs": [
391 |     {
392 |      "name": "stdout",
393 |      "output_type": "stream",
394 |      "text": [
395 |       "4\n",
396 |       "3\n",
397 |       "3\n",
398 |       "102\n"
399 |      ]
400 |     }
401 |    ],
402 |    "source": [
403 |     "print len(prod['a1'].unique())\n",
404 |     "print len(prod['a2'].unique())\n",
405 |     "print len(prod['a3'].unique())\n",
406 |     "# print len(prod['a2'].unique())\n",
407 |     "print len(prod['brand'].unique())"
408 |    ]
409 |   },
410 |   {
411 |    "cell_type": "markdown",
412 |    "metadata": {},
413 |    "source": [
414 |     "### 未知记录\n",
415 |     "接下来看看各个文件中的未知记录占的比重"
416 |    ]
417 |   },
418 |   {
419 |    "cell_type": "code",
420 |    "execution_count": 11,
421 |    "metadata": {
422 |     "ExecuteTime": {
423 |      "end_time": "2017-05-07T16:36:49.573716Z",
424 |      "start_time": "2017-05-07T16:36:49.456369Z"
425 |     },
426 |     "collapsed": false
427 |    },
428 |    "outputs": [
429 |     {
430 |      "name": "stdout",
431 |      "output_type": "stream",
432 |      "text": [
433 |       "No. of unknown age user: 14415 and the percent: 0.136867291423 \n",
434 |       "No. of unknown sex user: 54738 and the percent: 0.519725410887 \n"
435 |      ]
436 |     }
437 |    ],
438 |    "source": [
439 |     "print 'No. of unknown age user: {0} and the percent: {1} '.format(user[user['age']=='-1'].shape[0],\n",
440 |     "                                                                  user[user['age']=='-1'].shape[0]*1.0/user.shape[0])\n",
441 |     "print 'No. of unknown sex user: {0} and the percent: {1} '.format(user[user['sex']==2].shape[0],\n",
442 |     "                                                                  user[user['sex']==2].shape[0]*1.0/user.shape[0])"
443 |    ]
444 |   },
445 |   {
446 |    "cell_type": "code",
447 |    "execution_count": 12,
448 |    "metadata": {
449 |     "ExecuteTime": {
450 |      "end_time": "2017-05-07T16:36:49.639049Z",
451 |      "start_time": "2017-05-07T16:36:49.575491Z"
452 |     },
453 |     "collapsed": false
454 |    },
455 |    "outputs": [
456 |     {
457 |      "name": "stdout",
458 |      "output_type": "stream",
459 |      "text": [
460 |       "No. of unknown a1 in Product is 1701\n",
461 |       "percent:  0.0703270351842\n",
462 |       "No. of unknown a2 in Product is 4050\n",
463 |       "percent:  0.167445321867\n",
464 |       "No. of unknown a3 in Product is 3815\n",
465 |       "percent:  0.157729358746\n"
466 |      ]
467 |     }
468 |    ],
469 |    "source": [
470 |     "def unknown_records(f_path, f_name, col_name):\n",
471 |     "    df_file = pd.read_csv(f_path)\n",
472 |     "    missing = df_file[df_file[col_name]==-1].shape[0]\n",
473 |     "    print 'No. of unknown {0} in {1} is {2}'.format(col_name, f_name, missing) \n",
474 |     "    print 'percent: ', missing * 1.0 / df_file.shape[0]\n",
475 |     "    \n",
476 |     "unknown_records(PRODUCT_FILE, 'Product', 'a1')\n",
477 |     "unknown_records(PRODUCT_FILE, 'Product', 'a2')\n",
478 |     "unknown_records(PRODUCT_FILE, 'Product', 'a3')"
479 |    ]
480 |   },
481 |   {
482 |    "cell_type": "markdown",
483 |    "metadata": {},
484 |    "source": [
485 |     "小结一下\n",
486 |     "* 空值部分对3条用户的sex,age填充为未知值,注册时间不作处理，此外行为数据部分model_id待定: 43.2%,40.7%,39.0%\n",
487 |     "* 未知值部分，用户age存在部分未知值: 13.7%，sex存在大量保密情况(超过一半) 52.0%\n",
488 |     "* 商品中各个属性存在部分未知的情况，a1<a3<a2，分别为： 7.0%,16.7%,15.8%"
489 |    ]
490 |   },
491 |   {
492 |    "cell_type": "markdown",
493 |    "metadata": {},
494 |    "source": [
495 |     "### 异常值检测\n",
496 |     "对于任何的分析，在数据预处理的过程中检测数据中的异常值都是非常重要的一步。异常值的出现会使得把这些值考虑进去后结果出现倾斜。这里有很多关于怎样定义什么是数据集中的异常值的经验法则。这里我们将使用[Tukey的定义异常值的方法](http://datapigtechnologies.com/blog/index.php/highlighting-outliers-in-your-data-with-the-tukey-method/)：一个异常阶（outlier step）被定义成1.5倍的四分位距（interquartile range，IQR）。一个数据点如果某个特征包含在该特征的IQR之外的特征，那么该数据点被认定为异常点。"
497 |    ]
498 |   },
499 |   {
500 |    "cell_type": "code",
501 |    "execution_count": null,
502 |    "metadata": {
503 |     "collapsed": true
504 |    },
505 |    "outputs": [],
506 |    "source": [
507 |     "\n",
508 |     "# 对于每一个特征，找到值异常高或者是异常低的数据点\n",
509 |     "for feature in log_data.keys():\n",
510 |     "    \n",
511 |     "    # TODO：计算给定特征的Q1（数据的25th分位点）\n",
512 |     "    Q1 = np.percentile(log_data[feature], 25)\n",
513 |     "    \n",
514 |     "    # TODO：计算给定特征的Q3（数据的75th分位点）\n",
515 |     "    Q3 = np.percentile(log_data[feature], 75)\n",
516 |     "    \n",
517 |     "    # TODO：使用四分位范围计算异常阶（1.5倍的四分位距）\n",
518 |     "    step = 1.5 * (Q3 - Q1)\n",
519 |     "    \n",
520 |     "    # 显示异常点\n",
521 |     "    print \"Data points considered outliers for the feature '{}':\".format(feature)\n",
522 |     "    display(log_data[~((log_data[feature] >= Q1 - step) & (log_data[feature] <= Q3 + step))])"
523 |    ]
524 |   }
525 |  ],
526 |  "metadata": {
527 |   "kernelspec": {
528 |    "display_name": "Python 2",
529 |    "language": "python",
530 |    "name": "python2"
531 |   },
532 |   "language_info": {
533 |    "codemirror_mode": {
534 |     "name": "ipython",
535 |     "version": 2
536 |    },
537 |    "file_extension": ".py",
538 |    "mimetype": "text/x-python",
539 |    "name": "python",
540 |    "nbconvert_exporter": "python",
541 |    "pygments_lexer": "ipython2",
542 |    "version": "2.7.13"
543 |   }
544 |  },
545 |  "nbformat": 4,
546 |  "nbformat_minor": 2
547 | }
548 | 


--------------------------------------------------------------------------------
/Model_Design_v1.7.ipynb:
--------------------------------------------------------------------------------
   1 | {
   2 |  "cells": [
   3 |   {
   4 |    "cell_type": "markdown",
   5 |    "metadata": {},
   6 |    "source": [
   7 |     "## 模型设计"
   8 |    ]
   9 |   },
  10 |   {
  11 |    "cell_type": "markdown",
  12 |    "metadata": {},
  13 |    "source": [
  14 |     "专用测试模型\n",
  15 |     "\n",
  16 |     "由于GridSearchCV耗时过久，考虑折中的方法，即使用原始的train中的分割作为调参的评判"
  17 |    ]
  18 |   },
  19 |   {
  20 |    "cell_type": "code",
  21 |    "execution_count": 208,
  22 |    "metadata": {
  23 |     "ExecuteTime": {
  24 |      "end_time": "2017-05-25T15:16:30.859241Z",
  25 |      "start_time": "2017-05-25T15:16:30.854262Z"
  26 |     },
  27 |     "collapsed": false
  28 |    },
  29 |    "outputs": [],
  30 |    "source": [
  31 |     "#!/usr/bin/env python\n",
  32 |     "# -*- coding: UTF-8 -*-\n",
  33 |     "import sys\n",
  34 |     "import pandas as pd\n",
  35 |     "import numpy as np\n",
  36 |     "import xgboost as xgb\n",
  37 |     "from sklearn.model_selection import train_test_split\n",
  38 |     "import operator\n",
  39 |     "from matplotlib import pylab as plt\n",
  40 |     "from datetime import datetime\n",
  41 |     "import time\n",
  42 |     "from sklearn.model_selection import GridSearchCV"
  43 |    ]
  44 |   },
  45 |   {
  46 |    "cell_type": "code",
  47 |    "execution_count": 209,
  48 |    "metadata": {
  49 |     "ExecuteTime": {
  50 |      "end_time": "2017-05-25T15:16:30.928993Z",
  51 |      "start_time": "2017-05-25T15:16:30.863330Z"
  52 |     },
  53 |     "collapsed": false
  54 |    },
  55 |    "outputs": [],
  56 |    "source": [
  57 |     "# import gc\n",
  58 |     "def show_record():\n",
  59 |     "    train = pd.read_csv('train_set.csv')\n",
  60 |     "#     valid = pd.read_csv('val_set.csv')\n",
  61 |     "#     label_val = pd.read_csv('label_val_set.csv')\n",
  62 |     "    valid1 = pd.read_csv('val_1.csv')\n",
  63 |     "    valid2 = pd.read_csv('val_2.csv')\n",
  64 |     "    valid3 = pd.read_csv('val_3.csv')\n",
  65 |     "#     test = pd.read_csv('test_set.csv')\n",
  66 |     "    print train.shape\n",
  67 |     "#     print valid.shape\n",
  68 |     "#     print label_val.shape\n",
  69 |     "#     print test.shape\n",
  70 |     "    print valid1.shape\n",
  71 |     "    print valid2.shape\n",
  72 |     "    print valid3.shape\n",
  73 |     "\n",
  74 |     "# show_record()\n",
  75 |     "# del train, valid, test\n",
  76 |     "# gc.collect()\n"
  77 |    ]
  78 |   },
  79 |   {
  80 |    "cell_type": "markdown",
  81 |    "metadata": {},
  82 |    "source": [
  83 |     "### 训练数据\n",
  84 |     "* 返回训练后的模型\n",
  85 |     "* 生成特征map文件作为后续特征重要性之用"
  86 |    ]
  87 |   },
  88 |   {
  89 |    "cell_type": "code",
  90 |    "execution_count": 210,
  91 |    "metadata": {
  92 |     "ExecuteTime": {
  93 |      "end_time": "2017-05-25T15:17:20.950524Z",
  94 |      "start_time": "2017-05-25T15:16:30.930219Z"
  95 |     },
  96 |     "collapsed": false
  97 |    },
  98 |    "outputs": [
  99 |     {
 100 |      "name": "stdout",
 101 |      "output_type": "stream",
 102 |      "text": [
 103 |       "total features:  301\n",
 104 |       "[0]\ttrain-auc:0.913108\teval-auc:0.911621\n",
 105 |       "Multiple eval metrics have been passed: 'eval-auc' will be used for early stopping.\n",
 106 |       "\n",
 107 |       "Will train until eval-auc hasn't improved in 10 rounds.\n",
 108 |       "[1]\ttrain-auc:0.932872\teval-auc:0.930423\n",
 109 |       "[2]\ttrain-auc:0.936241\teval-auc:0.93338\n",
 110 |       "[3]\ttrain-auc:0.938389\teval-auc:0.936325\n",
 111 |       "[4]\ttrain-auc:0.938821\teval-auc:0.937227\n",
 112 |       "[5]\ttrain-auc:0.942359\teval-auc:0.941066\n",
 113 |       "[6]\ttrain-auc:0.943082\teval-auc:0.941563\n",
 114 |       "[7]\ttrain-auc:0.943831\teval-auc:0.942126\n",
 115 |       "[8]\ttrain-auc:0.945939\teval-auc:0.944321\n",
 116 |       "[9]\ttrain-auc:0.946849\teval-auc:0.94524\n",
 117 |       "[10]\ttrain-auc:0.948741\teval-auc:0.946958\n",
 118 |       "[11]\ttrain-auc:0.949038\teval-auc:0.947272\n",
 119 |       "[12]\ttrain-auc:0.950141\teval-auc:0.948423\n",
 120 |       "[13]\ttrain-auc:0.951377\teval-auc:0.949643\n",
 121 |       "[14]\ttrain-auc:0.951717\teval-auc:0.950112\n",
 122 |       "[15]\ttrain-auc:0.952721\teval-auc:0.951315\n",
 123 |       "[16]\ttrain-auc:0.953227\teval-auc:0.952028\n",
 124 |       "[17]\ttrain-auc:0.953574\teval-auc:0.952416\n",
 125 |       "[18]\ttrain-auc:0.954116\teval-auc:0.952997\n",
 126 |       "[19]\ttrain-auc:0.954471\teval-auc:0.953389\n",
 127 |       "[20]\ttrain-auc:0.954607\teval-auc:0.953544\n",
 128 |       "[21]\ttrain-auc:0.954891\teval-auc:0.953925\n",
 129 |       "[22]\ttrain-auc:0.955298\teval-auc:0.954452\n",
 130 |       "[23]\ttrain-auc:0.955599\teval-auc:0.954829\n",
 131 |       "[24]\ttrain-auc:0.956163\teval-auc:0.955341\n",
 132 |       "[25]\ttrain-auc:0.956823\teval-auc:0.95607\n",
 133 |       "[26]\ttrain-auc:0.957225\teval-auc:0.956492\n",
 134 |       "[27]\ttrain-auc:0.958301\teval-auc:0.957585\n",
 135 |       "[28]\ttrain-auc:0.958674\teval-auc:0.957999\n",
 136 |       "[29]\ttrain-auc:0.959072\teval-auc:0.958424\n",
 137 |       "[30]\ttrain-auc:0.95949\teval-auc:0.95886\n",
 138 |       "[31]\ttrain-auc:0.960144\teval-auc:0.959574\n",
 139 |       "[32]\ttrain-auc:0.960481\teval-auc:0.959766\n",
 140 |       "[33]\ttrain-auc:0.960753\teval-auc:0.959948\n",
 141 |       "[34]\ttrain-auc:0.961088\teval-auc:0.960227\n",
 142 |       "[35]\ttrain-auc:0.961579\teval-auc:0.960637\n",
 143 |       "[36]\ttrain-auc:0.961899\teval-auc:0.961006\n",
 144 |       "[37]\ttrain-auc:0.96222\teval-auc:0.961291\n",
 145 |       "[38]\ttrain-auc:0.96256\teval-auc:0.961603\n",
 146 |       "[39]\ttrain-auc:0.962937\teval-auc:0.96199\n",
 147 |       "[40]\ttrain-auc:0.963402\teval-auc:0.96247\n",
 148 |       "[41]\ttrain-auc:0.963608\teval-auc:0.962675\n",
 149 |       "[42]\ttrain-auc:0.963719\teval-auc:0.962718\n",
 150 |       "[43]\ttrain-auc:0.964028\teval-auc:0.963035\n",
 151 |       "[44]\ttrain-auc:0.964363\teval-auc:0.963354\n",
 152 |       "[45]\ttrain-auc:0.964619\teval-auc:0.963669\n",
 153 |       "[46]\ttrain-auc:0.964959\teval-auc:0.964009\n",
 154 |       "[47]\ttrain-auc:0.965308\teval-auc:0.964401\n",
 155 |       "[48]\ttrain-auc:0.965511\teval-auc:0.964586\n",
 156 |       "[49]\ttrain-auc:0.965684\teval-auc:0.964769\n",
 157 |       "[50]\ttrain-auc:0.965868\teval-auc:0.964956\n",
 158 |       "[51]\ttrain-auc:0.966077\teval-auc:0.965119\n",
 159 |       "[52]\ttrain-auc:0.966301\teval-auc:0.965328\n",
 160 |       "[53]\ttrain-auc:0.966489\teval-auc:0.965482\n",
 161 |       "[54]\ttrain-auc:0.966734\teval-auc:0.965689\n",
 162 |       "[55]\ttrain-auc:0.966895\teval-auc:0.965911\n",
 163 |       "[56]\ttrain-auc:0.967153\teval-auc:0.966191\n",
 164 |       "[57]\ttrain-auc:0.967307\teval-auc:0.966304\n",
 165 |       "[58]\ttrain-auc:0.967423\teval-auc:0.966448\n",
 166 |       "[59]\ttrain-auc:0.967507\teval-auc:0.9665\n",
 167 |       "[60]\ttrain-auc:0.96769\teval-auc:0.966634\n",
 168 |       "[61]\ttrain-auc:0.967913\teval-auc:0.966795\n",
 169 |       "[62]\ttrain-auc:0.968068\teval-auc:0.966968\n",
 170 |       "[63]\ttrain-auc:0.968196\teval-auc:0.967114\n",
 171 |       "[64]\ttrain-auc:0.968262\teval-auc:0.96715\n",
 172 |       "[65]\ttrain-auc:0.968397\teval-auc:0.96732\n",
 173 |       "[66]\ttrain-auc:0.96851\teval-auc:0.967424\n",
 174 |       "[67]\ttrain-auc:0.968621\teval-auc:0.967548\n",
 175 |       "[68]\ttrain-auc:0.968755\teval-auc:0.967664\n",
 176 |       "[69]\ttrain-auc:0.968887\teval-auc:0.967772\n",
 177 |       "[70]\ttrain-auc:0.968964\teval-auc:0.967863\n",
 178 |       "[71]\ttrain-auc:0.969091\teval-auc:0.96799\n",
 179 |       "[72]\ttrain-auc:0.969186\teval-auc:0.96807\n",
 180 |       "[73]\ttrain-auc:0.969338\teval-auc:0.96821\n",
 181 |       "[74]\ttrain-auc:0.969443\teval-auc:0.968308\n",
 182 |       "[75]\ttrain-auc:0.969527\teval-auc:0.968395\n",
 183 |       "[76]\ttrain-auc:0.969607\teval-auc:0.968476\n",
 184 |       "[77]\ttrain-auc:0.969698\teval-auc:0.968581\n",
 185 |       "[78]\ttrain-auc:0.969798\teval-auc:0.968621\n",
 186 |       "[79]\ttrain-auc:0.96986\teval-auc:0.968656\n",
 187 |       "[80]\ttrain-auc:0.969931\teval-auc:0.968705\n",
 188 |       "[81]\ttrain-auc:0.97008\teval-auc:0.968845\n",
 189 |       "[82]\ttrain-auc:0.970107\teval-auc:0.968868\n",
 190 |       "[83]\ttrain-auc:0.970225\teval-auc:0.968986\n",
 191 |       "[84]\ttrain-auc:0.970319\teval-auc:0.969047\n",
 192 |       "[85]\ttrain-auc:0.97046\teval-auc:0.969204\n",
 193 |       "[86]\ttrain-auc:0.970525\teval-auc:0.96928\n",
 194 |       "[87]\ttrain-auc:0.970585\teval-auc:0.969315\n",
 195 |       "[88]\ttrain-auc:0.970624\teval-auc:0.969353\n",
 196 |       "[89]\ttrain-auc:0.970689\teval-auc:0.969405\n",
 197 |       "[90]\ttrain-auc:0.97083\teval-auc:0.969556\n",
 198 |       "[91]\ttrain-auc:0.970917\teval-auc:0.969625\n",
 199 |       "[92]\ttrain-auc:0.970938\teval-auc:0.969653\n",
 200 |       "[93]\ttrain-auc:0.971022\teval-auc:0.969732\n",
 201 |       "[94]\ttrain-auc:0.971079\teval-auc:0.969774\n",
 202 |       "[95]\ttrain-auc:0.971156\teval-auc:0.969873\n",
 203 |       "[96]\ttrain-auc:0.971247\teval-auc:0.96995\n",
 204 |       "[97]\ttrain-auc:0.97132\teval-auc:0.970017\n",
 205 |       "[98]\ttrain-auc:0.971355\teval-auc:0.970063\n",
 206 |       "[99]\ttrain-auc:0.971424\teval-auc:0.970123\n",
 207 |       "[100]\ttrain-auc:0.971516\teval-auc:0.970181\n",
 208 |       "[101]\ttrain-auc:0.971591\teval-auc:0.97024\n",
 209 |       "[102]\ttrain-auc:0.971706\teval-auc:0.970358\n",
 210 |       "[103]\ttrain-auc:0.971833\teval-auc:0.970458\n",
 211 |       "[104]\ttrain-auc:0.971895\teval-auc:0.970494\n",
 212 |       "[105]\ttrain-auc:0.97195\teval-auc:0.970536\n",
 213 |       "[106]\ttrain-auc:0.971994\teval-auc:0.970567\n",
 214 |       "[107]\ttrain-auc:0.972044\teval-auc:0.970608\n",
 215 |       "[108]\ttrain-auc:0.97209\teval-auc:0.970646\n",
 216 |       "[109]\ttrain-auc:0.97218\teval-auc:0.970723\n",
 217 |       "[110]\ttrain-auc:0.972293\teval-auc:0.970827\n",
 218 |       "[111]\ttrain-auc:0.972311\teval-auc:0.970847\n",
 219 |       "[112]\ttrain-auc:0.972366\teval-auc:0.970898\n",
 220 |       "[113]\ttrain-auc:0.972435\teval-auc:0.97097\n",
 221 |       "[114]\ttrain-auc:0.972496\teval-auc:0.97102\n",
 222 |       "[115]\ttrain-auc:0.972532\teval-auc:0.971041\n",
 223 |       "[116]\ttrain-auc:0.972576\teval-auc:0.971063\n",
 224 |       "[117]\ttrain-auc:0.972691\teval-auc:0.971179\n",
 225 |       "[118]\ttrain-auc:0.97274\teval-auc:0.971195\n",
 226 |       "[119]\ttrain-auc:0.972806\teval-auc:0.971237\n",
 227 |       "[120]\ttrain-auc:0.972906\teval-auc:0.971305\n",
 228 |       "[121]\ttrain-auc:0.972952\teval-auc:0.971337\n",
 229 |       "[122]\ttrain-auc:0.972985\teval-auc:0.971359\n",
 230 |       "[123]\ttrain-auc:0.973057\teval-auc:0.971405\n",
 231 |       "[124]\ttrain-auc:0.973137\teval-auc:0.971469\n",
 232 |       "[125]\ttrain-auc:0.973214\teval-auc:0.971547\n",
 233 |       "[126]\ttrain-auc:0.973253\teval-auc:0.971568\n",
 234 |       "[127]\ttrain-auc:0.973292\teval-auc:0.971589\n",
 235 |       "[128]\ttrain-auc:0.973339\teval-auc:0.971617\n",
 236 |       "[129]\ttrain-auc:0.973381\teval-auc:0.971678\n",
 237 |       "[130]\ttrain-auc:0.973479\teval-auc:0.971781\n",
 238 |       "[131]\ttrain-auc:0.973517\teval-auc:0.971821\n",
 239 |       "[132]\ttrain-auc:0.973577\teval-auc:0.971874\n",
 240 |       "[133]\ttrain-auc:0.973626\teval-auc:0.971904\n",
 241 |       "[134]\ttrain-auc:0.973689\teval-auc:0.971956\n",
 242 |       "[135]\ttrain-auc:0.973712\teval-auc:0.971973\n",
 243 |       "[136]\ttrain-auc:0.973737\teval-auc:0.971986\n",
 244 |       "[137]\ttrain-auc:0.973787\teval-auc:0.972016\n",
 245 |       "[138]\ttrain-auc:0.973829\teval-auc:0.972047\n",
 246 |       "[139]\ttrain-auc:0.97386\teval-auc:0.97207\n",
 247 |       "[140]\ttrain-auc:0.973892\teval-auc:0.972098\n",
 248 |       "[141]\ttrain-auc:0.973934\teval-auc:0.972123\n",
 249 |       "[142]\ttrain-auc:0.974017\teval-auc:0.972172\n",
 250 |       "[143]\ttrain-auc:0.974054\teval-auc:0.972186\n",
 251 |       "[144]\ttrain-auc:0.974074\teval-auc:0.972196\n",
 252 |       "[145]\ttrain-auc:0.974146\teval-auc:0.97224\n",
 253 |       "[146]\ttrain-auc:0.97422\teval-auc:0.972268\n",
 254 |       "[147]\ttrain-auc:0.974302\teval-auc:0.972333\n",
 255 |       "[148]\ttrain-auc:0.974342\teval-auc:0.972365\n",
 256 |       "[149]\ttrain-auc:0.9744\teval-auc:0.972402\n",
 257 |       "[150]\ttrain-auc:0.974414\teval-auc:0.972401\n",
 258 |       "[151]\ttrain-auc:0.974465\teval-auc:0.972457\n",
 259 |       "[152]\ttrain-auc:0.974484\teval-auc:0.972464\n",
 260 |       "[153]\ttrain-auc:0.974518\teval-auc:0.972481\n",
 261 |       "[154]\ttrain-auc:0.974543\teval-auc:0.972486\n",
 262 |       "[155]\ttrain-auc:0.974571\teval-auc:0.972516\n",
 263 |       "[156]\ttrain-auc:0.974615\teval-auc:0.972535\n",
 264 |       "[157]\ttrain-auc:0.974697\teval-auc:0.972602\n",
 265 |       "[158]\ttrain-auc:0.974711\teval-auc:0.972621\n",
 266 |       "[159]\ttrain-auc:0.974749\teval-auc:0.972668\n",
 267 |       "[160]\ttrain-auc:0.974835\teval-auc:0.972706\n",
 268 |       "[161]\ttrain-auc:0.974902\teval-auc:0.972744\n",
 269 |       "[162]\ttrain-auc:0.974951\teval-auc:0.972775\n",
 270 |       "[163]\ttrain-auc:0.975003\teval-auc:0.972796\n",
 271 |       "[164]\ttrain-auc:0.975017\teval-auc:0.972807\n",
 272 |       "[165]\ttrain-auc:0.975039\teval-auc:0.972834\n",
 273 |       "[166]\ttrain-auc:0.975067\teval-auc:0.972862\n",
 274 |       "[167]\ttrain-auc:0.975098\teval-auc:0.97288\n",
 275 |       "[168]\ttrain-auc:0.975117\teval-auc:0.972898\n",
 276 |       "[169]\ttrain-auc:0.975179\teval-auc:0.972938\n",
 277 |       "[170]\ttrain-auc:0.975194\teval-auc:0.972923\n",
 278 |       "[171]\ttrain-auc:0.975231\teval-auc:0.972927\n",
 279 |       "[172]\ttrain-auc:0.97529\teval-auc:0.972968\n",
 280 |       "[173]\ttrain-auc:0.975333\teval-auc:0.972991\n",
 281 |       "[174]\ttrain-auc:0.975344\teval-auc:0.972996\n",
 282 |       "[175]\ttrain-auc:0.975396\teval-auc:0.973037\n",
 283 |       "[176]\ttrain-auc:0.975407\teval-auc:0.973061\n",
 284 |       "[177]\ttrain-auc:0.975475\teval-auc:0.973096\n",
 285 |       "[178]\ttrain-auc:0.97554\teval-auc:0.973126\n",
 286 |       "[179]\ttrain-auc:0.975574\teval-auc:0.973153\n",
 287 |       "[180]\ttrain-auc:0.975602\teval-auc:0.973176\n",
 288 |       "[181]\ttrain-auc:0.975673\teval-auc:0.973216\n",
 289 |       "[182]\ttrain-auc:0.9757\teval-auc:0.973239\n",
 290 |       "[183]\ttrain-auc:0.975749\teval-auc:0.973277\n",
 291 |       "[184]\ttrain-auc:0.975785\teval-auc:0.973293\n",
 292 |       "[185]\ttrain-auc:0.975804\teval-auc:0.973307\n",
 293 |       "[186]\ttrain-auc:0.975845\teval-auc:0.973352\n",
 294 |       "[187]\ttrain-auc:0.975861\teval-auc:0.973368\n",
 295 |       "[188]\ttrain-auc:0.975877\teval-auc:0.973377\n",
 296 |       "[189]\ttrain-auc:0.975923\teval-auc:0.973386\n",
 297 |       "[190]\ttrain-auc:0.975947\teval-auc:0.973379\n",
 298 |       "[191]\ttrain-auc:0.976003\teval-auc:0.973426\n",
 299 |       "[192]\ttrain-auc:0.97606\teval-auc:0.973478\n",
 300 |       "[193]\ttrain-auc:0.976122\teval-auc:0.973517\n",
 301 |       "[194]\ttrain-auc:0.976136\teval-auc:0.973525\n",
 302 |       "[195]\ttrain-auc:0.976159\teval-auc:0.973543\n",
 303 |       "[196]\ttrain-auc:0.976184\teval-auc:0.973557\n",
 304 |       "[197]\ttrain-auc:0.9762\teval-auc:0.973575\n",
 305 |       "[198]\ttrain-auc:0.976241\teval-auc:0.973598\n",
 306 |       "[199]\ttrain-auc:0.976268\teval-auc:0.973611\n",
 307 |       "[200]\ttrain-auc:0.97629\teval-auc:0.973603\n",
 308 |       "[201]\ttrain-auc:0.97635\teval-auc:0.973653\n",
 309 |       "[202]\ttrain-auc:0.976386\teval-auc:0.973675\n",
 310 |       "[203]\ttrain-auc:0.976424\teval-auc:0.973702\n",
 311 |       "[204]\ttrain-auc:0.976454\teval-auc:0.973711\n",
 312 |       "[205]\ttrain-auc:0.976462\teval-auc:0.97372\n",
 313 |       "[206]\ttrain-auc:0.976477\teval-auc:0.973726\n",
 314 |       "[207]\ttrain-auc:0.976495\teval-auc:0.973728\n",
 315 |       "[208]\ttrain-auc:0.976579\teval-auc:0.973798\n",
 316 |       "[209]\ttrain-auc:0.976655\teval-auc:0.973876\n",
 317 |       "[210]\ttrain-auc:0.976672\teval-auc:0.973882\n",
 318 |       "[211]\ttrain-auc:0.976726\teval-auc:0.973922\n",
 319 |       "[212]\ttrain-auc:0.976811\teval-auc:0.973967\n",
 320 |       "[213]\ttrain-auc:0.976845\teval-auc:0.973996\n",
 321 |       "[214]\ttrain-auc:0.976899\teval-auc:0.97401\n",
 322 |       "[215]\ttrain-auc:0.976926\teval-auc:0.974021\n",
 323 |       "[216]\ttrain-auc:0.976939\teval-auc:0.974023\n",
 324 |       "[217]\ttrain-auc:0.976969\teval-auc:0.974021\n",
 325 |       "[218]\ttrain-auc:0.976991\teval-auc:0.974018\n",
 326 |       "[219]\ttrain-auc:0.977021\teval-auc:0.974029\n",
 327 |       "[220]\ttrain-auc:0.97705\teval-auc:0.974053\n",
 328 |       "[221]\ttrain-auc:0.977064\teval-auc:0.974069\n",
 329 |       "[222]\ttrain-auc:0.97708\teval-auc:0.974078\n",
 330 |       "[223]\ttrain-auc:0.977096\teval-auc:0.974088\n",
 331 |       "[224]\ttrain-auc:0.97711\teval-auc:0.9741\n",
 332 |       "[225]\ttrain-auc:0.977132\teval-auc:0.974098\n",
 333 |       "[226]\ttrain-auc:0.977167\teval-auc:0.97409\n",
 334 |       "[227]\ttrain-auc:0.977197\teval-auc:0.974117\n",
 335 |       "[228]\ttrain-auc:0.977257\teval-auc:0.974155\n",
 336 |       "[229]\ttrain-auc:0.977282\teval-auc:0.974168\n",
 337 |       "[230]\ttrain-auc:0.977315\teval-auc:0.97418\n",
 338 |       "[231]\ttrain-auc:0.9774\teval-auc:0.974234\n",
 339 |       "[232]\ttrain-auc:0.977426\teval-auc:0.974248\n",
 340 |       "[233]\ttrain-auc:0.977435\teval-auc:0.974243\n",
 341 |       "[234]\ttrain-auc:0.977444\teval-auc:0.97425\n",
 342 |       "[235]\ttrain-auc:0.977501\teval-auc:0.974306\n",
 343 |       "[236]\ttrain-auc:0.977522\teval-auc:0.974321\n",
 344 |       "[237]\ttrain-auc:0.977528\teval-auc:0.974311\n",
 345 |       "[238]\ttrain-auc:0.977535\teval-auc:0.974316\n",
 346 |       "[239]\ttrain-auc:0.977581\teval-auc:0.97436\n",
 347 |       "[240]\ttrain-auc:0.977636\teval-auc:0.974411\n",
 348 |       "[241]\ttrain-auc:0.977643\teval-auc:0.974413\n",
 349 |       "[242]\ttrain-auc:0.977702\teval-auc:0.974434\n",
 350 |       "[243]\ttrain-auc:0.977712\teval-auc:0.974438\n",
 351 |       "[244]\ttrain-auc:0.977738\teval-auc:0.974443\n",
 352 |       "[245]\ttrain-auc:0.977757\teval-auc:0.974467\n",
 353 |       "[246]\ttrain-auc:0.977775\teval-auc:0.974465\n",
 354 |       "[247]\ttrain-auc:0.977804\teval-auc:0.974468\n",
 355 |       "[248]\ttrain-auc:0.977839\teval-auc:0.974488\n",
 356 |       "[249]\ttrain-auc:0.977852\teval-auc:0.974497\n",
 357 |       "[250]\ttrain-auc:0.977885\teval-auc:0.974505\n",
 358 |       "[251]\ttrain-auc:0.977908\teval-auc:0.974498\n",
 359 |       "[252]\ttrain-auc:0.977951\teval-auc:0.974538\n",
 360 |       "[253]\ttrain-auc:0.977984\teval-auc:0.974515\n",
 361 |       "[254]\ttrain-auc:0.978015\teval-auc:0.974532\n",
 362 |       "[255]\ttrain-auc:0.978064\teval-auc:0.97458\n",
 363 |       "[256]\ttrain-auc:0.978121\teval-auc:0.974593\n",
 364 |       "[257]\ttrain-auc:0.978146\teval-auc:0.974613\n",
 365 |       "[258]\ttrain-auc:0.978171\teval-auc:0.974605\n",
 366 |       "[259]\ttrain-auc:0.978204\teval-auc:0.974623\n",
 367 |       "[260]\ttrain-auc:0.978244\teval-auc:0.974644\n",
 368 |       "[261]\ttrain-auc:0.978264\teval-auc:0.97466\n",
 369 |       "[262]\ttrain-auc:0.978277\teval-auc:0.974679\n",
 370 |       "[263]\ttrain-auc:0.978284\teval-auc:0.974679\n",
 371 |       "[264]\ttrain-auc:0.978351\teval-auc:0.974717\n",
 372 |       "[265]\ttrain-auc:0.978383\teval-auc:0.974731\n",
 373 |       "[266]\ttrain-auc:0.978402\teval-auc:0.974751\n",
 374 |       "[267]\ttrain-auc:0.978414\teval-auc:0.97475\n",
 375 |       "[268]\ttrain-auc:0.978433\teval-auc:0.974759\n",
 376 |       "[269]\ttrain-auc:0.978467\teval-auc:0.97477\n",
 377 |       "[270]\ttrain-auc:0.978501\teval-auc:0.974782\n",
 378 |       "[271]\ttrain-auc:0.978552\teval-auc:0.974822\n",
 379 |       "[272]\ttrain-auc:0.978575\teval-auc:0.974838\n",
 380 |       "[273]\ttrain-auc:0.978587\teval-auc:0.974845\n",
 381 |       "[274]\ttrain-auc:0.978595\teval-auc:0.974843\n",
 382 |       "[275]\ttrain-auc:0.978607\teval-auc:0.974851\n",
 383 |       "[276]\ttrain-auc:0.978661\teval-auc:0.974882\n",
 384 |       "[277]\ttrain-auc:0.978726\teval-auc:0.97493\n",
 385 |       "[278]\ttrain-auc:0.978749\teval-auc:0.974941\n",
 386 |       "[279]\ttrain-auc:0.978778\teval-auc:0.974958\n",
 387 |       "[280]\ttrain-auc:0.978804\teval-auc:0.974971\n",
 388 |       "[281]\ttrain-auc:0.978833\teval-auc:0.974998\n",
 389 |       "[282]\ttrain-auc:0.978837\teval-auc:0.974999\n",
 390 |       "[283]\ttrain-auc:0.978851\teval-auc:0.974991\n",
 391 |       "[284]\ttrain-auc:0.978876\teval-auc:0.974986\n",
 392 |       "[285]\ttrain-auc:0.978919\teval-auc:0.975025\n",
 393 |       "[286]\ttrain-auc:0.978936\teval-auc:0.975018\n",
 394 |       "[287]\ttrain-auc:0.978943\teval-auc:0.975042\n",
 395 |       "[288]\ttrain-auc:0.978956\teval-auc:0.975056\n",
 396 |       "[289]\ttrain-auc:0.978968\teval-auc:0.975069\n",
 397 |       "[290]\ttrain-auc:0.979003\teval-auc:0.975099\n",
 398 |       "[291]\ttrain-auc:0.979014\teval-auc:0.975107\n",
 399 |       "[292]\ttrain-auc:0.979055\teval-auc:0.975141\n",
 400 |       "[293]\ttrain-auc:0.979068\teval-auc:0.975145\n",
 401 |       "[294]\ttrain-auc:0.979106\teval-auc:0.975182\n",
 402 |       "[295]\ttrain-auc:0.979123\teval-auc:0.975187\n",
 403 |       "[296]\ttrain-auc:0.979158\teval-auc:0.975205\n",
 404 |       "[297]\ttrain-auc:0.979189\teval-auc:0.975221\n",
 405 |       "[298]\ttrain-auc:0.979238\teval-auc:0.975272\n",
 406 |       "[299]\ttrain-auc:0.979276\teval-auc:0.975295\n",
 407 |       "[300]\ttrain-auc:0.979331\teval-auc:0.975331\n",
 408 |       "[301]\ttrain-auc:0.979346\teval-auc:0.975349\n",
 409 |       "[302]\ttrain-auc:0.979402\teval-auc:0.975374\n",
 410 |       "[303]\ttrain-auc:0.979424\teval-auc:0.97539\n",
 411 |       "[304]\ttrain-auc:0.979441\teval-auc:0.975395\n",
 412 |       "[305]\ttrain-auc:0.979457\teval-auc:0.975398\n",
 413 |       "[306]\ttrain-auc:0.979478\teval-auc:0.975397\n",
 414 |       "[307]\ttrain-auc:0.979487\teval-auc:0.975406\n",
 415 |       "[308]\ttrain-auc:0.97954\teval-auc:0.975441\n",
 416 |       "[309]\ttrain-auc:0.979567\teval-auc:0.975446\n",
 417 |       "[310]\ttrain-auc:0.979579\teval-auc:0.975445\n",
 418 |       "[311]\ttrain-auc:0.979581\teval-auc:0.975446\n",
 419 |       "[312]\ttrain-auc:0.979591\teval-auc:0.97546\n",
 420 |       "[313]\ttrain-auc:0.979596\teval-auc:0.97545\n",
 421 |       "[314]\ttrain-auc:0.979661\teval-auc:0.975512\n",
 422 |       "[315]\ttrain-auc:0.979701\teval-auc:0.975553\n",
 423 |       "[316]\ttrain-auc:0.979726\teval-auc:0.975551\n",
 424 |       "[317]\ttrain-auc:0.979751\teval-auc:0.975557\n",
 425 |       "[318]\ttrain-auc:0.979766\teval-auc:0.975567\n",
 426 |       "[319]\ttrain-auc:0.97982\teval-auc:0.975622\n",
 427 |       "[320]\ttrain-auc:0.979872\teval-auc:0.975648\n",
 428 |       "[321]\ttrain-auc:0.979919\teval-auc:0.975681\n",
 429 |       "[322]\ttrain-auc:0.979969\teval-auc:0.975713\n",
 430 |       "[323]\ttrain-auc:0.979979\teval-auc:0.975718\n",
 431 |       "[324]\ttrain-auc:0.979999\teval-auc:0.975726\n",
 432 |       "[325]\ttrain-auc:0.980037\teval-auc:0.975734\n",
 433 |       "[326]\ttrain-auc:0.980058\teval-auc:0.975762\n",
 434 |       "[327]\ttrain-auc:0.980087\teval-auc:0.975757\n",
 435 |       "[328]\ttrain-auc:0.980087\teval-auc:0.975754\n",
 436 |       "[329]\ttrain-auc:0.980115\teval-auc:0.975785\n",
 437 |       "[330]\ttrain-auc:0.98012\teval-auc:0.975779\n",
 438 |       "[331]\ttrain-auc:0.980128\teval-auc:0.975783\n",
 439 |       "[332]\ttrain-auc:0.980142\teval-auc:0.975803\n",
 440 |       "[333]\ttrain-auc:0.980179\teval-auc:0.975823\n",
 441 |       "[334]\ttrain-auc:0.980199\teval-auc:0.975838\n",
 442 |       "[335]\ttrain-auc:0.98021\teval-auc:0.975844\n",
 443 |       "[336]\ttrain-auc:0.980235\teval-auc:0.975879\n",
 444 |       "[337]\ttrain-auc:0.980258\teval-auc:0.975896\n",
 445 |       "[338]\ttrain-auc:0.980307\teval-auc:0.975942\n",
 446 |       "[339]\ttrain-auc:0.980337\teval-auc:0.975959\n",
 447 |       "[340]\ttrain-auc:0.980371\teval-auc:0.975987\n",
 448 |       "[341]\ttrain-auc:0.980387\teval-auc:0.975978\n",
 449 |       "[342]\ttrain-auc:0.980406\teval-auc:0.975988\n",
 450 |       "[343]\ttrain-auc:0.980431\teval-auc:0.975996\n",
 451 |       "[344]\ttrain-auc:0.980451\teval-auc:0.975996\n",
 452 |       "[345]\ttrain-auc:0.980491\teval-auc:0.97601\n",
 453 |       "[346]\ttrain-auc:0.980515\teval-auc:0.976017\n",
 454 |       "[347]\ttrain-auc:0.980551\teval-auc:0.97605\n",
 455 |       "[348]\ttrain-auc:0.980552\teval-auc:0.976053\n",
 456 |       "[349]\ttrain-auc:0.980599\teval-auc:0.976069\n",
 457 |       "[350]\ttrain-auc:0.98063\teval-auc:0.976085\n",
 458 |       "[351]\ttrain-auc:0.98066\teval-auc:0.97609\n",
 459 |       "[352]\ttrain-auc:0.980691\teval-auc:0.976109\n",
 460 |       "[353]\ttrain-auc:0.980705\teval-auc:0.976119\n",
 461 |       "[354]\ttrain-auc:0.980706\teval-auc:0.976126\n",
 462 |       "[355]\ttrain-auc:0.980733\teval-auc:0.976135\n",
 463 |       "[356]\ttrain-auc:0.98078\teval-auc:0.976186\n",
 464 |       "[357]\ttrain-auc:0.980799\teval-auc:0.976203\n",
 465 |       "[358]\ttrain-auc:0.980817\teval-auc:0.976207\n",
 466 |       "[359]\ttrain-auc:0.980826\teval-auc:0.976207\n",
 467 |       "[360]\ttrain-auc:0.980842\teval-auc:0.976208\n",
 468 |       "[361]\ttrain-auc:0.980859\teval-auc:0.976204\n",
 469 |       "[362]\ttrain-auc:0.980898\teval-auc:0.976226\n",
 470 |       "[363]\ttrain-auc:0.980906\teval-auc:0.976229\n",
 471 |       "[364]\ttrain-auc:0.980925\teval-auc:0.976238\n",
 472 |       "[365]\ttrain-auc:0.980974\teval-auc:0.976281\n",
 473 |       "[366]\ttrain-auc:0.980999\teval-auc:0.976293\n",
 474 |       "[367]\ttrain-auc:0.981008\teval-auc:0.976284\n",
 475 |       "[368]\ttrain-auc:0.981034\teval-auc:0.976293\n",
 476 |       "[369]\ttrain-auc:0.981042\teval-auc:0.976297\n",
 477 |       "[370]\ttrain-auc:0.981069\teval-auc:0.976308\n",
 478 |       "[371]\ttrain-auc:0.981096\teval-auc:0.97632\n",
 479 |       "[372]\ttrain-auc:0.981112\teval-auc:0.976322\n",
 480 |       "[373]\ttrain-auc:0.981122\teval-auc:0.976323\n",
 481 |       "[374]\ttrain-auc:0.981138\teval-auc:0.976339\n",
 482 |       "[375]\ttrain-auc:0.981142\teval-auc:0.976338\n",
 483 |       "[376]\ttrain-auc:0.981171\teval-auc:0.976343\n",
 484 |       "[377]\ttrain-auc:0.981205\teval-auc:0.976368\n",
 485 |       "[378]\ttrain-auc:0.981233\teval-auc:0.97637\n",
 486 |       "[379]\ttrain-auc:0.981247\teval-auc:0.97637\n",
 487 |       "[380]\ttrain-auc:0.981276\teval-auc:0.976381\n",
 488 |       "[381]\ttrain-auc:0.981315\teval-auc:0.976407\n",
 489 |       "[382]\ttrain-auc:0.981344\teval-auc:0.976415\n",
 490 |       "[383]\ttrain-auc:0.981352\teval-auc:0.976412\n",
 491 |       "[384]\ttrain-auc:0.981382\teval-auc:0.976427\n",
 492 |       "[385]\ttrain-auc:0.981418\teval-auc:0.976445\n",
 493 |       "[386]\ttrain-auc:0.981449\teval-auc:0.97645\n",
 494 |       "[387]\ttrain-auc:0.981475\teval-auc:0.97647\n",
 495 |       "[388]\ttrain-auc:0.981508\teval-auc:0.976505\n",
 496 |       "[389]\ttrain-auc:0.981524\teval-auc:0.976507\n",
 497 |       "[390]\ttrain-auc:0.981539\teval-auc:0.97651\n",
 498 |       "[391]\ttrain-auc:0.981552\teval-auc:0.976512\n",
 499 |       "[392]\ttrain-auc:0.981565\teval-auc:0.976526\n",
 500 |       "[393]\ttrain-auc:0.981586\teval-auc:0.976536\n",
 501 |       "[394]\ttrain-auc:0.981602\teval-auc:0.976557\n",
 502 |       "[395]\ttrain-auc:0.98164\teval-auc:0.976583\n",
 503 |       "[396]\ttrain-auc:0.981662\teval-auc:0.976594\n",
 504 |       "[397]\ttrain-auc:0.981681\teval-auc:0.976605\n",
 505 |       "[398]\ttrain-auc:0.981717\teval-auc:0.976614\n",
 506 |       "[399]\ttrain-auc:0.981752\teval-auc:0.97664\n",
 507 |       "[400]\ttrain-auc:0.981776\teval-auc:0.976648\n",
 508 |       "[401]\ttrain-auc:0.981799\teval-auc:0.976666\n",
 509 |       "[402]\ttrain-auc:0.981806\teval-auc:0.976667\n",
 510 |       "[403]\ttrain-auc:0.981809\teval-auc:0.976668\n",
 511 |       "[404]\ttrain-auc:0.981842\teval-auc:0.976694\n",
 512 |       "[405]\ttrain-auc:0.981865\teval-auc:0.976711\n",
 513 |       "[406]\ttrain-auc:0.981898\teval-auc:0.976723\n",
 514 |       "[407]\ttrain-auc:0.981931\teval-auc:0.976751\n",
 515 |       "[408]\ttrain-auc:0.981964\teval-auc:0.976777\n",
 516 |       "[409]\ttrain-auc:0.981986\teval-auc:0.976768\n",
 517 |       "[410]\ttrain-auc:0.982\teval-auc:0.976758\n",
 518 |       "[411]\ttrain-auc:0.982013\teval-auc:0.976757\n",
 519 |       "[412]\ttrain-auc:0.982024\teval-auc:0.976756\n",
 520 |       "[413]\ttrain-auc:0.982038\teval-auc:0.976749\n",
 521 |       "[414]\ttrain-auc:0.982042\teval-auc:0.976752\n",
 522 |       "[415]\ttrain-auc:0.982062\teval-auc:0.976763\n",
 523 |       "[416]\ttrain-auc:0.982069\teval-auc:0.976755\n",
 524 |       "[417]\ttrain-auc:0.982076\teval-auc:0.976751\n",
 525 |       "[418]\ttrain-auc:0.982087\teval-auc:0.976753\n",
 526 |       "Stopping. Best iteration:\n",
 527 |       "[408]\ttrain-auc:0.981964\teval-auc:0.976777\n",
 528 |       "\n"
 529 |      ]
 530 |     }
 531 |    ],
 532 |    "source": [
 533 |     "def create_feature_map(features):\n",
 534 |     "    outfile = open(r'xgb.fmap', 'w')\n",
 535 |     "    i = 0\n",
 536 |     "    for feat in features:\n",
 537 |     "        outfile.write('{0}\\t{1}\\tq\\n'.format(i, feat))\n",
 538 |     "        i = i + 1\n",
 539 |     "    outfile.close()\n",
 540 |     "\n",
 541 |     "def xgb_model(train_set):\n",
 542 |     "    actions = pd.read_csv(train_set)             #read train_set\n",
 543 |     "    # 单纯的删掉模型前一遍训练认为无用的特征（根据特征重要性中不存在的特征）\n",
 544 |     "    lst_useless = ['brand']\n",
 545 |     "    \n",
 546 |     "    actions.drop(lst_useless, inplace=True, axis=1)\n",
 547 |     "    \n",
 548 |     "    users = actions[['user_id', 'sku_id']].copy()\n",
 549 |     "    labels = actions['label'].copy()\n",
 550 |     "    del actions['user_id']\n",
 551 |     "    del actions['sku_id']\n",
 552 |     "    del actions['label']\n",
 553 |     "    # 尝试通过设置scale_pos_weight来调整政府比例不均的问题，但是经过采样的正负比为1:10，训练结果反而不如设置为1\n",
 554 |     "#     ratio = float(np.sum(labels==0)) / np.sum(labels==1)\n",
 555 |     "#     print ratio\n",
 556 |     "    \n",
 557 |     "    # write to feature map\n",
 558 |     "    features = list(actions.columns[:])\n",
 559 |     "    print 'total features: ', len(features)\n",
 560 |     "    create_feature_map(features)\n",
 561 |     "    # 训练时即传入特征名\n",
 562 |     "#     features = list(actions.columns.values)\n",
 563 |     "    \n",
 564 |     "    user_index=users\n",
 565 |     "    training_data=actions\n",
 566 |     "    label=labels\n",
 567 |     "    X_train, X_valid, y_train, y_valid = train_test_split(training_data.values, label.values, test_size=0.2, \n",
 568 |     "                                                          random_state=0)\n",
 569 |     "    \n",
 570 |     "    # 尝试通过提前设置传入训练的正负例的权重来改善正负比例不均的问题\n",
 571 |     "#     weights = np.zeros(len(y_train))\n",
 572 |     "#     weights[y_train==0] = 1\n",
 573 |     "#     weights[y_train==1] = 10\n",
 574 |     "    \n",
 575 |     "#     dtrain = xgb.DMatrix(X_train, label=y_train, weight=weights)\n",
 576 |     "    dtrain = xgb.DMatrix(X_train, label=y_train)\n",
 577 |     "    dvalid = xgb.DMatrix(X_valid, label=y_valid)\n",
 578 |     "#     dtrain = xgb.DMatrix(X_train, label=y_train, feature_names=features)\n",
 579 |     "#     dvalid = xgb.DMatrix(X_valid, label=y_valid, feature_names=features)\n",
 580 |     "#     dtrain = xgb.DMatrix(training_data.values, label.values)\n",
 581 |     "    param = {'n_estimators': 4000, 'max_depth': 3, 'min_child_weight': 5, 'gamma': 0, 'subsample': 1.0, \n",
 582 |     "             'colsample_bytree': 0.8, 'scale_pos_weight':10, 'eta': 0.1, 'silent': 1, 'objective': 'binary:logistic',\n",
 583 |     "             'eval_metric':'auc'}\n",
 584 |     "#     param = {'n_estimators': 4000, 'max_depth': 6, 'seed': 7, 'min_child_weight': 5, 'gamma': 0, 'subsample': 1.0, \n",
 585 |     "#              'colsample_bytree': 0.8, 'scale_pos_weight': 1, 'eta': 0.09, 'silent': 1, 'objective': 'binary:logistic',\n",
 586 |     "#              'eval_metric':'auc'}\n",
 587 |     "    \n",
 588 |     "    num_round = param['n_estimators']\n",
 589 |     "#     param['nthread'] = 4\n",
 590 |     "    # param['eval_metric'] = \"auc\"\n",
 591 |     "    plst = param.items()\n",
 592 |     "    evallist = [(dtrain, 'train'), (dvalid, 'eval')]\n",
 593 |     "    # 根本停不下来\n",
 594 |     "#     evallist = [(dvalid, 'eval'), (dtrain, 'train')]\n",
 595 |     "#     evallist = [(dtrain, 'train')]\n",
 596 |     "    bst = xgb.train(plst, dtrain, num_round, evallist, early_stopping_rounds=10)\n",
 597 |     "    bst.save_model('bst.model')\n",
 598 |     "    return bst, features\n",
 599 |     "\n",
 600 |     "bst_xgb, features = xgb_model('train_set.csv')"
 601 |    ]
 602 |   },
 603 |   {
 604 |    "cell_type": "code",
 605 |    "execution_count": 211,
 606 |    "metadata": {
 607 |     "ExecuteTime": {
 608 |      "end_time": "2017-05-25T15:17:20.955471Z",
 609 |      "start_time": "2017-05-25T15:17:20.952599Z"
 610 |     },
 611 |     "collapsed": false
 612 |    },
 613 |    "outputs": [
 614 |     {
 615 |      "name": "stdout",
 616 |      "output_type": "stream",
 617 |      "text": [
 618 |       "{'best_iteration': '408', 'best_msg': '[408]\\ttrain-auc:0.981964\\teval-auc:0.976777', 'best_score': '0.976777'}\n"
 619 |      ]
 620 |     }
 621 |    ],
 622 |    "source": [
 623 |     "print bst_xgb.attributes()"
 624 |    ]
 625 |   },
 626 |   {
 627 |    "cell_type": "markdown",
 628 |    "metadata": {},
 629 |    "source": [
 630 |     "### 对验证集进行线下测试"
 631 |    ]
 632 |   },
 633 |   {
 634 |    "cell_type": "code",
 635 |    "execution_count": 212,
 636 |    "metadata": {
 637 |     "ExecuteTime": {
 638 |      "end_time": "2017-05-25T15:17:21.018473Z",
 639 |      "start_time": "2017-05-25T15:17:20.957406Z"
 640 |     },
 641 |     "collapsed": false
 642 |    },
 643 |    "outputs": [],
 644 |    "source": [
 645 |     "def report(pred, label):\n",
 646 |     "\n",
 647 |     "    actions = label\n",
 648 |     "    result = pred\n",
 649 |     "\n",
 650 |     "    # 所有实际用户商品对\n",
 651 |     "    all_user_item_pair = actions['user_id'].map(str) + '-' + actions['sku_id'].map(str)\n",
 652 |     "    all_user_item_pair = np.array(all_user_item_pair)\n",
 653 |     "    # 所有购买用户\n",
 654 |     "    all_user_set = actions['user_id'].unique()\n",
 655 |     "\n",
 656 |     "    # 所有预测购买的用户\n",
 657 |     "    all_user_test_set = result['user_id'].unique()\n",
 658 |     "    all_user_test_item_pair = result['user_id'].map(str) + '-' + result['sku_id'].map(str)\n",
 659 |     "    all_user_test_item_pair = np.array(all_user_test_item_pair)\n",
 660 |     "\n",
 661 |     "    # 计算所有用户购买评价指标\n",
 662 |     "    pos, neg = 0,0\n",
 663 |     "    for user_id in all_user_test_set:\n",
 664 |     "        if user_id in all_user_set:\n",
 665 |     "            pos += 1\n",
 666 |     "        else:\n",
 667 |     "            neg += 1\n",
 668 |     "    all_user_acc = 1.0 * pos / ( pos + neg)\n",
 669 |     "    all_user_recall = 1.0 * pos / len(all_user_set)\n",
 670 |     "    print '所有用户中预测购买用户的准确率为 ' + str(all_user_acc)\n",
 671 |     "    print '所有用户中预测购买用户的召回率' + str(all_user_recall)\n",
 672 |     "\n",
 673 |     "    pos, neg = 0, 0\n",
 674 |     "    for user_item_pair in all_user_test_item_pair:\n",
 675 |     "        if user_item_pair in all_user_item_pair:\n",
 676 |     "            pos += 1\n",
 677 |     "        else:\n",
 678 |     "            neg += 1\n",
 679 |     "    all_item_acc = 1.0 * pos / ( pos + neg)\n",
 680 |     "    all_item_recall = 1.0 * pos / len(all_user_item_pair)\n",
 681 |     "    print '所有用户中预测购买商品的准确率为 ' + str(all_item_acc)\n",
 682 |     "    print '所有用户中预测购买商品的召回率' + str(all_item_recall)\n",
 683 |     "    F11 = 6.0 * all_user_recall * all_user_acc / (5.0 * all_user_recall + all_user_acc)\n",
 684 |     "    F12 = 5.0 * all_item_acc * all_item_recall / (2.0 * all_item_recall + 3 * all_item_acc)\n",
 685 |     "    score = 0.4 * F11 + 0.6 * F12\n",
 686 |     "    print 'F11=' + str(F11)\n",
 687 |     "    print 'F12=' + str(F12)\n",
 688 |     "    print 'score=' + str(score)\n",
 689 |     "    \n",
 690 |     "    return all_user_acc, all_user_recall, F11, all_item_acc, all_item_recall, F12, score"
 691 |    ]
 692 |   },
 693 |   {
 694 |    "cell_type": "code",
 695 |    "execution_count": 213,
 696 |    "metadata": {
 697 |     "ExecuteTime": {
 698 |      "end_time": "2017-05-25T15:17:21.095577Z",
 699 |      "start_time": "2017-05-25T15:17:21.020882Z"
 700 |     },
 701 |     "collapsed": false
 702 |    },
 703 |    "outputs": [],
 704 |    "source": [
 705 |     "def validate(valid_set, val_label, model):\n",
 706 |     "    actions = pd.read_csv(valid_set)                #read test_set        \n",
 707 |     "#     users = actions[['user_id', 'sku_id']].copy()\n",
 708 |     "    # 避免预测到非8类商品，所以最后还是再筛一遍的好        \n",
 709 |     "    users = actions[['user_id', 'sku_id', 'cate']].copy()\n",
 710 |     "    \n",
 711 |     "    actions['user_id'] = actions['user_id'].astype(np.int64)\n",
 712 |     "#     test_label= actions[actions['label'] == 1]\n",
 713 |     "\n",
 714 |     "#     test_label= actions[(actions['label']==1) & (actions['cate']==8)]\n",
 715 |     "    test_label = pd.read_csv(val_label)\n",
 716 |     "    \n",
 717 |     "    lst_useless = ['brand']\n",
 718 |     "    \n",
 719 |     "    actions.drop(lst_useless, inplace=True, axis=1)\n",
 720 |     "    \n",
 721 |     "#     test_label = test_label[['user_id','sku_id','label']]\n",
 722 |     "    del actions['user_id']\n",
 723 |     "    del actions['sku_id']\n",
 724 |     "    \n",
 725 |     "#     features = list(actions.columns.values)\n",
 726 |     "    \n",
 727 |     "#     del actions['label']\n",
 728 |     "    sub_user_index = users\n",
 729 |     "#     sub_trainning_data = xgb.DMatrix(actions.values, feature_names=features)\n",
 730 |     "    sub_trainning_data = xgb.DMatrix(actions.values)\n",
 731 |     "#     y = model.predict(sub_trainning_data,ntree_limit=model.best_iteration)\n",
 732 |     "    y = model.predict(sub_trainning_data, ntree_limit=model.best_ntree_limit)\n",
 733 |     "    sub_user_index['label'] = y\n",
 734 |     "    \n",
 735 |     "    sub_user_index.to_csv('result_' + valid_set, index=False)\n",
 736 |     "    \n",
 737 |     "#     sub_user_index = sub_user_index[sub_user_index['cate']==8]\n",
 738 |     "#     del sub_user_index['cate']\n",
 739 |     "    rank = 1000\n",
 740 |     "    pred = sub_user_index.sort_values(by='label', ascending=False)[:rank]\n",
 741 |     "#     pred = sub_user_index[sub_user_index['label'] >= 0.05]\n",
 742 |     "    \n",
 743 |     "    print 'No. of raw pred users: ', len(pred['user_id'].unique())\n",
 744 |     "    pred = pred[pred['cate']==8]\n",
 745 |     "    print 'No. of pred users bought cate 8: ', len(pred['user_id'].unique())\n",
 746 |     "    \n",
 747 |     "#     pred = pred[['user_id', 'sku_id']]\n",
 748 |     "    pred = pred[['user_id', 'sku_id', 'label']]\n",
 749 |     "    pred = pred.groupby('user_id').first().reset_index()\n",
 750 |     "    pred['user_id'] = pred['user_id'].astype(int)\n",
 751 |     "    pred['sku_id'] = pred['sku_id'].astype(int)\n",
 752 |     "    \n",
 753 |     "#     print 'No. of pred users after deduplicates: ', len(pred['user_id'].unique())\n",
 754 |     "    true_user = len(test_label['user_id'])   \n",
 755 |     "    pred_ui = len(pred['user_id'].unique())\n",
 756 |     "    print 'pred item: ', len(pred['sku_id'].unique())\n",
 757 |     "    print 'true users: ', true_user\n",
 758 |     "    print 'pred users: ', pred_ui\n",
 759 |     "    test_label['user_id'] = test_label['user_id'].astype(int)\n",
 760 |     "    test_label['sku_id'] = test_label['sku_id'].astype(int)\n",
 761 |     "\n",
 762 |     "    all_user_acc, all_user_recall, F11, all_item_acc, all_item_recall, F12, score = report(pred, test_label)   \n",
 763 |     "    \n",
 764 |     "    f_name = 'pred_' + str(rank) + '_' + valid_set\n",
 765 |     "    pred.to_csv(f_name, index=False)\n",
 766 |     "    \n",
 767 |     "    return rank, true_user, pred_ui, all_user_acc, all_user_recall, F11, all_item_acc, all_item_recall, F12, score\n",
 768 |     "\n",
 769 |     "# validate('val_set.csv', bst_xgb)"
 770 |    ]
 771 |   },
 772 |   {
 773 |    "cell_type": "code",
 774 |    "execution_count": 214,
 775 |    "metadata": {
 776 |     "ExecuteTime": {
 777 |      "end_time": "2017-05-25T15:18:02.727408Z",
 778 |      "start_time": "2017-05-25T15:17:21.098027Z"
 779 |     },
 780 |     "collapsed": false,
 781 |     "scrolled": true
 782 |    },
 783 |    "outputs": [
 784 |     {
 785 |      "name": "stdout",
 786 |      "output_type": "stream",
 787 |      "text": [
 788 |       "No. of raw pred users:  950\n",
 789 |       "No. of pred users bought cate 8:  950\n",
 790 |       "pred item:  220\n",
 791 |       "true users:  1211\n",
 792 |       "pred users:  950\n",
 793 |       "所有用户中预测购买用户的准确率为 0.147368421053\n",
 794 |       "所有用户中预测购买用户的召回率0.116569525396\n",
 795 |       "所有用户中预测购买商品的准确率为 0.108421052632\n",
 796 |       "所有用户中预测购买商品的召回率0.0850536746491\n",
 797 |       "F11=0.141152747437\n",
 798 |       "F12=0.0930778962588\n",
 799 |       "score=0.11230783673\n",
 800 |       "-------------------------------------------\n",
 801 |       "No. of raw pred users:  950\n",
 802 |       "No. of pred users bought cate 8:  950\n",
 803 |       "pred item:  203\n",
 804 |       "true users:  1259\n",
 805 |       "pred users:  950\n",
 806 |       "所有用户中预测购买用户的准确率为 0.163157894737\n",
 807 |       "所有用户中预测购买用户的召回率0.12360446571\n",
 808 |       "所有用户中预测购买商品的准确率为 0.116842105263\n",
 809 |       "所有用户中预测购买商品的召回率0.0881652104845\n",
 810 |       "F11=0.15489673551\n",
 811 |       "F12=0.0977629029417\n",
 812 |       "score=0.120616435969\n",
 813 |       "-------------------------------------------\n",
 814 |       "No. of raw pred users:  960\n",
 815 |       "No. of pred users bought cate 8:  960\n",
 816 |       "pred item:  219\n",
 817 |       "true users:  1385\n",
 818 |       "pred users:  960\n",
 819 |       "所有用户中预测购买用户的准确率为 0.161458333333\n",
 820 |       "所有用户中预测购买用户的召回率0.11231884058\n",
 821 |       "所有用户中预测购买商品的准确率为 0.120833333333\n",
 822 |       "所有用户中预测购买商品的召回率0.0837545126354\n",
 823 |       "F11=0.150485436893\n",
 824 |       "F12=0.0954732510288\n",
 825 |       "score=0.117478125375\n",
 826 |       "===========================================\n",
 827 |       "avg user acc:  0.157328216374\n",
 828 |       "avg user recall:  0.117497610562\n",
 829 |       "avg item acc:  0.115365497076\n",
 830 |       "avg item recall:  0.0856577992563\n",
 831 |       "avg F11:  0.14884497328\n",
 832 |       "avg F12:  0.0954380167431\n",
 833 |       "avg score:  0.116800799358\n"
 834 |      ]
 835 |     }
 836 |    ],
 837 |    "source": [
 838 |     "def avg_score():\n",
 839 |     "    rank1, true_user1, pred_ui1, user_acc1, user_recall1, F11_1, item_acc1, item_recall1, F12_1, score1 = validate('val_1.csv', 'label_val_1.csv', bst_xgb)\n",
 840 |     "    print '-------------------------------------------'\n",
 841 |     "    rank2, true_user2, pred_ui2, user_acc2, user_recall2, F11_2, item_acc2, item_recall2, F12_2, score2 = validate('val_2.csv', 'label_val_2.csv', bst_xgb)\n",
 842 |     "    print '-------------------------------------------'\n",
 843 |     "    rank3, true_user3, pred_ui3, user_acc3, user_recall3, F11_3, item_acc3, item_recall3, F12_3, score3 = validate('val_3.csv', 'label_val_3.csv', bst_xgb)\n",
 844 |     "    print '==========================================='\n",
 845 |     "    print 'avg user acc: ', (user_acc1+user_acc2+user_acc3)/3\n",
 846 |     "    print 'avg user recall: ', (user_recall1+user_recall2+user_recall3)/3\n",
 847 |     "    print 'avg item acc: ', (item_acc1+item_acc2+item_acc3)/3\n",
 848 |     "    print 'avg item recall: ', (item_recall1+item_recall2+item_recall3)/3\n",
 849 |     "    print 'avg F11: ', (F11_1+F11_2+F11_3)/3\n",
 850 |     "    print 'avg F12: ', (F12_1+F12_2+F12_3)/3\n",
 851 |     "    print 'avg score: ', (score1+score2+score3)/3\n",
 852 |     "    # make the csv file\n",
 853 |     "    dct_score = {}\n",
 854 |     "    dct_score['rank'] = [rank1, rank2, rank3]\n",
 855 |     "    dct_score['true_user'] = [true_user1, true_user2, true_user3]\n",
 856 |     "    dct_score['pred_ui'] = [pred_ui1, pred_ui2, pred_ui3]\n",
 857 |     "    dct_score['user_acc'] = [user_acc1, user_acc2, user_acc3]\n",
 858 |     "    dct_score['user_recall'] = [user_recall1, user_recall2, user_recall3]\n",
 859 |     "    dct_score['F11'] = [F11_1, F11_2, F11_3]\n",
 860 |     "    dct_score['item_acc'] = [item_acc1, item_acc2, item_acc3]\n",
 861 |     "    dct_score['item_recall'] = [item_recall1, item_recall2, item_recall3]\n",
 862 |     "    dct_score['F12'] = [F12_1, F12_2, F12_3]\n",
 863 |     "    dct_score['score'] = [score1, score2, score3]\n",
 864 |     "    column_order = ['rank', 'true_user', 'pred_ui', 'user_acc', 'user_recall', 'item_acc', 'item_recall', 'F11', 'F12', \n",
 865 |     "                    'score']\n",
 866 |     "    df_score = pd.DataFrame(dct_score)\n",
 867 |     "    file_name = 'score_' + str(datetime.now().date())[5:] +'_'+ str(rank1) + '.csv'\n",
 868 |     "    df_score[column_order].to_csv(file_name, index=False)\n",
 869 |     "avg_score()"
 870 |    ]
 871 |   },
 872 |   {
 873 |    "cell_type": "markdown",
 874 |    "metadata": {},
 875 |    "source": [
 876 |     "### 输出特征重要性"
 877 |    ]
 878 |   },
 879 |   {
 880 |    "cell_type": "code",
 881 |    "execution_count": 27,
 882 |    "metadata": {
 883 |     "ExecuteTime": {
 884 |      "end_time": "2017-05-24T19:56:47.018369Z",
 885 |      "start_time": "2017-05-24T19:56:47.003330Z"
 886 |     },
 887 |     "collapsed": false
 888 |    },
 889 |    "outputs": [],
 890 |    "source": [
 891 |     "def feature_importance(bst_xgb):\n",
 892 |     "    importance = bst_xgb.get_fscore(fmap=r'xgb.fmap')\n",
 893 |     "    importance = sorted(importance.items(), key=operator.itemgetter(1), reverse=True)\n",
 894 |     "\n",
 895 |     "    df = pd.DataFrame(importance, columns=['feature', 'fscore'])\n",
 896 |     "    df['fscore'] = df['fscore'] / df['fscore'].sum()\n",
 897 |     "    file_name = 'feature_importance_' + str(datetime.now().date())[5:] + '.csv'\n",
 898 |     "    df.to_csv(file_name)\n",
 899 |     "\n",
 900 |     "feature_importance(bst_xgb)"
 901 |    ]
 902 |   },
 903 |   {
 904 |    "cell_type": "markdown",
 905 |    "metadata": {},
 906 |    "source": [
 907 |     "### 生成提交结果"
 908 |    ]
 909 |   },
 910 |   {
 911 |    "cell_type": "code",
 912 |    "execution_count": 185,
 913 |    "metadata": {
 914 |     "ExecuteTime": {
 915 |      "end_time": "2017-05-25T15:00:26.300556Z",
 916 |      "start_time": "2017-05-25T15:00:08.688328Z"
 917 |     },
 918 |     "collapsed": false
 919 |    },
 920 |    "outputs": [
 921 |     {
 922 |      "name": "stdout",
 923 |      "output_type": "stream",
 924 |      "text": [
 925 |       "No. of raw pred users:  1142\n",
 926 |       "No. of pred users bought cate 8:  1142\n"
 927 |      ]
 928 |     }
 929 |    ],
 930 |    "source": [
 931 |     "# sub_file\n",
 932 |     "\n",
 933 |     "def submit(pred_set, model):\n",
 934 |     "    actions = pd.read_csv(pred_set)                #read test_set\n",
 935 |     "    \n",
 936 |     "#     print 'total user before: ', len(actions['user_id'].unique())\n",
 937 |     "#     potential = pd.read_csv('potential_user_04-28.csv')\n",
 938 |     "#     lst_user = potential['user_id'].unique().tolist()\n",
 939 |     "#     actions = actions[actions['user_id'].isin(lst_user)]\n",
 940 |     "#     print 'total user after: ', len(actions['user_id'].unique())\n",
 941 |     "    # 提前去掉部分特征\n",
 942 |     "    lst_useless = ['brand']\n",
 943 |     "    \n",
 944 |     "    actions.drop(lst_useless, inplace=True, axis=1)\n",
 945 |     "\n",
 946 |     "    \n",
 947 |     "    users = actions[['user_id', 'sku_id', 'cate']].copy()\n",
 948 |     "#     users = actions[['user_id', 'sku_id']].copy()\n",
 949 |     "    \n",
 950 |     "    actions['user_id'] = actions['user_id'].astype(np.int64)\n",
 951 |     "    del actions['user_id']\n",
 952 |     "    del actions['sku_id']\n",
 953 |     "    sub_user_index = users\n",
 954 |     "    sub_trainning_data = xgb.DMatrix(actions.values)\n",
 955 |     "    y = model.predict(sub_trainning_data, ntree_limit=model.best_ntree_limit)\n",
 956 |     "    sub_user_index['label'] = y\n",
 957 |     "    \n",
 958 |     "#     sub_user_index = sub_user_index[sub_user_index['cate']==8]\n",
 959 |     "#     del sub_user_index['cate']\n",
 960 |     "    rank = 1200\n",
 961 |     "    pred = sub_user_index.sort_values(by='label', ascending=False)[:rank]\n",
 962 |     "#     pred = sub_user_index[sub_user_index['label'] >= 0.05]\n",
 963 |     "#     pred = pred[['user_id', 'sku_id', 'label']]\n",
 964 |     "#     pred = pred[pred['label']>0.45]\n",
 965 |     "\n",
 966 |     "    print 'No. of raw pred users: ', len(pred['user_id'].unique())\n",
 967 |     "    pred = pred[pred['cate']==8]\n",
 968 |     "    print 'No. of pred users bought cate 8: ', len(pred['user_id'].unique())\n",
 969 |     "\n",
 970 |     "    pred = pred[['user_id', 'sku_id']]\n",
 971 |     "#     print \n",
 972 |     "    pred = pred.groupby('user_id').first().reset_index()\n",
 973 |     "    pred['user_id'] = pred['user_id'].astype(int)\n",
 974 |     "    pred['sku_id'] = pred['sku_id'].astype(int)\n",
 975 |     "    sub_file = 'submission_' + str(rank) + '_' + str(datetime.now().date())[5:] + '.csv'\n",
 976 |     "#     sub_file = 'submission_detail_' + str(datetime.now().date())[5:] + '.csv'\n",
 977 |     "    pred.to_csv(sub_file, index=False, index_label=False)  \n",
 978 |     "\n",
 979 |     "submit('test_set.csv', bst_xgb)"
 980 |    ]
 981 |   },
 982 |   {
 983 |    "cell_type": "markdown",
 984 |    "metadata": {},
 985 |    "source": [
 986 |     "### 提交结果除重\n",
 987 |     "除去最后临近三天的发生购买行为的用户商品对"
 988 |    ]
 989 |   },
 990 |   {
 991 |    "cell_type": "code",
 992 |    "execution_count": 186,
 993 |    "metadata": {
 994 |     "ExecuteTime": {
 995 |      "end_time": "2017-05-25T15:01:09.432698Z",
 996 |      "start_time": "2017-05-25T15:00:32.113207Z"
 997 |     },
 998 |     "collapsed": false
 999 |    },
1000 |    "outputs": [
1001 |     {
1002 |      "name": "stdout",
1003 |      "output_type": "stream",
1004 |      "text": [
1005 |       "No. of records after remove dup:  1142\n",
1006 |       "No. of dup:  0\n"
1007 |      ]
1008 |     }
1009 |    ],
1010 |    "source": [
1011 |     "from datetime import datetime\n",
1012 |     "\n",
1013 |     "def sub_improv(action, sub_file):\n",
1014 |     "    # 获取4月最近三天的目标用户商品对\n",
1015 |     "    action_4 = pd.read_csv(action)\n",
1016 |     "    action_4['time'] = pd.to_datetime(action_4['time']).apply(lambda x: x.date())\n",
1017 |     "    aim_date = [datetime.strptime(s, '%Y-%m-%d').date() for s in ['2016-04-09', '2016-04-10', '2016-04-11' ,\n",
1018 |     "                                                                  '2016-04-12', '2016-04-13', '2016-04-14', \n",
1019 |     "                                                                  '2016-04-15']]\n",
1020 |     "    aim_action = action_4[(action_4['type']==4) & (action_4['cate']==8) & (action_4['time'].isin(aim_date))]\n",
1021 |     "    aim_ui = aim_action['user_id'].map(int).map(str) + '-' + aim_action['sku_id'].map(str)\n",
1022 |     "    # 拼接提交数据的用户商品\n",
1023 |     "    sub = pd.read_csv(sub_file)\n",
1024 |     "    before = sub.shape[0]\n",
1025 |     "    sub_ui = sub['user_id'].map(str) + '-' + sub['sku_id'].map(str)\n",
1026 |     "    # 交集\n",
1027 |     "    lst_aim = aim_ui.unique().tolist()\n",
1028 |     "    lst_sub = sub_ui.unique().tolist()\n",
1029 |     "    lst_common = [i for i in lst_aim if i in lst_sub]\n",
1030 |     "    dct_ui = {i.split('-')[0]: i.split('-')[1] for i in lst_common}\n",
1031 |     "    # 从提交结果除掉交集部分\n",
1032 |     "    for k in dct_ui:\n",
1033 |     "        sub.drop(sub[(sub['user_id']==int(k)) & (sub['sku_id']==int(dct_ui[k]))].index, inplace=True)\n",
1034 |     "    print 'No. of records after remove dup: ', sub.shape[0]\n",
1035 |     "    print 'No. of dup: ', before - sub.shape[0]\n",
1036 |     "    if (before - sub.shape[0])!=0:\n",
1037 |     "        file_name = 'submission_' + str(datetime.now().date())[5:] + '_improv.csv'\n",
1038 |     "        sub.to_csv(file_name, index=False, index_label=False)\n",
1039 |     "    \n",
1040 |     "sub_improv('data/JData_Action_201604.csv', 'submission_1200_05-25.csv')"
1041 |    ]
1042 |   },
1043 |   {
1044 |    "cell_type": "code",
1045 |    "execution_count": null,
1046 |    "metadata": {
1047 |     "collapsed": true
1048 |    },
1049 |    "outputs": [],
1050 |    "source": []
1051 |   }
1052 |  ],
1053 |  "metadata": {
1054 |   "kernelspec": {
1055 |    "display_name": "Python 2",
1056 |    "language": "python",
1057 |    "name": "python2"
1058 |   },
1059 |   "language_info": {
1060 |    "codemirror_mode": {
1061 |     "name": "ipython",
1062 |     "version": 2
1063 |    },
1064 |    "file_extension": ".py",
1065 |    "mimetype": "text/x-python",
1066 |    "name": "python",
1067 |    "nbconvert_exporter": "python",
1068 |    "pygments_lexer": "ipython2",
1069 |    "version": "2.7.13"
1070 |   }
1071 |  },
1072 |  "nbformat": 4,
1073 |  "nbformat_minor": 2
1074 | }
1075 | 


--------------------------------------------------------------------------------
/Feature_Engineering-v1.4.ipynb:
--------------------------------------------------------------------------------
   1 | {
   2 |  "cells": [
   3 |   {
   4 |    "cell_type": "markdown",
   5 |    "metadata": {},
   6 |    "source": [
   7 |     "# 特征工程"
   8 |    ]
   9 |   },
  10 |   {
  11 |    "cell_type": "markdown",
  12 |    "metadata": {},
  13 |    "source": [
  14 |     "版本一\n",
  15 |     "\n",
  16 |     "* get_action_feat(): 最后保留类别属性，此外在对不同天拼接时注意on=['user_id', 'sku_id', 'cate']\n",
  17 |     "* get_accumulate_user_feat()：平滑转化率计算、标准差计算前的空值填充，平滑recent(怎么做更合理)，days_interval->day\n",
  18 |     "* user_cate：计算cate各项行为占比涉及除法会出现空值，这里对于除法可以做一个对数操作\n",
  19 |     "* product_basic：保留该特征，若使用的话空值填充需要注意，合并后填充空值为-1，对a1,a2,a3表示未知，brand需要看看是否存在这个品牌代号（没有），但是这里有一个问题，原始的a1,a2,a3均经过独热编码处理，所以这里并不是单纯的填充为-1，而是还是填充为0(刚好brand也没有这个值),此外注意拼接时on=['sku_id', 'cate']\n",
  20 |     "* get_accumulate_product_feat：修改转化率计算，在make_action里注意修改日期跨度问题,计数保留所有行为，标准差计算前的空值填充\n",
  21 |     "* get_accumulate_cate_feat()：修改转化率计算，修改日期跨度问题,计数保留全部，标准差计算前的空值填充\n",
  22 |     "* get_comment_product_date()：评论独热少了一个评论数为0的情况（但是对于部分时间窗可能并不存在comment为0的状态，所以需要处理），拼接后填充空值，取评论时的日期操作是多余的，直接取等即可，start_date多余，此外需分析comment数据集中包含的商品是否全部来自product（不是）\n",
  23 |     "* 最后打标签这块，先还是对用户购买类别8的商品对打标签，需要实验和思考"
  24 |    ]
  25 |   },
  26 |   {
  27 |    "cell_type": "markdown",
  28 |    "metadata": {},
  29 |    "source": [
  30 |     "## 准备\n",
  31 |     "* 导入相应的库\n",
  32 |     "* 设置文件路径等"
  33 |    ]
  34 |   },
  35 |   {
  36 |    "cell_type": "code",
  37 |    "execution_count": 1,
  38 |    "metadata": {
  39 |     "ExecuteTime": {
  40 |      "end_time": "2017-05-16T21:16:17.345239Z",
  41 |      "start_time": "2017-05-16T21:16:15.334700Z"
  42 |     },
  43 |     "collapsed": true
  44 |    },
  45 |    "outputs": [],
  46 |    "source": [
  47 |     "#!/usr/bin/env python\n",
  48 |     "# -*- coding: UTF-8 -*-\n",
  49 |     "import time\n",
  50 |     "from datetime import datetime\n",
  51 |     "from datetime import timedelta\n",
  52 |     "import pandas as pd\n",
  53 |     "import pickle\n",
  54 |     "import os\n",
  55 |     "import math\n",
  56 |     "import numpy as np"
  57 |    ]
  58 |   },
  59 |   {
  60 |    "cell_type": "code",
  61 |    "execution_count": 2,
  62 |    "metadata": {
  63 |     "ExecuteTime": {
  64 |      "end_time": "2017-05-16T21:16:17.352419Z",
  65 |      "start_time": "2017-05-16T21:16:17.346954Z"
  66 |     },
  67 |     "collapsed": true
  68 |    },
  69 |    "outputs": [],
  70 |    "source": [
  71 |     "# action_1_path = r'data/JData_Action_201602.csv'\n",
  72 |     "# action_2_path = r'data/JData_Action_201603.csv'\n",
  73 |     "# action_3_path = r'data/JData_Action_201604.csv'\n",
  74 |     "action_1_path = r'data/actions1.csv'\n",
  75 |     "action_2_path = r'data/actions2.csv'\n",
  76 |     "action_3_path = r'data/actions3.csv'\n",
  77 |     "comment_path = r'data/JData_Comment.csv'\n",
  78 |     "product_path = r'data/JData_Product.csv'\n",
  79 |     "# user_path = r'data/JData_User.csv'\n",
  80 |     "user_path = r'data/user.csv'\n",
  81 |     "\n",
  82 |     "comment_date = [\n",
  83 |     "    \"2016-02-01\", \"2016-02-08\", \"2016-02-15\", \"2016-02-22\", \"2016-02-29\",\n",
  84 |     "    \"2016-03-07\", \"2016-03-14\", \"2016-03-21\", \"2016-03-28\", \"2016-04-04\",\n",
  85 |     "    \"2016-04-11\", \"2016-04-15\"\n",
  86 |     "]"
  87 |    ]
  88 |   },
  89 |   {
  90 |    "cell_type": "code",
  91 |    "execution_count": 3,
  92 |    "metadata": {
  93 |     "ExecuteTime": {
  94 |      "end_time": "2017-05-16T21:16:17.463427Z",
  95 |      "start_time": "2017-05-16T21:16:17.353703Z"
  96 |     },
  97 |     "collapsed": true
  98 |    },
  99 |    "outputs": [],
 100 |    "source": [
 101 |     "def get_actions_1():\n",
 102 |     "    action = pd.read_csv(action_1_path)\n",
 103 |     "    return action\n",
 104 |     "\n",
 105 |     "def get_actions_2():\n",
 106 |     "    action2 = pd.read_csv(action_2_path)\n",
 107 |     "    return action2\n",
 108 |     "\n",
 109 |     "def get_actions_3():\n",
 110 |     "    action3 = pd.read_csv(action_3_path)\n",
 111 |     "    return action3\n",
 112 |     "# 读取并拼接所有行为记录文件\n",
 113 |     "def get_all_action():\n",
 114 |     "    action_1 = get_actions_1()\n",
 115 |     "    action_2 = get_actions_2()\n",
 116 |     "    action_3 = get_actions_3()\n",
 117 |     "    actions = pd.concat([action_1, action_2, action_3]) # type: pd.DataFrame\n",
 118 |     "#     actions = pd.read_csv(action_path)\n",
 119 |     "    return actions\n",
 120 |     "\n",
 121 |     "# 获取某个时间段的行为记录\n",
 122 |     "def get_actions(start_date, end_date, all_actions):\n",
 123 |     "    \"\"\"\n",
 124 |     "    :param start_date:\n",
 125 |     "    :param end_date:\n",
 126 |     "    :return: actions: pd.Dataframe\n",
 127 |     "    \"\"\"\n",
 128 |     "    actions = all_actions[(all_actions.time >= start_date) & (all_actions.time < end_date)].copy()\n",
 129 |     "    return actions"
 130 |    ]
 131 |   },
 132 |   {
 133 |    "cell_type": "markdown",
 134 |    "metadata": {},
 135 |    "source": [
 136 |     "## 用户特征\n",
 137 |     "### 用户基本特征\n",
 138 |     "获取基本的用户特征，基于用户本身属性多为类别特征的特点，对age,sex,usr_lv_cd进行独热编码操作，对于用户注册时间暂时不处理"
 139 |    ]
 140 |   },
 141 |   {
 142 |    "cell_type": "code",
 143 |    "execution_count": 4,
 144 |    "metadata": {
 145 |     "ExecuteTime": {
 146 |      "end_time": "2017-05-16T21:16:18.813768Z",
 147 |      "start_time": "2017-05-16T21:16:17.467022Z"
 148 |     },
 149 |     "collapsed": true
 150 |    },
 151 |    "outputs": [],
 152 |    "source": [
 153 |     "from sklearn import preprocessing\n",
 154 |     "\n",
 155 |     "def get_basic_user_feat():\n",
 156 |     "    # 针对年龄的中文字符问题处理，首先是读入的时候编码，填充空值，然后将其数值化，最后独热编码，此外对于sex也进行了数值类型转换\n",
 157 |     "    user = pd.read_csv(user_path, encoding='gbk')\n",
 158 |     "#     user['age'].fillna('-1', inplace=True)\n",
 159 |     "#     user['sex'].fillna(2, inplace=True)\n",
 160 |     "    user['sex'] = user['sex'].astype(int)    \n",
 161 |     "    user['age'] = user['age'].astype(unicode)\n",
 162 |     "    le = preprocessing.LabelEncoder()    \n",
 163 |     "    age_df = le.fit_transform(user['age'])\n",
 164 |     "#     print list(le.classes_)\n",
 165 |     "\n",
 166 |     "    age_df = pd.get_dummies(age_df, prefix='age')\n",
 167 |     "    sex_df = pd.get_dummies(user['sex'], prefix='sex')\n",
 168 |     "    user_lv_df = pd.get_dummies(user['user_lv_cd'], prefix='user_lv_cd')\n",
 169 |     "    user = pd.concat([user['user_id'], age_df, sex_df, user_lv_df], axis=1)\n",
 170 |     "    return user"
 171 |    ]
 172 |   },
 173 |   {
 174 |    "cell_type": "markdown",
 175 |    "metadata": {},
 176 |    "source": [
 177 |     "## 商品特征\n",
 178 |     "### 商品基本特征\n",
 179 |     "根据商品文件获取基本的特征，针对属性a1,a2,a3进行独热编码，商品类别和品牌直接作为特征"
 180 |    ]
 181 |   },
 182 |   {
 183 |    "cell_type": "code",
 184 |    "execution_count": 5,
 185 |    "metadata": {
 186 |     "ExecuteTime": {
 187 |      "end_time": "2017-05-16T21:16:18.826294Z",
 188 |      "start_time": "2017-05-16T21:16:18.818260Z"
 189 |     },
 190 |     "collapsed": true
 191 |    },
 192 |    "outputs": [],
 193 |    "source": [
 194 |     "def get_basic_product_feat():\n",
 195 |     "    product = pd.read_csv(product_path)\n",
 196 |     "    attr1_df = pd.get_dummies(product[\"a1\"], prefix=\"a1\")\n",
 197 |     "    attr2_df = pd.get_dummies(product[\"a2\"], prefix=\"a2\")\n",
 198 |     "    attr3_df = pd.get_dummies(product[\"a3\"], prefix=\"a3\")\n",
 199 |     "    product = pd.concat([product[['sku_id', 'cate', 'brand']], attr1_df, attr2_df, attr3_df], axis=1)\n",
 200 |     "    return product"
 201 |    ]
 202 |   },
 203 |   {
 204 |    "cell_type": "markdown",
 205 |    "metadata": {},
 206 |    "source": [
 207 |     "## 评论特征\n",
 208 |     "\n",
 209 |     "* 分时间段\n",
 210 |     "* 对评论数进行独热编码"
 211 |    ]
 212 |   },
 213 |   {
 214 |    "cell_type": "code",
 215 |    "execution_count": 6,
 216 |    "metadata": {
 217 |     "ExecuteTime": {
 218 |      "end_time": "2017-05-16T21:16:18.880657Z",
 219 |      "start_time": "2017-05-16T21:16:18.827667Z"
 220 |     },
 221 |     "collapsed": true
 222 |    },
 223 |    "outputs": [],
 224 |    "source": [
 225 |     "def get_comments_product_feat(end_date):\n",
 226 |     "    comments = pd.read_csv(comment_path)\n",
 227 |     "    comment_date_end = end_date\n",
 228 |     "    comment_date_begin = comment_date[0]\n",
 229 |     "    for date in reversed(comment_date):\n",
 230 |     "        if date < comment_date_end:\n",
 231 |     "            comment_date_begin = date\n",
 232 |     "            break\n",
 233 |     "    comments = comments[comments.dt==comment_date_begin]\n",
 234 |     "    df = pd.get_dummies(comments['comment_num'], prefix='comment_num')\n",
 235 |     "    # 为了防止某个时间段不具备评论数为0的情况（测试集出现过这种情况）\n",
 236 |     "    for i in range(0, 5):\n",
 237 |     "        if 'comment_num_' + str(i) not in df.columns:\n",
 238 |     "            df['comment_num_' + str(i)] = 0\n",
 239 |     "    df = df[['comment_num_0', 'comment_num_1', 'comment_num_2', 'comment_num_3', 'comment_num_4']]\n",
 240 |     "    \n",
 241 |     "    comments = pd.concat([comments, df], axis=1) # type: pd.DataFrame\n",
 242 |     "        #del comments['dt']\n",
 243 |     "        #del comments['comment_num']\n",
 244 |     "    comments = comments[['sku_id', 'has_bad_comment', 'bad_comment_rate','comment_num_0', 'comment_num_1', \n",
 245 |     "                         'comment_num_2', 'comment_num_3', 'comment_num_4']]\n",
 246 |     "    return comments"
 247 |    ]
 248 |   },
 249 |   {
 250 |    "cell_type": "markdown",
 251 |    "metadata": {},
 252 |    "source": [
 253 |     "## 行为特征\n",
 254 |     "\n",
 255 |     "* 分时间段\n",
 256 |     "* 对行为类别进行独热编码\n",
 257 |     "* 分别按照用户-类别行为分组和用户-类别-商品行为分组统计，然后计算\n",
 258 |     " * 用户对同类别下其他商品的行为计数\n",
 259 |     " * 针对用户对同类别下目标商品的行为计数与该时间段的行为均值作差"
 260 |    ]
 261 |   },
 262 |   {
 263 |    "cell_type": "code",
 264 |    "execution_count": 7,
 265 |    "metadata": {
 266 |     "ExecuteTime": {
 267 |      "end_time": "2017-05-16T21:16:18.950287Z",
 268 |      "start_time": "2017-05-16T21:16:18.882812Z"
 269 |     },
 270 |     "collapsed": true
 271 |    },
 272 |    "outputs": [],
 273 |    "source": [
 274 |     "def get_action_feat(start_date, end_date, all_actions, i):\n",
 275 |     "    actions = get_actions(start_date, end_date, all_actions)\n",
 276 |     "    actions = actions[['user_id', 'sku_id', 'cate','type']]\n",
 277 |     "    # 不同时间累积的行为计数（3,5,7,10,15,21,30）\n",
 278 |     "    df = pd.get_dummies(actions['type'], prefix='action_before_%s' %i)\n",
 279 |     "    before_date = 'action_before_%s' %i\n",
 280 |     "    actions = pd.concat([actions, df], axis=1)  # type: pd.DataFrame\n",
 281 |     "    # 分组统计，用户-类别-商品,不同用户对不同类别下商品的行为计数\n",
 282 |     "    actions = actions.groupby(['user_id', 'sku_id','cate'], as_index=False).sum()\n",
 283 |     "    # 分组统计，用户-类别，不同用户对不同商品类别的行为计数\n",
 284 |     "    user_cate = actions.groupby(['user_id','cate'], as_index=False).sum()\n",
 285 |     "    del user_cate['sku_id']\n",
 286 |     "    del user_cate['type']\n",
 287 |     "    actions = pd.merge(actions, user_cate, how='left', on=['user_id','cate'])\n",
 288 |     "    #本类别下其他商品点击量\n",
 289 |     "    # 前述两种分组含有相同名称的不同行为的计数，系统会自动针对名称调整添加后缀,x,y，所以这里作差统计的是同一类别下其他商品的行为计数\n",
 290 |     "    actions[before_date+'_1_y'] = actions[before_date+'_1_y'] - actions[before_date+'_1_x']\n",
 291 |     "    actions[before_date+'_2_y'] = actions[before_date+'_2_y'] - actions[before_date+'_2_x']\n",
 292 |     "    actions[before_date+'_3_y'] = actions[before_date+'_3_y'] - actions[before_date+'_3_x']\n",
 293 |     "    actions[before_date+'_4_y'] = actions[before_date+'_4_y'] - actions[before_date+'_4_x']\n",
 294 |     "    actions[before_date+'_5_y'] = actions[before_date+'_5_y'] - actions[before_date+'_5_x']\n",
 295 |     "    actions[before_date+'_6_y'] = actions[before_date+'_6_y'] - actions[before_date+'_6_x']\n",
 296 |     "    # 统计用户对不同类别下商品计数与该类别下商品行为计数均值（对时间）的差值\n",
 297 |     "    actions[before_date+'minus_mean_1'] = actions[before_date+'_1_x'] - (actions[before_date+'_1_x']/i)\n",
 298 |     "    actions[before_date+'minus_mean_2'] = actions[before_date+'_2_x'] - (actions[before_date+'_2_x']/i)\n",
 299 |     "    actions[before_date+'minus_mean_3'] = actions[before_date+'_3_x'] - (actions[before_date+'_3_x']/i)\n",
 300 |     "    actions[before_date+'minus_mean_4'] = actions[before_date+'_4_x'] - (actions[before_date+'_4_x']/i)\n",
 301 |     "    actions[before_date+'minus_mean_5'] = actions[before_date+'_5_x'] - (actions[before_date+'_5_x']/i)\n",
 302 |     "    actions[before_date+'minus_mean_6'] = actions[before_date+'_6_x'] - (actions[before_date+'_6_x']/i)\n",
 303 |     "    del actions['type']\n",
 304 |     "    # 保留cate特征\n",
 305 |     "#     del actions['cate']\n",
 306 |     "\n",
 307 |     "    return actions"
 308 |    ]
 309 |   },
 310 |   {
 311 |    "cell_type": "markdown",
 312 |    "metadata": {},
 313 |    "source": [
 314 |     "### 用户-行为\n",
 315 |     "#### 累积用户特征\n",
 316 |     "\n",
 317 |     "* 分时间段\n",
 318 |     "* 用户不同行为的\n",
 319 |     " * 购买转化率\n",
 320 |     " * 均值\n",
 321 |     " * 标准差"
 322 |    ]
 323 |   },
 324 |   {
 325 |    "cell_type": "code",
 326 |    "execution_count": 8,
 327 |    "metadata": {
 328 |     "ExecuteTime": {
 329 |      "end_time": "2017-05-16T21:16:19.036710Z",
 330 |      "start_time": "2017-05-16T21:16:18.952454Z"
 331 |     },
 332 |     "collapsed": true
 333 |    },
 334 |    "outputs": [],
 335 |    "source": [
 336 |     "def get_accumulate_user_feat(end_date, all_actions, day):\n",
 337 |     "    start_date = datetime.strptime(end_date, '%Y-%m-%d') - timedelta(days=day)\n",
 338 |     "    start_date = start_date.strftime('%Y-%m-%d')\n",
 339 |     "    before_date = 'user_action_%s' % day\n",
 340 |     "\n",
 341 |     "    feature = [\n",
 342 |     "        'user_id', before_date + '_1', before_date + '_2', before_date + '_3',\n",
 343 |     "        before_date + '_4', before_date + '_5', before_date + '_6',\n",
 344 |     "        before_date + '_1_ratio', before_date + '_2_ratio',\n",
 345 |     "        before_date + '_3_ratio', before_date + '_5_ratio',\n",
 346 |     "        before_date + '_6_ratio', before_date + '_1_mean',\n",
 347 |     "        before_date + '_2_mean', before_date + '_3_mean',\n",
 348 |     "        before_date + '_4_mean', before_date + '_5_mean',\n",
 349 |     "        before_date + '_6_mean', before_date + '_1_std',\n",
 350 |     "        before_date + '_2_std', before_date + '_3_std', before_date + '_4_std',\n",
 351 |     "        before_date + '_5_std', before_date + '_6_std'\n",
 352 |     "    ]\n",
 353 |     "\n",
 354 |     "    actions = get_actions(start_date, end_date, all_actions)\n",
 355 |     "    df = pd.get_dummies(actions['type'], prefix=before_date)\n",
 356 |     "\n",
 357 |     "#     actions['date'] = pd.to_datetime(actions['time']).apply(lambda x: x.date())\n",
 358 |     "\n",
 359 |     "    actions = pd.concat([actions[['user_id', 'date']], df], axis=1)\n",
 360 |     "    # 分组统计，用户不同日期的行为计算标准差\n",
 361 |     "    actions_date = actions.groupby(['user_id', 'date']).sum()\n",
 362 |     "    actions_date = actions_date.unstack()\n",
 363 |     "    actions_date.fillna(0, inplace=True)\n",
 364 |     "    action_1 = np.std(actions_date[before_date + '_1'], axis=1)\n",
 365 |     "    action_1 = action_1.to_frame()\n",
 366 |     "    action_1.columns = [before_date + '_1_std']\n",
 367 |     "    action_2 = np.std(actions_date[before_date + '_2'], axis=1)\n",
 368 |     "    action_2 = action_2.to_frame()\n",
 369 |     "    action_2.columns = [before_date + '_2_std']\n",
 370 |     "    action_3 = np.std(actions_date[before_date + '_3'], axis=1)\n",
 371 |     "    action_3 = action_3.to_frame()\n",
 372 |     "    action_3.columns = [before_date + '_3_std']\n",
 373 |     "    action_4 = np.std(actions_date[before_date + '_4'], axis=1)\n",
 374 |     "    action_4 = action_4.to_frame()\n",
 375 |     "    action_4.columns = [before_date + '_4_std']\n",
 376 |     "    action_5 = np.std(actions_date[before_date + '_5'], axis=1)\n",
 377 |     "    action_5 = action_5.to_frame()\n",
 378 |     "    action_5.columns = [before_date + '_5_std']\n",
 379 |     "    action_6 = np.std(actions_date[before_date + '_6'], axis=1)\n",
 380 |     "    action_6 = action_6.to_frame()\n",
 381 |     "    action_6.columns = [before_date + '_6_std']\n",
 382 |     "    actions_date = pd.concat(\n",
 383 |     "        [action_1, action_2, action_3, action_4, action_5, action_6], axis=1)\n",
 384 |     "    actions_date['user_id'] = actions_date.index\n",
 385 |     "    # 分组统计，按用户分组，统计用户各项行为的转化率、均值\n",
 386 |     "    actions = actions.groupby(['user_id'], as_index=False).sum()\n",
 387 |     "#     days_interal = (datetime.strptime(end_date, '%Y-%m-%d') -\n",
 388 |     "#                     datetime.strptime(start_date, '%Y-%m-%d')).days\n",
 389 |     "    # 转化率\n",
 390 |     "#     actions[before_date + '_1_ratio'] = actions[before_date +\n",
 391 |     "#                                                 '_4'] / actions[before_date +\n",
 392 |     "#                                                                 '_1']\n",
 393 |     "#     actions[before_date + '_2_ratio'] = actions[before_date +\n",
 394 |     "#                                                 '_4'] / actions[before_date +\n",
 395 |     "#                                                                 '_2']\n",
 396 |     "#     actions[before_date + '_3_ratio'] = actions[before_date +\n",
 397 |     "#                                                 '_4'] / actions[before_date +\n",
 398 |     "#                                                                 '_3']\n",
 399 |     "#     actions[before_date + '_5_ratio'] = actions[before_date +\n",
 400 |     "#                                                 '_4'] / actions[before_date +\n",
 401 |     "#                                                                 '_5']\n",
 402 |     "#     actions[before_date + '_6_ratio'] = actions[before_date +\n",
 403 |     "#                                                 '_4'] / actions[before_date +\n",
 404 |     "#                                                                 '_6']\n",
 405 |     "    actions[before_date + '_1_ratio'] =  np.log(1 + actions[before_date + '_4']) - np.log(1 + actions[before_date +'_1'])\n",
 406 |     "    actions[before_date + '_2_ratio'] =  np.log(1 + actions[before_date + '_4']) - np.log(1 + actions[before_date +'_2'])\n",
 407 |     "    actions[before_date + '_3_ratio'] =  np.log(1 + actions[before_date + '_4']) - np.log(1 + actions[before_date +'_3'])\n",
 408 |     "    actions[before_date + '_5_ratio'] =  np.log(1 + actions[before_date + '_4']) - np.log(1 + actions[before_date +'_5'])\n",
 409 |     "    actions[before_date + '_6_ratio'] =  np.log(1 + actions[before_date + '_4']) - np.log(1 + actions[before_date +'_6'])\n",
 410 |     "    # 均值\n",
 411 |     "    actions[before_date + '_1_mean'] = actions[before_date + '_1'] / day\n",
 412 |     "    actions[before_date + '_2_mean'] = actions[before_date + '_2'] / day\n",
 413 |     "    actions[before_date + '_3_mean'] = actions[before_date + '_3'] / day\n",
 414 |     "    actions[before_date + '_4_mean'] = actions[before_date + '_4'] / day\n",
 415 |     "    actions[before_date + '_5_mean'] = actions[before_date + '_5'] / day\n",
 416 |     "    actions[before_date + '_6_mean'] = actions[before_date + '_6'] / day\n",
 417 |     "    actions = pd.merge(actions, actions_date, how='left', on='user_id')\n",
 418 |     "    actions = actions[feature]\n",
 419 |     "    return actions"
 420 |    ]
 421 |   },
 422 |   {
 423 |    "cell_type": "markdown",
 424 |    "metadata": {},
 425 |    "source": [
 426 |     "#### 用户近期行为特征\n",
 427 |     "\n",
 428 |     "在上面针对用户进行累积特征提取的基础上，分别提取用户近一个月、近三天的特征，然后提取一个月内用户除去最近三天的行为占据一个月的行为的比重"
 429 |    ]
 430 |   },
 431 |   {
 432 |    "cell_type": "code",
 433 |    "execution_count": 9,
 434 |    "metadata": {
 435 |     "ExecuteTime": {
 436 |      "end_time": "2017-05-16T21:16:19.108485Z",
 437 |      "start_time": "2017-05-16T21:16:19.037711Z"
 438 |     },
 439 |     "collapsed": true
 440 |    },
 441 |    "outputs": [],
 442 |    "source": [
 443 |     "def get_recent_user_feat(end_date, all_actions):\n",
 444 |     "    actions_3 = get_accumulate_user_feat(end_date, all_actions, 3)\n",
 445 |     "    actions_30 = get_accumulate_user_feat(end_date, all_actions, 30)\n",
 446 |     "    actions = pd.merge(actions_3, actions_30, how ='left', on='user_id')\n",
 447 |     "    del actions_3\n",
 448 |     "    del actions_30\n",
 449 |     "    \n",
 450 |     "    actions['recent_action1'] =  np.log(1 + actions['user_action_30_1']-actions['user_action_3_1']) - np.log(1 + actions['user_action_30_1'])\n",
 451 |     "    actions['recent_action2'] =  np.log(1 + actions['user_action_30_2']-actions['user_action_3_2']) - np.log(1 + actions['user_action_30_2'])\n",
 452 |     "    actions['recent_action3'] =  np.log(1 + actions['user_action_30_3']-actions['user_action_3_3']) - np.log(1 + actions['user_action_30_3'])\n",
 453 |     "    actions['recent_action4'] =  np.log(1 + actions['user_action_30_4']-actions['user_action_3_4']) - np.log(1 + actions['user_action_30_4'])\n",
 454 |     "    actions['recent_action5'] =  np.log(1 + actions['user_action_30_5']-actions['user_action_3_5']) - np.log(1 + actions['user_action_30_5'])\n",
 455 |     "    actions['recent_action6'] =  np.log(1 + actions['user_action_30_6']-actions['user_action_3_6']) - np.log(1 + actions['user_action_30_6'])\n",
 456 |     "    \n",
 457 |     "#     actions['recent_action1'] = (actions['user_action_30_1']-actions['user_action_3_1'])/actions['user_action_30_1']\n",
 458 |     "#     actions['recent_action2'] = (actions['user_action_30_2']-actions['user_action_3_2'])/actions['user_action_30_2']\n",
 459 |     "#     actions['recent_action3'] = (actions['user_action_30_3']-actions['user_action_3_3'])/actions['user_action_30_3']\n",
 460 |     "#     actions['recent_action4'] = (actions['user_action_30_4']-actions['user_action_3_4'])/actions['user_action_30_4']\n",
 461 |     "#     actions['recent_action5'] = (actions['user_action_30_5']-actions['user_action_3_5'])/actions['user_action_30_5']\n",
 462 |     "#     actions['recent_action6'] = (actions['user_action_30_6']-actions['user_action_3_6'])/actions['user_action_30_6']\n",
 463 |     "    \n",
 464 |     "    return actions"
 465 |    ]
 466 |   },
 467 |   {
 468 |    "cell_type": "markdown",
 469 |    "metadata": {},
 470 |    "source": [
 471 |     "#### 用户对同类别下各种商品的行为\n",
 472 |     "* 用户对各个类别的各项行为操作统计\n",
 473 |     "* 用户对各个类别操作行为统计占对所有类别操作行为统计的比重"
 474 |    ]
 475 |   },
 476 |   {
 477 |    "cell_type": "code",
 478 |    "execution_count": 10,
 479 |    "metadata": {
 480 |     "ExecuteTime": {
 481 |      "end_time": "2017-05-16T21:16:19.184116Z",
 482 |      "start_time": "2017-05-16T21:16:19.110701Z"
 483 |     },
 484 |     "collapsed": true
 485 |    },
 486 |    "outputs": [],
 487 |    "source": [
 488 |     "#增加了用户对不同类别的交互特征\n",
 489 |     "def get_user_cate_feature(start_date, end_date, all_actions):\n",
 490 |     "    actions = get_actions(start_date, end_date, all_actions)\n",
 491 |     "    actions = actions[['user_id', 'cate', 'type']]\n",
 492 |     "    df = pd.get_dummies(actions['type'], prefix='type')\n",
 493 |     "    actions = pd.concat([actions[['user_id', 'cate']], df], axis=1)\n",
 494 |     "    actions = actions.groupby(['user_id', 'cate']).sum()\n",
 495 |     "    actions = actions.unstack()\n",
 496 |     "    actions.columns = actions.columns.swaplevel(0, 1)\n",
 497 |     "    actions.columns = actions.columns.droplevel()\n",
 498 |     "    actions.columns = [\n",
 499 |     "        'cate_4_type1', 'cate_5_type1', 'cate_6_type1', 'cate_7_type1',\n",
 500 |     "        'cate_8_type1', 'cate_9_type1', 'cate_10_type1', 'cate_11_type1',\n",
 501 |     "        'cate_4_type2', 'cate_5_type2', 'cate_6_type2', 'cate_7_type2',\n",
 502 |     "        'cate_8_type2', 'cate_9_type2', 'cate_10_type2', 'cate_11_type2',\n",
 503 |     "        'cate_4_type3', 'cate_5_type3', 'cate_6_type3', 'cate_7_type3',\n",
 504 |     "        'cate_8_type3', 'cate_9_type3', 'cate_10_type3', 'cate_11_type3',\n",
 505 |     "        'cate_4_type4', 'cate_5_type4', 'cate_6_type4', 'cate_7_type4',\n",
 506 |     "        'cate_8_type4', 'cate_9_type4', 'cate_10_type4', 'cate_11_type4',\n",
 507 |     "        'cate_4_type5', 'cate_5_type5', 'cate_6_type5', 'cate_7_type5',\n",
 508 |     "        'cate_8_type5', 'cate_9_type5', 'cate_10_type5', 'cate_11_type5',\n",
 509 |     "        'cate_4_type6', 'cate_5_type6', 'cate_6_type6', 'cate_7_type6',\n",
 510 |     "        'cate_8_type6', 'cate_9_type6', 'cate_10_type6', 'cate_11_type6'\n",
 511 |     "    ]\n",
 512 |     "    actions = actions.fillna(0)\n",
 513 |     "    actions['cate_action_sum'] = actions.sum(axis=1)\n",
 514 |     "    actions['cate8_percentage'] = (\n",
 515 |     "        actions['cate_8_type1'] + actions['cate_8_type2'] +\n",
 516 |     "        actions['cate_8_type3'] + actions['cate_8_type4'] +\n",
 517 |     "        actions['cate_8_type5'] + actions['cate_8_type6']\n",
 518 |     "    ) / actions['cate_action_sum']\n",
 519 |     "    actions['cate4_percentage'] = (\n",
 520 |     "        actions['cate_4_type1'] + actions['cate_4_type2'] +\n",
 521 |     "        actions['cate_4_type3'] + actions['cate_4_type4'] +\n",
 522 |     "        actions['cate_4_type5'] + actions['cate_4_type6']\n",
 523 |     "    ) / actions['cate_action_sum']\n",
 524 |     "    actions['cate5_percentage'] = (\n",
 525 |     "        actions['cate_5_type1'] + actions['cate_5_type2'] +\n",
 526 |     "        actions['cate_5_type3'] + actions['cate_5_type4'] +\n",
 527 |     "        actions['cate_5_type5'] + actions['cate_5_type6']\n",
 528 |     "    ) / actions['cate_action_sum']\n",
 529 |     "    actions['cate6_percentage'] = (\n",
 530 |     "        actions['cate_6_type1'] + actions['cate_6_type2'] +\n",
 531 |     "        actions['cate_6_type3'] + actions['cate_6_type4'] +\n",
 532 |     "        actions['cate_6_type5'] + actions['cate_6_type6']\n",
 533 |     "    ) / actions['cate_action_sum']\n",
 534 |     "    actions['cate7_percentage'] = (\n",
 535 |     "        actions['cate_7_type1'] + actions['cate_7_type2'] +\n",
 536 |     "        actions['cate_7_type3'] + actions['cate_7_type4'] +\n",
 537 |     "        actions['cate_7_type5'] + actions['cate_7_type6']\n",
 538 |     "    ) / actions['cate_action_sum']\n",
 539 |     "    actions['cate9_percentage'] = (\n",
 540 |     "        actions['cate_9_type1'] + actions['cate_9_type2'] +\n",
 541 |     "        actions['cate_9_type3'] + actions['cate_9_type4'] +\n",
 542 |     "        actions['cate_9_type5'] + actions['cate_9_type6']\n",
 543 |     "    ) / actions['cate_action_sum']\n",
 544 |     "    actions['cate10_percentage'] = (\n",
 545 |     "        actions['cate_10_type1'] + actions['cate_10_type2'] +\n",
 546 |     "        actions['cate_10_type3'] + actions['cate_10_type4'] +\n",
 547 |     "        actions['cate_10_type5'] + actions['cate_10_type6']\n",
 548 |     "    ) / actions['cate_action_sum']\n",
 549 |     "    actions['cate11_percentage'] = (\n",
 550 |     "        actions['cate_11_type1'] + actions['cate_11_type2'] +\n",
 551 |     "        actions['cate_11_type3'] + actions['cate_11_type4'] +\n",
 552 |     "        actions['cate_11_type5'] + actions['cate_11_type6']\n",
 553 |     "    ) / actions['cate_action_sum']\n",
 554 |     "\n",
 555 |     "    actions['cate8_type1_percentage'] = np.log(\n",
 556 |     "        1 + actions['cate_8_type1']) - np.log(\n",
 557 |     "            1 + actions['cate_8_type1'] + actions['cate_4_type1'] +\n",
 558 |     "            actions['cate_5_type1'] + actions['cate_6_type1'] +\n",
 559 |     "            actions['cate_7_type1'] + actions['cate_9_type1'] +\n",
 560 |     "            actions['cate_10_type1'] + actions['cate_11_type1'])\n",
 561 |     "\n",
 562 |     "    actions['cate8_type2_percentage'] = np.log(\n",
 563 |     "        1 + actions['cate_8_type2']) - np.log(\n",
 564 |     "            1 + actions['cate_8_type2'] + actions['cate_4_type2'] +\n",
 565 |     "            actions['cate_5_type2'] + actions['cate_6_type2'] +\n",
 566 |     "            actions['cate_7_type2'] + actions['cate_9_type2'] +\n",
 567 |     "            actions['cate_10_type2'] + actions['cate_11_type2'])\n",
 568 |     "    actions['cate8_type3_percentage'] = np.log(\n",
 569 |     "        1 + actions['cate_8_type3']) - np.log(\n",
 570 |     "            1 + actions['cate_8_type3'] + actions['cate_4_type3'] +\n",
 571 |     "            actions['cate_5_type3'] + actions['cate_6_type3'] +\n",
 572 |     "            actions['cate_7_type3'] + actions['cate_9_type3'] +\n",
 573 |     "            actions['cate_10_type3'] + actions['cate_11_type3'])\n",
 574 |     "    actions['cate8_type4_percentage'] = np.log(\n",
 575 |     "        1 + actions['cate_8_type4']) - np.log(\n",
 576 |     "            1 + actions['cate_8_type4'] + actions['cate_4_type4'] +\n",
 577 |     "            actions['cate_5_type4'] + actions['cate_6_type4'] +\n",
 578 |     "            actions['cate_7_type4'] + actions['cate_9_type4'] +\n",
 579 |     "            actions['cate_10_type4'] + actions['cate_11_type4'])\n",
 580 |     "    actions['cate8_type5_percentage'] = np.log(\n",
 581 |     "        1 + actions['cate_8_type5']) - np.log(\n",
 582 |     "            1 + actions['cate_8_type5'] + actions['cate_4_type5'] +\n",
 583 |     "            actions['cate_5_type5'] + actions['cate_6_type5'] +\n",
 584 |     "            actions['cate_7_type5'] + actions['cate_9_type5'] +\n",
 585 |     "            actions['cate_10_type5'] + actions['cate_11_type5'])\n",
 586 |     "    actions['cate8_type6_percentage'] = np.log(\n",
 587 |     "        1 + actions['cate_8_type6']) - np.log(\n",
 588 |     "            1 + actions['cate_8_type6'] + actions['cate_4_type6'] +\n",
 589 |     "            actions['cate_5_type6'] + actions['cate_6_type6'] +\n",
 590 |     "            actions['cate_7_type6'] + actions['cate_9_type6'] +\n",
 591 |     "            actions['cate_10_type6'] + actions['cate_11_type6'])\n",
 592 |     "    actions['user_id'] = actions.index\n",
 593 |     "    actions = actions[[\n",
 594 |     "        'user_id', 'cate8_percentage', 'cate4_percentage', 'cate5_percentage',\n",
 595 |     "        'cate6_percentage', 'cate7_percentage', 'cate9_percentage',\n",
 596 |     "        'cate10_percentage', 'cate11_percentage', 'cate8_type1_percentage',\n",
 597 |     "        'cate8_type2_percentage', 'cate8_type3_percentage',\n",
 598 |     "        'cate8_type4_percentage', 'cate8_type5_percentage',\n",
 599 |     "        'cate8_type6_percentage'\n",
 600 |     "    ]]\n",
 601 |     "    return actions"
 602 |    ]
 603 |   },
 604 |   {
 605 |    "cell_type": "markdown",
 606 |    "metadata": {},
 607 |    "source": [
 608 |     "### 商品-行为\n",
 609 |     "#### 累积商品特征\n",
 610 |     "* 分时间段\n",
 611 |     "* 针对商品的不同行为的\n",
 612 |     " * 购买转化率\n",
 613 |     " * 均值\n",
 614 |     " * 标准差"
 615 |    ]
 616 |   },
 617 |   {
 618 |    "cell_type": "code",
 619 |    "execution_count": 11,
 620 |    "metadata": {
 621 |     "ExecuteTime": {
 622 |      "end_time": "2017-05-16T21:16:19.251417Z",
 623 |      "start_time": "2017-05-16T21:16:19.185064Z"
 624 |     },
 625 |     "collapsed": true
 626 |    },
 627 |    "outputs": [],
 628 |    "source": [
 629 |     "def get_accumulate_product_feat(start_date, end_date, all_actions):\n",
 630 |     "    feature = [\n",
 631 |     "        'sku_id', 'product_action_1', 'product_action_2',\n",
 632 |     "        'product_action_3', 'product_action_4',\n",
 633 |     "        'product_action_5', 'product_action_6',\n",
 634 |     "        'product_action_1_ratio', 'product_action_2_ratio',\n",
 635 |     "        'product_action_3_ratio', 'product_action_5_ratio',\n",
 636 |     "        'product_action_6_ratio', 'product_action_1_mean',\n",
 637 |     "        'product_action_2_mean', 'product_action_3_mean',\n",
 638 |     "        'product_action_4_mean', 'product_action_5_mean',\n",
 639 |     "        'product_action_6_mean', 'product_action_1_std',\n",
 640 |     "        'product_action_2_std', 'product_action_3_std', 'product_action_4_std',\n",
 641 |     "        'product_action_5_std', 'product_action_6_std'\n",
 642 |     "    ]\n",
 643 |     "\n",
 644 |     "    actions = get_actions(start_date, end_date, all_actions)\n",
 645 |     "    df = pd.get_dummies(actions['type'], prefix='product_action')\n",
 646 |     "    # 按照商品-日期分组，计算某个时间段该商品的各项行为的标准差\n",
 647 |     "#     actions['date'] = pd.to_datetime(actions['time']).apply(lambda x: x.date())\n",
 648 |     "    actions = pd.concat([actions[['sku_id', 'date']], df], axis=1)\n",
 649 |     "    actions_date = actions.groupby(['sku_id', 'date']).sum()\n",
 650 |     "    actions_date = actions_date.unstack()\n",
 651 |     "    actions_date.fillna(0, inplace=True)\n",
 652 |     "    action_1 = np.std(actions_date['product_action_1'], axis=1)\n",
 653 |     "    action_1 = action_1.to_frame()\n",
 654 |     "    action_1.columns = ['product_action_1_std']\n",
 655 |     "    action_2 = np.std(actions_date['product_action_2'], axis=1)\n",
 656 |     "    action_2 = action_2.to_frame()\n",
 657 |     "    action_2.columns = ['product_action_2_std']\n",
 658 |     "    action_3 = np.std(actions_date['product_action_3'], axis=1)\n",
 659 |     "    action_3 = action_3.to_frame()\n",
 660 |     "    action_3.columns = ['product_action_3_std']\n",
 661 |     "    action_4 = np.std(actions_date['product_action_4'], axis=1)\n",
 662 |     "    action_4 = action_4.to_frame()\n",
 663 |     "    action_4.columns = ['product_action_4_std']\n",
 664 |     "    action_5 = np.std(actions_date['product_action_5'], axis=1)\n",
 665 |     "    action_5 = action_5.to_frame()\n",
 666 |     "    action_5.columns = ['product_action_5_std']\n",
 667 |     "    action_6 = np.std(actions_date['product_action_6'], axis=1)\n",
 668 |     "    action_6 = action_6.to_frame()\n",
 669 |     "    action_6.columns = ['product_action_6_std']\n",
 670 |     "    actions_date = pd.concat(\n",
 671 |     "        [action_1, action_2, action_3, action_4, action_5, action_6], axis=1)\n",
 672 |     "    actions_date['sku_id'] = actions_date.index\n",
 673 |     "\n",
 674 |     "    actions = actions.groupby(['sku_id'], as_index=False).sum()\n",
 675 |     "    days_interal = (datetime.strptime(end_date, '%Y-%m-%d') - datetime.strptime(start_date, '%Y-%m-%d')).days\n",
 676 |     "    # 针对商品分组，计算购买转化率\n",
 677 |     "#     actions['product_action_1_ratio'] = actions['product_action_4'] / actions[\n",
 678 |     "#         'product_action_1']\n",
 679 |     "#     actions['product_action_2_ratio'] = actions['product_action_4'] / actions[\n",
 680 |     "#         'product_action_2']\n",
 681 |     "#     actions['product_action_3_ratio'] = actions['product_action_4'] / actions[\n",
 682 |     "#         'product_action_3']\n",
 683 |     "#     actions['product_action_5_ratio'] = actions['product_action_4'] / actions[\n",
 684 |     "#         'product_action_5']\n",
 685 |     "#     actions['product_action_6_ratio'] = actions['product_action_4'] / actions[\n",
 686 |     "#         'product_action_6']\n",
 687 |     "    actions['product_action_1_ratio'] =  np.log(1 + actions['product_action_4']) - np.log(1 + actions['product_action_1'])\n",
 688 |     "    actions['product_action_2_ratio'] =  np.log(1 + actions['product_action_4']) - np.log(1 + actions['product_action_2'])\n",
 689 |     "    actions['product_action_3_ratio'] =  np.log(1 + actions['product_action_4']) - np.log(1 + actions['product_action_3'])\n",
 690 |     "    actions['product_action_5_ratio'] =  np.log(1 + actions['product_action_4']) - np.log(1 + actions['product_action_5'])\n",
 691 |     "    actions['product_action_6_ratio'] =  np.log(1 + actions['product_action_4']) - np.log(1 + actions['product_action_6'])\n",
 692 |     "    # 计算各种行为的均值\n",
 693 |     "    actions['product_action_1_mean'] = actions[\n",
 694 |     "        'product_action_1'] / days_interal\n",
 695 |     "    actions['product_action_2_mean'] = actions[\n",
 696 |     "        'product_action_2'] / days_interal\n",
 697 |     "    actions['product_action_3_mean'] = actions[\n",
 698 |     "        'product_action_3'] / days_interal\n",
 699 |     "    actions['product_action_4_mean'] = actions[\n",
 700 |     "        'product_action_4'] / days_interal\n",
 701 |     "    actions['product_action_5_mean'] = actions[\n",
 702 |     "        'product_action_5'] / days_interal\n",
 703 |     "    actions['product_action_6_mean'] = actions[\n",
 704 |     "        'product_action_6'] / days_interal\n",
 705 |     "    actions = pd.merge(actions, actions_date, how='left', on='sku_id')\n",
 706 |     "    actions = actions[feature]\n",
 707 |     "    return actions"
 708 |    ]
 709 |   },
 710 |   {
 711 |    "cell_type": "markdown",
 712 |    "metadata": {},
 713 |    "source": [
 714 |     "### 类别特征\n",
 715 |     "分时间段下各个商品类别的\n",
 716 |     "* 购买转化率\n",
 717 |     "* 标准差\n",
 718 |     "* 均值"
 719 |    ]
 720 |   },
 721 |   {
 722 |    "cell_type": "code",
 723 |    "execution_count": 12,
 724 |    "metadata": {
 725 |     "ExecuteTime": {
 726 |      "end_time": "2017-05-16T21:16:19.317213Z",
 727 |      "start_time": "2017-05-16T21:16:19.252514Z"
 728 |     },
 729 |     "collapsed": true
 730 |    },
 731 |    "outputs": [],
 732 |    "source": [
 733 |     "def get_accumulate_cate_feat(start_date, end_date, all_actions):\n",
 734 |     "    feature = ['cate','cate_action_1', 'cate_action_2', 'cate_action_3', 'cate_action_4', 'cate_action_5', \n",
 735 |     "               'cate_action_6', 'cate_action_1_ratio', 'cate_action_2_ratio', \n",
 736 |     "               'cate_action_3_ratio', 'cate_action_5_ratio', 'cate_action_6_ratio', 'cate_action_1_mean',\n",
 737 |     "               'cate_action_2_mean', 'cate_action_3_mean', 'cate_action_4_mean', 'cate_action_5_mean',\n",
 738 |     "               'cate_action_6_mean', 'cate_action_1_std', 'cate_action_2_std', 'cate_action_3_std',\n",
 739 |     "               'cate_action_4_std', 'cate_action_5_std', 'cate_action_6_std']\n",
 740 |     "    actions = get_actions(start_date, end_date, all_actions)\n",
 741 |     "#     actions['date'] = pd.to_datetime(actions['time']).apply(lambda x: x.date())\n",
 742 |     "    df = pd.get_dummies(actions['type'], prefix='cate_action')\n",
 743 |     "    actions = pd.concat([actions[['cate','date']], df], axis=1)\n",
 744 |     "    # 按照类别-日期分组计算针对不同类别的各种行为某段时间的标准差\n",
 745 |     "    actions_date = actions.groupby(['cate','date']).sum()\n",
 746 |     "    actions_date = actions_date.unstack()\n",
 747 |     "    actions_date.fillna(0, inplace=True)\n",
 748 |     "    action_1 = np.std(actions_date['cate_action_1'], axis=1)\n",
 749 |     "    action_1 = action_1.to_frame()\n",
 750 |     "    action_1.columns = ['cate_action_1_std']\n",
 751 |     "    action_2 = np.std(actions_date['cate_action_2'], axis=1)\n",
 752 |     "    action_2 = action_2.to_frame()\n",
 753 |     "    action_2.columns = ['cate_action_2_std']\n",
 754 |     "    action_3 = np.std(actions_date['cate_action_3'], axis=1)\n",
 755 |     "    action_3 = action_3.to_frame()\n",
 756 |     "    action_3.columns = ['cate_action_3_std']\n",
 757 |     "    action_4 = np.std(actions_date['cate_action_4'], axis=1)\n",
 758 |     "    action_4 = action_4.to_frame()\n",
 759 |     "    action_4.columns = ['cate_action_4_std']\n",
 760 |     "    action_5 = np.std(actions_date['cate_action_5'], axis=1)\n",
 761 |     "    action_5 = action_5.to_frame()\n",
 762 |     "    action_5.columns = ['cate_action_5_std']\n",
 763 |     "    action_6 = np.std(actions_date['cate_action_6'], axis=1)\n",
 764 |     "    action_6 = action_6.to_frame()\n",
 765 |     "    action_6.columns = ['cate_action_6_std']\n",
 766 |     "    actions_date = pd.concat([action_1, action_2, action_3, action_4, action_5, action_6], axis=1)\n",
 767 |     "    actions_date['cate'] = actions_date.index\n",
 768 |     "    # 按照类别分组，统计各个商品类别下行为的转化率\n",
 769 |     "    actions = actions.groupby(['cate'], as_index=False).sum()\n",
 770 |     "    days_interal = (datetime.strptime(end_date, '%Y-%m-%d')-datetime.strptime(start_date, '%Y-%m-%d')).days\n",
 771 |     "    \n",
 772 |     "#     actions['cate_action_1_ratio'] = actions['cate_action_4'] / actions['cate_action_1']\n",
 773 |     "#     actions['cate_action_2_ratio'] = actions['cate_action_4'] / actions['cate_action_2']\n",
 774 |     "#     actions['cate_action_3_ratio'] = actions['cate_action_4'] / actions['cate_action_3']\n",
 775 |     "#     actions['cate_action_5_ratio'] = actions['cate_action_4'] / actions['cate_action_5']\n",
 776 |     "#     actions['cate_action_6_ratio'] = actions['cate_action_4'] / actions['cate_action_6']\n",
 777 |     "    actions['cate_action_1_ratio'] =(np.log(1 + actions['cate_action_4']) - np.log(1 + actions['cate_action_1']))\n",
 778 |     "    actions['cate_action_2_ratio'] =(np.log(1 + actions['cate_action_4']) - np.log(1 + actions['cate_action_2']))\n",
 779 |     "    actions['cate_action_3_ratio'] =(np.log(1 + actions['cate_action_4']) - np.log(1 + actions['cate_action_3']))\n",
 780 |     "    actions['cate_action_5_ratio'] =(np.log(1 + actions['cate_action_4']) - np.log(1 + actions['cate_action_5']))\n",
 781 |     "    actions['cate_action_6_ratio'] =(np.log(1 + actions['cate_action_4']) - np.log(1 + actions['cate_action_6']))\n",
 782 |     "    # 按照类别分组，统计各个商品类别下行为在一段时间的均值\n",
 783 |     "    actions['cate_action_1_mean'] = actions['cate_action_1'] /  days_interal\n",
 784 |     "    actions['cate_action_2_mean'] = actions['cate_action_2'] /  days_interal\n",
 785 |     "    actions['cate_action_3_mean'] = actions['cate_action_3'] /  days_interal\n",
 786 |     "    actions['cate_action_4_mean'] = actions['cate_action_4'] /  days_interal\n",
 787 |     "    actions['cate_action_5_mean'] = actions['cate_action_5'] /  days_interal\n",
 788 |     "    actions['cate_action_6_mean'] = actions['cate_action_6'] /  days_interal\n",
 789 |     "    actions = pd.merge(actions, actions_date, how ='left',on='cate')\n",
 790 |     "    actions = actions[feature]\n",
 791 |     "    return actions"
 792 |    ]
 793 |   },
 794 |   {
 795 |    "cell_type": "markdown",
 796 |    "metadata": {},
 797 |    "source": [
 798 |     "## 构造训练集/测试集"
 799 |    ]
 800 |   },
 801 |   {
 802 |    "cell_type": "markdown",
 803 |    "metadata": {},
 804 |    "source": [
 805 |     "### 构造训练集/验证集\n",
 806 |     "* 标签,采用滑动窗口的方式，构造训练集的时候针对产生购买的行为标记为1\n",
 807 |     "* 整合特征"
 808 |    ]
 809 |   },
 810 |   {
 811 |    "cell_type": "code",
 812 |    "execution_count": 13,
 813 |    "metadata": {
 814 |     "ExecuteTime": {
 815 |      "end_time": "2017-05-16T21:16:19.383390Z",
 816 |      "start_time": "2017-05-16T21:16:19.318110Z"
 817 |     },
 818 |     "collapsed": true
 819 |    },
 820 |    "outputs": [],
 821 |    "source": [
 822 |     "def get_labels(start_date, end_date, all_actions):\n",
 823 |     "    actions = get_actions(start_date, end_date, all_actions)\n",
 824 |     "#     actions = actions[actions['type'] == 4]\n",
 825 |     "    # 修改为预测购买了商品8的用户预测\n",
 826 |     "    actions = actions[(actions['type'] == 4) & (actions['cate']==8)]\n",
 827 |     "    \n",
 828 |     "    actions = actions.groupby(['user_id', 'sku_id'], as_index=False).sum()\n",
 829 |     "    actions['label'] = 1\n",
 830 |     "    actions = actions[['user_id', 'sku_id', 'label']]\n",
 831 |     "    return actions"
 832 |    ]
 833 |   },
 834 |   {
 835 |    "cell_type": "markdown",
 836 |    "metadata": {},
 837 |    "source": [
 838 |     "构造训练集"
 839 |    ]
 840 |   },
 841 |   {
 842 |    "cell_type": "code",
 843 |    "execution_count": 14,
 844 |    "metadata": {
 845 |     "ExecuteTime": {
 846 |      "end_time": "2017-05-16T21:16:19.439236Z",
 847 |      "start_time": "2017-05-16T21:16:19.384369Z"
 848 |     },
 849 |     "collapsed": true
 850 |    },
 851 |    "outputs": [],
 852 |    "source": [
 853 |     "def make_actions(user, product, all_actions, train_start_date):\n",
 854 |     "    train_end_date = datetime.strptime(train_start_date, '%Y-%m-%d') + timedelta(days=3)\n",
 855 |     "    train_end_date = train_end_date.strftime('%Y-%m-%d')\n",
 856 |     "    # 修正prod_acc,cate_acc的时间跨度\n",
 857 |     "    start_days = datetime.strptime(train_end_date, '%Y-%m-%d') - timedelta(days=30)\n",
 858 |     "    start_days = start_days.strftime('%Y-%m-%d')\n",
 859 |     "    print train_end_date\n",
 860 |     "    user_acc = get_recent_user_feat(train_end_date, all_actions)\n",
 861 |     "    print 'get_recent_user_feat finsihed'\n",
 862 |     "    \n",
 863 |     "    user_cate = get_user_cate_feature(train_start_date, train_end_date, all_actions)\n",
 864 |     "    print 'get_user_cate_feature finished'\n",
 865 |     "    \n",
 866 |     "    product_acc = get_accumulate_product_feat(start_days, train_end_date, all_actions)\n",
 867 |     "    print 'get_accumulate_product_feat finsihed'\n",
 868 |     "    cate_acc = get_accumulate_cate_feat(start_days, train_end_date, all_actions)\n",
 869 |     "    print 'get_accumulate_cate_feat finsihed'\n",
 870 |     "    comment_acc = get_comments_product_feat(train_end_date)\n",
 871 |     "    print 'get_comments_product_feat finished'\n",
 872 |     "    # 标记\n",
 873 |     "    test_start_date = train_end_date\n",
 874 |     "    test_end_date = datetime.strptime(test_start_date, '%Y-%m-%d') + timedelta(days=5)\n",
 875 |     "    test_end_date = test_end_date.strftime('%Y-%m-%d')\n",
 876 |     "    labels = get_labels(test_start_date, test_end_date, all_actions)\n",
 877 |     "    print \"get labels\"\n",
 878 |     "    \n",
 879 |     "    actions = None\n",
 880 |     "    for i in (3, 5, 7, 10, 15, 21, 30):\n",
 881 |     "        start_days = datetime.strptime(train_end_date, '%Y-%m-%d') - timedelta(days=i)\n",
 882 |     "        start_days = start_days.strftime('%Y-%m-%d')\n",
 883 |     "        if actions is None:\n",
 884 |     "            actions = get_action_feat(start_days, train_end_date, all_actions, i)\n",
 885 |     "        else:\n",
 886 |     "            # 注意这里的拼接key\n",
 887 |     "            actions = pd.merge(actions, get_action_feat(start_days, train_end_date, all_actions, i), how='left',\n",
 888 |     "                               on=['user_id', 'sku_id', 'cate'])\n",
 889 |     "\n",
 890 |     "    actions = pd.merge(actions, user, how='left', on='user_id')\n",
 891 |     "    actions = pd.merge(actions, user_acc, how='left', on='user_id')\n",
 892 |     "    actions = pd.merge(actions, user_cate, how='left', on='user_id')\n",
 893 |     "    # 注意这里的拼接key\n",
 894 |     "    actions = pd.merge(actions, product, how='left', on=['sku_id', 'cate'])\n",
 895 |     "    actions = pd.merge(actions, product_acc, how='left', on='sku_id')\n",
 896 |     "    actions = pd.merge(actions, cate_acc, how='left', on='cate')\n",
 897 |     "    actions = pd.merge(actions, comment_acc, how='left', on='sku_id')\n",
 898 |     "    actions = pd.merge(actions, labels, how='left', on=['user_id', 'sku_id'])\n",
 899 |     "    # 主要是填充拼接商品基本特征、评论特征、标签之后的空值\n",
 900 |     "    actions = actions.fillna(0)\n",
 901 |     "#     return actions\n",
 902 |     "    # 采样\n",
 903 |     "    action_postive = actions[actions['label'] == 1]\n",
 904 |     "    action_negative = actions[actions['label'] == 0]\n",
 905 |     "    del actions\n",
 906 |     "    neg_len = len(action_postive) * 10\n",
 907 |     "    action_negative = action_negative.sample(n=neg_len)\n",
 908 |     "    action_sample = pd.concat([action_postive, action_negative], ignore_index=True)    \n",
 909 |     "    \n",
 910 |     "    return action_sample"
 911 |    ]
 912 |   },
 913 |   {
 914 |    "cell_type": "code",
 915 |    "execution_count": 15,
 916 |    "metadata": {
 917 |     "ExecuteTime": {
 918 |      "end_time": "2017-05-16T21:16:19.509168Z",
 919 |      "start_time": "2017-05-16T21:16:19.440133Z"
 920 |     },
 921 |     "collapsed": true
 922 |    },
 923 |    "outputs": [],
 924 |    "source": [
 925 |     "def make_train_set(train_start_date, setNums ,f_path):\n",
 926 |     "    train_actions = None\n",
 927 |     "    all_actions = get_all_action()\n",
 928 |     "    print \"get all actions!\"\n",
 929 |     "    user = get_basic_user_feat()\n",
 930 |     "    print 'get_basic_user_feat finsihed'\n",
 931 |     "    product = get_basic_product_feat()\n",
 932 |     "    print 'get_basic_product_feat finsihed'\n",
 933 |     "    # 滑窗,构造多组训练集/验证集\n",
 934 |     "    for i in range(setNums):\n",
 935 |     "        print train_start_date\n",
 936 |     "        if train_actions is None:\n",
 937 |     "            train_actions = make_actions(user, product, all_actions, train_start_date)\n",
 938 |     "        else:\n",
 939 |     "            train_actions = pd.concat([train_actions, make_actions(user, product, all_actions, train_start_date)],\n",
 940 |     "                                          ignore_index=True)\n",
 941 |     "        # 接下来每次移动一天\n",
 942 |     "        train_start_date = datetime.strptime(train_start_date, '%Y-%m-%d') + timedelta(days=1)\n",
 943 |     "        train_start_date = train_start_date.strftime('%Y-%m-%d')\n",
 944 |     "        print \"round {0}/{1} over!\".format(i+1, setNums)\n",
 945 |     "\n",
 946 |     "    train_actions.to_csv(f_path, index=False)"
 947 |    ]
 948 |   },
 949 |   {
 950 |    "cell_type": "code",
 951 |    "execution_count": 16,
 952 |    "metadata": {
 953 |     "ExecuteTime": {
 954 |      "end_time": "2017-05-16T22:34:29.544825Z",
 955 |      "start_time": "2017-05-16T21:16:19.510386Z"
 956 |     },
 957 |     "collapsed": false
 958 |    },
 959 |    "outputs": [
 960 |     {
 961 |      "name": "stdout",
 962 |      "output_type": "stream",
 963 |      "text": [
 964 |       "get all actions!\n",
 965 |       "get_basic_user_feat finsihed\n",
 966 |       "get_basic_product_feat finsihed\n",
 967 |       "2016-03-01\n",
 968 |       "2016-03-04\n",
 969 |       "get_recent_user_feat finsihed\n",
 970 |       "get_user_cate_feature finished\n",
 971 |       "get_accumulate_product_feat finsihed\n",
 972 |       "get_accumulate_cate_feat finsihed\n",
 973 |       "get_comments_product_feat finished\n",
 974 |       "get labels\n",
 975 |       "round 1/34 over!\n",
 976 |       "2016-03-02\n",
 977 |       "2016-03-05\n",
 978 |       "get_recent_user_feat finsihed\n",
 979 |       "get_user_cate_feature finished\n",
 980 |       "get_accumulate_product_feat finsihed\n",
 981 |       "get_accumulate_cate_feat finsihed\n",
 982 |       "get_comments_product_feat finished\n",
 983 |       "get labels\n",
 984 |       "round 2/34 over!\n",
 985 |       "2016-03-03\n",
 986 |       "2016-03-06\n",
 987 |       "get_recent_user_feat finsihed\n",
 988 |       "get_user_cate_feature finished\n",
 989 |       "get_accumulate_product_feat finsihed\n",
 990 |       "get_accumulate_cate_feat finsihed\n",
 991 |       "get_comments_product_feat finished\n",
 992 |       "get labels\n",
 993 |       "round 3/34 over!\n",
 994 |       "2016-03-04\n",
 995 |       "2016-03-07\n",
 996 |       "get_recent_user_feat finsihed\n",
 997 |       "get_user_cate_feature finished\n",
 998 |       "get_accumulate_product_feat finsihed\n",
 999 |       "get_accumulate_cate_feat finsihed\n",
1000 |       "get_comments_product_feat finished\n",
1001 |       "get labels\n",
1002 |       "round 4/34 over!\n",
1003 |       "2016-03-05\n",
1004 |       "2016-03-08\n",
1005 |       "get_recent_user_feat finsihed\n",
1006 |       "get_user_cate_feature finished\n",
1007 |       "get_accumulate_product_feat finsihed\n",
1008 |       "get_accumulate_cate_feat finsihed\n",
1009 |       "get_comments_product_feat finished\n",
1010 |       "get labels\n",
1011 |       "round 5/34 over!\n",
1012 |       "2016-03-06\n",
1013 |       "2016-03-09\n",
1014 |       "get_recent_user_feat finsihed\n",
1015 |       "get_user_cate_feature finished\n",
1016 |       "get_accumulate_product_feat finsihed\n",
1017 |       "get_accumulate_cate_feat finsihed\n",
1018 |       "get_comments_product_feat finished\n",
1019 |       "get labels\n",
1020 |       "round 6/34 over!\n",
1021 |       "2016-03-07\n",
1022 |       "2016-03-10\n",
1023 |       "get_recent_user_feat finsihed\n",
1024 |       "get_user_cate_feature finished\n",
1025 |       "get_accumulate_product_feat finsihed\n",
1026 |       "get_accumulate_cate_feat finsihed\n",
1027 |       "get_comments_product_feat finished\n",
1028 |       "get labels\n",
1029 |       "round 7/34 over!\n",
1030 |       "2016-03-08\n",
1031 |       "2016-03-11\n",
1032 |       "get_recent_user_feat finsihed\n",
1033 |       "get_user_cate_feature finished\n",
1034 |       "get_accumulate_product_feat finsihed\n",
1035 |       "get_accumulate_cate_feat finsihed\n",
1036 |       "get_comments_product_feat finished\n",
1037 |       "get labels\n",
1038 |       "round 8/34 over!\n",
1039 |       "2016-03-09\n",
1040 |       "2016-03-12\n",
1041 |       "get_recent_user_feat finsihed\n",
1042 |       "get_user_cate_feature finished\n",
1043 |       "get_accumulate_product_feat finsihed\n",
1044 |       "get_accumulate_cate_feat finsihed\n",
1045 |       "get_comments_product_feat finished\n",
1046 |       "get labels\n",
1047 |       "round 9/34 over!\n",
1048 |       "2016-03-10\n",
1049 |       "2016-03-13\n",
1050 |       "get_recent_user_feat finsihed\n",
1051 |       "get_user_cate_feature finished\n",
1052 |       "get_accumulate_product_feat finsihed\n",
1053 |       "get_accumulate_cate_feat finsihed\n",
1054 |       "get_comments_product_feat finished\n",
1055 |       "get labels\n",
1056 |       "round 10/34 over!\n",
1057 |       "2016-03-11\n",
1058 |       "2016-03-14\n",
1059 |       "get_recent_user_feat finsihed\n",
1060 |       "get_user_cate_feature finished\n",
1061 |       "get_accumulate_product_feat finsihed\n",
1062 |       "get_accumulate_cate_feat finsihed\n",
1063 |       "get_comments_product_feat finished\n",
1064 |       "get labels\n",
1065 |       "round 11/34 over!\n",
1066 |       "2016-03-12\n",
1067 |       "2016-03-15\n",
1068 |       "get_recent_user_feat finsihed\n",
1069 |       "get_user_cate_feature finished\n",
1070 |       "get_accumulate_product_feat finsihed\n",
1071 |       "get_accumulate_cate_feat finsihed\n",
1072 |       "get_comments_product_feat finished\n",
1073 |       "get labels\n",
1074 |       "round 12/34 over!\n",
1075 |       "2016-03-13\n",
1076 |       "2016-03-16\n",
1077 |       "get_recent_user_feat finsihed\n",
1078 |       "get_user_cate_feature finished\n",
1079 |       "get_accumulate_product_feat finsihed\n",
1080 |       "get_accumulate_cate_feat finsihed\n",
1081 |       "get_comments_product_feat finished\n",
1082 |       "get labels\n",
1083 |       "round 13/34 over!\n",
1084 |       "2016-03-14\n",
1085 |       "2016-03-17\n",
1086 |       "get_recent_user_feat finsihed\n",
1087 |       "get_user_cate_feature finished\n",
1088 |       "get_accumulate_product_feat finsihed\n",
1089 |       "get_accumulate_cate_feat finsihed\n",
1090 |       "get_comments_product_feat finished\n",
1091 |       "get labels\n",
1092 |       "round 14/34 over!\n",
1093 |       "2016-03-15\n",
1094 |       "2016-03-18\n",
1095 |       "get_recent_user_feat finsihed\n",
1096 |       "get_user_cate_feature finished\n",
1097 |       "get_accumulate_product_feat finsihed\n",
1098 |       "get_accumulate_cate_feat finsihed\n",
1099 |       "get_comments_product_feat finished\n",
1100 |       "get labels\n",
1101 |       "round 15/34 over!\n",
1102 |       "2016-03-16\n",
1103 |       "2016-03-19\n",
1104 |       "get_recent_user_feat finsihed\n",
1105 |       "get_user_cate_feature finished\n",
1106 |       "get_accumulate_product_feat finsihed\n",
1107 |       "get_accumulate_cate_feat finsihed\n",
1108 |       "get_comments_product_feat finished\n",
1109 |       "get labels\n",
1110 |       "round 16/34 over!\n",
1111 |       "2016-03-17\n",
1112 |       "2016-03-20\n",
1113 |       "get_recent_user_feat finsihed\n",
1114 |       "get_user_cate_feature finished\n",
1115 |       "get_accumulate_product_feat finsihed\n",
1116 |       "get_accumulate_cate_feat finsihed\n",
1117 |       "get_comments_product_feat finished\n",
1118 |       "get labels\n",
1119 |       "round 17/34 over!\n",
1120 |       "2016-03-18\n",
1121 |       "2016-03-21\n",
1122 |       "get_recent_user_feat finsihed\n",
1123 |       "get_user_cate_feature finished\n",
1124 |       "get_accumulate_product_feat finsihed\n",
1125 |       "get_accumulate_cate_feat finsihed\n",
1126 |       "get_comments_product_feat finished\n",
1127 |       "get labels\n",
1128 |       "round 18/34 over!\n",
1129 |       "2016-03-19\n",
1130 |       "2016-03-22\n",
1131 |       "get_recent_user_feat finsihed\n",
1132 |       "get_user_cate_feature finished\n",
1133 |       "get_accumulate_product_feat finsihed\n",
1134 |       "get_accumulate_cate_feat finsihed\n",
1135 |       "get_comments_product_feat finished\n",
1136 |       "get labels\n",
1137 |       "round 19/34 over!\n",
1138 |       "2016-03-20\n",
1139 |       "2016-03-23\n",
1140 |       "get_recent_user_feat finsihed\n",
1141 |       "get_user_cate_feature finished\n",
1142 |       "get_accumulate_product_feat finsihed\n",
1143 |       "get_accumulate_cate_feat finsihed\n",
1144 |       "get_comments_product_feat finished\n",
1145 |       "get labels\n",
1146 |       "round 20/34 over!\n",
1147 |       "2016-03-21\n",
1148 |       "2016-03-24\n",
1149 |       "get_recent_user_feat finsihed\n",
1150 |       "get_user_cate_feature finished\n",
1151 |       "get_accumulate_product_feat finsihed\n",
1152 |       "get_accumulate_cate_feat finsihed\n",
1153 |       "get_comments_product_feat finished\n",
1154 |       "get labels\n",
1155 |       "round 21/34 over!\n",
1156 |       "2016-03-22\n",
1157 |       "2016-03-25\n",
1158 |       "get_recent_user_feat finsihed\n",
1159 |       "get_user_cate_feature finished\n",
1160 |       "get_accumulate_product_feat finsihed\n",
1161 |       "get_accumulate_cate_feat finsihed\n",
1162 |       "get_comments_product_feat finished\n",
1163 |       "get labels\n",
1164 |       "round 22/34 over!\n",
1165 |       "2016-03-23\n",
1166 |       "2016-03-26\n",
1167 |       "get_recent_user_feat finsihed\n",
1168 |       "get_user_cate_feature finished\n",
1169 |       "get_accumulate_product_feat finsihed\n",
1170 |       "get_accumulate_cate_feat finsihed\n",
1171 |       "get_comments_product_feat finished\n",
1172 |       "get labels\n",
1173 |       "round 23/34 over!\n",
1174 |       "2016-03-24\n",
1175 |       "2016-03-27\n",
1176 |       "get_recent_user_feat finsihed\n",
1177 |       "get_user_cate_feature finished\n",
1178 |       "get_accumulate_product_feat finsihed\n",
1179 |       "get_accumulate_cate_feat finsihed\n",
1180 |       "get_comments_product_feat finished\n",
1181 |       "get labels\n",
1182 |       "round 24/34 over!\n",
1183 |       "2016-03-25\n",
1184 |       "2016-03-28\n",
1185 |       "get_recent_user_feat finsihed\n",
1186 |       "get_user_cate_feature finished\n",
1187 |       "get_accumulate_product_feat finsihed\n",
1188 |       "get_accumulate_cate_feat finsihed\n",
1189 |       "get_comments_product_feat finished\n",
1190 |       "get labels\n",
1191 |       "round 25/34 over!\n",
1192 |       "2016-03-26\n",
1193 |       "2016-03-29\n",
1194 |       "get_recent_user_feat finsihed\n",
1195 |       "get_user_cate_feature finished\n",
1196 |       "get_accumulate_product_feat finsihed\n",
1197 |       "get_accumulate_cate_feat finsihed\n",
1198 |       "get_comments_product_feat finished\n",
1199 |       "get labels\n",
1200 |       "round 26/34 over!\n",
1201 |       "2016-03-27\n",
1202 |       "2016-03-30\n",
1203 |       "get_recent_user_feat finsihed\n",
1204 |       "get_user_cate_feature finished\n",
1205 |       "get_accumulate_product_feat finsihed\n",
1206 |       "get_accumulate_cate_feat finsihed\n",
1207 |       "get_comments_product_feat finished\n",
1208 |       "get labels\n",
1209 |       "round 27/34 over!\n",
1210 |       "2016-03-28\n",
1211 |       "2016-03-31\n",
1212 |       "get_recent_user_feat finsihed\n",
1213 |       "get_user_cate_feature finished\n",
1214 |       "get_accumulate_product_feat finsihed\n",
1215 |       "get_accumulate_cate_feat finsihed\n",
1216 |       "get_comments_product_feat finished\n",
1217 |       "get labels\n",
1218 |       "round 28/34 over!\n",
1219 |       "2016-03-29\n",
1220 |       "2016-04-01\n",
1221 |       "get_recent_user_feat finsihed\n",
1222 |       "get_user_cate_feature finished\n",
1223 |       "get_accumulate_product_feat finsihed\n",
1224 |       "get_accumulate_cate_feat finsihed\n",
1225 |       "get_comments_product_feat finished\n",
1226 |       "get labels\n",
1227 |       "round 29/34 over!\n",
1228 |       "2016-03-30\n",
1229 |       "2016-04-02\n",
1230 |       "get_recent_user_feat finsihed\n",
1231 |       "get_user_cate_feature finished\n",
1232 |       "get_accumulate_product_feat finsihed\n",
1233 |       "get_accumulate_cate_feat finsihed\n",
1234 |       "get_comments_product_feat finished\n",
1235 |       "get labels\n",
1236 |       "round 30/34 over!\n",
1237 |       "2016-03-31\n",
1238 |       "2016-04-03\n",
1239 |       "get_recent_user_feat finsihed\n",
1240 |       "get_user_cate_feature finished\n",
1241 |       "get_accumulate_product_feat finsihed\n",
1242 |       "get_accumulate_cate_feat finsihed\n",
1243 |       "get_comments_product_feat finished\n",
1244 |       "get labels\n",
1245 |       "round 31/34 over!\n",
1246 |       "2016-04-01\n",
1247 |       "2016-04-04\n",
1248 |       "get_recent_user_feat finsihed\n",
1249 |       "get_user_cate_feature finished\n",
1250 |       "get_accumulate_product_feat finsihed\n",
1251 |       "get_accumulate_cate_feat finsihed\n",
1252 |       "get_comments_product_feat finished\n",
1253 |       "get labels\n",
1254 |       "round 32/34 over!\n",
1255 |       "2016-04-02\n",
1256 |       "2016-04-05\n",
1257 |       "get_recent_user_feat finsihed\n",
1258 |       "get_user_cate_feature finished\n",
1259 |       "get_accumulate_product_feat finsihed\n",
1260 |       "get_accumulate_cate_feat finsihed\n",
1261 |       "get_comments_product_feat finished\n",
1262 |       "get labels\n",
1263 |       "round 33/34 over!\n",
1264 |       "2016-04-03\n",
1265 |       "2016-04-06\n",
1266 |       "get_recent_user_feat finsihed\n",
1267 |       "get_user_cate_feature finished\n",
1268 |       "get_accumulate_product_feat finsihed\n",
1269 |       "get_accumulate_cate_feat finsihed\n",
1270 |       "get_comments_product_feat finished\n",
1271 |       "get labels\n",
1272 |       "round 34/34 over!\n"
1273 |      ]
1274 |     }
1275 |    ],
1276 |    "source": [
1277 |     "# 训练集\n",
1278 |     "train_start_date = '2016-03-01'\n",
1279 |     "make_train_set(train_start_date, 34, 'train_set.csv')"
1280 |    ]
1281 |   },
1282 |   {
1283 |    "cell_type": "markdown",
1284 |    "metadata": {},
1285 |    "source": [
1286 |     "构造验证集(线下测试集)"
1287 |    ]
1288 |   },
1289 |   {
1290 |    "cell_type": "code",
1291 |    "execution_count": 17,
1292 |    "metadata": {
1293 |     "ExecuteTime": {
1294 |      "end_time": "2017-05-16T22:34:29.585821Z",
1295 |      "start_time": "2017-05-16T22:34:29.546019Z"
1296 |     },
1297 |     "collapsed": true
1298 |    },
1299 |    "outputs": [],
1300 |    "source": [
1301 |     "def make_val_answer(val_start_date, val_end_date, all_actions, label_val_s1_path):\n",
1302 |     "    actions = get_actions(val_start_date, val_end_date,all_actions)\n",
1303 |     "    actions = actions[(actions['type'] == 4) & (actions['cate'] == 8)]\n",
1304 |     "    actions = actions[['user_id', 'sku_id']]\n",
1305 |     "    actions = actions.drop_duplicates()\n",
1306 |     "    actions.to_csv(label_val_s1_path, index=False)\n",
1307 |     "\n",
1308 |     "def make_val_set(train_start_date, train_end_date, val_s1_path):\n",
1309 |     "    # 修改时间跨度\n",
1310 |     "    start_days = datetime.strptime(train_end_date, '%Y-%m-%d') - timedelta(days=30)\n",
1311 |     "    start_days = start_days.strftime('%Y-%m-%d')\n",
1312 |     "    all_actions = get_all_action()\n",
1313 |     "    print \"get all actions!\"\n",
1314 |     "    user = get_basic_user_feat()\n",
1315 |     "    print 'get_basic_user_feat finsihed'\n",
1316 |     "    \n",
1317 |     "    product = get_basic_product_feat()\n",
1318 |     "    print 'get_basic_product_feat finsihed'\n",
1319 |     "#     user_acc = get_accumulate_user_feat(train_end_date,all_actions,30)\n",
1320 |     "#     print 'get_accumulate_user_feat finished'\n",
1321 |     "    user_acc = get_recent_user_feat(train_end_date, all_actions)\n",
1322 |     "    print 'get_recent_user_feat finsihed'\n",
1323 |     "    user_cate = get_user_cate_feature(train_start_date, train_end_date, all_actions)\n",
1324 |     "    print 'get_user_cate_feature finished'\n",
1325 |     " \n",
1326 |     "    product_acc = get_accumulate_product_feat(start_days, train_end_date, all_actions)\n",
1327 |     "    print 'get_accumulate_product_feat finsihed'\n",
1328 |     "    cate_acc = get_accumulate_cate_feat(start_days, train_end_date, all_actions)\n",
1329 |     "    print 'get_accumulate_cate_feat finsihed'\n",
1330 |     "    comment_acc = get_comments_product_feat(train_end_date)\n",
1331 |     "    print 'get_comments_product_feat finished'\n",
1332 |     "    \n",
1333 |     "    actions = None\n",
1334 |     "    for i in (3, 5, 7, 10, 15, 21, 30):\n",
1335 |     "        start_days = datetime.strptime(train_end_date, '%Y-%m-%d') - timedelta(days=i)\n",
1336 |     "        start_days = start_days.strftime('%Y-%m-%d')\n",
1337 |     "        if actions is None:\n",
1338 |     "            actions = get_action_feat(start_days, train_end_date, all_actions,i)\n",
1339 |     "        else:\n",
1340 |     "            actions = pd.merge(actions, get_action_feat(start_days, train_end_date,all_actions,i), how='left',\n",
1341 |     "                               on=['user_id', 'sku_id', 'cate'])\n",
1342 |     "\n",
1343 |     "    actions = pd.merge(actions, user, how='left', on='user_id')\n",
1344 |     "    actions = pd.merge(actions, user_acc, how='left', on='user_id')\n",
1345 |     "    actions = pd.merge(actions, user_cate, how='left', on='user_id')\n",
1346 |     "    # 注意这里的拼接key\n",
1347 |     "    actions = pd.merge(actions, product, how='left', on=['sku_id', 'cate'])\n",
1348 |     "    actions = pd.merge(actions, product_acc, how='left', on='sku_id')\n",
1349 |     "    actions = pd.merge(actions, cate_acc, how='left', on='cate')\n",
1350 |     "    actions = pd.merge(actions, comment_acc, how='left', on='sku_id')\n",
1351 |     "    actions = actions.fillna(0)\n",
1352 |     "   \n",
1353 |     "    \n",
1354 |     "#     print actions\n",
1355 |     "    # 构造真实用户购买情况作为后续验证\n",
1356 |     "    val_start_date = train_end_date\n",
1357 |     "    val_end_date = datetime.strptime(val_start_date, '%Y-%m-%d') + timedelta(days=5)\n",
1358 |     "    val_end_date = val_end_date.strftime('%Y-%m-%d')\n",
1359 |     "    make_val_answer(val_start_date, val_end_date, all_actions, 'label_'+val_s1_path)\n",
1360 |     "    \n",
1361 |     "    actions.to_csv(val_s1_path, index=False)\n"
1362 |    ]
1363 |   },
1364 |   {
1365 |    "cell_type": "code",
1366 |    "execution_count": 18,
1367 |    "metadata": {
1368 |     "ExecuteTime": {
1369 |      "end_time": "2017-05-16T22:44:56.918322Z",
1370 |      "start_time": "2017-05-16T22:34:29.587106Z"
1371 |     },
1372 |     "collapsed": false
1373 |    },
1374 |    "outputs": [
1375 |     {
1376 |      "name": "stdout",
1377 |      "output_type": "stream",
1378 |      "text": [
1379 |       "get all actions!\n",
1380 |       "get_basic_user_feat finsihed\n",
1381 |       "get_basic_product_feat finsihed\n",
1382 |       "get_recent_user_feat finsihed\n",
1383 |       "get_user_cate_feature finished\n",
1384 |       "get_accumulate_product_feat finsihed\n",
1385 |       "get_accumulate_cate_feat finsihed\n",
1386 |       "get_comments_product_feat finished\n",
1387 |       "get all actions!\n",
1388 |       "get_basic_user_feat finsihed\n",
1389 |       "get_basic_product_feat finsihed\n",
1390 |       "get_recent_user_feat finsihed\n",
1391 |       "get_user_cate_feature finished\n",
1392 |       "get_accumulate_product_feat finsihed\n",
1393 |       "get_accumulate_cate_feat finsihed\n",
1394 |       "get_comments_product_feat finished\n",
1395 |       "get all actions!\n",
1396 |       "get_basic_user_feat finsihed\n",
1397 |       "get_basic_product_feat finsihed\n",
1398 |       "get_recent_user_feat finsihed\n",
1399 |       "get_user_cate_feature finished\n",
1400 |       "get_accumulate_product_feat finsihed\n",
1401 |       "get_accumulate_cate_feat finsihed\n",
1402 |       "get_comments_product_feat finished\n"
1403 |      ]
1404 |     }
1405 |    ],
1406 |    "source": [
1407 |     "# 验证集\n",
1408 |     "# train_start_date = '2016-04-06'\n",
1409 |     "# make_train_set(train_start_date, 3, 'val_set.csv')\n",
1410 |     "make_val_set('2016-04-06', '2016-04-09', 'val_1.csv')\n",
1411 |     "make_val_set('2016-04-07', '2016-04-10', 'val_2.csv')\n",
1412 |     "make_val_set('2016-04-08', '2016-04-11', 'val_3.csv')"
1413 |    ]
1414 |   },
1415 |   {
1416 |    "cell_type": "markdown",
1417 |    "metadata": {},
1418 |    "source": [
1419 |     "### 构造测试集"
1420 |    ]
1421 |   },
1422 |   {
1423 |    "cell_type": "code",
1424 |    "execution_count": 19,
1425 |    "metadata": {
1426 |     "ExecuteTime": {
1427 |      "end_time": "2017-05-16T22:44:57.248609Z",
1428 |      "start_time": "2017-05-16T22:44:57.023515Z"
1429 |     },
1430 |     "collapsed": true
1431 |    },
1432 |    "outputs": [],
1433 |    "source": [
1434 |     "def make_test_set(train_start_date, train_end_date):\n",
1435 |     "    start_days = datetime.strptime(train_end_date, '%Y-%m-%d') - timedelta(days=30)\n",
1436 |     "    start_days = start_days.strftime('%Y-%m-%d')\n",
1437 |     "    all_actions = get_all_action()\n",
1438 |     "    print \"get all actions!\"\n",
1439 |     "    user = get_basic_user_feat()\n",
1440 |     "    print 'get_basic_user_feat finsihed'\n",
1441 |     "    product = get_basic_product_feat()\n",
1442 |     "    print 'get_basic_product_feat finsihed'\n",
1443 |     "    \n",
1444 |     "    user_acc = get_recent_user_feat(train_end_date, all_actions)\n",
1445 |     "    print 'get_accumulate_user_feat finsihed'\n",
1446 |     "    \n",
1447 |     "    user_cate = get_user_cate_feature(train_start_date, train_end_date, all_actions)\n",
1448 |     "    print 'get_user_cate_feature finished'\n",
1449 |     "    \n",
1450 |     "    product_acc = get_accumulate_product_feat(start_days, train_end_date, all_actions)\n",
1451 |     "    print 'get_accumulate_product_feat finsihed'\n",
1452 |     "    cate_acc = get_accumulate_cate_feat(start_days, train_end_date, all_actions)\n",
1453 |     "    print 'get_accumulate_cate_feat finsihed'\n",
1454 |     "    comment_acc = get_comments_product_feat(train_end_date)\n",
1455 |     "\n",
1456 |     "    actions = None\n",
1457 |     "    for i in (3, 5, 7, 10, 15, 21, 30):\n",
1458 |     "        start_days = datetime.strptime(train_end_date, '%Y-%m-%d') - timedelta(days=i)\n",
1459 |     "        start_days = start_days.strftime('%Y-%m-%d')\n",
1460 |     "        if actions is None:\n",
1461 |     "            actions = get_action_feat(start_days, train_end_date, all_actions,i)\n",
1462 |     "        else:\n",
1463 |     "            actions = pd.merge(actions, get_action_feat(start_days, train_end_date,all_actions,i), how='left',\n",
1464 |     "                               on=['user_id', 'sku_id', 'cate'])\n",
1465 |     "\n",
1466 |     "    actions = pd.merge(actions, user, how='left', on='user_id')\n",
1467 |     "    actions = pd.merge(actions, user_acc, how='left', on='user_id')\n",
1468 |     "    actions = pd.merge(actions, user_cate, how='left', on='user_id')\n",
1469 |     "    # 注意这里的拼接key\n",
1470 |     "    actions = pd.merge(actions, product, how='left', on=['sku_id', 'cate'])\n",
1471 |     "    actions = pd.merge(actions, product_acc, how='left', on='sku_id')\n",
1472 |     "    actions = pd.merge(actions, cate_acc, how='left', on='cate')\n",
1473 |     "    actions = pd.merge(actions, comment_acc, how='left', on='sku_id')\n",
1474 |     "\n",
1475 |     "    actions = actions.fillna(0)\n",
1476 |     "    \n",
1477 |     "\n",
1478 |     "    actions.to_csv(\"test_set.csv\", index=False)"
1479 |    ]
1480 |   },
1481 |   {
1482 |    "cell_type": "markdown",
1483 |    "metadata": {},
1484 |    "source": [
1485 |     "4.13~4.16这三天的评论记录似乎并不存在为0的情况，导致构建测试集时出错\n",
1486 |     "\n",
1487 |     "`KeyError: \"['comment_num_0'] not in index\"`"
1488 |    ]
1489 |   },
1490 |   {
1491 |    "cell_type": "code",
1492 |    "execution_count": 20,
1493 |    "metadata": {
1494 |     "ExecuteTime": {
1495 |      "end_time": "2017-05-16T22:48:22.214337Z",
1496 |      "start_time": "2017-05-16T22:44:57.250749Z"
1497 |     },
1498 |     "collapsed": false
1499 |    },
1500 |    "outputs": [
1501 |     {
1502 |      "name": "stdout",
1503 |      "output_type": "stream",
1504 |      "text": [
1505 |       "get all actions!\n",
1506 |       "get_basic_user_feat finsihed\n",
1507 |       "get_basic_product_feat finsihed\n",
1508 |       "get_accumulate_user_feat finsihed\n",
1509 |       "get_user_cate_feature finished\n",
1510 |       "get_accumulate_product_feat finsihed\n",
1511 |       "get_accumulate_cate_feat finsihed\n"
1512 |      ]
1513 |     }
1514 |    ],
1515 |    "source": [
1516 |     "# 预测结果\n",
1517 |     "sub_start_date = '2016-04-13'\n",
1518 |     "sub_end_date = '2016-04-16'\n",
1519 |     "make_test_set(sub_start_date, sub_end_date)"
1520 |    ]
1521 |   }
1522 |  ],
1523 |  "metadata": {
1524 |   "kernelspec": {
1525 |    "display_name": "Python 2",
1526 |    "language": "python",
1527 |    "name": "python2"
1528 |   },
1529 |   "language_info": {
1530 |    "codemirror_mode": {
1531 |     "name": "ipython",
1532 |     "version": 2
1533 |    },
1534 |    "file_extension": ".py",
1535 |    "mimetype": "text/x-python",
1536 |    "name": "python",
1537 |    "nbconvert_exporter": "python",
1538 |    "pygments_lexer": "ipython2",
1539 |    "version": "2.7.13"
1540 |   }
1541 |  },
1542 |  "nbformat": 4,
1543 |  "nbformat_minor": 2
1544 | }
1545 | 


--------------------------------------------------------------------------------
/data_cleaning_wp.ipynb:
--------------------------------------------------------------------------------
   1 | {
   2 |  "cells": [
   3 |   {
   4 |    "cell_type": "markdown",
   5 |    "metadata": {},
   6 |    "source": [
   7 |     "# 京东JData大数据竞赛（1）-  数据清洗"
   8 |    ]
   9 |   },
  10 |   {
  11 |    "cell_type": "markdown",
  12 |    "metadata": {},
  13 |    "source": [
  14 |     "比赛的题目是高潜用户的购买意向的预测,从机器学习的角度来讲我们可以认为这是一个二分类的任务.那么我们就是尝试去构建自己的正负样本.\n",
  15 |     "由于我们拿到的是原始数据,里面存在很多噪声,因而第一步我们先要对数据清洗,比如说:\n",
  16 |     "\n",
  17 |     "* 检查并去除重复记录\n",
  18 |     "* 无交互行为的用户和商品\n",
  19 |     "* 去掉浏览量很大而购买量很少的用户(惰性用户或爬虫用户)\n",
  20 |     "* ......"
  21 |    ]
  22 |   },
  23 |   {
  24 |    "cell_type": "markdown",
  25 |    "metadata": {},
  26 |    "source": [
  27 |     "为了能够进行上述清洗,在此首先构造了简单的用户(user)行为特征和商品(item)行为特征,对应于两张表user_table和item_table\n",
  28 |     "* user_table特征包括:\n",
  29 |     "    user_id(用户id),age(年龄),sex(性别),\n",
  30 |     "    user_lv_cd(用户级别),browse_num(浏览数),\n",
  31 |     "    addcart_num(加购数),delcart_num(删购数),\n",
  32 |     "    buy_num(购买数),favor_num(收藏数),\n",
  33 |     "    click_num(点击数),buy_addcart_ratio(购买加购转化率),\n",
  34 |     "    buy_browse_ratio(购买浏览转化率),\n",
  35 |     "    buy_click_ratio(购买点击转化率),\n",
  36 |     "    buy_favor_ratio(购买收藏转化率)\n",
  37 |     "    \n",
  38 |     "* item_table特征包括:\n",
  39 |     "    sku_id(商品id),attr1,attr2,\n",
  40 |     "    attr3,cate,brand,browse_num,\n",
  41 |     "    addcart_num,delcart_num,\n",
  42 |     "    buy_num,favor_num,click_num,\n",
  43 |     "    buy_addcart_ratio,buy_browse_ratio,\n",
  44 |     "    buy_click_ratio,buy_favor_ratio,\n",
  45 |     "    comment_num(评论数),\n",
  46 |     "    has_bad_comment(是否有差评),\n",
  47 |     "    bad_comment_rate(差评率)"
  48 |    ]
  49 |   },
  50 |   {
  51 |    "cell_type": "markdown",
  52 |    "metadata": {},
  53 |    "source": [
  54 |     "## 数据集注释"
  55 |    ]
  56 |   },
  57 |   {
  58 |    "cell_type": "markdown",
  59 |    "metadata": {},
  60 |    "source": [
  61 |     "这里涉及到的数据集是京东最新的数据集：\n",
  62 |     "1. JData_User.csv             用户数据集           105,321个用户\n",
  63 |     "2. JData_Comment.csv           商品评论             558,552条记录\n",
  64 |     "3. JData_Product.csv           预测商品集合           24,187条记录\n",
  65 |     "4. JData_Action_201602.csv       2月份行为交互记录     11,485,424条记录\n",
  66 |     "5. JData_Action_201603.csv       3月份行为交互记录     25,916,378条记录\n",
  67 |     "6. JData_Action_201604.csv       4月份行为交互记录     13,199,934条记录"
  68 |    ]
  69 |   },
  70 |   {
  71 |    "cell_type": "markdown",
  72 |    "metadata": {},
  73 |    "source": [
  74 |     "## 数据集验证"
  75 |    ]
  76 |   },
  77 |   {
  78 |    "cell_type": "markdown",
  79 |    "metadata": {},
  80 |    "source": [
  81 |     "### 首先检查JData_User中的用户和JData_Action中的用户是否一致\n",
  82 |     "保证行为数据中的所产生的行为均由用户数据中的用户产生（但是可能存在用户在行为数据中无行为）"
  83 |    ]
  84 |   },
  85 |   {
  86 |    "cell_type": "markdown",
  87 |    "metadata": {},
  88 |    "source": [
  89 |     "思路：利用pd.Merge连接sku 和 Action中的sku, 观察Action中的数据是否减少\n",
  90 |     "Example:"
  91 |    ]
  92 |   },
  93 |   {
  94 |    "cell_type": "code",
  95 |    "execution_count": 1,
  96 |    "metadata": {
  97 |     "collapsed": false
  98 |    },
  99 |    "outputs": [],
 100 |    "source": [
 101 |     "import pandas as pd\n",
 102 |     "# test sample\n",
 103 |     "# df1 = pd.DataFrame({'sku':['a','a','b','c'],'data':[1,1,2,3]})\n",
 104 |     "# df2 = pd.DataFrame({'sku':['a','b','c']})\n",
 105 |     "# df3 = pd.DataFrame({'sku':['a','b','d']})\n",
 106 |     "# df4 = pd.DataFrame({'sku':['a','b','c','d']})\n",
 107 |     "# print pd.merge(df2,df1)\n",
 108 |     "# # print pd.merge(df1,df2)\n",
 109 |     "# print pd.merge(df3,df1)\n",
 110 |     "# print pd.merge(df4,df1)\n",
 111 |     "# # print pd.merge(df1,df3)"
 112 |    ]
 113 |   },
 114 |   {
 115 |    "cell_type": "code",
 116 |    "execution_count": 2,
 117 |    "metadata": {
 118 |     "collapsed": false
 119 |    },
 120 |    "outputs": [
 121 |     {
 122 |      "name": "stdout",
 123 |      "output_type": "stream",
 124 |      "text": [
 125 |       "Is action of Feb. from User file?  True\n",
 126 |       "Is action of Mar. from User file?  True\n",
 127 |       "Is action of Apr. from User file?  True\n"
 128 |      ]
 129 |     }
 130 |    ],
 131 |    "source": [
 132 |     "def user_action_check():\n",
 133 |     "    df_user = pd.read_csv('data/JData_User.csv')\n",
 134 |     "    df_sku = df_user.ix[:,'user_id'].to_frame()\n",
 135 |     "    df_month2 = pd.read_csv('data/JData_Action_201602.csv')\n",
 136 |     "    print 'Is action of Feb. from User file? ', len(df_month2) == len(pd.merge(df_sku,df_month2))\n",
 137 |     "    df_month3 = pd.read_csv('data/JData_Action_201603.csv')\n",
 138 |     "    print 'Is action of Mar. from User file? ', len(df_month3) == len(pd.merge(df_sku,df_month3))\n",
 139 |     "    df_month4 = pd.read_csv('data/JData_Action_201604.csv')\n",
 140 |     "    print 'Is action of Apr. from User file? ', len(df_month4) == len(pd.merge(df_sku,df_month4))\n",
 141 |     "\n",
 142 |     "user_action_check()"
 143 |    ]
 144 |   },
 145 |   {
 146 |    "cell_type": "markdown",
 147 |    "metadata": {},
 148 |    "source": [
 149 |     "结论： User数据集中的用户和交互行为数据集中的用户完全一致\n",
 150 |     "\n",
 151 |     "*根据merge前后的数据量比对，能保证Action中的用户ID是User中的ID的子集*"
 152 |    ]
 153 |   },
 154 |   {
 155 |    "cell_type": "markdown",
 156 |    "metadata": {},
 157 |    "source": [
 158 |     "### 检查是否有重复记录\n",
 159 |     "除去各个数据文件中完全重复的记录,结果证明线上成绩反而大幅下降，可能解释是重复数据是有意义的，比如用户同时购买多件商品，同时添加多个数量的商品到购物车等..."
 160 |    ]
 161 |   },
 162 |   {
 163 |    "cell_type": "code",
 164 |    "execution_count": 8,
 165 |    "metadata": {
 166 |     "collapsed": false
 167 |    },
 168 |    "outputs": [],
 169 |    "source": [
 170 |     "def deduplicate(filepath, filename, newpath):\n",
 171 |     "    df_file = pd.read_csv(filepath)       \n",
 172 |     "    before = df_file.shape[0]\n",
 173 |     "    df_file.drop_duplicates(inplace=True)\n",
 174 |     "    after = df_file.shape[0]\n",
 175 |     "    n_dup = before-after\n",
 176 |     "    print 'No. of duplicate records for ' + filename + ' is: ' + str(n_dup)\n",
 177 |     "    if n_dup != 0:\n",
 178 |     "        df_file.to_csv(newpath, index=None)\n",
 179 |     "    else:\n",
 180 |     "        print 'no duplicate records in ' + filename"
 181 |    ]
 182 |   },
 183 |   {
 184 |    "cell_type": "code",
 185 |    "execution_count": 9,
 186 |    "metadata": {
 187 |     "collapsed": false
 188 |    },
 189 |    "outputs": [
 190 |     {
 191 |      "name": "stdout",
 192 |      "output_type": "stream",
 193 |      "text": [
 194 |       "No. of duplicate records for Mar. action is: 7085038\n",
 195 |       "No. of duplicate records for Feb. action is: 3672710\n",
 196 |       "No. of duplicate records for Comment is: 0\n",
 197 |       "no duplicate records in Comment\n",
 198 |       "No. of duplicate records for Product is: 0\n",
 199 |       "no duplicate records in Product\n",
 200 |       "No. of duplicate records for User is: 0\n",
 201 |       "no duplicate records in User\n"
 202 |      ]
 203 |     }
 204 |    ],
 205 |    "source": [
 206 |     "# deduplicate('data/JData_Action_201602.csv', 'Feb. action', 'data/JData_Action_201602_dedup.csv')\n",
 207 |     "deduplicate('data/JData_Action_201603.csv', 'Mar. action', 'data/JData_Action_201603_dedup.csv')\n",
 208 |     "deduplicate('data/JData_Action_201604.csv', 'Feb. action', 'data/JData_Action_201604_dedup.csv')\n",
 209 |     "deduplicate('data/JData_Comment.csv', 'Comment', 'data/JData_Comment_dedup.csv')\n",
 210 |     "deduplicate('data/JData_Product.csv', 'Product', 'data/JData_Product_dedup.csv')\n",
 211 |     "deduplicate('data/JData_User.csv', 'User', 'data/JData_User_dedup.csv')"
 212 |    ]
 213 |   },
 214 |   {
 215 |    "cell_type": "code",
 216 |    "execution_count": 31,
 217 |    "metadata": {
 218 |     "collapsed": false
 219 |    },
 220 |    "outputs": [
 221 |     {
 222 |      "data": {
 223 |       "text/html": [
 224 |        "<div>\n",
 225 |        "<table border=\"1\" class=\"dataframe\">\n",
 226 |        "  <thead>\n",
 227 |        "    <tr style=\"text-align: right;\">\n",
 228 |        "      <th></th>\n",
 229 |        "      <th>user_id</th>\n",
 230 |        "      <th>sku_id</th>\n",
 231 |        "      <th>time</th>\n",
 232 |        "      <th>model_id</th>\n",
 233 |        "      <th>cate</th>\n",
 234 |        "      <th>brand</th>\n",
 235 |        "    </tr>\n",
 236 |        "    <tr>\n",
 237 |        "      <th>type</th>\n",
 238 |        "      <th></th>\n",
 239 |        "      <th></th>\n",
 240 |        "      <th></th>\n",
 241 |        "      <th></th>\n",
 242 |        "      <th></th>\n",
 243 |        "      <th></th>\n",
 244 |        "    </tr>\n",
 245 |        "  </thead>\n",
 246 |        "  <tbody>\n",
 247 |        "    <tr>\n",
 248 |        "      <th>1</th>\n",
 249 |        "      <td>2176378</td>\n",
 250 |        "      <td>2176378</td>\n",
 251 |        "      <td>2176378</td>\n",
 252 |        "      <td>0</td>\n",
 253 |        "      <td>2176378</td>\n",
 254 |        "      <td>2176378</td>\n",
 255 |        "    </tr>\n",
 256 |        "    <tr>\n",
 257 |        "      <th>2</th>\n",
 258 |        "      <td>636</td>\n",
 259 |        "      <td>636</td>\n",
 260 |        "      <td>636</td>\n",
 261 |        "      <td>0</td>\n",
 262 |        "      <td>636</td>\n",
 263 |        "      <td>636</td>\n",
 264 |        "    </tr>\n",
 265 |        "    <tr>\n",
 266 |        "      <th>3</th>\n",
 267 |        "      <td>1464</td>\n",
 268 |        "      <td>1464</td>\n",
 269 |        "      <td>1464</td>\n",
 270 |        "      <td>0</td>\n",
 271 |        "      <td>1464</td>\n",
 272 |        "      <td>1464</td>\n",
 273 |        "    </tr>\n",
 274 |        "    <tr>\n",
 275 |        "      <th>4</th>\n",
 276 |        "      <td>37</td>\n",
 277 |        "      <td>37</td>\n",
 278 |        "      <td>37</td>\n",
 279 |        "      <td>0</td>\n",
 280 |        "      <td>37</td>\n",
 281 |        "      <td>37</td>\n",
 282 |        "    </tr>\n",
 283 |        "    <tr>\n",
 284 |        "      <th>5</th>\n",
 285 |        "      <td>1981</td>\n",
 286 |        "      <td>1981</td>\n",
 287 |        "      <td>1981</td>\n",
 288 |        "      <td>0</td>\n",
 289 |        "      <td>1981</td>\n",
 290 |        "      <td>1981</td>\n",
 291 |        "    </tr>\n",
 292 |        "    <tr>\n",
 293 |        "      <th>6</th>\n",
 294 |        "      <td>575597</td>\n",
 295 |        "      <td>575597</td>\n",
 296 |        "      <td>575597</td>\n",
 297 |        "      <td>545054</td>\n",
 298 |        "      <td>575597</td>\n",
 299 |        "      <td>575597</td>\n",
 300 |        "    </tr>\n",
 301 |        "  </tbody>\n",
 302 |        "</table>\n",
 303 |        "</div>"
 304 |       ],
 305 |       "text/plain": [
 306 |        "      user_id   sku_id     time  model_id     cate    brand\n",
 307 |        "type                                                       \n",
 308 |        "1     2176378  2176378  2176378         0  2176378  2176378\n",
 309 |        "2         636      636      636         0      636      636\n",
 310 |        "3        1464     1464     1464         0     1464     1464\n",
 311 |        "4          37       37       37         0       37       37\n",
 312 |        "5        1981     1981     1981         0     1981     1981\n",
 313 |        "6      575597   575597   575597    545054   575597   575597"
 314 |       ]
 315 |      },
 316 |      "execution_count": 31,
 317 |      "metadata": {},
 318 |      "output_type": "execute_result"
 319 |     }
 320 |    ],
 321 |    "source": [
 322 |     "IsDuplicated = df_month.duplicated() \n",
 323 |     "df_d=df_month[IsDuplicated]\n",
 324 |     "df_d.groupby('type').count()               #发现重复数据大多数都是由于浏览（1），或者点击(6)产生"
 325 |    ]
 326 |   },
 327 |   {
 328 |    "cell_type": "markdown",
 329 |    "metadata": {},
 330 |    "source": [
 331 |     "### 检查是否存在注册时间在2016年-4月-15号之后的用户"
 332 |    ]
 333 |   },
 334 |   {
 335 |    "cell_type": "code",
 336 |    "execution_count": 6,
 337 |    "metadata": {
 338 |     "collapsed": false
 339 |    },
 340 |    "outputs": [
 341 |     {
 342 |      "data": {
 343 |       "text/html": [
 344 |        "<div>\n",
 345 |        "<table border=\"1\" class=\"dataframe\">\n",
 346 |        "  <thead>\n",
 347 |        "    <tr style=\"text-align: right;\">\n",
 348 |        "      <th></th>\n",
 349 |        "      <th>user_id</th>\n",
 350 |        "      <th>age</th>\n",
 351 |        "      <th>sex</th>\n",
 352 |        "      <th>user_lv_cd</th>\n",
 353 |        "      <th>user_reg_tm</th>\n",
 354 |        "    </tr>\n",
 355 |        "  </thead>\n",
 356 |        "  <tbody>\n",
 357 |        "    <tr>\n",
 358 |        "      <th>7457</th>\n",
 359 |        "      <td>207458</td>\n",
 360 |        "      <td>-1</td>\n",
 361 |        "      <td>2.0</td>\n",
 362 |        "      <td>1</td>\n",
 363 |        "      <td>2016-04-15</td>\n",
 364 |        "    </tr>\n",
 365 |        "    <tr>\n",
 366 |        "      <th>7463</th>\n",
 367 |        "      <td>207464</td>\n",
 368 |        "      <td>26-35岁</td>\n",
 369 |        "      <td>2.0</td>\n",
 370 |        "      <td>2</td>\n",
 371 |        "      <td>2016-04-15</td>\n",
 372 |        "    </tr>\n",
 373 |        "    <tr>\n",
 374 |        "      <th>7467</th>\n",
 375 |        "      <td>207468</td>\n",
 376 |        "      <td>36-45岁</td>\n",
 377 |        "      <td>2.0</td>\n",
 378 |        "      <td>3</td>\n",
 379 |        "      <td>2016-04-15</td>\n",
 380 |        "    </tr>\n",
 381 |        "    <tr>\n",
 382 |        "      <th>7472</th>\n",
 383 |        "      <td>207473</td>\n",
 384 |        "      <td>-1</td>\n",
 385 |        "      <td>2.0</td>\n",
 386 |        "      <td>1</td>\n",
 387 |        "      <td>2016-04-15</td>\n",
 388 |        "    </tr>\n",
 389 |        "    <tr>\n",
 390 |        "      <th>7482</th>\n",
 391 |        "      <td>207483</td>\n",
 392 |        "      <td>26-35岁</td>\n",
 393 |        "      <td>2.0</td>\n",
 394 |        "      <td>3</td>\n",
 395 |        "      <td>2016-04-15</td>\n",
 396 |        "    </tr>\n",
 397 |        "    <tr>\n",
 398 |        "      <th>7492</th>\n",
 399 |        "      <td>207493</td>\n",
 400 |        "      <td>16-25岁</td>\n",
 401 |        "      <td>2.0</td>\n",
 402 |        "      <td>3</td>\n",
 403 |        "      <td>2016-04-15</td>\n",
 404 |        "    </tr>\n",
 405 |        "    <tr>\n",
 406 |        "      <th>7493</th>\n",
 407 |        "      <td>207494</td>\n",
 408 |        "      <td>16-25岁</td>\n",
 409 |        "      <td>2.0</td>\n",
 410 |        "      <td>3</td>\n",
 411 |        "      <td>2016-04-15</td>\n",
 412 |        "    </tr>\n",
 413 |        "    <tr>\n",
 414 |        "      <th>7503</th>\n",
 415 |        "      <td>207504</td>\n",
 416 |        "      <td>16-25岁</td>\n",
 417 |        "      <td>2.0</td>\n",
 418 |        "      <td>4</td>\n",
 419 |        "      <td>2016-04-15</td>\n",
 420 |        "    </tr>\n",
 421 |        "    <tr>\n",
 422 |        "      <th>7510</th>\n",
 423 |        "      <td>207511</td>\n",
 424 |        "      <td>46-55岁</td>\n",
 425 |        "      <td>2.0</td>\n",
 426 |        "      <td>5</td>\n",
 427 |        "      <td>2016-04-15</td>\n",
 428 |        "    </tr>\n",
 429 |        "    <tr>\n",
 430 |        "      <th>7512</th>\n",
 431 |        "      <td>207513</td>\n",
 432 |        "      <td>-1</td>\n",
 433 |        "      <td>2.0</td>\n",
 434 |        "      <td>1</td>\n",
 435 |        "      <td>2016-04-15</td>\n",
 436 |        "    </tr>\n",
 437 |        "    <tr>\n",
 438 |        "      <th>7518</th>\n",
 439 |        "      <td>207519</td>\n",
 440 |        "      <td>26-35岁</td>\n",
 441 |        "      <td>2.0</td>\n",
 442 |        "      <td>2</td>\n",
 443 |        "      <td>2016-04-15</td>\n",
 444 |        "    </tr>\n",
 445 |        "    <tr>\n",
 446 |        "      <th>7521</th>\n",
 447 |        "      <td>207522</td>\n",
 448 |        "      <td>26-35岁</td>\n",
 449 |        "      <td>0.0</td>\n",
 450 |        "      <td>3</td>\n",
 451 |        "      <td>2016-04-15</td>\n",
 452 |        "    </tr>\n",
 453 |        "    <tr>\n",
 454 |        "      <th>7525</th>\n",
 455 |        "      <td>207526</td>\n",
 456 |        "      <td>-1</td>\n",
 457 |        "      <td>2.0</td>\n",
 458 |        "      <td>3</td>\n",
 459 |        "      <td>2016-04-15</td>\n",
 460 |        "    </tr>\n",
 461 |        "    <tr>\n",
 462 |        "      <th>7533</th>\n",
 463 |        "      <td>207534</td>\n",
 464 |        "      <td>-1</td>\n",
 465 |        "      <td>2.0</td>\n",
 466 |        "      <td>1</td>\n",
 467 |        "      <td>2016-04-15</td>\n",
 468 |        "    </tr>\n",
 469 |        "    <tr>\n",
 470 |        "      <th>7543</th>\n",
 471 |        "      <td>207544</td>\n",
 472 |        "      <td>26-35岁</td>\n",
 473 |        "      <td>2.0</td>\n",
 474 |        "      <td>3</td>\n",
 475 |        "      <td>2016-04-15</td>\n",
 476 |        "    </tr>\n",
 477 |        "    <tr>\n",
 478 |        "      <th>7544</th>\n",
 479 |        "      <td>207545</td>\n",
 480 |        "      <td>-1</td>\n",
 481 |        "      <td>2.0</td>\n",
 482 |        "      <td>1</td>\n",
 483 |        "      <td>2016-04-15</td>\n",
 484 |        "    </tr>\n",
 485 |        "    <tr>\n",
 486 |        "      <th>7551</th>\n",
 487 |        "      <td>207552</td>\n",
 488 |        "      <td>26-35岁</td>\n",
 489 |        "      <td>2.0</td>\n",
 490 |        "      <td>3</td>\n",
 491 |        "      <td>2016-04-15</td>\n",
 492 |        "    </tr>\n",
 493 |        "    <tr>\n",
 494 |        "      <th>7553</th>\n",
 495 |        "      <td>207554</td>\n",
 496 |        "      <td>16-25岁</td>\n",
 497 |        "      <td>2.0</td>\n",
 498 |        "      <td>4</td>\n",
 499 |        "      <td>2016-04-15</td>\n",
 500 |        "    </tr>\n",
 501 |        "    <tr>\n",
 502 |        "      <th>8545</th>\n",
 503 |        "      <td>208546</td>\n",
 504 |        "      <td>16-25岁</td>\n",
 505 |        "      <td>0.0</td>\n",
 506 |        "      <td>2</td>\n",
 507 |        "      <td>2016-04-29</td>\n",
 508 |        "    </tr>\n",
 509 |        "    <tr>\n",
 510 |        "      <th>9394</th>\n",
 511 |        "      <td>209395</td>\n",
 512 |        "      <td>16-25岁</td>\n",
 513 |        "      <td>1.0</td>\n",
 514 |        "      <td>2</td>\n",
 515 |        "      <td>2016-05-11</td>\n",
 516 |        "    </tr>\n",
 517 |        "    <tr>\n",
 518 |        "      <th>10362</th>\n",
 519 |        "      <td>210363</td>\n",
 520 |        "      <td>56岁以上</td>\n",
 521 |        "      <td>2.0</td>\n",
 522 |        "      <td>2</td>\n",
 523 |        "      <td>2016-05-24</td>\n",
 524 |        "    </tr>\n",
 525 |        "    <tr>\n",
 526 |        "      <th>10367</th>\n",
 527 |        "      <td>210368</td>\n",
 528 |        "      <td>-1</td>\n",
 529 |        "      <td>2.0</td>\n",
 530 |        "      <td>1</td>\n",
 531 |        "      <td>2016-05-24</td>\n",
 532 |        "    </tr>\n",
 533 |        "    <tr>\n",
 534 |        "      <th>11019</th>\n",
 535 |        "      <td>211020</td>\n",
 536 |        "      <td>36-45岁</td>\n",
 537 |        "      <td>2.0</td>\n",
 538 |        "      <td>3</td>\n",
 539 |        "      <td>2016-06-06</td>\n",
 540 |        "    </tr>\n",
 541 |        "    <tr>\n",
 542 |        "      <th>12014</th>\n",
 543 |        "      <td>212015</td>\n",
 544 |        "      <td>36-45岁</td>\n",
 545 |        "      <td>2.0</td>\n",
 546 |        "      <td>2</td>\n",
 547 |        "      <td>2016-07-05</td>\n",
 548 |        "    </tr>\n",
 549 |        "    <tr>\n",
 550 |        "      <th>13850</th>\n",
 551 |        "      <td>213851</td>\n",
 552 |        "      <td>26-35岁</td>\n",
 553 |        "      <td>2.0</td>\n",
 554 |        "      <td>3</td>\n",
 555 |        "      <td>2016-09-11</td>\n",
 556 |        "    </tr>\n",
 557 |        "    <tr>\n",
 558 |        "      <th>14542</th>\n",
 559 |        "      <td>214543</td>\n",
 560 |        "      <td>-1</td>\n",
 561 |        "      <td>2.0</td>\n",
 562 |        "      <td>1</td>\n",
 563 |        "      <td>2016-10-05</td>\n",
 564 |        "    </tr>\n",
 565 |        "    <tr>\n",
 566 |        "      <th>16746</th>\n",
 567 |        "      <td>216747</td>\n",
 568 |        "      <td>16-25岁</td>\n",
 569 |        "      <td>2.0</td>\n",
 570 |        "      <td>1</td>\n",
 571 |        "      <td>2016-11-25</td>\n",
 572 |        "    </tr>\n",
 573 |        "  </tbody>\n",
 574 |        "</table>\n",
 575 |        "</div>"
 576 |       ],
 577 |       "text/plain": [
 578 |        "       user_id     age  sex  user_lv_cd user_reg_tm\n",
 579 |        "7457    207458      -1  2.0           1  2016-04-15\n",
 580 |        "7463    207464  26-35岁  2.0           2  2016-04-15\n",
 581 |        "7467    207468  36-45岁  2.0           3  2016-04-15\n",
 582 |        "7472    207473      -1  2.0           1  2016-04-15\n",
 583 |        "7482    207483  26-35岁  2.0           3  2016-04-15\n",
 584 |        "7492    207493  16-25岁  2.0           3  2016-04-15\n",
 585 |        "7493    207494  16-25岁  2.0           3  2016-04-15\n",
 586 |        "7503    207504  16-25岁  2.0           4  2016-04-15\n",
 587 |        "7510    207511  46-55岁  2.0           5  2016-04-15\n",
 588 |        "7512    207513      -1  2.0           1  2016-04-15\n",
 589 |        "7518    207519  26-35岁  2.0           2  2016-04-15\n",
 590 |        "7521    207522  26-35岁  0.0           3  2016-04-15\n",
 591 |        "7525    207526      -1  2.0           3  2016-04-15\n",
 592 |        "7533    207534      -1  2.0           1  2016-04-15\n",
 593 |        "7543    207544  26-35岁  2.0           3  2016-04-15\n",
 594 |        "7544    207545      -1  2.0           1  2016-04-15\n",
 595 |        "7551    207552  26-35岁  2.0           3  2016-04-15\n",
 596 |        "7553    207554  16-25岁  2.0           4  2016-04-15\n",
 597 |        "8545    208546  16-25岁  0.0           2  2016-04-29\n",
 598 |        "9394    209395  16-25岁  1.0           2  2016-05-11\n",
 599 |        "10362   210363   56岁以上  2.0           2  2016-05-24\n",
 600 |        "10367   210368      -1  2.0           1  2016-05-24\n",
 601 |        "11019   211020  36-45岁  2.0           3  2016-06-06\n",
 602 |        "12014   212015  36-45岁  2.0           2  2016-07-05\n",
 603 |        "13850   213851  26-35岁  2.0           3  2016-09-11\n",
 604 |        "14542   214543      -1  2.0           1  2016-10-05\n",
 605 |        "16746   216747  16-25岁  2.0           1  2016-11-25"
 606 |       ]
 607 |      },
 608 |      "execution_count": 6,
 609 |      "metadata": {},
 610 |      "output_type": "execute_result"
 611 |     }
 612 |    ],
 613 |    "source": [
 614 |     "import pandas as pd\n",
 615 |     "df_user = pd.read_csv('data\\JData_User.csv',encoding='gbk')\n",
 616 |     "df_user['user_reg_tm']=pd.to_datetime(df_user['user_reg_tm'])\n",
 617 |     "df_user.ix[df_user.user_reg_tm  >= '2016-4-15']"
 618 |    ]
 619 |   },
 620 |   {
 621 |    "cell_type": "markdown",
 622 |    "metadata": {},
 623 |    "source": [
 624 |     "由于注册时间是京东系统错误造成，如果行为数据中没有在4月15号之后的数据的话，那么说明这些用户还是正常用户，并不需要删除。"
 625 |    ]
 626 |   },
 627 |   {
 628 |    "cell_type": "code",
 629 |    "execution_count": 9,
 630 |    "metadata": {
 631 |     "collapsed": false
 632 |    },
 633 |    "outputs": [
 634 |     {
 635 |      "data": {
 636 |       "text/html": [
 637 |        "<div>\n",
 638 |        "<table border=\"1\" class=\"dataframe\">\n",
 639 |        "  <thead>\n",
 640 |        "    <tr style=\"text-align: right;\">\n",
 641 |        "      <th></th>\n",
 642 |        "      <th>user_id</th>\n",
 643 |        "      <th>sku_id</th>\n",
 644 |        "      <th>time</th>\n",
 645 |        "      <th>model_id</th>\n",
 646 |        "      <th>type</th>\n",
 647 |        "      <th>cate</th>\n",
 648 |        "      <th>brand</th>\n",
 649 |        "    </tr>\n",
 650 |        "  </thead>\n",
 651 |        "  <tbody>\n",
 652 |        "  </tbody>\n",
 653 |        "</table>\n",
 654 |        "</div>"
 655 |       ],
 656 |       "text/plain": [
 657 |        "Empty DataFrame\n",
 658 |        "Columns: [user_id, sku_id, time, model_id, type, cate, brand]\n",
 659 |        "Index: []"
 660 |       ]
 661 |      },
 662 |      "execution_count": 9,
 663 |      "metadata": {},
 664 |      "output_type": "execute_result"
 665 |     }
 666 |    ],
 667 |    "source": [
 668 |     "df_month = pd.read_csv('data\\JData_Action_201604.csv')\n",
 669 |     "df_month['time'] = pd.to_datetime(df_month['time'])\n",
 670 |     "df_month.ix[df_month.time >= '2016-4-16']"
 671 |    ]
 672 |   },
 673 |   {
 674 |    "cell_type": "markdown",
 675 |    "metadata": {},
 676 |    "source": [
 677 |     "结论：说明用户没有异常操作数据，所以这一批用户不删除"
 678 |    ]
 679 |   },
 680 |   {
 681 |    "cell_type": "markdown",
 682 |    "metadata": {},
 683 |    "source": [
 684 |     "### 行为数据中的user_id为浮点型，进行INT类型转换"
 685 |    ]
 686 |   },
 687 |   {
 688 |    "cell_type": "code",
 689 |    "execution_count": 47,
 690 |    "metadata": {
 691 |     "collapsed": false
 692 |    },
 693 |    "outputs": [
 694 |     {
 695 |      "name": "stdout",
 696 |      "output_type": "stream",
 697 |      "text": [
 698 |       "int64\n",
 699 |       "int64\n",
 700 |       "int64\n"
 701 |      ]
 702 |     }
 703 |    ],
 704 |    "source": [
 705 |     "import pandas as pd\n",
 706 |     "df_month = pd.read_csv('data\\JData_Action_201602.csv')\n",
 707 |     "df_month['user_id'] = df_month['user_id'].apply(lambda x:int(x))\n",
 708 |     "print df_month['user_id'].dtype\n",
 709 |     "df_month.to_csv('data\\JData_Action_201602.csv',index=None)\n",
 710 |     "df_month = pd.read_csv('data\\JData_Action_201603.csv')\n",
 711 |     "df_month['user_id'] = df_month['user_id'].apply(lambda x:int(x))\n",
 712 |     "print df_month['user_id'].dtype\n",
 713 |     "df_month.to_csv('data\\JData_Action_201603.csv',index=None)\n",
 714 |     "df_month = pd.read_csv('data\\JData_Action_201604.csv')\n",
 715 |     "df_month['user_id'] = df_month['user_id'].apply(lambda x:int(x))\n",
 716 |     "print df_month['user_id'].dtype\n",
 717 |     "df_month.to_csv('data\\JData_Action_201604.csv',index=None)"
 718 |    ]
 719 |   },
 720 |   {
 721 |    "cell_type": "markdown",
 722 |    "metadata": {},
 723 |    "source": [
 724 |     "### 年龄区间的处理"
 725 |    ]
 726 |   },
 727 |   {
 728 |    "cell_type": "code",
 729 |    "execution_count": 35,
 730 |    "metadata": {
 731 |     "collapsed": false
 732 |    },
 733 |    "outputs": [
 734 |     {
 735 |      "name": "stdout",
 736 |      "output_type": "stream",
 737 |      "text": [
 738 |       "     user_id    sex  user_lv_cd  user_reg_tm\n",
 739 |       "age                                         \n",
 740 |       "-1     14412  14412       14412        14412\n",
 741 |       "1          7      7           7            7\n",
 742 |       "2       8797   8797        8797         8797\n",
 743 |       "3      46570  46570       46570        46570\n",
 744 |       "4      30336  30336       30336        30336\n",
 745 |       "5       3325   3325        3325         3325\n",
 746 |       "6       1871   1871        1871         1871\n"
 747 |      ]
 748 |     }
 749 |    ],
 750 |    "source": [
 751 |     "import pandas as pd\n",
 752 |     "df_user = pd.read_csv('data\\JData_User.csv',encoding='gbk')\n",
 753 |     "\n",
 754 |     "def tranAge(x):\n",
 755 |     "    if x == u'15岁以下':\n",
 756 |     "        x='1'\n",
 757 |     "    elif x==u'16-25岁':\n",
 758 |     "        x='2'\n",
 759 |     "    elif x==u'26-35岁':\n",
 760 |     "        x='3'\n",
 761 |     "    elif x==u'36-45岁':\n",
 762 |     "        x='4'\n",
 763 |     "    elif x==u'46-55岁':\n",
 764 |     "        x='5'\n",
 765 |     "    elif x==u'56岁以上':\n",
 766 |     "        x='6'\n",
 767 |     "    return x\n",
 768 |     "df_user['age']=df_user['age'].apply(tranAge)\n",
 769 |     "print df_user.groupby(df_user['age']).count()\n",
 770 |     "df_user.to_csv('data\\JData_User.csv',index=None)"
 771 |    ]
 772 |   },
 773 |   {
 774 |    "cell_type": "markdown",
 775 |    "metadata": {},
 776 |    "source": [
 777 |     "### 构建User_table"
 778 |    ]
 779 |   },
 780 |   {
 781 |    "cell_type": "code",
 782 |    "execution_count": 6,
 783 |    "metadata": {
 784 |     "collapsed": true
 785 |    },
 786 |    "outputs": [],
 787 |    "source": [
 788 |     "#定义文件名\n",
 789 |     "ACTION_201602_FILE = \"data/JData_Action_201602.csv\"\n",
 790 |     "ACTION_201603_FILE = \"data/JData_Action_201603.csv\"\n",
 791 |     "ACTION_201604_FILE = \"data/JData_Action_201604.csv\"\n",
 792 |     "COMMENT_FILE = \"data/JData_Comment.csv\"\n",
 793 |     "PRODUCT_FILE = \"data/JData_Product.csv\"\n",
 794 |     "USER_FILE = \"data/JData_User.csv\"\n",
 795 |     "USER_TABLE_FILE = \"data/User_table.csv\"\n",
 796 |     "ITEM_TABLE_FILE = \"data/Item_table.csv\""
 797 |    ]
 798 |   },
 799 |   {
 800 |    "cell_type": "code",
 801 |    "execution_count": 7,
 802 |    "metadata": {
 803 |     "collapsed": true
 804 |    },
 805 |    "outputs": [],
 806 |    "source": [
 807 |     "# 导入相关包\n",
 808 |     "import pandas as pd\n",
 809 |     "import numpy as np\n",
 810 |     "from collections import Counter"
 811 |    ]
 812 |   },
 813 |   {
 814 |    "cell_type": "code",
 815 |    "execution_count": 8,
 816 |    "metadata": {
 817 |     "collapsed": true
 818 |    },
 819 |    "outputs": [],
 820 |    "source": [
 821 |     "# 功能函数: 对每一个user分组的数据进行统计\n",
 822 |     "def add_type_count(group):\n",
 823 |     "    behavior_type = group.type.astype(int)\n",
 824 |     "    # 用户行为类别\n",
 825 |     "    type_cnt = Counter(behavior_type)\n",
 826 |     "    # 1: 浏览 2: 加购 3: 删除\n",
 827 |     "    # 4: 购买 5: 收藏 6: 点击\n",
 828 |     "    group['browse_num'] = type_cnt[1]\n",
 829 |     "    group['addcart_num'] = type_cnt[2]\n",
 830 |     "    group['delcart_num'] = type_cnt[3]\n",
 831 |     "    group['buy_num'] = type_cnt[4]\n",
 832 |     "    group['favor_num'] = type_cnt[5]\n",
 833 |     "    group['click_num'] = type_cnt[6]\n",
 834 |     "\n",
 835 |     "    return group[['user_id', 'browse_num', 'addcart_num',\n",
 836 |     "                  'delcart_num', 'buy_num', 'favor_num',\n",
 837 |     "                  'click_num']]"
 838 |    ]
 839 |   },
 840 |   {
 841 |    "cell_type": "markdown",
 842 |    "metadata": {},
 843 |    "source": [
 844 |     "由于用户行为数据量较大,一次性读入可能造成内存错误(Memory Error),因而使用pandas的分块(chunk)读取."
 845 |    ]
 846 |   },
 847 |   {
 848 |    "cell_type": "code",
 849 |    "execution_count": 9,
 850 |    "metadata": {
 851 |     "collapsed": true
 852 |    },
 853 |    "outputs": [],
 854 |    "source": [
 855 |     "#对action数据进行统计\n",
 856 |     "#根据自己调节chunk_size大小\n",
 857 |     "def get_from_action_data(fname, chunk_size=100000):\n",
 858 |     "    reader = pd.read_csv(fname, header=0, iterator=True)\n",
 859 |     "    chunks = []\n",
 860 |     "    loop = True\n",
 861 |     "    while loop:\n",
 862 |     "        try:\n",
 863 |     "            # 只读取user_id和type两个字段\n",
 864 |     "            chunk = reader.get_chunk(chunk_size)[[\"user_id\", \"type\"]]\n",
 865 |     "            chunks.append(chunk)\n",
 866 |     "        except StopIteration:\n",
 867 |     "            loop = False\n",
 868 |     "            print(\"Iteration is stopped\")\n",
 869 |     "    # 将块拼接为pandas dataframe格式\n",
 870 |     "    df_ac = pd.concat(chunks, ignore_index=True)\n",
 871 |     "    # 按user_id分组，对每一组进行统计，as_index 表示无索引形式返回数据\n",
 872 |     "    df_ac = df_ac.groupby(['user_id'], as_index=False).apply(add_type_count)\n",
 873 |     "    # 将重复的行丢弃\n",
 874 |     "    df_ac = df_ac.drop_duplicates('user_id')\n",
 875 |     "\n",
 876 |     "    return df_ac"
 877 |    ]
 878 |   },
 879 |   {
 880 |    "cell_type": "code",
 881 |    "execution_count": 10,
 882 |    "metadata": {
 883 |     "collapsed": true
 884 |    },
 885 |    "outputs": [],
 886 |    "source": [
 887 |     "# 将各个action数据的统计量进行聚合\n",
 888 |     "def merge_action_data():\n",
 889 |     "    df_ac = []\n",
 890 |     "    df_ac.append(get_from_action_data(fname=ACTION_201602_FILE))\n",
 891 |     "    df_ac.append(get_from_action_data(fname=ACTION_201603_FILE))\n",
 892 |     "    df_ac.append(get_from_action_data(fname=ACTION_201604_FILE))\n",
 893 |     "\n",
 894 |     "    df_ac = pd.concat(df_ac, ignore_index=True)\n",
 895 |     "    # 用户在不同action表中统计量求和\n",
 896 |     "    df_ac = df_ac.groupby(['user_id'], as_index=False).sum()\n",
 897 |     "    #　构造转化率字段\n",
 898 |     "    df_ac['buy_addcart_ratio'] = df_ac['buy_num'] / df_ac['addcart_num']\n",
 899 |     "    df_ac['buy_browse_ratio'] = df_ac['buy_num'] / df_ac['browse_num']\n",
 900 |     "    df_ac['buy_click_ratio'] = df_ac['buy_num'] / df_ac['click_num']\n",
 901 |     "    df_ac['buy_favor_ratio'] = df_ac['buy_num'] / df_ac['favor_num']\n",
 902 |     "    \n",
 903 |     "    # 将大于１的转化率字段置为１(100%)\n",
 904 |     "    df_ac.ix[df_ac['buy_addcart_ratio'] > 1., 'buy_addcart_ratio'] = 1.\n",
 905 |     "    df_ac.ix[df_ac['buy_browse_ratio'] > 1., 'buy_browse_ratio'] = 1.\n",
 906 |     "    df_ac.ix[df_ac['buy_click_ratio'] > 1., 'buy_click_ratio'] = 1.\n",
 907 |     "    df_ac.ix[df_ac['buy_favor_ratio'] > 1., 'buy_favor_ratio'] = 1.\n",
 908 |     "\n",
 909 |     "    return df_ac"
 910 |    ]
 911 |   },
 912 |   {
 913 |    "cell_type": "code",
 914 |    "execution_count": 11,
 915 |    "metadata": {
 916 |     "collapsed": true
 917 |    },
 918 |    "outputs": [],
 919 |    "source": [
 920 |     "#　从FJData_User表中抽取需要的字段\n",
 921 |     "def get_from_jdata_user():\n",
 922 |     "    df_usr = pd.read_csv(USER_FILE, header=0)\n",
 923 |     "    df_usr = df_usr[[\"user_id\", \"age\", \"sex\", \"user_lv_cd\"]]\n",
 924 |     "    return df_usr"
 925 |    ]
 926 |   },
 927 |   {
 928 |    "cell_type": "code",
 929 |    "execution_count": 12,
 930 |    "metadata": {
 931 |     "collapsed": false
 932 |    },
 933 |    "outputs": [
 934 |     {
 935 |      "name": "stdout",
 936 |      "output_type": "stream",
 937 |      "text": [
 938 |       "Iteration is stopped\n",
 939 |       "Iteration is stopped\n",
 940 |       "Iteration is stopped\n"
 941 |      ]
 942 |     }
 943 |    ],
 944 |    "source": [
 945 |     "user_base = get_from_jdata_user()\n",
 946 |     "user_behavior = merge_action_data()\n",
 947 |     "\n",
 948 |     "# 连接成一张表，类似于SQL的左连接(left join)\n",
 949 |     "user_behavior = pd.merge(user_base, user_behavior, on=['user_id'], how='left')\n",
 950 |     "# 保存为user_table.csv\n",
 951 |     "user_behavior.to_csv(USER_TABLE_FILE, index=False)"
 952 |    ]
 953 |   },
 954 |   {
 955 |    "cell_type": "markdown",
 956 |    "metadata": {},
 957 |    "source": [
 958 |     "### 构建Item_table"
 959 |    ]
 960 |   },
 961 |   {
 962 |    "cell_type": "code",
 963 |    "execution_count": 21,
 964 |    "metadata": {
 965 |     "collapsed": true
 966 |    },
 967 |    "outputs": [],
 968 |    "source": [
 969 |     "#定义文件名\n",
 970 |     "ACTION_201602_FILE = \"data/JData_Action_201602.csv\"\n",
 971 |     "ACTION_201603_FILE = \"data/JData_Action_201603.csv\"\n",
 972 |     "ACTION_201604_FILE = \"data/JData_Action_201604.csv\"\n",
 973 |     "COMMENT_FILE = \"data/JData_Comment.csv\"\n",
 974 |     "PRODUCT_FILE = \"data/JData_Product.csv\"\n",
 975 |     "USER_FILE = \"data/JData_User.csv\"\n",
 976 |     "USER_TABLE_FILE = \"data/User_table.csv\"\n",
 977 |     "ITEM_TABLE_FILE = \"data/Item_table.csv\""
 978 |    ]
 979 |   },
 980 |   {
 981 |    "cell_type": "code",
 982 |    "execution_count": 14,
 983 |    "metadata": {
 984 |     "collapsed": true
 985 |    },
 986 |    "outputs": [],
 987 |    "source": [
 988 |     "# 导入相关包\n",
 989 |     "import pandas as pd\n",
 990 |     "import numpy as np\n",
 991 |     "from collections import Counter"
 992 |    ]
 993 |   },
 994 |   {
 995 |    "cell_type": "code",
 996 |    "execution_count": 15,
 997 |    "metadata": {
 998 |     "collapsed": true
 999 |    },
1000 |    "outputs": [],
1001 |    "source": [
1002 |     "# 读取Product中商品\n",
1003 |     "def get_from_jdata_product():\n",
1004 |     "    df_item = pd.read_csv(PRODUCT_FILE, header=0)\n",
1005 |     "    return df_item"
1006 |    ]
1007 |   },
1008 |   {
1009 |    "cell_type": "code",
1010 |    "execution_count": 16,
1011 |    "metadata": {
1012 |     "collapsed": true
1013 |    },
1014 |    "outputs": [],
1015 |    "source": [
1016 |     "# 对每一个商品分组进行统计\n",
1017 |     "def add_type_count(group):\n",
1018 |     "    behavior_type = group.type.astype(int)\n",
1019 |     "    type_cnt = Counter(behavior_type)\n",
1020 |     "\n",
1021 |     "    group['browse_num'] = type_cnt[1]\n",
1022 |     "    group['addcart_num'] = type_cnt[2]\n",
1023 |     "    group['delcart_num'] = type_cnt[3]\n",
1024 |     "    group['buy_num'] = type_cnt[4]\n",
1025 |     "    group['favor_num'] = type_cnt[5]\n",
1026 |     "    group['click_num'] = type_cnt[6]\n",
1027 |     "\n",
1028 |     "    return group[['sku_id', 'browse_num', 'addcart_num',\n",
1029 |     "                  'delcart_num', 'buy_num', 'favor_num',\n",
1030 |     "                  'click_num']]\n"
1031 |    ]
1032 |   },
1033 |   {
1034 |    "cell_type": "code",
1035 |    "execution_count": 17,
1036 |    "metadata": {
1037 |     "collapsed": true
1038 |    },
1039 |    "outputs": [],
1040 |    "source": [
1041 |     "#对action中的数据进行统计\n",
1042 |     "def get_from_action_data(fname, chunk_size=100000):\n",
1043 |     "    reader = pd.read_csv(fname, header=0, iterator=True)\n",
1044 |     "    chunks = []\n",
1045 |     "    loop = True\n",
1046 |     "    while loop:\n",
1047 |     "        try:\n",
1048 |     "            chunk = reader.get_chunk(chunk_size)[[\"sku_id\", \"type\"]]\n",
1049 |     "            chunks.append(chunk)\n",
1050 |     "        except StopIteration:\n",
1051 |     "            loop = False\n",
1052 |     "            print(\"Iteration is stopped\")\n",
1053 |     "\n",
1054 |     "    df_ac = pd.concat(chunks, ignore_index=True)\n",
1055 |     "\n",
1056 |     "    df_ac = df_ac.groupby(['sku_id'], as_index=False).apply(add_type_count)\n",
1057 |     "    # Select unique row\n",
1058 |     "    df_ac = df_ac.drop_duplicates('sku_id')\n",
1059 |     "\n",
1060 |     "    return df_ac"
1061 |    ]
1062 |   },
1063 |   {
1064 |    "cell_type": "code",
1065 |    "execution_count": 18,
1066 |    "metadata": {
1067 |     "collapsed": true
1068 |    },
1069 |    "outputs": [],
1070 |    "source": [
1071 |     "# 获取评论中的商品数据,如果存在某一个商品有两个日期的评论，我们取最晚的那一个\n",
1072 |     "def get_from_jdata_comment():\n",
1073 |     "    df_cmt = pd.read_csv(COMMENT_FILE, header=0)\n",
1074 |     "    df_cmt['dt'] = pd.to_datetime(df_cmt['dt'])\n",
1075 |     "    # find latest comment index\n",
1076 |     "    idx = df_cmt.groupby(['sku_id'])['dt'].transform(max) == df_cmt['dt']\n",
1077 |     "    df_cmt = df_cmt[idx]\n",
1078 |     "\n",
1079 |     "    return df_cmt[['sku_id', 'comment_num',\n",
1080 |     "                   'has_bad_comment', 'bad_comment_rate']]"
1081 |    ]
1082 |   },
1083 |   {
1084 |    "cell_type": "code",
1085 |    "execution_count": 19,
1086 |    "metadata": {
1087 |     "collapsed": true
1088 |    },
1089 |    "outputs": [],
1090 |    "source": [
1091 |     "def merge_action_data():\n",
1092 |     "    df_ac = []\n",
1093 |     "    df_ac.append(get_from_action_data(fname=ACTION_201602_FILE))\n",
1094 |     "    df_ac.append(get_from_action_data(fname=ACTION_201603_FILE))\n",
1095 |     "    df_ac.append(get_from_action_data(fname=ACTION_201604_FILE))\n",
1096 |     "\n",
1097 |     "    df_ac = pd.concat(df_ac, ignore_index=True)\n",
1098 |     "    df_ac = df_ac.groupby(['sku_id'], as_index=False).sum()\n",
1099 |     "\n",
1100 |     "    df_ac['buy_addcart_ratio'] = df_ac['buy_num'] / df_ac['addcart_num']\n",
1101 |     "    df_ac['buy_browse_ratio'] = df_ac['buy_num'] / df_ac['browse_num']\n",
1102 |     "    df_ac['buy_click_ratio'] = df_ac['buy_num'] / df_ac['click_num']\n",
1103 |     "    df_ac['buy_favor_ratio'] = df_ac['buy_num'] / df_ac['favor_num']\n",
1104 |     "\n",
1105 |     "    df_ac.ix[df_ac['buy_addcart_ratio'] > 1., 'buy_addcart_ratio'] = 1.\n",
1106 |     "    df_ac.ix[df_ac['buy_browse_ratio'] > 1., 'buy_browse_ratio'] = 1.\n",
1107 |     "    df_ac.ix[df_ac['buy_click_ratio'] > 1., 'buy_click_ratio'] = 1.\n",
1108 |     "    df_ac.ix[df_ac['buy_favor_ratio'] > 1., 'buy_favor_ratio'] = 1.\n",
1109 |     "\n",
1110 |     "    return df_ac"
1111 |    ]
1112 |   },
1113 |   {
1114 |    "cell_type": "code",
1115 |    "execution_count": 22,
1116 |    "metadata": {
1117 |     "collapsed": false
1118 |    },
1119 |    "outputs": [
1120 |     {
1121 |      "name": "stdout",
1122 |      "output_type": "stream",
1123 |      "text": [
1124 |       "Iteration is stopped\n",
1125 |       "Iteration is stopped\n",
1126 |       "Iteration is stopped\n"
1127 |      ]
1128 |     }
1129 |    ],
1130 |    "source": [
1131 |     "#取出只有P集合中存在商品信息\n",
1132 |     "item_base = get_from_jdata_product()\n",
1133 |     "item_behavior = merge_action_data()\n",
1134 |     "item_comment = get_from_jdata_comment()\n",
1135 |     "\n",
1136 |     "# SQL: left join\n",
1137 |     "item_behavior = pd.merge(\n",
1138 |     "    item_base, item_behavior, on=['sku_id'], how='left')\n",
1139 |     "item_behavior = pd.merge(\n",
1140 |     "    item_behavior, item_comment, on=['sku_id'], how='left')\n",
1141 |     "\n",
1142 |     "item_behavior.to_csv(ITEM_TABLE_FILE, index=False)"
1143 |    ]
1144 |   },
1145 |   {
1146 |    "cell_type": "markdown",
1147 |    "metadata": {},
1148 |    "source": [
1149 |     "### 数据清洗"
1150 |    ]
1151 |   },
1152 |   {
1153 |    "cell_type": "markdown",
1154 |    "metadata": {},
1155 |    "source": [
1156 |     "#### 用户清洗"
1157 |    ]
1158 |   },
1159 |   {
1160 |    "cell_type": "code",
1161 |    "execution_count": 1,
1162 |    "metadata": {
1163 |     "collapsed": false
1164 |    },
1165 |    "outputs": [
1166 |     {
1167 |      "data": {
1168 |       "text/html": [
1169 |        "<div>\n",
1170 |        "<table border=\"1\" class=\"dataframe\">\n",
1171 |        "  <thead>\n",
1172 |        "    <tr style=\"text-align: right;\">\n",
1173 |        "      <th></th>\n",
1174 |        "      <th>user_id</th>\n",
1175 |        "      <th>age</th>\n",
1176 |        "      <th>sex</th>\n",
1177 |        "      <th>user_lv_cd</th>\n",
1178 |        "      <th>browse_num</th>\n",
1179 |        "      <th>addcart_num</th>\n",
1180 |        "      <th>delcart_num</th>\n",
1181 |        "      <th>buy_num</th>\n",
1182 |        "      <th>favor_num</th>\n",
1183 |        "      <th>click_num</th>\n",
1184 |        "      <th>buy_addcart_ratio</th>\n",
1185 |        "      <th>buy_browse_ratio</th>\n",
1186 |        "      <th>buy_click_ratio</th>\n",
1187 |        "      <th>buy_favor_ratio</th>\n",
1188 |        "    </tr>\n",
1189 |        "  </thead>\n",
1190 |        "  <tbody>\n",
1191 |        "    <tr>\n",
1192 |        "      <th>count</th>\n",
1193 |        "      <td>105,321.000</td>\n",
1194 |        "      <td>105,318.000</td>\n",
1195 |        "      <td>105,318.000</td>\n",
1196 |        "      <td>105,321.000</td>\n",
1197 |        "      <td>105,180.000</td>\n",
1198 |        "      <td>105,180.000</td>\n",
1199 |        "      <td>105,180.000</td>\n",
1200 |        "      <td>105,180.000</td>\n",
1201 |        "      <td>105,180.000</td>\n",
1202 |        "      <td>105,180.000</td>\n",
1203 |        "      <td>72,129.000</td>\n",
1204 |        "      <td>105,172.000</td>\n",
1205 |        "      <td>103,197.000</td>\n",
1206 |        "      <td>45,986.000</td>\n",
1207 |        "    </tr>\n",
1208 |        "    <tr>\n",
1209 |        "      <th>mean</th>\n",
1210 |        "      <td>252,661.000</td>\n",
1211 |        "      <td>2.773</td>\n",
1212 |        "      <td>1.113</td>\n",
1213 |        "      <td>3.850</td>\n",
1214 |        "      <td>180.466</td>\n",
1215 |        "      <td>5.471</td>\n",
1216 |        "      <td>2.434</td>\n",
1217 |        "      <td>0.459</td>\n",
1218 |        "      <td>1.045</td>\n",
1219 |        "      <td>291.222</td>\n",
1220 |        "      <td>0.147</td>\n",
1221 |        "      <td>0.005</td>\n",
1222 |        "      <td>0.009</td>\n",
1223 |        "      <td>0.552</td>\n",
1224 |        "    </tr>\n",
1225 |        "    <tr>\n",
1226 |        "      <th>std</th>\n",
1227 |        "      <td>30,403.698</td>\n",
1228 |        "      <td>1.672</td>\n",
1229 |        "      <td>0.956</td>\n",
1230 |        "      <td>1.072</td>\n",
1231 |        "      <td>273.437</td>\n",
1232 |        "      <td>10.618</td>\n",
1233 |        "      <td>5.600</td>\n",
1234 |        "      <td>1.048</td>\n",
1235 |        "      <td>3.442</td>\n",
1236 |        "      <td>460.031</td>\n",
1237 |        "      <td>0.270</td>\n",
1238 |        "      <td>0.022</td>\n",
1239 |        "      <td>0.074</td>\n",
1240 |        "      <td>0.473</td>\n",
1241 |        "    </tr>\n",
1242 |        "    <tr>\n",
1243 |        "      <th>min</th>\n",
1244 |        "      <td>200,001.000</td>\n",
1245 |        "      <td>-1.000</td>\n",
1246 |        "      <td>0.000</td>\n",
1247 |        "      <td>1.000</td>\n",
1248 |        "      <td>0.000</td>\n",
1249 |        "      <td>0.000</td>\n",
1250 |        "      <td>0.000</td>\n",
1251 |        "      <td>0.000</td>\n",
1252 |        "      <td>0.000</td>\n",
1253 |        "      <td>0.000</td>\n",
1254 |        "      <td>0.000</td>\n",
1255 |        "      <td>0.000</td>\n",
1256 |        "      <td>0.000</td>\n",
1257 |        "      <td>0.000</td>\n",
1258 |        "    </tr>\n",
1259 |        "    <tr>\n",
1260 |        "      <th>25%</th>\n",
1261 |        "      <td>226,331.000</td>\n",
1262 |        "      <td>3.000</td>\n",
1263 |        "      <td>0.000</td>\n",
1264 |        "      <td>3.000</td>\n",
1265 |        "      <td>40.000</td>\n",
1266 |        "      <td>0.000</td>\n",
1267 |        "      <td>0.000</td>\n",
1268 |        "      <td>0.000</td>\n",
1269 |        "      <td>0.000</td>\n",
1270 |        "      <td>59.000</td>\n",
1271 |        "      <td>0.000</td>\n",
1272 |        "      <td>0.000</td>\n",
1273 |        "      <td>0.000</td>\n",
1274 |        "      <td>0.000</td>\n",
1275 |        "    </tr>\n",
1276 |        "    <tr>\n",
1277 |        "      <th>50%</th>\n",
1278 |        "      <td>252,661.000</td>\n",
1279 |        "      <td>3.000</td>\n",
1280 |        "      <td>2.000</td>\n",
1281 |        "      <td>4.000</td>\n",
1282 |        "      <td>94.000</td>\n",
1283 |        "      <td>2.000</td>\n",
1284 |        "      <td>0.000</td>\n",
1285 |        "      <td>0.000</td>\n",
1286 |        "      <td>0.000</td>\n",
1287 |        "      <td>148.000</td>\n",
1288 |        "      <td>0.000</td>\n",
1289 |        "      <td>0.000</td>\n",
1290 |        "      <td>0.000</td>\n",
1291 |        "      <td>1.000</td>\n",
1292 |        "    </tr>\n",
1293 |        "    <tr>\n",
1294 |        "      <th>75%</th>\n",
1295 |        "      <td>278,991.000</td>\n",
1296 |        "      <td>4.000</td>\n",
1297 |        "      <td>2.000</td>\n",
1298 |        "      <td>5.000</td>\n",
1299 |        "      <td>212.000</td>\n",
1300 |        "      <td>6.000</td>\n",
1301 |        "      <td>3.000</td>\n",
1302 |        "      <td>1.000</td>\n",
1303 |        "      <td>0.000</td>\n",
1304 |        "      <td>342.000</td>\n",
1305 |        "      <td>0.167</td>\n",
1306 |        "      <td>0.002</td>\n",
1307 |        "      <td>0.001</td>\n",
1308 |        "      <td>1.000</td>\n",
1309 |        "    </tr>\n",
1310 |        "    <tr>\n",
1311 |        "      <th>max</th>\n",
1312 |        "      <td>305,321.000</td>\n",
1313 |        "      <td>6.000</td>\n",
1314 |        "      <td>2.000</td>\n",
1315 |        "      <td>5.000</td>\n",
1316 |        "      <td>7,605.000</td>\n",
1317 |        "      <td>369.000</td>\n",
1318 |        "      <td>231.000</td>\n",
1319 |        "      <td>50.000</td>\n",
1320 |        "      <td>99.000</td>\n",
1321 |        "      <td>15,302.000</td>\n",
1322 |        "      <td>1.000</td>\n",
1323 |        "      <td>1.000</td>\n",
1324 |        "      <td>1.000</td>\n",
1325 |        "      <td>1.000</td>\n",
1326 |        "    </tr>\n",
1327 |        "  </tbody>\n",
1328 |        "</table>\n",
1329 |        "</div>"
1330 |       ],
1331 |       "text/plain": [
1332 |        "          user_id         age         sex  user_lv_cd  browse_num  \\\n",
1333 |        "count 105,321.000 105,318.000 105,318.000 105,321.000 105,180.000   \n",
1334 |        "mean  252,661.000       2.773       1.113       3.850     180.466   \n",
1335 |        "std    30,403.698       1.672       0.956       1.072     273.437   \n",
1336 |        "min   200,001.000      -1.000       0.000       1.000       0.000   \n",
1337 |        "25%   226,331.000       3.000       0.000       3.000      40.000   \n",
1338 |        "50%   252,661.000       3.000       2.000       4.000      94.000   \n",
1339 |        "75%   278,991.000       4.000       2.000       5.000     212.000   \n",
1340 |        "max   305,321.000       6.000       2.000       5.000   7,605.000   \n",
1341 |        "\n",
1342 |        "       addcart_num  delcart_num     buy_num   favor_num   click_num  \\\n",
1343 |        "count  105,180.000  105,180.000 105,180.000 105,180.000 105,180.000   \n",
1344 |        "mean         5.471        2.434       0.459       1.045     291.222   \n",
1345 |        "std         10.618        5.600       1.048       3.442     460.031   \n",
1346 |        "min          0.000        0.000       0.000       0.000       0.000   \n",
1347 |        "25%          0.000        0.000       0.000       0.000      59.000   \n",
1348 |        "50%          2.000        0.000       0.000       0.000     148.000   \n",
1349 |        "75%          6.000        3.000       1.000       0.000     342.000   \n",
1350 |        "max        369.000      231.000      50.000      99.000  15,302.000   \n",
1351 |        "\n",
1352 |        "       buy_addcart_ratio  buy_browse_ratio  buy_click_ratio  buy_favor_ratio  \n",
1353 |        "count         72,129.000       105,172.000      103,197.000       45,986.000  \n",
1354 |        "mean               0.147             0.005            0.009            0.552  \n",
1355 |        "std                0.270             0.022            0.074            0.473  \n",
1356 |        "min                0.000             0.000            0.000            0.000  \n",
1357 |        "25%                0.000             0.000            0.000            0.000  \n",
1358 |        "50%                0.000             0.000            0.000            1.000  \n",
1359 |        "75%                0.167             0.002            0.001            1.000  \n",
1360 |        "max                1.000             1.000            1.000            1.000  "
1361 |       ]
1362 |      },
1363 |      "execution_count": 1,
1364 |      "metadata": {},
1365 |      "output_type": "execute_result"
1366 |     }
1367 |    ],
1368 |    "source": [
1369 |     "import pandas as pd\n",
1370 |     "df_user = pd.read_csv('data/User_table.csv',header=0)\n",
1371 |     "pd.options.display.float_format = '{:,.3f}'.format  #输出格式设置，保留三位小数\n",
1372 |     "df_user.describe()"
1373 |    ]
1374 |   },
1375 |   {
1376 |    "cell_type": "markdown",
1377 |    "metadata": {},
1378 |    "source": [
1379 |     "由上述统计信息发现： 第一行中根据User_id统计发现有105321个用户，发现有3个用户没有age,sex字段，而且根据浏览、加购、删购、购买等记录却只有105180条记录，说明存在用户无任何交互记录，因此可以删除上述用户。"
1380 |    ]
1381 |   },
1382 |   {
1383 |    "cell_type": "markdown",
1384 |    "metadata": {},
1385 |    "source": [
1386 |     "删除没有age,sex字段的用户"
1387 |    ]
1388 |   },
1389 |   {
1390 |    "cell_type": "code",
1391 |    "execution_count": 3,
1392 |    "metadata": {
1393 |     "collapsed": false
1394 |    },
1395 |    "outputs": [
1396 |     {
1397 |      "data": {
1398 |       "text/html": [
1399 |        "<div>\n",
1400 |        "<table border=\"1\" class=\"dataframe\">\n",
1401 |        "  <thead>\n",
1402 |        "    <tr style=\"text-align: right;\">\n",
1403 |        "      <th></th>\n",
1404 |        "      <th>user_id</th>\n",
1405 |        "      <th>age</th>\n",
1406 |        "      <th>sex</th>\n",
1407 |        "      <th>user_lv_cd</th>\n",
1408 |        "      <th>browse_num</th>\n",
1409 |        "      <th>addcart_num</th>\n",
1410 |        "      <th>delcart_num</th>\n",
1411 |        "      <th>buy_num</th>\n",
1412 |        "      <th>favor_num</th>\n",
1413 |        "      <th>click_num</th>\n",
1414 |        "      <th>buy_addcart_ratio</th>\n",
1415 |        "      <th>buy_browse_ratio</th>\n",
1416 |        "      <th>buy_click_ratio</th>\n",
1417 |        "      <th>buy_favor_ratio</th>\n",
1418 |        "    </tr>\n",
1419 |        "  </thead>\n",
1420 |        "  <tbody>\n",
1421 |        "    <tr>\n",
1422 |        "      <th>34072</th>\n",
1423 |        "      <td>234073</td>\n",
1424 |        "      <td>nan</td>\n",
1425 |        "      <td>nan</td>\n",
1426 |        "      <td>1</td>\n",
1427 |        "      <td>32.000</td>\n",
1428 |        "      <td>6.000</td>\n",
1429 |        "      <td>4.000</td>\n",
1430 |        "      <td>1.000</td>\n",
1431 |        "      <td>0.000</td>\n",
1432 |        "      <td>41.000</td>\n",
1433 |        "      <td>0.167</td>\n",
1434 |        "      <td>0.031</td>\n",
1435 |        "      <td>0.024</td>\n",
1436 |        "      <td>1.000</td>\n",
1437 |        "    </tr>\n",
1438 |        "    <tr>\n",
1439 |        "      <th>38905</th>\n",
1440 |        "      <td>238906</td>\n",
1441 |        "      <td>nan</td>\n",
1442 |        "      <td>nan</td>\n",
1443 |        "      <td>1</td>\n",
1444 |        "      <td>171.000</td>\n",
1445 |        "      <td>3.000</td>\n",
1446 |        "      <td>2.000</td>\n",
1447 |        "      <td>2.000</td>\n",
1448 |        "      <td>3.000</td>\n",
1449 |        "      <td>464.000</td>\n",
1450 |        "      <td>0.667</td>\n",
1451 |        "      <td>0.012</td>\n",
1452 |        "      <td>0.004</td>\n",
1453 |        "      <td>0.667</td>\n",
1454 |        "    </tr>\n",
1455 |        "    <tr>\n",
1456 |        "      <th>67704</th>\n",
1457 |        "      <td>267705</td>\n",
1458 |        "      <td>nan</td>\n",
1459 |        "      <td>nan</td>\n",
1460 |        "      <td>1</td>\n",
1461 |        "      <td>342.000</td>\n",
1462 |        "      <td>18.000</td>\n",
1463 |        "      <td>8.000</td>\n",
1464 |        "      <td>0.000</td>\n",
1465 |        "      <td>0.000</td>\n",
1466 |        "      <td>743.000</td>\n",
1467 |        "      <td>0.000</td>\n",
1468 |        "      <td>0.000</td>\n",
1469 |        "      <td>0.000</td>\n",
1470 |        "      <td>nan</td>\n",
1471 |        "    </tr>\n",
1472 |        "  </tbody>\n",
1473 |        "</table>\n",
1474 |        "</div>"
1475 |       ],
1476 |       "text/plain": [
1477 |        "       user_id  age  sex  user_lv_cd  browse_num  addcart_num  delcart_num  \\\n",
1478 |        "34072   234073  nan  nan           1      32.000        6.000        4.000   \n",
1479 |        "38905   238906  nan  nan           1     171.000        3.000        2.000   \n",
1480 |        "67704   267705  nan  nan           1     342.000       18.000        8.000   \n",
1481 |        "\n",
1482 |        "       buy_num  favor_num  click_num  buy_addcart_ratio  buy_browse_ratio  \\\n",
1483 |        "34072    1.000      0.000     41.000              0.167             0.031   \n",
1484 |        "38905    2.000      3.000    464.000              0.667             0.012   \n",
1485 |        "67704    0.000      0.000    743.000              0.000             0.000   \n",
1486 |        "\n",
1487 |        "       buy_click_ratio  buy_favor_ratio  \n",
1488 |        "34072            0.024            1.000  \n",
1489 |        "38905            0.004            0.667  \n",
1490 |        "67704            0.000              nan  "
1491 |       ]
1492 |      },
1493 |      "execution_count": 3,
1494 |      "metadata": {},
1495 |      "output_type": "execute_result"
1496 |     }
1497 |    ],
1498 |    "source": [
1499 |     "df_user[df_user['age'].isnull()]"
1500 |    ]
1501 |   },
1502 |   {
1503 |    "cell_type": "code",
1504 |    "execution_count": 4,
1505 |    "metadata": {
1506 |     "collapsed": true
1507 |    },
1508 |    "outputs": [],
1509 |    "source": [
1510 |     "delete_list = df_user[df_user['age'].isnull()].index\n",
1511 |     "df_user.drop(delete_list,axis=0,inplace=True)"
1512 |    ]
1513 |   },
1514 |   {
1515 |    "cell_type": "markdown",
1516 |    "metadata": {},
1517 |    "source": [
1518 |     "删除无交互记录的用户"
1519 |    ]
1520 |   },
1521 |   {
1522 |    "cell_type": "code",
1523 |    "execution_count": 5,
1524 |    "metadata": {
1525 |     "collapsed": false
1526 |    },
1527 |    "outputs": [
1528 |     {
1529 |      "name": "stdout",
1530 |      "output_type": "stream",
1531 |      "text": [
1532 |       "105177\n"
1533 |      ]
1534 |     }
1535 |    ],
1536 |    "source": [
1537 |     "#删除无交互记录的用户\n",
1538 |     "df_naction = df_user[(df_user['browse_num'].isnull()) & (df_user['addcart_num'].isnull()) & (df_user['delcart_num'].isnull()) & (df_user['buy_num'].isnull()) & (df_user['favor_num'].isnull()) & (df_user['click_num'].isnull())]\n",
1539 |     "df_user.drop(df_naction.index,axis=0,inplace=True)\n",
1540 |     "print len(df_user)"
1541 |    ]
1542 |   },
1543 |   {
1544 |    "cell_type": "markdown",
1545 |    "metadata": {},
1546 |    "source": [
1547 |     "统计并删除无购买记录的用户"
1548 |    ]
1549 |   },
1550 |   {
1551 |    "cell_type": "code",
1552 |    "execution_count": 6,
1553 |    "metadata": {
1554 |     "collapsed": false
1555 |    },
1556 |    "outputs": [
1557 |     {
1558 |      "name": "stdout",
1559 |      "output_type": "stream",
1560 |      "text": [
1561 |       "75694\n"
1562 |      ]
1563 |     }
1564 |    ],
1565 |    "source": [
1566 |     "#统计无购买记录的用户\n",
1567 |     "df_bzero = df_user[df_user['buy_num']==0]\n",
1568 |     "#输出购买数为0的总记录数\n",
1569 |     "print len(df_bzero)"
1570 |    ]
1571 |   },
1572 |   {
1573 |    "cell_type": "code",
1574 |    "execution_count": 7,
1575 |    "metadata": {
1576 |     "collapsed": false
1577 |    },
1578 |    "outputs": [],
1579 |    "source": [
1580 |     "#删除无购买记录的用户\n",
1581 |     "df_user = df_user[df_user['buy_num']!=0]"
1582 |    ]
1583 |   },
1584 |   {
1585 |    "cell_type": "code",
1586 |    "execution_count": 8,
1587 |    "metadata": {
1588 |     "collapsed": false
1589 |    },
1590 |    "outputs": [
1591 |     {
1592 |      "data": {
1593 |       "text/html": [
1594 |        "<div>\n",
1595 |        "<table border=\"1\" class=\"dataframe\">\n",
1596 |        "  <thead>\n",
1597 |        "    <tr style=\"text-align: right;\">\n",
1598 |        "      <th></th>\n",
1599 |        "      <th>user_id</th>\n",
1600 |        "      <th>age</th>\n",
1601 |        "      <th>sex</th>\n",
1602 |        "      <th>user_lv_cd</th>\n",
1603 |        "      <th>browse_num</th>\n",
1604 |        "      <th>addcart_num</th>\n",
1605 |        "      <th>delcart_num</th>\n",
1606 |        "      <th>buy_num</th>\n",
1607 |        "      <th>favor_num</th>\n",
1608 |        "      <th>click_num</th>\n",
1609 |        "      <th>buy_addcart_ratio</th>\n",
1610 |        "      <th>buy_browse_ratio</th>\n",
1611 |        "      <th>buy_click_ratio</th>\n",
1612 |        "      <th>buy_favor_ratio</th>\n",
1613 |        "    </tr>\n",
1614 |        "  </thead>\n",
1615 |        "  <tbody>\n",
1616 |        "    <tr>\n",
1617 |        "      <th>count</th>\n",
1618 |        "      <td>29,483.000</td>\n",
1619 |        "      <td>29,483.000</td>\n",
1620 |        "      <td>29,483.000</td>\n",
1621 |        "      <td>29,483.000</td>\n",
1622 |        "      <td>29,483.000</td>\n",
1623 |        "      <td>29,483.000</td>\n",
1624 |        "      <td>29,483.000</td>\n",
1625 |        "      <td>29,483.000</td>\n",
1626 |        "      <td>29,483.000</td>\n",
1627 |        "      <td>29,483.000</td>\n",
1628 |        "      <td>29,483.000</td>\n",
1629 |        "      <td>29,483.000</td>\n",
1630 |        "      <td>29,483.000</td>\n",
1631 |        "      <td>29,483.000</td>\n",
1632 |        "    </tr>\n",
1633 |        "    <tr>\n",
1634 |        "      <th>mean</th>\n",
1635 |        "      <td>250,746.445</td>\n",
1636 |        "      <td>2.914</td>\n",
1637 |        "      <td>1.025</td>\n",
1638 |        "      <td>4.272</td>\n",
1639 |        "      <td>302.488</td>\n",
1640 |        "      <td>10.525</td>\n",
1641 |        "      <td>4.673</td>\n",
1642 |        "      <td>1.637</td>\n",
1643 |        "      <td>1.677</td>\n",
1644 |        "      <td>486.653</td>\n",
1645 |        "      <td>0.360</td>\n",
1646 |        "      <td>0.018</td>\n",
1647 |        "      <td>0.030</td>\n",
1648 |        "      <td>0.862</td>\n",
1649 |        "    </tr>\n",
1650 |        "    <tr>\n",
1651 |        "      <th>std</th>\n",
1652 |        "      <td>29,979.676</td>\n",
1653 |        "      <td>1.490</td>\n",
1654 |        "      <td>0.959</td>\n",
1655 |        "      <td>0.808</td>\n",
1656 |        "      <td>391.535</td>\n",
1657 |        "      <td>14.301</td>\n",
1658 |        "      <td>7.568</td>\n",
1659 |        "      <td>1.412</td>\n",
1660 |        "      <td>4.584</td>\n",
1661 |        "      <td>658.671</td>\n",
1662 |        "      <td>0.320</td>\n",
1663 |        "      <td>0.038</td>\n",
1664 |        "      <td>0.136</td>\n",
1665 |        "      <td>0.287</td>\n",
1666 |        "    </tr>\n",
1667 |        "    <tr>\n",
1668 |        "      <th>min</th>\n",
1669 |        "      <td>200,001.000</td>\n",
1670 |        "      <td>-1.000</td>\n",
1671 |        "      <td>0.000</td>\n",
1672 |        "      <td>2.000</td>\n",
1673 |        "      <td>1.000</td>\n",
1674 |        "      <td>0.000</td>\n",
1675 |        "      <td>0.000</td>\n",
1676 |        "      <td>1.000</td>\n",
1677 |        "      <td>0.000</td>\n",
1678 |        "      <td>0.000</td>\n",
1679 |        "      <td>0.004</td>\n",
1680 |        "      <td>0.000</td>\n",
1681 |        "      <td>0.000</td>\n",
1682 |        "      <td>0.010</td>\n",
1683 |        "    </tr>\n",
1684 |        "    <tr>\n",
1685 |        "      <th>25%</th>\n",
1686 |        "      <td>225,058.500</td>\n",
1687 |        "      <td>3.000</td>\n",
1688 |        "      <td>0.000</td>\n",
1689 |        "      <td>4.000</td>\n",
1690 |        "      <td>76.000</td>\n",
1691 |        "      <td>3.000</td>\n",
1692 |        "      <td>0.000</td>\n",
1693 |        "      <td>1.000</td>\n",
1694 |        "      <td>0.000</td>\n",
1695 |        "      <td>116.000</td>\n",
1696 |        "      <td>0.118</td>\n",
1697 |        "      <td>0.004</td>\n",
1698 |        "      <td>0.002</td>\n",
1699 |        "      <td>1.000</td>\n",
1700 |        "    </tr>\n",
1701 |        "    <tr>\n",
1702 |        "      <th>50%</th>\n",
1703 |        "      <td>249,144.000</td>\n",
1704 |        "      <td>3.000</td>\n",
1705 |        "      <td>1.000</td>\n",
1706 |        "      <td>4.000</td>\n",
1707 |        "      <td>178.000</td>\n",
1708 |        "      <td>6.000</td>\n",
1709 |        "      <td>2.000</td>\n",
1710 |        "      <td>1.000</td>\n",
1711 |        "      <td>0.000</td>\n",
1712 |        "      <td>282.000</td>\n",
1713 |        "      <td>0.250</td>\n",
1714 |        "      <td>0.008</td>\n",
1715 |        "      <td>0.005</td>\n",
1716 |        "      <td>1.000</td>\n",
1717 |        "    </tr>\n",
1718 |        "    <tr>\n",
1719 |        "      <th>75%</th>\n",
1720 |        "      <td>276,252.500</td>\n",
1721 |        "      <td>4.000</td>\n",
1722 |        "      <td>2.000</td>\n",
1723 |        "      <td>5.000</td>\n",
1724 |        "      <td>381.000</td>\n",
1725 |        "      <td>13.000</td>\n",
1726 |        "      <td>6.000</td>\n",
1727 |        "      <td>2.000</td>\n",
1728 |        "      <td>1.000</td>\n",
1729 |        "      <td>604.000</td>\n",
1730 |        "      <td>0.500</td>\n",
1731 |        "      <td>0.018</td>\n",
1732 |        "      <td>0.012</td>\n",
1733 |        "      <td>1.000</td>\n",
1734 |        "    </tr>\n",
1735 |        "    <tr>\n",
1736 |        "      <th>max</th>\n",
1737 |        "      <td>305,318.000</td>\n",
1738 |        "      <td>6.000</td>\n",
1739 |        "      <td>2.000</td>\n",
1740 |        "      <td>5.000</td>\n",
1741 |        "      <td>7,605.000</td>\n",
1742 |        "      <td>288.000</td>\n",
1743 |        "      <td>178.000</td>\n",
1744 |        "      <td>50.000</td>\n",
1745 |        "      <td>96.000</td>\n",
1746 |        "      <td>15,302.000</td>\n",
1747 |        "      <td>1.000</td>\n",
1748 |        "      <td>1.000</td>\n",
1749 |        "      <td>1.000</td>\n",
1750 |        "      <td>1.000</td>\n",
1751 |        "    </tr>\n",
1752 |        "  </tbody>\n",
1753 |        "</table>\n",
1754 |        "</div>"
1755 |       ],
1756 |       "text/plain": [
1757 |        "          user_id        age        sex  user_lv_cd  browse_num  addcart_num  \\\n",
1758 |        "count  29,483.000 29,483.000 29,483.000  29,483.000  29,483.000   29,483.000   \n",
1759 |        "mean  250,746.445      2.914      1.025       4.272     302.488       10.525   \n",
1760 |        "std    29,979.676      1.490      0.959       0.808     391.535       14.301   \n",
1761 |        "min   200,001.000     -1.000      0.000       2.000       1.000        0.000   \n",
1762 |        "25%   225,058.500      3.000      0.000       4.000      76.000        3.000   \n",
1763 |        "50%   249,144.000      3.000      1.000       4.000     178.000        6.000   \n",
1764 |        "75%   276,252.500      4.000      2.000       5.000     381.000       13.000   \n",
1765 |        "max   305,318.000      6.000      2.000       5.000   7,605.000      288.000   \n",
1766 |        "\n",
1767 |        "       delcart_num    buy_num  favor_num  click_num  buy_addcart_ratio  \\\n",
1768 |        "count   29,483.000 29,483.000 29,483.000 29,483.000         29,483.000   \n",
1769 |        "mean         4.673      1.637      1.677    486.653              0.360   \n",
1770 |        "std          7.568      1.412      4.584    658.671              0.320   \n",
1771 |        "min          0.000      1.000      0.000      0.000              0.004   \n",
1772 |        "25%          0.000      1.000      0.000    116.000              0.118   \n",
1773 |        "50%          2.000      1.000      0.000    282.000              0.250   \n",
1774 |        "75%          6.000      2.000      1.000    604.000              0.500   \n",
1775 |        "max        178.000     50.000     96.000 15,302.000              1.000   \n",
1776 |        "\n",
1777 |        "       buy_browse_ratio  buy_click_ratio  buy_favor_ratio  \n",
1778 |        "count        29,483.000       29,483.000       29,483.000  \n",
1779 |        "mean              0.018            0.030            0.862  \n",
1780 |        "std               0.038            0.136            0.287  \n",
1781 |        "min               0.000            0.000            0.010  \n",
1782 |        "25%               0.004            0.002            1.000  \n",
1783 |        "50%               0.008            0.005            1.000  \n",
1784 |        "75%               0.018            0.012            1.000  \n",
1785 |        "max               1.000            1.000            1.000  "
1786 |       ]
1787 |      },
1788 |      "execution_count": 8,
1789 |      "metadata": {},
1790 |      "output_type": "execute_result"
1791 |     }
1792 |    ],
1793 |    "source": [
1794 |     "df_user.describe()"
1795 |    ]
1796 |   },
1797 |   {
1798 |    "cell_type": "markdown",
1799 |    "metadata": {},
1800 |    "source": [
1801 |     "删除爬虫及惰性用户"
1802 |    ]
1803 |   },
1804 |   {
1805 |    "cell_type": "markdown",
1806 |    "metadata": {},
1807 |    "source": [
1808 |     "由上表所知，浏览购买转换比和点击购买转换比均值为0.018,0.030，因此这里认为浏览购买转换比和点击购买转换比小于0.0005的用户为惰性用户"
1809 |    ]
1810 |   },
1811 |   {
1812 |    "cell_type": "code",
1813 |    "execution_count": 9,
1814 |    "metadata": {
1815 |     "collapsed": false
1816 |    },
1817 |    "outputs": [
1818 |     {
1819 |      "name": "stdout",
1820 |      "output_type": "stream",
1821 |      "text": [
1822 |       "90\n"
1823 |      ]
1824 |     }
1825 |    ],
1826 |    "source": [
1827 |     "bindex = df_user[df_user['buy_browse_ratio']<0.0005].index\n",
1828 |     "print len(bindex)\n",
1829 |     "df_user.drop(bindex,axis=0,inplace=True)"
1830 |    ]
1831 |   },
1832 |   {
1833 |    "cell_type": "code",
1834 |    "execution_count": 10,
1835 |    "metadata": {
1836 |     "collapsed": false,
1837 |     "scrolled": true
1838 |    },
1839 |    "outputs": [
1840 |     {
1841 |      "name": "stdout",
1842 |      "output_type": "stream",
1843 |      "text": [
1844 |       "323\n"
1845 |      ]
1846 |     }
1847 |    ],
1848 |    "source": [
1849 |     "cindex = df_user[df_user['buy_click_ratio']<0.0005].index\n",
1850 |     "print len(cindex)\n",
1851 |     "df_user.drop(cindex,axis=0,inplace=True)"
1852 |    ]
1853 |   },
1854 |   {
1855 |    "cell_type": "code",
1856 |    "execution_count": 11,
1857 |    "metadata": {
1858 |     "collapsed": false
1859 |    },
1860 |    "outputs": [
1861 |     {
1862 |      "data": {
1863 |       "text/html": [
1864 |        "<div>\n",
1865 |        "<table border=\"1\" class=\"dataframe\">\n",
1866 |        "  <thead>\n",
1867 |        "    <tr style=\"text-align: right;\">\n",
1868 |        "      <th></th>\n",
1869 |        "      <th>user_id</th>\n",
1870 |        "      <th>age</th>\n",
1871 |        "      <th>sex</th>\n",
1872 |        "      <th>user_lv_cd</th>\n",
1873 |        "      <th>browse_num</th>\n",
1874 |        "      <th>addcart_num</th>\n",
1875 |        "      <th>delcart_num</th>\n",
1876 |        "      <th>buy_num</th>\n",
1877 |        "      <th>favor_num</th>\n",
1878 |        "      <th>click_num</th>\n",
1879 |        "      <th>buy_addcart_ratio</th>\n",
1880 |        "      <th>buy_browse_ratio</th>\n",
1881 |        "      <th>buy_click_ratio</th>\n",
1882 |        "      <th>buy_favor_ratio</th>\n",
1883 |        "    </tr>\n",
1884 |        "  </thead>\n",
1885 |        "  <tbody>\n",
1886 |        "    <tr>\n",
1887 |        "      <th>count</th>\n",
1888 |        "      <td>29,070.000</td>\n",
1889 |        "      <td>29,070.000</td>\n",
1890 |        "      <td>29,070.000</td>\n",
1891 |        "      <td>29,070.000</td>\n",
1892 |        "      <td>29,070.000</td>\n",
1893 |        "      <td>29,070.000</td>\n",
1894 |        "      <td>29,070.000</td>\n",
1895 |        "      <td>29,070.000</td>\n",
1896 |        "      <td>29,070.000</td>\n",
1897 |        "      <td>29,070.000</td>\n",
1898 |        "      <td>29,070.000</td>\n",
1899 |        "      <td>29,070.000</td>\n",
1900 |        "      <td>29,070.000</td>\n",
1901 |        "      <td>29,070.000</td>\n",
1902 |        "    </tr>\n",
1903 |        "    <tr>\n",
1904 |        "      <th>mean</th>\n",
1905 |        "      <td>250,767.099</td>\n",
1906 |        "      <td>2.910</td>\n",
1907 |        "      <td>1.028</td>\n",
1908 |        "      <td>4.268</td>\n",
1909 |        "      <td>280.260</td>\n",
1910 |        "      <td>10.145</td>\n",
1911 |        "      <td>4.457</td>\n",
1912 |        "      <td>1.644</td>\n",
1913 |        "      <td>1.589</td>\n",
1914 |        "      <td>447.113</td>\n",
1915 |        "      <td>0.364</td>\n",
1916 |        "      <td>0.019</td>\n",
1917 |        "      <td>0.031</td>\n",
1918 |        "      <td>0.866</td>\n",
1919 |        "    </tr>\n",
1920 |        "    <tr>\n",
1921 |        "      <th>std</th>\n",
1922 |        "      <td>29,998.870</td>\n",
1923 |        "      <td>1.492</td>\n",
1924 |        "      <td>0.959</td>\n",
1925 |        "      <td>0.809</td>\n",
1926 |        "      <td>325.129</td>\n",
1927 |        "      <td>13.443</td>\n",
1928 |        "      <td>6.998</td>\n",
1929 |        "      <td>1.420</td>\n",
1930 |        "      <td>4.294</td>\n",
1931 |        "      <td>530.994</td>\n",
1932 |        "      <td>0.320</td>\n",
1933 |        "      <td>0.038</td>\n",
1934 |        "      <td>0.137</td>\n",
1935 |        "      <td>0.282</td>\n",
1936 |        "    </tr>\n",
1937 |        "    <tr>\n",
1938 |        "      <th>min</th>\n",
1939 |        "      <td>200,001.000</td>\n",
1940 |        "      <td>-1.000</td>\n",
1941 |        "      <td>0.000</td>\n",
1942 |        "      <td>2.000</td>\n",
1943 |        "      <td>1.000</td>\n",
1944 |        "      <td>0.000</td>\n",
1945 |        "      <td>0.000</td>\n",
1946 |        "      <td>1.000</td>\n",
1947 |        "      <td>0.000</td>\n",
1948 |        "      <td>0.000</td>\n",
1949 |        "      <td>0.004</td>\n",
1950 |        "      <td>0.001</td>\n",
1951 |        "      <td>0.001</td>\n",
1952 |        "      <td>0.018</td>\n",
1953 |        "    </tr>\n",
1954 |        "    <tr>\n",
1955 |        "      <th>25%</th>\n",
1956 |        "      <td>225,036.000</td>\n",
1957 |        "      <td>3.000</td>\n",
1958 |        "      <td>0.000</td>\n",
1959 |        "      <td>4.000</td>\n",
1960 |        "      <td>75.000</td>\n",
1961 |        "      <td>3.000</td>\n",
1962 |        "      <td>0.000</td>\n",
1963 |        "      <td>1.000</td>\n",
1964 |        "      <td>0.000</td>\n",
1965 |        "      <td>114.000</td>\n",
1966 |        "      <td>0.125</td>\n",
1967 |        "      <td>0.004</td>\n",
1968 |        "      <td>0.002</td>\n",
1969 |        "      <td>1.000</td>\n",
1970 |        "    </tr>\n",
1971 |        "    <tr>\n",
1972 |        "      <th>50%</th>\n",
1973 |        "      <td>249,200.500</td>\n",
1974 |        "      <td>3.000</td>\n",
1975 |        "      <td>1.000</td>\n",
1976 |        "      <td>4.000</td>\n",
1977 |        "      <td>174.000</td>\n",
1978 |        "      <td>6.000</td>\n",
1979 |        "      <td>2.000</td>\n",
1980 |        "      <td>1.000</td>\n",
1981 |        "      <td>0.000</td>\n",
1982 |        "      <td>275.000</td>\n",
1983 |        "      <td>0.250</td>\n",
1984 |        "      <td>0.008</td>\n",
1985 |        "      <td>0.005</td>\n",
1986 |        "      <td>1.000</td>\n",
1987 |        "    </tr>\n",
1988 |        "    <tr>\n",
1989 |        "      <th>75%</th>\n",
1990 |        "      <td>276,284.000</td>\n",
1991 |        "      <td>4.000</td>\n",
1992 |        "      <td>2.000</td>\n",
1993 |        "      <td>5.000</td>\n",
1994 |        "      <td>366.000</td>\n",
1995 |        "      <td>13.000</td>\n",
1996 |        "      <td>6.000</td>\n",
1997 |        "      <td>2.000</td>\n",
1998 |        "      <td>1.000</td>\n",
1999 |        "      <td>585.000</td>\n",
2000 |        "      <td>0.500</td>\n",
2001 |        "      <td>0.018</td>\n",
2002 |        "      <td>0.012</td>\n",
2003 |        "      <td>1.000</td>\n",
2004 |        "    </tr>\n",
2005 |        "    <tr>\n",
2006 |        "      <th>max</th>\n",
2007 |        "      <td>305,318.000</td>\n",
2008 |        "      <td>6.000</td>\n",
2009 |        "      <td>2.000</td>\n",
2010 |        "      <td>5.000</td>\n",
2011 |        "      <td>5,007.000</td>\n",
2012 |        "      <td>288.000</td>\n",
2013 |        "      <td>158.000</td>\n",
2014 |        "      <td>50.000</td>\n",
2015 |        "      <td>69.000</td>\n",
2016 |        "      <td>8,156.000</td>\n",
2017 |        "      <td>1.000</td>\n",
2018 |        "      <td>1.000</td>\n",
2019 |        "      <td>1.000</td>\n",
2020 |        "      <td>1.000</td>\n",
2021 |        "    </tr>\n",
2022 |        "  </tbody>\n",
2023 |        "</table>\n",
2024 |        "</div>"
2025 |       ],
2026 |       "text/plain": [
2027 |        "          user_id        age        sex  user_lv_cd  browse_num  addcart_num  \\\n",
2028 |        "count  29,070.000 29,070.000 29,070.000  29,070.000  29,070.000   29,070.000   \n",
2029 |        "mean  250,767.099      2.910      1.028       4.268     280.260       10.145   \n",
2030 |        "std    29,998.870      1.492      0.959       0.809     325.129       13.443   \n",
2031 |        "min   200,001.000     -1.000      0.000       2.000       1.000        0.000   \n",
2032 |        "25%   225,036.000      3.000      0.000       4.000      75.000        3.000   \n",
2033 |        "50%   249,200.500      3.000      1.000       4.000     174.000        6.000   \n",
2034 |        "75%   276,284.000      4.000      2.000       5.000     366.000       13.000   \n",
2035 |        "max   305,318.000      6.000      2.000       5.000   5,007.000      288.000   \n",
2036 |        "\n",
2037 |        "       delcart_num    buy_num  favor_num  click_num  buy_addcart_ratio  \\\n",
2038 |        "count   29,070.000 29,070.000 29,070.000 29,070.000         29,070.000   \n",
2039 |        "mean         4.457      1.644      1.589    447.113              0.364   \n",
2040 |        "std          6.998      1.420      4.294    530.994              0.320   \n",
2041 |        "min          0.000      1.000      0.000      0.000              0.004   \n",
2042 |        "25%          0.000      1.000      0.000    114.000              0.125   \n",
2043 |        "50%          2.000      1.000      0.000    275.000              0.250   \n",
2044 |        "75%          6.000      2.000      1.000    585.000              0.500   \n",
2045 |        "max        158.000     50.000     69.000  8,156.000              1.000   \n",
2046 |        "\n",
2047 |        "       buy_browse_ratio  buy_click_ratio  buy_favor_ratio  \n",
2048 |        "count        29,070.000       29,070.000       29,070.000  \n",
2049 |        "mean              0.019            0.031            0.866  \n",
2050 |        "std               0.038            0.137            0.282  \n",
2051 |        "min               0.001            0.001            0.018  \n",
2052 |        "25%               0.004            0.002            1.000  \n",
2053 |        "50%               0.008            0.005            1.000  \n",
2054 |        "75%               0.018            0.012            1.000  \n",
2055 |        "max               1.000            1.000            1.000  "
2056 |       ]
2057 |      },
2058 |      "execution_count": 11,
2059 |      "metadata": {},
2060 |      "output_type": "execute_result"
2061 |     }
2062 |    ],
2063 |    "source": [
2064 |     "df_user.describe()"
2065 |    ]
2066 |   },
2067 |   {
2068 |    "cell_type": "markdown",
2069 |    "metadata": {},
2070 |    "source": [
2071 |     "最后这29070个用户为最终预测用户数据集"
2072 |    ]
2073 |   },
2074 |   {
2075 |    "cell_type": "code",
2076 |    "execution_count": 12,
2077 |    "metadata": {
2078 |     "collapsed": true
2079 |    },
2080 |    "outputs": [],
2081 |    "source": [
2082 |     "df_user.to_csv(\"data/JData_FUser.csv\",index=None)"
2083 |    ]
2084 |   },
2085 |   {
2086 |    "cell_type": "markdown",
2087 |    "metadata": {},
2088 |    "source": [
2089 |     "#### 商品清洗"
2090 |    ]
2091 |   },
2092 |   {
2093 |    "cell_type": "code",
2094 |    "execution_count": 13,
2095 |    "metadata": {
2096 |     "collapsed": false
2097 |    },
2098 |    "outputs": [
2099 |     {
2100 |      "data": {
2101 |       "text/html": [
2102 |        "<div>\n",
2103 |        "<table border=\"1\" class=\"dataframe\">\n",
2104 |        "  <thead>\n",
2105 |        "    <tr style=\"text-align: right;\">\n",
2106 |        "      <th></th>\n",
2107 |        "      <th>sku_id</th>\n",
2108 |        "      <th>a1</th>\n",
2109 |        "      <th>a2</th>\n",
2110 |        "      <th>a3</th>\n",
2111 |        "      <th>cate</th>\n",
2112 |        "      <th>brand</th>\n",
2113 |        "      <th>browse_num</th>\n",
2114 |        "      <th>addcart_num</th>\n",
2115 |        "      <th>delcart_num</th>\n",
2116 |        "      <th>buy_num</th>\n",
2117 |        "      <th>favor_num</th>\n",
2118 |        "      <th>click_num</th>\n",
2119 |        "      <th>buy_addcart_ratio</th>\n",
2120 |        "      <th>buy_browse_ratio</th>\n",
2121 |        "      <th>buy_click_ratio</th>\n",
2122 |        "      <th>buy_favor_ratio</th>\n",
2123 |        "      <th>comment_num</th>\n",
2124 |        "      <th>has_bad_comment</th>\n",
2125 |        "      <th>bad_comment_rate</th>\n",
2126 |        "    </tr>\n",
2127 |        "  </thead>\n",
2128 |        "  <tbody>\n",
2129 |        "    <tr>\n",
2130 |        "      <th>count</th>\n",
2131 |        "      <td>24,187.000</td>\n",
2132 |        "      <td>24,187.000</td>\n",
2133 |        "      <td>24,187.000</td>\n",
2134 |        "      <td>24,187.000</td>\n",
2135 |        "      <td>24,187.000</td>\n",
2136 |        "      <td>24,187.000</td>\n",
2137 |        "      <td>3,938.000</td>\n",
2138 |        "      <td>3,938.000</td>\n",
2139 |        "      <td>3,938.000</td>\n",
2140 |        "      <td>3,938.000</td>\n",
2141 |        "      <td>3,938.000</td>\n",
2142 |        "      <td>3,938.000</td>\n",
2143 |        "      <td>2,225.000</td>\n",
2144 |        "      <td>3,909.000</td>\n",
2145 |        "      <td>3,692.000</td>\n",
2146 |        "      <td>2,016.000</td>\n",
2147 |        "      <td>6,830.000</td>\n",
2148 |        "      <td>6,830.000</td>\n",
2149 |        "      <td>6,830.000</td>\n",
2150 |        "    </tr>\n",
2151 |        "    <tr>\n",
2152 |        "      <th>mean</th>\n",
2153 |        "      <td>85,398.737</td>\n",
2154 |        "      <td>2.177</td>\n",
2155 |        "      <td>0.939</td>\n",
2156 |        "      <td>1.180</td>\n",
2157 |        "      <td>8.000</td>\n",
2158 |        "      <td>435.864</td>\n",
2159 |        "      <td>1,723.711</td>\n",
2160 |        "      <td>54.212</td>\n",
2161 |        "      <td>21.284</td>\n",
2162 |        "      <td>3.373</td>\n",
2163 |        "      <td>10.655</td>\n",
2164 |        "      <td>2,790.132</td>\n",
2165 |        "      <td>0.036</td>\n",
2166 |        "      <td>0.001</td>\n",
2167 |        "      <td>0.000</td>\n",
2168 |        "      <td>0.162</td>\n",
2169 |        "      <td>2.732</td>\n",
2170 |        "      <td>0.486</td>\n",
2171 |        "      <td>0.044</td>\n",
2172 |        "    </tr>\n",
2173 |        "    <tr>\n",
2174 |        "      <th>std</th>\n",
2175 |        "      <td>49,238.799</td>\n",
2176 |        "      <td>1.176</td>\n",
2177 |        "      <td>0.970</td>\n",
2178 |        "      <td>1.046</td>\n",
2179 |        "      <td>0.000</td>\n",
2180 |        "      <td>225.749</td>\n",
2181 |        "      <td>7,957.661</td>\n",
2182 |        "      <td>285.723</td>\n",
2183 |        "      <td>106.100</td>\n",
2184 |        "      <td>21.695</td>\n",
2185 |        "      <td>49.956</td>\n",
2186 |        "      <td>12,647.534</td>\n",
2187 |        "      <td>0.093</td>\n",
2188 |        "      <td>0.002</td>\n",
2189 |        "      <td>0.001</td>\n",
2190 |        "      <td>0.264</td>\n",
2191 |        "      <td>1.037</td>\n",
2192 |        "      <td>0.500</td>\n",
2193 |        "      <td>0.110</td>\n",
2194 |        "    </tr>\n",
2195 |        "    <tr>\n",
2196 |        "      <th>min</th>\n",
2197 |        "      <td>6.000</td>\n",
2198 |        "      <td>-1.000</td>\n",
2199 |        "      <td>-1.000</td>\n",
2200 |        "      <td>-1.000</td>\n",
2201 |        "      <td>8.000</td>\n",
2202 |        "      <td>3.000</td>\n",
2203 |        "      <td>0.000</td>\n",
2204 |        "      <td>0.000</td>\n",
2205 |        "      <td>0.000</td>\n",
2206 |        "      <td>0.000</td>\n",
2207 |        "      <td>0.000</td>\n",
2208 |        "      <td>0.000</td>\n",
2209 |        "      <td>0.000</td>\n",
2210 |        "      <td>0.000</td>\n",
2211 |        "      <td>0.000</td>\n",
2212 |        "      <td>0.000</td>\n",
2213 |        "      <td>1.000</td>\n",
2214 |        "      <td>0.000</td>\n",
2215 |        "      <td>0.000</td>\n",
2216 |        "    </tr>\n",
2217 |        "    <tr>\n",
2218 |        "      <th>25%</th>\n",
2219 |        "      <td>42,476.000</td>\n",
2220 |        "      <td>1.000</td>\n",
2221 |        "      <td>1.000</td>\n",
2222 |        "      <td>1.000</td>\n",
2223 |        "      <td>8.000</td>\n",
2224 |        "      <td>214.000</td>\n",
2225 |        "      <td>19.000</td>\n",
2226 |        "      <td>0.000</td>\n",
2227 |        "      <td>0.000</td>\n",
2228 |        "      <td>0.000</td>\n",
2229 |        "      <td>0.000</td>\n",
2230 |        "      <td>30.000</td>\n",
2231 |        "      <td>0.000</td>\n",
2232 |        "      <td>0.000</td>\n",
2233 |        "      <td>0.000</td>\n",
2234 |        "      <td>0.000</td>\n",
2235 |        "      <td>2.000</td>\n",
2236 |        "      <td>0.000</td>\n",
2237 |        "      <td>0.000</td>\n",
2238 |        "    </tr>\n",
2239 |        "    <tr>\n",
2240 |        "      <th>50%</th>\n",
2241 |        "      <td>85,616.000</td>\n",
2242 |        "      <td>3.000</td>\n",
2243 |        "      <td>1.000</td>\n",
2244 |        "      <td>1.000</td>\n",
2245 |        "      <td>8.000</td>\n",
2246 |        "      <td>489.000</td>\n",
2247 |        "      <td>148.000</td>\n",
2248 |        "      <td>1.000</td>\n",
2249 |        "      <td>1.000</td>\n",
2250 |        "      <td>0.000</td>\n",
2251 |        "      <td>1.000</td>\n",
2252 |        "      <td>246.500</td>\n",
2253 |        "      <td>0.000</td>\n",
2254 |        "      <td>0.000</td>\n",
2255 |        "      <td>0.000</td>\n",
2256 |        "      <td>0.000</td>\n",
2257 |        "      <td>3.000</td>\n",
2258 |        "      <td>0.000</td>\n",
2259 |        "      <td>0.000</td>\n",
2260 |        "    </tr>\n",
2261 |        "    <tr>\n",
2262 |        "      <th>75%</th>\n",
2263 |        "      <td>127,774.000</td>\n",
2264 |        "      <td>3.000</td>\n",
2265 |        "      <td>2.000</td>\n",
2266 |        "      <td>2.000</td>\n",
2267 |        "      <td>8.000</td>\n",
2268 |        "      <td>571.000</td>\n",
2269 |        "      <td>655.750</td>\n",
2270 |        "      <td>12.000</td>\n",
2271 |        "      <td>5.000</td>\n",
2272 |        "      <td>0.000</td>\n",
2273 |        "      <td>4.000</td>\n",
2274 |        "      <td>1,071.750</td>\n",
2275 |        "      <td>0.048</td>\n",
2276 |        "      <td>0.000</td>\n",
2277 |        "      <td>0.000</td>\n",
2278 |        "      <td>0.250</td>\n",
2279 |        "      <td>4.000</td>\n",
2280 |        "      <td>1.000</td>\n",
2281 |        "      <td>0.044</td>\n",
2282 |        "    </tr>\n",
2283 |        "    <tr>\n",
2284 |        "      <th>max</th>\n",
2285 |        "      <td>171,224.000</td>\n",
2286 |        "      <td>3.000</td>\n",
2287 |        "      <td>2.000</td>\n",
2288 |        "      <td>2.000</td>\n",
2289 |        "      <td>8.000</td>\n",
2290 |        "      <td>922.000</td>\n",
2291 |        "      <td>194,920.000</td>\n",
2292 |        "      <td>6,296.000</td>\n",
2293 |        "      <td>2,258.000</td>\n",
2294 |        "      <td>691.000</td>\n",
2295 |        "      <td>1,205.000</td>\n",
2296 |        "      <td>312,005.000</td>\n",
2297 |        "      <td>1.000</td>\n",
2298 |        "      <td>0.114</td>\n",
2299 |        "      <td>0.063</td>\n",
2300 |        "      <td>1.000</td>\n",
2301 |        "      <td>4.000</td>\n",
2302 |        "      <td>1.000</td>\n",
2303 |        "      <td>1.000</td>\n",
2304 |        "    </tr>\n",
2305 |        "  </tbody>\n",
2306 |        "</table>\n",
2307 |        "</div>"
2308 |       ],
2309 |       "text/plain": [
2310 |        "           sku_id         a1         a2         a3       cate      brand  \\\n",
2311 |        "count  24,187.000 24,187.000 24,187.000 24,187.000 24,187.000 24,187.000   \n",
2312 |        "mean   85,398.737      2.177      0.939      1.180      8.000    435.864   \n",
2313 |        "std    49,238.799      1.176      0.970      1.046      0.000    225.749   \n",
2314 |        "min         6.000     -1.000     -1.000     -1.000      8.000      3.000   \n",
2315 |        "25%    42,476.000      1.000      1.000      1.000      8.000    214.000   \n",
2316 |        "50%    85,616.000      3.000      1.000      1.000      8.000    489.000   \n",
2317 |        "75%   127,774.000      3.000      2.000      2.000      8.000    571.000   \n",
2318 |        "max   171,224.000      3.000      2.000      2.000      8.000    922.000   \n",
2319 |        "\n",
2320 |        "       browse_num  addcart_num  delcart_num   buy_num  favor_num   click_num  \\\n",
2321 |        "count   3,938.000    3,938.000    3,938.000 3,938.000  3,938.000   3,938.000   \n",
2322 |        "mean    1,723.711       54.212       21.284     3.373     10.655   2,790.132   \n",
2323 |        "std     7,957.661      285.723      106.100    21.695     49.956  12,647.534   \n",
2324 |        "min         0.000        0.000        0.000     0.000      0.000       0.000   \n",
2325 |        "25%        19.000        0.000        0.000     0.000      0.000      30.000   \n",
2326 |        "50%       148.000        1.000        1.000     0.000      1.000     246.500   \n",
2327 |        "75%       655.750       12.000        5.000     0.000      4.000   1,071.750   \n",
2328 |        "max   194,920.000    6,296.000    2,258.000   691.000  1,205.000 312,005.000   \n",
2329 |        "\n",
2330 |        "       buy_addcart_ratio  buy_browse_ratio  buy_click_ratio  buy_favor_ratio  \\\n",
2331 |        "count          2,225.000         3,909.000        3,692.000        2,016.000   \n",
2332 |        "mean               0.036             0.001            0.000            0.162   \n",
2333 |        "std                0.093             0.002            0.001            0.264   \n",
2334 |        "min                0.000             0.000            0.000            0.000   \n",
2335 |        "25%                0.000             0.000            0.000            0.000   \n",
2336 |        "50%                0.000             0.000            0.000            0.000   \n",
2337 |        "75%                0.048             0.000            0.000            0.250   \n",
2338 |        "max                1.000             0.114            0.063            1.000   \n",
2339 |        "\n",
2340 |        "       comment_num  has_bad_comment  bad_comment_rate  \n",
2341 |        "count    6,830.000        6,830.000         6,830.000  \n",
2342 |        "mean         2.732            0.486             0.044  \n",
2343 |        "std          1.037            0.500             0.110  \n",
2344 |        "min          1.000            0.000             0.000  \n",
2345 |        "25%          2.000            0.000             0.000  \n",
2346 |        "50%          3.000            0.000             0.000  \n",
2347 |        "75%          4.000            1.000             0.044  \n",
2348 |        "max          4.000            1.000             1.000  "
2349 |       ]
2350 |      },
2351 |      "execution_count": 13,
2352 |      "metadata": {},
2353 |      "output_type": "execute_result"
2354 |     }
2355 |    ],
2356 |    "source": [
2357 |     "import pandas as pd\n",
2358 |     "df_product = pd.read_csv('data/Item_table.csv',header=0)\n",
2359 |     "pd.options.display.float_format = '{:,.3f}'.format  #输出格式设置，保留三位小数\n",
2360 |     "df_product.describe()"
2361 |    ]
2362 |   },
2363 |   {
2364 |    "cell_type": "code",
2365 |    "execution_count": null,
2366 |    "metadata": {
2367 |     "collapsed": true
2368 |    },
2369 |    "outputs": [],
2370 |    "source": []
2371 |   }
2372 |  ],
2373 |  "metadata": {
2374 |   "kernelspec": {
2375 |    "display_name": "Python 2",
2376 |    "language": "python",
2377 |    "name": "python2"
2378 |   },
2379 |   "language_info": {
2380 |    "codemirror_mode": {
2381 |     "name": "ipython",
2382 |     "version": 2
2383 |    },
2384 |    "file_extension": ".py",
2385 |    "mimetype": "text/x-python",
2386 |    "name": "python",
2387 |    "nbconvert_exporter": "python",
2388 |    "pygments_lexer": "ipython2",
2389 |    "version": "2.7.13"
2390 |   }
2391 |  },
2392 |  "nbformat": 4,
2393 |  "nbformat_minor": 0
2394 | }
2395 | 


--------------------------------------------------------------------------------