├── README.md
├── data
└── put dataset here
├── extract_feature.py
└── xgb.py
/README.md:
--------------------------------------------------------------------------------
1 | 赛题地址:[https://tianchi.shuju.aliyun.com/competition/introduction.htm?spm=5176.100065.200879.2.6r6s4g&raceId=231587](https://tianchi.shuju.aliyun.com/competition/introduction.htm?spm=5176.100065.200879.2.6r6s4g&raceId=231587 "O2O优惠券使用预测赛题地址")
2 |
3 | [第一赛季数据](https://pan.baidu.com/s/1c1NkUn2)
4 |
5 | -------------------
6 | **目录**
7 |
8 |
9 | [TOC]
10 |
11 | -------------------
12 | 正式开始做是从十月底开始的,之前参加了新手赛,而这一次正式赛可以说是真正认真做的一次,中间和队友一起学习了很多,也有小小的收获,不管这次成绩如何,以后还有机会。
13 |
14 | 
15 |
16 | -------------------
17 | # **数据与评价方式**
18 |
19 | 赛题提供用户在2016年1月1日至2016年6月30日之间真实线上线下消费行为,预测用户在2016年7月领取优惠券后15天以内的使用情况。 使用优惠券核销预测的平均AUC(ROC曲线下面积)作为评价标准。 即对每个优惠券coupon_id单独计算核销预测的AUC值,再对所有优惠券的AUC值求平均作为最终的评价标准。
20 |
21 | -------------------
22 | # **解决方案**
23 | 提供数据的区间是2016-01-01~2016-06-30,预测七月份用户领券使用情况,即用或者不用,转化为**二分类问题**,然后通过分类算法预测结果。首先就是特征工程,其中涉及对数据集合的划分,包括提取特征的区间和训练数据区间。接着就是从特征区间中提取特征,包括用户特征、商户特征、优惠券特征、用户商户组合特征、用户优惠券组合特征。后期在测试区间提取了当天的前后7/3/1天的领券信息(这里面后七天的特征其实是不能应用于工业应用的,因为实际预测中你无法知道后7/3/1天的领券信息),提升较大。最后使用GBDT、RandomForest、LR进行基于rank的分类模型融合
24 |
25 | -------------------
26 | # **数据划分**
27 | 最初没有使用数据划分,导致特征中产生数据泄露,以至于在训练数据上效果很好,线下测试也还不错,在线上表现确差强人意,后来划分了之后有明显提升。
28 |
29 | | 集合 | 预测区间 |特征区间|
30 | | :------- | :--------| :-- |
31 | | 测试集 |领券:20160701~20160731|领券&消费:20160101~20160630|
32 | | 训练集 | 领券:20160515~20160615
消费:20160515~20160630 |领券:20160101~20160501
消费:20160101~20160515|
33 |
34 | 并没有划分多个训练集,这一点是要改进之处。
35 |
36 | -------------------
37 | # **特征工程**
38 | 主要有五大特征类:用户特征、商户特征、优惠券特征、用户商户组合特征、用户优惠券组合特征,赛题包括online和offline的数据,由于里面只有部分用户重合,商户优惠券等并未有重合,个人臆测线上应该是淘宝天猫的购买消费数据,有一定关联,但关系微弱,因此只向其中提取了用户特征。而offline数据集就提取了所有五个特征类。一下是各部分特征:
39 |
40 | - 用户特征:u
41 | - 线下领取优惠券但没有使用的次数 u1
42 | - 线下普通消费次数 u2
43 | - 线下使用优惠券消费的次数 u3
44 | - 线下平均正常消费间隔 u4
45 | - 线下平均优惠券消费间隔 u5
46 | - u3/u1 使用优惠券次数与没使用优惠券次数比值 u6
47 | - u3/(u2+u3) 表示用户使用优惠券消费占比 u7
48 | - u4/15 代表15除以用户普通消费间隔,可以看成用户15天内平均会普通消费几次,值越小代表用户在15天内普通消费概率越大 u8
49 | - u5/15 代表15除以用户优惠券消费间隔,可以看成用户15天内平均会普通消费几次,值越大代表用户 在15天内普通消费概率越大 u9
50 | - 领取优惠券到使用优惠券的平均间隔时间 u10
51 | - u10/15 表示在15天内使用掉优惠券的值大小,值越小越有可能,值为0表示可能性最大 u11
52 | - 领取优惠券到使用优惠券间隔小于15天的次数 u12
53 | - u12/u3 表示用户15天使用掉优惠券的次数除以使用优惠券的次数,表示在15天使用掉优惠券的可能,值越大越好。 u13
54 | - u12/u1 F014 表示用户15天使用掉优惠券的次数除以领取优惠券未消费的次数,表示在15天使用掉优惠券的可能,值越大越好。 u14
55 | - u1+u3 领取优惠券的总次数 u15
56 | - u12/u15 F016 表示用户15天使用掉优惠券的次数除以领取优惠券的总次数,表示在15天使用掉优惠券的可能,值越大越好。 u16
57 | - u1+u2 一共消费多少次 u17
58 | - 最近一次消费到当前领券的时间间隔 u18
59 | - 最近一次优惠券消费到当前领券的时间间隔 u19
60 | - 用户当天领取的优惠券数目 u20
61 | - 用户前第i天领取的优惠券数目 u20si
62 | - 用户后第i天领取的优惠券数目 u20ai
63 | - 用户前7天领取的优惠券数目 u21
64 | - 用户前3天领取的优惠券数目 u22
65 | - u22/u21 u23
66 | - u20/u22 u24
67 | - 用户后7天领取的优惠券数目 u25
68 | - 用户后3天领取的优惠券数目 u26
69 | - u26/u25 u27
70 | - u20/u26 u28
71 | - 用户训练、预测时间领取的优惠券数目 u29
72 | - 用户当天领取的不同优惠券数目 u30
73 | - 用户前第i天领取的不同优惠券数目 u30si
74 | - 用户后第i天领取的不同优惠券数目 u30ai
75 | - 用户训练、预测时间领取的不同优惠券数目 u31
76 | - 按照7/4/2分解训练、预测时间,提取此段窗口时间的特征
77 | - 用户7/4/2天领取的优惠券数目 u32_i
78 | - 用户7/4/2天所领取的优惠券优惠率r1/r2/r3/r4排名 u_ri_ranki
79 | - 用户7/4/2天所领取的优惠券优惠率r1/r2/r3/r4排名 u_ri_dense _ranki
80 | - u32_4/u32_7 u33
81 | - u32_2/u32_4 u34
82 | - u32_2/u32_7 u35
83 | - u20/u32_2 u36
84 |
85 | - 线上领取优惠券未使用的次数 action=2 uo1
86 | - 线上特价消费次数 action=1 and cid=0 and drate="fixed" uo2
87 | - 线上使用优惠券消费的次数 uo3
88 | - 线上普通消费次数 action=1 and cid=0 and drate="null" uo4
89 | - 线上领取优惠券的次数 uo1+uo3 uo5
90 | - uo3/uo5 线上使用优惠券次数除以线上领取优惠券次数,正比 uo6
91 | - uo3/uo4 线上使用优惠券次数除以线上普通消费次数,正比 uo7
92 | - uo2/uo4线上特价消费次数除以线上普通消费次数 uo8
93 |
94 | - 加入训练预测时间前一个月的窗口特征
95 |
96 |
97 | - 线下领取优惠券但没有使用的次数 uw1
98 | - 线下普通消费次数 uw2
99 | - 线下使用优惠券消费的次数 uw3
100 | - 线下平均正常消费间隔 uw4
101 | - 线下平均优惠券消费间隔 uw5
102 | - uw3/uw1 使用优惠券次数与没使用优惠券次数比值 uw6
103 | - uw3/(uw2+uw3) 表示用户使用优惠券消费占比 uw7
104 | - uw4/15 代表15除以用户普通消费间隔,可以看成用户15天内平均会普通消费几次,值越小代表用户在15天内普通消费概率越大 uw8
105 | - uw5/15 代表15除以用户优惠券消费间隔,可以看成用户15天内平均会普通消费几次,值越大代表用户在15天内普通消费概率越大 uw9
106 | - 领取优惠券到使用优惠券的平均间隔时间 uw10
107 | - uw10/15 表示在15天内使用掉优惠券的值大小,值越小越有可能,值为0表示可能性最大 uw11
108 | - 领取优惠券到使用优惠券间隔小于15天的次数 uw12
109 | - uw12/uw3 表示用户15天使用掉优惠券的次数除以使用优惠券的次数,表示在15天使用掉优惠券的可能,值越大越好。 uw13
110 | - uw12/uw1 F014 表示用户15天使用掉优惠券的次数除以领取优惠券未消费的次数,表示在15天使用掉优惠券的可能,值越大越好。 uw14
111 | - uw1+uw3 领取优惠券的总次数 uw15
112 | - uw12/uw15 F016 表示用户15天使用掉优惠券的次数除以领取优惠券的总次数,表示在15天使用掉优惠券的可能,值越大越好。 uw16
113 | - F01+F02 一共消费多少次 uw17
114 |
115 |
116 | ----------
117 |
118 |
119 | - 商户特征:m
120 | - 商户一共的消费笔数:m0
121 | - 商户优惠券消费笔数:m1
122 | - 商户正常的消费笔数:m2
123 | - 没有被使用的优惠券: m3
124 | - 商户发放优惠券数目:m3+m1 m4
125 | - 商户优惠券使用率:m1/m4 m5
126 | - 商户在训练、预测时间发行的优惠券数目 m6
127 | - 商户当天发行的优惠券数目 m7
128 | - 商户在训练、预测时间有多少人在此店领券 m8
129 | - 商户在当天有多少人在此店领券 m9
130 | - 按照7/4/2分解训练、预测时间,提取此段窗口时间的特征
131 | - 7/4/2天此商店优惠券发放数目 m10_i
132 | - m9 / m10_7 m11
133 | - m9 / m10_4 m12
134 | - m9 / m10_2 m13
135 | - m10_2 / m10_4 m14
136 |
137 |
138 | ----------
139 |
140 |
141 | - 优惠券特征:c
142 | - 折扣类的优惠券折扣率 r1
143 | - 满减类优惠券满减金额 r2
144 | - 满减类优惠券减的金额 r3
145 | - 满减类优惠券优惠率 (r2-r3)/r2 r4
146 | - c1+c2 此优惠券一共发行多少张 c0
147 | - 此优惠券一共被使用多少张 c1
148 | - 没有使用的数目 c2
149 | - c1/c0 优惠券使用率 c3
150 | - 优惠力度 c5
151 | - 优惠力度在当天所领取优惠券里面排名 c5_rank
152 | - 优惠力度在当天所领取优惠券里面排名 c5_denserank
153 | - 优惠力度在当天同一店家所领取优惠券里面排名 c5_rankm
154 | - ~~优惠力度在当天所领取优惠券里面百分比排名 c5_rankp~~
155 | - ~~优惠力度在当天同一店家所领取优惠券里面百分比排名 c5_rankmp~~
156 | - 此优惠券在训练、预测时间发行了多少张 c6
157 | - 此优惠券在当天发行了多少张 c7
158 | - ~~领券当天周几 c8~~
159 | - ~~领券当天是否周末 c9~~ c8,c9去掉效果更好了。。。。
160 | - 此优惠券在当天被多少人领过 c10
161 | - 此优惠券在训练、预测时间被多少个人领过 c11
162 | - 此优惠券最后一次领券时间到此领券时间的间隔 c12
163 | - 此优惠券最后一次消费时间到此领券时间的间隔 c13
164 | - 按照7/4/2分解训练、预测时间,提取此段窗口时间的特征
165 | - 7/4/2天此优惠券发放数目 c14_i
166 | - c10 / c14_7 AS c15
167 | - c10 / c14_4 AS c16
168 | - c14_2 / c14_4 AS c17
169 | - c10 / c14_2 AS
170 |
171 |
172 | ----------
173 |
174 |
175 | - 用户和商户组合特征:um
176 | - 用户在商店总共消费过几次 um0
177 | - 用户在商店使用优惠券几次 um1
178 | - 用户在商店领取优惠券未消费次数 um2
179 | - 用户在商店普通消费次数 um3
180 | - um1/(um1+um2) 用户在此商户优惠券使用率 um4
181 | - um0/(u2+u3) 值大表示用户比较常去的商户 um5
182 | - um1/u3 值大表示用户比较喜欢在那个商户使用优惠券 um6
183 | - 用户在训练、预测时间在此商店领取的优惠券数目 um7
184 | - 用户当天在此商店领取的优惠券数目 um8
185 | - 按照7/4/2分解训练、预测时间,提取此段窗口时间的特征
186 | - 7/4/2天此用户在此商店领取的优惠券发放数目 um9_i
187 | - um8 / um9_7 um10
188 | - um8 / um9_4 um11
189 | - um8 / um9_2 um12
190 | - um9_2 / um9_4 um13
191 |
192 |
193 | ----------
194 |
195 |
196 | - 用户和优惠券组合特征:uc
197 | - 用户领取的优惠券数目 uc0
198 | - 用户领取未消费的优惠券数目 uc1
199 | - 用户消费了此优惠券的数目 uc2
200 | - uc02/uc0 uc3
201 | - 用户在此期间领取了多少张此优惠券 uc4 partiton by uid, cid
202 | - 用户在当天领取了多少张此优惠券 uc5
203 | - 领取优惠券时间-最后一次使用优惠券时间 uc6
204 | - uc6/ u5 uc7 正比
205 |
206 | - 用户前第i天领取的此优惠券数目 uc5si
207 | - 用户后第i天领取的此优惠券数目 uc5ai
208 | - 用户前7天领取的此优惠券数目 uc8
209 | - 用户前3天领取的此优惠券数目 uc9
210 | - uc9/uc8 uc10(若u21为0,则为1)
211 | - uc4/uc9 uc11
212 | - 用户后7天领取的此优惠券数目 uc12
213 | - 用户后3天领取的此优惠券数目 uc13
214 | - uc13/uc12 uc14
215 | - uc4/uc13 uc15
216 | - 按照7/4/2分解训练、预测时间,提取此段窗口时间的特征
217 | - 7/4/2天此用户在此商店领取的优惠券发放数目 uc16_i
218 | - 用户前后2/4/7领取的优惠券优惠率排名 uc17_i
219 |
220 | -------------------
221 | # **算法及模型融合**
222 | 最初使用GBDT和RF两种模型,GBDT效果优于RF,后期使用了多个GBDT,分别使用不同的参数、不同的正负样本比例以rank的方式进行多模型的融合,效果有微小提升,但是由于计算量的限制没有进一步展开。
223 | ## **模型融合**
224 |
225 | 由于评估指标是计算每个coupon_id核销预测的AUC值,然后所有优惠券的AUC值平均作为最终的评估指标,而rank融合方式对AUC之类的评估指标特别有效,所以采用此方法,公式为:
226 |
$\sum\limits_{i=1}^{n}\frac{Weight_i}{Rank_i} $
227 |
228 | 其中$n$表示模型的个数, $Weight_i$表示该模型权重,所有权重相同表示平均融合。$Rank_i$表示样本在第i个模型中的升序排名。它可以较快的利用排名融合多个模型之间的差异,而不需要加权融合概率。
229 | ## **应用**
230 | 基于参数,样本(采样率),特征获得多个模型,得到每个模型的概率值输出,然后以coupon_id分组,把概率转换为降序排名,这样就获得了每个模型的$Rank_i$,然后这里我们使用的是平均融合,$Weight_i=1$,这样就获得了最终的一个值作为输出。
231 |
232 | ----------
233 | # **线下评估**
234 | 虽然这次比赛每天有四次评测机会,但是构建线下评估在早期成绩比较差的时候用处很大,早期添加特征之后线下评估基本和线上的趋势保持一致(例如在添加了Label区间的领券特征之后,线下提升十多个百分点,线上也是一致),对于新特征衡量还是有参照性的。后期差距在0.1%级别的时候,就没有参照性了。
235 |
236 | 线下评估在训练集中采样1/3||1/4||1/5做线下评估集合,剩下的做为训练集训练模型,并将评估集合中全0或者全1的优惠券ID去掉,然后使用训练的模型对评估集合预测,将预测结果和实际标签作异或取反(相同为1,不同为0),然后算出每个优惠券ID的AUC,最后将每个ID的优惠券AUC取均值就得到最终的AUC。
237 |
238 | -------------------
239 | # **回顾**
240 | 这一次比赛学习了很多,包括分布式平台ODPS和机器学习平台实现数据清洗,特征提取,特征选择,分类建模、调参及模型融合等,学习摸索了一套方法,使自己建立了信心,明白还有很多需要学习的地方,之前一直对于算法都是当做一个黑匣子,只会熟悉输入输出直接调用,要深入了解算法,才能突破目前的瓶颈有所提高。
241 | 同时我觉得大家一起探讨交流也很重要,一个人做着做着就容易走偏,纯属个人看法。
242 |
243 | CSDN博客链接:[http://blog.csdn.net/shine19930820/article/details/53995369](http://blog.csdn.net/shine19930820/article/details/53995369)
244 |
245 | 自己的代码太乱,改自第一名[GitHub地址](https://github.com/wepe/O2O-Coupon-Usage-Forecast)
246 |
--------------------------------------------------------------------------------
/data/put dataset here:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/InsaneLife/O2O-Predict-Coupon-Usage/adc71ce5198873f9961f2ea01a17fb19633131c6/data/put dataset here
--------------------------------------------------------------------------------
/extract_feature.py:
--------------------------------------------------------------------------------
1 | import pandas as pd
2 | import numpy as np
3 | from datetime import date
4 |
5 |
6 | """
7 | dataset split:
8 | (date_received)
9 | dateset3: 20160701~20160731 (113640),features3 from 20160315~20160630 (off_test)
10 | dateset2: 20160515~20160615 (258446),features2 from 20160201~20160514
11 | dateset1: 20160414~20160514 (138303),features1 from 20160101~20160413
12 |
13 |
14 |
15 | 1.merchant related:
16 | sales_use_coupon. total_coupon
17 | transfer_rate = sales_use_coupon/total_coupon.
18 | merchant_avg_distance,merchant_min_distance,merchant_max_distance of those use coupon
19 | total_sales. coupon_rate = sales_use_coupon/total_sales.
20 |
21 | 2.coupon related:
22 | discount_rate. discount_man. discount_jian. is_man_jian
23 | day_of_week,day_of_month. (date_received)
24 |
25 | 3.user related:
26 | distance.
27 | user_avg_distance, user_min_distance,user_max_distance.
28 | buy_use_coupon. buy_total. coupon_received.
29 | buy_use_coupon/coupon_received.
30 | avg_diff_date_datereceived. min_diff_date_datereceived. max_diff_date_datereceived.
31 | count_merchant.
32 |
33 | 4.user_merchant:
34 | times_user_buy_merchant_before.
35 |
36 |
37 | 5. other feature:
38 | this_month_user_receive_all_coupon_count
39 | this_month_user_receive_same_coupon_count
40 | this_month_user_receive_same_coupon_lastone
41 | this_month_user_receive_same_coupon_firstone
42 | this_day_user_receive_all_coupon_count
43 | this_day_user_receive_same_coupon_count
44 | day_gap_before, day_gap_after (receive the same coupon)
45 | """
46 |
47 |
48 | #1754884 record,1053282 with coupon_id,9738 coupon. date_received:20160101~20160615,date:20160101~20160630, 539438 users, 8415 merchants
49 | off_train = pd.read_csv('data/ccf_offline_stage1_train.csv',header=None)
50 | off_train.columns = ['user_id','merchant_id','coupon_id','discount_rate','distance','date_received','date']
51 | #2050 coupon_id. date_received:20160701~20160731, 76309 users(76307 in trainset, 35965 in online_trainset), 1559 merchants(1558 in trainset)
52 | off_test = pd.read_csv('data/ccf_offline_stage1_test_revised.csv',header=None)
53 | off_test.columns = ['user_id','merchant_id','coupon_id','discount_rate','distance','date_received']
54 | #11429826 record(872357 with coupon_id),762858 user(267448 in off_train)
55 | on_train = pd.read_csv('data/ccf_online_stage1_train.csv',header=None)
56 | on_train.columns = ['user_id','merchant_id','action','coupon_id','discount_rate','date_received','date']
57 |
58 |
59 | dataset3 = off_test
60 | feature3 = off_train[((off_train.date>='20160315')&(off_train.date<='20160630'))|((off_train.date=='null')&(off_train.date_received>='20160315')&(off_train.date_received<='20160630'))]
61 | dataset2 = off_train[(off_train.date_received>='20160515')&(off_train.date_received<='20160615')]
62 | feature2 = off_train[(off_train.date>='20160201')&(off_train.date<='20160514')|((off_train.date=='null')&(off_train.date_received>='20160201')&(off_train.date_received<='20160514'))]
63 | dataset1 = off_train[(off_train.date_received>='20160414')&(off_train.date_received<='20160514')]
64 | feature1 = off_train[(off_train.date>='20160101')&(off_train.date<='20160413')|((off_train.date=='null')&(off_train.date_received>='20160101')&(off_train.date_received<='20160413'))]
65 |
66 |
67 | ############# other feature ##################3
68 | """
69 | 5. other feature:
70 | this_month_user_receive_all_coupon_count
71 | this_month_user_receive_same_coupon_count
72 | this_month_user_receive_same_coupon_lastone
73 | this_month_user_receive_same_coupon_firstone
74 | this_day_user_receive_all_coupon_count
75 | this_day_user_receive_same_coupon_count
76 | day_gap_before, day_gap_after (receive the same coupon)
77 | """
78 |
79 | #for dataset3
80 | t = dataset3[['user_id']]
81 | t['this_month_user_receive_all_coupon_count'] = 1
82 | t = t.groupby('user_id').agg('sum').reset_index()
83 |
84 | t1 = dataset3[['user_id','coupon_id']]
85 | t1['this_month_user_receive_same_coupon_count'] = 1
86 | t1 = t1.groupby(['user_id','coupon_id']).agg('sum').reset_index()
87 |
88 | t2 = dataset3[['user_id','coupon_id','date_received']]
89 | t2.date_received = t2.date_received.astype('str')
90 | t2 = t2.groupby(['user_id','coupon_id'])['date_received'].agg(lambda x:':'.join(x)).reset_index()
91 | t2['receive_number'] = t2.date_received.apply(lambda s:len(s.split(':')))
92 | t2 = t2[t2.receive_number>1]
93 | t2['max_date_received'] = t2.date_received.apply(lambda s:max([int(d) for d in s.split(':')]))
94 | t2['min_date_received'] = t2.date_received.apply(lambda s:min([int(d) for d in s.split(':')]))
95 | t2 = t2[['user_id','coupon_id','max_date_received','min_date_received']]
96 |
97 | t3 = dataset3[['user_id','coupon_id','date_received']]
98 | t3 = pd.merge(t3,t2,on=['user_id','coupon_id'],how='left')
99 | t3['this_month_user_receive_same_coupon_lastone'] = t3.max_date_received - t3.date_received
100 | t3['this_month_user_receive_same_coupon_firstone'] = t3.date_received - t3.min_date_received
101 | def is_firstlastone(x):
102 | if x==0:
103 | return 1
104 | elif x>0:
105 | return 0
106 | else:
107 | return -1 #those only receive once
108 |
109 | t3.this_month_user_receive_same_coupon_lastone = t3.this_month_user_receive_same_coupon_lastone.apply(is_firstlastone)
110 | t3.this_month_user_receive_same_coupon_firstone = t3.this_month_user_receive_same_coupon_firstone.apply(is_firstlastone)
111 | t3 = t3[['user_id','coupon_id','date_received','this_month_user_receive_same_coupon_lastone','this_month_user_receive_same_coupon_firstone']]
112 |
113 | t4 = dataset3[['user_id','date_received']]
114 | t4['this_day_user_receive_all_coupon_count'] = 1
115 | t4 = t4.groupby(['user_id','date_received']).agg('sum').reset_index()
116 |
117 | t5 = dataset3[['user_id','coupon_id','date_received']]
118 | t5['this_day_user_receive_same_coupon_count'] = 1
119 | t5 = t5.groupby(['user_id','coupon_id','date_received']).agg('sum').reset_index()
120 |
121 | t6 = dataset3[['user_id','coupon_id','date_received']]
122 | t6.date_received = t6.date_received.astype('str')
123 | t6 = t6.groupby(['user_id','coupon_id'])['date_received'].agg(lambda x:':'.join(x)).reset_index()
124 | t6.rename(columns={'date_received':'dates'},inplace=True)
125 |
126 | def get_day_gap_before(s):
127 | date_received,dates = s.split('-')
128 | dates = dates.split(':')
129 | gaps = []
130 | for d in dates:
131 | this_gap = (date(int(date_received[0:4]),int(date_received[4:6]),int(date_received[6:8]))-date(int(d[0:4]),int(d[4:6]),int(d[6:8]))).days
132 | if this_gap>0:
133 | gaps.append(this_gap)
134 | if len(gaps)==0:
135 | return -1
136 | else:
137 | return min(gaps)
138 |
139 | def get_day_gap_after(s):
140 | date_received,dates = s.split('-')
141 | dates = dates.split(':')
142 | gaps = []
143 | for d in dates:
144 | this_gap = (date(int(d[0:4]),int(d[4:6]),int(d[6:8]))-date(int(date_received[0:4]),int(date_received[4:6]),int(date_received[6:8]))).days
145 | if this_gap>0:
146 | gaps.append(this_gap)
147 | if len(gaps)==0:
148 | return -1
149 | else:
150 | return min(gaps)
151 |
152 |
153 | t7 = dataset3[['user_id','coupon_id','date_received']]
154 | t7 = pd.merge(t7,t6,on=['user_id','coupon_id'],how='left')
155 | t7['date_received_date'] = t7.date_received.astype('str') + '-' + t7.dates
156 | t7['day_gap_before'] = t7.date_received_date.apply(get_day_gap_before)
157 | t7['day_gap_after'] = t7.date_received_date.apply(get_day_gap_after)
158 | t7 = t7[['user_id','coupon_id','date_received','day_gap_before','day_gap_after']]
159 |
160 | other_feature3 = pd.merge(t1,t,on='user_id')
161 | other_feature3 = pd.merge(other_feature3,t3,on=['user_id','coupon_id'])
162 | other_feature3 = pd.merge(other_feature3,t4,on=['user_id','date_received'])
163 | other_feature3 = pd.merge(other_feature3,t5,on=['user_id','coupon_id','date_received'])
164 | other_feature3 = pd.merge(other_feature3,t7,on=['user_id','coupon_id','date_received'])
165 | other_feature3.to_csv('data/other_feature3.csv',index=None)
166 | print other_feature3.shape
167 |
168 |
169 |
170 | #for dataset2
171 | t = dataset2[['user_id']]
172 | t['this_month_user_receive_all_coupon_count'] = 1
173 | t = t.groupby('user_id').agg('sum').reset_index()
174 |
175 | t1 = dataset2[['user_id','coupon_id']]
176 | t1['this_month_user_receive_same_coupon_count'] = 1
177 | t1 = t1.groupby(['user_id','coupon_id']).agg('sum').reset_index()
178 |
179 | t2 = dataset2[['user_id','coupon_id','date_received']]
180 | t2.date_received = t2.date_received.astype('str')
181 | t2 = t2.groupby(['user_id','coupon_id'])['date_received'].agg(lambda x:':'.join(x)).reset_index()
182 | t2['receive_number'] = t2.date_received.apply(lambda s:len(s.split(':')))
183 | t2 = t2[t2.receive_number>1]
184 | t2['max_date_received'] = t2.date_received.apply(lambda s:max([int(d) for d in s.split(':')]))
185 | t2['min_date_received'] = t2.date_received.apply(lambda s:min([int(d) for d in s.split(':')]))
186 | t2 = t2[['user_id','coupon_id','max_date_received','min_date_received']]
187 |
188 | t3 = dataset2[['user_id','coupon_id','date_received']]
189 | t3 = pd.merge(t3,t2,on=['user_id','coupon_id'],how='left')
190 | t3['this_month_user_receive_same_coupon_lastone'] = t3.max_date_received - t3.date_received.astype('int')
191 | t3['this_month_user_receive_same_coupon_firstone'] = t3.date_received.astype('int') - t3.min_date_received
192 | def is_firstlastone(x):
193 | if x==0:
194 | return 1
195 | elif x>0:
196 | return 0
197 | else:
198 | return -1 #those only receive once
199 |
200 | t3.this_month_user_receive_same_coupon_lastone = t3.this_month_user_receive_same_coupon_lastone.apply(is_firstlastone)
201 | t3.this_month_user_receive_same_coupon_firstone = t3.this_month_user_receive_same_coupon_firstone.apply(is_firstlastone)
202 | t3 = t3[['user_id','coupon_id','date_received','this_month_user_receive_same_coupon_lastone','this_month_user_receive_same_coupon_firstone']]
203 |
204 | t4 = dataset2[['user_id','date_received']]
205 | t4['this_day_user_receive_all_coupon_count'] = 1
206 | t4 = t4.groupby(['user_id','date_received']).agg('sum').reset_index()
207 |
208 | t5 = dataset2[['user_id','coupon_id','date_received']]
209 | t5['this_day_user_receive_same_coupon_count'] = 1
210 | t5 = t5.groupby(['user_id','coupon_id','date_received']).agg('sum').reset_index()
211 |
212 | t6 = dataset2[['user_id','coupon_id','date_received']]
213 | t6.date_received = t6.date_received.astype('str')
214 | t6 = t6.groupby(['user_id','coupon_id'])['date_received'].agg(lambda x:':'.join(x)).reset_index()
215 | t6.rename(columns={'date_received':'dates'},inplace=True)
216 |
217 | def get_day_gap_before(s):
218 | date_received,dates = s.split('-')
219 | dates = dates.split(':')
220 | gaps = []
221 | for d in dates:
222 | this_gap = (date(int(date_received[0:4]),int(date_received[4:6]),int(date_received[6:8]))-date(int(d[0:4]),int(d[4:6]),int(d[6:8]))).days
223 | if this_gap>0:
224 | gaps.append(this_gap)
225 | if len(gaps)==0:
226 | return -1
227 | else:
228 | return min(gaps)
229 |
230 | def get_day_gap_after(s):
231 | date_received,dates = s.split('-')
232 | dates = dates.split(':')
233 | gaps = []
234 | for d in dates:
235 | this_gap = (date(int(d[0:4]),int(d[4:6]),int(d[6:8]))-date(int(date_received[0:4]),int(date_received[4:6]),int(date_received[6:8]))).days
236 | if this_gap>0:
237 | gaps.append(this_gap)
238 | if len(gaps)==0:
239 | return -1
240 | else:
241 | return min(gaps)
242 |
243 |
244 | t7 = dataset2[['user_id','coupon_id','date_received']]
245 | t7 = pd.merge(t7,t6,on=['user_id','coupon_id'],how='left')
246 | t7['date_received_date'] = t7.date_received.astype('str') + '-' + t7.dates
247 | t7['day_gap_before'] = t7.date_received_date.apply(get_day_gap_before)
248 | t7['day_gap_after'] = t7.date_received_date.apply(get_day_gap_after)
249 | t7 = t7[['user_id','coupon_id','date_received','day_gap_before','day_gap_after']]
250 |
251 | other_feature2 = pd.merge(t1,t,on='user_id')
252 | other_feature2 = pd.merge(other_feature2,t3,on=['user_id','coupon_id'])
253 | other_feature2 = pd.merge(other_feature2,t4,on=['user_id','date_received'])
254 | other_feature2 = pd.merge(other_feature2,t5,on=['user_id','coupon_id','date_received'])
255 | other_feature2 = pd.merge(other_feature2,t7,on=['user_id','coupon_id','date_received'])
256 | other_feature2.to_csv('data/other_feature2.csv',index=None)
257 | print other_feature2.shape
258 |
259 |
260 |
261 | #for dataset1
262 | t = dataset1[['user_id']]
263 | t['this_month_user_receive_all_coupon_count'] = 1
264 | t = t.groupby('user_id').agg('sum').reset_index()
265 |
266 | t1 = dataset1[['user_id','coupon_id']]
267 | t1['this_month_user_receive_same_coupon_count'] = 1
268 | t1 = t1.groupby(['user_id','coupon_id']).agg('sum').reset_index()
269 |
270 | t2 = dataset1[['user_id','coupon_id','date_received']]
271 | t2.date_received = t2.date_received.astype('str')
272 | t2 = t2.groupby(['user_id','coupon_id'])['date_received'].agg(lambda x:':'.join(x)).reset_index()
273 | t2['receive_number'] = t2.date_received.apply(lambda s:len(s.split(':')))
274 | t2 = t2[t2.receive_number>1]
275 | t2['max_date_received'] = t2.date_received.apply(lambda s:max([int(d) for d in s.split(':')]))
276 | t2['min_date_received'] = t2.date_received.apply(lambda s:min([int(d) for d in s.split(':')]))
277 | t2 = t2[['user_id','coupon_id','max_date_received','min_date_received']]
278 |
279 | t3 = dataset1[['user_id','coupon_id','date_received']]
280 | t3 = pd.merge(t3,t2,on=['user_id','coupon_id'],how='left')
281 | t3['this_month_user_receive_same_coupon_lastone'] = t3.max_date_received - t3.date_received.astype('int')
282 | t3['this_month_user_receive_same_coupon_firstone'] = t3.date_received.astype('int') - t3.min_date_received
283 | def is_firstlastone(x):
284 | if x==0:
285 | return 1
286 | elif x>0:
287 | return 0
288 | else:
289 | return -1 #those only receive once
290 |
291 | t3.this_month_user_receive_same_coupon_lastone = t3.this_month_user_receive_same_coupon_lastone.apply(is_firstlastone)
292 | t3.this_month_user_receive_same_coupon_firstone = t3.this_month_user_receive_same_coupon_firstone.apply(is_firstlastone)
293 | t3 = t3[['user_id','coupon_id','date_received','this_month_user_receive_same_coupon_lastone','this_month_user_receive_same_coupon_firstone']]
294 |
295 | t4 = dataset1[['user_id','date_received']]
296 | t4['this_day_user_receive_all_coupon_count'] = 1
297 | t4 = t4.groupby(['user_id','date_received']).agg('sum').reset_index()
298 |
299 | t5 = dataset1[['user_id','coupon_id','date_received']]
300 | t5['this_day_user_receive_same_coupon_count'] = 1
301 | t5 = t5.groupby(['user_id','coupon_id','date_received']).agg('sum').reset_index()
302 |
303 | t6 = dataset1[['user_id','coupon_id','date_received']]
304 | t6.date_received = t6.date_received.astype('str')
305 | t6 = t6.groupby(['user_id','coupon_id'])['date_received'].agg(lambda x:':'.join(x)).reset_index()
306 | t6.rename(columns={'date_received':'dates'},inplace=True)
307 |
308 | def get_day_gap_before(s):
309 | date_received,dates = s.split('-')
310 | dates = dates.split(':')
311 | gaps = []
312 | for d in dates:
313 | this_gap = (date(int(date_received[0:4]),int(date_received[4:6]),int(date_received[6:8]))-date(int(d[0:4]),int(d[4:6]),int(d[6:8]))).days
314 | if this_gap>0:
315 | gaps.append(this_gap)
316 | if len(gaps)==0:
317 | return -1
318 | else:
319 | return min(gaps)
320 |
321 | def get_day_gap_after(s):
322 | date_received,dates = s.split('-')
323 | dates = dates.split(':')
324 | gaps = []
325 | for d in dates:
326 | this_gap = (date(int(d[0:4]),int(d[4:6]),int(d[6:8]))-date(int(date_received[0:4]),int(date_received[4:6]),int(date_received[6:8]))).days
327 | if this_gap>0:
328 | gaps.append(this_gap)
329 | if len(gaps)==0:
330 | return -1
331 | else:
332 | return min(gaps)
333 |
334 |
335 | t7 = dataset1[['user_id','coupon_id','date_received']]
336 | t7 = pd.merge(t7,t6,on=['user_id','coupon_id'],how='left')
337 | t7['date_received_date'] = t7.date_received.astype('str') + '-' + t7.dates
338 | t7['day_gap_before'] = t7.date_received_date.apply(get_day_gap_before)
339 | t7['day_gap_after'] = t7.date_received_date.apply(get_day_gap_after)
340 | t7 = t7[['user_id','coupon_id','date_received','day_gap_before','day_gap_after']]
341 |
342 | other_feature1 = pd.merge(t1,t,on='user_id')
343 | other_feature1 = pd.merge(other_feature1,t3,on=['user_id','coupon_id'])
344 | other_feature1 = pd.merge(other_feature1,t4,on=['user_id','date_received'])
345 | other_feature1 = pd.merge(other_feature1,t5,on=['user_id','coupon_id','date_received'])
346 | other_feature1 = pd.merge(other_feature1,t7,on=['user_id','coupon_id','date_received'])
347 | other_feature1.to_csv('data/other_feature1.csv',index=None)
348 | print other_feature1.shape
349 |
350 |
351 |
352 |
353 |
354 |
355 | ############# coupon related feature #############
356 | """
357 | 2.coupon related:
358 | discount_rate. discount_man. discount_jian. is_man_jian
359 | day_of_week,day_of_month. (date_received)
360 | """
361 | def calc_discount_rate(s):
362 | s =str(s)
363 | s = s.split(':')
364 | if len(s)==1:
365 | return float(s[0])
366 | else:
367 | return 1.0-float(s[1])/float(s[0])
368 |
369 | def get_discount_man(s):
370 | s =str(s)
371 | s = s.split(':')
372 | if len(s)==1:
373 | return 'null'
374 | else:
375 | return int(s[0])
376 |
377 | def get_discount_jian(s):
378 | s =str(s)
379 | s = s.split(':')
380 | if len(s)==1:
381 | return 'null'
382 | else:
383 | return int(s[1])
384 |
385 | def is_man_jian(s):
386 | s =str(s)
387 | s = s.split(':')
388 | if len(s)==1:
389 | return 0
390 | else:
391 | return 1
392 |
393 | #dataset3
394 | dataset3['day_of_week'] = dataset3.date_received.astype('str').apply(lambda x:date(int(x[0:4]),int(x[4:6]),int(x[6:8])).weekday()+1)
395 | dataset3['day_of_month'] = dataset3.date_received.astype('str').apply(lambda x:int(x[6:8]))
396 | dataset3['days_distance'] = dataset3.date_received.astype('str').apply(lambda x:(date(int(x[0:4]),int(x[4:6]),int(x[6:8]))-date(2016,6,30)).days)
397 | dataset3['discount_man'] = dataset3.discount_rate.apply(get_discount_man)
398 | dataset3['discount_jian'] = dataset3.discount_rate.apply(get_discount_jian)
399 | dataset3['is_man_jian'] = dataset3.discount_rate.apply(is_man_jian)
400 | dataset3['discount_rate'] = dataset3.discount_rate.apply(calc_discount_rate)
401 | d = dataset3[['coupon_id']]
402 | d['coupon_count'] = 1
403 | d = d.groupby('coupon_id').agg('sum').reset_index()
404 | dataset3 = pd.merge(dataset3,d,on='coupon_id',how='left')
405 | dataset3.to_csv('data/coupon3_feature.csv',index=None)
406 | #dataset2
407 | dataset2['day_of_week'] = dataset2.date_received.astype('str').apply(lambda x:date(int(x[0:4]),int(x[4:6]),int(x[6:8])).weekday()+1)
408 | dataset2['day_of_month'] = dataset2.date_received.astype('str').apply(lambda x:int(x[6:8]))
409 | dataset2['days_distance'] = dataset2.date_received.astype('str').apply(lambda x:(date(int(x[0:4]),int(x[4:6]),int(x[6:8]))-date(2016,5,14)).days)
410 | dataset2['discount_man'] = dataset2.discount_rate.apply(get_discount_man)
411 | dataset2['discount_jian'] = dataset2.discount_rate.apply(get_discount_jian)
412 | dataset2['is_man_jian'] = dataset2.discount_rate.apply(is_man_jian)
413 | dataset2['discount_rate'] = dataset2.discount_rate.apply(calc_discount_rate)
414 | d = dataset2[['coupon_id']]
415 | d['coupon_count'] = 1
416 | d = d.groupby('coupon_id').agg('sum').reset_index()
417 | dataset2 = pd.merge(dataset2,d,on='coupon_id',how='left')
418 | dataset2.to_csv('data/coupon2_feature.csv',index=None)
419 | #dataset1
420 | dataset1['day_of_week'] = dataset1.date_received.astype('str').apply(lambda x:date(int(x[0:4]),int(x[4:6]),int(x[6:8])).weekday()+1)
421 | dataset1['day_of_month'] = dataset1.date_received.astype('str').apply(lambda x:int(x[6:8]))
422 | dataset1['days_distance'] = dataset1.date_received.astype('str').apply(lambda x:(date(int(x[0:4]),int(x[4:6]),int(x[6:8]))-date(2016,4,13)).days)
423 | dataset1['discount_man'] = dataset1.discount_rate.apply(get_discount_man)
424 | dataset1['discount_jian'] = dataset1.discount_rate.apply(get_discount_jian)
425 | dataset1['is_man_jian'] = dataset1.discount_rate.apply(is_man_jian)
426 | dataset1['discount_rate'] = dataset1.discount_rate.apply(calc_discount_rate)
427 | d = dataset1[['coupon_id']]
428 | d['coupon_count'] = 1
429 | d = d.groupby('coupon_id').agg('sum').reset_index()
430 | dataset1 = pd.merge(dataset1,d,on='coupon_id',how='left')
431 | dataset1.to_csv('data/coupon1_feature.csv',index=None)
432 |
433 |
434 |
435 | ############# merchant related feature #############
436 | """
437 | 1.merchant related:
438 | total_sales. sales_use_coupon. total_coupon
439 | coupon_rate = sales_use_coupon/total_sales.
440 | transfer_rate = sales_use_coupon/total_coupon.
441 | merchant_avg_distance,merchant_min_distance,merchant_max_distance of those use coupon
442 |
443 | """
444 |
445 | #for dataset3
446 | merchant3 = feature3[['merchant_id','coupon_id','distance','date_received','date']]
447 |
448 | t = merchant3[['merchant_id']]
449 | t.drop_duplicates(inplace=True)
450 |
451 | t1 = merchant3[merchant3.date!='null'][['merchant_id']]
452 | t1['total_sales'] = 1
453 | t1 = t1.groupby('merchant_id').agg('sum').reset_index()
454 |
455 | t2 = merchant3[(merchant3.date!='null')&(merchant3.coupon_id!='null')][['merchant_id']]
456 | t2['sales_use_coupon'] = 1
457 | t2 = t2.groupby('merchant_id').agg('sum').reset_index()
458 |
459 | t3 = merchant3[merchant3.coupon_id!='null'][['merchant_id']]
460 | t3['total_coupon'] = 1
461 | t3 = t3.groupby('merchant_id').agg('sum').reset_index()
462 |
463 | t4 = merchant3[(merchant3.date!='null')&(merchant3.coupon_id!='null')][['merchant_id','distance']]
464 | t4.replace('null',-1,inplace=True)
465 | t4.distance = t4.distance.astype('int')
466 | t4.replace(-1,np.nan,inplace=True)
467 | t5 = t4.groupby('merchant_id').agg('min').reset_index()
468 | t5.rename(columns={'distance':'merchant_min_distance'},inplace=True)
469 |
470 | t6 = t4.groupby('merchant_id').agg('max').reset_index()
471 | t6.rename(columns={'distance':'merchant_max_distance'},inplace=True)
472 |
473 | t7 = t4.groupby('merchant_id').agg('mean').reset_index()
474 | t7.rename(columns={'distance':'merchant_mean_distance'},inplace=True)
475 |
476 | t8 = t4.groupby('merchant_id').agg('median').reset_index()
477 | t8.rename(columns={'distance':'merchant_median_distance'},inplace=True)
478 |
479 | merchant3_feature = pd.merge(t,t1,on='merchant_id',how='left')
480 | merchant3_feature = pd.merge(merchant3_feature,t2,on='merchant_id',how='left')
481 | merchant3_feature = pd.merge(merchant3_feature,t3,on='merchant_id',how='left')
482 | merchant3_feature = pd.merge(merchant3_feature,t5,on='merchant_id',how='left')
483 | merchant3_feature = pd.merge(merchant3_feature,t6,on='merchant_id',how='left')
484 | merchant3_feature = pd.merge(merchant3_feature,t7,on='merchant_id',how='left')
485 | merchant3_feature = pd.merge(merchant3_feature,t8,on='merchant_id',how='left')
486 | merchant3_feature.sales_use_coupon = merchant3_feature.sales_use_coupon.replace(np.nan,0) #fillna with 0
487 | merchant3_feature['merchant_coupon_transfer_rate'] = merchant3_feature.sales_use_coupon.astype('float') / merchant3_feature.total_coupon
488 | merchant3_feature['coupon_rate'] = merchant3_feature.sales_use_coupon.astype('float') / merchant3_feature.total_sales
489 | merchant3_feature.total_coupon = merchant3_feature.total_coupon.replace(np.nan,0) #fillna with 0
490 | merchant3_feature.to_csv('data/merchant3_feature.csv',index=None)
491 |
492 |
493 | #for dataset2
494 | merchant2 = feature2[['merchant_id','coupon_id','distance','date_received','date']]
495 |
496 | t = merchant2[['merchant_id']]
497 | t.drop_duplicates(inplace=True)
498 |
499 | t1 = merchant2[merchant2.date!='null'][['merchant_id']]
500 | t1['total_sales'] = 1
501 | t1 = t1.groupby('merchant_id').agg('sum').reset_index()
502 |
503 | t2 = merchant2[(merchant2.date!='null')&(merchant2.coupon_id!='null')][['merchant_id']]
504 | t2['sales_use_coupon'] = 1
505 | t2 = t2.groupby('merchant_id').agg('sum').reset_index()
506 |
507 | t3 = merchant2[merchant2.coupon_id!='null'][['merchant_id']]
508 | t3['total_coupon'] = 1
509 | t3 = t3.groupby('merchant_id').agg('sum').reset_index()
510 |
511 | t4 = merchant2[(merchant2.date!='null')&(merchant2.coupon_id!='null')][['merchant_id','distance']]
512 | t4.replace('null',-1,inplace=True)
513 | t4.distance = t4.distance.astype('int')
514 | t4.replace(-1,np.nan,inplace=True)
515 | t5 = t4.groupby('merchant_id').agg('min').reset_index()
516 | t5.rename(columns={'distance':'merchant_min_distance'},inplace=True)
517 |
518 | t6 = t4.groupby('merchant_id').agg('max').reset_index()
519 | t6.rename(columns={'distance':'merchant_max_distance'},inplace=True)
520 |
521 | t7 = t4.groupby('merchant_id').agg('mean').reset_index()
522 | t7.rename(columns={'distance':'merchant_mean_distance'},inplace=True)
523 |
524 | t8 = t4.groupby('merchant_id').agg('median').reset_index()
525 | t8.rename(columns={'distance':'merchant_median_distance'},inplace=True)
526 |
527 | merchant2_feature = pd.merge(t,t1,on='merchant_id',how='left')
528 | merchant2_feature = pd.merge(merchant2_feature,t2,on='merchant_id',how='left')
529 | merchant2_feature = pd.merge(merchant2_feature,t3,on='merchant_id',how='left')
530 | merchant2_feature = pd.merge(merchant2_feature,t5,on='merchant_id',how='left')
531 | merchant2_feature = pd.merge(merchant2_feature,t6,on='merchant_id',how='left')
532 | merchant2_feature = pd.merge(merchant2_feature,t7,on='merchant_id',how='left')
533 | merchant2_feature = pd.merge(merchant2_feature,t8,on='merchant_id',how='left')
534 | merchant2_feature.sales_use_coupon = merchant2_feature.sales_use_coupon.replace(np.nan,0) #fillna with 0
535 | merchant2_feature['merchant_coupon_transfer_rate'] = merchant2_feature.sales_use_coupon.astype('float') / merchant2_feature.total_coupon
536 | merchant2_feature['coupon_rate'] = merchant2_feature.sales_use_coupon.astype('float') / merchant2_feature.total_sales
537 | merchant2_feature.total_coupon = merchant2_feature.total_coupon.replace(np.nan,0) #fillna with 0
538 | merchant2_feature.to_csv('data/merchant2_feature.csv',index=None)
539 |
540 | #for dataset1
541 | merchant1 = feature1[['merchant_id','coupon_id','distance','date_received','date']]
542 |
543 | t = merchant1[['merchant_id']]
544 | t.drop_duplicates(inplace=True)
545 |
546 | t1 = merchant1[merchant1.date!='null'][['merchant_id']]
547 | t1['total_sales'] = 1
548 | t1 = t1.groupby('merchant_id').agg('sum').reset_index()
549 |
550 | t2 = merchant1[(merchant1.date!='null')&(merchant1.coupon_id!='null')][['merchant_id']]
551 | t2['sales_use_coupon'] = 1
552 | t2 = t2.groupby('merchant_id').agg('sum').reset_index()
553 |
554 | t3 = merchant1[merchant1.coupon_id!='null'][['merchant_id']]
555 | t3['total_coupon'] = 1
556 | t3 = t3.groupby('merchant_id').agg('sum').reset_index()
557 |
558 | t4 = merchant1[(merchant1.date!='null')&(merchant1.coupon_id!='null')][['merchant_id','distance']]
559 | t4.replace('null',-1,inplace=True)
560 | t4.distance = t4.distance.astype('int')
561 | t4.replace(-1,np.nan,inplace=True)
562 | t5 = t4.groupby('merchant_id').agg('min').reset_index()
563 | t5.rename(columns={'distance':'merchant_min_distance'},inplace=True)
564 |
565 | t6 = t4.groupby('merchant_id').agg('max').reset_index()
566 | t6.rename(columns={'distance':'merchant_max_distance'},inplace=True)
567 |
568 | t7 = t4.groupby('merchant_id').agg('mean').reset_index()
569 | t7.rename(columns={'distance':'merchant_mean_distance'},inplace=True)
570 |
571 | t8 = t4.groupby('merchant_id').agg('median').reset_index()
572 | t8.rename(columns={'distance':'merchant_median_distance'},inplace=True)
573 |
574 |
575 | merchant1_feature = pd.merge(t,t1,on='merchant_id',how='left')
576 | merchant1_feature = pd.merge(merchant1_feature,t2,on='merchant_id',how='left')
577 | merchant1_feature = pd.merge(merchant1_feature,t3,on='merchant_id',how='left')
578 | merchant1_feature = pd.merge(merchant1_feature,t5,on='merchant_id',how='left')
579 | merchant1_feature = pd.merge(merchant1_feature,t6,on='merchant_id',how='left')
580 | merchant1_feature = pd.merge(merchant1_feature,t7,on='merchant_id',how='left')
581 | merchant1_feature = pd.merge(merchant1_feature,t8,on='merchant_id',how='left')
582 | merchant1_feature.sales_use_coupon = merchant1_feature.sales_use_coupon.replace(np.nan,0) #fillna with 0
583 | merchant1_feature['merchant_coupon_transfer_rate'] = merchant1_feature.sales_use_coupon.astype('float') / merchant1_feature.total_coupon
584 | merchant1_feature['coupon_rate'] = merchant1_feature.sales_use_coupon.astype('float') / merchant1_feature.total_sales
585 | merchant1_feature.total_coupon = merchant1_feature.total_coupon.replace(np.nan,0) #fillna with 0
586 | merchant1_feature.to_csv('data/merchant1_feature.csv',index=None)
587 |
588 |
589 |
590 |
591 | ############# user related feature #############
592 | """
593 | 3.user related:
594 | count_merchant.
595 | user_avg_distance, user_min_distance,user_max_distance.
596 | buy_use_coupon. buy_total. coupon_received.
597 | buy_use_coupon/coupon_received.
598 | buy_use_coupon/buy_total
599 | user_date_datereceived_gap
600 |
601 |
602 | """
603 |
604 | def get_user_date_datereceived_gap(s):
605 | s = s.split(':')
606 | return (date(int(s[0][0:4]),int(s[0][4:6]),int(s[0][6:8])) - date(int(s[1][0:4]),int(s[1][4:6]),int(s[1][6:8]))).days
607 |
608 | #for dataset3
609 | user3 = feature3[['user_id','merchant_id','coupon_id','discount_rate','distance','date_received','date']]
610 |
611 | t = user3[['user_id']]
612 | t.drop_duplicates(inplace=True)
613 |
614 | t1 = user3[user3.date!='null'][['user_id','merchant_id']]
615 | t1.drop_duplicates(inplace=True)
616 | t1.merchant_id = 1
617 | t1 = t1.groupby('user_id').agg('sum').reset_index()
618 | t1.rename(columns={'merchant_id':'count_merchant'},inplace=True)
619 |
620 | t2 = user3[(user3.date!='null')&(user3.coupon_id!='null')][['user_id','distance']]
621 | t2.replace('null',-1,inplace=True)
622 | t2.distance = t2.distance.astype('int')
623 | t2.replace(-1,np.nan,inplace=True)
624 | t3 = t2.groupby('user_id').agg('min').reset_index()
625 | t3.rename(columns={'distance':'user_min_distance'},inplace=True)
626 |
627 | t4 = t2.groupby('user_id').agg('max').reset_index()
628 | t4.rename(columns={'distance':'user_max_distance'},inplace=True)
629 |
630 | t5 = t2.groupby('user_id').agg('mean').reset_index()
631 | t5.rename(columns={'distance':'user_mean_distance'},inplace=True)
632 |
633 | t6 = t2.groupby('user_id').agg('median').reset_index()
634 | t6.rename(columns={'distance':'user_median_distance'},inplace=True)
635 |
636 | t7 = user3[(user3.date!='null')&(user3.coupon_id!='null')][['user_id']]
637 | t7['buy_use_coupon'] = 1
638 | t7 = t7.groupby('user_id').agg('sum').reset_index()
639 |
640 | t8 = user3[user3.date!='null'][['user_id']]
641 | t8['buy_total'] = 1
642 | t8 = t8.groupby('user_id').agg('sum').reset_index()
643 |
644 | t9 = user3[user3.coupon_id!='null'][['user_id']]
645 | t9['coupon_received'] = 1
646 | t9 = t9.groupby('user_id').agg('sum').reset_index()
647 |
648 | t10 = user3[(user3.date_received!='null')&(user3.date!='null')][['user_id','date_received','date']]
649 | t10['user_date_datereceived_gap'] = t10.date + ':' + t10.date_received
650 | t10.user_date_datereceived_gap = t10.user_date_datereceived_gap.apply(get_user_date_datereceived_gap)
651 | t10 = t10[['user_id','user_date_datereceived_gap']]
652 |
653 | t11 = t10.groupby('user_id').agg('mean').reset_index()
654 | t11.rename(columns={'user_date_datereceived_gap':'avg_user_date_datereceived_gap'},inplace=True)
655 | t12 = t10.groupby('user_id').agg('min').reset_index()
656 | t12.rename(columns={'user_date_datereceived_gap':'min_user_date_datereceived_gap'},inplace=True)
657 | t13 = t10.groupby('user_id').agg('max').reset_index()
658 | t13.rename(columns={'user_date_datereceived_gap':'max_user_date_datereceived_gap'},inplace=True)
659 |
660 |
661 | user3_feature = pd.merge(t,t1,on='user_id',how='left')
662 | user3_feature = pd.merge(user3_feature,t3,on='user_id',how='left')
663 | user3_feature = pd.merge(user3_feature,t4,on='user_id',how='left')
664 | user3_feature = pd.merge(user3_feature,t5,on='user_id',how='left')
665 | user3_feature = pd.merge(user3_feature,t6,on='user_id',how='left')
666 | user3_feature = pd.merge(user3_feature,t7,on='user_id',how='left')
667 | user3_feature = pd.merge(user3_feature,t8,on='user_id',how='left')
668 | user3_feature = pd.merge(user3_feature,t9,on='user_id',how='left')
669 | user3_feature = pd.merge(user3_feature,t11,on='user_id',how='left')
670 | user3_feature = pd.merge(user3_feature,t12,on='user_id',how='left')
671 | user3_feature = pd.merge(user3_feature,t13,on='user_id',how='left')
672 | user3_feature.count_merchant = user3_feature.count_merchant.replace(np.nan,0)
673 | user3_feature.buy_use_coupon = user3_feature.buy_use_coupon.replace(np.nan,0)
674 | user3_feature['buy_use_coupon_rate'] = user3_feature.buy_use_coupon.astype('float') / user3_feature.buy_total.astype('float')
675 | user3_feature['user_coupon_transfer_rate'] = user3_feature.buy_use_coupon.astype('float') / user3_feature.coupon_received.astype('float')
676 | user3_feature.buy_total = user3_feature.buy_total.replace(np.nan,0)
677 | user3_feature.coupon_received = user3_feature.coupon_received.replace(np.nan,0)
678 | user3_feature.to_csv('data/user3_feature.csv',index=None)
679 |
680 |
681 | #for dataset2
682 | user2 = feature2[['user_id','merchant_id','coupon_id','discount_rate','distance','date_received','date']]
683 |
684 | t = user2[['user_id']]
685 | t.drop_duplicates(inplace=True)
686 |
687 | t1 = user2[user2.date!='null'][['user_id','merchant_id']]
688 | t1.drop_duplicates(inplace=True)
689 | t1.merchant_id = 1
690 | t1 = t1.groupby('user_id').agg('sum').reset_index()
691 | t1.rename(columns={'merchant_id':'count_merchant'},inplace=True)
692 |
693 | t2 = user2[(user2.date!='null')&(user2.coupon_id!='null')][['user_id','distance']]
694 | t2.replace('null',-1,inplace=True)
695 | t2.distance = t2.distance.astype('int')
696 | t2.replace(-1,np.nan,inplace=True)
697 | t3 = t2.groupby('user_id').agg('min').reset_index()
698 | t3.rename(columns={'distance':'user_min_distance'},inplace=True)
699 |
700 | t4 = t2.groupby('user_id').agg('max').reset_index()
701 | t4.rename(columns={'distance':'user_max_distance'},inplace=True)
702 |
703 | t5 = t2.groupby('user_id').agg('mean').reset_index()
704 | t5.rename(columns={'distance':'user_mean_distance'},inplace=True)
705 |
706 | t6 = t2.groupby('user_id').agg('median').reset_index()
707 | t6.rename(columns={'distance':'user_median_distance'},inplace=True)
708 |
709 | t7 = user2[(user2.date!='null')&(user2.coupon_id!='null')][['user_id']]
710 | t7['buy_use_coupon'] = 1
711 | t7 = t7.groupby('user_id').agg('sum').reset_index()
712 |
713 | t8 = user2[user2.date!='null'][['user_id']]
714 | t8['buy_total'] = 1
715 | t8 = t8.groupby('user_id').agg('sum').reset_index()
716 |
717 | t9 = user2[user2.coupon_id!='null'][['user_id']]
718 | t9['coupon_received'] = 1
719 | t9 = t9.groupby('user_id').agg('sum').reset_index()
720 |
721 | t10 = user2[(user2.date_received!='null')&(user2.date!='null')][['user_id','date_received','date']]
722 | t10['user_date_datereceived_gap'] = t10.date + ':' + t10.date_received
723 | t10.user_date_datereceived_gap = t10.user_date_datereceived_gap.apply(get_user_date_datereceived_gap)
724 | t10 = t10[['user_id','user_date_datereceived_gap']]
725 |
726 | t11 = t10.groupby('user_id').agg('mean').reset_index()
727 | t11.rename(columns={'user_date_datereceived_gap':'avg_user_date_datereceived_gap'},inplace=True)
728 | t12 = t10.groupby('user_id').agg('min').reset_index()
729 | t12.rename(columns={'user_date_datereceived_gap':'min_user_date_datereceived_gap'},inplace=True)
730 | t13 = t10.groupby('user_id').agg('max').reset_index()
731 | t13.rename(columns={'user_date_datereceived_gap':'max_user_date_datereceived_gap'},inplace=True)
732 |
733 | user2_feature = pd.merge(t,t1,on='user_id',how='left')
734 | user2_feature = pd.merge(user2_feature,t3,on='user_id',how='left')
735 | user2_feature = pd.merge(user2_feature,t4,on='user_id',how='left')
736 | user2_feature = pd.merge(user2_feature,t5,on='user_id',how='left')
737 | user2_feature = pd.merge(user2_feature,t6,on='user_id',how='left')
738 | user2_feature = pd.merge(user2_feature,t7,on='user_id',how='left')
739 | user2_feature = pd.merge(user2_feature,t8,on='user_id',how='left')
740 | user2_feature = pd.merge(user2_feature,t9,on='user_id',how='left')
741 | user2_feature = pd.merge(user2_feature,t11,on='user_id',how='left')
742 | user2_feature = pd.merge(user2_feature,t12,on='user_id',how='left')
743 | user2_feature = pd.merge(user2_feature,t13,on='user_id',how='left')
744 | user2_feature.count_merchant = user2_feature.count_merchant.replace(np.nan,0)
745 | user2_feature.buy_use_coupon = user2_feature.buy_use_coupon.replace(np.nan,0)
746 | user2_feature['buy_use_coupon_rate'] = user2_feature.buy_use_coupon.astype('float') / user2_feature.buy_total.astype('float')
747 | user2_feature['user_coupon_transfer_rate'] = user2_feature.buy_use_coupon.astype('float') / user2_feature.coupon_received.astype('float')
748 | user2_feature.buy_total = user2_feature.buy_total.replace(np.nan,0)
749 | user2_feature.coupon_received = user2_feature.coupon_received.replace(np.nan,0)
750 | user2_feature.to_csv('data/user2_feature.csv',index=None)
751 |
752 |
753 | #for dataset1
754 | user1 = feature1[['user_id','merchant_id','coupon_id','discount_rate','distance','date_received','date']]
755 |
756 | t = user1[['user_id']]
757 | t.drop_duplicates(inplace=True)
758 |
759 | t1 = user1[user1.date!='null'][['user_id','merchant_id']]
760 | t1.drop_duplicates(inplace=True)
761 | t1.merchant_id = 1
762 | t1 = t1.groupby('user_id').agg('sum').reset_index()
763 | t1.rename(columns={'merchant_id':'count_merchant'},inplace=True)
764 |
765 | t2 = user1[(user1.date!='null')&(user1.coupon_id!='null')][['user_id','distance']]
766 | t2.replace('null',-1,inplace=True)
767 | t2.distance = t2.distance.astype('int')
768 | t2.replace(-1,np.nan,inplace=True)
769 | t3 = t2.groupby('user_id').agg('min').reset_index()
770 | t3.rename(columns={'distance':'user_min_distance'},inplace=True)
771 |
772 | t4 = t2.groupby('user_id').agg('max').reset_index()
773 | t4.rename(columns={'distance':'user_max_distance'},inplace=True)
774 |
775 | t5 = t2.groupby('user_id').agg('mean').reset_index()
776 | t5.rename(columns={'distance':'user_mean_distance'},inplace=True)
777 |
778 | t6 = t2.groupby('user_id').agg('median').reset_index()
779 | t6.rename(columns={'distance':'user_median_distance'},inplace=True)
780 |
781 | t7 = user1[(user1.date!='null')&(user1.coupon_id!='null')][['user_id']]
782 | t7['buy_use_coupon'] = 1
783 | t7 = t7.groupby('user_id').agg('sum').reset_index()
784 |
785 | t8 = user1[user1.date!='null'][['user_id']]
786 | t8['buy_total'] = 1
787 | t8 = t8.groupby('user_id').agg('sum').reset_index()
788 |
789 | t9 = user1[user1.coupon_id!='null'][['user_id']]
790 | t9['coupon_received'] = 1
791 | t9 = t9.groupby('user_id').agg('sum').reset_index()
792 |
793 | t10 = user1[(user1.date_received!='null')&(user1.date!='null')][['user_id','date_received','date']]
794 | t10['user_date_datereceived_gap'] = t10.date + ':' + t10.date_received
795 | t10.user_date_datereceived_gap = t10.user_date_datereceived_gap.apply(get_user_date_datereceived_gap)
796 | t10 = t10[['user_id','user_date_datereceived_gap']]
797 |
798 | t11 = t10.groupby('user_id').agg('mean').reset_index()
799 | t11.rename(columns={'user_date_datereceived_gap':'avg_user_date_datereceived_gap'},inplace=True)
800 | t12 = t10.groupby('user_id').agg('min').reset_index()
801 | t12.rename(columns={'user_date_datereceived_gap':'min_user_date_datereceived_gap'},inplace=True)
802 | t13 = t10.groupby('user_id').agg('max').reset_index()
803 | t13.rename(columns={'user_date_datereceived_gap':'max_user_date_datereceived_gap'},inplace=True)
804 |
805 | user1_feature = pd.merge(t,t1,on='user_id',how='left')
806 | user1_feature = pd.merge(user1_feature,t3,on='user_id',how='left')
807 | user1_feature = pd.merge(user1_feature,t4,on='user_id',how='left')
808 | user1_feature = pd.merge(user1_feature,t5,on='user_id',how='left')
809 | user1_feature = pd.merge(user1_feature,t6,on='user_id',how='left')
810 | user1_feature = pd.merge(user1_feature,t7,on='user_id',how='left')
811 | user1_feature = pd.merge(user1_feature,t8,on='user_id',how='left')
812 | user1_feature = pd.merge(user1_feature,t9,on='user_id',how='left')
813 | user1_feature = pd.merge(user1_feature,t11,on='user_id',how='left')
814 | user1_feature = pd.merge(user1_feature,t12,on='user_id',how='left')
815 | user1_feature = pd.merge(user1_feature,t13,on='user_id',how='left')
816 | user1_feature.count_merchant = user1_feature.count_merchant.replace(np.nan,0)
817 | user1_feature.buy_use_coupon = user1_feature.buy_use_coupon.replace(np.nan,0)
818 | user1_feature['buy_use_coupon_rate'] = user1_feature.buy_use_coupon.astype('float') / user1_feature.buy_total.astype('float')
819 | user1_feature['user_coupon_transfer_rate'] = user1_feature.buy_use_coupon.astype('float') / user1_feature.coupon_received.astype('float')
820 | user1_feature.buy_total = user1_feature.buy_total.replace(np.nan,0)
821 | user1_feature.coupon_received = user1_feature.coupon_received.replace(np.nan,0)
822 | user1_feature.to_csv('data/user1_feature.csv',index=None)
823 |
824 |
825 |
826 | ################## user_merchant related feature #########################
827 |
828 | """
829 | 4.user_merchant:
830 | times_user_buy_merchant_before.
831 | """
832 | #for dataset3
833 | all_user_merchant = feature3[['user_id','merchant_id']]
834 | all_user_merchant.drop_duplicates(inplace=True)
835 |
836 | t = feature3[['user_id','merchant_id','date']]
837 | t = t[t.date!='null'][['user_id','merchant_id']]
838 | t['user_merchant_buy_total'] = 1
839 | t = t.groupby(['user_id','merchant_id']).agg('sum').reset_index()
840 | t.drop_duplicates(inplace=True)
841 |
842 | t1 = feature3[['user_id','merchant_id','coupon_id']]
843 | t1 = t1[t1.coupon_id!='null'][['user_id','merchant_id']]
844 | t1['user_merchant_received'] = 1
845 | t1 = t1.groupby(['user_id','merchant_id']).agg('sum').reset_index()
846 | t1.drop_duplicates(inplace=True)
847 |
848 | t2 = feature3[['user_id','merchant_id','date','date_received']]
849 | t2 = t2[(t2.date!='null')&(t2.date_received!='null')][['user_id','merchant_id']]
850 | t2['user_merchant_buy_use_coupon'] = 1
851 | t2 = t2.groupby(['user_id','merchant_id']).agg('sum').reset_index()
852 | t2.drop_duplicates(inplace=True)
853 |
854 | t3 = feature3[['user_id','merchant_id']]
855 | t3['user_merchant_any'] = 1
856 | t3 = t3.groupby(['user_id','merchant_id']).agg('sum').reset_index()
857 | t3.drop_duplicates(inplace=True)
858 |
859 | t4 = feature3[['user_id','merchant_id','date','coupon_id']]
860 | t4 = t4[(t4.date!='null')&(t4.coupon_id=='null')][['user_id','merchant_id']]
861 | t4['user_merchant_buy_common'] = 1
862 | t4 = t4.groupby(['user_id','merchant_id']).agg('sum').reset_index()
863 | t4.drop_duplicates(inplace=True)
864 |
865 | user_merchant3 = pd.merge(all_user_merchant,t,on=['user_id','merchant_id'],how='left')
866 | user_merchant3 = pd.merge(user_merchant3,t1,on=['user_id','merchant_id'],how='left')
867 | user_merchant3 = pd.merge(user_merchant3,t2,on=['user_id','merchant_id'],how='left')
868 | user_merchant3 = pd.merge(user_merchant3,t3,on=['user_id','merchant_id'],how='left')
869 | user_merchant3 = pd.merge(user_merchant3,t4,on=['user_id','merchant_id'],how='left')
870 | user_merchant3.user_merchant_buy_use_coupon = user_merchant3.user_merchant_buy_use_coupon.replace(np.nan,0)
871 | user_merchant3.user_merchant_buy_common = user_merchant3.user_merchant_buy_common.replace(np.nan,0)
872 | user_merchant3['user_merchant_coupon_transfer_rate'] = user_merchant3.user_merchant_buy_use_coupon.astype('float') / user_merchant3.user_merchant_received.astype('float')
873 | user_merchant3['user_merchant_coupon_buy_rate'] = user_merchant3.user_merchant_buy_use_coupon.astype('float') / user_merchant3.user_merchant_buy_total.astype('float')
874 | user_merchant3['user_merchant_rate'] = user_merchant3.user_merchant_buy_total.astype('float') / user_merchant3.user_merchant_any.astype('float')
875 | user_merchant3['user_merchant_common_buy_rate'] = user_merchant3.user_merchant_buy_common.astype('float') / user_merchant3.user_merchant_buy_total.astype('float')
876 | user_merchant3.to_csv('data/user_merchant3.csv',index=None)
877 |
878 | #for dataset2
879 | all_user_merchant = feature2[['user_id','merchant_id']]
880 | all_user_merchant.drop_duplicates(inplace=True)
881 |
882 | t = feature2[['user_id','merchant_id','date']]
883 | t = t[t.date!='null'][['user_id','merchant_id']]
884 | t['user_merchant_buy_total'] = 1
885 | t = t.groupby(['user_id','merchant_id']).agg('sum').reset_index()
886 | t.drop_duplicates(inplace=True)
887 |
888 | t1 = feature2[['user_id','merchant_id','coupon_id']]
889 | t1 = t1[t1.coupon_id!='null'][['user_id','merchant_id']]
890 | t1['user_merchant_received'] = 1
891 | t1 = t1.groupby(['user_id','merchant_id']).agg('sum').reset_index()
892 | t1.drop_duplicates(inplace=True)
893 |
894 | t2 = feature2[['user_id','merchant_id','date','date_received']]
895 | t2 = t2[(t2.date!='null')&(t2.date_received!='null')][['user_id','merchant_id']]
896 | t2['user_merchant_buy_use_coupon'] = 1
897 | t2 = t2.groupby(['user_id','merchant_id']).agg('sum').reset_index()
898 | t2.drop_duplicates(inplace=True)
899 |
900 | t3 = feature2[['user_id','merchant_id']]
901 | t3['user_merchant_any'] = 1
902 | t3 = t3.groupby(['user_id','merchant_id']).agg('sum').reset_index()
903 | t3.drop_duplicates(inplace=True)
904 |
905 | t4 = feature2[['user_id','merchant_id','date','coupon_id']]
906 | t4 = t4[(t4.date!='null')&(t4.coupon_id=='null')][['user_id','merchant_id']]
907 | t4['user_merchant_buy_common'] = 1
908 | t4 = t4.groupby(['user_id','merchant_id']).agg('sum').reset_index()
909 | t4.drop_duplicates(inplace=True)
910 |
911 | user_merchant2 = pd.merge(all_user_merchant,t,on=['user_id','merchant_id'],how='left')
912 | user_merchant2 = pd.merge(user_merchant2,t1,on=['user_id','merchant_id'],how='left')
913 | user_merchant2 = pd.merge(user_merchant2,t2,on=['user_id','merchant_id'],how='left')
914 | user_merchant2 = pd.merge(user_merchant2,t3,on=['user_id','merchant_id'],how='left')
915 | user_merchant2 = pd.merge(user_merchant2,t4,on=['user_id','merchant_id'],how='left')
916 | user_merchant2.user_merchant_buy_use_coupon = user_merchant2.user_merchant_buy_use_coupon.replace(np.nan,0)
917 | user_merchant2.user_merchant_buy_common = user_merchant2.user_merchant_buy_common.replace(np.nan,0)
918 | user_merchant2['user_merchant_coupon_transfer_rate'] = user_merchant2.user_merchant_buy_use_coupon.astype('float') / user_merchant2.user_merchant_received.astype('float')
919 | user_merchant2['user_merchant_coupon_buy_rate'] = user_merchant2.user_merchant_buy_use_coupon.astype('float') / user_merchant2.user_merchant_buy_total.astype('float')
920 | user_merchant2['user_merchant_rate'] = user_merchant2.user_merchant_buy_total.astype('float') / user_merchant2.user_merchant_any.astype('float')
921 | user_merchant2['user_merchant_common_buy_rate'] = user_merchant2.user_merchant_buy_common.astype('float') / user_merchant2.user_merchant_buy_total.astype('float')
922 | user_merchant2.to_csv('data/user_merchant2.csv',index=None)
923 |
924 | #for dataset2
925 | all_user_merchant = feature1[['user_id','merchant_id']]
926 | all_user_merchant.drop_duplicates(inplace=True)
927 |
928 | t = feature1[['user_id','merchant_id','date']]
929 | t = t[t.date!='null'][['user_id','merchant_id']]
930 | t['user_merchant_buy_total'] = 1
931 | t = t.groupby(['user_id','merchant_id']).agg('sum').reset_index()
932 | t.drop_duplicates(inplace=True)
933 |
934 | t1 = feature1[['user_id','merchant_id','coupon_id']]
935 | t1 = t1[t1.coupon_id!='null'][['user_id','merchant_id']]
936 | t1['user_merchant_received'] = 1
937 | t1 = t1.groupby(['user_id','merchant_id']).agg('sum').reset_index()
938 | t1.drop_duplicates(inplace=True)
939 |
940 | t2 = feature1[['user_id','merchant_id','date','date_received']]
941 | t2 = t2[(t2.date!='null')&(t2.date_received!='null')][['user_id','merchant_id']]
942 | t2['user_merchant_buy_use_coupon'] = 1
943 | t2 = t2.groupby(['user_id','merchant_id']).agg('sum').reset_index()
944 | t2.drop_duplicates(inplace=True)
945 |
946 | t3 = feature1[['user_id','merchant_id']]
947 | t3['user_merchant_any'] = 1
948 | t3 = t3.groupby(['user_id','merchant_id']).agg('sum').reset_index()
949 | t3.drop_duplicates(inplace=True)
950 |
951 | t4 = feature1[['user_id','merchant_id','date','coupon_id']]
952 | t4 = t4[(t4.date!='null')&(t4.coupon_id=='null')][['user_id','merchant_id']]
953 | t4['user_merchant_buy_common'] = 1
954 | t4 = t4.groupby(['user_id','merchant_id']).agg('sum').reset_index()
955 | t4.drop_duplicates(inplace=True)
956 |
957 | user_merchant1 = pd.merge(all_user_merchant,t,on=['user_id','merchant_id'],how='left')
958 | user_merchant1 = pd.merge(user_merchant1,t1,on=['user_id','merchant_id'],how='left')
959 | user_merchant1 = pd.merge(user_merchant1,t2,on=['user_id','merchant_id'],how='left')
960 | user_merchant1 = pd.merge(user_merchant1,t3,on=['user_id','merchant_id'],how='left')
961 | user_merchant1 = pd.merge(user_merchant1,t4,on=['user_id','merchant_id'],how='left')
962 | user_merchant1.user_merchant_buy_use_coupon = user_merchant1.user_merchant_buy_use_coupon.replace(np.nan,0)
963 | user_merchant1.user_merchant_buy_common = user_merchant1.user_merchant_buy_common.replace(np.nan,0)
964 | user_merchant1['user_merchant_coupon_transfer_rate'] = user_merchant1.user_merchant_buy_use_coupon.astype('float') / user_merchant1.user_merchant_received.astype('float')
965 | user_merchant1['user_merchant_coupon_buy_rate'] = user_merchant1.user_merchant_buy_use_coupon.astype('float') / user_merchant1.user_merchant_buy_total.astype('float')
966 | user_merchant1['user_merchant_rate'] = user_merchant1.user_merchant_buy_total.astype('float') / user_merchant1.user_merchant_any.astype('float')
967 | user_merchant1['user_merchant_common_buy_rate'] = user_merchant1.user_merchant_buy_common.astype('float') / user_merchant1.user_merchant_buy_total.astype('float')
968 | user_merchant1.to_csv('data/user_merchant1.csv',index=None)
969 |
970 |
971 |
972 |
973 |
974 |
975 |
976 | ################## generate training and testing set ################
977 | def get_label(s):
978 | s = s.split(':')
979 | if s[0]=='null':
980 | return 0
981 | elif (date(int(s[0][0:4]),int(s[0][4:6]),int(s[0][6:8]))-date(int(s[1][0:4]),int(s[1][4:6]),int(s[1][6:8]))).days<=15:
982 | return 1
983 | else:
984 | return -1
985 |
986 |
987 | coupon3 = pd.read_csv('data/coupon3_feature.csv')
988 | merchant3 = pd.read_csv('data/merchant3_feature.csv')
989 | user3 = pd.read_csv('data/user3_feature.csv')
990 | user_merchant3 = pd.read_csv('data/user_merchant3.csv')
991 | other_feature3 = pd.read_csv('data/other_feature3.csv')
992 | dataset3 = pd.merge(coupon3,merchant3,on='merchant_id',how='left')
993 | dataset3 = pd.merge(dataset3,user3,on='user_id',how='left')
994 | dataset3 = pd.merge(dataset3,user_merchant3,on=['user_id','merchant_id'],how='left')
995 | dataset3 = pd.merge(dataset3,other_feature3,on=['user_id','coupon_id','date_received'],how='left')
996 | dataset3.drop_duplicates(inplace=True)
997 | print dataset3.shape
998 |
999 | dataset3.user_merchant_buy_total = dataset3.user_merchant_buy_total.replace(np.nan,0)
1000 | dataset3.user_merchant_any = dataset3.user_merchant_any.replace(np.nan,0)
1001 | dataset3.user_merchant_received = dataset3.user_merchant_received.replace(np.nan,0)
1002 | dataset3['is_weekend'] = dataset3.day_of_week.apply(lambda x:1 if x in (6,7) else 0)
1003 | weekday_dummies = pd.get_dummies(dataset3.day_of_week)
1004 | weekday_dummies.columns = ['weekday'+str(i+1) for i in range(weekday_dummies.shape[1])]
1005 | dataset3 = pd.concat([dataset3,weekday_dummies],axis=1)
1006 | dataset3.drop(['merchant_id','day_of_week','coupon_count'],axis=1,inplace=True)
1007 | dataset3 = dataset3.replace('null',np.nan)
1008 | dataset3.to_csv('data/dataset3.csv',index=None)
1009 |
1010 |
1011 | coupon2 = pd.read_csv('data/coupon2_feature.csv')
1012 | merchant2 = pd.read_csv('data/merchant2_feature.csv')
1013 | user2 = pd.read_csv('data/user2_feature.csv')
1014 | user_merchant2 = pd.read_csv('data/user_merchant2.csv')
1015 | other_feature2 = pd.read_csv('data/other_feature2.csv')
1016 | dataset2 = pd.merge(coupon2,merchant2,on='merchant_id',how='left')
1017 | dataset2 = pd.merge(dataset2,user2,on='user_id',how='left')
1018 | dataset2 = pd.merge(dataset2,user_merchant2,on=['user_id','merchant_id'],how='left')
1019 | dataset2 = pd.merge(dataset2,other_feature2,on=['user_id','coupon_id','date_received'],how='left')
1020 | dataset2.drop_duplicates(inplace=True)
1021 | print dataset2.shape
1022 |
1023 | dataset2.user_merchant_buy_total = dataset2.user_merchant_buy_total.replace(np.nan,0)
1024 | dataset2.user_merchant_any = dataset2.user_merchant_any.replace(np.nan,0)
1025 | dataset2.user_merchant_received = dataset2.user_merchant_received.replace(np.nan,0)
1026 | dataset2['is_weekend'] = dataset2.day_of_week.apply(lambda x:1 if x in (6,7) else 0)
1027 | weekday_dummies = pd.get_dummies(dataset2.day_of_week)
1028 | weekday_dummies.columns = ['weekday'+str(i+1) for i in range(weekday_dummies.shape[1])]
1029 | dataset2 = pd.concat([dataset2,weekday_dummies],axis=1)
1030 | dataset2['label'] = dataset2.date.astype('str') + ':' + dataset2.date_received.astype('str')
1031 | dataset2.label = dataset2.label.apply(get_label)
1032 | dataset2.drop(['merchant_id','day_of_week','date','date_received','coupon_id','coupon_count'],axis=1,inplace=True)
1033 | dataset2 = dataset2.replace('null',np.nan)
1034 | dataset2.to_csv('data/dataset2.csv',index=None)
1035 |
1036 |
1037 | coupon1 = pd.read_csv('data/coupon1_feature.csv')
1038 | merchant1 = pd.read_csv('data/merchant1_feature.csv')
1039 | user1 = pd.read_csv('data/user1_feature.csv')
1040 | user_merchant1 = pd.read_csv('data/user_merchant1.csv')
1041 | other_feature1 = pd.read_csv('data/other_feature1.csv')
1042 | dataset1 = pd.merge(coupon1,merchant1,on='merchant_id',how='left')
1043 | dataset1 = pd.merge(dataset1,user1,on='user_id',how='left')
1044 | dataset1 = pd.merge(dataset1,user_merchant1,on=['user_id','merchant_id'],how='left')
1045 | dataset1 = pd.merge(dataset1,other_feature1,on=['user_id','coupon_id','date_received'],how='left')
1046 | dataset1.drop_duplicates(inplace=True)
1047 | print dataset1.shape
1048 |
1049 | dataset1.user_merchant_buy_total = dataset1.user_merchant_buy_total.replace(np.nan,0)
1050 | dataset1.user_merchant_any = dataset1.user_merchant_any.replace(np.nan,0)
1051 | dataset1.user_merchant_received = dataset1.user_merchant_received.replace(np.nan,0)
1052 | dataset1['is_weekend'] = dataset1.day_of_week.apply(lambda x:1 if x in (6,7) else 0)
1053 | weekday_dummies = pd.get_dummies(dataset1.day_of_week)
1054 | weekday_dummies.columns = ['weekday'+str(i+1) for i in range(weekday_dummies.shape[1])]
1055 | dataset1 = pd.concat([dataset1,weekday_dummies],axis=1)
1056 | dataset1['label'] = dataset1.date.astype('str') + ':' + dataset1.date_received.astype('str')
1057 | dataset1.label = dataset1.label.apply(get_label)
1058 | dataset1.drop(['merchant_id','day_of_week','date','date_received','coupon_id','coupon_count'],axis=1,inplace=True)
1059 | dataset1 = dataset1.replace('null',np.nan)
1060 | dataset1.to_csv('data/dataset1.csv',index=None)
1061 |
--------------------------------------------------------------------------------
/xgb.py:
--------------------------------------------------------------------------------
1 | import pandas as pd
2 | import xgboost as xgb
3 | from sklearn.preprocessing import MinMaxScaler
4 |
5 | dataset1 = pd.read_csv('data/dataset1.csv')
6 | dataset1.label.replace(-1,0,inplace=True)
7 | dataset2 = pd.read_csv('data/dataset2.csv')
8 | dataset2.label.replace(-1,0,inplace=True)
9 | dataset3 = pd.read_csv('data/dataset3.csv')
10 |
11 | dataset1.drop_duplicates(inplace=True)
12 | dataset2.drop_duplicates(inplace=True)
13 | dataset3.drop_duplicates(inplace=True)
14 |
15 | dataset12 = pd.concat([dataset1,dataset2],axis=0)
16 |
17 | dataset1_y = dataset1.label
18 | dataset1_x = dataset1.drop(['user_id','label','day_gap_before','day_gap_after'],axis=1) # 'day_gap_before','day_gap_after' cause overfitting, 0.77
19 | dataset2_y = dataset2.label
20 | dataset2_x = dataset2.drop(['user_id','label','day_gap_before','day_gap_after'],axis=1)
21 | dataset12_y = dataset12.label
22 | dataset12_x = dataset12.drop(['user_id','label','day_gap_before','day_gap_after'],axis=1)
23 | dataset3_preds = dataset3[['user_id','coupon_id','date_received']]
24 | dataset3_x = dataset3.drop(['user_id','coupon_id','date_received','day_gap_before','day_gap_after'],axis=1)
25 |
26 | print dataset1_x.shape,dataset2_x.shape,dataset3_x.shape
27 |
28 | dataset1 = xgb.DMatrix(dataset1_x,label=dataset1_y)
29 | dataset2 = xgb.DMatrix(dataset2_x,label=dataset2_y)
30 | dataset12 = xgb.DMatrix(dataset12_x,label=dataset12_y)
31 | dataset3 = xgb.DMatrix(dataset3_x)
32 |
33 | params={'booster':'gbtree',
34 | 'objective': 'rank:pairwise',
35 | 'eval_metric':'auc',
36 | 'gamma':0.1,
37 | 'min_child_weight':1.1,
38 | 'max_depth':5,
39 | 'lambda':10,
40 | 'subsample':0.7,
41 | 'colsample_bytree':0.7,
42 | 'colsample_bylevel':0.7,
43 | 'eta': 0.01,
44 | 'tree_method':'exact',
45 | 'seed':0,
46 | 'nthread':12
47 | }
48 |
49 | #train on dataset1, evaluate on dataset2
50 | #watchlist = [(dataset1,'train'),(dataset2,'val')]
51 | #model = xgb_3500.train(params,dataset1,num_boost_round=3000,evals=watchlist,early_stopping_rounds=300)
52 |
53 | watchlist = [(dataset12,'train')]
54 | model = xgb.train(params,dataset12,num_boost_round=3500,evals=watchlist)
55 |
56 | #predict test set
57 | dataset3_preds['label'] = model.predict(dataset3)
58 | dataset3_preds.label = MinMaxScaler().fit_transform(dataset3_preds.label)
59 | dataset3_preds.sort_values(by=['coupon_id','label'],inplace=True)
60 | dataset3_preds.to_csv("xgb_preds.csv",index=None,header=None)
61 | print dataset3_preds.describe()
62 |
63 | #save feature score
64 | feature_score = model.get_fscore()
65 | feature_score = sorted(feature_score.items(), key=lambda x:x[1],reverse=True)
66 | fs = []
67 | for (key,value) in feature_score:
68 | fs.append("{0},{1}\n".format(key,value))
69 |
70 | with open('xgb_feature_score.csv','w') as f:
71 | f.writelines("feature,score\n")
72 | f.writelines(fs)
73 |
74 |
--------------------------------------------------------------------------------