├── README.md ├── data └── put dataset here ├── extract_feature.py └── xgb.py /README.md: -------------------------------------------------------------------------------- 1 | 赛题地址:[https://tianchi.shuju.aliyun.com/competition/introduction.htm?spm=5176.100065.200879.2.6r6s4g&raceId=231587](https://tianchi.shuju.aliyun.com/competition/introduction.htm?spm=5176.100065.200879.2.6r6s4g&raceId=231587 "O2O优惠券使用预测赛题地址") 2 | 3 | [第一赛季数据](https://pan.baidu.com/s/1c1NkUn2) 4 | 5 | ------------------- 6 | **目录** 7 | 8 | 9 | [TOC] 10 | 11 | ------------------- 12 | 正式开始做是从十月底开始的,之前参加了新手赛,而这一次正式赛可以说是真正认真做的一次,中间和队友一起学习了很多,也有小小的收获,不管这次成绩如何,以后还有机会。 13 | 14 | ![我的成绩](http://img.blog.csdn.net/20170103160645251?watermark/2/text/aHR0cDovL2Jsb2cuY3Nkbi5uZXQvc2hpbmUxOTkzMDgyMA==/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70/gravity/SouthEast) 15 | 16 | ------------------- 17 | # **数据与评价方式** 18 | 19 | 赛题提供用户在2016年1月1日至2016年6月30日之间真实线上线下消费行为,预测用户在2016年7月领取优惠券后15天以内的使用情况。 使用优惠券核销预测的平均AUC(ROC曲线下面积)作为评价标准。 即对每个优惠券coupon_id单独计算核销预测的AUC值,再对所有优惠券的AUC值求平均作为最终的评价标准。 20 | 21 | ------------------- 22 | # **解决方案** 23 | 提供数据的区间是2016-01-01~2016-06-30,预测七月份用户领券使用情况,即用或者不用,转化为**二分类问题**,然后通过分类算法预测结果。首先就是特征工程,其中涉及对数据集合的划分,包括提取特征的区间和训练数据区间。接着就是从特征区间中提取特征,包括用户特征、商户特征、优惠券特征、用户商户组合特征、用户优惠券组合特征。后期在测试区间提取了当天的前后7/3/1天的领券信息(这里面后七天的特征其实是不能应用于工业应用的,因为实际预测中你无法知道后7/3/1天的领券信息),提升较大。最后使用GBDT、RandomForest、LR进行基于rank的分类模型融合 24 | 25 | ------------------- 26 | # **数据划分** 27 | 最初没有使用数据划分,导致特征中产生数据泄露,以至于在训练数据上效果很好,线下测试也还不错,在线上表现确差强人意,后来划分了之后有明显提升。 28 | 29 | | 集合 | 预测区间 |特征区间| 30 | | :------- | :--------| :-- | 31 | | 测试集 |领券:20160701~20160731|领券&消费:20160101~20160630| 32 | | 训练集 | 领券:20160515~20160615
消费:20160515~20160630 |领券:20160101~20160501
消费:20160101~20160515| 33 | 34 | 并没有划分多个训练集,这一点是要改进之处。 35 | 36 | ------------------- 37 | # **特征工程** 38 | 主要有五大特征类:用户特征、商户特征、优惠券特征、用户商户组合特征、用户优惠券组合特征,赛题包括online和offline的数据,由于里面只有部分用户重合,商户优惠券等并未有重合,个人臆测线上应该是淘宝天猫的购买消费数据,有一定关联,但关系微弱,因此只向其中提取了用户特征。而offline数据集就提取了所有五个特征类。一下是各部分特征: 39 | 40 | - 用户特征:u 41 | - 线下领取优惠券但没有使用的次数 u1 42 | - 线下普通消费次数 u2 43 | - 线下使用优惠券消费的次数 u3 44 | - 线下平均正常消费间隔 u4 45 | - 线下平均优惠券消费间隔 u5 46 | - u3/u1 使用优惠券次数与没使用优惠券次数比值 u6 47 | - u3/(u2+u3) 表示用户使用优惠券消费占比 u7 48 | - u4/15 代表15除以用户普通消费间隔,可以看成用户15天内平均会普通消费几次,值越小代表用户在15天内普通消费概率越大 u8 49 | - u5/15 代表15除以用户优惠券消费间隔,可以看成用户15天内平均会普通消费几次,值越大代表用户 在15天内普通消费概率越大 u9 50 | - 领取优惠券到使用优惠券的平均间隔时间 u10 51 | - u10/15 表示在15天内使用掉优惠券的值大小,值越小越有可能,值为0表示可能性最大 u11 52 | - 领取优惠券到使用优惠券间隔小于15天的次数 u12 53 | - u12/u3 表示用户15天使用掉优惠券的次数除以使用优惠券的次数,表示在15天使用掉优惠券的可能,值越大越好。 u13 54 | - u12/u1 F014 表示用户15天使用掉优惠券的次数除以领取优惠券未消费的次数,表示在15天使用掉优惠券的可能,值越大越好。 u14 55 | - u1+u3 领取优惠券的总次数 u15 56 | - u12/u15 F016 表示用户15天使用掉优惠券的次数除以领取优惠券的总次数,表示在15天使用掉优惠券的可能,值越大越好。 u16 57 | - u1+u2 一共消费多少次 u17 58 | - 最近一次消费到当前领券的时间间隔 u18 59 | - 最近一次优惠券消费到当前领券的时间间隔 u19 60 | - 用户当天领取的优惠券数目 u20 61 | - 用户前第i天领取的优惠券数目 u20si 62 | - 用户后第i天领取的优惠券数目 u20ai 63 | - 用户前7天领取的优惠券数目 u21 64 | - 用户前3天领取的优惠券数目 u22 65 | - u22/u21 u23 66 | - u20/u22 u24 67 | - 用户后7天领取的优惠券数目 u25 68 | - 用户后3天领取的优惠券数目 u26 69 | - u26/u25 u27 70 | - u20/u26 u28 71 | - 用户训练、预测时间领取的优惠券数目 u29 72 | - 用户当天领取的不同优惠券数目 u30 73 | - 用户前第i天领取的不同优惠券数目 u30si 74 | - 用户后第i天领取的不同优惠券数目 u30ai 75 | - 用户训练、预测时间领取的不同优惠券数目 u31 76 | - 按照7/4/2分解训练、预测时间,提取此段窗口时间的特征 77 | - 用户7/4/2天领取的优惠券数目 u32_i 78 | - 用户7/4/2天所领取的优惠券优惠率r1/r2/r3/r4排名 u_ri_ranki 79 | - 用户7/4/2天所领取的优惠券优惠率r1/r2/r3/r4排名 u_ri_dense _ranki 80 | - u32_4/u32_7 u33 81 | - u32_2/u32_4 u34 82 | - u32_2/u32_7 u35 83 | - u20/u32_2 u36 84 | 85 | - 线上领取优惠券未使用的次数 action=2 uo1 86 | - 线上特价消费次数 action=1 and cid=0 and drate="fixed" uo2 87 | - 线上使用优惠券消费的次数 uo3 88 | - 线上普通消费次数 action=1 and cid=0 and drate="null" uo4 89 | - 线上领取优惠券的次数 uo1+uo3 uo5 90 | - uo3/uo5 线上使用优惠券次数除以线上领取优惠券次数,正比 uo6 91 | - uo3/uo4 线上使用优惠券次数除以线上普通消费次数,正比 uo7 92 | - uo2/uo4线上特价消费次数除以线上普通消费次数 uo8 93 | 94 | - 加入训练预测时间前一个月的窗口特征 95 | 96 | 97 | - 线下领取优惠券但没有使用的次数 uw1 98 | - 线下普通消费次数 uw2 99 | - 线下使用优惠券消费的次数 uw3 100 | - 线下平均正常消费间隔 uw4 101 | - 线下平均优惠券消费间隔 uw5 102 | - uw3/uw1 使用优惠券次数与没使用优惠券次数比值 uw6 103 | - uw3/(uw2+uw3) 表示用户使用优惠券消费占比 uw7 104 | - uw4/15 代表15除以用户普通消费间隔,可以看成用户15天内平均会普通消费几次,值越小代表用户在15天内普通消费概率越大 uw8 105 | - uw5/15 代表15除以用户优惠券消费间隔,可以看成用户15天内平均会普通消费几次,值越大代表用户在15天内普通消费概率越大 uw9 106 | - 领取优惠券到使用优惠券的平均间隔时间 uw10 107 | - uw10/15 表示在15天内使用掉优惠券的值大小,值越小越有可能,值为0表示可能性最大 uw11 108 | - 领取优惠券到使用优惠券间隔小于15天的次数 uw12 109 | - uw12/uw3 表示用户15天使用掉优惠券的次数除以使用优惠券的次数,表示在15天使用掉优惠券的可能,值越大越好。 uw13 110 | - uw12/uw1 F014 表示用户15天使用掉优惠券的次数除以领取优惠券未消费的次数,表示在15天使用掉优惠券的可能,值越大越好。 uw14 111 | - uw1+uw3 领取优惠券的总次数 uw15 112 | - uw12/uw15 F016 表示用户15天使用掉优惠券的次数除以领取优惠券的总次数,表示在15天使用掉优惠券的可能,值越大越好。 uw16 113 | - F01+F02 一共消费多少次 uw17 114 | 115 | 116 | ---------- 117 | 118 | 119 | - 商户特征:m 120 | - 商户一共的消费笔数:m0 121 | - 商户优惠券消费笔数:m1 122 | - 商户正常的消费笔数:m2 123 | - 没有被使用的优惠券: m3 124 | - 商户发放优惠券数目:m3+m1 m4 125 | - 商户优惠券使用率:m1/m4 m5 126 | - 商户在训练、预测时间发行的优惠券数目 m6 127 | - 商户当天发行的优惠券数目 m7 128 | - 商户在训练、预测时间有多少人在此店领券 m8 129 | - 商户在当天有多少人在此店领券 m9 130 | - 按照7/4/2分解训练、预测时间,提取此段窗口时间的特征 131 | - 7/4/2天此商店优惠券发放数目 m10_i 132 | - m9 / m10_7 m11 133 | - m9 / m10_4 m12 134 | - m9 / m10_2 m13 135 | - m10_2 / m10_4 m14 136 | 137 | 138 | ---------- 139 | 140 | 141 | - 优惠券特征:c 142 | - 折扣类的优惠券折扣率 r1 143 | - 满减类优惠券满减金额 r2 144 | - 满减类优惠券减的金额 r3 145 | - 满减类优惠券优惠率 (r2-r3)/r2 r4 146 | - c1+c2 此优惠券一共发行多少张 c0 147 | - 此优惠券一共被使用多少张 c1 148 | - 没有使用的数目 c2 149 | - c1/c0 优惠券使用率 c3 150 | - 优惠力度 c5 151 | - 优惠力度在当天所领取优惠券里面排名 c5_rank 152 | - 优惠力度在当天所领取优惠券里面排名 c5_denserank 153 | - 优惠力度在当天同一店家所领取优惠券里面排名 c5_rankm 154 | - ~~优惠力度在当天所领取优惠券里面百分比排名 c5_rankp~~ 155 | - ~~优惠力度在当天同一店家所领取优惠券里面百分比排名 c5_rankmp~~ 156 | - 此优惠券在训练、预测时间发行了多少张 c6 157 | - 此优惠券在当天发行了多少张 c7 158 | - ~~领券当天周几 c8~~ 159 | - ~~领券当天是否周末 c9~~ c8,c9去掉效果更好了。。。。 160 | - 此优惠券在当天被多少人领过 c10 161 | - 此优惠券在训练、预测时间被多少个人领过 c11 162 | - 此优惠券最后一次领券时间到此领券时间的间隔 c12 163 | - 此优惠券最后一次消费时间到此领券时间的间隔 c13 164 | - 按照7/4/2分解训练、预测时间,提取此段窗口时间的特征 165 | - 7/4/2天此优惠券发放数目 c14_i 166 | - c10 / c14_7 AS c15 167 | - c10 / c14_4 AS c16 168 | - c14_2 / c14_4 AS c17 169 | - c10 / c14_2 AS 170 | 171 | 172 | ---------- 173 | 174 | 175 | - 用户和商户组合特征:um 176 | - 用户在商店总共消费过几次 um0 177 | - 用户在商店使用优惠券几次 um1 178 | - 用户在商店领取优惠券未消费次数 um2 179 | - 用户在商店普通消费次数 um3 180 | - um1/(um1+um2) 用户在此商户优惠券使用率 um4 181 | - um0/(u2+u3) 值大表示用户比较常去的商户 um5 182 | - um1/u3 值大表示用户比较喜欢在那个商户使用优惠券 um6 183 | - 用户在训练、预测时间在此商店领取的优惠券数目 um7 184 | - 用户当天在此商店领取的优惠券数目 um8 185 | - 按照7/4/2分解训练、预测时间,提取此段窗口时间的特征 186 | - 7/4/2天此用户在此商店领取的优惠券发放数目 um9_i 187 | - um8 / um9_7 um10 188 | - um8 / um9_4 um11 189 | - um8 / um9_2 um12 190 | - um9_2 / um9_4 um13 191 | 192 | 193 | ---------- 194 | 195 | 196 | - 用户和优惠券组合特征:uc 197 | - 用户领取的优惠券数目 uc0 198 | - 用户领取未消费的优惠券数目 uc1 199 | - 用户消费了此优惠券的数目 uc2 200 | - uc02/uc0 uc3 201 | - 用户在此期间领取了多少张此优惠券 uc4 partiton by uid, cid 202 | - 用户在当天领取了多少张此优惠券 uc5 203 | - 领取优惠券时间-最后一次使用优惠券时间 uc6 204 | - uc6/ u5 uc7 正比 205 | 206 | - 用户前第i天领取的此优惠券数目 uc5si 207 | - 用户后第i天领取的此优惠券数目 uc5ai 208 | - 用户前7天领取的此优惠券数目 uc8 209 | - 用户前3天领取的此优惠券数目 uc9 210 | - uc9/uc8 uc10(若u21为0,则为1) 211 | - uc4/uc9 uc11 212 | - 用户后7天领取的此优惠券数目 uc12 213 | - 用户后3天领取的此优惠券数目 uc13 214 | - uc13/uc12 uc14 215 | - uc4/uc13 uc15 216 | - 按照7/4/2分解训练、预测时间,提取此段窗口时间的特征 217 | - 7/4/2天此用户在此商店领取的优惠券发放数目 uc16_i 218 | - 用户前后2/4/7领取的优惠券优惠率排名 uc17_i 219 | 220 | ------------------- 221 | # **算法及模型融合** 222 | 最初使用GBDT和RF两种模型,GBDT效果优于RF,后期使用了多个GBDT,分别使用不同的参数、不同的正负样本比例以rank的方式进行多模型的融合,效果有微小提升,但是由于计算量的限制没有进一步展开。 223 | ## **模型融合** 224 | 225 | 由于评估指标是计算每个coupon_id核销预测的AUC值,然后所有优惠券的AUC值平均作为最终的评估指标,而rank融合方式对AUC之类的评估指标特别有效,所以采用此方法,公式为: 226 |
$\sum\limits_{i=1}^{n}\frac{Weight_i}{Rank_i} $
227 | 228 | 其中$n$表示模型的个数, $Weight_i$表示该模型权重,所有权重相同表示平均融合。$Rank_i$表示样本在第i个模型中的升序排名。它可以较快的利用排名融合多个模型之间的差异,而不需要加权融合概率。 229 | ## **应用** 230 | 基于参数,样本(采样率),特征获得多个模型,得到每个模型的概率值输出,然后以coupon_id分组,把概率转换为降序排名,这样就获得了每个模型的$Rank_i$,然后这里我们使用的是平均融合,$Weight_i=1$,这样就获得了最终的一个值作为输出。 231 | 232 | ---------- 233 | # **线下评估** 234 | 虽然这次比赛每天有四次评测机会,但是构建线下评估在早期成绩比较差的时候用处很大,早期添加特征之后线下评估基本和线上的趋势保持一致(例如在添加了Label区间的领券特征之后,线下提升十多个百分点,线上也是一致),对于新特征衡量还是有参照性的。后期差距在0.1%级别的时候,就没有参照性了。 235 | 236 | 线下评估在训练集中采样1/3||1/4||1/5做线下评估集合,剩下的做为训练集训练模型,并将评估集合中全0或者全1的优惠券ID去掉,然后使用训练的模型对评估集合预测,将预测结果和实际标签作异或取反(相同为1,不同为0),然后算出每个优惠券ID的AUC,最后将每个ID的优惠券AUC取均值就得到最终的AUC。 237 | 238 | ------------------- 239 | # **回顾** 240 | 这一次比赛学习了很多,包括分布式平台ODPS和机器学习平台实现数据清洗,特征提取,特征选择,分类建模、调参及模型融合等,学习摸索了一套方法,使自己建立了信心,明白还有很多需要学习的地方,之前一直对于算法都是当做一个黑匣子,只会熟悉输入输出直接调用,要深入了解算法,才能突破目前的瓶颈有所提高。 241 | 同时我觉得大家一起探讨交流也很重要,一个人做着做着就容易走偏,纯属个人看法。 242 | 243 | CSDN博客链接:[http://blog.csdn.net/shine19930820/article/details/53995369](http://blog.csdn.net/shine19930820/article/details/53995369) 244 | 245 | 自己的代码太乱,改自第一名[GitHub地址](https://github.com/wepe/O2O-Coupon-Usage-Forecast) 246 | -------------------------------------------------------------------------------- /data/put dataset here: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/InsaneLife/O2O-Predict-Coupon-Usage/adc71ce5198873f9961f2ea01a17fb19633131c6/data/put dataset here -------------------------------------------------------------------------------- /extract_feature.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import numpy as np 3 | from datetime import date 4 | 5 | 6 | """ 7 | dataset split: 8 | (date_received) 9 | dateset3: 20160701~20160731 (113640),features3 from 20160315~20160630 (off_test) 10 | dateset2: 20160515~20160615 (258446),features2 from 20160201~20160514 11 | dateset1: 20160414~20160514 (138303),features1 from 20160101~20160413 12 | 13 | 14 | 15 | 1.merchant related: 16 | sales_use_coupon. total_coupon 17 | transfer_rate = sales_use_coupon/total_coupon. 18 | merchant_avg_distance,merchant_min_distance,merchant_max_distance of those use coupon 19 | total_sales. coupon_rate = sales_use_coupon/total_sales. 20 | 21 | 2.coupon related: 22 | discount_rate. discount_man. discount_jian. is_man_jian 23 | day_of_week,day_of_month. (date_received) 24 | 25 | 3.user related: 26 | distance. 27 | user_avg_distance, user_min_distance,user_max_distance. 28 | buy_use_coupon. buy_total. coupon_received. 29 | buy_use_coupon/coupon_received. 30 | avg_diff_date_datereceived. min_diff_date_datereceived. max_diff_date_datereceived. 31 | count_merchant. 32 | 33 | 4.user_merchant: 34 | times_user_buy_merchant_before. 35 | 36 | 37 | 5. other feature: 38 | this_month_user_receive_all_coupon_count 39 | this_month_user_receive_same_coupon_count 40 | this_month_user_receive_same_coupon_lastone 41 | this_month_user_receive_same_coupon_firstone 42 | this_day_user_receive_all_coupon_count 43 | this_day_user_receive_same_coupon_count 44 | day_gap_before, day_gap_after (receive the same coupon) 45 | """ 46 | 47 | 48 | #1754884 record,1053282 with coupon_id,9738 coupon. date_received:20160101~20160615,date:20160101~20160630, 539438 users, 8415 merchants 49 | off_train = pd.read_csv('data/ccf_offline_stage1_train.csv',header=None) 50 | off_train.columns = ['user_id','merchant_id','coupon_id','discount_rate','distance','date_received','date'] 51 | #2050 coupon_id. date_received:20160701~20160731, 76309 users(76307 in trainset, 35965 in online_trainset), 1559 merchants(1558 in trainset) 52 | off_test = pd.read_csv('data/ccf_offline_stage1_test_revised.csv',header=None) 53 | off_test.columns = ['user_id','merchant_id','coupon_id','discount_rate','distance','date_received'] 54 | #11429826 record(872357 with coupon_id),762858 user(267448 in off_train) 55 | on_train = pd.read_csv('data/ccf_online_stage1_train.csv',header=None) 56 | on_train.columns = ['user_id','merchant_id','action','coupon_id','discount_rate','date_received','date'] 57 | 58 | 59 | dataset3 = off_test 60 | feature3 = off_train[((off_train.date>='20160315')&(off_train.date<='20160630'))|((off_train.date=='null')&(off_train.date_received>='20160315')&(off_train.date_received<='20160630'))] 61 | dataset2 = off_train[(off_train.date_received>='20160515')&(off_train.date_received<='20160615')] 62 | feature2 = off_train[(off_train.date>='20160201')&(off_train.date<='20160514')|((off_train.date=='null')&(off_train.date_received>='20160201')&(off_train.date_received<='20160514'))] 63 | dataset1 = off_train[(off_train.date_received>='20160414')&(off_train.date_received<='20160514')] 64 | feature1 = off_train[(off_train.date>='20160101')&(off_train.date<='20160413')|((off_train.date=='null')&(off_train.date_received>='20160101')&(off_train.date_received<='20160413'))] 65 | 66 | 67 | ############# other feature ##################3 68 | """ 69 | 5. other feature: 70 | this_month_user_receive_all_coupon_count 71 | this_month_user_receive_same_coupon_count 72 | this_month_user_receive_same_coupon_lastone 73 | this_month_user_receive_same_coupon_firstone 74 | this_day_user_receive_all_coupon_count 75 | this_day_user_receive_same_coupon_count 76 | day_gap_before, day_gap_after (receive the same coupon) 77 | """ 78 | 79 | #for dataset3 80 | t = dataset3[['user_id']] 81 | t['this_month_user_receive_all_coupon_count'] = 1 82 | t = t.groupby('user_id').agg('sum').reset_index() 83 | 84 | t1 = dataset3[['user_id','coupon_id']] 85 | t1['this_month_user_receive_same_coupon_count'] = 1 86 | t1 = t1.groupby(['user_id','coupon_id']).agg('sum').reset_index() 87 | 88 | t2 = dataset3[['user_id','coupon_id','date_received']] 89 | t2.date_received = t2.date_received.astype('str') 90 | t2 = t2.groupby(['user_id','coupon_id'])['date_received'].agg(lambda x:':'.join(x)).reset_index() 91 | t2['receive_number'] = t2.date_received.apply(lambda s:len(s.split(':'))) 92 | t2 = t2[t2.receive_number>1] 93 | t2['max_date_received'] = t2.date_received.apply(lambda s:max([int(d) for d in s.split(':')])) 94 | t2['min_date_received'] = t2.date_received.apply(lambda s:min([int(d) for d in s.split(':')])) 95 | t2 = t2[['user_id','coupon_id','max_date_received','min_date_received']] 96 | 97 | t3 = dataset3[['user_id','coupon_id','date_received']] 98 | t3 = pd.merge(t3,t2,on=['user_id','coupon_id'],how='left') 99 | t3['this_month_user_receive_same_coupon_lastone'] = t3.max_date_received - t3.date_received 100 | t3['this_month_user_receive_same_coupon_firstone'] = t3.date_received - t3.min_date_received 101 | def is_firstlastone(x): 102 | if x==0: 103 | return 1 104 | elif x>0: 105 | return 0 106 | else: 107 | return -1 #those only receive once 108 | 109 | t3.this_month_user_receive_same_coupon_lastone = t3.this_month_user_receive_same_coupon_lastone.apply(is_firstlastone) 110 | t3.this_month_user_receive_same_coupon_firstone = t3.this_month_user_receive_same_coupon_firstone.apply(is_firstlastone) 111 | t3 = t3[['user_id','coupon_id','date_received','this_month_user_receive_same_coupon_lastone','this_month_user_receive_same_coupon_firstone']] 112 | 113 | t4 = dataset3[['user_id','date_received']] 114 | t4['this_day_user_receive_all_coupon_count'] = 1 115 | t4 = t4.groupby(['user_id','date_received']).agg('sum').reset_index() 116 | 117 | t5 = dataset3[['user_id','coupon_id','date_received']] 118 | t5['this_day_user_receive_same_coupon_count'] = 1 119 | t5 = t5.groupby(['user_id','coupon_id','date_received']).agg('sum').reset_index() 120 | 121 | t6 = dataset3[['user_id','coupon_id','date_received']] 122 | t6.date_received = t6.date_received.astype('str') 123 | t6 = t6.groupby(['user_id','coupon_id'])['date_received'].agg(lambda x:':'.join(x)).reset_index() 124 | t6.rename(columns={'date_received':'dates'},inplace=True) 125 | 126 | def get_day_gap_before(s): 127 | date_received,dates = s.split('-') 128 | dates = dates.split(':') 129 | gaps = [] 130 | for d in dates: 131 | this_gap = (date(int(date_received[0:4]),int(date_received[4:6]),int(date_received[6:8]))-date(int(d[0:4]),int(d[4:6]),int(d[6:8]))).days 132 | if this_gap>0: 133 | gaps.append(this_gap) 134 | if len(gaps)==0: 135 | return -1 136 | else: 137 | return min(gaps) 138 | 139 | def get_day_gap_after(s): 140 | date_received,dates = s.split('-') 141 | dates = dates.split(':') 142 | gaps = [] 143 | for d in dates: 144 | this_gap = (date(int(d[0:4]),int(d[4:6]),int(d[6:8]))-date(int(date_received[0:4]),int(date_received[4:6]),int(date_received[6:8]))).days 145 | if this_gap>0: 146 | gaps.append(this_gap) 147 | if len(gaps)==0: 148 | return -1 149 | else: 150 | return min(gaps) 151 | 152 | 153 | t7 = dataset3[['user_id','coupon_id','date_received']] 154 | t7 = pd.merge(t7,t6,on=['user_id','coupon_id'],how='left') 155 | t7['date_received_date'] = t7.date_received.astype('str') + '-' + t7.dates 156 | t7['day_gap_before'] = t7.date_received_date.apply(get_day_gap_before) 157 | t7['day_gap_after'] = t7.date_received_date.apply(get_day_gap_after) 158 | t7 = t7[['user_id','coupon_id','date_received','day_gap_before','day_gap_after']] 159 | 160 | other_feature3 = pd.merge(t1,t,on='user_id') 161 | other_feature3 = pd.merge(other_feature3,t3,on=['user_id','coupon_id']) 162 | other_feature3 = pd.merge(other_feature3,t4,on=['user_id','date_received']) 163 | other_feature3 = pd.merge(other_feature3,t5,on=['user_id','coupon_id','date_received']) 164 | other_feature3 = pd.merge(other_feature3,t7,on=['user_id','coupon_id','date_received']) 165 | other_feature3.to_csv('data/other_feature3.csv',index=None) 166 | print other_feature3.shape 167 | 168 | 169 | 170 | #for dataset2 171 | t = dataset2[['user_id']] 172 | t['this_month_user_receive_all_coupon_count'] = 1 173 | t = t.groupby('user_id').agg('sum').reset_index() 174 | 175 | t1 = dataset2[['user_id','coupon_id']] 176 | t1['this_month_user_receive_same_coupon_count'] = 1 177 | t1 = t1.groupby(['user_id','coupon_id']).agg('sum').reset_index() 178 | 179 | t2 = dataset2[['user_id','coupon_id','date_received']] 180 | t2.date_received = t2.date_received.astype('str') 181 | t2 = t2.groupby(['user_id','coupon_id'])['date_received'].agg(lambda x:':'.join(x)).reset_index() 182 | t2['receive_number'] = t2.date_received.apply(lambda s:len(s.split(':'))) 183 | t2 = t2[t2.receive_number>1] 184 | t2['max_date_received'] = t2.date_received.apply(lambda s:max([int(d) for d in s.split(':')])) 185 | t2['min_date_received'] = t2.date_received.apply(lambda s:min([int(d) for d in s.split(':')])) 186 | t2 = t2[['user_id','coupon_id','max_date_received','min_date_received']] 187 | 188 | t3 = dataset2[['user_id','coupon_id','date_received']] 189 | t3 = pd.merge(t3,t2,on=['user_id','coupon_id'],how='left') 190 | t3['this_month_user_receive_same_coupon_lastone'] = t3.max_date_received - t3.date_received.astype('int') 191 | t3['this_month_user_receive_same_coupon_firstone'] = t3.date_received.astype('int') - t3.min_date_received 192 | def is_firstlastone(x): 193 | if x==0: 194 | return 1 195 | elif x>0: 196 | return 0 197 | else: 198 | return -1 #those only receive once 199 | 200 | t3.this_month_user_receive_same_coupon_lastone = t3.this_month_user_receive_same_coupon_lastone.apply(is_firstlastone) 201 | t3.this_month_user_receive_same_coupon_firstone = t3.this_month_user_receive_same_coupon_firstone.apply(is_firstlastone) 202 | t3 = t3[['user_id','coupon_id','date_received','this_month_user_receive_same_coupon_lastone','this_month_user_receive_same_coupon_firstone']] 203 | 204 | t4 = dataset2[['user_id','date_received']] 205 | t4['this_day_user_receive_all_coupon_count'] = 1 206 | t4 = t4.groupby(['user_id','date_received']).agg('sum').reset_index() 207 | 208 | t5 = dataset2[['user_id','coupon_id','date_received']] 209 | t5['this_day_user_receive_same_coupon_count'] = 1 210 | t5 = t5.groupby(['user_id','coupon_id','date_received']).agg('sum').reset_index() 211 | 212 | t6 = dataset2[['user_id','coupon_id','date_received']] 213 | t6.date_received = t6.date_received.astype('str') 214 | t6 = t6.groupby(['user_id','coupon_id'])['date_received'].agg(lambda x:':'.join(x)).reset_index() 215 | t6.rename(columns={'date_received':'dates'},inplace=True) 216 | 217 | def get_day_gap_before(s): 218 | date_received,dates = s.split('-') 219 | dates = dates.split(':') 220 | gaps = [] 221 | for d in dates: 222 | this_gap = (date(int(date_received[0:4]),int(date_received[4:6]),int(date_received[6:8]))-date(int(d[0:4]),int(d[4:6]),int(d[6:8]))).days 223 | if this_gap>0: 224 | gaps.append(this_gap) 225 | if len(gaps)==0: 226 | return -1 227 | else: 228 | return min(gaps) 229 | 230 | def get_day_gap_after(s): 231 | date_received,dates = s.split('-') 232 | dates = dates.split(':') 233 | gaps = [] 234 | for d in dates: 235 | this_gap = (date(int(d[0:4]),int(d[4:6]),int(d[6:8]))-date(int(date_received[0:4]),int(date_received[4:6]),int(date_received[6:8]))).days 236 | if this_gap>0: 237 | gaps.append(this_gap) 238 | if len(gaps)==0: 239 | return -1 240 | else: 241 | return min(gaps) 242 | 243 | 244 | t7 = dataset2[['user_id','coupon_id','date_received']] 245 | t7 = pd.merge(t7,t6,on=['user_id','coupon_id'],how='left') 246 | t7['date_received_date'] = t7.date_received.astype('str') + '-' + t7.dates 247 | t7['day_gap_before'] = t7.date_received_date.apply(get_day_gap_before) 248 | t7['day_gap_after'] = t7.date_received_date.apply(get_day_gap_after) 249 | t7 = t7[['user_id','coupon_id','date_received','day_gap_before','day_gap_after']] 250 | 251 | other_feature2 = pd.merge(t1,t,on='user_id') 252 | other_feature2 = pd.merge(other_feature2,t3,on=['user_id','coupon_id']) 253 | other_feature2 = pd.merge(other_feature2,t4,on=['user_id','date_received']) 254 | other_feature2 = pd.merge(other_feature2,t5,on=['user_id','coupon_id','date_received']) 255 | other_feature2 = pd.merge(other_feature2,t7,on=['user_id','coupon_id','date_received']) 256 | other_feature2.to_csv('data/other_feature2.csv',index=None) 257 | print other_feature2.shape 258 | 259 | 260 | 261 | #for dataset1 262 | t = dataset1[['user_id']] 263 | t['this_month_user_receive_all_coupon_count'] = 1 264 | t = t.groupby('user_id').agg('sum').reset_index() 265 | 266 | t1 = dataset1[['user_id','coupon_id']] 267 | t1['this_month_user_receive_same_coupon_count'] = 1 268 | t1 = t1.groupby(['user_id','coupon_id']).agg('sum').reset_index() 269 | 270 | t2 = dataset1[['user_id','coupon_id','date_received']] 271 | t2.date_received = t2.date_received.astype('str') 272 | t2 = t2.groupby(['user_id','coupon_id'])['date_received'].agg(lambda x:':'.join(x)).reset_index() 273 | t2['receive_number'] = t2.date_received.apply(lambda s:len(s.split(':'))) 274 | t2 = t2[t2.receive_number>1] 275 | t2['max_date_received'] = t2.date_received.apply(lambda s:max([int(d) for d in s.split(':')])) 276 | t2['min_date_received'] = t2.date_received.apply(lambda s:min([int(d) for d in s.split(':')])) 277 | t2 = t2[['user_id','coupon_id','max_date_received','min_date_received']] 278 | 279 | t3 = dataset1[['user_id','coupon_id','date_received']] 280 | t3 = pd.merge(t3,t2,on=['user_id','coupon_id'],how='left') 281 | t3['this_month_user_receive_same_coupon_lastone'] = t3.max_date_received - t3.date_received.astype('int') 282 | t3['this_month_user_receive_same_coupon_firstone'] = t3.date_received.astype('int') - t3.min_date_received 283 | def is_firstlastone(x): 284 | if x==0: 285 | return 1 286 | elif x>0: 287 | return 0 288 | else: 289 | return -1 #those only receive once 290 | 291 | t3.this_month_user_receive_same_coupon_lastone = t3.this_month_user_receive_same_coupon_lastone.apply(is_firstlastone) 292 | t3.this_month_user_receive_same_coupon_firstone = t3.this_month_user_receive_same_coupon_firstone.apply(is_firstlastone) 293 | t3 = t3[['user_id','coupon_id','date_received','this_month_user_receive_same_coupon_lastone','this_month_user_receive_same_coupon_firstone']] 294 | 295 | t4 = dataset1[['user_id','date_received']] 296 | t4['this_day_user_receive_all_coupon_count'] = 1 297 | t4 = t4.groupby(['user_id','date_received']).agg('sum').reset_index() 298 | 299 | t5 = dataset1[['user_id','coupon_id','date_received']] 300 | t5['this_day_user_receive_same_coupon_count'] = 1 301 | t5 = t5.groupby(['user_id','coupon_id','date_received']).agg('sum').reset_index() 302 | 303 | t6 = dataset1[['user_id','coupon_id','date_received']] 304 | t6.date_received = t6.date_received.astype('str') 305 | t6 = t6.groupby(['user_id','coupon_id'])['date_received'].agg(lambda x:':'.join(x)).reset_index() 306 | t6.rename(columns={'date_received':'dates'},inplace=True) 307 | 308 | def get_day_gap_before(s): 309 | date_received,dates = s.split('-') 310 | dates = dates.split(':') 311 | gaps = [] 312 | for d in dates: 313 | this_gap = (date(int(date_received[0:4]),int(date_received[4:6]),int(date_received[6:8]))-date(int(d[0:4]),int(d[4:6]),int(d[6:8]))).days 314 | if this_gap>0: 315 | gaps.append(this_gap) 316 | if len(gaps)==0: 317 | return -1 318 | else: 319 | return min(gaps) 320 | 321 | def get_day_gap_after(s): 322 | date_received,dates = s.split('-') 323 | dates = dates.split(':') 324 | gaps = [] 325 | for d in dates: 326 | this_gap = (date(int(d[0:4]),int(d[4:6]),int(d[6:8]))-date(int(date_received[0:4]),int(date_received[4:6]),int(date_received[6:8]))).days 327 | if this_gap>0: 328 | gaps.append(this_gap) 329 | if len(gaps)==0: 330 | return -1 331 | else: 332 | return min(gaps) 333 | 334 | 335 | t7 = dataset1[['user_id','coupon_id','date_received']] 336 | t7 = pd.merge(t7,t6,on=['user_id','coupon_id'],how='left') 337 | t7['date_received_date'] = t7.date_received.astype('str') + '-' + t7.dates 338 | t7['day_gap_before'] = t7.date_received_date.apply(get_day_gap_before) 339 | t7['day_gap_after'] = t7.date_received_date.apply(get_day_gap_after) 340 | t7 = t7[['user_id','coupon_id','date_received','day_gap_before','day_gap_after']] 341 | 342 | other_feature1 = pd.merge(t1,t,on='user_id') 343 | other_feature1 = pd.merge(other_feature1,t3,on=['user_id','coupon_id']) 344 | other_feature1 = pd.merge(other_feature1,t4,on=['user_id','date_received']) 345 | other_feature1 = pd.merge(other_feature1,t5,on=['user_id','coupon_id','date_received']) 346 | other_feature1 = pd.merge(other_feature1,t7,on=['user_id','coupon_id','date_received']) 347 | other_feature1.to_csv('data/other_feature1.csv',index=None) 348 | print other_feature1.shape 349 | 350 | 351 | 352 | 353 | 354 | 355 | ############# coupon related feature ############# 356 | """ 357 | 2.coupon related: 358 | discount_rate. discount_man. discount_jian. is_man_jian 359 | day_of_week,day_of_month. (date_received) 360 | """ 361 | def calc_discount_rate(s): 362 | s =str(s) 363 | s = s.split(':') 364 | if len(s)==1: 365 | return float(s[0]) 366 | else: 367 | return 1.0-float(s[1])/float(s[0]) 368 | 369 | def get_discount_man(s): 370 | s =str(s) 371 | s = s.split(':') 372 | if len(s)==1: 373 | return 'null' 374 | else: 375 | return int(s[0]) 376 | 377 | def get_discount_jian(s): 378 | s =str(s) 379 | s = s.split(':') 380 | if len(s)==1: 381 | return 'null' 382 | else: 383 | return int(s[1]) 384 | 385 | def is_man_jian(s): 386 | s =str(s) 387 | s = s.split(':') 388 | if len(s)==1: 389 | return 0 390 | else: 391 | return 1 392 | 393 | #dataset3 394 | dataset3['day_of_week'] = dataset3.date_received.astype('str').apply(lambda x:date(int(x[0:4]),int(x[4:6]),int(x[6:8])).weekday()+1) 395 | dataset3['day_of_month'] = dataset3.date_received.astype('str').apply(lambda x:int(x[6:8])) 396 | dataset3['days_distance'] = dataset3.date_received.astype('str').apply(lambda x:(date(int(x[0:4]),int(x[4:6]),int(x[6:8]))-date(2016,6,30)).days) 397 | dataset3['discount_man'] = dataset3.discount_rate.apply(get_discount_man) 398 | dataset3['discount_jian'] = dataset3.discount_rate.apply(get_discount_jian) 399 | dataset3['is_man_jian'] = dataset3.discount_rate.apply(is_man_jian) 400 | dataset3['discount_rate'] = dataset3.discount_rate.apply(calc_discount_rate) 401 | d = dataset3[['coupon_id']] 402 | d['coupon_count'] = 1 403 | d = d.groupby('coupon_id').agg('sum').reset_index() 404 | dataset3 = pd.merge(dataset3,d,on='coupon_id',how='left') 405 | dataset3.to_csv('data/coupon3_feature.csv',index=None) 406 | #dataset2 407 | dataset2['day_of_week'] = dataset2.date_received.astype('str').apply(lambda x:date(int(x[0:4]),int(x[4:6]),int(x[6:8])).weekday()+1) 408 | dataset2['day_of_month'] = dataset2.date_received.astype('str').apply(lambda x:int(x[6:8])) 409 | dataset2['days_distance'] = dataset2.date_received.astype('str').apply(lambda x:(date(int(x[0:4]),int(x[4:6]),int(x[6:8]))-date(2016,5,14)).days) 410 | dataset2['discount_man'] = dataset2.discount_rate.apply(get_discount_man) 411 | dataset2['discount_jian'] = dataset2.discount_rate.apply(get_discount_jian) 412 | dataset2['is_man_jian'] = dataset2.discount_rate.apply(is_man_jian) 413 | dataset2['discount_rate'] = dataset2.discount_rate.apply(calc_discount_rate) 414 | d = dataset2[['coupon_id']] 415 | d['coupon_count'] = 1 416 | d = d.groupby('coupon_id').agg('sum').reset_index() 417 | dataset2 = pd.merge(dataset2,d,on='coupon_id',how='left') 418 | dataset2.to_csv('data/coupon2_feature.csv',index=None) 419 | #dataset1 420 | dataset1['day_of_week'] = dataset1.date_received.astype('str').apply(lambda x:date(int(x[0:4]),int(x[4:6]),int(x[6:8])).weekday()+1) 421 | dataset1['day_of_month'] = dataset1.date_received.astype('str').apply(lambda x:int(x[6:8])) 422 | dataset1['days_distance'] = dataset1.date_received.astype('str').apply(lambda x:(date(int(x[0:4]),int(x[4:6]),int(x[6:8]))-date(2016,4,13)).days) 423 | dataset1['discount_man'] = dataset1.discount_rate.apply(get_discount_man) 424 | dataset1['discount_jian'] = dataset1.discount_rate.apply(get_discount_jian) 425 | dataset1['is_man_jian'] = dataset1.discount_rate.apply(is_man_jian) 426 | dataset1['discount_rate'] = dataset1.discount_rate.apply(calc_discount_rate) 427 | d = dataset1[['coupon_id']] 428 | d['coupon_count'] = 1 429 | d = d.groupby('coupon_id').agg('sum').reset_index() 430 | dataset1 = pd.merge(dataset1,d,on='coupon_id',how='left') 431 | dataset1.to_csv('data/coupon1_feature.csv',index=None) 432 | 433 | 434 | 435 | ############# merchant related feature ############# 436 | """ 437 | 1.merchant related: 438 | total_sales. sales_use_coupon. total_coupon 439 | coupon_rate = sales_use_coupon/total_sales. 440 | transfer_rate = sales_use_coupon/total_coupon. 441 | merchant_avg_distance,merchant_min_distance,merchant_max_distance of those use coupon 442 | 443 | """ 444 | 445 | #for dataset3 446 | merchant3 = feature3[['merchant_id','coupon_id','distance','date_received','date']] 447 | 448 | t = merchant3[['merchant_id']] 449 | t.drop_duplicates(inplace=True) 450 | 451 | t1 = merchant3[merchant3.date!='null'][['merchant_id']] 452 | t1['total_sales'] = 1 453 | t1 = t1.groupby('merchant_id').agg('sum').reset_index() 454 | 455 | t2 = merchant3[(merchant3.date!='null')&(merchant3.coupon_id!='null')][['merchant_id']] 456 | t2['sales_use_coupon'] = 1 457 | t2 = t2.groupby('merchant_id').agg('sum').reset_index() 458 | 459 | t3 = merchant3[merchant3.coupon_id!='null'][['merchant_id']] 460 | t3['total_coupon'] = 1 461 | t3 = t3.groupby('merchant_id').agg('sum').reset_index() 462 | 463 | t4 = merchant3[(merchant3.date!='null')&(merchant3.coupon_id!='null')][['merchant_id','distance']] 464 | t4.replace('null',-1,inplace=True) 465 | t4.distance = t4.distance.astype('int') 466 | t4.replace(-1,np.nan,inplace=True) 467 | t5 = t4.groupby('merchant_id').agg('min').reset_index() 468 | t5.rename(columns={'distance':'merchant_min_distance'},inplace=True) 469 | 470 | t6 = t4.groupby('merchant_id').agg('max').reset_index() 471 | t6.rename(columns={'distance':'merchant_max_distance'},inplace=True) 472 | 473 | t7 = t4.groupby('merchant_id').agg('mean').reset_index() 474 | t7.rename(columns={'distance':'merchant_mean_distance'},inplace=True) 475 | 476 | t8 = t4.groupby('merchant_id').agg('median').reset_index() 477 | t8.rename(columns={'distance':'merchant_median_distance'},inplace=True) 478 | 479 | merchant3_feature = pd.merge(t,t1,on='merchant_id',how='left') 480 | merchant3_feature = pd.merge(merchant3_feature,t2,on='merchant_id',how='left') 481 | merchant3_feature = pd.merge(merchant3_feature,t3,on='merchant_id',how='left') 482 | merchant3_feature = pd.merge(merchant3_feature,t5,on='merchant_id',how='left') 483 | merchant3_feature = pd.merge(merchant3_feature,t6,on='merchant_id',how='left') 484 | merchant3_feature = pd.merge(merchant3_feature,t7,on='merchant_id',how='left') 485 | merchant3_feature = pd.merge(merchant3_feature,t8,on='merchant_id',how='left') 486 | merchant3_feature.sales_use_coupon = merchant3_feature.sales_use_coupon.replace(np.nan,0) #fillna with 0 487 | merchant3_feature['merchant_coupon_transfer_rate'] = merchant3_feature.sales_use_coupon.astype('float') / merchant3_feature.total_coupon 488 | merchant3_feature['coupon_rate'] = merchant3_feature.sales_use_coupon.astype('float') / merchant3_feature.total_sales 489 | merchant3_feature.total_coupon = merchant3_feature.total_coupon.replace(np.nan,0) #fillna with 0 490 | merchant3_feature.to_csv('data/merchant3_feature.csv',index=None) 491 | 492 | 493 | #for dataset2 494 | merchant2 = feature2[['merchant_id','coupon_id','distance','date_received','date']] 495 | 496 | t = merchant2[['merchant_id']] 497 | t.drop_duplicates(inplace=True) 498 | 499 | t1 = merchant2[merchant2.date!='null'][['merchant_id']] 500 | t1['total_sales'] = 1 501 | t1 = t1.groupby('merchant_id').agg('sum').reset_index() 502 | 503 | t2 = merchant2[(merchant2.date!='null')&(merchant2.coupon_id!='null')][['merchant_id']] 504 | t2['sales_use_coupon'] = 1 505 | t2 = t2.groupby('merchant_id').agg('sum').reset_index() 506 | 507 | t3 = merchant2[merchant2.coupon_id!='null'][['merchant_id']] 508 | t3['total_coupon'] = 1 509 | t3 = t3.groupby('merchant_id').agg('sum').reset_index() 510 | 511 | t4 = merchant2[(merchant2.date!='null')&(merchant2.coupon_id!='null')][['merchant_id','distance']] 512 | t4.replace('null',-1,inplace=True) 513 | t4.distance = t4.distance.astype('int') 514 | t4.replace(-1,np.nan,inplace=True) 515 | t5 = t4.groupby('merchant_id').agg('min').reset_index() 516 | t5.rename(columns={'distance':'merchant_min_distance'},inplace=True) 517 | 518 | t6 = t4.groupby('merchant_id').agg('max').reset_index() 519 | t6.rename(columns={'distance':'merchant_max_distance'},inplace=True) 520 | 521 | t7 = t4.groupby('merchant_id').agg('mean').reset_index() 522 | t7.rename(columns={'distance':'merchant_mean_distance'},inplace=True) 523 | 524 | t8 = t4.groupby('merchant_id').agg('median').reset_index() 525 | t8.rename(columns={'distance':'merchant_median_distance'},inplace=True) 526 | 527 | merchant2_feature = pd.merge(t,t1,on='merchant_id',how='left') 528 | merchant2_feature = pd.merge(merchant2_feature,t2,on='merchant_id',how='left') 529 | merchant2_feature = pd.merge(merchant2_feature,t3,on='merchant_id',how='left') 530 | merchant2_feature = pd.merge(merchant2_feature,t5,on='merchant_id',how='left') 531 | merchant2_feature = pd.merge(merchant2_feature,t6,on='merchant_id',how='left') 532 | merchant2_feature = pd.merge(merchant2_feature,t7,on='merchant_id',how='left') 533 | merchant2_feature = pd.merge(merchant2_feature,t8,on='merchant_id',how='left') 534 | merchant2_feature.sales_use_coupon = merchant2_feature.sales_use_coupon.replace(np.nan,0) #fillna with 0 535 | merchant2_feature['merchant_coupon_transfer_rate'] = merchant2_feature.sales_use_coupon.astype('float') / merchant2_feature.total_coupon 536 | merchant2_feature['coupon_rate'] = merchant2_feature.sales_use_coupon.astype('float') / merchant2_feature.total_sales 537 | merchant2_feature.total_coupon = merchant2_feature.total_coupon.replace(np.nan,0) #fillna with 0 538 | merchant2_feature.to_csv('data/merchant2_feature.csv',index=None) 539 | 540 | #for dataset1 541 | merchant1 = feature1[['merchant_id','coupon_id','distance','date_received','date']] 542 | 543 | t = merchant1[['merchant_id']] 544 | t.drop_duplicates(inplace=True) 545 | 546 | t1 = merchant1[merchant1.date!='null'][['merchant_id']] 547 | t1['total_sales'] = 1 548 | t1 = t1.groupby('merchant_id').agg('sum').reset_index() 549 | 550 | t2 = merchant1[(merchant1.date!='null')&(merchant1.coupon_id!='null')][['merchant_id']] 551 | t2['sales_use_coupon'] = 1 552 | t2 = t2.groupby('merchant_id').agg('sum').reset_index() 553 | 554 | t3 = merchant1[merchant1.coupon_id!='null'][['merchant_id']] 555 | t3['total_coupon'] = 1 556 | t3 = t3.groupby('merchant_id').agg('sum').reset_index() 557 | 558 | t4 = merchant1[(merchant1.date!='null')&(merchant1.coupon_id!='null')][['merchant_id','distance']] 559 | t4.replace('null',-1,inplace=True) 560 | t4.distance = t4.distance.astype('int') 561 | t4.replace(-1,np.nan,inplace=True) 562 | t5 = t4.groupby('merchant_id').agg('min').reset_index() 563 | t5.rename(columns={'distance':'merchant_min_distance'},inplace=True) 564 | 565 | t6 = t4.groupby('merchant_id').agg('max').reset_index() 566 | t6.rename(columns={'distance':'merchant_max_distance'},inplace=True) 567 | 568 | t7 = t4.groupby('merchant_id').agg('mean').reset_index() 569 | t7.rename(columns={'distance':'merchant_mean_distance'},inplace=True) 570 | 571 | t8 = t4.groupby('merchant_id').agg('median').reset_index() 572 | t8.rename(columns={'distance':'merchant_median_distance'},inplace=True) 573 | 574 | 575 | merchant1_feature = pd.merge(t,t1,on='merchant_id',how='left') 576 | merchant1_feature = pd.merge(merchant1_feature,t2,on='merchant_id',how='left') 577 | merchant1_feature = pd.merge(merchant1_feature,t3,on='merchant_id',how='left') 578 | merchant1_feature = pd.merge(merchant1_feature,t5,on='merchant_id',how='left') 579 | merchant1_feature = pd.merge(merchant1_feature,t6,on='merchant_id',how='left') 580 | merchant1_feature = pd.merge(merchant1_feature,t7,on='merchant_id',how='left') 581 | merchant1_feature = pd.merge(merchant1_feature,t8,on='merchant_id',how='left') 582 | merchant1_feature.sales_use_coupon = merchant1_feature.sales_use_coupon.replace(np.nan,0) #fillna with 0 583 | merchant1_feature['merchant_coupon_transfer_rate'] = merchant1_feature.sales_use_coupon.astype('float') / merchant1_feature.total_coupon 584 | merchant1_feature['coupon_rate'] = merchant1_feature.sales_use_coupon.astype('float') / merchant1_feature.total_sales 585 | merchant1_feature.total_coupon = merchant1_feature.total_coupon.replace(np.nan,0) #fillna with 0 586 | merchant1_feature.to_csv('data/merchant1_feature.csv',index=None) 587 | 588 | 589 | 590 | 591 | ############# user related feature ############# 592 | """ 593 | 3.user related: 594 | count_merchant. 595 | user_avg_distance, user_min_distance,user_max_distance. 596 | buy_use_coupon. buy_total. coupon_received. 597 | buy_use_coupon/coupon_received. 598 | buy_use_coupon/buy_total 599 | user_date_datereceived_gap 600 | 601 | 602 | """ 603 | 604 | def get_user_date_datereceived_gap(s): 605 | s = s.split(':') 606 | return (date(int(s[0][0:4]),int(s[0][4:6]),int(s[0][6:8])) - date(int(s[1][0:4]),int(s[1][4:6]),int(s[1][6:8]))).days 607 | 608 | #for dataset3 609 | user3 = feature3[['user_id','merchant_id','coupon_id','discount_rate','distance','date_received','date']] 610 | 611 | t = user3[['user_id']] 612 | t.drop_duplicates(inplace=True) 613 | 614 | t1 = user3[user3.date!='null'][['user_id','merchant_id']] 615 | t1.drop_duplicates(inplace=True) 616 | t1.merchant_id = 1 617 | t1 = t1.groupby('user_id').agg('sum').reset_index() 618 | t1.rename(columns={'merchant_id':'count_merchant'},inplace=True) 619 | 620 | t2 = user3[(user3.date!='null')&(user3.coupon_id!='null')][['user_id','distance']] 621 | t2.replace('null',-1,inplace=True) 622 | t2.distance = t2.distance.astype('int') 623 | t2.replace(-1,np.nan,inplace=True) 624 | t3 = t2.groupby('user_id').agg('min').reset_index() 625 | t3.rename(columns={'distance':'user_min_distance'},inplace=True) 626 | 627 | t4 = t2.groupby('user_id').agg('max').reset_index() 628 | t4.rename(columns={'distance':'user_max_distance'},inplace=True) 629 | 630 | t5 = t2.groupby('user_id').agg('mean').reset_index() 631 | t5.rename(columns={'distance':'user_mean_distance'},inplace=True) 632 | 633 | t6 = t2.groupby('user_id').agg('median').reset_index() 634 | t6.rename(columns={'distance':'user_median_distance'},inplace=True) 635 | 636 | t7 = user3[(user3.date!='null')&(user3.coupon_id!='null')][['user_id']] 637 | t7['buy_use_coupon'] = 1 638 | t7 = t7.groupby('user_id').agg('sum').reset_index() 639 | 640 | t8 = user3[user3.date!='null'][['user_id']] 641 | t8['buy_total'] = 1 642 | t8 = t8.groupby('user_id').agg('sum').reset_index() 643 | 644 | t9 = user3[user3.coupon_id!='null'][['user_id']] 645 | t9['coupon_received'] = 1 646 | t9 = t9.groupby('user_id').agg('sum').reset_index() 647 | 648 | t10 = user3[(user3.date_received!='null')&(user3.date!='null')][['user_id','date_received','date']] 649 | t10['user_date_datereceived_gap'] = t10.date + ':' + t10.date_received 650 | t10.user_date_datereceived_gap = t10.user_date_datereceived_gap.apply(get_user_date_datereceived_gap) 651 | t10 = t10[['user_id','user_date_datereceived_gap']] 652 | 653 | t11 = t10.groupby('user_id').agg('mean').reset_index() 654 | t11.rename(columns={'user_date_datereceived_gap':'avg_user_date_datereceived_gap'},inplace=True) 655 | t12 = t10.groupby('user_id').agg('min').reset_index() 656 | t12.rename(columns={'user_date_datereceived_gap':'min_user_date_datereceived_gap'},inplace=True) 657 | t13 = t10.groupby('user_id').agg('max').reset_index() 658 | t13.rename(columns={'user_date_datereceived_gap':'max_user_date_datereceived_gap'},inplace=True) 659 | 660 | 661 | user3_feature = pd.merge(t,t1,on='user_id',how='left') 662 | user3_feature = pd.merge(user3_feature,t3,on='user_id',how='left') 663 | user3_feature = pd.merge(user3_feature,t4,on='user_id',how='left') 664 | user3_feature = pd.merge(user3_feature,t5,on='user_id',how='left') 665 | user3_feature = pd.merge(user3_feature,t6,on='user_id',how='left') 666 | user3_feature = pd.merge(user3_feature,t7,on='user_id',how='left') 667 | user3_feature = pd.merge(user3_feature,t8,on='user_id',how='left') 668 | user3_feature = pd.merge(user3_feature,t9,on='user_id',how='left') 669 | user3_feature = pd.merge(user3_feature,t11,on='user_id',how='left') 670 | user3_feature = pd.merge(user3_feature,t12,on='user_id',how='left') 671 | user3_feature = pd.merge(user3_feature,t13,on='user_id',how='left') 672 | user3_feature.count_merchant = user3_feature.count_merchant.replace(np.nan,0) 673 | user3_feature.buy_use_coupon = user3_feature.buy_use_coupon.replace(np.nan,0) 674 | user3_feature['buy_use_coupon_rate'] = user3_feature.buy_use_coupon.astype('float') / user3_feature.buy_total.astype('float') 675 | user3_feature['user_coupon_transfer_rate'] = user3_feature.buy_use_coupon.astype('float') / user3_feature.coupon_received.astype('float') 676 | user3_feature.buy_total = user3_feature.buy_total.replace(np.nan,0) 677 | user3_feature.coupon_received = user3_feature.coupon_received.replace(np.nan,0) 678 | user3_feature.to_csv('data/user3_feature.csv',index=None) 679 | 680 | 681 | #for dataset2 682 | user2 = feature2[['user_id','merchant_id','coupon_id','discount_rate','distance','date_received','date']] 683 | 684 | t = user2[['user_id']] 685 | t.drop_duplicates(inplace=True) 686 | 687 | t1 = user2[user2.date!='null'][['user_id','merchant_id']] 688 | t1.drop_duplicates(inplace=True) 689 | t1.merchant_id = 1 690 | t1 = t1.groupby('user_id').agg('sum').reset_index() 691 | t1.rename(columns={'merchant_id':'count_merchant'},inplace=True) 692 | 693 | t2 = user2[(user2.date!='null')&(user2.coupon_id!='null')][['user_id','distance']] 694 | t2.replace('null',-1,inplace=True) 695 | t2.distance = t2.distance.astype('int') 696 | t2.replace(-1,np.nan,inplace=True) 697 | t3 = t2.groupby('user_id').agg('min').reset_index() 698 | t3.rename(columns={'distance':'user_min_distance'},inplace=True) 699 | 700 | t4 = t2.groupby('user_id').agg('max').reset_index() 701 | t4.rename(columns={'distance':'user_max_distance'},inplace=True) 702 | 703 | t5 = t2.groupby('user_id').agg('mean').reset_index() 704 | t5.rename(columns={'distance':'user_mean_distance'},inplace=True) 705 | 706 | t6 = t2.groupby('user_id').agg('median').reset_index() 707 | t6.rename(columns={'distance':'user_median_distance'},inplace=True) 708 | 709 | t7 = user2[(user2.date!='null')&(user2.coupon_id!='null')][['user_id']] 710 | t7['buy_use_coupon'] = 1 711 | t7 = t7.groupby('user_id').agg('sum').reset_index() 712 | 713 | t8 = user2[user2.date!='null'][['user_id']] 714 | t8['buy_total'] = 1 715 | t8 = t8.groupby('user_id').agg('sum').reset_index() 716 | 717 | t9 = user2[user2.coupon_id!='null'][['user_id']] 718 | t9['coupon_received'] = 1 719 | t9 = t9.groupby('user_id').agg('sum').reset_index() 720 | 721 | t10 = user2[(user2.date_received!='null')&(user2.date!='null')][['user_id','date_received','date']] 722 | t10['user_date_datereceived_gap'] = t10.date + ':' + t10.date_received 723 | t10.user_date_datereceived_gap = t10.user_date_datereceived_gap.apply(get_user_date_datereceived_gap) 724 | t10 = t10[['user_id','user_date_datereceived_gap']] 725 | 726 | t11 = t10.groupby('user_id').agg('mean').reset_index() 727 | t11.rename(columns={'user_date_datereceived_gap':'avg_user_date_datereceived_gap'},inplace=True) 728 | t12 = t10.groupby('user_id').agg('min').reset_index() 729 | t12.rename(columns={'user_date_datereceived_gap':'min_user_date_datereceived_gap'},inplace=True) 730 | t13 = t10.groupby('user_id').agg('max').reset_index() 731 | t13.rename(columns={'user_date_datereceived_gap':'max_user_date_datereceived_gap'},inplace=True) 732 | 733 | user2_feature = pd.merge(t,t1,on='user_id',how='left') 734 | user2_feature = pd.merge(user2_feature,t3,on='user_id',how='left') 735 | user2_feature = pd.merge(user2_feature,t4,on='user_id',how='left') 736 | user2_feature = pd.merge(user2_feature,t5,on='user_id',how='left') 737 | user2_feature = pd.merge(user2_feature,t6,on='user_id',how='left') 738 | user2_feature = pd.merge(user2_feature,t7,on='user_id',how='left') 739 | user2_feature = pd.merge(user2_feature,t8,on='user_id',how='left') 740 | user2_feature = pd.merge(user2_feature,t9,on='user_id',how='left') 741 | user2_feature = pd.merge(user2_feature,t11,on='user_id',how='left') 742 | user2_feature = pd.merge(user2_feature,t12,on='user_id',how='left') 743 | user2_feature = pd.merge(user2_feature,t13,on='user_id',how='left') 744 | user2_feature.count_merchant = user2_feature.count_merchant.replace(np.nan,0) 745 | user2_feature.buy_use_coupon = user2_feature.buy_use_coupon.replace(np.nan,0) 746 | user2_feature['buy_use_coupon_rate'] = user2_feature.buy_use_coupon.astype('float') / user2_feature.buy_total.astype('float') 747 | user2_feature['user_coupon_transfer_rate'] = user2_feature.buy_use_coupon.astype('float') / user2_feature.coupon_received.astype('float') 748 | user2_feature.buy_total = user2_feature.buy_total.replace(np.nan,0) 749 | user2_feature.coupon_received = user2_feature.coupon_received.replace(np.nan,0) 750 | user2_feature.to_csv('data/user2_feature.csv',index=None) 751 | 752 | 753 | #for dataset1 754 | user1 = feature1[['user_id','merchant_id','coupon_id','discount_rate','distance','date_received','date']] 755 | 756 | t = user1[['user_id']] 757 | t.drop_duplicates(inplace=True) 758 | 759 | t1 = user1[user1.date!='null'][['user_id','merchant_id']] 760 | t1.drop_duplicates(inplace=True) 761 | t1.merchant_id = 1 762 | t1 = t1.groupby('user_id').agg('sum').reset_index() 763 | t1.rename(columns={'merchant_id':'count_merchant'},inplace=True) 764 | 765 | t2 = user1[(user1.date!='null')&(user1.coupon_id!='null')][['user_id','distance']] 766 | t2.replace('null',-1,inplace=True) 767 | t2.distance = t2.distance.astype('int') 768 | t2.replace(-1,np.nan,inplace=True) 769 | t3 = t2.groupby('user_id').agg('min').reset_index() 770 | t3.rename(columns={'distance':'user_min_distance'},inplace=True) 771 | 772 | t4 = t2.groupby('user_id').agg('max').reset_index() 773 | t4.rename(columns={'distance':'user_max_distance'},inplace=True) 774 | 775 | t5 = t2.groupby('user_id').agg('mean').reset_index() 776 | t5.rename(columns={'distance':'user_mean_distance'},inplace=True) 777 | 778 | t6 = t2.groupby('user_id').agg('median').reset_index() 779 | t6.rename(columns={'distance':'user_median_distance'},inplace=True) 780 | 781 | t7 = user1[(user1.date!='null')&(user1.coupon_id!='null')][['user_id']] 782 | t7['buy_use_coupon'] = 1 783 | t7 = t7.groupby('user_id').agg('sum').reset_index() 784 | 785 | t8 = user1[user1.date!='null'][['user_id']] 786 | t8['buy_total'] = 1 787 | t8 = t8.groupby('user_id').agg('sum').reset_index() 788 | 789 | t9 = user1[user1.coupon_id!='null'][['user_id']] 790 | t9['coupon_received'] = 1 791 | t9 = t9.groupby('user_id').agg('sum').reset_index() 792 | 793 | t10 = user1[(user1.date_received!='null')&(user1.date!='null')][['user_id','date_received','date']] 794 | t10['user_date_datereceived_gap'] = t10.date + ':' + t10.date_received 795 | t10.user_date_datereceived_gap = t10.user_date_datereceived_gap.apply(get_user_date_datereceived_gap) 796 | t10 = t10[['user_id','user_date_datereceived_gap']] 797 | 798 | t11 = t10.groupby('user_id').agg('mean').reset_index() 799 | t11.rename(columns={'user_date_datereceived_gap':'avg_user_date_datereceived_gap'},inplace=True) 800 | t12 = t10.groupby('user_id').agg('min').reset_index() 801 | t12.rename(columns={'user_date_datereceived_gap':'min_user_date_datereceived_gap'},inplace=True) 802 | t13 = t10.groupby('user_id').agg('max').reset_index() 803 | t13.rename(columns={'user_date_datereceived_gap':'max_user_date_datereceived_gap'},inplace=True) 804 | 805 | user1_feature = pd.merge(t,t1,on='user_id',how='left') 806 | user1_feature = pd.merge(user1_feature,t3,on='user_id',how='left') 807 | user1_feature = pd.merge(user1_feature,t4,on='user_id',how='left') 808 | user1_feature = pd.merge(user1_feature,t5,on='user_id',how='left') 809 | user1_feature = pd.merge(user1_feature,t6,on='user_id',how='left') 810 | user1_feature = pd.merge(user1_feature,t7,on='user_id',how='left') 811 | user1_feature = pd.merge(user1_feature,t8,on='user_id',how='left') 812 | user1_feature = pd.merge(user1_feature,t9,on='user_id',how='left') 813 | user1_feature = pd.merge(user1_feature,t11,on='user_id',how='left') 814 | user1_feature = pd.merge(user1_feature,t12,on='user_id',how='left') 815 | user1_feature = pd.merge(user1_feature,t13,on='user_id',how='left') 816 | user1_feature.count_merchant = user1_feature.count_merchant.replace(np.nan,0) 817 | user1_feature.buy_use_coupon = user1_feature.buy_use_coupon.replace(np.nan,0) 818 | user1_feature['buy_use_coupon_rate'] = user1_feature.buy_use_coupon.astype('float') / user1_feature.buy_total.astype('float') 819 | user1_feature['user_coupon_transfer_rate'] = user1_feature.buy_use_coupon.astype('float') / user1_feature.coupon_received.astype('float') 820 | user1_feature.buy_total = user1_feature.buy_total.replace(np.nan,0) 821 | user1_feature.coupon_received = user1_feature.coupon_received.replace(np.nan,0) 822 | user1_feature.to_csv('data/user1_feature.csv',index=None) 823 | 824 | 825 | 826 | ################## user_merchant related feature ######################### 827 | 828 | """ 829 | 4.user_merchant: 830 | times_user_buy_merchant_before. 831 | """ 832 | #for dataset3 833 | all_user_merchant = feature3[['user_id','merchant_id']] 834 | all_user_merchant.drop_duplicates(inplace=True) 835 | 836 | t = feature3[['user_id','merchant_id','date']] 837 | t = t[t.date!='null'][['user_id','merchant_id']] 838 | t['user_merchant_buy_total'] = 1 839 | t = t.groupby(['user_id','merchant_id']).agg('sum').reset_index() 840 | t.drop_duplicates(inplace=True) 841 | 842 | t1 = feature3[['user_id','merchant_id','coupon_id']] 843 | t1 = t1[t1.coupon_id!='null'][['user_id','merchant_id']] 844 | t1['user_merchant_received'] = 1 845 | t1 = t1.groupby(['user_id','merchant_id']).agg('sum').reset_index() 846 | t1.drop_duplicates(inplace=True) 847 | 848 | t2 = feature3[['user_id','merchant_id','date','date_received']] 849 | t2 = t2[(t2.date!='null')&(t2.date_received!='null')][['user_id','merchant_id']] 850 | t2['user_merchant_buy_use_coupon'] = 1 851 | t2 = t2.groupby(['user_id','merchant_id']).agg('sum').reset_index() 852 | t2.drop_duplicates(inplace=True) 853 | 854 | t3 = feature3[['user_id','merchant_id']] 855 | t3['user_merchant_any'] = 1 856 | t3 = t3.groupby(['user_id','merchant_id']).agg('sum').reset_index() 857 | t3.drop_duplicates(inplace=True) 858 | 859 | t4 = feature3[['user_id','merchant_id','date','coupon_id']] 860 | t4 = t4[(t4.date!='null')&(t4.coupon_id=='null')][['user_id','merchant_id']] 861 | t4['user_merchant_buy_common'] = 1 862 | t4 = t4.groupby(['user_id','merchant_id']).agg('sum').reset_index() 863 | t4.drop_duplicates(inplace=True) 864 | 865 | user_merchant3 = pd.merge(all_user_merchant,t,on=['user_id','merchant_id'],how='left') 866 | user_merchant3 = pd.merge(user_merchant3,t1,on=['user_id','merchant_id'],how='left') 867 | user_merchant3 = pd.merge(user_merchant3,t2,on=['user_id','merchant_id'],how='left') 868 | user_merchant3 = pd.merge(user_merchant3,t3,on=['user_id','merchant_id'],how='left') 869 | user_merchant3 = pd.merge(user_merchant3,t4,on=['user_id','merchant_id'],how='left') 870 | user_merchant3.user_merchant_buy_use_coupon = user_merchant3.user_merchant_buy_use_coupon.replace(np.nan,0) 871 | user_merchant3.user_merchant_buy_common = user_merchant3.user_merchant_buy_common.replace(np.nan,0) 872 | user_merchant3['user_merchant_coupon_transfer_rate'] = user_merchant3.user_merchant_buy_use_coupon.astype('float') / user_merchant3.user_merchant_received.astype('float') 873 | user_merchant3['user_merchant_coupon_buy_rate'] = user_merchant3.user_merchant_buy_use_coupon.astype('float') / user_merchant3.user_merchant_buy_total.astype('float') 874 | user_merchant3['user_merchant_rate'] = user_merchant3.user_merchant_buy_total.astype('float') / user_merchant3.user_merchant_any.astype('float') 875 | user_merchant3['user_merchant_common_buy_rate'] = user_merchant3.user_merchant_buy_common.astype('float') / user_merchant3.user_merchant_buy_total.astype('float') 876 | user_merchant3.to_csv('data/user_merchant3.csv',index=None) 877 | 878 | #for dataset2 879 | all_user_merchant = feature2[['user_id','merchant_id']] 880 | all_user_merchant.drop_duplicates(inplace=True) 881 | 882 | t = feature2[['user_id','merchant_id','date']] 883 | t = t[t.date!='null'][['user_id','merchant_id']] 884 | t['user_merchant_buy_total'] = 1 885 | t = t.groupby(['user_id','merchant_id']).agg('sum').reset_index() 886 | t.drop_duplicates(inplace=True) 887 | 888 | t1 = feature2[['user_id','merchant_id','coupon_id']] 889 | t1 = t1[t1.coupon_id!='null'][['user_id','merchant_id']] 890 | t1['user_merchant_received'] = 1 891 | t1 = t1.groupby(['user_id','merchant_id']).agg('sum').reset_index() 892 | t1.drop_duplicates(inplace=True) 893 | 894 | t2 = feature2[['user_id','merchant_id','date','date_received']] 895 | t2 = t2[(t2.date!='null')&(t2.date_received!='null')][['user_id','merchant_id']] 896 | t2['user_merchant_buy_use_coupon'] = 1 897 | t2 = t2.groupby(['user_id','merchant_id']).agg('sum').reset_index() 898 | t2.drop_duplicates(inplace=True) 899 | 900 | t3 = feature2[['user_id','merchant_id']] 901 | t3['user_merchant_any'] = 1 902 | t3 = t3.groupby(['user_id','merchant_id']).agg('sum').reset_index() 903 | t3.drop_duplicates(inplace=True) 904 | 905 | t4 = feature2[['user_id','merchant_id','date','coupon_id']] 906 | t4 = t4[(t4.date!='null')&(t4.coupon_id=='null')][['user_id','merchant_id']] 907 | t4['user_merchant_buy_common'] = 1 908 | t4 = t4.groupby(['user_id','merchant_id']).agg('sum').reset_index() 909 | t4.drop_duplicates(inplace=True) 910 | 911 | user_merchant2 = pd.merge(all_user_merchant,t,on=['user_id','merchant_id'],how='left') 912 | user_merchant2 = pd.merge(user_merchant2,t1,on=['user_id','merchant_id'],how='left') 913 | user_merchant2 = pd.merge(user_merchant2,t2,on=['user_id','merchant_id'],how='left') 914 | user_merchant2 = pd.merge(user_merchant2,t3,on=['user_id','merchant_id'],how='left') 915 | user_merchant2 = pd.merge(user_merchant2,t4,on=['user_id','merchant_id'],how='left') 916 | user_merchant2.user_merchant_buy_use_coupon = user_merchant2.user_merchant_buy_use_coupon.replace(np.nan,0) 917 | user_merchant2.user_merchant_buy_common = user_merchant2.user_merchant_buy_common.replace(np.nan,0) 918 | user_merchant2['user_merchant_coupon_transfer_rate'] = user_merchant2.user_merchant_buy_use_coupon.astype('float') / user_merchant2.user_merchant_received.astype('float') 919 | user_merchant2['user_merchant_coupon_buy_rate'] = user_merchant2.user_merchant_buy_use_coupon.astype('float') / user_merchant2.user_merchant_buy_total.astype('float') 920 | user_merchant2['user_merchant_rate'] = user_merchant2.user_merchant_buy_total.astype('float') / user_merchant2.user_merchant_any.astype('float') 921 | user_merchant2['user_merchant_common_buy_rate'] = user_merchant2.user_merchant_buy_common.astype('float') / user_merchant2.user_merchant_buy_total.astype('float') 922 | user_merchant2.to_csv('data/user_merchant2.csv',index=None) 923 | 924 | #for dataset2 925 | all_user_merchant = feature1[['user_id','merchant_id']] 926 | all_user_merchant.drop_duplicates(inplace=True) 927 | 928 | t = feature1[['user_id','merchant_id','date']] 929 | t = t[t.date!='null'][['user_id','merchant_id']] 930 | t['user_merchant_buy_total'] = 1 931 | t = t.groupby(['user_id','merchant_id']).agg('sum').reset_index() 932 | t.drop_duplicates(inplace=True) 933 | 934 | t1 = feature1[['user_id','merchant_id','coupon_id']] 935 | t1 = t1[t1.coupon_id!='null'][['user_id','merchant_id']] 936 | t1['user_merchant_received'] = 1 937 | t1 = t1.groupby(['user_id','merchant_id']).agg('sum').reset_index() 938 | t1.drop_duplicates(inplace=True) 939 | 940 | t2 = feature1[['user_id','merchant_id','date','date_received']] 941 | t2 = t2[(t2.date!='null')&(t2.date_received!='null')][['user_id','merchant_id']] 942 | t2['user_merchant_buy_use_coupon'] = 1 943 | t2 = t2.groupby(['user_id','merchant_id']).agg('sum').reset_index() 944 | t2.drop_duplicates(inplace=True) 945 | 946 | t3 = feature1[['user_id','merchant_id']] 947 | t3['user_merchant_any'] = 1 948 | t3 = t3.groupby(['user_id','merchant_id']).agg('sum').reset_index() 949 | t3.drop_duplicates(inplace=True) 950 | 951 | t4 = feature1[['user_id','merchant_id','date','coupon_id']] 952 | t4 = t4[(t4.date!='null')&(t4.coupon_id=='null')][['user_id','merchant_id']] 953 | t4['user_merchant_buy_common'] = 1 954 | t4 = t4.groupby(['user_id','merchant_id']).agg('sum').reset_index() 955 | t4.drop_duplicates(inplace=True) 956 | 957 | user_merchant1 = pd.merge(all_user_merchant,t,on=['user_id','merchant_id'],how='left') 958 | user_merchant1 = pd.merge(user_merchant1,t1,on=['user_id','merchant_id'],how='left') 959 | user_merchant1 = pd.merge(user_merchant1,t2,on=['user_id','merchant_id'],how='left') 960 | user_merchant1 = pd.merge(user_merchant1,t3,on=['user_id','merchant_id'],how='left') 961 | user_merchant1 = pd.merge(user_merchant1,t4,on=['user_id','merchant_id'],how='left') 962 | user_merchant1.user_merchant_buy_use_coupon = user_merchant1.user_merchant_buy_use_coupon.replace(np.nan,0) 963 | user_merchant1.user_merchant_buy_common = user_merchant1.user_merchant_buy_common.replace(np.nan,0) 964 | user_merchant1['user_merchant_coupon_transfer_rate'] = user_merchant1.user_merchant_buy_use_coupon.astype('float') / user_merchant1.user_merchant_received.astype('float') 965 | user_merchant1['user_merchant_coupon_buy_rate'] = user_merchant1.user_merchant_buy_use_coupon.astype('float') / user_merchant1.user_merchant_buy_total.astype('float') 966 | user_merchant1['user_merchant_rate'] = user_merchant1.user_merchant_buy_total.astype('float') / user_merchant1.user_merchant_any.astype('float') 967 | user_merchant1['user_merchant_common_buy_rate'] = user_merchant1.user_merchant_buy_common.astype('float') / user_merchant1.user_merchant_buy_total.astype('float') 968 | user_merchant1.to_csv('data/user_merchant1.csv',index=None) 969 | 970 | 971 | 972 | 973 | 974 | 975 | 976 | ################## generate training and testing set ################ 977 | def get_label(s): 978 | s = s.split(':') 979 | if s[0]=='null': 980 | return 0 981 | elif (date(int(s[0][0:4]),int(s[0][4:6]),int(s[0][6:8]))-date(int(s[1][0:4]),int(s[1][4:6]),int(s[1][6:8]))).days<=15: 982 | return 1 983 | else: 984 | return -1 985 | 986 | 987 | coupon3 = pd.read_csv('data/coupon3_feature.csv') 988 | merchant3 = pd.read_csv('data/merchant3_feature.csv') 989 | user3 = pd.read_csv('data/user3_feature.csv') 990 | user_merchant3 = pd.read_csv('data/user_merchant3.csv') 991 | other_feature3 = pd.read_csv('data/other_feature3.csv') 992 | dataset3 = pd.merge(coupon3,merchant3,on='merchant_id',how='left') 993 | dataset3 = pd.merge(dataset3,user3,on='user_id',how='left') 994 | dataset3 = pd.merge(dataset3,user_merchant3,on=['user_id','merchant_id'],how='left') 995 | dataset3 = pd.merge(dataset3,other_feature3,on=['user_id','coupon_id','date_received'],how='left') 996 | dataset3.drop_duplicates(inplace=True) 997 | print dataset3.shape 998 | 999 | dataset3.user_merchant_buy_total = dataset3.user_merchant_buy_total.replace(np.nan,0) 1000 | dataset3.user_merchant_any = dataset3.user_merchant_any.replace(np.nan,0) 1001 | dataset3.user_merchant_received = dataset3.user_merchant_received.replace(np.nan,0) 1002 | dataset3['is_weekend'] = dataset3.day_of_week.apply(lambda x:1 if x in (6,7) else 0) 1003 | weekday_dummies = pd.get_dummies(dataset3.day_of_week) 1004 | weekday_dummies.columns = ['weekday'+str(i+1) for i in range(weekday_dummies.shape[1])] 1005 | dataset3 = pd.concat([dataset3,weekday_dummies],axis=1) 1006 | dataset3.drop(['merchant_id','day_of_week','coupon_count'],axis=1,inplace=True) 1007 | dataset3 = dataset3.replace('null',np.nan) 1008 | dataset3.to_csv('data/dataset3.csv',index=None) 1009 | 1010 | 1011 | coupon2 = pd.read_csv('data/coupon2_feature.csv') 1012 | merchant2 = pd.read_csv('data/merchant2_feature.csv') 1013 | user2 = pd.read_csv('data/user2_feature.csv') 1014 | user_merchant2 = pd.read_csv('data/user_merchant2.csv') 1015 | other_feature2 = pd.read_csv('data/other_feature2.csv') 1016 | dataset2 = pd.merge(coupon2,merchant2,on='merchant_id',how='left') 1017 | dataset2 = pd.merge(dataset2,user2,on='user_id',how='left') 1018 | dataset2 = pd.merge(dataset2,user_merchant2,on=['user_id','merchant_id'],how='left') 1019 | dataset2 = pd.merge(dataset2,other_feature2,on=['user_id','coupon_id','date_received'],how='left') 1020 | dataset2.drop_duplicates(inplace=True) 1021 | print dataset2.shape 1022 | 1023 | dataset2.user_merchant_buy_total = dataset2.user_merchant_buy_total.replace(np.nan,0) 1024 | dataset2.user_merchant_any = dataset2.user_merchant_any.replace(np.nan,0) 1025 | dataset2.user_merchant_received = dataset2.user_merchant_received.replace(np.nan,0) 1026 | dataset2['is_weekend'] = dataset2.day_of_week.apply(lambda x:1 if x in (6,7) else 0) 1027 | weekday_dummies = pd.get_dummies(dataset2.day_of_week) 1028 | weekday_dummies.columns = ['weekday'+str(i+1) for i in range(weekday_dummies.shape[1])] 1029 | dataset2 = pd.concat([dataset2,weekday_dummies],axis=1) 1030 | dataset2['label'] = dataset2.date.astype('str') + ':' + dataset2.date_received.astype('str') 1031 | dataset2.label = dataset2.label.apply(get_label) 1032 | dataset2.drop(['merchant_id','day_of_week','date','date_received','coupon_id','coupon_count'],axis=1,inplace=True) 1033 | dataset2 = dataset2.replace('null',np.nan) 1034 | dataset2.to_csv('data/dataset2.csv',index=None) 1035 | 1036 | 1037 | coupon1 = pd.read_csv('data/coupon1_feature.csv') 1038 | merchant1 = pd.read_csv('data/merchant1_feature.csv') 1039 | user1 = pd.read_csv('data/user1_feature.csv') 1040 | user_merchant1 = pd.read_csv('data/user_merchant1.csv') 1041 | other_feature1 = pd.read_csv('data/other_feature1.csv') 1042 | dataset1 = pd.merge(coupon1,merchant1,on='merchant_id',how='left') 1043 | dataset1 = pd.merge(dataset1,user1,on='user_id',how='left') 1044 | dataset1 = pd.merge(dataset1,user_merchant1,on=['user_id','merchant_id'],how='left') 1045 | dataset1 = pd.merge(dataset1,other_feature1,on=['user_id','coupon_id','date_received'],how='left') 1046 | dataset1.drop_duplicates(inplace=True) 1047 | print dataset1.shape 1048 | 1049 | dataset1.user_merchant_buy_total = dataset1.user_merchant_buy_total.replace(np.nan,0) 1050 | dataset1.user_merchant_any = dataset1.user_merchant_any.replace(np.nan,0) 1051 | dataset1.user_merchant_received = dataset1.user_merchant_received.replace(np.nan,0) 1052 | dataset1['is_weekend'] = dataset1.day_of_week.apply(lambda x:1 if x in (6,7) else 0) 1053 | weekday_dummies = pd.get_dummies(dataset1.day_of_week) 1054 | weekday_dummies.columns = ['weekday'+str(i+1) for i in range(weekday_dummies.shape[1])] 1055 | dataset1 = pd.concat([dataset1,weekday_dummies],axis=1) 1056 | dataset1['label'] = dataset1.date.astype('str') + ':' + dataset1.date_received.astype('str') 1057 | dataset1.label = dataset1.label.apply(get_label) 1058 | dataset1.drop(['merchant_id','day_of_week','date','date_received','coupon_id','coupon_count'],axis=1,inplace=True) 1059 | dataset1 = dataset1.replace('null',np.nan) 1060 | dataset1.to_csv('data/dataset1.csv',index=None) 1061 | -------------------------------------------------------------------------------- /xgb.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import xgboost as xgb 3 | from sklearn.preprocessing import MinMaxScaler 4 | 5 | dataset1 = pd.read_csv('data/dataset1.csv') 6 | dataset1.label.replace(-1,0,inplace=True) 7 | dataset2 = pd.read_csv('data/dataset2.csv') 8 | dataset2.label.replace(-1,0,inplace=True) 9 | dataset3 = pd.read_csv('data/dataset3.csv') 10 | 11 | dataset1.drop_duplicates(inplace=True) 12 | dataset2.drop_duplicates(inplace=True) 13 | dataset3.drop_duplicates(inplace=True) 14 | 15 | dataset12 = pd.concat([dataset1,dataset2],axis=0) 16 | 17 | dataset1_y = dataset1.label 18 | dataset1_x = dataset1.drop(['user_id','label','day_gap_before','day_gap_after'],axis=1) # 'day_gap_before','day_gap_after' cause overfitting, 0.77 19 | dataset2_y = dataset2.label 20 | dataset2_x = dataset2.drop(['user_id','label','day_gap_before','day_gap_after'],axis=1) 21 | dataset12_y = dataset12.label 22 | dataset12_x = dataset12.drop(['user_id','label','day_gap_before','day_gap_after'],axis=1) 23 | dataset3_preds = dataset3[['user_id','coupon_id','date_received']] 24 | dataset3_x = dataset3.drop(['user_id','coupon_id','date_received','day_gap_before','day_gap_after'],axis=1) 25 | 26 | print dataset1_x.shape,dataset2_x.shape,dataset3_x.shape 27 | 28 | dataset1 = xgb.DMatrix(dataset1_x,label=dataset1_y) 29 | dataset2 = xgb.DMatrix(dataset2_x,label=dataset2_y) 30 | dataset12 = xgb.DMatrix(dataset12_x,label=dataset12_y) 31 | dataset3 = xgb.DMatrix(dataset3_x) 32 | 33 | params={'booster':'gbtree', 34 | 'objective': 'rank:pairwise', 35 | 'eval_metric':'auc', 36 | 'gamma':0.1, 37 | 'min_child_weight':1.1, 38 | 'max_depth':5, 39 | 'lambda':10, 40 | 'subsample':0.7, 41 | 'colsample_bytree':0.7, 42 | 'colsample_bylevel':0.7, 43 | 'eta': 0.01, 44 | 'tree_method':'exact', 45 | 'seed':0, 46 | 'nthread':12 47 | } 48 | 49 | #train on dataset1, evaluate on dataset2 50 | #watchlist = [(dataset1,'train'),(dataset2,'val')] 51 | #model = xgb_3500.train(params,dataset1,num_boost_round=3000,evals=watchlist,early_stopping_rounds=300) 52 | 53 | watchlist = [(dataset12,'train')] 54 | model = xgb.train(params,dataset12,num_boost_round=3500,evals=watchlist) 55 | 56 | #predict test set 57 | dataset3_preds['label'] = model.predict(dataset3) 58 | dataset3_preds.label = MinMaxScaler().fit_transform(dataset3_preds.label) 59 | dataset3_preds.sort_values(by=['coupon_id','label'],inplace=True) 60 | dataset3_preds.to_csv("xgb_preds.csv",index=None,header=None) 61 | print dataset3_preds.describe() 62 | 63 | #save feature score 64 | feature_score = model.get_fscore() 65 | feature_score = sorted(feature_score.items(), key=lambda x:x[1],reverse=True) 66 | fs = [] 67 | for (key,value) in feature_score: 68 | fs.append("{0},{1}\n".format(key,value)) 69 | 70 | with open('xgb_feature_score.csv','w') as f: 71 | f.writelines("feature,score\n") 72 | f.writelines(fs) 73 | 74 | --------------------------------------------------------------------------------