├── README.md ├── Readme.txt ├── code ├── EDA13.py ├── EDA16-fourWeek.py ├── EDA16-fourWeek_rightTime.py ├── EDA16-threeWeek.py ├── EDA16-threeWeek_rightTime.py ├── EDA16-twoWeek.py ├── cat_model │ └── para.py ├── df_train_test.py ├── gen_result.py ├── gen_result2.py ├── lgb_model │ ├── lgb_train1.py │ ├── lgb_train2.py │ └── lgb_train3.py ├── run.sh ├── sbb2_train1.py ├── sbb2_train2.py ├── sbb2_train3.py ├── sbb4_train1.py ├── sbb4_train2 .py ├── sbb4_train3.py ├── sbb_train1.py ├── sbb_train2.py ├── sbb_train3.py └── xgb_model │ ├── xgb_train1.py │ ├── xgb_train2.py │ └── xgb_train3.py └── picture ├── huachuang.PNG └── time_series.PNG /README.md: -------------------------------------------------------------------------------- 1 | # 京东JDATA算法大赛2019-用户对品类下店铺的购买预测 2 | 3 | ## 赛题介绍 4 | 比赛网址：[JDATA2019-用户对品类下店铺的购买预测](https://jdata.jd.com/html/detail.html?id=8) 5 | 6 | 7 | ### 赛题任务 8 | 本赛题提供来自用户、商家、商品等多方面数据信息，包括商家和商品自身的内容信息、评论信息以及用户与之丰富的互动行为。参赛队伍需要通过数据挖掘技术和机器学习算法，构建用户购买商家中相关品类的预测模型，输出用户和店铺、品类的匹配结果，为精准营销提供高质量的目标群体。同时，希望参赛队伍通过本次比赛，挖掘数据背后潜在的意义，为电商生态平台的商家、用户提供多方共赢的智能解决方案。
9 | 即：对于训练集中出现的每一个用户，参赛者的模型需要预测该用户在未来7天内对`某个目标品类`下`某个店铺`的购买意向 10 | 11 | ### 赛题数据 12 | 1. 训练数据
13 | 提供`2018-02-01`到`2018-04-15`用户集合U中的用户，对商品集合S中部分商品的行为、评价、用户数据。
14 | 2. 预测数据
15 | 提供`2018-04-16`到`2018-04-22`预测用户U对哪些品类和店铺有购买，用户对品类下的店铺只会购买一次。
16 | 3. 数据表说明
17 | ![](https://img30.360buyimg.com/img/jfs/t1/40477/10/154/22847/5cc0f9d5Ea5384d90/47fa7d3c9e716bdc.png) 18 | ### 评测 19 | （1）该用户`2018-04-16`到`2018-04-22`是否对品类有购买，提交的结果文件中仅包含预测为下单的用户和品类（预测为未下单的用户和品类无须在结果中出现）。评测时将对提交结果中重复的“用户-品类”做排重处理，若预测正确，则评测算法中置label=1，不正确label=0。
20 | （2）如果用户对品类有购买，还需要预测对该品类下哪个店铺有购买，若店铺预测正确，则评测算法中置pred=1，不正确pred=0。
21 | 对于参赛者提交的结果文件，按如下公式计算得分：`score=0.4F11+0.6F12`。此处的F1值定义为：
22 | ![](https://img30.360buyimg.com/img/jfs/t1/31376/19/11155/9346/5cb3f143E00446b31/4c9c697ff32863ba.png)
23 | 其中：Precise为准确率，Recall为召回率； F11 是label=1或0的F1值，F12 是pred=1或0的F1值。 24 | 25 | ## 代码思路 26 | 首先此次的比赛是一种典型的样本不平衡问题，用户购买和不够购买的行为大概的比例为1：100
27 | 时间滑窗法+lightgbm+xgboost+catboost
28 | 由于type==5(加入购物车行为)只存在4月8号到4月15号这一周行为，存在数据缺失问题，因此在构建测试集和训练集均将type==5的数据删除；
29 | 根据滑窗法，利用一周的时间作为预测部分，将1周/2周/3周时间的用户作为训练部分，并且将2周/4周/6周的用户和商铺、商品之间的交互行为进行特征提取。
30 | 下图所示的为3周时间的用户作为训练部分，6周的用户和商铺、商品之间的交互行为进行特征提取的表格示意图 31 | ![](https://github.com/lcxanhui/JDATA-2019/blob/master/picture/time_series.PNG "3周时间的用户作为训练部分") 32 | ![](https://github.com/lcxanhui/JDATA-2019/blob/master/picture/huachuang.PNG) 33 | * A榜线上0.0614
Rank7 34 | * B榜线上0.0605
Rank16 35 | 36 | ## 代码环境 37 | run.sh 38 | 39 | ## 参考资料 40 | * 官方思路解答: [ 知乎：JData大数据比赛第三届非官方答疑贴](https://zhuanlan.zhihu.com/p/64503113) 41 | * 比赛参考代码: [【科普建模】JDATA3 用户对品类下店铺的购买预测](https://mp.weixin.qq.com/s?__biz=Mzg2MTEwNDQxNQ==&mid=2247483702&idx=1&sn=df621247b4790471063ddbeb15ad81c3&chksm=ce1d7146f96af85001e47999cb447d86820b082570c39de0c4ddc18dcba0b233697d5ef2e0ae&mpshare=1&scene=23&srcid=#rd) 42 | * 滑窗法介绍：[数据挖掘比赛之“滑窗法”](https://blog.csdn.net/oXiaoBuDianEr123/article/details/79309022) 43 | -------------------------------------------------------------------------------- /Readme.txt: -------------------------------------------------------------------------------- 1 | 训练部分 2 | EDA13 统计各周行为汇总生成特征 df_train和df_test，结果在output中 3 | 以EDA16开头的文件选用第一种特征，不同的滑动窗口，与不同时间区间范围进行训练，结果在output中 4 | 以sbb开头的文件选用第二种特征，不同窗口，与相同时间区间范围进行训练，结果在feature中 5 | 6 | 预测部分 7 | 运行run.sh 获得不同特征的结果csv，通过不同特征，不同滑窗的CSV投票融合生成最后结果提交 8 | 9 | 运行方式： 10 | 原始数据放data中，一键运行run.sh，最后的结果在submit中 11 | -------------------------------------------------------------------------------- /code/EDA13.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | from sklearn.model_selection import train_test_split 3 | import numpy as np 4 | from tqdm import tqdm 5 | import lightgbm as lgb 6 | from joblib import dump 7 | 8 | 9 | df_action=pd.read_csv("../data/jdata_action.csv") 10 | df_product=pd.read_csv("../data/jdata_product.csv") 11 | 12 | df_action=pd.merge(df_action,df_product,how='left',on='sku_id') 13 | df_action=df_action.groupby(['user_id','shop_id','cate'], as_index=False).sum() 14 | 15 | df_action=df_action[['user_id','shop_id','cate']] 16 | df_action_head=df_action.copy() 17 | 18 | df_action=pd.read_csv("../data/jdata_action.csv") 19 | 20 | def makeActionData(startDate,endDate): 21 | df=df_action[(df_action['action_time']>startDate)&(df_action['action_time']='2018-04-09') \ 24 | & (jdata_data['action_time']<='2018-04-15') \ 25 | & (jdata_data['type']==2)][['user_id','cate','shop_id']].drop_duplicates() 26 | train_buy['label'] = 1 27 | # 候选集时间： '2018-03-26'-'2018-04-08' 最近两周有行为的（用户，类目，店铺） 28 | train_set = jdata_data[(jdata_data['action_time']>='2018-03-12') \ 29 | & (jdata_data['action_time']<='2018-04-08')][['user_id','cate','shop_id']].drop_duplicates() 30 | train_set = train_set.merge(train_buy,on=['user_id','cate','shop_id'],how='left').fillna(0) 31 | 32 | 33 | train_set = train_set.merge(df_train,on=['user_id','cate','shop_id'],how='left') 34 | 35 | def mapper(x): 36 | if x is not np.nan: 37 | year=int(x[:4]) 38 | return 2018-year 39 | 40 | 41 | df_user['user_reg_tm']=df_user['user_reg_tm'].apply(lambda x:mapper(x)) 42 | df_shop['shop_reg_tm']=df_shop['shop_reg_tm'].apply(lambda x:mapper(x)) 43 | df_shop['shop_reg_tm']=df_shop['shop_reg_tm'].fillna(df_shop['shop_reg_tm'].mean()) 44 | df_user['age']=df_user['age'].fillna(df_user['age'].mean()) 45 | df_comment=pd.read_csv('../data/jdata_comment.csv') 46 | df_comment=df_comment.groupby(['sku_id'],as_index=False).sum() 47 | df_product=pd.read_csv('../data/jdata_product.csv') 48 | df_product_comment=pd.merge(df_product,df_comment,on='sku_id',how='left') 49 | df_product_comment=df_product_comment.fillna(0) 50 | df_product_comment=df_product_comment.groupby(['shop_id'],as_index=False).sum() 51 | df_product_comment=df_product_comment.drop(['sku_id','brand','cate'],axis=1) 52 | df_shop_product_comment=pd.merge(df_shop,df_product_comment,how='left',on='shop_id') 53 | 54 | 55 | train_set=pd.merge(train_set,df_user,how='left',on='user_id') 56 | train_set=pd.merge(train_set,df_shop_product_comment,on='shop_id',how='left') 57 | 58 | test_set = jdata_data[(jdata_data['action_time']>='2018-03-19') \ 59 | & (jdata_data['action_time']<='2018-04-15')][['user_id','cate','shop_id']].drop_duplicates() 60 | 61 | test_set = test_set.merge(df_test,on=['user_id','cate','shop_id'],how='left') 62 | 63 | del df_train 64 | del df_test 65 | 66 | test_set=pd.merge(test_set,df_user,how='left',on='user_id') 67 | test_set=pd.merge(test_set,df_shop_product_comment,on='shop_id',how='left') 68 | train_set.rename(columns={'cate_x':'cate'}, inplace = True) 69 | test_set.rename(columns={'cate_x':'cate'}, inplace = True) 70 | 71 | test_head=test_set[['user_id','cate','shop_id']] 72 | train_head=train_set[['user_id','cate','shop_id']] 73 | test_set=test_set.drop(['user_id','cate','shop_id'],axis=1) 74 | train_set=train_set.drop(['user_id','cate','shop_id'],axis=1) 75 | 76 | # 数据准备 77 | X_train = train_set.drop(['label'],axis=1).values 78 | y_train = train_set['label'].values 79 | X_test = test_set.values 80 | 81 | del test_set 82 | del train_set 83 | 84 | # 模型工具 85 | class SBBTree(): 86 | """Stacking,Bootstap,Bagging----SBBTree""" 87 | def __init__(self, params, stacking_num, bagging_num, bagging_test_size, num_boost_round, early_stopping_rounds): 88 | """ 89 | Initializes the SBBTree. 90 | Args: 91 | params : lgb params. 92 | stacking_num : k_flod stacking. 93 | bagging_num : bootstrap num. 94 | bagging_test_size : bootstrap sample rate. 95 | num_boost_round : boost num. 96 | early_stopping_rounds : early_stopping_rounds. 97 | """ 98 | self.params = params 99 | self.stacking_num = stacking_num 100 | self.bagging_num = bagging_num 101 | self.bagging_test_size = bagging_test_size 102 | self.num_boost_round = num_boost_round 103 | self.early_stopping_rounds = early_stopping_rounds 104 | 105 | self.model = lgb 106 | self.stacking_model = [] 107 | self.bagging_model = [] 108 | 109 | def fit(self, X, y): 110 | """ fit model. """ 111 | if self.stacking_num > 1: 112 | layer_train = np.zeros((X.shape[0], 2)) 113 | self.SK = StratifiedKFold(n_splits=self.stacking_num, shuffle=True, random_state=1) 114 | for k,(train_index, test_index) in enumerate(self.SK.split(X, y)): 115 | X_train = X[train_index] 116 | y_train = y[train_index] 117 | X_test = X[test_index] 118 | y_test = y[test_index] 119 | 120 | lgb_train = lgb.Dataset(X_train, y_train) 121 | lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train) 122 | 123 | gbm = lgb.train(self.params, 124 | lgb_train, 125 | num_boost_round=self.num_boost_round, 126 | valid_sets=lgb_eval, 127 | early_stopping_rounds=self.early_stopping_rounds, 128 | verbose_eval=300) 129 | 130 | self.stacking_model.append(gbm) 131 | 132 | pred_y = gbm.predict(X_test, num_iteration=gbm.best_iteration) 133 | layer_train[test_index, 1] = pred_y 134 | 135 | X = np.hstack((X, layer_train[:,1].reshape((-1,1)))) 136 | else: 137 | pass 138 | for bn in range(self.bagging_num): 139 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=self.bagging_test_size, random_state=bn) 140 | 141 | lgb_train = lgb.Dataset(X_train, y_train) 142 | lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train) 143 | 144 | gbm = lgb.train(self.params, 145 | lgb_train, 146 | num_boost_round=10000, 147 | valid_sets=lgb_eval, 148 | early_stopping_rounds=200, 149 | verbose_eval=300) 150 | 151 | self.bagging_model.append(gbm) 152 | 153 | def predict(self, X_pred): 154 | """ predict test data. """ 155 | if self.stacking_num > 1: 156 | test_pred = np.zeros((X_pred.shape[0], self.stacking_num)) 157 | for sn,gbm in enumerate(self.stacking_model): 158 | pred = gbm.predict(X_pred, num_iteration=gbm.best_iteration) 159 | test_pred[:, sn] = pred 160 | X_pred = np.hstack((X_pred, test_pred.mean(axis=1).reshape((-1,1)))) 161 | else: 162 | pass 163 | for bn,gbm in enumerate(self.bagging_model): 164 | pred = gbm.predict(X_pred, num_iteration=gbm.best_iteration) 165 | if bn == 0: 166 | pred_out=pred 167 | else: 168 | pred_out+=pred 169 | return pred_out/self.bagging_num 170 | 171 | # 模型参数 172 | params = { 173 | 'boosting_type': 'gbdt', 174 | 'objective': 'binary', 175 | 'metric': 'auc', 176 | 'learning_rate': 0.01, 177 | 'num_leaves': 2 ** 5 - 1, 178 | 'min_child_samples': 100, 179 | 'max_bin': 100, 180 | 'subsample': .7, 181 | 'subsample_freq': 1, 182 | 'colsample_bytree': 0.7, 183 | 'min_child_weight': 0, 184 | 'scale_pos_weight': 25, 185 | 'seed': 2018, 186 | 'nthread': 16, 187 | 'verbose': 0, 188 | } 189 | 190 | # 使用模型 191 | model = SBBTree(params=params,\ 192 | stacking_num=5,\ 193 | bagging_num=5,\ 194 | bagging_test_size=0.33,\ 195 | num_boost_round=10000,\ 196 | early_stopping_rounds=200) 197 | model.fit(X_train, y_train) 198 | y_predict = model.predict(X_test) 199 | #y_train_predict = model.predict(X_train) 200 | 201 | 202 | test_head['pred_prob'] = y_predict 203 | test_head.to_csv('../output/EDA16-fourWeek.csv',index=False) 204 | 205 | 206 | fourOld = test_head[test_head['pred_prob'] >= 0.60][['user_id', 'cate', 'shop_id']] 207 | fourOld.to_csv('../output/res_fourWeekOld60.csv', index=False) 208 | -------------------------------------------------------------------------------- /code/EDA16-fourWeek_rightTime.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import numpy as np 3 | import datetime 4 | import lightgbm as lgb 5 | from sklearn.metrics import f1_score 6 | from sklearn.model_selection import train_test_split 7 | from sklearn.model_selection import KFold 8 | from sklearn.model_selection import StratifiedKFold 9 | pd.set_option('display.max_columns', None) 10 | 11 | df_train=pd.read_csv('../output/df_train.csv') 12 | df_test=pd.read_csv('../output/df_test.csv') 13 | df_user=pd.read_csv('../data/jdata_user.csv') 14 | df_comment=pd.read_csv('../data/jdata_comment.csv') 15 | df_shop=pd.read_csv('../data/jdata_shop.csv') 16 | 17 | # 1）行为数据（jdata_action） 18 | jdata_action = pd.read_csv('../data/jdata_action.csv') 19 | # 3）商品数据（jdata_product） 20 | jdata_product = pd.read_csv('../data/jdata_product.csv') 21 | jdata_data = jdata_action.merge(jdata_product,on=['sku_id']) 22 | 23 | train_buy = jdata_data[(jdata_data['action_time']>='2018-04-09') \ 24 | & (jdata_data['action_time']<'2018-04-16') \ 25 | & (jdata_data['type']==2)][['user_id','cate','shop_id']].drop_duplicates() 26 | train_buy['label'] = 1 27 | # 候选集时间： '2018-03-26'-'2018-04-08' 最近两周有行为的（用户，类目，店铺） 28 | train_set = jdata_data[(jdata_data['action_time']>='2018-03-12') \ 29 | & (jdata_data['action_time']<'2018-04-09')][['user_id','cate','shop_id']].drop_duplicates() 30 | train_set = train_set.merge(train_buy,on=['user_id','cate','shop_id'],how='left').fillna(0) 31 | 32 | 33 | train_set = train_set.merge(df_train,on=['user_id','cate','shop_id'],how='left') 34 | 35 | def mapper(x): 36 | if x is not np.nan: 37 | year=int(x[:4]) 38 | return 2018-year 39 | 40 | 41 | df_user['user_reg_tm']=df_user['user_reg_tm'].apply(lambda x:mapper(x)) 42 | df_shop['shop_reg_tm']=df_shop['shop_reg_tm'].apply(lambda x:mapper(x)) 43 | df_shop['shop_reg_tm']=df_shop['shop_reg_tm'].fillna(df_shop['shop_reg_tm'].mean()) 44 | df_user['age']=df_user['age'].fillna(df_user['age'].mean()) 45 | df_comment=pd.read_csv('../data/jdata_comment.csv') 46 | df_comment=df_comment.groupby(['sku_id'],as_index=False).sum() 47 | df_product=pd.read_csv('../data/jdata_product.csv') 48 | df_product_comment=pd.merge(df_product,df_comment,on='sku_id',how='left') 49 | df_product_comment=df_product_comment.fillna(0) 50 | df_product_comment=df_product_comment.groupby(['shop_id'],as_index=False).sum() 51 | df_product_comment=df_product_comment.drop(['sku_id','brand','cate'],axis=1) 52 | df_shop_product_comment=pd.merge(df_shop,df_product_comment,how='left',on='shop_id') 53 | 54 | 55 | train_set=pd.merge(train_set,df_user,how='left',on='user_id') 56 | train_set=pd.merge(train_set,df_shop_product_comment,on='shop_id',how='left') 57 | 58 | test_set = jdata_data[(jdata_data['action_time']>='2018-03-19') \ 59 | & (jdata_data['action_time']<'2018-04-16')][['user_id','cate','shop_id']].drop_duplicates() 60 | 61 | test_set = test_set.merge(df_test,on=['user_id','cate','shop_id'],how='left') 62 | 63 | del df_train 64 | del df_test 65 | 66 | test_set=pd.merge(test_set,df_user,how='left',on='user_id') 67 | test_set=pd.merge(test_set,df_shop_product_comment,on='shop_id',how='left') 68 | train_set.rename(columns={'cate_x':'cate'}, inplace = True) 69 | test_set.rename(columns={'cate_x':'cate'}, inplace = True) 70 | 71 | test_head=test_set[['user_id','cate','shop_id']] 72 | train_head=train_set[['user_id','cate','shop_id']] 73 | test_set=test_set.drop(['user_id','cate','shop_id'],axis=1) 74 | train_set=train_set.drop(['user_id','cate','shop_id'],axis=1) 75 | 76 | # 数据准备 77 | X_train = train_set.drop(['label'],axis=1).values 78 | y_train = train_set['label'].values 79 | X_test = test_set.values 80 | 81 | del test_set 82 | del train_set 83 | 84 | # 模型工具 85 | class SBBTree(): 86 | """Stacking,Bootstap,Bagging----SBBTree""" 87 | def __init__(self, params, stacking_num, bagging_num, bagging_test_size, num_boost_round, early_stopping_rounds): 88 | """ 89 | Initializes the SBBTree. 90 | Args: 91 | params : lgb params. 92 | stacking_num : k_flod stacking. 93 | bagging_num : bootstrap num. 94 | bagging_test_size : bootstrap sample rate. 95 | num_boost_round : boost num. 96 | early_stopping_rounds : early_stopping_rounds. 97 | """ 98 | self.params = params 99 | self.stacking_num = stacking_num 100 | self.bagging_num = bagging_num 101 | self.bagging_test_size = bagging_test_size 102 | self.num_boost_round = num_boost_round 103 | self.early_stopping_rounds = early_stopping_rounds 104 | 105 | self.model = lgb 106 | self.stacking_model = [] 107 | self.bagging_model = [] 108 | 109 | def fit(self, X, y): 110 | """ fit model. """ 111 | if self.stacking_num > 1: 112 | layer_train = np.zeros((X.shape[0], 2)) 113 | self.SK = StratifiedKFold(n_splits=self.stacking_num, shuffle=True, random_state=1) 114 | for k,(train_index, test_index) in enumerate(self.SK.split(X, y)): 115 | X_train = X[train_index] 116 | y_train = y[train_index] 117 | X_test = X[test_index] 118 | y_test = y[test_index] 119 | 120 | lgb_train = lgb.Dataset(X_train, y_train) 121 | lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train) 122 | 123 | gbm = lgb.train(self.params, 124 | lgb_train, 125 | num_boost_round=self.num_boost_round, 126 | valid_sets=lgb_eval, 127 | early_stopping_rounds=self.early_stopping_rounds, 128 | verbose_eval=300) 129 | 130 | self.stacking_model.append(gbm) 131 | 132 | pred_y = gbm.predict(X_test, num_iteration=gbm.best_iteration) 133 | layer_train[test_index, 1] = pred_y 134 | 135 | X = np.hstack((X, layer_train[:,1].reshape((-1,1)))) 136 | else: 137 | pass 138 | for bn in range(self.bagging_num): 139 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=self.bagging_test_size, random_state=bn) 140 | 141 | lgb_train = lgb.Dataset(X_train, y_train) 142 | lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train) 143 | 144 | gbm = lgb.train(self.params, 145 | lgb_train, 146 | num_boost_round=10000, 147 | valid_sets=lgb_eval, 148 | early_stopping_rounds=200, 149 | verbose_eval=300) 150 | 151 | self.bagging_model.append(gbm) 152 | 153 | def predict(self, X_pred): 154 | """ predict test data. """ 155 | if self.stacking_num > 1: 156 | test_pred = np.zeros((X_pred.shape[0], self.stacking_num)) 157 | for sn,gbm in enumerate(self.stacking_model): 158 | pred = gbm.predict(X_pred, num_iteration=gbm.best_iteration) 159 | test_pred[:, sn] = pred 160 | X_pred = np.hstack((X_pred, test_pred.mean(axis=1).reshape((-1,1)))) 161 | else: 162 | pass 163 | for bn,gbm in enumerate(self.bagging_model): 164 | pred = gbm.predict(X_pred, num_iteration=gbm.best_iteration) 165 | if bn == 0: 166 | pred_out=pred 167 | else: 168 | pred_out+=pred 169 | return pred_out/self.bagging_num 170 | 171 | # 模型参数 172 | params = { 173 | 'boosting_type': 'gbdt', 174 | 'objective': 'binary', 175 | 'metric': 'auc', 176 | 'learning_rate': 0.01, 177 | 'num_leaves': 2 ** 5 - 1, 178 | 'min_child_samples': 100, 179 | 'max_bin': 100, 180 | 'subsample': .7, 181 | 'subsample_freq': 1, 182 | 'colsample_bytree': 0.7, 183 | 'min_child_weight': 0, 184 | 'scale_pos_weight': 25, 185 | 'seed': 2018, 186 | 'nthread': 16, 187 | 'verbose': 0, 188 | } 189 | 190 | # 使用模型 191 | model = SBBTree(params=params,\ 192 | stacking_num=5,\ 193 | bagging_num=5,\ 194 | bagging_test_size=0.33,\ 195 | num_boost_round=10000,\ 196 | early_stopping_rounds=200) 197 | model.fit(X_train, y_train) 198 | y_predict = model.predict(X_test) 199 | #y_train_predict = model.predict(X_train) 200 | 201 | 202 | test_head['pred_prob'] = y_predict 203 | test_head.to_csv('../output/EDA16-fourWeek_rightTime.csv',index=False) 204 | 205 | 206 | fourNew = test_head[test_head['pred_prob'] >= 0.675][['user_id', 'cate', 'shop_id']] 207 | fourNew.to_csv('../output/res_fourWeekNew675.csv', index=False) 208 | -------------------------------------------------------------------------------- /code/EDA16-threeWeek.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import numpy as np 3 | import datetime 4 | import lightgbm as lgb 5 | from sklearn.metrics import f1_score 6 | from sklearn.model_selection import train_test_split 7 | from sklearn.model_selection import KFold 8 | from sklearn.model_selection import StratifiedKFold 9 | pd.set_option('display.max_columns', None) 10 | 11 | df_train=pd.read_csv('../output/df_train.csv') 12 | df_test=pd.read_csv('../output/df_test.csv') 13 | df_user=pd.read_csv('../data/jdata_user.csv') 14 | df_comment=pd.read_csv('../data/jdata_comment.csv') 15 | df_shop=pd.read_csv('../data/jdata_shop.csv') 16 | 17 | # 1）行为数据（jdata_action） 18 | jdata_action = pd.read_csv('../data/jdata_action.csv') 19 | # 3）商品数据（jdata_product） 20 | jdata_product = pd.read_csv('../data/jdata_product.csv') 21 | jdata_data = jdata_action.merge(jdata_product,on=['sku_id']) 22 | 23 | train_buy = jdata_data[(jdata_data['action_time']>='2018-04-09') \ 24 | & (jdata_data['action_time']<='2018-04-15') \ 25 | & (jdata_data['type']==2)][['user_id','cate','shop_id']].drop_duplicates() 26 | train_buy['label'] = 1 27 | # 候选集时间： '2018-03-19'-'2018-04-08' 最近两周有行为的（用户，类目，店铺） 28 | train_set = jdata_data[(jdata_data['action_time']>='2018-03-19') \ 29 | & (jdata_data['action_time']<='2018-04-08')][['user_id','cate','shop_id']].drop_duplicates() 30 | train_set = train_set.merge(train_buy,on=['user_id','cate','shop_id'],how='left').fillna(0) 31 | 32 | 33 | train_set = train_set.merge(df_train,on=['user_id','cate','shop_id'],how='left') 34 | 35 | def mapper(x): 36 | if x is not np.nan: 37 | year=int(x[:4]) 38 | return 2018-year 39 | 40 | 41 | df_user['user_reg_tm']=df_user['user_reg_tm'].apply(lambda x:mapper(x)) 42 | df_shop['shop_reg_tm']=df_shop['shop_reg_tm'].apply(lambda x:mapper(x)) 43 | df_shop['shop_reg_tm']=df_shop['shop_reg_tm'].fillna(df_shop['shop_reg_tm'].mean()) 44 | df_user['age']=df_user['age'].fillna(df_user['age'].mean()) 45 | df_comment=pd.read_csv('../data/jdata_comment.csv') 46 | df_comment=df_comment.groupby(['sku_id'],as_index=False).sum() 47 | df_product=pd.read_csv('../data/jdata_product.csv') 48 | df_product_comment=pd.merge(df_product,df_comment,on='sku_id',how='left') 49 | df_product_comment=df_product_comment.fillna(0) 50 | df_product_comment=df_product_comment.groupby(['shop_id'],as_index=False).sum() 51 | df_product_comment=df_product_comment.drop(['sku_id','brand','cate'],axis=1) 52 | df_shop_product_comment=pd.merge(df_shop,df_product_comment,how='left',on='shop_id') 53 | 54 | 55 | train_set=pd.merge(train_set,df_user,how='left',on='user_id') 56 | train_set=pd.merge(train_set,df_shop_product_comment,on='shop_id',how='left') 57 | 58 | test_set = jdata_data[(jdata_data['action_time']>='2018-03-26') \ 59 | & (jdata_data['action_time']<='2018-04-15')][['user_id','cate','shop_id']].drop_duplicates() 60 | 61 | test_set = test_set.merge(df_test,on=['user_id','cate','shop_id'],how='left') 62 | 63 | del df_train 64 | del df_test 65 | 66 | test_set=pd.merge(test_set,df_user,how='left',on='user_id') 67 | test_set=pd.merge(test_set,df_shop_product_comment,on='shop_id',how='left') 68 | train_set.rename(columns={'cate_x':'cate'}, inplace = True) 69 | test_set.rename(columns={'cate_x':'cate'}, inplace = True) 70 | 71 | test_head=test_set[['user_id','cate','shop_id']] 72 | train_head=train_set[['user_id','cate','shop_id']] 73 | test_set=test_set.drop(['user_id','cate','shop_id'],axis=1) 74 | train_set=train_set.drop(['user_id','cate','shop_id'],axis=1) 75 | 76 | # 数据准备 77 | X_train = train_set.drop(['label'],axis=1).values 78 | y_train = train_set['label'].values 79 | X_test = test_set.values 80 | 81 | del test_set 82 | del train_set 83 | 84 | # 模型工具 85 | class SBBTree(): 86 | """Stacking,Bootstap,Bagging----SBBTree""" 87 | def __init__(self, params, stacking_num, bagging_num, bagging_test_size, num_boost_round, early_stopping_rounds): 88 | """ 89 | Initializes the SBBTree. 90 | Args: 91 | params : lgb params. 92 | stacking_num : k_flod stacking. 93 | bagging_num : bootstrap num. 94 | bagging_test_size : bootstrap sample rate. 95 | num_boost_round : boost num. 96 | early_stopping_rounds : early_stopping_rounds. 97 | """ 98 | self.params = params 99 | self.stacking_num = stacking_num 100 | self.bagging_num = bagging_num 101 | self.bagging_test_size = bagging_test_size 102 | self.num_boost_round = num_boost_round 103 | self.early_stopping_rounds = early_stopping_rounds 104 | 105 | self.model = lgb 106 | self.stacking_model = [] 107 | self.bagging_model = [] 108 | 109 | def fit(self, X, y): 110 | """ fit model. """ 111 | if self.stacking_num > 1: 112 | layer_train = np.zeros((X.shape[0], 2)) 113 | self.SK = StratifiedKFold(n_splits=self.stacking_num, shuffle=True, random_state=1) 114 | for k,(train_index, test_index) in enumerate(self.SK.split(X, y)): 115 | X_train = X[train_index] 116 | y_train = y[train_index] 117 | X_test = X[test_index] 118 | y_test = y[test_index] 119 | 120 | lgb_train = lgb.Dataset(X_train, y_train) 121 | lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train) 122 | 123 | gbm = lgb.train(self.params, 124 | lgb_train, 125 | num_boost_round=self.num_boost_round, 126 | valid_sets=lgb_eval, 127 | early_stopping_rounds=self.early_stopping_rounds, 128 | verbose_eval=300) 129 | 130 | self.stacking_model.append(gbm) 131 | 132 | pred_y = gbm.predict(X_test, num_iteration=gbm.best_iteration) 133 | layer_train[test_index, 1] = pred_y 134 | 135 | X = np.hstack((X, layer_train[:,1].reshape((-1,1)))) 136 | else: 137 | pass 138 | for bn in range(self.bagging_num): 139 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=self.bagging_test_size, random_state=bn) 140 | 141 | lgb_train = lgb.Dataset(X_train, y_train) 142 | lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train) 143 | 144 | gbm = lgb.train(self.params, 145 | lgb_train, 146 | num_boost_round=10000, 147 | valid_sets=lgb_eval, 148 | early_stopping_rounds=200, 149 | verbose_eval=300) 150 | 151 | self.bagging_model.append(gbm) 152 | 153 | def predict(self, X_pred): 154 | """ predict test data. """ 155 | if self.stacking_num > 1: 156 | test_pred = np.zeros((X_pred.shape[0], self.stacking_num)) 157 | for sn,gbm in enumerate(self.stacking_model): 158 | pred = gbm.predict(X_pred, num_iteration=gbm.best_iteration) 159 | test_pred[:, sn] = pred 160 | X_pred = np.hstack((X_pred, test_pred.mean(axis=1).reshape((-1,1)))) 161 | else: 162 | pass 163 | for bn,gbm in enumerate(self.bagging_model): 164 | pred = gbm.predict(X_pred, num_iteration=gbm.best_iteration) 165 | if bn == 0: 166 | pred_out=pred 167 | else: 168 | pred_out+=pred 169 | return pred_out/self.bagging_num 170 | 171 | # 模型参数 172 | params = { 173 | 'boosting_type': 'gbdt', 174 | 'objective': 'binary', 175 | 'metric': 'auc', 176 | 'learning_rate': 0.01, 177 | 'num_leaves': 2 ** 5 - 1, 178 | 'min_child_samples': 100, 179 | 'max_bin': 100, 180 | 'subsample': .7, 181 | 'subsample_freq': 1, 182 | 'colsample_bytree': 0.7, 183 | 'min_child_weight': 0, 184 | 'scale_pos_weight': 25, 185 | 'seed': 2018, 186 | 'nthread': 16, 187 | 'verbose': 0, 188 | } 189 | 190 | # 使用模型 191 | model = SBBTree(params=params,\ 192 | stacking_num=5,\ 193 | bagging_num=5,\ 194 | bagging_test_size=0.33,\ 195 | num_boost_round=10000,\ 196 | early_stopping_rounds=200) 197 | model.fit(X_train, y_train) 198 | y_predict = model.predict(X_test) 199 | #y_train_predict = model.predict(X_train) 200 | 201 | 202 | test_head['pred_prob'] = y_predict 203 | test_head.to_csv('../output/EDA16-threeWeek.csv',index=False) 204 | 205 | 206 | threeOld = test_head[test_head['pred_prob'] >= 0.595][['user_id', 'cate', 'shop_id']] 207 | threeOld.to_csv('../output/res_threeWeekOld595.csv', index=False) 208 | -------------------------------------------------------------------------------- /code/EDA16-threeWeek_rightTime.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import numpy as np 3 | import datetime 4 | import lightgbm as lgb 5 | from sklearn.metrics import f1_score 6 | from sklearn.model_selection import train_test_split 7 | from sklearn.model_selection import KFold 8 | from sklearn.model_selection import StratifiedKFold 9 | pd.set_option('display.max_columns', None) 10 | 11 | df_train=pd.read_csv('../output/df_train.csv') 12 | df_test=pd.read_csv('../output/df_test.csv') 13 | df_user=pd.read_csv('../data/jdata_user.csv') 14 | df_comment=pd.read_csv('../data/jdata_comment.csv') 15 | df_shop=pd.read_csv('../data/jdata_shop.csv') 16 | 17 | # 1）行为数据（jdata_action） 18 | jdata_action = pd.read_csv('../data/jdata_action.csv') 19 | # 3）商品数据（jdata_product） 20 | jdata_product = pd.read_csv('../data/jdata_product.csv') 21 | jdata_data = jdata_action.merge(jdata_product,on=['sku_id']) 22 | 23 | train_buy = jdata_data[(jdata_data['action_time']>='2018-04-09') \ 24 | & (jdata_data['action_time']<'2018-04-16') \ 25 | & (jdata_data['type']==2)][['user_id','cate','shop_id']].drop_duplicates() 26 | train_buy['label'] = 1 27 | # 候选集时间： '2018-03-19'-'2018-04-08' 最近两周有行为的（用户，类目，店铺） 28 | train_set = jdata_data[(jdata_data['action_time']>='2018-03-19') \ 29 | & (jdata_data['action_time']<'2018-04-09')][['user_id','cate','shop_id']].drop_duplicates() 30 | train_set = train_set.merge(train_buy,on=['user_id','cate','shop_id'],how='left').fillna(0) 31 | 32 | 33 | train_set = train_set.merge(df_train,on=['user_id','cate','shop_id'],how='left') 34 | 35 | def mapper(x): 36 | if x is not np.nan: 37 | year=int(x[:4]) 38 | return 2018-year 39 | 40 | 41 | df_user['user_reg_tm']=df_user['user_reg_tm'].apply(lambda x:mapper(x)) 42 | df_shop['shop_reg_tm']=df_shop['shop_reg_tm'].apply(lambda x:mapper(x)) 43 | df_shop['shop_reg_tm']=df_shop['shop_reg_tm'].fillna(df_shop['shop_reg_tm'].mean()) 44 | df_user['age']=df_user['age'].fillna(df_user['age'].mean()) 45 | df_comment=pd.read_csv('../data/jdata_comment.csv') 46 | df_comment=df_comment.groupby(['sku_id'],as_index=False).sum() 47 | df_product=pd.read_csv('../data/jdata_product.csv') 48 | df_product_comment=pd.merge(df_product,df_comment,on='sku_id',how='left') 49 | df_product_comment=df_product_comment.fillna(0) 50 | df_product_comment=df_product_comment.groupby(['shop_id'],as_index=False).sum() 51 | df_product_comment=df_product_comment.drop(['sku_id','brand','cate'],axis=1) 52 | df_shop_product_comment=pd.merge(df_shop,df_product_comment,how='left',on='shop_id') 53 | 54 | 55 | train_set=pd.merge(train_set,df_user,how='left',on='user_id') 56 | train_set=pd.merge(train_set,df_shop_product_comment,on='shop_id',how='left') 57 | 58 | test_set = jdata_data[(jdata_data['action_time']>='2018-03-26') \ 59 | & (jdata_data['action_time']<'2018-04-16')][['user_id','cate','shop_id']].drop_duplicates() 60 | 61 | test_set = test_set.merge(df_test,on=['user_id','cate','shop_id'],how='left') 62 | 63 | del df_train 64 | del df_test 65 | 66 | test_set=pd.merge(test_set,df_user,how='left',on='user_id') 67 | test_set=pd.merge(test_set,df_shop_product_comment,on='shop_id',how='left') 68 | train_set.rename(columns={'cate_x':'cate'}, inplace = True) 69 | test_set.rename(columns={'cate_x':'cate'}, inplace = True) 70 | 71 | test_head=test_set[['user_id','cate','shop_id']] 72 | train_head=train_set[['user_id','cate','shop_id']] 73 | test_set=test_set.drop(['user_id','cate','shop_id'],axis=1) 74 | train_set=train_set.drop(['user_id','cate','shop_id'],axis=1) 75 | 76 | # 数据准备 77 | X_train = train_set.drop(['label'],axis=1).values 78 | y_train = train_set['label'].values 79 | X_test = test_set.values 80 | 81 | del test_set 82 | del train_set 83 | 84 | # 模型工具 85 | class SBBTree(): 86 | """Stacking,Bootstap,Bagging----SBBTree""" 87 | def __init__(self, params, stacking_num, bagging_num, bagging_test_size, num_boost_round, early_stopping_rounds): 88 | """ 89 | Initializes the SBBTree. 90 | Args: 91 | params : lgb params. 92 | stacking_num : k_flod stacking. 93 | bagging_num : bootstrap num. 94 | bagging_test_size : bootstrap sample rate. 95 | num_boost_round : boost num. 96 | early_stopping_rounds : early_stopping_rounds. 97 | """ 98 | self.params = params 99 | self.stacking_num = stacking_num 100 | self.bagging_num = bagging_num 101 | self.bagging_test_size = bagging_test_size 102 | self.num_boost_round = num_boost_round 103 | self.early_stopping_rounds = early_stopping_rounds 104 | 105 | self.model = lgb 106 | self.stacking_model = [] 107 | self.bagging_model = [] 108 | 109 | def fit(self, X, y): 110 | """ fit model. """ 111 | if self.stacking_num > 1: 112 | layer_train = np.zeros((X.shape[0], 2)) 113 | self.SK = StratifiedKFold(n_splits=self.stacking_num, shuffle=True, random_state=1) 114 | for k,(train_index, test_index) in enumerate(self.SK.split(X, y)): 115 | X_train = X[train_index] 116 | y_train = y[train_index] 117 | X_test = X[test_index] 118 | y_test = y[test_index] 119 | 120 | lgb_train = lgb.Dataset(X_train, y_train) 121 | lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train) 122 | 123 | gbm = lgb.train(self.params, 124 | lgb_train, 125 | num_boost_round=self.num_boost_round, 126 | valid_sets=lgb_eval, 127 | early_stopping_rounds=self.early_stopping_rounds, 128 | verbose_eval=300) 129 | 130 | self.stacking_model.append(gbm) 131 | 132 | pred_y = gbm.predict(X_test, num_iteration=gbm.best_iteration) 133 | layer_train[test_index, 1] = pred_y 134 | 135 | X = np.hstack((X, layer_train[:,1].reshape((-1,1)))) 136 | else: 137 | pass 138 | for bn in range(self.bagging_num): 139 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=self.bagging_test_size, random_state=bn) 140 | 141 | lgb_train = lgb.Dataset(X_train, y_train) 142 | lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train) 143 | 144 | gbm = lgb.train(self.params, 145 | lgb_train, 146 | num_boost_round=10000, 147 | valid_sets=lgb_eval, 148 | early_stopping_rounds=200, 149 | verbose_eval=300) 150 | 151 | self.bagging_model.append(gbm) 152 | 153 | def predict(self, X_pred): 154 | """ predict test data. """ 155 | if self.stacking_num > 1: 156 | test_pred = np.zeros((X_pred.shape[0], self.stacking_num)) 157 | for sn,gbm in enumerate(self.stacking_model): 158 | pred = gbm.predict(X_pred, num_iteration=gbm.best_iteration) 159 | test_pred[:, sn] = pred 160 | X_pred = np.hstack((X_pred, test_pred.mean(axis=1).reshape((-1,1)))) 161 | else: 162 | pass 163 | for bn,gbm in enumerate(self.bagging_model): 164 | pred = gbm.predict(X_pred, num_iteration=gbm.best_iteration) 165 | if bn == 0: 166 | pred_out=pred 167 | else: 168 | pred_out+=pred 169 | return pred_out/self.bagging_num 170 | 171 | # 模型参数 172 | params = { 173 | 'boosting_type': 'gbdt', 174 | 'objective': 'binary', 175 | 'metric': 'auc', 176 | 'learning_rate': 0.01, 177 | 'num_leaves': 2 ** 5 - 1, 178 | 'min_child_samples': 100, 179 | 'max_bin': 100, 180 | 'subsample': .7, 181 | 'subsample_freq': 1, 182 | 'colsample_bytree': 0.7, 183 | 'min_child_weight': 0, 184 | 'scale_pos_weight': 25, 185 | 'seed': 2018, 186 | 'nthread': 16, 187 | 'verbose': 0, 188 | } 189 | 190 | # 使用模型 191 | model = SBBTree(params=params,\ 192 | stacking_num=5,\ 193 | bagging_num=5,\ 194 | bagging_test_size=0.33,\ 195 | num_boost_round=10000,\ 196 | early_stopping_rounds=200) 197 | model.fit(X_train, y_train) 198 | y_predict = model.predict(X_test) 199 | #y_train_predict = model.predict(X_train) 200 | 201 | 202 | test_head['pred_prob'] = y_predict 203 | test_head.to_csv('../output/EDA16-threeWeek_rightTime.csv',index=False) 204 | 205 | threeNew = test_head[test_head['pred_prob'] >= 0.65][['user_id', 'cate', 'shop_id']] 206 | threeNew.to_csv('../output/res_threeWeekNew65.csv', index=False) 207 | -------------------------------------------------------------------------------- /code/EDA16-twoWeek.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import numpy as np 3 | import datetime 4 | import lightgbm as lgb 5 | from sklearn.metrics import f1_score 6 | from sklearn.model_selection import train_test_split 7 | from sklearn.model_selection import KFold 8 | from sklearn.model_selection import StratifiedKFold 9 | pd.set_option('display.max_columns', None) 10 | 11 | df_train=pd.read_csv('../output/df_train.csv') 12 | df_test=pd.read_csv('../output/df_test.csv') 13 | df_user=pd.read_csv('../data/jdata_user.csv') 14 | df_comment=pd.read_csv('../data/jdata_comment.csv') 15 | df_shop=pd.read_csv('../data/jdata_shop.csv') 16 | 17 | # 1）行为数据（jdata_action） 18 | jdata_action = pd.read_csv('../data/jdata_action.csv') 19 | # 3）商品数据（jdata_product） 20 | jdata_product = pd.read_csv('../data/jdata_product.csv') 21 | jdata_data = jdata_action.merge(jdata_product,on=['sku_id']) 22 | 23 | train_buy = jdata_data[(jdata_data['action_time']>='2018-04-09') \ 24 | & (jdata_data['action_time']<='2018-04-15') \ 25 | & (jdata_data['type']==2)][['user_id','cate','shop_id']].drop_duplicates() 26 | train_buy['label'] = 1 27 | # 候选集时间： '2018-03-26'-'2018-04-08' 最近两周有行为的（用户，类目，店铺） 28 | train_set = jdata_data[(jdata_data['action_time']>='2018-03-26') \ 29 | & (jdata_data['action_time']<='2018-04-08')][['user_id','cate','shop_id']].drop_duplicates() 30 | train_set = train_set.merge(train_buy,on=['user_id','cate','shop_id'],how='left').fillna(0) 31 | train_set = train_set.merge(df_train,on=['user_id','cate','shop_id'],how='left') 32 | def mapper(x): 33 | if x is not np.nan: 34 | year=int(x[:4]) 35 | return 2018-year 36 | 37 | df_user['user_reg_tm']=df_user['user_reg_tm'].apply(lambda x:mapper(x)) 38 | df_shop['shop_reg_tm']=df_shop['shop_reg_tm'].apply(lambda x:mapper(x)) 39 | df_shop['shop_reg_tm']=df_shop['shop_reg_tm'].fillna(df_shop['shop_reg_tm'].mean()) 40 | df_user['age']=df_user['age'].fillna(df_user['age'].mean()) 41 | df_comment=pd.read_csv('../data/jdata_comment.csv') 42 | df_comment=df_comment.groupby(['sku_id'],as_index=False).sum() 43 | df_product=pd.read_csv('../data/jdata_product.csv') 44 | df_product_comment=pd.merge(df_product,df_comment,on='sku_id',how='left') 45 | df_product_comment=df_product_comment.fillna(0) 46 | df_product_comment=df_product_comment.groupby(['shop_id'],as_index=False).sum() 47 | df_product_comment=df_product_comment.drop(['sku_id','brand','cate'],axis=1) 48 | df_shop_product_comment=pd.merge(df_shop,df_product_comment,how='left',on='shop_id') 49 | 50 | train_set=pd.merge(train_set,df_user,how='left',on='user_id') 51 | train_set=pd.merge(train_set,df_shop_product_comment,on='shop_id',how='left') 52 | test_set = jdata_data[(jdata_data['action_time']>='2018-04-02') \ 53 | & (jdata_data['action_time']<='2018-04-15')][['user_id','cate','shop_id']].drop_duplicates() 54 | test_set = test_set.merge(df_test,on=['user_id','cate','shop_id'],how='left') 55 | 56 | del df_train 57 | del df_test 58 | test_set=pd.merge(test_set,df_user,how='left',on='user_id') 59 | test_set=pd.merge(test_set,df_shop_product_comment,on='shop_id',how='left') 60 | test_set=test_set.sort_values('user_id') 61 | train_set=train_set.sort_values('user_id') 62 | train_set.rename(columns={'cate_x':'cate'}, inplace = True) 63 | test_set.rename(columns={'cate_x':'cate'}, inplace = True) 64 | 65 | test_head=test_set[['user_id','cate','shop_id']] 66 | train_head=train_set[['user_id','cate','shop_id']] 67 | test_set=test_set.drop(['user_id','cate','shop_id'],axis=1) 68 | train_set=train_set.drop(['user_id','cate','shop_id'],axis=1) 69 | 70 | # 数据准备 71 | X_train = train_set.drop(['label'],axis=1).values 72 | y_train = train_set['label'].values 73 | X_test = test_set.values 74 | del test_set 75 | del train_set 76 | 77 | # 模型工具 78 | class SBBTree(): 79 | """Stacking,Bootstap,Bagging----SBBTree""" 80 | def __init__(self, params, stacking_num, bagging_num, bagging_test_size, num_boost_round, early_stopping_rounds): 81 | """ 82 | Initializes the SBBTree. 83 | Args: 84 | params : lgb params. 85 | stacking_num : k_flod stacking. 86 | bagging_num : bootstrap num. 87 | bagging_test_size : bootstrap sample rate. 88 | num_boost_round : boost num. 89 | early_stopping_rounds : early_stopping_rounds. 90 | """ 91 | self.params = params 92 | self.stacking_num = stacking_num 93 | self.bagging_num = bagging_num 94 | self.bagging_test_size = bagging_test_size 95 | self.num_boost_round = num_boost_round 96 | self.early_stopping_rounds = early_stopping_rounds 97 | 98 | self.model = lgb 99 | self.stacking_model = [] 100 | self.bagging_model = [] 101 | 102 | def fit(self, X, y): 103 | """ fit model. """ 104 | if self.stacking_num > 1: 105 | layer_train = np.zeros((X.shape[0], 2)) 106 | self.SK = StratifiedKFold(n_splits=self.stacking_num, shuffle=True, random_state=1) 107 | for k,(train_index, test_index) in enumerate(self.SK.split(X, y)): 108 | X_train = X[train_index] 109 | y_train = y[train_index] 110 | X_test = X[test_index] 111 | y_test = y[test_index] 112 | 113 | lgb_train = lgb.Dataset(X_train, y_train) 114 | lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train) 115 | 116 | gbm = lgb.train(self.params, 117 | lgb_train, 118 | num_boost_round=self.num_boost_round, 119 | valid_sets=lgb_eval, 120 | early_stopping_rounds=self.early_stopping_rounds, 121 | verbose_eval=300) 122 | 123 | self.stacking_model.append(gbm) 124 | 125 | pred_y = gbm.predict(X_test, num_iteration=gbm.best_iteration) 126 | layer_train[test_index, 1] = pred_y 127 | 128 | X = np.hstack((X, layer_train[:,1].reshape((-1,1)))) 129 | else: 130 | pass 131 | for bn in range(self.bagging_num): 132 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=self.bagging_test_size, random_state=bn) 133 | 134 | lgb_train = lgb.Dataset(X_train, y_train) 135 | lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train) 136 | 137 | gbm = lgb.train(self.params, 138 | lgb_train, 139 | num_boost_round=10000, 140 | valid_sets=lgb_eval, 141 | early_stopping_rounds=200, 142 | verbose_eval=300) 143 | 144 | self.bagging_model.append(gbm) 145 | 146 | def predict(self, X_pred): 147 | """ predict test data. """ 148 | if self.stacking_num > 1: 149 | test_pred = np.zeros((X_pred.shape[0], self.stacking_num)) 150 | for sn,gbm in enumerate(self.stacking_model): 151 | pred = gbm.predict(X_pred, num_iteration=gbm.best_iteration) 152 | test_pred[:, sn] = pred 153 | X_pred = np.hstack((X_pred, test_pred.mean(axis=1).reshape((-1,1)))) 154 | else: 155 | pass 156 | for bn,gbm in enumerate(self.bagging_model): 157 | pred = gbm.predict(X_pred, num_iteration=gbm.best_iteration) 158 | if bn == 0: 159 | pred_out=pred 160 | else: 161 | pred_out+=pred 162 | return pred_out/self.bagging_num 163 | 164 | # 模型参数 165 | params = { 166 | 'boosting_type': 'gbdt', 167 | 'objective': 'binary', 168 | 'metric': 'auc', 169 | 'learning_rate': 0.01, 170 | 'num_leaves': 2 ** 5 - 1, 171 | 'min_child_samples': 100, 172 | 'max_bin': 100, 173 | 'subsample': .7, 174 | 'subsample_freq': 1, 175 | 'colsample_bytree': 0.7, 176 | 'min_child_weight': 0, 177 | 'scale_pos_weight': 25, 178 | 'seed': 2018, 179 | 'nthread': 16, 180 | 'verbose': 0, 181 | } 182 | 183 | # 使用模型 184 | model = SBBTree(params=params,\ 185 | stacking_num=5,\ 186 | bagging_num=5,\ 187 | bagging_test_size=0.33,\ 188 | num_boost_round=10000,\ 189 | early_stopping_rounds=200) 190 | model.fit(X_train, y_train) 191 | y_predict = model.predict(X_test) 192 | y_train_predict = model.predict(X_train) 193 | 194 | test_head['pred_prob'] = y_predict 195 | 196 | 197 | test_head.to_csv('../output/EDA16-twoWeek.csv', index=False) 198 | 199 | twoOld = test_head[test_head['pred_prob'] >= 0.5205][['user_id', 'cate', 'shop_id']] 200 | twoOld.to_csv('../output/res_twoWeekOld5205.csv', index=False) 201 | 202 | 203 | 204 | 205 | -------------------------------------------------------------------------------- /code/cat_model/para.py: -------------------------------------------------------------------------------- 1 | # 感谢大佬分享的参数 2 | ctb_params = { 3 | 'n_estimators': 10000, 4 | 'learning_rate': 0.02, 5 | 'random_seed': 4590, 6 | 'reg_lambda': 0.08, 7 | 'subsample': 0.7, 8 | 'bootstrap_type': 'Bernoulli', 9 | 'boosting_type': 'Plain', 10 | 'one_hot_max_size': 10, 11 | 'rsm': 0.5, 12 | 'leaf_estimation_iterations': 5, 13 | 'use_best_model': True, 14 | 'max_depth': 6, 15 | 'verbose': -1, 16 | 'thread_count': 4 17 | } 18 | ctb_model = ctb.CatBoostRegressor(**ctb_params) 19 | -------------------------------------------------------------------------------- /code/df_train_test.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | from sklearn.model_selection import train_test_split 3 | import numpy as np 4 | from tqdm import tqdm 5 | import lightgbm as lgb 6 | from joblib import dump 7 | import time 8 | 9 | time_0 = time.process_time() 10 | print('>> 开始读取数据') 11 | df_action=pd.read_csv("./jdata_action.csv") 12 | df_product=pd.read_csv("./jdata_product.csv") 13 | 14 | df_action=pd.merge(df_action,df_product,how='left',on='sku_id') 15 | df_action=df_action.groupby(['user_id','shop_id','cate'], as_index=False).sum() 16 | time_1 = time.process_time() 17 | print('<< 数据读取完成！用时', time_1 - time_0, 's') 18 | 19 | df_action=df_action[['user_id','shop_id','cate']] 20 | df_action_head=df_action.copy() 21 | 22 | df_action=pd.read_csv("./jdata_action.csv") 23 | 24 | def makeActionData(startDate,endDate): 25 | df=df_action[(df_action['action_time']>startDate)&(df_action['action_time']=best_u])) 15 | sbb3_3[sbb3_3['pred_prob']>=best_u][['user_id','cate','shop_id']].to_csv('../output/sbb3_3.csv',index=False) 16 | 17 | # sbb3_2['pred_prob'] = y_predict 18 | best_u = 0.602 19 | #设置阈值计算行数 20 | # print('sbb3_2 best_len',len(sbb3_2[sbb3_2['pred_prob']>=best_u])) 21 | sbb3_2[sbb3_2['pred_prob']>=best_u][['user_id','cate','shop_id']].to_csv('../output/sbb3_2.csv',index=False) 22 | # sbb3_1['pred_prob'] = y_predict 23 | best_u = 0.521 24 | #设置阈值计算行数 25 | # print('sbb3_1 best_len',len(sbb3_1[sbb3_1['pred_prob']>=best_u])) 26 | sbb3_1[sbb3_1['pred_prob']>=best_u][['user_id','cate','shop_id']].to_csv('../output/sbb3_1.csv',index=False) 27 | 28 | 29 | n_3_593 = pd.read_csv('../output/res_threeWeekNew65.csv') 30 | n_4_590 = pd.read_csv('../output/res_fourWeekNew675.csv') 31 | o_2_573 = pd.read_csv('../output/res_twoWeekOld5205.csv') 32 | o_3_583 = pd.read_csv('../output/res_threeWeekOld595.csv') 33 | o_4_578 = pd.read_csv('../output/res_fourWeekOld60.csv') 34 | sbb_1 = pd.read_csv('../output/sbb3_1.csv') 35 | sbb_2 = pd.read_csv('../output/sbb3_2.csv') 36 | sbb_3 = pd.read_csv('../output/sbb3_3.csv') 37 | 38 | 39 | 40 | all_item = pd.concat([n_3_593,n_4_590,o_2_573,o_3_583,o_4_578,sbb_1,sbb_2,sbb_3],axis=0) 41 | all_item = all_item.drop_duplicates() 42 | 43 | 44 | n_3_593['label1'] = 1 45 | n_4_590['label2'] = 1 46 | o_2_573['label3'] = 1 47 | o_3_583['label4'] = 1 48 | o_4_578['label5'] = 1 49 | sbb_1['label6'] = 1 50 | sbb_2['label7'] = 1 51 | sbb_3['label8'] = 1 52 | 53 | 54 | 55 | all_item = all_item.merge(n_3_593,on=['user_id','cate','shop_id'],how='left') 56 | all_item = all_item.merge(n_4_590,on=['user_id','cate','shop_id'],how='left') 57 | all_item = all_item.merge(o_2_573,on=['user_id','cate','shop_id'],how='left') 58 | all_item = all_item.merge(o_3_583,on=['user_id','cate','shop_id'],how='left') 59 | all_item = all_item.merge(o_4_578,on=['user_id','cate','shop_id'],how='left') 60 | all_item = all_item.merge(sbb_1,on=['user_id','cate','shop_id'],how='left') 61 | all_item = all_item.merge(sbb_2,on=['user_id','cate','shop_id'],how='left') 62 | all_item = all_item.merge(sbb_3,on=['user_id','cate','shop_id'],how='left') 63 | 64 | 65 | all_item = all_item.fillna(0) 66 | 67 | 68 | all_item['sum'] = all_item['label1']+all_item['label2']+all_item['label3']+all_item['label4']+all_item['label5']+all_item['label6']+all_item['label7']+all_item['label8'] 69 | 70 | all_item[all_item['sum']>=2][['user_id', 71 | 'cate','shop_id']].to_csv('../submit/8_model_2.csv',index=False) 72 | 73 | 74 | -------------------------------------------------------------------------------- /code/gen_result2.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import numpy as np 3 | ###同特征相交不同特征投票 4 | sbb4_3 = pd.read_csv('../feature/4_sbb_get_3_test.csv') 5 | sbb4_2 = pd.read_csv('../feature/4_sbb_get_2_test.csv') 6 | sbb4_1 = pd.read_csv('../feature/4_sbb_get_1_test.csv') 7 | 8 | from tqdm import tqdm 9 | # sbb4_1['pred_prob'] = y_predict 10 | best_u = 0.662 11 | #设置阈值计算行数 12 | # print('sbb4_1 best_len',len(sbb4_1[sbb4_1['pred_prob']>=best_u])) 13 | sbb4_1[sbb4_1['pred_prob']>=best_u][['user_id','cate','shop_id']].to_csv('../output/sbb4_1.csv',index=False) 14 | 15 | from tqdm import tqdm 16 | # sbb4_1['pred_prob'] = y_predict 17 | best_u = 0.500 18 | #设置阈值计算行数 19 | # print('sbb4_2 best_len',len(sbb4_2[sbb4_2['pred_prob']>=best_u])) 20 | sbb4_2[sbb4_2['pred_prob']>=best_u][['user_id','cate','shop_id']].to_csv('../output/sbb4_2.csv',index=False) 21 | 22 | from tqdm import tqdm 23 | # sbb4_3['pred_prob'] = y_predict 24 | best_u = 0.685 25 | #设置阈值计算行数 26 | # print('sbb4_3 best_len',len(sbb4_3[sbb4_3['pred_prob']>=best_u])) 27 | sbb4_3[sbb4_3['pred_prob']>=best_u][['user_id','cate','shop_id']].to_csv('../output/sbb4_3.csv',index=False) 28 | 29 | sbb3_3 = pd.read_csv('../feature/3_sbb_get_3_test.csv') 30 | sbb3_2 = pd.read_csv('../feature/3_sbb_get_2_test.csv') 31 | sbb3_1 = pd.read_csv('../feature/3_sbb_get_1_test.csv') 32 | 33 | from tqdm import tqdm 34 | # sbb3_3['pred_prob'] = y_predict 35 | best_u = 0.686 36 | #设置阈值计算行数 37 | # print('sbb3_3 best_len',len(sbb3_3[sbb3_3['pred_prob']>=best_u])) 38 | sbb3_3[sbb3_3['pred_prob']>=best_u][['user_id','cate','shop_id']].to_csv('../output/sbb3_3.csv',index=False) 39 | 40 | from tqdm import tqdm 41 | # sbb3_2['pred_prob'] = y_predict 42 | best_u = 0.602 43 | #设置阈值计算行数 44 | # print('sbb3_2 best_len',len(sbb3_2[sbb3_2['pred_prob']>=best_u])) 45 | sbb3_2[sbb3_2['pred_prob']>=best_u][['user_id','cate','shop_id']].to_csv('../output/sbb3_2.csv',index=False) 46 | 47 | from tqdm import tqdm 48 | # sbb3_1['pred_prob'] = y_predict 49 | best_u = 0.521 50 | #设置阈值计算行数 51 | # print('sbb3_1 best_len',len(sbb3_1[sbb3_1['pred_prob']>=best_u])) 52 | sbb3_1[sbb3_1['pred_prob']>=best_u][['user_id','cate','shop_id']].to_csv('../output/sbb3_1.csv',index=False) 53 | 54 | sbb2_3 = pd.read_csv('../feature/2_sbb_get_3_test.csv') 55 | sbb2_2 = pd.read_csv('../feature/2_sbb_get_2_test.csv') 56 | sbb2_1 = pd.read_csv('../feature/2_sbb_get_1_test.csv') 57 | 58 | from tqdm import tqdm 59 | # sbb2_1['pred_prob'] = y_predict 60 | best_u = 0.495 61 | #设置阈值计算行数 62 | # print('sbb2_1 best_len',len(sbb2_1[sbb2_1['pred_prob']>=best_u])) 63 | sbb2_1[sbb2_1['pred_prob']>=best_u][['user_id','cate','shop_id']].to_csv('../output/sbb2_1.csv',index=False) 64 | 65 | from tqdm import tqdm 66 | # sbb4_2['pred_prob'] = y_predict 67 | best_u = 0.310 68 | #设置阈值计算行数 69 | # print('sbb2_2 best_len',len(sbb2_2[sbb2_2['pred_prob']>=best_u])) 70 | sbb2_2[sbb2_2['pred_prob']>=best_u][['user_id','cate','shop_id']].to_csv('../output/sbb2_2.csv',index=False) 71 | 72 | from tqdm import tqdm 73 | # sbb4_2['pred_prob'] = y_predict 74 | best_u = 0.480 75 | #设置阈值计算行数 76 | # print('sbb2_3 best_len',len(sbb2_3[sbb2_3['pred_prob']>=best_u])) 77 | sbb2_3[sbb2_3['pred_prob']>=best_u][['user_id','cate','shop_id']].to_csv('../output/sbb2_3.csv',index=False) 78 | 79 | ##同特征相交 80 | ##不同特征投票 81 | sbb21 = pd.read_csv('../output/sbb2_1.csv') 82 | sbb31 = pd.read_csv('../output/sbb3_1.csv') 83 | sbb41 = pd.read_csv('../output/sbb4_1.csv') 84 | all_data = pd.concat([sbb21,sbb31,sbb41],axis=0).drop_duplicates() 85 | 86 | sbb21['label2']=1 87 | sbb31['label3']=1 88 | sbb41['label4']=1 89 | 90 | all_data = all_data.merge(sbb21,on=['user_id','cate','shop_id'],how='left') 91 | all_data = all_data.merge(sbb31,on=['user_id','cate','shop_id'],how='left') 92 | all_data = all_data.merge(sbb41,on=['user_id','cate','shop_id'],how='left') 93 | all_data= all_data.fillna(0) 94 | all_data['sum'] = all_data['label2']+all_data['label3']+all_data['label4'] 95 | 96 | all_data['sum'].value_counts() 97 | 98 | all_data[all_data['sum']>=3][['user_id','cate','shop_id']].to_csv('../output/sbb*1_u3.csv',index=False) 99 | 100 | sbb22 = pd.read_csv('../output/sbb2_2.csv') 101 | sbb32 = pd.read_csv('../output/sbb3_2.csv') 102 | sbb42 = pd.read_csv('../output/sbb4_2.csv') 103 | all_data = pd.concat([sbb22,sbb32,sbb42],axis=0).drop_duplicates() 104 | 105 | sbb22['label2']=1 106 | sbb32['label3']=1 107 | sbb42['label4']=1 108 | 109 | all_data = all_data.merge(sbb22,on=['user_id','cate','shop_id'],how='left') 110 | all_data = all_data.merge(sbb32,on=['user_id','cate','shop_id'],how='left') 111 | all_data = all_data.merge(sbb42,on=['user_id','cate','shop_id'],how='left') 112 | all_data= all_data.fillna(0) 113 | all_data['sum'] = all_data['label2']+all_data['label3']+all_data['label4'] 114 | 115 | all_data['sum'].value_counts() 116 | all_data[all_data['sum']>=3][['user_id','cate','shop_id']].to_csv('../output/sbb*2_u3.csv',index=False) 117 | 118 | sbb23 = pd.read_csv('../output/sbb2_3.csv') 119 | sbb33 = pd.read_csv('../output/sbb3_3.csv') 120 | sbb43 = pd.read_csv('../output/sbb4_3.csv') 121 | all_data = pd.concat([sbb23,sbb33,sbb43],axis=0).drop_duplicates() 122 | 123 | sbb23['label2']=1 124 | sbb33['label3']=1 125 | sbb43['label4']=1 126 | 127 | all_data = all_data.merge(sbb23,on=['user_id','cate','shop_id'],how='left') 128 | all_data = all_data.merge(sbb33,on=['user_id','cate','shop_id'],how='left') 129 | all_data = all_data.merge(sbb43,on=['user_id','cate','shop_id'],how='left') 130 | all_data= all_data.fillna(0) 131 | all_data['sum'] = all_data['label2']+all_data['label3']+all_data['label4'] 132 | 133 | all_data['sum'].value_counts() 134 | all_data[all_data['sum']>=3][['user_id','cate','shop_id']].to_csv('../output/sbb*3_u3.csv',index=False) 135 | 136 | sbb1_vote = pd.read_csv('../output/sbb*1_u3.csv') 137 | sbb2_vote = pd.read_csv('../output/sbb*2_u3.csv') 138 | sbb3_vote = pd.read_csv('../output/sbb*3_u3.csv') 139 | a_result = pd.read_csv('../submit/8_model_2.csv') 140 | all_data = pd.concat([sbb1_vote,sbb2_vote,sbb3_vote,a_result],axis=0).drop_duplicates() 141 | 142 | 143 | sbb1_vote['label2']=1 144 | sbb2_vote['label3']=1 145 | sbb3_vote['label4']=1 146 | a_result['label5'] = 1 147 | 148 | all_data = all_data.merge(sbb1_vote,on=['user_id','cate','shop_id'],how='left') 149 | all_data = all_data.merge(sbb2_vote,on=['user_id','cate','shop_id'],how='left') 150 | all_data = all_data.merge(sbb3_vote,on=['user_id','cate','shop_id'],how='left') 151 | all_data = all_data.merge(a_result,on=['user_id','cate','shop_id'],how='left') 152 | all_data= all_data.fillna(0) 153 | all_data['sum'] = all_data['label2']+all_data['label3']+all_data['label4']+all_data['label5'] 154 | 155 | all_data['sum'].value_counts() 156 | 157 | all_data[all_data['sum']>=2][['user_id','cate','shop_id']].to_csv('../submit/b_final.csv',index=False) -------------------------------------------------------------------------------- /code/lgb_model/lgb_train1.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import numpy as np 3 | from datetime import datetime 4 | from sklearn.metrics import f1_score 5 | from sklearn.model_selection import train_test_split 6 | from sklearn.model_selection import KFold 7 | from sklearn.model_selection import StratifiedKFold 8 | import lighgbm as lgb 9 | pd.set_option('display.max_columns', None) 10 | 11 | 12 | ## 读取文件减少内存，参考鱼佬的腾讯赛 13 | def reduce_mem_usage(df, verbose=True): 14 | numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64'] 15 | start_mem = df.memory_usage().sum() / 1024**2 16 | for col in df.columns: 17 | col_type = df[col].dtypes 18 | if col_type in numerics: 19 | c_min = df[col].min() 20 | c_max = df[col].max() 21 | if str(col_type)[:3] == 'int': 22 | if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max: 23 | df[col] = df[col].astype(np.int8) 24 | elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max: 25 | df[col] = df[col].astype(np.int16) 26 | elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max: 27 | df[col] = df[col].astype(np.int32) 28 | elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max: 29 | df[col] = df[col].astype(np.int64) 30 | else: 31 | if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max: 32 | df[col] = df[col].astype(np.float16) 33 | elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max: 34 | df[col] = df[col].astype(np.float32) 35 | else: 36 | df[col] = df[col].astype(np.float64) 37 | end_mem = df.memory_usage().sum() / 1024**2 38 | if verbose: 39 | print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem)) 40 | return df 41 | 42 | df_train = reduce_mem_usage(pd.read_csv('./df_train.csv')) 43 | df_test = reduce_mem_usage(pd.read_csv('./df_test.csv')) 44 | ##这里可以选择加载多个特征文件进行merge 如果df_train变了记得在输出文件名称加以备注使用了什么特征文件 45 | ###设置特征标志位如果使用一周特征为1 加上两周特征为12 再加上三周特征为123 只使用二周特征为2 46 | 47 | df_user=reduce_mem_usage(pd.read_csv('./jdata_user.csv')) 48 | df_comment=reduce_mem_usage(pd.read_csv('./jdata_comment.csv')) 49 | df_shop=reduce_mem_usage(pd.read_csv('./jdata_shop.csv')) 50 | 51 | # 1）行为数据（jdata_action） 52 | jdata_action = reduce_mem_usage(pd.read_csv('./jdata_action.csv')) 53 | 54 | # 3）商品数据（jdata_product） 55 | jdata_product = reduce_mem_usage(pd.read_csv('./jdata_product.csv')) 56 | 57 | jdata_data = jdata_action.merge(jdata_product,on=['sku_id']) 58 | time_1 = time.process_time() 59 | print('<< 数据读取完成！用时', time_1 - time_0, 's') 60 | 61 | 62 | label_flag = 1 63 | train_buy = jdata_data[(jdata_data['action_time']>='2018-04-09')&(jdata_data['action_time']<'2018-04-15')&(jdata_data['type']==2)][['user_id','cate','shop_id']].drop_duplicates() 64 | train_buy['label'] = 1 65 | # 候选集时间： '2018-03-26'-'2018-04-08' 最近两周有行为的（用户，类目，店铺） 66 | win_size = 3#如果选择两周行为则为2 三周则为3 67 | train_set = jdata_data[(jdata_data['action_time']>='2018-03-19')&(jdata_data['action_time']<'2018-04-09')][['user_id','cate','shop_id']].drop_duplicates() 68 | train_set = train_set.merge(train_buy,on=['user_id','cate','shop_id'],how='left').fillna(0) 69 | 70 | train_set = train_set.merge(df_train,on=['user_id','cate','shop_id'],how='left') 71 | 72 | 73 | 74 | def mapper_year(x): 75 | if x is not np.nan: 76 | year = int(x[:4]) 77 | return 2018 - year 78 | 79 | 80 | def mapper_month(x): 81 | if x is not np.nan: 82 | year = int(x[:4]) 83 | month = int(x[5:7]) 84 | return (2018 - year) * 12 + month 85 | 86 | 87 | def mapper_day(x): 88 | if x is not np.nan: 89 | year = int(x[:4]) 90 | month = int(x[5:7]) 91 | day = int(x[8:10]) 92 | return (2018 - year) * 365 + month * 30 + day 93 | 94 | 95 | df_user['user_reg_year'] = df_user['user_reg_tm'].apply(lambda x: mapper_year(x)) 96 | df_user['user_reg_month'] = df_user['user_reg_tm'].apply(lambda x: mapper_month(x)) 97 | df_user['user_reg_day'] = df_user['user_reg_tm'].apply(lambda x: mapper_day(x)) 98 | 99 | df_shop['shop_reg_year'] = df_shop['shop_reg_tm'].apply(lambda x: mapper_year(x)) 100 | df_shop['shop_reg_month'] = df_shop['shop_reg_tm'].apply(lambda x: mapper_month(x)) 101 | df_shop['shop_reg_day'] = df_shop['shop_reg_tm'].apply(lambda x: mapper_day(x)) 102 | 103 | 104 | 105 | df_shop['shop_reg_year'] = df_shop['shop_reg_year'].fillna(1) 106 | df_shop['shop_reg_month'] = df_shop['shop_reg_month'].fillna(21) 107 | df_shop['shop_reg_day'] = df_shop['shop_reg_day'].fillna(101) 108 | 109 | df_user['age'] = df_user['age'].fillna(5) 110 | 111 | df_comment = df_comment.groupby(['sku_id'], as_index=False).sum() 112 | print('check point ...') 113 | df_product_comment = pd.merge(jdata_product, df_comment, on='sku_id', how='left') 114 | 115 | df_product_comment = df_product_comment.fillna(0) 116 | 117 | df_product_comment = df_product_comment.groupby(['shop_id'], as_index=False).sum() 118 | 119 | df_product_comment = df_product_comment.drop(['sku_id', 'brand', 'cate'], axis=1) 120 | 121 | df_shop_product_comment = pd.merge(df_shop, df_product_comment, how='left', on='shop_id') 122 | 123 | train_set = pd.merge(train_set, df_user, how='left', on='user_id') 124 | train_set = pd.merge(train_set, df_shop_product_comment, on='shop_id', how='left') 125 | 126 | 127 | 128 | 129 | train_set['vip_prob'] = train_set['vip_num']/train_set['fans_num'] 130 | train_set['goods_prob'] = train_set['good_comments']/train_set['comments'] 131 | 132 | train_set = train_set.drop(['comments','good_comments','bad_comments'],axis=1) 133 | 134 | 135 | 136 | test_set = jdata_data[(jdata_data['action_time'] >= '2018-03-26') & (jdata_data['action_time'] < '2018-04-16')][['user_id', 'cate', 'shop_id']].drop_duplicates() 137 | 138 | test_set = test_set.merge(df_test, on=['user_id', 'cate', 'shop_id'], how='left') 139 | 140 | test_set = pd.merge(test_set, df_user, how='left', on='user_id') 141 | test_set = pd.merge(test_set, df_shop_product_comment, on='shop_id', how='left') 142 | 143 | train_set.drop(['user_reg_tm', 'shop_reg_tm'], axis=1, inplace=True) 144 | test_set.drop(['user_reg_tm', 'shop_reg_tm'], axis=1, inplace=True) 145 | 146 | 147 | 148 | 149 | test_set['vip_prob'] = test_set['vip_num']/test_set['fans_num'] 150 | test_set['goods_prob'] = test_set['good_comments']/test_set['comments'] 151 | 152 | test_set = test_set.drop(['comments','good_comments','bad_comments'],axis=1) 153 | 154 | 155 | 156 | ###取六周特征特征为2.26-4.9 157 | train_set = train_set.drop([ 158 | '2018-02-19-2018-02-26-action_1', '2018-02-19-2018-02-26-action_2', 159 | '2018-02-19-2018-02-26-action_3', '2018-02-19-2018-02-26-action_4', 160 | '2018-02-12-2018-02-19-action_1', '2018-02-12-2018-02-19-action_2', 161 | '2018-02-12-2018-02-19-action_3', '2018-02-12-2018-02-19-action_4', 162 | '2018-02-05-2018-02-12-action_1', '2018-02-05-2018-02-12-action_2', 163 | '2018-02-05-2018-02-12-action_3', '2018-02-05-2018-02-12-action_4'],axis=1) 164 | 165 | 166 | ###取六周特征特征为3.05-4.15 167 | test_set = test_set.drop(['2018-02-26-2018-03-05-action_1', 168 | '2018-02-26-2018-03-05-action_2', '2018-02-26-2018-03-05-action_3', 169 | '2018-02-26-2018-03-05-action_4', '2018-02-19-2018-02-26-action_1', 170 | '2018-02-19-2018-02-26-action_2', '2018-02-19-2018-02-26-action_3', 171 | '2018-02-19-2018-02-26-action_4', '2018-02-12-2018-02-19-action_1', 172 | '2018-02-12-2018-02-19-action_2', '2018-02-12-2018-02-19-action_3', 173 | '2018-02-12-2018-02-19-action_4'],axis=1) 174 | 175 | 176 | 177 | train_set.rename(columns={'cate_x':'cate'}, inplace = True) 178 | test_set.rename(columns={'cate_x':'cate'}, inplace = True) 179 | 180 | 181 | 182 | test_head=test_set[['user_id','cate','shop_id']] 183 | train_head=train_set[['user_id','cate','shop_id']] 184 | test_set=test_set.drop(['user_id','cate','shop_id'],axis=1) 185 | train_set=train_set.drop(['user_id','cate','shop_id'],axis=1) 186 | 187 | 188 | # 数据准备 189 | X_train = train_set.drop(['label'],axis=1).values 190 | y_train = train_set['label'].values 191 | X_test = test_set.values 192 | 193 | del train_set 194 | del test_set 195 | 196 | 197 | print('------------------start modelling----------------') 198 | # 模型工具 199 | class SBBTree(): 200 | """Stacking,Bootstap,Bagging----SBBTree""" 201 | def __init__(self, params, stacking_num, bagging_num, bagging_test_size, num_boost_round, early_stopping_rounds): 202 | """ 203 | Initializes the SBBTree. 204 | Args: 205 | params : lgb params. 206 | stacking_num : k_flod stacking. 207 | bagging_num : bootstrap num. 208 | bagging_test_size : bootstrap sample rate. 209 | num_boost_round : boost num. 210 | early_stopping_rounds : early_stopping_rounds. 211 | """ 212 | self.params = params 213 | self.stacking_num = stacking_num 214 | self.bagging_num = bagging_num 215 | self.bagging_test_size = bagging_test_size 216 | self.num_boost_round = num_boost_round 217 | self.early_stopping_rounds = early_stopping_rounds 218 | 219 | self.model = lgb 220 | self.stacking_model = [] 221 | self.bagging_model = [] 222 | 223 | def fit(self, X, y): 224 | """ fit model. """ 225 | if self.stacking_num > 1: 226 | layer_train = np.zeros((X.shape[0], 2)) 227 | self.SK = StratifiedKFold(n_splits=self.stacking_num, shuffle=True, random_state=1) 228 | for k,(train_index, test_index) in enumerate(self.SK.split(X, y)): 229 | X_train = X[train_index] 230 | y_train = y[train_index] 231 | X_test = X[test_index] 232 | y_test = y[test_index] 233 | 234 | lgb_train = lgb.Dataset(X_train, y_train) 235 | lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train) 236 | 237 | gbm = lgb.train(self.params, 238 | lgb_train, 239 | num_boost_round=self.num_boost_round, 240 | valid_sets=lgb_eval, 241 | early_stopping_rounds=self.early_stopping_rounds) 242 | 243 | self.stacking_model.append(gbm) 244 | 245 | pred_y = gbm.predict(X_test, num_iteration=gbm.best_iteration) 246 | layer_train[test_index, 1] = pred_y 247 | 248 | X = np.hstack((X, layer_train[:,1].reshape((-1,1)))) 249 | else: 250 | pass 251 | for bn in range(self.bagging_num): 252 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=self.bagging_test_size, random_state=bn) 253 | 254 | lgb_train = lgb.Dataset(X_train, y_train) 255 | lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train) 256 | 257 | gbm = lgb.train(self.params, 258 | lgb_train, 259 | num_boost_round=10000, 260 | valid_sets=lgb_eval, 261 | early_stopping_rounds=200) 262 | 263 | self.bagging_model.append(gbm) 264 | 265 | def predict(self, X_pred): 266 | """ predict test data. """ 267 | if self.stacking_num > 1: 268 | test_pred = np.zeros((X_pred.shape[0], self.stacking_num)) 269 | for sn,gbm in enumerate(self.stacking_model): 270 | pred = gbm.predict(X_pred, num_iteration=gbm.best_iteration) 271 | test_pred[:, sn] = pred 272 | X_pred = np.hstack((X_pred, test_pred.mean(axis=1).reshape((-1,1)))) 273 | else: 274 | pass 275 | for bn,gbm in enumerate(self.bagging_model): 276 | pred = gbm.predict(X_pred, num_iteration=gbm.best_iteration) 277 | if bn == 0: 278 | pred_out=pred 279 | else: 280 | pred_out+=pred 281 | return pred_out/self.bagging_num 282 | 283 | 284 | # 模型参数 285 | params = { 286 | 'boosting_type': 'gbdt', 287 | 'objective': 'binary', 288 | 'metric': 'auc', 289 | 'learning_rate': 0.01, 290 | 'num_leaves': 2 ** 5 - 1, 291 | 'min_child_samples': 100, 292 | 'max_bin': 100, 293 | 'subsample': .7, 294 | 'subsample_freq': 1, 295 | 'colsample_bytree': 0.7, 296 | 'min_child_weight': 0, 297 | 'scale_pos_weight': 25, 298 | 'seed': 42, 299 | 'nthread': 20, 300 | 'verbose': 0, 301 | } 302 | # 使用模型 303 | model = SBBTree(params=params, \ 304 | stacking_num=5, \ 305 | bagging_num=5, \ 306 | bagging_test_size=0.33, \ 307 | num_boost_round=10000, \ 308 | early_stopping_rounds=200) 309 | model.fit(X_train, y_train) 310 | 311 | 312 | print('train is ok') 313 | y_predict = model.predict(X_test) 314 | print('pred test is ok') 315 | # y_train_predict = model.predict(X_train) 316 | 317 | 318 | 319 | from tqdm import tqdm 320 | test_head['pred_prob'] = y_predict 321 | test_head.to_csv('feature/'+str(win_size)+'_sbb_get_'+str(label_flag)+'_test.csv',index=False) 322 | -------------------------------------------------------------------------------- /code/lgb_model/lgb_train2.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import numpy as np 3 | from datetime import datetime 4 | from sklearn.metrics import f1_score 5 | from sklearn.model_selection import train_test_split 6 | from sklearn.model_selection import KFold 7 | from sklearn.model_selection import StratifiedKFold 8 | import lighgbm as lgb 9 | pd.set_option('display.max_columns', None) 10 | 11 | 12 | ## 读取文件减少内存，参考鱼佬的腾讯赛 13 | def reduce_mem_usage(df, verbose=True): 14 | numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64'] 15 | start_mem = df.memory_usage().sum() / 1024**2 16 | for col in df.columns: 17 | col_type = df[col].dtypes 18 | if col_type in numerics: 19 | c_min = df[col].min() 20 | c_max = df[col].max() 21 | if str(col_type)[:3] == 'int': 22 | if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max: 23 | df[col] = df[col].astype(np.int8) 24 | elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max: 25 | df[col] = df[col].astype(np.int16) 26 | elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max: 27 | df[col] = df[col].astype(np.int32) 28 | elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max: 29 | df[col] = df[col].astype(np.int64) 30 | else: 31 | if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max: 32 | df[col] = df[col].astype(np.float16) 33 | elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max: 34 | df[col] = df[col].astype(np.float32) 35 | else: 36 | df[col] = df[col].astype(np.float64) 37 | end_mem = df.memory_usage().sum() / 1024**2 38 | if verbose: 39 | print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem)) 40 | return df 41 | 42 | df_train = reduce_mem_usage(pd.read_csv('./df_train.csv')) 43 | df_test = reduce_mem_usage(pd.read_csv('./df_test.csv')) 44 | ##这里可以选择加载多个特征文件进行merge 如果df_train变了记得在输出文件名称加以备注使用了什么特征文件 45 | ###设置特征标志位如果使用一周特征为1 加上两周特征为12 再加上三周特征为123 只使用二周特征为2 46 | 47 | 48 | df_user=reduce_mem_usage(pd.read_csv('./jdata_user.csv')) 49 | df_comment=reduce_mem_usage(pd.read_csv('./jdata_comment.csv')) 50 | df_shop=reduce_mem_usage(pd.read_csv('./jdata_shop.csv')) 51 | 52 | # 1）行为数据（jdata_action） 53 | jdata_action = reduce_mem_usage(pd.read_csv('./jdata_action.csv')) 54 | 55 | # 3）商品数据（jdata_product） 56 | jdata_product = reduce_mem_usage(pd.read_csv('./jdata_product.csv')) 57 | 58 | jdata_data = jdata_action.merge(jdata_product,on=['sku_id']) 59 | time_1 = time.process_time() 60 | print('<< 数据读取完成！用时', time_1 - time_0, 's') 61 | 62 | 63 | label_flag = 2 64 | train_buy = jdata_data[(jdata_data['action_time']>='2018-04-02')&(jdata_data['action_time']<'2018-04-09') & (jdata_data['type']==2)][['user_id','cate','shop_id']].drop_duplicates() 65 | train_buy['label'] = 1 66 | # 候选集时间： '2018-03-26'-'2018-04-08' 最近两周有行为的（用户，类目，店铺） 67 | win_size = 3#如果选择两周行为则为2 三周则为3 68 | train_set = jdata_data[(jdata_data['action_time']>='2018-03-12')&(jdata_data['action_time']<'2018-04-02')][['user_id','cate','shop_id']].drop_duplicates() 69 | train_set = train_set.merge(train_buy,on=['user_id','cate','shop_id'],how='left').fillna(0) 70 | 71 | train_set = train_set.merge(df_train,on=['user_id','cate','shop_id'],how='left') 72 | 73 | 74 | 75 | def mapper_year(x): 76 | if x is not np.nan: 77 | year = int(x[:4]) 78 | return 2018 - year 79 | 80 | 81 | def mapper_month(x): 82 | if x is not np.nan: 83 | year = int(x[:4]) 84 | month = int(x[5:7]) 85 | return (2018 - year) * 12 + month 86 | 87 | 88 | def mapper_day(x): 89 | if x is not np.nan: 90 | year = int(x[:4]) 91 | month = int(x[5:7]) 92 | day = int(x[8:10]) 93 | return (2018 - year) * 365 + month * 30 + day 94 | 95 | 96 | df_user['user_reg_year'] = df_user['user_reg_tm'].apply(lambda x: mapper_year(x)) 97 | df_user['user_reg_month'] = df_user['user_reg_tm'].apply(lambda x: mapper_month(x)) 98 | df_user['user_reg_day'] = df_user['user_reg_tm'].apply(lambda x: mapper_day(x)) 99 | 100 | df_shop['shop_reg_year'] = df_shop['shop_reg_tm'].apply(lambda x: mapper_year(x)) 101 | df_shop['shop_reg_month'] = df_shop['shop_reg_tm'].apply(lambda x: mapper_month(x)) 102 | df_shop['shop_reg_day'] = df_shop['shop_reg_tm'].apply(lambda x: mapper_day(x)) 103 | 104 | 105 | 106 | df_shop['shop_reg_year'] = df_shop['shop_reg_year'].fillna(1) 107 | df_shop['shop_reg_month'] = df_shop['shop_reg_month'].fillna(21) 108 | df_shop['shop_reg_day'] = df_shop['shop_reg_day'].fillna(101) 109 | 110 | df_user['age'] = df_user['age'].fillna(5) 111 | 112 | df_comment = df_comment.groupby(['sku_id'], as_index=False).sum() 113 | print('check point ...') 114 | df_product_comment = pd.merge(jdata_product, df_comment, on='sku_id', how='left') 115 | 116 | df_product_comment = df_product_comment.fillna(0) 117 | 118 | df_product_comment = df_product_comment.groupby(['shop_id'], as_index=False).sum() 119 | 120 | df_product_comment = df_product_comment.drop(['sku_id', 'brand', 'cate'], axis=1) 121 | 122 | df_shop_product_comment = pd.merge(df_shop, df_product_comment, how='left', on='shop_id') 123 | 124 | train_set = pd.merge(train_set, df_user, how='left', on='user_id') 125 | train_set = pd.merge(train_set, df_shop_product_comment, on='shop_id', how='left') 126 | 127 | 128 | 129 | 130 | train_set['vip_prob'] = train_set['vip_num']/train_set['fans_num'] 131 | train_set['goods_prob'] = train_set['good_comments']/train_set['comments'] 132 | 133 | train_set = train_set.drop(['comments','good_comments','bad_comments'],axis=1) 134 | 135 | 136 | 137 | test_set = jdata_data[(jdata_data['action_time'] >= '2018-03-26') & (jdata_data['action_time'] < '2018-04-16')][['user_id', 'cate', 'shop_id']].drop_duplicates() 138 | 139 | test_set = test_set.merge(df_test, on=['user_id', 'cate', 'shop_id'], how='left') 140 | 141 | test_set = pd.merge(test_set, df_user, how='left', on='user_id') 142 | test_set = pd.merge(test_set, df_shop_product_comment, on='shop_id', how='left') 143 | 144 | train_set.drop(['user_reg_tm', 'shop_reg_tm'], axis=1, inplace=True) 145 | test_set.drop(['user_reg_tm', 'shop_reg_tm'], axis=1, inplace=True) 146 | 147 | 148 | 149 | 150 | test_set['vip_prob'] = test_set['vip_num']/test_set['fans_num'] 151 | test_set['goods_prob'] = test_set['good_comments']/test_set['comments'] 152 | 153 | test_set = test_set.drop(['comments','good_comments','bad_comments'],axis=1) 154 | 155 | 156 | 157 | ###取六周特征特征为2.19-4.1 158 | train_set = train_set.drop([ 159 | '2018-04-02-2018-04-09-action_1', '2018-04-02-2018-04-09-action_2', 160 | '2018-04-02-2018-04-09-action_3', '2018-04-02-2018-04-09-action_4', 161 | '2018-02-12-2018-02-19-action_1', '2018-02-12-2018-02-19-action_2', 162 | '2018-02-12-2018-02-19-action_3', '2018-02-12-2018-02-19-action_4', 163 | '2018-02-05-2018-02-12-action_1', '2018-02-05-2018-02-12-action_2', 164 | '2018-02-05-2018-02-12-action_3', '2018-02-05-2018-02-12-action_4'],axis=1) 165 | 166 | 167 | ###取六周特征特征为3.05-4.15 168 | test_set = test_set.drop(['2018-02-26-2018-03-05-action_1', 169 | '2018-02-26-2018-03-05-action_2', '2018-02-26-2018-03-05-action_3', 170 | '2018-02-26-2018-03-05-action_4', '2018-02-19-2018-02-26-action_1', 171 | '2018-02-19-2018-02-26-action_2', '2018-02-19-2018-02-26-action_3', 172 | '2018-02-19-2018-02-26-action_4', '2018-02-12-2018-02-19-action_1', 173 | '2018-02-12-2018-02-19-action_2', '2018-02-12-2018-02-19-action_3', 174 | '2018-02-12-2018-02-19-action_4'],axis=1) 175 | 176 | 177 | 178 | train_set.rename(columns={'cate_x':'cate'}, inplace = True) 179 | test_set.rename(columns={'cate_x':'cate'}, inplace = True) 180 | 181 | 182 | 183 | test_head=test_set[['user_id','cate','shop_id']] 184 | train_head=train_set[['user_id','cate','shop_id']] 185 | test_set=test_set.drop(['user_id','cate','shop_id'],axis=1) 186 | train_set=train_set.drop(['user_id','cate','shop_id'],axis=1) 187 | 188 | 189 | # 数据准备 190 | X_train = train_set.drop(['label'],axis=1).values 191 | y_train = train_set['label'].values 192 | X_test = test_set.values 193 | 194 | del train_set 195 | del test_set 196 | 197 | print('------------------start modelling----------------') 198 | # 模型工具 199 | class SBBTree(): 200 | """Stacking,Bootstap,Bagging----SBBTree""" 201 | def __init__(self, params, stacking_num, bagging_num, bagging_test_size, num_boost_round, early_stopping_rounds): 202 | """ 203 | Initializes the SBBTree. 204 | Args: 205 | params : lgb params. 206 | stacking_num : k_flod stacking. 207 | bagging_num : bootstrap num. 208 | bagging_test_size : bootstrap sample rate. 209 | num_boost_round : boost num. 210 | early_stopping_rounds : early_stopping_rounds. 211 | """ 212 | self.params = params 213 | self.stacking_num = stacking_num 214 | self.bagging_num = bagging_num 215 | self.bagging_test_size = bagging_test_size 216 | self.num_boost_round = num_boost_round 217 | self.early_stopping_rounds = early_stopping_rounds 218 | 219 | self.model = lgb 220 | self.stacking_model = [] 221 | self.bagging_model = [] 222 | 223 | def fit(self, X, y): 224 | """ fit model. """ 225 | if self.stacking_num > 1: 226 | layer_train = np.zeros((X.shape[0], 2)) 227 | self.SK = StratifiedKFold(n_splits=self.stacking_num, shuffle=True, random_state=1) 228 | for k,(train_index, test_index) in enumerate(self.SK.split(X, y)): 229 | X_train = X[train_index] 230 | y_train = y[train_index] 231 | X_test = X[test_index] 232 | y_test = y[test_index] 233 | 234 | lgb_train = lgb.Dataset(X_train, y_train) 235 | lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train) 236 | 237 | gbm = lgb.train(self.params, 238 | lgb_train, 239 | num_boost_round=self.num_boost_round, 240 | valid_sets=lgb_eval, 241 | early_stopping_rounds=self.early_stopping_rounds) 242 | 243 | self.stacking_model.append(gbm) 244 | 245 | pred_y = gbm.predict(X_test, num_iteration=gbm.best_iteration) 246 | layer_train[test_index, 1] = pred_y 247 | 248 | X = np.hstack((X, layer_train[:,1].reshape((-1,1)))) 249 | else: 250 | pass 251 | for bn in range(self.bagging_num): 252 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=self.bagging_test_size, random_state=bn) 253 | 254 | lgb_train = lgb.Dataset(X_train, y_train) 255 | lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train) 256 | 257 | gbm = lgb.train(self.params, 258 | lgb_train, 259 | num_boost_round=10000, 260 | valid_sets=lgb_eval, 261 | early_stopping_rounds=200) 262 | 263 | self.bagging_model.append(gbm) 264 | 265 | def predict(self, X_pred): 266 | """ predict test data. """ 267 | if self.stacking_num > 1: 268 | test_pred = np.zeros((X_pred.shape[0], self.stacking_num)) 269 | for sn,gbm in enumerate(self.stacking_model): 270 | pred = gbm.predict(X_pred, num_iteration=gbm.best_iteration) 271 | test_pred[:, sn] = pred 272 | X_pred = np.hstack((X_pred, test_pred.mean(axis=1).reshape((-1,1)))) 273 | else: 274 | pass 275 | for bn,gbm in enumerate(self.bagging_model): 276 | pred = gbm.predict(X_pred, num_iteration=gbm.best_iteration) 277 | if bn == 0: 278 | pred_out=pred 279 | else: 280 | pred_out+=pred 281 | return pred_out/self.bagging_num 282 | 283 | 284 | # 模型参数 285 | params = { 286 | 'boosting_type': 'gbdt', 287 | 'objective': 'binary', 288 | 'metric': 'auc', 289 | 'learning_rate': 0.01, 290 | 'num_leaves': 2 ** 5 - 1, 291 | 'min_child_samples': 100, 292 | 'max_bin': 100, 293 | 'subsample': .7, 294 | 'subsample_freq': 1, 295 | 'colsample_bytree': 0.7, 296 | 'min_child_weight': 0, 297 | 'scale_pos_weight': 25, 298 | 'seed': 42, 299 | 'nthread': 20, 300 | 'verbose': 0, 301 | } 302 | # 使用模型 303 | model = SBBTree(params=params, \ 304 | stacking_num=5, \ 305 | bagging_num=5, \ 306 | bagging_test_size=0.33, \ 307 | num_boost_round=10000, \ 308 | early_stopping_rounds=200) 309 | model.fit(X_train, y_train) 310 | 311 | 312 | print('train is ok') 313 | y_predict = model.predict(X_test) 314 | print('pred test is ok') 315 | # y_train_predict = model.predict(X_train) 316 | 317 | 318 | 319 | from tqdm import tqdm 320 | test_head['pred_prob'] = y_predict 321 | test_head.to_csv('feature/'+str(win_size)+'_sbb_get_'+str(label_flag)+'_test.csv',index=False) 322 | -------------------------------------------------------------------------------- /code/lgb_model/lgb_train3.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import numpy as np 3 | from datetime import datetime 4 | from sklearn.metrics import f1_score 5 | from sklearn.model_selection import train_test_split 6 | from sklearn.model_selection import KFold 7 | from sklearn.model_selection import StratifiedKFold 8 | import lighgbm as lgb 9 | pd.set_option('display.max_columns', None) 10 | 11 | 12 | ## 读取文件减少内存，参考鱼佬的腾讯赛 13 | def reduce_mem_usage(df, verbose=True): 14 | numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64'] 15 | start_mem = df.memory_usage().sum() / 1024**2 16 | for col in df.columns: 17 | col_type = df[col].dtypes 18 | if col_type in numerics: 19 | c_min = df[col].min() 20 | c_max = df[col].max() 21 | if str(col_type)[:3] == 'int': 22 | if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max: 23 | df[col] = df[col].astype(np.int8) 24 | elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max: 25 | df[col] = df[col].astype(np.int16) 26 | elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max: 27 | df[col] = df[col].astype(np.int32) 28 | elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max: 29 | df[col] = df[col].astype(np.int64) 30 | else: 31 | if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max: 32 | df[col] = df[col].astype(np.float16) 33 | elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max: 34 | df[col] = df[col].astype(np.float32) 35 | else: 36 | df[col] = df[col].astype(np.float64) 37 | end_mem = df.memory_usage().sum() / 1024**2 38 | if verbose: 39 | print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem)) 40 | return df 41 | 42 | df_train = reduce_mem_usage(pd.read_csv('./df_train.csv')) 43 | df_test = reduce_mem_usage(pd.read_csv('./df_test.csv')) 44 | ##这里可以选择加载多个特征文件进行merge 如果df_train变了记得在输出文件名称加以备注使用了什么特征文件 45 | ###设置特征标志位如果使用一周特征为1 加上两周特征为12 再加上三周特征为123 只使用二周特征为2 46 | 47 | 48 | df_user=reduce_mem_usage(pd.read_csv('./jdata_user.csv')) 49 | df_comment=reduce_mem_usage(pd.read_csv('./jdata_comment.csv')) 50 | df_shop=reduce_mem_usage(pd.read_csv('./jdata_shop.csv')) 51 | 52 | # 1）行为数据（jdata_action） 53 | jdata_action = reduce_mem_usage(pd.read_csv('./jdata_action.csv')) 54 | 55 | # 3）商品数据（jdata_product） 56 | jdata_product = reduce_mem_usage(pd.read_csv('./jdata_product.csv')) 57 | 58 | jdata_data = jdata_action.merge(jdata_product,on=['sku_id']) 59 | label_flag = 3 60 | train_buy = jdata_data[(jdata_data['action_time']>='2018-03-26') & (jdata_data['action_time']<'2018-04-02') & (jdata_data['type']==2)][['user_id','cate','shop_id']].drop_duplicates() 61 | train_buy['label'] = 1 62 | # 候选集时间： '2018-03-26'-'2018-04-08' 最近两周有行为的（用户，类目，店铺） 63 | win_size = 3#如果选择两周行为则为2 三周则为3 64 | train_set = jdata_data[(jdata_data['action_time']>='2018-03-05') & (jdata_data['action_time']<'2018-03-26')][['user_id','cate','shop_id']].drop_duplicates() 65 | train_set = train_set.merge(train_buy,on=['user_id','cate','shop_id'],how='left').fillna(0) 66 | 67 | train_set = train_set.merge(df_train,on=['user_id','cate','shop_id'],how='left') 68 | 69 | 70 | def mapper_year(x): 71 | if x is not np.nan: 72 | year = int(x[:4]) 73 | return 2018 - year 74 | 75 | 76 | def mapper_month(x): 77 | if x is not np.nan: 78 | year = int(x[:4]) 79 | month = int(x[5:7]) 80 | return (2018 - year) * 12 + month 81 | 82 | 83 | def mapper_day(x): 84 | if x is not np.nan: 85 | year = int(x[:4]) 86 | month = int(x[5:7]) 87 | day = int(x[8:10]) 88 | return (2018 - year) * 365 + month * 30 + day 89 | 90 | 91 | df_user['user_reg_year'] = df_user['user_reg_tm'].apply(lambda x: mapper_year(x)) 92 | df_user['user_reg_month'] = df_user['user_reg_tm'].apply(lambda x: mapper_month(x)) 93 | df_user['user_reg_day'] = df_user['user_reg_tm'].apply(lambda x: mapper_day(x)) 94 | 95 | df_shop['shop_reg_year'] = df_shop['shop_reg_tm'].apply(lambda x: mapper_year(x)) 96 | df_shop['shop_reg_month'] = df_shop['shop_reg_tm'].apply(lambda x: mapper_month(x)) 97 | df_shop['shop_reg_day'] = df_shop['shop_reg_tm'].apply(lambda x: mapper_day(x)) 98 | 99 | 100 | 101 | df_shop['shop_reg_year'] = df_shop['shop_reg_year'].fillna(1) 102 | df_shop['shop_reg_month'] = df_shop['shop_reg_month'].fillna(21) 103 | df_shop['shop_reg_day'] = df_shop['shop_reg_day'].fillna(101) 104 | 105 | df_user['age'] = df_user['age'].fillna(5) 106 | 107 | df_comment = df_comment.groupby(['sku_id'], as_index=False).sum() 108 | print('check point ...') 109 | df_product_comment = pd.merge(jdata_product, df_comment, on='sku_id', how='left') 110 | 111 | df_product_comment = df_product_comment.fillna(0) 112 | 113 | df_product_comment = df_product_comment.groupby(['shop_id'], as_index=False).sum() 114 | 115 | df_product_comment = df_product_comment.drop(['sku_id', 'brand', 'cate'], axis=1) 116 | 117 | df_shop_product_comment = pd.merge(df_shop, df_product_comment, how='left', on='shop_id') 118 | 119 | train_set = pd.merge(train_set, df_user, how='left', on='user_id') 120 | train_set = pd.merge(train_set, df_shop_product_comment, on='shop_id', how='left') 121 | 122 | 123 | 124 | train_set['vip_prob'] = train_set['vip_num']/train_set['fans_num'] 125 | train_set['goods_prob'] = train_set['good_comments']/train_set['comments'] 126 | 127 | train_set = train_set.drop(['comments','good_comments','bad_comments'],axis=1) 128 | 129 | 130 | 131 | 132 | test_set = jdata_data[(jdata_data['action_time'] >= '2018-03-26') & (jdata_data['action_time'] < '2018-04-16')][ 133 | ['user_id', 'cate', 'shop_id']].drop_duplicates() 134 | 135 | test_set = test_set.merge(df_test, on=['user_id', 'cate', 'shop_id'], how='left') 136 | 137 | test_set = pd.merge(test_set, df_user, how='left', on='user_id') 138 | test_set = pd.merge(test_set, df_shop_product_comment, on='shop_id', how='left') 139 | 140 | train_set.drop(['user_reg_tm', 'shop_reg_tm'], axis=1, inplace=True) 141 | test_set.drop(['user_reg_tm', 'shop_reg_tm'], axis=1, inplace=True) 142 | 143 | 144 | 145 | test_set['vip_prob'] = test_set['vip_num']/test_set['fans_num'] 146 | test_set['goods_prob'] = test_set['good_comments']/test_set['comments'] 147 | 148 | test_set = test_set.drop(['comments','good_comments','bad_comments'],axis=1) 149 | 150 | 151 | 152 | ###取六周特征特征为2.12-3.25 153 | train_set = train_set.drop(['2018-04-02-2018-04-09-action_1', '2018-04-02-2018-04-09-action_2', 154 | '2018-04-02-2018-04-09-action_3', '2018-04-02-2018-04-09-action_4', 155 | '2018-03-26-2018-04-02-action_1', '2018-03-26-2018-04-02-action_2', 156 | '2018-03-26-2018-04-02-action_3', '2018-03-26-2018-04-02-action_4', 157 | '2018-02-05-2018-02-12-action_1', '2018-02-05-2018-02-12-action_2', 158 | '2018-02-05-2018-02-12-action_3', '2018-02-05-2018-02-12-action_4'],axis=1) 159 | 160 | 161 | test_set = test_set.drop(['2018-02-26-2018-03-05-action_1', 162 | '2018-02-26-2018-03-05-action_2', '2018-02-26-2018-03-05-action_3', 163 | '2018-02-26-2018-03-05-action_4', '2018-02-19-2018-02-26-action_1', 164 | '2018-02-19-2018-02-26-action_2', '2018-02-19-2018-02-26-action_3', 165 | '2018-02-19-2018-02-26-action_4', '2018-02-12-2018-02-19-action_1', 166 | '2018-02-12-2018-02-19-action_2', '2018-02-12-2018-02-19-action_3', 167 | '2018-02-12-2018-02-19-action_4'],axis=1) 168 | 169 | 170 | train_set.rename(columns={'cate_x':'cate'}, inplace = True) 171 | test_set.rename(columns={'cate_x':'cate'}, inplace = True) 172 | 173 | 174 | 175 | test_head=test_set[['user_id','cate','shop_id']] 176 | train_head=train_set[['user_id','cate','shop_id']] 177 | test_set=test_set.drop(['user_id','cate','shop_id'],axis=1) 178 | train_set=train_set.drop(['user_id','cate','shop_id'],axis=1) 179 | 180 | 181 | # 数据准备 182 | X_train = train_set.drop(['label'],axis=1).values 183 | y_train = train_set['label'].values 184 | X_test = test_set.values 185 | 186 | 187 | del train_set 188 | del test_set 189 | 190 | print('------------------start modelling----------------') 191 | # 模型工具 192 | class SBBTree(): 193 | """Stacking,Bootstap,Bagging----SBBTree""" 194 | def __init__(self, params, stacking_num, bagging_num, bagging_test_size, num_boost_round, early_stopping_rounds): 195 | """ 196 | Initializes the SBBTree. 197 | Args: 198 | params : lgb params. 199 | stacking_num : k_flod stacking. 200 | bagging_num : bootstrap num. 201 | bagging_test_size : bootstrap sample rate. 202 | num_boost_round : boost num. 203 | early_stopping_rounds : early_stopping_rounds. 204 | """ 205 | self.params = params 206 | self.stacking_num = stacking_num 207 | self.bagging_num = bagging_num 208 | self.bagging_test_size = bagging_test_size 209 | self.num_boost_round = num_boost_round 210 | self.early_stopping_rounds = early_stopping_rounds 211 | 212 | self.model = lgb 213 | self.stacking_model = [] 214 | self.bagging_model = [] 215 | 216 | def fit(self, X, y): 217 | """ fit model. """ 218 | if self.stacking_num > 1: 219 | layer_train = np.zeros((X.shape[0], 2)) 220 | self.SK = StratifiedKFold(n_splits=self.stacking_num, shuffle=True, random_state=1) 221 | for k,(train_index, test_index) in enumerate(self.SK.split(X, y)): 222 | X_train = X[train_index] 223 | y_train = y[train_index] 224 | X_test = X[test_index] 225 | y_test = y[test_index] 226 | 227 | lgb_train = lgb.Dataset(X_train, y_train) 228 | lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train) 229 | 230 | gbm = lgb.train(self.params, 231 | lgb_train, 232 | num_boost_round=self.num_boost_round, 233 | valid_sets=lgb_eval, 234 | early_stopping_rounds=self.early_stopping_rounds) 235 | 236 | self.stacking_model.append(gbm) 237 | 238 | pred_y = gbm.predict(X_test, num_iteration=gbm.best_iteration) 239 | layer_train[test_index, 1] = pred_y 240 | 241 | X = np.hstack((X, layer_train[:,1].reshape((-1,1)))) 242 | else: 243 | pass 244 | for bn in range(self.bagging_num): 245 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=self.bagging_test_size, random_state=bn) 246 | 247 | lgb_train = lgb.Dataset(X_train, y_train) 248 | lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train) 249 | 250 | gbm = lgb.train(self.params, 251 | lgb_train, 252 | num_boost_round=10000, 253 | valid_sets=lgb_eval, 254 | early_stopping_rounds=200) 255 | 256 | self.bagging_model.append(gbm) 257 | 258 | def predict(self, X_pred): 259 | """ predict test data. """ 260 | if self.stacking_num > 1: 261 | test_pred = np.zeros((X_pred.shape[0], self.stacking_num)) 262 | for sn,gbm in enumerate(self.stacking_model): 263 | pred = gbm.predict(X_pred, num_iteration=gbm.best_iteration) 264 | test_pred[:, sn] = pred 265 | X_pred = np.hstack((X_pred, test_pred.mean(axis=1).reshape((-1,1)))) 266 | else: 267 | pass 268 | for bn,gbm in enumerate(self.bagging_model): 269 | pred = gbm.predict(X_pred, num_iteration=gbm.best_iteration) 270 | if bn == 0: 271 | pred_out=pred 272 | else: 273 | pred_out+=pred 274 | return pred_out/self.bagging_num 275 | 276 | 277 | # 模型参数 278 | params = { 279 | 'boosting_type': 'gbdt', 280 | 'objective': 'binary', 281 | 'metric': 'auc', 282 | 'learning_rate': 0.01, 283 | 'num_leaves': 2 ** 5 - 1, 284 | 'min_child_samples': 100, 285 | 'max_bin': 100, 286 | 'subsample': .7, 287 | 'subsample_freq': 1, 288 | 'colsample_bytree': 0.7, 289 | 'min_child_weight': 0, 290 | 'scale_pos_weight': 25, 291 | 'seed': 42, 292 | 'nthread': 20, 293 | 'verbose': 0, 294 | } 295 | # 使用模型 296 | model = SBBTree(params=params, \ 297 | stacking_num=5, \ 298 | bagging_num=5, \ 299 | bagging_test_size=0.33, \ 300 | num_boost_round=10000, \ 301 | early_stopping_rounds=200) 302 | model.fit(X_train, y_train) 303 | 304 | 305 | print('train is ok') 306 | y_predict = model.predict(X_test) 307 | print('pred test is ok') 308 | # y_train_predict = model.predict(X_train) 309 | 310 | 311 | 312 | from tqdm import tqdm 313 | test_head['pred_prob'] = y_predict 314 | test_head.to_csv('feature/'+str(win_size)+'_sbb_get_'+str(label_flag)+'_test.csv',index=False) 315 | -------------------------------------------------------------------------------- /code/run.sh: -------------------------------------------------------------------------------- 1 | 2 | python EDA13.py 3 | 4 | echo "base feature is ok" 5 | 6 | python EDA16-fourWeek.py 7 | python EDA16-fourWeek_rightTime.py 8 | python EDA16-threeWeek.py 9 | python EDA16-threeWeek_rightTime.py 10 | python EDA16-twoWeek.py 11 | 12 | echo "A result is ok" 13 | 14 | python sbb_train1.py 15 | python sbb_train2.py 16 | python sbb_train3.py 17 | echo "B win size 3 is ok" 18 | python sbb2_train1.py 19 | python sbb2_train2.py 20 | python sbb2_train3.py 21 | echo "B win size 2 is ok" 22 | python sbb4_train1.py 23 | python sbb4_train2.py 24 | python sbb4_train3.py 25 | echo "B win size 4 is ok" 26 | python gen_result.py 27 | 28 | echo "finish,,,,," 29 | -------------------------------------------------------------------------------- /code/sbb2_train1.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import numpy as np 3 | import datetime 4 | from sklearn.metrics import f1_score 5 | from sklearn.model_selection import train_test_split 6 | from sklearn.model_selection import KFold 7 | from sklearn.model_selection import StratifiedKFold 8 | import lightgbm as lgb 9 | pd.set_option('display.max_columns', None) 10 | 11 | 12 | df_train=pd.read_csv('../output/df_train.csv') 13 | df_test=pd.read_csv('../output/df_test.csv') 14 | ##这里可以选择加载多个特征文件进行merge 如果df_train变了记得在输出文件名称加以备注使用了什么特征文件 15 | ###设置特征标志位如果使用一周特征为1 加上两周特征为12 再加上三周特征为123 只使用二周特征为2 16 | 17 | # df_train_two=pd.read_csv('../output/df_train_two.csv') 18 | # df_test_two=pd.read_csv('../output/df_test_two.csv') 19 | # df_train = df_train.merge(df_train_two,on=['user_id','cate','shop_id'],how='left') 20 | # df_test = df_test.merge(df_test_two,on=['user_id','cate','shop_id'],how='left') 21 | 22 | 23 | df_user=pd.read_csv('../data/jdata_user.csv') 24 | df_comment=pd.read_csv('../data/jdata_comment.csv') 25 | df_shop=pd.read_csv('../data/jdata_shop.csv') 26 | 27 | # 1）行为数据（jdata_action） 28 | jdata_action = pd.read_csv('../data/jdata_action.csv') 29 | 30 | # 3）商品数据（jdata_product） 31 | jdata_product = pd.read_csv('../data/jdata_product.csv') 32 | 33 | jdata_data = jdata_action.merge(jdata_product,on=['sku_id']) 34 | label_flag = 1 35 | train_buy = jdata_data[(jdata_data['action_time']>='2018-04-09') 36 | & (jdata_data['action_time']<'2018-04-16') 37 | & (jdata_data['type']==2)][['user_id','cate','shop_id']].drop_duplicates() 38 | train_buy['label'] = 1 39 | # 候选集时间： '2018-03-26'-'2018-04-08' 最近两周有行为的（用户，类目，店铺） 40 | win_size = 2#如果选择两周行为则为2 三周则为3 41 | train_set = jdata_data[(jdata_data['action_time']>='2018-03-26') 42 | & (jdata_data['action_time']<'2018-04-09')][['user_id','cate','shop_id']].drop_duplicates() 43 | train_set = train_set.merge(train_buy,on=['user_id','cate','shop_id'],how='left').fillna(0) 44 | 45 | train_set = train_set.merge(df_train,on=['user_id','cate','shop_id'],how='left') 46 | 47 | def mapper_year(x): 48 | if x is not np.nan: 49 | year = int(x[:4]) 50 | return 2018 - year 51 | 52 | 53 | def mapper_month(x): 54 | if x is not np.nan: 55 | year = int(x[:4]) 56 | month = int(x[5:7]) 57 | return (2018 - year) * 12 + month 58 | 59 | 60 | def mapper_day(x): 61 | if x is not np.nan: 62 | year = int(x[:4]) 63 | month = int(x[5:7]) 64 | day = int(x[8:10]) 65 | return (2018 - year) * 365 + month * 30 + day 66 | 67 | 68 | df_user['user_reg_year'] = df_user['user_reg_tm'].apply(lambda x: mapper_year(x)) 69 | df_user['user_reg_month'] = df_user['user_reg_tm'].apply(lambda x: mapper_month(x)) 70 | df_user['user_reg_day'] = df_user['user_reg_tm'].apply(lambda x: mapper_day(x)) 71 | 72 | df_shop['shop_reg_year'] = df_shop['shop_reg_tm'].apply(lambda x: mapper_year(x)) 73 | df_shop['shop_reg_month'] = df_shop['shop_reg_tm'].apply(lambda x: mapper_month(x)) 74 | df_shop['shop_reg_day'] = df_shop['shop_reg_tm'].apply(lambda x: mapper_day(x)) 75 | 76 | 77 | df_shop['shop_reg_year'] = df_shop['shop_reg_year'].fillna(1) 78 | df_shop['shop_reg_month'] = df_shop['shop_reg_month'].fillna(21) 79 | df_shop['shop_reg_day'] = df_shop['shop_reg_day'].fillna(101) 80 | 81 | df_user['age'] = df_user['age'].fillna(5) 82 | 83 | df_comment = df_comment.groupby(['sku_id'], as_index=False).sum() 84 | print('check point ...') 85 | df_product_comment = pd.merge(jdata_product, df_comment, on='sku_id', how='left') 86 | 87 | df_product_comment = df_product_comment.fillna(0) 88 | 89 | df_product_comment = df_product_comment.groupby(['shop_id'], as_index=False).sum() 90 | 91 | df_product_comment = df_product_comment.drop(['sku_id', 'brand', 'cate'], axis=1) 92 | 93 | df_shop_product_comment = pd.merge(df_shop, df_product_comment, how='left', on='shop_id') 94 | 95 | train_set = pd.merge(train_set, df_user, how='left', on='user_id') 96 | train_set = pd.merge(train_set, df_shop_product_comment, on='shop_id', how='left') 97 | 98 | 99 | train_set['vip_prob'] = train_set['vip_num']/train_set['fans_num'] 100 | train_set['goods_prob'] = train_set['good_comments']/train_set['comments'] 101 | 102 | train_set = train_set.drop(['comments','good_comments','bad_comments'],axis=1) 103 | 104 | 105 | test_set = jdata_data[(jdata_data['action_time'] >= '2018-04-02') & (jdata_data['action_time'] < '2018-04-16')][ 106 | ['user_id', 'cate', 'shop_id']].drop_duplicates() 107 | 108 | test_set = test_set.merge(df_test, on=['user_id', 'cate', 'shop_id'], how='left') 109 | 110 | test_set = pd.merge(test_set, df_user, how='left', on='user_id') 111 | test_set = pd.merge(test_set, df_shop_product_comment, on='shop_id', how='left') 112 | 113 | train_set.drop(['user_reg_tm', 'shop_reg_tm'], axis=1, inplace=True) 114 | test_set.drop(['user_reg_tm', 'shop_reg_tm'], axis=1, inplace=True) 115 | 116 | 117 | test_set['vip_prob'] = test_set['vip_num']/test_set['fans_num'] 118 | test_set['goods_prob'] = test_set['good_comments']/test_set['comments'] 119 | 120 | test_set = test_set.drop(['comments','good_comments','bad_comments'],axis=1) 121 | 122 | 123 | ###取六周特征特征为2.26-4.9 124 | train_set = train_set.drop(['2018-02-19-2018-02-26-action_1', '2018-02-19-2018-02-26-action_2', 125 | '2018-02-19-2018-02-26-action_3', '2018-02-19-2018-02-26-action_4', 126 | '2018-02-12-2018-02-19-action_1', '2018-02-12-2018-02-19-action_2', 127 | '2018-02-12-2018-02-19-action_3', '2018-02-12-2018-02-19-action_4', 128 | '2018-02-05-2018-02-12-action_1', '2018-02-05-2018-02-12-action_2', 129 | '2018-02-05-2018-02-12-action_3', '2018-02-05-2018-02-12-action_4'],axis=1) 130 | 131 | 132 | test_set = test_set.drop(['2018-02-26-2018-03-05-action_1', 133 | '2018-02-26-2018-03-05-action_2', '2018-02-26-2018-03-05-action_3', 134 | '2018-02-26-2018-03-05-action_4', '2018-02-19-2018-02-26-action_1', 135 | '2018-02-19-2018-02-26-action_2', '2018-02-19-2018-02-26-action_3', 136 | '2018-02-19-2018-02-26-action_4', '2018-02-12-2018-02-19-action_1', 137 | '2018-02-12-2018-02-19-action_2', '2018-02-12-2018-02-19-action_3', 138 | '2018-02-12-2018-02-19-action_4'],axis=1) 139 | 140 | train_set.rename(columns={'cate_x':'cate'}, inplace = True) 141 | test_set.rename(columns={'cate_x':'cate'}, inplace = True) 142 | 143 | 144 | 145 | 146 | test_head=test_set[['user_id','cate','shop_id']] 147 | train_head=train_set[['user_id','cate','shop_id']] 148 | test_set=test_set.drop(['user_id','cate','shop_id'],axis=1) 149 | train_set=train_set.drop(['user_id','cate','shop_id'],axis=1) 150 | 151 | 152 | # 数据准备 153 | X_train = train_set.drop(['label'],axis=1).values 154 | y_train = train_set['label'].values 155 | X_test = test_set.values 156 | 157 | del train_set 158 | del test_set 159 | 160 | import gc 161 | gc.collect() 162 | 163 | 164 | 165 | # 模型工具 166 | class SBBTree(): 167 | """Stacking,Bootstap,Bagging----SBBTree""" 168 | def __init__(self, params, stacking_num, bagging_num, bagging_test_size, num_boost_round, early_stopping_rounds): 169 | """ 170 | Initializes the SBBTree. 171 | Args: 172 | params : lgb params. 173 | stacking_num : k_flod stacking. 174 | bagging_num : bootstrap num. 175 | bagging_test_size : bootstrap sample rate. 176 | num_boost_round : boost num. 177 | early_stopping_rounds : early_stopping_rounds. 178 | """ 179 | self.params = params 180 | self.stacking_num = stacking_num 181 | self.bagging_num = bagging_num 182 | self.bagging_test_size = bagging_test_size 183 | self.num_boost_round = num_boost_round 184 | self.early_stopping_rounds = early_stopping_rounds 185 | 186 | self.model = lgb 187 | self.stacking_model = [] 188 | self.bagging_model = [] 189 | 190 | def fit(self, X, y): 191 | """ fit model. """ 192 | if self.stacking_num > 1: 193 | layer_train = np.zeros((X.shape[0], 2)) 194 | self.SK = StratifiedKFold(n_splits=self.stacking_num, shuffle=True, random_state=1) 195 | for k,(train_index, test_index) in enumerate(self.SK.split(X, y)): 196 | X_train = X[train_index] 197 | y_train = y[train_index] 198 | X_test = X[test_index] 199 | y_test = y[test_index] 200 | 201 | lgb_train = lgb.Dataset(X_train, y_train) 202 | lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train) 203 | 204 | gbm = lgb.train(self.params, 205 | lgb_train, 206 | num_boost_round=self.num_boost_round, 207 | valid_sets=lgb_eval, 208 | early_stopping_rounds=self.early_stopping_rounds, 209 | verbose_eval=300) 210 | 211 | self.stacking_model.append(gbm) 212 | 213 | pred_y = gbm.predict(X_test, num_iteration=gbm.best_iteration) 214 | layer_train[test_index, 1] = pred_y 215 | 216 | X = np.hstack((X, layer_train[:,1].reshape((-1,1)))) 217 | else: 218 | pass 219 | for bn in range(self.bagging_num): 220 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=self.bagging_test_size, random_state=bn) 221 | 222 | lgb_train = lgb.Dataset(X_train, y_train) 223 | lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train) 224 | 225 | gbm = lgb.train(self.params, 226 | lgb_train, 227 | num_boost_round=10000, 228 | valid_sets=lgb_eval, 229 | early_stopping_rounds=200, 230 | verbose_eval=300) 231 | 232 | self.bagging_model.append(gbm) 233 | 234 | def predict(self, X_pred): 235 | """ predict test data. """ 236 | if self.stacking_num > 1: 237 | test_pred = np.zeros((X_pred.shape[0], self.stacking_num)) 238 | for sn,gbm in enumerate(self.stacking_model): 239 | pred = gbm.predict(X_pred, num_iteration=gbm.best_iteration) 240 | test_pred[:, sn] = pred 241 | X_pred = np.hstack((X_pred, test_pred.mean(axis=1).reshape((-1,1)))) 242 | else: 243 | pass 244 | for bn,gbm in enumerate(self.bagging_model): 245 | pred = gbm.predict(X_pred, num_iteration=gbm.best_iteration) 246 | if bn == 0: 247 | pred_out=pred 248 | else: 249 | pred_out+=pred 250 | return pred_out/self.bagging_num 251 | 252 | # 模型参数 253 | params = { 254 | 'boosting_type': 'gbdt', 255 | 'objective': 'binary', 256 | 'metric': 'auc', 257 | 'learning_rate': 0.01, 258 | 'num_leaves': 2 ** 5 - 1, 259 | 'min_child_samples': 100, 260 | 'max_bin': 100, 261 | 'subsample': 0.8, 262 | 'subsample_freq': 1, 263 | 'colsample_bytree': 0.8, 264 | 'min_child_weight': 0, 265 | 'scale_pos_weight': 25, 266 | 'seed': 2019, 267 | 'nthread': 4, 268 | 'verbose': 0, 269 | } 270 | 271 | # 使用模型 272 | model = SBBTree(params=params,\ 273 | stacking_num=5,\ 274 | bagging_num=5,\ 275 | bagging_test_size=0.33,\ 276 | num_boost_round=10000,\ 277 | early_stopping_rounds=200) 278 | model.fit(X_train, y_train) 279 | print('train is ok') 280 | y_predict = model.predict(X_test) 281 | print('pred test is ok') 282 | # y_train_predict = model.predict(X_train) 283 | 284 | 285 | from tqdm import tqdm 286 | test_head['pred_prob'] = y_predict 287 | test_head.to_csv('../feature/'+str(win_size)+'_sbb_get_'+str(label_flag)+'_test.csv',index=False) 288 | 289 | 290 | -------------------------------------------------------------------------------- /code/sbb2_train2.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import numpy as np 3 | import datetime 4 | from sklearn.metrics import f1_score 5 | from sklearn.model_selection import train_test_split 6 | from sklearn.model_selection import KFold 7 | from sklearn.model_selection import StratifiedKFold 8 | import lightgbm as lgb 9 | pd.set_option('display.max_columns', None) 10 | 11 | df_train=pd.read_csv('../output/df_train.csv') 12 | df_test=pd.read_csv('../output/df_test.csv') 13 | ##这里可以选择加载多个特征文件进行merge 如果df_train变了记得在输出文件名称加以备注使用了什么特征文件 14 | ###设置特征标志位如果使用一周特征为1 加上两周特征为12 再加上三周特征为123 只使用二周特征为2 15 | 16 | # df_train_two=pd.read_csv('../output/df_train_two.csv') 17 | # df_test_two=pd.read_csv('../output/df_test_two.csv') 18 | # df_train = df_train.merge(df_train_two,on=['user_id','cate','shop_id'],how='left') 19 | # df_test = df_test.merge(df_test_two,on=['user_id','cate','shop_id'],how='left') 20 | 21 | df_user=pd.read_csv('../data/jdata_user.csv') 22 | df_comment=pd.read_csv('../data/jdata_comment.csv') 23 | df_shop=pd.read_csv('../data/jdata_shop.csv') 24 | 25 | # 1）行为数据（jdata_action） 26 | jdata_action = pd.read_csv('../data/jdata_action.csv') 27 | 28 | # 3）商品数据（jdata_product） 29 | jdata_product = pd.read_csv('../data/jdata_product.csv') 30 | 31 | jdata_data = jdata_action.merge(jdata_product,on=['sku_id']) 32 | label_flag = 2 33 | train_buy = jdata_data[(jdata_data['action_time']>='2018-04-02') 34 | & (jdata_data['action_time']<'2018-04-09') 35 | & (jdata_data['type']==2)][['user_id','cate','shop_id']].drop_duplicates() 36 | train_buy['label'] = 1 37 | # 候选集时间： '2018-03-26'-'2018-04-08' 最近两周有行为的（用户，类目，店铺） 38 | win_size = 2#如果选择两周行为则为2 三周则为3 39 | train_set = jdata_data[(jdata_data['action_time']>='2018-03-19') 40 | & (jdata_data['action_time']<'2018-04-02')][['user_id','cate','shop_id']].drop_duplicates() 41 | train_set = train_set.merge(train_buy,on=['user_id','cate','shop_id'],how='left').fillna(0) 42 | 43 | train_set = train_set.merge(df_train,on=['user_id','cate','shop_id'],how='left') 44 | 45 | def mapper_year(x): 46 | if x is not np.nan: 47 | year = int(x[:4]) 48 | return 2018 - year 49 | 50 | 51 | def mapper_month(x): 52 | if x is not np.nan: 53 | year = int(x[:4]) 54 | month = int(x[5:7]) 55 | return (2018 - year) * 12 + month 56 | 57 | 58 | def mapper_day(x): 59 | if x is not np.nan: 60 | year = int(x[:4]) 61 | month = int(x[5:7]) 62 | day = int(x[8:10]) 63 | return (2018 - year) * 365 + month * 30 + day 64 | 65 | 66 | df_user['user_reg_year'] = df_user['user_reg_tm'].apply(lambda x: mapper_year(x)) 67 | df_user['user_reg_month'] = df_user['user_reg_tm'].apply(lambda x: mapper_month(x)) 68 | df_user['user_reg_day'] = df_user['user_reg_tm'].apply(lambda x: mapper_day(x)) 69 | 70 | df_shop['shop_reg_year'] = df_shop['shop_reg_tm'].apply(lambda x: mapper_year(x)) 71 | df_shop['shop_reg_month'] = df_shop['shop_reg_tm'].apply(lambda x: mapper_month(x)) 72 | df_shop['shop_reg_day'] = df_shop['shop_reg_tm'].apply(lambda x: mapper_day(x)) 73 | 74 | df_shop['shop_reg_year'] = df_shop['shop_reg_year'].fillna(1) 75 | df_shop['shop_reg_month'] = df_shop['shop_reg_month'].fillna(21) 76 | df_shop['shop_reg_day'] = df_shop['shop_reg_day'].fillna(101) 77 | 78 | df_user['age'] = df_user['age'].fillna(5) 79 | 80 | df_comment = df_comment.groupby(['sku_id'], as_index=False).sum() 81 | print('check point ...') 82 | df_product_comment = pd.merge(jdata_product, df_comment, on='sku_id', how='left') 83 | 84 | df_product_comment = df_product_comment.fillna(0) 85 | 86 | df_product_comment = df_product_comment.groupby(['shop_id'], as_index=False).sum() 87 | 88 | df_product_comment = df_product_comment.drop(['sku_id', 'brand', 'cate'], axis=1) 89 | 90 | df_shop_product_comment = pd.merge(df_shop, df_product_comment, how='left', on='shop_id') 91 | 92 | train_set = pd.merge(train_set, df_user, how='left', on='user_id') 93 | train_set = pd.merge(train_set, df_shop_product_comment, on='shop_id', how='left') 94 | 95 | 96 | train_set['vip_prob'] = train_set['vip_num']/train_set['fans_num'] 97 | train_set['goods_prob'] = train_set['good_comments']/train_set['comments'] 98 | 99 | train_set = train_set.drop(['comments','good_comments','bad_comments'],axis=1) 100 | 101 | test_set = jdata_data[(jdata_data['action_time'] >= '2018-04-02') & (jdata_data['action_time'] < '2018-04-16')][ 102 | ['user_id', 'cate', 'shop_id']].drop_duplicates() 103 | 104 | test_set = test_set.merge(df_test, on=['user_id', 'cate', 'shop_id'], how='left') 105 | 106 | test_set = pd.merge(test_set, df_user, how='left', on='user_id') 107 | test_set = pd.merge(test_set, df_shop_product_comment, on='shop_id', how='left') 108 | 109 | train_set.drop(['user_reg_tm', 'shop_reg_tm'], axis=1, inplace=True) 110 | test_set.drop(['user_reg_tm', 'shop_reg_tm'], axis=1, inplace=True) 111 | 112 | 113 | test_set['vip_prob'] = test_set['vip_num']/test_set['fans_num'] 114 | test_set['goods_prob'] = test_set['good_comments']/test_set['comments'] 115 | 116 | test_set = test_set.drop(['comments','good_comments','bad_comments'],axis=1) 117 | 118 | ###取六周特征特征为2.26-4.9 119 | train_set = train_set.drop([ 120 | '2018-04-02-2018-04-09-action_1', '2018-04-02-2018-04-09-action_2', 121 | '2018-04-02-2018-04-09-action_3', '2018-04-02-2018-04-09-action_4', 122 | '2018-02-12-2018-02-19-action_1', '2018-02-12-2018-02-19-action_2', 123 | '2018-02-12-2018-02-19-action_3', '2018-02-12-2018-02-19-action_4', 124 | '2018-02-05-2018-02-12-action_1', '2018-02-05-2018-02-12-action_2', 125 | '2018-02-05-2018-02-12-action_3', '2018-02-05-2018-02-12-action_4'],axis=1) 126 | 127 | 128 | test_set = test_set.drop(['2018-02-26-2018-03-05-action_1', 129 | '2018-02-26-2018-03-05-action_2', '2018-02-26-2018-03-05-action_3', 130 | '2018-02-26-2018-03-05-action_4', '2018-02-19-2018-02-26-action_1', 131 | '2018-02-19-2018-02-26-action_2', '2018-02-19-2018-02-26-action_3', 132 | '2018-02-19-2018-02-26-action_4', '2018-02-12-2018-02-19-action_1', 133 | '2018-02-12-2018-02-19-action_2', '2018-02-12-2018-02-19-action_3', 134 | '2018-02-12-2018-02-19-action_4'],axis=1) 135 | 136 | 137 | train_set.rename(columns={'cate_x':'cate'}, inplace = True) 138 | test_set.rename(columns={'cate_x':'cate'}, inplace = True) 139 | 140 | test_head=test_set[['user_id','cate','shop_id']] 141 | train_head=train_set[['user_id','cate','shop_id']] 142 | test_set=test_set.drop(['user_id','cate','shop_id'],axis=1) 143 | train_set=train_set.drop(['user_id','cate','shop_id'],axis=1) 144 | 145 | 146 | # 数据准备 147 | X_train = train_set.drop(['label'],axis=1).values 148 | y_train = train_set['label'].values 149 | X_test = test_set.values 150 | 151 | del train_set 152 | del test_set 153 | 154 | 155 | import gc 156 | gc.collect() 157 | 158 | 159 | # 模型工具 160 | class SBBTree(): 161 | """Stacking,Bootstap,Bagging----SBBTree""" 162 | def __init__(self, params, stacking_num, bagging_num, bagging_test_size, num_boost_round, early_stopping_rounds): 163 | """ 164 | Initializes the SBBTree. 165 | Args: 166 | params : lgb params. 167 | stacking_num : k_flod stacking. 168 | bagging_num : bootstrap num. 169 | bagging_test_size : bootstrap sample rate. 170 | num_boost_round : boost num. 171 | early_stopping_rounds : early_stopping_rounds. 172 | """ 173 | self.params = params 174 | self.stacking_num = stacking_num 175 | self.bagging_num = bagging_num 176 | self.bagging_test_size = bagging_test_size 177 | self.num_boost_round = num_boost_round 178 | self.early_stopping_rounds = early_stopping_rounds 179 | 180 | self.model = lgb 181 | self.stacking_model = [] 182 | self.bagging_model = [] 183 | 184 | def fit(self, X, y): 185 | """ fit model. """ 186 | if self.stacking_num > 1: 187 | layer_train = np.zeros((X.shape[0], 2)) 188 | self.SK = StratifiedKFold(n_splits=self.stacking_num, shuffle=True, random_state=1) 189 | for k,(train_index, test_index) in enumerate(self.SK.split(X, y)): 190 | X_train = X[train_index] 191 | y_train = y[train_index] 192 | X_test = X[test_index] 193 | y_test = y[test_index] 194 | 195 | lgb_train = lgb.Dataset(X_train, y_train) 196 | lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train) 197 | 198 | gbm = lgb.train(self.params, 199 | lgb_train, 200 | num_boost_round=self.num_boost_round, 201 | valid_sets=lgb_eval, 202 | early_stopping_rounds=self.early_stopping_rounds, 203 | verbose_eval=300) 204 | 205 | self.stacking_model.append(gbm) 206 | 207 | pred_y = gbm.predict(X_test, num_iteration=gbm.best_iteration) 208 | layer_train[test_index, 1] = pred_y 209 | 210 | X = np.hstack((X, layer_train[:,1].reshape((-1,1)))) 211 | else: 212 | pass 213 | for bn in range(self.bagging_num): 214 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=self.bagging_test_size, random_state=bn) 215 | 216 | lgb_train = lgb.Dataset(X_train, y_train) 217 | lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train) 218 | 219 | gbm = lgb.train(self.params, 220 | lgb_train, 221 | num_boost_round=10000, 222 | valid_sets=lgb_eval, 223 | early_stopping_rounds=200, 224 | verbose_eval=300) 225 | 226 | self.bagging_model.append(gbm) 227 | 228 | def predict(self, X_pred): 229 | """ predict test data. """ 230 | if self.stacking_num > 1: 231 | test_pred = np.zeros((X_pred.shape[0], self.stacking_num)) 232 | for sn,gbm in enumerate(self.stacking_model): 233 | pred = gbm.predict(X_pred, num_iteration=gbm.best_iteration) 234 | test_pred[:, sn] = pred 235 | X_pred = np.hstack((X_pred, test_pred.mean(axis=1).reshape((-1,1)))) 236 | else: 237 | pass 238 | for bn,gbm in enumerate(self.bagging_model): 239 | pred = gbm.predict(X_pred, num_iteration=gbm.best_iteration) 240 | if bn == 0: 241 | pred_out=pred 242 | else: 243 | pred_out+=pred 244 | return pred_out/self.bagging_num 245 | 246 | # 模型参数 247 | params = { 248 | 'boosting_type': 'gbdt', 249 | 'objective': 'binary', 250 | 'metric': 'auc', 251 | 'learning_rate': 0.01, 252 | 'num_leaves': 2 ** 5 - 1, 253 | 'min_child_samples': 100, 254 | 'max_bin': 100, 255 | 'subsample': 0.8, 256 | 'subsample_freq': 1, 257 | 'colsample_bytree': 0.8, 258 | 'min_child_weight': 0, 259 | 'scale_pos_weight': 25, 260 | 'seed': 2019, 261 | 'nthread': 4, 262 | 'verbose': 0, 263 | } 264 | 265 | # 使用模型 266 | model = SBBTree(params=params,\ 267 | stacking_num=5,\ 268 | bagging_num=5,\ 269 | bagging_test_size=0.33,\ 270 | num_boost_round=10000,\ 271 | early_stopping_rounds=200) 272 | model.fit(X_train, y_train) 273 | print('train is ok') 274 | y_predict = model.predict(X_test) 275 | print('pred test is ok') 276 | # y_train_predict = model.predict(X_train) 277 | 278 | 279 | from tqdm import tqdm 280 | test_head['pred_prob'] = y_predict 281 | test_head.to_csv('../feature/'+str(win_size)+'_sbb_get_'+str(label_flag)+'_test.csv',index=False) 282 | -------------------------------------------------------------------------------- /code/sbb2_train3.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import numpy as np 3 | import datetime 4 | from sklearn.metrics import f1_score 5 | from sklearn.model_selection import train_test_split 6 | from sklearn.model_selection import KFold 7 | from sklearn.model_selection import StratifiedKFold 8 | import lightgbm as lgb 9 | pd.set_option('display.max_columns', None) 10 | 11 | df_train=pd.read_csv('../output/df_train.csv') 12 | df_test=pd.read_csv('../output/df_test.csv') 13 | ##这里可以选择加载多个特征文件进行merge 如果df_train变了记得在输出文件名称加以备注使用了什么特征文件 14 | ###设置特征标志位如果使用一周特征为1 加上两周特征为12 再加上三周特征为123 只使用二周特征为2 15 | 16 | # df_train_two=pd.read_csv('../output/df_train_two.csv') 17 | # df_test_two=pd.read_csv('../output/df_test_two.csv') 18 | # df_train = df_train.merge(df_train_two,on=['user_id','cate','shop_id'],how='left') 19 | # df_test = df_test.merge(df_test_two,on=['user_id','cate','shop_id'],how='left') 20 | 21 | df_user=pd.read_csv('../data/jdata_user.csv') 22 | df_comment=pd.read_csv('../data/jdata_comment.csv') 23 | df_shop=pd.read_csv('../data/jdata_shop.csv') 24 | 25 | # 1）行为数据（jdata_action） 26 | jdata_action = pd.read_csv('../data/jdata_action.csv') 27 | 28 | # 3）商品数据（jdata_product） 29 | jdata_product = pd.read_csv('../data/jdata_product.csv') 30 | 31 | jdata_data = jdata_action.merge(jdata_product,on=['sku_id']) 32 | label_flag = 3 33 | train_buy = jdata_data[(jdata_data['action_time']>='2018-03-26') 34 | & (jdata_data['action_time']<'2018-04-02') 35 | & (jdata_data['type']==2)][['user_id','cate','shop_id']].drop_duplicates() 36 | train_buy['label'] = 1 37 | # 候选集时间： '2018-03-26'-'2018-04-08' 最近两周有行为的（用户，类目，店铺） 38 | win_size = 2#如果选择两周行为则为2 三周则为3 39 | train_set = jdata_data[(jdata_data['action_time']>='2018-03-12') 40 | & (jdata_data['action_time']<'2018-03-26')][['user_id','cate','shop_id']].drop_duplicates() 41 | train_set = train_set.merge(train_buy,on=['user_id','cate','shop_id'],how='left').fillna(0) 42 | 43 | train_set = train_set.merge(df_train,on=['user_id','cate','shop_id'],how='left') 44 | 45 | 46 | def mapper_year(x): 47 | if x is not np.nan: 48 | year = int(x[:4]) 49 | return 2018 - year 50 | 51 | 52 | def mapper_month(x): 53 | if x is not np.nan: 54 | year = int(x[:4]) 55 | month = int(x[5:7]) 56 | return (2018 - year) * 12 + month 57 | 58 | 59 | def mapper_day(x): 60 | if x is not np.nan: 61 | year = int(x[:4]) 62 | month = int(x[5:7]) 63 | day = int(x[8:10]) 64 | return (2018 - year) * 365 + month * 30 + day 65 | 66 | 67 | df_user['user_reg_year'] = df_user['user_reg_tm'].apply(lambda x: mapper_year(x)) 68 | df_user['user_reg_month'] = df_user['user_reg_tm'].apply(lambda x: mapper_month(x)) 69 | df_user['user_reg_day'] = df_user['user_reg_tm'].apply(lambda x: mapper_day(x)) 70 | 71 | df_shop['shop_reg_year'] = df_shop['shop_reg_tm'].apply(lambda x: mapper_year(x)) 72 | df_shop['shop_reg_month'] = df_shop['shop_reg_tm'].apply(lambda x: mapper_month(x)) 73 | df_shop['shop_reg_day'] = df_shop['shop_reg_tm'].apply(lambda x: mapper_day(x)) 74 | 75 | 76 | df_shop['shop_reg_year'] = df_shop['shop_reg_year'].fillna(1) 77 | df_shop['shop_reg_month'] = df_shop['shop_reg_month'].fillna(21) 78 | df_shop['shop_reg_day'] = df_shop['shop_reg_day'].fillna(101) 79 | 80 | df_user['age'] = df_user['age'].fillna(5) 81 | 82 | df_comment = df_comment.groupby(['sku_id'], as_index=False).sum() 83 | print('check point ...') 84 | df_product_comment = pd.merge(jdata_product, df_comment, on='sku_id', how='left') 85 | 86 | df_product_comment = df_product_comment.fillna(0) 87 | 88 | df_product_comment = df_product_comment.groupby(['shop_id'], as_index=False).sum() 89 | 90 | df_product_comment = df_product_comment.drop(['sku_id', 'brand', 'cate'], axis=1) 91 | 92 | df_shop_product_comment = pd.merge(df_shop, df_product_comment, how='left', on='shop_id') 93 | 94 | train_set = pd.merge(train_set, df_user, how='left', on='user_id') 95 | train_set = pd.merge(train_set, df_shop_product_comment, on='shop_id', how='left') 96 | 97 | 98 | train_set['vip_prob'] = train_set['vip_num']/train_set['fans_num'] 99 | train_set['goods_prob'] = train_set['good_comments']/train_set['comments'] 100 | 101 | train_set = train_set.drop(['comments','good_comments','bad_comments'],axis=1) 102 | 103 | 104 | test_set = jdata_data[(jdata_data['action_time'] >= '2018-04-02') & (jdata_data['action_time'] < '2018-04-16')][ 105 | ['user_id', 'cate', 'shop_id']].drop_duplicates() 106 | 107 | test_set = test_set.merge(df_test, on=['user_id', 'cate', 'shop_id'], how='left') 108 | 109 | test_set = pd.merge(test_set, df_user, how='left', on='user_id') 110 | test_set = pd.merge(test_set, df_shop_product_comment, on='shop_id', how='left') 111 | 112 | train_set.drop(['user_reg_tm', 'shop_reg_tm'], axis=1, inplace=True) 113 | test_set.drop(['user_reg_tm', 'shop_reg_tm'], axis=1, inplace=True) 114 | 115 | 116 | test_set['vip_prob'] = test_set['vip_num']/test_set['fans_num'] 117 | test_set['goods_prob'] = test_set['good_comments']/test_set['comments'] 118 | 119 | test_set = test_set.drop(['comments','good_comments','bad_comments'],axis=1) 120 | 121 | 122 | ###取六周特征特征为2.26-4.9 123 | train_set = train_set.drop(['2018-04-02-2018-04-09-action_1', '2018-04-02-2018-04-09-action_2', 124 | '2018-04-02-2018-04-09-action_3', '2018-04-02-2018-04-09-action_4', 125 | '2018-03-26-2018-04-02-action_1', '2018-03-26-2018-04-02-action_2', 126 | '2018-03-26-2018-04-02-action_3', '2018-03-26-2018-04-02-action_4', 127 | '2018-02-05-2018-02-12-action_1', '2018-02-05-2018-02-12-action_2', 128 | '2018-02-05-2018-02-12-action_3', '2018-02-05-2018-02-12-action_4'],axis=1) 129 | 130 | test_set = test_set.drop(['2018-02-26-2018-03-05-action_1', 131 | '2018-02-26-2018-03-05-action_2', '2018-02-26-2018-03-05-action_3', 132 | '2018-02-26-2018-03-05-action_4', '2018-02-19-2018-02-26-action_1', 133 | '2018-02-19-2018-02-26-action_2', '2018-02-19-2018-02-26-action_3', 134 | '2018-02-19-2018-02-26-action_4', '2018-02-12-2018-02-19-action_1', 135 | '2018-02-12-2018-02-19-action_2', '2018-02-12-2018-02-19-action_3', 136 | '2018-02-12-2018-02-19-action_4'],axis=1) 137 | 138 | train_set.rename(columns={'cate_x':'cate'}, inplace = True) 139 | test_set.rename(columns={'cate_x':'cate'}, inplace = True) 140 | 141 | test_head=test_set[['user_id','cate','shop_id']] 142 | train_head=train_set[['user_id','cate','shop_id']] 143 | test_set=test_set.drop(['user_id','cate','shop_id'],axis=1) 144 | train_set=train_set.drop(['user_id','cate','shop_id'],axis=1) 145 | if(train_set.shape[1]-1==test_set.shape[1]): 146 | print('ok',train_set.shape[1]) 147 | else: 148 | exit() 149 | 150 | 151 | # 数据准备 152 | X_train = train_set.drop(['label'],axis=1).values 153 | y_train = train_set['label'].values 154 | X_test = test_set.values 155 | 156 | del train_set 157 | del test_set 158 | 159 | import gc 160 | gc.collect() 161 | 162 | 163 | # 模型工具 164 | class SBBTree(): 165 | """Stacking,Bootstap,Bagging----SBBTree""" 166 | def __init__(self, params, stacking_num, bagging_num, bagging_test_size, num_boost_round, early_stopping_rounds): 167 | """ 168 | Initializes the SBBTree. 169 | Args: 170 | params : lgb params. 171 | stacking_num : k_flod stacking. 172 | bagging_num : bootstrap num. 173 | bagging_test_size : bootstrap sample rate. 174 | num_boost_round : boost num. 175 | early_stopping_rounds : early_stopping_rounds. 176 | """ 177 | self.params = params 178 | self.stacking_num = stacking_num 179 | self.bagging_num = bagging_num 180 | self.bagging_test_size = bagging_test_size 181 | self.num_boost_round = num_boost_round 182 | self.early_stopping_rounds = early_stopping_rounds 183 | 184 | self.model = lgb 185 | self.stacking_model = [] 186 | self.bagging_model = [] 187 | 188 | def fit(self, X, y): 189 | """ fit model. """ 190 | if self.stacking_num > 1: 191 | layer_train = np.zeros((X.shape[0], 2)) 192 | self.SK = StratifiedKFold(n_splits=self.stacking_num, shuffle=True, random_state=1) 193 | for k,(train_index, test_index) in enumerate(self.SK.split(X, y)): 194 | X_train = X[train_index] 195 | y_train = y[train_index] 196 | X_test = X[test_index] 197 | y_test = y[test_index] 198 | 199 | lgb_train = lgb.Dataset(X_train, y_train) 200 | lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train) 201 | 202 | gbm = lgb.train(self.params, 203 | lgb_train, 204 | num_boost_round=self.num_boost_round, 205 | valid_sets=lgb_eval, 206 | early_stopping_rounds=self.early_stopping_rounds, 207 | verbose_eval=300) 208 | 209 | self.stacking_model.append(gbm) 210 | 211 | pred_y = gbm.predict(X_test, num_iteration=gbm.best_iteration) 212 | layer_train[test_index, 1] = pred_y 213 | 214 | X = np.hstack((X, layer_train[:,1].reshape((-1,1)))) 215 | else: 216 | pass 217 | for bn in range(self.bagging_num): 218 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=self.bagging_test_size, random_state=bn) 219 | 220 | lgb_train = lgb.Dataset(X_train, y_train) 221 | lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train) 222 | 223 | gbm = lgb.train(self.params, 224 | lgb_train, 225 | num_boost_round=10000, 226 | valid_sets=lgb_eval, 227 | early_stopping_rounds=200, 228 | verbose_eval=300) 229 | 230 | self.bagging_model.append(gbm) 231 | 232 | def predict(self, X_pred): 233 | """ predict test data. """ 234 | if self.stacking_num > 1: 235 | test_pred = np.zeros((X_pred.shape[0], self.stacking_num)) 236 | for sn,gbm in enumerate(self.stacking_model): 237 | pred = gbm.predict(X_pred, num_iteration=gbm.best_iteration) 238 | test_pred[:, sn] = pred 239 | X_pred = np.hstack((X_pred, test_pred.mean(axis=1).reshape((-1,1)))) 240 | else: 241 | pass 242 | for bn,gbm in enumerate(self.bagging_model): 243 | pred = gbm.predict(X_pred, num_iteration=gbm.best_iteration) 244 | if bn == 0: 245 | pred_out=pred 246 | else: 247 | pred_out+=pred 248 | return pred_out/self.bagging_num 249 | 250 | # 模型参数 251 | params = { 252 | 'boosting_type': 'gbdt', 253 | 'objective': 'binary', 254 | 'metric': 'auc', 255 | 'learning_rate': 0.01, 256 | 'num_leaves': 2 ** 5 - 1, 257 | 'min_child_samples': 100, 258 | 'max_bin': 100, 259 | 'subsample': 0.8, 260 | 'subsample_freq': 1, 261 | 'colsample_bytree': 0.8, 262 | 'min_child_weight': 0, 263 | 'scale_pos_weight': 25, 264 | 'seed': 2019, 265 | 'nthread': 4, 266 | 'verbose': 0, 267 | } 268 | 269 | # 使用模型 270 | model = SBBTree(params=params,\ 271 | stacking_num=5,\ 272 | bagging_num=5,\ 273 | bagging_test_size=0.33,\ 274 | num_boost_round=10000,\ 275 | early_stopping_rounds=200) 276 | model.fit(X_train, y_train) 277 | print('train is ok') 278 | y_predict = model.predict(X_test) 279 | print('pred test is ok') 280 | # y_train_predict = model.predict(X_train) 281 | 282 | 283 | from tqdm import tqdm 284 | test_head['pred_prob'] = y_predict 285 | test_head.to_csv('../feature/'+str(win_size)+'_sbb_get_'+str(label_flag)+'_test.csv',index=False) 286 | -------------------------------------------------------------------------------- /code/sbb4_train1.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import numpy as np 3 | import datetime 4 | from sklearn.metrics import f1_score 5 | from sklearn.model_selection import train_test_split 6 | from sklearn.model_selection import KFold 7 | from sklearn.model_selection import StratifiedKFold 8 | import lightgbm as lgb 9 | pd.set_option('display.max_columns', None) 10 | 11 | df_train=pd.read_csv('../output/df_train.csv') 12 | df_test=pd.read_csv('../output/df_test.csv') 13 | ##这里可以选择加载多个特征文件进行merge 如果df_train变了记得在输出文件名称加以备注使用了什么特征文件 14 | ###设置特征标志位如果使用一周特征为1 加上两周特征为12 再加上三周特征为123 只使用二周特征为2 15 | 16 | # df_train_two=pd.read_csv('../output/df_train_two.csv') 17 | # df_test_two=pd.read_csv('../output/df_test_two.csv') 18 | # df_train = df_train.merge(df_train_two,on=['user_id','cate','shop_id'],how='left') 19 | # df_test = df_test.merge(df_test_two,on=['user_id','cate','shop_id'],how='left') 20 | 21 | 22 | df_user=pd.read_csv('../data/jdata_user.csv') 23 | df_comment=pd.read_csv('../data/jdata_comment.csv') 24 | df_shop=pd.read_csv('../data/jdata_shop.csv') 25 | 26 | # 1）行为数据（jdata_action） 27 | jdata_action = pd.read_csv('../data/jdata_action.csv') 28 | 29 | # 3）商品数据（jdata_product） 30 | jdata_product = pd.read_csv('../data/jdata_product.csv') 31 | 32 | jdata_data = jdata_action.merge(jdata_product,on=['sku_id']) 33 | label_flag = 1 34 | train_buy = jdata_data[(jdata_data['action_time']>='2018-04-09') 35 | & (jdata_data['action_time']<'2018-04-16') 36 | & (jdata_data['type']==2)][['user_id','cate','shop_id']].drop_duplicates() 37 | train_buy['label'] = 1 38 | # 候选集时间： '2018-03-26'-'2018-04-08' 最近两周有行为的（用户，类目，店铺） 39 | win_size = 4#如果选择两周行为则为2 三周则为3 40 | train_set = jdata_data[(jdata_data['action_time']>='2018-03-12') 41 | & (jdata_data['action_time']<'2018-04-09')][['user_id','cate','shop_id']].drop_duplicates() 42 | train_set = train_set.merge(train_buy,on=['user_id','cate','shop_id'],how='left').fillna(0) 43 | 44 | train_set = train_set.merge(df_train,on=['user_id','cate','shop_id'],how='left') 45 | 46 | 47 | def mapper_year(x): 48 | if x is not np.nan: 49 | year = int(x[:4]) 50 | return 2018 - year 51 | 52 | 53 | def mapper_month(x): 54 | if x is not np.nan: 55 | year = int(x[:4]) 56 | month = int(x[5:7]) 57 | return (2018 - year) * 12 + month 58 | 59 | 60 | def mapper_day(x): 61 | if x is not np.nan: 62 | year = int(x[:4]) 63 | month = int(x[5:7]) 64 | day = int(x[8:10]) 65 | return (2018 - year) * 365 + month * 30 + day 66 | 67 | 68 | df_user['user_reg_year'] = df_user['user_reg_tm'].apply(lambda x: mapper_year(x)) 69 | df_user['user_reg_month'] = df_user['user_reg_tm'].apply(lambda x: mapper_month(x)) 70 | df_user['user_reg_day'] = df_user['user_reg_tm'].apply(lambda x: mapper_day(x)) 71 | 72 | df_shop['shop_reg_year'] = df_shop['shop_reg_tm'].apply(lambda x: mapper_year(x)) 73 | df_shop['shop_reg_month'] = df_shop['shop_reg_tm'].apply(lambda x: mapper_month(x)) 74 | df_shop['shop_reg_day'] = df_shop['shop_reg_tm'].apply(lambda x: mapper_day(x)) 75 | 76 | 77 | df_shop['shop_reg_year'] = df_shop['shop_reg_year'].fillna(1) 78 | df_shop['shop_reg_month'] = df_shop['shop_reg_month'].fillna(21) 79 | df_shop['shop_reg_day'] = df_shop['shop_reg_day'].fillna(101) 80 | 81 | df_user['age'] = df_user['age'].fillna(5) 82 | 83 | df_comment = df_comment.groupby(['sku_id'], as_index=False).sum() 84 | print('check point ...') 85 | df_product_comment = pd.merge(jdata_product, df_comment, on='sku_id', how='left') 86 | 87 | df_product_comment = df_product_comment.fillna(0) 88 | 89 | df_product_comment = df_product_comment.groupby(['shop_id'], as_index=False).sum() 90 | 91 | df_product_comment = df_product_comment.drop(['sku_id', 'brand', 'cate'], axis=1) 92 | 93 | df_shop_product_comment = pd.merge(df_shop, df_product_comment, how='left', on='shop_id') 94 | 95 | train_set = pd.merge(train_set, df_user, how='left', on='user_id') 96 | train_set = pd.merge(train_set, df_shop_product_comment, on='shop_id', how='left') 97 | 98 | 99 | train_set['vip_prob'] = train_set['vip_num']/train_set['fans_num'] 100 | train_set['goods_prob'] = train_set['good_comments']/train_set['comments'] 101 | 102 | train_set = train_set.drop(['comments','good_comments','bad_comments'],axis=1) 103 | 104 | test_set = jdata_data[(jdata_data['action_time'] >= '2018-03-19') & (jdata_data['action_time'] < '2018-04-16')][ 105 | ['user_id', 'cate', 'shop_id']].drop_duplicates() 106 | 107 | test_set = test_set.merge(df_test, on=['user_id', 'cate', 'shop_id'], how='left') 108 | 109 | test_set = pd.merge(test_set, df_user, how='left', on='user_id') 110 | test_set = pd.merge(test_set, df_shop_product_comment, on='shop_id', how='left') 111 | 112 | train_set.drop(['user_reg_tm', 'shop_reg_tm'], axis=1, inplace=True) 113 | test_set.drop(['user_reg_tm', 'shop_reg_tm'], axis=1, inplace=True) 114 | 115 | test_set['vip_prob'] = test_set['vip_num']/test_set['fans_num'] 116 | test_set['goods_prob'] = test_set['good_comments']/test_set['comments'] 117 | 118 | test_set = test_set.drop(['comments','good_comments','bad_comments'],axis=1) 119 | 120 | ###取六周特征特征为2.26-4.9 121 | train_set = train_set.drop(['2018-02-19-2018-02-26-action_1', '2018-02-19-2018-02-26-action_2', 122 | '2018-02-19-2018-02-26-action_3', '2018-02-19-2018-02-26-action_4', 123 | '2018-02-12-2018-02-19-action_1', '2018-02-12-2018-02-19-action_2', 124 | '2018-02-12-2018-02-19-action_3', '2018-02-12-2018-02-19-action_4', 125 | '2018-02-05-2018-02-12-action_1', '2018-02-05-2018-02-12-action_2', 126 | '2018-02-05-2018-02-12-action_3', '2018-02-05-2018-02-12-action_4'],axis=1) 127 | 128 | test_set = test_set.drop(['2018-02-26-2018-03-05-action_1', 129 | '2018-02-26-2018-03-05-action_2', '2018-02-26-2018-03-05-action_3', 130 | '2018-02-26-2018-03-05-action_4', '2018-02-19-2018-02-26-action_1', 131 | '2018-02-19-2018-02-26-action_2', '2018-02-19-2018-02-26-action_3', 132 | '2018-02-19-2018-02-26-action_4', '2018-02-12-2018-02-19-action_1', 133 | '2018-02-12-2018-02-19-action_2', '2018-02-12-2018-02-19-action_3', 134 | '2018-02-12-2018-02-19-action_4'],axis=1) 135 | 136 | 137 | train_set.rename(columns={'cate_x':'cate'}, inplace = True) 138 | test_set.rename(columns={'cate_x':'cate'}, inplace = True) 139 | 140 | test_head=test_set[['user_id','cate','shop_id']] 141 | train_head=train_set[['user_id','cate','shop_id']] 142 | test_set=test_set.drop(['user_id','cate','shop_id'],axis=1) 143 | train_set=train_set.drop(['user_id','cate','shop_id'],axis=1) 144 | 145 | 146 | # 数据准备 147 | X_train = train_set.drop(['label'],axis=1).values 148 | y_train = train_set['label'].values 149 | X_test = test_set.values 150 | 151 | del train_set 152 | del test_set 153 | 154 | import gc 155 | gc.collect() 156 | 157 | 158 | 159 | # 模型工具 160 | class SBBTree(): 161 | """Stacking,Bootstap,Bagging----SBBTree""" 162 | def __init__(self, params, stacking_num, bagging_num, bagging_test_size, num_boost_round, early_stopping_rounds): 163 | """ 164 | Initializes the SBBTree. 165 | Args: 166 | params : lgb params. 167 | stacking_num : k_flod stacking. 168 | bagging_num : bootstrap num. 169 | bagging_test_size : bootstrap sample rate. 170 | num_boost_round : boost num. 171 | early_stopping_rounds : early_stopping_rounds. 172 | """ 173 | self.params = params 174 | self.stacking_num = stacking_num 175 | self.bagging_num = bagging_num 176 | self.bagging_test_size = bagging_test_size 177 | self.num_boost_round = num_boost_round 178 | self.early_stopping_rounds = early_stopping_rounds 179 | 180 | self.model = lgb 181 | self.stacking_model = [] 182 | self.bagging_model = [] 183 | 184 | def fit(self, X, y): 185 | """ fit model. """ 186 | if self.stacking_num > 1: 187 | layer_train = np.zeros((X.shape[0], 2)) 188 | self.SK = StratifiedKFold(n_splits=self.stacking_num, shuffle=True, random_state=1) 189 | for k,(train_index, test_index) in enumerate(self.SK.split(X, y)): 190 | X_train = X[train_index] 191 | y_train = y[train_index] 192 | X_test = X[test_index] 193 | y_test = y[test_index] 194 | 195 | lgb_train = lgb.Dataset(X_train, y_train) 196 | lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train) 197 | 198 | gbm = lgb.train(self.params, 199 | lgb_train, 200 | num_boost_round=self.num_boost_round, 201 | valid_sets=lgb_eval, 202 | early_stopping_rounds=self.early_stopping_rounds, 203 | verbose_eval=300) 204 | 205 | self.stacking_model.append(gbm) 206 | 207 | pred_y = gbm.predict(X_test, num_iteration=gbm.best_iteration) 208 | layer_train[test_index, 1] = pred_y 209 | 210 | X = np.hstack((X, layer_train[:,1].reshape((-1,1)))) 211 | else: 212 | pass 213 | for bn in range(self.bagging_num): 214 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=self.bagging_test_size, random_state=bn) 215 | 216 | lgb_train = lgb.Dataset(X_train, y_train) 217 | lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train) 218 | 219 | gbm = lgb.train(self.params, 220 | lgb_train, 221 | num_boost_round=10000, 222 | valid_sets=lgb_eval, 223 | early_stopping_rounds=200, 224 | verbose_eval=300) 225 | 226 | self.bagging_model.append(gbm) 227 | 228 | def predict(self, X_pred): 229 | """ predict test data. """ 230 | if self.stacking_num > 1: 231 | test_pred = np.zeros((X_pred.shape[0], self.stacking_num)) 232 | for sn,gbm in enumerate(self.stacking_model): 233 | pred = gbm.predict(X_pred, num_iteration=gbm.best_iteration) 234 | test_pred[:, sn] = pred 235 | X_pred = np.hstack((X_pred, test_pred.mean(axis=1).reshape((-1,1)))) 236 | else: 237 | pass 238 | for bn,gbm in enumerate(self.bagging_model): 239 | pred = gbm.predict(X_pred, num_iteration=gbm.best_iteration) 240 | if bn == 0: 241 | pred_out=pred 242 | else: 243 | pred_out+=pred 244 | return pred_out/self.bagging_num 245 | 246 | # 模型参数 247 | params = { 248 | 'boosting_type': 'gbdt', 249 | 'objective': 'binary', 250 | 'metric': 'auc', 251 | 'learning_rate': 0.01, 252 | 'num_leaves': 2 ** 5 - 1, 253 | 'min_child_samples': 100, 254 | 'max_bin': 100, 255 | 'subsample': 0.8, 256 | 'subsample_freq': 1, 257 | 'colsample_bytree': 0.8, 258 | 'min_child_weight': 0, 259 | 'scale_pos_weight': 25, 260 | 'seed': 2019, 261 | 'nthread': 4, 262 | 'verbose': 0, 263 | } 264 | 265 | # 使用模型 266 | model = SBBTree(params=params,\ 267 | stacking_num=5,\ 268 | bagging_num=5,\ 269 | bagging_test_size=0.33,\ 270 | num_boost_round=10000,\ 271 | early_stopping_rounds=200) 272 | model.fit(X_train, y_train) 273 | print('train is ok') 274 | y_predict = model.predict(X_test) 275 | print('pred test is ok') 276 | # y_train_predict = model.predict(X_train) 277 | 278 | 279 | from tqdm import tqdm 280 | test_head['pred_prob'] = y_predict 281 | test_head.to_csv('../feature/'+str(win_size)+'_sbb_get_'+str(label_flag)+'_test.csv',index=False) 282 | -------------------------------------------------------------------------------- /code/sbb4_train2 .py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import numpy as np 3 | import datetime 4 | from sklearn.metrics import f1_score 5 | from sklearn.model_selection import train_test_split 6 | from sklearn.model_selection import KFold 7 | from sklearn.model_selection import StratifiedKFold 8 | import lightgbm as lgb 9 | pd.set_option('display.max_columns', None) 10 | 11 | df_train=pd.read_csv('../output/df_train.csv') 12 | df_test=pd.read_csv('../output/df_test.csv') 13 | ##这里可以选择加载多个特征文件进行merge 如果df_train变了记得在输出文件名称加以备注使用了什么特征文件 14 | ###设置特征标志位如果使用一周特征为1 加上两周特征为12 再加上三周特征为123 只使用二周特征为2 15 | 16 | # df_train_two=pd.read_csv('../output/df_train_two.csv') 17 | # df_test_two=pd.read_csv('../output/df_test_two.csv') 18 | # df_train = df_train.merge(df_train_two,on=['user_id','cate','shop_id'],how='left') 19 | # df_test = df_test.merge(df_test_two,on=['user_id','cate','shop_id'],how='left') 20 | 21 | 22 | df_user=pd.read_csv('../data/jdata_user.csv') 23 | df_comment=pd.read_csv('../data/jdata_comment.csv') 24 | df_shop=pd.read_csv('../data/jdata_shop.csv') 25 | 26 | # 1）行为数据（jdata_action） 27 | jdata_action = pd.read_csv('../data/jdata_action.csv') 28 | 29 | # 3）商品数据（jdata_product） 30 | jdata_product = pd.read_csv('../data/jdata_product.csv') 31 | 32 | jdata_data = jdata_action.merge(jdata_product,on=['sku_id']) 33 | label_flag = 2 34 | train_buy = jdata_data[(jdata_data['action_time']>='2018-04-02') 35 | & (jdata_data['action_time']<'2018-04-09') 36 | & (jdata_data['type']==2)][['user_id','cate','shop_id']].drop_duplicates() 37 | train_buy['label'] = 1 38 | # 候选集时间： '2018-03-26'-'2018-04-08' 最近两周有行为的（用户，类目，店铺） 39 | win_size = 4#如果选择两周行为则为2 三周则为3 40 | train_set = jdata_data[(jdata_data['action_time']>='2018-03-05') 41 | & (jdata_data['action_time']<'2018-04-02')][['user_id','cate','shop_id']].drop_duplicates() 42 | train_set = train_set.merge(train_buy,on=['user_id','cate','shop_id'],how='left').fillna(0) 43 | 44 | train_set = train_set.merge(df_train,on=['user_id','cate','shop_id'],how='left') 45 | 46 | 47 | def mapper_year(x): 48 | if x is not np.nan: 49 | year = int(x[:4]) 50 | return 2018 - year 51 | 52 | 53 | def mapper_month(x): 54 | if x is not np.nan: 55 | year = int(x[:4]) 56 | month = int(x[5:7]) 57 | return (2018 - year) * 12 + month 58 | 59 | 60 | def mapper_day(x): 61 | if x is not np.nan: 62 | year = int(x[:4]) 63 | month = int(x[5:7]) 64 | day = int(x[8:10]) 65 | return (2018 - year) * 365 + month * 30 + day 66 | 67 | 68 | df_user['user_reg_year'] = df_user['user_reg_tm'].apply(lambda x: mapper_year(x)) 69 | df_user['user_reg_month'] = df_user['user_reg_tm'].apply(lambda x: mapper_month(x)) 70 | df_user['user_reg_day'] = df_user['user_reg_tm'].apply(lambda x: mapper_day(x)) 71 | 72 | df_shop['shop_reg_year'] = df_shop['shop_reg_tm'].apply(lambda x: mapper_year(x)) 73 | df_shop['shop_reg_month'] = df_shop['shop_reg_tm'].apply(lambda x: mapper_month(x)) 74 | df_shop['shop_reg_day'] = df_shop['shop_reg_tm'].apply(lambda x: mapper_day(x)) 75 | 76 | df_shop['shop_reg_year'] = df_shop['shop_reg_year'].fillna(1) 77 | df_shop['shop_reg_month'] = df_shop['shop_reg_month'].fillna(21) 78 | df_shop['shop_reg_day'] = df_shop['shop_reg_day'].fillna(101) 79 | 80 | df_user['age'] = df_user['age'].fillna(5) 81 | 82 | df_comment = df_comment.groupby(['sku_id'], as_index=False).sum() 83 | print('check point ...') 84 | df_product_comment = pd.merge(jdata_product, df_comment, on='sku_id', how='left') 85 | 86 | df_product_comment = df_product_comment.fillna(0) 87 | 88 | df_product_comment = df_product_comment.groupby(['shop_id'], as_index=False).sum() 89 | 90 | df_product_comment = df_product_comment.drop(['sku_id', 'brand', 'cate'], axis=1) 91 | 92 | df_shop_product_comment = pd.merge(df_shop, df_product_comment, how='left', on='shop_id') 93 | 94 | train_set = pd.merge(train_set, df_user, how='left', on='user_id') 95 | train_set = pd.merge(train_set, df_shop_product_comment, on='shop_id', how='left') 96 | 97 | train_set['vip_prob'] = train_set['vip_num']/train_set['fans_num'] 98 | train_set['goods_prob'] = train_set['good_comments']/train_set['comments'] 99 | 100 | train_set = train_set.drop(['comments','good_comments','bad_comments'],axis=1) 101 | 102 | 103 | test_set = jdata_data[(jdata_data['action_time'] >= '2018-03-19') & (jdata_data['action_time'] < '2018-04-16')][ 104 | ['user_id', 'cate', 'shop_id']].drop_duplicates() 105 | 106 | test_set = test_set.merge(df_test, on=['user_id', 'cate', 'shop_id'], how='left') 107 | 108 | test_set = pd.merge(test_set, df_user, how='left', on='user_id') 109 | test_set = pd.merge(test_set, df_shop_product_comment, on='shop_id', how='left') 110 | 111 | train_set.drop(['user_reg_tm', 'shop_reg_tm'], axis=1, inplace=True) 112 | test_set.drop(['user_reg_tm', 'shop_reg_tm'], axis=1, inplace=True) 113 | 114 | 115 | test_set['vip_prob'] = test_set['vip_num']/test_set['fans_num'] 116 | test_set['goods_prob'] = test_set['good_comments']/test_set['comments'] 117 | 118 | test_set = test_set.drop(['comments','good_comments','bad_comments'],axis=1) 119 | 120 | 121 | ###取六周特征特征为2.26-4.9 122 | train_set = train_set.drop([ 123 | '2018-04-02-2018-04-09-action_1', '2018-04-02-2018-04-09-action_2', 124 | '2018-04-02-2018-04-09-action_3', '2018-04-02-2018-04-09-action_4', 125 | '2018-02-12-2018-02-19-action_1', '2018-02-12-2018-02-19-action_2', 126 | '2018-02-12-2018-02-19-action_3', '2018-02-12-2018-02-19-action_4', 127 | '2018-02-05-2018-02-12-action_1', '2018-02-05-2018-02-12-action_2', 128 | '2018-02-05-2018-02-12-action_3', '2018-02-05-2018-02-12-action_4'],axis=1) 129 | 130 | 131 | test_set = test_set.drop(['2018-02-26-2018-03-05-action_1', 132 | '2018-02-26-2018-03-05-action_2', '2018-02-26-2018-03-05-action_3', 133 | '2018-02-26-2018-03-05-action_4', '2018-02-19-2018-02-26-action_1', 134 | '2018-02-19-2018-02-26-action_2', '2018-02-19-2018-02-26-action_3', 135 | '2018-02-19-2018-02-26-action_4', '2018-02-12-2018-02-19-action_1', 136 | '2018-02-12-2018-02-19-action_2', '2018-02-12-2018-02-19-action_3', 137 | '2018-02-12-2018-02-19-action_4'],axis=1) 138 | 139 | 140 | 141 | train_set.rename(columns={'cate_x':'cate'}, inplace = True) 142 | test_set.rename(columns={'cate_x':'cate'}, inplace = True) 143 | 144 | test_head=test_set[['user_id','cate','shop_id']] 145 | train_head=train_set[['user_id','cate','shop_id']] 146 | test_set=test_set.drop(['user_id','cate','shop_id'],axis=1) 147 | train_set=train_set.drop(['user_id','cate','shop_id'],axis=1) 148 | 149 | 150 | # 数据准备 151 | X_train = train_set.drop(['label'],axis=1).values 152 | y_train = train_set['label'].values 153 | X_test = test_set.values 154 | 155 | del train_set 156 | del test_set 157 | 158 | import gc 159 | gc.collect() 160 | 161 | # 模型工具 162 | class SBBTree(): 163 | """Stacking,Bootstap,Bagging----SBBTree""" 164 | def __init__(self, params, stacking_num, bagging_num, bagging_test_size, num_boost_round, early_stopping_rounds): 165 | """ 166 | Initializes the SBBTree. 167 | Args: 168 | params : lgb params. 169 | stacking_num : k_flod stacking. 170 | bagging_num : bootstrap num. 171 | bagging_test_size : bootstrap sample rate. 172 | num_boost_round : boost num. 173 | early_stopping_rounds : early_stopping_rounds. 174 | """ 175 | self.params = params 176 | self.stacking_num = stacking_num 177 | self.bagging_num = bagging_num 178 | self.bagging_test_size = bagging_test_size 179 | self.num_boost_round = num_boost_round 180 | self.early_stopping_rounds = early_stopping_rounds 181 | 182 | self.model = lgb 183 | self.stacking_model = [] 184 | self.bagging_model = [] 185 | 186 | def fit(self, X, y): 187 | """ fit model. """ 188 | if self.stacking_num > 1: 189 | layer_train = np.zeros((X.shape[0], 2)) 190 | self.SK = StratifiedKFold(n_splits=self.stacking_num, shuffle=True, random_state=1) 191 | for k,(train_index, test_index) in enumerate(self.SK.split(X, y)): 192 | X_train = X[train_index] 193 | y_train = y[train_index] 194 | X_test = X[test_index] 195 | y_test = y[test_index] 196 | 197 | lgb_train = lgb.Dataset(X_train, y_train) 198 | lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train) 199 | 200 | gbm = lgb.train(self.params, 201 | lgb_train, 202 | num_boost_round=self.num_boost_round, 203 | valid_sets=lgb_eval, 204 | early_stopping_rounds=self.early_stopping_rounds, 205 | verbose_eval=300) 206 | 207 | self.stacking_model.append(gbm) 208 | 209 | pred_y = gbm.predict(X_test, num_iteration=gbm.best_iteration) 210 | layer_train[test_index, 1] = pred_y 211 | 212 | X = np.hstack((X, layer_train[:,1].reshape((-1,1)))) 213 | else: 214 | pass 215 | for bn in range(self.bagging_num): 216 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=self.bagging_test_size, random_state=bn) 217 | 218 | lgb_train = lgb.Dataset(X_train, y_train) 219 | lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train) 220 | 221 | gbm = lgb.train(self.params, 222 | lgb_train, 223 | num_boost_round=10000, 224 | valid_sets=lgb_eval, 225 | early_stopping_rounds=200, 226 | verbose_eval=300) 227 | 228 | self.bagging_model.append(gbm) 229 | 230 | def predict(self, X_pred): 231 | """ predict test data. """ 232 | if self.stacking_num > 1: 233 | test_pred = np.zeros((X_pred.shape[0], self.stacking_num)) 234 | for sn,gbm in enumerate(self.stacking_model): 235 | pred = gbm.predict(X_pred, num_iteration=gbm.best_iteration) 236 | test_pred[:, sn] = pred 237 | X_pred = np.hstack((X_pred, test_pred.mean(axis=1).reshape((-1,1)))) 238 | else: 239 | pass 240 | for bn,gbm in enumerate(self.bagging_model): 241 | pred = gbm.predict(X_pred, num_iteration=gbm.best_iteration) 242 | if bn == 0: 243 | pred_out=pred 244 | else: 245 | pred_out+=pred 246 | return pred_out/self.bagging_num 247 | 248 | # 模型参数 249 | params = { 250 | 'boosting_type': 'gbdt', 251 | 'objective': 'binary', 252 | 'metric': 'auc', 253 | 'learning_rate': 0.01, 254 | 'num_leaves': 2 ** 5 - 1, 255 | 'min_child_samples': 100, 256 | 'max_bin': 100, 257 | 'subsample': 0.8, 258 | 'subsample_freq': 1, 259 | 'colsample_bytree': 0.8, 260 | 'min_child_weight': 0, 261 | 'scale_pos_weight': 25, 262 | 'seed': 2019, 263 | 'nthread': 4, 264 | 'verbose': 0, 265 | } 266 | 267 | # 使用模型 268 | model = SBBTree(params=params,\ 269 | stacking_num=5,\ 270 | bagging_num=5,\ 271 | bagging_test_size=0.33,\ 272 | num_boost_round=10000,\ 273 | early_stopping_rounds=200) 274 | model.fit(X_train, y_train) 275 | print('train is ok') 276 | y_predict = model.predict(X_test) 277 | print('pred test is ok') 278 | # y_train_predict = model.predict(X_train) 279 | 280 | from tqdm import tqdm 281 | test_head['pred_prob'] = y_predict 282 | test_head.to_csv('../feature/'+str(win_size)+'_sbb_get_'+str(label_flag)+'_test.csv',index=False) 283 | -------------------------------------------------------------------------------- /code/sbb4_train3.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import numpy as np 3 | import datetime 4 | from sklearn.metrics import f1_score 5 | from sklearn.model_selection import train_test_split 6 | from sklearn.model_selection import KFold 7 | from sklearn.model_selection import StratifiedKFold 8 | import lightgbm as lgb 9 | pd.set_option('display.max_columns', None) 10 | 11 | 12 | df_train=pd.read_csv('../output/df_train.csv') 13 | df_test=pd.read_csv('../output/df_test.csv') 14 | ##这里可以选择加载多个特征文件进行merge 如果df_train变了记得在输出文件名称加以备注使用了什么特征文件 15 | ###设置特征标志位如果使用一周特征为1 加上两周特征为12 再加上三周特征为123 只使用二周特征为2 16 | 17 | # df_train_two=pd.read_csv('../output/df_train_two.csv') 18 | # df_test_two=pd.read_csv('../output/df_test_two.csv') 19 | # df_train = df_train.merge(df_train_two,on=['user_id','cate','shop_id'],how='left') 20 | # df_test = df_test.merge(df_test_two,on=['user_id','cate','shop_id'],how='left') 21 | 22 | 23 | df_user=pd.read_csv('../data/jdata_user.csv') 24 | df_comment=pd.read_csv('../data/jdata_comment.csv') 25 | df_shop=pd.read_csv('../data/jdata_shop.csv') 26 | 27 | # 1）行为数据（jdata_action） 28 | jdata_action = pd.read_csv('../data/jdata_action.csv') 29 | 30 | # 3）商品数据（jdata_product） 31 | jdata_product = pd.read_csv('../data/jdata_product.csv') 32 | 33 | jdata_data = jdata_action.merge(jdata_product,on=['sku_id']) 34 | label_flag = 3 35 | train_buy = jdata_data[(jdata_data['action_time']>='2018-03-26') 36 | & (jdata_data['action_time']<'2018-04-02') 37 | & (jdata_data['type']==2)][['user_id','cate','shop_id']].drop_duplicates() 38 | train_buy['label'] = 1 39 | # 候选集时间： '2018-03-26'-'2018-04-08' 最近两周有行为的（用户，类目，店铺） 40 | win_size = 4#如果选择两周行为则为2 三周则为3 41 | train_set = jdata_data[(jdata_data['action_time']>='2018-02-26') 42 | & (jdata_data['action_time']<'2018-03-26')][['user_id','cate','shop_id']].drop_duplicates() 43 | train_set = train_set.merge(train_buy,on=['user_id','cate','shop_id'],how='left').fillna(0) 44 | 45 | train_set = train_set.merge(df_train,on=['user_id','cate','shop_id'],how='left') 46 | 47 | 48 | def mapper_year(x): 49 | if x is not np.nan: 50 | year = int(x[:4]) 51 | return 2018 - year 52 | 53 | 54 | def mapper_month(x): 55 | if x is not np.nan: 56 | year = int(x[:4]) 57 | month = int(x[5:7]) 58 | return (2018 - year) * 12 + month 59 | 60 | 61 | def mapper_day(x): 62 | if x is not np.nan: 63 | year = int(x[:4]) 64 | month = int(x[5:7]) 65 | day = int(x[8:10]) 66 | return (2018 - year) * 365 + month * 30 + day 67 | 68 | 69 | df_user['user_reg_year'] = df_user['user_reg_tm'].apply(lambda x: mapper_year(x)) 70 | df_user['user_reg_month'] = df_user['user_reg_tm'].apply(lambda x: mapper_month(x)) 71 | df_user['user_reg_day'] = df_user['user_reg_tm'].apply(lambda x: mapper_day(x)) 72 | 73 | df_shop['shop_reg_year'] = df_shop['shop_reg_tm'].apply(lambda x: mapper_year(x)) 74 | df_shop['shop_reg_month'] = df_shop['shop_reg_tm'].apply(lambda x: mapper_month(x)) 75 | df_shop['shop_reg_day'] = df_shop['shop_reg_tm'].apply(lambda x: mapper_day(x)) 76 | 77 | 78 | df_shop['shop_reg_year'] = df_shop['shop_reg_year'].fillna(1) 79 | df_shop['shop_reg_month'] = df_shop['shop_reg_month'].fillna(21) 80 | df_shop['shop_reg_day'] = df_shop['shop_reg_day'].fillna(101) 81 | 82 | df_user['age'] = df_user['age'].fillna(5) 83 | 84 | df_comment = df_comment.groupby(['sku_id'], as_index=False).sum() 85 | print('check point ...') 86 | df_product_comment = pd.merge(jdata_product, df_comment, on='sku_id', how='left') 87 | 88 | df_product_comment = df_product_comment.fillna(0) 89 | 90 | df_product_comment = df_product_comment.groupby(['shop_id'], as_index=False).sum() 91 | 92 | df_product_comment = df_product_comment.drop(['sku_id', 'brand', 'cate'], axis=1) 93 | 94 | df_shop_product_comment = pd.merge(df_shop, df_product_comment, how='left', on='shop_id') 95 | 96 | train_set = pd.merge(train_set, df_user, how='left', on='user_id') 97 | train_set = pd.merge(train_set, df_shop_product_comment, on='shop_id', how='left') 98 | 99 | 100 | train_set['vip_prob'] = train_set['vip_num']/train_set['fans_num'] 101 | train_set['goods_prob'] = train_set['good_comments']/train_set['comments'] 102 | 103 | train_set = train_set.drop(['comments','good_comments','bad_comments'],axis=1) 104 | 105 | 106 | test_set = jdata_data[(jdata_data['action_time'] >= '2018-03-19') & (jdata_data['action_time'] < '2018-04-16')][ 107 | ['user_id', 'cate', 'shop_id']].drop_duplicates() 108 | 109 | test_set = test_set.merge(df_test, on=['user_id', 'cate', 'shop_id'], how='left') 110 | 111 | test_set = pd.merge(test_set, df_user, how='left', on='user_id') 112 | test_set = pd.merge(test_set, df_shop_product_comment, on='shop_id', how='left') 113 | 114 | train_set.drop(['user_reg_tm', 'shop_reg_tm'], axis=1, inplace=True) 115 | test_set.drop(['user_reg_tm', 'shop_reg_tm'], axis=1, inplace=True) 116 | 117 | 118 | test_set['vip_prob'] = test_set['vip_num']/test_set['fans_num'] 119 | test_set['goods_prob'] = test_set['good_comments']/test_set['comments'] 120 | 121 | test_set = test_set.drop(['comments','good_comments','bad_comments'],axis=1) 122 | 123 | 124 | 125 | ###取六周特征特征为2.26-4.9 126 | train_set = train_set.drop(['2018-04-02-2018-04-09-action_1', '2018-04-02-2018-04-09-action_2', 127 | '2018-04-02-2018-04-09-action_3', '2018-04-02-2018-04-09-action_4', 128 | '2018-03-26-2018-04-02-action_1', '2018-03-26-2018-04-02-action_2', 129 | '2018-03-26-2018-04-02-action_3', '2018-03-26-2018-04-02-action_4', 130 | '2018-02-05-2018-02-12-action_1', '2018-02-05-2018-02-12-action_2', 131 | '2018-02-05-2018-02-12-action_3', '2018-02-05-2018-02-12-action_4'],axis=1) 132 | 133 | test_set = test_set.drop(['2018-02-26-2018-03-05-action_1', 134 | '2018-02-26-2018-03-05-action_2', '2018-02-26-2018-03-05-action_3', 135 | '2018-02-26-2018-03-05-action_4', '2018-02-19-2018-02-26-action_1', 136 | '2018-02-19-2018-02-26-action_2', '2018-02-19-2018-02-26-action_3', 137 | '2018-02-19-2018-02-26-action_4', '2018-02-12-2018-02-19-action_1', 138 | '2018-02-12-2018-02-19-action_2', '2018-02-12-2018-02-19-action_3', 139 | '2018-02-12-2018-02-19-action_4'],axis=1) 140 | 141 | 142 | train_set.rename(columns={'cate_x':'cate'}, inplace = True) 143 | test_set.rename(columns={'cate_x':'cate'}, inplace = True) 144 | 145 | 146 | 147 | test_head=test_set[['user_id','cate','shop_id']] 148 | train_head=train_set[['user_id','cate','shop_id']] 149 | test_set=test_set.drop(['user_id','cate','shop_id'],axis=1) 150 | train_set=train_set.drop(['user_id','cate','shop_id'],axis=1) 151 | if(train_set.shape[1]-1==test_set.shape[1]): 152 | print('ok',train_set.shape[1]) 153 | else: 154 | exit() 155 | 156 | # 数据准备 157 | X_train = train_set.drop(['label'],axis=1).values 158 | y_train = train_set['label'].values 159 | X_test = test_set.values 160 | 161 | del train_set 162 | del test_set 163 | 164 | import gc 165 | gc.collect() 166 | 167 | 168 | # 模型工具 169 | class SBBTree(): 170 | """Stacking,Bootstap,Bagging----SBBTree""" 171 | def __init__(self, params, stacking_num, bagging_num, bagging_test_size, num_boost_round, early_stopping_rounds): 172 | """ 173 | Initializes the SBBTree. 174 | Args: 175 | params : lgb params. 176 | stacking_num : k_flod stacking. 177 | bagging_num : bootstrap num. 178 | bagging_test_size : bootstrap sample rate. 179 | num_boost_round : boost num. 180 | early_stopping_rounds : early_stopping_rounds. 181 | """ 182 | self.params = params 183 | self.stacking_num = stacking_num 184 | self.bagging_num = bagging_num 185 | self.bagging_test_size = bagging_test_size 186 | self.num_boost_round = num_boost_round 187 | self.early_stopping_rounds = early_stopping_rounds 188 | 189 | self.model = lgb 190 | self.stacking_model = [] 191 | self.bagging_model = [] 192 | 193 | def fit(self, X, y): 194 | """ fit model. """ 195 | if self.stacking_num > 1: 196 | layer_train = np.zeros((X.shape[0], 2)) 197 | self.SK = StratifiedKFold(n_splits=self.stacking_num, shuffle=True, random_state=1) 198 | for k,(train_index, test_index) in enumerate(self.SK.split(X, y)): 199 | X_train = X[train_index] 200 | y_train = y[train_index] 201 | X_test = X[test_index] 202 | y_test = y[test_index] 203 | 204 | lgb_train = lgb.Dataset(X_train, y_train) 205 | lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train) 206 | 207 | gbm = lgb.train(self.params, 208 | lgb_train, 209 | num_boost_round=self.num_boost_round, 210 | valid_sets=lgb_eval, 211 | early_stopping_rounds=self.early_stopping_rounds, 212 | verbose_eval=300) 213 | 214 | self.stacking_model.append(gbm) 215 | 216 | pred_y = gbm.predict(X_test, num_iteration=gbm.best_iteration) 217 | layer_train[test_index, 1] = pred_y 218 | 219 | X = np.hstack((X, layer_train[:,1].reshape((-1,1)))) 220 | else: 221 | pass 222 | for bn in range(self.bagging_num): 223 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=self.bagging_test_size, random_state=bn) 224 | 225 | lgb_train = lgb.Dataset(X_train, y_train) 226 | lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train) 227 | 228 | gbm = lgb.train(self.params, 229 | lgb_train, 230 | num_boost_round=10000, 231 | valid_sets=lgb_eval, 232 | early_stopping_rounds=200, 233 | verbose_eval=300) 234 | 235 | self.bagging_model.append(gbm) 236 | 237 | def predict(self, X_pred): 238 | """ predict test data. """ 239 | if self.stacking_num > 1: 240 | test_pred = np.zeros((X_pred.shape[0], self.stacking_num)) 241 | for sn,gbm in enumerate(self.stacking_model): 242 | pred = gbm.predict(X_pred, num_iteration=gbm.best_iteration) 243 | test_pred[:, sn] = pred 244 | X_pred = np.hstack((X_pred, test_pred.mean(axis=1).reshape((-1,1)))) 245 | else: 246 | pass 247 | for bn,gbm in enumerate(self.bagging_model): 248 | pred = gbm.predict(X_pred, num_iteration=gbm.best_iteration) 249 | if bn == 0: 250 | pred_out=pred 251 | else: 252 | pred_out+=pred 253 | return pred_out/self.bagging_num 254 | 255 | # 模型参数 256 | params = { 257 | 'boosting_type': 'gbdt', 258 | 'objective': 'binary', 259 | 'metric': 'auc', 260 | 'learning_rate': 0.01, 261 | 'num_leaves': 2 ** 5 - 1, 262 | 'min_child_samples': 100, 263 | 'max_bin': 100, 264 | 'subsample': 0.8, 265 | 'subsample_freq': 1, 266 | 'colsample_bytree': 0.8, 267 | 'min_child_weight': 0, 268 | 'scale_pos_weight': 25, 269 | 'seed': 2019, 270 | 'nthread': 4, 271 | 'verbose': 0, 272 | } 273 | 274 | # 使用模型 275 | model = SBBTree(params=params,\ 276 | stacking_num=5,\ 277 | bagging_num=5,\ 278 | bagging_test_size=0.33,\ 279 | num_boost_round=10000,\ 280 | early_stopping_rounds=200) 281 | model.fit(X_train, y_train) 282 | print('train is ok') 283 | y_predict = model.predict(X_test) 284 | print('pred test is ok') 285 | # y_train_predict = model.predict(X_train) 286 | 287 | 288 | from tqdm import tqdm 289 | test_head['pred_prob'] = y_predict 290 | test_head.to_csv('../feature/'+str(win_size)+'_sbb_get_'+str(label_flag)+'_test.csv',index=False) 291 | -------------------------------------------------------------------------------- /code/sbb_train1.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import numpy as np 3 | import datetime 4 | from sklearn.metrics import f1_score 5 | from sklearn.model_selection import train_test_split 6 | from sklearn.model_selection import KFold 7 | from sklearn.model_selection import StratifiedKFold 8 | import lightgbm as lgb 9 | pd.set_option('display.max_columns', None) 10 | 11 | 12 | df_train=pd.read_csv('../output/df_train.csv') 13 | df_test=pd.read_csv('../output/df_test.csv') 14 | ##这里可以选择加载多个特征文件进行merge 如果df_train变了记得在输出文件名称加以备注使用了什么特征文件 15 | ###设置特征标志位如果使用一周特征为1 加上两周特征为12 再加上三周特征为123 只使用二周特征为2 16 | 17 | # df_train_two=pd.read_csv('../output/df_train_two.csv') 18 | # df_test_two=pd.read_csv('../output/df_test_two.csv') 19 | # df_train = df_train.merge(df_train_two,on=['user_id','cate','shop_id'],how='left') 20 | # df_test = df_test.merge(df_test_two,on=['user_id','cate','shop_id'],how='left') 21 | 22 | 23 | df_user=pd.read_csv('../data/jdata_user.csv') 24 | df_comment=pd.read_csv('../data/jdata_comment.csv') 25 | df_shop=pd.read_csv('../data/jdata_shop.csv') 26 | 27 | # 1）行为数据（jdata_action） 28 | jdata_action = pd.read_csv('../data/jdata_action.csv') 29 | 30 | # 3）商品数据（jdata_product） 31 | jdata_product = pd.read_csv('../data/jdata_product.csv') 32 | 33 | jdata_data = jdata_action.merge(jdata_product,on=['sku_id']) 34 | label_flag = 1 35 | train_buy = jdata_data[(jdata_data['action_time']>='2018-04-09')& 36 | (jdata_data['action_time']<'2018-04-16')& (jdata_data['type']==2)][['user_id','cate','shop_id']].drop_duplicates() 37 | train_buy['label'] = 1 38 | # 候选集时间： '2018-03-19'-'2018-04-08' 最近两周有行为的（用户，类目，店铺） 39 | win_size = 3#如果选择两周行为则为2 三周则为3 40 | train_set = jdata_data[(jdata_data['action_time']>='2018-03-19')& 41 | (jdata_data['action_time']<'2018-04-09')][['user_id','cate','shop_id']].drop_duplicates() 42 | train_set = train_set.merge(train_buy,on=['user_id','cate','shop_id'],how='left').fillna(0) 43 | 44 | train_set = train_set.merge(df_train,on=['user_id','cate','shop_id'],how='left') 45 | 46 | 47 | def mapper_year(x): 48 | if x is not np.nan: 49 | year = int(x[:4]) 50 | return 2018 - year 51 | 52 | 53 | def mapper_month(x): 54 | if x is not np.nan: 55 | year = int(x[:4]) 56 | month = int(x[5:7]) 57 | return (2018 - year) * 12 + month 58 | 59 | 60 | def mapper_day(x): 61 | if x is not np.nan: 62 | year = int(x[:4]) 63 | month = int(x[5:7]) 64 | day = int(x[8:10]) 65 | return (2018 - year) * 365 + month * 30 + day 66 | 67 | 68 | df_user['user_reg_year'] = df_user['user_reg_tm'].apply(lambda x: mapper_year(x)) 69 | df_user['user_reg_month'] = df_user['user_reg_tm'].apply(lambda x: mapper_month(x)) 70 | df_user['user_reg_day'] = df_user['user_reg_tm'].apply(lambda x: mapper_day(x)) 71 | 72 | df_shop['shop_reg_year'] = df_shop['shop_reg_tm'].apply(lambda x: mapper_year(x)) 73 | df_shop['shop_reg_month'] = df_shop['shop_reg_tm'].apply(lambda x: mapper_month(x)) 74 | df_shop['shop_reg_day'] = df_shop['shop_reg_tm'].apply(lambda x: mapper_day(x)) 75 | 76 | 77 | df_shop['shop_reg_year'] = df_shop['shop_reg_year'].fillna(1) 78 | df_shop['shop_reg_month'] = df_shop['shop_reg_month'].fillna(21) 79 | df_shop['shop_reg_day'] = df_shop['shop_reg_day'].fillna(101) 80 | 81 | df_user['age'] = df_user['age'].fillna(5) 82 | 83 | df_comment = df_comment.groupby(['sku_id'], as_index=False).sum() 84 | print('check point ...') 85 | df_product_comment = pd.merge(jdata_product, df_comment, on='sku_id', how='left') 86 | 87 | df_product_comment = df_product_comment.fillna(0) 88 | 89 | df_product_comment = df_product_comment.groupby(['shop_id'], as_index=False).sum() 90 | 91 | df_product_comment = df_product_comment.drop(['sku_id', 'brand', 'cate'], axis=1) 92 | 93 | df_shop_product_comment = pd.merge(df_shop, df_product_comment, how='left', on='shop_id') 94 | 95 | train_set = pd.merge(train_set, df_user, how='left', on='user_id') 96 | train_set = pd.merge(train_set, df_shop_product_comment, on='shop_id', how='left') 97 | 98 | 99 | 100 | train_set['vip_prob'] = train_set['vip_num']/train_set['fans_num'] 101 | train_set['goods_prob'] = train_set['good_comments']/train_set['comments'] 102 | 103 | train_set = train_set.drop(['comments','good_comments','bad_comments'],axis=1) 104 | 105 | 106 | test_set = jdata_data[(jdata_data['action_time'] >= '2018-03-26') & (jdata_data['action_time'] < '2018-04-16')][ 107 | ['user_id', 'cate', 'shop_id']].drop_duplicates() 108 | 109 | test_set = test_set.merge(df_test, on=['user_id', 'cate', 'shop_id'], how='left') 110 | 111 | test_set = pd.merge(test_set, df_user, how='left', on='user_id') 112 | test_set = pd.merge(test_set, df_shop_product_comment, on='shop_id', how='left') 113 | 114 | train_set.drop(['user_reg_tm', 'shop_reg_tm'], axis=1, inplace=True) 115 | test_set.drop(['user_reg_tm', 'shop_reg_tm'], axis=1, inplace=True) 116 | 117 | 118 | test_set['vip_prob'] = test_set['vip_num']/test_set['fans_num'] 119 | test_set['goods_prob'] = test_set['good_comments']/test_set['comments'] 120 | 121 | test_set = test_set.drop(['comments','good_comments','bad_comments'],axis=1) 122 | 123 | 124 | 125 | ###取六周特征特征为2.26-4.9 126 | train_set = train_set.drop(['2018-02-19-2018-02-26-action_1', '2018-02-19-2018-02-26-action_2', 127 | '2018-02-19-2018-02-26-action_3', '2018-02-19-2018-02-26-action_4', 128 | '2018-02-12-2018-02-19-action_1', '2018-02-12-2018-02-19-action_2', 129 | '2018-02-12-2018-02-19-action_3', '2018-02-12-2018-02-19-action_4', 130 | '2018-02-05-2018-02-12-action_1', '2018-02-05-2018-02-12-action_2', 131 | '2018-02-05-2018-02-12-action_3', '2018-02-05-2018-02-12-action_4'],axis=1) 132 | 133 | 134 | test_set = test_set.drop(['2018-02-26-2018-03-05-action_1', 135 | '2018-02-26-2018-03-05-action_2', '2018-02-26-2018-03-05-action_3', 136 | '2018-02-26-2018-03-05-action_4', '2018-02-19-2018-02-26-action_1', 137 | '2018-02-19-2018-02-26-action_2', '2018-02-19-2018-02-26-action_3', 138 | '2018-02-19-2018-02-26-action_4', '2018-02-12-2018-02-19-action_1', 139 | '2018-02-12-2018-02-19-action_2', '2018-02-12-2018-02-19-action_3', 140 | '2018-02-12-2018-02-19-action_4'],axis=1) 141 | 142 | 143 | train_set.rename(columns={'cate_x':'cate'}, inplace = True) 144 | test_set.rename(columns={'cate_x':'cate'}, inplace = True) 145 | 146 | 147 | test_head=test_set[['user_id','cate','shop_id']] 148 | train_head=train_set[['user_id','cate','shop_id']] 149 | test_set=test_set.drop(['user_id','cate','shop_id'],axis=1) 150 | train_set=train_set.drop(['user_id','cate','shop_id'],axis=1) 151 | 152 | 153 | # 数据准备 154 | X_train = train_set.drop(['label'],axis=1).values 155 | y_train = train_set['label'].values 156 | X_test = test_set.values 157 | 158 | ###RNN 159 | def max_min_scaler(data): 160 | for x in data.columns: 161 | data[x] = data[x]-data[x].min()/(data[x].max()-data[x].min()) 162 | return data 163 | train_X = max_min_scaler(train_X) 164 | test_X = max_min_scaler(test_X) 165 | 166 | 167 | # 模型工具 168 | class SBBTree(): 169 | """Stacking,Bootstap,Bagging----SBBTree""" 170 | def __init__(self, params, stacking_num, bagging_num, bagging_test_size, num_boost_round, early_stopping_rounds): 171 | """ 172 | Initializes the SBBTree. 173 | Args: 174 | params : lgb params. 175 | stacking_num : k_flod stacking. 176 | bagging_num : bootstrap num. 177 | bagging_test_size : bootstrap sample rate. 178 | num_boost_round : boost num. 179 | early_stopping_rounds : early_stopping_rounds. 180 | """ 181 | self.params = params 182 | self.stacking_num = stacking_num 183 | self.bagging_num = bagging_num 184 | self.bagging_test_size = bagging_test_size 185 | self.num_boost_round = num_boost_round 186 | self.early_stopping_rounds = early_stopping_rounds 187 | 188 | self.model = lgb 189 | self.stacking_model = [] 190 | self.bagging_model = [] 191 | 192 | def fit(self, X, y): 193 | """ fit model. """ 194 | if self.stacking_num > 1: 195 | layer_train = np.zeros((X.shape[0], 2)) 196 | self.SK = StratifiedKFold(n_splits=self.stacking_num, shuffle=True, random_state=1) 197 | for k,(train_index, test_index) in enumerate(self.SK.split(X, y)): 198 | X_train = X[train_index] 199 | y_train = y[train_index] 200 | X_test = X[test_index] 201 | y_test = y[test_index] 202 | 203 | lgb_train = lgb.Dataset(X_train, y_train) 204 | lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train) 205 | 206 | gbm = lgb.train(self.params, 207 | lgb_train, 208 | num_boost_round=self.num_boost_round, 209 | valid_sets=lgb_eval, 210 | early_stopping_rounds=self.early_stopping_rounds, 211 | verbose_eval=300) 212 | 213 | self.stacking_model.append(gbm) 214 | 215 | pred_y = gbm.predict(X_test, num_iteration=gbm.best_iteration) 216 | layer_train[test_index, 1] = pred_y 217 | 218 | X = np.hstack((X, layer_train[:,1].reshape((-1,1)))) 219 | else: 220 | pass 221 | for bn in range(self.bagging_num): 222 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=self.bagging_test_size, random_state=bn) 223 | 224 | lgb_train = lgb.Dataset(X_train, y_train) 225 | lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train) 226 | 227 | gbm = lgb.train(self.params, 228 | lgb_train, 229 | num_boost_round=10000, 230 | valid_sets=lgb_eval, 231 | early_stopping_rounds=200, 232 | verbose_eval=300) 233 | 234 | self.bagging_model.append(gbm) 235 | 236 | def predict(self, X_pred): 237 | """ predict test data. """ 238 | if self.stacking_num > 1: 239 | test_pred = np.zeros((X_pred.shape[0], self.stacking_num)) 240 | for sn,gbm in enumerate(self.stacking_model): 241 | pred = gbm.predict(X_pred, num_iteration=gbm.best_iteration) 242 | test_pred[:, sn] = pred 243 | X_pred = np.hstack((X_pred, test_pred.mean(axis=1).reshape((-1,1)))) 244 | else: 245 | pass 246 | for bn,gbm in enumerate(self.bagging_model): 247 | pred = gbm.predict(X_pred, num_iteration=gbm.best_iteration) 248 | if bn == 0: 249 | pred_out=pred 250 | else: 251 | pred_out+=pred 252 | return pred_out/self.bagging_num 253 | 254 | # 模型参数 255 | params = { 256 | 'boosting_type': 'gbdt', 257 | 'objective': 'binary', 258 | 'metric': 'auc', 259 | 'learning_rate': 0.01, 260 | 'num_leaves': 2 ** 5 - 1, 261 | 'min_child_samples': 100, 262 | 'max_bin': 100, 263 | 'subsample': 0.8, 264 | 'subsample_freq': 1, 265 | 'colsample_bytree': 0.8, 266 | 'min_child_weight': 0, 267 | 'scale_pos_weight': 25, 268 | 'seed': 2019, 269 | 'nthread': 4, 270 | 'verbose': 0, 271 | } 272 | 273 | # 使用模型 274 | # 使用模型 275 | model = SBBTree(params=params,\ 276 | stacking_num=5,\ 277 | bagging_num=5,\ 278 | bagging_test_size=0.33,\ 279 | num_boost_round=10000,\ 280 | early_stopping_rounds=200) 281 | model.fit(X_train, y_train) 282 | print('train is ok') 283 | y_predict = model.predict(X_test) 284 | print('pred test is ok') 285 | # y_train_predict = model.predict(X_train) 286 | 287 | 288 | from tqdm import tqdm 289 | test_head['pred_prob'] = y_predict 290 | test_head.to_csv('../feature/'+str(win_size)+'_sbb_get_'+str(label_flag)+'_test.csv',index=False) 291 | -------------------------------------------------------------------------------- /code/sbb_train2.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import numpy as np 3 | import datetime 4 | from sklearn.metrics import f1_score 5 | from sklearn.model_selection import train_test_split 6 | from sklearn.model_selection import KFold 7 | from sklearn.model_selection import StratifiedKFold 8 | import lightgbm as lgb 9 | pd.set_option('display.max_columns', None) 10 | 11 | 12 | df_train=pd.read_csv('../output/df_train.csv') 13 | df_test=pd.read_csv('../output/df_test.csv') 14 | ##这里可以选择加载多个特征文件进行merge 如果df_train变了记得在输出文件名称加以备注使用了什么特征文件 15 | ###设置特征标志位如果使用一周特征为1 加上两周特征为12 再加上三周特征为123 只使用二周特征为2 16 | 17 | # df_train_two=pd.read_csv('../output/df_train_two.csv') 18 | # df_test_two=pd.read_csv('../output/df_test_two.csv') 19 | # df_train = df_train.merge(df_train_two,on=['user_id','cate','shop_id'],how='left') 20 | # df_test = df_test.merge(df_test_two,on=['user_id','cate','shop_id'],how='left') 21 | 22 | df_user=pd.read_csv('../data/jdata_user.csv') 23 | df_comment=pd.read_csv('../data/jdata_comment.csv') 24 | df_shop=pd.read_csv('../data/jdata_shop.csv') 25 | 26 | # 1）行为数据（jdata_action） 27 | jdata_action = pd.read_csv('../data/jdata_action.csv') 28 | 29 | # 3）商品数据（jdata_product） 30 | jdata_product = pd.read_csv('../data/jdata_product.csv') 31 | 32 | jdata_data = jdata_action.merge(jdata_product,on=['sku_id']) 33 | label_flag = 2 34 | train_buy = jdata_data[(jdata_data['action_time']>='2018-04-02') 35 | & (jdata_data['action_time']<'2018-04-09') 36 | & (jdata_data['type']==2)][['user_id','cate','shop_id']].drop_duplicates() 37 | train_buy['label'] = 1 38 | # 候选集时间： '2018-03-12'-'2018-04-08' 最近两周有行为的（用户，类目，店铺） 39 | win_size = 3#如果选择两周行为则为2 三周则为3 40 | train_set = jdata_data[(jdata_data['action_time']>='2018-03-12') 41 | & (jdata_data['action_time']<'2018-04-02')][['user_id','cate','shop_id']].drop_duplicates() 42 | train_set = train_set.merge(train_buy,on=['user_id','cate','shop_id'],how='left').fillna(0) 43 | 44 | train_set = train_set.merge(df_train,on=['user_id','cate','shop_id'],how='left') 45 | 46 | 47 | def mapper_year(x): 48 | if x is not np.nan: 49 | year = int(x[:4]) 50 | return 2018 - year 51 | 52 | 53 | def mapper_month(x): 54 | if x is not np.nan: 55 | year = int(x[:4]) 56 | month = int(x[5:7]) 57 | return (2018 - year) * 12 + month 58 | 59 | 60 | def mapper_day(x): 61 | if x is not np.nan: 62 | year = int(x[:4]) 63 | month = int(x[5:7]) 64 | day = int(x[8:10]) 65 | return (2018 - year) * 365 + month * 30 + day 66 | 67 | 68 | df_user['user_reg_year'] = df_user['user_reg_tm'].apply(lambda x: mapper_year(x)) 69 | df_user['user_reg_month'] = df_user['user_reg_tm'].apply(lambda x: mapper_month(x)) 70 | df_user['user_reg_day'] = df_user['user_reg_tm'].apply(lambda x: mapper_day(x)) 71 | 72 | df_shop['shop_reg_year'] = df_shop['shop_reg_tm'].apply(lambda x: mapper_year(x)) 73 | df_shop['shop_reg_month'] = df_shop['shop_reg_tm'].apply(lambda x: mapper_month(x)) 74 | df_shop['shop_reg_day'] = df_shop['shop_reg_tm'].apply(lambda x: mapper_day(x)) 75 | 76 | 77 | df_shop['shop_reg_year'] = df_shop['shop_reg_year'].fillna(1) 78 | df_shop['shop_reg_month'] = df_shop['shop_reg_month'].fillna(21) 79 | df_shop['shop_reg_day'] = df_shop['shop_reg_day'].fillna(101) 80 | 81 | df_user['age'] = df_user['age'].fillna(5) 82 | 83 | df_comment = df_comment.groupby(['sku_id'], as_index=False).sum() 84 | print('check point ...') 85 | df_product_comment = pd.merge(jdata_product, df_comment, on='sku_id', how='left') 86 | 87 | df_product_comment = df_product_comment.fillna(0) 88 | 89 | df_product_comment = df_product_comment.groupby(['shop_id'], as_index=False).sum() 90 | 91 | df_product_comment = df_product_comment.drop(['sku_id', 'brand', 'cate'], axis=1) 92 | 93 | df_shop_product_comment = pd.merge(df_shop, df_product_comment, how='left', on='shop_id') 94 | 95 | train_set = pd.merge(train_set, df_user, how='left', on='user_id') 96 | train_set = pd.merge(train_set, df_shop_product_comment, on='shop_id', how='left') 97 | 98 | train_set['vip_prob'] = train_set['vip_num']/train_set['fans_num'] 99 | train_set['goods_prob'] = train_set['good_comments']/train_set['comments'] 100 | train_set = train_set.drop(['comments','good_comments','bad_comments'],axis=1) 101 | 102 | 103 | 104 | test_set = jdata_data[(jdata_data['action_time'] >= '2018-03-26') & (jdata_data['action_time'] < '2018-04-16')][ 105 | ['user_id', 'cate', 'shop_id']].drop_duplicates() 106 | 107 | test_set = test_set.merge(df_test, on=['user_id', 'cate', 'shop_id'], how='left') 108 | 109 | test_set = pd.merge(test_set, df_user, how='left', on='user_id') 110 | test_set = pd.merge(test_set, df_shop_product_comment, on='shop_id', how='left') 111 | 112 | train_set.drop(['user_reg_tm', 'shop_reg_tm'], axis=1, inplace=True) 113 | test_set.drop(['user_reg_tm', 'shop_reg_tm'], axis=1, inplace=True) 114 | 115 | 116 | test_set['vip_prob'] = test_set['vip_num']/test_set['fans_num'] 117 | test_set['goods_prob'] = test_set['good_comments']/test_set['comments'] 118 | 119 | test_set = test_set.drop(['comments','good_comments','bad_comments'],axis=1) 120 | 121 | 122 | ###取六周特征特征为2.26-4.9 123 | train_set = train_set.drop([ 124 | '2018-04-02-2018-04-09-action_1', '2018-04-02-2018-04-09-action_2', 125 | '2018-04-02-2018-04-09-action_3', '2018-04-02-2018-04-09-action_4', 126 | '2018-02-12-2018-02-19-action_1', '2018-02-12-2018-02-19-action_2', 127 | '2018-02-12-2018-02-19-action_3', '2018-02-12-2018-02-19-action_4', 128 | '2018-02-05-2018-02-12-action_1', '2018-02-05-2018-02-12-action_2', 129 | '2018-02-05-2018-02-12-action_3', '2018-02-05-2018-02-12-action_4'],axis=1) 130 | 131 | test_set = test_set.drop(['2018-02-26-2018-03-05-action_1', 132 | '2018-02-26-2018-03-05-action_2', '2018-02-26-2018-03-05-action_3', 133 | '2018-02-26-2018-03-05-action_4', '2018-02-19-2018-02-26-action_1', 134 | '2018-02-19-2018-02-26-action_2', '2018-02-19-2018-02-26-action_3', 135 | '2018-02-19-2018-02-26-action_4', '2018-02-12-2018-02-19-action_1', 136 | '2018-02-12-2018-02-19-action_2', '2018-02-12-2018-02-19-action_3', 137 | '2018-02-12-2018-02-19-action_4'],axis=1) 138 | 139 | train_set.rename(columns={'cate_x':'cate'}, inplace = True) 140 | test_set.rename(columns={'cate_x':'cate'}, inplace = True) 141 | 142 | test_head=test_set[['user_id','cate','shop_id']] 143 | train_head=train_set[['user_id','cate','shop_id']] 144 | test_set=test_set.drop(['user_id','cate','shop_id'],axis=1) 145 | train_set=train_set.drop(['user_id','cate','shop_id'],axis=1) 146 | 147 | 148 | # 数据准备 149 | X_train = train_set.drop(['label'],axis=1).values 150 | y_train = train_set['label'].values 151 | X_test = test_set.values 152 | 153 | 154 | del train_set 155 | del test_set 156 | 157 | 158 | # 模型工具 159 | class SBBTree(): 160 | """Stacking,Bootstap,Bagging----SBBTree""" 161 | def __init__(self, params, stacking_num, bagging_num, bagging_test_size, num_boost_round, early_stopping_rounds): 162 | """ 163 | Initializes the SBBTree. 164 | Args: 165 | params : lgb params. 166 | stacking_num : k_flod stacking. 167 | bagging_num : bootstrap num. 168 | bagging_test_size : bootstrap sample rate. 169 | num_boost_round : boost num. 170 | early_stopping_rounds : early_stopping_rounds. 171 | """ 172 | self.params = params 173 | self.stacking_num = stacking_num 174 | self.bagging_num = bagging_num 175 | self.bagging_test_size = bagging_test_size 176 | self.num_boost_round = num_boost_round 177 | self.early_stopping_rounds = early_stopping_rounds 178 | 179 | self.model = lgb 180 | self.stacking_model = [] 181 | self.bagging_model = [] 182 | 183 | def fit(self, X, y): 184 | """ fit model. """ 185 | if self.stacking_num > 1: 186 | layer_train = np.zeros((X.shape[0], 2)) 187 | self.SK = StratifiedKFold(n_splits=self.stacking_num, shuffle=True, random_state=1) 188 | for k,(train_index, test_index) in enumerate(self.SK.split(X, y)): 189 | X_train = X[train_index] 190 | y_train = y[train_index] 191 | X_test = X[test_index] 192 | y_test = y[test_index] 193 | 194 | lgb_train = lgb.Dataset(X_train, y_train) 195 | lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train) 196 | 197 | gbm = lgb.train(self.params, 198 | lgb_train, 199 | num_boost_round=self.num_boost_round, 200 | valid_sets=lgb_eval, 201 | early_stopping_rounds=self.early_stopping_rounds, 202 | verbose_eval=300) 203 | 204 | self.stacking_model.append(gbm) 205 | 206 | pred_y = gbm.predict(X_test, num_iteration=gbm.best_iteration) 207 | layer_train[test_index, 1] = pred_y 208 | 209 | X = np.hstack((X, layer_train[:,1].reshape((-1,1)))) 210 | else: 211 | pass 212 | for bn in range(self.bagging_num): 213 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=self.bagging_test_size, random_state=bn) 214 | 215 | lgb_train = lgb.Dataset(X_train, y_train) 216 | lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train) 217 | 218 | gbm = lgb.train(self.params, 219 | lgb_train, 220 | num_boost_round=10000, 221 | valid_sets=lgb_eval, 222 | early_stopping_rounds=200, 223 | verbose_eval=300) 224 | 225 | self.bagging_model.append(gbm) 226 | 227 | def predict(self, X_pred): 228 | """ predict test data. """ 229 | if self.stacking_num > 1: 230 | test_pred = np.zeros((X_pred.shape[0], self.stacking_num)) 231 | for sn,gbm in enumerate(self.stacking_model): 232 | pred = gbm.predict(X_pred, num_iteration=gbm.best_iteration) 233 | test_pred[:, sn] = pred 234 | X_pred = np.hstack((X_pred, test_pred.mean(axis=1).reshape((-1,1)))) 235 | else: 236 | pass 237 | for bn,gbm in enumerate(self.bagging_model): 238 | pred = gbm.predict(X_pred, num_iteration=gbm.best_iteration) 239 | if bn == 0: 240 | pred_out=pred 241 | else: 242 | pred_out+=pred 243 | return pred_out/self.bagging_num 244 | 245 | # 模型参数 246 | params = { 247 | 'boosting_type': 'gbdt', 248 | 'objective': 'binary', 249 | 'metric': 'auc', 250 | 'learning_rate': 0.01, 251 | 'num_leaves': 2 ** 5 - 1, 252 | 'min_child_samples': 100, 253 | 'max_bin': 100, 254 | 'subsample': 0.8, 255 | 'subsample_freq': 1, 256 | 'colsample_bytree': 0.8, 257 | 'min_child_weight': 0, 258 | 'scale_pos_weight': 25, 259 | 'seed': 2019, 260 | 'nthread': 4, 261 | 'verbose': 0, 262 | } 263 | 264 | # 使用模型 265 | model = SBBTree(params=params,\ 266 | stacking_num=5,\ 267 | bagging_num=5,\ 268 | bagging_test_size=0.33,\ 269 | num_boost_round=10000,\ 270 | early_stopping_rounds=200) 271 | model.fit(X_train, y_train) 272 | print('train is ok') 273 | y_predict = model.predict(X_test) 274 | print('pred test is ok') 275 | # y_train_predict = model.predict(X_train) 276 | 277 | 278 | from tqdm import tqdm 279 | test_head['pred_prob'] = y_predict 280 | test_head.to_csv('../feature/'+str(win_size)+'_sbb_get_'+str(label_flag)+'_test.csv',index=False) 281 | -------------------------------------------------------------------------------- /code/sbb_train3.py: -------------------------------------------------------------------------------- 1 | 2 | # coding: utf-8 3 | 4 | # In[1]: 5 | 6 | 7 | import pandas as pd 8 | import numpy as np 9 | import datetime 10 | from sklearn.metrics import f1_score 11 | from sklearn.model_selection import train_test_split 12 | from sklearn.model_selection import KFold 13 | from sklearn.model_selection import StratifiedKFold 14 | import lightgbm as lgb 15 | pd.set_option('display.max_columns', None) 16 | 17 | 18 | # In[2]: 19 | 20 | 21 | df_train=pd.read_csv('../output/df_train.csv') 22 | df_test=pd.read_csv('../output/df_test.csv') 23 | ##这里可以选择加载多个特征文件进行merge 如果df_train变了记得在输出文件名称加以备注使用了什么特征文件 24 | ###设置特征标志位如果使用一周特征为1 加上两周特征为12 再加上三周特征为123 只使用二周特征为2 25 | 26 | # df_train_two=pd.read_csv('../output/df_train_two.csv') 27 | # df_test_two=pd.read_csv('../output/df_test_two.csv') 28 | # df_train = df_train.merge(df_train_two,on=['user_id','cate','shop_id'],how='left') 29 | # df_test = df_test.merge(df_test_two,on=['user_id','cate','shop_id'],how='left') 30 | 31 | 32 | # In[15]: 33 | 34 | 35 | df_user=pd.read_csv('../data/jdata_user.csv') 36 | df_comment=pd.read_csv('../data/jdata_comment.csv') 37 | df_shop=pd.read_csv('../data/jdata_shop.csv') 38 | 39 | # 1）行为数据（jdata_action） 40 | jdata_action = pd.read_csv('../data/jdata_action.csv') 41 | 42 | # 3）商品数据（jdata_product） 43 | jdata_product = pd.read_csv('../data/jdata_product.csv') 44 | 45 | jdata_data = jdata_action.merge(jdata_product,on=['sku_id']) 46 | label_flag = 3 47 | train_buy = jdata_data[(jdata_data['action_time']>='2018-03-26') & (jdata_data['action_time']<'2018-04-02') & (jdata_data['type']==2)][['user_id','cate','shop_id']].drop_duplicates() 48 | train_buy['label'] = 1 49 | # 候选集时间： '2018-03-26'-'2018-04-08' 最近两周有行为的（用户，类目，店铺） 50 | win_size = 3#如果选择两周行为则为2 三周则为3 51 | train_set = jdata_data[(jdata_data['action_time']>='2018-03-05') & (jdata_data['action_time']<'2018-03-26')][['user_id','cate','shop_id']].drop_duplicates() 52 | train_set = train_set.merge(train_buy,on=['user_id','cate','shop_id'],how='left').fillna(0) 53 | 54 | train_set = train_set.merge(df_train,on=['user_id','cate','shop_id'],how='left') 55 | 56 | 57 | # In[17]: 58 | 59 | 60 | def mapper_year(x): 61 | if x is not np.nan: 62 | year = int(x[:4]) 63 | return 2018 - year 64 | 65 | 66 | def mapper_month(x): 67 | if x is not np.nan: 68 | year = int(x[:4]) 69 | month = int(x[5:7]) 70 | return (2018 - year) * 12 + month 71 | 72 | 73 | def mapper_day(x): 74 | if x is not np.nan: 75 | year = int(x[:4]) 76 | month = int(x[5:7]) 77 | day = int(x[8:10]) 78 | return (2018 - year) * 365 + month * 30 + day 79 | 80 | 81 | df_user['user_reg_year'] = df_user['user_reg_tm'].apply(lambda x: mapper_year(x)) 82 | df_user['user_reg_month'] = df_user['user_reg_tm'].apply(lambda x: mapper_month(x)) 83 | df_user['user_reg_day'] = df_user['user_reg_tm'].apply(lambda x: mapper_day(x)) 84 | 85 | df_shop['shop_reg_year'] = df_shop['shop_reg_tm'].apply(lambda x: mapper_year(x)) 86 | df_shop['shop_reg_month'] = df_shop['shop_reg_tm'].apply(lambda x: mapper_month(x)) 87 | df_shop['shop_reg_day'] = df_shop['shop_reg_tm'].apply(lambda x: mapper_day(x)) 88 | 89 | 90 | # In[25]: 91 | 92 | 93 | df_shop['shop_reg_year'] = df_shop['shop_reg_year'].fillna(1) 94 | df_shop['shop_reg_month'] = df_shop['shop_reg_month'].fillna(21) 95 | df_shop['shop_reg_day'] = df_shop['shop_reg_day'].fillna(101) 96 | 97 | df_user['age'] = df_user['age'].fillna(5) 98 | 99 | df_comment = df_comment.groupby(['sku_id'], as_index=False).sum() 100 | print('check point ...') 101 | df_product_comment = pd.merge(jdata_product, df_comment, on='sku_id', how='left') 102 | 103 | df_product_comment = df_product_comment.fillna(0) 104 | 105 | df_product_comment = df_product_comment.groupby(['shop_id'], as_index=False).sum() 106 | 107 | df_product_comment = df_product_comment.drop(['sku_id', 'brand', 'cate'], axis=1) 108 | 109 | df_shop_product_comment = pd.merge(df_shop, df_product_comment, how='left', on='shop_id') 110 | 111 | train_set = pd.merge(train_set, df_user, how='left', on='user_id') 112 | train_set = pd.merge(train_set, df_shop_product_comment, on='shop_id', how='left') 113 | 114 | 115 | # In[30]: 116 | 117 | 118 | train_set['vip_prob'] = train_set['vip_num']/train_set['fans_num'] 119 | train_set['goods_prob'] = train_set['good_comments']/train_set['comments'] 120 | 121 | train_set = train_set.drop(['comments','good_comments','bad_comments'],axis=1) 122 | 123 | 124 | # In[35]: 125 | 126 | 127 | test_set = jdata_data[(jdata_data['action_time'] >= '2018-03-26') & (jdata_data['action_time'] < '2018-04-16')][ 128 | ['user_id', 'cate', 'shop_id']].drop_duplicates() 129 | 130 | test_set = test_set.merge(df_test, on=['user_id', 'cate', 'shop_id'], how='left') 131 | 132 | test_set = pd.merge(test_set, df_user, how='left', on='user_id') 133 | test_set = pd.merge(test_set, df_shop_product_comment, on='shop_id', how='left') 134 | 135 | train_set.drop(['user_reg_tm', 'shop_reg_tm'], axis=1, inplace=True) 136 | test_set.drop(['user_reg_tm', 'shop_reg_tm'], axis=1, inplace=True) 137 | 138 | 139 | # In[36]: 140 | 141 | 142 | test_set['vip_prob'] = test_set['vip_num']/test_set['fans_num'] 143 | test_set['goods_prob'] = test_set['good_comments']/test_set['comments'] 144 | 145 | test_set = test_set.drop(['comments','good_comments','bad_comments'],axis=1) 146 | 147 | 148 | # In[40]: 149 | 150 | 151 | ###取六周特征特征为2.26-4.9 152 | train_set = train_set.drop(['2018-04-02-2018-04-09-action_1', '2018-04-02-2018-04-09-action_2', 153 | '2018-04-02-2018-04-09-action_3', '2018-04-02-2018-04-09-action_4', 154 | '2018-03-26-2018-04-02-action_1', '2018-03-26-2018-04-02-action_2', 155 | '2018-03-26-2018-04-02-action_3', '2018-03-26-2018-04-02-action_4', 156 | '2018-02-05-2018-02-12-action_1', '2018-02-05-2018-02-12-action_2', 157 | '2018-02-05-2018-02-12-action_3', '2018-02-05-2018-02-12-action_4'],axis=1) 158 | 159 | 160 | # In[41]: 161 | 162 | 163 | test_set = test_set.drop(['2018-02-26-2018-03-05-action_1', 164 | '2018-02-26-2018-03-05-action_2', '2018-02-26-2018-03-05-action_3', 165 | '2018-02-26-2018-03-05-action_4', '2018-02-19-2018-02-26-action_1', 166 | '2018-02-19-2018-02-26-action_2', '2018-02-19-2018-02-26-action_3', 167 | '2018-02-19-2018-02-26-action_4', '2018-02-12-2018-02-19-action_1', 168 | '2018-02-12-2018-02-19-action_2', '2018-02-12-2018-02-19-action_3', 169 | '2018-02-12-2018-02-19-action_4'],axis=1) 170 | 171 | 172 | # In[44]: 173 | 174 | 175 | train_set.rename(columns={'cate_x':'cate'}, inplace = True) 176 | test_set.rename(columns={'cate_x':'cate'}, inplace = True) 177 | 178 | 179 | # In[45]: 180 | 181 | 182 | test_head=test_set[['user_id','cate','shop_id']] 183 | train_head=train_set[['user_id','cate','shop_id']] 184 | test_set=test_set.drop(['user_id','cate','shop_id'],axis=1) 185 | train_set=train_set.drop(['user_id','cate','shop_id'],axis=1) 186 | if(train_set.shape[1]-1==test_set.shape[1]): 187 | print('ok',train_set.shape[1]) 188 | else: 189 | exit() 190 | # In[46]: 191 | 192 | 193 | 194 | # 数据准备 195 | X_train = train_set.drop(['label'],axis=1).values 196 | y_train = train_set['label'].values 197 | X_test = test_set.values 198 | 199 | 200 | # In[ ]: 201 | 202 | 203 | del train_set 204 | del test_set 205 | 206 | 207 | # In[48]: 208 | 209 | 210 | import gc 211 | gc.collect() 212 | 213 | 214 | # In[50]: 215 | 216 | 217 | # 模型工具 218 | class SBBTree(): 219 | """Stacking,Bootstap,Bagging----SBBTree""" 220 | """ author：Cookly 洪鹏飞 """ 221 | def __init__(self, params, stacking_num, bagging_num, bagging_test_size, num_boost_round, early_stopping_rounds): 222 | """ 223 | Initializes the SBBTree. 224 | Args: 225 | params : lgb params. 226 | stacking_num : k_flod stacking. 227 | bagging_num : bootstrap num. 228 | bagging_test_size : bootstrap sample rate. 229 | num_boost_round : boost num. 230 | early_stopping_rounds : early_stopping_rounds. 231 | """ 232 | self.params = params 233 | self.stacking_num = stacking_num 234 | self.bagging_num = bagging_num 235 | self.bagging_test_size = bagging_test_size 236 | self.num_boost_round = num_boost_round 237 | self.early_stopping_rounds = early_stopping_rounds 238 | 239 | self.model = lgb 240 | self.stacking_model = [] 241 | self.bagging_model = [] 242 | 243 | def fit(self, X, y): 244 | """ fit model. """ 245 | if self.stacking_num > 1: 246 | layer_train = np.zeros((X.shape[0], 2)) 247 | self.SK = StratifiedKFold(n_splits=self.stacking_num, shuffle=True, random_state=1) 248 | for k,(train_index, test_index) in enumerate(self.SK.split(X, y)): 249 | X_train = X[train_index] 250 | y_train = y[train_index] 251 | X_test = X[test_index] 252 | y_test = y[test_index] 253 | 254 | lgb_train = lgb.Dataset(X_train, y_train) 255 | lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train) 256 | 257 | gbm = lgb.train(self.params, 258 | lgb_train, 259 | num_boost_round=self.num_boost_round, 260 | valid_sets=lgb_eval, 261 | early_stopping_rounds=self.early_stopping_rounds, 262 | verbose_eval=300) 263 | 264 | self.stacking_model.append(gbm) 265 | 266 | pred_y = gbm.predict(X_test, num_iteration=gbm.best_iteration) 267 | layer_train[test_index, 1] = pred_y 268 | 269 | X = np.hstack((X, layer_train[:,1].reshape((-1,1)))) 270 | else: 271 | pass 272 | for bn in range(self.bagging_num): 273 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=self.bagging_test_size, random_state=bn) 274 | 275 | lgb_train = lgb.Dataset(X_train, y_train) 276 | lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train) 277 | 278 | gbm = lgb.train(self.params, 279 | lgb_train, 280 | num_boost_round=10000, 281 | valid_sets=lgb_eval, 282 | early_stopping_rounds=200, 283 | verbose_eval=300) 284 | 285 | self.bagging_model.append(gbm) 286 | 287 | def predict(self, X_pred): 288 | """ predict test data. """ 289 | if self.stacking_num > 1: 290 | test_pred = np.zeros((X_pred.shape[0], self.stacking_num)) 291 | for sn,gbm in enumerate(self.stacking_model): 292 | pred = gbm.predict(X_pred, num_iteration=gbm.best_iteration) 293 | test_pred[:, sn] = pred 294 | X_pred = np.hstack((X_pred, test_pred.mean(axis=1).reshape((-1,1)))) 295 | else: 296 | pass 297 | for bn,gbm in enumerate(self.bagging_model): 298 | pred = gbm.predict(X_pred, num_iteration=gbm.best_iteration) 299 | if bn == 0: 300 | pred_out=pred 301 | else: 302 | pred_out+=pred 303 | return pred_out/self.bagging_num 304 | 305 | # 模型参数 306 | params = { 307 | 'boosting_type': 'gbdt', 308 | 'objective': 'binary', 309 | 'metric': 'auc', 310 | 'learning_rate': 0.01, 311 | 'num_leaves': 2 ** 5 - 1, 312 | 'min_child_samples': 100, 313 | 'max_bin': 100, 314 | 'subsample': 0.8, 315 | 'subsample_freq': 1, 316 | 'colsample_bytree': 0.8, 317 | 'min_child_weight': 0, 318 | 'scale_pos_weight': 25, 319 | 'seed': 2019, 320 | 'nthread': 4, 321 | 'verbose': 0, 322 | } 323 | 324 | # 使用模型 325 | model = SBBTree(params=params, stacking_num=5, bagging_num=5, bagging_test_size=0.33, num_boost_round=10000, early_stopping_rounds=200) 326 | model.fit(X_train, y_train) 327 | print('train is ok') 328 | y_predict = model.predict(X_test) 329 | print('pred test is ok') 330 | # y_train_predict = model.predict(X_train) 331 | 332 | 333 | # In[ ]: 334 | 335 | 336 | from tqdm import tqdm 337 | test_head['pred_prob'] = y_predict 338 | test_head.to_csv('../feature/'+str(win_size)+'_sbb_get_'+str(label_flag)+'_test.csv',index=False) 339 | 340 | 341 | -------------------------------------------------------------------------------- /code/xgb_model/xgb_train3.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import numpy as np 3 | from datetime import datetime 4 | from sklearn.metrics import f1_score 5 | from sklearn.model_selection import train_test_split 6 | from sklearn.model_selection import KFold 7 | from sklearn.model_selection import StratifiedKFold 8 | import xgboost as xgb 9 | pd.set_option('display.max_columns', None) 10 | 11 | 12 | ## 读取文件减少内存，参考鱼佬的腾讯赛 13 | def reduce_mem_usage(df, verbose=True): 14 | numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64'] 15 | start_mem = df.memory_usage().sum() / 1024**2 16 | for col in df.columns: 17 | col_type = df[col].dtypes 18 | if col_type in numerics: 19 | c_min = df[col].min() 20 | c_max = df[col].max() 21 | if str(col_type)[:3] == 'int': 22 | if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max: 23 | df[col] = df[col].astype(np.int8) 24 | elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max: 25 | df[col] = df[col].astype(np.int16) 26 | elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max: 27 | df[col] = df[col].astype(np.int32) 28 | elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max: 29 | df[col] = df[col].astype(np.int64) 30 | else: 31 | if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max: 32 | df[col] = df[col].astype(np.float16) 33 | elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max: 34 | df[col] = df[col].astype(np.float32) 35 | else: 36 | df[col] = df[col].astype(np.float64) 37 | end_mem = df.memory_usage().sum() / 1024**2 38 | if verbose: 39 | print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem)) 40 | return df 41 | 42 | df_train = reduce_mem_usage(pd.read_csv('./df_train.csv')) 43 | df_test = reduce_mem_usage(pd.read_csv('./df_test.csv')) 44 | ##这里可以选择加载多个特征文件进行merge 如果df_train变了记得在输出文件名称加以备注使用了什么特征文件 45 | ###设置特征标志位如果使用一周特征为1 加上两周特征为12 再加上三周特征为123 只使用二周特征为2 46 | 47 | 48 | df_user=reduce_mem_usage(pd.read_csv('./jdata_user.csv')) 49 | df_comment=reduce_mem_usage(pd.read_csv('./jdata_comment.csv')) 50 | df_shop=reduce_mem_usage(pd.read_csv('./jdata_shop.csv')) 51 | 52 | # 1）行为数据（jdata_action） 53 | jdata_action = reduce_mem_usage(pd.read_csv('./jdata_action.csv')) 54 | 55 | # 3）商品数据（jdata_product） 56 | jdata_product = reduce_mem_usage(pd.read_csv('./jdata_product.csv')) 57 | 58 | jdata_data = jdata_action.merge(jdata_product,on=['sku_id']) 59 | label_flag = 3 60 | train_buy = jdata_data[(jdata_data['action_time']>='2018-03-26') & (jdata_data['action_time']<'2018-04-02') & (jdata_data['type']==2)][['user_id','cate','shop_id']].drop_duplicates() 61 | train_buy['label'] = 1 62 | # 候选集时间： '2018-03-26'-'2018-04-08' 最近两周有行为的（用户，类目，店铺） 63 | win_size = 3#如果选择两周行为则为2 三周则为3 64 | train_set = jdata_data[(jdata_data['action_time']>='2018-03-05') & (jdata_data['action_time']<'2018-03-26')][['user_id','cate','shop_id']].drop_duplicates() 65 | train_set = train_set.merge(train_buy,on=['user_id','cate','shop_id'],how='left').fillna(0) 66 | 67 | train_set = train_set.merge(df_train,on=['user_id','cate','shop_id'],how='left') 68 | 69 | 70 | def mapper_year(x): 71 | if x is not np.nan: 72 | year = int(x[:4]) 73 | return 2018 - year 74 | 75 | 76 | def mapper_month(x): 77 | if x is not np.nan: 78 | year = int(x[:4]) 79 | month = int(x[5:7]) 80 | return (2018 - year) * 12 + month 81 | 82 | 83 | def mapper_day(x): 84 | if x is not np.nan: 85 | year = int(x[:4]) 86 | month = int(x[5:7]) 87 | day = int(x[8:10]) 88 | return (2018 - year) * 365 + month * 30 + day 89 | 90 | 91 | df_user['user_reg_year'] = df_user['user_reg_tm'].apply(lambda x: mapper_year(x)) 92 | df_user['user_reg_month'] = df_user['user_reg_tm'].apply(lambda x: mapper_month(x)) 93 | df_user['user_reg_day'] = df_user['user_reg_tm'].apply(lambda x: mapper_day(x)) 94 | 95 | df_shop['shop_reg_year'] = df_shop['shop_reg_tm'].apply(lambda x: mapper_year(x)) 96 | df_shop['shop_reg_month'] = df_shop['shop_reg_tm'].apply(lambda x: mapper_month(x)) 97 | df_shop['shop_reg_day'] = df_shop['shop_reg_tm'].apply(lambda x: mapper_day(x)) 98 | 99 | 100 | 101 | df_shop['shop_reg_year'] = df_shop['shop_reg_year'].fillna(1) 102 | df_shop['shop_reg_month'] = df_shop['shop_reg_month'].fillna(21) 103 | df_shop['shop_reg_day'] = df_shop['shop_reg_day'].fillna(101) 104 | 105 | df_user['age'] = df_user['age'].fillna(5) 106 | 107 | df_comment = df_comment.groupby(['sku_id'], as_index=False).sum() 108 | print('check point ...') 109 | df_product_comment = pd.merge(jdata_product, df_comment, on='sku_id', how='left') 110 | 111 | df_product_comment = df_product_comment.fillna(0) 112 | 113 | df_product_comment = df_product_comment.groupby(['shop_id'], as_index=False).sum() 114 | 115 | df_product_comment = df_product_comment.drop(['sku_id', 'brand', 'cate'], axis=1) 116 | 117 | df_shop_product_comment = pd.merge(df_shop, df_product_comment, how='left', on='shop_id') 118 | 119 | train_set = pd.merge(train_set, df_user, how='left', on='user_id') 120 | train_set = pd.merge(train_set, df_shop_product_comment, on='shop_id', how='left') 121 | 122 | 123 | 124 | train_set['vip_prob'] = train_set['vip_num']/train_set['fans_num'] 125 | train_set['goods_prob'] = train_set['good_comments']/train_set['comments'] 126 | 127 | train_set = train_set.drop(['comments','good_comments','bad_comments'],axis=1) 128 | 129 | 130 | 131 | 132 | test_set = jdata_data[(jdata_data['action_time'] >= '2018-03-26') & (jdata_data['action_time'] < '2018-04-16')][ 133 | ['user_id', 'cate', 'shop_id']].drop_duplicates() 134 | 135 | test_set = test_set.merge(df_test, on=['user_id', 'cate', 'shop_id'], how='left') 136 | 137 | test_set = pd.merge(test_set, df_user, how='left', on='user_id') 138 | test_set = pd.merge(test_set, df_shop_product_comment, on='shop_id', how='left') 139 | 140 | train_set.drop(['user_reg_tm', 'shop_reg_tm'], axis=1, inplace=True) 141 | test_set.drop(['user_reg_tm', 'shop_reg_tm'], axis=1, inplace=True) 142 | 143 | 144 | 145 | test_set['vip_prob'] = test_set['vip_num']/test_set['fans_num'] 146 | test_set['goods_prob'] = test_set['good_comments']/test_set['comments'] 147 | 148 | test_set = test_set.drop(['comments','good_comments','bad_comments'],axis=1) 149 | 150 | 151 | 152 | ###取六周特征特征为2.12-3.25 153 | train_set = train_set.drop(['2018-04-02-2018-04-09-action_1', '2018-04-02-2018-04-09-action_2', 154 | '2018-04-02-2018-04-09-action_3', '2018-04-02-2018-04-09-action_4', 155 | '2018-03-26-2018-04-02-action_1', '2018-03-26-2018-04-02-action_2', 156 | '2018-03-26-2018-04-02-action_3', '2018-03-26-2018-04-02-action_4', 157 | '2018-02-05-2018-02-12-action_1', '2018-02-05-2018-02-12-action_2', 158 | '2018-02-05-2018-02-12-action_3', '2018-02-05-2018-02-12-action_4'],axis=1) 159 | 160 | 161 | test_set = test_set.drop(['2018-02-26-2018-03-05-action_1', 162 | '2018-02-26-2018-03-05-action_2', '2018-02-26-2018-03-05-action_3', 163 | '2018-02-26-2018-03-05-action_4', '2018-02-19-2018-02-26-action_1', 164 | '2018-02-19-2018-02-26-action_2', '2018-02-19-2018-02-26-action_3', 165 | '2018-02-19-2018-02-26-action_4', '2018-02-12-2018-02-19-action_1', 166 | '2018-02-12-2018-02-19-action_2', '2018-02-12-2018-02-19-action_3', 167 | '2018-02-12-2018-02-19-action_4'],axis=1) 168 | 169 | 170 | train_set.rename(columns={'cate_x':'cate'}, inplace = True) 171 | test_set.rename(columns={'cate_x':'cate'}, inplace = True) 172 | 173 | 174 | 175 | test_head=test_set[['user_id','cate','shop_id']] 176 | train_head=train_set[['user_id','cate','shop_id']] 177 | test_set=test_set.drop(['user_id','cate','shop_id'],axis=1) 178 | train_set=train_set.drop(['user_id','cate','shop_id'],axis=1) 179 | 180 | 181 | # 数据准备 182 | X_train = train_set.drop(['label'],axis=1).values 183 | y_train = train_set['label'].values 184 | X_test = test_set.values 185 | 186 | 187 | del train_set 188 | del test_set 189 | 190 | 191 | 192 | 193 | # 模型工具 194 | class SBBTree(): 195 | """Stacking,Bootstap,Bagging----SBBTree""" 196 | 197 | def __init__(self, params, stacking_num, bagging_num, bagging_test_size, num_boost_round, early_stopping_rounds): 198 | """ 199 | Initializes the SBBTree. 200 | Args: 201 | params : lgb params. 202 | stacking_num : k_flod stacking. 203 | bagging_num : bootstrap num. 204 | bagging_test_size : bootstrap sample rate. 205 | num_boost_round : boost num. 206 | early_stopping_rounds : early_stopping_rounds. 207 | """ 208 | self.params = params 209 | self.stacking_num = stacking_num 210 | self.bagging_num = bagging_num 211 | self.bagging_test_size = bagging_test_size 212 | self.num_boost_round = num_boost_round 213 | self.early_stopping_rounds = early_stopping_rounds 214 | 215 | self.model = xgb 216 | self.stacking_model = [] 217 | self.bagging_model = [] 218 | 219 | def fit(self, X, y): 220 | """ fit model. """ 221 | if self.stacking_num > 1: 222 | layer_train = np.zeros((X.shape[0], 2)) 223 | self.SK = StratifiedKFold(n_splits=self.stacking_num, shuffle=True, random_state=1) 224 | for k, (train_index, test_index) in enumerate(self.SK.split(X, y)): 225 | print('fold_{}'.format(k)) 226 | X_train = X[train_index] 227 | y_train = y[train_index] 228 | X_test = X[test_index] 229 | y_test = y[test_index] 230 | 231 | xgb_train = xgb.DMatrix(X_train, y_train) 232 | xgb_eval = xgb.DMatrix(X_test, y_test) 233 | watchlist = [(xgb_train, 'train'), (xgb_eval, 'valid')] 234 | 235 | xgb_model = xgb.train(dtrain=xgb_train, 236 | num_boost_round=self.num_boost_round, 237 | evals=watchlist, 238 | early_stopping_rounds=self.early_stopping_rounds, 239 | verbose_eval=300, 240 | params=self.params) 241 | self.stacking_model.append(xgb_model) 242 | 243 | pred_y = xgb_model.predict(xgb.DMatrix(X_test), ntree_limit=xgb_model.best_ntree_limit) 244 | layer_train[test_index, 1] = pred_y 245 | 246 | X = np.hstack((X, layer_train[:, 1].reshape((-1, 1)))) 247 | else: 248 | pass 249 | for bn in range(self.bagging_num): 250 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=self.bagging_test_size, random_state=bn) 251 | 252 | xgb_train = xgb.DMatrix(X_train, y_train) 253 | xgb_eval = xgb.DMatrix(X_test, y_test) 254 | watchlist = [(xgb_train, 'train'), (xgb_eval, 'valid')] 255 | 256 | xgb_model = xgb.train(dtrain=xgb_train, 257 | num_boost_round=10000, 258 | evals=watchlist, 259 | early_stopping_rounds=200, 260 | verbose_eval=300, 261 | params=self.params) 262 | 263 | self.bagging_model.append(xgb_model) 264 | 265 | def predict(self, X_pred): 266 | """ predict test data. """ 267 | if self.stacking_num > 1: 268 | test_pred = np.zeros((X_pred.shape[0], self.stacking_num)) 269 | for sn, gbm in enumerate(self.stacking_model): 270 | pred = gbm.predict(xgb.DMatrix(X_pred), ntree_limit=gbm.best_ntree_limit) 271 | test_pred[:, sn] = pred 272 | X_pred = np.hstack((X_pred, test_pred.mean(axis=1).reshape((-1, 1)))) 273 | else: 274 | pass 275 | for bn, gbm in enumerate(self.bagging_model): 276 | pred = gbm.predict(xgb.DMatrix(X_pred), ntree_limit=gbm.best_ntree_limit) 277 | if bn == 0: 278 | pred_out = pred 279 | else: 280 | pred_out += pred 281 | return pred_out / self.bagging_num 282 | 283 | 284 | # 模型参数 285 | params = { 286 | 'booster': 'gbtree', 287 | 'tree_method': 'exact', 288 | 'eta': 0.01, 289 | 'max_depth': 7, 290 | 'gamma': 0.1, 291 | "min_child_weight": 1.1, # 6 0.06339878 292 | 'subsample': 0.7, 293 | 'colsample_bytree': 0.7, # 0.06349307 294 | 'colsample_bylevel': 0.7, 295 | 'objective': 'binary:logistic', 296 | 'eval_metric': 'auc', 297 | 'silent': True, 298 | 'lambda': 3, # 0.06365710 299 | 'nthread': 24, 300 | 'seed': 42} 301 | 302 | # 使用模型 303 | model = SBBTree(params=params, \ 304 | stacking_num=5, \ 305 | bagging_num=5, \ 306 | bagging_test_size=0.33, \ 307 | num_boost_round=10000, \ 308 | early_stopping_rounds=200) 309 | model.fit(X_train, y_train) 310 | 311 | 312 | print('train is ok') 313 | y_predict = model.predict(X_test) 314 | print('pred test is ok') 315 | # y_train_predict = model.predict(X_train) 316 | 317 | 318 | # In[ ]: 319 | 320 | 321 | from tqdm import tqdm 322 | test_head['pred_prob'] = y_predict 323 | test_head.to_csv('feature/'+str(win_size)+'_xgb_get_'+str(label_flag)+'_test.csv',index=False) 324 | -------------------------------------------------------------------------------- /picture/huachuang.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lcxanhui/JDATA-2019/7019cf545a88cb14c55d5f4198fa5251a291648c/picture/huachuang.PNG -------------------------------------------------------------------------------- /picture/time_series.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lcxanhui/JDATA-2019/7019cf545a88cb14c55d5f4198fa5251a291648c/picture/time_series.PNG --------------------------------------------------------------------------------