├── README.md
├── Readme.txt
├── code
├── EDA13.py
├── EDA16-fourWeek.py
├── EDA16-fourWeek_rightTime.py
├── EDA16-threeWeek.py
├── EDA16-threeWeek_rightTime.py
├── EDA16-twoWeek.py
├── cat_model
│ └── para.py
├── df_train_test.py
├── gen_result.py
├── gen_result2.py
├── lgb_model
│ ├── lgb_train1.py
│ ├── lgb_train2.py
│ └── lgb_train3.py
├── run.sh
├── sbb2_train1.py
├── sbb2_train2.py
├── sbb2_train3.py
├── sbb4_train1.py
├── sbb4_train2 .py
├── sbb4_train3.py
├── sbb_train1.py
├── sbb_train2.py
├── sbb_train3.py
└── xgb_model
│ ├── xgb_train1.py
│ ├── xgb_train2.py
│ └── xgb_train3.py
└── picture
├── huachuang.PNG
└── time_series.PNG
/README.md:
--------------------------------------------------------------------------------
1 | # 京东JDATA算法大赛2019-用户对品类下店铺的购买预测
2 |
3 | ## 赛题介绍
4 | 比赛网址:[JDATA2019-用户对品类下店铺的购买预测](https://jdata.jd.com/html/detail.html?id=8)
5 |
6 |
7 | ### 赛题任务
8 | 本赛题提供来自用户、商家、商品等多方面数据信息,包括商家和商品自身的内容信息、评论信息以及用户与之丰富的互动行为。参赛队伍需要通过数据挖掘技术和机器学习算法,构建用户购买商家中相关品类的预测模型,输出用户和店铺、品类的匹配结果,为精准营销提供高质量的目标群体。同时,希望参赛队伍通过本次比赛,挖掘数据背后潜在的意义,为电商生态平台的商家、用户提供多方共赢的智能解决方案。
9 | 即:对于训练集中出现的每一个用户,参赛者的模型需要预测该用户在未来7天内对`某个目标品类`下`某个店铺`的购买意向
10 |
11 | ### 赛题数据
12 | 1. 训练数据
13 | 提供`2018-02-01`到`2018-04-15`用户集合U中的用户,对商品集合S中部分商品的行为、评价、用户数据。
14 | 2. 预测数据
15 | 提供`2018-04-16`到`2018-04-22`预测用户U对哪些品类和店铺有购买,用户对品类下的店铺只会购买一次。
16 | 3. 数据表说明
17 | 
18 | ### 评测
19 | (1)该用户`2018-04-16`到`2018-04-22`是否对品类有购买,提交的结果文件中仅包含预测为下单的用户和品类(预测为未下单的用户和品类无须在结果中出现)。评测时将对提交结果中重复的“用户-品类”做排重处理,若预测正确,则评测算法中置label=1,不正确label=0。
20 | (2)如果用户对品类有购买,还需要预测对该品类下哪个店铺有购买,若店铺预测正确,则评测算法中置pred=1,不正确pred=0。
21 | 对于参赛者提交的结果文件,按如下公式计算得分:`score=0.4F11+0.6F12`。此处的F1值定义为:
22 | 
23 | 其中:Precise为准确率,Recall为召回率; F11 是label=1或0的F1值,F12 是pred=1或0的F1值。
24 |
25 | ## 代码思路
26 | 首先此次的比赛是一种典型的样本不平衡问题,用户购买和不够购买的行为大概的比例为1:100
27 | 时间滑窗法+lightgbm+xgboost+catboost
28 | 由于type==5(加入购物车行为)只存在4月8号到4月15号这一周行为,存在数据缺失问题,因此在构建测试集和训练集均将type==5的数据删除;
29 | 根据滑窗法,利用一周的时间作为预测部分,将1周/2周/3周时间的用户作为训练部分,并且将2周/4周/6周的用户和商铺、商品之间的交互行为进行特征提取。
30 | 下图所示的为3周时间的用户作为训练部分,6周的用户和商铺、商品之间的交互行为进行特征提取的表格示意图
31 | 
32 | 
33 | * A榜线上0.0614
Rank7
34 | * B榜线上0.0605
Rank16
35 |
36 | ## 代码环境
37 | run.sh
38 |
39 | ## 参考资料
40 | * 官方思路解答: [ 知乎:JData大数据比赛第三届非官方答疑贴](https://zhuanlan.zhihu.com/p/64503113)
41 | * 比赛参考代码: [【科普建模】JDATA3 用户对品类下店铺的购买预测](https://mp.weixin.qq.com/s?__biz=Mzg2MTEwNDQxNQ==&mid=2247483702&idx=1&sn=df621247b4790471063ddbeb15ad81c3&chksm=ce1d7146f96af85001e47999cb447d86820b082570c39de0c4ddc18dcba0b233697d5ef2e0ae&mpshare=1&scene=23&srcid=#rd)
42 | * 滑窗法介绍:[数据挖掘比赛之“滑窗法”](https://blog.csdn.net/oXiaoBuDianEr123/article/details/79309022)
43 |
--------------------------------------------------------------------------------
/Readme.txt:
--------------------------------------------------------------------------------
1 | 训练部分
2 | EDA13 统计各周行为汇总生成特征 df_train和df_test,结果在output中
3 | 以EDA16开头的文件选用第一种特征,不同的滑动窗口,与不同时间区间范围进行训练,结果在output中
4 | 以sbb开头的文件选用第二种特征,不同窗口,与相同时间区间范围进行训练,结果在feature中
5 |
6 | 预测部分
7 | 运行run.sh 获得不同特征的结果csv,通过不同特征,不同滑窗的CSV投票融合生成最后结果提交
8 |
9 | 运行方式:
10 | 原始数据放data中,一键运行run.sh,最后的结果在submit中
11 |
--------------------------------------------------------------------------------
/code/EDA13.py:
--------------------------------------------------------------------------------
1 | import pandas as pd
2 | from sklearn.model_selection import train_test_split
3 | import numpy as np
4 | from tqdm import tqdm
5 | import lightgbm as lgb
6 | from joblib import dump
7 |
8 |
9 | df_action=pd.read_csv("../data/jdata_action.csv")
10 | df_product=pd.read_csv("../data/jdata_product.csv")
11 |
12 | df_action=pd.merge(df_action,df_product,how='left',on='sku_id')
13 | df_action=df_action.groupby(['user_id','shop_id','cate'], as_index=False).sum()
14 |
15 | df_action=df_action[['user_id','shop_id','cate']]
16 | df_action_head=df_action.copy()
17 |
18 | df_action=pd.read_csv("../data/jdata_action.csv")
19 |
20 | def makeActionData(startDate,endDate):
21 | df=df_action[(df_action['action_time']>startDate)&(df_action['action_time']='2018-04-09') \
24 | & (jdata_data['action_time']<='2018-04-15') \
25 | & (jdata_data['type']==2)][['user_id','cate','shop_id']].drop_duplicates()
26 | train_buy['label'] = 1
27 | # 候选集 时间 : '2018-03-26'-'2018-04-08' 最近两周有行为的(用户,类目,店铺)
28 | train_set = jdata_data[(jdata_data['action_time']>='2018-03-12') \
29 | & (jdata_data['action_time']<='2018-04-08')][['user_id','cate','shop_id']].drop_duplicates()
30 | train_set = train_set.merge(train_buy,on=['user_id','cate','shop_id'],how='left').fillna(0)
31 |
32 |
33 | train_set = train_set.merge(df_train,on=['user_id','cate','shop_id'],how='left')
34 |
35 | def mapper(x):
36 | if x is not np.nan:
37 | year=int(x[:4])
38 | return 2018-year
39 |
40 |
41 | df_user['user_reg_tm']=df_user['user_reg_tm'].apply(lambda x:mapper(x))
42 | df_shop['shop_reg_tm']=df_shop['shop_reg_tm'].apply(lambda x:mapper(x))
43 | df_shop['shop_reg_tm']=df_shop['shop_reg_tm'].fillna(df_shop['shop_reg_tm'].mean())
44 | df_user['age']=df_user['age'].fillna(df_user['age'].mean())
45 | df_comment=pd.read_csv('../data/jdata_comment.csv')
46 | df_comment=df_comment.groupby(['sku_id'],as_index=False).sum()
47 | df_product=pd.read_csv('../data/jdata_product.csv')
48 | df_product_comment=pd.merge(df_product,df_comment,on='sku_id',how='left')
49 | df_product_comment=df_product_comment.fillna(0)
50 | df_product_comment=df_product_comment.groupby(['shop_id'],as_index=False).sum()
51 | df_product_comment=df_product_comment.drop(['sku_id','brand','cate'],axis=1)
52 | df_shop_product_comment=pd.merge(df_shop,df_product_comment,how='left',on='shop_id')
53 |
54 |
55 | train_set=pd.merge(train_set,df_user,how='left',on='user_id')
56 | train_set=pd.merge(train_set,df_shop_product_comment,on='shop_id',how='left')
57 |
58 | test_set = jdata_data[(jdata_data['action_time']>='2018-03-19') \
59 | & (jdata_data['action_time']<='2018-04-15')][['user_id','cate','shop_id']].drop_duplicates()
60 |
61 | test_set = test_set.merge(df_test,on=['user_id','cate','shop_id'],how='left')
62 |
63 | del df_train
64 | del df_test
65 |
66 | test_set=pd.merge(test_set,df_user,how='left',on='user_id')
67 | test_set=pd.merge(test_set,df_shop_product_comment,on='shop_id',how='left')
68 | train_set.rename(columns={'cate_x':'cate'}, inplace = True)
69 | test_set.rename(columns={'cate_x':'cate'}, inplace = True)
70 |
71 | test_head=test_set[['user_id','cate','shop_id']]
72 | train_head=train_set[['user_id','cate','shop_id']]
73 | test_set=test_set.drop(['user_id','cate','shop_id'],axis=1)
74 | train_set=train_set.drop(['user_id','cate','shop_id'],axis=1)
75 |
76 | # 数据准备
77 | X_train = train_set.drop(['label'],axis=1).values
78 | y_train = train_set['label'].values
79 | X_test = test_set.values
80 |
81 | del test_set
82 | del train_set
83 |
84 | # 模型工具
85 | class SBBTree():
86 | """Stacking,Bootstap,Bagging----SBBTree"""
87 | def __init__(self, params, stacking_num, bagging_num, bagging_test_size, num_boost_round, early_stopping_rounds):
88 | """
89 | Initializes the SBBTree.
90 | Args:
91 | params : lgb params.
92 | stacking_num : k_flod stacking.
93 | bagging_num : bootstrap num.
94 | bagging_test_size : bootstrap sample rate.
95 | num_boost_round : boost num.
96 | early_stopping_rounds : early_stopping_rounds.
97 | """
98 | self.params = params
99 | self.stacking_num = stacking_num
100 | self.bagging_num = bagging_num
101 | self.bagging_test_size = bagging_test_size
102 | self.num_boost_round = num_boost_round
103 | self.early_stopping_rounds = early_stopping_rounds
104 |
105 | self.model = lgb
106 | self.stacking_model = []
107 | self.bagging_model = []
108 |
109 | def fit(self, X, y):
110 | """ fit model. """
111 | if self.stacking_num > 1:
112 | layer_train = np.zeros((X.shape[0], 2))
113 | self.SK = StratifiedKFold(n_splits=self.stacking_num, shuffle=True, random_state=1)
114 | for k,(train_index, test_index) in enumerate(self.SK.split(X, y)):
115 | X_train = X[train_index]
116 | y_train = y[train_index]
117 | X_test = X[test_index]
118 | y_test = y[test_index]
119 |
120 | lgb_train = lgb.Dataset(X_train, y_train)
121 | lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)
122 |
123 | gbm = lgb.train(self.params,
124 | lgb_train,
125 | num_boost_round=self.num_boost_round,
126 | valid_sets=lgb_eval,
127 | early_stopping_rounds=self.early_stopping_rounds,
128 | verbose_eval=300)
129 |
130 | self.stacking_model.append(gbm)
131 |
132 | pred_y = gbm.predict(X_test, num_iteration=gbm.best_iteration)
133 | layer_train[test_index, 1] = pred_y
134 |
135 | X = np.hstack((X, layer_train[:,1].reshape((-1,1))))
136 | else:
137 | pass
138 | for bn in range(self.bagging_num):
139 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=self.bagging_test_size, random_state=bn)
140 |
141 | lgb_train = lgb.Dataset(X_train, y_train)
142 | lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)
143 |
144 | gbm = lgb.train(self.params,
145 | lgb_train,
146 | num_boost_round=10000,
147 | valid_sets=lgb_eval,
148 | early_stopping_rounds=200,
149 | verbose_eval=300)
150 |
151 | self.bagging_model.append(gbm)
152 |
153 | def predict(self, X_pred):
154 | """ predict test data. """
155 | if self.stacking_num > 1:
156 | test_pred = np.zeros((X_pred.shape[0], self.stacking_num))
157 | for sn,gbm in enumerate(self.stacking_model):
158 | pred = gbm.predict(X_pred, num_iteration=gbm.best_iteration)
159 | test_pred[:, sn] = pred
160 | X_pred = np.hstack((X_pred, test_pred.mean(axis=1).reshape((-1,1))))
161 | else:
162 | pass
163 | for bn,gbm in enumerate(self.bagging_model):
164 | pred = gbm.predict(X_pred, num_iteration=gbm.best_iteration)
165 | if bn == 0:
166 | pred_out=pred
167 | else:
168 | pred_out+=pred
169 | return pred_out/self.bagging_num
170 |
171 | # 模型参数
172 | params = {
173 | 'boosting_type': 'gbdt',
174 | 'objective': 'binary',
175 | 'metric': 'auc',
176 | 'learning_rate': 0.01,
177 | 'num_leaves': 2 ** 5 - 1,
178 | 'min_child_samples': 100,
179 | 'max_bin': 100,
180 | 'subsample': .7,
181 | 'subsample_freq': 1,
182 | 'colsample_bytree': 0.7,
183 | 'min_child_weight': 0,
184 | 'scale_pos_weight': 25,
185 | 'seed': 2018,
186 | 'nthread': 16,
187 | 'verbose': 0,
188 | }
189 |
190 | # 使用模型
191 | model = SBBTree(params=params,\
192 | stacking_num=5,\
193 | bagging_num=5,\
194 | bagging_test_size=0.33,\
195 | num_boost_round=10000,\
196 | early_stopping_rounds=200)
197 | model.fit(X_train, y_train)
198 | y_predict = model.predict(X_test)
199 | #y_train_predict = model.predict(X_train)
200 |
201 |
202 | test_head['pred_prob'] = y_predict
203 | test_head.to_csv('../output/EDA16-fourWeek.csv',index=False)
204 |
205 |
206 | fourOld = test_head[test_head['pred_prob'] >= 0.60][['user_id', 'cate', 'shop_id']]
207 | fourOld.to_csv('../output/res_fourWeekOld60.csv', index=False)
208 |
--------------------------------------------------------------------------------
/code/EDA16-fourWeek_rightTime.py:
--------------------------------------------------------------------------------
1 | import pandas as pd
2 | import numpy as np
3 | import datetime
4 | import lightgbm as lgb
5 | from sklearn.metrics import f1_score
6 | from sklearn.model_selection import train_test_split
7 | from sklearn.model_selection import KFold
8 | from sklearn.model_selection import StratifiedKFold
9 | pd.set_option('display.max_columns', None)
10 |
11 | df_train=pd.read_csv('../output/df_train.csv')
12 | df_test=pd.read_csv('../output/df_test.csv')
13 | df_user=pd.read_csv('../data/jdata_user.csv')
14 | df_comment=pd.read_csv('../data/jdata_comment.csv')
15 | df_shop=pd.read_csv('../data/jdata_shop.csv')
16 |
17 | # 1)行为数据(jdata_action)
18 | jdata_action = pd.read_csv('../data/jdata_action.csv')
19 | # 3)商品数据(jdata_product)
20 | jdata_product = pd.read_csv('../data/jdata_product.csv')
21 | jdata_data = jdata_action.merge(jdata_product,on=['sku_id'])
22 |
23 | train_buy = jdata_data[(jdata_data['action_time']>='2018-04-09') \
24 | & (jdata_data['action_time']<'2018-04-16') \
25 | & (jdata_data['type']==2)][['user_id','cate','shop_id']].drop_duplicates()
26 | train_buy['label'] = 1
27 | # 候选集 时间 : '2018-03-26'-'2018-04-08' 最近两周有行为的(用户,类目,店铺)
28 | train_set = jdata_data[(jdata_data['action_time']>='2018-03-12') \
29 | & (jdata_data['action_time']<'2018-04-09')][['user_id','cate','shop_id']].drop_duplicates()
30 | train_set = train_set.merge(train_buy,on=['user_id','cate','shop_id'],how='left').fillna(0)
31 |
32 |
33 | train_set = train_set.merge(df_train,on=['user_id','cate','shop_id'],how='left')
34 |
35 | def mapper(x):
36 | if x is not np.nan:
37 | year=int(x[:4])
38 | return 2018-year
39 |
40 |
41 | df_user['user_reg_tm']=df_user['user_reg_tm'].apply(lambda x:mapper(x))
42 | df_shop['shop_reg_tm']=df_shop['shop_reg_tm'].apply(lambda x:mapper(x))
43 | df_shop['shop_reg_tm']=df_shop['shop_reg_tm'].fillna(df_shop['shop_reg_tm'].mean())
44 | df_user['age']=df_user['age'].fillna(df_user['age'].mean())
45 | df_comment=pd.read_csv('../data/jdata_comment.csv')
46 | df_comment=df_comment.groupby(['sku_id'],as_index=False).sum()
47 | df_product=pd.read_csv('../data/jdata_product.csv')
48 | df_product_comment=pd.merge(df_product,df_comment,on='sku_id',how='left')
49 | df_product_comment=df_product_comment.fillna(0)
50 | df_product_comment=df_product_comment.groupby(['shop_id'],as_index=False).sum()
51 | df_product_comment=df_product_comment.drop(['sku_id','brand','cate'],axis=1)
52 | df_shop_product_comment=pd.merge(df_shop,df_product_comment,how='left',on='shop_id')
53 |
54 |
55 | train_set=pd.merge(train_set,df_user,how='left',on='user_id')
56 | train_set=pd.merge(train_set,df_shop_product_comment,on='shop_id',how='left')
57 |
58 | test_set = jdata_data[(jdata_data['action_time']>='2018-03-19') \
59 | & (jdata_data['action_time']<'2018-04-16')][['user_id','cate','shop_id']].drop_duplicates()
60 |
61 | test_set = test_set.merge(df_test,on=['user_id','cate','shop_id'],how='left')
62 |
63 | del df_train
64 | del df_test
65 |
66 | test_set=pd.merge(test_set,df_user,how='left',on='user_id')
67 | test_set=pd.merge(test_set,df_shop_product_comment,on='shop_id',how='left')
68 | train_set.rename(columns={'cate_x':'cate'}, inplace = True)
69 | test_set.rename(columns={'cate_x':'cate'}, inplace = True)
70 |
71 | test_head=test_set[['user_id','cate','shop_id']]
72 | train_head=train_set[['user_id','cate','shop_id']]
73 | test_set=test_set.drop(['user_id','cate','shop_id'],axis=1)
74 | train_set=train_set.drop(['user_id','cate','shop_id'],axis=1)
75 |
76 | # 数据准备
77 | X_train = train_set.drop(['label'],axis=1).values
78 | y_train = train_set['label'].values
79 | X_test = test_set.values
80 |
81 | del test_set
82 | del train_set
83 |
84 | # 模型工具
85 | class SBBTree():
86 | """Stacking,Bootstap,Bagging----SBBTree"""
87 | def __init__(self, params, stacking_num, bagging_num, bagging_test_size, num_boost_round, early_stopping_rounds):
88 | """
89 | Initializes the SBBTree.
90 | Args:
91 | params : lgb params.
92 | stacking_num : k_flod stacking.
93 | bagging_num : bootstrap num.
94 | bagging_test_size : bootstrap sample rate.
95 | num_boost_round : boost num.
96 | early_stopping_rounds : early_stopping_rounds.
97 | """
98 | self.params = params
99 | self.stacking_num = stacking_num
100 | self.bagging_num = bagging_num
101 | self.bagging_test_size = bagging_test_size
102 | self.num_boost_round = num_boost_round
103 | self.early_stopping_rounds = early_stopping_rounds
104 |
105 | self.model = lgb
106 | self.stacking_model = []
107 | self.bagging_model = []
108 |
109 | def fit(self, X, y):
110 | """ fit model. """
111 | if self.stacking_num > 1:
112 | layer_train = np.zeros((X.shape[0], 2))
113 | self.SK = StratifiedKFold(n_splits=self.stacking_num, shuffle=True, random_state=1)
114 | for k,(train_index, test_index) in enumerate(self.SK.split(X, y)):
115 | X_train = X[train_index]
116 | y_train = y[train_index]
117 | X_test = X[test_index]
118 | y_test = y[test_index]
119 |
120 | lgb_train = lgb.Dataset(X_train, y_train)
121 | lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)
122 |
123 | gbm = lgb.train(self.params,
124 | lgb_train,
125 | num_boost_round=self.num_boost_round,
126 | valid_sets=lgb_eval,
127 | early_stopping_rounds=self.early_stopping_rounds,
128 | verbose_eval=300)
129 |
130 | self.stacking_model.append(gbm)
131 |
132 | pred_y = gbm.predict(X_test, num_iteration=gbm.best_iteration)
133 | layer_train[test_index, 1] = pred_y
134 |
135 | X = np.hstack((X, layer_train[:,1].reshape((-1,1))))
136 | else:
137 | pass
138 | for bn in range(self.bagging_num):
139 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=self.bagging_test_size, random_state=bn)
140 |
141 | lgb_train = lgb.Dataset(X_train, y_train)
142 | lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)
143 |
144 | gbm = lgb.train(self.params,
145 | lgb_train,
146 | num_boost_round=10000,
147 | valid_sets=lgb_eval,
148 | early_stopping_rounds=200,
149 | verbose_eval=300)
150 |
151 | self.bagging_model.append(gbm)
152 |
153 | def predict(self, X_pred):
154 | """ predict test data. """
155 | if self.stacking_num > 1:
156 | test_pred = np.zeros((X_pred.shape[0], self.stacking_num))
157 | for sn,gbm in enumerate(self.stacking_model):
158 | pred = gbm.predict(X_pred, num_iteration=gbm.best_iteration)
159 | test_pred[:, sn] = pred
160 | X_pred = np.hstack((X_pred, test_pred.mean(axis=1).reshape((-1,1))))
161 | else:
162 | pass
163 | for bn,gbm in enumerate(self.bagging_model):
164 | pred = gbm.predict(X_pred, num_iteration=gbm.best_iteration)
165 | if bn == 0:
166 | pred_out=pred
167 | else:
168 | pred_out+=pred
169 | return pred_out/self.bagging_num
170 |
171 | # 模型参数
172 | params = {
173 | 'boosting_type': 'gbdt',
174 | 'objective': 'binary',
175 | 'metric': 'auc',
176 | 'learning_rate': 0.01,
177 | 'num_leaves': 2 ** 5 - 1,
178 | 'min_child_samples': 100,
179 | 'max_bin': 100,
180 | 'subsample': .7,
181 | 'subsample_freq': 1,
182 | 'colsample_bytree': 0.7,
183 | 'min_child_weight': 0,
184 | 'scale_pos_weight': 25,
185 | 'seed': 2018,
186 | 'nthread': 16,
187 | 'verbose': 0,
188 | }
189 |
190 | # 使用模型
191 | model = SBBTree(params=params,\
192 | stacking_num=5,\
193 | bagging_num=5,\
194 | bagging_test_size=0.33,\
195 | num_boost_round=10000,\
196 | early_stopping_rounds=200)
197 | model.fit(X_train, y_train)
198 | y_predict = model.predict(X_test)
199 | #y_train_predict = model.predict(X_train)
200 |
201 |
202 | test_head['pred_prob'] = y_predict
203 | test_head.to_csv('../output/EDA16-fourWeek_rightTime.csv',index=False)
204 |
205 |
206 | fourNew = test_head[test_head['pred_prob'] >= 0.675][['user_id', 'cate', 'shop_id']]
207 | fourNew.to_csv('../output/res_fourWeekNew675.csv', index=False)
208 |
--------------------------------------------------------------------------------
/code/EDA16-threeWeek.py:
--------------------------------------------------------------------------------
1 | import pandas as pd
2 | import numpy as np
3 | import datetime
4 | import lightgbm as lgb
5 | from sklearn.metrics import f1_score
6 | from sklearn.model_selection import train_test_split
7 | from sklearn.model_selection import KFold
8 | from sklearn.model_selection import StratifiedKFold
9 | pd.set_option('display.max_columns', None)
10 |
11 | df_train=pd.read_csv('../output/df_train.csv')
12 | df_test=pd.read_csv('../output/df_test.csv')
13 | df_user=pd.read_csv('../data/jdata_user.csv')
14 | df_comment=pd.read_csv('../data/jdata_comment.csv')
15 | df_shop=pd.read_csv('../data/jdata_shop.csv')
16 |
17 | # 1)行为数据(jdata_action)
18 | jdata_action = pd.read_csv('../data/jdata_action.csv')
19 | # 3)商品数据(jdata_product)
20 | jdata_product = pd.read_csv('../data/jdata_product.csv')
21 | jdata_data = jdata_action.merge(jdata_product,on=['sku_id'])
22 |
23 | train_buy = jdata_data[(jdata_data['action_time']>='2018-04-09') \
24 | & (jdata_data['action_time']<='2018-04-15') \
25 | & (jdata_data['type']==2)][['user_id','cate','shop_id']].drop_duplicates()
26 | train_buy['label'] = 1
27 | # 候选集 时间 : '2018-03-19'-'2018-04-08' 最近两周有行为的(用户,类目,店铺)
28 | train_set = jdata_data[(jdata_data['action_time']>='2018-03-19') \
29 | & (jdata_data['action_time']<='2018-04-08')][['user_id','cate','shop_id']].drop_duplicates()
30 | train_set = train_set.merge(train_buy,on=['user_id','cate','shop_id'],how='left').fillna(0)
31 |
32 |
33 | train_set = train_set.merge(df_train,on=['user_id','cate','shop_id'],how='left')
34 |
35 | def mapper(x):
36 | if x is not np.nan:
37 | year=int(x[:4])
38 | return 2018-year
39 |
40 |
41 | df_user['user_reg_tm']=df_user['user_reg_tm'].apply(lambda x:mapper(x))
42 | df_shop['shop_reg_tm']=df_shop['shop_reg_tm'].apply(lambda x:mapper(x))
43 | df_shop['shop_reg_tm']=df_shop['shop_reg_tm'].fillna(df_shop['shop_reg_tm'].mean())
44 | df_user['age']=df_user['age'].fillna(df_user['age'].mean())
45 | df_comment=pd.read_csv('../data/jdata_comment.csv')
46 | df_comment=df_comment.groupby(['sku_id'],as_index=False).sum()
47 | df_product=pd.read_csv('../data/jdata_product.csv')
48 | df_product_comment=pd.merge(df_product,df_comment,on='sku_id',how='left')
49 | df_product_comment=df_product_comment.fillna(0)
50 | df_product_comment=df_product_comment.groupby(['shop_id'],as_index=False).sum()
51 | df_product_comment=df_product_comment.drop(['sku_id','brand','cate'],axis=1)
52 | df_shop_product_comment=pd.merge(df_shop,df_product_comment,how='left',on='shop_id')
53 |
54 |
55 | train_set=pd.merge(train_set,df_user,how='left',on='user_id')
56 | train_set=pd.merge(train_set,df_shop_product_comment,on='shop_id',how='left')
57 |
58 | test_set = jdata_data[(jdata_data['action_time']>='2018-03-26') \
59 | & (jdata_data['action_time']<='2018-04-15')][['user_id','cate','shop_id']].drop_duplicates()
60 |
61 | test_set = test_set.merge(df_test,on=['user_id','cate','shop_id'],how='left')
62 |
63 | del df_train
64 | del df_test
65 |
66 | test_set=pd.merge(test_set,df_user,how='left',on='user_id')
67 | test_set=pd.merge(test_set,df_shop_product_comment,on='shop_id',how='left')
68 | train_set.rename(columns={'cate_x':'cate'}, inplace = True)
69 | test_set.rename(columns={'cate_x':'cate'}, inplace = True)
70 |
71 | test_head=test_set[['user_id','cate','shop_id']]
72 | train_head=train_set[['user_id','cate','shop_id']]
73 | test_set=test_set.drop(['user_id','cate','shop_id'],axis=1)
74 | train_set=train_set.drop(['user_id','cate','shop_id'],axis=1)
75 |
76 | # 数据准备
77 | X_train = train_set.drop(['label'],axis=1).values
78 | y_train = train_set['label'].values
79 | X_test = test_set.values
80 |
81 | del test_set
82 | del train_set
83 |
84 | # 模型工具
85 | class SBBTree():
86 | """Stacking,Bootstap,Bagging----SBBTree"""
87 | def __init__(self, params, stacking_num, bagging_num, bagging_test_size, num_boost_round, early_stopping_rounds):
88 | """
89 | Initializes the SBBTree.
90 | Args:
91 | params : lgb params.
92 | stacking_num : k_flod stacking.
93 | bagging_num : bootstrap num.
94 | bagging_test_size : bootstrap sample rate.
95 | num_boost_round : boost num.
96 | early_stopping_rounds : early_stopping_rounds.
97 | """
98 | self.params = params
99 | self.stacking_num = stacking_num
100 | self.bagging_num = bagging_num
101 | self.bagging_test_size = bagging_test_size
102 | self.num_boost_round = num_boost_round
103 | self.early_stopping_rounds = early_stopping_rounds
104 |
105 | self.model = lgb
106 | self.stacking_model = []
107 | self.bagging_model = []
108 |
109 | def fit(self, X, y):
110 | """ fit model. """
111 | if self.stacking_num > 1:
112 | layer_train = np.zeros((X.shape[0], 2))
113 | self.SK = StratifiedKFold(n_splits=self.stacking_num, shuffle=True, random_state=1)
114 | for k,(train_index, test_index) in enumerate(self.SK.split(X, y)):
115 | X_train = X[train_index]
116 | y_train = y[train_index]
117 | X_test = X[test_index]
118 | y_test = y[test_index]
119 |
120 | lgb_train = lgb.Dataset(X_train, y_train)
121 | lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)
122 |
123 | gbm = lgb.train(self.params,
124 | lgb_train,
125 | num_boost_round=self.num_boost_round,
126 | valid_sets=lgb_eval,
127 | early_stopping_rounds=self.early_stopping_rounds,
128 | verbose_eval=300)
129 |
130 | self.stacking_model.append(gbm)
131 |
132 | pred_y = gbm.predict(X_test, num_iteration=gbm.best_iteration)
133 | layer_train[test_index, 1] = pred_y
134 |
135 | X = np.hstack((X, layer_train[:,1].reshape((-1,1))))
136 | else:
137 | pass
138 | for bn in range(self.bagging_num):
139 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=self.bagging_test_size, random_state=bn)
140 |
141 | lgb_train = lgb.Dataset(X_train, y_train)
142 | lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)
143 |
144 | gbm = lgb.train(self.params,
145 | lgb_train,
146 | num_boost_round=10000,
147 | valid_sets=lgb_eval,
148 | early_stopping_rounds=200,
149 | verbose_eval=300)
150 |
151 | self.bagging_model.append(gbm)
152 |
153 | def predict(self, X_pred):
154 | """ predict test data. """
155 | if self.stacking_num > 1:
156 | test_pred = np.zeros((X_pred.shape[0], self.stacking_num))
157 | for sn,gbm in enumerate(self.stacking_model):
158 | pred = gbm.predict(X_pred, num_iteration=gbm.best_iteration)
159 | test_pred[:, sn] = pred
160 | X_pred = np.hstack((X_pred, test_pred.mean(axis=1).reshape((-1,1))))
161 | else:
162 | pass
163 | for bn,gbm in enumerate(self.bagging_model):
164 | pred = gbm.predict(X_pred, num_iteration=gbm.best_iteration)
165 | if bn == 0:
166 | pred_out=pred
167 | else:
168 | pred_out+=pred
169 | return pred_out/self.bagging_num
170 |
171 | # 模型参数
172 | params = {
173 | 'boosting_type': 'gbdt',
174 | 'objective': 'binary',
175 | 'metric': 'auc',
176 | 'learning_rate': 0.01,
177 | 'num_leaves': 2 ** 5 - 1,
178 | 'min_child_samples': 100,
179 | 'max_bin': 100,
180 | 'subsample': .7,
181 | 'subsample_freq': 1,
182 | 'colsample_bytree': 0.7,
183 | 'min_child_weight': 0,
184 | 'scale_pos_weight': 25,
185 | 'seed': 2018,
186 | 'nthread': 16,
187 | 'verbose': 0,
188 | }
189 |
190 | # 使用模型
191 | model = SBBTree(params=params,\
192 | stacking_num=5,\
193 | bagging_num=5,\
194 | bagging_test_size=0.33,\
195 | num_boost_round=10000,\
196 | early_stopping_rounds=200)
197 | model.fit(X_train, y_train)
198 | y_predict = model.predict(X_test)
199 | #y_train_predict = model.predict(X_train)
200 |
201 |
202 | test_head['pred_prob'] = y_predict
203 | test_head.to_csv('../output/EDA16-threeWeek.csv',index=False)
204 |
205 |
206 | threeOld = test_head[test_head['pred_prob'] >= 0.595][['user_id', 'cate', 'shop_id']]
207 | threeOld.to_csv('../output/res_threeWeekOld595.csv', index=False)
208 |
--------------------------------------------------------------------------------
/code/EDA16-threeWeek_rightTime.py:
--------------------------------------------------------------------------------
1 | import pandas as pd
2 | import numpy as np
3 | import datetime
4 | import lightgbm as lgb
5 | from sklearn.metrics import f1_score
6 | from sklearn.model_selection import train_test_split
7 | from sklearn.model_selection import KFold
8 | from sklearn.model_selection import StratifiedKFold
9 | pd.set_option('display.max_columns', None)
10 |
11 | df_train=pd.read_csv('../output/df_train.csv')
12 | df_test=pd.read_csv('../output/df_test.csv')
13 | df_user=pd.read_csv('../data/jdata_user.csv')
14 | df_comment=pd.read_csv('../data/jdata_comment.csv')
15 | df_shop=pd.read_csv('../data/jdata_shop.csv')
16 |
17 | # 1)行为数据(jdata_action)
18 | jdata_action = pd.read_csv('../data/jdata_action.csv')
19 | # 3)商品数据(jdata_product)
20 | jdata_product = pd.read_csv('../data/jdata_product.csv')
21 | jdata_data = jdata_action.merge(jdata_product,on=['sku_id'])
22 |
23 | train_buy = jdata_data[(jdata_data['action_time']>='2018-04-09') \
24 | & (jdata_data['action_time']<'2018-04-16') \
25 | & (jdata_data['type']==2)][['user_id','cate','shop_id']].drop_duplicates()
26 | train_buy['label'] = 1
27 | # 候选集 时间 : '2018-03-19'-'2018-04-08' 最近两周有行为的(用户,类目,店铺)
28 | train_set = jdata_data[(jdata_data['action_time']>='2018-03-19') \
29 | & (jdata_data['action_time']<'2018-04-09')][['user_id','cate','shop_id']].drop_duplicates()
30 | train_set = train_set.merge(train_buy,on=['user_id','cate','shop_id'],how='left').fillna(0)
31 |
32 |
33 | train_set = train_set.merge(df_train,on=['user_id','cate','shop_id'],how='left')
34 |
35 | def mapper(x):
36 | if x is not np.nan:
37 | year=int(x[:4])
38 | return 2018-year
39 |
40 |
41 | df_user['user_reg_tm']=df_user['user_reg_tm'].apply(lambda x:mapper(x))
42 | df_shop['shop_reg_tm']=df_shop['shop_reg_tm'].apply(lambda x:mapper(x))
43 | df_shop['shop_reg_tm']=df_shop['shop_reg_tm'].fillna(df_shop['shop_reg_tm'].mean())
44 | df_user['age']=df_user['age'].fillna(df_user['age'].mean())
45 | df_comment=pd.read_csv('../data/jdata_comment.csv')
46 | df_comment=df_comment.groupby(['sku_id'],as_index=False).sum()
47 | df_product=pd.read_csv('../data/jdata_product.csv')
48 | df_product_comment=pd.merge(df_product,df_comment,on='sku_id',how='left')
49 | df_product_comment=df_product_comment.fillna(0)
50 | df_product_comment=df_product_comment.groupby(['shop_id'],as_index=False).sum()
51 | df_product_comment=df_product_comment.drop(['sku_id','brand','cate'],axis=1)
52 | df_shop_product_comment=pd.merge(df_shop,df_product_comment,how='left',on='shop_id')
53 |
54 |
55 | train_set=pd.merge(train_set,df_user,how='left',on='user_id')
56 | train_set=pd.merge(train_set,df_shop_product_comment,on='shop_id',how='left')
57 |
58 | test_set = jdata_data[(jdata_data['action_time']>='2018-03-26') \
59 | & (jdata_data['action_time']<'2018-04-16')][['user_id','cate','shop_id']].drop_duplicates()
60 |
61 | test_set = test_set.merge(df_test,on=['user_id','cate','shop_id'],how='left')
62 |
63 | del df_train
64 | del df_test
65 |
66 | test_set=pd.merge(test_set,df_user,how='left',on='user_id')
67 | test_set=pd.merge(test_set,df_shop_product_comment,on='shop_id',how='left')
68 | train_set.rename(columns={'cate_x':'cate'}, inplace = True)
69 | test_set.rename(columns={'cate_x':'cate'}, inplace = True)
70 |
71 | test_head=test_set[['user_id','cate','shop_id']]
72 | train_head=train_set[['user_id','cate','shop_id']]
73 | test_set=test_set.drop(['user_id','cate','shop_id'],axis=1)
74 | train_set=train_set.drop(['user_id','cate','shop_id'],axis=1)
75 |
76 | # 数据准备
77 | X_train = train_set.drop(['label'],axis=1).values
78 | y_train = train_set['label'].values
79 | X_test = test_set.values
80 |
81 | del test_set
82 | del train_set
83 |
84 | # 模型工具
85 | class SBBTree():
86 | """Stacking,Bootstap,Bagging----SBBTree"""
87 | def __init__(self, params, stacking_num, bagging_num, bagging_test_size, num_boost_round, early_stopping_rounds):
88 | """
89 | Initializes the SBBTree.
90 | Args:
91 | params : lgb params.
92 | stacking_num : k_flod stacking.
93 | bagging_num : bootstrap num.
94 | bagging_test_size : bootstrap sample rate.
95 | num_boost_round : boost num.
96 | early_stopping_rounds : early_stopping_rounds.
97 | """
98 | self.params = params
99 | self.stacking_num = stacking_num
100 | self.bagging_num = bagging_num
101 | self.bagging_test_size = bagging_test_size
102 | self.num_boost_round = num_boost_round
103 | self.early_stopping_rounds = early_stopping_rounds
104 |
105 | self.model = lgb
106 | self.stacking_model = []
107 | self.bagging_model = []
108 |
109 | def fit(self, X, y):
110 | """ fit model. """
111 | if self.stacking_num > 1:
112 | layer_train = np.zeros((X.shape[0], 2))
113 | self.SK = StratifiedKFold(n_splits=self.stacking_num, shuffle=True, random_state=1)
114 | for k,(train_index, test_index) in enumerate(self.SK.split(X, y)):
115 | X_train = X[train_index]
116 | y_train = y[train_index]
117 | X_test = X[test_index]
118 | y_test = y[test_index]
119 |
120 | lgb_train = lgb.Dataset(X_train, y_train)
121 | lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)
122 |
123 | gbm = lgb.train(self.params,
124 | lgb_train,
125 | num_boost_round=self.num_boost_round,
126 | valid_sets=lgb_eval,
127 | early_stopping_rounds=self.early_stopping_rounds,
128 | verbose_eval=300)
129 |
130 | self.stacking_model.append(gbm)
131 |
132 | pred_y = gbm.predict(X_test, num_iteration=gbm.best_iteration)
133 | layer_train[test_index, 1] = pred_y
134 |
135 | X = np.hstack((X, layer_train[:,1].reshape((-1,1))))
136 | else:
137 | pass
138 | for bn in range(self.bagging_num):
139 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=self.bagging_test_size, random_state=bn)
140 |
141 | lgb_train = lgb.Dataset(X_train, y_train)
142 | lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)
143 |
144 | gbm = lgb.train(self.params,
145 | lgb_train,
146 | num_boost_round=10000,
147 | valid_sets=lgb_eval,
148 | early_stopping_rounds=200,
149 | verbose_eval=300)
150 |
151 | self.bagging_model.append(gbm)
152 |
153 | def predict(self, X_pred):
154 | """ predict test data. """
155 | if self.stacking_num > 1:
156 | test_pred = np.zeros((X_pred.shape[0], self.stacking_num))
157 | for sn,gbm in enumerate(self.stacking_model):
158 | pred = gbm.predict(X_pred, num_iteration=gbm.best_iteration)
159 | test_pred[:, sn] = pred
160 | X_pred = np.hstack((X_pred, test_pred.mean(axis=1).reshape((-1,1))))
161 | else:
162 | pass
163 | for bn,gbm in enumerate(self.bagging_model):
164 | pred = gbm.predict(X_pred, num_iteration=gbm.best_iteration)
165 | if bn == 0:
166 | pred_out=pred
167 | else:
168 | pred_out+=pred
169 | return pred_out/self.bagging_num
170 |
171 | # 模型参数
172 | params = {
173 | 'boosting_type': 'gbdt',
174 | 'objective': 'binary',
175 | 'metric': 'auc',
176 | 'learning_rate': 0.01,
177 | 'num_leaves': 2 ** 5 - 1,
178 | 'min_child_samples': 100,
179 | 'max_bin': 100,
180 | 'subsample': .7,
181 | 'subsample_freq': 1,
182 | 'colsample_bytree': 0.7,
183 | 'min_child_weight': 0,
184 | 'scale_pos_weight': 25,
185 | 'seed': 2018,
186 | 'nthread': 16,
187 | 'verbose': 0,
188 | }
189 |
190 | # 使用模型
191 | model = SBBTree(params=params,\
192 | stacking_num=5,\
193 | bagging_num=5,\
194 | bagging_test_size=0.33,\
195 | num_boost_round=10000,\
196 | early_stopping_rounds=200)
197 | model.fit(X_train, y_train)
198 | y_predict = model.predict(X_test)
199 | #y_train_predict = model.predict(X_train)
200 |
201 |
202 | test_head['pred_prob'] = y_predict
203 | test_head.to_csv('../output/EDA16-threeWeek_rightTime.csv',index=False)
204 |
205 | threeNew = test_head[test_head['pred_prob'] >= 0.65][['user_id', 'cate', 'shop_id']]
206 | threeNew.to_csv('../output/res_threeWeekNew65.csv', index=False)
207 |
--------------------------------------------------------------------------------
/code/EDA16-twoWeek.py:
--------------------------------------------------------------------------------
1 | import pandas as pd
2 | import numpy as np
3 | import datetime
4 | import lightgbm as lgb
5 | from sklearn.metrics import f1_score
6 | from sklearn.model_selection import train_test_split
7 | from sklearn.model_selection import KFold
8 | from sklearn.model_selection import StratifiedKFold
9 | pd.set_option('display.max_columns', None)
10 |
11 | df_train=pd.read_csv('../output/df_train.csv')
12 | df_test=pd.read_csv('../output/df_test.csv')
13 | df_user=pd.read_csv('../data/jdata_user.csv')
14 | df_comment=pd.read_csv('../data/jdata_comment.csv')
15 | df_shop=pd.read_csv('../data/jdata_shop.csv')
16 |
17 | # 1)行为数据(jdata_action)
18 | jdata_action = pd.read_csv('../data/jdata_action.csv')
19 | # 3)商品数据(jdata_product)
20 | jdata_product = pd.read_csv('../data/jdata_product.csv')
21 | jdata_data = jdata_action.merge(jdata_product,on=['sku_id'])
22 |
23 | train_buy = jdata_data[(jdata_data['action_time']>='2018-04-09') \
24 | & (jdata_data['action_time']<='2018-04-15') \
25 | & (jdata_data['type']==2)][['user_id','cate','shop_id']].drop_duplicates()
26 | train_buy['label'] = 1
27 | # 候选集 时间 : '2018-03-26'-'2018-04-08' 最近两周有行为的(用户,类目,店铺)
28 | train_set = jdata_data[(jdata_data['action_time']>='2018-03-26') \
29 | & (jdata_data['action_time']<='2018-04-08')][['user_id','cate','shop_id']].drop_duplicates()
30 | train_set = train_set.merge(train_buy,on=['user_id','cate','shop_id'],how='left').fillna(0)
31 | train_set = train_set.merge(df_train,on=['user_id','cate','shop_id'],how='left')
32 | def mapper(x):
33 | if x is not np.nan:
34 | year=int(x[:4])
35 | return 2018-year
36 |
37 | df_user['user_reg_tm']=df_user['user_reg_tm'].apply(lambda x:mapper(x))
38 | df_shop['shop_reg_tm']=df_shop['shop_reg_tm'].apply(lambda x:mapper(x))
39 | df_shop['shop_reg_tm']=df_shop['shop_reg_tm'].fillna(df_shop['shop_reg_tm'].mean())
40 | df_user['age']=df_user['age'].fillna(df_user['age'].mean())
41 | df_comment=pd.read_csv('../data/jdata_comment.csv')
42 | df_comment=df_comment.groupby(['sku_id'],as_index=False).sum()
43 | df_product=pd.read_csv('../data/jdata_product.csv')
44 | df_product_comment=pd.merge(df_product,df_comment,on='sku_id',how='left')
45 | df_product_comment=df_product_comment.fillna(0)
46 | df_product_comment=df_product_comment.groupby(['shop_id'],as_index=False).sum()
47 | df_product_comment=df_product_comment.drop(['sku_id','brand','cate'],axis=1)
48 | df_shop_product_comment=pd.merge(df_shop,df_product_comment,how='left',on='shop_id')
49 |
50 | train_set=pd.merge(train_set,df_user,how='left',on='user_id')
51 | train_set=pd.merge(train_set,df_shop_product_comment,on='shop_id',how='left')
52 | test_set = jdata_data[(jdata_data['action_time']>='2018-04-02') \
53 | & (jdata_data['action_time']<='2018-04-15')][['user_id','cate','shop_id']].drop_duplicates()
54 | test_set = test_set.merge(df_test,on=['user_id','cate','shop_id'],how='left')
55 |
56 | del df_train
57 | del df_test
58 | test_set=pd.merge(test_set,df_user,how='left',on='user_id')
59 | test_set=pd.merge(test_set,df_shop_product_comment,on='shop_id',how='left')
60 | test_set=test_set.sort_values('user_id')
61 | train_set=train_set.sort_values('user_id')
62 | train_set.rename(columns={'cate_x':'cate'}, inplace = True)
63 | test_set.rename(columns={'cate_x':'cate'}, inplace = True)
64 |
65 | test_head=test_set[['user_id','cate','shop_id']]
66 | train_head=train_set[['user_id','cate','shop_id']]
67 | test_set=test_set.drop(['user_id','cate','shop_id'],axis=1)
68 | train_set=train_set.drop(['user_id','cate','shop_id'],axis=1)
69 |
70 | # 数据准备
71 | X_train = train_set.drop(['label'],axis=1).values
72 | y_train = train_set['label'].values
73 | X_test = test_set.values
74 | del test_set
75 | del train_set
76 |
77 | # 模型工具
78 | class SBBTree():
79 | """Stacking,Bootstap,Bagging----SBBTree"""
80 | def __init__(self, params, stacking_num, bagging_num, bagging_test_size, num_boost_round, early_stopping_rounds):
81 | """
82 | Initializes the SBBTree.
83 | Args:
84 | params : lgb params.
85 | stacking_num : k_flod stacking.
86 | bagging_num : bootstrap num.
87 | bagging_test_size : bootstrap sample rate.
88 | num_boost_round : boost num.
89 | early_stopping_rounds : early_stopping_rounds.
90 | """
91 | self.params = params
92 | self.stacking_num = stacking_num
93 | self.bagging_num = bagging_num
94 | self.bagging_test_size = bagging_test_size
95 | self.num_boost_round = num_boost_round
96 | self.early_stopping_rounds = early_stopping_rounds
97 |
98 | self.model = lgb
99 | self.stacking_model = []
100 | self.bagging_model = []
101 |
102 | def fit(self, X, y):
103 | """ fit model. """
104 | if self.stacking_num > 1:
105 | layer_train = np.zeros((X.shape[0], 2))
106 | self.SK = StratifiedKFold(n_splits=self.stacking_num, shuffle=True, random_state=1)
107 | for k,(train_index, test_index) in enumerate(self.SK.split(X, y)):
108 | X_train = X[train_index]
109 | y_train = y[train_index]
110 | X_test = X[test_index]
111 | y_test = y[test_index]
112 |
113 | lgb_train = lgb.Dataset(X_train, y_train)
114 | lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)
115 |
116 | gbm = lgb.train(self.params,
117 | lgb_train,
118 | num_boost_round=self.num_boost_round,
119 | valid_sets=lgb_eval,
120 | early_stopping_rounds=self.early_stopping_rounds,
121 | verbose_eval=300)
122 |
123 | self.stacking_model.append(gbm)
124 |
125 | pred_y = gbm.predict(X_test, num_iteration=gbm.best_iteration)
126 | layer_train[test_index, 1] = pred_y
127 |
128 | X = np.hstack((X, layer_train[:,1].reshape((-1,1))))
129 | else:
130 | pass
131 | for bn in range(self.bagging_num):
132 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=self.bagging_test_size, random_state=bn)
133 |
134 | lgb_train = lgb.Dataset(X_train, y_train)
135 | lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)
136 |
137 | gbm = lgb.train(self.params,
138 | lgb_train,
139 | num_boost_round=10000,
140 | valid_sets=lgb_eval,
141 | early_stopping_rounds=200,
142 | verbose_eval=300)
143 |
144 | self.bagging_model.append(gbm)
145 |
146 | def predict(self, X_pred):
147 | """ predict test data. """
148 | if self.stacking_num > 1:
149 | test_pred = np.zeros((X_pred.shape[0], self.stacking_num))
150 | for sn,gbm in enumerate(self.stacking_model):
151 | pred = gbm.predict(X_pred, num_iteration=gbm.best_iteration)
152 | test_pred[:, sn] = pred
153 | X_pred = np.hstack((X_pred, test_pred.mean(axis=1).reshape((-1,1))))
154 | else:
155 | pass
156 | for bn,gbm in enumerate(self.bagging_model):
157 | pred = gbm.predict(X_pred, num_iteration=gbm.best_iteration)
158 | if bn == 0:
159 | pred_out=pred
160 | else:
161 | pred_out+=pred
162 | return pred_out/self.bagging_num
163 |
164 | # 模型参数
165 | params = {
166 | 'boosting_type': 'gbdt',
167 | 'objective': 'binary',
168 | 'metric': 'auc',
169 | 'learning_rate': 0.01,
170 | 'num_leaves': 2 ** 5 - 1,
171 | 'min_child_samples': 100,
172 | 'max_bin': 100,
173 | 'subsample': .7,
174 | 'subsample_freq': 1,
175 | 'colsample_bytree': 0.7,
176 | 'min_child_weight': 0,
177 | 'scale_pos_weight': 25,
178 | 'seed': 2018,
179 | 'nthread': 16,
180 | 'verbose': 0,
181 | }
182 |
183 | # 使用模型
184 | model = SBBTree(params=params,\
185 | stacking_num=5,\
186 | bagging_num=5,\
187 | bagging_test_size=0.33,\
188 | num_boost_round=10000,\
189 | early_stopping_rounds=200)
190 | model.fit(X_train, y_train)
191 | y_predict = model.predict(X_test)
192 | y_train_predict = model.predict(X_train)
193 |
194 | test_head['pred_prob'] = y_predict
195 |
196 |
197 | test_head.to_csv('../output/EDA16-twoWeek.csv', index=False)
198 |
199 | twoOld = test_head[test_head['pred_prob'] >= 0.5205][['user_id', 'cate', 'shop_id']]
200 | twoOld.to_csv('../output/res_twoWeekOld5205.csv', index=False)
201 |
202 |
203 |
204 |
205 |
--------------------------------------------------------------------------------
/code/cat_model/para.py:
--------------------------------------------------------------------------------
1 | # 感谢大佬分享的参数
2 | ctb_params = {
3 | 'n_estimators': 10000,
4 | 'learning_rate': 0.02,
5 | 'random_seed': 4590,
6 | 'reg_lambda': 0.08,
7 | 'subsample': 0.7,
8 | 'bootstrap_type': 'Bernoulli',
9 | 'boosting_type': 'Plain',
10 | 'one_hot_max_size': 10,
11 | 'rsm': 0.5,
12 | 'leaf_estimation_iterations': 5,
13 | 'use_best_model': True,
14 | 'max_depth': 6,
15 | 'verbose': -1,
16 | 'thread_count': 4
17 | }
18 | ctb_model = ctb.CatBoostRegressor(**ctb_params)
19 |
--------------------------------------------------------------------------------
/code/df_train_test.py:
--------------------------------------------------------------------------------
1 | import pandas as pd
2 | from sklearn.model_selection import train_test_split
3 | import numpy as np
4 | from tqdm import tqdm
5 | import lightgbm as lgb
6 | from joblib import dump
7 | import time
8 |
9 | time_0 = time.process_time()
10 | print('>> 开始读取数据')
11 | df_action=pd.read_csv("./jdata_action.csv")
12 | df_product=pd.read_csv("./jdata_product.csv")
13 |
14 | df_action=pd.merge(df_action,df_product,how='left',on='sku_id')
15 | df_action=df_action.groupby(['user_id','shop_id','cate'], as_index=False).sum()
16 | time_1 = time.process_time()
17 | print('<< 数据读取完成!用时', time_1 - time_0, 's')
18 |
19 | df_action=df_action[['user_id','shop_id','cate']]
20 | df_action_head=df_action.copy()
21 |
22 | df_action=pd.read_csv("./jdata_action.csv")
23 |
24 | def makeActionData(startDate,endDate):
25 | df=df_action[(df_action['action_time']>startDate)&(df_action['action_time']=best_u]))
15 | sbb3_3[sbb3_3['pred_prob']>=best_u][['user_id','cate','shop_id']].to_csv('../output/sbb3_3.csv',index=False)
16 |
17 | # sbb3_2['pred_prob'] = y_predict
18 | best_u = 0.602
19 | #设置阈值 计算行数
20 | # print('sbb3_2 best_len',len(sbb3_2[sbb3_2['pred_prob']>=best_u]))
21 | sbb3_2[sbb3_2['pred_prob']>=best_u][['user_id','cate','shop_id']].to_csv('../output/sbb3_2.csv',index=False)
22 | # sbb3_1['pred_prob'] = y_predict
23 | best_u = 0.521
24 | #设置阈值 计算行数
25 | # print('sbb3_1 best_len',len(sbb3_1[sbb3_1['pred_prob']>=best_u]))
26 | sbb3_1[sbb3_1['pred_prob']>=best_u][['user_id','cate','shop_id']].to_csv('../output/sbb3_1.csv',index=False)
27 |
28 |
29 | n_3_593 = pd.read_csv('../output/res_threeWeekNew65.csv')
30 | n_4_590 = pd.read_csv('../output/res_fourWeekNew675.csv')
31 | o_2_573 = pd.read_csv('../output/res_twoWeekOld5205.csv')
32 | o_3_583 = pd.read_csv('../output/res_threeWeekOld595.csv')
33 | o_4_578 = pd.read_csv('../output/res_fourWeekOld60.csv')
34 | sbb_1 = pd.read_csv('../output/sbb3_1.csv')
35 | sbb_2 = pd.read_csv('../output/sbb3_2.csv')
36 | sbb_3 = pd.read_csv('../output/sbb3_3.csv')
37 |
38 |
39 |
40 | all_item = pd.concat([n_3_593,n_4_590,o_2_573,o_3_583,o_4_578,sbb_1,sbb_2,sbb_3],axis=0)
41 | all_item = all_item.drop_duplicates()
42 |
43 |
44 | n_3_593['label1'] = 1
45 | n_4_590['label2'] = 1
46 | o_2_573['label3'] = 1
47 | o_3_583['label4'] = 1
48 | o_4_578['label5'] = 1
49 | sbb_1['label6'] = 1
50 | sbb_2['label7'] = 1
51 | sbb_3['label8'] = 1
52 |
53 |
54 |
55 | all_item = all_item.merge(n_3_593,on=['user_id','cate','shop_id'],how='left')
56 | all_item = all_item.merge(n_4_590,on=['user_id','cate','shop_id'],how='left')
57 | all_item = all_item.merge(o_2_573,on=['user_id','cate','shop_id'],how='left')
58 | all_item = all_item.merge(o_3_583,on=['user_id','cate','shop_id'],how='left')
59 | all_item = all_item.merge(o_4_578,on=['user_id','cate','shop_id'],how='left')
60 | all_item = all_item.merge(sbb_1,on=['user_id','cate','shop_id'],how='left')
61 | all_item = all_item.merge(sbb_2,on=['user_id','cate','shop_id'],how='left')
62 | all_item = all_item.merge(sbb_3,on=['user_id','cate','shop_id'],how='left')
63 |
64 |
65 | all_item = all_item.fillna(0)
66 |
67 |
68 | all_item['sum'] = all_item['label1']+all_item['label2']+all_item['label3']+all_item['label4']+all_item['label5']+all_item['label6']+all_item['label7']+all_item['label8']
69 |
70 | all_item[all_item['sum']>=2][['user_id',
71 | 'cate','shop_id']].to_csv('../submit/8_model_2.csv',index=False)
72 |
73 |
74 |
--------------------------------------------------------------------------------
/code/gen_result2.py:
--------------------------------------------------------------------------------
1 | import pandas as pd
2 | import numpy as np
3 | ###同特征相交 不同特征投票
4 | sbb4_3 = pd.read_csv('../feature/4_sbb_get_3_test.csv')
5 | sbb4_2 = pd.read_csv('../feature/4_sbb_get_2_test.csv')
6 | sbb4_1 = pd.read_csv('../feature/4_sbb_get_1_test.csv')
7 |
8 | from tqdm import tqdm
9 | # sbb4_1['pred_prob'] = y_predict
10 | best_u = 0.662
11 | #设置阈值 计算行数
12 | # print('sbb4_1 best_len',len(sbb4_1[sbb4_1['pred_prob']>=best_u]))
13 | sbb4_1[sbb4_1['pred_prob']>=best_u][['user_id','cate','shop_id']].to_csv('../output/sbb4_1.csv',index=False)
14 |
15 | from tqdm import tqdm
16 | # sbb4_1['pred_prob'] = y_predict
17 | best_u = 0.500
18 | #设置阈值 计算行数
19 | # print('sbb4_2 best_len',len(sbb4_2[sbb4_2['pred_prob']>=best_u]))
20 | sbb4_2[sbb4_2['pred_prob']>=best_u][['user_id','cate','shop_id']].to_csv('../output/sbb4_2.csv',index=False)
21 |
22 | from tqdm import tqdm
23 | # sbb4_3['pred_prob'] = y_predict
24 | best_u = 0.685
25 | #设置阈值 计算行数
26 | # print('sbb4_3 best_len',len(sbb4_3[sbb4_3['pred_prob']>=best_u]))
27 | sbb4_3[sbb4_3['pred_prob']>=best_u][['user_id','cate','shop_id']].to_csv('../output/sbb4_3.csv',index=False)
28 |
29 | sbb3_3 = pd.read_csv('../feature/3_sbb_get_3_test.csv')
30 | sbb3_2 = pd.read_csv('../feature/3_sbb_get_2_test.csv')
31 | sbb3_1 = pd.read_csv('../feature/3_sbb_get_1_test.csv')
32 |
33 | from tqdm import tqdm
34 | # sbb3_3['pred_prob'] = y_predict
35 | best_u = 0.686
36 | #设置阈值 计算行数
37 | # print('sbb3_3 best_len',len(sbb3_3[sbb3_3['pred_prob']>=best_u]))
38 | sbb3_3[sbb3_3['pred_prob']>=best_u][['user_id','cate','shop_id']].to_csv('../output/sbb3_3.csv',index=False)
39 |
40 | from tqdm import tqdm
41 | # sbb3_2['pred_prob'] = y_predict
42 | best_u = 0.602
43 | #设置阈值 计算行数
44 | # print('sbb3_2 best_len',len(sbb3_2[sbb3_2['pred_prob']>=best_u]))
45 | sbb3_2[sbb3_2['pred_prob']>=best_u][['user_id','cate','shop_id']].to_csv('../output/sbb3_2.csv',index=False)
46 |
47 | from tqdm import tqdm
48 | # sbb3_1['pred_prob'] = y_predict
49 | best_u = 0.521
50 | #设置阈值 计算行数
51 | # print('sbb3_1 best_len',len(sbb3_1[sbb3_1['pred_prob']>=best_u]))
52 | sbb3_1[sbb3_1['pred_prob']>=best_u][['user_id','cate','shop_id']].to_csv('../output/sbb3_1.csv',index=False)
53 |
54 | sbb2_3 = pd.read_csv('../feature/2_sbb_get_3_test.csv')
55 | sbb2_2 = pd.read_csv('../feature/2_sbb_get_2_test.csv')
56 | sbb2_1 = pd.read_csv('../feature/2_sbb_get_1_test.csv')
57 |
58 | from tqdm import tqdm
59 | # sbb2_1['pred_prob'] = y_predict
60 | best_u = 0.495
61 | #设置阈值 计算行数
62 | # print('sbb2_1 best_len',len(sbb2_1[sbb2_1['pred_prob']>=best_u]))
63 | sbb2_1[sbb2_1['pred_prob']>=best_u][['user_id','cate','shop_id']].to_csv('../output/sbb2_1.csv',index=False)
64 |
65 | from tqdm import tqdm
66 | # sbb4_2['pred_prob'] = y_predict
67 | best_u = 0.310
68 | #设置阈值 计算行数
69 | # print('sbb2_2 best_len',len(sbb2_2[sbb2_2['pred_prob']>=best_u]))
70 | sbb2_2[sbb2_2['pred_prob']>=best_u][['user_id','cate','shop_id']].to_csv('../output/sbb2_2.csv',index=False)
71 |
72 | from tqdm import tqdm
73 | # sbb4_2['pred_prob'] = y_predict
74 | best_u = 0.480
75 | #设置阈值 计算行数
76 | # print('sbb2_3 best_len',len(sbb2_3[sbb2_3['pred_prob']>=best_u]))
77 | sbb2_3[sbb2_3['pred_prob']>=best_u][['user_id','cate','shop_id']].to_csv('../output/sbb2_3.csv',index=False)
78 |
79 | ##同特征相交
80 | ##不同特征投票
81 | sbb21 = pd.read_csv('../output/sbb2_1.csv')
82 | sbb31 = pd.read_csv('../output/sbb3_1.csv')
83 | sbb41 = pd.read_csv('../output/sbb4_1.csv')
84 | all_data = pd.concat([sbb21,sbb31,sbb41],axis=0).drop_duplicates()
85 |
86 | sbb21['label2']=1
87 | sbb31['label3']=1
88 | sbb41['label4']=1
89 |
90 | all_data = all_data.merge(sbb21,on=['user_id','cate','shop_id'],how='left')
91 | all_data = all_data.merge(sbb31,on=['user_id','cate','shop_id'],how='left')
92 | all_data = all_data.merge(sbb41,on=['user_id','cate','shop_id'],how='left')
93 | all_data= all_data.fillna(0)
94 | all_data['sum'] = all_data['label2']+all_data['label3']+all_data['label4']
95 |
96 | all_data['sum'].value_counts()
97 |
98 | all_data[all_data['sum']>=3][['user_id','cate','shop_id']].to_csv('../output/sbb*1_u3.csv',index=False)
99 |
100 | sbb22 = pd.read_csv('../output/sbb2_2.csv')
101 | sbb32 = pd.read_csv('../output/sbb3_2.csv')
102 | sbb42 = pd.read_csv('../output/sbb4_2.csv')
103 | all_data = pd.concat([sbb22,sbb32,sbb42],axis=0).drop_duplicates()
104 |
105 | sbb22['label2']=1
106 | sbb32['label3']=1
107 | sbb42['label4']=1
108 |
109 | all_data = all_data.merge(sbb22,on=['user_id','cate','shop_id'],how='left')
110 | all_data = all_data.merge(sbb32,on=['user_id','cate','shop_id'],how='left')
111 | all_data = all_data.merge(sbb42,on=['user_id','cate','shop_id'],how='left')
112 | all_data= all_data.fillna(0)
113 | all_data['sum'] = all_data['label2']+all_data['label3']+all_data['label4']
114 |
115 | all_data['sum'].value_counts()
116 | all_data[all_data['sum']>=3][['user_id','cate','shop_id']].to_csv('../output/sbb*2_u3.csv',index=False)
117 |
118 | sbb23 = pd.read_csv('../output/sbb2_3.csv')
119 | sbb33 = pd.read_csv('../output/sbb3_3.csv')
120 | sbb43 = pd.read_csv('../output/sbb4_3.csv')
121 | all_data = pd.concat([sbb23,sbb33,sbb43],axis=0).drop_duplicates()
122 |
123 | sbb23['label2']=1
124 | sbb33['label3']=1
125 | sbb43['label4']=1
126 |
127 | all_data = all_data.merge(sbb23,on=['user_id','cate','shop_id'],how='left')
128 | all_data = all_data.merge(sbb33,on=['user_id','cate','shop_id'],how='left')
129 | all_data = all_data.merge(sbb43,on=['user_id','cate','shop_id'],how='left')
130 | all_data= all_data.fillna(0)
131 | all_data['sum'] = all_data['label2']+all_data['label3']+all_data['label4']
132 |
133 | all_data['sum'].value_counts()
134 | all_data[all_data['sum']>=3][['user_id','cate','shop_id']].to_csv('../output/sbb*3_u3.csv',index=False)
135 |
136 | sbb1_vote = pd.read_csv('../output/sbb*1_u3.csv')
137 | sbb2_vote = pd.read_csv('../output/sbb*2_u3.csv')
138 | sbb3_vote = pd.read_csv('../output/sbb*3_u3.csv')
139 | a_result = pd.read_csv('../submit/8_model_2.csv')
140 | all_data = pd.concat([sbb1_vote,sbb2_vote,sbb3_vote,a_result],axis=0).drop_duplicates()
141 |
142 |
143 | sbb1_vote['label2']=1
144 | sbb2_vote['label3']=1
145 | sbb3_vote['label4']=1
146 | a_result['label5'] = 1
147 |
148 | all_data = all_data.merge(sbb1_vote,on=['user_id','cate','shop_id'],how='left')
149 | all_data = all_data.merge(sbb2_vote,on=['user_id','cate','shop_id'],how='left')
150 | all_data = all_data.merge(sbb3_vote,on=['user_id','cate','shop_id'],how='left')
151 | all_data = all_data.merge(a_result,on=['user_id','cate','shop_id'],how='left')
152 | all_data= all_data.fillna(0)
153 | all_data['sum'] = all_data['label2']+all_data['label3']+all_data['label4']+all_data['label5']
154 |
155 | all_data['sum'].value_counts()
156 |
157 | all_data[all_data['sum']>=2][['user_id','cate','shop_id']].to_csv('../submit/b_final.csv',index=False)
--------------------------------------------------------------------------------
/code/lgb_model/lgb_train1.py:
--------------------------------------------------------------------------------
1 | import pandas as pd
2 | import numpy as np
3 | from datetime import datetime
4 | from sklearn.metrics import f1_score
5 | from sklearn.model_selection import train_test_split
6 | from sklearn.model_selection import KFold
7 | from sklearn.model_selection import StratifiedKFold
8 | import lighgbm as lgb
9 | pd.set_option('display.max_columns', None)
10 |
11 |
12 | ## 读取文件减少内存,参考鱼佬的腾讯赛
13 | def reduce_mem_usage(df, verbose=True):
14 | numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
15 | start_mem = df.memory_usage().sum() / 1024**2
16 | for col in df.columns:
17 | col_type = df[col].dtypes
18 | if col_type in numerics:
19 | c_min = df[col].min()
20 | c_max = df[col].max()
21 | if str(col_type)[:3] == 'int':
22 | if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
23 | df[col] = df[col].astype(np.int8)
24 | elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
25 | df[col] = df[col].astype(np.int16)
26 | elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
27 | df[col] = df[col].astype(np.int32)
28 | elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
29 | df[col] = df[col].astype(np.int64)
30 | else:
31 | if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
32 | df[col] = df[col].astype(np.float16)
33 | elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
34 | df[col] = df[col].astype(np.float32)
35 | else:
36 | df[col] = df[col].astype(np.float64)
37 | end_mem = df.memory_usage().sum() / 1024**2
38 | if verbose:
39 | print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
40 | return df
41 |
42 | df_train = reduce_mem_usage(pd.read_csv('./df_train.csv'))
43 | df_test = reduce_mem_usage(pd.read_csv('./df_test.csv'))
44 | ##这里可以选择 加载多个特征文件 进行merge 如果df_train变了 记得在输出文件名称加以备注 使用了什么特征文件
45 | ###设置特征标志位 如果 使用一周特征为1 加上两周特征为12 再加上三周特征 为123 只使用二周特征为2
46 |
47 | df_user=reduce_mem_usage(pd.read_csv('./jdata_user.csv'))
48 | df_comment=reduce_mem_usage(pd.read_csv('./jdata_comment.csv'))
49 | df_shop=reduce_mem_usage(pd.read_csv('./jdata_shop.csv'))
50 |
51 | # 1)行为数据(jdata_action)
52 | jdata_action = reduce_mem_usage(pd.read_csv('./jdata_action.csv'))
53 |
54 | # 3)商品数据(jdata_product)
55 | jdata_product = reduce_mem_usage(pd.read_csv('./jdata_product.csv'))
56 |
57 | jdata_data = jdata_action.merge(jdata_product,on=['sku_id'])
58 | time_1 = time.process_time()
59 | print('<< 数据读取完成!用时', time_1 - time_0, 's')
60 |
61 |
62 | label_flag = 1
63 | train_buy = jdata_data[(jdata_data['action_time']>='2018-04-09')&(jdata_data['action_time']<'2018-04-15')&(jdata_data['type']==2)][['user_id','cate','shop_id']].drop_duplicates()
64 | train_buy['label'] = 1
65 | # 候选集 时间 : '2018-03-26'-'2018-04-08' 最近两周有行为的(用户,类目,店铺)
66 | win_size = 3#如果选择两周行为则为2 三周则为3
67 | train_set = jdata_data[(jdata_data['action_time']>='2018-03-19')&(jdata_data['action_time']<'2018-04-09')][['user_id','cate','shop_id']].drop_duplicates()
68 | train_set = train_set.merge(train_buy,on=['user_id','cate','shop_id'],how='left').fillna(0)
69 |
70 | train_set = train_set.merge(df_train,on=['user_id','cate','shop_id'],how='left')
71 |
72 |
73 |
74 | def mapper_year(x):
75 | if x is not np.nan:
76 | year = int(x[:4])
77 | return 2018 - year
78 |
79 |
80 | def mapper_month(x):
81 | if x is not np.nan:
82 | year = int(x[:4])
83 | month = int(x[5:7])
84 | return (2018 - year) * 12 + month
85 |
86 |
87 | def mapper_day(x):
88 | if x is not np.nan:
89 | year = int(x[:4])
90 | month = int(x[5:7])
91 | day = int(x[8:10])
92 | return (2018 - year) * 365 + month * 30 + day
93 |
94 |
95 | df_user['user_reg_year'] = df_user['user_reg_tm'].apply(lambda x: mapper_year(x))
96 | df_user['user_reg_month'] = df_user['user_reg_tm'].apply(lambda x: mapper_month(x))
97 | df_user['user_reg_day'] = df_user['user_reg_tm'].apply(lambda x: mapper_day(x))
98 |
99 | df_shop['shop_reg_year'] = df_shop['shop_reg_tm'].apply(lambda x: mapper_year(x))
100 | df_shop['shop_reg_month'] = df_shop['shop_reg_tm'].apply(lambda x: mapper_month(x))
101 | df_shop['shop_reg_day'] = df_shop['shop_reg_tm'].apply(lambda x: mapper_day(x))
102 |
103 |
104 |
105 | df_shop['shop_reg_year'] = df_shop['shop_reg_year'].fillna(1)
106 | df_shop['shop_reg_month'] = df_shop['shop_reg_month'].fillna(21)
107 | df_shop['shop_reg_day'] = df_shop['shop_reg_day'].fillna(101)
108 |
109 | df_user['age'] = df_user['age'].fillna(5)
110 |
111 | df_comment = df_comment.groupby(['sku_id'], as_index=False).sum()
112 | print('check point ...')
113 | df_product_comment = pd.merge(jdata_product, df_comment, on='sku_id', how='left')
114 |
115 | df_product_comment = df_product_comment.fillna(0)
116 |
117 | df_product_comment = df_product_comment.groupby(['shop_id'], as_index=False).sum()
118 |
119 | df_product_comment = df_product_comment.drop(['sku_id', 'brand', 'cate'], axis=1)
120 |
121 | df_shop_product_comment = pd.merge(df_shop, df_product_comment, how='left', on='shop_id')
122 |
123 | train_set = pd.merge(train_set, df_user, how='left', on='user_id')
124 | train_set = pd.merge(train_set, df_shop_product_comment, on='shop_id', how='left')
125 |
126 |
127 |
128 |
129 | train_set['vip_prob'] = train_set['vip_num']/train_set['fans_num']
130 | train_set['goods_prob'] = train_set['good_comments']/train_set['comments']
131 |
132 | train_set = train_set.drop(['comments','good_comments','bad_comments'],axis=1)
133 |
134 |
135 |
136 | test_set = jdata_data[(jdata_data['action_time'] >= '2018-03-26') & (jdata_data['action_time'] < '2018-04-16')][['user_id', 'cate', 'shop_id']].drop_duplicates()
137 |
138 | test_set = test_set.merge(df_test, on=['user_id', 'cate', 'shop_id'], how='left')
139 |
140 | test_set = pd.merge(test_set, df_user, how='left', on='user_id')
141 | test_set = pd.merge(test_set, df_shop_product_comment, on='shop_id', how='left')
142 |
143 | train_set.drop(['user_reg_tm', 'shop_reg_tm'], axis=1, inplace=True)
144 | test_set.drop(['user_reg_tm', 'shop_reg_tm'], axis=1, inplace=True)
145 |
146 |
147 |
148 |
149 | test_set['vip_prob'] = test_set['vip_num']/test_set['fans_num']
150 | test_set['goods_prob'] = test_set['good_comments']/test_set['comments']
151 |
152 | test_set = test_set.drop(['comments','good_comments','bad_comments'],axis=1)
153 |
154 |
155 |
156 | ###取六周特征 特征为2.26-4.9
157 | train_set = train_set.drop([
158 | '2018-02-19-2018-02-26-action_1', '2018-02-19-2018-02-26-action_2',
159 | '2018-02-19-2018-02-26-action_3', '2018-02-19-2018-02-26-action_4',
160 | '2018-02-12-2018-02-19-action_1', '2018-02-12-2018-02-19-action_2',
161 | '2018-02-12-2018-02-19-action_3', '2018-02-12-2018-02-19-action_4',
162 | '2018-02-05-2018-02-12-action_1', '2018-02-05-2018-02-12-action_2',
163 | '2018-02-05-2018-02-12-action_3', '2018-02-05-2018-02-12-action_4'],axis=1)
164 |
165 |
166 | ###取六周特征 特征为3.05-4.15
167 | test_set = test_set.drop(['2018-02-26-2018-03-05-action_1',
168 | '2018-02-26-2018-03-05-action_2', '2018-02-26-2018-03-05-action_3',
169 | '2018-02-26-2018-03-05-action_4', '2018-02-19-2018-02-26-action_1',
170 | '2018-02-19-2018-02-26-action_2', '2018-02-19-2018-02-26-action_3',
171 | '2018-02-19-2018-02-26-action_4', '2018-02-12-2018-02-19-action_1',
172 | '2018-02-12-2018-02-19-action_2', '2018-02-12-2018-02-19-action_3',
173 | '2018-02-12-2018-02-19-action_4'],axis=1)
174 |
175 |
176 |
177 | train_set.rename(columns={'cate_x':'cate'}, inplace = True)
178 | test_set.rename(columns={'cate_x':'cate'}, inplace = True)
179 |
180 |
181 |
182 | test_head=test_set[['user_id','cate','shop_id']]
183 | train_head=train_set[['user_id','cate','shop_id']]
184 | test_set=test_set.drop(['user_id','cate','shop_id'],axis=1)
185 | train_set=train_set.drop(['user_id','cate','shop_id'],axis=1)
186 |
187 |
188 | # 数据准备
189 | X_train = train_set.drop(['label'],axis=1).values
190 | y_train = train_set['label'].values
191 | X_test = test_set.values
192 |
193 | del train_set
194 | del test_set
195 |
196 |
197 | print('------------------start modelling----------------')
198 | # 模型工具
199 | class SBBTree():
200 | """Stacking,Bootstap,Bagging----SBBTree"""
201 | def __init__(self, params, stacking_num, bagging_num, bagging_test_size, num_boost_round, early_stopping_rounds):
202 | """
203 | Initializes the SBBTree.
204 | Args:
205 | params : lgb params.
206 | stacking_num : k_flod stacking.
207 | bagging_num : bootstrap num.
208 | bagging_test_size : bootstrap sample rate.
209 | num_boost_round : boost num.
210 | early_stopping_rounds : early_stopping_rounds.
211 | """
212 | self.params = params
213 | self.stacking_num = stacking_num
214 | self.bagging_num = bagging_num
215 | self.bagging_test_size = bagging_test_size
216 | self.num_boost_round = num_boost_round
217 | self.early_stopping_rounds = early_stopping_rounds
218 |
219 | self.model = lgb
220 | self.stacking_model = []
221 | self.bagging_model = []
222 |
223 | def fit(self, X, y):
224 | """ fit model. """
225 | if self.stacking_num > 1:
226 | layer_train = np.zeros((X.shape[0], 2))
227 | self.SK = StratifiedKFold(n_splits=self.stacking_num, shuffle=True, random_state=1)
228 | for k,(train_index, test_index) in enumerate(self.SK.split(X, y)):
229 | X_train = X[train_index]
230 | y_train = y[train_index]
231 | X_test = X[test_index]
232 | y_test = y[test_index]
233 |
234 | lgb_train = lgb.Dataset(X_train, y_train)
235 | lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)
236 |
237 | gbm = lgb.train(self.params,
238 | lgb_train,
239 | num_boost_round=self.num_boost_round,
240 | valid_sets=lgb_eval,
241 | early_stopping_rounds=self.early_stopping_rounds)
242 |
243 | self.stacking_model.append(gbm)
244 |
245 | pred_y = gbm.predict(X_test, num_iteration=gbm.best_iteration)
246 | layer_train[test_index, 1] = pred_y
247 |
248 | X = np.hstack((X, layer_train[:,1].reshape((-1,1))))
249 | else:
250 | pass
251 | for bn in range(self.bagging_num):
252 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=self.bagging_test_size, random_state=bn)
253 |
254 | lgb_train = lgb.Dataset(X_train, y_train)
255 | lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)
256 |
257 | gbm = lgb.train(self.params,
258 | lgb_train,
259 | num_boost_round=10000,
260 | valid_sets=lgb_eval,
261 | early_stopping_rounds=200)
262 |
263 | self.bagging_model.append(gbm)
264 |
265 | def predict(self, X_pred):
266 | """ predict test data. """
267 | if self.stacking_num > 1:
268 | test_pred = np.zeros((X_pred.shape[0], self.stacking_num))
269 | for sn,gbm in enumerate(self.stacking_model):
270 | pred = gbm.predict(X_pred, num_iteration=gbm.best_iteration)
271 | test_pred[:, sn] = pred
272 | X_pred = np.hstack((X_pred, test_pred.mean(axis=1).reshape((-1,1))))
273 | else:
274 | pass
275 | for bn,gbm in enumerate(self.bagging_model):
276 | pred = gbm.predict(X_pred, num_iteration=gbm.best_iteration)
277 | if bn == 0:
278 | pred_out=pred
279 | else:
280 | pred_out+=pred
281 | return pred_out/self.bagging_num
282 |
283 |
284 | # 模型参数
285 | params = {
286 | 'boosting_type': 'gbdt',
287 | 'objective': 'binary',
288 | 'metric': 'auc',
289 | 'learning_rate': 0.01,
290 | 'num_leaves': 2 ** 5 - 1,
291 | 'min_child_samples': 100,
292 | 'max_bin': 100,
293 | 'subsample': .7,
294 | 'subsample_freq': 1,
295 | 'colsample_bytree': 0.7,
296 | 'min_child_weight': 0,
297 | 'scale_pos_weight': 25,
298 | 'seed': 42,
299 | 'nthread': 20,
300 | 'verbose': 0,
301 | }
302 | # 使用模型
303 | model = SBBTree(params=params, \
304 | stacking_num=5, \
305 | bagging_num=5, \
306 | bagging_test_size=0.33, \
307 | num_boost_round=10000, \
308 | early_stopping_rounds=200)
309 | model.fit(X_train, y_train)
310 |
311 |
312 | print('train is ok')
313 | y_predict = model.predict(X_test)
314 | print('pred test is ok')
315 | # y_train_predict = model.predict(X_train)
316 |
317 |
318 |
319 | from tqdm import tqdm
320 | test_head['pred_prob'] = y_predict
321 | test_head.to_csv('feature/'+str(win_size)+'_sbb_get_'+str(label_flag)+'_test.csv',index=False)
322 |
--------------------------------------------------------------------------------
/code/lgb_model/lgb_train2.py:
--------------------------------------------------------------------------------
1 | import pandas as pd
2 | import numpy as np
3 | from datetime import datetime
4 | from sklearn.metrics import f1_score
5 | from sklearn.model_selection import train_test_split
6 | from sklearn.model_selection import KFold
7 | from sklearn.model_selection import StratifiedKFold
8 | import lighgbm as lgb
9 | pd.set_option('display.max_columns', None)
10 |
11 |
12 | ## 读取文件减少内存,参考鱼佬的腾讯赛
13 | def reduce_mem_usage(df, verbose=True):
14 | numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
15 | start_mem = df.memory_usage().sum() / 1024**2
16 | for col in df.columns:
17 | col_type = df[col].dtypes
18 | if col_type in numerics:
19 | c_min = df[col].min()
20 | c_max = df[col].max()
21 | if str(col_type)[:3] == 'int':
22 | if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
23 | df[col] = df[col].astype(np.int8)
24 | elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
25 | df[col] = df[col].astype(np.int16)
26 | elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
27 | df[col] = df[col].astype(np.int32)
28 | elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
29 | df[col] = df[col].astype(np.int64)
30 | else:
31 | if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
32 | df[col] = df[col].astype(np.float16)
33 | elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
34 | df[col] = df[col].astype(np.float32)
35 | else:
36 | df[col] = df[col].astype(np.float64)
37 | end_mem = df.memory_usage().sum() / 1024**2
38 | if verbose:
39 | print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
40 | return df
41 |
42 | df_train = reduce_mem_usage(pd.read_csv('./df_train.csv'))
43 | df_test = reduce_mem_usage(pd.read_csv('./df_test.csv'))
44 | ##这里可以选择 加载多个特征文件 进行merge 如果df_train变了 记得在输出文件名称加以备注 使用了什么特征文件
45 | ###设置特征标志位 如果 使用一周特征为1 加上两周特征为12 再加上三周特征 为123 只使用二周特征为2
46 |
47 |
48 | df_user=reduce_mem_usage(pd.read_csv('./jdata_user.csv'))
49 | df_comment=reduce_mem_usage(pd.read_csv('./jdata_comment.csv'))
50 | df_shop=reduce_mem_usage(pd.read_csv('./jdata_shop.csv'))
51 |
52 | # 1)行为数据(jdata_action)
53 | jdata_action = reduce_mem_usage(pd.read_csv('./jdata_action.csv'))
54 |
55 | # 3)商品数据(jdata_product)
56 | jdata_product = reduce_mem_usage(pd.read_csv('./jdata_product.csv'))
57 |
58 | jdata_data = jdata_action.merge(jdata_product,on=['sku_id'])
59 | time_1 = time.process_time()
60 | print('<< 数据读取完成!用时', time_1 - time_0, 's')
61 |
62 |
63 | label_flag = 2
64 | train_buy = jdata_data[(jdata_data['action_time']>='2018-04-02')&(jdata_data['action_time']<'2018-04-09') & (jdata_data['type']==2)][['user_id','cate','shop_id']].drop_duplicates()
65 | train_buy['label'] = 1
66 | # 候选集 时间 : '2018-03-26'-'2018-04-08' 最近两周有行为的(用户,类目,店铺)
67 | win_size = 3#如果选择两周行为则为2 三周则为3
68 | train_set = jdata_data[(jdata_data['action_time']>='2018-03-12')&(jdata_data['action_time']<'2018-04-02')][['user_id','cate','shop_id']].drop_duplicates()
69 | train_set = train_set.merge(train_buy,on=['user_id','cate','shop_id'],how='left').fillna(0)
70 |
71 | train_set = train_set.merge(df_train,on=['user_id','cate','shop_id'],how='left')
72 |
73 |
74 |
75 | def mapper_year(x):
76 | if x is not np.nan:
77 | year = int(x[:4])
78 | return 2018 - year
79 |
80 |
81 | def mapper_month(x):
82 | if x is not np.nan:
83 | year = int(x[:4])
84 | month = int(x[5:7])
85 | return (2018 - year) * 12 + month
86 |
87 |
88 | def mapper_day(x):
89 | if x is not np.nan:
90 | year = int(x[:4])
91 | month = int(x[5:7])
92 | day = int(x[8:10])
93 | return (2018 - year) * 365 + month * 30 + day
94 |
95 |
96 | df_user['user_reg_year'] = df_user['user_reg_tm'].apply(lambda x: mapper_year(x))
97 | df_user['user_reg_month'] = df_user['user_reg_tm'].apply(lambda x: mapper_month(x))
98 | df_user['user_reg_day'] = df_user['user_reg_tm'].apply(lambda x: mapper_day(x))
99 |
100 | df_shop['shop_reg_year'] = df_shop['shop_reg_tm'].apply(lambda x: mapper_year(x))
101 | df_shop['shop_reg_month'] = df_shop['shop_reg_tm'].apply(lambda x: mapper_month(x))
102 | df_shop['shop_reg_day'] = df_shop['shop_reg_tm'].apply(lambda x: mapper_day(x))
103 |
104 |
105 |
106 | df_shop['shop_reg_year'] = df_shop['shop_reg_year'].fillna(1)
107 | df_shop['shop_reg_month'] = df_shop['shop_reg_month'].fillna(21)
108 | df_shop['shop_reg_day'] = df_shop['shop_reg_day'].fillna(101)
109 |
110 | df_user['age'] = df_user['age'].fillna(5)
111 |
112 | df_comment = df_comment.groupby(['sku_id'], as_index=False).sum()
113 | print('check point ...')
114 | df_product_comment = pd.merge(jdata_product, df_comment, on='sku_id', how='left')
115 |
116 | df_product_comment = df_product_comment.fillna(0)
117 |
118 | df_product_comment = df_product_comment.groupby(['shop_id'], as_index=False).sum()
119 |
120 | df_product_comment = df_product_comment.drop(['sku_id', 'brand', 'cate'], axis=1)
121 |
122 | df_shop_product_comment = pd.merge(df_shop, df_product_comment, how='left', on='shop_id')
123 |
124 | train_set = pd.merge(train_set, df_user, how='left', on='user_id')
125 | train_set = pd.merge(train_set, df_shop_product_comment, on='shop_id', how='left')
126 |
127 |
128 |
129 |
130 | train_set['vip_prob'] = train_set['vip_num']/train_set['fans_num']
131 | train_set['goods_prob'] = train_set['good_comments']/train_set['comments']
132 |
133 | train_set = train_set.drop(['comments','good_comments','bad_comments'],axis=1)
134 |
135 |
136 |
137 | test_set = jdata_data[(jdata_data['action_time'] >= '2018-03-26') & (jdata_data['action_time'] < '2018-04-16')][['user_id', 'cate', 'shop_id']].drop_duplicates()
138 |
139 | test_set = test_set.merge(df_test, on=['user_id', 'cate', 'shop_id'], how='left')
140 |
141 | test_set = pd.merge(test_set, df_user, how='left', on='user_id')
142 | test_set = pd.merge(test_set, df_shop_product_comment, on='shop_id', how='left')
143 |
144 | train_set.drop(['user_reg_tm', 'shop_reg_tm'], axis=1, inplace=True)
145 | test_set.drop(['user_reg_tm', 'shop_reg_tm'], axis=1, inplace=True)
146 |
147 |
148 |
149 |
150 | test_set['vip_prob'] = test_set['vip_num']/test_set['fans_num']
151 | test_set['goods_prob'] = test_set['good_comments']/test_set['comments']
152 |
153 | test_set = test_set.drop(['comments','good_comments','bad_comments'],axis=1)
154 |
155 |
156 |
157 | ###取六周特征 特征为2.19-4.1
158 | train_set = train_set.drop([
159 | '2018-04-02-2018-04-09-action_1', '2018-04-02-2018-04-09-action_2',
160 | '2018-04-02-2018-04-09-action_3', '2018-04-02-2018-04-09-action_4',
161 | '2018-02-12-2018-02-19-action_1', '2018-02-12-2018-02-19-action_2',
162 | '2018-02-12-2018-02-19-action_3', '2018-02-12-2018-02-19-action_4',
163 | '2018-02-05-2018-02-12-action_1', '2018-02-05-2018-02-12-action_2',
164 | '2018-02-05-2018-02-12-action_3', '2018-02-05-2018-02-12-action_4'],axis=1)
165 |
166 |
167 | ###取六周特征 特征为3.05-4.15
168 | test_set = test_set.drop(['2018-02-26-2018-03-05-action_1',
169 | '2018-02-26-2018-03-05-action_2', '2018-02-26-2018-03-05-action_3',
170 | '2018-02-26-2018-03-05-action_4', '2018-02-19-2018-02-26-action_1',
171 | '2018-02-19-2018-02-26-action_2', '2018-02-19-2018-02-26-action_3',
172 | '2018-02-19-2018-02-26-action_4', '2018-02-12-2018-02-19-action_1',
173 | '2018-02-12-2018-02-19-action_2', '2018-02-12-2018-02-19-action_3',
174 | '2018-02-12-2018-02-19-action_4'],axis=1)
175 |
176 |
177 |
178 | train_set.rename(columns={'cate_x':'cate'}, inplace = True)
179 | test_set.rename(columns={'cate_x':'cate'}, inplace = True)
180 |
181 |
182 |
183 | test_head=test_set[['user_id','cate','shop_id']]
184 | train_head=train_set[['user_id','cate','shop_id']]
185 | test_set=test_set.drop(['user_id','cate','shop_id'],axis=1)
186 | train_set=train_set.drop(['user_id','cate','shop_id'],axis=1)
187 |
188 |
189 | # 数据准备
190 | X_train = train_set.drop(['label'],axis=1).values
191 | y_train = train_set['label'].values
192 | X_test = test_set.values
193 |
194 | del train_set
195 | del test_set
196 |
197 | print('------------------start modelling----------------')
198 | # 模型工具
199 | class SBBTree():
200 | """Stacking,Bootstap,Bagging----SBBTree"""
201 | def __init__(self, params, stacking_num, bagging_num, bagging_test_size, num_boost_round, early_stopping_rounds):
202 | """
203 | Initializes the SBBTree.
204 | Args:
205 | params : lgb params.
206 | stacking_num : k_flod stacking.
207 | bagging_num : bootstrap num.
208 | bagging_test_size : bootstrap sample rate.
209 | num_boost_round : boost num.
210 | early_stopping_rounds : early_stopping_rounds.
211 | """
212 | self.params = params
213 | self.stacking_num = stacking_num
214 | self.bagging_num = bagging_num
215 | self.bagging_test_size = bagging_test_size
216 | self.num_boost_round = num_boost_round
217 | self.early_stopping_rounds = early_stopping_rounds
218 |
219 | self.model = lgb
220 | self.stacking_model = []
221 | self.bagging_model = []
222 |
223 | def fit(self, X, y):
224 | """ fit model. """
225 | if self.stacking_num > 1:
226 | layer_train = np.zeros((X.shape[0], 2))
227 | self.SK = StratifiedKFold(n_splits=self.stacking_num, shuffle=True, random_state=1)
228 | for k,(train_index, test_index) in enumerate(self.SK.split(X, y)):
229 | X_train = X[train_index]
230 | y_train = y[train_index]
231 | X_test = X[test_index]
232 | y_test = y[test_index]
233 |
234 | lgb_train = lgb.Dataset(X_train, y_train)
235 | lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)
236 |
237 | gbm = lgb.train(self.params,
238 | lgb_train,
239 | num_boost_round=self.num_boost_round,
240 | valid_sets=lgb_eval,
241 | early_stopping_rounds=self.early_stopping_rounds)
242 |
243 | self.stacking_model.append(gbm)
244 |
245 | pred_y = gbm.predict(X_test, num_iteration=gbm.best_iteration)
246 | layer_train[test_index, 1] = pred_y
247 |
248 | X = np.hstack((X, layer_train[:,1].reshape((-1,1))))
249 | else:
250 | pass
251 | for bn in range(self.bagging_num):
252 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=self.bagging_test_size, random_state=bn)
253 |
254 | lgb_train = lgb.Dataset(X_train, y_train)
255 | lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)
256 |
257 | gbm = lgb.train(self.params,
258 | lgb_train,
259 | num_boost_round=10000,
260 | valid_sets=lgb_eval,
261 | early_stopping_rounds=200)
262 |
263 | self.bagging_model.append(gbm)
264 |
265 | def predict(self, X_pred):
266 | """ predict test data. """
267 | if self.stacking_num > 1:
268 | test_pred = np.zeros((X_pred.shape[0], self.stacking_num))
269 | for sn,gbm in enumerate(self.stacking_model):
270 | pred = gbm.predict(X_pred, num_iteration=gbm.best_iteration)
271 | test_pred[:, sn] = pred
272 | X_pred = np.hstack((X_pred, test_pred.mean(axis=1).reshape((-1,1))))
273 | else:
274 | pass
275 | for bn,gbm in enumerate(self.bagging_model):
276 | pred = gbm.predict(X_pred, num_iteration=gbm.best_iteration)
277 | if bn == 0:
278 | pred_out=pred
279 | else:
280 | pred_out+=pred
281 | return pred_out/self.bagging_num
282 |
283 |
284 | # 模型参数
285 | params = {
286 | 'boosting_type': 'gbdt',
287 | 'objective': 'binary',
288 | 'metric': 'auc',
289 | 'learning_rate': 0.01,
290 | 'num_leaves': 2 ** 5 - 1,
291 | 'min_child_samples': 100,
292 | 'max_bin': 100,
293 | 'subsample': .7,
294 | 'subsample_freq': 1,
295 | 'colsample_bytree': 0.7,
296 | 'min_child_weight': 0,
297 | 'scale_pos_weight': 25,
298 | 'seed': 42,
299 | 'nthread': 20,
300 | 'verbose': 0,
301 | }
302 | # 使用模型
303 | model = SBBTree(params=params, \
304 | stacking_num=5, \
305 | bagging_num=5, \
306 | bagging_test_size=0.33, \
307 | num_boost_round=10000, \
308 | early_stopping_rounds=200)
309 | model.fit(X_train, y_train)
310 |
311 |
312 | print('train is ok')
313 | y_predict = model.predict(X_test)
314 | print('pred test is ok')
315 | # y_train_predict = model.predict(X_train)
316 |
317 |
318 |
319 | from tqdm import tqdm
320 | test_head['pred_prob'] = y_predict
321 | test_head.to_csv('feature/'+str(win_size)+'_sbb_get_'+str(label_flag)+'_test.csv',index=False)
322 |
--------------------------------------------------------------------------------
/code/lgb_model/lgb_train3.py:
--------------------------------------------------------------------------------
1 | import pandas as pd
2 | import numpy as np
3 | from datetime import datetime
4 | from sklearn.metrics import f1_score
5 | from sklearn.model_selection import train_test_split
6 | from sklearn.model_selection import KFold
7 | from sklearn.model_selection import StratifiedKFold
8 | import lighgbm as lgb
9 | pd.set_option('display.max_columns', None)
10 |
11 |
12 | ## 读取文件减少内存,参考鱼佬的腾讯赛
13 | def reduce_mem_usage(df, verbose=True):
14 | numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
15 | start_mem = df.memory_usage().sum() / 1024**2
16 | for col in df.columns:
17 | col_type = df[col].dtypes
18 | if col_type in numerics:
19 | c_min = df[col].min()
20 | c_max = df[col].max()
21 | if str(col_type)[:3] == 'int':
22 | if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
23 | df[col] = df[col].astype(np.int8)
24 | elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
25 | df[col] = df[col].astype(np.int16)
26 | elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
27 | df[col] = df[col].astype(np.int32)
28 | elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
29 | df[col] = df[col].astype(np.int64)
30 | else:
31 | if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
32 | df[col] = df[col].astype(np.float16)
33 | elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
34 | df[col] = df[col].astype(np.float32)
35 | else:
36 | df[col] = df[col].astype(np.float64)
37 | end_mem = df.memory_usage().sum() / 1024**2
38 | if verbose:
39 | print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
40 | return df
41 |
42 | df_train = reduce_mem_usage(pd.read_csv('./df_train.csv'))
43 | df_test = reduce_mem_usage(pd.read_csv('./df_test.csv'))
44 | ##这里可以选择 加载多个特征文件 进行merge 如果df_train变了 记得在输出文件名称加以备注 使用了什么特征文件
45 | ###设置特征标志位 如果 使用一周特征为1 加上两周特征为12 再加上三周特征 为123 只使用二周特征为2
46 |
47 |
48 | df_user=reduce_mem_usage(pd.read_csv('./jdata_user.csv'))
49 | df_comment=reduce_mem_usage(pd.read_csv('./jdata_comment.csv'))
50 | df_shop=reduce_mem_usage(pd.read_csv('./jdata_shop.csv'))
51 |
52 | # 1)行为数据(jdata_action)
53 | jdata_action = reduce_mem_usage(pd.read_csv('./jdata_action.csv'))
54 |
55 | # 3)商品数据(jdata_product)
56 | jdata_product = reduce_mem_usage(pd.read_csv('./jdata_product.csv'))
57 |
58 | jdata_data = jdata_action.merge(jdata_product,on=['sku_id'])
59 | label_flag = 3
60 | train_buy = jdata_data[(jdata_data['action_time']>='2018-03-26') & (jdata_data['action_time']<'2018-04-02') & (jdata_data['type']==2)][['user_id','cate','shop_id']].drop_duplicates()
61 | train_buy['label'] = 1
62 | # 候选集 时间 : '2018-03-26'-'2018-04-08' 最近两周有行为的(用户,类目,店铺)
63 | win_size = 3#如果选择两周行为则为2 三周则为3
64 | train_set = jdata_data[(jdata_data['action_time']>='2018-03-05') & (jdata_data['action_time']<'2018-03-26')][['user_id','cate','shop_id']].drop_duplicates()
65 | train_set = train_set.merge(train_buy,on=['user_id','cate','shop_id'],how='left').fillna(0)
66 |
67 | train_set = train_set.merge(df_train,on=['user_id','cate','shop_id'],how='left')
68 |
69 |
70 | def mapper_year(x):
71 | if x is not np.nan:
72 | year = int(x[:4])
73 | return 2018 - year
74 |
75 |
76 | def mapper_month(x):
77 | if x is not np.nan:
78 | year = int(x[:4])
79 | month = int(x[5:7])
80 | return (2018 - year) * 12 + month
81 |
82 |
83 | def mapper_day(x):
84 | if x is not np.nan:
85 | year = int(x[:4])
86 | month = int(x[5:7])
87 | day = int(x[8:10])
88 | return (2018 - year) * 365 + month * 30 + day
89 |
90 |
91 | df_user['user_reg_year'] = df_user['user_reg_tm'].apply(lambda x: mapper_year(x))
92 | df_user['user_reg_month'] = df_user['user_reg_tm'].apply(lambda x: mapper_month(x))
93 | df_user['user_reg_day'] = df_user['user_reg_tm'].apply(lambda x: mapper_day(x))
94 |
95 | df_shop['shop_reg_year'] = df_shop['shop_reg_tm'].apply(lambda x: mapper_year(x))
96 | df_shop['shop_reg_month'] = df_shop['shop_reg_tm'].apply(lambda x: mapper_month(x))
97 | df_shop['shop_reg_day'] = df_shop['shop_reg_tm'].apply(lambda x: mapper_day(x))
98 |
99 |
100 |
101 | df_shop['shop_reg_year'] = df_shop['shop_reg_year'].fillna(1)
102 | df_shop['shop_reg_month'] = df_shop['shop_reg_month'].fillna(21)
103 | df_shop['shop_reg_day'] = df_shop['shop_reg_day'].fillna(101)
104 |
105 | df_user['age'] = df_user['age'].fillna(5)
106 |
107 | df_comment = df_comment.groupby(['sku_id'], as_index=False).sum()
108 | print('check point ...')
109 | df_product_comment = pd.merge(jdata_product, df_comment, on='sku_id', how='left')
110 |
111 | df_product_comment = df_product_comment.fillna(0)
112 |
113 | df_product_comment = df_product_comment.groupby(['shop_id'], as_index=False).sum()
114 |
115 | df_product_comment = df_product_comment.drop(['sku_id', 'brand', 'cate'], axis=1)
116 |
117 | df_shop_product_comment = pd.merge(df_shop, df_product_comment, how='left', on='shop_id')
118 |
119 | train_set = pd.merge(train_set, df_user, how='left', on='user_id')
120 | train_set = pd.merge(train_set, df_shop_product_comment, on='shop_id', how='left')
121 |
122 |
123 |
124 | train_set['vip_prob'] = train_set['vip_num']/train_set['fans_num']
125 | train_set['goods_prob'] = train_set['good_comments']/train_set['comments']
126 |
127 | train_set = train_set.drop(['comments','good_comments','bad_comments'],axis=1)
128 |
129 |
130 |
131 |
132 | test_set = jdata_data[(jdata_data['action_time'] >= '2018-03-26') & (jdata_data['action_time'] < '2018-04-16')][
133 | ['user_id', 'cate', 'shop_id']].drop_duplicates()
134 |
135 | test_set = test_set.merge(df_test, on=['user_id', 'cate', 'shop_id'], how='left')
136 |
137 | test_set = pd.merge(test_set, df_user, how='left', on='user_id')
138 | test_set = pd.merge(test_set, df_shop_product_comment, on='shop_id', how='left')
139 |
140 | train_set.drop(['user_reg_tm', 'shop_reg_tm'], axis=1, inplace=True)
141 | test_set.drop(['user_reg_tm', 'shop_reg_tm'], axis=1, inplace=True)
142 |
143 |
144 |
145 | test_set['vip_prob'] = test_set['vip_num']/test_set['fans_num']
146 | test_set['goods_prob'] = test_set['good_comments']/test_set['comments']
147 |
148 | test_set = test_set.drop(['comments','good_comments','bad_comments'],axis=1)
149 |
150 |
151 |
152 | ###取六周特征 特征为2.12-3.25
153 | train_set = train_set.drop(['2018-04-02-2018-04-09-action_1', '2018-04-02-2018-04-09-action_2',
154 | '2018-04-02-2018-04-09-action_3', '2018-04-02-2018-04-09-action_4',
155 | '2018-03-26-2018-04-02-action_1', '2018-03-26-2018-04-02-action_2',
156 | '2018-03-26-2018-04-02-action_3', '2018-03-26-2018-04-02-action_4',
157 | '2018-02-05-2018-02-12-action_1', '2018-02-05-2018-02-12-action_2',
158 | '2018-02-05-2018-02-12-action_3', '2018-02-05-2018-02-12-action_4'],axis=1)
159 |
160 |
161 | test_set = test_set.drop(['2018-02-26-2018-03-05-action_1',
162 | '2018-02-26-2018-03-05-action_2', '2018-02-26-2018-03-05-action_3',
163 | '2018-02-26-2018-03-05-action_4', '2018-02-19-2018-02-26-action_1',
164 | '2018-02-19-2018-02-26-action_2', '2018-02-19-2018-02-26-action_3',
165 | '2018-02-19-2018-02-26-action_4', '2018-02-12-2018-02-19-action_1',
166 | '2018-02-12-2018-02-19-action_2', '2018-02-12-2018-02-19-action_3',
167 | '2018-02-12-2018-02-19-action_4'],axis=1)
168 |
169 |
170 | train_set.rename(columns={'cate_x':'cate'}, inplace = True)
171 | test_set.rename(columns={'cate_x':'cate'}, inplace = True)
172 |
173 |
174 |
175 | test_head=test_set[['user_id','cate','shop_id']]
176 | train_head=train_set[['user_id','cate','shop_id']]
177 | test_set=test_set.drop(['user_id','cate','shop_id'],axis=1)
178 | train_set=train_set.drop(['user_id','cate','shop_id'],axis=1)
179 |
180 |
181 | # 数据准备
182 | X_train = train_set.drop(['label'],axis=1).values
183 | y_train = train_set['label'].values
184 | X_test = test_set.values
185 |
186 |
187 | del train_set
188 | del test_set
189 |
190 | print('------------------start modelling----------------')
191 | # 模型工具
192 | class SBBTree():
193 | """Stacking,Bootstap,Bagging----SBBTree"""
194 | def __init__(self, params, stacking_num, bagging_num, bagging_test_size, num_boost_round, early_stopping_rounds):
195 | """
196 | Initializes the SBBTree.
197 | Args:
198 | params : lgb params.
199 | stacking_num : k_flod stacking.
200 | bagging_num : bootstrap num.
201 | bagging_test_size : bootstrap sample rate.
202 | num_boost_round : boost num.
203 | early_stopping_rounds : early_stopping_rounds.
204 | """
205 | self.params = params
206 | self.stacking_num = stacking_num
207 | self.bagging_num = bagging_num
208 | self.bagging_test_size = bagging_test_size
209 | self.num_boost_round = num_boost_round
210 | self.early_stopping_rounds = early_stopping_rounds
211 |
212 | self.model = lgb
213 | self.stacking_model = []
214 | self.bagging_model = []
215 |
216 | def fit(self, X, y):
217 | """ fit model. """
218 | if self.stacking_num > 1:
219 | layer_train = np.zeros((X.shape[0], 2))
220 | self.SK = StratifiedKFold(n_splits=self.stacking_num, shuffle=True, random_state=1)
221 | for k,(train_index, test_index) in enumerate(self.SK.split(X, y)):
222 | X_train = X[train_index]
223 | y_train = y[train_index]
224 | X_test = X[test_index]
225 | y_test = y[test_index]
226 |
227 | lgb_train = lgb.Dataset(X_train, y_train)
228 | lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)
229 |
230 | gbm = lgb.train(self.params,
231 | lgb_train,
232 | num_boost_round=self.num_boost_round,
233 | valid_sets=lgb_eval,
234 | early_stopping_rounds=self.early_stopping_rounds)
235 |
236 | self.stacking_model.append(gbm)
237 |
238 | pred_y = gbm.predict(X_test, num_iteration=gbm.best_iteration)
239 | layer_train[test_index, 1] = pred_y
240 |
241 | X = np.hstack((X, layer_train[:,1].reshape((-1,1))))
242 | else:
243 | pass
244 | for bn in range(self.bagging_num):
245 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=self.bagging_test_size, random_state=bn)
246 |
247 | lgb_train = lgb.Dataset(X_train, y_train)
248 | lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)
249 |
250 | gbm = lgb.train(self.params,
251 | lgb_train,
252 | num_boost_round=10000,
253 | valid_sets=lgb_eval,
254 | early_stopping_rounds=200)
255 |
256 | self.bagging_model.append(gbm)
257 |
258 | def predict(self, X_pred):
259 | """ predict test data. """
260 | if self.stacking_num > 1:
261 | test_pred = np.zeros((X_pred.shape[0], self.stacking_num))
262 | for sn,gbm in enumerate(self.stacking_model):
263 | pred = gbm.predict(X_pred, num_iteration=gbm.best_iteration)
264 | test_pred[:, sn] = pred
265 | X_pred = np.hstack((X_pred, test_pred.mean(axis=1).reshape((-1,1))))
266 | else:
267 | pass
268 | for bn,gbm in enumerate(self.bagging_model):
269 | pred = gbm.predict(X_pred, num_iteration=gbm.best_iteration)
270 | if bn == 0:
271 | pred_out=pred
272 | else:
273 | pred_out+=pred
274 | return pred_out/self.bagging_num
275 |
276 |
277 | # 模型参数
278 | params = {
279 | 'boosting_type': 'gbdt',
280 | 'objective': 'binary',
281 | 'metric': 'auc',
282 | 'learning_rate': 0.01,
283 | 'num_leaves': 2 ** 5 - 1,
284 | 'min_child_samples': 100,
285 | 'max_bin': 100,
286 | 'subsample': .7,
287 | 'subsample_freq': 1,
288 | 'colsample_bytree': 0.7,
289 | 'min_child_weight': 0,
290 | 'scale_pos_weight': 25,
291 | 'seed': 42,
292 | 'nthread': 20,
293 | 'verbose': 0,
294 | }
295 | # 使用模型
296 | model = SBBTree(params=params, \
297 | stacking_num=5, \
298 | bagging_num=5, \
299 | bagging_test_size=0.33, \
300 | num_boost_round=10000, \
301 | early_stopping_rounds=200)
302 | model.fit(X_train, y_train)
303 |
304 |
305 | print('train is ok')
306 | y_predict = model.predict(X_test)
307 | print('pred test is ok')
308 | # y_train_predict = model.predict(X_train)
309 |
310 |
311 |
312 | from tqdm import tqdm
313 | test_head['pred_prob'] = y_predict
314 | test_head.to_csv('feature/'+str(win_size)+'_sbb_get_'+str(label_flag)+'_test.csv',index=False)
315 |
--------------------------------------------------------------------------------
/code/run.sh:
--------------------------------------------------------------------------------
1 |
2 | python EDA13.py
3 |
4 | echo "base feature is ok"
5 |
6 | python EDA16-fourWeek.py
7 | python EDA16-fourWeek_rightTime.py
8 | python EDA16-threeWeek.py
9 | python EDA16-threeWeek_rightTime.py
10 | python EDA16-twoWeek.py
11 |
12 | echo "A result is ok"
13 |
14 | python sbb_train1.py
15 | python sbb_train2.py
16 | python sbb_train3.py
17 | echo "B win size 3 is ok"
18 | python sbb2_train1.py
19 | python sbb2_train2.py
20 | python sbb2_train3.py
21 | echo "B win size 2 is ok"
22 | python sbb4_train1.py
23 | python sbb4_train2.py
24 | python sbb4_train3.py
25 | echo "B win size 4 is ok"
26 | python gen_result.py
27 |
28 | echo "finish,,,,,"
29 |
--------------------------------------------------------------------------------
/code/sbb2_train1.py:
--------------------------------------------------------------------------------
1 | import pandas as pd
2 | import numpy as np
3 | import datetime
4 | from sklearn.metrics import f1_score
5 | from sklearn.model_selection import train_test_split
6 | from sklearn.model_selection import KFold
7 | from sklearn.model_selection import StratifiedKFold
8 | import lightgbm as lgb
9 | pd.set_option('display.max_columns', None)
10 |
11 |
12 | df_train=pd.read_csv('../output/df_train.csv')
13 | df_test=pd.read_csv('../output/df_test.csv')
14 | ##这里可以选择 加载多个特征文件 进行merge 如果df_train变了 记得在输出文件名称加以备注 使用了什么特征文件
15 | ###设置特征标志位 如果 使用一周特征为1 加上两周特征为12 再加上三周特征 为123 只使用二周特征为2
16 |
17 | # df_train_two=pd.read_csv('../output/df_train_two.csv')
18 | # df_test_two=pd.read_csv('../output/df_test_two.csv')
19 | # df_train = df_train.merge(df_train_two,on=['user_id','cate','shop_id'],how='left')
20 | # df_test = df_test.merge(df_test_two,on=['user_id','cate','shop_id'],how='left')
21 |
22 |
23 | df_user=pd.read_csv('../data/jdata_user.csv')
24 | df_comment=pd.read_csv('../data/jdata_comment.csv')
25 | df_shop=pd.read_csv('../data/jdata_shop.csv')
26 |
27 | # 1)行为数据(jdata_action)
28 | jdata_action = pd.read_csv('../data/jdata_action.csv')
29 |
30 | # 3)商品数据(jdata_product)
31 | jdata_product = pd.read_csv('../data/jdata_product.csv')
32 |
33 | jdata_data = jdata_action.merge(jdata_product,on=['sku_id'])
34 | label_flag = 1
35 | train_buy = jdata_data[(jdata_data['action_time']>='2018-04-09')
36 | & (jdata_data['action_time']<'2018-04-16')
37 | & (jdata_data['type']==2)][['user_id','cate','shop_id']].drop_duplicates()
38 | train_buy['label'] = 1
39 | # 候选集 时间 : '2018-03-26'-'2018-04-08' 最近两周有行为的(用户,类目,店铺)
40 | win_size = 2#如果选择两周行为则为2 三周则为3
41 | train_set = jdata_data[(jdata_data['action_time']>='2018-03-26')
42 | & (jdata_data['action_time']<'2018-04-09')][['user_id','cate','shop_id']].drop_duplicates()
43 | train_set = train_set.merge(train_buy,on=['user_id','cate','shop_id'],how='left').fillna(0)
44 |
45 | train_set = train_set.merge(df_train,on=['user_id','cate','shop_id'],how='left')
46 |
47 | def mapper_year(x):
48 | if x is not np.nan:
49 | year = int(x[:4])
50 | return 2018 - year
51 |
52 |
53 | def mapper_month(x):
54 | if x is not np.nan:
55 | year = int(x[:4])
56 | month = int(x[5:7])
57 | return (2018 - year) * 12 + month
58 |
59 |
60 | def mapper_day(x):
61 | if x is not np.nan:
62 | year = int(x[:4])
63 | month = int(x[5:7])
64 | day = int(x[8:10])
65 | return (2018 - year) * 365 + month * 30 + day
66 |
67 |
68 | df_user['user_reg_year'] = df_user['user_reg_tm'].apply(lambda x: mapper_year(x))
69 | df_user['user_reg_month'] = df_user['user_reg_tm'].apply(lambda x: mapper_month(x))
70 | df_user['user_reg_day'] = df_user['user_reg_tm'].apply(lambda x: mapper_day(x))
71 |
72 | df_shop['shop_reg_year'] = df_shop['shop_reg_tm'].apply(lambda x: mapper_year(x))
73 | df_shop['shop_reg_month'] = df_shop['shop_reg_tm'].apply(lambda x: mapper_month(x))
74 | df_shop['shop_reg_day'] = df_shop['shop_reg_tm'].apply(lambda x: mapper_day(x))
75 |
76 |
77 | df_shop['shop_reg_year'] = df_shop['shop_reg_year'].fillna(1)
78 | df_shop['shop_reg_month'] = df_shop['shop_reg_month'].fillna(21)
79 | df_shop['shop_reg_day'] = df_shop['shop_reg_day'].fillna(101)
80 |
81 | df_user['age'] = df_user['age'].fillna(5)
82 |
83 | df_comment = df_comment.groupby(['sku_id'], as_index=False).sum()
84 | print('check point ...')
85 | df_product_comment = pd.merge(jdata_product, df_comment, on='sku_id', how='left')
86 |
87 | df_product_comment = df_product_comment.fillna(0)
88 |
89 | df_product_comment = df_product_comment.groupby(['shop_id'], as_index=False).sum()
90 |
91 | df_product_comment = df_product_comment.drop(['sku_id', 'brand', 'cate'], axis=1)
92 |
93 | df_shop_product_comment = pd.merge(df_shop, df_product_comment, how='left', on='shop_id')
94 |
95 | train_set = pd.merge(train_set, df_user, how='left', on='user_id')
96 | train_set = pd.merge(train_set, df_shop_product_comment, on='shop_id', how='left')
97 |
98 |
99 | train_set['vip_prob'] = train_set['vip_num']/train_set['fans_num']
100 | train_set['goods_prob'] = train_set['good_comments']/train_set['comments']
101 |
102 | train_set = train_set.drop(['comments','good_comments','bad_comments'],axis=1)
103 |
104 |
105 | test_set = jdata_data[(jdata_data['action_time'] >= '2018-04-02') & (jdata_data['action_time'] < '2018-04-16')][
106 | ['user_id', 'cate', 'shop_id']].drop_duplicates()
107 |
108 | test_set = test_set.merge(df_test, on=['user_id', 'cate', 'shop_id'], how='left')
109 |
110 | test_set = pd.merge(test_set, df_user, how='left', on='user_id')
111 | test_set = pd.merge(test_set, df_shop_product_comment, on='shop_id', how='left')
112 |
113 | train_set.drop(['user_reg_tm', 'shop_reg_tm'], axis=1, inplace=True)
114 | test_set.drop(['user_reg_tm', 'shop_reg_tm'], axis=1, inplace=True)
115 |
116 |
117 | test_set['vip_prob'] = test_set['vip_num']/test_set['fans_num']
118 | test_set['goods_prob'] = test_set['good_comments']/test_set['comments']
119 |
120 | test_set = test_set.drop(['comments','good_comments','bad_comments'],axis=1)
121 |
122 |
123 | ###取六周特征 特征为2.26-4.9
124 | train_set = train_set.drop(['2018-02-19-2018-02-26-action_1', '2018-02-19-2018-02-26-action_2',
125 | '2018-02-19-2018-02-26-action_3', '2018-02-19-2018-02-26-action_4',
126 | '2018-02-12-2018-02-19-action_1', '2018-02-12-2018-02-19-action_2',
127 | '2018-02-12-2018-02-19-action_3', '2018-02-12-2018-02-19-action_4',
128 | '2018-02-05-2018-02-12-action_1', '2018-02-05-2018-02-12-action_2',
129 | '2018-02-05-2018-02-12-action_3', '2018-02-05-2018-02-12-action_4'],axis=1)
130 |
131 |
132 | test_set = test_set.drop(['2018-02-26-2018-03-05-action_1',
133 | '2018-02-26-2018-03-05-action_2', '2018-02-26-2018-03-05-action_3',
134 | '2018-02-26-2018-03-05-action_4', '2018-02-19-2018-02-26-action_1',
135 | '2018-02-19-2018-02-26-action_2', '2018-02-19-2018-02-26-action_3',
136 | '2018-02-19-2018-02-26-action_4', '2018-02-12-2018-02-19-action_1',
137 | '2018-02-12-2018-02-19-action_2', '2018-02-12-2018-02-19-action_3',
138 | '2018-02-12-2018-02-19-action_4'],axis=1)
139 |
140 | train_set.rename(columns={'cate_x':'cate'}, inplace = True)
141 | test_set.rename(columns={'cate_x':'cate'}, inplace = True)
142 |
143 |
144 |
145 |
146 | test_head=test_set[['user_id','cate','shop_id']]
147 | train_head=train_set[['user_id','cate','shop_id']]
148 | test_set=test_set.drop(['user_id','cate','shop_id'],axis=1)
149 | train_set=train_set.drop(['user_id','cate','shop_id'],axis=1)
150 |
151 |
152 | # 数据准备
153 | X_train = train_set.drop(['label'],axis=1).values
154 | y_train = train_set['label'].values
155 | X_test = test_set.values
156 |
157 | del train_set
158 | del test_set
159 |
160 | import gc
161 | gc.collect()
162 |
163 |
164 |
165 | # 模型工具
166 | class SBBTree():
167 | """Stacking,Bootstap,Bagging----SBBTree"""
168 | def __init__(self, params, stacking_num, bagging_num, bagging_test_size, num_boost_round, early_stopping_rounds):
169 | """
170 | Initializes the SBBTree.
171 | Args:
172 | params : lgb params.
173 | stacking_num : k_flod stacking.
174 | bagging_num : bootstrap num.
175 | bagging_test_size : bootstrap sample rate.
176 | num_boost_round : boost num.
177 | early_stopping_rounds : early_stopping_rounds.
178 | """
179 | self.params = params
180 | self.stacking_num = stacking_num
181 | self.bagging_num = bagging_num
182 | self.bagging_test_size = bagging_test_size
183 | self.num_boost_round = num_boost_round
184 | self.early_stopping_rounds = early_stopping_rounds
185 |
186 | self.model = lgb
187 | self.stacking_model = []
188 | self.bagging_model = []
189 |
190 | def fit(self, X, y):
191 | """ fit model. """
192 | if self.stacking_num > 1:
193 | layer_train = np.zeros((X.shape[0], 2))
194 | self.SK = StratifiedKFold(n_splits=self.stacking_num, shuffle=True, random_state=1)
195 | for k,(train_index, test_index) in enumerate(self.SK.split(X, y)):
196 | X_train = X[train_index]
197 | y_train = y[train_index]
198 | X_test = X[test_index]
199 | y_test = y[test_index]
200 |
201 | lgb_train = lgb.Dataset(X_train, y_train)
202 | lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)
203 |
204 | gbm = lgb.train(self.params,
205 | lgb_train,
206 | num_boost_round=self.num_boost_round,
207 | valid_sets=lgb_eval,
208 | early_stopping_rounds=self.early_stopping_rounds,
209 | verbose_eval=300)
210 |
211 | self.stacking_model.append(gbm)
212 |
213 | pred_y = gbm.predict(X_test, num_iteration=gbm.best_iteration)
214 | layer_train[test_index, 1] = pred_y
215 |
216 | X = np.hstack((X, layer_train[:,1].reshape((-1,1))))
217 | else:
218 | pass
219 | for bn in range(self.bagging_num):
220 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=self.bagging_test_size, random_state=bn)
221 |
222 | lgb_train = lgb.Dataset(X_train, y_train)
223 | lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)
224 |
225 | gbm = lgb.train(self.params,
226 | lgb_train,
227 | num_boost_round=10000,
228 | valid_sets=lgb_eval,
229 | early_stopping_rounds=200,
230 | verbose_eval=300)
231 |
232 | self.bagging_model.append(gbm)
233 |
234 | def predict(self, X_pred):
235 | """ predict test data. """
236 | if self.stacking_num > 1:
237 | test_pred = np.zeros((X_pred.shape[0], self.stacking_num))
238 | for sn,gbm in enumerate(self.stacking_model):
239 | pred = gbm.predict(X_pred, num_iteration=gbm.best_iteration)
240 | test_pred[:, sn] = pred
241 | X_pred = np.hstack((X_pred, test_pred.mean(axis=1).reshape((-1,1))))
242 | else:
243 | pass
244 | for bn,gbm in enumerate(self.bagging_model):
245 | pred = gbm.predict(X_pred, num_iteration=gbm.best_iteration)
246 | if bn == 0:
247 | pred_out=pred
248 | else:
249 | pred_out+=pred
250 | return pred_out/self.bagging_num
251 |
252 | # 模型参数
253 | params = {
254 | 'boosting_type': 'gbdt',
255 | 'objective': 'binary',
256 | 'metric': 'auc',
257 | 'learning_rate': 0.01,
258 | 'num_leaves': 2 ** 5 - 1,
259 | 'min_child_samples': 100,
260 | 'max_bin': 100,
261 | 'subsample': 0.8,
262 | 'subsample_freq': 1,
263 | 'colsample_bytree': 0.8,
264 | 'min_child_weight': 0,
265 | 'scale_pos_weight': 25,
266 | 'seed': 2019,
267 | 'nthread': 4,
268 | 'verbose': 0,
269 | }
270 |
271 | # 使用模型
272 | model = SBBTree(params=params,\
273 | stacking_num=5,\
274 | bagging_num=5,\
275 | bagging_test_size=0.33,\
276 | num_boost_round=10000,\
277 | early_stopping_rounds=200)
278 | model.fit(X_train, y_train)
279 | print('train is ok')
280 | y_predict = model.predict(X_test)
281 | print('pred test is ok')
282 | # y_train_predict = model.predict(X_train)
283 |
284 |
285 | from tqdm import tqdm
286 | test_head['pred_prob'] = y_predict
287 | test_head.to_csv('../feature/'+str(win_size)+'_sbb_get_'+str(label_flag)+'_test.csv',index=False)
288 |
289 |
290 |
--------------------------------------------------------------------------------
/code/sbb2_train2.py:
--------------------------------------------------------------------------------
1 | import pandas as pd
2 | import numpy as np
3 | import datetime
4 | from sklearn.metrics import f1_score
5 | from sklearn.model_selection import train_test_split
6 | from sklearn.model_selection import KFold
7 | from sklearn.model_selection import StratifiedKFold
8 | import lightgbm as lgb
9 | pd.set_option('display.max_columns', None)
10 |
11 | df_train=pd.read_csv('../output/df_train.csv')
12 | df_test=pd.read_csv('../output/df_test.csv')
13 | ##这里可以选择 加载多个特征文件 进行merge 如果df_train变了 记得在输出文件名称加以备注 使用了什么特征文件
14 | ###设置特征标志位 如果 使用一周特征为1 加上两周特征为12 再加上三周特征 为123 只使用二周特征为2
15 |
16 | # df_train_two=pd.read_csv('../output/df_train_two.csv')
17 | # df_test_two=pd.read_csv('../output/df_test_two.csv')
18 | # df_train = df_train.merge(df_train_two,on=['user_id','cate','shop_id'],how='left')
19 | # df_test = df_test.merge(df_test_two,on=['user_id','cate','shop_id'],how='left')
20 |
21 | df_user=pd.read_csv('../data/jdata_user.csv')
22 | df_comment=pd.read_csv('../data/jdata_comment.csv')
23 | df_shop=pd.read_csv('../data/jdata_shop.csv')
24 |
25 | # 1)行为数据(jdata_action)
26 | jdata_action = pd.read_csv('../data/jdata_action.csv')
27 |
28 | # 3)商品数据(jdata_product)
29 | jdata_product = pd.read_csv('../data/jdata_product.csv')
30 |
31 | jdata_data = jdata_action.merge(jdata_product,on=['sku_id'])
32 | label_flag = 2
33 | train_buy = jdata_data[(jdata_data['action_time']>='2018-04-02')
34 | & (jdata_data['action_time']<'2018-04-09')
35 | & (jdata_data['type']==2)][['user_id','cate','shop_id']].drop_duplicates()
36 | train_buy['label'] = 1
37 | # 候选集 时间 : '2018-03-26'-'2018-04-08' 最近两周有行为的(用户,类目,店铺)
38 | win_size = 2#如果选择两周行为则为2 三周则为3
39 | train_set = jdata_data[(jdata_data['action_time']>='2018-03-19')
40 | & (jdata_data['action_time']<'2018-04-02')][['user_id','cate','shop_id']].drop_duplicates()
41 | train_set = train_set.merge(train_buy,on=['user_id','cate','shop_id'],how='left').fillna(0)
42 |
43 | train_set = train_set.merge(df_train,on=['user_id','cate','shop_id'],how='left')
44 |
45 | def mapper_year(x):
46 | if x is not np.nan:
47 | year = int(x[:4])
48 | return 2018 - year
49 |
50 |
51 | def mapper_month(x):
52 | if x is not np.nan:
53 | year = int(x[:4])
54 | month = int(x[5:7])
55 | return (2018 - year) * 12 + month
56 |
57 |
58 | def mapper_day(x):
59 | if x is not np.nan:
60 | year = int(x[:4])
61 | month = int(x[5:7])
62 | day = int(x[8:10])
63 | return (2018 - year) * 365 + month * 30 + day
64 |
65 |
66 | df_user['user_reg_year'] = df_user['user_reg_tm'].apply(lambda x: mapper_year(x))
67 | df_user['user_reg_month'] = df_user['user_reg_tm'].apply(lambda x: mapper_month(x))
68 | df_user['user_reg_day'] = df_user['user_reg_tm'].apply(lambda x: mapper_day(x))
69 |
70 | df_shop['shop_reg_year'] = df_shop['shop_reg_tm'].apply(lambda x: mapper_year(x))
71 | df_shop['shop_reg_month'] = df_shop['shop_reg_tm'].apply(lambda x: mapper_month(x))
72 | df_shop['shop_reg_day'] = df_shop['shop_reg_tm'].apply(lambda x: mapper_day(x))
73 |
74 | df_shop['shop_reg_year'] = df_shop['shop_reg_year'].fillna(1)
75 | df_shop['shop_reg_month'] = df_shop['shop_reg_month'].fillna(21)
76 | df_shop['shop_reg_day'] = df_shop['shop_reg_day'].fillna(101)
77 |
78 | df_user['age'] = df_user['age'].fillna(5)
79 |
80 | df_comment = df_comment.groupby(['sku_id'], as_index=False).sum()
81 | print('check point ...')
82 | df_product_comment = pd.merge(jdata_product, df_comment, on='sku_id', how='left')
83 |
84 | df_product_comment = df_product_comment.fillna(0)
85 |
86 | df_product_comment = df_product_comment.groupby(['shop_id'], as_index=False).sum()
87 |
88 | df_product_comment = df_product_comment.drop(['sku_id', 'brand', 'cate'], axis=1)
89 |
90 | df_shop_product_comment = pd.merge(df_shop, df_product_comment, how='left', on='shop_id')
91 |
92 | train_set = pd.merge(train_set, df_user, how='left', on='user_id')
93 | train_set = pd.merge(train_set, df_shop_product_comment, on='shop_id', how='left')
94 |
95 |
96 | train_set['vip_prob'] = train_set['vip_num']/train_set['fans_num']
97 | train_set['goods_prob'] = train_set['good_comments']/train_set['comments']
98 |
99 | train_set = train_set.drop(['comments','good_comments','bad_comments'],axis=1)
100 |
101 | test_set = jdata_data[(jdata_data['action_time'] >= '2018-04-02') & (jdata_data['action_time'] < '2018-04-16')][
102 | ['user_id', 'cate', 'shop_id']].drop_duplicates()
103 |
104 | test_set = test_set.merge(df_test, on=['user_id', 'cate', 'shop_id'], how='left')
105 |
106 | test_set = pd.merge(test_set, df_user, how='left', on='user_id')
107 | test_set = pd.merge(test_set, df_shop_product_comment, on='shop_id', how='left')
108 |
109 | train_set.drop(['user_reg_tm', 'shop_reg_tm'], axis=1, inplace=True)
110 | test_set.drop(['user_reg_tm', 'shop_reg_tm'], axis=1, inplace=True)
111 |
112 |
113 | test_set['vip_prob'] = test_set['vip_num']/test_set['fans_num']
114 | test_set['goods_prob'] = test_set['good_comments']/test_set['comments']
115 |
116 | test_set = test_set.drop(['comments','good_comments','bad_comments'],axis=1)
117 |
118 | ###取六周特征 特征为2.26-4.9
119 | train_set = train_set.drop([
120 | '2018-04-02-2018-04-09-action_1', '2018-04-02-2018-04-09-action_2',
121 | '2018-04-02-2018-04-09-action_3', '2018-04-02-2018-04-09-action_4',
122 | '2018-02-12-2018-02-19-action_1', '2018-02-12-2018-02-19-action_2',
123 | '2018-02-12-2018-02-19-action_3', '2018-02-12-2018-02-19-action_4',
124 | '2018-02-05-2018-02-12-action_1', '2018-02-05-2018-02-12-action_2',
125 | '2018-02-05-2018-02-12-action_3', '2018-02-05-2018-02-12-action_4'],axis=1)
126 |
127 |
128 | test_set = test_set.drop(['2018-02-26-2018-03-05-action_1',
129 | '2018-02-26-2018-03-05-action_2', '2018-02-26-2018-03-05-action_3',
130 | '2018-02-26-2018-03-05-action_4', '2018-02-19-2018-02-26-action_1',
131 | '2018-02-19-2018-02-26-action_2', '2018-02-19-2018-02-26-action_3',
132 | '2018-02-19-2018-02-26-action_4', '2018-02-12-2018-02-19-action_1',
133 | '2018-02-12-2018-02-19-action_2', '2018-02-12-2018-02-19-action_3',
134 | '2018-02-12-2018-02-19-action_4'],axis=1)
135 |
136 |
137 | train_set.rename(columns={'cate_x':'cate'}, inplace = True)
138 | test_set.rename(columns={'cate_x':'cate'}, inplace = True)
139 |
140 | test_head=test_set[['user_id','cate','shop_id']]
141 | train_head=train_set[['user_id','cate','shop_id']]
142 | test_set=test_set.drop(['user_id','cate','shop_id'],axis=1)
143 | train_set=train_set.drop(['user_id','cate','shop_id'],axis=1)
144 |
145 |
146 | # 数据准备
147 | X_train = train_set.drop(['label'],axis=1).values
148 | y_train = train_set['label'].values
149 | X_test = test_set.values
150 |
151 | del train_set
152 | del test_set
153 |
154 |
155 | import gc
156 | gc.collect()
157 |
158 |
159 | # 模型工具
160 | class SBBTree():
161 | """Stacking,Bootstap,Bagging----SBBTree"""
162 | def __init__(self, params, stacking_num, bagging_num, bagging_test_size, num_boost_round, early_stopping_rounds):
163 | """
164 | Initializes the SBBTree.
165 | Args:
166 | params : lgb params.
167 | stacking_num : k_flod stacking.
168 | bagging_num : bootstrap num.
169 | bagging_test_size : bootstrap sample rate.
170 | num_boost_round : boost num.
171 | early_stopping_rounds : early_stopping_rounds.
172 | """
173 | self.params = params
174 | self.stacking_num = stacking_num
175 | self.bagging_num = bagging_num
176 | self.bagging_test_size = bagging_test_size
177 | self.num_boost_round = num_boost_round
178 | self.early_stopping_rounds = early_stopping_rounds
179 |
180 | self.model = lgb
181 | self.stacking_model = []
182 | self.bagging_model = []
183 |
184 | def fit(self, X, y):
185 | """ fit model. """
186 | if self.stacking_num > 1:
187 | layer_train = np.zeros((X.shape[0], 2))
188 | self.SK = StratifiedKFold(n_splits=self.stacking_num, shuffle=True, random_state=1)
189 | for k,(train_index, test_index) in enumerate(self.SK.split(X, y)):
190 | X_train = X[train_index]
191 | y_train = y[train_index]
192 | X_test = X[test_index]
193 | y_test = y[test_index]
194 |
195 | lgb_train = lgb.Dataset(X_train, y_train)
196 | lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)
197 |
198 | gbm = lgb.train(self.params,
199 | lgb_train,
200 | num_boost_round=self.num_boost_round,
201 | valid_sets=lgb_eval,
202 | early_stopping_rounds=self.early_stopping_rounds,
203 | verbose_eval=300)
204 |
205 | self.stacking_model.append(gbm)
206 |
207 | pred_y = gbm.predict(X_test, num_iteration=gbm.best_iteration)
208 | layer_train[test_index, 1] = pred_y
209 |
210 | X = np.hstack((X, layer_train[:,1].reshape((-1,1))))
211 | else:
212 | pass
213 | for bn in range(self.bagging_num):
214 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=self.bagging_test_size, random_state=bn)
215 |
216 | lgb_train = lgb.Dataset(X_train, y_train)
217 | lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)
218 |
219 | gbm = lgb.train(self.params,
220 | lgb_train,
221 | num_boost_round=10000,
222 | valid_sets=lgb_eval,
223 | early_stopping_rounds=200,
224 | verbose_eval=300)
225 |
226 | self.bagging_model.append(gbm)
227 |
228 | def predict(self, X_pred):
229 | """ predict test data. """
230 | if self.stacking_num > 1:
231 | test_pred = np.zeros((X_pred.shape[0], self.stacking_num))
232 | for sn,gbm in enumerate(self.stacking_model):
233 | pred = gbm.predict(X_pred, num_iteration=gbm.best_iteration)
234 | test_pred[:, sn] = pred
235 | X_pred = np.hstack((X_pred, test_pred.mean(axis=1).reshape((-1,1))))
236 | else:
237 | pass
238 | for bn,gbm in enumerate(self.bagging_model):
239 | pred = gbm.predict(X_pred, num_iteration=gbm.best_iteration)
240 | if bn == 0:
241 | pred_out=pred
242 | else:
243 | pred_out+=pred
244 | return pred_out/self.bagging_num
245 |
246 | # 模型参数
247 | params = {
248 | 'boosting_type': 'gbdt',
249 | 'objective': 'binary',
250 | 'metric': 'auc',
251 | 'learning_rate': 0.01,
252 | 'num_leaves': 2 ** 5 - 1,
253 | 'min_child_samples': 100,
254 | 'max_bin': 100,
255 | 'subsample': 0.8,
256 | 'subsample_freq': 1,
257 | 'colsample_bytree': 0.8,
258 | 'min_child_weight': 0,
259 | 'scale_pos_weight': 25,
260 | 'seed': 2019,
261 | 'nthread': 4,
262 | 'verbose': 0,
263 | }
264 |
265 | # 使用模型
266 | model = SBBTree(params=params,\
267 | stacking_num=5,\
268 | bagging_num=5,\
269 | bagging_test_size=0.33,\
270 | num_boost_round=10000,\
271 | early_stopping_rounds=200)
272 | model.fit(X_train, y_train)
273 | print('train is ok')
274 | y_predict = model.predict(X_test)
275 | print('pred test is ok')
276 | # y_train_predict = model.predict(X_train)
277 |
278 |
279 | from tqdm import tqdm
280 | test_head['pred_prob'] = y_predict
281 | test_head.to_csv('../feature/'+str(win_size)+'_sbb_get_'+str(label_flag)+'_test.csv',index=False)
282 |
--------------------------------------------------------------------------------
/code/sbb2_train3.py:
--------------------------------------------------------------------------------
1 | import pandas as pd
2 | import numpy as np
3 | import datetime
4 | from sklearn.metrics import f1_score
5 | from sklearn.model_selection import train_test_split
6 | from sklearn.model_selection import KFold
7 | from sklearn.model_selection import StratifiedKFold
8 | import lightgbm as lgb
9 | pd.set_option('display.max_columns', None)
10 |
11 | df_train=pd.read_csv('../output/df_train.csv')
12 | df_test=pd.read_csv('../output/df_test.csv')
13 | ##这里可以选择 加载多个特征文件 进行merge 如果df_train变了 记得在输出文件名称加以备注 使用了什么特征文件
14 | ###设置特征标志位 如果 使用一周特征为1 加上两周特征为12 再加上三周特征 为123 只使用二周特征为2
15 |
16 | # df_train_two=pd.read_csv('../output/df_train_two.csv')
17 | # df_test_two=pd.read_csv('../output/df_test_two.csv')
18 | # df_train = df_train.merge(df_train_two,on=['user_id','cate','shop_id'],how='left')
19 | # df_test = df_test.merge(df_test_two,on=['user_id','cate','shop_id'],how='left')
20 |
21 | df_user=pd.read_csv('../data/jdata_user.csv')
22 | df_comment=pd.read_csv('../data/jdata_comment.csv')
23 | df_shop=pd.read_csv('../data/jdata_shop.csv')
24 |
25 | # 1)行为数据(jdata_action)
26 | jdata_action = pd.read_csv('../data/jdata_action.csv')
27 |
28 | # 3)商品数据(jdata_product)
29 | jdata_product = pd.read_csv('../data/jdata_product.csv')
30 |
31 | jdata_data = jdata_action.merge(jdata_product,on=['sku_id'])
32 | label_flag = 3
33 | train_buy = jdata_data[(jdata_data['action_time']>='2018-03-26')
34 | & (jdata_data['action_time']<'2018-04-02')
35 | & (jdata_data['type']==2)][['user_id','cate','shop_id']].drop_duplicates()
36 | train_buy['label'] = 1
37 | # 候选集 时间 : '2018-03-26'-'2018-04-08' 最近两周有行为的(用户,类目,店铺)
38 | win_size = 2#如果选择两周行为则为2 三周则为3
39 | train_set = jdata_data[(jdata_data['action_time']>='2018-03-12')
40 | & (jdata_data['action_time']<'2018-03-26')][['user_id','cate','shop_id']].drop_duplicates()
41 | train_set = train_set.merge(train_buy,on=['user_id','cate','shop_id'],how='left').fillna(0)
42 |
43 | train_set = train_set.merge(df_train,on=['user_id','cate','shop_id'],how='left')
44 |
45 |
46 | def mapper_year(x):
47 | if x is not np.nan:
48 | year = int(x[:4])
49 | return 2018 - year
50 |
51 |
52 | def mapper_month(x):
53 | if x is not np.nan:
54 | year = int(x[:4])
55 | month = int(x[5:7])
56 | return (2018 - year) * 12 + month
57 |
58 |
59 | def mapper_day(x):
60 | if x is not np.nan:
61 | year = int(x[:4])
62 | month = int(x[5:7])
63 | day = int(x[8:10])
64 | return (2018 - year) * 365 + month * 30 + day
65 |
66 |
67 | df_user['user_reg_year'] = df_user['user_reg_tm'].apply(lambda x: mapper_year(x))
68 | df_user['user_reg_month'] = df_user['user_reg_tm'].apply(lambda x: mapper_month(x))
69 | df_user['user_reg_day'] = df_user['user_reg_tm'].apply(lambda x: mapper_day(x))
70 |
71 | df_shop['shop_reg_year'] = df_shop['shop_reg_tm'].apply(lambda x: mapper_year(x))
72 | df_shop['shop_reg_month'] = df_shop['shop_reg_tm'].apply(lambda x: mapper_month(x))
73 | df_shop['shop_reg_day'] = df_shop['shop_reg_tm'].apply(lambda x: mapper_day(x))
74 |
75 |
76 | df_shop['shop_reg_year'] = df_shop['shop_reg_year'].fillna(1)
77 | df_shop['shop_reg_month'] = df_shop['shop_reg_month'].fillna(21)
78 | df_shop['shop_reg_day'] = df_shop['shop_reg_day'].fillna(101)
79 |
80 | df_user['age'] = df_user['age'].fillna(5)
81 |
82 | df_comment = df_comment.groupby(['sku_id'], as_index=False).sum()
83 | print('check point ...')
84 | df_product_comment = pd.merge(jdata_product, df_comment, on='sku_id', how='left')
85 |
86 | df_product_comment = df_product_comment.fillna(0)
87 |
88 | df_product_comment = df_product_comment.groupby(['shop_id'], as_index=False).sum()
89 |
90 | df_product_comment = df_product_comment.drop(['sku_id', 'brand', 'cate'], axis=1)
91 |
92 | df_shop_product_comment = pd.merge(df_shop, df_product_comment, how='left', on='shop_id')
93 |
94 | train_set = pd.merge(train_set, df_user, how='left', on='user_id')
95 | train_set = pd.merge(train_set, df_shop_product_comment, on='shop_id', how='left')
96 |
97 |
98 | train_set['vip_prob'] = train_set['vip_num']/train_set['fans_num']
99 | train_set['goods_prob'] = train_set['good_comments']/train_set['comments']
100 |
101 | train_set = train_set.drop(['comments','good_comments','bad_comments'],axis=1)
102 |
103 |
104 | test_set = jdata_data[(jdata_data['action_time'] >= '2018-04-02') & (jdata_data['action_time'] < '2018-04-16')][
105 | ['user_id', 'cate', 'shop_id']].drop_duplicates()
106 |
107 | test_set = test_set.merge(df_test, on=['user_id', 'cate', 'shop_id'], how='left')
108 |
109 | test_set = pd.merge(test_set, df_user, how='left', on='user_id')
110 | test_set = pd.merge(test_set, df_shop_product_comment, on='shop_id', how='left')
111 |
112 | train_set.drop(['user_reg_tm', 'shop_reg_tm'], axis=1, inplace=True)
113 | test_set.drop(['user_reg_tm', 'shop_reg_tm'], axis=1, inplace=True)
114 |
115 |
116 | test_set['vip_prob'] = test_set['vip_num']/test_set['fans_num']
117 | test_set['goods_prob'] = test_set['good_comments']/test_set['comments']
118 |
119 | test_set = test_set.drop(['comments','good_comments','bad_comments'],axis=1)
120 |
121 |
122 | ###取六周特征 特征为2.26-4.9
123 | train_set = train_set.drop(['2018-04-02-2018-04-09-action_1', '2018-04-02-2018-04-09-action_2',
124 | '2018-04-02-2018-04-09-action_3', '2018-04-02-2018-04-09-action_4',
125 | '2018-03-26-2018-04-02-action_1', '2018-03-26-2018-04-02-action_2',
126 | '2018-03-26-2018-04-02-action_3', '2018-03-26-2018-04-02-action_4',
127 | '2018-02-05-2018-02-12-action_1', '2018-02-05-2018-02-12-action_2',
128 | '2018-02-05-2018-02-12-action_3', '2018-02-05-2018-02-12-action_4'],axis=1)
129 |
130 | test_set = test_set.drop(['2018-02-26-2018-03-05-action_1',
131 | '2018-02-26-2018-03-05-action_2', '2018-02-26-2018-03-05-action_3',
132 | '2018-02-26-2018-03-05-action_4', '2018-02-19-2018-02-26-action_1',
133 | '2018-02-19-2018-02-26-action_2', '2018-02-19-2018-02-26-action_3',
134 | '2018-02-19-2018-02-26-action_4', '2018-02-12-2018-02-19-action_1',
135 | '2018-02-12-2018-02-19-action_2', '2018-02-12-2018-02-19-action_3',
136 | '2018-02-12-2018-02-19-action_4'],axis=1)
137 |
138 | train_set.rename(columns={'cate_x':'cate'}, inplace = True)
139 | test_set.rename(columns={'cate_x':'cate'}, inplace = True)
140 |
141 | test_head=test_set[['user_id','cate','shop_id']]
142 | train_head=train_set[['user_id','cate','shop_id']]
143 | test_set=test_set.drop(['user_id','cate','shop_id'],axis=1)
144 | train_set=train_set.drop(['user_id','cate','shop_id'],axis=1)
145 | if(train_set.shape[1]-1==test_set.shape[1]):
146 | print('ok',train_set.shape[1])
147 | else:
148 | exit()
149 |
150 |
151 | # 数据准备
152 | X_train = train_set.drop(['label'],axis=1).values
153 | y_train = train_set['label'].values
154 | X_test = test_set.values
155 |
156 | del train_set
157 | del test_set
158 |
159 | import gc
160 | gc.collect()
161 |
162 |
163 | # 模型工具
164 | class SBBTree():
165 | """Stacking,Bootstap,Bagging----SBBTree"""
166 | def __init__(self, params, stacking_num, bagging_num, bagging_test_size, num_boost_round, early_stopping_rounds):
167 | """
168 | Initializes the SBBTree.
169 | Args:
170 | params : lgb params.
171 | stacking_num : k_flod stacking.
172 | bagging_num : bootstrap num.
173 | bagging_test_size : bootstrap sample rate.
174 | num_boost_round : boost num.
175 | early_stopping_rounds : early_stopping_rounds.
176 | """
177 | self.params = params
178 | self.stacking_num = stacking_num
179 | self.bagging_num = bagging_num
180 | self.bagging_test_size = bagging_test_size
181 | self.num_boost_round = num_boost_round
182 | self.early_stopping_rounds = early_stopping_rounds
183 |
184 | self.model = lgb
185 | self.stacking_model = []
186 | self.bagging_model = []
187 |
188 | def fit(self, X, y):
189 | """ fit model. """
190 | if self.stacking_num > 1:
191 | layer_train = np.zeros((X.shape[0], 2))
192 | self.SK = StratifiedKFold(n_splits=self.stacking_num, shuffle=True, random_state=1)
193 | for k,(train_index, test_index) in enumerate(self.SK.split(X, y)):
194 | X_train = X[train_index]
195 | y_train = y[train_index]
196 | X_test = X[test_index]
197 | y_test = y[test_index]
198 |
199 | lgb_train = lgb.Dataset(X_train, y_train)
200 | lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)
201 |
202 | gbm = lgb.train(self.params,
203 | lgb_train,
204 | num_boost_round=self.num_boost_round,
205 | valid_sets=lgb_eval,
206 | early_stopping_rounds=self.early_stopping_rounds,
207 | verbose_eval=300)
208 |
209 | self.stacking_model.append(gbm)
210 |
211 | pred_y = gbm.predict(X_test, num_iteration=gbm.best_iteration)
212 | layer_train[test_index, 1] = pred_y
213 |
214 | X = np.hstack((X, layer_train[:,1].reshape((-1,1))))
215 | else:
216 | pass
217 | for bn in range(self.bagging_num):
218 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=self.bagging_test_size, random_state=bn)
219 |
220 | lgb_train = lgb.Dataset(X_train, y_train)
221 | lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)
222 |
223 | gbm = lgb.train(self.params,
224 | lgb_train,
225 | num_boost_round=10000,
226 | valid_sets=lgb_eval,
227 | early_stopping_rounds=200,
228 | verbose_eval=300)
229 |
230 | self.bagging_model.append(gbm)
231 |
232 | def predict(self, X_pred):
233 | """ predict test data. """
234 | if self.stacking_num > 1:
235 | test_pred = np.zeros((X_pred.shape[0], self.stacking_num))
236 | for sn,gbm in enumerate(self.stacking_model):
237 | pred = gbm.predict(X_pred, num_iteration=gbm.best_iteration)
238 | test_pred[:, sn] = pred
239 | X_pred = np.hstack((X_pred, test_pred.mean(axis=1).reshape((-1,1))))
240 | else:
241 | pass
242 | for bn,gbm in enumerate(self.bagging_model):
243 | pred = gbm.predict(X_pred, num_iteration=gbm.best_iteration)
244 | if bn == 0:
245 | pred_out=pred
246 | else:
247 | pred_out+=pred
248 | return pred_out/self.bagging_num
249 |
250 | # 模型参数
251 | params = {
252 | 'boosting_type': 'gbdt',
253 | 'objective': 'binary',
254 | 'metric': 'auc',
255 | 'learning_rate': 0.01,
256 | 'num_leaves': 2 ** 5 - 1,
257 | 'min_child_samples': 100,
258 | 'max_bin': 100,
259 | 'subsample': 0.8,
260 | 'subsample_freq': 1,
261 | 'colsample_bytree': 0.8,
262 | 'min_child_weight': 0,
263 | 'scale_pos_weight': 25,
264 | 'seed': 2019,
265 | 'nthread': 4,
266 | 'verbose': 0,
267 | }
268 |
269 | # 使用模型
270 | model = SBBTree(params=params,\
271 | stacking_num=5,\
272 | bagging_num=5,\
273 | bagging_test_size=0.33,\
274 | num_boost_round=10000,\
275 | early_stopping_rounds=200)
276 | model.fit(X_train, y_train)
277 | print('train is ok')
278 | y_predict = model.predict(X_test)
279 | print('pred test is ok')
280 | # y_train_predict = model.predict(X_train)
281 |
282 |
283 | from tqdm import tqdm
284 | test_head['pred_prob'] = y_predict
285 | test_head.to_csv('../feature/'+str(win_size)+'_sbb_get_'+str(label_flag)+'_test.csv',index=False)
286 |
--------------------------------------------------------------------------------
/code/sbb4_train1.py:
--------------------------------------------------------------------------------
1 | import pandas as pd
2 | import numpy as np
3 | import datetime
4 | from sklearn.metrics import f1_score
5 | from sklearn.model_selection import train_test_split
6 | from sklearn.model_selection import KFold
7 | from sklearn.model_selection import StratifiedKFold
8 | import lightgbm as lgb
9 | pd.set_option('display.max_columns', None)
10 |
11 | df_train=pd.read_csv('../output/df_train.csv')
12 | df_test=pd.read_csv('../output/df_test.csv')
13 | ##这里可以选择 加载多个特征文件 进行merge 如果df_train变了 记得在输出文件名称加以备注 使用了什么特征文件
14 | ###设置特征标志位 如果 使用一周特征为1 加上两周特征为12 再加上三周特征 为123 只使用二周特征为2
15 |
16 | # df_train_two=pd.read_csv('../output/df_train_two.csv')
17 | # df_test_two=pd.read_csv('../output/df_test_two.csv')
18 | # df_train = df_train.merge(df_train_two,on=['user_id','cate','shop_id'],how='left')
19 | # df_test = df_test.merge(df_test_two,on=['user_id','cate','shop_id'],how='left')
20 |
21 |
22 | df_user=pd.read_csv('../data/jdata_user.csv')
23 | df_comment=pd.read_csv('../data/jdata_comment.csv')
24 | df_shop=pd.read_csv('../data/jdata_shop.csv')
25 |
26 | # 1)行为数据(jdata_action)
27 | jdata_action = pd.read_csv('../data/jdata_action.csv')
28 |
29 | # 3)商品数据(jdata_product)
30 | jdata_product = pd.read_csv('../data/jdata_product.csv')
31 |
32 | jdata_data = jdata_action.merge(jdata_product,on=['sku_id'])
33 | label_flag = 1
34 | train_buy = jdata_data[(jdata_data['action_time']>='2018-04-09')
35 | & (jdata_data['action_time']<'2018-04-16')
36 | & (jdata_data['type']==2)][['user_id','cate','shop_id']].drop_duplicates()
37 | train_buy['label'] = 1
38 | # 候选集 时间 : '2018-03-26'-'2018-04-08' 最近两周有行为的(用户,类目,店铺)
39 | win_size = 4#如果选择两周行为则为2 三周则为3
40 | train_set = jdata_data[(jdata_data['action_time']>='2018-03-12')
41 | & (jdata_data['action_time']<'2018-04-09')][['user_id','cate','shop_id']].drop_duplicates()
42 | train_set = train_set.merge(train_buy,on=['user_id','cate','shop_id'],how='left').fillna(0)
43 |
44 | train_set = train_set.merge(df_train,on=['user_id','cate','shop_id'],how='left')
45 |
46 |
47 | def mapper_year(x):
48 | if x is not np.nan:
49 | year = int(x[:4])
50 | return 2018 - year
51 |
52 |
53 | def mapper_month(x):
54 | if x is not np.nan:
55 | year = int(x[:4])
56 | month = int(x[5:7])
57 | return (2018 - year) * 12 + month
58 |
59 |
60 | def mapper_day(x):
61 | if x is not np.nan:
62 | year = int(x[:4])
63 | month = int(x[5:7])
64 | day = int(x[8:10])
65 | return (2018 - year) * 365 + month * 30 + day
66 |
67 |
68 | df_user['user_reg_year'] = df_user['user_reg_tm'].apply(lambda x: mapper_year(x))
69 | df_user['user_reg_month'] = df_user['user_reg_tm'].apply(lambda x: mapper_month(x))
70 | df_user['user_reg_day'] = df_user['user_reg_tm'].apply(lambda x: mapper_day(x))
71 |
72 | df_shop['shop_reg_year'] = df_shop['shop_reg_tm'].apply(lambda x: mapper_year(x))
73 | df_shop['shop_reg_month'] = df_shop['shop_reg_tm'].apply(lambda x: mapper_month(x))
74 | df_shop['shop_reg_day'] = df_shop['shop_reg_tm'].apply(lambda x: mapper_day(x))
75 |
76 |
77 | df_shop['shop_reg_year'] = df_shop['shop_reg_year'].fillna(1)
78 | df_shop['shop_reg_month'] = df_shop['shop_reg_month'].fillna(21)
79 | df_shop['shop_reg_day'] = df_shop['shop_reg_day'].fillna(101)
80 |
81 | df_user['age'] = df_user['age'].fillna(5)
82 |
83 | df_comment = df_comment.groupby(['sku_id'], as_index=False).sum()
84 | print('check point ...')
85 | df_product_comment = pd.merge(jdata_product, df_comment, on='sku_id', how='left')
86 |
87 | df_product_comment = df_product_comment.fillna(0)
88 |
89 | df_product_comment = df_product_comment.groupby(['shop_id'], as_index=False).sum()
90 |
91 | df_product_comment = df_product_comment.drop(['sku_id', 'brand', 'cate'], axis=1)
92 |
93 | df_shop_product_comment = pd.merge(df_shop, df_product_comment, how='left', on='shop_id')
94 |
95 | train_set = pd.merge(train_set, df_user, how='left', on='user_id')
96 | train_set = pd.merge(train_set, df_shop_product_comment, on='shop_id', how='left')
97 |
98 |
99 | train_set['vip_prob'] = train_set['vip_num']/train_set['fans_num']
100 | train_set['goods_prob'] = train_set['good_comments']/train_set['comments']
101 |
102 | train_set = train_set.drop(['comments','good_comments','bad_comments'],axis=1)
103 |
104 | test_set = jdata_data[(jdata_data['action_time'] >= '2018-03-19') & (jdata_data['action_time'] < '2018-04-16')][
105 | ['user_id', 'cate', 'shop_id']].drop_duplicates()
106 |
107 | test_set = test_set.merge(df_test, on=['user_id', 'cate', 'shop_id'], how='left')
108 |
109 | test_set = pd.merge(test_set, df_user, how='left', on='user_id')
110 | test_set = pd.merge(test_set, df_shop_product_comment, on='shop_id', how='left')
111 |
112 | train_set.drop(['user_reg_tm', 'shop_reg_tm'], axis=1, inplace=True)
113 | test_set.drop(['user_reg_tm', 'shop_reg_tm'], axis=1, inplace=True)
114 |
115 | test_set['vip_prob'] = test_set['vip_num']/test_set['fans_num']
116 | test_set['goods_prob'] = test_set['good_comments']/test_set['comments']
117 |
118 | test_set = test_set.drop(['comments','good_comments','bad_comments'],axis=1)
119 |
120 | ###取六周特征 特征为2.26-4.9
121 | train_set = train_set.drop(['2018-02-19-2018-02-26-action_1', '2018-02-19-2018-02-26-action_2',
122 | '2018-02-19-2018-02-26-action_3', '2018-02-19-2018-02-26-action_4',
123 | '2018-02-12-2018-02-19-action_1', '2018-02-12-2018-02-19-action_2',
124 | '2018-02-12-2018-02-19-action_3', '2018-02-12-2018-02-19-action_4',
125 | '2018-02-05-2018-02-12-action_1', '2018-02-05-2018-02-12-action_2',
126 | '2018-02-05-2018-02-12-action_3', '2018-02-05-2018-02-12-action_4'],axis=1)
127 |
128 | test_set = test_set.drop(['2018-02-26-2018-03-05-action_1',
129 | '2018-02-26-2018-03-05-action_2', '2018-02-26-2018-03-05-action_3',
130 | '2018-02-26-2018-03-05-action_4', '2018-02-19-2018-02-26-action_1',
131 | '2018-02-19-2018-02-26-action_2', '2018-02-19-2018-02-26-action_3',
132 | '2018-02-19-2018-02-26-action_4', '2018-02-12-2018-02-19-action_1',
133 | '2018-02-12-2018-02-19-action_2', '2018-02-12-2018-02-19-action_3',
134 | '2018-02-12-2018-02-19-action_4'],axis=1)
135 |
136 |
137 | train_set.rename(columns={'cate_x':'cate'}, inplace = True)
138 | test_set.rename(columns={'cate_x':'cate'}, inplace = True)
139 |
140 | test_head=test_set[['user_id','cate','shop_id']]
141 | train_head=train_set[['user_id','cate','shop_id']]
142 | test_set=test_set.drop(['user_id','cate','shop_id'],axis=1)
143 | train_set=train_set.drop(['user_id','cate','shop_id'],axis=1)
144 |
145 |
146 | # 数据准备
147 | X_train = train_set.drop(['label'],axis=1).values
148 | y_train = train_set['label'].values
149 | X_test = test_set.values
150 |
151 | del train_set
152 | del test_set
153 |
154 | import gc
155 | gc.collect()
156 |
157 |
158 |
159 | # 模型工具
160 | class SBBTree():
161 | """Stacking,Bootstap,Bagging----SBBTree"""
162 | def __init__(self, params, stacking_num, bagging_num, bagging_test_size, num_boost_round, early_stopping_rounds):
163 | """
164 | Initializes the SBBTree.
165 | Args:
166 | params : lgb params.
167 | stacking_num : k_flod stacking.
168 | bagging_num : bootstrap num.
169 | bagging_test_size : bootstrap sample rate.
170 | num_boost_round : boost num.
171 | early_stopping_rounds : early_stopping_rounds.
172 | """
173 | self.params = params
174 | self.stacking_num = stacking_num
175 | self.bagging_num = bagging_num
176 | self.bagging_test_size = bagging_test_size
177 | self.num_boost_round = num_boost_round
178 | self.early_stopping_rounds = early_stopping_rounds
179 |
180 | self.model = lgb
181 | self.stacking_model = []
182 | self.bagging_model = []
183 |
184 | def fit(self, X, y):
185 | """ fit model. """
186 | if self.stacking_num > 1:
187 | layer_train = np.zeros((X.shape[0], 2))
188 | self.SK = StratifiedKFold(n_splits=self.stacking_num, shuffle=True, random_state=1)
189 | for k,(train_index, test_index) in enumerate(self.SK.split(X, y)):
190 | X_train = X[train_index]
191 | y_train = y[train_index]
192 | X_test = X[test_index]
193 | y_test = y[test_index]
194 |
195 | lgb_train = lgb.Dataset(X_train, y_train)
196 | lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)
197 |
198 | gbm = lgb.train(self.params,
199 | lgb_train,
200 | num_boost_round=self.num_boost_round,
201 | valid_sets=lgb_eval,
202 | early_stopping_rounds=self.early_stopping_rounds,
203 | verbose_eval=300)
204 |
205 | self.stacking_model.append(gbm)
206 |
207 | pred_y = gbm.predict(X_test, num_iteration=gbm.best_iteration)
208 | layer_train[test_index, 1] = pred_y
209 |
210 | X = np.hstack((X, layer_train[:,1].reshape((-1,1))))
211 | else:
212 | pass
213 | for bn in range(self.bagging_num):
214 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=self.bagging_test_size, random_state=bn)
215 |
216 | lgb_train = lgb.Dataset(X_train, y_train)
217 | lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)
218 |
219 | gbm = lgb.train(self.params,
220 | lgb_train,
221 | num_boost_round=10000,
222 | valid_sets=lgb_eval,
223 | early_stopping_rounds=200,
224 | verbose_eval=300)
225 |
226 | self.bagging_model.append(gbm)
227 |
228 | def predict(self, X_pred):
229 | """ predict test data. """
230 | if self.stacking_num > 1:
231 | test_pred = np.zeros((X_pred.shape[0], self.stacking_num))
232 | for sn,gbm in enumerate(self.stacking_model):
233 | pred = gbm.predict(X_pred, num_iteration=gbm.best_iteration)
234 | test_pred[:, sn] = pred
235 | X_pred = np.hstack((X_pred, test_pred.mean(axis=1).reshape((-1,1))))
236 | else:
237 | pass
238 | for bn,gbm in enumerate(self.bagging_model):
239 | pred = gbm.predict(X_pred, num_iteration=gbm.best_iteration)
240 | if bn == 0:
241 | pred_out=pred
242 | else:
243 | pred_out+=pred
244 | return pred_out/self.bagging_num
245 |
246 | # 模型参数
247 | params = {
248 | 'boosting_type': 'gbdt',
249 | 'objective': 'binary',
250 | 'metric': 'auc',
251 | 'learning_rate': 0.01,
252 | 'num_leaves': 2 ** 5 - 1,
253 | 'min_child_samples': 100,
254 | 'max_bin': 100,
255 | 'subsample': 0.8,
256 | 'subsample_freq': 1,
257 | 'colsample_bytree': 0.8,
258 | 'min_child_weight': 0,
259 | 'scale_pos_weight': 25,
260 | 'seed': 2019,
261 | 'nthread': 4,
262 | 'verbose': 0,
263 | }
264 |
265 | # 使用模型
266 | model = SBBTree(params=params,\
267 | stacking_num=5,\
268 | bagging_num=5,\
269 | bagging_test_size=0.33,\
270 | num_boost_round=10000,\
271 | early_stopping_rounds=200)
272 | model.fit(X_train, y_train)
273 | print('train is ok')
274 | y_predict = model.predict(X_test)
275 | print('pred test is ok')
276 | # y_train_predict = model.predict(X_train)
277 |
278 |
279 | from tqdm import tqdm
280 | test_head['pred_prob'] = y_predict
281 | test_head.to_csv('../feature/'+str(win_size)+'_sbb_get_'+str(label_flag)+'_test.csv',index=False)
282 |
--------------------------------------------------------------------------------
/code/sbb4_train2 .py:
--------------------------------------------------------------------------------
1 | import pandas as pd
2 | import numpy as np
3 | import datetime
4 | from sklearn.metrics import f1_score
5 | from sklearn.model_selection import train_test_split
6 | from sklearn.model_selection import KFold
7 | from sklearn.model_selection import StratifiedKFold
8 | import lightgbm as lgb
9 | pd.set_option('display.max_columns', None)
10 |
11 | df_train=pd.read_csv('../output/df_train.csv')
12 | df_test=pd.read_csv('../output/df_test.csv')
13 | ##这里可以选择 加载多个特征文件 进行merge 如果df_train变了 记得在输出文件名称加以备注 使用了什么特征文件
14 | ###设置特征标志位 如果 使用一周特征为1 加上两周特征为12 再加上三周特征 为123 只使用二周特征为2
15 |
16 | # df_train_two=pd.read_csv('../output/df_train_two.csv')
17 | # df_test_two=pd.read_csv('../output/df_test_two.csv')
18 | # df_train = df_train.merge(df_train_two,on=['user_id','cate','shop_id'],how='left')
19 | # df_test = df_test.merge(df_test_two,on=['user_id','cate','shop_id'],how='left')
20 |
21 |
22 | df_user=pd.read_csv('../data/jdata_user.csv')
23 | df_comment=pd.read_csv('../data/jdata_comment.csv')
24 | df_shop=pd.read_csv('../data/jdata_shop.csv')
25 |
26 | # 1)行为数据(jdata_action)
27 | jdata_action = pd.read_csv('../data/jdata_action.csv')
28 |
29 | # 3)商品数据(jdata_product)
30 | jdata_product = pd.read_csv('../data/jdata_product.csv')
31 |
32 | jdata_data = jdata_action.merge(jdata_product,on=['sku_id'])
33 | label_flag = 2
34 | train_buy = jdata_data[(jdata_data['action_time']>='2018-04-02')
35 | & (jdata_data['action_time']<'2018-04-09')
36 | & (jdata_data['type']==2)][['user_id','cate','shop_id']].drop_duplicates()
37 | train_buy['label'] = 1
38 | # 候选集 时间 : '2018-03-26'-'2018-04-08' 最近两周有行为的(用户,类目,店铺)
39 | win_size = 4#如果选择两周行为则为2 三周则为3
40 | train_set = jdata_data[(jdata_data['action_time']>='2018-03-05')
41 | & (jdata_data['action_time']<'2018-04-02')][['user_id','cate','shop_id']].drop_duplicates()
42 | train_set = train_set.merge(train_buy,on=['user_id','cate','shop_id'],how='left').fillna(0)
43 |
44 | train_set = train_set.merge(df_train,on=['user_id','cate','shop_id'],how='left')
45 |
46 |
47 | def mapper_year(x):
48 | if x is not np.nan:
49 | year = int(x[:4])
50 | return 2018 - year
51 |
52 |
53 | def mapper_month(x):
54 | if x is not np.nan:
55 | year = int(x[:4])
56 | month = int(x[5:7])
57 | return (2018 - year) * 12 + month
58 |
59 |
60 | def mapper_day(x):
61 | if x is not np.nan:
62 | year = int(x[:4])
63 | month = int(x[5:7])
64 | day = int(x[8:10])
65 | return (2018 - year) * 365 + month * 30 + day
66 |
67 |
68 | df_user['user_reg_year'] = df_user['user_reg_tm'].apply(lambda x: mapper_year(x))
69 | df_user['user_reg_month'] = df_user['user_reg_tm'].apply(lambda x: mapper_month(x))
70 | df_user['user_reg_day'] = df_user['user_reg_tm'].apply(lambda x: mapper_day(x))
71 |
72 | df_shop['shop_reg_year'] = df_shop['shop_reg_tm'].apply(lambda x: mapper_year(x))
73 | df_shop['shop_reg_month'] = df_shop['shop_reg_tm'].apply(lambda x: mapper_month(x))
74 | df_shop['shop_reg_day'] = df_shop['shop_reg_tm'].apply(lambda x: mapper_day(x))
75 |
76 | df_shop['shop_reg_year'] = df_shop['shop_reg_year'].fillna(1)
77 | df_shop['shop_reg_month'] = df_shop['shop_reg_month'].fillna(21)
78 | df_shop['shop_reg_day'] = df_shop['shop_reg_day'].fillna(101)
79 |
80 | df_user['age'] = df_user['age'].fillna(5)
81 |
82 | df_comment = df_comment.groupby(['sku_id'], as_index=False).sum()
83 | print('check point ...')
84 | df_product_comment = pd.merge(jdata_product, df_comment, on='sku_id', how='left')
85 |
86 | df_product_comment = df_product_comment.fillna(0)
87 |
88 | df_product_comment = df_product_comment.groupby(['shop_id'], as_index=False).sum()
89 |
90 | df_product_comment = df_product_comment.drop(['sku_id', 'brand', 'cate'], axis=1)
91 |
92 | df_shop_product_comment = pd.merge(df_shop, df_product_comment, how='left', on='shop_id')
93 |
94 | train_set = pd.merge(train_set, df_user, how='left', on='user_id')
95 | train_set = pd.merge(train_set, df_shop_product_comment, on='shop_id', how='left')
96 |
97 | train_set['vip_prob'] = train_set['vip_num']/train_set['fans_num']
98 | train_set['goods_prob'] = train_set['good_comments']/train_set['comments']
99 |
100 | train_set = train_set.drop(['comments','good_comments','bad_comments'],axis=1)
101 |
102 |
103 | test_set = jdata_data[(jdata_data['action_time'] >= '2018-03-19') & (jdata_data['action_time'] < '2018-04-16')][
104 | ['user_id', 'cate', 'shop_id']].drop_duplicates()
105 |
106 | test_set = test_set.merge(df_test, on=['user_id', 'cate', 'shop_id'], how='left')
107 |
108 | test_set = pd.merge(test_set, df_user, how='left', on='user_id')
109 | test_set = pd.merge(test_set, df_shop_product_comment, on='shop_id', how='left')
110 |
111 | train_set.drop(['user_reg_tm', 'shop_reg_tm'], axis=1, inplace=True)
112 | test_set.drop(['user_reg_tm', 'shop_reg_tm'], axis=1, inplace=True)
113 |
114 |
115 | test_set['vip_prob'] = test_set['vip_num']/test_set['fans_num']
116 | test_set['goods_prob'] = test_set['good_comments']/test_set['comments']
117 |
118 | test_set = test_set.drop(['comments','good_comments','bad_comments'],axis=1)
119 |
120 |
121 | ###取六周特征 特征为2.26-4.9
122 | train_set = train_set.drop([
123 | '2018-04-02-2018-04-09-action_1', '2018-04-02-2018-04-09-action_2',
124 | '2018-04-02-2018-04-09-action_3', '2018-04-02-2018-04-09-action_4',
125 | '2018-02-12-2018-02-19-action_1', '2018-02-12-2018-02-19-action_2',
126 | '2018-02-12-2018-02-19-action_3', '2018-02-12-2018-02-19-action_4',
127 | '2018-02-05-2018-02-12-action_1', '2018-02-05-2018-02-12-action_2',
128 | '2018-02-05-2018-02-12-action_3', '2018-02-05-2018-02-12-action_4'],axis=1)
129 |
130 |
131 | test_set = test_set.drop(['2018-02-26-2018-03-05-action_1',
132 | '2018-02-26-2018-03-05-action_2', '2018-02-26-2018-03-05-action_3',
133 | '2018-02-26-2018-03-05-action_4', '2018-02-19-2018-02-26-action_1',
134 | '2018-02-19-2018-02-26-action_2', '2018-02-19-2018-02-26-action_3',
135 | '2018-02-19-2018-02-26-action_4', '2018-02-12-2018-02-19-action_1',
136 | '2018-02-12-2018-02-19-action_2', '2018-02-12-2018-02-19-action_3',
137 | '2018-02-12-2018-02-19-action_4'],axis=1)
138 |
139 |
140 |
141 | train_set.rename(columns={'cate_x':'cate'}, inplace = True)
142 | test_set.rename(columns={'cate_x':'cate'}, inplace = True)
143 |
144 | test_head=test_set[['user_id','cate','shop_id']]
145 | train_head=train_set[['user_id','cate','shop_id']]
146 | test_set=test_set.drop(['user_id','cate','shop_id'],axis=1)
147 | train_set=train_set.drop(['user_id','cate','shop_id'],axis=1)
148 |
149 |
150 | # 数据准备
151 | X_train = train_set.drop(['label'],axis=1).values
152 | y_train = train_set['label'].values
153 | X_test = test_set.values
154 |
155 | del train_set
156 | del test_set
157 |
158 | import gc
159 | gc.collect()
160 |
161 | # 模型工具
162 | class SBBTree():
163 | """Stacking,Bootstap,Bagging----SBBTree"""
164 | def __init__(self, params, stacking_num, bagging_num, bagging_test_size, num_boost_round, early_stopping_rounds):
165 | """
166 | Initializes the SBBTree.
167 | Args:
168 | params : lgb params.
169 | stacking_num : k_flod stacking.
170 | bagging_num : bootstrap num.
171 | bagging_test_size : bootstrap sample rate.
172 | num_boost_round : boost num.
173 | early_stopping_rounds : early_stopping_rounds.
174 | """
175 | self.params = params
176 | self.stacking_num = stacking_num
177 | self.bagging_num = bagging_num
178 | self.bagging_test_size = bagging_test_size
179 | self.num_boost_round = num_boost_round
180 | self.early_stopping_rounds = early_stopping_rounds
181 |
182 | self.model = lgb
183 | self.stacking_model = []
184 | self.bagging_model = []
185 |
186 | def fit(self, X, y):
187 | """ fit model. """
188 | if self.stacking_num > 1:
189 | layer_train = np.zeros((X.shape[0], 2))
190 | self.SK = StratifiedKFold(n_splits=self.stacking_num, shuffle=True, random_state=1)
191 | for k,(train_index, test_index) in enumerate(self.SK.split(X, y)):
192 | X_train = X[train_index]
193 | y_train = y[train_index]
194 | X_test = X[test_index]
195 | y_test = y[test_index]
196 |
197 | lgb_train = lgb.Dataset(X_train, y_train)
198 | lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)
199 |
200 | gbm = lgb.train(self.params,
201 | lgb_train,
202 | num_boost_round=self.num_boost_round,
203 | valid_sets=lgb_eval,
204 | early_stopping_rounds=self.early_stopping_rounds,
205 | verbose_eval=300)
206 |
207 | self.stacking_model.append(gbm)
208 |
209 | pred_y = gbm.predict(X_test, num_iteration=gbm.best_iteration)
210 | layer_train[test_index, 1] = pred_y
211 |
212 | X = np.hstack((X, layer_train[:,1].reshape((-1,1))))
213 | else:
214 | pass
215 | for bn in range(self.bagging_num):
216 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=self.bagging_test_size, random_state=bn)
217 |
218 | lgb_train = lgb.Dataset(X_train, y_train)
219 | lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)
220 |
221 | gbm = lgb.train(self.params,
222 | lgb_train,
223 | num_boost_round=10000,
224 | valid_sets=lgb_eval,
225 | early_stopping_rounds=200,
226 | verbose_eval=300)
227 |
228 | self.bagging_model.append(gbm)
229 |
230 | def predict(self, X_pred):
231 | """ predict test data. """
232 | if self.stacking_num > 1:
233 | test_pred = np.zeros((X_pred.shape[0], self.stacking_num))
234 | for sn,gbm in enumerate(self.stacking_model):
235 | pred = gbm.predict(X_pred, num_iteration=gbm.best_iteration)
236 | test_pred[:, sn] = pred
237 | X_pred = np.hstack((X_pred, test_pred.mean(axis=1).reshape((-1,1))))
238 | else:
239 | pass
240 | for bn,gbm in enumerate(self.bagging_model):
241 | pred = gbm.predict(X_pred, num_iteration=gbm.best_iteration)
242 | if bn == 0:
243 | pred_out=pred
244 | else:
245 | pred_out+=pred
246 | return pred_out/self.bagging_num
247 |
248 | # 模型参数
249 | params = {
250 | 'boosting_type': 'gbdt',
251 | 'objective': 'binary',
252 | 'metric': 'auc',
253 | 'learning_rate': 0.01,
254 | 'num_leaves': 2 ** 5 - 1,
255 | 'min_child_samples': 100,
256 | 'max_bin': 100,
257 | 'subsample': 0.8,
258 | 'subsample_freq': 1,
259 | 'colsample_bytree': 0.8,
260 | 'min_child_weight': 0,
261 | 'scale_pos_weight': 25,
262 | 'seed': 2019,
263 | 'nthread': 4,
264 | 'verbose': 0,
265 | }
266 |
267 | # 使用模型
268 | model = SBBTree(params=params,\
269 | stacking_num=5,\
270 | bagging_num=5,\
271 | bagging_test_size=0.33,\
272 | num_boost_round=10000,\
273 | early_stopping_rounds=200)
274 | model.fit(X_train, y_train)
275 | print('train is ok')
276 | y_predict = model.predict(X_test)
277 | print('pred test is ok')
278 | # y_train_predict = model.predict(X_train)
279 |
280 | from tqdm import tqdm
281 | test_head['pred_prob'] = y_predict
282 | test_head.to_csv('../feature/'+str(win_size)+'_sbb_get_'+str(label_flag)+'_test.csv',index=False)
283 |
--------------------------------------------------------------------------------
/code/sbb4_train3.py:
--------------------------------------------------------------------------------
1 | import pandas as pd
2 | import numpy as np
3 | import datetime
4 | from sklearn.metrics import f1_score
5 | from sklearn.model_selection import train_test_split
6 | from sklearn.model_selection import KFold
7 | from sklearn.model_selection import StratifiedKFold
8 | import lightgbm as lgb
9 | pd.set_option('display.max_columns', None)
10 |
11 |
12 | df_train=pd.read_csv('../output/df_train.csv')
13 | df_test=pd.read_csv('../output/df_test.csv')
14 | ##这里可以选择 加载多个特征文件 进行merge 如果df_train变了 记得在输出文件名称加以备注 使用了什么特征文件
15 | ###设置特征标志位 如果 使用一周特征为1 加上两周特征为12 再加上三周特征 为123 只使用二周特征为2
16 |
17 | # df_train_two=pd.read_csv('../output/df_train_two.csv')
18 | # df_test_two=pd.read_csv('../output/df_test_two.csv')
19 | # df_train = df_train.merge(df_train_two,on=['user_id','cate','shop_id'],how='left')
20 | # df_test = df_test.merge(df_test_two,on=['user_id','cate','shop_id'],how='left')
21 |
22 |
23 | df_user=pd.read_csv('../data/jdata_user.csv')
24 | df_comment=pd.read_csv('../data/jdata_comment.csv')
25 | df_shop=pd.read_csv('../data/jdata_shop.csv')
26 |
27 | # 1)行为数据(jdata_action)
28 | jdata_action = pd.read_csv('../data/jdata_action.csv')
29 |
30 | # 3)商品数据(jdata_product)
31 | jdata_product = pd.read_csv('../data/jdata_product.csv')
32 |
33 | jdata_data = jdata_action.merge(jdata_product,on=['sku_id'])
34 | label_flag = 3
35 | train_buy = jdata_data[(jdata_data['action_time']>='2018-03-26')
36 | & (jdata_data['action_time']<'2018-04-02')
37 | & (jdata_data['type']==2)][['user_id','cate','shop_id']].drop_duplicates()
38 | train_buy['label'] = 1
39 | # 候选集 时间 : '2018-03-26'-'2018-04-08' 最近两周有行为的(用户,类目,店铺)
40 | win_size = 4#如果选择两周行为则为2 三周则为3
41 | train_set = jdata_data[(jdata_data['action_time']>='2018-02-26')
42 | & (jdata_data['action_time']<'2018-03-26')][['user_id','cate','shop_id']].drop_duplicates()
43 | train_set = train_set.merge(train_buy,on=['user_id','cate','shop_id'],how='left').fillna(0)
44 |
45 | train_set = train_set.merge(df_train,on=['user_id','cate','shop_id'],how='left')
46 |
47 |
48 | def mapper_year(x):
49 | if x is not np.nan:
50 | year = int(x[:4])
51 | return 2018 - year
52 |
53 |
54 | def mapper_month(x):
55 | if x is not np.nan:
56 | year = int(x[:4])
57 | month = int(x[5:7])
58 | return (2018 - year) * 12 + month
59 |
60 |
61 | def mapper_day(x):
62 | if x is not np.nan:
63 | year = int(x[:4])
64 | month = int(x[5:7])
65 | day = int(x[8:10])
66 | return (2018 - year) * 365 + month * 30 + day
67 |
68 |
69 | df_user['user_reg_year'] = df_user['user_reg_tm'].apply(lambda x: mapper_year(x))
70 | df_user['user_reg_month'] = df_user['user_reg_tm'].apply(lambda x: mapper_month(x))
71 | df_user['user_reg_day'] = df_user['user_reg_tm'].apply(lambda x: mapper_day(x))
72 |
73 | df_shop['shop_reg_year'] = df_shop['shop_reg_tm'].apply(lambda x: mapper_year(x))
74 | df_shop['shop_reg_month'] = df_shop['shop_reg_tm'].apply(lambda x: mapper_month(x))
75 | df_shop['shop_reg_day'] = df_shop['shop_reg_tm'].apply(lambda x: mapper_day(x))
76 |
77 |
78 | df_shop['shop_reg_year'] = df_shop['shop_reg_year'].fillna(1)
79 | df_shop['shop_reg_month'] = df_shop['shop_reg_month'].fillna(21)
80 | df_shop['shop_reg_day'] = df_shop['shop_reg_day'].fillna(101)
81 |
82 | df_user['age'] = df_user['age'].fillna(5)
83 |
84 | df_comment = df_comment.groupby(['sku_id'], as_index=False).sum()
85 | print('check point ...')
86 | df_product_comment = pd.merge(jdata_product, df_comment, on='sku_id', how='left')
87 |
88 | df_product_comment = df_product_comment.fillna(0)
89 |
90 | df_product_comment = df_product_comment.groupby(['shop_id'], as_index=False).sum()
91 |
92 | df_product_comment = df_product_comment.drop(['sku_id', 'brand', 'cate'], axis=1)
93 |
94 | df_shop_product_comment = pd.merge(df_shop, df_product_comment, how='left', on='shop_id')
95 |
96 | train_set = pd.merge(train_set, df_user, how='left', on='user_id')
97 | train_set = pd.merge(train_set, df_shop_product_comment, on='shop_id', how='left')
98 |
99 |
100 | train_set['vip_prob'] = train_set['vip_num']/train_set['fans_num']
101 | train_set['goods_prob'] = train_set['good_comments']/train_set['comments']
102 |
103 | train_set = train_set.drop(['comments','good_comments','bad_comments'],axis=1)
104 |
105 |
106 | test_set = jdata_data[(jdata_data['action_time'] >= '2018-03-19') & (jdata_data['action_time'] < '2018-04-16')][
107 | ['user_id', 'cate', 'shop_id']].drop_duplicates()
108 |
109 | test_set = test_set.merge(df_test, on=['user_id', 'cate', 'shop_id'], how='left')
110 |
111 | test_set = pd.merge(test_set, df_user, how='left', on='user_id')
112 | test_set = pd.merge(test_set, df_shop_product_comment, on='shop_id', how='left')
113 |
114 | train_set.drop(['user_reg_tm', 'shop_reg_tm'], axis=1, inplace=True)
115 | test_set.drop(['user_reg_tm', 'shop_reg_tm'], axis=1, inplace=True)
116 |
117 |
118 | test_set['vip_prob'] = test_set['vip_num']/test_set['fans_num']
119 | test_set['goods_prob'] = test_set['good_comments']/test_set['comments']
120 |
121 | test_set = test_set.drop(['comments','good_comments','bad_comments'],axis=1)
122 |
123 |
124 |
125 | ###取六周特征 特征为2.26-4.9
126 | train_set = train_set.drop(['2018-04-02-2018-04-09-action_1', '2018-04-02-2018-04-09-action_2',
127 | '2018-04-02-2018-04-09-action_3', '2018-04-02-2018-04-09-action_4',
128 | '2018-03-26-2018-04-02-action_1', '2018-03-26-2018-04-02-action_2',
129 | '2018-03-26-2018-04-02-action_3', '2018-03-26-2018-04-02-action_4',
130 | '2018-02-05-2018-02-12-action_1', '2018-02-05-2018-02-12-action_2',
131 | '2018-02-05-2018-02-12-action_3', '2018-02-05-2018-02-12-action_4'],axis=1)
132 |
133 | test_set = test_set.drop(['2018-02-26-2018-03-05-action_1',
134 | '2018-02-26-2018-03-05-action_2', '2018-02-26-2018-03-05-action_3',
135 | '2018-02-26-2018-03-05-action_4', '2018-02-19-2018-02-26-action_1',
136 | '2018-02-19-2018-02-26-action_2', '2018-02-19-2018-02-26-action_3',
137 | '2018-02-19-2018-02-26-action_4', '2018-02-12-2018-02-19-action_1',
138 | '2018-02-12-2018-02-19-action_2', '2018-02-12-2018-02-19-action_3',
139 | '2018-02-12-2018-02-19-action_4'],axis=1)
140 |
141 |
142 | train_set.rename(columns={'cate_x':'cate'}, inplace = True)
143 | test_set.rename(columns={'cate_x':'cate'}, inplace = True)
144 |
145 |
146 |
147 | test_head=test_set[['user_id','cate','shop_id']]
148 | train_head=train_set[['user_id','cate','shop_id']]
149 | test_set=test_set.drop(['user_id','cate','shop_id'],axis=1)
150 | train_set=train_set.drop(['user_id','cate','shop_id'],axis=1)
151 | if(train_set.shape[1]-1==test_set.shape[1]):
152 | print('ok',train_set.shape[1])
153 | else:
154 | exit()
155 |
156 | # 数据准备
157 | X_train = train_set.drop(['label'],axis=1).values
158 | y_train = train_set['label'].values
159 | X_test = test_set.values
160 |
161 | del train_set
162 | del test_set
163 |
164 | import gc
165 | gc.collect()
166 |
167 |
168 | # 模型工具
169 | class SBBTree():
170 | """Stacking,Bootstap,Bagging----SBBTree"""
171 | def __init__(self, params, stacking_num, bagging_num, bagging_test_size, num_boost_round, early_stopping_rounds):
172 | """
173 | Initializes the SBBTree.
174 | Args:
175 | params : lgb params.
176 | stacking_num : k_flod stacking.
177 | bagging_num : bootstrap num.
178 | bagging_test_size : bootstrap sample rate.
179 | num_boost_round : boost num.
180 | early_stopping_rounds : early_stopping_rounds.
181 | """
182 | self.params = params
183 | self.stacking_num = stacking_num
184 | self.bagging_num = bagging_num
185 | self.bagging_test_size = bagging_test_size
186 | self.num_boost_round = num_boost_round
187 | self.early_stopping_rounds = early_stopping_rounds
188 |
189 | self.model = lgb
190 | self.stacking_model = []
191 | self.bagging_model = []
192 |
193 | def fit(self, X, y):
194 | """ fit model. """
195 | if self.stacking_num > 1:
196 | layer_train = np.zeros((X.shape[0], 2))
197 | self.SK = StratifiedKFold(n_splits=self.stacking_num, shuffle=True, random_state=1)
198 | for k,(train_index, test_index) in enumerate(self.SK.split(X, y)):
199 | X_train = X[train_index]
200 | y_train = y[train_index]
201 | X_test = X[test_index]
202 | y_test = y[test_index]
203 |
204 | lgb_train = lgb.Dataset(X_train, y_train)
205 | lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)
206 |
207 | gbm = lgb.train(self.params,
208 | lgb_train,
209 | num_boost_round=self.num_boost_round,
210 | valid_sets=lgb_eval,
211 | early_stopping_rounds=self.early_stopping_rounds,
212 | verbose_eval=300)
213 |
214 | self.stacking_model.append(gbm)
215 |
216 | pred_y = gbm.predict(X_test, num_iteration=gbm.best_iteration)
217 | layer_train[test_index, 1] = pred_y
218 |
219 | X = np.hstack((X, layer_train[:,1].reshape((-1,1))))
220 | else:
221 | pass
222 | for bn in range(self.bagging_num):
223 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=self.bagging_test_size, random_state=bn)
224 |
225 | lgb_train = lgb.Dataset(X_train, y_train)
226 | lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)
227 |
228 | gbm = lgb.train(self.params,
229 | lgb_train,
230 | num_boost_round=10000,
231 | valid_sets=lgb_eval,
232 | early_stopping_rounds=200,
233 | verbose_eval=300)
234 |
235 | self.bagging_model.append(gbm)
236 |
237 | def predict(self, X_pred):
238 | """ predict test data. """
239 | if self.stacking_num > 1:
240 | test_pred = np.zeros((X_pred.shape[0], self.stacking_num))
241 | for sn,gbm in enumerate(self.stacking_model):
242 | pred = gbm.predict(X_pred, num_iteration=gbm.best_iteration)
243 | test_pred[:, sn] = pred
244 | X_pred = np.hstack((X_pred, test_pred.mean(axis=1).reshape((-1,1))))
245 | else:
246 | pass
247 | for bn,gbm in enumerate(self.bagging_model):
248 | pred = gbm.predict(X_pred, num_iteration=gbm.best_iteration)
249 | if bn == 0:
250 | pred_out=pred
251 | else:
252 | pred_out+=pred
253 | return pred_out/self.bagging_num
254 |
255 | # 模型参数
256 | params = {
257 | 'boosting_type': 'gbdt',
258 | 'objective': 'binary',
259 | 'metric': 'auc',
260 | 'learning_rate': 0.01,
261 | 'num_leaves': 2 ** 5 - 1,
262 | 'min_child_samples': 100,
263 | 'max_bin': 100,
264 | 'subsample': 0.8,
265 | 'subsample_freq': 1,
266 | 'colsample_bytree': 0.8,
267 | 'min_child_weight': 0,
268 | 'scale_pos_weight': 25,
269 | 'seed': 2019,
270 | 'nthread': 4,
271 | 'verbose': 0,
272 | }
273 |
274 | # 使用模型
275 | model = SBBTree(params=params,\
276 | stacking_num=5,\
277 | bagging_num=5,\
278 | bagging_test_size=0.33,\
279 | num_boost_round=10000,\
280 | early_stopping_rounds=200)
281 | model.fit(X_train, y_train)
282 | print('train is ok')
283 | y_predict = model.predict(X_test)
284 | print('pred test is ok')
285 | # y_train_predict = model.predict(X_train)
286 |
287 |
288 | from tqdm import tqdm
289 | test_head['pred_prob'] = y_predict
290 | test_head.to_csv('../feature/'+str(win_size)+'_sbb_get_'+str(label_flag)+'_test.csv',index=False)
291 |
--------------------------------------------------------------------------------
/code/sbb_train1.py:
--------------------------------------------------------------------------------
1 | import pandas as pd
2 | import numpy as np
3 | import datetime
4 | from sklearn.metrics import f1_score
5 | from sklearn.model_selection import train_test_split
6 | from sklearn.model_selection import KFold
7 | from sklearn.model_selection import StratifiedKFold
8 | import lightgbm as lgb
9 | pd.set_option('display.max_columns', None)
10 |
11 |
12 | df_train=pd.read_csv('../output/df_train.csv')
13 | df_test=pd.read_csv('../output/df_test.csv')
14 | ##这里可以选择 加载多个特征文件 进行merge 如果df_train变了 记得在输出文件名称加以备注 使用了什么特征文件
15 | ###设置特征标志位 如果 使用一周特征为1 加上两周特征为12 再加上三周特征 为123 只使用二周特征为2
16 |
17 | # df_train_two=pd.read_csv('../output/df_train_two.csv')
18 | # df_test_two=pd.read_csv('../output/df_test_two.csv')
19 | # df_train = df_train.merge(df_train_two,on=['user_id','cate','shop_id'],how='left')
20 | # df_test = df_test.merge(df_test_two,on=['user_id','cate','shop_id'],how='left')
21 |
22 |
23 | df_user=pd.read_csv('../data/jdata_user.csv')
24 | df_comment=pd.read_csv('../data/jdata_comment.csv')
25 | df_shop=pd.read_csv('../data/jdata_shop.csv')
26 |
27 | # 1)行为数据(jdata_action)
28 | jdata_action = pd.read_csv('../data/jdata_action.csv')
29 |
30 | # 3)商品数据(jdata_product)
31 | jdata_product = pd.read_csv('../data/jdata_product.csv')
32 |
33 | jdata_data = jdata_action.merge(jdata_product,on=['sku_id'])
34 | label_flag = 1
35 | train_buy = jdata_data[(jdata_data['action_time']>='2018-04-09')&
36 | (jdata_data['action_time']<'2018-04-16')& (jdata_data['type']==2)][['user_id','cate','shop_id']].drop_duplicates()
37 | train_buy['label'] = 1
38 | # 候选集 时间 : '2018-03-19'-'2018-04-08' 最近两周有行为的(用户,类目,店铺)
39 | win_size = 3#如果选择两周行为则为2 三周则为3
40 | train_set = jdata_data[(jdata_data['action_time']>='2018-03-19')&
41 | (jdata_data['action_time']<'2018-04-09')][['user_id','cate','shop_id']].drop_duplicates()
42 | train_set = train_set.merge(train_buy,on=['user_id','cate','shop_id'],how='left').fillna(0)
43 |
44 | train_set = train_set.merge(df_train,on=['user_id','cate','shop_id'],how='left')
45 |
46 |
47 | def mapper_year(x):
48 | if x is not np.nan:
49 | year = int(x[:4])
50 | return 2018 - year
51 |
52 |
53 | def mapper_month(x):
54 | if x is not np.nan:
55 | year = int(x[:4])
56 | month = int(x[5:7])
57 | return (2018 - year) * 12 + month
58 |
59 |
60 | def mapper_day(x):
61 | if x is not np.nan:
62 | year = int(x[:4])
63 | month = int(x[5:7])
64 | day = int(x[8:10])
65 | return (2018 - year) * 365 + month * 30 + day
66 |
67 |
68 | df_user['user_reg_year'] = df_user['user_reg_tm'].apply(lambda x: mapper_year(x))
69 | df_user['user_reg_month'] = df_user['user_reg_tm'].apply(lambda x: mapper_month(x))
70 | df_user['user_reg_day'] = df_user['user_reg_tm'].apply(lambda x: mapper_day(x))
71 |
72 | df_shop['shop_reg_year'] = df_shop['shop_reg_tm'].apply(lambda x: mapper_year(x))
73 | df_shop['shop_reg_month'] = df_shop['shop_reg_tm'].apply(lambda x: mapper_month(x))
74 | df_shop['shop_reg_day'] = df_shop['shop_reg_tm'].apply(lambda x: mapper_day(x))
75 |
76 |
77 | df_shop['shop_reg_year'] = df_shop['shop_reg_year'].fillna(1)
78 | df_shop['shop_reg_month'] = df_shop['shop_reg_month'].fillna(21)
79 | df_shop['shop_reg_day'] = df_shop['shop_reg_day'].fillna(101)
80 |
81 | df_user['age'] = df_user['age'].fillna(5)
82 |
83 | df_comment = df_comment.groupby(['sku_id'], as_index=False).sum()
84 | print('check point ...')
85 | df_product_comment = pd.merge(jdata_product, df_comment, on='sku_id', how='left')
86 |
87 | df_product_comment = df_product_comment.fillna(0)
88 |
89 | df_product_comment = df_product_comment.groupby(['shop_id'], as_index=False).sum()
90 |
91 | df_product_comment = df_product_comment.drop(['sku_id', 'brand', 'cate'], axis=1)
92 |
93 | df_shop_product_comment = pd.merge(df_shop, df_product_comment, how='left', on='shop_id')
94 |
95 | train_set = pd.merge(train_set, df_user, how='left', on='user_id')
96 | train_set = pd.merge(train_set, df_shop_product_comment, on='shop_id', how='left')
97 |
98 |
99 |
100 | train_set['vip_prob'] = train_set['vip_num']/train_set['fans_num']
101 | train_set['goods_prob'] = train_set['good_comments']/train_set['comments']
102 |
103 | train_set = train_set.drop(['comments','good_comments','bad_comments'],axis=1)
104 |
105 |
106 | test_set = jdata_data[(jdata_data['action_time'] >= '2018-03-26') & (jdata_data['action_time'] < '2018-04-16')][
107 | ['user_id', 'cate', 'shop_id']].drop_duplicates()
108 |
109 | test_set = test_set.merge(df_test, on=['user_id', 'cate', 'shop_id'], how='left')
110 |
111 | test_set = pd.merge(test_set, df_user, how='left', on='user_id')
112 | test_set = pd.merge(test_set, df_shop_product_comment, on='shop_id', how='left')
113 |
114 | train_set.drop(['user_reg_tm', 'shop_reg_tm'], axis=1, inplace=True)
115 | test_set.drop(['user_reg_tm', 'shop_reg_tm'], axis=1, inplace=True)
116 |
117 |
118 | test_set['vip_prob'] = test_set['vip_num']/test_set['fans_num']
119 | test_set['goods_prob'] = test_set['good_comments']/test_set['comments']
120 |
121 | test_set = test_set.drop(['comments','good_comments','bad_comments'],axis=1)
122 |
123 |
124 |
125 | ###取六周特征 特征为2.26-4.9
126 | train_set = train_set.drop(['2018-02-19-2018-02-26-action_1', '2018-02-19-2018-02-26-action_2',
127 | '2018-02-19-2018-02-26-action_3', '2018-02-19-2018-02-26-action_4',
128 | '2018-02-12-2018-02-19-action_1', '2018-02-12-2018-02-19-action_2',
129 | '2018-02-12-2018-02-19-action_3', '2018-02-12-2018-02-19-action_4',
130 | '2018-02-05-2018-02-12-action_1', '2018-02-05-2018-02-12-action_2',
131 | '2018-02-05-2018-02-12-action_3', '2018-02-05-2018-02-12-action_4'],axis=1)
132 |
133 |
134 | test_set = test_set.drop(['2018-02-26-2018-03-05-action_1',
135 | '2018-02-26-2018-03-05-action_2', '2018-02-26-2018-03-05-action_3',
136 | '2018-02-26-2018-03-05-action_4', '2018-02-19-2018-02-26-action_1',
137 | '2018-02-19-2018-02-26-action_2', '2018-02-19-2018-02-26-action_3',
138 | '2018-02-19-2018-02-26-action_4', '2018-02-12-2018-02-19-action_1',
139 | '2018-02-12-2018-02-19-action_2', '2018-02-12-2018-02-19-action_3',
140 | '2018-02-12-2018-02-19-action_4'],axis=1)
141 |
142 |
143 | train_set.rename(columns={'cate_x':'cate'}, inplace = True)
144 | test_set.rename(columns={'cate_x':'cate'}, inplace = True)
145 |
146 |
147 | test_head=test_set[['user_id','cate','shop_id']]
148 | train_head=train_set[['user_id','cate','shop_id']]
149 | test_set=test_set.drop(['user_id','cate','shop_id'],axis=1)
150 | train_set=train_set.drop(['user_id','cate','shop_id'],axis=1)
151 |
152 |
153 | # 数据准备
154 | X_train = train_set.drop(['label'],axis=1).values
155 | y_train = train_set['label'].values
156 | X_test = test_set.values
157 |
158 | ###RNN
159 | def max_min_scaler(data):
160 | for x in data.columns:
161 | data[x] = data[x]-data[x].min()/(data[x].max()-data[x].min())
162 | return data
163 | train_X = max_min_scaler(train_X)
164 | test_X = max_min_scaler(test_X)
165 |
166 |
167 | # 模型工具
168 | class SBBTree():
169 | """Stacking,Bootstap,Bagging----SBBTree"""
170 | def __init__(self, params, stacking_num, bagging_num, bagging_test_size, num_boost_round, early_stopping_rounds):
171 | """
172 | Initializes the SBBTree.
173 | Args:
174 | params : lgb params.
175 | stacking_num : k_flod stacking.
176 | bagging_num : bootstrap num.
177 | bagging_test_size : bootstrap sample rate.
178 | num_boost_round : boost num.
179 | early_stopping_rounds : early_stopping_rounds.
180 | """
181 | self.params = params
182 | self.stacking_num = stacking_num
183 | self.bagging_num = bagging_num
184 | self.bagging_test_size = bagging_test_size
185 | self.num_boost_round = num_boost_round
186 | self.early_stopping_rounds = early_stopping_rounds
187 |
188 | self.model = lgb
189 | self.stacking_model = []
190 | self.bagging_model = []
191 |
192 | def fit(self, X, y):
193 | """ fit model. """
194 | if self.stacking_num > 1:
195 | layer_train = np.zeros((X.shape[0], 2))
196 | self.SK = StratifiedKFold(n_splits=self.stacking_num, shuffle=True, random_state=1)
197 | for k,(train_index, test_index) in enumerate(self.SK.split(X, y)):
198 | X_train = X[train_index]
199 | y_train = y[train_index]
200 | X_test = X[test_index]
201 | y_test = y[test_index]
202 |
203 | lgb_train = lgb.Dataset(X_train, y_train)
204 | lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)
205 |
206 | gbm = lgb.train(self.params,
207 | lgb_train,
208 | num_boost_round=self.num_boost_round,
209 | valid_sets=lgb_eval,
210 | early_stopping_rounds=self.early_stopping_rounds,
211 | verbose_eval=300)
212 |
213 | self.stacking_model.append(gbm)
214 |
215 | pred_y = gbm.predict(X_test, num_iteration=gbm.best_iteration)
216 | layer_train[test_index, 1] = pred_y
217 |
218 | X = np.hstack((X, layer_train[:,1].reshape((-1,1))))
219 | else:
220 | pass
221 | for bn in range(self.bagging_num):
222 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=self.bagging_test_size, random_state=bn)
223 |
224 | lgb_train = lgb.Dataset(X_train, y_train)
225 | lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)
226 |
227 | gbm = lgb.train(self.params,
228 | lgb_train,
229 | num_boost_round=10000,
230 | valid_sets=lgb_eval,
231 | early_stopping_rounds=200,
232 | verbose_eval=300)
233 |
234 | self.bagging_model.append(gbm)
235 |
236 | def predict(self, X_pred):
237 | """ predict test data. """
238 | if self.stacking_num > 1:
239 | test_pred = np.zeros((X_pred.shape[0], self.stacking_num))
240 | for sn,gbm in enumerate(self.stacking_model):
241 | pred = gbm.predict(X_pred, num_iteration=gbm.best_iteration)
242 | test_pred[:, sn] = pred
243 | X_pred = np.hstack((X_pred, test_pred.mean(axis=1).reshape((-1,1))))
244 | else:
245 | pass
246 | for bn,gbm in enumerate(self.bagging_model):
247 | pred = gbm.predict(X_pred, num_iteration=gbm.best_iteration)
248 | if bn == 0:
249 | pred_out=pred
250 | else:
251 | pred_out+=pred
252 | return pred_out/self.bagging_num
253 |
254 | # 模型参数
255 | params = {
256 | 'boosting_type': 'gbdt',
257 | 'objective': 'binary',
258 | 'metric': 'auc',
259 | 'learning_rate': 0.01,
260 | 'num_leaves': 2 ** 5 - 1,
261 | 'min_child_samples': 100,
262 | 'max_bin': 100,
263 | 'subsample': 0.8,
264 | 'subsample_freq': 1,
265 | 'colsample_bytree': 0.8,
266 | 'min_child_weight': 0,
267 | 'scale_pos_weight': 25,
268 | 'seed': 2019,
269 | 'nthread': 4,
270 | 'verbose': 0,
271 | }
272 |
273 | # 使用模型
274 | # 使用模型
275 | model = SBBTree(params=params,\
276 | stacking_num=5,\
277 | bagging_num=5,\
278 | bagging_test_size=0.33,\
279 | num_boost_round=10000,\
280 | early_stopping_rounds=200)
281 | model.fit(X_train, y_train)
282 | print('train is ok')
283 | y_predict = model.predict(X_test)
284 | print('pred test is ok')
285 | # y_train_predict = model.predict(X_train)
286 |
287 |
288 | from tqdm import tqdm
289 | test_head['pred_prob'] = y_predict
290 | test_head.to_csv('../feature/'+str(win_size)+'_sbb_get_'+str(label_flag)+'_test.csv',index=False)
291 |
--------------------------------------------------------------------------------
/code/sbb_train2.py:
--------------------------------------------------------------------------------
1 | import pandas as pd
2 | import numpy as np
3 | import datetime
4 | from sklearn.metrics import f1_score
5 | from sklearn.model_selection import train_test_split
6 | from sklearn.model_selection import KFold
7 | from sklearn.model_selection import StratifiedKFold
8 | import lightgbm as lgb
9 | pd.set_option('display.max_columns', None)
10 |
11 |
12 | df_train=pd.read_csv('../output/df_train.csv')
13 | df_test=pd.read_csv('../output/df_test.csv')
14 | ##这里可以选择 加载多个特征文件 进行merge 如果df_train变了 记得在输出文件名称加以备注 使用了什么特征文件
15 | ###设置特征标志位 如果 使用一周特征为1 加上两周特征为12 再加上三周特征 为123 只使用二周特征为2
16 |
17 | # df_train_two=pd.read_csv('../output/df_train_two.csv')
18 | # df_test_two=pd.read_csv('../output/df_test_two.csv')
19 | # df_train = df_train.merge(df_train_two,on=['user_id','cate','shop_id'],how='left')
20 | # df_test = df_test.merge(df_test_two,on=['user_id','cate','shop_id'],how='left')
21 |
22 | df_user=pd.read_csv('../data/jdata_user.csv')
23 | df_comment=pd.read_csv('../data/jdata_comment.csv')
24 | df_shop=pd.read_csv('../data/jdata_shop.csv')
25 |
26 | # 1)行为数据(jdata_action)
27 | jdata_action = pd.read_csv('../data/jdata_action.csv')
28 |
29 | # 3)商品数据(jdata_product)
30 | jdata_product = pd.read_csv('../data/jdata_product.csv')
31 |
32 | jdata_data = jdata_action.merge(jdata_product,on=['sku_id'])
33 | label_flag = 2
34 | train_buy = jdata_data[(jdata_data['action_time']>='2018-04-02')
35 | & (jdata_data['action_time']<'2018-04-09')
36 | & (jdata_data['type']==2)][['user_id','cate','shop_id']].drop_duplicates()
37 | train_buy['label'] = 1
38 | # 候选集 时间 : '2018-03-12'-'2018-04-08' 最近两周有行为的(用户,类目,店铺)
39 | win_size = 3#如果选择两周行为则为2 三周则为3
40 | train_set = jdata_data[(jdata_data['action_time']>='2018-03-12')
41 | & (jdata_data['action_time']<'2018-04-02')][['user_id','cate','shop_id']].drop_duplicates()
42 | train_set = train_set.merge(train_buy,on=['user_id','cate','shop_id'],how='left').fillna(0)
43 |
44 | train_set = train_set.merge(df_train,on=['user_id','cate','shop_id'],how='left')
45 |
46 |
47 | def mapper_year(x):
48 | if x is not np.nan:
49 | year = int(x[:4])
50 | return 2018 - year
51 |
52 |
53 | def mapper_month(x):
54 | if x is not np.nan:
55 | year = int(x[:4])
56 | month = int(x[5:7])
57 | return (2018 - year) * 12 + month
58 |
59 |
60 | def mapper_day(x):
61 | if x is not np.nan:
62 | year = int(x[:4])
63 | month = int(x[5:7])
64 | day = int(x[8:10])
65 | return (2018 - year) * 365 + month * 30 + day
66 |
67 |
68 | df_user['user_reg_year'] = df_user['user_reg_tm'].apply(lambda x: mapper_year(x))
69 | df_user['user_reg_month'] = df_user['user_reg_tm'].apply(lambda x: mapper_month(x))
70 | df_user['user_reg_day'] = df_user['user_reg_tm'].apply(lambda x: mapper_day(x))
71 |
72 | df_shop['shop_reg_year'] = df_shop['shop_reg_tm'].apply(lambda x: mapper_year(x))
73 | df_shop['shop_reg_month'] = df_shop['shop_reg_tm'].apply(lambda x: mapper_month(x))
74 | df_shop['shop_reg_day'] = df_shop['shop_reg_tm'].apply(lambda x: mapper_day(x))
75 |
76 |
77 | df_shop['shop_reg_year'] = df_shop['shop_reg_year'].fillna(1)
78 | df_shop['shop_reg_month'] = df_shop['shop_reg_month'].fillna(21)
79 | df_shop['shop_reg_day'] = df_shop['shop_reg_day'].fillna(101)
80 |
81 | df_user['age'] = df_user['age'].fillna(5)
82 |
83 | df_comment = df_comment.groupby(['sku_id'], as_index=False).sum()
84 | print('check point ...')
85 | df_product_comment = pd.merge(jdata_product, df_comment, on='sku_id', how='left')
86 |
87 | df_product_comment = df_product_comment.fillna(0)
88 |
89 | df_product_comment = df_product_comment.groupby(['shop_id'], as_index=False).sum()
90 |
91 | df_product_comment = df_product_comment.drop(['sku_id', 'brand', 'cate'], axis=1)
92 |
93 | df_shop_product_comment = pd.merge(df_shop, df_product_comment, how='left', on='shop_id')
94 |
95 | train_set = pd.merge(train_set, df_user, how='left', on='user_id')
96 | train_set = pd.merge(train_set, df_shop_product_comment, on='shop_id', how='left')
97 |
98 | train_set['vip_prob'] = train_set['vip_num']/train_set['fans_num']
99 | train_set['goods_prob'] = train_set['good_comments']/train_set['comments']
100 | train_set = train_set.drop(['comments','good_comments','bad_comments'],axis=1)
101 |
102 |
103 |
104 | test_set = jdata_data[(jdata_data['action_time'] >= '2018-03-26') & (jdata_data['action_time'] < '2018-04-16')][
105 | ['user_id', 'cate', 'shop_id']].drop_duplicates()
106 |
107 | test_set = test_set.merge(df_test, on=['user_id', 'cate', 'shop_id'], how='left')
108 |
109 | test_set = pd.merge(test_set, df_user, how='left', on='user_id')
110 | test_set = pd.merge(test_set, df_shop_product_comment, on='shop_id', how='left')
111 |
112 | train_set.drop(['user_reg_tm', 'shop_reg_tm'], axis=1, inplace=True)
113 | test_set.drop(['user_reg_tm', 'shop_reg_tm'], axis=1, inplace=True)
114 |
115 |
116 | test_set['vip_prob'] = test_set['vip_num']/test_set['fans_num']
117 | test_set['goods_prob'] = test_set['good_comments']/test_set['comments']
118 |
119 | test_set = test_set.drop(['comments','good_comments','bad_comments'],axis=1)
120 |
121 |
122 | ###取六周特征 特征为2.26-4.9
123 | train_set = train_set.drop([
124 | '2018-04-02-2018-04-09-action_1', '2018-04-02-2018-04-09-action_2',
125 | '2018-04-02-2018-04-09-action_3', '2018-04-02-2018-04-09-action_4',
126 | '2018-02-12-2018-02-19-action_1', '2018-02-12-2018-02-19-action_2',
127 | '2018-02-12-2018-02-19-action_3', '2018-02-12-2018-02-19-action_4',
128 | '2018-02-05-2018-02-12-action_1', '2018-02-05-2018-02-12-action_2',
129 | '2018-02-05-2018-02-12-action_3', '2018-02-05-2018-02-12-action_4'],axis=1)
130 |
131 | test_set = test_set.drop(['2018-02-26-2018-03-05-action_1',
132 | '2018-02-26-2018-03-05-action_2', '2018-02-26-2018-03-05-action_3',
133 | '2018-02-26-2018-03-05-action_4', '2018-02-19-2018-02-26-action_1',
134 | '2018-02-19-2018-02-26-action_2', '2018-02-19-2018-02-26-action_3',
135 | '2018-02-19-2018-02-26-action_4', '2018-02-12-2018-02-19-action_1',
136 | '2018-02-12-2018-02-19-action_2', '2018-02-12-2018-02-19-action_3',
137 | '2018-02-12-2018-02-19-action_4'],axis=1)
138 |
139 | train_set.rename(columns={'cate_x':'cate'}, inplace = True)
140 | test_set.rename(columns={'cate_x':'cate'}, inplace = True)
141 |
142 | test_head=test_set[['user_id','cate','shop_id']]
143 | train_head=train_set[['user_id','cate','shop_id']]
144 | test_set=test_set.drop(['user_id','cate','shop_id'],axis=1)
145 | train_set=train_set.drop(['user_id','cate','shop_id'],axis=1)
146 |
147 |
148 | # 数据准备
149 | X_train = train_set.drop(['label'],axis=1).values
150 | y_train = train_set['label'].values
151 | X_test = test_set.values
152 |
153 |
154 | del train_set
155 | del test_set
156 |
157 |
158 | # 模型工具
159 | class SBBTree():
160 | """Stacking,Bootstap,Bagging----SBBTree"""
161 | def __init__(self, params, stacking_num, bagging_num, bagging_test_size, num_boost_round, early_stopping_rounds):
162 | """
163 | Initializes the SBBTree.
164 | Args:
165 | params : lgb params.
166 | stacking_num : k_flod stacking.
167 | bagging_num : bootstrap num.
168 | bagging_test_size : bootstrap sample rate.
169 | num_boost_round : boost num.
170 | early_stopping_rounds : early_stopping_rounds.
171 | """
172 | self.params = params
173 | self.stacking_num = stacking_num
174 | self.bagging_num = bagging_num
175 | self.bagging_test_size = bagging_test_size
176 | self.num_boost_round = num_boost_round
177 | self.early_stopping_rounds = early_stopping_rounds
178 |
179 | self.model = lgb
180 | self.stacking_model = []
181 | self.bagging_model = []
182 |
183 | def fit(self, X, y):
184 | """ fit model. """
185 | if self.stacking_num > 1:
186 | layer_train = np.zeros((X.shape[0], 2))
187 | self.SK = StratifiedKFold(n_splits=self.stacking_num, shuffle=True, random_state=1)
188 | for k,(train_index, test_index) in enumerate(self.SK.split(X, y)):
189 | X_train = X[train_index]
190 | y_train = y[train_index]
191 | X_test = X[test_index]
192 | y_test = y[test_index]
193 |
194 | lgb_train = lgb.Dataset(X_train, y_train)
195 | lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)
196 |
197 | gbm = lgb.train(self.params,
198 | lgb_train,
199 | num_boost_round=self.num_boost_round,
200 | valid_sets=lgb_eval,
201 | early_stopping_rounds=self.early_stopping_rounds,
202 | verbose_eval=300)
203 |
204 | self.stacking_model.append(gbm)
205 |
206 | pred_y = gbm.predict(X_test, num_iteration=gbm.best_iteration)
207 | layer_train[test_index, 1] = pred_y
208 |
209 | X = np.hstack((X, layer_train[:,1].reshape((-1,1))))
210 | else:
211 | pass
212 | for bn in range(self.bagging_num):
213 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=self.bagging_test_size, random_state=bn)
214 |
215 | lgb_train = lgb.Dataset(X_train, y_train)
216 | lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)
217 |
218 | gbm = lgb.train(self.params,
219 | lgb_train,
220 | num_boost_round=10000,
221 | valid_sets=lgb_eval,
222 | early_stopping_rounds=200,
223 | verbose_eval=300)
224 |
225 | self.bagging_model.append(gbm)
226 |
227 | def predict(self, X_pred):
228 | """ predict test data. """
229 | if self.stacking_num > 1:
230 | test_pred = np.zeros((X_pred.shape[0], self.stacking_num))
231 | for sn,gbm in enumerate(self.stacking_model):
232 | pred = gbm.predict(X_pred, num_iteration=gbm.best_iteration)
233 | test_pred[:, sn] = pred
234 | X_pred = np.hstack((X_pred, test_pred.mean(axis=1).reshape((-1,1))))
235 | else:
236 | pass
237 | for bn,gbm in enumerate(self.bagging_model):
238 | pred = gbm.predict(X_pred, num_iteration=gbm.best_iteration)
239 | if bn == 0:
240 | pred_out=pred
241 | else:
242 | pred_out+=pred
243 | return pred_out/self.bagging_num
244 |
245 | # 模型参数
246 | params = {
247 | 'boosting_type': 'gbdt',
248 | 'objective': 'binary',
249 | 'metric': 'auc',
250 | 'learning_rate': 0.01,
251 | 'num_leaves': 2 ** 5 - 1,
252 | 'min_child_samples': 100,
253 | 'max_bin': 100,
254 | 'subsample': 0.8,
255 | 'subsample_freq': 1,
256 | 'colsample_bytree': 0.8,
257 | 'min_child_weight': 0,
258 | 'scale_pos_weight': 25,
259 | 'seed': 2019,
260 | 'nthread': 4,
261 | 'verbose': 0,
262 | }
263 |
264 | # 使用模型
265 | model = SBBTree(params=params,\
266 | stacking_num=5,\
267 | bagging_num=5,\
268 | bagging_test_size=0.33,\
269 | num_boost_round=10000,\
270 | early_stopping_rounds=200)
271 | model.fit(X_train, y_train)
272 | print('train is ok')
273 | y_predict = model.predict(X_test)
274 | print('pred test is ok')
275 | # y_train_predict = model.predict(X_train)
276 |
277 |
278 | from tqdm import tqdm
279 | test_head['pred_prob'] = y_predict
280 | test_head.to_csv('../feature/'+str(win_size)+'_sbb_get_'+str(label_flag)+'_test.csv',index=False)
281 |
--------------------------------------------------------------------------------
/code/sbb_train3.py:
--------------------------------------------------------------------------------
1 |
2 | # coding: utf-8
3 |
4 | # In[1]:
5 |
6 |
7 | import pandas as pd
8 | import numpy as np
9 | import datetime
10 | from sklearn.metrics import f1_score
11 | from sklearn.model_selection import train_test_split
12 | from sklearn.model_selection import KFold
13 | from sklearn.model_selection import StratifiedKFold
14 | import lightgbm as lgb
15 | pd.set_option('display.max_columns', None)
16 |
17 |
18 | # In[2]:
19 |
20 |
21 | df_train=pd.read_csv('../output/df_train.csv')
22 | df_test=pd.read_csv('../output/df_test.csv')
23 | ##这里可以选择 加载多个特征文件 进行merge 如果df_train变了 记得在输出文件名称加以备注 使用了什么特征文件
24 | ###设置特征标志位 如果 使用一周特征为1 加上两周特征为12 再加上三周特征 为123 只使用二周特征为2
25 |
26 | # df_train_two=pd.read_csv('../output/df_train_two.csv')
27 | # df_test_two=pd.read_csv('../output/df_test_two.csv')
28 | # df_train = df_train.merge(df_train_two,on=['user_id','cate','shop_id'],how='left')
29 | # df_test = df_test.merge(df_test_two,on=['user_id','cate','shop_id'],how='left')
30 |
31 |
32 | # In[15]:
33 |
34 |
35 | df_user=pd.read_csv('../data/jdata_user.csv')
36 | df_comment=pd.read_csv('../data/jdata_comment.csv')
37 | df_shop=pd.read_csv('../data/jdata_shop.csv')
38 |
39 | # 1)行为数据(jdata_action)
40 | jdata_action = pd.read_csv('../data/jdata_action.csv')
41 |
42 | # 3)商品数据(jdata_product)
43 | jdata_product = pd.read_csv('../data/jdata_product.csv')
44 |
45 | jdata_data = jdata_action.merge(jdata_product,on=['sku_id'])
46 | label_flag = 3
47 | train_buy = jdata_data[(jdata_data['action_time']>='2018-03-26') & (jdata_data['action_time']<'2018-04-02') & (jdata_data['type']==2)][['user_id','cate','shop_id']].drop_duplicates()
48 | train_buy['label'] = 1
49 | # 候选集 时间 : '2018-03-26'-'2018-04-08' 最近两周有行为的(用户,类目,店铺)
50 | win_size = 3#如果选择两周行为则为2 三周则为3
51 | train_set = jdata_data[(jdata_data['action_time']>='2018-03-05') & (jdata_data['action_time']<'2018-03-26')][['user_id','cate','shop_id']].drop_duplicates()
52 | train_set = train_set.merge(train_buy,on=['user_id','cate','shop_id'],how='left').fillna(0)
53 |
54 | train_set = train_set.merge(df_train,on=['user_id','cate','shop_id'],how='left')
55 |
56 |
57 | # In[17]:
58 |
59 |
60 | def mapper_year(x):
61 | if x is not np.nan:
62 | year = int(x[:4])
63 | return 2018 - year
64 |
65 |
66 | def mapper_month(x):
67 | if x is not np.nan:
68 | year = int(x[:4])
69 | month = int(x[5:7])
70 | return (2018 - year) * 12 + month
71 |
72 |
73 | def mapper_day(x):
74 | if x is not np.nan:
75 | year = int(x[:4])
76 | month = int(x[5:7])
77 | day = int(x[8:10])
78 | return (2018 - year) * 365 + month * 30 + day
79 |
80 |
81 | df_user['user_reg_year'] = df_user['user_reg_tm'].apply(lambda x: mapper_year(x))
82 | df_user['user_reg_month'] = df_user['user_reg_tm'].apply(lambda x: mapper_month(x))
83 | df_user['user_reg_day'] = df_user['user_reg_tm'].apply(lambda x: mapper_day(x))
84 |
85 | df_shop['shop_reg_year'] = df_shop['shop_reg_tm'].apply(lambda x: mapper_year(x))
86 | df_shop['shop_reg_month'] = df_shop['shop_reg_tm'].apply(lambda x: mapper_month(x))
87 | df_shop['shop_reg_day'] = df_shop['shop_reg_tm'].apply(lambda x: mapper_day(x))
88 |
89 |
90 | # In[25]:
91 |
92 |
93 | df_shop['shop_reg_year'] = df_shop['shop_reg_year'].fillna(1)
94 | df_shop['shop_reg_month'] = df_shop['shop_reg_month'].fillna(21)
95 | df_shop['shop_reg_day'] = df_shop['shop_reg_day'].fillna(101)
96 |
97 | df_user['age'] = df_user['age'].fillna(5)
98 |
99 | df_comment = df_comment.groupby(['sku_id'], as_index=False).sum()
100 | print('check point ...')
101 | df_product_comment = pd.merge(jdata_product, df_comment, on='sku_id', how='left')
102 |
103 | df_product_comment = df_product_comment.fillna(0)
104 |
105 | df_product_comment = df_product_comment.groupby(['shop_id'], as_index=False).sum()
106 |
107 | df_product_comment = df_product_comment.drop(['sku_id', 'brand', 'cate'], axis=1)
108 |
109 | df_shop_product_comment = pd.merge(df_shop, df_product_comment, how='left', on='shop_id')
110 |
111 | train_set = pd.merge(train_set, df_user, how='left', on='user_id')
112 | train_set = pd.merge(train_set, df_shop_product_comment, on='shop_id', how='left')
113 |
114 |
115 | # In[30]:
116 |
117 |
118 | train_set['vip_prob'] = train_set['vip_num']/train_set['fans_num']
119 | train_set['goods_prob'] = train_set['good_comments']/train_set['comments']
120 |
121 | train_set = train_set.drop(['comments','good_comments','bad_comments'],axis=1)
122 |
123 |
124 | # In[35]:
125 |
126 |
127 | test_set = jdata_data[(jdata_data['action_time'] >= '2018-03-26') & (jdata_data['action_time'] < '2018-04-16')][
128 | ['user_id', 'cate', 'shop_id']].drop_duplicates()
129 |
130 | test_set = test_set.merge(df_test, on=['user_id', 'cate', 'shop_id'], how='left')
131 |
132 | test_set = pd.merge(test_set, df_user, how='left', on='user_id')
133 | test_set = pd.merge(test_set, df_shop_product_comment, on='shop_id', how='left')
134 |
135 | train_set.drop(['user_reg_tm', 'shop_reg_tm'], axis=1, inplace=True)
136 | test_set.drop(['user_reg_tm', 'shop_reg_tm'], axis=1, inplace=True)
137 |
138 |
139 | # In[36]:
140 |
141 |
142 | test_set['vip_prob'] = test_set['vip_num']/test_set['fans_num']
143 | test_set['goods_prob'] = test_set['good_comments']/test_set['comments']
144 |
145 | test_set = test_set.drop(['comments','good_comments','bad_comments'],axis=1)
146 |
147 |
148 | # In[40]:
149 |
150 |
151 | ###取六周特征 特征为2.26-4.9
152 | train_set = train_set.drop(['2018-04-02-2018-04-09-action_1', '2018-04-02-2018-04-09-action_2',
153 | '2018-04-02-2018-04-09-action_3', '2018-04-02-2018-04-09-action_4',
154 | '2018-03-26-2018-04-02-action_1', '2018-03-26-2018-04-02-action_2',
155 | '2018-03-26-2018-04-02-action_3', '2018-03-26-2018-04-02-action_4',
156 | '2018-02-05-2018-02-12-action_1', '2018-02-05-2018-02-12-action_2',
157 | '2018-02-05-2018-02-12-action_3', '2018-02-05-2018-02-12-action_4'],axis=1)
158 |
159 |
160 | # In[41]:
161 |
162 |
163 | test_set = test_set.drop(['2018-02-26-2018-03-05-action_1',
164 | '2018-02-26-2018-03-05-action_2', '2018-02-26-2018-03-05-action_3',
165 | '2018-02-26-2018-03-05-action_4', '2018-02-19-2018-02-26-action_1',
166 | '2018-02-19-2018-02-26-action_2', '2018-02-19-2018-02-26-action_3',
167 | '2018-02-19-2018-02-26-action_4', '2018-02-12-2018-02-19-action_1',
168 | '2018-02-12-2018-02-19-action_2', '2018-02-12-2018-02-19-action_3',
169 | '2018-02-12-2018-02-19-action_4'],axis=1)
170 |
171 |
172 | # In[44]:
173 |
174 |
175 | train_set.rename(columns={'cate_x':'cate'}, inplace = True)
176 | test_set.rename(columns={'cate_x':'cate'}, inplace = True)
177 |
178 |
179 | # In[45]:
180 |
181 |
182 | test_head=test_set[['user_id','cate','shop_id']]
183 | train_head=train_set[['user_id','cate','shop_id']]
184 | test_set=test_set.drop(['user_id','cate','shop_id'],axis=1)
185 | train_set=train_set.drop(['user_id','cate','shop_id'],axis=1)
186 | if(train_set.shape[1]-1==test_set.shape[1]):
187 | print('ok',train_set.shape[1])
188 | else:
189 | exit()
190 | # In[46]:
191 |
192 |
193 |
194 | # 数据准备
195 | X_train = train_set.drop(['label'],axis=1).values
196 | y_train = train_set['label'].values
197 | X_test = test_set.values
198 |
199 |
200 | # In[ ]:
201 |
202 |
203 | del train_set
204 | del test_set
205 |
206 |
207 | # In[48]:
208 |
209 |
210 | import gc
211 | gc.collect()
212 |
213 |
214 | # In[50]:
215 |
216 |
217 | # 模型工具
218 | class SBBTree():
219 | """Stacking,Bootstap,Bagging----SBBTree"""
220 | """ author:Cookly 洪鹏飞 """
221 | def __init__(self, params, stacking_num, bagging_num, bagging_test_size, num_boost_round, early_stopping_rounds):
222 | """
223 | Initializes the SBBTree.
224 | Args:
225 | params : lgb params.
226 | stacking_num : k_flod stacking.
227 | bagging_num : bootstrap num.
228 | bagging_test_size : bootstrap sample rate.
229 | num_boost_round : boost num.
230 | early_stopping_rounds : early_stopping_rounds.
231 | """
232 | self.params = params
233 | self.stacking_num = stacking_num
234 | self.bagging_num = bagging_num
235 | self.bagging_test_size = bagging_test_size
236 | self.num_boost_round = num_boost_round
237 | self.early_stopping_rounds = early_stopping_rounds
238 |
239 | self.model = lgb
240 | self.stacking_model = []
241 | self.bagging_model = []
242 |
243 | def fit(self, X, y):
244 | """ fit model. """
245 | if self.stacking_num > 1:
246 | layer_train = np.zeros((X.shape[0], 2))
247 | self.SK = StratifiedKFold(n_splits=self.stacking_num, shuffle=True, random_state=1)
248 | for k,(train_index, test_index) in enumerate(self.SK.split(X, y)):
249 | X_train = X[train_index]
250 | y_train = y[train_index]
251 | X_test = X[test_index]
252 | y_test = y[test_index]
253 |
254 | lgb_train = lgb.Dataset(X_train, y_train)
255 | lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)
256 |
257 | gbm = lgb.train(self.params,
258 | lgb_train,
259 | num_boost_round=self.num_boost_round,
260 | valid_sets=lgb_eval,
261 | early_stopping_rounds=self.early_stopping_rounds,
262 | verbose_eval=300)
263 |
264 | self.stacking_model.append(gbm)
265 |
266 | pred_y = gbm.predict(X_test, num_iteration=gbm.best_iteration)
267 | layer_train[test_index, 1] = pred_y
268 |
269 | X = np.hstack((X, layer_train[:,1].reshape((-1,1))))
270 | else:
271 | pass
272 | for bn in range(self.bagging_num):
273 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=self.bagging_test_size, random_state=bn)
274 |
275 | lgb_train = lgb.Dataset(X_train, y_train)
276 | lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)
277 |
278 | gbm = lgb.train(self.params,
279 | lgb_train,
280 | num_boost_round=10000,
281 | valid_sets=lgb_eval,
282 | early_stopping_rounds=200,
283 | verbose_eval=300)
284 |
285 | self.bagging_model.append(gbm)
286 |
287 | def predict(self, X_pred):
288 | """ predict test data. """
289 | if self.stacking_num > 1:
290 | test_pred = np.zeros((X_pred.shape[0], self.stacking_num))
291 | for sn,gbm in enumerate(self.stacking_model):
292 | pred = gbm.predict(X_pred, num_iteration=gbm.best_iteration)
293 | test_pred[:, sn] = pred
294 | X_pred = np.hstack((X_pred, test_pred.mean(axis=1).reshape((-1,1))))
295 | else:
296 | pass
297 | for bn,gbm in enumerate(self.bagging_model):
298 | pred = gbm.predict(X_pred, num_iteration=gbm.best_iteration)
299 | if bn == 0:
300 | pred_out=pred
301 | else:
302 | pred_out+=pred
303 | return pred_out/self.bagging_num
304 |
305 | # 模型参数
306 | params = {
307 | 'boosting_type': 'gbdt',
308 | 'objective': 'binary',
309 | 'metric': 'auc',
310 | 'learning_rate': 0.01,
311 | 'num_leaves': 2 ** 5 - 1,
312 | 'min_child_samples': 100,
313 | 'max_bin': 100,
314 | 'subsample': 0.8,
315 | 'subsample_freq': 1,
316 | 'colsample_bytree': 0.8,
317 | 'min_child_weight': 0,
318 | 'scale_pos_weight': 25,
319 | 'seed': 2019,
320 | 'nthread': 4,
321 | 'verbose': 0,
322 | }
323 |
324 | # 使用模型
325 | model = SBBTree(params=params, stacking_num=5, bagging_num=5, bagging_test_size=0.33, num_boost_round=10000, early_stopping_rounds=200)
326 | model.fit(X_train, y_train)
327 | print('train is ok')
328 | y_predict = model.predict(X_test)
329 | print('pred test is ok')
330 | # y_train_predict = model.predict(X_train)
331 |
332 |
333 | # In[ ]:
334 |
335 |
336 | from tqdm import tqdm
337 | test_head['pred_prob'] = y_predict
338 | test_head.to_csv('../feature/'+str(win_size)+'_sbb_get_'+str(label_flag)+'_test.csv',index=False)
339 |
340 |
341 |
--------------------------------------------------------------------------------
/code/xgb_model/xgb_train3.py:
--------------------------------------------------------------------------------
1 | import pandas as pd
2 | import numpy as np
3 | from datetime import datetime
4 | from sklearn.metrics import f1_score
5 | from sklearn.model_selection import train_test_split
6 | from sklearn.model_selection import KFold
7 | from sklearn.model_selection import StratifiedKFold
8 | import xgboost as xgb
9 | pd.set_option('display.max_columns', None)
10 |
11 |
12 | ## 读取文件减少内存,参考鱼佬的腾讯赛
13 | def reduce_mem_usage(df, verbose=True):
14 | numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
15 | start_mem = df.memory_usage().sum() / 1024**2
16 | for col in df.columns:
17 | col_type = df[col].dtypes
18 | if col_type in numerics:
19 | c_min = df[col].min()
20 | c_max = df[col].max()
21 | if str(col_type)[:3] == 'int':
22 | if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
23 | df[col] = df[col].astype(np.int8)
24 | elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
25 | df[col] = df[col].astype(np.int16)
26 | elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
27 | df[col] = df[col].astype(np.int32)
28 | elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
29 | df[col] = df[col].astype(np.int64)
30 | else:
31 | if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
32 | df[col] = df[col].astype(np.float16)
33 | elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
34 | df[col] = df[col].astype(np.float32)
35 | else:
36 | df[col] = df[col].astype(np.float64)
37 | end_mem = df.memory_usage().sum() / 1024**2
38 | if verbose:
39 | print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
40 | return df
41 |
42 | df_train = reduce_mem_usage(pd.read_csv('./df_train.csv'))
43 | df_test = reduce_mem_usage(pd.read_csv('./df_test.csv'))
44 | ##这里可以选择 加载多个特征文件 进行merge 如果df_train变了 记得在输出文件名称加以备注 使用了什么特征文件
45 | ###设置特征标志位 如果 使用一周特征为1 加上两周特征为12 再加上三周特征 为123 只使用二周特征为2
46 |
47 |
48 | df_user=reduce_mem_usage(pd.read_csv('./jdata_user.csv'))
49 | df_comment=reduce_mem_usage(pd.read_csv('./jdata_comment.csv'))
50 | df_shop=reduce_mem_usage(pd.read_csv('./jdata_shop.csv'))
51 |
52 | # 1)行为数据(jdata_action)
53 | jdata_action = reduce_mem_usage(pd.read_csv('./jdata_action.csv'))
54 |
55 | # 3)商品数据(jdata_product)
56 | jdata_product = reduce_mem_usage(pd.read_csv('./jdata_product.csv'))
57 |
58 | jdata_data = jdata_action.merge(jdata_product,on=['sku_id'])
59 | label_flag = 3
60 | train_buy = jdata_data[(jdata_data['action_time']>='2018-03-26') & (jdata_data['action_time']<'2018-04-02') & (jdata_data['type']==2)][['user_id','cate','shop_id']].drop_duplicates()
61 | train_buy['label'] = 1
62 | # 候选集 时间 : '2018-03-26'-'2018-04-08' 最近两周有行为的(用户,类目,店铺)
63 | win_size = 3#如果选择两周行为则为2 三周则为3
64 | train_set = jdata_data[(jdata_data['action_time']>='2018-03-05') & (jdata_data['action_time']<'2018-03-26')][['user_id','cate','shop_id']].drop_duplicates()
65 | train_set = train_set.merge(train_buy,on=['user_id','cate','shop_id'],how='left').fillna(0)
66 |
67 | train_set = train_set.merge(df_train,on=['user_id','cate','shop_id'],how='left')
68 |
69 |
70 | def mapper_year(x):
71 | if x is not np.nan:
72 | year = int(x[:4])
73 | return 2018 - year
74 |
75 |
76 | def mapper_month(x):
77 | if x is not np.nan:
78 | year = int(x[:4])
79 | month = int(x[5:7])
80 | return (2018 - year) * 12 + month
81 |
82 |
83 | def mapper_day(x):
84 | if x is not np.nan:
85 | year = int(x[:4])
86 | month = int(x[5:7])
87 | day = int(x[8:10])
88 | return (2018 - year) * 365 + month * 30 + day
89 |
90 |
91 | df_user['user_reg_year'] = df_user['user_reg_tm'].apply(lambda x: mapper_year(x))
92 | df_user['user_reg_month'] = df_user['user_reg_tm'].apply(lambda x: mapper_month(x))
93 | df_user['user_reg_day'] = df_user['user_reg_tm'].apply(lambda x: mapper_day(x))
94 |
95 | df_shop['shop_reg_year'] = df_shop['shop_reg_tm'].apply(lambda x: mapper_year(x))
96 | df_shop['shop_reg_month'] = df_shop['shop_reg_tm'].apply(lambda x: mapper_month(x))
97 | df_shop['shop_reg_day'] = df_shop['shop_reg_tm'].apply(lambda x: mapper_day(x))
98 |
99 |
100 |
101 | df_shop['shop_reg_year'] = df_shop['shop_reg_year'].fillna(1)
102 | df_shop['shop_reg_month'] = df_shop['shop_reg_month'].fillna(21)
103 | df_shop['shop_reg_day'] = df_shop['shop_reg_day'].fillna(101)
104 |
105 | df_user['age'] = df_user['age'].fillna(5)
106 |
107 | df_comment = df_comment.groupby(['sku_id'], as_index=False).sum()
108 | print('check point ...')
109 | df_product_comment = pd.merge(jdata_product, df_comment, on='sku_id', how='left')
110 |
111 | df_product_comment = df_product_comment.fillna(0)
112 |
113 | df_product_comment = df_product_comment.groupby(['shop_id'], as_index=False).sum()
114 |
115 | df_product_comment = df_product_comment.drop(['sku_id', 'brand', 'cate'], axis=1)
116 |
117 | df_shop_product_comment = pd.merge(df_shop, df_product_comment, how='left', on='shop_id')
118 |
119 | train_set = pd.merge(train_set, df_user, how='left', on='user_id')
120 | train_set = pd.merge(train_set, df_shop_product_comment, on='shop_id', how='left')
121 |
122 |
123 |
124 | train_set['vip_prob'] = train_set['vip_num']/train_set['fans_num']
125 | train_set['goods_prob'] = train_set['good_comments']/train_set['comments']
126 |
127 | train_set = train_set.drop(['comments','good_comments','bad_comments'],axis=1)
128 |
129 |
130 |
131 |
132 | test_set = jdata_data[(jdata_data['action_time'] >= '2018-03-26') & (jdata_data['action_time'] < '2018-04-16')][
133 | ['user_id', 'cate', 'shop_id']].drop_duplicates()
134 |
135 | test_set = test_set.merge(df_test, on=['user_id', 'cate', 'shop_id'], how='left')
136 |
137 | test_set = pd.merge(test_set, df_user, how='left', on='user_id')
138 | test_set = pd.merge(test_set, df_shop_product_comment, on='shop_id', how='left')
139 |
140 | train_set.drop(['user_reg_tm', 'shop_reg_tm'], axis=1, inplace=True)
141 | test_set.drop(['user_reg_tm', 'shop_reg_tm'], axis=1, inplace=True)
142 |
143 |
144 |
145 | test_set['vip_prob'] = test_set['vip_num']/test_set['fans_num']
146 | test_set['goods_prob'] = test_set['good_comments']/test_set['comments']
147 |
148 | test_set = test_set.drop(['comments','good_comments','bad_comments'],axis=1)
149 |
150 |
151 |
152 | ###取六周特征 特征为2.12-3.25
153 | train_set = train_set.drop(['2018-04-02-2018-04-09-action_1', '2018-04-02-2018-04-09-action_2',
154 | '2018-04-02-2018-04-09-action_3', '2018-04-02-2018-04-09-action_4',
155 | '2018-03-26-2018-04-02-action_1', '2018-03-26-2018-04-02-action_2',
156 | '2018-03-26-2018-04-02-action_3', '2018-03-26-2018-04-02-action_4',
157 | '2018-02-05-2018-02-12-action_1', '2018-02-05-2018-02-12-action_2',
158 | '2018-02-05-2018-02-12-action_3', '2018-02-05-2018-02-12-action_4'],axis=1)
159 |
160 |
161 | test_set = test_set.drop(['2018-02-26-2018-03-05-action_1',
162 | '2018-02-26-2018-03-05-action_2', '2018-02-26-2018-03-05-action_3',
163 | '2018-02-26-2018-03-05-action_4', '2018-02-19-2018-02-26-action_1',
164 | '2018-02-19-2018-02-26-action_2', '2018-02-19-2018-02-26-action_3',
165 | '2018-02-19-2018-02-26-action_4', '2018-02-12-2018-02-19-action_1',
166 | '2018-02-12-2018-02-19-action_2', '2018-02-12-2018-02-19-action_3',
167 | '2018-02-12-2018-02-19-action_4'],axis=1)
168 |
169 |
170 | train_set.rename(columns={'cate_x':'cate'}, inplace = True)
171 | test_set.rename(columns={'cate_x':'cate'}, inplace = True)
172 |
173 |
174 |
175 | test_head=test_set[['user_id','cate','shop_id']]
176 | train_head=train_set[['user_id','cate','shop_id']]
177 | test_set=test_set.drop(['user_id','cate','shop_id'],axis=1)
178 | train_set=train_set.drop(['user_id','cate','shop_id'],axis=1)
179 |
180 |
181 | # 数据准备
182 | X_train = train_set.drop(['label'],axis=1).values
183 | y_train = train_set['label'].values
184 | X_test = test_set.values
185 |
186 |
187 | del train_set
188 | del test_set
189 |
190 |
191 |
192 |
193 | # 模型工具
194 | class SBBTree():
195 | """Stacking,Bootstap,Bagging----SBBTree"""
196 |
197 | def __init__(self, params, stacking_num, bagging_num, bagging_test_size, num_boost_round, early_stopping_rounds):
198 | """
199 | Initializes the SBBTree.
200 | Args:
201 | params : lgb params.
202 | stacking_num : k_flod stacking.
203 | bagging_num : bootstrap num.
204 | bagging_test_size : bootstrap sample rate.
205 | num_boost_round : boost num.
206 | early_stopping_rounds : early_stopping_rounds.
207 | """
208 | self.params = params
209 | self.stacking_num = stacking_num
210 | self.bagging_num = bagging_num
211 | self.bagging_test_size = bagging_test_size
212 | self.num_boost_round = num_boost_round
213 | self.early_stopping_rounds = early_stopping_rounds
214 |
215 | self.model = xgb
216 | self.stacking_model = []
217 | self.bagging_model = []
218 |
219 | def fit(self, X, y):
220 | """ fit model. """
221 | if self.stacking_num > 1:
222 | layer_train = np.zeros((X.shape[0], 2))
223 | self.SK = StratifiedKFold(n_splits=self.stacking_num, shuffle=True, random_state=1)
224 | for k, (train_index, test_index) in enumerate(self.SK.split(X, y)):
225 | print('fold_{}'.format(k))
226 | X_train = X[train_index]
227 | y_train = y[train_index]
228 | X_test = X[test_index]
229 | y_test = y[test_index]
230 |
231 | xgb_train = xgb.DMatrix(X_train, y_train)
232 | xgb_eval = xgb.DMatrix(X_test, y_test)
233 | watchlist = [(xgb_train, 'train'), (xgb_eval, 'valid')]
234 |
235 | xgb_model = xgb.train(dtrain=xgb_train,
236 | num_boost_round=self.num_boost_round,
237 | evals=watchlist,
238 | early_stopping_rounds=self.early_stopping_rounds,
239 | verbose_eval=300,
240 | params=self.params)
241 | self.stacking_model.append(xgb_model)
242 |
243 | pred_y = xgb_model.predict(xgb.DMatrix(X_test), ntree_limit=xgb_model.best_ntree_limit)
244 | layer_train[test_index, 1] = pred_y
245 |
246 | X = np.hstack((X, layer_train[:, 1].reshape((-1, 1))))
247 | else:
248 | pass
249 | for bn in range(self.bagging_num):
250 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=self.bagging_test_size, random_state=bn)
251 |
252 | xgb_train = xgb.DMatrix(X_train, y_train)
253 | xgb_eval = xgb.DMatrix(X_test, y_test)
254 | watchlist = [(xgb_train, 'train'), (xgb_eval, 'valid')]
255 |
256 | xgb_model = xgb.train(dtrain=xgb_train,
257 | num_boost_round=10000,
258 | evals=watchlist,
259 | early_stopping_rounds=200,
260 | verbose_eval=300,
261 | params=self.params)
262 |
263 | self.bagging_model.append(xgb_model)
264 |
265 | def predict(self, X_pred):
266 | """ predict test data. """
267 | if self.stacking_num > 1:
268 | test_pred = np.zeros((X_pred.shape[0], self.stacking_num))
269 | for sn, gbm in enumerate(self.stacking_model):
270 | pred = gbm.predict(xgb.DMatrix(X_pred), ntree_limit=gbm.best_ntree_limit)
271 | test_pred[:, sn] = pred
272 | X_pred = np.hstack((X_pred, test_pred.mean(axis=1).reshape((-1, 1))))
273 | else:
274 | pass
275 | for bn, gbm in enumerate(self.bagging_model):
276 | pred = gbm.predict(xgb.DMatrix(X_pred), ntree_limit=gbm.best_ntree_limit)
277 | if bn == 0:
278 | pred_out = pred
279 | else:
280 | pred_out += pred
281 | return pred_out / self.bagging_num
282 |
283 |
284 | # 模型参数
285 | params = {
286 | 'booster': 'gbtree',
287 | 'tree_method': 'exact',
288 | 'eta': 0.01,
289 | 'max_depth': 7,
290 | 'gamma': 0.1,
291 | "min_child_weight": 1.1, # 6 0.06339878
292 | 'subsample': 0.7,
293 | 'colsample_bytree': 0.7, # 0.06349307
294 | 'colsample_bylevel': 0.7,
295 | 'objective': 'binary:logistic',
296 | 'eval_metric': 'auc',
297 | 'silent': True,
298 | 'lambda': 3, # 0.06365710
299 | 'nthread': 24,
300 | 'seed': 42}
301 |
302 | # 使用模型
303 | model = SBBTree(params=params, \
304 | stacking_num=5, \
305 | bagging_num=5, \
306 | bagging_test_size=0.33, \
307 | num_boost_round=10000, \
308 | early_stopping_rounds=200)
309 | model.fit(X_train, y_train)
310 |
311 |
312 | print('train is ok')
313 | y_predict = model.predict(X_test)
314 | print('pred test is ok')
315 | # y_train_predict = model.predict(X_train)
316 |
317 |
318 | # In[ ]:
319 |
320 |
321 | from tqdm import tqdm
322 | test_head['pred_prob'] = y_predict
323 | test_head.to_csv('feature/'+str(win_size)+'_xgb_get_'+str(label_flag)+'_test.csv',index=False)
324 |
--------------------------------------------------------------------------------
/picture/huachuang.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lcxanhui/JDATA-2019/7019cf545a88cb14c55d5f4198fa5251a291648c/picture/huachuang.PNG
--------------------------------------------------------------------------------
/picture/time_series.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lcxanhui/JDATA-2019/7019cf545a88cb14c55d5f4198fa5251a291648c/picture/time_series.PNG
--------------------------------------------------------------------------------