├── 图片资源
├── 图片1.png
├── 图片2.png
├── 图片3.png
├── 图片4.png
├── 图片5.png
├── 图片6.png
├── 图片7.png
├── 图片8.png
├── 图片9.png
├── .DS_Store
├── 图片10.png
├── 图片11.png
├── 图片12.png
├── 图片13.png
├── 图片14.png
└── 图片15.jpg
├── .ipynb_checkpoints
├── 特征工程-checkpoint.ipynb
├── 测试数据特征处理与填充-checkpoint.ipynb
├── 数据探索-checkpoint.ipynb
└── 预测建模-checkpoint.ipynb
├── README.md
├── 测试数据特征处理与填充.ipynb
├── 预测建模.ipynb
└── 特征工程.ipynb
/图片资源/图片1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/2017403603/Data_mining/HEAD/图片资源/图片1.png
--------------------------------------------------------------------------------
/图片资源/图片2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/2017403603/Data_mining/HEAD/图片资源/图片2.png
--------------------------------------------------------------------------------
/图片资源/图片3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/2017403603/Data_mining/HEAD/图片资源/图片3.png
--------------------------------------------------------------------------------
/图片资源/图片4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/2017403603/Data_mining/HEAD/图片资源/图片4.png
--------------------------------------------------------------------------------
/图片资源/图片5.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/2017403603/Data_mining/HEAD/图片资源/图片5.png
--------------------------------------------------------------------------------
/图片资源/图片6.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/2017403603/Data_mining/HEAD/图片资源/图片6.png
--------------------------------------------------------------------------------
/图片资源/图片7.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/2017403603/Data_mining/HEAD/图片资源/图片7.png
--------------------------------------------------------------------------------
/图片资源/图片8.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/2017403603/Data_mining/HEAD/图片资源/图片8.png
--------------------------------------------------------------------------------
/图片资源/图片9.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/2017403603/Data_mining/HEAD/图片资源/图片9.png
--------------------------------------------------------------------------------
/图片资源/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/2017403603/Data_mining/HEAD/图片资源/.DS_Store
--------------------------------------------------------------------------------
/图片资源/图片10.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/2017403603/Data_mining/HEAD/图片资源/图片10.png
--------------------------------------------------------------------------------
/图片资源/图片11.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/2017403603/Data_mining/HEAD/图片资源/图片11.png
--------------------------------------------------------------------------------
/图片资源/图片12.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/2017403603/Data_mining/HEAD/图片资源/图片12.png
--------------------------------------------------------------------------------
/图片资源/图片13.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/2017403603/Data_mining/HEAD/图片资源/图片13.png
--------------------------------------------------------------------------------
/图片资源/图片14.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/2017403603/Data_mining/HEAD/图片资源/图片14.png
--------------------------------------------------------------------------------
/图片资源/图片15.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/2017403603/Data_mining/HEAD/图片资源/图片15.jpg
--------------------------------------------------------------------------------
/.ipynb_checkpoints/特征工程-checkpoint.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [],
3 | "metadata": {},
4 | "nbformat": 4,
5 | "nbformat_minor": 5
6 | }
7 |
--------------------------------------------------------------------------------
/.ipynb_checkpoints/测试数据特征处理与填充-checkpoint.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [],
3 | "metadata": {},
4 | "nbformat": 4,
5 | "nbformat_minor": 5
6 | }
7 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 |
天猫复购预测赛技术报告
2 |
3 | 小组成员:李航程、姚远舟、黄建辉、刘杭达
4 |
5 | ## 一、问题描述
6 |
7 | ### 1.1 问题背景
8 |
9 | 商家有时会在特定日期,例如Boxing-day,黑色星期五或是双十一(11月11日)开展大型促销活动或者发放优惠券以吸引消费者,然而很多被吸引来的买家都是一次性消费者,这些促销活动可能对销售业绩的增长并没有长远帮助,因此为解决这个问题,商家需要识别出哪类消费者可以转化为重复购买者。通过对这些潜在的忠诚客户进行定位,商家可以大大降低促销成本,提高投资回报率。
10 |
11 | ### 1.2 数据描述
12 |
13 | 现在给定四个数据文件,分别为训练数据,测试数据,用户画像以及用户历史记录。训练数据提供纬度为用户、商家,以及该用户是否为该商家的重复购买者(即label)。用户画像数据集提供对应用户id的年龄和性别信息;用户历史记录提供用户过去六个月在不同店铺的多种活跃状态以及点击时间等;测试数据集为用户和商家的组合,用以预测该用户是否为该商家的重复购买者。
14 |
15 | ### 1.3 问题描述
16 |
17 | 根据给定的四个数据形式,在测试数据中给定了用户id和商家id的组合,需要预测该名用户在对应商家的重复购买概率值。
18 |
19 | ## 二、数据探索
20 |
21 | ### 2.1 加载数据集
22 |
23 | ```python
24 | train_data = pd.read_csv("../DataMining/data_format1/train_format1.csv")
25 | test_data = pd.read_csv("../DataMining/data_format1/test_format1.csv")
26 | user_info = pd.read_csv("../DataMining/data_format1/user_info_format1.csv")
27 | user_log = pd.read_csv("../DataMining/data_format1/user_log_format1.csv")
28 | ```
29 |
30 | ### 2.2 查看用户画像中年龄和性别缺失率
31 |
32 | ```python
33 | (user_info.shape[0] - user_info["age_range"].count())/user_info.shape[0]
34 | (user_info.shape[0] - user_info["gender"].count()) / user_info.shape[0]
35 | ```
36 |
37 | 其中年龄缺失率为0.52%,性别缺失率为1.5%。缺失比率较小,因此其对最终的分类结果影响较小。后面将直接将NaN(由-1代替)当作特征输入进模型进行训练和学习
38 |
39 | ### 2.3 查看用户信息数据的缺失—用户行为日志数据缺失
40 |
41 | ```python
42 | user_log.isna().sum()
43 | ```
44 |
45 |
46 |
47 | 用户行为日志主要缺失特征为购买品牌的缺失,其他特征均无缺失。
48 |
49 | ### 2.4 查看用户画像和历史记录基本数据描述
50 |
51 | ```python
52 | user_info.describe()
53 | ```
54 |
55 |
56 |
57 | 用户画像的基本数据分析显示用户的平均年龄在30岁左右,且方差较大。且购买者的性别主要为女性。
58 |
59 | ```python
60 | user_log.describe()
61 | ```
62 |
63 |
64 |
65 | ### 2.5 查看样本label比例
66 |
67 |
68 |
69 | 样本不均衡,非重复购买者比例远远大于重复购买者,因此需要采取一定措施解决此类样本不平衡问题
70 |
71 | ### 2.6 对top 5店铺进行画图分析
72 |
73 | ```python
74 | train_data.merchant_id.value_counts().head(5)
75 | train_data_merchant["TOP5"]=train_data_merchant["merchant_id"].map(lambda x: 1 if x in[4044,3828,4173,1102,4976] else 0)
76 | train_data_merchant=train_data_merchant[train_data_merchant["TOP5"]==1]
77 | plt.figure(figsize=(8,6))
78 | plt.title("Merchant VS Label")sax=sns.countplot("merchant_id",hue="label",data=train_data_merchant)
79 | ```
80 |
81 |
82 |
83 | 采用分布直方图对前五名店铺进行比例分析,可得前五名店铺占据了接近一半的数据量,且重复购买的比例都远远小于非重复购买
84 |
85 | ### 2.7 对商家的重复购买比例进行绘图分析
86 |
87 | ```python
88 | train_data.groupby(["merchant_id"])["label"].mean()
89 | merchant_repeat_buy=[rate for rate in train_data.groupby(["merchant_id"])["label"].mean() if rate<=1 and rate > 0]
90 | plt.figure(figsize=(8,4))
91 | ax=plt.subplot(1,2,1)
92 | sns.distplot(merchant_repeat_buy,fit=stats.norm)
93 | ax=plt.subplot(1,2,2)
94 | res=stats.probplot(merchant_repeat_buy,plot=plt)
95 | ```
96 |
97 |
98 |
99 | 由于数据的特征维度并不具有连续性,无法使用插值法进行填补,并且空缺比率较小,因此我们直接将空缺数据视为一个特征,用-1填补并代表此类特征
100 |
101 | ## 三、特征工程
102 |
103 | ### 3.1 数据集合并
104 |
105 | 1. 将训练集df_train和用户基本信息user_info_format.csv合并得到df_train,合并依据是用户user_id。
106 |
107 | ```python
108 | df_train = pd.merge(df_train,user_info,on="user_id",how="left")
109 | ```
110 |
111 | 2. 将df_train和用户行为日志user_log_format1.csv合并得到新的df_train,合并依据是用户user_id和商家merchant_id。
112 |
113 | ```python
114 | df_train = pd.merge(df_train,total_logs_temp,on=["user_id","merchant_id"],how="left")
115 | ```
116 |
117 | ### 3.2 特征生成
118 |
119 | 1. 通过简单合并生成特征
120 | + 每个用户在每个商家交互过的商品总和(不分种类)。***total_item_id***
121 | + 每个用户在每个商家交互过的商品种类总和。***unique_item_id***
122 | + 每个用户在每个商家交互过的商品所属品类总和***total_cat_id***
123 | + 每个用户在每个商家交互过的天数总和。***total_time_temp***
124 | + 每个用户在每个商家点击次数总和。***clicks***
125 | + 每个用户在每个商家加入购物车次数总和。***shopping_cart***
126 | + 每个用户在每个商家购买商品次数总和。***purchases***
127 | + 每个用户在每个商家收藏商品次数总和。***favourites***
128 |
129 | 2. 通过分析生成特征
130 |
131 | + 用户每月使用次数
132 |
133 | ```python
134 | month_temp=user_log.groupby(['user_id','month']).size().reset_index().rename(columns={0:'cnt'})
135 | month_temp=pd.get_dummies(month_temp, columns=['month'],prefix='user_mcnt')
136 | for i in range(5,12):
137 | month_temp['user_mcnt_'+str(i)]=month_temp['cnt']*month_temp['user_mcnt_'+str(i)]
138 | month_temp=month_temp.groupby(['user_id']).sum().drop(['cnt'],axis=1).reset_index()
139 | ```
140 |
141 | 意义:用户每月使用天猫的次数可以反映用户行为在时间上的特征,用户在一年中不同的月份的消费表现可能不同,例如在年尾,春节,“双十一”等期间可能消费水平高一些,在夏冬两季的消费水平可能会低一些,通过统计每月使用次数可以有效反映出这些特征。
142 |
143 | + 商家的特征
144 |
145 | ```python
146 | temp = groups.size().reset_index().rename(columns={0:'merchantf1'})
147 | matrix = matrix.merge(temp, on='merchant_id', how='left')
148 | temp = groups['user_id', 'item_id', 'cat_id', 'brand_id'].nunique().reset_index().rename(columns={'user_id':'merchantf2', 'item_id':'merchantf3', 'cat_id':'merchantf4', 'brand_id':'merchantf5'})
149 | matrix = matrix.merge(temp, on='merchant_id', how='left')
150 | temp = groups['action_type'].value_counts().unstack().reset_index().rename(columns={0:'merchantf6', 1:'merchantf7', 2:'merchantf8', 3:'merchantf9'})
151 | matrix = matrix.merge(temp, on='merchant_id', how='left')
152 | ```
153 |
154 | 商家售出的某个商品、品牌的数量,能够反映某些商品或者品牌的受欢迎程度,一定程度上也可以导致顾客回购率。
155 |
156 | + 商家与用户的综合特征
157 |
158 | ```python
159 | matrix['ratiof1'] = matrix['userf9']/matrix['userf7'] # 用户购买点击比
160 | matrix['ratiof2'] = matrix['merchantf8']/matrix['merchantf6'] # 商家购买点击比
161 | ```
162 |
163 | 用户点击或者该商家被点击最终转化为顾客购买的比率能够很好的反映物品的受欢迎程度
164 |
165 | ## 四、候选模型简介
166 |
167 | 1. 逻辑回归[1](Logistic Regression,LR)是一种广义线性回归(Generalized Linear Model),在机器学习中是最常见的一种用于二分类的算法模型。
168 | 2. 决策树[2](Decision Tree,DT)是一种基本的分类与回归方法,本文主要讨论分类决策树,决策树模型呈树形结构,在分类问题中,表示基于特征对数据进行分类的过程。
169 | 3. 随机森林[3](Random Forest,RF)指的是利用多棵决策树对样本进行训练并预测的一种分类器,可回归可分类,所以随机森林是基于多颗决策树的一种集成学习算法。
170 | 4. 梯度提升树[4](Gradient Descent Decision Tree,GBDT),梯度提升树是以 CART 作为基函数,采用加法模型和前向分步算法的一种梯度提升方法。
171 | 5. XGBoost[5]是陈天奇等人开发的一个开源机器学习项目,高效地实现了GBDT算法并进行了算法和工程上的许多改进,被广泛应用在Kaggle竞赛及其他许多机器学习竞赛中并取得了不错的成绩。
172 |
173 | ## 五、候选模型预测对比
174 |
175 | ### 5.1 加载训练数据和测试数据
176 |
177 | ```python
178 | #读取数据
179 | df_train = pd.read_csv(r'df_train.csv')
180 | #加载最终测试数据
181 | test_data= pd.read_csv(r'test_data.csv')
182 | test_data
183 | ```
184 |
185 |
186 |
187 | ### 5.2 建模前预处理数据集
188 |
189 | ```python
190 | #建模前预处理
191 | y = df_train["label"]
192 | X = df_train.drop(["user_id", "merchant_id", "label"], axis=1)
193 | X.head(10)
194 | #分割数据
195 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=8)
196 | ```
197 |
198 |
199 |
200 | ### 5.3 候选模型预测:逻辑回归
201 |
202 | ```python
203 | #logistic回归
204 | Logit = LogisticRegression(solver='liblinear')
205 | Logit.fit(X_train, y_train)
206 | Predict = Logit.predict(X_test)
207 | Predict_proba = Logit.predict_proba(X_test)
208 | print(Predict.shape)
209 | print(Predict[0:20])
210 | print(Predict_proba[:])
211 | Score = accuracy_score(y_test, Predict)
212 | Score
213 | ```
214 |
215 |
216 |
217 | ```python
218 | #逻辑回归最终结果获取
219 | Logit_Ans_Predict_proba = Logit.predict_proba(test_data)
220 | df_test['prob']=Logit_Ans_Predict_proba[:,1]
221 | #最终答案保存
222 | df_test.to_csv("Logit_Ans.csv",index=None)
223 | ```
224 |
225 | 提交得到评分为:0.4564939
226 |
227 | ### 5.4 候选模型预测:决策树
228 |
229 | ```python
230 | #决策树
231 | from sklearn.tree import DecisionTreeClassifier
232 | tree = DecisionTreeClassifier(max_depth=4,random_state=0)
233 | tree.fit(X_train, y_train)
234 | Predict_proba = tree.predict_proba(X_test)
235 | print(Predict_proba[:])
236 | print("Accuracy on training set: {:.3f}".format(tree.score(X_train, y_train)))
237 | print("Accuracy on test set: {:.3f}".format(tree.score(X_test, y_test)))
238 | ```
239 |
240 |
241 |
242 | ```python
243 | #决策树最终结果获取
244 | Tree_Ans_Predict_proba = tree.predict_proba(test_data)
245 | df_test['prob']=Tree_Ans_Predict_proba[:,1]
246 | #最终答案保存
247 | df_test.to_csv("Tree_Ans.csv",index=None)
248 | ```
249 |
250 | 提交得到评分为:0.5833852
251 |
252 | ### 5.5 候选模型预测:随机森林
253 |
254 | ```python
255 | #随机森林
256 | from sklearn.ensemble import RandomForestClassifier
257 | rfc = RandomForestClassifier(n_estimators=50,random_state=90,max_depth=5)
258 | rfc = rfc.fit(X_train, y_train)
259 | Predict_proba = rfc.predict_proba(X_test)
260 | print(Predict_proba[:])
261 | print("Accuracy on training set: {:.3f}".format(rfc.score(X_train, y_train)))
262 | print("Accuracy on test set: {:.3f}".format(rfc.score(X_test, y_test)))
263 | ```
264 |
265 |
266 |
267 | ```python
268 | #随机森林最终结果获取
269 | RFC_Ans_Predict_proba = rfc.predict_proba(test_data)
270 | df_test['prob']=RFC_Ans_Predict_proba[:,1]
271 | #最终答案保存
272 | df_test.to_csv("RFC_Ans.csv",index=None)
273 | ```
274 |
275 | 提交得到评分为:0.6252815
276 |
277 |
278 |
279 | ### 5.6 候选模型预测:随机森林调参
280 |
281 | ```python
282 | # 调参,绘制学习曲线来调参n_estimators(对随机森林影响最大)
283 | score_lt = []
284 | # 每隔10步建立一个随机森林,获得不同n_estimators的得分
285 | for i in range(0,200,10):
286 | print("进度:",i)
287 | rfc = RandomForestClassifier(n_estimators=i+1,random_state=90,max_depth=8)
288 | rfc = rfc.fit(X_train, y_train)
289 | score = rfc.score(X_test, y_test)
290 | score_lt.append(score)
291 | score_max = max(score_lt)
292 | print('最大得分:{}'.format(score_max),'子树数量为:{}'.format(score_lt.index(score_max)*10+1))
293 | # 绘制学习曲线
294 | x = np.arange(1,201,10)
295 | plt.subplot(111)
296 | plt.plot(x, score_lt, 'r-')
297 | plt.show()
298 | ```
299 |
300 |
301 |
302 | 上图中横坐标为参数n_estimators的值,纵坐标为模型在测试集上的准确率,每迭代一次n_estimators增加10,画出每次迭代准确率的折线图,由图可知当n_estimators=100时随机森林模型的效果最好,经调参后提交得到评分为:0.6256826。
303 |
304 | ### 5.7 候选模型预测:XGboost
305 |
306 | ```python
307 | import xgboost as xgb
308 | def xgb_train(X_train, y_train, X_valid, y_valid, verbose=True):
309 | model_xgb = xgb.XGBClassifier(
310 | max_depth=10, # raw8
311 | n_estimators=1000,
312 | min_child_weight=300,
313 | colsample_bytree=0.8,
314 | subsample=0.8,
315 | eta=0.3,
316 | seed=42
317 | )
318 | model_xgb.fit(
319 | X_train,
320 | y_train,
321 | eval_metric='auc',
322 | eval_set=[(X_train, y_train), (X_valid, y_valid)],
323 | verbose=verbose,
324 | early_stopping_rounds=10 # 早停法,如果auc在10epoch没有进步就stop
325 | )
326 | print(model_xgb.best_score)
327 | print("Accuracy on training set: {:.3f}".format(model.score(X_train, y_train)))
328 | print("Accuracy on test set: {:.3f}".format(model.score(X_test, y_test)))
329 | return model_xgb
330 | ```
331 |
332 |
333 |
334 | ```python
335 | #XGboost最终结果获取
336 | model_xgb = xgb_train(X_train, y_train, X_valid, y_valid, verbose=False)
337 | prob = model_xgb.predict_proba(test_data)
338 | submission['prob'] = pd.Series(prob[:,1])
339 | submission.drop(['origin'], axis=1, inplace=True)
340 | submission.to_csv('submission_xgb.csv', index=False)
341 | ```
342 |
343 | 提交得到评分为:0.6562986
344 |
345 | ## 六、最终成绩及排名
346 |
347 | 小组成员:李航程、姚远舟、黄建辉、刘杭达
348 |
349 |
350 |
351 | ## 七、天猫复购预测总结
352 |
353 | 本次比赛最终成绩和排名并不是很高,思考其原因主要还是在数据预处理和特征工程阶段没有做好,在数据集中,年龄和性别的缺失值差不多有九万个,巨大的特征值数据缺失是预测准确率不高的主要原因之一,其次是特征工程,我们抽取特征的方法还是使用传统的方法,相对比较简单,这也是导致模型预测准确率不高的原因之一;在选用模型上我们使用了逻辑回归、决策树、随机森林、Xgboost等热门模型,训练后这些模型在训练集上的表现区别并不明显,经比较Xgboost模型在测试集的效果最好,后期工作准备再重新做一下特征工程,在模型选取方面,计划使用bagging集成多种分类算法的思想对模型进行改进,进一步提高预测准确率。
354 |
355 | ## 八、参考
356 |
357 | [1] [https://www.cnblogs.com/phyger/p/14188712.html](https://www.cnblogs.com/phyger/p/14188712.html)
358 |
359 | [2] [https://blog.csdn.net/qq_34807908/article/details/81539536](https://blog.csdn.net/qq_34807908/article/details/81539536)
360 |
361 | [3] [https://blog.csdn.net/lovenankai/article/details/99966142](https://blog.csdn.net/lovenankai/article/details/99966142)
362 |
363 | [4] [https://www.jianshu.com/p/d1f696266814](https://www.jianshu.com/p/d1f696266814)
364 |
365 | [5] [http://cran.fhcrc.org/web/packages/xgboost/vignettes/xgboost.pdf](http://cran.fhcrc.org/web/packages/xgboost/vignettes/xgboost.pdf)
366 |
--------------------------------------------------------------------------------
/.ipynb_checkpoints/数据探索-checkpoint.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 1,
6 | "id": "0d97411e",
7 | "metadata": {},
8 | "outputs": [],
9 | "source": [
10 | "import numpy as np\n",
11 | "import pandas as pd\n",
12 | "import matplotlib.pyplot as plt\n",
13 | "import seaborn as sns\n",
14 | "from scipy import stats\n",
15 | "import warnings\n",
16 | "\n",
17 | "warnings.filterwarnings(\"ignore\")"
18 | ]
19 | },
20 | {
21 | "cell_type": "code",
22 | "execution_count": 2,
23 | "id": "30627d48",
24 | "metadata": {},
25 | "outputs": [],
26 | "source": [
27 | "#导入数据\n",
28 | "train_data = pd.read_csv(\"../DataMining/data_format1/train_format1.csv\")\n",
29 | "test_data = pd.read_csv(\"../DataMining/data_format1/test_format1.csv\")\n",
30 | "\n",
31 | "user_info = pd.read_csv(\"../DataMining/data_format1/user_info_format1.csv\")\n",
32 | "user_log = pd.read_csv(\"../DataMining/data_format1/user_log_format1.csv\")"
33 | ]
34 | },
35 | {
36 | "cell_type": "code",
37 | "execution_count": 3,
38 | "id": "7005f9dd",
39 | "metadata": {},
40 | "outputs": [
41 | {
42 | "data": {
43 | "text/plain": [
44 | "(424170, 3)"
45 | ]
46 | },
47 | "execution_count": 3,
48 | "metadata": {},
49 | "output_type": "execute_result"
50 | }
51 | ],
52 | "source": [
53 | "#1.查看用户信息缺失值-年龄值\n",
54 | "#shape大小:\n",
55 | "user_info.shape"
56 | ]
57 | },
58 | {
59 | "cell_type": "code",
60 | "execution_count": 4,
61 | "id": "ee333a9d",
62 | "metadata": {},
63 | "outputs": [
64 | {
65 | "data": {
66 | "text/plain": [
67 | "421953"
68 | ]
69 | },
70 | "execution_count": 4,
71 | "metadata": {},
72 | "output_type": "execute_result"
73 | }
74 | ],
75 | "source": [
76 | "#年龄数据总个数:\n",
77 | "user_info[\"age_range\"].count()"
78 | ]
79 | },
80 | {
81 | "cell_type": "code",
82 | "execution_count": 5,
83 | "id": "c2488c3c",
84 | "metadata": {},
85 | "outputs": [
86 | {
87 | "data": {
88 | "text/plain": [
89 | "0.005226677982884221"
90 | ]
91 | },
92 | "execution_count": 5,
93 | "metadata": {},
94 | "output_type": "execute_result"
95 | }
96 | ],
97 | "source": [
98 | "#缺失率查看:\n",
99 | "(user_info.shape[0]-user_info[\"age_range\"].count())/user_info.shape[0]"
100 | ]
101 | },
102 | {
103 | "cell_type": "code",
104 | "execution_count": 6,
105 | "id": "8bbe97c9",
106 | "metadata": {},
107 | "outputs": [
108 | {
109 | "data": {
110 | "text/plain": [
111 | "user_id 95131\n",
112 | "age_range 92914\n",
113 | "gender 90664\n",
114 | "dtype: int64"
115 | ]
116 | },
117 | "execution_count": 6,
118 | "metadata": {},
119 | "output_type": "execute_result"
120 | }
121 | ],
122 | "source": [
123 | "##当年龄为空或者等于0时默认为缺失\n",
124 | "#缺失值查看:\n",
125 | "user_info[user_info['age_range'].isna()|(user_info['age_range']==0)].count()"
126 | ]
127 | },
128 | {
129 | "cell_type": "code",
130 | "execution_count": 7,
131 | "id": "34af97b9",
132 | "metadata": {},
133 | "outputs": [
134 | {
135 | "data": {
136 | "text/html": [
137 | "\n",
138 | "\n",
151 | "
\n",
152 | " \n",
153 | " \n",
154 | " | \n",
155 | " user_id | \n",
156 | "
\n",
157 | " \n",
158 | " | age_range | \n",
159 | " | \n",
160 | "
\n",
161 | " \n",
162 | " \n",
163 | " \n",
164 | " | 0.0 | \n",
165 | " 92914 | \n",
166 | "
\n",
167 | " \n",
168 | " | 1.0 | \n",
169 | " 24 | \n",
170 | "
\n",
171 | " \n",
172 | " | 2.0 | \n",
173 | " 52871 | \n",
174 | "
\n",
175 | " \n",
176 | " | 3.0 | \n",
177 | " 111654 | \n",
178 | "
\n",
179 | " \n",
180 | " | 4.0 | \n",
181 | " 79991 | \n",
182 | "
\n",
183 | " \n",
184 | " | 5.0 | \n",
185 | " 40777 | \n",
186 | "
\n",
187 | " \n",
188 | " | 6.0 | \n",
189 | " 35464 | \n",
190 | "
\n",
191 | " \n",
192 | " | 7.0 | \n",
193 | " 6992 | \n",
194 | "
\n",
195 | " \n",
196 | " | 8.0 | \n",
197 | " 1266 | \n",
198 | "
\n",
199 | " \n",
200 | "
\n",
201 | "
"
202 | ],
203 | "text/plain": [
204 | " user_id\n",
205 | "age_range \n",
206 | "0.0 92914\n",
207 | "1.0 24\n",
208 | "2.0 52871\n",
209 | "3.0 111654\n",
210 | "4.0 79991\n",
211 | "5.0 40777\n",
212 | "6.0 35464\n",
213 | "7.0 6992\n",
214 | "8.0 1266"
215 | ]
216 | },
217 | "execution_count": 7,
218 | "metadata": {},
219 | "output_type": "execute_result"
220 | }
221 | ],
222 | "source": [
223 | "#数据分组查看:\n",
224 | "user_info.groupby(['age_range'])[['user_id']].count()"
225 | ]
226 | },
227 | {
228 | "cell_type": "code",
229 | "execution_count": 8,
230 | "id": "02c03a0f",
231 | "metadata": {},
232 | "outputs": [
233 | {
234 | "data": {
235 | "text/plain": [
236 | "2217"
237 | ]
238 | },
239 | "execution_count": 8,
240 | "metadata": {},
241 | "output_type": "execute_result"
242 | }
243 | ],
244 | "source": [
245 | "#空值查看:\n",
246 | "user_info.shape[0]-user_info[\"age_range\"].count()"
247 | ]
248 | },
249 | {
250 | "cell_type": "code",
251 | "execution_count": 9,
252 | "id": "b25292e3",
253 | "metadata": {},
254 | "outputs": [
255 | {
256 | "data": {
257 | "text/plain": [
258 | "0.01517316170403376"
259 | ]
260 | },
261 | "execution_count": 9,
262 | "metadata": {},
263 | "output_type": "execute_result"
264 | }
265 | ],
266 | "source": [
267 | "##2.查看用户信息数据的缺失——性别值\n",
268 | "#缺失率查看:\n",
269 | "(user_info.shape[0] - user_info[\"gender\"].count()) / user_info.shape[0]"
270 | ]
271 | },
272 | {
273 | "cell_type": "code",
274 | "execution_count": 10,
275 | "id": "7d86b971",
276 | "metadata": {},
277 | "outputs": [
278 | {
279 | "data": {
280 | "text/plain": [
281 | "user_id 16862\n",
282 | "age_range 14664\n",
283 | "gender 10426\n",
284 | "dtype: int64"
285 | ]
286 | },
287 | "execution_count": 10,
288 | "metadata": {},
289 | "output_type": "execute_result"
290 | }
291 | ],
292 | "source": [
293 | "# 当性别为空或者等于2时默认为缺失\n",
294 | "# 缺失值查看:\n",
295 | "user_info[user_info['gender'].isna() | (user_info['gender'] == 2)].count()"
296 | ]
297 | },
298 | {
299 | "cell_type": "code",
300 | "execution_count": 11,
301 | "id": "f294b0d4",
302 | "metadata": {},
303 | "outputs": [
304 | {
305 | "data": {
306 | "text/html": [
307 | "\n",
308 | "\n",
321 | "
\n",
322 | " \n",
323 | " \n",
324 | " | \n",
325 | " user_id | \n",
326 | "
\n",
327 | " \n",
328 | " | gender | \n",
329 | " | \n",
330 | "
\n",
331 | " \n",
332 | " \n",
333 | " \n",
334 | " | 0.0 | \n",
335 | " 285638 | \n",
336 | "
\n",
337 | " \n",
338 | " | 1.0 | \n",
339 | " 121670 | \n",
340 | "
\n",
341 | " \n",
342 | " | 2.0 | \n",
343 | " 10426 | \n",
344 | "
\n",
345 | " \n",
346 | "
\n",
347 | "
"
348 | ],
349 | "text/plain": [
350 | " user_id\n",
351 | "gender \n",
352 | "0.0 285638\n",
353 | "1.0 121670\n",
354 | "2.0 10426"
355 | ]
356 | },
357 | "execution_count": 11,
358 | "metadata": {},
359 | "output_type": "execute_result"
360 | }
361 | ],
362 | "source": [
363 | "#数据分组查看:\n",
364 | "user_info.groupby(['gender'])[['user_id']].count()"
365 | ]
366 | },
367 | {
368 | "cell_type": "code",
369 | "execution_count": 12,
370 | "id": "7cbe3a5f",
371 | "metadata": {},
372 | "outputs": [
373 | {
374 | "data": {
375 | "text/plain": [
376 | "6436"
377 | ]
378 | },
379 | "execution_count": 12,
380 | "metadata": {},
381 | "output_type": "execute_result"
382 | }
383 | ],
384 | "source": [
385 | "#空值查看:\n",
386 | "user_info.shape[0] - user_info[\"gender\"].count()"
387 | ]
388 | },
389 | {
390 | "cell_type": "code",
391 | "execution_count": 13,
392 | "id": "c6b8e6da",
393 | "metadata": {},
394 | "outputs": [
395 | {
396 | "data": {
397 | "text/plain": [
398 | "user_id 106330\n",
399 | "age_range 104113\n",
400 | "gender 99894\n",
401 | "dtype: int64"
402 | ]
403 | },
404 | "execution_count": 13,
405 | "metadata": {},
406 | "output_type": "execute_result"
407 | }
408 | ],
409 | "source": [
410 | "# 查看用户信息数据的缺失——年龄或性别:\n",
411 | "user_info[user_info['age_range'].isna() | (user_info['age_range'] == 0) | user_info['gender'].isna() | (user_info['gender'] == 2)].count()"
412 | ]
413 | },
414 | {
415 | "cell_type": "code",
416 | "execution_count": 14,
417 | "id": "e1b779ef",
418 | "metadata": {},
419 | "outputs": [
420 | {
421 | "data": {
422 | "text/plain": [
423 | "user_id 0\n",
424 | "item_id 0\n",
425 | "cat_id 0\n",
426 | "seller_id 0\n",
427 | "brand_id 91015\n",
428 | "time_stamp 0\n",
429 | "action_type 0\n",
430 | "dtype: int64"
431 | ]
432 | },
433 | "execution_count": 14,
434 | "metadata": {},
435 | "output_type": "execute_result"
436 | }
437 | ],
438 | "source": [
439 | "#3.查看用户信息数据的缺失——用户行为日志数据缺失\n",
440 | "user_log.isna().sum()"
441 | ]
442 | },
443 | {
444 | "cell_type": "code",
445 | "execution_count": 15,
446 | "id": "88a6aa39",
447 | "metadata": {},
448 | "outputs": [
449 | {
450 | "data": {
451 | "text/html": [
452 | "\n",
453 | "\n",
466 | "
\n",
467 | " \n",
468 | " \n",
469 | " | \n",
470 | " user_id | \n",
471 | " age_range | \n",
472 | " gender | \n",
473 | "
\n",
474 | " \n",
475 | " \n",
476 | " \n",
477 | " | count | \n",
478 | " 424170.000000 | \n",
479 | " 421953.000000 | \n",
480 | " 417734.000000 | \n",
481 | "
\n",
482 | " \n",
483 | " | mean | \n",
484 | " 212085.500000 | \n",
485 | " 2.930262 | \n",
486 | " 0.341179 | \n",
487 | "
\n",
488 | " \n",
489 | " | std | \n",
490 | " 122447.476178 | \n",
491 | " 1.942978 | \n",
492 | " 0.524112 | \n",
493 | "
\n",
494 | " \n",
495 | " | min | \n",
496 | " 1.000000 | \n",
497 | " 0.000000 | \n",
498 | " 0.000000 | \n",
499 | "
\n",
500 | " \n",
501 | " | 25% | \n",
502 | " 106043.250000 | \n",
503 | " 2.000000 | \n",
504 | " 0.000000 | \n",
505 | "
\n",
506 | " \n",
507 | " | 50% | \n",
508 | " 212085.500000 | \n",
509 | " 3.000000 | \n",
510 | " 0.000000 | \n",
511 | "
\n",
512 | " \n",
513 | " | 75% | \n",
514 | " 318127.750000 | \n",
515 | " 4.000000 | \n",
516 | " 1.000000 | \n",
517 | "
\n",
518 | " \n",
519 | " | max | \n",
520 | " 424170.000000 | \n",
521 | " 8.000000 | \n",
522 | " 2.000000 | \n",
523 | "
\n",
524 | " \n",
525 | "
\n",
526 | "
"
527 | ],
528 | "text/plain": [
529 | " user_id age_range gender\n",
530 | "count 424170.000000 421953.000000 417734.000000\n",
531 | "mean 212085.500000 2.930262 0.341179\n",
532 | "std 122447.476178 1.942978 0.524112\n",
533 | "min 1.000000 0.000000 0.000000\n",
534 | "25% 106043.250000 2.000000 0.000000\n",
535 | "50% 212085.500000 3.000000 0.000000\n",
536 | "75% 318127.750000 4.000000 1.000000\n",
537 | "max 424170.000000 8.000000 2.000000"
538 | ]
539 | },
540 | "execution_count": 15,
541 | "metadata": {},
542 | "output_type": "execute_result"
543 | }
544 | ],
545 | "source": [
546 | "#查看user_info基本数据描述:\n",
547 | "user_info.describe()"
548 | ]
549 | },
550 | {
551 | "cell_type": "code",
552 | "execution_count": 16,
553 | "id": "3ff86b41",
554 | "metadata": {},
555 | "outputs": [
556 | {
557 | "data": {
558 | "text/html": [
559 | "\n",
560 | "\n",
573 | "
\n",
574 | " \n",
575 | " \n",
576 | " | \n",
577 | " user_id | \n",
578 | " item_id | \n",
579 | " cat_id | \n",
580 | " seller_id | \n",
581 | " brand_id | \n",
582 | " time_stamp | \n",
583 | " action_type | \n",
584 | "
\n",
585 | " \n",
586 | " \n",
587 | " \n",
588 | " | count | \n",
589 | " 5.492533e+07 | \n",
590 | " 5.492533e+07 | \n",
591 | " 5.492533e+07 | \n",
592 | " 5.492533e+07 | \n",
593 | " 5.483432e+07 | \n",
594 | " 5.492533e+07 | \n",
595 | " 5.492533e+07 | \n",
596 | "
\n",
597 | " \n",
598 | " | mean | \n",
599 | " 2.121568e+05 | \n",
600 | " 5.538613e+05 | \n",
601 | " 8.770308e+02 | \n",
602 | " 2.470941e+03 | \n",
603 | " 4.153348e+03 | \n",
604 | " 9.230953e+02 | \n",
605 | " 2.854458e-01 | \n",
606 | "
\n",
607 | " \n",
608 | " | std | \n",
609 | " 1.222872e+05 | \n",
610 | " 3.221459e+05 | \n",
611 | " 4.486269e+02 | \n",
612 | " 1.473310e+03 | \n",
613 | " 2.397679e+03 | \n",
614 | " 1.954305e+02 | \n",
615 | " 8.075806e-01 | \n",
616 | "
\n",
617 | " \n",
618 | " | min | \n",
619 | " 1.000000e+00 | \n",
620 | " 1.000000e+00 | \n",
621 | " 1.000000e+00 | \n",
622 | " 1.000000e+00 | \n",
623 | " 1.000000e+00 | \n",
624 | " 5.110000e+02 | \n",
625 | " 0.000000e+00 | \n",
626 | "
\n",
627 | " \n",
628 | " | 25% | \n",
629 | " 1.063360e+05 | \n",
630 | " 2.731680e+05 | \n",
631 | " 5.550000e+02 | \n",
632 | " 1.151000e+03 | \n",
633 | " 2.027000e+03 | \n",
634 | " 7.300000e+02 | \n",
635 | " 0.000000e+00 | \n",
636 | "
\n",
637 | " \n",
638 | " | 50% | \n",
639 | " 2.126540e+05 | \n",
640 | " 5.555290e+05 | \n",
641 | " 8.210000e+02 | \n",
642 | " 2.459000e+03 | \n",
643 | " 4.065000e+03 | \n",
644 | " 1.010000e+03 | \n",
645 | " 0.000000e+00 | \n",
646 | "
\n",
647 | " \n",
648 | " | 75% | \n",
649 | " 3.177500e+05 | \n",
650 | " 8.306890e+05 | \n",
651 | " 1.252000e+03 | \n",
652 | " 3.760000e+03 | \n",
653 | " 6.196000e+03 | \n",
654 | " 1.109000e+03 | \n",
655 | " 0.000000e+00 | \n",
656 | "
\n",
657 | " \n",
658 | " | max | \n",
659 | " 4.241700e+05 | \n",
660 | " 1.113166e+06 | \n",
661 | " 1.671000e+03 | \n",
662 | " 4.995000e+03 | \n",
663 | " 8.477000e+03 | \n",
664 | " 1.112000e+03 | \n",
665 | " 3.000000e+00 | \n",
666 | "
\n",
667 | " \n",
668 | "
\n",
669 | "
"
670 | ],
671 | "text/plain": [
672 | " user_id item_id cat_id seller_id brand_id \\\n",
673 | "count 5.492533e+07 5.492533e+07 5.492533e+07 5.492533e+07 5.483432e+07 \n",
674 | "mean 2.121568e+05 5.538613e+05 8.770308e+02 2.470941e+03 4.153348e+03 \n",
675 | "std 1.222872e+05 3.221459e+05 4.486269e+02 1.473310e+03 2.397679e+03 \n",
676 | "min 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 \n",
677 | "25% 1.063360e+05 2.731680e+05 5.550000e+02 1.151000e+03 2.027000e+03 \n",
678 | "50% 2.126540e+05 5.555290e+05 8.210000e+02 2.459000e+03 4.065000e+03 \n",
679 | "75% 3.177500e+05 8.306890e+05 1.252000e+03 3.760000e+03 6.196000e+03 \n",
680 | "max 4.241700e+05 1.113166e+06 1.671000e+03 4.995000e+03 8.477000e+03 \n",
681 | "\n",
682 | " time_stamp action_type \n",
683 | "count 5.492533e+07 5.492533e+07 \n",
684 | "mean 9.230953e+02 2.854458e-01 \n",
685 | "std 1.954305e+02 8.075806e-01 \n",
686 | "min 5.110000e+02 0.000000e+00 \n",
687 | "25% 7.300000e+02 0.000000e+00 \n",
688 | "50% 1.010000e+03 0.000000e+00 \n",
689 | "75% 1.109000e+03 0.000000e+00 \n",
690 | "max 1.112000e+03 3.000000e+00 "
691 | ]
692 | },
693 | "execution_count": 16,
694 | "metadata": {},
695 | "output_type": "execute_result"
696 | }
697 | ],
698 | "source": [
699 | "#查看user_log基本数据描述:\n",
700 | "user_log.describe()"
701 | ]
702 | },
703 | {
704 | "cell_type": "code",
705 | "execution_count": null,
706 | "id": "d48b1f40",
707 | "metadata": {},
708 | "outputs": [],
709 | "source": []
710 | }
711 | ],
712 | "metadata": {
713 | "kernelspec": {
714 | "display_name": "Python 3",
715 | "language": "python",
716 | "name": "python3"
717 | },
718 | "language_info": {
719 | "codemirror_mode": {
720 | "name": "ipython",
721 | "version": 3
722 | },
723 | "file_extension": ".py",
724 | "mimetype": "text/x-python",
725 | "name": "python",
726 | "nbconvert_exporter": "python",
727 | "pygments_lexer": "ipython3",
728 | "version": "3.8.8"
729 | }
730 | },
731 | "nbformat": 4,
732 | "nbformat_minor": 5
733 | }
734 |
--------------------------------------------------------------------------------
/测试数据特征处理与填充.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 1,
6 | "id": "6c451add",
7 | "metadata": {},
8 | "outputs": [],
9 | "source": [
10 | "#导包\n",
11 | "import numpy as np\n",
12 | "import pandas as pd\n",
13 | "import matplotlib.pyplot as plt\n",
14 | "plt.rcParams[\"font.sans-serif\"] = \"SimHei\" #解决中文乱码问题\n",
15 | "import seaborn as sns\n",
16 | "import random\n",
17 | "from sklearn.model_selection import train_test_split\n",
18 | "from sklearn.linear_model import LogisticRegression\n",
19 | "from sklearn.preprocessing import LabelEncoder\n",
20 | "from sklearn.metrics import accuracy_score\n",
21 | "from sklearn import model_selection\n",
22 | "from sklearn.neighbors import KNeighborsRegressor"
23 | ]
24 | },
25 | {
26 | "cell_type": "code",
27 | "execution_count": 2,
28 | "id": "33a0082f",
29 | "metadata": {},
30 | "outputs": [],
31 | "source": [
32 | "df_test = pd.read_csv(r'../DataMining/data_format1\\test_format1.csv')\n",
33 | "user_info = pd.read_csv(r'../DataMining/data_format1\\user_info_format1.csv')\n",
34 | "user_log = pd.read_csv(r'../DataMining/data_format1\\user_log_format1.csv')"
35 | ]
36 | },
37 | {
38 | "cell_type": "code",
39 | "execution_count": 3,
40 | "id": "d2c315d6",
41 | "metadata": {},
42 | "outputs": [
43 | {
44 | "name": "stdout",
45 | "output_type": "stream",
46 | "text": [
47 | "\n",
48 | "RangeIndex: 424170 entries, 0 to 424169\n",
49 | "Data columns (total 3 columns):\n",
50 | " # Column Non-Null Count Dtype \n",
51 | "--- ------ -------------- ----- \n",
52 | " 0 user_id 424170 non-null int64 \n",
53 | " 1 age_range 329039 non-null float64\n",
54 | " 2 gender 407308 non-null float64\n",
55 | "dtypes: float64(2), int64(1)\n",
56 | "memory usage: 9.7 MB\n"
57 | ]
58 | }
59 | ],
60 | "source": [
61 | "#使用空值去替换\n",
62 | "user_info['age_range'].replace(0.0,np.nan,inplace=True)\n",
63 | "user_info['gender'].replace(2.0,np.nan,inplace=True)\n",
64 | "user_info.info()"
65 | ]
66 | },
67 | {
68 | "cell_type": "code",
69 | "execution_count": 4,
70 | "id": "d5b34bee",
71 | "metadata": {},
72 | "outputs": [],
73 | "source": [
74 | "user_info['age_range'].replace(np.nan,-1,inplace=True)\n",
75 | "user_info['gender'].replace(np.nan,-1,inplace=True)\n",
76 | "# user_info['age_range'].replace(np.nan,1,inplace=True)\n",
77 | "# user_info['gender'].replace(np.nan,0,inplace=True)"
78 | ]
79 | },
80 | {
81 | "cell_type": "code",
82 | "execution_count": 5,
83 | "id": "dc6e724d",
84 | "metadata": {},
85 | "outputs": [],
86 | "source": [
87 | "#特征值合并\n",
88 | "\n",
89 | "df_test = pd.merge(df_test,user_info,on=\"user_id\",how=\"left\")\n",
90 | " \n",
91 | "total_logs_temp = user_log.groupby([user_log[\"user_id\"],user_log[\"seller_id\"]])[\"item_id\"].count().reset_index()\n",
92 | " \n",
93 | "total_logs_temp.rename(columns={\"seller_id\":\"merchant_id\",\"item_id\":\"total_item_id\"},inplace=True)\n",
94 | " \n",
95 | "df_test = pd.merge(df_test,total_logs_temp,on=[\"user_id\",\"merchant_id\"],how=\"left\")\n",
96 | " \n",
97 | "unique_item_id = user_log.groupby([\"user_id\",\"seller_id\",\"item_id\"]).count().reset_index()[[\"user_id\",\"seller_id\",\"item_id\"]]\n",
98 | " \n",
99 | "unique_item_id_cnt = unique_item_id.groupby([\"user_id\",\"seller_id\"]).count().reset_index()\n",
100 | " \n",
101 | "unique_item_id_cnt.rename(columns={\"seller_id\":\"merchant_id\",\"item_id\":\"unique_item_id\"},inplace=True)\n",
102 | " \n",
103 | "df_test = pd.merge(df_test, unique_item_id_cnt, on=[\"user_id\", \"merchant_id\"], how=\"left\")\n",
104 | " \n",
105 | "cat_id_temp = user_log.groupby([\"user_id\", \"seller_id\", \"cat_id\"]).count().reset_index()[[\"user_id\", \"seller_id\", \"cat_id\"]]\n",
106 | " \n",
107 | "cat_id_temp_cnt = cat_id_temp.groupby([\"user_id\", \"seller_id\"]).count().reset_index()\n",
108 | " \n",
109 | "cat_id_temp_cnt.rename(columns={\"seller_id\":\"merchant_id\",\"cat_id\":\"total_cat_id\"},inplace=True)\n",
110 | " \n",
111 | "df_test = pd.merge(df_test, cat_id_temp_cnt, on=[\"user_id\", \"merchant_id\"], how=\"left\")\n",
112 | " \n",
113 | "time_temp = user_log.groupby([\"user_id\", \"seller_id\", \"time_stamp\"]).count().reset_index()[[\"user_id\", \"seller_id\", \"time_stamp\"]]\n",
114 | " \n",
115 | "time_temp_cnt = time_temp.groupby([\"user_id\", \"seller_id\"]).count().reset_index()\n",
116 | " \n",
117 | "time_temp_cnt.rename(columns={\"seller_id\":\"merchant_id\",\"time_stamp\":\"total_time_temp\"},inplace=True)\n",
118 | " \n",
119 | "df_test = pd.merge(df_test, time_temp_cnt, on=[\"user_id\", \"merchant_id\"], how=\"left\")\n",
120 | " \n",
121 | "click_temp = user_log.groupby([\"user_id\", \"seller_id\", \"action_type\"])[\"item_id\"].count().reset_index()\n",
122 | " \n",
123 | "click_temp.rename(columns={\"seller_id\":\"merchant_id\",\"item_id\":\"times\"},inplace=True)\n",
124 | " \n",
125 | "click_temp[\"clicks\"] = click_temp[\"action_type\"] == 0\n",
126 | " \n",
127 | "click_temp[\"clicks\"] = click_temp[\"clicks\"] * click_temp[\"times\"]\n",
128 | " \n",
129 | "click_temp[\"shopping_cart\"] = click_temp[\"action_type\"] == 1\n",
130 | "click_temp[\"shopping_cart\"] = click_temp[\"shopping_cart\"] * click_temp[\"times\"]\n",
131 | " \n",
132 | "click_temp[\"purchases\"] = click_temp[\"action_type\"] == 2\n",
133 | "click_temp[\"purchases\"] = click_temp[\"purchases\"] * click_temp[\"times\"]\n",
134 | " \n",
135 | "click_temp[\"favourites\"] = click_temp[\"action_type\"] == 3\n",
136 | "click_temp[\"favourites\"] = click_temp[\"favourites\"] * click_temp[\"times\"]\n",
137 | " \n",
138 | "four_features = click_temp.groupby([\"user_id\", \"merchant_id\"]).sum().reset_index()\n",
139 | " \n",
140 | "#删除相关列\n",
141 | "four_features = four_features.drop([\"action_type\", \"times\"], axis=1)\n",
142 | " \n",
143 | "#合并\n",
144 | "df_test = pd.merge(df_test, four_features, on=[\"user_id\", \"merchant_id\"], how=\"left\")\n",
145 | " \n",
146 | "#缺失值向前填充\n",
147 | "df_test = df_test.fillna(method=\"ffill\")\n",
148 | " "
149 | ]
150 | },
151 | {
152 | "cell_type": "code",
153 | "execution_count": 6,
154 | "id": "43c3bf05",
155 | "metadata": {},
156 | "outputs": [
157 | {
158 | "data": {
159 | "text/html": [
160 | "\n",
161 | "\n",
174 | "
\n",
175 | " \n",
176 | " \n",
177 | " | \n",
178 | " user_id | \n",
179 | " merchant_id | \n",
180 | " prob | \n",
181 | " age_range | \n",
182 | " gender | \n",
183 | " total_item_id | \n",
184 | " unique_item_id | \n",
185 | " total_cat_id | \n",
186 | " total_time_temp | \n",
187 | " clicks | \n",
188 | " shopping_cart | \n",
189 | " purchases | \n",
190 | " favourites | \n",
191 | "
\n",
192 | " \n",
193 | " \n",
194 | " \n",
195 | " | 0 | \n",
196 | " 163968 | \n",
197 | " 4605 | \n",
198 | " NaN | \n",
199 | " -1.0 | \n",
200 | " 0.0 | \n",
201 | " 2 | \n",
202 | " 1 | \n",
203 | " 1 | \n",
204 | " 1 | \n",
205 | " 1 | \n",
206 | " 0 | \n",
207 | " 1 | \n",
208 | " 0 | \n",
209 | "
\n",
210 | " \n",
211 | " | 1 | \n",
212 | " 360576 | \n",
213 | " 1581 | \n",
214 | " NaN | \n",
215 | " 2.0 | \n",
216 | " -1.0 | \n",
217 | " 10 | \n",
218 | " 9 | \n",
219 | " 4 | \n",
220 | " 1 | \n",
221 | " 5 | \n",
222 | " 0 | \n",
223 | " 5 | \n",
224 | " 0 | \n",
225 | "
\n",
226 | " \n",
227 | " | 2 | \n",
228 | " 98688 | \n",
229 | " 1964 | \n",
230 | " NaN | \n",
231 | " 6.0 | \n",
232 | " 0.0 | \n",
233 | " 6 | \n",
234 | " 1 | \n",
235 | " 1 | \n",
236 | " 1 | \n",
237 | " 5 | \n",
238 | " 0 | \n",
239 | " 1 | \n",
240 | " 0 | \n",
241 | "
\n",
242 | " \n",
243 | " | 3 | \n",
244 | " 98688 | \n",
245 | " 3645 | \n",
246 | " NaN | \n",
247 | " 6.0 | \n",
248 | " 0.0 | \n",
249 | " 11 | \n",
250 | " 1 | \n",
251 | " 1 | \n",
252 | " 1 | \n",
253 | " 10 | \n",
254 | " 0 | \n",
255 | " 1 | \n",
256 | " 0 | \n",
257 | "
\n",
258 | " \n",
259 | " | 4 | \n",
260 | " 295296 | \n",
261 | " 3361 | \n",
262 | " NaN | \n",
263 | " 2.0 | \n",
264 | " 1.0 | \n",
265 | " 50 | \n",
266 | " 8 | \n",
267 | " 4 | \n",
268 | " 5 | \n",
269 | " 47 | \n",
270 | " 0 | \n",
271 | " 1 | \n",
272 | " 2 | \n",
273 | "
\n",
274 | " \n",
275 | " | ... | \n",
276 | " ... | \n",
277 | " ... | \n",
278 | " ... | \n",
279 | " ... | \n",
280 | " ... | \n",
281 | " ... | \n",
282 | " ... | \n",
283 | " ... | \n",
284 | " ... | \n",
285 | " ... | \n",
286 | " ... | \n",
287 | " ... | \n",
288 | " ... | \n",
289 | "
\n",
290 | " \n",
291 | " | 261472 | \n",
292 | " 228479 | \n",
293 | " 3111 | \n",
294 | " NaN | \n",
295 | " 6.0 | \n",
296 | " 0.0 | \n",
297 | " 5 | \n",
298 | " 2 | \n",
299 | " 1 | \n",
300 | " 2 | \n",
301 | " 4 | \n",
302 | " 0 | \n",
303 | " 1 | \n",
304 | " 0 | \n",
305 | "
\n",
306 | " \n",
307 | " | 261473 | \n",
308 | " 97919 | \n",
309 | " 2341 | \n",
310 | " NaN | \n",
311 | " 8.0 | \n",
312 | " 1.0 | \n",
313 | " 2 | \n",
314 | " 1 | \n",
315 | " 1 | \n",
316 | " 1 | \n",
317 | " 1 | \n",
318 | " 0 | \n",
319 | " 1 | \n",
320 | " 0 | \n",
321 | "
\n",
322 | " \n",
323 | " | 261474 | \n",
324 | " 97919 | \n",
325 | " 3971 | \n",
326 | " NaN | \n",
327 | " 8.0 | \n",
328 | " 1.0 | \n",
329 | " 16 | \n",
330 | " 5 | \n",
331 | " 2 | \n",
332 | " 3 | \n",
333 | " 12 | \n",
334 | " 0 | \n",
335 | " 4 | \n",
336 | " 0 | \n",
337 | "
\n",
338 | " \n",
339 | " | 261475 | \n",
340 | " 32639 | \n",
341 | " 3536 | \n",
342 | " NaN | \n",
343 | " -1.0 | \n",
344 | " 0.0 | \n",
345 | " 3 | \n",
346 | " 2 | \n",
347 | " 1 | \n",
348 | " 1 | \n",
349 | " 2 | \n",
350 | " 0 | \n",
351 | " 1 | \n",
352 | " 0 | \n",
353 | "
\n",
354 | " \n",
355 | " | 261476 | \n",
356 | " 32639 | \n",
357 | " 3319 | \n",
358 | " NaN | \n",
359 | " -1.0 | \n",
360 | " 0.0 | \n",
361 | " 11 | \n",
362 | " 1 | \n",
363 | " 1 | \n",
364 | " 2 | \n",
365 | " 10 | \n",
366 | " 0 | \n",
367 | " 1 | \n",
368 | " 0 | \n",
369 | "
\n",
370 | " \n",
371 | "
\n",
372 | "
261477 rows × 13 columns
\n",
373 | "
"
374 | ],
375 | "text/plain": [
376 | " user_id merchant_id prob age_range gender total_item_id \\\n",
377 | "0 163968 4605 NaN -1.0 0.0 2 \n",
378 | "1 360576 1581 NaN 2.0 -1.0 10 \n",
379 | "2 98688 1964 NaN 6.0 0.0 6 \n",
380 | "3 98688 3645 NaN 6.0 0.0 11 \n",
381 | "4 295296 3361 NaN 2.0 1.0 50 \n",
382 | "... ... ... ... ... ... ... \n",
383 | "261472 228479 3111 NaN 6.0 0.0 5 \n",
384 | "261473 97919 2341 NaN 8.0 1.0 2 \n",
385 | "261474 97919 3971 NaN 8.0 1.0 16 \n",
386 | "261475 32639 3536 NaN -1.0 0.0 3 \n",
387 | "261476 32639 3319 NaN -1.0 0.0 11 \n",
388 | "\n",
389 | " unique_item_id total_cat_id total_time_temp clicks shopping_cart \\\n",
390 | "0 1 1 1 1 0 \n",
391 | "1 9 4 1 5 0 \n",
392 | "2 1 1 1 5 0 \n",
393 | "3 1 1 1 10 0 \n",
394 | "4 8 4 5 47 0 \n",
395 | "... ... ... ... ... ... \n",
396 | "261472 2 1 2 4 0 \n",
397 | "261473 1 1 1 1 0 \n",
398 | "261474 5 2 3 12 0 \n",
399 | "261475 2 1 1 2 0 \n",
400 | "261476 1 1 2 10 0 \n",
401 | "\n",
402 | " purchases favourites \n",
403 | "0 1 0 \n",
404 | "1 5 0 \n",
405 | "2 1 0 \n",
406 | "3 1 0 \n",
407 | "4 1 2 \n",
408 | "... ... ... \n",
409 | "261472 1 0 \n",
410 | "261473 1 0 \n",
411 | "261474 4 0 \n",
412 | "261475 1 0 \n",
413 | "261476 1 0 \n",
414 | "\n",
415 | "[261477 rows x 13 columns]"
416 | ]
417 | },
418 | "execution_count": 6,
419 | "metadata": {},
420 | "output_type": "execute_result"
421 | }
422 | ],
423 | "source": [
424 | "df_test"
425 | ]
426 | },
427 | {
428 | "cell_type": "code",
429 | "execution_count": 7,
430 | "id": "fa6f95a9",
431 | "metadata": {},
432 | "outputs": [
433 | {
434 | "data": {
435 | "text/html": [
436 | "\n",
437 | "\n",
450 | "
\n",
451 | " \n",
452 | " \n",
453 | " | \n",
454 | " age_range | \n",
455 | " gender | \n",
456 | " total_item_id | \n",
457 | " unique_item_id | \n",
458 | " total_cat_id | \n",
459 | " total_time_temp | \n",
460 | " clicks | \n",
461 | " shopping_cart | \n",
462 | " purchases | \n",
463 | " favourites | \n",
464 | "
\n",
465 | " \n",
466 | " \n",
467 | " \n",
468 | " | 0 | \n",
469 | " -1.0 | \n",
470 | " 0.0 | \n",
471 | " 2 | \n",
472 | " 1 | \n",
473 | " 1 | \n",
474 | " 1 | \n",
475 | " 1 | \n",
476 | " 0 | \n",
477 | " 1 | \n",
478 | " 0 | \n",
479 | "
\n",
480 | " \n",
481 | " | 1 | \n",
482 | " 2.0 | \n",
483 | " -1.0 | \n",
484 | " 10 | \n",
485 | " 9 | \n",
486 | " 4 | \n",
487 | " 1 | \n",
488 | " 5 | \n",
489 | " 0 | \n",
490 | " 5 | \n",
491 | " 0 | \n",
492 | "
\n",
493 | " \n",
494 | " | 2 | \n",
495 | " 6.0 | \n",
496 | " 0.0 | \n",
497 | " 6 | \n",
498 | " 1 | \n",
499 | " 1 | \n",
500 | " 1 | \n",
501 | " 5 | \n",
502 | " 0 | \n",
503 | " 1 | \n",
504 | " 0 | \n",
505 | "
\n",
506 | " \n",
507 | " | 3 | \n",
508 | " 6.0 | \n",
509 | " 0.0 | \n",
510 | " 11 | \n",
511 | " 1 | \n",
512 | " 1 | \n",
513 | " 1 | \n",
514 | " 10 | \n",
515 | " 0 | \n",
516 | " 1 | \n",
517 | " 0 | \n",
518 | "
\n",
519 | " \n",
520 | " | 4 | \n",
521 | " 2.0 | \n",
522 | " 1.0 | \n",
523 | " 50 | \n",
524 | " 8 | \n",
525 | " 4 | \n",
526 | " 5 | \n",
527 | " 47 | \n",
528 | " 0 | \n",
529 | " 1 | \n",
530 | " 2 | \n",
531 | "
\n",
532 | " \n",
533 | " | ... | \n",
534 | " ... | \n",
535 | " ... | \n",
536 | " ... | \n",
537 | " ... | \n",
538 | " ... | \n",
539 | " ... | \n",
540 | " ... | \n",
541 | " ... | \n",
542 | " ... | \n",
543 | " ... | \n",
544 | "
\n",
545 | " \n",
546 | " | 261472 | \n",
547 | " 6.0 | \n",
548 | " 0.0 | \n",
549 | " 5 | \n",
550 | " 2 | \n",
551 | " 1 | \n",
552 | " 2 | \n",
553 | " 4 | \n",
554 | " 0 | \n",
555 | " 1 | \n",
556 | " 0 | \n",
557 | "
\n",
558 | " \n",
559 | " | 261473 | \n",
560 | " 8.0 | \n",
561 | " 1.0 | \n",
562 | " 2 | \n",
563 | " 1 | \n",
564 | " 1 | \n",
565 | " 1 | \n",
566 | " 1 | \n",
567 | " 0 | \n",
568 | " 1 | \n",
569 | " 0 | \n",
570 | "
\n",
571 | " \n",
572 | " | 261474 | \n",
573 | " 8.0 | \n",
574 | " 1.0 | \n",
575 | " 16 | \n",
576 | " 5 | \n",
577 | " 2 | \n",
578 | " 3 | \n",
579 | " 12 | \n",
580 | " 0 | \n",
581 | " 4 | \n",
582 | " 0 | \n",
583 | "
\n",
584 | " \n",
585 | " | 261475 | \n",
586 | " -1.0 | \n",
587 | " 0.0 | \n",
588 | " 3 | \n",
589 | " 2 | \n",
590 | " 1 | \n",
591 | " 1 | \n",
592 | " 2 | \n",
593 | " 0 | \n",
594 | " 1 | \n",
595 | " 0 | \n",
596 | "
\n",
597 | " \n",
598 | " | 261476 | \n",
599 | " -1.0 | \n",
600 | " 0.0 | \n",
601 | " 11 | \n",
602 | " 1 | \n",
603 | " 1 | \n",
604 | " 2 | \n",
605 | " 10 | \n",
606 | " 0 | \n",
607 | " 1 | \n",
608 | " 0 | \n",
609 | "
\n",
610 | " \n",
611 | "
\n",
612 | "
261477 rows × 10 columns
\n",
613 | "
"
614 | ],
615 | "text/plain": [
616 | " age_range gender total_item_id unique_item_id total_cat_id \\\n",
617 | "0 -1.0 0.0 2 1 1 \n",
618 | "1 2.0 -1.0 10 9 4 \n",
619 | "2 6.0 0.0 6 1 1 \n",
620 | "3 6.0 0.0 11 1 1 \n",
621 | "4 2.0 1.0 50 8 4 \n",
622 | "... ... ... ... ... ... \n",
623 | "261472 6.0 0.0 5 2 1 \n",
624 | "261473 8.0 1.0 2 1 1 \n",
625 | "261474 8.0 1.0 16 5 2 \n",
626 | "261475 -1.0 0.0 3 2 1 \n",
627 | "261476 -1.0 0.0 11 1 1 \n",
628 | "\n",
629 | " total_time_temp clicks shopping_cart purchases favourites \n",
630 | "0 1 1 0 1 0 \n",
631 | "1 1 5 0 5 0 \n",
632 | "2 1 5 0 1 0 \n",
633 | "3 1 10 0 1 0 \n",
634 | "4 5 47 0 1 2 \n",
635 | "... ... ... ... ... ... \n",
636 | "261472 2 4 0 1 0 \n",
637 | "261473 1 1 0 1 0 \n",
638 | "261474 3 12 0 4 0 \n",
639 | "261475 1 2 0 1 0 \n",
640 | "261476 2 10 0 1 0 \n",
641 | "\n",
642 | "[261477 rows x 10 columns]"
643 | ]
644 | },
645 | "execution_count": 7,
646 | "metadata": {},
647 | "output_type": "execute_result"
648 | }
649 | ],
650 | "source": [
651 | "#测试数据预处理\n",
652 | "# y = df_train[\"label\"]\n",
653 | "X = df_test.drop([\"user_id\", \"merchant_id\", \"prob\"], axis=1)\n",
654 | "# X['age_range'].replace(-1,3,inplace=True)\n",
655 | "# X['gender'].replace(-1,0,inplace=True)\n",
656 | "X"
657 | ]
658 | },
659 | {
660 | "cell_type": "code",
661 | "execution_count": 8,
662 | "id": "0d5814e4",
663 | "metadata": {},
664 | "outputs": [],
665 | "source": [
666 | "#将构建好的特征保存\n",
667 | "X.to_csv(\"test_data.csv\",index=None)"
668 | ]
669 | },
670 | {
671 | "cell_type": "code",
672 | "execution_count": null,
673 | "id": "d54567e6",
674 | "metadata": {},
675 | "outputs": [],
676 | "source": []
677 | }
678 | ],
679 | "metadata": {
680 | "kernelspec": {
681 | "display_name": "Python 3",
682 | "language": "python",
683 | "name": "python3"
684 | },
685 | "language_info": {
686 | "codemirror_mode": {
687 | "name": "ipython",
688 | "version": 3
689 | },
690 | "file_extension": ".py",
691 | "mimetype": "text/x-python",
692 | "name": "python",
693 | "nbconvert_exporter": "python",
694 | "pygments_lexer": "ipython3",
695 | "version": "3.8.8"
696 | }
697 | },
698 | "nbformat": 4,
699 | "nbformat_minor": 5
700 | }
701 |
--------------------------------------------------------------------------------
/.ipynb_checkpoints/预测建模-checkpoint.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 10,
6 | "id": "8035b6a2",
7 | "metadata": {},
8 | "outputs": [],
9 | "source": [
10 | "#导包\n",
11 | "import numpy as np\n",
12 | "import pandas as pd\n",
13 | "import matplotlib.pyplot as plt\n",
14 | "plt.rcParams[\"font.sans-serif\"] = \"SimHei\" #解决中文乱码问题\n",
15 | "import seaborn as sns\n",
16 | "import random\n",
17 | "from sklearn.model_selection import train_test_split\n",
18 | "from sklearn.linear_model import LogisticRegression\n",
19 | "from sklearn.preprocessing import LabelEncoder\n",
20 | "from sklearn.metrics import accuracy_score\n",
21 | "from sklearn import model_selection\n",
22 | "from sklearn.neighbors import KNeighborsRegressor"
23 | ]
24 | },
25 | {
26 | "cell_type": "code",
27 | "execution_count": 11,
28 | "id": "4e929396",
29 | "metadata": {},
30 | "outputs": [
31 | {
32 | "data": {
33 | "text/html": [
34 | "\n",
35 | "\n",
48 | "
\n",
49 | " \n",
50 | " \n",
51 | " | \n",
52 | " user_id | \n",
53 | " merchant_id | \n",
54 | " prob | \n",
55 | "
\n",
56 | " \n",
57 | " \n",
58 | " \n",
59 | " | 0 | \n",
60 | " 163968 | \n",
61 | " 4605 | \n",
62 | " NaN | \n",
63 | "
\n",
64 | " \n",
65 | " | 1 | \n",
66 | " 360576 | \n",
67 | " 1581 | \n",
68 | " NaN | \n",
69 | "
\n",
70 | " \n",
71 | " | 2 | \n",
72 | " 98688 | \n",
73 | " 1964 | \n",
74 | " NaN | \n",
75 | "
\n",
76 | " \n",
77 | " | 3 | \n",
78 | " 98688 | \n",
79 | " 3645 | \n",
80 | " NaN | \n",
81 | "
\n",
82 | " \n",
83 | " | 4 | \n",
84 | " 295296 | \n",
85 | " 3361 | \n",
86 | " NaN | \n",
87 | "
\n",
88 | " \n",
89 | " | ... | \n",
90 | " ... | \n",
91 | " ... | \n",
92 | " ... | \n",
93 | "
\n",
94 | " \n",
95 | " | 261472 | \n",
96 | " 228479 | \n",
97 | " 3111 | \n",
98 | " NaN | \n",
99 | "
\n",
100 | " \n",
101 | " | 261473 | \n",
102 | " 97919 | \n",
103 | " 2341 | \n",
104 | " NaN | \n",
105 | "
\n",
106 | " \n",
107 | " | 261474 | \n",
108 | " 97919 | \n",
109 | " 3971 | \n",
110 | " NaN | \n",
111 | "
\n",
112 | " \n",
113 | " | 261475 | \n",
114 | " 32639 | \n",
115 | " 3536 | \n",
116 | " NaN | \n",
117 | "
\n",
118 | " \n",
119 | " | 261476 | \n",
120 | " 32639 | \n",
121 | " 3319 | \n",
122 | " NaN | \n",
123 | "
\n",
124 | " \n",
125 | "
\n",
126 | "
261477 rows × 3 columns
\n",
127 | "
"
128 | ],
129 | "text/plain": [
130 | " user_id merchant_id prob\n",
131 | "0 163968 4605 NaN\n",
132 | "1 360576 1581 NaN\n",
133 | "2 98688 1964 NaN\n",
134 | "3 98688 3645 NaN\n",
135 | "4 295296 3361 NaN\n",
136 | "... ... ... ...\n",
137 | "261472 228479 3111 NaN\n",
138 | "261473 97919 2341 NaN\n",
139 | "261474 97919 3971 NaN\n",
140 | "261475 32639 3536 NaN\n",
141 | "261476 32639 3319 NaN\n",
142 | "\n",
143 | "[261477 rows x 3 columns]"
144 | ]
145 | },
146 | "execution_count": 11,
147 | "metadata": {},
148 | "output_type": "execute_result"
149 | }
150 | ],
151 | "source": [
152 | "#读取数据\n",
153 | "df_train = pd.read_csv(r'df_train.csv')\n",
154 | "df_test = pd.read_csv(r'../DataMining/data_format1\\test_format1.csv')\n",
155 | "df_test\n"
156 | ]
157 | },
158 | {
159 | "cell_type": "code",
160 | "execution_count": 12,
161 | "id": "16970677",
162 | "metadata": {},
163 | "outputs": [
164 | {
165 | "data": {
166 | "text/html": [
167 | "\n",
168 | "\n",
181 | "
\n",
182 | " \n",
183 | " \n",
184 | " | \n",
185 | " age_range | \n",
186 | " gender | \n",
187 | " total_item_id | \n",
188 | " unique_item_id | \n",
189 | " total_cat_id | \n",
190 | " total_time_temp | \n",
191 | " clicks | \n",
192 | " shopping_cart | \n",
193 | " purchases | \n",
194 | " favourites | \n",
195 | "
\n",
196 | " \n",
197 | " \n",
198 | " \n",
199 | " | 0 | \n",
200 | " 6.0 | \n",
201 | " 0.0 | \n",
202 | " 39 | \n",
203 | " 20 | \n",
204 | " 6 | \n",
205 | " 9 | \n",
206 | " 36 | \n",
207 | " 0 | \n",
208 | " 1 | \n",
209 | " 2 | \n",
210 | "
\n",
211 | " \n",
212 | " | 1 | \n",
213 | " 6.0 | \n",
214 | " 0.0 | \n",
215 | " 14 | \n",
216 | " 1 | \n",
217 | " 1 | \n",
218 | " 3 | \n",
219 | " 13 | \n",
220 | " 0 | \n",
221 | " 1 | \n",
222 | " 0 | \n",
223 | "
\n",
224 | " \n",
225 | " | 2 | \n",
226 | " 6.0 | \n",
227 | " 0.0 | \n",
228 | " 18 | \n",
229 | " 2 | \n",
230 | " 1 | \n",
231 | " 2 | \n",
232 | " 12 | \n",
233 | " 0 | \n",
234 | " 6 | \n",
235 | " 0 | \n",
236 | "
\n",
237 | " \n",
238 | " | 3 | \n",
239 | " 6.0 | \n",
240 | " 0.0 | \n",
241 | " 2 | \n",
242 | " 1 | \n",
243 | " 1 | \n",
244 | " 1 | \n",
245 | " 1 | \n",
246 | " 0 | \n",
247 | " 1 | \n",
248 | " 0 | \n",
249 | "
\n",
250 | " \n",
251 | " | 4 | \n",
252 | " -1.0 | \n",
253 | " 0.0 | \n",
254 | " 8 | \n",
255 | " 1 | \n",
256 | " 1 | \n",
257 | " 3 | \n",
258 | " 7 | \n",
259 | " 0 | \n",
260 | " 1 | \n",
261 | " 0 | \n",
262 | "
\n",
263 | " \n",
264 | " | 5 | \n",
265 | " 4.0 | \n",
266 | " 1.0 | \n",
267 | " 1 | \n",
268 | " 1 | \n",
269 | " 1 | \n",
270 | " 1 | \n",
271 | " 0 | \n",
272 | " 0 | \n",
273 | " 1 | \n",
274 | " 0 | \n",
275 | "
\n",
276 | " \n",
277 | " | 6 | \n",
278 | " 5.0 | \n",
279 | " 0.0 | \n",
280 | " 3 | \n",
281 | " 2 | \n",
282 | " 1 | \n",
283 | " 1 | \n",
284 | " 2 | \n",
285 | " 0 | \n",
286 | " 1 | \n",
287 | " 0 | \n",
288 | "
\n",
289 | " \n",
290 | " | 7 | \n",
291 | " 5.0 | \n",
292 | " 0.0 | \n",
293 | " 83 | \n",
294 | " 48 | \n",
295 | " 15 | \n",
296 | " 3 | \n",
297 | " 78 | \n",
298 | " 0 | \n",
299 | " 5 | \n",
300 | " 0 | \n",
301 | "
\n",
302 | " \n",
303 | " | 8 | \n",
304 | " 5.0 | \n",
305 | " 0.0 | \n",
306 | " 7 | \n",
307 | " 4 | \n",
308 | " 1 | \n",
309 | " 1 | \n",
310 | " 6 | \n",
311 | " 0 | \n",
312 | " 1 | \n",
313 | " 0 | \n",
314 | "
\n",
315 | " \n",
316 | " | 9 | \n",
317 | " 4.0 | \n",
318 | " 1.0 | \n",
319 | " 4 | \n",
320 | " 1 | \n",
321 | " 1 | \n",
322 | " 2 | \n",
323 | " 2 | \n",
324 | " 0 | \n",
325 | " 1 | \n",
326 | " 1 | \n",
327 | "
\n",
328 | " \n",
329 | "
\n",
330 | "
"
331 | ],
332 | "text/plain": [
333 | " age_range gender total_item_id unique_item_id total_cat_id \\\n",
334 | "0 6.0 0.0 39 20 6 \n",
335 | "1 6.0 0.0 14 1 1 \n",
336 | "2 6.0 0.0 18 2 1 \n",
337 | "3 6.0 0.0 2 1 1 \n",
338 | "4 -1.0 0.0 8 1 1 \n",
339 | "5 4.0 1.0 1 1 1 \n",
340 | "6 5.0 0.0 3 2 1 \n",
341 | "7 5.0 0.0 83 48 15 \n",
342 | "8 5.0 0.0 7 4 1 \n",
343 | "9 4.0 1.0 4 1 1 \n",
344 | "\n",
345 | " total_time_temp clicks shopping_cart purchases favourites \n",
346 | "0 9 36 0 1 2 \n",
347 | "1 3 13 0 1 0 \n",
348 | "2 2 12 0 6 0 \n",
349 | "3 1 1 0 1 0 \n",
350 | "4 3 7 0 1 0 \n",
351 | "5 1 0 0 1 0 \n",
352 | "6 1 2 0 1 0 \n",
353 | "7 3 78 0 5 0 \n",
354 | "8 1 6 0 1 0 \n",
355 | "9 2 2 0 1 1 "
356 | ]
357 | },
358 | "execution_count": 12,
359 | "metadata": {},
360 | "output_type": "execute_result"
361 | }
362 | ],
363 | "source": [
364 | "#建模前预处理\n",
365 | "y = df_train[\"label\"]\n",
366 | "X = df_train.drop([\"user_id\", \"merchant_id\", \"label\"], axis=1)\n",
367 | "X.head(10)\n"
368 | ]
369 | },
370 | {
371 | "cell_type": "code",
372 | "execution_count": 13,
373 | "id": "889e9034",
374 | "metadata": {},
375 | "outputs": [],
376 | "source": [
377 | "#分割数据\n",
378 | "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=8)"
379 | ]
380 | },
381 | {
382 | "cell_type": "code",
383 | "execution_count": 14,
384 | "id": "b66e524a",
385 | "metadata": {},
386 | "outputs": [
387 | {
388 | "data": {
389 | "text/html": [
390 | "\n",
391 | "\n",
404 | "
\n",
405 | " \n",
406 | " \n",
407 | " | \n",
408 | " age_range | \n",
409 | " gender | \n",
410 | " total_item_id | \n",
411 | " unique_item_id | \n",
412 | " total_cat_id | \n",
413 | " total_time_temp | \n",
414 | " clicks | \n",
415 | " shopping_cart | \n",
416 | " purchases | \n",
417 | " favourites | \n",
418 | "
\n",
419 | " \n",
420 | " \n",
421 | " \n",
422 | " | 0 | \n",
423 | " -1.0 | \n",
424 | " 0.0 | \n",
425 | " 2 | \n",
426 | " 1 | \n",
427 | " 1 | \n",
428 | " 1 | \n",
429 | " 1 | \n",
430 | " 0 | \n",
431 | " 1 | \n",
432 | " 0 | \n",
433 | "
\n",
434 | " \n",
435 | " | 1 | \n",
436 | " 2.0 | \n",
437 | " -1.0 | \n",
438 | " 10 | \n",
439 | " 9 | \n",
440 | " 4 | \n",
441 | " 1 | \n",
442 | " 5 | \n",
443 | " 0 | \n",
444 | " 5 | \n",
445 | " 0 | \n",
446 | "
\n",
447 | " \n",
448 | " | 2 | \n",
449 | " 6.0 | \n",
450 | " 0.0 | \n",
451 | " 6 | \n",
452 | " 1 | \n",
453 | " 1 | \n",
454 | " 1 | \n",
455 | " 5 | \n",
456 | " 0 | \n",
457 | " 1 | \n",
458 | " 0 | \n",
459 | "
\n",
460 | " \n",
461 | " | 3 | \n",
462 | " 6.0 | \n",
463 | " 0.0 | \n",
464 | " 11 | \n",
465 | " 1 | \n",
466 | " 1 | \n",
467 | " 1 | \n",
468 | " 10 | \n",
469 | " 0 | \n",
470 | " 1 | \n",
471 | " 0 | \n",
472 | "
\n",
473 | " \n",
474 | " | 4 | \n",
475 | " 2.0 | \n",
476 | " 1.0 | \n",
477 | " 50 | \n",
478 | " 8 | \n",
479 | " 4 | \n",
480 | " 5 | \n",
481 | " 47 | \n",
482 | " 0 | \n",
483 | " 1 | \n",
484 | " 2 | \n",
485 | "
\n",
486 | " \n",
487 | " | ... | \n",
488 | " ... | \n",
489 | " ... | \n",
490 | " ... | \n",
491 | " ... | \n",
492 | " ... | \n",
493 | " ... | \n",
494 | " ... | \n",
495 | " ... | \n",
496 | " ... | \n",
497 | " ... | \n",
498 | "
\n",
499 | " \n",
500 | " | 261472 | \n",
501 | " 6.0 | \n",
502 | " 0.0 | \n",
503 | " 5 | \n",
504 | " 2 | \n",
505 | " 1 | \n",
506 | " 2 | \n",
507 | " 4 | \n",
508 | " 0 | \n",
509 | " 1 | \n",
510 | " 0 | \n",
511 | "
\n",
512 | " \n",
513 | " | 261473 | \n",
514 | " 8.0 | \n",
515 | " 1.0 | \n",
516 | " 2 | \n",
517 | " 1 | \n",
518 | " 1 | \n",
519 | " 1 | \n",
520 | " 1 | \n",
521 | " 0 | \n",
522 | " 1 | \n",
523 | " 0 | \n",
524 | "
\n",
525 | " \n",
526 | " | 261474 | \n",
527 | " 8.0 | \n",
528 | " 1.0 | \n",
529 | " 16 | \n",
530 | " 5 | \n",
531 | " 2 | \n",
532 | " 3 | \n",
533 | " 12 | \n",
534 | " 0 | \n",
535 | " 4 | \n",
536 | " 0 | \n",
537 | "
\n",
538 | " \n",
539 | " | 261475 | \n",
540 | " -1.0 | \n",
541 | " 0.0 | \n",
542 | " 3 | \n",
543 | " 2 | \n",
544 | " 1 | \n",
545 | " 1 | \n",
546 | " 2 | \n",
547 | " 0 | \n",
548 | " 1 | \n",
549 | " 0 | \n",
550 | "
\n",
551 | " \n",
552 | " | 261476 | \n",
553 | " -1.0 | \n",
554 | " 0.0 | \n",
555 | " 11 | \n",
556 | " 1 | \n",
557 | " 1 | \n",
558 | " 2 | \n",
559 | " 10 | \n",
560 | " 0 | \n",
561 | " 1 | \n",
562 | " 0 | \n",
563 | "
\n",
564 | " \n",
565 | "
\n",
566 | "
261477 rows × 10 columns
\n",
567 | "
"
568 | ],
569 | "text/plain": [
570 | " age_range gender total_item_id unique_item_id total_cat_id \\\n",
571 | "0 -1.0 0.0 2 1 1 \n",
572 | "1 2.0 -1.0 10 9 4 \n",
573 | "2 6.0 0.0 6 1 1 \n",
574 | "3 6.0 0.0 11 1 1 \n",
575 | "4 2.0 1.0 50 8 4 \n",
576 | "... ... ... ... ... ... \n",
577 | "261472 6.0 0.0 5 2 1 \n",
578 | "261473 8.0 1.0 2 1 1 \n",
579 | "261474 8.0 1.0 16 5 2 \n",
580 | "261475 -1.0 0.0 3 2 1 \n",
581 | "261476 -1.0 0.0 11 1 1 \n",
582 | "\n",
583 | " total_time_temp clicks shopping_cart purchases favourites \n",
584 | "0 1 1 0 1 0 \n",
585 | "1 1 5 0 5 0 \n",
586 | "2 1 5 0 1 0 \n",
587 | "3 1 10 0 1 0 \n",
588 | "4 5 47 0 1 2 \n",
589 | "... ... ... ... ... ... \n",
590 | "261472 2 4 0 1 0 \n",
591 | "261473 1 1 0 1 0 \n",
592 | "261474 3 12 0 4 0 \n",
593 | "261475 1 2 0 1 0 \n",
594 | "261476 2 10 0 1 0 \n",
595 | "\n",
596 | "[261477 rows x 10 columns]"
597 | ]
598 | },
599 | "execution_count": 14,
600 | "metadata": {},
601 | "output_type": "execute_result"
602 | }
603 | ],
604 | "source": [
605 | "#加载最终测试数据\n",
606 | "test_data= pd.read_csv(r'test_data.csv')\n",
607 | "test_data\n"
608 | ]
609 | },
610 | {
611 | "cell_type": "code",
612 | "execution_count": 20,
613 | "id": "bede42d0",
614 | "metadata": {},
615 | "outputs": [
616 | {
617 | "name": "stdout",
618 | "output_type": "stream",
619 | "text": [
620 | "(52173,)\n",
621 | "[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]\n",
622 | "[[0.92242829 0.07757171]\n",
623 | " [0.95384761 0.04615239]\n",
624 | " [0.93995785 0.06004215]\n",
625 | " ...\n",
626 | " [0.94603563 0.05396437]\n",
627 | " [0.86838486 0.13161514]\n",
628 | " [0.95512153 0.04487847]]\n"
629 | ]
630 | },
631 | {
632 | "data": {
633 | "text/plain": [
634 | "0.9391831023709581"
635 | ]
636 | },
637 | "execution_count": 20,
638 | "metadata": {},
639 | "output_type": "execute_result"
640 | }
641 | ],
642 | "source": [
643 | "#logistic回归\n",
644 | "Logit = LogisticRegression(solver='liblinear')\n",
645 | "Logit.fit(X_train, y_train)\n",
646 | "Predict = Logit.predict(X_test)\n",
647 | "Predict_proba = Logit.predict_proba(X_test)\n",
648 | "print(Predict.shape)\n",
649 | "print(Predict[0:20])\n",
650 | "print(Predict_proba[:])\n",
651 | "print(\"Accuracy on training set: {:.3f}\".format(Logit.score(X_train, y_train)))\n",
652 | "print(\"Accuracy on test set: {:.3f}\".format(Logit.score(X_test, y_test)))\n",
653 | "Score = accuracy_score(y_test, Predict)\n",
654 | "Score"
655 | ]
656 | },
657 | {
658 | "cell_type": "code",
659 | "execution_count": 21,
660 | "id": "65787397",
661 | "metadata": {},
662 | "outputs": [],
663 | "source": [
664 | "#逻辑回归最终结果获取\n",
665 | "Logit_Ans_Predict_proba = Logit.predict_proba(test_data)\n",
666 | "df_test['prob']=Logit_Ans_Predict_proba[:,1]\n",
667 | "#最终答案保存\n",
668 | "df_test.to_csv(\"Logit_Ans.csv\",index=None)"
669 | ]
670 | },
671 | {
672 | "cell_type": "code",
673 | "execution_count": 22,
674 | "id": "a37fd1e5",
675 | "metadata": {},
676 | "outputs": [
677 | {
678 | "name": "stdout",
679 | "output_type": "stream",
680 | "text": [
681 | "[[0.89765569 0.10234431]\n",
682 | " [0.9609094 0.0390906 ]\n",
683 | " [0.93901148 0.06098852]\n",
684 | " ...\n",
685 | " [0.92812445 0.07187555]\n",
686 | " [0.89765569 0.10234431]\n",
687 | " [0.9609094 0.0390906 ]]\n",
688 | "Accuracy on training set: 0.939\n",
689 | "Accuracy on test set: 0.939\n"
690 | ]
691 | }
692 | ],
693 | "source": [
694 | "#决策树\n",
695 | "from sklearn.tree import DecisionTreeClassifier\n",
696 | "tree = DecisionTreeClassifier(max_depth=4,random_state=0) \n",
697 | "tree.fit(X_train, y_train)\n",
698 | "Predict_proba = tree.predict_proba(X_test)\n",
699 | "print(Predict_proba[:])\n",
700 | "print(\"Accuracy on training set: {:.3f}\".format(tree.score(X_train, y_train)))\n",
701 | "print(\"Accuracy on test set: {:.3f}\".format(tree.score(X_test, y_test)))"
702 | ]
703 | },
704 | {
705 | "cell_type": "code",
706 | "execution_count": 23,
707 | "id": "5ed0c662",
708 | "metadata": {},
709 | "outputs": [],
710 | "source": [
711 | "#决策树最终结果获取\n",
712 | "Tree_Ans_Predict_proba = tree.predict_proba(test_data)\n",
713 | "df_test['prob']=Tree_Ans_Predict_proba[:,1]\n",
714 | "#最终答案保存\n",
715 | "df_test.to_csv(\"Tree_Ans.csv\",index=None)"
716 | ]
717 | },
718 | {
719 | "cell_type": "code",
720 | "execution_count": 28,
721 | "id": "9c002987",
722 | "metadata": {},
723 | "outputs": [
724 | {
725 | "name": "stdout",
726 | "output_type": "stream",
727 | "text": [
728 | "[[0.90345203 0.09654797]\n",
729 | " [0.96242055 0.03757945]\n",
730 | " [0.92398178 0.07601822]\n",
731 | " ...\n",
732 | " [0.91943483 0.08056517]\n",
733 | " [0.86844252 0.13155748]\n",
734 | " [0.9607207 0.0392793 ]]\n",
735 | "Accuracy on training set: 0.939\n",
736 | "Accuracy on test set: 0.939\n"
737 | ]
738 | }
739 | ],
740 | "source": [
741 | "#随机森林\n",
742 | "from sklearn.ensemble import RandomForestClassifier\n",
743 | "rfc = RandomForestClassifier(n_estimators=100,random_state=90,max_depth=8)\n",
744 | "rfc = rfc.fit(X_train, y_train)\n",
745 | "Predict_proba = rfc.predict_proba(X_test)\n",
746 | "print(Predict_proba[:])\n",
747 | "print(\"Accuracy on training set: {:.3f}\".format(rfc.score(X_train, y_train))) \n",
748 | "print(\"Accuracy on test set: {:.3f}\".format(rfc.score(X_test, y_test)))"
749 | ]
750 | },
751 | {
752 | "cell_type": "code",
753 | "execution_count": 29,
754 | "id": "55703385",
755 | "metadata": {},
756 | "outputs": [],
757 | "source": [
758 | "#随机森林最终结果获取\n",
759 | "RFC_Ans_Predict_proba = rfc.predict_proba(test_data)\n",
760 | "df_test['prob']=RFC_Ans_Predict_proba[:,1]\n",
761 | "#最终答案保存\n",
762 | "df_test.to_csv(\"RFC_Ans.csv\",index=None)"
763 | ]
764 | },
765 | {
766 | "cell_type": "code",
767 | "execution_count": 27,
768 | "id": "54978d26",
769 | "metadata": {},
770 | "outputs": [
771 | {
772 | "name": "stdout",
773 | "output_type": "stream",
774 | "text": [
775 | "进度: 0\n",
776 | "进度: 10\n",
777 | "进度: 20\n",
778 | "进度: 30\n",
779 | "进度: 40\n",
780 | "进度: 50\n",
781 | "进度: 60\n",
782 | "进度: 70\n",
783 | "进度: 80\n",
784 | "进度: 90\n",
785 | "进度: 100\n",
786 | "进度: 110\n",
787 | "进度: 120\n",
788 | "进度: 130\n",
789 | "进度: 140\n",
790 | "进度: 150\n",
791 | "进度: 160\n",
792 | "进度: 170\n",
793 | "进度: 180\n",
794 | "进度: 190\n",
795 | "最大得分:0.9394897744043854 子树数量为:101\n"
796 | ]
797 | },
798 | {
799 | "data": {
800 | "image/png": "\n",
801 | "text/plain": [
802 | ""
803 | ]
804 | },
805 | "metadata": {
806 | "needs_background": "light"
807 | },
808 | "output_type": "display_data"
809 | }
810 | ],
811 | "source": [
812 | "# 调参,绘制学习曲线来调参n_estimators(对随机森林影响最大)\n",
813 | "score_lt = []\n",
814 | "\n",
815 | "# 每隔10步建立一个随机森林,获得不同n_estimators的得分\n",
816 | "for i in range(0,200,10):\n",
817 | " print(\"进度:\",i)\n",
818 | " rfc = RandomForestClassifier(n_estimators=i+1,random_state=90,max_depth=8)\n",
819 | " rfc = rfc.fit(X_train, y_train)\n",
820 | " score = rfc.score(X_test, y_test)\n",
821 | " score_lt.append(score)\n",
822 | "score_max = max(score_lt)\n",
823 | "print('最大得分:{}'.format(score_max),'子树数量为:{}'.format(score_lt.index(score_max)*10+1))\n",
824 | "\n",
825 | "# 绘制学习曲线\n",
826 | "x = np.arange(1,201,10)\n",
827 | "plt.subplot(111)\n",
828 | "plt.plot(x, score_lt, 'r-')\n",
829 | "plt.show()"
830 | ]
831 | },
832 | {
833 | "cell_type": "code",
834 | "execution_count": 15,
835 | "id": "b124dbb8",
836 | "metadata": {},
837 | "outputs": [
838 | {
839 | "name": "stderr",
840 | "output_type": "stream",
841 | "text": [
842 | "D:\\anaconda3\\lib\\site-packages\\xgboost\\sklearn.py:1224: UserWarning: The use of label encoder in XGBClassifier is deprecated and will be removed in a future release. To remove this warning, do the following: 1) Pass option use_label_encoder=False when constructing XGBClassifier object; and 2) Encode your labels (y) as integers starting with 0, i.e. 0, 1, 2, ..., [num_class - 1].\n",
843 | " warnings.warn(label_encoder_deprecation_msg, UserWarning)\n"
844 | ]
845 | },
846 | {
847 | "name": "stdout",
848 | "output_type": "stream",
849 | "text": [
850 | "[15:05:43] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.5.0/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\n",
851 | "[[0.87493783 0.12506217]\n",
852 | " [0.9712213 0.02877864]\n",
853 | " [0.8449106 0.1550894 ]\n",
854 | " ...\n",
855 | " [0.87651736 0.12348264]\n",
856 | " [0.916159 0.08384103]\n",
857 | " [0.9761114 0.02388861]]\n",
858 | "Accuracy on training set: 0.939\n",
859 | "Accuracy on test set: 0.939\n"
860 | ]
861 | }
862 | ],
863 | "source": [
864 | "#使用XGboost\n",
865 | "from sklearn.model_selection import train_test_split\n",
866 | "from sklearn.ensemble import RandomForestClassifier\n",
867 | "from sklearn.linear_model import LinearRegression\n",
868 | "from sklearn.metrics import classification_report\n",
869 | "import xgboost as xgb\n",
870 | "\n",
871 | "model = xgb.XGBClassifier(\n",
872 | " max_depth=8,\n",
873 | " n_estimators=2000,\n",
874 | " min_child_weight=300, \n",
875 | " colsample_bytree=0.8, \n",
876 | " subsample=0.8, \n",
877 | " eta=0.3, \n",
878 | " seed=42 \n",
879 | ")\n",
880 | "# model.fit(\n",
881 | "# X_train, y_train,\n",
882 | "# eval_metric='auc', eval_set=[(X_train, y_train), (X_test, y_test)],\n",
883 | "# verbose=True,\n",
884 | "# #早停法,如果auc在10epoch没有进步就stop\n",
885 | "# early_stopping_rounds=30 \n",
886 | "# )\n",
887 | "model.fit(X_train, y_train)\n",
888 | "\n",
889 | "\n",
890 | "Predict_proba = model.predict_proba(X_test)\n",
891 | "print(Predict_proba[:])\n",
892 | "print(\"Accuracy on training set: {:.3f}\".format(model.score(X_train, y_train))) \n",
893 | "print(\"Accuracy on test set: {:.3f}\".format(model.score(X_test, y_test)))\n"
894 | ]
895 | },
896 | {
897 | "cell_type": "code",
898 | "execution_count": 16,
899 | "id": "b942defe",
900 | "metadata": {},
901 | "outputs": [],
902 | "source": [
903 | "#XGboost最终结果获取\n",
904 | "xgboost_Ans_Predict_proba = model.predict_proba(test_data)\n",
905 | "df_test['prob']=xgboost_Ans_Predict_proba[:,1]\n",
906 | "#最终答案保存\n",
907 | "df_test.to_csv(\"xgboost_Ans.csv\",index=None)"
908 | ]
909 | },
910 | {
911 | "cell_type": "code",
912 | "execution_count": null,
913 | "id": "33f75369",
914 | "metadata": {},
915 | "outputs": [],
916 | "source": []
917 | }
918 | ],
919 | "metadata": {
920 | "kernelspec": {
921 | "display_name": "Python 3",
922 | "language": "python",
923 | "name": "python3"
924 | },
925 | "language_info": {
926 | "codemirror_mode": {
927 | "name": "ipython",
928 | "version": 3
929 | },
930 | "file_extension": ".py",
931 | "mimetype": "text/x-python",
932 | "name": "python",
933 | "nbconvert_exporter": "python",
934 | "pygments_lexer": "ipython3",
935 | "version": "3.8.8"
936 | }
937 | },
938 | "nbformat": 4,
939 | "nbformat_minor": 5
940 | }
941 |
--------------------------------------------------------------------------------
/预测建模.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 1,
6 | "id": "8035b6a2",
7 | "metadata": {},
8 | "outputs": [],
9 | "source": [
10 | "#导包\n",
11 | "import numpy as np\n",
12 | "import pandas as pd\n",
13 | "import matplotlib.pyplot as plt\n",
14 | "plt.rcParams[\"font.sans-serif\"] = \"SimHei\" #解决中文乱码问题\n",
15 | "import seaborn as sns\n",
16 | "import random\n",
17 | "from sklearn.model_selection import train_test_split\n",
18 | "from sklearn.linear_model import LogisticRegression\n",
19 | "from sklearn.preprocessing import LabelEncoder\n",
20 | "from sklearn.metrics import accuracy_score\n",
21 | "from sklearn import model_selection\n",
22 | "from sklearn.neighbors import KNeighborsRegressor"
23 | ]
24 | },
25 | {
26 | "cell_type": "code",
27 | "execution_count": 2,
28 | "id": "4e929396",
29 | "metadata": {},
30 | "outputs": [
31 | {
32 | "data": {
33 | "text/html": [
34 | "\n",
35 | "\n",
48 | "
\n",
49 | " \n",
50 | " \n",
51 | " | \n",
52 | " user_id | \n",
53 | " merchant_id | \n",
54 | " prob | \n",
55 | "
\n",
56 | " \n",
57 | " \n",
58 | " \n",
59 | " | 0 | \n",
60 | " 163968 | \n",
61 | " 4605 | \n",
62 | " NaN | \n",
63 | "
\n",
64 | " \n",
65 | " | 1 | \n",
66 | " 360576 | \n",
67 | " 1581 | \n",
68 | " NaN | \n",
69 | "
\n",
70 | " \n",
71 | " | 2 | \n",
72 | " 98688 | \n",
73 | " 1964 | \n",
74 | " NaN | \n",
75 | "
\n",
76 | " \n",
77 | " | 3 | \n",
78 | " 98688 | \n",
79 | " 3645 | \n",
80 | " NaN | \n",
81 | "
\n",
82 | " \n",
83 | " | 4 | \n",
84 | " 295296 | \n",
85 | " 3361 | \n",
86 | " NaN | \n",
87 | "
\n",
88 | " \n",
89 | " | ... | \n",
90 | " ... | \n",
91 | " ... | \n",
92 | " ... | \n",
93 | "
\n",
94 | " \n",
95 | " | 261472 | \n",
96 | " 228479 | \n",
97 | " 3111 | \n",
98 | " NaN | \n",
99 | "
\n",
100 | " \n",
101 | " | 261473 | \n",
102 | " 97919 | \n",
103 | " 2341 | \n",
104 | " NaN | \n",
105 | "
\n",
106 | " \n",
107 | " | 261474 | \n",
108 | " 97919 | \n",
109 | " 3971 | \n",
110 | " NaN | \n",
111 | "
\n",
112 | " \n",
113 | " | 261475 | \n",
114 | " 32639 | \n",
115 | " 3536 | \n",
116 | " NaN | \n",
117 | "
\n",
118 | " \n",
119 | " | 261476 | \n",
120 | " 32639 | \n",
121 | " 3319 | \n",
122 | " NaN | \n",
123 | "
\n",
124 | " \n",
125 | "
\n",
126 | "
261477 rows × 3 columns
\n",
127 | "
"
128 | ],
129 | "text/plain": [
130 | " user_id merchant_id prob\n",
131 | "0 163968 4605 NaN\n",
132 | "1 360576 1581 NaN\n",
133 | "2 98688 1964 NaN\n",
134 | "3 98688 3645 NaN\n",
135 | "4 295296 3361 NaN\n",
136 | "... ... ... ...\n",
137 | "261472 228479 3111 NaN\n",
138 | "261473 97919 2341 NaN\n",
139 | "261474 97919 3971 NaN\n",
140 | "261475 32639 3536 NaN\n",
141 | "261476 32639 3319 NaN\n",
142 | "\n",
143 | "[261477 rows x 3 columns]"
144 | ]
145 | },
146 | "execution_count": 2,
147 | "metadata": {},
148 | "output_type": "execute_result"
149 | }
150 | ],
151 | "source": [
152 | "#读取数据\n",
153 | "df_train = pd.read_csv(r'df_train.csv')\n",
154 | "df_test = pd.read_csv(r'../DataMining/data_format1\\test_format1.csv')\n",
155 | "df_test\n"
156 | ]
157 | },
158 | {
159 | "cell_type": "code",
160 | "execution_count": 3,
161 | "id": "16970677",
162 | "metadata": {},
163 | "outputs": [
164 | {
165 | "data": {
166 | "text/html": [
167 | "\n",
168 | "\n",
181 | "
\n",
182 | " \n",
183 | " \n",
184 | " | \n",
185 | " age_range | \n",
186 | " gender | \n",
187 | " total_item_id | \n",
188 | " unique_item_id | \n",
189 | " total_cat_id | \n",
190 | " total_time_temp | \n",
191 | " clicks | \n",
192 | " shopping_cart | \n",
193 | " purchases | \n",
194 | " favourites | \n",
195 | "
\n",
196 | " \n",
197 | " \n",
198 | " \n",
199 | " | 0 | \n",
200 | " 6.0 | \n",
201 | " 0.0 | \n",
202 | " 39 | \n",
203 | " 20 | \n",
204 | " 6 | \n",
205 | " 9 | \n",
206 | " 36 | \n",
207 | " 0 | \n",
208 | " 1 | \n",
209 | " 2 | \n",
210 | "
\n",
211 | " \n",
212 | " | 1 | \n",
213 | " 6.0 | \n",
214 | " 0.0 | \n",
215 | " 14 | \n",
216 | " 1 | \n",
217 | " 1 | \n",
218 | " 3 | \n",
219 | " 13 | \n",
220 | " 0 | \n",
221 | " 1 | \n",
222 | " 0 | \n",
223 | "
\n",
224 | " \n",
225 | " | 2 | \n",
226 | " 6.0 | \n",
227 | " 0.0 | \n",
228 | " 18 | \n",
229 | " 2 | \n",
230 | " 1 | \n",
231 | " 2 | \n",
232 | " 12 | \n",
233 | " 0 | \n",
234 | " 6 | \n",
235 | " 0 | \n",
236 | "
\n",
237 | " \n",
238 | " | 3 | \n",
239 | " 6.0 | \n",
240 | " 0.0 | \n",
241 | " 2 | \n",
242 | " 1 | \n",
243 | " 1 | \n",
244 | " 1 | \n",
245 | " 1 | \n",
246 | " 0 | \n",
247 | " 1 | \n",
248 | " 0 | \n",
249 | "
\n",
250 | " \n",
251 | " | 4 | \n",
252 | " -1.0 | \n",
253 | " 0.0 | \n",
254 | " 8 | \n",
255 | " 1 | \n",
256 | " 1 | \n",
257 | " 3 | \n",
258 | " 7 | \n",
259 | " 0 | \n",
260 | " 1 | \n",
261 | " 0 | \n",
262 | "
\n",
263 | " \n",
264 | " | 5 | \n",
265 | " 4.0 | \n",
266 | " 1.0 | \n",
267 | " 1 | \n",
268 | " 1 | \n",
269 | " 1 | \n",
270 | " 1 | \n",
271 | " 0 | \n",
272 | " 0 | \n",
273 | " 1 | \n",
274 | " 0 | \n",
275 | "
\n",
276 | " \n",
277 | " | 6 | \n",
278 | " 5.0 | \n",
279 | " 0.0 | \n",
280 | " 3 | \n",
281 | " 2 | \n",
282 | " 1 | \n",
283 | " 1 | \n",
284 | " 2 | \n",
285 | " 0 | \n",
286 | " 1 | \n",
287 | " 0 | \n",
288 | "
\n",
289 | " \n",
290 | " | 7 | \n",
291 | " 5.0 | \n",
292 | " 0.0 | \n",
293 | " 83 | \n",
294 | " 48 | \n",
295 | " 15 | \n",
296 | " 3 | \n",
297 | " 78 | \n",
298 | " 0 | \n",
299 | " 5 | \n",
300 | " 0 | \n",
301 | "
\n",
302 | " \n",
303 | " | 8 | \n",
304 | " 5.0 | \n",
305 | " 0.0 | \n",
306 | " 7 | \n",
307 | " 4 | \n",
308 | " 1 | \n",
309 | " 1 | \n",
310 | " 6 | \n",
311 | " 0 | \n",
312 | " 1 | \n",
313 | " 0 | \n",
314 | "
\n",
315 | " \n",
316 | " | 9 | \n",
317 | " 4.0 | \n",
318 | " 1.0 | \n",
319 | " 4 | \n",
320 | " 1 | \n",
321 | " 1 | \n",
322 | " 2 | \n",
323 | " 2 | \n",
324 | " 0 | \n",
325 | " 1 | \n",
326 | " 1 | \n",
327 | "
\n",
328 | " \n",
329 | "
\n",
330 | "
"
331 | ],
332 | "text/plain": [
333 | " age_range gender total_item_id unique_item_id total_cat_id \\\n",
334 | "0 6.0 0.0 39 20 6 \n",
335 | "1 6.0 0.0 14 1 1 \n",
336 | "2 6.0 0.0 18 2 1 \n",
337 | "3 6.0 0.0 2 1 1 \n",
338 | "4 -1.0 0.0 8 1 1 \n",
339 | "5 4.0 1.0 1 1 1 \n",
340 | "6 5.0 0.0 3 2 1 \n",
341 | "7 5.0 0.0 83 48 15 \n",
342 | "8 5.0 0.0 7 4 1 \n",
343 | "9 4.0 1.0 4 1 1 \n",
344 | "\n",
345 | " total_time_temp clicks shopping_cart purchases favourites \n",
346 | "0 9 36 0 1 2 \n",
347 | "1 3 13 0 1 0 \n",
348 | "2 2 12 0 6 0 \n",
349 | "3 1 1 0 1 0 \n",
350 | "4 3 7 0 1 0 \n",
351 | "5 1 0 0 1 0 \n",
352 | "6 1 2 0 1 0 \n",
353 | "7 3 78 0 5 0 \n",
354 | "8 1 6 0 1 0 \n",
355 | "9 2 2 0 1 1 "
356 | ]
357 | },
358 | "execution_count": 3,
359 | "metadata": {},
360 | "output_type": "execute_result"
361 | }
362 | ],
363 | "source": [
364 | "#建模前预处理\n",
365 | "y = df_train[\"label\"]\n",
366 | "X = df_train.drop([\"user_id\", \"merchant_id\", \"label\"], axis=1)\n",
367 | "X.head(10)\n"
368 | ]
369 | },
370 | {
371 | "cell_type": "code",
372 | "execution_count": 4,
373 | "id": "889e9034",
374 | "metadata": {},
375 | "outputs": [],
376 | "source": [
377 | "#分割数据\n",
378 | "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=8)"
379 | ]
380 | },
381 | {
382 | "cell_type": "code",
383 | "execution_count": 5,
384 | "id": "b66e524a",
385 | "metadata": {},
386 | "outputs": [
387 | {
388 | "data": {
389 | "text/html": [
390 | "\n",
391 | "\n",
404 | "
\n",
405 | " \n",
406 | " \n",
407 | " | \n",
408 | " age_range | \n",
409 | " gender | \n",
410 | " total_item_id | \n",
411 | " unique_item_id | \n",
412 | " total_cat_id | \n",
413 | " total_time_temp | \n",
414 | " clicks | \n",
415 | " shopping_cart | \n",
416 | " purchases | \n",
417 | " favourites | \n",
418 | "
\n",
419 | " \n",
420 | " \n",
421 | " \n",
422 | " | 0 | \n",
423 | " -1.0 | \n",
424 | " 0.0 | \n",
425 | " 2 | \n",
426 | " 1 | \n",
427 | " 1 | \n",
428 | " 1 | \n",
429 | " 1 | \n",
430 | " 0 | \n",
431 | " 1 | \n",
432 | " 0 | \n",
433 | "
\n",
434 | " \n",
435 | " | 1 | \n",
436 | " 2.0 | \n",
437 | " -1.0 | \n",
438 | " 10 | \n",
439 | " 9 | \n",
440 | " 4 | \n",
441 | " 1 | \n",
442 | " 5 | \n",
443 | " 0 | \n",
444 | " 5 | \n",
445 | " 0 | \n",
446 | "
\n",
447 | " \n",
448 | " | 2 | \n",
449 | " 6.0 | \n",
450 | " 0.0 | \n",
451 | " 6 | \n",
452 | " 1 | \n",
453 | " 1 | \n",
454 | " 1 | \n",
455 | " 5 | \n",
456 | " 0 | \n",
457 | " 1 | \n",
458 | " 0 | \n",
459 | "
\n",
460 | " \n",
461 | " | 3 | \n",
462 | " 6.0 | \n",
463 | " 0.0 | \n",
464 | " 11 | \n",
465 | " 1 | \n",
466 | " 1 | \n",
467 | " 1 | \n",
468 | " 10 | \n",
469 | " 0 | \n",
470 | " 1 | \n",
471 | " 0 | \n",
472 | "
\n",
473 | " \n",
474 | " | 4 | \n",
475 | " 2.0 | \n",
476 | " 1.0 | \n",
477 | " 50 | \n",
478 | " 8 | \n",
479 | " 4 | \n",
480 | " 5 | \n",
481 | " 47 | \n",
482 | " 0 | \n",
483 | " 1 | \n",
484 | " 2 | \n",
485 | "
\n",
486 | " \n",
487 | " | ... | \n",
488 | " ... | \n",
489 | " ... | \n",
490 | " ... | \n",
491 | " ... | \n",
492 | " ... | \n",
493 | " ... | \n",
494 | " ... | \n",
495 | " ... | \n",
496 | " ... | \n",
497 | " ... | \n",
498 | "
\n",
499 | " \n",
500 | " | 261472 | \n",
501 | " 6.0 | \n",
502 | " 0.0 | \n",
503 | " 5 | \n",
504 | " 2 | \n",
505 | " 1 | \n",
506 | " 2 | \n",
507 | " 4 | \n",
508 | " 0 | \n",
509 | " 1 | \n",
510 | " 0 | \n",
511 | "
\n",
512 | " \n",
513 | " | 261473 | \n",
514 | " 8.0 | \n",
515 | " 1.0 | \n",
516 | " 2 | \n",
517 | " 1 | \n",
518 | " 1 | \n",
519 | " 1 | \n",
520 | " 1 | \n",
521 | " 0 | \n",
522 | " 1 | \n",
523 | " 0 | \n",
524 | "
\n",
525 | " \n",
526 | " | 261474 | \n",
527 | " 8.0 | \n",
528 | " 1.0 | \n",
529 | " 16 | \n",
530 | " 5 | \n",
531 | " 2 | \n",
532 | " 3 | \n",
533 | " 12 | \n",
534 | " 0 | \n",
535 | " 4 | \n",
536 | " 0 | \n",
537 | "
\n",
538 | " \n",
539 | " | 261475 | \n",
540 | " -1.0 | \n",
541 | " 0.0 | \n",
542 | " 3 | \n",
543 | " 2 | \n",
544 | " 1 | \n",
545 | " 1 | \n",
546 | " 2 | \n",
547 | " 0 | \n",
548 | " 1 | \n",
549 | " 0 | \n",
550 | "
\n",
551 | " \n",
552 | " | 261476 | \n",
553 | " -1.0 | \n",
554 | " 0.0 | \n",
555 | " 11 | \n",
556 | " 1 | \n",
557 | " 1 | \n",
558 | " 2 | \n",
559 | " 10 | \n",
560 | " 0 | \n",
561 | " 1 | \n",
562 | " 0 | \n",
563 | "
\n",
564 | " \n",
565 | "
\n",
566 | "
261477 rows × 10 columns
\n",
567 | "
"
568 | ],
569 | "text/plain": [
570 | " age_range gender total_item_id unique_item_id total_cat_id \\\n",
571 | "0 -1.0 0.0 2 1 1 \n",
572 | "1 2.0 -1.0 10 9 4 \n",
573 | "2 6.0 0.0 6 1 1 \n",
574 | "3 6.0 0.0 11 1 1 \n",
575 | "4 2.0 1.0 50 8 4 \n",
576 | "... ... ... ... ... ... \n",
577 | "261472 6.0 0.0 5 2 1 \n",
578 | "261473 8.0 1.0 2 1 1 \n",
579 | "261474 8.0 1.0 16 5 2 \n",
580 | "261475 -1.0 0.0 3 2 1 \n",
581 | "261476 -1.0 0.0 11 1 1 \n",
582 | "\n",
583 | " total_time_temp clicks shopping_cart purchases favourites \n",
584 | "0 1 1 0 1 0 \n",
585 | "1 1 5 0 5 0 \n",
586 | "2 1 5 0 1 0 \n",
587 | "3 1 10 0 1 0 \n",
588 | "4 5 47 0 1 2 \n",
589 | "... ... ... ... ... ... \n",
590 | "261472 2 4 0 1 0 \n",
591 | "261473 1 1 0 1 0 \n",
592 | "261474 3 12 0 4 0 \n",
593 | "261475 1 2 0 1 0 \n",
594 | "261476 2 10 0 1 0 \n",
595 | "\n",
596 | "[261477 rows x 10 columns]"
597 | ]
598 | },
599 | "execution_count": 5,
600 | "metadata": {},
601 | "output_type": "execute_result"
602 | }
603 | ],
604 | "source": [
605 | "#加载最终测试数据\n",
606 | "test_data= pd.read_csv(r'test_data.csv')\n",
607 | "test_data\n"
608 | ]
609 | },
610 | {
611 | "cell_type": "code",
612 | "execution_count": 6,
613 | "id": "bede42d0",
614 | "metadata": {},
615 | "outputs": [
616 | {
617 | "name": "stdout",
618 | "output_type": "stream",
619 | "text": [
620 | "(52173,)\n",
621 | "[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]\n",
622 | "[[0.92242829 0.07757171]\n",
623 | " [0.95384761 0.04615239]\n",
624 | " [0.93995785 0.06004215]\n",
625 | " ...\n",
626 | " [0.94603563 0.05396437]\n",
627 | " [0.86838486 0.13161514]\n",
628 | " [0.95512153 0.04487847]]\n",
629 | "Accuracy on training set: 0.939\n",
630 | "Accuracy on test set: 0.939\n"
631 | ]
632 | },
633 | {
634 | "data": {
635 | "text/plain": [
636 | "0.9391831023709581"
637 | ]
638 | },
639 | "execution_count": 6,
640 | "metadata": {},
641 | "output_type": "execute_result"
642 | }
643 | ],
644 | "source": [
645 | "#logistic回归\n",
646 | "Logit = LogisticRegression(solver='liblinear')\n",
647 | "Logit.fit(X_train, y_train)\n",
648 | "Predict = Logit.predict(X_test)\n",
649 | "Predict_proba = Logit.predict_proba(X_test)\n",
650 | "print(Predict.shape)\n",
651 | "print(Predict[0:20])\n",
652 | "print(Predict_proba[:])\n",
653 | "print(\"Accuracy on training set: {:.3f}\".format(Logit.score(X_train, y_train)))\n",
654 | "print(\"Accuracy on test set: {:.3f}\".format(Logit.score(X_test, y_test)))\n",
655 | "Score = accuracy_score(y_test, Predict)\n",
656 | "Score"
657 | ]
658 | },
659 | {
660 | "cell_type": "code",
661 | "execution_count": 21,
662 | "id": "65787397",
663 | "metadata": {},
664 | "outputs": [],
665 | "source": [
666 | "#逻辑回归最终结果获取\n",
667 | "Logit_Ans_Predict_proba = Logit.predict_proba(test_data)\n",
668 | "df_test['prob']=Logit_Ans_Predict_proba[:,1]\n",
669 | "#最终答案保存\n",
670 | "df_test.to_csv(\"Logit_Ans.csv\",index=None)"
671 | ]
672 | },
673 | {
674 | "cell_type": "code",
675 | "execution_count": 22,
676 | "id": "a37fd1e5",
677 | "metadata": {},
678 | "outputs": [
679 | {
680 | "name": "stdout",
681 | "output_type": "stream",
682 | "text": [
683 | "[[0.89765569 0.10234431]\n",
684 | " [0.9609094 0.0390906 ]\n",
685 | " [0.93901148 0.06098852]\n",
686 | " ...\n",
687 | " [0.92812445 0.07187555]\n",
688 | " [0.89765569 0.10234431]\n",
689 | " [0.9609094 0.0390906 ]]\n",
690 | "Accuracy on training set: 0.939\n",
691 | "Accuracy on test set: 0.939\n"
692 | ]
693 | }
694 | ],
695 | "source": [
696 | "#决策树\n",
697 | "from sklearn.tree import DecisionTreeClassifier\n",
698 | "tree = DecisionTreeClassifier(max_depth=4,random_state=0) \n",
699 | "tree.fit(X_train, y_train)\n",
700 | "Predict_proba = tree.predict_proba(X_test)\n",
701 | "print(Predict_proba[:])\n",
702 | "print(\"Accuracy on training set: {:.3f}\".format(tree.score(X_train, y_train)))\n",
703 | "print(\"Accuracy on test set: {:.3f}\".format(tree.score(X_test, y_test)))"
704 | ]
705 | },
706 | {
707 | "cell_type": "code",
708 | "execution_count": 23,
709 | "id": "5ed0c662",
710 | "metadata": {},
711 | "outputs": [],
712 | "source": [
713 | "#决策树最终结果获取\n",
714 | "Tree_Ans_Predict_proba = tree.predict_proba(test_data)\n",
715 | "df_test['prob']=Tree_Ans_Predict_proba[:,1]\n",
716 | "#最终答案保存\n",
717 | "df_test.to_csv(\"Tree_Ans.csv\",index=None)"
718 | ]
719 | },
720 | {
721 | "cell_type": "code",
722 | "execution_count": 28,
723 | "id": "9c002987",
724 | "metadata": {},
725 | "outputs": [
726 | {
727 | "name": "stdout",
728 | "output_type": "stream",
729 | "text": [
730 | "[[0.90345203 0.09654797]\n",
731 | " [0.96242055 0.03757945]\n",
732 | " [0.92398178 0.07601822]\n",
733 | " ...\n",
734 | " [0.91943483 0.08056517]\n",
735 | " [0.86844252 0.13155748]\n",
736 | " [0.9607207 0.0392793 ]]\n",
737 | "Accuracy on training set: 0.939\n",
738 | "Accuracy on test set: 0.939\n"
739 | ]
740 | }
741 | ],
742 | "source": [
743 | "#随机森林\n",
744 | "from sklearn.ensemble import RandomForestClassifier\n",
745 | "rfc = RandomForestClassifier(n_estimators=100,random_state=90,max_depth=8)\n",
746 | "rfc = rfc.fit(X_train, y_train)\n",
747 | "Predict_proba = rfc.predict_proba(X_test)\n",
748 | "print(Predict_proba[:])\n",
749 | "print(\"Accuracy on training set: {:.3f}\".format(rfc.score(X_train, y_train))) \n",
750 | "print(\"Accuracy on test set: {:.3f}\".format(rfc.score(X_test, y_test)))"
751 | ]
752 | },
753 | {
754 | "cell_type": "code",
755 | "execution_count": 29,
756 | "id": "55703385",
757 | "metadata": {},
758 | "outputs": [],
759 | "source": [
760 | "#随机森林最终结果获取\n",
761 | "RFC_Ans_Predict_proba = rfc.predict_proba(test_data)\n",
762 | "df_test['prob']=RFC_Ans_Predict_proba[:,1]\n",
763 | "#最终答案保存\n",
764 | "df_test.to_csv(\"RFC_Ans.csv\",index=None)"
765 | ]
766 | },
767 | {
768 | "cell_type": "code",
769 | "execution_count": 27,
770 | "id": "54978d26",
771 | "metadata": {},
772 | "outputs": [
773 | {
774 | "name": "stdout",
775 | "output_type": "stream",
776 | "text": [
777 | "进度: 0\n",
778 | "进度: 10\n",
779 | "进度: 20\n",
780 | "进度: 30\n",
781 | "进度: 40\n",
782 | "进度: 50\n",
783 | "进度: 60\n",
784 | "进度: 70\n",
785 | "进度: 80\n",
786 | "进度: 90\n",
787 | "进度: 100\n",
788 | "进度: 110\n",
789 | "进度: 120\n",
790 | "进度: 130\n",
791 | "进度: 140\n",
792 | "进度: 150\n",
793 | "进度: 160\n",
794 | "进度: 170\n",
795 | "进度: 180\n",
796 | "进度: 190\n",
797 | "最大得分:0.9394897744043854 子树数量为:101\n"
798 | ]
799 | },
800 | {
801 | "data": {
802 | "image/png": "\n",
803 | "text/plain": [
804 | ""
805 | ]
806 | },
807 | "metadata": {
808 | "needs_background": "light"
809 | },
810 | "output_type": "display_data"
811 | }
812 | ],
813 | "source": [
814 | "# 调参,绘制学习曲线来调参n_estimators(对随机森林影响最大)\n",
815 | "score_lt = []\n",
816 | "\n",
817 | "# 每隔10步建立一个随机森林,获得不同n_estimators的得分\n",
818 | "for i in range(0,200,10):\n",
819 | " print(\"进度:\",i)\n",
820 | " rfc = RandomForestClassifier(n_estimators=i+1,random_state=90,max_depth=8)\n",
821 | " rfc = rfc.fit(X_train, y_train)\n",
822 | " score = rfc.score(X_test, y_test)\n",
823 | " score_lt.append(score)\n",
824 | "score_max = max(score_lt)\n",
825 | "print('最大得分:{}'.format(score_max),'子树数量为:{}'.format(score_lt.index(score_max)*10+1))\n",
826 | "\n",
827 | "# 绘制学习曲线\n",
828 | "x = np.arange(1,201,10)\n",
829 | "plt.subplot(111)\n",
830 | "plt.plot(x, score_lt, 'r-')\n",
831 | "plt.show()"
832 | ]
833 | },
834 | {
835 | "cell_type": "code",
836 | "execution_count": 15,
837 | "id": "b124dbb8",
838 | "metadata": {},
839 | "outputs": [
840 | {
841 | "name": "stderr",
842 | "output_type": "stream",
843 | "text": [
844 | "D:\\anaconda3\\lib\\site-packages\\xgboost\\sklearn.py:1224: UserWarning: The use of label encoder in XGBClassifier is deprecated and will be removed in a future release. To remove this warning, do the following: 1) Pass option use_label_encoder=False when constructing XGBClassifier object; and 2) Encode your labels (y) as integers starting with 0, i.e. 0, 1, 2, ..., [num_class - 1].\n",
845 | " warnings.warn(label_encoder_deprecation_msg, UserWarning)\n"
846 | ]
847 | },
848 | {
849 | "name": "stdout",
850 | "output_type": "stream",
851 | "text": [
852 | "[15:05:43] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.5.0/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\n",
853 | "[[0.87493783 0.12506217]\n",
854 | " [0.9712213 0.02877864]\n",
855 | " [0.8449106 0.1550894 ]\n",
856 | " ...\n",
857 | " [0.87651736 0.12348264]\n",
858 | " [0.916159 0.08384103]\n",
859 | " [0.9761114 0.02388861]]\n",
860 | "Accuracy on training set: 0.939\n",
861 | "Accuracy on test set: 0.939\n"
862 | ]
863 | }
864 | ],
865 | "source": [
866 | "#使用XGboost\n",
867 | "from sklearn.model_selection import train_test_split\n",
868 | "from sklearn.ensemble import RandomForestClassifier\n",
869 | "from sklearn.linear_model import LinearRegression\n",
870 | "from sklearn.metrics import classification_report\n",
871 | "import xgboost as xgb\n",
872 | "\n",
873 | "model = xgb.XGBClassifier(\n",
874 | " max_depth=8,\n",
875 | " n_estimators=2000,\n",
876 | " min_child_weight=300, \n",
877 | " colsample_bytree=0.8, \n",
878 | " subsample=0.8, \n",
879 | " eta=0.3, \n",
880 | " seed=42 \n",
881 | ")\n",
882 | "# model.fit(\n",
883 | "# X_train, y_train,\n",
884 | "# eval_metric='auc', eval_set=[(X_train, y_train), (X_test, y_test)],\n",
885 | "# verbose=True,\n",
886 | "# #早停法,如果auc在10epoch没有进步就stop\n",
887 | "# early_stopping_rounds=30 \n",
888 | "# )\n",
889 | "model.fit(X_train, y_train)\n",
890 | "\n",
891 | "\n",
892 | "Predict_proba = model.predict_proba(X_test)\n",
893 | "print(Predict_proba[:])\n",
894 | "print(\"Accuracy on training set: {:.3f}\".format(model.score(X_train, y_train))) \n",
895 | "print(\"Accuracy on test set: {:.3f}\".format(model.score(X_test, y_test)))\n"
896 | ]
897 | },
898 | {
899 | "cell_type": "code",
900 | "execution_count": 16,
901 | "id": "b942defe",
902 | "metadata": {},
903 | "outputs": [],
904 | "source": [
905 | "#XGboost最终结果获取\n",
906 | "xgboost_Ans_Predict_proba = model.predict_proba(test_data)\n",
907 | "df_test['prob']=xgboost_Ans_Predict_proba[:,1]\n",
908 | "#最终答案保存\n",
909 | "df_test.to_csv(\"xgboost_Ans.csv\",index=None)"
910 | ]
911 | },
912 | {
913 | "cell_type": "code",
914 | "execution_count": null,
915 | "id": "33f75369",
916 | "metadata": {},
917 | "outputs": [],
918 | "source": []
919 | }
920 | ],
921 | "metadata": {
922 | "kernelspec": {
923 | "display_name": "Python 3",
924 | "language": "python",
925 | "name": "python3"
926 | },
927 | "language_info": {
928 | "codemirror_mode": {
929 | "name": "ipython",
930 | "version": 3
931 | },
932 | "file_extension": ".py",
933 | "mimetype": "text/x-python",
934 | "name": "python",
935 | "nbconvert_exporter": "python",
936 | "pygments_lexer": "ipython3",
937 | "version": "3.8.8"
938 | }
939 | },
940 | "nbformat": 4,
941 | "nbformat_minor": 5
942 | }
943 |
--------------------------------------------------------------------------------
/特征工程.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 1,
6 | "id": "24b1b8d2",
7 | "metadata": {},
8 | "outputs": [],
9 | "source": [
10 | "#导包\n",
11 | "import numpy as np\n",
12 | "import pandas as pd\n",
13 | "import matplotlib.pyplot as plt\n",
14 | "plt.rcParams[\"font.sans-serif\"] = \"SimHei\" #解决中文乱码问题\n",
15 | "import seaborn as sns\n",
16 | "import random\n",
17 | "from sklearn.model_selection import train_test_split\n",
18 | "from sklearn.linear_model import LogisticRegression\n",
19 | "from sklearn.preprocessing import LabelEncoder\n",
20 | "from sklearn.metrics import accuracy_score\n",
21 | "from sklearn import model_selection\n",
22 | "from sklearn.neighbors import KNeighborsRegressor"
23 | ]
24 | },
25 | {
26 | "cell_type": "code",
27 | "execution_count": 2,
28 | "id": "51717a7b",
29 | "metadata": {},
30 | "outputs": [
31 | {
32 | "name": "stdout",
33 | "output_type": "stream",
34 | "text": [
35 | "(261477, 3) (260864, 3)\n",
36 | "(424170, 3) (54925330, 7)\n"
37 | ]
38 | }
39 | ],
40 | "source": [
41 | "#读取数据\n",
42 | "\n",
43 | "df_train = pd.read_csv(r'../DataMining/data_format1\\train_format1.csv')\n",
44 | "df_test = pd.read_csv(r'../DataMining/data_format1\\test_format1.csv')\n",
45 | "user_info = pd.read_csv(r'../DataMining/data_format1\\user_info_format1.csv')\n",
46 | "user_log = pd.read_csv(r'../DataMining/data_format1\\user_log_format1.csv')\n",
47 | "\n",
48 | "print(df_test.shape,df_train.shape)\n",
49 | "print(user_info.shape,user_log.shape)"
50 | ]
51 | },
52 | {
53 | "cell_type": "code",
54 | "execution_count": 3,
55 | "id": "e833f4c8",
56 | "metadata": {},
57 | "outputs": [
58 | {
59 | "name": "stdout",
60 | "output_type": "stream",
61 | "text": [
62 | "\n",
63 | "RangeIndex: 424170 entries, 0 to 424169\n",
64 | "Data columns (total 3 columns):\n",
65 | " # Column Non-Null Count Dtype \n",
66 | "--- ------ -------------- ----- \n",
67 | " 0 user_id 424170 non-null int64 \n",
68 | " 1 age_range 329039 non-null float64\n",
69 | " 2 gender 407308 non-null float64\n",
70 | "dtypes: float64(2), int64(1)\n",
71 | "memory usage: 9.7 MB\n"
72 | ]
73 | }
74 | ],
75 | "source": [
76 | "#使用空值去替换\n",
77 | "user_info['age_range'].replace(0.0,np.nan,inplace=True)\n",
78 | "user_info['gender'].replace(2.0,np.nan,inplace=True)\n",
79 | "\n",
80 | "user_info.info()"
81 | ]
82 | },
83 | {
84 | "cell_type": "code",
85 | "execution_count": 4,
86 | "id": "81f6042b",
87 | "metadata": {},
88 | "outputs": [],
89 | "source": [
90 | "user_info['age_range'].replace(np.nan,-1,inplace=True)\n",
91 | "user_info['gender'].replace(np.nan,-1,inplace=True)\n",
92 | "# user_info['age_range'].replace(np.nan,1,inplace=True)\n",
93 | "# user_info['gender'].replace(np.nan,0,inplace=True)"
94 | ]
95 | },
96 | {
97 | "cell_type": "code",
98 | "execution_count": 5,
99 | "id": "cfe56eb8",
100 | "metadata": {},
101 | "outputs": [
102 | {
103 | "data": {
104 | "text/plain": [
105 | "Text(0.5, 1.0, '用户年龄分布')"
106 | ]
107 | },
108 | "execution_count": 5,
109 | "metadata": {},
110 | "output_type": "execute_result"
111 | },
112 | {
113 | "data": {
114 | "image/png": "\n",
115 | "text/plain": [
116 | ""
117 | ]
118 | },
119 | "metadata": {
120 | "needs_background": "light"
121 | },
122 | "output_type": "display_data"
123 | }
124 | ],
125 | "source": [
126 | "#年龄分布可视化\n",
127 | "fig = plt.figure(figsize = (10, 6))\n",
128 | "x = np.array([\"NULL\",\"<18\",\"18-24\",\"25-29\",\"30-34\",\"35-39\",\"40-49\",\">=50\"])\n",
129 | "#<18岁为1;[18,24]为2; [25,29]为3; [30,34]为4;[35,39]为5;[40,49]为6; > = 50时为7和8\n",
130 | "y = np.array([user_info[user_info['age_range'] == -1]['age_range'].count(),\n",
131 | " user_info[user_info['age_range'] == 1]['age_range'].count(),\n",
132 | " user_info[user_info['age_range'] == 2]['age_range'].count(),\n",
133 | " user_info[user_info['age_range'] == 3]['age_range'].count(),\n",
134 | " user_info[user_info['age_range'] == 4]['age_range'].count(),\n",
135 | " user_info[user_info['age_range'] == 5]['age_range'].count(),\n",
136 | " user_info[user_info['age_range'] == 6]['age_range'].count(),\n",
137 | " user_info[user_info['age_range'] == 7]['age_range'].count() + \n",
138 | " user_info[user_info['age_range'] == 8]['age_range'].count()])\n",
139 | "plt.bar(x,y,label='人数')\n",
140 | "plt.legend()\n",
141 | "plt.title('用户年龄分布')"
142 | ]
143 | },
144 | {
145 | "cell_type": "code",
146 | "execution_count": 6,
147 | "id": "21ae9565",
148 | "metadata": {},
149 | "outputs": [
150 | {
151 | "data": {
152 | "text/plain": [
153 | "Text(0.5, 1.0, '用户性别分布')"
154 | ]
155 | },
156 | "execution_count": 6,
157 | "metadata": {},
158 | "output_type": "execute_result"
159 | },
160 | {
161 | "data": {
162 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAY4AAAESCAYAAADqoDJEAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuNCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8QVMy6AAAACXBIWXMAAAsTAAALEwEAmpwYAAASfklEQVR4nO3de7BdZ13G8e9DkmppqdPSQ6RViYFwUdtYCNBIrScISKSjpTBWBgEtGjoyon8gWlsUEXFEZYqdAY1ULAWrxWmRUZHgpTZgKpyovah06iVtia0G20mMtRfozz/2itk9yUn2m7Mv53R/PzN7zl6/vS7v4vTsh/d911pJVSFJ0qCeMOkGSJKWF4NDktTE4JAkNTE4JElNDA5JUhODQ2qU5KuO8vmTk6wcV3ukcTM4pAZJvhf4vaOstg14+rzt3p7kp4/heM9M8jcLfPbs7ufaJC8eYF9PSPJXSZ7W2g6pn8GhqZPkSUkeSjLXve5Osrtv+UtJLlxg808BT0nyxCMc4iHg4e5Yv5jkvP7avLa8O8k9Sb7QvR6Zt++HgUcOs90rgD9IEqCArUdpE8Bm4IlVdedR1pOOyO60ptHDwD1VtQEgyVuBr66qd3XLv0Pfl3WSXwdeNm8ff9v7zgbgn6rqld2X+HHAo8ArknwEeB7wceDZwKMH1qmqh7ptHwHeXlUf7I61C3g4yQ8A3wa8Z37jkzwB+Dng0urdwftvSf4EuBzY0rfeRuAaYD/wZeCZwD1J/r5/d/S+B95cVTcc5X83CTA4NJ0K+Nokn+mWTweekOTl3fI6el/2B6wGLj7cF2uSWQ5+uX8D8MHu/bcAPwicBXyg20cBrwPuBs4/sAvgtCTf0i2v6moPcpgeSuetwN6q+sO+2s8ANyX5NeCt1bMDWNO18zXARVX10iQfBN5SVQ8ssH/piAwOTaNHgXur6hxYsMfR78sD7A96vYp/Ap4LXA38MDBXVa/ojvHlqrp83ra3A68EXtNtu4MjDCEn+U7gJ4AX9deran83z/Fp4MYkP1xVt3fbrAF+AXhpt/omeiEmHRODQ9NoReP6q47y+YExq6cAn6EXHLuBTwD7jrRhVX0U+GiSB4D1VXVgbuTQgyRPArYCbwd2JDm+O/aBY5xCL1SeD3yl2+apwB8DTwOu7/b7dfR6J0Wvl/S6qvrjo5yj9P8MDk2jrwKemmSuW15Nb6jq/G55DY8dqnoycHWS/znMvlYC/w5QVVcDJHlLt3x9kl/qho+eDFSSNwC3VtXru6uifpPePMcK4E+6L/YbgX+cf6Cq+u8kz+nC5cok7wF2V9X7uuP+KfCFqtraLa8HrgPeB7ytqr61q/8z8MKqerDrXS00JCYdlsGhaXQ6cFNVfQcceagqyQp68xTnVtVtrQeqqkuASxYYqtoF/BC9nsoH6c2JPB94FfAFDvZk+vfX/yX/HcCPzzuvL/Yt7wV+pqp+P8nbjtTMAU9HAgwOTacXAjsHXHeWXo/gkB7AYlXVg8C/dnMTf11VX0zyHOB/6c2rLDh53d1PsqKqbuorP5XeENmB/e+iF05wmHmTJMfR6wkdbQ5HegyDQ9PoQuBdfcur6P4WuuGdpwMPdHeI/zLwa1X16CF7WdhKYEWSVUBV1WO+mLsv7C9X1aNJngn8LHBe9/GJwANVdR1wXZKnM69H0IXG++muzEpyEr2ruB7su8x3vv673Vd253wLvfmRWxrOTTI4NF2SfB2wsqo+21f+Bw5OgL+W3hfpZ4Az6P0/+MsbD7OK3v0cFwEXJem/J+TV9L7EL+4mu68FfqKqbknyDuCNwE91654PvA24o2/79wEvAb6nqg70mi4GXk/vktyFnDavfSuBZ1bVITcXSkcT/wVAaTKSnEDvSqq/7pafQ6+3cWe3vJZeeH36wD0XXdg86Be+JsngkCQ18VlVkqQmBockqcnjfnL81FNPrTVr1ky6GZK0rOzcufNLVTVzuM8e98GxZs0a5ubmjr6iJOn/JVnw8fsOVUmSmhgckqQmBockqYnBIUlqYnBIkpoYHJKkJgaHJKmJwSFJamJwSJKaPO7vHNf0uOudZ0y6CY973/Czt066CVoC7HFIkpoYHJKkJgaHJKmJwSFJamJwSJKaGBySpCYGhySpicEhSWpicEiSmhgckqQmBockqYnBIUlqYnBIkpoYHJKkJgaHJKmJwSFJamJwSJKaGBySpCYGhySpicEhSWpicEiSmhgckqQmQw+OJF+T5JNJtiW5PslxSe5KckP3OqNb78okO5Jc1rftMdckSeMxih7Ha4H3VtXLgHuBnwauqarZ7nVrkguAFVW1EVibZN1iaiM4B0nSAlYOe4dV9f6+xRngbuC8JJuAW4E3AbPAtd0624BzgLMWUbtj2OchSTq8kc1xJNkInAx8GnhJVb0AWAV8N3ACsLtb9T5g9SJr84+9Jclckrk9e/YM+cwkabqNJDiSnAJcAVwE3FJV93QfzQHrgP3A8V3txK4di6k9RlVtraoNVbVhZmZmiGcmSRrF5PhxwMeAS6rqTuDqJOuTrADOB24GdtIbYgJYD+xaZE2SNCZDn+MA3gg8F7g0yaXAXwJXAwE+UVV/luQkYHuS04DNwNlALaImSRqTUUyOfwD4wLzyz89bZ1+SWeClwHuqai/AYmqSpPEYRY9jIFV1Pwevjlp0TZI0Ht45LklqYnBIkpoYHJKkJgaHJKmJwSFJamJwSJKaGBySpCYGhySpicEhSWpicEiSmhgckqQmBockqYnBIUlqYnBIkpoYHJKkJgaHJKmJwSFJamJwSJKaGBySpCYGhySpicEhSWpicEiSmhgckqQmBockqYnBIUlqYnBIkpoYHJKkJkMPjiRfk+STSbYluT7JcUmuTLIjyWV96w21Jkkaj1H0OF4LvLeqXgbcC3w/sKKqNgJrk6xLcsEwayM4B0nSAlYOe4dV9f6+xRngB4DLu+VtwDnAWcC1Q6zdMeTTkCQtYGRzHEk2AicDdwO7u/J9wGrghCHX5h97S5K5JHN79uwZ4llJkkYSHElOAa4ALgL2A8d3H53YHXPYtceoqq1VtaGqNszMzAzvxCRJI5kcPw74GHBJVd0J7KQ3nASwHtg1gpokaUyGPscBvBF4LnBpkkuBDwGvS3IasBk4Gyhg+xBrkqQxGXqPo6o+UFUnV9Vs97oKmAVuAjZV1d6q2jfM2rDPQZK0sFH0OA5RVfdz8EqokdQkSePhneOSpCYGhySpicEhSWpicEiSmhgckqQmBockqYnBIUlqYnBIkpoYHJKkJgaHJKmJwSFJamJwSJKaGBySpCYGhySpicEhSWpicEiSmhgckqQmBockqYnBIUlqYnBIkpocU3AkOWfYDZEkLQ8DBUeST88r/dII2iJJWgZWHunDJGcCZwGnJ3l9Vz4BeHDUDZMkLU1H63HkMD//C/i+kbVIkrSkHbHHUVU3AzcneVZVfXhMbZIkLWFHDI4+lyf5fuC4AwWDRJKm06BXVf0p8Ax6Q1UHXpKkKTRoj+O/q+pdI22JJGlZGLTHsT3JNUk2Jzk3yblHWjnJ6iTbu/crk9yV5IbudUZXvzLJjiSX9W13zDVJ0ngMGhyPAF8Ang9sAmYXWjHJycBV9C7bBTgTuKaqZrvXrUkuAFZU1UZgbZJ1i6m1n7Yk6VgNGhy7gH8D7ux+7jrCul8BLgT2dctnA+cl+VzXU1hJL3iu7T7fBpyzyNpjJNmSZC7J3J49ewY8RUnSIFoeORLgeOACYMGhqqraV1V7+0qfB15SVS8AVgHfTa83srv7/D5g9SJr89uwtao2VNWGmZmZhlOUJB3NQJPjVXVV3+JvJHl/wzFuqaqHuvdzwDpgP70QAjiRXoAtpiZJGpNBn1V1bt/r1cA3NRzj6iTrk6wAzgduBnZycIhpPb2hr8XUJEljMujluJuA6t4/DLy54RjvBH6X3lDXJ6rqz5KcRO9KrdOAzfTmQWoRNUnSmAw6zPNu4D+AU4AvAbcfbYOqmu1+3lZVZ1bVGVV1aVfbR2+S+yZgU1XtXUxtwHOQJA3BoMHx2/QmoT8JnA58aLEHrqr7q+raqrp3GDVJ0ngMOlT19VX1uu79p5L81agaJEla2gYNjn9PcgnwN8BGDl4OK0maMoMOVV1ML2ReTe/GvjeNrEWSpCVt0OD4CHBXVf0o8CR6cx6SpCk0aHCcfOAmwKp6N3Dq6JokSVrKBp3j+GKSnwI+R+9Bh/85uiZJkpayQXscPwg8QG+O43+BN4yqQZKkpW3QZ1U9BFwx4rZIkpYBHxAoSWpicEiSmhgckqQmBockqYnBIUlqYnBIkpoMegOgJI3Ui6540aSb8Lj32R/77FD2Y49DktTE4JAkNTE4JElNDA5JUhODQ5LUxOCQJDUxOCRJTQwOSVITg0OS1MTgkCQ1MTgkSU0MDklSk5EER5LVSbb3LV+ZZEeSy0ZVkySNx9CDI8nJwFXACd3yBcCKqtoIrE2ybti1YZ+DJGlho+hxfAW4ENjXLc8C13bvtwHnjKD2GEm2JJlLMrdnz55Fn5Ak6aChB0dV7auqvX2lE4Dd3fv7gNUjqM1vw9aq2lBVG2ZmZoZxWpKkzjgmx/cDx3fvT+yOOeyaJGlMxvGlu5ODw0nrgV0jqEmSxmQc/3Tsx4HtSU4DNgNnAzXkmiRpTEbW46iq2e7nPnoT2jcBm6pq77BrozoHSdKhxtHjoKru5+CVUCOpSZLGw4llSVITg0OS1MTgkCQ1MTgkSU0MDklSE4NDktTE4JAkNTE4JElNDA5JUhODQ5LUxOCQJDUxOCRJTQwOSVITg0OS1MTgkCQ1MTgkSU0MDklSE4NDktTE4JAkNTE4JElNDA5JUhODQ5LUxOCQJDUxOCRJTQwOSVITg0OS1MTgkCQ1GXlwJFmZ5K4kN3SvM5JcmWRHksv61jvmmiRpfMbR4zgTuKaqZqtqFlgHrKiqjcDaJOuSXHCstTG0X5LUZ+UYjnE2cF6STcCtwEPAtd1n24BzgLMWUbtjxO2XJPUZR4/j88BLquoFwCpgM7C7++w+YDVwwiJqh0iyJclckrk9e/YM92wkacqNIzhuqap7uvdzwKnA8d3yiV0b9i+idoiq2lpVG6pqw8zMzBBPRZI0juC4Osn6JCuA84E30xtiAlgP7AJ2LqImSRqjccxxvBP4XSDAJ4CPA9uTnEZv2OpsoBZRkySN0ch7HFV1W1WdWVVnVNWlVbUPmAVuAjZV1d7F1EbdfknSY42jx3GIqrqfg1dHLbomSRof7xyXJDUxOCRJTQwOSVITg0OS1MTgkCQ1MTgkSU0MDklSE4NDktTE4JAkNTE4JElNDA5JUhODQ5LUxOCQJDUxOCRJTQwOSVITg0OS1MTgkCQ1MTgkSU0m8k/HLlXP+8kPT7oJU2Hnr7x+0k2QtAj2OCRJTQwOSVITg0OS1MTgkCQ1MTgkSU0MDklSE4NDktTE4JAkNVm2wZHkyiQ7klw26bZI0jRZlsGR5AJgRVVtBNYmWTfpNknStFiWwQHMAtd277cB50yuKZI0XZbrs6pOAHZ37+8Dntv/YZItwJZucX+S28fYtnE7FfjSpBvRIr/6hkk3YSlZXr+/n8ukW7CULK/fHZC3NP3+nrbQB8s1OPYDx3fvT2Rez6mqtgJbx92oSUgyV1UbJt0OHRt/f8vXNP/ulutQ1U4ODk+tB3ZNrimSNF2Wa4/j48D2JKcBm4GzJ9scSZoey7LHUVX76E2Q3wRsqqq9k23RRE3FkNzjmL+/5Wtqf3epqkm3QZK0jCzLHockaXIMjmUuyeok2yfdDmnaTPPfnsGxjCU5GbiK3n0tWkZ8ZM7yNu1/ewbH8vYV4EJg36QbosH5yJzHhan+21uul+NOpSS/CTyrr/QXVfXOxLt5l5lZDn1kzh0Ta42adVd2Mq1/ewbHMlJVb5p0GzQUR3xkjrTUOVQljd8RH5kjLXX+ByuNn4/M0bLmDYDSmCU5CdgO/DndI3Om/OkHWmYMDmkCuss5XwrcWFX3Tro9UguDQ5LUxDkOSVITg0OS1MTgkJaAJO9IMjvpdkiDMDgkSU28c1xapCTHA9cBpwD/AtxG727wpwC3VtWbk7wDWAV8O3AS8HLgIeBjwAogwA1Jngh8uH/b7hg3AJ8Hzqyq7xrbyUmHYY9DWrxnA1+kd1PfM4AHgNuq6lzgqUnO7NZ7Rle7DngxsAX4o6raBDzSrbNlgW3PBnYYGloKDA5p8XYDzwNuBN5H70GUr+x6CWuB07v1Ptz9vAs4DvhG4OauNtf9XGjb26rqutGdgjQ4h6qkxXs58AtVdT1Akhngc1X1oSTn0QuKFwL/M2+7u4BvBv4S+FbgU8Dth9kWes+3kpYEexzS4v0dcEWSv0jye/SCYHOSG4GLgbsX2G4r8Kqud3FSV/utAbeVJsY7x6VFSvIjwGvozVM8AvxqVd0w0UZJI2RwSJKaOFQlSWpicEiSmhgckqQmBockqYnBIUlqYnBIkpr8H3c5ZeKklhTFAAAAAElFTkSuQmCC\n",
163 | "text/plain": [
164 | ""
165 | ]
166 | },
167 | "metadata": {
168 | "needs_background": "light"
169 | },
170 | "output_type": "display_data"
171 | }
172 | ],
173 | "source": [
174 | "sns.countplot(x='gender',order=[-1,0,1],data=user_info)\n",
175 | "plt.title('用户性别分布')"
176 | ]
177 | },
178 | {
179 | "cell_type": "code",
180 | "execution_count": 7,
181 | "id": "a9a9558f",
182 | "metadata": {},
183 | "outputs": [
184 | {
185 | "data": {
186 | "text/plain": [
187 | "'\\n1.年龄空值的比较多,性别空值的少\\n2.年龄主要在18-39之间\\n3.大多数是女性\\n'"
188 | ]
189 | },
190 | "execution_count": 7,
191 | "metadata": {},
192 | "output_type": "execute_result"
193 | },
194 | {
195 | "data": {
196 | "image/png": "\n",
197 | "text/plain": [
198 | ""
199 | ]
200 | },
201 | "metadata": {
202 | "needs_background": "light"
203 | },
204 | "output_type": "display_data"
205 | }
206 | ],
207 | "source": [
208 | "sns.countplot(x='age_range',order=[-1,1,2,3,4,5,6,7,8],hue='gender',data=user_info)\n",
209 | "plt.title('用户年龄-性别分布')\n",
210 | "'''\n",
211 | "1.年龄空值的比较多,性别空值的少\n",
212 | "2.年龄主要在18-39之间\n",
213 | "3.大多数是女性\n",
214 | "'''"
215 | ]
216 | },
217 | {
218 | "cell_type": "code",
219 | "execution_count": 8,
220 | "id": "dc6ddb76",
221 | "metadata": {},
222 | "outputs": [],
223 | "source": [
224 | "#特征值合并\n",
225 | "\n",
226 | "df_train = pd.merge(df_train,user_info,on=\"user_id\",how=\"left\")\n",
227 | " \n",
228 | "total_logs_temp = user_log.groupby([user_log[\"user_id\"],user_log[\"seller_id\"]])[\"item_id\"].count().reset_index()\n",
229 | " \n",
230 | "total_logs_temp.rename(columns={\"seller_id\":\"merchant_id\",\"item_id\":\"total_item_id\"},inplace=True)\n",
231 | " \n",
232 | "df_train = pd.merge(df_train,total_logs_temp,on=[\"user_id\",\"merchant_id\"],how=\"left\")\n",
233 | " \n",
234 | "unique_item_id = user_log.groupby([\"user_id\",\"seller_id\",\"item_id\"]).count().reset_index()[[\"user_id\",\"seller_id\",\"item_id\"]]\n",
235 | " \n",
236 | "unique_item_id_cnt = unique_item_id.groupby([\"user_id\",\"seller_id\"]).count().reset_index()\n",
237 | " \n",
238 | "unique_item_id_cnt.rename(columns={\"seller_id\":\"merchant_id\",\"item_id\":\"unique_item_id\"},inplace=True)\n",
239 | " \n",
240 | "df_train = pd.merge(df_train, unique_item_id_cnt, on=[\"user_id\", \"merchant_id\"], how=\"left\")\n",
241 | " \n",
242 | "cat_id_temp = user_log.groupby([\"user_id\", \"seller_id\", \"cat_id\"]).count().reset_index()[[\"user_id\", \"seller_id\", \"cat_id\"]]\n",
243 | " \n",
244 | "cat_id_temp_cnt = cat_id_temp.groupby([\"user_id\", \"seller_id\"]).count().reset_index()\n",
245 | " \n",
246 | "cat_id_temp_cnt.rename(columns={\"seller_id\":\"merchant_id\",\"cat_id\":\"total_cat_id\"},inplace=True)\n",
247 | " \n",
248 | "df_train = pd.merge(df_train, cat_id_temp_cnt, on=[\"user_id\", \"merchant_id\"], how=\"left\")\n",
249 | " \n",
250 | "time_temp = user_log.groupby([\"user_id\", \"seller_id\", \"time_stamp\"]).count().reset_index()[[\"user_id\", \"seller_id\", \"time_stamp\"]]\n",
251 | " \n",
252 | "time_temp_cnt = time_temp.groupby([\"user_id\", \"seller_id\"]).count().reset_index()\n",
253 | " \n",
254 | "time_temp_cnt.rename(columns={\"seller_id\":\"merchant_id\",\"time_stamp\":\"total_time_temp\"},inplace=True)\n",
255 | " \n",
256 | "df_train = pd.merge(df_train, time_temp_cnt, on=[\"user_id\", \"merchant_id\"], how=\"left\")\n",
257 | " \n",
258 | "click_temp = user_log.groupby([\"user_id\", \"seller_id\", \"action_type\"])[\"item_id\"].count().reset_index()\n",
259 | " \n",
260 | "click_temp.rename(columns={\"seller_id\":\"merchant_id\",\"item_id\":\"times\"},inplace=True)\n",
261 | " \n",
262 | "click_temp[\"clicks\"] = click_temp[\"action_type\"] == 0\n",
263 | " \n",
264 | "click_temp[\"clicks\"] = click_temp[\"clicks\"] * click_temp[\"times\"]\n",
265 | " \n",
266 | "click_temp[\"shopping_cart\"] = click_temp[\"action_type\"] == 1\n",
267 | "click_temp[\"shopping_cart\"] = click_temp[\"shopping_cart\"] * click_temp[\"times\"]\n",
268 | " \n",
269 | "click_temp[\"purchases\"] = click_temp[\"action_type\"] == 2\n",
270 | "click_temp[\"purchases\"] = click_temp[\"purchases\"] * click_temp[\"times\"]\n",
271 | " \n",
272 | "click_temp[\"favourites\"] = click_temp[\"action_type\"] == 3\n",
273 | "click_temp[\"favourites\"] = click_temp[\"favourites\"] * click_temp[\"times\"]\n",
274 | " \n",
275 | "four_features = click_temp.groupby([\"user_id\", \"merchant_id\"]).sum().reset_index()\n",
276 | " \n",
277 | "#删除相关列\n",
278 | "four_features = four_features.drop([\"action_type\", \"times\"], axis=1)\n",
279 | " \n",
280 | "#合并\n",
281 | "df_train = pd.merge(df_train, four_features, on=[\"user_id\", \"merchant_id\"], how=\"left\")\n",
282 | " \n",
283 | "#缺失值向前填充\n",
284 | "df_train = df_train.fillna(method=\"ffill\")\n",
285 | " "
286 | ]
287 | },
288 | {
289 | "cell_type": "code",
290 | "execution_count": 9,
291 | "id": "c0467512",
292 | "metadata": {},
293 | "outputs": [],
294 | "source": [
295 | "# user_info['age_range'].replace(np.nan,1,inplace=True)\n",
296 | "# user_info['gender'].replace(np.nan,0,inplace=True)\n",
297 | "# df_train['age_range'].replace(-1,np.nan,inplace=True)\n",
298 | "# df_train['gender'].replace(-1,np.nan,inplace=True)"
299 | ]
300 | },
301 | {
302 | "cell_type": "code",
303 | "execution_count": 10,
304 | "id": "91ed1463",
305 | "metadata": {},
306 | "outputs": [
307 | {
308 | "data": {
309 | "text/html": [
310 | "\n",
311 | "\n",
324 | "
\n",
325 | " \n",
326 | " \n",
327 | " | \n",
328 | " user_id | \n",
329 | " merchant_id | \n",
330 | " label | \n",
331 | " age_range | \n",
332 | " gender | \n",
333 | " total_item_id | \n",
334 | " unique_item_id | \n",
335 | " total_cat_id | \n",
336 | " total_time_temp | \n",
337 | " clicks | \n",
338 | " shopping_cart | \n",
339 | " purchases | \n",
340 | " favourites | \n",
341 | "
\n",
342 | " \n",
343 | " \n",
344 | " \n",
345 | " | 0 | \n",
346 | " 34176 | \n",
347 | " 3906 | \n",
348 | " 0 | \n",
349 | " 6.0 | \n",
350 | " 0.0 | \n",
351 | " 39 | \n",
352 | " 20 | \n",
353 | " 6 | \n",
354 | " 9 | \n",
355 | " 36 | \n",
356 | " 0 | \n",
357 | " 1 | \n",
358 | " 2 | \n",
359 | "
\n",
360 | " \n",
361 | " | 1 | \n",
362 | " 34176 | \n",
363 | " 121 | \n",
364 | " 0 | \n",
365 | " 6.0 | \n",
366 | " 0.0 | \n",
367 | " 14 | \n",
368 | " 1 | \n",
369 | " 1 | \n",
370 | " 3 | \n",
371 | " 13 | \n",
372 | " 0 | \n",
373 | " 1 | \n",
374 | " 0 | \n",
375 | "
\n",
376 | " \n",
377 | " | 2 | \n",
378 | " 34176 | \n",
379 | " 4356 | \n",
380 | " 1 | \n",
381 | " 6.0 | \n",
382 | " 0.0 | \n",
383 | " 18 | \n",
384 | " 2 | \n",
385 | " 1 | \n",
386 | " 2 | \n",
387 | " 12 | \n",
388 | " 0 | \n",
389 | " 6 | \n",
390 | " 0 | \n",
391 | "
\n",
392 | " \n",
393 | " | 3 | \n",
394 | " 34176 | \n",
395 | " 2217 | \n",
396 | " 0 | \n",
397 | " 6.0 | \n",
398 | " 0.0 | \n",
399 | " 2 | \n",
400 | " 1 | \n",
401 | " 1 | \n",
402 | " 1 | \n",
403 | " 1 | \n",
404 | " 0 | \n",
405 | " 1 | \n",
406 | " 0 | \n",
407 | "
\n",
408 | " \n",
409 | " | 4 | \n",
410 | " 230784 | \n",
411 | " 4818 | \n",
412 | " 0 | \n",
413 | " -1.0 | \n",
414 | " 0.0 | \n",
415 | " 8 | \n",
416 | " 1 | \n",
417 | " 1 | \n",
418 | " 3 | \n",
419 | " 7 | \n",
420 | " 0 | \n",
421 | " 1 | \n",
422 | " 0 | \n",
423 | "
\n",
424 | " \n",
425 | " | ... | \n",
426 | " ... | \n",
427 | " ... | \n",
428 | " ... | \n",
429 | " ... | \n",
430 | " ... | \n",
431 | " ... | \n",
432 | " ... | \n",
433 | " ... | \n",
434 | " ... | \n",
435 | " ... | \n",
436 | " ... | \n",
437 | " ... | \n",
438 | " ... | \n",
439 | "
\n",
440 | " \n",
441 | " | 260859 | \n",
442 | " 359807 | \n",
443 | " 4325 | \n",
444 | " 0 | \n",
445 | " 4.0 | \n",
446 | " 1.0 | \n",
447 | " 20 | \n",
448 | " 6 | \n",
449 | " 2 | \n",
450 | " 1 | \n",
451 | " 18 | \n",
452 | " 0 | \n",
453 | " 2 | \n",
454 | " 0 | \n",
455 | "
\n",
456 | " \n",
457 | " | 260860 | \n",
458 | " 294527 | \n",
459 | " 3971 | \n",
460 | " 0 | \n",
461 | " -1.0 | \n",
462 | " 1.0 | \n",
463 | " 17 | \n",
464 | " 3 | \n",
465 | " 1 | \n",
466 | " 2 | \n",
467 | " 13 | \n",
468 | " 0 | \n",
469 | " 1 | \n",
470 | " 3 | \n",
471 | "
\n",
472 | " \n",
473 | " | 260861 | \n",
474 | " 294527 | \n",
475 | " 152 | \n",
476 | " 0 | \n",
477 | " -1.0 | \n",
478 | " 1.0 | \n",
479 | " 9 | \n",
480 | " 1 | \n",
481 | " 1 | \n",
482 | " 1 | \n",
483 | " 7 | \n",
484 | " 0 | \n",
485 | " 1 | \n",
486 | " 1 | \n",
487 | "
\n",
488 | " \n",
489 | " | 260862 | \n",
490 | " 294527 | \n",
491 | " 2537 | \n",
492 | " 0 | \n",
493 | " -1.0 | \n",
494 | " 1.0 | \n",
495 | " 1 | \n",
496 | " 1 | \n",
497 | " 1 | \n",
498 | " 1 | \n",
499 | " 0 | \n",
500 | " 0 | \n",
501 | " 1 | \n",
502 | " 0 | \n",
503 | "
\n",
504 | " \n",
505 | " | 260863 | \n",
506 | " 229247 | \n",
507 | " 4140 | \n",
508 | " 0 | \n",
509 | " 4.0 | \n",
510 | " -1.0 | \n",
511 | " 24 | \n",
512 | " 15 | \n",
513 | " 1 | \n",
514 | " 2 | \n",
515 | " 23 | \n",
516 | " 0 | \n",
517 | " 1 | \n",
518 | " 0 | \n",
519 | "
\n",
520 | " \n",
521 | "
\n",
522 | "
260864 rows × 13 columns
\n",
523 | "
"
524 | ],
525 | "text/plain": [
526 | " user_id merchant_id label age_range gender total_item_id \\\n",
527 | "0 34176 3906 0 6.0 0.0 39 \n",
528 | "1 34176 121 0 6.0 0.0 14 \n",
529 | "2 34176 4356 1 6.0 0.0 18 \n",
530 | "3 34176 2217 0 6.0 0.0 2 \n",
531 | "4 230784 4818 0 -1.0 0.0 8 \n",
532 | "... ... ... ... ... ... ... \n",
533 | "260859 359807 4325 0 4.0 1.0 20 \n",
534 | "260860 294527 3971 0 -1.0 1.0 17 \n",
535 | "260861 294527 152 0 -1.0 1.0 9 \n",
536 | "260862 294527 2537 0 -1.0 1.0 1 \n",
537 | "260863 229247 4140 0 4.0 -1.0 24 \n",
538 | "\n",
539 | " unique_item_id total_cat_id total_time_temp clicks shopping_cart \\\n",
540 | "0 20 6 9 36 0 \n",
541 | "1 1 1 3 13 0 \n",
542 | "2 2 1 2 12 0 \n",
543 | "3 1 1 1 1 0 \n",
544 | "4 1 1 3 7 0 \n",
545 | "... ... ... ... ... ... \n",
546 | "260859 6 2 1 18 0 \n",
547 | "260860 3 1 2 13 0 \n",
548 | "260861 1 1 1 7 0 \n",
549 | "260862 1 1 1 0 0 \n",
550 | "260863 15 1 2 23 0 \n",
551 | "\n",
552 | " purchases favourites \n",
553 | "0 1 2 \n",
554 | "1 1 0 \n",
555 | "2 6 0 \n",
556 | "3 1 0 \n",
557 | "4 1 0 \n",
558 | "... ... ... \n",
559 | "260859 2 0 \n",
560 | "260860 1 3 \n",
561 | "260861 1 1 \n",
562 | "260862 1 0 \n",
563 | "260863 1 0 \n",
564 | "\n",
565 | "[260864 rows x 13 columns]"
566 | ]
567 | },
568 | "execution_count": 10,
569 | "metadata": {},
570 | "output_type": "execute_result"
571 | }
572 | ],
573 | "source": [
574 | "# print(df_train.shape)\n",
575 | "# df_train_dropnan=df_train.dropna(axis=0,how='any')\n",
576 | "# df_train_dropnan.shape\n",
577 | "df_train"
578 | ]
579 | },
580 | {
581 | "cell_type": "code",
582 | "execution_count": 11,
583 | "id": "becb8596",
584 | "metadata": {},
585 | "outputs": [],
586 | "source": [
587 | "#将构建好的特征保存\n",
588 | "df_train.to_csv(\"df_train.csv\",index=None)"
589 | ]
590 | },
591 | {
592 | "cell_type": "code",
593 | "execution_count": null,
594 | "id": "9aeeb576",
595 | "metadata": {},
596 | "outputs": [],
597 | "source": []
598 | }
599 | ],
600 | "metadata": {
601 | "kernelspec": {
602 | "display_name": "Python 3",
603 | "language": "python",
604 | "name": "python3"
605 | },
606 | "language_info": {
607 | "codemirror_mode": {
608 | "name": "ipython",
609 | "version": 3
610 | },
611 | "file_extension": ".py",
612 | "mimetype": "text/x-python",
613 | "name": "python",
614 | "nbconvert_exporter": "python",
615 | "pygments_lexer": "ipython3",
616 | "version": "3.8.8"
617 | }
618 | },
619 | "nbformat": 4,
620 | "nbformat_minor": 5
621 | }
622 |
--------------------------------------------------------------------------------