├── 图片资源 ├── 图片1.png ├── 图片2.png ├── 图片3.png ├── 图片4.png ├── 图片5.png ├── 图片6.png ├── 图片7.png ├── 图片8.png ├── 图片9.png ├── .DS_Store ├── 图片10.png ├── 图片11.png ├── 图片12.png ├── 图片13.png ├── 图片14.png └── 图片15.jpg ├── .ipynb_checkpoints ├── 特征工程-checkpoint.ipynb ├── 测试数据特征处理与填充-checkpoint.ipynb ├── 数据探索-checkpoint.ipynb └── 预测建模-checkpoint.ipynb ├── README.md ├── 测试数据特征处理与填充.ipynb ├── 预测建模.ipynb └── 特征工程.ipynb /图片资源/图片1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/2017403603/Data_mining/HEAD/图片资源/图片1.png -------------------------------------------------------------------------------- /图片资源/图片2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/2017403603/Data_mining/HEAD/图片资源/图片2.png -------------------------------------------------------------------------------- /图片资源/图片3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/2017403603/Data_mining/HEAD/图片资源/图片3.png -------------------------------------------------------------------------------- /图片资源/图片4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/2017403603/Data_mining/HEAD/图片资源/图片4.png -------------------------------------------------------------------------------- /图片资源/图片5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/2017403603/Data_mining/HEAD/图片资源/图片5.png -------------------------------------------------------------------------------- /图片资源/图片6.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/2017403603/Data_mining/HEAD/图片资源/图片6.png -------------------------------------------------------------------------------- /图片资源/图片7.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/2017403603/Data_mining/HEAD/图片资源/图片7.png -------------------------------------------------------------------------------- /图片资源/图片8.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/2017403603/Data_mining/HEAD/图片资源/图片8.png -------------------------------------------------------------------------------- /图片资源/图片9.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/2017403603/Data_mining/HEAD/图片资源/图片9.png -------------------------------------------------------------------------------- /图片资源/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/2017403603/Data_mining/HEAD/图片资源/.DS_Store -------------------------------------------------------------------------------- /图片资源/图片10.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/2017403603/Data_mining/HEAD/图片资源/图片10.png -------------------------------------------------------------------------------- /图片资源/图片11.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/2017403603/Data_mining/HEAD/图片资源/图片11.png -------------------------------------------------------------------------------- /图片资源/图片12.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/2017403603/Data_mining/HEAD/图片资源/图片12.png -------------------------------------------------------------------------------- /图片资源/图片13.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/2017403603/Data_mining/HEAD/图片资源/图片13.png -------------------------------------------------------------------------------- /图片资源/图片14.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/2017403603/Data_mining/HEAD/图片资源/图片14.png -------------------------------------------------------------------------------- /图片资源/图片15.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/2017403603/Data_mining/HEAD/图片资源/图片15.jpg -------------------------------------------------------------------------------- /.ipynb_checkpoints/特征工程-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [], 3 | "metadata": {}, 4 | "nbformat": 4, 5 | "nbformat_minor": 5 6 | } 7 | -------------------------------------------------------------------------------- /.ipynb_checkpoints/测试数据特征处理与填充-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [], 3 | "metadata": {}, 4 | "nbformat": 4, 5 | "nbformat_minor": 5 6 | } 7 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 |

天猫复购预测赛技术报告

2 | 3 |
小组成员:李航程、姚远舟、黄建辉、刘杭达
4 | 5 | ## 一、问题描述 6 | 7 | ### 1.1 问题背景 8 | 9 | ​ 商家有时会在特定日期,例如Boxing-day,黑色星期五或是双十一(11月11日)开展大型促销活动或者发放优惠券以吸引消费者,然而很多被吸引来的买家都是一次性消费者,这些促销活动可能对销售业绩的增长并没有长远帮助,因此为解决这个问题,商家需要识别出哪类消费者可以转化为重复购买者。通过对这些潜在的忠诚客户进行定位,商家可以大大降低促销成本,提高投资回报率。 10 | 11 | ### 1.2 数据描述 12 | 13 | ​ 现在给定四个数据文件,分别为训练数据,测试数据,用户画像以及用户历史记录。训练数据提供纬度为用户、商家,以及该用户是否为该商家的重复购买者(即label)。用户画像数据集提供对应用户id的年龄和性别信息;用户历史记录提供用户过去六个月在不同店铺的多种活跃状态以及点击时间等;测试数据集为用户和商家的组合,用以预测该用户是否为该商家的重复购买者。 14 | 15 | ### 1.3 问题描述 16 | 17 | ​ 根据给定的四个数据形式,在测试数据中给定了用户id和商家id的组合,需要预测该名用户在对应商家的重复购买概率值。 18 | 19 | ## 二、数据探索 20 | 21 | ### 2.1 加载数据集 22 | 23 | ```python 24 | train_data = pd.read_csv("../DataMining/data_format1/train_format1.csv") 25 | test_data = pd.read_csv("../DataMining/data_format1/test_format1.csv") 26 | user_info = pd.read_csv("../DataMining/data_format1/user_info_format1.csv") 27 | user_log = pd.read_csv("../DataMining/data_format1/user_log_format1.csv") 28 | ``` 29 | 30 | ### 2.2 查看用户画像中年龄和性别缺失率 31 | 32 | ```python 33 | (user_info.shape[0] - user_info["age_range"].count())/user_info.shape[0] 34 | (user_info.shape[0] - user_info["gender"].count()) / user_info.shape[0] 35 | ``` 36 | 37 | ​ 其中年龄缺失率为0.52%,性别缺失率为1.5%。缺失比率较小,因此其对最终的分类结果影响较小。后面将直接将NaN(由-1代替)当作特征输入进模型进行训练和学习 38 | 39 | ### 2.3 查看用户信息数据的缺失—用户行为日志数据缺失 40 | 41 | ```python 42 | user_log.isna().sum() 43 | ``` 44 | 45 | 图片1 46 | 47 | ​ 用户行为日志主要缺失特征为购买品牌的缺失,其他特征均无缺失。 48 | 49 | ### 2.4 查看用户画像和历史记录基本数据描述 50 | 51 | ```python 52 | user_info.describe() 53 | ``` 54 | 55 | 图片2 56 | 57 | ​ 用户画像的基本数据分析显示用户的平均年龄在30岁左右,且方差较大。且购买者的性别主要为女性。 58 | 59 | ```python 60 | user_log.describe() 61 | ``` 62 | 63 | 图片3 64 | 65 | ### 2.5 查看样本label比例 66 | 67 | 图片4 68 | 69 | ​ 样本不均衡,非重复购买者比例远远大于重复购买者,因此需要采取一定措施解决此类样本不平衡问题 70 | 71 | ### 2.6 对top 5店铺进行画图分析 72 | 73 | ```python 74 | train_data.merchant_id.value_counts().head(5) 75 | train_data_merchant["TOP5"]=train_data_merchant["merchant_id"].map(lambda x: 1 if x in[4044,3828,4173,1102,4976] else 0) 76 | train_data_merchant=train_data_merchant[train_data_merchant["TOP5"]==1] 77 | plt.figure(figsize=(8,6)) 78 | plt.title("Merchant VS Label")sax=sns.countplot("merchant_id",hue="label",data=train_data_merchant) 79 | ``` 80 | 81 | 图片5 82 | 83 | ​ 采用分布直方图对前五名店铺进行比例分析,可得前五名店铺占据了接近一半的数据量,且重复购买的比例都远远小于非重复购买 84 | 85 | ### 2.7 对商家的重复购买比例进行绘图分析 86 | 87 | ```python 88 | train_data.groupby(["merchant_id"])["label"].mean() 89 | merchant_repeat_buy=[rate for rate in train_data.groupby(["merchant_id"])["label"].mean() if rate<=1 and rate > 0] 90 | plt.figure(figsize=(8,4)) 91 | ax=plt.subplot(1,2,1) 92 | sns.distplot(merchant_repeat_buy,fit=stats.norm) 93 | ax=plt.subplot(1,2,2) 94 | res=stats.probplot(merchant_repeat_buy,plot=plt) 95 | ``` 96 | 97 | 图片6 98 | 99 | ​ 由于数据的特征维度并不具有连续性,无法使用插值法进行填补,并且空缺比率较小,因此我们直接将空缺数据视为一个特征,用-1填补并代表此类特征 100 | 101 | ## 三、特征工程 102 | 103 | ### 3.1 数据集合并 104 | 105 | 1. 将训练集df_train和用户基本信息user_info_format.csv合并得到df_train,合并依据是用户user_id。 106 | 107 | ```python 108 | df_train = pd.merge(df_train,user_info,on="user_id",how="left") 109 | ``` 110 | 111 | 2. 将df_train和用户行为日志user_log_format1.csv合并得到新的df_train,合并依据是用户user_id和商家merchant_id。 112 | 113 | ```python 114 | df_train = pd.merge(df_train,total_logs_temp,on=["user_id","merchant_id"],how="left") 115 | ``` 116 | 117 | ### 3.2 特征生成 118 | 119 | 1. 通过简单合并生成特征 120 | + 每个用户在每个商家交互过的商品总和(不分种类)。***total_item_id*** 121 | + 每个用户在每个商家交互过的商品种类总和。***unique_item_id*** 122 | + 每个用户在每个商家交互过的商品所属品类总和***total_cat_id*** 123 | + 每个用户在每个商家交互过的天数总和。***total_time_temp*** 124 | + 每个用户在每个商家点击次数总和。***clicks*** 125 | + 每个用户在每个商家加入购物车次数总和。***shopping_cart*** 126 | + 每个用户在每个商家购买商品次数总和。***purchases*** 127 | + 每个用户在每个商家收藏商品次数总和。***favourites*** 128 | 129 | 2. 通过分析生成特征 130 | 131 | + 用户每月使用次数 132 | 133 | ```python 134 | month_temp=user_log.groupby(['user_id','month']).size().reset_index().rename(columns={0:'cnt'}) 135 | month_temp=pd.get_dummies(month_temp, columns=['month'],prefix='user_mcnt') 136 | for i in range(5,12): 137 | month_temp['user_mcnt_'+str(i)]=month_temp['cnt']*month_temp['user_mcnt_'+str(i)] 138 | month_temp=month_temp.groupby(['user_id']).sum().drop(['cnt'],axis=1).reset_index() 139 | ``` 140 | 141 | ​ 意义:用户每月使用天猫的次数可以反映用户行为在时间上的特征,用户在一年中不同的月份的消费表现可能不同,例如在年尾,春节,“双十一”等期间可能消费水平高一些,在夏冬两季的消费水平可能会低一些,通过统计每月使用次数可以有效反映出这些特征。 142 | 143 | + 商家的特征 144 | 145 | ```python 146 | temp = groups.size().reset_index().rename(columns={0:'merchantf1'}) 147 | matrix = matrix.merge(temp, on='merchant_id', how='left') 148 | temp = groups['user_id', 'item_id', 'cat_id', 'brand_id'].nunique().reset_index().rename(columns={'user_id':'merchantf2', 'item_id':'merchantf3', 'cat_id':'merchantf4', 'brand_id':'merchantf5'}) 149 | matrix = matrix.merge(temp, on='merchant_id', how='left') 150 | temp = groups['action_type'].value_counts().unstack().reset_index().rename(columns={0:'merchantf6', 1:'merchantf7', 2:'merchantf8', 3:'merchantf9'}) 151 | matrix = matrix.merge(temp, on='merchant_id', how='left') 152 | ``` 153 | 154 | ​ 商家售出的某个商品、品牌的数量,能够反映某些商品或者品牌的受欢迎程度,一定程度上也可以导致顾客回购率。 155 | 156 | + 商家与用户的综合特征 157 | 158 | ```python 159 | matrix['ratiof1'] = matrix['userf9']/matrix['userf7'] # 用户购买点击比 160 | matrix['ratiof2'] = matrix['merchantf8']/matrix['merchantf6'] # 商家购买点击比 161 | ``` 162 | 163 | ​ 用户点击或者该商家被点击最终转化为顾客购买的比率能够很好的反映物品的受欢迎程度 164 | 165 | ## 四、候选模型简介 166 | 167 | 1. 逻辑回归[1](Logistic Regression,LR)是一种广义线性回归(Generalized Linear Model),在机器学习中是最常见的一种用于二分类的算法模型。 168 | 2. 决策树[2](Decision Tree,DT)是一种基本的分类与回归方法,本文主要讨论分类决策树,决策树模型呈树形结构,在分类问题中,表示基于特征对数据进行分类的过程。 169 | 3. 随机森林[3](Random Forest,RF)指的是利用多棵决策树对样本进行训练并预测的一种分类器,可回归可分类,所以随机森林是基于多颗决策树的一种集成学习算法。 170 | 4. 梯度提升树[4](Gradient Descent Decision Tree,GBDT),梯度提升树是以 CART 作为基函数,采用加法模型和前向分步算法的一种梯度提升方法。 171 | 5. XGBoost[5]是陈天奇等人开发的一个开源机器学习项目,高效地实现了GBDT算法并进行了算法和工程上的许多改进,被广泛应用在Kaggle竞赛及其他许多机器学习竞赛中并取得了不错的成绩。 172 | 173 | ## 五、候选模型预测对比 174 | 175 | ### 5.1 加载训练数据和测试数据 176 | 177 | ```python 178 | #读取数据 179 | df_train = pd.read_csv(r'df_train.csv') 180 | #加载最终测试数据 181 | test_data= pd.read_csv(r'test_data.csv') 182 | test_data 183 | ``` 184 | 185 | 图片7 186 | 187 | ### 5.2 建模前预处理数据集 188 | 189 | ```python 190 | #建模前预处理 191 | y = df_train["label"] 192 | X = df_train.drop(["user_id", "merchant_id", "label"], axis=1) 193 | X.head(10) 194 | #分割数据 195 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=8) 196 | ``` 197 | 198 | 图片8 199 | 200 | ### 5.3 候选模型预测:逻辑回归 201 | 202 | ```python 203 | #logistic回归 204 | Logit = LogisticRegression(solver='liblinear') 205 | Logit.fit(X_train, y_train) 206 | Predict = Logit.predict(X_test) 207 | Predict_proba = Logit.predict_proba(X_test) 208 | print(Predict.shape) 209 | print(Predict[0:20]) 210 | print(Predict_proba[:]) 211 | Score = accuracy_score(y_test, Predict) 212 | Score 213 | ``` 214 | 215 | 图片9 216 | 217 | ```python 218 | #逻辑回归最终结果获取 219 | Logit_Ans_Predict_proba = Logit.predict_proba(test_data) 220 | df_test['prob']=Logit_Ans_Predict_proba[:,1] 221 | #最终答案保存 222 | df_test.to_csv("Logit_Ans.csv",index=None) 223 | ``` 224 | 225 | ​ 提交得到评分为:0.4564939 226 | 227 | ### 5.4 候选模型预测:决策树 228 | 229 | ```python 230 | #决策树 231 | from sklearn.tree import DecisionTreeClassifier 232 | tree = DecisionTreeClassifier(max_depth=4,random_state=0) 233 | tree.fit(X_train, y_train) 234 | Predict_proba = tree.predict_proba(X_test) 235 | print(Predict_proba[:]) 236 | print("Accuracy on training set: {:.3f}".format(tree.score(X_train, y_train))) 237 | print("Accuracy on test set: {:.3f}".format(tree.score(X_test, y_test))) 238 | ``` 239 | 240 | 图片10 241 | 242 | ```python 243 | #决策树最终结果获取 244 | Tree_Ans_Predict_proba = tree.predict_proba(test_data) 245 | df_test['prob']=Tree_Ans_Predict_proba[:,1] 246 | #最终答案保存 247 | df_test.to_csv("Tree_Ans.csv",index=None) 248 | ``` 249 | 250 | ​ 提交得到评分为:0.5833852 251 | 252 | ### 5.5 候选模型预测:随机森林 253 | 254 | ```python 255 | #随机森林 256 | from sklearn.ensemble import RandomForestClassifier 257 | rfc = RandomForestClassifier(n_estimators=50,random_state=90,max_depth=5) 258 | rfc = rfc.fit(X_train, y_train) 259 | Predict_proba = rfc.predict_proba(X_test) 260 | print(Predict_proba[:]) 261 | print("Accuracy on training set: {:.3f}".format(rfc.score(X_train, y_train))) 262 | print("Accuracy on test set: {:.3f}".format(rfc.score(X_test, y_test))) 263 | ``` 264 | 265 | 图片11 266 | 267 | ```python 268 | #随机森林最终结果获取 269 | RFC_Ans_Predict_proba = rfc.predict_proba(test_data) 270 | df_test['prob']=RFC_Ans_Predict_proba[:,1] 271 | #最终答案保存 272 | df_test.to_csv("RFC_Ans.csv",index=None) 273 | ``` 274 | 275 | ​ 提交得到评分为:0.6252815 276 | 277 | 278 | 279 | ### 5.6 候选模型预测:随机森林调参 280 | 281 | ```python 282 | # 调参,绘制学习曲线来调参n_estimators(对随机森林影响最大) 283 | score_lt = [] 284 | # 每隔10步建立一个随机森林,获得不同n_estimators的得分 285 | for i in range(0,200,10): 286 | print("进度:",i) 287 | rfc = RandomForestClassifier(n_estimators=i+1,random_state=90,max_depth=8) 288 | rfc = rfc.fit(X_train, y_train) 289 | score = rfc.score(X_test, y_test) 290 | score_lt.append(score) 291 | score_max = max(score_lt) 292 | print('最大得分:{}'.format(score_max),'子树数量为:{}'.format(score_lt.index(score_max)*10+1)) 293 | # 绘制学习曲线 294 | x = np.arange(1,201,10) 295 | plt.subplot(111) 296 | plt.plot(x, score_lt, 'r-') 297 | plt.show() 298 | ``` 299 | 300 | image-20211125145343834 301 | 302 | ​ 上图中横坐标为参数n_estimators的值,纵坐标为模型在测试集上的准确率,每迭代一次n_estimators增加10,画出每次迭代准确率的折线图,由图可知当n_estimators=100时随机森林模型的效果最好,经调参后提交得到评分为:0.6256826。 303 | 304 | ### 5.7 候选模型预测:XGboost 305 | 306 | ```python 307 | import xgboost as xgb 308 | def xgb_train(X_train, y_train, X_valid, y_valid, verbose=True): 309 | model_xgb = xgb.XGBClassifier( 310 | max_depth=10, # raw8 311 | n_estimators=1000, 312 | min_child_weight=300, 313 | colsample_bytree=0.8, 314 | subsample=0.8, 315 | eta=0.3, 316 | seed=42 317 | ) 318 | model_xgb.fit( 319 | X_train, 320 | y_train, 321 | eval_metric='auc', 322 | eval_set=[(X_train, y_train), (X_valid, y_valid)], 323 | verbose=verbose, 324 | early_stopping_rounds=10 # 早停法,如果auc在10epoch没有进步就stop 325 | ) 326 | print(model_xgb.best_score) 327 | print("Accuracy on training set: {:.3f}".format(model.score(X_train, y_train))) 328 | print("Accuracy on test set: {:.3f}".format(model.score(X_test, y_test))) 329 | return model_xgb 330 | ``` 331 | 332 | 图片12 333 | 334 | ```python 335 | #XGboost最终结果获取 336 | model_xgb = xgb_train(X_train, y_train, X_valid, y_valid, verbose=False) 337 | prob = model_xgb.predict_proba(test_data) 338 | submission['prob'] = pd.Series(prob[:,1]) 339 | submission.drop(['origin'], axis=1, inplace=True) 340 | submission.to_csv('submission_xgb.csv', index=False) 341 | ``` 342 | 343 | ​ 提交得到评分为:0.6562986 344 | 345 | ## 六、最终成绩及排名 346 | 347 |
小组成员:李航程、姚远舟、黄建辉、刘杭达
348 | 349 | 图片15 350 | 351 | ## 七、天猫复购预测总结 352 | 353 | ​ 本次比赛最终成绩和排名并不是很高,思考其原因主要还是在数据预处理和特征工程阶段没有做好,在数据集中,年龄和性别的缺失值差不多有九万个,巨大的特征值数据缺失是预测准确率不高的主要原因之一,其次是特征工程,我们抽取特征的方法还是使用传统的方法,相对比较简单,这也是导致模型预测准确率不高的原因之一;在选用模型上我们使用了逻辑回归、决策树、随机森林、Xgboost等热门模型,训练后这些模型在训练集上的表现区别并不明显,经比较Xgboost模型在测试集的效果最好,后期工作准备再重新做一下特征工程,在模型选取方面,计划使用bagging集成多种分类算法的思想对模型进行改进,进一步提高预测准确率。 354 | 355 | ## 八、参考 356 | 357 | [1] [https://www.cnblogs.com/phyger/p/14188712.html](https://www.cnblogs.com/phyger/p/14188712.html) 358 | 359 | [2] [https://blog.csdn.net/qq_34807908/article/details/81539536](https://blog.csdn.net/qq_34807908/article/details/81539536) 360 | 361 | [3] [https://blog.csdn.net/lovenankai/article/details/99966142](https://blog.csdn.net/lovenankai/article/details/99966142) 362 | 363 | [4] [https://www.jianshu.com/p/d1f696266814](https://www.jianshu.com/p/d1f696266814) 364 | 365 | [5] [http://cran.fhcrc.org/web/packages/xgboost/vignettes/xgboost.pdf](http://cran.fhcrc.org/web/packages/xgboost/vignettes/xgboost.pdf) 366 | -------------------------------------------------------------------------------- /.ipynb_checkpoints/数据探索-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "id": "0d97411e", 7 | "metadata": {}, 8 | "outputs": [], 9 | "source": [ 10 | "import numpy as np\n", 11 | "import pandas as pd\n", 12 | "import matplotlib.pyplot as plt\n", 13 | "import seaborn as sns\n", 14 | "from scipy import stats\n", 15 | "import warnings\n", 16 | "\n", 17 | "warnings.filterwarnings(\"ignore\")" 18 | ] 19 | }, 20 | { 21 | "cell_type": "code", 22 | "execution_count": 2, 23 | "id": "30627d48", 24 | "metadata": {}, 25 | "outputs": [], 26 | "source": [ 27 | "#导入数据\n", 28 | "train_data = pd.read_csv(\"../DataMining/data_format1/train_format1.csv\")\n", 29 | "test_data = pd.read_csv(\"../DataMining/data_format1/test_format1.csv\")\n", 30 | "\n", 31 | "user_info = pd.read_csv(\"../DataMining/data_format1/user_info_format1.csv\")\n", 32 | "user_log = pd.read_csv(\"../DataMining/data_format1/user_log_format1.csv\")" 33 | ] 34 | }, 35 | { 36 | "cell_type": "code", 37 | "execution_count": 3, 38 | "id": "7005f9dd", 39 | "metadata": {}, 40 | "outputs": [ 41 | { 42 | "data": { 43 | "text/plain": [ 44 | "(424170, 3)" 45 | ] 46 | }, 47 | "execution_count": 3, 48 | "metadata": {}, 49 | "output_type": "execute_result" 50 | } 51 | ], 52 | "source": [ 53 | "#1.查看用户信息缺失值-年龄值\n", 54 | "#shape大小:\n", 55 | "user_info.shape" 56 | ] 57 | }, 58 | { 59 | "cell_type": "code", 60 | "execution_count": 4, 61 | "id": "ee333a9d", 62 | "metadata": {}, 63 | "outputs": [ 64 | { 65 | "data": { 66 | "text/plain": [ 67 | "421953" 68 | ] 69 | }, 70 | "execution_count": 4, 71 | "metadata": {}, 72 | "output_type": "execute_result" 73 | } 74 | ], 75 | "source": [ 76 | "#年龄数据总个数:\n", 77 | "user_info[\"age_range\"].count()" 78 | ] 79 | }, 80 | { 81 | "cell_type": "code", 82 | "execution_count": 5, 83 | "id": "c2488c3c", 84 | "metadata": {}, 85 | "outputs": [ 86 | { 87 | "data": { 88 | "text/plain": [ 89 | "0.005226677982884221" 90 | ] 91 | }, 92 | "execution_count": 5, 93 | "metadata": {}, 94 | "output_type": "execute_result" 95 | } 96 | ], 97 | "source": [ 98 | "#缺失率查看:\n", 99 | "(user_info.shape[0]-user_info[\"age_range\"].count())/user_info.shape[0]" 100 | ] 101 | }, 102 | { 103 | "cell_type": "code", 104 | "execution_count": 6, 105 | "id": "8bbe97c9", 106 | "metadata": {}, 107 | "outputs": [ 108 | { 109 | "data": { 110 | "text/plain": [ 111 | "user_id 95131\n", 112 | "age_range 92914\n", 113 | "gender 90664\n", 114 | "dtype: int64" 115 | ] 116 | }, 117 | "execution_count": 6, 118 | "metadata": {}, 119 | "output_type": "execute_result" 120 | } 121 | ], 122 | "source": [ 123 | "##当年龄为空或者等于0时默认为缺失\n", 124 | "#缺失值查看:\n", 125 | "user_info[user_info['age_range'].isna()|(user_info['age_range']==0)].count()" 126 | ] 127 | }, 128 | { 129 | "cell_type": "code", 130 | "execution_count": 7, 131 | "id": "34af97b9", 132 | "metadata": {}, 133 | "outputs": [ 134 | { 135 | "data": { 136 | "text/html": [ 137 | "
\n", 138 | "\n", 151 | "\n", 152 | " \n", 153 | " \n", 154 | " \n", 155 | " \n", 156 | " \n", 157 | " \n", 158 | " \n", 159 | " \n", 160 | " \n", 161 | " \n", 162 | " \n", 163 | " \n", 164 | " \n", 165 | " \n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | " \n", 170 | " \n", 171 | " \n", 172 | " \n", 173 | " \n", 174 | " \n", 175 | " \n", 176 | " \n", 177 | " \n", 178 | " \n", 179 | " \n", 180 | " \n", 181 | " \n", 182 | " \n", 183 | " \n", 184 | " \n", 185 | " \n", 186 | " \n", 187 | " \n", 188 | " \n", 189 | " \n", 190 | " \n", 191 | " \n", 192 | " \n", 193 | " \n", 194 | " \n", 195 | " \n", 196 | " \n", 197 | " \n", 198 | " \n", 199 | " \n", 200 | "
user_id
age_range
0.092914
1.024
2.052871
3.0111654
4.079991
5.040777
6.035464
7.06992
8.01266
\n", 201 | "
" 202 | ], 203 | "text/plain": [ 204 | " user_id\n", 205 | "age_range \n", 206 | "0.0 92914\n", 207 | "1.0 24\n", 208 | "2.0 52871\n", 209 | "3.0 111654\n", 210 | "4.0 79991\n", 211 | "5.0 40777\n", 212 | "6.0 35464\n", 213 | "7.0 6992\n", 214 | "8.0 1266" 215 | ] 216 | }, 217 | "execution_count": 7, 218 | "metadata": {}, 219 | "output_type": "execute_result" 220 | } 221 | ], 222 | "source": [ 223 | "#数据分组查看:\n", 224 | "user_info.groupby(['age_range'])[['user_id']].count()" 225 | ] 226 | }, 227 | { 228 | "cell_type": "code", 229 | "execution_count": 8, 230 | "id": "02c03a0f", 231 | "metadata": {}, 232 | "outputs": [ 233 | { 234 | "data": { 235 | "text/plain": [ 236 | "2217" 237 | ] 238 | }, 239 | "execution_count": 8, 240 | "metadata": {}, 241 | "output_type": "execute_result" 242 | } 243 | ], 244 | "source": [ 245 | "#空值查看:\n", 246 | "user_info.shape[0]-user_info[\"age_range\"].count()" 247 | ] 248 | }, 249 | { 250 | "cell_type": "code", 251 | "execution_count": 9, 252 | "id": "b25292e3", 253 | "metadata": {}, 254 | "outputs": [ 255 | { 256 | "data": { 257 | "text/plain": [ 258 | "0.01517316170403376" 259 | ] 260 | }, 261 | "execution_count": 9, 262 | "metadata": {}, 263 | "output_type": "execute_result" 264 | } 265 | ], 266 | "source": [ 267 | "##2.查看用户信息数据的缺失——性别值\n", 268 | "#缺失率查看:\n", 269 | "(user_info.shape[0] - user_info[\"gender\"].count()) / user_info.shape[0]" 270 | ] 271 | }, 272 | { 273 | "cell_type": "code", 274 | "execution_count": 10, 275 | "id": "7d86b971", 276 | "metadata": {}, 277 | "outputs": [ 278 | { 279 | "data": { 280 | "text/plain": [ 281 | "user_id 16862\n", 282 | "age_range 14664\n", 283 | "gender 10426\n", 284 | "dtype: int64" 285 | ] 286 | }, 287 | "execution_count": 10, 288 | "metadata": {}, 289 | "output_type": "execute_result" 290 | } 291 | ], 292 | "source": [ 293 | "# 当性别为空或者等于2时默认为缺失\n", 294 | "# 缺失值查看:\n", 295 | "user_info[user_info['gender'].isna() | (user_info['gender'] == 2)].count()" 296 | ] 297 | }, 298 | { 299 | "cell_type": "code", 300 | "execution_count": 11, 301 | "id": "f294b0d4", 302 | "metadata": {}, 303 | "outputs": [ 304 | { 305 | "data": { 306 | "text/html": [ 307 | "
\n", 308 | "\n", 321 | "\n", 322 | " \n", 323 | " \n", 324 | " \n", 325 | " \n", 326 | " \n", 327 | " \n", 328 | " \n", 329 | " \n", 330 | " \n", 331 | " \n", 332 | " \n", 333 | " \n", 334 | " \n", 335 | " \n", 336 | " \n", 337 | " \n", 338 | " \n", 339 | " \n", 340 | " \n", 341 | " \n", 342 | " \n", 343 | " \n", 344 | " \n", 345 | " \n", 346 | "
user_id
gender
0.0285638
1.0121670
2.010426
\n", 347 | "
" 348 | ], 349 | "text/plain": [ 350 | " user_id\n", 351 | "gender \n", 352 | "0.0 285638\n", 353 | "1.0 121670\n", 354 | "2.0 10426" 355 | ] 356 | }, 357 | "execution_count": 11, 358 | "metadata": {}, 359 | "output_type": "execute_result" 360 | } 361 | ], 362 | "source": [ 363 | "#数据分组查看:\n", 364 | "user_info.groupby(['gender'])[['user_id']].count()" 365 | ] 366 | }, 367 | { 368 | "cell_type": "code", 369 | "execution_count": 12, 370 | "id": "7cbe3a5f", 371 | "metadata": {}, 372 | "outputs": [ 373 | { 374 | "data": { 375 | "text/plain": [ 376 | "6436" 377 | ] 378 | }, 379 | "execution_count": 12, 380 | "metadata": {}, 381 | "output_type": "execute_result" 382 | } 383 | ], 384 | "source": [ 385 | "#空值查看:\n", 386 | "user_info.shape[0] - user_info[\"gender\"].count()" 387 | ] 388 | }, 389 | { 390 | "cell_type": "code", 391 | "execution_count": 13, 392 | "id": "c6b8e6da", 393 | "metadata": {}, 394 | "outputs": [ 395 | { 396 | "data": { 397 | "text/plain": [ 398 | "user_id 106330\n", 399 | "age_range 104113\n", 400 | "gender 99894\n", 401 | "dtype: int64" 402 | ] 403 | }, 404 | "execution_count": 13, 405 | "metadata": {}, 406 | "output_type": "execute_result" 407 | } 408 | ], 409 | "source": [ 410 | "# 查看用户信息数据的缺失——年龄或性别:\n", 411 | "user_info[user_info['age_range'].isna() | (user_info['age_range'] == 0) | user_info['gender'].isna() | (user_info['gender'] == 2)].count()" 412 | ] 413 | }, 414 | { 415 | "cell_type": "code", 416 | "execution_count": 14, 417 | "id": "e1b779ef", 418 | "metadata": {}, 419 | "outputs": [ 420 | { 421 | "data": { 422 | "text/plain": [ 423 | "user_id 0\n", 424 | "item_id 0\n", 425 | "cat_id 0\n", 426 | "seller_id 0\n", 427 | "brand_id 91015\n", 428 | "time_stamp 0\n", 429 | "action_type 0\n", 430 | "dtype: int64" 431 | ] 432 | }, 433 | "execution_count": 14, 434 | "metadata": {}, 435 | "output_type": "execute_result" 436 | } 437 | ], 438 | "source": [ 439 | "#3.查看用户信息数据的缺失——用户行为日志数据缺失\n", 440 | "user_log.isna().sum()" 441 | ] 442 | }, 443 | { 444 | "cell_type": "code", 445 | "execution_count": 15, 446 | "id": "88a6aa39", 447 | "metadata": {}, 448 | "outputs": [ 449 | { 450 | "data": { 451 | "text/html": [ 452 | "
\n", 453 | "\n", 466 | "\n", 467 | " \n", 468 | " \n", 469 | " \n", 470 | " \n", 471 | " \n", 472 | " \n", 473 | " \n", 474 | " \n", 475 | " \n", 476 | " \n", 477 | " \n", 478 | " \n", 479 | " \n", 480 | " \n", 481 | " \n", 482 | " \n", 483 | " \n", 484 | " \n", 485 | " \n", 486 | " \n", 487 | " \n", 488 | " \n", 489 | " \n", 490 | " \n", 491 | " \n", 492 | " \n", 493 | " \n", 494 | " \n", 495 | " \n", 496 | " \n", 497 | " \n", 498 | " \n", 499 | " \n", 500 | " \n", 501 | " \n", 502 | " \n", 503 | " \n", 504 | " \n", 505 | " \n", 506 | " \n", 507 | " \n", 508 | " \n", 509 | " \n", 510 | " \n", 511 | " \n", 512 | " \n", 513 | " \n", 514 | " \n", 515 | " \n", 516 | " \n", 517 | " \n", 518 | " \n", 519 | " \n", 520 | " \n", 521 | " \n", 522 | " \n", 523 | " \n", 524 | " \n", 525 | "
user_idage_rangegender
count424170.000000421953.000000417734.000000
mean212085.5000002.9302620.341179
std122447.4761781.9429780.524112
min1.0000000.0000000.000000
25%106043.2500002.0000000.000000
50%212085.5000003.0000000.000000
75%318127.7500004.0000001.000000
max424170.0000008.0000002.000000
\n", 526 | "
" 527 | ], 528 | "text/plain": [ 529 | " user_id age_range gender\n", 530 | "count 424170.000000 421953.000000 417734.000000\n", 531 | "mean 212085.500000 2.930262 0.341179\n", 532 | "std 122447.476178 1.942978 0.524112\n", 533 | "min 1.000000 0.000000 0.000000\n", 534 | "25% 106043.250000 2.000000 0.000000\n", 535 | "50% 212085.500000 3.000000 0.000000\n", 536 | "75% 318127.750000 4.000000 1.000000\n", 537 | "max 424170.000000 8.000000 2.000000" 538 | ] 539 | }, 540 | "execution_count": 15, 541 | "metadata": {}, 542 | "output_type": "execute_result" 543 | } 544 | ], 545 | "source": [ 546 | "#查看user_info基本数据描述:\n", 547 | "user_info.describe()" 548 | ] 549 | }, 550 | { 551 | "cell_type": "code", 552 | "execution_count": 16, 553 | "id": "3ff86b41", 554 | "metadata": {}, 555 | "outputs": [ 556 | { 557 | "data": { 558 | "text/html": [ 559 | "
\n", 560 | "\n", 573 | "\n", 574 | " \n", 575 | " \n", 576 | " \n", 577 | " \n", 578 | " \n", 579 | " \n", 580 | " \n", 581 | " \n", 582 | " \n", 583 | " \n", 584 | " \n", 585 | " \n", 586 | " \n", 587 | " \n", 588 | " \n", 589 | " \n", 590 | " \n", 591 | " \n", 592 | " \n", 593 | " \n", 594 | " \n", 595 | " \n", 596 | " \n", 597 | " \n", 598 | " \n", 599 | " \n", 600 | " \n", 601 | " \n", 602 | " \n", 603 | " \n", 604 | " \n", 605 | " \n", 606 | " \n", 607 | " \n", 608 | " \n", 609 | " \n", 610 | " \n", 611 | " \n", 612 | " \n", 613 | " \n", 614 | " \n", 615 | " \n", 616 | " \n", 617 | " \n", 618 | " \n", 619 | " \n", 620 | " \n", 621 | " \n", 622 | " \n", 623 | " \n", 624 | " \n", 625 | " \n", 626 | " \n", 627 | " \n", 628 | " \n", 629 | " \n", 630 | " \n", 631 | " \n", 632 | " \n", 633 | " \n", 634 | " \n", 635 | " \n", 636 | " \n", 637 | " \n", 638 | " \n", 639 | " \n", 640 | " \n", 641 | " \n", 642 | " \n", 643 | " \n", 644 | " \n", 645 | " \n", 646 | " \n", 647 | " \n", 648 | " \n", 649 | " \n", 650 | " \n", 651 | " \n", 652 | " \n", 653 | " \n", 654 | " \n", 655 | " \n", 656 | " \n", 657 | " \n", 658 | " \n", 659 | " \n", 660 | " \n", 661 | " \n", 662 | " \n", 663 | " \n", 664 | " \n", 665 | " \n", 666 | " \n", 667 | " \n", 668 | "
user_iditem_idcat_idseller_idbrand_idtime_stampaction_type
count5.492533e+075.492533e+075.492533e+075.492533e+075.483432e+075.492533e+075.492533e+07
mean2.121568e+055.538613e+058.770308e+022.470941e+034.153348e+039.230953e+022.854458e-01
std1.222872e+053.221459e+054.486269e+021.473310e+032.397679e+031.954305e+028.075806e-01
min1.000000e+001.000000e+001.000000e+001.000000e+001.000000e+005.110000e+020.000000e+00
25%1.063360e+052.731680e+055.550000e+021.151000e+032.027000e+037.300000e+020.000000e+00
50%2.126540e+055.555290e+058.210000e+022.459000e+034.065000e+031.010000e+030.000000e+00
75%3.177500e+058.306890e+051.252000e+033.760000e+036.196000e+031.109000e+030.000000e+00
max4.241700e+051.113166e+061.671000e+034.995000e+038.477000e+031.112000e+033.000000e+00
\n", 669 | "
" 670 | ], 671 | "text/plain": [ 672 | " user_id item_id cat_id seller_id brand_id \\\n", 673 | "count 5.492533e+07 5.492533e+07 5.492533e+07 5.492533e+07 5.483432e+07 \n", 674 | "mean 2.121568e+05 5.538613e+05 8.770308e+02 2.470941e+03 4.153348e+03 \n", 675 | "std 1.222872e+05 3.221459e+05 4.486269e+02 1.473310e+03 2.397679e+03 \n", 676 | "min 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 \n", 677 | "25% 1.063360e+05 2.731680e+05 5.550000e+02 1.151000e+03 2.027000e+03 \n", 678 | "50% 2.126540e+05 5.555290e+05 8.210000e+02 2.459000e+03 4.065000e+03 \n", 679 | "75% 3.177500e+05 8.306890e+05 1.252000e+03 3.760000e+03 6.196000e+03 \n", 680 | "max 4.241700e+05 1.113166e+06 1.671000e+03 4.995000e+03 8.477000e+03 \n", 681 | "\n", 682 | " time_stamp action_type \n", 683 | "count 5.492533e+07 5.492533e+07 \n", 684 | "mean 9.230953e+02 2.854458e-01 \n", 685 | "std 1.954305e+02 8.075806e-01 \n", 686 | "min 5.110000e+02 0.000000e+00 \n", 687 | "25% 7.300000e+02 0.000000e+00 \n", 688 | "50% 1.010000e+03 0.000000e+00 \n", 689 | "75% 1.109000e+03 0.000000e+00 \n", 690 | "max 1.112000e+03 3.000000e+00 " 691 | ] 692 | }, 693 | "execution_count": 16, 694 | "metadata": {}, 695 | "output_type": "execute_result" 696 | } 697 | ], 698 | "source": [ 699 | "#查看user_log基本数据描述:\n", 700 | "user_log.describe()" 701 | ] 702 | }, 703 | { 704 | "cell_type": "code", 705 | "execution_count": null, 706 | "id": "d48b1f40", 707 | "metadata": {}, 708 | "outputs": [], 709 | "source": [] 710 | } 711 | ], 712 | "metadata": { 713 | "kernelspec": { 714 | "display_name": "Python 3", 715 | "language": "python", 716 | "name": "python3" 717 | }, 718 | "language_info": { 719 | "codemirror_mode": { 720 | "name": "ipython", 721 | "version": 3 722 | }, 723 | "file_extension": ".py", 724 | "mimetype": "text/x-python", 725 | "name": "python", 726 | "nbconvert_exporter": "python", 727 | "pygments_lexer": "ipython3", 728 | "version": "3.8.8" 729 | } 730 | }, 731 | "nbformat": 4, 732 | "nbformat_minor": 5 733 | } 734 | -------------------------------------------------------------------------------- /测试数据特征处理与填充.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "id": "6c451add", 7 | "metadata": {}, 8 | "outputs": [], 9 | "source": [ 10 | "#导包\n", 11 | "import numpy as np\n", 12 | "import pandas as pd\n", 13 | "import matplotlib.pyplot as plt\n", 14 | "plt.rcParams[\"font.sans-serif\"] = \"SimHei\" #解决中文乱码问题\n", 15 | "import seaborn as sns\n", 16 | "import random\n", 17 | "from sklearn.model_selection import train_test_split\n", 18 | "from sklearn.linear_model import LogisticRegression\n", 19 | "from sklearn.preprocessing import LabelEncoder\n", 20 | "from sklearn.metrics import accuracy_score\n", 21 | "from sklearn import model_selection\n", 22 | "from sklearn.neighbors import KNeighborsRegressor" 23 | ] 24 | }, 25 | { 26 | "cell_type": "code", 27 | "execution_count": 2, 28 | "id": "33a0082f", 29 | "metadata": {}, 30 | "outputs": [], 31 | "source": [ 32 | "df_test = pd.read_csv(r'../DataMining/data_format1\\test_format1.csv')\n", 33 | "user_info = pd.read_csv(r'../DataMining/data_format1\\user_info_format1.csv')\n", 34 | "user_log = pd.read_csv(r'../DataMining/data_format1\\user_log_format1.csv')" 35 | ] 36 | }, 37 | { 38 | "cell_type": "code", 39 | "execution_count": 3, 40 | "id": "d2c315d6", 41 | "metadata": {}, 42 | "outputs": [ 43 | { 44 | "name": "stdout", 45 | "output_type": "stream", 46 | "text": [ 47 | "\n", 48 | "RangeIndex: 424170 entries, 0 to 424169\n", 49 | "Data columns (total 3 columns):\n", 50 | " # Column Non-Null Count Dtype \n", 51 | "--- ------ -------------- ----- \n", 52 | " 0 user_id 424170 non-null int64 \n", 53 | " 1 age_range 329039 non-null float64\n", 54 | " 2 gender 407308 non-null float64\n", 55 | "dtypes: float64(2), int64(1)\n", 56 | "memory usage: 9.7 MB\n" 57 | ] 58 | } 59 | ], 60 | "source": [ 61 | "#使用空值去替换\n", 62 | "user_info['age_range'].replace(0.0,np.nan,inplace=True)\n", 63 | "user_info['gender'].replace(2.0,np.nan,inplace=True)\n", 64 | "user_info.info()" 65 | ] 66 | }, 67 | { 68 | "cell_type": "code", 69 | "execution_count": 4, 70 | "id": "d5b34bee", 71 | "metadata": {}, 72 | "outputs": [], 73 | "source": [ 74 | "user_info['age_range'].replace(np.nan,-1,inplace=True)\n", 75 | "user_info['gender'].replace(np.nan,-1,inplace=True)\n", 76 | "# user_info['age_range'].replace(np.nan,1,inplace=True)\n", 77 | "# user_info['gender'].replace(np.nan,0,inplace=True)" 78 | ] 79 | }, 80 | { 81 | "cell_type": "code", 82 | "execution_count": 5, 83 | "id": "dc6e724d", 84 | "metadata": {}, 85 | "outputs": [], 86 | "source": [ 87 | "#特征值合并\n", 88 | "\n", 89 | "df_test = pd.merge(df_test,user_info,on=\"user_id\",how=\"left\")\n", 90 | " \n", 91 | "total_logs_temp = user_log.groupby([user_log[\"user_id\"],user_log[\"seller_id\"]])[\"item_id\"].count().reset_index()\n", 92 | " \n", 93 | "total_logs_temp.rename(columns={\"seller_id\":\"merchant_id\",\"item_id\":\"total_item_id\"},inplace=True)\n", 94 | " \n", 95 | "df_test = pd.merge(df_test,total_logs_temp,on=[\"user_id\",\"merchant_id\"],how=\"left\")\n", 96 | " \n", 97 | "unique_item_id = user_log.groupby([\"user_id\",\"seller_id\",\"item_id\"]).count().reset_index()[[\"user_id\",\"seller_id\",\"item_id\"]]\n", 98 | " \n", 99 | "unique_item_id_cnt = unique_item_id.groupby([\"user_id\",\"seller_id\"]).count().reset_index()\n", 100 | " \n", 101 | "unique_item_id_cnt.rename(columns={\"seller_id\":\"merchant_id\",\"item_id\":\"unique_item_id\"},inplace=True)\n", 102 | " \n", 103 | "df_test = pd.merge(df_test, unique_item_id_cnt, on=[\"user_id\", \"merchant_id\"], how=\"left\")\n", 104 | " \n", 105 | "cat_id_temp = user_log.groupby([\"user_id\", \"seller_id\", \"cat_id\"]).count().reset_index()[[\"user_id\", \"seller_id\", \"cat_id\"]]\n", 106 | " \n", 107 | "cat_id_temp_cnt = cat_id_temp.groupby([\"user_id\", \"seller_id\"]).count().reset_index()\n", 108 | " \n", 109 | "cat_id_temp_cnt.rename(columns={\"seller_id\":\"merchant_id\",\"cat_id\":\"total_cat_id\"},inplace=True)\n", 110 | " \n", 111 | "df_test = pd.merge(df_test, cat_id_temp_cnt, on=[\"user_id\", \"merchant_id\"], how=\"left\")\n", 112 | " \n", 113 | "time_temp = user_log.groupby([\"user_id\", \"seller_id\", \"time_stamp\"]).count().reset_index()[[\"user_id\", \"seller_id\", \"time_stamp\"]]\n", 114 | " \n", 115 | "time_temp_cnt = time_temp.groupby([\"user_id\", \"seller_id\"]).count().reset_index()\n", 116 | " \n", 117 | "time_temp_cnt.rename(columns={\"seller_id\":\"merchant_id\",\"time_stamp\":\"total_time_temp\"},inplace=True)\n", 118 | " \n", 119 | "df_test = pd.merge(df_test, time_temp_cnt, on=[\"user_id\", \"merchant_id\"], how=\"left\")\n", 120 | " \n", 121 | "click_temp = user_log.groupby([\"user_id\", \"seller_id\", \"action_type\"])[\"item_id\"].count().reset_index()\n", 122 | " \n", 123 | "click_temp.rename(columns={\"seller_id\":\"merchant_id\",\"item_id\":\"times\"},inplace=True)\n", 124 | " \n", 125 | "click_temp[\"clicks\"] = click_temp[\"action_type\"] == 0\n", 126 | " \n", 127 | "click_temp[\"clicks\"] = click_temp[\"clicks\"] * click_temp[\"times\"]\n", 128 | " \n", 129 | "click_temp[\"shopping_cart\"] = click_temp[\"action_type\"] == 1\n", 130 | "click_temp[\"shopping_cart\"] = click_temp[\"shopping_cart\"] * click_temp[\"times\"]\n", 131 | " \n", 132 | "click_temp[\"purchases\"] = click_temp[\"action_type\"] == 2\n", 133 | "click_temp[\"purchases\"] = click_temp[\"purchases\"] * click_temp[\"times\"]\n", 134 | " \n", 135 | "click_temp[\"favourites\"] = click_temp[\"action_type\"] == 3\n", 136 | "click_temp[\"favourites\"] = click_temp[\"favourites\"] * click_temp[\"times\"]\n", 137 | " \n", 138 | "four_features = click_temp.groupby([\"user_id\", \"merchant_id\"]).sum().reset_index()\n", 139 | " \n", 140 | "#删除相关列\n", 141 | "four_features = four_features.drop([\"action_type\", \"times\"], axis=1)\n", 142 | " \n", 143 | "#合并\n", 144 | "df_test = pd.merge(df_test, four_features, on=[\"user_id\", \"merchant_id\"], how=\"left\")\n", 145 | " \n", 146 | "#缺失值向前填充\n", 147 | "df_test = df_test.fillna(method=\"ffill\")\n", 148 | " " 149 | ] 150 | }, 151 | { 152 | "cell_type": "code", 153 | "execution_count": 6, 154 | "id": "43c3bf05", 155 | "metadata": {}, 156 | "outputs": [ 157 | { 158 | "data": { 159 | "text/html": [ 160 | "
\n", 161 | "\n", 174 | "\n", 175 | " \n", 176 | " \n", 177 | " \n", 178 | " \n", 179 | " \n", 180 | " \n", 181 | " \n", 182 | " \n", 183 | " \n", 184 | " \n", 185 | " \n", 186 | " \n", 187 | " \n", 188 | " \n", 189 | " \n", 190 | " \n", 191 | " \n", 192 | " \n", 193 | " \n", 194 | " \n", 195 | " \n", 196 | " \n", 197 | " \n", 198 | " \n", 199 | " \n", 200 | " \n", 201 | " \n", 202 | " \n", 203 | " \n", 204 | " \n", 205 | " \n", 206 | " \n", 207 | " \n", 208 | " \n", 209 | " \n", 210 | " \n", 211 | " \n", 212 | " \n", 213 | " \n", 214 | " \n", 215 | " \n", 216 | " \n", 217 | " \n", 218 | " \n", 219 | " \n", 220 | " \n", 221 | " \n", 222 | " \n", 223 | " \n", 224 | " \n", 225 | " \n", 226 | " \n", 227 | " \n", 228 | " \n", 229 | " \n", 230 | " \n", 231 | " \n", 232 | " \n", 233 | " \n", 234 | " \n", 235 | " \n", 236 | " \n", 237 | " \n", 238 | " \n", 239 | " \n", 240 | " \n", 241 | " \n", 242 | " \n", 243 | " \n", 244 | " \n", 245 | " \n", 246 | " \n", 247 | " \n", 248 | " \n", 249 | " \n", 250 | " \n", 251 | " \n", 252 | " \n", 253 | " \n", 254 | " \n", 255 | " \n", 256 | " \n", 257 | " \n", 258 | " \n", 259 | " \n", 260 | " \n", 261 | " \n", 262 | " \n", 263 | " \n", 264 | " \n", 265 | " \n", 266 | " \n", 267 | " \n", 268 | " \n", 269 | " \n", 270 | " \n", 271 | " \n", 272 | " \n", 273 | " \n", 274 | " \n", 275 | " \n", 276 | " \n", 277 | " \n", 278 | " \n", 279 | " \n", 280 | " \n", 281 | " \n", 282 | " \n", 283 | " \n", 284 | " \n", 285 | " \n", 286 | " \n", 287 | " \n", 288 | " \n", 289 | " \n", 290 | " \n", 291 | " \n", 292 | " \n", 293 | " \n", 294 | " \n", 295 | " \n", 296 | " \n", 297 | " \n", 298 | " \n", 299 | " \n", 300 | " \n", 301 | " \n", 302 | " \n", 303 | " \n", 304 | " \n", 305 | " \n", 306 | " \n", 307 | " \n", 308 | " \n", 309 | " \n", 310 | " \n", 311 | " \n", 312 | " \n", 313 | " \n", 314 | " \n", 315 | " \n", 316 | " \n", 317 | " \n", 318 | " \n", 319 | " \n", 320 | " \n", 321 | " \n", 322 | " \n", 323 | " \n", 324 | " \n", 325 | " \n", 326 | " \n", 327 | " \n", 328 | " \n", 329 | " \n", 330 | " \n", 331 | " \n", 332 | " \n", 333 | " \n", 334 | " \n", 335 | " \n", 336 | " \n", 337 | " \n", 338 | " \n", 339 | " \n", 340 | " \n", 341 | " \n", 342 | " \n", 343 | " \n", 344 | " \n", 345 | " \n", 346 | " \n", 347 | " \n", 348 | " \n", 349 | " \n", 350 | " \n", 351 | " \n", 352 | " \n", 353 | " \n", 354 | " \n", 355 | " \n", 356 | " \n", 357 | " \n", 358 | " \n", 359 | " \n", 360 | " \n", 361 | " \n", 362 | " \n", 363 | " \n", 364 | " \n", 365 | " \n", 366 | " \n", 367 | " \n", 368 | " \n", 369 | " \n", 370 | " \n", 371 | "
user_idmerchant_idprobage_rangegendertotal_item_idunique_item_idtotal_cat_idtotal_time_tempclicksshopping_cartpurchasesfavourites
01639684605NaN-1.00.021111010
13605761581NaN2.0-1.0109415050
2986881964NaN6.00.061115010
3986883645NaN6.00.01111110010
42952963361NaN2.01.05084547012
..........................................
2614722284793111NaN6.00.052124010
261473979192341NaN8.01.021111010
261474979193971NaN8.01.01652312040
261475326393536NaN-1.00.032112010
261476326393319NaN-1.00.01111210010
\n", 372 | "

261477 rows × 13 columns

\n", 373 | "
" 374 | ], 375 | "text/plain": [ 376 | " user_id merchant_id prob age_range gender total_item_id \\\n", 377 | "0 163968 4605 NaN -1.0 0.0 2 \n", 378 | "1 360576 1581 NaN 2.0 -1.0 10 \n", 379 | "2 98688 1964 NaN 6.0 0.0 6 \n", 380 | "3 98688 3645 NaN 6.0 0.0 11 \n", 381 | "4 295296 3361 NaN 2.0 1.0 50 \n", 382 | "... ... ... ... ... ... ... \n", 383 | "261472 228479 3111 NaN 6.0 0.0 5 \n", 384 | "261473 97919 2341 NaN 8.0 1.0 2 \n", 385 | "261474 97919 3971 NaN 8.0 1.0 16 \n", 386 | "261475 32639 3536 NaN -1.0 0.0 3 \n", 387 | "261476 32639 3319 NaN -1.0 0.0 11 \n", 388 | "\n", 389 | " unique_item_id total_cat_id total_time_temp clicks shopping_cart \\\n", 390 | "0 1 1 1 1 0 \n", 391 | "1 9 4 1 5 0 \n", 392 | "2 1 1 1 5 0 \n", 393 | "3 1 1 1 10 0 \n", 394 | "4 8 4 5 47 0 \n", 395 | "... ... ... ... ... ... \n", 396 | "261472 2 1 2 4 0 \n", 397 | "261473 1 1 1 1 0 \n", 398 | "261474 5 2 3 12 0 \n", 399 | "261475 2 1 1 2 0 \n", 400 | "261476 1 1 2 10 0 \n", 401 | "\n", 402 | " purchases favourites \n", 403 | "0 1 0 \n", 404 | "1 5 0 \n", 405 | "2 1 0 \n", 406 | "3 1 0 \n", 407 | "4 1 2 \n", 408 | "... ... ... \n", 409 | "261472 1 0 \n", 410 | "261473 1 0 \n", 411 | "261474 4 0 \n", 412 | "261475 1 0 \n", 413 | "261476 1 0 \n", 414 | "\n", 415 | "[261477 rows x 13 columns]" 416 | ] 417 | }, 418 | "execution_count": 6, 419 | "metadata": {}, 420 | "output_type": "execute_result" 421 | } 422 | ], 423 | "source": [ 424 | "df_test" 425 | ] 426 | }, 427 | { 428 | "cell_type": "code", 429 | "execution_count": 7, 430 | "id": "fa6f95a9", 431 | "metadata": {}, 432 | "outputs": [ 433 | { 434 | "data": { 435 | "text/html": [ 436 | "
\n", 437 | "\n", 450 | "\n", 451 | " \n", 452 | " \n", 453 | " \n", 454 | " \n", 455 | " \n", 456 | " \n", 457 | " \n", 458 | " \n", 459 | " \n", 460 | " \n", 461 | " \n", 462 | " \n", 463 | " \n", 464 | " \n", 465 | " \n", 466 | " \n", 467 | " \n", 468 | " \n", 469 | " \n", 470 | " \n", 471 | " \n", 472 | " \n", 473 | " \n", 474 | " \n", 475 | " \n", 476 | " \n", 477 | " \n", 478 | " \n", 479 | " \n", 480 | " \n", 481 | " \n", 482 | " \n", 483 | " \n", 484 | " \n", 485 | " \n", 486 | " \n", 487 | " \n", 488 | " \n", 489 | " \n", 490 | " \n", 491 | " \n", 492 | " \n", 493 | " \n", 494 | " \n", 495 | " \n", 496 | " \n", 497 | " \n", 498 | " \n", 499 | " \n", 500 | " \n", 501 | " \n", 502 | " \n", 503 | " \n", 504 | " \n", 505 | " \n", 506 | " \n", 507 | " \n", 508 | " \n", 509 | " \n", 510 | " \n", 511 | " \n", 512 | " \n", 513 | " \n", 514 | " \n", 515 | " \n", 516 | " \n", 517 | " \n", 518 | " \n", 519 | " \n", 520 | " \n", 521 | " \n", 522 | " \n", 523 | " \n", 524 | " \n", 525 | " \n", 526 | " \n", 527 | " \n", 528 | " \n", 529 | " \n", 530 | " \n", 531 | " \n", 532 | " \n", 533 | " \n", 534 | " \n", 535 | " \n", 536 | " \n", 537 | " \n", 538 | " \n", 539 | " \n", 540 | " \n", 541 | " \n", 542 | " \n", 543 | " \n", 544 | " \n", 545 | " \n", 546 | " \n", 547 | " \n", 548 | " \n", 549 | " \n", 550 | " \n", 551 | " \n", 552 | " \n", 553 | " \n", 554 | " \n", 555 | " \n", 556 | " \n", 557 | " \n", 558 | " \n", 559 | " \n", 560 | " \n", 561 | " \n", 562 | " \n", 563 | " \n", 564 | " \n", 565 | " \n", 566 | " \n", 567 | " \n", 568 | " \n", 569 | " \n", 570 | " \n", 571 | " \n", 572 | " \n", 573 | " \n", 574 | " \n", 575 | " \n", 576 | " \n", 577 | " \n", 578 | " \n", 579 | " \n", 580 | " \n", 581 | " \n", 582 | " \n", 583 | " \n", 584 | " \n", 585 | " \n", 586 | " \n", 587 | " \n", 588 | " \n", 589 | " \n", 590 | " \n", 591 | " \n", 592 | " \n", 593 | " \n", 594 | " \n", 595 | " \n", 596 | " \n", 597 | " \n", 598 | " \n", 599 | " \n", 600 | " \n", 601 | " \n", 602 | " \n", 603 | " \n", 604 | " \n", 605 | " \n", 606 | " \n", 607 | " \n", 608 | " \n", 609 | " \n", 610 | " \n", 611 | "
age_rangegendertotal_item_idunique_item_idtotal_cat_idtotal_time_tempclicksshopping_cartpurchasesfavourites
0-1.00.021111010
12.0-1.0109415050
26.00.061115010
36.00.01111110010
42.01.05084547012
.................................
2614726.00.052124010
2614738.01.021111010
2614748.01.01652312040
261475-1.00.032112010
261476-1.00.01111210010
\n", 612 | "

261477 rows × 10 columns

\n", 613 | "
" 614 | ], 615 | "text/plain": [ 616 | " age_range gender total_item_id unique_item_id total_cat_id \\\n", 617 | "0 -1.0 0.0 2 1 1 \n", 618 | "1 2.0 -1.0 10 9 4 \n", 619 | "2 6.0 0.0 6 1 1 \n", 620 | "3 6.0 0.0 11 1 1 \n", 621 | "4 2.0 1.0 50 8 4 \n", 622 | "... ... ... ... ... ... \n", 623 | "261472 6.0 0.0 5 2 1 \n", 624 | "261473 8.0 1.0 2 1 1 \n", 625 | "261474 8.0 1.0 16 5 2 \n", 626 | "261475 -1.0 0.0 3 2 1 \n", 627 | "261476 -1.0 0.0 11 1 1 \n", 628 | "\n", 629 | " total_time_temp clicks shopping_cart purchases favourites \n", 630 | "0 1 1 0 1 0 \n", 631 | "1 1 5 0 5 0 \n", 632 | "2 1 5 0 1 0 \n", 633 | "3 1 10 0 1 0 \n", 634 | "4 5 47 0 1 2 \n", 635 | "... ... ... ... ... ... \n", 636 | "261472 2 4 0 1 0 \n", 637 | "261473 1 1 0 1 0 \n", 638 | "261474 3 12 0 4 0 \n", 639 | "261475 1 2 0 1 0 \n", 640 | "261476 2 10 0 1 0 \n", 641 | "\n", 642 | "[261477 rows x 10 columns]" 643 | ] 644 | }, 645 | "execution_count": 7, 646 | "metadata": {}, 647 | "output_type": "execute_result" 648 | } 649 | ], 650 | "source": [ 651 | "#测试数据预处理\n", 652 | "# y = df_train[\"label\"]\n", 653 | "X = df_test.drop([\"user_id\", \"merchant_id\", \"prob\"], axis=1)\n", 654 | "# X['age_range'].replace(-1,3,inplace=True)\n", 655 | "# X['gender'].replace(-1,0,inplace=True)\n", 656 | "X" 657 | ] 658 | }, 659 | { 660 | "cell_type": "code", 661 | "execution_count": 8, 662 | "id": "0d5814e4", 663 | "metadata": {}, 664 | "outputs": [], 665 | "source": [ 666 | "#将构建好的特征保存\n", 667 | "X.to_csv(\"test_data.csv\",index=None)" 668 | ] 669 | }, 670 | { 671 | "cell_type": "code", 672 | "execution_count": null, 673 | "id": "d54567e6", 674 | "metadata": {}, 675 | "outputs": [], 676 | "source": [] 677 | } 678 | ], 679 | "metadata": { 680 | "kernelspec": { 681 | "display_name": "Python 3", 682 | "language": "python", 683 | "name": "python3" 684 | }, 685 | "language_info": { 686 | "codemirror_mode": { 687 | "name": "ipython", 688 | "version": 3 689 | }, 690 | "file_extension": ".py", 691 | "mimetype": "text/x-python", 692 | "name": "python", 693 | "nbconvert_exporter": "python", 694 | "pygments_lexer": "ipython3", 695 | "version": "3.8.8" 696 | } 697 | }, 698 | "nbformat": 4, 699 | "nbformat_minor": 5 700 | } 701 | -------------------------------------------------------------------------------- /.ipynb_checkpoints/预测建模-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 10, 6 | "id": "8035b6a2", 7 | "metadata": {}, 8 | "outputs": [], 9 | "source": [ 10 | "#导包\n", 11 | "import numpy as np\n", 12 | "import pandas as pd\n", 13 | "import matplotlib.pyplot as plt\n", 14 | "plt.rcParams[\"font.sans-serif\"] = \"SimHei\" #解决中文乱码问题\n", 15 | "import seaborn as sns\n", 16 | "import random\n", 17 | "from sklearn.model_selection import train_test_split\n", 18 | "from sklearn.linear_model import LogisticRegression\n", 19 | "from sklearn.preprocessing import LabelEncoder\n", 20 | "from sklearn.metrics import accuracy_score\n", 21 | "from sklearn import model_selection\n", 22 | "from sklearn.neighbors import KNeighborsRegressor" 23 | ] 24 | }, 25 | { 26 | "cell_type": "code", 27 | "execution_count": 11, 28 | "id": "4e929396", 29 | "metadata": {}, 30 | "outputs": [ 31 | { 32 | "data": { 33 | "text/html": [ 34 | "
\n", 35 | "\n", 48 | "\n", 49 | " \n", 50 | " \n", 51 | " \n", 52 | " \n", 53 | " \n", 54 | " \n", 55 | " \n", 56 | " \n", 57 | " \n", 58 | " \n", 59 | " \n", 60 | " \n", 61 | " \n", 62 | " \n", 63 | " \n", 64 | " \n", 65 | " \n", 66 | " \n", 67 | " \n", 68 | " \n", 69 | " \n", 70 | " \n", 71 | " \n", 72 | " \n", 73 | " \n", 74 | " \n", 75 | " \n", 76 | " \n", 77 | " \n", 78 | " \n", 79 | " \n", 80 | " \n", 81 | " \n", 82 | " \n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | "
user_idmerchant_idprob
01639684605NaN
13605761581NaN
2986881964NaN
3986883645NaN
42952963361NaN
............
2614722284793111NaN
261473979192341NaN
261474979193971NaN
261475326393536NaN
261476326393319NaN
\n", 126 | "

261477 rows × 3 columns

\n", 127 | "
" 128 | ], 129 | "text/plain": [ 130 | " user_id merchant_id prob\n", 131 | "0 163968 4605 NaN\n", 132 | "1 360576 1581 NaN\n", 133 | "2 98688 1964 NaN\n", 134 | "3 98688 3645 NaN\n", 135 | "4 295296 3361 NaN\n", 136 | "... ... ... ...\n", 137 | "261472 228479 3111 NaN\n", 138 | "261473 97919 2341 NaN\n", 139 | "261474 97919 3971 NaN\n", 140 | "261475 32639 3536 NaN\n", 141 | "261476 32639 3319 NaN\n", 142 | "\n", 143 | "[261477 rows x 3 columns]" 144 | ] 145 | }, 146 | "execution_count": 11, 147 | "metadata": {}, 148 | "output_type": "execute_result" 149 | } 150 | ], 151 | "source": [ 152 | "#读取数据\n", 153 | "df_train = pd.read_csv(r'df_train.csv')\n", 154 | "df_test = pd.read_csv(r'../DataMining/data_format1\\test_format1.csv')\n", 155 | "df_test\n" 156 | ] 157 | }, 158 | { 159 | "cell_type": "code", 160 | "execution_count": 12, 161 | "id": "16970677", 162 | "metadata": {}, 163 | "outputs": [ 164 | { 165 | "data": { 166 | "text/html": [ 167 | "
\n", 168 | "\n", 181 | "\n", 182 | " \n", 183 | " \n", 184 | " \n", 185 | " \n", 186 | " \n", 187 | " \n", 188 | " \n", 189 | " \n", 190 | " \n", 191 | " \n", 192 | " \n", 193 | " \n", 194 | " \n", 195 | " \n", 196 | " \n", 197 | " \n", 198 | " \n", 199 | " \n", 200 | " \n", 201 | " \n", 202 | " \n", 203 | " \n", 204 | " \n", 205 | " \n", 206 | " \n", 207 | " \n", 208 | " \n", 209 | " \n", 210 | " \n", 211 | " \n", 212 | " \n", 213 | " \n", 214 | " \n", 215 | " \n", 216 | " \n", 217 | " \n", 218 | " \n", 219 | " \n", 220 | " \n", 221 | " \n", 222 | " \n", 223 | " \n", 224 | " \n", 225 | " \n", 226 | " \n", 227 | " \n", 228 | " \n", 229 | " \n", 230 | " \n", 231 | " \n", 232 | " \n", 233 | " \n", 234 | " \n", 235 | " \n", 236 | " \n", 237 | " \n", 238 | " \n", 239 | " \n", 240 | " \n", 241 | " \n", 242 | " \n", 243 | " \n", 244 | " \n", 245 | " \n", 246 | " \n", 247 | " \n", 248 | " \n", 249 | " \n", 250 | " \n", 251 | " \n", 252 | " \n", 253 | " \n", 254 | " \n", 255 | " \n", 256 | " \n", 257 | " \n", 258 | " \n", 259 | " \n", 260 | " \n", 261 | " \n", 262 | " \n", 263 | " \n", 264 | " \n", 265 | " \n", 266 | " \n", 267 | " \n", 268 | " \n", 269 | " \n", 270 | " \n", 271 | " \n", 272 | " \n", 273 | " \n", 274 | " \n", 275 | " \n", 276 | " \n", 277 | " \n", 278 | " \n", 279 | " \n", 280 | " \n", 281 | " \n", 282 | " \n", 283 | " \n", 284 | " \n", 285 | " \n", 286 | " \n", 287 | " \n", 288 | " \n", 289 | " \n", 290 | " \n", 291 | " \n", 292 | " \n", 293 | " \n", 294 | " \n", 295 | " \n", 296 | " \n", 297 | " \n", 298 | " \n", 299 | " \n", 300 | " \n", 301 | " \n", 302 | " \n", 303 | " \n", 304 | " \n", 305 | " \n", 306 | " \n", 307 | " \n", 308 | " \n", 309 | " \n", 310 | " \n", 311 | " \n", 312 | " \n", 313 | " \n", 314 | " \n", 315 | " \n", 316 | " \n", 317 | " \n", 318 | " \n", 319 | " \n", 320 | " \n", 321 | " \n", 322 | " \n", 323 | " \n", 324 | " \n", 325 | " \n", 326 | " \n", 327 | " \n", 328 | " \n", 329 | "
age_rangegendertotal_item_idunique_item_idtotal_cat_idtotal_time_tempclicksshopping_cartpurchasesfavourites
06.00.039206936012
16.00.01411313010
26.00.01821212060
36.00.021111010
4-1.00.081137010
54.01.011110010
65.00.032112010
75.00.0834815378050
85.00.074116010
94.01.041122011
\n", 330 | "
" 331 | ], 332 | "text/plain": [ 333 | " age_range gender total_item_id unique_item_id total_cat_id \\\n", 334 | "0 6.0 0.0 39 20 6 \n", 335 | "1 6.0 0.0 14 1 1 \n", 336 | "2 6.0 0.0 18 2 1 \n", 337 | "3 6.0 0.0 2 1 1 \n", 338 | "4 -1.0 0.0 8 1 1 \n", 339 | "5 4.0 1.0 1 1 1 \n", 340 | "6 5.0 0.0 3 2 1 \n", 341 | "7 5.0 0.0 83 48 15 \n", 342 | "8 5.0 0.0 7 4 1 \n", 343 | "9 4.0 1.0 4 1 1 \n", 344 | "\n", 345 | " total_time_temp clicks shopping_cart purchases favourites \n", 346 | "0 9 36 0 1 2 \n", 347 | "1 3 13 0 1 0 \n", 348 | "2 2 12 0 6 0 \n", 349 | "3 1 1 0 1 0 \n", 350 | "4 3 7 0 1 0 \n", 351 | "5 1 0 0 1 0 \n", 352 | "6 1 2 0 1 0 \n", 353 | "7 3 78 0 5 0 \n", 354 | "8 1 6 0 1 0 \n", 355 | "9 2 2 0 1 1 " 356 | ] 357 | }, 358 | "execution_count": 12, 359 | "metadata": {}, 360 | "output_type": "execute_result" 361 | } 362 | ], 363 | "source": [ 364 | "#建模前预处理\n", 365 | "y = df_train[\"label\"]\n", 366 | "X = df_train.drop([\"user_id\", \"merchant_id\", \"label\"], axis=1)\n", 367 | "X.head(10)\n" 368 | ] 369 | }, 370 | { 371 | "cell_type": "code", 372 | "execution_count": 13, 373 | "id": "889e9034", 374 | "metadata": {}, 375 | "outputs": [], 376 | "source": [ 377 | "#分割数据\n", 378 | "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=8)" 379 | ] 380 | }, 381 | { 382 | "cell_type": "code", 383 | "execution_count": 14, 384 | "id": "b66e524a", 385 | "metadata": {}, 386 | "outputs": [ 387 | { 388 | "data": { 389 | "text/html": [ 390 | "
\n", 391 | "\n", 404 | "\n", 405 | " \n", 406 | " \n", 407 | " \n", 408 | " \n", 409 | " \n", 410 | " \n", 411 | " \n", 412 | " \n", 413 | " \n", 414 | " \n", 415 | " \n", 416 | " \n", 417 | " \n", 418 | " \n", 419 | " \n", 420 | " \n", 421 | " \n", 422 | " \n", 423 | " \n", 424 | " \n", 425 | " \n", 426 | " \n", 427 | " \n", 428 | " \n", 429 | " \n", 430 | " \n", 431 | " \n", 432 | " \n", 433 | " \n", 434 | " \n", 435 | " \n", 436 | " \n", 437 | " \n", 438 | " \n", 439 | " \n", 440 | " \n", 441 | " \n", 442 | " \n", 443 | " \n", 444 | " \n", 445 | " \n", 446 | " \n", 447 | " \n", 448 | " \n", 449 | " \n", 450 | " \n", 451 | " \n", 452 | " \n", 453 | " \n", 454 | " \n", 455 | " \n", 456 | " \n", 457 | " \n", 458 | " \n", 459 | " \n", 460 | " \n", 461 | " \n", 462 | " \n", 463 | " \n", 464 | " \n", 465 | " \n", 466 | " \n", 467 | " \n", 468 | " \n", 469 | " \n", 470 | " \n", 471 | " \n", 472 | " \n", 473 | " \n", 474 | " \n", 475 | " \n", 476 | " \n", 477 | " \n", 478 | " \n", 479 | " \n", 480 | " \n", 481 | " \n", 482 | " \n", 483 | " \n", 484 | " \n", 485 | " \n", 486 | " \n", 487 | " \n", 488 | " \n", 489 | " \n", 490 | " \n", 491 | " \n", 492 | " \n", 493 | " \n", 494 | " \n", 495 | " \n", 496 | " \n", 497 | " \n", 498 | " \n", 499 | " \n", 500 | " \n", 501 | " \n", 502 | " \n", 503 | " \n", 504 | " \n", 505 | " \n", 506 | " \n", 507 | " \n", 508 | " \n", 509 | " \n", 510 | " \n", 511 | " \n", 512 | " \n", 513 | " \n", 514 | " \n", 515 | " \n", 516 | " \n", 517 | " \n", 518 | " \n", 519 | " \n", 520 | " \n", 521 | " \n", 522 | " \n", 523 | " \n", 524 | " \n", 525 | " \n", 526 | " \n", 527 | " \n", 528 | " \n", 529 | " \n", 530 | " \n", 531 | " \n", 532 | " \n", 533 | " \n", 534 | " \n", 535 | " \n", 536 | " \n", 537 | " \n", 538 | " \n", 539 | " \n", 540 | " \n", 541 | " \n", 542 | " \n", 543 | " \n", 544 | " \n", 545 | " \n", 546 | " \n", 547 | " \n", 548 | " \n", 549 | " \n", 550 | " \n", 551 | " \n", 552 | " \n", 553 | " \n", 554 | " \n", 555 | " \n", 556 | " \n", 557 | " \n", 558 | " \n", 559 | " \n", 560 | " \n", 561 | " \n", 562 | " \n", 563 | " \n", 564 | " \n", 565 | "
age_rangegendertotal_item_idunique_item_idtotal_cat_idtotal_time_tempclicksshopping_cartpurchasesfavourites
0-1.00.021111010
12.0-1.0109415050
26.00.061115010
36.00.01111110010
42.01.05084547012
.................................
2614726.00.052124010
2614738.01.021111010
2614748.01.01652312040
261475-1.00.032112010
261476-1.00.01111210010
\n", 566 | "

261477 rows × 10 columns

\n", 567 | "
" 568 | ], 569 | "text/plain": [ 570 | " age_range gender total_item_id unique_item_id total_cat_id \\\n", 571 | "0 -1.0 0.0 2 1 1 \n", 572 | "1 2.0 -1.0 10 9 4 \n", 573 | "2 6.0 0.0 6 1 1 \n", 574 | "3 6.0 0.0 11 1 1 \n", 575 | "4 2.0 1.0 50 8 4 \n", 576 | "... ... ... ... ... ... \n", 577 | "261472 6.0 0.0 5 2 1 \n", 578 | "261473 8.0 1.0 2 1 1 \n", 579 | "261474 8.0 1.0 16 5 2 \n", 580 | "261475 -1.0 0.0 3 2 1 \n", 581 | "261476 -1.0 0.0 11 1 1 \n", 582 | "\n", 583 | " total_time_temp clicks shopping_cart purchases favourites \n", 584 | "0 1 1 0 1 0 \n", 585 | "1 1 5 0 5 0 \n", 586 | "2 1 5 0 1 0 \n", 587 | "3 1 10 0 1 0 \n", 588 | "4 5 47 0 1 2 \n", 589 | "... ... ... ... ... ... \n", 590 | "261472 2 4 0 1 0 \n", 591 | "261473 1 1 0 1 0 \n", 592 | "261474 3 12 0 4 0 \n", 593 | "261475 1 2 0 1 0 \n", 594 | "261476 2 10 0 1 0 \n", 595 | "\n", 596 | "[261477 rows x 10 columns]" 597 | ] 598 | }, 599 | "execution_count": 14, 600 | "metadata": {}, 601 | "output_type": "execute_result" 602 | } 603 | ], 604 | "source": [ 605 | "#加载最终测试数据\n", 606 | "test_data= pd.read_csv(r'test_data.csv')\n", 607 | "test_data\n" 608 | ] 609 | }, 610 | { 611 | "cell_type": "code", 612 | "execution_count": 20, 613 | "id": "bede42d0", 614 | "metadata": {}, 615 | "outputs": [ 616 | { 617 | "name": "stdout", 618 | "output_type": "stream", 619 | "text": [ 620 | "(52173,)\n", 621 | "[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]\n", 622 | "[[0.92242829 0.07757171]\n", 623 | " [0.95384761 0.04615239]\n", 624 | " [0.93995785 0.06004215]\n", 625 | " ...\n", 626 | " [0.94603563 0.05396437]\n", 627 | " [0.86838486 0.13161514]\n", 628 | " [0.95512153 0.04487847]]\n" 629 | ] 630 | }, 631 | { 632 | "data": { 633 | "text/plain": [ 634 | "0.9391831023709581" 635 | ] 636 | }, 637 | "execution_count": 20, 638 | "metadata": {}, 639 | "output_type": "execute_result" 640 | } 641 | ], 642 | "source": [ 643 | "#logistic回归\n", 644 | "Logit = LogisticRegression(solver='liblinear')\n", 645 | "Logit.fit(X_train, y_train)\n", 646 | "Predict = Logit.predict(X_test)\n", 647 | "Predict_proba = Logit.predict_proba(X_test)\n", 648 | "print(Predict.shape)\n", 649 | "print(Predict[0:20])\n", 650 | "print(Predict_proba[:])\n", 651 | "print(\"Accuracy on training set: {:.3f}\".format(Logit.score(X_train, y_train)))\n", 652 | "print(\"Accuracy on test set: {:.3f}\".format(Logit.score(X_test, y_test)))\n", 653 | "Score = accuracy_score(y_test, Predict)\n", 654 | "Score" 655 | ] 656 | }, 657 | { 658 | "cell_type": "code", 659 | "execution_count": 21, 660 | "id": "65787397", 661 | "metadata": {}, 662 | "outputs": [], 663 | "source": [ 664 | "#逻辑回归最终结果获取\n", 665 | "Logit_Ans_Predict_proba = Logit.predict_proba(test_data)\n", 666 | "df_test['prob']=Logit_Ans_Predict_proba[:,1]\n", 667 | "#最终答案保存\n", 668 | "df_test.to_csv(\"Logit_Ans.csv\",index=None)" 669 | ] 670 | }, 671 | { 672 | "cell_type": "code", 673 | "execution_count": 22, 674 | "id": "a37fd1e5", 675 | "metadata": {}, 676 | "outputs": [ 677 | { 678 | "name": "stdout", 679 | "output_type": "stream", 680 | "text": [ 681 | "[[0.89765569 0.10234431]\n", 682 | " [0.9609094 0.0390906 ]\n", 683 | " [0.93901148 0.06098852]\n", 684 | " ...\n", 685 | " [0.92812445 0.07187555]\n", 686 | " [0.89765569 0.10234431]\n", 687 | " [0.9609094 0.0390906 ]]\n", 688 | "Accuracy on training set: 0.939\n", 689 | "Accuracy on test set: 0.939\n" 690 | ] 691 | } 692 | ], 693 | "source": [ 694 | "#决策树\n", 695 | "from sklearn.tree import DecisionTreeClassifier\n", 696 | "tree = DecisionTreeClassifier(max_depth=4,random_state=0) \n", 697 | "tree.fit(X_train, y_train)\n", 698 | "Predict_proba = tree.predict_proba(X_test)\n", 699 | "print(Predict_proba[:])\n", 700 | "print(\"Accuracy on training set: {:.3f}\".format(tree.score(X_train, y_train)))\n", 701 | "print(\"Accuracy on test set: {:.3f}\".format(tree.score(X_test, y_test)))" 702 | ] 703 | }, 704 | { 705 | "cell_type": "code", 706 | "execution_count": 23, 707 | "id": "5ed0c662", 708 | "metadata": {}, 709 | "outputs": [], 710 | "source": [ 711 | "#决策树最终结果获取\n", 712 | "Tree_Ans_Predict_proba = tree.predict_proba(test_data)\n", 713 | "df_test['prob']=Tree_Ans_Predict_proba[:,1]\n", 714 | "#最终答案保存\n", 715 | "df_test.to_csv(\"Tree_Ans.csv\",index=None)" 716 | ] 717 | }, 718 | { 719 | "cell_type": "code", 720 | "execution_count": 28, 721 | "id": "9c002987", 722 | "metadata": {}, 723 | "outputs": [ 724 | { 725 | "name": "stdout", 726 | "output_type": "stream", 727 | "text": [ 728 | "[[0.90345203 0.09654797]\n", 729 | " [0.96242055 0.03757945]\n", 730 | " [0.92398178 0.07601822]\n", 731 | " ...\n", 732 | " [0.91943483 0.08056517]\n", 733 | " [0.86844252 0.13155748]\n", 734 | " [0.9607207 0.0392793 ]]\n", 735 | "Accuracy on training set: 0.939\n", 736 | "Accuracy on test set: 0.939\n" 737 | ] 738 | } 739 | ], 740 | "source": [ 741 | "#随机森林\n", 742 | "from sklearn.ensemble import RandomForestClassifier\n", 743 | "rfc = RandomForestClassifier(n_estimators=100,random_state=90,max_depth=8)\n", 744 | "rfc = rfc.fit(X_train, y_train)\n", 745 | "Predict_proba = rfc.predict_proba(X_test)\n", 746 | "print(Predict_proba[:])\n", 747 | "print(\"Accuracy on training set: {:.3f}\".format(rfc.score(X_train, y_train))) \n", 748 | "print(\"Accuracy on test set: {:.3f}\".format(rfc.score(X_test, y_test)))" 749 | ] 750 | }, 751 | { 752 | "cell_type": "code", 753 | "execution_count": 29, 754 | "id": "55703385", 755 | "metadata": {}, 756 | "outputs": [], 757 | "source": [ 758 | "#随机森林最终结果获取\n", 759 | "RFC_Ans_Predict_proba = rfc.predict_proba(test_data)\n", 760 | "df_test['prob']=RFC_Ans_Predict_proba[:,1]\n", 761 | "#最终答案保存\n", 762 | "df_test.to_csv(\"RFC_Ans.csv\",index=None)" 763 | ] 764 | }, 765 | { 766 | "cell_type": "code", 767 | "execution_count": 27, 768 | "id": "54978d26", 769 | "metadata": {}, 770 | "outputs": [ 771 | { 772 | "name": "stdout", 773 | "output_type": "stream", 774 | "text": [ 775 | "进度: 0\n", 776 | "进度: 10\n", 777 | "进度: 20\n", 778 | "进度: 30\n", 779 | "进度: 40\n", 780 | "进度: 50\n", 781 | "进度: 60\n", 782 | "进度: 70\n", 783 | "进度: 80\n", 784 | "进度: 90\n", 785 | "进度: 100\n", 786 | "进度: 110\n", 787 | "进度: 120\n", 788 | "进度: 130\n", 789 | "进度: 140\n", 790 | "进度: 150\n", 791 | "进度: 160\n", 792 | "进度: 170\n", 793 | "进度: 180\n", 794 | "进度: 190\n", 795 | "最大得分:0.9394897744043854 子树数量为:101\n" 796 | ] 797 | }, 798 | { 799 | "data": { 800 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYgAAAD2CAYAAADMHBAjAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuNCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8QVMy6AAAACXBIWXMAAAsTAAALEwEAmpwYAAAcVUlEQVR4nO3df5BV9X3/8efLVQyCwiIbMo3ijxlo00SxdL8ojdLNN5JgRki+TCtJ/cLU6jh+S5p2nFjMxHQbwaQ4X3Y64zSOfL8k5etoJtSkalErhEEg0yXJQhA12h8z1ThOmbvIIlkwBvX9/eOcZdnLuXvP7t2798J5PWaYPffc9zn3c89e7ms/58fnKCIwMzMrd1ajG2BmZs3JAWFmZpkcEGZmlskBYWZmmRwQZmaW6exGN2Ckpk+fHpdeemmjm2FmdlrZs2fPwYhoG8kyp11AXHrppfT09DS6GWZmpxVJr410Ge9iMjOzTA4IMzPL5IAwM7NMdQ0ISdMkLZQ0vZ6vY2ZmYy9XQEjaIKlb0j0Vnr9M0lOSdklal85rBTYD84Dtktoq1J0t6ReSnkv/XTFG783MzGpQ9SwmSUuBloiYL+nbkmZFxL+Xla0FVkfEbknfk9QBBHBnOq8VmAvcmlF3BPhuRKwau7dlZma1ytOD6AA2pdNbgGszamYDe9PpEjAlInakQbCApBfRnVUHXAPcKOknaU/llNCSdLukHkk9vb29Od+amZnVIs91EJOAN9LpQyQ9gXKPAZ2SdgOLgK8ASBKwDOgDjleo+whwfUT8l6T/B3wGePLklUfEemA9QHt7u8cnt9NLBOzZAz/8IRw7Nvr1tLTANdfAJz4BEyaMXfvMKsgTEP3AxHR6Mhm9johYI+la4C5gY0T0p/MDWClpNbAkq07S/oh4J11VDzCrtrdk1gTeeQeeew6eeAKefBLeSP/Gkka/zoF7t5x/PtxwA3z2s/CZz8DUqbW21ixTnl1MexjcrTQHeLVC3T5gJtAFIGmVpBXpc1OBw1l1wMOS5khqAT4HPJ+z7WbN5fBhePRRWLYM2tpg0SLYuBHmzUt+HjwI778/+n/HjsE//VOy/h074Oabk9e5/np44AH4xS8avQXsDKNqd5STdAGwC9gG3AB8HvjDiLinrO7rwH9ExMPp41aSYxfnAi8CKyMiMuo+BjwKCHgyIr46XHva29vDQ21Y03jttaSX8MQTsHMnvPsuzJgBixcnf+F/8pMwcWL19YzU++/Dj388+NqvvJLMv+qq5HU/+9lkupYei51RJO2JiPYRLZPnlqPpl/1CYGdEHBhl+8aEA8IaKgJ+9rPBL+bn0w7vb/1W8qX8uc8lPYazxvka1H/7t8E2/cu/JO2cOROWLEnatWCBj1sUXN0CopmclgFx7Bhs3QovvAB//Mdw0UXj34b//E946CF4++3xf+0zxdGjsGULvP568pf57/3e4F/rs2c3unWDSiXYvDkJi61bk9/5lCnw6U/Dhz7U6NZZLT75yST0R8EB0Uyy/pMCTJ4Mq1fDF78IZ4/DYLq//jV0dcG998Lx48nr2+i0tMDHP54Ewo03wgc/2OgWVTfwx8kTTyThdvRoo1tktfiLv4DOzlEt6oBotKxu/sUXD/6VefHFyS/4n/852T/80EPJ7oh62bUL7rgDfv5zWLoU/vZvkzaYWeGMJiBOu/tBNJXhDhT+1V9lHyh8+mn4/vfhz/88Oaf9jjvgG98Y21MVDx6Ev/xL+M534JJLkjNfbrxx7NZvZoXgHsRIvf12csHTE08kX7ylUrKr6Pd/PwmEJUuSL+VqjhxJQuSBB5JTFbu64AtfqO2sk/ffh7//e7jrrmT9X/4yfO1rcN55o1+nmZ0RRtOD8HDfeZVK8Ad/ANOnJyGwaRN0dMAjj0BvbxIaf/Zn+cIB4IILkl0+P/1pcrbJzTfDwoXJbqrRePHFJKRuvRV++7dh3z745jcdDmY2ag6IvLZtS3YN3XRTcgyhtxe+9z34oz+qbffQ3LnQ3Q1/93dJWFxxBfz1X8OvfpVv+aNH4e674Xd+B15+Gb797eQiqo9+dPRtMjPDAZFfqZT8XLcuOV3w3HPHbt0tLfCnfwr/+q9JL+XrX0+CYuvW4ZfbvDkJgrVrYcWK5BjILbeM/zn4ZnZG8jdJXgPHGuo57s2HPpTsstq6NTkW8alPJT2UA2XXJr7+enJW0uLFyWmrO3fChg3J7i8zszHigMirVEoOJo/HX+fXXw/79ye7mr7/ffjN30x2Qb3zTtKD+chHkt1cf/M3sHcvXHdd/dtkZoXjgMirVBrfC6M+8IHkgpgXXkiulfjiF5OA+vKXk4PjP/85rFrl4RPMrG4cEHn19iZf0ONt9uzkCthHH4Xf/V34wQ+S02svvXT822JmheIL5fIqleDqqxvz2lJyjcQXvtCY1zezQnIPIq/x3sVkZtZgDog83n4bfvlLB4SZFYoDIo/e3uSnA8LMCqSuASFpmqSFkk7vE/QHLpJzQJhZgeQKCEkbJHVLuqfC85dJekrSLknr0nmtwGZgHrBdUltW3UnrmCHpZzW+n/oY6EE04iwmM7MGqRoQkpYCLRExH7hc0qyMsrXA6oi4DrhIUgdwJXBnRNwHPAvMrVA34H8Ddbh57xhwD8LMCihPD6ID2JRObwGuzaiZDexNp0vAlIjYERG7JS0g6UV0Z9UBSPrvwFEg837Xkm6X1COpp3fgr/nx5IAwswLKExCTgDfS6UPAjIyax4BOSYuBRcA2AEkClgF9wPGsOkkTgK8Bd1dqQESsj4j2iGhva8RunlIJJk6ESZPG/7XNzBokT0D0M7jrZ3LWMhGxBngGuA3YGBH96fyIiJXAfmBJhbq7gW9FxOEa30v9DFwDUcvNfMzMTjN5AmIPg7uV5gCvVqjbB8wEugAkrZK0In1uKnA4qw64Hlgp6TngKkn/N2fbx48vkjOzAsoTEI8DyyV1ATcBL0lak1F3F9AVEcfSx+vT5XYCLSTHL06pi4gFEdERER3Avoi4bdTvpl4aNQ6TmVkDVR2LKSKOpGcbLQTuj4gDwPMZdZ1lj/vSZYatK3uuo2qLG6FUSm7gY2ZWILkG60u/7DdVLTwTRXgXk5kVkofaqOaXv0xu1OOAMLOCcUBU42sgzKygHBDVOCDMrKAcENV4HCYzKygHRDXuQZhZQTkgqhkICPcgzKxgHBDVlEowZQqce26jW2JmNq4cENX4GggzKygHRDUOCDMrKAdENaWSjz+YWSE5IKrp7XUPwswKyQExnPffd0CYWWE5IIZz6FASEg4IMysgB8RwfJGcmRWYA2I4DggzKzAHxHB8FbWZFVhdA0LSNEkLJU0fi7pxNzBQn3sQZlZAuQJC0gZJ3ZLuqfD8ZZKekrRL0rp0XiuwGZgHbJfUlrduLN7YmCiVQIILL2x0S8zMxl3VW45KWgq0RMR8Sd+WNCsi/r2sbC2wOiJ2S/peeg/rAO5M57UCc4Fbc9Y9O1ZvsCalEkyfDi0tjW6Jmdm4y9OD6GDwftRbgGszamYDe9PpEjAlInakX/oLSHoH3SOoG0LS7ZJ6JPX0Duz2GQ8eZsPMCixPQEwC3kinDwEzMmoeAzolLQYWAdsAJAlYBvQBx0dQN0RErI+I9ohobxvPA8YOCDMrsDwB0Q9MTKcnZy0TEWuAZ4DbgI0R0Z/Oj4hYCewHluStq+0tjSGPw2RmBZYnIPYwuFtpDvBqhbp9wEygC0DSKkkr0uemAodHWNd4HmbDzAosT0A8DiyX1AXcBLwkaU1G3V1AV0QcSx+vT5fbCbSQHL8YSV1j/frX0NfngDCzwqp6FlNEHEnPNloI3B8RB4DnM+o6yx73pcuMqq7hDh5MfjogzKygqgYEnPgS31S18EziYTbMrOA81EYlDggzKzgHRCUeh8nMCs4BUYl7EGZWcA6ISnp74ZxzYMqURrfEzKwhHBCVDFxFLTW6JWZmDeGAqMTDbJhZwTkgKnFAmFnBOSAqcUCYWcE5ICrxQH1mVnAOiCxHj8KxY+5BmFmhOSCy+F7UZmYOiEy+SM7MzAGRyQFhZuaAyOSAMDNzQGTyQH1mZg6ITL29MGkSnHdeo1tiZtYwdQ0ISdMkLZQ0vZ6vM+Z8kZyZWb6AkLRBUrekeyo8f5mkpyTtkrQundcKbAbmAdsltVWomyLpGUlbJP2jpAlj9N5GzwFhZlY9ICQtBVoiYj5wuaRZGWVrgdURcR1wUXoP6yuBOyPiPuBZYG6FupuBroj4FHAAWFTrm6qZA8LMLFcPooPB+1FvAa7NqJkN7E2nS8CUiNgREbslLSDpRXRXqPtWRGxN57Wl84eQdLukHkk9vQMXsdWTA8LMLFdATALeSKcPATMyah4DOiUtJukBbAOQJGAZ0Accr1SX1s4HWiNid/nKI2J9RLRHRHtbvc8sivA4TGZm5AuIfmBiOj05a5mIWAM8A9wGbIyI/nR+RMRKYD+wpFKdpGnAA8Cf1PZ2xsDhw/Duu+5BmFnh5QmIPQzuVpoDvFqhbh8wE+gCkLRK0or0uanA4Qp1E4B/AL4SEa+NoO314XGYzMyAfAHxOLBcUhdwE/CSpDUZdXeRHGw+lj5eny63E2ghOX6RVXcryQHsr0p6TtKy0b2VMeKrqM3MADi7WkFEHEnPNloI3B8RB4DnM+o6yx73pctUq3sQeHBEra4nB4SZGZAjIODEl/2mqoVnAgeEmRngoTZONRAQ00+vi7/NzMaaA6JcqQStrXDOOY1uiZlZQzkgyvX2eveSmRkOiFP5KmozM8ABcSoHhJkZ4IA4lQPCzAxwQAz17rvw5pseh8nMDAfEUG++mQzW5x6EmZkDYgiPw2RmdoID4mS+itrM7AQHxMkcEGZmJzggTuaAMDM7wQFxslIJWlqSoTbMzArOAXGyUikZpO8sbxYzM38TnswXyZmZneCAOJkH6jMzO6GuASFpmqSFkk6Pmyu4B2FmdkKugJC0QVK3pHsqPH+ZpKck7ZK0Lp3XCmwG5gHbJbVl1aW1MyTtGoP3UxsHhJnZCVUDQtJSoCUi5gOXS5qVUbYWWB0R1wEXpfewvhK4MyLuA54F5mbVpUGyEZg0Fm9o1H71KzhyxOMwmZml8vQgOhi8H/UW4NqMmtnA3nS6BEyJiB0RsVvSApJeRHdWHfAesAw4UqkBkm6X1COpp3dgOIyx5mE2zMyGyBMQk4A30ulDwIyMmseATkmLgUXANgBJIvny7wOOZ9VFxJGIeGu4BkTE+ohoj4j2tnr9he+L5MzMhsgTEP3AxHR6ctYyEbEGeAa4DdgYEf3p/IiIlcB+YEmluqbgHoSZ2RB5AmIPg7uV5gCvVqjbB8wEugAkrZK0In1uKnA4q65puAdhZjZEnoB4HFguqQu4CXhJ0pqMuruArog4lj5eny63E2ghOX6RVdccHBBmZkOcXa0gIo6kZyUtBO6PiAPA8xl1nWWP+9Jlhq07aX5HrhbXS6kE554Lkyc3tBlmZs2iakDAiS/7TVULT2cD10BIjW6JmVlT8FAbA3yRnJnZEA6IAR6HycxsCAfEAPcgzMyGcEAARDggzMzKOCAA+vuTsZgcEGZmJzggYPAaCA/UZ2Z2ggMCfJGcmVkGBwQ4IMzMMjggwAP1mZllcECAj0GYmWVwQEASEBdcAB/4QKNbYmbWNBwQkASEew9mZkM4IMAXyZmZZXBAgAPCzCyDAwI8UJ+ZWYa6BoSkaZIWSppez9epyfvvOyDMzDLkCghJGyR1S7qnwvOXSXpK0i5J69J5rcBmYB6wXVJbVl2e9ddVXx+8954DwsysTNWAkLQUaImI+cDlkmZllK0FVkfEdcBF6S1KrwTujIj7gGeBuVl1OddfP74GwswsU54eRAeDtxvdAlybUTMb2JtOl4ApEbEjInZLWkDSi+jOqsu5/vrxMBtmZpnyBMQk4I10+hAwI6PmMaBT0mJgEbANQJKAZUAfcLxCXdX1S7pdUo+knt6BYTHGigPCzCxTnoDoByam05OzlomINcAzwG3AxojoT+dHRKwE9gNLKtTlWf/6iGiPiPa2sd4V5HGYzMwy5QmIPQzu9pkDvFqhbh8wE+gCkLRK0or0uanA4ay6Eay/PkolkODCC8f1Zc3Mmt3ZOWoeB3ZJ+g3gBuDzktZERPkZR3cBXRFxLH28Htgk6TbgRZLjC1l15eu/ZrRvZlRKpSQczs6zKczMiqPqt2JEHEnPSloI3B8RB4DnM+o6yx73pctUqytf/1v5mz8GPA6TmVmmXH82p1/2m6oWjlK91z8sD7NhZpbJQ204IMzMMjkgPMyGmVmmYgfE8eNw6JADwswsQ7ED4uDB5KcDwszsFMUOCI/DZGZWkQMC3IMwM8vggAAHhJlZBgcEOCDMzDIUOyB6e5MhNqZObXRLzMyaTrEDYuAiOanRLTEzazoOCJ/BZGaWyQHh4w9mZpkcEA4IM7NMDggHhJlZpuIGxLFjcPSoA8LMrILiBoTvRW1mNqy6BoSkaZIWSppez9cZFY/DZGY2rFwBIWmDpG5J5fehHnj+MklPSdolaV06rxXYDMwDtktqk9Qq6WlJPZIeqrTsuPBV1GZmw6oaEJKWAi0RMR+4XNKsjLK1wOqIuA64KL3H9JXAnRFxH/AsMBdYDjwSEe3A+ZLaKyxbfw4IM7Nh5elBdDB4v+gtwLUZNbOBvel0CZgSETsiYrekBSS9iG7gTeBjkqYCFwOvZy1bvnJJt6e9jp7egWMHtXJAmJkNK09ATALeSKcPATMyah4DOiUtBhYB2wAkCVgG9AHHgR8BlwBfAl5O15e57MkiYn1EtEdEe9tYHTPo7YXzzoNJk8ZmfWZmZ5g8AdEPTEynJ2ctExFrgGeA24CNEdGfzo+IWAnsB5YAncAdEXEv8ApwS6Vl687XQJiZDStPQOxhcLfSHODVCnX7gJlAF4CkVZJWpM9NBQ4DrcAVklqAq4HIWnZceBwmM7Nh5QmIx4HlkrqAm4CXJK3JqLsL6IqIY+nj9elyO4EWkuMX30znvwVMA75bYdn6cw/CzGxYZ1criIgj6ZlFC4H7I+IA8HxGXWfZ4750mZP9BPhotWXHRakEV1017i9rZna6qBoQcOLLflPVwtNFhHsQZmZVFHOojbfeguPHHRBmZsMoZkB4HCYzs6qKGRC+SM7MrKpiB4RPczUzq6jYAeEehJlZRcUOCPcgzMwqKm5ATJ0KEyY0uiVmZk2rmAHR2+vdS2ZmVRQzIHyRnJlZVcUNCB9/MDMbVnEDwj0IM7NhFS8g3nsPDh50QJiZVVG8gHjzzWSwPgeEmdmwihcQHofJzCyX4gWEr6I2M8uluAHhs5jMzIZV14CQNE3SQknT6/k6I+IehJlZLrkCQtIGSd2S7qnw/GWSnpK0S9K6dF4rsBmYB2yX1CapVdLTknokPTRQVz6vrkolOOssmDat7i9lZnY6qxoQkpYCLRExH7hc0qyMsrXA6oi4DrgovYf1lcCdEXEf8CwwF1gOPBIR7cD5ktorzKufUgmmT4eWlrq+jJnZ6S5PD6KDwftRbwGuzaiZDexNp0vAlIjYERG7JS0g6UV0A28CH5M0FbgYeL3CvCEk3Z72MHp6B85CGi1fJGdmlkuegJgEvJFOHwJmZNQ8BnRKWgwsArYBSBKwDOgDjgM/Ai4BvgS8nK4va94QEbE+Itojor2t1oPLHqjPzCyXPAHRD0xMpydnLRMRa4BngNuAjRHRn86PiFgJ7AeWAJ3AHRFxL/AKcEuFefXjcZjMzHLJExB7GNytNAd4tULdPmAm0AUgaZWkFelzU4HDQCtwhaQW4GogKsyrH+9iMjPLJU9APA4sl9QF3AS8JGlNRt1dQFdEHEsfr0+X2wm0kBy/+GY6/y1gGvDdCvPq45134K23HBBmZjmcXa0gIo6kZyUtBO6PiAPA8xl1nWWP+9JlTvYT4KM55tWHh9kwM8utakDAiS/7TVULm50vkjMzy61YQ224B2FmlluxAsLjMJmZ5VbMgHAPwsysquIFxIQJcMEFjW6JmVnTK15AfPCDIDW6JWZmTa+YAWFmZlUVKyA8DpOZWW7FCgiPw2RmlltxAiLCu5jMzEagOAFx9Ci8/bYDwswsp+IEhK+BMDMbEQeEmZllckCYmVmm4gTEhRfC0qXw4Q83uiVmZqeFXMN9nxE+/vHkn5mZ5VKcHoSZmY1IXQNC0jRJCyVNr+frmJnZ2MsVEJI2SOqWdE+F5y+T9JSkXZLWpfNagc3APGC7pDZJrZKeltQj6aG07n9Jei79t29gvpmZNVbVgJC0FGiJiPnA5ZJmZZStBVZHxHXARek9rK8E7oyI+4BngbnAcuCRiGgHzpfUHhEPRkRHRHQAu4D/Mwbvy8zMapSnB9HB4P2otwDXZtTMBvam0yVgSkTsiIjdkhaQ9CK6gTeBj0maClwMvD6wAkkfBmZERE/5yiXdnvY6enoHbhtqZmZ1lScgJgFvpNOHgBkZNY8BnZIWA4uAbQCSBCwD+oDjwI+AS4AvAS+n6xuwEngwqwERsT4i2iOivc2D7ZmZjYs8AdEPTEynJ2ctExFrgGeA24CNEdGfzo+IWAnsB5YAncAdEXEv8ApwC4Cks4BPAM/V8mbMzGzs5AmIPQzuVpoDvFqhbh8wE+gCkLRK0or0uanAYaAVuEJSC3A1EOnz1wE/jojAzMyagqp9J0u6gOTg8TbgBuDzwB9GxD1ldV8H/iMiHk4ft5IcuzgXeJFkF9J/A75DspupG/gfEdEv6RtAT0T8oGqDpV7gtZG8ydR04OAolhsvbl9t3L7auH21OR3aNykiRrSPvmpAwIkv+4XAzog4MLr2NZaknvTsqabk9tXG7auN21ebM7V9uYbaiIg+Bs9kMjOzAvBQG2ZmlqlIAbG+0Q2owu2rjdtXG7evNmdk+3IdgzAzs+IpUg/CzMxGwAFhZmaZChEQ1UajHW+Spkh6RtIWSf8oaYKkX5w0qu0VDW7f2eXtaaZtmDEC8IZm2X6SZkjaddLjU7ZbI7flye2r8Dk85XffwPZltqWJtt8pI1E3cvtV+H3W9Pk74wMi52i04+1moCsiPgUcAO4Gvjswqm1EvNDY5nHlye0BZtFE2zBjBOCHaILtl14vtJFk/LLMz14jP4/l7ePUz+Eiyn7347ktM9p3SluaaftVGIm6YduPU3+fn6fGz98ZHxDkG412XEXEtyJia/qwDXgXuFHST9J0b/StYK85uT3A9TTZNoTBEYCBdppj+71HMjjlkfRxB6dut6x542VI+zI+hyXKfvfjvC3Lt19WWzpoku03QENHom7Y9sv4ff5Pavz8FSEg8oxG2xCS5pOMT7UVuD4i5gHnAJ9paMPgp2XtuYHm3IYDIwCXt7ch2y8ijkTEWyfNyvrsNezzmNE+YPBzGBG7aeC2zGhfVluabvsxdCTqhn8WT/peeZ0aP39FCIiqo9E2gqRpwAPAnwD7I+K/0qd6SHbpNFJ5e6bTZNtQQ0cAbrbtNyDrs9dUn8eyzyE017bMakuzbb/ykagbuv3Kfp81f/4a/h99HOQdjXbcSJoA/APwlYh4DXhY0hwlo9x+Dni+ke3LaM9KmmwbMnQE4GbbfgOyPntN83nM+BxCc23LrLY0zfZLlY9E3bDtl/H7rPnz1+h93ePhcWCXpN8g2VVyTWObA8CtJLdg/aqkrwLbgYcBAU9GxA8b2TjgXuDRgfbQnNvw08DOdHpIe5tg+w14nFO3W2TMa5Tyz+GDNNe2PKUtSkeXbpLtB0M/h9DY7Vf++/wOsLyWz18hrqTWGTAabaN5G45O1nbztqyNt19+tX7+ChEQZmY2ckU4BmFmZqPggDAzs0wOCDMzy+SAMDOzTA4IMzPL9P8BuldbC1NjeU0AAAAASUVORK5CYII=\n", 801 | "text/plain": [ 802 | "
" 803 | ] 804 | }, 805 | "metadata": { 806 | "needs_background": "light" 807 | }, 808 | "output_type": "display_data" 809 | } 810 | ], 811 | "source": [ 812 | "# 调参,绘制学习曲线来调参n_estimators(对随机森林影响最大)\n", 813 | "score_lt = []\n", 814 | "\n", 815 | "# 每隔10步建立一个随机森林,获得不同n_estimators的得分\n", 816 | "for i in range(0,200,10):\n", 817 | " print(\"进度:\",i)\n", 818 | " rfc = RandomForestClassifier(n_estimators=i+1,random_state=90,max_depth=8)\n", 819 | " rfc = rfc.fit(X_train, y_train)\n", 820 | " score = rfc.score(X_test, y_test)\n", 821 | " score_lt.append(score)\n", 822 | "score_max = max(score_lt)\n", 823 | "print('最大得分:{}'.format(score_max),'子树数量为:{}'.format(score_lt.index(score_max)*10+1))\n", 824 | "\n", 825 | "# 绘制学习曲线\n", 826 | "x = np.arange(1,201,10)\n", 827 | "plt.subplot(111)\n", 828 | "plt.plot(x, score_lt, 'r-')\n", 829 | "plt.show()" 830 | ] 831 | }, 832 | { 833 | "cell_type": "code", 834 | "execution_count": 15, 835 | "id": "b124dbb8", 836 | "metadata": {}, 837 | "outputs": [ 838 | { 839 | "name": "stderr", 840 | "output_type": "stream", 841 | "text": [ 842 | "D:\\anaconda3\\lib\\site-packages\\xgboost\\sklearn.py:1224: UserWarning: The use of label encoder in XGBClassifier is deprecated and will be removed in a future release. To remove this warning, do the following: 1) Pass option use_label_encoder=False when constructing XGBClassifier object; and 2) Encode your labels (y) as integers starting with 0, i.e. 0, 1, 2, ..., [num_class - 1].\n", 843 | " warnings.warn(label_encoder_deprecation_msg, UserWarning)\n" 844 | ] 845 | }, 846 | { 847 | "name": "stdout", 848 | "output_type": "stream", 849 | "text": [ 850 | "[15:05:43] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.5.0/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\n", 851 | "[[0.87493783 0.12506217]\n", 852 | " [0.9712213 0.02877864]\n", 853 | " [0.8449106 0.1550894 ]\n", 854 | " ...\n", 855 | " [0.87651736 0.12348264]\n", 856 | " [0.916159 0.08384103]\n", 857 | " [0.9761114 0.02388861]]\n", 858 | "Accuracy on training set: 0.939\n", 859 | "Accuracy on test set: 0.939\n" 860 | ] 861 | } 862 | ], 863 | "source": [ 864 | "#使用XGboost\n", 865 | "from sklearn.model_selection import train_test_split\n", 866 | "from sklearn.ensemble import RandomForestClassifier\n", 867 | "from sklearn.linear_model import LinearRegression\n", 868 | "from sklearn.metrics import classification_report\n", 869 | "import xgboost as xgb\n", 870 | "\n", 871 | "model = xgb.XGBClassifier(\n", 872 | " max_depth=8,\n", 873 | " n_estimators=2000,\n", 874 | " min_child_weight=300, \n", 875 | " colsample_bytree=0.8, \n", 876 | " subsample=0.8, \n", 877 | " eta=0.3, \n", 878 | " seed=42 \n", 879 | ")\n", 880 | "# model.fit(\n", 881 | "# X_train, y_train,\n", 882 | "# eval_metric='auc', eval_set=[(X_train, y_train), (X_test, y_test)],\n", 883 | "# verbose=True,\n", 884 | "# #早停法,如果auc在10epoch没有进步就stop\n", 885 | "# early_stopping_rounds=30 \n", 886 | "# )\n", 887 | "model.fit(X_train, y_train)\n", 888 | "\n", 889 | "\n", 890 | "Predict_proba = model.predict_proba(X_test)\n", 891 | "print(Predict_proba[:])\n", 892 | "print(\"Accuracy on training set: {:.3f}\".format(model.score(X_train, y_train))) \n", 893 | "print(\"Accuracy on test set: {:.3f}\".format(model.score(X_test, y_test)))\n" 894 | ] 895 | }, 896 | { 897 | "cell_type": "code", 898 | "execution_count": 16, 899 | "id": "b942defe", 900 | "metadata": {}, 901 | "outputs": [], 902 | "source": [ 903 | "#XGboost最终结果获取\n", 904 | "xgboost_Ans_Predict_proba = model.predict_proba(test_data)\n", 905 | "df_test['prob']=xgboost_Ans_Predict_proba[:,1]\n", 906 | "#最终答案保存\n", 907 | "df_test.to_csv(\"xgboost_Ans.csv\",index=None)" 908 | ] 909 | }, 910 | { 911 | "cell_type": "code", 912 | "execution_count": null, 913 | "id": "33f75369", 914 | "metadata": {}, 915 | "outputs": [], 916 | "source": [] 917 | } 918 | ], 919 | "metadata": { 920 | "kernelspec": { 921 | "display_name": "Python 3", 922 | "language": "python", 923 | "name": "python3" 924 | }, 925 | "language_info": { 926 | "codemirror_mode": { 927 | "name": "ipython", 928 | "version": 3 929 | }, 930 | "file_extension": ".py", 931 | "mimetype": "text/x-python", 932 | "name": "python", 933 | "nbconvert_exporter": "python", 934 | "pygments_lexer": "ipython3", 935 | "version": "3.8.8" 936 | } 937 | }, 938 | "nbformat": 4, 939 | "nbformat_minor": 5 940 | } 941 | -------------------------------------------------------------------------------- /预测建模.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "id": "8035b6a2", 7 | "metadata": {}, 8 | "outputs": [], 9 | "source": [ 10 | "#导包\n", 11 | "import numpy as np\n", 12 | "import pandas as pd\n", 13 | "import matplotlib.pyplot as plt\n", 14 | "plt.rcParams[\"font.sans-serif\"] = \"SimHei\" #解决中文乱码问题\n", 15 | "import seaborn as sns\n", 16 | "import random\n", 17 | "from sklearn.model_selection import train_test_split\n", 18 | "from sklearn.linear_model import LogisticRegression\n", 19 | "from sklearn.preprocessing import LabelEncoder\n", 20 | "from sklearn.metrics import accuracy_score\n", 21 | "from sklearn import model_selection\n", 22 | "from sklearn.neighbors import KNeighborsRegressor" 23 | ] 24 | }, 25 | { 26 | "cell_type": "code", 27 | "execution_count": 2, 28 | "id": "4e929396", 29 | "metadata": {}, 30 | "outputs": [ 31 | { 32 | "data": { 33 | "text/html": [ 34 | "
\n", 35 | "\n", 48 | "\n", 49 | " \n", 50 | " \n", 51 | " \n", 52 | " \n", 53 | " \n", 54 | " \n", 55 | " \n", 56 | " \n", 57 | " \n", 58 | " \n", 59 | " \n", 60 | " \n", 61 | " \n", 62 | " \n", 63 | " \n", 64 | " \n", 65 | " \n", 66 | " \n", 67 | " \n", 68 | " \n", 69 | " \n", 70 | " \n", 71 | " \n", 72 | " \n", 73 | " \n", 74 | " \n", 75 | " \n", 76 | " \n", 77 | " \n", 78 | " \n", 79 | " \n", 80 | " \n", 81 | " \n", 82 | " \n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | "
user_idmerchant_idprob
01639684605NaN
13605761581NaN
2986881964NaN
3986883645NaN
42952963361NaN
............
2614722284793111NaN
261473979192341NaN
261474979193971NaN
261475326393536NaN
261476326393319NaN
\n", 126 | "

261477 rows × 3 columns

\n", 127 | "
" 128 | ], 129 | "text/plain": [ 130 | " user_id merchant_id prob\n", 131 | "0 163968 4605 NaN\n", 132 | "1 360576 1581 NaN\n", 133 | "2 98688 1964 NaN\n", 134 | "3 98688 3645 NaN\n", 135 | "4 295296 3361 NaN\n", 136 | "... ... ... ...\n", 137 | "261472 228479 3111 NaN\n", 138 | "261473 97919 2341 NaN\n", 139 | "261474 97919 3971 NaN\n", 140 | "261475 32639 3536 NaN\n", 141 | "261476 32639 3319 NaN\n", 142 | "\n", 143 | "[261477 rows x 3 columns]" 144 | ] 145 | }, 146 | "execution_count": 2, 147 | "metadata": {}, 148 | "output_type": "execute_result" 149 | } 150 | ], 151 | "source": [ 152 | "#读取数据\n", 153 | "df_train = pd.read_csv(r'df_train.csv')\n", 154 | "df_test = pd.read_csv(r'../DataMining/data_format1\\test_format1.csv')\n", 155 | "df_test\n" 156 | ] 157 | }, 158 | { 159 | "cell_type": "code", 160 | "execution_count": 3, 161 | "id": "16970677", 162 | "metadata": {}, 163 | "outputs": [ 164 | { 165 | "data": { 166 | "text/html": [ 167 | "
\n", 168 | "\n", 181 | "\n", 182 | " \n", 183 | " \n", 184 | " \n", 185 | " \n", 186 | " \n", 187 | " \n", 188 | " \n", 189 | " \n", 190 | " \n", 191 | " \n", 192 | " \n", 193 | " \n", 194 | " \n", 195 | " \n", 196 | " \n", 197 | " \n", 198 | " \n", 199 | " \n", 200 | " \n", 201 | " \n", 202 | " \n", 203 | " \n", 204 | " \n", 205 | " \n", 206 | " \n", 207 | " \n", 208 | " \n", 209 | " \n", 210 | " \n", 211 | " \n", 212 | " \n", 213 | " \n", 214 | " \n", 215 | " \n", 216 | " \n", 217 | " \n", 218 | " \n", 219 | " \n", 220 | " \n", 221 | " \n", 222 | " \n", 223 | " \n", 224 | " \n", 225 | " \n", 226 | " \n", 227 | " \n", 228 | " \n", 229 | " \n", 230 | " \n", 231 | " \n", 232 | " \n", 233 | " \n", 234 | " \n", 235 | " \n", 236 | " \n", 237 | " \n", 238 | " \n", 239 | " \n", 240 | " \n", 241 | " \n", 242 | " \n", 243 | " \n", 244 | " \n", 245 | " \n", 246 | " \n", 247 | " \n", 248 | " \n", 249 | " \n", 250 | " \n", 251 | " \n", 252 | " \n", 253 | " \n", 254 | " \n", 255 | " \n", 256 | " \n", 257 | " \n", 258 | " \n", 259 | " \n", 260 | " \n", 261 | " \n", 262 | " \n", 263 | " \n", 264 | " \n", 265 | " \n", 266 | " \n", 267 | " \n", 268 | " \n", 269 | " \n", 270 | " \n", 271 | " \n", 272 | " \n", 273 | " \n", 274 | " \n", 275 | " \n", 276 | " \n", 277 | " \n", 278 | " \n", 279 | " \n", 280 | " \n", 281 | " \n", 282 | " \n", 283 | " \n", 284 | " \n", 285 | " \n", 286 | " \n", 287 | " \n", 288 | " \n", 289 | " \n", 290 | " \n", 291 | " \n", 292 | " \n", 293 | " \n", 294 | " \n", 295 | " \n", 296 | " \n", 297 | " \n", 298 | " \n", 299 | " \n", 300 | " \n", 301 | " \n", 302 | " \n", 303 | " \n", 304 | " \n", 305 | " \n", 306 | " \n", 307 | " \n", 308 | " \n", 309 | " \n", 310 | " \n", 311 | " \n", 312 | " \n", 313 | " \n", 314 | " \n", 315 | " \n", 316 | " \n", 317 | " \n", 318 | " \n", 319 | " \n", 320 | " \n", 321 | " \n", 322 | " \n", 323 | " \n", 324 | " \n", 325 | " \n", 326 | " \n", 327 | " \n", 328 | " \n", 329 | "
age_rangegendertotal_item_idunique_item_idtotal_cat_idtotal_time_tempclicksshopping_cartpurchasesfavourites
06.00.039206936012
16.00.01411313010
26.00.01821212060
36.00.021111010
4-1.00.081137010
54.01.011110010
65.00.032112010
75.00.0834815378050
85.00.074116010
94.01.041122011
\n", 330 | "
" 331 | ], 332 | "text/plain": [ 333 | " age_range gender total_item_id unique_item_id total_cat_id \\\n", 334 | "0 6.0 0.0 39 20 6 \n", 335 | "1 6.0 0.0 14 1 1 \n", 336 | "2 6.0 0.0 18 2 1 \n", 337 | "3 6.0 0.0 2 1 1 \n", 338 | "4 -1.0 0.0 8 1 1 \n", 339 | "5 4.0 1.0 1 1 1 \n", 340 | "6 5.0 0.0 3 2 1 \n", 341 | "7 5.0 0.0 83 48 15 \n", 342 | "8 5.0 0.0 7 4 1 \n", 343 | "9 4.0 1.0 4 1 1 \n", 344 | "\n", 345 | " total_time_temp clicks shopping_cart purchases favourites \n", 346 | "0 9 36 0 1 2 \n", 347 | "1 3 13 0 1 0 \n", 348 | "2 2 12 0 6 0 \n", 349 | "3 1 1 0 1 0 \n", 350 | "4 3 7 0 1 0 \n", 351 | "5 1 0 0 1 0 \n", 352 | "6 1 2 0 1 0 \n", 353 | "7 3 78 0 5 0 \n", 354 | "8 1 6 0 1 0 \n", 355 | "9 2 2 0 1 1 " 356 | ] 357 | }, 358 | "execution_count": 3, 359 | "metadata": {}, 360 | "output_type": "execute_result" 361 | } 362 | ], 363 | "source": [ 364 | "#建模前预处理\n", 365 | "y = df_train[\"label\"]\n", 366 | "X = df_train.drop([\"user_id\", \"merchant_id\", \"label\"], axis=1)\n", 367 | "X.head(10)\n" 368 | ] 369 | }, 370 | { 371 | "cell_type": "code", 372 | "execution_count": 4, 373 | "id": "889e9034", 374 | "metadata": {}, 375 | "outputs": [], 376 | "source": [ 377 | "#分割数据\n", 378 | "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=8)" 379 | ] 380 | }, 381 | { 382 | "cell_type": "code", 383 | "execution_count": 5, 384 | "id": "b66e524a", 385 | "metadata": {}, 386 | "outputs": [ 387 | { 388 | "data": { 389 | "text/html": [ 390 | "
\n", 391 | "\n", 404 | "\n", 405 | " \n", 406 | " \n", 407 | " \n", 408 | " \n", 409 | " \n", 410 | " \n", 411 | " \n", 412 | " \n", 413 | " \n", 414 | " \n", 415 | " \n", 416 | " \n", 417 | " \n", 418 | " \n", 419 | " \n", 420 | " \n", 421 | " \n", 422 | " \n", 423 | " \n", 424 | " \n", 425 | " \n", 426 | " \n", 427 | " \n", 428 | " \n", 429 | " \n", 430 | " \n", 431 | " \n", 432 | " \n", 433 | " \n", 434 | " \n", 435 | " \n", 436 | " \n", 437 | " \n", 438 | " \n", 439 | " \n", 440 | " \n", 441 | " \n", 442 | " \n", 443 | " \n", 444 | " \n", 445 | " \n", 446 | " \n", 447 | " \n", 448 | " \n", 449 | " \n", 450 | " \n", 451 | " \n", 452 | " \n", 453 | " \n", 454 | " \n", 455 | " \n", 456 | " \n", 457 | " \n", 458 | " \n", 459 | " \n", 460 | " \n", 461 | " \n", 462 | " \n", 463 | " \n", 464 | " \n", 465 | " \n", 466 | " \n", 467 | " \n", 468 | " \n", 469 | " \n", 470 | " \n", 471 | " \n", 472 | " \n", 473 | " \n", 474 | " \n", 475 | " \n", 476 | " \n", 477 | " \n", 478 | " \n", 479 | " \n", 480 | " \n", 481 | " \n", 482 | " \n", 483 | " \n", 484 | " \n", 485 | " \n", 486 | " \n", 487 | " \n", 488 | " \n", 489 | " \n", 490 | " \n", 491 | " \n", 492 | " \n", 493 | " \n", 494 | " \n", 495 | " \n", 496 | " \n", 497 | " \n", 498 | " \n", 499 | " \n", 500 | " \n", 501 | " \n", 502 | " \n", 503 | " \n", 504 | " \n", 505 | " \n", 506 | " \n", 507 | " \n", 508 | " \n", 509 | " \n", 510 | " \n", 511 | " \n", 512 | " \n", 513 | " \n", 514 | " \n", 515 | " \n", 516 | " \n", 517 | " \n", 518 | " \n", 519 | " \n", 520 | " \n", 521 | " \n", 522 | " \n", 523 | " \n", 524 | " \n", 525 | " \n", 526 | " \n", 527 | " \n", 528 | " \n", 529 | " \n", 530 | " \n", 531 | " \n", 532 | " \n", 533 | " \n", 534 | " \n", 535 | " \n", 536 | " \n", 537 | " \n", 538 | " \n", 539 | " \n", 540 | " \n", 541 | " \n", 542 | " \n", 543 | " \n", 544 | " \n", 545 | " \n", 546 | " \n", 547 | " \n", 548 | " \n", 549 | " \n", 550 | " \n", 551 | " \n", 552 | " \n", 553 | " \n", 554 | " \n", 555 | " \n", 556 | " \n", 557 | " \n", 558 | " \n", 559 | " \n", 560 | " \n", 561 | " \n", 562 | " \n", 563 | " \n", 564 | " \n", 565 | "
age_rangegendertotal_item_idunique_item_idtotal_cat_idtotal_time_tempclicksshopping_cartpurchasesfavourites
0-1.00.021111010
12.0-1.0109415050
26.00.061115010
36.00.01111110010
42.01.05084547012
.................................
2614726.00.052124010
2614738.01.021111010
2614748.01.01652312040
261475-1.00.032112010
261476-1.00.01111210010
\n", 566 | "

261477 rows × 10 columns

\n", 567 | "
" 568 | ], 569 | "text/plain": [ 570 | " age_range gender total_item_id unique_item_id total_cat_id \\\n", 571 | "0 -1.0 0.0 2 1 1 \n", 572 | "1 2.0 -1.0 10 9 4 \n", 573 | "2 6.0 0.0 6 1 1 \n", 574 | "3 6.0 0.0 11 1 1 \n", 575 | "4 2.0 1.0 50 8 4 \n", 576 | "... ... ... ... ... ... \n", 577 | "261472 6.0 0.0 5 2 1 \n", 578 | "261473 8.0 1.0 2 1 1 \n", 579 | "261474 8.0 1.0 16 5 2 \n", 580 | "261475 -1.0 0.0 3 2 1 \n", 581 | "261476 -1.0 0.0 11 1 1 \n", 582 | "\n", 583 | " total_time_temp clicks shopping_cart purchases favourites \n", 584 | "0 1 1 0 1 0 \n", 585 | "1 1 5 0 5 0 \n", 586 | "2 1 5 0 1 0 \n", 587 | "3 1 10 0 1 0 \n", 588 | "4 5 47 0 1 2 \n", 589 | "... ... ... ... ... ... \n", 590 | "261472 2 4 0 1 0 \n", 591 | "261473 1 1 0 1 0 \n", 592 | "261474 3 12 0 4 0 \n", 593 | "261475 1 2 0 1 0 \n", 594 | "261476 2 10 0 1 0 \n", 595 | "\n", 596 | "[261477 rows x 10 columns]" 597 | ] 598 | }, 599 | "execution_count": 5, 600 | "metadata": {}, 601 | "output_type": "execute_result" 602 | } 603 | ], 604 | "source": [ 605 | "#加载最终测试数据\n", 606 | "test_data= pd.read_csv(r'test_data.csv')\n", 607 | "test_data\n" 608 | ] 609 | }, 610 | { 611 | "cell_type": "code", 612 | "execution_count": 6, 613 | "id": "bede42d0", 614 | "metadata": {}, 615 | "outputs": [ 616 | { 617 | "name": "stdout", 618 | "output_type": "stream", 619 | "text": [ 620 | "(52173,)\n", 621 | "[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]\n", 622 | "[[0.92242829 0.07757171]\n", 623 | " [0.95384761 0.04615239]\n", 624 | " [0.93995785 0.06004215]\n", 625 | " ...\n", 626 | " [0.94603563 0.05396437]\n", 627 | " [0.86838486 0.13161514]\n", 628 | " [0.95512153 0.04487847]]\n", 629 | "Accuracy on training set: 0.939\n", 630 | "Accuracy on test set: 0.939\n" 631 | ] 632 | }, 633 | { 634 | "data": { 635 | "text/plain": [ 636 | "0.9391831023709581" 637 | ] 638 | }, 639 | "execution_count": 6, 640 | "metadata": {}, 641 | "output_type": "execute_result" 642 | } 643 | ], 644 | "source": [ 645 | "#logistic回归\n", 646 | "Logit = LogisticRegression(solver='liblinear')\n", 647 | "Logit.fit(X_train, y_train)\n", 648 | "Predict = Logit.predict(X_test)\n", 649 | "Predict_proba = Logit.predict_proba(X_test)\n", 650 | "print(Predict.shape)\n", 651 | "print(Predict[0:20])\n", 652 | "print(Predict_proba[:])\n", 653 | "print(\"Accuracy on training set: {:.3f}\".format(Logit.score(X_train, y_train)))\n", 654 | "print(\"Accuracy on test set: {:.3f}\".format(Logit.score(X_test, y_test)))\n", 655 | "Score = accuracy_score(y_test, Predict)\n", 656 | "Score" 657 | ] 658 | }, 659 | { 660 | "cell_type": "code", 661 | "execution_count": 21, 662 | "id": "65787397", 663 | "metadata": {}, 664 | "outputs": [], 665 | "source": [ 666 | "#逻辑回归最终结果获取\n", 667 | "Logit_Ans_Predict_proba = Logit.predict_proba(test_data)\n", 668 | "df_test['prob']=Logit_Ans_Predict_proba[:,1]\n", 669 | "#最终答案保存\n", 670 | "df_test.to_csv(\"Logit_Ans.csv\",index=None)" 671 | ] 672 | }, 673 | { 674 | "cell_type": "code", 675 | "execution_count": 22, 676 | "id": "a37fd1e5", 677 | "metadata": {}, 678 | "outputs": [ 679 | { 680 | "name": "stdout", 681 | "output_type": "stream", 682 | "text": [ 683 | "[[0.89765569 0.10234431]\n", 684 | " [0.9609094 0.0390906 ]\n", 685 | " [0.93901148 0.06098852]\n", 686 | " ...\n", 687 | " [0.92812445 0.07187555]\n", 688 | " [0.89765569 0.10234431]\n", 689 | " [0.9609094 0.0390906 ]]\n", 690 | "Accuracy on training set: 0.939\n", 691 | "Accuracy on test set: 0.939\n" 692 | ] 693 | } 694 | ], 695 | "source": [ 696 | "#决策树\n", 697 | "from sklearn.tree import DecisionTreeClassifier\n", 698 | "tree = DecisionTreeClassifier(max_depth=4,random_state=0) \n", 699 | "tree.fit(X_train, y_train)\n", 700 | "Predict_proba = tree.predict_proba(X_test)\n", 701 | "print(Predict_proba[:])\n", 702 | "print(\"Accuracy on training set: {:.3f}\".format(tree.score(X_train, y_train)))\n", 703 | "print(\"Accuracy on test set: {:.3f}\".format(tree.score(X_test, y_test)))" 704 | ] 705 | }, 706 | { 707 | "cell_type": "code", 708 | "execution_count": 23, 709 | "id": "5ed0c662", 710 | "metadata": {}, 711 | "outputs": [], 712 | "source": [ 713 | "#决策树最终结果获取\n", 714 | "Tree_Ans_Predict_proba = tree.predict_proba(test_data)\n", 715 | "df_test['prob']=Tree_Ans_Predict_proba[:,1]\n", 716 | "#最终答案保存\n", 717 | "df_test.to_csv(\"Tree_Ans.csv\",index=None)" 718 | ] 719 | }, 720 | { 721 | "cell_type": "code", 722 | "execution_count": 28, 723 | "id": "9c002987", 724 | "metadata": {}, 725 | "outputs": [ 726 | { 727 | "name": "stdout", 728 | "output_type": "stream", 729 | "text": [ 730 | "[[0.90345203 0.09654797]\n", 731 | " [0.96242055 0.03757945]\n", 732 | " [0.92398178 0.07601822]\n", 733 | " ...\n", 734 | " [0.91943483 0.08056517]\n", 735 | " [0.86844252 0.13155748]\n", 736 | " [0.9607207 0.0392793 ]]\n", 737 | "Accuracy on training set: 0.939\n", 738 | "Accuracy on test set: 0.939\n" 739 | ] 740 | } 741 | ], 742 | "source": [ 743 | "#随机森林\n", 744 | "from sklearn.ensemble import RandomForestClassifier\n", 745 | "rfc = RandomForestClassifier(n_estimators=100,random_state=90,max_depth=8)\n", 746 | "rfc = rfc.fit(X_train, y_train)\n", 747 | "Predict_proba = rfc.predict_proba(X_test)\n", 748 | "print(Predict_proba[:])\n", 749 | "print(\"Accuracy on training set: {:.3f}\".format(rfc.score(X_train, y_train))) \n", 750 | "print(\"Accuracy on test set: {:.3f}\".format(rfc.score(X_test, y_test)))" 751 | ] 752 | }, 753 | { 754 | "cell_type": "code", 755 | "execution_count": 29, 756 | "id": "55703385", 757 | "metadata": {}, 758 | "outputs": [], 759 | "source": [ 760 | "#随机森林最终结果获取\n", 761 | "RFC_Ans_Predict_proba = rfc.predict_proba(test_data)\n", 762 | "df_test['prob']=RFC_Ans_Predict_proba[:,1]\n", 763 | "#最终答案保存\n", 764 | "df_test.to_csv(\"RFC_Ans.csv\",index=None)" 765 | ] 766 | }, 767 | { 768 | "cell_type": "code", 769 | "execution_count": 27, 770 | "id": "54978d26", 771 | "metadata": {}, 772 | "outputs": [ 773 | { 774 | "name": "stdout", 775 | "output_type": "stream", 776 | "text": [ 777 | "进度: 0\n", 778 | "进度: 10\n", 779 | "进度: 20\n", 780 | "进度: 30\n", 781 | "进度: 40\n", 782 | "进度: 50\n", 783 | "进度: 60\n", 784 | "进度: 70\n", 785 | "进度: 80\n", 786 | "进度: 90\n", 787 | "进度: 100\n", 788 | "进度: 110\n", 789 | "进度: 120\n", 790 | "进度: 130\n", 791 | "进度: 140\n", 792 | "进度: 150\n", 793 | "进度: 160\n", 794 | "进度: 170\n", 795 | "进度: 180\n", 796 | "进度: 190\n", 797 | "最大得分:0.9394897744043854 子树数量为:101\n" 798 | ] 799 | }, 800 | { 801 | "data": { 802 | "image/png": "\n", 803 | "text/plain": [ 804 | "
" 805 | ] 806 | }, 807 | "metadata": { 808 | "needs_background": "light" 809 | }, 810 | "output_type": "display_data" 811 | } 812 | ], 813 | "source": [ 814 | "# 调参,绘制学习曲线来调参n_estimators(对随机森林影响最大)\n", 815 | "score_lt = []\n", 816 | "\n", 817 | "# 每隔10步建立一个随机森林,获得不同n_estimators的得分\n", 818 | "for i in range(0,200,10):\n", 819 | " print(\"进度:\",i)\n", 820 | " rfc = RandomForestClassifier(n_estimators=i+1,random_state=90,max_depth=8)\n", 821 | " rfc = rfc.fit(X_train, y_train)\n", 822 | " score = rfc.score(X_test, y_test)\n", 823 | " score_lt.append(score)\n", 824 | "score_max = max(score_lt)\n", 825 | "print('最大得分:{}'.format(score_max),'子树数量为:{}'.format(score_lt.index(score_max)*10+1))\n", 826 | "\n", 827 | "# 绘制学习曲线\n", 828 | "x = np.arange(1,201,10)\n", 829 | "plt.subplot(111)\n", 830 | "plt.plot(x, score_lt, 'r-')\n", 831 | "plt.show()" 832 | ] 833 | }, 834 | { 835 | "cell_type": "code", 836 | "execution_count": 15, 837 | "id": "b124dbb8", 838 | "metadata": {}, 839 | "outputs": [ 840 | { 841 | "name": "stderr", 842 | "output_type": "stream", 843 | "text": [ 844 | "D:\\anaconda3\\lib\\site-packages\\xgboost\\sklearn.py:1224: UserWarning: The use of label encoder in XGBClassifier is deprecated and will be removed in a future release. To remove this warning, do the following: 1) Pass option use_label_encoder=False when constructing XGBClassifier object; and 2) Encode your labels (y) as integers starting with 0, i.e. 0, 1, 2, ..., [num_class - 1].\n", 845 | " warnings.warn(label_encoder_deprecation_msg, UserWarning)\n" 846 | ] 847 | }, 848 | { 849 | "name": "stdout", 850 | "output_type": "stream", 851 | "text": [ 852 | "[15:05:43] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.5.0/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\n", 853 | "[[0.87493783 0.12506217]\n", 854 | " [0.9712213 0.02877864]\n", 855 | " [0.8449106 0.1550894 ]\n", 856 | " ...\n", 857 | " [0.87651736 0.12348264]\n", 858 | " [0.916159 0.08384103]\n", 859 | " [0.9761114 0.02388861]]\n", 860 | "Accuracy on training set: 0.939\n", 861 | "Accuracy on test set: 0.939\n" 862 | ] 863 | } 864 | ], 865 | "source": [ 866 | "#使用XGboost\n", 867 | "from sklearn.model_selection import train_test_split\n", 868 | "from sklearn.ensemble import RandomForestClassifier\n", 869 | "from sklearn.linear_model import LinearRegression\n", 870 | "from sklearn.metrics import classification_report\n", 871 | "import xgboost as xgb\n", 872 | "\n", 873 | "model = xgb.XGBClassifier(\n", 874 | " max_depth=8,\n", 875 | " n_estimators=2000,\n", 876 | " min_child_weight=300, \n", 877 | " colsample_bytree=0.8, \n", 878 | " subsample=0.8, \n", 879 | " eta=0.3, \n", 880 | " seed=42 \n", 881 | ")\n", 882 | "# model.fit(\n", 883 | "# X_train, y_train,\n", 884 | "# eval_metric='auc', eval_set=[(X_train, y_train), (X_test, y_test)],\n", 885 | "# verbose=True,\n", 886 | "# #早停法,如果auc在10epoch没有进步就stop\n", 887 | "# early_stopping_rounds=30 \n", 888 | "# )\n", 889 | "model.fit(X_train, y_train)\n", 890 | "\n", 891 | "\n", 892 | "Predict_proba = model.predict_proba(X_test)\n", 893 | "print(Predict_proba[:])\n", 894 | "print(\"Accuracy on training set: {:.3f}\".format(model.score(X_train, y_train))) \n", 895 | "print(\"Accuracy on test set: {:.3f}\".format(model.score(X_test, y_test)))\n" 896 | ] 897 | }, 898 | { 899 | "cell_type": "code", 900 | "execution_count": 16, 901 | "id": "b942defe", 902 | "metadata": {}, 903 | "outputs": [], 904 | "source": [ 905 | "#XGboost最终结果获取\n", 906 | "xgboost_Ans_Predict_proba = model.predict_proba(test_data)\n", 907 | "df_test['prob']=xgboost_Ans_Predict_proba[:,1]\n", 908 | "#最终答案保存\n", 909 | "df_test.to_csv(\"xgboost_Ans.csv\",index=None)" 910 | ] 911 | }, 912 | { 913 | "cell_type": "code", 914 | "execution_count": null, 915 | "id": "33f75369", 916 | "metadata": {}, 917 | "outputs": [], 918 | "source": [] 919 | } 920 | ], 921 | "metadata": { 922 | "kernelspec": { 923 | "display_name": "Python 3", 924 | "language": "python", 925 | "name": "python3" 926 | }, 927 | "language_info": { 928 | "codemirror_mode": { 929 | "name": "ipython", 930 | "version": 3 931 | }, 932 | "file_extension": ".py", 933 | "mimetype": "text/x-python", 934 | "name": "python", 935 | "nbconvert_exporter": "python", 936 | "pygments_lexer": "ipython3", 937 | "version": "3.8.8" 938 | } 939 | }, 940 | "nbformat": 4, 941 | "nbformat_minor": 5 942 | } 943 | -------------------------------------------------------------------------------- /特征工程.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "id": "24b1b8d2", 7 | "metadata": {}, 8 | "outputs": [], 9 | "source": [ 10 | "#导包\n", 11 | "import numpy as np\n", 12 | "import pandas as pd\n", 13 | "import matplotlib.pyplot as plt\n", 14 | "plt.rcParams[\"font.sans-serif\"] = \"SimHei\" #解决中文乱码问题\n", 15 | "import seaborn as sns\n", 16 | "import random\n", 17 | "from sklearn.model_selection import train_test_split\n", 18 | "from sklearn.linear_model import LogisticRegression\n", 19 | "from sklearn.preprocessing import LabelEncoder\n", 20 | "from sklearn.metrics import accuracy_score\n", 21 | "from sklearn import model_selection\n", 22 | "from sklearn.neighbors import KNeighborsRegressor" 23 | ] 24 | }, 25 | { 26 | "cell_type": "code", 27 | "execution_count": 2, 28 | "id": "51717a7b", 29 | "metadata": {}, 30 | "outputs": [ 31 | { 32 | "name": "stdout", 33 | "output_type": "stream", 34 | "text": [ 35 | "(261477, 3) (260864, 3)\n", 36 | "(424170, 3) (54925330, 7)\n" 37 | ] 38 | } 39 | ], 40 | "source": [ 41 | "#读取数据\n", 42 | "\n", 43 | "df_train = pd.read_csv(r'../DataMining/data_format1\\train_format1.csv')\n", 44 | "df_test = pd.read_csv(r'../DataMining/data_format1\\test_format1.csv')\n", 45 | "user_info = pd.read_csv(r'../DataMining/data_format1\\user_info_format1.csv')\n", 46 | "user_log = pd.read_csv(r'../DataMining/data_format1\\user_log_format1.csv')\n", 47 | "\n", 48 | "print(df_test.shape,df_train.shape)\n", 49 | "print(user_info.shape,user_log.shape)" 50 | ] 51 | }, 52 | { 53 | "cell_type": "code", 54 | "execution_count": 3, 55 | "id": "e833f4c8", 56 | "metadata": {}, 57 | "outputs": [ 58 | { 59 | "name": "stdout", 60 | "output_type": "stream", 61 | "text": [ 62 | "\n", 63 | "RangeIndex: 424170 entries, 0 to 424169\n", 64 | "Data columns (total 3 columns):\n", 65 | " # Column Non-Null Count Dtype \n", 66 | "--- ------ -------------- ----- \n", 67 | " 0 user_id 424170 non-null int64 \n", 68 | " 1 age_range 329039 non-null float64\n", 69 | " 2 gender 407308 non-null float64\n", 70 | "dtypes: float64(2), int64(1)\n", 71 | "memory usage: 9.7 MB\n" 72 | ] 73 | } 74 | ], 75 | "source": [ 76 | "#使用空值去替换\n", 77 | "user_info['age_range'].replace(0.0,np.nan,inplace=True)\n", 78 | "user_info['gender'].replace(2.0,np.nan,inplace=True)\n", 79 | "\n", 80 | "user_info.info()" 81 | ] 82 | }, 83 | { 84 | "cell_type": "code", 85 | "execution_count": 4, 86 | "id": "81f6042b", 87 | "metadata": {}, 88 | "outputs": [], 89 | "source": [ 90 | "user_info['age_range'].replace(np.nan,-1,inplace=True)\n", 91 | "user_info['gender'].replace(np.nan,-1,inplace=True)\n", 92 | "# user_info['age_range'].replace(np.nan,1,inplace=True)\n", 93 | "# user_info['gender'].replace(np.nan,0,inplace=True)" 94 | ] 95 | }, 96 | { 97 | "cell_type": "code", 98 | "execution_count": 5, 99 | "id": "cfe56eb8", 100 | "metadata": {}, 101 | "outputs": [ 102 | { 103 | "data": { 104 | "text/plain": [ 105 | "Text(0.5, 1.0, '用户年龄分布')" 106 | ] 107 | }, 108 | "execution_count": 5, 109 | "metadata": {}, 110 | "output_type": "execute_result" 111 | }, 112 | { 113 | "data": { 114 | "image/png": "\n", 115 | "text/plain": [ 116 | "
" 117 | ] 118 | }, 119 | "metadata": { 120 | "needs_background": "light" 121 | }, 122 | "output_type": "display_data" 123 | } 124 | ], 125 | "source": [ 126 | "#年龄分布可视化\n", 127 | "fig = plt.figure(figsize = (10, 6))\n", 128 | "x = np.array([\"NULL\",\"<18\",\"18-24\",\"25-29\",\"30-34\",\"35-39\",\"40-49\",\">=50\"])\n", 129 | "#<18岁为1;[18,24]为2; [25,29]为3; [30,34]为4;[35,39]为5;[40,49]为6; > = 50时为7和8\n", 130 | "y = np.array([user_info[user_info['age_range'] == -1]['age_range'].count(),\n", 131 | " user_info[user_info['age_range'] == 1]['age_range'].count(),\n", 132 | " user_info[user_info['age_range'] == 2]['age_range'].count(),\n", 133 | " user_info[user_info['age_range'] == 3]['age_range'].count(),\n", 134 | " user_info[user_info['age_range'] == 4]['age_range'].count(),\n", 135 | " user_info[user_info['age_range'] == 5]['age_range'].count(),\n", 136 | " user_info[user_info['age_range'] == 6]['age_range'].count(),\n", 137 | " user_info[user_info['age_range'] == 7]['age_range'].count() + \n", 138 | " user_info[user_info['age_range'] == 8]['age_range'].count()])\n", 139 | "plt.bar(x,y,label='人数')\n", 140 | "plt.legend()\n", 141 | "plt.title('用户年龄分布')" 142 | ] 143 | }, 144 | { 145 | "cell_type": "code", 146 | "execution_count": 6, 147 | "id": "21ae9565", 148 | "metadata": {}, 149 | "outputs": [ 150 | { 151 | "data": { 152 | "text/plain": [ 153 | "Text(0.5, 1.0, '用户性别分布')" 154 | ] 155 | }, 156 | "execution_count": 6, 157 | "metadata": {}, 158 | "output_type": "execute_result" 159 | }, 160 | { 161 | "data": { 162 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAY4AAAESCAYAAADqoDJEAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuNCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8QVMy6AAAACXBIWXMAAAsTAAALEwEAmpwYAAASfklEQVR4nO3de7BdZ13G8e9DkmppqdPSQ6RViYFwUdtYCNBIrScISKSjpTBWBgEtGjoyon8gWlsUEXFEZYqdAY1ULAWrxWmRUZHgpTZgKpyovah06iVtia0G20mMtRfozz/2itk9yUn2m7Mv53R/PzN7zl6/vS7v4vTsh/d911pJVSFJ0qCeMOkGSJKWF4NDktTE4JAkNTE4JElNDA5JUhODQ2qU5KuO8vmTk6wcV3ukcTM4pAZJvhf4vaOstg14+rzt3p7kp4/heM9M8jcLfPbs7ufaJC8eYF9PSPJXSZ7W2g6pn8GhqZPkSUkeSjLXve5Osrtv+UtJLlxg808BT0nyxCMc4iHg4e5Yv5jkvP7avLa8O8k9Sb7QvR6Zt++HgUcOs90rgD9IEqCArUdpE8Bm4IlVdedR1pOOyO60ptHDwD1VtQEgyVuBr66qd3XLv0Pfl3WSXwdeNm8ff9v7zgbgn6rqld2X+HHAo8ArknwEeB7wceDZwKMH1qmqh7ptHwHeXlUf7I61C3g4yQ8A3wa8Z37jkzwB+Dng0urdwftvSf4EuBzY0rfeRuAaYD/wZeCZwD1J/r5/d/S+B95cVTcc5X83CTA4NJ0K+Nokn+mWTweekOTl3fI6el/2B6wGLj7cF2uSWQ5+uX8D8MHu/bcAPwicBXyg20cBrwPuBs4/sAvgtCTf0i2v6moPcpgeSuetwN6q+sO+2s8ANyX5NeCt1bMDWNO18zXARVX10iQfBN5SVQ8ssH/piAwOTaNHgXur6hxYsMfR78sD7A96vYp/Ap4LXA38MDBXVa/ojvHlqrp83ra3A68EXtNtu4MjDCEn+U7gJ4AX9deran83z/Fp4MYkP1xVt3fbrAF+AXhpt/omeiEmHRODQ9NoReP6q47y+YExq6cAn6EXHLuBTwD7jrRhVX0U+GiSB4D1VXVgbuTQgyRPArYCbwd2JDm+O/aBY5xCL1SeD3yl2+apwB8DTwOu7/b7dfR6J0Wvl/S6qvrjo5yj9P8MDk2jrwKemmSuW15Nb6jq/G55DY8dqnoycHWS/znMvlYC/w5QVVcDJHlLt3x9kl/qho+eDFSSNwC3VtXru6uifpPePMcK4E+6L/YbgX+cf6Cq+u8kz+nC5cok7wF2V9X7uuP+KfCFqtraLa8HrgPeB7ytqr61q/8z8MKqerDrXS00JCYdlsGhaXQ6cFNVfQcceagqyQp68xTnVtVtrQeqqkuASxYYqtoF/BC9nsoH6c2JPB94FfAFDvZk+vfX/yX/HcCPzzuvL/Yt7wV+pqp+P8nbjtTMAU9HAgwOTacXAjsHXHeWXo/gkB7AYlXVg8C/dnMTf11VX0zyHOB/6c2rLDh53d1PsqKqbuorP5XeENmB/e+iF05wmHmTJMfR6wkdbQ5HegyDQ9PoQuBdfcur6P4WuuGdpwMPdHeI/zLwa1X16CF7WdhKYEWSVUBV1WO+mLsv7C9X1aNJngn8LHBe9/GJwANVdR1wXZKnM69H0IXG++muzEpyEr2ruB7su8x3vv673Vd253wLvfmRWxrOTTI4NF2SfB2wsqo+21f+Bw5OgL+W3hfpZ4Az6P0/+MsbD7OK3v0cFwEXJem/J+TV9L7EL+4mu68FfqKqbknyDuCNwE91654PvA24o2/79wEvAb6nqg70mi4GXk/vktyFnDavfSuBZ1bVITcXSkcT/wVAaTKSnEDvSqq/7pafQ6+3cWe3vJZeeH36wD0XXdg86Be+JsngkCQ18VlVkqQmBockqcnjfnL81FNPrTVr1ky6GZK0rOzcufNLVTVzuM8e98GxZs0a5ubmjr6iJOn/JVnw8fsOVUmSmhgckqQmBockqYnBIUlqYnBIkpoYHJKkJgaHJKmJwSFJamJwSJKaPO7vHNf0uOudZ0y6CY973/Czt066CVoC7HFIkpoYHJKkJgaHJKmJwSFJamJwSJKaGBySpCYGhySpicEhSWpicEiSmhgckqQmBockqYnBIUlqYnBIkpoYHJKkJgaHJKmJwSFJamJwSJKaGBySpCYGhySpicEhSWpicEiSmhgckqQmQw+OJF+T5JNJtiW5PslxSe5KckP3OqNb78okO5Jc1rftMdckSeMxih7Ha4H3VtXLgHuBnwauqarZ7nVrkguAFVW1EVibZN1iaiM4B0nSAlYOe4dV9f6+xRngbuC8JJuAW4E3AbPAtd0624BzgLMWUbtj2OchSTq8kc1xJNkInAx8GnhJVb0AWAV8N3ACsLtb9T5g9SJr84+9Jclckrk9e/YM+cwkabqNJDiSnAJcAVwE3FJV93QfzQHrgP3A8V3txK4di6k9RlVtraoNVbVhZmZmiGcmSRrF5PhxwMeAS6rqTuDqJOuTrADOB24GdtIbYgJYD+xaZE2SNCZDn+MA3gg8F7g0yaXAXwJXAwE+UVV/luQkYHuS04DNwNlALaImSRqTUUyOfwD4wLzyz89bZ1+SWeClwHuqai/AYmqSpPEYRY9jIFV1Pwevjlp0TZI0Ht45LklqYnBIkpoYHJKkJgaHJKmJwSFJamJwSJKaGBySpCYGhySpicEhSWpicEiSmhgckqQmBockqYnBIUlqYnBIkpoYHJKkJgaHJKmJwSFJamJwSJKaGBySpCYGhySpicEhSWpicEiSmhgckqQmBockqYnBIUlqYnBIkpoYHJKkJkMPjiRfk+STSbYluT7JcUmuTLIjyWV96w21Jkkaj1H0OF4LvLeqXgbcC3w/sKKqNgJrk6xLcsEwayM4B0nSAlYOe4dV9f6+xRngB4DLu+VtwDnAWcC1Q6zdMeTTkCQtYGRzHEk2AicDdwO7u/J9wGrghCHX5h97S5K5JHN79uwZ4llJkkYSHElOAa4ALgL2A8d3H53YHXPYtceoqq1VtaGqNszMzAzvxCRJI5kcPw74GHBJVd0J7KQ3nASwHtg1gpokaUyGPscBvBF4LnBpkkuBDwGvS3IasBk4Gyhg+xBrkqQxGXqPo6o+UFUnV9Vs97oKmAVuAjZV1d6q2jfM2rDPQZK0sFH0OA5RVfdz8EqokdQkSePhneOSpCYGhySpicEhSWpicEiSmhgckqQmBockqYnBIUlqYnBIkpoYHJKkJgaHJKmJwSFJamJwSJKaGBySpCYGhySpicEhSWpicEiSmhgckqQmBockqYnBIUlqYnBIkpocU3AkOWfYDZEkLQ8DBUeST88r/dII2iJJWgZWHunDJGcCZwGnJ3l9Vz4BeHDUDZMkLU1H63HkMD//C/i+kbVIkrSkHbHHUVU3AzcneVZVfXhMbZIkLWFHDI4+lyf5fuC4AwWDRJKm06BXVf0p8Ax6Q1UHXpKkKTRoj+O/q+pdI22JJGlZGLTHsT3JNUk2Jzk3yblHWjnJ6iTbu/crk9yV5IbudUZXvzLJjiSX9W13zDVJ0ngMGhyPAF8Ang9sAmYXWjHJycBV9C7bBTgTuKaqZrvXrUkuAFZU1UZgbZJ1i6m1n7Yk6VgNGhy7gH8D7ux+7jrCul8BLgT2dctnA+cl+VzXU1hJL3iu7T7fBpyzyNpjJNmSZC7J3J49ewY8RUnSIFoeORLgeOACYMGhqqraV1V7+0qfB15SVS8AVgHfTa83srv7/D5g9SJr89uwtao2VNWGmZmZhlOUJB3NQJPjVXVV3+JvJHl/wzFuqaqHuvdzwDpgP70QAjiRXoAtpiZJGpNBn1V1bt/r1cA3NRzj6iTrk6wAzgduBnZycIhpPb2hr8XUJEljMujluJuA6t4/DLy54RjvBH6X3lDXJ6rqz5KcRO9KrdOAzfTmQWoRNUnSmAw6zPNu4D+AU4AvAbcfbYOqmu1+3lZVZ1bVGVV1aVfbR2+S+yZgU1XtXUxtwHOQJA3BoMHx2/QmoT8JnA58aLEHrqr7q+raqrp3GDVJ0ngMOlT19VX1uu79p5L81agaJEla2gYNjn9PcgnwN8BGDl4OK0maMoMOVV1ML2ReTe/GvjeNrEWSpCVt0OD4CHBXVf0o8CR6cx6SpCk0aHCcfOAmwKp6N3Dq6JokSVrKBp3j+GKSnwI+R+9Bh/85uiZJkpayQXscPwg8QG+O43+BN4yqQZKkpW3QZ1U9BFwx4rZIkpYBHxAoSWpicEiSmhgckqQmBockqYnBIUlqYnBIkpoMegOgJI3Ui6540aSb8Lj32R/77FD2Y49DktTE4JAkNTE4JElNDA5JUhODQ5LUxOCQJDUxOCRJTQwOSVITg0OS1MTgkCQ1MTgkSU0MDklSk5EER5LVSbb3LV+ZZEeSy0ZVkySNx9CDI8nJwFXACd3yBcCKqtoIrE2ybti1YZ+DJGlho+hxfAW4ENjXLc8C13bvtwHnjKD2GEm2JJlLMrdnz55Fn5Ak6aChB0dV7auqvX2lE4Dd3fv7gNUjqM1vw9aq2lBVG2ZmZoZxWpKkzjgmx/cDx3fvT+yOOeyaJGlMxvGlu5ODw0nrgV0jqEmSxmQc/3Tsx4HtSU4DNgNnAzXkmiRpTEbW46iq2e7nPnoT2jcBm6pq77BrozoHSdKhxtHjoKru5+CVUCOpSZLGw4llSVITg0OS1MTgkCQ1MTgkSU0MDklSE4NDktTE4JAkNTE4JElNDA5JUhODQ5LUxOCQJDUxOCRJTQwOSVITg0OS1MTgkCQ1MTgkSU0MDklSE4NDktTE4JAkNTE4JElNDA5JUhODQ5LUxOCQJDUxOCRJTQwOSVITg0OS1MTgkCQ1GXlwJFmZ5K4kN3SvM5JcmWRHksv61jvmmiRpfMbR4zgTuKaqZqtqFlgHrKiqjcDaJOuSXHCstTG0X5LUZ+UYjnE2cF6STcCtwEPAtd1n24BzgLMWUbtjxO2XJPUZR4/j88BLquoFwCpgM7C7++w+YDVwwiJqh0iyJclckrk9e/YM92wkacqNIzhuqap7uvdzwKnA8d3yiV0b9i+idoiq2lpVG6pqw8zMzBBPRZI0juC4Osn6JCuA84E30xtiAlgP7AJ2LqImSRqjccxxvBP4XSDAJ4CPA9uTnEZv2OpsoBZRkySN0ch7HFV1W1WdWVVnVNWlVbUPmAVuAjZV1d7F1EbdfknSY42jx3GIqrqfg1dHLbomSRof7xyXJDUxOCRJTQwOSVITg0OS1MTgkCQ1MTgkSU0MDklSE4NDktTE4JAkNTE4JElNDA5JUhODQ5LUxOCQJDUxOCRJTQwOSVITg0OS1MTgkCQ1MTgkSU0m8k/HLlXP+8kPT7oJU2Hnr7x+0k2QtAj2OCRJTQwOSVITg0OS1MTgkCQ1MTgkSU0MDklSE4NDktTE4JAkNVm2wZHkyiQ7klw26bZI0jRZlsGR5AJgRVVtBNYmWTfpNknStFiWwQHMAtd277cB50yuKZI0XZbrs6pOAHZ37+8Dntv/YZItwJZucX+S28fYtnE7FfjSpBvRIr/6hkk3YSlZXr+/n8ukW7CULK/fHZC3NP3+nrbQB8s1OPYDx3fvT2Rez6mqtgJbx92oSUgyV1UbJt0OHRt/f8vXNP/ulutQ1U4ODk+tB3ZNrimSNF2Wa4/j48D2JKcBm4GzJ9scSZoey7LHUVX76E2Q3wRsqqq9k23RRE3FkNzjmL+/5Wtqf3epqkm3QZK0jCzLHockaXIMjmUuyeok2yfdDmnaTPPfnsGxjCU5GbiK3n0tWkZ8ZM7yNu1/ewbH8vYV4EJg36QbosH5yJzHhan+21uul+NOpSS/CTyrr/QXVfXOxLt5l5lZDn1kzh0Ta42adVd2Mq1/ewbHMlJVb5p0GzQUR3xkjrTUOVQljd8RH5kjLXX+ByuNn4/M0bLmDYDSmCU5CdgO/DndI3Om/OkHWmYMDmkCuss5XwrcWFX3Tro9UguDQ5LUxDkOSVITg0OS1MTgkJaAJO9IMjvpdkiDMDgkSU28c1xapCTHA9cBpwD/AtxG727wpwC3VtWbk7wDWAV8O3AS8HLgIeBjwAogwA1Jngh8uH/b7hg3AJ8Hzqyq7xrbyUmHYY9DWrxnA1+kd1PfM4AHgNuq6lzgqUnO7NZ7Rle7DngxsAX4o6raBDzSrbNlgW3PBnYYGloKDA5p8XYDzwNuBN5H70GUr+x6CWuB07v1Ptz9vAs4DvhG4OauNtf9XGjb26rqutGdgjQ4h6qkxXs58AtVdT1Akhngc1X1oSTn0QuKFwL/M2+7u4BvBv4S+FbgU8Dth9kWes+3kpYEexzS4v0dcEWSv0jye/SCYHOSG4GLgbsX2G4r8Kqud3FSV/utAbeVJsY7x6VFSvIjwGvozVM8AvxqVd0w0UZJI2RwSJKaOFQlSWpicEiSmhgckqQmBockqYnBIUlqYnBIkpr8H3c5ZeKklhTFAAAAAElFTkSuQmCC\n", 163 | "text/plain": [ 164 | "
" 165 | ] 166 | }, 167 | "metadata": { 168 | "needs_background": "light" 169 | }, 170 | "output_type": "display_data" 171 | } 172 | ], 173 | "source": [ 174 | "sns.countplot(x='gender',order=[-1,0,1],data=user_info)\n", 175 | "plt.title('用户性别分布')" 176 | ] 177 | }, 178 | { 179 | "cell_type": "code", 180 | "execution_count": 7, 181 | "id": "a9a9558f", 182 | "metadata": {}, 183 | "outputs": [ 184 | { 185 | "data": { 186 | "text/plain": [ 187 | "'\\n1.年龄空值的比较多,性别空值的少\\n2.年龄主要在18-39之间\\n3.大多数是女性\\n'" 188 | ] 189 | }, 190 | "execution_count": 7, 191 | "metadata": {}, 192 | "output_type": "execute_result" 193 | }, 194 | { 195 | "data": { 196 | "image/png": "\n", 197 | "text/plain": [ 198 | "
" 199 | ] 200 | }, 201 | "metadata": { 202 | "needs_background": "light" 203 | }, 204 | "output_type": "display_data" 205 | } 206 | ], 207 | "source": [ 208 | "sns.countplot(x='age_range',order=[-1,1,2,3,4,5,6,7,8],hue='gender',data=user_info)\n", 209 | "plt.title('用户年龄-性别分布')\n", 210 | "'''\n", 211 | "1.年龄空值的比较多,性别空值的少\n", 212 | "2.年龄主要在18-39之间\n", 213 | "3.大多数是女性\n", 214 | "'''" 215 | ] 216 | }, 217 | { 218 | "cell_type": "code", 219 | "execution_count": 8, 220 | "id": "dc6ddb76", 221 | "metadata": {}, 222 | "outputs": [], 223 | "source": [ 224 | "#特征值合并\n", 225 | "\n", 226 | "df_train = pd.merge(df_train,user_info,on=\"user_id\",how=\"left\")\n", 227 | " \n", 228 | "total_logs_temp = user_log.groupby([user_log[\"user_id\"],user_log[\"seller_id\"]])[\"item_id\"].count().reset_index()\n", 229 | " \n", 230 | "total_logs_temp.rename(columns={\"seller_id\":\"merchant_id\",\"item_id\":\"total_item_id\"},inplace=True)\n", 231 | " \n", 232 | "df_train = pd.merge(df_train,total_logs_temp,on=[\"user_id\",\"merchant_id\"],how=\"left\")\n", 233 | " \n", 234 | "unique_item_id = user_log.groupby([\"user_id\",\"seller_id\",\"item_id\"]).count().reset_index()[[\"user_id\",\"seller_id\",\"item_id\"]]\n", 235 | " \n", 236 | "unique_item_id_cnt = unique_item_id.groupby([\"user_id\",\"seller_id\"]).count().reset_index()\n", 237 | " \n", 238 | "unique_item_id_cnt.rename(columns={\"seller_id\":\"merchant_id\",\"item_id\":\"unique_item_id\"},inplace=True)\n", 239 | " \n", 240 | "df_train = pd.merge(df_train, unique_item_id_cnt, on=[\"user_id\", \"merchant_id\"], how=\"left\")\n", 241 | " \n", 242 | "cat_id_temp = user_log.groupby([\"user_id\", \"seller_id\", \"cat_id\"]).count().reset_index()[[\"user_id\", \"seller_id\", \"cat_id\"]]\n", 243 | " \n", 244 | "cat_id_temp_cnt = cat_id_temp.groupby([\"user_id\", \"seller_id\"]).count().reset_index()\n", 245 | " \n", 246 | "cat_id_temp_cnt.rename(columns={\"seller_id\":\"merchant_id\",\"cat_id\":\"total_cat_id\"},inplace=True)\n", 247 | " \n", 248 | "df_train = pd.merge(df_train, cat_id_temp_cnt, on=[\"user_id\", \"merchant_id\"], how=\"left\")\n", 249 | " \n", 250 | "time_temp = user_log.groupby([\"user_id\", \"seller_id\", \"time_stamp\"]).count().reset_index()[[\"user_id\", \"seller_id\", \"time_stamp\"]]\n", 251 | " \n", 252 | "time_temp_cnt = time_temp.groupby([\"user_id\", \"seller_id\"]).count().reset_index()\n", 253 | " \n", 254 | "time_temp_cnt.rename(columns={\"seller_id\":\"merchant_id\",\"time_stamp\":\"total_time_temp\"},inplace=True)\n", 255 | " \n", 256 | "df_train = pd.merge(df_train, time_temp_cnt, on=[\"user_id\", \"merchant_id\"], how=\"left\")\n", 257 | " \n", 258 | "click_temp = user_log.groupby([\"user_id\", \"seller_id\", \"action_type\"])[\"item_id\"].count().reset_index()\n", 259 | " \n", 260 | "click_temp.rename(columns={\"seller_id\":\"merchant_id\",\"item_id\":\"times\"},inplace=True)\n", 261 | " \n", 262 | "click_temp[\"clicks\"] = click_temp[\"action_type\"] == 0\n", 263 | " \n", 264 | "click_temp[\"clicks\"] = click_temp[\"clicks\"] * click_temp[\"times\"]\n", 265 | " \n", 266 | "click_temp[\"shopping_cart\"] = click_temp[\"action_type\"] == 1\n", 267 | "click_temp[\"shopping_cart\"] = click_temp[\"shopping_cart\"] * click_temp[\"times\"]\n", 268 | " \n", 269 | "click_temp[\"purchases\"] = click_temp[\"action_type\"] == 2\n", 270 | "click_temp[\"purchases\"] = click_temp[\"purchases\"] * click_temp[\"times\"]\n", 271 | " \n", 272 | "click_temp[\"favourites\"] = click_temp[\"action_type\"] == 3\n", 273 | "click_temp[\"favourites\"] = click_temp[\"favourites\"] * click_temp[\"times\"]\n", 274 | " \n", 275 | "four_features = click_temp.groupby([\"user_id\", \"merchant_id\"]).sum().reset_index()\n", 276 | " \n", 277 | "#删除相关列\n", 278 | "four_features = four_features.drop([\"action_type\", \"times\"], axis=1)\n", 279 | " \n", 280 | "#合并\n", 281 | "df_train = pd.merge(df_train, four_features, on=[\"user_id\", \"merchant_id\"], how=\"left\")\n", 282 | " \n", 283 | "#缺失值向前填充\n", 284 | "df_train = df_train.fillna(method=\"ffill\")\n", 285 | " " 286 | ] 287 | }, 288 | { 289 | "cell_type": "code", 290 | "execution_count": 9, 291 | "id": "c0467512", 292 | "metadata": {}, 293 | "outputs": [], 294 | "source": [ 295 | "# user_info['age_range'].replace(np.nan,1,inplace=True)\n", 296 | "# user_info['gender'].replace(np.nan,0,inplace=True)\n", 297 | "# df_train['age_range'].replace(-1,np.nan,inplace=True)\n", 298 | "# df_train['gender'].replace(-1,np.nan,inplace=True)" 299 | ] 300 | }, 301 | { 302 | "cell_type": "code", 303 | "execution_count": 10, 304 | "id": "91ed1463", 305 | "metadata": {}, 306 | "outputs": [ 307 | { 308 | "data": { 309 | "text/html": [ 310 | "
\n", 311 | "\n", 324 | "\n", 325 | " \n", 326 | " \n", 327 | " \n", 328 | " \n", 329 | " \n", 330 | " \n", 331 | " \n", 332 | " \n", 333 | " \n", 334 | " \n", 335 | " \n", 336 | " \n", 337 | " \n", 338 | " \n", 339 | " \n", 340 | " \n", 341 | " \n", 342 | " \n", 343 | " \n", 344 | " \n", 345 | " \n", 346 | " \n", 347 | " \n", 348 | " \n", 349 | " \n", 350 | " \n", 351 | " \n", 352 | " \n", 353 | " \n", 354 | " \n", 355 | " \n", 356 | " \n", 357 | " \n", 358 | " \n", 359 | " \n", 360 | " \n", 361 | " \n", 362 | " \n", 363 | " \n", 364 | " \n", 365 | " \n", 366 | " \n", 367 | " \n", 368 | " \n", 369 | " \n", 370 | " \n", 371 | " \n", 372 | " \n", 373 | " \n", 374 | " \n", 375 | " \n", 376 | " \n", 377 | " \n", 378 | " \n", 379 | " \n", 380 | " \n", 381 | " \n", 382 | " \n", 383 | " \n", 384 | " \n", 385 | " \n", 386 | " \n", 387 | " \n", 388 | " \n", 389 | " \n", 390 | " \n", 391 | " \n", 392 | " \n", 393 | " \n", 394 | " \n", 395 | " \n", 396 | " \n", 397 | " \n", 398 | " \n", 399 | " \n", 400 | " \n", 401 | " \n", 402 | " \n", 403 | " \n", 404 | " \n", 405 | " \n", 406 | " \n", 407 | " \n", 408 | " \n", 409 | " \n", 410 | " \n", 411 | " \n", 412 | " \n", 413 | " \n", 414 | " \n", 415 | " \n", 416 | " \n", 417 | " \n", 418 | " \n", 419 | " \n", 420 | " \n", 421 | " \n", 422 | " \n", 423 | " \n", 424 | " \n", 425 | " \n", 426 | " \n", 427 | " \n", 428 | " \n", 429 | " \n", 430 | " \n", 431 | " \n", 432 | " \n", 433 | " \n", 434 | " \n", 435 | " \n", 436 | " \n", 437 | " \n", 438 | " \n", 439 | " \n", 440 | " \n", 441 | " \n", 442 | " \n", 443 | " \n", 444 | " \n", 445 | " \n", 446 | " \n", 447 | " \n", 448 | " \n", 449 | " \n", 450 | " \n", 451 | " \n", 452 | " \n", 453 | " \n", 454 | " \n", 455 | " \n", 456 | " \n", 457 | " \n", 458 | " \n", 459 | " \n", 460 | " \n", 461 | " \n", 462 | " \n", 463 | " \n", 464 | " \n", 465 | " \n", 466 | " \n", 467 | " \n", 468 | " \n", 469 | " \n", 470 | " \n", 471 | " \n", 472 | " \n", 473 | " \n", 474 | " \n", 475 | " \n", 476 | " \n", 477 | " \n", 478 | " \n", 479 | " \n", 480 | " \n", 481 | " \n", 482 | " \n", 483 | " \n", 484 | " \n", 485 | " \n", 486 | " \n", 487 | " \n", 488 | " \n", 489 | " \n", 490 | " \n", 491 | " \n", 492 | " \n", 493 | " \n", 494 | " \n", 495 | " \n", 496 | " \n", 497 | " \n", 498 | " \n", 499 | " \n", 500 | " \n", 501 | " \n", 502 | " \n", 503 | " \n", 504 | " \n", 505 | " \n", 506 | " \n", 507 | " \n", 508 | " \n", 509 | " \n", 510 | " \n", 511 | " \n", 512 | " \n", 513 | " \n", 514 | " \n", 515 | " \n", 516 | " \n", 517 | " \n", 518 | " \n", 519 | " \n", 520 | " \n", 521 | "
user_idmerchant_idlabelage_rangegendertotal_item_idunique_item_idtotal_cat_idtotal_time_tempclicksshopping_cartpurchasesfavourites
034176390606.00.039206936012
13417612106.00.01411313010
234176435616.00.01821212060
334176221706.00.021111010
423078448180-1.00.081137010
..........................................
260859359807432504.01.02062118020
26086029452739710-1.01.01731213013
2608612945271520-1.01.091117011
26086229452725370-1.01.011110010
260863229247414004.0-1.024151223010
\n", 522 | "

260864 rows × 13 columns

\n", 523 | "
" 524 | ], 525 | "text/plain": [ 526 | " user_id merchant_id label age_range gender total_item_id \\\n", 527 | "0 34176 3906 0 6.0 0.0 39 \n", 528 | "1 34176 121 0 6.0 0.0 14 \n", 529 | "2 34176 4356 1 6.0 0.0 18 \n", 530 | "3 34176 2217 0 6.0 0.0 2 \n", 531 | "4 230784 4818 0 -1.0 0.0 8 \n", 532 | "... ... ... ... ... ... ... \n", 533 | "260859 359807 4325 0 4.0 1.0 20 \n", 534 | "260860 294527 3971 0 -1.0 1.0 17 \n", 535 | "260861 294527 152 0 -1.0 1.0 9 \n", 536 | "260862 294527 2537 0 -1.0 1.0 1 \n", 537 | "260863 229247 4140 0 4.0 -1.0 24 \n", 538 | "\n", 539 | " unique_item_id total_cat_id total_time_temp clicks shopping_cart \\\n", 540 | "0 20 6 9 36 0 \n", 541 | "1 1 1 3 13 0 \n", 542 | "2 2 1 2 12 0 \n", 543 | "3 1 1 1 1 0 \n", 544 | "4 1 1 3 7 0 \n", 545 | "... ... ... ... ... ... \n", 546 | "260859 6 2 1 18 0 \n", 547 | "260860 3 1 2 13 0 \n", 548 | "260861 1 1 1 7 0 \n", 549 | "260862 1 1 1 0 0 \n", 550 | "260863 15 1 2 23 0 \n", 551 | "\n", 552 | " purchases favourites \n", 553 | "0 1 2 \n", 554 | "1 1 0 \n", 555 | "2 6 0 \n", 556 | "3 1 0 \n", 557 | "4 1 0 \n", 558 | "... ... ... \n", 559 | "260859 2 0 \n", 560 | "260860 1 3 \n", 561 | "260861 1 1 \n", 562 | "260862 1 0 \n", 563 | "260863 1 0 \n", 564 | "\n", 565 | "[260864 rows x 13 columns]" 566 | ] 567 | }, 568 | "execution_count": 10, 569 | "metadata": {}, 570 | "output_type": "execute_result" 571 | } 572 | ], 573 | "source": [ 574 | "# print(df_train.shape)\n", 575 | "# df_train_dropnan=df_train.dropna(axis=0,how='any')\n", 576 | "# df_train_dropnan.shape\n", 577 | "df_train" 578 | ] 579 | }, 580 | { 581 | "cell_type": "code", 582 | "execution_count": 11, 583 | "id": "becb8596", 584 | "metadata": {}, 585 | "outputs": [], 586 | "source": [ 587 | "#将构建好的特征保存\n", 588 | "df_train.to_csv(\"df_train.csv\",index=None)" 589 | ] 590 | }, 591 | { 592 | "cell_type": "code", 593 | "execution_count": null, 594 | "id": "9aeeb576", 595 | "metadata": {}, 596 | "outputs": [], 597 | "source": [] 598 | } 599 | ], 600 | "metadata": { 601 | "kernelspec": { 602 | "display_name": "Python 3", 603 | "language": "python", 604 | "name": "python3" 605 | }, 606 | "language_info": { 607 | "codemirror_mode": { 608 | "name": "ipython", 609 | "version": 3 610 | }, 611 | "file_extension": ".py", 612 | "mimetype": "text/x-python", 613 | "name": "python", 614 | "nbconvert_exporter": "python", 615 | "pygments_lexer": "ipython3", 616 | "version": "3.8.8" 617 | } 618 | }, 619 | "nbformat": 4, 620 | "nbformat_minor": 5 621 | } 622 | --------------------------------------------------------------------------------