├── README.md ├── baseline_catboost.py ├── baseline_catboost_2.py ├── baseline_keras.py ├── baseline_lgb.py ├── baseline_xgb.py ├── data_helper.py ├── data_load-3.py ├── feature_enginee.py ├── img ├── s └── stacking.jpg ├── result_merge.py └── search_param.py /README.md: -------------------------------------------------------------------------------- 1 | # ifly-algorithm_challenge 2 | 讯飞移动反欺诈算法竞赛,目前分数只有94.48 3 | 4 | 讯飞移动反欺诈算法数据竞赛网址: http://challenge.xfyun.cn/2019/gamedetail?type=detail/mobileAD 5 | 6 | ### 总体流程 7 | ``` 8 | | EDA 9 | | 数据预处理 10 | | 数据特征构造 11 | | 模型搭建 12 | | 模型参数的调优以及特征筛选 13 | ``` 14 | #### 一、EDA 15 | 16 | 在做数据竞赛的时候,当我们拿到数据集的时候,我们首先要做的,也是最重要的事情那就是进行数据探索分析,这部分做的好,对接下来的每个部分的工作都是有力的参考,而不会盲目的去进行数据预处理,以及特征构造。数据探索分析主要做以下几个部分的工作: 17 | - 查看每个数据特征的缺失情况、特征中是否含有异常点,错误值 18 | - 查看每个特征的分布情况,主要包括连续特征是否存在偏移,也就是不服从正态分布的情况;离散特征的值具体分布如何 19 | - 查看一般特征之间的相关性如何 20 | - 查看一般特征与目标特征之间的相关性 21 | 22 | 23 | #### 二、数据预处理 24 | 25 | 通过EDA,我们对数据进行了初步分析,接下来就针对EDA部分得出的结果,来进行数据的预处理工作,主要做了以下工作: 26 | 27 | - 先是对特征中不符合类型的数据进行类别转换 28 | - 利用众数来对缺失值填充,异常值的剔除,以及错误值进行修正 29 | - 对于连续特征,如果存在偏移现象,使用```对数```或者```Box-Cox```进行转换,使其满足正太分布,降低拖尾带来的影响 30 | - 对于离散值,一般是进行```独热编码```,但由于本身数据集离散特征的取值类别数较多,如果使独热编码(OneHotEncoder),会使得数据得维度变得很高,会降低模型运行的速度;这里我使用的是```类别编码(LabelEncoder)```来对离散特征进行编码 31 | - 对部分连续特征进行分箱操作,将其离散化。 32 | - 对离散特征进行更加细粒度的划分,如make,model,osv这种特征,可以进行更加详细的划分,比如```osv : 10.0.3 --- > 10 0 3```,对于make制造商,为了降低噪声数据带来的影星,将make特征进行了预处理,如```huawei/HUAWEI/honor/...``` 统一用```HUAWEI```表示。 33 | 34 | #### 补充一点(比较重要) 35 | ``` 36 | 面对数据量很大时候,如何解决大规模数据建模问题?一般会有三种基本方法: 37 | 1. 对原始样本进行抽样,不过这样会存在一个问题,那就是会导致正负样本可能失衡,影响模型最终的效果 38 | 2. 对数据结构或者类型进行优化来降低内存的消耗: 39 | (1) 当特征取值没有负数的时候,我们可以将int32类型的数据转换为uint8。 40 | (2) 将float64类型的数据转换为float32 41 | (3) 在不影响模型效果的前提下,可以将object类型的数据转化为category类型的数据,这种适合离散特征取值较少的情况,一般特征的的取值类别数占特征总的数量的比例小于5%。这次采用的是这种方法,具体实现见代码. 42 | 3. 利用online learning等相关方法。 43 | ``` 44 | #### 数据特征构造 45 | 46 | ``` 47 | 特征主要分为: 48 | 1. 原始特征 49 | 2. 统计特征 : count, max, min,std, nunique,mean,... 50 | 3. 组合特征 : 将重要性较高的特征与其他的特征进行组合 51 | 4. 叶子节点特征:利用lgb和xgb模型来生成部分叶子节点特征 52 | ``` 53 | 54 | #### 模型的选择 55 | ``` 56 | 这里主要是选用了四个模型,三个机器学习模型,一个深度学习模型 57 | 1. lightgbm 58 | 2. xgboost 59 | 3. catboost 60 | 4. MLP 61 | 最后发现catboost效果最好,catboost有一个参数cat_features,通过这个参数,我们可以指定离散特征的索引.效果的确很好,但也很吃内存,最后因为没有机器跑模型,到后面基本上很多想法都不能实现。。。。 62 | ``` 63 | #### 模型参数的寻优 64 | 可直接参考: https://www.cnblogs.com/pinard/p/11114748.html 65 | 66 | #### 特征的筛选 67 | ##### 主要是从以下几个方面进行考虑: 68 | - 高百分比的缺失值特征选择法 69 | - 高度相关特征选择方法(如果一般变量之间的相关性很高,则可去掉其中的某些特征) 70 | - 树型结构中的重要性特征选择方法 71 | - 低重要性特征选择方法 72 | - 唯一值特征选择方法 73 | 74 | 在我们的建模过程中,主要是使用了以下方法进行特征选取: 75 | 76 | - [x] 使用了随机森林,lgb来进行特征重要性的输出 77 | - [x] 使用了基于过滤器的卡方,基于包装器的递归特征消除,以及皮尔逊相关系数来进行特征选择 78 | ``` 79 | 参考代码: 80 | 基于卡方进行特征选择的方法: 81 | chi_selector = SelectKBest(chi2, k = self.k) 82 | chi_selector.fit(self.X.values, self.y) 83 | chi_support = chi_selector.get_support(indices=True) #返回被选择的特征所在的列 84 | _ = chi_selector.get_support() 85 | ``` 86 | 87 | #### 模型的融合 88 | 在打数据类竞赛的时候,最后都会进行多模型的融合,来进行提分,模型融合使用的是集成学习的思想,主要分为了四种类型的集成学习,分别是bagging,boosting,stacking,blending.大概说下它们的思想和不同点: 89 | ##### 1. bagging:有放回采样,弱学习器之间没有联系,相互独立,可以进行并行拟合 90 | - [x] ```有放回采样```: 对数据进行随机采样(bootstrap),就是从训练集中采集固定数量的样本,但是每采集一个样本之后,都将样本放回,并且随机采样的而样本数量与训练集数量大小一致。 91 | - [x] ```袋外数据(oob)```: 由于m个样本的训练集,在每次随机采样中,被采集到的概率为1/m,不被采集到的数据概率为(1-1/m),那么经过m次采集之后,没被采集到的概率为```(1-1/m)^m```, 当m -> 无穷大的时候,前式是趋向于1/e, 约等于36.8%,即有36.8%的数据在m次采样后未被采集到,我们称这样的数据为袋外数据,一般用袋外数据来检测模型的泛化性能, 在sklarn库,```randomclassifier()```类中的```oob```参数就表示这个意思,默认是```false```,如若需要可以设置为```true```. 92 | - [x] ```弱学习器之间是相互独立```: tbagging认为弱学习器之间的地位是相同的,是相互独立的,互相并不影响 93 | - [x] ```并行拟合```: 由于各个弱学习器之间是相互独立的,因此不每个弱学习器训练不受其他弱学习器的约束,即没有前后依赖关系,所以可以并行训练模型。 94 | - [x] bagging对弱学习器的选择没有限制,一般常用决策树或者神经网络 95 | - [x] ```集合策略```: 对于分类任务,一般采用的是投票选择;对于回归任务,采用加权平均方法 96 | - [x] bagging学习,由于每次采样不同的数据集来训练不同的弱学习器,因此泛化性能比较好,即方差较小,但是对于训练集的拟合程度会差一点,即偏差较大。 97 | - [x] 代表性算法: ```随机森林(RF)``` --- 在bagging随机采样的基础之上,还进行了```特征属性的随机采样```,大大提高了模型训练的速度,以及使用```CART```作为弱学习器。 98 | 99 | ##### 2. boosting: 无放回采样,弱学习器之间有联系,并不是相互独立,串行拟合 100 | - [x] ```无放回采样``` 101 | - [x] Boosting思想: 采用的是加法模型,前向分步算法 102 | - [x] 损失函数: 对于回归问题,使用平方误差损失函数,对于分类问题,使用指数函数作为损失函数 103 | - [x] 当前强学习器是上一轮强学习器与当前弱学习器的组合,所以各个弱学习器之间并不是相互独立的,是相互影响的,所以只能串行拟合,这也是Boosting方法的缺点,在代表算法中,例如Xgboost算法就利用不同的线程实现了局部并行拟合,来提高模型训练的速度。 104 | - [x] Boosting 可以降低模型对训练集的拟合误差,但是训练方差较大。 105 | - [x] 代表算法: Adaboost, GBDT, LightGBM, Xgboost,CatBoost 106 | 107 | ##### 3. stacking: 初级学习器,次级学习器 108 | 直接借用别人的图,![参考] (https://www.cnblogs.com/gczr/p/7144508.html) 109 | ![img](https://github.com/jiangzhongkai/ifly-algorithm_challenge/blob/master/img/stacking.jpg) 110 | 111 | 根据上图分析一下stacking具体步骤: 112 | 113 |   (1) TrainingData进行5-fold分割,正好生成5个model,每个model预测训练数据的1/5部分,最后合起来正好是一个完整的训练集Predictions,行数与TrainingData一致。 114 | 115 |   (2) TestData数据,model1-model5每次都对TestData进行预测,形成5份完整的Predict(绿色部分),最后对这个5个Predict取平均值,得到测试集Predictions。 116 | 117 |   (3) 上面的1)与2)步骤只是用了一种算法,如果用三种算法,就是三份“训练集Predictions与测试集Predictions”,可以认为就是形成了三列新的特征,训练集Predictions与测试集Predictions各三列。 118 | 119 |   (4) 3列训练集```Predictions+TrainingData```的y值,就形成了新的训练样本数据;测试集Predictions的三列就是新的测试数据。 120 | 121 |   (5) 利用```meta model```(模型上的模型),其实就是再找一种算法对上述新数据进行建模预测,预测出来的数据就是提交的最终数据 122 | 123 | ##### 4. blending 124 | 算法简单思想就是: 假如总的数据集为12500条,其中训练集training data的条数为10000,测试集testing data的条数为2500;然后将traing data 分为7000和3000,然后使用m个模型来训练7000条数据,然后再将m个模型在3000条数据上进行预测,将m个模型预测的结果进行拼接,则会得到3000 x m的数据,同时对testing data进行预测,将预测结果与3000 x m 合并共同作为第二层training data.最后再使用一个模型来进行预测即可。(和stacking思想有点像,但是区别还是挺大的。),主要优缺点如下: 125 | ##### 优点: 126 | 127 | 1. 比stacking简单(因为没有进行交叉验证来获取新的feature) 128 | 129 | 2. 避开了信息泄露问题 130 | ##### 缺点: 131 | 132 | 1. 使用了很少的数据 133 | 134 | 2. blender可能会过拟合 135 | 136 | 3. stacking使用了多次的CV会比较稳健。 137 | 138 | 139 | 在模型的输出时,我们输出的是概率,这里采用的加权融合的方法,```catboost*0.5 + lgb*0.2 + xgb*0.2+ MLP * 0.1```,然后再进行最终类别的输出. 140 | 141 | #### 最后再补充一下 142 | 143 | 我们在做分类任务时,如果正负样本分布均衡,我们如何处理呢? 144 | 145 | Q: 首先要说下,为什么正负样本分布不均衡,会影向模型的最终效果呢? 146 | 147 | A: 本质原因是在训练优化的目标寒素与测试时使用的评价标准不一致,主要分为样本分布不一致,类别权重不同。 148 | 149 | Q: 如何处理? 150 | 151 | A: 一般会进行```随机采样```操作,具体分```过采样```和```欠采样```操作。但是如果单纯的增加类别少的样本,增加模型的复杂度的同时,也容易造成过拟合,一般我们选择```SMOTE```方法来生成新样本。在进行欠采样的时候,如果单纯的从类别多的样本中选取数量和样本少的样本进行训练,这样容易丢失部分有用信息,一般我们使用```Easy Ensamble``` 和```Balance Cascade```方法来进行采样,其中```Easy Ensemble```思想就是对数据集进行多次欠采样,训练多个基模型,然后将这些模型的输出进行有效结合作为最的结果。```Balance Cascade```是一个级联结构,在每一级中从多数类S1中随机抽取子集E,用E+S2(少数类)训练该级分类器,然后将S1中能够被当前分类器正确判别的样本剔除掉,继续下一级的操作,重复若干次得到级联结构(有点像Adaboost思想降低被正确分类样本的权重,只不过是直接将权重设为0),最终的结果也是各级分类器结果的融合。 152 | 153 | ### 数据集下载地址: https://pan.baidu.com/s/1l08S5CV4HNoRJh8SU1NGBg 154 | 155 | ### 参考资料xiang 156 | 157 | - [x] GBDT、xgboost对比分析: 158 | 159 | https://wenku.baidu.com/view/f3da60b4951ea76e58fafab069dc5022aaea463e.html 160 | 161 | - [x] xgboost论文: 162 | 163 | https://arxiv.org/pdf/1603.02754.pdf 164 | 165 | - [x] lightgbm论文: 166 | 167 | http://papers.nips.cc/paper/6907-lightgbm-a-highly-efficient-gradient-boosting-decision-tree.pdf 168 | 169 | - [x] catboost论文: 170 | 171 | - https://arxiv.org/pdf/1706.09516.pdf 172 | 173 | - http://learningsys.org/nips17/assets/papers/paper_11.pdf 174 | -------------------------------------------------------------------------------- /baseline_catboost.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | import numpy as np 3 | import pandas as pd 4 | from pandas import DataFrame as DF 5 | import scipy.spatial.distance as dist 6 | import catboost as cbt 7 | import json 8 | from sklearn.metrics import f1_score 9 | import time 10 | import gc 11 | import math 12 | from tqdm import tqdm 13 | from scipy import stats 14 | from sklearn.cluster import KMeans 15 | from six.moves import reduce 16 | from sklearn.pipeline import Pipeline 17 | from search_param import reduce_mem_usage 18 | 19 | from decimal import * 20 | import warnings 21 | 22 | warnings.filterwarnings('ignore') 23 | 24 | file = "Data/" 25 | 26 | import json 27 | from sklearn.metrics import f1_score 28 | import time 29 | import gc 30 | import math 31 | from tqdm import tqdm 32 | from scipy import stats 33 | 34 | from six.moves import reduce 35 | from sklearn.model_selection import KFold, StratifiedKFold, GridSearchCV 36 | from sklearn.preprocessing import LabelEncoder 37 | from collections import Counter 38 | 39 | from datetime import datetime, timedelta 40 | 41 | import warnings 42 | 43 | warnings.filterwarnings('ignore') 44 | 45 | train = pd.read_table(file + "train_data.txt") 46 | test = pd.read_table(file + "test_data.txt") 47 | test_sid = test["sid"] 48 | 49 | all_data = train.append(test).reset_index(drop=True) 50 | 51 | # 对时间的处理 52 | all_data['time'] = pd.to_datetime(all_data['nginxtime'] * 1e+6) + timedelta(hours=8) 53 | all_data['day'] = all_data['time'].dt.dayofyear 54 | all_data['hour'] = all_data['time'].dt.hour 55 | 56 | # 对sid和时间的处理 57 | all_data['sid_time'] = all_data['nginxtime'].apply(lambda x: Decimal(str(x)[4:-2])) 58 | time_min = all_data["sid"].apply(lambda x: x.split("-")).apply(lambda x: Decimal(x[-1][4:])) 59 | 60 | nginxtime_mean = all_data["nginxtime"].mean() 61 | all_data["sample_weight"] = (all_data['nginxtime'] / nginxtime_mean).fillna(1) 62 | 63 | all_data["sid_time"] = (time_min - all_data["sid_time"]).apply(lambda x: int(x)) 64 | all_data["sid_time" + "_count"] = all_data.groupby(["sid_time"])["sid_time"].transform('count') 65 | all_data["sid_time" + "_rank"] = all_data.groupby(["sid_time"])["sid_time"].transform('count').rank(method='min') 66 | 67 | all_data["req_ip"] = all_data.groupby("reqrealip")["ip"].transform("count") 68 | 69 | # Data Clean 70 | # 全部变成大写,防止oppo 和 OPPO 的出现 71 | all_data['model'].replace('PACM00', "OPPO R15", inplace=True) 72 | all_data['model'].replace('PBAM00', "OPPO A5", inplace=True) 73 | all_data['model'].replace('PBEM00', "OPPO R17", inplace=True) 74 | all_data['model'].replace('PADM00', "OPPO A3", inplace=True) 75 | all_data['model'].replace('PBBM00', "OPPO A7", inplace=True) 76 | all_data['model'].replace('PAAM00', "OPPO R15_1", inplace=True) 77 | all_data['model'].replace('PACT00', "OPPO R15_2", inplace=True) 78 | all_data['model'].replace('PABT00', "OPPO A5_1", inplace=True) 79 | all_data['model'].replace('PBCM10', "OPPO R15x", inplace=True) 80 | 81 | for fea in ['model', 'make', 'lan']: 82 | all_data[fea] = all_data[fea].astype('str') 83 | all_data[fea] = all_data[fea].map(lambda x: x.upper()) 84 | 85 | from urllib.parse import unquote 86 | 87 | 88 | def url_clean(x): 89 | x = unquote(x, 'utf-8').replace('%2B', ' ').replace('%20', ' ').replace('%2F', '/').replace('%3F', '?').replace( 90 | '%25', '%').replace('%23', '#').replace(".", ' ').replace('??', ' '). \ 91 | replace('%26', ' ').replace("%3D", '=').replace('%22', '').replace('_', ' ').replace('+', ' ').replace('-', 92 | ' ').replace( 93 | '__', ' ').replace(' ', ' ').replace(',', ' ') 94 | 95 | if (x[0] == 'V') & (x[-1] == 'A'): 96 | return "VIVO {}".format(x) 97 | elif (x[0] == 'P') & (x[-1] == '0'): 98 | return "OPPO {}".format(x) 99 | elif (len(x) == 5) & (x[0] == 'O'): 100 | return "Smartisan {}".format(x) 101 | elif ('AL00' in x): 102 | return "HW {}".format(x) 103 | else: 104 | return x 105 | 106 | 107 | all_data[fea] = all_data[fea].map(url_clean) 108 | 109 | all_data['big_model'] = all_data['model'].map(lambda x: x.split(' ')[0]) 110 | all_data['model_equal_make'] = (all_data['big_model'] == all_data['make']).astype(int) 111 | 112 | 113 | #TODO:添加部分特征 -- 2019.8.14------------------------------------------------------------------------------------------------------------------- 114 | #TODO:统计特征 115 | #nunique计算 116 | # adid_feat_nunique = ["mediashowid","apptype","city","ip","reqrealip","province","model","dvctype","make","ntt","carrier","osv","lan"] 117 | # 118 | # for feat in adid_feat_nunique: 119 | # gp1 = all_data.groupby("adunitshowid")[feat].nunique().reset_index() 120 | # gp1.columns = ["adunitshowid","adid_nuni_"+feat] 121 | # all_data = all_data.merge(gp1, how = "left",on="adunitshowid") 122 | # 123 | # gp2 = all_data.groupby("mediashowid")["adunitshowid"].nunique().reset_index() 124 | # gp2.columns = ["mediashowid","meid_adid_nuni"] 125 | # all_data = all_data.merge(gp2, how = "left", on = "mediashowid") 126 | # 127 | # gp2 = all_data.groupby("city")["adunitshowid"].nunique().reset_index() 128 | # gp2.columns = ["city","city_adid_nuni"] 129 | # all_data = all_data.merge(gp2, how = "left", on = "city") 130 | # 131 | # gp2 = all_data.groupby("province")["adunitshowid"].nunique().reset_index() 132 | # gp2.columns = ["province","province_adid_nuni"] 133 | # all_data = all_data.merge(gp2, how = "left", on = "province") 134 | # 135 | # gp2 = all_data.groupby("ip")["adunitshowid"].nunique().reset_index() 136 | # gp2.columns = ["ip","ip_adid_nuni"] 137 | # allData = all_data.merge(gp2, how = "left", on = "ip") 138 | # 139 | # gp2 = all_data.groupby("model")["adunitshowid"].nunique().reset_index() 140 | # gp2.columns = ["model","model_adid_nuni"] 141 | # all_data = all_data.merge(gp2, how = "left", on = "model") 142 | # 143 | # gp2 = all_data.groupby("make")["adunitshowid"].nunique().reset_index() 144 | # gp2.columns = ["make","make_adid_nuni"] 145 | # all_data = all_data.merge(gp2, how = "left", on = "make") 146 | # 147 | # 148 | # del gp1 149 | # del gp2 150 | # gc.collect() 151 | # 152 | # #根据对外媒体id进行类别计数 153 | # meid_feat_nunique = ["adunitshowid","apptype","city","ip","reqrealip","province","model","dvctype","make","ntt","carrier","osv","lan"] 154 | # for feat in meid_feat_nunique: 155 | # gp1 = all_data.groupby("mediashowid")[feat].nunique().reset_index() 156 | # gp1.columns = ["mediashowid","medi_nuni_"+feat] 157 | # all_data = all_data.merge(gp1, how = "left",on="mediashowid") 158 | # gp2 = all_data.groupby("city")["mediashowid"].nunique().reset_index() 159 | # gp2.columns = ["city","city_medi_nuni"] 160 | # all_data = all_data.merge(gp2, how = "left", on = "city") 161 | # 162 | # gp2 = all_data.groupby("ip")["mediashowid"].nunique().reset_index() 163 | # gp2.columns = ["ip","ip_medi_nuni"] 164 | # all_data = all_data.merge(gp2, how = "left", on = "ip") 165 | # 166 | # gp2 = all_data.groupby("province")["mediashowid"].nunique().reset_index() 167 | # gp2.columns = ["province","province_medi_nuni"] 168 | # all_data = all_data.merge(gp2, how = "left", on = "province") 169 | # 170 | # gp2 = all_data.groupby("model")["mediashowid"].nunique().reset_index() 171 | # gp2.columns = ["model","model_medi_nuni"] 172 | # all_data = all_data.merge(gp2, how = "left", on = "model") 173 | # 174 | # gp2 = all_data.groupby("make")["mediashowid"].nunique().reset_index() 175 | # gp2.columns = ["make","make_medi_nuni"] 176 | # all_data = all_data.merge(gp2, how = "left", on = "make") 177 | # 178 | # del gp1 179 | # del gp2 180 | # gc.collect() 181 | # 182 | # #adidmd5 183 | # adidmd5_feat_nunique = ["apptype","city","ip","reqrealip","province","model","dvctype","make","ntt","carrier","osv","lan"] 184 | # for feat in adidmd5_feat_nunique: 185 | # gp1 = all_data.groupby("adidmd5")[feat].nunique().reset_index() 186 | # gp1.columns = ["adidmd5","android_nuni_"+feat] 187 | # all_data =all_data.merge(gp1, how= "left", on = "adidmd5") 188 | # 189 | # 190 | # gp2 = all_data.groupby("city")["adidmd5"].nunique().reset_index() 191 | # gp2.columns = ["city","city_adidmd_nuni"] 192 | # all_data = all_data.merge(gp2, how = "left", on = "city") 193 | # 194 | # gp2 = all_data.groupby("ip")["adidmd5"].nunique().reset_index() 195 | # gp2.columns = ["ip","ip_adidmd_nuni"] 196 | # all_data = all_data.merge(gp2, how = "left", on = "ip") 197 | # 198 | # gp2 = all_data.groupby("province")["adidmd5"].nunique().reset_index() 199 | # gp2.columns = ["province","province_adidmd_nuni"] 200 | # all_data = all_data.merge(gp2, how = "left", on = "province") 201 | # 202 | # gp2 = all_data.groupby("model")["adidmd5"].nunique().reset_index() 203 | # gp2.columns = ["model","model_adidmd_nuni"] 204 | # all_data = all_data.merge(gp2, how = "left", on = "model") 205 | # 206 | # gp2 = all_data.groupby("make")["adidmd5"].nunique().reset_index() 207 | # gp2.columns = ["make","make_adidmd_nuni"] 208 | # all_data = all_data.merge(gp2, how = "left", on = "make") 209 | # 210 | # del gp1 211 | # del gp2 212 | # gc.collect() 213 | 214 | 215 | # feat_1 = ["adunitshowid","mediashowid","adidmd5"] 216 | # feat_2 = ["apptype","city","ip","reqrealip","province","model","dvctype","make","ntt","carrier","osv","lan"] 217 | # cross_feat = [] 218 | # for fe_1 in feat_1: 219 | # for fe_2 in feat_2: 220 | # col_name = "cross_"+fe_1+"_and_"+fe_2 221 | # cross_feat.append(col_name) 222 | # all_data[col_name] = all_data[fe_1].astype(str).values + "_" + all_data[fe_2].astype(str).values 223 | # 224 | # #TODO:对交叉特征进行计数 --- 2019.8.7 225 | # for fe in cross_feat: 226 | # locals()[fe+"_cnt"] = all_data[fe].value_counts().to_dict() 227 | # all_data[fe+"_cnt"] = all_data[fe].map(locals()[fe+"_cnt"]) 228 | # 229 | # 230 | # for fe in cross_feat: 231 | # le_feat = LabelEncoder() 232 | # le_feat.fit(all_data[fe]) 233 | # all_data[fe] = le_feat.transform(all_data[fe]) 234 | 235 | # city_cnt = all_data["city"].value_counts().to_dict() 236 | # all_data["city_cnt"] = all_data["city"].map(city_cnt) 237 | # 238 | # model_cnt = all_data["model"].value_counts().to_dict() 239 | # all_data["model_cnt"] = all_data["model"].map(model_cnt) 240 | # 241 | # make_cnt = all_data["make"].value_counts().to_dict() 242 | # all_data["make_cnt"] = all_data["make"].map(make_cnt) 243 | # 244 | # ip_cnt = all_data["ip"].value_counts().to_dict() 245 | # all_data["ip_cnt"] = all_data["ip"].map(ip_cnt) 246 | # 247 | # reqrealip_cnt = all_data["reqrealip"].value_counts().to_dict() 248 | # all_data["reqrealip_cnt"] = all_data["reqrealip"].map(reqrealip_cnt) 249 | # 250 | # osv_cnt = all_data["osv"].value_counts().to_dict() 251 | # all_data["osv_cnt"] = all_data["osv"].map(osv_cnt) 252 | 253 | # #TODO:交叉特征 254 | # feat_1 = ["adunitshowid","mediashowid","adidmd5"] 255 | # feat_2 = ["apptype","city","ip","reqrealip","province","model","dvctype","make","ntt","carrier","osv","lan"] 256 | # cross_feat = [] 257 | # for fe_1 in feat_1: 258 | # for fe_2 in feat_2: 259 | # col_name = "cross_"+fe_1+"_and_"+fe_2 260 | # cross_feat.append(col_name) 261 | # all_data[col_name] = all_data[fe_1].astype(str).values + "_" + all_data[fe_2].astype(str).values 262 | 263 | # #TODO:对交叉特征进行计数 --- 2019.8.7 264 | # for fe in cross_feat: 265 | # locals()[fe+"_cnt"] = all_data[fe].value_counts().to_dict() 266 | # all_data[fe+"_cnt"] = all_data[fe].map(locals()[fe+"_cnt"]) 267 | # 268 | # for fe in cross_feat: 269 | # le_feat = LabelEncoder() 270 | # le_feat.fit(all_data[fe]) 271 | # all_data[fe] = le_feat.transform(all_data[fe]) 272 | 273 | #TODO:---------------------------------------------------------------------------------------------------------------------------------------------------------------------- 274 | 275 | i = "adunitshowid" 276 | all_data[i + "_0"], all_data[i + "_1"], all_data[i + "_2"], all_data[i + "_3"] = all_data[i].apply(lambda x: x[0: 8]), \ 277 | all_data[i].apply(lambda x: x[8: 16]), \ 278 | all_data[i].apply(lambda x: x[16: 24]), \ 279 | all_data[i].apply(lambda x: x[24:32]) 280 | del all_data[i] 281 | 282 | i = "pkgname" 283 | all_data[i + "_1"], all_data[i + "_2"], all_data[i + "_3"] = all_data[i].apply(lambda x: x[8: 16]), all_data[i].apply( 284 | lambda x: x[16: 24]), all_data[i].apply(lambda x: x[24: 32]) 285 | del all_data[i] 286 | 287 | i = "mediashowid" 288 | all_data[i + "_0"], all_data[i + "_1"], all_data[i + "_2"], all_data[i + "_3"] = all_data[i].apply(lambda x: x[0: 8]), \ 289 | all_data[i].apply(lambda x: x[8: 16]), \ 290 | all_data[i].apply(lambda x: x[16: 24]), \ 291 | all_data[i].apply(lambda x: x[24: 32]) 292 | del all_data[i] 293 | 294 | i = "idfamd5" 295 | all_data[i + "_1"], all_data[i + "_2"], all_data[i + "_3"] = all_data[i].apply(lambda x: x[8: 16]), all_data[i].apply( 296 | lambda x: x[16: 24]), all_data[i].apply(lambda x: x[24: 32]) 297 | del all_data[i] 298 | 299 | i = "macmd5" 300 | all_data[i + "_0"], all_data[i + "_1"], all_data[i + "_3"] = all_data[i].apply(lambda x: x[0: 8]), all_data[i].apply( 301 | lambda x: x[8: 16]), \ 302 | all_data[i].apply(lambda x: x[24:32]) 303 | del all_data[i] 304 | 305 | # H,W,PPI 306 | all_data['size'] = (np.sqrt(all_data['h'] ** 2 + all_data['w'] ** 2) / 2.54) / 1000 307 | all_data['ratio'] = all_data['h'] / all_data['w'] 308 | all_data['px'] = all_data['ppi'] * all_data['size'] 309 | all_data['mj'] = all_data['h'] * all_data['w'] 310 | 311 | all_data["ver_len"] = all_data["ver"].apply(lambda x: str(x).split(".")).apply(lambda x: len(x)) 312 | osv = all_data["osv"].apply(lambda x: str(x).split(".")) 313 | all_data["osv_len"] = osv.apply(lambda x: len(x)) 314 | 315 | all_data["ip"] = all_data["ip"].map(lambda x: ".".join(x.split(".")[:2])) 316 | 317 | num_col = ['h', 'w', 'size', 'mj', 'ratio', 'px'] 318 | cat_col = [i for i in all_data.select_dtypes(object).columns if (i not in ['sid', 'label'])] 319 | both_col = [] 320 | 321 | rankNot = ["idfamd5_1", "idfamd5_2", "idfamd5_3", "ver_len"] 322 | 323 | countNot = ["idfamd5_1", "idfamd5_2", "idfamd5_3", "macmd5_1", "macmd5_2", "macmd5_3", "ver_len"] 324 | for i in tqdm(cat_col): 325 | lbl = LabelEncoder() 326 | # 327 | if i not in countNot: 328 | all_data[i + "_count"] = all_data.groupby([i])[i].transform('count') 329 | both_col.extend([i + "_count"]) 330 | if i not in rankNot: 331 | all_data[i + "_rank"] = all_data.groupby([i])[i].transform('count').rank(method='min') 332 | both_col.extend([i + "_rank"]) 333 | all_data[i] = lbl.fit_transform(all_data[i].astype(str)) 334 | 335 | for i in tqdm(['w', 'ppi', 'ratio']): 336 | all_data['{}_count'.format(i)] = all_data.groupby(['{}'.format(i)])['sid'].transform('count') 337 | all_data['{}_rank'.format(i)] = all_data['{}_count'.format(i)].rank(method='min') 338 | 339 | class_num = 8 340 | quantile = [] 341 | for i in range(class_num + 1): 342 | quantile.append(all_data["ratio"].quantile(q=i / class_num)) 343 | 344 | all_data["ratio_cat"] = all_data["ratio"] 345 | for i in range(class_num + 1): 346 | if i != class_num: 347 | all_data["ratio_cat"][((all_data["ratio"] < quantile[i + 1]) & (all_data["ratio"] >= quantile[i]))] = i 348 | else: 349 | all_data["ratio_cat"][ 350 | ((all_data["ratio"] == quantile[i]))] = i - 1 351 | all_data["ratio_cat"] = lbl.fit_transform(all_data["ratio_cat"].astype(str)) 352 | 353 | class_num = 10 354 | quantile = [] 355 | for i in range(class_num + 1): 356 | quantile.append(all_data["mj"].quantile(q=i / class_num)) 357 | 358 | all_data["mj_cat"] = all_data["mj"] 359 | for i in range(class_num + 1): 360 | if i != class_num: 361 | all_data["mj_cat"][((all_data["mj"] < quantile[i + 1]) & (all_data["mj"] >= quantile[i]))] = i 362 | else: 363 | all_data["mj_cat"][ 364 | ((all_data["mj"] == quantile[i]))] = i - 1 365 | all_data["mj_cat"] = lbl.fit_transform(all_data["mj_cat"].astype(str)) 366 | 367 | class_num = 10 368 | quantile = [] 369 | for i in range(class_num + 1): 370 | quantile.append(all_data["size"].quantile(q=i / class_num)) 371 | 372 | all_data["size_cat"] = all_data["size"] 373 | for i in range(class_num + 1): 374 | if i != class_num: 375 | all_data["size_cat"][((all_data["size"] < quantile[i + 1]) & (all_data["size"] >= quantile[i]))] = i 376 | else: 377 | all_data["size_cat"][ 378 | ((all_data["size"] == quantile[i]))] = i - 1 379 | all_data["size_cat"] = lbl.fit_transform(all_data["size_cat"].astype(str)) 380 | 381 | all_data["req_ip_std"] = all_data.groupby("reqrealip")["ip"].transform("std") 382 | all_data["req_ip_skew"] = all_data.groupby("reqrealip")["ip"].skew() 383 | 384 | #添加的试试----2019.8.14--- 提高了几个百分点 385 | for col in ['mediashowid_2_rank','adunitshowid_3_rank','make_count','mediashowid_3_count','make_rank','model_count','model_rank']: 386 | del all_data[col] 387 | 388 | feature_name = [i for i in all_data.columns if i not in ['sid', 'label', 'time', "sample_weight"]] 389 | 390 | #修改类型,降低内存 391 | 392 | # all_data=reduce_mem_usage(all_data,verbose=True) 393 | #进行参数寻优,由于内存受限,先随机进行采样,然后再进行参数寻优 394 | # all_data = all_data.sample(n=100000) 395 | 396 | from sklearn.metrics import roc_auc_score 397 | #获取训练集和测试集 398 | # X_train = all_data[:train.shape[0]] 399 | # X_test = all_data[train.shape[0]:] 400 | 401 | tr_index = ~all_data['label'].isnull() 402 | 403 | ''' 404 | X = all_data[tr_index].dropna().reset_index(drop=True) 405 | X_train = X[list(set(feature_name))].reset_index(drop=True) 406 | y = X[['label']].reset_index(drop=True).astype(int) 407 | ''' 408 | 409 | 410 | X_train = all_data[tr_index].reset_index(drop=True) 411 | y = all_data[tr_index][['label']].reset_index(drop=True).astype(int) 412 | X_test = all_data[~tr_index].reset_index(drop=True) 413 | 414 | print(X_train.shape, X_test.shape) 415 | 416 | print(feature_name) 417 | random_seed = 2019 418 | final_pred = [] 419 | cv_score = [] 420 | 421 | 422 | # [0,2,8,9,16]'hour' 423 | cate_feature = ["ver_len", "apptype", "city", "province", "dvctype", "ntt", "carrier", "lan", "orientation", 424 | "make", "model", "os", ] 425 | cat_list = [] 426 | for i in cat_col: 427 | cat_list.append(feature_name.index(i)) 428 | cat_list.append(feature_name.index("ver_len")) 429 | cat_list.append(feature_name.index("osv_len")) 430 | cat_list.append(feature_name.index("ratio_cat")) 431 | cat_list.append(feature_name.index("mj_cat")) 432 | cat_list.append(feature_name.index("size_cat")) 433 | 434 | # print("===search_params==============") 435 | # 436 | # cv_params = { 437 | # "learning_rate" : [0.1,0.2,0.3,0.01,0.02,0.03,0.04,0.05], 438 | # "max_depth" : [3,4,5,6,7,8,9,10,11] 439 | # } 440 | # 441 | # 442 | # cbt_model = cbt.CatBoostClassifier(iterations=900,task_type="GPU", 443 | # l2_leaf_reg=8, verbose=10, early_stopping_rounds=1000, eval_metric='F1', 444 | # cat_features=cat_list, gpu_ram_part=0.8,boosting_type="Plain",max_bin=129) 445 | # grid = GridSearchCV(estimator=cbt_model,param_grid=cv_params,scoring='f1',n_jobs=-1,cv=3) 446 | # grid.fit(X_train.values, y.values) 447 | # print(grid.best_score_) 448 | # print(grid.best_params_) 449 | # print(grid.best_estimator_) 450 | # 451 | # exit() 452 | 453 | skf = StratifiedKFold(n_splits=8, random_state=random_seed, shuffle=True) 454 | # cv_pred = np.zeros((X_train.shape[0],)) 455 | # test_pred = np.zeros((X_test.shape[0],)) 456 | val_score = [] 457 | cv_pred = [] 458 | 459 | start = time.time() 460 | for index, (train_index, test_index) in enumerate(skf.split(X_train, y)): 461 | start_time = time.time() 462 | print("==========================fold_{}=============================".format(str(index+1))) 463 | train_x, val_x, train_y, val_y = X_train[feature_name].iloc[train_index], X_train[feature_name].iloc[test_index], \ 464 | y.iloc[train_index], y.iloc[test_index] 465 | # cbt_model.fit(train_x[feature_name], train_y,eval_set=(test_x[feature_name],test_y)) 466 | cbt_model = cbt.CatBoostClassifier(iterations=800, learning_rate=0.05, max_depth=7, task_type="GPU", 467 | l2_leaf_reg=8, verbose=10, early_stopping_rounds=1000, eval_metric='F1', 468 | cat_features=cat_list, gpu_ram_part=0.7,boosting_type="Plain",max_bin=129) 469 | cbt_model.fit(train_x[feature_name], train_y, eval_set=(val_x[feature_name], val_y),use_best_model=True) 470 | # sample_weight=X_train[["sample_weight"]].iloc[train_index]) 471 | 472 | print(dict(zip(X_train.columns,cbt_model.feature_importances_))) 473 | 474 | val_pred = cbt_model.predict(val_x) 475 | print("f1_score:{}".format(f1_score(val_y,val_pred))) 476 | val_score.append(f1_score(val_y, val_pred)) 477 | 478 | test_pred = cbt_model.predict(X_test[feature_name],verbose=10).astype(int) 479 | cv_pred.append(test_pred) 480 | end_time = time.time() 481 | print("-"*60) 482 | print("finished in {}".format(timedelta(seconds=end_time-start_time))) 483 | print('-'*60) 484 | 485 | # test_pred += cbt_model.predict_proba(X_test[feature_name], verbose=10)[:, 1] / 5 486 | # y_val = cbt_model.predict_proba(test_x[feature_name], verbose=10) 487 | # print(Counter(np.argmax(y_val, axis=1))) 488 | # cv_score.append(f1_score(test_y, np.round(y_val[:, 1]))) 489 | # print(cv_score[-1]) 490 | end = time.time() 491 | print("-"*100) 492 | print(val_score) 493 | print("mean f1_score : {}".format(np.mean(val_score))) 494 | print("Total training fininshed in {}".format(timedelta(seconds=end-start))) 495 | print("-"*100) 496 | 497 | #提交结果 498 | submit= [] 499 | 500 | for line in np.array(cv_pred).transpose(): 501 | submit.append(np.argmax(np.bincount(line))) 502 | result = pd.DataFrame() 503 | result["sid"] = test["sid"].values.tolist() 504 | result["label"] = submit 505 | 506 | result.to_csv("submissionCat{}.csv".format(datetime.now().strftime("%Y%m%d%H%M")),index=False) 507 | 508 | print(result.head()) 509 | print(result["label"].value_counts()) -------------------------------------------------------------------------------- /baseline_catboost_2.py: -------------------------------------------------------------------------------- 1 | """-*- coding: utf-8 -*- 2 | DateTime : 2019/8/12 9:18 3 | Author : Peter_Bonnie 4 | FileName : baseline_catboost_2.py 5 | Software: PyCharm 6 | """ 7 | import numpy as np 8 | import pandas as pd 9 | from baseline_lgb import load_params 10 | from sklearn.model_selection import StratifiedKFold,TimeSeriesSplit 11 | import gc 12 | import time 13 | import datetime 14 | import catboost as cbt 15 | from sklearn.metrics import f1_score 16 | 17 | 18 | #TODO:载入数据 19 | 20 | trainData, testData, trainY, test_sid = load_params() 21 | trainX, testX = trainData.values, testData.values 22 | 23 | #先获取类别特征的索引 24 | cat_list = [] 25 | feature_name = [col for col in trainData.columns if col not in ["time_of_hour", "time_of_min","time_of_sec"]] 26 | 27 | 28 | cat_columns = [col for col in trainData.select_dtypes(object).columns] 29 | # cat_columns = [col for col in trainData.columns if trainDatal[col].dtype == "object"] 30 | print(cat_columns) 31 | 32 | 33 | 34 | for col in cat_columns: 35 | cat_list.append(feature_name.index(col)) 36 | 37 | print("cate index :",cat_list) 38 | #TODO:模型搭建 39 | start = time.time() 40 | n_splits = 7 41 | random_state = 2019 42 | skf = StratifiedKFold(n_splits=n_splits, shuffle=False,random_state=random_state) 43 | 44 | model = cbt.CatBoostClassifier(iterations=10000, learning_rate=0.05, max_depth=8, task_type='GPU', 45 | l2_leaf_reg=8, verbose=10, early_stopping_rounds=3000, eval_metric='F1',cat_features=cat_list) 46 | 47 | cv_pred = [] 48 | cv_score = [] 49 | 50 | for index, (tra_idx, val_idx) in enumerate(skf.split(trainX, trainY)): 51 | start_time = time.time() 52 | print("==============================fold_{}=========================================".format(str(index+1))) 53 | X_train, Y_train = trainX[tra_idx], trainY[tra_idx] 54 | X_val, Y_val = trainX[val_idx], trainY[val_idx] 55 | cbt_model = model.fit(X_train, Y_train,eval_set=(X_val, Y_val),early_stopping_rounds=3000,cat_features=cat_list) 56 | print(dict(zip(trainData.columns,cbt_model.feature_importances_))) 57 | val_pred = cbt_model.predict(X_val) 58 | print("fold_{0},f1_score:{1}".format(str(index+1), f1_score(Y_val,val_pred))) 59 | cv_score.append(f1_score(Y_val,val_pred)) 60 | test_pred = cbt_model.predict(testX).astype(int) 61 | cv_pred.append(test_pred) 62 | end_time = time.time() 63 | print("finised in {}".format(datetime.timedelta(seconds=end_time-start_time))) 64 | 65 | end = time.time() 66 | print('-'*60) 67 | print("Training has finished.") 68 | print("Total training time is {}".format(str(datetime.timedelta(seconds=end-start)))) 69 | print(cv_score) 70 | print("mean f1:",np.mean(cv_score)) 71 | print('-'*60) 72 | 73 | #提交结果 74 | submit= [] 75 | 76 | for line in np.array(cv_pred).transpose(): 77 | submit.append(np.argmax(np.bincount(line))) 78 | 79 | result = pd.DataFrame(columns=["sid","label"]) 80 | result["sid"] = list(test_sid.unique()) 81 | result["label"] = submit 82 | 83 | result.to_csv("submissionCat{}.csv".format(datetime.datetime.now().strftime("%Y%m%d%H%M")),index=False) 84 | 85 | print(result.head()) 86 | print(result["label"].value_counts()) 87 | 88 | 89 | 90 | 91 | 92 | 93 | 94 | 95 | 96 | 97 | 98 | -------------------------------------------------------------------------------- /baseline_keras.py: -------------------------------------------------------------------------------- 1 | """-*- coding: utf-8 -*- 2 | DateTime : 2019/8/20 9:17 3 | Author : Peter_Bonnie 4 | FileName : baseline_keras.py 5 | Software: PyCharm 6 | """ 7 | from keras.models import Sequential, Model 8 | from keras.layers import Dense, Dropout, Activation, BatchNormalization, Input 9 | from sklearn.metrics import f1_score 10 | import pandas as pd 11 | import numpy as np 12 | import gc 13 | from sklearn.preprocessing import LabelEncoder 14 | import matplotlib.pyplot as plt 15 | from sklearn.model_selection import KFold,StratifiedKFold 16 | import datetime 17 | import keras 18 | from keras import backend as K 19 | 20 | #load data 21 | all_data = pd.read_csv("Data/allData.csv") 22 | 23 | train_data = pd.read_csv("Data/train_data.txt",delimiter='\t') 24 | test_data = pd.read_csv("Data/test_data.txt",delimiter='\t') 25 | 26 | trainData = all_data[:train_data.shape[0]] 27 | testData = all_data[train_data.shape[0]:] 28 | test_sid = test_data["sid"] 29 | 30 | label = train_data["label"] 31 | del train_data["label"] 32 | 33 | #model 34 | K.clear_session() 35 | 36 | class MLP(object): 37 | 38 | def __init__(self,drop_out,activation,input_units,trainX, testX,trainY,epoch,batchsize,num_class,optimizer,loss,metrics,valX,valY): 39 | 40 | self.drop_out = drop_out 41 | self.activation = activation 42 | self.input_units = input_units 43 | self.trainX = trainX 44 | self.trainY = trainY 45 | self.testX = testX 46 | self.valX = valX 47 | self.valY = valY 48 | self.epoch = epoch 49 | self.batchsize = batchsize 50 | self.num_class = num_class 51 | self.optimizer = optimizer 52 | self.loss = loss 53 | self.metrics = metrics 54 | 55 | def _build(self): 56 | model = Sequential() 57 | #first layer 58 | model.add(Dense(self.input_units,input_dim=self.trainX.shape[1],activation=self.activation)) 59 | model.add(BatchNormalization()) 60 | model.add(Dropout(self.drop_out)) 61 | #second layer 62 | model.add(Dense(self.input_units // 2, activation= self.activation)) 63 | model.add(Dropout(self.drop_out)) 64 | 65 | #third layer 66 | model.add(Dense(self.input_units // 4, activation= self.activation)) 67 | model.add(Dropout(self.drop_out)) 68 | 69 | #forth layer 70 | model.add(Dense(self.input_units // 8, activation= self.activation)) 71 | model.add(Dropout(self.drop_out / 2)) 72 | 73 | #output 74 | model.add(Dense(self.num_class,activation="sigmoid")) 75 | 76 | model.compile(optimizer=self.optimizer,loss= self.loss,metrics=self.metrics) 77 | model.summary() 78 | 79 | return model 80 | 81 | def _fit(self): 82 | 83 | model = self._build() 84 | history = model.fit(self.trainX, self.trainY,batch_size=self.batchsize,epochs=self.epoch, 85 | callbacks=[keras.callbacks.EarlyStopping(monitor="val_loss",min_delta=0,patience=100,verbose=1,mode="auto"),] 86 | ,shuffle= True,verbose=1) 87 | return model 88 | 89 | def _predict(self): 90 | 91 | val_score = [] 92 | model = self._fit() 93 | valPredict = model.predict(self.valX,batch_size=self.batchsize) 94 | val_score = f1_score(self.valY, valPredict) 95 | print(val_score) 96 | test_predict = model.predict(self.testX,batch_size=self.batchsize) 97 | return test_predict 98 | 99 | 100 | skf = StratifiedKFold(n_splits=7,shuffle=True,random_state=2019) 101 | 102 | trainX = trainData.values 103 | testX = testData.values 104 | trainY = label 105 | 106 | test_pred = [] 107 | for idx, (trx,valx) in enumerate(skf.split(trainX,trainY)): 108 | print("=================fold_{}=====================================".format(str(idx+1))) 109 | X_train = trainX[trx] 110 | Y_train = trainY[trx] 111 | 112 | X_val = trainX[valx] 113 | Y_val = trainY[valx] 114 | 115 | mlp = MLP(drop_out=0.2,activation="relu",input_units=256,trainX=X_train,testX=testX,trainY=Y_train,epoch=1000, 116 | batchsize=128,num_class=1,optimizer="adam",loss="binary_crossentropy",metrics=["accuracy"],valX=X_val,valY=Y_val) 117 | test_predict = mlp._predict() 118 | test_pred.append(test_predict) 119 | 120 | 121 | submit = [] 122 | for line in np.array(test_pred).transpose(): 123 | submit.append(np.argmax(np.bincount(line))) 124 | final_result = pd.DataFrame(columns=["sid","label"]) 125 | final_result["sid"] = list(test_sid.unique()) 126 | final_result["label"] = submit 127 | final_result.to_csv("submitMLP{0}.csv".format(datetime.datetime.now().strftime("%Y%m%d%H%M")),index = False) 128 | print(final_result.head()) 129 | 130 | 131 | 132 | 133 | 134 | 135 | 136 | 137 | 138 | 139 | 140 | 141 | 142 | -------------------------------------------------------------------------------- /baseline_lgb.py: -------------------------------------------------------------------------------- 1 | from __future__ import print_function,division 2 | """-*- coding: utf-8 -*- 3 | DateTime : 2019/7/30 10:11 4 | Author : Peter_Bonnie 5 | FileName : baseline.py 6 | Software: PyCharm 7 | """ 8 | import numpy as np 9 | import pandas as pd 10 | import matplotlib.pyplot as plt 11 | import seaborn as sns 12 | from scipy import sparse 13 | from sklearn.preprocessing import MinMaxScaler,StandardScaler,OneHotEncoder,LabelEncoder 14 | from sklearn.model_selection import GridSearchCV,StratifiedKFold,RandomizedSearchCV,train_test_split,cross_val_score,ShuffleSplit 15 | from sklearn.linear_model import LogisticRegression 16 | from sklearn.ensemble import RandomForestClassifier,GradientBoostingClassifier,ExtraTreesClassifier 17 | from sklearn.metrics import f1_score 18 | from sklearn.feature_selection import SelectFromModel,RFECV,VarianceThreshold,SelectKBest,chi2,RFE,SelectPercentile 19 | from sklearn.feature_extraction.text import CountVectorizer 20 | import os 21 | import time 22 | import datetime 23 | import lightgbm as lgb 24 | from lightgbm import plot_importance 25 | import collections 26 | import xgboost as xgb 27 | from xgboost import plot_importance,to_graphviz 28 | from data_helper import * 29 | from search_param import Util,reduce_mem_usage 30 | import catboost as cat 31 | from feature_enginee import * 32 | import gc 33 | import warnings 34 | 35 | warnings.filterwarnings("ignore") 36 | sns.set(style = "whitegrid",color_codes = True) 37 | sns.set(font_scale = 1) 38 | 39 | 40 | #TODO:数据的加载 41 | def load_params(): 42 | 43 | print("loding data.....") 44 | start = time.time() 45 | PATH = "Data/" 46 | allData = pd.read_csv(PATH +"allData.csv") 47 | 48 | trainData_1 = pd.read_csv(PATH+"train_data.txt",delimiter='\t') 49 | testData_1 = pd.read_csv(PATH+"test_data.txt",delimiter='\t') 50 | 51 | allData = reduce_mem_usage(allData) 52 | # trainData_1 = reduce_mem_usage(trainData_1) 53 | # testData_1 = reduce_mem_usage(testData_1) 54 | 55 | 56 | all_sid = pd.concat([trainData_1["sid"],testData_1["sid"]],axis=0) 57 | test_sid = testData_1.pop("sid") 58 | label = trainData_1.pop("label") 59 | 60 | ######################################整个数据特征以及处理结束############################################################### 61 | #TODO:数据源的获取 62 | trainData = allData[:trainData_1.shape[0]] 63 | testData = allData[trainData_1.shape[0]:] 64 | trainY = label 65 | 66 | 67 | for col in ["idfamd5_3", "android_nuni_apptype", "idfamd5_cnt", "idfamd5_2", "idfamd5_1", 68 | "cross_adunitshowid_and_dvctype", "cross_mediashowid_and_apptype", "android_nuni_make", "osv_len", 69 | "cross_adunitshowid_and_apptype","macmd5_cnt","pkgname_2_rank","android_nuni_lan","adunitshowid_1_rank","android_nuni_model","pkgname_2_count", 70 | "mediashowid_1_rank","adunitshowid_cnt","mediashowid_2_count","pkgname_3_rank","mediashowid_3_count","adunitshowid_1_count","mediashowid_1_count", 71 | "adunitshowid_3_count","mediashowid_cnt","pkgname_3_count","adunitshowid_2_rank","cross_adunitshowid_and_apptype_cnt","cross_mediashowid_and_apptype_cnt", 72 | "adunitshowid_3_rank","mediashowid_3_rank","adunitshowid_2_count","mediashowid_2_rank"]: 73 | 74 | del trainData[col] 75 | del testData[col] 76 | 77 | # TODO:删出部分特征试试 ----2019.8.11 涨了将近3个百分点 78 | # 分别只取前50,75, 100,125,150,175,200,225,250 进行测试,选最好的特征数量 79 | 80 | all_cate = [('nginxtime', 5198), ('ip', 4713), ('imeimd5', 4590), ('city', 3932), ('model', 3635), ('time_of_min', 3471), ('time_of_sec', 3333), 81 | ('ratio', 3310), ('big_model', 3007), ('time_of_hour', 2700), ('cross_mediashowid_and_model_cnt', 2608), ('cross_adunitshowid_and_city_cnt', 2410), 82 | ('cross_mediashowid_and_city_cnt', 2390), ('cross_adunitshowid_and_model_cnt', 2348), ('big_model_rank', 2151), ('cross_adunitshowid_and_osv_cnt', 2129), 83 | ('macmd5_3', 1930), ('model_medi_nuni', 1841), ('reqrealip', 1808), ('model_adidmd_nuni', 1796), ('model_adid_nuni', 1766), ('ver_cnt', 1736), 84 | ('city_adid_nuni', 1716), ('cross_mediashowid_and_osv_cnt', 1707), ('macmd5', 1683), ('osv', 1647), ('city_medi_nuni', 1599), ('macmd5_1', 1594), 85 | ('cross_adunitshowid_and_reqrealip_cnt', 1576), ('city_cnt', 1575), ('osv_cnt', 1564), ('cross_adunitshowid_and_make_cnt', 1544), ('model_cnt', 1543), 86 | ('adunitshowid_2', 1536), ('city_adidmd_nuni', 1504), ('adunitshowid_3', 1496), ('make', 1464), ('adunitshowid_1', 1434), ('req_ip', 1313), 87 | ('cross_mediashowid_and_make_cnt', 1295), ('ver', 1288), ('cross_mediashowid_and_city', 1270), ('size', 1175), ('h', 1063), 88 | ('cross_adunitshowid_and_ntt_cnt', 989), ('cross_mediashowid_and_osv', 961), ('cross_mediashowid_and_ip', 948), ('cross_adunitshowid_and_carrier_cnt', 925), 89 | ('mediashowid_3', 915), ('adidmd5', 904), ('cross_mediashowid_and_reqrealip_cnt', 902), ('make_adid_nuni', 878), ('make_adidmd_nuni', 864), 90 | ('adid_nuni_make', 839), ('mediashowid_2', 826), ('mediashowid_1', 810), ('cross_adunitshowid_and_city', 789), ('cross_adidmd5_and_city', 785), 91 | ('adid_nuni_model', 785), ('cross_adunitshowid_and_province_cnt', 769), ('apptype', 746), ('cross_mediashowid_and_model', 733), ('cross_adidmd5_and_ip', 717), 92 | ('adid_nuni_osv', 673), ('cross_mediashowid_and_reqrealip', 671), ('cross_adunitshowid_and_ip', 666), ('cross_mediashowid_and_carrier_cnt', 663), 93 | ('px', 658), ('ip_cnt', 651), ('area', 642), ('cross_adidmd5_and_apptype', 626), ('apptype_cnt', 595), ('ip_adid_nuni', 575), ('cross_mediashowid_and_ntt_cnt', 533), 94 | ('cross_adunitshowid_and_reqrealip', 514), ('cross_adidmd5_and_reqrealip', 514), ('ip_medi_nuni', 503), ('pkgname_2', 502), ('medi_nuni_make', 495), ('cross_mediashowid_and_make', 492), 95 | ('adid_nuni_city', 486), ('pkgname_3', 484), ('ip_adidmd_nuni', 480), ('cross_adidmd5_and_model', 473), ('cross_adunitshowid_and_model', 470), ('cross_adunitshowid_and_osv', 468), ('imeimd5_cnt', 457), 96 | ('pkgname_1', 456), ('openudidmd5', 442), ('medi_nuni_osv', 433), ('cross_mediashowid_and_province_cnt', 428), ('cross_adidmd5_and_city_cnt', 420), ('cross_adidmd5_and_province', 415), ('make_medi_nuni', 403), 97 | ('cost_time', 397), ('adid_nuni_reqrealip', 382), ('medi_nuni_city', 368), ('cross_adunitshowid_and_make', 367), ('pkgname', 347), ('cross_adunitshowid_and_lan_cnt', 339), ('cross_adidmd5_and_make', 337), ('pkgname_1_rank', 335), 98 | ('meid_adid_nuni', 326), ('carrier', 326), ('medi_nuni_model', 323), ('cross_mediashowid_and_carrier', 299), ('cross_adunitshowid_and_ntt', 298), ('medi_nuni_reqrealip', 296), ('cross_adidmd5_and_osv', 295), ('w', 287), ('cross_adidmd5_and_dvctype', 283), 99 | ('cross_adidmd5_and_model_cnt', 278), ('adunitshowid', 276), ('adunitshowid_0_rank', 273), ('cross_mediashowid_and_ip_cnt', 270), ('cross_mediashowid_and_lan_cnt', 268), ('adid_nuni_province', 256), ('cross_mediashowid_and_ntt', 246), ('big_model_count', 245), 100 | ('ntt', 227), ('ppi', 208), ('make_cnt', 200), ('cross_adunitshowid_and_carrier', 193), ('cross_adidmd5_and_reqrealip_cnt', 191), ('macmd5_0', 178), ('ver_len', 174), ('cross_adunitshowid_and_province', 168), ('carrier_cnt', 165), ('macmd5_0_rank', 158), ('lan_cnt', 153), 101 | ('cross_adidmd5_and_lan', 148), ('lan', 145), ('adid_nuni_ntt', 142), ('cross_adidmd5_and_ip_cnt', 139), ('mediashowid_0_rank', 134), ('adid_nuni_carrier', 133), ('cross_adidmd5_and_carrier', 131), ('mediashowid', 129), ('ntt_cnt', 125), ('medi_nuni_carrier', 124), 102 | ('reqrealip_cnt', 120), ('cross_adunitshowid_and_ip_cnt', 120), ('cross_adidmd5_and_ntt', 116), ('medi_nuni_ntt', 101), ('medi_nuni_province', 99), ('cross_mediashowid_and_province', 92), ('cross_mediashowid_and_dvctype_cnt', 86), ('ratio_cat', 86), ('cross_adunitshowid_and_dvctype_cnt', 85), 103 | ('cross_adidmd5_and_osv_cnt', 81), ('dvctype', 75), ('province', 68), ('cross_adidmd5_and_make_cnt', 64), ('dvctype_cnt', 57), ('adid_nuni_ip', 57), ('adid_nuni_lan', 56), ('android_nuni_ip', 50), ('province_adid_nuni', 50), ('macmd5_3_rank', 44), ('medi_nuni_ip', 43), ('android_nuni_reqrealip', 42), 104 | ('adunitshowid_0', 42), ('medi_nuni_dvctype', 42), ('cross_adidmd5_and_ntt_cnt', 38), ('adunitshowid_0_count', 37), ('android_nuni_city', 36), ('cross_adidmd5_and_apptype_cnt', 32), ('medi_nuni_lan', 31), ('macmd5_1_rank', 29), ('pkgname_1_count', 28), ('cross_adunitshowid_and_lan', 25), 105 | ('mediashowid_0', 20), ('orientation', 19), ('openudidmd5_cnt', 19), ('cross_mediashowid_and_lan', 19), ('macmd5_0_count', 18), ('cross_adidmd5_and_province_cnt', 18), ('adid_nuni_dvctype', 16), ('android_nuni_ntt', 14), ('cross_adidmd5_and_dvctype_cnt', 14), ('area_cat', 13), 106 | ('cross_adidmd5_and_carrier_cnt', 13), ('mediashowid_0_count', 11), ('size_cat', 11), ('cross_adidmd5_and_lan_cnt', 8), ('adidmd5_cnt', 7), ('cross_mediashowid_and_dvctype', 5)] 107 | top_50_feat = all_cate[:50] 108 | top_75_feat = all_cate[:75] 109 | top_100_feat = all_cate[:100] 110 | top_125_feat = all_cate[:125] 111 | top_150_feat = all_cate[:150] 112 | 113 | #只取重要性前50的特征 114 | # for col in all_cate: 115 | # if col not in top_50_feat: 116 | # del trainData[col[0]] 117 | # del testData[col[0]] 118 | 119 | # #只取重要性前75的特征 120 | # for col in all_cate: 121 | # if col not in top_75_feat: 122 | # del trainData[col[0]] 123 | # del testData[col[0]] 124 | # 125 | #只取重要性前100的特征 126 | # for col in all_cate: 127 | # if col not in top_100_feat: 128 | # del trainData[col[0]] 129 | # del testData[col[0]] 130 | # 131 | # for col in all_cate: 132 | # if col not in top_125_feat: 133 | # del trainData[col[0]] 134 | # del testData[col[0]] 135 | # 136 | for col in all_cate: 137 | if col not in top_150_feat: 138 | del trainData[col[0]] 139 | del testData[col[0]] 140 | 141 | gc.collect() 142 | 143 | # if os.path.exists("Data/feature/base_train_csr.npz") and True: 144 | # print("loading csr .....") 145 | # base_train_csr = sparse.load_npz("Data/feature/base_train_csr.npz").tocsr().astype("bool") 146 | # base_test_csr = sparse.load_npz("Data/feature/base_test_csr.npz").tocsr().astype("bool") 147 | # else: 148 | # base_train_csr = sparse.csr_matrix((len(trainData), 0)) 149 | # base_test_csr = sparse.csr_matrix((len(testData), 0)) 150 | # 151 | # enc = OneHotEncoder() 152 | 153 | 154 | # #利用csr进行构造 155 | # cv = CountVectorizer(min_df=5) 156 | # cv.fit(all_sid) 157 | # train_a = cv.transform(trainData_1["sid"]) 158 | # test_a = cv.transform(test_sid) 159 | # 160 | # trainData = sparse.hstack((train_a,trainData),'csr') 161 | # testData = sparse.hstack((test_a,testData),'csr') 162 | 163 | # try: 164 | # #做一个循环遍历找到取得最高值的特征百分点个数 165 | # feature_select = SelectPercentile (chi2, percentile= 98) 166 | # feature_select.fit(trainData, trainY) 167 | # trainData = feature_select.transform(trainData) 168 | # testData = feature_select.transform(testData) 169 | # end = time.time() 170 | # print("chi2 select finish, it cost {}".format(datetime.timedelta(seconds=(end - start)))) 171 | # 172 | # except: 173 | # raise ValueError("error handle....") 174 | 175 | return trainData, testData, trainY, test_sid 176 | 177 | if __name__ == "__main__": 178 | 179 | trainData, testData, trainY, test_sid = load_params() 180 | trainX, testX = trainData.values, testData.values 181 | 182 | #TODO:2019.8.11 利用xgb来生成叶子的节点 183 | #采用全部数据集进行训练,不过选择重要性较高的特征叶子节点特征的生成 184 | feat_importance = [('nginxtime', 5146), ('ip', 4758), ('imeimd5', 4739), ('city', 3998), ('model', 3575), ('time_of_min', 3431), ('time_of_sec', 3360), 185 | ('ratio', 3256), ('big_model', 3124), ('time_of_hour', 2650), ('cross_mediashowid_and_model_cnt', 2575), ('cross_adunitshowid_and_model_cnt', 2420), 186 | ('cross_mediashowid_and_city_cnt', 2374), ('cross_adunitshowid_and_city_cnt', 2362), ('big_model_rank', 2150), ('cross_adunitshowid_and_osv_cnt', 2028), 187 | ('macmd5_3', 1941), ('reqrealip', 1861), ('model_medi_nuni', 1851), ('city_adid_nuni', 1792), ('cross_mediashowid_and_osv_cnt', 1758), 188 | ('ver_cnt', 1734), ('model_adidmd_nuni', 1717), ('model_adid_nuni', 1691), ('macmd5_1', 1664), ('osv', 1640), ('macmd5', 1624), 189 | ('model_cnt', 1572), ('cross_adunitshowid_and_make_cnt', 1572), ('city_cnt', 1571), ('city_medi_nuni', 1543), ('adunitshowid_1', 1520), 190 | ('cross_adunitshowid_and_reqrealip_cnt', 1503), ('adunitshowid_3', 1497), ('adunitshowid_2', 1474), ('osv_cnt', 1468), ('city_adidmd_nuni', 1459), 191 | ('make', 1430), ('cross_mediashowid_and_make_cnt', 1316), ('ver', 1311), ('req_ip', 1307), ('cross_mediashowid_and_city', 1283), ('size', 1113), 192 | ('h', 1020), ('cross_mediashowid_and_ip', 989), ('cross_mediashowid_and_osv', 973), ('cross_adunitshowid_and_ntt_cnt', 968), 193 | ('cross_adunitshowid_and_carrier_cnt', 915), ('adidmd5', 910), ('cross_mediashowid_and_reqrealip_cnt', 905), ('adid_nuni_make', 887), 194 | ('make_adidmd_nuni', 862), ('cross_adidmd5_and_city', 847), ('mediashowid_3', 845), ('mediashowid_2', 832), ('cross_adidmd5_and_ip', 820), 195 | ('mediashowid_1', 811), ('make_adid_nuni', 809), ('adid_nuni_model', 781), ('cross_mediashowid_and_model', 773), ('apptype', 771), 196 | ('cross_adunitshowid_and_city', 750), ('cross_adunitshowid_and_province_cnt', 743), ('adid_nuni_osv', 734), ('px', 705), ('ip_cnt', 674), 197 | ('cross_mediashowid_and_carrier_cnt', 656), ('cross_adidmd5_and_apptype', 638), ('cross_mediashowid_and_reqrealip', 634), 198 | ('cross_adunitshowid_and_ip', 619), ('apptype_cnt', 589), ('area', 569), ('openudidmd5', 554), ('cross_mediashowid_and_make', 538), 199 | ('cross_adidmd5_and_reqrealip', 535), ('ip_adid_nuni', 520), ('cross_adidmd5_and_model', 516), ('cross_adunitshowid_and_reqrealip', 515), 200 | ('cross_mediashowid_and_ntt_cnt', 514), ('pkgname_2', 514), ('adid_nuni_city', 508), ('cross_adunitshowid_and_model', 506), ('medi_nuni_make', 498), 201 | ('ip_medi_nuni', 490), ('ip_adidmd_nuni', 483), ('cross_adunitshowid_and_osv', 480), ('pkgname_3', 461), ('imeimd5_cnt', 445), 202 | ('cross_adidmd5_and_city_cnt', 442), ('make_medi_nuni', 438), ('cost_time', 436), ('medi_nuni_osv', 435), ('pkgname_1', 421), 203 | ('cross_mediashowid_and_province_cnt', 408), ('cross_adunitshowid_and_make', 401), ('cross_adidmd5_and_make', 396), ('cross_adidmd5_and_province', 387), 204 | ('adid_nuni_reqrealip', 369), ('cross_adidmd5_and_osv', 367), ('pkgname', 352), ('cross_adunitshowid_and_lan_cnt', 327), ('medi_nuni_city', 326), 205 | ('medi_nuni_reqrealip', 316), ('pkgname_1_rank', 314), ('medi_nuni_model', 304), ('cross_adunitshowid_and_ntt', 300), ('w', 300)] 206 | 207 | # feat_col = [col[0] for col in feat_importance] 208 | # no_feat_col = [col for col in trainData.columns if col not in feat_col] 209 | # 210 | # clf = XgboostFeature() 211 | # new_feature,new_test_features = clf.fit_model(X_train = trainData[feat_col].values, y_train= trainY, X_test= testData[feat_col].values) 212 | # print(new_feature,new_test_features) 213 | # 214 | # #然后再将生成的叶子特征与原始特征进行拼接 215 | # trainX = clf.mergeToOne(trainX,new_feature) 216 | # testX = clf.mergeToOne(testX, new_test_features) 217 | 218 | 219 | # #TODO:模型搭建 220 | start = time.time() 221 | model = lgb.LGBMClassifier(boosting_type="gbdt",num_leaves=48, max_depth=-1, learning_rate=0.05, 222 | n_estimators=3000, subsample_for_bin=50000,objective="binary",min_split_gain=0, min_child_weight=5, min_child_samples=30, #10 223 | subsample=0.8,subsample_freq=1, colsample_bytree=1, reg_alpha=3,reg_lambda=5, 224 | feature_fraction= 0.9, bagging_fraction = 0.9,#此次添加的 225 | seed= 2019,n_jobs=10,slient=True,num_boost_round=3000) 226 | n_splits = 7 227 | random_seed = 2019 228 | skf = StratifiedKFold(shuffle=True,random_state=random_seed,n_splits=n_splits) 229 | cv_pred= [] 230 | val_score = [] 231 | for idx, (tra_idx, val_idx) in enumerate(skf.split(trainX, trainY)): 232 | startTime = time.time() 233 | print("==================================fold_{}====================================".format(str(idx+1))) 234 | X_train, Y_train = trainX[tra_idx],trainY[tra_idx] 235 | X_val, Y_val = trainX[val_idx], trainY[val_idx] 236 | lgb_model = model.fit(X_train,Y_train,eval_names=["train","valid"],eval_metric=["logloss"],eval_set=[(X_train, Y_train),(X_val,Y_val)],early_stopping_rounds=200) 237 | print(dict(zip(trainData.columns,lgb_model.feature_importances_))) 238 | print(lgb_model.feature_importances_) 239 | 240 | #验证集进行验证 241 | val_pred = lgb_model.predict(X_val,num_iteration=lgb_model.best_iteration_) 242 | val_score.append(f1_score(Y_val,val_pred)) 243 | print("f1_score:",f1_score(Y_val, val_pred)) 244 | test_pred = lgb_model.predict(testX, num_iteration = lgb_model.best_iteration_).astype(int) 245 | cv_pred.append(test_pred) 246 | endTime = time.time() 247 | print("fold_{} finished in {}".format(str(idx+1), datetime.timedelta(seconds= endTime-startTime))) 248 | # # if idx == 0: 249 | # # cv_pred = np.array(test_pred).reshape(-1.1) 250 | # # else: 251 | # # cv_pred = np.hstack((cv_pred,np.array(test_pred).reshape(-1,1))) 252 | end = time.time() 253 | print('-'*60) 254 | print("Training has finished.") 255 | print("Total training time is {}".format(str(datetime.timedelta(seconds=end-start)))) 256 | print(val_score) 257 | print("mean f1:",np.mean(val_score)) 258 | print('-'*60) 259 | 260 | submit = [] 261 | for line in np.array(cv_pred).transpose(): 262 | submit.append(np.argmax(np.bincount(line))) 263 | final_result = pd.DataFrame(columns=["sid","label"]) 264 | final_result["sid"] = list(test_sid.unique()) 265 | final_result["label"] = submit 266 | final_result.to_csv("submitLGB{0}.csv".format(datetime.datetime.now().strftime("%Y%m%d%H%M")),index = False) 267 | print(final_result.head()) -------------------------------------------------------------------------------- /baseline_xgb.py: -------------------------------------------------------------------------------- 1 | from __future__ import print_function,division 2 | """-*- coding: utf-8 -*- 3 | DateTime : 2019/7/30 10:11 4 | Author : Peter_Bonnie 5 | FileName : baseline.py 6 | Software: PyCharm 7 | """ 8 | import numpy as np 9 | import pandas as pd 10 | import matplotlib.pyplot as plt 11 | import seaborn as sns 12 | from scipy import sparse 13 | from sklearn.preprocessing import MinMaxScaler,StandardScaler,OneHotEncoder,LabelEncoder 14 | from sklearn.model_selection import GridSearchCV,StratifiedKFold,RandomizedSearchCV,train_test_split 15 | from sklearn.linear_model import LogisticRegression 16 | from sklearn.ensemble import RandomForestClassifier,GradientBoostingClassifier,ExtraTreesClassifier 17 | from sklearn.metrics import f1_score 18 | import time 19 | import datetime 20 | import lightgbm as lgb 21 | from lightgbm import plot_importance 22 | import collections 23 | import xgboost as xgb 24 | from xgboost import plot_importance,to_graphviz 25 | from data_helper import * 26 | from search_param import Util,reduce_mem_usage 27 | import catboost as cat 28 | import gc 29 | import warnings 30 | 31 | warnings.filterwarnings("ignore") 32 | sns.set(style = "whitegrid",color_codes = True) 33 | sns.set(font_scale = 1) 34 | 35 | #TODO:数据的加载 36 | PATH = "Data/" 37 | allData = pd.read_csv(PATH +"allData.csv") 38 | trainData = pd.read_csv(PATH+"train_data.txt",delimiter='\t') 39 | testData = pd.read_csv(PATH+"test_data.txt",delimiter='\t') 40 | 41 | #降低内存消耗 42 | allData = reduce_mem_usage(allData) 43 | trainData = reduce_mem_usage(trainData) 44 | testData = reduce_mem_usage(testData) 45 | 46 | test_sid = testData.pop("sid") 47 | label = trainData.pop("label") 48 | 49 | ######################################整个数据特征以及处理结束############################################################### 50 | #TODO:数据源的获取 51 | trainData = allData[:trainData.shape[0]] 52 | testData = allData[trainData.shape[0]:] 53 | trainX, trainY, testX = trainData.values,label, testData.values 54 | 55 | #TODO:特征重要性的选择 56 | def f1_sco(preds,valid): 57 | labels = valid.get_label() 58 | preds = np.argmax(preds.reshape(2, -1), axis=0) 59 | score_vali = f1_score(y_true=labels, y_pred=preds, average='weighted') # 改了下f1_score的计算方式 60 | return 'f1_score', score_vali, True 61 | 62 | # #TODO:模型搭建 63 | xgb_params = { 64 | 'booster': 'gbtree', 65 | 'max_depth': 5, 66 | 'subsample': 0.8, 67 | 'colsample_bytree': 0.8, 68 | 'objective':'binary:logistic', 69 | 'eval_metric': 'logloss', 70 | "learning_rate": 0.05, 71 | "seed" : 2019, 72 | "njob" : -1, 73 | 'silent': True, 74 | } 75 | start = time.time() 76 | model = xgb.XGBClassifier(**xgb_params) 77 | n_splits = 7 78 | random_seed = 2019 79 | skf = StratifiedKFold(shuffle=True,random_state=random_seed,n_splits=n_splits) 80 | cv_pred= [] 81 | val_score = [] 82 | for idx, (tra_idx, val_idx) in enumerate(skf.split(trainX, trainY)): 83 | print("==================================fold_{}====================================".format(str(idx+1))) 84 | X_train, Y_train = trainX[tra_idx],trainY[tra_idx] 85 | X_val, Y_val = trainX[val_idx], trainY[val_idx] 86 | dtrain = xgb.DMatrix(X_train,Y_train) 87 | dval = xgb.DMatrix(X_val, Y_val) 88 | watchlists = [(dtrain,'dtrain'),(dval,'dval')] 89 | bst = xgb.train(dtrain=dtrain, num_boost_round=3000, evals=watchlists, early_stopping_rounds=200, \ 90 | verbose_eval=50, params=xgb_params) 91 | val_pred = bst.predict(xgb.DMatrix(trainX[val_idx]),ntree_limit = bst.best_ntree_limit) 92 | val_pred = [0 if i < 0.5 else 1 for i in val_pred] 93 | val_score.append(f1_score(Y_val,val_pred)) 94 | print("f1_score:", f1_score(Y_val, val_pred)) 95 | test_pred = bst.predict(xgb.DMatrix(testX),ntree_limit = bst.best_ntree_limit) 96 | test_pred = [0 if i < 0.5 else 1 for i in test_pred] 97 | cv_pred.append(test_pred) 98 | # # if idx == 0: 99 | # # cv_pred = np.array(test_pred).reshape(-1.1) 100 | # # else: 101 | # # cv_pred = np.hstack((cv_pred,np.array(test_pred).reshape(-1,1))) 102 | end = time.time() 103 | diff = end - start 104 | print(compute_cost(diff)) 105 | submit = [] 106 | for line in np.array(cv_pred).transpose(): 107 | submit.append(np.argmax(np.bincount(line))) 108 | final_result = pd.DataFrame(columns=["sid","label"]) 109 | final_result["sid"] = list(test_sid.unique()) 110 | final_result["label"] = submit 111 | 112 | final_result.to_csv("submitXGB{0}.csv".format(datetime.datetime.now().strftime("%Y%m%d%H%M")),index = False) 113 | print(val_score) 114 | print("mean f1:",np.mean(val_score)) 115 | print(final_result.head()) -------------------------------------------------------------------------------- /data_helper.py: -------------------------------------------------------------------------------- 1 | """-*- coding: utf-8 -*- 2 | DateTime : 2019/7/28 16:03 3 | Author : Peter_Bonnie 4 | FileName : data_helper.py 5 | Software: PyCharm 6 | """ 7 | #数据处理文件 8 | import pandas as pd 9 | import numpy as np 10 | from sklearn.preprocessing import MinMaxScaler,StandardScaler 11 | import time 12 | import re 13 | 14 | 15 | PATH = "Data/" 16 | 17 | #TODO:缺失值填充 18 | 19 | def my_fillna(df): 20 | for i in df.columns: 21 | if df[i].isnull().sum() / df[i].values.shape[0] > 0.0: 22 | if df[i].dtype == "object": 23 | df[i] = df[i].fillna(str(-1)) 24 | df[i] = df[i].replace("nan", str(-1)) 25 | elif df[i].dtype == "float": 26 | df[i] = df[i].fillna(-1.0) 27 | df[i] = df[i].replace("nan", -1.0) 28 | else: 29 | df[i] = df[i].fillna(-1) 30 | df[i] = df[i].replace("nan", -1) 31 | return df 32 | 33 | #TODO:操作系统版本号进行更细致的分解 34 | def split_osv_into_threeParts(x): 35 | 36 | x = list(map(int, x.split('.'))) 37 | while len(x) < 3: 38 | x.append(0) 39 | 40 | return x[0], x[1], x[2] 41 | 42 | 43 | def remove_lowcase(se): 44 | count = dict(se.value_counts()) 45 | se = se.map(lambda x: -1 if count[x] < 5 else x) 46 | return se 47 | 48 | 49 | def clearn_make(make): 50 | if 'oppo' in make: 51 | return 'oppo' 52 | elif 'vivo' in make: 53 | return 'vivo' 54 | elif 'huawei' in make or 'honor' in make: 55 | return 'huawei' 56 | elif 'redmi' in make: 57 | return 'xiaomi' 58 | 59 | strs = make.split() 60 | if len(strs) > 1: 61 | s = strs[0] 62 | if s == 'mi' or s == 'm1' or s == 'm2' or s == 'm3' or s == 'm6': 63 | s = 'xiaomi' 64 | return s 65 | return make 66 | 67 | 68 | def clearn_model(model): 69 | if '%' in model: 70 | return model.split('%')[0] 71 | elif 'vivo' in model: 72 | return 'vivo' 73 | elif 'oppo' in model or 'pb' in model or 'pa' in model or 'pc' in model: 74 | return 'oppo' 75 | elif 'huawei' in model or 'honor' in model: 76 | return 'huawei' 77 | elif 'redmi' in model or 'xiaomi' in model or 'm5' in model or 'm4' in model or 'm7' in model or 'mi' in model or 'm2' in model or 'm3' in model or 'm6' in model: 78 | return 'xiaomi' 79 | elif 'letv' in model: 80 | return 'letv' 81 | elif 'oneplus' in model: 82 | return 'oneplus' 83 | elif 'zte' in model or '中兴' in model: 84 | return 'zte' 85 | # 使用正则来匹配 86 | model = re.compile(r'v.*.[at]').sub('vivo', model) 87 | 88 | return model 89 | 90 | 91 | def padding_empty_osv(osv): 92 | str_osv = osv.strip().split('.') 93 | while len(str_osv) < 3: 94 | str_osv.append('0') 95 | if str_osv[0] == '': 96 | str_osv[0] = '0' 97 | if str_osv[1] == '': 98 | str_osv[1] = '0' 99 | return '.'.join(str_osv) 100 | 101 | 102 | def remove_outlier_osv(osv): 103 | if "吴氏家族版4.4.2" in osv: 104 | osv = "4.4.2" 105 | elif '3.2.0-2-20180726.9015' in osv: 106 | osv = "3.2.0" 107 | elif '21100.0.0' in osv or '21000.0.0' in osv: 108 | osv = "0.0.0" 109 | elif len(osv.split('.')) > 3: 110 | osv = '.'.join(osv.split('.')[:3]) 111 | return osv 112 | 113 | 114 | #计算运行时间 115 | def compute_cost(sec): 116 | hours,secs = divmod(sec, 3600) 117 | mins,secs = divmod(secs, 60) 118 | return "Fininshed, and it cost {0} hours : {1} mins : {2} secs".format(int(hours),int(mins),int(secs)) 119 | 120 | 121 | def making(x): 122 | x = x.lower() 123 | if 'iphone' in x or 'apple' in x or '苹果' in x: 124 | return 'iphone' 125 | elif 'huawei' in x or 'honor' in x or '华为' in x or '荣耀' in x: 126 | return 'huawei' 127 | elif 'xiaomi' in x or '小米' in x or 'redmi' in x: 128 | return 'xiaomi' 129 | elif '魅族' in x: 130 | return 'meizu' 131 | elif '金立' in x: 132 | return 'gionee' 133 | elif '三星' in x or 'samsung' in x: 134 | return 'samsung' 135 | elif 'vivo' in x: 136 | return 'vivo' 137 | elif 'oppo' in x: 138 | return 'oppo' 139 | elif 'lenovo' in x or '联想' in x: 140 | return 'lenovo' 141 | elif 'nubia' in x: 142 | return 'nubia' 143 | elif 'oneplus' in x or '一加' in x: 144 | return 'oneplus' 145 | elif 'smartisan' in x or '锤子' in x: 146 | return 'smartisan' 147 | elif '360' in x or '360手机' in x: 148 | return '360' 149 | elif 'zte' in x or '中兴' in x: 150 | return 'zte' 151 | else: 152 | return 'others' 153 | 154 | 155 | # 处理'lan’ 156 | def lan(x): 157 | x = x.lower() 158 | if x in ['zh-cn','zh','cn','zh_cn','zh_cn_#hans','zh-']: 159 | return 'zh-cn' 160 | elif x in ['tw','zh-tw','zh_tw']: 161 | return 'zh-tw' 162 | elif 'en' in x: 163 | return 'en' 164 | elif 'hk' in x: 165 | return 'zh-hk' 166 | else: 167 | return x 168 | 169 | 170 | 171 | 172 | 173 | 174 | 175 | 176 | 177 | 178 | 179 | 180 | 181 | 182 | 183 | 184 | 185 | 186 | 187 | 188 | 189 | 190 | 191 | 192 | -------------------------------------------------------------------------------- /data_load-3.py: -------------------------------------------------------------------------------- 1 | """-*- coding: utf-8 -*- 2 | DateTime : 2019/8/5 22:06 3 | Author : Peter_Bonnie 4 | FileName : data_load.py 5 | Software: PyCharm 6 | """ 7 | from sklearn.preprocessing import LabelEncoder,OneHotEncoder 8 | from sklearn.feature_extraction.text import CountVectorizer 9 | import re 10 | from data_helper import * 11 | import time 12 | import datetime 13 | import gc 14 | from scipy import sparse 15 | from multiprocessing import cpu_count 16 | import tqdm 17 | import warnings 18 | warnings.filterwarnings("ignore") 19 | 20 | print("data procesing.....") 21 | start = time.time() 22 | #TODO:数据的加载 23 | PATH = "Data/" 24 | trainData = pd.read_csv(PATH +"train_data.txt", delimiter= '\t') 25 | testData = pd.read_csv(PATH + "test_data.txt", delimiter= '\t') 26 | 27 | # trainData = reduce_mem_usage(trainData, verbose = True) 28 | # testData = reduce_mem_usage(testData, verbose = True) 29 | 30 | #TODO: 对osv,make,以及model 进行了比较细致的划分处理 ----修改时间2019.8.3 31 | 32 | #osv 33 | trainData["osv"].fillna('0',inplace = True) 34 | trainData["osv"] = trainData["osv"].map(lambda x: re.sub('[a-zA-Z]+','',x)) 35 | trainData["osv"] = trainData["osv"].map(lambda x: re.sub('_','.',x)) 36 | trainData["osv"] = trainData["osv"].map(lambda x:padding_empty_osv(x)) 37 | trainData["osv"] = trainData["osv"].map(lambda x:remove_outlier_osv(x)) 38 | 39 | testData["osv"].fillna('0', inplace = True) 40 | testData["osv"] = testData["osv"].map(lambda x:re.sub('[a-zA-Z]+','',x)) 41 | testData["osv"] = testData["osv"].map(lambda x:re.sub('_','.',x)) 42 | testData["osv"] = testData["osv"].map(lambda x:padding_empty_osv(x)) 43 | testData["osv"] = testData["osv"].map(lambda x:remove_outlier_osv(x)) 44 | 45 | 46 | #缺失值填充 47 | trainData = my_fillna(trainData) 48 | testData = my_fillna(testData) 49 | ######################################3#######################数据处理####################################################### 50 | #获取会话开始时间 51 | trainData["begin_time"] = trainData["sid"].apply(lambda x:int(x.split('-')[-1])) 52 | testData["begin_time"] = testData["sid"].apply(lambda x:int(x.split('-')[-1])) 53 | #获取消耗的时间 54 | trainData["cost_time"] = trainData["nginxtime"] - trainData["begin_time"] 55 | testData["cost_time"] = testData["nginxtime"] - testData["begin_time"] 56 | #将时间戳转化为时分秒的形式 57 | 58 | del trainData["begin_time"] 59 | del testData["begin_time"] 60 | 61 | #TODO:2019.8.9号添加 62 | trainData["time"] = pd.to_datetime(trainData["nginxtime"], unit='ms') 63 | testData["time"] = pd.to_datetime(testData["nginxtime"], unit='ms') 64 | 65 | trainData["time_of_hour"] = trainData["time"].dt.hour.astype(int) 66 | testData["time_of_hour"] = testData["time"].dt.hour.astype(int) 67 | trainData["time_of_min"] = trainData["time"].dt.minute.astype(int) 68 | testData["time_of_min"] = testData['time'].dt.minute.astype(int) 69 | trainData["time_of_sec"] = trainData["time"].dt.second.astype(int) 70 | testData["time_of_sec"] = testData["time"].dt.second.astype(int) 71 | 72 | del trainData["time"] 73 | del testData["time"] 74 | 75 | #carrier 76 | trainData.carrier[(trainData.carrier == 0) | (trainData.carrier == -1)] = 0 77 | testData.carrier[(testData.carrier == 0) | (testData.carrier == -1)] =0 78 | 79 | #ntt 80 | trainData.ntt[(trainData.ntt == 0) | (trainData.ntt == 7)] = 0 81 | testData.ntt[(testData.ntt == 0) | (testData.ntt == 7)] = 0 82 | 83 | trainData.ntt[(trainData.ntt == 1) | (trainData.ntt == 2)] = 1 84 | testData.ntt[(testData.ntt == 1) | (testData.ntt == 2)] =1 85 | 86 | trainData.ntt[trainData.ntt == 3] = 2 87 | testData.ntt[testData.ntt == 3] = 2 88 | 89 | trainData.ntt[(trainData.ntt >= 4) & (trainData.ntt <= 6)] = 3 90 | testData.ntt[(testData.ntt >= 4) & (testData.ntt <= 6)] = 3 91 | 92 | #orientation 93 | trainData.orientation[(trainData.orientation == 90) |(trainData.orientation == 2)] = 0 94 | testData.orientation[(testData.orientation == 90) | (testData.orientation == 2)] = 0 95 | 96 | """ 97 | 组合特征 98 | """ 99 | """ 100 | 交叉特征 101 | """ 102 | label = trainData.pop("label") 103 | train_sid = trainData.pop("sid") 104 | test_sid = testData.pop("sid") 105 | 106 | #删除部分特征 107 | trainData = trainData.drop(["os"],axis =1) 108 | testData = testData.drop(["os"],axis =1) 109 | 110 | allData = pd.concat([trainData, testData],axis= 0) 111 | 112 | #TODO:2019.8.10号添加 113 | allData["req_ip"] = allData.groupby("reqrealip")["ip"].transform("count") 114 | allData['model'].replace('PACM00', "OPPO", inplace=True) 115 | allData['model'].replace('PBAM00', "OPPO", inplace=True) 116 | allData['model'].replace('PBEM00', "OPPO", inplace=True) 117 | allData['model'].replace('PADM00', "OPPO", inplace=True) 118 | allData['model'].replace('PBBM00', "OPPO", inplace=True) 119 | allData['model'].replace('PAAM00', "OPPO", inplace=True) 120 | allData['model'].replace('PACT00', "OPPO", inplace=True) 121 | allData['model'].replace('PABT00', "OPPO", inplace=True) 122 | allData['model'].replace('PBCM10', "OPPO", inplace=True) 123 | 124 | for feat in ["model","make","lan"]: 125 | allData[feat] = allData[feat].astype(str) 126 | allData[feat] = allData[feat].map(lambda x:x.upper()) 127 | allData["big_model"] = allData["model"].map(lambda x:x.split(' ')[0]) 128 | 129 | feature = "adunitshowid" 130 | allData[feature+"_0"], allData[feature+"_1"], allData[feature+"_2"], allData[feature+"_3"] = allData[feature].apply(lambda x: x[0:8]),\ 131 | allData[feature].apply(lambda x: x[8:16]),\ 132 | allData[feature].apply(lambda x: x[16:24]),\ 133 | allData[feature].apply(lambda x: x[24:32]) 134 | feature = "pkgname" 135 | allData[feature+"_1"], allData[feature+"_2"], allData[feature+"_3"] = allData[feature].apply(lambda x: x[8:16]),\ 136 | allData[feature].apply(lambda x: x[16:24]),\ 137 | allData[feature].apply(lambda x: x[24:32]) 138 | feature = "mediashowid" 139 | allData[feature+"_0"], allData[feature+"_1"], allData[feature+"_2"], allData[feature+"_3"] = allData[feature].apply(lambda x: x[0:8]),\ 140 | allData[feature].apply(lambda x: x[8:16]),\ 141 | allData[feature].apply(lambda x: x[16:24]),\ 142 | allData[feature].apply(lambda x: x[24:32]) 143 | feature = "idfamd5" 144 | allData[feature+"_1"], allData[feature+"_2"], allData[feature+"_3"] = allData[feature].apply(lambda x: x[8:16]),\ 145 | allData[feature].apply(lambda x: x[16:24]),\ 146 | allData[feature].apply(lambda x: x[24:32]) 147 | 148 | feature = "macmd5" 149 | allData[feature + "_0"], allData[feature + "_1"], allData[feature + "_3"] = allData[feature].apply(lambda x: x[0: 8]), allData[feature].apply(lambda x: x[8: 16]), \ 150 | allData[feature].apply(lambda x: x[24:32]) 151 | 152 | 153 | #对上述特征进行类别编码 154 | for fe in ["big_model","adunitshowid_0","adunitshowid_1","adunitshowid_2","adunitshowid_3","pkgname_1","pkgname_2","pkgname_3","mediashowid_0","mediashowid_1","mediashowid_2","mediashowid_3","idfamd5_1","idfamd5_2","idfamd5_3",\ 155 | "macmd5_0","macmd5_1","macmd5_3"]: 156 | le = LabelEncoder() 157 | 158 | #2019.8.11 -- 添加 159 | if fe not in ["idfamd5_1", "idfamd5_2", "idfamd5_3", "ver_len"]: 160 | allData[fe+"_rank"] = allData.groupby([fe])[fe].transform("count").rank(method="min") 161 | if fe not in ["idfamd5_1", "idfamd5_2", "idfamd5_3", "macmd5_1", "macmd5_2", "macmd5_3", "ver_len"]: 162 | allData[fe+"_count"] = allData.groupby([fe])[fe].transform("count") 163 | #---------------------------- 164 | le.fit(allData[fe]) 165 | allData[fe] = le.transform(allData[fe].astype("str")) 166 | 167 | allData["size"] = (np.sqrt(allData["h"]**2 + allData["w"] ** 2)/ 2.54) / 1000 168 | allData["ratio"] = allData["h"] / allData["w"] 169 | allData["px"] = allData["ppi"] * allData["size"] 170 | allData["area"] = allData["h"] * allData["w"] 171 | 172 | allData["ver_len"] = allData["ver"].apply(lambda x:str(x).split(".")).apply(lambda x:len(x)) 173 | allData["osv_len"] = allData["osv"].apply(lambda x:str(x).split(".")).apply(lambda x:len(x)) 174 | 175 | 176 | #TODO:2019.8.10=--------------------------------------------------------------------------------------------------------------------------------------------------- 177 | 178 | #TODO:2019.8.11--------------特征构造代码开始-------------------------------------------------------- ------------------------------- 179 | 180 | #主要是对于新生成的特征进行rank,count,skew,cat特征构造,查看效果 181 | #quantile 182 | class_num = 8 183 | quantile = [] 184 | for i in range(class_num+1): 185 | quantile.append(allData["ratio"].quantile(q=i / class_num)) 186 | 187 | allData["ratio_cat"] = allData["ratio"] 188 | for i in range(class_num + 1): 189 | if i != class_num: 190 | allData["ratio_cat"][((allData["ratio"] < quantile[i + 1]) & (allData["ratio"] >= quantile[i]))] = i 191 | else: 192 | allData["ratio_cat"][((allData["ratio"] == quantile[i]))] = i - 1 193 | 194 | allData["ratio_cat"] = allData["ratio_cat"].astype(str).fillna("0.0") 195 | le = LabelEncoder() 196 | le.fit(allData["ratio_cat"]) 197 | allData["ratio_cat"] = le.transform(allData["ratio_cat"]) 198 | del le 199 | 200 | class_num = 10 201 | quantile = [] 202 | for i in range(class_num + 1): 203 | quantile.append(allData["area"].quantile(q=i / class_num)) 204 | 205 | allData["area_cat"] = allData["area"] 206 | for i in range(class_num + 1): 207 | if i != class_num: 208 | allData["area_cat"][((allData["area"] < quantile[i + 1]) & (allData["area"] >= quantile[i]))] = i 209 | else: 210 | allData["area_cat"][((allData["area"] == quantile[i]))] = i - 1 211 | 212 | allData["area_cat"] = allData["area_cat"].astype(str).fillna("0.0") 213 | le = LabelEncoder() 214 | le.fit(allData["area_cat"]) 215 | allData["area_cat"] = le.transform(allData["area_cat"]) 216 | del le 217 | 218 | class_num = 10 219 | quantile = [] 220 | for i in range(class_num + 1): 221 | quantile.append(allData["size"].quantile(q=i / class_num)) 222 | 223 | allData["size_cat"] = allData["size"] 224 | for i in range(class_num + 1): 225 | if i != class_num: 226 | allData["size_cat"][((allData["size"] < quantile[i + 1]) & (allData["size"] >= quantile[i]))] = i 227 | else: 228 | allData["size_cat"][((allData["size"] == quantile[i]))] = i - 1 229 | 230 | allData["size_cat"] = allData["size_cat"].astype(str).fillna("0.0") 231 | le = LabelEncoder() 232 | le.fit(allData["size_cat"]) 233 | allData["size_cat"] = le.transform(allData["size_cat"]) 234 | 235 | del le 236 | gc.collect() 237 | 238 | #TODO:2019.8.11--------------特征构造代码结束---------------------------------------------------------------------------------------- 239 | 240 | #TODO:2019.8.12-----利用LGB和XGB来生成叶子节点特征 241 | #使用apply函数进行生成,然后再与原特征进行融合。。。。 242 | 243 | """ 244 | 聚合特征主要尝试以下: 245 | ip 246 | *id 247 | """ 248 | #TODO:统计特征 249 | #nunique计算 250 | adid_feat_nunique = ["mediashowid","apptype","city","ip","reqrealip","province","model","dvctype","make","ntt","carrier","osv","lan"] 251 | 252 | for feat in adid_feat_nunique: 253 | gp1 = allData.groupby("adunitshowid")[feat].nunique().reset_index() 254 | gp1.columns = ["adunitshowid","adid_nuni_"+feat] 255 | allData = allData.merge(gp1, how = "left",on="adunitshowid") 256 | 257 | gp2 = allData.groupby("mediashowid")["adunitshowid"].nunique().reset_index() 258 | gp2.columns = ["mediashowid","meid_adid_nuni"] 259 | allData = allData.merge(gp2, how = "left", on = "mediashowid") 260 | 261 | gp2 = allData.groupby("city")["adunitshowid"].nunique().reset_index() 262 | gp2.columns = ["city","city_adid_nuni"] 263 | allData = allData.merge(gp2, how = "left", on = "city") 264 | 265 | gp2 = allData.groupby("province")["adunitshowid"].nunique().reset_index() 266 | gp2.columns = ["province","province_adid_nuni"] 267 | allData = allData.merge(gp2, how = "left", on = "province") 268 | 269 | gp2 = allData.groupby("ip")["adunitshowid"].nunique().reset_index() 270 | gp2.columns = ["ip","ip_adid_nuni"] 271 | allData = allData.merge(gp2, how = "left", on = "ip") 272 | 273 | gp2 = allData.groupby("model")["adunitshowid"].nunique().reset_index() 274 | gp2.columns = ["model","model_adid_nuni"] 275 | allData = allData.merge(gp2, how = "left", on = "model") 276 | 277 | gp2 = allData.groupby("make")["adunitshowid"].nunique().reset_index() 278 | gp2.columns = ["make","make_adid_nuni"] 279 | allData = allData.merge(gp2, how = "left", on = "make") 280 | 281 | 282 | del gp1 283 | del gp2 284 | gc.collect() 285 | 286 | #根据对外媒体id进行类别计数 287 | meid_feat_nunique = ["adunitshowid","apptype","city","ip","reqrealip","province","model","dvctype","make","ntt","carrier","osv","lan"] 288 | for feat in meid_feat_nunique: 289 | gp1 = allData.groupby("mediashowid")[feat].nunique().reset_index() 290 | gp1.columns = ["mediashowid","medi_nuni_"+feat] 291 | allData = allData.merge(gp1, how = "left",on="mediashowid") 292 | gp2 = allData.groupby("city")["mediashowid"].nunique().reset_index() 293 | gp2.columns = ["city","city_medi_nuni"] 294 | allData = allData.merge(gp2, how = "left", on = "city") 295 | 296 | gp2 = allData.groupby("ip")["mediashowid"].nunique().reset_index() 297 | gp2.columns = ["ip","ip_medi_nuni"] 298 | allData = allData.merge(gp2, how = "left", on = "ip") 299 | 300 | gp2 = allData.groupby("province")["mediashowid"].nunique().reset_index() 301 | gp2.columns = ["province","province_medi_nuni"] 302 | allData = allData.merge(gp2, how = "left", on = "province") 303 | 304 | gp2 = allData.groupby("model")["mediashowid"].nunique().reset_index() 305 | gp2.columns = ["model","model_medi_nuni"] 306 | allData = allData.merge(gp2, how = "left", on = "model") 307 | 308 | gp2 = allData.groupby("make")["mediashowid"].nunique().reset_index() 309 | gp2.columns = ["make","make_medi_nuni"] 310 | allData = allData.merge(gp2, how = "left", on = "make") 311 | 312 | del gp1 313 | del gp2 314 | gc.collect() 315 | 316 | #adidmd5 317 | adidmd5_feat_nunique = ["apptype","city","ip","reqrealip","province","model","dvctype","make","ntt","carrier","osv","lan"] 318 | for feat in adidmd5_feat_nunique: 319 | gp1 = allData.groupby("adidmd5")[feat].nunique().reset_index() 320 | gp1.columns = ["adidmd5","android_nuni_"+feat] 321 | allData =allData.merge(gp1, how= "left", on = "adidmd5") 322 | 323 | 324 | gp2 = allData.groupby("city")["adidmd5"].nunique().reset_index() 325 | gp2.columns = ["city","city_adidmd_nuni"] 326 | allData = allData.merge(gp2, how = "left", on = "city") 327 | 328 | gp2 = allData.groupby("ip")["adidmd5"].nunique().reset_index() 329 | gp2.columns = ["ip","ip_adidmd_nuni"] 330 | allData = allData.merge(gp2, how = "left", on = "ip") 331 | 332 | gp2 = allData.groupby("province")["adidmd5"].nunique().reset_index() 333 | gp2.columns = ["province","province_adidmd_nuni"] 334 | allData = allData.merge(gp2, how = "left", on = "province") 335 | 336 | gp2 = allData.groupby("model")["adidmd5"].nunique().reset_index() 337 | gp2.columns = ["model","model_adidmd_nuni"] 338 | allData = allData.merge(gp2, how = "left", on = "model") 339 | 340 | gp2 = allData.groupby("make")["adidmd5"].nunique().reset_index() 341 | gp2.columns = ["make","make_adidmd_nuni"] 342 | allData = allData.merge(gp2, how = "left", on = "make") 343 | 344 | del gp1 345 | del gp2 346 | gc.collect() 347 | 348 | #TODO:进行每个特征中不同取值的数量 ----2019.8.5 349 | city_cnt = allData["city"].value_counts().to_dict() 350 | allData["city_cnt"] = allData["city"].map(city_cnt) 351 | 352 | model_cnt = allData["model"].value_counts().to_dict() 353 | allData["model_cnt"] = allData["model"].map(model_cnt) 354 | 355 | make_cnt = allData["make"].value_counts().to_dict() 356 | allData["make_cnt"] = allData["make"].map(make_cnt) 357 | 358 | ip_cnt = allData["ip"].value_counts().to_dict() 359 | allData["ip_cnt"] = allData["ip"].map(ip_cnt) 360 | 361 | reqrealip_cnt = allData["reqrealip"].value_counts().to_dict() 362 | allData["reqrealip_cnt"] = allData["reqrealip"].map(reqrealip_cnt) 363 | 364 | osv_cnt = allData["osv"].value_counts().to_dict() 365 | allData["osv_cnt"] = allData["osv"].map(osv_cnt) 366 | 367 | #TODO:ratio特征构造 368 | 369 | #TODO:关于mean特征的构造 370 | for fe in []: 371 | pass 372 | 373 | 374 | 375 | 376 | 377 | #TODO:关于std特征的构造 378 | 379 | #TODO:交叉特征 380 | feat_1 = ["adunitshowid","mediashowid","adidmd5"] 381 | feat_2 = ["apptype","city","ip","reqrealip","province","model","dvctype","make","ntt","carrier","osv","lan"] 382 | cross_feat = [] 383 | for fe_1 in feat_1: 384 | for fe_2 in feat_2: 385 | col_name = "cross_"+fe_1+"_and_"+fe_2 386 | cross_feat.append(col_name) 387 | allData[col_name] = allData[fe_1].astype(str).values + "_" + allData[fe_2].astype(str).values 388 | 389 | #TODO:对交叉特征进行计数 --- 2019.8.7 390 | for fe in cross_feat: 391 | locals()[fe+"_cnt"] = allData[fe].value_counts().to_dict() 392 | allData[fe+"_cnt"] = allData[fe].map(locals()[fe+"_cnt"]) 393 | 394 | 395 | for fe in cross_feat: 396 | le_feat = LabelEncoder() 397 | le_feat.fit(allData[fe]) 398 | allData[fe] = le_feat.transform(allData[fe]) 399 | 400 | #先对连续特征进行归一化等操作 401 | continue_feats = ["h","w","ppi","nginxtime","cost_time"] 402 | #需要进行类别编码的特征 403 | label_encoder_feats = ["ver","lan","pkgname","adunitshowid","mediashowid","apptype","city","province","ip","reqrealip","adidmd5","imeimd5","idfamd5","openudidmd5","macmd5","model","make","osv"] 404 | #需要进行独热编码的特征 405 | cate_feats = ["dvctype","ntt","carrier"] 406 | 407 | #TODO:2019.8.8添加 408 | for cf in list(set(cate_feats+["ver","lan","adunitshowid","mediashowid","apptype","adidmd5","imeimd5","idfamd5","openudidmd5","macmd5"])): 409 | locals()[cf+"_cnt"] = allData[cf].value_counts().to_dict() 410 | allData[cf+"_cnt"] = allData[cf].map(locals()[cf+"_cnt"]) 411 | 412 | #labelencoder转化 413 | for fe in ["pkgname","adunitshowid","mediashowid","apptype","province","ip","reqrealip","adidmd5","imeimd5","idfamd5","openudidmd5","macmd5","ver","lan"]: 414 | le_feat = LabelEncoder() 415 | le_feat.fit(allData[fe]) 416 | allData[fe] = le_feat.transform(allData[fe]) 417 | 418 | #由于city在类别编码时出现错误,所以我们就手动进行编码 419 | city_2_idx = dict(zip(list(set(allData["city"].unique())),range(len(list(set(allData["city"].unique())))))) 420 | 421 | #对model, make,osv 自定义编码方式,因为用类别编码的时候,出现了报错,主要是由于取值中含有不可识别字符,后期再进行处理一下 422 | model_2_idx = dict(zip(list(set(allData["model"].unique())),range(len(list(set(allData["model"].unique())))))) 423 | make_2_idx = dict(zip(list(set(allData["make"].unique())),range(len(list(set(allData["make"].unique())))))) 424 | osv_2_idx = dict(zip(list(set(allData["osv"].unique())),range(len(list(set(allData["osv"].unique())))))) 425 | 426 | allData["city"] = allData["city"].map(city_2_idx) 427 | allData["model"] = allData["model"].map(model_2_idx) 428 | allData["make"] = allData["make"].map(make_2_idx) 429 | allData["osv"] = allData["osv"].map(osv_2_idx) 430 | 431 | # #对连续变量进行简单的归一化处理 432 | for fe in continue_feats: 433 | temp_data = np.reshape(allData[fe].values,[-1,1]) 434 | mm = MinMaxScaler() 435 | mm.fit(temp_data) 436 | allData[fe] = mm.transform(temp_data) 437 | 438 | #对运营商,网络类型进行简单预处理 439 | allData['carrier'].value_counts() 440 | allData["carrier"] = allData["carrier"].map({0.0:0,-1.0:0,46000.0:1,46001.0:2,46003.0:3}).astype(int) 441 | allData["ntt"] = allData["ntt"].astype(int) 442 | allData["orientation"] = allData["orientation"].astype(int) 443 | allData["dvctype"] = allData["dvctype"].astype(int) 444 | 445 | 446 | #删除特征较低的部分特征 447 | allData.pop("idfamd5") 448 | allData.pop("adid_nuni_mediashowid") 449 | allData.pop("adid_nuni_apptype") 450 | allData.pop("medi_nuni_apptype") 451 | 452 | allData.pop('medi_nuni_adunitshowid') 453 | allData.pop('province_adidmd_nuni') 454 | 455 | allData.pop("android_nuni_osv") 456 | 457 | 458 | #2019.8.3添加删除的列 ---- 测试,貌似有点涨分 459 | allData.pop('province_medi_nuni') 460 | allData.pop('android_nuni_province') 461 | allData.pop('android_nuni_dvctype') 462 | allData.pop('android_nuni_carrier') 463 | 464 | #将处理后的数据保存起来 465 | print("saving the file.....") 466 | 467 | allData.to_csv("Data/allData.csv",index=False) 468 | end = time.time() 469 | print("saving finished. it costs {}".format(datetime.timedelta(seconds=end-start))) -------------------------------------------------------------------------------- /feature_enginee.py: -------------------------------------------------------------------------------- 1 | """-*- coding: utf-8 -*- 2 | DateTime : 2019/7/29 9:52 3 | Author : Peter_Bonnie 4 | FileName : feature_enginee.py 5 | Software: PyCharm 6 | """ 7 | 8 | import lightgbm as lgb 9 | import numpy as np 10 | import pandas as pd 11 | from sklearn.metrics import f1_score 12 | from sklearn.model_selection import KFold, StratifiedKFold 13 | from sklearn.feature_selection import SelectKBest, RFE, SelectFromModel, chi2 14 | from sklearn.ensemble import RandomForestClassifier 15 | from sklearn.linear_model import LogisticRegression 16 | from sklearn.model_selection import train_test_split 17 | from sklearn import metrics 18 | from xgboost.sklearn import XGBClassifier 19 | 20 | 21 | 22 | def modeling_cross_validation(params, X, y, nr_folds=5): 23 | # oof_preds = np.zeros(X.shape[0]) 24 | # Split data with kfold 25 | folds = KFold(n_splits=nr_folds, shuffle=False, random_state=4096) 26 | 27 | for fold_, (trn_idx, val_idx) in enumerate(folds.split(X, y)): 28 | print("fold n°{}".format(fold_ + 1)) 29 | trn_data = lgb.Dataset(X[trn_idx], y[trn_idx]) 30 | val_data = lgb.Dataset(X[val_idx], y[val_idx]) 31 | 32 | num_round = 20000 33 | clf = lgb.train(params, trn_data, num_round, valid_sets=[trn_data, val_data], verbose_eval=1000, 34 | early_stopping_rounds=100) 35 | val_pred = clf.predict(X[val_idx], num_iteration=clf.best_iteration) 36 | 37 | 38 | score = f1_score(val_pred, y) 39 | # score = mean_squared_error(oof_preds, target) 40 | 41 | return score 42 | 43 | def featureSelect_CV(train,columns,target): 44 | init_cols = columns 45 | params = {'num_leaves': 120, 46 | 'min_data_in_leaf': 30, 47 | 'objective': 'binary', 48 | 'max_depth': -1, 49 | 'learning_rate': 0.05, 50 | "min_child_samples": 30, 51 | "boosting": "gbdt", 52 | "feature_fraction": 0.9, 53 | "bagging_freq": 1, 54 | "bagging_fraction": 0.9, 55 | "bagging_seed": 11, 56 | "lambda_l1": 0.02, 57 | "verbosity": -1} 58 | best_cols = init_cols.copy() 59 | best_score = modeling_cross_validation(params, train[init_cols].values, target.values, nr_folds=5) 60 | print("初始CV score: {:<8.8f}".format(best_score)) 61 | save_remove_feat=[] #用于存储被删除的特征 62 | for f in init_cols: 63 | 64 | best_cols.remove(f) 65 | score = modeling_cross_validation(params, train[best_cols].values, target.values, nr_folds=5) 66 | diff = best_score - score 67 | print('-' * 10) 68 | if diff > 0.00002: 69 | print("当前移除特征: {}, CV score: {:<8.8f}, 最佳cv score: {:<8.8f}, 有效果,删除!!".format(f, score, best_score)) 70 | best_score = score 71 | save_remove_feat.append(f) 72 | else: 73 | print("当前移除特征: {}, CV score: {:<8.8f}, 最佳cv score: {:<8.8f}, 没效果,保留!!".format(f, score, best_score)) 74 | best_cols.append(f) 75 | print('-' * 10) 76 | print("优化后CV score: {:<8.8f}".format(best_score)) 77 | 78 | return best_cols,save_remove_feat 79 | 80 | class SelectFeature(object): 81 | 82 | def __init__(self, X, y, columns): 83 | 84 | self.X = X 85 | self.y = y 86 | self.cols = columns 87 | self.k = 130 88 | 89 | #TODO: 使用随机森林来选择特征 90 | def SelectFeatureByModel_1(self): 91 | lr_selector = SelectFromModel(estimator=LogisticRegression(penalty='l1')) 92 | lr_selector.fit(self.X.values, self.y) 93 | lr_support = lr_selector.get_support(indices=True) 94 | _ = lr_selector.get_support() 95 | 96 | save_feat = [] 97 | for i in list(lr_support): 98 | save_feat.append(self.X.columns[i]) 99 | 100 | return save_feat,_ 101 | 102 | def SelectFeatureByModel_2(self): 103 | rf_selector = SelectFromModel(estimator=RandomForestClassifier(n_estimators=100)) 104 | rf_selector.fit(self.X.values, self.y) 105 | _ = rf_selector.get_support() 106 | rf_support = rf_selector.get_support(indices=True) 107 | 108 | save_feat = [] 109 | for i in list(rf_support): 110 | save_feat.append(self.X.columns[i]) 111 | 112 | return save_feat, _ 113 | 114 | #TODO-2:利用方差来选择特征 115 | def SelectFeatureByVariance(self): 116 | pass 117 | 118 | #TODO-3:利用递归特征消除 119 | def SelectFeatureByRFE(self): 120 | rfe_selector = RFE(estimator=LogisticRegression(),n_features_to_select=self.k,step=10,verbose=5) 121 | rfe_selector.fit(self.X.values,self.y) 122 | rfe_support = rfe_selector.get_support(indices=True) 123 | _= rfe_selector.get_support() 124 | save_feat = [] 125 | for i in list(rfe_support): 126 | save_feat.append(self.X.columns[i]) 127 | 128 | return save_feat,_ 129 | 130 | #TODO-4:利用卡方检验来进行特征的选择 131 | def SelectFeatureByK(self): 132 | 133 | """可以根据scores_来查看每个特征的得分,分越高,表示越重要 134 | 或者根据p值,p值越小,表示置信度越高 135 | """ 136 | chi_selector = SelectKBest(chi2, k = self.k) 137 | chi_selector.fit(self.X.values, self.y) 138 | chi_support = chi_selector.get_support(indices=True) #返回被选择的特征所在的列 139 | _ = chi_selector.get_support() 140 | print(chi_support) 141 | save_feat = [] 142 | #获取对应的特征 143 | for i in list(chi_support): 144 | save_feat.append(self.X.columns[i]) 145 | 146 | return save_feat,_ 147 | 148 | #TODO-5 :皮尔逊相关系数 149 | def SelectFeatureByPerson(self): 150 | pass 151 | 152 | #TODO-6: 获取多个方法的交集 153 | def merge_Multi_Method(self, feat_name, chi_support, rf_support, rfe_support, lr_support): 154 | 155 | feature_selection_df = pd.DataFrame(columns={"features":feat_name, "Chi":chi_support,"rf":rf_support,"rfe":rfe_support,"lr":lr_support}) 156 | feature_selection_df["total"] = np.sum(feature_selection_df,axis=1) 157 | feature_selection_df = feature_selection_df.sort_values(["total","features"],ascending=False) 158 | feature_selection_df.index = range(1,len(feature_selection_df)+1) 159 | feature_selection_df.to_csv("drop_cols.csv") 160 | return feature_selection_df 161 | 162 | 163 | 164 | class XgboostFeature(): 165 | ##可以传入xgboost的参数,用来生成叶子节点特征 166 | ##常用传入特征的个数 即树的个数 默认30 167 | def __init__(self,n_estimators=30,learning_rate =0.3,max_depth=3,min_child_weight=1,gamma=0.3,subsample=0.8,colsample_bytree=0.8,objective= 'binary:logistic',nthread=4,scale_pos_weight=1,reg_alpha=1e-05,reg_lambda=1,seed=27): 168 | self.n_estimators=n_estimators 169 | self.learning_rate=learning_rate 170 | self.max_depth=max_depth 171 | self.min_child_weight=min_child_weight 172 | self.gamma=gamma 173 | self.subsample=subsample 174 | self.colsample_bytree=colsample_bytree 175 | self.objective=objective 176 | self.nthread=nthread 177 | self.scale_pos_weight=scale_pos_weight 178 | self.reg_alpha=reg_alpha 179 | self.reg_lambda=reg_lambda 180 | self.seed=seed 181 | print('Xgboost Feature start, new_feature number:',n_estimators) 182 | def mergeToOne(self,X,X2): 183 | X3=[] 184 | for i in range(X.shape[0]): 185 | tmp=np.array([list(X[i]),list(X2[i])]) 186 | X3.append(list(np.hstack(tmp))) 187 | X3=np.array(X3) 188 | return X3 189 | ##切割训练 190 | def fit_model_split(self,X_train,y_train,X_test,y_test): 191 | ##X_train_1用于生成模型 X_train_2用于和新特征组成新训练集合 192 | X_train_1, X_train_2, y_train_1, y_train_2 = train_test_split(X_train, y_train, test_size=0.6, random_state=0) 193 | clf = XGBClassifier( 194 | learning_rate =self.learning_rate, 195 | n_estimators=self.n_estimators, 196 | max_depth=self.max_depth, 197 | min_child_weight=self.min_child_weight, 198 | gamma=self.gamma, 199 | subsample=self.subsample, 200 | colsample_bytree=self.colsample_bytree, 201 | objective= self.objective, 202 | nthread=self.nthread, 203 | scale_pos_weight=self.scale_pos_weight, 204 | reg_alpha=self.reg_alpha, 205 | reg_lambda=self.reg_lambda, 206 | seed=self.seed) 207 | clf.fit(X_train_1, y_train_1) 208 | y_pre= clf.predict(X_train_2) 209 | y_pro= clf.predict_proba(X_train_2)[:,1] 210 | print("pred_leaf=T AUC Score : %f" % metrics.roc_auc_score(y_train_2, y_pro)) 211 | print("pred_leaf=T Accuracy : %.4g" % metrics.accuracy_score(y_train_2, y_pre)) 212 | new_feature= clf.apply(X_train_2) 213 | X_train_new2=self.mergeToOne(X_train_2,new_feature) 214 | new_feature_test= clf.apply(X_test) 215 | X_test_new=self.mergeToOne(X_test,new_feature_test) 216 | print("Training set of sample size 0.4 fewer than before") 217 | return X_train_new2,y_train_2,X_test_new,y_test 218 | ##整体训练 219 | def fit_model(self,X_train,y_train,X_test): 220 | clf = XGBClassifier( 221 | learning_rate =self.learning_rate, 222 | n_estimators=self.n_estimators, 223 | max_depth=self.max_depth, 224 | min_child_weight=self.min_child_weight, 225 | gamma=self.gamma, 226 | subsample=self.subsample, 227 | colsample_bytree=self.colsample_bytree, 228 | objective= self.objective, 229 | nthread=self.nthread, 230 | scale_pos_weight=self.scale_pos_weight, 231 | reg_alpha=self.reg_alpha, 232 | reg_lambda=self.reg_lambda, 233 | seed=self.seed) 234 | clf.fit(X_train, y_train) 235 | # y_pre= clf.predict(X_test) 236 | # y_pro= clf.predict_proba(X_test)[:,1] 237 | # print("pred_leaf=T AUC Score : %f" % metrics.roc_auc_score(y_test, y_pro)) 238 | # print("pred_leaf=T Accuracy : %.4g" % metrics.accuracy_score(y_test, y_pre)) 239 | new_feature= clf.apply(X_train) 240 | X_train_new=self.mergeToOne(X_train,new_feature) 241 | new_feature_test= clf.apply(X_test) 242 | X_test_new=self.mergeToOne(X_test,new_feature_test) 243 | print("Training set sample number remains the same") 244 | return X_train_new,y_train,X_test_new 245 | 246 | 247 | 248 | 249 | -------------------------------------------------------------------------------- /img/s: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /img/stacking.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jiangzhongkai/ifly-algorithm_challenge/3c8d71290b0e6fae63bfcf16eb82b3176ca941d9/img/stacking.jpg -------------------------------------------------------------------------------- /result_merge.py: -------------------------------------------------------------------------------- 1 | """-*- coding: utf-8 -*- 2 | DateTime : 2019/7/30 12:32 3 | Author : Peter_Bonnie 4 | FileName : result_merge.py 5 | Software: PyCharm 6 | """ 7 | 8 | import numpy as np 9 | import pandas as pd 10 | 11 | 12 | p3 = pd.read_csv("submissionCat201908132309.csv") 13 | p2 = pd.read_csv("submissionCat201908131957.csv") 14 | p1 = pd.read_csv("submissionCat201908141318.csv") 15 | 16 | 17 | 18 | 19 | s_sid = p1.pop("sid") 20 | 21 | cv_result = [] 22 | submit = [] 23 | cv_result = np.array(p1["label"].values.tolist()).reshape(-1,1) 24 | 25 | # cv_result = np.hstack((cv_result,np.array(p2["label"].values.tolist()).reshape(-1,1))) 26 | # cv_result = np.hstack((cv_result,np.array(p4["label"].values.tolist()).reshape(-1,1))) 27 | cv_result = np.hstack((cv_result,np.array(p2["label"].values.tolist()).reshape(-1,1))) 28 | cv_result = np.hstack((cv_result,np.array(p3["label"].values.tolist()).reshape(-1,1))) 29 | 30 | print(cv_result) 31 | 32 | for line in cv_result: 33 | submit.append(np.argmax(np.bincount(line))) 34 | 35 | df = pd.DataFrame() 36 | 37 | df["sid"] = s_sid 38 | df["label"] = submit 39 | df.to_csv("submitMerge.csv",index= False) -------------------------------------------------------------------------------- /search_param.py: -------------------------------------------------------------------------------- 1 | """-*- coding: utf-8 -*- 2 | DateTime : 2019/7/30 10:11 3 | Author : Peter_Bonnie 4 | FileName : baseline.py 5 | Software: PyCharm 6 | """ 7 | from __future__ import print_function,division 8 | import numpy as np 9 | import pandas as pd 10 | import matplotlib.pyplot as plt 11 | import seaborn as sns 12 | from sklearn.preprocessing import MinMaxScaler,StandardScaler,OneHotEncoder,LabelEncoder 13 | from sklearn.model_selection import GridSearchCV,StratifiedKFold,RandomizedSearchCV,train_test_split 14 | from sklearn.linear_model import LogisticRegression 15 | from sklearn.ensemble import RandomForestClassifier,GradientBoostingClassifier,ExtraTreesClassifier,\ 16 | AdaBoostClassifier,VotingClassifier 17 | from sklearn.metrics import f1_score 18 | import time 19 | import datetime 20 | import lightgbm as lgb 21 | from lightgbm import plot_importance 22 | import collections 23 | import xgboost as xgb 24 | from xgboost import plot_importance,to_graphviz 25 | from sklearn.feature_selection import SelectFromModel, VarianceThreshold 26 | from data_helper import * 27 | import catboost as cat 28 | import gc 29 | import warnings 30 | 31 | warnings.filterwarnings("ignore") 32 | 33 | """ 34 | 基本操作函数 35 | """ 36 | class Util(object): 37 | 38 | def __init__(self, model, cv_params, trainX, trainY, top_n_features, random_state): 39 | 40 | self.model = model 41 | self.cv_params = cv_params 42 | self.trainX = trainX 43 | self.trainY = trainY 44 | self.top_n_features = top_n_features 45 | self.random_state = random_state 46 | 47 | #TODO:选择最佳参数 48 | def search_best_params(self): 49 | 50 | if self.model == None: 51 | raise ValueError("model cant be None.") 52 | 53 | if self.cv_params == None or not isinstance(self.cv_params,dict): 54 | raise TypeError("the type of cv_params should be dict.") 55 | 56 | grid = GridSearchCV(estimator = self.model,param_grid = self.cv_params,scoring = "f1",n_jobs = -1,iid = True,cv = 3, verbose= 2) 57 | grid.fit(self.trainX, self.trainY) 58 | 59 | return grid.best_estimator_, grid.best_params_, grid.best_score_ 60 | 61 | #TODO:利用多个模型来选择最佳特征 62 | def get_top_n_features(self): 63 | 64 | # 随机森林 65 | rf_est = RandomForestClassifier(self.random_state) 66 | rf_param_grid = {'n_estimators': [500], 'min_samples_split': [2, 3], 'max_depth': [20]} 67 | rf_grid = GridSearchCV(rf_est, rf_param_grid, n_jobs=25, cv=10, verbose=1) 68 | rf_grid.fit(self.trainX, self.trainY) 69 | # 将feature按Importance排序 70 | feature_imp_sorted_rf = pd.DataFrame({'feature': list(self.trainX), 71 | 'importance': rf_grid.best_estimator_.feature_importances_}).sort_values('importance', ascending=False) 72 | features_top_n_rf = feature_imp_sorted_rf.head(self.top_n_features)['feature'] 73 | print('Sample 25 Features from RF Classifier') 74 | print(str(features_top_n_rf[:60])) 75 | 76 | # AdaBoost 77 | ada_est = AdaBoostClassifier(self.random_state) 78 | ada_param_grid = {'n_estimators': [500], 'learning_rate': [0.5, 0.6]} 79 | ada_grid = GridSearchCV(ada_est, ada_param_grid, n_jobs=25, cv=10, verbose=1) 80 | ada_grid.fit(self.trainX, self.trainY) 81 | # 排序 82 | feature_imp_sorted_ada = pd.DataFrame({'feature': list(self.trainX), 83 | 'importance': ada_grid.best_estimator_.feature_importances_}).sort_values( 84 | 'importance', ascending=False) 85 | features_top_n_ada = feature_imp_sorted_ada.head(self.top_n_features)['feature'] 86 | 87 | # ExtraTree 88 | et_est = ExtraTreesClassifier(self.random_state) 89 | et_param_grid = {'n_estimators': [500], 'min_samples_split': [3, 4], 'max_depth': [15]} 90 | et_grid = GridSearchCV(et_est, et_param_grid, n_jobs=25, cv=10, verbose=1) 91 | et_grid.fit(self.trainX, self.trainY) 92 | # 排序 93 | feature_imp_sorted_et = pd.DataFrame({'feature': list(self.trainX), 94 | 'importance': et_grid.best_estimator_.feature_importances_}).sort_values( 95 | 'importance', ascending=False) 96 | features_top_n_et = feature_imp_sorted_et.head(self.top_n_features)['feature'] 97 | print('Sample 25 Features from ET Classifier:') 98 | print(str(features_top_n_et[:60])) 99 | 100 | # 将三个模型挑选出来的前features_top_n_et合并 101 | features_top_n = pd.concat([features_top_n_rf, features_top_n_ada, features_top_n_et], 102 | ignore_index=True).drop_duplicates() 103 | 104 | return features_top_n 105 | 106 | 107 | #TODO:多个模型进行投票 108 | def Vote_Model(self): 109 | 110 | #随机森林 111 | rf_est = RandomForestClassifier(n_estimators = 750, criterion = 'gini', max_features = 'sqrt', 112 | max_depth = 3, min_samples_split = 4, min_samples_leaf = 2, 113 | n_jobs = 50, random_state = 42, verbose = 1) 114 | #梯度增强 115 | gbm_est = GradientBoostingClassifier(n_estimators=900, learning_rate=0.0008, loss='exponential', 116 | min_samples_split=3, min_samples_leaf=2, max_features='sqrt', 117 | max_depth=3, random_state=42, verbose=1) 118 | #extraTree 119 | et_est = ExtraTreesClassifier(n_estimators=750, max_features='sqrt', max_depth=35, n_jobs=50, 120 | criterion='entropy', random_state=42, verbose=1) 121 | 122 | #lgb 123 | lgb_est = lgb.LGBMClassifier(boosting_type="gbdt",num_leaves=48, max_depth=-1, learning_rate=0.05, n_estimators=3000,\ 124 | subsample_for_bin=50000,objective="binary",min_split_gain=0, min_child_weight=5, min_child_samples=10,\ 125 | subsample=0.8,subsample_freq=1, colsample_bytree=1, reg_alpha=3,reg_lambda=5, seed= 2019,n_jobs=10,slient=True,num_boost_round=3000) 126 | 127 | #xgb 128 | # xgb_est = xgb.XGBClassifier() 129 | 130 | #融合模型 131 | voting_est = VotingClassifier(estimators = [('rf', rf_est),('gbm', gbm_est),('et', et_est),('lgb',lgb_est)], 132 | voting = 'soft', weights = [3,1.5,1.5,4], 133 | n_jobs = 50) 134 | voting_est.fit(self.trainX, self.trainY) 135 | 136 | return voting_est 137 | 138 | #利用stacking来进行融合 139 | class Ensemble(object): 140 | 141 | def __init__(self, n_splits, stacker, base_models): 142 | """ 143 | :param n_splits: 交叉选择的次数 144 | :param stacker: 最终融合模型 145 | :param base_models: 基模型 146 | """ 147 | self.n_splits = n_splits 148 | self.stacker = stacker 149 | self.base_models = base_models 150 | self.local_name = locals() #主要是用来生成动态变量 151 | 152 | def fir_predict(self, X, y, T): 153 | """ 154 | :param X: training X set 155 | :param y: training y set 156 | :param T: testing X set 157 | :return: 158 | """ 159 | X = np.array(X) 160 | y = np.array(y) 161 | T = np.array(T) 162 | 163 | folds = list(StratifiedKFold(n_splits = self.n_splits, shuffle = True, random_state = 2019).split(X,y)) 164 | S_train = np.zeros((X.shape[0],len(self.base_models))) 165 | S_test = np.zeros((T.shape[0],len(self.base_models))) 166 | 167 | for i ,clf in enumerate(self.base_models): 168 | self.local_name['S_test_%s'%i]= np.zeros((T.shape[0], self.n_splits)) 169 | 170 | for j, (train_idx,test_idx) in enumerate(folds): 171 | X_train = X[train_idx] 172 | y_train = y[train_idx] 173 | X_holdout = X[test_idx] 174 | y_holdout = y[test_idx] 175 | print("Fit Model %d fold %d"%(i,j)) 176 | clf.fit(X_train,y_train) 177 | y_pred = clf.predict(X_holdout) 178 | #查看下每一个模型在每一折的f1分数 179 | print("fold_{0} f1_score is :{1}".format(str(i),f1_score(y_holdout, y_pred))) 180 | S_train[test_idx,i] = y_pred 181 | self.local_name['S_test_%s'%i][:,j] = clf.predict(T) 182 | print(self.local_name['S_test_%s'%i][:j]) 183 | #这里进行投票的原则来选择 184 | temp_res = [] 185 | for line in self.local_name['S_test_%s'%i]: 186 | temp_res.append(np.argmax(np.bincount(line))) 187 | S_test[:,i] = temp_res 188 | del temp_res 189 | 190 | #训练第二层模型并进行预测 191 | self.stacker.fit(S_train, y) 192 | res = self.stacker.predict(S_test) 193 | return res 194 | 195 | #降低内存消耗 196 | def reduce_mem_usage(df, verbose=True): 197 | numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64','object'] 198 | start_mem = df.memory_usage().sum() / 1024**2 199 | for col in df.columns: 200 | col_type = df[col].dtypes 201 | if col_type in numerics: 202 | c_min = df[col].min() 203 | c_max = df[col].max() 204 | if str(col_type)[:3] == 'int': 205 | if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max: 206 | df[col] = df[col].astype(np.int8) 207 | elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max: 208 | df[col] = df[col].astype(np.int16) 209 | elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max: 210 | df[col] = df[col].astype(np.int32) 211 | elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max: 212 | df[col] = df[col].astype(np.int64) 213 | elif str(col_type)[:5] == 'float': 214 | if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max: 215 | df[col] = df[col].astype(np.float16) 216 | elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max: 217 | df[col] = df[col].astype(np.float32) 218 | else: 219 | df[col] = df[col].astype(np.float64) 220 | # else: 221 | # if len(df[col].unique()) / len(df[col]) < 0.05: 222 | # df[col] = df[col].astype("category") 223 | end_mem = df.memory_usage().sum() / 1024**2 224 | if verbose: 225 | print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem)) 226 | return df 227 | 228 | 229 | 230 | if __name__ == "__main__": 231 | 232 | trainData = pd.read_csv("Data/train_data.txt",delimiter='\t') 233 | trainData = my_fillna(trainData) 234 | 235 | trainData.pop("os") 236 | sid = trainData.pop("sid") 237 | label = trainData.pop("label") 238 | 239 | # 先对连续特征进行归一化等操作 240 | continue_feats = ["h", "w", "ppi", "nginxtime"] 241 | # 需要进行类别编码的特征 242 | label_encoder_feats = ["ver", "lan", "pkgname", "adunitshowid", "mediashowid", "apptype", "city", "province", "ip", 243 | "reqrealip", "adidmd5", "imeimd5", "idfamd5", "openudidmd5", "macmd5", "model", "make", "osv"] 244 | # 需要进行独热编码的特征 245 | cate_feats = ["dvctype", "ntt", "carrier"] 246 | 247 | # labelencoder转化 248 | for fe in ["pkgname", "adunitshowid", "mediashowid", "apptype", "province", "ip", "reqrealip", "adidmd5", "imeimd5", 249 | "idfamd5", "openudidmd5", "macmd5", "ver", "lan"]: 250 | le_feat = LabelEncoder() 251 | le_feat.fit(trainData[fe]) 252 | trainData[fe] = le_feat.transform(trainData[fe]) 253 | 254 | # 由于city在类别编码时出现错误,所以我们就手动进行编码 255 | city_2_idx = dict(zip(list(set(trainData["city"].unique())), range(len(list(set(trainData["city"].unique())))))) 256 | 257 | # 对model, make,osv 自定义编码方式,因为用类别编码的时候,出现了报错,主要是由于取值中含有不可识别字符,后期再进行处理一下 258 | model_2_idx = dict(zip(list(set(trainData["model"].unique())), range(len(list(set(trainData["model"].unique())))))) 259 | make_2_idx = dict(zip(list(set(trainData["make"].unique())), range(len(list(set(trainData["make"].unique())))))) 260 | osv_2_idx = dict(zip(list(set(trainData["osv"].unique())), range(len(list(set(trainData["osv"].unique())))))) 261 | 262 | trainData["city"] = trainData["city"].map(city_2_idx) 263 | trainData["model"] = trainData["model"].map(model_2_idx) 264 | trainData["make"] = trainData["make"].map(make_2_idx) 265 | trainData["osv"] = trainData["osv"].map(osv_2_idx) 266 | 267 | # #对连续变量进行简单的归一化处理 268 | for fe in continue_feats: 269 | temp_data = np.reshape(trainData[fe].values, [-1, 1]) 270 | mm = MinMaxScaler() 271 | mm.fit(temp_data) 272 | trainData[fe] = mm.transform(temp_data) 273 | 274 | # 对运营商,网络类型进行简单预处理 275 | trainData["carrier"] = trainData["carrier"].map({0.0: 0, -1.0: 0, 46000.0: 1, 46001.0: 2, 46003.0: 3}).astype(int) 276 | trainData["ntt"] = trainData["ntt"].astype(int) 277 | trainData["orientation"] = trainData["orientation"].astype(int) 278 | trainData["dvctype"] = trainData["dvctype"].astype(int) 279 | 280 | 281 | trainX = trainData.values 282 | trainY = label 283 | 284 | cv_params = { 285 | "max_depth" : [i for i in range(3,11)], 286 | "n_estimators" : [i for i in range(1000,4001, 200)], 287 | "learning_rate":[0.1,0.2,0.01,0.02,0.03,0.04,0.05,0.001,0.002,0.003,0.004,0.005], 288 | "num_leaves" : [i for i in range(30, 128,8)] 289 | } 290 | 291 | model = lgb.LGBMClassifier(boosting_type="gbdt",learning_rate=0.05,n_estimators=3000, subsample_for_bin=50000,objective="binary",min_split_gain=0, min_child_weight=5, min_child_samples=10, subsample=0.8,subsample_freq=1, colsample_bytree=1, reg_alpha=3,reg_lambda=5, seed= 2019,n_jobs=10,slient=True,num_boost_round=3000) 292 | util = Util(model, cv_params=cv_params, trainX=trainX,trainY=trainY,top_n_features=None,random_state=None) 293 | print(util.search_best_params()) 294 | 295 | --------------------------------------------------------------------------------