├── README.md ├── feature_a.py └── lgb.py /README.md: -------------------------------------------------------------------------------- 1 | # jdata 2 | 3 | jdata联通赛top11特征分析及详细代码(含注释) 4 | 5 | #这是我第二次参加数据挖掘类比赛,新手成长中,回头想想,当初有很多细节没做好,所以也附带了赛后反思优化 6 | 7 | #比赛链接 https://jdata.jd.com/html/detail.html?id=3 8 | 9 | ### 特征工程 10 | 11 | #初始特征 12 | 13 | 对于train testa testb 分别有三个特征表:sms.txt(短信) voice.txt(通话) wa.txt(网址) 14 | 15 | #### sms 表 16 | 17 | ##### `opp_num` 18 | 19 | 这个是匿名特征,当时直接删掉没用,想想真后悔,其实稍加处理下,应该可以稍微提分的,比如说对出现频次最多的n个号码进行统计,应该可以作为风险用户的强特 20 | 21 | 的,后悔!! 22 | 23 | 24 | ##### `opp_head` 25 | 26 | 根据vid和opp_head做group_by聚类处理,统计每个用户对于不同开头号码的短信情况 27 | 28 | 赛后反思:其实可以知道里面有国际电话和国内电话,并且有不同运营商的信息,可以对其进一步处理的 29 | 30 | 31 | 32 | ##### `opp_len` 33 | 34 | 根据vid和opp_len做group_by聚类处理,统计每个用户对于不同开头号码的短信情况 35 | 36 | 37 | ##### `start_time` 38 | 由于这个是时间戳特征,所以对其进行了数字化处理,并统计了不同时间分布的短信情况:如深夜短信、早上、下午等,同时统计了发送和接收短信的总次数、均值等 39 | 40 | 赛后反思:后来做另一个比赛时,才知道有个diff的时间差骚操作,并且结合一阶二阶时间差,往往会有比较好的效果,还是太年轻! 41 | 42 | 43 | ##### `in_out` 44 | 45 | 简单统计了短信发送和接收的比例 46 | 47 | 赛后反思:其实可以根据in out 做一些更细致化的操作的 48 | 49 | 50 | #### voice表 51 | 52 | 对于voice表, 其实里面有大量的特征提取方法和`sms`表相同 53 | 54 | 下面说说相对于sms表增加的一些操作 55 | 56 | `start_time 、 end_time` 57 | 58 | 通过这两个特征,可以统计通话时长。基于通话时长也可以展开一系列的操作。 59 | 60 | 比如:通话最长、最短,平均通话时长,通话时长少于多少的次数等等 61 | 62 | 另外,还可以统计没人每天平均通话时长、次数 63 | 64 | 赛后反思:也和sms一样,可以统计时间差等信息。同时还可以考虑使用滑动时间窗口进行统计,这也是后来参加一些比赛发现的骚操作。经验还是很重要! 65 | 66 | 67 | #### wa 表 68 | 69 | ##### `wa_name` 70 | 71 | 这里当初直接进行one-hot了,其实可以先统计频率次数top n和最少次数top n 的one-hot的,这样就不用出现维度灾难的问题 72 | 73 | ##### `visit_cnt` 74 | 75 | 统计总和,方差等量,常规操作 76 | 77 | 78 | ##### `wa_type` 79 | 80 | 统计浏览方式,统计了比例啊 one-hot啊之类 81 | 82 | 赛后反思:其实可以根据不同的wa_type进行更细致化的统计,这样可能效果会好一点,不过也不一定 83 | 84 | 其他字段基本也是差不多操作,有兴趣欢迎浏览代码,代码中有一些pandas等的操作,也有详细注释 85 | 86 | 87 | 88 | -------------------------------------------------------------------------------- /feature_a.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Sun Jul 29 21:04:48 2018 4 | 5 | @author: YWZQ 6 | """ 7 | 8 | import pandas as pd 9 | from pandas import merge 10 | import re 11 | 12 | testb_voice=pd.read_csv(r'.\voice_test_b.txt','\t',header=None) 13 | train_voice=pd.read_csv(r'.\voice_train.txt','\t',header=None) 14 | 15 | testb_voice.columns=['vid','opp_num','opp_head','opp_len','start_time','end_time','call_type','in_out'] 16 | train_voice.columns=['vid','opp_num','opp_head','opp_len','start_time','end_time','call_type','in_out'] 17 | train_testb_voice=pd.concat([train_voice,testb_voice]) 18 | 19 | voice_data=train_testb_voice[['vid']] 20 | #================================================================= 21 | #处理通话起止时间,得到通话时长 22 | #================================================================= 23 | def minus_voice_time(df): 24 | start_day=int(df['start_time']/1000000) 25 | start_hour=int(df['start_time']%1000000/10000) 26 | start_min=int(df['start_time']%10000/100) 27 | start_second=int(df['start_time']%100) 28 | 29 | end_day=int(df['end_time']/1000000) 30 | end_hour=int(df['end_time']%1000000/10000) 31 | end_min=int(df['end_time']%10000/100) 32 | end_second=int(df['end_time']%100) 33 | 34 | minus_day=end_day-start_day 35 | if minus_day>0: 36 | end_hour+=24 37 | 38 | minus_hour=end_hour-start_hour 39 | if minus_hour>0: 40 | end_min+=minus_hour*60 41 | 42 | minus_min=end_min-start_min 43 | if minus_min>0: 44 | end_second+=minus_min*60 45 | 46 | time_long=end_second-start_second 47 | 48 | df['time_long']=time_long 49 | 50 | if int(end_hour)<5: 51 | a=1 52 | else: 53 | a=0 54 | df['voice_xianshi']=a 55 | return df 56 | voice_longtime=train_testb_voice.apply(minus_voice_time,axis=1) 57 | 58 | #===================================================================== 59 | #存在异常数据,用正则清洗 60 | #===================================================================== 61 | def return_opp_head_class(s): 62 | s=str(s) 63 | p5=r'DDD' 64 | pattern5=re.compile(p5) 65 | 66 | if pattern5.findall(s): 67 | a=132 68 | else: 69 | a=s 70 | return (a) 71 | voice_longtime['opp_head']=voice_longtime['opp_head'].map(return_opp_head_class) 72 | 73 | for i in voice_longtime['opp_head']: 74 | if i=='DDD': 75 | i=132 76 | voice_longtime['opp_head']=voice_longtime['opp_head'].apply(pd.to_numeric) 77 | print(voice_longtime.info()) 78 | voice_data=merge(voice_data,voice_longtime,on='vid',how='left') 79 | 80 | #======================================================================= 81 | #统计每个用户拨打不同号码长度的次数 82 | #======================================================================= 83 | x_voice=voice_longtime 84 | group_opp_len = x_voice['time_long'].groupby([x_voice['vid'],x_voice['opp_len']]).sum() 85 | group_opp_len_unstack=group_opp_len.unstack() 86 | print(group_opp_len_unstack) 87 | print(group_opp_len_unstack.info()) 88 | group_opp_len_unstack.to_csv(r'.\voice\group_opp_len.csv') 89 | group_opp_len_unstack=group_opp_len_unstack.reset_index('vid') 90 | voice_data=merge(voice_data,group_opp_len_unstack,on='vid',how='left') 91 | 92 | 93 | #================================================================================= 94 | #统计每个用户拨打不同号码开头的次数 95 | #================================================================================= 96 | group_opp_head = x_voice['time_long'].groupby([x_voice['vid'],x_voice['opp_head']]).sum() 97 | group_opp_head_unstack=group_opp_head.unstack() 98 | group_opp_head_unstack.fillna(0,inplace=True) 99 | print(group_opp_head_unstack) 100 | print(group_opp_head_unstack.info()) 101 | group_opp_head_unstack=group_opp_head_unstack.reset_index('vid') 102 | group_opp_head_unstack.to_csv(r'.\group_opp_head.csv') 103 | voice_data=merge(voice_data,group_opp_head_unstack,on='vid',how='left') 104 | 105 | #============================================================================== 106 | #统计不同拨打类型次数 107 | #============================================================================== 108 | x_voice['ci']=1 109 | group_calltype=x_voice['ci'].groupby([x_voice['vid'],x_voice['call_type']]).sum() 110 | group_calltype_unstack=group_calltype.unstack() 111 | group_calltype_unstack.fillna(0,inplace=True) 112 | print(group_calltype_unstack) 113 | print(group_calltype_unstack.isnull().sum().sum()) 114 | group_calltype_unstack.to_csv(r'.\group_calltype.csv') 115 | group_calltype_unstack=group_calltype_unstack.reset_index('vid') 116 | 117 | voice_data=merge(voice_data,group_calltype_unstack,on='vid',how='left') 118 | 119 | #================================================================================ 120 | #统计打进打出 121 | #================================================================================ 122 | group_in_out = x_voice['ci'].groupby([x_voice['vid'],x_voice['in_out']]).sum() 123 | group_in_out_unstack=group_in_out.unstack() 124 | group_in_out_unstack.fillna(0,inplace=True) 125 | print(group_in_out_unstack) 126 | print(group_in_out_unstack.isnull().sum().sum()) 127 | group_in_out_unstack.to_csv(r'.\group_in_out.csv') 128 | group_in_out_unstack=group_in_out_unstack.reset_index('vid') 129 | voice_data=merge(voice_data,group_in_out_unstack,on='vid',how='left') 130 | 131 | 132 | #============================================================================ 133 | #通话时长少于6s次数 134 | #============================================================================= 135 | def longtime_less5s(df): 136 | df['time_less6s']=0 137 | if (df['time_long']<6) &(df['in_out']==0): 138 | df['time_less6s']=1 139 | return df 140 | voice_longtime_less5s=voice_longtime.apply(longtime_less5s,axis=1) 141 | print(voice_longtime_less5s['time_less6s'].sum()) 142 | #print(voice_longtime_less5s['time_long']) 143 | 144 | voice_less5s=voice_longtime_less5s['time_less6s'].groupby(voice_longtime_less5s['vid']).sum() 145 | #voice_less5s_pd=pd.DataFrame(voice_less5s,index=range(len(voice_less5s))) 146 | voice_less5s_pd=voice_less5s.to_frame() 147 | voice_less5s_pd=voice_less5s_pd.reset_index() 148 | #voice_less5s_pd.columns=['vid','less5s_ci'] 149 | print(voice_less5s_pd) 150 | 151 | voice_less5s_pd.to_csv(r'.\group_timelong_less5s.csv',index=None) 152 | voice_data=merge(voice_data,voice_less5s_pd,on='vid',how='left') 153 | 154 | #========================================================================== 155 | #打出比例 156 | #========================================================================= 157 | 158 | def out_rate(df): 159 | df['out_rate']=df['0']/(df['0']+df['1']) 160 | df['pre_label']=0 161 | if (df['out_rate']>0.8) &((df['0']-df['1'])>100): 162 | df['pre_label']=1 163 | return df 164 | 165 | voice_out_rate=group_in_out_unstack.apply(out_rate,axis=1) 166 | voice_out_rate.drop(['0','1'],axis=1,inplace=True) 167 | print(voice_out_rate) 168 | print(voice_out_rate['pre_label'].sum()) 169 | voice_out_rate.to_csv(r'.\group_inoutrate.csv',index=None) 170 | voice_data=merge(voice_data,voice_out_rate,on='vid',how='left') 171 | 172 | #================== 173 | #下面是短信处理 174 | #================== 175 | 176 | sms_train=pd.read_csv(r'.\sms_train.txt','\t',header=None) 177 | sms_testb=pd.read_csv(r'.\sms_test_b.txt','\t',header=None) 178 | 179 | sms=pd.concat((sms_train,sms_testb)) 180 | 181 | sms.columns=['vid','opp_num','opp_head','opp_len','start_time','in_out'] 182 | sms['ci']=1 183 | print(sms) 184 | data_sms=[['vid']] 185 | group_opp_head=sms['ci'].groupby([sms['vid'],sms['opp_head']]).sum() 186 | group_opp_head_unstack=group_opp_head.unstack() 187 | group_opp_head_unstack.fillna(0,inplace=True) 188 | print(group_opp_head_unstack) 189 | group_opp_head_unstack.to_csv(r'E:\jdata\testb\sms\group_head.csv') 190 | group_opp_head_unstack=group_opp_head_unstack.reset_index('vid') 191 | data_sms=merge(data_sms,group_opp_head_unstack,on='vid',how='left') 192 | 193 | 194 | group_opp_len=sms['ci'].groupby([sms['vid'],sms['opp_len']]).sum() 195 | group_opp_len_unstack=group_opp_len.unstack() 196 | group_opp_len_unstack.fillna(0,inplace=True) 197 | print(group_opp_len_unstack) 198 | group_opp_len_unstack.to_csv(r'E:\jdata\testb\sms\group_len.csv') 199 | group_opp_len_unstack=group_opp_len_unstack.reset_index('vid') 200 | data_sms=merge(data_sms,group_opp_len_unstack,on='vid',how='left') 201 | 202 | group_in_out=sms['ci'].groupby([sms['vid'],sms['in_out']]).sum() 203 | group_in_out_unstack=group_in_out.unstack() 204 | group_in_out_unstack.fillna(0,inplace=True) 205 | print(group_in_out_unstack) 206 | group_in_out_unstack.to_csv(r'.\group_in_out.csv') 207 | group_in_out_unstack=group_in_out_unstack.reset_index('vid') 208 | 209 | data_sms=merge(data_sms,group_in_out_unstack,on='vid',how='left') 210 | 211 | #============= 212 | #下面是网络 213 | #============= 214 | net_train=pd.read_csv(r'E:\jdata\train\wa_train.txt','\t',header=None) 215 | net_testb=pd.read_csv(r'E:\jdata\testb\testb_data\wa_test_b.txt','\t',header=None) 216 | net=pd.concat((net_train,net_testb)) 217 | net.columns=['vid','net_name','vist_times','visit_time_long','up_flow','down_flos','watch_type','date'] 218 | #print(net) 219 | data_net=net[['vid']] 220 | group_name = net['vist_times'].groupby([net['vid'],net['net_name']]).sum() 221 | group_name_unstack=group_name.unstack() 222 | group_name_unstack.fillna(0,inplace=True) 223 | #print(group_name_unstack) 224 | #print(group_name_unstack.isnull().sum().sum()) 225 | #print(group_name_unstack.info()) 226 | group_name_unstack.to_csv(r'.\group_name.csv') 227 | group_name_unstack=group_name_unstack.reset_index('vid') 228 | data_net=merge(data_net,group_name_unstack,on='vid',how='left') 229 | 230 | 231 | net['ci']=1 232 | #print(net) 233 | group_type=net['ci'].groupby([net['vid'],net['watch_type']]).sum() 234 | net_group_type_unstack=group_type.unstack() 235 | net_group_type_unstack.fillna(0,inplace=True) 236 | print(net_group_type_unstack) 237 | net_group_type_unstack.to_csv(r'.\group_type.csv') 238 | net_group_type_unstack=net_group_type_unstack.reset_index('vid') 239 | data_net=merge(data_net,net_group_type_unstack,on='vid',how='left') 240 | data=merge(voice_data,data_sms,on='vid',how='left') 241 | data=merge(data,data_net,on='vid',how='left') 242 | data.to_csv(r'.\data.csv') 243 | -------------------------------------------------------------------------------- /lgb.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Sun Jul 29 22:07:46 2018 4 | 5 | @author: YWZQ 6 | """ 7 | 8 | import pandas as pd 9 | from sklearn.model_selection import train_test_split 10 | #import sklearn 11 | import lightgbm as lgb 12 | 13 | 14 | testb_vid=pd.read_csv(r'.\testb_vid.csv',encoding='gbk') 15 | 16 | data=pd.read_csv(r'.\train_x_y.csv',encoding='gbk') 17 | testb=pd.read_csv(r'.\test_x.csv',encoding='gbk') 18 | 19 | 20 | 21 | data.fillna(0,inplace=True) 22 | testb.fillna(0,inplace=True) 23 | 24 | #print(data) 25 | #print(data.info()) 26 | print(data.isnull().sum().sum()) 27 | 28 | data_matrix=data.as_matrix() 29 | testa_matrix=testb.as_matrix() 30 | 31 | #test_X_matrix=test_X.as_matrix() 32 | 33 | X=data_matrix[:,2:] 34 | y=data_matrix[:,1] 35 | 36 | testa_x=testa_matrix[:,1:] 37 | 38 | #test_X=test_X_matrix[:,1:] 39 | #test_vid=test_X_matrix[:,0] 40 | 41 | X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=16) 42 | 43 | lgb_train=lgb.Dataset(X_train,y_train,free_raw_data=False) 44 | lgb_eval=lgb.Dataset(X_test,y_test,reference=lgb_train,free_raw_data=False) 45 | 46 | params = { 47 | 'boosting_type': 'gbdt', 48 | 'objective': 'binary', 49 | 'metric': 'auc', 50 | 'num_leaves':63, 51 | 'learning_rate': 0.01, 52 | 'feature_fraction':0.6, 53 | 'bagging_fraction':0.7, 54 | 'bagging_freq':5, 55 | 'verbose': 0, 56 | 'lambda_l1':0.5, 57 | 'lambda_l2':35, 58 | 'min_data_in_leaf':20, 59 | 'min_split_gain':0.1 60 | } 61 | print('model begin:\n') 62 | gbm = lgb.train(params,lgb_train,num_boost_round=2642,verbose_eval=True,early_stopping_rounds=100,valid_sets=[lgb_train,lgb_eval]) 63 | 64 | print(len(y)) 65 | print('y:\n') 66 | y_test_list=list(y) 67 | print(y[2400:2600]) 68 | 69 | y_predict=gbm.predict(X,num_iteration=gbm.best_iteration) 70 | print('gbm_best_iteration:\n') 71 | print(gbm.best_iteration) #一个数值 72 | 73 | #print(len(y_predict)) 74 | print('y_predict:\n') 75 | print(y_predict) 76 | y_predict_list=list(y_predict) 77 | print(y_predict_list[3900:4100]) 78 | 79 | y_predict_label=[0 if i<=0.4 else 1 for i in y_predict_list] 80 | num=0 81 | y_len=len(y) 82 | for i in range(y_len): 83 | if y[i]!=y_predict_label[i]: 84 | num+=1 85 | print('diffirence:\n') 86 | print(num/y_len) 87 | 88 | 89 | 90 | testa_predict=gbm.predict(testa_x,num_iteration=gbm.best_iteration) 91 | print('len of testb_predict:\n') 92 | print(len(testa_predict)) 93 | 94 | testa_predict_list=list(testa_predict) 95 | dict_testa={'predict':testa_predict_list} 96 | testa_pd=pd.DataFrame(dict_testa) 97 | testa=pd.concat([testb_vid,testa_pd],axis=1) 98 | 99 | 100 | testa_predict=testa.sort_values(by='predict',ascending=False) 101 | testa_predict['predict']=testa_predict['predict'].map(lambda x:1 if x>=0.45 else 0 ) 102 | print(testa_predict) 103 | #testb_predict.to_csv(r'.\testa_predict_sort.csv',index=None,header=None) 104 | --------------------------------------------------------------------------------