├── chizhu
├── single_model
│ ├── config.py
│ ├── .DS_Store
│ ├── data
│ │ └── .DS_Store
│ ├── user_behavior.py
│ ├── get_nn_feat.py
│ ├── xgb.py
│ └── xgb_nb.py
├── stacking
│ ├── .DS_Store
│ ├── all_feat
│ │ └── .DS_Store
│ └── nurbs_feat
│ │ ├── .DS_Store
│ │ ├── xgb_22.py
│ │ └── xgb__nurbs_nb.py
├── readme.txt
└── util
│ ├── get_nn_res.py
│ └── bagging.py
├── nb_cz_lwl_wcm
├── 运行说明.txt
├── 2_get_feature_brand.py
├── 6_get_feature_device_start_close.py
├── 7_get_feature_w2v.py
├── 13_last_get_all_feature.py
├── 1_get_age_reg.py
└── 4_get_feature_device_start_close_tfidf_1_2.py
├── linwangli
├── 融合思路.pptx
├── result
│ └── .DS_Store
├── readme.txt
└── code
│ ├── utils.py
│ ├── lgb_allfeat_22.py
│ └── lgb_allfeat_condProb.py
├── 2018易观A10大数据应用峰会-RNG_终极版.pptx
├── README.md
└── THLUO
├── 代码运行.bat
├── 28.final.py
├── readme.md
├── 24.thluo_22_lgb.py
├── 3.w2c_all_emb.py
├── 1.w2c_model_start.py
├── 2.w2c_model_close.py
├── 3.w2c_model_all.py
├── 3.device_quchong_start_app_w2c.py
├── 11.hcc_device_brand_age_sex.py
├── 25.thluo_22_xgb.py
├── 14.device_start_GRU_pred_age.py
├── 21.tfidf_lr_sex_age_prob_oof.py
├── 26.thluo_nb_lgb.py
├── 13.device_start_GRU_pred.py
├── 15.device_all_GRU_pred.py
├── 16.device_start_capsule_pred.py
├── 17.device_start_textcnn_pred.py
├── 19.device_start_lstm_pred.py
└── 18.device_start_text_dpcnn_pred.py
/chizhu/single_model/config.py:
--------------------------------------------------------------------------------
1 | path = "/Users/chizhu/data/competition_data/易观/"
2 |
--------------------------------------------------------------------------------
/nb_cz_lwl_wcm/运行说明.txt:
--------------------------------------------------------------------------------
1 | Demo文件夹下存放原始数据集
2 | 按照1、2、3... 顺序运行,最后在feature文件夹下面生成feature_nurbs.csv
3 |
--------------------------------------------------------------------------------
/linwangli/融合思路.pptx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/chizhu/yiguan_sex_age_predict_1st_solution/HEAD/linwangli/融合思路.pptx
--------------------------------------------------------------------------------
/chizhu/stacking/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/chizhu/yiguan_sex_age_predict_1st_solution/HEAD/chizhu/stacking/.DS_Store
--------------------------------------------------------------------------------
/linwangli/result/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/chizhu/yiguan_sex_age_predict_1st_solution/HEAD/linwangli/result/.DS_Store
--------------------------------------------------------------------------------
/2018易观A10大数据应用峰会-RNG_终极版.pptx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/chizhu/yiguan_sex_age_predict_1st_solution/HEAD/2018易观A10大数据应用峰会-RNG_终极版.pptx
--------------------------------------------------------------------------------
/chizhu/single_model/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/chizhu/yiguan_sex_age_predict_1st_solution/HEAD/chizhu/single_model/.DS_Store
--------------------------------------------------------------------------------
/chizhu/single_model/data/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/chizhu/yiguan_sex_age_predict_1st_solution/HEAD/chizhu/single_model/data/.DS_Store
--------------------------------------------------------------------------------
/chizhu/stacking/all_feat/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/chizhu/yiguan_sex_age_predict_1st_solution/HEAD/chizhu/stacking/all_feat/.DS_Store
--------------------------------------------------------------------------------
/chizhu/stacking/nurbs_feat/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/chizhu/yiguan_sex_age_predict_1st_solution/HEAD/chizhu/stacking/nurbs_feat/.DS_Store
--------------------------------------------------------------------------------
/linwangli/readme.txt:
--------------------------------------------------------------------------------
1 | |—— code
2 | |—— lgb_allfeat_22.py:基于【全部特征】训练得到lgb结果
3 | |—— lgb_allfeat_condProb.py:基于【全部特征+条件概率】训练得到lgb结果
4 | |—— utils.py:一些脚本函数,如加权融合/相关性评测等
5 | |—— dataset
6 | |—— deviceid_train.tsv: 赛方提供的文件
7 | |—— all_feat.csv: 团队提取的所有特征
8 | |—— result:存放各种提交文件
9 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # yiguan_sex_age_predict_1st_solution
2 | 易观性别年龄预测第一名解决方案
3 |
4 | ##### [比赛链接](https://www.tinymind.cn/competitions/43)
5 | --------
6 |
7 | 团队是分别个人做然后再合并,所以团队中特征文件有所交叉,主要用到的方案是stacking不同模型,因为数据产出的维度较高,通过不同模型stacking可以达到不会损失过量信息下达到降维的目的。
8 |
9 | 以下是运行代码的顺序:
10 |
11 | * 1.产出特征文件
12 |
13 | > 按照nb_cz_lwl_wcm文件夹运行说明分别运行 nb_cz_lwl_wcm文件夹下的所有文件产出特征文件 feature_one.csv
14 | > 按照thluo 文件夹下运行说明分别运行 thluo 文件夹下的代码生成 thluo_train_best_feat.csv
15 |
16 | * 2.模型加权
17 | 注:模型所得到的结果在linwangli文件夹下
18 |
19 | > 运行完thluo文件夹下面的所有代码会生成thluo_prob
20 | > 用linwangli/code文件夹下面的模型以及上面所求得的特征文件可跑出对应概率文件,相关概率文件加权方案看 linwangli文件夹下面的融合思路ppt
21 |
22 |
23 |
24 |
25 | CONTRIBUTORS:[THLUO](https://github.com/THLUO) [WangliLin](https://github.com/WangliLin) [Puck Wang](https://github.com/PuckWong) [chizhu](https://github.com/chizhu) [NURBS](https://github.com/suncostanx)
26 |
27 |
28 |
29 |
30 |
31 |
32 |
--------------------------------------------------------------------------------
/chizhu/readme.txt:
--------------------------------------------------------------------------------
1 | |-single_model/
2 | |-data/ 处理后的特征和数据存放位置
3 | |-model/ 模型文件
4 | |-submit 模型概率文件,可用作stacking材料
5 | |-config.py 配置原始文件路径
6 | |-user_behavior.py 得到user_behavior特征集
7 | |-get_nn_feat.py 获得nn 的统计特征输入
8 | |-lgb.py
9 | |-xgb.py
10 | |-xgb_nb.py 条件概率
11 | |-cnn.py
12 | |-deepnn.py
13 | |-yg_best_nn.py
14 | |-stacking/
15 | |-all_feat/ 使用全部概率文件的xgb的条件概率
16 | |-nurbs_feat/ 使用rurbs概率文件的xgb的22分类以及条件概率
17 | |-xgb_nurbs_nb.py 条件概率
18 | |-xgb_22.py 22分类
19 | |-util/
20 | |-bagging.py 加权融合脚本
21 | |-get_nn_res.py 获得nn概率文件和可提交的结果
22 |
23 |
24 | 使用说明:
25 | single_model:1)先配置config.py 里的文件路径
26 | 2)运行user_behavior.py
27 | 3)运行get_nn_feat.py
28 | 4)然后可以逐个运行nn或者tree模型,得到的概率文件在submit/
29 |
30 | stacking:这里直接运行是不行的 因为需要概率文件,大小在2G左右,没有附上,之后可以找我们要
31 | util:加权用,这里需要的是stacking/nurbs_feat下的xgb_22.py和_xgbnb.py产生的结果取均值得到一份结果,xgb_22_nb.csv
32 |
33 |
--------------------------------------------------------------------------------
/nb_cz_lwl_wcm/2_get_feature_brand.py:
--------------------------------------------------------------------------------
1 | # -*- coding:utf-8 -*-
2 |
3 | import pandas as pd
4 | import numpy as np
5 | from sklearn import preprocessing
6 |
7 | train = pd.read_csv('Demo/deviceid_train.tsv', sep='\t', header=None)
8 | test = pd.read_csv('Demo/deviceid_test.tsv', sep='\t', header=None)
9 |
10 | data_all = pd.concat([train, test], axis=0)
11 | data_all = data_all.rename({0:'id'}, axis=1)
12 | del data_all[1],data_all[2]
13 | deviced_brand = pd.read_csv('Demo/deviceid_brand.tsv', sep='\t', header=None)
14 | deviced_brand = deviced_brand.rename({0: 'id'}, axis=1)
15 | data_all = pd.merge(data_all, deviced_brand, on='id', how='left')
16 | print(data_all)
17 | # 直接做类别编码特征
18 |
19 | feature = pd.DataFrame()
20 | label_encoder = preprocessing.LabelEncoder()
21 | feature['phone_type'] = label_encoder.fit_transform(data_all[1])
22 | feature['phone_type_detail'] = label_encoder.fit_transform(data_all[2])
23 | feature.to_csv('feature/deviceid_brand_feature.csv', index=False)
--------------------------------------------------------------------------------
/THLUO/代码运行.bat:
--------------------------------------------------------------------------------
1 | python 1.w2c_model_start.py
2 | python 2.w2c_model_close.py
3 | python 3.w2c_model_all.py
4 | python 3.device_quchong_start_app_w2c.py
5 | python 3.w2c_all_emb.py
6 | python 4.device_age_prob_oof.py
7 | python 5.device_sex_prob_oof.py
8 | python 6.start_close_age_prob_oof.py
9 | python 7.start_close_sex_prob_oof.py
10 | python 9.sex_age_bin_prob_oof.py
11 | python 10.age_bin_prob_oof.py
12 | python 11.hcc_device_brand_age_sex.py
13 | python 12.device_age_regression_prob_oof.py
14 | python 13.device_start_GRU_pred.py
15 | python 14.device_start_GRU_pred_age.py
16 | python 15.device_all_GRU_pred.py
17 | python 16.device_start_capsule_pred.py
18 | python 17.device_start_textcnn_pred.py
19 | python 18.device_start_text_dpcnn_pred.py
20 | python 19.device_start_lstm_pred.py
21 | python 20.lgb_sex_age_prob_oof.py
22 | python 21.tfidf_lr_sex_age_prob_oof.py
23 | python 22.base_feat.py
24 | python 23.ATT_v6.py
25 | python 24.thluo_22_lgb.py
26 | python 25.thluo_22_xgb.py
27 | python 26.thluo_nb_lgb.py
28 | python 27.thluo_nb_xgb.py
29 | python 28.final.py
--------------------------------------------------------------------------------
/THLUO/28.final.py:
--------------------------------------------------------------------------------
1 |
2 | # coding: utf-8
3 |
4 | # In[1]:
5 |
6 |
7 | import numpy as np
8 | import pandas as pd
9 |
10 |
11 | # In[2]:
12 |
13 |
14 | th_22_results_lgb = pd.read_csv('th_22_results_lgb.csv')
15 | th_22_results_xgb = pd.read_csv('th_22_results_xgb.csv')
16 | th_lgb_nb = pd.read_csv('th_lgb_nb.csv')
17 | th_xgb_nb = pd.read_csv('th_xgb_nb.csv')
18 |
19 |
20 | # In[5]:
21 |
22 |
23 | #直接22分类 lgb与xgb进行55 45加权融合
24 | results_22 = pd.DataFrame(th_22_results_lgb.values[:,1:] * 0.55 + th_22_results_xgb.values[:,1:] * 0.45)
25 | results_22.columns = th_22_results_lgb.columns[1:]
26 | results_22['DeviceID'] = th_22_results_lgb['DeviceID']
27 |
28 |
29 | # In[6]:
30 |
31 |
32 | #条件概率分类, xgb与lgb进行65 35加权融合
33 | results_nb = pd.DataFrame(th_xgb_nb.values[:,1:] * 0.65 + th_lgb_nb.values[:,1:] * 0.35)
34 | results_nb.columns = th_xgb_nb.columns[1:]
35 | results_nb['DeviceID'] = th_xgb_nb['DeviceID']
36 |
37 |
38 | # In[ ]:
39 |
40 |
41 | #两份结果继续进行加权融合
42 | results_final = pd.DataFrame(results_22.values[:,1:] * 0.65 + results_nb.values[:,1:] * 0.35)
43 | results_final.columns = results_22.columns[1:]
44 | results_final['DeviceID'] = results_22['DeviceID']
45 |
46 |
47 | # In[ ]:
48 |
49 |
50 | results_final.to_csv('result/thluo_final.csv', index=None)
51 |
52 |
--------------------------------------------------------------------------------
/chizhu/util/get_nn_res.py:
--------------------------------------------------------------------------------
1 | import pandas as pd
2 | path = "/Users/chizhu/data/competition_data/易观/"
3 | res1 = pd.read_csv(path+"res1.csv")
4 | res2_1 = pd.read_csv(path+"res2_1.csv")
5 | res2_2 = pd.read_csv(path+"res2_2.csv")
6 | res1.index = range(len(res1))
7 | res2_1.index = range(len(res2_1))
8 | res2_2.index = range(len(res2_2))
9 | final_1 = res2_1.copy()
10 | final_2 = res2_2.copy()
11 | for i in range(11):
12 | final_1[str(i)] = res1['sex1']*res2_1[str(i)]
13 | final_2[str(i)] = res1['sex2']*res2_2[str(i)]
14 | id_list = pred['DeviceID']
15 | final = id_list
16 | final.index = range(len(final))
17 | final.columns = ['DeviceID']
18 | final_pred = pd.concat([final_1, final_2], 1)
19 | final = pd.concat([final, final_pred], 1)
20 | final.columns = ['DeviceID', '1-0', '1-1', '1-2', '1-3', '1-4', '1-5', '1-6',
21 | '1-7', '1-8', '1-9', '1-10', '2-0', '2-1', '2-2', '2-3', '2-4',
22 | '2-5', '2-6', '2-7', '2-8', '2-9', '2-10']
23 |
24 | final.to_csv(path+'nn_feat_v12.csv', index=False)
25 |
26 | train = pd.read_csv(path+"deviceid_train.tsv", sep="\t",
27 | names=["id", "sex", "age"])
28 | test = pd.read_csv(path+"deviceid_test.tsv", sep="\t", names=['DeviceID'])
29 |
30 | pred = pd.read_csv(path+"nn_feat_v6.csv")
31 | sub = pd.merge(test, pred, on="DeviceID", how="left")
32 |
33 | sub.to_csv(path+"nn_v6.csv", index=False)
34 |
35 |
36 |
--------------------------------------------------------------------------------
/chizhu/util/bagging.py:
--------------------------------------------------------------------------------
1 | import os
2 | import pandas as pd
3 | path = "/Users/chizhu/data/competition_data/易观/"
4 | os.listdir(path)
5 |
6 | train = pd.read_csv(path+"deviceid_train.tsv", sep="\t",
7 | names=["id", "sex", "age"])
8 | test = pd.read_csv(path+"deviceid_test.tsv", sep="\t", names=['DeviceID'])
9 | pred = pd.read_csv(path+"nn_feat_v6.csv")
10 |
11 | lgb1 = pd.read_csv(path+"th_results_ems_22_nb_5400.csv") # 576
12 | lgb1 = pd.merge(test, lgb1, on="DeviceID", how="left")
13 | submit = lgb1.copy()
14 |
15 | nn1 = pd.read_csv(path+"xgb_and_nurbs.csv") # 573
16 | nn1 = pd.merge(test, nn1, on="DeviceID", how="left")
17 |
18 | # nn2=pd.read_csv(path+"th_results_ems_2547.csv")##574
19 | # nn2=pd.merge(test,nn2,on="DeviceID",how="left")
20 |
21 | # lgb2=pd.read_csv(path+"th_results_ems_2.549.csv")##570
22 | # lgb2=pd.merge(test,lgb2,on="DeviceID",how="left")
23 |
24 | # lgb3=pd.read_csv(path+"th_results_ems_2547.csv")##547
25 | # lgb3=pd.merge(test,lgb3,on="DeviceID",how="left")
26 |
27 |
28 | for i in['1-0', '1-1', '1-2', '1-3', '1-4', '1-5', '1-6',
29 | '1-7', '1-8', '1-9', '1-10', '2-0', '2-1', '2-2', '2-3', '2-4',
30 | '2-5', '2-6', '2-7', '2-8', '2-9', '2-10']:
31 | # submit[i]=(lgb1[i]+lgb2[i]+nn1[i]+nn2[i])/4.0
32 | submit[i] = 0.75*lgb1[i]+0.25*nn1[i]
33 | # submit[i]=0.1*lgb1[i]+0.1*nn1[i]+0.2*nn2[i]+0.2*lgb2[i]+0.4*lgb3[i]
34 |
35 | submit.to_csv(path+"th_nurbs_7525.csv", index=False)
36 |
--------------------------------------------------------------------------------
/linwangli/code/utils.py:
--------------------------------------------------------------------------------
1 | import pandas as pd
2 | import numpy as np
3 |
4 | def weights_ensemble(results, weights):
5 | '''
6 | 针对此次比赛的按权重进行模型融合的函数脚本
7 | results: list,存放所有需要融合的结果路径
8 | weights: list, 存放各个结果的权重
9 | return: 可以直接to_csv提交的结果
10 | '''
11 | for i in range(len(results)):
12 | if i == 0:
13 | sub = pd.read_csv(results[0])
14 | final_cols = list(sub.columns)
15 | cols = list(sub.columns)
16 | cols[1:] = [col + '_0' for col in cols[1:]]
17 | sub.columns = cols
18 | else:
19 | result = pd.read_csv(results[i])
20 | cols = list(result.columns)
21 | cols[1:] = [col + '_' + str(i) for col in cols[1:]]
22 | result.columns = cols
23 | sub = pd.merge(left=sub, right=result, on='DeviceID')
24 | for i in range(len(weights)):
25 | for col in final_cols[1:]:
26 | if col not in sub.columns:
27 | sub[col] = weights[i] * sub[col + '_' + str(i)]
28 | else:
29 | sub[col] = sub[col] + weights[i] * sub[col + '_' + str(i)]
30 | sub = sub[final_cols]
31 | return sub
32 |
33 | def result_corr(path1, path2):
34 | '''
35 | 根据此次比赛写的评测不同提交结果相关性文件
36 | path1: 结果1的路径
37 | path2: 结果2的路径
38 | return: 返回不同提交结果的相关性
39 | '''
40 | result_1 = pd.read_csv(path1)
41 | result_2 = pd.read_csv(path2)
42 | result = pd.merge(left=result_1, right=result_2, on='DeviceID', suffixes=('_x', '_y'))
43 | cols = result_1.columns[1:]
44 | col_list = []
45 | for col in cols:
46 | col_pair = [col + '_x', col + '_y']
47 | col_list.append(result[col_pair].corr().loc[col + '_x', col + '_y'])
48 |
49 | return np.mean(col_list)
--------------------------------------------------------------------------------
/nb_cz_lwl_wcm/6_get_feature_device_start_close.py:
--------------------------------------------------------------------------------
1 | # -*- coding:utf-8 -*-
2 |
3 | import pandas as pd
4 | import numpy as np
5 | from sklearn import preprocessing
6 |
7 | train = pd.read_csv('Demo/deviceid_train.tsv', sep='\t', header=None)
8 | test = pd.read_csv('Demo/deviceid_test.tsv', sep='\t', header=None)
9 |
10 | data_all = pd.concat([train, test], axis=0)
11 | data_all = data_all.rename({0:'id'}, axis=1)
12 | del data_all[1],data_all[2]
13 |
14 | start_close_time = pd.read_csv('Demo/deviceid_package_start_close.tsv', sep='\t', header=None)
15 | start_close_time = start_close_time.rename({0:'id', 1:'app_name', 2:'start_time', 3:'close_time'}, axis=1)
16 |
17 |
18 | start_close_time['diff_time'] = (start_close_time['close_time'] - start_close_time['start_time'])/1000
19 |
20 | print('开始转换时间')
21 | import time
22 | start_close_time['close_time'] = start_close_time['close_time'].apply(lambda row: int(time.localtime(row/1000).tm_hour))
23 | start_close_time['start_time'] = start_close_time['start_time'].apply(lambda row: int(time.localtime(row/1000).tm_hour))
24 |
25 | # 一个表里面的总次数
26 | print('一个表的总次数')
27 | feature = pd.DataFrame()
28 | feature['start_close_count'] = pd.merge(data_all, start_close_time.groupby('id').size().reset_index(), on='id', how='left')[0]
29 |
30 | # 0 - 5 点的使用次数
31 | temp = start_close_time[(start_close_time['close_time'] >=0)&(start_close_time['close_time'] <=5)]
32 | temp = temp.groupby('id').size().reset_index()
33 | feature['zero_five_count'] = pd.merge(data_all, temp, on='id', how='left').fillna(0)[0]
34 |
35 | # 玩的时间最长的app的名字编码
36 | def get_max_label(row):
37 | row_name = list(row['app_name'])
38 | row_diff_time = list(row['diff_time'])
39 | return row_name[np.argmax(row_diff_time)]
40 |
41 | start_close_max_name = start_close_time.groupby('id').apply(lambda row:get_max_label(row)).reset_index()
42 | label_encoder = preprocessing.LabelEncoder()
43 | feature['start_close_max_name'] = label_encoder.fit_transform(pd.merge(data_all, start_close_max_name, on='id', how='left').fillna(0)[0])
44 |
45 | feature.to_csv('feature/feature_start_close.csv', index=False)
--------------------------------------------------------------------------------
/nb_cz_lwl_wcm/7_get_feature_w2v.py:
--------------------------------------------------------------------------------
1 | from gensim.models import Word2Vec
2 | import pandas as pd
3 | path="Demo/"
4 | packages = pd.read_csv(path+"deviceid_packages.tsv",
5 | sep="\t", names=['id', 'app_list'])
6 | packages['app_count'] = packages['app_list'].apply(
7 | lambda x: len(x.split(",")), 1)
8 | documents = packages['app_list'].values.tolist()
9 | texts = [[word for word in str(document).split(',')] for document in documents]
10 | # frequency = defaultdict(int)
11 | # for text in texts:
12 | # for token in text:
13 | # frequency[token] += 1
14 | # texts = [[token for token in text if frequency[token] >= 5] for text in texts]
15 | w2v = Word2Vec(texts, size=128, window=10, iter=45,
16 | workers=12, seed=1017, min_count=5)
17 | w2v.wv.save_word2vec_format('./w2v_128.txt')
18 |
19 | import gensim
20 | import numpy as np
21 |
22 |
23 | def get_w2v_avg(text, w2v_out_path, word2vec_Path):
24 | texts = []
25 | w2v_dim = 128
26 | data = text
27 | # data = pd.read_csv(text_path)
28 | data['app_list'] = data['app_list'].apply(
29 | lambda x: x.strip().split(","), 1)
30 | texts = data['app_list'].values.tolist()
31 |
32 | model = gensim.models.KeyedVectors.load_word2vec_format(
33 | word2vec_Path, binary=False)
34 | vacab = model.vocab.keys()
35 |
36 | w2v_feature = np.zeros((len(texts), w2v_dim))
37 | w2v_feature_avg = np.zeros((len(texts), w2v_dim))
38 |
39 | for i, line in enumerate(texts):
40 | num = 0
41 | if line == '':
42 | w2v_feature_avg[i, :] = np.zeros(w2v_dim)
43 | else:
44 | for word in line:
45 | num += 1
46 | vec = model[word] if word in vacab else np.zeros(w2v_dim)
47 | w2v_feature[i, :] += vec
48 | w2v_feature_avg[i, :] = w2v_feature[i, :] / num
49 | w2v_avg = pd.DataFrame(w2v_feature_avg)
50 | w2v_avg.columns = ['w2v_avg_' + str(i) for i in w2v_avg.columns]
51 | w2v_avg['id'] = data['id']
52 | w2v_avg.to_csv(w2v_out_path, encoding='utf-8', index=None)
53 | return w2v_avg
54 |
55 |
56 | w2v_feat = get_w2v_avg(packages, "feature/w2v_avg.csv", "w2v_128.txt")
57 |
58 |
59 |
--------------------------------------------------------------------------------
/nb_cz_lwl_wcm/13_last_get_all_feature.py:
--------------------------------------------------------------------------------
1 | # -*- coding:utf-8 -*-
2 |
3 | import pandas as pd
4 |
5 | df_brand = pd.read_csv('feature/deviceid_brand_feature.csv')
6 | df_lr = pd.read_csv('feature/tfidf_lr_error_single_classfiy.csv')
7 | df_pac = pd.read_csv('feature/tfidf_pac_error_single_classfiy.csv')
8 | df_sgd = pd.read_csv('feature/tfidf_sgd_error_single_classfiy.csv')
9 | df_ridge = pd.read_csv('feature/tfidf_ridge_error_single_classfiy.csv')
10 | df_bnb = pd.read_csv('feature/tfidf_bnb_error_single_classfiy.csv')
11 | df_mnb = pd.read_csv('feature/tfidf_mnb_error_single_classfiy.csv')
12 | df_lsvc = pd.read_csv('feature/tfidf_lsvc_error_single_classfiy.csv')
13 | df_lr_2 = pd.read_csv('feature/tfidf_lr_1_3_error_single_classfiy.csv')
14 | df_pac_2 = pd.read_csv('feature/tfidf_pac_1_3_error_single_classfiy.csv')
15 | df_sgd_2 = pd.read_csv('feature/tfidf_sgd_1_3_error_single_classfiy.csv')
16 | df_ridge_2 = pd.read_csv('feature/tfidf_ridge_1_3_error_single_classfiy.csv')
17 | df_bnb_2 = pd.read_csv('feature/tfidf_bnb_1_3_error_single_classfiy.csv')
18 | df_mnb_2 = pd.read_csv('feature/tfidf_mnb_1_3_error_single_classfiy.csv')
19 | df_lsvc_2 = pd.read_csv('feature/tfidf_lsvc_2_error_single_classfiy.csv')
20 | df_kmeans_2 = pd.read_csv('feature/cluster_2_tfidf_feature.csv')
21 | df_start_close = pd.read_csv('feature/feature_start_close.csv')
22 | df_ling_reg = pd.read_csv('feature/tfidf_ling_reg.csv')
23 | df_par_reg = pd.read_csv('feature/tfidf_par_reg.csv')
24 | df_svr_reg = pd.read_csv('feature/tfidf_svr_reg.csv')
25 | df_w2v = pd.read_csv('feature/w2v_avg.csv')
26 | del df_w2v['DeviceID']
27 | df_best_nn = pd.read_csv('feature/yg_best_nn.csv')
28 | del df_best_nn['DeviceID']
29 | df_chizhu_lgb = pd.read_csv('feature/lgb_feat_chizhu.csv')
30 | del df_chizhu_lgb['DeviceID']
31 | df_chizhu_nn = pd.read_csv('feature/nn_feat.csv')
32 | del df_chizhu_nn['DeviceID']
33 | df_lwl_lgb = pd.read_csv('feature/feat_lwl.csv')
34 | del df_lwl_lgb['DeviceID']
35 | df_feature = pd.concat([
36 | df_brand,
37 | df_lr, df_pac, df_sgd,
38 | df_ridge, df_bnb, df_mnb, df_lsvc,
39 | df_start_close, df_ling_reg, df_par_reg,df_svr_reg,
40 | df_lr_2, df_pac_2, df_sgd_2, df_ridge_2, df_bnb_2, df_mnb_2,
41 | df_lsvc_2, df_kmeans_2, df_w2v, df_best_nn, df_chizhu_lgb, df_chizhu_nn
42 | df_lwl_lgb
43 | ], axis=1)
44 |
45 | df_feature.to_csv('feature/feature_one.csv', encoding='utf8', index=None)
46 |
47 |
--------------------------------------------------------------------------------
/THLUO/readme.md:
--------------------------------------------------------------------------------
1 | 本代码运行在windows10, 48G内存, 1070ti显卡上, 由于运行的py文件比较多, 所以需要比较长的时间才能跑完
2 |
3 | 文件夹说明:
4 | > cache文件夹是存放输出模型的文件夹
5 | > embedding是存放w2c词嵌入的文件夹
6 | > input是存放本次比赛数据的文件夹
7 | > result是THLUO选手最终的结果
8 |
9 | 下面是每个py文件的功能介绍:
10 | * 1.w2c_model_start.py 根据device打开app的时间对app进行排序,形成app_list, 将app开作词,device_id看成文档,对app进行embedding
11 | * 2.w2c_model_close.py 根据device关闭app的时间对app进行排序,形成app_list, 将app开作词,device_id看成文档,对app进行embedding
12 | * 3.w2c_model_all.py 根据device打开关闭app的时间合在对app进行排序,形成app_list, 将app开作词,device_id看成文档,对app进行embedding
13 | * 4.device_quchong_start_app_w2c.py 根据device打开app的时间对app进行排序,形成app_list, 对app_list进行去重操作, 将app开作词,device_id看成文档,对app进行embedding
14 | * 5.device_age_prob_oof.py 单独对用户年龄进行预测
15 | * 6.device_sex_prob_oof.py 单独对用户性别进行预测
16 | * 7.start_close_age_prob_oof.py 对app所属的年龄概率进行预测
17 | * 8.start_close_sex_prob_oof.py 对app所属的性别概率进行预测
18 | * 9.sex_age_bin_prob_oof.py 用2分类的手法来预测用户属于性别-年龄的概率
19 | * 10.age_bin_prob_oof.py 用2分类的手法来预测用户属于年龄的概率
20 | * 11.hcc_device_brand_age_sex.py 手机品牌和手机类型属于High Cardinality Categorical, 参考论文A Preprocessing Scheme for High-Cardinality Categorical Attributes in Classification and Prediction Problems,对手机品牌和手机类型属于性别年龄的概率进行预测
21 | * 12.device_age_regression_prob_oof.py 用回归的手法对用户属于年龄的概率进行预测
22 | * 13.device_start_GRU_pred.py 根据device打开app的时间对app进行排序,形成app_list,将app开作词,device_id看成文档,跑了一个GRU文本模型对用户属于性别年龄的概率进行预测
23 | * 14.device_start_GRU_pred_age.py 根据device打开app的时间对app进行排序,形成app_list,将app开作词,device_id看成文档,跑了一个GRU文本模型对用户属于年龄的概率进行预测
24 | * 15.device_all_GRU_pred.py 根据device打开关闭app的时间合在对app进行排序,形成app_list, 将app开作词,device_id看成文档,跑了一个GRU文本模型对用户属于性别年龄的概率进行预测
25 | * 16.device_start_capsule_pred.py 用capsule模型对用户属于性别年龄的概率进行预测
26 | * 17.device_start_textcnn_pred.py 用textcnn模型对用户属于性别年龄的概率进行预测
27 | * 18.device_start_text_dpcnn_pred.py 用dpcnn模型对用户属于性别年龄的概率进行预测
28 | * 19.device_start_lstm_pred.py 用lstm模型对用户属于性别年龄的概率进行预测
29 | * 20.lgb_sex_age_prob_oof.py 一个基础的模型,对用户属于性别年龄的概率进行预测
30 | * 21.tfidf_lr_sex_age_prob_oof.py 对app进行tf-idf操作,用户LR训练一个模型来预测用户的性别年龄概率
31 | * 22.base_feat.py 生成基础人工特征+上面产出的概率模型特征
32 | * 23.ATT_v6.py 用attention模型对22.base_feat.py产出的特征进行训练,来计算用户属于性别年龄的概率
33 | * 24.thluo_22_lgb.py 用lgb训练一个22多分类模型,输出test概率文件
34 | * 25.thluo_22_xgb.py 用xgb训练一个22多分类模型,输出test概率文件
35 | * 26.thluo_nb_lgb.py 用lgb训练一个条件分类模型,输出test概率文件,条件概率模型指的是先预测p(sex) 再预测p(age|sex),最终p(sex, age) = p(sex) * p(age|sex)
36 | * 27.thluo_nb_xgb.py 用xgb训练一个条件分类模型,输出test概率文件,条件概率模型指的是先预测p(sex) 再预测p(age|sex),最终p(sex, age) = p(sex) * p(age|sex)
37 | * 28.final.py 对上面四个模型产出的结果,进行线性加权融合,形成THLUO选手个人的最终结果
38 | * TextModel.py包含本次比赛用到的文本模型
39 | * util.py里面包含一些共用的函数
40 |
41 |
42 |
43 |
44 | > note:因为本次比赛提交代码的时间比较仓促,之前一直都是用notebook来做比赛,所以如有问题,请联系团队
45 |
--------------------------------------------------------------------------------
/THLUO/24.thluo_22_lgb.py:
--------------------------------------------------------------------------------
1 |
2 | # coding: utf-8
3 |
4 | # In[1]:
5 |
6 |
7 | import pandas as pd
8 | import seaborn as sns
9 | import numpy as np
10 | from tqdm import tqdm
11 | from sklearn.decomposition import LatentDirichletAllocation
12 | from sklearn.cross_validation import train_test_split
13 | from sklearn.metrics import accuracy_score
14 | import lightgbm as lgb
15 | from datetime import datetime,timedelta
16 | import time
17 | from sklearn.feature_extraction.text import TfidfTransformer
18 | from sklearn.feature_extraction.text import CountVectorizer
19 | from sklearn.preprocessing import LabelEncoder
20 | import gc
21 |
22 |
23 |
24 | # In[24]:
25 |
26 |
27 | df_train_w2v = pd.read_csv('thluo_train_best_feat.csv')
28 | df_att_nn_feat_v6 = pd.read_csv('att_nn_feat_v6.csv')
29 | df_att_nn_feat_v6.columns = ['device_id'] + ['att_nn_feat_' + str(i) for i in range(22)]
30 | df_train_w2v = df_train_w2v.merge(df_att_nn_feat_v6, on='device_id', how='left')
31 |
32 |
33 | # In[ ]:
34 |
35 |
36 | df_train_w2v.to_csv('thluo_train_best_feat.csv', index=None)
37 |
38 |
39 | # In[26]:
40 |
41 |
42 | train = df_train_w2v[df_train_w2v['sex'].notnull()]
43 | test = df_train_w2v[df_train_w2v['sex'].isnull()]
44 |
45 | X = train.drop(['sex','age','sex_age','device_id'],axis=1)
46 | Y = train['sex_age']
47 | Y_CAT = pd.Categorical(Y)
48 | Y = pd.Series(Y_CAT.codes)
49 |
50 |
51 | # In[28]:
52 |
53 |
54 | from sklearn.model_selection import KFold, StratifiedKFold
55 | gc.collect()
56 | seed = 666
57 | num_folds = 5
58 | folds = StratifiedKFold(n_splits= num_folds, shuffle=True, random_state=seed)
59 |
60 | sub_list = []
61 |
62 | cate_feat = ['device_type','device_brand']
63 |
64 | for n_fold, (train_idx, valid_idx) in enumerate(folds.split(X, Y)):
65 | train_x, train_y = X.iloc[train_idx], Y.iloc[train_idx]
66 | valid_x, valid_y = X.iloc[valid_idx], Y.iloc[valid_idx]
67 |
68 | lgb_train=lgb.Dataset(train_x,label=train_y)
69 | lgb_eval = lgb.Dataset(valid_x, valid_y, reference=lgb_train)
70 | params = {
71 | 'boosting_type': 'gbdt',
72 | #'learning_rate' : 0.02,
73 | 'learning_rate' : 0.01,
74 | 'max_depth':5,
75 | 'num_leaves' : 2 ** 4,
76 | 'metric': {'multi_logloss'},
77 | 'num_class' : 22,
78 | 'objective' : 'multiclass',
79 | 'random_state' : 2018,
80 | 'bagging_freq' : 5,
81 | 'feature_fraction' : 0.7,
82 | 'bagging_fraction' : 0.7,
83 | 'min_split_gain' : 0.0970905919552776,
84 | 'min_child_weight' : 9.42012323936088,
85 | }
86 |
87 | gbm = lgb.train(params,
88 | lgb_train,
89 | num_boost_round=1000,
90 | valid_sets=lgb_eval,
91 | early_stopping_rounds=200, verbose_eval=100)
92 |
93 | sub = pd.DataFrame(gbm.predict(test[X.columns.values],num_iteration=gbm.best_iteration))
94 | sub_list.append(sub)
95 |
96 |
97 | # In[29]:
98 |
99 |
100 | sub = (sub_list[0] + sub_list[1] + sub_list[2] + sub_list[3] + sub_list[4]) / num_folds
101 |
102 |
103 | # In[31]:
104 |
105 |
106 | sub.columns=Y_CAT.categories
107 | sub['DeviceID']=test['device_id'].values
108 | sub=sub[['DeviceID', '1-0', '1-1', '1-2', '1-3', '1-4', '1-5', '1-6', '1-7','1-8', '1-9', '1-10', '2-0', '2-1', '2-2', '2-3', '2-4', '2-5', '2-6', '2-7', '2-8', '2-9', '2-10']]
109 |
110 |
111 | # In[32]:
112 |
113 |
114 | sub.to_csv('th_22_results_lgb.csv',index=False)
115 |
116 |
--------------------------------------------------------------------------------
/linwangli/code/lgb_allfeat_22.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | # coding: utf-8
3 |
4 | from catboost import Pool, CatBoostClassifier, cv
5 | import pandas as pd
6 | import seaborn as sns
7 | import numpy as np
8 | from tqdm import tqdm
9 | from sklearn.decomposition import LatentDirichletAllocation
10 | from sklearn.model_selection import train_test_split
11 | from sklearn.metrics import accuracy_score
12 | import lightgbm as lgb
13 | from datetime import datetime,timedelta
14 | import matplotlib.pyplot as plt
15 | import time
16 |
17 | import gc
18 | from sklearn import preprocessing
19 | from sklearn.feature_extraction.text import TfidfVectorizer
20 |
21 | from scipy.sparse import hstack, vstack
22 | from sklearn.model_selection import StratifiedKFold
23 | from sklearn.model_selection import cross_val_score
24 | from skopt.space import Integer, Categorical, Real, Log10
25 | from skopt.utils import use_named_args
26 | from skopt import gp_minimize
27 | import re
28 |
29 |
30 | train = pd.read_csv('../dataset/deviceid_train.tsv', sep='\t', names=['device_id', 'sex', 'age'])
31 | all_feat = pd.read_csv('../dataset/all_feat.csv')
32 |
33 | train['label'] = train['sex'].astype(str) + '-' + train['age'].astype(str)
34 | label_le = preprocessing.LabelEncoder()
35 | train['label'] = label_le.fit_transform(train['label'])
36 | data_all = pd.merge(left=all_feat, right=train, on='device_id', how='left')
37 |
38 |
39 | train = data_all[:50000]
40 | test = data_all[50000:]
41 | train = train.fillna(-1)
42 | test = test.fillna(-1)
43 | del data_all
44 | gc.collect()
45 |
46 | use_feats = all_feat.columns[1:]
47 | use_feats
48 |
49 | X_train = train[use_feats]
50 | X_test = test[use_feats]
51 | Y = train['label']
52 | kfold = StratifiedKFold(n_splits=5, random_state=10, shuffle=True)
53 | sub = np.zeros((X_test.shape[0], 22))
54 | for i, (train_index, test_index) in enumerate(kfold.split(X_train, Y)):
55 | X_tr, X_vl, y_tr, y_vl = X_train.iloc[train_index], X_train.iloc[test_index], Y.iloc[train_index], Y.iloc[test_index]
56 | dtrain = lgb.Dataset(X_tr, label=y_tr, categorical_feature=[-1])
57 | dvalid = lgb.Dataset(X_vl, y_vl, reference=dtrain)
58 | params = {
59 | 'boosting_type': 'gbdt',
60 | 'max_depth':6,
61 | 'metric': {'multi_logloss'},
62 | 'num_class':22,
63 | 'objective':'multiclass',
64 | 'num_leaves':7,
65 | 'subsample': 0.9,
66 | 'colsample_bytree': 0.2,
67 | 'lambda_l1':0.0001,
68 | 'lambda_l2':0.00111,
69 | 'subsample_freq':12,
70 | 'learning_rate': 0.012,
71 | 'min_child_weight':12
72 |
73 | }
74 |
75 | model = lgb.train(params,
76 | dtrain,
77 | num_boost_round=6000,
78 | valid_sets=dvalid,
79 | early_stopping_rounds=100,
80 | verbose_eval=100)
81 |
82 |
83 | sub += model.predict(X_test, num_iteration=model.best_iteration)/kfold.n_splits
84 |
85 |
86 | sub = pd.DataFrame(sub)
87 | cols = [x for x in range(0, 22)]
88 | cols = label_le.inverse_transform(cols)
89 | sub.columns = cols
90 | sub['DeviceID'] = test['device_id'].values
91 | sub = sub[['DeviceID', '1-0', '1-1', '1-2', '1-3', '1-4', '1-5', '1-6',
92 | '1-7','1-8', '1-9', '1-10', '2-0', '2-1', '2-2', '2-3', '2-4',
93 | '2-5', '2-6', '2-7', '2-8', '2-9', '2-10']]
94 | sub.to_csv('lgb_22.csv', index=False)
95 |
96 |
97 |
98 |
99 |
100 |
--------------------------------------------------------------------------------
/chizhu/single_model/user_behavior.py:
--------------------------------------------------------------------------------
1 |
2 | # coding: utf-8
3 |
4 | import pandas as pd
5 | import seaborn as sns
6 | import numpy as np
7 | from tqdm import tqdm
8 | from sklearn.decomposition import LatentDirichletAllocation
9 | from sklearn.model_selection import train_test_split
10 | from sklearn.metrics import accuracy_score
11 | import lightgbm as lgb
12 | from datetime import datetime, timedelta
13 | import matplotlib.pyplot as plt
14 | import time
15 | from sklearn.feature_extraction.text import TfidfTransformer
16 | from sklearn.feature_extraction.text import CountVectorizer
17 | # %matplotlib inline
18 | from config import path
19 | #add
20 | import gc
21 |
22 | packtime = pd.read_table(path+'deviceid_package_start_close.tsv',
23 | names=['device_id', 'app', 'start', 'close'], low_memory=True)
24 | # packtime.head()
25 | packtime['peroid'] = (packtime['close'] - packtime['start'])/1000
26 | packtime['start'] = pd.to_datetime(packtime['start'], unit='ms')
27 | #packtime['closetime'] = pd.to_datetime(packtime['close'], unit='ms')
28 | del packtime['close']
29 | gc.collect()
30 |
31 | #packtime['day'] = packtime['start'].dt.day
32 | #packtime['month'] = packtime['start'].dt.month
33 | packtime['hour'] = packtime['start'].dt.hour
34 | packtime['date'] = packtime['start'].dt.date
35 | packtime['dayofweek'] = packtime['start'].dt.dayofweek
36 | #packtime['hour'] = pd.cut(packtime['hour'], bins=4).cat.codes
37 |
38 | #平均每天使用设备时间
39 | dtime = packtime.groupby(['device_id', 'date'])['peroid'].agg('sum')
40 | #不同时间段占比
41 | qtime = packtime.groupby(['device_id', 'hour'])['peroid'].agg('sum')
42 | wtime = packtime.groupby(['device_id', 'dayofweek'])['peroid'].agg('sum')
43 | atime = packtime.groupby(['device_id', 'app'])['peroid'].agg('sum')
44 |
45 |
46 | dapp = packtime[['device_id', 'date', 'app']].drop_duplicates().groupby(
47 | ['device_id', 'date'])['app'].agg(' '.join)
48 | dapp = dapp.reset_index()
49 | dapp['app_len'] = dapp['app'].apply(lambda x: x.split(' ')).apply(len)
50 | dapp_stat = dapp.groupby('device_id')['app_len'].agg(
51 | {'std': 'std', 'mean': 'mean', 'max': 'max'})
52 | dapp_stat = dapp_stat.reset_index()
53 | dapp_stat.columns = ['device_id', 'app_len_std', 'app_len_mean', 'app_len_max']
54 | # dapp_stat.head()
55 |
56 | dtime = dtime.reset_index()
57 | dtime_stat = dtime.groupby(['device_id'])['peroid'].agg(
58 | {'sum': 'sum', 'mean': 'mean', 'std': 'std', 'max': 'max'}).reset_index()
59 | dtime_stat.columns = ['device_id', 'date_sum',
60 | 'date_mean', 'date_std', 'date_max']
61 | # dtime_stat.head()
62 |
63 | qtime = qtime.reset_index()
64 | ftime = qtime.pivot(index='device_id', columns='hour',
65 | values='peroid').fillna(0)
66 | ftime.columns = ['h%s' % i for i in range(24)]
67 | ftime.reset_index(inplace=True)
68 | # ftime.head()
69 |
70 | wtime = wtime.reset_index()
71 | weektime = wtime.pivot(
72 | index='device_id', columns='dayofweek', values='peroid').fillna(0)
73 | weektime.columns = ['w0', 'w1', 'w2', 'w3', 'w4', 'w5', 'w6']
74 | weektime.reset_index(inplace=True)
75 | # weektime.head()
76 |
77 | atime = atime.reset_index()
78 | app = atime.groupby(['device_id'])['peroid'].idxmax()
79 |
80 | #dapp_stat.shape, dtime_stat.shape, ftime.shape, weektime.shape, app.shape
81 |
82 | user = pd.merge(dapp_stat, dtime_stat, on='device_id', how='left')
83 | user = pd.merge(user, ftime, on='device_id', how='left')
84 | user = pd.merge(user, weektime, on='device_id', how='left')
85 | user = pd.merge(user, atime.iloc[app], on='device_id', how='left')
86 |
87 | app_cat = pd.read_table(path+'package_label.tsv',
88 | names=['app', 'category', 'app_name'])
89 |
90 | cat_enc = pd.DataFrame(app_cat['category'].value_counts())
91 | cat_enc['idx'] = range(45)
92 |
93 | app_cat['cat_enc'] = app_cat['category'].map(cat_enc['idx'])
94 | app_cat.set_index(['app'], inplace=True)
95 |
96 | atime['app_cat_enc'] = atime['app'].map(app_cat['cat_enc']).fillna(45)
97 |
98 | cat_num = atime.groupby(['device_id', 'app_cat_enc'])[
99 | 'app'].agg('count').reset_index()
100 | cat_time = atime.groupby(['device_id', 'app_cat_enc'])[
101 | 'peroid'].agg('sum').reset_index()
102 |
103 | app_cat_num = cat_num.pivot(
104 | index='device_id', columns='app_cat_enc', values='app').fillna(0)
105 | app_cat_num.columns = ['cat%s' % i for i in range(46)]
106 | app_cat_time = cat_time.pivot(
107 | index='device_id', columns='app_cat_enc', values='peroid').fillna(0)
108 | app_cat_time.columns = ['time%s' % i for i in range(46)]
109 |
110 | user = pd.merge(user, app_cat_num, on='device_id', how='left')
111 | user = pd.merge(user, app_cat_time, on='device_id', how='left')
112 | user.to_csv('data/user_behavior.csv', index=False)
113 |
114 |
115 |
--------------------------------------------------------------------------------
/THLUO/3.w2c_all_emb.py:
--------------------------------------------------------------------------------
1 |
2 | # coding: utf-8
3 |
4 | # In[1]:
5 |
6 |
7 | import pandas as pd
8 | import seaborn as sns
9 | import numpy as np
10 | from tqdm import tqdm
11 | from sklearn.decomposition import LatentDirichletAllocation
12 | from sklearn.cross_validation import train_test_split
13 | from sklearn.metrics import accuracy_score
14 | import lightgbm as lgb
15 | from datetime import datetime,timedelta
16 | import time
17 | from sklearn.feature_extraction.text import TfidfTransformer
18 | from sklearn.feature_extraction.text import CountVectorizer
19 | from sklearn.preprocessing import LabelEncoder
20 | import gc
21 |
22 |
23 |
24 | # In[2]:
25 |
26 |
27 | path='input/'
28 | data=pd.DataFrame()
29 | #sex_age=pd.read_excel('./data/性别年龄对照表.xlsx')
30 |
31 |
32 | # In[3]:
33 |
34 |
35 | deviceid_packages=pd.read_csv(path+'deviceid_packages.tsv',sep='\t',names=['device_id','apps'])
36 | deviceid_test=pd.read_csv(path+'deviceid_test.tsv',sep='\t',names=['device_id'])
37 | deviceid_train=pd.read_csv(path+'deviceid_train.tsv',sep='\t',names=['device_id','sex','age'])
38 | deviceid_brand = pd.read_csv(path+'deviceid_brand.tsv',sep='\t', names=['device_id','device_brand', 'device_type'])
39 | deviceid_package_start_close = pd.read_csv(path+'deviceid_package_start_close.tsv',sep='\t', names=['device_id','app_id','start_time','close_time'])
40 | package_label = pd.read_csv(path+'package_label.tsv',sep='\t',names=['app_id','app_parent_type', 'app_child_type'])
41 |
42 |
43 | deviceid_brand['device_brand'] = deviceid_brand.device_brand.apply(lambda x : str(x).split(' ')[0])
44 |
45 | df_temp = deviceid_brand.groupby('device_brand')['device_id'].count().reset_index().rename(columns={'device_id':'brand_counts'})
46 | one_time_brand = df_temp[df_temp.brand_counts == 1].device_brand.values
47 | deviceid_brand['device_brand'] = deviceid_brand.device_brand.apply(lambda x : 'other' if x in one_time_brand else x)
48 |
49 | df_temp = deviceid_brand.groupby('device_brand')['device_id'].count().reset_index().rename(columns={'device_id':'brand_counts'})
50 | one_time_brand = df_temp[df_temp.brand_counts == 2].device_brand.values
51 | deviceid_brand['device_brand'] = deviceid_brand.device_brand.apply(lambda x : 'other_2' if x in one_time_brand else x)
52 |
53 | df_temp = deviceid_brand.groupby('device_brand')['device_id'].count().reset_index().rename(columns={'device_id':'brand_counts'})
54 | one_time_brand = df_temp[df_temp.brand_counts == 3].device_brand.values
55 | deviceid_brand['device_brand'] = deviceid_brand.device_brand.apply(lambda x : 'other_3' if x in one_time_brand else x)
56 |
57 |
58 | #转换成对应的数字
59 | lbl = LabelEncoder()
60 | lbl.fit(list(deviceid_brand.device_brand.values))
61 | deviceid_brand['device_brand'] = lbl.transform(list(deviceid_brand.device_brand.values))
62 |
63 | lbl = LabelEncoder()
64 | lbl.fit(list(deviceid_brand.device_type.values))
65 | deviceid_brand['device_type'] = lbl.transform(list(deviceid_brand.device_type.values))
66 |
67 | #转换成对应的数字
68 | lbl = LabelEncoder()
69 | lbl.fit(list(package_label.app_parent_type.values))
70 | package_label['app_parent_type'] = lbl.transform(list(package_label.app_parent_type.values))
71 |
72 | lbl = LabelEncoder()
73 | lbl.fit(list(package_label.app_child_type.values))
74 | package_label['app_child_type'] = lbl.transform(list(package_label.app_child_type.values))
75 |
76 |
77 | # In[4]:
78 |
79 |
80 | deviceid_package_start = deviceid_package_start_close[['device_id', 'app_id', 'start_time']]
81 | deviceid_package_start.columns = ['device_id', 'app_id', 'all_time']
82 | deviceid_package_close = deviceid_package_start_close[['device_id', 'app_id', 'close_time']]
83 | deviceid_package_close.columns = ['device_id', 'app_id', 'all_time']
84 | deviceid_package_all = pd.concat([deviceid_package_start, deviceid_package_close])
85 |
86 |
87 | # In[6]:
88 |
89 |
90 | df_sorted = deviceid_package_all.sort_values(by='all_time')
91 |
92 |
93 | # In[8]:
94 |
95 |
96 | df_device_start_app_list = df_sorted.groupby('device_id').apply(lambda x : list(x.app_id)).reset_index().rename(columns = {0 : 'app_list'})
97 | df_device_start_app_list
98 |
99 |
100 | # In[9]:
101 |
102 |
103 | app_list = list(df_device_start_app_list.app_list.values)
104 |
105 |
106 | # In[10]:
107 |
108 |
109 | from gensim.test.utils import common_texts, get_tmpfile
110 | from gensim.models import Word2Vec
111 |
112 |
113 | # In[11]:
114 |
115 |
116 | word_dim = 200
117 | model = Word2Vec(app_list, size=word_dim, window=20, min_count=2, workers=4)
118 | model.save("word2vec.model")
119 |
120 |
121 | # In[13]:
122 |
123 |
124 | vocab = list(model.wv.vocab.keys())
125 |
126 | w2c_arr = []
127 |
128 | for v in vocab :
129 | w2c_arr.append(list(model.wv[v]))
130 |
131 |
132 | # In[14]:
133 |
134 |
135 | df_w2c_start = pd.DataFrame()
136 | df_w2c_start['app_id'] = vocab
137 | df_w2c_start = pd.concat([df_w2c_start, pd.DataFrame(w2c_arr)], axis=1)
138 | df_w2c_start.columns = ['app_id'] + ['w2c_all_app_' + str(i) for i in range(word_dim)]
139 |
140 |
141 | # In[16]:
142 |
143 |
144 | df_w2c_start.to_csv('w2c_all_emb.csv', index=None)
145 |
146 |
--------------------------------------------------------------------------------
/chizhu/stacking/nurbs_feat/xgb_22.py:
--------------------------------------------------------------------------------
1 |
2 | # coding: utf-8
3 |
4 | # In[2]:
5 |
6 |
7 | import pandas as pd
8 | import seaborn as sns
9 | import numpy as np
10 | from tqdm import tqdm
11 | from sklearn.decomposition import LatentDirichletAllocation
12 | from sklearn.model_selection import train_test_split
13 | from sklearn.metrics import accuracy_score
14 | import lightgbm as lgb
15 | from datetime import datetime,timedelta
16 | import matplotlib.pyplot as plt
17 | import time
18 | from sklearn.feature_extraction.text import TfidfTransformer
19 | from sklearn.feature_extraction.text import CountVectorizer
20 | # get_ipython().run_line_magic('matplotlib', 'inline')
21 |
22 | #add
23 | import gc
24 | from sklearn import preprocessing
25 | from sklearn.feature_extraction.text import TfidfVectorizer
26 |
27 | from scipy.sparse import hstack, vstack
28 | from sklearn.model_selection import StratifiedKFold
29 | from sklearn.model_selection import cross_val_score
30 | # from skopt.space import Integer, Categorical, Real, Log10
31 | # from skopt.utils import use_named_args
32 | # from skopt import gp_minimize
33 | from gensim.models import Word2Vec, FastText
34 | import gensim
35 | import re
36 | import os
37 | path="./feature/"###nurbs概率文件路径
38 | o_path="/dev/shm/chizhu_data/data/"###原始文件路径
39 | os.listdir(path)
40 |
41 |
42 | # In[4]:
43 |
44 |
45 |
46 | all_feat=pd.read_csv(path+"feature_22_all.csv")
47 | train_id=pd.read_csv(o_path+"deviceid_train.tsv",sep="\t",names=['device_id','sex','age'])
48 | test_id=pd.read_csv(o_path+"deviceid_test.tsv",sep="\t",names=['device_id'])
49 | all_id=pd.concat([train_id[['device_id']],test_id[['device_id']]])
50 | all_id.index=range(len(all_id))
51 | all_feat['device_id']=all_id
52 | # deepnn_feat=pd.read_csv(path+"deepnn_fix.csv")
53 | # deepnn_feat['device_id']=deepnn_feat['DeviceID']
54 | # del deepnn_feat['DeviceID']
55 |
56 |
57 | # In[9]:
58 |
59 |
60 | train=pd.merge(train_id,all_feat,on="device_id",how="left")
61 | # train=pd.merge(train,deepnn_feat,on="device_id",how="left")
62 | test=pd.merge(test_id,all_feat,on="device_id",how="left")
63 | # test=pd.merge(test,deepnn_feat,on="device_id",how="left")
64 |
65 |
66 | # In[10]:
67 |
68 |
69 | train['sex-age']=train.apply(lambda x:str(x['sex'])+"-"+str(x['age']),1)
70 |
71 |
72 | # In[11]:
73 |
74 |
75 | features = [x for x in train.columns if x not in ['device_id',"sex",'age','sex-age']]
76 | label="sex-age"
77 |
78 |
79 | # In[12]:
80 |
81 |
82 | Y_CAT=pd.Categorical(train[label])
83 |
84 |
85 | # In[13]:
86 |
87 |
88 | import lightgbm as lgb
89 | import xgboost as xgb
90 | from sklearn.metrics import auc, log_loss, roc_auc_score,f1_score,recall_score,precision_score
91 | from sklearn.cross_validation import StratifiedKFold
92 |
93 | kf = StratifiedKFold(Y_CAT, n_folds=5, shuffle=True, random_state=1024)
94 | params={
95 | 'booster':'gbtree',
96 | "tree_method":"gpu_hist",
97 | "gpu_id":"1",
98 | 'objective': 'multi:softprob',
99 | # 'is_unbalance':'True',
100 | # 'scale_pos_weight': 1500.0/13458.0,
101 | 'eval_metric': "mlogloss",
102 | 'num_class':22,
103 | 'gamma':0.1,#0.2 is ok
104 | 'max_depth':6,
105 | # 'lambda':20,
106 | # "alpha":5,
107 | 'subsample':0.7,
108 | 'colsample_bytree':0.4 ,
109 | # 'min_child_weight':2.5,
110 | 'eta': 0.01,
111 | # 'learning_rate':0.01,
112 | "silent":1,
113 | 'seed':1024,
114 | 'nthread':12,
115 |
116 | }
117 | num_round = 3500
118 | early_stopping_rounds = 100
119 |
120 |
121 | # In[14]:
122 |
123 |
124 | aus = []
125 | sub2 = np.zeros((len(test),22 ))
126 | pred_oob2=np.zeros((len(train),22))
127 | models=[]
128 | iters=[]
129 | for i,(train_index,test_index) in enumerate(kf):
130 |
131 | tr_x = train[features].reindex(index=train_index, copy=False)
132 | tr_y = Y_CAT.codes[train_index]
133 | te_x = train[features].reindex(index=test_index, copy=False)
134 | te_y = Y_CAT.codes[test_index]
135 |
136 | # tr_y=tr_y.apply(lambda x:1 if x>0 else 0)
137 | # te_y=te_y.apply(lambda x:1 if x>0 else 0)
138 | d_tr = xgb.DMatrix(tr_x, label=tr_y)
139 | d_te = xgb.DMatrix(te_x, label=te_y)
140 | watchlist = [(d_tr,'train'),
141 | (d_te,'val')
142 | ]
143 | model = xgb.train(params, d_tr, num_boost_round=5500,
144 | evals=watchlist,verbose_eval=200,
145 | early_stopping_rounds=100)
146 | models.append(model)
147 | iters.append(model.best_iteration)
148 | pred = model.predict(d_te,ntree_limit=model.best_iteration)
149 | pred_oob2[test_index] =pred
150 | # te_y=te_y.apply(lambda x:1 if x>0 else 0)
151 | a = log_loss(te_y, pred)
152 |
153 | sub2 += model.predict(xgb.DMatrix(test[features]),ntree_limit=model.best_iteration)/5
154 |
155 |
156 | print ("idx: ", i)
157 | print (" loss: %.5f" % a)
158 | # print " gini: %.5f" % g
159 | aus.append(a)
160 |
161 | print ("mean")
162 | print ("loss: %s" % (sum(aus) / 5.0))
163 |
164 |
165 | # In[15]:
166 |
167 |
168 | res=np.vstack((pred_oob2,sub2))
169 | res = pd.DataFrame(res,columns=Y_CAT.categories)
170 | res['DeviceID']=all_id
171 | res=res[['DeviceID', '1-0', '1-1', '1-2', '1-3', '1-4', '1-5', '1-6', '1-7','1-8', '1-9', '1-10', '2-0', '2-1', '2-2', '2-3', '2-4', '2-5', '2-6', '2-7', '2-8', '2-9', '2-10']]
172 |
173 | res.to_csv("xgb_nurbs_22_feat.csv",index=False)
174 |
175 |
176 | # In[16]:
177 |
178 |
179 | test['DeviceID']=test['device_id']
180 | sub=pd.merge(test[['DeviceID']],res,on="DeviceID",how="left")
181 | sub.to_csv("xgb_nurbs_22.csv",index=False)
182 |
183 |
--------------------------------------------------------------------------------
/THLUO/1.w2c_model_start.py:
--------------------------------------------------------------------------------
1 |
2 | # coding: utf-8
3 |
4 | # In[1]:
5 |
6 |
7 | import pandas as pd
8 | import seaborn as sns
9 | import numpy as np
10 | from tqdm import tqdm
11 | from sklearn.decomposition import LatentDirichletAllocation
12 | from sklearn.cross_validation import train_test_split
13 | from sklearn.metrics import accuracy_score
14 | import lightgbm as lgb
15 | from datetime import datetime,timedelta
16 | import time
17 | from sklearn.feature_extraction.text import TfidfTransformer
18 | from sklearn.feature_extraction.text import CountVectorizer
19 | from sklearn.preprocessing import LabelEncoder
20 | import gc
21 | from gensim.test.utils import common_texts, get_tmpfile
22 | from gensim.models import Word2Vec
23 |
24 |
25 | # In[2]:
26 |
27 |
28 | path='input/'
29 | data=pd.DataFrame()
30 | print ('1.w2c_model_start.py')
31 |
32 | # In[3]:
33 |
34 |
35 | deviceid_packages=pd.read_csv(path+'deviceid_packages.tsv',sep='\t',names=['device_id','apps'])
36 | deviceid_test=pd.read_csv(path+'deviceid_test.tsv',sep='\t',names=['device_id'])
37 | deviceid_train=pd.read_csv(path+'deviceid_train.tsv',sep='\t',names=['device_id','sex','age'])
38 | deviceid_brand = pd.read_csv(path+'deviceid_brand.tsv',sep='\t', names=['device_id','device_brand', 'device_type'])
39 | deviceid_package_start_close = pd.read_csv(path+'deviceid_package_start_close.tsv',sep='\t', names=['device_id','app_id','start_time','close_time'])
40 | package_label = pd.read_csv(path+'package_label.tsv',sep='\t',names=['app_id','app_parent_type', 'app_child_type'])
41 |
42 |
43 | deviceid_brand['device_brand'] = deviceid_brand.device_brand.apply(lambda x : str(x).split(' ')[0])
44 |
45 | df_temp = deviceid_brand.groupby('device_brand')['device_id'].count().reset_index().rename(columns={'device_id':'brand_counts'})
46 | one_time_brand = df_temp[df_temp.brand_counts == 1].device_brand.values
47 | deviceid_brand['device_brand'] = deviceid_brand.device_brand.apply(lambda x : 'other' if x in one_time_brand else x)
48 |
49 | df_temp = deviceid_brand.groupby('device_brand')['device_id'].count().reset_index().rename(columns={'device_id':'brand_counts'})
50 | one_time_brand = df_temp[df_temp.brand_counts == 2].device_brand.values
51 | deviceid_brand['device_brand'] = deviceid_brand.device_brand.apply(lambda x : 'other_2' if x in one_time_brand else x)
52 |
53 | df_temp = deviceid_brand.groupby('device_brand')['device_id'].count().reset_index().rename(columns={'device_id':'brand_counts'})
54 | one_time_brand = df_temp[df_temp.brand_counts == 3].device_brand.values
55 | deviceid_brand['device_brand'] = deviceid_brand.device_brand.apply(lambda x : 'other_3' if x in one_time_brand else x)
56 |
57 |
58 | #转换成对应的数字
59 | lbl = LabelEncoder()
60 | lbl.fit(list(deviceid_brand.device_brand.values))
61 | deviceid_brand['device_brand'] = lbl.transform(list(deviceid_brand.device_brand.values))
62 |
63 | lbl = LabelEncoder()
64 | lbl.fit(list(deviceid_brand.device_type.values))
65 | deviceid_brand['device_type'] = lbl.transform(list(deviceid_brand.device_type.values))
66 |
67 | #转换成对应的数字
68 | lbl = LabelEncoder()
69 | lbl.fit(list(package_label.app_parent_type.values))
70 | package_label['app_parent_type'] = lbl.transform(list(package_label.app_parent_type.values))
71 |
72 | lbl = LabelEncoder()
73 | lbl.fit(list(package_label.app_child_type.values))
74 | package_label['app_child_type'] = lbl.transform(list(package_label.app_child_type.values))
75 |
76 |
77 | # In[4]:
78 |
79 |
80 | df_sorted = deviceid_package_start_close.sort_values(by='start_time')
81 |
82 |
83 | # In[20]:
84 |
85 |
86 | df_results = df_sorted.groupby('device_id')['app_id'].apply(lambda x:' '.join(x)).reset_index().rename(columns = {'app_id' : 'app_list'})
87 | df_results.to_csv('01.device_click_app_sorted_by_start.csv', index=None)
88 | del df_results
89 |
90 |
91 | # In[5]:
92 |
93 |
94 | df_device_start_app_list = df_sorted.groupby('device_id').apply(lambda x : list(x.app_id)).reset_index().rename(columns = {0 : 'app_list'})
95 |
96 |
97 | # In[7]:
98 |
99 |
100 | app_list = list(df_device_start_app_list.app_list.values)
101 |
102 |
103 | # In[9]:
104 |
105 |
106 | model = Word2Vec(app_list, size=10, window=10, min_count=2, workers=4)
107 | model.save("word2vec.model")
108 |
109 |
110 | # In[10]:
111 |
112 |
113 | vocab = list(model.wv.vocab.keys())
114 |
115 | w2c_arr = []
116 |
117 | for v in vocab :
118 | w2c_arr.append(list(model.wv[v]))
119 |
120 |
121 | # In[11]:
122 |
123 |
124 | df_w2c_start = pd.DataFrame()
125 | df_w2c_start['app_id'] = vocab
126 | df_w2c_start = pd.concat([df_w2c_start, pd.DataFrame(w2c_arr)], axis=1)
127 | df_w2c_start.columns = ['app_id'] + ['w2c_start_app_' + str(i) for i in range(10)]
128 |
129 |
130 | # In[13]:
131 |
132 |
133 | w2c_nums = 10
134 | agg = {}
135 | for l in ['w2c_start_app_' + str(i) for i in range(w2c_nums)] :
136 | agg[l] = ['mean', 'std', 'max', 'min']
137 |
138 |
139 | # In[14]:
140 |
141 |
142 | deviceid_package_start_close = deviceid_package_start_close.merge(df_w2c_start, on='app_id', how='left')
143 |
144 |
145 | # In[15]:
146 |
147 |
148 | df_agg = deviceid_package_start_close.groupby('device_id').agg(agg)
149 | df_agg.columns = pd.Index(['device_' + e[0] + "_" + e[1].upper() for e in df_agg.columns.tolist()])
150 | df_agg = df_agg.reset_index()
151 | df_agg.to_csv('device_start_app_w2c.csv', index=None)
152 |
153 |
154 | # In[16]:
155 |
156 |
157 | df_results = deviceid_package_start_close.groupby(['device_id', 'app_id'])['start_time'].mean().reset_index()
158 | df_results = df_results.merge(df_w2c_start, on='app_id', how='left')
159 |
160 |
161 | # In[18]:
162 |
163 |
164 | df_agg = df_results.groupby('device_id').agg(agg)
165 | df_agg.columns = pd.Index(['device_app_unique_' + e[0] + "_" + e[1].upper() for e in df_agg.columns.tolist()])
166 | df_agg = df_agg.reset_index()
167 |
168 |
169 | # In[24]:
170 |
171 |
172 | df_agg.to_csv('device_app_unique_start_app_w2c.csv', index=None)
173 | print ('success.....')
174 |
--------------------------------------------------------------------------------
/THLUO/2.w2c_model_close.py:
--------------------------------------------------------------------------------
1 |
2 | # coding: utf-8
3 |
4 | # In[1]:
5 |
6 |
7 | import pandas as pd
8 | import seaborn as sns
9 | import numpy as np
10 | from tqdm import tqdm
11 | from sklearn.decomposition import LatentDirichletAllocation
12 | from sklearn.cross_validation import train_test_split
13 | from sklearn.metrics import accuracy_score
14 | import lightgbm as lgb
15 | from datetime import datetime,timedelta
16 | import time
17 | from sklearn.feature_extraction.text import TfidfTransformer
18 | from sklearn.feature_extraction.text import CountVectorizer
19 | from sklearn.preprocessing import LabelEncoder
20 | import gc
21 |
22 |
23 |
24 |
25 | # In[2]:
26 | print ('2.w2c_model_close.py')
27 |
28 | path='input/'
29 | data=pd.DataFrame()
30 | #sex_age=pd.read_excel('./data/性别年龄对照表.xlsx')
31 |
32 |
33 | # In[3]:
34 |
35 |
36 | deviceid_packages=pd.read_csv(path+'deviceid_packages.tsv',sep='\t',names=['device_id','apps'])
37 | deviceid_test=pd.read_csv(path+'deviceid_test.tsv',sep='\t',names=['device_id'])
38 | deviceid_train=pd.read_csv(path+'deviceid_train.tsv',sep='\t',names=['device_id','sex','age'])
39 | deviceid_brand = pd.read_csv(path+'deviceid_brand.tsv',sep='\t', names=['device_id','device_brand', 'device_type'])
40 | deviceid_package_start_close = pd.read_csv(path+'deviceid_package_start_close.tsv',sep='\t', names=['device_id','app_id','start_time','close_time'])
41 | package_label = pd.read_csv(path+'package_label.tsv',sep='\t',names=['app_id','app_parent_type', 'app_child_type'])
42 |
43 |
44 | deviceid_brand['device_brand'] = deviceid_brand.device_brand.apply(lambda x : str(x).split(' ')[0])
45 |
46 | df_temp = deviceid_brand.groupby('device_brand')['device_id'].count().reset_index().rename(columns={'device_id':'brand_counts'})
47 | one_time_brand = df_temp[df_temp.brand_counts == 1].device_brand.values
48 | deviceid_brand['device_brand'] = deviceid_brand.device_brand.apply(lambda x : 'other' if x in one_time_brand else x)
49 |
50 | df_temp = deviceid_brand.groupby('device_brand')['device_id'].count().reset_index().rename(columns={'device_id':'brand_counts'})
51 | one_time_brand = df_temp[df_temp.brand_counts == 2].device_brand.values
52 | deviceid_brand['device_brand'] = deviceid_brand.device_brand.apply(lambda x : 'other_2' if x in one_time_brand else x)
53 |
54 | df_temp = deviceid_brand.groupby('device_brand')['device_id'].count().reset_index().rename(columns={'device_id':'brand_counts'})
55 | one_time_brand = df_temp[df_temp.brand_counts == 3].device_brand.values
56 | deviceid_brand['device_brand'] = deviceid_brand.device_brand.apply(lambda x : 'other_3' if x in one_time_brand else x)
57 |
58 |
59 | #转换成对应的数字
60 | lbl = LabelEncoder()
61 | lbl.fit(list(deviceid_brand.device_brand.values))
62 | deviceid_brand['device_brand'] = lbl.transform(list(deviceid_brand.device_brand.values))
63 |
64 | lbl = LabelEncoder()
65 | lbl.fit(list(deviceid_brand.device_type.values))
66 | deviceid_brand['device_type'] = lbl.transform(list(deviceid_brand.device_type.values))
67 |
68 | #转换成对应的数字
69 | lbl = LabelEncoder()
70 | lbl.fit(list(package_label.app_parent_type.values))
71 | package_label['app_parent_type'] = lbl.transform(list(package_label.app_parent_type.values))
72 |
73 | lbl = LabelEncoder()
74 | lbl.fit(list(package_label.app_child_type.values))
75 | package_label['app_child_type'] = lbl.transform(list(package_label.app_child_type.values))
76 |
77 |
78 | # In[4]:
79 |
80 |
81 | df_sorted = deviceid_package_start_close.sort_values(by='close_time')
82 |
83 |
84 | # In[6]:
85 |
86 |
87 | df_results = df_sorted.groupby('device_id')['app_id'].apply(lambda x:' '.join(x)).reset_index().rename(columns = {'app_id' : 'app_list'})
88 |
89 |
90 | # In[7]:
91 |
92 |
93 | df_results.to_csv('02.device_click_app_sorted_by_close.csv', index=None)
94 |
95 |
96 | # In[6]:
97 |
98 |
99 | df_device_start_app_list = df_sorted.groupby('device_id').apply(lambda x : list(x.app_id)).reset_index().rename(columns = {0 : 'app_list'})
100 |
101 |
102 | # In[7]:
103 |
104 |
105 | app_list = list(df_device_start_app_list.app_list.values)
106 |
107 |
108 | # In[8]:
109 |
110 |
111 | from gensim.test.utils import common_texts, get_tmpfile
112 | from gensim.models import Word2Vec
113 |
114 |
115 | # In[9]:
116 |
117 |
118 | model = Word2Vec(app_list, size=10, window=10, min_count=2, workers=4)
119 | model.save("word2vec.model")
120 |
121 |
122 | # In[11]:
123 |
124 |
125 | vocab = list(model.wv.vocab.keys())
126 |
127 | w2c_arr = []
128 |
129 | for v in vocab :
130 | w2c_arr.append(list(model.wv[v]))
131 |
132 |
133 | # In[12]:
134 |
135 |
136 | df_w2c_start = pd.DataFrame()
137 | df_w2c_start['app_id'] = vocab
138 | df_w2c_start = pd.concat([df_w2c_start, pd.DataFrame(w2c_arr)], axis=1)
139 | df_w2c_start.columns = ['app_id'] + ['w2c_close_app_' + str(i) for i in range(10)]
140 |
141 |
142 | # In[ ]:
143 |
144 |
145 | w2c_nums = 10
146 | agg = {}
147 | for l in ['w2c_close_app_' + str(i) for i in range(w2c_nums)] :
148 | agg[l] = ['mean', 'std', 'max', 'min']
149 |
150 |
151 | # In[14]:
152 |
153 |
154 | deviceid_package_start_close = deviceid_package_start_close.merge(df_w2c_start, on='app_id', how='left')
155 |
156 |
157 | # In[ ]:
158 |
159 |
160 | df_agg = deviceid_package_start_close.groupby('device_id').agg(agg)
161 | df_agg.columns = pd.Index(['device_' + e[0] + "_" + e[1].upper() for e in df_agg.columns.tolist()])
162 | df_agg = df_agg.reset_index()
163 | df_agg.to_csv('device_close_app_w2c.csv', index=None)
164 |
165 |
166 | # In[14]:
167 |
168 |
169 | df_results = deviceid_package_start_close.groupby(['device_id', 'app_id'])['start_time'].mean().reset_index()
170 | df_results = df_results.merge(df_w2c_start, on='app_id', how='left')
171 |
172 |
173 | # In[17]:
174 |
175 |
176 | df_agg = df_results.groupby('device_id').agg(agg)
177 | df_agg.columns = pd.Index(['device_app_unique_' + e[0] + "_" + e[1].upper() for e in df_agg.columns.tolist()])
178 | df_agg = df_agg.reset_index()
179 |
180 |
181 | # In[18]:
182 |
183 |
184 | df_agg.to_csv('device_app_unique_close_app_w2c.csv', index=None)
185 |
186 |
--------------------------------------------------------------------------------
/THLUO/3.w2c_model_all.py:
--------------------------------------------------------------------------------
1 |
2 | # coding: utf-8
3 |
4 | # In[1]:
5 |
6 |
7 | import pandas as pd
8 | import seaborn as sns
9 | import numpy as np
10 | from tqdm import tqdm
11 | from sklearn.decomposition import LatentDirichletAllocation
12 | from sklearn.cross_validation import train_test_split
13 | from sklearn.metrics import accuracy_score
14 | import lightgbm as lgb
15 | from datetime import datetime,timedelta
16 | import time
17 | from sklearn.feature_extraction.text import TfidfTransformer
18 | from sklearn.feature_extraction.text import CountVectorizer
19 | from sklearn.preprocessing import LabelEncoder
20 | from gensim.test.utils import common_texts, get_tmpfile
21 | from gensim.models import Word2Vec
22 | import gc
23 |
24 |
25 |
26 | # In[2]:
27 | print ('3.w2c_model_all.py')
28 |
29 | path='input/'
30 | data=pd.DataFrame()
31 | #sex_age=pd.read_excel('./data/性别年龄对照表.xlsx')
32 |
33 |
34 | # In[3]:
35 |
36 |
37 | deviceid_packages=pd.read_csv(path+'deviceid_packages.tsv',sep='\t',names=['device_id','apps'])
38 | deviceid_test=pd.read_csv(path+'deviceid_test.tsv',sep='\t',names=['device_id'])
39 | deviceid_train=pd.read_csv(path+'deviceid_train.tsv',sep='\t',names=['device_id','sex','age'])
40 | deviceid_brand = pd.read_csv(path+'deviceid_brand.tsv',sep='\t', names=['device_id','device_brand', 'device_type'])
41 | deviceid_package_start_close = pd.read_csv(path+'deviceid_package_start_close.tsv',sep='\t', names=['device_id','app_id','start_time','close_time'])
42 | package_label = pd.read_csv(path+'package_label.tsv',sep='\t',names=['app_id','app_parent_type', 'app_child_type'])
43 |
44 |
45 | deviceid_brand['device_brand'] = deviceid_brand.device_brand.apply(lambda x : str(x).split(' ')[0])
46 |
47 | df_temp = deviceid_brand.groupby('device_brand')['device_id'].count().reset_index().rename(columns={'device_id':'brand_counts'})
48 | one_time_brand = df_temp[df_temp.brand_counts == 1].device_brand.values
49 | deviceid_brand['device_brand'] = deviceid_brand.device_brand.apply(lambda x : 'other' if x in one_time_brand else x)
50 |
51 | df_temp = deviceid_brand.groupby('device_brand')['device_id'].count().reset_index().rename(columns={'device_id':'brand_counts'})
52 | one_time_brand = df_temp[df_temp.brand_counts == 2].device_brand.values
53 | deviceid_brand['device_brand'] = deviceid_brand.device_brand.apply(lambda x : 'other_2' if x in one_time_brand else x)
54 |
55 | df_temp = deviceid_brand.groupby('device_brand')['device_id'].count().reset_index().rename(columns={'device_id':'brand_counts'})
56 | one_time_brand = df_temp[df_temp.brand_counts == 3].device_brand.values
57 | deviceid_brand['device_brand'] = deviceid_brand.device_brand.apply(lambda x : 'other_3' if x in one_time_brand else x)
58 |
59 |
60 | #转换成对应的数字
61 | lbl = LabelEncoder()
62 | lbl.fit(list(deviceid_brand.device_brand.values))
63 | deviceid_brand['device_brand'] = lbl.transform(list(deviceid_brand.device_brand.values))
64 |
65 | lbl = LabelEncoder()
66 | lbl.fit(list(deviceid_brand.device_type.values))
67 | deviceid_brand['device_type'] = lbl.transform(list(deviceid_brand.device_type.values))
68 |
69 | #转换成对应的数字
70 | lbl = LabelEncoder()
71 | lbl.fit(list(package_label.app_parent_type.values))
72 | package_label['app_parent_type'] = lbl.transform(list(package_label.app_parent_type.values))
73 |
74 | lbl = LabelEncoder()
75 | lbl.fit(list(package_label.app_child_type.values))
76 | package_label['app_child_type'] = lbl.transform(list(package_label.app_child_type.values))
77 |
78 |
79 | # In[4]:
80 |
81 |
82 | deviceid_package_start = deviceid_package_start_close[['device_id', 'app_id', 'start_time']]
83 | deviceid_package_start.columns = ['device_id', 'app_id', 'all_time']
84 | deviceid_package_close = deviceid_package_start_close[['device_id', 'app_id', 'close_time']]
85 | deviceid_package_close.columns = ['device_id', 'app_id', 'all_time']
86 | deviceid_package_all = pd.concat([deviceid_package_start, deviceid_package_close])
87 |
88 |
89 | # In[5]:
90 |
91 |
92 | df_sorted = deviceid_package_all.sort_values(by='all_time')
93 |
94 |
95 | # In[7]:
96 |
97 |
98 | df_results = df_sorted.groupby('device_id')['app_id'].apply(lambda x:' '.join(x)).reset_index().rename(columns = {'app_id' : 'app_list'})
99 | df_results.to_csv('03.device_click_app_sorted_by_all.csv', index=None)
100 | del df_results
101 |
102 |
103 | # In[8]:
104 |
105 |
106 | df_device_start_app_list = df_sorted.groupby('device_id').apply(lambda x : list(x.app_id)).reset_index().rename(columns = {0 : 'app_list'})
107 |
108 |
109 | # In[9]:
110 |
111 |
112 | app_list = list(df_device_start_app_list.app_list.values)
113 |
114 |
115 | # In[11]:
116 |
117 |
118 | model = Word2Vec(app_list, size=10, window=50, min_count=2, workers=4)
119 | model.save("word2vec.model")
120 |
121 |
122 | # In[12]:
123 |
124 |
125 | vocab = list(model.wv.vocab.keys())
126 |
127 | w2c_arr = []
128 |
129 | for v in vocab :
130 | w2c_arr.append(list(model.wv[v]))
131 |
132 |
133 | # In[13]:
134 |
135 |
136 | df_w2c_start = pd.DataFrame()
137 | df_w2c_start['app_id'] = vocab
138 | df_w2c_start = pd.concat([df_w2c_start, pd.DataFrame(w2c_arr)], axis=1)
139 | df_w2c_start.columns = ['app_id'] + ['w2c_all_app_' + str(i) for i in range(10)]
140 |
141 |
142 | # In[14]:
143 |
144 |
145 | w2c_nums = 10
146 | agg = {}
147 | for l in ['w2c_all_app_' + str(i) for i in range(w2c_nums)] :
148 | agg[l] = ['mean', 'std', 'max', 'min']
149 |
150 |
151 | # In[15]:
152 |
153 |
154 | deviceid_package_start_close = deviceid_package_start_close.merge(df_w2c_start, on='app_id', how='left')
155 |
156 |
157 | # In[16]:
158 |
159 |
160 | df_agg = deviceid_package_start_close.groupby('device_id').agg(agg)
161 | df_agg.columns = pd.Index(['device_' + e[0] + "_" + e[1].upper() for e in df_agg.columns.tolist()])
162 | df_agg = df_agg.reset_index()
163 | df_agg.to_csv('device_all_app_w2c.csv', index=None)
164 |
165 |
166 | # In[18]:
167 |
168 |
169 | df_results = deviceid_package_start_close.groupby(['device_id', 'app_id'])['start_time'].mean().reset_index()
170 | df_results = df_results.merge(df_w2c_start, on='app_id', how='left')
171 |
172 |
173 | # In[22]:
174 |
175 |
176 | df_agg = df_results.groupby('device_id').agg(agg)
177 | df_agg.columns = pd.Index(['device_app_unique' + e[0] + "_" + e[1].upper() for e in df_agg.columns.tolist()])
178 | df_agg = df_agg.reset_index()
179 |
180 |
181 | # In[20]:
182 |
183 |
184 | df_agg.to_csv('device_app_unique_all_app_w2c.csv', index=None)
185 |
186 |
--------------------------------------------------------------------------------
/THLUO/3.device_quchong_start_app_w2c.py:
--------------------------------------------------------------------------------
1 |
2 | # coding: utf-8
3 |
4 | # In[1]:
5 |
6 |
7 | import pandas as pd
8 | import seaborn as sns
9 | import numpy as np
10 | from tqdm import tqdm
11 | from sklearn.decomposition import LatentDirichletAllocation
12 | from sklearn.cross_validation import train_test_split
13 | from sklearn.metrics import accuracy_score
14 | import lightgbm as lgb
15 | from datetime import datetime,timedelta
16 | import time
17 | from sklearn.feature_extraction.text import TfidfTransformer
18 | from sklearn.feature_extraction.text import CountVectorizer
19 | from sklearn.preprocessing import LabelEncoder
20 | import gc
21 | from gensim.test.utils import common_texts, get_tmpfile
22 | from gensim.models import Word2Vec
23 |
24 |
25 | # In[2]:
26 |
27 | print ('8.device_quchong_start_app_w2c.py')
28 | path='input/'
29 | data=pd.DataFrame()
30 | #sex_age=pd.read_excel('./data/性别年龄对照表.xlsx')
31 |
32 |
33 | # In[3]:
34 |
35 |
36 | deviceid_packages=pd.read_csv(path+'deviceid_packages.tsv',sep='\t',names=['device_id','apps'])
37 | deviceid_test=pd.read_csv(path+'deviceid_test.tsv',sep='\t',names=['device_id'])
38 | deviceid_train=pd.read_csv(path+'deviceid_train.tsv',sep='\t',names=['device_id','sex','age'])
39 | deviceid_brand = pd.read_csv(path+'deviceid_brand.tsv',sep='\t', names=['device_id','device_brand', 'device_type'])
40 | deviceid_package_start_close = pd.read_csv(path+'deviceid_package_start_close.tsv',sep='\t', names=['device_id','app_id','start_time','close_time'])
41 | package_label = pd.read_csv(path+'package_label.tsv',sep='\t',names=['app_id','app_parent_type', 'app_child_type'])
42 |
43 |
44 | deviceid_brand['device_brand'] = deviceid_brand.device_brand.apply(lambda x : str(x).split(' ')[0])
45 |
46 | df_temp = deviceid_brand.groupby('device_brand')['device_id'].count().reset_index().rename(columns={'device_id':'brand_counts'})
47 | one_time_brand = df_temp[df_temp.brand_counts == 1].device_brand.values
48 | deviceid_brand['device_brand'] = deviceid_brand.device_brand.apply(lambda x : 'other' if x in one_time_brand else x)
49 |
50 | df_temp = deviceid_brand.groupby('device_brand')['device_id'].count().reset_index().rename(columns={'device_id':'brand_counts'})
51 | one_time_brand = df_temp[df_temp.brand_counts == 2].device_brand.values
52 | deviceid_brand['device_brand'] = deviceid_brand.device_brand.apply(lambda x : 'other_2' if x in one_time_brand else x)
53 |
54 | df_temp = deviceid_brand.groupby('device_brand')['device_id'].count().reset_index().rename(columns={'device_id':'brand_counts'})
55 | one_time_brand = df_temp[df_temp.brand_counts == 3].device_brand.values
56 | deviceid_brand['device_brand'] = deviceid_brand.device_brand.apply(lambda x : 'other_3' if x in one_time_brand else x)
57 |
58 |
59 | #转换成对应的数字
60 | lbl = LabelEncoder()
61 | lbl.fit(list(deviceid_brand.device_brand.values))
62 | deviceid_brand['device_brand'] = lbl.transform(list(deviceid_brand.device_brand.values))
63 |
64 | lbl = LabelEncoder()
65 | lbl.fit(list(deviceid_brand.device_type.values))
66 | deviceid_brand['device_type'] = lbl.transform(list(deviceid_brand.device_type.values))
67 |
68 | #转换成对应的数字
69 | lbl = LabelEncoder()
70 | lbl.fit(list(package_label.app_parent_type.values))
71 | package_label['app_parent_type'] = lbl.transform(list(package_label.app_parent_type.values))
72 |
73 | lbl = LabelEncoder()
74 | lbl.fit(list(package_label.app_child_type.values))
75 | package_label['app_child_type'] = lbl.transform(list(package_label.app_child_type.values))
76 |
77 |
78 | # In[4]:
79 |
80 |
81 | import time
82 |
83 | # 输入毫秒级的时间,转出正常格式的时间
84 | def timeStamp(timeNum):
85 | timeStamp = float(timeNum/1000)
86 | timeArray = time.localtime(timeStamp)
87 | otherStyleTime = time.strftime("%Y-%m-%d %H:%M:%S", timeArray)
88 | return otherStyleTime
89 |
90 | #解析出具体的时间
91 | deviceid_package_start_close['start_date'] = pd.to_datetime(deviceid_package_start_close.start_time.apply(timeStamp))
92 | deviceid_package_start_close['end_date'] = pd.to_datetime(deviceid_package_start_close.close_time.apply(timeStamp))
93 | deviceid_package_start_close['start_hour'] = deviceid_package_start_close.start_date.dt.hour
94 | deviceid_package_start_close['end_hour'] = deviceid_package_start_close.end_date.dt.hour
95 | deviceid_package_start_close['time_gap'] = (deviceid_package_start_close['end_date'] - deviceid_package_start_close['start_date']).astype('timedelta64[s]')
96 |
97 | deviceid_package_start_close = deviceid_package_start_close.merge(package_label, on='app_id', how='left')
98 | deviceid_package_start_close.app_parent_type.fillna(-1, inplace=True)
99 | deviceid_package_start_close.app_child_type.fillna(-1, inplace=True)
100 | deviceid_package_start_close['start_year'] = deviceid_package_start_close.start_date.dt.year
101 | deviceid_package_start_close['end_year'] = deviceid_package_start_close.end_date.dt.year
102 | deviceid_package_start_close['year_gap'] = deviceid_package_start_close['end_year'] - deviceid_package_start_close['start_year']
103 |
104 |
105 | # In[9]:
106 |
107 |
108 | df_temp = deviceid_package_start_close.groupby(['device_id', 'app_id'])['start_hour'].mean().reset_index()
109 | df_temp
110 |
111 |
112 | # In[10]:
113 |
114 |
115 | df_sorted = df_temp.sort_values(by='start_hour')
116 |
117 |
118 | # In[13]:
119 |
120 |
121 | df_device_start_app_list = df_sorted.groupby('device_id').apply(lambda x : list(x.app_id)).reset_index().rename(columns = {0 : 'app_list'})
122 |
123 |
124 | # In[17]:
125 |
126 |
127 | app_list = list(df_device_start_app_list.app_list.values)
128 |
129 |
130 | # In[35]:
131 |
132 |
133 | model = Word2Vec(app_list, size=10, window=4, min_count=2, workers=4)
134 | model.save("word2vec.model")
135 |
136 |
137 | # In[37]:
138 |
139 |
140 | vocab = list(model.wv.vocab.keys())
141 |
142 | w2c_arr = []
143 |
144 | for v in vocab :
145 | w2c_arr.append(list(model.wv[v]))
146 |
147 |
148 | # In[38]:
149 |
150 |
151 | df_w2c_start = pd.DataFrame()
152 | df_w2c_start['app_id'] = vocab
153 | df_w2c_start = pd.concat([df_w2c_start, pd.DataFrame(w2c_arr)], axis=1)
154 | df_w2c_start.columns = ['app_id'] + ['w2c_start_app_' + str(i) for i in range(10)]
155 |
156 |
157 | # In[47]:
158 |
159 |
160 | df_sorted = df_sorted.merge(df_w2c_start, on='app_id', how='left')
161 | df_sorted
162 |
163 |
164 | # In[48]:
165 |
166 |
167 | w2c_nums = 10
168 | agg = {}
169 | for l in ['w2c_start_app_' + str(i) for i in range(w2c_nums)] :
170 | agg[l] = ['mean', 'std', 'max', 'min']
171 |
172 |
173 | # In[50]:
174 |
175 |
176 | df_agg = df_sorted.groupby('device_id').agg(agg)
177 | df_agg.columns = pd.Index(['device_quchong' + e[0] + "_" + e[1].upper() for e in df_agg.columns.tolist()])
178 | df_agg = df_agg.reset_index()
179 |
180 |
181 | # In[52]:
182 |
183 |
184 | df_agg.to_csv('device_quchong_start_app_w2c.csv', index=None)
185 |
186 |
--------------------------------------------------------------------------------
/chizhu/single_model/get_nn_feat.py:
--------------------------------------------------------------------------------
1 | import pandas as pd
2 | import seaborn as sns
3 | import numpy as np
4 | from tqdm import tqdm
5 | from sklearn.decomposition import LatentDirichletAllocation
6 | from sklearn.model_selection import train_test_split
7 | from sklearn.metrics import accuracy_score
8 | import lightgbm as lgb
9 | from datetime import datetime, timedelta
10 | import matplotlib.pyplot as plt
11 | import time
12 | from sklearn.feature_extraction.text import TfidfTransformer
13 | from sklearn.feature_extraction.text import CountVectorizer
14 | %matplotlib inline
15 |
16 | #add
17 | import gc
18 | from sklearn import preprocessing
19 | from sklearn.feature_extraction.text import TfidfVectorizer
20 |
21 | from scipy.sparse import hstack, vstack
22 | from sklearn.model_selection import StratifiedKFold
23 | from sklearn.model_selection import cross_val_score
24 | # from skopt.space import Integer, Categorical, Real, Log10
25 | # from skopt.utils import use_named_args
26 | # from skopt import gp_minimize
27 | from gensim.models import Word2Vec, FastText
28 | import gensim
29 | import re
30 | from config import path
31 | # path = "/dev/shm/chizhu_data/data/"
32 | ###这里是原始文件的地址,务必修改这里的路径
33 |
34 | test = pd.read_csv(path+'deviceid_test.tsv', sep='\t', names=['device_id'])
35 | train = pd.read_csv(path+'deviceid_train.tsv', sep='\t',
36 | names=['device_id', 'sex', 'age'])
37 | brand = pd.read_table(path+'deviceid_brand.tsv',
38 | names=['device_id', 'vendor', 'version'])
39 | packtime = pd.read_table(path+'deviceid_package_start_close.tsv',
40 | names=['device_id', 'app', 'start', 'close'])
41 | packages = pd.read_csv(path+'deviceid_packages.tsv',
42 | sep='\t', names=['device_id', 'apps'])
43 |
44 | packtime['period'] = (packtime['close'] - packtime['start'])/1000
45 | packtime['start'] = pd.to_datetime(packtime['start'], unit='ms')
46 | app_use_time = packtime.groupby(['app'])['period'].agg('sum').reset_index()
47 | app_use_top100 = app_use_time.sort_values(
48 | by='period', ascending=False)[:100]['app']
49 | device_app_use_time = packtime.groupby(['device_id', 'app'])[
50 | 'period'].agg('sum').reset_index()
51 | use_time_top100_statis = device_app_use_time.set_index(
52 | 'app').loc[list(app_use_top100)].reset_index()
53 | top100_statis = use_time_top100_statis.pivot(
54 | index='device_id', columns='app', values='period').reset_index()
55 |
56 | top100_statis = top100_statis.fillna(0)
57 |
58 | # 手机品牌预处理
59 | brand['vendor'] = brand['vendor'].astype(
60 | str).apply(lambda x: x.split(' ')[0].upper())
61 | brand['ph_ver'] = brand['vendor'] + '_' + brand['version']
62 |
63 | ph_ver = brand['ph_ver'].value_counts()
64 | ph_ver_cnt = pd.DataFrame(ph_ver).reset_index()
65 | ph_ver_cnt.columns = ['ph_ver', 'ph_ver_cnt']
66 |
67 | brand = pd.merge(left=brand, right=ph_ver_cnt, on='ph_ver')
68 |
69 | # 针对长尾分布做的一点处理
70 | mask = (brand.ph_ver_cnt < 100)
71 | brand.loc[mask, 'ph_ver'] = 'other'
72 |
73 | train = pd.merge(brand[['device_id', 'ph_ver']],
74 | train, on='device_id', how='right')
75 | test = pd.merge(brand[['device_id', 'ph_ver']],
76 | test, on='device_id', how='right')
77 | train['ph_ver'] = train['ph_ver'].astype(str)
78 | test['ph_ver'] = test['ph_ver'].astype(str)
79 |
80 | # 将 ph_ver 进行 label encoder
81 | ph_ver_le = preprocessing.LabelEncoder()
82 | train['ph_ver'] = ph_ver_le.fit_transform(train['ph_ver'])
83 | test['ph_ver'] = ph_ver_le.transform(test['ph_ver'])
84 | train['label'] = train['sex'].astype(str) + '-' + train['age'].astype(str)
85 | label_le = preprocessing.LabelEncoder()
86 | train['label'] = label_le.fit_transform(train['label'])
87 |
88 | test['sex'] = -1
89 | test['age'] = -1
90 | test['label'] = -1
91 | data = pd.concat([train, test], ignore_index=True)
92 | # data.shape
93 |
94 | ph_ver_dummy = pd.get_dummies(data['ph_ver'])
95 | ph_ver_dummy.columns = ['ph_ver_' + str(i)
96 | for i in range(ph_ver_dummy.shape[1])]
97 |
98 | data = pd.concat([data, ph_ver_dummy], axis=1)
99 |
100 | del data['ph_ver']
101 |
102 | train = data[data.sex != -1]
103 | test = data[data.sex == -1]
104 | # train.shape, test.shape
105 |
106 | # 每个app的总使用次数统计
107 | app_num = packtime['app'].value_counts().reset_index()
108 | app_num.columns = ['app', 'app_num']
109 | packtime = pd.merge(left=packtime, right=app_num, on='app')
110 | # 同样的,针对长尾分布做些处理(尝试过不做处理,或换其他阈值,这个100的阈值最高)
111 | packtime.loc[packtime.app_num < 100, 'app'] = 'other'
112 |
113 | # 统计每台设备的app数量
114 | df_app = packtime[['device_id', 'app']]
115 | apps = df_app.drop_duplicates().groupby(['device_id'])[
116 | 'app'].apply(' '.join).reset_index()
117 | apps['app_length'] = apps['app'].apply(lambda x: len(x.split(' ')))
118 |
119 | train = pd.merge(train, apps, on='device_id', how='left')
120 | test = pd.merge(test, apps, on='device_id', how='left')
121 |
122 | # packtime['period'] = (packtime['close'] - packtime['start'])/1000
123 | # packtime['start'] = pd.to_datetime(packtime['start'], unit='ms')
124 | packtime['dayofweek'] = packtime['start'].dt.dayofweek
125 | packtime['hour'] = packtime['start'].dt.hour
126 | # packtime = packtime[(packtime['start'] < '2017-03-31 23:59:59') & (packtime['start'] > '2017-03-01 00:00:00')]
127 |
128 | app_use_time = packtime.groupby(['device_id', 'dayofweek'])[
129 | 'period'].agg('sum').reset_index()
130 | week_app_use = app_use_time.pivot_table(
131 | values='period', columns='dayofweek', index='device_id').reset_index()
132 | week_app_use = week_app_use.fillna(0)
133 | week_app_use.columns = ['device_id'] + \
134 | ['week_day_' + str(i) for i in range(0, 7)]
135 |
136 | week_app_use['week_max'] = week_app_use.max(axis=1)
137 | week_app_use['week_min'] = week_app_use.min(axis=1)
138 | week_app_use['week_sum'] = week_app_use.sum(axis=1)
139 | week_app_use['week_std'] = week_app_use.std(axis=1)
140 |
141 | # '''
142 | # for i in range(0, 7):
143 | # week_app_use['week_day_' + str(i)] = week_app_use['week_day_' + str(i)] / week_app_use['week_sum']
144 | # '''
145 |
146 | user_behavior = pd.read_csv('data/user_behavior.csv')
147 | user_behavior['app_len_max'] = user_behavior['app_len_max'].astype(np.float64)
148 | del user_behavior['app']
149 | train = pd.merge(train, user_behavior, on='device_id', how='left')
150 | test = pd.merge(test, user_behavior, on='device_id', how='left')
151 |
152 | train = pd.merge(train, week_app_use, on='device_id', how='left')
153 | test = pd.merge(test, week_app_use, on='device_id', how='left')
154 |
155 | top100_statis.columns = ['device_id'] + \
156 | ['top100_statis_' + str(i) for i in range(0, 100)]
157 | train = pd.merge(train, top100_statis, on='device_id', how='left')
158 | test = pd.merge(test, top100_statis, on='device_id', how='left')
159 |
160 | train.to_csv("data/train_statistic_feat.csv", index=False)
161 | test.to_csv("data/test_statistic_feat.csv", index=False)
162 |
--------------------------------------------------------------------------------
/THLUO/11.hcc_device_brand_age_sex.py:
--------------------------------------------------------------------------------
1 |
2 | # coding: utf-8
3 |
4 | # In[1]:
5 |
6 |
7 | import pandas as pd
8 | import seaborn as sns
9 | import numpy as np
10 | from tqdm import tqdm
11 | from sklearn.decomposition import LatentDirichletAllocation
12 | from sklearn.cross_validation import train_test_split
13 | from sklearn.metrics import accuracy_score
14 | import lightgbm as lgb
15 | from datetime import datetime,timedelta
16 | import time
17 | from sklearn.feature_extraction.text import TfidfTransformer
18 | from sklearn.feature_extraction.text import CountVectorizer
19 | from sklearn.preprocessing import LabelEncoder
20 | import gc
21 | from sklearn.model_selection import StratifiedKFold
22 |
23 |
24 |
25 | # In[2]:
26 |
27 | print ('11.hcc_device_brand_age_sex.py')
28 | path='input/'
29 | data=pd.DataFrame()
30 | #sex_age=pd.read_excel('./data/性别年龄对照表.xlsx')
31 |
32 |
33 | # In[3]:
34 |
35 |
36 | deviceid_packages=pd.read_csv(path+'deviceid_packages.tsv',sep='\t',names=['device_id','apps'])
37 | deviceid_test=pd.read_csv(path+'deviceid_test.tsv',sep='\t',names=['device_id'])
38 | deviceid_train=pd.read_csv(path+'deviceid_train.tsv',sep='\t',names=['device_id','sex','age'])
39 | deviceid_brand = pd.read_csv(path+'deviceid_brand.tsv',sep='\t', names=['device_id','device_brand', 'device_type'])
40 | deviceid_package_start_close = pd.read_csv(path+'deviceid_package_start_close.tsv',sep='\t', names=['device_id','app_id','start_time','close_time'])
41 | package_label = pd.read_csv(path+'package_label.tsv',sep='\t',names=['app_id','app_parent_type', 'app_child_type'])
42 |
43 |
44 | #deviceid_brand['device_brand'] = deviceid_brand.device_brand.apply(lambda x : str(x).split(' ')[0])
45 |
46 |
47 | #转换成对应的数字
48 | lbl = LabelEncoder()
49 | lbl.fit(list(deviceid_brand.device_brand.values))
50 | deviceid_brand['device_brand'] = lbl.transform(list(deviceid_brand.device_brand.values))
51 |
52 | lbl = LabelEncoder()
53 | lbl.fit(list(deviceid_brand.device_type.values))
54 | deviceid_brand['device_type'] = lbl.transform(list(deviceid_brand.device_type.values))
55 |
56 | #转换成对应的数字
57 | lbl = LabelEncoder()
58 | lbl.fit(list(package_label.app_parent_type.values))
59 | package_label['app_parent_type'] = lbl.transform(list(package_label.app_parent_type.values))
60 |
61 | lbl = LabelEncoder()
62 | lbl.fit(list(package_label.app_child_type.values))
63 | package_label['app_child_type'] = lbl.transform(list(package_label.app_child_type.values))
64 |
65 |
66 | # In[4]:
67 |
68 |
69 | df_train = deviceid_train.merge(deviceid_brand, how='left', on='device_id')
70 | df_train.fillna(-1, inplace=True)
71 | df_test = deviceid_test.merge(deviceid_brand, how='left', on='device_id')
72 | df_test.fillna(-1, inplace=True)
73 |
74 |
75 | # In[5]:
76 |
77 |
78 | df_train['sex'] = df_train.sex.apply(lambda x : 1 if x == 1 else 0)
79 | df_train = df_train.join(pd.get_dummies(df_train["age"], prefix="age").astype(int))
80 | df_train['sex_age'] = df_train['sex'].map(str) + '_' + df_train['age'].map(str)
81 | Y = df_train['sex_age']
82 | Y_CAT = pd.Categorical(Y)
83 | df_train['sex_age'] = pd.Series(Y_CAT.codes)
84 | df_train = df_train.join(pd.get_dummies(df_train["sex_age"], prefix="sex_age").astype(int))
85 |
86 |
87 | # In[6]:
88 |
89 |
90 | sex_age_columns = ['sex_age_' + str(i) for i in range(22)]
91 | sex_age_prior_set = df_train[sex_age_columns].mean().values
92 | age_columns = ['age_' + str(i) for i in range(11)]
93 | age_prior_set = df_train[age_columns].mean().values
94 | sex_prior_prob= df_train.sex.mean()
95 | sex_prior_prob
96 |
97 |
98 | # In[7]:
99 |
100 |
101 | def hcc_encode(train_df, test_df, variable, target, prior_prob, k=5, f=1, g=1, update_df=None):
102 | """
103 | See "A Preprocessing Scheme for High-Cardinality Categorical Attributes in
104 | Classification and Prediction Problems" by Daniele Micci-Barreca
105 | """
106 | hcc_name = "_".join(["hcc", variable, target])
107 |
108 | grouped = train_df.groupby(variable)[target].agg({"size": "size", "mean": "mean"})
109 | grouped["lambda"] = 1 / (g + np.exp((k - grouped["size"]) / f))
110 | grouped[hcc_name] = grouped["lambda"] * grouped["mean"] + (1 - grouped["lambda"]) * prior_prob
111 |
112 | df = test_df[[variable]].join(grouped, on=variable, how="left")[hcc_name].fillna(prior_prob)
113 |
114 | if update_df is None: update_df = test_df
115 | if hcc_name not in update_df.columns: update_df[hcc_name] = np.nan
116 | update_df.update(df)
117 | return
118 |
119 |
120 | # In[8]:
121 |
122 |
123 | #拟合年龄
124 | #拟合测试集
125 | # High-Cardinality Categorical encoding
126 | skf = StratifiedKFold(5)
127 | nums = 11
128 | for variable in ['device_brand', 'device_type'] :
129 | for i in range(nums) :
130 | target = age_columns[i]
131 | age_prior_prob = age_prior_set[i]
132 | print (variable, target, age_prior_prob)
133 | hcc_encode(df_train, df_test, variable, target, age_prior_prob, k=5, f=1, g=1, update_df=None)
134 | #拟合验证集
135 | for train, test in skf.split(np.zeros(len(df_train)), df_train['age']):
136 | hcc_encode(df_train.iloc[train], df_train.iloc[test], variable, target, age_prior_prob, k=5, update_df=df_train)
137 |
138 |
139 | # In[9]:
140 |
141 |
142 | #拟合性别
143 | #拟合测试集
144 | # High-Cardinality Categorical encoding
145 | skf = StratifiedKFold(5)
146 | for variable in ['device_brand', 'device_type'] :
147 | target = 'sex'
148 | print (variable, target, sex_prior_prob)
149 | hcc_encode(df_train, df_test, variable, target, sex_prior_prob, k=5, f=1, g=1, update_df=None)
150 | #拟合验证集
151 | for train, test in skf.split(np.zeros(len(df_train)), df_train['age']):
152 | hcc_encode(df_train.iloc[train], df_train.iloc[test], variable, target, sex_prior_prob, k=5, f=1, g=1, update_df=df_train)
153 |
154 |
155 | # In[10]:
156 |
157 |
158 | #拟合性别年龄
159 | #拟合测试集
160 | # High-Cardinality Categorical encoding
161 | skf = StratifiedKFold(5)
162 | nums = 22
163 | for variable in ['device_brand', 'device_type'] :
164 | for i in range(nums) :
165 | target = sex_age_columns[i]
166 | sex_age_prior_prob = sex_age_prior_set[i]
167 | print (variable, target, sex_age_prior_prob)
168 | hcc_encode(df_train, df_test, variable, target, sex_age_prior_prob, k=5, f=1, g=1, update_df=None)
169 | #拟合验证集
170 | for train, test in skf.split(np.zeros(len(df_train)), df_train['sex_age']):
171 | hcc_encode(df_train.iloc[train], df_train.iloc[test], variable, target, sex_age_prior_prob, k=5, update_df=df_train)
172 |
173 |
174 | # In[14]:
175 |
176 |
177 | hcc_columns = ['device_id'] + ['hcc_device_brand_age_' + str(i) for i in range(11)] + ['hcc_device_brand_sex'] + ['hcc_device_type_age_' + str(i) for i in range(11)] + ['hcc_device_type_sex'] + ['hcc_device_type_sex_age_' + str(i) for i in range(22)]
178 | df_total = pd.concat([df_train[hcc_columns], df_test[hcc_columns]])
179 |
180 |
181 | # In[15]:
182 |
183 |
184 | df_total.to_csv('hcc_device_brand_age_sex.csv', index=None)
185 |
186 |
--------------------------------------------------------------------------------
/nb_cz_lwl_wcm/1_get_age_reg.py:
--------------------------------------------------------------------------------
1 | # -*- coding:utf-8 -*-
2 |
3 |
4 | ####### 尝试骚操作,单独针对这个表
5 | import pandas as pd
6 | from sklearn.cluster import KMeans
7 | from sklearn.linear_model import LogisticRegression, SGDClassifier, PassiveAggressiveClassifier, RidgeClassifier, Ridge, \
8 | PassiveAggressiveRegressor
9 | from sklearn.metrics import mean_squared_error
10 | from sklearn.model_selection import KFold
11 | from sklearn.naive_bayes import BernoulliNB, MultinomialNB
12 | from sklearn.svm import LinearSVC, LinearSVR
13 |
14 | train = pd.read_csv('Demo/deviceid_train.tsv', sep='\t', header=None)
15 | test = pd.read_csv('Demo/deviceid_test.tsv', sep='\t', header=None)
16 | test_id = test[0]
17 | def get_label(row):
18 | return row[2]
19 | train['label'] = train.apply(lambda row:get_label(row), axis=1)
20 | data_all = pd.concat([train, test], axis=0)
21 | data_all = data_all.rename({0:'id'}, axis=1)
22 | del data_all[1],data_all[2]
23 |
24 | deviceid_packages = pd.read_csv('Demo/deviceid_packages.tsv', sep='\t', header=None)
25 | deviceid_packages = deviceid_packages.rename({0: 'id', 1: 'packages_names'}, axis=1)
26 | package_label = pd.read_csv('Demo/package_label.tsv', sep='\t', header=None)
27 | package_label = package_label.rename({0:'packages_name', 1:'packages_type'},axis=1)
28 | dict_label = dict(zip(list(package_label['packages_name']), list(package_label['packages_type'])))
29 |
30 | data_all = pd.merge(data_all, deviceid_packages, on='id', how='left')
31 |
32 | feature = pd.DataFrame()
33 |
34 | import numpy as np
35 |
36 | # app个数
37 | # 毒特征?
38 | # feature['app_count'] = data_all['packages_names'].apply(lambda row: len(str(row).split(',')))
39 |
40 | # 对此数据做countvector,和tfidfvector,并在一起跑几个学习模型
41 | # 引申出来的count和tfidf,跑基本机器学习分类模型
42 | data_all['package_str'] = data_all['packages_names'].apply(lambda row: str(row).replace(',', ' '))
43 | def get_more_information(row):
44 | result = ' '
45 | start = True
46 | row_list = row.split(',')
47 | for i in row_list:
48 | try:
49 | if start:
50 | result = dict_label[i]
51 | start = False
52 | else:
53 | result = result + ' ' + dict_label[i]
54 | except KeyError:
55 | pass
56 | return result
57 | data_all['package_str_more_information'] = data_all['packages_names'].apply(lambda row: get_more_information(str(row)))
58 |
59 | print(data_all)
60 |
61 | from sklearn.feature_extraction.text import CountVectorizer
62 | from sklearn.feature_extraction.text import TfidfVectorizer
63 | import scipy.sparse
64 |
65 | count_vec = CountVectorizer()
66 | count_csr_basic = count_vec.fit_transform(data_all['package_str'])
67 | tfidf_vec = TfidfVectorizer()
68 | tfidf_vec_basic = tfidf_vec.fit_transform(data_all['package_str'])
69 |
70 | count_vec = CountVectorizer()
71 | count_csr_more = count_vec.fit_transform(data_all['package_str_more_information'])
72 |
73 | tfidf_vec = TfidfVectorizer()
74 | tfidf_vec_more = tfidf_vec.fit_transform(data_all['package_str_more_information'])
75 |
76 | data_feature = scipy.sparse.csr_matrix(scipy.sparse.hstack([count_csr_basic, tfidf_vec_basic,
77 | count_csr_more, tfidf_vec_more]))
78 |
79 | train_feature = data_feature[:len(train)]
80 | score = train['label']
81 | test_feature = data_feature[len(train):]
82 | number = len(np.unique(score))
83 |
84 | X = train_feature
85 | test = test_feature
86 | y = score
87 |
88 | n_flods = 5
89 | kf = KFold(n_splits=n_flods,shuffle=True,random_state=1017)
90 | kf = kf.split(X)
91 |
92 | def xx_mse_s(y_true,y_pre):
93 | y_true = y_true
94 | y_pre = pd.DataFrame({'res': list(y_pre)})
95 | return mean_squared_error(y_true,y_pre['res'].values)
96 |
97 | ######################## ridge reg #########################3
98 | cv_pred = []
99 | xx_mse = []
100 | stack = np.zeros((len(y),1))
101 | stack_te = np.zeros((len(test_id),1))
102 | model_1 = Ridge(solver='auto', fit_intercept=True, alpha=0.4, max_iter=250, normalize=False, tol=0.01,random_state=1017)
103 | for i ,(train_fold,test_fold) in enumerate(kf):
104 | X_train, X_validate, label_train, label_validate = X[train_fold, :], X[test_fold, :], y[train_fold], y[test_fold]
105 | model_1.fit(X_train, label_train)
106 | val_ = model_1.predict(X=X_validate)
107 | stack[test_fold] = np.array(val_).reshape(len(val_),1)
108 | print(xx_mse_s(label_validate, val_))
109 | cv_pred.append(model_1.predict(test))
110 | xx_mse.append(xx_mse_s(label_validate, val_))
111 | import numpy as np
112 | print('xx_result',np.mean(xx_mse))
113 | s = 0
114 | for i in cv_pred:
115 | s = s+i
116 | s = s/n_flods
117 | print(stack)
118 | print(s)
119 | df_stack1 = pd.DataFrame(stack)
120 | df_stack2 = pd.DataFrame(s)
121 | df_stack = pd.concat([df_stack1,df_stack2
122 | ], axis=0)
123 | df_stack.to_csv('feature/tfidf_ling_reg.csv', encoding='utf8', index=None)
124 |
125 | ######################## par reg #########################
126 | kf = KFold(n_splits=n_flods,shuffle=True,random_state=1017)
127 | kf = kf.split(X)
128 | cv_pred = []
129 | xx_mse = []
130 | stack = np.zeros((len(y),1))
131 | model_1 = PassiveAggressiveRegressor(fit_intercept=True, max_iter=280, tol=0.01,random_state=1017)
132 | for i ,(train_fold,test_fold) in enumerate(kf):
133 | X_train, X_validate, label_train, label_validate = X[train_fold, :], X[test_fold, :], y[train_fold], y[test_fold]
134 | model_1.fit(X_train, label_train)
135 | val_ = model_1.predict(X=X_validate)
136 | stack[test_fold] = np.array(val_).reshape(len(val_),1)
137 | print(xx_mse_s(label_validate, val_))
138 | cv_pred.append(model_1.predict(test))
139 | xx_mse.append(xx_mse_s(label_validate, val_))
140 | import numpy as np
141 | print('xx_result',np.mean(xx_mse))
142 | s = 0
143 | for i in cv_pred:
144 | s = s+i
145 | s = s/n_flods
146 | print(stack)
147 | print(s)
148 | df_stack1 = pd.DataFrame(stack)
149 | df_stack2 = pd.DataFrame(s)
150 | df_stack = pd.concat([df_stack1,df_stack2
151 | ], axis=0)
152 | df_stack.to_csv('feature/tfidf_par_reg.csv', encoding='utf8', index=None)
153 |
154 | ######################## svr reg #########################
155 | kf = KFold(n_splits=n_flods,shuffle=True,random_state=1017)
156 | kf = kf.split(X)
157 | cv_pred = []
158 | xx_mse = []
159 | stack = np.zeros((len(y),1))
160 | model_1 = LinearSVR(random_state=1017)
161 | for i ,(train_fold,test_fold) in enumerate(kf):
162 | X_train, X_validate, label_train, label_validate = X[train_fold, :], X[test_fold, :], y[train_fold], y[test_fold]
163 | model_1.fit(X_train, label_train)
164 | val_ = model_1.predict(X=X_validate)
165 | stack[test_fold] = np.array(val_).reshape(len(val_),1)
166 | print(xx_mse_s(label_validate, val_))
167 | cv_pred.append(model_1.predict(test))
168 | xx_mse.append(xx_mse_s(label_validate, val_))
169 | import numpy as np
170 | print('xx_result',np.mean(xx_mse))
171 | s = 0
172 | for i in cv_pred:
173 | s = s+i
174 | s = s/n_flods
175 | print(stack)
176 | print(s)
177 | df_stack1 = pd.DataFrame(stack)
178 | df_stack2 = pd.DataFrame(s)
179 | df_stack = pd.concat([df_stack1,df_stack2
180 | ], axis=0)
181 | df_stack.to_csv('feature/tfidf_svr_reg.csv', encoding='utf8', index=None)
182 |
183 |
--------------------------------------------------------------------------------
/THLUO/25.thluo_22_xgb.py:
--------------------------------------------------------------------------------
1 |
2 | # coding: utf-8
3 |
4 | # In[1]:
5 |
6 |
7 | import pandas as pd
8 | import seaborn as sns
9 | import numpy as np
10 | from tqdm import tqdm
11 | from sklearn.decomposition import LatentDirichletAllocation
12 | from sklearn.cross_validation import train_test_split
13 | from sklearn.metrics import accuracy_score
14 | import lightgbm as lgb
15 | import xgboost as xgb
16 | from datetime import datetime,timedelta
17 | import time
18 | from sklearn.feature_extraction.text import TfidfTransformer
19 | from sklearn.feature_extraction.text import CountVectorizer
20 | from sklearn.preprocessing import LabelEncoder
21 | import gc
22 | from feat_util import *
23 |
24 |
25 | # In[2]:
26 |
27 | print ('25.thluo_22_xgb.py')
28 | path='input/'
29 | data=pd.DataFrame()
30 | #sex_age=pd.read_excel('./data/性别年龄对照表.xlsx')
31 |
32 |
33 | # In[3]:
34 |
35 |
36 | deviceid_packages=pd.read_csv(path+'deviceid_packages.tsv',sep='\t',names=['device_id','apps'])
37 | deviceid_test=pd.read_csv(path+'deviceid_test.tsv',sep='\t',names=['device_id'])
38 | deviceid_train=pd.read_csv(path+'deviceid_train.tsv',sep='\t',names=['device_id','sex','age'])
39 |
40 |
41 | # In[4]:
42 |
43 |
44 | df_train = pd.concat([deviceid_train, deviceid_test])
45 |
46 |
47 | # In[5]:
48 |
49 |
50 | df_train
51 |
52 |
53 | # In[6]:
54 |
55 |
56 | df_sex_prob_oof = pd.read_csv('device_sex_prob_oof.csv')
57 | df_age_prob_oof = pd.read_csv('device_age_prob_oof.csv')
58 | df_start_close_sex_prob_oof = pd.read_csv('start_close_sex_prob_oof.csv')
59 | #后面两个,线上线下不对应,线下过拟合了
60 | df_start_close_age_prob_oof = pd.read_csv('start_close_age_prob_oof.csv')
61 | df_tfidf_lr_sex_age_prob_oof = pd.read_csv('tfidf_lr_sex_age_prob_oof.csv')
62 | #之前的有用的
63 | df_sex_age_bin_prob_oof = pd.read_csv('sex_age_bin_prob_oof.csv')
64 |
65 | df_age_bin_prob_oof = pd.read_csv('age_bin_prob_oof.csv')
66 | df_hcc_device_brand_age_sex = pd.read_csv('hcc_device_brand_age_sex.csv')
67 | df_device_age_regression_prob_oof = pd.read_csv('device_age_regression_prob_oof.csv')
68 | df_device_start_GRU_pred = pd.read_csv('device_start_GRU_pred.csv')
69 | df_device_start_GRU_pred_age = pd.read_csv('device_start_GRU_pred_age.csv')
70 | df_device_all_GRU_pred = pd.read_csv('device_all_GRU_pred.csv')
71 | df_lgb_sex_age_prob_oof = pd.read_csv('lgb_sex_age_prob_oof.csv')
72 | df_device_start_capsule_pred = pd.read_csv('device_start_capsule_pred.csv')
73 | df_device_start_textcnn_pred = pd.read_csv('device_start_textcnn_pred.csv')
74 | df_device_start_text_dpcnn_pred = pd.read_csv('device_start_text_dpcnn_pred.csv')
75 | df_device_start_lstm_pred = pd.read_csv('device_start_lstm_pred.csv')
76 | df_att_nn_feat_v6 = pd.read_csv('att_nn_feat_v6.csv')
77 | df_att_nn_feat_v6.columns = ['device_id'] + ['att_nn_feat_' + str(i) for i in range(22)]
78 |
79 | #过拟合特征
80 | del df_start_close_age_prob_oof['device_app_groupedstart_close_age_prob_oof_4_MEAN']
81 | del df_start_close_sex_prob_oof['device_app_groupedstart_close_sex_prob_oof_MIN']
82 | del df_start_close_sex_prob_oof['device_app_groupedstart_close_sex_prob_oof_MAX']
83 |
84 |
85 | # In[7]:
86 |
87 |
88 | df_train_w2v = df_train.merge(df_sex_prob_oof, on='device_id', how='left')
89 | df_train_w2v = df_train_w2v.merge(df_age_prob_oof, on='device_id', how='left')
90 | df_train_w2v = df_train_w2v.merge(df_start_close_sex_prob_oof, on='device_id', how='left')
91 | df_train_w2v = df_train_w2v.merge(df_start_close_age_prob_oof, on='device_id', how='left')
92 | df_train_w2v = df_train_w2v.merge(df_sex_age_bin_prob_oof, on='device_id', how='left')
93 | df_train_w2v = df_train_w2v.merge(df_age_bin_prob_oof, on='device_id', how='left')
94 | df_train_w2v = df_train_w2v.merge(df_hcc_device_brand_age_sex, on='device_id', how='left')
95 | df_train_w2v = df_train_w2v.merge(df_device_age_regression_prob_oof, on='device_id', how='left')
96 | df_train_w2v = df_train_w2v.merge(df_device_start_GRU_pred, on='device_id', how='left')
97 | df_train_w2v = df_train_w2v.merge(df_device_start_GRU_pred_age, on='device_id', how='left')
98 | df_train_w2v = df_train_w2v.merge(df_device_all_GRU_pred, on='device_id', how='left')
99 | df_train_w2v = df_train_w2v.merge(df_lgb_sex_age_prob_oof, on='device_id', how='left')
100 | df_train_w2v = df_train_w2v.merge(df_device_start_capsule_pred, on='device_id', how='left')
101 | df_train_w2v = df_train_w2v.merge(df_device_start_textcnn_pred, on='device_id', how='left')
102 | df_train_w2v = df_train_w2v.merge(df_device_start_text_dpcnn_pred, on='device_id', how='left')
103 | df_train_w2v = df_train_w2v.merge(df_device_start_lstm_pred, on='device_id', how='left')
104 | df_train_w2v = df_train_w2v.merge(df_att_nn_feat_v6, on='device_id', how='left')
105 |
106 |
107 | # In[9]:
108 |
109 |
110 | df_train_w2v['sex'] = df_train_w2v['sex'].apply(lambda x:str(x))
111 | df_train_w2v['age'] = df_train_w2v['age'].apply(lambda x:str(x))
112 | def tool(x):
113 | if x=='nan':
114 | return x
115 | else:
116 | return str(int(float(x)))
117 | df_train_w2v['sex']=df_train_w2v['sex'].apply(tool)
118 | df_train_w2v['age']=df_train_w2v['age'].apply(tool)
119 | df_train_w2v['sex_age']=df_train_w2v['sex']+'-'+df_train_w2v['age']
120 | df_train_w2v = df_train_w2v.replace({'nan':np.NaN,'nan-nan':np.NaN})
121 |
122 |
123 | # In[11]:
124 |
125 |
126 | train = df_train_w2v[df_train_w2v['sex'].notnull()]
127 | test = df_train_w2v[df_train_w2v['sex'].isnull()]
128 |
129 | X = train.drop(['sex','age','sex_age','device_id'],axis=1)
130 | Y = train['sex_age']
131 | Y_CAT = pd.Categorical(Y)
132 | Y = pd.Series(Y_CAT.codes)
133 |
134 |
135 | # In[14]:
136 |
137 |
138 | from sklearn.model_selection import KFold, StratifiedKFold
139 | gc.collect()
140 | #seed = 2048
141 | seed = 666
142 | num_folds = 5
143 | folds = StratifiedKFold(n_splits= num_folds, shuffle=True, random_state=seed)
144 |
145 | sub_list = []
146 |
147 | cate_feat = ['device_type','device_brand']
148 |
149 | for n_fold, (train_idx, valid_idx) in enumerate(folds.split(X, Y)):
150 | train_x, train_y = X.iloc[train_idx], Y.iloc[train_idx]
151 | valid_x, valid_y = X.iloc[valid_idx], Y.iloc[valid_idx]
152 |
153 | xg_train = xgb.DMatrix(train_x, label=train_y)
154 | xg_val = xgb.DMatrix(valid_x, label=valid_y)
155 |
156 | param = {
157 | 'objective' : 'multi:softprob',
158 | 'eta' : 0.03,
159 | 'max_depth' : 3,
160 | 'num_class' : 22,
161 | 'eval_metric' : 'mlogloss',
162 | 'min_child_weight' : 3,
163 | 'subsample' : 0.7,
164 | 'colsample_bytree' : 0.7,
165 | 'seed' : 2006,
166 | 'nthread' : 5
167 | }
168 |
169 | num_rounds = 1000
170 |
171 | watchlist = [ (xg_train,'train'), (xg_val, 'val') ]
172 | model = xgb.train(param, xg_train, num_rounds, watchlist, early_stopping_rounds=100, verbose_eval=50)
173 |
174 | test_matrix = xgb.DMatrix(test[X.columns.values])
175 | sub = pd.DataFrame(model.predict(test_matrix))
176 | sub_list.append(sub)
177 |
178 |
179 | # In[15]:
180 |
181 |
182 | sub = (sub_list[0] + sub_list[1] + sub_list[2] + sub_list[3] + sub_list[4]) / num_folds
183 | sub
184 |
185 |
186 | # In[16]:
187 |
188 |
189 | sub.columns=Y_CAT.categories
190 | sub['DeviceID']=test['device_id'].values
191 | sub=sub[['DeviceID', '1-0', '1-1', '1-2', '1-3', '1-4', '1-5', '1-6', '1-7','1-8', '1-9', '1-10', '2-0', '2-1', '2-2', '2-3', '2-4', '2-5', '2-6', '2-7', '2-8', '2-9', '2-10']]
192 | sub.to_csv('th_22_results_xgb.csv',index=False)
193 |
194 |
--------------------------------------------------------------------------------
/linwangli/code/lgb_allfeat_condProb.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | # coding: utf-8
3 | from catboost import Pool, CatBoostClassifier, cv
4 | import pandas as pd
5 | import seaborn as sns
6 | import numpy as np
7 | from tqdm import tqdm
8 | from sklearn.decomposition import LatentDirichletAllocation
9 | from sklearn.model_selection import train_test_split
10 | from sklearn.metrics import accuracy_score
11 | import lightgbm as lgb
12 | from datetime import datetime,timedelta
13 | import matplotlib.pyplot as plt
14 | import time
15 | from sklearn.feature_extraction.text import TfidfTransformer
16 | from sklearn.feature_extraction.text import CountVectorizer
17 | import gc
18 | from sklearn import preprocessing
19 | from sklearn.feature_extraction.text import TfidfVectorizer
20 |
21 | from scipy.sparse import hstack, vstack
22 | from sklearn.model_selection import StratifiedKFold
23 | from sklearn.model_selection import cross_val_score
24 | from skopt.space import Integer, Categorical, Real, Log10
25 | from skopt.utils import use_named_args
26 | from skopt import gp_minimize
27 | import re
28 |
29 |
30 | # 读入数据
31 | train = pd.read_csv('../dataset/deviceid_train.tsv', sep='\t', names=['device_id', 'sex', 'age'])
32 | all_feat = pd.read_csv('../dataset/all_feat.csv')
33 |
34 |
35 | data_all = pd.merge(left=all_feat, right=train, on='device_id', how='left')
36 | train = data_all[:50000]
37 | test = data_all[50000:]
38 | train = train.fillna(-1)
39 | test = test.fillna(-1)
40 | del data_all
41 | gc.collect()
42 | use_feats = all_feat.columns[1:]
43 | use_feats
44 |
45 |
46 | # P(age)
47 |
48 | Y = train['sex'] - 1
49 | X_train = train[use_feats]
50 | X_test = test[use_feats]
51 | kfold = StratifiedKFold(n_splits=5, random_state=10, shuffle=True)
52 | oof_preds1 = np.zeros((X_train.shape[0], ))
53 | sub1 = np.zeros((X_test.shape[0], ))
54 | for i, (train_index, test_index) in enumerate(kfold.split(X_train, Y)):
55 | X_tr, X_vl, y_tr, y_vl = X_train.iloc[train_index], X_train.iloc[test_index], Y.iloc[train_index], Y.iloc[test_index]
56 | dtrain = lgb.Dataset(X_tr, label=y_tr)
57 | dvalid = lgb.Dataset(X_vl, y_vl, reference=dtrain)
58 | params = {
59 | 'boosting_type': 'gbdt',
60 | 'max_depth':6,
61 | 'objective':'binary',
62 | 'num_leaves':31,
63 | 'subsample': 0.85,
64 | 'colsample_bytree': 0.2,
65 | 'lambda_l1':0.00007995302080034896,
66 | 'lambda_l2':0.0003648648811380991,
67 | 'subsample_freq':12,
68 | 'learning_rate': 0.012,
69 | 'min_child_weight':5.5
70 | }
71 |
72 | model = lgb.train(params,
73 | dtrain,
74 | num_boost_round=4000,
75 | valid_sets=dvalid,
76 | early_stopping_rounds=100,
77 | verbose_eval=100)
78 |
79 | oof_preds1[test_index] = model.predict(X_vl, num_iteration=model.best_iteration)
80 | sub1 += model.predict(X_test, num_iteration=model.best_iteration)/kfold.n_splits
81 |
82 |
83 | # P(age|sex = 2)
84 |
85 | train['sex_pred'] = train['sex']
86 | test['sex_pred'] = 1
87 |
88 | use_feats = list(train.columns[1:-3])
89 | use_feats = use_feats + ['sex_pred']
90 |
91 | X_train = train[use_feats]
92 | X_test = test[use_feats]
93 |
94 | Y = train['age']
95 | kfold = StratifiedKFold(n_splits=10, random_state=10, shuffle=True)
96 | oof_preds2_1 = np.zeros((X_train.shape[0], 11))
97 | sub2_1 = np.zeros((X_test.shape[0], 11))
98 | for i, (train_index, test_index) in enumerate(kfold.split(X_train, Y)):
99 | X_tr, X_vl, y_tr, y_vl = X_train.iloc[train_index], X_train.iloc[test_index], Y.iloc[train_index], Y.iloc[test_index]
100 |
101 |
102 | dtrain = lgb.Dataset(X_tr, label=y_tr)
103 | dvalid = lgb.Dataset(X_vl, y_vl, reference=dtrain)
104 | params = {
105 | 'boosting_type': 'gbdt',
106 | 'max_depth':6,
107 | 'metric': {'multi_logloss'},
108 | 'num_class':11,
109 | 'objective':'multiclass',
110 | 'num_leaves':31,
111 | 'subsample': 0.9,
112 | 'colsample_bytree': 0.2,
113 | 'lambda_l1':0.0001,
114 | 'lambda_l2':0.00111,
115 | 'subsample_freq':10,
116 | 'learning_rate': 0.012,
117 | 'min_child_weight':10
118 | }
119 |
120 | model = lgb.train(params,
121 | dtrain,
122 | num_boost_round=4000,
123 | valid_sets=dvalid,
124 | early_stopping_rounds=100,
125 | verbose_eval=100)
126 |
127 | oof_preds2_1[test_index] = model.predict(X_vl, num_iteration=model.best_iteration)
128 | sub2_1 += model.predict(X_test, num_iteration=model.best_iteration)/kfold.n_splits
129 |
130 |
131 | # P(age|sex = 2)
132 |
133 | train['sex_pred'] = train['sex']
134 | test['sex_pred'] = 2
135 |
136 | use_feats = list(train.columns[1:-3])
137 | use_feats = use_feats + ['sex_pred']
138 |
139 | X_train = train[use_feats]
140 | X_test = test[use_feats]
141 |
142 |
143 | Y = train['age']
144 | kfold = StratifiedKFold(n_splits=10, random_state=10, shuffle=True)
145 | oof_preds2_2 = np.zeros((X_train.shape[0], 11))
146 | sub2_2 = np.zeros((X_test.shape[0], 11))
147 | for i, (train_index, test_index) in enumerate(kfold.split(X_train, Y)):
148 | X_tr, X_vl, y_tr, y_vl = X_train.iloc[train_index], X_train.iloc[test_index], Y.iloc[train_index], Y.iloc[test_index]
149 |
150 |
151 | dtrain = lgb.Dataset(X_tr, label=y_tr)
152 | dvalid = lgb.Dataset(X_vl, y_vl, reference=dtrain)
153 | params = {
154 | 'boosting_type': 'gbdt',
155 | 'max_depth':6,
156 | 'metric': {'multi_logloss'},
157 | 'num_class':11,
158 | 'objective':'multiclass',
159 | 'num_leaves':31,
160 | 'subsample': 0.9,
161 | 'colsample_bytree': 0.2,
162 | 'lambda_l1':0.0001,
163 | 'lambda_l2':0.00111,
164 | 'subsample_freq':10,
165 | 'learning_rate': 0.012,
166 | 'min_child_weight':10
167 | }
168 |
169 | model = lgb.train(params,
170 | dtrain,
171 | num_boost_round=4000,
172 | valid_sets=dvalid,
173 | early_stopping_rounds=100,
174 | verbose_eval=100)
175 |
176 | oof_preds2_2[test_index] = model.predict(X_vl, num_iteration=model.best_iteration)
177 | sub2_2 += model.predict(X_test, num_iteration=model.best_iteration)/kfold.n_splits
178 |
179 |
180 | # 保存测试集的预测结果
181 | sub1 = pd.DataFrame(sub1, columns=['sex2'])
182 |
183 | sub1['sex1'] = 1-sub1['sex2']
184 | sub2 = pd.DataFrame(sub2_1, columns=['age%s'%i for i in range(11)])
185 | sub = pd.DataFrame(test['device_id'].values, columns=['DeviceID'])
186 |
187 | for i in ['sex1', 'sex2']:
188 | for j in ['age%s'%i for i in range(11)]:
189 | sub[i+'_'+j] = sub1[i] * sub2[j]
190 | sub.columns = ['DeviceID', '1-0', '1-1', '1-2', '1-3', '1-4', '1-5', '1-6',
191 | '1-7','1-8', '1-9', '1-10', '2-0', '2-1', '2-2', '2-3', '2-4',
192 | '2-5', '2-6', '2-7', '2-8', '2-9', '2-10']
193 |
194 | sub.to_csv('test_pred.csv', index=False)
195 |
196 |
197 | # 保存训练集五折的预测结果
198 | oof_preds1 = pd.DataFrame(oof_preds1, columns=['sex2'])
199 | oof_preds1['sex1'] = 1-oof_preds1['sex2']
200 |
201 | oof_preds2_1 = pd.DataFrame(oof_preds2_1, columns=['age%s'%i for i in range(11)])
202 | oof_preds2_2 = pd.DataFrame(oof_preds2_2, columns=['age%s'%i for i in range(11)])
203 |
204 | oof_preds = train[['device_id']]
205 | oof_preds.columns = ['DeviceID']
206 |
207 | for i in ['age%s'%i for i in range(11)]:
208 | oof_preds['sex1_'+i] = oof_preds1['sex1'] * oof_preds2_1[i]
209 | for i in ['age%s'%i for i in range(11)]:
210 | oof_preds['sex2_'+i] = oof_preds1['sex2'] * oof_preds2_2[i]
211 |
212 | oof_preds.columns = ['DeviceID', '1-0', '1-1', '1-2', '1-3', '1-4', '1-5', '1-6',
213 | '1-7','1-8', '1-9', '1-10', '2-0', '2-1', '2-2', '2-3', '2-4',
214 | '2-5', '2-6', '2-7', '2-8', '2-9', '2-10']
215 |
216 | oof_preds.to_csv('train_pred.csv', index=False)
217 |
218 |
219 |
220 |
221 |
222 |
--------------------------------------------------------------------------------
/THLUO/14.device_start_GRU_pred_age.py:
--------------------------------------------------------------------------------
1 |
2 | # coding: utf-8
3 |
4 | # In[1]:
5 |
6 |
7 | # coding: utf-8
8 | import feather
9 | import os
10 | import re
11 | import sys
12 | import gc
13 | import random
14 | import pandas as pd
15 | import numpy as np
16 | import gensim
17 | from gensim.models import Word2Vec
18 | from gensim.models.word2vec import LineSentence
19 | from scipy import stats
20 | import tensorflow as tf
21 | import keras
22 | from keras.layers import *
23 | from keras.models import *
24 | from keras.optimizers import *
25 | from keras.callbacks import *
26 | from keras.preprocessing import text, sequence
27 | from keras.utils import to_categorical
28 | from keras.engine.topology import Layer
29 | from sklearn.preprocessing import LabelEncoder
30 | from keras.utils import np_utils
31 | from keras.utils.training_utils import multi_gpu_model
32 | from sklearn.model_selection import train_test_split
33 | from sklearn.metrics import f1_score
34 | from sklearn.model_selection import KFold
35 | from sklearn.metrics import accuracy_score
36 | from sklearn.preprocessing import LabelEncoder
37 | from sklearn.metrics import f1_score
38 | from TextModel import *
39 | import warnings
40 | warnings.filterwarnings('ignore')
41 | config = tf.ConfigProto()
42 | config.gpu_options.allow_growth = True
43 | session = tf.Session(config=config)
44 |
45 |
46 | # In[2]:
47 | print('14.device_start_GRU_pred_age.py')
48 |
49 | df_doc = pd.read_csv('01.device_click_app_sorted_by_start.csv')
50 | deviceid_test=pd.read_csv('input/deviceid_test.tsv',sep='\t',names=['device_id'])
51 | deviceid_train=pd.read_csv('input/deviceid_train.tsv',sep='\t',names=['device_id','sex','age'])
52 | df_total = pd.concat([deviceid_train, deviceid_test])
53 | df_doc = df_doc.merge(df_total, on='device_id', how='left')
54 |
55 |
56 | df_wv2_all = pd.read_csv('w2c_all_emb.csv')
57 |
58 | dic_w2c_all = {}
59 | for row in df_wv2_all.values :
60 | app_id = row[0]
61 | vector = row[1:]
62 | dic_w2c_all[app_id] = vector
63 |
64 |
65 | # In[3]:
66 |
67 |
68 | train = df_doc[df_doc['age'].notnull()]
69 | test = df_doc[df_doc['age'].isnull()]
70 | train.reset_index(drop=True, inplace=True)
71 | test.reset_index(drop=True, inplace=True)
72 |
73 | lb = LabelEncoder()
74 | train_label = lb.fit_transform(train['age'].values)
75 | train['class'] = train_label
76 |
77 |
78 | # In[5]:
79 |
80 |
81 | column_name="app_list"
82 | word_seq_len = 900
83 | victor_size = 200
84 | num_words = 35000
85 | batch_size = 64
86 | classification = 11
87 | kfold=10
88 |
89 |
90 | # In[6]:
91 |
92 |
93 | from sklearn.metrics import log_loss
94 |
95 | def get_mut_label(y_label) :
96 | results = []
97 | for ele in y_label :
98 | results.append(ele.argmax())
99 | return results
100 |
101 | class RocAucEvaluation(Callback):
102 | def __init__(self, validation_data=(), interval=1):
103 | super(Callback, self).__init__()
104 |
105 | self.interval = interval
106 | self.X_val, self.y_val = validation_data
107 |
108 | def on_epoch_end(self, epoch, logs={}):
109 | if epoch % self.interval == 0:
110 | y_pred = self.model.predict(self.X_val, verbose=0)
111 | val_y = get_mut_label(self.y_val)
112 | score = log_loss(val_y, y_pred)
113 | print("\n mlogloss - epoch: %d - score: %.6f \n" % (epoch+1, score))
114 |
115 |
116 | # In[7]:
117 |
118 |
119 | #词向量
120 | def w2v_pad(df_train,df_test,col, maxlen_,victor_size, num_words):
121 |
122 | tokenizer = text.Tokenizer(num_words=num_words, lower=False,filters="")
123 | tokenizer.fit_on_texts(list(df_train[col].values)+list(df_test[col].values))
124 |
125 | train_ = sequence.pad_sequences(tokenizer.texts_to_sequences(df_train[col].values), maxlen=maxlen_)
126 | test_ = sequence.pad_sequences(tokenizer.texts_to_sequences(df_test[col].values), maxlen=maxlen_)
127 |
128 | word_index = tokenizer.word_index
129 |
130 | count = 0
131 | nb_words = len(word_index)
132 | print(nb_words)
133 | all_data=pd.concat([df_train[col],df_test[col]])
134 | file_name = 'embedding/' + 'Word2Vec_start_' + col +"_"+ str(victor_size) + '.model'
135 | if not os.path.exists(file_name):
136 | model = Word2Vec([[word for word in document.split(' ')] for document in all_data.values],
137 | size=victor_size, window=5, iter=10, workers=11, seed=2018, min_count=2)
138 | model.save(file_name)
139 | else:
140 | model = Word2Vec.load(file_name)
141 | print("add word2vec finished....")
142 |
143 |
144 |
145 | embedding_word2vec_matrix = np.zeros((nb_words + 1, victor_size))
146 | for word, i in word_index.items():
147 | embedding_vector = model[word] if word in model else None
148 | if embedding_vector is not None:
149 | count += 1
150 | embedding_word2vec_matrix[i] = embedding_vector
151 | else:
152 | unk_vec = np.random.random(victor_size) * 0.5
153 | unk_vec = unk_vec - unk_vec.mean()
154 | embedding_word2vec_matrix[i] = unk_vec
155 |
156 | embedding_w2c_all = np.zeros((nb_words + 1, victor_size))
157 | for word, i in word_index.items():
158 | embedding_vector = dic_w2c_all[word]
159 | embedding_w2c_all[i] = embedding_vector
160 |
161 |
162 | #embedding_matrix = np.concatenate((embedding_word2vec_matrix,embedding_w2c_all),axis=1)
163 | embedding_matrix = embedding_word2vec_matrix
164 |
165 | return train_, test_, word_index, embedding_matrix
166 |
167 |
168 | # In[8]:
169 |
170 |
171 | train_, test_,word2idx, word_embedding = w2v_pad(train,test,column_name, word_seq_len,victor_size, num_words)
172 |
173 |
174 | # In[11]:
175 |
176 |
177 | my_opt="bi_gru_model"
178 | #参数
179 | Y = train['class'].values
180 |
181 | if not os.path.exists("cache/"+my_opt):
182 | os.mkdir("cache/"+my_opt)
183 |
184 |
185 | # In[17]:
186 |
187 |
188 | from sklearn.model_selection import KFold, StratifiedKFold
189 | gc.collect()
190 | seed = 2006
191 | num_folds = 10
192 | kf = StratifiedKFold(n_splits= num_folds, shuffle=True, random_state=seed).split(train_, Y)
193 |
194 | epochs = 4
195 | my_opt=eval(my_opt)
196 | train_model_pred = np.zeros((train_.shape[0], classification))
197 | test_model_pred = np.zeros((test_.shape[0], classification))
198 | for i, (train_fold, val_fold) in enumerate(kf):
199 | X_train, X_valid, = train_[train_fold, :], train_[val_fold, :]
200 | y_train, y_valid = Y[train_fold], Y[val_fold]
201 |
202 | y_tra = to_categorical(y_train)
203 | y_val = to_categorical(y_valid)
204 |
205 | #模型
206 | name = str(my_opt.__name__)
207 |
208 | model = my_opt(word_seq_len, word_embedding, classification)
209 |
210 |
211 | RocAuc = RocAucEvaluation(validation_data=(X_valid, y_val), interval=1)
212 |
213 | hist = model.fit(X_train, y_tra, batch_size=batch_size, epochs=epochs, validation_data=(X_valid, y_val),
214 | callbacks=[RocAuc])
215 |
216 |
217 | train_model_pred[val_fold, :] = model.predict(X_valid)
218 |
219 |
220 | # In[21]:
221 |
222 |
223 | #模型
224 | #用全部的数据预测
225 | train_label = to_categorical(Y)
226 | name = str(my_opt.__name__)
227 |
228 | model = my_opt(word_seq_len, word_embedding, classification)
229 |
230 |
231 | RocAuc = RocAucEvaluation(validation_data=(train_, train_label), interval=1)
232 |
233 | hist = model.fit(train_, train_label, batch_size=batch_size, epochs=epochs, validation_data=(train_, train_label),
234 | callbacks=[RocAuc])
235 |
236 |
237 | test_model_pred = model.predict(test_)
238 |
239 |
240 | # In[22]:
241 |
242 |
243 | df_train_pred = pd.DataFrame(train_model_pred)
244 | df_test_pred = pd.DataFrame(test_model_pred)
245 | df_train_pred.columns = ['device_start_GRU_pred_age_' + str(i) for i in range(11)]
246 | df_test_pred.columns = ['device_start_GRU_pred_age_' + str(i) for i in range(11)]
247 |
248 |
249 | # In[23]:
250 |
251 |
252 | df_train_pred = pd.concat([train[['device_id']], df_train_pred], axis=1)
253 | df_test_pred = pd.concat([test[['device_id']], df_test_pred], axis=1)
254 |
255 |
256 | # In[24]:
257 |
258 |
259 | df_results = pd.concat([df_train_pred, df_test_pred])
260 | df_results.to_csv('device_start_GRU_pred_age.csv', index=None)
261 |
262 |
--------------------------------------------------------------------------------
/chizhu/single_model/xgb.py:
--------------------------------------------------------------------------------
1 |
2 | # coding: utf-8
3 |
4 | # In[1]:
5 |
6 |
7 | import pandas as pd
8 | import seaborn as sns
9 | import numpy as np
10 | from tqdm import tqdm
11 | from sklearn.decomposition import LatentDirichletAllocation
12 | from sklearn.model_selection import train_test_split
13 | from sklearn.metrics import accuracy_score
14 | import lightgbm as lgb
15 | from datetime import datetime,timedelta
16 | import matplotlib.pyplot as plt
17 | import time
18 | from sklearn.feature_extraction.text import TfidfTransformer
19 | from sklearn.feature_extraction.text import CountVectorizer
20 | # get_ipython().run_line_magic('matplotlib', 'inline')
21 |
22 | #add
23 | import gc
24 | from sklearn import preprocessing
25 | from sklearn.feature_extraction.text import TfidfVectorizer
26 |
27 | from scipy.sparse import hstack, vstack
28 | from sklearn.model_selection import StratifiedKFold
29 | from sklearn.model_selection import cross_val_score
30 | # from skopt.space import Integer, Categorical, Real, Log10
31 | # from skopt.utils import use_named_args
32 | # from skopt import gp_minimize
33 | from gensim.models import Word2Vec, FastText
34 | import gensim
35 | import re
36 | # path="/dev/shm/chizhu_data/data/"
37 |
38 |
39 | # In[2]:
40 |
41 |
42 | tfidf_feat=pd.read_csv("data/tfidf_classfiy.csv")
43 | tf2=pd.read_csv("data/tfidf_classfiy_package.csv")
44 | train_data=pd.read_csv("data/train_data.csv")
45 | test_data=pd.read_csv("data/test_data.csv")
46 |
47 |
48 | # In[3]:
49 |
50 |
51 | train_data = pd.merge(train_data,tfidf_feat,on="device_id",how="left")
52 | train = pd.merge(train_data,tf2,on="device_id",how="left")
53 | test_data = pd.merge(test_data,tfidf_feat,on="device_id",how="left")
54 | test = pd.merge(test_data,tf2,on="device_id",how="left")
55 |
56 |
57 | # In[4]:
58 |
59 |
60 | features = [x for x in train.columns if x not in ['device_id', 'sex',"age","label","app"]]
61 | Y = train['sex'] - 1
62 |
63 |
64 | # In[19]:
65 |
66 |
67 | import lightgbm as lgb
68 | import xgboost as xgb
69 | from sklearn.metrics import auc, log_loss, roc_auc_score,f1_score,recall_score,precision_score
70 | from sklearn.cross_validation import StratifiedKFold
71 |
72 | kf = StratifiedKFold(Y, n_folds=5, shuffle=True, random_state=1024)
73 | params={
74 | 'booster':'gbtree',
75 |
76 | 'objective': 'binary:logistic',
77 | # 'is_unbalance':'True',
78 | # 'scale_pos_weight': 1500.0/13458.0,
79 | 'eval_metric': "logloss",
80 |
81 | 'gamma':0.2,#0.2 is ok
82 | 'max_depth':6,
83 | # 'lambda':20,
84 | # "alpha":5,
85 | 'subsample':0.7,
86 | 'colsample_bytree':0.4 ,
87 | # 'min_child_weight':2.5,
88 | 'eta': 0.01,
89 | # 'learning_rate':0.01,
90 | "silent":1,
91 | 'seed':1024,
92 | 'nthread':12,
93 |
94 | }
95 | num_round = 3500
96 | early_stopping_rounds = 100
97 |
98 |
99 | # In[20]:
100 |
101 |
102 | aus = []
103 | sub1 = np.zeros((len(test), ))
104 | pred_oob1=np.zeros((len(train),))
105 | for i,(train_index,test_index) in enumerate(kf):
106 |
107 | tr_x = train[features].reindex(index=train_index, copy=False)
108 | tr_y = Y[train_index]
109 | te_x = train[features].reindex(index=test_index, copy=False)
110 | te_y = Y[test_index]
111 |
112 | # tr_y=tr_y.apply(lambda x:1 if x>0 else 0)
113 | # te_y=te_y.apply(lambda x:1 if x>0 else 0)
114 | d_tr = xgb.DMatrix(tr_x, label=tr_y)
115 | d_te = xgb.DMatrix(te_x, label=te_y)
116 | watchlist = [(d_tr,'train'),
117 | (d_te,'val')
118 | ]
119 | model = xgb.train(params, d_tr, num_boost_round=5500,
120 | evals=watchlist,verbose_eval=200,
121 | early_stopping_rounds=100)
122 | pred = model.predict(d_te)
123 | pred_oob1[test_index] =pred
124 | # te_y=te_y.apply(lambda x:1 if x>0 else 0)
125 | a = log_loss(te_y, pred)
126 |
127 | sub1 += model.predict(xgb.DMatrix(test[features]))/5
128 |
129 |
130 | print ("idx: ", i)
131 | print (" loss: %.5f" % a)
132 | # print " gini: %.5f" % g
133 | aus.append(a)
134 |
135 | print ("mean")
136 | print ("auc: %s" % (sum(aus) / 5.0))
137 |
138 |
139 | # In[21]:
140 |
141 |
142 | pred_oob1 = pd.DataFrame(pred_oob1, columns=['sex2'])
143 | sub1 = pd.DataFrame(sub1, columns=['sex2'])
144 | res1=pd.concat([pred_oob1,sub1])
145 | res1['sex1'] = 1-res1['sex2']
146 |
147 |
148 | # In[22]:
149 |
150 |
151 | import gc
152 | gc.collect()
153 |
154 |
155 | # In[23]:
156 |
157 |
158 | tfidf_feat=pd.read_csv("data/tfidf_age.csv")
159 | tf2=pd.read_csv("data/pack_tfidf_age.csv")
160 | train_data = pd.merge(train_data,tfidf_feat,on="device_id",how="left")
161 | train = pd.merge(train_data,tf2,on="device_id",how="left")
162 | test_data = pd.merge(test_data,tfidf_feat,on="device_id",how="left")
163 | test = pd.merge(test_data,tf2,on="device_id",how="left")
164 | features = [x for x in train.columns if x not in ['device_id',"age","sex","label","app"]]
165 | Y = train['age']
166 |
167 |
168 | # In[34]:
169 |
170 |
171 | import lightgbm as lgb
172 | import xgboost as xgb
173 | from sklearn.metrics import auc, log_loss, roc_auc_score,f1_score,recall_score,precision_score
174 | from sklearn.cross_validation import StratifiedKFold
175 |
176 | kf = StratifiedKFold(Y, n_folds=5, shuffle=True, random_state=1024)
177 | params={
178 | 'booster':'gbtree',
179 | 'objective': 'multi:softprob',
180 | # 'is_unbalance':'True',
181 | # 'scale_pos_weight': 1500.0/13458.0,
182 | 'eval_metric': "mlogloss",
183 | 'num_class':11,
184 | 'gamma':0.1,#0.2 is ok
185 | 'max_depth':6,
186 | # 'lambda':20,
187 | # "alpha":5,
188 | 'subsample':0.7,
189 | 'colsample_bytree':0.4 ,
190 | # 'min_child_weight':2.5,
191 | 'eta': 0.01,
192 | # 'learning_rate':0.01,
193 | "silent":1,
194 | 'seed':1024,
195 | 'nthread':12,
196 |
197 | }
198 | num_round = 3500
199 | early_stopping_rounds = 100
200 |
201 |
202 | # In[ ]:
203 |
204 |
205 | aus = []
206 | sub2 = np.zeros((len(test),11 ))
207 | pred_oob2=np.zeros((len(train),11))
208 | for i,(train_index,test_index) in enumerate(kf):
209 |
210 | tr_x = train[features].reindex(index=train_index, copy=False)
211 | tr_y = Y[train_index]
212 | te_x = train[features].reindex(index=test_index, copy=False)
213 | te_y = Y[test_index]
214 |
215 | # tr_y=tr_y.apply(lambda x:1 if x>0 else 0)
216 | # te_y=te_y.apply(lambda x:1 if x>0 else 0)
217 | d_tr = xgb.DMatrix(tr_x, label=tr_y)
218 | d_te = xgb.DMatrix(te_x, label=te_y)
219 | watchlist = [(d_tr,'train'),
220 | (d_te,'val')
221 | ]
222 | model = xgb.train(params, d_tr, num_boost_round=5500,
223 | evals=watchlist,verbose_eval=200,
224 | early_stopping_rounds=100)
225 | pred = model.predict(d_te)
226 | pred_oob2[test_index] =pred
227 | # te_y=te_y.apply(lambda x:1 if x>0 else 0)
228 | a = log_loss(te_y, pred)
229 |
230 | sub2 += model.predict(xgb.DMatrix(test[features]))/5
231 |
232 |
233 | print ("idx: ", i)
234 | print (" loss: %.5f" % a)
235 | # print " gini: %.5f" % g
236 | aus.append(a)
237 |
238 | print ("mean")
239 | print ("auc: %s" % (sum(aus) / 5.0))
240 |
241 |
242 | # In[ ]:
243 |
244 |
245 | res2_1=np.vstack((pred_oob2,sub2))
246 | res2_1 = pd.DataFrame(res2_1)
247 |
248 |
249 | # In[ ]:
250 |
251 |
252 | res1.index=range(len(res1))
253 | res2_1.index=range(len(res2_1))
254 | final_1=res2_1.copy()
255 | final_2=res2_1.copy()
256 | for i in range(11):
257 | final_1[i]=res1['sex1']*res2_1[i]
258 | final_2[i]=res1['sex2']*res2_1[i]
259 | id_list=pd.concat([train[['device_id']],test[['device_id']]])
260 | final=id_list
261 | final.index=range(len(final))
262 | final.columns= ['DeviceID']
263 | final_pred = pd.concat([final_1,final_2],1)
264 | final=pd.concat([final,final_pred],1)
265 | final.columns = ['DeviceID', '1-0', '1-1', '1-2', '1-3', '1-4', '1-5', '1-6',
266 | '1-7','1-8', '1-9', '1-10', '2-0', '2-1', '2-2', '2-3', '2-4',
267 | '2-5', '2-6', '2-7', '2-8', '2-9', '2-10']
268 |
269 | final.to_csv('submit/xgb_feat_chizhu.csv', index=False)
270 |
271 |
272 | # In[ ]:
273 |
274 |
275 | test['DeviceID']=test['device_id']
276 | sub=pd.merge(test[['DeviceID']],final,on="DeviceID",how="left")
277 | sub.to_csv("submit/xgb_chizhu.csv",index=False)
278 |
279 |
--------------------------------------------------------------------------------
/THLUO/21.tfidf_lr_sex_age_prob_oof.py:
--------------------------------------------------------------------------------
1 |
2 | # coding: utf-8
3 |
4 | # In[1]:
5 |
6 |
7 | import pandas as pd
8 | import seaborn as sns
9 | import numpy as np
10 | from tqdm import tqdm
11 | from sklearn.decomposition import LatentDirichletAllocation
12 | from sklearn.cross_validation import train_test_split
13 | from sklearn.metrics import accuracy_score
14 | import lightgbm as lgb
15 | from datetime import datetime,timedelta
16 | import time
17 | from sklearn.feature_extraction.text import TfidfTransformer
18 | from sklearn.feature_extraction.text import CountVectorizer
19 | from sklearn.linear_model import LogisticRegression
20 | from sklearn.preprocessing import LabelEncoder
21 | import gc
22 | from sklearn.linear_model import LogisticRegression
23 | from sklearn.metrics import log_loss
24 |
25 |
26 |
27 | # In[2]:
28 |
29 | print('21.tfidf_lr.py')
30 | path='input/'
31 | data=pd.DataFrame()
32 | #sex_age=pd.read_excel('./data/性别年龄对照表.xlsx')
33 |
34 |
35 | # In[3]:
36 |
37 |
38 | deviceid_packages=pd.read_csv(path+'deviceid_packages.tsv',sep='\t',names=['device_id','apps'])
39 | deviceid_test=pd.read_csv(path+'deviceid_test.tsv',sep='\t',names=['device_id'])
40 | deviceid_train=pd.read_csv(path+'deviceid_train.tsv',sep='\t',names=['device_id','sex','age'])
41 | deviceid_brand = pd.read_csv(path+'deviceid_brand.tsv',sep='\t', names=['device_id','device_brand', 'device_type'])
42 | deviceid_package_start_close = pd.read_csv(path+'deviceid_package_start_close.tsv',sep='\t', names=['device_id','app_id','start_time','close_time'])
43 | package_label = pd.read_csv(path+'package_label.tsv',sep='\t',names=['app_id','app_parent_type', 'app_child_type'])
44 |
45 |
46 | deviceid_brand['device_brand'] = deviceid_brand.device_brand.apply(lambda x : str(x).split(' ')[0])
47 |
48 | df_temp = deviceid_brand.groupby('device_brand')['device_id'].count().reset_index().rename(columns={'device_id':'brand_counts'})
49 | one_time_brand = df_temp[df_temp.brand_counts == 1].device_brand.values
50 | deviceid_brand['device_brand'] = deviceid_brand.device_brand.apply(lambda x : 'other' if x in one_time_brand else x)
51 |
52 | df_temp = deviceid_brand.groupby('device_brand')['device_id'].count().reset_index().rename(columns={'device_id':'brand_counts'})
53 | one_time_brand = df_temp[df_temp.brand_counts == 2].device_brand.values
54 | deviceid_brand['device_brand'] = deviceid_brand.device_brand.apply(lambda x : 'other_2' if x in one_time_brand else x)
55 |
56 | df_temp = deviceid_brand.groupby('device_brand')['device_id'].count().reset_index().rename(columns={'device_id':'brand_counts'})
57 | one_time_brand = df_temp[df_temp.brand_counts == 3].device_brand.values
58 | deviceid_brand['device_brand'] = deviceid_brand.device_brand.apply(lambda x : 'other_3' if x in one_time_brand else x)
59 |
60 |
61 | #转换成对应的数字
62 | lbl = LabelEncoder()
63 | lbl.fit(list(deviceid_brand.device_brand.values))
64 | deviceid_brand['device_brand'] = lbl.transform(list(deviceid_brand.device_brand.values))
65 |
66 | lbl = LabelEncoder()
67 | lbl.fit(list(deviceid_brand.device_type.values))
68 | deviceid_brand['device_type'] = lbl.transform(list(deviceid_brand.device_type.values))
69 |
70 | #转换成对应的数字
71 | lbl = LabelEncoder()
72 | lbl.fit(list(package_label.app_parent_type.values))
73 | package_label['app_parent_type'] = lbl.transform(list(package_label.app_parent_type.values))
74 |
75 | lbl = LabelEncoder()
76 | lbl.fit(list(package_label.app_child_type.values))
77 | package_label['app_child_type'] = lbl.transform(list(package_label.app_child_type.values))
78 |
79 | deviceid_train = pd.concat([deviceid_train, deviceid_test])
80 |
81 |
82 | # In[4]:
83 |
84 |
85 | deviceid_package_start = deviceid_package_start_close[['device_id', 'app_id', 'start_time']]
86 | deviceid_package_start.columns = ['device_id', 'app_id', 'all_time']
87 | deviceid_package_close = deviceid_package_start_close[['device_id', 'app_id', 'close_time']]
88 | deviceid_package_close.columns = ['device_id', 'app_id', 'all_time']
89 | deviceid_package_all = pd.concat([deviceid_package_start, deviceid_package_close])
90 | deviceid_package_all = deviceid_package_all.sort_values(by='all_time')
91 | #deviceid_package_all = deviceid_package_all.merge(deviceid_train, on='device_id', how='left')
92 |
93 |
94 | # In[5]:
95 |
96 |
97 | df = deviceid_package_all.groupby('device_id').apply(lambda x : list(x.app_id)).reset_index().rename(columns = {0 : 'app_list'})
98 |
99 |
100 | # In[6]:
101 |
102 |
103 | df_sex_prob_oof = pd.read_csv('device_sex_prob_oof.csv')
104 | df_age_prob_oof = pd.read_csv('device_age_prob_oof.csv')
105 | df_start_close_sex_prob_oof = pd.read_csv('start_close_sex_prob_oof.csv')
106 | df_start_close_age_prob_oof = pd.read_csv('start_close_age_prob_oof.csv')
107 | df_start_close_sex_age_prob_oof = pd.read_csv('start_close_sex_age_prob_oof.csv')
108 |
109 |
110 | gc.collect()
111 | df = df.merge(df_sex_prob_oof, on='device_id', how='left')
112 | df = df.merge(df_age_prob_oof, on='device_id', how='left')
113 | df = df.merge(df_start_close_sex_prob_oof, on='device_id', how='left')
114 | df = df.merge(df_start_close_age_prob_oof, on='device_id', how='left')
115 | df = df.merge(df_start_close_sex_age_prob_oof, on='device_id', how='left')
116 | df.fillna(0, inplace=True)
117 | apps = df['app_list'].apply(lambda x:' '.join(x)).tolist()
118 | del df['app_list']
119 |
120 |
121 | df = df.merge(deviceid_train, on='device_id', how='left')
122 |
123 |
124 | # In[8]:
125 |
126 |
127 | vectorizer=CountVectorizer()
128 | transformer=TfidfTransformer()
129 | cntTf = vectorizer.fit_transform(apps)
130 | tfidf=transformer.fit_transform(cntTf)
131 | word=vectorizer.get_feature_names()
132 | weight=tfidf.toarray()
133 | df_weight=pd.DataFrame(weight)
134 | feature=df_weight.columns
135 |
136 |
137 | # In[9]:
138 |
139 |
140 | for i in df.columns.values:
141 | df_weight[i] = df[i]
142 | df_weight[i] = df[i]
143 |
144 |
145 | # In[11]:
146 |
147 |
148 | df_weight['sex'] = df_weight['sex'].apply(lambda x:str(x))
149 | df_weight['age'] = df_weight['age'].apply(lambda x:str(x))
150 | def tool(x):
151 | if x == 'nan':
152 | return x
153 | else:
154 | return str(int(float(x)))
155 | df_weight['sex'] = df_weight['sex'].apply(tool)
156 | df_weight['age'] = df_weight['age'].apply(tool)
157 | df_weight['sex_age'] = df_weight['sex']+'-'+df_weight['age']
158 | df_weight['sex_age'] = df_weight.sex_age.replace({'nan':np.NaN,'nan-nan':np.NaN})
159 |
160 |
161 | # In[12]:
162 |
163 |
164 | train = df_weight[df_weight.sex_age.notnull()]
165 | train.reset_index(drop=True, inplace=True)
166 | test = df_weight[df_weight.sex_age.isnull()]
167 | test.reset_index(drop=True, inplace=True)
168 | gc.collect()
169 |
170 |
171 | # In[16]:
172 |
173 |
174 | X = train.drop(['sex','age','sex_age','device_id'],axis=1)
175 | Y = train['sex_age']
176 | Y_CAT = pd.Categorical(Y)
177 | Y = pd.Series(Y_CAT.codes)
178 |
179 |
180 | # In[18]:
181 |
182 |
183 | from sklearn.model_selection import KFold, StratifiedKFold
184 | gc.collect()
185 | seed = 666
186 | num_folds = 5
187 | folds = StratifiedKFold(n_splits= num_folds, shuffle=True, random_state=seed)
188 |
189 | oof_preds = np.zeros([train.shape[0], 22])
190 |
191 | for n_fold, (train_idx, valid_idx) in enumerate(folds.split(X, Y)):
192 | train_x, train_y = X.iloc[train_idx], Y.iloc[train_idx]
193 | valid_x, valid_y = X.iloc[valid_idx], Y.iloc[valid_idx]
194 |
195 |
196 | clf = LogisticRegression(C=4)
197 | clf.fit(train_x, train_y)
198 | valid_preds=clf.predict_proba(valid_x)
199 | train_preds=clf.predict_proba(train_x)
200 |
201 | oof_preds[valid_idx] = valid_preds
202 |
203 | print (log_loss(train_y.values, train_preds), log_loss(valid_y.values, valid_preds))
204 |
205 |
206 | oof_train = pd.DataFrame(oof_preds)
207 | oof_train.columns = ['tfidf_lr_sex_age_prob_oof_' + str(i) for i in range(22)]
208 | train_temp = pd.concat([train[['device_id']], oof_train], axis=1)
209 |
210 |
211 | # In[20]:
212 |
213 |
214 | #用全部的数据预测
215 | clf = LogisticRegression(C=4)
216 | clf.fit(X, Y)
217 | train_preds=clf.predict_proba(X)
218 | test_preds=clf.predict_proba(test[X.columns])
219 | print (log_loss(Y.values, train_preds))
220 |
221 | oof_test = pd.DataFrame(test_preds)
222 | oof_test.columns = ['tfidf_lr_sex_age_prob_oof_' + str(i) for i in range(22)]
223 |
224 |
225 | # In[24]:
226 |
227 |
228 | oof_test
229 |
230 |
231 | # In[25]:
232 |
233 |
234 | test_temp = pd.concat([test[['device_id']], oof_test], axis=1)
235 | test_temp
236 |
237 |
238 | # In[26]:
239 |
240 |
241 | sex_age_oof = pd.concat([train_temp, test_temp])
242 | sex_age_oof
243 |
244 |
245 | # In[29]:
246 |
247 |
248 | sex_age_oof.to_csv('tfidf_lr_sex_age_prob_oof.csv', index=None)
249 |
250 |
--------------------------------------------------------------------------------
/THLUO/26.thluo_nb_lgb.py:
--------------------------------------------------------------------------------
1 |
2 | # coding: utf-8
3 |
4 | # In[1]:
5 |
6 |
7 | # coding: utf-8
8 |
9 | # In[1]:
10 |
11 | from sklearn.metrics import log_loss
12 | import pandas as pd
13 | import seaborn as sns
14 | import numpy as np
15 | from tqdm import tqdm
16 | from sklearn.decomposition import LatentDirichletAllocation
17 | from sklearn.model_selection import train_test_split
18 | from sklearn.metrics import accuracy_score
19 | import lightgbm as lgb
20 | from datetime import datetime,timedelta
21 | import time
22 | from sklearn.feature_extraction.text import TfidfTransformer
23 | from sklearn.feature_extraction.text import CountVectorizer
24 | get_ipython().run_line_magic('matplotlib', 'inline')
25 |
26 | #add
27 | import gc
28 | from sklearn import preprocessing
29 | from sklearn.feature_extraction.text import TfidfVectorizer
30 |
31 | from scipy.sparse import hstack, vstack
32 | from sklearn.model_selection import StratifiedKFold
33 | from sklearn.model_selection import cross_val_score
34 | # from skopt.space import Integer, Categorical, Real, Log10
35 | # from skopt.utils import use_named_args
36 | # from skopt import gp_minimize
37 | from gensim.models import Word2Vec, FastText
38 | import gensim
39 | import re
40 | import os
41 | path="./"
42 | os.listdir(path)
43 |
44 |
45 | # In[2]:
46 | print ('26.thluo_nb_lgb.py')
47 |
48 | train_id=pd.read_csv("input/deviceid_train.tsv",sep="\t",names=['device_id','sex','age'])
49 | test_id=pd.read_csv("input/deviceid_test.tsv",sep="\t",names=['device_id'])
50 | all_id=pd.concat([train_id[['device_id']],test_id[['device_id']]])
51 | #nurbs=pd.read_csv("nurbs_feature_all.csv")
52 | #nurbs.columns=["nurbs_"+str(i) for i in nurbs.columns]
53 | thluo = pd.read_csv("thluo_train_best_feat.csv")
54 | del thluo['age']
55 | del thluo['sex']
56 | del thluo['sex_age']
57 |
58 |
59 | # In[7]:
60 |
61 |
62 | feat = thluo.copy()
63 |
64 |
65 | # In[8]:
66 |
67 |
68 | train=pd.merge(train_id,feat,on="device_id",how="left")
69 | test=pd.merge(test_id,feat,on="device_id",how="left")
70 |
71 |
72 | # In[11]:
73 |
74 |
75 | features = [x for x in train.columns if x not in ['device_id', 'sex',"age",]]
76 | Y = train['sex'] - 1
77 |
78 |
79 | # In[12]:
80 |
81 |
82 | from sklearn.model_selection import KFold, StratifiedKFold
83 | gc.collect()
84 | seed = 1024
85 | num_folds = 5
86 | folds = StratifiedKFold(n_splits= num_folds, shuffle=True, random_state=seed)
87 |
88 |
89 | # In[13]:
90 |
91 |
92 | params = {
93 | 'boosting_type': 'gbdt',
94 | 'learning_rate' : 0.02,
95 | #'max_depth':5,
96 | 'num_leaves' : 2 ** 5,
97 | 'metric': {'binary_logloss'},
98 | #'num_class' : 22,
99 | 'objective' : 'binary',
100 | 'random_state' : 6666,
101 | 'bagging_freq' : 5,
102 | 'feature_fraction' : 0.7,
103 | 'bagging_fraction' : 0.7,
104 | 'min_split_gain' : 0.0970905919552776,
105 | 'min_child_weight' : 9.42012323936088,
106 | }
107 |
108 |
109 | # In[14]:
110 |
111 |
112 | #预测性别
113 | aus = []
114 | sub1 = np.zeros((len(test), ))
115 | pred_oob1=np.zeros((len(train),))
116 | for i,(train_index,test_index) in enumerate(folds.split(train[features], Y)):
117 |
118 | tr_x = train[features].reindex(index=train_index, copy=False)
119 | tr_y = Y[train_index]
120 | te_x = train[features].reindex(index=test_index, copy=False)
121 | te_y = Y[test_index]
122 |
123 | lgb_train=lgb.Dataset(tr_x,label=tr_y)
124 | lgb_eval = lgb.Dataset(te_x, te_y, reference=lgb_train)
125 |
126 | gbm = lgb.train(params, lgb_train, num_boost_round=300,
127 | valid_sets=[lgb_train, lgb_eval], verbose_eval=100)
128 |
129 | pred = gbm.predict(te_x[tr_x.columns.values])
130 | pred_oob1[test_index] =pred
131 | # te_y=te_y.apply(lambda x:1 if x>0 else 0)
132 | a = log_loss(te_y, pred)
133 |
134 |
135 | print ("idx: ", i)
136 | print (" loss: %.5f" % a)
137 | # print " gini: %.5f" % g
138 | aus.append(a)
139 |
140 | print ("mean")
141 | print ("auc: %s" % (sum(aus) / 5.0))
142 |
143 |
144 | # In[15]:
145 |
146 |
147 | #用全部数据训练一个lgb
148 | #用全部的train来预测test
149 | lgb_train = lgb.Dataset(train[features],label=Y)
150 |
151 | gbm = lgb.train(params, lgb_train, num_boost_round=300, valid_sets=lgb_train, verbose_eval=100)
152 |
153 | sub1 = gbm.predict(test[features])
154 |
155 |
156 | # In[16]:
157 |
158 |
159 | pred_oob1 = pd.DataFrame(pred_oob1, columns=['sex2'])
160 | sub1 = pd.DataFrame(sub1, columns=['sex2'])
161 | res1=pd.concat([pred_oob1,sub1])
162 | res1['sex1'] = 1-res1['sex2']
163 |
164 |
165 | # In[18]:
166 |
167 |
168 |
169 | # In[50]:
170 |
171 |
172 | features = [x for x in train.columns if x not in ['device_id',"age"]]
173 | Y = train['age']
174 |
175 |
176 | # In[51]:
177 |
178 |
179 | import lightgbm as lgb
180 | import xgboost as xgb
181 | from sklearn.metrics import auc, log_loss, roc_auc_score,f1_score,recall_score,precision_score
182 |
183 |
184 | # In[19]:
185 |
186 |
187 | from sklearn.model_selection import KFold, StratifiedKFold
188 | gc.collect()
189 | seed = 1024
190 | num_folds = 5
191 | folds = StratifiedKFold(n_splits= num_folds, shuffle=True, random_state=seed)
192 |
193 |
194 | # In[20]:
195 |
196 |
197 | params = {
198 | 'boosting_type': 'gbdt',
199 | 'learning_rate' : 0.02,
200 | #'max_depth':5,
201 | 'num_leaves' : 2 ** 5,
202 | 'metric': {'multi_logloss'},
203 | 'num_class' : 11,
204 | 'objective' : 'multiclass',
205 | 'random_state' : 6666,
206 | 'bagging_freq' : 5,
207 | 'feature_fraction' : 0.7,
208 | 'bagging_fraction' : 0.7,
209 | 'min_split_gain' : 0.0970905919552776,
210 | 'min_child_weight' : 9.42012323936088,
211 | }
212 |
213 |
214 | # In[22]:
215 |
216 |
217 | #预测性别
218 | aus = []
219 | sub2 = np.zeros((len(test),11 ))
220 | pred_oob2=np.zeros((len(train),11))
221 | models=[]
222 | iters=[]
223 | for i,(train_index,test_index) in enumerate(folds.split(train[features], Y)):
224 |
225 | tr_x = train[features].reindex(index=train_index, copy=False)
226 | tr_y = Y[train_index]
227 | te_x = train[features].reindex(index=test_index, copy=False)
228 | te_y = Y[test_index]
229 |
230 | lgb_train=lgb.Dataset(tr_x,label=tr_y)
231 | lgb_eval = lgb.Dataset(te_x, te_y, reference=lgb_train)
232 |
233 | gbm = lgb.train(params, lgb_train, num_boost_round=430,
234 | valid_sets=[lgb_train, lgb_eval], verbose_eval=100)
235 |
236 | pred = gbm.predict(te_x[tr_x.columns.values])
237 | pred_oob2[test_index] = pred
238 | # te_y=te_y.apply(lambda x:1 if x>0 else 0)
239 | a = log_loss(te_y, pred)
240 |
241 | #sub2 += gbm.predict(test[features], num_iteration=gbm.best_iteration) / 5
242 |
243 | models.append(gbm)
244 | iters.append(gbm.best_iteration)
245 |
246 | print ("idx: ", i)
247 | print (" loss: %.5f" % a)
248 | # print " gini: %.5f" % g
249 | aus.append(a)
250 |
251 | print ("mean")
252 | print ("auc: %s" % (sum(aus) / 5.0))
253 |
254 |
255 | # In[23]:
256 |
257 |
258 | #预测条件概率
259 | ####sex1
260 | test['sex']=1
261 | #用全部数据训练一个lgb
262 | #用全部的train来预测test
263 | lgb_train = lgb.Dataset(train[features],label=Y)
264 |
265 | gbm = lgb.train(params, lgb_train, num_boost_round=430, valid_sets=lgb_train, verbose_eval=100)
266 | sub2 = gbm.predict(test[features])
267 |
268 | res2_1=np.vstack((pred_oob2,sub2))
269 | res2_1 = pd.DataFrame(res2_1)
270 |
271 |
272 | # In[24]:
273 |
274 |
275 | ###sex2
276 | #预测条件概率
277 | test['sex']=2
278 |
279 | sub2 = np.zeros((len(test),11))
280 | sub2 = gbm.predict(test[features], num_iteration = gbm.best_iteration)
281 | res2_2=np.vstack((pred_oob2,sub2))
282 | res2_2 = pd.DataFrame(res2_2)
283 |
284 |
285 | # In[27]:
286 |
287 |
288 | res1.index=range(len(res1))
289 | res2_1.index=range(len(res2_1))
290 | res2_2.index=range(len(res2_2))
291 | final_1=res2_1.copy()
292 | final_2=res2_2.copy()
293 |
294 |
295 | # In[28]:
296 |
297 |
298 | for i in range(11):
299 | final_1[i]=res1['sex1'] * res2_1[i]
300 | final_2[i]=res1['sex2'] * res2_2[i]
301 | id_list = pd.concat([train[['device_id']],test[['device_id']]])
302 | final = id_list
303 | final.index = range(len(final))
304 | final.columns = ['DeviceID']
305 | final_pred = pd.concat([final_1,final_2], 1)
306 | final = pd.concat([final,final_pred],1)
307 | final.columns = ['DeviceID', '1-0', '1-1', '1-2', '1-3', '1-4', '1-5', '1-6',
308 | '1-7','1-8', '1-9', '1-10', '2-0', '2-1', '2-2', '2-3', '2-4',
309 | '2-5', '2-6', '2-7', '2-8', '2-9', '2-10']
310 |
311 |
312 | # In[30]:
313 |
314 |
315 | test['DeviceID']=test['device_id']
316 | sub=pd.merge(test[['DeviceID']],final,on="DeviceID",how="left")
317 | sub.to_csv("th_lgb_nb.csv",index=False)
318 |
319 |
--------------------------------------------------------------------------------
/THLUO/13.device_start_GRU_pred.py:
--------------------------------------------------------------------------------
1 |
2 | # coding: utf-8
3 |
4 | # In[1]:
5 |
6 |
7 | # coding: utf-8
8 | import feather
9 | import os
10 | import re
11 | import sys
12 | import gc
13 | import random
14 | import pandas as pd
15 | import numpy as np
16 | import gensim
17 | from gensim.models import Word2Vec
18 | from gensim.models.word2vec import LineSentence
19 | from scipy import stats
20 | import tensorflow as tf
21 | import keras
22 | from keras.layers import *
23 | from keras.models import *
24 | from keras.optimizers import *
25 | from keras.callbacks import *
26 | from keras.preprocessing import text, sequence
27 | from keras.utils import to_categorical
28 | from keras.engine.topology import Layer
29 | from sklearn.preprocessing import LabelEncoder
30 | from keras.utils import np_utils
31 | from keras.utils.training_utils import multi_gpu_model
32 | from sklearn.model_selection import train_test_split
33 | from sklearn.metrics import f1_score
34 | from sklearn.model_selection import KFold
35 | from sklearn.metrics import accuracy_score
36 | from sklearn.preprocessing import LabelEncoder
37 | from sklearn.metrics import f1_score
38 | import warnings
39 | from TextModel import *
40 | warnings.filterwarnings('ignore')
41 | config = tf.ConfigProto()
42 | config.gpu_options.allow_growth = True
43 | session = tf.Session(config=config)
44 |
45 |
46 | # In[2]:
47 | print ('13.device_start_GRU_pred.py')
48 |
49 | df_doc = pd.read_csv('01.device_click_app_sorted_by_start.csv')
50 | deviceid_test=pd.read_csv('input/deviceid_test.tsv',sep='\t',names=['device_id'])
51 | deviceid_train=pd.read_csv('input/deviceid_train.tsv',sep='\t',names=['device_id','sex','age'])
52 | df_total = pd.concat([deviceid_train, deviceid_test])
53 | df_doc = df_doc.merge(df_total, on='device_id', how='left')
54 |
55 |
56 | df_wv2_all = pd.read_csv('w2c_all_emb.csv')
57 |
58 | dic_w2c_all = {}
59 | for row in df_wv2_all.values :
60 | app_id = row[0]
61 | vector = row[1:]
62 | dic_w2c_all[app_id] = vector
63 |
64 |
65 | # In[3]:
66 |
67 |
68 | df_doc['sex'] = df_doc['sex'].apply(lambda x:str(x))
69 | df_doc['age'] = df_doc['age'].apply(lambda x:str(x))
70 | def tool(x):
71 | if x=='nan':
72 | return x
73 | else:
74 | return str(int(float(x)))
75 | df_doc['sex']=df_doc['sex'].apply(tool)
76 | df_doc['age']=df_doc['age'].apply(tool)
77 | df_doc['sex_age']=df_doc['sex']+'-'+df_doc['age']
78 | df_doc = df_doc.replace({'nan':np.NaN,'nan-nan':np.NaN})
79 | train = df_doc[df_doc['sex_age'].notnull()]
80 | test = df_doc[df_doc['sex_age'].isnull()]
81 | train.reset_index(drop=True, inplace=True)
82 | test.reset_index(drop=True, inplace=True)
83 |
84 | lb = LabelEncoder()
85 | train_label = lb.fit_transform(train['sex_age'].values)
86 | train['class'] = train_label
87 |
88 |
89 | # In[5]:
90 |
91 |
92 | column_name="app_list"
93 | word_seq_len = 900
94 | victor_size = 200
95 | num_words = 35000
96 | batch_size = 64
97 | classification = 22
98 | kfold=10
99 |
100 |
101 | # In[6]:
102 |
103 |
104 | from sklearn.metrics import log_loss
105 |
106 | def get_mut_label(y_label) :
107 | results = []
108 | for ele in y_label :
109 | results.append(ele.argmax())
110 | return results
111 |
112 | class RocAucEvaluation(Callback):
113 | def __init__(self, validation_data=(), interval=1):
114 | super(Callback, self).__init__()
115 |
116 | self.interval = interval
117 | self.X_val, self.y_val = validation_data
118 |
119 | def on_epoch_end(self, epoch, logs={}):
120 | if epoch % self.interval == 0:
121 | y_pred = self.model.predict(self.X_val, verbose=0)
122 | val_y = get_mut_label(self.y_val)
123 | score = log_loss(val_y, y_pred)
124 | print("\n mlogloss - epoch: %d - score: %.6f \n" % (epoch+1, score))
125 |
126 |
127 | # In[7]:
128 |
129 |
130 | #词向量
131 | def w2v_pad(df_train,df_test,col, maxlen_,victor_size, num_words):
132 |
133 | tokenizer = text.Tokenizer(num_words=num_words, lower=False,filters="")
134 | tokenizer.fit_on_texts(list(df_train[col].values)+list(df_test[col].values))
135 |
136 | train_ = sequence.pad_sequences(tokenizer.texts_to_sequences(df_train[col].values), maxlen=maxlen_)
137 | test_ = sequence.pad_sequences(tokenizer.texts_to_sequences(df_test[col].values), maxlen=maxlen_)
138 |
139 | word_index = tokenizer.word_index
140 |
141 | count = 0
142 | nb_words = len(word_index)
143 | print(nb_words)
144 | all_data=pd.concat([df_train[col],df_test[col]])
145 | file_name = 'embedding/' + 'Word2Vec_start_' + col +"_"+ str(victor_size) + '.model'
146 | if not os.path.exists(file_name):
147 | model = Word2Vec([[word for word in document.split(' ')] for document in all_data.values],
148 | size=victor_size, window=5, iter=10, workers=11, seed=2018, min_count=2)
149 | model.save(file_name)
150 | else:
151 | model = Word2Vec.load(file_name)
152 | print("add word2vec finished....")
153 |
154 |
155 |
156 | embedding_word2vec_matrix = np.zeros((nb_words + 1, victor_size))
157 | for word, i in word_index.items():
158 | embedding_vector = model[word] if word in model else None
159 | if embedding_vector is not None:
160 | count += 1
161 | embedding_word2vec_matrix[i] = embedding_vector
162 | else:
163 | unk_vec = np.random.random(victor_size) * 0.5
164 | unk_vec = unk_vec - unk_vec.mean()
165 | embedding_word2vec_matrix[i] = unk_vec
166 |
167 | embedding_w2c_all = np.zeros((nb_words + 1, victor_size))
168 | for word, i in word_index.items():
169 | embedding_vector = dic_w2c_all[word]
170 | embedding_w2c_all[i] = embedding_vector
171 |
172 |
173 | #embedding_matrix = np.concatenate((embedding_word2vec_matrix,embedding_w2c_all),axis=1)
174 | embedding_matrix = embedding_word2vec_matrix
175 |
176 | return train_, test_, word_index, embedding_matrix
177 |
178 |
179 | # In[8]:
180 |
181 |
182 | train_, test_,word2idx, word_embedding = w2v_pad(train,test,column_name, word_seq_len,victor_size, num_words)
183 |
184 |
185 | # In[11]:
186 |
187 |
188 | my_opt="bi_gru_model"
189 | #参数
190 | Y = train['class'].values
191 |
192 | if not os.path.exists("cache/"+my_opt):
193 | os.mkdir("cache/"+my_opt)
194 |
195 |
196 | # In[12]:
197 |
198 |
199 | from sklearn.model_selection import KFold, StratifiedKFold
200 | gc.collect()
201 | seed = 2006
202 | num_folds = 10
203 | kf = StratifiedKFold(n_splits= num_folds, shuffle=True, random_state=seed).split(train_, Y)
204 |
205 |
206 | # In[13]:
207 |
208 |
209 | epochs = 4
210 | my_opt=eval(my_opt)
211 | train_model_pred = np.zeros((train_.shape[0], classification))
212 | test_model_pred = np.zeros((test_.shape[0], classification))
213 | for i, (train_fold, val_fold) in enumerate(kf):
214 | X_train, X_valid, = train_[train_fold, :], train_[val_fold, :]
215 | y_train, y_valid = Y[train_fold], Y[val_fold]
216 |
217 | y_tra = to_categorical(y_train)
218 | y_val = to_categorical(y_valid)
219 |
220 | #模型
221 | name = str(my_opt.__name__)
222 |
223 | model = my_opt(word_seq_len, word_embedding, classification)
224 |
225 |
226 | RocAuc = RocAucEvaluation(validation_data=(X_valid, y_val), interval=1)
227 |
228 | hist = model.fit(X_train, y_tra, batch_size=batch_size, epochs=epochs, validation_data=(X_valid, y_val),
229 | callbacks=[RocAuc])
230 |
231 |
232 | train_model_pred[val_fold, :] = model.predict(X_valid)
233 |
234 |
235 | # In[26]:
236 |
237 |
238 | #模型
239 | #用全部的数据预测
240 | train_label = to_categorical(Y)
241 | name = str(my_opt.__name__)
242 |
243 | model = my_opt(word_seq_len, word_embedding, classification)
244 |
245 |
246 | RocAuc = RocAucEvaluation(validation_data=(train_, train_label), interval=1)
247 |
248 | hist = model.fit(train_, train_label, batch_size=batch_size, epochs=epochs, validation_data=(train_, train_label),
249 | callbacks=[RocAuc])
250 |
251 |
252 | test_model_pred = model.predict(test_)
253 |
254 |
255 | # In[27]:
256 |
257 |
258 | df_train_pred = pd.DataFrame(train_model_pred)
259 | df_test_pred = pd.DataFrame(test_model_pred)
260 | df_train_pred.columns = ['device_start_GRU_pred_' + str(i) for i in range(22)]
261 | df_test_pred.columns = ['device_start_GRU_pred_' + str(i) for i in range(22)]
262 |
263 |
264 | # In[35]:
265 |
266 |
267 | df_train_pred = pd.concat([train[['device_id']], df_train_pred], axis=1)
268 | df_test_pred = pd.concat([test[['device_id']], df_test_pred], axis=1)
269 |
270 |
271 | # In[37]:
272 |
273 |
274 | df_results = pd.concat([df_train_pred, df_test_pred])
275 | df_results.to_csv('device_start_GRU_pred.csv', index=None)
276 |
277 |
--------------------------------------------------------------------------------
/THLUO/15.device_all_GRU_pred.py:
--------------------------------------------------------------------------------
1 |
2 | # coding: utf-8
3 |
4 | # In[1]:
5 |
6 |
7 | # coding: utf-8
8 | import feather
9 | import os
10 | import re
11 | import sys
12 | import gc
13 | import random
14 | import pandas as pd
15 | import numpy as np
16 | import gensim
17 | from gensim.models import Word2Vec
18 | from gensim.models.word2vec import LineSentence
19 | from scipy import stats
20 | import tensorflow as tf
21 | import keras
22 | from keras.layers import *
23 | from keras.models import *
24 | from keras.optimizers import *
25 | from keras.callbacks import *
26 | from keras.preprocessing import text, sequence
27 | from keras.utils import to_categorical
28 | from keras.engine.topology import Layer
29 | from sklearn.preprocessing import LabelEncoder
30 | from keras.utils import np_utils
31 | from keras.utils.training_utils import multi_gpu_model
32 | from sklearn.model_selection import train_test_split
33 | from sklearn.metrics import f1_score
34 | from sklearn.model_selection import KFold
35 | from sklearn.metrics import accuracy_score
36 | from sklearn.preprocessing import LabelEncoder
37 | from TextModel import *
38 | from sklearn.metrics import f1_score
39 | import warnings
40 | warnings.filterwarnings('ignore')
41 | config = tf.ConfigProto()
42 | config.gpu_options.allow_growth = True
43 | session = tf.Session(config=config)
44 |
45 |
46 | # In[2]:
47 | print('15.device_all_GRU_pred.py')
48 |
49 | df_doc = pd.read_csv('03.device_click_app_sorted_by_all.csv')
50 | deviceid_test=pd.read_csv('input/deviceid_test.tsv',sep='\t',names=['device_id'])
51 | deviceid_train=pd.read_csv('input/deviceid_train.tsv',sep='\t',names=['device_id','sex','age'])
52 | df_total = pd.concat([deviceid_train, deviceid_test])
53 | df_doc = df_doc.merge(df_total, on='device_id', how='left')
54 |
55 |
56 | df_wv2_all = pd.read_csv('w2c_all_emb.csv')
57 |
58 | dic_w2c_all = {}
59 | for row in df_wv2_all.values :
60 | app_id = row[0]
61 | vector = row[1:]
62 | dic_w2c_all[app_id] = vector
63 |
64 | # In[3]:
65 |
66 |
67 | df_doc['sex'] = df_doc['sex'].apply(lambda x:str(x))
68 | df_doc['age'] = df_doc['age'].apply(lambda x:str(x))
69 | def tool(x):
70 | if x=='nan':
71 | return x
72 | else:
73 | return str(int(float(x)))
74 | df_doc['sex']=df_doc['sex'].apply(tool)
75 | df_doc['age']=df_doc['age'].apply(tool)
76 | df_doc['sex_age']=df_doc['sex']+'-'+df_doc['age']
77 | df_doc = df_doc.replace({'nan':np.NaN,'nan-nan':np.NaN})
78 | train = df_doc[df_doc['sex_age'].notnull()]
79 | test = df_doc[df_doc['sex_age'].isnull()]
80 | train.reset_index(drop=True, inplace=True)
81 | test.reset_index(drop=True, inplace=True)
82 |
83 | lb = LabelEncoder()
84 | train_label = lb.fit_transform(train['sex_age'].values)
85 | train['class'] = train_label
86 |
87 |
88 | # In[6]:
89 |
90 |
91 | column_name="app_list"
92 | word_seq_len = 1800
93 | victor_size = 200
94 | num_words = 35000
95 | batch_size = 64
96 | classification = 22
97 | kfold=10
98 |
99 |
100 | # In[7]:
101 |
102 |
103 | from sklearn.metrics import log_loss
104 |
105 | def get_mut_label(y_label) :
106 | results = []
107 | for ele in y_label :
108 | results.append(ele.argmax())
109 | return results
110 |
111 | class RocAucEvaluation(Callback):
112 | def __init__(self, validation_data=(), interval=1):
113 | super(Callback, self).__init__()
114 |
115 | self.interval = interval
116 | self.X_val, self.y_val = validation_data
117 |
118 | def on_epoch_end(self, epoch, logs={}):
119 | if epoch % self.interval == 0:
120 | y_pred = self.model.predict(self.X_val, verbose=0)
121 | val_y = get_mut_label(self.y_val)
122 | score = log_loss(val_y, y_pred)
123 | print("\n mlogloss - epoch: %d - score: %.6f \n" % (epoch+1, score))
124 |
125 |
126 | # In[14]:
127 |
128 |
129 | #词向量
130 | def w2v_pad(df_train,df_test,col, maxlen_,victor_size, num_words):
131 |
132 | tokenizer = text.Tokenizer(num_words=num_words, lower=False,filters="")
133 | tokenizer.fit_on_texts(list(df_train[col].values)+list(df_test[col].values))
134 |
135 | train_ = sequence.pad_sequences(tokenizer.texts_to_sequences(df_train[col].values), maxlen=maxlen_)
136 | test_ = sequence.pad_sequences(tokenizer.texts_to_sequences(df_test[col].values), maxlen=maxlen_)
137 |
138 | word_index = tokenizer.word_index
139 |
140 | count = 0
141 | nb_words = len(word_index)
142 | print(nb_words)
143 | all_data=pd.concat([df_train[col],df_test[col]])
144 | file_name = 'embedding/' + 'Word2Vec_all' + col +"_"+ str(victor_size) + '.model'
145 | if not os.path.exists(file_name):
146 | model = Word2Vec([[word for word in document.split(' ')] for document in all_data.values],
147 | size=victor_size, window=30, iter=10, workers=11, seed=2018, min_count=2)
148 | model.save(file_name)
149 | else:
150 | model = Word2Vec.load(file_name)
151 | print("add word2vec finished....")
152 |
153 |
154 |
155 | embedding_word2vec_matrix = np.zeros((nb_words + 1, victor_size))
156 | for word, i in word_index.items():
157 | embedding_vector = model[word] if word in model else None
158 | if embedding_vector is not None:
159 | count += 1
160 | embedding_word2vec_matrix[i] = embedding_vector
161 | else:
162 | unk_vec = np.random.random(victor_size) * 0.5
163 | unk_vec = unk_vec - unk_vec.mean()
164 | embedding_word2vec_matrix[i] = unk_vec
165 |
166 | embedding_w2c_all = np.zeros((nb_words + 1, victor_size))
167 | for word, i in word_index.items():
168 | embedding_vector = dic_w2c_all[word]
169 | embedding_w2c_all[i] = embedding_vector
170 |
171 |
172 | #embedding_matrix = np.concatenate((embedding_word2vec_matrix,embedding_w2c_all),axis=1)
173 | embedding_matrix = embedding_word2vec_matrix
174 |
175 | return train_, test_, word_index, embedding_matrix
176 |
177 |
178 | # In[15]:
179 |
180 |
181 | train_, test_,word2idx, word_embedding = w2v_pad(train,test,column_name, word_seq_len,victor_size, num_words)
182 |
183 |
184 | # In[21]:
185 |
186 |
187 | my_opt="bi_gru_model"
188 | #参数
189 | Y = train['class'].values
190 |
191 | if not os.path.exists("cache/"+my_opt):
192 | os.mkdir("cache/"+my_opt)
193 |
194 |
195 | # In[22]:
196 |
197 |
198 | from sklearn.model_selection import KFold, StratifiedKFold
199 | gc.collect()
200 | seed = 2006
201 | num_folds = 10
202 | kf = StratifiedKFold(n_splits= num_folds, shuffle=True, random_state=seed).split(train_, Y)
203 |
204 |
205 | # In[23]:
206 |
207 |
208 | epochs = 4
209 | my_opt=eval(my_opt)
210 | train_model_pred = np.zeros((train_.shape[0], classification))
211 | test_model_pred = np.zeros((test_.shape[0], classification))
212 | for i, (train_fold, val_fold) in enumerate(kf):
213 | X_train, X_valid, = train_[train_fold, :], train_[val_fold, :]
214 | y_train, y_valid = Y[train_fold], Y[val_fold]
215 |
216 | y_tra = to_categorical(y_train)
217 | y_val = to_categorical(y_valid)
218 |
219 | #模型
220 | name = str(my_opt.__name__)
221 |
222 | model = my_opt(word_seq_len, word_embedding, classification)
223 |
224 |
225 | RocAuc = RocAucEvaluation(validation_data=(X_valid, y_val), interval=1)
226 |
227 | hist = model.fit(X_train, y_tra, batch_size=batch_size, epochs=epochs, validation_data=(X_valid, y_val),
228 | callbacks=[RocAuc])
229 |
230 |
231 | train_model_pred[val_fold, :] = model.predict(X_valid)
232 |
233 | del model
234 | del hist
235 | gc.collect()
236 |
237 |
238 | # In[27]:
239 |
240 |
241 | #模型
242 | #用全部的数据预测
243 | train_label = to_categorical(Y)
244 | name = str(my_opt.__name__)
245 |
246 | model = my_opt(word_seq_len, word_embedding, classification)
247 |
248 |
249 | RocAuc = RocAucEvaluation(validation_data=(train_, train_label), interval=1)
250 |
251 | hist = model.fit(train_, train_label, batch_size=batch_size, epochs=epochs, validation_data=(train_, train_label),
252 | callbacks=[RocAuc])
253 |
254 |
255 | test_model_pred = model.predict(test_)
256 |
257 |
258 | # In[28]:
259 |
260 |
261 | df_train_pred = pd.DataFrame(train_model_pred)
262 | df_test_pred = pd.DataFrame(test_model_pred)
263 | df_train_pred.columns = ['device_all_GRU_pred_' + str(i) for i in range(22)]
264 | df_test_pred.columns = ['device_all_GRU_pred_' + str(i) for i in range(22)]
265 |
266 |
267 | # In[29]:
268 |
269 |
270 | df_train_pred = pd.concat([train[['device_id']], df_train_pred], axis=1)
271 | df_test_pred = pd.concat([test[['device_id']], df_test_pred], axis=1)
272 |
273 |
274 | # In[30]:
275 |
276 |
277 | df_results = pd.concat([df_train_pred, df_test_pred])
278 | df_results.to_csv('device_all_GRU_pred.csv', index=None)
279 |
280 |
--------------------------------------------------------------------------------
/THLUO/16.device_start_capsule_pred.py:
--------------------------------------------------------------------------------
1 |
2 | # coding: utf-8
3 |
4 | # In[1]:
5 |
6 |
7 | # coding: utf-8
8 | import feather
9 | import os
10 | import re
11 | import sys
12 | import gc
13 | import random
14 | import pandas as pd
15 | import numpy as np
16 | import gensim
17 | from gensim.models import Word2Vec
18 | from gensim.models.word2vec import LineSentence
19 | from scipy import stats
20 | import tensorflow as tf
21 | import keras
22 | from keras.layers import *
23 | from keras.models import *
24 | from keras.optimizers import *
25 | from keras.callbacks import *
26 | from keras.preprocessing import text, sequence
27 | from keras.utils import to_categorical
28 | from keras.engine.topology import Layer
29 | from sklearn.preprocessing import LabelEncoder
30 | from keras.utils import np_utils
31 | from keras.utils.training_utils import multi_gpu_model
32 | from sklearn.model_selection import train_test_split
33 | from sklearn.metrics import f1_score
34 | from sklearn.model_selection import KFold
35 | from sklearn.metrics import accuracy_score
36 | from sklearn.preprocessing import LabelEncoder
37 | from sklearn.metrics import f1_score
38 | import warnings
39 | warnings.filterwarnings('ignore')
40 | config = tf.ConfigProto()
41 | config.gpu_options.allow_growth = True
42 | session = tf.Session(config=config)
43 |
44 |
45 | # In[2]:
46 | print ('16.device_start_capsule_pred.py')
47 |
48 | df_doc = pd.read_csv('01.device_click_app_sorted_by_start.csv')
49 | deviceid_test=pd.read_csv('input/deviceid_test.tsv',sep='\t',names=['device_id'])
50 | deviceid_train=pd.read_csv('input/deviceid_train.tsv',sep='\t',names=['device_id','sex','age'])
51 | df_total = pd.concat([deviceid_train, deviceid_test])
52 | df_doc = df_doc.merge(df_total, on='device_id', how='left')
53 |
54 |
55 | df_wv2_all = pd.read_csv('w2c_all_emb.csv')
56 |
57 | dic_w2c_all = {}
58 | for row in df_wv2_all.values :
59 | app_id = row[0]
60 | vector = row[1:]
61 | dic_w2c_all[app_id] = vector
62 |
63 |
64 | # In[3]:
65 |
66 |
67 | df_doc['sex'] = df_doc['sex'].apply(lambda x:str(x))
68 | df_doc['age'] = df_doc['age'].apply(lambda x:str(x))
69 | def tool(x):
70 | if x=='nan':
71 | return x
72 | else:
73 | return str(int(float(x)))
74 | df_doc['sex']=df_doc['sex'].apply(tool)
75 | df_doc['age']=df_doc['age'].apply(tool)
76 | df_doc['sex_age']=df_doc['sex']+'-'+df_doc['age']
77 | df_doc = df_doc.replace({'nan':np.NaN,'nan-nan':np.NaN})
78 | train = df_doc[df_doc['sex_age'].notnull()]
79 | test = df_doc[df_doc['sex_age'].isnull()]
80 | train.reset_index(drop=True, inplace=True)
81 | test.reset_index(drop=True, inplace=True)
82 |
83 | lb = LabelEncoder()
84 | train_label = lb.fit_transform(train['sex_age'].values)
85 | train['class'] = train_label
86 |
87 |
88 | # In[5]:
89 |
90 |
91 | column_name="app_list"
92 | word_seq_len = 900
93 | victor_size = 200
94 | num_words = 35000
95 | batch_size = 64
96 | classification = 22
97 | kfold=10
98 |
99 |
100 | # In[6]:
101 |
102 |
103 | from sklearn.metrics import log_loss
104 |
105 | def get_mut_label(y_label) :
106 | results = []
107 | for ele in y_label :
108 | results.append(ele.argmax())
109 | return results
110 |
111 | class RocAucEvaluation(Callback):
112 | def __init__(self, validation_data=(), interval=1):
113 | super(Callback, self).__init__()
114 |
115 | self.interval = interval
116 | self.X_val, self.y_val = validation_data
117 |
118 | def on_epoch_end(self, epoch, logs={}):
119 | if epoch % self.interval == 0:
120 | y_pred = self.model.predict(self.X_val, verbose=0)
121 | val_y = get_mut_label(self.y_val)
122 | score = log_loss(val_y, y_pred)
123 | print("\n mlogloss - epoch: %d - score: %.6f \n" % (epoch+1, score))
124 |
125 |
126 | # In[7]:
127 |
128 |
129 | #词向量
130 | def w2v_pad(df_train,df_test,col, maxlen_,victor_size, num_words):
131 |
132 | tokenizer = text.Tokenizer(num_words=num_words, lower=False,filters="")
133 | tokenizer.fit_on_texts(list(df_train[col].values)+list(df_test[col].values))
134 |
135 | train_ = sequence.pad_sequences(tokenizer.texts_to_sequences(df_train[col].values), maxlen=maxlen_)
136 | test_ = sequence.pad_sequences(tokenizer.texts_to_sequences(df_test[col].values), maxlen=maxlen_)
137 |
138 | word_index = tokenizer.word_index
139 |
140 | count = 0
141 | nb_words = len(word_index)
142 | print(nb_words)
143 | all_data=pd.concat([df_train[col],df_test[col]])
144 | file_name = 'embedding/' + 'Word2Vec_start_' + col +"_"+ str(victor_size) + '.model'
145 | if not os.path.exists(file_name):
146 | model = Word2Vec([[word for word in document.split(' ')] for document in all_data.values],
147 | size=victor_size, window=5, iter=10, workers=11, seed=2018, min_count=2)
148 | model.save(file_name)
149 | else:
150 | model = Word2Vec.load(file_name)
151 | print("add word2vec finished....")
152 |
153 |
154 |
155 | embedding_word2vec_matrix = np.zeros((nb_words + 1, victor_size))
156 | for word, i in word_index.items():
157 | embedding_vector = model[word] if word in model else None
158 | if embedding_vector is not None:
159 | count += 1
160 | embedding_word2vec_matrix[i] = embedding_vector
161 | else:
162 | unk_vec = np.random.random(victor_size) * 0.5
163 | unk_vec = unk_vec - unk_vec.mean()
164 | embedding_word2vec_matrix[i] = unk_vec
165 |
166 | embedding_w2c_all = np.zeros((nb_words + 1, victor_size))
167 | for word, i in word_index.items():
168 | embedding_vector = dic_w2c_all[word]
169 | embedding_w2c_all[i] = embedding_vector
170 |
171 |
172 | #embedding_matrix = np.concatenate((embedding_word2vec_matrix,embedding_w2c_all),axis=1)
173 | embedding_matrix = embedding_word2vec_matrix
174 |
175 | return train_, test_, word_index, embedding_matrix
176 |
177 |
178 | # In[8]:
179 |
180 |
181 | train_, test_,word2idx, word_embedding = w2v_pad(train,test,column_name, word_seq_len,victor_size, num_words)
182 |
183 |
184 | # In[10]:
185 |
186 |
187 | from TextModel import *
188 |
189 |
190 | # In[18]:
191 |
192 |
193 | my_opt="get_text_capsule"
194 | #参数
195 | Y = train['class'].values
196 |
197 | if not os.path.exists("cache/"+my_opt):
198 | os.mkdir("cache/"+my_opt)
199 |
200 |
201 |
202 | # In[19]:
203 |
204 |
205 | from sklearn.model_selection import KFold, StratifiedKFold
206 | gc.collect()
207 | seed = 2006
208 | num_folds = 5
209 | kf = StratifiedKFold(n_splits= num_folds, shuffle=True, random_state=seed).split(train_, Y)
210 |
211 |
212 | # In[20]:
213 |
214 |
215 | epochs = 10
216 | my_opt=eval(my_opt)
217 | train_model_pred = np.zeros((train_.shape[0], classification))
218 | test_model_pred = np.zeros((test_.shape[0], classification))
219 | for i, (train_fold, val_fold) in enumerate(kf):
220 | X_train, X_valid, = train_[train_fold, :], train_[val_fold, :]
221 | y_train, y_valid = Y[train_fold], Y[val_fold]
222 |
223 | y_tra = to_categorical(y_train)
224 | y_val = to_categorical(y_valid)
225 |
226 | #模型
227 | name = str(my_opt.__name__)
228 |
229 | model = my_opt(word_seq_len, word_embedding, classification)
230 |
231 |
232 | RocAuc = RocAucEvaluation(validation_data=(X_valid, y_val), interval=1)
233 |
234 | hist = model.fit(X_train, y_tra, batch_size=batch_size, epochs=epochs, validation_data=(X_valid, y_val),
235 | callbacks=[RocAuc])
236 |
237 |
238 | train_model_pred[val_fold, :] = model.predict(X_valid)
239 |
240 |
241 | # In[24]:
242 |
243 |
244 | #模型
245 | #用全部的数据预测
246 | train_label = to_categorical(Y)
247 | name = str(my_opt.__name__)
248 |
249 | model = my_opt(word_seq_len, word_embedding, classification)
250 |
251 |
252 | RocAuc = RocAucEvaluation(validation_data=(train_, train_label), interval=1)
253 |
254 | hist = model.fit(train_, train_label, batch_size=batch_size, epochs=epochs, validation_data=(train_, train_label),
255 | callbacks=[RocAuc])
256 |
257 |
258 | test_model_pred = model.predict(test_)
259 |
260 |
261 | # In[25]:
262 |
263 |
264 | df_train_pred = pd.DataFrame(train_model_pred)
265 | df_test_pred = pd.DataFrame(test_model_pred)
266 | df_train_pred.columns = ['device_start_capsule_pred_' + str(i) for i in range(22)]
267 | df_test_pred.columns = ['device_start_capsule_pred_' + str(i) for i in range(22)]
268 |
269 |
270 | # In[26]:
271 |
272 |
273 | df_train_pred = pd.concat([train[['device_id']], df_train_pred], axis=1)
274 | df_test_pred = pd.concat([test[['device_id']], df_test_pred], axis=1)
275 |
276 |
277 | # In[27]:
278 |
279 |
280 | df_results = pd.concat([df_train_pred, df_test_pred])
281 | df_results.to_csv('device_start_capsule_pred.csv', index=None)
282 |
283 |
--------------------------------------------------------------------------------
/THLUO/17.device_start_textcnn_pred.py:
--------------------------------------------------------------------------------
1 |
2 | # coding: utf-8
3 |
4 | # In[1]:
5 |
6 |
7 | # coding: utf-8
8 | import feather
9 | import os
10 | import re
11 | import sys
12 | import gc
13 | import random
14 | import pandas as pd
15 | import numpy as np
16 | import gensim
17 | from gensim.models import Word2Vec
18 | from gensim.models.word2vec import LineSentence
19 | from scipy import stats
20 | import tensorflow as tf
21 | import keras
22 | from keras.layers import *
23 | from keras.models import *
24 | from keras.optimizers import *
25 | from keras.callbacks import *
26 | from keras.preprocessing import text, sequence
27 | from keras.utils import to_categorical
28 | from keras.engine.topology import Layer
29 | from sklearn.preprocessing import LabelEncoder
30 | from keras.utils import np_utils
31 | from keras.utils.training_utils import multi_gpu_model
32 | from sklearn.model_selection import train_test_split
33 | from sklearn.metrics import f1_score
34 | from sklearn.model_selection import KFold
35 | from sklearn.metrics import accuracy_score
36 | from sklearn.preprocessing import LabelEncoder
37 | from sklearn.metrics import f1_score
38 | import warnings
39 | warnings.filterwarnings('ignore')
40 | config = tf.ConfigProto()
41 | config.gpu_options.allow_growth = True
42 | session = tf.Session(config=config)
43 |
44 |
45 | # In[2]:
46 | print ('17.device_start_textcnn_pred.py')
47 |
48 | df_doc = pd.read_csv('01.device_click_app_sorted_by_start.csv')
49 | deviceid_test=pd.read_csv('input/deviceid_test.tsv',sep='\t',names=['device_id'])
50 | deviceid_train=pd.read_csv('input/deviceid_train.tsv',sep='\t',names=['device_id','sex','age'])
51 | df_total = pd.concat([deviceid_train, deviceid_test])
52 | df_doc = df_doc.merge(df_total, on='device_id', how='left')
53 |
54 |
55 | df_wv2_all = pd.read_csv('w2c_all_emb.csv')
56 |
57 | dic_w2c_all = {}
58 | for row in df_wv2_all.values :
59 | app_id = row[0]
60 | vector = row[1:]
61 | dic_w2c_all[app_id] = vector
62 |
63 |
64 | # In[3]:
65 |
66 |
67 | df_doc['sex'] = df_doc['sex'].apply(lambda x:str(x))
68 | df_doc['age'] = df_doc['age'].apply(lambda x:str(x))
69 | def tool(x):
70 | if x=='nan':
71 | return x
72 | else:
73 | return str(int(float(x)))
74 | df_doc['sex']=df_doc['sex'].apply(tool)
75 | df_doc['age']=df_doc['age'].apply(tool)
76 | df_doc['sex_age']=df_doc['sex']+'-'+df_doc['age']
77 | df_doc = df_doc.replace({'nan':np.NaN,'nan-nan':np.NaN})
78 | train = df_doc[df_doc['sex_age'].notnull()]
79 | test = df_doc[df_doc['sex_age'].isnull()]
80 | train.reset_index(drop=True, inplace=True)
81 | test.reset_index(drop=True, inplace=True)
82 |
83 | lb = LabelEncoder()
84 | train_label = lb.fit_transform(train['sex_age'].values)
85 | train['class'] = train_label
86 |
87 |
88 | # In[5]:
89 |
90 |
91 | column_name="app_list"
92 | word_seq_len = 900
93 | victor_size = 200
94 | num_words = 35000
95 | batch_size = 64
96 | classification = 22
97 | kfold=10
98 |
99 |
100 | # In[6]:
101 |
102 |
103 | from sklearn.metrics import log_loss
104 |
105 | def get_mut_label(y_label) :
106 | results = []
107 | for ele in y_label :
108 | results.append(ele.argmax())
109 | return results
110 |
111 | class RocAucEvaluation(Callback):
112 | def __init__(self, validation_data=(), interval=1):
113 | super(Callback, self).__init__()
114 |
115 | self.interval = interval
116 | self.X_val, self.y_val = validation_data
117 |
118 | def on_epoch_end(self, epoch, logs={}):
119 | if epoch % self.interval == 0:
120 | y_pred = self.model.predict(self.X_val, verbose=0)
121 | val_y = get_mut_label(self.y_val)
122 | score = log_loss(val_y, y_pred)
123 | print("\n mlogloss - epoch: %d - score: %.6f \n" % (epoch+1, score))
124 |
125 |
126 | # In[7]:
127 |
128 |
129 | #词向量
130 | def w2v_pad(df_train,df_test,col, maxlen_,victor_size, num_words):
131 |
132 | tokenizer = text.Tokenizer(num_words=num_words, lower=False,filters="")
133 | tokenizer.fit_on_texts(list(df_train[col].values)+list(df_test[col].values))
134 |
135 | train_ = sequence.pad_sequences(tokenizer.texts_to_sequences(df_train[col].values), maxlen=maxlen_)
136 | test_ = sequence.pad_sequences(tokenizer.texts_to_sequences(df_test[col].values), maxlen=maxlen_)
137 |
138 | word_index = tokenizer.word_index
139 |
140 | count = 0
141 | nb_words = len(word_index)
142 | print(nb_words)
143 | all_data=pd.concat([df_train[col],df_test[col]])
144 | file_name = 'embedding/' + 'Word2Vec_start_' + col +"_"+ str(victor_size) + '.model'
145 | if not os.path.exists(file_name):
146 | model = Word2Vec([[word for word in document.split(' ')] for document in all_data.values],
147 | size=victor_size, window=5, iter=10, workers=11, seed=2018, min_count=2)
148 | model.save(file_name)
149 | else:
150 | model = Word2Vec.load(file_name)
151 | print("add word2vec finished....")
152 |
153 |
154 |
155 | embedding_word2vec_matrix = np.zeros((nb_words + 1, victor_size))
156 | for word, i in word_index.items():
157 | embedding_vector = model[word] if word in model else None
158 | if embedding_vector is not None:
159 | count += 1
160 | embedding_word2vec_matrix[i] = embedding_vector
161 | else:
162 | unk_vec = np.random.random(victor_size) * 0.5
163 | unk_vec = unk_vec - unk_vec.mean()
164 | embedding_word2vec_matrix[i] = unk_vec
165 |
166 | embedding_w2c_all = np.zeros((nb_words + 1, victor_size))
167 | for word, i in word_index.items():
168 | embedding_vector = dic_w2c_all[word]
169 | embedding_w2c_all[i] = embedding_vector
170 |
171 |
172 | #embedding_matrix = np.concatenate((embedding_word2vec_matrix,embedding_w2c_all),axis=1)
173 | embedding_matrix = embedding_word2vec_matrix
174 |
175 | return train_, test_, word_index, embedding_matrix
176 |
177 |
178 | # In[8]:
179 |
180 |
181 | train_, test_,word2idx, word_embedding = w2v_pad(train,test,column_name, word_seq_len,victor_size, num_words)
182 |
183 |
184 | # In[10]:
185 |
186 |
187 | from TextModel import *
188 |
189 |
190 | # In[19]:
191 |
192 |
193 | my_opt="get_text_cnn2"
194 | #参数
195 | Y = train['class'].values
196 |
197 | if not os.path.exists("cache/"+my_opt):
198 | os.mkdir("cache/"+my_opt)
199 |
200 |
201 |
202 | # In[20]:
203 |
204 |
205 | from sklearn.model_selection import KFold, StratifiedKFold
206 | gc.collect()
207 | seed = 2006
208 | num_folds = 5
209 | kf = StratifiedKFold(n_splits= num_folds, shuffle=True, random_state=seed).split(train_, Y)
210 |
211 |
212 | # In[21]:
213 |
214 |
215 | epochs = 6
216 | my_opt=eval(my_opt)
217 | train_model_pred = np.zeros((train_.shape[0], classification))
218 | test_model_pred = np.zeros((test_.shape[0], classification))
219 | for i, (train_fold, val_fold) in enumerate(kf):
220 | X_train, X_valid, = train_[train_fold, :], train_[val_fold, :]
221 | y_train, y_valid = Y[train_fold], Y[val_fold]
222 |
223 | y_tra = to_categorical(y_train)
224 | y_val = to_categorical(y_valid)
225 |
226 | #模型
227 | name = str(my_opt.__name__)
228 |
229 | model = my_opt(word_seq_len, word_embedding, classification)
230 |
231 |
232 | RocAuc = RocAucEvaluation(validation_data=(X_valid, y_val), interval=1)
233 |
234 | hist = model.fit(X_train, y_tra, batch_size=batch_size, epochs=epochs, validation_data=(X_valid, y_val),
235 | callbacks=[RocAuc])
236 |
237 |
238 | train_model_pred[val_fold, :] = model.predict(X_valid)
239 |
240 |
241 | # In[25]:
242 |
243 |
244 | #模型
245 | #用全部的数据预测
246 | train_label = to_categorical(Y)
247 | name = str(my_opt.__name__)
248 |
249 | model = my_opt(word_seq_len, word_embedding, classification)
250 |
251 |
252 | RocAuc = RocAucEvaluation(validation_data=(train_, train_label), interval=1)
253 |
254 | hist = model.fit(train_, train_label, batch_size=batch_size, epochs=epochs, validation_data=(train_, train_label),
255 | callbacks=[RocAuc])
256 |
257 |
258 | test_model_pred = model.predict(test_)
259 |
260 |
261 | # In[26]:
262 |
263 |
264 | df_train_pred = pd.DataFrame(train_model_pred)
265 | df_test_pred = pd.DataFrame(test_model_pred)
266 | df_train_pred.columns = ['device_start_textcnn_pred_' + str(i) for i in range(22)]
267 | df_test_pred.columns = ['device_start_textcnn_pred_' + str(i) for i in range(22)]
268 |
269 |
270 | # In[27]:
271 |
272 |
273 | df_train_pred = pd.concat([train[['device_id']], df_train_pred], axis=1)
274 | df_test_pred = pd.concat([test[['device_id']], df_test_pred], axis=1)
275 |
276 |
277 | # In[28]:
278 |
279 |
280 | df_results = pd.concat([df_train_pred, df_test_pred])
281 | df_results.to_csv('device_start_textcnn_pred.csv', index=None)
282 |
283 |
--------------------------------------------------------------------------------
/THLUO/19.device_start_lstm_pred.py:
--------------------------------------------------------------------------------
1 | import feather
2 | import os
3 | import re
4 | import sys
5 | import gc
6 | import random
7 | import pandas as pd
8 | import numpy as np
9 | import gensim
10 | from gensim.models import Word2Vec
11 | from gensim.models.word2vec import LineSentence
12 | from scipy import stats
13 | import tensorflow as tf
14 | import keras
15 | from keras.layers import *
16 | from keras.models import *
17 | from keras.optimizers import *
18 | from keras.callbacks import *
19 | from keras.preprocessing import text, sequence
20 | from keras.utils import to_categorical
21 | from keras.engine.topology import Layer
22 | from sklearn.preprocessing import LabelEncoder
23 | from keras.utils import np_utils
24 | from keras.utils.training_utils import multi_gpu_model
25 | from sklearn.model_selection import train_test_split
26 | from sklearn.metrics import f1_score
27 | from sklearn.model_selection import KFold
28 | from sklearn.metrics import accuracy_score
29 | from sklearn.preprocessing import LabelEncoder
30 | from sklearn.metrics import f1_score
31 | import warnings
32 | warnings.filterwarnings('ignore')
33 | config = tf.ConfigProto()
34 | config.gpu_options.allow_growth = True
35 | session = tf.Session(config=config)
36 |
37 | print ('19.lstm...........py')
38 | # In[2]:
39 |
40 |
41 | df_doc = pd.read_csv('01.device_click_app_sorted_by_start.csv')
42 | deviceid_test=pd.read_csv('input/deviceid_test.tsv',sep='\t',names=['device_id'])
43 | deviceid_train=pd.read_csv('input/deviceid_train.tsv',sep='\t',names=['device_id','sex','age'])
44 | df_total = pd.concat([deviceid_train, deviceid_test])
45 | df_doc = df_doc.merge(df_total, on='device_id', how='left')
46 |
47 | df_wv2_all = pd.read_csv('w2c_all_emb.csv')
48 |
49 | dic_w2c_all = {}
50 | for row in df_wv2_all.values :
51 | app_id = row[0]
52 | vector = row[1:]
53 | dic_w2c_all[app_id] = vector
54 |
55 |
56 | # In[3]:
57 |
58 |
59 | df_doc['sex'] = df_doc['sex'].apply(lambda x:str(x))
60 | df_doc['age'] = df_doc['age'].apply(lambda x:str(x))
61 | def tool(x):
62 | if x=='nan':
63 | return x
64 | else:
65 | return str(int(float(x)))
66 | df_doc['sex']=df_doc['sex'].apply(tool)
67 | df_doc['age']=df_doc['age'].apply(tool)
68 | df_doc['sex_age']=df_doc['sex']+'-'+df_doc['age']
69 | df_doc = df_doc.replace({'nan':np.NaN,'nan-nan':np.NaN})
70 | train = df_doc[df_doc['sex_age'].notnull()]
71 | test = df_doc[df_doc['sex_age'].isnull()]
72 | train.reset_index(drop=True, inplace=True)
73 | test.reset_index(drop=True, inplace=True)
74 |
75 | lb = LabelEncoder()
76 | train_label = lb.fit_transform(train['sex_age'].values)
77 | train['class'] = train_label
78 |
79 |
80 | # In[5]:
81 |
82 |
83 | column_name="app_list"
84 | word_seq_len = 900
85 | victor_size = 200
86 | num_words = 35000
87 | batch_size = 64
88 | classification = 22
89 | kfold=10
90 |
91 |
92 | # In[6]:
93 |
94 |
95 | from sklearn.metrics import log_loss
96 |
97 | def get_mut_label(y_label) :
98 | results = []
99 | for ele in y_label :
100 | results.append(ele.argmax())
101 | return results
102 |
103 | class RocAucEvaluation(Callback):
104 | def __init__(self, validation_data=(), interval=1):
105 | super(Callback, self).__init__()
106 |
107 | self.interval = interval
108 | self.X_val, self.y_val = validation_data
109 |
110 | def on_epoch_end(self, epoch, logs={}):
111 | if epoch % self.interval == 0:
112 | y_pred = self.model.predict(self.X_val, verbose=0)
113 | val_y = get_mut_label(self.y_val)
114 | score = log_loss(val_y, y_pred)
115 | print("\n mlogloss - epoch: %d - score: %.6f \n" % (epoch+1, score))
116 |
117 |
118 | # In[7]:
119 |
120 |
121 | #词向量
122 | def w2v_pad(df_train,df_test,col, maxlen_,victor_size, num_words):
123 |
124 | tokenizer = text.Tokenizer(num_words=num_words, lower=False,filters="")
125 | tokenizer.fit_on_texts(list(df_train[col].values)+list(df_test[col].values))
126 |
127 | train_ = sequence.pad_sequences(tokenizer.texts_to_sequences(df_train[col].values), maxlen=maxlen_)
128 | test_ = sequence.pad_sequences(tokenizer.texts_to_sequences(df_test[col].values), maxlen=maxlen_)
129 |
130 | word_index = tokenizer.word_index
131 |
132 | count = 0
133 | nb_words = len(word_index)
134 | print(nb_words)
135 | all_data=pd.concat([df_train[col],df_test[col]])
136 | file_name = 'embedding/' + 'Word2Vec_start_' + col +"_"+ str(victor_size) + '.model'
137 | if not os.path.exists(file_name):
138 | model = Word2Vec([[word for word in document.split(' ')] for document in all_data.values],
139 | size=victor_size, window=5, iter=10, workers=11, seed=2018, min_count=2)
140 | model.save(file_name)
141 | else:
142 | model = Word2Vec.load(file_name)
143 | print("add word2vec finished....")
144 |
145 |
146 |
147 | embedding_word2vec_matrix = np.zeros((nb_words + 1, victor_size))
148 | for word, i in word_index.items():
149 | embedding_vector = model[word] if word in model else None
150 | if embedding_vector is not None:
151 | count += 1
152 | embedding_word2vec_matrix[i] = embedding_vector
153 | else:
154 | unk_vec = np.random.random(victor_size) * 0.5
155 | unk_vec = unk_vec - unk_vec.mean()
156 | embedding_word2vec_matrix[i] = unk_vec
157 |
158 | embedding_w2c_all = np.zeros((nb_words + 1, victor_size))
159 | for word, i in word_index.items():
160 | embedding_vector = dic_w2c_all[word]
161 | embedding_w2c_all[i] = embedding_vector
162 |
163 |
164 | #embedding_matrix = np.concatenate((embedding_word2vec_matrix,embedding_w2c_all),axis=1)
165 | embedding_matrix = embedding_word2vec_matrix
166 |
167 | return train_, test_, word_index, embedding_matrix
168 |
169 |
170 | # In[8]:
171 |
172 |
173 | train_, test_,word2idx, word_embedding = w2v_pad(train,test,column_name, word_seq_len,victor_size, num_words)
174 |
175 |
176 | # In[10]:
177 |
178 |
179 | from TextModel import *
180 |
181 |
182 | # In[13]:
183 |
184 |
185 | my_opt="get_text_lstm1"
186 | #参数
187 | Y = train['class'].values
188 |
189 | if not os.path.exists("cache/"+my_opt):
190 | os.mkdir("cache/"+my_opt)
191 |
192 |
193 |
194 | # In[14]:
195 |
196 |
197 | from sklearn.model_selection import KFold, StratifiedKFold
198 | gc.collect()
199 | seed = 2006
200 | num_folds = 5
201 | kf = StratifiedKFold(n_splits= num_folds, shuffle=True, random_state=seed).split(train_, Y)
202 |
203 |
204 | # In[15]:
205 |
206 |
207 | from keras import backend as K
208 |
209 | epochs = 6
210 | my_opt=eval(my_opt)
211 | train_model_pred = np.zeros((train_.shape[0], classification))
212 | test_model_pred = np.zeros((test_.shape[0], classification))
213 | for i, (train_fold, val_fold) in enumerate(kf):
214 | X_train, X_valid, = train_[train_fold, :], train_[val_fold, :]
215 | y_train, y_valid = Y[train_fold], Y[val_fold]
216 |
217 | y_tra = to_categorical(y_train)
218 | y_val = to_categorical(y_valid)
219 |
220 | #模型
221 | name = str(my_opt.__name__)
222 |
223 | model = my_opt(word_seq_len, word_embedding, classification)
224 |
225 |
226 | RocAuc = RocAucEvaluation(validation_data=(X_valid, y_val), interval=1)
227 |
228 | hist = model.fit(X_train, y_tra, batch_size=batch_size, epochs=epochs, validation_data=(X_valid, y_val),
229 | callbacks=[RocAuc])
230 |
231 |
232 | train_model_pred[val_fold, :] = model.predict(X_valid)
233 |
234 |
235 | del model
236 | del hist
237 | gc.collect()
238 | K.clear_session()
239 | tf.reset_default_graph()
240 |
241 |
242 |
243 | # In[19]:
244 |
245 |
246 | #模型
247 | #用全部的数据预测
248 | train_label = to_categorical(Y)
249 | name = str(my_opt.__name__)
250 |
251 | model = my_opt(word_seq_len, word_embedding, classification)
252 |
253 |
254 | RocAuc = RocAucEvaluation(validation_data=(train_, train_label), interval=1)
255 |
256 | hist = model.fit(train_, train_label, batch_size=batch_size, epochs=epochs, validation_data=(train_, train_label),
257 | callbacks=[RocAuc])
258 |
259 |
260 | test_model_pred = model.predict(test_)
261 |
262 |
263 | # In[20]:
264 |
265 |
266 | df_train_pred = pd.DataFrame(train_model_pred)
267 | df_test_pred = pd.DataFrame(test_model_pred)
268 | df_train_pred.columns = ['device_start_lstm_pred_' + str(i) for i in range(22)]
269 | df_test_pred.columns = ['device_start_lstm_pred_' + str(i) for i in range(22)]
270 |
271 |
272 | # In[21]:
273 |
274 |
275 | df_train_pred = pd.concat([train[['device_id']], df_train_pred], axis=1)
276 | df_test_pred = pd.concat([test[['device_id']], df_test_pred], axis=1)
277 |
278 |
279 | # In[22]:
280 |
281 |
282 | df_results = pd.concat([df_train_pred, df_test_pred])
283 | df_results.to_csv('device_start_lstm_pred.csv', index=None)
284 |
285 |
--------------------------------------------------------------------------------
/THLUO/18.device_start_text_dpcnn_pred.py:
--------------------------------------------------------------------------------
1 |
2 | # coding: utf-8
3 |
4 | # In[1]:
5 |
6 |
7 | # coding: utf-8
8 | import feather
9 | import os
10 | import re
11 | import sys
12 | import gc
13 | import random
14 | import pandas as pd
15 | import numpy as np
16 | import gensim
17 | from gensim.models import Word2Vec
18 | from gensim.models.word2vec import LineSentence
19 | from scipy import stats
20 | import tensorflow as tf
21 | import keras
22 | from keras.layers import *
23 | from keras.models import *
24 | from keras.optimizers import *
25 | from keras.callbacks import *
26 | from keras.preprocessing import text, sequence
27 | from keras.utils import to_categorical
28 | from keras.engine.topology import Layer
29 | from sklearn.preprocessing import LabelEncoder
30 | from keras.utils import np_utils
31 | from keras.utils.training_utils import multi_gpu_model
32 | from sklearn.model_selection import train_test_split
33 | from sklearn.metrics import f1_score
34 | from sklearn.model_selection import KFold
35 | from sklearn.metrics import accuracy_score
36 | from sklearn.preprocessing import LabelEncoder
37 | from sklearn.metrics import f1_score
38 | import warnings
39 | warnings.filterwarnings('ignore')
40 | config = tf.ConfigProto()
41 | config.gpu_options.allow_growth = True
42 | session = tf.Session(config=config)
43 |
44 |
45 | # In[2]:
46 | print ('18.device_start_text_dpcnn_pred.py')
47 |
48 | df_doc = pd.read_csv('01.device_click_app_sorted_by_start.csv')
49 | deviceid_test=pd.read_csv('input/deviceid_test.tsv',sep='\t',names=['device_id'])
50 | deviceid_train=pd.read_csv('input/deviceid_train.tsv',sep='\t',names=['device_id','sex','age'])
51 | df_total = pd.concat([deviceid_train, deviceid_test])
52 | df_doc = df_doc.merge(df_total, on='device_id', how='left')
53 |
54 |
55 | df_wv2_all = pd.read_csv('w2c_all_emb.csv')
56 |
57 | dic_w2c_all = {}
58 | for row in df_wv2_all.values :
59 | app_id = row[0]
60 | vector = row[1:]
61 | dic_w2c_all[app_id] = vector
62 |
63 |
64 | # In[3]:
65 |
66 |
67 | df_doc['sex'] = df_doc['sex'].apply(lambda x:str(x))
68 | df_doc['age'] = df_doc['age'].apply(lambda x:str(x))
69 | def tool(x):
70 | if x=='nan':
71 | return x
72 | else:
73 | return str(int(float(x)))
74 | df_doc['sex']=df_doc['sex'].apply(tool)
75 | df_doc['age']=df_doc['age'].apply(tool)
76 | df_doc['sex_age']=df_doc['sex']+'-'+df_doc['age']
77 | df_doc = df_doc.replace({'nan':np.NaN,'nan-nan':np.NaN})
78 | train = df_doc[df_doc['sex_age'].notnull()]
79 | test = df_doc[df_doc['sex_age'].isnull()]
80 | train.reset_index(drop=True, inplace=True)
81 | test.reset_index(drop=True, inplace=True)
82 |
83 | lb = LabelEncoder()
84 | train_label = lb.fit_transform(train['sex_age'].values)
85 | train['class'] = train_label
86 |
87 |
88 | # In[5]:
89 |
90 |
91 | column_name="app_list"
92 | word_seq_len = 900
93 | victor_size = 200
94 | num_words = 35000
95 | batch_size = 64
96 | classification = 22
97 | kfold=10
98 |
99 |
100 | # In[6]:
101 |
102 |
103 | from sklearn.metrics import log_loss
104 |
105 | def get_mut_label(y_label) :
106 | results = []
107 | for ele in y_label :
108 | results.append(ele.argmax())
109 | return results
110 |
111 | class RocAucEvaluation(Callback):
112 | def __init__(self, validation_data=(), interval=1):
113 | super(Callback, self).__init__()
114 |
115 | self.interval = interval
116 | self.X_val, self.y_val = validation_data
117 |
118 | def on_epoch_end(self, epoch, logs={}):
119 | if epoch % self.interval == 0:
120 | y_pred = self.model.predict(self.X_val, verbose=0)
121 | val_y = get_mut_label(self.y_val)
122 | score = log_loss(val_y, y_pred)
123 | print("\n mlogloss - epoch: %d - score: %.6f \n" % (epoch+1, score))
124 |
125 |
126 | # In[7]:
127 |
128 |
129 | #词向量
130 | def w2v_pad(df_train,df_test,col, maxlen_,victor_size, num_words):
131 |
132 | tokenizer = text.Tokenizer(num_words=num_words, lower=False,filters="")
133 | tokenizer.fit_on_texts(list(df_train[col].values)+list(df_test[col].values))
134 |
135 | train_ = sequence.pad_sequences(tokenizer.texts_to_sequences(df_train[col].values), maxlen=maxlen_)
136 | test_ = sequence.pad_sequences(tokenizer.texts_to_sequences(df_test[col].values), maxlen=maxlen_)
137 |
138 | word_index = tokenizer.word_index
139 |
140 | count = 0
141 | nb_words = len(word_index)
142 | print(nb_words)
143 | all_data=pd.concat([df_train[col],df_test[col]])
144 | file_name = 'embedding/' + 'Word2Vec_start_' + col +"_"+ str(victor_size) + '.model'
145 | if not os.path.exists(file_name):
146 | model = Word2Vec([[word for word in document.split(' ')] for document in all_data.values],
147 | size=victor_size, window=5, iter=10, workers=11, seed=2018, min_count=2)
148 | model.save(file_name)
149 | else:
150 | model = Word2Vec.load(file_name)
151 | print("add word2vec finished....")
152 |
153 |
154 |
155 | embedding_word2vec_matrix = np.zeros((nb_words + 1, victor_size))
156 | for word, i in word_index.items():
157 | embedding_vector = model[word] if word in model else None
158 | if embedding_vector is not None:
159 | count += 1
160 | embedding_word2vec_matrix[i] = embedding_vector
161 | else:
162 | unk_vec = np.random.random(victor_size) * 0.5
163 | unk_vec = unk_vec - unk_vec.mean()
164 | embedding_word2vec_matrix[i] = unk_vec
165 |
166 | embedding_w2c_all = np.zeros((nb_words + 1, victor_size))
167 | for word, i in word_index.items():
168 | embedding_vector = dic_w2c_all[word]
169 | embedding_w2c_all[i] = embedding_vector
170 |
171 |
172 | #embedding_matrix = np.concatenate((embedding_word2vec_matrix,embedding_w2c_all),axis=1)
173 | embedding_matrix = embedding_word2vec_matrix
174 |
175 | return train_, test_, word_index, embedding_matrix
176 |
177 |
178 | # In[8]:
179 |
180 |
181 | train_, test_,word2idx, word_embedding = w2v_pad(train,test,column_name, word_seq_len,victor_size, num_words)
182 |
183 |
184 | # In[10]:
185 |
186 |
187 | from TextModel import *
188 |
189 |
190 | # In[12]:
191 |
192 |
193 | my_opt="get_text_dpcnn"
194 | #参数
195 | Y = train['class'].values
196 |
197 | if not os.path.exists("cache/"+my_opt):
198 | os.mkdir("cache/"+my_opt)
199 |
200 |
201 |
202 | # In[13]:
203 |
204 |
205 | from sklearn.model_selection import KFold, StratifiedKFold
206 | gc.collect()
207 | seed = 2006
208 | num_folds = 5
209 | kf = StratifiedKFold(n_splits= num_folds, shuffle=True, random_state=seed).split(train_, Y)
210 |
211 |
212 | # In[14]:
213 |
214 |
215 | from keras import backend as K
216 |
217 | epochs = 6
218 | my_opt=eval(my_opt)
219 | train_model_pred = np.zeros((train_.shape[0], classification))
220 | test_model_pred = np.zeros((test_.shape[0], classification))
221 | for i, (train_fold, val_fold) in enumerate(kf):
222 | X_train, X_valid, = train_[train_fold, :], train_[val_fold, :]
223 | y_train, y_valid = Y[train_fold], Y[val_fold]
224 |
225 | y_tra = to_categorical(y_train)
226 | y_val = to_categorical(y_valid)
227 |
228 | #模型
229 | name = str(my_opt.__name__)
230 |
231 | model = my_opt(word_seq_len, word_embedding, classification)
232 |
233 |
234 | RocAuc = RocAucEvaluation(validation_data=(X_valid, y_val), interval=1)
235 |
236 | hist = model.fit(X_train, y_tra, batch_size=batch_size, epochs=epochs, validation_data=(X_valid, y_val),
237 | callbacks=[RocAuc])
238 |
239 |
240 | train_model_pred[val_fold, :] = model.predict(X_valid)
241 |
242 |
243 | del model
244 | del hist
245 | gc.collect()
246 | K.clear_session()
247 | tf.reset_default_graph()
248 |
249 |
250 |
251 | # In[15]:
252 |
253 |
254 | #模型
255 | #用全部的数据预测
256 | train_label = to_categorical(Y)
257 | name = str(my_opt.__name__)
258 |
259 | model = my_opt(word_seq_len, word_embedding, classification)
260 |
261 |
262 | RocAuc = RocAucEvaluation(validation_data=(train_, train_label), interval=1)
263 |
264 | hist = model.fit(train_, train_label, batch_size=batch_size, epochs=epochs, validation_data=(train_, train_label),
265 | callbacks=[RocAuc])
266 |
267 |
268 | test_model_pred = model.predict(test_)
269 |
270 |
271 | # In[16]:
272 |
273 |
274 | df_train_pred = pd.DataFrame(train_model_pred)
275 | df_test_pred = pd.DataFrame(test_model_pred)
276 | df_train_pred.columns = ['device_start_text_dpcnn_pred_' + str(i) for i in range(22)]
277 | df_test_pred.columns = ['device_start_text_dpcnn_pred_' + str(i) for i in range(22)]
278 |
279 |
280 | # In[17]:
281 |
282 |
283 | df_train_pred = pd.concat([train[['device_id']], df_train_pred], axis=1)
284 | df_test_pred = pd.concat([test[['device_id']], df_test_pred], axis=1)
285 |
286 |
287 | # In[18]:
288 |
289 |
290 | df_results = pd.concat([df_train_pred, df_test_pred])
291 | df_results.to_csv('device_start_text_dpcnn_pred.csv', index=None)
292 |
293 |
--------------------------------------------------------------------------------
/chizhu/stacking/nurbs_feat/xgb__nurbs_nb.py:
--------------------------------------------------------------------------------
1 |
2 | # coding: utf-8
3 |
4 | # In[1]:
5 |
6 |
7 | import pandas as pd
8 | import seaborn as sns
9 | import numpy as np
10 | from tqdm import tqdm
11 | from sklearn.decomposition import LatentDirichletAllocation
12 | from sklearn.model_selection import train_test_split
13 | from sklearn.metrics import accuracy_score
14 | import lightgbm as lgb
15 | from datetime import datetime,timedelta
16 | import matplotlib.pyplot as plt
17 | import time
18 | from sklearn.feature_extraction.text import TfidfTransformer
19 | from sklearn.feature_extraction.text import CountVectorizer
20 | get_ipython().run_line_magic('matplotlib', 'inline')
21 |
22 | #add
23 | import gc
24 | from sklearn import preprocessing
25 | from sklearn.feature_extraction.text import TfidfVectorizer
26 |
27 | from scipy.sparse import hstack, vstack
28 | from sklearn.model_selection import StratifiedKFold
29 | from sklearn.model_selection import cross_val_score
30 | # from skopt.space import Integer, Categorical, Real, Log10
31 | # from skopt.utils import use_named_args
32 | # from skopt import gp_minimize
33 | from gensim.models import Word2Vec, FastText
34 | import gensim
35 | import re
36 | import os
37 | path="./feature/"##nurbs概率文件路径
38 | o_path="/dev/shm/chizhu_data/data/"###原始文件路径
39 | os.listdir(path)
40 |
41 |
42 | # In[2]:
43 |
44 |
45 | sex_feat=pd.read_csv(path+"feature_sex_all.csv")
46 | age_feat=pd.read_csv(path+"feature_age_all.csv")
47 | # all_feat=pd.read_csv(path+"feature_22_all.csv")
48 | train_id=pd.read_csv(o_path+"deviceid_train.tsv",sep="\t",names=['device_id','sex','age'])
49 | test_id=pd.read_csv(o_path+"deviceid_test.tsv",sep="\t",names=['device_id'])
50 | all_id=pd.concat([train_id[['device_id']],test_id[['device_id']]])
51 | all_id.index=range(len(all_id))
52 | sex_feat['device_id']=all_id
53 | age_feat['device_id']=all_id
54 | # deepnn_feat=pd.read_csv(path+"deepnn_fix.csv")
55 | # deepnn_feat['device_id']=deepnn_feat['DeviceID']
56 | # del deepnn_feat['DeviceID']
57 |
58 |
59 | # In[3]:
60 |
61 |
62 | train=pd.merge(train_id,sex_feat,on="device_id",how="left")
63 | # train=pd.merge(train,deepnn_feat,on="device_id",how="left")
64 | test=pd.merge(test_id,sex_feat,on="device_id",how="left")
65 | # test=pd.merge(test,deepnn_feat,on="device_id",how="left")
66 |
67 |
68 | # In[4]:
69 |
70 |
71 | features = [x for x in train.columns if x not in ['device_id', 'sex',"age",]]
72 | Y = train['sex'] - 1
73 |
74 |
75 | # In[5]:
76 |
77 |
78 |
79 | import xgboost as xgb
80 | from sklearn.metrics import auc, log_loss, roc_auc_score,f1_score,recall_score,precision_score
81 | from sklearn.cross_validation import StratifiedKFold
82 |
83 | kf = StratifiedKFold(Y, n_folds=10, shuffle=True, random_state=1024)
84 | params={
85 | 'booster':'gbtree',
86 | "tree_method":"gpu_hist",
87 | "gpu_id":"2",
88 | 'objective': 'binary:logistic',
89 | # 'is_unbalance':'True',
90 | # 'scale_pos_weight': 1500.0/13458.0,
91 | 'eval_metric': "logloss",
92 |
93 | 'gamma':0.2,#0.2 is ok
94 | 'max_depth':6,
95 | # 'lambda':20,
96 | # "alpha":5,
97 | 'subsample':0.7,
98 | 'colsample_bytree':0.4 ,
99 | # 'min_child_weight':2.5,
100 | 'eta': 0.01,
101 | # 'learning_rate':0.01,
102 | "silent":1,
103 | 'seed':1024,
104 | 'nthread':12,
105 |
106 | }
107 | num_round = 3500
108 | early_stopping_rounds = 100
109 |
110 |
111 | # In[6]:
112 |
113 |
114 | aus = []
115 | sub1 = np.zeros((len(test), ))
116 | pred_oob1=np.zeros((len(train),))
117 | for i,(train_index,test_index) in enumerate(kf):
118 |
119 | tr_x = train[features].reindex(index=train_index, copy=False)
120 | tr_y = Y[train_index]
121 | te_x = train[features].reindex(index=test_index, copy=False)
122 | te_y = Y[test_index]
123 |
124 | # tr_y=tr_y.apply(lambda x:1 if x>0 else 0)
125 | # te_y=te_y.apply(lambda x:1 if x>0 else 0)
126 | d_tr = xgb.DMatrix(tr_x, label=tr_y)
127 | d_te = xgb.DMatrix(te_x, label=te_y)
128 | watchlist = [(d_tr,'train'),
129 | (d_te,'val')
130 | ]
131 | model = xgb.train(params, d_tr, num_boost_round=5500,
132 | evals=watchlist,verbose_eval=200,
133 | early_stopping_rounds=100)
134 | pred = model.predict(d_te,ntree_limit=model.best_iteration)
135 | pred_oob1[test_index] =pred
136 | # te_y=te_y.apply(lambda x:1 if x>0 else 0)
137 | a = log_loss(te_y, pred)
138 |
139 | sub1 += model.predict(xgb.DMatrix(test[features]),ntree_limit=model.best_iteration)/10
140 |
141 |
142 | print ("idx: ", i)
143 | print (" loss: %.5f" % a)
144 | # print " gini: %.5f" % g
145 | aus.append(a)
146 |
147 | print ("mean")
148 | print ("auc: %s" % (sum(aus) / 10.0))
149 |
150 |
151 | # In[7]:
152 |
153 |
154 | pred_oob1 = pd.DataFrame(pred_oob1, columns=['sex2'])
155 | sub1 = pd.DataFrame(sub1, columns=['sex2'])
156 | res1=pd.concat([pred_oob1,sub1])
157 | res1['sex1'] = 1-res1['sex2']
158 |
159 |
160 | # In[8]:
161 |
162 |
163 | import gc
164 | gc.collect()
165 |
166 |
167 | # In[9]:
168 |
169 |
170 | train=pd.merge(train_id,age_feat,on="device_id",how="left")
171 | # train=pd.merge(train,deepnn_feat,on="device_id",how="left")
172 | test=pd.merge(test_id,age_feat,on="device_id",how="left")
173 | # test=pd.merge(test,deepnn_feat,on="device_id",how="left")
174 |
175 |
176 | # In[10]:
177 |
178 |
179 | ####sex1
180 | test['sex']=1
181 |
182 |
183 | # In[11]:
184 |
185 |
186 | features = [x for x in train.columns if x not in ['device_id',"age"]]
187 | Y = train['age']
188 |
189 |
190 | # In[12]:
191 |
192 |
193 | import lightgbm as lgb
194 | import xgboost as xgb
195 | from sklearn.metrics import auc, log_loss, roc_auc_score,f1_score,recall_score,precision_score
196 | from sklearn.cross_validation import StratifiedKFold
197 |
198 | kf = StratifiedKFold(Y, n_folds=10, shuffle=True, random_state=1024)
199 | params={
200 | 'booster':'gbtree',
201 | "tree_method":"gpu_hist",
202 | "gpu_id":"2",
203 | 'objective': 'multi:softprob',
204 | # 'is_unbalance':'True',
205 | # 'scale_pos_weight': 1500.0/13458.0,
206 | 'eval_metric': "mlogloss",
207 | 'num_class':11,
208 | 'gamma':0.1,#0.2 is ok
209 | 'max_depth':6,
210 | # 'lambda':20,
211 | # "alpha":5,
212 | 'subsample':0.7,
213 | 'colsample_bytree':0.4 ,
214 | # 'min_child_weight':2.5,
215 | 'eta': 0.01,
216 | # 'learning_rate':0.01,
217 | "silent":1,
218 | 'seed':1024,
219 | 'nthread':12,
220 |
221 | }
222 | num_round = 3500
223 | early_stopping_rounds = 100
224 |
225 |
226 | # In[13]:
227 |
228 |
229 | aus = []
230 | sub2 = np.zeros((len(test),11 ))
231 | pred_oob2=np.zeros((len(train),11))
232 | models=[]
233 | iters=[]
234 | for i,(train_index,test_index) in enumerate(kf):
235 |
236 | tr_x = train[features].reindex(index=train_index, copy=False)
237 | tr_y = Y[train_index]
238 | te_x = train[features].reindex(index=test_index, copy=False)
239 | te_y = Y[test_index]
240 |
241 | # tr_y=tr_y.apply(lambda x:1 if x>0 else 0)
242 | # te_y=te_y.apply(lambda x:1 if x>0 else 0)
243 | d_tr = xgb.DMatrix(tr_x, label=tr_y)
244 | d_te = xgb.DMatrix(te_x, label=te_y)
245 | watchlist = [(d_tr,'train'),
246 | (d_te,'val')
247 | ]
248 | model = xgb.train(params, d_tr, num_boost_round=5500,
249 | evals=watchlist,verbose_eval=200,
250 | early_stopping_rounds=100)
251 | models.append(model)
252 | iters.append(model.best_iteration)
253 | pred = model.predict(d_te,ntree_limit=model.best_iteration)
254 | pred_oob2[test_index] =pred
255 | # te_y=te_y.apply(lambda x:1 if x>0 else 0)
256 | a = log_loss(te_y, pred)
257 |
258 | sub2 += model.predict(xgb.DMatrix(test[features]),ntree_limit=model.best_iteration)/10
259 |
260 |
261 | print ("idx: ", i)
262 | print (" loss: %.5f" % a)
263 | # print " gini: %.5f" % g
264 | aus.append(a)
265 |
266 | print ("mean")
267 | print ("auc: %s" % (sum(aus) / 10.0))
268 |
269 |
270 | # In[14]:
271 |
272 |
273 | res2_1=np.vstack((pred_oob2,sub2))
274 | res2_1 = pd.DataFrame(res2_1)
275 |
276 |
277 | # In[15]:
278 |
279 |
280 | ###sex2
281 | test['sex']=2
282 | features = [x for x in train.columns if x not in ['device_id',"age"]]
283 | Y = train['age']
284 |
285 |
286 | # In[16]:
287 |
288 |
289 | aus = []
290 | sub2 = np.zeros((len(test),11 ))
291 | for model,it in zip(models,iters):
292 | sub2 += model.predict(xgb.DMatrix(test[features]),ntree_limit=it)/10
293 | res2_2=np.vstack((pred_oob2,sub2))
294 | res2_2 = pd.DataFrame(res2_2)
295 |
296 |
297 | # In[17]:
298 |
299 |
300 | res1.index=range(len(res1))
301 | res2_1.index=range(len(res2_1))
302 | res2_2.index=range(len(res2_2))
303 | final_1=res2_1.copy()
304 | final_2=res2_2.copy()
305 | for i in range(11):
306 | final_1[i]=res1['sex1']*res2_1[i]
307 | final_2[i]=res1['sex2']*res2_2[i]
308 | id_list=pd.concat([train[['device_id']],test[['device_id']]])
309 | final=id_list
310 | final.index=range(len(final))
311 | final.columns= ['DeviceID']
312 | final_pred = pd.concat([final_1,final_2],1)
313 | final=pd.concat([final,final_pred],1)
314 | final.columns = ['DeviceID', '1-0', '1-1', '1-2', '1-3', '1-4', '1-5', '1-6',
315 | '1-7','1-8', '1-9', '1-10', '2-0', '2-1', '2-2', '2-3', '2-4',
316 | '2-5', '2-6', '2-7', '2-8', '2-9', '2-10']
317 |
318 | final.to_csv('xgb_feat_nurbs_nb_10fold.csv', index=False)
319 |
320 |
321 | # In[18]:
322 |
323 |
324 | test['DeviceID']=test['device_id']
325 | sub=pd.merge(test[['DeviceID']],final,on="DeviceID",how="left")
326 | sub.to_csv("xgb_nurbs_nb_10fold.csv",index=False)
327 |
328 |
--------------------------------------------------------------------------------
/nb_cz_lwl_wcm/4_get_feature_device_start_close_tfidf_1_2.py:
--------------------------------------------------------------------------------
1 | # -*- coding:utf-8 -*-
2 |
3 | import pandas as pd
4 | import scipy.sparse
5 | import numpy as np
6 | from sklearn import preprocessing
7 | from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
8 |
9 | train = pd.read_csv('Demo/deviceid_train.tsv', sep='\t', header=None)
10 | test = pd.read_csv('Demo/deviceid_test.tsv', sep='\t', header=None)
11 |
12 | data_all = pd.concat([train, test], axis=0)
13 | data_all = data_all.rename({0:'id'}, axis=1)
14 | del data_all[1],data_all[2]
15 |
16 | start_close_time = pd.read_csv('Demo/deviceid_package_start_close.tsv', sep='\t', header=None)
17 | start_close_time = start_close_time.rename({0:'id', 1:'app_name', 2:'start_time', 3:'close_time'}, axis=1)
18 |
19 | start_close_time = start_close_time.sort_values(by='start_time')
20 |
21 | start_close_time['start_time'] = map(int,start_close_time['start_time']/1000)
22 | start_close_time['close_time'] = map(int,start_close_time['close_time']/1000)
23 |
24 | unique_app_name = np.unique(start_close_time['app_name'])
25 | dict_label = dict(zip(list(unique_app_name), list(np.arange(0, len(unique_app_name), 1))))
26 | import time
27 | start_close_time['app_name'] = start_close_time['app_name'].apply(lambda row: str(dict_label[row]))
28 |
29 | del start_close_time['start_time'], start_close_time['close_time']
30 |
31 | from tqdm import tqdm, tqdm_pandas
32 | tqdm_pandas(tqdm())
33 | def dealed_row(row):
34 | app_name_list = list(row['app_name'])
35 | return ' '.join(app_name_list)
36 |
37 | data_feature = start_close_time.groupby('id').progress_apply(lambda row:dealed_row(row)).reset_index()
38 | data_feature = pd.merge(data_all, data_feature, on='id', how='left')
39 | del data_feature['id']
40 |
41 | count_vec = CountVectorizer(ngram_range=(1,3))
42 | count_csr_basic = count_vec.fit_transform(data_feature[0])
43 | tfidf_vec = TfidfVectorizer(ngram_range=(1,3))
44 | tfidf_vec_basic = tfidf_vec.fit_transform(data_feature[0])
45 |
46 | data_feature = scipy.sparse.csr_matrix(scipy.sparse.hstack([count_csr_basic, tfidf_vec_basic]))
47 |
48 |
49 | from sklearn.cluster import KMeans
50 | from sklearn.linear_model import LogisticRegression, SGDClassifier, PassiveAggressiveClassifier, RidgeClassifier
51 | from sklearn.metrics import mean_squared_error
52 | from sklearn.naive_bayes import BernoulliNB, MultinomialNB
53 | from sklearn.svm import LinearSVC
54 | from sklearn.cross_validation import StratifiedKFold
55 |
56 | train = pd.read_csv('Demo/deviceid_train.tsv', sep='\t', header=None)
57 | test = pd.read_csv('Demo/deviceid_test.tsv', sep='\t', header=None)
58 | def get_label(row):
59 | if row[1] == 1:
60 | return row[2]
61 | else:
62 | return row[2] + 11
63 | train['label'] = train.apply(lambda row:get_label(row), axis=1)
64 | data_all = pd.concat([train, test], axis=0)
65 | data_all = data_all.rename({0:'id'}, axis=1)
66 | del data_all[1],data_all[2]
67 |
68 | train_feature = data_feature[:len(train)]
69 | score = train['label']
70 | test_feature = data_feature[len(train):]
71 | number = len(np.unique(score))
72 |
73 | # 五则交叉验证
74 | n_folds = 5
75 | print('处理完毕')
76 |
77 | ########################### lr(LogisticRegression) ################################
78 | print('lr stacking')
79 | stack_train = np.zeros((len(train), number))
80 | stack_test = np.zeros((len(test), number))
81 | score_va = 0
82 |
83 | for i, (tr, va) in enumerate(StratifiedKFold(score, n_folds=n_folds, random_state=1017)):
84 | print('stack:%d/%d' % ((i + 1), n_folds))
85 | clf = LogisticRegression(random_state=1017, C=8)
86 | clf.fit(train_feature[tr], score[tr])
87 | score_va = clf.predict_proba(train_feature[va])
88 | score_te = clf.predict_proba(test_feature)
89 | print('得分' + str(mean_squared_error(score[va], clf.predict(train_feature[va]))))
90 | stack_train[va] += score_va
91 | stack_test += score_te
92 | stack_test /= n_folds
93 | stack = np.vstack([stack_train, stack_test])
94 | df_stack = pd.DataFrame()
95 | for i in range(stack.shape[1]):
96 | df_stack['tfidf_lr_2_classfiy_{}'.format(i)] = np.around(stack[:, i], 6)
97 | df_stack.to_csv('feature/tfidf_lr_1_3_error_single_classfiy.csv', index=None, encoding='utf8')
98 | print('lr特征已保存\n')
99 |
100 | ########################### SGD(随机梯度下降) ################################
101 | print('sgd stacking')
102 | stack_train = np.zeros((len(train), number))
103 | stack_test = np.zeros((len(test), number))
104 | score_va = 0
105 |
106 | for i, (tr, va) in enumerate(StratifiedKFold(score, n_folds=n_folds, random_state=1017)):
107 | print('stack:%d/%d' % ((i + 1), n_folds))
108 | sgd = SGDClassifier(random_state=1017, loss='log')
109 | sgd.fit(train_feature[tr], score[tr])
110 | score_va = sgd.predict_proba(train_feature[va])
111 | score_te = sgd.predict_proba(test_feature)
112 | print('得分' + str(mean_squared_error(score[va], sgd.predict(train_feature[va]))))
113 | stack_train[va] += score_va
114 | stack_test += score_te
115 | stack_test /= n_folds
116 | stack = np.vstack([stack_train, stack_test])
117 | df_stack = pd.DataFrame()
118 | for i in range(stack.shape[1]):
119 | df_stack['tfidf_2_sgd_classfiy_{}'.format(i)] = np.around(stack[:, i], 6)
120 | df_stack.to_csv('feature/tfidf_sgd_1_3_error_single_classfiy.csv', index=None, encoding='utf8')
121 | print('sgd特征已保存\n')
122 |
123 | ########################### pac(PassiveAggressiveClassifier) ################################
124 | print('PAC stacking')
125 | stack_train = np.zeros((len(train), number))
126 | stack_test = np.zeros((len(test), number))
127 | score_va = 0
128 |
129 | for i, (tr, va) in enumerate(StratifiedKFold(score, n_folds=n_folds, random_state=1017)):
130 | print('stack:%d/%d' % ((i + 1), n_folds))
131 | pac = PassiveAggressiveClassifier(random_state=1017)
132 | pac.fit(train_feature[tr], score[tr])
133 | score_va = pac._predict_proba_lr(train_feature[va])
134 | score_te = pac._predict_proba_lr(test_feature)
135 | print(score_va)
136 | print('得分' + str(mean_squared_error(score[va], pac.predict(train_feature[va]))))
137 | stack_train[va] += score_va
138 | stack_test += score_te
139 | stack_test /= n_folds
140 | stack = np.vstack([stack_train, stack_test])
141 | df_stack = pd.DataFrame()
142 | for i in range(stack.shape[1]):
143 | df_stack['tfidf_pac_classfiy_{}'.format(i)] = np.around(stack[:, i], 6)
144 | df_stack.to_csv('feature/tfidf_pac_1_3_error_single_classfiy.csv', index=None, encoding='utf8')
145 | print('pac特征已保存\n')
146 |
147 |
148 | ########################### ridge(RidgeClassfiy) ################################
149 | print('RidgeClassfiy stacking')
150 | stack_train = np.zeros((len(train), number))
151 | stack_test = np.zeros((len(test), number))
152 | score_va = 0
153 |
154 | for i, (tr, va) in enumerate(StratifiedKFold(score, n_folds=n_folds, random_state=1017)):
155 | print('stack:%d/%d' % ((i + 1), n_folds))
156 | ridge = RidgeClassifier(random_state=1017)
157 | ridge.fit(train_feature[tr], score[tr])
158 | score_va = ridge._predict_proba_lr(train_feature[va])
159 | score_te = ridge._predict_proba_lr(test_feature)
160 | print(score_va)
161 | print('得分' + str(mean_squared_error(score[va], ridge.predict(train_feature[va]))))
162 | stack_train[va] += score_va
163 | stack_test += score_te
164 | stack_test /= n_folds
165 | stack = np.vstack([stack_train, stack_test])
166 | df_stack = pd.DataFrame()
167 | for i in range(stack.shape[1]):
168 | df_stack['tfidf_ridge_classfiy_{}'.format(i)] = np.around(stack[:, i], 6)
169 | df_stack.to_csv('feature/tfidf_ridge_1_3_error_single_classfiy.csv', index=None, encoding='utf8')
170 | print('ridge特征已保存\n')
171 |
172 |
173 | ########################### bnb(BernoulliNB) ################################
174 | print('BernoulliNB stacking')
175 | stack_train = np.zeros((len(train), number))
176 | stack_test = np.zeros((len(test), number))
177 | score_va = 0
178 |
179 | for i, (tr, va) in enumerate(StratifiedKFold(score, n_folds=n_folds, random_state=1017)):
180 | print('stack:%d/%d' % ((i + 1), n_folds))
181 | bnb = BernoulliNB()
182 | bnb.fit(train_feature[tr], score[tr])
183 | score_va = bnb.predict_proba(train_feature[va])
184 | score_te = bnb.predict_proba(test_feature)
185 | print(score_va)
186 | print('得分' + str(mean_squared_error(score[va], bnb.predict(train_feature[va]))))
187 | stack_train[va] += score_va
188 | stack_test += score_te
189 | stack_test /= n_folds
190 | stack = np.vstack([stack_train, stack_test])
191 | df_stack = pd.DataFrame()
192 | for i in range(stack.shape[1]):
193 | df_stack['tfidf_bnb_classfiy_{}'.format(i)] = np.around(stack[:, i], 6)
194 | df_stack.to_csv('feature/tfidf_bnb_1_3_error_single_classfiy.csv', index=None, encoding='utf8')
195 | print('BernoulliNB特征已保存\n')
196 |
197 | ########################### mnb(MultinomialNB) ################################
198 | print('MultinomialNB stacking')
199 | stack_train = np.zeros((len(train), number))
200 | stack_test = np.zeros((len(test), number))
201 | score_va = 0
202 |
203 | for i, (tr, va) in enumerate(StratifiedKFold(score, n_folds=n_folds, random_state=1017)):
204 | print('stack:%d/%d' % ((i + 1), n_folds))
205 | mnb = MultinomialNB()
206 | mnb.fit(train_feature[tr], score[tr])
207 | score_va = mnb.predict_proba(train_feature[va])
208 | score_te = mnb.predict_proba(test_feature)
209 | print(score_va)
210 | print('得分' + str(mean_squared_error(score[va], mnb.predict(train_feature[va]))))
211 | stack_train[va] += score_va
212 | stack_test += score_te
213 | stack_test /= n_folds
214 | stack = np.vstack([stack_train, stack_test])
215 | df_stack = pd.DataFrame()
216 | for i in range(stack.shape[1]):
217 | df_stack['tfidf_mnb_classfiy_{}'.format(i)] = np.around(stack[:, i], 6)
218 | df_stack.to_csv('feature/tfidf_mnb_1_3_error_single_classfiy.csv', index=None, encoding='utf8')
219 | print('MultinomialNB特征已保存\n')
220 |
--------------------------------------------------------------------------------
/chizhu/single_model/xgb_nb.py:
--------------------------------------------------------------------------------
1 |
2 | # coding: utf-8
3 |
4 | # In[1]:
5 |
6 |
7 | import pandas as pd
8 | import seaborn as sns
9 | import numpy as np
10 | from tqdm import tqdm
11 | from sklearn.decomposition import LatentDirichletAllocation
12 | from sklearn.model_selection import train_test_split
13 | from sklearn.metrics import accuracy_score
14 | import lightgbm as lgb
15 | from datetime import datetime,timedelta
16 | import matplotlib.pyplot as plt
17 | import time
18 | from sklearn.feature_extraction.text import TfidfTransformer
19 | from sklearn.feature_extraction.text import CountVectorizer
20 | # get_ipython().run_line_magic('matplotlib', 'inline')
21 |
22 | #add
23 | import gc
24 | from sklearn import preprocessing
25 | from sklearn.feature_extraction.text import TfidfVectorizer
26 |
27 | from scipy.sparse import hstack, vstack
28 | from sklearn.model_selection import StratifiedKFold
29 | from sklearn.model_selection import cross_val_score
30 | # from skopt.space import Integer, Categorical, Real, Log10
31 | # from skopt.utils import use_named_args
32 | # from skopt import gp_minimize
33 | from gensim.models import Word2Vec, FastText
34 | import gensim
35 | import re
36 | # path="/dev/shm/chizhu_data/data/"
37 |
38 |
39 | # In[2]:
40 |
41 |
42 | tfidf_feat=pd.read_csv("data/tfidf_classfiy.csv")
43 | tf2=pd.read_csv("data/tfidf_classfiy_package.csv")
44 | train_data=pd.read_csv("data/train_data.csv")
45 | test_data=pd.read_csv("data/test_data.csv")
46 |
47 |
48 | # In[4]:
49 |
50 |
51 | train_data = pd.merge(train_data,tfidf_feat,on="device_id",how="left")
52 | train = pd.merge(train_data,tf2,on="device_id",how="left")
53 | test_data = pd.merge(test_data,tfidf_feat,on="device_id",how="left")
54 | test = pd.merge(test_data,tf2,on="device_id",how="left")
55 |
56 |
57 | # In[5]:
58 |
59 |
60 | features = [x for x in train.columns if x not in ['device_id', 'sex',"age","label","app"]]
61 | Y = train['sex'] - 1
62 |
63 |
64 | # In[5]:
65 |
66 |
67 | import lightgbm as lgb
68 | import xgboost as xgb
69 | from sklearn.metrics import auc, log_loss, roc_auc_score,f1_score,recall_score,precision_score
70 | from sklearn.cross_validation import StratifiedKFold
71 |
72 | kf = StratifiedKFold(Y, n_folds=5, shuffle=True, random_state=1024)
73 | params={
74 | 'booster':'gbtree',
75 | 'objective': 'binary:logistic',
76 | # 'is_unbalance':'True',
77 | # 'scale_pos_weight': 1500.0/13458.0,
78 | 'eval_metric': "logloss",
79 |
80 | 'gamma':0.2,#0.2 is ok
81 | 'max_depth':6,
82 | # 'lambda':20,
83 | # "alpha":5,
84 | 'subsample':0.7,
85 | 'colsample_bytree':0.4 ,
86 | # 'min_child_weight':2.5,
87 | 'eta': 0.01,
88 | # 'learning_rate':0.01,
89 | "silent":1,
90 | 'seed':1024,
91 | 'nthread':12,
92 |
93 | }
94 | num_round = 3500
95 | early_stopping_rounds = 100
96 |
97 |
98 | # In[6]:
99 |
100 |
101 | aus = []
102 | sub1 = np.zeros((len(test), ))
103 | pred_oob1=np.zeros((len(train),))
104 | for i,(train_index,test_index) in enumerate(kf):
105 |
106 | tr_x = train[features].reindex(index=train_index, copy=False)
107 | tr_y = Y[train_index]
108 | te_x = train[features].reindex(index=test_index, copy=False)
109 | te_y = Y[test_index]
110 |
111 | # tr_y=tr_y.apply(lambda x:1 if x>0 else 0)
112 | # te_y=te_y.apply(lambda x:1 if x>0 else 0)
113 | d_tr = xgb.DMatrix(tr_x, label=tr_y)
114 | d_te = xgb.DMatrix(te_x, label=te_y)
115 | watchlist = [(d_tr,'train'),
116 | (d_te,'val')
117 | ]
118 | model = xgb.train(params, d_tr, num_boost_round=5500,
119 | evals=watchlist,verbose_eval=200,
120 | early_stopping_rounds=100)
121 | pred = model.predict(d_te)
122 | pred_oob1[test_index] =pred
123 | # te_y=te_y.apply(lambda x:1 if x>0 else 0)
124 | a = log_loss(te_y, pred)
125 |
126 | sub1 += model.predict(xgb.DMatrix(test[features]))/5
127 |
128 |
129 | print ("idx: ", i)
130 | print (" loss: %.5f" % a)
131 | # print " gini: %.5f" % g
132 | aus.append(a)
133 |
134 | print ("mean")
135 | print ("auc: %s" % (sum(aus) / 5.0))
136 |
137 |
138 | # In[7]:
139 |
140 |
141 | pred_oob1 = pd.DataFrame(pred_oob1, columns=['sex2'])
142 | sub1 = pd.DataFrame(sub1, columns=['sex2'])
143 | res1=pd.concat([pred_oob1,sub1])
144 | res1['sex1'] = 1-res1['sex2']
145 |
146 |
147 | # In[8]:
148 |
149 |
150 | import gc
151 | gc.collect()
152 |
153 |
154 | # In[9]:
155 |
156 |
157 | tfidf_feat=pd.read_csv("data/tfidf_age.csv")
158 | tf2=pd.read_csv("data/pack_tfidf_age.csv")
159 | train_data = pd.merge(train_data,tfidf_feat,on="device_id",how="left")
160 | train = pd.merge(train_data,tf2,on="device_id",how="left")
161 | test_data = pd.merge(test_data,tfidf_feat,on="device_id",how="left")
162 | test = pd.merge(test_data,tf2,on="device_id",how="left")
163 |
164 |
165 | # In[10]:
166 |
167 |
168 | ####sex1
169 | test['sex']=1
170 |
171 |
172 | # In[11]:
173 |
174 |
175 | features = [x for x in train.columns if x not in ['device_id',"age","label","app"]]
176 | Y = train['age']
177 |
178 |
179 | # In[12]:
180 |
181 |
182 | import lightgbm as lgb
183 | import xgboost as xgb
184 | from sklearn.metrics import auc, log_loss, roc_auc_score,f1_score,recall_score,precision_score
185 | from sklearn.cross_validation import StratifiedKFold
186 |
187 | kf = StratifiedKFold(Y, n_folds=5, shuffle=True, random_state=1024)
188 | params={
189 | 'booster':'gbtree',
190 | 'objective': 'multi:softprob',
191 | # 'is_unbalance':'True',
192 | # 'scale_pos_weight': 1500.0/13458.0,
193 | 'eval_metric': "mlogloss",
194 | 'num_class':11,
195 | 'gamma':0.1,#0.2 is ok
196 | 'max_depth':6,
197 | # 'lambda':20,
198 | # "alpha":5,
199 | 'subsample':0.7,
200 | 'colsample_bytree':0.4 ,
201 | # 'min_child_weight':2.5,
202 | 'eta': 0.01,
203 | # 'learning_rate':0.01,
204 | "silent":1,
205 | 'seed':1024,
206 | 'nthread':12,
207 |
208 | }
209 | num_round = 3500
210 | early_stopping_rounds = 100
211 |
212 |
213 | # In[13]:
214 |
215 |
216 | aus = []
217 | sub2 = np.zeros((len(test),11 ))
218 | pred_oob2=np.zeros((len(train),11))
219 | for i,(train_index,test_index) in enumerate(kf):
220 |
221 | tr_x = train[features].reindex(index=train_index, copy=False)
222 | tr_y = Y[train_index]
223 | te_x = train[features].reindex(index=test_index, copy=False)
224 | te_y = Y[test_index]
225 |
226 | # tr_y=tr_y.apply(lambda x:1 if x>0 else 0)
227 | # te_y=te_y.apply(lambda x:1 if x>0 else 0)
228 | d_tr = xgb.DMatrix(tr_x, label=tr_y)
229 | d_te = xgb.DMatrix(te_x, label=te_y)
230 | watchlist = [(d_tr,'train'),
231 | (d_te,'val')
232 | ]
233 | model = xgb.train(params, d_tr, num_boost_round=5500,
234 | evals=watchlist,verbose_eval=200,
235 | early_stopping_rounds=100)
236 | pred = model.predict(d_te)
237 | pred_oob2[test_index] =pred
238 | # te_y=te_y.apply(lambda x:1 if x>0 else 0)
239 | a = log_loss(te_y, pred)
240 |
241 | sub2 += model.predict(xgb.DMatrix(test[features]))/5
242 |
243 |
244 | print ("idx: ", i)
245 | print (" loss: %.5f" % a)
246 | # print " gini: %.5f" % g
247 | aus.append(a)
248 |
249 | print ("mean")
250 | print ("auc: %s" % (sum(aus) / 5.0))
251 |
252 |
253 | # In[14]:
254 |
255 |
256 | res2_1=np.vstack((pred_oob2,sub2))
257 | res2_1 = pd.DataFrame(res2_1)
258 |
259 |
260 | # In[ ]:
261 |
262 |
263 | ###sex2
264 | test['sex']=2
265 | features = [x for x in train.columns if x not in ['device_id',"age","label","app"]]
266 | Y = train['age']
267 |
268 |
269 | # In[ ]:
270 |
271 |
272 | aus = []
273 | sub2 = np.zeros((len(test),11 ))
274 | pred_oob2=np.zeros((len(train),11))
275 | for i,(train_index,test_index) in enumerate(kf):
276 |
277 | tr_x = train[features].reindex(index=train_index, copy=False)
278 | tr_y = Y[train_index]
279 | te_x = train[features].reindex(index=test_index, copy=False)
280 | te_y = Y[test_index]
281 |
282 | # tr_y=tr_y.apply(lambda x:1 if x>0 else 0)
283 | # te_y=te_y.apply(lambda x:1 if x>0 else 0)
284 | d_tr = xgb.DMatrix(tr_x, label=tr_y)
285 | d_te = xgb.DMatrix(te_x, label=te_y)
286 | watchlist = [(d_tr,'train'),
287 | (d_te,'val')
288 | ]
289 | model = xgb.train(params, d_tr, num_boost_round=5500,
290 | evals=watchlist,verbose_eval=200,
291 | early_stopping_rounds=100)
292 | pred = model.predict(d_te)
293 | pred_oob2[test_index] =pred
294 | # te_y=te_y.apply(lambda x:1 if x>0 else 0)
295 | a = log_loss(te_y, pred)
296 |
297 | sub2 += model.predict(xgb.DMatrix(test[features]))/5
298 |
299 |
300 | print ("idx: ", i)
301 | print (" loss: %.5f" % a)
302 | # print " gini: %.5f" % g
303 | aus.append(a)
304 |
305 | print ("mean")
306 | print ("auc: %s" % (sum(aus) / 5.0))
307 |
308 |
309 | # In[ ]:
310 |
311 |
312 | res2_2=np.vstack((pred_oob2,sub2))
313 | res2_2 = pd.DataFrame(res2_2)
314 |
315 |
316 | # In[ ]:
317 |
318 |
319 | res1.index=range(len(res1))
320 | res2_1.index=range(len(res2_1))
321 | res2_2.index=range(len(res2_2))
322 | final_1=res2_1.copy()
323 | final_2=res2_2.copy()
324 | for i in range(11):
325 | final_1[i]=res1['sex1']*res2_1[i]
326 | final_2[i]=res1['sex2']*res2_2[i]
327 | id_list=pd.concat([train[['device_id']],test[['device_id']]])
328 | final=id_list
329 | final.index=range(len(final))
330 | final.columns= ['DeviceID']
331 | final_pred = pd.concat([final_1,final_2],1)
332 | final=pd.concat([final,final_pred],1)
333 | final.columns = ['DeviceID', '1-0', '1-1', '1-2', '1-3', '1-4', '1-5', '1-6',
334 | '1-7','1-8', '1-9', '1-10', '2-0', '2-1', '2-2', '2-3', '2-4',
335 | '2-5', '2-6', '2-7', '2-8', '2-9', '2-10']
336 |
337 | final.to_csv('submit/xgb_feat_chizhu_nb.csv', index=False)
338 |
339 |
340 | # In[ ]:
341 |
342 |
343 | test['DeviceID']=test['device_id']
344 | sub=pd.merge(test[['DeviceID']],final,on="DeviceID",how="left")
345 | sub.to_csv("submit/xgb_chizhu_nb.csv",index=False)
346 |
347 |
--------------------------------------------------------------------------------