├── .idea └── vcs.xml ├── README.md ├── gbdt_run.sh ├── models ├── lgb.py ├── run_tf_nfm.py ├── run_tf_nfm_by_part.py ├── tf_NFM.py └── xgb.py ├── run.sh ├── src └── converters │ ├── combine_and_shuffle.py │ ├── ensemble.py │ ├── lgb_analyze_importance.py │ ├── make_submission.py │ ├── norm_ensemble.py │ ├── pre-csv.py │ ├── pre-dnn.py │ └── pre-gbdt.py └── utils ├── __init__.py ├── args.py ├── args2.py ├── args3.py ├── donal_args.py ├── nzc_args.py └── tencent_data_func.py /.idea/vcs.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 运行环境: ubuntu 14.04, 256g以上的内存, 1T以上的磁盘，nvidia 1080 ti, python2.7 2 | Python环境: sklearn, lightgbm, tensorflow-gpu, xgboost 3 | 文件说明: 4 | src文件夹包含converters文件夹 5 | converters文件夹包含如下： 6 | （1）pre-csv.py 将原始的数据文件进行拼表 7 | （2）combine_and_shuffle.py 将初赛数据与复赛数据拼一起，随机打乱 8 | （3）pre-dnn.py 生成dnn的特征数据文件 9 | （4）pre-gbdt.py 生成gbdt的特征数据文件 10 | （5）lgb_analyze_importance.py 利用lgb分析特征重要性 11 | （6）ensemble.py 对多个结果做加权组合 12 | （7）norm_ensemble.py 对多个结果进行归一化后再做加权组合 13 | 14 | data文件夹主要用于存在数据 15 | 16 | models文件夹用于存放模型，文件夹包含模型如下： 17 | （1）lgb.py 训练lgb模型 18 | （2）tf_NFM.py 定义nffm模型 19 | （3）run_tf_nfm.py 将全部数据一次加载进内存来训练nffm模型 20 | （4）run_tf_nfm_by_part.py 将数据分批次加载进入内存来训练nffm模型 21 | （5）xgb.py 训练xgb模型 22 | 23 | utils文件夹用于存放参数文件和辅助函数类： 24 | （1）args.py args2.py args3.py donal_args.py nzc_args.py 都是参数文件，用于用不同的参数来训练模型 25 | （2）tencent_data_func.py 定义了大量的特征工程及其他数据操作相关的辅助函数，由于迭代太多，里面有点杂乱 26 | 27 | gbdt_run.sh文件可生成gbdt的特征数据文件，训练lgb模型。值得注意的是，gbdt的特征数据文件的生成需要耗费大量的时间， 28 | 本人使用了5台服务器跑了接近4天才将特征完全完成生成完，而lgb模型对我的成绩提升其实很少，lgb的成绩为0.760左右， 29 | 只为我的最终成绩带来了万分位的提升。如果机器不足，不建议生成gbdt的特征。而且gbdt的特征占用了大概500g的磁盘空间。 30 | 31 | run.sh文件可生成dnn的特征数据文件，并训练lgb模型。值得注意的是，该文件只会跑一次dnn模型。实际上，dnn模型一般一轮收敛， 32 | 十分吃数据的分布，在同样的参数下，使用打乱样本顺序，训练多次结果取平均都可以获得4-5个千分位的提升。而我的dnn模型的单次成绩为 33 | 0.770~0.771之间，跑5-10次，取平均分数可以接近b榜的0.774-0.775之间。将参数和对正样本加权等改动，跑多次结果取平均 34 | ，可以到达0.775以上。 35 | 36 | 运行说明： 37 | 将原始数据放进data文件，运行run.sh文件即可。 38 | 值得注意的是，初赛的数据要在前面加 "chusai_"的前缀标志。向量特征（比如interest1等）的特征生成如果使用一台服务器，可能需要两天左右。 39 | dnn模型的训练时间大概是2.5-3个小时之间。 40 | lgb模型训练时间大概为10个小时左右。 41 | xgb的训练时间大概是30多个小时。 42 | 如果是xgb的loss function设置为pair rank，可能需要2天多的时间。 43 | 44 | 45 | 特征说明： 46 | dnn的特征主要包括6类特征： 47 | （1）原始onehot特征，比如aid，age，gender等。 48 | （2）向量特征，比如interest1，interest2，topic1，kw1等 49 | （3）向量长度统计特征：interest1，interest2，interest5的长度统计。 50 | （4）uid类的统计特征，uid的出现次数，uid的正样本次数，以及uid与ad特征的组合出现次数，组合正样本次数。 51 | （5）uid的序列特征，比如uid=1时，总共出现了5次，序列为[-1,1,-1,-1,-1]， 52 | 则第一次出现时，特征为【】 53 | 第二次出现时，特征为【-1】 54 | 第三次出现时，特征为【-1，1】 55 | 第四次出现时，特征为【-1，1，-1】 56 | 第五次出现时，特征为【-1，1，-1，-1】 57 | （6）组合特征：age与aid的组合，gender与aid的组合，interest1与aid的组合，interest2与aid的组合。 58 | 值得注意的是，上面是总的类别介绍，但是实际上在模型运行中，本人构造了大量的特征，做了一个大的特征集合，每次训练的是小的特征集合，最后融合。 59 | 只是效果提升似乎也没有多明显。所以这里只选取了最优成绩的特征工程。 60 | 61 | gbdt的特征主要包括： 62 | （1）原始特征的转化率，如果是向量特征的话，则取转化率最大的作为表示 63 | （2）原始特征的组合转化率，包括用户内部的特征组合，用户与广告的特征组合 64 | （3）原始特征的出现次数，如果是向量特征的话，则取出现次数最大的作为表示 65 | （4）原始特征的组合出现次数，包括用户内部的特征组合出现次数，用户与广告的特征组合出现次数 66 | （6）uid的出现次数和转化率，uid与广告特征的组合出现次数和转化率 67 | 值得注意的是，gbdt的特征会进行用一个小的数据集进行遍历，对特征进行排序，最后分别取top200，top400的特征集合进行全集的训练。 68 | 在top200的情况下，线上的成绩为759左右。 69 | top400的情况下，需要将数据分成两部分训练取平均，成绩为761左右。 70 | 71 | 模型说明： 72 | 本人的最优模型为去年冠军的nffm模型，但是设计了一些过滤交叉的条件，比如广告内部的特征不做交叉。本人也尝试过多种其他模型的方案， 73 | 但是单模型都比不过nffm模型。但本人在组合时尝试使用了其他dnn模型进行训练的结果来组合，由于没有控制变量对比（组合时同时加入了nffm不同参数 74 | 的训练结果，以及nffm对正样本加权后的训练结果），不知道组合后的效果的提升是多少，甚至有可能是降低的也说不定。 -------------------------------------------------------------------------------- /gbdt_run.sh: -------------------------------------------------------------------------------- 1 | python src/converters/pre-gbdt.py 1 2 | python src/converters/pre-gbdt.py 2 3 | python src/converters/pre-gbdt.py 3 4 | python src/converters/pre-gbdt.py 4 5 | python src/converters/pre-gbdt.py 5 6 | python src/converters/pre-gbdt.py 6 7 | python src/converters/pre-gbdt.py 7 8 | python src/converters/pre-gbdt.py 8 9 | python src/converters/pre-gbdt.py 9 10 | python src/converters/pre-gbdt.py 10 11 | python src/converters/pre-gbdt.py 11 12 | python src/converters/pre-gbdt.py 12 13 | python src/converters/pre-gbdt.py 13 14 | 15 | python src/converters/lgb_analyze_importance.py 16 | 17 | python models/lgb.py 18 | 19 | 20 | -------------------------------------------------------------------------------- /models/lgb.py: -------------------------------------------------------------------------------- 1 | # coding=utf-8 2 | import pandas as pd 3 | import lightgbm as lgb 4 | from sklearn.model_selection import train_test_split 5 | from sklearn.feature_extraction.text import CountVectorizer 6 | from sklearn.preprocessing import OneHotEncoder,LabelEncoder 7 | from scipy import sparse 8 | import os 9 | import numpy as np 10 | import sys 11 | from time import time 12 | 13 | sys.path.append('../') 14 | from utils.donal_args import args 15 | 16 | def get_top_count_feature(path, num=50): 17 | col_names = ['label'] 18 | f = open(path, 'rb') 19 | f.readline() 20 | i = 0 21 | for line in f: 22 | i += 1 23 | if i > num: 24 | break 25 | col_names.append(line.strip().split(',')[0]) 26 | return col_names 27 | 28 | def read_data_from_csv(path, beg, end): 29 | data_arr = [] 30 | f = open(path,'rb') 31 | num = 0 32 | for line in f: 33 | num += 1 34 | if num <= beg: 35 | continue 36 | if num > end: 37 | break 38 | data_arr.append(float(line.strip())) 39 | f.close() 40 | return np.array(data_arr).reshape([-1,1]) 41 | 42 | def read_data_from_bin(path, col_name, beg, end, part): 43 | data_arr = [] 44 | for i in range(part): 45 | res = np.fromfile(path+col_name+'_' + str(i)+'.bin', dtype=np.int).astype(np.float).reshape([-1,1]) 46 | data_arr.append(res) 47 | return np.concatenate(data_arr,axis=0)[beg:end] 48 | 49 | def read_batch_data_from_csv(path, col_names, beg = 0, end = 7000000): 50 | data_arr = [] 51 | num = 0 52 | for col_name in col_names: 53 | num += 1 54 | t = read_data_from_csv(path+col_name+'.csv', beg, end) 55 | print num, col_name, t.shape 56 | data_arr.append(t) 57 | return np.concatenate(data_arr,axis=1) 58 | 59 | def read_batch_data_from_bin(path, col_names, beg = 0, end = 7000000, part=9): 60 | data_arr = [] 61 | num = 0 62 | for col_name in col_names: 63 | num += 1 64 | t = read_data_from_bin(path, col_name, beg, end, part) 65 | print num, col_name, t.shape 66 | data_arr.append(t) 67 | return np.concatenate(data_arr,axis=1) 68 | 69 | 70 | col_names = get_top_count_feature(args.gbdt_data_path+'analyze/importance_rank.csv',400) 71 | uid_count_feature = ['uid_uid_pos_times_count_5_fold_all','uid_adCategoryId_pos_times_count_5_fold_all', 72 | 'uid_advertiserId_pos_times_count_5_fold_all','uid_campaignId_pos_times_count_5_fold_all', 73 | 'uid_creativeId_pos_times_count_5_fold_all','uid_creativeSize_pos_times_count_5_fold_all', 74 | 'uid_productId_pos_times_count_5_fold_all','uid_productType_pos_times_count_5_fold_all', 75 | 'uid_adCategoryId_times_count','uid_advertiserId_times_count', 76 | 'uid_campaignId_times_count','uid_creativeId_times_count', 77 | 'uid_creativeSize_times_count','uid_productType_times_count', 78 | 'uid_productId_times_count'] 79 | uid_count_feature_with_chusai = ['log_uid_uid_pos_times_count_5_fold_with_chusai','log_uid_adCategoryId_pos_times_count_5_fold_with_chusai', 80 | 'log_uid_advertiserId_pos_times_count_5_fold_with_chusai','log_uid_campaignId_pos_times_count_5_fold_with_chusai', 81 | 'log_uid_creativeId_pos_times_count_5_fold_with_chusai','log_uid_creativeSize_pos_times_count_5_fold_with_chusai', 82 | 'log_uid_productId_pos_times_count_5_fold_with_chusai','log_uid_productType_pos_times_count_5_fold_with_chusai', 83 | 'log_uid_adCategoryId_times_count_with_chusai','log_uid_advertiserId_times_count_with_chusai', 84 | 'log_uid_campaignId_times_count_with_chusai','log_uid_creativeId_times_count_with_chusai', 85 | 'log_uid_creativeSize_times_count_with_chusai','log_uid_productType_times_count_with_chusai', 86 | 'log_uid_productId_times_count_with_chusai'] 87 | beg = 0 88 | end = 44000000 89 | train_train_bin_datas = read_batch_data_from_bin(args.dnn_data_path+'train/train-', uid_count_feature_with_chusai, beg, end, part=9) 90 | beg = 44000000 91 | end = 46000000 92 | train_test_bin_datas = read_batch_data_from_bin(args.dnn_data_path+'train/train-', uid_count_feature_with_chusai, beg, end, part=9) 93 | beg = 0 94 | end = 34000000 95 | test2_bin_datas = read_batch_data_from_bin(args.dnn_data_path+'test2/test-', uid_count_feature_with_chusai, beg, end, part=3) 96 | 97 | beg = 0 98 | end = 44000000 99 | train_train_labels = read_batch_data_from_csv(args.gbdt_data_path+'train/', col_names[0:1], beg, end) 100 | train_train_datas = read_batch_data_from_csv(args.gbdt_data_path+'train/', col_names[1:], beg, end) 101 | beg = 44000000 102 | end = 46000000 103 | train_test_labels = read_batch_data_from_csv(args.gbdt_data_path+'train/', col_names[0:1], beg, end) 104 | train_test_datas = read_batch_data_from_csv(args.gbdt_data_path+'train/', col_names[1:], beg, end) 105 | 106 | # beg = 0 107 | # end = 34000000 108 | # test1_datas = read_batch_data_from_csv(args.gbdt_data_path+'test1/', col_names[1:], beg, end) 109 | 110 | beg = 0 111 | end = 34000000 112 | test2_datas = read_batch_data_from_csv(args.gbdt_data_path+'test2/', col_names[1:], beg, end) 113 | 114 | train_train_datas = np.concatenate([train_train_datas,train_train_bin_datas],axis=1) 115 | train_test_datas = np.concatenate([train_test_datas,train_test_bin_datas], axis=1) 116 | test2_datas = np.concatenate([test2_datas, test2_bin_datas], axis=1) 117 | 118 | print train_train_datas.shape 119 | 120 | clf = lgb.LGBMClassifier( 121 | boosting_type='gbdt', num_leaves=63, reg_alpha=0.0, reg_lambda=1, 122 | max_depth=-1, n_estimators=1200, objective='binary', 123 | subsample=0.8, colsample_bytree=0.8, subsample_freq=1, feature_fraction=0.8, 124 | learning_rate=0.05, min_child_weight=50 125 | ) 126 | clf.fit(train_train_datas, train_train_labels, eval_set=[(train_train_datas, train_train_labels),(train_test_datas, train_test_labels)], eval_metric='auc',early_stopping_rounds=100) 127 | 128 | # test1_res = clf.predict_proba(test1_datas)[:,1] 129 | # f = open(args.gbdt_data_path+'result/test1/lgb_test1_res.csv','wb') 130 | # for r in test1_res: 131 | # f.write('%.6f' % (r)+'\n') 132 | # f.close() 133 | 134 | test2_res = clf.predict_proba(test2_datas)[:,1] 135 | f = open(args.gbdt_data_path+'result/test2/lgb_test2_res_2.csv','wb') 136 | for r in test2_res: 137 | f.write('%.6f' % (r)+'\n') 138 | f.close() 139 | -------------------------------------------------------------------------------- /models/run_tf_nfm.py: -------------------------------------------------------------------------------- 1 | # -*- coding:utf-8 -*- 2 | 3 | import os 4 | import numpy as np 5 | from sklearn.base import BaseEstimator, TransformerMixin 6 | from sklearn.metrics import roc_auc_score 7 | from time import time 8 | import sys 9 | import itertools 10 | 11 | 12 | from tf_NFM import NFM 13 | 14 | test_out_name = sys.argv[1] 15 | args_name = sys.argv[2] 16 | assert args_name in ['donal_args','args','args2','args3','nzc_args'] 17 | 18 | sys.path.append('../') 19 | from utils.tencent_data_func import * 20 | if args_name == 'donal_args': 21 | from utils.donal_args import args 22 | elif args_name == 'args': 23 | from utils.args import args 24 | elif args_name == 'args2': 25 | from utils.args2 import args 26 | elif args_name == 'args3': 27 | from utils.args3 import args 28 | elif args_name == 'nzc_args': 29 | from utils.nzc_args import args 30 | 31 | 32 | 33 | #param 34 | field_sizes = [len(args.static_features), len(args.dynamic_features)] 35 | dynamic_max_len_dict = args.dynamic_features_max_len_dict 36 | learning_rate = args.lr 37 | epochs = args.epochs 38 | batch_size = args.batch_size 39 | 40 | train_parts = [0,1,2,3,4,5,6,7,8] 41 | test_parts = [7,8] 42 | test1_parts = [0,1,2] 43 | test2_parts = [0,1,2] 44 | 45 | y, static_index, dynamic_index, dynamic_lengths, extern_lr_index = \ 46 | load_concatenate_tencent_data_to_dict(args.dnn_data_path+'train/train-', train_parts,args.static_features, 47 | args.dynamic_features, is_csv=False) 48 | 49 | # for key in args.static_features: 50 | # print key, static_index[key].shape 51 | # 52 | # for key in args.dynamic_features: 53 | # print key, dynamic_index[key].shape 54 | 55 | # valid_y, valid_static_index, valid_dynamic_index, valid_dynamic_lengths, valid_exctern_lr_index = \ 56 | # load_concatenate_tencent_data_to_dict(args.dnn_data_path+'train/train-', test_parts) 57 | 58 | dynamic_total_size_dict = load_dynamic_total_size_dict(args.dnn_data_path+'dict/',args.dynamic_features) 59 | static_total_size_dict = load_static_total_size_dict(args.dnn_data_path+'dict/', args.static_features) 60 | 61 | 62 | exclusive_cols = args.exclusive_cols 63 | 64 | # test1_y, test1_static_index, test1_dynamic_index, test1_dynamic_lengths, test1_extern_lr_index = \ 65 | # load_concatenate_tencent_data_to_dict(args.dnn_data_path+'test1/test-', test1_parts, is_csv=False) 66 | 67 | test2_y, test2_static_index, test2_dynamic_index, test2_dynamic_lengths, test2_extern_lr_index = \ 68 | load_concatenate_tencent_data_to_dict(args.dnn_data_path+'test2/test-', test2_parts,args.static_features, 69 | args.dynamic_features) 70 | 71 | 72 | # y1_pred = np.array([0.0] * len(test1_y)) 73 | y2_pred = np.array([0.0] * len(test2_y)) 74 | 75 | for i in range(1): 76 | dfm = NFM(field_sizes=field_sizes, static_total_size_dict=static_total_size_dict, dynamic_total_size_dict=dynamic_total_size_dict, 77 | dynamic_max_len_dict=dynamic_max_len_dict, exclusive_cols=args.exclusive_cols,learning_rate=learning_rate, 78 | deep_layers=[64, 64], epoch=1, batch_size=batch_size,optimizer_type='adam') 79 | # dfm.fit(static_index, dynamic_index, dynamic_lengths, y, 80 | # valid_static_index, valid_dynamic_index, valid_dynamic_lengths, valid_y, combine=False) 81 | dfm.fit(static_index, dynamic_index, dynamic_lengths, y, show_eval=False, is_shuffle=True) 82 | # y1_pred += dfm.predict(test1_static_index, test1_dynamic_index, test1_dynamic_lengths) 83 | y2_pred += dfm.predict(test2_static_index, test2_dynamic_index, test2_dynamic_lengths) 84 | 85 | # y1_pred /= 3.0 86 | # y2_pred /= 3.0 87 | 88 | # f = open(args.dnn_data_path+'result/test1/'+test_out_name+'.csv', 'wb') 89 | # for y in y1_pred: 90 | # f.write('%.6f' % (y) + '\n') 91 | # f.close() 92 | 93 | f = open(args.dnn_data_path+'result/test2/'+test_out_name+'.csv', 'wb') 94 | for y in y2_pred: 95 | f.write('%.6f' % (y) + '\n') 96 | f.close() 97 | 98 | 99 | 100 | 101 | 102 | 103 | -------------------------------------------------------------------------------- /models/run_tf_nfm_by_part.py: -------------------------------------------------------------------------------- 1 | # -*- coding:utf-8 -*- 2 | 3 | import os 4 | import numpy as np 5 | from sklearn.base import BaseEstimator, TransformerMixin 6 | from sklearn.metrics import roc_auc_score 7 | from time import time 8 | import sys 9 | import itertools 10 | 11 | 12 | from tf_NFM import NFM 13 | 14 | test_out_name = sys.argv[1] 15 | args_name = sys.argv[2] 16 | assert args_name in ['donal_args','args','args2','args3','nzc_args'] 17 | 18 | sys.path.append('../') 19 | from utils.tencent_data_func import * 20 | if args_name == 'donal_args': 21 | from utils.donal_args import args 22 | elif args_name == 'args': 23 | from utils.args import args 24 | elif args_name == 'args2': 25 | from utils.args2 import args 26 | elif args_name == 'args3': 27 | from utils.args3 import args 28 | elif args_name == 'nzc_args': 29 | from utils.nzc_args import args 30 | 31 | 32 | 33 | #param 34 | field_sizes = [len(args.static_features), len(args.dynamic_features)] 35 | dynamic_max_len_dict = args.dynamic_features_max_len_dict 36 | learning_rate = args.lr 37 | epochs = args.epochs 38 | batch_size = args.batch_size 39 | 40 | 41 | dnn_data_path = args.root_data_path + 'nzc_dnn/' 42 | # for key in args.static_features: 43 | # print key, static_index[key].shape 44 | # 45 | # for key in args.dynamic_features: 46 | # print key, dynamic_index[key].shape 47 | 48 | # valid_y, valid_static_index, valid_dynamic_index, valid_dynamic_lengths, valid_exctern_lr_index = \ 49 | # load_concatenate_tencent_data_to_dict(args.dnn_data_path+'train/train-', test_parts) 50 | 51 | dynamic_total_size_dict = load_dynamic_total_size_dict(dnn_data_path+'dict_with_chusai/', args.dynamic_features) 52 | static_total_size_dict = load_static_total_size_dict(dnn_data_path+'dict_with_chusai/', args.static_features) 53 | print field_sizes, len(static_total_size_dict), len(dynamic_total_size_dict) 54 | print args.static_features 55 | print args.dynamic_features 56 | exclusive_cols = args.exclusive_cols 57 | 58 | y2_pred = np.array([0.0] * 11727304) 59 | 60 | # valid_parts = [8] 61 | # valid_y, valid_static_index, valid_dynamic_index, valid_dynamic_lengths, valid_exctern_lr_index = \ 62 | # load_concatenate_tencent_data_from_npy_to_dict(dnn_data_path+'train_npz/train-', valid_parts, 63 | # args.static_features, args.dynamic_features) 64 | 65 | for i in range(1): 66 | dfm = NFM(field_sizes=field_sizes, static_total_size_dict=static_total_size_dict, dynamic_total_size_dict=dynamic_total_size_dict, 67 | dynamic_max_len_dict=dynamic_max_len_dict, exclusive_cols=exclusive_cols,learning_rate=learning_rate, epoch=1, batch_size=batch_size,optimizer_type='adam') 68 | # dfm.fit(static_index, dynamic_index, dynamic_lengths, y, 69 | # valid_static_index, valid_dynamic_index, valid_dynamic_lengths, valid_y, combine=False) 70 | train_parts = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10] 71 | np.random.shuffle(train_parts) 72 | for p in train_parts: 73 | tmp_parts = [p] 74 | y, static_index, dynamic_index, dynamic_lengths, extern_lr_index = \ 75 | load_concatenate_tencent_data_to_dict(dnn_data_path + 'train_with_chusai/train-', tmp_parts, 76 | args.static_features, args.dynamic_features,is_csv=False) 77 | dfm.fit(static_index, dynamic_index, dynamic_lengths, y, 78 | show_eval=False, is_shuffle=True) 79 | test2_parts = [0, 1, 2] 80 | tmp_arr = [] 81 | for p in test2_parts: 82 | tmp_parts = [p] 83 | test2_y, test2_static_index, test2_dynamic_index, test2_dynamic_lengths, test2_extern_lr_index = \ 84 | load_concatenate_tencent_data_to_dict(dnn_data_path + 'test2_with_chusai/test-', tmp_parts, 85 | args.static_features, args.dynamic_features,is_csv= False) 86 | tmp_arr.append(dfm.predict(test2_static_index, test2_dynamic_index, test2_dynamic_lengths)) 87 | y2_pred += np.concatenate(tmp_arr,axis=0) 88 | 89 | # y2_pred /= 3.0 90 | 91 | f = open(dnn_data_path+'result/test2/'+test_out_name+'.csv', 'wb') 92 | for y in y2_pred: 93 | f.write('%.6f' % (y) + '\n') 94 | f.close() 95 | 96 | 97 | 98 | 99 | 100 | 101 | -------------------------------------------------------------------------------- /models/tf_NFM.py: -------------------------------------------------------------------------------- 1 | """ 2 | Created on Dec 10, 2017 3 | @author: jachin,Nie 4 | 5 | A tf implementation of NFM 6 | 7 | Reference: 8 | [1] Neural Factorization Machines for Sparse Predictive Analytics 9 | Xiangnan He,School of Computing,National University of Singapore,Singapore 117417,dcshex@nus.edu.sg 10 | Tat-Seng Chua,School of Computing,National University of Singapore,Singapore 117417,dcscts@nus.edu.sg 11 | """ 12 | 13 | import numpy as np 14 | import tensorflow as tf 15 | from sklearn.base import BaseEstimator, TransformerMixin 16 | from sklearn.metrics import roc_auc_score 17 | from time import time 18 | from tensorflow.contrib.layers.python.layers import batch_norm as batch_norm 19 | import itertools 20 | import random 21 | 22 | 23 | class NFM(BaseEstimator, TransformerMixin): 24 | def __init__(self, field_sizes, 25 | static_total_size_dict, dynamic_total_size_dict, dynamic_max_len_dict, exclusive_cols, extern_lr_size = 0, extern_lr_feature_size = 0, 26 | embedding_size=8, dropout_fm=[1.0, 1.0], out = False, reduce = False, 27 | deep_layers=[256, 128], dropout_deep=[1.0, 1.0, 1.0], 28 | deep_layers_activation=tf.nn.relu, 29 | epoch=10, batch_size=256, 30 | learning_rate=0.001, optimizer_type="adam", 31 | batch_norm=1, batch_norm_decay=0.995, 32 | verbose=True, random_seed=950104, 33 | loss_type="logloss", eval_metric=roc_auc_score, 34 | l2_reg=0.0, greater_is_better=True, model_path=None): 35 | assert loss_type in ["logloss", "mse"], \ 36 | "loss_type can be either 'logloss' for classification task or 'mse' for regression task" 37 | 38 | assert field_sizes[0] == len(static_total_size_dict) and field_sizes[1] == len(dynamic_total_size_dict) 39 | assert len(static_total_size_dict) > 0 40 | 41 | self.field_sizes = field_sizes 42 | self.total_field_size = field_sizes[0] + field_sizes[1] 43 | self.dynamic_total_size_dict = dynamic_total_size_dict 44 | self.static_total_size_dict = static_total_size_dict 45 | self.embedding_size = embedding_size 46 | self.dynamic_max_len_dict = dynamic_max_len_dict 47 | self.exclusive_cols = exclusive_cols 48 | self.extern_lr_size = extern_lr_size 49 | self.extern_lr_feature_size = extern_lr_feature_size 50 | self.dynamic_features = list(self.dynamic_total_size_dict.keys()) 51 | self.static_features = list(self.static_total_size_dict.keys()) 52 | self.total_features = list(self.static_total_size_dict.keys()) + list(self.dynamic_total_size_dict.keys()) 53 | 54 | self.dropout_fm = dropout_fm 55 | self.deep_layers = deep_layers 56 | 57 | self.out = out 58 | self.reduce = reduce 59 | 60 | self.dropout_deep = dropout_deep 61 | self.deep_layers_activation = deep_layers_activation 62 | self.l2_reg = l2_reg 63 | 64 | self.epoch = epoch 65 | self.batch_size = batch_size 66 | self.learning_rate = learning_rate 67 | self.optimizer_type = optimizer_type 68 | 69 | self.batch_norm = batch_norm 70 | self.batch_norm_decay = batch_norm_decay 71 | 72 | self.verbose = verbose 73 | self.random_seed = random_seed + int(random.random() * 100) 74 | self.loss_type = loss_type 75 | self.eval_metric = eval_metric 76 | self.greater_is_better = greater_is_better 77 | self.model_path = model_path 78 | #self.train_result, self.valid_result = [], [] 79 | 80 | self._init_graph() 81 | 82 | 83 | def _init_graph(self): 84 | self.graph = tf.Graph() 85 | with self.graph.as_default(): 86 | 87 | tf.set_random_seed(self.random_seed) 88 | #static part input 89 | self.static_index_dict = {} 90 | for key in self.static_total_size_dict: 91 | self.static_index_dict[key] = tf.placeholder(tf.int32, shape=[None], 92 | name=key+"_st_index") # None 93 | #dynamic part input 94 | self.dynamic_index_dict = {} 95 | self.dynamic_lengths_dict = {} 96 | for key in self.dynamic_total_size_dict: 97 | self.dynamic_index_dict[key] = tf.placeholder(tf.int32, shape=[None, self.dynamic_max_len_dict[key]], 98 | name=key+"_dy_index") # None * max_len 99 | self.dynamic_lengths_dict[key] = tf.placeholder(tf.int32, shape=[None], 100 | name=key+"_dy_length") # None 101 | #others input 102 | self.label = tf.placeholder(tf.float32, shape=[None, 1], name="label") # None * 1 103 | self.dropout_keep_fm = tf.placeholder(tf.float32, shape=[None], name="dropout_keep_fm") 104 | self.dropout_keep_deep = tf.placeholder(tf.float32, shape=[None], name="dropout_keep_deep") 105 | self.train_phase = tf.placeholder(tf.bool, name="train_phase") 106 | 107 | self.weights = self._initialize_weights() 108 | 109 | # lr part 110 | self.static_lr_embs = [tf.gather(self.weights["static_lr_embeddings_dict"][key], 111 | self.static_index_dict[key]) for key in self.static_features] # static_feature_size * None * 1 112 | self.static_lr_embs = tf.concat(self.static_lr_embs, axis=1) # None * static_feature_size 113 | self.dynamic_lr_embs = [tf.reduce_sum(tf.gather(self.weights["dynamic_lr_embeddings_dict"][key], 114 | self.dynamic_index_dict[key]), axis=1) for key in self.dynamic_features] # dynamic_feature_size * None * 1 115 | self.dynamic_lr_embs = tf.concat(self.dynamic_lr_embs, axis=1) # None * dynamic_feature_size 116 | self.dynamic_lengths = tf.concat([tf.reshape(self.dynamic_lengths_dict[key],[-1,1]) for key in self.dynamic_features], axis=1)# None * dynamic_feature_size 117 | self.dynamic_lr_embs = tf.div(self.dynamic_lr_embs, tf.to_float(self.dynamic_lengths)) # None * dynamic_feature_size 118 | 119 | # ffm part 120 | embed_var_raw_dict = {} 121 | embed_var_dict = {} 122 | for key in self.static_features: 123 | embed_var_raw = tf.gather(self.weights["static_ffm_embeddings_dict"][key], 124 | self.static_index_dict[key]) # None * [k * F] 125 | embed_var_raw_dict[key] = tf.reshape(embed_var_raw, [-1, self.total_field_size, self.embedding_size]) 126 | for key in self.dynamic_features: 127 | embed_var_raw = tf.gather(self.weights["dynamic_ffm_embeddings_dict"][key], 128 | self.dynamic_index_dict[key]) # None * max_len * [k * F] 129 | ffm_mask = tf.sequence_mask(self.dynamic_lengths_dict[key], maxlen=self.dynamic_max_len_dict[key]) # None * max_len 130 | ffm_mask = tf.expand_dims(ffm_mask, axis=-1) # None * max_len * 1 131 | ffm_mask = tf.concat([ffm_mask for i in range(self.embedding_size * self.total_field_size)], 132 | axis=-1) # None * max_len * [k * F] 133 | embed_var_raw = tf.multiply(embed_var_raw, tf.to_float(ffm_mask)) # None * max_len * [k * F] 134 | embed_var_raw = tf.reduce_sum(embed_var_raw, axis = 1) # None * [k*F] 135 | padding_lengths = tf.concat([tf.expand_dims(self.dynamic_lengths_dict[key], axis=-1) 136 | for i in range(self.embedding_size * self.total_field_size)], axis=-1) # None * [k*F] 137 | embed_var_raw = tf.div(embed_var_raw, tf.to_float(padding_lengths)) # None * [k*F] 138 | embed_var_raw_dict[key] = tf.reshape(embed_var_raw, [-1, self.total_field_size, self.embedding_size]) 139 | 140 | for (i1, i2) in itertools.combinations(list(range(0, self.total_field_size)), 2): 141 | c1, c2 = self.total_features[i1], self.total_features[i2] 142 | if (c1, c2) in self.exclusive_cols: 143 | continue 144 | embed_var_dict.setdefault(c1, {})[c2] = embed_var_raw_dict[c1][:, i2, :] # None * k 145 | embed_var_dict.setdefault(c2, {})[c1] = embed_var_raw_dict[c2][:, i1, :] # None * k 146 | 147 | x_mat = [] 148 | y_mat = [] 149 | input_size = 0 150 | for (c1, c2) in itertools.combinations(embed_var_dict.keys(), 2): 151 | if (c1, c2) in self.exclusive_cols: 152 | continue 153 | input_size += 1 154 | x_mat.append(embed_var_dict[c1][c2]) #input_size * None * k 155 | y_mat.append(embed_var_dict[c2][c1]) #input_size * None * k 156 | x_mat = tf.transpose(x_mat, perm=[1, 0, 2]) # None * input_size * k 157 | y_mat = tf.transpose(y_mat, perm=[1, 0, 2]) # None * input_size * k 158 | 159 | if self.out: 160 | x_mat = tf.expand_dims(x_mat, 3) 161 | y_mat = tf.expand_dims(y_mat, 2) 162 | x = tf.matmul(x_mat, y_mat) 163 | x = tf.reshape(x, [-1, input_size, self.embedding_size * self.embedding_size]) 164 | else: 165 | x = tf.multiply(x_mat, y_mat) 166 | 167 | if self.reduce: 168 | flat_vars = tf.reshape(tf.reduce_mean(x, axis=2), [-1, input_size]) 169 | elif self.out: 170 | flat_vars = tf.reshape(x, [-1, input_size * self.embedding_size * self.embedding_size]) 171 | else: 172 | flat_vars = tf.reshape(x, [-1, input_size * self.embedding_size]) 173 | 174 | # ---------- Deep component ---------- 175 | self.y_deep = flat_vars # 176 | self.y_deep = tf.nn.dropout(self.y_deep, self.dropout_keep_deep[0]) 177 | for i in range(0, len(self.deep_layers)): 178 | self.y_deep = tf.matmul(self.y_deep, self.weights["layer_%d" % i]) 179 | #self.y_deep = tf.add(tf.matmul(self.y_deep, self.weights["layer_%d" %i]), self.weights["bias_%d"%i]) # None * layer[i] * 1 180 | if self.batch_norm: 181 | self.y_deep = self.batch_norm_layer(self.y_deep, train_phase=self.train_phase, scope_bn="bn_%d" %i) # None * layer[i] * 1 182 | self.y_deep = self.deep_layers_activation(self.y_deep) 183 | self.y_deep = tf.nn.dropout(self.y_deep, self.dropout_keep_deep[1+i]) # dropout at each Deep layer 184 | 185 | # ---------- NFFM ---------- 186 | #concat_input = tf.concat([self.static_lr_embs, self.dynamic_lr_embs, self.y_deep], axis=1) 187 | self.out = tf.add(tf.matmul(self.y_deep, self.weights["concat_projection"]), self.weights["concat_bias"]) 188 | #self.out = tf.add(tf.reshape(tf.reduce_sum(self.out,axis=1),[-1,1]), self.weights['concat_bias']) 189 | self.out = tf.add(self.out, tf.reshape(tf.reduce_sum(self.static_lr_embs,axis=1),[-1,1])) 190 | self.out = tf.add(self.out, tf.reshape(tf.reduce_sum(self.dynamic_lr_embs,axis=1),[-1,1])) 191 | 192 | # loss 193 | if self.loss_type == "logloss": 194 | self.out = tf.nn.sigmoid(self.out) 195 | self.loss = tf.losses.log_loss(self.label, self.out) 196 | elif self.loss_type == "mse": 197 | self.loss = tf.nn.l2_loss(tf.subtract(self.label, self.out)) 198 | # l2 regularization on weights 199 | if self.l2_reg > 0: 200 | self.loss += tf.contrib.layers.l2_regularizer( 201 | self.l2_reg)(self.weights["concat_projection"]) 202 | for i in range(len(self.deep_layers)): 203 | self.loss += tf.contrib.layers.l2_regularizer( 204 | self.l2_reg)(self.weights["layer_%d"%i]) 205 | 206 | # optimizer 207 | if self.optimizer_type == "adam": 208 | self.optimizer = tf.train.AdamOptimizer(learning_rate=self.learning_rate, beta1=0.9, beta2=0.999, 209 | epsilon=1e-8).minimize(self.loss) 210 | elif self.optimizer_type == "adagrad": 211 | self.optimizer = tf.train.AdagradOptimizer(learning_rate=self.learning_rate, 212 | initial_accumulator_value=1e-8).minimize(self.loss) 213 | elif self.optimizer_type == "gd": 214 | self.optimizer = tf.train.GradientDescentOptimizer(learning_rate=self.learning_rate).minimize(self.loss) 215 | elif self.optimizer_type == "momentum": 216 | self.optimizer = tf.train.MomentumOptimizer(learning_rate=self.learning_rate, momentum=0.95).minimize( 217 | self.loss) 218 | 219 | 220 | # init 221 | self.saver = tf.train.Saver() 222 | init = tf.global_variables_initializer() 223 | self.sess = self._init_session() 224 | if not self.model_path: 225 | self.sess.run(init) 226 | else: 227 | self.load_model(self.model_path) 228 | 229 | 230 | def _init_session(self): 231 | config = tf.ConfigProto() 232 | #config.gpu_options.allow_growth = True 233 | return tf.Session(config=config) 234 | 235 | 236 | def _initialize_weights(self): 237 | weights = dict() 238 | # lr part 239 | weights["static_lr_embeddings_dict"] = {} 240 | for key in self.static_total_size_dict: 241 | weights["static_lr_embeddings_dict"][key] = tf.Variable( 242 | tf.truncated_normal([self.static_total_size_dict[key], 1], 0.0, 0.0001), 243 | name=key + '_lr_embeddings') 244 | 245 | weights["dynamic_lr_embeddings_dict"] = {} 246 | for key in self.dynamic_total_size_dict: 247 | weights["dynamic_lr_embeddings_dict"][key] = tf.Variable( 248 | tf.truncated_normal([self.dynamic_total_size_dict[key], 1], 0.0, 0.0001), 249 | name=key+'_lr_embeddings') 250 | 251 | if self.extern_lr_size: 252 | weights["extern_lr_embeddings"] = tf.Variable( 253 | tf.truncated_normal([self.extern_lr_size, 1], 0.0, 0.0001), 254 | name="extern_lr_embeddings") 255 | 256 | # embeddings 257 | weights["static_ffm_embeddings_dict"] = {} 258 | for key in self.static_total_size_dict: 259 | weights["static_ffm_embeddings_dict"][key] = tf.Variable( 260 | tf.truncated_normal([self.static_total_size_dict[key], 261 | self.embedding_size * self.total_field_size], 0.0, 0.0001), 262 | name=key + '_ffm_embeddings') # static_feature_size * [K * F] 263 | 264 | weights["dynamic_ffm_embeddings_dict"] = {} 265 | for key in self.dynamic_total_size_dict: 266 | weights["dynamic_ffm_embeddings_dict"][key] = tf.Variable( 267 | tf.truncated_normal([self.dynamic_total_size_dict[key], 268 | self.embedding_size * self.total_field_size], 0.0, 0.0001), 269 | name=key + '_ffm_embeddings') # dynamic_feature_size * [K * F] 270 | 271 | # deep layers 272 | num_layer = len(self.deep_layers) 273 | input_size = 0 274 | features = self.total_features 275 | for (i1, i2) in itertools.combinations(list(range(0, len(features))), 2): 276 | c1, c2 = features[i1], features[i2] 277 | if (c1, c2) in self.exclusive_cols: 278 | continue 279 | input_size += 1 280 | if self.out: 281 | input_size *= self.embedding_size * self.embedding_size 282 | elif not self.reduce: 283 | input_size *= self.embedding_size 284 | #input_size = self.total_field_size * (self.total_field_size - 1) / 2 * self.embedding_size 285 | glorot = np.sqrt(2.0 / (input_size + self.deep_layers[0])) 286 | weights["layer_0"] = tf.Variable( 287 | np.random.normal(loc=0, scale=glorot, size=(input_size, self.deep_layers[0])), dtype=np.float32) 288 | weights["bias_0"] = tf.Variable(np.random.normal(loc=0, scale=glorot, size=(1, self.deep_layers[0])), 289 | dtype=np.float32) # 1 * layers[0] 290 | for i in range(1, num_layer): 291 | glorot = np.sqrt(2.0 / (self.deep_layers[i-1] + self.deep_layers[i])) 292 | weights["layer_%d" % i] = tf.Variable( 293 | np.random.normal(loc=0, scale=glorot, size=(self.deep_layers[i-1], self.deep_layers[i])), 294 | dtype=np.float32) # layers[i-1] * layers[i] 295 | # weights["bias_%d" % i] = tf.Variable( 296 | # np.random.normal(loc=0, scale=glorot, size=(1, self.deep_layers[i])), 297 | # dtype=np.float32) # 1 * layer[i] 298 | 299 | # final concat projection layer 300 | input_size = self.deep_layers[-1] 301 | if self.extern_lr_size: 302 | input_size += self.extern_lr_feature_size 303 | glorot = np.sqrt(2.0 / (input_size + 1)) 304 | weights["concat_projection"] = tf.Variable( 305 | np.random.normal(loc=0, scale=glorot, size=(input_size, 1)), 306 | dtype=np.float32) # layers[i-1]*layers[i] 307 | weights["concat_bias"] = tf.Variable(tf.constant(-3.5), dtype=np.float32) 308 | 309 | return weights 310 | 311 | 312 | def batch_norm_layer(self, x, train_phase, scope_bn): 313 | bn_train = batch_norm(x, decay=self.batch_norm_decay, center=True, scale=True, updates_collections=None, 314 | is_training=True, reuse=None, trainable=True, scope=scope_bn) 315 | bn_inference = batch_norm(x, decay=self.batch_norm_decay, center=True, scale=True, updates_collections=None, 316 | is_training=False, reuse=True, trainable=True, scope=scope_bn) 317 | z = tf.cond(train_phase, lambda: bn_train, lambda: bn_inference) 318 | return z 319 | 320 | 321 | def get_batch(self, static_index_dict, dynamic_index_dict, dynamic_lengths_dict, y, batch_size, index): 322 | start = index * batch_size 323 | end = (index+1) * batch_size 324 | end = end if end < len(y) else len(y) 325 | batch_static_index_dict = {} 326 | batch_dynamic_index_dict = {} 327 | batch_dynamic_lengths_dict = {} 328 | for key in static_index_dict: 329 | batch_static_index_dict[key] = static_index_dict[key][start:end] 330 | for key in dynamic_index_dict: 331 | batch_dynamic_index_dict[key] = dynamic_index_dict[key][start:end] 332 | for key in dynamic_lengths_dict: 333 | batch_dynamic_lengths_dict[key] = dynamic_lengths_dict[key][start:end] 334 | return batch_static_index_dict, batch_dynamic_index_dict, batch_dynamic_lengths_dict,\ 335 | [[y_] for y_ in y[start:end]] 336 | 337 | 338 | # shuffle four lists simutaneously 339 | def shuffle_in_unison_scary(self, a, b, c, d): 340 | rng_state = np.random.get_state() 341 | for key in a: 342 | np.random.set_state(rng_state) 343 | np.random.shuffle(a[key]) 344 | for key in b: 345 | np.random.set_state(rng_state) 346 | np.random.shuffle(b[key]) 347 | for key in c: 348 | np.random.set_state(rng_state) 349 | np.random.shuffle(c[key]) 350 | np.random.set_state(rng_state) 351 | np.random.shuffle(d) 352 | 353 | 354 | def fit_on_batch(self, static_index_dict, dynamic_index_dict, dynamic_lengths_dict, y): 355 | feed_dict = {self.label: y, 356 | self.dropout_keep_fm: self.dropout_fm, 357 | self.dropout_keep_deep: self.dropout_deep, 358 | self.train_phase: True} 359 | for key in self.static_features: 360 | feed_dict[self.static_index_dict[key]] = static_index_dict[key] 361 | for key in self.dynamic_features: 362 | feed_dict[self.dynamic_index_dict[key]] = dynamic_index_dict[key] 363 | feed_dict[self.dynamic_lengths_dict[key]] = dynamic_lengths_dict[key] 364 | loss, opt = self.sess.run((self.loss, self.optimizer), feed_dict=feed_dict) 365 | return loss 366 | 367 | 368 | def fit(self, train_static_index_dict, train_dynamic_index_dict, train_dynamic_lengths_dict, train_y, 369 | valid_static_index_dict=None, valid_dynamic_index_dict=None, valid_dynamic_lengths_dict=None, valid_y=None, 370 | combine=False, show_eval = True, is_shuffle = True): 371 | """ 372 | :param train_static_index: 373 | :param train_dynamic_index: 374 | :param train_dynamic_lengths: 375 | :param train_y: 376 | :param valid_static_index: 377 | :param valid_dynamic_index: 378 | :param valid_dynamic_lengths: 379 | :param valid_y: 380 | :return: 381 | """ 382 | print "fit begin" 383 | has_valid = valid_static_index_dict is not None 384 | if has_valid and combine: 385 | for key in train_static_index_dict: 386 | train_static_index_dict[key] = np.concatenate([train_static_index_dict[key], 387 | valid_static_index_dict[key]], axis=0) 388 | for key in train_dynamic_index_dict: 389 | train_dynamic_index_dict[key] = np.concatenate([train_dynamic_index_dict[key], 390 | valid_dynamic_index_dict[key]], axis=0) 391 | train_dynamic_lengths_dict[key] = np.concatenate([train_dynamic_lengths_dict[key], 392 | valid_dynamic_lengths_dict[key]], axis=0) 393 | train_y = np.concatenate([train_y, valid_y], axis=0) 394 | 395 | for epoch in range(self.epoch): 396 | total_loss = 0.0 397 | total_size = 0.0 398 | batch_begin_time = time() 399 | t1 = time() 400 | if is_shuffle: 401 | self.shuffle_in_unison_scary(train_static_index_dict, train_dynamic_index_dict, 402 | train_dynamic_lengths_dict, train_y) 403 | print "shuffle data cost %.1f" %(time()-t1) 404 | 405 | total_batch = int(len(train_y) / self.batch_size) 406 | for i in range(total_batch): 407 | offset = i * self.batch_size 408 | end = (i+1) * self.batch_size 409 | end = end if end < len(train_y) else len(train_y) 410 | static_index_batch_dict, dynamic_index_batch_dict, dynamic_lengths_batch_dict, y_batch\ 411 | = self.get_batch(train_static_index_dict, train_dynamic_index_dict, train_dynamic_lengths_dict, 412 | train_y, self.batch_size, i) 413 | batch_loss = self.fit_on_batch(static_index_batch_dict, dynamic_index_batch_dict, dynamic_lengths_batch_dict, y_batch) 414 | total_loss += batch_loss * (end - offset) 415 | total_size += end - offset 416 | if i % 100 == 99: 417 | print('[%d, %5d] loss: %.6f time: %.1f s' % 418 | (epoch + 1, i + 1, total_loss / total_size, time() - batch_begin_time)) 419 | total_loss = 0.0 420 | total_size = 0.0 421 | batch_begin_time = time() 422 | 423 | # evaluate training and validation datasets 424 | if not combine and show_eval: 425 | train_result = self.evaluate(train_static_index_dict, train_dynamic_index_dict, 426 | train_dynamic_lengths_dict, train_y) 427 | #self.train_result.append(train_result) 428 | if has_valid and not combine: 429 | valid_result = self.evaluate(valid_static_index_dict, valid_dynamic_index_dict, 430 | valid_dynamic_lengths_dict, valid_y) 431 | # self.valid_result.append(valid_result) 432 | if self.verbose > 0 and not combine and show_eval: 433 | if has_valid and not combine: 434 | print("[%d] train-result=%.6f, valid-result=%.6f [%.1f s]" 435 | % (epoch + 1, train_result, valid_result, time() - t1)) 436 | else: 437 | print("[%d] train-result=%.6f [%.1f s]" 438 | % (epoch + 1, train_result, time() - t1)) 439 | 440 | print "fit end" 441 | 442 | 443 | def predict(self, static_index_dict, dynamic_index_dict, dynamic_lengths_dict, y = []): 444 | """ 445 | :param static_index: 446 | :param dynamic_index: 447 | :param dynamic_lengths: 448 | :return: 449 | """ 450 | print "predict begin" 451 | # dummy y 452 | if len(y) == 0: 453 | dummy_y = [1] * len(static_index_dict[self.static_features[0]]) 454 | else: 455 | dummy_y = y 456 | batch_index = 0 457 | batch_size = 1024 458 | static_index_dict_batch, dynamic_index_dict_batch, dynamic_lengths_dict_batch, y_batch\ 459 | = self.get_batch(static_index_dict, dynamic_index_dict, dynamic_lengths_dict, dummy_y, batch_size, batch_index) 460 | y_pred = None 461 | total_loss = 0.0 462 | total_size = 0.0 463 | while len(static_index_dict_batch[self.static_features[0]]) > 0: 464 | num_batch = len(y_batch) 465 | feed_dict = { 466 | self.label: y_batch, 467 | self.dropout_keep_fm: [1.0] * len(self.dropout_fm), 468 | self.dropout_keep_deep: [1.0] * len(self.dropout_deep), 469 | self.train_phase: False} 470 | for key in self.static_features: 471 | feed_dict[self.static_index_dict[key]] = static_index_dict_batch[key] 472 | for key in self.dynamic_features: 473 | feed_dict[self.dynamic_index_dict[key]] = dynamic_index_dict_batch[key] 474 | feed_dict[self.dynamic_lengths_dict[key]] = dynamic_lengths_dict_batch[key] 475 | batch_out, batch_loss = self.sess.run((self.out, self.loss), feed_dict=feed_dict) 476 | total_loss += batch_loss * num_batch 477 | total_size += num_batch 478 | if batch_index == 0: 479 | y_pred = np.reshape(batch_out, (num_batch,)) 480 | else: 481 | y_pred = np.concatenate((y_pred, np.reshape(batch_out, (num_batch,)))) 482 | 483 | batch_index += 1 484 | static_index_dict_batch, dynamic_index_dict_batch, dynamic_lengths_dict_batch, y_batch \ 485 | = self.get_batch(static_index_dict, dynamic_index_dict, dynamic_lengths_dict, dummy_y, batch_size, 486 | batch_index) 487 | print "valid logloss is %.6f" % (total_loss / total_size) 488 | print "predict end" 489 | return y_pred 490 | 491 | 492 | def evaluate(self, static_index_dict, dynamic_index_dict, dynamic_lengths_dict, y): 493 | """ 494 | :param static_index: 495 | :param dynamic_index: 496 | :param dynamic_lengths: 497 | :param y: 498 | :return: 499 | """ 500 | print "evaluate begin" 501 | print "predicting ing" 502 | b_time = time() 503 | y_pred = self.predict(static_index_dict, dynamic_index_dict, dynamic_lengths_dict, y) 504 | print "predicting costs %.1f" %(time()- b_time) 505 | print "counting eval ing" 506 | b_time = time() 507 | res = self.eval_metric(y, y_pred) 508 | print "counting eval cost %.1f" %(time()- b_time) 509 | print "evaluate end" 510 | return res 511 | 512 | def save_model(self, path, i): 513 | self.saver.save(self.sess, path, global_step=i) 514 | 515 | def load_model(self, path): 516 | model_file = tf.train.latest_checkpoint(path) 517 | print model_file,"model file" 518 | self.saver.restore(self.sess, path) -------------------------------------------------------------------------------- /models/xgb.py: -------------------------------------------------------------------------------- 1 | import xgboost as xgb 2 | import sys 3 | from sklearn.metrics import roc_auc_score 4 | from random import random 5 | import numpy as np 6 | 7 | sys.path.append('../') 8 | from utils.donal_args import args 9 | 10 | def get_top_count_feature(path, num=50): 11 | col_names = ['label'] 12 | f = open(path, 'rb') 13 | f.readline() 14 | i = 0 15 | for line in f: 16 | i += 1 17 | if i > num: 18 | break 19 | col_names.append(line.strip().split(',')[0]) 20 | return col_names 21 | 22 | def read_data_from_csv(path, beg, end): 23 | data_arr = [] 24 | f = open(path,'rb') 25 | num = 0 26 | for line in f: 27 | num += 1 28 | if num <= beg: 29 | continue 30 | if num > end: 31 | break 32 | data_arr.append(float(line.strip())) 33 | f.close() 34 | return np.array(data_arr).reshape([-1,1]) 35 | 36 | def read_data_from_bin(path, col_name, beg, end, part): 37 | data_arr = [] 38 | for i in range(part): 39 | res = np.fromfile(path+col_name+'_' + str(i)+'.bin', dtype=np.int).astype(np.float).reshape([-1,1]) 40 | data_arr.append(res) 41 | return np.concatenate(data_arr,axis=0)[beg:end] 42 | 43 | def read_batch_data_from_csv(path, col_names, beg = 0, end = 7000000): 44 | data_arr = [] 45 | num = 0 46 | for col_name in col_names: 47 | num += 1 48 | t = read_data_from_csv(path+col_name+'.csv', beg, end) 49 | print num, col_name, t.shape 50 | data_arr.append(t) 51 | return np.concatenate(data_arr,axis=1) 52 | 53 | def read_batch_data_from_bin(path, col_names, beg = 0, end = 7000000, part=9): 54 | data_arr = [] 55 | num = 0 56 | for col_name in col_names: 57 | num += 1 58 | t = read_data_from_bin(path, col_name, beg, end, part) 59 | print num, col_name, t.shape 60 | data_arr.append(t) 61 | return np.concatenate(data_arr,axis=1) 62 | 63 | 64 | col_names = get_top_count_feature(args.gbdt_data_path+'analyze/importance_rank.csv',200) 65 | uid_count_feature = ['uid_uid_pos_times_count_5_fold_all','uid_adCategoryId_pos_times_count_5_fold_all', 66 | 'uid_advertiserId_pos_times_count_5_fold_all','uid_campaignId_pos_times_count_5_fold_all', 67 | 'uid_creativeId_pos_times_count_5_fold_all','uid_creativeSize_pos_times_count_5_fold_all', 68 | 'uid_productId_pos_times_count_5_fold_all','uid_productType_pos_times_count_5_fold_all', 69 | 'uid_adCategoryId_times_count','uid_advertiserId_times_count', 70 | 'uid_campaignId_times_count','uid_creativeId_times_count', 71 | 'uid_creativeSize_times_count','uid_productType_times_count', 72 | 'uid_productId_times_count'] 73 | beg = 0 74 | end = 44000000 75 | train_train_bin_datas = read_batch_data_from_bin(args.dnn_data_path+'train/train-', uid_count_feature, beg, end, part=9) 76 | beg = 44000000 77 | end = 46000000 78 | train_test_bin_datas = read_batch_data_from_bin(args.dnn_data_path+'train/train-', uid_count_feature, beg, end, part=9) 79 | beg = 0 80 | end = 34000000 81 | test2_bin_datas = read_batch_data_from_bin(args.dnn_data_path+'test2/test-', uid_count_feature, beg, end, part=3) 82 | 83 | beg = 0 84 | end = 44000000 85 | train_train_labels = read_batch_data_from_csv(args.gbdt_data_path+'train/', col_names[0:1], beg, end) 86 | train_train_datas = read_batch_data_from_csv(args.gbdt_data_path+'train/', col_names[1:], beg, end) 87 | beg = 44000000 88 | end = 46000000 89 | train_test_labels = read_batch_data_from_csv(args.gbdt_data_path+'train/', col_names[0:1], beg, end) 90 | train_test_datas = read_batch_data_from_csv(args.gbdt_data_path+'train/', col_names[1:], beg, end) 91 | 92 | # beg = 0 93 | # end = 34000000 94 | # test1_datas = read_batch_data_from_csv(args.gbdt_data_path+'test1/', col_names[1:], beg, end) 95 | 96 | beg = 0 97 | end = 34000000 98 | test2_datas = read_batch_data_from_csv(args.gbdt_data_path+'test2/', col_names[1:], beg, end) 99 | 100 | train_train_datas = np.concatenate([train_train_datas,train_train_bin_datas],axis=1) 101 | train_test_datas = np.concatenate([train_test_datas,train_test_bin_datas], axis=1) 102 | test2_datas = np.concatenate([test2_datas, test2_bin_datas], axis=1) 103 | 104 | print train_train_datas.shape, train_test_datas.shape,test2_datas.shape 105 | 106 | dtrain = xgb.DMatrix(train_train_datas, train_train_labels) 107 | dtest = xgb.DMatrix(train_test_datas, train_test_labels) 108 | 109 | 110 | param = {} 111 | param['objective'] = 'binary:logistic' 112 | param['booster'] = 'gbtree' 113 | param['eta'] = 0.1 114 | param['max_depth'] = 8 115 | param['silent'] = 1 116 | param['nthread'] = 16 117 | param['subsample'] = 0.8 118 | param['colsample_bytree'] = 0.8 119 | param['eval_metric'] = 'auc' 120 | # param['tree_method'] = 'exact' 121 | # param['scale_pos_weight'] = 1 122 | num_round = 800 123 | 124 | watchlist = [(dtrain, 'train'), (dtest, 'test')] 125 | bst = xgb.train(param, dtrain, num_round, watchlist, early_stopping_rounds=5) 126 | 127 | test_datas = xgb.DMatrix(test2_datas) 128 | res = bst.predict(test_datas, ntree_limit=bst.best_ntree_limit) 129 | 130 | f = open(args.gbdt_data_path+'result/test2/xgb_test2_res.csv','wb') 131 | for r in res: 132 | f.write(str(r)+'\n') 133 | -------------------------------------------------------------------------------- /run.sh: -------------------------------------------------------------------------------- 1 | python src/converters/pre-csv.py 0 2 | python src/converters/pre-csv.py 1 3 | 4 | python src/converters/combine_and_shuffle.py 5 | 6 | python src/converters/pre-dnn.py 0 7 | python src/converters/pre-dnn.py 1 8 | python src/converters/pre-dnn.py 2 9 | python src/converters/pre-dnn.py 3 10 | python src/converters/pre-dnn.py 4 11 | python src/converters/pre-dnn.py 5 12 | python src/converters/pre-dnn.py 6 13 | python src/converters/pre-dnn.py 7 14 | 15 | python models/run_tf_nfm_by_part.py raw_donal donal_args 16 | python src/converters/make_submission.py data/test2.csv data/dnn/result/test2/raw_donal.csv 17 | 18 | 19 | -------------------------------------------------------------------------------- /src/converters/combine_and_shuffle.py: -------------------------------------------------------------------------------- 1 | # coding=utf-8 2 | 3 | """ 4 | 组合初赛数据和随机打乱 5 | """ 6 | 7 | import os 8 | import sys 9 | import numpy as np 10 | from sklearn.base import BaseEstimator, TransformerMixin 11 | from sklearn.metrics import roc_auc_score 12 | from time import time 13 | import math 14 | import pandas as pd 15 | 16 | sys.path.append('../') 17 | from utils.tencent_data_func import * 18 | from utils.donal_args import args 19 | 20 | 21 | nums = 100000000 22 | print "reading train data" 23 | train_dict, train_num = read_raw_data(args.combine_train_path, nums) 24 | nums = 100000000 25 | chusai_train_dict, chusai_train_num = read_raw_data(args.chusai_combine_train_path, nums) 26 | 27 | rng_state = np.random.get_state() 28 | 29 | for key in train_dict: 30 | train_dict[key].extend(chusai_train_dict[key]) 31 | np.random.set_state(rng_state) 32 | np.random.shuffle(train_dict[key]) 33 | 34 | 35 | def write_data(data_dict, path): 36 | f = open(path,'wb') 37 | headers = [key for key in data_dict] 38 | f.write(','.join(headers)+'\n') 39 | for i,d in enumerate(data_dict['label']): 40 | row = [] 41 | for key in headers: 42 | row.append(data_dict[key][i]) 43 | f.write(','.join(row)+'\n') 44 | f.close() 45 | 46 | write_data(train_dict, args.random_combine_train_path_with_chusai) 47 | print "combine and shuffle data finished" 48 | 49 | 50 | 51 | 52 | 53 | -------------------------------------------------------------------------------- /src/converters/ensemble.py: -------------------------------------------------------------------------------- 1 | import argparse, csv, sys, pickle, collections, math 2 | 3 | def logistic_func(x): 4 | return 1/(1+math.exp(-x)) 5 | 6 | def inv_logistic_func(x): 7 | return math.log(x/(1-x)) 8 | 9 | path_1 = sys.argv[1] 10 | path_2 = sys.argv[2] 11 | path_3 = sys.argv[3] 12 | weight_1 = float(sys.argv[4]) 13 | weight_2 = float(sys.argv[5]) 14 | data_1 = [] 15 | data_2 = [] 16 | 17 | f = open(path_1,'rb') 18 | for line in f: 19 | data_1.append(float(line)) 20 | f.close() 21 | 22 | f = open(path_2,'rb') 23 | for line in f: 24 | data_2.append(float(line)) 25 | f.close() 26 | 27 | assert len(data_1) == len(data_2) 28 | 29 | f = open(path_3,'wb') 30 | for i, d in enumerate(data_1): 31 | #t1 = inv_logistic_func(d) 32 | #t2 = inv_logistic_func(data_2[i]) 33 | #val = logistic_func(t1*weight_1+t2*weight_2) 34 | val = (d*weight_1+data_2[i]*weight_2) 35 | f.write('%.6f'%(val)+'\n') 36 | f.close() 37 | -------------------------------------------------------------------------------- /src/converters/lgb_analyze_importance.py: -------------------------------------------------------------------------------- 1 | # coding=utf-8 2 | import pandas as pd 3 | import lightgbm as lgb 4 | from sklearn.model_selection import train_test_split 5 | from sklearn.feature_extraction.text import CountVectorizer 6 | from sklearn.preprocessing import OneHotEncoder,LabelEncoder 7 | from scipy import sparse 8 | import os 9 | import numpy as np 10 | from time import time 11 | import sys 12 | 13 | sys.path.append('../') 14 | from utils.donal_args import args 15 | 16 | 17 | col_names = ['label'] 18 | len_feature = ['interest1','interest2','interest3','interest4','interest5', 19 | 'kw1','kw2','kw3','topic1','topic2','topic3','appIdAction','appIdInstall'] 20 | for f in len_feature: 21 | col_names.append(f + '_len') 22 | 23 | user_one_feature = ['LBS','age','carrier','consumptionAbility','education','gender','house','os','ct','interest1', 24 | 'interest2','interest3','interest4','interest5','marriageStatus','topic1','topic2','topic3', 25 | 'kw1','kw2','kw3','appIdAction','appIdInstall'] 26 | ad_one_feature = ['aid','advertiserId','campaignId', 'creativeId','creativeSize', 27 | 'adCategoryId', 'productId', 'productType'] 28 | for f in user_one_feature: 29 | col_names.append(f+'_rate_count') 30 | for f in ad_one_feature: 31 | col_names.append(f + '_rate_count') 32 | 33 | user_combine_feature = ['LBS','age','carrier','consumptionAbility','education','gender','house','os','ct','interest1', 34 | 'interest2','interest3','interest4','interest5','marriageStatus','topic1','topic2','topic3', 35 | 'kw1','kw2','kw3'] 36 | ad_combine_feature = ['aid','advertiserId','campaignId', 'creativeId','creativeSize', 37 | 'adCategoryId', 'productId', 'productType'] 38 | for i, f1 in enumerate(user_combine_feature): 39 | if f1 == 'interest1': 40 | break 41 | for f2 in user_combine_feature[i+1:]: 42 | col_names.append(f1+'-'+f2+'_rate_count') 43 | col_names.extend(['interest1-interest2_rate_count', 'interest1-interest3_rate_count', 44 | 'interest1-interest4_rate_count', 'interest1-interest5_rate_count']) 45 | col_names.extend(['interest4-interest5_rate_count', 'interest4-marriageStatus_rate_count', 46 | 'interest4-topic1_rate_count', 'interest4-topic2_rate_count', 47 | 'interest4-topic3_rate_count', 'interest4-kw1_rate_count', 48 | 'interest4-kw2_rate_count', 'interest4-kw3_rate_count', 49 | 'interest5-marriageStatus_rate_count', 'interest5-topic1_rate_count', 50 | 'interest5-topic2_rate_count', 'interest5-topic3_rate_count', 51 | 'topic1-topic2_rate_count', 'topic1-topic3_rate_count']) 52 | for i, f1 in enumerate(user_combine_feature): 53 | for f2 in ad_combine_feature: 54 | col_names.append(f1+'-'+f2+'_rate_count') 55 | 56 | user_time_feature = ['uid','LBS','age','carrier','consumptionAbility','education','gender','house','os','ct','interest1', 57 | 'interest2','interest3','interest4','interest5','marriageStatus','topic1','topic2','topic3', 58 | 'kw1','kw2','kw3','appIdAction','appIdInstall'] 59 | ad_time_feature = ['aid','advertiserId','campaignId', 'creativeId','creativeSize', 60 | 'adCategoryId', 'productId', 'productType'] 61 | for f in user_time_feature: 62 | col_names.append(f+'_times_count') 63 | for f in ad_time_feature: 64 | col_names.append(f+'_times_count') 65 | 66 | user_combine_feature = ['LBS','age','carrier','consumptionAbility','education','gender','house','os','ct','interest1', 67 | 'interest2','interest3','interest4','interest5','marriageStatus','topic1','topic2','topic3', 68 | 'kw1','kw2','kw3'] 69 | ad_combine_feature = ['aid','advertiserId','campaignId', 'creativeId','creativeSize', 70 | 'adCategoryId', 'productId', 'productType'] 71 | for i, f1 in enumerate(user_combine_feature): 72 | for f2 in user_combine_feature[i+1:]: 73 | col_names.append(f1+'-'+f2+'_times_count') 74 | for i, f1 in enumerate(user_combine_feature): 75 | for f2 in ad_combine_feature: 76 | col_names.append(f1 + '-' + f2 + '_times_count') 77 | 78 | def read_data_from_csv(path, beg, end): 79 | data_arr = [] 80 | f = open(path,'rb') 81 | num = 0 82 | for line in f: 83 | num += 1 84 | if num < beg: 85 | continue 86 | if num > end: 87 | break 88 | data_arr.append(float(line.strip())) 89 | f.close() 90 | return np.array(data_arr).reshape([-1,1]) 91 | 92 | 93 | def read_batch_data_from_csv(path, col_names, beg = 0, end = 7000000): 94 | data_arr = [] 95 | num = 0 96 | for col_name in col_names: 97 | num += 1 98 | t = read_data_from_csv(path+col_name+'.csv', beg, end) 99 | print num, col_name, t.shape 100 | data_arr.append(t) 101 | return np.concatenate(data_arr,axis=1) 102 | 103 | beg = 0 104 | end = 20000000 105 | train_labels = read_batch_data_from_csv(args.gbdt_data_path+'train/', col_names[0:1], beg, end) 106 | train_datas = read_batch_data_from_csv(args.gbdt_data_path+'train/', col_names[1:], beg, end) 107 | 108 | print train_datas.shape 109 | 110 | 111 | 112 | 113 | clf = lgb.LGBMClassifier( 114 | boosting_type='gbdt', num_leaves=63, reg_alpha=0.0, reg_lambda=1, 115 | max_depth=-1, n_estimators=1000, objective='binary', 116 | subsample=0.8, colsample_bytree=0.8, subsample_freq=1, feature_fraction=0.8, 117 | learning_rate=0.05, min_child_weight=50 118 | ) 119 | clf.fit(train_datas, train_labels, eval_set=[(train_datas, train_labels)], eval_metric='logloss',early_stopping_rounds=100) 120 | 121 | dict = {} 122 | for i,col_name in enumerate(col_names[1:]): 123 | dict[col_name] = clf.feature_importances_[i] * 100 124 | 125 | f = open(args.gbdt_data_path+'analyze/' + 'importance_rank.csv','wb') 126 | f.write('name,value\n') 127 | sort_dict = sorted(dict.items(),key = lambda x:x[1],reverse = True) 128 | for name, val in sort_dict: 129 | f.write(name + ',' + str(val) + '\n') 130 | f.close() 131 | 132 | 133 | 134 | -------------------------------------------------------------------------------- /src/converters/make_submission.py: -------------------------------------------------------------------------------- 1 | import sys 2 | sys.path.append('../../') 3 | from utils.donal_args import args 4 | path_1 = sys.argv[1] 5 | path_2 = sys.argv[2] 6 | f1 = open(path_1) 7 | f2 = open(path_2) 8 | f = open('submission.csv','wb') 9 | 10 | f.write('aid,uid,score\n') 11 | f1.readline() 12 | for line in f1: 13 | line = line.strip() +','+ f2.readline() 14 | f.write(line) 15 | -------------------------------------------------------------------------------- /src/converters/norm_ensemble.py: -------------------------------------------------------------------------------- 1 | import argparse, csv, sys, pickle, collections, math 2 | import numpy as np 3 | 4 | def logistic_func(x): 5 | return 1/(1+math.exp(-x)) 6 | 7 | def inv_logistic_func(x): 8 | return math.log(x/(1-x)) 9 | 10 | def Normalize(data): 11 | mx = max(data) 12 | mn = min(data) 13 | return [(float(i) - mn) / (mx - mn) for i in data] 14 | 15 | path_1 = sys.argv[1] 16 | path_2 = sys.argv[2] 17 | path_3 = sys.argv[3] 18 | weight_1 = float(sys.argv[4]) 19 | weight_2 = float(sys.argv[5]) 20 | data_1 = [] 21 | data_2 = [] 22 | 23 | f = open(path_1,'rb') 24 | for line in f: 25 | data_1.append(float(line)) 26 | f.close() 27 | 28 | f = open(path_2,'rb') 29 | for line in f: 30 | data_2.append(float(line)) 31 | f.close() 32 | 33 | data_1 = Normalize(data_1) 34 | data_2 = Normalize(data_2) 35 | assert len(data_1) == len(data_2) 36 | 37 | f = open(path_3,'wb') 38 | for i, d in enumerate(data_1): 39 | #t1 = inv_logistic_func(d) 40 | #t2 = inv_logistic_func(data_2[i]) 41 | #val = logistic_func(t1*weight_1+t2*weight_2) 42 | val = (d*weight_1+data_2[i]*weight_2) 43 | f.write('%.6f'%(val)+'\n') 44 | f.close() 45 | -------------------------------------------------------------------------------- /src/converters/pre-csv.py: -------------------------------------------------------------------------------- 1 | # coding=utf-8 2 | """ 3 | 拼表和生成csv文件 4 | 5 | """ 6 | 7 | import numpy as np 8 | import pandas as pd 9 | import os 10 | import sys 11 | 12 | from time import time 13 | 14 | sys.path.append('../../') 15 | from utils.donal_args import args 16 | 17 | state = int(sys.argv[1]) # state=0：复赛数据拼表； state!=0：初赛数据拼表 18 | if state == 0: 19 | ad_feature_path = args.ad_feature_path 20 | user_feature_path = args.user_feature_path 21 | raw_train_path = args.raw_train_path 22 | raw_test1_path = args.raw_test1_path 23 | raw_test2_path = args.raw_test2_path 24 | combine_train_path = args.combine_train_path 25 | combine_test1_path = args.combine_test1_path 26 | combine_test2_path = args.combine_test2_path 27 | else: 28 | ad_feature_path = args.chusai_ad_feature_path 29 | user_feature_path = args.chusai_user_feature_path 30 | raw_train_path = args.chusai_raw_train_path 31 | raw_test1_path = args.chusai_raw_test1_path 32 | raw_test2_path = args.chusai_raw_test2_path 33 | combine_train_path = args.chusai_combine_train_path 34 | combine_test1_path = args.chusai_combine_test1_path 35 | combine_test2_path = args.chusai_combine_test2_path 36 | 37 | ad_feature=pd.read_csv(ad_feature_path) 38 | 39 | userFeature_data = [] 40 | user_feature = None 41 | with open(user_feature_path, 'r') as f: 42 | for i, line in enumerate(f): 43 | line = line.strip().split('|') 44 | userFeature_dict = {} 45 | for each in line: 46 | each_list = each.split(' ') 47 | userFeature_dict[each_list[0]] = ' '.join(each_list[1:]) 48 | userFeature_data.append(userFeature_dict) 49 | if i % 100000 == 0: 50 | print(i) 51 | user_feature = pd.DataFrame(userFeature_data) 52 | user_feature['uid'] = user_feature['uid'].apply(int) 53 | 54 | train=pd.read_csv(raw_train_path) 55 | predict1=pd.read_csv(raw_test1_path) 56 | predict2 = pd.read_csv(raw_test2_path) 57 | train.loc[train['label']==-1,'label']=0 58 | predict1['label']=-1 59 | predict2['label']=-2 60 | data=pd.concat([train,predict1,predict2]) 61 | data=pd.merge(data,ad_feature,on='aid',how='left') 62 | data=pd.merge(data,user_feature,on='uid',how='left') 63 | data=data.fillna('-1') 64 | 65 | train = data[data.label != -1][data.label != -2] 66 | 67 | test1 = data[data.label == -1] 68 | test2 = data[data.label == -2] 69 | 70 | train.to_csv(combine_train_path, index=False) 71 | test1.to_csv(combine_test1_path, index=False) 72 | test2.to_csv(combine_test2_path, index=False) 73 | 74 | 75 | 76 | 77 | -------------------------------------------------------------------------------- /src/converters/pre-dnn.py: -------------------------------------------------------------------------------- 1 | # coding=utf-8 2 | 3 | import os 4 | import sys 5 | import numpy as np 6 | from sklearn.base import BaseEstimator, TransformerMixin 7 | from sklearn.metrics import roc_auc_score 8 | from time import time 9 | import math 10 | import pandas as pd 11 | part = int(sys.argv[1]) 12 | 13 | sys.path.append('../../') 14 | from utils.tencent_data_func import * 15 | from utils.donal_args import args 16 | 17 | nums = 100000000 18 | print "reading train data" 19 | print args.random_combine_train_path_with_chusai 20 | train_dict, train_num = read_raw_data(args.random_combine_train_path_with_chusai, nums) 21 | print "reading test1 data" 22 | test1_dict, test1_num = read_raw_data(args.combine_test1_path, nums) 23 | print "reading test2 data" 24 | test2_dict, test2_num = read_raw_data(args.combine_test2_path, nums) 25 | 26 | 27 | one_hot_feature=['aid', 'advertiserId','campaignId', 'creativeId','creativeSize','adCategoryId', 28 | 'productId', 'productType', 'LBS','age','carrier','consumptionAbility','education','gender','house'] 29 | combine_onehot_feature = ['LBS','age','carrier','consumptionAbility','education','gender','house'] 30 | vector_feature=['os','ct', 'marriageStatus', 'interest1','interest2','interest3','interest4','interest5', 31 | 'kw1','kw2','kw3','topic1','topic2','topic3'] 32 | combine_vector_feature = ['os','ct', 'marriageStatus', 'interest1','interest2','interest3','interest4','interest5'] 33 | app_feature = ['appIdAction','appIdInstall'] 34 | 35 | if part == 0: 36 | print "label preparing" 37 | labels = [] 38 | for d in train_dict['label']: 39 | labels.append(int(d)) 40 | labels = np.array(labels) 41 | write_data_into_parts(labels, args.dnn_data_path + 'train_with_chusai/train-label') 42 | # labels = [] 43 | # for d in test1_dict['label']: 44 | # labels.append(int(d)) 45 | # labels = np.array(labels) 46 | # write_data_into_parts(labels, args.dnn_data_path+'test1/test-label') 47 | labels = [] 48 | for d in test2_dict['label']: 49 | labels.append(int(d)) 50 | labels = np.array(labels) 51 | write_data_into_parts(labels, args.dnn_data_path + 'test2_with_chusai/test-label') 52 | 53 | begin_num = 1 54 | for feature in one_hot_feature: 55 | print feature, "preparing" 56 | train_res, test1_res, test2_res, f_dict = onehot_feature_process(train_dict[feature], test1_dict[feature], 57 | test2_dict[feature], begin_num) 58 | write_dict(args.dnn_data_path + 'dict_with_chusai/' + feature + '.csv', f_dict) 59 | write_data_into_parts(train_res, args.dnn_data_path + 'train_with_chusai/train-' + feature) 60 | # write_data_into_parts(test1_res, args.dnn_data_path + 'test1/test-' + feature) 61 | write_data_into_parts(test2_res, args.dnn_data_path + 'test2_with_chusai/test-' + feature) 62 | 63 | for feature in combine_onehot_feature: 64 | print feature + '_aid', "preparing" 65 | train_res, test1_res, test2_res, f_dict = onehot_combine_process(train_dict[feature], train_dict['aid'], 66 | test1_dict[feature], test1_dict['aid'], 67 | test2_dict[feature], test2_dict['aid'], 68 | begin_num) 69 | write_dict(args.dnn_data_path + 'dict_with_chusai/' + feature + '_aid.csv', f_dict) 70 | write_data_into_parts(train_res, args.dnn_data_path + 'train_with_chusai/train-' + feature + '_aid') 71 | # write_data_into_parts(test1_res, args.dnn_data_path + 'test1/test-' + feature + '_aid') 72 | write_data_into_parts(test2_res, args.dnn_data_path + 'test2_with_chusai/test-' + feature + '_aid') 73 | 74 | print "static max_len :", begin_num 75 | begin_num = 1 76 | if part == 1: 77 | for feature in vector_feature: 78 | print feature, "preparing" 79 | max_len = args.dynamic_dict[feature] 80 | train_res, test1_res, test2_res, f_dict = vector_feature_process(train_dict[feature], test1_dict[feature], 81 | test2_dict[feature], 82 | begin_num, max_len) 83 | write_dict(args.dnn_data_path + 'dict_with_chusai/' + feature + '.csv', f_dict) 84 | write_data_into_parts(train_res, args.dnn_data_path + 'train_with_chusai/train-' + feature) 85 | # write_data_into_parts(test1_res, args.dnn_data_path + 'test1/test-' + feature) 86 | write_data_into_parts(test2_res, args.dnn_data_path + 'test2_with_chusai/test-' + feature) 87 | train_res_lengths = get_vector_feature_len(train_res) 88 | # test1_res_lengths = get_vector_feature_len(test1_res) 89 | test2_res_lengths = get_vector_feature_len(test2_res) 90 | write_data_into_parts(train_res_lengths, args.dnn_data_path + 'train_with_chusai/train-' + feature + '_lengths') 91 | # write_data_into_parts(test1_res_lengths, args.dnn_data_path + 'test1_with_chusai/test-' + feature + '_lengths') 92 | write_data_into_parts(test2_res_lengths, args.dnn_data_path + 'test2_with_chusai/test-' + feature + '_lengths') 93 | 94 | if part == 2: 95 | for feature in ['interest1', 'interest2','marriageStatus']: 96 | print feature + '_aid', "preparing" 97 | max_len = args.dynamic_dict[feature] 98 | train_res, test1_res, test2_res, f_dict = vector_combine_process(train_dict[feature], train_dict['aid'], 99 | test1_dict[feature], test1_dict['aid'], 100 | test2_dict[feature], test2_dict['aid'], 101 | begin_num, max_len) 102 | write_dict(args.dnn_data_path + 'dict_with_chusai/' + feature + '_aid.csv', f_dict) 103 | write_data_into_parts(train_res, args.dnn_data_path + 'train_with_chusai/train-' + feature + '_aid') 104 | # write_data_into_parts(test1_res, args.dnn_data_path + 'test1_with_chusai/test-' + feature + '_aid') 105 | write_data_into_parts(test2_res, args.dnn_data_path + 'test2_with_chusai/test-' + feature + '_aid') 106 | train_res_lengths = get_vector_feature_len(train_res) 107 | # test1_res_lengths = get_vector_feature_len(test1_res) 108 | test2_res_lengths = get_vector_feature_len(test2_res) 109 | write_data_into_parts(train_res_lengths, args.dnn_data_path + 'train_with_chusai/train-' + feature + '_aid' + '_lengths') 110 | # write_data_into_parts(test1_res_lengths, args.dnn_data_path + 'test1_with_chusai/test-' + feature + '_aid' + '_lengths') 111 | write_data_into_parts(test2_res_lengths, args.dnn_data_path + 'test2_with_chusai/test-' + feature + '_aid' + '_lengths') 112 | 113 | print "dynamic max_len", begin_num 114 | 115 | 116 | 117 | if part == 3: 118 | """ 119 | len features 120 | """ 121 | len_features = ['interest1', 'interest2', 'interest5'] 122 | for f in len_features: 123 | print f + '_len adding' 124 | train_res, test1_res, test2_res, f_dict = add_len(train_dict[f], test1_dict[f], test2_dict[f]) 125 | write_dict(args.dnn_data_path + 'dict_with_chusai/' + f + '_len' + '.csv', f_dict) 126 | write_data_into_parts(train_res, args.dnn_data_path + 'train_with_chusai/train-' + f + '_len') 127 | # write_data_into_parts(test1_res, args.dnn_data_path + 'test1/test-' + f + '_len') 128 | write_data_into_parts(test2_res, args.dnn_data_path + 'test2_with_chusai/test-' + f + '_len') 129 | 130 | """ 131 | uid combine counts 132 | """ 133 | print "uid counts adding" 134 | ad_feartures = ['uid', 'advertiserId', 'campaignId', 'creativeId', 'creativeSize', 135 | 'adCategoryId', 'productId', 'productType'] 136 | 137 | dnn_data_path = args.root_data_path + 'dnn/' 138 | for f in ad_feartures: 139 | print f, 'uid times counting' 140 | train_res, test1_res, test2_res, f_dict = count_combine_feature_times(train_dict['uid'], train_dict[f], 141 | test1_dict['uid'], test1_dict[f], 142 | test2_dict['uid'], test2_dict[f]) 143 | write_dict(dnn_data_path + 'dict_with_chusai/' + 'uid_' + f + '_times_count.csv', f_dict) 144 | write_data_into_parts(train_res, dnn_data_path + 'train_with_chusai/train-' + 'uid_' + f + '_times_count') 145 | # write_data_into_parts(test1_res, dnn_data_path + 'test1/test-' + 'uid_' + f + '_times_count') 146 | write_data_into_parts(test2_res, dnn_data_path + 'test2_with_chusai/test-' + 'uid_' + f + '_times_count') 147 | 148 | print "log uid counts adding" 149 | log_train_res = [] 150 | # log_test1_res = [] 151 | log_test2_res = [] 152 | log_f_dict = {} 153 | for val in train_res: 154 | log_train_res.append(int(math.log(1 + val * val))) 155 | # for val in test1_res: 156 | # log_test1_res.append(int(math.log(1 + val * val))) 157 | for val in test2_res: 158 | log_test2_res.append(int(math.log(1 + val * val))) 159 | for key in f_dict: 160 | new_key = int(math.log(1 + key * key)) 161 | if not log_f_dict.has_key(new_key): 162 | log_f_dict[new_key] = 0 163 | log_f_dict[new_key] += f_dict[key] 164 | log_train_res = np.array(log_train_res) 165 | # log_test1_res = np.array(log_test1_res) 166 | log_test2_res = np.array(log_test2_res) 167 | write_dict(dnn_data_path + 'dict_with_chusai/' + 'log_uid_' + f + '_times_count.csv', log_f_dict) 168 | write_data_into_parts(log_train_res, dnn_data_path + 'train_with_chusai/train-' + 'log_uid_' + f + '_times_count') 169 | # write_data_into_parts(log_test1_res, dnn_data_path + 'test1/test-' + 'log_uid_' + f + '_times_count') 170 | write_data_into_parts(log_test2_res, dnn_data_path + 'test2_with_chusai/test-' + 'log_uid_' + f + '_times_count') 171 | 172 | if part == 4: 173 | """ 174 | uid pos cimbine counts 175 | """ 176 | print "uid pos combine counts" 177 | ad_feartures = ['uid','advertiserId', 'campaignId', 'creativeId', 'creativeSize', 178 | 'adCategoryId', 'productId', 'productType'] 179 | dnn_data_path = args.root_data_path + 'dnn/' 180 | print "uid pos combine counts adding" 181 | for f in ad_feartures: 182 | print f,'uid pos counting' 183 | new_train_data = combine_to_one(train_dict['uid'], train_dict[f]) 184 | new_test1_data = combine_to_one(test1_dict['uid'], test1_dict[f]) 185 | new_test2_data = combine_to_one(test2_dict['uid'], test2_dict[f]) 186 | train_res, test1_res, test2_res, f_dict = count_pos_feature(new_train_data, new_test1_data, 187 | new_test2_data, train_dict['label'], 5, 188 | is_val=False) 189 | write_dict(dnn_data_path + 'dict_with_chusai/' + 'uid_' + f + '_pos_times_count_5_fold_all.csv', f_dict) 190 | write_data_into_parts(train_res, dnn_data_path + 'train_with_chusai/train-' + 'uid_'+ f +'_pos_times_count_5_fold_all') 191 | # write_data_into_parts(test1_res, dnn_data_path + 'test1/test-' + 'uid_'+ f + '_pos_times_count_5_fold_all') 192 | write_data_into_parts(test2_res, dnn_data_path + 'test2_with_chusai/test-' + 'uid_'+ f + '_pos_times_count_5_fold_all') 193 | 194 | print "log uid counts adding" 195 | log_train_res = [] 196 | # log_test1_res = [] 197 | log_test2_res = [] 198 | log_f_dict = {} 199 | for val in train_res: 200 | log_train_res.append(int(math.log(1 + val * val))) 201 | # for val in test1_res: 202 | # log_test1_res.append(int(math.log(1 + val * val))) 203 | for val in test2_res: 204 | log_test2_res.append(int(math.log(1 + val * val))) 205 | for key in f_dict: 206 | new_key = int(math.log(1 + key * key)) 207 | if key == -1: 208 | new_key = -1 209 | if not log_f_dict.has_key(new_key): 210 | log_f_dict[new_key] = 0 211 | log_f_dict[new_key] += f_dict[key] 212 | log_train_res = np.array(log_train_res) 213 | # log_test1_res = np.array(log_test1_res) 214 | log_test2_res = np.array(log_test2_res) 215 | write_dict(dnn_data_path + 'dict_with_chusai/' + 'log_uid_' + f + '_pos_times_count_5_fold_all.csv', f_dict) 216 | write_data_into_parts(log_train_res, 217 | dnn_data_path + 'train_with_chusai/train-' + 'log_uid_' + f + '_pos_times_count_5_fold_all') 218 | # write_data_into_parts(test1_res, 219 | # dnn_data_path + 'test1_with_chusai/test-' + 'log_uid_' + f + '_pos_times_count_5_fold_all') 220 | write_data_into_parts(log_test2_res, 221 | dnn_data_path + 'test2_with_chusai/test-' + 'log_uid_' + f + '_pos_times_count_5_fold_all') 222 | 223 | if part == 5: 224 | """ 225 | uid pos combine counts sample 226 | """ 227 | print "uid pos combine counts" 228 | ad_feartures = ['uid','advertiserId', 'campaignId', 'creativeId', 'creativeSize', 229 | 'adCategoryId', 'productId', 'productType'] 230 | dnn_data_path = args.root_data_path + 'dnn/' 231 | print "uid pos combine counts sample adding" 232 | for f in ad_feartures: 233 | print f, 'uid pos counting' 234 | new_train_data = combine_to_one(train_dict['uid'], train_dict[f]) 235 | new_test1_data = combine_to_one(test1_dict['uid'], test1_dict[f]) 236 | new_test2_data = combine_to_one(test2_dict['uid'], test2_dict[f]) 237 | train_res, test1_res, test2_res, f_dict = count_pos_feature(new_train_data, new_test1_data, 238 | new_test2_data, train_dict['label'], 5, 239 | is_val=True) 240 | write_dict(dnn_data_path + 'dict_with_chusai/' + 'uid_' + f + '_pos_times_count_5_fold.csv', f_dict) 241 | write_data_into_parts(train_res, dnn_data_path + 'train_with_chusai/train-' + 'uid_'+ f +'_pos_times_count_5_fold') 242 | # write_data_into_parts(test1_res, dnn_data_path + 'test1_with_chusai/test-' + 'uid_'+ f + '_pos_times_count_5_fold') 243 | write_data_into_parts(test2_res, dnn_data_path + 'test2_with_chusai/test-' + 'uid_'+ f + '_pos_times_count_5_fold') 244 | 245 | print "log uid counts adding" 246 | log_train_res = [] 247 | # log_test1_res = [] 248 | log_test2_res = [] 249 | log_f_dict = {} 250 | for val in train_res: 251 | log_train_res.append(int(math.log(1 + val * val))) 252 | # for val in test1_res: 253 | # log_test1_res.append(int(math.log(1 + val * val))) 254 | for val in test2_res: 255 | log_test2_res.append(int(math.log(1 + val * val))) 256 | for key in f_dict: 257 | new_key = int(math.log(1 + key * key)) 258 | if key == -1: 259 | new_key = -1 260 | if not log_f_dict.has_key(new_key): 261 | log_f_dict[new_key] = 0 262 | log_f_dict[new_key] += f_dict[key] 263 | log_train_res = np.array(log_train_res) 264 | # log_test1_res = np.array(log_test1_res) 265 | log_test2_res = np.array(log_test2_res) 266 | write_dict(dnn_data_path + 'dict_with_chusai/' + 'log_uid_' + f + '_pos_times_count_5_fold.csv', f_dict) 267 | write_data_into_parts(log_train_res, 268 | dnn_data_path + 'train_with_chusai/train-' + 'log_uid_' + f + '_pos_times_count_5_fold') 269 | # write_data_into_parts(log_test1_res, 270 | # dnn_data_path + 'test1/test-' + 'log_uid_' + f + '_pos_times_count_5_fold') 271 | write_data_into_parts(log_test2_res, 272 | dnn_data_path + 'test2_with_chusai/test-' + 'log_uid_' + f + '_pos_times_count_5_fold') 273 | 274 | combine_features = [('topic1','topic2'),('LBS','kw1'),('interest2','aid'),('kw2','creativeSize'), 275 | ('kw2','aid'),('marriageStatus','aid'),('interest1','aid'),('LBS','kw2'), 276 | ('topic2','aid'),('LBS','aid'),('interest3','aid'),('interest4','aid'),('interest5','aid'), 277 | ('topic1','aid'),('kw3','aid'),('topic3','aid'),('kw1','aid')] 278 | if part == 6: 279 | """ 280 | seq feat counting 281 | """ 282 | print "uid seq feat counts" 283 | dnn_data_path = args.root_data_path + 'dnn/' 284 | # raw_train_dict, raw_train_num = read_raw_data(args.raw_train_path, nums) 285 | train_res, test1_res, test2_res, f_dict = uid_seq_feature(train_dict['uid'], test1_dict['uid'], 286 | test2_dict['uid'], train_dict['label']) 287 | # new_raw_train_data = combine_to_one(raw_train_dict['uid'], raw_train_dict['aid']) 288 | # new_train_data = combine_to_one(train_dict['uid'], train_dict['aid']) 289 | # train_res = resort_data(new_raw_train_data, raw_train_res, new_train_data) 290 | 291 | write_dict(dnn_data_path + 'dict_with_chusai/' + 'uid_seq.csv',f_dict) 292 | write_data_into_parts(train_res, dnn_data_path + 'train_with_chusai/train-' + 'uid_seq') 293 | # write_data_into_parts(test1_res, dnn_data_path + 'test1/test-' + 'uid_seq') 294 | write_data_into_parts(test2_res, dnn_data_path + 'test2_with_chusai/test-' + 'uid_seq') 295 | 296 | for f1 in ['age', 'gender']: 297 | for f2 in ['advertiserId','campaignId', 'creativeId','creativeSize','adCategoryId','productId', 'productType']: 298 | print f1 + '_' + f2, "preparing" 299 | train_res, test1_res, test2_res, f_dict = onehot_combine_process(train_dict[f1], train_dict[f2], 300 | test1_dict[f1], test1_dict[f2], 301 | test2_dict[f1], test2_dict[f2], 302 | begin_num) 303 | write_dict(args.dnn_data_path + 'dict_with_chusai/' + f1 + '_' + f2 + '.csv', f_dict) 304 | write_data_into_parts(train_res, args.dnn_data_path + 'train_with_chusai/train-' + f1 + '_' + f2) 305 | # write_data_into_parts(test1_res, args.dnn_data_path + 'test1/test-' + feature + '_aid') 306 | write_data_into_parts(test2_res, args.dnn_data_path + 'test2_with_chusai/test-' + f1 + '_' + f2) 307 | 308 | 309 | if part == 7: 310 | """ 311 | uid combine counts 312 | """ 313 | print "uid counts adding" 314 | ad_feartures = ['uid','advertiserId', 'campaignId', 'creativeId', 'creativeSize', 315 | 'adCategoryId', 'productId', 'productType'] 316 | print "loading chusai test1" 317 | chusai_test1_dict, test1_num_chusai = read_raw_data(args.chusai_combine_test1_path, nums) 318 | print "loading chusai test2" 319 | chusai_test2_dict, test2_num_chusai = read_raw_data(args.chusai_combine_test2_path, nums) 320 | for key in chusai_test1_dict: 321 | chusai_test1_dict[key].extend(chusai_test2_dict[key]) 322 | 323 | dnn_data_path = args.root_data_path + 'dnn/' 324 | for f in ad_feartures: 325 | print f, 'uid times counting' 326 | train_res, test1_res, test2_res, f_dict = count_combine_feature_times_with_chusai(train_dict['uid'], train_dict[f], 327 | test1_dict['uid'], test1_dict[f], 328 | test2_dict['uid'], test2_dict[f], 329 | chusai_test1_dict['uid'], 330 | chusai_test1_dict[f]) 331 | write_dict(dnn_data_path + 'dict_with_chusai/' + 'uid_' + f + '_times_count_with_chusai_test.csv', f_dict) 332 | write_data_into_parts(train_res, dnn_data_path + 'train_with_chusai/train-' + 'uid_' + f + '_times_count_with_chusai_test') 333 | # write_data_into_parts(test1_res, dnn_data_path + 'test1_with_chusai/test-' + 'uid_' + f + '_times_count_with_chusai_test') 334 | write_data_into_parts(test2_res, dnn_data_path + 'test2_with_chusai/test-' + 'uid_' + f + '_times_count_with_chusai_test') 335 | 336 | print "log uid counts adding" 337 | log_train_res = [] 338 | log_test1_res = [] 339 | log_test2_res = [] 340 | log_f_dict = {} 341 | for val in train_res: 342 | log_train_res.append(int(math.log(1 + val * val))) 343 | for val in test1_res: 344 | log_test1_res.append(int(math.log(1 + val * val))) 345 | for val in test2_res: 346 | log_test2_res.append(int(math.log(1 + val * val))) 347 | for key in f_dict: 348 | new_key = int(math.log(1 + key * key)) 349 | if not log_f_dict.has_key(new_key): 350 | log_f_dict[new_key] = 0 351 | log_f_dict[new_key] += f_dict[key] 352 | log_train_res = np.array(log_train_res) 353 | log_test1_res = np.array(log_test1_res) 354 | log_test2_res = np.array(log_test2_res) 355 | write_dict(dnn_data_path + 'dict_with_chusai/' + 'log_uid_' + f + '_times_count_with_chusai_test.csv', log_f_dict) 356 | write_data_into_parts(log_train_res, dnn_data_path + 'train_with_chusai/train-' + 'log_uid_' + f + '_times_count_with_chusai_test') 357 | # write_data_into_parts(log_test1_res, dnn_data_path + 'test1_with_chusai/test-' + 'log_uid_' + f + '_times_count_with_chusai_test') 358 | write_data_into_parts(log_test2_res, dnn_data_path + 'test2_with_chusai/test-' + 'log_uid_' + f + '_times_count_with_chusai_test') 359 | 360 | 361 | -------------------------------------------------------------------------------- /src/converters/pre-gbdt.py: -------------------------------------------------------------------------------- 1 | # coding=utf-8 2 | """ 3 | 生成gbdt的特征需要多台服务器分块生成 4 | 5 | """ 6 | 7 | import numpy as np 8 | import pandas as pd 9 | import os 10 | import sys 11 | from random import random 12 | 13 | import time 14 | 15 | sys.path.append('../../') 16 | from utils.donal_args import args 17 | 18 | part = int(sys.argv[1]) 19 | 20 | """ 21 | 辅助函数 22 | """ 23 | 24 | def read_data(path): 25 | f = open(path, 'rb') 26 | features = f.readline().strip().split(',') 27 | count_dict = {} 28 | for feature in features: 29 | count_dict[feature] = [] 30 | num = 0 31 | for line in f: 32 | # if num > 1000: 33 | # break 34 | datas = line.strip().split(',') 35 | for i, d in enumerate(datas): 36 | count_dict[features[i]].append(d) 37 | num += 1 38 | f.close() 39 | return count_dict,num 40 | 41 | def shuffle_data(data_dict): 42 | rng_state = np.random.get_state() 43 | for key in data_dict: 44 | np.random.set_state(rng_state) 45 | np.random.shuffle(data_dict[key]) 46 | 47 | def clean(train_data, test_data): 48 | train_s = set() 49 | test_s = set() 50 | for i, data in enumerate(train_data): 51 | xs = data.split(' ') 52 | for x in xs: 53 | train_s.add(x) 54 | for i, data in enumerate(test_data): 55 | xs = data.split(' ') 56 | for x in xs: 57 | test_s.add(x) 58 | new_train_data = [] 59 | new_test_data = [] 60 | for i, data in enumerate(train_data): 61 | xs = data.split(' ') 62 | nxs = [] 63 | for x in xs: 64 | if x in test_s: 65 | nxs.append(x) 66 | if len(nxs) == 0: 67 | nxs = ['-1'] 68 | new_train_data.append(' '.join(nxs)) 69 | for i, data in enumerate(test_data): 70 | xs = data.split(' ') 71 | nxs = [] 72 | for x in xs: 73 | if x in train_s: 74 | nxs.append(x) 75 | if len(nxs) == 0: 76 | nxs = ['-1'] 77 | new_test_data.append(' '.join(nxs)) 78 | return new_train_data, new_test_data 79 | 80 | def gen_count_dict(data, labels, begin, end): 81 | total_dict = {} 82 | pos_dict = {} 83 | for i, d in enumerate(data): 84 | if i >= begin and i < end: 85 | continue 86 | xs = d.split(' ') 87 | for x in xs: 88 | if not total_dict.has_key(x): 89 | total_dict[x] = 0.0 90 | if not pos_dict.has_key(x): 91 | pos_dict[x] = 0.0 92 | total_dict[x] += 1 93 | if labels[i] == '1': 94 | pos_dict[x] += 1 95 | return total_dict, pos_dict 96 | 97 | def gen_combine_count_dict(data1, data2, labels, begin, end): 98 | total_dict = {} 99 | pos_dict = {} 100 | for i, d in enumerate(data1): 101 | if i >= begin and i < end: 102 | continue 103 | xs = d.split(' ') 104 | xs2 = data2[i].split(' ') 105 | for x1 in xs: 106 | for x2 in xs2: 107 | k = x1+'|'+x2 108 | if not total_dict.has_key(k): 109 | total_dict[k] = 0.0 110 | if not pos_dict.has_key(k): 111 | pos_dict[k] = 0.0 112 | total_dict[k] += 1 113 | if labels[i] == '1': 114 | pos_dict[k] += 1 115 | return total_dict, pos_dict 116 | 117 | def count_feature(train_data, test1_data, test2_data, labels, k, test_only= False): 118 | nums = len(train_data) 119 | interval = nums // k 120 | split_points = [] 121 | for i in range(k): 122 | split_points.append(i * interval) 123 | split_points.append(nums) 124 | 125 | s = set() 126 | for d in train_data: 127 | xs = d.split(' ') 128 | for x in xs: 129 | s.add(x) 130 | b = nums // len(s) 131 | a = b*1.0 / 20 132 | 133 | train_res = [] 134 | if not test_only: 135 | for i in range(k): 136 | tmp = [] 137 | total_dict, pos_dict = gen_count_dict(train_data, labels, split_points[i],split_points[i+1]) 138 | for j in range(split_points[i],split_points[i+1]): 139 | xs = train_data[j].split(' ') 140 | t = [] 141 | for x in xs: 142 | if not total_dict.has_key(x): 143 | t.append(0.05) 144 | continue 145 | t.append((a + pos_dict[x]) / (b + total_dict[x])) 146 | tmp.append(max(t)) 147 | train_res.extend(tmp) 148 | 149 | total_dict, pos_dict = gen_count_dict(train_data, labels, 1, 0) 150 | test1_res = [] 151 | for d in test1_data: 152 | xs = d.split(' ') 153 | t = [] 154 | for x in xs: 155 | if not total_dict.has_key(x): 156 | t.append(0.05) 157 | continue 158 | t.append((a + pos_dict[x]) / (b + total_dict[x])) 159 | test1_res.append(max(t)) 160 | 161 | test2_res = [] 162 | for d in test2_data: 163 | xs = d.split(' ') 164 | t = [] 165 | for x in xs: 166 | if not total_dict.has_key(x): 167 | t.append(0.05) 168 | continue 169 | t.append((a + pos_dict[x]) / (b + total_dict[x])) 170 | test2_res.append(max(t)) 171 | 172 | return train_res, test1_res, test2_res 173 | 174 | def count_combine_feature(train_data1, train_data2, test1_data1, test1_data2, 175 | test2_data1, test2_data2, labels, k, test_only = False): 176 | nums = len(train_data1) 177 | interval = nums // k 178 | split_points = [] 179 | for i in range(k): 180 | split_points.append(i * interval) 181 | split_points.append(nums) 182 | 183 | s = set() 184 | for i, d in enumerate(train_data1): 185 | xs = d.split(' ') 186 | xs2 = train_data2[i].split(' ') 187 | for x1 in xs: 188 | for x2 in xs2: 189 | ke = x1 + '|' + x2 190 | s.add(ke) 191 | b = nums // len(s) 192 | a = b*1.0 / 20 193 | 194 | train_res = [] 195 | if not test_only: 196 | for i in range(k): 197 | tmp = [] 198 | total_dict, pos_dict = gen_combine_count_dict(train_data1, train_data2, labels, 199 | split_points[i],split_points[i+1]) 200 | for j in range(split_points[i],split_points[i+1]): 201 | xs = train_data1[j].split(' ') 202 | xs2 = train_data2[j].split(' ') 203 | t = [] 204 | for x1 in xs: 205 | for x2 in xs2: 206 | ke = x1 + '|' + x2 207 | if not total_dict.has_key(ke): 208 | t.append(0.05) 209 | continue 210 | t.append((a + pos_dict[ke]) / (b + total_dict[ke])) 211 | tmp.append(max(t)) 212 | train_res.extend(tmp) 213 | 214 | total_dict, pos_dict = gen_combine_count_dict(train_data1, train_data2, labels, 1, 0) 215 | test1_res = [] 216 | for i,d in enumerate(test1_data1): 217 | xs = d.split(' ') 218 | xs2 = test1_data2[i].split(' ') 219 | t = [] 220 | for x1 in xs: 221 | for x2 in xs2: 222 | ke = x1 + '|' + x2 223 | if not total_dict.has_key(ke): 224 | t.append(0.05) 225 | continue 226 | t.append((a + pos_dict[ke]) / (b + total_dict[ke])) 227 | test1_res.append(max(t)) 228 | 229 | test2_res = [] 230 | for i, d in enumerate(test2_data1): 231 | xs = d.split(' ') 232 | xs2 = test2_data2[i].split(' ') 233 | t = [] 234 | for x1 in xs: 235 | for x2 in xs2: 236 | ke = x1 + '|' + x2 237 | if not total_dict.has_key(ke): 238 | t.append(0.05) 239 | continue 240 | t.append((a + pos_dict[ke]) / (b + total_dict[ke])) 241 | test2_res.append(max(t)) 242 | 243 | return train_res, test1_res, test2_res 244 | 245 | def count_feature_times(train_data, test1_data, test2_data): 246 | total_dict = {} 247 | for i, d in enumerate(train_data): 248 | xs = d.split(' ') 249 | for x in xs: 250 | if not total_dict.has_key(x): 251 | total_dict[x] = 0.0 252 | total_dict[x] += 1 253 | for i, d in enumerate(test1_data): 254 | xs = d.split(' ') 255 | for x in xs: 256 | if not total_dict.has_key(x): 257 | total_dict[x] = 0.0 258 | total_dict[x] += 1 259 | for i, d in enumerate(test2_data): 260 | xs = d.split(' ') 261 | for x in xs: 262 | if not total_dict.has_key(x): 263 | total_dict[x] = 0.0 264 | total_dict[x] += 1 265 | train_res = [] 266 | for d in train_data: 267 | xs = d.split(' ') 268 | t = [] 269 | for x in xs: 270 | t.append(total_dict[x]) 271 | train_res.append(max(t)) 272 | test1_res = [] 273 | for d in test1_data: 274 | xs = d.split(' ') 275 | t = [] 276 | for x in xs: 277 | t.append(total_dict[x]) 278 | test1_res.append(max(t)) 279 | test2_res = [] 280 | for d in test2_data: 281 | xs = d.split(' ') 282 | t = [] 283 | for x in xs: 284 | t.append(total_dict[x]) 285 | test2_res.append(max(t)) 286 | return train_res, test1_res, test2_res 287 | 288 | def count_combine_feature_times(train_data1, train_data2, test1_data1, test1_data2, 289 | test2_data1, test2_data2): 290 | total_dict = {} 291 | for i, d in enumerate(train_data1): 292 | xs = d.split(' ') 293 | xs2 = train_data2[i].split(' ') 294 | for x1 in xs: 295 | for x2 in xs2: 296 | ke = x1 + '|' + x2 297 | if not total_dict.has_key(ke): 298 | total_dict[ke] = 0.0 299 | total_dict[ke] += 1 300 | for i, d in enumerate(test1_data1): 301 | xs = d.split(' ') 302 | xs2 = test1_data2[i].split(' ') 303 | for x1 in xs: 304 | for x2 in xs2: 305 | ke = x1 + '|' + x2 306 | if not total_dict.has_key(ke): 307 | total_dict[ke] = 0.0 308 | total_dict[ke] += 1 309 | for i, d in enumerate(test2_data1): 310 | xs = d.split(' ') 311 | xs2 = test2_data2[i].split(' ') 312 | for x1 in xs: 313 | for x2 in xs2: 314 | ke = x1 + '|' + x2 315 | if not total_dict.has_key(ke): 316 | total_dict[ke] = 0.0 317 | total_dict[ke] += 1 318 | train_res = [] 319 | for i, d in enumerate(train_data1): 320 | xs = d.split(' ') 321 | xs2 = train_data2[i].split(' ') 322 | t = [] 323 | for x1 in xs: 324 | for x2 in xs2: 325 | ke = x1 + '|' + x2 326 | t.append(total_dict[ke]) 327 | train_res.append(max(t)) 328 | test1_res = [] 329 | for i, d in enumerate(test1_data1): 330 | xs = d.split(' ') 331 | xs2 = test1_data2[i].split(' ') 332 | t = [] 333 | for x1 in xs: 334 | for x2 in xs2: 335 | ke = x1 + '|' + x2 336 | t.append(total_dict[ke]) 337 | test1_res.append(max(t)) 338 | test2_res = [] 339 | for i, d in enumerate(test2_data1): 340 | xs = d.split(' ') 341 | xs2 = test2_data2[i].split(' ') 342 | t = [] 343 | for x1 in xs: 344 | for x2 in xs2: 345 | ke = x1 + '|' + x2 346 | t.append(total_dict[ke]) 347 | test2_res.append(max(t)) 348 | return train_res, test1_res, test2_res 349 | 350 | def count_pos_feature(train_data, test1_data, test2_data, labels, k, test_only= False, is_val = False): 351 | nums = len(train_data) 352 | last = nums 353 | if is_val: 354 | last = 5100000 * 8 355 | interval = last // k 356 | split_points = [] 357 | for i in range(k): 358 | split_points.append(i * interval) 359 | split_points.append(last) 360 | count_train_data = train_data[0:last] 361 | count_labels = labels[0:last] 362 | 363 | train_res = [] 364 | if not test_only: 365 | for i in range(k): 366 | print i,"part counting" 367 | print split_points[i], split_points[i+1] 368 | tmp = [] 369 | total_dict, pos_dict = gen_count_dict(count_train_data, count_labels, split_points[i],split_points[i+1]) 370 | for j in range(split_points[i],split_points[i+1]): 371 | xs = train_data[j].split(' ') 372 | t = [] 373 | for x in xs: 374 | if not pos_dict.has_key(x): 375 | t.append(0) 376 | continue 377 | t.append(pos_dict[x] + 1) 378 | tmp.append(max(t)) 379 | train_res.extend(tmp) 380 | 381 | total_dict, pos_dict = gen_count_dict(count_train_data, count_labels, 1, 0) 382 | count_dict = {-1:0} 383 | for key in pos_dict: 384 | if not count_dict.has_key(pos_dict[key]): 385 | count_dict[pos_dict[key]] = 0 386 | count_dict[pos_dict[key]] += 1 387 | 388 | if is_val: 389 | for i in range(last, nums): 390 | xs = train_data[i].split(' ') 391 | t = [] 392 | for x in xs: 393 | if not total_dict.has_key(x): 394 | t.append(0) 395 | continue 396 | t.append(pos_dict[x] + 1) 397 | train_res.append(max(t)) 398 | 399 | test1_res = [] 400 | for d in test1_data: 401 | xs = d.split(' ') 402 | t = [] 403 | for x in xs: 404 | if not pos_dict.has_key(x): 405 | t.append(0) 406 | continue 407 | t.append(pos_dict[x] + 1) 408 | test1_res.append(max(t)) 409 | 410 | test2_res = [] 411 | for d in test2_data: 412 | xs = d.split(' ') 413 | t = [] 414 | for x in xs: 415 | if not pos_dict.has_key(x): 416 | t.append(0) 417 | continue 418 | t.append(pos_dict[x] + 1) 419 | test2_res.append(max(t)) 420 | 421 | return train_res, test1_res, test2_res 422 | 423 | def combine_to_one(data1, data2): 424 | assert len(data1) == len(data2) 425 | new_res = [] 426 | for i, d in enumerate(data1): 427 | x1 = data1[i] 428 | x2 = data2[i] 429 | new_x = x1 + '|' + x2 430 | new_res.append(new_x) 431 | return new_res 432 | 433 | def add_len(data): 434 | res = [] 435 | for i, d in enumerate(data): 436 | if d=='-1': 437 | res.append(0) 438 | else: 439 | xs = d.split(' ') 440 | res.append(len(xs)) 441 | return res 442 | 443 | def write_data(path, data, is_six = True): 444 | f = open(path, 'wb') 445 | for d in data: 446 | if is_six: 447 | f.write("%.6f" % (d) + '\n') 448 | else: 449 | f.write("%.1f" % (d) + '\n') 450 | f.close() 451 | 452 | def write_label(path, data): 453 | f = open(path, 'wb') 454 | for d in data: 455 | f.write(d+'\n') 456 | f.close() 457 | 458 | 459 | 460 | print "reading train data" 461 | train_dict, train_num = read_data(args.random_combine_train_path_with_chusai) 462 | print "reading test1 data" 463 | test1_dict, test1_num = read_data(args.combine_test1_path) 464 | print "reading test2 data" 465 | test2_dict, test2_num = read_data(args.combine_test2_path) 466 | print train_num, test1_num, test2_num 467 | print "\n\n" 468 | 469 | 470 | if part == 1: 471 | print "writing labels" 472 | write_label(args.gbdt_data_path + 'train/label.csv', train_dict['label']) 473 | write_label(args.gbdt_data_path + 'test1/label.csv', test1_dict['label']) 474 | write_label(args.gbdt_data_path + 'test2/label.csv', test2_dict['label']) 475 | print "\n\n" 476 | 477 | print "adding len feature" 478 | len_feature = ['interest1','interest2','interest3','interest4','interest5', 479 | 'kw1','kw2','kw3','topic1','topic2','topic3','appIdAction','appIdInstall'] 480 | if part == 1: 481 | for f in len_feature: 482 | print f + '_len adding' 483 | b_time = time.time() 484 | train_res = add_len(train_dict[f]) 485 | test1_res = add_len(test1_dict[f]) 486 | test2_res = add_len(test2_dict[f]) 487 | write_data(args.gbdt_data_path + 'train/' + f + '_len.csv', train_res) 488 | write_data(args.gbdt_data_path + 'test1/' + f + '_len.csv', test1_res) 489 | write_data(args.gbdt_data_path + 'test2/' + f + '_len.csv', test2_res) 490 | print "costs %.1f s" % (time.time() - b_time) 491 | print "\n\n" 492 | 493 | print "counting one rate feature" 494 | user_one_feature = ['LBS','age','carrier','consumptionAbility','education','gender','house','os','ct','interest1', 495 | 'interest2','interest3','interest4','interest5','marriageStatus','topic1','topic2','topic3', 496 | 'kw1','kw2','kw3','appIdAction','appIdInstall'] 497 | ad_one_feature = ['aid','advertiserId','campaignId', 'creativeId','creativeSize', 498 | 'adCategoryId', 'productId', 'productType'] 499 | if part == 1: 500 | for f in user_one_feature: 501 | print f, "rate preparing" 502 | b_time = time.time() 503 | train_res, test1_res, test2_res = count_feature(train_dict[f], test1_dict[f], test2_dict[f], train_dict['label'], 5) 504 | write_data(args.gbdt_data_path + 'train/' + f + '_rate_count.csv', train_res) 505 | write_data(args.gbdt_data_path + 'test1/' + f + '_rate_count.csv', test1_res) 506 | write_data(args.gbdt_data_path + 'test2/' + f + '_rate_count.csv', test2_res) 507 | print "costs %.1f s" % (time.time() - b_time) 508 | for f in ad_one_feature: 509 | print f, "rate preparing" 510 | b_time = time.time() 511 | train_res, test1_res, test2_res = count_feature(train_dict[f], test1_dict[f], test2_dict[f], train_dict['label'], 5) 512 | write_data(args.gbdt_data_path + 'train/' + f + '_rate_count.csv', train_res) 513 | write_data(args.gbdt_data_path + 'test1/' + f + '_rate_count.csv', test1_res) 514 | write_data(args.gbdt_data_path + 'test2/' + f + '_rate_count.csv', test2_res) 515 | print "costs %.1f s" % (time.time() - b_time) 516 | print "\n\n" 517 | 518 | print "counting combine rate" 519 | user_combine_feature = ['LBS','age','carrier','consumptionAbility','education','gender','house','os','ct','interest1', 520 | 'interest2','interest3','interest4','interest5','marriageStatus','topic1','topic2','topic3', 521 | 'kw1','kw2','kw3'] 522 | finsh_feature_2 = ['LBS'] 523 | finsh_feature_3 = ['LBS','age','carrier','consumptionAbility','education','gender'] 524 | 525 | ad_combine_feature = ['aid','advertiserId','campaignId', 'creativeId','creativeSize', 526 | 'adCategoryId', 'productId', 'productType'] 527 | if part == 2 or part == 7 or part == 8 or part == 9 or part == 10 or part == 11: 528 | for i, f1 in enumerate(user_combine_feature): 529 | if part == 2 and f1 not in ['age','carrier']: 530 | continue 531 | if part == 7 and f1 not in ['consumptionAbility','education']: 532 | continue 533 | if part == 8 and f1 not in ['gender','house','os']: 534 | continue 535 | if part == 9 and f1 not in ['ct','interest1','interest2','interest3']: 536 | continue 537 | if part == 10 and f1 not in ['interest4','interest5','marriageStatus']: 538 | continue 539 | if part == 11 and f1 not in ['topic1','topic2','topic3','kw1','kw2','kw3']: 540 | continue 541 | for f2 in user_combine_feature[i+1:]: 542 | print f1+'-'+f2+" counting preparing" 543 | b_time = time.time() 544 | train_res, test1_res, test2_res = count_combine_feature(train_dict[f1], train_dict[f2], 545 | test1_dict[f1], test1_dict[f2], 546 | test2_dict[f1], test2_dict[f2], 547 | train_dict['label'],5) 548 | write_data(args.gbdt_data_path + 'train/' + f1 + '-' + f2 + '_rate_count.csv', train_res) 549 | write_data(args.gbdt_data_path + 'test1/' + f1 + '-' + f2 + '_rate_count.csv', test1_res) 550 | write_data(args.gbdt_data_path + 'test2/' + f1 + '-' + f2 + '_rate_count.csv', test2_res) 551 | print "costs %.1f s" % (time.time() - b_time) 552 | 553 | if part == 3 or part == 12 or part == 13: 554 | for i, f1 in enumerate(user_combine_feature): 555 | if part == 3 and f1 not in ['house','os','ct','interest1','interest2']: 556 | continue 557 | if part == 12 and f1 not in ['interest3','interest4','interest5','marriageStatus','topic1']: 558 | continue 559 | if part == 13 and f1 not in ['topic2','topic3','kw1','kw2','kw3']: 560 | continue 561 | for f2 in ad_combine_feature: 562 | print f1 + '-' + f2 + " counting preparing" 563 | b_time = time.time() 564 | train_res, test1_res, test2_res = count_combine_feature(train_dict[f1], train_dict[f2], 565 | test1_dict[f1], test1_dict[f2], 566 | test2_dict[f1], test2_dict[f2], 567 | train_dict['label'],5) 568 | write_data(args.gbdt_data_path + 'train/' + f1 + '-' + f2 + '_rate_count.csv', train_res) 569 | write_data(args.gbdt_data_path + 'test1/' + f1 + '-' + f2 + '_rate_count.csv', test1_res) 570 | write_data(args.gbdt_data_path + 'test2/' + f1 + '-' + f2 + '_rate_count.csv', test2_res) 571 | print "costs %.1f s" % (time.time() - b_time) 572 | print "\n\n" 573 | 574 | print "counting one times" 575 | user_time_feature = ['uid','LBS','age','carrier','consumptionAbility','education','gender','house','os','ct','interest1', 576 | 'interest2','interest3','interest4','interest5','marriageStatus','topic1','topic2','topic3', 577 | 'kw1','kw2','kw3','appIdAction','appIdInstall'] 578 | ad_time_feature = ['aid','advertiserId','campaignId', 'creativeId','creativeSize', 579 | 'adCategoryId', 'productId', 'productType'] 580 | if part == 4: 581 | for f in user_time_feature: 582 | print f, "times preparing" 583 | b_time = time.time() 584 | train_res, test1_res, test2_res = count_feature_times(train_dict[f], test1_dict[f], test2_dict[f]) 585 | write_data(args.gbdt_data_path + 'train/' + f + '_times_count.csv', train_res, False) 586 | write_data(args.gbdt_data_path + 'test1/' + f + '_times_count.csv', test1_res, False) 587 | write_data(args.gbdt_data_path + 'test2/' + f + '_times_count.csv', test2_res, False) 588 | print "costs %.1f s" % (time.time() - b_time) 589 | for f in ad_time_feature: 590 | print f, "times preparing" 591 | b_time = time.time() 592 | train_res, test1_res, test2_res = count_feature_times(train_dict[f], test1_dict[f], test2_dict[f]) 593 | write_data(args.gbdt_data_path + 'train/' + f + '_times_count.csv', train_res, False) 594 | write_data(args.gbdt_data_path + 'test1/' + f + '_times_count.csv', test1_res, False) 595 | write_data(args.gbdt_data_path + 'test2/' + f + '_times_count.csv', test2_res, False) 596 | print "costs %.1f s" % (time.time() - b_time) 597 | print "\n\n" 598 | 599 | print "counting combine times" 600 | user_combine_feature = ['LBS','age','carrier','consumptionAbility','education','gender','house','os','ct','interest1', 601 | 'interest2','interest3','interest4','interest5','marriageStatus','topic1','topic2','topic3', 602 | 'kw1','kw2','kw3'] 603 | finsh_feature_5 = ['LBS','age','carrier','consumptionAbility','education','gender'] 604 | finsh_feature_5_2 = ['os','ct','interest1'] 605 | finsh_feature_6 = ['LBS','age','carrier','consumptionAbility','education','gender','house','os','ct','interest1', 606 | 'interest2','interest3','interest4'] 607 | ad_combine_feature = ['aid','advertiserId','campaignId', 'creativeId','creativeSize', 608 | 'adCategoryId', 'productId', 'productType'] 609 | if part == 5: 610 | for i, f1 in enumerate(user_combine_feature): 611 | for f2 in user_combine_feature[i+1:]: 612 | print f1+'-'+f2+" times preparing" 613 | b_time = time.time() 614 | train_res, test1_res, test2_res = count_combine_feature_times(train_dict[f1], train_dict[f2], 615 | test1_dict[f1], test1_dict[f2], 616 | test2_dict[f1], test2_dict[f2]) 617 | write_data(args.gbdt_data_path + 'train/' + f1 + '-' + f2 + '_times_count.csv', train_res, False) 618 | write_data(args.gbdt_data_path + 'test1/' + f1 + '-' + f2 + '_times_count.csv', test1_res, False) 619 | write_data(args.gbdt_data_path + 'test2/' + f1 + '-' + f2 + '_times_count.csv', test2_res, False) 620 | print "costs %.1f s" % (time.time() - b_time) 621 | if part == 6: 622 | for i, f1 in enumerate(user_combine_feature): 623 | for f2 in ad_combine_feature: 624 | print f1+'-'+f2+" times preparing" 625 | b_time = time.time() 626 | train_res, test1_res, test2_res = count_combine_feature_times(train_dict[f1], train_dict[f2], 627 | test1_dict[f1], test1_dict[f2], 628 | test2_dict[f1], test2_dict[f2]) 629 | write_data(args.gbdt_data_path + 'train/' + f1 + '-' + f2 + '_times_count.csv', train_res, False) 630 | write_data(args.gbdt_data_path + 'test1/' + f1 + '-' + f2 + '_times_count.csv', test1_res, False) 631 | write_data(args.gbdt_data_path + 'test2/' + f1 + '-' + f2 + '_times_count.csv', test2_res, False) 632 | print "costs %.1f s" % (time.time() - b_time) 633 | 634 | -------------------------------------------------------------------------------- /utils/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nzc/tencent-contest/da87f68c1ca1c63cd11cb069079bcbe0375d3897/utils/__init__.py -------------------------------------------------------------------------------- /utils/args.py: -------------------------------------------------------------------------------- 1 | """ 2 | args parameters 3 | """ 4 | import itertools 5 | 6 | 7 | 8 | class args: 9 | 10 | root_data_path = '../data/' 11 | 12 | """ 13 | converters args 14 | """ 15 | ad_feature_path = root_data_path + 'adFeature.csv' 16 | user_feature_path = root_data_path + 'userFeature.data' 17 | raw_train_path = root_data_path + 'train.csv' 18 | raw_test1_path = root_data_path + 'test1.csv' 19 | raw_test2_path = root_data_path + 'test2.csv' 20 | combine_train_path = root_data_path + 'combine_train.csv' 21 | combine_test1_path = root_data_path + 'combine_test1.csv' 22 | combine_test2_path = root_data_path + 'combine_test2.csv' 23 | random_combine_train_path = root_data_path + 'random_combine_train.csv' 24 | random_combine_train_path_with_chusai = root_data_path + 'random_combine_train_with_chusai.csv' 25 | 26 | gbdt_data_path = root_data_path + 'gbdt/' 27 | 28 | dnn_data_path = root_data_path + 'dnn/' 29 | 30 | """ 31 | raw feature 32 | """ 33 | dynamic_dict = {'interest1':61, 'interest2':33, 'interest3':10, 34 | 'interest4':10, 'interest5':86, 'kw1':5, 'kw2':5, 'kw3':5, 35 | 'topic1':5, 'topic2':5, 'topic3':5, 'appIdInstall':908, 36 | 'appIdAction':823, 'ct':4, 'os':2, 'marriageStatus':3} 37 | 38 | user_static_features = ['uid', 'house', 'education', 'LBS', 'consumptionAbility', 39 | 'gender', 'age', 'carrier'] 40 | ad_static_features = ['aid', 'advertiserId', 'campaignId', 'creativeId', 'creativeSize', 41 | 'adCategoryId','productId', 'productType'] 42 | user_dynamic_features = [key for key in dynamic_dict] 43 | 44 | """ 45 | dnn args 46 | """ 47 | # data processing part 48 | dynamic_max_len = 30 49 | 50 | static_features = ['interest1_len','interest2_len','interest5_len','aid', 'advertiserId', 'campaignId', 'creativeId', 'creativeSize', 'adCategoryId', 51 | 'productId', 'productType', 'LBS', 'age', 'carrier', 52 | 'consumptionAbility', 'education', 'gender', 'house','age_aid','gender_aid','uid_seq'] 53 | uid_count_feature = ['uid_uid_times_count','uid_uid_pos_times_count_5_fold_all','uid_adCategoryId_pos_times_count_5_fold_all', 54 | 'uid_advertiserId_pos_times_count_5_fold_all','uid_campaignId_pos_times_count_5_fold_all', 55 | 'uid_creativeId_pos_times_count_5_fold_all','uid_creativeSize_pos_times_count_5_fold_all', 56 | 'uid_productId_pos_times_count_5_fold_all','uid_productType_pos_times_count_5_fold_all', 57 | 'uid_adCategoryId_times_count','uid_advertiserId_times_count', 58 | 'uid_campaignId_times_count','uid_creativeId_times_count', 59 | 'uid_creativeSize_times_count','uid_productType_times_count', 60 | 'uid_productId_times_count'] 61 | static_features.extend(uid_count_feature) 62 | 63 | dynamic_features = ['kw1', 'kw2','kw3', 'topic1', 'topic2', 'topic3','marriageStatus', 64 | 'os', 'ct', 'interest1', 'interest2', 65 | 'interest3', 'interest4', 'interest5','interest1_aid','interest2_aid','marriageStatus_aid'] 66 | 67 | dynamic_features_max_len_dict = {'interest1':61, 'interest2':33, 'interest3':10, 68 | 'interest4':10, 'interest5':86, 'kw1':5, 'kw2':5, 'kw3':5, 69 | 'topic1':5, 'topic2':5, 'topic3':5, 'ct':4, 'os':2, 'marriageStatus':3, 70 | 'os_aid':2, 'ct_aid':4, 'marriageStatus_aid':3,"interest1_aid":61, 71 | 'interest2_aid':33, 'interest3_aid':10, 'interest4_aid':10, 'interest5_aid':86, 72 | 'LBS_kw1':5, 'topic1_topic2':25} 73 | aid_combine_feature = ['LBS','age','carrier','consumptionAbility','education','gender','house', 74 | 'os', 'ct', 'marriageStatus', 'interest1', 'interest2', 'interest3', 'interest4', 'interest5'] 75 | uid_pos_feature = ['uid_pos_times_count_5_fold','log_uid_pos_times_count_5_fold', 76 | 'uid_pos_times_count_5_fold_all','log_uid_pos_times_count_5_fold_all', 77 | ] 78 | exclusive_cols = [] 79 | exclusive_cols.extend(itertools.permutations(['aid', 'advertiserId', 'campaignId', 'creativeId', 80 | 'creativeSize', 'adCategoryId', 'productId', 'productType'], 2)) 81 | for f in aid_combine_feature: 82 | exclusive_cols.append(('aid',f+'_aid')) 83 | exclusive_cols.append((f + '_aid', 'aid')) 84 | exclusive_cols.append((f, f + '_aid')) 85 | exclusive_cols.append((f + '_aid', f)) 86 | # for f1 in uid_pos_feature: 87 | # for f2 in aid_combine_feature: 88 | # exclusive_cols.extend([(f1,f2+'_aid'),(f2+'_aid',f1)]) 89 | # exclusive_cols.extend([(f1,'LBS_kw1'), ('LBS_kw1', f1), ('topic1_topic2',f1),(f1,'topic1_topic2')]) 90 | exclusive_cols.extend([('kw1', 'LBS_kw1'), ('LBS_kw1', 'kw1'), ('LBS','LBS_kw1'), ('LBS_kw1','LBS')]) 91 | exclusive_cols.extend([('topic1','topic1_topic2'),('topic1_topic2','topic1'),('topic2','topic1_topic2'),('topic1_topic2','topic2')]) 92 | # exclusive_cols.extend([('interest1_len','interest1'),('interest1','interest1_len'),('interest2','interest2_len'),('interest2_len','interest2')]) 93 | # exclusive_cols.extend([('advertiserId','log_uid_advertiserId_times_count'),('log_uid_advertiserId_times_count','advertiserId')]) 94 | # exclusive_cols.extend([('advertiserId', 'uid_advertiserId_times_count'), ('uid_advertiserId_times_count', 'advertiserId')]) 95 | exclusive_cols.extend([('LBS_kw1','uid_pos_times_count_5_fold'),('uid_pos_times_count_5_fold','LBS_kw1')]) 96 | 97 | 98 | 99 | extern_lr_features = [] 100 | 101 | batch_size = 1024 102 | epochs = 1 103 | 104 | lr = 0.0002 105 | 106 | -------------------------------------------------------------------------------- /utils/args2.py: -------------------------------------------------------------------------------- 1 | 2 | """ 3 | args parameters 4 | """ 5 | import itertools 6 | 7 | 8 | 9 | class args: 10 | 11 | root_data_path = '../data/' 12 | 13 | """ 14 | converters args 15 | """ 16 | ad_feature_path = root_data_path + 'adFeature.csv' 17 | user_feature_path = root_data_path + 'userFeature.data' 18 | raw_train_path = root_data_path + 'train.csv' 19 | raw_test1_path = root_data_path + 'test1.csv' 20 | raw_test2_path = root_data_path + 'test2.csv' 21 | combine_train_path = root_data_path + 'combine_train.csv' 22 | combine_test1_path = root_data_path + 'combine_test1.csv' 23 | combine_test2_path = root_data_path + 'combine_test2.csv' 24 | random_combine_train_path = root_data_path + 'random_combine_train.csv' 25 | random_combine_train_path_with_chusai = root_data_path + 'random_combine_train_with_chusai.csv' 26 | 27 | gbdt_data_path = root_data_path + 'gbdt/' 28 | 29 | dnn_data_path = root_data_path + 'dnn/' 30 | 31 | """ 32 | raw feature 33 | """ 34 | dynamic_dict = {'interest1':61, 'interest2':33, 'interest3':10, 35 | 'interest4':10, 'interest5':86, 'kw1':5, 'kw2':5, 'kw3':5, 36 | 'topic1':5, 'topic2':5, 'topic3':5, 'appIdInstall':908, 37 | 'appIdAction':823, 'ct':4, 'os':2, 'marriageStatus':3} 38 | 39 | user_static_features = ['uid', 'house', 'education', 'LBS', 'consumptionAbility', 40 | 'gender', 'age', 'carrier'] 41 | ad_static_features = ['aid', 'advertiserId', 'campaignId', 'creativeId', 'creativeSize', 42 | 'adCategoryId','productId', 'productType'] 43 | user_dynamic_features = [key for key in dynamic_dict] 44 | 45 | """ 46 | dnn args 47 | """ 48 | # data processing part 49 | dynamic_max_len = 30 50 | 51 | static_features = ['interest1_len','interest2_len','interest5_len','aid', 'advertiserId', 'campaignId', 'creativeId', 'creativeSize', 'adCategoryId', 52 | 'productId', 'productType', 'LBS', 'age', 'carrier', 53 | 'consumptionAbility', 'education', 'gender', 'house','age_aid','gender_aid','uid_seq'] 54 | uid_count_feature = ['uid_uid_times_count_with_chusai_test','log_uid_uid_pos_times_count_5_fold_all','log_uid_adCategoryId_pos_times_count_5_fold_all', 55 | 'log_uid_advertiserId_pos_times_count_5_fold_all','log_uid_campaignId_pos_times_count_5_fold_all', 56 | 'log_uid_creativeId_pos_times_count_5_fold_all','log_uid_creativeSize_pos_times_count_5_fold_all', 57 | 'log_uid_productId_pos_times_count_5_fold_all','log_uid_productType_pos_times_count_5_fold_all', 58 | 'uid_adCategoryId_times_count_with_chusai_test','uid_advertiserId_times_count_with_chusai_test', 59 | 'uid_campaignId_times_count_with_chusai_test','uid_creativeId_times_count_with_chusai_test', 60 | 'uid_creativeSize_times_count_with_chusai_test','uid_productType_times_count_with_chusai_test', 61 | 'uid_productId_times_count_with_chusai_test'] 62 | static_features.extend(uid_count_feature) 63 | 64 | dynamic_features = ['kw1', 'kw2','kw3', 'topic1', 'topic2', 'topic3','marriageStatus', 65 | 'os', 'ct', 'interest1', 'interest2', 66 | 'interest3', 'interest4', 'interest5','interest1_aid','interest2_aid','marriageStatus_aid'] 67 | 68 | dynamic_features_max_len_dict = {'interest1':61, 'interest2':33, 'interest3':10, 69 | 'interest4':10, 'interest5':86, 'kw1':5, 'kw2':5, 'kw3':5, 70 | 'topic1':5, 'topic2':5, 'topic3':5, 'ct':4, 'os':2, 'marriageStatus':3, 71 | 'os_aid':2, 'ct_aid':4, 'marriageStatus_aid':3,"interest1_aid":61, 72 | 'interest2_aid':33, 'interest3_aid':10, 'interest4_aid':10, 'interest5_aid':86, 73 | 'LBS_kw1':5, 'topic1_topic2':25} 74 | aid_combine_feature = ['LBS','age','carrier','consumptionAbility','education','gender','house', 75 | 'os', 'ct', 'marriageStatus', 'interest1', 'interest2', 'interest3', 'interest4', 'interest5'] 76 | uid_pos_feature = ['uid_pos_times_count_5_fold','log_uid_pos_times_count_5_fold', 77 | 'uid_pos_times_count_5_fold_all','log_uid_pos_times_count_5_fold_all', 78 | ] 79 | exclusive_cols = [] 80 | exclusive_cols.extend(itertools.permutations(['aid', 'advertiserId', 'campaignId', 'creativeId', 81 | 'creativeSize', 'adCategoryId', 'productId', 'productType'], 2)) 82 | for f in aid_combine_feature: 83 | exclusive_cols.append(('aid',f+'_aid')) 84 | exclusive_cols.append((f + '_aid', 'aid')) 85 | exclusive_cols.append((f, f + '_aid')) 86 | exclusive_cols.append((f + '_aid', f)) 87 | # for f1 in uid_pos_feature: 88 | # for f2 in aid_combine_feature: 89 | # exclusive_cols.extend([(f1,f2+'_aid'),(f2+'_aid',f1)]) 90 | # exclusive_cols.extend([(f1,'LBS_kw1'), ('LBS_kw1', f1), ('topic1_topic2',f1),(f1,'topic1_topic2')]) 91 | exclusive_cols.extend([('kw1', 'LBS_kw1'), ('LBS_kw1', 'kw1'), ('LBS','LBS_kw1'), ('LBS_kw1','LBS')]) 92 | exclusive_cols.extend([('topic1','topic1_topic2'),('topic1_topic2','topic1'),('topic2','topic1_topic2'),('topic1_topic2','topic2')]) 93 | # exclusive_cols.extend([('interest1_len','interest1'),('interest1','interest1_len'),('interest2','interest2_len'),('interest2_len','interest2')]) 94 | # exclusive_cols.extend([('advertiserId','log_uid_advertiserId_times_count'),('log_uid_advertiserId_times_count','advertiserId')]) 95 | # exclusive_cols.extend([('advertiserId', 'uid_advertiserId_times_count'), ('uid_advertiserId_times_count', 'advertiserId')]) 96 | exclusive_cols.extend([('LBS_kw1','uid_pos_times_count_5_fold'),('uid_pos_times_count_5_fold','LBS_kw1')]) 97 | 98 | 99 | 100 | extern_lr_features = [] 101 | 102 | batch_size = 1024 103 | epochs = 1 104 | 105 | lr = 0.0002 106 | -------------------------------------------------------------------------------- /utils/args3.py: -------------------------------------------------------------------------------- 1 | """ 2 | args parameters 3 | """ 4 | import itertools 5 | 6 | 7 | 8 | class args: 9 | 10 | root_data_path = '../data/' 11 | 12 | """ 13 | converters args 14 | """ 15 | ad_feature_path = root_data_path + 'adFeature.csv' 16 | user_feature_path = root_data_path + 'userFeature.data' 17 | raw_train_path = root_data_path + 'train.csv' 18 | raw_test1_path = root_data_path + 'test1.csv' 19 | raw_test2_path = root_data_path + 'test2.csv' 20 | combine_train_path = root_data_path + 'combine_train.csv' 21 | combine_test1_path = root_data_path + 'combine_test1.csv' 22 | combine_test2_path = root_data_path + 'combine_test2.csv' 23 | random_combine_train_path = root_data_path + 'random_combine_train.csv' 24 | 25 | gbdt_data_path = root_data_path + 'gbdt/' 26 | 27 | dnn_data_path = root_data_path + 'nzc_dnn/' 28 | 29 | """ 30 | raw feature 31 | """ 32 | dynamic_dict = {'interest1':61, 'interest2':33, 'interest3':10, 33 | 'interest4':10, 'interest5':86, 'kw1':5, 'kw2':5, 'kw3':5, 34 | 'topic1':5, 'topic2':5, 'topic3':5, 'appIdInstall':908, 35 | 'appIdAction':823, 'ct':4, 'os':2, 'marriageStatus':3} 36 | 37 | user_static_features = ['uid', 'house', 'education', 'LBS', 'consumptionAbility', 38 | 'gender', 'age', 'carrier'] 39 | ad_static_features = ['aid', 'advertiserId', 'campaignId', 'creativeId', 'creativeSize', 40 | 'adCategoryId','productId', 'productType'] 41 | user_dynamic_features = [key for key in dynamic_dict] 42 | 43 | """ 44 | dnn args 45 | """ 46 | # data processing part 47 | dynamic_max_len = 30 48 | 49 | static_features = ['uid_uid_times_count','interest1_len','interest2_len','interest5_len','aid', 'advertiserId', 'campaignId', 'creativeId', 'creativeSize', 'adCategoryId', 50 | 'productId', 'productType', 'LBS', 'age', 'carrier', 51 | 'consumptionAbility', 'education', 'gender', 'house'] 52 | uid_count_feature = ['log_uid_uid_pos_times_count_5_fold_all', 'log_uid_adCategoryId_pos_times_count_5_fold_all', 53 | 'log_uid_advertiserId_pos_times_count_5_fold_all', 'log_uid_campaignId_pos_times_count_5_fold_all', 54 | 'log_uid_creativeId_pos_times_count_5_fold_all', 'log_uid_creativeSize_pos_times_count_5_fold_all', 55 | 'log_uid_productId_pos_times_count_5_fold_all', 'log_uid_productType_pos_times_count_5_fold_all', 56 | 'uid_adCategoryId_times_count_with_chusai_test', 'uid_advertiserId_times_count_with_chusai_test', 57 | 'uid_campaignId_times_count_with_chusai_test', 'uid_creativeId_times_count_with_chusai_test', 58 | 'uid_creativeSize_times_count_with_chusai_test', 'uid_productType_times_count_with_chusai_test', 59 | 'uid_productId_times_count_with_chusai_test'] 60 | static_features.extend(uid_count_feature) 61 | 62 | dynamic_features = ['kw1', 'kw2', 'kw3', 'topic1', 'topic2', 'topic3','marriageStatus', 63 | 'os', 'ct', 'interest1', 'interest2', 64 | 'interest3', 'interest4', 'interest5'] 65 | 66 | dynamic_features_max_len_dict = {'interest1':61, 'interest2':33, 'interest3':10, 67 | 'interest4':10, 'interest5':86, 'kw1':5, 'kw2':5, 'kw3':5, 68 | 'topic1':5, 'topic2':5, 'topic3':5, 'ct':4, 'os':2, 'marriageStatus':3, 69 | 'os_aid':2, 'ct_aid':4, 'marriageStatus_aid':3,"interest1_aid":61, 70 | 'interest2_aid':33, 'interest3_aid':10, 'interest4_aid':10, 'interest5_aid':86, 71 | 'LBS_kw1':5, 'topic1_topic2':25} 72 | aid_combine_feature = ['LBS','age','carrier','consumptionAbility','education','gender','house', 73 | 'os', 'ct', 'marriageStatus', 'interest1', 'interest2', 'interest3', 'interest4', 'interest5'] 74 | uid_pos_feature = ['uid_pos_times_count_5_fold','log_uid_pos_times_count_5_fold', 75 | 'uid_pos_times_count_10_fold','log_uid_pos_times_count_10_fold', 76 | 'uid_pos_times_count_5_fold_all','log_uid_pos_times_count_5_fold_all', 77 | 'uid_pos_times_count_10_fold_all', 'log_uid_pos_times_count_10_fold_all'] 78 | exclusive_cols = [] 79 | exclusive_cols.extend(itertools.permutations(['aid', 'advertiserId', 'campaignId', 'creativeId', 80 | 'creativeSize', 'adCategoryId', 'productId', 'productType'], 2)) 81 | for f in aid_combine_feature[9:]: 82 | exclusive_cols.append(('aid',f+'_aid')) 83 | exclusive_cols.append((f + '_aid', 'aid')) 84 | exclusive_cols.append((f, f + '_aid')) 85 | exclusive_cols.append((f + '_aid', f)) 86 | for f1 in uid_pos_feature: 87 | for f2 in aid_combine_feature: 88 | exclusive_cols.extend([(f1,f2+'_aid'),(f2+'_aid',f1)]) 89 | exclusive_cols.extend([(f1,'LBS_kw1'), ('LBS_kw1', f1), ('topic1_topic2',f1),(f1,'topic1_topic2')]) 90 | exclusive_cols.extend([('kw1', 'LBS_kw1'), ('LBS_kw1', 'kw1'), ('LBS','LBS_kw1'), ('LBS_kw1','LBS')]) 91 | exclusive_cols.extend([('topic1','topic1_topic2'),('topic1_topic2','topic1'),('topic2','topic1_topic2'),('topic1_topic2','topic2')]) 92 | exclusive_cols.extend([('interest1_len','interest1'),('interest1','interest1_len'),('interest2','interest2_len'),('interest2_len','interest2')]) 93 | exclusive_cols.extend([('advertiserId','log_uid_advertiserId_times_count'),('log_uid_advertiserId_times_count','advertiserId')]) 94 | exclusive_cols.extend([('advertiserId', 'uid_advertiserId_times_count'), ('uid_advertiserId_times_count', 'advertiserId')]) 95 | exclusive_cols.extend([('LBS_kw1','uid_pos_times_count_5_fold'),('uid_pos_times_count_5_fold','LBS_kw1')]) 96 | 97 | 98 | 99 | extern_lr_features = [] 100 | 101 | batch_size = 1024 102 | epochs = 1 103 | 104 | lr = 0.0002 105 | -------------------------------------------------------------------------------- /utils/donal_args.py: -------------------------------------------------------------------------------- 1 | """ 2 | args parameters 3 | """ 4 | import itertools 5 | 6 | 7 | 8 | class args: 9 | 10 | root_data_path = 'data/' 11 | 12 | """ 13 | converters args 14 | """ 15 | ad_feature_path = root_data_path + 'adFeature.csv' 16 | user_feature_path = root_data_path + 'userFeature.data' 17 | raw_train_path = root_data_path + 'train.csv' 18 | raw_test1_path = root_data_path + 'test1.csv' 19 | raw_test2_path = root_data_path + 'test2.csv' 20 | combine_train_path = root_data_path + 'combine_train.csv' 21 | combine_test1_path = root_data_path + 'combine_test1.csv' 22 | combine_test2_path = root_data_path + 'combine_test2.csv' 23 | chusai_ad_feature_path = root_data_path + 'chusai_adFeature.csv' 24 | chusai_user_feature_path = root_data_path + 'chusai_userFeature.data' 25 | chusai_raw_train_path = root_data_path + 'chusai_train.csv' 26 | chusai_raw_test1_path = root_data_path + 'chusai_test1.csv' 27 | chusai_raw_test2_path = root_data_path + 'chusai_test2.csv' 28 | chusai_combine_train_path = root_data_path + 'chusai_combine_train.csv' 29 | chusai_combine_test1_path = root_data_path + 'chusai_combine_test1.csv' 30 | chusai_combine_test2_path = root_data_path + 'chusai_combine_test2.csv' 31 | random_combine_train_path_with_chusai = root_data_path + 'random_combine_train_with_chusai.csv' 32 | 33 | gbdt_data_path = root_data_path + 'gbdt/' 34 | 35 | dnn_data_path = root_data_path + 'dnn/' 36 | 37 | """ 38 | raw feature 39 | """ 40 | dynamic_dict = {'interest1':61, 'interest2':33, 'interest3':10, 41 | 'interest4':10, 'interest5':86, 'kw1':5, 'kw2':5, 'kw3':5, 42 | 'topic1':5, 'topic2':5, 'topic3':5, 'appIdInstall':908, 43 | 'appIdAction':823, 'ct':4, 'os':2, 'marriageStatus':3} 44 | 45 | user_static_features = ['uid', 'house', 'education', 'LBS', 'consumptionAbility', 46 | 'gender', 'age', 'carrier'] 47 | ad_static_features = ['aid', 'advertiserId', 'campaignId', 'creativeId', 'creativeSize', 48 | 'adCategoryId','productId', 'productType'] 49 | user_dynamic_features = [key for key in dynamic_dict] 50 | 51 | """ 52 | dnn args 53 | """ 54 | # data processing part 55 | dynamic_max_len = 30 56 | 57 | static_features = ['interest1_len','interest2_len','interest5_len','aid', 'advertiserId', 'campaignId', 'creativeId', 'creativeSize', 'adCategoryId', 58 | 'productId', 'productType', 'LBS', 'age', 'carrier', 59 | 'consumptionAbility', 'education', 'gender', 'house','age_aid','gender_aid','uid_seq'] 60 | uid_count_feature = ['log_uid_uid_times_count','log_uid_uid_pos_times_count_5_fold_all','log_uid_adCategoryId_pos_times_count_5_fold_all', 61 | 'log_uid_advertiserId_pos_times_count_5_fold_all','log_uid_campaignId_pos_times_count_5_fold_all', 62 | 'log_uid_creativeId_pos_times_count_5_fold_all','log_uid_creativeSize_pos_times_count_5_fold_all', 63 | 'log_uid_productId_pos_times_count_5_fold_all','log_uid_productType_pos_times_count_5_fold_all', 64 | 'log_uid_adCategoryId_times_count','log_uid_advertiserId_times_count', 65 | 'log_uid_campaignId_times_count','log_uid_creativeId_times_count', 66 | 'log_uid_creativeSize_times_count','log_uid_productType_times_count', 67 | 'log_uid_productId_times_count'] 68 | static_features.extend(uid_count_feature) 69 | 70 | dynamic_features = ['kw1', 'kw2','kw3', 'topic1', 'topic2', 'topic3','marriageStatus', 71 | 'os', 'ct', 'interest1', 'interest2', 72 | 'interest3', 'interest4', 'interest5','interest1_aid','interest2_aid'] 73 | 74 | dynamic_features_max_len_dict = {'interest1':61, 'interest2':33, 'interest3':10, 75 | 'interest4':10, 'interest5':86, 'kw1':5, 'kw2':5, 'kw3':5, 76 | 'topic1':5, 'topic2':5, 'topic3':5, 'ct':4, 'os':2, 'marriageStatus':3, 77 | 'os_aid':2, 'ct_aid':4, 'marriageStatus_aid':3,"interest1_aid":61, 78 | 'interest2_aid':33, 'interest3_aid':10, 'interest4_aid':10, 'interest5_aid':86, 79 | 'LBS_kw1':5, 'topic1_topic2':25} 80 | aid_combine_feature = ['LBS','age','carrier','consumptionAbility','education','gender','house', 81 | 'os', 'ct', 'marriageStatus', 'interest1', 'interest2', 'interest3', 'interest4', 'interest5'] 82 | uid_pos_feature = ['uid_pos_times_count_5_fold','log_uid_pos_times_count_5_fold', 83 | 'uid_pos_times_count_5_fold_all','log_uid_pos_times_count_5_fold_all', 84 | ] 85 | exclusive_cols = [] 86 | exclusive_cols.extend(itertools.permutations(['aid', 'advertiserId', 'campaignId', 'creativeId', 87 | 'creativeSize', 'adCategoryId', 'productId', 'productType'], 2)) 88 | for f in aid_combine_feature: 89 | exclusive_cols.append(('aid',f+'_aid')) 90 | exclusive_cols.append((f + '_aid', 'aid')) 91 | exclusive_cols.append((f, f + '_aid')) 92 | exclusive_cols.append((f + '_aid', f)) 93 | # for f1 in uid_pos_feature: 94 | # for f2 in aid_combine_feature: 95 | # exclusive_cols.extend([(f1,f2+'_aid'),(f2+'_aid',f1)]) 96 | # exclusive_cols.extend([(f1,'LBS_kw1'), ('LBS_kw1', f1), ('topic1_topic2',f1),(f1,'topic1_topic2')]) 97 | exclusive_cols.extend([('kw1', 'LBS_kw1'), ('LBS_kw1', 'kw1'), ('LBS','LBS_kw1'), ('LBS_kw1','LBS')]) 98 | exclusive_cols.extend([('topic1','topic1_topic2'),('topic1_topic2','topic1'),('topic2','topic1_topic2'),('topic1_topic2','topic2')]) 99 | # exclusive_cols.extend([('interest1_len','interest1'),('interest1','interest1_len'),('interest2','interest2_len'),('interest2_len','interest2')]) 100 | # exclusive_cols.extend([('advertiserId','log_uid_advertiserId_times_count'),('log_uid_advertiserId_times_count','advertiserId')]) 101 | # exclusive_cols.extend([('advertiserId', 'uid_advertiserId_times_count'), ('uid_advertiserId_times_count', 'advertiserId')]) 102 | exclusive_cols.extend([('LBS_kw1','uid_pos_times_count_5_fold'),('uid_pos_times_count_5_fold','LBS_kw1')]) 103 | 104 | 105 | 106 | extern_lr_features = [] 107 | 108 | batch_size = 1024 109 | epochs = 1 110 | 111 | lr = 0.0002 112 | -------------------------------------------------------------------------------- /utils/nzc_args.py: -------------------------------------------------------------------------------- 1 | """ 2 | args parameters 3 | """ 4 | import itertools 5 | 6 | 7 | 8 | class args: 9 | 10 | root_data_path = '../data/' 11 | 12 | """ 13 | converters args 14 | """ 15 | ad_feature_path = root_data_path + 'adFeature.csv' 16 | user_feature_path = root_data_path + 'userFeature.data' 17 | raw_train_path = root_data_path + 'train.csv' 18 | raw_test1_path = root_data_path + 'test1.csv' 19 | raw_test2_path = root_data_path + 'test2.csv' 20 | combine_train_path = root_data_path + 'combine_train.csv' 21 | combine_test1_path = root_data_path + 'combine_test1.csv' 22 | combine_test2_path = root_data_path + 'combine_test2.csv' 23 | random_combine_train_path = root_data_path + 'random_combine_train.csv' 24 | 25 | gbdt_data_path = root_data_path + 'gbdt/' 26 | 27 | dnn_data_path = root_data_path + 'nzc_dnn/' 28 | 29 | """ 30 | raw feature 31 | """ 32 | dynamic_dict = {'interest1':61, 'interest2':33, 'interest3':10, 33 | 'interest4':10, 'interest5':86, 'kw1':5, 'kw2':5, 'kw3':5, 34 | 'topic1':5, 'topic2':5, 'topic3':5, 'appIdInstall':908, 35 | 'appIdAction':823, 'ct':4, 'os':2, 'marriageStatus':3} 36 | 37 | user_static_features = ['uid', 'house', 'education', 'LBS', 'consumptionAbility', 38 | 'gender', 'age', 'carrier'] 39 | ad_static_features = ['aid', 'advertiserId', 'campaignId', 'creativeId', 'creativeSize', 40 | 'adCategoryId','productId', 'productType'] 41 | user_dynamic_features = [key for key in dynamic_dict] 42 | 43 | """ 44 | dnn args 45 | """ 46 | # data processing part 47 | 48 | static_features = ['log_uid_uid_times_count','interest1_len','interest2_len','interest5_len','aid', 'advertiserId', 'campaignId', 'creativeId', 'creativeSize', 'adCategoryId', 49 | 'productId', 'productType', 'LBS', 'age', 'carrier', 50 | 'consumptionAbility', 'education', 'gender', 'house'] 51 | uid_count_feature = ['log_uid_uid_pos_times_count_5_fold_all_set', 'log_uid_adCategoryId_pos_times_count_5_fold_all_set', 52 | 'log_uid_advertiserId_pos_times_count_5_fold_all_set', 'log_uid_campaignId_pos_times_count_5_fold_all_set', 53 | 'log_uid_creativeId_pos_times_count_5_fold_all_set', 'log_uid_creativeSize_pos_times_count_5_fold_all_set', 54 | 'log_uid_productId_pos_times_count_5_fold_all_set', 'log_uid_productType_pos_times_count_5_fold_all_set', 55 | 'log_uid_adCategoryId_times_count', 'log_uid_advertiserId_times_count', 56 | 'log_uid_campaignId_times_count', 'log_uid_creativeId_times_count', 57 | 'log_uid_creativeSize_times_count', 'log_uid_productType_times_count', 58 | 'log_uid_productId_times_count'] 59 | static_features.extend(uid_count_feature) 60 | 61 | dynamic_features = ['kw1', 'kw2', 'kw3', 'topic1', 'topic2', 'topic3','marriageStatus', 62 | 'os', 'ct', 'interest1', 'interest2', 63 | 'interest3', 'interest4', 'interest5'] 64 | 65 | dynamic_features_max_len_dict = {'interest1':61, 'interest2':33, 'interest3':10, 66 | 'interest4':10, 'interest5':86, 'kw1':5, 'kw2':5, 'kw3':5, 67 | 'topic1':5, 'topic2':5, 'topic3':5, 'ct':4, 'os':2, 'marriageStatus':3, 68 | 'os_aid':2, 'ct_aid':4, 'marriageStatus_aid':3,"interest1_aid":61, 69 | 'interest2_aid':33, 'interest3_aid':10, 'interest4_aid':10, 'interest5_aid':86, 70 | 'LBS_kw1':5, 'topic1_topic2':25} 71 | aid_combine_feature = ['LBS','age','carrier','consumptionAbility','education','gender','house', 72 | 'os', 'ct', 'marriageStatus', 'interest1', 'interest2', 'interest3', 'interest4', 'interest5'] 73 | uid_pos_feature = ['uid_pos_times_count_5_fold','log_uid_pos_times_count_5_fold', 74 | 'uid_pos_times_count_10_fold','log_uid_pos_times_count_10_fold', 75 | 'uid_pos_times_count_5_fold_all','log_uid_pos_times_count_5_fold_all', 76 | 'uid_pos_times_count_10_fold_all', 'log_uid_pos_times_count_10_fold_all'] 77 | exclusive_cols = [] 78 | exclusive_cols.extend(itertools.permutations(['aid', 'advertiserId', 'campaignId', 'creativeId', 79 | 'creativeSize', 'adCategoryId', 'productId', 'productType'], 2)) 80 | for f in aid_combine_feature[9:]: 81 | exclusive_cols.append(('aid',f+'_aid')) 82 | exclusive_cols.append((f + '_aid', 'aid')) 83 | exclusive_cols.append((f, f + '_aid')) 84 | exclusive_cols.append((f + '_aid', f)) 85 | for f1 in uid_pos_feature: 86 | for f2 in aid_combine_feature: 87 | exclusive_cols.extend([(f1,f2+'_aid'),(f2+'_aid',f1)]) 88 | exclusive_cols.extend([(f1,'LBS_kw1'), ('LBS_kw1', f1), ('topic1_topic2',f1),(f1,'topic1_topic2')]) 89 | exclusive_cols.extend([('kw1', 'LBS_kw1'), ('LBS_kw1', 'kw1'), ('LBS','LBS_kw1'), ('LBS_kw1','LBS')]) 90 | exclusive_cols.extend([('topic1','topic1_topic2'),('topic1_topic2','topic1'),('topic2','topic1_topic2'),('topic1_topic2','topic2')]) 91 | exclusive_cols.extend([('interest1_len','interest1'),('interest1','interest1_len'),('interest2','interest2_len'),('interest2_len','interest2')]) 92 | exclusive_cols.extend([('advertiserId','log_uid_advertiserId_times_count'),('log_uid_advertiserId_times_count','advertiserId')]) 93 | exclusive_cols.extend([('advertiserId', 'uid_advertiserId_times_count'), ('uid_advertiserId_times_count', 'advertiserId')]) 94 | exclusive_cols.extend([('LBS_kw1','uid_pos_times_count_5_fold'),('uid_pos_times_count_5_fold','LBS_kw1')]) 95 | 96 | 97 | 98 | extern_lr_features = [] 99 | 100 | batch_size = 1024 101 | epochs = 1 102 | 103 | lr = 0.0002 104 | -------------------------------------------------------------------------------- /utils/tencent_data_func.py: -------------------------------------------------------------------------------- 1 | # -*- coding:utf-8 -*- 2 | 3 | import os 4 | import numpy as np 5 | from sklearn.base import BaseEstimator, TransformerMixin 6 | from sklearn.metrics import roc_auc_score 7 | from time import time 8 | import random 9 | 10 | from donal_args import args 11 | 12 | """ 13 | some functions to process tencent data 14 | """ 15 | 16 | def read_raw_data(path, max_num): 17 | f = open(path, 'rb') 18 | features = f.readline().strip().split(',') 19 | dict = {} 20 | num = 0 21 | for line in f: 22 | if num >= max_num: 23 | break 24 | datas = line.strip().split(',') 25 | for i, d in enumerate(datas): 26 | if not dict.has_key(features[i]): 27 | dict[features[i]] = [] 28 | dict[features[i]].append(d) 29 | num += 1 30 | f.close() 31 | return dict,num 32 | 33 | def read_raw_data_2(path, begin, end): 34 | f = open(path, 'rb') 35 | features = f.readline().strip().split(',') 36 | dict = {} 37 | num = 0 38 | for line in f: 39 | num += 1 40 | if num <= begin: 41 | continue 42 | if num > end: 43 | break 44 | datas = line.strip().split(',') 45 | for i, d in enumerate(datas): 46 | if not dict.has_key(features[i]): 47 | dict[features[i]] = [] 48 | dict[features[i]].append(d) 49 | f.close() 50 | return dict,num 51 | 52 | def onehot_feature_process(train_data, test1_data, test2_data, begin_num, filter_num = 100): 53 | count_dict = {} 54 | index_dict = {} 55 | filter_set = set() 56 | begin_index = begin_num 57 | for d in train_data: 58 | if not count_dict.has_key(d): 59 | count_dict[d] = 0 60 | count_dict[d] += 1 61 | for key in count_dict: 62 | if count_dict[key] < filter_num: 63 | filter_set.add(key) 64 | train_res = [] 65 | for d in train_data: 66 | if d in filter_set: 67 | d = '-2' 68 | if not index_dict.has_key(d): 69 | index_dict[d] = begin_index 70 | begin_index += 1 71 | train_res.append(index_dict[d]) 72 | if not index_dict.has_key('-2'): 73 | index_dict['-2'] = begin_index 74 | test1_res = [] 75 | for d in test1_data: 76 | if d in filter_set or not index_dict.has_key(d): 77 | d = '-2' 78 | test1_res.append(index_dict[d]) 79 | test2_res = [] 80 | for d in test2_data: 81 | if d in filter_set or not index_dict.has_key(d): 82 | d = '-2' 83 | test2_res.append(index_dict[d]) 84 | return np.array(train_res), np.array(test1_res), np.array(test2_res), index_dict 85 | 86 | def vector_feature_process(train_data, test1_data, test2_data, begin_num, max_len = 30, filter_num = 100): 87 | count_dict = {} 88 | index_dict = {} 89 | filter_set = set() 90 | begin_index = begin_num 91 | print "dict counting" 92 | for d in train_data: 93 | xs = d.split(' ') 94 | for x in xs: 95 | if not count_dict.has_key(x): 96 | count_dict[x] = 0 97 | count_dict[x] += 1 98 | for key in count_dict: 99 | if count_dict[key] < filter_num: 100 | filter_set.add(key) 101 | train_res = [] 102 | for d in train_data: 103 | xs = d.split(' ') 104 | row = [0] * max_len 105 | for i, x in enumerate(xs): 106 | if x in filter_set: 107 | x = '-2' 108 | if not index_dict.has_key(x): 109 | index_dict[x] = begin_index 110 | begin_index += 1 111 | row[i] = index_dict[x] 112 | train_res.append(row) 113 | if not index_dict.has_key('-2'): 114 | index_dict['-2'] = begin_index 115 | test1_res = [] 116 | for d in test1_data: 117 | row = [0] * max_len 118 | xs = d.split(' ') 119 | for i, x in enumerate(xs): 120 | if x in filter_set or not index_dict.has_key(x): 121 | x = '-2' 122 | row[i] = index_dict[x] 123 | test1_res.append(row) 124 | test2_res = [] 125 | for d in test2_data: 126 | row = [0] * max_len 127 | xs = d.split(' ') 128 | for i, x in enumerate(xs): 129 | if x in filter_set or not index_dict.has_key(x): 130 | x = '-2' 131 | row[i] = index_dict[x] 132 | test2_res.append(row) 133 | return np.array(train_res), np.array(test1_res), np.array(test2_res), index_dict 134 | 135 | def onehot_combine_process(train_data_1, train_data_2, test1_data_1, test1_data_2, test2_data_1, test2_data_2, begin_num, filter_num = 100): 136 | count_dict = {} 137 | index_dict = {} 138 | filter_set = set() 139 | begin_index = begin_num 140 | for i, d in enumerate(train_data_1): 141 | id_1 = train_data_1[i] 142 | id_2 = train_data_2[i] 143 | t_id = id_1 + '|' + id_2 144 | if not count_dict.has_key(t_id): 145 | count_dict[t_id] = 0 146 | count_dict[t_id] += 1 147 | for key in count_dict: 148 | if count_dict[key] < filter_num: 149 | filter_set.add(key) 150 | train_res = [] 151 | for i, d in enumerate(train_data_1): 152 | id_1 = train_data_1[i] 153 | id_2 = train_data_2[i] 154 | t_id = id_1 + '|' + id_2 155 | if t_id in filter_set: 156 | t_id = '-2' 157 | if not index_dict.has_key(t_id): 158 | index_dict[t_id] = begin_index 159 | begin_index += 1 160 | train_res.append(index_dict[t_id]) 161 | if not index_dict.has_key('-2'): 162 | index_dict['-2'] = begin_index 163 | test1_res = [] 164 | for i, d in enumerate(test1_data_1): 165 | id_1 = test1_data_1[i] 166 | id_2 = test1_data_2[i] 167 | t_id = id_1 + '|' + id_2 168 | if t_id in filter_set or not index_dict.has_key(t_id): 169 | t_id = '-2' 170 | test1_res.append(index_dict[t_id]) 171 | test2_res = [] 172 | for i, d in enumerate(test2_data_1): 173 | id_1 = test2_data_1[i] 174 | id_2 = test2_data_2[i] 175 | t_id = id_1 + '|' + id_2 176 | if t_id in filter_set or not index_dict.has_key(t_id): 177 | t_id = '-2' 178 | test2_res.append(index_dict[t_id]) 179 | return np.array(train_res), np.array(test1_res), np.array(test2_res), index_dict 180 | 181 | def vector_combine_process(train_data_1, train_data_2, test1_data_1, test1_data_2, 182 | test2_data_1, test2_data_2, begin_num, max_len = 30, filter_num=100): 183 | count_dict = {} 184 | index_dict = {} 185 | filter_set = set() 186 | begin_index = begin_num 187 | for i, d in enumerate(train_data_1): 188 | xs_1 = train_data_1[i].split(' ') 189 | xs_2 = train_data_2[i].split(' ') 190 | for x_1 in xs_1: 191 | for x_2 in xs_2: 192 | t_id = x_1 + '|' + x_2 193 | if not count_dict.has_key(t_id): 194 | count_dict[t_id] = 0 195 | count_dict[t_id] += 1 196 | for key in count_dict: 197 | if count_dict[key] < filter_num: 198 | filter_set.add(key) 199 | train_res = [] 200 | for i, d in enumerate(train_data_1): 201 | xs_1 = train_data_1[i].split(' ') 202 | xs_2 = train_data_2[i].split(' ') 203 | row = [0] * max_len 204 | j = 0 205 | for x_1 in xs_1: 206 | for x_2 in xs_2: 207 | t_id = x_1 + '|' + x_2 208 | if t_id in filter_set: 209 | t_id = '-2' 210 | if not index_dict.has_key(t_id): 211 | index_dict[t_id] = begin_index 212 | begin_index += 1 213 | row[j] = index_dict[t_id] 214 | j += 1 215 | train_res.append(row) 216 | if not index_dict.has_key('-2'): 217 | index_dict['-2'] = begin_index 218 | test1_res = [] 219 | for i, d in enumerate(test1_data_1): 220 | xs_1 = test1_data_1[i].split(' ') 221 | xs_2 = test1_data_2[i].split(' ') 222 | row = [0]*max_len 223 | j = 0 224 | for x_1 in xs_1: 225 | for x_2 in xs_2: 226 | t_id = x_1 + '|' + x_2 227 | if t_id in filter_set or not index_dict.has_key(t_id): 228 | t_id = '-2' 229 | row[j] = index_dict[t_id] 230 | j += 1 231 | test1_res.append(row) 232 | test2_res = [] 233 | for i, d in enumerate(test2_data_1): 234 | xs_1 = test2_data_1[i].split(' ') 235 | xs_2 = test2_data_2[i].split(' ') 236 | row = [0] * max_len 237 | j = 0 238 | for x_1 in xs_1: 239 | for x_2 in xs_2: 240 | t_id = x_1 + '|' + x_2 241 | if t_id in filter_set or not index_dict.has_key(t_id): 242 | t_id = '-2' 243 | row[j] = index_dict[t_id] 244 | j += 1 245 | test2_res.append(row) 246 | return np.array(train_res), np.array(test1_res), np.array(test2_res), index_dict 247 | 248 | def read_indexs(path): 249 | f = open(path,'rb') 250 | res = [] 251 | for line in f: 252 | res.append(int(line.strip())-1) 253 | return res 254 | 255 | def write_data_into_parts(data, root_path, nums = 5100000): 256 | l = data.shape[0] // nums 257 | for i in range(l+1): 258 | begin = i * nums 259 | end = min(nums*(i+1), data.shape[0]) 260 | t_data = data[begin:end] 261 | t_data.tofile(root_path+'_'+str(i)+'.bin') 262 | 263 | def write_data_into_parts_to_npy(data, root_path, nums = 5100000): 264 | l = data.shape[0] // nums 265 | for i in range(l+1): 266 | begin = i * nums 267 | end = min(nums*(i+1), data.shape[0]) 268 | t_data = data[begin:end] 269 | np.save(root_path+'_'+str(i)+'.npy',t_data) 270 | 271 | def get_vector_feature_len(data): 272 | res = [] 273 | for d in data: 274 | cnt = 0 275 | for item in d: 276 | if item != 0: 277 | cnt += 1 278 | res.append(cnt) 279 | return np.array(res) 280 | 281 | def read_csv_helper(path): 282 | data = [] 283 | with open(path) as f: 284 | tmp = [] 285 | for line in f: 286 | items = line.strip().split(',') 287 | for item in items: 288 | tmp.append(int(item)) 289 | data.append(tmp) 290 | return np.array(data) 291 | 292 | def read_data_from_bin(path, col_names, part, sizes): 293 | data_arr = [] 294 | for col_name in col_names: 295 | data_arr.append(np.fromfile(path + col_name + '_' + str(part) +'.bin', dtype=np.int).reshape(sizes)) 296 | return np.concatenate(data_arr,axis=1) 297 | 298 | def read_data_from_npy(path, col_names, part, sizes): 299 | data_arr = [] 300 | for col_name in col_names: 301 | data_arr.append(np.load(path + col_name + '_' + str(part) +'.npy').reshape(sizes)) 302 | return np.concatenate(data_arr,axis=1) 303 | 304 | def read_data_from_csv(path, col_names, part, sizes): 305 | data_arr = [] 306 | for col_name in col_names: 307 | data_arr.append(read_csv_helper(path + col_name + '_' + str(part) +'.csv').reshape(sizes)) 308 | return np.concatenate(data_arr,axis=1) 309 | 310 | def read_data_from_bin_to_dict(path, col_names, part, sizes=None): 311 | data_arr = {} 312 | t_sizes = sizes 313 | for col_name in col_names: 314 | if sizes == None: 315 | t_sizes = [-1, args.dynamic_features_max_len_dict[col_name]] 316 | data_arr[col_name] = np.fromfile(path + col_name + '_' + str(part) + '.bin', dtype=np.int).reshape(t_sizes) 317 | return data_arr 318 | 319 | def read_data_from_npy_to_dict(path, col_names, part, sizes=None): 320 | data_arr = {} 321 | t_sizes = sizes 322 | bin_cols = ['log_uid_adCategoryId_times_count','log_uid_advertiserId_times_count', 323 | 'log_uid_campaignId_times_count','log_uid_creativeId_times_count', 324 | 'log_uid_creativeSize_times_count','log_uid_productId_times_count', 325 | 'log_uid_productType_times_count'] 326 | for col_name in col_names: 327 | if sizes == None: 328 | t_sizes = [-1, args.dynamic_features_max_len_dict[col_name]] 329 | if col_name not in bin_cols: 330 | data_arr[col_name] = np.load(path + col_name + '_' + str(part) + '.npy').reshape(t_sizes) 331 | else: 332 | data_arr[col_name] = np.fromfile(path + col_name + '_' + str(part) + '.bin', dtype=np.int).reshape(t_sizes) 333 | return data_arr 334 | 335 | def read_data_from_csv_to_dict(path, col_names, part, sizes=None): 336 | data_arr = {} 337 | t_sizes = sizes 338 | for col_name in col_names: 339 | if sizes == None: 340 | t_sizes = [-1, args.dynamic_features_max_len_dict[col_name]] 341 | data_arr[col_name] = read_csv_helper(path + col_name + '_' + str(part) + '.csv').reshape(t_sizes) 342 | return data_arr 343 | 344 | def load_tencent_data_to_dict(path, part): 345 | static_features = args.static_features 346 | dynamic_features = args.dynamic_features 347 | dynamic_lengths = [] 348 | for f in dynamic_features: 349 | dynamic_lengths.append(f + '_lengths') 350 | labels = read_data_from_bin(path, ['label'], part, [-1, 1]).reshape([-1]) 351 | static_index_dict = read_data_from_bin_to_dict(path, static_features, part, [-1]) 352 | dynamic_index_dict = read_data_from_bin_to_dict(path, dynamic_features, part) 353 | dynamic_lengths_dict = read_data_from_bin_to_dict(path, dynamic_lengths, part, [-1]) 354 | return labels, static_index_dict, dynamic_index_dict, dynamic_lengths_dict, None 355 | 356 | def load_concatenate_tencent_data_to_dict(data_root_path, parts, static_features, dynamic_features, is_csv=False): 357 | labels = [] 358 | static_ids = [] 359 | dynamic_ids = [] 360 | dynamic_lengths = [] 361 | extern_lr_ids = None 362 | dynamic_lens = [] 363 | for f in dynamic_features: 364 | dynamic_lens.append(f + '_lengths') 365 | for part in parts: 366 | print part, "part loading" 367 | b_time = time() 368 | extern_lr_features = args.extern_lr_features 369 | if not is_csv: 370 | labels.append(read_data_from_bin(data_root_path, ['label'], part, [-1, 1]).reshape([-1])) 371 | else: 372 | labels.append(read_data_from_csv(data_root_path, ['label'], part, [-1, 1]).reshape([-1])) 373 | if not is_csv: 374 | static_ids.append(read_data_from_bin_to_dict(data_root_path, static_features, part, [-1])) 375 | dynamic_ids.append(read_data_from_bin_to_dict(data_root_path, dynamic_features, part)) 376 | dynamic_lengths.append(read_data_from_bin_to_dict(data_root_path, dynamic_lens, part, [-1])) 377 | else: 378 | static_ids.append(read_data_from_csv_to_dict(data_root_path, static_features, part, [-1])) 379 | dynamic_ids.append(read_data_from_csv_to_dict(data_root_path, dynamic_features, part)) 380 | dynamic_lengths.append(read_data_from_csv_to_dict(data_root_path, dynamic_lens, part, [-1])) 381 | extern_lr_ids = None 382 | print "%d part loading costs %.1f s" % (part, time() - b_time) 383 | if len(extern_lr_features) != 0: 384 | extern_lr_ids = read_data_from_bin(data_root_path, extern_lr_features, part) 385 | static_index_dict = {} 386 | dynamic_index_dict = {} 387 | dynamic_lengths_dict = {} 388 | for key in static_features: 389 | static_index_dict[key] = np.concatenate([item[key] for item in static_ids], axis = 0) 390 | for key in dynamic_features: 391 | dynamic_index_dict[key] = np.concatenate([item[key] for item in dynamic_ids], axis = 0) 392 | dynamic_lengths_dict[key] = np.concatenate([item[key+'_lengths'] for item in dynamic_lengths], axis = 0) 393 | return np.concatenate(labels, axis=0), static_index_dict, \ 394 | dynamic_index_dict, dynamic_lengths_dict, \ 395 | extern_lr_ids 396 | 397 | def load_concatenate_tencent_data_from_npy_to_dict(data_root_path, parts, static_features,dynamic_features,is_csv=False): 398 | labels = [] 399 | static_ids = [] 400 | dynamic_ids = [] 401 | dynamic_lengths = [] 402 | extern_lr_ids = None 403 | dynamic_lens = [] 404 | for f in dynamic_features: 405 | dynamic_lens.append(f + '_lengths') 406 | for part in parts: 407 | print part, "part loading" 408 | b_time = time() 409 | extern_lr_features = args.extern_lr_features 410 | if not is_csv: 411 | labels.append(read_data_from_npy(data_root_path, ['label'], part, [-1, 1]).reshape([-1])) 412 | else: 413 | labels.append(read_data_from_csv(data_root_path, ['label'], part, [-1, 1]).reshape([-1])) 414 | if not is_csv: 415 | static_ids.append(read_data_from_npy_to_dict(data_root_path, static_features, part, [-1])) 416 | dynamic_ids.append(read_data_from_npy_to_dict(data_root_path, dynamic_features, part)) 417 | dynamic_lengths.append(read_data_from_npy_to_dict(data_root_path, dynamic_lens, part, [-1])) 418 | else: 419 | static_ids.append(read_data_from_csv_to_dict(data_root_path, static_features, part, [-1])) 420 | dynamic_ids.append(read_data_from_csv_to_dict(data_root_path, dynamic_features, part)) 421 | dynamic_lengths.append(read_data_from_csv_to_dict(data_root_path, dynamic_lens, part, [-1])) 422 | extern_lr_ids = None 423 | print "%d part loading costs %.1f s" % (part, time() - b_time) 424 | if len(extern_lr_features) != 0: 425 | extern_lr_ids = read_data_from_bin(data_root_path, extern_lr_features, part) 426 | static_index_dict = {} 427 | dynamic_index_dict = {} 428 | dynamic_lengths_dict = {} 429 | for key in static_features: 430 | static_index_dict[key] = np.concatenate([item[key] for item in static_ids], axis = 0) 431 | for key in dynamic_features: 432 | dynamic_index_dict[key] = np.concatenate([item[key] for item in dynamic_ids], axis = 0) 433 | dynamic_lengths_dict[key] = np.concatenate([item[key+'_lengths'] for item in dynamic_lengths], axis = 0) 434 | return np.concatenate(labels, axis=0), static_index_dict, \ 435 | dynamic_index_dict, dynamic_lengths_dict, \ 436 | extern_lr_ids 437 | 438 | def load_dynamic_total_size_dict(path, dynamic_features): 439 | dynamic_max_len_dict = {} 440 | for key in dynamic_features: 441 | f = open(path + key+'.csv', 'rb') 442 | dynamic_max_len_dict[key] = len(f.readlines()) + 1 443 | f.close() 444 | return dynamic_max_len_dict 445 | 446 | def load_static_total_size_dict(path, static_features): 447 | static_max_len_dict = {} 448 | for key in static_features: 449 | f = open(path + key+'.csv', 'rb') 450 | static_max_len_dict[key] = len(f.readlines()) + 1 451 | f.close() 452 | return static_max_len_dict 453 | 454 | def load_tencent_data(data_root_path, part, test=False): 455 | static_features = args.static_features 456 | dynamic_features = args.dynamic_features 457 | dynamic_lengths = [] 458 | for f in dynamic_features: 459 | dynamic_lengths.append(f+'_lengths') 460 | extern_lr_features = args.extern_lr_features 461 | if not test: 462 | labels = read_data_from_bin(data_root_path, ['label'], part, [-1,1]).reshape([-1]) 463 | static_ids = read_data_from_bin(data_root_path, static_features, part,[-1,1]) 464 | dynamic_ids = read_data_from_bin(data_root_path, dynamic_features, part, [-1,args.dynamic_max_len]) 465 | dynamic_lengths = read_data_from_bin(data_root_path, dynamic_lengths, part, [-1,1]) 466 | extern_lr_ids = None 467 | if len(extern_lr_features) != 0: 468 | extern_lr_ids = read_data_from_bin(data_root_path, extern_lr_features, part) 469 | if test: 470 | labels = np.array([0] * static_ids.shape[0]) 471 | return labels, static_ids, dynamic_ids, dynamic_lengths, extern_lr_ids 472 | 473 | 474 | def load_concatenate_tencent_data(data_root_path, parts, test=False): 475 | labels = [] 476 | static_ids = [] 477 | dynamic_ids = [] 478 | dynamic_lengths = [] 479 | extern_lr_ids = None 480 | num = 0 481 | for part in range(parts): 482 | print part, "part loading" 483 | b_time = time() 484 | static_features = args.static_features 485 | dynamic_features = args.dynamic_features 486 | dynamic_lens = [] 487 | for f in dynamic_features: 488 | dynamic_lens.append(f+'_lengths') 489 | extern_lr_features = args.extern_lr_features 490 | if not test: 491 | labels.append(read_data_from_bin(data_root_path, ['label'], part, [-1,1]).reshape([-1])) 492 | static_ids.append(read_data_from_bin(data_root_path, static_features, part, [-1, 1])) 493 | dynamic_ids.append(read_data_from_bin(data_root_path, dynamic_features, part, [-1, args.dynamic_max_len])) 494 | dynamic_lengths.append(read_data_from_bin(data_root_path, dynamic_lens, part, [-1, 1])) 495 | num += len(static_ids[part]) 496 | extern_lr_ids = None 497 | print "%d part loading costs %.1f s" % (part, time()-b_time) 498 | if len(extern_lr_features) != 0: 499 | extern_lr_ids = read_data_from_bin(data_root_path, extern_lr_features, part) 500 | if not test: 501 | return np.concatenate(labels,axis=0), np.concatenate(static_ids,axis=0), \ 502 | np.concatenate(dynamic_ids, axis=0), np.concatenate(dynamic_lengths, axis=0),\ 503 | extern_lr_ids 504 | else: 505 | return np.array([0]*num), np.concatenate(static_ids,axis=0), \ 506 | np.concatenate(dynamic_ids, axis=0), np.concatenate(dynamic_lengths, axis=0),\ 507 | extern_lr_ids 508 | 509 | 510 | def bin_to_libffm(data_root_path, dir_data_root_path, parts, test=False): 511 | fw = open(dir_data_root_path, 'wb') 512 | for part in range(parts): 513 | print part, "preparing" 514 | labels, static_ids, dynamic_ids, dynamic_lengths, extern_lr_ids = load_tencent_data(data_root_path, part, test) 515 | for i, d in enumerate(labels): 516 | row = [] 517 | row.append(str(labels[i])) 518 | st_size = static_ids.shape[1] 519 | for j in range(static_ids.shape[1]): 520 | row.append(str(j) + ':' + str(static_ids[i][j]) +':1') 521 | st_total_feature_size = 54634 522 | for j in range(dynamic_ids.shape[1]): 523 | ind = j // args.dynamic_max_len + st_size 524 | if dynamic_ids[i][j] != 0: 525 | row.append(str(ind) + ':' + str(dynamic_ids[i][j] + st_total_feature_size) + ':1') 526 | fw.write(' '.join(row) + '\n') 527 | fw.close() 528 | 529 | 530 | 531 | def write_dict(data_path, data): 532 | fw = open(data_path, 'wb') 533 | for key in data: 534 | fw.write(str(key)+','+str(data[key])+'\n') 535 | fw.close() 536 | 537 | def count_feature_times(train_data, test1_data, test2_data): 538 | total_dict = {} 539 | count_dict = {} 540 | for i, d in enumerate(train_data): 541 | xs = d.split(' ') 542 | for x in xs: 543 | if not total_dict.has_key(x): 544 | total_dict[x] = 0 545 | total_dict[x] += 1 546 | for i, d in enumerate(test1_data): 547 | xs = d.split(' ') 548 | for x in xs: 549 | if not total_dict.has_key(x): 550 | total_dict[x] = 0 551 | total_dict[x] += 1 552 | for i, d in enumerate(test2_data): 553 | xs = d.split(' ') 554 | for x in xs: 555 | if not total_dict.has_key(x): 556 | total_dict[x] = 0 557 | total_dict[x] += 1 558 | for key in total_dict: 559 | if not count_dict.has_key(total_dict[key]): 560 | count_dict[total_dict[key]] = 0 561 | count_dict[total_dict[key]] += 1 562 | train_res = [] 563 | for d in train_data: 564 | xs = d.split(' ') 565 | t = [] 566 | for x in xs: 567 | t.append(total_dict[x]) 568 | train_res.append(max(t)) 569 | test1_res = [] 570 | for d in test1_data: 571 | xs = d.split(' ') 572 | t = [] 573 | for x in xs: 574 | t.append(total_dict[x]) 575 | test1_res.append(max(t)) 576 | test2_res = [] 577 | for d in test2_data: 578 | xs = d.split(' ') 579 | t = [] 580 | for x in xs: 581 | t.append(total_dict[x]) 582 | test2_res.append(max(t)) 583 | return np.array(train_res), np.array(test1_res), np.array(test2_res), count_dict 584 | 585 | def count_combine_feature_times(train_data_1, train_data_2, test1_data_1, test1_data_2, test2_data_1, test2_data_2): 586 | total_dict = {} 587 | count_dict = {} 588 | for i, d in enumerate(train_data_1): 589 | xs1 = d.split(' ') 590 | xs2 = train_data_2[i].split(',') 591 | for x1 in xs1: 592 | for x2 in xs2: 593 | ke = x1+'|'+x2 594 | if not total_dict.has_key(ke): 595 | total_dict[ke] = 0 596 | total_dict[ke] += 1 597 | for i, d in enumerate(test1_data_1): 598 | xs1 = d.split(' ') 599 | xs2 = test1_data_2[i].split(',') 600 | for x1 in xs1: 601 | for x2 in xs2: 602 | ke = x1+'|'+x2 603 | if not total_dict.has_key(ke): 604 | total_dict[ke] = 0 605 | total_dict[ke] += 1 606 | for i, d in enumerate(test2_data_1): 607 | xs1 = d.split(' ') 608 | xs2 = test2_data_2[i].split(',') 609 | for x1 in xs1: 610 | for x2 in xs2: 611 | ke = x1+'|'+x2 612 | if not total_dict.has_key(ke): 613 | total_dict[ke] = 0 614 | total_dict[ke] += 1 615 | for key in total_dict: 616 | if not count_dict.has_key(total_dict[key]): 617 | count_dict[total_dict[key]] = 0 618 | count_dict[total_dict[key]] += 1 619 | 620 | train_res = [] 621 | for i, d in enumerate(train_data_1): 622 | t = [] 623 | xs1 = d.split(' ') 624 | xs2 = train_data_2[i].split(',') 625 | for x1 in xs1: 626 | for x2 in xs2: 627 | ke = x1 + '|' + x2 628 | t.append(total_dict[ke]) 629 | train_res.append(max(t)) 630 | test1_res = [] 631 | for i, d in enumerate(test1_data_1): 632 | t = [] 633 | xs1 = d.split(' ') 634 | xs2 = test1_data_2[i].split(',') 635 | for x1 in xs1: 636 | for x2 in xs2: 637 | ke = x1 + '|' + x2 638 | t.append(total_dict[ke]) 639 | test1_res.append(max(t)) 640 | test2_res = [] 641 | for i, d in enumerate(test2_data_1): 642 | t = [] 643 | xs1 = d.split(' ') 644 | xs2 = test2_data_2[i].split(',') 645 | for x1 in xs1: 646 | for x2 in xs2: 647 | ke = x1 + '|' + x2 648 | t.append(total_dict[ke]) 649 | test2_res.append(max(t)) 650 | return np.array(train_res), np.array(test1_res), np.array(test2_res), count_dict 651 | 652 | def add_len(train_data, test1_data, test2_data): 653 | count_dict = {} 654 | train_res = [] 655 | for i, d in enumerate(train_data): 656 | if d=='-1': 657 | train_res.append(0) 658 | if not count_dict.has_key(0): 659 | count_dict[0] = 0 660 | count_dict[0] += 1 661 | else: 662 | xs = d.split(' ') 663 | train_res.append(len(xs)) 664 | if not count_dict.has_key(len(xs)): 665 | count_dict[len(xs)] = 0 666 | count_dict[len(xs)] += 1 667 | test1_res = [] 668 | for i, d in enumerate(test1_data): 669 | if d=='-1': 670 | test1_res.append(0) 671 | else: 672 | xs = d.split(' ') 673 | test1_res.append(len(xs)) 674 | test2_res = [] 675 | for i, d in enumerate(test2_data): 676 | if d == '-1': 677 | test2_res.append(0) 678 | else: 679 | xs = d.split(' ') 680 | test2_res.append(len(xs)) 681 | return np.array(train_res), np.array(test1_res), np.array(test2_res), count_dict 682 | 683 | def gen_count_dict(data, labels, begin, end): 684 | total_dict = {} 685 | pos_dict = {} 686 | for i, d in enumerate(data): 687 | if i >= begin and i < end: 688 | continue 689 | xs = d.split(' ') 690 | for x in xs: 691 | if not total_dict.has_key(x): 692 | total_dict[x] = 0 693 | if not pos_dict.has_key(x): 694 | pos_dict[x] = 0 695 | total_dict[x] += 1 696 | if labels[i] == '1': 697 | pos_dict[x] += 1 698 | return total_dict, pos_dict 699 | 700 | def combine_to_one(data1, data2): 701 | assert len(data1) == len(data2) 702 | new_res = [] 703 | for i, d in enumerate(data1): 704 | x1 = data1[i] 705 | x2 = data2[i] 706 | new_x = x1 + '|' + x2 707 | new_res.append(new_x) 708 | return new_res 709 | 710 | def count_pos_feature(train_data, test1_data, test2_data, labels, k, test_only= False, is_val = False): 711 | nums = len(train_data) 712 | last = nums 713 | if is_val: 714 | last = nums-4739700 715 | assert last > 0 716 | interval = last // k 717 | split_points = [] 718 | for i in range(k): 719 | split_points.append(i * interval) 720 | split_points.append(last) 721 | count_train_data = train_data[0:last] 722 | count_labels = labels[0:last] 723 | 724 | train_res = [] 725 | if not test_only: 726 | for i in range(k): 727 | print i,"part counting" 728 | print split_points[i], split_points[i+1] 729 | tmp = [] 730 | total_dict, pos_dict = gen_count_dict(count_train_data, count_labels, split_points[i],split_points[i+1]) 731 | for j in range(split_points[i],split_points[i+1]): 732 | xs = train_data[j].split(' ') 733 | t = [] 734 | for x in xs: 735 | if not pos_dict.has_key(x): 736 | t.append(0) 737 | continue 738 | t.append(pos_dict[x] + 1) 739 | tmp.append(max(t)) 740 | train_res.extend(tmp) 741 | 742 | total_dict, pos_dict = gen_count_dict(count_train_data, count_labels, 1, 0) 743 | count_dict = {-1:0} 744 | for key in pos_dict: 745 | if not count_dict.has_key(pos_dict[key]): 746 | count_dict[pos_dict[key]] = 0 747 | count_dict[pos_dict[key]] += 1 748 | 749 | if is_val: 750 | for i in range(last, nums): 751 | xs = train_data[i].split(' ') 752 | t = [] 753 | for x in xs: 754 | if not total_dict.has_key(x): 755 | t.append(0) 756 | continue 757 | t.append(pos_dict[x] + 1) 758 | train_res.append(max(t)) 759 | 760 | test1_res = [] 761 | for d in test1_data: 762 | xs = d.split(' ') 763 | t = [] 764 | for x in xs: 765 | if not pos_dict.has_key(x): 766 | t.append(0) 767 | continue 768 | t.append(pos_dict[x] + 1) 769 | test1_res.append(max(t)) 770 | 771 | test2_res = [] 772 | for d in test2_data: 773 | xs = d.split(' ') 774 | t = [] 775 | for x in xs: 776 | if not pos_dict.has_key(x): 777 | t.append(0) 778 | continue 779 | t.append(pos_dict[x] + 1) 780 | test2_res.append(max(t)) 781 | 782 | return np.array(train_res), np.array(test1_res), np.array(test2_res), count_dict 783 | 784 | def count_pos_feature_2(train_data, test1_data, test2_data, labels, k, test_only= False, is_val = False): 785 | nums = len(train_data) 786 | last = nums 787 | if is_val: 788 | last = nums - 4739700 789 | interval = last // k 790 | split_points = [] 791 | for i in range(k): 792 | split_points.append(i * interval) 793 | split_points.append(last) 794 | count_train_data = train_data[0:last] 795 | count_labels = labels[0:last] 796 | 797 | train_res = [] 798 | if not test_only: 799 | for i in range(k): 800 | print i,"part counting" 801 | print split_points[i], split_points[i+1] 802 | tmp = [] 803 | total_dict, pos_dict = gen_count_dict(count_train_data, count_labels, split_points[i],split_points[i+1]) 804 | for j in range(split_points[i],split_points[i+1]): 805 | xs = count_train_data[j].split(' ') 806 | t = [] 807 | for x in xs: 808 | if not total_dict.has_key(x): 809 | t.append(0) 810 | continue 811 | t.append(pos_dict[x]+1) 812 | tmp.append(max(t)) 813 | train_res.extend(tmp) 814 | 815 | rng_state = np.random.get_state() 816 | np.random.set_state(rng_state) 817 | np.random.shuffle(count_train_data) 818 | np.random.set_state(rng_state) 819 | np.random.shuffle(count_labels) 820 | e = last*(k-1)/k 821 | 822 | total_dict, pos_dict = gen_count_dict(count_train_data[0:e], count_labels[0:e], 1, 0) 823 | count_dict = {-1:0} 824 | for key in pos_dict: 825 | if not count_dict.has_key(pos_dict[key]): 826 | count_dict[pos_dict[key]] = 0 827 | count_dict[pos_dict[key]] += 1 828 | 829 | if is_val: 830 | for i in range(last, nums): 831 | xs = train_data[i].split(' ') 832 | t = [] 833 | for x in xs: 834 | if not total_dict.has_key(x): 835 | t.append(0) 836 | continue 837 | t.append(pos_dict[x] + 1) 838 | train_res.append(max(t)) 839 | 840 | 841 | test1_res = [] 842 | for d in test1_data: 843 | xs = d.split(' ') 844 | t = [] 845 | for x in xs: 846 | if not total_dict.has_key(x): 847 | t.append(0) 848 | continue 849 | t.append(pos_dict[x] + 1) 850 | test1_res.append(max(t)) 851 | 852 | test2_res = [] 853 | for d in test2_data: 854 | xs = d.split(' ') 855 | t = [] 856 | for x in xs: 857 | if not total_dict.has_key(x): 858 | t.append(0) 859 | continue 860 | t.append(pos_dict[x] + 1) 861 | test2_res.append(max(t)) 862 | 863 | return np.array(train_res), np.array(test1_res), np.array(test2_res), count_dict 864 | 865 | 866 | def uid_seq_feature(train_data, test1_data, test2_data, label): 867 | count_dict = {} 868 | seq_dict = {} 869 | seq_emb_dict = {} 870 | train_seq = [] 871 | ind = 0 872 | for i, d in enumerate(train_data): 873 | if not count_dict.has_key(d): 874 | count_dict[d] = [] 875 | seq_key = ' '.join(count_dict[d][-4:]) 876 | if not seq_dict.has_key(seq_key): 877 | seq_dict[seq_key] = 0 878 | seq_emb_dict[seq_key] = ind 879 | ind += 1 880 | seq_dict[seq_key] += 1 881 | train_seq.append(seq_emb_dict[seq_key]) 882 | count_dict[d].append(label[i]) 883 | test1_seq = [] 884 | for d in test1_data: 885 | if not count_dict.has_key(d): 886 | seq_key = '' 887 | else: 888 | seq_key = ' '.join(count_dict[d][-4:]) 889 | if seq_emb_dict.has_key(seq_key): 890 | key = seq_emb_dict[seq_key] 891 | else: 892 | key = 0 893 | test1_seq.append(key) 894 | test2_seq = [] 895 | for d in test2_data: 896 | if not count_dict.has_key(d): 897 | seq_key = '' 898 | else: 899 | seq_key = ' '.join(count_dict[d][-4:]) 900 | if seq_emb_dict.has_key(seq_key): 901 | key = seq_emb_dict[seq_key] 902 | else: 903 | key = 0 904 | test2_seq.append(key) 905 | 906 | return np.array(train_seq), np.array(test1_seq), np.array(test2_seq), seq_emb_dict 907 | 908 | def resort_data(raw_train_data, train_feat, ran_train_data): 909 | feat_dict = {} 910 | for i,d in enumerate(raw_train_data): 911 | feat_dict[d] = train_feat[i] 912 | train_res = [] 913 | for d in ran_train_data: 914 | train_res.append(feat_dict[d]) 915 | return np.array(train_res) 916 | 917 | def count_combine_feature_times_with_chusai(train_data_1, train_data_2, 918 | test1_data_1, test1_data_2, 919 | test2_data_1, test2_data_2, 920 | chusai_train_data_1, chusai_train_data_2): 921 | total_dict = {} 922 | count_dict = {} 923 | for i, d in enumerate(chusai_train_data_1): 924 | xs1 = d.split(' ') 925 | xs2 = chusai_train_data_2[i].split(',') 926 | for x1 in xs1: 927 | for x2 in xs2: 928 | ke = x1+'|'+x2 929 | if not total_dict.has_key(ke): 930 | total_dict[ke] = 0 931 | total_dict[ke] += 1 932 | 933 | for i, d in enumerate(train_data_1): 934 | xs1 = d.split(' ') 935 | xs2 = train_data_2[i].split(',') 936 | for x1 in xs1: 937 | for x2 in xs2: 938 | ke = x1+'|'+x2 939 | if not total_dict.has_key(ke): 940 | total_dict[ke] = 0 941 | total_dict[ke] += 1 942 | for i, d in enumerate(test1_data_1): 943 | xs1 = d.split(' ') 944 | xs2 = test1_data_2[i].split(',') 945 | for x1 in xs1: 946 | for x2 in xs2: 947 | ke = x1+'|'+x2 948 | if not total_dict.has_key(ke): 949 | total_dict[ke] = 0 950 | total_dict[ke] += 1 951 | for i, d in enumerate(test2_data_1): 952 | xs1 = d.split(' ') 953 | xs2 = test2_data_2[i].split(',') 954 | for x1 in xs1: 955 | for x2 in xs2: 956 | ke = x1+'|'+x2 957 | if not total_dict.has_key(ke): 958 | total_dict[ke] = 0 959 | total_dict[ke] += 1 960 | for key in total_dict: 961 | if not count_dict.has_key(total_dict[key]): 962 | count_dict[total_dict[key]] = 0 963 | count_dict[total_dict[key]] += 1 964 | 965 | train_res = [] 966 | for i, d in enumerate(train_data_1): 967 | t = [] 968 | xs1 = d.split(' ') 969 | xs2 = train_data_2[i].split(',') 970 | for x1 in xs1: 971 | for x2 in xs2: 972 | ke = x1 + '|' + x2 973 | t.append(total_dict[ke]) 974 | train_res.append(max(t)) 975 | test1_res = [] 976 | for i, d in enumerate(test1_data_1): 977 | t = [] 978 | xs1 = d.split(' ') 979 | xs2 = test1_data_2[i].split(',') 980 | for x1 in xs1: 981 | for x2 in xs2: 982 | ke = x1 + '|' + x2 983 | t.append(total_dict[ke]) 984 | test1_res.append(max(t)) 985 | test2_res = [] 986 | for i, d in enumerate(test2_data_1): 987 | t = [] 988 | xs1 = d.split(' ') 989 | xs2 = test2_data_2[i].split(',') 990 | for x1 in xs1: 991 | for x2 in xs2: 992 | ke = x1 + '|' + x2 993 | t.append(total_dict[ke]) 994 | test2_res.append(max(t)) 995 | return np.array(train_res), np.array(test1_res), np.array(test2_res), count_dict 996 | 997 | def count_feature_times_with_chusai(train_data, test1_data, test2_data, chusai_train_data): 998 | total_dict = {} 999 | count_dict = {} 1000 | for i, d in enumerate(chusai_train_data): 1001 | xs = d.split(' ') 1002 | for x in xs: 1003 | if not total_dict.has_key(x): 1004 | total_dict[x] = 0 1005 | total_dict[x] += 1 1006 | for i, d in enumerate(train_data): 1007 | xs = d.split(' ') 1008 | for x in xs: 1009 | if not total_dict.has_key(x): 1010 | total_dict[x] = 0 1011 | total_dict[x] += 1 1012 | for i, d in enumerate(test1_data): 1013 | xs = d.split(' ') 1014 | for x in xs: 1015 | if not total_dict.has_key(x): 1016 | total_dict[x] = 0 1017 | total_dict[x] += 1 1018 | for i, d in enumerate(test2_data): 1019 | xs = d.split(' ') 1020 | for x in xs: 1021 | if not total_dict.has_key(x): 1022 | total_dict[x] = 0 1023 | total_dict[x] += 1 1024 | for key in total_dict: 1025 | if not count_dict.has_key(total_dict[key]): 1026 | count_dict[total_dict[key]] = 0 1027 | count_dict[total_dict[key]] += 1 1028 | train_res = [] 1029 | for d in train_data: 1030 | xs = d.split(' ') 1031 | t = [] 1032 | for x in xs: 1033 | t.append(total_dict[x]) 1034 | train_res.append(max(t)) 1035 | test1_res = [] 1036 | for d in test1_data: 1037 | xs = d.split(' ') 1038 | t = [] 1039 | for x in xs: 1040 | t.append(total_dict[x]) 1041 | test1_res.append(max(t)) 1042 | test2_res = [] 1043 | for d in test2_data: 1044 | xs = d.split(' ') 1045 | t = [] 1046 | for x in xs: 1047 | t.append(total_dict[x]) 1048 | test2_res.append(max(t)) 1049 | return np.array(train_res), np.array(test1_res), np.array(test2_res), count_dict 1050 | 1051 | def count_pos_feature_with_chusai(train_data, test1_data, test2_data, chusai_train_data, 1052 | chusai_labels, labels, k, test_only= False, is_val = False): 1053 | nums = len(train_data) 1054 | last = nums 1055 | if is_val: 1056 | last = 5100000 * 8 1057 | interval = last // k 1058 | split_points = [] 1059 | for i in range(k): 1060 | split_points.append(i * interval) 1061 | split_points.append(last) 1062 | count_train_data = train_data[0:last] 1063 | count_labels = labels[0:last] 1064 | 1065 | train_res = [] 1066 | if not test_only: 1067 | for i in range(k): 1068 | print i,"part counting" 1069 | print split_points[i], split_points[i+1] 1070 | tmp = [] 1071 | total_dict, pos_dict = gen_count_dict(count_train_data, count_labels, split_points[i],split_points[i+1]) 1072 | for d in chusai_train_data: 1073 | xs = d.split(' ') 1074 | for x in xs: 1075 | if not total_dict.has_key(x): 1076 | total_dict[x] = 0 1077 | if not pos_dict.has_key(x): 1078 | pos_dict[x] = 0 1079 | total_dict[x] += 1 1080 | if chusai_labels[i] == '1': 1081 | pos_dict[x] += 1 1082 | for j in range(split_points[i],split_points[i+1]): 1083 | xs = train_data[j].split(' ') 1084 | t = [] 1085 | for x in xs: 1086 | if not pos_dict.has_key(x): 1087 | t.append(0) 1088 | continue 1089 | t.append(pos_dict[x] + 1) 1090 | tmp.append(max(t)) 1091 | train_res.extend(tmp) 1092 | 1093 | total_dict, pos_dict = gen_count_dict(count_train_data, count_labels, 1, 0) 1094 | for d in chusai_train_data: 1095 | xs = d.split(' ') 1096 | for x in xs: 1097 | if not total_dict.has_key(x): 1098 | total_dict[x] = 0 1099 | if not pos_dict.has_key(x): 1100 | pos_dict[x] = 0 1101 | total_dict[x] += 1 1102 | if chusai_labels[i] == '1': 1103 | pos_dict[x] += 1 1104 | count_dict = {-1:0} 1105 | for key in pos_dict: 1106 | if not count_dict.has_key(pos_dict[key]): 1107 | count_dict[pos_dict[key]] = 0 1108 | count_dict[pos_dict[key]] += 1 1109 | 1110 | if is_val: 1111 | for i in range(last, nums): 1112 | xs = train_data[i].split(' ') 1113 | t = [] 1114 | for x in xs: 1115 | if not total_dict.has_key(x): 1116 | t.append(0) 1117 | continue 1118 | t.append(pos_dict[x] + 1) 1119 | train_res.append(max(t)) 1120 | 1121 | test1_res = [] 1122 | for d in test1_data: 1123 | xs = d.split(' ') 1124 | t = [] 1125 | for x in xs: 1126 | if not pos_dict.has_key(x): 1127 | t.append(0) 1128 | continue 1129 | t.append(pos_dict[x] + 1) 1130 | test1_res.append(max(t)) 1131 | 1132 | test2_res = [] 1133 | for d in test2_data: 1134 | xs = d.split(' ') 1135 | t = [] 1136 | for x in xs: 1137 | if not pos_dict.has_key(x): 1138 | t.append(0) 1139 | continue 1140 | t.append(pos_dict[x] + 1) 1141 | test2_res.append(max(t)) 1142 | 1143 | return np.array(train_res), np.array(test1_res), np.array(test2_res), count_dict 1144 | 1145 | def count_pos_feature_all_set(train_data, test1_data, test2_data, labels, k, test_only= False, is_val = False): 1146 | nums = len(train_data) 1147 | last = nums 1148 | if is_val: 1149 | last = nums-4739700 1150 | interval = last // k 1151 | split_points = [] 1152 | for i in range(k): 1153 | split_points.append(i * interval) 1154 | split_points.append(last) 1155 | count_train_data = train_data[0:last] 1156 | count_labels = labels[0:last] 1157 | 1158 | total_dict, pos_dict = gen_count_dict(count_train_data, count_labels, 1, 0) 1159 | train_res = [] 1160 | for i ,d in enumerate(count_train_data): 1161 | xs = train_data[i].split(' ') 1162 | t = [] 1163 | for x in xs: 1164 | if not total_dict.has_key(x): 1165 | t.append(0) 1166 | else: 1167 | sub = 0 1168 | if count_labels[i] == '1': 1169 | sub = 1 1170 | t.append(pos_dict[x] - sub) 1171 | train_res.append(max(t)) 1172 | count_dict = {} 1173 | for key in pos_dict: 1174 | if not count_dict.has_key(pos_dict[key]): 1175 | count_dict[pos_dict[key]] = 0 1176 | count_dict[pos_dict[key]] += 1 1177 | 1178 | if is_val: 1179 | for i in range(last, nums): 1180 | xs = train_data[i].split(' ') 1181 | t = [] 1182 | for x in xs: 1183 | if not total_dict.has_key(x): 1184 | t.append(0) 1185 | continue 1186 | t.append(pos_dict[x]) 1187 | train_res.append(max(t)) 1188 | 1189 | test1_res = [] 1190 | for d in test1_data: 1191 | xs = d.split(' ') 1192 | t = [] 1193 | for x in xs: 1194 | if not pos_dict.has_key(x): 1195 | t.append(0) 1196 | continue 1197 | t.append(pos_dict[x]) 1198 | test1_res.append(max(t)) 1199 | 1200 | test2_res = [] 1201 | for d in test2_data: 1202 | xs = d.split(' ') 1203 | t = [] 1204 | for x in xs: 1205 | if not pos_dict.has_key(x): 1206 | t.append(0) 1207 | continue 1208 | t.append(pos_dict[x]) 1209 | test2_res.append(max(t)) 1210 | 1211 | return np.array(train_res), np.array(test1_res), np.array(test2_res), count_dict 1212 | 1213 | 1214 | def count_pos_feature_by_history(train_data, test1_data, test2_data, labels, k, test_only= False, is_val = False): 1215 | nums = len(train_data) 1216 | last = nums 1217 | if is_val: 1218 | last = nums-4739700 1219 | interval = last // k 1220 | split_points = [] 1221 | for i in range(k): 1222 | split_points.append(i * interval) 1223 | split_points.append(last) 1224 | count_train_data = train_data[0:last] 1225 | count_labels = labels[0:last] 1226 | 1227 | train_res = [] 1228 | if not test_only: 1229 | for i in range(k): 1230 | print i,"part counting" 1231 | print split_points[i], split_points[i+1] 1232 | tmp = [] 1233 | total_dict, pos_dict = gen_count_dict(count_train_data, count_labels, split_points[i],split_points[i+1]) 1234 | for j in range(split_points[i],split_points[i+1]): 1235 | xs = train_data[j].split(' ') 1236 | t = [] 1237 | for x in xs: 1238 | if not pos_dict.has_key(x): 1239 | t.append(0) 1240 | continue 1241 | t.append(pos_dict[x] + 1) 1242 | tmp.append(max(t)) 1243 | train_res.extend(tmp) 1244 | 1245 | total_dict, pos_dict = gen_count_dict(count_train_data, count_labels, 1, 0) 1246 | count_dict = {-1:0} 1247 | for key in pos_dict: 1248 | if not count_dict.has_key(pos_dict[key]): 1249 | count_dict[pos_dict[key]] = 0 1250 | count_dict[pos_dict[key]] += 1 1251 | 1252 | if is_val: 1253 | for i in range(last, nums): 1254 | xs = train_data[i].split(' ') 1255 | t = [] 1256 | for x in xs: 1257 | if not total_dict.has_key(x): 1258 | t.append(0) 1259 | continue 1260 | t.append(pos_dict[x] + 1) 1261 | train_res.append(max(t)) 1262 | 1263 | test1_res = [] 1264 | for d in test1_data: 1265 | xs = d.split(' ') 1266 | t = [] 1267 | for x in xs: 1268 | if not pos_dict.has_key(x): 1269 | t.append(0) 1270 | continue 1271 | t.append(pos_dict[x] + 1) 1272 | test1_res.append(max(t)) 1273 | 1274 | test2_res = [] 1275 | for d in test2_data: 1276 | xs = d.split(' ') 1277 | t = [] 1278 | for x in xs: 1279 | if not pos_dict.has_key(x): 1280 | t.append(0) 1281 | continue 1282 | t.append(pos_dict[x] + 1) 1283 | test2_res.append(max(t)) 1284 | 1285 | return np.array(train_res), np.array(test1_res), np.array(test2_res), count_dict --------------------------------------------------------------------------------