├── LICENSE ├── README.md ├── config_deepmcp.py ├── ctr_funcs.py ├── data └── data_readme.md ├── deepcp.py ├── deepmcp.py ├── deepmp.py └── dnn.py /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2019 Wentao Ouyang 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Deep Matching, Correlation and Prediction (DeepMCP) Model 2 | 3 | DeepMCP is a model for click-through rate (CTR) prediction. Most existing methods mainly model the feature-CTR relationship and suffer from the data sparsity issue. In contrast, DeepMCP models other types of relationships in order to learn more informative and statistically reliable feature representations, and in consequence to improve the performance of CTR prediction. In particular, DeepMCP contains three parts: a matching subnet, a correlation subnet and a prediction subnet. These subnets model the user-ad, ad-ad and feature-CTR relationship respectively. When these subnets are jointly optimized under the supervision of the target labels, the learned feature representations have both good prediction powers and good representation abilities. 4 | 5 | If you use this code, please cite the following paper: 6 | * **Representation Learning-Assisted Click-Through Rate Prediction. In IJCAI, 2019.** 7 | 8 | arXiv: https://arxiv.org/abs/1906.04365 [Extended version] 9 | 10 | IJCAI: https://www.ijcai.org/proceedings/2019/634 11 | 12 | #### Bibtex 13 | ``` 14 | @inproceedings{ouyang2019representation, 15 | title={Representation Learning-Assisted Click-Through Rate Prediction}, 16 | author={Ouyang, Wentao and Zhang, Xiuwu and Ren, Shukui and Qi, Chao and Liu, Zhaojie and Du, Yanlong}, 17 | booktitle={IJCAI}, 18 | pages={4561--4567}, 19 | year={2019} 20 | } 21 | ``` 22 | 23 | #### TensorFlow (TF) version 24 | 1.3.0 25 | 26 | #### Abbreviation 27 | ft - feature, slot == field 28 | 29 | ## Data Preparation (DeepMP) 30 | Data is in the "csv" format, where each row contains an instance.\ 31 | Assume there are N unique fts. Fts need to be indexed from 1 to N. Use 0 for missing values or for padding. 32 | 33 | We categorize fts as i) **one-hot** or **univalent** (e.g., user id, city) and ii) **mul-hot** or **multivalent** (e.g., words in ad title). 34 | 35 | csv data format 36 | * \\\ 37 | 38 | We also need to define the max number of features per mul-hot ft slot (through the "max_len_per_slot" parameter) and perform trimming or padding accordingly. Please refer to the following example for more detail. 39 | 40 | ### Example 41 | 1. original fts (ft_name:ft_value) 42 | * label:0, gender:male, age:27, query:apple, title:apple, title:fruit, title:fresh 43 | * label:1, gender:female, age:35, query:shoes, query:winter, title:shoes, title:winter, title:warm, title:sales 44 | 45 | 2. csv fts (not converted to ft index yet) 46 | * 0, male, 27, apple, 0, 0, apple, fruit, fresh 47 | * 1, female, 35, shoes, winter, 0, shoes, winter, warm 48 | 49 | #### Explanation 50 | csv format settings:\ 51 | n_one_hot_slot = 2 # num of one-hot ft slots (gender, age)\ 52 | n_mul_hot_slot = 2 # num of mul-hot ft slots (query, title)\ 53 | max_len_per_slot = 3 # max num of fts per mul-hot ft slot 54 | 55 | For the first instance, the mul-hot ft slot "query" contains only 1 ft "apple". We thus pad (max_len_per_slot - 1) zeros, resulting in "apple, 0, 0".\ 56 | For the second instance, the mul-hot ft slot "title" contains 4 fts. We thus only keep the first max_len_per_slot fts. 57 | 58 | ## Data Preparation (DeepCP/DeepMCP) 59 | DeepCP/DeepMCP needs two datasets as input. Both are in the "csv" format.\ 60 | The first dataset is the same as that for DeepMP.\ 61 | The second dataset should contain a target ad, a context ad and N negative ads per row. 62 | 63 | csv data format 64 | * \\\\\\...\\ 65 | 66 | csv format settings:\ 67 | n_one_hot_slot_s = 2 # num of one-hot ft slots per ad in the second dataset\ 68 | n_mul_hot_slot_s = 2 # num of mul-hot ft slots per ad in the second dataset\ 69 | max_len_per_slot_s = 3 # max num of fts per mul-hot ft slot in the second dataset 70 | 71 | ## Source Code 72 | 1. **DeepMP** achieves the best tradeoff between prediction performance and model complexity. It needs only 1 dataset. (configs of the second dataset are useless) \[**_Recommended_**\] 73 | 2. DeepCP needs 2 datasets. Its performance is not as good as DeepMP. 74 | 3. DeepMCP also needs 2 datasets. It is the most complex and leads to the best performance. 75 | 76 | * config_deepmcp.py -- config file 77 | * ctr_funcs.py -- functions 78 | * deepmp.py -- Deep Matching and Prediction (DeepMP) model 79 | * deepcp.py -- Deep Correlation and Prediction (DeepCP) model 80 | * deepmcp.py -- Deep Matching, Correlation and Prediction (DeepMCP) model 81 | 82 | ## Run the Code 83 | First revise the config file, and then run the code 84 | ```bash 85 | nohup python deepmp.py > [output_file_name] 2>&1 & 86 | ``` 87 | -------------------------------------------------------------------------------- /config_deepmcp.py: -------------------------------------------------------------------------------- 1 | ''' 2 | config file 3 | ''' 4 | # first dataset 5 | n_one_hot_slot = 25 # num of one-hot slots in the 1st dataset 6 | n_mul_hot_slot = 2 # num of mul-hot slots in the 1st dataset 7 | max_len_per_slot = 5 # max num of fts per mul-hot slot in the 1st dataset 8 | n_ft = 42301586 # num of unique fts in the 1st dataset 9 | num_csv_col = 561 # num of cols in the csv file (1st dataset) 10 | # total_n_slot = n_one_hot_slot + n_mul_hot_slot = 25+2 = 27 11 | # the following indices are w.r.t. these total_n_slot(=27) slots, starting from slot idx 0 12 | # in the sample csv data, slot idx 0 is bias; it does not belong to user or ad fts 13 | user_ft_idx = [3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 25] # idx of user (& query) fts 14 | ad_ft_idx = [1, 2, 19, 20, 21, 22, 23, 24, 26] # idx of ad fts 15 | 16 | pre = './data/' 17 | suf = '.csv' 18 | train_file_name = [pre+'day_1'+suf, pre+'day_2'+suf] # can contain multiple file names 19 | val_file_name = [pre+'day_3'+suf] # should contain only 1 file name 20 | test_file_name = [pre+'day_4'+suf] # should contain only 1 file name 21 | 22 | time_style = '%Y-%m-%d %H:%M:%S' 23 | output_file_name = '0311_1430' # part of file and folder names for recording the output model and result 24 | k = 10 # embedding dim for each ft 25 | alpha = 5 # balancing para for the matching subnet 26 | beta = 0.01 # balancing para for the correlation subnet 27 | batch_size = 128 # batch size of the 1st dataset 28 | kp_prob = 1.0 # keep prob in dropout; set to 1.0 if n_epoch = 1 29 | opt_alg = 'Adagrad' # 'Adam' 30 | eta = 0.05 # learning rate 31 | max_num_lower_ct = 100 # early stop if the metric does not improve over the validation set after max_num_lower_ct times 32 | n_epoch = 1 # number of times to loop over the 1st dataset 33 | record_step_size = 200 # record auc and loss on the validation set after record_step_size times of mini_batch 34 | layer_dim = [512, 256, 1] # prediction subnet FC layer dims, the last is the output layer, must be included 35 | layer_dim_match = [512, 256] # matching subnet FC layer dims 36 | 37 | # second dataset 38 | train_file_name_corr = ['./data/corr.csv'] 39 | batch_size_corr = 128 # batch size of the 2nd dataset 40 | layer_dim_corr = [512, 256] # correlation subnet FC layer dims 41 | n_neg_used_corr = 4 # num of neg ads used for each target ad in the 2nd dataset 42 | n_one_hot_slot_corr = 10 # num of one-hot slots per ad in the 2nd dataset 43 | n_mul_hot_slot_corr = 2 # num of mul-hot slots per ad in the 2nd dataset 44 | max_len_per_slot_corr = 10 # max num of fts per mul-hot slot in the 2nd dataset 45 | num_csv_col_corr = 180 # num of cols in the csv file (2nd dataset) 46 | n_epoch_corr = 2 # number of times to loop over the 2nd dataset 47 | -------------------------------------------------------------------------------- /ctr_funcs.py: -------------------------------------------------------------------------------- 1 | import tensorflow as tf 2 | import numpy as np 3 | import datetime 4 | from sklearn import metrics 5 | 6 | def cal_auc(pred_score, label): 7 | fpr, tpr, thresholds = metrics.roc_curve(label, pred_score, pos_label=1) 8 | auc_val = metrics.auc(fpr, tpr) 9 | return auc_val, fpr, tpr 10 | 11 | def cal_rmse(pred_score, label): 12 | mse = metrics.mean_squared_error(label, pred_score) 13 | rmse = np.sqrt(mse) 14 | return rmse 15 | 16 | def cal_rectified_rmse(pred_score, label, sample_rate): 17 | for idx, item in enumerate(pred_score): 18 | pred_score[idx] = item/(item + (1-item)/sample_rate) 19 | mse = metrics.mean_squared_error(label, pred_score) 20 | rmse = np.sqrt(mse) 21 | return rmse 22 | 23 | # only works for 2D list 24 | def list_flatten(input_list): 25 | output_list = [yy for xx in input_list for yy in xx] 26 | return output_list 27 | 28 | 29 | def count_lines(file_name): 30 | num_lines = sum(1 for line in open(file_name, 'rt')) 31 | return num_lines 32 | 33 | # this func is only for avito data 34 | def tf_read_data(file_name_queue, label_col_idx, record_defaults): 35 | reader = tf.TextLineReader() 36 | key, value = reader.read(file_name_queue) 37 | 38 | # Default values, in case of empty columns. Also specifies the type of the decoded result. 39 | cols = tf.decode_csv(value, record_defaults=record_defaults) 40 | # you can only process the data using tf ops 41 | label = cols.pop(label_col_idx) 42 | feature = cols 43 | # Retrieve a single instance 44 | return feature, label 45 | 46 | def tf_read_data_wo_label(file_name_queue, record_defaults): 47 | reader = tf.TextLineReader() 48 | key, value = reader.read(file_name_queue) 49 | # Default values, in case of empty columns. Also specifies the type of the decoded result. 50 | cols = tf.decode_csv(value, record_defaults=record_defaults) 51 | # you can only process the data using tf ops 52 | feature = cols 53 | # Retrieve a single instance 54 | return feature 55 | 56 | # load training data 57 | record_defaults = [[0]]*141 58 | record_defaults[0] = [0.0] 59 | def tf_input_pipeline(file_names, batch_size, num_epochs=1, label_col_idx=0, record_defaults=record_defaults): 60 | # shuffle over files 61 | file_name_queue = tf.train.string_input_producer(file_names, num_epochs=num_epochs, shuffle=True) 62 | feature, label = tf_read_data(file_name_queue, label_col_idx, record_defaults) 63 | # min_after_dequeue defines how big a buffer we will randomly sample from 64 | # capacity must be larger than min_after_dequeue and the amount larger determines the max we 65 | # will prefetch 66 | min_after_dequeue = 5000 67 | capacity = min_after_dequeue + 3*batch_size 68 | feature_batch, label_batch = tf.train.shuffle_batch([feature, label], \ 69 | batch_size=batch_size, capacity=capacity, min_after_dequeue=min_after_dequeue) 70 | return feature_batch, label_batch 71 | 72 | # without label 73 | def tf_input_pipeline_wo_label(file_names, batch_size, num_epochs=1, record_defaults=record_defaults): 74 | # shuffle over files 75 | file_name_queue = tf.train.string_input_producer(file_names, num_epochs=num_epochs, shuffle=True) 76 | feature = tf_read_data_wo_label(file_name_queue, record_defaults) 77 | # min_after_dequeue defines how big a buffer we will randomly sample from 78 | # capacity must be larger than min_after_dequeue and the amount larger determines the max we 79 | # will prefetch 80 | min_after_dequeue = 5000 81 | capacity = min_after_dequeue + 3*batch_size 82 | feature_batch = tf.train.shuffle_batch([feature], \ 83 | batch_size=batch_size, capacity=capacity, min_after_dequeue=min_after_dequeue) 84 | return feature_batch 85 | 86 | def tf_input_pipeline_test(file_names, batch_size, num_epochs=1, label_col_idx=0, record_defaults=record_defaults): 87 | # shuffle over files 88 | file_name_queue = tf.train.string_input_producer(file_names, num_epochs=num_epochs, shuffle=True) 89 | feature, label = tf_read_data(file_name_queue, label_col_idx, record_defaults) 90 | # min_after_dequeue defines how big a buffer we will randomly sample from 91 | # capacity must be larger than min_after_dequeue and the amount larger determines the max we 92 | # will prefetch 93 | min_after_dequeue = 5000 94 | capacity = min_after_dequeue + 3*batch_size 95 | feature_batch, label_batch = tf.train.batch([feature, label], \ 96 | batch_size=batch_size, capacity=capacity) 97 | return feature_batch, label_batch 98 | 99 | time_style = '%Y-%m-%d %H:%M:%S' 100 | def print_time(): 101 | now = datetime.datetime.now() 102 | time_str = now.strftime(time_style) 103 | print(time_str) 104 | 105 | -------------------------------------------------------------------------------- /data/data_readme.md: -------------------------------------------------------------------------------- 1 | Please reuse the data in project: https://github.com/oywtece/dstn. 2 | 3 | Although the data are prepared for the DSTN model, we can use the part of the label and the target ad for the DeepMP model. 4 | 5 | Please put the "day_1.csv", "day_2.csv" ... files under this "data" folder. 6 | 7 | The config_deepmcp.py file has been updated such that you can run "dnn.py" and "deepmp.py" successfully. 8 | 9 | You can check https://github.com/oywtece/dstn/issues/4 for the meaning of the columns in the csv data files. 10 | -------------------------------------------------------------------------------- /deepcp.py: -------------------------------------------------------------------------------- 1 | # DeepCP - Deep Correlation and Prediction model 2 | 3 | import numpy as np 4 | import tensorflow as tf 5 | import datetime 6 | import ctr_funcs as func 7 | import config_deepmcp as cfg 8 | import os 9 | import shutil 10 | 11 | # config 12 | str_txt = cfg.output_file_name 13 | base_path = './tmp' 14 | model_saving_addr = base_path + '/deepcp_' + str_txt + '/' 15 | output_file_name = base_path + '/deepcp_' + str_txt + '.txt' 16 | num_csv_col = cfg.num_csv_col 17 | train_file_name = cfg.train_file_name 18 | val_file_name = cfg.val_file_name 19 | test_file_name = cfg.test_file_name 20 | batch_size = cfg.batch_size 21 | n_ft = cfg.n_ft 22 | k = cfg.k 23 | kp_prob = cfg.kp_prob 24 | n_epoch = cfg.n_epoch 25 | max_num_lower_ct = cfg.max_num_lower_ct 26 | record_step_size = cfg.record_step_size 27 | layer_dim = cfg.layer_dim 28 | layer_dim_match = cfg.layer_dim_match 29 | eta = cfg.eta # learning rate 30 | opt_alg = cfg.opt_alg 31 | n_one_hot_slot = cfg.n_one_hot_slot 32 | n_mul_hot_slot = cfg.n_mul_hot_slot 33 | max_len_per_slot = cfg.max_len_per_slot 34 | beta = cfg.beta # for correlation loss 35 | label_col_idx = 0 36 | record_defaults = [[0]]*num_csv_col 37 | record_defaults[0] = [0.0] 38 | total_num_ft_col = num_csv_col - 1 39 | 40 | ## corr dataset - no test data for this dataset 41 | train_file_name_corr = cfg.train_file_name_corr 42 | batch_size_corr = cfg.batch_size_corr 43 | layer_dim_corr = cfg.layer_dim_corr 44 | n_one_hot_slot_corr = cfg.n_one_hot_slot_corr 45 | n_mul_hot_slot_corr = cfg.n_mul_hot_slot_corr 46 | max_len_per_slot_corr = cfg.max_len_per_slot_corr 47 | n_epoch_corr = cfg.n_epoch_corr 48 | n_neg_used_corr = cfg.n_neg_used_corr 49 | # no label 50 | num_csv_col_corr = cfg.num_csv_col_corr 51 | record_defaults_corr = [[0]]*num_csv_col_corr 52 | total_num_ft_col_corr = num_csv_col_corr 53 | 54 | # create dir 55 | if not os.path.exists(base_path): 56 | os.mkdir(base_path) 57 | 58 | # remove dir 59 | if os.path.isdir(model_saving_addr): 60 | shutil.rmtree(model_saving_addr) 61 | 62 | # for DNN 63 | idx_1 = n_one_hot_slot 64 | idx_2 = idx_1 + n_mul_hot_slot*max_len_per_slot 65 | 66 | ########################################################### 67 | ########################################################### 68 | print('Loading data start!') 69 | tf.set_random_seed(123) 70 | 71 | # load training data 72 | train_ft, train_label = func.tf_input_pipeline(train_file_name, batch_size, n_epoch, label_col_idx, record_defaults) 73 | 74 | n_val_inst = func.count_lines(val_file_name[0]) 75 | val_ft, val_label = func.tf_input_pipeline(val_file_name, n_val_inst, 1, label_col_idx, record_defaults) 76 | n_val_batch = n_val_inst//batch_size 77 | 78 | # load test data 79 | test_ft, test_label = func.tf_input_pipeline_test(test_file_name, batch_size, 1, label_col_idx, record_defaults) 80 | print('Loading data set 1 done!') 81 | 82 | # load training data 83 | train_ft_corr = func.tf_input_pipeline_wo_label(train_file_name_corr, batch_size_corr, n_epoch_corr, record_defaults_corr) 84 | print('Loading data set 2 done!') 85 | 86 | ######################################################################## 87 | # partition input for correlation loss 88 | def partition_input_corr(x_input_corr): 89 | # generate idx_list 90 | len_list = [] 91 | 92 | # 2 - tar & ctxt 93 | for i in range(n_neg_used_corr+2): 94 | len_list.append(n_one_hot_slot_corr) 95 | len_list.append(n_mul_hot_slot_corr*max_len_per_slot_corr) 96 | 97 | len_list = np.array(len_list) 98 | idx_list = np.cumsum(len_list) 99 | 100 | x_tar_one_hot_corr = x_input_corr[:, 0:idx_list[0]] 101 | x_tar_mul_hot_corr = x_input_corr[:, idx_list[0]:idx_list[1]] 102 | # shape=[None, n_mul_hot_slot, max_len_per_slot] 103 | x_tar_mul_hot_corr = tf.reshape(x_tar_mul_hot_corr, (-1, n_mul_hot_slot_corr, max_len_per_slot_corr)) 104 | 105 | x_input_one_hot_dict_corr = {} 106 | x_input_mul_hot_dict_corr = {} 107 | 108 | for i in range(n_neg_used_corr+1): 109 | x_input_one_hot_dict_corr[i] = x_input_corr[:, idx_list[2*i+1]:idx_list[2*i+2]] 110 | temp = x_input_corr[:, idx_list[2*i+2]:idx_list[2*i+3]] 111 | x_input_mul_hot_dict_corr[i] = tf.reshape(temp, (-1, n_mul_hot_slot_corr, max_len_per_slot_corr)) 112 | 113 | return x_tar_one_hot_corr, x_tar_mul_hot_corr, x_input_one_hot_dict_corr, x_input_mul_hot_dict_corr 114 | 115 | # add mask 116 | def get_masked_one_hot(x_input_one_hot): 117 | data_mask = tf.cast(tf.greater(x_input_one_hot, 0), tf.float32) 118 | data_mask = tf.expand_dims(data_mask, axis = 2) 119 | data_mask = tf.tile(data_mask, (1,1,k)) 120 | # output: (?, n_one_hot_slot, k) 121 | data_embed_one_hot = tf.nn.embedding_lookup(emb_mat, x_input_one_hot) 122 | data_embed_one_hot_masked = tf.multiply(data_embed_one_hot, data_mask) 123 | return data_embed_one_hot_masked 124 | 125 | def get_masked_mul_hot(x_input_mul_hot): 126 | data_mask = tf.cast(tf.greater(x_input_mul_hot, 0), tf.float32) 127 | data_mask = tf.expand_dims(data_mask, axis = 3) 128 | data_mask = tf.tile(data_mask, (1,1,1,k)) 129 | # output: (?, n_mul_hot_slot, max_len_per_slot, k) 130 | data_embed_mul_hot = tf.nn.embedding_lookup(emb_mat, x_input_mul_hot) 131 | data_embed_mul_hot_masked = tf.multiply(data_embed_mul_hot, data_mask) 132 | # output: (?, n_mul_hot_slot, k) 133 | data_embed_mul_hot_masked = tf.reduce_sum(data_embed_mul_hot_masked, 2) 134 | return data_embed_mul_hot_masked 135 | 136 | # output: (?, n_one_hot_slot + n_mul_hot_slot, k) 137 | def get_concate_embed(x_input_one_hot, x_input_mul_hot): 138 | data_embed_one_hot = get_masked_one_hot(x_input_one_hot) 139 | data_embed_mul_hot = get_masked_mul_hot(x_input_mul_hot) 140 | data_embed_concat = tf.concat([data_embed_one_hot, data_embed_mul_hot], 1) 141 | return data_embed_concat 142 | 143 | # input: (?, n_slot*k) 144 | # output: (?, 1) 145 | def get_pred_output(data_embed_concat): 146 | # include output layer 147 | n_layer = len(layer_dim) 148 | data_embed_dnn = tf.reshape(data_embed_concat, [-1, (n_one_hot_slot + n_mul_hot_slot)*k]) 149 | cur_layer = data_embed_dnn 150 | # loop to create DNN struct 151 | for i in range(0, n_layer): 152 | # output layer, linear activation 153 | if i == n_layer - 1: 154 | cur_layer = tf.matmul(cur_layer, weight_dict[i]) + bias_dict[i] 155 | else: 156 | cur_layer = tf.nn.relu(tf.matmul(cur_layer, weight_dict[i]) + bias_dict[i]) 157 | cur_layer = tf.nn.dropout(cur_layer, keep_prob) 158 | 159 | y_hat = cur_layer 160 | return y_hat 161 | 162 | # correlation loss input 163 | def get_corr_output(x_input_corr): 164 | x_tar_one_hot_corr, x_tar_mul_hot_corr, x_input_one_hot_dict_corr, x_input_mul_hot_dict_corr = \ 165 | partition_input_corr(x_input_corr) 166 | 167 | data_embed_tar = get_concate_embed(x_tar_one_hot_corr, x_tar_mul_hot_corr) 168 | data_vec_tar = tf.reshape(data_embed_tar, [-1, (n_one_hot_slot_corr + n_mul_hot_slot_corr)*k]) 169 | 170 | n_layer_corr = len(layer_dim_corr) 171 | cur_layer = data_vec_tar 172 | for i in range(0, n_layer_corr): 173 | if i == n_layer_corr - 1: 174 | cur_layer = tf.nn.tanh(tf.matmul(cur_layer, weight_dict_corr[i]) + bias_dict_corr[i]) 175 | else: 176 | cur_layer = tf.nn.relu(tf.matmul(cur_layer, weight_dict_corr[i]) + bias_dict_corr[i]) 177 | data_rep_tar = cur_layer 178 | 179 | # idx 0 - pos, idx 1 -- neg 180 | inner_prod_dict = {} 181 | for mm in range(n_neg_used_corr + 1): 182 | cur_data_embed = get_concate_embed(x_input_one_hot_dict_corr[mm], \ 183 | x_input_mul_hot_dict_corr[mm]) 184 | cur_data_vec = tf.reshape(cur_data_embed, [-1, (n_one_hot_slot_corr + n_mul_hot_slot_corr)*k]) 185 | cur_layer = cur_data_vec 186 | for i in range(0, n_layer_corr): 187 | if i == n_layer_corr - 1: 188 | cur_layer = tf.nn.tanh(tf.matmul(cur_layer, weight_dict_corr[i]) + bias_dict_corr[i]) 189 | else: 190 | cur_layer = tf.nn.relu(tf.matmul(cur_layer, weight_dict_corr[i]) + bias_dict_corr[i]) 191 | cur_data_rep = cur_layer 192 | # each ele - None*1 193 | inner_prod_dict[mm] = tf.reduce_sum(tf.multiply(data_rep_tar, cur_data_rep), 1, \ 194 | keep_dims=True) 195 | 196 | return inner_prod_dict 197 | 198 | ########################################################### 199 | ########################################################### 200 | # input for l1 - prediction loss 201 | x_input = tf.placeholder(tf.int32, shape=[None, total_num_ft_col]) 202 | # shape=[None, n_one_hot_slot] 203 | x_input_one_hot = x_input[:, 0:idx_1] 204 | x_input_mul_hot = x_input[:, idx_1:idx_2] 205 | # shape=[None, n_mul_hot_slot, max_len_per_slot] 206 | x_input_mul_hot = tf.reshape(x_input_mul_hot, (-1, n_mul_hot_slot, max_len_per_slot)) 207 | 208 | # input for corr loss 209 | x_input_corr = tf.placeholder(tf.int32, shape=[None, total_num_ft_col_corr]) 210 | 211 | # target vec for l1 212 | y_target = tf.placeholder(tf.float32, shape=[None, 1]) 213 | 214 | # dropout keep prob 215 | keep_prob = tf.placeholder(tf.float32) 216 | # emb_mat dim add 1 -> for padding (idx = 0) 217 | with tf.device('/cpu:0'): 218 | emb_mat = tf.Variable(tf.random_normal([n_ft + 1, k], stddev=0.01)) 219 | 220 | ################################ 221 | # prediction subnet FC layers, including output layer 222 | n_layer = len(layer_dim) 223 | in_dim = (n_one_hot_slot + n_mul_hot_slot)*k 224 | weight_dict = {} 225 | bias_dict = {} 226 | 227 | # loop to create DNN vars 228 | for i in range(0, n_layer): 229 | out_dim = layer_dim[i] 230 | weight_dict[i] = tf.Variable(tf.random_normal(shape=[in_dim, out_dim], stddev=np.sqrt(2.0/(in_dim+out_dim)))) 231 | bias_dict[i] = tf.Variable(tf.constant(0.0, shape=[out_dim])) 232 | in_dim = layer_dim[i] 233 | 234 | ################################ 235 | # correlation subnet FC layers 236 | n_layer_corr = len(layer_dim_corr) 237 | in_dim_corr = (n_one_hot_slot_corr + n_mul_hot_slot_corr)*k 238 | weight_dict_corr = {} 239 | bias_dict_corr = {} 240 | 241 | for i in range(0, n_layer_corr): 242 | out_dim_corr = layer_dim_corr[i] 243 | weight_dict_corr[i] = tf.Variable(tf.random_normal(shape=[in_dim_corr, out_dim_corr],\ 244 | stddev=np.sqrt(2.0/(in_dim_corr+out_dim_corr)))) 245 | bias_dict_corr[i] = tf.Variable(tf.constant(0.0, shape=[out_dim_corr])) 246 | in_dim_corr = layer_dim_corr[i] 247 | ################################ 248 | 249 | data_embed_concat = get_concate_embed(x_input_one_hot, x_input_mul_hot) 250 | y_hat = get_pred_output(data_embed_concat) 251 | inner_prod_dict_corr = get_corr_output(x_input_corr) 252 | 253 | loss_ctr = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=y_hat, labels=y_target)) 254 | # logloss 255 | y_corr_cast_1 = tf.ones_like(inner_prod_dict_corr[0]) 256 | y_corr_cast_0 = tf.zeros_like(inner_prod_dict_corr[0]) 257 | # pos 258 | loss_corr = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=inner_prod_dict_corr[0], \ 259 | labels=y_corr_cast_1)) 260 | # neg 261 | for i in range(n_neg_used_corr): 262 | loss_corr += tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=inner_prod_dict_corr[i+1], \ 263 | labels=y_corr_cast_0)) 264 | 265 | loss = loss_ctr + beta*loss_corr 266 | 267 | ############################# 268 | # prediction 269 | ############################# 270 | pred_score = tf.sigmoid(y_hat) 271 | 272 | if opt_alg == 'Adam': 273 | optimizer = tf.train.AdamOptimizer(eta).minimize(loss) 274 | else: 275 | # default 276 | optimizer = tf.train.AdagradOptimizer(eta).minimize(loss) 277 | 278 | ######################################## 279 | # Launch the graph. 280 | config = tf.ConfigProto(log_device_placement=False) 281 | config.gpu_options.allow_growth = True 282 | config.gpu_options.per_process_gpu_memory_fraction = 0.3 283 | 284 | with tf.Session(config=config) as sess: 285 | sess.run(tf.global_variables_initializer()) 286 | sess.run(tf.local_variables_initializer()) 287 | coord = tf.train.Coordinator() 288 | threads = tf.train.start_queue_runners(sess, coord) 289 | 290 | saver_val = tf.train.Saver() 291 | train_loss_list = [] 292 | val_auc_list = [] 293 | best_n_round = 0 294 | best_val_auc = 0 295 | lower_ct = 0 296 | early_stop_flag = 0 297 | 298 | val_ft_inst, val_label_inst = sess.run([val_ft, val_label]) 299 | 300 | func.print_time() 301 | print('Start train loop') 302 | 303 | epoch = -1 304 | try: 305 | while not coord.should_stop(): 306 | epoch += 1 307 | train_ft_inst, train_label_inst = sess.run([train_ft, train_label]) 308 | train_label_inst = np.transpose([train_label_inst]) 309 | 310 | train_ft_corr_inst = sess.run(train_ft_corr) 311 | 312 | # training 313 | sess.run(optimizer, feed_dict={x_input:train_ft_inst, y_target:train_label_inst, \ 314 | x_input_corr:train_ft_corr_inst, keep_prob:kp_prob}) 315 | 316 | # record loss and accuracy every step_size generations 317 | if (epoch+1)%record_step_size == 0: 318 | train_loss_temp = sess.run(loss, feed_dict={ \ 319 | x_input:train_ft_inst, y_target:train_label_inst, \ 320 | x_input_corr:train_ft_corr_inst, keep_prob:1.0}) 321 | train_loss_list.append(train_loss_temp) 322 | 323 | val_pred_score_all = [] 324 | val_label_all = [] 325 | 326 | for iii in range(n_val_batch): 327 | # get batch 328 | start_idx = iii*batch_size 329 | end_idx = (iii+1)*batch_size 330 | cur_val_ft = val_ft_inst[start_idx: end_idx] 331 | cur_val_label = val_label_inst[start_idx: end_idx] 332 | # pred score 333 | cur_val_pred_score = sess.run(pred_score, feed_dict={ \ 334 | x_input:cur_val_ft, keep_prob:1.0}) 335 | val_pred_score_all.append(cur_val_pred_score.flatten()) 336 | val_label_all.append(cur_val_label) 337 | 338 | # calculate auc 339 | val_pred_score_re = func.list_flatten(val_pred_score_all) 340 | val_label_re = func.list_flatten(val_label_all) 341 | val_auc_temp, _, _ = func.cal_auc(val_pred_score_re, val_label_re) 342 | # record all val results 343 | val_auc_list.append(val_auc_temp) 344 | 345 | # record best and save models 346 | if val_auc_temp > best_val_auc: 347 | best_val_auc = val_auc_temp 348 | best_n_round = epoch 349 | # Save the variables to disk 350 | save_path = saver_val.save(sess, model_saving_addr) 351 | print("Model saved in: %s" % save_path) 352 | # count of consecutive lower 353 | if val_auc_temp < best_val_auc: 354 | lower_ct += 1 355 | # once higher or equal, set to 0 356 | else: 357 | lower_ct = 0 358 | 359 | if lower_ct >= max_num_lower_ct: 360 | early_stop_flag = 1 361 | 362 | auc_and_loss = [epoch+1, train_loss_temp, val_auc_temp] 363 | # round to given number of decimals 364 | auc_and_loss = [np.round(xx,4) for xx in auc_and_loss] 365 | func.print_time() 366 | print('Generation # {}. Train Loss: {:.4f}. Val Avg AUC: {:.4f}.'\ 367 | .format(*auc_and_loss)) 368 | 369 | # stop while loop 370 | if early_stop_flag == 1: 371 | break 372 | 373 | except tf.errors.OutOfRangeError: 374 | func.print_time() 375 | print('Done training -- epoch limit reached') 376 | 377 | # restore model 378 | saver_val.restore(sess, model_saving_addr) 379 | print("Model restored.") 380 | 381 | # load test data 382 | test_pred_score_all = [] 383 | test_label_all = [] 384 | test_loss_all = [] 385 | try: 386 | while True: 387 | test_ft_inst, test_label_inst = sess.run([test_ft, test_label]) 388 | cur_test_pred_score = sess.run(pred_score, feed_dict={ \ 389 | x_input:test_ft_inst, keep_prob:1.0}) 390 | test_pred_score_all.append(cur_test_pred_score.flatten()) 391 | test_label_all.append(test_label_inst) 392 | 393 | cur_test_loss = sess.run(loss_ctr, feed_dict={ \ 394 | x_input:test_ft_inst, \ 395 | y_target: np.transpose([test_label_inst]), keep_prob:1.0}) 396 | test_loss_all.append(cur_test_loss) 397 | 398 | except tf.errors.OutOfRangeError: 399 | func.print_time() 400 | print('Done testing -- epoch limit reached') 401 | finally: 402 | coord.request_stop() 403 | 404 | coord.join(threads) 405 | 406 | # calculate auc 407 | test_pred_score_re = func.list_flatten(test_pred_score_all) 408 | test_label_re = func.list_flatten(test_label_all) 409 | test_auc, _, _ = func.cal_auc(test_pred_score_re, test_label_re) 410 | test_rmse = func.cal_rmse(test_pred_score_re, test_label_re) 411 | test_loss = np.mean(test_loss_all) 412 | 413 | # rounding 414 | test_auc = np.round(test_auc, 4) 415 | test_rmse = np.round(test_rmse, 4) 416 | test_loss = np.round(test_loss, 5) 417 | train_loss_list = [np.round(xx,4) for xx in train_loss_list] 418 | val_auc_list = [np.round(xx,4) for xx in val_auc_list] 419 | 420 | print('test_auc = ', test_auc) 421 | print('test_rmse =', test_rmse) 422 | print('test_loss =', test_loss) 423 | print('train_loss_list =', train_loss_list) 424 | print('val_auc_list =', val_auc_list) 425 | 426 | # write output to file 427 | with open(output_file_name, 'a') as f: 428 | now = datetime.datetime.now() 429 | time_str = now.strftime(cfg.time_style) 430 | f.write(time_str + '\n') 431 | f.write('train_file_name = ' + train_file_name[0] + '\n') 432 | f.write('learning_rate = ' + str(eta) \ 433 | + ', beta = ' + str(beta) \ 434 | + ', n_epoch = ' + str(n_epoch) \ 435 | + ', emb_dize = ' + str(k) + '\n') 436 | f.write('test_auc = ' + str(test_auc) + '\n') 437 | f.write('test_rmse = ' + str(test_rmse) + '\n') 438 | f.write('test_loss = ' + str(test_loss) + '\n') 439 | f.write('train_loss_list =' + str(train_loss_list) + '\n') 440 | f.write('val_auc_list =' + str(val_auc_list) + '\n') 441 | f.write('-'*50 + '\n') 442 | -------------------------------------------------------------------------------- /deepmcp.py: -------------------------------------------------------------------------------- 1 | # DeepMCP - Deep Matching, Correlation and Prediction model 2 | 3 | import numpy as np 4 | import tensorflow as tf 5 | import datetime 6 | import ctr_funcs as func 7 | import config_deepmcp as cfg 8 | import os 9 | import shutil 10 | 11 | # config 12 | str_txt = cfg.output_file_name 13 | base_path = './tmp' 14 | model_saving_addr = base_path + '/deepmcp_' + str_txt + '/' 15 | output_file_name = base_path + '/deepmcp_' + str_txt + '.txt' 16 | num_csv_col = cfg.num_csv_col 17 | train_file_name = cfg.train_file_name 18 | val_file_name = cfg.val_file_name 19 | test_file_name = cfg.test_file_name 20 | batch_size = cfg.batch_size 21 | n_ft = cfg.n_ft 22 | k = cfg.k 23 | kp_prob = cfg.kp_prob 24 | n_epoch = cfg.n_epoch 25 | max_num_lower_ct = cfg.max_num_lower_ct 26 | record_step_size = cfg.record_step_size 27 | layer_dim = cfg.layer_dim 28 | layer_dim_match = cfg.layer_dim_match 29 | eta = cfg.eta # learning rate 30 | opt_alg = cfg.opt_alg 31 | n_one_hot_slot = cfg.n_one_hot_slot 32 | n_mul_hot_slot = cfg.n_mul_hot_slot 33 | max_len_per_slot = cfg.max_len_per_slot 34 | alpha = cfg.alpha # for matching loss 35 | beta = cfg.beta # for correlation loss 36 | user_ft_idx = cfg.user_ft_idx 37 | ad_ft_idx = cfg.ad_ft_idx 38 | n_user_ft = len(user_ft_idx) 39 | n_ad_ft = len(ad_ft_idx) 40 | label_col_idx = 0 41 | record_defaults = [[0]]*num_csv_col 42 | record_defaults[0] = [0.0] 43 | total_num_ft_col = num_csv_col - 1 44 | 45 | ## corr dataset - no test data for this dataset 46 | train_file_name_corr = cfg.train_file_name_corr 47 | batch_size_corr = cfg.batch_size_corr 48 | layer_dim_corr = cfg.layer_dim_corr 49 | n_one_hot_slot_corr = cfg.n_one_hot_slot_corr 50 | n_mul_hot_slot_corr = cfg.n_mul_hot_slot_corr 51 | max_len_per_slot_corr = cfg.max_len_per_slot_corr 52 | n_epoch_corr = cfg.n_epoch_corr 53 | n_neg_used_corr = cfg.n_neg_used_corr 54 | # no label 55 | num_csv_col_corr = cfg.num_csv_col_corr 56 | record_defaults_corr = [[0]]*num_csv_col_corr 57 | total_num_ft_col_corr = num_csv_col_corr 58 | 59 | # create dir 60 | if not os.path.exists(base_path): 61 | os.mkdir(base_path) 62 | 63 | # remove dir 64 | if os.path.isdir(model_saving_addr): 65 | shutil.rmtree(model_saving_addr) 66 | 67 | # for DNN 68 | idx_1 = n_one_hot_slot 69 | idx_2 = idx_1 + n_mul_hot_slot*max_len_per_slot 70 | 71 | ########################################################### 72 | ########################################################### 73 | print('Loading data start!') 74 | tf.set_random_seed(123) 75 | 76 | # load training data 77 | train_ft, train_label = func.tf_input_pipeline(train_file_name, batch_size, n_epoch, label_col_idx, record_defaults) 78 | 79 | n_val_inst = func.count_lines(val_file_name[0]) 80 | val_ft, val_label = func.tf_input_pipeline(val_file_name, n_val_inst, 1, label_col_idx, record_defaults) 81 | n_val_batch = n_val_inst//batch_size 82 | 83 | # load test data 84 | test_ft, test_label = func.tf_input_pipeline_test(test_file_name, batch_size, 1, label_col_idx, record_defaults) 85 | print('Loading data set 1 done!') 86 | 87 | # load training data 88 | train_ft_corr = func.tf_input_pipeline_wo_label(train_file_name_corr, batch_size_corr, n_epoch_corr, record_defaults_corr) 89 | print('Loading data set 2 done!') 90 | 91 | ######################################################################## 92 | # partition input for correlation loss 93 | def partition_input_corr(x_input_corr): 94 | # generate idx_list 95 | len_list = [] 96 | 97 | # 2 - tar & ctxt 98 | for i in range(n_neg_used_corr+2): 99 | len_list.append(n_one_hot_slot_corr) 100 | len_list.append(n_mul_hot_slot_corr*max_len_per_slot_corr) 101 | 102 | len_list = np.array(len_list) 103 | idx_list = np.cumsum(len_list) 104 | 105 | x_tar_one_hot_corr = x_input_corr[:, 0:idx_list[0]] 106 | x_tar_mul_hot_corr = x_input_corr[:, idx_list[0]:idx_list[1]] 107 | # shape=[None, n_mul_hot_slot, max_len_per_slot] 108 | x_tar_mul_hot_corr = tf.reshape(x_tar_mul_hot_corr, (-1, n_mul_hot_slot_corr, max_len_per_slot_corr)) 109 | 110 | x_input_one_hot_dict_corr = {} 111 | x_input_mul_hot_dict_corr = {} 112 | 113 | for i in range(n_neg_used_corr+1): 114 | x_input_one_hot_dict_corr[i] = x_input_corr[:, idx_list[2*i+1]:idx_list[2*i+2]] 115 | temp = x_input_corr[:, idx_list[2*i+2]:idx_list[2*i+3]] 116 | x_input_mul_hot_dict_corr[i] = tf.reshape(temp, (-1, n_mul_hot_slot_corr, max_len_per_slot_corr)) 117 | 118 | return x_tar_one_hot_corr, x_tar_mul_hot_corr, x_input_one_hot_dict_corr, x_input_mul_hot_dict_corr 119 | 120 | # add mask 121 | def get_masked_one_hot(x_input_one_hot): 122 | data_mask = tf.cast(tf.greater(x_input_one_hot, 0), tf.float32) 123 | data_mask = tf.expand_dims(data_mask, axis = 2) 124 | data_mask = tf.tile(data_mask, (1,1,k)) 125 | # output: (?, n_one_hot_slot, k) 126 | data_embed_one_hot = tf.nn.embedding_lookup(emb_mat, x_input_one_hot) 127 | data_embed_one_hot_masked = tf.multiply(data_embed_one_hot, data_mask) 128 | return data_embed_one_hot_masked 129 | 130 | def get_masked_mul_hot(x_input_mul_hot): 131 | data_mask = tf.cast(tf.greater(x_input_mul_hot, 0), tf.float32) 132 | data_mask = tf.expand_dims(data_mask, axis = 3) 133 | data_mask = tf.tile(data_mask, (1,1,1,k)) 134 | # output: (?, n_mul_hot_slot, max_len_per_slot, k) 135 | data_embed_mul_hot = tf.nn.embedding_lookup(emb_mat, x_input_mul_hot) 136 | data_embed_mul_hot_masked = tf.multiply(data_embed_mul_hot, data_mask) 137 | # output: (?, n_mul_hot_slot, k) 138 | data_embed_mul_hot_masked = tf.reduce_sum(data_embed_mul_hot_masked, 2) 139 | return data_embed_mul_hot_masked 140 | 141 | # output: (?, n_one_hot_slot + n_mul_hot_slot, k) 142 | def get_concate_embed(x_input_one_hot, x_input_mul_hot): 143 | data_embed_one_hot = get_masked_one_hot(x_input_one_hot) 144 | data_embed_mul_hot = get_masked_mul_hot(x_input_mul_hot) 145 | data_embed_concat = tf.concat([data_embed_one_hot, data_embed_mul_hot], 1) 146 | return data_embed_concat 147 | 148 | # input: (?, n_slot*k) 149 | # output: (?, 1) 150 | def get_pred_output(data_embed_concat): 151 | # include output layer 152 | n_layer = len(layer_dim) 153 | data_embed_dnn = tf.reshape(data_embed_concat, [-1, (n_one_hot_slot + n_mul_hot_slot)*k]) 154 | cur_layer = data_embed_dnn 155 | # loop to create DNN struct 156 | for i in range(0, n_layer): 157 | # output layer, linear activation 158 | if i == n_layer - 1: 159 | cur_layer = tf.matmul(cur_layer, weight_dict[i]) + bias_dict[i] 160 | else: 161 | cur_layer = tf.nn.relu(tf.matmul(cur_layer, weight_dict[i]) + bias_dict[i]) 162 | cur_layer = tf.nn.dropout(cur_layer, keep_prob) 163 | 164 | y_hat = cur_layer 165 | return y_hat 166 | 167 | # matching loss input 168 | def get_match_output(data_embed_concat): 169 | cur_idx = user_ft_idx[0] 170 | user_ft_cols = data_embed_concat[:, cur_idx:cur_idx+1, :] 171 | for i in range(1, len(user_ft_idx)): 172 | cur_idx = user_ft_idx[i] 173 | cur_x = data_embed_concat[:, cur_idx:cur_idx+1, :] 174 | user_ft_cols = tf.concat([user_ft_cols, cur_x], 1) 175 | 176 | cur_idx = ad_ft_idx[0] 177 | ad_ft_cols = data_embed_concat[:, cur_idx:cur_idx+1, :] 178 | for i in range(1, len(ad_ft_idx)): 179 | cur_idx = ad_ft_idx[i] 180 | cur_x = data_embed_concat[:, cur_idx:cur_idx+1, :] 181 | ad_ft_cols = tf.concat([ad_ft_cols, cur_x], 1) 182 | 183 | user_ft_vec = tf.reshape(user_ft_cols, [-1, n_user_ft*k]) 184 | ad_ft_vec = tf.reshape(ad_ft_cols, [-1, n_ad_ft*k]) 185 | 186 | n_layer_match = len(layer_dim_match) 187 | cur_layer = user_ft_vec 188 | for i in range(0, n_layer_match): 189 | if i == n_layer_match - 1: 190 | cur_layer = tf.nn.tanh(tf.matmul(cur_layer, weight_dict_user[i]) + bias_dict_user[i]) 191 | else: 192 | cur_layer = tf.nn.relu(tf.matmul(cur_layer, weight_dict_user[i]) + bias_dict_user[i]) 193 | user_rep = cur_layer 194 | 195 | cur_layer = ad_ft_vec 196 | for i in range(0, n_layer_match): 197 | if i == n_layer_match - 1: 198 | cur_layer = tf.nn.tanh(tf.matmul(cur_layer, weight_dict_ad[i]) + bias_dict_ad[i]) 199 | else: 200 | cur_layer = tf.nn.relu(tf.matmul(cur_layer, weight_dict_ad[i]) + bias_dict_ad[i]) 201 | ad_rep = cur_layer 202 | 203 | # (?*mk) x (?*mk) -> (?*1) 204 | inner_prod = tf.reduce_sum(tf.multiply(user_rep, ad_rep), 1, keep_dims=True) 205 | return inner_prod 206 | 207 | # correlation loss input 208 | def get_corr_output(x_input_corr): 209 | x_tar_one_hot_corr, x_tar_mul_hot_corr, x_input_one_hot_dict_corr, x_input_mul_hot_dict_corr = \ 210 | partition_input_corr(x_input_corr) 211 | 212 | data_embed_tar = get_concate_embed(x_tar_one_hot_corr, x_tar_mul_hot_corr) 213 | data_vec_tar = tf.reshape(data_embed_tar, [-1, (n_one_hot_slot_corr + n_mul_hot_slot_corr)*k]) 214 | 215 | n_layer_corr = len(layer_dim_corr) 216 | cur_layer = data_vec_tar 217 | for i in range(0, n_layer_corr): 218 | if i == n_layer_corr - 1: 219 | cur_layer = tf.nn.tanh(tf.matmul(cur_layer, weight_dict_corr[i]) + bias_dict_corr[i]) 220 | else: 221 | cur_layer = tf.nn.relu(tf.matmul(cur_layer, weight_dict_corr[i]) + bias_dict_corr[i]) 222 | data_rep_tar = cur_layer 223 | 224 | # idx 0 - pos, idx 1 -- neg 225 | inner_prod_dict = {} 226 | for mm in range(n_neg_used_corr + 1): 227 | cur_data_embed = get_concate_embed(x_input_one_hot_dict_corr[mm], \ 228 | x_input_mul_hot_dict_corr[mm]) 229 | cur_data_vec = tf.reshape(cur_data_embed, [-1, (n_one_hot_slot_corr + n_mul_hot_slot_corr)*k]) 230 | cur_layer = cur_data_vec 231 | for i in range(0, n_layer_corr): 232 | if i == n_layer_corr - 1: 233 | cur_layer = tf.nn.tanh(tf.matmul(cur_layer, weight_dict_corr[i]) + bias_dict_corr[i]) 234 | else: 235 | cur_layer = tf.nn.relu(tf.matmul(cur_layer, weight_dict_corr[i]) + bias_dict_corr[i]) 236 | cur_data_rep = cur_layer 237 | # each ele - None*1 238 | inner_prod_dict[mm] = tf.reduce_sum(tf.multiply(data_rep_tar, cur_data_rep), 1, \ 239 | keep_dims=True) 240 | 241 | return inner_prod_dict 242 | 243 | ########################################################### 244 | ########################################################### 245 | # input for l1 - prediction loss 246 | x_input = tf.placeholder(tf.int32, shape=[None, total_num_ft_col]) 247 | # shape=[None, n_one_hot_slot] 248 | x_input_one_hot = x_input[:, 0:idx_1] 249 | x_input_mul_hot = x_input[:, idx_1:idx_2] 250 | # shape=[None, n_mul_hot_slot, max_len_per_slot] 251 | x_input_mul_hot = tf.reshape(x_input_mul_hot, (-1, n_mul_hot_slot, max_len_per_slot)) 252 | 253 | # input for corr loss 254 | x_input_corr = tf.placeholder(tf.int32, shape=[None, total_num_ft_col_corr]) 255 | 256 | # target vec for l1 257 | y_target = tf.placeholder(tf.float32, shape=[None, 1]) 258 | 259 | # dropout keep prob 260 | keep_prob = tf.placeholder(tf.float32) 261 | # emb_mat dim add 1 -> for padding (idx = 0) 262 | with tf.device('/cpu:0'): 263 | emb_mat = tf.Variable(tf.random_normal([n_ft + 1, k], stddev=0.01)) 264 | 265 | ################################ 266 | # prediction subnet FC layers, including output layer 267 | n_layer = len(layer_dim) 268 | in_dim = (n_one_hot_slot + n_mul_hot_slot)*k 269 | weight_dict = {} 270 | bias_dict = {} 271 | 272 | # loop to create DNN vars 273 | for i in range(0, n_layer): 274 | out_dim = layer_dim[i] 275 | weight_dict[i] = tf.Variable(tf.random_normal(shape=[in_dim, out_dim], stddev=np.sqrt(2.0/(in_dim+out_dim)))) 276 | bias_dict[i] = tf.Variable(tf.constant(0.0, shape=[out_dim])) 277 | in_dim = layer_dim[i] 278 | 279 | ################################ 280 | # matching subnet FC layers 281 | n_layer_match = len(layer_dim_match) 282 | in_dim_user = n_user_ft*k 283 | weight_dict_user={} 284 | bias_dict_user={} 285 | 286 | in_dim_ad = n_ad_ft*k 287 | weight_dict_ad={} 288 | bias_dict_ad={} 289 | 290 | for i in range(0, n_layer_match): 291 | out_dim_user = layer_dim_match[i] 292 | weight_dict_user[i] = tf.Variable(tf.random_normal(shape=[in_dim_user, out_dim_user],\ 293 | stddev=np.sqrt(2.0/(in_dim_user+out_dim_user)))) 294 | bias_dict_user[i] = tf.Variable(tf.constant(0.0, shape=[out_dim_user])) 295 | in_dim_user = layer_dim_match[i] 296 | 297 | for i in range(0, n_layer_match): 298 | out_dim_ad = layer_dim_match[i] 299 | weight_dict_ad[i] = tf.Variable(tf.random_normal(shape=[in_dim_ad, out_dim_ad],\ 300 | stddev=np.sqrt(2.0/(in_dim_ad+out_dim_ad)))) 301 | bias_dict_ad[i] = tf.Variable(tf.constant(0.0, shape=[out_dim_ad])) 302 | in_dim_ad = layer_dim_match[i] 303 | 304 | 305 | ################################ 306 | # correlation subnet FC layers 307 | n_layer_corr = len(layer_dim_corr) 308 | in_dim_corr = (n_one_hot_slot_corr + n_mul_hot_slot_corr)*k 309 | weight_dict_corr = {} 310 | bias_dict_corr = {} 311 | 312 | for i in range(0, n_layer_corr): 313 | out_dim_corr = layer_dim_corr[i] 314 | weight_dict_corr[i] = tf.Variable(tf.random_normal(shape=[in_dim_corr, out_dim_corr],\ 315 | stddev=np.sqrt(2.0/(in_dim_corr+out_dim_corr)))) 316 | bias_dict_corr[i] = tf.Variable(tf.constant(0.0, shape=[out_dim_corr])) 317 | in_dim_corr = layer_dim_corr[i] 318 | ################################ 319 | 320 | data_embed_concat = get_concate_embed(x_input_one_hot, x_input_mul_hot) 321 | y_hat = get_pred_output(data_embed_concat) 322 | y_hat_match = get_match_output(data_embed_concat) 323 | inner_prod_dict_corr = get_corr_output(x_input_corr) 324 | 325 | loss_ctr = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=y_hat, labels=y_target)) 326 | loss_match = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=y_hat_match, labels=y_target)) 327 | 328 | # logloss 329 | y_corr_cast_1 = tf.ones_like(inner_prod_dict_corr[0]) 330 | y_corr_cast_0 = tf.zeros_like(inner_prod_dict_corr[0]) 331 | # pos 332 | loss_corr = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=inner_prod_dict_corr[0], \ 333 | labels=y_corr_cast_1)) 334 | # neg 335 | for i in range(n_neg_used_corr): 336 | loss_corr += tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=inner_prod_dict_corr[i+1], \ 337 | labels=y_corr_cast_0)) 338 | 339 | loss = loss_ctr + alpha*loss_match + beta*loss_corr 340 | 341 | ############################# 342 | # prediction 343 | ############################# 344 | pred_score = tf.sigmoid(y_hat) 345 | 346 | if opt_alg == 'Adam': 347 | optimizer = tf.train.AdamOptimizer(eta).minimize(loss) 348 | else: 349 | # default 350 | optimizer = tf.train.AdagradOptimizer(eta).minimize(loss) 351 | 352 | ######################################## 353 | # Launch the graph. 354 | config = tf.ConfigProto(log_device_placement=False) 355 | config.gpu_options.allow_growth = True 356 | config.gpu_options.per_process_gpu_memory_fraction = 0.3 357 | 358 | with tf.Session(config=config) as sess: 359 | sess.run(tf.global_variables_initializer()) 360 | sess.run(tf.local_variables_initializer()) 361 | coord = tf.train.Coordinator() 362 | threads = tf.train.start_queue_runners(sess, coord) 363 | 364 | saver_val = tf.train.Saver() 365 | train_loss_list = [] 366 | val_auc_list = [] 367 | best_n_round = 0 368 | best_val_auc = 0 369 | lower_ct = 0 370 | early_stop_flag = 0 371 | 372 | val_ft_inst, val_label_inst = sess.run([val_ft, val_label]) 373 | 374 | func.print_time() 375 | print('Start train loop') 376 | 377 | epoch = -1 378 | try: 379 | while not coord.should_stop(): 380 | epoch += 1 381 | train_ft_inst, train_label_inst = sess.run([train_ft, train_label]) 382 | train_label_inst = np.transpose([train_label_inst]) 383 | 384 | train_ft_corr_inst = sess.run(train_ft_corr) 385 | 386 | # training 387 | sess.run(optimizer, feed_dict={x_input:train_ft_inst, y_target:train_label_inst, \ 388 | x_input_corr:train_ft_corr_inst, keep_prob:kp_prob}) 389 | 390 | # record loss and accuracy every step_size generations 391 | if (epoch+1)%record_step_size == 0: 392 | train_loss_temp = sess.run(loss, feed_dict={ \ 393 | x_input:train_ft_inst, y_target:train_label_inst, \ 394 | x_input_corr:train_ft_corr_inst, keep_prob:1.0}) 395 | train_loss_list.append(train_loss_temp) 396 | 397 | val_pred_score_all = [] 398 | val_label_all = [] 399 | 400 | for iii in range(n_val_batch): 401 | # get batch 402 | start_idx = iii*batch_size 403 | end_idx = (iii+1)*batch_size 404 | cur_val_ft = val_ft_inst[start_idx: end_idx] 405 | cur_val_label = val_label_inst[start_idx: end_idx] 406 | # pred score 407 | cur_val_pred_score = sess.run(pred_score, feed_dict={ \ 408 | x_input:cur_val_ft, keep_prob:1.0}) 409 | val_pred_score_all.append(cur_val_pred_score.flatten()) 410 | val_label_all.append(cur_val_label) 411 | 412 | # calculate auc 413 | val_pred_score_re = func.list_flatten(val_pred_score_all) 414 | val_label_re = func.list_flatten(val_label_all) 415 | val_auc_temp, _, _ = func.cal_auc(val_pred_score_re, val_label_re) 416 | # record all val results 417 | val_auc_list.append(val_auc_temp) 418 | 419 | # record best and save models 420 | if val_auc_temp > best_val_auc: 421 | best_val_auc = val_auc_temp 422 | best_n_round = epoch 423 | # Save the variables to disk 424 | save_path = saver_val.save(sess, model_saving_addr) 425 | print("Model saved in: %s" % save_path) 426 | # count of consecutive lower 427 | if val_auc_temp < best_val_auc: 428 | lower_ct += 1 429 | # once higher or equal, set to 0 430 | else: 431 | lower_ct = 0 432 | 433 | if lower_ct >= max_num_lower_ct: 434 | early_stop_flag = 1 435 | 436 | auc_and_loss = [epoch+1, train_loss_temp, val_auc_temp] 437 | # round to given number of decimals 438 | auc_and_loss = [np.round(xx,4) for xx in auc_and_loss] 439 | func.print_time() 440 | print('Generation # {}. Train Loss: {:.4f}. Val Avg AUC: {:.4f}.'\ 441 | .format(*auc_and_loss)) 442 | 443 | # stop while loop 444 | if early_stop_flag == 1: 445 | break 446 | 447 | except tf.errors.OutOfRangeError: 448 | func.print_time() 449 | print('Done training -- epoch limit reached') 450 | 451 | # restore model 452 | saver_val.restore(sess, model_saving_addr) 453 | print("Model restored.") 454 | 455 | # load test data 456 | test_pred_score_all = [] 457 | test_label_all = [] 458 | test_loss_all = [] 459 | try: 460 | while True: 461 | test_ft_inst, test_label_inst = sess.run([test_ft, test_label]) 462 | cur_test_pred_score = sess.run(pred_score, feed_dict={ \ 463 | x_input:test_ft_inst, keep_prob:1.0}) 464 | test_pred_score_all.append(cur_test_pred_score.flatten()) 465 | test_label_all.append(test_label_inst) 466 | 467 | cur_test_loss = sess.run(loss_ctr, feed_dict={ \ 468 | x_input:test_ft_inst, \ 469 | y_target: np.transpose([test_label_inst]), keep_prob:1.0}) 470 | test_loss_all.append(cur_test_loss) 471 | 472 | except tf.errors.OutOfRangeError: 473 | func.print_time() 474 | print('Done testing -- epoch limit reached') 475 | finally: 476 | coord.request_stop() 477 | 478 | coord.join(threads) 479 | 480 | # calculate auc 481 | test_pred_score_re = func.list_flatten(test_pred_score_all) 482 | test_label_re = func.list_flatten(test_label_all) 483 | test_auc, _, _ = func.cal_auc(test_pred_score_re, test_label_re) 484 | test_rmse = func.cal_rmse(test_pred_score_re, test_label_re) 485 | test_loss = np.mean(test_loss_all) 486 | 487 | # rounding 488 | test_auc = np.round(test_auc, 4) 489 | test_rmse = np.round(test_rmse, 4) 490 | test_loss = np.round(test_loss, 5) 491 | train_loss_list = [np.round(xx,4) for xx in train_loss_list] 492 | val_auc_list = [np.round(xx,4) for xx in val_auc_list] 493 | 494 | print('test_auc = ', test_auc) 495 | print('test_rmse =', test_rmse) 496 | print('test_loss =', test_loss) 497 | print('train_loss_list =', train_loss_list) 498 | print('val_auc_list =', val_auc_list) 499 | 500 | # write output to file 501 | with open(output_file_name, 'a') as f: 502 | now = datetime.datetime.now() 503 | time_str = now.strftime(cfg.time_style) 504 | f.write(time_str + '\n') 505 | f.write('train_file_name = ' + train_file_name[0] + '\n') 506 | f.write('learning_rate = ' + str(eta) + ', alpha = ' + str(alpha) \ 507 | + ', beta = ' + str(beta) \ 508 | + ', n_epoch = ' + str(n_epoch) \ 509 | + ', emb_dize = ' + str(k) + '\n') 510 | f.write('test_auc = ' + str(test_auc) + '\n') 511 | f.write('test_rmse = ' + str(test_rmse) + '\n') 512 | f.write('test_loss = ' + str(test_loss) + '\n') 513 | f.write('train_loss_list =' + str(train_loss_list) + '\n') 514 | f.write('val_auc_list =' + str(val_auc_list) + '\n') 515 | f.write('-'*50 + '\n') 516 | -------------------------------------------------------------------------------- /deepmp.py: -------------------------------------------------------------------------------- 1 | # DeepMP - Deep Matching and Prediction model 2 | 3 | import numpy as np 4 | import tensorflow as tf 5 | import datetime 6 | import ctr_funcs as func 7 | import config_deepmcp as cfg 8 | import os 9 | import shutil 10 | 11 | # config 12 | str_txt = cfg.output_file_name 13 | base_path = './tmp' 14 | model_saving_addr = base_path + '/deepmp_' + str_txt + '/' 15 | output_file_name = base_path + '/deepmp_' + str_txt + '.txt' 16 | num_csv_col = cfg.num_csv_col 17 | train_file_name = cfg.train_file_name 18 | val_file_name = cfg.val_file_name 19 | test_file_name = cfg.test_file_name 20 | batch_size = cfg.batch_size 21 | n_ft = cfg.n_ft 22 | k = cfg.k 23 | kp_prob = cfg.kp_prob 24 | n_epoch = cfg.n_epoch 25 | max_num_lower_ct = cfg.max_num_lower_ct 26 | record_step_size = cfg.record_step_size 27 | layer_dim = cfg.layer_dim 28 | layer_dim_match = cfg.layer_dim_match 29 | eta = cfg.eta # learning rate 30 | opt_alg = cfg.opt_alg 31 | n_one_hot_slot = cfg.n_one_hot_slot 32 | n_mul_hot_slot = cfg.n_mul_hot_slot 33 | max_len_per_slot = cfg.max_len_per_slot 34 | alpha = cfg.alpha # for matching loss 35 | user_ft_idx = cfg.user_ft_idx 36 | ad_ft_idx = cfg.ad_ft_idx 37 | n_user_ft = len(user_ft_idx) 38 | n_ad_ft = len(ad_ft_idx) 39 | label_col_idx = 0 40 | record_defaults = [[0]]*num_csv_col 41 | record_defaults[0] = [0.0] 42 | total_num_ft_col = num_csv_col - 1 43 | 44 | # create dir 45 | if not os.path.exists(base_path): 46 | os.mkdir(base_path) 47 | 48 | # remove dir 49 | if os.path.isdir(model_saving_addr): 50 | shutil.rmtree(model_saving_addr) 51 | 52 | # for DNN 53 | idx_1 = n_one_hot_slot 54 | idx_2 = idx_1 + n_mul_hot_slot*max_len_per_slot 55 | 56 | ########################################################### 57 | ########################################################### 58 | print('Loading data start!') 59 | tf.set_random_seed(123) 60 | 61 | # load training data 62 | train_ft, train_label = func.tf_input_pipeline(train_file_name, batch_size, n_epoch, label_col_idx, record_defaults) 63 | 64 | n_val_inst = func.count_lines(val_file_name[0]) 65 | val_ft, val_label = func.tf_input_pipeline(val_file_name, n_val_inst, 1, label_col_idx, record_defaults) 66 | n_val_batch = n_val_inst//batch_size 67 | 68 | # load test data 69 | test_ft, test_label = func.tf_input_pipeline_test(test_file_name, batch_size, 1, label_col_idx, record_defaults) 70 | print('Loading data set 1 done!') 71 | 72 | ######################################################################## 73 | 74 | # add mask 75 | def get_masked_one_hot(x_input_one_hot): 76 | data_mask = tf.cast(tf.greater(x_input_one_hot, 0), tf.float32) 77 | data_mask = tf.expand_dims(data_mask, axis = 2) 78 | data_mask = tf.tile(data_mask, (1,1,k)) 79 | # output: (?, n_one_hot_slot, k) 80 | data_embed_one_hot = tf.nn.embedding_lookup(emb_mat, x_input_one_hot) 81 | data_embed_one_hot_masked = tf.multiply(data_embed_one_hot, data_mask) 82 | return data_embed_one_hot_masked 83 | 84 | def get_masked_mul_hot(x_input_mul_hot): 85 | data_mask = tf.cast(tf.greater(x_input_mul_hot, 0), tf.float32) 86 | data_mask = tf.expand_dims(data_mask, axis = 3) 87 | data_mask = tf.tile(data_mask, (1,1,1,k)) 88 | # output: (?, n_mul_hot_slot, max_len_per_slot, k) 89 | data_embed_mul_hot = tf.nn.embedding_lookup(emb_mat, x_input_mul_hot) 90 | data_embed_mul_hot_masked = tf.multiply(data_embed_mul_hot, data_mask) 91 | # output: (?, n_mul_hot_slot, k) 92 | data_embed_mul_hot_masked = tf.reduce_sum(data_embed_mul_hot_masked, 2) 93 | return data_embed_mul_hot_masked 94 | 95 | # output: (?, n_one_hot_slot + n_mul_hot_slot, k) 96 | def get_concate_embed(x_input_one_hot, x_input_mul_hot): 97 | data_embed_one_hot = get_masked_one_hot(x_input_one_hot) 98 | data_embed_mul_hot = get_masked_mul_hot(x_input_mul_hot) 99 | data_embed_concat = tf.concat([data_embed_one_hot, data_embed_mul_hot], 1) 100 | return data_embed_concat 101 | 102 | # input: (?, n_slot*k) 103 | # output: (?, 1) 104 | def get_pred_output(data_embed_concat): 105 | # include output layer 106 | n_layer = len(layer_dim) 107 | data_embed_dnn = tf.reshape(data_embed_concat, [-1, (n_one_hot_slot + n_mul_hot_slot)*k]) 108 | cur_layer = data_embed_dnn 109 | # loop to create DNN struct 110 | for i in range(0, n_layer): 111 | # output layer, linear activation 112 | if i == n_layer - 1: 113 | cur_layer = tf.matmul(cur_layer, weight_dict[i]) + bias_dict[i] 114 | else: 115 | cur_layer = tf.nn.relu(tf.matmul(cur_layer, weight_dict[i]) + bias_dict[i]) 116 | cur_layer = tf.nn.dropout(cur_layer, keep_prob) 117 | 118 | y_hat = cur_layer 119 | return y_hat 120 | 121 | # matching loss input 122 | def get_match_output(data_embed_concat): 123 | cur_idx = user_ft_idx[0] 124 | user_ft_cols = data_embed_concat[:, cur_idx:cur_idx+1, :] 125 | for i in range(1, len(user_ft_idx)): 126 | cur_idx = user_ft_idx[i] 127 | cur_x = data_embed_concat[:, cur_idx:cur_idx+1, :] 128 | user_ft_cols = tf.concat([user_ft_cols, cur_x], 1) 129 | 130 | cur_idx = ad_ft_idx[0] 131 | ad_ft_cols = data_embed_concat[:, cur_idx:cur_idx+1, :] 132 | for i in range(1, len(ad_ft_idx)): 133 | cur_idx = ad_ft_idx[i] 134 | cur_x = data_embed_concat[:, cur_idx:cur_idx+1, :] 135 | ad_ft_cols = tf.concat([ad_ft_cols, cur_x], 1) 136 | 137 | user_ft_vec = tf.reshape(user_ft_cols, [-1, n_user_ft*k]) 138 | ad_ft_vec = tf.reshape(ad_ft_cols, [-1, n_ad_ft*k]) 139 | 140 | n_layer_match = len(layer_dim_match) 141 | cur_layer = user_ft_vec 142 | for i in range(0, n_layer_match): 143 | if i == n_layer_match - 1: 144 | cur_layer = tf.nn.tanh(tf.matmul(cur_layer, weight_dict_user[i]) + bias_dict_user[i]) 145 | else: 146 | cur_layer = tf.nn.relu(tf.matmul(cur_layer, weight_dict_user[i]) + bias_dict_user[i]) 147 | user_rep = cur_layer 148 | 149 | cur_layer = ad_ft_vec 150 | for i in range(0, n_layer_match): 151 | if i == n_layer_match - 1: 152 | cur_layer = tf.nn.tanh(tf.matmul(cur_layer, weight_dict_ad[i]) + bias_dict_ad[i]) 153 | else: 154 | cur_layer = tf.nn.relu(tf.matmul(cur_layer, weight_dict_ad[i]) + bias_dict_ad[i]) 155 | ad_rep = cur_layer 156 | 157 | # (?*mk) x (?*mk) -> (?*1) 158 | inner_prod = tf.reduce_sum(tf.multiply(user_rep, ad_rep), 1, keep_dims=True) 159 | return inner_prod 160 | 161 | ########################################################### 162 | ########################################################### 163 | # input for l1 - prediction loss 164 | x_input = tf.placeholder(tf.int32, shape=[None, total_num_ft_col]) 165 | # shape=[None, n_one_hot_slot] 166 | x_input_one_hot = x_input[:, 0:idx_1] 167 | x_input_mul_hot = x_input[:, idx_1:idx_2] 168 | # shape=[None, n_mul_hot_slot, max_len_per_slot] 169 | x_input_mul_hot = tf.reshape(x_input_mul_hot, (-1, n_mul_hot_slot, max_len_per_slot)) 170 | 171 | # target vec for l1 172 | y_target = tf.placeholder(tf.float32, shape=[None, 1]) 173 | 174 | # dropout keep prob 175 | keep_prob = tf.placeholder(tf.float32) 176 | # emb_mat dim add 1 -> for padding (idx = 0) 177 | with tf.device('/cpu:0'): 178 | emb_mat = tf.Variable(tf.random_normal([n_ft + 1, k], stddev=0.01)) 179 | 180 | ################################ 181 | # prediction subnet FC layers, including output layer 182 | n_layer = len(layer_dim) 183 | in_dim = (n_one_hot_slot + n_mul_hot_slot)*k 184 | weight_dict = {} 185 | bias_dict = {} 186 | 187 | # loop to create DNN vars 188 | for i in range(0, n_layer): 189 | out_dim = layer_dim[i] 190 | weight_dict[i] = tf.Variable(tf.random_normal(shape=[in_dim, out_dim], stddev=np.sqrt(2.0/(in_dim+out_dim)))) 191 | bias_dict[i] = tf.Variable(tf.constant(0.0, shape=[out_dim])) 192 | in_dim = layer_dim[i] 193 | 194 | ################################ 195 | # matching subnet FC layers 196 | n_layer_match = len(layer_dim_match) 197 | in_dim_user = n_user_ft*k 198 | weight_dict_user={} 199 | bias_dict_user={} 200 | 201 | in_dim_ad = n_ad_ft*k 202 | weight_dict_ad={} 203 | bias_dict_ad={} 204 | 205 | for i in range(0, n_layer_match): 206 | out_dim_user = layer_dim_match[i] 207 | weight_dict_user[i] = tf.Variable(tf.random_normal(shape=[in_dim_user, out_dim_user],\ 208 | stddev=np.sqrt(2.0/(in_dim_user+out_dim_user)))) 209 | bias_dict_user[i] = tf.Variable(tf.constant(0.0, shape=[out_dim_user])) 210 | in_dim_user = layer_dim_match[i] 211 | 212 | for i in range(0, n_layer_match): 213 | out_dim_ad = layer_dim_match[i] 214 | weight_dict_ad[i] = tf.Variable(tf.random_normal(shape=[in_dim_ad, out_dim_ad],\ 215 | stddev=np.sqrt(2.0/(in_dim_ad+out_dim_ad)))) 216 | bias_dict_ad[i] = tf.Variable(tf.constant(0.0, shape=[out_dim_ad])) 217 | in_dim_ad = layer_dim_match[i] 218 | 219 | ################################ 220 | data_embed_concat = get_concate_embed(x_input_one_hot, x_input_mul_hot) 221 | y_hat = get_pred_output(data_embed_concat) 222 | y_hat_match = get_match_output(data_embed_concat) 223 | 224 | loss_ctr = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=y_hat, labels=y_target)) 225 | loss_match = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=y_hat_match, labels=y_target)) 226 | 227 | loss = loss_ctr + alpha*loss_match 228 | 229 | ############################# 230 | # prediction 231 | ############################# 232 | pred_score = tf.sigmoid(y_hat) 233 | 234 | if opt_alg == 'Adam': 235 | optimizer = tf.train.AdamOptimizer(eta).minimize(loss) 236 | else: 237 | # default 238 | optimizer = tf.train.AdagradOptimizer(eta).minimize(loss) 239 | 240 | ######################################## 241 | # Launch the graph. 242 | config = tf.ConfigProto(log_device_placement=False) 243 | config.gpu_options.allow_growth = True 244 | config.gpu_options.per_process_gpu_memory_fraction = 0.3 245 | 246 | with tf.Session(config=config) as sess: 247 | sess.run(tf.global_variables_initializer()) 248 | sess.run(tf.local_variables_initializer()) 249 | coord = tf.train.Coordinator() 250 | threads = tf.train.start_queue_runners(sess, coord) 251 | 252 | saver_val = tf.train.Saver() 253 | train_loss_list = [] 254 | val_auc_list = [] 255 | best_n_round = 0 256 | best_val_auc = 0 257 | lower_ct = 0 258 | early_stop_flag = 0 259 | 260 | val_ft_inst, val_label_inst = sess.run([val_ft, val_label]) 261 | 262 | func.print_time() 263 | print('Start train loop') 264 | 265 | epoch = -1 266 | try: 267 | while not coord.should_stop(): 268 | epoch += 1 269 | train_ft_inst, train_label_inst = sess.run([train_ft, train_label]) 270 | train_label_inst = np.transpose([train_label_inst]) 271 | 272 | # training 273 | sess.run(optimizer, feed_dict={x_input:train_ft_inst, y_target:train_label_inst, \ 274 | keep_prob:kp_prob}) 275 | 276 | # record loss and accuracy every step_size generations 277 | if (epoch+1)%record_step_size == 0: 278 | train_loss_temp = sess.run(loss, feed_dict={ \ 279 | x_input:train_ft_inst, y_target:train_label_inst, \ 280 | keep_prob:1.0}) 281 | train_loss_list.append(train_loss_temp) 282 | 283 | val_pred_score_all = [] 284 | val_label_all = [] 285 | 286 | for iii in range(n_val_batch): 287 | # get batch 288 | start_idx = iii*batch_size 289 | end_idx = (iii+1)*batch_size 290 | cur_val_ft = val_ft_inst[start_idx: end_idx] 291 | cur_val_label = val_label_inst[start_idx: end_idx] 292 | # pred score 293 | cur_val_pred_score = sess.run(pred_score, feed_dict={ \ 294 | x_input:cur_val_ft, keep_prob:1.0}) 295 | val_pred_score_all.append(cur_val_pred_score.flatten()) 296 | val_label_all.append(cur_val_label) 297 | 298 | # calculate auc 299 | val_pred_score_re = func.list_flatten(val_pred_score_all) 300 | val_label_re = func.list_flatten(val_label_all) 301 | val_auc_temp, _, _ = func.cal_auc(val_pred_score_re, val_label_re) 302 | # record all val results 303 | val_auc_list.append(val_auc_temp) 304 | 305 | # record best and save models 306 | if val_auc_temp > best_val_auc: 307 | best_val_auc = val_auc_temp 308 | best_n_round = epoch 309 | # Save the variables to disk 310 | save_path = saver_val.save(sess, model_saving_addr) 311 | print("Model saved in: %s" % save_path) 312 | # count of consecutive lower 313 | if val_auc_temp < best_val_auc: 314 | lower_ct += 1 315 | # once higher or equal, set to 0 316 | else: 317 | lower_ct = 0 318 | 319 | if lower_ct >= max_num_lower_ct: 320 | early_stop_flag = 1 321 | 322 | auc_and_loss = [epoch+1, train_loss_temp, val_auc_temp] 323 | # round to given number of decimals 324 | auc_and_loss = [np.round(xx,4) for xx in auc_and_loss] 325 | func.print_time() 326 | print('Generation # {}. Train Loss: {:.4f}. Val Avg AUC: {:.4f}.'\ 327 | .format(*auc_and_loss)) 328 | 329 | # stop while loop 330 | if early_stop_flag == 1: 331 | break 332 | 333 | except tf.errors.OutOfRangeError: 334 | func.print_time() 335 | print('Done training -- epoch limit reached') 336 | 337 | # restore model 338 | saver_val.restore(sess, model_saving_addr) 339 | print("Model restored.") 340 | 341 | # load test data 342 | test_pred_score_all = [] 343 | test_label_all = [] 344 | test_loss_all = [] 345 | try: 346 | while True: 347 | test_ft_inst, test_label_inst = sess.run([test_ft, test_label]) 348 | cur_test_pred_score = sess.run(pred_score, feed_dict={ \ 349 | x_input:test_ft_inst, keep_prob:1.0}) 350 | test_pred_score_all.append(cur_test_pred_score.flatten()) 351 | test_label_all.append(test_label_inst) 352 | 353 | cur_test_loss = sess.run(loss_ctr, feed_dict={ \ 354 | x_input:test_ft_inst, \ 355 | y_target: np.transpose([test_label_inst]), keep_prob:1.0}) 356 | test_loss_all.append(cur_test_loss) 357 | 358 | except tf.errors.OutOfRangeError: 359 | func.print_time() 360 | print('Done testing -- epoch limit reached') 361 | finally: 362 | coord.request_stop() 363 | 364 | coord.join(threads) 365 | 366 | # calculate auc 367 | test_pred_score_re = func.list_flatten(test_pred_score_all) 368 | test_label_re = func.list_flatten(test_label_all) 369 | test_auc, _, _ = func.cal_auc(test_pred_score_re, test_label_re) 370 | test_rmse = func.cal_rmse(test_pred_score_re, test_label_re) 371 | test_loss = np.mean(test_loss_all) 372 | 373 | # rounding 374 | test_auc = np.round(test_auc, 4) 375 | test_rmse = np.round(test_rmse, 4) 376 | test_loss = np.round(test_loss, 5) 377 | train_loss_list = [np.round(xx,4) for xx in train_loss_list] 378 | val_auc_list = [np.round(xx,4) for xx in val_auc_list] 379 | 380 | print('test_auc = ', test_auc) 381 | print('test_rmse =', test_rmse) 382 | print('test_loss =', test_loss) 383 | print('train_loss_list =', train_loss_list) 384 | print('val_auc_list =', val_auc_list) 385 | 386 | # write output to file 387 | with open(output_file_name, 'a') as f: 388 | now = datetime.datetime.now() 389 | time_str = now.strftime(cfg.time_style) 390 | f.write(time_str + '\n') 391 | f.write('train_file_name = ' + train_file_name[0] + '\n') 392 | f.write('learning_rate = ' + str(eta) + ', alpha = ' + str(alpha) \ 393 | + ', n_epoch = ' + str(n_epoch) \ 394 | + ', emb_dize = ' + str(k) + '\n') 395 | f.write('test_auc = ' + str(test_auc) + '\n') 396 | f.write('test_rmse = ' + str(test_rmse) + '\n') 397 | f.write('test_loss = ' + str(test_loss) + '\n') 398 | f.write('train_loss_list =' + str(train_loss_list) + '\n') 399 | f.write('val_auc_list =' + str(val_auc_list) + '\n') 400 | f.write('-'*50 + '\n') 401 | -------------------------------------------------------------------------------- /dnn.py: -------------------------------------------------------------------------------- 1 | # DNN (prediction) 2 | 3 | import numpy as np 4 | import tensorflow as tf 5 | import datetime 6 | import ctr_funcs as func 7 | import config_deepmcp as cfg 8 | import os 9 | import shutil 10 | 11 | # config 12 | str_txt = cfg.output_file_name 13 | base_path = './tmp' 14 | model_saving_addr = base_path + '/dnn_' + str_txt + '/' 15 | output_file_name = base_path + '/dnn_' + str_txt + '.txt' 16 | num_csv_col = cfg.num_csv_col 17 | train_file_name = cfg.train_file_name 18 | val_file_name = cfg.val_file_name 19 | test_file_name = cfg.test_file_name 20 | batch_size = cfg.batch_size 21 | n_ft = cfg.n_ft 22 | k = cfg.k 23 | kp_prob = cfg.kp_prob 24 | n_epoch = cfg.n_epoch 25 | max_num_lower_ct = cfg.max_num_lower_ct 26 | record_step_size = cfg.record_step_size 27 | layer_dim = cfg.layer_dim 28 | layer_dim_match = cfg.layer_dim_match 29 | eta = cfg.eta # learning rate 30 | opt_alg = cfg.opt_alg 31 | n_one_hot_slot = cfg.n_one_hot_slot 32 | n_mul_hot_slot = cfg.n_mul_hot_slot 33 | max_len_per_slot = cfg.max_len_per_slot 34 | label_col_idx = 0 35 | record_defaults = [[0]]*num_csv_col 36 | record_defaults[0] = [0.0] 37 | total_num_ft_col = num_csv_col - 1 38 | 39 | # create dir 40 | if not os.path.exists(base_path): 41 | os.mkdir(base_path) 42 | 43 | # remove dir 44 | if os.path.isdir(model_saving_addr): 45 | shutil.rmtree(model_saving_addr) 46 | 47 | # for DNN 48 | idx_1 = n_one_hot_slot 49 | idx_2 = idx_1 + n_mul_hot_slot*max_len_per_slot 50 | 51 | ########################################################### 52 | ########################################################### 53 | print('Loading data start!') 54 | tf.set_random_seed(123) 55 | 56 | # load training data 57 | train_ft, train_label = func.tf_input_pipeline(train_file_name, batch_size, n_epoch, label_col_idx, record_defaults) 58 | 59 | n_val_inst = func.count_lines(val_file_name[0]) 60 | val_ft, val_label = func.tf_input_pipeline(val_file_name, n_val_inst, 1, label_col_idx, record_defaults) 61 | n_val_batch = n_val_inst//batch_size 62 | 63 | # load test data 64 | test_ft, test_label = func.tf_input_pipeline_test(test_file_name, batch_size, 1, label_col_idx, record_defaults) 65 | print('Loading data set 1 done!') 66 | 67 | ######################################################################## 68 | 69 | # add mask 70 | def get_masked_one_hot(x_input_one_hot): 71 | data_mask = tf.cast(tf.greater(x_input_one_hot, 0), tf.float32) 72 | data_mask = tf.expand_dims(data_mask, axis = 2) 73 | data_mask = tf.tile(data_mask, (1,1,k)) 74 | # output: (?, n_one_hot_slot, k) 75 | data_embed_one_hot = tf.nn.embedding_lookup(emb_mat, x_input_one_hot) 76 | data_embed_one_hot_masked = tf.multiply(data_embed_one_hot, data_mask) 77 | return data_embed_one_hot_masked 78 | 79 | def get_masked_mul_hot(x_input_mul_hot): 80 | data_mask = tf.cast(tf.greater(x_input_mul_hot, 0), tf.float32) 81 | data_mask = tf.expand_dims(data_mask, axis = 3) 82 | data_mask = tf.tile(data_mask, (1,1,1,k)) 83 | # output: (?, n_mul_hot_slot, max_len_per_slot, k) 84 | data_embed_mul_hot = tf.nn.embedding_lookup(emb_mat, x_input_mul_hot) 85 | data_embed_mul_hot_masked = tf.multiply(data_embed_mul_hot, data_mask) 86 | # output: (?, n_mul_hot_slot, k) 87 | data_embed_mul_hot_masked = tf.reduce_sum(data_embed_mul_hot_masked, 2) 88 | return data_embed_mul_hot_masked 89 | 90 | # output: (?, n_one_hot_slot + n_mul_hot_slot, k) 91 | def get_concate_embed(x_input_one_hot, x_input_mul_hot): 92 | data_embed_one_hot = get_masked_one_hot(x_input_one_hot) 93 | data_embed_mul_hot = get_masked_mul_hot(x_input_mul_hot) 94 | data_embed_concat = tf.concat([data_embed_one_hot, data_embed_mul_hot], 1) 95 | return data_embed_concat 96 | 97 | # input: (?, n_slot*k) 98 | # output: (?, 1) 99 | def get_pred_output(data_embed_concat): 100 | # include output layer 101 | n_layer = len(layer_dim) 102 | data_embed_dnn = tf.reshape(data_embed_concat, [-1, (n_one_hot_slot + n_mul_hot_slot)*k]) 103 | cur_layer = data_embed_dnn 104 | # loop to create DNN struct 105 | for i in range(0, n_layer): 106 | # output layer, linear activation 107 | if i == n_layer - 1: 108 | cur_layer = tf.matmul(cur_layer, weight_dict[i]) + bias_dict[i] 109 | else: 110 | cur_layer = tf.nn.relu(tf.matmul(cur_layer, weight_dict[i]) + bias_dict[i]) 111 | cur_layer = tf.nn.dropout(cur_layer, keep_prob) 112 | 113 | y_hat = cur_layer 114 | return y_hat 115 | 116 | ########################################################### 117 | ########################################################### 118 | # input for prediction loss 119 | x_input = tf.placeholder(tf.int32, shape=[None, total_num_ft_col]) 120 | # shape=[None, n_one_hot_slot] 121 | x_input_one_hot = x_input[:, 0:idx_1] 122 | x_input_mul_hot = x_input[:, idx_1:idx_2] 123 | # shape=[None, n_mul_hot_slot, max_len_per_slot] 124 | x_input_mul_hot = tf.reshape(x_input_mul_hot, (-1, n_mul_hot_slot, max_len_per_slot)) 125 | 126 | # target vec for l1 127 | y_target = tf.placeholder(tf.float32, shape=[None, 1]) 128 | 129 | # dropout keep prob 130 | keep_prob = tf.placeholder(tf.float32) 131 | # emb_mat dim add 1 -> for padding (idx = 0) 132 | with tf.device('/cpu:0'): 133 | emb_mat = tf.Variable(tf.random_normal([n_ft + 1, k], stddev=0.01)) 134 | 135 | ################################ 136 | # prediction subnet FC layers, including output layer 137 | n_layer = len(layer_dim) 138 | in_dim = (n_one_hot_slot + n_mul_hot_slot)*k 139 | weight_dict = {} 140 | bias_dict = {} 141 | 142 | # loop to create DNN vars 143 | for i in range(0, n_layer): 144 | out_dim = layer_dim[i] 145 | weight_dict[i] = tf.Variable(tf.random_normal(shape=[in_dim, out_dim], stddev=np.sqrt(2.0/(in_dim+out_dim)))) 146 | bias_dict[i] = tf.Variable(tf.constant(0.0, shape=[out_dim])) 147 | in_dim = layer_dim[i] 148 | 149 | ################################ 150 | data_embed_concat = get_concate_embed(x_input_one_hot, x_input_mul_hot) 151 | y_hat = get_pred_output(data_embed_concat) 152 | 153 | loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=y_hat, labels=y_target)) 154 | 155 | ############################# 156 | # prediction 157 | ############################# 158 | pred_score = tf.sigmoid(y_hat) 159 | 160 | if opt_alg == 'Adam': 161 | optimizer = tf.train.AdamOptimizer(eta).minimize(loss) 162 | else: 163 | # default 164 | optimizer = tf.train.AdagradOptimizer(eta).minimize(loss) 165 | 166 | ######################################## 167 | # Launch the graph. 168 | config = tf.ConfigProto(log_device_placement=False) 169 | config.gpu_options.allow_growth = True 170 | config.gpu_options.per_process_gpu_memory_fraction = 0.3 171 | 172 | with tf.Session(config=config) as sess: 173 | sess.run(tf.global_variables_initializer()) 174 | sess.run(tf.local_variables_initializer()) 175 | coord = tf.train.Coordinator() 176 | threads = tf.train.start_queue_runners(sess, coord) 177 | 178 | saver_val = tf.train.Saver() 179 | train_loss_list = [] 180 | val_auc_list = [] 181 | best_n_round = 0 182 | best_val_auc = 0 183 | lower_ct = 0 184 | early_stop_flag = 0 185 | 186 | val_ft_inst, val_label_inst = sess.run([val_ft, val_label]) 187 | 188 | func.print_time() 189 | print('Start train loop') 190 | 191 | epoch = -1 192 | try: 193 | while not coord.should_stop(): 194 | epoch += 1 195 | train_ft_inst, train_label_inst = sess.run([train_ft, train_label]) 196 | train_label_inst = np.transpose([train_label_inst]) 197 | 198 | # training 199 | sess.run(optimizer, feed_dict={x_input:train_ft_inst, y_target:train_label_inst, \ 200 | keep_prob:kp_prob}) 201 | 202 | # record loss and accuracy every step_size generations 203 | if (epoch+1)%record_step_size == 0: 204 | train_loss_temp = sess.run(loss, feed_dict={ \ 205 | x_input:train_ft_inst, y_target:train_label_inst, \ 206 | keep_prob:1.0}) 207 | train_loss_list.append(train_loss_temp) 208 | 209 | val_pred_score_all = [] 210 | val_label_all = [] 211 | 212 | for iii in range(n_val_batch): 213 | # get batch 214 | start_idx = iii*batch_size 215 | end_idx = (iii+1)*batch_size 216 | cur_val_ft = val_ft_inst[start_idx: end_idx] 217 | cur_val_label = val_label_inst[start_idx: end_idx] 218 | # pred score 219 | cur_val_pred_score = sess.run(pred_score, feed_dict={ \ 220 | x_input:cur_val_ft, keep_prob:1.0}) 221 | val_pred_score_all.append(cur_val_pred_score.flatten()) 222 | val_label_all.append(cur_val_label) 223 | 224 | # calculate auc 225 | val_pred_score_re = func.list_flatten(val_pred_score_all) 226 | val_label_re = func.list_flatten(val_label_all) 227 | val_auc_temp, _, _ = func.cal_auc(val_pred_score_re, val_label_re) 228 | # record all val results 229 | val_auc_list.append(val_auc_temp) 230 | 231 | # record best and save models 232 | if val_auc_temp > best_val_auc: 233 | best_val_auc = val_auc_temp 234 | best_n_round = epoch 235 | # Save the variables to disk 236 | save_path = saver_val.save(sess, model_saving_addr) 237 | print("Model saved in: %s" % save_path) 238 | # count of consecutive lower 239 | if val_auc_temp < best_val_auc: 240 | lower_ct += 1 241 | # once higher or equal, set to 0 242 | else: 243 | lower_ct = 0 244 | 245 | if lower_ct >= max_num_lower_ct: 246 | early_stop_flag = 1 247 | 248 | auc_and_loss = [epoch+1, train_loss_temp, val_auc_temp] 249 | # round to given number of decimals 250 | auc_and_loss = [np.round(xx,4) for xx in auc_and_loss] 251 | func.print_time() 252 | print('Generation # {}. Train Loss: {:.4f}. Val Avg AUC: {:.4f}.'\ 253 | .format(*auc_and_loss)) 254 | 255 | # stop while loop 256 | if early_stop_flag == 1: 257 | break 258 | 259 | except tf.errors.OutOfRangeError: 260 | func.print_time() 261 | print('Done training -- epoch limit reached') 262 | 263 | # restore model 264 | saver_val.restore(sess, model_saving_addr) 265 | print("Model restored.") 266 | 267 | # load test data 268 | test_pred_score_all = [] 269 | test_label_all = [] 270 | test_loss_all = [] 271 | try: 272 | while True: 273 | test_ft_inst, test_label_inst = sess.run([test_ft, test_label]) 274 | cur_test_pred_score = sess.run(pred_score, feed_dict={ \ 275 | x_input:test_ft_inst, keep_prob:1.0}) 276 | test_pred_score_all.append(cur_test_pred_score.flatten()) 277 | test_label_all.append(test_label_inst) 278 | 279 | cur_test_loss = sess.run(loss, feed_dict={ \ 280 | x_input:test_ft_inst, \ 281 | y_target: np.transpose([test_label_inst]), keep_prob:1.0}) 282 | test_loss_all.append(cur_test_loss) 283 | 284 | except tf.errors.OutOfRangeError: 285 | func.print_time() 286 | print('Done testing -- epoch limit reached') 287 | finally: 288 | coord.request_stop() 289 | 290 | coord.join(threads) 291 | 292 | # calculate auc 293 | test_pred_score_re = func.list_flatten(test_pred_score_all) 294 | test_label_re = func.list_flatten(test_label_all) 295 | test_auc, _, _ = func.cal_auc(test_pred_score_re, test_label_re) 296 | test_rmse = func.cal_rmse(test_pred_score_re, test_label_re) 297 | test_loss = np.mean(test_loss_all) 298 | 299 | # rounding 300 | test_auc = np.round(test_auc, 4) 301 | test_rmse = np.round(test_rmse, 4) 302 | test_loss = np.round(test_loss, 5) 303 | train_loss_list = [np.round(xx,4) for xx in train_loss_list] 304 | val_auc_list = [np.round(xx,4) for xx in val_auc_list] 305 | 306 | print('test_auc = ', test_auc) 307 | print('test_rmse =', test_rmse) 308 | print('test_loss =', test_loss) 309 | print('train_loss_list =', train_loss_list) 310 | print('val_auc_list =', val_auc_list) 311 | 312 | # write output to file 313 | with open(output_file_name, 'a') as f: 314 | now = datetime.datetime.now() 315 | time_str = now.strftime(cfg.time_style) 316 | f.write(time_str + '\n') 317 | f.write('train_file_name = ' + train_file_name[0] + '\n') 318 | f.write('learning_rate = ' + str(eta) \ 319 | + ', n_epoch = ' + str(n_epoch) \ 320 | + ', emb_dize = ' + str(k) + '\n') 321 | f.write('test_auc = ' + str(test_auc) + '\n') 322 | f.write('test_rmse = ' + str(test_rmse) + '\n') 323 | f.write('test_loss = ' + str(test_loss) + '\n') 324 | f.write('train_loss_list =' + str(train_loss_list) + '\n') 325 | f.write('val_auc_list =' + str(val_auc_list) + '\n') 326 | f.write('-'*50 + '\n') 327 | --------------------------------------------------------------------------------