├── .gitignore ├── README.md ├── attributes ├── README.md ├── attribute.py ├── comb_attribute.py ├── embed_attribute.py ├── input_attribute.py └── mulhot_index.py ├── examples ├── README.md ├── dataset │ ├── README.md │ ├── i.csv │ ├── i_attr.csv │ ├── obs_te.csv │ ├── obs_tr.csv │ ├── obs_va.csv │ ├── u.csv │ └── u_attr.csv ├── run_hmf.sh ├── run_lstm.sh └── run_w2v.sh ├── hmf ├── hmf_model.py └── run_hmf.py ├── lstm ├── best_buckets.py ├── data_iterator.py ├── generate_jobs.py ├── run.py └── seqModel.py ├── utils ├── eval_metrics.py ├── evaluate.py ├── load_data.py ├── pandatools.py ├── prepare_train.py ├── preprocess.py └── submit.py └── word2vec ├── cbow_model.py ├── data_iterator.py ├── linear_seq.py ├── run_w2v.py └── skipgram_model.py /.gitignore: -------------------------------------------------------------------------------- 1 | .DS_Store 2 | *~ 3 | [#].[#] 4 | .[#]* 5 | *[#] 6 | *pyc 7 | mf/data/* 8 | mf/test/* 9 | mf/*pyc 10 | mf/log/* 11 | */*pyc 12 | mf/data* 13 | mf/test* 14 | mf/log* 15 | mf/*.sh 16 | mf/*.pbs 17 | dataset/* 18 | jobs/* 19 | mf/*.json 20 | *.json 21 | *.csv 22 | raw_data/*/* 23 | baselines/* 24 | data/* 25 | */*/*.npy 26 | *.npy 27 | */*.npy 28 | res/* 29 | tmp/* 30 | others/* 31 | paper/*/* 32 | preprocess/* 33 | */*.sh 34 | cache/* 35 | train/* 36 | examples/cache/ 37 | examples/train/ 38 | ae-hmf/* 39 | rank/* 40 | ae-word2vec/* 41 | dlstm/* 42 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # A-RecSys : a Tensorflow Toolkit for Implicit Recommendation Tasks 2 | 3 | ## A-RecSys 4 | A-RecSys implements implicit recommendation algorithms and is designed for large scale recommendation settings. It extends traditional matrix factorization algorithms, and focuses on attribute embedding and applying sequence models. 5 | 6 | Works implemented by this toolkit include: 7 | 8 | + A Batch Learning Framework for Scalable Personalized Ranking. AAAI 18. [arXiv](https://arxiv.org/abs/1711.04019) 9 | + Sequential heterogeneous attribute embedding for item recommendation. ICDM 17 SERecsys Workshop. 10 | + Temporal Learning and Sequence Modeling for a Job Recommender System. RecSys Challenge 16 [pdf](https://arxiv.org/abs/1608.03333) 11 | 12 | 13 | The models and features supported by A-RecSys include, 14 | 15 | #### Models 16 | + Hybrid matrix factorization model (with deep layer extensions) 17 | + Linear sequence models based on CBOW and skip-gram 18 | + LSTM-based seq2seq model 19 | 20 | 21 | #### Features 22 | + Recommendation with implicit feedback 23 | + Heterogeneous attribute embedding (see attributes/README.md for details) 24 | + Objective functions include cross-entropy, Weighted Margin Rank Batch loss. 25 | 26 | ## How to use 27 | 28 | ### Input data 29 | CSV-formated (sep=\t) input files include 30 | 31 | u.csv : user file. user id and attribute values. 32 | i.csv: item file. item id and attribute values. 33 | obs_tr.csv: implicit feedback for training. First two columns are user-id, item-id. Third column (optional) is for timestamp. 34 | obs_va.csv: implicit feedback for development. 35 | obs_te.csv: implicit feedback for testing. 36 | 37 | **A example** (adapted from MovieLens 1m) is given at ./examples/dataset/ 38 | 39 | ### Train models 40 | **Example scripts** are provided at ./examples/ to start running the code 41 | 42 | To train hybrid matrix factorization model on provided MovieLens 1m dataset: 43 | 44 | ``` 45 | cd examples/ 46 | bash run_hmf.sh 32 1 False 100 False 47 | ``` 48 | 49 | To train lstm model: 50 | 51 | ``` 52 | cd examples/ 53 | bash run_lstm.sh 64 1 False 54 | ``` 55 | 56 | (Code has been tested on TF 0.8 and above.) 57 | 58 | ### Recommend 59 | You can switch to "recommend" mode from "training" by setting flag *recommend* to 'true'. In the above HMF example, it would be: 60 | 61 | ``` 62 | cd examples/ 63 | bash run_hmf.sh 32 1 False 100 True 64 | ``` 65 | 66 | By default, the code generates a ground truth interaction file *res_T_test.csv* from *obs_te.csv*, and perform recommendation on all users that appear in *res_T_test.csv*. You can generate your own *res_T_test.csv* to narrow down user set to identify which recommendation is being performed. 67 | 68 | 69 | ### Dependencies 70 | The code now supports Tensorflow v1.0. During our development, the code was tested with versions 0.8, 0.9, 0.11, 0.12. 71 | 72 | ## Cite 73 | Please cite the following if you find this helpful. 74 | 75 | @inproceedings{liu2017wmrb, 76 | title={WMRB: learning to rank in a scalable batch training approach}, 77 | author={Liu, Kuan and Natarajan, Prem}, 78 | booktitle={Proceedings of the Recommender Systems Poster}, 79 | year={2017}, 80 | organization={ACM} 81 | } 82 | 83 | @inproceedings{liu2016temporal, 84 | title={Temporal learning and sequence modeling for a job recommender system}, 85 | author={Liu, Kuan and Shi, Xing and Kumar, Anoop and Zhu, Linhong and Natarajan, Prem}, 86 | booktitle={Proceedings of the Recommender Systems Challenge}, 87 | pages={7}, 88 | year={2016}, 89 | organization={ACM} 90 | } 91 | 92 | 93 | ## Feedback 94 | Your comments and suggestions are more than welcome! We really appreciate that! 95 | 96 | Kuan Liu kuanl@usc.edu 97 | Xing Shi xingshi@usc.edu 98 | -------------------------------------------------------------------------------- /attributes/README.md: -------------------------------------------------------------------------------- 1 | # attribute embedding 2 | 3 | ## what it does 4 | 5 | * Preprocess attribute files, tokenize user/item/attributes 6 | * Combine heterogeneous attributes: MIX or HET. 7 | * Embed attributes in input and output layers of models. 8 | * Loss functions 9 | 10 | 11 | ## heterogeneous attribute embedding 12 | 13 | #### illustration figure 14 | 15 | ![alt text](http://www-scf.usc.edu/~kuanl/papers/seq_het.png =700x) 16 | ## How to use 17 | 18 | Variant flags are used to control attribute embedding strategies. 19 | 20 | (TODO: complete readme file) 21 | 22 | -------------------------------------------------------------------------------- /attributes/attribute.py: -------------------------------------------------------------------------------- 1 | 2 | from __future__ import absolute_import 3 | from __future__ import division 4 | from __future__ import print_function 5 | 6 | 7 | class Attributes(object): 8 | def __init__(self, num_feature_cat=0, feature_cat=None, 9 | num_text_feat=0, feature_mulhot=None, mulhot_max_length=None, 10 | mulhot_starts=None, mulhot_lengths=None, 11 | v_sizes_cat=None, v_sizes_mulhot=None, 12 | embedding_size_list_cat=None): 13 | self.num_features_cat = num_feature_cat 14 | self.num_features_mulhot = num_text_feat 15 | self.features_cat = feature_cat 16 | self.features_mulhot = feature_mulhot 17 | # self.mulhot_max_length = mulhot_max_length 18 | self.mulhot_starts = mulhot_starts 19 | self.mulhot_lengths = mulhot_lengths 20 | self._embedding_classes_list_cat = v_sizes_cat 21 | self._embedding_classes_list_mulhot = v_sizes_mulhot 22 | return 23 | 24 | def set_model_size(self, sizes, opt=0): 25 | if isinstance(sizes, list): 26 | if opt == 0: 27 | assert(len(sizes) == self.num_features_cat) 28 | self._embedding_size_list_cat = sizes 29 | else: 30 | assert(len(sizes) == self.num_features_mulhot) 31 | self._embedding_size_list_mulhot = sizes 32 | elif isinstance(sizes, int): 33 | self._embedding_size_list_cat = [sizes] * self.num_features_cat 34 | self._embedding_size_list_mulhot = [sizes] * self.num_features_mulhot 35 | else: 36 | print('error: sizes need to be list or int') 37 | exit(0) 38 | return 39 | 40 | def set_target_prediction(self, features_cat_tr, full_values_tr, 41 | full_segids_tr, full_lengths_tr): 42 | # TODO: move these indices outside this class 43 | self.full_cat_tr = features_cat_tr 44 | self.full_values_tr = full_values_tr 45 | self.full_segids_tr = full_segids_tr 46 | self.full_lengths_tr = full_lengths_tr 47 | return 48 | 49 | # def get_item_last_index(self): 50 | # return len(self.features_cat[0]) - 1 51 | 52 | def overview(self, out=None): 53 | def p(val): 54 | if out: 55 | out(val) 56 | else: 57 | print(val) 58 | p('# of categorical attributes: {}'.format(self.num_features_cat)) 59 | p('# of multi-hot attributes: {}'.format(self.num_features_mulhot)) 60 | p('====attributes values===') 61 | if self.num_features_cat > 0: 62 | p('\tinput categorical:') 63 | p('\t{}'.format(self.features_cat)) 64 | if hasattr(self, 'full_cat_tr'): 65 | p('\toutput categorical:') 66 | p('\t{}'.format(self.full_cat_tr)) 67 | if self.num_features_mulhot > 0: 68 | p('\tinput multi-hot:') 69 | p('\t values: {}'.format(self.features_mulhot)) 70 | p('\t starts:{}'.format(self.mulhot_starts)) 71 | p('\t length:{}'.format(self.mulhot_lengths)) 72 | if hasattr(self, 'full_values_tr'): 73 | p('\toutput multi-hot:') 74 | p('\t values:{}'.format(self.full_values_tr)) 75 | p('\t starts:{}'.format(self.full_segids_tr)) 76 | p('\t length:{}'.format(self.full_lengths_tr)) 77 | p('\n') 78 | -------------------------------------------------------------------------------- /attributes/comb_attribute.py: -------------------------------------------------------------------------------- 1 | from preprocess import create_dictionary, create_dictionary_mix, tokenize_attribute_map, filter_cat, filter_mulhot, pickle_save 2 | import numpy as np 3 | import attribute 4 | 5 | 6 | class Comb_Attributes(object): 7 | def __init__(self): 8 | return 9 | 10 | def get_attributes(self, users, items, data_tr, user_features, item_features): 11 | # create_dictionary 12 | user_feature_names, user_feature_types = user_features 13 | item_feature_names, item_feature_types = item_features 14 | 15 | u_inds = [p[0] for p in data_tr] 16 | self.create_dictionary(self.data_dir, u_inds, users, user_feature_types, 17 | user_feature_names, self.max_vocabulary_size, self.logits_size_tr, 18 | prefix='user', threshold=self.threshold) 19 | 20 | # create user feature map 21 | (num_features_cat, features_cat, num_features_mulhot, features_mulhot, 22 | mulhot_max_leng, mulhot_starts, mulhot_lengs, v_sizes_cat, 23 | v_sizes_mulhot) = tokenize_attribute_map(self.data_dir, users, 24 | user_feature_types, self.max_vocabulary_size, self.logits_size_tr, 25 | prefix='user') 26 | 27 | u_attributes = attribute.Attributes(num_features_cat, features_cat, 28 | num_features_mulhot, features_mulhot, mulhot_max_leng, mulhot_starts, 29 | mulhot_lengs, v_sizes_cat, v_sizes_mulhot) 30 | 31 | # create_dictionary 32 | i_inds_tr = [p[1] for p in data_tr] 33 | self.create_dictionary(self.data_dir, i_inds_tr, items, item_feature_types, 34 | item_feature_names, self.max_vocabulary_size, self.logits_size_tr, 35 | prefix='item', threshold=self.threshold) 36 | 37 | # create item feature map 38 | items_cp = np.copy(items) 39 | (num_features_cat2, features_cat2, num_features_mulhot2, features_mulhot2, 40 | mulhot_max_leng2, mulhot_starts2, mulhot_lengs2, v_sizes_cat2, 41 | v_sizes_mulhot2) = tokenize_attribute_map(self.data_dir, 42 | items_cp, item_feature_types, self.max_vocabulary_size, self.logits_size_tr, 43 | prefix='item') 44 | 45 | ''' 46 | create an (item-index <--> classification output) mapping 47 | there are more than one valid mapping as long as 1 to 1 48 | ''' 49 | item2fea0 = features_cat2[0] if len(features_cat2) > 0 else None 50 | item_ind2logit_ind, logit_ind2item_ind = self.index_mapping(item2fea0, 51 | i_inds_tr, len(items)) 52 | 53 | i_attributes = attribute.Attributes(num_features_cat2, features_cat2, 54 | num_features_mulhot2, features_mulhot2, mulhot_max_leng2, mulhot_starts2, 55 | mulhot_lengs2, v_sizes_cat2, v_sizes_mulhot2) 56 | 57 | # set target prediction indices 58 | features_cat2_tr = filter_cat(num_features_cat2, features_cat2, 59 | logit_ind2item_ind) 60 | 61 | (full_values, full_values_tr, full_segids, full_lengths, full_segids_tr, 62 | full_lengths_tr) = filter_mulhot(self.data_dir, items, 63 | item_feature_types, self.max_vocabulary_size, logit_ind2item_ind, 64 | prefix='item') 65 | 66 | i_attributes.set_target_prediction(features_cat2_tr, full_values_tr, 67 | full_segids_tr, full_lengths_tr) 68 | 69 | return u_attributes, i_attributes, item_ind2logit_ind, logit_ind2item_ind 70 | 71 | 72 | class MIX(Comb_Attributes): 73 | 74 | def __init__(self, data_dir, max_vocabulary_size=500000, logits_size_tr=50000, 75 | threshold=2): 76 | self.data_dir = data_dir 77 | self.max_vocabulary_size = max_vocabulary_size 78 | self.logits_size_tr = logits_size_tr 79 | self.threshold = threshold 80 | self.create_dictionary = create_dictionary_mix 81 | return 82 | 83 | def index_mapping(self, item2fea0, i_inds, M=None): 84 | item_ind2logit_ind = {} 85 | logit_ind2item_ind = {} 86 | 87 | item_ind_count = {} 88 | for i_ind in i_inds: 89 | item_ind_count[i_ind] = item_ind_count[i_ind] + 1 if i_ind in item_ind_count else 1 90 | ind_list = sorted(item_ind_count, key=item_ind_count.get, reverse=True) 91 | assert(self.logits_size_tr <= len(ind_list)), 'Item_vocab_size should be smaller than # of appeared items' 92 | ind_list = ind_list[:self.logits_size_tr] 93 | 94 | for index, elem in enumerate(ind_list): 95 | item_ind2logit_ind[elem] = index 96 | logit_ind2item_ind[index] = elem 97 | 98 | return item_ind2logit_ind, logit_ind2item_ind 99 | 100 | def mix_attr(self, users, items, user_features, item_features): 101 | user_feature_names, user_feature_types = user_features 102 | item_feature_names, item_feature_types = item_features 103 | user_feature_names[0] = 'uid' 104 | 105 | # user 106 | n = len(users) 107 | users2 = np.zeros((n, 1), dtype=object) 108 | for i in range(n): 109 | v = [] 110 | user = users[i, :] 111 | for j in range(len(user_feature_types)): 112 | t = user_feature_types[j] 113 | n = user_feature_names[j] 114 | if t == 0: 115 | v.append(n + str(user[j])) 116 | elif t == 1: 117 | v.extend([n + s for s in user[j].split(',')]) 118 | else: 119 | continue 120 | users2[i, 0] = ','.join(v) 121 | 122 | # item 123 | n = len(items) 124 | items2 = np.zeros((n, 1), dtype=object) 125 | for i in range(n): 126 | v = [] 127 | item = items[i, :] 128 | for j in range(len(item_feature_types)): 129 | t = item_feature_types[j] 130 | n = item_feature_names[j] 131 | if t == 0: 132 | v.append(n + str(item[j])) 133 | elif t == 1: 134 | v.extend([n + s for s in item[j].split(',')]) 135 | else: 136 | continue 137 | items2[i, 0] = ','.join(v) 138 | 139 | # modify attribute names and types 140 | if len(user_feature_types) == 1 and user_feature_types[0] == 0: 141 | user_features = ([['mix'], [0]]) 142 | else: 143 | user_features = ([['mix'], [1]]) 144 | if len(item_feature_types) == 1 and item_feature_types[0] == 0: 145 | item_features = ([['mix'], [0]]) 146 | else: 147 | item_features = ([['mix'], [1]]) 148 | return users2, items2, user_features, item_features 149 | 150 | 151 | class HET(Comb_Attributes): 152 | 153 | def __init__(self, data_dir, max_vocabulary_size=50000, logits_size_tr=50000, 154 | threshold=2): 155 | self.data_dir = data_dir 156 | self.max_vocabulary_size = max_vocabulary_size 157 | self.logits_size_tr = logits_size_tr 158 | self.threshold = threshold 159 | self.create_dictionary = create_dictionary 160 | return 161 | 162 | def index_mapping(self, item2fea0, i_inds, M): 163 | item_ind2logit_ind = {} 164 | logit_ind2item_ind = {} 165 | ind = 0 166 | for i in range(M): 167 | fea0 = item2fea0[i] 168 | if fea0 != 0: 169 | item_ind2logit_ind[i] = ind 170 | ind += 1 171 | assert(ind == self.logits_size_tr), 'Item_vocab_size %d too large! need to be no greater than %d\nFix: --item_vocab_size [smaller item_vocab_size]\n' % (self.logits_size_tr, ind) 172 | 173 | logit_ind2item_ind = {} 174 | for k, v in item_ind2logit_ind.items(): 175 | logit_ind2item_ind[v] = k 176 | return item_ind2logit_ind, logit_ind2item_ind 177 | 178 | -------------------------------------------------------------------------------- /attributes/input_attribute.py: -------------------------------------------------------------------------------- 1 | from comb_attribute import HET, MIX 2 | import attribute 3 | import cPickle as pickle 4 | import sys, os 5 | 6 | sys.path.insert(0, '../utils') 7 | from load_data import load_raw_data 8 | 9 | 10 | def read_data(raw_data_dir='../raw_data/data/', data_dir='../cache/data/', 11 | combine_att='mix', logits_size_tr='10000', 12 | thresh=2, use_user_feature=True, use_item_feature=True, no_user_id=False, 13 | test=False, mylog=None): 14 | 15 | if not mylog: 16 | def mylog(val): 17 | print(val) 18 | 19 | data_filename = os.path.join(data_dir, 'data') 20 | if os.path.isfile(data_filename): 21 | mylog("data file {} exists! loading cached data. \nCaution: change cached data dir (--data_dir) if new data (or new preprocessing) is used.".format(data_filename)) 22 | (data_tr, data_va, u_attr, i_attr, item_ind2logit_ind, 23 | logit_ind2item_ind, user_index, item_index) = pickle.load( 24 | open(data_filename, 'rb')) 25 | # u_attr.overview(mylog) 26 | # i_attr.overview(mylog) 27 | 28 | else: 29 | if not os.path.exists(data_dir): 30 | os.mkdir(data_dir) 31 | _submit = 1 if test else 0 32 | (users, items, data_tr, data_va, user_features, item_features, 33 | user_index, item_index) = load_raw_data(data_dir=raw_data_dir, _submit=_submit) 34 | if not use_user_feature: 35 | n = len(users) 36 | users = users[:, 0].reshape(n, 1) 37 | user_features = ([user_features[0][0]], [user_features[1][0]]) 38 | if not use_item_feature: 39 | m = len(items) 40 | items = items[:, 0].reshape(m, 1) 41 | item_features = ([item_features[0][0]], [item_features[1][0]]) 42 | 43 | if no_user_id: 44 | users[:, 0] = 0 45 | 46 | if combine_att == 'het': 47 | het = HET(data_dir=data_dir, logits_size_tr=logits_size_tr, threshold=thresh) 48 | u_attr, i_attr, item_ind2logit_ind, logit_ind2item_ind = het.get_attributes( 49 | users, items, data_tr, user_features, item_features) 50 | elif combine_att == 'mix': 51 | mix = MIX(data_dir=data_dir, logits_size_tr=logits_size_tr, 52 | threshold=thresh) 53 | users2, items2, user_features, item_features = mix.mix_attr(users, items, 54 | user_features, item_features) 55 | (u_attr, i_attr, item_ind2logit_ind, 56 | logit_ind2item_ind) = mix.get_attributes(users2, items2, data_tr, 57 | user_features, item_features) 58 | 59 | mylog("saving data format to data directory") 60 | from preprocess import pickle_save 61 | pickle_save((data_tr, data_va, u_attr, i_attr, 62 | item_ind2logit_ind, logit_ind2item_ind, user_index, item_index), data_filename) 63 | 64 | mylog('length of item_ind2logit_ind: {}'.format(len(item_ind2logit_ind))) 65 | 66 | # if FLAGS.dataset in ['ml', 'yelp']: 67 | # mylog('disabling the lstm-rec fake feature') 68 | # u_attr.num_features_cat = 1 69 | 70 | return (data_tr, data_va, u_attr, i_attr, item_ind2logit_ind, 71 | logit_ind2item_ind, user_index, item_index) 72 | 73 | -------------------------------------------------------------------------------- /attributes/mulhot_index.py: -------------------------------------------------------------------------------- 1 | 2 | from __future__ import absolute_import 3 | from __future__ import division 4 | from __future__ import print_function 5 | import tensorflow as tf 6 | 7 | def concat_versions(axis, value): 8 | if tf.__version__.startswith('0'): 9 | return tf.concat(axis, value) 10 | else: 11 | return tf.concat(value, axis) 12 | 13 | def batch_slice(target, begin, size, l): 14 | b = tf.unstack(begin) 15 | s = tf.unstack(size) 16 | res = [] 17 | for i in range(l): 18 | res.append(tf.slice(target, [b[i]], [s[i]])) 19 | return concat_versions(0, res) 20 | 21 | def batch_segids(size, l): 22 | s = tf.unstack(size) 23 | res = [] 24 | for i in range(l): 25 | ok = tf.tile([i], [s[i]]) 26 | res.append(ok) 27 | return concat_versions(0, res) 28 | 29 | def batch_slice_segids(target, begin, size, l): 30 | b = tf.unstack(begin) 31 | s = tf.unstack(size) 32 | res = [] 33 | res2 = [] 34 | for i in range(l): 35 | res.append(tf.slice(target, [b[i]], [s[i]])) 36 | res2.append(tf.tile([i], [s[i]])) 37 | return concat_versions(0, res), concat_versions(0, res2) 38 | 39 | def batch_slice20(target, b, s, l): 40 | res1, res2 = [], [] 41 | h = int(l/2) 42 | assert(l/2 == h) 43 | for i in range(h): 44 | res1.append(tf.slice(target, [b[i]], [s[i]])) 45 | res2.append(tf.slice(target, [b[i+h]], [s[i+h]])) 46 | return concat_versions(0, res1+res2) 47 | 48 | def batch_slice2(target, b, s, l): 49 | res = [] 50 | for i in range(l): 51 | res.append(tf.slice(target, [b[i]], [s[i]])) 52 | return concat_versions(0, res) 53 | 54 | def batch_segids20(s, l): 55 | res1, res2 = [], [] 56 | h = int(l/2) 57 | for i in range(h): 58 | res1.append(tf.tile([i], [s[i]])) 59 | res2.append(tf.tile([i+h], [s[i+h]])) 60 | return concat_versions(0, res1 + res2) 61 | 62 | def batch_segids2(s, l): 63 | res = [] 64 | for i in range(l): 65 | ok = tf.tile([i], [s[i]]) 66 | res.append(ok) 67 | return concat_versions(0, res) 68 | 69 | -------------------------------------------------------------------------------- /examples/README.md: -------------------------------------------------------------------------------- 1 | # Example scripts 2 | 3 | ### ./dataset 4 | 5 | Example dataset adapted from Movielens 1m. 6 | 7 | ### run_hmf.sh 8 | 9 | Example script to run hmf model on ML1m data. 10 | 11 | ### run_lstm.sh 12 | 13 | Example script to run lstm model on ML1m data. 14 | 15 | -------------------------------------------------------------------------------- /examples/dataset/README.md: -------------------------------------------------------------------------------- 1 | # Example Dataset 2 | 3 | ### source 4 | 5 | MovieLens 1m dataset (https://grouplens.org/datasets/movielens/1m/). 6 | 7 | ### filtering 8 | 9 | Reviews no less than 4.0 are used as implicit feedbacks. We filtered out users with less than 10 movie reviews. 10 | 11 | ### train/validation/test splitting 12 | 13 | Movie reviews associated with each user are sorted in chronological order and then splitted 3:1:1. 14 | 15 | 16 | -------------------------------------------------------------------------------- /examples/dataset/i.csv: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/skywaLKer518/A-Recsys/a1413234b4fff0321ae6391051fe9b4b600ec76a/examples/dataset/i.csv -------------------------------------------------------------------------------- /examples/dataset/i_attr.csv: -------------------------------------------------------------------------------- 1 | id genres title 2 | 0 1 1 3 | -------------------------------------------------------------------------------- /examples/dataset/u_attr.csv: -------------------------------------------------------------------------------- 1 | id gender age job 2 | 0 0 0 0 3 | -------------------------------------------------------------------------------- /examples/run_hmf.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # default values 4 | dfh=32 5 | dflr=1 6 | dfte=False 7 | dfn=1000 8 | dfrec=False 9 | 10 | # hyper-parameters 11 | h=${1:-$dfh} 12 | lr=${2:-$dflr} 13 | te=${3:-$dfte} 14 | n=${4:-$dfn} 15 | rec=${5:-$dfrec} 16 | 17 | if [ $# -ne 5 ] 18 | then 19 | echo "Number of arguments should be 5!" 20 | echo "Usage: bash run_hmf.sh [model_size (e.g. 32)] [learning-rate (e.g. 1)] [test (True or False)] [num_epoch (e.g. 50)] [recommend (True or False)]" 21 | if [ $# -gt 5 ] 22 | then 23 | exit 24 | fi 25 | echo "Run with default values" 26 | fi 27 | 28 | if [ ! -d "./cache" ] 29 | then 30 | mkdir ./cache 31 | fi 32 | if [ ! -d "./train" ] 33 | then 34 | mkdir ./train 35 | fi 36 | 37 | cd ../hmf/ 38 | 39 | python run_hmf.py --dataset ml1m --raw_data ../examples/dataset/ --data_dir ../examples/cache/ml1m --train_dir ../examples/train/hmf_ml1m_h${h}lr${lr}te${te} --dataset ml1m --raw_data ../examples/dataset/ --item_vocab_size 3100 --vocab_min_thresh 1 --steps_per_checkpoint 300 --loss ce --learning_rate ${lr} --size $h --n_epoch $n --test ${te} --recommend ${rec} 40 | 41 | echo 'finished!' 42 | -------------------------------------------------------------------------------- /examples/run_lstm.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # default values 4 | dfh=64 5 | dflr=1 6 | dfrec=False 7 | 8 | # hyper-parameters 9 | h=${1:-$dfh} 10 | lr=${2:-$dflr} 11 | rec=${3:-$dfrec} 12 | 13 | if [ $# -ne 3 ] 14 | then 15 | echo "Number of arguments should be 3!" 16 | echo "Usage: bash run_lstm.sh [model_size (e.g. 64)] [learning-rate (e.g. 1)] [recommend (True or False)]" 17 | if [ $# -gt 3 ] 18 | then 19 | echo "too many arguments. exit." 20 | exit 21 | fi 22 | echo "Not enough arguments. Run with default values" 23 | fi 24 | 25 | if [ ! -d "./cache" ] 26 | then 27 | mkdir ./cache 28 | fi 29 | if [ ! -d "./train" ] 30 | then 31 | mkdir ./train 32 | fi 33 | 34 | cd ../lstm/ 35 | 36 | python run.py --dataset ml1m --raw_data ../examples/dataset/ --data_dir ../examples/cache/ml1m --train_dir ../examples/train/lstm_ml1m_h${h}lr${lr} --item_vocab_size 3100 --vocab_min_thresh 1 --steps_per_checkpoint 5 --loss ce --learning_rate ${lr} --size $h --recommend ${rec} 37 | 38 | 39 | echo 'finished!' 40 | -------------------------------------------------------------------------------- /examples/run_w2v.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # default values 4 | dfh=32 5 | dflr=1 6 | dw=5 7 | dni=3 8 | dfte=False 9 | dfn=1000 10 | dfrec=False 11 | dmodel=cbow 12 | dskips=3 13 | # hyper-parameters 14 | h=${1:-$dfh} 15 | lr=${2:-$dflr} 16 | w=${3:-$dw} 17 | ni=${4:-$dni} 18 | te=${5:-$dfte} 19 | n=${6:-$dfn} 20 | rec=${7:-$dfrec} 21 | 22 | if [ $# -ne 7 ] 23 | then 24 | echo "Number of arguments should be 5!" 25 | echo "Usage: bash run_w2v.sh [model_size (e.g. 32)] [learning-rate (e.g. 1)] [window size (e.g. 5)] [ni (e.g. 3)] [test (True or False)] [num_epoch (e.g. 50)] [recommend (True or False)]" 26 | if [ $# -gt 7 ] 27 | then 28 | exit 29 | fi 30 | echo "Run with default values" 31 | fi 32 | 33 | if [ ! -d "./cache" ] 34 | then 35 | mkdir ./cache 36 | fi 37 | if [ ! -d "./train" ] 38 | then 39 | mkdir ./train 40 | fi 41 | 42 | cd ../word2vec/ 43 | 44 | python run_w2v.py --model ${dmodel} --dataset ml1m --raw_data ../examples/dataset/ --data_dir ../examples/cache/ml1m --train_dir ../examples/train/${dmodel}_ml1m_h${h}lr${lr}w${w}ni${ni}n${n}te${te} --dataset ml1m --raw_data ../examples/dataset/ --item_vocab_size 3100 --vocab_min_thresh 1 --steps_per_checkpoint 300 --loss ce --learning_rate ${lr} --size $h --n_epoch $n --skip_window $w --ni ${ni} --num_skips ${dskips} --test ${te} --recommend ${rec} 45 | 46 | echo 'finished!' 47 | -------------------------------------------------------------------------------- /hmf/hmf_model.py: -------------------------------------------------------------------------------- 1 | 2 | from __future__ import absolute_import 3 | from __future__ import division 4 | from __future__ import print_function 5 | 6 | import random, math 7 | 8 | import numpy as np 9 | from six.moves import xrange # pylint: disable=redefined-builtin 10 | import tensorflow as tf 11 | 12 | import time 13 | import sys 14 | import itertools 15 | 16 | sys.path.insert(0, '../attributes') 17 | import embed_attribute 18 | 19 | class LatentProductModel(object): 20 | def __init__(self, user_size, item_size, size, 21 | num_layers, batch_size, learning_rate, 22 | learning_rate_decay_factor, user_attributes=None, 23 | item_attributes=None, item_ind2logit_ind=None, 24 | logit_ind2item_ind=None, loss_function='ce', GPU=None, 25 | logit_size_test=None, nonlinear=None, dropout=1.0, 26 | n_sampled=None, indices_item=None, dtype=tf.float32, 27 | top_N_items=100, hidden_size=500, loss_func='log', 28 | loss_exp_p = 1.005): 29 | 30 | self.user_size = user_size 31 | self.item_size = item_size 32 | self.top_N_items = top_N_items 33 | 34 | if user_attributes is not None: 35 | user_attributes.set_model_size(size) 36 | self.user_attributes = user_attributes 37 | if item_attributes is not None: 38 | item_attributes.set_model_size(size) 39 | self.item_attributes = item_attributes 40 | 41 | self.item_ind2logit_ind = item_ind2logit_ind 42 | self.logit_ind2item_ind = logit_ind2item_ind 43 | if logit_ind2item_ind is not None: 44 | self.logit_size = len(logit_ind2item_ind) 45 | if indices_item is not None: 46 | self.indices_item = indices_item 47 | else: 48 | self.indices_item = range(self.logit_size) 49 | self.logit_size_test = logit_size_test 50 | 51 | self.nonlinear = nonlinear 52 | self.loss_function = loss_function 53 | self.n_sampled = n_sampled 54 | self.batch_size = batch_size 55 | 56 | self.learning_rate = tf.Variable(float(learning_rate), trainable=False) 57 | self.learning_rate_decay_op = self.learning_rate.assign( 58 | self.learning_rate * learning_rate_decay_factor) 59 | self.global_step = tf.Variable(0, trainable=False) 60 | self.att_emb = None 61 | self.dtype=dtype 62 | 63 | self.data_length = None 64 | self.train_permutation = None 65 | self.start_index = None 66 | 67 | mb = self.batch_size 68 | ''' this is mapped item target ''' 69 | self.item_target = tf.placeholder(tf.int32, shape = [mb], name = "item") 70 | self.item_id_target = tf.placeholder(tf.int32, shape = [mb], name = "item_id") 71 | 72 | self.dropout = dropout 73 | self.keep_prob = tf.placeholder(tf.float32, name='keep_prob') 74 | 75 | m = embed_attribute.EmbeddingAttribute(user_attributes, item_attributes, mb, 76 | self.n_sampled, 0, False, item_ind2logit_ind, logit_ind2item_ind) 77 | self.att_emb = m 78 | embedded_user, user_b = m.get_batch_user(self.keep_prob, False) 79 | 80 | if self.nonlinear in ['relu', 'tanh']: 81 | act = tf.nn.relu if self.nonlinear == 'relu' else tf.tanh 82 | w1 = tf.get_variable('w1', [size, hidden_size], dtype=self.dtype) 83 | b1 = tf.get_variable('b1', [hidden_size], dtype=self.dtype) 84 | w2 = tf.get_variable('w2', [hidden_size, size], dtype=self.dtype) 85 | b2 = tf.get_variable('b2', [size], dtype=self.dtype) 86 | 87 | embedded_user, user_b = m.get_batch_user(1.0, False) 88 | h0 = tf.nn.dropout(act(embedded_user), self.keep_prob) 89 | 90 | h1 = act(tf.matmul(h0, w1) + b1) 91 | h1 = tf.nn.dropout(h1, self.keep_prob) 92 | 93 | h2 = act(tf.matmul(h1, w2) + b2) 94 | embedded_user = tf.nn.dropout(h2, self.keep_prob) 95 | 96 | pos_embs_item, pos_item_b = m.get_batch_item('pos', batch_size) 97 | pos_embs_item = tf.reduce_mean(pos_embs_item, 0) 98 | 99 | neg_embs_item, neg_item_b = m.get_batch_item('neg', batch_size) 100 | neg_embs_item = tf.reduce_mean(neg_embs_item, 0) 101 | # print('debug: user, item dim', embedded_user.get_shape(), neg_embs_item.get_shape()) 102 | 103 | print("construct postive/negative items/scores \n(for bpr loss, AUC)") 104 | self.pos_score = tf.reduce_sum(tf.multiply(embedded_user, pos_embs_item), 1) + pos_item_b 105 | self.neg_score = tf.reduce_sum(tf.multiply(embedded_user, neg_embs_item), 1) + neg_item_b 106 | neg_pos = self.neg_score - self.pos_score 107 | self.auc = 0.5 - 0.5 * tf.reduce_mean(tf.sign(neg_pos)) 108 | 109 | # mini batch version 110 | if self.n_sampled is not None: 111 | print("sampled prediction") 112 | sampled_logits = m.get_prediction(embedded_user, 'sampled') 113 | # embedded_item, item_b = m.get_sampled_item(self.n_sampled) 114 | # sampled_logits = tf.matmul(embedded_user, tf.transpose(embedded_item)) + item_b 115 | target_score = m.get_target_score(embedded_user, self.item_id_target) 116 | 117 | print("non-sampled prediction") 118 | logits = m.get_prediction(embedded_user) 119 | 120 | loss = self.loss_function 121 | if loss in ['warp', 'ce', 'rs', 'rs-sig', 'rs-sig2', 'bbpr']: 122 | batch_loss = m.compute_loss(logits, self.item_target, loss, 123 | loss_func=loss_func, exp_p=loss_exp_p) 124 | elif loss in ['warp_eval']: 125 | batch_loss, batch_rank = m.compute_loss(logits, self.item_target, loss) 126 | 127 | elif loss in ['mw']: 128 | # batch_loss = m.compute_loss(sampled_logits, self.pos_score, loss) 129 | batch_loss = m.compute_loss(sampled_logits, target_score, loss) 130 | batch_loss_eval = m.compute_loss(logits, self.item_target, 'warp') 131 | 132 | elif loss in ['bpr', 'bpr-hinge']: 133 | batch_loss = m.compute_loss(neg_pos, self.item_target, loss) 134 | else: 135 | print("not implemented!") 136 | exit(-1) 137 | if loss in ['warp', 'warp_eval', 'mw', 'rs', 'rs-sig', 'rs-sig2', 'bbpr']: 138 | self.set_mask, self.reset_mask = m.get_warp_mask() 139 | 140 | self.loss = tf.reduce_mean(batch_loss) 141 | self.batch_loss = batch_loss 142 | if loss in ['warp_eval']: 143 | self.batch_rank = batch_rank 144 | self.loss_eval = tf.reduce_mean(batch_loss_eval) if loss == 'mw' else self.loss 145 | # Gradients and SGD update operation for training the model. 146 | params = tf.trainable_variables() 147 | opt = tf.train.AdagradOptimizer(self.learning_rate) 148 | # opt = tf.train.AdamOptimizer(self.learning_rate) 149 | gradients = tf.gradients(self.loss, params) 150 | self.updates = opt.apply_gradients( 151 | zip(gradients, params), global_step=self.global_step) 152 | 153 | self.output = logits 154 | values, self.indices= tf.nn.top_k(self.output, self.top_N_items, sorted=True) 155 | # self.saver = tf.train.Saver(tf.global_variables()) 156 | self.saver = tf.train.Saver(tf.global_variables()) 157 | 158 | def prepare_warp(self, pos_item_set, pos_item_set_eval): 159 | self.att_emb.prepare_warp(pos_item_set, pos_item_set_eval) 160 | return 161 | 162 | def step(self, session, user_input, item_input, neg_item_input=None, 163 | item_sampled = None, item_sampled_id2idx = None, 164 | forward_only=False, recommend=False, recommend_new = False, loss=None, 165 | run_op=None, run_meta=None): 166 | input_feed = {} 167 | if forward_only or recommend: 168 | input_feed[self.keep_prob.name] = 1.0 169 | else: 170 | input_feed[self.keep_prob.name] = self.dropout 171 | 172 | if recommend == False: 173 | targets = self.att_emb.target_mapping([item_input]) 174 | input_feed[self.item_target.name] = targets[0] 175 | if loss in ['mw']: 176 | input_feed[self.item_id_target.name] = item_input 177 | 178 | # if loss in ['mw', 'mce'] and recommend == False: 179 | # input_feed[self.item_target.name] = [item_sampled_id2idx[v] for v in item_input] 180 | 181 | if self.att_emb is not None: 182 | (update_sampled, input_feed_sampled, 183 | input_feed_warp) = self.att_emb.add_input(input_feed, user_input, 184 | item_input, neg_item_input=neg_item_input, 185 | item_sampled = item_sampled, item_sampled_id2idx = item_sampled_id2idx, 186 | forward_only=forward_only, recommend=recommend, loss = loss) 187 | 188 | if not recommend: 189 | if not forward_only: 190 | # output_feed = [self.updates, self.loss, self.auc] 191 | output_feed = [self.updates, self.loss] 192 | # output_feed = [self.embedded_user, self.pos_embs_item] 193 | else: 194 | # output_feed = [self.loss_eval, self.auc] 195 | output_feed = [self.loss_eval] 196 | else: 197 | if recommend_new: 198 | output_feed = [self.indices_test] 199 | else: 200 | output_feed = [self.indices] 201 | 202 | # for warp_eval 203 | if loss in ['warp_eval']: 204 | output_feed = [self.batch_loss, self.batch_rank] 205 | 206 | if item_sampled is not None and loss in ['mw', 'mce']: 207 | session.run(update_sampled, input_feed_sampled) 208 | 209 | if (loss in ['warp', 'warp_eval', 'rs', 'rs-sig', 'rs-sig2', 'bbpr', 'mw']) and recommend is False: 210 | session.run(self.set_mask[loss], input_feed_warp) 211 | 212 | if run_op is not None and run_meta is not None: 213 | outputs = session.run(output_feed, input_feed, options=run_op, run_metadata=run_meta) 214 | else: 215 | outputs = session.run(output_feed, input_feed) 216 | 217 | if (loss in ['warp', 'warp_eval', 'rs', 'rs-sig', 'rs-sig2', 'bbpr', 'mw']) and recommend is False: 218 | session.run(self.reset_mask[loss], input_feed_warp) 219 | 220 | if loss in ['warp_eval']: 221 | return outputs 222 | if not recommend: 223 | if not forward_only: 224 | return outputs[1]#, outputs[2]#, outputs[3] #, outputs[3], outputs[4] 225 | else: 226 | return outputs[0]#, outputs[1] 227 | else: 228 | return outputs[0] 229 | 230 | def get_batch(self, data, loss = 'ce', hist = None): 231 | batch_user_input, batch_item_input = [], [] 232 | batch_neg_item_input = [] 233 | 234 | count = 0 235 | while count < self.batch_size: 236 | u, i, _ = random.choice(data) 237 | batch_user_input.append(u) 238 | batch_item_input.append(i) 239 | count += 1 240 | 241 | return batch_user_input, batch_item_input, batch_neg_item_input 242 | 243 | def get_permuted_batch(self, data): 244 | batch_user_input, batch_item_input = [], [] 245 | if self.data_length == None: 246 | self.data_length = len(data) 247 | self.start_index = 0 248 | self.train_permutation = np.random.permutation(self.data_length) 249 | if self.start_index + self.batch_size >= self.data_length: 250 | self.start_index = 0 251 | self.train_permutation = np.random.permutation(self.data_length) 252 | 253 | indices = range(self.start_index, self.start_index + self.batch_size) 254 | indices = self.train_permutation[indices] 255 | self.start_index += self.batch_size 256 | for j in indices: 257 | u, i , _ = data[j] 258 | batch_user_input.append(u) 259 | batch_item_input.append(i) 260 | return batch_user_input, batch_item_input, None 261 | 262 | 263 | # def get_eval_batch(self, loss, users, items, hist = None): 264 | # neg_items = [] 265 | # l, i = len(users), 0 266 | # while i < l: 267 | # u = users[i] 268 | # i2 = random.choice(self.indices_item) 269 | # while i2 in hist[u]: 270 | # i2 = random.choice(self.indices_item) 271 | # neg_items.append(i2) 272 | # i += 1 273 | 274 | # return neg_items #, None, None 275 | -------------------------------------------------------------------------------- /hmf/run_hmf.py: -------------------------------------------------------------------------------- 1 | from __future__ import absolute_import 2 | from __future__ import division 3 | from __future__ import print_function 4 | 5 | import math, os, sys 6 | import random, time 7 | import logging 8 | import numpy as np 9 | from six.moves import xrange # pylint: disable=redefined-builtin 10 | import tensorflow as tf 11 | sys.path.insert(0, '../utils') 12 | sys.path.insert(0, '../attributes') 13 | 14 | from input_attribute import read_data 15 | from prepare_train import positive_items, item_frequency, sample_items 16 | 17 | # datasets, paths, and preprocessing 18 | tf.app.flags.DEFINE_string("dataset", "xing", ".") 19 | tf.app.flags.DEFINE_string("raw_data", "../raw_data", "input data directory") 20 | tf.app.flags.DEFINE_string("data_dir", "./cache0", "Cached data directory") 21 | tf.app.flags.DEFINE_string("train_dir", "./tmp", "Training directory.") 22 | tf.app.flags.DEFINE_boolean("test", False, "Test on test splits") 23 | tf.app.flags.DEFINE_string("combine_att", 'mix', "method to combine attributes: het or mix") 24 | tf.app.flags.DEFINE_boolean("use_user_feature", True, "RT") 25 | tf.app.flags.DEFINE_boolean("use_item_feature", True, "RT") 26 | tf.app.flags.DEFINE_integer("user_vocab_size", 150000, "User vocabulary size.") 27 | tf.app.flags.DEFINE_integer("item_vocab_size", 50000, "Item vocabulary size.") 28 | tf.app.flags.DEFINE_integer("item_vocab_min_thresh", 2, "filter inactive tokens.") 29 | 30 | # tuning hypers 31 | tf.app.flags.DEFINE_string("loss", 'ce', "loss function: ce, warp, (mw, mce, bpr)") 32 | tf.app.flags.DEFINE_string("loss_func", 'log', "loss function: log, exp, poly") 33 | tf.app.flags.DEFINE_float("loss_exp_p", 1.0005, "p in 1-p^{-x}; or in x^p") 34 | tf.app.flags.DEFINE_float("learning_rate", 0.1, "Learning rate.") 35 | tf.app.flags.DEFINE_float("keep_prob", 0.5, "dropout rate.") 36 | tf.app.flags.DEFINE_float("learning_rate_decay_factor", 1.0, 37 | "Learning rate decays by this much.") 38 | tf.app.flags.DEFINE_integer("batch_size", 64, 39 | "Batch size to use during training.") 40 | tf.app.flags.DEFINE_integer("size", 20, "Size of each embedding.") 41 | tf.app.flags.DEFINE_integer("patience", 20, 42 | "exit if the model can't improve for $patience evals") 43 | tf.app.flags.DEFINE_integer("n_epoch", 1000, "How many epochs to train.") 44 | tf.app.flags.DEFINE_integer("steps_per_checkpoint", 4000, 45 | "How many training steps to do per checkpoint.") 46 | 47 | # to recommend 48 | tf.app.flags.DEFINE_boolean("recommend", False, 49 | "Set to True for recommend items.") 50 | tf.app.flags.DEFINE_string("saverec", False, "") 51 | tf.app.flags.DEFINE_integer("top_N_items", 100, 52 | "number of items output") 53 | tf.app.flags.DEFINE_boolean("recommend_new", False, 54 | "Set to True for recommend new items that were not used to train.") 55 | 56 | # nonlinear 57 | tf.app.flags.DEFINE_string("nonlinear", 'linear', "nonlinear activation") 58 | tf.app.flags.DEFINE_integer("hidden_size", 500, "when nonlinear proj used") 59 | tf.app.flags.DEFINE_integer("num_layers", 1, "Number of layers in the model.") 60 | 61 | # algorithms with sampling 62 | tf.app.flags.DEFINE_float("power", 0.5, "related to sampling rate.") 63 | tf.app.flags.DEFINE_integer("n_resample", 50, "iterations before resample.") 64 | tf.app.flags.DEFINE_integer("n_sampled", 1024, "sampled softmax/warp loss.") 65 | 66 | tf.app.flags.DEFINE_string("sample_type", 'random', "random, sweep, permute") 67 | tf.app.flags.DEFINE_float("user_sample", 1.0, "user sample rate.") 68 | tf.app.flags.DEFINE_integer("seed", 0, "mini batch sampling random seed.") 69 | 70 | # 71 | tf.app.flags.DEFINE_integer("gpu", -1, "gpu card number") 72 | tf.app.flags.DEFINE_boolean("profile", False, "False = no profile, True = profile") 73 | tf.app.flags.DEFINE_boolean("device_log", False, 74 | "Set to True for logging device usages.") 75 | tf.app.flags.DEFINE_boolean("eval", True, 76 | "Set to True for evaluation.") 77 | tf.app.flags.DEFINE_boolean("use_more_train", False, 78 | "Set true if use non-appearred items to train.") 79 | tf.app.flags.DEFINE_string("model_option", 'loss', 80 | "model to evaluation") 81 | 82 | # tf.app.flags.DEFINE_integer("max_train_data_size", 0, 83 | # "Limit on the size of training data (0: no limit).") 84 | # Xing related 85 | # tf.app.flags.DEFINE_integer("ta", 0, "target_active") 86 | 87 | 88 | 89 | FLAGS = tf.app.flags.FLAGS 90 | 91 | def mylog(msg): 92 | print(msg) 93 | logging.info(msg) 94 | return 95 | 96 | def create_model(session, u_attributes=None, i_attributes=None, 97 | item_ind2logit_ind=None, logit_ind2item_ind=None, 98 | loss = FLAGS.loss, logit_size_test=None, ind_item = None): 99 | gpu = None if FLAGS.gpu == -1 else FLAGS.gpu 100 | n_sampled = FLAGS.n_sampled if FLAGS.loss in ['mw', 'mce'] else None 101 | import hmf_model 102 | model = hmf_model.LatentProductModel(FLAGS.user_vocab_size, 103 | FLAGS.item_vocab_size, FLAGS.size, FLAGS.num_layers, 104 | FLAGS.batch_size, FLAGS.learning_rate, 105 | FLAGS.learning_rate_decay_factor, u_attributes, i_attributes, 106 | item_ind2logit_ind, logit_ind2item_ind, loss_function = loss, GPU=gpu, 107 | logit_size_test=logit_size_test, nonlinear=FLAGS.nonlinear, 108 | dropout=FLAGS.keep_prob, n_sampled=n_sampled, indices_item=ind_item, 109 | top_N_items=FLAGS.top_N_items, hidden_size=FLAGS.hidden_size, 110 | loss_func= FLAGS.loss_func, loss_exp_p = FLAGS.loss_exp_p) 111 | 112 | if not os.path.isdir(FLAGS.train_dir): 113 | os.mkdir(FLAGS.train_dir) 114 | ckpt = tf.train.get_checkpoint_state(FLAGS.train_dir) 115 | 116 | if ckpt: 117 | print("Reading model parameters from %s" % ckpt.model_checkpoint_path) 118 | logging.info("Reading model parameters from %s" % ckpt.model_checkpoint_path) 119 | model.saver.restore(session, ckpt.model_checkpoint_path) 120 | else: 121 | print("Created model with fresh parameters.") 122 | logging.info("Created model with fresh parameters.") 123 | # session.run(tf.global_variables_initializer()) 124 | session.run(tf.global_variables_initializer()) 125 | return model 126 | 127 | def train(raw_data=FLAGS.raw_data, train_dir=FLAGS.train_dir, mylog=mylog, 128 | data_dir=FLAGS.data_dir, combine_att=FLAGS.combine_att, test=FLAGS.test, 129 | logits_size_tr=FLAGS.item_vocab_size, thresh=FLAGS.item_vocab_min_thresh, 130 | use_user_feature=FLAGS.use_user_feature, 131 | use_item_feature=FLAGS.use_item_feature, 132 | batch_size=FLAGS.batch_size, steps_per_checkpoint=FLAGS.steps_per_checkpoint, 133 | loss_func=FLAGS.loss, max_patience=FLAGS.patience, go_test=FLAGS.test, 134 | max_epoch=FLAGS.n_epoch, sample_type=FLAGS.sample_type, power=FLAGS.power, 135 | use_more_train=FLAGS.use_more_train, profile=FLAGS.profile, 136 | device_log=FLAGS.device_log): 137 | with tf.Session(config=tf.ConfigProto(allow_soft_placement=True, 138 | log_device_placement=device_log)) as sess: 139 | run_options = None 140 | run_metadata = None 141 | if profile: 142 | # in order to profile 143 | from tensorflow.python.client import timeline 144 | run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE) 145 | run_metadata = tf.RunMetadata() 146 | steps_per_checkpoint = 30 147 | 148 | mylog("reading data") 149 | (data_tr, data_va, u_attributes, i_attributes,item_ind2logit_ind, 150 | logit_ind2item_ind, _, _) = read_data( 151 | raw_data_dir=raw_data, 152 | data_dir=data_dir, 153 | combine_att=combine_att, 154 | logits_size_tr=logits_size_tr, 155 | thresh=thresh, 156 | use_user_feature=use_user_feature, 157 | use_item_feature=use_item_feature, 158 | test=test, 159 | mylog=mylog) 160 | 161 | mylog("train/dev size: %d/%d" %(len(data_tr),len(data_va))) 162 | 163 | ''' 164 | remove some rare items in both train and valid set 165 | this helps make train/valid set distribution similar 166 | to each other 167 | ''' 168 | mylog("original train/dev size: %d/%d" %(len(data_tr),len(data_va))) 169 | data_tr = [p for p in data_tr if (p[1] in item_ind2logit_ind)] 170 | data_va = [p for p in data_va if (p[1] in item_ind2logit_ind)] 171 | mylog("new train/dev size: %d/%d" %(len(data_tr),len(data_va))) 172 | 173 | random.seed(FLAGS.seed) 174 | 175 | item_pop, p_item = item_frequency(data_tr, power) 176 | 177 | if use_more_train: 178 | item_population = range(len(item_ind2logit_ind)) 179 | else: 180 | item_population = item_pop 181 | 182 | model = create_model(sess, u_attributes, i_attributes, item_ind2logit_ind, 183 | logit_ind2item_ind, loss=loss_func, ind_item=item_population) 184 | 185 | pos_item_list, pos_item_list_val = None, None 186 | if loss_func in ['warp', 'mw', 'rs', 'rs-sig', 'rs-sig2', 'bbpr']: 187 | pos_item_list, pos_item_list_val = positive_items(data_tr, data_va) 188 | model.prepare_warp(pos_item_list, pos_item_list_val) 189 | 190 | mylog('started training') 191 | step_time, loss, current_step, auc = 0.0, 0.0, 0, 0.0 192 | 193 | repeat = 5 if loss_func.startswith('bpr') else 1 194 | patience = max_patience 195 | 196 | if os.path.isfile(os.path.join(train_dir, 'auc_train.npy')): 197 | auc_train = list(np.load(os.path.join(train_dir, 'auc_train.npy'))) 198 | auc_dev = list(np.load(os.path.join(train_dir, 'auc_dev.npy'))) 199 | previous_losses = list(np.load(os.path.join(train_dir, 200 | 'loss_train.npy'))) 201 | losses_dev = list(np.load(os.path.join(train_dir, 'loss_dev.npy'))) 202 | best_auc = max(auc_dev) 203 | best_loss = min(losses_dev) 204 | else: 205 | previous_losses, auc_train, auc_dev, losses_dev = [], [], [], [] 206 | best_auc, best_loss = -1, 1000000 207 | 208 | item_sampled, item_sampled_id2idx = None, None 209 | 210 | if sample_type == 'random': 211 | get_next_batch = model.get_batch 212 | elif sample_type == 'permute': 213 | get_next_batch = model.get_permuted_batch 214 | else: 215 | print('not implemented!') 216 | exit() 217 | 218 | train_total_size = float(len(data_tr)) 219 | n_epoch = max_epoch 220 | steps_per_epoch = int(1.0 * train_total_size / batch_size) 221 | total_steps = steps_per_epoch * n_epoch 222 | 223 | mylog("Train:") 224 | mylog("total: {}".format(train_total_size)) 225 | mylog("Steps_per_epoch: {}".format(steps_per_epoch)) 226 | mylog("Total_steps:{}".format(total_steps)) 227 | mylog("Dev:") 228 | mylog("total: {}".format(len(data_va))) 229 | 230 | mylog("\n\ntraining start!") 231 | while True: 232 | ranndom_number_01 = np.random.random_sample() 233 | start_time = time.time() 234 | (user_input, item_input, neg_item_input) = get_next_batch(data_tr) 235 | 236 | if loss_func in ['mw', 'mce'] and current_step % FLAGS.n_resample == 0: 237 | item_sampled, item_sampled_id2idx = sample_items(item_population, 238 | FLAGS.n_sampled, p_item) 239 | else: 240 | item_sampled = None 241 | 242 | step_loss = model.step(sess, user_input, item_input, 243 | neg_item_input, item_sampled, item_sampled_id2idx, loss=loss_func, 244 | run_op=run_options, run_meta=run_metadata) 245 | 246 | step_time += (time.time() - start_time) / steps_per_checkpoint 247 | loss += step_loss / steps_per_checkpoint 248 | current_step += 1 249 | if model.global_step.eval() > total_steps: 250 | mylog("Training reaches maximum steps. Terminating...") 251 | break 252 | 253 | if current_step % steps_per_checkpoint == 0: 254 | 255 | if loss_func in ['ce', 'mce']: 256 | perplexity = math.exp(loss) if loss < 300 else float('inf') 257 | mylog("global step %d learning rate %.4f step-time %.4f perplexity %.2f" % (model.global_step.eval(), model.learning_rate.eval(), step_time, perplexity)) 258 | else: 259 | mylog("global step %d learning rate %.4f step-time %.4f loss %.3f" % (model.global_step.eval(), model.learning_rate.eval(), step_time, loss)) 260 | if profile: 261 | # Create the Timeline object, and write it to a json 262 | tl = timeline.Timeline(run_metadata.step_stats) 263 | ctf = tl.generate_chrome_trace_format() 264 | with open('timeline.json', 'w') as f: 265 | f.write(ctf) 266 | exit() 267 | 268 | # Decrease learning rate if no improvement was seen over last 3 times. 269 | if len(previous_losses) > 2 and loss > max(previous_losses[-3:]): 270 | sess.run(model.learning_rate_decay_op) 271 | previous_losses.append(loss) 272 | auc_train.append(auc) 273 | 274 | # Reset timer and loss. 275 | step_time, loss, auc = 0.0, 0.0, 0.0 276 | 277 | if not FLAGS.eval: 278 | continue 279 | 280 | 281 | # Run evals on development set and print their loss. 282 | l_va = len(data_va) 283 | eval_loss, eval_auc = 0.0, 0.0 284 | count_va = 0 285 | start_time = time.time() 286 | for idx_s in range(0, l_va, batch_size): 287 | idx_e = idx_s + batch_size 288 | if idx_e > l_va: 289 | break 290 | lt = data_va[idx_s:idx_e] 291 | user_va = [x[0] for x in lt] 292 | item_va = [x[1] for x in lt] 293 | for _ in range(repeat): 294 | item_va_neg = None 295 | the_loss = 'warp' if loss_func == 'mw' else loss_func 296 | eval_loss0 = model.step(sess, user_va, item_va, item_va_neg, 297 | None, None, forward_only=True, 298 | loss=the_loss) 299 | eval_loss += eval_loss0 300 | count_va += 1 301 | eval_loss /= count_va 302 | eval_auc /= count_va 303 | step_time = (time.time() - start_time) / count_va 304 | if loss_func in ['ce', 'mce']: 305 | eval_ppx = math.exp(eval_loss) if eval_loss < 300 else float('inf') 306 | mylog(" dev: perplexity %.2f eval_auc(not computed) %.4f step-time %.4f" % ( 307 | eval_ppx, eval_auc, step_time)) 308 | else: 309 | mylog(" dev: loss %.3f eval_auc(not computed) %.4f step-time %.4f" % (eval_loss, 310 | eval_auc, step_time)) 311 | sys.stdout.flush() 312 | 313 | if eval_loss < best_loss and not go_test: 314 | best_loss = eval_loss 315 | patience = max_patience 316 | checkpoint_path = os.path.join(train_dir, "best.ckpt") 317 | mylog('Saving best model...') 318 | model.saver.save(sess, checkpoint_path, 319 | global_step=0, write_meta_graph = False) 320 | 321 | if go_test: 322 | checkpoint_path = os.path.join(train_dir, "best.ckpt") 323 | mylog('Saving best model...') 324 | model.saver.save(sess, checkpoint_path, 325 | global_step=0, write_meta_graph = False) 326 | 327 | if eval_loss > best_loss: 328 | patience -= 1 329 | 330 | auc_dev.append(eval_auc) 331 | losses_dev.append(eval_loss) 332 | 333 | if patience < 0 and not go_test: 334 | mylog("no improvement for too long.. terminating..") 335 | mylog("best loss %.4f" % best_loss) 336 | sys.stdout.flush() 337 | break 338 | return 339 | 340 | def recommend(target_uids=[], raw_data=FLAGS.raw_data, data_dir=FLAGS.data_dir, 341 | combine_att=FLAGS.combine_att, logits_size_tr=FLAGS.item_vocab_size, 342 | item_vocab_min_thresh=FLAGS.item_vocab_min_thresh, loss=FLAGS.loss, 343 | top_n=FLAGS.top_N_items, test=FLAGS.test, mylog=mylog, 344 | use_user_feature=FLAGS.use_user_feature, 345 | use_item_feature=FLAGS.use_item_feature, 346 | batch_size=FLAGS.batch_size, device_log=FLAGS.device_log): 347 | 348 | with tf.Session(config=tf.ConfigProto(allow_soft_placement=True, 349 | log_device_placement=device_log)) as sess: 350 | mylog("reading data") 351 | (_, _, u_attributes, i_attributes, item_ind2logit_ind, 352 | logit_ind2item_ind, user_index, item_index) = read_data( 353 | raw_data_dir=raw_data, 354 | data_dir=data_dir, 355 | combine_att=combine_att, 356 | logits_size_tr=logits_size_tr, 357 | thresh=item_vocab_min_thresh, 358 | use_user_feature=use_user_feature, 359 | use_item_feature=use_item_feature, 360 | test=test, 361 | mylog=mylog) 362 | 363 | model = create_model(sess, u_attributes, i_attributes, item_ind2logit_ind, 364 | logit_ind2item_ind, loss=loss, ind_item=None) 365 | 366 | Uinds = [user_index[v] for v in target_uids] 367 | 368 | N = len(Uinds) 369 | mylog("%d target users to recommend" % N) 370 | rec = np.zeros((N, top_n), dtype=int) 371 | 372 | count = 0 373 | time_start = time.time() 374 | for idx_s in range(0, N, batch_size): 375 | count += 1 376 | if count % 100 == 0: 377 | mylog("idx: %d, c: %d" % (idx_s, count)) 378 | 379 | idx_e = idx_s + batch_size 380 | if idx_e <= N: 381 | users = Uinds[idx_s: idx_e] 382 | recs = model.step(sess, users, None, None, forward_only=True, 383 | recommend=True) 384 | rec[idx_s:idx_e, :] = recs 385 | else: 386 | users = range(idx_s, N) + [0] * (idx_e - N) 387 | users = [Uinds[t] for t in users] 388 | recs = model.step(sess, users, None, None, forward_only=True, 389 | recommend=True) 390 | idx_e = N 391 | rec[idx_s:idx_e, :] = recs[:(idx_e-idx_s),:] 392 | 393 | time_end = time.time() 394 | mylog("Time used %.1f" % (time_end - time_start)) 395 | 396 | # transform result to a dictionary 397 | # R[user_id] = [item_id1, item_id2, ...] 398 | 399 | ind2id = {} 400 | for iid in item_index: 401 | uind = item_index[iid] 402 | assert(uind not in ind2id) 403 | ind2id[uind] = iid 404 | R = {} 405 | for i in xrange(N): 406 | uid = target_uids[i] 407 | R[uid] = [ind2id[logit_ind2item_ind[v]] for v in list(rec[i, :])] 408 | 409 | return R 410 | 411 | def compute_scores(raw_data_dir=FLAGS.raw_data, data_dir=FLAGS.data_dir, 412 | dataset=FLAGS.dataset, save_recommendation=FLAGS.saverec, 413 | train_dir=FLAGS.train_dir, test=FLAGS.test): 414 | 415 | from evaluate import Evaluation as Evaluate 416 | evaluation = Evaluate(raw_data_dir, test=test) 417 | 418 | R = recommend(evaluation.get_uids(), data_dir=data_dir) 419 | 420 | evaluation.eval_on(R) 421 | scores_self, scores_ex = evaluation.get_scores() 422 | mylog("====evaluation scores (NDCG, RECALL, PRECISION, MAP) @ 2,5,10,20,30====") 423 | mylog("METRIC_FORMAT (self): {}".format(scores_self)) 424 | mylog("METRIC_FORMAT (ex ): {}".format(scores_ex)) 425 | if save_recommendation: 426 | name_inds = os.path.join(train_dir, "indices.npy") 427 | np.save(name_inds, rec) 428 | 429 | def main(_): 430 | 431 | if FLAGS.test: 432 | if FLAGS.data_dir[-1] == '/': 433 | FLAGS.data_dir = FLAGS.data_dir[:-1] + '_test' 434 | else: 435 | FLAGS.data_dir = FLAGS.data_dir + '_test' 436 | 437 | if not os.path.exists(FLAGS.train_dir): 438 | os.mkdir(FLAGS.train_dir) 439 | if not FLAGS.recommend: 440 | print('train') 441 | log_path = os.path.join(FLAGS.train_dir,"log.txt") 442 | logging.basicConfig(filename=log_path,level=logging.DEBUG) 443 | train(data_dir=FLAGS.data_dir) 444 | else: 445 | print('recommend') 446 | log_path = os.path.join(FLAGS.train_dir,"log.recommend.txt") 447 | logging.basicConfig(filename=log_path,level=logging.DEBUG) 448 | compute_scores(data_dir=FLAGS.data_dir) 449 | return 450 | 451 | if __name__ == "__main__": 452 | tf.app.run() 453 | 454 | -------------------------------------------------------------------------------- /lstm/best_buckets.py: -------------------------------------------------------------------------------- 1 | def calculate_buckets(array, max_length, max_buckets): 2 | d = {} 3 | for u,ll in array: 4 | length = len(ll) 5 | if not length in d: 6 | d[length] = 0 7 | d[length] += 1 8 | 9 | dd = [(x, d[x]) for x in d] 10 | dd = sorted(dd, key = lambda x: x[0]) 11 | running_sum = [] 12 | s = 0 13 | for l, n in dd: 14 | s += n 15 | running_sum.append((l,s)) 16 | 17 | def best_point(ll): 18 | # return index so that l[:index+1] and l[index+1:] 19 | index = 0 20 | maxv = 0 21 | base = ll[0][1] 22 | for i in xrange(len(ll)): 23 | l,n = ll[i] 24 | v = (ll[-1][0] - l) * (n-base) 25 | if v > maxv: 26 | maxv = v 27 | index = i 28 | return index, maxv 29 | 30 | def arg_max(array,key): 31 | maxv = -10000 32 | index = -1 33 | for i in xrange(len(array)): 34 | item = array[i] 35 | v = key(item) 36 | if v > maxv: 37 | maxv = v 38 | index = i 39 | return index 40 | 41 | end_index = 0 42 | for i in xrange(len(running_sum)-1,-1,-1): 43 | if running_sum[i][0] <= max_length: 44 | end_index = i+1 45 | break 46 | 47 | print running_sum 48 | 49 | if end_index <= max_buckets: 50 | buckets = [x[0] for x in running_sum[:end_index]] 51 | else: 52 | buckets = [] 53 | # (array, maxv, index) 54 | states = [(running_sum[:end_index],0,end_index-1)] 55 | while len(buckets) < max_buckets: 56 | index = arg_max(states, lambda x: x[1]) 57 | state = states[index] 58 | del states[index] 59 | #split state 60 | array = state[0] 61 | split_index = state[2] 62 | buckets.append(array[split_index][0]) 63 | array1 = array[:split_index+1] 64 | array2 = array[split_index+1:] 65 | if len(array1) > 0: 66 | id1, maxv1 = best_point(array1) 67 | states.append((array1,maxv1,id1)) 68 | if len(array2) > 0: 69 | id2, maxv2 = best_point(array2) 70 | states.append((array2,maxv2,id2)) 71 | return buckets 72 | 73 | def main(): 74 | 75 | import random 76 | a = [] 77 | for i in xrange(1000): 78 | l = random.randint(1,50) 79 | a.append([0]*l) 80 | max_length = 40 81 | max_buckets = 4 82 | print calculate_buckets(a,max_length, max_buckets) 83 | 84 | if __name__ == "__main__": 85 | main() 86 | -------------------------------------------------------------------------------- /lstm/data_iterator.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | 3 | PAD_ID = 0 4 | START_ID = 1 5 | 6 | class DataIterator: 7 | def __init__(self, model, data_set, n_bucket, batch_size, train_buckets_scale): 8 | self.data_set = data_set 9 | self.n_bucket = n_bucket 10 | self.batch_size = batch_size 11 | self.train_buckets_scale = train_buckets_scale 12 | self.model = model 13 | 14 | def next_random(self): 15 | while True: 16 | random_number_01 = np.random.random_sample() 17 | bucket_id = min([i for i in xrange(len(self.train_buckets_scale)) 18 | if self.train_buckets_scale[i] > random_number_01]) 19 | 20 | users, inputs, outputs, weights, _ = self.model.get_batch(self.data_set, bucket_id) 21 | yield users, inputs, outputs, weights, bucket_id 22 | 23 | def next_sequence(self, stop=False, recommend = False): 24 | bucket_id = 0 25 | while True: 26 | if bucket_id >= self.n_bucket: 27 | if stop: 28 | break 29 | bucket_id = 0 30 | start_id = 0 31 | while True: 32 | get_batch_func = self.model.get_batch 33 | if recommend: 34 | get_batch_func = self.model.get_batch_recommend 35 | users, inputs, outputs, weights, finished = get_batch_func(self.data_set, bucket_id, start_id = start_id) 36 | yield users, inputs, outputs, weights, bucket_id 37 | if finished: 38 | break 39 | start_id += self.batch_size 40 | bucket_id += 1 41 | 42 | 43 | -------------------------------------------------------------------------------- /lstm/generate_jobs.py: -------------------------------------------------------------------------------- 1 | import os 2 | import sys 3 | 4 | head_hpc1=""" 5 | #!/bin/bash 6 | #PBS -q isi 7 | #PBS -l walltime=300:00:00 8 | #PBS -l nodes=1:ppn=16:gpus=2:shared 9 | 10 | source $NLGHOME/sh/init_tensorflow.sh 11 | cd /home/nlg-05/xingshi/lstm/tensorflow/recsys/lstm/ 12 | 13 | data_part=/home/nlg-05/xingshi/lstm/tensorflow/recsys/data/data_part 14 | data_full=/home/nlg-05/xingshi/lstm/tensorflow/recsys/data/data_full 15 | data_ml=/home/nlg-05/xingshi/lstm/tensorflow/recsys/data/data_ml 16 | train_dir=/home/nlg-05/xingshi/lstm/tensorflow/recsys/train/ 17 | 18 | __cmd__ 19 | """ 20 | 21 | head_hpc2=""" 22 | #!/bin/bash 23 | #PBS -l walltime=23:59:59 24 | #PBS -l nodes=1:ppn=16:gpus=2:shared 25 | #PBS -M kuanl@usc.edu -p 1023 26 | 27 | source /usr/usc/tensorflow/0.9.0/setup.sh 28 | cd /home/rcf-proj/pn3/kuanl/recsys/lstm/ 29 | 30 | data_part=/home/rcf-proj/pn3/kuanl/recsys/data/data_part/ 31 | data_full=/home/rcf-proj/pn3/kuanl/recsys/data/data_full/ 32 | data_ml=/home/rcf-proj/pn3/kuanl/recsys/data/data_ml/ 33 | train_dir=/home/rcf-proj/pn3/kuanl/recsys/train/ 34 | 35 | __cmd__ 36 | """ 37 | 38 | def main(acct=0): 39 | 40 | def data_dir(val): 41 | return "", "--data_dir {}".format(val) 42 | 43 | def train_dir(val): 44 | return "", "--train_dir {}".format(val) 45 | 46 | def batch_size(val): 47 | return "m{}".format(val), "--batch_size {}".format(val) 48 | 49 | def size(val): 50 | return "h{}".format(val), "--size {}".format(val) 51 | 52 | def dropout(val): 53 | return "d{}".format(val), "--keep_prob {}".format(val) 54 | 55 | def learning_rate(val): 56 | return "l{}".format(val), "--learning_rate {}".format(val) 57 | 58 | def n_epoch(val): 59 | return "", "--n_epoch {}".format(val) 60 | 61 | def loss(val): 62 | return "{}".format(val), "--loss {}".format(val) 63 | 64 | def ta(val): 65 | return "Ta{}".format(val), "--ta {}".format(val) 66 | 67 | def num_layers(val): 68 | return "n{}".format(val), "--num_layers {}".format(val) 69 | 70 | def L(val): 71 | return "L{}".format(val), "--L {}".format(val) 72 | 73 | def N(val): 74 | return "", "--N {}".format(val) 75 | 76 | def dataset(val): 77 | if val == 'xing': 78 | return "", "--dataset xing" 79 | elif val == "ml": 80 | return "Ml", "--dataset ml --after40 False" 81 | 82 | def use_concat(val): 83 | if val: 84 | name = "Cc" 85 | else: 86 | name = "Mn" 87 | return name, "--use_concat {}".format(val) 88 | 89 | def item_vocab_size(val): 90 | if item_vocab_size == 50000: 91 | return "", "" 92 | else: 93 | return "", "--item_vocab_size {}".format(val) 94 | 95 | def fromScratch(val): 96 | if not val: 97 | return "","--fromScratch False" 98 | else: 99 | return "","" 100 | 101 | # whether or not to use separte embedding for input/output items 102 | def use_out_items(val): 103 | if val: 104 | return "oT", "--use_sep_item True" 105 | else: 106 | return "", "--use_sep_item False" 107 | 108 | funcs = [data_dir, batch_size, size, #0 109 | dropout, learning_rate, n_epoch, #3 110 | loss, ta, num_layers, #6 111 | L, N, use_concat, #9 112 | dataset, item_vocab_size, fromScratch, #12 113 | use_out_items] 114 | 115 | template = ["$data_full", 64, 128, 0.5, 0.5, 150, "ce", 0, 1, 30, "001",False,'xing',50000, False, False] 116 | template_ml = ["$data_ml", 64, 128, 0.5, 0.5, 150, "ce", 0, 1, 200, "000",False,'ml',13000, True, False] 117 | params = [] 118 | 119 | # for xing 120 | _h = [256] 121 | _dropout = [0.4,0.6] 122 | _learning_rate = [0.5, 1.0] 123 | for lr in _learning_rate: 124 | for dr in _dropout: 125 | for h in _h: 126 | temp = list(template) 127 | temp[4] = lr 128 | temp[3] = dr 129 | temp[2] = h 130 | params.append(temp) 131 | 132 | # for ml 133 | _h = [128] 134 | _dropout = [0.4,0.6,0.8] 135 | _learning_rate = [0.5, 1.0] 136 | for lr in _learning_rate: 137 | for dr in _dropout: 138 | for h in _h: 139 | temp = list(template_ml) 140 | temp[4] = lr 141 | temp[3] = dr 142 | temp[2] = h 143 | params.append(temp) 144 | 145 | 146 | def get_name_cmd(para): 147 | name = "" 148 | cmd = [] 149 | for func, para in zip(funcs,para): 150 | n, c = func(para) 151 | name += n 152 | cmd.append(c) 153 | 154 | name = name.replace(".",'') 155 | n, c = train_dir("${train_dir}/"+name) 156 | cmd.append(c) 157 | 158 | cmd = " ".join(cmd) 159 | return name, cmd 160 | 161 | head = head_hpc1 if acct == 0 else head_hpc2 162 | # train 163 | for para in params: 164 | name, cmd = get_name_cmd(para) 165 | cmd = "python run.py " + cmd 166 | fn = "../jobs/{}.sh".format(name) 167 | f = open(fn,'w') 168 | content = head.replace("__cmd__",cmd) 169 | f.write(content) 170 | f.close() 171 | 172 | # decode 173 | for para in params: 174 | name, cmd = get_name_cmd(para) 175 | name = name + ".decode" 176 | cmd += " --recommend True" 177 | cmd = "python run.py " + cmd 178 | fn = "../jobs/{}.sh".format(name) 179 | f = open(fn,'w') 180 | content = head.replace("__cmd__",cmd) 181 | content = content.replace("23:59:59", "0:59:59") 182 | f.write(content) 183 | f.close() 184 | 185 | 186 | if __name__ == "__main__": 187 | if len(sys.argv) > 1 and int(sys.argv[1]) == 1: # hpc2 188 | main(2) 189 | else: 190 | main() 191 | 192 | 193 | 194 | 195 | 196 | -------------------------------------------------------------------------------- /lstm/run.py: -------------------------------------------------------------------------------- 1 | from __future__ import absolute_import 2 | from __future__ import division 3 | from __future__ import print_function 4 | 5 | import math 6 | import os 7 | import random 8 | import sys 9 | import time 10 | 11 | import numpy as np 12 | from six.moves import xrange # pylint: disable=redefined-builtin 13 | import tensorflow as tf 14 | import logging 15 | 16 | from seqModel import SeqModel 17 | 18 | # import pandas as pd 19 | # import configparser 20 | # import env 21 | 22 | sys.path.insert(0, '../utils') 23 | sys.path.insert(0, '../attributes') 24 | 25 | 26 | import embed_attribute 27 | from input_attribute import read_data as read_attributed_data 28 | 29 | 30 | import data_iterator 31 | from data_iterator import DataIterator 32 | from best_buckets import * 33 | from tensorflow.python.client import timeline 34 | from prepare_train import positive_items, item_frequency, sample_items, to_week 35 | 36 | # datasets, paths, and preprocessing 37 | tf.app.flags.DEFINE_string("dataset", "xing", "dataset name") 38 | tf.app.flags.DEFINE_string("raw_data", "../raw_data", "input data directory") 39 | tf.app.flags.DEFINE_string("data_dir", "./cache0/", "Data directory") 40 | tf.app.flags.DEFINE_string("train_dir", "./train", "Training directory.") 41 | tf.app.flags.DEFINE_boolean("test", True, "fix to be True as we split non-test part by users.") 42 | tf.app.flags.DEFINE_string("combine_att", 'mix', "method to combine attributes: het or mix") 43 | tf.app.flags.DEFINE_boolean("use_item_feature", True, "RT") 44 | tf.app.flags.DEFINE_boolean("use_user_feature", True, "RT") 45 | tf.app.flags.DEFINE_integer("item_vocab_size", 50000, "Item vocabulary size.") 46 | tf.app.flags.DEFINE_integer("vocab_min_thresh", 2, "filter inactive tokens.") 47 | 48 | # tuning hypers 49 | tf.app.flags.DEFINE_string("loss", 'ce', "loss function: ce, warp") 50 | tf.app.flags.DEFINE_float("learning_rate", 0.5, "Learning rate.") 51 | tf.app.flags.DEFINE_float("learning_rate_decay_factor", 0.83, 52 | "Learning rate decays by this much.") 53 | tf.app.flags.DEFINE_float("max_gradient_norm", 5.0, 54 | "Clip gradients to this norm.") 55 | tf.app.flags.DEFINE_float("keep_prob", 0.5, "dropout rate.") 56 | tf.app.flags.DEFINE_float("power", 0.5, "related to sampling rate.") 57 | tf.app.flags.DEFINE_integer("batch_size", 64, 58 | "Batch size to use during training/evaluation.") 59 | tf.app.flags.DEFINE_integer("size", 128, "Size of each model layer.") 60 | tf.app.flags.DEFINE_integer("num_layers", 1, "Number of layers in the model.") 61 | tf.app.flags.DEFINE_integer("n_epoch", 500, 62 | "Maximum number of epochs in training.") 63 | tf.app.flags.DEFINE_integer("L", 30,"max length") 64 | tf.app.flags.DEFINE_integer("n_bucket", 10, 65 | "num of buckets to run.") 66 | tf.app.flags.DEFINE_integer("patience", 10,"exit if the model can't improve for $patence evals") 67 | #tf.app.flags.DEFINE_integer("steps_per_checkpoint", 200,"How many training steps to do per checkpoint.") 68 | 69 | # recommendation 70 | tf.app.flags.DEFINE_boolean("recommend", False, "Set to True for recommending.") 71 | tf.app.flags.DEFINE_boolean("recommend_new", False, "TODO.") 72 | tf.app.flags.DEFINE_integer("topk", 100, "recommend items with the topk values") 73 | 74 | # for ensemble 75 | tf.app.flags.DEFINE_boolean("ensemble", False, "to ensemble") 76 | tf.app.flags.DEFINE_string("ensemble_suffix", "", "multiple models suffix: 1,2,3,4,5") 77 | tf.app.flags.DEFINE_integer("seed", 0, "dev split random seed.") 78 | 79 | # attribute model variants 80 | tf.app.flags.DEFINE_integer("output_feat", 1, "0: no use, 1: use, mean-mulhot, 2: use, max-pool") 81 | tf.app.flags.DEFINE_boolean("use_sep_item", False, "use separate embedding parameters for output items.") 82 | tf.app.flags.DEFINE_boolean("no_input_item_feature", False, "not using attributes at input layer") 83 | tf.app.flags.DEFINE_boolean("use_concat", False, "use concat or mean") 84 | tf.app.flags.DEFINE_boolean("no_user_id", True, "use user id or not") 85 | 86 | # devices 87 | tf.app.flags.DEFINE_string("N", "000", "GPU layer distribution: [input_embedding, lstm, output_embedding]") 88 | 89 | # 90 | tf.app.flags.DEFINE_boolean("withAdagrad", True, 91 | "withAdagrad.") 92 | tf.app.flags.DEFINE_boolean("fromScratch", True, 93 | "withAdagrad.") 94 | tf.app.flags.DEFINE_boolean("saveCheckpoint", False, 95 | "save Model at each checkpoint.") 96 | tf.app.flags.DEFINE_boolean("profile", False, "False = no profile, True = profile") 97 | 98 | # others... 99 | tf.app.flags.DEFINE_integer("ta", 1, "part = 1, full = 0") 100 | tf.app.flags.DEFINE_float("user_sample", 1.0, "user sample rate.") 101 | 102 | tf.app.flags.DEFINE_boolean("after40", False, 103 | "whether use items after week 40 only.") 104 | tf.app.flags.DEFINE_string("split", "last", "last: last maxlen only; overlap: overlap 1 / 3 of maxlen") 105 | 106 | tf.app.flags.DEFINE_integer("n_sampled", 1024, "sampled softmax/warp loss.") 107 | tf.app.flags.DEFINE_integer("n_resample", 30, "iterations before resample.") 108 | 109 | 110 | 111 | # for beam_search 112 | tf.app.flags.DEFINE_boolean("beam_search", False, "to beam_search") 113 | tf.app.flags.DEFINE_integer("beam_size", 10,"the beam size") 114 | 115 | tf.app.flags.DEFINE_integer("max_train_data_size", 0, 116 | "Limit on the size of training data (0: no limit).") 117 | tf.app.flags.DEFINE_boolean("old_att", False, "tmp: use attribute_0.8.csv") 118 | 119 | FLAGS = tf.app.flags.FLAGS 120 | 121 | _buckets = [] 122 | 123 | def mylog(msg): 124 | print(msg) 125 | sys.stdout.flush() 126 | logging.info(msg) 127 | 128 | def split_buckets(array,buckets): 129 | """ 130 | array : [(user,[items])] 131 | return: 132 | d : [[(user, [items])]] 133 | """ 134 | d = [[] for i in xrange(len(buckets))] 135 | for u, items in array: 136 | index = get_buckets_id(len(items), buckets) 137 | if index >= 0: 138 | d[index].append((u,items)) 139 | return d 140 | 141 | def get_buckets_id(l, buckets): 142 | id = -1 143 | for i in xrange(len(buckets)): 144 | if l <= buckets[i]: 145 | id = i 146 | break 147 | return id 148 | 149 | def form_sequence_prediction(data, uids, maxlen, START_ID): 150 | """ 151 | Args: 152 | data = [(user_id,[item_id])] 153 | uids = [user_id] 154 | Return: 155 | d : [(user_id,[item_id])] 156 | """ 157 | d = [] 158 | m = {} 159 | for uid, items in data: 160 | m[uid] = items 161 | for uid in uids: 162 | if uid in m: 163 | items = [START_ID] + m[uid][-(maxlen-1):] 164 | else: 165 | items = [START_ID] 166 | d.append((uid, items)) 167 | 168 | return d 169 | 170 | def form_sequence(data, maxlen = 100): 171 | """ 172 | Args: 173 | data = [(u,i,week)] 174 | Return: 175 | d : [(user_id, [item_id])] 176 | """ 177 | 178 | users = [] 179 | items = [] 180 | d = {} # d[u] = [(i,week)] 181 | for u,i,week in data: 182 | if not u in d: 183 | d[u] = [] 184 | d[u].append((i,week)) 185 | 186 | dd = [] 187 | n_all_item = 0 188 | n_rest_item = 0 189 | for u in d: 190 | tmp = sorted(d[u],key = lambda x: x[1]) 191 | n_all_item += len(tmp) 192 | while True: 193 | new_tmp = [x[0] for x in tmp][:maxlen] 194 | n_rest_item += len(new_tmp) 195 | # make sure every sequence has at least one item 196 | if len(new_tmp) > 0: 197 | dd.append((u,new_tmp)) 198 | if len(tmp) <= maxlen: 199 | break 200 | else: 201 | if len(tmp) - maxlen <=7: 202 | tmp = tmp[maxlen-10:] 203 | else: 204 | tmp = tmp[maxlen:] 205 | 206 | # count below not valid any more 207 | # mylog("All item: {} Rest item: {} Remove item: {}".format(n_all_item, n_rest_item, n_all_item - n_rest_item)) 208 | 209 | return dd 210 | 211 | def prepare_warp(embAttr, data_tr, data_va): 212 | pos_item_list, pos_item_list_val = {}, {} 213 | for t in data_tr: 214 | u, i_list = t 215 | pos_item_list[u] = list(set(i_list)) 216 | for t in data_va: 217 | u, i_list = t 218 | pos_item_list_val[u] = list(set(i_list)) 219 | embAttr.prepare_warp(pos_item_list, pos_item_list_val) 220 | 221 | def get_device_address(s): 222 | add = [] 223 | if s == "": 224 | for i in xrange(3): 225 | add.append("/cpu:0") 226 | else: 227 | add = ["/gpu:{}".format(int(x)) for x in s] 228 | print(add) 229 | return add 230 | 231 | def split_train_dev(seq_all, ratio = 0.05): 232 | random.seed(FLAGS.seed) 233 | seq_tr, seq_va = [],[] 234 | for item in seq_all: 235 | r = random.random() 236 | if r < ratio: 237 | seq_va.append(item) 238 | else: 239 | seq_tr.append(item) 240 | return seq_tr, seq_va 241 | 242 | 243 | def get_data(raw_data, data_dir=FLAGS.data_dir, combine_att=FLAGS.combine_att, 244 | logits_size_tr=FLAGS.item_vocab_size, thresh=FLAGS.vocab_min_thresh, 245 | use_user_feature=FLAGS.use_user_feature, test=FLAGS.test, mylog=mylog, 246 | use_item_feature=FLAGS.use_item_feature, no_user_id=FLAGS.no_user_id, 247 | recommend = False): 248 | 249 | (data_tr, data_va, u_attr, i_attr, item_ind2logit_ind, logit_ind2item_ind, 250 | user_index, item_index) = read_attributed_data( 251 | raw_data_dir=raw_data, 252 | data_dir=data_dir, 253 | combine_att=combine_att, 254 | logits_size_tr=logits_size_tr, 255 | thresh=thresh, 256 | use_user_feature=use_user_feature, 257 | use_item_feature=use_item_feature, 258 | no_user_id=no_user_id, 259 | test=test, 260 | mylog=mylog) 261 | 262 | # remove unk 263 | data_tr = [p for p in data_tr if (p[1] in item_ind2logit_ind)] 264 | 265 | # remove items before week 40 266 | if FLAGS.after40: 267 | data_tr = [p for p in data_tr if (to_week(p[2]) >= 40)] 268 | 269 | # item frequency (for sampling) 270 | item_population, p_item = item_frequency(data_tr, FLAGS.power) 271 | 272 | # UNK and START 273 | # print(len(item_ind2logit_ind)) 274 | # print(len(logit_ind2item_ind)) 275 | # print(len(item_index)) 276 | START_ID = len(item_index) 277 | # START_ID = i_attr.get_item_last_index() 278 | item_ind2logit_ind[START_ID] = 0 279 | seq_all = form_sequence(data_tr, maxlen = FLAGS.L) 280 | seq_tr0, seq_va0 = split_train_dev(seq_all,ratio=0.05) 281 | 282 | 283 | # calculate buckets 284 | global _buckets 285 | _buckets = calculate_buckets(seq_tr0+seq_va0, FLAGS.L, FLAGS.n_bucket) 286 | _buckets = sorted(_buckets) 287 | 288 | # split_buckets 289 | seq_tr = split_buckets(seq_tr0,_buckets) 290 | seq_va = split_buckets(seq_va0,_buckets) 291 | 292 | # get test data 293 | if recommend: 294 | from evaluate import Evaluation as Evaluate 295 | evaluation = Evaluate(raw_data, test=test) 296 | uinds = evaluation.get_uinds() 297 | seq_test = form_sequence_prediction(seq_all, uinds, FLAGS.L, START_ID) 298 | _buckets = calculate_buckets(seq_test, FLAGS.L, FLAGS.n_bucket) 299 | _buckets = sorted(_buckets) 300 | seq_test = split_buckets(seq_test,_buckets) 301 | else: 302 | seq_test = [] 303 | evaluation = None 304 | uinds = [] 305 | 306 | # create embedAttr 307 | 308 | devices = get_device_address(FLAGS.N) 309 | with tf.device(devices[0]): 310 | u_attr.set_model_size(FLAGS.size) 311 | i_attr.set_model_size(FLAGS.size) 312 | 313 | embAttr = embed_attribute.EmbeddingAttribute(u_attr, i_attr, FLAGS.batch_size, FLAGS.n_sampled, _buckets[-1], FLAGS.use_sep_item, item_ind2logit_ind, logit_ind2item_ind, devices=devices) 314 | 315 | if FLAGS.loss in ["warp", 'mw']: 316 | prepare_warp(embAttr, seq_tr0, seq_va0) 317 | 318 | return seq_tr, seq_va, seq_test, embAttr, START_ID, item_population, p_item, evaluation, uinds, user_index, item_index, logit_ind2item_ind 319 | 320 | 321 | def create_model(session,embAttr,START_ID, run_options, run_metadata): 322 | devices = get_device_address(FLAGS.N) 323 | dtype = tf.float32 324 | model = SeqModel(_buckets, 325 | FLAGS.size, 326 | FLAGS.num_layers, 327 | FLAGS.max_gradient_norm, 328 | FLAGS.batch_size, 329 | FLAGS.learning_rate, 330 | FLAGS.learning_rate_decay_factor, 331 | embAttr, 332 | withAdagrad = FLAGS.withAdagrad, 333 | num_samples = FLAGS.n_sampled, 334 | dropoutRate = FLAGS.keep_prob, 335 | START_ID = START_ID, 336 | loss = FLAGS.loss, 337 | dtype = dtype, 338 | devices = devices, 339 | use_concat = FLAGS.use_concat, 340 | no_user_id=False, # to remove this argument 341 | output_feat = FLAGS.output_feat, 342 | no_input_item_feature = FLAGS.no_input_item_feature, 343 | topk_n = FLAGS.topk, 344 | run_options = run_options, 345 | run_metadata = run_metadata 346 | ) 347 | 348 | ckpt = tf.train.get_checkpoint_state(FLAGS.train_dir) 349 | # if FLAGS.recommend or (not FLAGS.fromScratch) and ckpt and tf.gfile.Exists(ckpt.model_checkpoint_path): 350 | 351 | if FLAGS.recommend or FLAGS.beam_search or FLAGS.ensemble or (not FLAGS.fromScratch) and ckpt: 352 | mylog("Reading model parameters from %s" % ckpt.model_checkpoint_path) 353 | model.saver.restore(session, ckpt.model_checkpoint_path) 354 | else: 355 | mylog("Created model with fresh parameters.") 356 | session.run(tf.global_variables_initializer()) 357 | return model 358 | 359 | 360 | def show_all_variables(): 361 | all_vars = tf.global_variables() 362 | for var in all_vars: 363 | mylog(var.name) 364 | 365 | 366 | def train(raw_data=FLAGS.raw_data): 367 | 368 | # Read Data 369 | mylog("Reading Data...") 370 | train_set, dev_set, test_set, embAttr, START_ID, item_population, p_item, _, _, _, _, _ = get_data(raw_data,data_dir=FLAGS.data_dir) 371 | n_targets_train = np.sum([np.sum([len(items) for uid, items in x]) for x in train_set]) 372 | train_bucket_sizes = [len(train_set[b]) for b in xrange(len(_buckets))] 373 | train_total_size = float(sum(train_bucket_sizes)) 374 | train_buckets_scale = [sum(train_bucket_sizes[:i + 1]) / train_total_size for i in xrange(len(train_bucket_sizes))] 375 | dev_bucket_sizes = [len(dev_set[b]) for b in xrange(len(_buckets))] 376 | dev_total_size = int(sum(dev_bucket_sizes)) 377 | 378 | 379 | # steps 380 | batch_size = FLAGS.batch_size 381 | n_epoch = FLAGS.n_epoch 382 | steps_per_epoch = int(train_total_size / batch_size) 383 | steps_per_dev = int(dev_total_size / batch_size) 384 | 385 | steps_per_checkpoint = int(steps_per_epoch / 2) 386 | total_steps = steps_per_epoch * n_epoch 387 | 388 | # reports 389 | mylog(_buckets) 390 | mylog("Train:") 391 | mylog("total: {}".format(train_total_size)) 392 | mylog("bucket sizes: {}".format(train_bucket_sizes)) 393 | mylog("Dev:") 394 | mylog("total: {}".format(dev_total_size)) 395 | mylog("bucket sizes: {}".format(dev_bucket_sizes)) 396 | mylog("") 397 | mylog("Steps_per_epoch: {}".format(steps_per_epoch)) 398 | mylog("Total_steps:{}".format(total_steps)) 399 | mylog("Steps_per_checkpoint: {}".format(steps_per_checkpoint)) 400 | 401 | # with tf.Session(config=tf.ConfigProto(allow_soft_placement=True, log_device_placement = False, device_count={'CPU':8, 'GPU':1})) as sess: 402 | with tf.Session(config=tf.ConfigProto(allow_soft_placement=True, log_device_placement = False)) as sess: 403 | 404 | # runtime profile 405 | if FLAGS.profile: 406 | run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE) 407 | run_metadata = tf.RunMetadata() 408 | else: 409 | run_options = None 410 | run_metadata = None 411 | 412 | mylog("Creating Model.. (this can take a few minutes)") 413 | model = create_model(sess, embAttr, START_ID, run_options, run_metadata) 414 | show_all_variables() 415 | 416 | # Data Iterators 417 | dite = DataIterator(model, train_set, len(train_buckets_scale), batch_size, train_buckets_scale) 418 | 419 | iteType = 0 420 | if iteType == 0: 421 | mylog("withRandom") 422 | ite = dite.next_random() 423 | elif iteType == 1: 424 | mylog("withSequence") 425 | ite = dite.next_sequence() 426 | 427 | # statistics during training 428 | step_time, loss = 0.0, 0.0 429 | current_step = 0 430 | previous_losses = [] 431 | his = [] 432 | low_ppx = float("inf") 433 | low_ppx_step = 0 434 | steps_per_report = 30 435 | n_targets_report = 0 436 | report_time = 0 437 | n_valid_sents = 0 438 | patience = FLAGS.patience 439 | item_sampled, item_sampled_id2idx = None, None 440 | 441 | while current_step < total_steps: 442 | 443 | # start 444 | start_time = time.time() 445 | 446 | # re-sample every once a while 447 | if FLAGS.loss in ['mw', 'mce'] and current_step % FLAGS.n_resample == 0 : 448 | item_sampled, item_sampled_id2idx = sample_items(item_population, FLAGS.n_sampled, p_item) 449 | else: 450 | item_sampled = None 451 | 452 | # data and train 453 | users, inputs, outputs, weights, bucket_id = ite.next() 454 | 455 | L = model.step(sess, users, inputs, outputs, weights, bucket_id, item_sampled=item_sampled, item_sampled_id2idx=item_sampled_id2idx) 456 | 457 | # loss and time 458 | step_time += (time.time() - start_time) / steps_per_checkpoint 459 | 460 | loss += L 461 | current_step += 1 462 | n_valid_sents += np.sum(np.sign(weights[0])) 463 | 464 | # for report 465 | report_time += (time.time() - start_time) 466 | n_targets_report += np.sum(weights) 467 | 468 | if current_step % steps_per_report == 0: 469 | mylog("--------------------"+"Report"+str(current_step)+"-------------------") 470 | mylog("StepTime: {} Speed: {} targets / sec in total {} targets".format(report_time/steps_per_report, n_targets_report*1.0 / report_time, n_targets_train)) 471 | 472 | report_time = 0 473 | n_targets_report = 0 474 | 475 | # Create the Timeline object, and write it to a json 476 | if FLAGS.profile: 477 | tl = timeline.Timeline(run_metadata.step_stats) 478 | ctf = tl.generate_chrome_trace_format() 479 | with open('timeline.json', 'w') as f: 480 | f.write(ctf) 481 | exit() 482 | 483 | 484 | 485 | if current_step % steps_per_checkpoint == 0: 486 | mylog("--------------------"+"TRAIN"+str(current_step)+"-------------------") 487 | # Print statistics for the previous epoch. 488 | 489 | loss = loss / n_valid_sents 490 | perplexity = math.exp(float(loss)) if loss < 300 else float("inf") 491 | mylog("global step %d learning rate %.4f step-time %.2f perplexity " "%.2f" % (model.global_step.eval(), model.learning_rate.eval(), step_time, perplexity)) 492 | 493 | train_ppx = perplexity 494 | 495 | # Save checkpoint and zero timer and loss. 496 | step_time, loss, n_valid_sents = 0.0, 0.0, 0 497 | 498 | # dev data 499 | mylog("--------------------" + "DEV" + str(current_step) + "-------------------") 500 | eval_loss, eval_ppx = evaluate(sess, model, dev_set, item_sampled_id2idx=item_sampled_id2idx) 501 | mylog("dev: ppx: {}".format(eval_ppx)) 502 | 503 | his.append([current_step, train_ppx, eval_ppx]) 504 | 505 | if eval_ppx < low_ppx: 506 | patience = FLAGS.patience 507 | low_ppx = eval_ppx 508 | low_ppx_step = current_step 509 | checkpoint_path = os.path.join(FLAGS.train_dir, "best.ckpt") 510 | mylog("Saving best model....") 511 | s = time.time() 512 | model.saver.save(sess, checkpoint_path, global_step=0, write_meta_graph = False) 513 | mylog("Best model saved using {} sec".format(time.time()-s)) 514 | else: 515 | patience -= 1 516 | 517 | if patience <= 0: 518 | mylog("Training finished. Running out of patience.") 519 | break 520 | 521 | sys.stdout.flush() 522 | 523 | def evaluate(sess, model, data_set, item_sampled_id2idx=None): 524 | # Run evals on development set and print their perplexity/loss. 525 | dropoutRateRaw = FLAGS.keep_prob 526 | sess.run(model.dropout10_op) 527 | 528 | start_id = 0 529 | loss = 0.0 530 | n_steps = 0 531 | n_valids = 0 532 | batch_size = FLAGS.batch_size 533 | 534 | dite = DataIterator(model, data_set, len(_buckets), batch_size, None) 535 | ite = dite.next_sequence(stop = True) 536 | 537 | for users, inputs, outputs, weights, bucket_id in ite: 538 | L = model.step(sess, users, inputs, outputs, weights, bucket_id, forward_only = True) 539 | loss += L 540 | n_steps += 1 541 | n_valids += np.sum(np.sign(weights[0])) 542 | 543 | loss = loss/(n_valids) 544 | ppx = math.exp(loss) if loss < 300 else float("inf") 545 | 546 | sess.run(model.dropoutAssign_op) 547 | 548 | return loss, ppx 549 | 550 | def recommend(raw_data=FLAGS.raw_data): 551 | 552 | # Read Data 553 | mylog("recommend") 554 | mylog("Reading Data...") 555 | _, _, test_set, embAttr, START_ID, _, _, evaluation, uinds, user_index, item_index, logit_ind2item_ind = get_data( 556 | raw_data, data_dir=FLAGS.data_dir, recommend =True) 557 | test_bucket_sizes = [len(test_set[b]) for b in xrange(len(_buckets))] 558 | test_total_size = int(sum(test_bucket_sizes)) 559 | 560 | # reports 561 | mylog(_buckets) 562 | mylog("Test:") 563 | mylog("total: {}".format(test_total_size)) 564 | mylog("buckets: {}".format(test_bucket_sizes)) 565 | 566 | with tf.Session(config=tf.ConfigProto(allow_soft_placement=True, log_device_placement = False)) as sess: 567 | 568 | # runtime profile 569 | if FLAGS.profile: 570 | run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE) 571 | run_metadata = tf.RunMetadata() 572 | else: 573 | run_options = None 574 | run_metadata = None 575 | 576 | mylog("Creating Model") 577 | model = create_model(sess, embAttr, START_ID, run_options, run_metadata) 578 | show_all_variables() 579 | 580 | sess.run(model.dropoutRate.assign(1.0)) 581 | 582 | start_id = 0 583 | n_steps = 0 584 | batch_size = FLAGS.batch_size 585 | 586 | dite = DataIterator(model, test_set, len(_buckets), batch_size, None) 587 | ite = dite.next_sequence(stop = True, recommend = True) 588 | 589 | n_total_user = len(uinds) 590 | n_recommended = 0 591 | uind2rank = {} 592 | for r, uind in enumerate(uinds): 593 | uind2rank[uind] = r 594 | rec = np.zeros((n_total_user,FLAGS.topk), dtype = int) 595 | rec_value = np.zeros((n_total_user,FLAGS.topk), dtype = float) 596 | start = time.time() 597 | 598 | for users, inputs, positions, valids, bucket_id in ite: 599 | results = model.step_recommend(sess, users, inputs, positions, bucket_id) 600 | for i, valid in enumerate(valids): 601 | if valid == 1: 602 | n_recommended += 1 603 | if n_recommended % 1000 == 0: 604 | mylog("Evaluating n {} bucket_id {}".format(n_recommended, bucket_id)) 605 | uind, topk_values, topk_indexes = results[i] 606 | rank= uind2rank[uind] 607 | rec[rank,:] = topk_indexes 608 | rec_value[rank,:] = topk_values 609 | n_steps += 1 610 | end = time.time() 611 | mylog("Time used {} sec for {} steps {} users ".format(end-start, n_steps, n_recommended)) 612 | 613 | ind2id = {} 614 | for iid in item_index: 615 | iind = item_index[iid] 616 | assert(iind not in ind2id) 617 | ind2id[iind] = iid 618 | 619 | uind2id = {} 620 | for uid in user_index: 621 | uind = user_index[uid] 622 | assert(uind not in uind2id) 623 | uind2id[uind] = uid 624 | 625 | 626 | R = {} 627 | for i in xrange(n_total_user): 628 | uid = uind2id[uinds[i]] 629 | R[uid] = [ind2id[logit_ind2item_ind[v]] for v in list(rec[i, :])] 630 | 631 | evaluation.eval_on(R) 632 | 633 | scores_self, scores_ex = evaluation.get_scores() 634 | mylog("====evaluation scores (NDCG, RECALL, PRECISION, MAP) @ 2,5,10,20,30====") 635 | mylog("METRIC_FORMAT (self): {}".format(scores_self)) 636 | mylog("METRIC_FORMAT (ex ): {}".format(scores_ex)) 637 | 638 | # save the two matrix 639 | np.save(os.path.join(FLAGS.train_dir,"top{}_index.npy".format(FLAGS.topk)),rec) 640 | np.save(os.path.join(FLAGS.train_dir,"top{}_value.npy".format(FLAGS.topk)),rec_value) 641 | 642 | 643 | def ensemble(raw_data=FLAGS.raw_data): 644 | # Read Data 645 | mylog("Ensemble {} {}".format(FLAGS.train_dir, FLAGS.ensemble_suffix)) 646 | mylog("Reading Data...") 647 | # task = Task(FLAGS.dataset) 648 | _, _, test_set, embAttr, START_ID, _, _, evaluation, uinds, user_index, item_index, logit_ind2item_ind = get_data( 649 | raw_data, data_dir=FLAGS.data_dir, recommend =True) 650 | test_bucket_sizes = [len(test_set[b]) for b in xrange(len(_buckets))] 651 | test_total_size = int(sum(test_bucket_sizes)) 652 | 653 | # reports 654 | mylog(_buckets) 655 | mylog("Test:") 656 | mylog("total: {}".format(test_total_size)) 657 | mylog("buckets: {}".format(test_bucket_sizes)) 658 | 659 | # load top_index, and top_value 660 | suffixes = FLAGS.ensemble_suffix.split(',') 661 | top_indexes = [] 662 | top_values = [] 663 | for suffix in suffixes: 664 | # dir_path = FLAGS.train_dir+suffix 665 | dir_path = FLAGS.train_dir.replace('seed', 'seed' + suffix) 666 | mylog("Loading results from {}".format(dir_path)) 667 | index_path = os.path.join(dir_path,"top{}_index.npy".format(FLAGS.topk)) 668 | value_path = os.path.join(dir_path,"top{}_value.npy".format(FLAGS.topk)) 669 | top_index = np.load(index_path) 670 | top_value = np.load(value_path) 671 | top_indexes.append(top_index) 672 | top_values.append(top_value) 673 | 674 | # ensemble 675 | rec = np.zeros(top_indexes[0].shape) 676 | for row in xrange(rec.shape[0]): 677 | v = {} 678 | for i in xrange(len(suffixes)): 679 | for col in xrange(rec.shape[1]): 680 | index = top_indexes[i][row,col] 681 | value = top_values[i][row,col] 682 | if index not in v: 683 | v[index] = 0 684 | v[index] += value 685 | items = [(index,v[index]/len(suffixes)) for index in v ] 686 | items = sorted(items,key = lambda x: -x[1]) 687 | rec[row:] = [x[0] for x in items][:FLAGS.topk] 688 | if row % 1000 == 0: 689 | mylog("Ensembling n {}".format(row)) 690 | 691 | ind2id = {} 692 | for iid in item_index: 693 | uind = item_index[iid] 694 | assert(uind not in ind2id) 695 | ind2id[uind] = iid 696 | 697 | uind2id = {} 698 | for uid in user_index: 699 | uind = user_index[uid] 700 | assert(uind not in uind2id) 701 | uind2id[uind] = uid 702 | 703 | R = {} 704 | for i in xrange(n_total_user): 705 | uind = uinds[i] 706 | uid = uind2id[uind] 707 | R[uid] = [ind2id[logit_ind2item_ind[v]] for v in list(rec[i, :])] 708 | 709 | evaluation.eval_on(R) 710 | 711 | scores_self, scores_ex = evaluation.get_scores() 712 | mylog("====evaluation scores (NDCG, RECALL, PRECISION, MAP) @ 2,5,10,20,30====") 713 | mylog("METRIC_FORMAT (self): {}".format(scores_self)) 714 | mylog("METRIC_FORMAT (ex ): {}".format(scores_ex)) 715 | 716 | def beam_search(): 717 | mylog("Reading Data...") 718 | task = Task(FLAGS.dataset) 719 | _, _, test_set, embAttr, START_ID, _, _, evaluation, uids = read_data(task, test = True ) 720 | test_bucket_sizes = [len(test_set[b]) for b in xrange(len(_buckets))] 721 | test_total_size = int(sum(test_bucket_sizes)) 722 | 723 | # reports 724 | mylog(_buckets) 725 | mylog("Test:") 726 | mylog("total: {}".format(test_total_size)) 727 | mylog("buckets: {}".format(test_bucket_sizes)) 728 | 729 | with tf.Session(config=tf.ConfigProto(allow_soft_placement=True, log_device_placement = False)) as sess: 730 | 731 | # runtime profile 732 | if FLAGS.profile: 733 | run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE) 734 | run_metadata = tf.RunMetadata() 735 | else: 736 | run_options = None 737 | run_metadata = None 738 | 739 | mylog("Creating Model") 740 | model = create_model(sess, embAttr, START_ID, run_options, run_metadata) 741 | show_all_variables() 742 | model.init_beam_decoder() 743 | 744 | sess.run(model.dropoutRate.assign(1.0)) 745 | 746 | start_id = 0 747 | n_steps = 0 748 | batch_size = FLAGS.batch_size 749 | 750 | dite = DataIterator(model, test_set, len(_buckets), batch_size, None) 751 | ite = dite.next_sequence(stop = True, recommend = True) 752 | 753 | n_total_user = len(uids) 754 | n_recommended = 0 755 | uid2rank = {} 756 | for r, uid in enumerate(uids): 757 | uid2rank[uid] = r 758 | rec = np.zeros((n_total_user,FLAGS.topk), dtype = int) 759 | rec_value = np.zeros((n_total_user,FLAGS.topk), dtype = float) 760 | start = time.time() 761 | 762 | for users, inputs, positions, valids, bucket_id in ite: 763 | print(inputs) 764 | print(positions) 765 | results = model.beam_step(sess, index=0, user_input = users, item_inputs = inputs, sequence_length = positions, bucket_id = bucket_id) 766 | break 767 | 768 | 769 | 770 | 771 | def main(_): 772 | print("V 2017-03-22") 773 | if FLAGS.test: 774 | if FLAGS.data_dir[-1] == '/': 775 | FLAGS.data_dir = FLAGS.data_dir[:-1] + '_test' 776 | else: 777 | FLAGS.data_dir = FLAGS.data_dir + '_test' 778 | 779 | if not os.path.exists(FLAGS.train_dir): 780 | os.mkdir(FLAGS.train_dir) 781 | 782 | if FLAGS.beam_search: 783 | FLAGS.batch_size = 1 784 | FLAGS.n_bucket = 1 785 | beam_search() 786 | return 787 | 788 | if FLAGS.ensemble: 789 | suffixes = FLAGS.ensemble_suffix.split(',') 790 | log_path = os.path.join(FLAGS.train_dir.replace('seed', 'seed'+suffixes[0]),"log.ensemble.txt.{}".format(FLAGS.topk)) 791 | logging.basicConfig(filename=log_path,level=logging.DEBUG, filemode ="w") 792 | ensemble() 793 | return 794 | 795 | if FLAGS.recommend: 796 | log_path = os.path.join(FLAGS.train_dir,"log.recommend.txt.{}".format(FLAGS.topk)) 797 | logging.basicConfig(filename=log_path,level=logging.DEBUG, filemode ="w") 798 | recommend() 799 | return 800 | else: 801 | # train 802 | log_path = os.path.join(FLAGS.train_dir,"log.txt") 803 | if FLAGS.fromScratch: 804 | filemode = "w" 805 | else: 806 | filemode = "a" 807 | logging.basicConfig(filename=log_path,level=logging.DEBUG, filemode = filemode) 808 | train() 809 | 810 | if __name__ == "__main__": 811 | tf.app.run() 812 | -------------------------------------------------------------------------------- /lstm/seqModel.py: -------------------------------------------------------------------------------- 1 | from __future__ import absolute_import 2 | from __future__ import division 3 | from __future__ import print_function 4 | 5 | import random 6 | 7 | import numpy as np 8 | from six.moves import xrange # pylint: disable=redefined-builtin 9 | import tensorflow as tf 10 | from tensorflow.python.ops import variable_scope 11 | 12 | from tensorflow.python.framework import ops 13 | from tensorflow.python.ops import array_ops 14 | from tensorflow.python.ops import control_flow_ops 15 | from tensorflow.python.ops import embedding_ops 16 | from tensorflow.python.ops import math_ops 17 | from tensorflow.python.ops import nn_ops 18 | from tensorflow.python.ops import rnn 19 | from tensorflow.python.ops import variable_scope 20 | 21 | import data_iterator 22 | # import env 23 | 24 | class SeqModel(object): 25 | 26 | def __init__(self, 27 | buckets, 28 | size, 29 | num_layers, 30 | max_gradient_norm, 31 | batch_size, 32 | learning_rate, 33 | learning_rate_decay_factor, 34 | embeddingAttribute, 35 | withAdagrad = True, 36 | num_samples=512, 37 | forward_only=False, 38 | dropoutRate = 1.0, 39 | START_ID = 0, 40 | loss = "ce", 41 | devices = "", 42 | run_options = None, 43 | run_metadata = None, 44 | use_concat = True, 45 | output_feat = 1, 46 | no_input_item_feature = False, 47 | no_user_id = True, 48 | topk_n = 30, 49 | dtype=tf.float32): 50 | """Create the model. 51 | 52 | Args: 53 | buckets: a list of pairs (I, O), where I specifies maximum input length 54 | that will be processed in that bucket, and O specifies maximum output 55 | length. Training instances that have inputs longer than I or outputs 56 | longer than O will be pushed to the next bucket and padded accordingly. 57 | We assume that the list is sorted, e.g., [(2, 4), (8, 16)]. 58 | size: number of units in each layer of the model. 59 | num_layers: number of layers in the model. 60 | max_gradient_norm: gradients will be clipped to maximally this norm. 61 | batch_size: the size of the batches used during training; 62 | the model construction is independent of batch_size, so it can be 63 | changed after initialization if this is convenient, e.g., for decoding. 64 | 65 | learning_rate: learning rate to start with. 66 | learning_rate_decay_factor: decay learning rate by this much when needed. 67 | 68 | num_samples: number of samples for sampled softmax. 69 | forward_only: if set, we do not construct the backward pass in the model. 70 | dtype: the data type to use to store internal variables. 71 | """ 72 | self.embeddingAttribute = embeddingAttribute 73 | self.buckets = buckets 74 | self.START_ID = START_ID 75 | self.PAD_ID = START_ID 76 | self.USER_PAD_ID = 0 77 | self.batch_size = batch_size 78 | self.loss = loss 79 | self.devices = devices 80 | self.run_options = run_options 81 | self.run_metadata = run_metadata 82 | self.output_feat = output_feat 83 | self.no_input_item_feature = no_input_item_feature 84 | self.topk_n = topk_n 85 | self.dtype = dtype 86 | 87 | with tf.device(devices[0]): 88 | self.dropoutRate = tf.Variable( 89 | float(dropoutRate), trainable=False, dtype=dtype) 90 | self.dropoutAssign_op = self.dropoutRate.assign(dropoutRate) 91 | self.dropout10_op = self.dropoutRate.assign(1.0) 92 | self.learning_rate = tf.Variable( 93 | float(learning_rate), trainable=False, dtype=dtype) 94 | self.learning_rate_decay_op = self.learning_rate.assign( 95 | self.learning_rate * learning_rate_decay_factor) 96 | self.global_step = tf.Variable(0, trainable=False) 97 | 98 | with tf.device(devices[1]): 99 | single_cell = tf.contrib.rnn.core_rnn_cell.LSTMCell(size, state_is_tuple=True) 100 | single_cell = tf.contrib.rnn.core_rnn_cell.DropoutWrapper(single_cell,input_keep_prob = self.dropoutRate) 101 | if num_layers >= 1: 102 | single_cell = tf.contrib.rnn.core_rnn_cell.MultiRNNCell([single_cell] * num_layers, state_is_tuple=True) 103 | single_cell = tf.contrib.rnn.core_rnn_cell.DropoutWrapper(single_cell, output_keep_prob = self.dropoutRate) 104 | 105 | self.single_cell = single_cell 106 | 107 | 108 | # Feeds for inputs. 109 | with tf.device(devices[2]): 110 | self.targets = [] 111 | self.target_ids = [] 112 | self.target_weights = [] 113 | 114 | # target: 1 2 3 4 115 | # inputs: go 1 2 3 116 | # weights:1 1 1 1 117 | 118 | for i in xrange(buckets[-1]): 119 | self.targets.append(tf.placeholder(tf.int32, 120 | shape=[self.batch_size], name = "target{}".format(i))) 121 | self.target_ids.append(tf.placeholder(tf.int32, 122 | shape=[self.batch_size], name = "target_id{}".format(i))) 123 | self.target_weights.append(tf.placeholder(dtype, 124 | shape = [self.batch_size], name="target_weight{}".format(i))) 125 | 126 | with tf.device(devices[0]): 127 | 128 | self.inputs = [] 129 | 130 | if use_concat: 131 | user_embed, _ = self.embeddingAttribute.get_batch_user(1.0,concat = True, no_id = no_user_id) 132 | user_embed_size = self.embeddingAttribute.get_user_model_size( 133 | no_id = no_user_id, concat = True) 134 | item_embed_size = self.embeddingAttribute.get_item_model_size( 135 | concat=True) 136 | w_input_user = tf.get_variable("w_input_user",[user_embed_size, size], dtype = dtype) 137 | w_input_item = tf.get_variable("w_input_item",[item_embed_size, size], dtype = dtype) 138 | user_embed_transform = tf.matmul(user_embed, w_input_user) 139 | 140 | for i in xrange(buckets[-1]): 141 | name = "input{}".format(i) 142 | item_embed, _ = self.embeddingAttribute.get_batch_item(name, 143 | self.batch_size, concat = True, no_attribute=self.no_input_item_feature) 144 | item_embed_transform = tf.matmul(item_embed, w_input_item) 145 | input_embed = user_embed_transform + item_embed_transform 146 | self.inputs.append(input_embed) 147 | else: 148 | user_embed, _ = self.embeddingAttribute.get_batch_user(1.0,concat = False, no_id = no_user_id) 149 | 150 | for i in xrange(buckets[-1]): 151 | name = "input{}".format(i) 152 | item_embed, _ = self.embeddingAttribute.get_batch_item(name, 153 | self.batch_size, concat = False, no_attribute=self.no_input_item_feature) 154 | item_embed = tf.reduce_mean(item_embed, 0) 155 | input_embed = tf.reduce_mean([user_embed, item_embed], 0) 156 | self.inputs.append(input_embed) 157 | 158 | self.outputs, self.losses, self.outputs_full, self.losses_full, self.topk_values, self.topk_indexes = self.model_with_buckets(self.inputs,self.targets, self.target_weights, self.buckets, single_cell,self.embeddingAttribute, dtype, devices = devices) 159 | 160 | # for warp 161 | if self.loss in ["warp", "mw"]: 162 | self.set_mask, self.reset_mask = self.embeddingAttribute.get_warp_mask(device = self.devices[2]) 163 | 164 | #with tf.device(devices[0]): 165 | # train 166 | with tf.device(devices[0]): 167 | params = tf.trainable_variables() 168 | if not forward_only: 169 | self.gradient_norms = [] 170 | self.updates = [] 171 | self.gradient_norms = [] 172 | self.updates = [] 173 | if withAdagrad: 174 | opt = tf.train.AdagradOptimizer(self.learning_rate) 175 | else: 176 | opt = tf.train.GradientDescentOptimizer(self.learning_rate) 177 | 178 | for b in xrange(len(buckets)): 179 | gradients = tf.gradients(self.losses[b], params, colocate_gradients_with_ops=True) 180 | clipped_gradients, norm = tf.clip_by_global_norm(gradients, max_gradient_norm) 181 | self.gradient_norms.append(norm) 182 | self.updates.append(opt.apply_gradients(zip(clipped_gradients, params), global_step=self.global_step)) 183 | 184 | self.saver = tf.train.Saver(tf.global_variables()) 185 | 186 | 187 | def init_beam_decoder(self,beam_size=10, max_steps = 30): 188 | 189 | # a non bucket design 190 | # 191 | # how to feed in: 192 | # user_history = [1,2,3,4] 193 | # inputs = [GO, 1, 2, 3], sequene_length = [4-1] 194 | 195 | self.beam_size = beam_size 196 | 197 | init_state = self.single_cell.zero_state(1, self.dtype) 198 | self.before_state = [] 199 | self.after_state = [] 200 | print(init_state) 201 | shape = [self.beam_size, init_state[0].c.get_shape()[1]] 202 | 203 | with tf.device(self.devices[0]): 204 | 205 | with tf.variable_scope("beam_search"): 206 | 207 | # two variable: before_state, after_state 208 | for i, state_tuple in enumerate(init_state): 209 | cb = tf.get_variable("before_c_{}".format(i), shape, initializer=tf.constant_initializer(0.0), trainable = False) 210 | hb = tf.get_variable("before_h_{}".format(i), shape, initializer=tf.constant_initializer(0.0), trainable = False) 211 | sb = tf.contrib.rnn.core_rnn_cell.LSTMStateTuple(cb,hb) 212 | ca = tf.get_variable("after_c_{}".format(i), shape, initializer=tf.constant_initializer(0.0), trainable = False) 213 | ha = tf.get_variable("after_h_{}".format(i), shape, initializer=tf.constant_initializer(0.0), trainable = False) 214 | sa = tf.contrib.rnn.core_rnn_cell.LSTMStateTuple(ca,ha) 215 | self.before_state.append(sb) 216 | self.after_state.append(sa) 217 | 218 | # a new place holder for sequence_length 219 | self.sequence_length = tf.placeholder(tf.int32, shape=[1], name = "sequence_length") 220 | 221 | # the final_state after processing the start state 222 | with tf.variable_scope("",reuse=True): 223 | _, self.beam_final_state = tf.contrib.rnn.static_rnn(self.single_cell,self.inputs,initial_state = init_state, sequence_length = self.sequence_length) 224 | 225 | with tf.variable_scope("beam_search"): 226 | # copy the final_state to before_state 227 | self.final2before_ops = [] # an operation sequence 228 | for i in xrange(len(self.before_state)): 229 | final_c = self.beam_final_state[i].c 230 | final_h = self.beam_final_state[i].h 231 | final_c_expand = tf.nn.embedding_lookup(final_c,[0] * self.beam_size) 232 | final_h_expand = tf.nn.embedding_lookup(final_h,[0] * self.beam_size) 233 | copy_c = self.before_state[i].c.assign(final_c_expand) 234 | copy_h = self.before_state[i].h.assign(final_h_expand) 235 | self.final2before_ops.append(copy_c) 236 | self.final2before_ops.append(copy_h) 237 | 238 | # operation: copy after_state to before_state according to a ma 239 | self.beam_parent = tf.placeholder(tf.int32, shape=[self.beam_size], name = "beam_parent") 240 | self.after2before_ops = [] # an operation sequence 241 | for i in xrange(len(self.before_state)): 242 | after_c = self.after_state[i].c 243 | after_h = self.after_state[i].h 244 | after_c_expand = tf.nn.embedding_lookup(after_c,self.beam_parent) 245 | after_h_expand = tf.nn.embedding_lookup(after_h,self.beam_parent) 246 | copy_c = self.before_state[i].c.assign(after_c_expand) 247 | copy_h = self.before_state[i].h.assign(after_h_expand) 248 | self.after2before_ops.append(copy_c) 249 | self.after2before_ops.append(copy_h) 250 | 251 | 252 | # operation: one step RNN 253 | with tf.variable_scope("",reuse=True): 254 | self.beam_step_outputs, self.beam_step_state = tf.contrib.rnn.static_rnn(self.single_cell,self.beam_step_inputs,initial_state = self.before_state) 255 | 256 | with tf.variable_scope("beam_search"): 257 | # operate: copy beam_step_state to after_state 258 | self.beam2after_ops = [] # an operation sequence 259 | for i in xrange(len(self.after_state)): 260 | copy_c = self.after_state[i].c.assign(self.beam_step_state[i].c) 261 | copy_h = self.after_state[i].h.assign(self.beam_step_state[i].h) 262 | self.beam2after_ops.append(copy_c) 263 | self.beam2after_ops.append(copy_h) 264 | 265 | 266 | def show_before_state(self): 267 | for i in xrange(self.before_state): 268 | print(self.before_state[i].c.eval()) 269 | print(self.before_state[i].h.eval()) 270 | 271 | def beam_step(self, session, index = 0, beam_input = None, user_input=None, item_inputs=None,sequence_length = None, bucket_id = 0): 272 | if index == 0: 273 | length = self.buckets[bucket_id] 274 | 275 | input_feed = {} 276 | (update_sampled, input_feed_sampled, input_feed_warp) = self.embeddingAttribute.add_input(input_feed, user_input, item_inputs, forward_only = True, recommend = True) 277 | input_feed[self.sequence_length.name] = sequence_length 278 | 279 | output_feed = [self.final2before.ops] 280 | 281 | self.show_before_state() 282 | _ = session.run(output_feed, input_feed) 283 | self.show_before_state() 284 | 285 | else: 286 | pass 287 | 288 | 289 | def step(self,session, user_input, item_inputs, targets, target_weights, 290 | bucket_id, item_sampled=None, item_sampled_id2idx = None, forward_only = False, recommend = False): 291 | 292 | length = self.buckets[bucket_id] 293 | 294 | targets_mapped = self.embeddingAttribute.target_mapping(targets) 295 | input_feed = {} 296 | for l in xrange(length): 297 | input_feed[self.targets[l].name] = targets_mapped[l] 298 | input_feed[self.target_weights[l].name] = target_weights[l] 299 | if self.loss in ['mw', 'ce']: 300 | input_feed[self.target_ids[l].name] = targets[l] 301 | 302 | (update_sampled, input_feed_sampled, input_feed_warp) = self.embeddingAttribute.add_input(input_feed, user_input, item_inputs, forward_only = forward_only, recommend = recommend, loss = self.loss, item_sampled_id2idx=item_sampled_id2idx) 303 | if self.loss in ["warp", "mw"]: 304 | session.run(self.set_mask[self.loss], input_feed_warp) 305 | 306 | if item_sampled is not None and self.loss in ['mw', 'mce']: 307 | session.run(update_sampled, input_feed_sampled) 308 | 309 | # output_feed 310 | if forward_only: 311 | output_feed = [self.losses_full[bucket_id]] 312 | else: 313 | output_feed = [self.losses[bucket_id]] 314 | output_feed += [self.updates[bucket_id], self.gradient_norms[bucket_id]] 315 | 316 | if self.loss in ["warp", "mw"]: 317 | session.run(self.set_mask[self.loss], input_feed_warp) 318 | 319 | outputs = session.run(output_feed, input_feed, options = self.run_options, run_metadata = self.run_metadata) 320 | 321 | if self.loss in ["warp", "mw"]: 322 | session.run(self.reset_mask[self.loss], input_feed_warp) 323 | 324 | return outputs[0] 325 | 326 | def step_recommend(self,session, user_input, item_inputs, positions, bucket_id): 327 | length = self.buckets[bucket_id] 328 | if bucket_id == 0: 329 | pre_length = 0 330 | else: 331 | pre_length = self.buckets[bucket_id - 1] 332 | 333 | input_feed = {} 334 | 335 | (update_sampled, input_feed_sampled, input_feed_warp) = self.embeddingAttribute.add_input(input_feed, user_input, item_inputs, forward_only = True, recommend = True, loss = self.loss) 336 | 337 | # output_feed 338 | output_feed = {} 339 | 340 | for pos in range(pre_length,length): 341 | output_feed[pos] = [self.topk_values[bucket_id][pos], self.topk_indexes[bucket_id][pos]] 342 | 343 | outputs = session.run(output_feed, input_feed, options = self.run_options, run_metadata = self.run_metadata) 344 | 345 | # results = [(uid, [value], [index])] 346 | results = [] 347 | for i, pos in enumerate(positions): 348 | uid = user_input[i] 349 | values = outputs[pos][0][i,:] 350 | indexes = outputs[pos][1][i,:] 351 | results.append((uid,values,indexes)) 352 | 353 | return results 354 | 355 | 356 | def get_batch(self, data_set, bucket_id, start_id = None): 357 | length = self.buckets[bucket_id] 358 | 359 | users, item_inputs,item_outputs, weights = [], [], [], [] 360 | 361 | for i in xrange(self.batch_size): 362 | if start_id == None: 363 | user, item_seq = random.choice(data_set[bucket_id]) 364 | else: 365 | if start_id + i < len(data_set[bucket_id]): 366 | user, item_seq = data_set[bucket_id][start_id + i] 367 | else: 368 | user = self.USER_PAD_ID 369 | item_seq = [] 370 | 371 | pad_seq = [self.PAD_ID] * (length - len(item_seq)) 372 | if len(item_seq) == 0: 373 | item_input_seq = [self.START_ID] + pad_seq[1:] 374 | else: 375 | item_input_seq = [self.START_ID] + item_seq[:-1] + pad_seq 376 | item_output_seq = item_seq + pad_seq 377 | target_weight = [1.0] * len(item_seq) + [0.0] * len(pad_seq) 378 | 379 | users.append(user) 380 | item_inputs.append(item_input_seq) 381 | item_outputs.append(item_output_seq) 382 | weights.append(target_weight) 383 | 384 | # Now we create batch-major vectors from the data selected above. 385 | def batch_major(l): 386 | output = [] 387 | for i in xrange(len(l[0])): 388 | temp = [] 389 | for j in xrange(self.batch_size): 390 | temp.append(l[j][i]) 391 | output.append(temp) 392 | return output 393 | 394 | batch_user = users 395 | batch_item_inputs = batch_major(item_inputs) 396 | batch_item_outputs = batch_major(item_outputs) 397 | batch_weights = batch_major(weights) 398 | 399 | finished = False 400 | if start_id != None and start_id + self.batch_size >= len(data_set[bucket_id]): 401 | finished = True 402 | 403 | 404 | return batch_user, batch_item_inputs, batch_item_outputs, batch_weights, finished 405 | 406 | 407 | def get_batch_recommend(self, data_set, bucket_id, start_id = None): 408 | length = self.buckets[bucket_id] 409 | 410 | users, item_inputs, positions, valids = [], [], [], [] 411 | 412 | for i in xrange(self.batch_size): 413 | if start_id == None: 414 | user, item_seq = random.choice(data_set[bucket_id]) 415 | valid = 1 416 | position = len(item_seq) - 1 417 | else: 418 | if start_id + i < len(data_set[bucket_id]): 419 | user, item_seq = data_set[bucket_id][start_id + i] 420 | valid = 1 421 | position = len(item_seq) - 1 422 | else: 423 | user = self.USER_PAD_ID 424 | item_seq = [] 425 | valid = 0 426 | position = length-1 427 | 428 | pad_seq = [self.PAD_ID] * (length - len(item_seq)) 429 | item_input_seq = item_seq + pad_seq 430 | valids.append(valid) 431 | users.append(user) 432 | positions.append(position) 433 | item_inputs.append(item_input_seq) 434 | 435 | # Now we create batch-major vectors from the data selected above. 436 | def batch_major(l): 437 | output = [] 438 | for i in xrange(len(l[0])): 439 | temp = [] 440 | for j in xrange(self.batch_size): 441 | temp.append(l[j][i]) 442 | output.append(temp) 443 | return output 444 | 445 | batch_item_inputs = batch_major(item_inputs) 446 | 447 | finished = False 448 | if start_id != None and start_id + self.batch_size >= len(data_set[bucket_id]): 449 | finished = True 450 | 451 | return users, batch_item_inputs, positions, valids, finished 452 | 453 | 454 | def model_with_buckets(self, inputs, targets, weights, 455 | buckets, cell, embeddingAttribute, dtype, 456 | per_example_loss=False, name=None, devices = None): 457 | 458 | all_inputs = inputs + targets + weights 459 | losses = [] 460 | losses_full = [] 461 | outputs = [] 462 | outputs_full = [] 463 | topk_values = [] 464 | topk_indexes = [] 465 | softmax_loss_function = lambda x,y: self.embeddingAttribute.compute_loss(x ,y, loss=self.loss, device = devices[2]) 466 | 467 | with tf.device(devices[1]): 468 | init_state = cell.zero_state(self.batch_size, dtype) 469 | 470 | 471 | with tf.name_scope(name, "model_with_buckets", all_inputs): 472 | # with ops.op_scope(all_inputs, name, "model_with_buckets"): 473 | for j, bucket in enumerate(buckets): 474 | with variable_scope.variable_scope(variable_scope.get_variable_scope(),reuse=True if j > 0 else None): 475 | 476 | with tf.device(devices[1]): 477 | bucket_outputs, _ = tf.contrib.rnn.static_rnn(cell,inputs[:bucket],initial_state = init_state) 478 | with tf.device(devices[2]): 479 | 480 | bucket_outputs_full = [self.embeddingAttribute.get_prediction(x, device=devices[2], output_feat=self.output_feat) for x in bucket_outputs] 481 | 482 | if self.loss in ['warp', 'ce']: 483 | t = targets 484 | bucket_outputs = [self.embeddingAttribute.get_prediction(x, device=devices[2], output_feat=self.output_feat) for x in bucket_outputs] 485 | elif self.loss in ['mw']: 486 | # bucket_outputs0 = [self.embeddingAttribute.get_prediction(x, pool='sampled', device=devices[2]) for x in bucket_outputs] 487 | t, bucket_outputs0 = [], [] 488 | 489 | for i in xrange(len(bucket_outputs)): 490 | x = bucket_outputs[i] 491 | ids = self.target_ids[i] 492 | bucket_outputs0.append(self.embeddingAttribute.get_prediction(x, pool='sampled', device=devices[2], output_feat=self.output_feat)) 493 | t.append(self.embeddingAttribute.get_target_score(x, ids, device=devices[2])) 494 | bucket_outputs = bucket_outputs0 495 | 496 | outputs.append(bucket_outputs) 497 | outputs_full.append(bucket_outputs_full) 498 | 499 | if per_example_loss: 500 | losses.append(sequence_loss_by_example( 501 | outputs[-1], t[:bucket], weights[:bucket], 502 | softmax_loss_function=softmax_loss_function)) 503 | losses_full.append(sequence_loss_by_example( 504 | outputs_full[-1], t[:bucket], weights[:bucket], 505 | softmax_loss_function=softmax_loss_function)) 506 | else: 507 | losses.append(sequence_loss( 508 | outputs[-1], t[:bucket], weights[:bucket], 509 | softmax_loss_function=softmax_loss_function)) 510 | losses_full.append(sequence_loss( 511 | outputs_full[-1], t[:bucket], weights[:bucket],softmax_loss_function=softmax_loss_function)) 512 | topk_value, topk_index = [], [] 513 | 514 | for full_logits in outputs_full[-1]: 515 | value, index = tf.nn.top_k(tf.nn.softmax(full_logits), self.topk_n, sorted = True) 516 | topk_value.append(value) 517 | topk_index.append(index) 518 | topk_values.append(topk_value) 519 | topk_indexes.append(topk_index) 520 | 521 | return outputs, losses, outputs_full, losses_full, topk_values, topk_indexes 522 | 523 | 524 | def sequence_loss_by_example(logits, targets, weights, 525 | average_across_timesteps=True, 526 | softmax_loss_function=None, name=None): 527 | """Weighted cross-entropy loss for a sequence of logits (per example). 528 | 529 | Args: 530 | logits: List of 2D Tensors of shape [batch_size x num_decoder_symbols]. 531 | targets: List of 1D batch-sized int32 Tensors of the same length as logits. 532 | weights: List of 1D batch-sized float-Tensors of the same length as logits. 533 | average_across_timesteps: If set, divide the returned cost by the total 534 | label weight. 535 | softmax_loss_function: Function (inputs-batch, labels-batch) -> loss-batch 536 | to be used instead of the standard softmax (the default if this is None). 537 | name: Optional name for this operation, default: "sequence_loss_by_example". 538 | 539 | Returns: 540 | 1D batch-sized float Tensor: The log-perplexity for each sequence. 541 | 542 | Raises: 543 | ValueError: If len(logits) is different from len(targets) or len(weights). 544 | """ 545 | if len(targets) != len(logits) or len(weights) != len(logits): 546 | raise ValueError("Lengths of logits, weights, and targets must be the same " 547 | "%d, %d, %d." % (len(logits), len(weights), len(targets))) 548 | with tf.name_scope(name, "sequence_loss_by_example", logits + targets + weights): 549 | # with ops.op_scope(logits + targets + weights,name, "sequence_loss_by_example"): 550 | log_perp_list = [] 551 | for logit, target, weight in zip(logits, targets, weights): 552 | if softmax_loss_function is None: 553 | # TODO(irving,ebrevdo): This reshape is needed because 554 | # sequence_loss_by_example is called with scalars sometimes, which 555 | # violates our general scalar strictness policy. 556 | target = array_ops.reshape(target, [-1]) 557 | crossent = nn_ops.sparse_softmax_cross_entropy_with_logits( 558 | logit, target) 559 | else: 560 | crossent = softmax_loss_function(logit, target) 561 | log_perp_list.append(crossent * weight) 562 | 563 | log_perps = math_ops.add_n(log_perp_list) 564 | if average_across_timesteps: 565 | total_size = math_ops.add_n(weights) 566 | total_size += 1e-12 # Just to avoid division by 0 for all-0 weights. 567 | log_perps /= total_size 568 | return log_perps 569 | 570 | 571 | def sequence_loss(logits, targets, weights, 572 | average_across_timesteps=True, average_across_batch=False, 573 | softmax_loss_function=None, name=None): 574 | """Weighted cross-entropy loss for a sequence of logits, batch-collapsed. 575 | 576 | Args: 577 | logits: List of 2D Tensors of shape [batch_size x num_decoder_symbols]. 578 | targets: List of 1D batch-sized int32 Tensors of the same length as logits. 579 | weights: List of 1D batch-sized float-Tensors of the same length as logits. 580 | average_across_timesteps: If set, divide the returned cost by the total 581 | label weight. 582 | average_across_batch: If set, divide the returned cost by the batch size. 583 | softmax_loss_function: Function (inputs-batch, labels-batch) -> loss-batch 584 | to be used instead of the standard softmax (the default if this is None). 585 | name: Optional name for this operation, defaults to "sequence_loss". 586 | 587 | Returns: 588 | A scalar float Tensor: The average log-perplexity per symbol (weighted). 589 | 590 | Raises: 591 | ValueError: If len(logits) is different from len(targets) or len(weights). 592 | """ 593 | 594 | with tf.name_scope(name, "sequence_loss", logits + targets + weights): 595 | # with ops.op_scope(logits + targets + weights, name, "sequence_loss"): 596 | cost = math_ops.reduce_sum(sequence_loss_by_example( 597 | logits, targets, weights, 598 | average_across_timesteps=average_across_timesteps, 599 | softmax_loss_function=softmax_loss_function)) 600 | if average_across_batch: 601 | total_size = tf.reduce_sum(tf.sign(weights[0])) 602 | return cost / math_ops.cast(total_size, cost.dtype) 603 | else: 604 | return cost 605 | -------------------------------------------------------------------------------- /utils/eval_metrics.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | 3 | def metrics(X, T, Ns=[2,5,10,20,30], metrics=['prec', 'recall', 'map', 'ndcg']): 4 | n_users = float(len(T)) 5 | N_pos = len(Ns) 6 | funcs = {'prec':PRECISION, 'recall':RECALL, 'map':MAP, 'ndcg':NDCG} 7 | res = {} 8 | for m in metrics: 9 | re = [] 10 | for n in Ns: 11 | re.append(0.0) 12 | res[m] = re 13 | 14 | for u, t in T.items(): 15 | t = set(t) 16 | if u not in X: 17 | continue 18 | pred = X[u] 19 | correct = [int(r in t) for r in pred] # correct or not at each position 20 | cumsum_x = np.cumsum(correct) 21 | for m in metrics: 22 | f = funcs[m] 23 | s = f(correct, t, Ns, cumsum_x) 24 | for i in range(N_pos): 25 | res[m][i] += s[i] 26 | for m in metrics: 27 | for i in range(N_pos): 28 | res[m][i] = res[m][i] / n_users 29 | return res 30 | 31 | def PRECISION(X, T, Ns=[1,2,5,10,20,30], cumsum_x=None): 32 | ''' return PRECISION@N ''' 33 | # assert(Ns[-1] <= len(X)) 34 | l = len(cumsum_x) 35 | if l == 0: 36 | return [0.0 for n in Ns] 37 | 38 | return [cumsum_x[min(n-1, l-1)] * 1.0 / min(n,l) for n in Ns] 39 | 40 | def RECALL(X, T, Ns=[1,2,5,10,20,30], cumsum_x=None): 41 | ''' return RECALL@N ''' 42 | # assert (len(T) > 0) 43 | n_t = len(T) 44 | l = len(cumsum_x) 45 | if l == 0: 46 | return [0.0 for n in Ns] 47 | 48 | return [cumsum_x[min(n-1, l-1)] * 1.0 / n_t for n in Ns] 49 | 50 | 51 | def MAP(X, T, Ns=[1,2,5,10,20,30], cumsum_x=None): 52 | ''' return MAP@N N = 2, 5, 10, 20, 30 ''' 53 | l = len(X) 54 | if l == 0: 55 | return [0.0 for n in Ns] 56 | n_t = len(T) 57 | ap_i = [X[i] * 1.0 * cumsum_x[i] / (i+1) for i in range(l)] 58 | ap = np.cumsum(ap_i) 59 | return [ap[min(n-1, l-1)] / min(min(n_t, n),l) for n in Ns] 60 | 61 | N_max = 30 62 | discounts = [1.0 / np.log2(2+n) for n in range(N_max)] 63 | def NDCG(X, T, Ns=[1,2,5,10,20,30], cumsum_x=None): 64 | ''' return NDCG@N N = 2, 5, 10, 20, 30 ''' 65 | l = len(X) 66 | if l == 0: 67 | return [0.0 for n in Ns] 68 | n_max = l if l < N_max else N_max 69 | disc = discounts[:n_max] 70 | DCGs = [x_d[0] * x_d[1] for x_d in zip(X, disc)] 71 | IDCGs = discounts[:len(T)] + [0.0] * (n_max - len(T)) 72 | cumsum_DCG = np.cumsum(DCGs) 73 | cumsum_IDCG = np.cumsum(IDCGs) 74 | return [cumsum_DCG[min(n-1, l-1)] / cumsum_IDCG[min(n-1, l-1)] for n in Ns] 75 | 76 | 77 | '''deprecated''' 78 | 79 | def eval_P5(X, T, K=5): 80 | score = 0 81 | for uid in T: 82 | if uid not in X: 83 | continue 84 | t = set(T[uid]) 85 | pred = set(X[uid][:K]) 86 | score += len(pred.intersection(t)) * 1.0 / len(pred) 87 | return score * 1.0 / len(T) 88 | 89 | def eval_R20(X, T): 90 | ''' 91 | actually it is the success rate of 20 92 | ''' 93 | POS = 20 94 | success = 0 95 | for uid in T: 96 | if uid not in X: 97 | continue 98 | suc = 1 if len(set(X[uid][:POS]).intersection(T[uid])) > 0 else 0 99 | success += suc 100 | return 1.0 * success / (len(T)) 101 | 102 | -------------------------------------------------------------------------------- /utils/evaluate.py: -------------------------------------------------------------------------------- 1 | from eval_metrics import metrics 2 | from submit import load_submit, combine_sub, format_submit 3 | from load_data import load_users, load_items, load_interactions 4 | from os.path import isfile, join 5 | 6 | 7 | class Evaluation(object): 8 | 9 | def __init__(self, raw_data_dir, test=False): 10 | 11 | res_filename = 'res_T_test.csv' if test else 'res_T.csv' 12 | if not isfile(join(raw_data_dir, res_filename)): 13 | print('eval file does not exist. creating ... ') 14 | self.create_eval_file(raw_data_dir) 15 | self.T = load_submit(res_filename, submit_dir=raw_data_dir) 16 | hist_filename = 'historical_train_test.csv' if test else 'historical_train.csv' 17 | self.hist = load_submit(hist_filename, submit_dir=raw_data_dir) 18 | 19 | self.Iatt, _, self.Iid2ind = load_items(raw_data_dir) 20 | self.Uatt, _, self.Uid2ind = load_users(raw_data_dir) 21 | 22 | self.Uids = self.get_uids() 23 | self.Uinds = [self.Uid2ind[v] for v in self.Uids] 24 | self.combine_sub = combine_sub 25 | return 26 | 27 | def get_user_n(self): 28 | return len(self.Uinds) 29 | 30 | def get_uids(self): 31 | return list(self.T.keys()) 32 | 33 | def get_uinds(self): 34 | return self.Uinds 35 | 36 | def eval_on(self, rec): 37 | 38 | self.res = rec 39 | 40 | tmp_filename = 'rec' 41 | for k, v in rec.items(): 42 | rec[k] = [str(v) for v in rec[k]] 43 | # if isinstance(R[k], list): 44 | 45 | # R[k] = ','.join(str(xx) for xx in v) 46 | # for k, v in rec.items(): 47 | # rec[k] = v.split(',') 48 | 49 | r_ex = self.combine_sub(self.hist, rec, 1, users = self.Uatt) 50 | 51 | result = metrics(rec, self.T) 52 | l = result.values() 53 | self.s_self = [item for sublist in l for item in sublist] 54 | l = metrics(r_ex, self.T).values() 55 | self.s_ex = [item for sublist in l for item in sublist] 56 | return 57 | 58 | def get_scores(self): 59 | return self.s_self, self.s_ex 60 | 61 | def set_uinds(self, uinds): 62 | self.Uinds = uinds 63 | 64 | def create_eval_file(self, raw_data): 65 | DIR = raw_data 66 | interact, names = load_interactions(data_dir=DIR) 67 | 68 | interact_tr, interact_va, interact_te = interact 69 | 70 | data_tr = zip(list(interact_tr[:, 0]), list(interact_tr[:, 1]), list(interact_tr[:, 2])) 71 | data_va = zip(list(interact_va[:, 0]), list(interact_va[:, 1]), list(interact_va[:, 2])) 72 | data_te = zip(list(interact_te[:, 0]), list(interact_te[:, 1]), list(interact_te[:, 2])) 73 | 74 | seq_tr, seq_va, seq_te = {}, {}, {} 75 | 76 | for u, i , t in data_tr: 77 | if u not in seq_tr: 78 | seq_tr[u] = [] 79 | seq_tr[u].append((i, t)) 80 | 81 | for u, i , t in data_va: 82 | if u not in seq_va: 83 | seq_va[u] = [] 84 | seq_va[u].append((i,t)) 85 | 86 | for u, i , t in data_te: 87 | if u not in seq_te: 88 | seq_te[u] = [] 89 | seq_te[u].append(i) 90 | 91 | for u, v in seq_tr.items(): 92 | l = sorted(v, key = lambda x:x[1], reverse=True) 93 | seq_tr[u] = ','.join([str(p[0]) for p in l]) 94 | 95 | for u, v in seq_va.items(): 96 | l = sorted(v, key = lambda x:x[1], reverse=True) 97 | seq_va[u] = ','.join(str(p[0]) for p in l) 98 | 99 | for u, v in seq_te.items(): 100 | seq_te[u] = ','.join(str(p) for p in seq_te[u]) 101 | 102 | 103 | format_submit(seq_tr, 'historical_train.csv', submit_dir=DIR) 104 | format_submit(seq_va, 'res_T.csv', submit_dir=DIR) 105 | format_submit(seq_te, 'res_T_test.csv', submit_dir=DIR) 106 | 107 | seq_va_tr = seq_va 108 | for u in seq_tr: 109 | if u in seq_va: 110 | seq_va_tr[u] = seq_va[u] +','+ seq_tr[u] 111 | format_submit(seq_va_tr, 'historical_train_test.csv', submit_dir=DIR) 112 | return 113 | 114 | 115 | -------------------------------------------------------------------------------- /utils/load_data.py: -------------------------------------------------------------------------------- 1 | from os.path import join, isfile 2 | import pandas as pd 3 | import numpy as np 4 | 5 | def build_index(values): 6 | count, index = 0, {} 7 | opt = 1 if values.shape[1] else 0 8 | for v in values: 9 | if opt == 1: 10 | index[v[0]] = count 11 | elif opt == 0: 12 | index[v] = count 13 | count += 1 14 | return index 15 | 16 | def load_csv(filename, indexing=True, sep = '\t', header=0): 17 | if not isfile(filename): 18 | return [],None, None if indexing else [], None 19 | data = pd.read_csv(filename, delimiter=sep, header=header) 20 | columns = list(data.columns) 21 | values = data.values 22 | if indexing: 23 | index = build_index(values) 24 | return values, columns, index 25 | else: 26 | return values, columns 27 | 28 | def file_check(filename): 29 | if not isfile(filename): 30 | print("Error: user file {} does not exit!".format(filename)) 31 | exit(1) 32 | return 33 | 34 | def load_users(data_dir, sep='\t'): 35 | filename = join(data_dir, 'u.csv') 36 | file_check(filename) 37 | users, attr_names, user_index = load_csv(filename) 38 | filename = join(data_dir, 'u_attr.csv') 39 | if isfile(filename): 40 | vals, _ = load_csv(filename, False) 41 | attr_types = vals.flatten().tolist() 42 | else: 43 | attr_types = [0] * len(attr_names) 44 | return users, (attr_names, attr_types), user_index 45 | 46 | def load_items(data_dir, sep='\t'): 47 | filename = join(data_dir, 'i.csv') 48 | file_check(filename) 49 | items, attr_names, item_index = load_csv(filename) 50 | filename = join(data_dir, 'i_attr.csv') 51 | if isfile(filename): 52 | vals, _ = load_csv(filename, False) 53 | attr_types = vals.flatten().tolist() 54 | else: 55 | attr_types = [0] * len(attr_names) 56 | return items, (attr_names, attr_types), item_index 57 | 58 | def load_interactions(data_dir, sep='\t'): 59 | filename0 = join(data_dir, 'obs_') 60 | suffix = ['tr.csv', 'va.csv', 'te.csv'] 61 | ints, names = [], [] 62 | for s in suffix: 63 | filename = filename0 + s 64 | interact, name = load_csv(filename, False) 65 | assert(interact.shape[1] >= 2) 66 | if interact.shape[1] == 2: 67 | l = interact.shape[0] 68 | interact = np.append(interact, np.zeros((l, 1), dtype=int), 1) 69 | ints.append(interact) 70 | names.append(name) 71 | return ints, names[0] 72 | 73 | def load_raw_data(data_dir, _submit=0): 74 | users, u_attr, user_index = load_users(data_dir) 75 | items, i_attr, item_index = load_items(data_dir) 76 | ints, names = load_interactions(data_dir) 77 | for v in ints: 78 | for i in range(len(v)): 79 | v[i][0] = user_index[v[i][0]] 80 | v[i][1] = item_index[v[i][1]] 81 | interact_tr, interact_va, interact_te = ints 82 | 83 | data_va, data_te = None, None 84 | if _submit == 1: 85 | interact_tr = np.append(interact_tr, interact_va, 0) 86 | data_tr = zip(list(interact_tr[:, 0]), list(interact_tr[:, 1]), 87 | list(interact_tr[:, 2])) 88 | data_va = zip(list(interact_te[:, 0]), list(interact_te[:, 1]), 89 | list(interact_te[:, 2])) 90 | else: 91 | data_tr = zip(list(interact_tr[:, 0]), list(interact_tr[:, 1]), 92 | list(interact_tr[:, 2])) 93 | data_va = zip(list(interact_va[:, 0]), list(interact_va[:, 1]), 94 | list(interact_va[:, 2])) 95 | return users, items, data_tr, data_va, u_attr, i_attr, user_index, item_index 96 | 97 | -------------------------------------------------------------------------------- /utils/pandatools.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import numpy as np 3 | from scipy.sparse import coo_matrix 4 | 5 | def load_csv(filename, sep = '\t', types = 1, header=0): 6 | data = pd.read_csv(filename, delimiter=sep, header=header) 7 | if types == 0: 8 | return data 9 | columns = list(data.columns) 10 | values = data.values 11 | return values, columns 12 | 13 | def write_csv(x, filename, header, columns=None): 14 | x.to_csv(path_or_buf=filename, sep='\t', 15 | index=False, header = header, columns = columns) 16 | return 17 | 18 | def build_index(values, opt = 1): 19 | count, index = 0, {} 20 | for v in values: 21 | if opt == 1: 22 | index[v[0]] = count 23 | elif opt == 0: 24 | index[v] = count 25 | count += 1 26 | return index 27 | 28 | def matrix2tsv(M, filename): 29 | x = pd.DataFrame(M) 30 | write_csv(x, filename, None) 31 | return 32 | 33 | def dict2tsv(D, filename): 34 | l, ind = 0, 0 35 | for k in D: 36 | l += len(D[k]) 37 | x = np.zeros((l, 3), dtype=object) 38 | 39 | for k in D: 40 | for p in D[k]: 41 | x[ind, :] = (k, p, D[k][p]) 42 | ind += 1 43 | assert(ind == l) 44 | x = pd.DataFrame(x) 45 | write_csv(x, filename, None) 46 | return 47 | 48 | def sparse2tsv(M, filename): 49 | ''' 50 | from sparse matrix 2 tsv file 51 | ''' 52 | M = M.todok() 53 | pos = np.array(M.keys()) 54 | val = np.array(M.values()).reshape(pos.shape[0],1) 55 | x = np.append(pos, val, 1) 56 | x = x.astype(int) 57 | x = pd.DataFrame(x) 58 | write_csv(x, filename, None) 59 | return 60 | 61 | def sparse2tsv2(A, filename, matlab): 62 | a, b = A.nonzero() 63 | d = [i for sublist in A.data for i in sublist] 64 | l = len(d) 65 | A2 = np.zeros((l, 3), dtype=int) 66 | for i in range(l): 67 | A2[i, 0] = a[i]+matlab 68 | A2[i, 1] = b[i]+matlab 69 | A2[i, 2] = d[i] 70 | matrix2tsv(A2, filename) 71 | 72 | def tsv2dict(filename): 73 | x = pd.read_csv(filename, delimiter='\t', header=None).values 74 | D = {} 75 | assert(x.shape[1] == 3) 76 | for i in range(len(x)): 77 | uid, iid, sim = x[i,:] 78 | uid, iid = int(uid), int(iid) 79 | if uid not in D: 80 | D[uid] = {} 81 | D[uid][iid] = sim 82 | return D 83 | 84 | def tsv2matrix(filename, opt=1): 85 | x = pd.read_csv(filename, delimiter='\t', header=None).values 86 | if opt == 0: 87 | return x 88 | d = int(x[:,1].max() + 1) 89 | n = int(x[:,0].max() + 1) 90 | p = coo_matrix( (x[:,2], (x[:,0], x[:,1])), shape=(n,d)) 91 | p = p.todense() 92 | return p 93 | -------------------------------------------------------------------------------- /utils/prepare_train.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | from datetime import datetime 3 | 4 | def to_week(t): 5 | return datetime.fromtimestamp(t).isocalendar()[1] 6 | 7 | def sample_items(items, n, p=None, replace=False): 8 | if p: 9 | item_sampled = np.random.choice(items, n, replace=replace, p=p) 10 | else: 11 | item_sampled = np.random.choice(items, n, replace=replace) 12 | item_sampled_id2idx = {} 13 | i = 0 14 | for item in item_sampled: 15 | item_sampled_id2idx[item] = i 16 | i += 1 17 | return item_sampled, item_sampled_id2idx 18 | 19 | def item_frequency(data_tr, power): 20 | ''' count item frequency and compute sampling prob''' 21 | item_counts = {} 22 | item_population = set([]) 23 | for u, i, _ in data_tr: 24 | item_counts[i] = 1 if i not in item_counts else item_counts[i] + 1 25 | item_population.add(i) 26 | item_population = list(item_population) 27 | counts = [item_counts[v] for v in item_population] 28 | # print(len(item_population)) 29 | 30 | count_sum = sum(counts) * 1.0 31 | 32 | p_item_unormalized = [np.power(c / count_sum, power) for c in counts] 33 | p_item_sum = sum(p_item_unormalized) 34 | p_item = [f / p_item_sum for f in p_item_unormalized] 35 | return item_population, p_item 36 | 37 | def positive_items(data_tr, data_va): 38 | hist, hist_va = {}, {} 39 | for u, i, _ in data_tr: 40 | if u not in hist: 41 | hist[u] = set([i]) 42 | else: 43 | hist[u].add(i) 44 | for u, i, _ in data_va: 45 | if u not in hist_va: 46 | hist_va[u] = set([i]) 47 | else: 48 | hist_va[u].add(i) 49 | 50 | pos_item_list = {} 51 | pos_item_list_val = {} 52 | for u in hist: 53 | pos_item_list[u] = list(hist[u]) 54 | for u in hist_va: 55 | pos_item_list_val[u] = list(hist_va[u]) 56 | 57 | return pos_item_list, pos_item_list_val 58 | -------------------------------------------------------------------------------- /utils/preprocess.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | from os import listdir, mkdir, path, rename 3 | from os.path import isfile, join 4 | from tensorflow.python.platform import gfile 5 | 6 | 7 | _UNK = "_UNK" 8 | _START = "_START" 9 | # _PHANTOM = "_PHANTOM" 10 | 11 | UNK_ID = 0 12 | START_ID = 1 13 | # REST_ID = 1 14 | 15 | _START_VOCAB = [_UNK, _START ] #, _PHANTOM] 16 | 17 | 18 | def pickle_save(m, filename): 19 | import cPickle as pickle 20 | pickle.dump(m, open(filename, 'wb'), protocol=pickle.HIGHEST_PROTOCOL) 21 | 22 | def initialize_vocabulary(vocabulary_path): 23 | """Initialize vocabulary from file. 24 | 25 | We assume the vocabulary is stored one-item-per-line, so a file: 26 | dog 27 | cat 28 | will result in a vocabulary {"dog": 0, "cat": 1}, and this function will 29 | also return the reversed-vocabulary ["dog", "cat"]. 30 | 31 | Args: 32 | vocabulary_path: path to the file containing the vocabulary. 33 | 34 | Returns: 35 | a pair: the vocabulary (a dictionary mapping string to integers), and 36 | the reversed vocabulary (a list, which reverses the vocabulary mapping). 37 | 38 | Raises: 39 | ValueError: if the provided vocabulary_path does not exist. 40 | """ 41 | if gfile.Exists(vocabulary_path): 42 | rev_vocab = [] 43 | with gfile.GFile(vocabulary_path, mode="rb") as f: 44 | rev_vocab.extend(f.readlines()) 45 | rev_vocab = [line.strip() for line in rev_vocab] 46 | vocab = dict([(x, y) for (y, x) in enumerate(rev_vocab)]) 47 | return vocab, rev_vocab 48 | else: 49 | raise ValueError("Vocabulary file %s not found.", vocabulary_path) 50 | 51 | def create_dictionary(data_dir, inds, features, feature_types, feature_names, 52 | max_vocabulary_size=50000, logits_size_tr = 50000, threshold = 2, 53 | prefix='user'): 54 | filename = 'vocab0_%d' % max_vocabulary_size 55 | if isfile(join(data_dir, filename)): 56 | print("vocabulary exists!") 57 | return 58 | vocab_counts = {} 59 | num_uf = len(feature_names) 60 | assert(len(feature_types) == num_uf), 'length of feature_types should be the same length of feature_names {} vs {}'.format(len(feature_types), num_uf) 61 | for ind in range(num_uf): 62 | name = feature_names[ind] 63 | vocab_counts[name] = {} 64 | 65 | for u in inds: # u index 66 | uf = features[u, :] 67 | for ii in range(num_uf): 68 | name = feature_names[ii] 69 | if feature_types[ii] == 0: 70 | vocab_counts[name][uf[ii]] = vocab_counts[name][uf[ii]] + 1 if uf[ii] in vocab_counts[name] else 1 71 | elif feature_types[ii] == 1: 72 | if not isinstance(uf[ii], list): 73 | if not isinstance(uf[ii], str): 74 | uf[ii] = str(uf[ii]) 75 | uf[ii] = uf[ii].split(',') 76 | for t in uf[ii]: 77 | vocab_counts[name][t] = vocab_counts[name][t] + 1 if t in vocab_counts[name] else 1 78 | 79 | minimum_occurance = [] 80 | for i in range(num_uf): 81 | name = feature_names[i] 82 | if feature_types[i] > 1: 83 | continue 84 | vocab_list = _START_VOCAB + sorted(vocab_counts[name], 85 | key=vocab_counts[name].get, reverse=True) 86 | if prefix == 'item' and i == 0: 87 | max_size = logits_size_tr + len(_START_VOCAB) 88 | elif prefix == 'user' and i == 0: 89 | max_size = len(features) 90 | max_size = max_vocabulary_size # looks still better to filter first 91 | else: 92 | max_size = max_vocabulary_size 93 | 94 | # max_size += len(_START_VOCAB) 95 | 96 | # if len(vocab_list) > max_size: 97 | # vocab_list= vocab_list[:max_size] 98 | with gfile.GFile(join(data_dir, ("%s_vocab%d_%d"% (prefix, i, 99 | max_size))), mode="wb") as vocab_file: 100 | 101 | if prefix == 'user' and i == 0: 102 | vocab_list2 = [v for v in vocab_list if v in _START_VOCAB or 103 | vocab_counts[name][v] >= threshold] 104 | else: 105 | vocab_list2 = [v for v in vocab_list if v in _START_VOCAB or 106 | vocab_counts[name][v] >= threshold] 107 | if len(vocab_list2) > max_size: 108 | print("vocabulary {}_{} longer than max_vocabulary_size {}. Truncate the tail".format(prefix, len(vocab_list2), max_size)) 109 | vocab_list2= vocab_list2[:max_size] 110 | for w in vocab_list2: 111 | vocab_file.write(str(w) + b"\n") 112 | minimum_occurance.append(vocab_counts[name][vocab_list2[-1]]) 113 | with gfile.GFile(join(data_dir, "%s_minimum_occurance_%d" %(prefix, 114 | max_size)), mode="wb") as sum_file: 115 | sum_file.write('\n'.join([str(v) for v in minimum_occurance])) 116 | 117 | return 118 | 119 | def create_dictionary_mix(data_dir, inds, features, feature_types, 120 | feature_names, max_vocabulary_size=50000, logits_size_tr = 50000, 121 | threshold = 2, prefix='user'): 122 | filename = 'vocab0_%d' % max_vocabulary_size 123 | if isfile(join(data_dir, filename)): 124 | print("vocabulary exists!") 125 | return 126 | vocab_counts = {} 127 | num_uf = len(feature_names) 128 | assert(len(feature_types) == num_uf), 'length of feature_types should be the same length of feature_names {} vs {}'.format(len(feature_types), num_uf) 129 | 130 | vocab_uid = {} 131 | vocab = {} 132 | for u in inds: # u index 133 | uf = features[u, 0] 134 | 135 | if not isinstance(uf, list): 136 | uf = uf.split(',') 137 | for t in uf: 138 | if t.startswith('uid'): 139 | vocab_uid[t] = vocab_uid[t] + 1 if t in vocab_uid else 1 140 | else: 141 | vocab[t] = vocab[t] + 1 if t in vocab else 1 142 | 143 | minimum_occurance = [] 144 | 145 | vocab_list = _START_VOCAB + vocab_uid.keys() + sorted(vocab, 146 | key=vocab.get, reverse=True) 147 | 148 | max_size = max_vocabulary_size 149 | 150 | with gfile.GFile(join(data_dir, ("%s_vocab%d_%d"% (prefix, 0, 151 | max_size))), mode="wb") as vocab_file: 152 | 153 | vocab_list2 = [v for v in vocab_list if v in _START_VOCAB or (v in vocab and 154 | vocab[v] >= threshold) or (v in vocab_uid and vocab_uid[v] >= threshold)] 155 | if len(vocab_list2) > max_size: 156 | print("vocabulary {}_{} longer than max_vocabulary_size {}. Truncate the tail".format(prefix, len(vocab_list2), max_size)) 157 | vocab_list2 = vocab_list2[:max_size] 158 | 159 | for w in vocab_list2: 160 | vocab_file.write(str(w) + b"\n") 161 | min_occurance = vocab[vocab_list2[-1]] if vocab_list2[-1] in vocab else vocab_uid[vocab_list2[-1]] 162 | minimum_occurance.append(min_occurance) 163 | with gfile.GFile(join(data_dir, "%s_minimum_occurance_%d" %(prefix, 164 | max_size)), mode="wb") as sum_file: 165 | sum_file.write('\n'.join([str(v) for v in minimum_occurance])) 166 | 167 | return 168 | 169 | def tokenize_attribute_map(data_dir, features, feature_types, max_vocabulary_size, 170 | logits_size_tr=50000, prefix='user'): 171 | """ 172 | read feature maps and tokenize with loaded vocabulary 173 | output required format for Attributes 174 | """ 175 | features_cat, features_mulhot = [], [] 176 | v_sizes_cat, v_sizes_mulhot = [], [] 177 | mulhot_max_leng, mulhot_starts, mulhot_lengs = [], [], [] 178 | # logit_ind2item_ind = {} 179 | for i in range(len(feature_types)): 180 | ut = feature_types[i] 181 | if feature_types[i] > 1: 182 | continue 183 | 184 | path = "%s_vocab%d_" %(prefix, i) 185 | vocabulary_paths = [f for f in listdir(data_dir) if f.startswith(path)] 186 | assert(len(vocabulary_paths) == 1) 187 | vocabulary_path = join(data_dir, vocabulary_paths[0]) 188 | 189 | vocab, _ = initialize_vocabulary(vocabulary_path) 190 | 191 | N = len(features) 192 | users2 = np.copy(features) 193 | uf = features[:, i] 194 | if ut == 0: 195 | v_sizes_cat.append(len(vocab)) 196 | for n in range(N): 197 | uf[n] = vocab.get(str(uf[n]), UNK_ID) 198 | uf = np.append(uf, START_ID) 199 | features_cat.append(uf) 200 | else: 201 | mtl = 0 202 | idx = 0 203 | starts, lengs, vals = [idx], [], [] 204 | v_sizes_mulhot.append(len(vocab)) 205 | for n in range(N): 206 | elem = uf[n] 207 | if not isinstance(elem, list): 208 | if not isinstance(elem, str): 209 | elem = str(elem) 210 | elem = elem.split(',') 211 | val = [vocab.get(str(v), UNK_ID) for v in elem] 212 | val_ = [v for v in val if v != UNK_ID] 213 | if len(val_) == 0: 214 | val_ = [UNK_ID] 215 | 216 | vals.extend(val_) 217 | l_mulhot = len(val_) 218 | mtl = max(mtl, l_mulhot) 219 | idx += l_mulhot 220 | starts.append(idx) 221 | lengs.append(l_mulhot) 222 | 223 | vals.append(START_ID) 224 | idx += 1 225 | starts.append(idx) 226 | lengs.append(1) 227 | 228 | mulhot_max_leng.append(mtl) 229 | mulhot_starts.append(np.array(starts)) 230 | mulhot_lengs.append(np.array(lengs)) 231 | features_mulhot.append(np.array(vals)) 232 | 233 | num_features_cat = sum(v == 0 for v in feature_types) 234 | num_features_mulhot= sum(v == 1 for v in feature_types) 235 | assert(num_features_cat + num_features_mulhot <= len(feature_types)) 236 | return (num_features_cat, features_cat, num_features_mulhot, features_mulhot, 237 | mulhot_max_leng, mulhot_starts, mulhot_lengs, v_sizes_cat, 238 | v_sizes_mulhot) 239 | 240 | def filter_cat(num_features_cat, features_cat, logit_ind2item_ind): 241 | ''' 242 | create mapping from logits index [0, logits_size) to features 243 | ''' 244 | features_cat_tr = [] 245 | size = len(logit_ind2item_ind) 246 | for i in xrange(num_features_cat): 247 | feat_cat = features_cat[i] 248 | feat_cat_tr = [] 249 | for j in xrange(size): 250 | item_index = logit_ind2item_ind[j] 251 | feat_cat_tr.append(feat_cat[item_index]) 252 | features_cat_tr.append(np.array(feat_cat_tr)) 253 | 254 | return features_cat_tr 255 | 256 | 257 | def filter_mulhot(data_dir, items, feature_types, max_vocabulary_size, 258 | logit_ind2item_ind, prefix='item'): 259 | full_values, full_values_tr= [], [] 260 | full_segids, full_lengths = [], [] 261 | full_segids_tr, full_lengths_tr = [], [] 262 | 263 | L = len(logit_ind2item_ind) 264 | N = len(items) 265 | for i in range(len(feature_types)): 266 | full_index, full_index_tr = [], [] 267 | lengs, lengs_tr = [], [] 268 | ut = feature_types[i] 269 | if feature_types[i] == 1: 270 | 271 | path = "%s_vocab%d_" %(prefix, i) 272 | vocabulary_paths = [f for f in listdir(data_dir) if f.startswith(path)] 273 | assert(len(vocabulary_paths)==1), 'more than one dictionaries found! delete unnecessary ones to fix this.' 274 | vocabulary_path = join(data_dir, vocabulary_paths[0]) 275 | 276 | vocab, _ = initialize_vocabulary(vocabulary_path) 277 | 278 | uf = items[:, i] 279 | mtl, idx, vals = 0, 0, [] 280 | segids = [] 281 | 282 | for n in xrange(N): 283 | elem = uf[n] 284 | if not isinstance(elem, list): 285 | if not isinstance(elem, str): 286 | elem = str(elem) 287 | elem = elem.split(',') 288 | 289 | val = [vocab.get(v, UNK_ID) for v in elem] 290 | val_ = [v for v in val if v != UNK_ID] 291 | if len(val_) == 0: 292 | val_ = [UNK_ID] 293 | vals.extend(val_) 294 | l_mulhot = len(val_) 295 | segids.extend([n] * l_mulhot) 296 | lengs.append([l_mulhot * 1.0]) 297 | 298 | full_values.append(vals) 299 | full_segids.append(segids) 300 | full_lengths.append(lengs) 301 | 302 | idx2, vals2 = 0, [] 303 | segids_tr = [] 304 | for n in xrange(L): 305 | i_ind = logit_ind2item_ind[n] 306 | elem = uf[i_ind] 307 | if not isinstance(elem, list): 308 | if not isinstance(elem, str): 309 | elem = str(elem) 310 | elem = elem.split(',') 311 | 312 | val = [vocab.get(v, UNK_ID) for v in elem] 313 | val_ = [v for v in val if v != UNK_ID] 314 | if len(val_) == 0: 315 | val_ = [UNK_ID] 316 | vals2.extend(val_) 317 | l_mulhot = len(val_) 318 | lengs_tr.append([l_mulhot * 1.0]) 319 | segids_tr.extend([n] * l_mulhot) 320 | 321 | full_values_tr.append(vals2) 322 | full_segids_tr.append(segids_tr) 323 | full_lengths_tr.append(lengs_tr) 324 | 325 | return (full_values, full_values_tr, full_segids, full_lengths, 326 | full_segids_tr, full_lengths_tr) 327 | 328 | -------------------------------------------------------------------------------- /utils/submit.py: -------------------------------------------------------------------------------- 1 | from pandatools import write_csv, load_csv, pd 2 | from os.path import join 3 | 4 | 5 | def load_submit(sub_id, submit_dir = '../submissions/'): 6 | 7 | filename = join(submit_dir, sub_id) 8 | data = load_csv(filename, types=0) 9 | x = data.set_index('user_id').to_dict()['items'] 10 | for _, key in enumerate(x): 11 | l = x[key] 12 | if isinstance(l, str): 13 | x[key] = l.split(',') 14 | elif isinstance(l, int): 15 | x[key] = [str(l)] 16 | else: 17 | x[key] = [] 18 | return x 19 | 20 | def format_submit(X, sub_id, submit_dir = '../submissions/'): 21 | ''' 22 | save recommendation result to submission file 23 | input: 24 | X : dict. Ex: X[1400] = 1232,1123,5325 25 | sub_id: submission id 26 | ''' 27 | 28 | 29 | header = ['user_id', 'items'] 30 | for pos, key in enumerate(X): 31 | l = X[key] 32 | if isinstance(l, list): 33 | X[key] = ','.join(str(xx) for xx in l) 34 | else: 35 | # print 'not a list. No need to convert.' 36 | break 37 | x = pd.DataFrame(X.items()) 38 | write_csv(x, join(submit_dir, sub_id), header) 39 | return 40 | 41 | 42 | # def combine_sub(r1, r2, opt = 0, old=False, users=None): 43 | def combine_sub(r1, r2, opt = 0, users=None): 44 | rec = {} 45 | for i in range(len(users)): 46 | uid = users[i, 0] 47 | if uid not in r1 and uid not in r2: 48 | continue 49 | i_set = set() 50 | rec[uid] = [] 51 | if uid in r1: 52 | for iid in r1[uid]: 53 | if iid not in i_set: 54 | i_set.add(iid) 55 | if opt == 0: 56 | rec[uid].append(iid) 57 | if uid in r2: 58 | for iid in r2[uid]: 59 | if iid not in i_set: 60 | i_set.add(iid) 61 | rec[uid].append(iid) 62 | return rec -------------------------------------------------------------------------------- /word2vec/cbow_model.py: -------------------------------------------------------------------------------- 1 | 2 | from __future__ import absolute_import 3 | from __future__ import division 4 | from __future__ import print_function 5 | 6 | 7 | import tensorflow as tf 8 | import sys 9 | from linear_seq import LinearSeq 10 | sys.path.insert(0, '../attributes') 11 | import embed_attribute 12 | 13 | 14 | class Model(LinearSeq): 15 | def __init__(self, user_size, item_size, size, 16 | batch_size, learning_rate, 17 | learning_rate_decay_factor, 18 | user_attributes=None, 19 | item_attributes=None, 20 | item_ind2logit_ind=None, 21 | logit_ind2item_ind=None, 22 | n_input_items=0, 23 | loss_function='ce', 24 | logit_size_test=None, 25 | dropout=1.0, 26 | top_N_items = 100, 27 | use_sep_item=True, 28 | n_sampled=None, 29 | output_feat=1, 30 | indices_item=None, 31 | dtype=tf.float32): 32 | 33 | self.user_size = user_size 34 | self.item_size = item_size 35 | self.top_N_items = top_N_items 36 | 37 | if user_attributes is not None: 38 | user_attributes.set_model_size(size) 39 | self.user_attributes = user_attributes 40 | if item_attributes is not None: 41 | item_attributes.set_model_size(size) 42 | self.item_attributes = item_attributes 43 | 44 | self.item_ind2logit_ind = item_ind2logit_ind 45 | self.logit_ind2item_ind = logit_ind2item_ind 46 | if logit_ind2item_ind is not None: 47 | self.logit_size = len(logit_ind2item_ind) 48 | if indices_item is not None: 49 | self.indices_item = indices_item 50 | else: 51 | self.indices_item = range(self.logit_size) 52 | self.logit_size_test = logit_size_test 53 | 54 | self.loss_function = loss_function 55 | self.n_input_items = n_input_items 56 | self.n_sampled = n_sampled 57 | self.batch_size = batch_size 58 | 59 | self.learning_rate = tf.Variable(float(learning_rate), trainable=False) 60 | self.learning_rate_decay_op = self.learning_rate.assign( 61 | self.learning_rate * learning_rate_decay_factor) 62 | self.global_step = tf.Variable(0, trainable=False) 63 | 64 | self.att_emb = None 65 | self.dtype=dtype 66 | 67 | mb = self.batch_size 68 | ''' this is mapped item target ''' 69 | self.item_target = tf.placeholder(tf.int32, shape = [mb], name = "item") 70 | self.item_id_target = tf.placeholder(tf.int32, shape = [mb], name = "item_id") 71 | 72 | self.dropout = dropout 73 | self.keep_prob = tf.constant(dropout, dtype=dtype) 74 | # tf.placeholder(tf.float32, name='keep_prob') 75 | 76 | n_input = max(n_input_items, 1) 77 | m = embed_attribute.EmbeddingAttribute(user_attributes, item_attributes, mb, 78 | self.n_sampled, n_input, use_sep_item, item_ind2logit_ind, logit_ind2item_ind) 79 | self.att_emb = m 80 | 81 | embedded_user, _ = m.get_batch_user(1.0, False) 82 | embedded_items = [] 83 | for i in range(n_input): 84 | embedded_item, _ = m.get_batch_item('input{}'.format(i), batch_size) 85 | embedded_item = tf.reduce_mean(embedded_item, 0) 86 | embedded_items.append(embedded_item) 87 | embedded_items = tf.reduce_mean(embedded_items, 0) 88 | 89 | print("non-sampled prediction") 90 | input_embed = tf.reduce_mean([embedded_user, embedded_items], 0) 91 | input_embed = tf.nn.dropout(input_embed, self.keep_prob) 92 | logits = m.get_prediction(input_embed, output_feat=output_feat) 93 | 94 | if self.n_input_items == 0: 95 | input_embed_test= embedded_user 96 | else: 97 | # including two cases: 1, n items. 2, end_line item 98 | # input_embed_test = [embedded_user] + embedded_items 99 | # input_embed_test = tf.reduce_mean(input_embed_test, 0) 100 | 101 | input_embed_test = [embedded_user] + [embedded_items] 102 | input_embed_test = tf.reduce_mean(input_embed_test, 0) 103 | logits_test = m.get_prediction(input_embed_test, output_feat=output_feat) 104 | 105 | # mini batch version 106 | print("sampled prediction") 107 | if self.n_sampled is not None: 108 | sampled_logits = m.get_prediction(input_embed, 'sampled', output_feat=output_feat) 109 | # embedded_item, item_b = m.get_sampled_item(self.n_sampled) 110 | # sampled_logits = tf.matmul(embedded_user, tf.transpose(embedded_item)) + item_b 111 | target_score = m.get_target_score(input_embed, self.item_id_target) 112 | 113 | loss = self.loss_function 114 | if loss in ['warp', 'ce', 'bbpr']: 115 | batch_loss = m.compute_loss(logits, self.item_target, loss) 116 | batch_loss_test = m.compute_loss(logits_test, self.item_target, loss) 117 | elif loss in ['mw']: 118 | batch_loss = m.compute_loss(sampled_logits, target_score, loss) 119 | batch_loss_eval = m.compute_loss(logits, self.item_target, 'warp') 120 | else: 121 | print("not implemented!") 122 | exit(-1) 123 | if loss in ['warp', 'mw', 'bbpr']: 124 | self.set_mask, self.reset_mask = m.get_warp_mask() 125 | 126 | self.loss = tf.reduce_mean(batch_loss) 127 | # self.loss_eval = tf.reduce_mean(batch_loss_eval) if loss == 'mw' else self.loss 128 | self.loss_test = tf.reduce_mean(batch_loss_test) 129 | 130 | # Gradients and SGD update operation for training the model. 131 | params = tf.trainable_variables() 132 | opt = tf.train.AdagradOptimizer(self.learning_rate) 133 | # opt = tf.train.AdamOptimizer(self.learning_rate) 134 | gradients = tf.gradients(self.loss, params) 135 | self.updates = opt.apply_gradients( 136 | zip(gradients, params), global_step=self.global_step) 137 | 138 | self.output = logits_test 139 | values, self.indices= tf.nn.top_k(self.output, self.top_N_items, sorted=True) 140 | self.saver = tf.train.Saver(tf.all_variables()) 141 | -------------------------------------------------------------------------------- /word2vec/data_iterator.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import collections 3 | 4 | class DataIterator: 5 | def __init__(self, seq, end_ind, batch_size, n_skips, window, sequence): 6 | self.seq = seq 7 | self.l_seq = len(seq) 8 | self.end_ind = end_ind 9 | self.batch_size = batch_size 10 | self.num_skips = n_skips 11 | self.skip_window = window 12 | self.index = 0 13 | self.sequence = sequence 14 | 15 | def get_next(self): 16 | seq = self.seq 17 | users = np.ndarray(shape=[self.batch_size], dtype=np.int32) 18 | i_items = np.ndarray(shape=[self.batch_size], dtype=np.int32) 19 | o_items = np.ndarray(shape=[self.batch_size], dtype=np.int32) 20 | span = 2 * self.skip_window + 1 21 | b = collections.deque(maxlen=span) 22 | center = self.skip_window 23 | 24 | for _ in range(span): 25 | b.append(seq[self.index]) 26 | self.index = (self.index + 1) % self.l_seq 27 | 28 | while True: 29 | if self.sequence: 30 | self.index = np.random.randint(0, self.l_seq) 31 | for _ in range(span): 32 | b.append(seq[self.index]) 33 | self.index = (self.index + 1) % self.l_seq 34 | 35 | ind = 0 36 | while ind < self.batch_size: 37 | u = b[center][0] 38 | i_i = b[center][1] 39 | hi = center if i_i == self.end_ind else span 40 | targets_to_avoid = [center] 41 | 42 | for _ in range(self.num_skips): 43 | t = np.random.randint(0, hi) 44 | while t in targets_to_avoid: 45 | t = np.random.randint(0, hi) 46 | o_i = b[t][1] 47 | if b[t][0] != u or o_i == self.end_ind: 48 | continue 49 | targets_to_avoid.append(t) 50 | users[ind] = u 51 | i_items[ind] = i_i 52 | o_items[ind] = o_i 53 | ind += 1 54 | if ind >= self.batch_size: 55 | break 56 | b.append(seq[self.index]) 57 | self.index = (self.index + 1) % self.l_seq 58 | yield users, i_items, o_items 59 | 60 | def get_next_sg(self): 61 | # only predict future based on history 62 | seq = self.seq 63 | users = np.ndarray(shape=[self.batch_size], dtype=np.int32) 64 | i_items = np.ndarray(shape=[self.batch_size], dtype=np.int32) 65 | o_items = np.ndarray(shape=[self.batch_size], dtype=np.int32) 66 | span = 2 * self.skip_window + 1 67 | b = collections.deque(maxlen=span) 68 | # center = self.skip_window 69 | center = 0 70 | 71 | for _ in range(span): 72 | b.append(seq[self.index]) 73 | self.index = (self.index + 1) % self.l_seq 74 | 75 | while True: 76 | if self.sequence: 77 | self.index = np.random.randint(0, self.l_seq) 78 | for _ in range(span): 79 | b.append(seq[self.index]) 80 | self.index = (self.index + 1) % self.l_seq 81 | 82 | ind = 0 83 | while ind < self.batch_size: 84 | u = b[center][0] 85 | i_i = b[center][1] 86 | hi = span 87 | # hi = center if i_i == self.end_ind else span 88 | targets_to_avoid = [center] 89 | 90 | for _ in range(self.num_skips): 91 | t = np.random.randint(0, hi) 92 | while t in targets_to_avoid: 93 | t = np.random.randint(0, hi) 94 | o_i = b[t][1] 95 | if b[t][0] != u or o_i == self.end_ind: 96 | continue 97 | targets_to_avoid.append(t) 98 | users[ind] = u 99 | i_items[ind] = i_i 100 | o_items[ind] = o_i 101 | ind += 1 102 | if ind >= self.batch_size: 103 | break 104 | b.append(seq[self.index]) 105 | self.index = (self.index + 1) % self.l_seq 106 | yield users, [i_items], o_items 107 | 108 | def get_next_cbow(self): 109 | # cbow, only predict current based on history 110 | seq = self.seq 111 | users = np.ndarray(shape=[self.batch_size], dtype=np.int32) 112 | # i_items = np.ndarray(shape=[self.batch_size], dtype=np.int32) 113 | i_items = [ [0] * self.num_skips] * self.batch_size 114 | o_items = np.ndarray(shape=[self.batch_size], dtype=np.int32) 115 | span = self.skip_window + 1 116 | b = collections.deque(maxlen=span) 117 | center = span - 1 118 | 119 | for _ in range(span): 120 | b.append(seq[self.index]) 121 | self.index = (self.index + 1) % self.l_seq 122 | 123 | u_seq_len = -1 124 | u, o_i = b[center] 125 | if o_i == self.end_ind: 126 | u_seq_len = 0 127 | else: 128 | for ii in range(center): 129 | uu = b[ii][0] 130 | if uu == u: 131 | u_seq_len += 1 132 | 133 | while True: 134 | if self.sequence: 135 | print('error: not implemented') 136 | exit(1) 137 | self.index = np.random.randint(0, self.l_seq) 138 | for _ in range(span): 139 | b.append(seq[self.index]) 140 | self.index = (self.index + 1) % self.l_seq 141 | 142 | ind = 0 143 | while ind < self.batch_size: 144 | u = b[center][0] 145 | o_i = b[center][1] 146 | if o_i == self.end_ind: # actually the start 147 | b.append(seq[self.index]) 148 | self.index = (self.index + 1) % self.l_seq 149 | u_seq_len = 0 150 | continue 151 | u_seq_len += 1 152 | u_seq_len = min(u_seq_len, center) 153 | 154 | with_replacement = (u_seq_len < self.num_skips) 155 | i_samples = np.random.choice(center, self.num_skips, with_replacement) 156 | 157 | i_is = [b[j][1] for j in i_samples] 158 | 159 | i_items[ind] = i_is 160 | users[ind] = u 161 | o_items[ind] = o_i 162 | ind += 1 163 | 164 | b.append(seq[self.index]) 165 | self.index = (self.index + 1) % self.l_seq 166 | 167 | i_items_input = batch_major(i_items, self.batch_size, self.num_skips) 168 | 169 | yield users, i_items_input, o_items 170 | 171 | 172 | def batch_major(l, m, n): 173 | output = [] 174 | for i in range(n): 175 | tmp = [] 176 | for j in range(m): 177 | tmp.append(l[j][i]) 178 | output.append(tmp) 179 | return output -------------------------------------------------------------------------------- /word2vec/linear_seq.py: -------------------------------------------------------------------------------- 1 | 2 | from __future__ import absolute_import 3 | from __future__ import division 4 | from __future__ import print_function 5 | 6 | 7 | import tensorflow as tf 8 | import sys 9 | sys.path.insert(0, '../attributes') 10 | import embed_attribute 11 | 12 | class LinearSeq(object): 13 | def __init__(self, user_size, item_size, size, 14 | batch_size, learning_rate, 15 | learning_rate_decay_factor, 16 | user_attributes=None, 17 | item_attributes=None, 18 | item_ind2logit_ind=None, 19 | logit_ind2item_ind=None, 20 | n_input_items=0, 21 | loss_function='ce', 22 | logit_size_test=None, 23 | dropout=1.0, 24 | use_sep_item=True, 25 | n_sampled=None, 26 | output_feat=1, 27 | indices_item=None, 28 | dtype=tf.float32): 29 | 30 | self.user_size = user_size 31 | self.item_size = item_size 32 | 33 | if user_attributes is not None: 34 | user_attributes.set_model_size(size) 35 | self.user_attributes = user_attributes 36 | if item_attributes is not None: 37 | item_attributes.set_model_size(size) 38 | self.item_attributes = item_attributes 39 | 40 | self.item_ind2logit_ind = item_ind2logit_ind 41 | self.logit_ind2item_ind = logit_ind2item_ind 42 | if logit_ind2item_ind is not None: 43 | self.logit_size = len(logit_ind2item_ind) 44 | if indices_item is not None: 45 | self.indices_item = indices_item 46 | else: 47 | self.indices_item = range(self.logit_size) 48 | self.logit_size_test = logit_size_test 49 | 50 | self.loss_function = loss_function 51 | self.n_input_items = n_input_items 52 | self.n_sampled = n_sampled 53 | self.batch_size = batch_size 54 | 55 | self.learning_rate = tf.Variable(float(learning_rate), trainable=False) 56 | self.learning_rate_decay_op = self.learning_rate.assign( 57 | self.learning_rate * learning_rate_decay_factor) 58 | self.global_step = tf.Variable(0, trainable=False) 59 | 60 | self.att_emb = None 61 | self.dtype=dtype 62 | 63 | mb = self.batch_size 64 | ''' this is mapped item target ''' 65 | 66 | def prepare_warp(self, pos_item_set, pos_item_set_eval): 67 | self.att_emb.prepare_warp(pos_item_set, pos_item_set_eval) 68 | return 69 | 70 | def step(self, session, user_input, item_input=None, item_output=None, 71 | item_sampled = None, item_sampled_id2idx = None, 72 | forward_only=False, recommend=False, recommend_new = False, loss=None, 73 | run_op=None, run_meta=None): 74 | input_feed = {} 75 | 76 | if recommend == False: 77 | targets = self.att_emb.target_mapping([item_output]) 78 | input_feed[self.item_target.name] = targets[0] 79 | if loss in ['mw']: 80 | input_feed[self.item_id_target.name] = item_output 81 | 82 | if self.att_emb is not None: 83 | (update_sampled, input_feed_sampled, 84 | input_feed_warp) = self.att_emb.add_input(input_feed, user_input, 85 | item_input, neg_item_input=None, item_sampled = item_sampled, 86 | item_sampled_id2idx = item_sampled_id2idx, 87 | forward_only=forward_only, recommend=recommend, loss=loss) 88 | 89 | if not recommend: 90 | if not forward_only: 91 | output_feed = [self.updates, self.loss] 92 | else: 93 | output_feed = [self.loss_test] 94 | else: 95 | if recommend_new: 96 | output_feed = [self.indices_test] 97 | else: 98 | output_feed = [self.indices] 99 | 100 | if item_sampled is not None and loss in ['mw', 'mce']: 101 | session.run(update_sampled, input_feed_sampled) 102 | 103 | if (loss in ['warp', 'bbpr', 'mw']) and recommend is False: 104 | session.run(self.set_mask[loss], input_feed_warp) 105 | 106 | if run_op is not None and run_meta is not None: 107 | outputs = session.run(output_feed, input_feed, options=run_op, run_metadata=run_meta) 108 | else: 109 | outputs = session.run(output_feed, input_feed) 110 | 111 | if (loss in ['warp', 'bbpr', 'mw']) and recommend is False: 112 | session.run(self.reset_mask[loss], input_feed_warp) 113 | 114 | if not recommend: 115 | if not forward_only: 116 | return outputs[1] 117 | else: 118 | return outputs[0] 119 | else: 120 | return outputs[0] 121 | 122 | -------------------------------------------------------------------------------- /word2vec/run_w2v.py: -------------------------------------------------------------------------------- 1 | from __future__ import absolute_import 2 | from __future__ import division 3 | from __future__ import print_function 4 | 5 | import math, os, sys 6 | import random, time 7 | import logging 8 | 9 | import numpy as np 10 | from six.moves import xrange # pylint: disable=redefined-builtin 11 | import tensorflow as tf 12 | 13 | sys.path.insert(0, '../utils') 14 | sys.path.insert(0, '../attributes') 15 | 16 | 17 | from input_attribute import read_data as read_attributed_data 18 | from prepare_train import positive_items, item_frequency, sample_items, to_week 19 | from data_iterator import DataIterator 20 | # in order to profile 21 | from tensorflow.python.client import timeline 22 | 23 | # models 24 | tf.app.flags.DEFINE_string("model", "cbow", "cbow or skipgram") 25 | 26 | # datasets, paths, and preprocessing 27 | tf.app.flags.DEFINE_string("dataset", "xing", ".") 28 | tf.app.flags.DEFINE_string("raw_data", "../raw_data", "input data directory") 29 | tf.app.flags.DEFINE_string("data_dir", "./data0", "Data directory") 30 | tf.app.flags.DEFINE_string("train_dir", "./test0", "Training directory.") 31 | tf.app.flags.DEFINE_boolean("test", False, "Test on test splits") 32 | tf.app.flags.DEFINE_string("combine_att", 'mix', "method to combine attributes: het or mix") 33 | tf.app.flags.DEFINE_boolean("use_user_feature", True, "RT") 34 | tf.app.flags.DEFINE_boolean("use_item_feature", True, "RT") 35 | tf.app.flags.DEFINE_integer("user_vocab_size", 150000, "User vocabulary size.") 36 | tf.app.flags.DEFINE_integer("item_vocab_size", 50000, "Item vocabulary size.") 37 | tf.app.flags.DEFINE_integer("vocab_min_thresh", 2, "filter inactive tokens.") 38 | 39 | # tuning hypers 40 | tf.app.flags.DEFINE_string("loss", 'ce', "loss function") 41 | tf.app.flags.DEFINE_float("learning_rate", 0.1, "Learning rate.") 42 | tf.app.flags.DEFINE_float("keep_prob", 0.5, "dropout rate.") 43 | tf.app.flags.DEFINE_float("learning_rate_decay_factor", 1.0, "Learning rate decays by this much.") 44 | tf.app.flags.DEFINE_integer("batch_size", 64, 45 | "Batch size to use during training.") 46 | tf.app.flags.DEFINE_integer("size", 20, "Size of each model layer.") 47 | tf.app.flags.DEFINE_integer("patience", 20, 48 | "exit if the model can't improve for $patience evals") 49 | tf.app.flags.DEFINE_integer("n_epoch", 1000, "How many epochs to train.") 50 | tf.app.flags.DEFINE_integer("steps_per_checkpoint", 4000, 51 | "How many training steps to do per checkpoint.") 52 | 53 | # to recommend 54 | tf.app.flags.DEFINE_boolean("recommend", False, 55 | "Set to True for recommend items.") 56 | tf.app.flags.DEFINE_integer("top_N_items", 100, 57 | "number of items output") 58 | tf.app.flags.DEFINE_boolean("recommend_new", False, 59 | "Set to True for recommend new items that were not used to train.") 60 | 61 | # algorithms with sampling 62 | tf.app.flags.DEFINE_float("power", 0.5, "related to sampling rate.") 63 | tf.app.flags.DEFINE_integer("n_resample", 50, "iterations before resample.") 64 | tf.app.flags.DEFINE_integer("n_sampled", 1024, "sampled softmax/warp loss.") 65 | tf.app.flags.DEFINE_float("user_sample", 1.0, "user sample rate.") 66 | 67 | 68 | # attribute model variants 69 | tf.app.flags.DEFINE_integer("output_feat", 1, "0: no use, 1: use, mean-mulhot, 2: use, max-pool") 70 | tf.app.flags.DEFINE_boolean("use_sep_item", True, "use separate embedding parameters for output items.") 71 | tf.app.flags.DEFINE_boolean("no_user_id", False, "use user id or not") 72 | 73 | # word2vec hypers 74 | tf.app.flags.DEFINE_integer("ni", 2, "# of input context items.") 75 | tf.app.flags.DEFINE_integer("num_skips", 3, "# of output context items.") 76 | tf.app.flags.DEFINE_integer("skip_window", 5, "Size of each model layer.") 77 | 78 | 79 | # others 80 | tf.app.flags.DEFINE_boolean("device_log", False, 81 | "Set to True for logging device usages.") 82 | tf.app.flags.DEFINE_boolean("eval", True, 83 | "Set to True for evaluation.") 84 | tf.app.flags.DEFINE_boolean("use_more_train", False, 85 | "Set true if use non-appearred items to train.") 86 | tf.app.flags.DEFINE_boolean("profile", False, "False = no profile, True = profile") 87 | tf.app.flags.DEFINE_boolean("after40", False, 88 | "whether use items after week 40 only.") 89 | # tf.app.flags.DEFINE_string("model_option", 'loss', 90 | # "model to evaluation") 91 | # tf.app.flags.DEFINE_integer("max_train_data_size", 0, 92 | # "Limit on the size of training data (0: no limit).") 93 | 94 | 95 | FLAGS = tf.app.flags.FLAGS 96 | 97 | def mylog(msg): 98 | print(msg) 99 | logging.info(msg) 100 | return 101 | 102 | def get_user_items_seq(data): 103 | # group (u,i) by user and sort by time 104 | d = {} 105 | for u, i, t in data: 106 | if FLAGS.after40 and to_week(t) < 40: 107 | continue 108 | if u not in d: 109 | d[u] = [] 110 | d[u].append((i,t)) 111 | for u in d: 112 | tmp = sorted(d[u], key=lambda x:x[1]) 113 | tmp = [x[0] for x in tmp] 114 | assert(len(tmp)>0) 115 | d[u] = tmp 116 | return d 117 | 118 | def form_train_seq(x, pad_token, opt=1): 119 | # train corpus 120 | seq = [] 121 | p = pad_token 122 | for u in x: 123 | l = [(u,i) for i in x[u]] 124 | if opt == 0: 125 | seq.extend(l) 126 | seq.append((u, p)) 127 | else: 128 | seq.append((u, p)) 129 | seq.extend(l) 130 | 131 | return seq 132 | 133 | def prepare_valid(data_va, u_i_seq_tr, end_ind, n=0): 134 | res = {} 135 | processed = set([]) 136 | for u, _, _ in data_va: 137 | if u in processed: 138 | continue 139 | processed.add(u) 140 | 141 | if u in u_i_seq_tr: 142 | if n == 0: 143 | res[u] = [] 144 | elif n == -1 : 145 | res[u] = [end_ind] 146 | else: 147 | items = u_i_seq_tr[u][-n:] 148 | l = len(items) 149 | if l < n: 150 | items += [end_ind] * (n-l) 151 | res[u] = items 152 | else: 153 | if n == -1: 154 | res[u] = [end_ind] 155 | else: 156 | res[u] = [end_ind] * n 157 | return res 158 | 159 | def get_data(raw_data, data_dir=FLAGS.data_dir, combine_att=FLAGS.combine_att, 160 | logits_size_tr=FLAGS.item_vocab_size, thresh=FLAGS.vocab_min_thresh, 161 | use_user_feature=FLAGS.use_user_feature, test=FLAGS.test, mylog=mylog, 162 | use_item_feature=FLAGS.use_item_feature, no_user_id=FLAGS.no_user_id): 163 | 164 | (data_tr0, data_va0, u_attr, i_attr, item_ind2logit_ind, logit_ind2item_ind, 165 | user_index, item_index) = read_attributed_data( 166 | raw_data_dir=raw_data, 167 | data_dir=data_dir, 168 | combine_att=combine_att, 169 | logits_size_tr=logits_size_tr, 170 | thresh=thresh, 171 | use_user_feature=use_user_feature, 172 | use_item_feature=use_item_feature, 173 | no_user_id=no_user_id, 174 | test=test, 175 | mylog=mylog) 176 | mylog('length of item_ind2logit_ind: {}'.format(len(item_ind2logit_ind))) 177 | 178 | #remove some rare items in both train and valid set 179 | #this helps make train/valid set distribution similar 180 | #to each other 181 | 182 | mylog("original train/dev size: %d/%d" %(len(data_tr0),len(data_va0))) 183 | data_tr = [p for p in data_tr0 if (p[1] in item_ind2logit_ind)] 184 | data_va = [p for p in data_va0 if (p[1] in item_ind2logit_ind)] 185 | mylog("new train/dev size: %d/%d" %(len(data_tr),len(data_va))) 186 | 187 | u_i_seq_tr = get_user_items_seq(data_tr) 188 | 189 | PAD_ID = len(item_index) 190 | seq_tr = form_train_seq(u_i_seq_tr, PAD_ID) 191 | items_dev = prepare_valid(data_va0, u_i_seq_tr, PAD_ID, FLAGS.ni) 192 | 193 | return (seq_tr, items_dev, data_tr, data_va, u_attr, i_attr, 194 | item_ind2logit_ind, logit_ind2item_ind, PAD_ID, user_index, item_index) 195 | 196 | def create_model(session, u_attributes=None, i_attributes=None, 197 | item_ind2logit_ind=None, logit_ind2item_ind=None, 198 | loss = FLAGS.loss, logit_size_test=None, ind_item = None): 199 | n_sampled = FLAGS.n_sampled if FLAGS.loss in ['mw', 'mce'] else None 200 | if FLAGS.model == 'cbow': 201 | import cbow_model as w2v_model 202 | elif FLAGS.model == 'sg': 203 | import skipgram_model as w2v_model 204 | else: 205 | mylog('not implemented error') 206 | error(1) 207 | 208 | model = w2v_model.Model(FLAGS.user_vocab_size, 209 | FLAGS.item_vocab_size, FLAGS.size, 210 | FLAGS.batch_size, FLAGS.learning_rate, 211 | FLAGS.learning_rate_decay_factor, u_attributes, i_attributes, 212 | item_ind2logit_ind, logit_ind2item_ind, loss_function=loss, 213 | n_input_items=FLAGS.ni, use_sep_item=FLAGS.use_sep_item, 214 | dropout=FLAGS.keep_prob, top_N_items=FLAGS.top_N_items, 215 | output_feat=FLAGS.output_feat, 216 | n_sampled=n_sampled) 217 | 218 | if not os.path.isdir(FLAGS.train_dir): 219 | os.mkdir(FLAGS.train_dir) 220 | ckpt = tf.train.get_checkpoint_state(FLAGS.train_dir) 221 | 222 | if ckpt: 223 | mylog("Reading model parameters from %s" % ckpt.model_checkpoint_path) 224 | logging.info("Reading model parameters from %s" % ckpt.model_checkpoint_path) 225 | model.saver.restore(session, ckpt.model_checkpoint_path) 226 | else: 227 | mylog("Created model with fresh parameters.") 228 | logging.info("Created model with fresh parameters.") 229 | session.run(tf.initialize_all_variables()) 230 | return model 231 | 232 | def train(raw_data=FLAGS.raw_data): 233 | with tf.Session(config=tf.ConfigProto(allow_soft_placement=True, 234 | log_device_placement=FLAGS.device_log)) as sess: 235 | run_options = None 236 | run_metadata = None 237 | if FLAGS.profile: 238 | run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE) 239 | run_metadata = tf.RunMetadata() 240 | FLAGS.steps_per_checkpoint = 30 241 | 242 | mylog("reading data") 243 | 244 | (seq_tr, items_dev, data_tr, data_va, u_attributes, i_attributes, 245 | item_ind2logit_ind, logit_ind2item_ind, end_ind, _, _) = get_data(raw_data, 246 | data_dir=FLAGS.data_dir) 247 | 248 | power = FLAGS.power 249 | item_pop, p_item = item_frequency(data_tr, power) 250 | 251 | if FLAGS.use_more_train: 252 | item_population = range(len(item_ind2logit_ind)) 253 | else: 254 | item_population = item_pop 255 | 256 | model = create_model(sess, u_attributes, i_attributes, item_ind2logit_ind, 257 | logit_ind2item_ind, loss=FLAGS.loss, ind_item=item_population) 258 | 259 | # data iterators 260 | n_skips = FLAGS.ni if FLAGS.model == 'cbow' else FLAGS.num_skips 261 | dite = DataIterator(seq_tr, end_ind, FLAGS.batch_size, n_skips, 262 | FLAGS.skip_window, False) 263 | if FLAGS.model == 'sg': 264 | ite = dite.get_next_sg() 265 | else: 266 | ite = dite.get_next_cbow() 267 | 268 | mylog('started training') 269 | step_time, loss, current_step, auc = 0.0, 0.0, 0, 0.0 270 | 271 | repeat = 5 if FLAGS.loss.startswith('bpr') else 1 272 | patience = FLAGS.patience 273 | 274 | if os.path.isfile(os.path.join(FLAGS.train_dir, 'auc_train.npy')): 275 | auc_train = list(np.load(os.path.join(FLAGS.train_dir, 'auc_train.npy'))) 276 | auc_dev = list(np.load(os.path.join(FLAGS.train_dir, 'auc_dev.npy'))) 277 | previous_losses = list(np.load(os.path.join(FLAGS.train_dir, 278 | 'loss_train.npy'))) 279 | losses_dev = list(np.load(os.path.join(FLAGS.train_dir, 'loss_dev.npy'))) 280 | best_auc = max(auc_dev) 281 | best_loss = min(losses_dev) 282 | else: 283 | previous_losses, auc_train, auc_dev, losses_dev = [], [], [], [] 284 | best_auc, best_loss = -1, 1000000 285 | 286 | item_sampled, item_sampled_id2idx = None, None 287 | 288 | train_total_size = float(len(data_tr)) 289 | n_epoch = FLAGS.n_epoch 290 | steps_per_epoch = int(1.0 * train_total_size / FLAGS.batch_size) 291 | total_steps = steps_per_epoch * n_epoch 292 | 293 | mylog("Train:") 294 | mylog("total: {}".format(train_total_size)) 295 | mylog("Steps_per_epoch: {}".format(steps_per_epoch)) 296 | mylog("Total_steps:{}".format(total_steps)) 297 | mylog("Dev:") 298 | mylog("total: {}".format(len(data_va))) 299 | 300 | while True: 301 | 302 | start_time = time.time() 303 | # generate batch of training 304 | (user_input, input_items, output_items) = ite.next() 305 | if current_step < 5: 306 | mylog("current step is {}".format(current_step)) 307 | mylog('user') 308 | mylog(user_input) 309 | mylog('input_item') 310 | mylog(input_items) 311 | mylog('output_item') 312 | mylog(output_items) 313 | 314 | 315 | if FLAGS.loss in ['mw', 'mce'] and current_step % FLAGS.n_resample == 0: 316 | item_sampled, item_sampled_id2idx = sample_items(item_population, 317 | FLAGS.n_sampled, p_item) 318 | else: 319 | item_sampled = None 320 | 321 | step_loss = model.step(sess, user_input, input_items, 322 | output_items, item_sampled, item_sampled_id2idx, loss=FLAGS.loss,run_op=run_options, run_meta=run_metadata) 323 | # step_loss = 0 324 | 325 | step_time += (time.time() - start_time) / FLAGS.steps_per_checkpoint 326 | loss += step_loss / FLAGS.steps_per_checkpoint 327 | # auc += step_auc / FLAGS.steps_per_checkpoint 328 | current_step += 1 329 | if current_step > total_steps: 330 | mylog("Training reaches maximum steps. Terminating...") 331 | break 332 | 333 | if current_step % FLAGS.steps_per_checkpoint == 0: 334 | 335 | if FLAGS.loss in ['ce', 'mce']: 336 | perplexity = math.exp(loss) if loss < 300 else float('inf') 337 | mylog("global step %d learning rate %.4f step-time %.4f perplexity %.2f" % (model.global_step.eval(), model.learning_rate.eval(), step_time, perplexity)) 338 | else: 339 | mylog("global step %d learning rate %.4f step-time %.4f loss %.3f" % (model.global_step.eval(), model.learning_rate.eval(), step_time, loss)) 340 | if FLAGS.profile: 341 | # Create the Timeline object, and write it to a json 342 | tl = timeline.Timeline(run_metadata.step_stats) 343 | ctf = tl.generate_chrome_trace_format() 344 | with open('timeline.json', 'w') as f: 345 | f.write(ctf) 346 | exit() 347 | 348 | # Decrease learning rate if no improvement was seen over last 3 times. 349 | if len(previous_losses) > 2 and loss > max(previous_losses[-3:]): 350 | sess.run(model.learning_rate_decay_op) 351 | previous_losses.append(loss) 352 | auc_train.append(auc) 353 | step_time, loss, auc = 0.0, 0.0, 0.0 354 | 355 | if not FLAGS.eval: 356 | continue 357 | # # Save checkpoint and zero timer and loss. 358 | # checkpoint_path = os.path.join(FLAGS.train_dir, "go.ckpt") 359 | # current_model = model.saver.save(sess, checkpoint_path, 360 | # global_step=model.global_step) 361 | 362 | # Run evals on development set and print their loss/auc. 363 | l_va = len(data_va) 364 | eval_loss, eval_auc = 0.0, 0.0 365 | count_va = 0 366 | start_time = time.time() 367 | for idx_s in range(0, l_va, FLAGS.batch_size): 368 | idx_e = idx_s + FLAGS.batch_size 369 | if idx_e > l_va: 370 | break 371 | lt = data_va[idx_s:idx_e] 372 | user_va = [x[0] for x in lt] 373 | item_va_input = [items_dev[x[0]] for x in lt] 374 | item_va_input = map(list, zip(*item_va_input)) 375 | item_va = [x[1] for x in lt] 376 | 377 | the_loss = 'warp' if FLAGS.loss == 'mw' else FLAGS.loss 378 | eval_loss0 = model.step(sess, user_va, item_va_input, item_va, 379 | forward_only=True, loss=the_loss) 380 | eval_loss += eval_loss0 381 | count_va += 1 382 | eval_loss /= count_va 383 | eval_auc /= count_va 384 | step_time = (time.time() - start_time) / count_va 385 | if FLAGS.loss in ['ce', 'mce']: 386 | eval_ppx = math.exp(eval_loss) if eval_loss < 300 else float('inf') 387 | mylog(" dev: perplexity %.2f eval_auc %.4f step-time %.4f" % ( 388 | eval_ppx, eval_auc, step_time)) 389 | else: 390 | mylog(" dev: loss %.3f eval_auc %.4f step-time %.4f" % (eval_loss, 391 | eval_auc, step_time)) 392 | sys.stdout.flush() 393 | 394 | if eval_loss < best_loss and not FLAGS.test: 395 | best_loss = eval_loss 396 | patience = FLAGS.patience 397 | 398 | # Save checkpoint and zero timer and loss. 399 | checkpoint_path = os.path.join(FLAGS.train_dir, "best.ckpt") 400 | model.saver.save(sess, checkpoint_path, 401 | global_step=0, write_meta_graph = False) 402 | mylog('Saving best model...') 403 | 404 | if FLAGS.test: 405 | checkpoint_path = os.path.join(FLAGS.train_dir, "best.ckpt") 406 | model.saver.save(sess, checkpoint_path, 407 | global_step=0, write_meta_graph = False) 408 | mylog('Saving current model...') 409 | 410 | if eval_loss > best_loss: # and eval_auc < best_auc: 411 | patience -= 1 412 | 413 | auc_dev.append(eval_auc) 414 | losses_dev.append(eval_loss) 415 | 416 | if patience < 0 and not FLAGS.test: 417 | mylog("no improvement for too long.. terminating..") 418 | mylog("best auc %.4f" % best_auc) 419 | mylog("best loss %.4f" % best_loss) 420 | sys.stdout.flush() 421 | break 422 | return 423 | 424 | def recommend(raw_data=FLAGS.raw_data, test=FLAGS.test, loss=FLAGS.loss, 425 | batch_size=FLAGS.batch_size, topN=FLAGS.top_N_items, 426 | device_log=FLAGS.device_log): 427 | 428 | with tf.Session(config=tf.ConfigProto(allow_soft_placement=True, 429 | log_device_placement=device_log)) as sess: 430 | mylog("reading data") 431 | 432 | (_, items_dev, _, _, u_attributes, i_attributes, item_ind2logit_ind, 433 | logit_ind2item_ind, _, user_index, item_index) = get_data(raw_data, 434 | data_dir=FLAGS.data_dir) 435 | 436 | from evaluate import Evaluation as Evaluate 437 | 438 | evaluation = Evaluate(raw_data, test=test) 439 | 440 | model = create_model(sess, u_attributes, i_attributes, item_ind2logit_ind, 441 | logit_ind2item_ind, loss=loss, ind_item=None) 442 | 443 | Uinds = evaluation.get_uinds() 444 | N = len(Uinds) 445 | mylog("N = %d" % N) 446 | Uinds = [p for p in Uinds if p in items_dev] 447 | mylog("new N = {}, (reduced from original {})".format(len(Uinds), N)) 448 | if len(Uinds) < N: 449 | evaluation.set_uinds(Uinds) 450 | N = len(Uinds) 451 | rec = np.zeros((N, topN), dtype=int) 452 | count = 0 453 | time_start = time.time() 454 | for idx_s in range(0, N, batch_size): 455 | count += 1 456 | if count % 100 == 0: 457 | mylog("idx: %d, c: %d" % (idx_s, count)) 458 | 459 | idx_e = idx_s + batch_size 460 | if idx_e <= N: 461 | users = Uinds[idx_s: idx_e] 462 | items_input = [items_dev[u] for u in users] 463 | items_input = map(list, zip(*items_input)) 464 | recs = model.step(sess, users, items_input, forward_only=True, 465 | recommend = True, recommend_new = FLAGS.recommend_new) 466 | rec[idx_s:idx_e, :] = recs 467 | else: 468 | users = range(idx_s, N) + [0] * (idx_e - N) 469 | users = [Uinds[t] for t in users] 470 | items_input = [items_dev[u] for u in users] 471 | items_input = map(list, zip(*items_input)) 472 | recs = model.step(sess, users, items_input, forward_only=True, 473 | recommend = True, recommend_new = FLAGS.recommend_new) 474 | idx_e = N 475 | rec[idx_s:idx_e, :] = recs[:(idx_e-idx_s),:] 476 | # return rec: i: uinds[i] --> logid 477 | 478 | time_end = time.time() 479 | mylog("Time used %.1f" % (time_end - time_start)) 480 | 481 | ind2id = {} 482 | for iid in item_index: 483 | uind = item_index[iid] 484 | assert(uind not in ind2id) 485 | ind2id[uind] = iid 486 | 487 | uids = evaluation.get_uids() 488 | R = {} 489 | for i in xrange(N): 490 | uid = uids[i] 491 | R[uid] = [ind2id[logit_ind2item_ind[v]] for v in list(rec[i, :])] 492 | 493 | evaluation.eval_on(R) 494 | scores_self, scores_ex = evaluation.get_scores() 495 | mylog("====evaluation scores (NDCG, RECALL, PRECISION, MAP) @ 2,5,10,20,30====") 496 | mylog("METRIC_FORMAT (self): {}".format(scores_self)) 497 | mylog("METRIC_FORMAT (ex ): {}".format(scores_ex)) 498 | 499 | return 500 | 501 | def main(_): 502 | 503 | # logging.debug('This message should go to the log file') 504 | # logging.info('So should this') 505 | # logging.warning('And this, too') 506 | if FLAGS.test: 507 | if FLAGS.data_dir[-1] == '/': 508 | FLAGS.data_dir = FLAGS.data_dir[:-1] + '_test' 509 | else: 510 | FLAGS.data_dir = FLAGS.data_dir + '_test' 511 | 512 | if not os.path.exists(FLAGS.train_dir): 513 | os.mkdir(FLAGS.train_dir) 514 | if not FLAGS.recommend: 515 | log_path = os.path.join(FLAGS.train_dir,"log.txt") 516 | logging.basicConfig(filename=log_path,level=logging.DEBUG) 517 | train() 518 | else: 519 | log_path = os.path.join(FLAGS.train_dir,"log.recommend.txt") 520 | logging.basicConfig(filename=log_path,level=logging.DEBUG) 521 | recommend() 522 | return 523 | 524 | if __name__ == "__main__": 525 | tf.app.run() 526 | 527 | -------------------------------------------------------------------------------- /word2vec/skipgram_model.py: -------------------------------------------------------------------------------- 1 | 2 | from __future__ import absolute_import 3 | from __future__ import division 4 | from __future__ import print_function 5 | 6 | import tensorflow as tf 7 | import sys 8 | from linear_seq import LinearSeq 9 | sys.path.insert(0, '../attributes') 10 | import embed_attribute 11 | 12 | 13 | class Model(LinearSeq): 14 | def __init__(self, user_size, item_size, size, 15 | batch_size, learning_rate, 16 | learning_rate_decay_factor, 17 | user_attributes=None, 18 | item_attributes=None, 19 | item_ind2logit_ind=None, 20 | logit_ind2item_ind=None, 21 | n_input_items=0, 22 | loss_function='ce', 23 | logit_size_test=None, 24 | dropout=1.0, 25 | top_N_items=100, 26 | use_sep_item=True, 27 | n_sampled=None, 28 | output_feat=1, 29 | indices_item=None, 30 | dtype=tf.float32): 31 | 32 | self.user_size = user_size 33 | self.item_size = item_size 34 | self.top_N_items = top_N_items 35 | if user_attributes is not None: 36 | user_attributes.set_model_size(size) 37 | self.user_attributes = user_attributes 38 | if item_attributes is not None: 39 | item_attributes.set_model_size(size) 40 | self.item_attributes = item_attributes 41 | 42 | self.item_ind2logit_ind = item_ind2logit_ind 43 | self.logit_ind2item_ind = logit_ind2item_ind 44 | if logit_ind2item_ind is not None: 45 | self.logit_size = len(logit_ind2item_ind) 46 | if indices_item is not None: 47 | self.indices_item = indices_item 48 | else: 49 | self.indices_item = range(self.logit_size) 50 | self.logit_size_test = logit_size_test 51 | 52 | self.loss_function = loss_function 53 | self.n_input_items = n_input_items 54 | self.n_sampled = n_sampled 55 | self.batch_size = batch_size 56 | 57 | self.learning_rate = tf.Variable(float(learning_rate), trainable=False) 58 | self.learning_rate_decay_op = self.learning_rate.assign( 59 | self.learning_rate * learning_rate_decay_factor) 60 | self.global_step = tf.Variable(0, trainable=False) 61 | 62 | self.att_emb = None 63 | self.dtype=dtype 64 | 65 | mb = self.batch_size 66 | ''' this is mapped item target ''' 67 | self.item_target = tf.placeholder(tf.int32, shape = [mb], name = "item") 68 | self.item_id_target = tf.placeholder(tf.int32, shape = [mb], name = "item_id") 69 | 70 | self.dropout = dropout 71 | self.keep_prob = tf.constant(dropout, dtype=dtype) 72 | # tf.placeholder(tf.float32, name='keep_prob') 73 | 74 | n_input = max(n_input_items, 1) 75 | m = embed_attribute.EmbeddingAttribute(user_attributes, item_attributes, mb, 76 | self.n_sampled, n_input, use_sep_item, item_ind2logit_ind, logit_ind2item_ind) 77 | self.att_emb = m 78 | 79 | embedded_user, _ = m.get_batch_user(1.0, False) 80 | embedded_items = [] 81 | for i in range(n_input): 82 | embedded_item, _ = m.get_batch_item('input{}'.format(i), batch_size) 83 | embedded_item = tf.reduce_mean(embedded_item, 0) 84 | embedded_items.append(embedded_item) 85 | 86 | print("non-sampled prediction") 87 | input_embed = tf.reduce_mean([embedded_user, embedded_items[0]], 0) 88 | input_embed = tf.nn.dropout(input_embed, self.keep_prob) 89 | logits = m.get_prediction(input_embed, output_feat=output_feat) 90 | 91 | if self.n_input_items == 0: 92 | input_embed_test= embedded_user 93 | else: 94 | # including two cases: 1, n items. 2, end_line item 95 | # input_embed_test = [embedded_user] + embedded_items 96 | # input_embed_test = tf.reduce_mean(input_embed_test, 0) 97 | 98 | input_embed_test = [embedded_user] + [tf.reduce_mean(embedded_items, 0)] 99 | input_embed_test = tf.reduce_mean(input_embed_test, 0) 100 | logits_test = m.get_prediction(input_embed_test, output_feat=output_feat) 101 | 102 | # mini batch version 103 | print("sampled prediction") 104 | if self.n_sampled is not None: 105 | sampled_logits = m.get_prediction(input_embed, 'sampled', output_feat=output_feat) 106 | # embedded_item, item_b = m.get_sampled_item(self.n_sampled) 107 | # sampled_logits = tf.matmul(embedded_user, tf.transpose(embedded_item)) + item_b 108 | target_score = m.get_target_score(input_embed, self.item_id_target) 109 | 110 | loss = self.loss_function 111 | if loss in ['warp', 'ce', 'bbpr']: 112 | batch_loss = m.compute_loss(logits, self.item_target, loss) 113 | batch_loss_test = m.compute_loss(logits_test, self.item_target, loss) 114 | elif loss in ['mw']: 115 | batch_loss = m.compute_loss(sampled_logits, target_score, loss) 116 | batch_loss_eval = m.compute_loss(logits, self.item_target, 'warp') 117 | else: 118 | print("not implemented!") 119 | exit(-1) 120 | if loss in ['warp', 'mw', 'bbpr']: 121 | self.set_mask, self.reset_mask = m.get_warp_mask() 122 | 123 | self.loss = tf.reduce_mean(batch_loss) 124 | # self.loss_eval = tf.reduce_mean(batch_loss_eval) if loss == 'mw' else self.loss 125 | self.loss_test = tf.reduce_mean(batch_loss_test) 126 | 127 | # Gradients and SGD update operation for training the model. 128 | params = tf.trainable_variables() 129 | opt = tf.train.AdagradOptimizer(self.learning_rate) 130 | # opt = tf.train.AdamOptimizer(self.learning_rate) 131 | gradients = tf.gradients(self.loss, params) 132 | self.updates = opt.apply_gradients( 133 | zip(gradients, params), global_step=self.global_step) 134 | 135 | self.output = logits_test 136 | values, self.indices= tf.nn.top_k(self.output, self.top_N_items, sorted=True) 137 | self.saver = tf.train.Saver(tf.all_variables()) 138 | --------------------------------------------------------------------------------