├── README.md ├── args.py ├── build_submit.py ├── build_submit_py3.py ├── data_set_phase1 ├── ._profiles.csv ├── ._test_plans.csv ├── ._test_queries.csv ├── ._train_clicks.csv └── ._train_queries.csv ├── generate_test.py ├── generate_test_py3.py ├── infer.py ├── local_train.py ├── local_train_py3.py ├── map_reader.py ├── map_reader_mmh.py ├── network_confv6.py ├── networks ├── network_conf.py ├── network_confv4.py └── network_confv6.py ├── out └── readme ├── pre_process_test.py ├── pre_test_dense.py ├── preprocess.py ├── preprocess_dense.py ├── submit └── readme ├── testres └── readme └── weather.json /README.md: -------------------------------------------------------------------------------- 1 | # Paddle_baseline_KDD2019 2 | ## More Information Go To https://github.com/PaddlePaddle/models/tree/develop/PaddleRec/ctr/Paddle_baseline_KDD2019 3 | Paddle baseline for KDD2019 "Context-Aware Multi-Modal Transportation Recommendation"(https://dianshi.baidu.com/competition/29/question) 4 | 5 | This repository is the demo codes for the KDD2019 "Context-Aware Multi-Modal Transportation Recommendation" competition using PaddlePaddle. It is written by python and uses PaddlePaddle to solve the task. Note that this repository is on developing and welcome everyone to contribute. The current baseline solution codes can get 0.68 - 0.69 score of online submission. As an example, my submission based on these networks programmed by PaddlePaddle is 0.6898. 6 | The reason of the publication of this baseline codes is to encourage us to use PaddlePaddle and build the most powerful recommendation model via PaddlePaddle. 7 | 8 | The example codes are ran on Linux, python2.7, single machine with CPU. Currently, There are some Compatibility issues while using python3 (UPDATE: Currently, The codes can be run using python3, Please refer following instruction: "RUN ON Python3"). Note that distributed train options are not provided here, if you want to learn more about this, please check more modes examples on https://github.com/PaddlePaddle/models. About the speed of training, for one epoch, 1000 batch size, it would take about 8 mins to train the whole training instances generated from raw data using SGD optimizer (it would take relatively longer using Adam optimizer). 9 | 10 | The configuration and process of all the networks are fundamental, a lot of optimizations can be done based on them to achieve better results e.g. better cost function, more powerful feature engineering, designed model validation, NN optimization tricks... 11 | 12 | The code is rough and from my daily use. They will be trimmed these days... 13 | ## Install PaddlePaddle 14 | please visit the official site of PaddlePaddle(http://www.paddlepaddle.org/documentation/docs/zh/1.4/beginners_guide/install/index_cn.html) 15 | ## preprocess feature 16 | ```python 17 | python preprocess_dense.py # change for different feature strategy 18 | python pre_test_dense.py 19 | #cd out 20 | split -a 2 -d -l 200000 normed_train.txt normed_train 21 | ``` 22 | preprocess.py and preprocess_dense.py is the code for preprocessing the raw data. Two versions are provided to deal with all sparse features and sparse plus dense features. Correspondingly, pre_process_test.py and pre_test_dense.py are the codes to preproccess test raw data. The training instances are saved in json. It is very easy to add new features. In our demo, all features are generated from provided raw data except for weather feature, which is gengerated from open weather records. 23 | Note that the feature generated in this step need to fit in the input of the model input. Make sure we use the right version. In demo codes, The sparse plus dense features are used for network_confv6. 24 | 25 | ## build the network 26 | main network logic is in network_confv?.py. The networks are base on fm & deep related algorithms. I try several networks and public some of them. There may be some defects in the networks but all of them are functional. 27 | 28 | ## train the network 29 | ```python 30 | python local_train.py 31 | ``` 32 | In local_train.py and map_reader.py, I use dataset API, so we need to download the corresponding .whl package or clone codes on develop branch of PaddlePaddle. The reason to use this is the speed of feeding data is much faster. 33 | Note that the input format feed into the network is self-defined. make sure we build the same format between training and test. 34 | 35 | ## test results 36 | ```python 37 | python generate_test.py 38 | python build_submit.py 39 | ``` 40 | In generate_test.py and build_submit, for convenience, I use the whole train data to train the network and test the network with provided data without label 41 | ## RUN ON Python3 42 | Running on python3, run the following python files with _py3 postfix, and keep the same for the rest in python2 43 | ```python 44 | python local_train_py3.py 45 | python generate_test_py3.py 46 | python build_submit_py3.py 47 | ``` 48 | 49 | 50 | 51 | 52 | -------------------------------------------------------------------------------- /args.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | 3 | def parse_args(): 4 | parser = argparse.ArgumentParser(description="PaddlePaddle CTR example") 5 | parser.add_argument( 6 | '--train_data_path', 7 | type=str, 8 | default='./data/raw/train.txt', 9 | help="The path of training dataset") 10 | parser.add_argument( 11 | '--test_data_path', 12 | type=str, 13 | default='./data/raw/valid.txt', 14 | help="The path of testing dataset") 15 | parser.add_argument( 16 | '--batch_size', 17 | type=int, 18 | default=1000, 19 | help="The size of mini-batch (default:1000)") 20 | parser.add_argument( 21 | '--embedding_size', 22 | type=int, 23 | default=16, 24 | help="The size for embedding layer (default:10)") 25 | parser.add_argument( 26 | '--num_passes', 27 | type=int, 28 | default=10, 29 | help="The number of passes to train (default: 10)") 30 | parser.add_argument( 31 | '--model_output_dir', 32 | type=str, 33 | default='models', 34 | help='The path for model to store (default: models)') 35 | parser.add_argument( 36 | '--sparse_feature_dim', 37 | type=int, 38 | default=1000001, 39 | help='sparse feature hashing space for index processing') 40 | parser.add_argument( 41 | '--is_local', 42 | type=int, 43 | default=1, 44 | help='Local train or distributed train (default: 1)') 45 | parser.add_argument( 46 | '--cloud_train', 47 | type=int, 48 | default=0, 49 | help='Local train or distributed train on paddlecloud (default: 0)') 50 | parser.add_argument( 51 | '--async_mode', 52 | action='store_true', 53 | default=False, 54 | help='Whether start pserver in async mode to support ASGD') 55 | parser.add_argument( 56 | '--no_split_var', 57 | action='store_true', 58 | default=False, 59 | help='Whether split variables into blocks when update_method is pserver') 60 | parser.add_argument( 61 | '--role', 62 | type=str, 63 | default='pserver', # trainer or pserver 64 | help='The path for model to store (default: models)') 65 | parser.add_argument( 66 | '--endpoints', 67 | type=str, 68 | default='127.0.0.1:6000', 69 | help='The pserver endpoints, like: 127.0.0.1:6000,127.0.0.1:6001') 70 | parser.add_argument( 71 | '--current_endpoint', 72 | type=str, 73 | default='127.0.0.1:6000', 74 | help='The path for model to store (default: 127.0.0.1:6000)') 75 | parser.add_argument( 76 | '--trainer_id', 77 | type=int, 78 | default=0, 79 | help='The path for model to store (default: models)') 80 | parser.add_argument( 81 | '--trainers', 82 | type=int, 83 | default=1, 84 | help='The num of trianers, (default: 1)') 85 | return parser.parse_args() 86 | -------------------------------------------------------------------------------- /build_submit.py: -------------------------------------------------------------------------------- 1 | import json 2 | import csv 3 | import io 4 | 5 | 6 | def build(): 7 | submit_map = {} 8 | with io.open('./submit/submit.csv', 'wb') as csv_file: 9 | writer = csv.writer(csv_file, delimiter=',') 10 | writer.writerow(['sid', 'recommend_mode']) 11 | # choose the res file you want to build submit file 12 | with open('./out/normed_test_session.txt', 'r') as f1: 13 | with open('./testres/res8', 'r') as f2: 14 | cur_session ='' 15 | for x, y in zip(f1.readlines(), f2.readlines()): 16 | m1 = json.loads(x) 17 | session_id = m1["session_id"] 18 | if cur_session == '': 19 | cur_session = session_id 20 | 21 | transport_mode = m1["plan"]["transport_mode"] 22 | 23 | if cur_session != session_id: 24 | writer.writerow([str(cur_session), str(submit_map[cur_session]["transport_mode"])]) 25 | cur_session = session_id 26 | if session_id not in submit_map: 27 | submit_map[session_id] = {} 28 | submit_map[session_id]["transport_mode"] = transport_mode 29 | submit_map[session_id]["probability"] = y 30 | #if int(submit_map[session_id]["transport_mode"]) == 0 and submit_map[session_id]["probability"] > 0.02: 31 | #submit_map[session_id]["probability"] = 0.99 32 | else: 33 | if float(y) > float(submit_map[session_id]["probability"]): 34 | submit_map[session_id]["transport_mode"] = transport_mode 35 | submit_map[session_id]["probability"] = y 36 | #if int(submit_map[session_id]["transport_mode"]) == 0 and submit_map[session_id]["probability"] > 0.02: 37 | #submit_map[session_id]["transport_mode"] = 0 38 | #submit_map[session_id]["probability"] = 0.99 39 | 40 | 41 | writer.writerow([cur_session, submit_map[cur_session]["transport_mode"]]) 42 | 43 | 44 | 45 | if __name__ == "__main__": 46 | build() 47 | -------------------------------------------------------------------------------- /build_submit_py3.py: -------------------------------------------------------------------------------- 1 | import json 2 | import csv 3 | import io 4 | 5 | 6 | def build(): 7 | submit_map = {} 8 | with io.open('./submit/submit.csv', 'w') as csv_file: 9 | writer = csv.writer(csv_file, delimiter=',') 10 | writer.writerow(['sid', 'recommend_mode']) 11 | # choose the res file you want to build submit file 12 | with open('./out/normed_test_session.txt', 'r') as f1: 13 | with open('./testres/res0', 'r') as f2: 14 | cur_session ='' 15 | for x, y in zip(f1.readlines(), f2.readlines()): 16 | m1 = json.loads(x) 17 | session_id = m1["session_id"] 18 | if cur_session == '': 19 | cur_session = session_id 20 | 21 | transport_mode = m1["plan"]["transport_mode"] 22 | 23 | if cur_session != session_id: 24 | writer.writerow([str(cur_session), str(submit_map[cur_session]["transport_mode"])]) 25 | cur_session = session_id 26 | if session_id not in submit_map: 27 | submit_map[session_id] = {} 28 | submit_map[session_id]["transport_mode"] = transport_mode 29 | submit_map[session_id]["probability"] = y 30 | #if int(submit_map[session_id]["transport_mode"]) == 0 and submit_map[session_id]["probability"] > 0.02: 31 | #submit_map[session_id]["probability"] = 0.99 32 | else: 33 | if float(y) > float(submit_map[session_id]["probability"]): 34 | submit_map[session_id]["transport_mode"] = transport_mode 35 | submit_map[session_id]["probability"] = y 36 | #if int(submit_map[session_id]["transport_mode"]) == 0 and submit_map[session_id]["probability"] > 0.02: 37 | #submit_map[session_id]["transport_mode"] = 0 38 | #submit_map[session_id]["probability"] = 0.99 39 | 40 | 41 | writer.writerow([cur_session, submit_map[cur_session]["transport_mode"]]) 42 | 43 | 44 | 45 | if __name__ == "__main__": 46 | build() 47 | -------------------------------------------------------------------------------- /data_set_phase1/._profiles.csv: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yaoxuefeng6/Paddle_baseline_KDD2019/dd7f8f6016f8457cac06ae0bdfb006bc0682457d/data_set_phase1/._profiles.csv -------------------------------------------------------------------------------- /data_set_phase1/._test_plans.csv: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yaoxuefeng6/Paddle_baseline_KDD2019/dd7f8f6016f8457cac06ae0bdfb006bc0682457d/data_set_phase1/._test_plans.csv -------------------------------------------------------------------------------- /data_set_phase1/._test_queries.csv: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yaoxuefeng6/Paddle_baseline_KDD2019/dd7f8f6016f8457cac06ae0bdfb006bc0682457d/data_set_phase1/._test_queries.csv -------------------------------------------------------------------------------- /data_set_phase1/._train_clicks.csv: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yaoxuefeng6/Paddle_baseline_KDD2019/dd7f8f6016f8457cac06ae0bdfb006bc0682457d/data_set_phase1/._train_clicks.csv -------------------------------------------------------------------------------- /data_set_phase1/._train_queries.csv: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yaoxuefeng6/Paddle_baseline_KDD2019/dd7f8f6016f8457cac06ae0bdfb006bc0682457d/data_set_phase1/._train_queries.csv -------------------------------------------------------------------------------- /generate_test.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # http://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | 15 | 16 | import argparse 17 | import logging 18 | import numpy as np 19 | # disable gpu training for this example 20 | import os 21 | 22 | os.environ["CUDA_VISIBLE_DEVICES"] = "" 23 | import paddle 24 | import paddle.fluid as fluid 25 | logging.basicConfig( 26 | format='%(asctime)s - %(levelname)s - %(message)s') 27 | logger = logging.getLogger("fluid") 28 | logger.setLevel(logging.INFO) 29 | num_context_feature = 22 30 | 31 | def parse_args(): 32 | parser = argparse.ArgumentParser(description="PaddlePaddle DeepFM example") 33 | parser.add_argument( 34 | '--model_path', 35 | type=str, 36 | #required=True, 37 | default='models', 38 | help="The path of model parameters gz file") 39 | parser.add_argument( 40 | '--data_path', 41 | type=str, 42 | required=False, 43 | help="The path of the dataset to infer") 44 | parser.add_argument( 45 | '--embedding_size', 46 | type=int, 47 | default=16, 48 | help="The size for embedding layer (default:10)") 49 | parser.add_argument( 50 | '--sparse_feature_dim', 51 | type=int, 52 | default=1000001, 53 | help="The size for embedding layer (default:1000001)") 54 | parser.add_argument( 55 | '--batch_size', 56 | type=int, 57 | default=1000, 58 | help="The size of mini-batch (default:1000)") 59 | 60 | return parser.parse_args() 61 | 62 | def to_lodtensor(data, place): 63 | seq_lens = [len(seq) for seq in data] 64 | cur_len = 0 65 | lod = [cur_len] 66 | for l in seq_lens: 67 | cur_len += l 68 | lod.append(cur_len) 69 | flattened_data = np.concatenate(data, axis=0).astype("int64") 70 | flattened_data = flattened_data.reshape([len(flattened_data), 1]) 71 | res = fluid.LoDTensor() 72 | res.set(flattened_data, place) 73 | res.set_lod([lod]) 74 | 75 | 76 | return res 77 | 78 | 79 | def data2tensor(data, place): 80 | feed_dict = {} 81 | dense = data[0] 82 | sparse = data[1:-1] 83 | y = data[-1] 84 | #user_data = np.array([x[0] for x in data]).astype("float32") 85 | #user_data = user_data.reshape([-1, 10]) 86 | #feed_dict["user_profile"] = user_data 87 | dense_data = np.array([x[0] for x in data]).astype("float32") 88 | dense_data = dense_data.reshape([-1, 3]) 89 | feed_dict["dense_feature"] = dense_data 90 | for i in range(num_context_feature): 91 | sparse_data = to_lodtensor([x[1 + i] for x in data], place) 92 | feed_dict["context" + str(i)] = sparse_data 93 | 94 | context_fm = to_lodtensor(np.array([x[-2] for x in data]).astype("float32"), place) 95 | 96 | feed_dict["context_fm"] = context_fm 97 | y_data = np.array([x[-1] for x in data]).astype("int64") 98 | y_data = y_data.reshape([-1, 1]) 99 | feed_dict["label"] = y_data 100 | return feed_dict 101 | 102 | def test(): 103 | args = parse_args() 104 | 105 | place = fluid.CPUPlace() 106 | test_scope = fluid.core.Scope() 107 | 108 | # filelist = ["%s/%s" % (args.data_path, x) for x in os.listdir(args.data_path)] 109 | from map_reader import MapDataset 110 | map_dataset = MapDataset() 111 | map_dataset.setup(args.sparse_feature_dim) 112 | exe = fluid.Executor(place) 113 | 114 | whole_filelist = ["./out/normed_test_session.txt"] 115 | test_files = whole_filelist[int(0.0 * len(whole_filelist)):int(1.0 * len(whole_filelist))] 116 | 117 | #set how many epochs runing for infer 118 | epochs = 1 119 | 120 | for i in range(epochs): 121 | cur_model_path = args.model_path + "/epoch" + str(i + 1) + ".model" 122 | with open("./testres/res" + str(i), 'w') as r: 123 | with fluid.scope_guard(test_scope): 124 | [inference_program, feed_target_names, fetch_targets] = \ 125 | fluid.io.load_inference_model(cur_model_path, exe) 126 | 127 | test_reader = map_dataset.test_reader(test_files, 1000, 100000) 128 | k = 0 129 | for batch_id, data in enumerate(test_reader()): 130 | print(len(data[0])) 131 | feed_dict = data2tensor(data, place) 132 | loss_val, auc_val, accuracy, predict, _ = exe.run(inference_program, 133 | feed=feed_dict, 134 | fetch_list=fetch_targets, return_numpy=False) 135 | 136 | x = np.array(predict) 137 | for j in range(x.shape[0]): 138 | r.write(str(x[j][1])) 139 | r.write("\n") 140 | 141 | 142 | if __name__ == '__main__': 143 | test() 144 | -------------------------------------------------------------------------------- /generate_test_py3.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # http://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | 15 | 16 | import argparse 17 | import logging 18 | import numpy as np 19 | # disable gpu training for this example 20 | import os, sys 21 | 22 | #this is set for python3 environment and the paddle install path is only use for jarvis platform 23 | sys.path.append("/usr/local/cuda-9.2/conda/envs/py36-paddle/lib/python3.6/site-packages/paddle/fluid") 24 | sys.path.append("/usr/local/cuda-9.2/conda/envs/py36-paddle/lib/python3.6/site-packages/paddle/fluid/proto") 25 | 26 | os.environ["CUDA_VISIBLE_DEVICES"] = "" 27 | import paddle 28 | import paddle.fluid as fluid 29 | logging.basicConfig( 30 | format='%(asctime)s - %(levelname)s - %(message)s') 31 | logger = logging.getLogger("fluid") 32 | logger.setLevel(logging.INFO) 33 | num_context_feature = 22 34 | 35 | def parse_args(): 36 | parser = argparse.ArgumentParser(description="PaddlePaddle DeepFM example") 37 | parser.add_argument( 38 | '--model_path', 39 | type=str, 40 | #required=True, 41 | default='models', 42 | help="The path of model parameters gz file") 43 | parser.add_argument( 44 | '--data_path', 45 | type=str, 46 | required=False, 47 | help="The path of the dataset to infer") 48 | parser.add_argument( 49 | '--embedding_size', 50 | type=int, 51 | default=16, 52 | help="The size for embedding layer (default:10)") 53 | parser.add_argument( 54 | '--sparse_feature_dim', 55 | type=int, 56 | default=1000001, 57 | help="The size for embedding layer (default:1000001)") 58 | parser.add_argument( 59 | '--batch_size', 60 | type=int, 61 | default=1000, 62 | help="The size of mini-batch (default:1000)") 63 | 64 | return parser.parse_args() 65 | 66 | def to_lodtensor(data, place): 67 | seq_lens = [len(seq) for seq in data] 68 | cur_len = 0 69 | lod = [cur_len] 70 | for l in seq_lens: 71 | cur_len += l 72 | lod.append(cur_len) 73 | flattened_data = np.concatenate(data, axis=0).astype("int64") 74 | flattened_data = flattened_data.reshape([len(flattened_data), 1]) 75 | res = fluid.LoDTensor() 76 | res.set(flattened_data, place) 77 | res.set_lod([lod]) 78 | 79 | 80 | return res 81 | 82 | 83 | def data2tensor(data, place): 84 | feed_dict = {} 85 | dense = data[0] 86 | sparse = data[1:-1] 87 | y = data[-1] 88 | #user_data = np.array([x[0] for x in data]).astype("float32") 89 | #user_data = user_data.reshape([-1, 10]) 90 | #feed_dict["user_profile"] = user_data 91 | dense_data = np.array([x[0] for x in data]).astype("float32") 92 | dense_data = dense_data.reshape([-1, 3]) 93 | feed_dict["dense_feature"] = dense_data 94 | for i in range(num_context_feature): 95 | sparse_data = to_lodtensor([x[1 + i] for x in data], place) 96 | feed_dict["context" + str(i)] = sparse_data 97 | 98 | context_fm = to_lodtensor(np.array([x[-2] for x in data]).astype("float32"), place) 99 | 100 | feed_dict["context_fm"] = context_fm 101 | y_data = np.array([x[-1] for x in data]).astype("int64") 102 | y_data = y_data.reshape([-1, 1]) 103 | feed_dict["label"] = y_data 104 | return feed_dict 105 | 106 | def test(): 107 | args = parse_args() 108 | 109 | place = fluid.CPUPlace() 110 | test_scope = fluid.core.Scope() 111 | 112 | # filelist = ["%s/%s" % (args.data_path, x) for x in os.listdir(args.data_path)] 113 | from map_reader_mmh import MapDataset 114 | map_dataset = MapDataset() 115 | map_dataset.setup(args.sparse_feature_dim) 116 | exe = fluid.Executor(place) 117 | 118 | whole_filelist = ["./out/normed_test_session.txt"] 119 | test_files = whole_filelist[int(0.0 * len(whole_filelist)):int(1.0 * len(whole_filelist))] 120 | 121 | # set how many epochs runing for infer 122 | epochs = 1 123 | 124 | for i in range(epochs): 125 | cur_model_path = args.model_path + "/epoch" + str(i + 1) + ".model" 126 | with open("./testres/res" + str(i), 'w') as r: 127 | with fluid.scope_guard(test_scope): 128 | [inference_program, feed_target_names, fetch_targets] = \ 129 | fluid.io.load_inference_model(cur_model_path, exe) 130 | 131 | test_reader = map_dataset.test_reader(test_files, 1000, 100000) 132 | k = 0 133 | for batch_id, data in enumerate(test_reader()): 134 | print(len(data[0])) 135 | feed_dict = data2tensor(data, place) 136 | loss_val, auc_val, accuracy, predict, _ = exe.run(inference_program, 137 | feed=feed_dict, 138 | fetch_list=fetch_targets, return_numpy=False) 139 | 140 | x = np.array(predict) 141 | for j in range(x.shape[0]): 142 | r.write(str(x[j][1])) 143 | r.write("\n") 144 | 145 | 146 | if __name__ == '__main__': 147 | test() 148 | -------------------------------------------------------------------------------- /infer.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import logging 3 | 4 | import numpy as np 5 | # disable gpu training for this example 6 | import os 7 | 8 | os.environ["CUDA_VISIBLE_DEVICES"] = "" 9 | import paddle 10 | import paddle.fluid as fluid 11 | 12 | import map_reader 13 | from network_conf import ctr_deepfm_dataset 14 | 15 | logging.basicConfig( 16 | format='%(asctime)s - %(levelname)s - %(message)s') 17 | logger = logging.getLogger("fluid") 18 | logger.setLevel(logging.INFO) 19 | 20 | 21 | def parse_args(): 22 | parser = argparse.ArgumentParser(description="PaddlePaddle DeepFM example") 23 | parser.add_argument( 24 | '--model_path', 25 | type=str, 26 | #required=True, 27 | default='models', 28 | help="The path of model parameters gz file") 29 | parser.add_argument( 30 | '--data_path', 31 | type=str, 32 | required=False, 33 | help="The path of the dataset to infer") 34 | parser.add_argument( 35 | '--embedding_size', 36 | type=int, 37 | default=16, 38 | help="The size for embedding layer (default:10)") 39 | parser.add_argument( 40 | '--sparse_feature_dim', 41 | type=int, 42 | default=1000001, 43 | help="The size for embedding layer (default:1000001)") 44 | parser.add_argument( 45 | '--batch_size', 46 | type=int, 47 | default=1000, 48 | help="The size of mini-batch (default:1000)") 49 | 50 | return parser.parse_args() 51 | 52 | 53 | def to_lodtensor(data, place): 54 | seq_lens = [len(seq) for seq in data] 55 | cur_len = 0 56 | lod = [cur_len] 57 | for l in seq_lens: 58 | cur_len += l 59 | lod.append(cur_len) 60 | flattened_data = np.concatenate(data, axis=0).astype("int64") 61 | flattened_data = flattened_data.reshape([len(flattened_data), 1]) 62 | res = fluid.LoDTensor() 63 | res.set(flattened_data, place) 64 | res.set_lod([lod]) 65 | return res 66 | 67 | 68 | def data2tensor(data, place): 69 | feed_dict = {} 70 | test_dict = {} 71 | dense = data[0] 72 | sparse = data[1:-1] 73 | y = data[-1] 74 | dense_data = np.array([x[0] for x in data]).astype("float32") 75 | dense_data = dense_data.reshape([-1, 65]) 76 | feed_dict["user_profile"] = dense_data 77 | for i in range(10): 78 | sparse_data = to_lodtensor([x[1 + i] for x in data], place) 79 | feed_dict["context" + str(i)] = sparse_data 80 | 81 | y_data = np.array([x[-1] for x in data]).astype("int64") 82 | y_data = y_data.reshape([-1, 1]) 83 | feed_dict["label"] = y_data 84 | test_dict["test"] = [1] 85 | return feed_dict, test_dict 86 | 87 | 88 | def infer(): 89 | args = parse_args() 90 | 91 | place = fluid.CPUPlace() 92 | inference_scope = fluid.core.Scope() 93 | 94 | filelist = ["%s/%s" % (args.data_path, x) for x in os.listdir(args.data_path)] 95 | from map_reader import MapDataset 96 | map_dataset = MapDataset() 97 | map_dataset.setup(args.sparse_feature_dim) 98 | exe = fluid.Executor(place) 99 | 100 | whole_filelist = ["raw_data/part-%d" % x for x in range(len(os.listdir("raw_data")))] 101 | #whole_filelist = ["./out/normed_train09", "./out/normed_train10", "./out/normed_train11"] 102 | test_files = whole_filelist[int(0.0 * len(whole_filelist)):int(1.0 * len(whole_filelist))] 103 | 104 | # file_groups = [whole_filelist[i:i+train_thread_num] for i in range(0, len(whole_filelist), train_thread_num)] 105 | 106 | def set_zero(var_name): 107 | param = inference_scope.var(var_name).get_tensor() 108 | param_array = np.zeros(param._get_dims()).astype("int64") 109 | param.set(param_array, place) 110 | 111 | epochs = 2 112 | for i in range(epochs): 113 | cur_model_path = args.model_path + "/epoch" + str(i + 1) + ".model" 114 | with fluid.scope_guard(inference_scope): 115 | [inference_program, feed_target_names, fetch_targets] = \ 116 | fluid.io.load_inference_model(cur_model_path, exe) 117 | auc_states_names = ['_generated_var_2', '_generated_var_3'] 118 | for name in auc_states_names: 119 | set_zero(name) 120 | 121 | test_reader = map_dataset.infer_reader(test_files, 1000, 100000) 122 | for batch_id, data in enumerate(test_reader()): 123 | loss_val, auc_val, accuracy, predict, label = exe.run(inference_program, 124 | feed=data2tensor(data, place), 125 | fetch_list=fetch_targets, return_numpy=False) 126 | 127 | #print(np.array(predict)) 128 | #x = np.array(predict) 129 | #print(.shape)x 130 | #print("train_pass_%d, test_pass_%d\t%f\t" % (i - 1, i, auc_val)) 131 | 132 | 133 | if __name__ == '__main__': 134 | infer() 135 | -------------------------------------------------------------------------------- /local_train.py: -------------------------------------------------------------------------------- 1 | from __future__ import print_function 2 | 3 | from args import parse_args 4 | import os 5 | import paddle.fluid as fluid 6 | import sys 7 | from network_confv6 import ctr_deepfm_dataset 8 | 9 | 10 | NUM_CONTEXT_FEATURE = 22 11 | DIM_USER_PROFILE = 10 12 | DIM_DENSE_FEATURE = 3 13 | PYTHON_PATH = "/home/yaoxuefeng/whls/paddle_release_home/python/bin/python" # this is mine change yours 14 | 15 | def train(): 16 | args = parse_args() 17 | if not os.path.isdir(args.model_output_dir): 18 | os.mkdir(args.model_output_dir) 19 | 20 | #set the input format for our model. Note that you need to carefully modify them when you define a new network 21 | #user_profile = fluid.layers.data( 22 | #name="user_profile", shape=[DIM_USER_PROFILE], dtype='int64', lod_level=1) 23 | dense_feature = fluid.layers.data( 24 | name="dense_feature", shape=[DIM_DENSE_FEATURE], dtype='float32') 25 | context_feature = [ 26 | fluid.layers.data(name="context" + str(i), shape=[1], lod_level=1, dtype="int64") 27 | for i in range(0, NUM_CONTEXT_FEATURE)] 28 | context_feature_fm = fluid.layers.data( 29 | name="context_fm", shape=[1], dtype='int64', lod_level=1) 30 | label = fluid.layers.data(name='label', shape=[1], dtype='int64') 31 | 32 | print("ready to network") 33 | #self define network 34 | loss, auc_var, batch_auc_var, accuracy, predict = ctr_deepfm_dataset(dense_feature, context_feature, context_feature_fm, label, 35 | args.embedding_size, args.sparse_feature_dim) 36 | 37 | print("ready to optimize") 38 | optimizer = fluid.optimizer.SGD(learning_rate=1e-4) 39 | optimizer.minimize(loss) 40 | #single machine CPU training. more options on trainig please visit PaddlePaddle site 41 | exe = fluid.Executor(fluid.CPUPlace()) 42 | exe.run(fluid.default_startup_program()) 43 | #use dataset api for much faster speed 44 | dataset = fluid.DatasetFactory().create_dataset() 45 | dataset.set_use_var([dense_feature] + context_feature + [context_feature_fm] + [label]) 46 | #self define how to process generated training insatnces in map_reader.py 47 | pipe_command = PYTHON_PATH + " map_reader.py %d" % args.sparse_feature_dim 48 | dataset.set_pipe_command(pipe_command) 49 | dataset.set_batch_size(args.batch_size) 50 | thread_num = 1 51 | dataset.set_thread(thread_num) 52 | #self define how to split training files for example:"split -a 2 -d -l 200000 normed_train.txt normed_train" 53 | whole_filelist = ["./out/normed_train%d" % x for x in range(len(os.listdir("out")))] 54 | whole_filelist = ["./out/normed_train00", "./out/normed_train01", "./out/normed_train02", "./out/normed_train03", 55 | "./out/normed_train04", "./out/normed_train05", "./out/normed_train06", "./out/normed_train07", 56 | "./out/normed_train08", 57 | "./out/normed_train09", "./out/normed_train10", "./out/normed_train11"] 58 | print("ready to epochs") 59 | epochs = 10 60 | for i in range(epochs): 61 | print("start %dth epoch" % i) 62 | dataset.set_filelist(whole_filelist[:int(len(whole_filelist))]) 63 | #print the informations you want by setting fetch_list and fetch_info 64 | exe.train_from_dataset(program=fluid.default_main_program(), 65 | dataset=dataset, 66 | fetch_list=[auc_var, accuracy, predict, label], 67 | fetch_info=["auc", "accuracy", "predict", "label"], 68 | debug=False) 69 | model_dir = args.model_output_dir + '/epoch' + str(i + 1) + ".model" 70 | sys.stderr.write("epoch%d finished" % (i + 1)) 71 | #save model 72 | fluid.io.save_inference_model(model_dir, [dense_feature.name] + [x.name for x in context_feature] + [context_feature_fm.name] + [label.name], 73 | [loss, auc_var, accuracy, predict, label], exe) 74 | 75 | 76 | if __name__ == '__main__': 77 | train() 78 | -------------------------------------------------------------------------------- /local_train_py3.py: -------------------------------------------------------------------------------- 1 | from __future__ import print_function 2 | 3 | from args import parse_args 4 | import os 5 | import paddle.fluid as fluid 6 | import sys 7 | from network_confv6 import ctr_deepfm_dataset 8 | 9 | #this is set for python3 environment and the paddle install path is only use for jarvis platform 10 | sys.path.append("/usr/local/cuda-9.2/conda/envs/py36-paddle/lib/python3.6/site-packages/paddle/fluid") 11 | sys.path.append("/usr/local/cuda-9.2/conda/envs/py36-paddle/lib/python3.6/site-packages/paddle/fluid/proto") 12 | 13 | NUM_CONTEXT_FEATURE = 22 14 | DIM_USER_PROFILE = 10 15 | DIM_DENSE_FEATURE = 3 16 | #PYTHON_PATH = "/home/yaoxuefeng/whls/paddle_release_home/python/bin/python" # this is mine change your own python path 17 | PYTHON_PATH = "python" 18 | def train(): 19 | args = parse_args() 20 | if not os.path.isdir(args.model_output_dir): 21 | os.mkdir(args.model_output_dir) 22 | 23 | #set the input format for our model. Note that you need to carefully modify them when you define a new network 24 | #user_profile = fluid.layers.data( 25 | #name="user_profile", shape=[DIM_USER_PROFILE], dtype='int64', lod_level=1) 26 | dense_feature = fluid.layers.data( 27 | name="dense_feature", shape=[DIM_DENSE_FEATURE], dtype='float32') 28 | context_feature = [ 29 | fluid.layers.data(name="context" + str(i), shape=[1], lod_level=1, dtype="int64") 30 | for i in range(0, NUM_CONTEXT_FEATURE)] 31 | context_feature_fm = fluid.layers.data( 32 | name="context_fm", shape=[1], dtype='int64', lod_level=1) 33 | label = fluid.layers.data(name='label', shape=[1], dtype='int64') 34 | 35 | print("ready to network") 36 | #self define network 37 | loss, auc_var, batch_auc_var, accuracy, predict = ctr_deepfm_dataset(dense_feature, context_feature, context_feature_fm, label, 38 | args.embedding_size, args.sparse_feature_dim) 39 | 40 | print("ready to optimize") 41 | optimizer = fluid.optimizer.SGD(learning_rate=1e-4) 42 | optimizer.minimize(loss) 43 | #single machine CPU training. more options on trainig please visit PaddlePaddle site 44 | exe = fluid.Executor(fluid.CPUPlace()) 45 | exe.run(fluid.default_startup_program()) 46 | #use dataset api for much faster speed 47 | dataset = fluid.DatasetFactory().create_dataset() 48 | dataset.set_use_var([dense_feature] + context_feature + [context_feature_fm] + [label]) 49 | #self define how to process generated training insatnces in map_reader_mmh.py 50 | pipe_command = PYTHON_PATH + " map_reader_mmh.py %d" % args.sparse_feature_dim 51 | dataset.set_pipe_command(pipe_command) 52 | dataset.set_batch_size(args.batch_size) 53 | #set thread num not larger than the length of filelist for muli-thread loading data 54 | thread_num = 12 55 | dataset.set_thread(thread_num) 56 | #self define how to split training files for example:"split -a 2 -d -l 200000 normed_train.txt normed_train" 57 | whole_filelist = ["./out/normed_train%d" % x for x in range(len(os.listdir("out")))] 58 | whole_filelist = ["./out/normed_train00", "./out/normed_train01", "./out/normed_train02", "./out/normed_train03", 59 | "./out/normed_train04", "./out/normed_train05", "./out/normed_train06", "./out/normed_train07", 60 | "./out/normed_train08", 61 | "./out/normed_train09", "./out/normed_train10", "./out/normed_train11"] 62 | print("ready to epochs") 63 | epochs = 10 64 | for i in range(epochs): 65 | print("start %dth epoch" % i) 66 | dataset.set_filelist(whole_filelist[:int(len(whole_filelist))]) 67 | #print the informations you want by setting fetch_list and fetch_info 68 | exe.train_from_dataset(program=fluid.default_main_program(), 69 | dataset=dataset, 70 | fetch_list=[auc_var, accuracy, predict, label], 71 | fetch_info=["auc", "accuracy", "predict", "label"], 72 | debug=False) 73 | model_dir = args.model_output_dir + '/epoch' + str(i + 1) + ".model" 74 | sys.stderr.write("epoch%d finished" % (i + 1)) 75 | #save model 76 | fluid.io.save_inference_model(model_dir, [dense_feature.name] + [x.name for x in context_feature] + [context_feature_fm.name] + [label.name], 77 | [loss, auc_var, accuracy, predict, label], exe) 78 | 79 | 80 | if __name__ == '__main__': 81 | train() 82 | -------------------------------------------------------------------------------- /map_reader.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # http://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | 15 | import sys 16 | import json 17 | import paddle.fluid.incubate.data_generator as dg 18 | 19 | 20 | class MapDataset(dg.MultiSlotDataGenerator): 21 | def setup(self, sparse_feature_dim): 22 | self.profile_length = 65 23 | self.dense_length = 3 24 | #feature names 25 | self.dense_feature_list = ["distance", "price", "eta"] 26 | 27 | self.pid_list = ["pid"] 28 | self.query_feature_list = ["weekday", "hour", "o1", "o2", "d1", "d2"] 29 | self.plan_feature_list = ["transport_mode"] 30 | self.rank_feature_list = ["plan_rank", "whole_rank", "price_rank", "eta_rank", "distance_rank"] 31 | self.rank_whole_pic_list = ["mode_rank1", "mode_rank2", "mode_rank3", "mode_rank4", 32 | "mode_rank5"] 33 | self.weather_feature_list = ["max_temp", "min_temp", "wea", "wind"] 34 | self.hash_dim = 1000001 35 | self.train_idx_ = 2000000 36 | #carefully set if you change the features 37 | self.categorical_range_ = range(0, 22) 38 | 39 | #process one instance 40 | def _process_line(self, line): 41 | instance = json.loads(line) 42 | """ 43 | profile = instance["profile"] 44 | len_profile = len(profile) 45 | if len_profile >= 10: 46 | user_profile_feature = profile[0:10] 47 | else: 48 | profile.extend([0]*(10-len_profile)) 49 | user_profile_feature = profile 50 | 51 | if len(profile) > 1 or (len(profile) == 1 and profile[0] != 0): 52 | for p in profile: 53 | if p >= 1 and p <= 65: 54 | user_profile_feature[p - 1] = 1 55 | """ 56 | context_feature = [] 57 | context_feature_fm = [] 58 | dense_feature = [0] * self.dense_length 59 | plan = instance["plan"] 60 | for i, val in enumerate(self.dense_feature_list): 61 | dense_feature[i] = plan[val] 62 | 63 | if (instance["pid"] == ""): 64 | instance["pid"] = 0 65 | 66 | query = instance["query"] 67 | weather_dic = instance["weather"] 68 | for fea in self.pid_list: 69 | context_feature.append([hash(fea + str(instance[fea])) % self.hash_dim]) 70 | context_feature_fm.append(hash(fea + str(instance[fea])) % self.hash_dim) 71 | for fea in self.query_feature_list: 72 | context_feature.append([hash(fea + str(query[fea])) % self.hash_dim]) 73 | context_feature_fm.append(hash(fea + str(query[fea])) % self.hash_dim) 74 | for fea in self.plan_feature_list: 75 | context_feature.append([hash(fea + str(plan[fea])) % self.hash_dim]) 76 | context_feature_fm.append(hash(fea + str(plan[fea])) % self.hash_dim) 77 | for fea in self.rank_feature_list: 78 | context_feature.append([hash(fea + str(instance[fea])) % self.hash_dim]) 79 | context_feature_fm.append(hash(fea + str(instance[fea])) % self.hash_dim) 80 | for fea in self.rank_whole_pic_list: 81 | context_feature.append([hash(fea + str(instance[fea])) % self.hash_dim]) 82 | context_feature_fm.append(hash(fea + str(instance[fea])) % self.hash_dim) 83 | for fea in self.weather_feature_list: 84 | context_feature.append([hash(fea + str(weather_dic[fea])) % self.hash_dim]) 85 | context_feature_fm.append(hash(fea + str(weather_dic[fea])) % self.hash_dim) 86 | 87 | label = [int(instance["label"])] 88 | 89 | return dense_feature, context_feature, context_feature_fm, label 90 | 91 | def infer_reader(self, filelist, batch, buf_size): 92 | print(filelist) 93 | 94 | def local_iter(): 95 | for fname in filelist: 96 | with open(fname.strip(), "r") as fin: 97 | for line in fin: 98 | dense_feature, sparse_feature, sparse_feature_fm, label = self._process_line(line) 99 | yield [dense_feature] + sparse_feature + [sparse_feature_fm] + [label] 100 | 101 | import paddle 102 | batch_iter = paddle.batch( 103 | paddle.reader.shuffle( 104 | local_iter, buf_size=buf_size), 105 | batch_size=batch) 106 | return batch_iter 107 | 108 | #generat inputs for testing 109 | def test_reader(self, filelist, batch, buf_size): 110 | print(filelist) 111 | 112 | def local_iter(): 113 | for fname in filelist: 114 | with open(fname.strip(), "r") as fin: 115 | for line in fin: 116 | dense_feature, sparse_feature, sparse_feature_fm, label = self._process_line(line) 117 | yield [dense_feature] + sparse_feature + [sparse_feature_fm] + [label] 118 | 119 | import paddle 120 | batch_iter = paddle.batch( 121 | paddle.reader.buffered( 122 | local_iter, size=buf_size), 123 | batch_size=batch) 124 | return batch_iter 125 | 126 | #generate inputs for trainig 127 | def generate_sample(self, line): 128 | def data_iter(): 129 | dense_feature, sparse_feature, sparse_feature_fm, label = self._process_line(line) 130 | #feature_name = ["user_profile"] 131 | feature_name = [] 132 | feature_name.append("dense_feature") 133 | for idx in self.categorical_range_: 134 | feature_name.append("context" + str(idx)) 135 | feature_name.append("context_fm") 136 | feature_name.append("label") 137 | yield zip(feature_name, [dense_feature] + sparse_feature + [sparse_feature_fm] + [label]) 138 | 139 | return data_iter 140 | 141 | 142 | if __name__ == "__main__": 143 | map_dataset = MapDataset() 144 | map_dataset.setup(int(sys.argv[1])) 145 | map_dataset.run_from_stdin() 146 | -------------------------------------------------------------------------------- /map_reader_mmh.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # http://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | 15 | import sys 16 | import json 17 | import paddle.fluid.incubate.data_generator as dg 18 | import mmh3 19 | 20 | 21 | class MapDataset(dg.MultiSlotDataGenerator): 22 | def setup(self, sparse_feature_dim): 23 | self.profile_length = 65 24 | self.dense_length = 3 25 | # feature names 26 | self.dense_feature_list = ["distance", "price", "eta"] 27 | 28 | self.pid_list = ["pid"] 29 | self.query_feature_list = ["weekday", "hour", "o1", "o2", "d1", "d2"] 30 | self.plan_feature_list = ["transport_mode"] 31 | self.rank_feature_list = ["plan_rank", "whole_rank", "price_rank", "eta_rank", "distance_rank"] 32 | self.rank_whole_pic_list = ["mode_rank1", "mode_rank2", "mode_rank3", "mode_rank4", 33 | "mode_rank5"] 34 | self.weather_feature_list = ["max_temp", "min_temp", "wea", "wind"] 35 | self.hash_dim = 1000001 36 | self.train_idx_ = 2000000 37 | # carefully set if you change the features 38 | self.categorical_range_ = range(0, 22) 39 | 40 | # process one instance 41 | def _process_line(self, line): 42 | instance = json.loads(line) 43 | """ 44 | profile = instance["profile"] 45 | len_profile = len(profile) 46 | if len_profile >= 10: 47 | user_profile_feature = profile[0:10] 48 | else: 49 | profile.extend([0]*(10-len_profile)) 50 | user_profile_feature = profile 51 | 52 | if len(profile) > 1 or (len(profile) == 1 and profile[0] != 0): 53 | for p in profile: 54 | if p >= 1 and p <= 65: 55 | user_profile_feature[p - 1] = 1 56 | """ 57 | context_feature = [] 58 | context_feature_fm = [] 59 | dense_feature = [0] * self.dense_length 60 | plan = instance["plan"] 61 | for i, val in enumerate(self.dense_feature_list): 62 | dense_feature[i] = plan[val] 63 | 64 | if (instance["pid"] == ""): 65 | instance["pid"] = 0 66 | 67 | query = instance["query"] 68 | weather_dic = instance["weather"] 69 | for fea in self.pid_list: 70 | context_feature.append([mmh3.hash(fea + str(instance[fea])) % self.hash_dim]) 71 | context_feature_fm.append(mmh3.hash(fea + str(instance[fea])) % self.hash_dim) 72 | for fea in self.query_feature_list: 73 | context_feature.append([mmh3.hash(fea + str(query[fea])) % self.hash_dim]) 74 | context_feature_fm.append(mmh3.hash(fea + str(query[fea])) % self.hash_dim) 75 | for fea in self.plan_feature_list: 76 | context_feature.append([mmh3.hash(fea + str(plan[fea])) % self.hash_dim]) 77 | context_feature_fm.append(mmh3.hash(fea + str(plan[fea])) % self.hash_dim) 78 | for fea in self.rank_feature_list: 79 | context_feature.append([mmh3.hash(fea + str(instance[fea])) % self.hash_dim]) 80 | context_feature_fm.append(mmh3.hash(fea + str(instance[fea])) % self.hash_dim) 81 | for fea in self.rank_whole_pic_list: 82 | context_feature.append([mmh3.hash(fea + str(instance[fea])) % self.hash_dim]) 83 | context_feature_fm.append(mmh3.hash(fea + str(instance[fea])) % self.hash_dim) 84 | for fea in self.weather_feature_list: 85 | context_feature.append([mmh3.hash(fea + str(weather_dic[fea])) % self.hash_dim]) 86 | context_feature_fm.append(mmh3.hash(fea + str(weather_dic[fea])) % self.hash_dim) 87 | 88 | label = [int(instance["label"])] 89 | 90 | return dense_feature, context_feature, context_feature_fm, label 91 | 92 | def infer_reader(self, filelist, batch, buf_size): 93 | print(filelist) 94 | 95 | def local_iter(): 96 | for fname in filelist: 97 | with open(fname.strip(), "r") as fin: 98 | for line in fin: 99 | dense_feature, sparse_feature, sparse_feature_fm, label = self._process_line(line) 100 | yield [dense_feature] + sparse_feature + [sparse_feature_fm] + [label] 101 | 102 | import paddle 103 | batch_iter = paddle.batch( 104 | paddle.reader.shuffle( 105 | local_iter, buf_size=buf_size), 106 | batch_size=batch) 107 | return batch_iter 108 | 109 | # generat inputs for testing 110 | def test_reader(self, filelist, batch, buf_size): 111 | print(filelist) 112 | 113 | def local_iter(): 114 | for fname in filelist: 115 | with open(fname.strip(), "r") as fin: 116 | for line in fin: 117 | dense_feature, sparse_feature, sparse_feature_fm, label = self._process_line(line) 118 | yield [dense_feature] + sparse_feature + [sparse_feature_fm] + [label] 119 | 120 | import paddle 121 | batch_iter = paddle.batch( 122 | paddle.reader.buffered( 123 | local_iter, size=buf_size), 124 | batch_size=batch) 125 | return batch_iter 126 | 127 | # generate inputs for trainig 128 | def generate_sample(self, line): 129 | def data_iter(): 130 | dense_feature, sparse_feature, sparse_feature_fm, label = self._process_line(line) 131 | # feature_name = ["user_profile"] 132 | feature_name = [] 133 | feature_name.append("dense_feature") 134 | for idx in self.categorical_range_: 135 | feature_name.append("context" + str(idx)) 136 | feature_name.append("context_fm") 137 | feature_name.append("label") 138 | yield list(zip(feature_name, [dense_feature] + sparse_feature + [sparse_feature_fm] + [label])) 139 | 140 | return data_iter 141 | 142 | 143 | if __name__ == "__main__": 144 | map_dataset = MapDataset() 145 | map_dataset.setup(int(sys.argv[1])) 146 | map_dataset.run_from_stdin() 147 | -------------------------------------------------------------------------------- /network_confv6.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # http://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | 15 | import paddle.fluid as fluid 16 | import math 17 | 18 | user_profile_dim = 65 19 | dense_feature_dim = 3 20 | 21 | def ctr_deepfm_dataset(dense_feature, context_feature, context_feature_fm, label, 22 | embedding_size, sparse_feature_dim): 23 | def dense_fm_layer(input, emb_dict_size, factor_size, fm_param_attr): 24 | 25 | first_order = fluid.layers.fc(input=input, size=1) 26 | emb_table = fluid.layers.create_parameter(shape=[emb_dict_size, factor_size], 27 | dtype='float32', attr=fm_param_attr) 28 | 29 | input_mul_factor = fluid.layers.matmul(input, emb_table) 30 | input_mul_factor_square = fluid.layers.square(input_mul_factor) 31 | input_square = fluid.layers.square(input) 32 | factor_square = fluid.layers.square(emb_table) 33 | input_square_mul_factor_square = fluid.layers.matmul(input_square, factor_square) 34 | 35 | second_order = 0.5 * (input_mul_factor_square - input_square_mul_factor_square) 36 | return first_order, second_order 37 | 38 | 39 | dense_fm_param_attr = fluid.param_attr.ParamAttr(name="DenseFeatFactors", 40 | initializer=fluid.initializer.Normal( 41 | scale=1 / math.sqrt(dense_feature_dim))) 42 | dense_fm_first, dense_fm_second = dense_fm_layer( 43 | dense_feature, dense_feature_dim, 16, dense_fm_param_attr) 44 | 45 | 46 | def sparse_fm_layer(input, emb_dict_size, factor_size, fm_param_attr): 47 | 48 | first_embeddings = fluid.layers.embedding( 49 | input=input, dtype='float32', size=[emb_dict_size, 1], is_sparse=True) 50 | first_order = fluid.layers.sequence_pool(input=first_embeddings, pool_type='sum') 51 | 52 | nonzero_embeddings = fluid.layers.embedding( 53 | input=input, dtype='float32', size=[emb_dict_size, factor_size], 54 | param_attr=fm_param_attr, is_sparse=True) 55 | summed_features_emb = fluid.layers.sequence_pool(input=nonzero_embeddings, pool_type='sum') 56 | summed_features_emb_square = fluid.layers.square(summed_features_emb) 57 | 58 | squared_features_emb = fluid.layers.square(nonzero_embeddings) 59 | squared_sum_features_emb = fluid.layers.sequence_pool( 60 | input=squared_features_emb, pool_type='sum') 61 | 62 | second_order = 0.5 * (summed_features_emb_square - squared_sum_features_emb) 63 | return first_order, second_order 64 | 65 | sparse_fm_param_attr = fluid.param_attr.ParamAttr(name="SparseFeatFactors", 66 | initializer=fluid.initializer.Normal( 67 | scale=1 / math.sqrt(sparse_feature_dim))) 68 | 69 | #data = fluid.layers.data(name='ids', shape=[1], dtype='float32') 70 | sparse_fm_first, sparse_fm_second = sparse_fm_layer( 71 | context_feature_fm, sparse_feature_dim, 16, sparse_fm_param_attr) 72 | 73 | def embedding_layer(input): 74 | return fluid.layers.embedding( 75 | input=input, 76 | is_sparse=True, 77 | # you need to patch https://github.com/PaddlePaddle/Paddle/pull/14190 78 | # if you want to set is_distributed to True 79 | is_distributed=False, 80 | size=[sparse_feature_dim, embedding_size], 81 | param_attr=fluid.ParamAttr(name="SparseFeatFactors", 82 | initializer=fluid.initializer.Uniform())) 83 | 84 | sparse_embed_seq = list(map(embedding_layer, context_feature)) 85 | 86 | concated_ori = fluid.layers.concat(sparse_embed_seq + [dense_feature], axis=1) 87 | concated = fluid.layers.batch_norm(input=concated_ori, name="bn", epsilon=1e-4) 88 | 89 | deep = deep_net(concated) 90 | 91 | predict = fluid.layers.fc(input=[deep, sparse_fm_first, sparse_fm_second, dense_fm_first, dense_fm_second], size=2, act="softmax", 92 | param_attr=fluid.ParamAttr(initializer=fluid.initializer.Normal( 93 | scale=1 / math.sqrt(deep.shape[1])), learning_rate=0.01)) 94 | 95 | #similarity_norm = fluid.layers.sigmoid(fluid.layers.clip(predict, min=-15.0, max=15.0), name="similarity_norm") 96 | 97 | cost = fluid.layers.cross_entropy(input=predict, label=label) 98 | 99 | avg_cost = fluid.layers.reduce_sum(cost) 100 | accuracy = fluid.layers.accuracy(input=predict, label=label) 101 | auc_var, batch_auc_var, auc_states = \ 102 | fluid.layers.auc(input=predict, label=label, num_thresholds=2 ** 12, slide_steps=20) 103 | return avg_cost, auc_var, batch_auc_var, accuracy, predict 104 | 105 | 106 | def deep_net(concated, lr_x=0.0001): 107 | fc_layers_input = [concated] 108 | fc_layers_size = [400, 400, 400] 109 | fc_layers_act = ["relu"] * (len(fc_layers_size)) 110 | 111 | for i in range(len(fc_layers_size)): 112 | fc = fluid.layers.fc( 113 | input=fc_layers_input[-1], 114 | size=fc_layers_size[i], 115 | act=fc_layers_act[i], 116 | param_attr=fluid.ParamAttr(learning_rate=lr_x * 0.5)) 117 | 118 | fc_layers_input.append(fc) 119 | #w_res = fluid.layers.create_parameter(shape=[353, 16], dtype='float32', name="w_res") 120 | #high_path = fluid.layers.matmul(concated, w_res) 121 | 122 | #return fluid.layers.elementwise_add(high_path, fc_layers_input[-1]) 123 | return fc_layers_input[-1] 124 | -------------------------------------------------------------------------------- /networks/network_conf.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # http://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | 15 | 16 | import paddle.fluid as fluid 17 | import math 18 | 19 | user_profile_dim = 65 20 | num_context = 25 21 | dim_fm_vector = 16 22 | dim_concated = user_profile_dim + dim_fm_vector * (num_context) 23 | 24 | 25 | def ctr_deepfm_dataset(user_profile, context_feature, label, 26 | embedding_size, sparse_feature_dim): 27 | def embedding_layer(input): 28 | return fluid.layers.embedding( 29 | input=input, 30 | is_sparse=True, 31 | # you need to patch https://github.com/PaddlePaddle/Paddle/pull/14190 32 | # if you want to set is_distributed to True 33 | is_distributed=False, 34 | size=[sparse_feature_dim, embedding_size], 35 | param_attr=fluid.ParamAttr(name="SparseFeatFactors", 36 | initializer=fluid.initializer.Uniform())) 37 | 38 | sparse_embed_seq = list(map(embedding_layer, context_feature)) 39 | 40 | w = fluid.layers.create_parameter( 41 | shape=[65, 65], dtype='float32', 42 | name="w_fm") 43 | user_profile_emb = fluid.layers.matmul(user_profile, w) 44 | 45 | concated_ori = fluid.layers.concat(sparse_embed_seq + [user_profile_emb], axis=1) 46 | concated = fluid.layers.batch_norm(input=concated_ori, name="bn", epsilon=1e-4) 47 | 48 | deep = deep_net(concated) 49 | linear_term, second_term = fm(concated, dim_concated, 8) #depend on the number of context feature 50 | 51 | predict = fluid.layers.fc(input=[deep, linear_term, second_term], size=2, act="softmax", 52 | param_attr=fluid.ParamAttr(initializer=fluid.initializer.Normal( 53 | scale=1 / math.sqrt(deep.shape[1])), learning_rate=0.01)) 54 | 55 | #similarity_norm = fluid.layers.sigmoid(fluid.layers.clip(predict, min=-15.0, max=15.0), name="similarity_norm") 56 | 57 | 58 | cost = fluid.layers.cross_entropy(input=predict, label=label) 59 | 60 | avg_cost = fluid.layers.reduce_sum(cost) 61 | accuracy = fluid.layers.accuracy(input=predict, label=label) 62 | auc_var, batch_auc_var, auc_states = \ 63 | fluid.layers.auc(input=predict, label=label, num_thresholds=2 ** 12, slide_steps=20) 64 | return avg_cost, auc_var, batch_auc_var, accuracy, predict 65 | 66 | 67 | def deep_net(concated, lr_x=0.0001): 68 | fc_layers_input = [concated] 69 | fc_layers_size = [128, 64, 32, 16] 70 | fc_layers_act = ["relu"] * (len(fc_layers_size)) 71 | 72 | for i in range(len(fc_layers_size)): 73 | fc = fluid.layers.fc( 74 | input=fc_layers_input[-1], 75 | size=fc_layers_size[i], 76 | act=fc_layers_act[i], 77 | param_attr=fluid.ParamAttr(learning_rate=lr_x * 0.5)) 78 | 79 | fc_layers_input.append(fc) 80 | 81 | return fc_layers_input[-1] 82 | 83 | 84 | def fm(concated, emb_dict_size, factor_size, lr_x=0.0001): 85 | linear_term = fluid.layers.fc(input=concated, size=8, act=None, param_attr=fluid.ParamAttr(learning_rate=lr_x)) 86 | 87 | emb_table = fluid.layers.create_parameter(shape=[emb_dict_size, factor_size], 88 | dtype='float32') 89 | 90 | input_mul_factor = fluid.layers.matmul(concated, emb_table) 91 | input_mul_factor_square = fluid.layers.square(input_mul_factor) 92 | input_square = fluid.layers.square(concated) 93 | factor_square = fluid.layers.square(emb_table) 94 | input_square_mul_factor_square = fluid.layers.matmul(input_square, factor_square) 95 | 96 | second_term = 0.5 * (input_mul_factor_square - input_square_mul_factor_square) 97 | 98 | return linear_term, second_term 99 | 100 | 101 | 102 | 103 | 104 | 105 | 106 | 107 | -------------------------------------------------------------------------------- /networks/network_confv4.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # http://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | 15 | import paddle.fluid as fluid 16 | import math 17 | 18 | user_profile_dim = 65 19 | slot_1 = [0, 1, 2, 3, 4, 5] 20 | slot_2 = [6] 21 | slot_3 = [7, 8, 9, 10, 11] 22 | slot_4 = [12, 13, 14, 15, 16] 23 | slot_5 = [17, 18, 19, 20] 24 | num_context = 25 25 | num_slots_pair = 5 26 | dim_fm_vector = 16 27 | dim_concated = user_profile_dim + dim_fm_vector * (num_context + num_slots_pair) 28 | 29 | def ctr_deepfm_dataset(user_profile, dense_feature, context_feature, label, 30 | embedding_size, sparse_feature_dim): 31 | def embedding_layer(input): 32 | return fluid.layers.embedding( 33 | input=input, 34 | is_sparse=True, 35 | # you need to patch https://github.com/PaddlePaddle/Paddle/pull/14190 36 | # if you want to set is_distributed to True 37 | is_distributed=False, 38 | size=[sparse_feature_dim, embedding_size], 39 | param_attr=fluid.ParamAttr(name="SparseFeatFactors", 40 | initializer=fluid.initializer.Uniform())) 41 | 42 | sparse_embed_seq = list(map(embedding_layer, context_feature)) 43 | 44 | w = fluid.layers.create_parameter( 45 | shape=[65, 65], dtype='float32', 46 | name="w_fm") 47 | 48 | user_emb_list = [] 49 | user_profile_emb = fluid.layers.matmul(user_profile, w) 50 | user_emb_list.append(user_profile_emb) 51 | user_emb_list.append(dense_feature) 52 | 53 | w1 = fluid.layers.create_parameter(shape=[65, dim_fm_vector], dtype='float32', name="w_1") 54 | w2 = fluid.layers.create_parameter(shape=[65, dim_fm_vector], dtype='float32', name="w_2") 55 | w3 = fluid.layers.create_parameter(shape=[65, dim_fm_vector], dtype='float32', name="w_3") 56 | w4 = fluid.layers.create_parameter(shape=[65, dim_fm_vector], dtype='float32', name="w_4") 57 | w5 = fluid.layers.create_parameter(shape=[65, dim_fm_vector], dtype='float32', name="w_5") 58 | user_profile_emb_1 = fluid.layers.matmul(user_profile, w1) 59 | user_profile_emb_2 = fluid.layers.matmul(user_profile, w2) 60 | user_profile_emb_3 = fluid.layers.matmul(user_profile, w3) 61 | user_profile_emb_4 = fluid.layers.matmul(user_profile, w4) 62 | user_profile_emb_5 = fluid.layers.matmul(user_profile, w5) 63 | 64 | sparse_embed_seq_1 = embedding_layer(context_feature[slot_1[0]]) 65 | sparse_embed_seq_2 = embedding_layer(context_feature[slot_2[0]]) 66 | sparse_embed_seq_3 = embedding_layer(context_feature[slot_3[0]]) 67 | sparse_embed_seq_4 = embedding_layer(context_feature[slot_4[0]]) 68 | sparse_embed_seq_5 = embedding_layer(context_feature[slot_5[0]]) 69 | for i in slot_1[1:-1]: 70 | sparse_embed_seq_1 = fluid.layers.elementwise_add(sparse_embed_seq_1, embedding_layer(context_feature[i])) 71 | for i in slot_2[1:-1]: 72 | sparse_embed_seq_2 = fluid.layers.elementwise_add(sparse_embed_seq_2, embedding_layer(context_feature[i])) 73 | for i in slot_3[1:-1]: 74 | sparse_embed_seq_3 = fluid.layers.elementwise_add(sparse_embed_seq_3, embedding_layer(context_feature[i])) 75 | for i in slot_4[1:-1]: 76 | sparse_embed_seq_4 = fluid.layers.elementwise_add(sparse_embed_seq_4, embedding_layer(context_feature[i])) 77 | for i in slot_5[1:-1]: 78 | sparse_embed_seq_5 = fluid.layers.elementwise_add(sparse_embed_seq_5, embedding_layer(context_feature[i])) 79 | 80 | ele_product_1 = fluid.layers.elementwise_mul(user_profile_emb_1, sparse_embed_seq_1) 81 | user_emb_list.append(ele_product_1) 82 | ele_product_2 = fluid.layers.elementwise_mul(user_profile_emb_2, sparse_embed_seq_2) 83 | user_emb_list.append(ele_product_2) 84 | ele_product_3 = fluid.layers.elementwise_mul(user_profile_emb_3, sparse_embed_seq_3) 85 | user_emb_list.append(ele_product_3) 86 | ele_product_4 = fluid.layers.elementwise_mul(user_profile_emb_4, sparse_embed_seq_4) 87 | user_emb_list.append(ele_product_4) 88 | ele_product_5 = fluid.layers.elementwise_mul(user_profile_emb_5, sparse_embed_seq_5) 89 | user_emb_list.append(ele_product_5) 90 | 91 | ffm_1 = fluid.layers.reduce_sum(ele_product_1, dim=1, keep_dim=True) 92 | ffm_2 = fluid.layers.reduce_sum(ele_product_2, dim=1, keep_dim=True) 93 | ffm_3 = fluid.layers.reduce_sum(ele_product_3, dim=1, keep_dim=True) 94 | ffm_4 = fluid.layers.reduce_sum(ele_product_4, dim=1, keep_dim=True) 95 | ffm_5 = fluid.layers.reduce_sum(ele_product_5, dim=1, keep_dim=True) 96 | 97 | 98 | concated_ori = fluid.layers.concat(sparse_embed_seq + user_emb_list, axis=1) 99 | concated = fluid.layers.batch_norm(input=concated_ori, name="bn", epsilon=1e-4) 100 | 101 | deep = deep_net(concated) 102 | linear_term, second_term = fm(concated, dim_concated, 8) #depend on the number of context feature 103 | 104 | predict = fluid.layers.fc(input=[deep, linear_term, second_term, ffm_1, ffm_2, ffm_3, ffm_4, ffm_5], size=2, act="softmax", 105 | param_attr=fluid.ParamAttr(initializer=fluid.initializer.Normal( 106 | scale=1 / math.sqrt(deep.shape[1])), learning_rate=0.01)) 107 | 108 | #similarity_norm = fluid.layers.sigmoid(fluid.layers.clip(predict, min=-15.0, max=15.0), name="similarity_norm") 109 | 110 | 111 | cost = fluid.layers.cross_entropy(input=predict, label=label) 112 | 113 | avg_cost = fluid.layers.reduce_sum(cost) 114 | accuracy = fluid.layers.accuracy(input=predict, label=label) 115 | auc_var, batch_auc_var, auc_states = \ 116 | fluid.layers.auc(input=predict, label=label, num_thresholds=2 ** 12, slide_steps=20) 117 | return avg_cost, auc_var, batch_auc_var, accuracy, predict 118 | 119 | 120 | def deep_net(concated, lr_x=0.0001): 121 | fc_layers_input = [concated] 122 | fc_layers_size = [256, 128, 64, 32, 16] 123 | fc_layers_act = ["relu"] * (len(fc_layers_size)) 124 | 125 | for i in range(len(fc_layers_size)): 126 | fc = fluid.layers.fc( 127 | input=fc_layers_input[-1], 128 | size=fc_layers_size[i], 129 | act=fc_layers_act[i], 130 | param_attr=fluid.ParamAttr(learning_rate=lr_x * 0.5)) 131 | 132 | fc_layers_input.append(fc) 133 | w_res = fluid.layers.create_parameter(shape=[dim_concated, 16], dtype='float32', name="w_res") 134 | high_path = fluid.layers.matmul(concated, w_res) 135 | 136 | return fluid.layers.elementwise_add(high_path, fc_layers_input[-1]) 137 | #return fc_layers_input[-1] 138 | 139 | 140 | def fm(concated, emb_dict_size, factor_size, lr_x=0.0001): 141 | linear_term = fluid.layers.fc(input=concated, size=8, act=None, param_attr=fluid.ParamAttr(learning_rate=lr_x)) 142 | 143 | emb_table = fluid.layers.create_parameter(shape=[emb_dict_size, factor_size], 144 | dtype='float32') 145 | 146 | input_mul_factor = fluid.layers.matmul(concated, emb_table) 147 | input_mul_factor_square = fluid.layers.square(input_mul_factor) 148 | input_square = fluid.layers.square(concated) 149 | factor_square = fluid.layers.square(emb_table) 150 | input_square_mul_factor_square = fluid.layers.matmul(input_square, factor_square) 151 | 152 | second_term = 0.5 * (input_mul_factor_square - input_square_mul_factor_square) 153 | 154 | return linear_term, second_term -------------------------------------------------------------------------------- /networks/network_confv6.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # http://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | 15 | import paddle.fluid as fluid 16 | import math 17 | 18 | user_profile_dim = 65 19 | dense_feature_dim = 3 20 | 21 | def ctr_deepfm_dataset(dense_feature, context_feature, context_feature_fm, label, 22 | embedding_size, sparse_feature_dim): 23 | def dense_fm_layer(input, emb_dict_size, factor_size, fm_param_attr): 24 | 25 | first_order = fluid.layers.fc(input=input, size=1) 26 | emb_table = fluid.layers.create_parameter(shape=[emb_dict_size, factor_size], 27 | dtype='float32', attr=fm_param_attr) 28 | 29 | input_mul_factor = fluid.layers.matmul(input, emb_table) 30 | input_mul_factor_square = fluid.layers.square(input_mul_factor) 31 | input_square = fluid.layers.square(input) 32 | factor_square = fluid.layers.square(emb_table) 33 | input_square_mul_factor_square = fluid.layers.matmul(input_square, factor_square) 34 | 35 | second_order = 0.5 * (input_mul_factor_square - input_square_mul_factor_square) 36 | return first_order, second_order 37 | 38 | 39 | dense_fm_param_attr = fluid.param_attr.ParamAttr(name="DenseFeatFactors", 40 | initializer=fluid.initializer.Normal( 41 | scale=1 / math.sqrt(dense_feature_dim))) 42 | dense_fm_first, dense_fm_second = dense_fm_layer( 43 | dense_feature, dense_feature_dim, 16, dense_fm_param_attr) 44 | 45 | 46 | def sparse_fm_layer(input, emb_dict_size, factor_size, fm_param_attr): 47 | 48 | first_embeddings = fluid.layers.embedding( 49 | input=input, dtype='float32', size=[emb_dict_size, 1], is_sparse=True) 50 | first_order = fluid.layers.sequence_pool(input=first_embeddings, pool_type='sum') 51 | 52 | nonzero_embeddings = fluid.layers.embedding( 53 | input=input, dtype='float32', size=[emb_dict_size, factor_size], 54 | param_attr=fm_param_attr, is_sparse=True) 55 | summed_features_emb = fluid.layers.sequence_pool(input=nonzero_embeddings, pool_type='sum') 56 | summed_features_emb_square = fluid.layers.square(summed_features_emb) 57 | 58 | squared_features_emb = fluid.layers.square(nonzero_embeddings) 59 | squared_sum_features_emb = fluid.layers.sequence_pool( 60 | input=squared_features_emb, pool_type='sum') 61 | 62 | second_order = 0.5 * (summed_features_emb_square - squared_sum_features_emb) 63 | return first_order, second_order 64 | 65 | sparse_fm_param_attr = fluid.param_attr.ParamAttr(name="SparseFeatFactors", 66 | initializer=fluid.initializer.Normal( 67 | scale=1 / math.sqrt(sparse_feature_dim))) 68 | 69 | #data = fluid.layers.data(name='ids', shape=[1], dtype='float32') 70 | sparse_fm_first, sparse_fm_second = sparse_fm_layer( 71 | context_feature_fm, sparse_feature_dim, 16, sparse_fm_param_attr) 72 | 73 | def embedding_layer(input): 74 | return fluid.layers.embedding( 75 | input=input, 76 | is_sparse=True, 77 | # you need to patch https://github.com/PaddlePaddle/Paddle/pull/14190 78 | # if you want to set is_distributed to True 79 | is_distributed=False, 80 | size=[sparse_feature_dim, embedding_size], 81 | param_attr=fluid.ParamAttr(name="SparseFeatFactors", 82 | initializer=fluid.initializer.Uniform())) 83 | 84 | sparse_embed_seq = list(map(embedding_layer, context_feature)) 85 | 86 | concated_ori = fluid.layers.concat(sparse_embed_seq + [dense_feature], axis=1) 87 | concated = fluid.layers.batch_norm(input=concated_ori, name="bn", epsilon=1e-4) 88 | 89 | deep = deep_net(concated) 90 | 91 | predict = fluid.layers.fc(input=[deep, sparse_fm_first, sparse_fm_second, dense_fm_first, dense_fm_second], size=2, act="softmax", 92 | param_attr=fluid.ParamAttr(initializer=fluid.initializer.Normal( 93 | scale=1 / math.sqrt(deep.shape[1])), learning_rate=0.01)) 94 | 95 | #similarity_norm = fluid.layers.sigmoid(fluid.layers.clip(predict, min=-15.0, max=15.0), name="similarity_norm") 96 | 97 | cost = fluid.layers.cross_entropy(input=predict, label=label) 98 | 99 | avg_cost = fluid.layers.reduce_sum(cost) 100 | accuracy = fluid.layers.accuracy(input=predict, label=label) 101 | auc_var, batch_auc_var, auc_states = \ 102 | fluid.layers.auc(input=predict, label=label, num_thresholds=2 ** 12, slide_steps=20) 103 | return avg_cost, auc_var, batch_auc_var, accuracy, predict 104 | 105 | 106 | def deep_net(concated, lr_x=0.0001): 107 | fc_layers_input = [concated] 108 | fc_layers_size = [400, 400, 400] 109 | fc_layers_act = ["relu"] * (len(fc_layers_size)) 110 | 111 | for i in range(len(fc_layers_size)): 112 | fc = fluid.layers.fc( 113 | input=fc_layers_input[-1], 114 | size=fc_layers_size[i], 115 | act=fc_layers_act[i], 116 | param_attr=fluid.ParamAttr(learning_rate=lr_x * 0.5)) 117 | 118 | fc_layers_input.append(fc) 119 | #w_res = fluid.layers.create_parameter(shape=[353, 16], dtype='float32', name="w_res") 120 | #high_path = fluid.layers.matmul(concated, w_res) 121 | 122 | #return fluid.layers.elementwise_add(high_path, fc_layers_input[-1]) 123 | return fc_layers_input[-1] -------------------------------------------------------------------------------- /out/readme: -------------------------------------------------------------------------------- 1 | this folder for preprocessed data 2 | -------------------------------------------------------------------------------- /pre_process_test.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # http://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | 15 | import os, sys, time, random, csv, datetime, json 16 | import pandas as pd 17 | import numpy as np 18 | import argparse 19 | import logging 20 | import time 21 | 22 | logging.basicConfig( 23 | format='%(asctime)s - %(levelname)s - %(message)s') 24 | logger = logging.getLogger("preprocess") 25 | logger.setLevel(logging.INFO) 26 | 27 | TEST_QUERIES_PATH = "./data_set_phase1/test_queries.csv" 28 | TEST_PLANS_PATH = "./data_set_phase1/test_plans.csv" 29 | TRAIN_CLICK_PATH = "./data_set_phase1/train_clicks.csv" 30 | PROFILES_PATH = "./data_set_phase1/profiles.csv" 31 | OUT_NORM_TEST_PATH = "./out/normed_test_session.txt" 32 | OUT_RAW_TEST_PATH = "./out/test_session.txt" 33 | 34 | O1_MIN = 115.47 35 | O1_MAX = 117.29 36 | 37 | O2_MIN = 39.46 38 | O2_MAX = 40.97 39 | 40 | D1_MIN = 115.44 41 | D1_MAX = 117.37 42 | 43 | D2_MIN = 39.46 44 | D2_MAX = 40.96 45 | SCALE_OD = 0.02 46 | 47 | DISTANCE_MIN = 1.0 48 | DISTANCE_MAX = 225864.0 49 | THRESHOLD_DIS = 40000.0 50 | SCALE_DIS = 500 51 | 52 | PRICE_MIN = 200.0 53 | PRICE_MAX = 92300.0 54 | THRESHOLD_PRICE = 20000 55 | SCALE_PRICE = 100 56 | 57 | ETA_MIN = 1.0 58 | ETA_MAX = 72992.0 59 | THRESHOLD_ETA = 10800.0 60 | SCALE_ETA = 120 61 | 62 | 63 | def build_norm_feature(): 64 | with open(OUT_NORM_TEST_PATH, 'w') as nf: 65 | with open(OUT_RAW_TEST_PATH, 'r') as f: 66 | for line in f: 67 | cur_map = json.loads(line) 68 | 69 | if cur_map["plan"]["distance"] > THRESHOLD_DIS: 70 | cur_map["plan"]["distance"] = int(THRESHOLD_DIS) 71 | elif cur_map["plan"]["distance"] > 0: 72 | cur_map["plan"]["distance"] = int(cur_map["plan"]["distance"] / SCALE_DIS) 73 | 74 | if cur_map["plan"]["price"] and cur_map["plan"]["price"] > THRESHOLD_PRICE: 75 | cur_map["plan"]["price"] = int(THRESHOLD_PRICE) 76 | elif not cur_map["plan"]["price"] or cur_map["plan"]["price"] < 0: 77 | cur_map["plan"]["price"] = 0 78 | else: 79 | cur_map["plan"]["price"] = int(cur_map["plan"]["price"] / SCALE_PRICE) 80 | 81 | if cur_map["plan"]["eta"] > THRESHOLD_ETA: 82 | cur_map["plan"]["eta"] = int(THRESHOLD_ETA) 83 | elif cur_map["plan"]["eta"] > 0: 84 | cur_map["plan"]["eta"] = int(cur_map["plan"]["eta"] / SCALE_ETA) 85 | 86 | # o1 87 | if cur_map["query"]["o1"] > O1_MAX: 88 | cur_map["query"]["o1"] = int((O1_MAX - O1_MIN) / SCALE_OD + 1) 89 | elif cur_map["query"]["o1"] < O1_MIN: 90 | cur_map["query"]["o1"] = 0 91 | else: 92 | cur_map["query"]["o1"] = int((cur_map["query"]["o1"] - O1_MIN) / 0.02) 93 | 94 | # o2 95 | if cur_map["query"]["o2"] > O2_MAX: 96 | cur_map["query"]["o2"] = int((O2_MAX - O2_MIN) / SCALE_OD + 1) 97 | elif cur_map["query"]["o2"] < O2_MIN: 98 | cur_map["query"]["o2"] = 0 99 | else: 100 | cur_map["query"]["o2"] = int((cur_map["query"]["o2"] - O2_MIN) / 0.02) 101 | 102 | # d1 103 | if cur_map["query"]["d1"] > D1_MAX: 104 | cur_map["query"]["d1"] = int((D1_MAX - D1_MIN) / SCALE_OD + 1) 105 | elif cur_map["query"]["d1"] < D1_MIN: 106 | cur_map["query"]["d1"] = 0 107 | else: 108 | cur_map["query"]["d1"] = int((cur_map["query"]["d1"] - D1_MIN) / SCALE_OD) 109 | 110 | # d2 111 | if cur_map["query"]["d2"] > D2_MAX: 112 | cur_map["query"]["d2"] = int((D2_MAX - D2_MIN) / SCALE_OD + 1) 113 | elif cur_map["query"]["d2"] < D2_MIN: 114 | cur_map["query"]["d2"] = 0 115 | else: 116 | cur_map["query"]["d2"] = int((cur_map["query"]["d2"] - D2_MIN) / SCALE_OD) 117 | 118 | cur_json_instance = json.dumps(cur_map) 119 | nf.write(cur_json_instance + '\n') 120 | 121 | 122 | def preprocess(): 123 | """ 124 | Construct the train data indexed by session id and mode id jointly. Convert some of the raw features (user profile, 125 | od pair, req time, click time, eta, price, distance, transport mode) to one-hot ids used for 126 | embedding. We split the one-hot features into two categories: user feature and context feature for 127 | better understanding of FM algorithm. 128 | Note that the user profile is already provided by one-hot encoded form, we convert it back to the 129 | ids for unity with the context feature and easily using of PaddlePaddle embedding layer. Given the 130 | train clicks data, we label each train instance with 1 or 0 depend on if this instance is clicked or 131 | not. 132 | :return: 133 | """ 134 | 135 | train_data_dict = {} 136 | with open("./weather.json", 'r') as f: 137 | weather_dict = json.load(f) 138 | 139 | with open(TEST_QUERIES_PATH, 'r') as f: 140 | csv_reader = csv.reader(f, delimiter=',') 141 | train_index_list = [] 142 | for k, line in enumerate(csv_reader): 143 | if k == 0: continue 144 | if line[0] == "": continue 145 | if line[1] == "": 146 | train_index_list.append(line[0] + "_0") 147 | else: 148 | train_index_list.append(line[0] + "_" + line[1]) 149 | 150 | train_index = line[0] 151 | train_data_dict[train_index] = {} 152 | train_data_dict[train_index]["pid"] = line[1] 153 | train_data_dict[train_index]["query"] = {} 154 | 155 | reqweekday = datetime.datetime.strptime(line[2], '%Y-%m-%d %H:%M:%S').strftime("%w") 156 | reqhour = datetime.datetime.strptime(line[2], '%Y-%m-%d %H:%M:%S').strftime("%H") 157 | 158 | date_key = datetime.datetime.strptime(line[2], '%Y-%m-%d %H:%M:%S').strftime("%m-%d") 159 | train_data_dict[train_index]["weather"] = {} 160 | train_data_dict[train_index]["weather"].update({"max_temp": weather_dict[date_key]["max_temp"]}) 161 | train_data_dict[train_index]["weather"].update({"min_temp": weather_dict[date_key]["min_temp"]}) 162 | train_data_dict[train_index]["weather"].update({"wea": weather_dict[date_key]["weather"]}) 163 | train_data_dict[train_index]["weather"].update({"wind": weather_dict[date_key]["wind"]}) 164 | 165 | train_data_dict[train_index]["query"].update({"weekday":reqweekday}) 166 | train_data_dict[train_index]["query"].update({"hour":reqhour}) 167 | 168 | o = line[3].split(',') 169 | o_first = o[0] 170 | o_second = o[1] 171 | train_data_dict[train_index]["query"].update({"o1":float(o_first)}) 172 | train_data_dict[train_index]["query"].update({"o2":float(o_second)}) 173 | 174 | d = line[4].split(',') 175 | d_first = d[0] 176 | d_second = d[1] 177 | train_data_dict[train_index]["query"].update({"d1":float(d_first)}) 178 | train_data_dict[train_index]["query"].update({"d2":float(d_second)}) 179 | 180 | plan_map = {} 181 | plan_data = pd.read_csv(TEST_PLANS_PATH) 182 | for index, row in plan_data.iterrows(): 183 | plans_str = row['plans'] 184 | plans_list = json.loads(plans_str) 185 | session_id = str(row['sid']) 186 | # train_data_dict[session_id]["plans"] = [] 187 | plan_map[session_id] = plans_list 188 | 189 | profile_map = {} 190 | with open(PROFILES_PATH, 'r') as f: 191 | csv_reader = csv.reader(f, delimiter=',') 192 | for k, line in enumerate(csv_reader): 193 | if k == 0: continue 194 | profile_map[line[0]] = [i for i in range(len(line)) if line[i] == "1.0"] 195 | 196 | session_click_map = {} 197 | with open(TRAIN_CLICK_PATH, 'r') as f: 198 | csv_reader = csv.reader(f, delimiter=',') 199 | for k, line in enumerate(csv_reader): 200 | if k == 0: continue 201 | if line[0] == "" or line[1] == "" or line[2] == "": 202 | continue 203 | session_click_map[line[0]] = line[2] 204 | #return train_data_dict, profile_map, session_click_map, plan_map 205 | generate_sparse_features(train_data_dict, profile_map, session_click_map, plan_map) 206 | 207 | 208 | def generate_sparse_features(train_data_dict, profile_map, session_click_map, plan_map): 209 | if not os.path.isdir("./out/"): 210 | os.mkdir("./out/") 211 | with open(os.path.join("./out/", "test_session.txt"), 'w') as f_train: 212 | for session_id, plan_list in plan_map.items(): 213 | if session_id not in train_data_dict: 214 | continue 215 | cur_map = train_data_dict[session_id] 216 | cur_map["session_id"] = session_id 217 | if cur_map["pid"] != "": 218 | cur_map["profile"] = profile_map[cur_map["pid"]] 219 | else: 220 | cur_map["profile"] = [0] 221 | del cur_map["pid"] 222 | whole_rank = 0 223 | for plan in plan_list: 224 | whole_rank += 1 225 | cur_map["mode_rank" + str(whole_rank)] = plan["transport_mode"] 226 | 227 | if whole_rank < 5: 228 | for r in range(whole_rank + 1, 6): 229 | cur_map["mode_rank" + str(r)] = -1 230 | 231 | cur_map["whole_rank"] = whole_rank 232 | flag_click = False 233 | rank = 1 234 | 235 | price_list = [] 236 | eta_list = [] 237 | distance_list = [] 238 | for plan in plan_list: 239 | if not plan["price"]: 240 | price_list.append(0) 241 | else: 242 | price_list.append(int(plan["price"])) 243 | eta_list.append(int(plan["eta"])) 244 | distance_list.append(int(plan["distance"])) 245 | price_list.sort(reverse=False) 246 | eta_list.sort(reverse=False) 247 | distance_list.sort(reverse=False) 248 | 249 | for plan in plan_list: 250 | if plan["price"] and int(plan["price"]) == price_list[0]: 251 | cur_map["mode_min_price"] = plan["transport_mode"] 252 | if plan["price"] and int(plan["price"]) == price_list[-1]: 253 | cur_map["mode_max_price"] = plan["transport_mode"] 254 | if int(plan["eta"]) == eta_list[0]: 255 | cur_map["mode_min_eta"] = plan["transport_mode"] 256 | if int(plan["eta"]) == eta_list[-1]: 257 | cur_map["mode_max_eta"] = plan["transport_mode"] 258 | if int(plan["distance"]) == distance_list[0]: 259 | cur_map["mode_min_distance"] = plan["transport_mode"] 260 | if int(plan["distance"]) == distance_list[-1]: 261 | cur_map["mode_max_distance"] = plan["transport_mode"] 262 | if "mode_min_price" not in cur_map: 263 | cur_map["mode_min_price"] = -1 264 | if "mode_max_price" not in cur_map: 265 | cur_map["mode_max_price"] = -1 266 | 267 | 268 | for plan in plan_list: 269 | cur_price = int(plan["price"]) if plan["price"] else 0 270 | cur_eta = int(plan["eta"]) 271 | cur_distance = int(plan["distance"]) 272 | cur_map["price_rank"] = price_list.index(cur_price) + 1 273 | cur_map["eta_rank"] = eta_list.index(cur_eta) + 1 274 | cur_map["distance_rank"] = distance_list.index(cur_distance) + 1 275 | 276 | if ("transport_mode" in plan) and (session_id in session_click_map) and ( 277 | int(plan["transport_mode"]) == int(session_click_map[session_id])): 278 | cur_map["plan"] = plan 279 | cur_map["label"] = 1 280 | flag_click = True 281 | # print("label is 1") 282 | else: 283 | cur_map["plan"] = plan 284 | cur_map["label"] = 0 285 | 286 | cur_map["plan_rank"] = rank 287 | rank += 1 288 | cur_json_instance = json.dumps(cur_map) 289 | f_train.write(cur_json_instance + '\n') 290 | 291 | cur_map["plan"]["distance"] = -1 292 | cur_map["plan"]["price"] = -1 293 | cur_map["plan"]["eta"] = -1 294 | cur_map["plan"]["transport_mode"] = 0 295 | cur_map["plan_rank"] = 0 296 | cur_map["price_rank"] = 0 297 | cur_map["eta_rank"] = 0 298 | cur_map["plan_rank"] = 0 299 | cur_map["label"] = 1 300 | cur_json_instance = json.dumps(cur_map) 301 | f_train.write(cur_json_instance + '\n') 302 | 303 | build_norm_feature() 304 | 305 | 306 | if __name__ == "__main__": 307 | preprocess() -------------------------------------------------------------------------------- /pre_test_dense.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # http://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | 15 | 16 | import os, sys, time, random, csv, datetime, json 17 | import pandas as pd 18 | import numpy as np 19 | import argparse 20 | import logging 21 | import time 22 | 23 | logging.basicConfig( 24 | format='%(asctime)s - %(levelname)s - %(message)s') 25 | logger = logging.getLogger("preprocess") 26 | logger.setLevel(logging.INFO) 27 | 28 | TRAIN_QUERIES_PATH = "./data_set_phase1/test_queries.csv" 29 | TRAIN_PLANS_PATH = "./data_set_phase1/test_plans.csv" 30 | TRAIN_CLICK_PATH = "./data_set_phase1/train_clicks.csv" 31 | PROFILES_PATH = "./data_set_phase1/profiles.csv" 32 | 33 | O1_MIN = 115.47 34 | O1_MAX = 117.29 35 | 36 | O2_MIN = 39.46 37 | O2_MAX = 40.97 38 | 39 | D1_MIN = 115.44 40 | D1_MAX = 117.37 41 | 42 | D2_MIN = 39.46 43 | D2_MAX = 40.96 44 | 45 | DISTANCE_MIN = 1.0 46 | DISTANCE_MAX = 225864.0 47 | THRESHOLD_DIS = 200000.0 48 | 49 | PRICE_MIN = 200.0 50 | PRICE_MAX = 92300.0 51 | THRESHOLD_PRICE = 20000 52 | 53 | ETA_MIN = 1.0 54 | ETA_MAX = 72992.0 55 | THRESHOLD_ETA = 10800.0 56 | 57 | 58 | def build_norm_feature(): 59 | with open("./out/normed_test_session.txt", 'w') as nf: 60 | with open("./out/test_session.txt", 'r') as f: 61 | for line in f: 62 | cur_map = json.loads(line) 63 | 64 | cur_map["plan"]["distance"] = (cur_map["plan"]["distance"] - DISTANCE_MIN) / (DISTANCE_MAX - DISTANCE_MIN) 65 | 66 | if cur_map["plan"]["price"]: 67 | cur_map["plan"]["price"] = (cur_map["plan"]["price"] - PRICE_MIN) / (PRICE_MAX - PRICE_MIN) 68 | else: 69 | cur_map["plan"]["price"] = 0.0 70 | 71 | cur_map["plan"]["eta"] = (cur_map["plan"]["eta"] - ETA_MIN) / (ETA_MAX - ETA_MIN) 72 | 73 | cur_json_instance = json.dumps(cur_map) 74 | nf.write(cur_json_instance + '\n') 75 | 76 | 77 | def preprocess(): 78 | """ 79 | Construct the train data indexed by session id and mode id jointly. Convert all the raw features (user profile, 80 | od pair, req time, click time, eta, price, distance, transport mode) to one-hot ids used for 81 | embedding. We split the one-hot features into two categories: user feature and context feature for 82 | better understanding of FFM algorithm. 83 | Note that the user profile is already provided by one-hot encoded form, we convert it back to the 84 | ids for unity with the context feature and easily using of PaddlePaddle embedding layer. Given the 85 | train clicks data, we label each train instance with 1 or 0 depend on if this instance is clicked or 86 | not. 87 | :return: 88 | """ 89 | #args = parse_args() 90 | 91 | train_data_dict = {} 92 | with open("./weather.json", 'r') as f: 93 | weather_dict = json.load(f) 94 | 95 | with open(TRAIN_QUERIES_PATH, 'r') as f: 96 | csv_reader = csv.reader(f, delimiter=',') 97 | train_index_list = [] 98 | for k, line in enumerate(csv_reader): 99 | if k == 0: continue 100 | if line[0] == "": continue 101 | if line[1] == "": 102 | train_index_list.append(line[0] + "_0") 103 | else: 104 | train_index_list.append(line[0] + "_" + line[1]) 105 | 106 | train_index = line[0] 107 | train_data_dict[train_index] = {} 108 | train_data_dict[train_index]["pid"] = line[1] 109 | train_data_dict[train_index]["query"] = {} 110 | 111 | reqweekday = datetime.datetime.strptime(line[2], '%Y-%m-%d %H:%M:%S').strftime("%w") 112 | reqhour = datetime.datetime.strptime(line[2], '%Y-%m-%d %H:%M:%S').strftime("%H") 113 | 114 | date_key = datetime.datetime.strptime(line[2], '%Y-%m-%d %H:%M:%S').strftime("%m-%d") 115 | train_data_dict[train_index]["weather"] = {} 116 | train_data_dict[train_index]["weather"].update({"max_temp": weather_dict[date_key]["max_temp"]}) 117 | train_data_dict[train_index]["weather"].update({"min_temp": weather_dict[date_key]["min_temp"]}) 118 | train_data_dict[train_index]["weather"].update({"wea": weather_dict[date_key]["weather"]}) 119 | train_data_dict[train_index]["weather"].update({"wind": weather_dict[date_key]["wind"]}) 120 | 121 | train_data_dict[train_index]["query"].update({"weekday":reqweekday}) 122 | train_data_dict[train_index]["query"].update({"hour":reqhour}) 123 | 124 | o = line[3].split(',') 125 | o_first = o[0] 126 | o_second = o[1] 127 | train_data_dict[train_index]["query"].update({"o1":float(o_first)}) 128 | train_data_dict[train_index]["query"].update({"o2":float(o_second)}) 129 | 130 | d = line[4].split(',') 131 | d_first = d[0] 132 | d_second = d[1] 133 | train_data_dict[train_index]["query"].update({"d1":float(d_first)}) 134 | train_data_dict[train_index]["query"].update({"d2":float(d_second)}) 135 | 136 | plan_map = {} 137 | plan_data = pd.read_csv(TRAIN_PLANS_PATH) 138 | for index, row in plan_data.iterrows(): 139 | plans_str = row['plans'] 140 | plans_list = json.loads(plans_str) 141 | session_id = str(row['sid']) 142 | # train_data_dict[session_id]["plans"] = [] 143 | plan_map[session_id] = plans_list 144 | 145 | profile_map = {} 146 | with open(PROFILES_PATH, 'r') as f: 147 | csv_reader = csv.reader(f, delimiter=',') 148 | for k, line in enumerate(csv_reader): 149 | if k == 0: continue 150 | profile_map[line[0]] = [i for i in range(len(line)) if line[i] == "1.0"] 151 | 152 | session_click_map = {} 153 | with open(TRAIN_CLICK_PATH, 'r') as f: 154 | csv_reader = csv.reader(f, delimiter=',') 155 | for k, line in enumerate(csv_reader): 156 | if k == 0: continue 157 | if line[0] == "" or line[1] == "" or line[2] == "": 158 | continue 159 | session_click_map[line[0]] = line[2] 160 | #return train_data_dict, profile_map, session_click_map, plan_map 161 | generate_sparse_features(train_data_dict, profile_map, session_click_map, plan_map) 162 | 163 | 164 | def generate_sparse_features(train_data_dict, profile_map, session_click_map, plan_map): 165 | if not os.path.isdir("./out/"): 166 | os.mkdir("./out/") 167 | with open(os.path.join("./out/", "test_session.txt"), 'w') as f_train: 168 | for session_id, plan_list in plan_map.items(): 169 | if session_id not in train_data_dict: 170 | continue 171 | cur_map = train_data_dict[session_id] 172 | cur_map["session_id"] = session_id 173 | if cur_map["pid"] != "": 174 | cur_map["profile"] = profile_map[cur_map["pid"]] 175 | else: 176 | cur_map["profile"] = [0] 177 | # del cur_map["pid"] 178 | whole_rank = 0 179 | for plan in plan_list: 180 | whole_rank += 1 181 | cur_map["mode_rank" + str(whole_rank)] = plan["transport_mode"] 182 | 183 | if whole_rank < 5: 184 | for r in range(whole_rank + 1, 6): 185 | cur_map["mode_rank" + str(r)] = -1 186 | 187 | cur_map["whole_rank"] = whole_rank 188 | rank = 1 189 | 190 | price_list = [] 191 | eta_list = [] 192 | distance_list = [] 193 | for plan in plan_list: 194 | if not plan["price"]: 195 | price_list.append(0) 196 | else: 197 | price_list.append(int(plan["price"])) 198 | eta_list.append(int(plan["eta"])) 199 | distance_list.append(int(plan["distance"])) 200 | price_list.sort(reverse=False) 201 | eta_list.sort(reverse=False) 202 | distance_list.sort(reverse=False) 203 | 204 | for plan in plan_list: 205 | if plan["price"] and int(plan["price"]) == price_list[0]: 206 | cur_map["mode_min_price"] = plan["transport_mode"] 207 | if plan["price"] and int(plan["price"]) == price_list[-1]: 208 | cur_map["mode_max_price"] = plan["transport_mode"] 209 | if int(plan["eta"]) == eta_list[0]: 210 | cur_map["mode_min_eta"] = plan["transport_mode"] 211 | if int(plan["eta"]) == eta_list[-1]: 212 | cur_map["mode_max_eta"] = plan["transport_mode"] 213 | if int(plan["distance"]) == distance_list[0]: 214 | cur_map["mode_min_distance"] = plan["transport_mode"] 215 | if int(plan["distance"]) == distance_list[-1]: 216 | cur_map["mode_max_distance"] = plan["transport_mode"] 217 | if "mode_min_price" not in cur_map: 218 | cur_map["mode_min_price"] = -1 219 | if "mode_max_price" not in cur_map: 220 | cur_map["mode_max_price"] = -1 221 | 222 | for plan in plan_list: 223 | cur_price = int(plan["price"]) if plan["price"] else 0 224 | cur_eta = int(plan["eta"]) 225 | cur_distance = int(plan["distance"]) 226 | cur_map["price_rank"] = price_list.index(cur_price) + 1 227 | cur_map["eta_rank"] = eta_list.index(cur_eta) + 1 228 | cur_map["distance_rank"] = distance_list.index(cur_distance) + 1 229 | 230 | if ("transport_mode" in plan) and (session_id in session_click_map) and ( 231 | int(plan["transport_mode"]) == int(session_click_map[session_id])): 232 | cur_map["plan"] = plan 233 | cur_map["label"] = 1 234 | else: 235 | cur_map["plan"] = plan 236 | cur_map["label"] = 0 237 | 238 | cur_map["plan_rank"] = rank 239 | rank += 1 240 | cur_json_instance = json.dumps(cur_map) 241 | f_train.write(cur_json_instance + '\n') 242 | 243 | cur_map["plan"]["distance"] = -1 244 | cur_map["plan"]["price"] = -1 245 | cur_map["plan"]["eta"] = -1 246 | cur_map["plan"]["transport_mode"] = 0 247 | cur_map["plan_rank"] = 0 248 | cur_map["price_rank"] = 0 249 | cur_map["eta_rank"] = 0 250 | cur_map["plan_rank"] = 0 251 | cur_map["label"] = 1 252 | cur_json_instance = json.dumps(cur_map) 253 | f_train.write(cur_json_instance + '\n') 254 | 255 | 256 | build_norm_feature() 257 | 258 | 259 | if __name__ == "__main__": 260 | preprocess() -------------------------------------------------------------------------------- /preprocess.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # http://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | 15 | import os, sys, time, random, csv, datetime, json 16 | import pandas as pd 17 | import numpy as np 18 | import argparse 19 | import logging 20 | import time 21 | 22 | logging.basicConfig( 23 | format='%(asctime)s - %(levelname)s - %(message)s') 24 | logger = logging.getLogger("preprocess") 25 | logger.setLevel(logging.INFO) 26 | 27 | TRAIN_QUERIES_PATH = "./data_set_phase1/train_queries.csv" 28 | TRAIN_PLANS_PATH = "./data_set_phase1/train_plans.csv" 29 | TRAIN_CLICK_PATH = "./data_set_phase1/train_clicks.csv" 30 | PROFILES_PATH = "./data_set_phase1/profiles.csv" 31 | OUT_NORM_TRAIN_PATH = "./out/normed_train.txt" 32 | OUT_RAW_TRAIN_PATH = "./out/train.txt" 33 | 34 | OUT_DIR = "./out" 35 | 36 | 37 | O1_MIN = 115.47 38 | O1_MAX = 117.29 39 | 40 | O2_MIN = 39.46 41 | O2_MAX = 40.97 42 | 43 | D1_MIN = 115.44 44 | D1_MAX = 117.37 45 | 46 | D2_MIN = 39.46 47 | D2_MAX = 40.96 48 | SCALE_OD = 0.02 49 | 50 | DISTANCE_MIN = 1.0 51 | DISTANCE_MAX = 225864.0 52 | THRESHOLD_DIS = 40000.0 53 | SCALE_DIS = 500 54 | 55 | PRICE_MIN = 200.0 56 | PRICE_MAX = 92300.0 57 | THRESHOLD_PRICE = 20000 58 | SCALE_PRICE = 100 59 | 60 | ETA_MIN = 1.0 61 | ETA_MAX = 72992.0 62 | THRESHOLD_ETA = 10800.0 63 | SCALE_ETA = 120 64 | 65 | 66 | def build_norm_feature(): 67 | with open(OUT_NORM_TRAIN_PATH, 'w') as nf: 68 | with open(OUT_RAW_TRAIN_PATH, 'r') as f: 69 | for line in f: 70 | cur_map = json.loads(line) 71 | 72 | if cur_map["plan"]["distance"] > THRESHOLD_DIS: 73 | cur_map["plan"]["distance"] = int(THRESHOLD_DIS) 74 | elif cur_map["plan"]["distance"] > 0: 75 | cur_map["plan"]["distance"] = int(cur_map["plan"]["distance"] / SCALE_DIS) 76 | 77 | if cur_map["plan"]["price"] and cur_map["plan"]["price"] > THRESHOLD_PRICE: 78 | cur_map["plan"]["price"] = int(THRESHOLD_PRICE) 79 | elif not cur_map["plan"]["price"] or cur_map["plan"]["price"] < 0: 80 | cur_map["plan"]["price"] = 0 81 | else: 82 | cur_map["plan"]["price"] = int(cur_map["plan"]["price"] / SCALE_PRICE) 83 | 84 | if cur_map["plan"]["eta"] > THRESHOLD_ETA: 85 | cur_map["plan"]["eta"] = int(THRESHOLD_ETA) 86 | elif cur_map["plan"]["eta"] > 0: 87 | cur_map["plan"]["eta"] = int(cur_map["plan"]["eta"] / SCALE_ETA) 88 | 89 | # o1 90 | if cur_map["query"]["o1"] > O1_MAX: 91 | cur_map["query"]["o1"] = int((O1_MAX - O1_MIN) / SCALE_OD + 1) 92 | elif cur_map["query"]["o1"] < O1_MIN: 93 | cur_map["query"]["o1"] = 0 94 | else: 95 | cur_map["query"]["o1"] = int((cur_map["query"]["o1"] - O1_MIN) / 0.02) 96 | 97 | # o2 98 | if cur_map["query"]["o2"] > O2_MAX: 99 | cur_map["query"]["o2"] = int((O2_MAX - O2_MIN) / SCALE_OD + 1) 100 | elif cur_map["query"]["o2"] < O2_MIN: 101 | cur_map["query"]["o2"] = 0 102 | else: 103 | cur_map["query"]["o2"] = int((cur_map["query"]["o2"] - O2_MIN) / 0.02) 104 | 105 | # d1 106 | if cur_map["query"]["d1"] > D1_MAX: 107 | cur_map["query"]["d1"] = int((D1_MAX - D1_MIN) / SCALE_OD + 1) 108 | elif cur_map["query"]["d1"] < D1_MIN: 109 | cur_map["query"]["d1"] = 0 110 | else: 111 | cur_map["query"]["d1"] = int((cur_map["query"]["d1"] - D1_MIN) / SCALE_OD) 112 | 113 | # d2 114 | if cur_map["query"]["d2"] > D2_MAX: 115 | cur_map["query"]["d2"] = int((D2_MAX - D2_MIN) / SCALE_OD + 1) 116 | elif cur_map["query"]["d2"] < D2_MIN: 117 | cur_map["query"]["d2"] = 0 118 | else: 119 | cur_map["query"]["d2"] = int((cur_map["query"]["d2"] - D2_MIN) / SCALE_OD) 120 | 121 | cur_json_instance = json.dumps(cur_map) 122 | nf.write(cur_json_instance + '\n') 123 | 124 | 125 | def preprocess(): 126 | """ 127 | Construct the train data indexed by session id and mode id jointly. Convert all the raw features (user profile, 128 | od pair, req time, click time, eta, price, distance, transport mode) to one-hot ids used for 129 | embedding. We split the one-hot features into two categories: user feature and context feature for 130 | better understanding of FM algorithm. 131 | Note that the user profile is already provided by one-hot encoded form, we treat it as embedded vector 132 | for unity with the context feature and easily using of PaddlePaddle embedding layer. Given the 133 | train clicks data, we label each train instance with 1 or 0 depend on if this instance is clicked or 134 | not include non-click case. 135 | :return: 136 | """ 137 | 138 | train_data_dict = {} 139 | with open(TRAIN_QUERIES_PATH, 'r') as f: 140 | csv_reader = csv.reader(f, delimiter=',') 141 | train_index_list = [] 142 | for k, line in enumerate(csv_reader): 143 | if k == 0: continue 144 | if line[0] == "": continue 145 | if line[1] == "": 146 | train_index_list.append(line[0] + "_0") 147 | else: 148 | train_index_list.append(line[0] + "_" + line[1]) 149 | 150 | train_index = line[0] 151 | train_data_dict[train_index] = {} 152 | train_data_dict[train_index]["pid"] = line[1] 153 | train_data_dict[train_index]["query"] = {} 154 | 155 | reqweekday = datetime.datetime.strptime(line[2], '%Y-%m-%d %H:%M:%S').strftime("%w") 156 | reqhour = datetime.datetime.strptime(line[2], '%Y-%m-%d %H:%M:%S').strftime("%H") 157 | 158 | train_data_dict[train_index]["query"].update({"weekday":reqweekday}) 159 | train_data_dict[train_index]["query"].update({"hour":reqhour}) 160 | 161 | o = line[3].split(',') 162 | o_first = o[0] 163 | o_second = o[1] 164 | train_data_dict[train_index]["query"].update({"o1":float(o_first)}) 165 | train_data_dict[train_index]["query"].update({"o2":float(o_second)}) 166 | 167 | d = line[4].split(',') 168 | d_first = d[0] 169 | d_second = d[1] 170 | train_data_dict[train_index]["query"].update({"d1":float(d_first)}) 171 | train_data_dict[train_index]["query"].update({"d2":float(d_second)}) 172 | 173 | plan_map = {} 174 | plan_data = pd.read_csv(TRAIN_PLANS_PATH) 175 | for index, row in plan_data.iterrows(): 176 | plans_str = row['plans'] 177 | plans_list = json.loads(plans_str) 178 | session_id = str(row['sid']) 179 | # train_data_dict[session_id]["plans"] = [] 180 | plan_map[session_id] = plans_list 181 | 182 | profile_map = {} 183 | with open(PROFILES_PATH, 'r') as f: 184 | csv_reader = csv.reader(f, delimiter=',') 185 | for k, line in enumerate(csv_reader): 186 | if k == 0: continue 187 | profile_map[line[0]] = [i for i in range(len(line)) if line[i] == "1.0"] 188 | 189 | session_click_map = {} 190 | with open(TRAIN_CLICK_PATH, 'r') as f: 191 | csv_reader = csv.reader(f, delimiter=',') 192 | for k, line in enumerate(csv_reader): 193 | if k == 0: continue 194 | if line[0] == "" or line[1] == "" or line[2] == "": 195 | continue 196 | session_click_map[line[0]] = line[2] 197 | #return train_data_dict, profile_map, session_click_map, plan_map 198 | generate_sparse_features(train_data_dict, profile_map, session_click_map, plan_map) 199 | 200 | 201 | def generate_sparse_features(train_data_dict, profile_map, session_click_map, plan_map): 202 | if not os.path.isdir(OUT_DIR): 203 | os.mkdir(OUT_DIR) 204 | with open(os.path.join("./out/", "train.txt"), 'w') as f_train: 205 | for session_id, plan_list in plan_map.items(): 206 | if session_id not in train_data_dict: 207 | continue 208 | cur_map = train_data_dict[session_id] 209 | if cur_map["pid"] != "": 210 | cur_map["profile"] = profile_map[cur_map["pid"]] 211 | else: 212 | cur_map["profile"] = [0] 213 | del cur_map["pid"] 214 | whole_rank = 0 215 | for plan in plan_list: 216 | whole_rank += 1 217 | cur_map["whole_rank"] = whole_rank 218 | flag_click = False 219 | rank = 1 220 | 221 | 222 | for plan in plan_list: 223 | 224 | if ("transport_mode" in plan) and (session_id in session_click_map) and ( 225 | int(plan["transport_mode"]) == int(session_click_map[session_id])): 226 | cur_map["plan"] = plan 227 | cur_map["label"] = 1 228 | flag_click = True 229 | # print("label is 1") 230 | else: 231 | cur_map["plan"] = plan 232 | cur_map["label"] = 0 233 | 234 | cur_map["rank"] = rank 235 | rank += 1 236 | cur_json_instance = json.dumps(cur_map) 237 | f_train.write(cur_json_instance + '\n') 238 | if not flag_click: 239 | cur_map["plan"]["distance"] = -1 240 | cur_map["plan"]["price"] = -1 241 | cur_map["plan"]["eta"] = -1 242 | cur_map["plan"]["transport_mode"] = 0 243 | cur_map["rank"] = 0 244 | cur_map["label"] = 1 245 | cur_json_instance = json.dumps(cur_map) 246 | f_train.write(cur_json_instance + '\n') 247 | else: 248 | cur_map["plan"]["distance"] = -1 249 | cur_map["plan"]["price"] = -1 250 | cur_map["plan"]["eta"] = -1 251 | cur_map["plan"]["transport_mode"] = 0 252 | cur_map["rank"] = 0 253 | cur_map["label"] = 0 254 | cur_json_instance = json.dumps(cur_map) 255 | f_train.write(cur_json_instance + '\n') 256 | 257 | 258 | build_norm_feature() 259 | 260 | 261 | if __name__ == "__main__": 262 | preprocess() 263 | -------------------------------------------------------------------------------- /preprocess_dense.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # http://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | 15 | import os, random, csv, datetime, json 16 | import pandas as pd 17 | import numpy as np 18 | import argparse 19 | import logging 20 | import time 21 | 22 | logging.basicConfig( 23 | format='%(asctime)s - %(levelname)s - %(message)s') 24 | logger = logging.getLogger("preprocess") 25 | logger.setLevel(logging.INFO) 26 | 27 | TRAIN_QUERIES_PATH = "./data_set_phase1/train_queries.csv" 28 | TRAIN_PLANS_PATH = "./data_set_phase1/train_plans.csv" 29 | TRAIN_CLICK_PATH = "./data_set_phase1/train_clicks.csv" 30 | PROFILES_PATH = "./data_set_phase1/profiles.csv" 31 | 32 | OUT_DIR = "./out" 33 | ORI_TRAIN_PATH = "train.txt" 34 | NORM_TRAIN_PATH = "normed_train.txt" 35 | #variable to control the ratio of positive and negative instance of transmode 0 which is original label of no click 36 | THRESHOLD_LABEL = 0.5 37 | 38 | 39 | 40 | O1_MIN = 115.47 41 | O1_MAX = 117.29 42 | 43 | O2_MIN = 39.46 44 | O2_MAX = 40.97 45 | 46 | D1_MIN = 115.44 47 | D1_MAX = 117.37 48 | 49 | D2_MIN = 39.46 50 | D2_MAX = 40.96 51 | 52 | DISTANCE_MIN = 1.0 53 | DISTANCE_MAX = 225864.0 54 | THRESHOLD_DIS = 200000.0 55 | 56 | PRICE_MIN = 200.0 57 | PRICE_MAX = 92300.0 58 | THRESHOLD_PRICE = 20000 59 | 60 | ETA_MIN = 1.0 61 | ETA_MAX = 72992.0 62 | THRESHOLD_ETA = 10800.0 63 | 64 | 65 | def build_norm_feature(): 66 | with open(os.path.join(OUT_DIR, NORM_TRAIN_PATH), 'w') as nf: 67 | with open(os.path.join(OUT_DIR, ORI_TRAIN_PATH), 'r') as f: 68 | for line in f: 69 | cur_map = json.loads(line) 70 | 71 | cur_map["plan"]["distance"] = (cur_map["plan"]["distance"] - DISTANCE_MIN) / (DISTANCE_MAX - DISTANCE_MIN) 72 | 73 | if cur_map["plan"]["price"]: 74 | cur_map["plan"]["price"] = (cur_map["plan"]["price"] - PRICE_MIN) / (PRICE_MAX - PRICE_MIN) 75 | else: 76 | cur_map["plan"]["price"] = 0.0 77 | 78 | cur_map["plan"]["eta"] = (cur_map["plan"]["eta"] - ETA_MIN) / (ETA_MAX - ETA_MIN) 79 | 80 | cur_json_instance = json.dumps(cur_map) 81 | nf.write(cur_json_instance + '\n') 82 | 83 | 84 | def preprocess(): 85 | """ 86 | Construct the train data indexed by session id and mode id jointly. Convert all the raw features (user profile, 87 | od pair, req time, click time, eta, price, distance, transport mode) to one-hot ids used for 88 | embedding. We split the one-hot features into two categories: user feature and context feature for 89 | better understanding of FM algorithm. 90 | Note that the user profile is already provided by one-hot encoded form, we treat it as embedded vector 91 | for unity with the context feature and easily using of PaddlePaddle embedding layer. Given the 92 | train clicks data, we label each train instance with 1 or 0 depend on if this instance is clicked or 93 | not include non-click case. To Be Changed 94 | :return: 95 | """ 96 | 97 | train_data_dict = {} 98 | 99 | with open("./weather.json", 'r') as f: 100 | weather_dict = json.load(f) 101 | 102 | with open(TRAIN_QUERIES_PATH, 'r') as f: 103 | csv_reader = csv.reader(f, delimiter=',') 104 | train_index_list = [] 105 | for k, line in enumerate(csv_reader): 106 | if k == 0: continue 107 | if line[0] == "": continue 108 | if line[1] == "": 109 | train_index_list.append(line[0] + "_0") 110 | else: 111 | train_index_list.append(line[0] + "_" + line[1]) 112 | 113 | train_index = line[0] 114 | train_data_dict[train_index] = {} 115 | train_data_dict[train_index]["pid"] = line[1] 116 | train_data_dict[train_index]["query"] = {} 117 | train_data_dict[train_index]["weather"] = {} 118 | 119 | reqweekday = datetime.datetime.strptime(line[2], '%Y-%m-%d %H:%M:%S').strftime("%w") 120 | reqhour = datetime.datetime.strptime(line[2], '%Y-%m-%d %H:%M:%S').strftime("%H") 121 | 122 | # weather related features, no big use, maybe more detailed weather information is better 123 | date_key = datetime.datetime.strptime(line[2], '%Y-%m-%d %H:%M:%S').strftime("%m-%d") 124 | train_data_dict[train_index]["weather"] = {} 125 | train_data_dict[train_index]["weather"].update({"max_temp": weather_dict[date_key]["max_temp"]}) 126 | train_data_dict[train_index]["weather"].update({"min_temp": weather_dict[date_key]["min_temp"]}) 127 | train_data_dict[train_index]["weather"].update({"wea": weather_dict[date_key]["weather"]}) 128 | train_data_dict[train_index]["weather"].update({"wind": weather_dict[date_key]["wind"]}) 129 | 130 | train_data_dict[train_index]["query"].update({"weekday":reqweekday}) 131 | train_data_dict[train_index]["query"].update({"hour":reqhour}) 132 | 133 | o = line[3].split(',') 134 | o_first = o[0] 135 | o_second = o[1] 136 | train_data_dict[train_index]["query"].update({"o1":float(o_first)}) 137 | train_data_dict[train_index]["query"].update({"o2":float(o_second)}) 138 | 139 | d = line[4].split(',') 140 | d_first = d[0] 141 | d_second = d[1] 142 | train_data_dict[train_index]["query"].update({"d1":float(d_first)}) 143 | train_data_dict[train_index]["query"].update({"d2":float(d_second)}) 144 | 145 | plan_map = {} 146 | plan_data = pd.read_csv(TRAIN_PLANS_PATH) 147 | for index, row in plan_data.iterrows(): 148 | plans_str = row['plans'] 149 | plans_list = json.loads(plans_str) 150 | session_id = str(row['sid']) 151 | # train_data_dict[session_id]["plans"] = [] 152 | plan_map[session_id] = plans_list 153 | 154 | profile_map = {} 155 | with open(PROFILES_PATH, 'r') as f: 156 | csv_reader = csv.reader(f, delimiter=',') 157 | for k, line in enumerate(csv_reader): 158 | if k == 0: continue 159 | profile_map[line[0]] = [i for i in range(len(line)) if line[i] == "1.0"] 160 | 161 | session_click_map = {} 162 | with open(TRAIN_CLICK_PATH, 'r') as f: 163 | csv_reader = csv.reader(f, delimiter=',') 164 | for k, line in enumerate(csv_reader): 165 | if k == 0: continue 166 | if line[0] == "" or line[1] == "" or line[2] == "": 167 | continue 168 | session_click_map[line[0]] = line[2] 169 | #return train_data_dict, profile_map, session_click_map, plan_map 170 | generate_sparse_features(train_data_dict, profile_map, session_click_map, plan_map) 171 | 172 | 173 | def generate_sparse_features(train_data_dict, profile_map, session_click_map, plan_map): 174 | if not os.path.isdir(OUT_DIR): 175 | os.mkdir(OUT_DIR) 176 | with open(os.path.join(OUT_DIR, ORI_TRAIN_PATH), 'w') as f_train: 177 | for session_id, plan_list in plan_map.items(): 178 | if session_id not in train_data_dict: 179 | continue 180 | cur_map = train_data_dict[session_id] 181 | if cur_map["pid"] != "": 182 | cur_map["profile"] = profile_map[cur_map["pid"]] 183 | else: 184 | cur_map["profile"] = [0] 185 | 186 | #rank information related feature 187 | whole_rank = 0 188 | for plan in plan_list: 189 | whole_rank += 1 190 | cur_map["mode_rank" + str(whole_rank)] = plan["transport_mode"] 191 | 192 | if whole_rank < 5: 193 | for r in range(whole_rank + 1, 6): 194 | cur_map["mode_rank" + str(r)] = -1 195 | 196 | cur_map["whole_rank"] = whole_rank 197 | flag_click = False 198 | rank = 1 199 | 200 | price_list = [] 201 | eta_list = [] 202 | distance_list = [] 203 | for plan in plan_list: 204 | if not plan["price"]: 205 | price_list.append(0) 206 | else: 207 | price_list.append(int(plan["price"])) 208 | eta_list.append(int(plan["eta"])) 209 | distance_list.append(int(plan["distance"])) 210 | price_list.sort(reverse=False) 211 | eta_list.sort(reverse=False) 212 | distance_list.sort(reverse=False) 213 | 214 | for plan in plan_list: 215 | if plan["price"] and int(plan["price"]) == price_list[0]: 216 | cur_map["mode_min_price"] = plan["transport_mode"] 217 | if plan["price"] and int(plan["price"]) == price_list[-1]: 218 | cur_map["mode_max_price"] = plan["transport_mode"] 219 | if int(plan["eta"]) == eta_list[0]: 220 | cur_map["mode_min_eta"] = plan["transport_mode"] 221 | if int(plan["eta"]) == eta_list[-1]: 222 | cur_map["mode_max_eta"] = plan["transport_mode"] 223 | if int(plan["distance"]) == distance_list[0]: 224 | cur_map["mode_min_distance"] = plan["transport_mode"] 225 | if int(plan["distance"]) == distance_list[-1]: 226 | cur_map["mode_max_distance"] = plan["transport_mode"] 227 | if "mode_min_price" not in cur_map: 228 | cur_map["mode_min_price"] = -1 229 | if "mode_max_price" not in cur_map: 230 | cur_map["mode_max_price"] = -1 231 | 232 | for plan in plan_list: 233 | if ("transport_mode" in plan) and (session_id in session_click_map) and ( 234 | int(plan["transport_mode"]) == int(session_click_map[session_id])): 235 | flag_click = True 236 | if flag_click: 237 | 238 | for plan in plan_list: 239 | cur_price = int(plan["price"]) if plan["price"] else 0 240 | cur_eta = int(plan["eta"]) 241 | cur_distance = int(plan["distance"]) 242 | cur_map["price_rank"] = price_list.index(cur_price) + 1 243 | cur_map["eta_rank"] = eta_list.index(cur_eta) + 1 244 | cur_map["distance_rank"] = distance_list.index(cur_distance) + 1 245 | 246 | if ("transport_mode" in plan) and (session_id in session_click_map) and ( 247 | int(plan["transport_mode"]) == int(session_click_map[session_id])): 248 | cur_map["plan"] = plan 249 | cur_map["label"] = 1 250 | else: 251 | cur_map["plan"] = plan 252 | cur_map["label"] = 0 253 | 254 | cur_map["plan_rank"] = rank 255 | rank += 1 256 | cur_json_instance = json.dumps(cur_map) 257 | f_train.write(cur_json_instance + '\n') 258 | 259 | cur_map["plan"] = {} 260 | #since we define a new ctr task from original task, we use a basic way to generate instances of transport mode 0. 261 | #There should be a optimal strategy to generate instances of transport mode 0 262 | if not flag_click: 263 | cur_map["plan"]["distance"] = -1 264 | cur_map["plan"]["price"] = -1 265 | cur_map["plan"]["eta"] = -1 266 | cur_map["plan"]["transport_mode"] = 0 267 | cur_map["plan_rank"] = 0 268 | cur_map["price_rank"] = 0 269 | cur_map["eta_rank"] = 0 270 | cur_map["distance_rank"] = 0 271 | cur_map["label"] = 1 272 | cur_json_instance = json.dumps(cur_map) 273 | f_train.write(cur_json_instance + '\n') 274 | else: 275 | if random.random() < THRESHOLD_LABEL: 276 | cur_map["plan"]["distance"] = -1 277 | cur_map["plan"]["price"] = -1 278 | cur_map["plan"]["eta"] = -1 279 | cur_map["plan"]["transport_mode"] = 0 280 | cur_map["plan_rank"] = 0 281 | cur_map["price_rank"] = 0 282 | cur_map["eta_rank"] = 0 283 | cur_map["distance_rank"] = 0 284 | cur_map["label"] = 0 285 | cur_json_instance = json.dumps(cur_map) 286 | f_train.write(cur_json_instance + '\n') 287 | 288 | 289 | 290 | build_norm_feature() 291 | 292 | 293 | if __name__ == "__main__": 294 | preprocess() 295 | -------------------------------------------------------------------------------- /submit/readme: -------------------------------------------------------------------------------- 1 | this is the folder for submit file 2 | -------------------------------------------------------------------------------- /testres/readme: -------------------------------------------------------------------------------- 1 | This folder for infered res 2 | -------------------------------------------------------------------------------- /weather.json: -------------------------------------------------------------------------------- 1 | {"10-01": {"max_temp": "24", "min_temp": "12", "weather": "q", "wind": "45"}, "10-02": {"max_temp": "24", "min_temp": "11", "weather": "q", "wind": "12"}, "10-03": {"max_temp": "25", "min_temp": "10", "weather": "q", "wind": "12"}, "10-04": {"max_temp": "25", "min_temp": "12", "weather": "q", "wind": "12"}, "10-05": {"max_temp": "24", "min_temp": "14", "weather": "dy", "wind": "12"}, "10-06": {"max_temp": "20", "min_temp": "8", "weather": "q", "wind": "45"}, "10-07": {"max_temp": "21", "min_temp": "7", "weather": "q", "wind": "12"}, "10-08": {"max_temp": "21", "min_temp": "8", "weather": "dy", "wind": "12"}, "10-09": {"max_temp": "15", "min_temp": "4", "weather": "dyq", "wind": "45"}, "10-10": {"max_temp": "17", "min_temp": "4", "weather": "dyq", "wind": "12"}, "10-11": {"max_temp": "18", "min_temp": "5", "weather": "qdy", "wind": "12"}, "10-12": {"max_temp": "20", "min_temp": "5", "weather": "dyq", "wind": "12"}, "10-13": {"max_temp": "20", "min_temp": "8", "weather": "dy", "wind": "12"}, "10-14": {"max_temp": "21", "min_temp": "10", "weather": "dy", "wind": "12"}, "10-15": {"max_temp": "17", "min_temp": "11", "weather": "xq", "wind": "12"}, "10-16": {"max_temp": "17", "min_temp": "7", "weather": "dyq", "wind": "12"}, "10-17": {"max_temp": "17", "min_temp": "5", "weather": "q", "wind": "12"}, "10-18": {"max_temp": "18", "min_temp": "5", "weather": "q", "wind": "12"}, "10-19": {"max_temp": "19", "min_temp": "7", "weather": "dy", "wind": "12"}, "10-20": {"max_temp": "18", "min_temp": "7", "weather": "dy", "wind": "12"}, "10-21": {"max_temp": "18", "min_temp": "7", "weather": "dy", "wind": "12"}, "10-22": {"max_temp": "19", "min_temp": "5", "weather": "dyq", "wind": "12"}, "10-23": {"max_temp": "19", "min_temp": "4", "weather": "q", "wind": "34"}, "10-24": {"max_temp": "20", "min_temp": "6", "weather": "qdy", "wind": "12"}, "10-25": {"max_temp": "15", "min_temp": "8", "weather": "dy", "wind": "12"}, "10-26": {"max_temp": "14", "min_temp": "3", "weather": "q", "wind": "45"}, "10-27": {"max_temp": "17", "min_temp": "5", "weather": "dy", "wind": "12"}, "10-28": {"max_temp": "17", "min_temp": "4", "weather": "dyq", "wind": "45"}, "10-29": {"max_temp": "15", "min_temp": "3", "weather": "q", "wind": "34"}, "10-30": {"max_temp": "16", "min_temp": "1", "weather": "q", "wind": "12"}, "10-31": {"max_temp": "17", "min_temp": "3", "weather": "q", "wind": "12"}, "11-01": {"max_temp": "17", "min_temp": "3", "weather": "q", "wind": "12"}, "11-02": {"max_temp": "18", "min_temp": "4", "weather": "q", "wind": "12"}, "11-03": {"max_temp": "16", "min_temp": "6", "weather": "dy", "wind": "12"}, "11-04": {"max_temp": "10", "min_temp": "2", "weather": "xydy", "wind": "34"}, "11-05": {"max_temp": "10", "min_temp": "2", "weather": "dy", "wind": "12"}, "11-06": {"max_temp": "12", "min_temp": "0", "weather": "dy", "wind": "12"}, "11-07": {"max_temp": "13", "min_temp": "3", "weather": "dy", "wind": "12"}, "11-08": {"max_temp": "14", "min_temp": "2", "weather": "dy", "wind": "12"}, "11-09": {"max_temp": "15", "min_temp": "1", "weather": "qdy", "wind": "34"}, "11-10": {"max_temp": "11", "min_temp": "0", "weather": "dy", "wind": "12"}, "11-11": {"max_temp": "13", "min_temp": "1", "weather": "dyq", "wind": "12"}, "11-12": {"max_temp": "14", "min_temp": "2", "weather": "q", "wind": "12"}, "11-13": {"max_temp": "13", "min_temp": "5", "weather": "dy", "wind": "12"}, "11-14": {"max_temp": "13", "min_temp": "5", "weather": "dy", "wind": "12"}, "11-15": {"max_temp": "8", "min_temp": "1", "weather": "xydy", "wind": "34"}, "11-16": {"max_temp": "8", "min_temp": "-1", "weather": "q", "wind": "12"}, "11-17": {"max_temp": "9", "min_temp": "-2", "weather": "dyq", "wind": "12"}, "11-18": {"max_temp": "11", "min_temp": "-3", "weather": "q", "wind": "34"}, "11-19": {"max_temp": "10", "min_temp": "-2", "weather": "qdy", "wind": "12"}, "11-20": {"max_temp": "9", "min_temp": "-1", "weather": "dy", "wind": "12"}, "11-21": {"max_temp": "9", "min_temp": "-3", "weather": "q", "wind": "2"}, "11-22": {"max_temp": "8", "min_temp": "-3", "weather": "qdy", "wind": "1"}, "11-23": {"max_temp": "7", "min_temp": "0", "weather": "dy", "wind": "2"}, "11-24": {"max_temp": "9", "min_temp": "-3", "weather": "qdy", "wind": "2"}, "11-25": {"max_temp": "10", "min_temp": "-3", "weather": "q", "wind": "1"}, "11-26": {"max_temp": "10", "min_temp": "0", "weather": "dy", "wind": "1"}, "11-27": {"max_temp": "9", "min_temp": "-3", "weather": "qdy", "wind": "2"}, "11-28": {"max_temp": "8", "min_temp": "-3", "weather": "q", "wind": "1"}, "11-29": {"max_temp": "7", "min_temp": "-4", "weather": "q", "wind": "1"}, "11-30": {"max_temp": "8", "min_temp": "-3", "weather": "q", "wind": "1"}, "12-01": {"max_temp": "7", "min_temp": "0", "weather": "dy", "wind": "1"}, "12-02": {"max_temp": "9", "min_temp": "2", "weather": "dy", "wind": "1"}, "12-03": {"max_temp": "8", "min_temp": "-3", "weather": "dyq", "wind": "3"}, "12-04": {"max_temp": "4", "min_temp": "-6", "weather": "qdy", "wind": "2"}, "12-05": {"max_temp": "1", "min_temp": "-4", "weather": "dy", "wind": "1"}, "12-06": {"max_temp": "-2", "min_temp": "-9", "weather": "q", "wind": "3"}, "12-07": {"max_temp": "-4", "min_temp": "-10", "weather": "q", "wind": "3"}, "12-08": {"max_temp": "-2", "min_temp": "-10", "weather": "qdy", "wind": "2"}, "12-09": {"max_temp": "-1", "min_temp": "-10", "weather": "dyq", "wind": "1"}} --------------------------------------------------------------------------------