├── README.md
├── args.py
├── build_submit.py
├── build_submit_py3.py
├── data_set_phase1
    ├── ._profiles.csv
    ├── ._test_plans.csv
    ├── ._test_queries.csv
    ├── ._train_clicks.csv
    └── ._train_queries.csv
├── generate_test.py
├── generate_test_py3.py
├── infer.py
├── local_train.py
├── local_train_py3.py
├── map_reader.py
├── map_reader_mmh.py
├── network_confv6.py
├── networks
    ├── network_conf.py
    ├── network_confv4.py
    └── network_confv6.py
├── out
    └── readme
├── pre_process_test.py
├── pre_test_dense.py
├── preprocess.py
├── preprocess_dense.py
├── submit
    └── readme
├── testres
    └── readme
└── weather.json


/README.md:
--------------------------------------------------------------------------------
 1 | # Paddle_baseline_KDD2019
 2 | ## More Information Go To https://github.com/PaddlePaddle/models/tree/develop/PaddleRec/ctr/Paddle_baseline_KDD2019
 3 | Paddle baseline for KDD2019 "Context-Aware Multi-Modal Transportation Recommendation"(https://dianshi.baidu.com/competition/29/question)
 4 | 
 5 | This repository is the demo codes for the  KDD2019 "Context-Aware Multi-Modal Transportation Recommendation" competition using PaddlePaddle. It is written by python and uses PaddlePaddle to solve the task. Note that this repository is on developing and welcome everyone to contribute. The current baseline solution codes can get 0.68 - 0.69 score of online submission. As an example, my submission based on these networks programmed by PaddlePaddle is 0.6898.
 6 | The reason of the publication of this baseline codes is to encourage us to use PaddlePaddle and build the most powerful recommendation model via PaddlePaddle. 
 7 | 
 8 | The example codes are ran on Linux, python2.7, single machine with CPU. Currently, There are some Compatibility issues while using python3 (UPDATE: Currently, The codes can be run using python3, Please refer following instruction: "RUN ON Python3"). Note that distributed train options are not provided here, if you want to learn more about this, please check more modes examples on https://github.com/PaddlePaddle/models. About the speed of training, for one epoch, 1000 batch size, it would take about 8 mins to train the whole training instances generated from raw data using SGD optimizer (it would take relatively longer using Adam optimizer). 
 9 | 
10 | The configuration and process of all the networks are fundamental, a lot of optimizations can be done based on them to achieve better results e.g. better cost function, more powerful feature engineering, designed model validation, NN optimization tricks...
11 | 
12 | The code is rough and from my daily use. They will be trimmed these days...
13 | ## Install PaddlePaddle
14 | please visit the official site of PaddlePaddle(http://www.paddlepaddle.org/documentation/docs/zh/1.4/beginners_guide/install/index_cn.html) 
15 | ## preprocess feature
16 | ```python
17 | python preprocess_dense.py # change for different feature strategy
18 | python pre_test_dense.py
19 | #cd out
20 | split -a 2 -d -l 200000 normed_train.txt normed_train
21 | ```
22 | preprocess.py and preprocess_dense.py is the code for preprocessing the raw data. Two versions are provided to deal with all sparse features and sparse plus dense features. Correspondingly, pre_process_test.py and pre_test_dense.py are the codes to preproccess test raw data. The training instances are saved in json. It is very easy to add new features. In our demo, all features are generated from provided raw data except for weather feature, which is gengerated from open weather records.
23 | Note that the feature generated in this step need to fit in the input of the model input. Make sure we use the right version. In demo codes, The sparse plus dense features are used for network_confv6. 
24 | 
25 | ## build the network
26 | main network logic is in network_confv?.py. The networks are base on fm & deep related algorithms. I try several networks and public some of them. There may be some defects in the networks but all of them are functional. 
27 | 
28 | ## train the network
29 | ```python
30 | python local_train.py
31 | ```
32 | In local_train.py and map_reader.py, I use dataset API, so we need to download the corresponding .whl package or clone codes on develop branch of PaddlePaddle. The reason to use this is the speed of feeding data is much faster.
33 | Note that the input format feed into the network is self-defined. make sure we build the same format between training and test.  
34 | 
35 | ## test results
36 | ```python
37 | python generate_test.py
38 | python build_submit.py
39 | ```
40 | In generate_test.py and build_submit, for convenience, I use the whole train data to train the network and test the network with provided data without label
41 | ## RUN ON Python3
42 | Running on python3, run the following python files with _py3 postfix, and keep the same for the rest in python2
43 | ```python
44 | python local_train_py3.py
45 | python generate_test_py3.py
46 | python build_submit_py3.py
47 | ```
48 | 
49 | 
50 | 
51 | 
52 | 


--------------------------------------------------------------------------------
/args.py:
--------------------------------------------------------------------------------
 1 | import argparse
 2 | 
 3 | def parse_args():
 4 |         parser = argparse.ArgumentParser(description="PaddlePaddle CTR example")
 5 |         parser.add_argument(
 6 |             '--train_data_path',
 7 |             type=str,
 8 |             default='./data/raw/train.txt',
 9 |             help="The path of training dataset")
10 |         parser.add_argument(
11 |             '--test_data_path',
12 |             type=str,
13 |             default='./data/raw/valid.txt',
14 |             help="The path of testing dataset")
15 |         parser.add_argument(
16 |             '--batch_size',
17 |             type=int,
18 |             default=1000,
19 |             help="The size of mini-batch (default:1000)")
20 |         parser.add_argument(
21 |             '--embedding_size',
22 |             type=int,
23 |             default=16,
24 |             help="The size for embedding layer (default:10)")
25 |         parser.add_argument(
26 |             '--num_passes',
27 |             type=int,
28 |             default=10,
29 |             help="The number of passes to train (default: 10)")
30 |         parser.add_argument(
31 |             '--model_output_dir',
32 |             type=str,
33 |             default='models',
34 |             help='The path for model to store (default: models)')
35 |         parser.add_argument(
36 |             '--sparse_feature_dim',
37 |             type=int,
38 |             default=1000001,
39 |             help='sparse feature hashing space for index processing')
40 |         parser.add_argument(
41 |             '--is_local',
42 |             type=int,
43 |             default=1,
44 |             help='Local train or distributed train (default: 1)')
45 |         parser.add_argument(
46 |             '--cloud_train',
47 |             type=int,
48 |             default=0,
49 |             help='Local train or distributed train on paddlecloud (default: 0)')
50 |         parser.add_argument(
51 |             '--async_mode',
52 |             action='store_true',
53 |             default=False,
54 |             help='Whether start pserver in async mode to support ASGD')
55 |         parser.add_argument(
56 |             '--no_split_var',
57 |             action='store_true',
58 |             default=False,
59 |             help='Whether split variables into blocks when update_method is pserver')
60 |         parser.add_argument(
61 |             '--role',
62 |             type=str,
63 |             default='pserver', # trainer or pserver
64 |             help='The path for model to store (default: models)')
65 |         parser.add_argument(
66 |             '--endpoints',
67 |             type=str,
68 |             default='127.0.0.1:6000',
69 |             help='The pserver endpoints, like: 127.0.0.1:6000,127.0.0.1:6001')
70 |         parser.add_argument(
71 |             '--current_endpoint',
72 |             type=str,
73 |             default='127.0.0.1:6000',
74 |             help='The path for model to store (default: 127.0.0.1:6000)')
75 |         parser.add_argument(
76 |             '--trainer_id',
77 |             type=int,
78 |             default=0,
79 |             help='The path for model to store (default: models)')
80 |         parser.add_argument(
81 |             '--trainers',
82 |             type=int,
83 |             default=1,
84 |             help='The num of trianers, (default: 1)')
85 |         return parser.parse_args()
86 | 


--------------------------------------------------------------------------------
/build_submit.py:
--------------------------------------------------------------------------------
 1 | import json
 2 | import csv
 3 | import io
 4 | 
 5 | 
 6 | def build():
 7 |     submit_map = {}
 8 |     with io.open('./submit/submit.csv', 'wb') as csv_file:
 9 |         writer = csv.writer(csv_file, delimiter=',')
10 |         writer.writerow(['sid', 'recommend_mode'])
11 |         # choose the res file you want to build submit file
12 |         with open('./out/normed_test_session.txt', 'r') as f1:
13 |             with open('./testres/res8', 'r') as f2:
14 |                 cur_session =''
15 |                 for x, y in zip(f1.readlines(), f2.readlines()):
16 |                     m1 = json.loads(x)
17 |                     session_id = m1["session_id"]
18 |                     if cur_session == '':
19 |                         cur_session = session_id
20 | 
21 |                     transport_mode = m1["plan"]["transport_mode"]
22 | 
23 |                     if cur_session != session_id:
24 |                         writer.writerow([str(cur_session), str(submit_map[cur_session]["transport_mode"])])
25 |                         cur_session = session_id
26 |                     if session_id not in submit_map:
27 |                         submit_map[session_id] = {}
28 |                         submit_map[session_id]["transport_mode"] = transport_mode
29 |                         submit_map[session_id]["probability"] = y
30 |                         #if int(submit_map[session_id]["transport_mode"]) == 0 and submit_map[session_id]["probability"] > 0.02:
31 |                             #submit_map[session_id]["probability"] = 0.99
32 |                     else:
33 |                         if float(y) > float(submit_map[session_id]["probability"]):
34 |                             submit_map[session_id]["transport_mode"] = transport_mode
35 |                             submit_map[session_id]["probability"] = y
36 |                             #if int(submit_map[session_id]["transport_mode"]) == 0 and submit_map[session_id]["probability"] > 0.02:
37 |                                 #submit_map[session_id]["transport_mode"] = 0
38 |                                 #submit_map[session_id]["probability"] = 0.99
39 | 
40 | 
41 |         writer.writerow([cur_session, submit_map[cur_session]["transport_mode"]])
42 | 
43 | 
44 | 
45 | if __name__ == "__main__":
46 |     build()
47 | 


--------------------------------------------------------------------------------
/build_submit_py3.py:
--------------------------------------------------------------------------------
 1 | import json
 2 | import csv
 3 | import io
 4 | 
 5 | 
 6 | def build():
 7 |     submit_map = {}
 8 |     with io.open('./submit/submit.csv', 'w') as csv_file:
 9 |         writer = csv.writer(csv_file, delimiter=',')
10 |         writer.writerow(['sid', 'recommend_mode'])
11 |         # choose the res file you want to build submit file 
12 |         with open('./out/normed_test_session.txt', 'r') as f1:
13 |             with open('./testres/res0', 'r') as f2:
14 |                 cur_session =''
15 |                 for x, y in zip(f1.readlines(), f2.readlines()):
16 |                     m1 = json.loads(x)
17 |                     session_id = m1["session_id"]
18 |                     if cur_session == '':
19 |                         cur_session = session_id
20 | 
21 |                     transport_mode = m1["plan"]["transport_mode"]
22 | 
23 |                     if cur_session != session_id:
24 |                         writer.writerow([str(cur_session), str(submit_map[cur_session]["transport_mode"])])
25 |                         cur_session = session_id
26 |                     if session_id not in submit_map:
27 |                         submit_map[session_id] = {}
28 |                         submit_map[session_id]["transport_mode"] = transport_mode
29 |                         submit_map[session_id]["probability"] = y
30 |                         #if int(submit_map[session_id]["transport_mode"]) == 0 and submit_map[session_id]["probability"] > 0.02:
31 |                             #submit_map[session_id]["probability"] = 0.99
32 |                     else:
33 |                         if float(y) > float(submit_map[session_id]["probability"]):
34 |                             submit_map[session_id]["transport_mode"] = transport_mode
35 |                             submit_map[session_id]["probability"] = y
36 |                             #if int(submit_map[session_id]["transport_mode"]) == 0 and submit_map[session_id]["probability"] > 0.02:
37 |                                 #submit_map[session_id]["transport_mode"] = 0
38 |                                 #submit_map[session_id]["probability"] = 0.99
39 | 
40 | 
41 |         writer.writerow([cur_session, submit_map[cur_session]["transport_mode"]])
42 | 
43 | 
44 | 
45 | if __name__ == "__main__":
46 |     build()
47 | 


--------------------------------------------------------------------------------
/data_set_phase1/._profiles.csv:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yaoxuefeng6/Paddle_baseline_KDD2019/dd7f8f6016f8457cac06ae0bdfb006bc0682457d/data_set_phase1/._profiles.csv


--------------------------------------------------------------------------------
/data_set_phase1/._test_plans.csv:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yaoxuefeng6/Paddle_baseline_KDD2019/dd7f8f6016f8457cac06ae0bdfb006bc0682457d/data_set_phase1/._test_plans.csv


--------------------------------------------------------------------------------
/data_set_phase1/._test_queries.csv:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yaoxuefeng6/Paddle_baseline_KDD2019/dd7f8f6016f8457cac06ae0bdfb006bc0682457d/data_set_phase1/._test_queries.csv


--------------------------------------------------------------------------------
/data_set_phase1/._train_clicks.csv:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yaoxuefeng6/Paddle_baseline_KDD2019/dd7f8f6016f8457cac06ae0bdfb006bc0682457d/data_set_phase1/._train_clicks.csv


--------------------------------------------------------------------------------
/data_set_phase1/._train_queries.csv:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yaoxuefeng6/Paddle_baseline_KDD2019/dd7f8f6016f8457cac06ae0bdfb006bc0682457d/data_set_phase1/._train_queries.csv


--------------------------------------------------------------------------------
/generate_test.py:
--------------------------------------------------------------------------------
  1 | # Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved
  2 | #
  3 | # Licensed under the Apache License, Version 2.0 (the "License");
  4 | # you may not use this file except in compliance with the License.
  5 | # You may obtain a copy of the License at
  6 | #
  7 | #     http://www.apache.org/licenses/LICENSE-2.0
  8 | #
  9 | # Unless required by applicable law or agreed to in writing, software
 10 | # distributed under the License is distributed on an "AS IS" BASIS,
 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 12 | # See the License for the specific language governing permissions and
 13 | # limitations under the License.
 14 | 
 15 | 
 16 | import argparse
 17 | import logging
 18 | import numpy as np
 19 | # disable gpu training for this example
 20 | import os
 21 | 
 22 | os.environ["CUDA_VISIBLE_DEVICES"] = ""
 23 | import paddle
 24 | import paddle.fluid as fluid
 25 | logging.basicConfig(
 26 |     format='%(asctime)s - %(levelname)s - %(message)s')
 27 | logger = logging.getLogger("fluid")
 28 | logger.setLevel(logging.INFO)
 29 | num_context_feature = 22
 30 | 
 31 | def parse_args():
 32 |     parser = argparse.ArgumentParser(description="PaddlePaddle DeepFM example")
 33 |     parser.add_argument(
 34 |         '--model_path',
 35 |         type=str,
 36 |         #required=True,
 37 |         default='models',
 38 |         help="The path of model parameters gz file")
 39 |     parser.add_argument(
 40 |         '--data_path',
 41 |         type=str,
 42 |         required=False,
 43 |         help="The path of the dataset to infer")
 44 |     parser.add_argument(
 45 |         '--embedding_size',
 46 |         type=int,
 47 |         default=16,
 48 |         help="The size for embedding layer (default:10)")
 49 |     parser.add_argument(
 50 |         '--sparse_feature_dim',
 51 |         type=int,
 52 |         default=1000001,
 53 |         help="The size for embedding layer (default:1000001)")
 54 |     parser.add_argument(
 55 |         '--batch_size',
 56 |         type=int,
 57 |         default=1000,
 58 |         help="The size of mini-batch (default:1000)")
 59 | 
 60 |     return parser.parse_args()
 61 | 
 62 | def to_lodtensor(data, place):
 63 |     seq_lens = [len(seq) for seq in data]
 64 |     cur_len = 0
 65 |     lod = [cur_len]
 66 |     for l in seq_lens:
 67 |         cur_len += l
 68 |         lod.append(cur_len)
 69 |     flattened_data = np.concatenate(data, axis=0).astype("int64")
 70 |     flattened_data = flattened_data.reshape([len(flattened_data), 1])
 71 |     res = fluid.LoDTensor()
 72 |     res.set(flattened_data, place)
 73 |     res.set_lod([lod])
 74 | 
 75 | 
 76 |     return res
 77 | 
 78 | 
 79 | def data2tensor(data, place):
 80 |     feed_dict = {}
 81 |     dense = data[0]
 82 |     sparse = data[1:-1]
 83 |     y = data[-1]
 84 |     #user_data = np.array([x[0] for x in data]).astype("float32")
 85 |     #user_data = user_data.reshape([-1, 10])
 86 |     #feed_dict["user_profile"] = user_data
 87 |     dense_data = np.array([x[0] for x in data]).astype("float32")
 88 |     dense_data = dense_data.reshape([-1, 3])
 89 |     feed_dict["dense_feature"] = dense_data
 90 |     for i in range(num_context_feature):
 91 |         sparse_data = to_lodtensor([x[1 + i] for x in data], place)
 92 |         feed_dict["context" + str(i)] = sparse_data
 93 | 
 94 |     context_fm = to_lodtensor(np.array([x[-2] for x in data]).astype("float32"), place)
 95 | 
 96 |     feed_dict["context_fm"] = context_fm
 97 |     y_data = np.array([x[-1] for x in data]).astype("int64")
 98 |     y_data = y_data.reshape([-1, 1])
 99 |     feed_dict["label"] = y_data
100 |     return feed_dict
101 | 
102 | def test():
103 |     args = parse_args()
104 | 
105 |     place = fluid.CPUPlace()
106 |     test_scope = fluid.core.Scope()
107 | 
108 |     # filelist = ["%s/%s" % (args.data_path, x) for x in os.listdir(args.data_path)]
109 |     from map_reader import MapDataset
110 |     map_dataset = MapDataset()
111 |     map_dataset.setup(args.sparse_feature_dim)
112 |     exe = fluid.Executor(place)
113 | 
114 |     whole_filelist = ["./out/normed_test_session.txt"]
115 |     test_files = whole_filelist[int(0.0 * len(whole_filelist)):int(1.0 * len(whole_filelist))]
116 | 
117 |     #set how many epochs runing for infer
118 |     epochs = 1
119 | 
120 |     for i in range(epochs):
121 |         cur_model_path = args.model_path + "/epoch" + str(i + 1) + ".model"
122 |         with open("./testres/res" + str(i), 'w') as r:
123 |             with fluid.scope_guard(test_scope):
124 |                 [inference_program, feed_target_names, fetch_targets] = \
125 |                     fluid.io.load_inference_model(cur_model_path, exe)
126 | 
127 |                 test_reader = map_dataset.test_reader(test_files, 1000, 100000)
128 |                 k = 0
129 |                 for batch_id, data in enumerate(test_reader()):
130 |                     print(len(data[0]))
131 |                     feed_dict = data2tensor(data, place)
132 |                     loss_val, auc_val, accuracy, predict, _ = exe.run(inference_program,
133 |                                                 feed=feed_dict,
134 |                                                 fetch_list=fetch_targets, return_numpy=False)
135 | 
136 |                     x = np.array(predict)
137 |                     for j in range(x.shape[0]):
138 |                         r.write(str(x[j][1]))
139 |                         r.write("\n")
140 | 
141 | 
142 | if __name__ == '__main__':
143 |     test()
144 | 


--------------------------------------------------------------------------------
/generate_test_py3.py:
--------------------------------------------------------------------------------
  1 | # Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved
  2 | #
  3 | # Licensed under the Apache License, Version 2.0 (the "License");
  4 | # you may not use this file except in compliance with the License.
  5 | # You may obtain a copy of the License at
  6 | #
  7 | #     http://www.apache.org/licenses/LICENSE-2.0
  8 | #
  9 | # Unless required by applicable law or agreed to in writing, software
 10 | # distributed under the License is distributed on an "AS IS" BASIS,
 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 12 | # See the License for the specific language governing permissions and
 13 | # limitations under the License.
 14 | 
 15 | 
 16 | import argparse
 17 | import logging
 18 | import numpy as np
 19 | # disable gpu training for this example
 20 | import os, sys
 21 | 
 22 | #this is set for python3 environment and the paddle install path is only use for jarvis platform
 23 | sys.path.append("/usr/local/cuda-9.2/conda/envs/py36-paddle/lib/python3.6/site-packages/paddle/fluid")
 24 | sys.path.append("/usr/local/cuda-9.2/conda/envs/py36-paddle/lib/python3.6/site-packages/paddle/fluid/proto")
 25 | 
 26 | os.environ["CUDA_VISIBLE_DEVICES"] = ""
 27 | import paddle
 28 | import paddle.fluid as fluid
 29 | logging.basicConfig(
 30 |     format='%(asctime)s - %(levelname)s - %(message)s')
 31 | logger = logging.getLogger("fluid")
 32 | logger.setLevel(logging.INFO)
 33 | num_context_feature = 22
 34 | 
 35 | def parse_args():
 36 |     parser = argparse.ArgumentParser(description="PaddlePaddle DeepFM example")
 37 |     parser.add_argument(
 38 |         '--model_path',
 39 |         type=str,
 40 |         #required=True,
 41 |         default='models',
 42 |         help="The path of model parameters gz file")
 43 |     parser.add_argument(
 44 |         '--data_path',
 45 |         type=str,
 46 |         required=False,
 47 |         help="The path of the dataset to infer")
 48 |     parser.add_argument(
 49 |         '--embedding_size',
 50 |         type=int,
 51 |         default=16,
 52 |         help="The size for embedding layer (default:10)")
 53 |     parser.add_argument(
 54 |         '--sparse_feature_dim',
 55 |         type=int,
 56 |         default=1000001,
 57 |         help="The size for embedding layer (default:1000001)")
 58 |     parser.add_argument(
 59 |         '--batch_size',
 60 |         type=int,
 61 |         default=1000,
 62 |         help="The size of mini-batch (default:1000)")
 63 | 
 64 |     return parser.parse_args()
 65 | 
 66 | def to_lodtensor(data, place):
 67 |     seq_lens = [len(seq) for seq in data]
 68 |     cur_len = 0
 69 |     lod = [cur_len]
 70 |     for l in seq_lens:
 71 |         cur_len += l
 72 |         lod.append(cur_len)
 73 |     flattened_data = np.concatenate(data, axis=0).astype("int64")
 74 |     flattened_data = flattened_data.reshape([len(flattened_data), 1])
 75 |     res = fluid.LoDTensor()
 76 |     res.set(flattened_data, place)
 77 |     res.set_lod([lod])
 78 | 
 79 | 
 80 |     return res
 81 | 
 82 | 
 83 | def data2tensor(data, place):
 84 |     feed_dict = {}
 85 |     dense = data[0]
 86 |     sparse = data[1:-1]
 87 |     y = data[-1]
 88 |     #user_data = np.array([x[0] for x in data]).astype("float32")
 89 |     #user_data = user_data.reshape([-1, 10])
 90 |     #feed_dict["user_profile"] = user_data
 91 |     dense_data = np.array([x[0] for x in data]).astype("float32")
 92 |     dense_data = dense_data.reshape([-1, 3])
 93 |     feed_dict["dense_feature"] = dense_data
 94 |     for i in range(num_context_feature):
 95 |         sparse_data = to_lodtensor([x[1 + i] for x in data], place)
 96 |         feed_dict["context" + str(i)] = sparse_data
 97 | 
 98 |     context_fm = to_lodtensor(np.array([x[-2] for x in data]).astype("float32"), place)
 99 | 
100 |     feed_dict["context_fm"] = context_fm
101 |     y_data = np.array([x[-1] for x in data]).astype("int64")
102 |     y_data = y_data.reshape([-1, 1])
103 |     feed_dict["label"] = y_data
104 |     return feed_dict
105 | 
106 | def test():
107 |     args = parse_args()
108 | 
109 |     place = fluid.CPUPlace()
110 |     test_scope = fluid.core.Scope()
111 | 
112 |     # filelist = ["%s/%s" % (args.data_path, x) for x in os.listdir(args.data_path)]
113 |     from map_reader_mmh import MapDataset
114 |     map_dataset = MapDataset()
115 |     map_dataset.setup(args.sparse_feature_dim)
116 |     exe = fluid.Executor(place)
117 | 
118 |     whole_filelist = ["./out/normed_test_session.txt"]
119 |     test_files = whole_filelist[int(0.0 * len(whole_filelist)):int(1.0 * len(whole_filelist))]
120 | 
121 |     # set how many epochs runing for infer
122 |     epochs = 1
123 | 
124 |     for i in range(epochs):
125 |         cur_model_path = args.model_path + "/epoch" + str(i + 1) + ".model"
126 |         with open("./testres/res" + str(i), 'w') as r:
127 |             with fluid.scope_guard(test_scope):
128 |                 [inference_program, feed_target_names, fetch_targets] = \
129 |                     fluid.io.load_inference_model(cur_model_path, exe)
130 | 
131 |                 test_reader = map_dataset.test_reader(test_files, 1000, 100000)
132 |                 k = 0
133 |                 for batch_id, data in enumerate(test_reader()):
134 |                     print(len(data[0]))
135 |                     feed_dict = data2tensor(data, place)
136 |                     loss_val, auc_val, accuracy, predict, _ = exe.run(inference_program,
137 |                                                 feed=feed_dict,
138 |                                                 fetch_list=fetch_targets, return_numpy=False)
139 | 
140 |                     x = np.array(predict)
141 |                     for j in range(x.shape[0]):
142 |                         r.write(str(x[j][1]))
143 |                         r.write("\n")
144 | 
145 | 
146 | if __name__ == '__main__':
147 |     test()
148 | 


--------------------------------------------------------------------------------
/infer.py:
--------------------------------------------------------------------------------
  1 | import argparse
  2 | import logging
  3 | 
  4 | import numpy as np
  5 | # disable gpu training for this example
  6 | import os
  7 | 
  8 | os.environ["CUDA_VISIBLE_DEVICES"] = ""
  9 | import paddle
 10 | import paddle.fluid as fluid
 11 | 
 12 | import map_reader
 13 | from network_conf import ctr_deepfm_dataset
 14 | 
 15 | logging.basicConfig(
 16 |     format='%(asctime)s - %(levelname)s - %(message)s')
 17 | logger = logging.getLogger("fluid")
 18 | logger.setLevel(logging.INFO)
 19 | 
 20 | 
 21 | def parse_args():
 22 |     parser = argparse.ArgumentParser(description="PaddlePaddle DeepFM example")
 23 |     parser.add_argument(
 24 |         '--model_path',
 25 |         type=str,
 26 |         #required=True,
 27 |         default='models',
 28 |         help="The path of model parameters gz file")
 29 |     parser.add_argument(
 30 |         '--data_path',
 31 |         type=str,
 32 |         required=False,
 33 |         help="The path of the dataset to infer")
 34 |     parser.add_argument(
 35 |         '--embedding_size',
 36 |         type=int,
 37 |         default=16,
 38 |         help="The size for embedding layer (default:10)")
 39 |     parser.add_argument(
 40 |         '--sparse_feature_dim',
 41 |         type=int,
 42 |         default=1000001,
 43 |         help="The size for embedding layer (default:1000001)")
 44 |     parser.add_argument(
 45 |         '--batch_size',
 46 |         type=int,
 47 |         default=1000,
 48 |         help="The size of mini-batch (default:1000)")
 49 | 
 50 |     return parser.parse_args()
 51 | 
 52 | 
 53 | def to_lodtensor(data, place):
 54 |     seq_lens = [len(seq) for seq in data]
 55 |     cur_len = 0
 56 |     lod = [cur_len]
 57 |     for l in seq_lens:
 58 |         cur_len += l
 59 |         lod.append(cur_len)
 60 |     flattened_data = np.concatenate(data, axis=0).astype("int64")
 61 |     flattened_data = flattened_data.reshape([len(flattened_data), 1])
 62 |     res = fluid.LoDTensor()
 63 |     res.set(flattened_data, place)
 64 |     res.set_lod([lod])
 65 |     return res
 66 | 
 67 | 
 68 | def data2tensor(data, place):
 69 |     feed_dict = {}
 70 |     test_dict = {}
 71 |     dense = data[0]
 72 |     sparse = data[1:-1]
 73 |     y = data[-1]
 74 |     dense_data = np.array([x[0] for x in data]).astype("float32")
 75 |     dense_data = dense_data.reshape([-1, 65])
 76 |     feed_dict["user_profile"] = dense_data
 77 |     for i in range(10):
 78 |         sparse_data = to_lodtensor([x[1 + i] for x in data], place)
 79 |         feed_dict["context" + str(i)] = sparse_data
 80 | 
 81 |     y_data = np.array([x[-1] for x in data]).astype("int64")
 82 |     y_data = y_data.reshape([-1, 1])
 83 |     feed_dict["label"] = y_data
 84 |     test_dict["test"] = [1]
 85 |     return feed_dict, test_dict
 86 | 
 87 | 
 88 | def infer():
 89 |     args = parse_args()
 90 | 
 91 |     place = fluid.CPUPlace()
 92 |     inference_scope = fluid.core.Scope()
 93 | 
 94 |     filelist = ["%s/%s" % (args.data_path, x) for x in os.listdir(args.data_path)]
 95 |     from map_reader import MapDataset
 96 |     map_dataset = MapDataset()
 97 |     map_dataset.setup(args.sparse_feature_dim)
 98 |     exe = fluid.Executor(place)
 99 | 
100 |     whole_filelist = ["raw_data/part-%d" % x for x in range(len(os.listdir("raw_data")))]
101 |     #whole_filelist = ["./out/normed_train09",  "./out/normed_train10",  "./out/normed_train11"]
102 |     test_files = whole_filelist[int(0.0 * len(whole_filelist)):int(1.0 * len(whole_filelist))]
103 | 
104 |     # file_groups = [whole_filelist[i:i+train_thread_num] for i in range(0, len(whole_filelist), train_thread_num)]
105 | 
106 |     def set_zero(var_name):
107 |         param = inference_scope.var(var_name).get_tensor()
108 |         param_array = np.zeros(param._get_dims()).astype("int64")
109 |         param.set(param_array, place)
110 | 
111 |     epochs = 2
112 |     for i in range(epochs):
113 |         cur_model_path = args.model_path + "/epoch" + str(i + 1) + ".model"
114 |         with fluid.scope_guard(inference_scope):
115 |             [inference_program, feed_target_names, fetch_targets] = \
116 |                 fluid.io.load_inference_model(cur_model_path, exe)
117 |             auc_states_names = ['_generated_var_2', '_generated_var_3']
118 |             for name in auc_states_names:
119 |                 set_zero(name)
120 | 
121 |             test_reader = map_dataset.infer_reader(test_files, 1000, 100000)
122 |             for batch_id, data in enumerate(test_reader()):
123 |                 loss_val, auc_val, accuracy, predict, label = exe.run(inference_program,
124 |                                             feed=data2tensor(data, place),
125 |                                             fetch_list=fetch_targets, return_numpy=False)
126 | 
127 |                 #print(np.array(predict))
128 |                 #x = np.array(predict)
129 |                 #print(.shape)x
130 |             #print("train_pass_%d, test_pass_%d\t%f\t" % (i - 1, i, auc_val))
131 | 
132 | 
133 | if __name__ == '__main__':
134 |     infer()
135 | 


--------------------------------------------------------------------------------
/local_train.py:
--------------------------------------------------------------------------------
 1 | from __future__ import print_function
 2 | 
 3 | from args import parse_args
 4 | import os
 5 | import paddle.fluid as fluid
 6 | import sys
 7 | from network_confv6 import ctr_deepfm_dataset
 8 | 
 9 | 
10 | NUM_CONTEXT_FEATURE = 22
11 | DIM_USER_PROFILE = 10
12 | DIM_DENSE_FEATURE = 3
13 | PYTHON_PATH = "/home/yaoxuefeng/whls/paddle_release_home/python/bin/python" # this is mine change yours
14 | 
15 | def train():
16 |     args = parse_args()
17 |     if not os.path.isdir(args.model_output_dir):
18 |         os.mkdir(args.model_output_dir)
19 |     
20 |     #set the input format for our model. Note that you need to carefully modify them when you define a new network
21 |     #user_profile = fluid.layers.data(
22 |         #name="user_profile", shape=[DIM_USER_PROFILE], dtype='int64', lod_level=1)
23 |     dense_feature = fluid.layers.data(
24 |         name="dense_feature", shape=[DIM_DENSE_FEATURE], dtype='float32')
25 |     context_feature = [
26 |         fluid.layers.data(name="context" + str(i), shape=[1], lod_level=1, dtype="int64")
27 |         for i in range(0, NUM_CONTEXT_FEATURE)]
28 |     context_feature_fm = fluid.layers.data(
29 |         name="context_fm", shape=[1], dtype='int64', lod_level=1)
30 |     label = fluid.layers.data(name='label', shape=[1], dtype='int64')
31 | 
32 |     print("ready to network")
33 |     #self define network 
34 |     loss, auc_var, batch_auc_var, accuracy, predict = ctr_deepfm_dataset(dense_feature, context_feature, context_feature_fm, label,
35 |                                                         args.embedding_size, args.sparse_feature_dim)
36 | 
37 |     print("ready to optimize")
38 |     optimizer = fluid.optimizer.SGD(learning_rate=1e-4)
39 |     optimizer.minimize(loss)
40 |     #single machine CPU training. more options on trainig please visit PaddlePaddle site
41 |     exe = fluid.Executor(fluid.CPUPlace())
42 |     exe.run(fluid.default_startup_program())
43 |     #use dataset api for much faster speed
44 |     dataset = fluid.DatasetFactory().create_dataset()
45 |     dataset.set_use_var([dense_feature] + context_feature + [context_feature_fm] + [label])
46 |     #self define how to process generated training insatnces in map_reader.py
47 |     pipe_command = PYTHON_PATH + "  map_reader.py %d" % args.sparse_feature_dim
48 |     dataset.set_pipe_command(pipe_command)
49 |     dataset.set_batch_size(args.batch_size)
50 |     thread_num = 1
51 |     dataset.set_thread(thread_num)
52 |     #self define how to split training files for example:"split -a 2 -d -l 200000 normed_train.txt normed_train"
53 |     whole_filelist = ["./out/normed_train%d" % x for x in range(len(os.listdir("out")))]
54 |     whole_filelist = ["./out/normed_train00", "./out/normed_train01", "./out/normed_train02", "./out/normed_train03",
55 |                       "./out/normed_train04", "./out/normed_train05", "./out/normed_train06", "./out/normed_train07",
56 |                       "./out/normed_train08",
57 |                       "./out/normed_train09", "./out/normed_train10", "./out/normed_train11"]
58 |     print("ready to epochs")
59 |     epochs = 10
60 |     for i in range(epochs):
61 |         print("start %dth epoch" % i)
62 |         dataset.set_filelist(whole_filelist[:int(len(whole_filelist))])
63 |         #print the informations you want by setting fetch_list and fetch_info
64 |         exe.train_from_dataset(program=fluid.default_main_program(),
65 |                                dataset=dataset,
66 |                                fetch_list=[auc_var, accuracy, predict, label],
67 |                                fetch_info=["auc", "accuracy", "predict", "label"],
68 |                                debug=False)
69 |         model_dir = args.model_output_dir + '/epoch' + str(i + 1) + ".model"
70 |         sys.stderr.write("epoch%d finished" % (i + 1))
71 |         #save model
72 |         fluid.io.save_inference_model(model_dir, [dense_feature.name] + [x.name for x in context_feature] + [context_feature_fm.name] + [label.name],
73 |                                       [loss, auc_var, accuracy, predict, label], exe)
74 | 
75 | 
76 | if __name__ == '__main__':
77 |     train()
78 | 


--------------------------------------------------------------------------------
/local_train_py3.py:
--------------------------------------------------------------------------------
 1 | from __future__ import print_function
 2 | 
 3 | from args import parse_args
 4 | import os
 5 | import paddle.fluid as fluid
 6 | import sys
 7 | from network_confv6 import ctr_deepfm_dataset
 8 | 
 9 | #this is set for python3 environment and the paddle install path is only use for jarvis platform
10 | sys.path.append("/usr/local/cuda-9.2/conda/envs/py36-paddle/lib/python3.6/site-packages/paddle/fluid")
11 | sys.path.append("/usr/local/cuda-9.2/conda/envs/py36-paddle/lib/python3.6/site-packages/paddle/fluid/proto")
12 | 
13 | NUM_CONTEXT_FEATURE = 22
14 | DIM_USER_PROFILE = 10
15 | DIM_DENSE_FEATURE = 3
16 | #PYTHON_PATH = "/home/yaoxuefeng/whls/paddle_release_home/python/bin/python" # this is mine change your own python path
17 | PYTHON_PATH = "python"
18 | def train():
19 |     args = parse_args()
20 |     if not os.path.isdir(args.model_output_dir):
21 |         os.mkdir(args.model_output_dir)
22 |     
23 |     #set the input format for our model. Note that you need to carefully modify them when you define a new network
24 |     #user_profile = fluid.layers.data(
25 |         #name="user_profile", shape=[DIM_USER_PROFILE], dtype='int64', lod_level=1)
26 |     dense_feature = fluid.layers.data(
27 |         name="dense_feature", shape=[DIM_DENSE_FEATURE], dtype='float32')
28 |     context_feature = [
29 |         fluid.layers.data(name="context" + str(i), shape=[1], lod_level=1, dtype="int64")
30 |         for i in range(0, NUM_CONTEXT_FEATURE)]
31 |     context_feature_fm = fluid.layers.data(
32 |         name="context_fm", shape=[1], dtype='int64', lod_level=1)
33 |     label = fluid.layers.data(name='label', shape=[1], dtype='int64')
34 | 
35 |     print("ready to network")
36 |     #self define network 
37 |     loss, auc_var, batch_auc_var, accuracy, predict = ctr_deepfm_dataset(dense_feature, context_feature, context_feature_fm, label,
38 |                                                         args.embedding_size, args.sparse_feature_dim)
39 | 
40 |     print("ready to optimize")
41 |     optimizer = fluid.optimizer.SGD(learning_rate=1e-4)
42 |     optimizer.minimize(loss)
43 |     #single machine CPU training. more options on trainig please visit PaddlePaddle site
44 |     exe = fluid.Executor(fluid.CPUPlace())
45 |     exe.run(fluid.default_startup_program())
46 |     #use dataset api for much faster speed
47 |     dataset = fluid.DatasetFactory().create_dataset()
48 |     dataset.set_use_var([dense_feature] + context_feature + [context_feature_fm] + [label])
49 |     #self define how to process generated training insatnces in map_reader_mmh.py
50 |     pipe_command = PYTHON_PATH + "  map_reader_mmh.py %d" % args.sparse_feature_dim
51 |     dataset.set_pipe_command(pipe_command)
52 |     dataset.set_batch_size(args.batch_size)
53 |     #set thread num not larger than the length of filelist for muli-thread loading data
54 |     thread_num = 12
55 |     dataset.set_thread(thread_num)
56 |     #self define how to split training files for example:"split -a 2 -d -l 200000 normed_train.txt normed_train"
57 |     whole_filelist = ["./out/normed_train%d" % x for x in range(len(os.listdir("out")))]
58 |     whole_filelist = ["./out/normed_train00", "./out/normed_train01", "./out/normed_train02", "./out/normed_train03",
59 |                       "./out/normed_train04", "./out/normed_train05", "./out/normed_train06", "./out/normed_train07",
60 |                       "./out/normed_train08",
61 |                       "./out/normed_train09", "./out/normed_train10", "./out/normed_train11"]
62 |     print("ready to epochs")
63 |     epochs = 10
64 |     for i in range(epochs):
65 |         print("start %dth epoch" % i)
66 |         dataset.set_filelist(whole_filelist[:int(len(whole_filelist))])
67 |         #print the informations you want by setting fetch_list and fetch_info
68 |         exe.train_from_dataset(program=fluid.default_main_program(),
69 |                                dataset=dataset,
70 |                                fetch_list=[auc_var, accuracy, predict, label],
71 |                                fetch_info=["auc", "accuracy", "predict", "label"],
72 |                                debug=False)
73 |         model_dir = args.model_output_dir + '/epoch' + str(i + 1) + ".model"
74 |         sys.stderr.write("epoch%d finished" % (i + 1))
75 |         #save model
76 |         fluid.io.save_inference_model(model_dir, [dense_feature.name] + [x.name for x in context_feature] + [context_feature_fm.name] + [label.name],
77 |                                       [loss, auc_var, accuracy, predict, label], exe)
78 | 
79 | 
80 | if __name__ == '__main__':
81 |     train()
82 | 


--------------------------------------------------------------------------------
/map_reader.py:
--------------------------------------------------------------------------------
  1 | # Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved
  2 | #
  3 | # Licensed under the Apache License, Version 2.0 (the "License");
  4 | # you may not use this file except in compliance with the License.
  5 | # You may obtain a copy of the License at
  6 | #
  7 | #     http://www.apache.org/licenses/LICENSE-2.0
  8 | #
  9 | # Unless required by applicable law or agreed to in writing, software
 10 | # distributed under the License is distributed on an "AS IS" BASIS,
 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 12 | # See the License for the specific language governing permissions and
 13 | # limitations under the License.
 14 | 
 15 | import sys
 16 | import json
 17 | import paddle.fluid.incubate.data_generator as dg
 18 | 
 19 | 
 20 | class MapDataset(dg.MultiSlotDataGenerator):
 21 |     def setup(self, sparse_feature_dim):
 22 |         self.profile_length = 65
 23 |         self.dense_length = 3
 24 |         #feature names
 25 |         self.dense_feature_list = ["distance", "price", "eta"]
 26 | 
 27 |         self.pid_list = ["pid"]
 28 |         self.query_feature_list = ["weekday", "hour", "o1", "o2", "d1", "d2"]
 29 |         self.plan_feature_list = ["transport_mode"]
 30 |         self.rank_feature_list = ["plan_rank", "whole_rank", "price_rank", "eta_rank", "distance_rank"]
 31 |         self.rank_whole_pic_list = ["mode_rank1", "mode_rank2", "mode_rank3", "mode_rank4",
 32 |                                     "mode_rank5"]
 33 |         self.weather_feature_list = ["max_temp", "min_temp", "wea", "wind"]
 34 |         self.hash_dim = 1000001
 35 |         self.train_idx_ = 2000000
 36 |         #carefully set if you change the features 
 37 |         self.categorical_range_ = range(0, 22)
 38 | 
 39 |     #process one instance
 40 |     def _process_line(self, line):
 41 |         instance = json.loads(line)
 42 |         """
 43 |         profile = instance["profile"]
 44 |         len_profile = len(profile)
 45 |         if len_profile >= 10:
 46 |             user_profile_feature = profile[0:10]
 47 |         else:
 48 |             profile.extend([0]*(10-len_profile))
 49 |             user_profile_feature = profile
 50 |         
 51 |         if len(profile) > 1 or (len(profile) == 1 and profile[0] != 0):
 52 |             for p in profile:
 53 |                 if p >= 1 and p <= 65:
 54 |                     user_profile_feature[p - 1] = 1
 55 |         """
 56 |         context_feature = []
 57 |         context_feature_fm = []
 58 |         dense_feature = [0] * self.dense_length
 59 |         plan = instance["plan"]
 60 |         for i, val in enumerate(self.dense_feature_list):
 61 |             dense_feature[i] = plan[val]
 62 | 
 63 |         if (instance["pid"] == ""):
 64 |             instance["pid"] = 0
 65 | 
 66 |         query = instance["query"]
 67 |         weather_dic = instance["weather"]
 68 |         for fea in self.pid_list:
 69 |             context_feature.append([hash(fea + str(instance[fea])) % self.hash_dim])
 70 |             context_feature_fm.append(hash(fea + str(instance[fea])) % self.hash_dim)
 71 |         for fea in self.query_feature_list:
 72 |             context_feature.append([hash(fea + str(query[fea])) % self.hash_dim])
 73 |             context_feature_fm.append(hash(fea + str(query[fea])) % self.hash_dim)
 74 |         for fea in self.plan_feature_list:
 75 |             context_feature.append([hash(fea + str(plan[fea])) % self.hash_dim])
 76 |             context_feature_fm.append(hash(fea + str(plan[fea])) % self.hash_dim)
 77 |         for fea in self.rank_feature_list:
 78 |             context_feature.append([hash(fea + str(instance[fea])) % self.hash_dim])
 79 |             context_feature_fm.append(hash(fea + str(instance[fea])) % self.hash_dim)
 80 |         for fea in self.rank_whole_pic_list:
 81 |             context_feature.append([hash(fea + str(instance[fea])) % self.hash_dim])
 82 |             context_feature_fm.append(hash(fea + str(instance[fea])) % self.hash_dim)
 83 |         for fea in self.weather_feature_list:
 84 |             context_feature.append([hash(fea + str(weather_dic[fea])) % self.hash_dim])
 85 |             context_feature_fm.append(hash(fea + str(weather_dic[fea])) % self.hash_dim)
 86 | 
 87 |         label = [int(instance["label"])]
 88 | 
 89 |         return dense_feature, context_feature, context_feature_fm, label
 90 | 
 91 |     def infer_reader(self, filelist, batch, buf_size):
 92 |         print(filelist)
 93 | 
 94 |         def local_iter():
 95 |             for fname in filelist:
 96 |                 with open(fname.strip(), "r") as fin:
 97 |                     for line in fin:
 98 |                         dense_feature, sparse_feature, sparse_feature_fm, label = self._process_line(line)
 99 |                         yield [dense_feature] + sparse_feature + [sparse_feature_fm] + [label]
100 | 
101 |         import paddle
102 |         batch_iter = paddle.batch(
103 |             paddle.reader.shuffle(
104 |                 local_iter, buf_size=buf_size),
105 |             batch_size=batch)
106 |         return batch_iter
107 | 
108 |     #generat inputs for testing
109 |     def test_reader(self, filelist, batch, buf_size):
110 |         print(filelist)
111 | 
112 |         def local_iter():
113 |             for fname in filelist:
114 |                 with open(fname.strip(), "r") as fin:
115 |                     for line in fin:
116 |                         dense_feature, sparse_feature, sparse_feature_fm, label = self._process_line(line)
117 |                         yield [dense_feature] + sparse_feature + [sparse_feature_fm] + [label]
118 | 
119 |         import paddle
120 |         batch_iter = paddle.batch(
121 |             paddle.reader.buffered(
122 |                 local_iter, size=buf_size),
123 |             batch_size=batch)
124 |         return batch_iter
125 | 
126 |     #generate inputs for trainig 
127 |     def generate_sample(self, line):
128 |         def data_iter():
129 |             dense_feature, sparse_feature, sparse_feature_fm, label = self._process_line(line)
130 |             #feature_name = ["user_profile"]
131 |             feature_name = []
132 |             feature_name.append("dense_feature")
133 |             for idx in self.categorical_range_:
134 |                 feature_name.append("context" + str(idx))
135 |             feature_name.append("context_fm")
136 |             feature_name.append("label")
137 |             yield zip(feature_name, [dense_feature] + sparse_feature + [sparse_feature_fm] + [label])
138 | 
139 |         return data_iter
140 | 
141 | 
142 | if __name__ == "__main__":
143 |     map_dataset = MapDataset()
144 |     map_dataset.setup(int(sys.argv[1]))
145 |     map_dataset.run_from_stdin()
146 | 


--------------------------------------------------------------------------------
/map_reader_mmh.py:
--------------------------------------------------------------------------------
  1 | # Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved
  2 | #
  3 | # Licensed under the Apache License, Version 2.0 (the "License");
  4 | # you may not use this file except in compliance with the License.
  5 | # You may obtain a copy of the License at
  6 | #
  7 | #     http://www.apache.org/licenses/LICENSE-2.0
  8 | #
  9 | # Unless required by applicable law or agreed to in writing, software
 10 | # distributed under the License is distributed on an "AS IS" BASIS,
 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 12 | # See the License for the specific language governing permissions and
 13 | # limitations under the License.
 14 | 
 15 | import sys
 16 | import json
 17 | import paddle.fluid.incubate.data_generator as dg
 18 | import mmh3
 19 | 
 20 | 
 21 | class MapDataset(dg.MultiSlotDataGenerator):
 22 |     def setup(self, sparse_feature_dim):
 23 |         self.profile_length = 65
 24 |         self.dense_length = 3
 25 |         # feature names
 26 |         self.dense_feature_list = ["distance", "price", "eta"]
 27 | 
 28 |         self.pid_list = ["pid"]
 29 |         self.query_feature_list = ["weekday", "hour", "o1", "o2", "d1", "d2"]
 30 |         self.plan_feature_list = ["transport_mode"]
 31 |         self.rank_feature_list = ["plan_rank", "whole_rank", "price_rank", "eta_rank", "distance_rank"]
 32 |         self.rank_whole_pic_list = ["mode_rank1", "mode_rank2", "mode_rank3", "mode_rank4",
 33 |                                     "mode_rank5"]
 34 |         self.weather_feature_list = ["max_temp", "min_temp", "wea", "wind"]
 35 |         self.hash_dim = 1000001
 36 |         self.train_idx_ = 2000000
 37 |         # carefully set if you change the features
 38 |         self.categorical_range_ = range(0, 22)
 39 | 
 40 |     # process one instance
 41 |     def _process_line(self, line):
 42 |         instance = json.loads(line)
 43 |         """
 44 |         profile = instance["profile"]
 45 |         len_profile = len(profile)
 46 |         if len_profile >= 10:
 47 |             user_profile_feature = profile[0:10]
 48 |         else:
 49 |             profile.extend([0]*(10-len_profile))
 50 |             user_profile_feature = profile
 51 | 
 52 |         if len(profile) > 1 or (len(profile) == 1 and profile[0] != 0):
 53 |             for p in profile:
 54 |                 if p >= 1 and p <= 65:
 55 |                     user_profile_feature[p - 1] = 1
 56 |         """
 57 |         context_feature = []
 58 |         context_feature_fm = []
 59 |         dense_feature = [0] * self.dense_length
 60 |         plan = instance["plan"]
 61 |         for i, val in enumerate(self.dense_feature_list):
 62 |             dense_feature[i] = plan[val]
 63 | 
 64 |         if (instance["pid"] == ""):
 65 |             instance["pid"] = 0
 66 | 
 67 |         query = instance["query"]
 68 |         weather_dic = instance["weather"]
 69 |         for fea in self.pid_list:
 70 |             context_feature.append([mmh3.hash(fea + str(instance[fea])) % self.hash_dim])
 71 |             context_feature_fm.append(mmh3.hash(fea + str(instance[fea])) % self.hash_dim)
 72 |         for fea in self.query_feature_list:
 73 |             context_feature.append([mmh3.hash(fea + str(query[fea])) % self.hash_dim])
 74 |             context_feature_fm.append(mmh3.hash(fea + str(query[fea])) % self.hash_dim)
 75 |         for fea in self.plan_feature_list:
 76 |             context_feature.append([mmh3.hash(fea + str(plan[fea])) % self.hash_dim])
 77 |             context_feature_fm.append(mmh3.hash(fea + str(plan[fea])) % self.hash_dim)
 78 |         for fea in self.rank_feature_list:
 79 |             context_feature.append([mmh3.hash(fea + str(instance[fea])) % self.hash_dim])
 80 |             context_feature_fm.append(mmh3.hash(fea + str(instance[fea])) % self.hash_dim)
 81 |         for fea in self.rank_whole_pic_list:
 82 |             context_feature.append([mmh3.hash(fea + str(instance[fea])) % self.hash_dim])
 83 |             context_feature_fm.append(mmh3.hash(fea + str(instance[fea])) % self.hash_dim)
 84 |         for fea in self.weather_feature_list:
 85 |             context_feature.append([mmh3.hash(fea + str(weather_dic[fea])) % self.hash_dim])
 86 |             context_feature_fm.append(mmh3.hash(fea + str(weather_dic[fea])) % self.hash_dim)
 87 | 
 88 |         label = [int(instance["label"])]
 89 | 
 90 |         return dense_feature, context_feature, context_feature_fm, label
 91 | 
 92 |     def infer_reader(self, filelist, batch, buf_size):
 93 |         print(filelist)
 94 | 
 95 |         def local_iter():
 96 |             for fname in filelist:
 97 |                 with open(fname.strip(), "r") as fin:
 98 |                     for line in fin:
 99 |                         dense_feature, sparse_feature, sparse_feature_fm, label = self._process_line(line)
100 |                         yield [dense_feature] + sparse_feature + [sparse_feature_fm] + [label]
101 | 
102 |         import paddle
103 |         batch_iter = paddle.batch(
104 |             paddle.reader.shuffle(
105 |                 local_iter, buf_size=buf_size),
106 |             batch_size=batch)
107 |         return batch_iter
108 | 
109 |     # generat inputs for testing
110 |     def test_reader(self, filelist, batch, buf_size):
111 |         print(filelist)
112 | 
113 |         def local_iter():
114 |             for fname in filelist:
115 |                 with open(fname.strip(), "r") as fin:
116 |                     for line in fin:
117 |                         dense_feature, sparse_feature, sparse_feature_fm, label = self._process_line(line)
118 |                         yield [dense_feature] + sparse_feature + [sparse_feature_fm] + [label]
119 | 
120 |         import paddle
121 |         batch_iter = paddle.batch(
122 |             paddle.reader.buffered(
123 |                 local_iter, size=buf_size),
124 |             batch_size=batch)
125 |         return batch_iter
126 | 
127 |     # generate inputs for trainig
128 |     def generate_sample(self, line):
129 |         def data_iter():
130 |             dense_feature, sparse_feature, sparse_feature_fm, label = self._process_line(line)
131 |             # feature_name = ["user_profile"]
132 |             feature_name = []
133 |             feature_name.append("dense_feature")
134 |             for idx in self.categorical_range_:
135 |                 feature_name.append("context" + str(idx))
136 |             feature_name.append("context_fm")
137 |             feature_name.append("label")
138 |             yield list(zip(feature_name, [dense_feature] + sparse_feature + [sparse_feature_fm] + [label]))
139 | 
140 |         return data_iter
141 | 
142 | 
143 | if __name__ == "__main__":
144 |     map_dataset = MapDataset()
145 |     map_dataset.setup(int(sys.argv[1]))
146 |     map_dataset.run_from_stdin()
147 | 


--------------------------------------------------------------------------------
/network_confv6.py:
--------------------------------------------------------------------------------
  1 | # Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved
  2 | #
  3 | # Licensed under the Apache License, Version 2.0 (the "License");
  4 | # you may not use this file except in compliance with the License.
  5 | # You may obtain a copy of the License at
  6 | #
  7 | #     http://www.apache.org/licenses/LICENSE-2.0
  8 | #
  9 | # Unless required by applicable law or agreed to in writing, software
 10 | # distributed under the License is distributed on an "AS IS" BASIS,
 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 12 | # See the License for the specific language governing permissions and
 13 | # limitations under the License.
 14 | 
 15 | import paddle.fluid as fluid
 16 | import math
 17 | 
 18 | user_profile_dim = 65
 19 | dense_feature_dim = 3
 20 | 
 21 | def ctr_deepfm_dataset(dense_feature, context_feature, context_feature_fm, label,
 22 |                        embedding_size, sparse_feature_dim):
 23 |     def dense_fm_layer(input, emb_dict_size, factor_size, fm_param_attr):
 24 | 
 25 |         first_order = fluid.layers.fc(input=input, size=1)
 26 |         emb_table = fluid.layers.create_parameter(shape=[emb_dict_size, factor_size],
 27 |                                                   dtype='float32', attr=fm_param_attr)
 28 | 
 29 |         input_mul_factor = fluid.layers.matmul(input, emb_table)
 30 |         input_mul_factor_square = fluid.layers.square(input_mul_factor)
 31 |         input_square = fluid.layers.square(input)
 32 |         factor_square = fluid.layers.square(emb_table)
 33 |         input_square_mul_factor_square = fluid.layers.matmul(input_square, factor_square)
 34 | 
 35 |         second_order = 0.5 * (input_mul_factor_square - input_square_mul_factor_square)
 36 |         return first_order, second_order
 37 | 
 38 | 
 39 |     dense_fm_param_attr = fluid.param_attr.ParamAttr(name="DenseFeatFactors",
 40 |                                                      initializer=fluid.initializer.Normal(
 41 |                                                          scale=1 / math.sqrt(dense_feature_dim)))
 42 |     dense_fm_first, dense_fm_second = dense_fm_layer(
 43 |         dense_feature, dense_feature_dim, 16, dense_fm_param_attr)
 44 | 
 45 | 
 46 |     def sparse_fm_layer(input, emb_dict_size, factor_size, fm_param_attr):
 47 | 
 48 |         first_embeddings = fluid.layers.embedding(
 49 |             input=input, dtype='float32', size=[emb_dict_size, 1], is_sparse=True)
 50 |         first_order = fluid.layers.sequence_pool(input=first_embeddings, pool_type='sum')
 51 | 
 52 |         nonzero_embeddings = fluid.layers.embedding(
 53 |             input=input, dtype='float32', size=[emb_dict_size, factor_size],
 54 |             param_attr=fm_param_attr, is_sparse=True)
 55 |         summed_features_emb = fluid.layers.sequence_pool(input=nonzero_embeddings, pool_type='sum')
 56 |         summed_features_emb_square = fluid.layers.square(summed_features_emb)
 57 | 
 58 |         squared_features_emb = fluid.layers.square(nonzero_embeddings)
 59 |         squared_sum_features_emb = fluid.layers.sequence_pool(
 60 |             input=squared_features_emb, pool_type='sum')
 61 | 
 62 |         second_order = 0.5 * (summed_features_emb_square - squared_sum_features_emb)
 63 |         return first_order, second_order
 64 | 
 65 |     sparse_fm_param_attr = fluid.param_attr.ParamAttr(name="SparseFeatFactors",
 66 |                                                       initializer=fluid.initializer.Normal(
 67 |                                                           scale=1 / math.sqrt(sparse_feature_dim)))
 68 | 
 69 |     #data = fluid.layers.data(name='ids', shape=[1], dtype='float32')
 70 |     sparse_fm_first, sparse_fm_second = sparse_fm_layer(
 71 |         context_feature_fm, sparse_feature_dim, 16, sparse_fm_param_attr)
 72 | 
 73 |     def embedding_layer(input):
 74 |         return fluid.layers.embedding(
 75 |             input=input,
 76 |             is_sparse=True,
 77 |             # you need to patch https://github.com/PaddlePaddle/Paddle/pull/14190
 78 |             # if you want to set is_distributed to True
 79 |             is_distributed=False,
 80 |             size=[sparse_feature_dim, embedding_size],
 81 |             param_attr=fluid.ParamAttr(name="SparseFeatFactors",
 82 |                                        initializer=fluid.initializer.Uniform()))
 83 | 
 84 |     sparse_embed_seq = list(map(embedding_layer, context_feature))
 85 | 
 86 |     concated_ori = fluid.layers.concat(sparse_embed_seq + [dense_feature], axis=1)
 87 |     concated = fluid.layers.batch_norm(input=concated_ori, name="bn", epsilon=1e-4)
 88 | 
 89 |     deep = deep_net(concated)
 90 | 
 91 |     predict = fluid.layers.fc(input=[deep, sparse_fm_first, sparse_fm_second, dense_fm_first, dense_fm_second], size=2, act="softmax",
 92 |                               param_attr=fluid.ParamAttr(initializer=fluid.initializer.Normal(
 93 |                                   scale=1 / math.sqrt(deep.shape[1])), learning_rate=0.01))
 94 | 
 95 |     #similarity_norm = fluid.layers.sigmoid(fluid.layers.clip(predict, min=-15.0, max=15.0), name="similarity_norm")
 96 | 
 97 |     cost = fluid.layers.cross_entropy(input=predict, label=label)
 98 | 
 99 |     avg_cost = fluid.layers.reduce_sum(cost)
100 |     accuracy = fluid.layers.accuracy(input=predict, label=label)
101 |     auc_var, batch_auc_var, auc_states = \
102 |         fluid.layers.auc(input=predict, label=label, num_thresholds=2 ** 12, slide_steps=20)
103 |     return avg_cost, auc_var, batch_auc_var, accuracy, predict
104 | 
105 | 
106 | def deep_net(concated, lr_x=0.0001):
107 |     fc_layers_input = [concated]
108 |     fc_layers_size = [400, 400, 400]
109 |     fc_layers_act = ["relu"] * (len(fc_layers_size))
110 | 
111 |     for i in range(len(fc_layers_size)):
112 |         fc = fluid.layers.fc(
113 |             input=fc_layers_input[-1],
114 |             size=fc_layers_size[i],
115 |             act=fc_layers_act[i],
116 |             param_attr=fluid.ParamAttr(learning_rate=lr_x * 0.5))
117 | 
118 |         fc_layers_input.append(fc)
119 |     #w_res = fluid.layers.create_parameter(shape=[353, 16], dtype='float32', name="w_res")
120 |     #high_path = fluid.layers.matmul(concated, w_res)
121 | 
122 |     #return fluid.layers.elementwise_add(high_path, fc_layers_input[-1])
123 |     return fc_layers_input[-1]
124 | 


--------------------------------------------------------------------------------
/networks/network_conf.py:
--------------------------------------------------------------------------------
  1 | # Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved
  2 | #
  3 | # Licensed under the Apache License, Version 2.0 (the "License");
  4 | # you may not use this file except in compliance with the License.
  5 | # You may obtain a copy of the License at
  6 | #
  7 | #     http://www.apache.org/licenses/LICENSE-2.0
  8 | #
  9 | # Unless required by applicable law or agreed to in writing, software
 10 | # distributed under the License is distributed on an "AS IS" BASIS,
 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 12 | # See the License for the specific language governing permissions and
 13 | # limitations under the License.
 14 | 
 15 | 
 16 | import paddle.fluid as fluid
 17 | import math
 18 | 
 19 | user_profile_dim = 65
 20 | num_context = 25
 21 | dim_fm_vector = 16
 22 | dim_concated = user_profile_dim + dim_fm_vector * (num_context)
 23 | 
 24 | 
 25 | def ctr_deepfm_dataset(user_profile, context_feature, label,
 26 |                        embedding_size, sparse_feature_dim):
 27 |     def embedding_layer(input):
 28 |         return fluid.layers.embedding(
 29 |             input=input,
 30 |             is_sparse=True,
 31 |             # you need to patch https://github.com/PaddlePaddle/Paddle/pull/14190
 32 |             # if you want to set is_distributed to True
 33 |             is_distributed=False,
 34 |             size=[sparse_feature_dim, embedding_size],
 35 |             param_attr=fluid.ParamAttr(name="SparseFeatFactors",
 36 |                                        initializer=fluid.initializer.Uniform()))
 37 | 
 38 |     sparse_embed_seq = list(map(embedding_layer, context_feature))
 39 | 
 40 |     w = fluid.layers.create_parameter(
 41 |         shape=[65, 65], dtype='float32',
 42 |         name="w_fm")
 43 |     user_profile_emb = fluid.layers.matmul(user_profile, w)
 44 | 
 45 |     concated_ori = fluid.layers.concat(sparse_embed_seq + [user_profile_emb], axis=1)
 46 |     concated = fluid.layers.batch_norm(input=concated_ori, name="bn", epsilon=1e-4)
 47 | 
 48 |     deep = deep_net(concated)
 49 |     linear_term, second_term = fm(concated, dim_concated, 8) #depend on the number of context feature
 50 | 
 51 |     predict = fluid.layers.fc(input=[deep, linear_term, second_term], size=2, act="softmax",
 52 |                               param_attr=fluid.ParamAttr(initializer=fluid.initializer.Normal(
 53 |                                   scale=1 / math.sqrt(deep.shape[1])), learning_rate=0.01))
 54 | 
 55 |     #similarity_norm = fluid.layers.sigmoid(fluid.layers.clip(predict, min=-15.0, max=15.0), name="similarity_norm")
 56 | 
 57 | 
 58 |     cost = fluid.layers.cross_entropy(input=predict, label=label)
 59 | 
 60 |     avg_cost = fluid.layers.reduce_sum(cost)
 61 |     accuracy = fluid.layers.accuracy(input=predict, label=label)
 62 |     auc_var, batch_auc_var, auc_states = \
 63 |         fluid.layers.auc(input=predict, label=label, num_thresholds=2 ** 12, slide_steps=20)
 64 |     return avg_cost, auc_var, batch_auc_var, accuracy, predict
 65 | 
 66 | 
 67 | def deep_net(concated, lr_x=0.0001):
 68 |     fc_layers_input = [concated]
 69 |     fc_layers_size = [128, 64, 32, 16]
 70 |     fc_layers_act = ["relu"] * (len(fc_layers_size))
 71 | 
 72 |     for i in range(len(fc_layers_size)):
 73 |         fc = fluid.layers.fc(
 74 |             input=fc_layers_input[-1],
 75 |             size=fc_layers_size[i],
 76 |             act=fc_layers_act[i],
 77 |             param_attr=fluid.ParamAttr(learning_rate=lr_x * 0.5))
 78 | 
 79 |         fc_layers_input.append(fc)
 80 | 
 81 |     return fc_layers_input[-1]
 82 | 
 83 | 
 84 | def fm(concated, emb_dict_size, factor_size, lr_x=0.0001):
 85 |     linear_term = fluid.layers.fc(input=concated, size=8, act=None, param_attr=fluid.ParamAttr(learning_rate=lr_x))
 86 | 
 87 |     emb_table = fluid.layers.create_parameter(shape=[emb_dict_size, factor_size],
 88 |                                                   dtype='float32')
 89 | 
 90 |     input_mul_factor = fluid.layers.matmul(concated, emb_table)
 91 |     input_mul_factor_square = fluid.layers.square(input_mul_factor)
 92 |     input_square = fluid.layers.square(concated)
 93 |     factor_square = fluid.layers.square(emb_table)
 94 |     input_square_mul_factor_square = fluid.layers.matmul(input_square, factor_square)
 95 | 
 96 |     second_term = 0.5 * (input_mul_factor_square - input_square_mul_factor_square)
 97 | 
 98 |     return linear_term, second_term
 99 | 
100 | 
101 | 
102 | 
103 | 
104 | 
105 | 
106 | 
107 | 


--------------------------------------------------------------------------------
/networks/network_confv4.py:
--------------------------------------------------------------------------------
  1 | # Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved
  2 | #
  3 | # Licensed under the Apache License, Version 2.0 (the "License");
  4 | # you may not use this file except in compliance with the License.
  5 | # You may obtain a copy of the License at
  6 | #
  7 | #     http://www.apache.org/licenses/LICENSE-2.0
  8 | #
  9 | # Unless required by applicable law or agreed to in writing, software
 10 | # distributed under the License is distributed on an "AS IS" BASIS,
 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 12 | # See the License for the specific language governing permissions and
 13 | # limitations under the License.
 14 | 
 15 | import paddle.fluid as fluid
 16 | import math
 17 | 
 18 | user_profile_dim = 65
 19 | slot_1 = [0, 1, 2, 3, 4, 5]
 20 | slot_2 = [6]
 21 | slot_3 = [7, 8, 9, 10, 11]
 22 | slot_4 = [12, 13, 14, 15, 16]
 23 | slot_5 = [17, 18, 19, 20]
 24 | num_context = 25
 25 | num_slots_pair = 5
 26 | dim_fm_vector = 16
 27 | dim_concated = user_profile_dim + dim_fm_vector * (num_context + num_slots_pair)
 28 | 
 29 | def ctr_deepfm_dataset(user_profile, dense_feature, context_feature, label,
 30 |                        embedding_size, sparse_feature_dim):
 31 |     def embedding_layer(input):
 32 |         return fluid.layers.embedding(
 33 |             input=input,
 34 |             is_sparse=True,
 35 |             # you need to patch https://github.com/PaddlePaddle/Paddle/pull/14190
 36 |             # if you want to set is_distributed to True
 37 |             is_distributed=False,
 38 |             size=[sparse_feature_dim, embedding_size],
 39 |             param_attr=fluid.ParamAttr(name="SparseFeatFactors",
 40 |                                        initializer=fluid.initializer.Uniform()))
 41 | 
 42 |     sparse_embed_seq = list(map(embedding_layer, context_feature))
 43 | 
 44 |     w = fluid.layers.create_parameter(
 45 |         shape=[65, 65], dtype='float32',
 46 |         name="w_fm")
 47 | 
 48 |     user_emb_list = []
 49 |     user_profile_emb = fluid.layers.matmul(user_profile, w)
 50 |     user_emb_list.append(user_profile_emb)
 51 |     user_emb_list.append(dense_feature)
 52 | 
 53 |     w1 = fluid.layers.create_parameter(shape=[65, dim_fm_vector], dtype='float32', name="w_1")
 54 |     w2 = fluid.layers.create_parameter(shape=[65, dim_fm_vector], dtype='float32', name="w_2")
 55 |     w3 = fluid.layers.create_parameter(shape=[65, dim_fm_vector], dtype='float32', name="w_3")
 56 |     w4 = fluid.layers.create_parameter(shape=[65, dim_fm_vector], dtype='float32', name="w_4")
 57 |     w5 = fluid.layers.create_parameter(shape=[65, dim_fm_vector], dtype='float32', name="w_5")
 58 |     user_profile_emb_1 = fluid.layers.matmul(user_profile, w1)
 59 |     user_profile_emb_2 = fluid.layers.matmul(user_profile, w2)
 60 |     user_profile_emb_3 = fluid.layers.matmul(user_profile, w3)
 61 |     user_profile_emb_4 = fluid.layers.matmul(user_profile, w4)
 62 |     user_profile_emb_5 = fluid.layers.matmul(user_profile, w5)
 63 | 
 64 |     sparse_embed_seq_1 = embedding_layer(context_feature[slot_1[0]])
 65 |     sparse_embed_seq_2 = embedding_layer(context_feature[slot_2[0]])
 66 |     sparse_embed_seq_3 = embedding_layer(context_feature[slot_3[0]])
 67 |     sparse_embed_seq_4 = embedding_layer(context_feature[slot_4[0]])
 68 |     sparse_embed_seq_5 = embedding_layer(context_feature[slot_5[0]])
 69 |     for i in slot_1[1:-1]:
 70 |         sparse_embed_seq_1 = fluid.layers.elementwise_add(sparse_embed_seq_1, embedding_layer(context_feature[i]))
 71 |     for i in slot_2[1:-1]:
 72 |         sparse_embed_seq_2 = fluid.layers.elementwise_add(sparse_embed_seq_2, embedding_layer(context_feature[i]))
 73 |     for i in slot_3[1:-1]:
 74 |         sparse_embed_seq_3 = fluid.layers.elementwise_add(sparse_embed_seq_3, embedding_layer(context_feature[i]))
 75 |     for i in slot_4[1:-1]:
 76 |         sparse_embed_seq_4 = fluid.layers.elementwise_add(sparse_embed_seq_4, embedding_layer(context_feature[i]))
 77 |     for i in slot_5[1:-1]:
 78 |         sparse_embed_seq_5 = fluid.layers.elementwise_add(sparse_embed_seq_5, embedding_layer(context_feature[i]))
 79 | 
 80 |     ele_product_1 = fluid.layers.elementwise_mul(user_profile_emb_1, sparse_embed_seq_1)
 81 |     user_emb_list.append(ele_product_1)
 82 |     ele_product_2 = fluid.layers.elementwise_mul(user_profile_emb_2, sparse_embed_seq_2)
 83 |     user_emb_list.append(ele_product_2)
 84 |     ele_product_3 = fluid.layers.elementwise_mul(user_profile_emb_3, sparse_embed_seq_3)
 85 |     user_emb_list.append(ele_product_3)
 86 |     ele_product_4 = fluid.layers.elementwise_mul(user_profile_emb_4, sparse_embed_seq_4)
 87 |     user_emb_list.append(ele_product_4)
 88 |     ele_product_5 = fluid.layers.elementwise_mul(user_profile_emb_5, sparse_embed_seq_5)
 89 |     user_emb_list.append(ele_product_5)
 90 | 
 91 |     ffm_1 = fluid.layers.reduce_sum(ele_product_1, dim=1, keep_dim=True)
 92 |     ffm_2 = fluid.layers.reduce_sum(ele_product_2, dim=1, keep_dim=True)
 93 |     ffm_3 = fluid.layers.reduce_sum(ele_product_3, dim=1, keep_dim=True)
 94 |     ffm_4 = fluid.layers.reduce_sum(ele_product_4, dim=1, keep_dim=True)
 95 |     ffm_5 = fluid.layers.reduce_sum(ele_product_5, dim=1, keep_dim=True)
 96 | 
 97 | 
 98 |     concated_ori = fluid.layers.concat(sparse_embed_seq + user_emb_list, axis=1)
 99 |     concated = fluid.layers.batch_norm(input=concated_ori, name="bn", epsilon=1e-4)
100 | 
101 |     deep = deep_net(concated)
102 |     linear_term, second_term = fm(concated, dim_concated, 8) #depend on the number of context feature
103 | 
104 |     predict = fluid.layers.fc(input=[deep, linear_term, second_term, ffm_1, ffm_2, ffm_3, ffm_4, ffm_5], size=2, act="softmax",
105 |                               param_attr=fluid.ParamAttr(initializer=fluid.initializer.Normal(
106 |                                   scale=1 / math.sqrt(deep.shape[1])), learning_rate=0.01))
107 | 
108 |     #similarity_norm = fluid.layers.sigmoid(fluid.layers.clip(predict, min=-15.0, max=15.0), name="similarity_norm")
109 | 
110 | 
111 |     cost = fluid.layers.cross_entropy(input=predict, label=label)
112 | 
113 |     avg_cost = fluid.layers.reduce_sum(cost)
114 |     accuracy = fluid.layers.accuracy(input=predict, label=label)
115 |     auc_var, batch_auc_var, auc_states = \
116 |         fluid.layers.auc(input=predict, label=label, num_thresholds=2 ** 12, slide_steps=20)
117 |     return avg_cost, auc_var, batch_auc_var, accuracy, predict
118 | 
119 | 
120 | def deep_net(concated, lr_x=0.0001):
121 |     fc_layers_input = [concated]
122 |     fc_layers_size = [256, 128, 64, 32, 16]
123 |     fc_layers_act = ["relu"] * (len(fc_layers_size))
124 | 
125 |     for i in range(len(fc_layers_size)):
126 |         fc = fluid.layers.fc(
127 |             input=fc_layers_input[-1],
128 |             size=fc_layers_size[i],
129 |             act=fc_layers_act[i],
130 |             param_attr=fluid.ParamAttr(learning_rate=lr_x * 0.5))
131 | 
132 |         fc_layers_input.append(fc)
133 |     w_res = fluid.layers.create_parameter(shape=[dim_concated, 16], dtype='float32', name="w_res")
134 |     high_path = fluid.layers.matmul(concated, w_res)
135 | 
136 |     return fluid.layers.elementwise_add(high_path, fc_layers_input[-1])
137 |     #return fc_layers_input[-1]
138 | 
139 | 
140 | def fm(concated, emb_dict_size, factor_size, lr_x=0.0001):
141 |     linear_term = fluid.layers.fc(input=concated, size=8, act=None, param_attr=fluid.ParamAttr(learning_rate=lr_x))
142 | 
143 |     emb_table = fluid.layers.create_parameter(shape=[emb_dict_size, factor_size],
144 |                                                   dtype='float32')
145 | 
146 |     input_mul_factor = fluid.layers.matmul(concated, emb_table)
147 |     input_mul_factor_square = fluid.layers.square(input_mul_factor)
148 |     input_square = fluid.layers.square(concated)
149 |     factor_square = fluid.layers.square(emb_table)
150 |     input_square_mul_factor_square = fluid.layers.matmul(input_square, factor_square)
151 | 
152 |     second_term = 0.5 * (input_mul_factor_square - input_square_mul_factor_square)
153 | 
154 |     return linear_term, second_term


--------------------------------------------------------------------------------
/networks/network_confv6.py:
--------------------------------------------------------------------------------
  1 | # Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved
  2 | #
  3 | # Licensed under the Apache License, Version 2.0 (the "License");
  4 | # you may not use this file except in compliance with the License.
  5 | # You may obtain a copy of the License at
  6 | #
  7 | #     http://www.apache.org/licenses/LICENSE-2.0
  8 | #
  9 | # Unless required by applicable law or agreed to in writing, software
 10 | # distributed under the License is distributed on an "AS IS" BASIS,
 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 12 | # See the License for the specific language governing permissions and
 13 | # limitations under the License.
 14 | 
 15 | import paddle.fluid as fluid
 16 | import math
 17 | 
 18 | user_profile_dim = 65
 19 | dense_feature_dim = 3
 20 | 
 21 | def ctr_deepfm_dataset(dense_feature, context_feature, context_feature_fm, label,
 22 |                        embedding_size, sparse_feature_dim):
 23 |     def dense_fm_layer(input, emb_dict_size, factor_size, fm_param_attr):
 24 | 
 25 |         first_order = fluid.layers.fc(input=input, size=1)
 26 |         emb_table = fluid.layers.create_parameter(shape=[emb_dict_size, factor_size],
 27 |                                                   dtype='float32', attr=fm_param_attr)
 28 | 
 29 |         input_mul_factor = fluid.layers.matmul(input, emb_table)
 30 |         input_mul_factor_square = fluid.layers.square(input_mul_factor)
 31 |         input_square = fluid.layers.square(input)
 32 |         factor_square = fluid.layers.square(emb_table)
 33 |         input_square_mul_factor_square = fluid.layers.matmul(input_square, factor_square)
 34 | 
 35 |         second_order = 0.5 * (input_mul_factor_square - input_square_mul_factor_square)
 36 |         return first_order, second_order
 37 | 
 38 | 
 39 |     dense_fm_param_attr = fluid.param_attr.ParamAttr(name="DenseFeatFactors",
 40 |                                                      initializer=fluid.initializer.Normal(
 41 |                                                          scale=1 / math.sqrt(dense_feature_dim)))
 42 |     dense_fm_first, dense_fm_second = dense_fm_layer(
 43 |         dense_feature, dense_feature_dim, 16, dense_fm_param_attr)
 44 | 
 45 | 
 46 |     def sparse_fm_layer(input, emb_dict_size, factor_size, fm_param_attr):
 47 | 
 48 |         first_embeddings = fluid.layers.embedding(
 49 |             input=input, dtype='float32', size=[emb_dict_size, 1], is_sparse=True)
 50 |         first_order = fluid.layers.sequence_pool(input=first_embeddings, pool_type='sum')
 51 | 
 52 |         nonzero_embeddings = fluid.layers.embedding(
 53 |             input=input, dtype='float32', size=[emb_dict_size, factor_size],
 54 |             param_attr=fm_param_attr, is_sparse=True)
 55 |         summed_features_emb = fluid.layers.sequence_pool(input=nonzero_embeddings, pool_type='sum')
 56 |         summed_features_emb_square = fluid.layers.square(summed_features_emb)
 57 | 
 58 |         squared_features_emb = fluid.layers.square(nonzero_embeddings)
 59 |         squared_sum_features_emb = fluid.layers.sequence_pool(
 60 |             input=squared_features_emb, pool_type='sum')
 61 | 
 62 |         second_order = 0.5 * (summed_features_emb_square - squared_sum_features_emb)
 63 |         return first_order, second_order
 64 | 
 65 |     sparse_fm_param_attr = fluid.param_attr.ParamAttr(name="SparseFeatFactors",
 66 |                                                       initializer=fluid.initializer.Normal(
 67 |                                                           scale=1 / math.sqrt(sparse_feature_dim)))
 68 | 
 69 |     #data = fluid.layers.data(name='ids', shape=[1], dtype='float32')
 70 |     sparse_fm_first, sparse_fm_second = sparse_fm_layer(
 71 |         context_feature_fm, sparse_feature_dim, 16, sparse_fm_param_attr)
 72 | 
 73 |     def embedding_layer(input):
 74 |         return fluid.layers.embedding(
 75 |             input=input,
 76 |             is_sparse=True,
 77 |             # you need to patch https://github.com/PaddlePaddle/Paddle/pull/14190
 78 |             # if you want to set is_distributed to True
 79 |             is_distributed=False,
 80 |             size=[sparse_feature_dim, embedding_size],
 81 |             param_attr=fluid.ParamAttr(name="SparseFeatFactors",
 82 |                                        initializer=fluid.initializer.Uniform()))
 83 | 
 84 |     sparse_embed_seq = list(map(embedding_layer, context_feature))
 85 | 
 86 |     concated_ori = fluid.layers.concat(sparse_embed_seq + [dense_feature], axis=1)
 87 |     concated = fluid.layers.batch_norm(input=concated_ori, name="bn", epsilon=1e-4)
 88 | 
 89 |     deep = deep_net(concated)
 90 | 
 91 |     predict = fluid.layers.fc(input=[deep, sparse_fm_first, sparse_fm_second, dense_fm_first, dense_fm_second], size=2, act="softmax",
 92 |                               param_attr=fluid.ParamAttr(initializer=fluid.initializer.Normal(
 93 |                                   scale=1 / math.sqrt(deep.shape[1])), learning_rate=0.01))
 94 | 
 95 |     #similarity_norm = fluid.layers.sigmoid(fluid.layers.clip(predict, min=-15.0, max=15.0), name="similarity_norm")
 96 | 
 97 |     cost = fluid.layers.cross_entropy(input=predict, label=label)
 98 | 
 99 |     avg_cost = fluid.layers.reduce_sum(cost)
100 |     accuracy = fluid.layers.accuracy(input=predict, label=label)
101 |     auc_var, batch_auc_var, auc_states = \
102 |         fluid.layers.auc(input=predict, label=label, num_thresholds=2 ** 12, slide_steps=20)
103 |     return avg_cost, auc_var, batch_auc_var, accuracy, predict
104 | 
105 | 
106 | def deep_net(concated, lr_x=0.0001):
107 |     fc_layers_input = [concated]
108 |     fc_layers_size = [400, 400, 400]
109 |     fc_layers_act = ["relu"] * (len(fc_layers_size))
110 | 
111 |     for i in range(len(fc_layers_size)):
112 |         fc = fluid.layers.fc(
113 |             input=fc_layers_input[-1],
114 |             size=fc_layers_size[i],
115 |             act=fc_layers_act[i],
116 |             param_attr=fluid.ParamAttr(learning_rate=lr_x * 0.5))
117 | 
118 |         fc_layers_input.append(fc)
119 |     #w_res = fluid.layers.create_parameter(shape=[353, 16], dtype='float32', name="w_res")
120 |     #high_path = fluid.layers.matmul(concated, w_res)
121 | 
122 |     #return fluid.layers.elementwise_add(high_path, fc_layers_input[-1])
123 |     return fc_layers_input[-1]


--------------------------------------------------------------------------------
/out/readme:
--------------------------------------------------------------------------------
1 | this folder for preprocessed data
2 | 


--------------------------------------------------------------------------------
/pre_process_test.py:
--------------------------------------------------------------------------------
  1 | # Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved
  2 | #
  3 | # Licensed under the Apache License, Version 2.0 (the "License");
  4 | # you may not use this file except in compliance with the License.
  5 | # You may obtain a copy of the License at
  6 | #
  7 | #     http://www.apache.org/licenses/LICENSE-2.0
  8 | #
  9 | # Unless required by applicable law or agreed to in writing, software
 10 | # distributed under the License is distributed on an "AS IS" BASIS,
 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 12 | # See the License for the specific language governing permissions and
 13 | # limitations under the License.
 14 | 
 15 | import os, sys, time, random, csv, datetime, json
 16 | import pandas as pd
 17 | import numpy as np
 18 | import argparse
 19 | import logging
 20 | import time
 21 | 
 22 | logging.basicConfig(
 23 |     format='%(asctime)s - %(levelname)s - %(message)s')
 24 | logger = logging.getLogger("preprocess")
 25 | logger.setLevel(logging.INFO)
 26 | 
 27 | TEST_QUERIES_PATH = "./data_set_phase1/test_queries.csv"
 28 | TEST_PLANS_PATH = "./data_set_phase1/test_plans.csv"
 29 | TRAIN_CLICK_PATH = "./data_set_phase1/train_clicks.csv"
 30 | PROFILES_PATH = "./data_set_phase1/profiles.csv"
 31 | OUT_NORM_TEST_PATH = "./out/normed_test_session.txt"
 32 | OUT_RAW_TEST_PATH = "./out/test_session.txt"
 33 | 
 34 | O1_MIN = 115.47
 35 | O1_MAX = 117.29
 36 | 
 37 | O2_MIN = 39.46
 38 | O2_MAX = 40.97
 39 | 
 40 | D1_MIN = 115.44
 41 | D1_MAX = 117.37
 42 | 
 43 | D2_MIN = 39.46
 44 | D2_MAX = 40.96
 45 | SCALE_OD = 0.02
 46 | 
 47 | DISTANCE_MIN = 1.0
 48 | DISTANCE_MAX = 225864.0
 49 | THRESHOLD_DIS = 40000.0
 50 | SCALE_DIS = 500
 51 | 
 52 | PRICE_MIN = 200.0
 53 | PRICE_MAX = 92300.0
 54 | THRESHOLD_PRICE = 20000
 55 | SCALE_PRICE = 100
 56 | 
 57 | ETA_MIN = 1.0
 58 | ETA_MAX = 72992.0
 59 | THRESHOLD_ETA = 10800.0
 60 | SCALE_ETA = 120
 61 | 
 62 | 
 63 | def build_norm_feature():
 64 |     with open(OUT_NORM_TEST_PATH, 'w') as nf:
 65 |         with open(OUT_RAW_TEST_PATH, 'r') as f:
 66 |             for line in f:
 67 |                 cur_map = json.loads(line)
 68 | 
 69 |                 if cur_map["plan"]["distance"] > THRESHOLD_DIS:
 70 |                     cur_map["plan"]["distance"] = int(THRESHOLD_DIS)
 71 |                 elif cur_map["plan"]["distance"] > 0:
 72 |                     cur_map["plan"]["distance"] = int(cur_map["plan"]["distance"] / SCALE_DIS)
 73 | 
 74 |                 if cur_map["plan"]["price"] and cur_map["plan"]["price"] > THRESHOLD_PRICE:
 75 |                     cur_map["plan"]["price"] = int(THRESHOLD_PRICE)
 76 |                 elif not cur_map["plan"]["price"] or cur_map["plan"]["price"] < 0:
 77 |                     cur_map["plan"]["price"] = 0
 78 |                 else:
 79 |                     cur_map["plan"]["price"] = int(cur_map["plan"]["price"] / SCALE_PRICE)
 80 | 
 81 |                 if cur_map["plan"]["eta"] > THRESHOLD_ETA:
 82 |                     cur_map["plan"]["eta"] = int(THRESHOLD_ETA)
 83 |                 elif cur_map["plan"]["eta"] > 0:
 84 |                     cur_map["plan"]["eta"] = int(cur_map["plan"]["eta"] / SCALE_ETA)
 85 | 
 86 |                 # o1
 87 |                 if cur_map["query"]["o1"] > O1_MAX:
 88 |                     cur_map["query"]["o1"] = int((O1_MAX - O1_MIN) / SCALE_OD + 1)
 89 |                 elif cur_map["query"]["o1"] < O1_MIN:
 90 |                     cur_map["query"]["o1"] = 0
 91 |                 else:
 92 |                     cur_map["query"]["o1"] = int((cur_map["query"]["o1"] - O1_MIN) / 0.02)
 93 | 
 94 |                 # o2
 95 |                 if cur_map["query"]["o2"] > O2_MAX:
 96 |                     cur_map["query"]["o2"] = int((O2_MAX - O2_MIN) / SCALE_OD + 1)
 97 |                 elif cur_map["query"]["o2"] < O2_MIN:
 98 |                     cur_map["query"]["o2"] = 0
 99 |                 else:
100 |                     cur_map["query"]["o2"] = int((cur_map["query"]["o2"] - O2_MIN) / 0.02)
101 | 
102 |                 # d1
103 |                 if cur_map["query"]["d1"] > D1_MAX:
104 |                     cur_map["query"]["d1"] = int((D1_MAX - D1_MIN) / SCALE_OD + 1)
105 |                 elif cur_map["query"]["d1"] < D1_MIN:
106 |                     cur_map["query"]["d1"] = 0
107 |                 else:
108 |                     cur_map["query"]["d1"] = int((cur_map["query"]["d1"] - D1_MIN) / SCALE_OD)
109 | 
110 |                 # d2
111 |                 if cur_map["query"]["d2"] > D2_MAX:
112 |                     cur_map["query"]["d2"] = int((D2_MAX - D2_MIN) / SCALE_OD + 1)
113 |                 elif cur_map["query"]["d2"] < D2_MIN:
114 |                     cur_map["query"]["d2"] = 0
115 |                 else:
116 |                     cur_map["query"]["d2"] = int((cur_map["query"]["d2"] - D2_MIN) / SCALE_OD)
117 | 
118 |                 cur_json_instance = json.dumps(cur_map)
119 |                 nf.write(cur_json_instance + '\n')
120 | 
121 | 
122 | def preprocess():
123 |     """
124 |     Construct the train data indexed by session id and mode id jointly. Convert some of the raw features (user profile,
125 |     od pair, req time, click time, eta, price, distance, transport mode) to one-hot ids used for
126 |     embedding. We split the one-hot features into two categories: user feature and context feature for
127 |     better understanding of FM algorithm.
128 |     Note that the user profile is already provided by one-hot encoded form, we convert it back to the
129 |     ids for unity with the context feature and easily using of PaddlePaddle embedding layer. Given the
130 |     train clicks data, we label each train instance with 1 or 0 depend on if this instance is clicked or
131 |     not.
132 |     :return:
133 |     """
134 | 
135 |     train_data_dict = {}
136 |     with open("./weather.json", 'r') as f:
137 |         weather_dict = json.load(f)
138 | 
139 |     with open(TEST_QUERIES_PATH, 'r') as f:
140 |         csv_reader = csv.reader(f, delimiter=',')
141 |         train_index_list = []
142 |         for k, line in enumerate(csv_reader):
143 |             if k == 0: continue
144 |             if line[0] == "": continue
145 |             if line[1] == "":
146 |                 train_index_list.append(line[0] + "_0")
147 |             else:
148 |                 train_index_list.append(line[0] + "_" + line[1])
149 | 
150 |             train_index = line[0]
151 |             train_data_dict[train_index] = {}
152 |             train_data_dict[train_index]["pid"] = line[1]
153 |             train_data_dict[train_index]["query"] = {}
154 | 
155 |             reqweekday = datetime.datetime.strptime(line[2], '%Y-%m-%d %H:%M:%S').strftime("%w")
156 |             reqhour = datetime.datetime.strptime(line[2], '%Y-%m-%d %H:%M:%S').strftime("%H")
157 | 
158 |             date_key = datetime.datetime.strptime(line[2], '%Y-%m-%d %H:%M:%S').strftime("%m-%d")
159 |             train_data_dict[train_index]["weather"] = {}
160 |             train_data_dict[train_index]["weather"].update({"max_temp": weather_dict[date_key]["max_temp"]})
161 |             train_data_dict[train_index]["weather"].update({"min_temp": weather_dict[date_key]["min_temp"]})
162 |             train_data_dict[train_index]["weather"].update({"wea": weather_dict[date_key]["weather"]})
163 |             train_data_dict[train_index]["weather"].update({"wind": weather_dict[date_key]["wind"]})
164 | 
165 |             train_data_dict[train_index]["query"].update({"weekday":reqweekday})
166 |             train_data_dict[train_index]["query"].update({"hour":reqhour})
167 | 
168 |             o = line[3].split(',')
169 |             o_first = o[0]
170 |             o_second = o[1]
171 |             train_data_dict[train_index]["query"].update({"o1":float(o_first)})
172 |             train_data_dict[train_index]["query"].update({"o2":float(o_second)})
173 | 
174 |             d = line[4].split(',')
175 |             d_first = d[0]
176 |             d_second = d[1]
177 |             train_data_dict[train_index]["query"].update({"d1":float(d_first)})
178 |             train_data_dict[train_index]["query"].update({"d2":float(d_second)})
179 | 
180 |     plan_map = {}
181 |     plan_data = pd.read_csv(TEST_PLANS_PATH)
182 |     for index, row in plan_data.iterrows():
183 |         plans_str = row['plans']
184 |         plans_list = json.loads(plans_str)
185 |         session_id = str(row['sid'])
186 |         # train_data_dict[session_id]["plans"] = []
187 |         plan_map[session_id] = plans_list
188 | 
189 |     profile_map = {}
190 |     with open(PROFILES_PATH, 'r') as f:
191 |         csv_reader = csv.reader(f, delimiter=',')
192 |         for k, line in enumerate(csv_reader):
193 |             if k == 0: continue
194 |             profile_map[line[0]] = [i for i in range(len(line)) if line[i] == "1.0"]
195 | 
196 |     session_click_map = {}
197 |     with open(TRAIN_CLICK_PATH, 'r') as f:
198 |         csv_reader = csv.reader(f, delimiter=',')
199 |         for k, line in enumerate(csv_reader):
200 |             if k == 0: continue
201 |             if line[0] == "" or line[1] == "" or line[2] == "":
202 |                 continue
203 |             session_click_map[line[0]] = line[2]
204 |     #return train_data_dict, profile_map, session_click_map, plan_map
205 |     generate_sparse_features(train_data_dict, profile_map, session_click_map, plan_map)
206 | 
207 | 
208 | def generate_sparse_features(train_data_dict, profile_map, session_click_map, plan_map):
209 |     if not os.path.isdir("./out/"):
210 |         os.mkdir("./out/")
211 |     with open(os.path.join("./out/", "test_session.txt"), 'w') as f_train:
212 |         for session_id, plan_list in plan_map.items():
213 |             if session_id not in train_data_dict:
214 |                 continue
215 |             cur_map = train_data_dict[session_id]
216 |             cur_map["session_id"] = session_id
217 |             if cur_map["pid"] != "":
218 |                 cur_map["profile"] = profile_map[cur_map["pid"]]
219 |             else:
220 |                 cur_map["profile"] = [0]
221 |             del cur_map["pid"]
222 |             whole_rank = 0
223 |             for plan in plan_list:
224 |                 whole_rank += 1
225 |                 cur_map["mode_rank" + str(whole_rank)] = plan["transport_mode"]
226 | 
227 |             if whole_rank < 5:
228 |                 for r in range(whole_rank + 1, 6):
229 |                     cur_map["mode_rank" + str(r)] = -1
230 | 
231 |             cur_map["whole_rank"] = whole_rank
232 |             flag_click = False
233 |             rank = 1
234 | 
235 |             price_list = []
236 |             eta_list = []
237 |             distance_list = []
238 |             for plan in plan_list:
239 |                 if not plan["price"]:
240 |                     price_list.append(0)
241 |                 else:
242 |                     price_list.append(int(plan["price"]))
243 |                 eta_list.append(int(plan["eta"]))
244 |                 distance_list.append(int(plan["distance"]))
245 |             price_list.sort(reverse=False)
246 |             eta_list.sort(reverse=False)
247 |             distance_list.sort(reverse=False)
248 | 
249 |             for plan in plan_list:
250 |                 if plan["price"] and int(plan["price"]) == price_list[0]:
251 |                     cur_map["mode_min_price"] = plan["transport_mode"]
252 |                 if plan["price"] and int(plan["price"]) == price_list[-1]:
253 |                     cur_map["mode_max_price"] = plan["transport_mode"]
254 |                 if int(plan["eta"]) == eta_list[0]:
255 |                     cur_map["mode_min_eta"] = plan["transport_mode"]
256 |                 if int(plan["eta"]) == eta_list[-1]:
257 |                     cur_map["mode_max_eta"] = plan["transport_mode"]
258 |                 if int(plan["distance"]) == distance_list[0]:
259 |                     cur_map["mode_min_distance"] = plan["transport_mode"]
260 |                 if int(plan["distance"]) == distance_list[-1]:
261 |                     cur_map["mode_max_distance"] = plan["transport_mode"]
262 |             if "mode_min_price" not in cur_map:
263 |                 cur_map["mode_min_price"] = -1
264 |             if "mode_max_price" not in cur_map:
265 |                 cur_map["mode_max_price"] = -1
266 | 
267 | 
268 |             for plan in plan_list:
269 |                 cur_price = int(plan["price"]) if plan["price"] else 0
270 |                 cur_eta = int(plan["eta"])
271 |                 cur_distance = int(plan["distance"])
272 |                 cur_map["price_rank"] = price_list.index(cur_price) + 1
273 |                 cur_map["eta_rank"] = eta_list.index(cur_eta) + 1
274 |                 cur_map["distance_rank"] = distance_list.index(cur_distance) + 1
275 | 
276 |                 if ("transport_mode" in plan) and (session_id in session_click_map) and (
277 |                         int(plan["transport_mode"]) == int(session_click_map[session_id])):
278 |                     cur_map["plan"] = plan
279 |                     cur_map["label"] = 1
280 |                     flag_click = True
281 |                     # print("label is 1")
282 |                 else:
283 |                     cur_map["plan"] = plan
284 |                     cur_map["label"] = 0
285 | 
286 |                 cur_map["plan_rank"] = rank
287 |                 rank += 1
288 |                 cur_json_instance = json.dumps(cur_map)
289 |                 f_train.write(cur_json_instance + '\n')
290 | 
291 |             cur_map["plan"]["distance"] = -1
292 |             cur_map["plan"]["price"] = -1
293 |             cur_map["plan"]["eta"] = -1
294 |             cur_map["plan"]["transport_mode"] = 0
295 |             cur_map["plan_rank"] = 0
296 |             cur_map["price_rank"] = 0
297 |             cur_map["eta_rank"] = 0
298 |             cur_map["plan_rank"] = 0
299 |             cur_map["label"] = 1
300 |             cur_json_instance = json.dumps(cur_map)
301 |             f_train.write(cur_json_instance + '\n')
302 | 
303 |     build_norm_feature()
304 | 
305 | 
306 | if __name__ == "__main__":
307 |     preprocess()


--------------------------------------------------------------------------------
/pre_test_dense.py:
--------------------------------------------------------------------------------
  1 | # Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved
  2 | #
  3 | # Licensed under the Apache License, Version 2.0 (the "License");
  4 | # you may not use this file except in compliance with the License.
  5 | # You may obtain a copy of the License at
  6 | #
  7 | #     http://www.apache.org/licenses/LICENSE-2.0
  8 | #
  9 | # Unless required by applicable law or agreed to in writing, software
 10 | # distributed under the License is distributed on an "AS IS" BASIS,
 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 12 | # See the License for the specific language governing permissions and
 13 | # limitations under the License.
 14 | 
 15 | 
 16 | import os, sys, time, random, csv, datetime, json
 17 | import pandas as pd
 18 | import numpy as np
 19 | import argparse
 20 | import logging
 21 | import time
 22 | 
 23 | logging.basicConfig(
 24 |     format='%(asctime)s - %(levelname)s - %(message)s')
 25 | logger = logging.getLogger("preprocess")
 26 | logger.setLevel(logging.INFO)
 27 | 
 28 | TRAIN_QUERIES_PATH = "./data_set_phase1/test_queries.csv"
 29 | TRAIN_PLANS_PATH = "./data_set_phase1/test_plans.csv"
 30 | TRAIN_CLICK_PATH = "./data_set_phase1/train_clicks.csv"
 31 | PROFILES_PATH = "./data_set_phase1/profiles.csv"
 32 | 
 33 | O1_MIN = 115.47
 34 | O1_MAX = 117.29
 35 | 
 36 | O2_MIN = 39.46
 37 | O2_MAX = 40.97
 38 | 
 39 | D1_MIN = 115.44
 40 | D1_MAX = 117.37
 41 | 
 42 | D2_MIN = 39.46
 43 | D2_MAX = 40.96
 44 | 
 45 | DISTANCE_MIN = 1.0
 46 | DISTANCE_MAX = 225864.0
 47 | THRESHOLD_DIS = 200000.0
 48 | 
 49 | PRICE_MIN = 200.0
 50 | PRICE_MAX = 92300.0
 51 | THRESHOLD_PRICE = 20000
 52 | 
 53 | ETA_MIN = 1.0
 54 | ETA_MAX = 72992.0
 55 | THRESHOLD_ETA = 10800.0
 56 | 
 57 | 
 58 | def build_norm_feature():
 59 |     with open("./out/normed_test_session.txt", 'w') as nf:
 60 |         with open("./out/test_session.txt", 'r') as f:
 61 |             for line in f:
 62 |                 cur_map = json.loads(line)
 63 | 
 64 |                 cur_map["plan"]["distance"] = (cur_map["plan"]["distance"] - DISTANCE_MIN) / (DISTANCE_MAX - DISTANCE_MIN)
 65 | 
 66 |                 if cur_map["plan"]["price"]:
 67 |                     cur_map["plan"]["price"] = (cur_map["plan"]["price"] - PRICE_MIN) / (PRICE_MAX - PRICE_MIN)
 68 |                 else:
 69 |                     cur_map["plan"]["price"] = 0.0
 70 | 
 71 |                 cur_map["plan"]["eta"] = (cur_map["plan"]["eta"] - ETA_MIN) / (ETA_MAX - ETA_MIN)
 72 | 
 73 |                 cur_json_instance = json.dumps(cur_map)
 74 |                 nf.write(cur_json_instance + '\n')
 75 | 
 76 | 
 77 | def preprocess():
 78 |     """
 79 |     Construct the train data indexed by session id and mode id jointly. Convert all the raw features (user profile,
 80 |     od pair, req time, click time, eta, price, distance, transport mode) to one-hot ids used for
 81 |     embedding. We split the one-hot features into two categories: user feature and context feature for
 82 |     better understanding of FFM algorithm.
 83 |     Note that the user profile is already provided by one-hot encoded form, we convert it back to the
 84 |     ids for unity with the context feature and easily using of PaddlePaddle embedding layer. Given the
 85 |     train clicks data, we label each train instance with 1 or 0 depend on if this instance is clicked or
 86 |     not.
 87 |     :return:
 88 |     """
 89 |     #args = parse_args()
 90 | 
 91 |     train_data_dict = {}
 92 |     with open("./weather.json", 'r') as f:
 93 |         weather_dict = json.load(f)
 94 | 
 95 |     with open(TRAIN_QUERIES_PATH, 'r') as f:
 96 |         csv_reader = csv.reader(f, delimiter=',')
 97 |         train_index_list = []
 98 |         for k, line in enumerate(csv_reader):
 99 |             if k == 0: continue
100 |             if line[0] == "": continue
101 |             if line[1] == "":
102 |                 train_index_list.append(line[0] + "_0")
103 |             else:
104 |                 train_index_list.append(line[0] + "_" + line[1])
105 | 
106 |             train_index = line[0]
107 |             train_data_dict[train_index] = {}
108 |             train_data_dict[train_index]["pid"] = line[1]
109 |             train_data_dict[train_index]["query"] = {}
110 | 
111 |             reqweekday = datetime.datetime.strptime(line[2], '%Y-%m-%d %H:%M:%S').strftime("%w")
112 |             reqhour = datetime.datetime.strptime(line[2], '%Y-%m-%d %H:%M:%S').strftime("%H")
113 | 
114 |             date_key = datetime.datetime.strptime(line[2], '%Y-%m-%d %H:%M:%S').strftime("%m-%d")
115 |             train_data_dict[train_index]["weather"] = {}
116 |             train_data_dict[train_index]["weather"].update({"max_temp": weather_dict[date_key]["max_temp"]})
117 |             train_data_dict[train_index]["weather"].update({"min_temp": weather_dict[date_key]["min_temp"]})
118 |             train_data_dict[train_index]["weather"].update({"wea": weather_dict[date_key]["weather"]})
119 |             train_data_dict[train_index]["weather"].update({"wind": weather_dict[date_key]["wind"]})
120 | 
121 |             train_data_dict[train_index]["query"].update({"weekday":reqweekday})
122 |             train_data_dict[train_index]["query"].update({"hour":reqhour})
123 | 
124 |             o = line[3].split(',')
125 |             o_first = o[0]
126 |             o_second = o[1]
127 |             train_data_dict[train_index]["query"].update({"o1":float(o_first)})
128 |             train_data_dict[train_index]["query"].update({"o2":float(o_second)})
129 | 
130 |             d = line[4].split(',')
131 |             d_first = d[0]
132 |             d_second = d[1]
133 |             train_data_dict[train_index]["query"].update({"d1":float(d_first)})
134 |             train_data_dict[train_index]["query"].update({"d2":float(d_second)})
135 | 
136 |     plan_map = {}
137 |     plan_data = pd.read_csv(TRAIN_PLANS_PATH)
138 |     for index, row in plan_data.iterrows():
139 |         plans_str = row['plans']
140 |         plans_list = json.loads(plans_str)
141 |         session_id = str(row['sid'])
142 |         # train_data_dict[session_id]["plans"] = []
143 |         plan_map[session_id] = plans_list
144 | 
145 |     profile_map = {}
146 |     with open(PROFILES_PATH, 'r') as f:
147 |         csv_reader = csv.reader(f, delimiter=',')
148 |         for k, line in enumerate(csv_reader):
149 |             if k == 0: continue
150 |             profile_map[line[0]] = [i for i in range(len(line)) if line[i] == "1.0"]
151 | 
152 |     session_click_map = {}
153 |     with open(TRAIN_CLICK_PATH, 'r') as f:
154 |         csv_reader = csv.reader(f, delimiter=',')
155 |         for k, line in enumerate(csv_reader):
156 |             if k == 0: continue
157 |             if line[0] == "" or line[1] == "" or line[2] == "":
158 |                 continue
159 |             session_click_map[line[0]] = line[2]
160 |     #return train_data_dict, profile_map, session_click_map, plan_map
161 |     generate_sparse_features(train_data_dict, profile_map, session_click_map, plan_map)
162 | 
163 | 
164 | def generate_sparse_features(train_data_dict, profile_map, session_click_map, plan_map):
165 |     if not os.path.isdir("./out/"):
166 |         os.mkdir("./out/")
167 |     with open(os.path.join("./out/", "test_session.txt"), 'w') as f_train:
168 |         for session_id, plan_list in plan_map.items():
169 |             if session_id not in train_data_dict:
170 |                 continue
171 |             cur_map = train_data_dict[session_id]
172 |             cur_map["session_id"] = session_id
173 |             if cur_map["pid"] != "":
174 |                 cur_map["profile"] = profile_map[cur_map["pid"]]
175 |             else:
176 |                 cur_map["profile"] = [0]
177 |             # del cur_map["pid"]
178 |             whole_rank = 0
179 |             for plan in plan_list:
180 |                 whole_rank += 1
181 |                 cur_map["mode_rank" + str(whole_rank)] = plan["transport_mode"]
182 | 
183 |             if whole_rank < 5:
184 |                 for r in range(whole_rank + 1, 6):
185 |                     cur_map["mode_rank" + str(r)] = -1
186 | 
187 |             cur_map["whole_rank"] = whole_rank
188 |             rank = 1
189 | 
190 |             price_list = []
191 |             eta_list = []
192 |             distance_list = []
193 |             for plan in plan_list:
194 |                 if not plan["price"]:
195 |                     price_list.append(0)
196 |                 else:
197 |                     price_list.append(int(plan["price"]))
198 |                 eta_list.append(int(plan["eta"]))
199 |                 distance_list.append(int(plan["distance"]))
200 |             price_list.sort(reverse=False)
201 |             eta_list.sort(reverse=False)
202 |             distance_list.sort(reverse=False)
203 | 
204 |             for plan in plan_list:
205 |                 if plan["price"] and int(plan["price"]) == price_list[0]:
206 |                     cur_map["mode_min_price"] = plan["transport_mode"]
207 |                 if plan["price"] and int(plan["price"]) == price_list[-1]:
208 |                     cur_map["mode_max_price"] = plan["transport_mode"]
209 |                 if int(plan["eta"]) == eta_list[0]:
210 |                     cur_map["mode_min_eta"] = plan["transport_mode"]
211 |                 if int(plan["eta"]) == eta_list[-1]:
212 |                     cur_map["mode_max_eta"] = plan["transport_mode"]
213 |                 if int(plan["distance"]) == distance_list[0]:
214 |                     cur_map["mode_min_distance"] = plan["transport_mode"]
215 |                 if int(plan["distance"]) == distance_list[-1]:
216 |                     cur_map["mode_max_distance"] = plan["transport_mode"]
217 |             if "mode_min_price" not in cur_map:
218 |                 cur_map["mode_min_price"] = -1
219 |             if "mode_max_price" not in cur_map:
220 |                 cur_map["mode_max_price"] = -1
221 | 
222 |             for plan in plan_list:
223 |                 cur_price = int(plan["price"]) if plan["price"] else 0
224 |                 cur_eta = int(plan["eta"])
225 |                 cur_distance = int(plan["distance"])
226 |                 cur_map["price_rank"] = price_list.index(cur_price) + 1
227 |                 cur_map["eta_rank"] = eta_list.index(cur_eta) + 1
228 |                 cur_map["distance_rank"] = distance_list.index(cur_distance) + 1
229 | 
230 |                 if ("transport_mode" in plan) and (session_id in session_click_map) and (
231 |                         int(plan["transport_mode"]) == int(session_click_map[session_id])):
232 |                     cur_map["plan"] = plan
233 |                     cur_map["label"] = 1
234 |                 else:
235 |                     cur_map["plan"] = plan
236 |                     cur_map["label"] = 0
237 | 
238 |                 cur_map["plan_rank"] = rank
239 |                 rank += 1
240 |                 cur_json_instance = json.dumps(cur_map)
241 |                 f_train.write(cur_json_instance + '\n')
242 | 
243 |             cur_map["plan"]["distance"] = -1
244 |             cur_map["plan"]["price"] = -1
245 |             cur_map["plan"]["eta"] = -1
246 |             cur_map["plan"]["transport_mode"] = 0
247 |             cur_map["plan_rank"] = 0
248 |             cur_map["price_rank"] = 0
249 |             cur_map["eta_rank"] = 0
250 |             cur_map["plan_rank"] = 0
251 |             cur_map["label"] = 1
252 |             cur_json_instance = json.dumps(cur_map)
253 |             f_train.write(cur_json_instance + '\n')
254 | 
255 | 
256 |     build_norm_feature()
257 | 
258 | 
259 | if __name__ == "__main__":
260 |     preprocess()


--------------------------------------------------------------------------------
/preprocess.py:
--------------------------------------------------------------------------------
  1 | # Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved
  2 | #
  3 | # Licensed under the Apache License, Version 2.0 (the "License");
  4 | # you may not use this file except in compliance with the License.
  5 | # You may obtain a copy of the License at
  6 | #
  7 | #     http://www.apache.org/licenses/LICENSE-2.0
  8 | #
  9 | # Unless required by applicable law or agreed to in writing, software
 10 | # distributed under the License is distributed on an "AS IS" BASIS,
 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 12 | # See the License for the specific language governing permissions and
 13 | # limitations under the License.
 14 | 
 15 | import os, sys, time, random, csv, datetime, json
 16 | import pandas as pd
 17 | import numpy as np
 18 | import argparse
 19 | import logging
 20 | import time
 21 | 
 22 | logging.basicConfig(
 23 |     format='%(asctime)s - %(levelname)s - %(message)s')
 24 | logger = logging.getLogger("preprocess")
 25 | logger.setLevel(logging.INFO)
 26 | 
 27 | TRAIN_QUERIES_PATH = "./data_set_phase1/train_queries.csv"
 28 | TRAIN_PLANS_PATH = "./data_set_phase1/train_plans.csv"
 29 | TRAIN_CLICK_PATH = "./data_set_phase1/train_clicks.csv"
 30 | PROFILES_PATH = "./data_set_phase1/profiles.csv"
 31 | OUT_NORM_TRAIN_PATH = "./out/normed_train.txt"
 32 | OUT_RAW_TRAIN_PATH = "./out/train.txt"
 33 | 
 34 | OUT_DIR = "./out"
 35 | 
 36 | 
 37 | O1_MIN = 115.47
 38 | O1_MAX = 117.29
 39 | 
 40 | O2_MIN = 39.46
 41 | O2_MAX = 40.97
 42 | 
 43 | D1_MIN = 115.44
 44 | D1_MAX = 117.37
 45 | 
 46 | D2_MIN = 39.46
 47 | D2_MAX = 40.96
 48 | SCALE_OD = 0.02
 49 | 
 50 | DISTANCE_MIN = 1.0
 51 | DISTANCE_MAX = 225864.0
 52 | THRESHOLD_DIS = 40000.0
 53 | SCALE_DIS = 500
 54 | 
 55 | PRICE_MIN = 200.0
 56 | PRICE_MAX = 92300.0
 57 | THRESHOLD_PRICE = 20000
 58 | SCALE_PRICE = 100
 59 | 
 60 | ETA_MIN = 1.0
 61 | ETA_MAX = 72992.0
 62 | THRESHOLD_ETA = 10800.0
 63 | SCALE_ETA = 120
 64 | 
 65 | 
 66 | def build_norm_feature():
 67 |     with open(OUT_NORM_TRAIN_PATH, 'w') as nf:
 68 |         with open(OUT_RAW_TRAIN_PATH, 'r') as f:
 69 |             for line in f:
 70 |                 cur_map = json.loads(line)
 71 | 
 72 |                 if cur_map["plan"]["distance"] > THRESHOLD_DIS:
 73 |                     cur_map["plan"]["distance"] = int(THRESHOLD_DIS)
 74 |                 elif cur_map["plan"]["distance"] > 0:
 75 |                     cur_map["plan"]["distance"] = int(cur_map["plan"]["distance"] / SCALE_DIS)
 76 | 
 77 |                 if cur_map["plan"]["price"] and cur_map["plan"]["price"] > THRESHOLD_PRICE:
 78 |                     cur_map["plan"]["price"] = int(THRESHOLD_PRICE)
 79 |                 elif not cur_map["plan"]["price"] or cur_map["plan"]["price"] < 0:
 80 |                     cur_map["plan"]["price"] = 0
 81 |                 else:
 82 |                     cur_map["plan"]["price"] = int(cur_map["plan"]["price"] / SCALE_PRICE)
 83 | 
 84 |                 if cur_map["plan"]["eta"] > THRESHOLD_ETA:
 85 |                     cur_map["plan"]["eta"] = int(THRESHOLD_ETA)
 86 |                 elif cur_map["plan"]["eta"] > 0:
 87 |                     cur_map["plan"]["eta"] = int(cur_map["plan"]["eta"] / SCALE_ETA)
 88 | 
 89 |                 # o1
 90 |                 if cur_map["query"]["o1"] > O1_MAX:
 91 |                     cur_map["query"]["o1"] = int((O1_MAX - O1_MIN) / SCALE_OD + 1)
 92 |                 elif cur_map["query"]["o1"] < O1_MIN:
 93 |                     cur_map["query"]["o1"] = 0
 94 |                 else:
 95 |                     cur_map["query"]["o1"] = int((cur_map["query"]["o1"] - O1_MIN) / 0.02)
 96 | 
 97 |                 # o2
 98 |                 if cur_map["query"]["o2"] > O2_MAX:
 99 |                     cur_map["query"]["o2"] = int((O2_MAX - O2_MIN) / SCALE_OD + 1)
100 |                 elif cur_map["query"]["o2"] < O2_MIN:
101 |                     cur_map["query"]["o2"] = 0
102 |                 else:
103 |                     cur_map["query"]["o2"] = int((cur_map["query"]["o2"] - O2_MIN) / 0.02)
104 | 
105 |                 # d1
106 |                 if cur_map["query"]["d1"] > D1_MAX:
107 |                     cur_map["query"]["d1"] = int((D1_MAX - D1_MIN) / SCALE_OD + 1)
108 |                 elif cur_map["query"]["d1"] < D1_MIN:
109 |                     cur_map["query"]["d1"] = 0
110 |                 else:
111 |                     cur_map["query"]["d1"] = int((cur_map["query"]["d1"] - D1_MIN) / SCALE_OD)
112 | 
113 |                 # d2
114 |                 if cur_map["query"]["d2"] > D2_MAX:
115 |                     cur_map["query"]["d2"] = int((D2_MAX - D2_MIN) / SCALE_OD + 1)
116 |                 elif cur_map["query"]["d2"] < D2_MIN:
117 |                     cur_map["query"]["d2"] = 0
118 |                 else:
119 |                     cur_map["query"]["d2"] = int((cur_map["query"]["d2"] - D2_MIN) / SCALE_OD)
120 | 
121 |                 cur_json_instance = json.dumps(cur_map)
122 |                 nf.write(cur_json_instance + '\n')
123 | 
124 | 
125 | def preprocess():
126 |     """
127 |     Construct the train data indexed by session id and mode id jointly. Convert all the raw features (user profile,
128 |     od pair, req time, click time, eta, price, distance, transport mode) to one-hot ids used for
129 |     embedding. We split the one-hot features into two categories: user feature and context feature for
130 |     better understanding of FM algorithm.
131 |     Note that the user profile is already provided by one-hot encoded form, we treat it as embedded vector
132 |     for unity with the context feature and easily using of PaddlePaddle embedding layer. Given the
133 |     train clicks data, we label each train instance with 1 or 0 depend on if this instance is clicked or
134 |     not include non-click case.
135 |     :return:
136 |     """
137 | 
138 |     train_data_dict = {}
139 |     with open(TRAIN_QUERIES_PATH, 'r') as f:
140 |         csv_reader = csv.reader(f, delimiter=',')
141 |         train_index_list = []
142 |         for k, line in enumerate(csv_reader):
143 |             if k == 0: continue
144 |             if line[0] == "": continue
145 |             if line[1] == "":
146 |                 train_index_list.append(line[0] + "_0")
147 |             else:
148 |                 train_index_list.append(line[0] + "_" + line[1])
149 | 
150 |             train_index = line[0]
151 |             train_data_dict[train_index] = {}
152 |             train_data_dict[train_index]["pid"] = line[1]
153 |             train_data_dict[train_index]["query"] = {}
154 | 
155 |             reqweekday = datetime.datetime.strptime(line[2], '%Y-%m-%d %H:%M:%S').strftime("%w")
156 |             reqhour = datetime.datetime.strptime(line[2], '%Y-%m-%d %H:%M:%S').strftime("%H")
157 | 
158 |             train_data_dict[train_index]["query"].update({"weekday":reqweekday})
159 |             train_data_dict[train_index]["query"].update({"hour":reqhour})
160 | 
161 |             o = line[3].split(',')
162 |             o_first = o[0]
163 |             o_second = o[1]
164 |             train_data_dict[train_index]["query"].update({"o1":float(o_first)})
165 |             train_data_dict[train_index]["query"].update({"o2":float(o_second)})
166 | 
167 |             d = line[4].split(',')
168 |             d_first = d[0]
169 |             d_second = d[1]
170 |             train_data_dict[train_index]["query"].update({"d1":float(d_first)})
171 |             train_data_dict[train_index]["query"].update({"d2":float(d_second)})
172 | 
173 |     plan_map = {}
174 |     plan_data = pd.read_csv(TRAIN_PLANS_PATH)
175 |     for index, row in plan_data.iterrows():
176 |         plans_str = row['plans']
177 |         plans_list = json.loads(plans_str)
178 |         session_id = str(row['sid'])
179 |         # train_data_dict[session_id]["plans"] = []
180 |         plan_map[session_id] = plans_list
181 | 
182 |     profile_map = {}
183 |     with open(PROFILES_PATH, 'r') as f:
184 |         csv_reader = csv.reader(f, delimiter=',')
185 |         for k, line in enumerate(csv_reader):
186 |             if k == 0: continue
187 |             profile_map[line[0]] = [i for i in range(len(line)) if line[i] == "1.0"]
188 | 
189 |     session_click_map = {}
190 |     with open(TRAIN_CLICK_PATH, 'r') as f:
191 |         csv_reader = csv.reader(f, delimiter=',')
192 |         for k, line in enumerate(csv_reader):
193 |             if k == 0: continue
194 |             if line[0] == "" or line[1] == "" or line[2] == "":
195 |                 continue
196 |             session_click_map[line[0]] = line[2]
197 |     #return train_data_dict, profile_map, session_click_map, plan_map
198 |     generate_sparse_features(train_data_dict, profile_map, session_click_map, plan_map)
199 | 
200 | 
201 | def generate_sparse_features(train_data_dict, profile_map, session_click_map, plan_map):
202 |     if not os.path.isdir(OUT_DIR):
203 |         os.mkdir(OUT_DIR)
204 |     with open(os.path.join("./out/", "train.txt"), 'w') as f_train:
205 |         for session_id, plan_list in plan_map.items():
206 |             if session_id not in train_data_dict:
207 |                 continue
208 |             cur_map = train_data_dict[session_id]
209 |             if cur_map["pid"] != "":
210 |                 cur_map["profile"] = profile_map[cur_map["pid"]]
211 |             else:
212 |                 cur_map["profile"] = [0]
213 |             del cur_map["pid"]
214 |             whole_rank = 0
215 |             for plan in plan_list:
216 |                 whole_rank += 1
217 |             cur_map["whole_rank"] = whole_rank
218 |             flag_click = False
219 |             rank = 1
220 | 
221 | 
222 |             for plan in plan_list:
223 | 
224 |                 if ("transport_mode" in plan) and (session_id in session_click_map) and (
225 |                         int(plan["transport_mode"]) == int(session_click_map[session_id])):
226 |                     cur_map["plan"] = plan
227 |                     cur_map["label"] = 1
228 |                     flag_click = True
229 |                     # print("label is 1")
230 |                 else:
231 |                     cur_map["plan"] = plan
232 |                     cur_map["label"] = 0
233 | 
234 |                 cur_map["rank"] = rank
235 |                 rank += 1
236 |                 cur_json_instance = json.dumps(cur_map)
237 |                 f_train.write(cur_json_instance + '\n')
238 |             if not flag_click:
239 |                 cur_map["plan"]["distance"] = -1
240 |                 cur_map["plan"]["price"] = -1
241 |                 cur_map["plan"]["eta"] = -1
242 |                 cur_map["plan"]["transport_mode"] = 0
243 |                 cur_map["rank"] = 0
244 |                 cur_map["label"] = 1
245 |                 cur_json_instance = json.dumps(cur_map)
246 |                 f_train.write(cur_json_instance + '\n')
247 |             else:
248 |                 cur_map["plan"]["distance"] = -1
249 |                 cur_map["plan"]["price"] = -1
250 |                 cur_map["plan"]["eta"] = -1
251 |                 cur_map["plan"]["transport_mode"] = 0
252 |                 cur_map["rank"] = 0
253 |                 cur_map["label"] = 0
254 |                 cur_json_instance = json.dumps(cur_map)
255 |                 f_train.write(cur_json_instance + '\n')
256 | 
257 | 
258 |     build_norm_feature()
259 | 
260 | 
261 | if __name__ == "__main__":
262 |     preprocess()
263 | 


--------------------------------------------------------------------------------
/preprocess_dense.py:
--------------------------------------------------------------------------------
  1 | # Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved
  2 | #
  3 | # Licensed under the Apache License, Version 2.0 (the "License");
  4 | # you may not use this file except in compliance with the License.
  5 | # You may obtain a copy of the License at
  6 | #
  7 | #     http://www.apache.org/licenses/LICENSE-2.0
  8 | #
  9 | # Unless required by applicable law or agreed to in writing, software
 10 | # distributed under the License is distributed on an "AS IS" BASIS,
 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 12 | # See the License for the specific language governing permissions and
 13 | # limitations under the License.
 14 | 
 15 | import os, random, csv, datetime, json
 16 | import pandas as pd
 17 | import numpy as np
 18 | import argparse
 19 | import logging
 20 | import time
 21 | 
 22 | logging.basicConfig(
 23 |     format='%(asctime)s - %(levelname)s - %(message)s')
 24 | logger = logging.getLogger("preprocess")
 25 | logger.setLevel(logging.INFO)
 26 | 
 27 | TRAIN_QUERIES_PATH = "./data_set_phase1/train_queries.csv"
 28 | TRAIN_PLANS_PATH = "./data_set_phase1/train_plans.csv"
 29 | TRAIN_CLICK_PATH = "./data_set_phase1/train_clicks.csv"
 30 | PROFILES_PATH = "./data_set_phase1/profiles.csv"
 31 | 
 32 | OUT_DIR = "./out"
 33 | ORI_TRAIN_PATH = "train.txt"
 34 | NORM_TRAIN_PATH = "normed_train.txt"
 35 | #variable to control the ratio of positive and negative instance of transmode 0 which is original label of no click
 36 | THRESHOLD_LABEL = 0.5
 37 | 
 38 | 
 39 | 
 40 | O1_MIN = 115.47
 41 | O1_MAX = 117.29
 42 | 
 43 | O2_MIN = 39.46
 44 | O2_MAX = 40.97
 45 | 
 46 | D1_MIN = 115.44
 47 | D1_MAX = 117.37
 48 | 
 49 | D2_MIN = 39.46
 50 | D2_MAX = 40.96
 51 | 
 52 | DISTANCE_MIN = 1.0
 53 | DISTANCE_MAX = 225864.0
 54 | THRESHOLD_DIS = 200000.0
 55 | 
 56 | PRICE_MIN = 200.0
 57 | PRICE_MAX = 92300.0
 58 | THRESHOLD_PRICE = 20000
 59 | 
 60 | ETA_MIN = 1.0
 61 | ETA_MAX = 72992.0
 62 | THRESHOLD_ETA = 10800.0
 63 | 
 64 | 
 65 | def build_norm_feature():
 66 |     with open(os.path.join(OUT_DIR, NORM_TRAIN_PATH), 'w') as nf:
 67 |         with open(os.path.join(OUT_DIR, ORI_TRAIN_PATH), 'r') as f:
 68 |             for line in f:
 69 |                 cur_map = json.loads(line)
 70 | 
 71 |                 cur_map["plan"]["distance"] = (cur_map["plan"]["distance"] - DISTANCE_MIN) / (DISTANCE_MAX - DISTANCE_MIN)
 72 | 
 73 |                 if cur_map["plan"]["price"]:
 74 |                     cur_map["plan"]["price"] = (cur_map["plan"]["price"] - PRICE_MIN) / (PRICE_MAX - PRICE_MIN)
 75 |                 else:
 76 |                     cur_map["plan"]["price"] = 0.0
 77 | 
 78 |                 cur_map["plan"]["eta"] = (cur_map["plan"]["eta"] - ETA_MIN) / (ETA_MAX - ETA_MIN)
 79 | 
 80 |                 cur_json_instance = json.dumps(cur_map)
 81 |                 nf.write(cur_json_instance + '\n')
 82 | 
 83 | 
 84 | def preprocess():
 85 |     """
 86 |     Construct the train data indexed by session id and mode id jointly. Convert all the raw features (user profile,
 87 |     od pair, req time, click time, eta, price, distance, transport mode) to one-hot ids used for
 88 |     embedding. We split the one-hot features into two categories: user feature and context feature for
 89 |     better understanding of FM algorithm.
 90 |     Note that the user profile is already provided by one-hot encoded form, we treat it as embedded vector
 91 |     for unity with the context feature and easily using of PaddlePaddle embedding layer. Given the
 92 |     train clicks data, we label each train instance with 1 or 0 depend on if this instance is clicked or
 93 |     not include non-click case. To Be Changed
 94 |     :return:
 95 |     """
 96 | 
 97 |     train_data_dict = {}
 98 | 
 99 |     with open("./weather.json", 'r') as f:
100 |         weather_dict = json.load(f)
101 | 
102 |     with open(TRAIN_QUERIES_PATH, 'r') as f:
103 |         csv_reader = csv.reader(f, delimiter=',')
104 |         train_index_list = []
105 |         for k, line in enumerate(csv_reader):
106 |             if k == 0: continue
107 |             if line[0] == "": continue
108 |             if line[1] == "":
109 |                 train_index_list.append(line[0] + "_0")
110 |             else:
111 |                 train_index_list.append(line[0] + "_" + line[1])
112 | 
113 |             train_index = line[0]
114 |             train_data_dict[train_index] = {}
115 |             train_data_dict[train_index]["pid"] = line[1]
116 |             train_data_dict[train_index]["query"] = {}
117 |             train_data_dict[train_index]["weather"] = {}
118 | 
119 |             reqweekday = datetime.datetime.strptime(line[2], '%Y-%m-%d %H:%M:%S').strftime("%w")
120 |             reqhour = datetime.datetime.strptime(line[2], '%Y-%m-%d %H:%M:%S').strftime("%H")
121 | 
122 |             # weather related features, no big use, maybe more detailed weather information is better
123 |             date_key = datetime.datetime.strptime(line[2], '%Y-%m-%d %H:%M:%S').strftime("%m-%d")
124 |             train_data_dict[train_index]["weather"] = {}
125 |             train_data_dict[train_index]["weather"].update({"max_temp": weather_dict[date_key]["max_temp"]})
126 |             train_data_dict[train_index]["weather"].update({"min_temp": weather_dict[date_key]["min_temp"]})
127 |             train_data_dict[train_index]["weather"].update({"wea": weather_dict[date_key]["weather"]})
128 |             train_data_dict[train_index]["weather"].update({"wind": weather_dict[date_key]["wind"]})
129 | 
130 |             train_data_dict[train_index]["query"].update({"weekday":reqweekday})
131 |             train_data_dict[train_index]["query"].update({"hour":reqhour})
132 | 
133 |             o = line[3].split(',')
134 |             o_first = o[0]
135 |             o_second = o[1]
136 |             train_data_dict[train_index]["query"].update({"o1":float(o_first)})
137 |             train_data_dict[train_index]["query"].update({"o2":float(o_second)})
138 | 
139 |             d = line[4].split(',')
140 |             d_first = d[0]
141 |             d_second = d[1]
142 |             train_data_dict[train_index]["query"].update({"d1":float(d_first)})
143 |             train_data_dict[train_index]["query"].update({"d2":float(d_second)})
144 | 
145 |     plan_map = {}
146 |     plan_data = pd.read_csv(TRAIN_PLANS_PATH)
147 |     for index, row in plan_data.iterrows():
148 |         plans_str = row['plans']
149 |         plans_list = json.loads(plans_str)
150 |         session_id = str(row['sid'])
151 |         # train_data_dict[session_id]["plans"] = []
152 |         plan_map[session_id] = plans_list
153 | 
154 |     profile_map = {}
155 |     with open(PROFILES_PATH, 'r') as f:
156 |         csv_reader = csv.reader(f, delimiter=',')
157 |         for k, line in enumerate(csv_reader):
158 |             if k == 0: continue
159 |             profile_map[line[0]] = [i for i in range(len(line)) if line[i] == "1.0"]
160 | 
161 |     session_click_map = {}
162 |     with open(TRAIN_CLICK_PATH, 'r') as f:
163 |         csv_reader = csv.reader(f, delimiter=',')
164 |         for k, line in enumerate(csv_reader):
165 |             if k == 0: continue
166 |             if line[0] == "" or line[1] == "" or line[2] == "":
167 |                 continue
168 |             session_click_map[line[0]] = line[2]
169 |     #return train_data_dict, profile_map, session_click_map, plan_map
170 |     generate_sparse_features(train_data_dict, profile_map, session_click_map, plan_map)
171 | 
172 | 
173 | def generate_sparse_features(train_data_dict, profile_map, session_click_map, plan_map):
174 |     if not os.path.isdir(OUT_DIR):
175 |         os.mkdir(OUT_DIR)
176 |     with open(os.path.join(OUT_DIR, ORI_TRAIN_PATH), 'w') as f_train:
177 |         for session_id, plan_list in plan_map.items():
178 |             if session_id not in train_data_dict:
179 |                 continue
180 |             cur_map = train_data_dict[session_id]
181 |             if cur_map["pid"] != "":
182 |                 cur_map["profile"] = profile_map[cur_map["pid"]]
183 |             else:
184 |                 cur_map["profile"] = [0]
185 |             
186 |             #rank information related feature 
187 |             whole_rank = 0
188 |             for plan in plan_list:
189 |                 whole_rank += 1
190 |                 cur_map["mode_rank" + str(whole_rank)] = plan["transport_mode"]
191 | 
192 |             if whole_rank < 5:
193 |                 for r in range(whole_rank + 1, 6):
194 |                     cur_map["mode_rank" + str(r)] = -1
195 | 
196 |             cur_map["whole_rank"] = whole_rank
197 |             flag_click = False
198 |             rank = 1
199 | 
200 |             price_list = []
201 |             eta_list = []
202 |             distance_list = []
203 |             for plan in plan_list:
204 |                 if not plan["price"]:
205 |                     price_list.append(0)
206 |                 else:
207 |                     price_list.append(int(plan["price"]))
208 |                 eta_list.append(int(plan["eta"]))
209 |                 distance_list.append(int(plan["distance"]))
210 |             price_list.sort(reverse=False)
211 |             eta_list.sort(reverse=False)
212 |             distance_list.sort(reverse=False)
213 | 
214 |             for plan in plan_list:
215 |                 if plan["price"] and int(plan["price"]) == price_list[0]:
216 |                     cur_map["mode_min_price"] = plan["transport_mode"]
217 |                 if plan["price"] and int(plan["price"]) == price_list[-1]:
218 |                     cur_map["mode_max_price"] = plan["transport_mode"]
219 |                 if int(plan["eta"]) == eta_list[0]:
220 |                     cur_map["mode_min_eta"] = plan["transport_mode"]
221 |                 if int(plan["eta"]) == eta_list[-1]:
222 |                     cur_map["mode_max_eta"] = plan["transport_mode"]
223 |                 if int(plan["distance"]) == distance_list[0]:
224 |                     cur_map["mode_min_distance"] = plan["transport_mode"]
225 |                 if int(plan["distance"]) == distance_list[-1]:
226 |                     cur_map["mode_max_distance"] = plan["transport_mode"]
227 |             if "mode_min_price" not in cur_map:
228 |                 cur_map["mode_min_price"] = -1
229 |             if "mode_max_price" not in cur_map:
230 |                 cur_map["mode_max_price"] = -1
231 | 
232 |             for plan in plan_list:
233 |                 if ("transport_mode" in plan) and (session_id in session_click_map) and (
234 |                         int(plan["transport_mode"]) == int(session_click_map[session_id])):
235 |                     flag_click = True
236 |             if flag_click:
237 | 
238 |                 for plan in plan_list:
239 |                     cur_price = int(plan["price"]) if plan["price"] else 0
240 |                     cur_eta = int(plan["eta"])
241 |                     cur_distance = int(plan["distance"])
242 |                     cur_map["price_rank"] = price_list.index(cur_price) + 1
243 |                     cur_map["eta_rank"] = eta_list.index(cur_eta) + 1
244 |                     cur_map["distance_rank"] = distance_list.index(cur_distance) + 1
245 | 
246 |                     if ("transport_mode" in plan) and (session_id in session_click_map) and (
247 |                             int(plan["transport_mode"]) == int(session_click_map[session_id])):
248 |                         cur_map["plan"] = plan
249 |                         cur_map["label"] = 1
250 |                     else:
251 |                         cur_map["plan"] = plan
252 |                         cur_map["label"] = 0
253 | 
254 |                     cur_map["plan_rank"] = rank
255 |                     rank += 1
256 |                     cur_json_instance = json.dumps(cur_map)
257 |                     f_train.write(cur_json_instance + '\n')
258 |             
259 |             cur_map["plan"] = {}
260 |             #since we define a new ctr task from original task, we use a basic way to generate instances of transport mode 0.
261 |             #There should be a optimal strategy to generate instances of transport mode 0
262 |             if not flag_click:
263 |                 cur_map["plan"]["distance"] = -1
264 |                 cur_map["plan"]["price"] = -1
265 |                 cur_map["plan"]["eta"] = -1
266 |                 cur_map["plan"]["transport_mode"] = 0
267 |                 cur_map["plan_rank"] = 0
268 |                 cur_map["price_rank"] = 0
269 |                 cur_map["eta_rank"] = 0
270 |                 cur_map["distance_rank"] = 0
271 |                 cur_map["label"] = 1
272 |                 cur_json_instance = json.dumps(cur_map)
273 |                 f_train.write(cur_json_instance + '\n')
274 |             else:
275 |                 if random.random() < THRESHOLD_LABEL:
276 |                     cur_map["plan"]["distance"] = -1
277 |                     cur_map["plan"]["price"] = -1
278 |                     cur_map["plan"]["eta"] = -1
279 |                     cur_map["plan"]["transport_mode"] = 0
280 |                     cur_map["plan_rank"] = 0
281 |                     cur_map["price_rank"] = 0
282 |                     cur_map["eta_rank"] = 0
283 |                     cur_map["distance_rank"] = 0
284 |                     cur_map["label"] = 0
285 |                     cur_json_instance = json.dumps(cur_map)
286 |                     f_train.write(cur_json_instance + '\n')
287 | 
288 | 
289 | 
290 |     build_norm_feature()
291 | 
292 | 
293 | if __name__ == "__main__":
294 |     preprocess()
295 | 


--------------------------------------------------------------------------------
/submit/readme:
--------------------------------------------------------------------------------
1 | this is the folder for submit file
2 | 


--------------------------------------------------------------------------------
/testres/readme:
--------------------------------------------------------------------------------
1 | This folder for infered res
2 | 


--------------------------------------------------------------------------------
/weather.json:
--------------------------------------------------------------------------------
1 | {"10-01": {"max_temp": "24", "min_temp": "12", "weather": "q", "wind": "45"}, "10-02": {"max_temp": "24", "min_temp": "11", "weather": "q", "wind": "12"}, "10-03": {"max_temp": "25", "min_temp": "10", "weather": "q", "wind": "12"}, "10-04": {"max_temp": "25", "min_temp": "12", "weather": "q", "wind": "12"}, "10-05": {"max_temp": "24", "min_temp": "14", "weather": "dy", "wind": "12"}, "10-06": {"max_temp": "20", "min_temp": "8", "weather": "q", "wind": "45"}, "10-07": {"max_temp": "21", "min_temp": "7", "weather": "q", "wind": "12"}, "10-08": {"max_temp": "21", "min_temp": "8", "weather": "dy", "wind": "12"}, "10-09": {"max_temp": "15", "min_temp": "4", "weather": "dyq", "wind": "45"}, "10-10": {"max_temp": "17", "min_temp": "4", "weather": "dyq", "wind": "12"}, "10-11": {"max_temp": "18", "min_temp": "5", "weather": "qdy", "wind": "12"}, "10-12": {"max_temp": "20", "min_temp": "5", "weather": "dyq", "wind": "12"}, "10-13": {"max_temp": "20", "min_temp": "8", "weather": "dy", "wind": "12"}, "10-14": {"max_temp": "21", "min_temp": "10", "weather": "dy", "wind": "12"}, "10-15": {"max_temp": "17", "min_temp": "11", "weather": "xq", "wind": "12"}, "10-16": {"max_temp": "17", "min_temp": "7", "weather": "dyq", "wind": "12"}, "10-17": {"max_temp": "17", "min_temp": "5", "weather": "q", "wind": "12"}, "10-18": {"max_temp": "18", "min_temp": "5", "weather": "q", "wind": "12"}, "10-19": {"max_temp": "19", "min_temp": "7", "weather": "dy", "wind": "12"}, "10-20": {"max_temp": "18", "min_temp": "7", "weather": "dy", "wind": "12"}, "10-21": {"max_temp": "18", "min_temp": "7", "weather": "dy", "wind": "12"}, "10-22": {"max_temp": "19", "min_temp": "5", "weather": "dyq", "wind": "12"}, "10-23": {"max_temp": "19", "min_temp": "4", "weather": "q", "wind": "34"}, "10-24": {"max_temp": "20", "min_temp": "6", "weather": "qdy", "wind": "12"}, "10-25": {"max_temp": "15", "min_temp": "8", "weather": "dy", "wind": "12"}, "10-26": {"max_temp": "14", "min_temp": "3", "weather": "q", "wind": "45"}, "10-27": {"max_temp": "17", "min_temp": "5", "weather": "dy", "wind": "12"}, "10-28": {"max_temp": "17", "min_temp": "4", "weather": "dyq", "wind": "45"}, "10-29": {"max_temp": "15", "min_temp": "3", "weather": "q", "wind": "34"}, "10-30": {"max_temp": "16", "min_temp": "1", "weather": "q", "wind": "12"}, "10-31": {"max_temp": "17", "min_temp": "3", "weather": "q", "wind": "12"}, "11-01": {"max_temp": "17", "min_temp": "3", "weather": "q", "wind": "12"}, "11-02": {"max_temp": "18", "min_temp": "4", "weather": "q", "wind": "12"}, "11-03": {"max_temp": "16", "min_temp": "6", "weather": "dy", "wind": "12"}, "11-04": {"max_temp": "10", "min_temp": "2", "weather": "xydy", "wind": "34"}, "11-05": {"max_temp": "10", "min_temp": "2", "weather": "dy", "wind": "12"}, "11-06": {"max_temp": "12", "min_temp": "0", "weather": "dy", "wind": "12"}, "11-07": {"max_temp": "13", "min_temp": "3", "weather": "dy", "wind": "12"}, "11-08": {"max_temp": "14", "min_temp": "2", "weather": "dy", "wind": "12"}, "11-09": {"max_temp": "15", "min_temp": "1", "weather": "qdy", "wind": "34"}, "11-10": {"max_temp": "11", "min_temp": "0", "weather": "dy", "wind": "12"}, "11-11": {"max_temp": "13", "min_temp": "1", "weather": "dyq", "wind": "12"}, "11-12": {"max_temp": "14", "min_temp": "2", "weather": "q", "wind": "12"}, "11-13": {"max_temp": "13", "min_temp": "5", "weather": "dy", "wind": "12"}, "11-14": {"max_temp": "13", "min_temp": "5", "weather": "dy", "wind": "12"}, "11-15": {"max_temp": "8", "min_temp": "1", "weather": "xydy", "wind": "34"}, "11-16": {"max_temp": "8", "min_temp": "-1", "weather": "q", "wind": "12"}, "11-17": {"max_temp": "9", "min_temp": "-2", "weather": "dyq", "wind": "12"}, "11-18": {"max_temp": "11", "min_temp": "-3", "weather": "q", "wind": "34"}, "11-19": {"max_temp": "10", "min_temp": "-2", "weather": "qdy", "wind": "12"}, "11-20": {"max_temp": "9", "min_temp": "-1", "weather": "dy", "wind": "12"}, "11-21": {"max_temp": "9", "min_temp": "-3", "weather": "q", "wind": "2"}, "11-22": {"max_temp": "8", "min_temp": "-3", "weather": "qdy", "wind": "1"}, "11-23": {"max_temp": "7", "min_temp": "0", "weather": "dy", "wind": "2"}, "11-24": {"max_temp": "9", "min_temp": "-3", "weather": "qdy", "wind": "2"}, "11-25": {"max_temp": "10", "min_temp": "-3", "weather": "q", "wind": "1"}, "11-26": {"max_temp": "10", "min_temp": "0", "weather": "dy", "wind": "1"}, "11-27": {"max_temp": "9", "min_temp": "-3", "weather": "qdy", "wind": "2"}, "11-28": {"max_temp": "8", "min_temp": "-3", "weather": "q", "wind": "1"}, "11-29": {"max_temp": "7", "min_temp": "-4", "weather": "q", "wind": "1"}, "11-30": {"max_temp": "8", "min_temp": "-3", "weather": "q", "wind": "1"}, "12-01": {"max_temp": "7", "min_temp": "0", "weather": "dy", "wind": "1"}, "12-02": {"max_temp": "9", "min_temp": "2", "weather": "dy", "wind": "1"}, "12-03": {"max_temp": "8", "min_temp": "-3", "weather": "dyq", "wind": "3"}, "12-04": {"max_temp": "4", "min_temp": "-6", "weather": "qdy", "wind": "2"}, "12-05": {"max_temp": "1", "min_temp": "-4", "weather": "dy", "wind": "1"}, "12-06": {"max_temp": "-2", "min_temp": "-9", "weather": "q", "wind": "3"}, "12-07": {"max_temp": "-4", "min_temp": "-10", "weather": "q", "wind": "3"}, "12-08": {"max_temp": "-2", "min_temp": "-10", "weather": "qdy", "wind": "2"}, "12-09": {"max_temp": "-1", "min_temp": "-10", "weather": "dyq", "wind": "1"}}


--------------------------------------------------------------------------------