├── .gitignore ├── README.md ├── config ├── avazu │ ├── config_sparse.json │ └── config_template.json ├── criteo │ ├── config_dense.json │ └── config_template.json └── kdd12 │ ├── config_dense.json │ └── config_template.json ├── data └── criteo │ └── k_fold │ ├── part0 │ └── train.txt │ ├── part1 │ └── train.txt │ ├── part2 │ └── train.txt │ ├── part3 │ └── train.txt │ ├── part4 │ └── train.txt │ ├── part5 │ └── train.txt │ ├── part6 │ └── train.txt │ ├── part7 │ └── train.txt │ ├── part8 │ └── train.txt │ └── part9 │ └── train.txt ├── rec_alg ├── __init__.py ├── common │ ├── __init__.py │ ├── batch_generator.py │ ├── constants.py │ ├── data_loader.py │ ├── tf_utils.py │ └── utils.py ├── components │ ├── __init__.py │ ├── inputs.py │ ├── layers.py │ ├── multi_hash_codebook_kif_layer.py │ └── multi_hash_codebook_layer.py ├── model │ ├── __init__.py │ ├── base_model.py │ ├── fibinet │ │ ├── __init__.py │ │ ├── fibinet_model.py │ │ └── run_fibinet.py │ └── memonet │ │ ├── __init__.py │ │ ├── memonet_model.py │ │ └── run_memonet.py └── preprocessing │ ├── __init__.py │ ├── avazu │ ├── __init__.py │ ├── avazu_process.py │ └── raw_data_process.py │ ├── base_process.py │ ├── criteo │ ├── __init__.py │ └── criteo_process.py │ ├── dense_process.py │ ├── kdd12 │ ├── __init__.py │ ├── kdd12_process.py │ └── raw_data_process.py │ ├── kfold_process.py │ └── sparse_process.py ├── script ├── run_avazu_process.sh ├── run_criteo_process.sh ├── run_fibinet_model.sh ├── run_kdd12_process.sh └── run_memonet_model.sh └── setup.py /.gitignore: -------------------------------------------------------------------------------- 1 | *.txt 2 | *.xml 3 | *.iml 4 | *.pyc 5 | *.log 6 | *.swp 7 | logs/ 8 | tmp/ 9 | data/ 10 | .vscode/ 11 | .DS_Store 12 | *.egg-info/ 13 | nohup.out 14 | .coverage 15 | .idea/ 16 | build/ 17 | venv/ 18 | *.model 19 | *.json 20 | test/ 21 | 22 | config_concat.json 23 | config_dense.json 24 | config_dense_normal.json 25 | config_sparse.json 26 | config_varlen.json 27 | config_dense_discrete.json 28 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # End-to-end Recommendation Algorithm Package 2 | 3 | The currently supported algorithms: 4 | 5 | 1. MemoNet: [MemoNet: MemoNet: Memorizing All Cross Features' Representations Efficiently via Multi-Hash Codebook Network for CTR Prediction](https://arxiv.org/abs/2211.01334) 6 | 2. FiBiNet++: [FiBiNet++: Reducing Model Size by Low Rank Feature Interaction Layer for CTR Prediction](https://arxiv.org/abs/2209.05016) 7 | 3. FiBiNet: [FiBiNET: Combining Feature Importance and Bilinear feature Interaction for Click-Through Rate Prediction](https://arxiv.org/abs/1905.09433) . 8 | 9 | The following algorithms are planned to be supported: 10 | 1. GateNet: [GateNet: Gating-Enhanced Deep Network for Click-Through Rate Prediction](https://arxiv.org/abs/2007.03519) 11 | 2. ContextNet: [ContextNet: A Click-Through Rate Prediction Framework Using Contextual information to Refine Feature Embedding](https://arxiv.org/abs/2107.12025) 12 | 3. MaskNet: [MaskNet: Introducing Feature-Wise Multiplication to CTR Ranking Models by Instance-Guided Mask](https://arxiv.org/abs/2102.07619) 13 | 14 | ## Prerequisites 15 | 16 | - Python >= 3.6.8 17 | - TensorFlow-GPU == 1.14 18 | 19 | ## Getting Started 20 | 21 | ### Installation 22 | 23 | - Install TensorFlow-GPU 1.14 24 | 25 | - Clone this repo 26 | 27 | ### Dataset 28 | 29 | - Links of datasets are: 30 | 31 | - https://www.kaggle.com/c/criteo-display-ad-challenge 32 | - https://www.kaggle.com/c/avazu-ctr-prediction 33 | - https://www.kaggle.com/c/kddcup2012-track2 34 | 35 | - You can download the original datasets and preprocess them by yourself. Run `python -u -m rec_alg.preprocessing.{dataset_name}.{dataset_name}_process` to preprocess the datasets. `dataset_name` can be `criteo`, `avazu` or `kdd12`. 36 | 37 | - This repo also contains a demo dataset of criteo, which contains 100,000 samples and has been preprocessed. It is used to help demonstrate models here. 38 | 39 | ### Training 40 | 41 | #### MemoNet 42 | You can use `python -u -m rec_alg.model.memonet.run_memonet` to train a specific model on a dataset. Parameters could be found in the code. 43 | 44 | #### FiBiNet & FiBiNet ++ 45 | 46 | You can use `python -u -m rec_alg.model.fibinet.run_fibinet --version {version} --config {config_path}` to train a specific model on a dataset. 47 | 48 | Some important parameters are list below, and other hyper-parameters can be found in the code. 49 | 50 | - version: model version, supports `v1`, `++`, and `custom`, default to `++`. For `custom`, you can adjust all parameter values flexibly. 51 | - config_path: specifies the paths of the input/output files and the fields of the dataset. It is generated during dataset preprocessing. Support values: `./config/criteo/config_dense.json`, `./config/avazu/config_sparse.json`. 52 | - mode: running mode, supports `train`, `retrain`, `test`. 53 | 54 | ## Acknowledgement 55 | 56 | Part of the code comes from [DeepCTR](https://github.com/shenweichen/DeepCTR). 57 | 58 | -------------------------------------------------------------------------------- /config/avazu/config_sparse.json: -------------------------------------------------------------------------------- 1 | { 2 | "base_info": { 3 | "train_path": "./data/avazu/sparse/", 4 | "concat_path": "./data/avazu/concat/", 5 | "sparse_path": "./data/avazu/sparse/", 6 | "dense_path": "./data/avazu/dense/", 7 | "k_fold_path": "./data/avazu/k_fold/", 8 | "k_fold": 10, 9 | "random_seed": 2018 10 | }, 11 | "model": { 12 | "data_prefix": "./data/avazu/k_fold/", 13 | "results_prefix": "./data/model/", 14 | "train_results_file": "train_results.txt", 15 | "test_results_file": "test_results.txt" 16 | }, 17 | "features": [ 18 | { 19 | "name": "label", 20 | "type": "label", 21 | "dtype": "int32" 22 | }, 23 | { 24 | "name": "C1", 25 | "type": "sparse", 26 | "dimension": 7, 27 | "use_hash": false, 28 | "dtype": "int32", 29 | "embedding": true, 30 | "feature_info_gain": 0.001316, 31 | "feature_ig": 0.002621, 32 | "feature_attention": -0.2713, 33 | "feature_origin_num": 0.0714 34 | }, 35 | { 36 | "name": "banner_pos", 37 | "type": "sparse", 38 | "dimension": 7, 39 | "use_hash": false, 40 | "dtype": "int32", 41 | "embedding": true, 42 | "feature_info_gain": 0.000491, 43 | "feature_ig": 0.000562, 44 | "feature_attention": 1.5827, 45 | "feature_origin_num": 0.0769 46 | }, 47 | { 48 | "name": "site_id", 49 | "type": "sparse", 50 | "dimension": 3564, 51 | "use_hash": false, 52 | "dtype": "int32", 53 | "embedding": true, 54 | "feature_info_gain": 0.051313, 55 | "feature_ig": 0.010732, 56 | "feature_attention": -8.6402, 57 | "feature_origin_num": 0.0769 58 | }, 59 | { 60 | "name": "site_domain", 61 | "type": "sparse", 62 | "dimension": 4325, 63 | "use_hash": false, 64 | "dtype": "int32", 65 | "embedding": true, 66 | "feature_info_gain": 0.045302, 67 | "feature_ig": 0.010239, 68 | "feature_attention": -8.6832, 69 | "feature_origin_num": 0.0769 70 | }, 71 | { 72 | "name": "site_category", 73 | "type": "sparse", 74 | "dimension": 25, 75 | "use_hash": false, 76 | "dtype": "int32", 77 | "embedding": true, 78 | "feature_info_gain": 0.01131, 79 | "feature_ig": 0.005802, 80 | "feature_attention": 1.2455, 81 | "feature_origin_num": 0.087 82 | }, 83 | { 84 | "name": "app_id", 85 | "type": "sparse", 86 | "dimension": 5066, 87 | "use_hash": false, 88 | "dtype": "int32", 89 | "embedding": true, 90 | "feature_info_gain": 0.03229, 91 | "feature_ig": 0.009737, 92 | "feature_attention": 1.0956, 93 | "feature_origin_num": 0.1429 94 | }, 95 | { 96 | "name": "app_domain", 97 | "type": "sparse", 98 | "dimension": 307, 99 | "use_hash": false, 100 | "dtype": "int32", 101 | "embedding": true, 102 | "feature_info_gain": 0.016844, 103 | "feature_ig": 0.008697, 104 | "feature_attention": -2.1024, 105 | "feature_origin_num": 0.0833 106 | }, 107 | { 108 | "name": "app_category", 109 | "type": "sparse", 110 | "dimension": 31, 111 | "use_hash": false, 112 | "dtype": "int32", 113 | "embedding": true, 114 | "feature_info_gain": 0.011741, 115 | "feature_ig": 0.007834, 116 | "feature_attention": -1.6797, 117 | "feature_origin_num": 0.0741 118 | }, 119 | { 120 | "name": "device_id", 121 | "type": "sparse", 122 | "dimension": 278182, 123 | "use_hash": false, 124 | "dtype": "int32", 125 | "embedding": true, 126 | "feature_info_gain": 0.020215, 127 | "feature_ig": 0.008577, 128 | "feature_attention": 3.8125, 129 | "feature_origin_num": 0.2222 130 | }, 131 | { 132 | "name": "device_ip", 133 | "type": "sparse", 134 | "dimension": 1242892, 135 | "use_hash": false, 136 | "dtype": "int32", 137 | "embedding": true, 138 | "feature_info_gain": 0.084573, 139 | "feature_ig": 0.005821, 140 | "feature_attention": 13.54, 141 | "feature_origin_num": 0.5 142 | }, 143 | { 144 | "name": "device_model", 145 | "type": "sparse", 146 | "dimension": 6592, 147 | "use_hash": false, 148 | "dtype": "int32", 149 | "embedding": true, 150 | "feature_info_gain": 0.024494, 151 | "feature_ig": 0.002885, 152 | "feature_attention": 17.3369, 153 | "feature_origin_num": 0.5 154 | }, 155 | { 156 | "name": "device_type", 157 | "type": "sparse", 158 | "dimension": 5, 159 | "use_hash": false, 160 | "dtype": "int32", 161 | "embedding": true, 162 | "feature_info_gain": 0.00118, 163 | "feature_ig": 0.002495, 164 | "feature_attention": -4.9315, 165 | "feature_origin_num": 0.0541 166 | }, 167 | { 168 | "name": "device_conn_type", 169 | "type": "sparse", 170 | "dimension": 4, 171 | "use_hash": false, 172 | "dtype": "int32", 173 | "embedding": true, 174 | "feature_info_gain": 0.00711, 175 | "feature_ig": 0.009912, 176 | "feature_attention": -6.868, 177 | "feature_origin_num": 0.05 178 | }, 179 | { 180 | "name": "C14", 181 | "type": "sparse", 182 | "dimension": 2483, 183 | "use_hash": false, 184 | "dtype": "int32", 185 | "embedding": true, 186 | "feature_info_gain": 0.051731, 187 | "feature_ig": 0.006223, 188 | "feature_attention": 8.7072, 189 | "feature_origin_num": 0.1818 190 | }, 191 | { 192 | "name": "C15", 193 | "type": "sparse", 194 | "dimension": 8, 195 | "use_hash": false, 196 | "dtype": "int32", 197 | "embedding": true, 198 | "feature_info_gain": 0.009486, 199 | "feature_ig": 0.023491, 200 | "feature_attention": -0.4061, 201 | "feature_origin_num": 0.069 202 | }, 203 | { 204 | "name": "C16", 205 | "type": "sparse", 206 | "dimension": 9, 207 | "use_hash": false, 208 | "dtype": "int32", 209 | "embedding": true, 210 | "feature_info_gain": 0.012149, 211 | "feature_ig": 0.032473, 212 | "feature_attention": 4.8553, 213 | "feature_origin_num": 0.0952 214 | }, 215 | { 216 | "name": "C17", 217 | "type": "sparse", 218 | "dimension": 431, 219 | "use_hash": false, 220 | "dtype": "int32", 221 | "embedding": true, 222 | "feature_info_gain": 0.04872, 223 | "feature_ig": 0.007264, 224 | "feature_attention": 14.9213, 225 | "feature_origin_num": 0.2 226 | }, 227 | { 228 | "name": "C18", 229 | "type": "sparse", 230 | "dimension": 4, 231 | "use_hash": false, 232 | "dtype": "int32", 233 | "embedding": true, 234 | "feature_info_gain": 0.022289, 235 | "feature_ig": 0.01268, 236 | "feature_attention": -3.7916, 237 | "feature_origin_num": 0.0556 238 | }, 239 | { 240 | "name": "C19", 241 | "type": "sparse", 242 | "dimension": 68, 243 | "use_hash": false, 244 | "dtype": "int32", 245 | "embedding": true, 246 | "feature_info_gain": 0.024317, 247 | "feature_ig": 0.006413, 248 | "feature_attention": -0.3239, 249 | "feature_origin_num": 0.087 250 | }, 251 | { 252 | "name": "C20", 253 | "type": "sparse", 254 | "dimension": 169, 255 | "use_hash": false, 256 | "dtype": "int32", 257 | "embedding": true, 258 | "feature_info_gain": 0.015021, 259 | "feature_ig": 0.004064, 260 | "feature_attention": -3.8178, 261 | "feature_origin_num": 0.0741 262 | }, 263 | { 264 | "name": "C21", 265 | "type": "sparse", 266 | "dimension": 60, 267 | "use_hash": false, 268 | "dtype": "int32", 269 | "embedding": true, 270 | "feature_info_gain": 0.031306, 271 | "feature_ig": 0.007526, 272 | "feature_attention": 7.5865, 273 | "feature_origin_num": 0.1176 274 | } 275 | ] 276 | } -------------------------------------------------------------------------------- /config/avazu/config_template.json: -------------------------------------------------------------------------------- 1 | { 2 | "base_info": { 3 | "train_path": "./data/avazu/data/", 4 | "concat_path": "./data/avazu/concat/", 5 | "sparse_path": "./data/avazu/sparse/", 6 | "dense_path": "./data/avazu/dense/", 7 | "k_fold_path": "./data/avazu/k_fold/", 8 | "k_fold": 10, 9 | "random_seed": 2018 10 | }, 11 | "model": { 12 | "data_prefix": "./data/avazu/k_fold/", 13 | "results_prefix": "./data/model/", 14 | "train_results_file": "train_results.txt", 15 | "test_results_file": "test_results.txt" 16 | }, 17 | "features": [ 18 | { 19 | "name": "label", 20 | "type": "label", 21 | "dtype": "int32" 22 | }, 23 | { 24 | "name": "C1", 25 | "type": "sparse", 26 | "dimension": 0, 27 | "use_hash": false, 28 | "dtype": "int32", 29 | "embedding": true, 30 | "feature_info_gain": 0.001316, 31 | "feature_ig": 0.002621, 32 | "feature_attention":-0.2713, 33 | "feature_origin_num": 0.0714 34 | }, 35 | { 36 | "name": "banner_pos", 37 | "type": "sparse", 38 | "dimension": 0, 39 | "use_hash": false, 40 | "dtype": "int32", 41 | "embedding": true, 42 | "feature_info_gain": 0.000491, 43 | "feature_ig": 0.000562, 44 | "feature_attention":1.5827, 45 | "feature_origin_num": 0.0769 46 | }, 47 | { 48 | "name": "site_id", 49 | "type": "sparse", 50 | "dimension": 0, 51 | "use_hash": false, 52 | "dtype": "int32", 53 | "embedding": true, 54 | "feature_info_gain": 0.051313, 55 | "feature_ig": 0.010732, 56 | "feature_attention":-8.6402, 57 | "feature_origin_num": 0.0769 58 | }, 59 | { 60 | "name": "site_domain", 61 | "type": "sparse", 62 | "dimension": 0, 63 | "use_hash": false, 64 | "dtype": "int32", 65 | "embedding": true, 66 | "feature_info_gain": 0.045302, 67 | "feature_ig": 0.010239, 68 | "feature_attention":-8.6832, 69 | "feature_origin_num": 0.0769 70 | }, 71 | { 72 | "name": "site_category", 73 | "type": "sparse", 74 | "dimension": 0, 75 | "use_hash": false, 76 | "dtype": "int32", 77 | "embedding": true, 78 | "feature_info_gain": 0.011310, 79 | "feature_ig": 0.005802, 80 | "feature_attention":1.2455, 81 | "feature_origin_num": 0.0870 82 | }, 83 | { 84 | "name": "app_id", 85 | "type": "sparse", 86 | "dimension": 0, 87 | "use_hash": false, 88 | "dtype": "int32", 89 | "embedding": true, 90 | "feature_info_gain": 0.032290, 91 | "feature_ig": 0.009737, 92 | "feature_attention":1.0956, 93 | "feature_origin_num": 0.1429 94 | }, 95 | { 96 | "name": "app_domain", 97 | "type": "sparse", 98 | "dimension": 0, 99 | "use_hash": false, 100 | "dtype": "int32", 101 | "embedding": true, 102 | "feature_info_gain": 0.016844, 103 | "feature_ig": 0.008697, 104 | "feature_attention":-2.1024, 105 | "feature_origin_num": 0.0833 106 | }, 107 | { 108 | "name": "app_category", 109 | "type": "sparse", 110 | "dimension": 0, 111 | "use_hash": false, 112 | "dtype": "int32", 113 | "embedding": true, 114 | "feature_info_gain": 0.011741, 115 | "feature_ig": 0.007834, 116 | "feature_attention":-1.6797, 117 | "feature_origin_num": 0.0741 118 | }, 119 | { 120 | "name": "device_id", 121 | "type": "sparse", 122 | "dimension": 0, 123 | "use_hash": false, 124 | "dtype": "int32", 125 | "embedding": true, 126 | "feature_info_gain": 0.020215, 127 | "feature_ig": 0.008577, 128 | "feature_attention":3.8125, 129 | "feature_origin_num": 0.2222 130 | }, 131 | { 132 | "name": "device_ip", 133 | "type": "sparse", 134 | "dimension": 0, 135 | "use_hash": false, 136 | "dtype": "int32", 137 | "embedding": true, 138 | "feature_info_gain": 0.084573, 139 | "feature_ig": 0.005821, 140 | "feature_attention":13.5400, 141 | "feature_origin_num": 0.5000 142 | }, 143 | { 144 | "name": "device_model", 145 | "type": "sparse", 146 | "dimension": 0, 147 | "use_hash": false, 148 | "dtype": "int32", 149 | "embedding": true, 150 | "feature_info_gain": 0.024494, 151 | "feature_ig": 0.002885, 152 | "feature_attention":17.3369, 153 | "feature_origin_num": 0.5000 154 | }, 155 | { 156 | "name": "device_type", 157 | "type": "sparse", 158 | "dimension": 0, 159 | "use_hash": false, 160 | "dtype": "int32", 161 | "embedding": true, 162 | "feature_info_gain": 0.001180, 163 | "feature_ig": 0.002495, 164 | "feature_attention":-4.9315, 165 | "feature_origin_num": 0.0541 166 | }, 167 | { 168 | "name": "device_conn_type", 169 | "type": "sparse", 170 | "dimension": 0, 171 | "use_hash": false, 172 | "dtype": "int32", 173 | "embedding": true, 174 | "feature_info_gain": 0.007110, 175 | "feature_ig": 0.009912, 176 | "feature_attention":-6.8680, 177 | "feature_origin_num": 0.0500 178 | }, 179 | { 180 | "name": "C14", 181 | "type": "sparse", 182 | "dimension": 0, 183 | "use_hash": false, 184 | "dtype": "int32", 185 | "embedding": true, 186 | "feature_info_gain": 0.051731, 187 | "feature_ig": 0.006223, 188 | "feature_attention":8.7072, 189 | "feature_origin_num": 0.1818 190 | }, 191 | { 192 | "name": "C15", 193 | "type": "sparse", 194 | "dimension": 0, 195 | "use_hash": false, 196 | "dtype": "int32", 197 | "embedding": true, 198 | "feature_info_gain": 0.009486, 199 | "feature_ig": 0.023491, 200 | "feature_attention":-0.4061, 201 | "feature_origin_num": 0.0690 202 | }, 203 | { 204 | "name": "C16", 205 | "type": "sparse", 206 | "dimension": 0, 207 | "use_hash": false, 208 | "dtype": "int32", 209 | "embedding": true, 210 | "feature_info_gain": 0.012149, 211 | "feature_ig": 0.032473, 212 | "feature_attention":4.8553, 213 | "feature_origin_num": 0.0952 214 | }, 215 | { 216 | "name": "C17", 217 | "type": "sparse", 218 | "dimension": 0, 219 | "use_hash": false, 220 | "dtype": "int32", 221 | "embedding": true, 222 | "feature_info_gain": 0.048720, 223 | "feature_ig": 0.007264, 224 | "feature_attention":14.9213, 225 | "feature_origin_num": 0.2000 226 | }, 227 | { 228 | "name": "C18", 229 | "type": "sparse", 230 | "dimension": 0, 231 | "use_hash": false, 232 | "dtype": "int32", 233 | "embedding": true, 234 | "feature_info_gain": 0.022289, 235 | "feature_ig": 0.012680, 236 | "feature_attention":-3.7916, 237 | "feature_origin_num": 0.0556 238 | }, 239 | { 240 | "name": "C19", 241 | "type": "sparse", 242 | "dimension": 0, 243 | "use_hash": false, 244 | "dtype": "int32", 245 | "embedding": true, 246 | "feature_info_gain": 0.024317, 247 | "feature_ig": 0.006413, 248 | "feature_attention":-0.3239, 249 | "feature_origin_num": 0.0870 250 | }, 251 | { 252 | "name": "C20", 253 | "type": "sparse", 254 | "dimension": 0, 255 | "use_hash": false, 256 | "dtype": "int32", 257 | "embedding": true, 258 | "feature_info_gain": 0.015021, 259 | "feature_ig": 0.004064, 260 | "feature_attention":-3.8178, 261 | "feature_origin_num": 0.0741 262 | }, 263 | { 264 | "name": "C21", 265 | "type": "sparse", 266 | "dimension": 0, 267 | "use_hash": false, 268 | "dtype": "int32", 269 | "embedding": true, 270 | "feature_info_gain": 0.031306, 271 | "feature_ig": 0.007526, 272 | "feature_attention":7.5865, 273 | "feature_origin_num": 0.1176 274 | } 275 | ] 276 | } -------------------------------------------------------------------------------- /config/criteo/config_dense.json: -------------------------------------------------------------------------------- 1 | { 2 | "base_info": { 3 | "train_path": "./data/criteo/dense/", 4 | "sparse_path": "./data/criteo/sparse/", 5 | "dense_path": "./data/criteo/dense/", 6 | "k_fold_path": "./data/criteo/k_fold/", 7 | "k_fold": 10, 8 | "random_seed": 2018 9 | }, 10 | "model": { 11 | "data_prefix": "./data/criteo/k_fold/", 12 | "results_prefix": "./data/model/", 13 | "train_results_file": "train_results.txt", 14 | "test_results_file": "test_results.txt" 15 | }, 16 | "features": [ 17 | { 18 | "name": "label", 19 | "type": "label", 20 | "dtype": "int32" 21 | }, 22 | { 23 | "name": "I1", 24 | "type": "dense", 25 | "dtype": "float32", 26 | "feature_num": 648, 27 | "feature_info_gain": 0.027131, 28 | "feature_ig": 0.013217, 29 | "feature_attention": 12.6426, 30 | "feature_origin_num": 0.0465 31 | }, 32 | { 33 | "name": "I2", 34 | "type": "dense", 35 | "dtype": "float32", 36 | "feature_num": 9364, 37 | "feature_info_gain": 0.005273, 38 | "feature_ig": 0.000875, 39 | "feature_attention": 92.2337, 40 | "feature_origin_num": 0.1053 41 | }, 42 | { 43 | "name": "I3", 44 | "type": "dense", 45 | "dtype": "float32", 46 | "feature_num": 14745, 47 | "feature_info_gain": 0.008674, 48 | "feature_ig": 0.00179, 49 | "feature_attention": -29.9626, 50 | "feature_origin_num": 0.05 51 | }, 52 | { 53 | "name": "I4", 54 | "type": "dense", 55 | "dtype": "float32", 56 | "feature_num": 489, 57 | "feature_info_gain": 0.003057, 58 | "feature_ig": 0.000768, 59 | "feature_attention": 42.9781, 60 | "feature_origin_num": 0.0541 61 | }, 62 | { 63 | "name": "I5", 64 | "type": "dense", 65 | "dtype": "float32", 66 | "feature_num": 476706, 67 | "feature_info_gain": 0.033112, 68 | "feature_ig": 0.002542, 69 | "feature_attention": -10.0345, 70 | "feature_origin_num": 0.08 71 | }, 72 | { 73 | "name": "I6", 74 | "type": "dense", 75 | "dtype": "float32", 76 | "feature_num": 11617, 77 | "feature_info_gain": 0.028335, 78 | "feature_ig": 0.00431, 79 | "feature_attention": -29.4932, 80 | "feature_origin_num": 0.0488 81 | }, 82 | { 83 | "name": "I7", 84 | "type": "dense", 85 | "dtype": "float32", 86 | "feature_num": 4141, 87 | "feature_info_gain": 0.035279, 88 | "feature_ig": 0.007616, 89 | "feature_attention": 84.6759, 90 | "feature_origin_num": 0.0833 91 | }, 92 | { 93 | "name": "I8", 94 | "type": "dense", 95 | "dtype": "float32", 96 | "feature_num": 1372, 97 | "feature_info_gain": 0.00199, 98 | "feature_ig": 0.000395, 99 | "feature_attention": -38.7957, 100 | "feature_origin_num": 0.0364 101 | }, 102 | { 103 | "name": "I9", 104 | "type": "dense", 105 | "dtype": "float32", 106 | "feature_num": 7274, 107 | "feature_info_gain": 0.001249, 108 | "feature_ig": 0.000163, 109 | "feature_attention": 0.3921, 110 | "feature_origin_num": 0.0556 111 | }, 112 | { 113 | "name": "I10", 114 | "type": "dense", 115 | "dtype": "float32", 116 | "feature_num": 12, 117 | "feature_info_gain": 0.025483, 118 | "feature_ig": 0.023777, 119 | "feature_attention": 12.6983, 120 | "feature_origin_num": 0.037 121 | }, 122 | { 123 | "name": "I11", 124 | "type": "dense", 125 | "dtype": "float32", 126 | "feature_num": 168, 127 | "feature_info_gain": 0.037279, 128 | "feature_ig": 0.012875, 129 | "feature_attention": -33.1952, 130 | "feature_origin_num": 0.0328 131 | }, 132 | { 133 | "name": "I12", 134 | "type": "dense", 135 | "dtype": "float32", 136 | "feature_num": 406, 137 | "feature_info_gain": 0.003773, 138 | "feature_ig": 0.007959, 139 | "feature_attention": 38.1309, 140 | "feature_origin_num": 0.0513 141 | }, 142 | { 143 | "name": "I13", 144 | "type": "dense", 145 | "dtype": "float32", 146 | "feature_num": 1375, 147 | "feature_info_gain": 0.013456, 148 | "feature_ig": 0.003311, 149 | "feature_attention": 21.2417, 150 | "feature_origin_num": 0.0541 151 | }, 152 | { 153 | "name": "C1", 154 | "type": "sparse", 155 | "dimension": 1443, 156 | "use_hash": false, 157 | "dtype": "int32", 158 | "embedding": true, 159 | "feature_info_gain": 3.5e-05, 160 | "feature_ig": 1.2e-05, 161 | "feature_attention": -74.9407, 162 | "feature_origin_num": 0.0357 163 | }, 164 | { 165 | "name": "C2", 166 | "type": "sparse", 167 | "dimension": 554, 168 | "use_hash": false, 169 | "dtype": "int32", 170 | "embedding": true, 171 | "feature_info_gain": 0.025349, 172 | "feature_ig": 0.003849, 173 | "feature_attention": 72.2962, 174 | "feature_origin_num": 0.0588 175 | }, 176 | { 177 | "name": "C3", 178 | "type": "sparse", 179 | "dimension": 175781, 180 | "use_hash": false, 181 | "dtype": "int32", 182 | "embedding": true, 183 | "feature_info_gain": 0.05659, 184 | "feature_ig": 0.0061, 185 | "feature_attention": -24.5854, 186 | "feature_origin_num": 0.0667 187 | }, 188 | { 189 | "name": "C4", 190 | "type": "sparse", 191 | "dimension": 128509, 192 | "use_hash": false, 193 | "dtype": "int32", 194 | "embedding": true, 195 | "feature_info_gain": 0.061085, 196 | "feature_ig": 0.00569, 197 | "feature_attention": -77.3841, 198 | "feature_origin_num": 0.0476 199 | }, 200 | { 201 | "name": "C5", 202 | "type": "sparse", 203 | "dimension": 305, 204 | "use_hash": false, 205 | "dtype": "int32", 206 | "embedding": true, 207 | "feature_info_gain": 7e-06, 208 | "feature_ig": 4e-06, 209 | "feature_attention": -7.4469, 210 | "feature_origin_num": 0.0392 211 | }, 212 | { 213 | "name": "C6", 214 | "type": "sparse", 215 | "dimension": 19, 216 | "use_hash": false, 217 | "dtype": "int32", 218 | "embedding": true, 219 | "feature_info_gain": 0.001007, 220 | "feature_ig": 0.000449, 221 | "feature_attention": 33.7929, 222 | "feature_origin_num": 0.0444 223 | }, 224 | { 225 | "name": "C7", 226 | "type": "sparse", 227 | "dimension": 11930, 228 | "use_hash": false, 229 | "dtype": "int32", 230 | "embedding": true, 231 | "feature_info_gain": 0.057185, 232 | "feature_ig": 0.005103, 233 | "feature_attention": -22.6924, 234 | "feature_origin_num": 0.0526 235 | }, 236 | { 237 | "name": "C8", 238 | "type": "sparse", 239 | "dimension": 629, 240 | "use_hash": false, 241 | "dtype": "int32", 242 | "embedding": true, 243 | "feature_info_gain": 1.4e-05, 244 | "feature_ig": 6e-06, 245 | "feature_attention": 8.9243, 246 | "feature_origin_num": 0.0444 247 | }, 248 | { 249 | "name": "C9", 250 | "type": "sparse", 251 | "dimension": 3, 252 | "use_hash": false, 253 | "dtype": "int32", 254 | "embedding": true, 255 | "feature_info_gain": 0.008301, 256 | "feature_ig": 0.017487, 257 | "feature_attention": -8.8341, 258 | "feature_origin_num": 0.0323 259 | }, 260 | { 261 | "name": "C10", 262 | "type": "sparse", 263 | "dimension": 41224, 264 | "use_hash": false, 265 | "dtype": "int32", 266 | "embedding": true, 267 | "feature_info_gain": 0.029038, 268 | "feature_ig": 0.002937, 269 | "feature_attention": 17.8221, 270 | "feature_origin_num": 0.08 271 | }, 272 | { 273 | "name": "C11", 274 | "type": "sparse", 275 | "dimension": 5160, 276 | "use_hash": false, 277 | "dtype": "int32", 278 | "embedding": true, 279 | "feature_info_gain": 0.048004, 280 | "feature_ig": 0.004788, 281 | "feature_attention": -73.8346, 282 | "feature_origin_num": 0.04 283 | }, 284 | { 285 | "name": "C12", 286 | "type": "sparse", 287 | "dimension": 174835, 288 | "use_hash": false, 289 | "dtype": "int32", 290 | "embedding": true, 291 | "feature_info_gain": 0.057472, 292 | "feature_ig": 0.006003, 293 | "feature_attention": 22.0109, 294 | "feature_origin_num": 0.1176 295 | }, 296 | { 297 | "name": "C13", 298 | "type": "sparse", 299 | "dimension": 3175, 300 | "use_hash": false, 301 | "dtype": "int32", 302 | "embedding": true, 303 | "feature_info_gain": 0.045485, 304 | "feature_ig": 0.004734, 305 | "feature_attention": 12.7543, 306 | "feature_origin_num": 0.0556 307 | }, 308 | { 309 | "name": "C14", 310 | "type": "sparse", 311 | "dimension": 27, 312 | "use_hash": false, 313 | "dtype": "int32", 314 | "embedding": true, 315 | "feature_info_gain": 0.011647, 316 | "feature_ig": 0.004945, 317 | "feature_attention": 121.0908, 318 | "feature_origin_num": 0.0556 319 | }, 320 | { 321 | "name": "C15", 322 | "type": "sparse", 323 | "dimension": 11254, 324 | "use_hash": false, 325 | "dtype": "int32", 326 | "embedding": true, 327 | "feature_info_gain": 0.055275, 328 | "feature_ig": 0.005483, 329 | "feature_attention": -101.924, 330 | "feature_origin_num": 0.04 331 | }, 332 | { 333 | "name": "C16", 334 | "type": "sparse", 335 | "dimension": 165206, 336 | "use_hash": false, 337 | "dtype": "int32", 338 | "embedding": true, 339 | "feature_info_gain": 0.058913, 340 | "feature_ig": 0.005759, 341 | "feature_attention": 65.1193, 342 | "feature_origin_num": 0.1429 343 | }, 344 | { 345 | "name": "C17", 346 | "type": "sparse", 347 | "dimension": 10, 348 | "use_hash": false, 349 | "dtype": "int32", 350 | "embedding": true, 351 | "feature_info_gain": 0.02037, 352 | "feature_ig": 0.008121, 353 | "feature_attention": 380.3994, 354 | "feature_origin_num": 0.0526 355 | }, 356 | { 357 | "name": "C18", 358 | "type": "sparse", 359 | "dimension": 4605, 360 | "use_hash": false, 361 | "dtype": "int32", 362 | "embedding": true, 363 | "feature_info_gain": 0.042858, 364 | "feature_ig": 0.004977, 365 | "feature_attention": -42.2779, 366 | "feature_origin_num": 0.04 367 | }, 368 | { 369 | "name": "C19", 370 | "type": "sparse", 371 | "dimension": 2017, 372 | "use_hash": false, 373 | "dtype": "int32", 374 | "embedding": true, 375 | "feature_info_gain": 0.005209, 376 | "feature_ig": 0.001604, 377 | "feature_attention": 277.917, 378 | "feature_origin_num": 0.087 379 | }, 380 | { 381 | "name": "C20", 382 | "type": "sparse", 383 | "dimension": 4, 384 | "use_hash": false, 385 | "dtype": "int32", 386 | "embedding": true, 387 | "feature_info_gain": 0.001781, 388 | "feature_ig": 0.000949, 389 | "feature_attention": -119.6272, 390 | "feature_origin_num": 0.0263 391 | }, 392 | { 393 | "name": "C21", 394 | "type": "sparse", 395 | "dimension": 172322, 396 | "use_hash": false, 397 | "dtype": "int32", 398 | "embedding": true, 399 | "feature_info_gain": 0.058825, 400 | "feature_ig": 0.005973, 401 | "feature_attention": 80.1146, 402 | "feature_origin_num": 0.1818 403 | }, 404 | { 405 | "name": "C22", 406 | "type": "sparse", 407 | "dimension": 18, 408 | "use_hash": false, 409 | "dtype": "int32", 410 | "embedding": true, 411 | "feature_info_gain": 0.000895, 412 | "feature_ig": 0.000798, 413 | "feature_attention": -20.5837, 414 | "feature_origin_num": 0.0333 415 | }, 416 | { 417 | "name": "C23", 418 | "type": "sparse", 419 | "dimension": 15, 420 | "use_hash": false, 421 | "dtype": "int32", 422 | "embedding": true, 423 | "feature_info_gain": 0.01374, 424 | "feature_ig": 0.005619, 425 | "feature_attention": 31.1082, 426 | "feature_origin_num": 0.0417 427 | }, 428 | { 429 | "name": "C24", 430 | "type": "sparse", 431 | "dimension": 56456, 432 | "use_hash": false, 433 | "dtype": "int32", 434 | "embedding": true, 435 | "feature_info_gain": 0.047887, 436 | "feature_ig": 0.005095, 437 | "feature_attention": -219.7682, 438 | "feature_origin_num": 0.0435 439 | }, 440 | { 441 | "name": "C25", 442 | "type": "sparse", 443 | "dimension": 86, 444 | "use_hash": false, 445 | "dtype": "int32", 446 | "embedding": true, 447 | "feature_info_gain": 0.003809, 448 | "feature_ig": 0.001331, 449 | "feature_attention": -18.8173, 450 | "feature_origin_num": 0.0357 451 | }, 452 | { 453 | "name": "C26", 454 | "type": "sparse", 455 | "dimension": 43356, 456 | "use_hash": false, 457 | "dtype": "int32", 458 | "embedding": true, 459 | "feature_info_gain": 0.033743, 460 | "feature_ig": 0.005086, 461 | "feature_attention": 314.5245, 462 | "feature_origin_num": 0.2 463 | } 464 | ] 465 | } -------------------------------------------------------------------------------- /config/criteo/config_template.json: -------------------------------------------------------------------------------- 1 | { 2 | "base_info": { 3 | "train_path": "./data/criteo/data/", 4 | "sparse_path": "./data/criteo/sparse/", 5 | "dense_path": "./data/criteo/dense/", 6 | "k_fold_path": "./data/criteo/k_fold/", 7 | "k_fold": 10, 8 | "random_seed": 2018 9 | }, 10 | "model": { 11 | "data_prefix": "./data/criteo/k_fold/", 12 | "results_prefix": "./data/model/", 13 | "train_results_file": "train_results.txt", 14 | "test_results_file": "test_results.txt" 15 | }, 16 | "features": [ 17 | { 18 | "name": "label", 19 | "type": "label", 20 | "dtype": "int32" 21 | }, 22 | { 23 | "name": "I1", 24 | "type": "dense", 25 | "dtype": "float32", 26 | "feature_num": 648, 27 | "feature_info_gain": 0.027131, 28 | "feature_ig": 0.013217, 29 | "feature_attention":12.6426, 30 | "feature_origin_num": 0.0465 31 | }, 32 | { 33 | "name": "I2", 34 | "type": "dense", 35 | "dtype": "float32", 36 | "feature_num": 9364, 37 | "feature_info_gain": 0.005273, 38 | "feature_ig": 0.000875, 39 | "feature_attention":92.2337, 40 | "feature_origin_num": 0.1053 41 | }, 42 | { 43 | "name": "I3", 44 | "type": "dense", 45 | "dtype": "float32", 46 | "feature_num": 14745, 47 | "feature_info_gain": 0.008674, 48 | "feature_ig": 0.001790, 49 | "feature_attention":-29.9626, 50 | "feature_origin_num": 0.0500 51 | }, 52 | { 53 | "name": "I4", 54 | "type": "dense", 55 | "dtype": "float32", 56 | "feature_num": 489, 57 | "feature_info_gain": 0.003057, 58 | "feature_ig": 0.000768, 59 | "feature_attention":42.9781, 60 | "feature_origin_num": 0.0541 61 | }, 62 | { 63 | "name": "I5", 64 | "type": "dense", 65 | "dtype": "float32", 66 | "feature_num": 476706, 67 | "feature_info_gain": 0.033112, 68 | "feature_ig": 0.002542, 69 | "feature_attention":-10.0345, 70 | "feature_origin_num": 0.0800 71 | }, 72 | { 73 | "name": "I6", 74 | "type": "dense", 75 | "dtype": "float32", 76 | "feature_num": 11617, 77 | "feature_info_gain": 0.028335, 78 | "feature_ig": 0.004310, 79 | "feature_attention":-29.4932, 80 | "feature_origin_num": 0.0488 81 | }, 82 | { 83 | "name": "I7", 84 | "type": "dense", 85 | "dtype": "float32", 86 | "feature_num": 4141, 87 | "feature_info_gain": 0.035279, 88 | "feature_ig": 0.007616, 89 | "feature_attention":84.6759, 90 | "feature_origin_num": 0.0833 91 | }, 92 | { 93 | "name": "I8", 94 | "type": "dense", 95 | "dtype": "float32", 96 | "feature_num": 1372, 97 | "feature_info_gain": 0.001990, 98 | "feature_ig": 0.000395, 99 | "feature_attention":-38.7957, 100 | "feature_origin_num": 0.0364 101 | }, 102 | { 103 | "name": "I9", 104 | "type": "dense", 105 | "dtype": "float32", 106 | "feature_num": 7274, 107 | "feature_info_gain": 0.001249, 108 | "feature_ig": 0.000163, 109 | "feature_attention":0.3921, 110 | "feature_origin_num": 0.0556 111 | }, 112 | { 113 | "name": "I10", 114 | "type": "dense", 115 | "dtype": "float32", 116 | "feature_num": 12, 117 | "feature_info_gain": 0.025483, 118 | "feature_ig": 0.023777, 119 | "feature_attention":12.6983, 120 | "feature_origin_num": 0.0370 121 | }, 122 | { 123 | "name": "I11", 124 | "type": "dense", 125 | "dtype": "float32", 126 | "feature_num": 168, 127 | "feature_info_gain": 0.037279, 128 | "feature_ig": 0.012875, 129 | "feature_attention":-33.1952, 130 | "feature_origin_num": 0.0328 131 | }, 132 | { 133 | "name": "I12", 134 | "type": "dense", 135 | "dtype": "float32", 136 | "feature_num": 406, 137 | "feature_info_gain": 0.003773, 138 | "feature_ig": 0.007959, 139 | "feature_attention":38.1309, 140 | "feature_origin_num": 0.0513 141 | }, 142 | { 143 | "name": "I13", 144 | "type": "dense", 145 | "dtype": "float32", 146 | "feature_num": 1375, 147 | "feature_info_gain": 0.013456, 148 | "feature_ig": 0.003311, 149 | "feature_attention":21.2417, 150 | "feature_origin_num": 0.0541 151 | }, 152 | { 153 | "name": "C1", 154 | "type": "sparse", 155 | "dimension": 0, 156 | "use_hash": false, 157 | "dtype": "int32", 158 | "embedding": true, 159 | "feature_info_gain": 0.000035, 160 | "feature_ig": 0.000012, 161 | "feature_attention":-74.9407, 162 | "feature_origin_num": 0.0357 163 | }, 164 | { 165 | "name": "C2", 166 | "type": "sparse", 167 | "dimension": 0, 168 | "use_hash": false, 169 | "dtype": "int32", 170 | "embedding": true, 171 | "feature_info_gain": 0.025349, 172 | "feature_ig": 0.003849, 173 | "feature_attention":72.2962, 174 | "feature_origin_num": 0.0588 175 | }, 176 | { 177 | "name": "C3", 178 | "type": "sparse", 179 | "dimension": 0, 180 | "use_hash": false, 181 | "dtype": "int32", 182 | "embedding": true, 183 | "feature_info_gain": 0.056590, 184 | "feature_ig": 0.006100, 185 | "feature_attention":-24.5854, 186 | "feature_origin_num": 0.0667 187 | }, 188 | { 189 | "name": "C4", 190 | "type": "sparse", 191 | "dimension": 0, 192 | "use_hash": false, 193 | "dtype": "int32", 194 | "embedding": true, 195 | "feature_info_gain": 0.061085, 196 | "feature_ig": 0.005690, 197 | "feature_attention":-77.3841, 198 | "feature_origin_num": 0.0476 199 | }, 200 | { 201 | "name": "C5", 202 | "type": "sparse", 203 | "dimension": 0, 204 | "use_hash": false, 205 | "dtype": "int32", 206 | "embedding": true, 207 | "feature_info_gain": 0.000007, 208 | "feature_ig": 0.000004, 209 | "feature_attention":-7.4469, 210 | "feature_origin_num": 0.0392 211 | }, 212 | { 213 | "name": "C6", 214 | "type": "sparse", 215 | "dimension": 0, 216 | "use_hash": false, 217 | "dtype": "int32", 218 | "embedding": true, 219 | "feature_info_gain": 0.001007, 220 | "feature_ig": 0.000449, 221 | "feature_attention":33.7929, 222 | "feature_origin_num": 0.0444 223 | }, 224 | { 225 | "name": "C7", 226 | "type": "sparse", 227 | "dimension": 0, 228 | "use_hash": false, 229 | "dtype": "int32", 230 | "embedding": true, 231 | "feature_info_gain": 0.057185, 232 | "feature_ig": 0.005103, 233 | "feature_attention":-22.6924, 234 | "feature_origin_num": 0.0526 235 | }, 236 | { 237 | "name": "C8", 238 | "type": "sparse", 239 | "dimension": 0, 240 | "use_hash": false, 241 | "dtype": "int32", 242 | "embedding": true, 243 | "feature_info_gain": 0.000014, 244 | "feature_ig": 0.000006, 245 | "feature_attention":8.9243, 246 | "feature_origin_num": 0.0444 247 | }, 248 | { 249 | "name": "C9", 250 | "type": "sparse", 251 | "dimension": 0, 252 | "use_hash": false, 253 | "dtype": "int32", 254 | "embedding": true, 255 | "feature_info_gain": 0.008301, 256 | "feature_ig": 0.017487, 257 | "feature_attention":-8.8341, 258 | "feature_origin_num": 0.0323 259 | }, 260 | { 261 | "name": "C10", 262 | "type": "sparse", 263 | "dimension": 0, 264 | "use_hash": false, 265 | "dtype": "int32", 266 | "embedding": true, 267 | "feature_info_gain": 0.029038, 268 | "feature_ig": 0.002937, 269 | "feature_attention":17.8221, 270 | "feature_origin_num": 0.0800 271 | }, 272 | { 273 | "name": "C11", 274 | "type": "sparse", 275 | "dimension": 0, 276 | "use_hash": false, 277 | "dtype": "int32", 278 | "embedding": true, 279 | "feature_info_gain": 0.048004, 280 | "feature_ig": 0.004788, 281 | "feature_attention":-73.8346, 282 | "feature_origin_num": 0.0400 283 | }, 284 | { 285 | "name": "C12", 286 | "type": "sparse", 287 | "dimension": 0, 288 | "use_hash": false, 289 | "dtype": "int32", 290 | "embedding": true, 291 | "feature_info_gain": 0.057472, 292 | "feature_ig": 0.006003, 293 | "feature_attention":22.0109, 294 | "feature_origin_num": 0.1176 295 | }, 296 | { 297 | "name": "C13", 298 | "type": "sparse", 299 | "dimension": 0, 300 | "use_hash": false, 301 | "dtype": "int32", 302 | "embedding": true, 303 | "feature_info_gain": 0.045485, 304 | "feature_ig": 0.004734, 305 | "feature_attention":12.7543, 306 | "feature_origin_num": 0.0556 307 | }, 308 | { 309 | "name": "C14", 310 | "type": "sparse", 311 | "dimension": 0, 312 | "use_hash": false, 313 | "dtype": "int32", 314 | "embedding": true, 315 | "feature_info_gain": 0.011647, 316 | "feature_ig": 0.004945, 317 | "feature_attention":121.0908, 318 | "feature_origin_num": 0.0556 319 | }, 320 | { 321 | "name": "C15", 322 | "type": "sparse", 323 | "dimension": 0, 324 | "use_hash": false, 325 | "dtype": "int32", 326 | "embedding": true, 327 | "feature_info_gain": 0.055275, 328 | "feature_ig": 0.005483, 329 | "feature_attention":-101.9240, 330 | "feature_origin_num": 0.0400 331 | }, 332 | { 333 | "name": "C16", 334 | "type": "sparse", 335 | "dimension": 0, 336 | "use_hash": false, 337 | "dtype": "int32", 338 | "embedding": true, 339 | "feature_info_gain": 0.058913, 340 | "feature_ig": 0.005759, 341 | "feature_attention":65.1193, 342 | "feature_origin_num": 0.1429 343 | }, 344 | { 345 | "name": "C17", 346 | "type": "sparse", 347 | "dimension": 0, 348 | "use_hash": false, 349 | "dtype": "int32", 350 | "embedding": true, 351 | "feature_info_gain": 0.020370, 352 | "feature_ig": 0.008121, 353 | "feature_attention":380.3994, 354 | "feature_origin_num": 0.0526 355 | }, 356 | { 357 | "name": "C18", 358 | "type": "sparse", 359 | "dimension": 0, 360 | "use_hash": false, 361 | "dtype": "int32", 362 | "embedding": true, 363 | "feature_info_gain": 0.042858, 364 | "feature_ig": 0.004977, 365 | "feature_attention":-42.2779, 366 | "feature_origin_num": 0.0400 367 | }, 368 | { 369 | "name": "C19", 370 | "type": "sparse", 371 | "dimension": 0, 372 | "use_hash": false, 373 | "dtype": "int32", 374 | "embedding": true, 375 | "feature_info_gain": 0.005209, 376 | "feature_ig": 0.001604, 377 | "feature_attention":277.9170, 378 | "feature_origin_num": 0.0870 379 | }, 380 | { 381 | "name": "C20", 382 | "type": "sparse", 383 | "dimension": 0, 384 | "use_hash": false, 385 | "dtype": "int32", 386 | "embedding": true, 387 | "feature_info_gain": 0.001781, 388 | "feature_ig": 0.000949, 389 | "feature_attention":-119.6272, 390 | "feature_origin_num": 0.0263 391 | }, 392 | { 393 | "name": "C21", 394 | "type": "sparse", 395 | "dimension": 0, 396 | "use_hash": false, 397 | "dtype": "int32", 398 | "embedding": true, 399 | "feature_info_gain": 0.058825, 400 | "feature_ig": 0.005973, 401 | "feature_attention":80.1146, 402 | "feature_origin_num": 0.1818 403 | }, 404 | { 405 | "name": "C22", 406 | "type": "sparse", 407 | "dimension": 0, 408 | "use_hash": false, 409 | "dtype": "int32", 410 | "embedding": true, 411 | "feature_info_gain": 0.000895, 412 | "feature_ig": 0.000798, 413 | "feature_attention":-20.5837, 414 | "feature_origin_num": 0.0333 415 | }, 416 | { 417 | "name": "C23", 418 | "type": "sparse", 419 | "dimension": 0, 420 | "use_hash": false, 421 | "dtype": "int32", 422 | "embedding": true, 423 | "feature_info_gain": 0.013740, 424 | "feature_ig": 0.005619, 425 | "feature_attention":31.1082, 426 | "feature_origin_num": 0.0417 427 | }, 428 | { 429 | "name": "C24", 430 | "type": "sparse", 431 | "dimension": 0, 432 | "use_hash": false, 433 | "dtype": "int32", 434 | "embedding": true, 435 | "feature_info_gain": 0.047887, 436 | "feature_ig": 0.005095, 437 | "feature_attention":-219.7682, 438 | "feature_origin_num": 0.0435 439 | }, 440 | { 441 | "name": "C25", 442 | "type": "sparse", 443 | "dimension": 0, 444 | "use_hash": false, 445 | "dtype": "int32", 446 | "embedding": true, 447 | "feature_info_gain": 0.003809, 448 | "feature_ig": 0.001331, 449 | "feature_attention":-18.8173, 450 | "feature_origin_num": 0.0357 451 | }, 452 | { 453 | "name": "C26", 454 | "type": "sparse", 455 | "dimension": 0, 456 | "use_hash": false, 457 | "dtype": "int32", 458 | "embedding": true, 459 | "feature_info_gain": 0.033743, 460 | "feature_ig": 0.005086, 461 | "feature_attention":314.5245, 462 | "feature_origin_num": 0.2000 463 | } 464 | ] 465 | } -------------------------------------------------------------------------------- /config/kdd12/config_dense.json: -------------------------------------------------------------------------------- 1 | { 2 | "base_info": { 3 | "train_path": "./data/kdd12/dense/", 4 | "concat_path": "./data/kdd12/concat/", 5 | "sparse_path": "./data/kdd12/sparse/", 6 | "dense_path": "./data/kdd12/dense/", 7 | "k_fold_path": "./data/kdd12/k_fold/", 8 | "k_fold": 10, 9 | "random_seed": 2018 10 | }, 11 | "model": { 12 | "data_prefix": "./data/kdd12/k_fold/", 13 | "train_paths": [ 14 | "part0", 15 | "part1", 16 | "part2", 17 | "part3", 18 | "part4", 19 | "part5", 20 | "part6", 21 | "part7" 22 | ], 23 | "valid_paths": [ 24 | "part8" 25 | ], 26 | "test_paths": [ 27 | "part9" 28 | ], 29 | "results_prefix": "./data/model/", 30 | "train_results_file": "train_results.txt", 31 | "test_results_file": "test_results.txt" 32 | }, 33 | "features": [ 34 | { 35 | "name": "label", 36 | "type": "label", 37 | "dtype": "int32" 38 | }, 39 | { 40 | "name": "Impression", 41 | "type": "dense", 42 | "dtype": "float32", 43 | "feature_num": 3357, 44 | "feature_info_gain": 0.003125, 45 | "feature_ig": 0.003485, 46 | "feature_attention": 16.3908, 47 | "feature_origin_num": 0.1538 48 | }, 49 | { 50 | "name": "DisplayURL", 51 | "type": "sparse", 52 | "dimension": 22804, 53 | "use_hash": false, 54 | "dtype": "int32", 55 | "embedding": true, 56 | "feature_info_gain": 0.012621, 57 | "feature_ig": 0.001278, 58 | "feature_attention": -1.3437, 59 | "feature_origin_num": 0.1333 60 | }, 61 | { 62 | "name": "AdID", 63 | "type": "sparse", 64 | "dimension": 297532, 65 | "use_hash": false, 66 | "dtype": "int32", 67 | "embedding": true, 68 | "feature_info_gain": 0.023368, 69 | "feature_ig": 0.001626, 70 | "feature_attention": 24.8785, 71 | "feature_origin_num": 0.2222 72 | }, 73 | { 74 | "name": "AdvertiserID", 75 | "type": "sparse", 76 | "dimension": 14446, 77 | "use_hash": false, 78 | "dtype": "int32", 79 | "embedding": true, 80 | "feature_info_gain": 0.012569, 81 | "feature_ig": 0.001292, 82 | "feature_attention": -15.1511, 83 | "feature_origin_num": 0.1053 84 | }, 85 | { 86 | "name": "Depth", 87 | "type": "dense", 88 | "dtype": "float32", 89 | "feature_num": 3, 90 | "feature_info_gain": 0.001303, 91 | "feature_ig": 0.000858, 92 | "feature_attention": -34.9466, 93 | "feature_origin_num": 0.08 94 | }, 95 | { 96 | "name": "Position", 97 | "type": "dense", 98 | "dtype": "float32", 99 | "feature_num": 3, 100 | "feature_info_gain": 0.0036, 101 | "feature_ig": 0.002832, 102 | "feature_attention": -7.0461, 103 | "feature_origin_num": 0.0909 104 | }, 105 | { 106 | "name": "QueryID", 107 | "type": "sparse", 108 | "dimension": 1192895, 109 | "use_hash": false, 110 | "dtype": "int32", 111 | "embedding": true, 112 | "feature_info_gain": 0.027631, 113 | "feature_ig": 0.002421, 114 | "feature_attention": 4.0461, 115 | "feature_origin_num": 0.25 116 | }, 117 | { 118 | "name": "KeywordID", 119 | "type": "sparse", 120 | "dimension": 387509, 121 | "use_hash": false, 122 | "dtype": "int32", 123 | "embedding": true, 124 | "feature_info_gain": 0.02303, 125 | "feature_ig": 0.0016, 126 | "feature_attention": 0.7922, 127 | "feature_origin_num": 0.1667 128 | }, 129 | { 130 | "name": "TitleID", 131 | "type": "sparse", 132 | "dimension": 845284, 133 | "use_hash": false, 134 | "dtype": "int32", 135 | "embedding": true, 136 | "feature_info_gain": 0.028599, 137 | "feature_ig": 0.001867, 138 | "feature_attention": -10.1859, 139 | "feature_origin_num": 0.1538 140 | }, 141 | { 142 | "name": "DescriptionID", 143 | "type": "sparse", 144 | "dimension": 679550, 145 | "use_hash": false, 146 | "dtype": "int32", 147 | "embedding": true, 148 | "feature_info_gain": 0.025758, 149 | "feature_ig": 0.001747, 150 | "feature_attention": 25.7454, 151 | "feature_origin_num": 0.3333 152 | }, 153 | { 154 | "name": "UserID", 155 | "type": "sparse", 156 | "dimension": 2022305, 157 | "use_hash": false, 158 | "dtype": "int32", 159 | "embedding": true, 160 | "feature_info_gain": 0.025656, 161 | "feature_ig": 0.0028, 162 | "feature_attention": 6.0242, 163 | "feature_origin_num": 0.3333 164 | }, 165 | { 166 | "name": "Gender", 167 | "type": "sparse", 168 | "dimension": 4, 169 | "use_hash": false, 170 | "dtype": "int32", 171 | "embedding": true, 172 | "feature_info_gain": 0.001086, 173 | "feature_ig": 0.000609, 174 | "feature_attention": -20.1411, 175 | "feature_origin_num": 0.087 176 | }, 177 | { 178 | "name": "Age", 179 | "type": "sparse", 180 | "dimension": 8, 181 | "use_hash": false, 182 | "dtype": "int32", 183 | "embedding": true, 184 | "feature_info_gain": 0.001241, 185 | "feature_ig": 0.000444, 186 | "feature_attention": 28.7298, 187 | "feature_origin_num": 0.1818 188 | } 189 | ] 190 | } -------------------------------------------------------------------------------- /config/kdd12/config_template.json: -------------------------------------------------------------------------------- 1 | { 2 | "base_info": { 3 | "train_path": "./data/kdd12/data/", 4 | "concat_path": "./data/kdd12/concat/", 5 | "sparse_path": "./data/kdd12/sparse/", 6 | "dense_path": "./data/kdd12/dense/", 7 | "k_fold_path": "./data/kdd12/k_fold/", 8 | "k_fold": 10, 9 | "random_seed": 2018 10 | }, 11 | "model": { 12 | "data_prefix": "./data/kdd12/k_fold/", 13 | "train_paths": ["part0", "part1", "part2", "part3", "part4", "part5", "part6", "part7"], 14 | "valid_paths": ["part8"], 15 | "test_paths": ["part9"], 16 | "results_prefix": "./data/model/", 17 | "train_results_file": "train_results.txt", 18 | "test_results_file": "test_results.txt" 19 | }, 20 | "features": [ 21 | { 22 | "name": "label", 23 | "type": "label", 24 | "dtype": "int32" 25 | }, 26 | { 27 | "name": "Impression", 28 | "type": "dense", 29 | "dtype": "float32", 30 | "feature_num": 3357, 31 | "feature_info_gain": 0.003125, 32 | "feature_ig": 0.003485, 33 | "feature_attention":16.3908, 34 | "feature_origin_num": 0.1538 35 | }, 36 | { 37 | "name": "DisplayURL", 38 | "type": "sparse", 39 | "dimension": 0, 40 | "use_hash": false, 41 | "dtype": "int32", 42 | "embedding": true, 43 | "feature_info_gain": 0.012621, 44 | "feature_ig": 0.001278, 45 | "feature_attention":-1.3437, 46 | "feature_origin_num": 0.1333 47 | }, 48 | { 49 | "name": "AdID", 50 | "type": "sparse", 51 | "dimension": 0, 52 | "use_hash": false, 53 | "dtype": "int32", 54 | "embedding": true, 55 | "feature_info_gain": 0.023368, 56 | "feature_ig": 0.001626, 57 | "feature_attention":24.8785, 58 | "feature_origin_num": 0.2222 59 | }, 60 | { 61 | "name": "AdvertiserID", 62 | "type": "sparse", 63 | "dimension": 0, 64 | "use_hash": false, 65 | "dtype": "int32", 66 | "embedding": true, 67 | "feature_info_gain": 0.012569, 68 | "feature_ig": 0.001292, 69 | "feature_attention":-15.1511, 70 | "feature_origin_num": 0.1053 71 | }, 72 | { 73 | "name": "Depth", 74 | "type": "dense", 75 | "dtype": "float32", 76 | "feature_num": 3, 77 | "feature_info_gain": 0.001303, 78 | "feature_ig": 0.000858, 79 | "feature_attention":-34.9466, 80 | "feature_origin_num": 0.0800 81 | }, 82 | { 83 | "name": "Position", 84 | "type": "dense", 85 | "dtype": "float32", 86 | "feature_num": 3, 87 | "feature_info_gain": 0.003600, 88 | "feature_ig": 0.002832, 89 | "feature_attention":-7.0461, 90 | "feature_origin_num": 0.0909 91 | }, 92 | { 93 | "name": "QueryID", 94 | "type": "sparse", 95 | "dimension": 0, 96 | "use_hash": false, 97 | "dtype": "int32", 98 | "embedding": true, 99 | "feature_info_gain": 0.027631, 100 | "feature_ig": 0.002421, 101 | "feature_attention":4.0461, 102 | "feature_origin_num": 0.2500 103 | }, 104 | { 105 | "name": "KeywordID", 106 | "type": "sparse", 107 | "dimension": 0, 108 | "use_hash": false, 109 | "dtype": "int32", 110 | "embedding": true, 111 | "feature_info_gain": 0.023030, 112 | "feature_ig": 0.001600, 113 | "feature_attention":0.7922, 114 | "feature_origin_num": 0.1667 115 | }, 116 | { 117 | "name": "TitleID", 118 | "type": "sparse", 119 | "dimension": 0, 120 | "use_hash": false, 121 | "dtype": "int32", 122 | "embedding": true, 123 | "feature_info_gain": 0.028599, 124 | "feature_ig": 0.001867, 125 | "feature_attention":-10.1859, 126 | "feature_origin_num": 0.1538 127 | }, 128 | { 129 | "name": "DescriptionID", 130 | "type": "sparse", 131 | "dimension": 0, 132 | "use_hash": false, 133 | "dtype": "int32", 134 | "embedding": true, 135 | "feature_info_gain": 0.025758, 136 | "feature_ig": 0.001747, 137 | "feature_attention":25.7454, 138 | "feature_origin_num": 0.3333 139 | }, 140 | { 141 | "name": "UserID", 142 | "type": "sparse", 143 | "dimension": 0, 144 | "use_hash": false, 145 | "dtype": "int32", 146 | "embedding": true, 147 | "feature_info_gain": 0.025656, 148 | "feature_ig": 0.002800, 149 | "feature_attention":6.0242, 150 | "feature_origin_num": 0.3333 151 | }, 152 | { 153 | "name": "Gender", 154 | "type": "sparse", 155 | "dimension": 0, 156 | "use_hash": false, 157 | "dtype": "int32", 158 | "embedding": true, 159 | "feature_info_gain": 0.001086, 160 | "feature_ig": 0.000609, 161 | "feature_attention":-20.1411, 162 | "feature_origin_num": 0.0870 163 | }, 164 | { 165 | "name": "Age", 166 | "type": "sparse", 167 | "dimension": 0, 168 | "use_hash": false, 169 | "dtype": "int32", 170 | "embedding": true, 171 | "feature_info_gain": 0.001241, 172 | "feature_ig": 0.000444, 173 | "feature_attention":28.7298, 174 | "feature_origin_num": 0.1818 175 | } 176 | ] 177 | } -------------------------------------------------------------------------------- /rec_alg/__init__.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding:utf-8 -*- 3 | -------------------------------------------------------------------------------- /rec_alg/common/__init__.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding:utf-8 -*- 3 | -------------------------------------------------------------------------------- /rec_alg/common/batch_generator.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | 3 | from rec_alg.common.data_loader import DataLoader 4 | from rec_alg.components.inputs import VarLenSparseFeat 5 | 6 | 7 | class BatchGenerator(object): 8 | @staticmethod 9 | def get_batch(samples, batch_size, index, features): 10 | start = index * batch_size 11 | end = (index + 1) * batch_size 12 | end = end if end < len(samples) else len(samples) 13 | 14 | batch = samples[start:end].copy() 15 | x = [] 16 | y = [batch[:, 0]] 17 | column_index = 1 18 | for feature in features: 19 | if isinstance(feature, VarLenSparseFeat): 20 | x.append(batch[:, column_index:(column_index + feature.maxlen)]) 21 | column_index += feature.maxlen 22 | else: 23 | x.append(batch[:, column_index]) 24 | column_index += 1 25 | return x, y # labels, samples 26 | 27 | @staticmethod 28 | def generate_arrays_from_file(paths, batch_size=32, drop_remainder=True, features=None, shuffle=True): 29 | if not features: 30 | features = {} 31 | pre_path = "" 32 | samples = None 33 | while True: 34 | for path in paths: 35 | if pre_path != path: 36 | samples = DataLoader.smart_load_data(path) 37 | pre_path = path 38 | if len(samples) == 0: 39 | continue 40 | if shuffle: 41 | np.random.shuffle(samples) 42 | total_batch = int(len(samples) / batch_size) if drop_remainder else int( 43 | (len(samples) - 1) / batch_size) + 1 44 | for index in range(total_batch): 45 | batch_features, labels = BatchGenerator.get_batch(samples, batch_size, index, features) 46 | yield batch_features, labels 47 | 48 | @staticmethod 49 | def get_dataset_length(paths, batch_size=32, drop_remainder=True): 50 | length = 0 51 | for path in paths: 52 | samples = DataLoader.smart_load_data(path) 53 | if len(samples) == 0: 54 | continue 55 | total_batch = int(len(samples) / batch_size) if drop_remainder else int((len(samples) - 1) / batch_size) + 1 56 | length += total_batch 57 | return length 58 | 59 | @staticmethod 60 | def get_txt_dataset_length(paths, batch_size=32, drop_remainder=True): 61 | length = 0 62 | for path in paths: 63 | file_length = DataLoader.get_file_len(path) 64 | total_batch = int(file_length / batch_size) if drop_remainder else int( 65 | (file_length - 1) / batch_size) + 1 66 | if total_batch == 0: 67 | continue 68 | length += total_batch 69 | return length 70 | -------------------------------------------------------------------------------- /rec_alg/common/constants.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding:utf-8 -*- 3 | 4 | 5 | class Constants(object): 6 | """ 7 | Constants 8 | """ 9 | LOGGER_DEFAULT = "default" 10 | 11 | # feature type 12 | FEATURE_TYPE_LABEL = "label" 13 | FEATURE_TYPE_SPARSE = "sparse" 14 | FEATURE_TYPE_DENSE = "dense" 15 | FEATURE_TYPE_VARLENSPARSE = "VarLenSparse" 16 | 17 | # mode 18 | MODE_TRAIN = "train" 19 | MODE_RETRAIN = "retrain" 20 | MODE_TEST = "test" 21 | 22 | # date 23 | DATETIME_FORMAT = "%Y-%m-%d %H:%M:%S" 24 | -------------------------------------------------------------------------------- /rec_alg/common/data_loader.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding:utf-8 -*- 3 | import collections 4 | import json 5 | import os 6 | 7 | import numpy as np 8 | import pandas as pd 9 | 10 | 11 | class DataLoader(object): 12 | @staticmethod 13 | def load_config_dict(config_path): 14 | with open(config_path) as config_file: 15 | # for keep order 16 | config = json.loads(config_file.read(), object_pairs_hook=collections.OrderedDict) 17 | return config 18 | 19 | @staticmethod 20 | def get_files(prefix, paths): 21 | abs_paths = [os.path.join(prefix, path) for path in paths] 22 | files = DataLoader.list_multi_dir_files(abs_paths) 23 | return files 24 | 25 | @staticmethod 26 | def smart_load_data(path): 27 | names = os.path.splitext(path) 28 | if len(names) == 1: 29 | return False 30 | ext = names[1] 31 | if ext == ".json": 32 | return DataLoader.load_data_json(path) 33 | elif ext == ".txt": 34 | return DataLoader.load_data_txt(path) 35 | elif ext == ".csv": 36 | return DataLoader.load_data_txt(path, sep=",") 37 | elif ext == ".npy": 38 | return DataLoader.load_data_npy(path) 39 | return False 40 | 41 | @staticmethod 42 | def load_data_txt(path, sep="\t", names=None): 43 | df = pd.read_csv(path, names=names, sep=sep, index_col=False, header=None) 44 | return df.values 45 | 46 | @staticmethod 47 | def load_data_txt_as_df(path, sep="\t", names=None, usecols=None): 48 | df = pd.read_csv(path, names=names, sep=sep, index_col=False, header=None, usecols=usecols) 49 | return df 50 | 51 | @staticmethod 52 | def load_data_npy(path): 53 | data = np.load(path) 54 | return data 55 | 56 | @staticmethod 57 | def load_data_json(path): 58 | with open(path) as f: 59 | lines = f.read() 60 | config = json.loads(lines, object_pairs_hook=collections.OrderedDict) 61 | return config 62 | 63 | @staticmethod 64 | def list_dir_files(path): 65 | files = [] 66 | for filename in os.listdir(path): 67 | abs_path = os.path.abspath(os.path.join(path, filename)) 68 | if os.path.isfile(abs_path) and not os.path.basename(abs_path).startswith("."): 69 | files.append(abs_path) 70 | files.sort() 71 | return files 72 | 73 | @staticmethod 74 | def rmdirs(path): 75 | if not os.path.exists(path): 76 | return True 77 | for root, dirs, files in os.walk(path, topdown=False): 78 | for name in files: 79 | os.remove(os.path.join(root, name)) 80 | for name in dirs: 81 | os.rmdir(os.path.join(root, name)) 82 | return True 83 | 84 | @staticmethod 85 | def list_multi_dir_files(paths): 86 | files = [] 87 | for path in paths: 88 | files.extend(DataLoader.list_dir_files(path)) 89 | return files 90 | 91 | @staticmethod 92 | def validate_or_create_dir(path): 93 | if not os.path.exists(path): 94 | os.makedirs(path) 95 | return True 96 | 97 | @staticmethod 98 | def get_file_len(filename): 99 | cnt = 0 100 | with open(filename) as f: 101 | for line in f: 102 | line = line.strip("(\n| )") 103 | if line: 104 | cnt += 1 105 | return cnt 106 | -------------------------------------------------------------------------------- /rec_alg/common/tf_utils.py: -------------------------------------------------------------------------------- 1 | import tensorflow as tf 2 | from rec_alg.components.inputs import SparseFeat, DenseFeat, VarLenSparseFeat 3 | 4 | 5 | def get_top_inputs_embeddings(feature_columns, features, embeddings, feature_importance_metric="dimension", 6 | feature_importance_top_k=-1, return_feature_index=False): 7 | """ 8 | Get top K inputs and embeddings following importance metric 9 | 10 | Note: the order of features and embeddings is different 11 | :param feature_columns: feature columns in dataset 12 | :param features: dict, key is feature nane, value is the Input of a feature to a model 13 | :param embeddings: embeddings of features 14 | :param feature_importance_metric: metric of feature importance 15 | :param feature_importance_top_k: top K 16 | :param return_feature_index: boolean 17 | :return: 18 | """ 19 | # Default use all inputs and embeddings 20 | if feature_importance_top_k == -1: 21 | feature_importance_top_k = len(feature_columns) 22 | 23 | # Get the TopK most important feature names 24 | sorted_feature_columns = [] 25 | print("Feature columns: ") 26 | for fc in feature_columns: 27 | sorted_feature_columns.append((fc.name, getattr(fc, feature_importance_metric))) 28 | sorted_feature_columns.sort(key=lambda f: f[1], reverse=True) 29 | print("sorted_feature_columns: ", sorted_feature_columns) 30 | selected_feature_columns = set([sorted_feature_columns[i][0] for i in range(feature_importance_top_k)]) 31 | print("selected_feature_columns: ", selected_feature_columns) 32 | 33 | # Get the TopK most important feature inputs in order of [sparse inputs, var len inputs, dense inputs] 34 | top_inputs = [] 35 | for idx, fc in enumerate(feature_columns): 36 | if isinstance(fc, SparseFeat) and fc.name in selected_feature_columns: 37 | top_inputs.append(features[fc.name]) 38 | for idx, fc in enumerate(feature_columns): 39 | if isinstance(fc, VarLenSparseFeat) and fc.name in selected_feature_columns: 40 | top_inputs.append(features[fc.name]) 41 | for idx, fc in enumerate(feature_columns): 42 | if isinstance(fc, DenseFeat) and fc.name in selected_feature_columns: 43 | top_inputs.append(features[fc.name]) 44 | 45 | count_sparse_features = 0 46 | offsets_sparse_features = [] 47 | count_var_len_features = 0 48 | offsets_var_len_features = [] 49 | count_dense_features = 0 50 | offsets_dense_features = [] 51 | for idx, fc in enumerate(feature_columns): 52 | if isinstance(fc, SparseFeat): 53 | if fc.name in selected_feature_columns: 54 | offsets_sparse_features.append(count_sparse_features) 55 | count_sparse_features += 1 56 | elif isinstance(fc, VarLenSparseFeat): 57 | if fc.name in selected_feature_columns: 58 | offsets_var_len_features.append(count_var_len_features) 59 | count_var_len_features += 1 60 | else: 61 | if fc.name in selected_feature_columns: 62 | offsets_dense_features.append(count_dense_features) 63 | count_dense_features += 1 64 | # embeddings = [sparse_embeddings, var_len_embeddings, dense_embeddings] 65 | selected_features_indexes = [idx for idx in offsets_sparse_features] 66 | base_var_len_features = count_sparse_features 67 | for offset in offsets_var_len_features: 68 | selected_features_indexes.append(base_var_len_features + offset) 69 | base_dense_features = count_sparse_features + count_var_len_features 70 | for offset in offsets_dense_features: 71 | selected_features_indexes.append(base_dense_features + offset) 72 | top_embeddings = tf.gather(embeddings, selected_features_indexes, axis=1) 73 | if return_feature_index: 74 | return top_inputs, top_embeddings, selected_features_indexes 75 | return top_inputs, top_embeddings 76 | -------------------------------------------------------------------------------- /rec_alg/common/utils.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding:utf-8 -*- 3 | import argparse 4 | import ast 5 | 6 | import numpy as np 7 | import tensorflow as tf 8 | import tensorflow.keras.backend as K 9 | 10 | 11 | class Utils: 12 | @staticmethod 13 | def count_parameters(model): 14 | """ 15 | Counting model parameters 16 | :return: total_count, trainable_count, non_trainable_count 17 | """ 18 | trainable_count = np.sum([K.count_params(w) for w in model.trainable_weights]) 19 | non_trainable_count = np.sum([K.count_params(w) for w in model.non_trainable_weights]) 20 | total_count = trainable_count + non_trainable_count 21 | 22 | print('Total params: {:,}'.format(total_count)) 23 | print('Trainable params: {:,}'.format(trainable_count)) 24 | print('Non-trainable params: {:,}'.format(non_trainable_count)) 25 | 26 | return total_count, trainable_count, non_trainable_count 27 | 28 | @staticmethod 29 | def str2list(v): 30 | try: 31 | v = v.split(',') 32 | v = [int(_.strip('[]')) for _ in v] 33 | except: 34 | v = [] 35 | return v 36 | 37 | @staticmethod 38 | def str_to_type(v): 39 | try: 40 | v = ast.literal_eval(v) 41 | except: 42 | v = [] 43 | return v 44 | 45 | @staticmethod 46 | def str2bool(v): 47 | if v.lower() in ['yes', 'true', 't', 'y', '1']: 48 | return True 49 | elif v.lower() in ['no', 'false', 'f', 'n', '0']: 50 | return False 51 | else: 52 | raise argparse.ArgumentTypeError('Unsupported value encountered.') 53 | 54 | @staticmethod 55 | def get_float(value, default_value=0.0): 56 | try: 57 | return float(value) 58 | except: 59 | return default_value 60 | 61 | @staticmethod 62 | def str2liststr(v): 63 | v = v.split(",") 64 | v = [s.strip() for s in v] 65 | return v 66 | 67 | @staticmethod 68 | def get_upper_triangular_indices(n): 69 | indices = [] 70 | for i in range(0, n): 71 | for j in range(i + 1, n): 72 | indices.append([i, j]) 73 | return indices 74 | 75 | @staticmethod 76 | def concat_func(inputs, axis=-1, name=None): 77 | if len(inputs) == 1: 78 | return inputs[0] 79 | else: 80 | return tf.keras.layers.Concatenate(axis=axis, name=name)(inputs) 81 | -------------------------------------------------------------------------------- /rec_alg/components/__init__.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding:utf-8 -*- 3 | -------------------------------------------------------------------------------- /rec_alg/components/inputs.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding:utf-8 -*- 3 | 4 | from collections import OrderedDict, namedtuple 5 | 6 | from tensorflow.python import GlorotNormal, GlorotUniform 7 | from tensorflow.python.keras.initializers import RandomNormal, TruncatedNormal, RandomUniform 8 | from tensorflow.python.keras.layers import Embedding, Input 9 | from tensorflow.python.keras.regularizers import l2 10 | 11 | from .layers import Hash, Linear 12 | from rec_alg.common.utils import Utils 13 | 14 | class SparseFeat(namedtuple('SparseFeat', ['name', 'dimension', 'use_hash', 'dtype','embedding_name','embedding', 'feature_num', 'feature_origin_num', 'feature_info_gain', 'feature_ig', 'feature_attention'])): 15 | __slots__ = () 16 | 17 | def __new__(cls, name, dimension, use_hash=False, dtype="int32", embedding_name=None, embedding=True, feature_num=None, feature_origin_num=None, feature_info_gain=None, feature_ig=None, feature_attention=None): 18 | if embedding and embedding_name is None: 19 | embedding_name = name 20 | if feature_num is None: 21 | feature_num = dimension 22 | if feature_origin_num is None: 23 | feature_origin_num = dimension 24 | if feature_info_gain is None: 25 | feature_info_gain = dimension 26 | if feature_ig is None: 27 | feature_ig = dimension 28 | if feature_attention is None: 29 | feature_attention = dimension 30 | return super(SparseFeat, cls).__new__(cls, name, dimension, use_hash, dtype, embedding_name, embedding, feature_num, feature_origin_num, feature_info_gain, feature_ig, feature_attention) 31 | 32 | 33 | class DenseFeat(namedtuple('DenseFeat', ['name', 'dimension', 'dtype', 'feature_num', 'feature_origin_num', 'feature_info_gain', 'feature_ig', 'feature_attention'])): 34 | __slots__ = () 35 | 36 | def __new__(cls, name, dimension=1, dtype="float32", feature_num=None, feature_origin_num=None, feature_info_gain=None, feature_ig=None, feature_attention=None): 37 | if feature_num is None: 38 | feature_num = dimension 39 | if feature_origin_num is None: 40 | feature_origin_num = dimension 41 | if feature_info_gain is None: 42 | feature_info_gain = dimension 43 | if feature_ig is None: 44 | feature_ig = dimension 45 | if feature_attention is None: 46 | feature_attention = dimension 47 | return super(DenseFeat, cls).__new__(cls, name, dimension, dtype, feature_num, feature_origin_num, feature_info_gain, feature_ig, feature_attention) 48 | 49 | 50 | class VarLenSparseFeat(namedtuple('VarLenFeat', ['name', 'dimension', 'maxlen', 'combiner', 'use_hash', 'dtype','embedding_name','embedding', 'feature_num', 'feature_origin_num', 'feature_info_gain', 'feature_ig', 'feature_attention'])): 51 | __slots__ = () 52 | 53 | def __new__(cls, name, dimension, maxlen, combiner="mean", use_hash=False, dtype="float32", embedding_name=None, embedding=True, feature_num=None, feature_origin_num=None, feature_info_gain=None, feature_ig=None, feature_attention=None): 54 | if embedding_name is None: 55 | embedding_name = name 56 | if feature_num is None: 57 | feature_num = dimension 58 | if feature_origin_num is None: 59 | feature_origin_num = dimension 60 | if feature_info_gain is None: 61 | feature_info_gain = dimension 62 | if feature_ig is None: 63 | feature_ig = dimension 64 | if feature_attention is None: 65 | feature_attention = dimension 66 | return super(VarLenSparseFeat, cls).__new__(cls, name, dimension, maxlen, combiner, use_hash, dtype, embedding_name, embedding, feature_num, feature_origin_num, feature_info_gain, feature_ig, feature_attention) 67 | 68 | 69 | def build_input_features(feature_columns, mask_zero=True, prefix=''): 70 | input_features = OrderedDict() 71 | for fc in feature_columns: 72 | if isinstance(fc, SparseFeat): 73 | input_features[fc.name] = Input( 74 | shape=(1,), name=prefix + fc.name, dtype=fc.dtype) 75 | elif isinstance(fc, DenseFeat): 76 | input_features[fc.name] = Input( 77 | shape=(fc.dimension,), name=prefix + fc.name, dtype=fc.dtype) 78 | elif isinstance(fc, VarLenSparseFeat): 79 | input_features[fc.name] = Input(shape=(fc.maxlen,), name=prefix + fc.name, 80 | dtype=fc.dtype) 81 | if not mask_zero: 82 | input_features[fc.name + "_seq_length"] = Input(shape=( 83 | 1,), name=prefix + 'seq_length_' + fc.name) 84 | input_features[fc.name + "_seq_max_length"] = fc.maxlen 85 | else: 86 | raise TypeError("Invalid feature column type,got", type(fc)) 87 | 88 | return input_features 89 | 90 | 91 | def create_embedding_dict(sparse_feature_columns, varlen_sparse_feature_columns, embedding_size, init_std, seed, l2_reg, 92 | prefix='sparse_', seq_mask_zero=True, initializer_mode="random_normal"): 93 | if embedding_size == 'auto': 94 | print("Notice:Do not use auto embedding in models other than DCN") 95 | sparse_embedding = {feat.embedding_name: Embedding(feat.dimension, 6 * int(pow(feat.dimension, 0.25)), 96 | embeddings_initializer=get_embedding_initializer( 97 | initializer_mode=initializer_mode, 98 | mean=0.0, 99 | stddev=init_std, 100 | seed=seed), 101 | embeddings_regularizer=l2(l2_reg), 102 | name=prefix + '_emb_' + feat.name) for feat in 103 | sparse_feature_columns} 104 | else: 105 | 106 | sparse_embedding = {feat.embedding_name: Embedding(feat.dimension, embedding_size, 107 | embeddings_initializer=get_embedding_initializer( 108 | initializer_mode=initializer_mode, 109 | mean=0.0, 110 | stddev=init_std, 111 | seed=seed), 112 | embeddings_regularizer=l2( 113 | l2_reg), 114 | name=prefix + '_emb_' + feat.name) for feat in 115 | sparse_feature_columns} 116 | 117 | if varlen_sparse_feature_columns and len(varlen_sparse_feature_columns) > 0: 118 | for feat in varlen_sparse_feature_columns: 119 | # if feat.name not in sparse_embedding: 120 | if embedding_size == "auto": 121 | sparse_embedding[feat.embedding_name] = Embedding(feat.dimension, 6 * int(pow(feat.dimension, 0.25)), 122 | embeddings_initializer=get_embedding_initializer( 123 | initializer_mode=initializer_mode, 124 | mean=0.0, 125 | stddev=init_std, 126 | seed=seed), 127 | embeddings_regularizer=l2( 128 | l2_reg), 129 | name=prefix + '_seq_emb_' + feat.name, 130 | mask_zero=seq_mask_zero) 131 | 132 | else: 133 | sparse_embedding[feat.embedding_name] = Embedding(feat.dimension, embedding_size, 134 | embeddings_initializer=get_embedding_initializer( 135 | initializer_mode=initializer_mode, 136 | mean=0.0, 137 | stddev=init_std, 138 | seed=seed), 139 | embeddings_regularizer=l2( 140 | l2_reg), 141 | name=prefix + '_seq_emb_' + feat.name, 142 | mask_zero=seq_mask_zero) 143 | return sparse_embedding 144 | 145 | 146 | def get_embedding_initializer(initializer_mode="random_normal", mean=0.0, stddev=0.01, minval=-0.05, maxval=0.05, 147 | seed=1024): 148 | if initializer_mode == "random_normal": 149 | initializer = RandomNormal(mean=mean, stddev=stddev, seed=seed) 150 | elif initializer_mode == "truncated_normal": 151 | initializer = TruncatedNormal(mean=mean, stddev=stddev, seed=seed) 152 | elif initializer_mode == "glorot_normal": 153 | initializer = GlorotNormal(seed=seed) 154 | elif initializer_mode == "random_uniform": 155 | initializer = RandomUniform(minval=minval, maxval=maxval, seed=seed) 156 | elif initializer_mode == "glorot_uniform": 157 | initializer = GlorotUniform(seed=seed) 158 | else: 159 | raise Exception("Don't support embedding initializer_mode: ", initializer_mode) 160 | return initializer 161 | 162 | 163 | def create_embedding_matrix(feature_columns, l2_reg, init_std, seed, embedding_size, prefix="", seq_mask_zero=True, 164 | initializer_mode="random_normal"): 165 | sparse_feature_columns = list( 166 | filter(lambda x: isinstance(x, SparseFeat) and x.embedding, feature_columns)) if feature_columns else [] 167 | varlen_sparse_feature_columns = list( 168 | filter(lambda x: isinstance(x, VarLenSparseFeat) and x.embedding, feature_columns)) if feature_columns else [] 169 | sparse_emb_dict = create_embedding_dict(sparse_feature_columns, varlen_sparse_feature_columns, embedding_size, 170 | init_std, seed, l2_reg, prefix=prefix + 'sparse', 171 | seq_mask_zero=seq_mask_zero, 172 | initializer_mode=initializer_mode) 173 | return sparse_emb_dict 174 | 175 | 176 | def get_linear_logit(features, feature_columns, units=1, use_bias=True, l2_reg=0, init_std=0.0001, seed=1024, 177 | prefix='linear'): 178 | linear_emb_list = [ 179 | input_from_feature_columns(features, feature_columns, 1, l2_reg, init_std, seed, prefix=prefix + str(i))[0] for 180 | i in range(units)] 181 | _, dense_input_list = input_from_feature_columns(features, feature_columns, 1, l2_reg, init_std, seed, 182 | prefix=prefix) 183 | 184 | linear_logit_list = [] 185 | for i in range(units): 186 | if len(linear_emb_list[0]) > 0 and len(dense_input_list) > 0: 187 | sparse_input = Utils.concat_func(linear_emb_list[i]) 188 | dense_input = Utils.concat_func(dense_input_list) 189 | linear_logit = Linear(l2_reg, mode=2, use_bias=use_bias)([sparse_input, dense_input]) 190 | elif len(linear_emb_list[0]) > 0: 191 | sparse_input = Utils.concat_func(linear_emb_list[i]) 192 | linear_logit = Linear(l2_reg, mode=0, use_bias=use_bias)(sparse_input) 193 | elif len(dense_input_list) > 0: 194 | dense_input = Utils.concat_func(dense_input_list) 195 | linear_logit = Linear(l2_reg, mode=1, use_bias=use_bias)(dense_input) 196 | else: 197 | raise NotImplementedError 198 | linear_logit_list.append(linear_logit) 199 | 200 | return Utils.concat_func(linear_logit_list) 201 | 202 | 203 | def embedding_lookup(sparse_embedding_dict, sparse_input_dict, sparse_feature_columns, return_feat_list=(), 204 | mask_feat_list=()): 205 | embedding_vec_list = [] 206 | for fc in sparse_feature_columns: 207 | feature_name = fc.name 208 | embedding_name = fc.embedding_name 209 | if len(return_feat_list) == 0 or feature_name in return_feat_list and fc.embedding: 210 | if fc.use_hash: 211 | lookup_idx = Hash(fc.dimension, mask_zero=(feature_name in mask_feat_list))( 212 | sparse_input_dict[feature_name]) 213 | else: 214 | lookup_idx = sparse_input_dict[feature_name] 215 | 216 | embedding_vec_list.append(sparse_embedding_dict[embedding_name](lookup_idx)) 217 | 218 | return embedding_vec_list 219 | 220 | 221 | def get_dense_input(features, feature_columns): 222 | dense_feature_columns = list(filter(lambda x: isinstance(x, DenseFeat), feature_columns)) if feature_columns else [] 223 | dense_input_list = [] 224 | for fc in dense_feature_columns: 225 | dense_input_list.append(features[fc.name]) 226 | return dense_input_list 227 | 228 | 229 | def input_from_feature_columns(features, feature_columns, embedding_size, l2_reg, init_std, seed, prefix='', 230 | seq_mask_zero=True, support_dense=True, initializer_mode="random_normal"): 231 | sparse_feature_columns = list( 232 | filter(lambda x: isinstance(x, SparseFeat), feature_columns)) if feature_columns else [] 233 | 234 | embedding_dict = create_embedding_matrix(feature_columns, l2_reg, init_std, seed, embedding_size, prefix=prefix, 235 | seq_mask_zero=seq_mask_zero, initializer_mode=initializer_mode) 236 | sparse_embedding_list = embedding_lookup( 237 | embedding_dict, features, sparse_feature_columns) 238 | dense_value_list = get_dense_input(features, feature_columns) 239 | if not support_dense and len(dense_value_list) > 0: 240 | raise ValueError("DenseFeat is not supported in dnn_feature_columns") 241 | 242 | return sparse_embedding_list, dense_value_list 243 | -------------------------------------------------------------------------------- /rec_alg/components/multi_hash_codebook_layer.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding:utf-8 -*- 3 | import copy 4 | import itertools 5 | 6 | import tensorflow as tf 7 | from tensorflow.python.keras.layers import Layer, Embedding 8 | from tensorflow.python.keras.regularizers import l2 9 | 10 | from rec_alg.common.utils import Utils 11 | from rec_alg.components.inputs import get_embedding_initializer 12 | from rec_alg.components.layers import StrongHash, SENETLayer 13 | 14 | 15 | class MultiHashCodebookLayer(Layer): 16 | def __init__(self, num_buckets, embedding_size, bucket_mode="hash-share", initializer_mode="random_normal", 17 | init_std=0.01, l2_reg=0.0, seed=1024, num_hash=1, merge_mode="concat", output_dims=0, params={}, 18 | hash_float_precision=12, interact_orders=(2,), interact_modes=("senetsum",), **kwargs): 19 | """ 20 | Implement of multi-Hash Codebook Network (HCNet) in MemoNet, only supports input of all features 21 | :param num_buckets: num of codewords 22 | :param embedding_size: dimension of codeword 23 | :param bucket_mode: mode of codeword, support hash-share, hash-private 24 | :param initializer_mode: initializer of codebook 25 | :param init_std: init std of codebook 26 | :param l2_reg: l2 reg of code book 27 | :param seed: seed 28 | :param num_hash: num of hash functions 29 | :param merge_mode: merge mode of different codewords of a feature, support concat, senetorigin 30 | :param output_dims: output dim of HCNet 31 | :param params: expand params 32 | :param hash_float_precision: precision for float inputs 33 | :param interact_orders: orders of interaction. For example [2,3] means 2-order and 3-order interactions 34 | :param interact_modes: mode of interaction, supports "sum", "senetsum" 35 | :param kwargs: 36 | """ 37 | self.num_buckets = num_buckets 38 | self.embedding_size = embedding_size 39 | self.bucket_mode = bucket_mode 40 | self.initializer_mode = initializer_mode 41 | self.init_std = init_std 42 | self.l2_reg = l2_reg 43 | self.seed = seed 44 | self.num_hash = num_hash 45 | self.merge_mode = merge_mode 46 | self.params = copy.deepcopy(params) 47 | self.hash_float_precision = hash_float_precision 48 | self.interact_orders = interact_orders 49 | self.interact_modes = list(interact_modes) + [interact_modes[-1] for _ in 50 | range(len(interact_modes), len(interact_orders))] 51 | self.output_dims = self._get_output_dims(output_dims=output_dims) 52 | 53 | self.interact_indexes = None # [order_n=[field_x=[target_idx]]] 54 | self.senet_layer = None 55 | self.field_interaction_idx = None 56 | # Max 11 Hash Functions 57 | self.hash_keys = [[7744, 1822], 58 | [423, 6649], 59 | [3588, 8319], 60 | [8220, 7283], 61 | [1965, 9209], 62 | [4472, 1935], 63 | [3987, 4403], 64 | [2379, 2870], 65 | [5638, 2954], 66 | [2211, 2], 67 | [6075, 9105]] 68 | self.field_size = None 69 | self.hash_layers = [] 70 | self.embedding_layer = None 71 | self.hash_merge_layer = None 72 | self.interact_mode_layers = [] 73 | self.transform_layer = None 74 | self.field_tokens = [] 75 | self._init() 76 | super(MultiHashCodebookLayer, self).__init__(**kwargs) 77 | 78 | def _init(self): 79 | self.outer_interact_mode = self.params.get("interact_mode", None) 80 | return 81 | 82 | def build(self, input_shape): 83 | self.field_size = len(input_shape[0]) 84 | self.interact_indexes = [self.get_field_interaction_idx(order_n, self.field_size) 85 | for order_n in self.interact_orders] 86 | 87 | # Hash Layers 88 | strong_hash = True if self.num_hash > 1 else False 89 | for i in range(self.num_hash): 90 | hash_layer = StrongHash(num_buckets=self.num_buckets, mask_zero=False, strong=strong_hash, 91 | key=self.hash_keys[i]) 92 | self.hash_layers.append(hash_layer) 93 | 94 | # Codebooks 95 | self.embedding_layer = [] 96 | num_embeddings = 1 97 | if self.bucket_mode == "hash-share": 98 | num_embeddings = 1 99 | elif self.bucket_mode == "hash-private": 100 | num_embeddings = self.num_hash 101 | for _ in self.interact_orders: 102 | self.embedding_layer.append([Embedding(input_dim=self.num_buckets, output_dim=self.embedding_size, 103 | embeddings_initializer=get_embedding_initializer( 104 | initializer_mode=self.initializer_mode, 105 | mean=0.0, 106 | stddev=self.init_std, 107 | seed=self.seed), 108 | embeddings_regularizer=l2(self.l2_reg)) 109 | for _ in range(num_embeddings)]) 110 | 111 | # Linear Memory Restoring (LMR) and Attentive Memory Restoring (AMR) 112 | if "senetorigin" in self.merge_mode: 113 | self.senet_layer = [SENETLayer( 114 | senet_squeeze_mode="bit", senet_reduction_ratio=1.0, senet_excitation_mode="bit", 115 | senet_activation="none", seed=self.seed, 116 | output_weights=True, output_field_size=self.num_hash, output_embedding_size=self.embedding_size 117 | ) for _ in self.interact_orders] 118 | self.transform_layer = tf.keras.layers.Dense(self.output_dims, activation=None, 119 | use_bias=False, name="hash_merge_final_transform") 120 | 121 | # Feature Shrinking-Global Attentive Shrinking(GAS) 122 | for idx, interact_mode in enumerate(self.interact_modes): 123 | num_interact = len(list(itertools.combinations(range(self.field_size), self.interact_orders[idx]))) 124 | interact_mode_layer = SENETLayer( 125 | senet_squeeze_mode="bit", senet_reduction_ratio=1.0, senet_excitation_mode="vector", 126 | senet_activation="none", seed=self.seed, 127 | output_weights=True, output_field_size=num_interact) 128 | self.interact_mode_layers.append(interact_mode_layer) 129 | 130 | for i in range(self.field_size): 131 | self.field_tokens.append(tf.constant(str(i), dtype=tf.string, shape=(1, 1))) 132 | super(MultiHashCodebookLayer, self).build(input_shape) # Be sure to call this somewhere! 133 | 134 | def call(self, inputs, training=None, **kwargs): 135 | """ 136 | :param placeholder_inputs: (?, length) 137 | :param origin_embeddings 138 | :param training: 139 | :param kwargs: 140 | :return: 141 | """ 142 | placeholder_inputs, origin_embeddings = inputs[0], inputs[1] 143 | input_list = [] 144 | batch_size = tf.shape(placeholder_inputs[0])[0] 145 | # 1. Multi-Hash Addressing 146 | # 1.1. Obtain all cross features 147 | for i in range(self.field_size): 148 | field_token = tf.tile(self.field_tokens[i], [batch_size, 1]) 149 | if placeholder_inputs[i].dtype == tf.float32 or placeholder_inputs[i].dtype == tf.float16 or \ 150 | placeholder_inputs[i].dtype == tf.float64: 151 | item = tf.strings.as_string(placeholder_inputs[i], precision=self.hash_float_precision) 152 | else: 153 | item = tf.strings.as_string(placeholder_inputs[i], ) 154 | # Process VarLenSparse Feature 155 | if item.shape[-1].value > 1: 156 | item = tf.expand_dims(tf.strings.reduce_join(item, axis=-1, separator="-"), axis=1) 157 | field_item = tf.strings.reduce_join([field_token, item], axis=0, separator="_") 158 | input_list.append(field_item) 159 | 160 | interact_tokens = [] 161 | for order_n in self.interact_orders: 162 | tokens = self.get_high_order_tokens(input_list=input_list, order_n=order_n, field_size=self.field_size) 163 | interact_tokens.append(tokens) 164 | 165 | interact_embeddings = [] 166 | interact_field_weights = [] 167 | for idx, tokens in enumerate(interact_tokens): 168 | # 2. Multi-Hash Address && Memory Restoring 169 | embeddings = self.get_embeddings_from_tokens(tokens, origin_embeddings, interact_order_idx=idx) 170 | # 3. Feature Shrinking 171 | if self.interact_modes[idx] == "sum": 172 | # [batch, fields, field_interact_num, embedding_size] 173 | field_embeddings = tf.gather(embeddings, self.interact_indexes[idx], axis=1) 174 | # [batch,fields, embedding_size] 175 | embeddings = tf.reduce_sum(field_embeddings, axis=-2, keepdims=False) 176 | elif "senetsum" in self.interact_modes[idx]: 177 | # [batch, fields, field_interact_num, embedding_size] 178 | field_embeddings = tf.gather(embeddings, self.interact_indexes[idx], axis=1) 179 | # [batch, num_interact, vector=1|bit=output_dims] 180 | weights = self.interact_mode_layers[idx](origin_embeddings) 181 | # [batch, fields, field_interact_num, 1] 182 | field_weights = tf.gather(weights, self.interact_indexes[idx], axis=1) 183 | if "softmax" in self.interact_modes[idx]: 184 | temperature = self.get_float_from_param(self.interact_modes[idx], default_value=1.0) 185 | field_weights = tf.nn.softmax(field_weights / temperature, axis=-2) 186 | interact_field_weights.append(field_weights) 187 | # [batch, fields, embedding_size] 188 | weighted_field_embeddings = field_embeddings * field_weights 189 | # [batch,fields, embedding_size] 190 | embeddings = tf.reduce_sum(weighted_field_embeddings, axis=-2, keepdims=False) 191 | interact_embeddings.append(embeddings) 192 | 193 | outputs = tf.concat(interact_embeddings, axis=1) 194 | return outputs, interact_field_weights 195 | 196 | def get_float_from_param(self, param, default_value=0.0): 197 | value = default_value 198 | try: 199 | value = float(param.split('-')[-1]) 200 | except: 201 | pass 202 | return value 203 | 204 | @staticmethod 205 | def get_high_order_tokens(input_list, order_n, field_size): 206 | """ 207 | Get high order tokens from input_list 208 | :param input_list: 209 | :param order_n: 210 | :param field_size: 211 | :return: 212 | """ 213 | interact_token_list = [] 214 | for idx_tuples in itertools.combinations(range(field_size), order_n): 215 | input_items = [input_list[idx] for idx in idx_tuples] 216 | interact_token = tf.strings.reduce_join(input_items, axis=0, separator="_") 217 | interact_token_list.append(interact_token) 218 | tokens = Utils.concat_func(interact_token_list, axis=1) 219 | return tokens 220 | 221 | def get_embeddings_from_tokens(self, tokens, origin_embeddings, interact_order_idx=0): 222 | # Multi-Hash Addressing. 223 | hash_embedding_list = [] 224 | for i in range(self.num_hash): 225 | hash_idx = self.hash_layers[i](tokens) 226 | if self.bucket_mode == "hash-share": 227 | hash_embedding = self.embedding_layer[interact_order_idx][0](hash_idx) 228 | elif self.bucket_mode == "hash-private": 229 | hash_embedding = self.embedding_layer[interact_order_idx][i](hash_idx) 230 | else: 231 | raise Exception("Unknown bucket_mode: {}".format(self.bucket_mode)) 232 | hash_embedding_list.append(hash_embedding) # [(batch, num_interact, embeddings)] 233 | 234 | # Memory Restoring 235 | if "concat" in self.merge_mode: 236 | embeddings = Utils.concat_func(hash_embedding_list, axis=-1) # [batch, num_interact, num_hash*embeddings] 237 | elif "senetorigin" in self.merge_mode: 238 | reweight_embeddings = tf.stack(hash_embedding_list, axis=-2) # [batch, num_interacts, num_hash, embeddings] 239 | 240 | embeddings = self._merge_by_senet_origin(input_embeddings=origin_embeddings, # [batch, fields, embeddings] 241 | reweight_embeddings=reweight_embeddings, 242 | interact_order_idx=interact_order_idx) 243 | else: 244 | raise Exception("MultiHashCodebookLayer: unknown hash_merge_mode: {}".format(self.merge_mode)) 245 | 246 | # Used to decrease outputs dimensions 247 | embeddings = self.transform_layer(embeddings) 248 | return embeddings 249 | 250 | def _merge_by_senet_origin(self, input_embeddings, reweight_embeddings, interact_order_idx=0): 251 | num_interacts = reweight_embeddings.shape[1].value 252 | # [batch*num_interact, num_hash, embeddings] 253 | reweight_embeddings = tf.reshape(reweight_embeddings, (-1, self.num_hash, self.embedding_size)) 254 | origin_embedding_size = input_embeddings.shape[-1].value 255 | split_embeddings = tf.split(input_embeddings, self.field_size, axis=1) # [(batch, 1, embeddings)] 256 | interact_embedding_list = [] 257 | for idx_tuples in itertools.combinations(range(self.field_size), self.interact_orders[interact_order_idx]): 258 | # [batch, order_n, embeddings] 259 | interact_input_embeddings = [split_embeddings[idx] for idx in idx_tuples] 260 | interact_embedding = Utils.concat_func(interact_input_embeddings, axis=1) 261 | interact_embedding_list.append(interact_embedding) 262 | interact_embeddings = tf.stack(interact_embedding_list, axis=1) # [batch, num_interact, order_n, embeddings] 263 | 264 | # Inputs of weights net, [batch*num_interacts, order_n, embeddings] 265 | inputs = tf.reshape(interact_embeddings, (-1, interact_embeddings.shape[-2], origin_embedding_size)) 266 | # Weights 267 | weights = self.senet_layer[interact_order_idx](inputs) # [batch*num_interact, fields, fields=1|bit=embeddings] 268 | # Reweight 269 | reweight_outputs = reweight_embeddings * weights 270 | # Reshape 271 | reshape_outputs = tf.reshape(reweight_outputs, (-1, num_interacts, self.num_hash, self.embedding_size)) 272 | # Concat [batch, num_interact, num_hash*embeddings] 273 | outputs = tf.reshape(reshape_outputs, (-1, num_interacts, self.num_hash * self.embedding_size)) 274 | return outputs 275 | 276 | def _get_output_dims(self, output_dims): 277 | output_dims = output_dims if output_dims and output_dims > 0 else self.embedding_size 278 | return output_dims 279 | 280 | @staticmethod 281 | def get_field_interaction_idx(order_n, field_size): 282 | # 2-order: [(1,2), (1,3), ..., (1, N), ..., (N-1,N)] 283 | field_interaction_idx = [[] for _ in range(field_size)] 284 | 285 | idx_dict = dict() 286 | target_idx = 0 287 | for idx_tuples in itertools.combinations(range(field_size), order_n): 288 | for idx in idx_tuples: 289 | field_interaction_idx[idx].append(target_idx) 290 | idx_dict[target_idx] = idx_tuples 291 | target_idx += 1 292 | 293 | # print(idx_dict) 294 | for i in range(field_size): 295 | field_interaction_idx[i] = sorted(field_interaction_idx[i]) 296 | return field_interaction_idx 297 | 298 | def compute_output_shape(self, input_shape): 299 | output_dims = self.field_size * (self.field_size - 1) // 2 300 | if "concat" in self.merge_mode: 301 | shape = (input_shape[0][0], int(output_dims * self.num_hash), self.embedding_size) 302 | else: 303 | shape = (input_shape[0][0], output_dims, self.embedding_size) 304 | return shape 305 | 306 | def get_config(self, ): 307 | config = {'num_buckets': self.num_buckets, 'embedding_size': self.embedding_size, 308 | 'initializer_mode': self.initializer_mode, 'init_std': self.init_std, 'l2_reg': self.l2_reg, 309 | 'seed': self.seed, "num_hash": self.num_hash, "hash_merge_mode": self.merge_mode} 310 | base_config = super(MultiHashCodebookLayer, self).get_config() 311 | return dict(list(base_config.items()) + list(config.items())) 312 | -------------------------------------------------------------------------------- /rec_alg/model/__init__.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding:utf-8 -*- 3 | -------------------------------------------------------------------------------- /rec_alg/model/base_model.py: -------------------------------------------------------------------------------- 1 | import os 2 | import re 3 | 4 | import numpy as np 5 | import pandas as pd 6 | import tensorflow as tf 7 | from tensorflow import keras 8 | 9 | from rec_alg.common.batch_generator import BatchGenerator 10 | from rec_alg.common.constants import Constants 11 | from rec_alg.common.data_loader import DataLoader 12 | from rec_alg.components.inputs import SparseFeat, DenseFeat, VarLenSparseFeat 13 | 14 | 15 | class BaseModel(object): 16 | """ 17 | BaseModel 18 | """ 19 | CHECKPOINT_TEMPLATE = "cp-{epoch:04d}.ckpt" 20 | CHECKPOINT_RE_TEMPLATE = "^cp-(.*).ckpt" 21 | 22 | def __init__(self, config_path, **kwargs): 23 | # Load config for config_path 24 | self.config = DataLoader.load_config_dict(config_path=config_path) 25 | self.model_suffix = kwargs.get("model_path", "") 26 | self.model_config = self.config.get("model", {}) 27 | self.sequence_max_len = kwargs.get("sequence_max_len", 0) 28 | # Load dataset feature info 29 | self._load_feature_info() 30 | # Create dirs 31 | self._create_dirs() 32 | # Get input/output files 33 | self._get_input_output_files() 34 | 35 | self.args = getattr(self, "args", None) 36 | return 37 | 38 | def _load_feature_info(self): 39 | """ 40 | Load feature info of a dataset from config 41 | :return: 42 | """ 43 | self.features = [] 44 | self.sparse_features = [] 45 | self.dense_features = [] 46 | self.varlen_features = [] 47 | for feature in self.config["features"]: 48 | if feature["type"] == Constants.FEATURE_TYPE_SPARSE: 49 | sparse_feature = SparseFeat(name=feature["name"], dimension=feature["dimension"], 50 | use_hash=feature["use_hash"], 51 | dtype=feature["dtype"], embedding=feature["embedding"], 52 | embedding_name=feature.get("embedding_name", None), 53 | feature_num=feature.get("feature_num", None), 54 | feature_origin_num=feature.get("feature_origin_num", None), 55 | feature_info_gain=feature.get("feature_info_gain", None), 56 | feature_ig=feature.get("feature_ig", None), 57 | feature_attention=feature.get("feature_attention", None)) 58 | self.sparse_features.append(sparse_feature) 59 | self.features.append(sparse_feature) 60 | elif feature["type"] == Constants.FEATURE_TYPE_DENSE: 61 | dense_feature = DenseFeat(name=feature["name"], dimension=feature.get("dimension", 1), 62 | dtype=feature["dtype"], 63 | feature_num=feature.get("feature_num", None), 64 | feature_origin_num=feature.get("feature_origin_num", None), 65 | feature_info_gain=feature.get("feature_info_gain", None), 66 | feature_ig=feature.get("feature_ig", None), 67 | feature_attention=feature.get("feature_attention", None)) 68 | self.dense_features.append(dense_feature) 69 | self.features.append(dense_feature) 70 | elif feature["type"] == Constants.FEATURE_TYPE_VARLENSPARSE: 71 | varlen_feature = VarLenSparseFeat( 72 | name=feature["name"], dimension=feature["dimension"], 73 | maxlen=feature["maxlen"] if self.sequence_max_len <= 0 else self.sequence_max_len, 74 | combiner=feature["combiner"], 75 | use_hash=feature["use_hash"], 76 | dtype=feature["dtype"], embedding=feature["embedding"], 77 | embedding_name=feature.get("embedding_name", None), 78 | feature_num=feature.get("feature_num", None), 79 | feature_origin_num=feature.get("feature_origin_num", None), 80 | feature_info_gain=feature.get("feature_info_gain", None), 81 | feature_ig=feature.get("feature_ig", None), 82 | feature_attention=feature.get("feature_attention", None)) 83 | self.varlen_features.append(varlen_feature) 84 | self.features.append(varlen_feature) 85 | return True 86 | 87 | def _create_dirs(self): 88 | self.checkpoint_dir = os.path.join(self.model_config.get("results_prefix", "./data/model/"), 89 | self.args.model_path, "checkpoint") 90 | self.model_file_dir = os.path.join(self.model_config.get("model_file_path", "./data/model/"), 91 | self.args.model_path, "model") 92 | DataLoader.validate_or_create_dir(self.checkpoint_dir) 93 | DataLoader.validate_or_create_dir(self.model_file_dir) 94 | return 95 | 96 | def _get_input_output_files(self): 97 | train_paths = self.args.train_paths if self.args.train_paths else self.model_config["train_paths"] 98 | valid_paths = self.args.valid_paths if self.args.valid_paths else self.model_config["valid_paths"] 99 | test_paths = self.args.test_paths if self.args.test_paths else self.model_config["test_paths"] 100 | print("Train paths: ", self.model_config["data_prefix"], train_paths) 101 | print("Valid paths: ", self.model_config["data_prefix"], valid_paths) 102 | print("Test paths: ", self.model_config["data_prefix"], test_paths) 103 | self.train_files = DataLoader.get_files(self.model_config["data_prefix"], train_paths) 104 | self.valid_files = DataLoader.get_files(self.model_config["data_prefix"], valid_paths) 105 | self.test_files = DataLoader.get_files(self.model_config["data_prefix"], test_paths) 106 | self.train_results_file = os.path.join(self.model_config["results_prefix"], 107 | self.args.model_path, 108 | self.model_config["train_results_file"]) 109 | self.test_results_file = os.path.join(self.model_config["results_prefix"], 110 | self.args.model_path, 111 | self.model_config["test_results_file"]) 112 | return 113 | 114 | def create_model(self): 115 | """ 116 | 由子类创建具体的keras.model 117 | :return: 118 | """ 119 | raise NotImplementedError() 120 | 121 | def create_checkpoint_callback(self, save_weights_only=True, period=1): 122 | """ 123 | Create callback function of checkpoint 124 | :return: callback 125 | """ 126 | checkpoint_path = "{checkpoint_dir}/{name}".format(checkpoint_dir=self.checkpoint_dir, 127 | name=self.CHECKPOINT_TEMPLATE) 128 | cp_callback = tf.keras.callbacks.ModelCheckpoint( 129 | filepath=checkpoint_path, 130 | verbose=1, 131 | save_weights_only=save_weights_only, 132 | period=period, ) 133 | return cp_callback 134 | 135 | def create_earlystopping_callback(self, monitor='val_loss', min_delta=0, patience=5, verbose=1, mode='auto', 136 | baseline=None, restore_best_weights=True): 137 | """ 138 | Create early stopping callback 139 | :return: callback 140 | """ 141 | es_callback = tf.keras.callbacks.EarlyStopping(monitor=monitor, min_delta=min_delta, patience=patience, 142 | verbose=verbose, mode=mode, baseline=baseline, 143 | restore_best_weights=restore_best_weights) 144 | return es_callback 145 | 146 | def restore_model_from_checkpoint(self, restore_epoch=-1): 147 | """ 148 | Restore model weights from checkpoint created by restore_epoch 149 | Notice: these are only weights in checkpoint file, model structure should be created by create_model 150 | :param restore_epoch: from which checkpoint to load weights 151 | :return: model, latest_epoch 152 | """ 153 | # Get checkpoint path 154 | latest_checkpoint_path = tf.train.latest_checkpoint(self.checkpoint_dir) 155 | latest_epoch = self._get_latest_epoch_from_checkpoint(latest_checkpoint_path) 156 | checkpoint_path = os.path.join(self.checkpoint_dir, self.CHECKPOINT_TEMPLATE.format( 157 | epoch=restore_epoch)) if 0 < restore_epoch <= latest_epoch else latest_checkpoint_path 158 | 159 | # create model and load weights from checkpoint 160 | model = self.create_model() 161 | model.load_weights(checkpoint_path).expect_partial() 162 | print("BaseModel::restore_model_from_checkpoint: restore model from checkpoint: ", checkpoint_path) 163 | return model, latest_epoch 164 | 165 | def _get_latest_epoch_from_checkpoint(self, latest_checkpoint): 166 | """ 167 | Get latest epoch from checkpoint path 168 | :param latest_checkpoint: 169 | :return: 170 | """ 171 | latest_epoch = 0 172 | regular = re.compile(self.CHECKPOINT_RE_TEMPLATE) 173 | try: 174 | checkpoint = os.path.basename(latest_checkpoint) 175 | match_result = regular.match(checkpoint) 176 | latest_epoch = int(match_result.group(1)) 177 | except Exception as e: 178 | print(e) 179 | return latest_epoch 180 | 181 | def run(self): 182 | if self.args.mode in (Constants.MODE_TRAIN, Constants.MODE_RETRAIN): 183 | self.train_model() 184 | self.test_model() 185 | elif self.args.mode == Constants.MODE_TEST: 186 | self.test_model() 187 | return True 188 | 189 | def train_model(self): 190 | """ 191 | Train model 192 | :return: history 193 | """ 194 | if self.args.mode == Constants.MODE_RETRAIN: 195 | model, latest_epoch = self.restore_model_from_checkpoint() 196 | else: 197 | model = self.create_model() 198 | latest_epoch = 0 199 | callbacks = [self.create_checkpoint_callback(), self.create_earlystopping_callback(), ] 200 | 201 | # 1. Get data from generator (single process & thread) 202 | train_steps = BatchGenerator.get_txt_dataset_length(self.train_files, batch_size=self.args.batch_size, 203 | drop_remainder=True) 204 | val_steps = BatchGenerator.get_txt_dataset_length(self.valid_files, batch_size=self.args.batch_size, 205 | drop_remainder=False) 206 | train_generator = BatchGenerator.generate_arrays_from_file(self.train_files, batch_size=self.args.batch_size, 207 | drop_remainder=True, features=self.features, 208 | shuffle=True) 209 | val_generator = BatchGenerator.generate_arrays_from_file(self.valid_files, batch_size=self.args.batch_size, 210 | drop_remainder=False, features=self.features, 211 | shuffle=False) 212 | print("Train files: ", self.train_files) 213 | print("Train steps: ", train_steps) 214 | print("Valid files: ", self.valid_files) 215 | print("Valid steps: ", val_steps) 216 | 217 | history = model.fit_generator(train_generator, steps_per_epoch=train_steps, epochs=self.args.epochs, 218 | verbose=self.args.verbose, validation_data=val_generator, 219 | validation_steps=val_steps, 220 | callbacks=callbacks, max_queue_size=10, workers=1, use_multiprocessing=False, 221 | shuffle=False, initial_epoch=latest_epoch) 222 | self._save_train_results(latest_epoch, history) 223 | return history 224 | 225 | def _save_train_results(self, latest_epoch, history): 226 | df = pd.DataFrame(history.history) 227 | df.insert(0, "epoch", range(latest_epoch + 1, latest_epoch + len(df) + 1)) 228 | if len(df) > 0: 229 | df.to_csv(self.train_results_file, sep="\t", float_format="%.5f", index=False, encoding="utf-8", mode="a") 230 | return 231 | 232 | def test_model(self): 233 | """ 234 | Test FiBiNET model, support to test model from specific checkpoint 235 | :return: 236 | """ 237 | restore_epochs = [] 238 | if not isinstance(self.args.restore_epochs, list) or len(self.args.restore_epochs) == 0: 239 | restore_epochs = np.arange(1, self.args.epochs + 1) 240 | elif len(self.args.restore_epochs) == 1: 241 | restore_epochs = np.arange(1, self.args.restore_epochs[0]) 242 | elif len(self.args.restore_epochs) == 2: 243 | restore_epochs = np.arange(self.args.restore_epochs[0], self.args.restore_epochs[1]) 244 | elif len(self.args.restore_epochs) >= 3: 245 | restore_epochs = np.arange(self.args.restore_epochs[0], self.args.restore_epochs[1], 246 | self.args.restore_epochs[2]) 247 | print("BaseModel::test_model: restore_epochs: {}".format(restore_epochs)) 248 | for restore_epoch in restore_epochs: 249 | self.test_model_from_checkpoint(restore_epoch) 250 | return True 251 | 252 | def test_model_from_checkpoint(self, restore_epoch=-1): 253 | model, latest_epoch = self.restore_model_from_checkpoint(restore_epoch=restore_epoch) 254 | test_steps = BatchGenerator.get_dataset_length(self.test_files, batch_size=self.args.batch_size, 255 | drop_remainder=False) 256 | test_generator = BatchGenerator.generate_arrays_from_file(self.test_files, batch_size=self.args.batch_size, 257 | features=self.features, drop_remainder=False, 258 | shuffle=False) 259 | print("Test files: ", self.test_files) 260 | print("Test steps: ", test_steps) 261 | predict_ans = model.evaluate_generator(test_generator, steps=test_steps, verbose=self.args.verbose) 262 | results_dict = dict(zip(model.metrics_names, predict_ans)) 263 | print("BaseModel::test_model_from_checkpoint: Epoch {} Evaluation results: {}".format(restore_epoch, 264 | results_dict)) 265 | self._save_test_results(restore_epoch, results_dict) 266 | return 267 | 268 | def _save_test_results(self, restore_epoch, results_dict): 269 | df = pd.DataFrame(columns=results_dict.keys()) 270 | df.loc[0] = list(results_dict.values()) 271 | df.insert(0, "epoch", "{}".format(restore_epoch)) 272 | if len(df) > 0: 273 | df.to_csv(self.test_results_file, sep="\t", float_format="%.5f", index=False, encoding="utf-8", mode="a") 274 | return 275 | -------------------------------------------------------------------------------- /rec_alg/model/fibinet/__init__.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding:utf-8 -*- 3 | -------------------------------------------------------------------------------- /rec_alg/model/fibinet/fibinet_model.py: -------------------------------------------------------------------------------- 1 | import copy 2 | 3 | import tensorflow as tf 4 | from tensorflow.python.keras.layers import Dense, Add, Flatten, BatchNormalization 5 | 6 | from rec_alg.common.utils import Utils 7 | from rec_alg.components.inputs import build_input_features, get_linear_logit, input_from_feature_columns 8 | from rec_alg.components.layers import BilinearInteraction, DNNLayer, SENETLayer 9 | from rec_alg.components.layers import DenseEmbeddingLayer 10 | from rec_alg.components.layers import PredictionLayer 11 | from rec_alg.components.layers import SimpleLayerNormalization 12 | 13 | 14 | class FiBiNetModel(object): 15 | def __init__(self, params, feature_columns, embedding_size=10, embedding_l2_reg=0.0, embedding_dropout=0.0, 16 | sparse_embedding_norm_type="bn", dense_embedding_norm_type="layer_norm", 17 | dense_embedding_share_params=False, senet_squeeze_mode="group_mean_max", 18 | senet_squeeze_group_num=2, senet_squeeze_topk=1, senet_reduction_ratio=3.0, 19 | senet_excitation_mode="bit", senet_activation="none", senet_use_skip_connection=True, 20 | senet_reweight_norm_type="ln", origin_bilinear_type='all_ip', origin_bilinear_dnn_units=(50,), 21 | origin_bilinear_dnn_activation="linear", senet_bilinear_type='none', 22 | dnn_hidden_units=(400, 400, 400), dnn_l2_reg=0.0, dnn_use_bn=False, dnn_dropout=0.0, 23 | dnn_activation='relu', enable_linear=False, linear_l2_reg=0.0, init_std=0.01, seed=1024, 24 | task='binary'): 25 | """Instantiates the Feature Importance and Bilinear feature Interaction NETwork architecture. 26 | 27 | :param params: dict, contains all command line parameters 28 | :param feature_columns: An iterable containing all the features used by the model. 29 | :param embedding_size: positive integer,sparse feature embedding_size 30 | :param embedding_l2_reg: float. L2 regularize strength applied to embedding vector 31 | :param embedding_dropout: float, dropout probability applied to embedding vector 32 | :param sparse_embedding_norm_type: str, norm type for sparse fields, supports none,bn 33 | :param dense_embedding_norm_type: str, norm type for dense fields, supports none,layer_norm 34 | :param dense_embedding_share_params: bool, whether sharing parameters when using layer norm 35 | :param senet_reduction_ratio: integer in [1,inf), reduction ratio used in SENET Layer, default to 3 36 | :param senet_squeeze_mode: str, squeeze mode in SENet, support mean, max, min, topk 37 | :param senet_squeeze_topk: int, topk value when squeeze mode is topk 38 | :param senet_activation: str, activation for senet excitation, supports none, relu 39 | :param senet_use_skip_connection: bool, whether use skip connection in re-weights of SENet 40 | :param origin_bilinear_type: str, bilinear function type used in Bilinear Interaction Layer for the original 41 | embeddings, can be all, each, interaction 42 | :param senet_bilinear_type: str, bilinear function type used in Bilinear Interaction Layer for the senet 43 | embeddings, can be all, each, interaction 44 | :param dnn_hidden_units: list, list of positive integer or empty list, the layer number and units in each layer 45 | of DNN 46 | :param dnn_l2_reg: float. L2 regularize strength applied to DNN 47 | :param dnn_use_bn: bool, whether to use batch norm in DNN 48 | :param dnn_dropout: float in [0,1), the probability we will drop out a given DNN coordinate. 49 | :param dnn_activation: Activation function to use in DNN 50 | :param linear_l2_reg: float. L2 regularize strength applied to wide part 51 | :param init_std: float,to use as the initialize std of embedding vector 52 | :param seed: integer ,to use as random seed. 53 | :param task: str, ``"binary"`` for binary log_loss or ``"regression"`` for regression loss 54 | :return: A FiBiNetModel instance 55 | """ 56 | super(FiBiNetModel, self).__init__() 57 | tf.compat.v1.set_random_seed(seed=seed) 58 | 59 | self.params = copy.deepcopy(params) 60 | self.feature_columns = feature_columns 61 | self.field_size = len(feature_columns) 62 | 63 | self.embedding_size = embedding_size 64 | self.embedding_l2_reg = embedding_l2_reg 65 | self.embedding_dropout = embedding_dropout 66 | self.sparse_embedding_norm_type = sparse_embedding_norm_type 67 | self.dense_embedding_norm_type = dense_embedding_norm_type 68 | self.dense_embedding_share_params = dense_embedding_share_params 69 | 70 | self.senet_squeeze_mode = senet_squeeze_mode 71 | self.senet_squeeze_group_num = senet_squeeze_group_num 72 | self.senet_squeeze_topk = senet_squeeze_topk 73 | self.senet_reduction_ratio = senet_reduction_ratio 74 | self.senet_excitation_mode = senet_excitation_mode 75 | self.senet_activation = senet_activation 76 | self.senet_use_skip_connection = senet_use_skip_connection 77 | self.senet_reweight_norm_type = senet_reweight_norm_type 78 | 79 | self.origin_bilinear_type = origin_bilinear_type 80 | self.origin_bilinear_dnn_units = origin_bilinear_dnn_units 81 | self.origin_bilinear_dnn_activation = origin_bilinear_dnn_activation 82 | self.senet_bilinear_type = senet_bilinear_type 83 | 84 | self.dnn_hidden_units = dnn_hidden_units 85 | self.dnn_activation = dnn_activation 86 | self.dnn_l2_reg = dnn_l2_reg 87 | self.dnn_use_bn = dnn_use_bn 88 | self.dnn_dropout = dnn_dropout 89 | 90 | self.enable_linear = enable_linear 91 | self.linear_l2_reg = linear_l2_reg 92 | 93 | self.init_std = init_std 94 | self.seed = seed 95 | self.task = task 96 | 97 | self.features = None 98 | self.inputs_list = None 99 | self.embeddings = None 100 | self.senet_embeddings = None 101 | self.origin_bilinear_out = None 102 | self.senet_bilinear_out = None 103 | self.output = None 104 | return 105 | 106 | def get_model(self): 107 | # Inputs 108 | self.features, self.inputs_list = self.get_inputs() 109 | 110 | # Embeddings 111 | self.embeddings = self.get_embeddings() 112 | self.senet_embeddings = self.get_senet_embeddings() 113 | 114 | # Bilinear interaction 115 | self.origin_bilinear_out = BilinearInteraction( 116 | bilinear_type=self.origin_bilinear_type, dnn_units=self.origin_bilinear_dnn_units, 117 | dnn_activation=self.origin_bilinear_dnn_activation, seed=self.seed)(self.embeddings) # [batch, 1, dim] 118 | self.senet_bilinear_out = BilinearInteraction(bilinear_type=self.senet_bilinear_type, seed=self.seed)( 119 | self.senet_embeddings) 120 | 121 | # DNN part 122 | dnn_input = Utils.concat_func([self.origin_bilinear_out, self.senet_bilinear_out]) 123 | flatten_dnn_input = Flatten()(dnn_input) 124 | dnn_out = DNNLayer(self.dnn_hidden_units, self.dnn_activation, self.dnn_l2_reg, self.dnn_dropout, 125 | self.dnn_use_bn, self.seed)(flatten_dnn_input) 126 | dnn_logit = Dense(1, use_bias=False, activation=None)(dnn_out) 127 | 128 | # Model 129 | if self.enable_linear: 130 | # Linear part 131 | linear_logit = get_linear_logit(self.features, self.feature_columns, l2_reg=self.linear_l2_reg, 132 | init_std=self.init_std, use_bias=False, 133 | seed=self.seed, prefix='linear') 134 | final_logit = Add()([linear_logit, dnn_logit]) 135 | else: 136 | final_logit = dnn_logit 137 | self.output = PredictionLayer(self.task, use_bias=True)(final_logit) 138 | model = tf.keras.models.Model(inputs=self.inputs_list, outputs=self.output) 139 | return model 140 | 141 | def get_inputs(self): 142 | features = build_input_features(self.feature_columns) 143 | inputs_list = list(features.values()) 144 | return features, inputs_list 145 | 146 | def get_embeddings(self, name_prefix=""): 147 | """ 148 | Get sparse+dense feature embeddings 149 | :return: 150 | """ 151 | # 1. sparse embedding 152 | sparse_embedding_list, dense_value_list = input_from_feature_columns(self.features, self.feature_columns, 153 | self.embedding_size, 154 | self.embedding_l2_reg, 155 | self.init_std, 156 | self.seed, 157 | prefix=name_prefix) 158 | sparse_embeddings = Utils.concat_func(sparse_embedding_list, axis=1) 159 | 160 | if self.sparse_embedding_norm_type in {"bn"}: 161 | sparse_embedding_norm_layer = BatchNormalization(axis=-1) 162 | sparse_embeddings = sparse_embedding_norm_layer(sparse_embeddings) 163 | 164 | # 2. dense embedding 165 | dense_embeddings = None 166 | if len(dense_value_list) > 0: 167 | dense_values = tf.stack(dense_value_list, axis=1) 168 | dense_embeddings = DenseEmbeddingLayer(embedding_size=self.embedding_size, init_std=self.init_std, 169 | embedding_l2_reg=self.embedding_l2_reg, seed=self.seed)(dense_values) 170 | # DenseEmbeddingNorm 171 | if self.dense_embedding_norm_type in {"layer_norm"}: 172 | dense_embedding_norm_layer = SimpleLayerNormalization(self.dense_embedding_share_params) 173 | dense_embeddings = dense_embedding_norm_layer(dense_embeddings) 174 | 175 | # 3. get the whole embeddings 176 | if len(dense_value_list) > 0: 177 | embeddings = Utils.concat_func([sparse_embeddings, dense_embeddings], axis=1) 178 | else: 179 | embeddings = sparse_embeddings 180 | return embeddings 181 | 182 | def get_senet_embeddings(self): 183 | outputs = SENETLayer(senet_squeeze_mode=self.senet_squeeze_mode, 184 | senet_squeeze_group_num=self.senet_squeeze_group_num, 185 | senet_squeeze_topk=self.senet_squeeze_topk, 186 | senet_reduction_ratio=self.senet_reduction_ratio, 187 | senet_excitation_mode=self.senet_excitation_mode, 188 | senet_activation=self.senet_activation, 189 | senet_use_skip_connection=self.senet_use_skip_connection, 190 | senet_reweight_norm_type=self.senet_reweight_norm_type, 191 | seed=self.seed)(self.embeddings) 192 | return outputs 193 | -------------------------------------------------------------------------------- /rec_alg/model/fibinet/run_fibinet.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import os 3 | 4 | import tensorflow as tf 5 | from tensorflow import keras 6 | 7 | from rec_alg.common.data_loader import DataLoader 8 | from rec_alg.common.utils import Utils 9 | from rec_alg.model.base_model import BaseModel 10 | from rec_alg.model.fibinet.fibinet_model import FiBiNetModel 11 | 12 | 13 | # tf.enable_eager_execution() 14 | 15 | 16 | class FiBiNetRunner(BaseModel): 17 | """ 18 | train & test FiBiNetModel with supported args 19 | """ 20 | CHECKPOINT_TEMPLATE = "cp-{epoch:04d}.ckpt" 21 | CHECKPOINT_RE_TEMPLATE = "^cp-(.*).ckpt" 22 | 23 | def __init__(self): 24 | # Get args 25 | self.args = self.parse_args() 26 | self._update_parameters() 27 | print("Args: ", self.args.__dict__) 28 | args_dict = self.args.__dict__.copy() 29 | config_path = args_dict.pop("config_path") 30 | super(FiBiNetRunner, self).__init__(config_path=config_path, **args_dict) 31 | # Get input/output files 32 | self._get_input_output_files() 33 | return 34 | 35 | def _update_parameters(self): 36 | if self.args.version == "v1": 37 | parameters = { 38 | "sparse_embedding_norm_type": "none", 39 | "dense_embedding_norm_type": "none", 40 | "senet_squeeze_mode": "mean", 41 | "senet_squeeze_group_num": 1, 42 | "senet_excitation_mode": "vector", 43 | "senet_activation": "relu", 44 | "senet_use_skip_connection": False, 45 | "senet_reweight_norm_type": "none", 46 | "origin_bilinear_type": "all", 47 | "origin_bilinear_dnn_units": [], 48 | "origin_bilinear_dnn_activation": "linear", 49 | "senet_bilinear_type": "all", 50 | "enable_linear": True 51 | } 52 | self.args.__dict__.update(parameters) 53 | elif self.args.version == "++": 54 | parameters = { 55 | "sparse_embedding_norm_type": "bn", 56 | "dense_embedding_norm_type": "layer_norm", 57 | "senet_squeeze_mode": "group_mean_max", 58 | "senet_squeeze_group_num": 2, 59 | "senet_excitation_mode": "bit", 60 | "senet_activation": "none", 61 | "senet_use_skip_connection": True, 62 | "senet_reweight_norm_type": "ln", 63 | "origin_bilinear_type": "all_ip", 64 | "origin_bilinear_dnn_units": [50], 65 | "origin_bilinear_dnn_activation": "linear", 66 | "senet_bilinear_type": "none", 67 | "enable_linear": False, 68 | } 69 | self.args.__dict__.update(parameters) 70 | return 71 | 72 | def _get_input_output_files(self): 73 | train_paths = self.args.train_paths if self.args.train_paths else self.model_config["train_paths"] 74 | valid_paths = self.args.valid_paths if self.args.valid_paths else self.model_config["valid_paths"] 75 | test_paths = self.args.test_paths if self.args.test_paths else self.model_config["test_paths"] 76 | print("Train paths: ", self.model_config["data_prefix"], train_paths) 77 | print("Valid paths: ", self.model_config["data_prefix"], valid_paths) 78 | print("Test paths: ", self.model_config["data_prefix"], test_paths) 79 | self.train_files = DataLoader.get_files(self.model_config["data_prefix"], train_paths) 80 | self.valid_files = DataLoader.get_files(self.model_config["data_prefix"], valid_paths) 81 | self.test_files = DataLoader.get_files(self.model_config["data_prefix"], test_paths) 82 | self.train_results_file = os.path.join(self.model_config["results_prefix"], 83 | self.args.model_path, 84 | self.model_config["train_results_file"]) 85 | self.test_results_file = os.path.join(self.model_config["results_prefix"], 86 | self.args.model_path, 87 | self.model_config["test_results_file"]) 88 | return 89 | 90 | def parse_args(self): 91 | parser = argparse.ArgumentParser() 92 | # 1. Run setup 93 | parser.add_argument('--version', type=str, default="++", 94 | help="#version: version of fibinet model, support v1, ++ and custom") 95 | parser.add_argument('--config_path', type=str, default="./config/criteo/config_dense.json", 96 | help="#config path: path of config file which includes info of dataset features") 97 | parser.add_argument('--train_paths', type=Utils.str2liststr, 98 | default="part0,part1,part2,part3,part4,part5,part6,part7", 99 | help='#train_paths: training directories split with comma') 100 | parser.add_argument('--valid_paths', type=Utils.str2liststr, default="part8", 101 | help='#valid_paths: validation directories split with comma') 102 | parser.add_argument('--test_paths', type=Utils.str2liststr, default="part9", 103 | help='#test_paths: testing directories split with comma') 104 | # 2. Model architecture 105 | # 2.1. Embeddings 106 | parser.add_argument('--embedding_size', type=int, default=10, 107 | help='#embedding_size: feature embedding size') 108 | parser.add_argument('--embedding_l2_reg', type=float, default=0.0, 109 | help='#embedding_l2_reg: L2 regularizer strength applied to embedding') 110 | parser.add_argument('--embedding_dropout', type=float, default=0.0, 111 | help='#embedding_dropout: the probability of dropping out on embedding') 112 | parser.add_argument('--sparse_embedding_norm_type', type=str, default='bn', 113 | help='#sparse_embedding_norm_type: str, support `none, bn') 114 | parser.add_argument('--dense_embedding_norm_type', type=str, default='layer_norm', 115 | help='#dense_embedding_norm_type: str, support `none, layer_norm') 116 | parser.add_argument('--dense_embedding_share_params', type=Utils.str2bool, default=False, 117 | help='#dense_embedding_share_params: whether sharing params among different fields') 118 | 119 | # 2.2. SENet 120 | parser.add_argument('--senet_squeeze_mode', type=str, default='group_mean_max', 121 | help='#senet_squeeze_mode: mean, max, topk, and group') 122 | parser.add_argument('--senet_squeeze_group_num', type=int, default=2, 123 | help='#senet_squeeze_group_num: worked only in group mode') 124 | parser.add_argument('--senet_squeeze_topk', type=int, default=1, 125 | help='#senet_squeeze_topk: positive integer, topk value') 126 | parser.add_argument('--senet_reduction_ratio', type=float, default=3.0, 127 | help='#senet_reduction_ratio: senet reduction ratio') 128 | parser.add_argument('--senet_excitation_mode', type=str, default="bit", 129 | help='#senet_excitation_mode: str, support: none(=squeeze_mode), vector|group|bit') 130 | parser.add_argument('--senet_activation', type=str, default='none', 131 | help='#senet_activation: activation function used in SENet Layer 2') 132 | parser.add_argument('--senet_use_skip_connection', type=Utils.str2bool, default=True, 133 | help='#senet_use_skip_connection: bool.') 134 | parser.add_argument('--senet_reweight_norm_type', type=str, default='ln', 135 | help='#senet_reweight_norm_type: none, ln') 136 | 137 | # 2.3. Bilinear type 138 | parser.add_argument('--origin_bilinear_type', type=str, default='all_ip', 139 | help='#origin_bilinear_type: bilinear type applied to original embeddings') 140 | parser.add_argument('--origin_bilinear_dnn_units', type=Utils.str2list, default=[50], 141 | help='#origin_bilinear_dnn_units: list') 142 | parser.add_argument('--origin_bilinear_dnn_activation', type=str, default='linear', 143 | help='#origin_bilinear_dnn_activation: Activation function to use in DNN') 144 | parser.add_argument('--senet_bilinear_type', type=str, default='none', 145 | help='#senet_bilinear_type: bilinear type applied to senet embeddings') 146 | 147 | # 2.4. DNN part 148 | parser.add_argument('--dnn_hidden_units', type=Utils.str2list, default=[400, 400, 400], 149 | help='#dnn_hidden_units: layer number and units in each layer of DNN') 150 | parser.add_argument('--dnn_activation', type=str, default='relu', 151 | help='#dnn_activation: activation function used in DNN') 152 | parser.add_argument('--dnn_l2_reg', type=float, default=0.0, 153 | help='#dnn_l2_reg: L2 regularizer strength applied to DNN') 154 | parser.add_argument('--dnn_use_bn', type=Utils.str2bool, default=False, 155 | help='#dnn_use_bn: whether to use BatchNormalization before activation or not in DNN') 156 | parser.add_argument('--dnn_dropout', type=float, default=0.0, 157 | help='#dnn_dropout: the probability of dropping out on each layer of DNN') 158 | 159 | # 2.5. Linear part 160 | parser.add_argument('--enable_linear', type=Utils.str2bool, default=False, 161 | help='#enable_linear: bool. Whether use linear part in the model') 162 | parser.add_argument('--linear_l2_reg', type=float, default=0.0, 163 | help='#linear_l2_reg: L2 regularizer strength applied to linear') 164 | 165 | # 3. Train/Valid/Test setup 166 | parser.add_argument('--seed', type=int, default=1024, help='#seed: integer ,to use as random seed.') 167 | parser.add_argument('--epochs', type=int, default=5) 168 | parser.add_argument('--batch_size', type=int, default=1024) 169 | parser.add_argument('--learning_rate', type=float, default=0.0001) 170 | parser.add_argument('--init_std', type=float, default=0.01) 171 | parser.add_argument('--verbose', type=int, default=1) 172 | parser.add_argument("--mode", type=str, default="train", help="support: train, retrain, test") 173 | parser.add_argument('--restore_epochs', type=Utils.str2list, default=[], 174 | help="restore weights from checkpoint, format like np.arange(), eg. [1, 5, 1]") 175 | parser.add_argument("--early_stopping", type=Utils.str2bool, default=True, help="enable early stopping") 176 | parser.add_argument("--model_path", type=str, default="fibinet", help="model_path, to avoid being covered") 177 | return parser.parse_args() 178 | 179 | def create_model(self): 180 | """ 181 | Create FiBiNet model 182 | :return: instance of FiBiNet model: tf.keras.Model 183 | """ 184 | fibinet = FiBiNetModel(params=self.args.__dict__, 185 | feature_columns=self.features, 186 | embedding_size=self.args.embedding_size, 187 | embedding_l2_reg=self.args.embedding_l2_reg, 188 | embedding_dropout=self.args.embedding_dropout, 189 | sparse_embedding_norm_type=self.args.sparse_embedding_norm_type, 190 | dense_embedding_norm_type=self.args.dense_embedding_norm_type, 191 | dense_embedding_share_params=self.args.dense_embedding_share_params, 192 | senet_squeeze_mode=self.args.senet_squeeze_mode, 193 | senet_squeeze_group_num=self.args.senet_squeeze_group_num, 194 | senet_squeeze_topk=self.args.senet_squeeze_topk, 195 | senet_reduction_ratio=self.args.senet_reduction_ratio, 196 | senet_excitation_mode=self.args.senet_excitation_mode, 197 | senet_activation=self.args.senet_activation, 198 | senet_use_skip_connection=self.args.senet_use_skip_connection, 199 | senet_reweight_norm_type=self.args.senet_reweight_norm_type, 200 | origin_bilinear_type=self.args.origin_bilinear_type, 201 | origin_bilinear_dnn_units=self.args.origin_bilinear_dnn_units, 202 | origin_bilinear_dnn_activation=self.args.origin_bilinear_dnn_activation, 203 | senet_bilinear_type=self.args.senet_bilinear_type, 204 | dnn_hidden_units=self.args.dnn_hidden_units, 205 | dnn_activation=self.args.dnn_activation, 206 | dnn_l2_reg=self.args.dnn_l2_reg, 207 | dnn_use_bn=self.args.dnn_use_bn, 208 | dnn_dropout=self.args.dnn_dropout, 209 | enable_linear=self.args.enable_linear, 210 | linear_l2_reg=self.args.linear_l2_reg, 211 | init_std=self.args.init_std, 212 | seed=self.args.seed, ) 213 | model = fibinet.get_model() 214 | # optimizer & loss & metrics 215 | optimizer = tf.keras.optimizers.Adam(learning_rate=self.args.learning_rate, beta_1=0.9, beta_2=0.999, 216 | epsilon=1e-8) 217 | loss = tf.keras.losses.BinaryCrossentropy() 218 | metrics = ["AUC", "binary_crossentropy"] 219 | model.compile(optimizer, loss, metrics=metrics) 220 | # model.run_eagerly = True 221 | # Print Info 222 | model.summary() 223 | tf.keras.utils.plot_model(model, os.path.join(self.model_file_dir, "fibinet.png"), show_shapes=True, 224 | show_layer_names=True) 225 | return model 226 | 227 | 228 | if __name__ == "__main__": 229 | runner = FiBiNetRunner() 230 | runner.run() 231 | pass 232 | -------------------------------------------------------------------------------- /rec_alg/model/memonet/__init__.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding:utf-8 -*- 3 | -------------------------------------------------------------------------------- /rec_alg/model/memonet/memonet_model.py: -------------------------------------------------------------------------------- 1 | # -*- coding:utf-8 -*- 2 | import copy 3 | 4 | import tensorflow as tf 5 | 6 | from rec_alg.common import tf_utils 7 | from rec_alg.common.utils import Utils 8 | from rec_alg.components.inputs import build_input_features, input_from_feature_columns 9 | from rec_alg.components.layers import DenseEmbeddingLayer, PredictionLayer, DNNLayer 10 | from rec_alg.components.multi_hash_codebook_layer import MultiHashCodebookLayer 11 | from rec_alg.components.multi_hash_codebook_kif_layer import MultiHashCodebookKIFLayer 12 | 13 | 14 | class MemoNetModel(object): 15 | def __init__(self, feature_columns, params, embedding_size, embedding_l2_reg=0.0, embedding_dropout=0, 16 | dnn_hidden_units=(), dnn_activation='relu', dnn_l2_reg=0.0, dnn_use_bn=False, 17 | dnn_dropout=0.0, init_std=0.01, task='binary', seed=1024): 18 | super(MemoNetModel, self).__init__() 19 | tf.compat.v1.set_random_seed(seed=seed) 20 | 21 | self.feature_columns = feature_columns 22 | self.field_size = len(feature_columns) 23 | self.params = copy.deepcopy(params) 24 | 25 | self.embedding_size = embedding_size 26 | self.embedding_l2_reg = embedding_l2_reg 27 | self.embedding_dropout = embedding_dropout 28 | 29 | self.dnn_hidden_units = dnn_hidden_units 30 | self.dnn_activation = dnn_activation 31 | self.dnn_l2_reg = dnn_l2_reg 32 | self.dnn_use_bn = dnn_use_bn 33 | self.dnn_dropout = dnn_dropout 34 | 35 | self.init_std = init_std 36 | self.task = task 37 | self.seed = seed 38 | 39 | self.features = None 40 | self.inputs_list = None 41 | self.embeddings = None 42 | self.outputs = None 43 | 44 | self._init() 45 | return 46 | 47 | def _init(self): 48 | self.interact_mode = self.params.get("interact_mode", "fullhcnet") 49 | self.interaction_hash_embedding_buckets = self.params.get("interaction_hash_embedding_buckets", 100000) 50 | self.interaction_hash_embedding_size = self.params.get("interaction_hash_embedding_size", self.embedding_size) 51 | self.interaction_hash_embedding_bucket_mode = self.params.get("interaction_hash_embedding_bucket_mode", 52 | "hash-share") 53 | self.interaction_hash_embedding_num_hash = self.params.get("interaction_hash_embedding_num_hash", 2) 54 | self.interaction_hash_embedding_merge_mode = self.params.get("interaction_hash_embedding_merge_mode", "concat") 55 | self.interaction_hash_output_dims = self.params.get("interaction_hash_output_dims", 0) 56 | self.interaction_hash_embedding_float_precision = self.params.get( 57 | "interaction_hash_embedding_float_precision", 12) 58 | self.interaction_hash_embedding_interact_orders = self.params.get( 59 | "interaction_hash_embedding_interact_orders", (2,)) 60 | self.interaction_hash_embedding_interact_modes = self.params.get( 61 | "interaction_hash_embedding_interact_modes", ("none",)) 62 | self.interaction_hash_embedding_feature_metric = self.params.get("interaction_hash_embedding_feature_metric", 63 | "dimension") 64 | self.interaction_hash_embedding_feature_top_k = self.params.get("interaction_hash_embedding_feature_top_k", -1) 65 | return 66 | 67 | def get_model(self): 68 | self.features, self.inputs_list = self.get_inputs() 69 | self.embeddings = self.get_embeddings() 70 | 71 | interact_embeddings = [self.embeddings] 72 | if "fullhcnet" in self.interact_mode: 73 | multi_hash_codebook_layer = MultiHashCodebookLayer( 74 | name="multi_hash_codebook_layer", 75 | num_buckets=self.interaction_hash_embedding_buckets, 76 | embedding_size=self.interaction_hash_embedding_size, 77 | bucket_mode=self.interaction_hash_embedding_bucket_mode, 78 | init_std=self.init_std, 79 | l2_reg=self.embedding_l2_reg, 80 | seed=self.seed, 81 | num_hash=self.interaction_hash_embedding_num_hash, 82 | merge_mode=self.interaction_hash_embedding_merge_mode, 83 | output_dims=self.interaction_hash_output_dims, 84 | params=self.params, 85 | hash_float_precision=self.interaction_hash_embedding_float_precision, 86 | interact_orders=self.interaction_hash_embedding_interact_orders, 87 | interact_modes=self.interaction_hash_embedding_interact_modes, 88 | ) 89 | 90 | top_inputs_list, top_embeddings = tf_utils.get_top_inputs_embeddings( 91 | feature_columns=self.feature_columns, features=self.features, embeddings=self.embeddings, 92 | feature_importance_metric=self.interaction_hash_embedding_feature_metric, 93 | feature_importance_top_k=self.interaction_hash_embedding_feature_top_k) 94 | 95 | interaction_hash_embeddings, interact_field_weights = multi_hash_codebook_layer( 96 | [top_inputs_list, top_embeddings]) 97 | interact_embeddings.append(interaction_hash_embeddings) 98 | if "subsethcnet" in self.interact_mode: 99 | print("-----------------GetTopInputsAndEmbeddings------------------") 100 | top_inputs_list, top_embeddings, top_field_indexes = tf_utils.get_top_inputs_embeddings( 101 | feature_columns=self.feature_columns, features=self.features, embeddings=self.embeddings, 102 | feature_importance_metric=self.interaction_hash_embedding_feature_metric, 103 | feature_importance_top_k=self.interaction_hash_embedding_feature_top_k, 104 | return_feature_index=True, 105 | ) 106 | 107 | multi_hash_codebook_kif_layer = MultiHashCodebookKIFLayer( 108 | name="multi_hash_codebook_kif_layer", 109 | field_size=self.field_size, 110 | top_field_indexes=top_field_indexes, 111 | num_buckets=self.interaction_hash_embedding_buckets, 112 | embedding_size=self.interaction_hash_embedding_size, 113 | bucket_mode=self.interaction_hash_embedding_bucket_mode, 114 | init_std=self.init_std, 115 | l2_reg=self.embedding_l2_reg, 116 | seed=self.seed, 117 | num_hash=self.interaction_hash_embedding_num_hash, 118 | merge_mode=self.interaction_hash_embedding_merge_mode, 119 | output_dims=self.interaction_hash_output_dims, 120 | params=self.params, 121 | hash_float_precision=self.interaction_hash_embedding_float_precision, 122 | interact_orders=self.interaction_hash_embedding_interact_orders, 123 | interact_modes=self.interaction_hash_embedding_interact_modes, 124 | ) 125 | 126 | print("-----------------GetAllInputsAndEmbeddings------------------") 127 | all_inputs_list, all_embeddings, all_feature_indexes = tf_utils.get_top_inputs_embeddings( 128 | feature_columns=self.feature_columns, features=self.features, embeddings=self.embeddings, 129 | feature_importance_metric=self.interaction_hash_embedding_feature_metric, 130 | feature_importance_top_k=-1, 131 | return_feature_index=True, 132 | ) 133 | 134 | interaction_hash_embeddings, interact_field_weights = multi_hash_codebook_kif_layer( 135 | [all_inputs_list, all_embeddings]) 136 | interact_embeddings.append(interaction_hash_embeddings) 137 | 138 | self.outputs = self.to_predict(interact_embeddings) 139 | model = tf.keras.models.Model(inputs=self.inputs_list, outputs=self.outputs) 140 | return model 141 | 142 | def get_inputs(self): 143 | """ 144 | inputs of keras model 145 | :return: 146 | """ 147 | features = build_input_features(self.feature_columns) 148 | inputs_list = list(features.values()) 149 | return features, inputs_list 150 | 151 | def get_embeddings(self, name_prefix=""): 152 | """ 153 | sparse & dense feature embeddings 154 | :return: 155 | """ 156 | init_std = self.init_std * (self.embedding_size ** -0.5) 157 | sparse_embedding_list, dense_value_list = input_from_feature_columns(self.features, self.feature_columns, 158 | self.embedding_size, 159 | self.embedding_l2_reg, 160 | init_std, 161 | self.seed, 162 | prefix=name_prefix, ) 163 | sparse_embeddings = Utils.concat_func(sparse_embedding_list, axis=1) 164 | 165 | dense_embeddings = None 166 | if len(dense_value_list) > 0: 167 | dense_values = tf.stack(dense_value_list, axis=1) 168 | dense_embeddings = DenseEmbeddingLayer( 169 | embedding_size=self.embedding_size, 170 | init_std=init_std, 171 | embedding_l2_reg=self.embedding_l2_reg)(dense_values) 172 | 173 | if len(dense_value_list) > 0: 174 | embeddings = Utils.concat_func([sparse_embeddings, dense_embeddings], axis=1) 175 | else: 176 | embeddings = sparse_embeddings 177 | embeddings *= self.embedding_size ** 0.5 178 | 179 | # 3. dropout 180 | embeddings = tf.keras.layers.Dropout(self.embedding_dropout, name="origin_embeddings")(embeddings) 181 | return embeddings 182 | 183 | def to_predict(self, interact_embeddings): 184 | interact_embeddings = [tf.keras.layers.Flatten()(embeddings) for embeddings in interact_embeddings] 185 | dnn_inputs = Utils.concat_func(interact_embeddings, axis=1) 186 | dnn_outputs = DNNLayer(self.dnn_hidden_units, self.dnn_activation, self.dnn_l2_reg, self.dnn_dropout, 187 | self.dnn_use_bn, self.seed)(dnn_inputs) 188 | final_logit = tf.keras.layers.Dense(1, use_bias=True, activation=None, )(dnn_outputs) 189 | predict_outputs = PredictionLayer(self.task, use_bias=False)(final_logit) 190 | return predict_outputs 191 | -------------------------------------------------------------------------------- /rec_alg/model/memonet/run_memonet.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import os 3 | 4 | import tensorflow as tf 5 | from tensorflow import keras 6 | from tensorflow import keras 7 | 8 | from rec_alg.common.utils import Utils 9 | from rec_alg.model.base_model import BaseModel 10 | from rec_alg.model.memonet.memonet_model import MemoNetModel 11 | 12 | 13 | # tf.enable_eager_execution() 14 | 15 | 16 | class MemoNetRunner(BaseModel): 17 | 18 | def __init__(self): 19 | # Get args 20 | self.args = self.parse_args() 21 | print("Args: ", self.args.__dict__) 22 | args_dict = self.args.__dict__.copy() 23 | config_path = args_dict.pop("config_path") 24 | super(MemoNetRunner, self).__init__(config_path=config_path, **args_dict) 25 | return 26 | 27 | def parse_args(self): 28 | parser = argparse.ArgumentParser() 29 | # 1. Run setup 30 | parser.add_argument('--config_path', type=str, default="./config/criteo/config_dense.json", 31 | help="#config path: path of config file which includes info of dataset features") 32 | parser.add_argument('--train_paths', type=Utils.str2liststr, 33 | default="part0,part1,part2,part3,part4,part5,part6,part7", 34 | help='#train_paths: training directories split with comma') 35 | parser.add_argument('--valid_paths', type=Utils.str2liststr, default="part8", 36 | help='#valid_paths: validation directories split with comma') 37 | parser.add_argument('--test_paths', type=Utils.str2liststr, default="part9", 38 | help='#test_paths: testing directories split with comma') 39 | 40 | # 2. Model architecture 41 | # 2.1. Embedding 42 | parser.add_argument('--embedding_size', type=int, default=10, 43 | help='#embedding_size: feature embedding size') 44 | parser.add_argument('--embedding_l2_reg', type=float, default=0.0, 45 | help='#embedding_l2_reg: L2 regularizer strength applied to embedding') 46 | parser.add_argument('--embedding_dropout', type=float, default=0.0, 47 | help='#embedding_dropout: the probability of dropping out on embedding') 48 | 49 | # 2.4. DNN part 50 | parser.add_argument('--dnn_hidden_units', type=Utils.str2list, default=[400, 400], 51 | help='#dnn_hidden_units: layer number and units in each layer of DNN') 52 | parser.add_argument('--dnn_activation', type=str, default='relu', 53 | help='#dnn_activation: activation function used in DNN') 54 | parser.add_argument('--dnn_l2_reg', type=float, default=0.0, 55 | help='#dnn_l2_reg: L2 regularizer strength applied to DNN') 56 | parser.add_argument('--dnn_use_bn', type=Utils.str2bool, default=False, 57 | help='#dnn_use_bn: whether to use BatchNormalization before activation or not in DNN') 58 | parser.add_argument('--dnn_dropout', type=float, default=0.0, 59 | help='#dnn_dropout: the probability of dropping out on each layer of DNN') 60 | 61 | # 2.3. Interact-mode 62 | parser.add_argument('--interact_mode', type=str, default='fullhcnet', 63 | help='#interact_mode: str, fullhcnet, subsethcnet') 64 | 65 | # 2.4. experience-embedding-hash 66 | parser.add_argument('--interaction_hash_embedding_buckets', type=int, default=100000, 67 | help='#interaction_hash_embedding_buckets: int') 68 | parser.add_argument('--interaction_hash_embedding_size', type=int, default=10, 69 | help='#interaction_hash_embedding_size: int') 70 | parser.add_argument('--interaction_hash_embedding_bucket_mode', type=str, default="hash-share", 71 | help='#interaction_hash_embedding_bucket_mode: str') 72 | parser.add_argument('--interaction_hash_embedding_num_hash', type=int, default=2, 73 | help='#interaction_hash_embedding_num_hash: int') 74 | parser.add_argument('--interaction_hash_embedding_merge_mode', type=str, default="concat", 75 | help='#interaction_hash_embedding_merge_mode:str') 76 | parser.add_argument('--interaction_hash_output_dims', type=int, default=0, 77 | help='#interaction_hash_output_dims: int') 78 | parser.add_argument('--interaction_hash_embedding_float_precision', type=int, default=12) 79 | parser.add_argument('--interaction_hash_embedding_interact_orders', type=Utils.str_to_type, default=[2, ], 80 | help='#interaction_hash_embedding_interact_orders: list') 81 | parser.add_argument('--interaction_hash_embedding_interact_modes', type=Utils.str_to_type, 82 | default=["senetsum", ], help='#interaction_hash_embedding_interact_modes: list') 83 | parser.add_argument('--interaction_hash_embedding_feature_metric', type=str, default="dimension") 84 | parser.add_argument('--interaction_hash_embedding_feature_top_k', type=int, default=-1) 85 | 86 | # 3. Train/Valid/Test setup 87 | parser.add_argument('--seed', type=int, default=1024, help='#seed: integer ,to use as random seed.') 88 | parser.add_argument('--epochs', type=int, default=3) 89 | parser.add_argument('--batch_size', type=int, default=1024) 90 | parser.add_argument('--learning_rate', type=float, default=0.001) 91 | parser.add_argument('--init_std', type=float, default=0.01) 92 | parser.add_argument('--verbose', type=int, default=1) 93 | parser.add_argument("--mode", type=str, default="train", help="support: train, retrain, test") 94 | parser.add_argument('--restore_epochs', type=Utils.str2list, default=[], 95 | help="restore weights from checkpoint, format like np.arange(), eg. [1, 5, 1]") 96 | parser.add_argument("--early_stopping", type=Utils.str2bool, default=True, help="enable early stopping") 97 | parser.add_argument("--model_path", type=str, default="rec_alg", help="model_path, to avoid being covered") 98 | return parser.parse_args() 99 | 100 | def create_model(self): 101 | memonet_model = MemoNetModel(feature_columns=self.features, 102 | params=self.args.__dict__, 103 | embedding_size=self.args.embedding_size, 104 | embedding_l2_reg=self.args.embedding_l2_reg, 105 | embedding_dropout=self.args.embedding_dropout, 106 | dnn_hidden_units=self.args.dnn_hidden_units, 107 | dnn_activation=self.args.dnn_activation, 108 | dnn_l2_reg=self.args.dnn_l2_reg, 109 | dnn_use_bn=self.args.dnn_use_bn, 110 | dnn_dropout=self.args.dnn_dropout, 111 | init_std=self.args.init_std, 112 | seed=self.args.seed, ) 113 | model = memonet_model.get_model() 114 | # optimizer & loss & metrics 115 | optimizer = tf.keras.optimizers.Adam(learning_rate=self.args.learning_rate, beta_1=0.9, beta_2=0.999, 116 | epsilon=1e-8) 117 | loss = tf.keras.losses.BinaryCrossentropy() 118 | metrics = ["AUC", "binary_crossentropy"] 119 | # Model compile 120 | model.compile(optimizer, loss, metrics=metrics, ) 121 | # model.run_eagerly = True 122 | # Print Info 123 | model.summary() 124 | keras.utils.plot_model(model, os.path.join(self.model_file_dir, "memonet.png"), show_shapes=True, 125 | show_layer_names=True) 126 | return model 127 | 128 | 129 | if __name__ == "__main__": 130 | runner = MemoNetRunner() 131 | runner.run() 132 | -------------------------------------------------------------------------------- /rec_alg/preprocessing/__init__.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # encoding: utf-8 -------------------------------------------------------------------------------- /rec_alg/preprocessing/avazu/__init__.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding:utf-8 -*- 3 | -------------------------------------------------------------------------------- /rec_alg/preprocessing/avazu/avazu_process.py: -------------------------------------------------------------------------------- 1 | from rec_alg.preprocessing.avazu.raw_data_process import RawDataProcess 2 | from rec_alg.preprocessing.kfold_process import KFoldProcess 3 | from rec_alg.preprocessing.sparse_process import SparseProcess 4 | 5 | 6 | def main(): 7 | chunksize = 10000000 8 | # Process raw data to one file 9 | raw_data = RawDataProcess(config_path="./config/avazu/config_template.json") 10 | raw_data.fit() 11 | raw_data.transform(sep=",", chunksize=chunksize) 12 | 13 | # Preprocessing categorical fields:Fill null and LowFreq value with index-base, and encode label 14 | sparse = SparseProcess(config_path=raw_data.target_config_path) 15 | sparse.fit(min_occurrences=4, index_base=0) 16 | sparse.transform() 17 | 18 | # K-Fold 19 | kfold = KFoldProcess(config_path=sparse.target_config_path) 20 | kfold.fit() 21 | kfold.transform(chunksize=chunksize, mode="fast") 22 | return 23 | 24 | 25 | if __name__ == "__main__": 26 | main() 27 | -------------------------------------------------------------------------------- /rec_alg/preprocessing/avazu/raw_data_process.py: -------------------------------------------------------------------------------- 1 | import json 2 | import os 3 | 4 | import pandas as pd 5 | 6 | from rec_alg.common.data_loader import DataLoader 7 | from rec_alg.preprocessing.base_process import BaseProcess 8 | 9 | 10 | class RawDataProcess(BaseProcess): 11 | 12 | def __init__(self, config_path, ): 13 | super(RawDataProcess, self).__init__(config_path) 14 | self.concat_path = self.config.get("base_info", {}).get("concat_path", None) 15 | self.target_config_path = "{dir}/{name}".format(dir=os.path.dirname(self.config_path), 16 | name="config_concat.json") 17 | 18 | self._init() 19 | return 20 | 21 | def _init(self): 22 | DataLoader.validate_or_create_dir(self.concat_path) 23 | return 24 | 25 | def fit(self): 26 | return 27 | 28 | def transform(self, sep="\t", chunksize=10000): 29 | target_file = os.path.join(self.concat_path, "train.txt") 30 | if os.path.exists(target_file): 31 | os.remove(target_file) 32 | 33 | label_index = 1 34 | iterator = pd.read_csv(os.path.join(self.train_path, "train.txt"), sep=sep, index_col=None, 35 | chunksize=chunksize, encoding="utf-8") 36 | for n, data_chunk in enumerate(iterator): 37 | print('RawDataProcess::transform: Size of uploaded chunk: %i instances, %i features' % data_chunk.shape) 38 | print("RawDataProcess::transform: chunk counter: {}".format(n)) 39 | 40 | # Put "label" to the first column 41 | cols = list(data_chunk.columns) 42 | cols.insert(0, cols.pop(label_index)) 43 | # Remove column of "id" and "hour" 44 | cols.pop(1) 45 | cols.pop(1) 46 | df_target = data_chunk.loc[:, cols] 47 | 48 | df_target.to_csv(os.path.join(self.concat_path, "train.txt"), sep='\t', header=None, index=False, mode="a") 49 | 50 | self._update_config() 51 | pass 52 | 53 | def _update_config(self): 54 | self.config["base_info"]["train_path"] = self.concat_path 55 | with open(self.target_config_path, 'w', encoding='utf-8') as json_file: 56 | json.dump(self.config, json_file, ensure_ascii=False, indent=4) 57 | return True 58 | -------------------------------------------------------------------------------- /rec_alg/preprocessing/base_process.py: -------------------------------------------------------------------------------- 1 | import logging 2 | 3 | from rec_alg.common.constants import Constants 4 | from rec_alg.common.data_loader import DataLoader 5 | 6 | 7 | class BaseProcess(object): 8 | """ 9 | Base class for dataset processing 10 | """ 11 | 12 | def __init__(self, config_path, **kwargs): 13 | self.config_path = config_path 14 | self.config = DataLoader.load_config_dict(config_path=config_path) 15 | self.logger = logging.getLogger(Constants.LOGGER_DEFAULT) 16 | self.label_index, self.sparse_feature_indexes, self.dense_feature_indexes, self.varlen_feature_origin_indexes, \ 17 | self.varlen_feature_target_indexes = self.get_feature_indexes() 18 | self.train_path = self.config.get("base_info", {}).get("train_path", None) 19 | self.train_files = DataLoader.list_dir_files(self.train_path) 20 | return 21 | 22 | def get_feature_indexes(self): 23 | label_index = 0 24 | sparse_feature_indexes = [] 25 | dense_feature_indexes = [] 26 | varlen_feature_origin_indexes = [] 27 | varlen_feature_target_indexes = [] 28 | features = [feature for feature in self.config.get("features", [])] 29 | if len(features) == 0: 30 | print("BaseProcess::get_feature_indexes: No features defined") 31 | exit(-1) 32 | index = 0 33 | for i, feature in enumerate(features): 34 | feature_type = feature.get("type", "") 35 | if feature_type == Constants.FEATURE_TYPE_LABEL: 36 | label_index = index 37 | index += 1 38 | elif feature_type == Constants.FEATURE_TYPE_SPARSE: 39 | sparse_feature_indexes.append(index) 40 | index += 1 41 | elif feature_type == Constants.FEATURE_TYPE_DENSE: 42 | dense_feature_indexes.append(index) 43 | index += 1 44 | elif feature_type == Constants.FEATURE_TYPE_VARLENSPARSE: 45 | varlen_feature_origin_indexes.append(i) 46 | for j in range(index, index + feature.get("maxlen", 1)): 47 | varlen_feature_target_indexes.append(j) 48 | index += 1 49 | return label_index, sparse_feature_indexes, dense_feature_indexes, varlen_feature_origin_indexes, \ 50 | varlen_feature_target_indexes 51 | -------------------------------------------------------------------------------- /rec_alg/preprocessing/criteo/__init__.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding:utf-8 -*- 3 | -------------------------------------------------------------------------------- /rec_alg/preprocessing/criteo/criteo_process.py: -------------------------------------------------------------------------------- 1 | from rec_alg.preprocessing.dense_process import DenseProcess 2 | from rec_alg.preprocessing.kfold_process import KFoldProcess 3 | from rec_alg.preprocessing.sparse_process import SparseProcess 4 | 5 | 6 | def main(config): 7 | chunksize = 10000 8 | # Preprocessing categorical fields:Fill null and LowFreq value with index-base, and encode label 9 | sparse = SparseProcess(config_path="./config/criteo/config_template.json") 10 | sparse.fit(min_occurrences=10) 11 | sparse.transform() 12 | 13 | # Preprocessing dense fields:Fill null with 0, and do scale-transform 14 | config_path = sparse.target_config_path 15 | if config.get("enable_dense_scale", True): 16 | dense = DenseProcess(config_path=config_path) 17 | dense_transformer = config.get("dense_transformer", "scale_multi_min_max") 18 | if dense_transformer == "scale_ln2": 19 | dense.fit(dense.scale_ln2) 20 | elif dense_transformer == "scale_multi_min_max": 21 | dense.fit(dense.scale_multi_min_max) 22 | elif dense_transformer == "scale_ln0": 23 | dense.fit(dense.scale_ln0) 24 | else: 25 | raise Exception("Don't support parameter dense_transformer: {}".format(dense_transformer)) 26 | dense.transform() 27 | config_path = dense.target_config_path 28 | 29 | # K-Fold 30 | # config_path = "./config/criteo/config_dense.json" 31 | kfold = KFoldProcess(config_path=config_path) 32 | kfold.fit() 33 | kfold.transform(chunksize=chunksize, mode="fast") 34 | return 35 | 36 | 37 | if __name__ == "__main__": 38 | config = { 39 | "enable_dense_scale": True, 40 | "dense_transformer": "scale_multi_min_max", 41 | } 42 | main(config) 43 | pass 44 | -------------------------------------------------------------------------------- /rec_alg/preprocessing/dense_process.py: -------------------------------------------------------------------------------- 1 | import json 2 | import math 3 | import os 4 | from collections import OrderedDict 5 | 6 | import numpy as np 7 | from sklearn.preprocessing import MinMaxScaler 8 | 9 | from rec_alg.common.data_loader import DataLoader 10 | from rec_alg.common.utils import Utils 11 | from rec_alg.preprocessing.base_process import BaseProcess 12 | 13 | 14 | class DenseProcess(BaseProcess): 15 | """ 16 | Preprocessing for dense fields in datasets 17 | """ 18 | 19 | def __init__(self, config_path, **kwargs): 20 | super(DenseProcess, self).__init__(config_path=config_path, **kwargs) 21 | self.dense_path = self.config.get("base_info", {}).get("dense_path", None) 22 | self.transformer = None 23 | self.target_config_path = "{dir}/{name}".format(dir=os.path.dirname(self.config_path), 24 | name="config_dense.json") 25 | 26 | self.min_max_dict = None 27 | self.min_max_scaler = None 28 | self.min_max_scalers = None 29 | 30 | self._init() 31 | return 32 | 33 | def _init(self): 34 | DataLoader.validate_or_create_dir(self.dense_path) 35 | return 36 | 37 | def fit(self, transformer): 38 | self.transformer = transformer 39 | return True 40 | 41 | def transform(self, sep="\t"): 42 | if not os.path.exists(self.dense_path): 43 | os.mkdir(self.dense_path) 44 | 45 | cnt_train = 0 46 | dense_feature_indexes = set(self.dense_feature_indexes) 47 | for file_path in self.train_files: 48 | fi = open(file_path, 'r', encoding="utf-8") 49 | fo = open(self.dense_path + os.path.basename(file_path), 'w', encoding="utf-8") 50 | for line in fi: 51 | if cnt_train % 10000 == 0: 52 | print('DenseProcess::transform: cnt : %d' % cnt_train) 53 | entry = [] 54 | items = line.strip('\n').split(sep=sep) 55 | for index, item in enumerate(items): 56 | if index in dense_feature_indexes: 57 | entry.append(str(self.transformer(item, index))) 58 | else: 59 | entry.append(item) 60 | fo.write(sep.join(entry) + '\n') 61 | cnt_train += 1 62 | fi.close() 63 | fo.close() 64 | 65 | self._update_config() 66 | return True 67 | 68 | def scale_multi_min_max(self, x, index=-1): 69 | """ 70 | Mode 0: value = (x-min)/(max-min) Min-Max-Scale 71 | :param x: 72 | :param index: 73 | :return: 74 | """ 75 | if not self.min_max_scalers: 76 | self.min_max_scalers = self._fit_min_max_scaler() 77 | scaler = self.min_max_scalers.get(index) 78 | x = Utils.get_float(x) 79 | transformed_values = scaler.transform(np.reshape(x, (-1, 1))) 80 | value = transformed_values[0][0] 81 | return value 82 | 83 | def _fit_min_max_scaler(self, sep="\t"): 84 | if not self.min_max_dict: 85 | self.min_max_dict = self._get_min_max_dict(sep=sep) 86 | 87 | min_max_scalers = OrderedDict() 88 | for key, value in self.min_max_dict.items(): 89 | scaler = MinMaxScaler() 90 | scaler.fit(np.reshape(value, (-1, 1))) 91 | min_max_scalers[key] = scaler 92 | return min_max_scalers 93 | 94 | def scale_ln2(self, x, index=-1): 95 | """ 96 | Mode 1: value = int[(Ln(x))^2] 97 | :param x: 98 | :param index: 99 | :return: 100 | """ 101 | if isinstance(x, str): 102 | x = Utils.get_float(x) 103 | if x > 2: 104 | x = int(math.log(float(x)) ** 2) 105 | return x 106 | 107 | def scale_ln0(self, x, index=-1): 108 | """ 109 | Mode 2: value = ln(1+x-min) 110 | :param x: 111 | :param index: 112 | :return: 113 | """ 114 | if not self.min_max_dict: 115 | self.min_max_dict = self._get_min_max_dict() 116 | 117 | x = Utils.get_float(x) 118 | current_min = self.min_max_dict.get(index, [0, 0])[0] 119 | value = math.log(1.0 + (float(x) - current_min)) 120 | return value 121 | 122 | def _get_min_max_dict(self, sep="\t"): 123 | min_max_dict = OrderedDict() 124 | for i in self.dense_feature_indexes: 125 | min_max_dict[i] = [np.inf, -np.inf] # min/max 126 | cnt_train = 0 127 | for file_path in self.train_files: 128 | fi = open(file_path, 'r', encoding="utf-8") 129 | for line in fi: 130 | if cnt_train % 10000 == 0: 131 | print('DenseProcess::fit: cnt : %d' % cnt_train) 132 | items = line.strip('\n').split(sep=sep) 133 | for index in self.dense_feature_indexes: 134 | item = items[index] 135 | need_update = False 136 | current_min, current_max = min_max_dict.get(index) 137 | item_value = Utils.get_float(item) 138 | if item_value < current_min: 139 | current_min = item_value 140 | need_update = True 141 | if item_value > current_max: 142 | current_max = item_value 143 | need_update = True 144 | if need_update: 145 | min_max_dict[index] = [current_min, current_max] 146 | 147 | cnt_train += 1 148 | return min_max_dict 149 | 150 | def _update_config(self): 151 | self.config["base_info"]["train_path"] = self.config["base_info"]["dense_path"] 152 | with open(self.target_config_path, 'w', encoding='utf-8') as json_file: 153 | json.dump(self.config, json_file, ensure_ascii=False, indent=4) 154 | return True 155 | -------------------------------------------------------------------------------- /rec_alg/preprocessing/kdd12/__init__.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding:utf-8 -*- -------------------------------------------------------------------------------- /rec_alg/preprocessing/kdd12/kdd12_process.py: -------------------------------------------------------------------------------- 1 | from rec_alg.preprocessing.dense_process import DenseProcess 2 | from rec_alg.preprocessing.kdd12.raw_data_process import RawDataProcess 3 | from rec_alg.preprocessing.kfold_process import KFoldProcess 4 | from rec_alg.preprocessing.sparse_process import SparseProcess 5 | 6 | 7 | def main(): 8 | chunksize = 10000000 9 | # Process raw data to one file 10 | raw_data = RawDataProcess(config_path="./config/kdd12/config_template.json") 11 | raw_data.fit() 12 | raw_data.transform(sep="\t", chunksize=chunksize) 13 | 14 | # Preprocessing categorical fields:Fill null and LowFreq value with index-base, and encode label 15 | sparse = SparseProcess(config_path=raw_data.target_config_path) 16 | sparse.fit(min_occurrences=10, index_base=0) 17 | sparse.transform() 18 | 19 | # Preprocessing dense fields:Fill null with 0, and do scale-transform 20 | dense = DenseProcess(config_path=sparse.target_config_path) 21 | dense.fit(dense.scale_multi_min_max) 22 | dense.transform() 23 | 24 | # K-Fold 25 | kfold = KFoldProcess(config_path=dense.target_config_path) 26 | kfold.fit() 27 | kfold.transform(chunksize=chunksize, mode="fast") 28 | return 29 | 30 | 31 | if __name__ == "__main__": 32 | main() 33 | -------------------------------------------------------------------------------- /rec_alg/preprocessing/kdd12/raw_data_process.py: -------------------------------------------------------------------------------- 1 | import json 2 | import os 3 | 4 | import numpy as np 5 | import pandas as pd 6 | 7 | from rec_alg.common.data_loader import DataLoader 8 | from rec_alg.preprocessing.base_process import BaseProcess 9 | 10 | 11 | class RawDataProcess(BaseProcess): 12 | """ 13 | 1. concat train.txt and userid_profile.txt to make a full train.txt file 14 | 2. process label to 0 and 1 15 | 3. Need huge memory 16 | """ 17 | 18 | def __init__(self, config_path, ): 19 | super(RawDataProcess, self).__init__(config_path) 20 | self.concat_path = self.config.get("base_info", {}).get("concat_path", None) 21 | self.target_config_path = "{dir}/{name}".format(dir=os.path.dirname(self.config_path), 22 | name="config_concat.json") 23 | 24 | self._init() 25 | return 26 | 27 | def _init(self): 28 | DataLoader.validate_or_create_dir(self.concat_path) 29 | return 30 | 31 | def fit(self): 32 | return 33 | 34 | def transform(self, sep="\t", chunksize=10000): 35 | # Load data 36 | df_user = DataLoader.load_data_txt_as_df(path=os.path.join(self.train_path, "userid_profile.txt"), sep="\t", ) 37 | iterator = pd.read_csv(os.path.join(self.train_path, "training.txt"), sep=sep, header=None, index_col=None, 38 | chunksize=chunksize, encoding="utf-8") 39 | target_file = os.path.join(self.concat_path, "train.txt") 40 | if os.path.exists(target_file): 41 | os.remove(target_file) 42 | 43 | for n, data_chunk in enumerate(iterator): 44 | print('RawDataProcess::transform: Size of uploaded chunk: %i instances, %i features' % data_chunk.shape) 45 | print("RawDataProcess::transform: chunk counter: {}".format(n)) 46 | 47 | # concat data 48 | df_target = pd.merge(data_chunk, df_user, left_on=data_chunk.columns[-1], right_on=df_user.columns[0], 49 | how='left') 50 | df_target.drop(columns=df_target.columns[-len(df_user.columns)], inplace=True) 51 | 52 | # Missing value handling 53 | df_target.iloc[:, -2] = df_target.iloc[:, -2].apply( 54 | lambda x: 0 if x is None or x == "" or np.isnan(x) else x) 55 | df_target.iloc[:, -1] = df_target.iloc[:, -1].apply( 56 | lambda x: 0 if x is None or x == "" or np.isnan(x) else x) 57 | 58 | # Label 59 | df_target.iloc[:, 0] = df_target.iloc[:, 0].apply(lambda x: x if int(x) == 0 else 1) 60 | 61 | # write out 62 | df_target.to_csv(os.path.join(self.concat_path, "train.txt"), sep='\t', header=None, index=False, mode="a") 63 | 64 | self._update_config() 65 | pass 66 | 67 | def _update_config(self): 68 | self.config["base_info"]["train_path"] = self.concat_path 69 | with open(self.target_config_path, 'w', encoding='utf-8') as json_file: 70 | json.dump(self.config, json_file, ensure_ascii=False, indent=4) 71 | return True 72 | -------------------------------------------------------------------------------- /rec_alg/preprocessing/kfold_process.py: -------------------------------------------------------------------------------- 1 | import os 2 | 3 | import pandas as pd 4 | from sklearn.model_selection import StratifiedKFold 5 | 6 | from .base_process import BaseProcess 7 | from ..common.data_loader import DataLoader 8 | 9 | 10 | class KFoldProcess(BaseProcess): 11 | """ 12 | Split dataset with K-Fold 13 | """ 14 | 15 | def __init__(self, config_path, **kwargs): 16 | super(KFoldProcess, self).__init__(config_path=config_path, **kwargs) 17 | self.k = self.config["base_info"]["k_fold"] 18 | self.k_fold_path = self.config["base_info"]["k_fold_path"] 19 | self.random_seed = self.config["base_info"]["random_seed"] 20 | self._init() 21 | return 22 | 23 | def _init(self): 24 | DataLoader.validate_or_create_dir(self.k_fold_path) 25 | return 26 | 27 | def fit(self): 28 | DataLoader.rmdirs(self.k_fold_path) 29 | return True 30 | 31 | def transform(self, sep="\t", chunksize=10000, shuffle=True, mode="fast"): 32 | counter = 0 33 | feature_indexes = self.sparse_feature_indexes + self.dense_feature_indexes + self.varlen_feature_target_indexes 34 | feature_indexes.sort() 35 | for file_path in self.train_files: 36 | print("KFoldProcess::transform: Processing file {} ...".format(file_path)) 37 | filename = os.path.basename(file_path) 38 | iterator = pd.read_csv(file_path, sep=sep, header=None, index_col=None, chunksize=chunksize, 39 | encoding="utf-8", dtype=None) 40 | for n, data_chunk in enumerate(iterator): 41 | print('KFoldProcess::transform: Size of uploaded chunk: %i instances, %i features' % data_chunk.shape) 42 | counter += 1 43 | print("KFoldProcess::transform: chunk counter: {}".format(counter)) 44 | label_and_feature = data_chunk.values 45 | train_y = label_and_feature[:, self.label_index] 46 | train_x = label_and_feature[:, feature_indexes] 47 | print("KFoldProcess::transform: shape of train_x", train_x.shape) 48 | folds = list(StratifiedKFold(n_splits=self.k, shuffle=shuffle, 49 | random_state=self.random_seed).split(train_x, train_y)) 50 | print("KFoldProcess::transform: fold num: %d" % (len(folds))) 51 | self._save_fold_data(folds, data_chunk, filename, sep, mode) 52 | print("KFoldProcess::transform: save train_data done") 53 | return True 54 | 55 | def _save_fold_data(self, folds, train_data, filename, sep="\t", mode="fast"): 56 | for i in range(len(folds)): 57 | train_id, valid_id = folds[i] 58 | print("KFoldProcess::transform: now part %d" % i) 59 | file_path = os.path.join(self.k_fold_path, "part" + str(i) + "/") 60 | DataLoader.validate_or_create_dir(file_path) 61 | abs_filename = os.path.join(file_path, filename) 62 | with open(abs_filename, "a", encoding="utf-8") as f: 63 | if mode == "fast": 64 | fold_data = train_data.values[valid_id] 65 | for data in fold_data: 66 | f.write(sep.join([str(item) for item in data]) + "\n") 67 | else: 68 | for row_ix in valid_id: 69 | items = [str(train_data.iat[row_ix, column_ix]) for column_ix in train_data.columns] 70 | f.write(sep.join(items) + "\n") 71 | return True 72 | -------------------------------------------------------------------------------- /rec_alg/preprocessing/sparse_process.py: -------------------------------------------------------------------------------- 1 | import json 2 | import os 3 | from collections import OrderedDict 4 | 5 | from rec_alg.common.data_loader import DataLoader 6 | from rec_alg.preprocessing.base_process import BaseProcess 7 | 8 | 9 | class SparseProcess(BaseProcess): 10 | """ 11 | Preprocessing for categorical fields in datasets 12 | """ 13 | 14 | def __init__(self, config_path, **kwargs): 15 | super(SparseProcess, self).__init__(config_path=config_path, **kwargs) 16 | self.label_encoder = OrderedDict() 17 | self.label_counter = OrderedDict() 18 | self.sparse_path = self.config.get("base_info", {}).get("sparse_path", None) 19 | self.target_config_path = "{dir}/{name}".format(dir=os.path.dirname(self.config_path), 20 | name="config_sparse.json") 21 | self._init() 22 | return 23 | 24 | def _init(self): 25 | DataLoader.validate_or_create_dir(self.sparse_path) 26 | return 27 | 28 | def fit(self, sep="\t", min_occurrences=10, index_base=0): 29 | """ 30 | Encode categorical fields 31 | For feature appearing less or equal than min_occurrences, encode to 0. And feature appearing great than 32 | min_occurrences, encode to independent value. 33 | :param sep: separator of different fields 34 | :param min_occurrences: min occurrences of a feature 35 | :param index_base: zero-based or one-based 36 | :return: 37 | """ 38 | cnt_train = 0 39 | fea_freqs = OrderedDict() 40 | for feature_index in self.sparse_feature_indexes: 41 | fea_freqs.setdefault(feature_index, OrderedDict()) 42 | for file_path in self.train_files: 43 | with open(file_path, mode="r", encoding="utf-8") as fi: 44 | for line in fi: 45 | items = line.strip('\n').split(sep=sep) 46 | if len(items) == 0: 47 | continue 48 | 49 | cnt_train += 1 50 | if cnt_train % 10000 == 0: 51 | print('SparseProcess::fit: now train cnt : %d' % cnt_train) 52 | 53 | for feature_index in self.sparse_feature_indexes: 54 | cur_freqs = fea_freqs[feature_index] 55 | item = items[feature_index] 56 | if item not in cur_freqs: 57 | cur_freqs[item] = 1 58 | else: 59 | cur_freqs[item] += 1 60 | 61 | print("get index in field") 62 | label_encoder = OrderedDict() 63 | label_counter = OrderedDict() 64 | for feature_index in self.sparse_feature_indexes: 65 | label_encoder.setdefault(feature_index, OrderedDict()) 66 | label_counter.setdefault(feature_index, index_base) 67 | for index, cur_freqs in fea_freqs.items(): 68 | current_encoder = label_encoder[index] 69 | min_occurrences_index = -1 70 | label_index = index_base 71 | for item, freq in cur_freqs.items(): 72 | if freq <= min_occurrences: 73 | if min_occurrences_index < 0: 74 | min_occurrences_index = label_index 75 | label_index += 1 76 | current_encoder[item] = [min_occurrences_index, freq] 77 | else: 78 | current_encoder[item] = [label_index, freq] 79 | label_index += 1 80 | label_counter[index] = label_index 81 | 82 | print('SparseProcess::fit: number of categorical fields:', len(self.sparse_feature_indexes)) 83 | print('SparseProcess::fit: number of features of every fields:{}'.format(label_counter)) 84 | print('SparseProcess::fit: number of all the features: {}'.format(sum(label_counter.values()))) 85 | print('SparseProcess::fit: total entries :%d' % cnt_train) 86 | for feature_index in self.sparse_feature_indexes: 87 | if feature_index != 1: 88 | continue 89 | print("Sparse feature {}: ".format(feature_index)) 90 | for key, value in label_encoder[feature_index].items(): 91 | print("key:{}\tindex:{}\tcount:{}".format(key, value[0], value[1])) 92 | 93 | self.label_encoder = label_encoder 94 | self.label_counter = label_counter 95 | 96 | self._update_config() 97 | return True 98 | 99 | def transform(self, sep="\t"): 100 | sparse_feature_indexes = set(self.sparse_feature_indexes) 101 | cnt_train = 0 102 | for file_path in self.train_files: 103 | fi = open(file_path, 'r', encoding="utf-8") 104 | fo = open(self.sparse_path + os.path.basename(file_path), 'w', encoding="utf-8") 105 | print('SparseProcess::transform: remake training data...') 106 | for line in fi: 107 | cnt_train += 1 108 | if cnt_train % 10000 == 0: 109 | print('SparseProcess::transform: now train cnt : %d' % cnt_train) 110 | entry = [] 111 | items = line.strip('\n').split(sep=sep) 112 | for index, item in enumerate(items): 113 | if index in sparse_feature_indexes: 114 | entry.append(str(self.label_encoder[index][item][0])) 115 | else: 116 | entry.append(item) 117 | fo.write(sep.join(entry) + '\n') 118 | fo.close() 119 | fi.close() 120 | return True 121 | 122 | def _update_config(self): 123 | for index, feature in enumerate(self.config["features"]): 124 | if feature["type"] == "sparse": 125 | feature["dimension"] = self.label_counter[index] 126 | 127 | self.config["base_info"]["train_path"] = self.config["base_info"]["sparse_path"] 128 | 129 | with open(self.target_config_path, 'w', encoding='utf-8') as json_file: 130 | json.dump(self.config, json_file, ensure_ascii=False, indent=4) 131 | return True 132 | -------------------------------------------------------------------------------- /script/run_avazu_process.sh: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env bash 2 | 3 | nohup python -u -m rec_alg.preprocessing.avazu.avazu_process > process_avazu.log 2>&1 & 4 | -------------------------------------------------------------------------------- /script/run_criteo_process.sh: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env bash 2 | 3 | nohup python -u -m rec_alg.preprocessing.criteo.criteo_process > process_criteo.log 2>&1 & 4 | -------------------------------------------------------------------------------- /script/run_fibinet_model.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | set -x 3 | 4 | #export CUDA_VISIBLE_DEVICES=1 5 | export TF_FORCE_GPU_ALLOW_GROWTH=true 6 | 7 | dt=$(date "+%Y%m%d") 8 | 9 | cmd=$1 10 | echo "*****cmd:$cmd" 11 | 12 | ########################################### 13 | # Parameters 14 | ########################################### 15 | version="++" 16 | config_path="./config/criteo/config_dense.json" 17 | train_paths="part0,part1,part2,part3,part4,part5,part6,part7" 18 | valid_paths="part8" 19 | test_paths="part9" 20 | 21 | # Embedding 22 | embedding_size=10 23 | embedding_l2_reg=0.0 24 | embedding_dropout=0.0 25 | sparse_embedding_norm_type="bn" 26 | dense_embedding_norm_type="layer_norm" 27 | dense_embedding_share_params=False 28 | 29 | # SENet 30 | senet_squeeze_mode="group_mean_max" 31 | senet_squeeze_group_num=2 32 | senet_squeeze_topk=1 33 | senet_reduction_ratio=3.0 34 | senet_excitation_mode="bit" 35 | senet_activation="none" 36 | senet_use_skip_connection=True 37 | senet_reweight_norm_type="ln" 38 | 39 | # Bilinear-Interaction 40 | origin_bilinear_type="all_ip" 41 | origin_bilinear_dnn_units="[50]" 42 | origin_bilinear_dnn_activation="linear" 43 | senet_bilinear_type="none" 44 | 45 | # MLP 46 | dnn_hidden_units="[400,400,400]" 47 | dnn_activation="relu" 48 | dnn_l2_reg=0.0 49 | dnn_use_bn=False 50 | dnn_dropout=0.0 51 | 52 | # Linear Part 53 | enable_linear=False 54 | linear_l2_reg=0.0 55 | 56 | # Train/Test setup 57 | seed=1024 58 | epochs=3 59 | batch_size=1024 60 | learning_rate=0.0001 61 | init_std=0.01 62 | verbose=1 63 | mode="train" 64 | restore_epochs=[] 65 | early_stopping=False 66 | model_path="fibinet_model/"${cmd} 67 | 68 | ########################################### 69 | # Different runs 70 | ########################################### 71 | if [ $cmd == "fibinet_v1_criteo" ]; then 72 | version="v1" 73 | elif [ $cmd == "fibinet++_criteo" ]; then 74 | version="++" 75 | elif [ $cmd == "fibinet_v1_avazu" ]; then 76 | embedding_size=50 77 | learning_rate=0.001 78 | config_path="./config/avazu/config_sparse.json" 79 | version="v1" 80 | elif [ $cmd == "fibinet++_avazu" ]; then 81 | embedding_size=50 82 | learning_rate=0.001 83 | config_path="./config/avazu/config_sparse.json" 84 | version="++" 85 | else 86 | echo "****ERROR unknown cmd: $cmd..." 87 | exit 1 88 | fi 89 | 90 | if [ ! -d "./logs" ]; then 91 | mkdir logs 92 | fi 93 | 94 | python -u -m rec_alg.model.fibinet.run_fibinet \ 95 | --version ${version} \ 96 | --config_path ${config_path} \ 97 | --train_paths ${train_paths} \ 98 | --valid_paths ${valid_paths} \ 99 | --test_paths ${test_paths} \ 100 | --embedding_size ${embedding_size} \ 101 | --embedding_l2_reg ${embedding_l2_reg} \ 102 | --embedding_dropout ${embedding_dropout} \ 103 | --sparse_embedding_norm_type ${sparse_embedding_norm_type} \ 104 | --dense_embedding_norm_type ${dense_embedding_norm_type} \ 105 | --dense_embedding_share_params ${dense_embedding_share_params} \ 106 | --senet_squeeze_mode ${senet_squeeze_mode} \ 107 | --senet_squeeze_group_num ${senet_squeeze_group_num} \ 108 | --senet_squeeze_topk ${senet_squeeze_topk} \ 109 | --senet_reduction_ratio ${senet_reduction_ratio} \ 110 | --senet_excitation_mode ${senet_excitation_mode} \ 111 | --senet_activation ${senet_activation} \ 112 | --senet_use_skip_connection ${senet_use_skip_connection} \ 113 | --senet_reweight_norm_type ${senet_reweight_norm_type} \ 114 | --origin_bilinear_type ${origin_bilinear_type} \ 115 | --origin_bilinear_dnn_units ${origin_bilinear_dnn_units} \ 116 | --origin_bilinear_dnn_activation ${origin_bilinear_dnn_activation} \ 117 | --senet_bilinear_type ${senet_bilinear_type} \ 118 | --dnn_hidden_units ${dnn_hidden_units} \ 119 | --dnn_activation ${dnn_activation} \ 120 | --dnn_l2_reg ${dnn_l2_reg} \ 121 | --dnn_use_bn ${dnn_use_bn} \ 122 | --dnn_dropout ${dnn_dropout} \ 123 | --enable_linear ${enable_linear} \ 124 | --linear_l2_reg ${linear_l2_reg} \ 125 | --seed ${seed} \ 126 | --epochs ${epochs} \ 127 | --batch_size ${batch_size} \ 128 | --learning_rate ${learning_rate} \ 129 | --init_std ${init_std} \ 130 | --verbose ${verbose} \ 131 | --mode ${mode} \ 132 | --restore_epochs ${restore_epochs} \ 133 | --early_stopping ${early_stopping} \ 134 | --model_path ${model_path} \ 135 | >>./logs/fibinet_${dt}_${cmd}.log 2>&1 136 | 137 | echo "running ..." 138 | -------------------------------------------------------------------------------- /script/run_kdd12_process.sh: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env bash 2 | 3 | nohup python -u -m rec_alg.preprocessing.kdd12.kdd12_process > process_kdd12.log 2>&1 & 4 | -------------------------------------------------------------------------------- /script/run_memonet_model.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | set -x 3 | 4 | #export CUDA_VISIBLE_DEVICES=1 5 | export TF_FORCE_GPU_ALLOW_GROWTH=true 6 | 7 | dt=$(date "+%Y%m%d") 8 | 9 | cmd=$1 10 | echo "*****cmd:$cmd" 11 | 12 | ########################################### 13 | # Parameters 14 | ########################################### 15 | config_path="./config/criteo/config_dense.json" 16 | train_paths="part0,part1,part2,part3,part4,part5,part6,part7" 17 | valid_paths="part8" 18 | test_paths="part9" 19 | 20 | # Embedding 21 | embedding_size=10 22 | embedding_l2_reg=0.0 23 | embedding_dropout=0.0 24 | 25 | # MLP 26 | dnn_hidden_units="[400,400]" 27 | dnn_activation="relu" 28 | dnn_l2_reg=0.0 29 | dnn_use_bn=False 30 | dnn_dropout=0.0 31 | 32 | # Interact mode 33 | interact_mode="fullhcnet" 34 | 35 | # HCNet 36 | interaction_hash_embedding_buckets=1000000 37 | interaction_hash_embedding_size=${embedding_size} 38 | interaction_hash_embedding_bucket_mode="hash-share" 39 | interaction_hash_embedding_num_hash=2 40 | interaction_hash_embedding_merge_mode="concat" 41 | interaction_hash_output_dims=0 42 | interaction_hash_embedding_float_precision=12 43 | interaction_hash_embedding_interact_orders="[2]" 44 | interaction_hash_embedding_interact_modes="['senetsum']" 45 | interaction_hash_embedding_feature_metric="dimension" 46 | interaction_hash_embedding_feature_top_k=-1 47 | 48 | # Train/Test setup 49 | seed=1024 50 | epochs=3 51 | batch_size=1024 52 | learning_rate=0.001 53 | init_std=0.01 54 | verbose=1 55 | mode="train" 56 | restore_epochs=[] 57 | early_stopping=True 58 | model_path="memonet_model/"${cmd} 59 | 60 | # TODO 61 | ########################################### 62 | # Different runs 63 | ########################################### 64 | if [ $cmd == "memonet_criteo_hcnet-full-1e6-10-2-concat_output-10_orders-2-senetsum" ]; then 65 | interact_mode="fullhcnet" 66 | interaction_hash_embedding_buckets=1000000 67 | interaction_hash_embedding_size=10 68 | interaction_hash_embedding_num_hash=2 69 | interaction_hash_embedding_merge_mode="concat" 70 | interaction_hash_output_dims=10 71 | interaction_hash_embedding_interact_orders="[2]" 72 | interaction_hash_embedding_interact_modes="['senetsum']" 73 | epochs=1 74 | elif [ $cmd == "memonet_criteo_hcnet-full-1e6-10-2-senetorigin_output-10_orders-2-senetsum" ]; then 75 | interact_mode="fullhcnet" 76 | interaction_hash_embedding_buckets=1000000 77 | interaction_hash_embedding_size=10 78 | interaction_hash_embedding_num_hash=2 79 | interaction_hash_embedding_merge_mode="senetorigin" 80 | interaction_hash_output_dims=10 81 | interaction_hash_embedding_interact_orders="[2]" 82 | interaction_hash_embedding_interact_modes="['senetsum']" 83 | epochs=1 84 | elif [ $cmd == "memonet_avazu_hcnet-full-1e6-10-2-concat_output-10_orders-2-senetsum" ]; then 85 | config_path="./config/avazu/config_sparse.json" 86 | embedding_size=50 87 | interact_mode="fullhcnet" 88 | interaction_hash_embedding_buckets=1000000 89 | interaction_hash_embedding_size=50 90 | interaction_hash_embedding_num_hash=2 91 | interaction_hash_embedding_merge_mode="concat" 92 | interaction_hash_output_dims=50 93 | interaction_hash_embedding_interact_orders="[2]" 94 | interaction_hash_embedding_interact_modes="['senetsum']" 95 | learning_rate=0.0001 96 | epochs=1 97 | elif [ $cmd == "memonet_avazu_hcnet-full-1e6-10-2-senetorigin_output-10_orders-2-senetsum" ]; then 98 | config_path="./config/avazu/config_sparse.json" 99 | embedding_size=50 100 | interact_mode="fullhcnet" 101 | interaction_hash_embedding_buckets=1000000 102 | interaction_hash_embedding_size=50 103 | interaction_hash_embedding_num_hash=2 104 | interaction_hash_embedding_merge_mode="senetorigin" 105 | interaction_hash_output_dims=50 106 | interaction_hash_embedding_interact_orders="[2]" 107 | interaction_hash_embedding_interact_modes="['senetsum']" 108 | learning_rate=0.0001 109 | epochs=1 110 | elif [ $cmd == "memonet_kdd12_hcnet-full-5e5-10-2-concat_output-10_orders-2-senetsum" ]; then 111 | config_path="./config/kdd12/config_dense.json" 112 | embedding_size=10 113 | interact_mode="fullhcnet" 114 | interaction_hash_embedding_buckets=500000 115 | interaction_hash_embedding_size=10 116 | interaction_hash_embedding_num_hash=2 117 | interaction_hash_embedding_merge_mode="concat" 118 | interaction_hash_output_dims=10 119 | interaction_hash_embedding_interact_orders="[2]" 120 | interaction_hash_embedding_interact_modes="['senetsum']" 121 | learning_rate=0.001 122 | epochs=2 123 | elif [ $cmd == "memonet_kdd12_hcnet-full-5e5-10-2-senetorigin_output-10_orders-2-senetsum" ]; then 124 | config_path="./config/kdd12/config_dense.json" 125 | embedding_size=10 126 | interact_mode="fullhcnet" 127 | interaction_hash_embedding_buckets=500000 128 | interaction_hash_embedding_size=10 129 | interaction_hash_embedding_num_hash=2 130 | interaction_hash_embedding_merge_mode="senetorigin" 131 | interaction_hash_output_dims=10 132 | interaction_hash_embedding_interact_orders="[2]" 133 | interaction_hash_embedding_interact_modes="['senetsum']" 134 | learning_rate=0.001 135 | epochs=2 136 | else 137 | echo "****ERROR unknown cmd: $cmd..." 138 | exit 1 139 | fi 140 | 141 | if [ ! -d "./logs" ]; then 142 | mkdir logs 143 | fi 144 | 145 | python -u -m rec_alg.model.memonet.run_memonet \ 146 | --config_path ${config_path} \ 147 | --train_paths ${train_paths} \ 148 | --valid_paths ${valid_paths} \ 149 | --test_paths ${test_paths} \ 150 | --embedding_size ${embedding_size} \ 151 | --embedding_l2_reg ${embedding_l2_reg} \ 152 | --embedding_dropout ${embedding_dropout} \ 153 | --dnn_hidden_units ${dnn_hidden_units} \ 154 | --dnn_activation ${dnn_activation} \ 155 | --dnn_l2_reg ${dnn_l2_reg} \ 156 | --dnn_use_bn ${dnn_use_bn} \ 157 | --dnn_dropout ${dnn_dropout} \ 158 | --interact_mode ${interact_mode} \ 159 | --interaction_hash_embedding_buckets ${interaction_hash_embedding_buckets} \ 160 | --interaction_hash_embedding_size ${interaction_hash_embedding_size} \ 161 | --interaction_hash_embedding_bucket_mode ${interaction_hash_embedding_bucket_mode} \ 162 | --interaction_hash_embedding_num_hash ${interaction_hash_embedding_num_hash} \ 163 | --interaction_hash_embedding_merge_mode ${interaction_hash_embedding_merge_mode} \ 164 | --interaction_hash_output_dims ${interaction_hash_output_dims} \ 165 | --interaction_hash_embedding_float_precision ${interaction_hash_embedding_float_precision} \ 166 | --interaction_hash_embedding_interact_orders ${interaction_hash_embedding_interact_orders} \ 167 | --interaction_hash_embedding_interact_modes ${interaction_hash_embedding_interact_modes} \ 168 | --interaction_hash_embedding_feature_metric ${interaction_hash_embedding_feature_metric} \ 169 | --interaction_hash_embedding_feature_top_k ${interaction_hash_embedding_feature_top_k} \ 170 | --seed ${seed} \ 171 | --epochs ${epochs} \ 172 | --batch_size ${batch_size} \ 173 | --learning_rate ${learning_rate} \ 174 | --init_std ${init_std} \ 175 | --verbose ${verbose} \ 176 | --mode ${mode} \ 177 | --restore_epochs ${restore_epochs} \ 178 | --early_stopping ${early_stopping} \ 179 | --model_path ${model_path} \ 180 | >>./logs/memonet_${dt}_${cmd}.log 2>&1 181 | 182 | echo "running ..." 183 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | # -*- coding:utf-8 -*- 2 | from setuptools import setup 3 | 4 | setup( 5 | name='rec_alg', 6 | version='0.1', 7 | description='end-to-end recommendation algorithm package with dataset processing', 8 | author='Pengtao Zhang', 9 | author_email='zpt1986@126.com', 10 | packages=['rec_alg'], 11 | install_requires=["tensorflow==1.14", "pydot", "graphviz", "numpy", "pandas", "scikit-learn==0.21.3"], 12 | ) 13 | --------------------------------------------------------------------------------