├── LICENSE
├── README.md
├── coarse
    ├── dataset.py
    ├── eval_dssm.py
    ├── eval_dssm_auxiliary_ranking.py
    ├── eval_dssm_data_dist_shift_all.py
    ├── eval_dssm_data_dist_shift_sampling.py
    ├── eval_dssm_fsltr.py
    ├── eval_dssm_ubm.py
    ├── file.txt
    ├── metrics.py
    ├── models.py
    ├── run_dssm.py
    ├── run_dssm.sh
    ├── run_dssm_auxiliary_ranking.py
    ├── run_dssm_auxiliary_ranking.sh
    ├── run_dssm_data_dist_shift_all.py
    ├── run_dssm_data_dist_shift_all.sh
    ├── run_dssm_data_dist_shift_sampling.py
    ├── run_dssm_data_dist_shift_sampling.sh
    ├── run_dssm_fsltr.py
    ├── run_dssm_fsltr.sh
    ├── run_dssm_ubm.py
    ├── run_dssm_ubm.sh
    └── utils.py
├── data
    ├── .DS_Store
    ├── all_stage
    │   └── example.feather
    ├── others
    │   ├── coarse_rank_test.feather
    │   ├── id_cnt.pkl
    │   ├── rank_test.feather
    │   ├── realshow_video_info.feather
    │   └── retrieval_test.feather
    ├── realshow
    │   └── example.feather
    ├── request_id_dict
    │   └── example.pkl
    └── seq_effective_50_dict
    │   └── example.pkl
├── rank
    ├── dataset.py
    ├── eval_din.py
    ├── eval_din_auxiliary_ranking.py
    ├── eval_din_data_dist_shift_all.py
    ├── eval_din_data_dist_shift_sampling.py
    ├── eval_din_fsltr.py
    ├── eval_din_ubm.py
    ├── file.txt
    ├── metrics.py
    ├── models.py
    ├── run_din.py
    ├── run_din.sh
    ├── run_din_auxiliary_ranking.py
    ├── run_din_auxiliary_ranking.sh
    ├── run_din_data_dist_shift_all.py
    ├── run_din_data_dist_shift_all.sh
    ├── run_din_data_dist_shift_sampling.py
    ├── run_din_data_dist_shift_sampling.sh
    ├── run_din_fsltr.py
    ├── run_din_fsltr.sh
    ├── run_din_ubm.py
    ├── run_din_ubm.sh
    └── utils.py
├── recflow.jpg
└── retrieval
    ├── dataset.py
    ├── eval_sasrec.py
    ├── eval_sasrec_fsltr.py
    ├── eval_sasrec_hardnegmining.py
    ├── file.txt
    ├── metrics.py
    ├── models.py
    ├── modules.py
    ├── run_sasrec.py
    ├── run_sasrec.sh
    ├── run_sasrec_fsltr.py
    ├── run_sasrec_fsltr.sh
    ├── run_sasrec_hardnegmining.py
    ├── run_sasrec_hardnegmining.sh
    └── utils.py


/README.md:
--------------------------------------------------------------------------------
  1 | # RecFlow: An Industrial Full Flow Recommendation Dataset
  2 | 
  3 | [![LICENSE](https://img.shields.io/badge/license-CC%20BY--SA%204.0-green)](https://github.com/RecFlow-nips24/RecFlow-nips24/blob/main/LICENSE)
  4 | 
  5 | ### Download the data
  6 | 
  7 | Download manually through the following links:
  8 | 
  9 | - link: [Drive](https://rec.ustc.edu.cn/share/f8e5adc0-2e57-11ef-bea5-3b4cac9d110e)
 10 | 
 11 | ---
 12 | 
 13 | ### Motivation
 14 | To provide the recommendation systems (RS) research community with an industrial full flow dataset, we propose RecFlow, which includes samples from the exposure space and unexposed items filtered at each stage of Kuaishou's multi-stage RS. Compared with all existing public RS datasets, RecFlow can be leveraged to not only optimize the conventional recommendation tasks but also study the challenges including the interplay of different stages, the data distribution shift, auxiliary ranking tasks, user behavior sequence modeling, etc. It is the first public RS dataset that allows researchers to study the real industrial multi-stage RS.
 15 | 
 16 | The following figure illustrates the process of RecFlow's data collection .
 17 | 
 18 | ![kuaidata](./recflow.jpg)
 19 | 
 20 | ### Usage
 21 | RecFlow can be applied to the following tasks. (1) By recording items from the serving space, RecFlow enables the study of how to alleviate the discrepancy between training and serving for specific stages during both the learning and evaluation processes. (2) RecFlow also records the stage information for different stage samples, facilitating research on joint modeling of multiple stages, such as stage consistency or optimal multi-stage RS. (3) The positive and negative samples from the exposure space are suitable for classical click-through rate prediction or sequential recommendation tasks. (4) RecFlow stores multiple types of positive feedback (e.g., effective view, long view, like, follow, share, comment), supporting research on multi-task recommendation. (5) Information about video duration and playing time for each exposure video allows the study of learning through implicit feedback, such as predicting playing time. (6) RecFlow includes a request identifier feature, which can contribute to studying the re-ranking problem. (7) Timestamps for each sample enable the aggregation of user feedback in chronological order, facilitating the study of user behavior sequence modeling algorithms. (8) RecFlow incorporates context, user, and video features beyond identity features (e.g., user ID and video ID), making it suitable for context-based recommendation. (9) The rich information recorded about RS and user feedback allows the construction of more accurate RS simulators or user models in feed scenarios. (10) Rich stage data may help estimate selection bias more accurately and design better debiasd algorithms. 
 22 | 
 23 | ---
 24 | 
 25 | ### Dataset Organization
 26 | 
 27 | *RecFlow* dataset has following folders. **all_stage** contains data from all stages. **realshow** contains data from the exposure space. **seq_effective_50_dict** contains the user's effective_view behavior sequence of length 50. **request_id_dict** stores the data from all stages in first_level_key-second_level_key-value structure. The first_level_key is the *request_id*, the second_levele_key is the stage label (i.e. *realshow,rerank_pos,rerank_neg,rank_pos,rank_neg,coarse_neg,prerank_neg*), the value is the corresponding videos of that stage. **ubm_seq_request_id_dict**  is for the user behavior sequence modeling tasks and hold the same structure with **request_id_dict**. **id_cnt.pkl** records the unique ID number of each feature field. **retrieval_test.feather** is the testing dataset for retrieval experiments. **coarse_rank_test.feather** is the testing dataset for coarse ranking experiments. **rank_test.feather** is the testing dataset for ranking experiments. **realshow_video_info.feather** contains the video information from the exposure space. **realshow_video_info_daily** contains the accumulated video information from the exposure space.
 28 |   ```
 29 |   RecFlow
 30 |      ├── all_stage
 31 |      |   ├──2024-01-13.feather
 32 |      |   ├──2024-01-14.feather
 33 |      |   ├──...  
 34 |      |   └──2024-02-18.feather
 35 |      |       
 36 |      ├── realshow
 37 |      |   ├──2024-01-13.feather
 38 |      |   ├──2024-01-14.feather
 39 |      |   ├──...
 40 |      |   └──2024-02-18.feather
 41 |      |
 42 |      ├── seq_effective_50_dict
 43 |      |   ├──2024-01-13.pkl
 44 |      |   ├──2024-01-14.pkl
 45 |      |   ├──...
 46 |      |   └──2024-02-18.pkl
 47 |      |
 48 |      ├── request_id_dict
 49 |      |   ├──2024-01-13.pkl
 50 |      |   ├──2024-01-14.pkl
 51 |      |   ├──...
 52 |      |   └──2024-02-18.pkl
 53 |      |
 54 |      ├── ubm_seq_request_id_dict
 55 |      |   ├──2024-01-13.pkl
 56 |      |   ├──2024-01-14.pkl
 57 |      |   ├──...
 58 |      |   └──2024-02-18.pkl
 59 |      |
 60 |      └── others
 61 |         ├──id_cnt.pkl
 62 |         ├──retrieval_test.feather
 63 |         ├──coarse_rank_test.feather 
 64 |         ├──rank_test.feather
 65 |         ├──realshow_video_info.feather
 66 |         └──realshow_video_info_daily
 67 |            ├──2024-01-13.feather
 68 |            ├──2024-01-14.feather
 69 |            ├──...
 70 |            └──2024-02-18.feather
 71 |   ```
 72 | 
 73 | #### Descriptions of the feature fields in RecFlow. 
 74 | 
 75 | | Field Name:    | Description                                              | Type    |
 76 | | -------------- | -------------------------------------------------------- | ------- |
 77 | | request_id        | The unique ID of each recommendation request.                                      | Integer   |
 78 | | request\_timestamp       | The timestamp of each recommendation request.                              | Integer   |
 79 | | user\_id  | The unique ID of each user. | Integer   |
 80 | | device\_id | The unique ID of each device.                        | Integer   |
 81 | | age           | The user's age.                 | Integer     |
 82 | | gender           | The user's gender.                                 | Integer   |
 83 | | province      | The user's province.                                    | Integer |
 84 | | video\_id    | The unique ID of each video. | Integer |
 85 | | author\_id           | The unique ID of each author.            | Integer     |
 86 | | category\_level\_one           | The first level category ID of each video.                              | Integer   |
 87 | | category\_level\_two      | The second level category ID of each video.                                  | Integer |
 88 | | upload\_type    | The upload type ID of each video.  | Integer |
 89 | | upload\_timestamp           |The upload timestamp of each video.                | Integer     |
 90 | | duration           | The time duration of each video in milliseconds.                                | Integer   |
 91 | | realshow      | A binary feedback signal indicating the video is exposed to the user.                                         | Integer |
 92 | | rerank\_pos    | A binary feedback signal indicating the video ranks top-10 in rerank stage.  | Integer |
 93 | | rerank\_neg           | A binary feedback signal indicating the video ranks out of top-10 in rerank stage.                 | Integer     |
 94 | | rank\_pos           | A binary feedback signal indicating the video ranks top-10 in rank stage.                         | Integer   |
 95 | | rank\_neg      | A binary feedback signal indicating the video ranks out of top-10 in rank stage.                                     | Integer |
 96 | | coarse\_neg    | A binary feedback signal indicating the video ranks out of top-500 in coarse rank stage.   | Integer |
 97 | | prerank\_neg           | A binary feedback signal indicating the video ranks out of top-500 in pre-rank stage.                | Integer     |
 98 | | rank\_index           |The rank position of the video in the rank stage.                            | Integer   |
 99 | | rerank\_index      | The rank position of the video in rerank stage.                          | Integer |
100 | | playing\_time    | The time duration of the user watching the video.       | Integer |
101 | | effective\_view           | A binary feedback signal indicating the user watches at least 30\% of the video.              | Integer     |
102 | | long\_view           | A binary feedback signal indicating the user watches at least 100\% of the video.                              | Integer   |
103 | | like      | A binary feedback signal indicating the user hit the like button.                                      | Integer |
104 | | follow    | A binary feedback signal indicating the user hit the follow the author button. | Integer |
105 | | forward      | A binary feedback signal indicating the user forwards this video.                                          | Integer |
106 | | comment      | A binary feedback signal indicating the user writes a comment in the comments section of this video                                         | Integer |
107 | 
108 | ---
109 | 
110 | ### Code
111 | If you want to run the code in the repository, you need to download the data from [Drive](https://rec.ustc.edu.cn/share/883adf20-7e44-11ef-90e2-9beaf2bdc778), and place them in the data folder as above data organization.
112 | 
113 | #### Retrieval
114 | 
115 | Baseline
116 | ```
117 | bash ./retrieval/run_sasrec.sh
118 | ```
119 | 
120 | Hard Negative Mining
121 | ```
122 | bash ./retrieval/run_sasrec_hardnegmining.sh
123 | ```
124 | 
125 | Interplay between Retrieval and Subsequent Stages
126 | ```
127 | bash ./retrieval/run_sasrec_fsltr.sh
128 | ```
129 | 
130 | 
131 | #### Coarse Ranking
132 | 
133 | Baseline
134 | ```
135 | bash ./coarse/run_dssm.sh
136 | ```
137 | 
138 | Data Distribution Shift
139 | ```
140 | bash ./coarse/run_dssm_data_dist_shift_sampling.sh
141 | bash ./coarse/run_dssm_data_dist_shift_all.sh
142 | ```
143 | 
144 | Interplay between Retrieval and Subsequent Stages
145 | ```
146 | bash ./coarse/run_dssm_fsltr.sh
147 | ```
148 | 
149 | Auxiliary Ranking
150 | ```
151 | bash ./coarse/run_dssm_auxiliary_ranking.sh
152 | ```
153 | 
154 | User Behavior Sequence Modeling
155 | ```
156 | bash ./coarse/run_dssm_ubm.sh
157 | ```
158 | 
159 | 
160 | #### Ranking
161 | 
162 | Baseline
163 | ```
164 | bash ./rank/run_din.sh
165 | ```
166 | 
167 | Data Distribution Shift
168 | ```
169 | bash ./rank/run_din_data_dist_shift_sampling.sh
170 | bash ./rank/run_din_data_dist_shift_all.sh
171 | ```
172 | 
173 | Interplay between Retrieval and Subsequent Stages
174 | ```
175 | bash ./rank/run_din_fsltr.sh
176 | ```
177 | 
178 | Auxiliary Ranking
179 | ```
180 | bash ./rank/run_din_auxiliary_ranking.sh
181 | ```
182 | 
183 | User Behavior Sequence Modeling
184 | ```
185 | bash ./rank/run_din_ubm.sh
186 | ```
187 | 
188 | ### Requirements
189 | ```
190 | python=3.7
191 | numpy=1.19.2
192 | pandas=1.3.5
193 | pyarrow=8.0.0
194 | scikit-learn=1.0.2
195 | pytorch=1.6
196 | faiss-gpu=1.7.1
197 | ```


--------------------------------------------------------------------------------
/coarse/eval_dssm.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import argparse
  3 | 
  4 | import torch
  5 | from torch.utils.data import DataLoader
  6 | 
  7 | from models import DSSM
  8 | from dataset import Prerank_Train_Dataset,Prerank_Test_Dataset
  9 | 
 10 | from utils import load_pkl
 11 | from metrics import evaluate,evaluate_recall
 12 | 
 13 | def parse_args():
 14 |   parser = argparse.ArgumentParser()
 15 | 
 16 |   parser.add_argument('--epochs', type=int, default=1, help='epochs.')
 17 |   parser.add_argument('--batch_size', type=int, default=1024, help='train batch size.')
 18 |   parser.add_argument('--infer_realshow_batch_size', type=int, default=1024, help='inference batch size.')
 19 |   parser.add_argument('--infer_recall_batch_size', type=int, default=1024, help='inference batch size.')
 20 |   parser.add_argument('--emb_dim', type=int, default=8, help='embedding dimension.')
 21 |   parser.add_argument('--lr', type=float, default=1e-3, help='learning rate.')
 22 |   parser.add_argument('--seq_len', type=int, default=3, help='length of behaivor sequence')
 23 |   
 24 |   parser.add_argument('--cuda', type=int, default=0, help='cuda device.')
 25 | 
 26 |   parser.add_argument('--print_freq', type=int, default=200, help='frequency of print.')
 27 |   
 28 |   parser.add_argument('--tag', type=str, default="1st", help='exp tag.')
 29 | 
 30 |   return parser.parse_args()
 31 | 
 32 | 
 33 | if __name__ == '__main__':
 34 |   args = parse_args()
 35 | 
 36 |   for k,v in vars(args).items():
 37 |     print(f"{k}:{v}")
 38 |     
 39 |   #prepare data
 40 |   prefix = "../data"
 41 |   
 42 |   realshow_prefix = os.path.join(prefix, "realshow")
 43 |   path_to_test_csv = os.path.join(realshow_prefix, "2024-02-18.feather")
 44 |   print("testing file:")
 45 |   print(path_to_test_csv)
 46 |   
 47 |   seq_prefix = os.path.join(prefix, "seq_effective_50_dict")
 48 |   path_to_test_seq_pkl = os.path.join(seq_prefix, "2024-02-18.pkl")
 49 |   print("testing seq file:")
 50 |   print(path_to_test_seq_pkl)
 51 |   
 52 |   others_prefix = os.path.join(prefix, "others")
 53 |   path_to_id_cnt_pkl = os.path.join(others_prefix, "id_cnt.pkl")
 54 |   print(f"path_to_id_cnt_pkl: {path_to_id_cnt_pkl}")
 55 |   id_cnt_dict = load_pkl(path_to_id_cnt_pkl)
 56 |   for k,v in id_cnt_dict.items():
 57 |     print(f"{k}:{v}")
 58 |     
 59 |   path_to_test_pkl = os.path.join(others_prefix, "prerank_test.feather")
 60 |   print(f"path_to_test_pkl: {path_to_test_pkl}")
 61 | 
 62 |   #prepare model
 63 |   os.environ["CUDA_VISIBLE_DEVICES"] = str(args.cuda)
 64 |   device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
 65 |   print(f"device: {device}")
 66 |   
 67 |   model = DSSM(args.emb_dim, args.seq_len, device, id_cnt_dict).to(device)
 68 |   
 69 |   path_to_save_model=f"./checkpoints/bs-{args.batch_size}_lr-{args.lr}_{args.tag}.pkl"
 70 |   
 71 |   state_dict = torch.load(path_to_save_model)
 72 |   model.load_state_dict(state_dict)
 73 |   
 74 |   print("testing: realshow")
 75 | 
 76 |   test_realshow_dataset = Prerank_Train_Dataset(
 77 |     path_to_test_csv,
 78 |     args.seq_len,
 79 |     path_to_test_seq_pkl,
 80 |   )
 81 |   
 82 |   test_realshow_loader = DataLoader(
 83 |     dataset=test_realshow_dataset, 
 84 |     batch_size=args.infer_realshow_batch_size, 
 85 |     shuffle=False, 
 86 |     num_workers=0, 
 87 |     drop_last=True
 88 |   )
 89 |   
 90 |   print_str = evaluate(model, test_realshow_loader, device)
 91 | 
 92 |   print("testing: recall")
 93 |   
 94 |   test_recall_dataset = Prerank_Test_Dataset(
 95 |     path_to_test_pkl,
 96 |     args.seq_len,
 97 |     path_to_test_seq_pkl,
 98 |     max_candidate_cnt=470
 99 |   )
100 |   
101 |   test_recall_loader = DataLoader(
102 |     dataset=test_recall_dataset, 
103 |     batch_size=args.infer_recall_batch_size, 
104 |     shuffle=False, 
105 |     num_workers=0, 
106 |     drop_last=True
107 |   )
108 |   
109 |   target_print = evaluate_recall(model, test_recall_loader, device)
110 |   
111 |   print("realshow")
112 |   print(print_str)
113 |   
114 |   print("recall")
115 |   print(target_print[0])
116 |   print(target_print[1])


--------------------------------------------------------------------------------
/coarse/eval_dssm_auxiliary_ranking.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import argparse
  3 | 
  4 | import torch
  5 | from torch.utils.data import DataLoader
  6 | 
  7 | from models import DSSM_AuxRanking
  8 | from dataset import Prerank_Train_Dataset,Prerank_Test_Dataset
  9 | 
 10 | from utils import load_pkl
 11 | from metrics import evaluate,evaluate_recall
 12 | 
 13 | def parse_args(): 
 14 |   parser = argparse.ArgumentParser()
 15 | 
 16 |   parser.add_argument('--epochs', type=int, default=1, help='epochs.')
 17 |   parser.add_argument('--batch_size', type=int, default=1024, help='train batch size.')
 18 |   parser.add_argument('--infer_realshow_batch_size', type=int, default=1024, help='inference batch size.')
 19 |   parser.add_argument('--infer_recall_batch_size', type=int, default=1024, help='inference batch size.')
 20 |   parser.add_argument('--emb_dim', type=int, default=8, help='embedding dimension.')
 21 |   parser.add_argument('--lr', type=float, default=1e-3, help='learning rate.')
 22 |   parser.add_argument('--seq_len', type=int, default=3, help='length of behaivor sequence')
 23 |   
 24 |   parser.add_argument('--cuda', type=int, default=0, help='cuda device.')
 25 | 
 26 |   parser.add_argument('--print_freq', type=int, default=200, help='frequency of print.')
 27 |   
 28 |   parser.add_argument('--tag', type=str, default="1st", help='exp tag.')
 29 |   
 30 |   # flow param  
 31 |   parser.add_argument('--flows', type=str, default="", help='exp tag.')
 32 |   
 33 |   parser.add_argument('--rank_loss_weight', type=float, default=1e-2, help='learning rate.')
 34 |   
 35 |   return parser.parse_args()
 36 | 
 37 | 
 38 | if __name__ == '__main__':
 39 |   args = parse_args()
 40 | 
 41 |   for k,v in vars(args).items():
 42 |     print(f"{k}:{v}")
 43 | 
 44 |   #prepare data
 45 |   prefix = "../data"
 46 |   
 47 |   realshow_prefix = os.path.join(prefix, "realshow")
 48 |   path_to_test_csv = os.path.join(realshow_prefix, "2024-02-18.feather")
 49 |   print("testing file:")
 50 |   print(path_to_test_csv)
 51 |   
 52 |   seq_prefix = os.path.join(prefix, "seq_effective_50_dict")
 53 |   path_to_test_seq_pkl = os.path.join(seq_prefix, "2024-02-18.pkl")
 54 |   print("testing seq file:")
 55 |   print(path_to_test_seq_pkl)
 56 |   
 57 |   others_prefix = os.path.join(prefix, "others")
 58 |   path_to_id_cnt_pkl = os.path.join(others_prefix, "id_cnt.pkl")
 59 |   print(f"path_to_id_cnt_pkl: {path_to_id_cnt_pkl}")
 60 |   id_cnt_dict = load_pkl(path_to_id_cnt_pkl)
 61 |   for k,v in id_cnt_dict.items():
 62 |     print(f"{k}:{v}")
 63 |     
 64 |   path_to_test_pkl = os.path.join(others_prefix, "prerank_test.feather")
 65 |   print(f"path_to_test_pkl: {path_to_test_pkl}")
 66 | 
 67 |   #prepare model
 68 |   os.environ["CUDA_VISIBLE_DEVICES"] = str(args.cuda)
 69 |   device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
 70 |   print(f"device: {device}")
 71 |   
 72 |   model = DSSM_AuxRanking(
 73 |     args.emb_dim, args.seq_len, 
 74 |     device, id_cnt_dict
 75 |   ).to(device)
 76 |   
 77 |   path_to_save_model=f"./checkpoints/{args.batch_size}_{args.lr}_{args.flows}_{args.rank_loss_weight}_{args.tag}.pkl"
 78 |   
 79 |   state_dict = torch.load(path_to_save_model)
 80 |   
 81 |   model.load_state_dict(state_dict)
 82 |   
 83 |   print("testing: realshow")
 84 | 
 85 |   test_realshow_dataset = Prerank_Train_Dataset(
 86 |     path_to_test_csv,
 87 |     args.seq_len,
 88 |     path_to_test_seq_pkl,
 89 |   )
 90 |   
 91 |   test_realshow_loader = DataLoader(
 92 |     dataset=test_realshow_dataset, 
 93 |     batch_size=args.infer_realshow_batch_size, 
 94 |     shuffle=False, 
 95 |     num_workers=0, 
 96 |     drop_last=True
 97 |   )
 98 |   print_str = evaluate(model, test_realshow_loader, device)
 99 |   
100 |   print("testing: recall")
101 |   
102 |   test_recall_dataset = Prerank_Test_Dataset(
103 |     path_to_test_pkl,
104 |     args.seq_len,
105 |     path_to_test_seq_pkl,
106 |     max_candidate_cnt=470
107 |   )
108 |   
109 |   test_recall_loader = DataLoader(
110 |     dataset=test_recall_dataset, 
111 |     batch_size=args.infer_recall_batch_size, 
112 |     shuffle=False, 
113 |     num_workers=0, 
114 |     drop_last=True
115 |   )
116 |   target_print = evaluate_recall(model, test_recall_loader, device)
117 | 
118 |   print("realshow")
119 |   print(print_str)
120 |   
121 |   print("recall")
122 |   print(target_print[0])
123 |   print(target_print[1])


--------------------------------------------------------------------------------
/coarse/eval_dssm_data_dist_shift_all.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import argparse
  3 | 
  4 | import torch
  5 | from torch.utils.data import DataLoader
  6 | 
  7 | from models import DSSM
  8 | from dataset import Prerank_Train_Dataset,Prerank_Test_Dataset
  9 | 
 10 | from utils import load_pkl
 11 | from metrics import evaluate,evaluate_recall
 12 | 
 13 | def parse_args():
 14 |   parser = argparse.ArgumentParser()
 15 | 
 16 |   parser.add_argument('--epochs', type=int, default=1, help='epochs.')
 17 |   parser.add_argument('--batch_size', type=int, default=1024, help='train batch size.')
 18 |   parser.add_argument('--infer_realshow_batch_size', type=int, default=1024, help='inference batch size.')
 19 |   parser.add_argument('--infer_recall_batch_size', type=int, default=1024, help='inference batch size.')
 20 |   parser.add_argument('--emb_dim', type=int, default=8, help='embedding dimension.')
 21 |   parser.add_argument('--lr', type=float, default=1e-3, help='learning rate.')
 22 |   parser.add_argument('--seq_len', type=int, default=3, help='length of behaivor sequence')
 23 |   
 24 |   parser.add_argument('--cuda', type=int, default=0, help='cuda device.')
 25 | 
 26 |   parser.add_argument('--print_freq', type=int, default=200, help='frequency of print.')
 27 |   
 28 |   parser.add_argument('--tag', type=str, default="1st", help='exp tag.')
 29 |   
 30 |   parser.add_argument('--flows', type=str, default="", help='exp tag.')
 31 |   
 32 |   return parser.parse_args()
 33 | 
 34 | 
 35 | if __name__ == '__main__':
 36 |   args = parse_args()
 37 | 
 38 |   for k,v in vars(args).items():
 39 |     print(f"{k}:{v}")
 40 | 
 41 |   #prepare data
 42 |   prefix = "../data"
 43 |   
 44 |   realshow_prefix = os.path.join(prefix, "realshow")
 45 |   path_to_test_csv = os.path.join(realshow_prefix, "2024-02-18.feather")
 46 |   print("testing file:")
 47 |   print(path_to_test_csv)
 48 |   
 49 |   seq_prefix = os.path.join(prefix, "seq_effective_50_dict")
 50 |   path_to_test_seq_pkl = os.path.join(seq_prefix, "2024-02-18.pkl")
 51 |   print("testing seq file:")
 52 |   print(path_to_test_seq_pkl)
 53 |   
 54 |   others_prefix = os.path.join(prefix, "others")
 55 |   path_to_id_cnt_pkl = os.path.join(others_prefix, "id_cnt.pkl")
 56 |   print(f"path_to_id_cnt_pkl: {path_to_id_cnt_pkl}")
 57 |   id_cnt_dict = load_pkl(path_to_id_cnt_pkl)
 58 |   for k,v in id_cnt_dict.items():
 59 |     print(f"{k}:{v}")
 60 |     
 61 |   path_to_test_pkl = os.path.join(others_prefix, "prerank_test.feather")
 62 |   print(f"path_to_test_pkl: {path_to_test_pkl}")
 63 | 
 64 |   #prepare model
 65 |   os.environ["CUDA_VISIBLE_DEVICES"] = str(args.cuda)
 66 |   device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
 67 |   print(f"device: {device}")
 68 |   
 69 |   model = DSSM(args.emb_dim, args.seq_len, device, id_cnt_dict).to(device)
 70 |   
 71 |   path_to_save_model=f"./checkpoints/bs-{args.batch_size}_lr-{args.lr}_{args.flows}_{args.tag}.pkl"
 72 |   
 73 |   state_dict = torch.load(path_to_save_model)
 74 |   
 75 |   model.load_state_dict(state_dict)
 76 |   
 77 |   print("testing: realshow")
 78 | 
 79 |   test_realshow_dataset = Prerank_Train_Dataset(
 80 |     path_to_test_csv,
 81 |     args.seq_len,
 82 |     path_to_test_seq_pkl,
 83 |   )
 84 |   
 85 |   test_realshow_loader = DataLoader(
 86 |     dataset=test_realshow_dataset, 
 87 |     batch_size=args.infer_realshow_batch_size, 
 88 |     shuffle=False, 
 89 |     num_workers=0, 
 90 |     drop_last=True
 91 |   )
 92 |   print_str = evaluate(model, test_realshow_loader, device)
 93 |   
 94 |   print("testing: recall")
 95 |   
 96 |   test_recall_dataset = Prerank_Test_Dataset(
 97 |     path_to_test_pkl,
 98 |     args.seq_len,
 99 |     path_to_test_seq_pkl,
100 |     max_candidate_cnt=470
101 |   )
102 |   
103 |   test_recall_loader = DataLoader(
104 |     dataset=test_recall_dataset, 
105 |     batch_size=args.infer_recall_batch_size, 
106 |     shuffle=False, 
107 |     num_workers=0, 
108 |     drop_last=True
109 |   )
110 |   target_print = evaluate_recall(model, test_recall_loader, device)
111 |   
112 |   print("realshow")
113 |   print(print_str)
114 |   
115 |   print("recall")
116 |   print(target_print[0])
117 |   print(target_print[1])


--------------------------------------------------------------------------------
/coarse/eval_dssm_data_dist_shift_sampling.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import argparse
  3 | 
  4 | import torch
  5 | from torch.utils.data import DataLoader
  6 | 
  7 | from models import DSSM
  8 | from dataset import Prerank_Train_Dataset,Prerank_Test_Dataset
  9 | 
 10 | from utils import load_pkl
 11 | from metrics import evaluate,evaluate_recall
 12 | 
 13 | def parse_args():
 14 |   parser = argparse.ArgumentParser()
 15 | 
 16 |   parser.add_argument('--epochs', type=int, default=1, help='epochs.')
 17 |   parser.add_argument('--batch_size', type=int, default=1024, help='train batch size.')
 18 |   parser.add_argument('--infer_realshow_batch_size', type=int, default=1024, help='inference batch size.')
 19 |   parser.add_argument('--infer_recall_batch_size', type=int, default=1024, help='inference batch size.')
 20 |   parser.add_argument('--emb_dim', type=int, default=8, help='embedding dimension.')
 21 |   parser.add_argument('--lr', type=float, default=1e-3, help='learning rate.')
 22 |   parser.add_argument('--seq_len', type=int, default=3, help='length of behaivor sequence')
 23 |   
 24 |   parser.add_argument('--cuda', type=int, default=0, help='cuda device.')
 25 | 
 26 |   parser.add_argument('--print_freq', type=int, default=200, help='frequency of print.')
 27 |   
 28 |   parser.add_argument('--tag', type=str, default="1st", help='exp tag.')
 29 |   
 30 |   parser.add_argument('--flows', type=str, default="", help='exp tag.')
 31 |   parser.add_argument('--k_flow_negs', type=str, default="", help='number of flow negative.')
 32 |   
 33 |   return parser.parse_args()
 34 | 
 35 | 
 36 | if __name__ == '__main__':
 37 |   args = parse_args()
 38 | 
 39 |   for k,v in vars(args).items():
 40 |     print(f"{k}:{v}")
 41 | 
 42 |   #prepare data
 43 |   prefix = "../data"
 44 |   
 45 |   realshow_prefix = os.path.join(prefix, "realshow")
 46 |   path_to_test_csv = os.path.join(realshow_prefix, "2024-02-18.feather")
 47 |   print("testing file:")
 48 |   print(path_to_test_csv)
 49 |   
 50 |   seq_prefix = os.path.join(prefix, "seq_effective_50_dict")
 51 |   path_to_test_seq_pkl = os.path.join(seq_prefix, "2024-02-18.pkl")
 52 |   print("testing seq file:")
 53 |   print(path_to_test_seq_pkl)
 54 |   
 55 |   others_prefix = os.path.join(prefix, "others")
 56 |   path_to_id_cnt_pkl = os.path.join(others_prefix, "id_cnt.pkl")
 57 |   print(f"path_to_id_cnt_pkl: {path_to_id_cnt_pkl}")
 58 |   id_cnt_dict = load_pkl(path_to_id_cnt_pkl)
 59 |   for k,v in id_cnt_dict.items():
 60 |     print(f"{k}:{v}")
 61 |     
 62 |   path_to_test_pkl = os.path.join(others_prefix, "prerank_test.feather")
 63 |   print(f"path_to_test_pkl: {path_to_test_pkl}")
 64 | 
 65 |   #prepare model
 66 |   os.environ["CUDA_VISIBLE_DEVICES"] = str(args.cuda)
 67 |   device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
 68 |   print(f"device: {device}")
 69 |   
 70 |   model = DSSM(args.emb_dim, args.seq_len, device, id_cnt_dict).to(device)
 71 |   
 72 |   path_to_save_model=f"./checkpoints/bs-{args.batch_size}_lr-{args.lr}_{args.flows}-{args.k_flow_negs}_{args.tag}.pkl"
 73 |   
 74 |   state_dict = torch.load(path_to_save_model)
 75 |   
 76 |   model.load_state_dict(state_dict)
 77 |   
 78 |   print("testing: realshow")
 79 | 
 80 |   test_realshow_dataset = Prerank_Train_Dataset(
 81 |     path_to_test_csv,
 82 |     args.seq_len,
 83 |     path_to_test_seq_pkl,
 84 |   )
 85 |   
 86 |   test_realshow_loader = DataLoader(
 87 |     dataset=test_realshow_dataset, 
 88 |     batch_size=args.infer_realshow_batch_size, 
 89 |     shuffle=False, 
 90 |     num_workers=0, 
 91 |     drop_last=True
 92 |   )
 93 |   print_str = evaluate(model, test_realshow_loader, device)
 94 | 
 95 |   print("testing: recall")
 96 |   
 97 |   test_recall_dataset = Prerank_Test_Dataset(
 98 |     path_to_test_pkl,
 99 |     args.seq_len,
100 |     path_to_test_seq_pkl,
101 |     max_candidate_cnt=470
102 |   )
103 |   
104 |   test_recall_loader = DataLoader(
105 |     dataset=test_recall_dataset, 
106 |     batch_size=args.infer_recall_batch_size, 
107 |     shuffle=False, 
108 |     num_workers=0, 
109 |     drop_last=True
110 |   )
111 |   target_print = evaluate_recall(model, test_recall_loader, device)
112 |   
113 |   print("realshow")
114 |   print(print_str)
115 |   
116 |   print("recall")
117 |   print(target_print[0])
118 |   print(target_print[1])


--------------------------------------------------------------------------------
/coarse/eval_dssm_fsltr.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import argparse
  3 | 
  4 | import torch
  5 | from torch.utils.data import DataLoader
  6 | 
  7 | from models import DSSM
  8 | from dataset import Prerank_Train_Dataset,Prerank_Test_Dataset
  9 | 
 10 | from utils import load_pkl
 11 | from metrics import evaluate,evaluate_recall
 12 | 
 13 | def parse_args(): 
 14 |   parser = argparse.ArgumentParser()
 15 | 
 16 |   parser.add_argument('--epochs', type=int, default=1, help='epochs.')
 17 |   parser.add_argument('--batch_size', type=int, default=1024, help='train batch size.')
 18 |   parser.add_argument('--infer_realshow_batch_size', type=int, default=1024, help='inference batch size.')
 19 |   parser.add_argument('--infer_recall_batch_size', type=int, default=1024, help='inference batch size.')
 20 |   parser.add_argument('--emb_dim', type=int, default=8, help='embedding dimension.')
 21 |   parser.add_argument('--lr', type=float, default=1e-3, help='learning rate.')
 22 |   parser.add_argument('--seq_len', type=int, default=3, help='length of behaivor sequence')
 23 |   
 24 |   parser.add_argument('--cuda', type=int, default=0, help='cuda device.')
 25 | 
 26 |   parser.add_argument('--print_freq', type=int, default=200, help='frequency of print.')
 27 |   
 28 |   parser.add_argument('--tag', type=str, default="1st", help='exp tag.')
 29 |   
 30 |   parser.add_argument('--flows', type=str, default='mcd_prerank_neg', help='model name.')
 31 |   
 32 |   parser.add_argument('--flow_nums', type=str, default="1", help='number of negative samples')
 33 |   
 34 |   parser.add_argument('--flow_weights', type=str, default="1.0", help='learning rate.')
 35 |   
 36 |   return parser.parse_args()
 37 | 
 38 | 
 39 | if __name__ == '__main__':
 40 |   args = parse_args()
 41 | 
 42 |   for k,v in vars(args).items():
 43 |     print(f"{k}:{v}")
 44 | 
 45 |   #prepare data
 46 |   prefix = "../data"
 47 |   
 48 |   realshow_prefix = os.path.join(prefix, "realshow")
 49 |   path_to_test_csv = os.path.join(realshow_prefix, "2024-02-18.feather")
 50 |   print("testing file:")
 51 |   print(path_to_test_csv)
 52 |   
 53 |   seq_prefix = os.path.join(prefix, "seq_effective_50_dict")
 54 |   path_to_test_seq_pkl = os.path.join(seq_prefix, "2024-02-18.pkl")
 55 |   print("testing seq file:")
 56 |   print(path_to_test_seq_pkl)
 57 |   
 58 |   others_prefix = os.path.join(prefix, "others")
 59 |   path_to_id_cnt_pkl = os.path.join(others_prefix, "id_cnt.pkl")
 60 |   print(f"path_to_id_cnt_pkl: {path_to_id_cnt_pkl}")
 61 |   id_cnt_dict = load_pkl(path_to_id_cnt_pkl)
 62 |   for k,v in id_cnt_dict.items():
 63 |     print(f"{k}:{v}")
 64 |     
 65 |   path_to_test_pkl = os.path.join(others_prefix, "prerank_test.feather")
 66 |   print(f"path_to_test_pkl: {path_to_test_pkl}")
 67 | 
 68 |   #prepare model
 69 |   os.environ["CUDA_VISIBLE_DEVICES"] = str(args.cuda)
 70 |   device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
 71 |   print(f"device: {device}")
 72 |   
 73 |   model = DSSM(args.emb_dim, args.seq_len, device, id_cnt_dict).to(device)
 74 |   
 75 |   path_to_save_model=f"./checkpoints/bs-{args.batch_size}_lr-{args.lr}_{args.loss}_{args.flows}_{args.flow_nums}_{args.flow_weights}_{args.tag}.pkl"
 76 |   
 77 |   state_dict = torch.load(path_to_save_model)
 78 |   
 79 |   model.load_state_dict(state_dict)
 80 |   
 81 |   print("testing: realshow")
 82 |   test_realshow_dataset = Prerank_Train_Dataset(
 83 |     path_to_test_csv,
 84 |     args.seq_len,
 85 |     path_to_test_seq_pkl,
 86 |   )
 87 |   
 88 |   test_realshow_loader = DataLoader(
 89 |     dataset=test_realshow_dataset, 
 90 |     batch_size=args.infer_realshow_batch_size, 
 91 |     shuffle=False, 
 92 |     num_workers=0, 
 93 |     drop_last=True
 94 |   )
 95 |   print_str = evaluate(model, test_realshow_loader, device)
 96 |   
 97 |   print("testing: recall")
 98 |   
 99 |   test_recall_dataset = Prerank_Test_Dataset(
100 |     path_to_test_pkl,
101 |     args.seq_len,
102 |     path_to_test_seq_pkl,
103 |     max_candidate_cnt=470
104 |   )
105 |   
106 |   test_recall_loader = DataLoader(
107 |     dataset=test_recall_dataset, 
108 |     batch_size=args.infer_recall_batch_size, 
109 |     shuffle=False, 
110 |     num_workers=0, 
111 |     drop_last=True
112 |   )
113 |   target_print = evaluate_recall(model, test_recall_loader, device)
114 | 
115 |   print("realshow")
116 |   print(print_str)
117 |   
118 |   print("recall")
119 |   print(target_print[0])
120 |   print(target_print[1])


--------------------------------------------------------------------------------
/coarse/eval_dssm_ubm.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import argparse
  3 | 
  4 | import torch
  5 | from torch.utils.data import DataLoader
  6 | 
  7 | from models import DSSM_UBM
  8 | from dataset import Prerank_Train_UBM_Dataset,Prerank_Test_UBM_Dataset
  9 | 
 10 | from utils import load_pkl
 11 | from metrics import evaluate,evaluate_recall
 12 | 
 13 | def parse_args():
 14 |   parser = argparse.ArgumentParser()
 15 | 
 16 |   parser.add_argument('--epochs', type=int, default=1, help='epochs.')
 17 |   parser.add_argument('--batch_size', type=int, default=1024, help='train batch size.')
 18 |   parser.add_argument('--infer_realshow_batch_size', type=int, default=1024, help='inference batch size.')
 19 |   parser.add_argument('--infer_recall_batch_size', type=int, default=1024, help='inference batch size.')
 20 |   parser.add_argument('--emb_dim', type=int, default=8, help='embedding dimension.')
 21 |   parser.add_argument('--lr', type=float, default=1e-3, help='learning rate.')
 22 |   parser.add_argument('--seq_len', type=int, default=3, help='length of behaivor sequence')
 23 |   
 24 |   parser.add_argument('--cuda', type=int, default=0, help='cuda device.')
 25 | 
 26 |   parser.add_argument('--print_freq', type=int, default=200, help='frequency of print.')
 27 |   
 28 |   parser.add_argument('--tag', type=str, default="1st", help='exp tag.')
 29 |   
 30 |   parser.add_argument('--flows', type=str, default="", help='exp tag.')
 31 | 
 32 |   return parser.parse_args()
 33 | 
 34 | 
 35 | if __name__ == '__main__':
 36 |   args = parse_args()
 37 | 
 38 |   #prepare data
 39 |   prefix = "../data"
 40 |   
 41 |   realshow_prefix = os.path.join(prefix, "realshow")
 42 |   path_to_test_csv = os.path.join(realshow_prefix, "2024-02-18.feather")
 43 |   print(f"testing file: {path_to_test_csv}")
 44 |   
 45 |   seq_prefix = os.path.join(prefix, "seq_effective_50_dict")
 46 |   path_to_test_seq_pkl = os.path.join(seq_prefix, "2024-02-18.pkl")
 47 |   print(f"testing seq file: {path_to_test_seq_pkl}")
 48 |   
 49 |   request_id_prefix = os.path.join(prefix, "ubm_seq_request_id_dict")
 50 |   path_to_request_id_pkl = os.path.join(request_id_prefix, "2024-02-18.pkl")
 51 |   print(f"testing request_id file: {path_to_request_id_pkl}")
 52 |   
 53 |   others_prefix = os.path.join(prefix, "others")
 54 |   path_to_id_cnt_pkl = os.path.join(others_prefix, "id_cnt.pkl")
 55 |   print(f"path_to_id_cnt_pkl: {path_to_id_cnt_pkl}")
 56 |   
 57 |   id_cnt_dict = load_pkl(path_to_id_cnt_pkl)
 58 |   for k,v in id_cnt_dict.items():
 59 |     print(f"{k}:{v}")
 60 |     
 61 |   path_to_test_pkl = os.path.join(others_prefix, "prerank_test.feather")
 62 |   print(f"path_to_test_pkl: {path_to_test_pkl}")
 63 | 
 64 |   #prepare model
 65 |   os.environ["CUDA_VISIBLE_DEVICES"] = str(args.cuda)
 66 |   device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
 67 |   print(f"device: {device}")
 68 |   
 69 |   per_flow_seq_len = 10
 70 |   n_flows = len(args.flows.split(','))
 71 |   flow_seq_len = per_flow_seq_len * n_flows
 72 |   
 73 |   max_candidate_cnt = 430
 74 |   
 75 |   model = DSSM_UBM(
 76 |     args.emb_dim, 
 77 |     args.seq_len, 
 78 |     device, 
 79 |     per_flow_seq_len,flow_seq_len,
 80 |     id_cnt_dict
 81 |   ).to(device)
 82 |   
 83 |   path_to_save_model=f"./checkpoints/bs-{args.batch_size}_lr-{args.lr}_flows-{args.flows}_{args.tag}.pkl"
 84 |   
 85 |   state_dict = torch.load(path_to_save_model)
 86 |   
 87 |   model.load_state_dict(state_dict)
 88 |   
 89 |   print("testing: realshow")
 90 | 
 91 |   test_realshow_dataset = Prerank_Train_UBM_Dataset(
 92 |     path_to_test_csv,
 93 |     args.seq_len,
 94 |     path_to_test_seq_pkl,
 95 |     path_to_request_id_pkl,
 96 |     args.flows,
 97 |     per_flow_seq_len
 98 |   )
 99 |   
100 |   test_realshow_loader = DataLoader(
101 |     dataset=test_realshow_dataset, 
102 |     batch_size=args.infer_realshow_batch_size, 
103 |     shuffle=False, 
104 |     num_workers=0, 
105 |     drop_last=True
106 |   )
107 |   print_str = evaluate(model, test_realshow_loader, device)  
108 |   
109 |   print("testing: recall")
110 |   
111 |   test_recall_dataset = Prerank_Test_UBM_Dataset(
112 |     path_to_test_pkl,
113 |     args.seq_len,
114 |     path_to_test_seq_pkl,
115 |     path_to_request_id_pkl,
116 |     args.flows, per_flow_seq_len,
117 |     max_candidate_cnt
118 |   )
119 |   
120 |   test_recall_loader = DataLoader(
121 |     dataset=test_recall_dataset, 
122 |     batch_size=args.infer_recall_batch_size, 
123 |     shuffle=False, 
124 |     num_workers=0, 
125 |     drop_last=True
126 |   )
127 |   target_print = evaluate_recall(model, test_recall_loader, device)
128 |   
129 |   print("realshow")
130 |   print(print_str)
131 |   
132 |   print("recall")
133 |   print(target_print[0])
134 |   print(target_print[1])


--------------------------------------------------------------------------------
/coarse/file.txt:
--------------------------------------------------------------------------------
 1 | 2024-01-13
 2 | 2024-01-14
 3 | 2024-01-15
 4 | 2024-01-16
 5 | 2024-01-17
 6 | 2024-01-18
 7 | 2024-01-19
 8 | 2024-01-20
 9 | 2024-01-21
10 | 2024-01-22
11 | 2024-01-23
12 | 2024-01-24
13 | 2024-01-25
14 | 2024-01-26
15 | 2024-01-27
16 | 2024-01-28
17 | 2024-01-29
18 | 2024-01-30
19 | 2024-01-31
20 | 2024-02-01
21 | 2024-02-02
22 | 2024-02-03
23 | 2024-02-04
24 | 2024-02-05
25 | 2024-02-06
26 | 2024-02-07
27 | 2024-02-08
28 | 2024-02-09
29 | 2024-02-10
30 | 2024-02-11
31 | 2024-02-12
32 | 2024-02-13
33 | 2024-02-14
34 | 2024-02-15
35 | 2024-02-16
36 | 2024-02-17


--------------------------------------------------------------------------------
/coarse/metrics.py:
--------------------------------------------------------------------------------
  1 | import torch
  2 | import numpy as np
  3 | from sklearn.metrics import roc_auc_score, log_loss
  4 | 
  5 | def evaluate(model, data_loader, device):
  6 |   model.eval()
  7 |   
  8 |   logits_lst = np.zeros(shape=(962560,), dtype=np.float32)
  9 |   label_lst = np.zeros(shape=(962560,), dtype=np.float32)
 10 |   
 11 |   with torch.no_grad():
 12 |     
 13 |     start_index = 0
 14 |     end_index = 0
 15 |     
 16 |     for inputs in data_loader:
 17 |           
 18 |       inputs_LongTensor = [torch.LongTensor(inp.numpy()).to(device) for inp in inputs[:-1]]
 19 |       
 20 |       logits = model(inputs_LongTensor) #b
 21 |       
 22 |       logits = torch.sigmoid(logits)
 23 |       
 24 |       end_index += inputs[-1].size(0)
 25 |       
 26 |       label_lst[start_index:end_index] = inputs[-1].numpy().astype(np.float32)
 27 |       
 28 |       logits_lst[start_index:end_index] = logits.cpu().numpy().astype(np.float32)
 29 |       
 30 |       start_index = end_index
 31 | 
 32 |   test_auc = roc_auc_score(label_lst, logits_lst)
 33 |   test_logloss = log_loss(label_lst, logits_lst)
 34 | 
 35 |   print_str = f"auc\tlogloss: {test_auc:.6f}\t{test_logloss:.6f}"
 36 |   
 37 |   return print_str
 38 |   
 39 | 
 40 | def evaluate_recall(model, data_loader, device):
 41 |   model.eval()
 42 |   
 43 |   target_top_k = [50,100,200]
 44 |   
 45 |   total_target_cnt = 0.0
 46 |   
 47 |   target_recall_lst = [0.0 for _ in range(len(target_top_k))]
 48 |   target_ndcg_lst = [0.0 for _ in range(len(target_top_k))]
 49 |   
 50 |   with torch.no_grad():
 51 |     
 52 |     for inputs in data_loader:
 53 |         
 54 |       inputs_LongTensor = [torch.LongTensor(inp.numpy()).to(device) for inp in inputs[:-2]]
 55 |       
 56 |       logits = model.forward_recall(inputs_LongTensor) #b*500
 57 |       
 58 |       logits = logits.cpu().numpy()
 59 |       
 60 |       labels = inputs[-2].numpy().astype(np.float) #b*500
 61 |       
 62 |       n_photos = inputs[-1].numpy() #b
 63 |       
 64 |       for i in range(n_photos.shape[0]):
 65 |         
 66 |         n_photo = n_photos[i]
 67 |         
 68 |         logit = logits[i,:n_photo]
 69 |         label = labels[i,:n_photo]
 70 |         
 71 |         logit_descending_index = np.argsort(logit*-1.0) #descending order
 72 |         logit_descending_rank = np.argsort(logit_descending_index) #descending order
 73 |         
 74 |         #target metric
 75 |         if np.sum(label) > 0 and np.sum(label)!=n_photo:
 76 |           target_pos_index = np.nonzero(label)[0]
 77 |           target_pos_rank = logit_descending_rank[target_pos_index]
 78 |           for i in range(len(target_top_k)):
 79 |             target_recall_lst[i] += np.sum(target_pos_rank<target_top_k[i])
 80 |             target_ndcg_lst[i] += np.sum((1.0/np.log2(target_pos_rank+2))*(target_pos_rank<target_top_k[i]))
 81 |             
 82 |           total_target_cnt += np.sum(label)
 83 |         
 84 |   target_recall = []
 85 |   target_ndcg = []
 86 | 
 87 |   for i in range(len(target_top_k)):
 88 |     target_recall.append(target_recall_lst[i]/total_target_cnt)
 89 |     target_ndcg.append(target_ndcg_lst[i]/total_target_cnt)
 90 |   
 91 |   target_print_str = f"Target: "
 92 |   for i in range(len(target_top_k)):
 93 |     target_print_str += f"recall@{target_top_k[i]},"
 94 |     target_print_str += f"ndcg@{target_top_k[i]},"
 95 |   
 96 |   target_print_value_str = f""
 97 |   for i in range(len(target_top_k)):
 98 |     target_print_value_str += f"{target_recall[i]:.6f},"
 99 |     target_print_value_str += f"{target_ndcg[i]:.6f},"
100 | 
101 |   return target_print_str[:-1],target_print_value_str[:-1]


--------------------------------------------------------------------------------
/coarse/run_dssm.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import argparse
  3 | 
  4 | import torch
  5 | import torch.nn as nn
  6 | from torch.utils.data import DataLoader
  7 | 
  8 | from models import DSSM
  9 | from dataset import Prerank_Train_Dataset
 10 | 
 11 | from utils import load_pkl
 12 | 
 13 | def parse_args():
 14 |   parser = argparse.ArgumentParser()
 15 | 
 16 |   parser.add_argument('--epochs', type=int, default=1, help='epochs.')
 17 |   parser.add_argument('--batch_size', type=int, default=1024, help='train batch size.')
 18 |   parser.add_argument('--infer_realshow_batch_size', type=int, default=1024, help='inference batch size.')
 19 |   parser.add_argument('--infer_recall_batch_size', type=int, default=1024, help='inference batch size.')
 20 |   parser.add_argument('--emb_dim', type=int, default=8, help='embedding dimension.')
 21 |   parser.add_argument('--lr', type=float, default=1e-3, help='learning rate.')
 22 |   parser.add_argument('--seq_len', type=int, default=3, help='length of behaivor sequence')
 23 |   
 24 |   parser.add_argument('--cuda', type=int, default=0, help='cuda device.')
 25 | 
 26 |   parser.add_argument('--print_freq', type=int, default=200, help='frequency of print.')
 27 |   
 28 |   parser.add_argument('--tag', type=str, default="1st", help='exp tag.')
 29 | 
 30 |   return parser.parse_args()
 31 | 
 32 | 
 33 | if __name__ == '__main__':
 34 |   args = parse_args()
 35 | 
 36 |   for k,v in vars(args).items():
 37 |     print(f"{k}:{v}")
 38 | 
 39 |   #prepare data
 40 |   prefix = "../data"
 41 |   
 42 |   realshow_prefix = os.path.join(prefix, "realshow")
 43 |   path_to_train_csv_lst = []
 44 |   with open("./file.txt", mode='r') as f:
 45 |     lines = f.readlines()
 46 |     for line in lines:
 47 |       tmp_csv_path = os.path.join(realshow_prefix, line.strip()+'.feather')
 48 |       path_to_train_csv_lst.append(tmp_csv_path)
 49 |       
 50 |   num_of_train_csv = len(path_to_train_csv_lst)
 51 |   print("training files:")
 52 |   print(f"number of train_csv: {num_of_train_csv}")
 53 |   for idx, filepath in enumerate(path_to_train_csv_lst):
 54 |     print(f"{idx}: {filepath}")
 55 | 
 56 |   seq_prefix = os.path.join(prefix, "seq_effective_50_dict")
 57 |   path_to_train_seq_pkl_lst = []
 58 |   with open("./file.txt", mode='r') as f:
 59 |     lines = f.readlines()
 60 |     for line in lines:
 61 |       tmp_seq_pkl_path = os.path.join(seq_prefix, line.strip()+'.pkl')
 62 |       path_to_train_seq_pkl_lst.append(tmp_seq_pkl_path)
 63 |   
 64 |   print("training seq files:")
 65 |   for idx, filepath in enumerate(path_to_train_seq_pkl_lst):
 66 |     print(f"{idx}: {filepath}")
 67 |   
 68 |   others_prefix = os.path.join(prefix, "others")
 69 |   path_to_id_cnt_pkl = os.path.join(others_prefix, "id_cnt.pkl")
 70 |   print(f"path_to_id_cnt_pkl: {path_to_id_cnt_pkl}")
 71 |   
 72 |   id_cnt_dict = load_pkl(path_to_id_cnt_pkl)
 73 |   for k,v in id_cnt_dict.items():
 74 |     print(f"{k}:{v}")
 75 | 
 76 |   #prepare model
 77 |   os.environ["CUDA_VISIBLE_DEVICES"] = str(args.cuda)
 78 |   device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
 79 |   print(f"device: {device}")
 80 |   
 81 |   model = DSSM(args.emb_dim, args.seq_len, device, id_cnt_dict).to(device)
 82 |   
 83 |   loss_fn = nn.BCEWithLogitsLoss().to(device)
 84 |   
 85 |   optimizer = torch.optim.Adam(model.parameters(), lr=args.lr)
 86 |   
 87 |   # training
 88 |   for epoch in range(args.epochs):
 89 |     for n_day in range(num_of_train_csv):
 90 |       train_dataset = Prerank_Train_Dataset(
 91 |         path_to_train_csv_lst[n_day],
 92 |         args.seq_len,
 93 |         path_to_train_seq_pkl_lst[n_day],
 94 |       )
 95 |     
 96 |       train_loader = DataLoader(
 97 |         dataset=train_dataset, 
 98 |         batch_size=args.batch_size, 
 99 |         shuffle=True, 
100 |         num_workers=1, 
101 |         drop_last=True
102 |       )
103 |       
104 |       for iter_step, inputs in enumerate(train_loader):
105 |         
106 |         inputs_LongTensor = [torch.LongTensor(inp.numpy()).to(device) for inp in inputs[:-1]]
107 |         
108 |         label = torch.FloatTensor(inputs[-1].numpy()).to(device) #b
109 |         
110 |         logits = model(inputs_LongTensor) #b
111 |         
112 |         loss = loss_fn(logits, label)
113 |         
114 |         optimizer.zero_grad()
115 |         
116 |         loss.backward()
117 | 
118 |         optimizer.step()
119 | 
120 |         if iter_step % args.print_freq == 0:
121 |           print(f"Day:{n_day}\t[Epoch/iter]:{epoch:>3}/{iter_step:<4}\tloss:{loss.detach().cpu().item():.6f}")
122 |   
123 |   path_to_save_model=f"./checkpoints/bs-{args.batch_size}_lr-{args.lr}_{args.tag}.pkl"
124 |   
125 |   torch.save(model.state_dict(), path_to_save_model)
126 |   
127 |   print(f"save model to {path_to_save_model} DONE.")


--------------------------------------------------------------------------------
/coarse/run_dssm.sh:
--------------------------------------------------------------------------------
 1 | set -x
 2 | set -e
 3 | set -o pipefail
 4 | 
 5 | tag=dssm-1st
 6 | 
 7 | python -B -u run_dssm.py \
 8 | --epochs=1 \
 9 | --batch_size=1024 \
10 | --infer_realshow_batch_size=1024 \
11 | --infer_recall_batch_size=900 \
12 | --emb_dim=8 \
13 | --lr=1e-2 \
14 | --seq_len=50 \
15 | --cuda='0' \
16 | --print_freq=100 \
17 | --tag=${tag} > "./logs/bs-1024_lr-1e-2_${tag}.log" 2>&1
18 | 
19 | python -B -u eval_dssm.py \
20 | --epochs=1 \
21 | --batch_size=1024 \
22 | --infer_realshow_batch_size=1024 \
23 | --infer_recall_batch_size=900 \
24 | --emb_dim=8 \
25 | --lr=1e-2 \
26 | --seq_len=50 \
27 | --cuda='0' \
28 | --print_freq=100 \
29 | --tag=${tag} >> "./logs/bs-1024_lr-1e-2_${tag}.log" 2>&1


--------------------------------------------------------------------------------
/coarse/run_dssm_auxiliary_ranking.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import argparse
  3 | 
  4 | import torch
  5 | import torch.nn as nn
  6 | import torch.nn.functional as F
  7 | from torch.utils.data import DataLoader
  8 | 
  9 | from models import DSSM_AuxRanking
 10 | from dataset import Prerank_Train_Auxiliary_Ranking_Dataset
 11 | 
 12 | from utils import load_pkl
 13 | 
 14 | def parse_args(): 
 15 |   parser = argparse.ArgumentParser()
 16 | 
 17 |   parser.add_argument('--epochs', type=int, default=1, help='epochs.')
 18 |   parser.add_argument('--batch_size', type=int, default=1024, help='train batch size.')
 19 |   parser.add_argument('--infer_realshow_batch_size', type=int, default=1024, help='inference batch size.')
 20 |   parser.add_argument('--infer_recall_batch_size', type=int, default=1024, help='inference batch size.')
 21 |   parser.add_argument('--emb_dim', type=int, default=8, help='embedding dimension.')
 22 |   parser.add_argument('--lr', type=float, default=1e-3, help='learning rate.')
 23 |   parser.add_argument('--seq_len', type=int, default=3, help='length of behaivor sequence')
 24 |   
 25 |   parser.add_argument('--cuda', type=int, default=0, help='cuda device.')
 26 | 
 27 |   parser.add_argument('--print_freq', type=int, default=200, help='frequency of print.')
 28 |   
 29 |   parser.add_argument('--tag', type=str, default="1st", help='exp tag.')
 30 |   
 31 |   # flow param  
 32 |   parser.add_argument('--flows', type=str, default="", help='exp tag.')
 33 |   
 34 |   parser.add_argument('--rank_loss_weight', type=float, default=1e-2, help='learning rate.')
 35 |   
 36 |   return parser.parse_args()
 37 | 
 38 | 
 39 | if __name__ == '__main__':
 40 |   args = parse_args()
 41 | 
 42 |   for k,v in vars(args).items():
 43 |     print(f"{k}:{v}")
 44 |     
 45 |   #prepare data
 46 |   prefix = "../data"
 47 |   
 48 |   realshow_prefix = os.path.join(prefix, "realshow")
 49 |   path_to_train_csv_lst = []
 50 |   with open("./file.txt", mode='r') as f:
 51 |     lines = f.readlines()
 52 |     for line in lines:
 53 |       tmp_csv_path = os.path.join(realshow_prefix, line.strip()+'.feather')
 54 |       path_to_train_csv_lst.append(tmp_csv_path)
 55 |       
 56 |   num_of_train_csv = len(path_to_train_csv_lst)
 57 |   print("training files:")
 58 |   print(f"number of train_csv: {num_of_train_csv}")
 59 |   for idx, filepath in enumerate(path_to_train_csv_lst):
 60 |     print(f"{idx}: {filepath}")
 61 |     
 62 |   seq_prefix = os.path.join(prefix, "seq_effective_50_dict")
 63 |   path_to_train_seq_pkl_lst = []
 64 |   with open("./file.txt", mode='r') as f:
 65 |     lines = f.readlines()
 66 |     for line in lines:
 67 |       tmp_seq_pkl_path = os.path.join(seq_prefix, line.strip()+'.pkl')
 68 |       path_to_train_seq_pkl_lst.append(tmp_seq_pkl_path)
 69 |   
 70 |   print("training seq files:")
 71 |   for idx, filepath in enumerate(path_to_train_seq_pkl_lst):
 72 |     print(f"{idx}: {filepath}")
 73 |   
 74 |   request_id_prefix = os.path.join(prefix, "request_id_dict")
 75 |   path_to_train_request_pkl_lst = []
 76 |   with open("./file.txt", mode='r') as f:
 77 |     lines = f.readlines()
 78 |     for line in lines:
 79 |       tmp_request_pkl_path = os.path.join(request_id_prefix, line.strip()+".pkl")
 80 |       path_to_train_request_pkl_lst.append(tmp_request_pkl_path)
 81 |       
 82 |   print("training request files")
 83 |   for idx, filepath in enumerate(path_to_train_request_pkl_lst):
 84 |     print(f"{idx}: {filepath}")
 85 |   
 86 |   others_prefix = os.path.join(prefix, "others")
 87 |   path_to_id_cnt_pkl = os.path.join(others_prefix, "id_cnt.pkl")
 88 |   print(f"path_to_id_cnt_pkl: {path_to_id_cnt_pkl}")
 89 |   
 90 |   id_cnt_dict = load_pkl(path_to_id_cnt_pkl)
 91 |   for k,v in id_cnt_dict.items():
 92 |     print(f"{k}:{v}")
 93 |     
 94 |   #prepare model
 95 |   os.environ["CUDA_VISIBLE_DEVICES"] = str(args.cuda)
 96 |   device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
 97 |   print(f"device: {device}")
 98 |   
 99 |   model = DSSM_AuxRanking(args.emb_dim, args.seq_len, device, id_cnt_dict).to(device)
100 |   
101 |   loss_fn = nn.BCEWithLogitsLoss().to(device)
102 |   
103 |   optimizer = torch.optim.Adam(model.parameters(), lr=args.lr)
104 |   
105 |   padding_num = -2**30 + 1
106 |   
107 |   n_flows = len(args.flows.split(','))
108 |   n_flow_photos = n_flows * 10
109 |   
110 |   #training
111 |   for epoch in range(args.epochs):
112 |     for n_day in range(num_of_train_csv):
113 |       
114 |       train_dataset = Prerank_Train_Auxiliary_Ranking_Dataset(
115 |         path_to_train_csv_lst[n_day],
116 |         args.seq_len,
117 |         path_to_train_seq_pkl_lst[n_day],
118 |         path_to_train_request_pkl_lst[n_day],
119 |         args.flows
120 |       )
121 |     
122 |       train_loader = DataLoader(
123 |         dataset=train_dataset, 
124 |         batch_size=args.batch_size, 
125 |         shuffle=True, 
126 |         num_workers=1, 
127 |         drop_last=True
128 |       )
129 |       
130 |       for iter_step, inputs in enumerate(train_loader):
131 |         
132 |         inputs_LongTensor = [torch.LongTensor(inp.numpy()).to(device) for inp in inputs[:-2]]
133 |         
134 |         label = torch.FloatTensor(inputs[-1].numpy()).to(device) #b
135 |         
136 |         logits, flow_logits = model.forward_train(inputs_LongTensor) #b*1,b*p
137 |         
138 |         loss = loss_fn(logits.squeeze(), label)
139 | 
140 |         flow_mask = torch.FloatTensor(inputs[-2].numpy()).to(device) # b*p 
141 | 
142 |         flow_logits = torch.where(
143 |           flow_mask > 0,
144 |           flow_logits,
145 |           torch.full_like(flow_logits, fill_value=padding_num)
146 |         )
147 |         
148 |         logits_repeat = logits.repeat([1,n_flow_photos]) #b*p
149 |         
150 |         bpr_logits = logits_repeat - flow_logits
151 |         
152 |         rank_loss = F.binary_cross_entropy_with_logits(
153 |           bpr_logits,
154 |           torch.ones_like(bpr_logits), 
155 |           weight=label.unsqueeze(1).repeat([1,n_flow_photos]),
156 |           reduction='sum'
157 |         ) / label.unsqueeze(1).repeat([1,n_flow_photos]).sum()
158 |         
159 |         all_loss = loss + args.rank_loss_weight * rank_loss
160 |           
161 |         optimizer.zero_grad()
162 |         
163 |         all_loss.backward()
164 | 
165 |         optimizer.step()
166 | 
167 |         if iter_step % args.print_freq == 0:
168 |           print(f"Day:{n_day}\t[Epoch/iter]:{epoch:>3}/{iter_step:<4}\tall_loss:{all_loss.detach().cpu().item():.6f}\tloss:{loss.detach().cpu().item():.6f}\trank_loss:{rank_loss.detach().cpu().item():.6f}")
169 |       
170 |   path_to_save_model=f"./checkpoints/{args.batch_size}_{args.lr}_{args.flows}_{args.rank_loss_weight}_{args.tag}.pkl"
171 |   
172 |   torch.save(model.state_dict(), path_to_save_model)
173 |   
174 |   print(f"save model to {path_to_save_model} DONE.")


--------------------------------------------------------------------------------
/coarse/run_dssm_auxiliary_ranking.sh:
--------------------------------------------------------------------------------
 1 | set -x
 2 | set -e
 3 | set -o pipefail
 4 | 
 5 | tag=dssm_auxiliary_ranking-1st
 6 | 
 7 | flows=rerank_pos,rerank_neg,rank_pos,rank_neg
 8 | 
 9 | rank_loss_weight=0.1
10 | 
11 | python -B -u run_dssm_auxiliary_ranking.py \
12 | --epochs=1 \
13 | --batch_size=1024 \
14 | --infer_realshow_batch_size=1024 \
15 | --infer_recall_batch_size=900 \
16 | --emb_dim=8 \
17 | --lr=1e-2 \
18 | --seq_len=50 \
19 | --cuda='0' \
20 | --print_freq=100 \
21 | --flows=${flows} \
22 | --rank_loss_weight=${rank_loss_weight} \
23 | --tag=${tag} > "./logs/bs-1024_lr-1e-2_${flows}_${rank_loss_weight}_${tag}.log" 2>&1
24 | 
25 | python -B -u eval_dssm_auxiliary_ranking.py \
26 | --epochs=1 \
27 | --batch_size=1024 \
28 | --infer_realshow_batch_size=1024 \
29 | --infer_recall_batch_size=900 \
30 | --emb_dim=8 \
31 | --lr=1e-2 \
32 | --seq_len=50 \
33 | --cuda='0' \
34 | --print_freq=100 \
35 | --flows=${flows} \
36 | --rank_loss_weight=${rank_loss_weight} \
37 | --tag=${tag} >> "./logs/bs-1024_lr-1e-2_${flows}_${rank_loss_weight}_${tag}.log" 2>&1


--------------------------------------------------------------------------------
/coarse/run_dssm_data_dist_shift_all.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import argparse
  3 | 
  4 | import torch
  5 | import torch.nn as nn
  6 | from torch.utils.data import DataLoader
  7 | 
  8 | from models import DSSM
  9 | from dataset import Prerank_Train_Data_Dist_Shift_All_Dataset
 10 | 
 11 | from utils import load_pkl
 12 | 
 13 | def parse_args():
 14 |   parser = argparse.ArgumentParser()
 15 | 
 16 |   parser.add_argument('--epochs', type=int, default=1, help='epochs.')
 17 |   parser.add_argument('--batch_size', type=int, default=1024, help='train batch size.')
 18 |   parser.add_argument('--infer_realshow_batch_size', type=int, default=1024, help='inference batch size.')
 19 |   parser.add_argument('--infer_recall_batch_size', type=int, default=1024, help='inference batch size.')
 20 |   parser.add_argument('--emb_dim', type=int, default=8, help='embedding dimension.')
 21 |   parser.add_argument('--lr', type=float, default=1e-3, help='learning rate.')
 22 |   parser.add_argument('--seq_len', type=int, default=3, help='length of behaivor sequence')
 23 |   
 24 |   parser.add_argument('--cuda', type=int, default=0, help='cuda device.')
 25 | 
 26 |   parser.add_argument('--print_freq', type=int, default=200, help='frequency of print.')
 27 |   
 28 |   parser.add_argument('--tag', type=str, default="1st", help='exp tag.')
 29 |   
 30 |   parser.add_argument('--flows', type=str, default="", help='exp tag.')
 31 |   
 32 |   return parser.parse_args()
 33 | 
 34 | 
 35 | if __name__ == '__main__':
 36 |   args = parse_args()
 37 | 
 38 |   for k,v in vars(args).items():
 39 |     print(f"{k}:{v}")
 40 |   
 41 |   #prepare data
 42 |   prefix = "../data"
 43 |   
 44 |   remap_daily_prefix = os.path.join(prefix, "remap_daily")
 45 |   path_to_train_csv_lst = []
 46 |   with open("./file.txt", mode='r') as f:
 47 |     lines = f.readlines()
 48 |     for line in lines:
 49 |       tmp_csv_path = os.path.join(remap_daily_prefix, line.strip()+'.feather')
 50 |       path_to_train_csv_lst.append(tmp_csv_path)
 51 | 
 52 |   num_of_train_csv = len(path_to_train_csv_lst)
 53 |   print("training files:")
 54 |   print(f"number of train_csv: {num_of_train_csv}")
 55 |   for idx, filepath in enumerate(path_to_train_csv_lst):
 56 |     print(f"{idx}: {filepath}")
 57 | 
 58 |   seq_prefix = os.path.join(prefix, "seq_effective_50_dict")
 59 |   path_to_train_seq_pkl_lst = []
 60 |   with open("./file.txt", mode='r') as f:
 61 |     lines = f.readlines()
 62 |     for line in lines:
 63 |       tmp_seq_pkl_path = os.path.join(seq_prefix, line.strip()+'.pkl')
 64 |       path_to_train_seq_pkl_lst.append(tmp_seq_pkl_path)
 65 |   
 66 |   print("training seq files:")
 67 |   for idx, filepath in enumerate(path_to_train_seq_pkl_lst):
 68 |     print(f"{idx}: {filepath}")
 69 |   
 70 |   others_prefix = os.path.join(prefix, "others")
 71 |   path_to_id_cnt_pkl = os.path.join(others_prefix, "id_cnt.pkl")
 72 |   print(f"path_to_id_cnt_pkl: {path_to_id_cnt_pkl}")
 73 |   
 74 |   id_cnt_dict = load_pkl(path_to_id_cnt_pkl)
 75 |   for k,v in id_cnt_dict.items():
 76 |     print(f"{k}:{v}")
 77 | 
 78 |   #prepare model
 79 |   os.environ["CUDA_VISIBLE_DEVICES"] = str(args.cuda)
 80 |   device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
 81 |   print(f"device: {device}")
 82 |   
 83 |   model = DSSM(args.emb_dim, args.seq_len, device, id_cnt_dict).to(device)
 84 | 
 85 |   loss_fn = nn.BCEWithLogitsLoss().to(device)
 86 |   
 87 |   optimizer = torch.optim.Adam(model.parameters(), lr=args.lr)
 88 |   
 89 |   #training
 90 |   for epoch in range(args.epochs):
 91 |     for n_day in range(num_of_train_csv):
 92 |       train_dataset = Prerank_Train_Data_Dist_Shift_All_Dataset(
 93 |         path_to_train_csv_lst[n_day],
 94 |         args.seq_len,
 95 |         path_to_train_seq_pkl_lst[n_day],
 96 |         args.flows
 97 |       )
 98 |     
 99 |       train_loader = DataLoader(
100 |         dataset=train_dataset, 
101 |         batch_size=args.batch_size, 
102 |         shuffle=True, 
103 |         num_workers=1, 
104 |         drop_last=True
105 |       )
106 | 
107 |       for iter_step, inputs in enumerate(train_loader):
108 |         
109 |         inputs_LongTensor = [torch.LongTensor(inp.numpy()).to(device) for inp in inputs[:-1]]
110 |         
111 |         label = torch.FloatTensor(inputs[-1].numpy()).to(device) #b*k
112 |         
113 |         logits = model(inputs_LongTensor) #b
114 |         
115 |         loss = loss_fn(logits, label)
116 |         
117 |         optimizer.zero_grad()
118 |         
119 |         loss.backward()
120 | 
121 |         optimizer.step()
122 | 
123 |         if iter_step % args.print_freq == 0:
124 |           print(f"Day:{n_day}\t[Epoch/iter]:{epoch:>3}/{iter_step:<4}\tloss:{loss.detach().cpu().item():.6f}")
125 |   
126 |   path_to_save_model=f"./checkpoints/bs-{args.batch_size}_lr-{args.lr}_{args.flows}_{args.tag}.pkl"
127 |   
128 |   torch.save(model.state_dict(), path_to_save_model)
129 |   
130 |   print(f"save model to {path_to_save_model} DONE.")
131 |   


--------------------------------------------------------------------------------
/coarse/run_dssm_data_dist_shift_all.sh:
--------------------------------------------------------------------------------
 1 | set -x
 2 | set -e
 3 | set -o pipefail
 4 | 
 5 | tag=dssm_data_dist_shift_all-1st
 6 | 
 7 | flows=rank_neg
 8 | 
 9 | python -B -u run_dssm_data_dist_shift_all.py \
10 | --epochs=1 \
11 | --batch_size=1024 \
12 | --infer_realshow_batch_size=1024 \
13 | --infer_recall_batch_size=900 \
14 | --emb_dim=8 \
15 | --lr=1e-2 \
16 | --seq_len=50 \
17 | --cuda='0' \
18 | --print_freq=100 \
19 | --flows=${flows} \
20 | --tag=${tag} > "./logs/bs-1024_lr-1e-2_${flows}_${tag}.log" 2>&1
21 | 
22 | python -B -u eval_dssm_data_dist_shift_all.py \
23 | --epochs=1 \
24 | --batch_size=1024 \
25 | --infer_realshow_batch_size=1024 \
26 | --infer_recall_batch_size=900 \
27 | --emb_dim=8 \
28 | --lr=1e-2 \
29 | --seq_len=50 \
30 | --cuda='0' \
31 | --print_freq=100 \
32 | --flows=${flows} \
33 | --tag=${tag} >> "./logs/bs-1024_lr-1e-2_${flows}_${tag}.log" 2>&1


--------------------------------------------------------------------------------
/coarse/run_dssm_data_dist_shift_sampling.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import argparse
  3 | 
  4 | import torch
  5 | import torch.nn as nn
  6 | from torch.utils.data import DataLoader
  7 | 
  8 | from models import DSSM
  9 | from dataset import Prerank_Train_Data_Dist_Shift_Sampling_Dataset
 10 | 
 11 | from utils import load_pkl
 12 | 
 13 | def parse_args():
 14 |   parser = argparse.ArgumentParser()
 15 | 
 16 |   parser.add_argument('--epochs', type=int, default=1, help='epochs.')
 17 |   parser.add_argument('--batch_size', type=int, default=1024, help='train batch size.')
 18 |   parser.add_argument('--infer_realshow_batch_size', type=int, default=1024, help='inference batch size.')
 19 |   parser.add_argument('--infer_recall_batch_size', type=int, default=1024, help='inference batch size.')
 20 |   parser.add_argument('--emb_dim', type=int, default=8, help='embedding dimension.')
 21 |   parser.add_argument('--lr', type=float, default=1e-3, help='learning rate.')
 22 |   parser.add_argument('--seq_len', type=int, default=3, help='length of behaivor sequence')
 23 |   
 24 |   parser.add_argument('--cuda', type=int, default=0, help='cuda device.')
 25 | 
 26 |   parser.add_argument('--print_freq', type=int, default=200, help='frequency of print.')
 27 |   
 28 |   parser.add_argument('--tag', type=str, default="1st", help='exp tag.')
 29 |   
 30 |   parser.add_argument('--flows', type=str, default="", help='exp tag.')
 31 |   parser.add_argument('--k_flow_negs', type=str, default="", help='number of flow negative.')
 32 |   
 33 |   return parser.parse_args()
 34 | 
 35 | 
 36 | if __name__ == '__main__':
 37 |   args = parse_args()
 38 | 
 39 |   for k,v in vars(args).items():
 40 |     print(f"{k}:{v}")
 41 |     
 42 |   #prepare data
 43 |   prefix = "../data"
 44 |   
 45 |   remap_daily__prefix = os.path.join(prefix, "all_stage")
 46 |   path_to_train_csv_lst = []
 47 |   with open("./file.txt", mode='r') as f:
 48 |     lines = f.readlines()
 49 |     for line in lines:
 50 |       tmp_csv_path = os.path.join(remap_daily__prefix, line.strip()+'.feather')
 51 |       path_to_train_csv_lst.append(tmp_csv_path)
 52 |       
 53 |   num_of_train_csv = len(path_to_train_csv_lst)
 54 |   print("training files:")
 55 |   print(f"number of train_csv: {num_of_train_csv}")
 56 |   for idx, filepath in enumerate(path_to_train_csv_lst):
 57 |     print(f"{idx}: {filepath}")
 58 |   
 59 |   seq_prefix = os.path.join(prefix, "seq_effective_50_dict")
 60 |   path_to_train_seq_pkl_lst = []
 61 |   with open("./file.txt", mode='r') as f:
 62 |     lines = f.readlines()
 63 |     for line in lines:
 64 |       tmp_seq_pkl_path = os.path.join(seq_prefix, line.strip()+'.pkl')
 65 |       path_to_train_seq_pkl_lst.append(tmp_seq_pkl_path)
 66 |   
 67 |   print("training seq files:")
 68 |   for idx, filepath in enumerate(path_to_train_seq_pkl_lst):
 69 |     print(f"{idx}: {filepath}")
 70 |   
 71 |   others_prefix = os.path.join(prefix, "others")
 72 |   path_to_id_cnt_pkl = os.path.join(others_prefix, "id_cnt.pkl")
 73 |   print(f"path_to_id_cnt_pkl: {path_to_id_cnt_pkl}")
 74 |   
 75 |   id_cnt_dict = load_pkl(path_to_id_cnt_pkl)
 76 |   for k,v in id_cnt_dict.items():
 77 |     print(f"{k}:{v}")
 78 | 
 79 |   #prepare model
 80 |   os.environ["CUDA_VISIBLE_DEVICES"] = str(args.cuda)
 81 |   device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
 82 |   print(f"device: {device}")
 83 |   
 84 |   model = DSSM(args.emb_dim, args.seq_len, device, id_cnt_dict).to(device)
 85 |   
 86 |   loss_fn = nn.BCEWithLogitsLoss().to(device)
 87 |   
 88 |   optimizer = torch.optim.Adam(model.parameters(), lr=args.lr)
 89 |   
 90 |   #training
 91 |   for epoch in range(args.epochs):
 92 |     for n_day in range(num_of_train_csv):
 93 |       train_dataset = Prerank_Train_Data_Dist_Shift_Sampling_Dataset(
 94 |         path_to_train_csv_lst[n_day],
 95 |         args.seq_len,
 96 |         path_to_train_seq_pkl_lst[n_day],
 97 |         args.flows,
 98 |         args.k_flow_negs
 99 |       )
100 |     
101 |       train_loader = DataLoader(
102 |         dataset=train_dataset, 
103 |         batch_size=args.batch_size, 
104 |         shuffle=True, 
105 |         num_workers=1, 
106 |         drop_last=True
107 |       )
108 |       
109 |       for iter_step, inputs in enumerate(train_loader):
110 |         
111 |         inputs_LongTensor = [torch.LongTensor(inp.numpy()).to(device) for inp in inputs[:-1]]
112 |         
113 |         label = torch.FloatTensor(inputs[-1].numpy()).to(device) #b
114 |         
115 |         logits = model(inputs_LongTensor) #b
116 |         
117 |         loss = loss_fn(logits, label)
118 |         
119 |         optimizer.zero_grad()
120 |         
121 |         loss.backward()
122 | 
123 |         optimizer.step()
124 | 
125 |         if iter_step % args.print_freq == 0:
126 |           print(f"Day:{n_day}\t[Epoch/iter]:{epoch:>3}/{iter_step:<4}\tloss:{loss.detach().cpu().item():.6f}")
127 |       
128 |   path_to_save_model=f"./checkpoints/bs-{args.batch_size}_lr-{args.lr}_{args.flows}-{args.k_flow_negs}_{args.tag}.pkl"
129 |   
130 |   torch.save(model.state_dict(), path_to_save_model)
131 |   
132 |   print(f"save model to {path_to_save_model} DONE.")


--------------------------------------------------------------------------------
/coarse/run_dssm_data_dist_shift_sampling.sh:
--------------------------------------------------------------------------------
 1 | set -x
 2 | set -e
 3 | set -o pipefail
 4 | 
 5 | tag=dssm_data_dist_shift_sampling-1st
 6 | 
 7 | flows=rank_neg
 8 | k_flow_negs=1
 9 | 
10 | python -B -u run_dssm_data_dist_shift_sampling.py \
11 | --epochs=1 \
12 | --batch_size=1024 \
13 | --infer_realshow_batch_size=1024 \
14 | --infer_recall_batch_size=900 \
15 | --emb_dim=8 \
16 | --lr=1e-2 \
17 | --seq_len=50 \
18 | --cuda='0' \
19 | --print_freq=100 \
20 | --flows=${flows} \
21 | --k_flow_negs=${k_flow_negs} \
22 | --tag=${tag} > "./logs/bs-1024_lr-1e-2_${flows}_${k_flow_negs}_${tag}.log" 2>&1
23 | 
24 | python -B -u eval_dssm_data_dist_shift_sampling.py \
25 | --epochs=1 \
26 | --batch_size=1024 \
27 | --infer_realshow_batch_size=1024 \
28 | --infer_recall_batch_size=900 \
29 | --emb_dim=8 \
30 | --lr=1e-2 \
31 | --seq_len=50 \
32 | --cuda='0' \
33 | --print_freq=100 \
34 | --flows=${flows} \
35 | --k_flow_negs=${k_flow_negs} \
36 | --tag=${tag} >> "./logs/bs-1024_lr-1e-2_${flows}_${k_flow_negs}_${tag}.log" 2>&1


--------------------------------------------------------------------------------
/coarse/run_dssm_fsltr.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import argparse
  3 | 
  4 | import torch
  5 | import torch.nn.functional as F
  6 | from torch.utils.data import DataLoader
  7 | 
  8 | from models import DSSM
  9 | from dataset import Prerank_Train_FSLTR_Dataset
 10 | 
 11 | from utils import load_pkl
 12 | 
 13 | def parse_args(): 
 14 |   parser = argparse.ArgumentParser()
 15 | 
 16 |   parser.add_argument('--epochs', type=int, default=1, help='epochs.')
 17 |   parser.add_argument('--batch_size', type=int, default=1024, help='train batch size.')
 18 |   parser.add_argument('--infer_realshow_batch_size', type=int, default=1024, help='inference batch size.')
 19 |   parser.add_argument('--infer_recall_batch_size', type=int, default=1024, help='inference batch size.')
 20 |   parser.add_argument('--emb_dim', type=int, default=8, help='embedding dimension.')
 21 |   parser.add_argument('--lr', type=float, default=1e-3, help='learning rate.')
 22 |   parser.add_argument('--seq_len', type=int, default=3, help='length of behaivor sequence')
 23 |   
 24 |   parser.add_argument('--cuda', type=int, default=0, help='cuda device.')
 25 | 
 26 |   parser.add_argument('--print_freq', type=int, default=200, help='frequency of print.')
 27 |   
 28 |   parser.add_argument('--tag', type=str, default="1st", help='exp tag.')
 29 |   
 30 |   parser.add_argument('--flows', type=str, default='request_prerank_neg', help='model name.')
 31 |   
 32 |   parser.add_argument('--flow_nums', type=str, default="1", help='number of negative samples')
 33 |   
 34 |   parser.add_argument('--flow_weights', type=str, default="1.0", help='learning rate.')
 35 |   
 36 |   return parser.parse_args()
 37 | 
 38 | 
 39 | if __name__ == '__main__':
 40 |   args = parse_args()
 41 | 
 42 |   for k,v in vars(args).items():
 43 |     print(f"{k}:{v}")
 44 |     
 45 |   #prepare data
 46 |   prefix = "../data"
 47 |   
 48 |   realshow_prefix = os.path.join(prefix, "all_stage")
 49 |   path_to_train_csv_lst = []
 50 |   with open("./file.txt", mode='r') as f:
 51 |     lines = f.readlines()
 52 |     for line in lines:
 53 |       tmp_csv_path = os.path.join(realshow_prefix, line.strip()+'.feather')
 54 |       path_to_train_csv_lst.append(tmp_csv_path)
 55 |   
 56 |   num_of_train_csv = len(path_to_train_csv_lst)
 57 |   print("training files:")
 58 |   print(f"number of train_csv: {num_of_train_csv}")
 59 |   for idx, filepath in enumerate(path_to_train_csv_lst):
 60 |     print(f"{idx}: {filepath}")
 61 | 
 62 |   seq_prefix = os.path.join(prefix, "seq_effective_50_dict")
 63 |   path_to_train_seq_pkl_lst = []
 64 |   with open("./file.txt", mode='r') as f:
 65 |     lines = f.readlines()
 66 |     for line in lines:
 67 |       tmp_seq_pkl_path = os.path.join(seq_prefix, line.strip()+'.pkl')
 68 |       path_to_train_seq_pkl_lst.append(tmp_seq_pkl_path)
 69 |   
 70 |   print("training seq files:")
 71 |   for idx, filepath in enumerate(path_to_train_seq_pkl_lst):
 72 |     print(f"{idx}: {filepath}")
 73 |   
 74 |   request_id_prefix = os.path.join(prefix, "request_id_dict")
 75 |   path_to_train_request_pkl_lst = []
 76 |   with open("./file.txt", mode='r') as f:
 77 |     lines = f.readlines()
 78 |     for line in lines:
 79 |       tmp_request_pkl_path = os.path.join(request_id_prefix, line.strip()+".pkl")
 80 |       path_to_train_request_pkl_lst.append(tmp_request_pkl_path)
 81 |       
 82 |   print("training request files")
 83 |   for idx, filepath in enumerate(path_to_train_request_pkl_lst):
 84 |     print(f"{idx}: {filepath}")
 85 |   
 86 |   others_prefix = os.path.join(prefix, "others")
 87 |   path_to_id_cnt_pkl = os.path.join(others_prefix, "id_cnt.pkl")
 88 |   print(f"path_to_id_cnt_pkl: {path_to_id_cnt_pkl}")
 89 |   
 90 |   id_cnt_dict = load_pkl(path_to_id_cnt_pkl)
 91 |   for k,v in id_cnt_dict.items():
 92 |     print(f"{k}:{v}")
 93 |     
 94 |   #prepare model
 95 |   os.environ["CUDA_VISIBLE_DEVICES"] = str(args.cuda)
 96 |   device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
 97 |   print(f"device: {device}")
 98 |   
 99 |   model = DSSM(args.emb_dim, args.seq_len, device, id_cnt_dict).to(device)
100 |   
101 |   optimizer = torch.optim.Adam(model.parameters(), lr=args.lr)
102 |   
103 |   sum_num = 0
104 |   weight_lst = []
105 |   
106 |   nums = args.flow_nums.split(',')
107 |   weights = args.flow_weights.split(',')
108 | 
109 |   for idx,num in enumerate(nums):
110 |     sum_num += int(num)
111 |     weight_lst.extend([float(weights[idx])]*int(num))
112 |   
113 |   loss_weight_gpu = torch.tensor(weight_lst, dtype=torch.float32, device=device).reshape([1,-1,1]) #1*p*1
114 |   
115 |   padding_num = -2**30 + 1
116 |   
117 |   #training
118 |   for epoch in range(args.epochs):
119 |     for n_day in range(num_of_train_csv):
120 |       train_dataset = Prerank_Train_FSLTR_Dataset(
121 |         path_to_train_csv_lst[n_day],
122 |         args.seq_len,
123 |         path_to_train_seq_pkl_lst[n_day],
124 |         path_to_train_request_pkl_lst[n_day],
125 |         args.flows,
126 |         args.flow_nums
127 |       )
128 |     
129 |       train_loader = DataLoader(
130 |         dataset=train_dataset, 
131 |         batch_size=args.batch_size, 
132 |         shuffle=True, 
133 |         num_workers=1, 
134 |         drop_last=True
135 |       )
136 | 
137 |       for iter_step, inputs in enumerate(train_loader):
138 |         
139 |         inputs_LongTensor = [torch.LongTensor(inp.numpy()).to(device) for inp in inputs[:-1]]
140 |         
141 |         logits = model.forward_fsltr(inputs_LongTensor) #b
142 |         
143 |         priority = torch.FloatTensor(inputs[-1].numpy()).to(device) #b*p
144 |         
145 |         weight = torch.gt(
146 |           priority.unsqueeze(-1), priority.unsqueeze(1)
147 |         ) #b*p*p
148 |         
149 |         logits_diff = logits.unsqueeze(-1) - logits.unsqueeze(1)
150 |         
151 |         loss = F.binary_cross_entropy_with_logits(
152 |           logits_diff, 
153 |           torch.ones_like(logits_diff), 
154 |           weight=weight*loss_weight_gpu, 
155 |           reduction='sum') / weight.sum()
156 |         
157 |         optimizer.zero_grad()
158 |         
159 |         loss.backward()
160 | 
161 |         optimizer.step()
162 | 
163 |         if iter_step % args.print_freq == 0:
164 |           print(f"Day:{n_day}\t[Epoch/iter]:{epoch:>3}/{iter_step:<4}\tloss:{loss.detach().cpu().item():.6f}")
165 |     
166 |   path_to_save_model=f"./checkpoints/bs-{args.batch_size}_lr-{args.lr}_{args.loss}_{args.flows}_{args.flow_nums}_{args.flow_weights}_{args.tag}.pkl"
167 |   
168 |   torch.save(model.state_dict(), path_to_save_model)
169 |   
170 |   print(f"save model to {path_to_save_model} DONE.")


--------------------------------------------------------------------------------
/coarse/run_dssm_fsltr.sh:
--------------------------------------------------------------------------------
 1 | set -x
 2 | set -e
 3 | set -o pipefail
 4 | 
 5 | tag=dssm_fsltr-1st
 6 | 
 7 | flows=click,realshow,rerank_pos,rerank_neg,rank_pos,rank_neg,coarse_neg
 8 | flow_nums=6,6,10,10,10,10,10
 9 | flow_weights=1.0,1.0,1.0,1.0,1.0,1.0,0.0
10 | 
11 | python -B -u run_dssm_fsltr.py \
12 | --epochs=1 \
13 | --batch_size=1024 \
14 | --infer_realshow_batch_size=1024 \
15 | --infer_recall_batch_size=900 \
16 | --emb_dim=8 \
17 | --lr=1e-2 \
18 | --seq_len=50 \
19 | --cuda='0' \
20 | --print_freq=100 \
21 | --flows=${flows} \
22 | --flow_nums=${flow_nums} \
23 | --flow_weights=${flow_weights} \
24 | --tag=${tag} > "./logs/bs-1024_lr-1e-2_${flows}_${flow_nums}_${flow_weights}_${tag}.log" 2>&1
25 | 
26 | python -B -u eval_dssm_fsltr.py \
27 | --epochs=1 \
28 | --batch_size=1024 \
29 | --infer_realshow_batch_size=1024 \
30 | --infer_recall_batch_size=900 \
31 | --emb_dim=8 \
32 | --lr=1e-2 \
33 | --seq_len=50 \
34 | --cuda='0' \
35 | --print_freq=100 \
36 | --flows=${flows} \
37 | --flow_nums=${flow_nums} \
38 | --flow_weights=${flow_weights} \
39 | --tag=${tag} >> "./logs/bs-1024_lr-1e-2_${flows}_${flow_nums}_${flow_weights}_${tag}.log" 2>&1


--------------------------------------------------------------------------------
/coarse/run_dssm_ubm.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import argparse
  3 | 
  4 | import torch
  5 | import torch.nn as nn
  6 | from torch.utils.data import DataLoader
  7 | 
  8 | from models import DSSM_UBM
  9 | from dataset import Prerank_Train_UBM_Dataset
 10 | 
 11 | from utils import load_pkl
 12 | 
 13 | def parse_args():
 14 |   parser = argparse.ArgumentParser()
 15 | 
 16 |   parser.add_argument('--epochs', type=int, default=1, help='epochs.')
 17 |   parser.add_argument('--batch_size', type=int, default=1024, help='train batch size.')
 18 |   parser.add_argument('--infer_realshow_batch_size', type=int, default=1024, help='inference batch size.')
 19 |   parser.add_argument('--infer_recall_batch_size', type=int, default=1024, help='inference batch size.')
 20 |   parser.add_argument('--emb_dim', type=int, default=8, help='embedding dimension.')
 21 |   parser.add_argument('--lr', type=float, default=1e-3, help='learning rate.')
 22 |   parser.add_argument('--seq_len', type=int, default=3, help='length of behaivor sequence')
 23 |   
 24 |   parser.add_argument('--cuda', type=int, default=0, help='cuda device.')
 25 | 
 26 |   parser.add_argument('--print_freq', type=int, default=200, help='frequency of print.')
 27 |   
 28 |   parser.add_argument('--tag', type=str, default="1st", help='exp tag.')
 29 |   
 30 |   parser.add_argument('--flows', type=str, default="", help='exp tag.')
 31 | 
 32 |   return parser.parse_args()
 33 | 
 34 | 
 35 | if __name__ == '__main__':
 36 |   args = parse_args()
 37 | 
 38 |   for k,v in vars(args).items():
 39 |     print(f"{k}:{v}")
 40 |     
 41 |   #prepare data
 42 |   prefix = "../data"
 43 |   
 44 |   realshow_prefix = os.path.join(prefix, "realshow")
 45 |   path_to_train_csv_lst = []
 46 |   with open("./file.txt", mode='r') as f:
 47 |     lines = f.readlines()
 48 |     for line in lines:
 49 |       tmp_csv_path = os.path.join(realshow_prefix, line.strip()+'.feather')
 50 |       path_to_train_csv_lst.append(tmp_csv_path)
 51 |       
 52 |   num_of_train_csv = len(path_to_train_csv_lst)
 53 |   print("training files:")
 54 |   print(f"number of train_csv: {num_of_train_csv}")
 55 |   for idx, filepath in enumerate(path_to_train_csv_lst):
 56 |     print(f"{idx}: {filepath}")
 57 |     
 58 | 
 59 |   seq_prefix = os.path.join(prefix, "seq_effective_50_dict")
 60 |   path_to_train_seq_pkl_lst = []
 61 |   with open("./file.txt", mode='r') as f:
 62 |     lines = f.readlines()
 63 |     for line in lines:
 64 |       tmp_seq_pkl_path = os.path.join(seq_prefix, line.strip()+'.pkl')
 65 |       path_to_train_seq_pkl_lst.append(tmp_seq_pkl_path)
 66 |   
 67 |   print("training seq files:")
 68 |   for idx, filepath in enumerate(path_to_train_seq_pkl_lst):
 69 |     print(f"{idx}: {filepath}")
 70 |     
 71 |   request_id_prefix = os.path.join(prefix, "ubm_seq_request_id_dict")
 72 |   path_to_train_request_pkl_lst = []
 73 |   with open("./file.txt", mode='r') as f:
 74 |     lines = f.readlines()
 75 |     for line in lines:
 76 |       tmp_request_pkl_path = os.path.join(request_id_prefix, line.strip()+".pkl")
 77 |       path_to_train_request_pkl_lst.append(tmp_request_pkl_path)
 78 |       
 79 |   print("training request files")
 80 |   for idx, filepath in enumerate(path_to_train_request_pkl_lst):
 81 |     print(f"{idx}: {filepath}")
 82 |       
 83 |   others_prefix = os.path.join(prefix, "others")
 84 |   path_to_id_cnt_pkl = os.path.join(others_prefix, "id_cnt.pkl")
 85 |   print(f"path_to_id_cnt_pkl: {path_to_id_cnt_pkl}")
 86 |   
 87 |   id_cnt_dict = load_pkl(path_to_id_cnt_pkl)
 88 |   for k,v in id_cnt_dict.items():
 89 |     print(f"{k}:{v}")
 90 | 
 91 |   #prepare model
 92 |   os.environ["CUDA_VISIBLE_DEVICES"] = str(args.cuda)
 93 |   device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
 94 |   print(f"device: {device}")
 95 |   
 96 |   per_flow_seq_len = 10
 97 |   n_flows = len(args.flows.split(','))
 98 |   flow_seq_len = per_flow_seq_len * n_flows
 99 |   
100 |   max_candidate_cnt = 430
101 |   
102 |   model = DSSM_UBM(
103 |     args.emb_dim, 
104 |     args.seq_len, 
105 |     device, 
106 |     per_flow_seq_len, flow_seq_len,
107 |     id_cnt_dict
108 |   ).to(device)
109 |   
110 |   loss_fn = nn.BCEWithLogitsLoss().to(device)
111 |   
112 |   optimizer = torch.optim.Adam(model.parameters(), lr=args.lr)
113 |   
114 |   #training
115 |   for epoch in range(args.epochs):
116 |     for n_day in range(num_of_train_csv):
117 |       
118 |       train_dataset = Prerank_Train_UBM_Dataset(
119 |         path_to_train_csv_lst[n_day],
120 |         args.seq_len,
121 |         path_to_train_seq_pkl_lst[n_day],
122 |         path_to_train_request_pkl_lst[n_day],
123 |         args.flows,
124 |         per_flow_seq_len
125 |       )
126 |     
127 |       train_loader = DataLoader(
128 |         dataset=train_dataset, 
129 |         batch_size=args.batch_size, 
130 |         shuffle=True, 
131 |         num_workers=1, 
132 |         drop_last=True
133 |       )
134 |     
135 |       for iter_step, inputs in enumerate(train_loader):
136 |         
137 |         inputs_LongTensor = [torch.LongTensor(inp.numpy()).to(device) for inp in inputs[:-1]]
138 |         
139 |         label = torch.FloatTensor(inputs[-1].numpy()).to(device) #b
140 |         
141 |         logits = model(inputs_LongTensor) #b
142 |         
143 |         loss = loss_fn(logits, label)
144 |         
145 |         optimizer.zero_grad()
146 |         
147 |         loss.backward()
148 | 
149 |         optimizer.step()
150 | 
151 |         if iter_step % args.print_freq == 0:
152 |           print(f"Day:{n_day}\t[Epoch/iter]:{epoch:>3}/{iter_step:<4}\tloss:{loss.detach().cpu().item():.6f}")
153 |       
154 |   path_to_save_model=f"./checkpoints/bs-{args.batch_size}_lr-{args.lr}_flows-{args.flows}_{args.tag}.pkl"
155 |   
156 |   torch.save(model.state_dict(), path_to_save_model)
157 |   
158 |   print(f"save model to {path_to_save_model} DONE.")


--------------------------------------------------------------------------------
/coarse/run_dssm_ubm.sh:
--------------------------------------------------------------------------------
 1 | set -x
 2 | set -e
 3 | set -o pipefail
 4 | 
 5 | tag=dssm_ubm-1st
 6 | 
 7 | flows=rank_pos
 8 | 
 9 | python -B -u run_dssm_ubm.py \
10 | --epochs=1 \
11 | --batch_size=1024 \
12 | --infer_realshow_batch_size=1024 \
13 | --infer_recall_batch_size=900 \
14 | --emb_dim=8 \
15 | --lr=1e-2 \
16 | --seq_len=50 \
17 | --cuda='0' \
18 | --print_freq=100 \
19 | --flows=${flows} \
20 | --tag=${tag} > "./logs/bs-1024_lr-1e-2_${flows}_${tag}.log" 2>&1
21 | 
22 | python -B -u eval_dssm_ubm.py \
23 | --epochs=1 \
24 | --batch_size=1024 \
25 | --infer_realshow_batch_size=1024 \
26 | --infer_recall_batch_size=900 \
27 | --emb_dim=8 \
28 | --lr=1e-2 \
29 | --seq_len=50 \
30 | --cuda='0' \
31 | --print_freq=100 \
32 | --flows=${flows} \
33 | --tag=${tag} >> "./logs/bs-1024_lr-1e-2_${flows}_${tag}.log" 2>&1


--------------------------------------------------------------------------------
/coarse/utils.py:
--------------------------------------------------------------------------------
 1 | import pickle as pkl
 2 | from collections import defaultdict
 3 | 
 4 | def defaultdict_tuple():
 5 |   return defaultdict(tuple)
 6 | 
 7 | def defaultdict_str():
 8 |   return defaultdict(str)
 9 | 
10 | def load_pkl(filename):
11 |   with open(filename, 'rb') as f:
12 |     return pkl.load(f)


--------------------------------------------------------------------------------
/data/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/RecFlow-ICLR/RecFlow/6b9eb4f3dd1651788041704c388f9df1a39cf294/data/.DS_Store


--------------------------------------------------------------------------------
/data/all_stage/example.feather:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/RecFlow-ICLR/RecFlow/6b9eb4f3dd1651788041704c388f9df1a39cf294/data/all_stage/example.feather


--------------------------------------------------------------------------------
/data/others/coarse_rank_test.feather:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/RecFlow-ICLR/RecFlow/6b9eb4f3dd1651788041704c388f9df1a39cf294/data/others/coarse_rank_test.feather


--------------------------------------------------------------------------------
/data/others/id_cnt.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/RecFlow-ICLR/RecFlow/6b9eb4f3dd1651788041704c388f9df1a39cf294/data/others/id_cnt.pkl


--------------------------------------------------------------------------------
/data/others/rank_test.feather:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/RecFlow-ICLR/RecFlow/6b9eb4f3dd1651788041704c388f9df1a39cf294/data/others/rank_test.feather


--------------------------------------------------------------------------------
/data/others/realshow_video_info.feather:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/RecFlow-ICLR/RecFlow/6b9eb4f3dd1651788041704c388f9df1a39cf294/data/others/realshow_video_info.feather


--------------------------------------------------------------------------------
/data/others/retrieval_test.feather:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/RecFlow-ICLR/RecFlow/6b9eb4f3dd1651788041704c388f9df1a39cf294/data/others/retrieval_test.feather


--------------------------------------------------------------------------------
/data/realshow/example.feather:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/RecFlow-ICLR/RecFlow/6b9eb4f3dd1651788041704c388f9df1a39cf294/data/realshow/example.feather


--------------------------------------------------------------------------------
/data/request_id_dict/example.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/RecFlow-ICLR/RecFlow/6b9eb4f3dd1651788041704c388f9df1a39cf294/data/request_id_dict/example.pkl


--------------------------------------------------------------------------------
/data/seq_effective_50_dict/example.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/RecFlow-ICLR/RecFlow/6b9eb4f3dd1651788041704c388f9df1a39cf294/data/seq_effective_50_dict/example.pkl


--------------------------------------------------------------------------------
/rank/eval_din.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import argparse
  3 | 
  4 | import torch
  5 | from torch.utils.data import DataLoader
  6 | 
  7 | from models import DIN
  8 | from dataset import Rank_Train_Dataset,Rank_Test_Dataset
  9 | 
 10 | from utils import load_pkl
 11 | from metrics import evaluate,evaluate_recall
 12 | 
 13 | def parse_args():
 14 |   parser = argparse.ArgumentParser()
 15 | 
 16 |   parser.add_argument('--epochs', type=int, default=1, help='epochs.')
 17 |   parser.add_argument('--batch_size', type=int, default=1024, help='train batch size.')
 18 |   parser.add_argument('--infer_realshow_batch_size', type=int, default=1024, help='inference batch size.')
 19 |   parser.add_argument('--infer_recall_batch_size', type=int, default=1024, help='inference batch size.')
 20 |   parser.add_argument('--emb_dim', type=int, default=8, help='embedding dimension.')
 21 |   parser.add_argument('--lr', type=float, default=1e-3, help='learning rate.')
 22 |   parser.add_argument('--seq_len', type=int, default=3, help='length of behaivor sequence')
 23 |   
 24 |   parser.add_argument('--cuda', type=int, default=0, help='cuda device.')
 25 | 
 26 |   parser.add_argument('--print_freq', type=int, default=200, help='frequency of print.')
 27 |   
 28 |   parser.add_argument('--tag', type=str, default="1st", help='exp tag.')
 29 | 
 30 |   return parser.parse_args()
 31 | 
 32 | 
 33 | if __name__ == '__main__':
 34 |   args = parse_args()
 35 | 
 36 |   for k,v in vars(args).items():
 37 |     print(f"{k}:{v}")
 38 |     
 39 |   #prepare data
 40 |   prefix = "../data"
 41 |   
 42 |   realshow_prefix = os.path.join(prefix, "realshow")
 43 |   path_to_test_csv = os.path.join(realshow_prefix, "2024-02-18.feather")
 44 |   print(f"testing file: {path_to_test_csv}")
 45 |   
 46 |   seq_prefix = os.path.join(prefix, "seq_effective_50_dict")
 47 |   path_to_test_seq_pkl = os.path.join(seq_prefix, "2024-02-18.pkl")
 48 |   print(f"testing seq file: {path_to_test_seq_pkl}")
 49 |   
 50 |   others_prefix = os.path.join(prefix, "others")
 51 |   path_to_id_cnt_pkl = os.path.join(others_prefix, "id_cnt.pkl")
 52 |   print(f"path_to_id_cnt_pkl: {path_to_id_cnt_pkl}") 
 53 |   id_cnt_dict = load_pkl(path_to_id_cnt_pkl)
 54 |   for k,v in id_cnt_dict.items():
 55 |     print(f"{k}:{v}")
 56 |     
 57 |   path_to_test_pkl = os.path.join(others_prefix, "rank_test.feather")
 58 |   print(f"path_to_test_pkl: {path_to_test_pkl}")
 59 | 
 60 |   #prepare model
 61 |   os.environ["CUDA_VISIBLE_DEVICES"] = str(args.cuda)
 62 |   device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
 63 |   print(f"device: {device}")
 64 |   
 65 |   max_candidate_cnt = 430
 66 |   
 67 |   model = DIN(
 68 |     args.emb_dim, 
 69 |     args.seq_len, 
 70 |     device, 
 71 |     max_candidate_cnt, 
 72 |     id_cnt_dict
 73 |   ).to(device)
 74 |   
 75 |   path_to_save_model=f"./checkpoints/bs-{args.batch_size}_lr-{args.lr}_{args.tag}.pkl"
 76 |   
 77 |   state_dict = torch.load(path_to_save_model)
 78 |   
 79 |   model.load_state_dict(state_dict)
 80 |   
 81 |   print("testing: realshow")
 82 | 
 83 |   test_realshow_dataset = Rank_Train_Dataset(
 84 |     path_to_test_csv,
 85 |     args.seq_len,
 86 |     path_to_test_seq_pkl,
 87 |   )
 88 |   
 89 |   test_realshow_loader = DataLoader(
 90 |     dataset=test_realshow_dataset, 
 91 |     batch_size=args.infer_realshow_batch_size, 
 92 |     shuffle=False, 
 93 |     num_workers=0, 
 94 |     drop_last=True
 95 |   )
 96 |   print_str = evaluate(model, test_realshow_loader, device)
 97 |   
 98 |   print("testing: recall")
 99 |   
100 |   test_recall_dataset = Rank_Test_Dataset(
101 |     path_to_test_pkl,
102 |     args.seq_len,
103 |     path_to_test_seq_pkl,
104 |     max_candidate_cnt
105 |   )
106 |   
107 |   test_recall_loader = DataLoader(
108 |     dataset=test_recall_dataset, 
109 |     batch_size=args.infer_recall_batch_size, 
110 |     shuffle=False, 
111 |     num_workers=0, 
112 |     drop_last=True
113 |   )
114 |   
115 |   target_print = evaluate_recall(model, test_recall_loader, device)
116 |   
117 |   print("realshow")
118 |   print(print_str)
119 |   
120 |   print("recall")
121 |   print(target_print[0])
122 |   print(target_print[1])


--------------------------------------------------------------------------------
/rank/eval_din_auxiliary_ranking.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import argparse
  3 | 
  4 | import torch
  5 | from torch.utils.data import DataLoader
  6 | 
  7 | from models import DIN_AuxRanking
  8 | from dataset import Rank_Train_Dataset,Rank_Test_Dataset
  9 | 
 10 | from utils import load_pkl
 11 | from metrics import evaluate,evaluate_recall
 12 | 
 13 | def parse_args(): 
 14 |   parser = argparse.ArgumentParser()
 15 | 
 16 |   parser.add_argument('--epochs', type=int, default=1, help='epochs.')
 17 |   parser.add_argument('--batch_size', type=int, default=1024, help='train batch size.')
 18 |   parser.add_argument('--infer_realshow_batch_size', type=int, default=1024, help='inference batch size.')
 19 |   parser.add_argument('--infer_recall_batch_size', type=int, default=1024, help='inference batch size.')
 20 |   parser.add_argument('--emb_dim', type=int, default=8, help='embedding dimension.')
 21 |   parser.add_argument('--lr', type=float, default=1e-3, help='learning rate.')
 22 |   parser.add_argument('--seq_len', type=int, default=3, help='length of behaivor sequence')
 23 |   
 24 |   parser.add_argument('--cuda', type=int, default=0, help='cuda device.')
 25 | 
 26 |   parser.add_argument('--print_freq', type=int, default=200, help='frequency of print.')
 27 |   
 28 |   parser.add_argument('--tag', type=str, default="1st", help='exp tag.')
 29 |   
 30 |   # flow param  
 31 |   parser.add_argument('--flows', type=str, default="", help='exp tag.')
 32 |   
 33 |   parser.add_argument('--rank_loss_weight', type=float, default=1e-2, help='learning rate.')
 34 |   
 35 |   return parser.parse_args()
 36 | 
 37 | 
 38 | if __name__ == '__main__':
 39 |   args = parse_args()
 40 | 
 41 |   for k,v in vars(args).items():
 42 |     print(f"{k}:{v}")
 43 |     
 44 |   #prepare data
 45 |   prefix = "../data"
 46 |   
 47 |   realshow_prefix = os.path.join(prefix, "realshow")
 48 |   path_to_test_csv = os.path.join(realshow_prefix, "2024-02-18.feather")
 49 |   print(f"testing file: {path_to_test_csv}")
 50 |   
 51 |   seq_prefix = os.path.join(prefix, "seq_effective_50_dict")
 52 |   path_to_test_seq_pkl = os.path.join(seq_prefix, "2024-02-18.pkl")
 53 |   print(f"testing seq file: {path_to_test_seq_pkl}")
 54 |   
 55 |   others_prefix = os.path.join(prefix, "others")
 56 |   path_to_id_cnt_pkl = os.path.join(others_prefix, "id_cnt.pkl")
 57 |   print(f"path_to_id_cnt_pkl: {path_to_id_cnt_pkl}") 
 58 |   id_cnt_dict = load_pkl(path_to_id_cnt_pkl)
 59 |   for k,v in id_cnt_dict.items():
 60 |     print(f"{k}:{v}")
 61 |     
 62 |   path_to_test_pkl = os.path.join(others_prefix, "rank_test.feather")
 63 |   print(f"path_to_test_pkl: {path_to_test_pkl}")
 64 | 
 65 |   #prepare model
 66 |   os.environ["CUDA_VISIBLE_DEVICES"] = str(args.cuda)
 67 |   device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
 68 |   print(f"device: {device}")
 69 |   
 70 |   max_candidate_cnt = 430
 71 |   
 72 |   model = DIN_AuxRanking(
 73 |     args.emb_dim, args.seq_len, 
 74 |     device, max_candidate_cnt, id_cnt_dict
 75 |   ).to(device)
 76 |   
 77 |   path_to_save_model=f"./checkpoints/{args.batch_size}_{args.lr}_{args.flows}_{args.rank_loss_weight}_{args.tag}.pkl"
 78 |   
 79 |   state_dict = torch.load(path_to_save_model)
 80 |   
 81 |   model.load_state_dict(state_dict)
 82 |   
 83 |   print("testing: realshow")
 84 |   
 85 |   test_realshow_dataset = Rank_Train_Dataset(
 86 |     path_to_test_csv,
 87 |     args.seq_len,
 88 |     path_to_test_seq_pkl,
 89 |   )
 90 |   
 91 |   test_realshow_loader = DataLoader(
 92 |     dataset=test_realshow_dataset, 
 93 |     batch_size=args.infer_realshow_batch_size, 
 94 |     shuffle=False, 
 95 |     num_workers=0, 
 96 |     drop_last=True
 97 |   )
 98 |   print_str = evaluate(model, test_realshow_loader, device)
 99 |   
100 |   print("testing: recall")
101 |   
102 |   test_recall_dataset = Rank_Test_Dataset(
103 |     path_to_test_pkl,
104 |     args.seq_len,
105 |     path_to_test_seq_pkl,
106 |     max_candidate_cnt
107 |   )
108 |   
109 |   test_recall_loader = DataLoader(
110 |     dataset=test_recall_dataset, 
111 |     batch_size=args.infer_recall_batch_size, 
112 |     shuffle=False, 
113 |     num_workers=0, 
114 |     drop_last=True
115 |   )
116 |   target_print = evaluate_recall(model, test_recall_loader, device)
117 | 
118 |   print("realshow")
119 |   print(print_str)
120 |   
121 |   print("recall")
122 |   print(target_print[0])
123 |   print(target_print[1])


--------------------------------------------------------------------------------
/rank/eval_din_data_dist_shift_all.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import argparse
  3 | 
  4 | import torch
  5 | from torch.utils.data import DataLoader
  6 | 
  7 | from models import DIN
  8 | from dataset import Rank_Train_Dataset,Rank_Test_Dataset
  9 | 
 10 | from utils import load_pkl
 11 | from metrics import evaluate,evaluate_recall
 12 | 
 13 | def parse_args():
 14 |   parser = argparse.ArgumentParser()
 15 | 
 16 |   parser.add_argument('--epochs', type=int, default=1, help='epochs.')
 17 |   parser.add_argument('--batch_size', type=int, default=1024, help='train batch size.')
 18 |   parser.add_argument('--infer_realshow_batch_size', type=int, default=1024, help='inference batch size.')
 19 |   parser.add_argument('--infer_recall_batch_size', type=int, default=1024, help='inference batch size.')
 20 |   parser.add_argument('--emb_dim', type=int, default=8, help='embedding dimension.')
 21 |   parser.add_argument('--lr', type=float, default=1e-3, help='learning rate.')
 22 |   parser.add_argument('--seq_len', type=int, default=3, help='length of behaivor sequence')
 23 |   
 24 |   parser.add_argument('--cuda', type=int, default=0, help='cuda device.')
 25 | 
 26 |   parser.add_argument('--print_freq', type=int, default=200, help='frequency of print.')
 27 |   
 28 |   parser.add_argument('--tag', type=str, default="1st", help='exp tag.')
 29 |   
 30 |   parser.add_argument('--flows', type=str, default="", help='exp tag.')
 31 |   
 32 |   return parser.parse_args()
 33 | 
 34 | 
 35 | if __name__ == '__main__':
 36 |   args = parse_args()
 37 | 
 38 |   for k,v in vars(args).items():
 39 |     print(f"{k}:{v}")
 40 |     
 41 |   #prepare data
 42 |   prefix = "../data"
 43 |   
 44 |   realshow_prefix = os.path.join(prefix, "realshow")
 45 |   path_to_test_csv = os.path.join(realshow_prefix, "2024-02-18.feather")
 46 |   print(f"testing file: {path_to_test_csv}")
 47 |   
 48 |   seq_prefix = os.path.join(prefix, "seq_effective_50_dict")
 49 |   path_to_test_seq_pkl = os.path.join(seq_prefix, "2024-02-18.pkl")
 50 |   print(f"testing seq file: {path_to_test_seq_pkl}")
 51 |   
 52 |   others_prefix = os.path.join(prefix, "others")
 53 |   path_to_id_cnt_pkl = os.path.join(others_prefix, "id_cnt.pkl")
 54 |   print(f"path_to_id_cnt_pkl: {path_to_id_cnt_pkl}") 
 55 |   id_cnt_dict = load_pkl(path_to_id_cnt_pkl)
 56 |   for k,v in id_cnt_dict.items():
 57 |     print(f"{k}:{v}")
 58 |     
 59 |   path_to_test_pkl = os.path.join(others_prefix, "rank_test.feather")
 60 |   print(f"path_to_test_pkl: {path_to_test_pkl}")
 61 | 
 62 |   #prepare model
 63 |   os.environ["CUDA_VISIBLE_DEVICES"] = str(args.cuda)
 64 |   device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
 65 |   print(f"device: {device}")
 66 |   
 67 |   max_candidate_cnt = 430
 68 |   
 69 |   model = DIN(
 70 |     args.emb_dim, 
 71 |     args.seq_len, 
 72 |     device, 
 73 |     max_candidate_cnt, id_cnt_dict
 74 |   ).to(device)
 75 |   
 76 |   path_to_save_model=f"./checkpoints/bs-{args.batch_size}_lr-{args.lr}_{args.flows}_{args.tag}.pkl"
 77 |   
 78 |   state_dict = torch.load(path_to_save_model)
 79 |   
 80 |   model.load_state_dict(state_dict)
 81 |   
 82 |   print("testing: realshow")
 83 | 
 84 |   test_realshow_dataset = Rank_Train_Dataset(
 85 |     path_to_test_csv,
 86 |     args.seq_len,
 87 |     path_to_test_seq_pkl,
 88 |   )
 89 |   
 90 |   test_realshow_loader = DataLoader(
 91 |     dataset=test_realshow_dataset, 
 92 |     batch_size=args.infer_realshow_batch_size, 
 93 |     shuffle=False, 
 94 |     num_workers=0, 
 95 |     drop_last=True
 96 |   )
 97 |   print_str = evaluate(model, test_realshow_loader, device)
 98 |   
 99 |   print("testing: recall")
100 |   
101 |   test_recall_dataset = Rank_Test_Dataset(
102 |     path_to_test_pkl,
103 |     args.seq_len,
104 |     path_to_test_seq_pkl,
105 |     max_candidate_cnt=max_candidate_cnt
106 |   )
107 |   
108 |   test_recall_loader = DataLoader(
109 |     dataset=test_recall_dataset, 
110 |     batch_size=args.infer_recall_batch_size, 
111 |     shuffle=False, 
112 |     num_workers=0, 
113 |     drop_last=True
114 |   )
115 |   target_print = evaluate_recall(model, test_recall_loader, device)
116 |   
117 |   print("realshow")
118 |   print(print_str)
119 |   
120 |   print("recall")
121 |   print(target_print[0])
122 |   print(target_print[1])


--------------------------------------------------------------------------------
/rank/eval_din_data_dist_shift_sampling.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import argparse
  3 | 
  4 | import torch
  5 | from torch.utils.data import DataLoader
  6 | 
  7 | from models import DIN
  8 | from dataset import Rank_Train_Dataset,Rank_Test_Dataset
  9 | 
 10 | from utils import load_pkl
 11 | from metrics import evaluate,evaluate_recall
 12 | 
 13 | def parse_args():
 14 |   parser = argparse.ArgumentParser()
 15 | 
 16 |   parser.add_argument('--epochs', type=int, default=1, help='epochs.')
 17 |   parser.add_argument('--batch_size', type=int, default=1024, help='train batch size.')
 18 |   parser.add_argument('--infer_realshow_batch_size', type=int, default=1024, help='inference batch size.')
 19 |   parser.add_argument('--infer_recall_batch_size', type=int, default=1024, help='inference batch size.')
 20 |   parser.add_argument('--emb_dim', type=int, default=8, help='embedding dimension.')
 21 |   parser.add_argument('--lr', type=float, default=1e-3, help='learning rate.')
 22 |   parser.add_argument('--seq_len', type=int, default=3, help='length of behaivor sequence')
 23 |   
 24 |   parser.add_argument('--cuda', type=int, default=0, help='cuda device.')
 25 | 
 26 |   parser.add_argument('--print_freq', type=int, default=200, help='frequency of print.')
 27 |   
 28 |   parser.add_argument('--tag', type=str, default="1st", help='exp tag.')
 29 |   
 30 |   parser.add_argument('--flows', type=str, default="", help='exp tag.')
 31 |   parser.add_argument('--k_flow_negs', type=str, default="", help='number of flow negative.')
 32 |   
 33 |   return parser.parse_args()
 34 | 
 35 | 
 36 | if __name__ == '__main__':
 37 |   args = parse_args()
 38 | 
 39 |   for k,v in vars(args).items():
 40 |     print(f"{k}:{v}")
 41 |     
 42 |   #prepare data
 43 |   prefix = "../data"
 44 |   
 45 |   realshow_prefix = os.path.join(prefix, "realshow")
 46 |   path_to_test_csv = os.path.join(realshow_prefix, "2024-02-18.feather")
 47 |   print(f"testing file: {path_to_test_csv}")
 48 |   
 49 |   seq_prefix = os.path.join(prefix, "seq_effective_50_dict")
 50 |   path_to_test_seq_pkl = os.path.join(seq_prefix, "2024-02-18.pkl")
 51 |   print(f"testing seq file: {path_to_test_seq_pkl}")
 52 |   
 53 |   others_prefix = os.path.join(prefix, "others")
 54 |   path_to_id_cnt_pkl = os.path.join(others_prefix, "id_cnt.pkl")
 55 |   print(f"path_to_id_cnt_pkl: {path_to_id_cnt_pkl}") 
 56 |   id_cnt_dict = load_pkl(path_to_id_cnt_pkl)
 57 |   for k,v in id_cnt_dict.items():
 58 |     print(f"{k}:{v}")
 59 |     
 60 |   path_to_test_pkl = os.path.join(others_prefix, "rank_test.feather")
 61 |   print(f"path_to_test_pkl: {path_to_test_pkl}")
 62 | 
 63 |   #prepare model
 64 |   os.environ["CUDA_VISIBLE_DEVICES"] = str(args.cuda)
 65 |   device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
 66 |   print(f"device: {device}")
 67 |   
 68 |   max_candidate_cnt = 430
 69 |   
 70 |   model = DIN(
 71 |     args.emb_dim, 
 72 |     args.seq_len, 
 73 |     device, 
 74 |     max_candidate_cnt, id_cnt_dict
 75 |   ).to(device)
 76 |   
 77 |   path_to_save_model=f"./checkpoints/bs-{args.batch_size}_lr-{args.lr}_{args.flows}-{args.k_flow_negs}_{args.tag}.pkl"
 78 |   
 79 |   state_dict = torch.load(path_to_save_model)
 80 |   
 81 |   model.load_state_dict(state_dict)
 82 |   
 83 |   print("testing: realshow")
 84 | 
 85 |   test_realshow_dataset = Rank_Train_Dataset(
 86 |     path_to_test_csv,
 87 |     args.seq_len,
 88 |     path_to_test_seq_pkl,
 89 |   )
 90 |   
 91 |   test_realshow_loader = DataLoader(
 92 |     dataset=test_realshow_dataset, 
 93 |     batch_size=args.infer_realshow_batch_size, 
 94 |     shuffle=False, 
 95 |     num_workers=0, 
 96 |     drop_last=True
 97 |   )
 98 |   print_str = evaluate(model, test_realshow_loader, device)
 99 |   
100 |   print("testing: recall")
101 |   
102 |   test_recall_dataset = Rank_Test_Dataset(
103 |     path_to_test_pkl,
104 |     args.seq_len,
105 |     path_to_test_seq_pkl,
106 |     max_candidate_cnt=max_candidate_cnt
107 |   )
108 |   
109 |   test_recall_loader = DataLoader(
110 |     dataset=test_recall_dataset, 
111 |     batch_size=args.infer_recall_batch_size, 
112 |     shuffle=False, 
113 |     num_workers=0, 
114 |     drop_last=True
115 |   )
116 |   
117 |   target_print = evaluate_recall(model, test_recall_loader, device)
118 |   
119 |   print("realshow")
120 |   print(print_str)
121 |   
122 |   print("recall")
123 |   print(target_print[0])
124 |   print(target_print[1])


--------------------------------------------------------------------------------
/rank/eval_din_fsltr.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import argparse
  3 | 
  4 | import torch
  5 | from torch.utils.data import DataLoader
  6 | 
  7 | from models import DIN
  8 | from dataset import Rank_Train_Dataset,Rank_Test_Dataset
  9 | 
 10 | from utils import load_pkl
 11 | from metrics import evaluate,evaluate_recall
 12 | 
 13 | def parse_args(): 
 14 |   parser = argparse.ArgumentParser()
 15 | 
 16 |   parser.add_argument('--epochs', type=int, default=1, help='epochs.')
 17 |   parser.add_argument('--batch_size', type=int, default=1024, help='train batch size.')
 18 |   parser.add_argument('--infer_realshow_batch_size', type=int, default=1024, help='inference batch size.')
 19 |   parser.add_argument('--infer_recall_batch_size', type=int, default=1024, help='inference batch size.')
 20 |   parser.add_argument('--emb_dim', type=int, default=8, help='embedding dimension.')
 21 |   parser.add_argument('--lr', type=float, default=1e-3, help='learning rate.')
 22 |   parser.add_argument('--seq_len', type=int, default=3, help='length of behaivor sequence')
 23 |   
 24 |   parser.add_argument('--cuda', type=int, default=0, help='cuda device.')
 25 | 
 26 |   parser.add_argument('--print_freq', type=int, default=200, help='frequency of print.')
 27 |   
 28 |   parser.add_argument('--tag', type=str, default="1st", help='exp tag.')
 29 |   
 30 |   parser.add_argument('--click_rank_loss_w', type=float, default=1e-2, help='learning rate.')
 31 |   parser.add_argument('--realshow_rank_loss_w', type=float, default=1e-2, help='learning rate.')
 32 |   parser.add_argument('--rerank_pos_rank_loss_w', type=float, default=1e-2, help='learning rate.')
 33 |   parser.add_argument('--rank_pos_rank_loss_w', type=float, default=1e-2, help='learning rate.')
 34 |   
 35 |   return parser.parse_args()
 36 | 
 37 | 
 38 | if __name__ == '__main__':
 39 |   args = parse_args()
 40 | 
 41 |   for k,v in vars(args).items():
 42 |     print(f"{k}:{v}")
 43 |     
 44 |   #prepare data
 45 |   prefix = "../data"
 46 |   
 47 |   realshow_prefix = os.path.join(prefix, "realshow")
 48 |   path_to_test_csv = os.path.join(realshow_prefix, "2024-02-18.feather")
 49 |   print(f"testing file: {path_to_test_csv}")
 50 |   
 51 |   seq_prefix = os.path.join(prefix, "seq_effective_50_dict")
 52 |   path_to_test_seq_pkl = os.path.join(seq_prefix, "2024-02-18.pkl")
 53 |   print(f"testing seq file: {path_to_test_seq_pkl}")
 54 |   
 55 |   others_prefix = os.path.join(prefix, "others")
 56 |   path_to_id_cnt_pkl = os.path.join(others_prefix, "id_cnt.pkl")
 57 |   print(f"path_to_id_cnt_pkl: {path_to_id_cnt_pkl}") 
 58 |   id_cnt_dict = load_pkl(path_to_id_cnt_pkl)
 59 |   for k,v in id_cnt_dict.items():
 60 |     print(f"{k}:{v}")
 61 |     
 62 |   path_to_test_pkl = os.path.join(others_prefix, "rank_test.feather")
 63 |   print(f"path_to_test_pkl: {path_to_test_pkl}")
 64 | 
 65 |   #prepare model
 66 |   os.environ["CUDA_VISIBLE_DEVICES"] = str(args.cuda)
 67 |   device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
 68 |   print(f"device: {device}")
 69 |   
 70 |   max_candidate_cnt = 430
 71 |   
 72 |   model = DIN(
 73 |     args.emb_dim, 
 74 |     args.seq_len, 
 75 |     device, 
 76 |     max_candidate_cnt, id_cnt_dict
 77 |   ).to(device)
 78 |   
 79 |   path_to_save_model=f"./checkpoints/bs-{args.batch_size}_lr-{args.lr}_{args.click_rank_loss_w}-{args.realshow_rank_loss_w}-{args.rerank_pos_rank_loss_w}-{args.rank_pos_rank_loss_w}_{args.tag}.pkl"
 80 |   state_dict = torch.load(path_to_save_model)
 81 |   model.load_state_dict(state_dict)
 82 |   
 83 |   print("testing: realshow")
 84 | 
 85 |   test_realshow_dataset = Rank_Train_Dataset(
 86 |     path_to_test_csv,
 87 |     args.seq_len,
 88 |     path_to_test_seq_pkl,
 89 |   )
 90 |   
 91 |   test_realshow_loader = DataLoader(
 92 |     dataset=test_realshow_dataset, 
 93 |     batch_size=args.infer_realshow_batch_size, 
 94 |     shuffle=False, 
 95 |     num_workers=0, 
 96 |     drop_last=True
 97 |   )
 98 |   print_str = evaluate(model, test_realshow_loader, device)
 99 |   
100 |   print("testing: recall")
101 |   
102 |   test_recall_dataset = Rank_Test_Dataset(
103 |     path_to_test_pkl,
104 |     args.seq_len,
105 |     path_to_test_seq_pkl,
106 |     max_candidate_cnt=max_candidate_cnt
107 |   )
108 |   
109 |   test_recall_loader = DataLoader(
110 |     dataset=test_recall_dataset, 
111 |     batch_size=args.infer_recall_batch_size, 
112 |     shuffle=False, 
113 |     num_workers=0, 
114 |     drop_last=True
115 |   )
116 |   target_print = evaluate_recall(model, test_recall_loader, device)
117 | 
118 |   print("realshow")
119 |   print(print_str)
120 |   
121 |   print("recall")
122 |   print(target_print[0])
123 |   print(target_print[1])


--------------------------------------------------------------------------------
/rank/eval_din_ubm.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import argparse
  3 | 
  4 | import torch
  5 | from torch.utils.data import DataLoader
  6 | 
  7 | from models import DIN_UBM
  8 | from dataset import Rank_Train_UBM_Dataset,Rank_Test_UBM_Dataset
  9 | 
 10 | from utils import load_pkl
 11 | from metrics import evaluate,evaluate_recall
 12 | 
 13 | def parse_args():
 14 |   parser = argparse.ArgumentParser()
 15 | 
 16 |   parser.add_argument('--epochs', type=int, default=1, help='epochs.')
 17 |   parser.add_argument('--batch_size', type=int, default=1024, help='train batch size.')
 18 |   parser.add_argument('--infer_realshow_batch_size', type=int, default=1024, help='inference batch size.')
 19 |   parser.add_argument('--infer_recall_batch_size', type=int, default=1024, help='inference batch size.')
 20 |   parser.add_argument('--emb_dim', type=int, default=8, help='embedding dimension.')
 21 |   parser.add_argument('--lr', type=float, default=1e-3, help='learning rate.')
 22 |   parser.add_argument('--seq_len', type=int, default=3, help='length of behaivor sequence')
 23 |   
 24 |   parser.add_argument('--cuda', type=int, default=0, help='cuda device.')
 25 | 
 26 |   parser.add_argument('--print_freq', type=int, default=200, help='frequency of print.')
 27 |   
 28 |   parser.add_argument('--tag', type=str, default="1st", help='exp tag.')
 29 |   
 30 |   parser.add_argument('--flows', type=str, default="", help='exp tag.')
 31 | 
 32 |   return parser.parse_args()
 33 | 
 34 | 
 35 | if __name__ == '__main__':
 36 |   args = parse_args()
 37 | 
 38 |   #prepare data
 39 |   prefix = "../data"
 40 |   
 41 |   realshow_prefix = os.path.join(prefix, "realshow")
 42 |   path_to_test_csv = os.path.join(realshow_prefix, "2024-02-18.feather")
 43 |   print(f"testing file: {path_to_test_csv}")
 44 |   
 45 |   seq_prefix = os.path.join(prefix, "seq_effective_50_dict")
 46 |   path_to_test_seq_pkl = os.path.join(seq_prefix, "2024-02-18.pkl")
 47 |   print(f"testing seq file: {path_to_test_seq_pkl}")
 48 |   
 49 |   request_id_prefix = os.path.join(prefix, "ubm_seq_request_id_dict")
 50 |   path_to_request_id_pkl = os.path.join(request_id_prefix, "2024-02-18.pkl")
 51 |   print(f"testing request_id file: {path_to_test_seq_pkl}")
 52 |   
 53 |   others_prefix = os.path.join(prefix, "others")
 54 |   path_to_id_cnt_pkl = os.path.join(others_prefix, "id_cnt.pkl")
 55 |   print(f"path_to_id_cnt_pkl: {path_to_id_cnt_pkl}")
 56 |   
 57 |   id_cnt_dict = load_pkl(path_to_id_cnt_pkl)
 58 |   for k,v in id_cnt_dict.items():
 59 |     print(f"{k}:{v}")
 60 |     
 61 |   path_to_test_pkl = os.path.join(others_prefix, "rank_test.feather")
 62 |   print(f"path_to_test_pkl: {path_to_test_pkl}")
 63 | 
 64 |   #prepare model
 65 |   os.environ["CUDA_VISIBLE_DEVICES"] = str(args.cuda)
 66 |   device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
 67 |   print(f"device: {device}")
 68 |   
 69 |   per_flow_seq_len = 10
 70 |   n_flows = len(args.flows.split(','))
 71 |   flow_seq_len = per_flow_seq_len * n_flows
 72 |   
 73 |   max_candidate_cnt = 430
 74 |   
 75 |   model = DIN_UBM(
 76 |     args.emb_dim, 
 77 |     args.seq_len, 
 78 |     device, 
 79 |     max_candidate_cnt, 
 80 |     per_flow_seq_len,flow_seq_len,
 81 |     id_cnt_dict
 82 |   ).to(device)
 83 |   
 84 |   path_to_save_model=f"./checkpoints/bs-{args.batch_size}_lr-{args.lr}_{args.flows}_{args.tag}.pkl"
 85 |   
 86 |   state_dict = torch.load(path_to_save_model)
 87 |   
 88 |   model.load_state_dict(state_dict)
 89 |   
 90 |   print("testing: realshow")
 91 | 
 92 |   test_realshow_dataset = Rank_Train_UBM_Dataset(
 93 |     path_to_test_csv,
 94 |     args.seq_len,
 95 |     path_to_test_seq_pkl,
 96 |     path_to_request_id_pkl,
 97 |     args.flows,
 98 |     per_flow_seq_len
 99 |   )
100 |   
101 |   test_realshow_loader = DataLoader(
102 |     dataset=test_realshow_dataset, 
103 |     batch_size=args.infer_realshow_batch_size, 
104 |     shuffle=False, 
105 |     num_workers=0, 
106 |     drop_last=True
107 |   )
108 |   
109 |   print_str = evaluate(model, test_realshow_loader, device)
110 |   
111 |   print("testing: recall")
112 |   
113 |   test_recall_dataset = Rank_Test_UBM_Dataset(
114 |     path_to_test_pkl,
115 |     args.seq_len,
116 |     path_to_test_seq_pkl,
117 |     path_to_request_id_pkl,
118 |     args.flows, per_flow_seq_len,
119 |     max_candidate_cnt
120 |   )
121 |   
122 |   test_recall_loader = DataLoader(
123 |     dataset=test_recall_dataset, 
124 |     batch_size=args.infer_recall_batch_size, 
125 |     shuffle=False, 
126 |     num_workers=0, 
127 |     drop_last=True
128 |   )
129 |   
130 |   target_print = evaluate_recall(model, test_recall_loader, device)
131 |   
132 |   print("realshow")
133 |   print(print_str)
134 |   
135 |   print("recall")
136 |   print(target_print[0])
137 |   print(target_print[1])


--------------------------------------------------------------------------------
/rank/file.txt:
--------------------------------------------------------------------------------
 1 | 2024-01-13
 2 | 2024-01-14
 3 | 2024-01-15
 4 | 2024-01-16
 5 | 2024-01-17
 6 | 2024-01-18
 7 | 2024-01-19
 8 | 2024-01-20
 9 | 2024-01-21
10 | 2024-01-22
11 | 2024-01-23
12 | 2024-01-24
13 | 2024-01-25
14 | 2024-01-26
15 | 2024-01-27
16 | 2024-01-28
17 | 2024-01-29
18 | 2024-01-30
19 | 2024-01-31
20 | 2024-02-01
21 | 2024-02-02
22 | 2024-02-03
23 | 2024-02-04
24 | 2024-02-05
25 | 2024-02-06
26 | 2024-02-07
27 | 2024-02-08
28 | 2024-02-09
29 | 2024-02-10
30 | 2024-02-11
31 | 2024-02-12
32 | 2024-02-13
33 | 2024-02-14
34 | 2024-02-15
35 | 2024-02-16
36 | 2024-02-17


--------------------------------------------------------------------------------
/rank/metrics.py:
--------------------------------------------------------------------------------
 1 | import torch
 2 | import numpy as np
 3 | from sklearn.metrics import roc_auc_score, log_loss
 4 | 
 5 | def evaluate(model, data_loader, device):
 6 |   model.eval()
 7 |   
 8 |   logits_lst = np.zeros(shape=(962560,), dtype=np.float32)
 9 |   label_lst = np.zeros(shape=(962560,), dtype=np.float32)
10 |   
11 |   with torch.no_grad():
12 |     start_index = 0
13 |     end_index = 0
14 |     
15 |     for inputs in data_loader:
16 |           
17 |       inputs_LongTensor = [torch.LongTensor(inp.numpy()).to(device) for inp in inputs[:-1]]
18 |       
19 |       logits = model(inputs_LongTensor) #b
20 |       
21 |       logits = torch.sigmoid(logits)
22 |       
23 |       end_index += inputs[-1].size(0)
24 |       
25 |       label_lst[start_index:end_index] = inputs[-1].numpy().astype(np.float32)
26 |       
27 |       logits_lst[start_index:end_index] = logits.cpu().numpy().astype(np.float32)
28 |       
29 |       start_index = end_index
30 |       
31 |   test_auc = roc_auc_score(label_lst, logits_lst)
32 |   test_logloss = log_loss(label_lst, logits_lst)
33 | 
34 |   print_str = f"Target: auc \t logloss: {test_auc:.6f} \t {test_logloss:.6f}"
35 |   
36 |   return print_str
37 |   
38 | 
39 | def evaluate_recall(model, data_loader, device):
40 |   model.eval()
41 |   
42 |   target_top_k = [50,100,200]
43 |   
44 |   total_target_cnt = 0.0
45 |   
46 |   target_recall_lst = [0.0 for _ in range(len(target_top_k))]
47 |   target_ndcg_lst = [0.0 for _ in range(len(target_top_k))]
48 | 
49 |   with torch.no_grad():
50 |     
51 |     for inputs in data_loader:
52 |         
53 |       inputs_LongTensor = [torch.LongTensor(inp.numpy()).to(device) for inp in inputs[:-2]]
54 |       
55 |       logits = model.forward_recall(inputs_LongTensor) #b*430
56 |       logits = logits.cpu().numpy()
57 |       
58 |       labels = inputs[-2].numpy().astype(np.float) #b*430
59 |       n_photos = inputs[-1].numpy() #b
60 |       
61 |       for i in range(n_photos.shape[0]):
62 |         
63 |         n_photo = n_photos[i]
64 |         
65 |         logit = logits[i,:n_photo]
66 |         label = labels[i,:n_photo]
67 |         
68 |         logit_descending_index = np.argsort(logit*-1.0) #descending order
69 |         logit_descending_rank = np.argsort(logit_descending_index) #descending order
70 |         
71 |         #target metric
72 |         if np.sum(label) > 0 and np.sum(label)!=n_photo:
73 |           target_pos_index = np.nonzero(label)[0]
74 |           target_pos_rank = logit_descending_rank[target_pos_index]
75 |           
76 |           for i in range(len(target_top_k)):
77 |             target_recall_lst[i] += np.sum(target_pos_rank<target_top_k[i])
78 |             target_ndcg_lst[i] += np.sum((1.0/np.log2(target_pos_rank+2))*(target_pos_rank<target_top_k[i]))
79 |             
80 |           total_target_cnt += np.sum(label)
81 | 
82 |   target_recall = []
83 |   target_ndcg = []
84 |   
85 |   for i in range(len(target_top_k)):
86 |     target_recall.append(target_recall_lst[i]/total_target_cnt)
87 |     target_ndcg.append(target_ndcg_lst[i]/total_target_cnt)
88 |   
89 |   target_print_str = f"Target: "
90 |   for i in range(len(target_top_k)):
91 |     target_print_str += f"recall@{target_top_k[i]},"
92 |     target_print_str += f"ndcg@{target_top_k[i]},"
93 |   
94 |   target_print_value_str = f""
95 |   for i in range(len(target_top_k)):
96 |     target_print_value_str += f"{target_recall[i]:.6f},"
97 |     target_print_value_str += f"{target_ndcg[i]:.6f},"
98 | 
99 |   return target_print_str[:-1], target_print_value_str[:-1]


--------------------------------------------------------------------------------
/rank/run_din.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import argparse
  3 | 
  4 | import torch
  5 | import torch.nn as nn
  6 | from torch.utils.data import DataLoader
  7 | 
  8 | from models import DIN
  9 | from dataset import Rank_Train_Dataset
 10 | 
 11 | from utils import load_pkl
 12 | 
 13 | def parse_args():
 14 |   parser = argparse.ArgumentParser()
 15 | 
 16 |   parser.add_argument('--epochs', type=int, default=1, help='epochs.')
 17 |   parser.add_argument('--batch_size', type=int, default=1024, help='train batch size.')
 18 |   parser.add_argument('--infer_realshow_batch_size', type=int, default=1024, help='inference batch size.')
 19 |   parser.add_argument('--infer_recall_batch_size', type=int, default=1024, help='inference batch size.')
 20 |   parser.add_argument('--emb_dim', type=int, default=8, help='embedding dimension.')
 21 |   parser.add_argument('--lr', type=float, default=1e-3, help='learning rate.')
 22 |   parser.add_argument('--seq_len', type=int, default=3, help='length of behaivor sequence')
 23 |   
 24 |   parser.add_argument('--cuda', type=int, default=0, help='cuda device.')
 25 | 
 26 |   parser.add_argument('--print_freq', type=int, default=200, help='frequency of print.')
 27 |   
 28 |   parser.add_argument('--tag', type=str, default="1st", help='exp tag.')
 29 | 
 30 |   return parser.parse_args()
 31 | 
 32 | 
 33 | if __name__ == '__main__':
 34 |   args = parse_args()
 35 | 
 36 |   for k,v in vars(args).items():
 37 |     print(f"{k}:{v}")
 38 |   
 39 |   #prepare data
 40 |   prefix = "../data"
 41 |   
 42 |   realshow_prefix = os.path.join(prefix, "realshow")
 43 |   path_to_train_csv_lst = []
 44 |   with open("./file.txt", mode='r') as f:
 45 |     lines = f.readlines()
 46 |     for line in lines:
 47 |       tmp_csv_path = os.path.join(realshow_prefix, line.strip()+'.feather')
 48 |       path_to_train_csv_lst.append(tmp_csv_path)
 49 |       
 50 |   num_of_train_csv = len(path_to_train_csv_lst)
 51 |   print("training files:")
 52 |   print(f"number of train_csv: {num_of_train_csv}")
 53 |   for idx, filepath in enumerate(path_to_train_csv_lst):
 54 |     print(f"{idx}: {filepath}")
 55 |     
 56 |   seq_prefix = os.path.join(prefix, "seq_effective_50_dict")
 57 |   path_to_train_seq_pkl_lst = []
 58 |   with open("./file.txt", mode='r') as f:
 59 |     lines = f.readlines()
 60 |     for line in lines:
 61 |       tmp_seq_pkl_path = os.path.join(seq_prefix, line.strip()+'.pkl')
 62 |       path_to_train_seq_pkl_lst.append(tmp_seq_pkl_path)
 63 |   
 64 |   print("training seq files:")
 65 |   for idx, filepath in enumerate(path_to_train_seq_pkl_lst):
 66 |     print(f"{idx}: {filepath}")
 67 |   
 68 |   others_prefix = os.path.join(prefix, "others")
 69 |   path_to_id_cnt_pkl = os.path.join(others_prefix, "id_cnt.pkl")
 70 |   print(f"path_to_id_cnt_pkl: {path_to_id_cnt_pkl}")
 71 |   
 72 |   id_cnt_dict = load_pkl(path_to_id_cnt_pkl)
 73 |   for k,v in id_cnt_dict.items():
 74 |     print(f"{k}:{v}")
 75 | 
 76 |   #prepare model
 77 |   os.environ["CUDA_VISIBLE_DEVICES"] = str(args.cuda)
 78 |   device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
 79 |   print(f"device: {device}")
 80 |   
 81 |   max_candidate_cnt = 430
 82 |   
 83 |   model = DIN(args.emb_dim, args.seq_len, device, max_candidate_cnt, id_cnt_dict).to(device)
 84 |   
 85 |   loss_fn = nn.BCEWithLogitsLoss().to(device)
 86 |   
 87 |   optimizer = torch.optim.Adam(model.parameters(), lr=args.lr)
 88 |   
 89 |   #training
 90 |   for epoch in range(args.epochs):
 91 |     for n_day in range(num_of_train_csv):
 92 |       train_dataset = Rank_Train_Dataset(
 93 |         path_to_train_csv_lst[n_day],
 94 |         args.seq_len,
 95 |         path_to_train_seq_pkl_lst[n_day],
 96 |       )
 97 |     
 98 |       train_loader = DataLoader(
 99 |         dataset=train_dataset, 
100 |         batch_size=args.batch_size, 
101 |         shuffle=True, 
102 |         num_workers=1, 
103 |         drop_last=True
104 |       )
105 |       
106 |       for iter_step, inputs in enumerate(train_loader):
107 |         
108 |         inputs_LongTensor = [torch.LongTensor(inp.numpy()).to(device) for inp in inputs[:-1]]
109 |         
110 |         label = torch.FloatTensor(inputs[-1].numpy()).to(device) #b
111 |         
112 |         logits = model(inputs_LongTensor) #b
113 |         
114 |         loss = loss_fn(logits, label)
115 |         
116 |         optimizer.zero_grad()
117 |         
118 |         loss.backward()
119 | 
120 |         optimizer.step()
121 | 
122 |         if iter_step % args.print_freq == 0:
123 |           print(f"Day:{n_day}\t[Epoch/iter]:{epoch:>3}/{iter_step:<4}\tloss:{loss.detach().cpu().item():.6f}")
124 |   
125 |   path_to_save_model=f"./checkpoints/bs-{args.batch_size}_lr-{args.lr}_{args.tag}.pkl"
126 |   
127 |   torch.save(model.state_dict(), path_to_save_model)
128 |   
129 |   print(f"save model to {path_to_save_model} DONE.")


--------------------------------------------------------------------------------
/rank/run_din.sh:
--------------------------------------------------------------------------------
 1 | set -x
 2 | set -e
 3 | set -o pipefail
 4 | 
 5 | tag=din-1st
 6 | 
 7 | python -B -u run_din.py \
 8 | --epochs=1 \
 9 | --batch_size=1024 \
10 | --infer_realshow_batch_size=1024 \
11 | --infer_recall_batch_size=512 \
12 | --emb_dim=8 \
13 | --lr=1e-2 \
14 | --seq_len=50 \
15 | --cuda='0' \
16 | --print_freq=100 \
17 | --tag=${tag} > "./logs/bs-1024_lr-1e-2_${tag}.log" 2>&1
18 | 
19 | python -B -u eval_dssm.py \
20 | --epochs=1 \
21 | --batch_size=1024 \
22 | --infer_realshow_batch_size=1024 \
23 | --infer_recall_batch_size=512 \
24 | --emb_dim=8 \
25 | --lr=1e-2 \
26 | --seq_len=50 \
27 | --cuda='0' \
28 | --print_freq=100 \
29 | --tag=${tag} >> "./logs/bs-1024_lr-1e-2_${tag}.log" 2>&1


--------------------------------------------------------------------------------
/rank/run_din_auxiliary_ranking.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import argparse
  3 | 
  4 | import torch
  5 | import torch.nn as nn
  6 | import torch.nn.functional as F
  7 | from torch.utils.data import DataLoader
  8 | 
  9 | from models import DIN_AuxRanking
 10 | from dataset import Rank_Train_Auxiliary_Ranking_Dataset
 11 | 
 12 | from utils import load_pkl
 13 | 
 14 | def parse_args(): 
 15 |   parser = argparse.ArgumentParser()
 16 | 
 17 |   parser.add_argument('--epochs', type=int, default=1, help='epochs.')
 18 |   parser.add_argument('--batch_size', type=int, default=1024, help='train batch size.')
 19 |   parser.add_argument('--infer_realshow_batch_size', type=int, default=1024, help='inference batch size.')
 20 |   parser.add_argument('--infer_recall_batch_size', type=int, default=1024, help='inference batch size.')
 21 |   parser.add_argument('--emb_dim', type=int, default=8, help='embedding dimension.')
 22 |   parser.add_argument('--lr', type=float, default=1e-3, help='learning rate.')
 23 |   parser.add_argument('--seq_len', type=int, default=3, help='length of behaivor sequence')
 24 |   
 25 |   parser.add_argument('--cuda', type=int, default=0, help='cuda device.')
 26 | 
 27 |   parser.add_argument('--print_freq', type=int, default=200, help='frequency of print.')
 28 |   
 29 |   parser.add_argument('--tag', type=str, default="1st", help='exp tag.')
 30 |   
 31 |   # flow param  
 32 |   parser.add_argument('--flows', type=str, default="", help='exp tag.')
 33 |   
 34 |   parser.add_argument('--rank_loss_weight', type=float, default=1e-2, help='learning rate.')
 35 |   
 36 |   return parser.parse_args()
 37 | 
 38 | 
 39 | if __name__ == '__main__':
 40 |   args = parse_args()
 41 | 
 42 |   for k,v in vars(args).items():
 43 |     print(f"{k}:{v}")
 44 |     
 45 |   #prepare data
 46 |   prefix = "../data"
 47 |   
 48 |   realshow_prefix = os.path.join(prefix, "realshow")
 49 |   path_to_train_csv_lst = []
 50 |   with open("./file.txt", mode='r') as f:
 51 |     lines = f.readlines()
 52 |     for line in lines:
 53 |       tmp_csv_path = os.path.join(realshow_prefix, line.strip()+'.feather')
 54 |       path_to_train_csv_lst.append(tmp_csv_path)
 55 |       
 56 |   num_of_train_csv = len(path_to_train_csv_lst)
 57 |   print("training files:")
 58 |   print(f"number of train_csv: {num_of_train_csv}")
 59 |   for idx, filepath in enumerate(path_to_train_csv_lst):
 60 |     print(f"{idx}: {filepath}")
 61 |     
 62 |   seq_prefix = os.path.join(prefix, "seq_effective_50_dict")
 63 |   path_to_train_seq_pkl_lst = []
 64 |   with open("./file.txt", mode='r') as f:
 65 |     lines = f.readlines()
 66 |     for line in lines:
 67 |       tmp_seq_pkl_path = os.path.join(seq_prefix, line.strip()+'.pkl')
 68 |       path_to_train_seq_pkl_lst.append(tmp_seq_pkl_path)
 69 |   
 70 |   print("training seq files:")
 71 |   for idx, filepath in enumerate(path_to_train_seq_pkl_lst):
 72 |     print(f"{idx}: {filepath}")
 73 |   
 74 |   request_id_prefix = os.path.join(prefix, "request_id_dict")
 75 |   path_to_train_request_pkl_lst = []
 76 |   with open("./file.txt", mode='r') as f:
 77 |     lines = f.readlines()
 78 |     for line in lines:
 79 |       tmp_request_pkl_path = os.path.join(request_id_prefix, line.strip()+".pkl")
 80 |       path_to_train_request_pkl_lst.append(tmp_request_pkl_path)
 81 |       
 82 |   print("training request files")
 83 |   for idx, filepath in enumerate(path_to_train_request_pkl_lst):
 84 |     print(f"{idx}: {filepath}")
 85 |   
 86 |   others_prefix = os.path.join(prefix, "others")
 87 |   path_to_id_cnt_pkl = os.path.join(others_prefix, "id_cnt.pkl")
 88 |   print(f"path_to_id_cnt_pkl: {path_to_id_cnt_pkl}")
 89 |   
 90 |   id_cnt_dict = load_pkl(path_to_id_cnt_pkl)
 91 |   for k,v in id_cnt_dict.items():
 92 |     print(f"{k}:{v}")
 93 |     
 94 |   #prepare model
 95 |   os.environ["CUDA_VISIBLE_DEVICES"] = str(args.cuda)
 96 |   device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
 97 |   print(f"device: {device}")
 98 |   
 99 |   max_candidate_cnt = 430
100 |   
101 |   model = DIN_AuxRanking(
102 |     args.emb_dim, args.seq_len, 
103 |     device, max_candidate_cnt, id_cnt_dict
104 |   ).to(device)
105 |   
106 |   loss_fn = nn.BCEWithLogitsLoss().to(device)
107 |     
108 |   optimizer = torch.optim.Adam(model.parameters(), lr=args.lr)
109 |   
110 |   padding_num = -2**30 + 1
111 |   
112 |   k_per_flow = 10
113 |   
114 |   n_flows = len(args.flows.split(','))
115 |   
116 |   n_flow_photos = n_flows * k_per_flow
117 |   
118 |   #training
119 |   for epoch in range(args.epochs):
120 |     for n_day in range(num_of_train_csv):
121 |       
122 |       train_dataset = Rank_Train_Auxiliary_Ranking_Dataset(
123 |         path_to_train_csv_lst[n_day],
124 |         args.seq_len,
125 |         path_to_train_seq_pkl_lst[n_day],
126 |         path_to_train_request_pkl_lst[n_day],
127 |         args.flows
128 |       )
129 |     
130 |       train_loader = DataLoader(
131 |         dataset=train_dataset, 
132 |         batch_size=args.batch_size, 
133 |         shuffle=True, 
134 |         num_workers=1, 
135 |         drop_last=True
136 |       )
137 |       
138 |       for iter_step, inputs in enumerate(train_loader):
139 |         
140 |         inputs_LongTensor = [torch.LongTensor(inp.numpy()).to(device) for inp in inputs[:-2]]
141 |         
142 |         label = torch.FloatTensor(inputs[-1].numpy()).to(device) #b
143 |         
144 |         logits, aux_logits, flow_logits = model.forward_train(inputs_LongTensor) #b*1,b*1,b*p
145 |         
146 |         loss = loss_fn(logits.squeeze(), label)
147 |         
148 |         flow_mask = torch.FloatTensor(inputs[-2].numpy()).to(device) # b*p 
149 |         
150 |         flow_logits = torch.where(
151 |           flow_mask > 0,
152 |           flow_logits,
153 |           torch.full_like(flow_logits, fill_value=padding_num)
154 |         )
155 |         
156 |         aux_logits_repeat = aux_logits.repeat([1,n_flow_photos]) #b*p
157 |         
158 |         bpr_logits = aux_logits_repeat - flow_logits
159 |         
160 |         rank_loss = F.binary_cross_entropy_with_logits(
161 |           bpr_logits,
162 |           torch.ones_like(bpr_logits), 
163 |           weight=label.unsqueeze(1).repeat([1,n_flow_photos]),
164 |           reduction='sum'
165 |         ) / label.unsqueeze(1).repeat([1,n_flow_photos]).sum()
166 |         
167 |         all_loss = loss + args.rank_loss_weight * rank_loss
168 |         
169 |         optimizer.zero_grad()
170 |         
171 |         all_loss.backward()
172 | 
173 |         optimizer.step()
174 | 
175 |         if iter_step % args.print_freq == 0:
176 |           print(f"Day:{n_day}\t[Epoch/iter]:{epoch:>3}/{iter_step:<4}\tall_loss:{all_loss.detach().cpu().item():.6f}\tloss:{loss.detach().cpu().item():.6f}\trank_loss:{rank_loss.detach().cpu().item():.6f}")
177 |       
178 |   path_to_save_model=f"./checkpoints/{args.batch_size}_{args.lr}_{args.flows}_{args.rank_loss_weight}_{args.tag}.pkl"
179 |   
180 |   torch.save(model.state_dict(), path_to_save_model)
181 |   
182 |   print(f"save model to {path_to_save_model} DONE.")


--------------------------------------------------------------------------------
/rank/run_din_auxiliary_ranking.sh:
--------------------------------------------------------------------------------
 1 | set -x
 2 | set -e
 3 | set -o pipefail
 4 | 
 5 | tag=din_auxiliary_ranking-1st
 6 | 
 7 | flows=rerank_pos,rerank_neg,rank_pos,rank_neg
 8 | 
 9 | rank_loss_weight=0.1
10 | 
11 | python -B -u run_din_auxiliary_ranking.py \
12 | --epochs=1 \
13 | --batch_size=1024 \
14 | --infer_realshow_batch_size=1024 \
15 | --infer_recall_batch_size=512 \
16 | --emb_dim=8 \
17 | --lr=1e-2 \
18 | --seq_len=50 \
19 | --cuda='0' \
20 | --print_freq=100 \
21 | --flows=${flows} \
22 | --rank_loss_weight=${rank_loss_weight} \
23 | --tag=${tag} > "./logs/bs-1024_lr-1e-2_${flows}_${rank_loss_weight}_${tag}.log" 2>&1
24 | 
25 | python -B -u eval_din_auxiliary_ranking.py \
26 | --epochs=1 \
27 | --batch_size=1024 \
28 | --infer_realshow_batch_size=1024 \
29 | --infer_recall_batch_size=512 \
30 | --emb_dim=8 \
31 | --lr=1e-2 \
32 | --seq_len=50 \
33 | --cuda='0' \
34 | --print_freq=100 \
35 | --flows=${flows} \
36 | --rank_loss_weight=${rank_loss_weight} \
37 | --tag=${tag} >> "./logs/bs-1024_lr-1e-2_${flows}_${rank_loss_weight}_${tag}.log" 2>&1


--------------------------------------------------------------------------------
/rank/run_din_data_dist_shift_all.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import argparse
  3 | 
  4 | import torch
  5 | import torch.nn as nn
  6 | from torch.utils.data import DataLoader
  7 | 
  8 | from models import DIN
  9 | from dataset import Rank_Train_Data_Dist_Shift_All_Dataset
 10 | 
 11 | from utils import load_pkl
 12 | 
 13 | def parse_args():
 14 |   parser = argparse.ArgumentParser()
 15 | 
 16 |   parser.add_argument('--epochs', type=int, default=1, help='epochs.')
 17 |   parser.add_argument('--batch_size', type=int, default=1024, help='train batch size.')
 18 |   parser.add_argument('--infer_realshow_batch_size', type=int, default=1024, help='inference batch size.')
 19 |   parser.add_argument('--infer_recall_batch_size', type=int, default=1024, help='inference batch size.')
 20 |   parser.add_argument('--emb_dim', type=int, default=8, help='embedding dimension.')
 21 |   parser.add_argument('--lr', type=float, default=1e-3, help='learning rate.')
 22 |   parser.add_argument('--seq_len', type=int, default=3, help='length of behaivor sequence')
 23 |   
 24 |   parser.add_argument('--cuda', type=int, default=0, help='cuda device.')
 25 | 
 26 |   parser.add_argument('--print_freq', type=int, default=200, help='frequency of print.')
 27 |   
 28 |   parser.add_argument('--tag', type=str, default="1st", help='exp tag.')
 29 |   
 30 |   parser.add_argument('--flows', type=str, default="", help='exp tag.')
 31 |   
 32 |   return parser.parse_args()
 33 | 
 34 | 
 35 | if __name__ == '__main__':
 36 |   args = parse_args()
 37 | 
 38 |   for k,v in vars(args).items():
 39 |     print(f"{k}:{v}")
 40 |     
 41 |   #prepare data
 42 |   prefix = "../data"
 43 |   
 44 |   realshow_prefix = os.path.join(prefix, "all_stage")
 45 |   path_to_train_csv_lst = []
 46 |   with open("./file.txt", mode='r') as f:
 47 |     lines = f.readlines()
 48 |     for line in lines:
 49 |       tmp_csv_path = os.path.join(realshow_prefix, line.strip()+'.feather')
 50 |       path_to_train_csv_lst.append(tmp_csv_path)
 51 |       
 52 |   num_of_train_csv = len(path_to_train_csv_lst)
 53 |   print("training files:")
 54 |   print(f"number of train_csv: {num_of_train_csv}")
 55 |   for idx, filepath in enumerate(path_to_train_csv_lst):
 56 |     print(f"{idx}: {filepath}")
 57 |     
 58 |   seq_prefix = os.path.join(prefix, "seq_effective_50_dict")
 59 |   path_to_train_seq_pkl_lst = []
 60 |   with open("./file.txt", mode='r') as f:
 61 |     lines = f.readlines()
 62 |     for line in lines:
 63 |       tmp_seq_pkl_path = os.path.join(seq_prefix, line.strip()+'.pkl')
 64 |       path_to_train_seq_pkl_lst.append(tmp_seq_pkl_path)
 65 |   
 66 |   print("training seq files:")
 67 |   for idx, filepath in enumerate(path_to_train_seq_pkl_lst):
 68 |     print(f"{idx}: {filepath}")
 69 | 
 70 |   others_prefix = os.path.join(prefix, "others")
 71 |   path_to_id_cnt_pkl = os.path.join(others_prefix, "id_cnt.pkl")
 72 |   print(f"path_to_id_cnt_pkl: {path_to_id_cnt_pkl}")
 73 |   
 74 |   id_cnt_dict = load_pkl(path_to_id_cnt_pkl)
 75 |   for k,v in id_cnt_dict.items():
 76 |     print(f"{k}:{v}")
 77 |     
 78 |   #prepare model
 79 |   os.environ["CUDA_VISIBLE_DEVICES"] = str(args.cuda)
 80 |   device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
 81 |   print(f"device: {device}")
 82 |   
 83 |   max_candidate_cnt = 430
 84 |   
 85 |   model = DIN(
 86 |     args.emb_dim, args.seq_len, 
 87 |     device, max_candidate_cnt, id_cnt_dict
 88 |   ).to(device)
 89 |   
 90 |   loss_fn = nn.BCEWithLogitsLoss().to(device)
 91 |   
 92 |   optimizer = torch.optim.Adam(model.parameters(), lr=args.lr)
 93 |     
 94 |   #training
 95 |   for epoch in range(args.epochs):
 96 |     for n_day in range(num_of_train_csv):
 97 |       
 98 |       train_dataset = Rank_Train_Data_Dist_Shift_All_Dataset(
 99 |         path_to_train_csv_lst[n_day],
100 |         args.seq_len,
101 |         path_to_train_seq_pkl_lst[n_day],
102 |         args.flows
103 |       )
104 |     
105 |       train_loader = DataLoader(
106 |         dataset=train_dataset, 
107 |         batch_size=args.batch_size, 
108 |         shuffle=True, 
109 |         num_workers=1, 
110 |         drop_last=True
111 |       )
112 |       
113 |       for iter_step, inputs in enumerate(train_loader):
114 |         
115 |         inputs_LongTensor = [torch.LongTensor(inp.numpy()).to(device) for inp in inputs[:-1]]
116 |         
117 |         label = torch.FloatTensor(inputs[-1].numpy()).to(device) #b*k
118 |         
119 |         logits = model(inputs_LongTensor) #b
120 | 
121 |         loss = loss_fn(logits, label)
122 |         
123 |         optimizer.zero_grad()
124 |         
125 |         loss.backward()
126 | 
127 |         optimizer.step()
128 | 
129 |         if iter_step % args.print_freq == 0:
130 |           print(f"Day:{n_day}\t[Epoch/iter]:{epoch:>3}/{iter_step:<4}\tloss:{loss.detach().cpu().item():.6f}")
131 |       
132 |   path_to_save_model=f"./checkpoints/bs-{args.batch_size}_lr-{args.lr}_{args.flows}_{args.tag}.pkl"
133 |   
134 |   torch.save(model.state_dict(), path_to_save_model)
135 |   
136 |   print(f"save model to {path_to_save_model} DONE.")


--------------------------------------------------------------------------------
/rank/run_din_data_dist_shift_all.sh:
--------------------------------------------------------------------------------
 1 | set -x
 2 | set -e
 3 | set -o pipefail
 4 | 
 5 | tag=din_data_dist_shift_all-1st
 6 | 
 7 | flows=rank_neg
 8 | 
 9 | python -B -u run_din_data_dist_shift_all.py \
10 | --epochs=1 \
11 | --batch_size=1024 \
12 | --infer_realshow_batch_size=1024 \
13 | --infer_recall_batch_size=512 \
14 | --emb_dim=8 \
15 | --lr=1e-2 \
16 | --seq_len=50 \
17 | --cuda='0' \
18 | --print_freq=100 \
19 | --flows=${flows} \
20 | --tag=${tag} > "./logs/bs-1024_lr-1e-2_${flows}_${tag}.log" 2>&1
21 | 
22 | python -B -u eval_din_data_dist_shift_all.py \
23 | --epochs=1 \
24 | --batch_size=1024 \
25 | --infer_realshow_batch_size=1024 \
26 | --infer_recall_batch_size=512 \
27 | --emb_dim=8 \
28 | --lr=1e-2 \
29 | --seq_len=50 \
30 | --cuda='0' \
31 | --print_freq=100 \
32 | --flows=${flows} \
33 | --tag=${tag} >> "./logs/bs-1024_lr-1e-2_${flows}_${tag}.log" 2>&1


--------------------------------------------------------------------------------
/rank/run_din_data_dist_shift_sampling.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import argparse
  3 | 
  4 | import torch
  5 | import torch.nn as nn
  6 | from torch.utils.data import DataLoader
  7 | 
  8 | from models import DIN
  9 | from dataset import Rank_Train_Data_Dist_Shift_Sampling_Dataset
 10 | 
 11 | from utils import load_pkl
 12 | 
 13 | def parse_args():
 14 |   parser = argparse.ArgumentParser()
 15 | 
 16 |   parser.add_argument('--epochs', type=int, default=1, help='epochs.')
 17 |   parser.add_argument('--batch_size', type=int, default=1024, help='train batch size.')
 18 |   parser.add_argument('--infer_realshow_batch_size', type=int, default=1024, help='inference batch size.')
 19 |   parser.add_argument('--infer_recall_batch_size', type=int, default=1024, help='inference batch size.')
 20 |   parser.add_argument('--emb_dim', type=int, default=8, help='embedding dimension.')
 21 |   parser.add_argument('--lr', type=float, default=1e-3, help='learning rate.')
 22 |   parser.add_argument('--seq_len', type=int, default=3, help='length of behaivor sequence')
 23 |   
 24 |   parser.add_argument('--cuda', type=int, default=0, help='cuda device.')
 25 | 
 26 |   parser.add_argument('--print_freq', type=int, default=200, help='frequency of print.')
 27 |   
 28 |   parser.add_argument('--tag', type=str, default="1st", help='exp tag.')
 29 |   
 30 |   parser.add_argument('--flows', type=str, default="", help='exp tag.')
 31 |   parser.add_argument('--k_flow_negs', type=str, default="", help='number of flow negative.')
 32 |   
 33 |   return parser.parse_args()
 34 | 
 35 | 
 36 | if __name__ == '__main__':
 37 |   args = parse_args()
 38 | 
 39 |   for k,v in vars(args).items():
 40 |     print(f"{k}:{v}")
 41 |     
 42 |   #prepare data
 43 |   prefix = "../data"
 44 |   
 45 |   realshow_prefix = os.path.join(prefix, "all_stage")
 46 |   path_to_train_csv_lst = []
 47 |   with open("./file.txt", mode='r') as f:
 48 |     lines = f.readlines()
 49 |     for line in lines:
 50 |       tmp_csv_path = os.path.join(realshow_prefix, line.strip()+'.feather')
 51 |       path_to_train_csv_lst.append(tmp_csv_path)
 52 |       
 53 |   num_of_train_csv = len(path_to_train_csv_lst)
 54 |   print("training files:")
 55 |   print(f"number of train_csv: {num_of_train_csv}")
 56 |   for idx, filepath in enumerate(path_to_train_csv_lst):
 57 |     print(f"{idx}: {filepath}")
 58 |     
 59 |   seq_prefix = os.path.join(prefix, "seq_effective_50_dict")
 60 |   path_to_train_seq_pkl_lst = []
 61 |   with open("./file.txt", mode='r') as f:
 62 |     lines = f.readlines()
 63 |     for line in lines:
 64 |       tmp_seq_pkl_path = os.path.join(seq_prefix, line.strip()+'.pkl')
 65 |       path_to_train_seq_pkl_lst.append(tmp_seq_pkl_path)
 66 |   
 67 |   print("training seq files:")
 68 |   for idx, filepath in enumerate(path_to_train_seq_pkl_lst):
 69 |     print(f"{idx}: {filepath}")
 70 | 
 71 |   others_prefix = os.path.join(prefix, "others")
 72 |   path_to_id_cnt_pkl = os.path.join(others_prefix, "id_cnt.pkl")
 73 |   print(f"path_to_id_cnt_pkl: {path_to_id_cnt_pkl}")
 74 |   
 75 |   id_cnt_dict = load_pkl(path_to_id_cnt_pkl)
 76 |   for k,v in id_cnt_dict.items():
 77 |     print(f"{k}:{v}")
 78 |     
 79 |   #prepare model
 80 |   os.environ["CUDA_VISIBLE_DEVICES"] = str(args.cuda)
 81 |   device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
 82 |   print(f"device: {device}")
 83 |   
 84 |   max_candidate_cnt = 430
 85 |   
 86 |   model = DIN(
 87 |     args.emb_dim, args.seq_len, 
 88 |     device, max_candidate_cnt, id_cnt_dict
 89 |   ).to(device)
 90 |   
 91 |   loss_fn = nn.BCEWithLogitsLoss().to(device)
 92 |   
 93 |   optimizer = torch.optim.Adam(model.parameters(), lr=args.lr)
 94 |     
 95 |   #training
 96 |   for epoch in range(args.epochs):
 97 |     for n_day in range(num_of_train_csv):
 98 |       train_dataset = Rank_Train_Data_Dist_Shift_Sampling_Dataset(
 99 |         path_to_train_csv_lst[n_day],
100 |         args.seq_len,
101 |         path_to_train_seq_pkl_lst[n_day],
102 |         args.flows,
103 |         args.k_flow_negs
104 |       )
105 |     
106 |       train_loader = DataLoader(
107 |         dataset=train_dataset, 
108 |         batch_size=args.batch_size, 
109 |         shuffle=True, 
110 |         num_workers=1, 
111 |         drop_last=True
112 |       )
113 |       
114 |       for iter_step, inputs in enumerate(train_loader):
115 |         
116 |         inputs_LongTensor = [torch.LongTensor(inp.numpy()).to(device) for inp in inputs[:-1]]
117 |         
118 |         label = torch.FloatTensor(inputs[-1].numpy()).to(device) #b*k
119 |         
120 |         logits = model(inputs_LongTensor) #b
121 | 
122 |         loss = loss_fn(logits, label)
123 |         
124 |         optimizer.zero_grad()
125 |         
126 |         loss.backward()
127 | 
128 |         optimizer.step()
129 | 
130 |         if iter_step % args.print_freq == 0:
131 |           print(f"Day:{n_day}\t[Epoch/iter]:{epoch:>3}/{iter_step:<4}\tloss:{loss.detach().cpu().item():.6f}")
132 |       
133 |   path_to_save_model=f"./checkpoints/bs-{args.batch_size}_lr-{args.lr}_{args.flows}-{args.k_flow_negs}_{args.tag}.pkl"
134 |   
135 |   torch.save(model.state_dict(), path_to_save_model)
136 |   
137 |   print(f"save model to {path_to_save_model} DONE.")


--------------------------------------------------------------------------------
/rank/run_din_data_dist_shift_sampling.sh:
--------------------------------------------------------------------------------
 1 | set -x
 2 | set -e
 3 | set -o pipefail
 4 | 
 5 | tag=din_data_dist_shift_sampling-1st
 6 | 
 7 | flows=rank_neg
 8 | k_flow_negs=1
 9 | 
10 | python -B -u run_din_data_dist_shift_sampling.py \
11 | --epochs=1 \
12 | --batch_size=1024 \
13 | --infer_realshow_batch_size=1024 \
14 | --infer_recall_batch_size=512 \
15 | --emb_dim=8 \
16 | --lr=1e-2 \
17 | --seq_len=50 \
18 | --cuda='0' \
19 | --print_freq=100 \
20 | --flows=${flows} \
21 | --k_flow_negs=${k_flow_negs} \
22 | --tag=${tag} > "./logs/bs-1024_lr-1e-2_${flows}_${k_flow_negs}_${tag}.log" 2>&1
23 | 
24 | python -B -u eval_din_data_dist_shift_sampling.py \
25 | --epochs=1 \
26 | --batch_size=1024 \
27 | --infer_realshow_batch_size=1024 \
28 | --infer_recall_batch_size=512 \
29 | --emb_dim=8 \
30 | --lr=1e-2 \
31 | --seq_len=50 \
32 | --cuda='0' \
33 | --print_freq=100 \
34 | --flows=${flows} \
35 | --k_flow_negs=${k_flow_negs} \
36 | --tag=${tag} >> "./logs/bs-1024_lr-1e-2_${flows}_${k_flow_negs}_${tag}.log" 2>&1


--------------------------------------------------------------------------------
/rank/run_din_fsltr.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import argparse
  3 | 
  4 | import torch
  5 | import torch.nn as nn
  6 | import torch.nn.functional as F
  7 | from torch.utils.data import DataLoader
  8 | 
  9 | from models import DIN
 10 | from dataset import Rank_Train_FSLTR_Dataset
 11 | 
 12 | from utils import load_pkl
 13 | 
 14 | def parse_args(): 
 15 |   parser = argparse.ArgumentParser()
 16 | 
 17 |   parser.add_argument('--epochs', type=int, default=1, help='epochs.')
 18 |   parser.add_argument('--batch_size', type=int, default=1024, help='train batch size.')
 19 |   parser.add_argument('--infer_realshow_batch_size', type=int, default=1024, help='inference batch size.')
 20 |   parser.add_argument('--infer_recall_batch_size', type=int, default=1024, help='inference batch size.')
 21 |   parser.add_argument('--emb_dim', type=int, default=8, help='embedding dimension.')
 22 |   parser.add_argument('--lr', type=float, default=1e-3, help='learning rate.')
 23 |   parser.add_argument('--seq_len', type=int, default=3, help='length of behaivor sequence')
 24 |   
 25 |   parser.add_argument('--cuda', type=int, default=0, help='cuda device.')
 26 | 
 27 |   parser.add_argument('--print_freq', type=int, default=200, help='frequency of print.')
 28 |   
 29 |   parser.add_argument('--tag', type=str, default="1st", help='exp tag.')
 30 |   
 31 |   parser.add_argument('--click_rank_loss_w', type=float, default=1e-2, help='learning rate.')
 32 |   parser.add_argument('--realshow_rank_loss_w', type=float, default=1e-2, help='learning rate.')
 33 |   parser.add_argument('--rerank_pos_rank_loss_w', type=float, default=1e-2, help='learning rate.')
 34 |   parser.add_argument('--rank_pos_rank_loss_w', type=float, default=1e-2, help='learning rate.')
 35 |   
 36 |   return parser.parse_args()
 37 | 
 38 | 
 39 | if __name__ == '__main__':
 40 |   args = parse_args()
 41 | 
 42 |   for k,v in vars(args).items():
 43 |     print(f"{k}:{v}")
 44 |     
 45 |   #prepare data
 46 |   prefix = "../data"
 47 |   
 48 |   realshow_prefix = os.path.join(prefix, "all_stage")
 49 |   path_to_train_csv_lst = []
 50 |   with open("./file.txt", mode='r') as f:
 51 |     lines = f.readlines()
 52 |     for line in lines:
 53 |       tmp_csv_path = os.path.join(realshow_prefix, line.strip()+'.feather')
 54 |       path_to_train_csv_lst.append(tmp_csv_path)
 55 |       
 56 |   num_of_train_csv = len(path_to_train_csv_lst)
 57 |   print("training files:")
 58 |   print(f"number of train_csv: {num_of_train_csv}")
 59 |   for idx, filepath in enumerate(path_to_train_csv_lst):
 60 |     print(f"{idx}: {filepath}")
 61 | 
 62 |   seq_prefix = os.path.join(prefix, "seq_effective_50_dict")
 63 |   path_to_train_seq_pkl_lst = []
 64 |   with open("./file.txt", mode='r') as f:
 65 |     lines = f.readlines()
 66 |     for line in lines:
 67 |       tmp_seq_pkl_path = os.path.join(seq_prefix, line.strip()+'.pkl')
 68 |       path_to_train_seq_pkl_lst.append(tmp_seq_pkl_path)
 69 |   
 70 |   print("training seq files:")
 71 |   for idx, filepath in enumerate(path_to_train_seq_pkl_lst):
 72 |     print(f"{idx}: {filepath}")
 73 |   
 74 |   request_id_prefix = os.path.join(prefix, "request_id_dict")
 75 |   path_to_train_request_pkl_lst = []
 76 |   with open("./file.txt", mode='r') as f:
 77 |     lines = f.readlines()
 78 |     for line in lines:
 79 |       tmp_request_pkl_path = os.path.join(request_id_prefix, line.strip()+".pkl")
 80 |       path_to_train_request_pkl_lst.append(tmp_request_pkl_path)
 81 |       
 82 |   print("training request files")
 83 |   for idx, filepath in enumerate(path_to_train_request_pkl_lst):
 84 |     print(f"{idx}: {filepath}")
 85 |   
 86 |   others_prefix = os.path.join(prefix, "others")
 87 |   path_to_id_cnt_pkl = os.path.join(others_prefix, "id_cnt.pkl")
 88 |   print(f"path_to_id_cnt_pkl: {path_to_id_cnt_pkl}")
 89 |   
 90 |   id_cnt_dict = load_pkl(path_to_id_cnt_pkl)
 91 |   for k,v in id_cnt_dict.items():
 92 |     print(f"{k}:{v}")
 93 |     
 94 |   #prepare model
 95 |   os.environ["CUDA_VISIBLE_DEVICES"] = str(args.cuda)
 96 |   device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
 97 |   print(f"device: {device}")
 98 |   
 99 |   max_candidate_cnt = 430
100 |   
101 |   model = DIN(
102 |     args.emb_dim, args.seq_len, 
103 |     device, max_candidate_cnt, id_cnt_dict
104 |   ).to(device)
105 |   
106 |   loss_fn = nn.CrossEntropyLoss(ignore_index=1)
107 |   
108 |   optimizer = torch.optim.Adam(model.parameters(), lr=args.lr)
109 |     
110 |   #training
111 |   for epoch in range(args.epochs):
112 |     for n_day in range(num_of_train_csv):
113 |       
114 |       train_dataset = Rank_Train_FSLTR_Dataset(
115 |         path_to_train_csv_lst[n_day],
116 |         args.seq_len,
117 |         path_to_train_seq_pkl_lst[n_day],
118 |         path_to_train_request_pkl_lst[n_day]
119 |       )
120 |     
121 |       train_loader = DataLoader(
122 |         dataset=train_dataset, 
123 |         batch_size=args.batch_size, 
124 |         shuffle=True, 
125 |         num_workers=1, 
126 |         drop_last=True
127 |       )
128 |       
129 |       for iter_step, inputs in enumerate(train_loader):
130 |         
131 |         inputs_LongTensor = [torch.LongTensor(inp.numpy()).to(device) for inp in inputs[:-6]]
132 |         
133 |         click_logits, realshow_logits, \
134 |         rerank_pos_logits, rerank_neg_logits, \
135 |         rank_pos_logits, rank_neg_logits = model.forward_fsltr(inputs_LongTensor) #b
136 |         
137 |         tmp_logits = torch.cat([realshow_logits,rerank_pos_logits,rerank_neg_logits, rank_pos_logits, rank_neg_logits], dim=1)
138 |         click_bpr_logits = click_logits.unsqueeze(-1) - tmp_logits.unsqueeze(1) #b*6*46
139 |         click_label = torch.FloatTensor(inputs[-6].numpy()).to(device).unsqueeze(-1) #b*6*1
140 |         click_rank_loss = F.binary_cross_entropy_with_logits(
141 |           click_bpr_logits, 
142 |           torch.ones(click_bpr_logits.size(), dtype=torch.float, device=device), 
143 |           weight=click_label, 
144 |           reduction='sum') / (46*torch.sum(click_label))
145 |         
146 |         tmp_logits = torch.cat([rerank_pos_logits,rerank_neg_logits, rank_pos_logits, rank_neg_logits], dim=1)
147 |         realshow_bpr_logits = realshow_logits.unsqueeze(-1) - tmp_logits.unsqueeze(1) #b*6*40
148 |         realshow_label = torch.FloatTensor(inputs[-5].numpy()).to(device).unsqueeze(-1) #b*6*1
149 |         realshow_rank_loss = F.binary_cross_entropy_with_logits(
150 |           realshow_bpr_logits, 
151 |           torch.ones(realshow_bpr_logits.size(), dtype=torch.float, device=device), 
152 |           weight=realshow_label, 
153 |           reduction='sum') / (40*torch.sum(realshow_label))
154 |         
155 |         tmp_logits = torch.cat([rerank_neg_logits, rank_neg_logits], dim=1)
156 |         rerank_pos_bpr_logits = rerank_pos_logits.unsqueeze(-1) - tmp_logits.unsqueeze(1) #b*6*20
157 |         rerank_pos_label = torch.FloatTensor(inputs[-4].numpy()).to(device).unsqueeze(-1) #b*10*1
158 |         rerank_pos_rank_loss = F.binary_cross_entropy_with_logits(
159 |           rerank_pos_bpr_logits, 
160 |           torch.ones(rerank_pos_bpr_logits.size(), dtype=torch.float, device=device), 
161 |           weight=rerank_pos_label, 
162 |           reduction='sum') / (20*torch.sum(rerank_pos_label))
163 |         
164 |         tmp_logits = torch.cat([rerank_neg_logits, rank_neg_logits], dim=1)
165 |         rank_pos_bpr_logits = rank_pos_logits.unsqueeze(-1) - tmp_logits.unsqueeze(1) #b*6*20
166 |         rank_pos_label = torch.FloatTensor(inputs[-2].numpy()).to(device).unsqueeze(-1) #b*10*1
167 |         rank_pos_rank_loss = F.binary_cross_entropy_with_logits(
168 |           rank_pos_bpr_logits, 
169 |           torch.ones(rank_pos_bpr_logits.size(), dtype=torch.float, device=device), 
170 |           weight=rank_pos_label, 
171 |           reduction='sum') / (20*torch.sum(rank_pos_label))
172 |           
173 |         loss = click_rank_loss * args.click_rank_loss_w + \
174 |           realshow_rank_loss * args.realshow_rank_loss_w + \
175 |           rerank_pos_rank_loss * args.rerank_pos_rank_loss_w + \
176 |           rank_pos_rank_loss * args.rank_pos_rank_loss_w
177 |         
178 |         optimizer.zero_grad()
179 |         
180 |         loss.backward()
181 | 
182 |         optimizer.step()
183 | 
184 |         if iter_step % args.print_freq == 0:
185 |           print(f"Day:{n_day}\t[Epoch/iter]:{epoch:>3}/{iter_step:<4}\tloss:{loss.detach().cpu().item():.6f} \tclick_rank_loss:{click_rank_loss.detach().cpu().item():.6f} \trealshow_rank_loss:{realshow_rank_loss.detach().cpu().item():.6f}\trerank_pos_rank_loss:{rerank_pos_rank_loss.detach().cpu().item():.6f}\trank_pos_rank_loss:{rank_pos_rank_loss.detach().cpu().item():.6f}")
186 |       
187 |   path_to_save_model=f"./checkpoints/bs-{args.batch_size}_lr-{args.lr}_{args.click_rank_loss_w}-{args.realshow_rank_loss_w}-{args.rerank_pos_rank_loss_w}-{args.rank_pos_rank_loss_w}_{args.tag}.pkl"
188 |   
189 |   torch.save(model.state_dict(), path_to_save_model)
190 |   
191 |   print(f"save model to {path_to_save_model} DONE.")


--------------------------------------------------------------------------------
/rank/run_din_fsltr.sh:
--------------------------------------------------------------------------------
 1 | set -x
 2 | set -e
 3 | set -o pipefail
 4 | 
 5 | tag=din_fsltr-1st
 6 | 
 7 | click_rank_loss_w=1.0
 8 | realshow_rank_loss_w=0.5
 9 | rerank_pos_rank_loss_w=0.05
10 | rank_pos_rank_loss_w=0.05
11 | 
12 | python -B -u run_din_fsltr.py \
13 | --epochs=1 \
14 | --batch_size=1024 \
15 | --infer_realshow_batch_size=1024 \
16 | --infer_recall_batch_size=512 \
17 | --emb_dim=8 \
18 | --lr=1e-2 \
19 | --seq_len=50 \
20 | --cuda='0' \
21 | --print_freq=100 \
22 | --click_rank_loss_w=${click_rank_loss_w} \
23 | --realshow_rank_loss_w=${realshow_rank_loss_w} \
24 | --rerank_pos_rank_loss_w=${rerank_pos_rank_loss_w} \
25 | --rank_pos_rank_loss_w=${rank_pos_rank_loss_w} \
26 | --tag=${tag} > "./logs/bs-1024_lr-1e-2_${click_rank_loss_w}_${realshow_rank_loss_w}_${rerank_pos_rank_loss_w}_${rank_pos_rank_loss_w}_${tag}.log" 2>&1
27 | 
28 | python -B -u eval_din_fsltr.py \
29 | --epochs=1 \
30 | --batch_size=1024 \
31 | --infer_realshow_batch_size=1024 \
32 | --infer_recall_batch_size=512 \
33 | --emb_dim=8 \
34 | --lr=1e-2 \
35 | --seq_len=50 \
36 | --cuda='0' \
37 | --print_freq=100 \
38 | --click_rank_loss_w=${click_rank_loss_w} \
39 | --realshow_rank_loss_w=${realshow_rank_loss_w} \
40 | --rerank_pos_rank_loss_w=${rerank_pos_rank_loss_w} \
41 | --rank_pos_rank_loss_w=${rank_pos_rank_loss_w} \
42 | --tag=${tag} >> "./logs/bs-1024_lr-1e-2_${click_rank_loss_w}_${realshow_rank_loss_w}_${rerank_pos_rank_loss_w}_${rank_pos_rank_loss_w}_${tag}.log" 2>&1


--------------------------------------------------------------------------------
/rank/run_din_ubm.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import argparse
  3 | 
  4 | import torch
  5 | import torch.nn as nn
  6 | from torch.utils.data import DataLoader
  7 | 
  8 | from models import DIN_UBM
  9 | from dataset import Rank_Train_UBM_Dataset
 10 | 
 11 | from utils import load_pkl
 12 | 
 13 | def parse_args():
 14 |   parser = argparse.ArgumentParser()
 15 | 
 16 |   parser.add_argument('--epochs', type=int, default=1, help='epochs.')
 17 |   parser.add_argument('--batch_size', type=int, default=1024, help='train batch size.')
 18 |   parser.add_argument('--infer_realshow_batch_size', type=int, default=1024, help='inference batch size.')
 19 |   parser.add_argument('--infer_recall_batch_size', type=int, default=1024, help='inference batch size.')
 20 |   parser.add_argument('--emb_dim', type=int, default=8, help='embedding dimension.')
 21 |   parser.add_argument('--lr', type=float, default=1e-3, help='learning rate.')
 22 |   parser.add_argument('--seq_len', type=int, default=3, help='length of behaivor sequence')
 23 |   
 24 |   parser.add_argument('--cuda', type=int, default=0, help='cuda device.')
 25 | 
 26 |   parser.add_argument('--print_freq', type=int, default=200, help='frequency of print.')
 27 |   
 28 |   parser.add_argument('--tag', type=str, default="1st", help='exp tag.')
 29 |   
 30 |   parser.add_argument('--flows', type=str, default="", help='exp tag.')
 31 | 
 32 |   return parser.parse_args()
 33 | 
 34 | 
 35 | if __name__ == '__main__':
 36 |   args = parse_args()
 37 | 
 38 |   for k,v in vars(args).items():
 39 |     print(f"{k}:{v}")
 40 |     
 41 |   #prepare data
 42 |   prefix = "../data"
 43 |   
 44 |   realshow_prefix = os.path.join(prefix, "realshow")
 45 |   path_to_train_csv_lst = []
 46 |   with open("./file.txt", mode='r') as f:
 47 |     lines = f.readlines()
 48 |     for line in lines:
 49 |       tmp_csv_path = os.path.join(realshow_prefix, line.strip()+'.feather')
 50 |       path_to_train_csv_lst.append(tmp_csv_path)
 51 |       
 52 |   num_of_train_csv = len(path_to_train_csv_lst)
 53 |   print("training files:")
 54 |   print(f"number of train_csv: {num_of_train_csv}")
 55 |   for idx, filepath in enumerate(path_to_train_csv_lst):
 56 |     print(f"{idx}: {filepath}")
 57 |     
 58 | 
 59 |   seq_prefix = os.path.join(prefix, "seq_effective_50_dict")
 60 |   path_to_train_seq_pkl_lst = []
 61 |   with open("./file.txt", mode='r') as f:
 62 |     lines = f.readlines()
 63 |     for line in lines:
 64 |       tmp_seq_pkl_path = os.path.join(seq_prefix, line.strip()+'.pkl')
 65 |       path_to_train_seq_pkl_lst.append(tmp_seq_pkl_path)
 66 |   
 67 |   print("training seq files:")
 68 |   for idx, filepath in enumerate(path_to_train_seq_pkl_lst):
 69 |     print(f"{idx}: {filepath}")
 70 |     
 71 |   request_id_prefix = os.path.join(prefix, "ubm_seq_request_id_dict")
 72 |   path_to_train_request_pkl_lst = []
 73 |   with open("./file.txt", mode='r') as f:
 74 |     lines = f.readlines()
 75 |     for line in lines:
 76 |       tmp_request_pkl_path = os.path.join(request_id_prefix, line.strip()+".pkl")
 77 |       path_to_train_request_pkl_lst.append(tmp_request_pkl_path)
 78 |       
 79 |   print("training request files")
 80 |   for idx, filepath in enumerate(path_to_train_request_pkl_lst):
 81 |     print(f"{idx}: {filepath}")
 82 |       
 83 |   others_prefix = os.path.join(prefix, "others")
 84 |   path_to_id_cnt_pkl = os.path.join(others_prefix, "id_cnt.pkl")
 85 |   print(f"path_to_id_cnt_pkl: {path_to_id_cnt_pkl}")
 86 |   
 87 |   id_cnt_dict = load_pkl(path_to_id_cnt_pkl)
 88 |   for k,v in id_cnt_dict.items():
 89 |     print(f"{k}:{v}")
 90 | 
 91 |   #prepare model
 92 |   os.environ["CUDA_VISIBLE_DEVICES"] = str(args.cuda)
 93 |   device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
 94 |   print(f"device: {device}")
 95 |   
 96 |   per_flow_seq_len = 10
 97 |   n_flows = len(args.flows.split(','))
 98 |   flow_seq_len = per_flow_seq_len * n_flows
 99 |   
100 |   max_candidate_cnt = 430
101 |   
102 |   model = DIN_UBM(
103 |     args.emb_dim, 
104 |     args.seq_len, 
105 |     device, 
106 |     max_candidate_cnt, 
107 |     per_flow_seq_len, flow_seq_len,
108 |     id_cnt_dict
109 |   ).to(device)
110 |   
111 |   loss_fn = nn.BCEWithLogitsLoss().to(device)
112 |   
113 |   optimizer = torch.optim.Adam(model.parameters(), lr=args.lr)
114 |   
115 |   #training
116 |   for epoch in range(args.epochs):
117 |     for n_day in range(num_of_train_csv):
118 |       
119 |       train_dataset = Rank_Train_UBM_Dataset(
120 |         path_to_train_csv_lst[n_day],
121 |         args.seq_len,
122 |         path_to_train_seq_pkl_lst[n_day],
123 |         path_to_train_request_pkl_lst[n_day],
124 |         args.flows,
125 |         per_flow_seq_len
126 |       )
127 |     
128 |       train_loader = DataLoader(
129 |         dataset=train_dataset, 
130 |         batch_size=args.batch_size, 
131 |         shuffle=True, 
132 |         num_workers=1, 
133 |         drop_last=True
134 |       )
135 | 
136 |       for iter_step, inputs in enumerate(train_loader):
137 |         
138 |         inputs_LongTensor = [torch.LongTensor(inp.numpy()).to(device) for inp in inputs[:-1]]
139 |         
140 |         label = torch.FloatTensor(inputs[-1].numpy()).to(device) #b
141 |         
142 |         logits = model(inputs_LongTensor) #b
143 |         
144 |         loss = loss_fn(logits, label)
145 |         
146 |         optimizer.zero_grad()
147 |         
148 |         loss.backward()
149 | 
150 |         optimizer.step()
151 | 
152 |         if iter_step % args.print_freq == 0:
153 |           print(f"Day:{n_day}\t[Epoch/iter]:{epoch:>3}/{iter_step:<4}\tloss:{loss.detach().cpu().item():.6f}")
154 |       
155 |   path_to_save_model=f"./checkpoints/bs-{args.batch_size}_lr-{args.lr}_{args.flows}_{args.tag}.pkl"
156 |   
157 |   torch.save(model.state_dict(), path_to_save_model)
158 |   
159 |   print(f"save model to {path_to_save_model} DONE.")


--------------------------------------------------------------------------------
/rank/run_din_ubm.sh:
--------------------------------------------------------------------------------
 1 | set -x
 2 | set -e
 3 | set -o pipefail
 4 | 
 5 | tag=din_ubm-1st
 6 | 
 7 | flows=rank_pos
 8 | 
 9 | python -B -u run_din_ubm.py \
10 | --epochs=1 \
11 | --batch_size=1024 \
12 | --infer_realshow_batch_size=1024 \
13 | --infer_recall_batch_size=512 \
14 | --emb_dim=8 \
15 | --lr=1e-2 \
16 | --seq_len=50 \
17 | --cuda='0' \
18 | --print_freq=100 \
19 | --flows=${flows} \
20 | --tag=${tag} > "./logs/bs-1024_lr-1e-2_${flows}_${tag}.log" 2>&1
21 | 
22 | python -B -u eval_din_ubm.py \
23 | --epochs=1 \
24 | --batch_size=1024 \
25 | --infer_realshow_batch_size=1024 \
26 | --infer_recall_batch_size=512 \
27 | --emb_dim=8 \
28 | --lr=1e-2 \
29 | --seq_len=50 \
30 | --cuda='0' \
31 | --print_freq=100 \
32 | --flows=${flows} \
33 | --tag=${tag} >> "./logs/bs-1024_lr-1e-2_${flows}_${tag}.log" 2>&1


--------------------------------------------------------------------------------
/rank/utils.py:
--------------------------------------------------------------------------------
 1 | import pickle as pkl
 2 | from collections import defaultdict
 3 | 
 4 | def defaultdict_tuple():
 5 |   return defaultdict(tuple)
 6 | 
 7 | def defaultdict_str():
 8 |   return defaultdict(str)
 9 | 
10 | def load_pkl(filename):
11 |   with open(filename, 'rb') as f:
12 |     return pkl.load(f)


--------------------------------------------------------------------------------
/recflow.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/RecFlow-ICLR/RecFlow/6b9eb4f3dd1651788041704c388f9df1a39cf294/recflow.jpg


--------------------------------------------------------------------------------
/retrieval/dataset.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import gc
  3 | import time
  4 | 
  5 | import numpy as np
  6 | import pandas as pd
  7 | from torch.utils.data import Dataset
  8 | 
  9 | from utils import load_pkl
 10 | 
 11 | class Recall_Train_SASRec_Dataset(Dataset):
 12 |   def __init__(
 13 |     self, 
 14 |     path_to_csv, 
 15 |     seq_len, neg_num, 
 16 |     path_to_seq, 
 17 |     video_corpus
 18 |   ):
 19 |     t1 = time.time()
 20 |     
 21 |     raw_df = pd.read_feather(path_to_csv)
 22 |     
 23 |     df = raw_df[raw_df['effective_view']==1][["request_id", "video_id"]]
 24 |     
 25 |     self.data = df.to_numpy().copy()
 26 | 
 27 |     self.seq_len = seq_len
 28 |     
 29 |     self.neg_num = neg_num
 30 |     
 31 |     self.today_seq = load_pkl(path_to_seq)
 32 |     
 33 |     video_corpus_df = pd.read_feather(video_corpus)
 34 |     self.video_corpus = video_corpus_df['video_id'].unique().copy() + 1
 35 |     del video_corpus_df
 36 | 
 37 |     self.n_video_corpus = self.video_corpus.shape[0]
 38 |     
 39 |     del raw_df
 40 |     del df
 41 |     
 42 |     gc.collect()
 43 |     
 44 |     t2 = time.time()
 45 |     print(f'init data time: {t2-t1}')
 46 |     
 47 |   def __len__(self):
 48 |     return self.data.shape[0]
 49 | 
 50 |   def negative_sampling(self, tgt_video, neg_num):
 51 |     cnt = 0
 52 |     negs_index = np.random.randint(self.n_video_corpus, size=neg_num)
 53 |     while tgt_video in self.video_corpus[negs_index]:
 54 |       negs_index = np.random.randint(self.n_video_corpus, size=neg_num)
 55 |       cnt += 1
 56 |       if cnt >= 10:
 57 |         break
 58 |     return self.video_corpus[negs_index]
 59 |     
 60 |   def __getitem__(self, idx):
 61 |     request_id = self.data[idx][0]
 62 |     vid = self.data[idx][1] + 1 
 63 |     
 64 |     seq_full = self.today_seq[request_id][:,[0,7]].copy()
 65 |     
 66 |     seq_mask = (seq_full[:,1] > 0).astype(np.int8)
 67 |     
 68 |     seq_len = np.sum(seq_mask)
 69 |     
 70 |     seq_arr = seq_full[:,0]
 71 |     
 72 |     if seq_len > 0:
 73 |       seq_arr[-seq_len:] += 1 
 74 |     
 75 |     neg_vids = self.negative_sampling(vid, self.neg_num)
 76 |     
 77 |     return seq_arr, seq_mask, vid, neg_vids
 78 | 
 79 | #public
 80 | class Recall_Train_SASRec_HardNegMining_Dataset(Dataset):
 81 |   def __init__(
 82 |     self, 
 83 |     path_to_csv, 
 84 |     seq_len, neg_num, 
 85 |     path_to_seq, 
 86 |     path_to_request_id_pkl,
 87 |     video_corpus,
 88 |     flow_negs,
 89 |     flow_neg_nums
 90 |   ):
 91 |     t1 = time.time()
 92 |     
 93 |     self.flow_negs = flow_negs.split(',')
 94 |   
 95 |     self.flow_neg_nums = list(map(int, flow_neg_nums.split(",")))
 96 |     
 97 |     raw_df = pd.read_feather(path_to_csv)
 98 |     
 99 |     df = raw_df[raw_df['effective_view']==1][["request_id", "video_id"]]
100 |     
101 |     self.data = df.to_numpy().copy()
102 | 
103 |     self.seq_len = seq_len
104 |     
105 |     self.neg_num = neg_num
106 |     
107 |     self.random_neg_nums = self.neg_num - sum(self.flow_neg_nums)
108 |     
109 |     self.today_seq = load_pkl(path_to_seq)
110 |     
111 |     video_corpus_df = pd.read_feather(video_corpus)
112 |     self.video_corpus = video_corpus_df['video_id'].unique().copy() + 1
113 |     del video_corpus_df
114 | 
115 |     self.n_video_corpus = self.video_corpus.shape[0]
116 | 
117 |     self.request_dict = load_pkl(path_to_request_id_pkl)
118 |     
119 |     del df
120 |     del raw_df
121 |     
122 |     gc.collect()
123 |     
124 |     t2 = time.time()
125 |     print(f'init data time: {t2-t1}')
126 |     
127 |   def __len__(self):
128 |     return self.data.shape[0]
129 | 
130 |   def random_negative_sampling(self, tgt_video, neg_num):
131 |     cnt = 0
132 |     negs_index = np.random.randint(self.n_video_corpus, size=neg_num)
133 |     while tgt_video in self.video_corpus[negs_index]:
134 |       negs_index = np.random.randint(self.n_video_corpus, size=neg_num)
135 |       cnt += 1
136 |       if cnt >= 10:
137 |         break
138 |     return self.video_corpus[negs_index]
139 | 
140 |   def flow_negative_sampling(self, tgt_video, request_id):
141 |     
142 |     flow_neg_lst = []
143 |     
144 |     for idx, flow_neg in enumerate(self.flow_negs):
145 |       if flow_neg in self.request_dict[request_id]:
146 |         flow_arr = self.request_dict[request_id][flow_neg][:,0] + 1
147 |         flow_arr_shape = flow_arr.shape[0]
148 |         cnt = 0
149 |         tmp_neg_arr = flow_arr[np.random.randint(flow_arr_shape,size=self.flow_neg_nums[idx])]
150 |         while tgt_video in tmp_neg_arr:
151 |           tmp_neg_arr = flow_arr[np.random.randint(flow_arr_shape,size=self.flow_neg_nums[idx])]
152 |           cnt += 1
153 |           if cnt >= 10:
154 |             break
155 |       else:
156 |         tmp_neg_arr = np.zeros(self.flow_neg_nums[idx], dtype=np.int64)
157 |         
158 |       flow_neg_lst.extend(tmp_neg_arr)
159 |         
160 |     return np.reshape(np.concatenate([np.reshape(x,[-1,1]) for x in flow_neg_lst]), [-1])
161 |   
162 |   def __getitem__(self, idx):
163 |     request_id = self.data[idx][0]
164 |     vid = self.data[idx][1] + 1 
165 |     
166 |     # 0: padding, 1: behavior
167 |     seq_full = self.today_seq[request_id][:,[0,7]].copy()
168 |     
169 |     seq_mask = (seq_full[:,1] > 0).astype(np.int8) #50
170 |     
171 |     seq_len = np.sum(seq_mask)
172 |     
173 |     seq_arr = seq_full[:,0]
174 |     
175 |     if seq_len > 0:
176 |       seq_arr[-seq_len:] += 1 #50
177 |     
178 |     #negative sampling
179 |     random_neg_vids = self.random_negative_sampling(vid, self.random_neg_nums)
180 |     
181 |     flow_neg_vids = self.flow_negative_sampling(vid, request_id)
182 |     
183 |     neg_vids = np.append(random_neg_vids, flow_neg_vids)
184 |     
185 |     return seq_arr, seq_mask, vid, neg_vids
186 | 
187 | #public
188 | class Recall_Train_SASRec_FSLTR_Dataset(Dataset):
189 |   def __init__(
190 |     self, 
191 |     path_to_csv, 
192 |     seq_len, neg_num, 
193 |     path_to_seq, 
194 |     path_to_request_id_pkl,
195 |     video_corpus,
196 |     flow_negs,
197 |     flow_neg_nums
198 |   ):
199 |     t1 = time.time()
200 |     
201 |     self.priority = {
202 |       "click":6,
203 |       "realshow":5,
204 |       "rerank_pos":4,
205 |       "rank_pos":4,
206 |       "rerank_neg":3,
207 |       "rank_neg":3,
208 |       "coarse_neg":2,
209 |       "prerank_neg":1
210 |     }
211 |     
212 |     self.flow_negs = flow_negs.split(',')
213 |     
214 |     self.flow_neg_nums = list(map(int, flow_neg_nums.split(",")))
215 |     
216 |     raw_df = pd.read_feather(path_to_csv)
217 |     
218 |     df = raw_df[raw_df['effective_view']==1][["request_id", "video_id"]]
219 |     
220 |     self.data = df.to_numpy().copy()
221 | 
222 |     self.seq_len = seq_len
223 |     
224 |     self.neg_num = neg_num
225 |     
226 |     self.random_neg_nums = self.neg_num - sum(self.flow_neg_nums)
227 |     
228 |     self.today_seq = load_pkl(path_to_seq)
229 | 
230 |     video_corpus_df = pd.read_feather(video_corpus)
231 |     self.video_corpus = video_corpus_df['video_id'].unique().copy() + 1 
232 |     del video_corpus_df
233 | 
234 |     self.n_video_corpus = self.video_corpus.shape[0]
235 |   
236 |     self.request_dict = load_pkl(path_to_request_id_pkl)
237 |     
238 |     del df
239 |     del raw_df
240 |     
241 |     gc.collect()
242 |     
243 |     t2 = time.time()
244 |     print(f'init data time: {t2-t1}')
245 |     
246 |   def __len__(self):
247 |     return self.data.shape[0]
248 | 
249 |   def random_negative_sampling(self, tgt_video, neg_num):
250 |     cnt = 0
251 |     negs_index = np.random.randint(self.n_video_corpus, size=neg_num)
252 |     while tgt_video in self.video_corpus[negs_index]:
253 |       negs_index = np.random.randint(self.n_video_corpus, size=neg_num)
254 |       cnt += 1
255 |       if cnt >= 10:
256 |         break
257 |     return self.video_corpus[negs_index]
258 | 
259 |   def flow_negative_sampling(self, tgt_video, request_id):
260 |     
261 |     flow_neg_lst = []
262 |     flow_neg_priority_lst = []
263 |     
264 |     flow_dict = self.request_dict[request_id]
265 |     for idx, flow_neg in enumerate(self.flow_negs):
266 |       if flow_neg in flow_dict:
267 |         flow_arr = flow_dict[flow_neg][:,0] + 1
268 |         flow_arr_shape = flow_arr.shape[0]
269 |         cnt = 0
270 |         tmp_neg_arr = flow_arr[np.random.randint(flow_arr_shape,size=self.flow_neg_nums[idx])]
271 |         while tgt_video in tmp_neg_arr:
272 |           tmp_neg_arr = flow_arr[np.random.randint(flow_arr_shape,size=self.flow_neg_nums[idx])]
273 |           cnt += 1
274 |           if cnt >= 10:
275 |             break
276 |         tmp_priority_arr = np.ones(self.flow_neg_nums[idx], dtype=np.float32)*self.priority[flow_neg]
277 |       else:
278 |         tmp_neg_arr = np.zeros(self.flow_neg_nums[idx], dtype=np.int64)
279 |         tmp_priority_arr = np.zeros(self.flow_neg_nums[idx], dtype=np.float32)
280 |       
281 |       flow_neg_lst.append(tmp_neg_arr)
282 |       flow_neg_priority_lst.append(tmp_priority_arr)
283 |         
284 |     return np.concatenate(flow_neg_lst), np.concatenate(flow_neg_priority_lst)
285 |   
286 |   def __getitem__(self, idx):
287 |     request_id = self.data[idx][0]
288 |     vid = self.data[idx][1] + 1 
289 |     pos_priority = np.ones(1, dtype=np.float32) * self.priority['click']
290 |     
291 |     # 0: padding, 1: behavior
292 |     seq_full = self.today_seq[request_id][:,[0,7]].copy()
293 |     
294 |     seq_mask = (seq_full[:,1] > 0).astype(np.int8) #50
295 |     
296 |     seq_len = np.sum(seq_mask)
297 |     
298 |     seq_arr = seq_full[:,0]
299 |     
300 |     if seq_len > 0:
301 |       seq_arr[-seq_len:] += 1 #50
302 |     
303 |     #negative sampling
304 |     random_neg_vids = self.random_negative_sampling(vid, self.random_neg_nums)
305 |     random_neg_priority = np.zeros(self.random_neg_nums, dtype=np.float32)
306 |     
307 |     flow_neg_vids,flow_neg_priority = self.flow_negative_sampling(vid, request_id)
308 |     
309 |     vids = np.concatenate([np.atleast_1d(vid),random_neg_vids, flow_neg_vids])
310 |     prioritys = np.concatenate([pos_priority,random_neg_priority,flow_neg_priority])
311 |     
312 |     return seq_arr, seq_mask, vids, prioritys
313 | 
314 | #public
315 | class Recall_Test_SASRec_Recall_Dataset(Dataset):
316 |   def __init__(
317 |     self, 
318 |     path_to_test_feather, 
319 |     seq_len, 
320 |     path_to_seq, 
321 |     max_candidate_cnt=30
322 |   ):
323 |     t1 = time.time()
324 |     
325 |     raw_df = pd.read_feather(path_to_test_feather)
326 |     
327 |     data = raw_df[["request_id", "video_id", "effective_view"]]
328 |     
329 |     self.request_ids = data['request_id'].unique()
330 |     
331 |     self.seq_len = seq_len
332 |     
333 |     self.today_seq = load_pkl(path_to_seq)
334 |     
335 |     self.max_candidate_cnt = max_candidate_cnt
336 |     
337 |     self.data_group = data.copy().groupby('request_id')
338 |     
339 |     del data
340 |     del raw_df
341 |     
342 |     gc.collect()
343 |     
344 |     t2 = time.time()
345 |     print(f'init data time: {t2-t1}')
346 | 
347 |   def __len__(self):
348 |     return len(self.request_ids)
349 |   
350 |   def __getitem__(self, idx):
351 |     request_id = self.request_ids[idx]
352 |     
353 |     request_id_df = self.data_group.get_group(request_id)[["video_id","effective_view"]]
354 |     
355 |     request_id_arr = request_id_df.to_numpy().copy()
356 |     
357 |     n_video = request_id_arr.shape[0]
358 |     
359 |     n_complent = self.max_candidate_cnt - n_video
360 |     
361 |     complent_arr = np.zeros(shape=(n_complent,2), dtype=np.int64)
362 |     
363 |     request_id_arr = np.concatenate([request_id_arr, complent_arr], axis=0)
364 |     
365 |     vid = request_id_arr[:,0] + 1
366 |     
367 |     seq_full = self.today_seq[request_id][:,[0,7]].copy()
368 |     
369 |     seq_mask = (seq_full[:,1] > 0).astype(np.int8)
370 |     
371 |     seq_len = np.sum(seq_mask)
372 |     
373 |     seq_arr = seq_full[:,0]
374 |     
375 |     if seq_len > 0:
376 |       seq_arr[-seq_len:] += 1
377 | 
378 |     effective = request_id_arr[:,1]
379 | 
380 |     return seq_arr, seq_mask, vid, effective, n_video


--------------------------------------------------------------------------------
/retrieval/eval_sasrec.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import argparse
 3 | 
 4 | import torch
 5 | from torch.utils.data import DataLoader
 6 | 
 7 | from models import SASRec
 8 | from dataset import Recall_Test_SASRec_Recall_Dataset
 9 | 
10 | from utils import load_pkl
11 | from metrics import evaluate_recall
12 | 
13 | def parse_args():
14 |   parser = argparse.ArgumentParser()
15 | 
16 |   parser.add_argument('--epochs', type=int, default=1, help='epochs.')
17 |   parser.add_argument('--batch_size', type=int, default=1024, help='train batch size.')
18 |   parser.add_argument('--infer_batch_size', type=int, default=1024, help='inference batch size.')
19 |   parser.add_argument('--emb_dim', type=int, default=8, help='embedding dimension.')
20 |   parser.add_argument('--lr', type=float, default=1e-3, help='learning rate.')
21 |   parser.add_argument('--seq_len', type=int, default=3, help='length of behaivor sequence')
22 |   
23 |   parser.add_argument('--cuda', type=int, default=0, help='cuda device.')
24 | 
25 |   parser.add_argument('--print_freq', type=int, default=200, help='frequency of print.')
26 |   
27 |   parser.add_argument('--tag', type=str, default="1st", help='exp tag.')
28 |   
29 |   parser.add_argument('--neg_num', type=int, default=50, help='number of negative samples')
30 |   
31 |   return parser.parse_args()
32 | 
33 | 
34 | if __name__ == '__main__':
35 |   args = parse_args()
36 | 
37 |   #prepare data
38 |   prefix = "../data"
39 |   
40 |   seq_prefix = os.path.join(prefix, "seq_effective_50_dict")
41 |   path_to_test_seq_pkl = os.path.join(seq_prefix, "2024-02-18.pkl")
42 |   print(f"testing seq file: {path_to_test_seq_pkl}")
43 |   
44 |   others_prefix = os.path.join(prefix, "others")
45 |   path_to_id_cnt_pkl = os.path.join(others_prefix, "id_cnt.pkl")
46 |   print(f"path_to_id_cnt_pkl: {path_to_id_cnt_pkl}")
47 |   
48 |   id_cnt_dict = load_pkl(path_to_id_cnt_pkl)
49 |   for k,v in id_cnt_dict.items():
50 |     print(f"{k}:{v}")
51 |     
52 |   #prepare negatives
53 |   path_to_realshow_video_corpus_feather = os.path.join(others_prefix, "realshow_video_info.feather")
54 |   print(f"path_to_video_corpus_pkl: {path_to_realshow_video_corpus_feather}")
55 |   
56 |   #prepare recal_test
57 |   path_to_recall_test_feather = os.path.join(others_prefix, "recall_test.feather")
58 |   print(f"path_to_recall_test_pkl: {path_to_recall_test_feather}")
59 |   
60 |   #prepare model
61 |   os.environ["CUDA_VISIBLE_DEVICES"] = str(args.cuda)
62 |   device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
63 |   print(f"device: {device}")
64 |   
65 |   model = SASRec(args.emb_dim, args.seq_len, args.neg_num, device, id_cnt_dict).to(device)
66 |   
67 |   path_to_save_model=f"./checkpoints/bs-{args.batch_size}_lr-{args.lr}_{args.neg_num}_{args.tag}.pkl"
68 |   
69 |   state_dict = torch.load(path_to_save_model)
70 |   
71 |   model.load_state_dict(state_dict)
72 |   
73 |   print("testing: recall")
74 |   
75 |   test_recall_dataset = Recall_Test_SASRec_Recall_Dataset(
76 |     path_to_recall_test_feather,
77 |     args.seq_len,
78 |     path_to_test_seq_pkl,
79 |     max_candidate_cnt=30
80 |   )
81 |   
82 |   test_recall_loader = DataLoader(
83 |     dataset=test_recall_dataset, 
84 |     batch_size=args.infer_batch_size, 
85 |     shuffle=False, 
86 |     num_workers=0, 
87 |     drop_last=True
88 |   )
89 |   
90 |   target_print = evaluate_recall(
91 |     model, 
92 |     test_recall_loader, 
93 |     device, 
94 |     path_to_realshow_video_corpus_feather
95 |   )
96 |   
97 |   print(target_print[0])
98 |   print(target_print[1])


--------------------------------------------------------------------------------
/retrieval/eval_sasrec_fsltr.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import argparse
  3 | 
  4 | import torch
  5 | from torch.utils.data import DataLoader
  6 | 
  7 | from models import SASRec
  8 | from dataset import Recall_Test_SASRec_Recall_Dataset
  9 | 
 10 | from utils import load_pkl
 11 | from metrics import evaluate_recall
 12 | 
 13 | def parse_args():
 14 |   parser = argparse.ArgumentParser()
 15 | 
 16 |   parser.add_argument('--epochs', type=int, default=1, help='epochs.')
 17 |   parser.add_argument('--batch_size', type=int, default=1024, help='train batch size.')
 18 |   parser.add_argument('--infer_batch_size', type=int, default=1024, help='inference batch size.')
 19 |   parser.add_argument('--emb_dim', type=int, default=8, help='embedding dimension.')
 20 |   parser.add_argument('--lr', type=float, default=1e-3, help='learning rate.')
 21 |   parser.add_argument('--seq_len', type=int, default=3, help='length of behaivor sequence')
 22 |   
 23 |   parser.add_argument('--cuda', type=int, default=0, help='cuda device.')
 24 | 
 25 |   parser.add_argument('--print_freq', type=int, default=200, help='frequency of print.')
 26 |   
 27 |   parser.add_argument('--tag', type=str, default="1st", help='exp tag.')
 28 |   
 29 |   parser.add_argument('--neg_num', type=int, default=3, help='number of negative samples')
 30 |   
 31 |   parser.add_argument('--flow_negs', type=str, default='mcd_prerank_neg', help='model name.')
 32 |   
 33 |   parser.add_argument('--flow_neg_nums', type=str, default=3, help='number of negative samples')
 34 |   
 35 |   return parser.parse_args()
 36 | 
 37 | 
 38 | if __name__ == '__main__':
 39 |   args = parse_args()
 40 | 
 41 |   #prepare data
 42 |   prefix = "../data"
 43 |   
 44 |   seq_prefix = os.path.join(prefix, "seq_effective_50_dict")
 45 |   path_to_test_seq_pkl = os.path.join(seq_prefix, "2024-02-18.pkl")
 46 |   print(f"testing seq file: {path_to_test_seq_pkl}")
 47 |   
 48 |   others_prefix = os.path.join(prefix, "others")
 49 |   path_to_id_cnt_pkl = os.path.join(others_prefix, "id_cnt.pkl")
 50 |   print(f"path_to_id_cnt_pkl: {path_to_id_cnt_pkl}")
 51 |   
 52 |   id_cnt_dict = load_pkl(path_to_id_cnt_pkl)
 53 |   for k,v in id_cnt_dict.items():
 54 |     print(f"{k}:{v}")
 55 |     
 56 |   #prepare negatives
 57 |   path_to_realshow_video_corpus_feather = os.path.join(others_prefix, "realshow_video_info.feather")
 58 |   print(f"path_to_video_corpus_pkl: {path_to_realshow_video_corpus_feather}")
 59 |   
 60 |   #prepare recal_test
 61 |   path_to_recall_test_feather = os.path.join(others_prefix, "recall_test.feather")
 62 |   print(f"path_to_recall_test_pkl: {path_to_recall_test_feather}")
 63 |   
 64 |   #prepare model
 65 |   os.environ["CUDA_VISIBLE_DEVICES"] = str(args.cuda)
 66 |   device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
 67 |   print(f"device: {device}")
 68 |   
 69 |   model = SASRec(
 70 |     args.emb_dim, args.seq_len, 
 71 |     args.neg_num,
 72 |     device, id_cnt_dict
 73 |   ).to(device)
 74 |   
 75 |   path_to_save_model=f"./checkpoints/bs-{args.batch_size}_lr-{args.lr}_{args.neg_num}_{args.flow_negs}_{args.flow_neg_nums}_{args.tag}.pkl"
 76 |   
 77 |   state_dict = torch.load(path_to_save_model)
 78 |   
 79 |   model.load_state_dict(state_dict)
 80 |   
 81 |   print("testing: recall")
 82 |   
 83 |   test_recall_dataset = Recall_Test_SASRec_Recall_Dataset(
 84 |     path_to_recall_test_feather,
 85 |     args.seq_len,
 86 |     path_to_test_seq_pkl,
 87 |     max_candidate_cnt=30
 88 |   )
 89 |   
 90 |   test_recall_loader = DataLoader(
 91 |     dataset=test_recall_dataset, 
 92 |     batch_size=args.infer_batch_size, 
 93 |     shuffle=False, 
 94 |     num_workers=0, 
 95 |     drop_last=True
 96 |   )
 97 |   
 98 |   target_print = evaluate_recall(
 99 |     model, 
100 |     test_recall_loader, 
101 |     device, 
102 |     path_to_realshow_video_corpus_feather
103 |   )
104 |   
105 |   print(target_print[0])
106 |   print(target_print[1])


--------------------------------------------------------------------------------
/retrieval/eval_sasrec_hardnegmining.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import argparse
  3 | 
  4 | import torch
  5 | from torch.utils.data import DataLoader
  6 | 
  7 | from models import SASRec
  8 | from dataset import Recall_Test_SASRec_Recall_Dataset
  9 | 
 10 | from utils import load_pkl
 11 | from metrics import evaluate_recall
 12 | 
 13 | def parse_args():
 14 |   parser = argparse.ArgumentParser()
 15 | 
 16 |   parser.add_argument('--epochs', type=int, default=1, help='epochs.')
 17 |   parser.add_argument('--batch_size', type=int, default=1024, help='train batch size.')
 18 |   parser.add_argument('--infer_batch_size', type=int, default=1024, help='inference batch size.')
 19 |   parser.add_argument('--emb_dim', type=int, default=8, help='embedding dimension.')
 20 |   parser.add_argument('--lr', type=float, default=1e-3, help='learning rate.')
 21 |   parser.add_argument('--seq_len', type=int, default=3, help='length of behaivor sequence')
 22 |   
 23 |   parser.add_argument('--cuda', type=int, default=0, help='cuda device.')
 24 | 
 25 |   parser.add_argument('--print_freq', type=int, default=200, help='frequency of print.')
 26 |   
 27 |   parser.add_argument('--tag', type=str, default="1st", help='exp tag.')
 28 |   
 29 |   parser.add_argument('--neg_num', type=int, default=3, help='number of negative samples')
 30 |   
 31 |   parser.add_argument('--flow_negs', type=str, default='mcd_prerank_neg', help='model name.')
 32 |   
 33 |   parser.add_argument('--flow_neg_nums', type=str, default=3, help='number of negative samples')
 34 |   
 35 |   return parser.parse_args()
 36 | 
 37 | 
 38 | if __name__ == '__main__':
 39 |   args = parse_args()
 40 | 
 41 |   #prepare data
 42 |   prefix = "../data"
 43 |   
 44 |   seq_prefix = os.path.join(prefix, "seq_effective_50_dict")
 45 |   path_to_test_seq_pkl = os.path.join(seq_prefix, "2024-02-18.pkl")
 46 |   print(f"testing seq file: {path_to_test_seq_pkl}")
 47 |   
 48 |   others_prefix = os.path.join(prefix, "others")
 49 |   path_to_id_cnt_pkl = os.path.join(others_prefix, "id_cnt.pkl")
 50 |   print(f"path_to_id_cnt_pkl: {path_to_id_cnt_pkl}")
 51 |   
 52 |   id_cnt_dict = load_pkl(path_to_id_cnt_pkl)
 53 |   for k,v in id_cnt_dict.items():
 54 |     print(f"{k}:{v}")
 55 |     
 56 |   #prepare negatives
 57 |   path_to_realshow_video_corpus_feather = os.path.join(others_prefix, "realshow_video_info.feather")
 58 |   print(f"path_to_video_corpus_pkl: {path_to_realshow_video_corpus_feather}")
 59 |   
 60 |   #prepare recal_test
 61 |   path_to_recall_test_feather = os.path.join(others_prefix, "recall_test.feather")
 62 |   print(f"path_to_recall_test_pkl: {path_to_recall_test_feather}")
 63 |   
 64 |   #prepare model
 65 |   os.environ["CUDA_VISIBLE_DEVICES"] = str(args.cuda)
 66 |   device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
 67 |   print(f"device: {device}")
 68 |   
 69 |   model = SASRec(
 70 |     args.emb_dim, args.seq_len, 
 71 |     args.neg_num,
 72 |     device, id_cnt_dict
 73 |   ).to(device)
 74 |   
 75 |   path_to_save_model=f"./checkpoints/bs-{args.batch_size}_lr-{args.lr}_neg_num-{args.neg_num}_flow_negs-{args.flow_negs}_flow_neg_nums-{args.flow_neg_nums}_{args.tag}.pkl"
 76 |   
 77 |   state_dict = torch.load(path_to_save_model)
 78 |   
 79 |   model.load_state_dict(state_dict)
 80 |   
 81 |   print("testing: recall")
 82 |   
 83 |   test_recall_dataset = Recall_Test_SASRec_Recall_Dataset(
 84 |     path_to_recall_test_feather,
 85 |     args.seq_len,
 86 |     path_to_test_seq_pkl,
 87 |     max_candidate_cnt=30
 88 |   )
 89 |   
 90 |   test_recall_loader = DataLoader(
 91 |     dataset=test_recall_dataset, 
 92 |     batch_size=args.infer_batch_size, 
 93 |     shuffle=False, 
 94 |     num_workers=0, 
 95 |     drop_last=True
 96 |   )
 97 |   
 98 |   target_print = evaluate_recall(
 99 |     model, 
100 |     test_recall_loader, 
101 |     device, 
102 |     path_to_realshow_video_corpus_feather
103 |   )
104 |   
105 |   print(target_print[0])
106 |   print(target_print[1])


--------------------------------------------------------------------------------
/retrieval/file.txt:
--------------------------------------------------------------------------------
 1 | 2024-01-13
 2 | 2024-01-14
 3 | 2024-01-15
 4 | 2024-01-16
 5 | 2024-01-17
 6 | 2024-01-18
 7 | 2024-01-19
 8 | 2024-01-20
 9 | 2024-01-21
10 | 2024-01-22
11 | 2024-01-23
12 | 2024-01-24
13 | 2024-01-25
14 | 2024-01-26
15 | 2024-01-27
16 | 2024-01-28
17 | 2024-01-29
18 | 2024-01-30
19 | 2024-01-31
20 | 2024-02-01
21 | 2024-02-02
22 | 2024-02-03
23 | 2024-02-04
24 | 2024-02-05
25 | 2024-02-06
26 | 2024-02-07
27 | 2024-02-08
28 | 2024-02-09
29 | 2024-02-10
30 | 2024-02-11
31 | 2024-02-12
32 | 2024-02-13
33 | 2024-02-14
34 | 2024-02-15
35 | 2024-02-16
36 | 2024-02-17


--------------------------------------------------------------------------------
/retrieval/metrics.py:
--------------------------------------------------------------------------------
 1 | import gc
 2 | import faiss
 3 | 
 4 | import torch
 5 | import numpy as np
 6 | import pandas as pd
 7 |   
 8 | def evaluate_recall(
 9 |   model, data_loader, device,
10 |   path_to_realshow_video_corpus_feather
11 | ):
12 |   
13 |   model.eval()
14 |   
15 |   realshow_video_corpus_df = pd.read_feather(path_to_realshow_video_corpus_feather)
16 |   
17 |   realshow_video_corpus = realshow_video_corpus_df['video_id'].unique().copy() + 1
18 |   
19 |   del realshow_video_corpus_df
20 |   
21 |   gc.collect()
22 | 
23 |   target_top_k = [50, 100, 500, 1000]
24 |   
25 |   total_target_cnt = 0.0
26 |   
27 |   target_realshow_recall_lst = [0.0 for _ in range(len(target_top_k))]
28 |   target_realshow_ndcg_lst = [0.0 for _ in range(len(target_top_k))]
29 |   
30 |   with torch.no_grad():
31 |     
32 |     #construct realshow embedding
33 |     realshow_faiss_obj = faiss.StandardGpuResources()
34 |     realshow_flat_config = faiss.GpuIndexFlatConfig()
35 |     realshow_flat_config.device = 0
36 |     realshow_index_flat = faiss.GpuIndexFlatIP(realshow_faiss_obj, 8, realshow_flat_config)
37 |     realshow_index_flat.add(model.vid_emb.weight.cpu().numpy()[realshow_video_corpus])
38 | 
39 |     for idx,inputs in enumerate(data_loader):
40 |       
41 |       inputs_LongTensor = [torch.LongTensor(inp.numpy()).to(device) for inp in inputs[:-3]]
42 |     
43 |       user_emb = model.forward_recall(inputs_LongTensor) #b*d
44 |       
45 |       _, topk_realshow_logits_index = realshow_index_flat.search(user_emb.cpu().numpy(), k=1000)
46 | 
47 |       topk_realshow_videos = realshow_video_corpus[topk_realshow_logits_index] #k*b
48 |       
49 |       vids = inputs[-3].numpy().astype(np.int64) #b*30
50 |       labels = inputs[-2].numpy().astype(np.float) #b*30
51 |       
52 |       n_videos = inputs[-1].numpy() #b
53 |       
54 |       for i in range(n_videos.shape[0]):
55 |         
56 |         n_video = n_videos[i]
57 |         
58 |         topk_realshow_video = topk_realshow_videos[i] 
59 |         
60 |         vid = vids[i,:n_video]
61 |         label = labels[i,:n_video]
62 |         
63 |         #target metric
64 |         if np.sum(label) > 0:
65 |           target_pos_index = np.nonzero(label)[0]
66 |           target_pos_vid = vid[target_pos_index]
67 | 
68 |           target_pos_realshow_rank = np.where(topk_realshow_video == target_pos_vid[:,None])[1]
69 |           if target_pos_realshow_rank.shape[0] > 0:
70 |             for i in range(len(target_top_k)):
71 |               target_realshow_recall_lst[i] += np.sum(target_pos_realshow_rank<target_top_k[i])
72 |               target_realshow_ndcg_lst[i] += np.sum((1.0/np.log2(target_pos_realshow_rank+2))*(target_pos_realshow_rank<target_top_k[i]))
73 |           
74 |           total_target_cnt += np.sum(label)
75 | 
76 |   target_realshow_recall = []
77 |   target_realshow_ndcg = []
78 |   
79 |   for i in range(len(target_top_k)):
80 |     target_realshow_recall.append(target_realshow_recall_lst[i]/total_target_cnt)
81 |     target_realshow_ndcg.append(target_realshow_ndcg_lst[i]/total_target_cnt)
82 |   
83 |   target_print_str = f"Target: "
84 |   for i in range(len(target_top_k)):
85 |     target_print_str += f"realshow_recall@{target_top_k[i]},"
86 |     target_print_str += f"realshow_ndcg@{target_top_k[i]},"
87 |   
88 |   target_print_value_str = f""
89 |   for i in range(len(target_top_k)):
90 |     target_print_value_str += f"{target_realshow_recall[i]:.6f},"
91 |     target_print_value_str += f"{target_realshow_ndcg[i]:.6f},"
92 | 
93 |   return target_print_str[:-1],target_print_value_str[:-1]


--------------------------------------------------------------------------------
/retrieval/models.py:
--------------------------------------------------------------------------------
  1 | import torch
  2 | import torch.nn as nn
  3 | from modules import MultiHeadAttention,PositionwiseFeedForward
  4 | 
  5 | class SASRec(nn.Module):
  6 |   def __init__(self, emb_dim, seq_len, neg_num, device, id_cnt_dict, num_heads=1):
  7 |     super(SASRec, self).__init__()
  8 |     
  9 |     self.emb_dim = emb_dim
 10 |     self.seq_len = seq_len
 11 |     
 12 |     self.neg_num = neg_num
 13 |     
 14 |     self.device = device
 15 |     
 16 |     #item
 17 |     self.vid_emb = nn.Embedding(
 18 |       num_embeddings=id_cnt_dict['video_id'] + 2,
 19 |       embedding_dim=emb_dim,
 20 |       padding_idx=0
 21 |     )
 22 | 
 23 |     self.position = nn.Embedding(
 24 |       num_embeddings=seq_len,
 25 |       embedding_dim=emb_dim,
 26 |     )
 27 | 
 28 |     self.ln_1 = nn.LayerNorm(emb_dim)
 29 |     self.ln_2 = nn.LayerNorm(emb_dim)
 30 |     self.ln_3 = nn.LayerNorm(emb_dim)
 31 | 
 32 |     self.mh_attn = MultiHeadAttention(emb_dim, emb_dim, num_heads, 0.0)
 33 |     self.feed_forward = PositionwiseFeedForward(emb_dim, emb_dim, 0.0)
 34 | 
 35 |   
 36 |   def forward(self, inputs):
 37 |     seq, seq_mask, tgt_vid, neg_vids = inputs
 38 |     
 39 |     seq_emb = self.vid_emb(seq) #b*t*d
 40 |     tgt_emb = self.vid_emb(tgt_vid) #b*d
 41 |     neg_emb = self.vid_emb(neg_vids) #b*n*d
 42 | 
 43 |     position_emb = self.position(torch.arange(self.seq_len, dtype=torch.int64, device=self.device)) #t*d
 44 | 
 45 |     seq_emb = seq_emb + position_emb #b*t*d
 46 | 
 47 |     mask = torch.ne(seq_mask, 0).float().unsqueeze(-1) #b*t*1
 48 |     
 49 |     seq_emb *= mask
 50 | 
 51 |     seq_emb_ln = self.ln_1(seq_emb)
 52 | 
 53 |     mh_attn_out = self.mh_attn(seq_emb_ln, seq_emb)
 54 | 
 55 |     ff_out = self.feed_forward(self.ln_2(mh_attn_out))
 56 | 
 57 |     ff_out *= mask
 58 | 
 59 |     ff_out = self.ln_3(ff_out) #b*t*d
 60 | 
 61 |     final_state = ff_out[:,-1,:] #b*d
 62 | 
 63 |     tgt_logits = torch.sum(final_state*tgt_emb, dim=1) #b
 64 |     
 65 |     neg_logits = torch.bmm(final_state.unsqueeze(1), neg_emb.transpose(2,1)).squeeze() # b*1*d @ b*d*n -> b*1*n -> b*n
 66 |     
 67 |     neg_logits = neg_logits.view(-1) #bn
 68 |     
 69 |     return tgt_logits, neg_logits
 70 |   
 71 |   
 72 |   def forward_fsltr(self, inputs):
 73 |     seq, seq_mask, vids = inputs
 74 |     
 75 |     seq_emb = self.vid_emb(seq) #b*t*d
 76 |     vids_emb = self.vid_emb(vids) #b*d
 77 | 
 78 |     position_emb = self.position(torch.arange(self.seq_len, dtype=torch.int64, device=self.device)) #t*d
 79 | 
 80 |     seq_emb = seq_emb + position_emb #b*t*d
 81 | 
 82 |     mask = torch.ne(seq_mask, 0).float().unsqueeze(-1) #b*t*1
 83 |     
 84 |     seq_emb *= mask
 85 | 
 86 |     seq_emb_ln = self.ln_1(seq_emb)
 87 | 
 88 |     mh_attn_out = self.mh_attn(seq_emb_ln, seq_emb)
 89 | 
 90 |     ff_out = self.feed_forward(self.ln_2(mh_attn_out))
 91 | 
 92 |     ff_out *= mask
 93 | 
 94 |     ff_out = self.ln_3(ff_out) #b*t*d
 95 | 
 96 |     final_state = ff_out[:,-1,:] #b*d
 97 |     
 98 |     logits = torch.bmm(vids_emb, final_state.unsqueeze(-1)).squeeze()
 99 |     
100 |     return logits
101 | 
102 | 
103 |   def forward_recall(self, inputs):
104 |     
105 |     seq, seq_mask = inputs
106 |     
107 |     seq_emb = self.vid_emb(seq) #b*t*d
108 | 
109 |     position_emb = self.position(torch.arange(self.seq_len, dtype=torch.int64, device=self.device)) #t*d
110 | 
111 |     seq_emb += position_emb
112 | 
113 |     mask = torch.ne(seq_mask, 0).float().unsqueeze(-1) #b*seq_len*1
114 | 
115 |     seq_emb *= mask
116 |     
117 |     seq_emb_ln = self.ln_1(seq_emb)
118 |     
119 |     mh_attn_out = self.mh_attn(seq_emb_ln, seq_emb)
120 | 
121 |     ff_out = self.feed_forward(self.ln_2(mh_attn_out))
122 |     
123 |     ff_out *= mask
124 |     
125 |     ff_out = self.ln_3(ff_out)
126 |     
127 |     final_state = ff_out[:,-1,:] #b*d
128 | 
129 |     return final_state #b*d


--------------------------------------------------------------------------------
/retrieval/modules.py:
--------------------------------------------------------------------------------
 1 | import torch
 2 | import torch.nn as nn
 3 | import torch.nn.functional as F
 4 | 
 5 | class PositionwiseFeedForward(nn.Module):
 6 |   def __init__(self, d_in, d_hid, dropout=0.1):
 7 |     super().__init__()
 8 |     self.w_1 = nn.Conv1d(d_in, d_hid, 1)
 9 |     self.w_2 = nn.Conv1d(d_hid, d_in, 1)
10 |     # self.layer_norm = nn.LayerNorm(d_in)
11 |     self.dropout = nn.Dropout(dropout)
12 | 
13 |   def forward(self, x):
14 |     residual = x
15 |     output = x.transpose(1, 2)
16 |     output = self.w_2(F.relu(self.w_1(output)))
17 |     output = output.transpose(1, 2)
18 |     output = self.dropout(output)
19 |     # output = self.layer_norm(output + residual)
20 |     output =output + residual
21 |     return output
22 | 
23 | class MultiHeadAttention(nn.Module):
24 |   def __init__(self, hidden_size, num_units, num_heads, dropout_rate):
25 |     super().__init__()
26 |     self.hidden_size = hidden_size
27 |     self.num_heads = num_heads
28 |     assert hidden_size % num_heads == 0
29 |     
30 |     self.linear_q = nn.Linear(hidden_size, num_units)
31 |     self.linear_k = nn.Linear(hidden_size, num_units)
32 |     self.linear_v = nn.Linear(hidden_size, num_units)
33 |     self.dropout = nn.Dropout(dropout_rate)
34 |     self.softmax = nn.Softmax(dim=-1)
35 | 
36 | 
37 |   def forward(self, queries, keys):
38 |     """
39 |     :param queries: A 3d tensor with shape of [N, T_q, C_q]
40 |     :param keys: A 3d tensor with shape of [N, T_k, C_k]
41 |     
42 |     :return: A 3d tensor with shape of (N, T_q, C)
43 |     
44 |     """
45 |     Q = self.linear_q(queries)  # (N, T_q, C)
46 |     K = self.linear_k(keys)  # (N, T_k, C)
47 |     V = self.linear_v(keys)  # (N, T_k, C)
48 |     
49 |     # Split and Concat
50 |     split_size = self.hidden_size // self.num_heads
51 |     Q_ = torch.cat(torch.split(Q, split_size, dim=2), dim=0)  # (h*N, T_q, C/h)
52 |     K_ = torch.cat(torch.split(K, split_size, dim=2), dim=0)  # (h*N, T_k, C/h)
53 |     V_ = torch.cat(torch.split(V, split_size, dim=2), dim=0)  # (h*N, T_k, C/h)
54 |     
55 |     # Multiplication
56 |     matmul_output = torch.bmm(Q_, K_.transpose(1, 2)) / self.hidden_size ** 0.5  # (h*N, T_q, T_k)
57 |     
58 |     # Key Masking
59 |     key_mask = torch.sign(torch.abs(keys.sum(dim=-1))).repeat(self.num_heads, 1)  # (h*N, T_k)
60 |     key_mask_reshaped = key_mask.unsqueeze(1).repeat(1, queries.shape[1], 1)  # (h*N, T_q, T_k)
61 |     key_paddings = torch.ones_like(matmul_output) * (-2 ** 32 + 1)
62 |     matmul_output_m1 = torch.where(torch.eq(key_mask_reshaped, 0), key_paddings, matmul_output)  # (h*N, T_q, T_k)
63 |     
64 |     # Causality - Future Blinding
65 |     diag_vals = torch.ones_like(matmul_output[0, :, :])   # (T_q, T_k)
66 |     tril = torch.tril(diag_vals)  # (T_q, T_k)
67 |     causality_mask = tril.unsqueeze(0).repeat(matmul_output.shape[0], 1, 1)  # (h*N, T_q, T_k)
68 |     causality_paddings = torch.ones_like(causality_mask) * (-2 ** 32 + 1)
69 |     matmul_output_m2 = torch.where(torch.eq(causality_mask, 0), causality_paddings, matmul_output_m1)  # (h*N, T_q, T_k)
70 |     
71 |     # Activation
72 |     matmul_output_sm = self.softmax(matmul_output_m2)  # (h*N, T_q, T_k)
73 |     
74 |     # Query Masking
75 |     query_mask = torch.sign(torch.abs(queries.sum(dim=-1))).repeat(self.num_heads, 1)  # (h*N, T_q)
76 |     query_mask = query_mask.unsqueeze(-1).repeat(1, 1, keys.shape[1])  # (h*N, T_q, T_k)
77 |     matmul_output_qm = matmul_output_sm * query_mask
78 |     
79 |     # Dropout
80 |     matmul_output_dropout = self.dropout(matmul_output_qm)
81 |     
82 |     # Weighted Sum
83 |     output_ws = torch.bmm(matmul_output_dropout, V_)  # ( h*N, T_q, C/h)
84 |     
85 |     # Restore Shape
86 |     output = torch.cat(torch.split(output_ws, output_ws.shape[0] // self.num_heads, dim=0), dim=2)  # (N, T_q, C)
87 |     
88 |     # Residual Connection
89 |     output_res = output + queries
90 |     
91 |     return output_res


--------------------------------------------------------------------------------
/retrieval/run_sasrec.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import argparse
  3 | 
  4 | import torch
  5 | import torch.nn as nn
  6 | from torch.utils.data import DataLoader
  7 | 
  8 | from models import SASRec
  9 | from dataset import Recall_Train_SASRec_Dataset
 10 | 
 11 | from utils import load_pkl
 12 | 
 13 | def parse_args():
 14 |   parser = argparse.ArgumentParser()
 15 | 
 16 |   parser.add_argument('--epochs', type=int, default=1, help='epochs.')
 17 |   parser.add_argument('--batch_size', type=int, default=1024, help='train batch size.')
 18 |   parser.add_argument('--infer_batch_size', type=int, default=1024, help='inference batch size.')
 19 |   parser.add_argument('--emb_dim', type=int, default=8, help='embedding dimension.')
 20 |   parser.add_argument('--lr', type=float, default=1e-3, help='learning rate.')
 21 |   parser.add_argument('--seq_len', type=int, default=3, help='length of behaivor sequence')
 22 |   
 23 |   parser.add_argument('--cuda', type=int, default=0, help='cuda device.')
 24 | 
 25 |   parser.add_argument('--print_freq', type=int, default=200, help='frequency of print.')
 26 |   
 27 |   parser.add_argument('--tag', type=str, default="1st", help='exp tag.')
 28 |   
 29 |   parser.add_argument('--neg_num', type=int, default=50, help='number of negative samples')
 30 |   
 31 |   return parser.parse_args()
 32 | 
 33 | 
 34 | if __name__ == '__main__':
 35 |   args = parse_args()
 36 | 
 37 |   for k,v in vars(args).items():
 38 |     print(f"{k}:{v}")
 39 |   
 40 |   #prepare data
 41 |   prefix = "../data"
 42 |   
 43 |   realshow_prefix = os.path.join(prefix, "realshow")
 44 |   path_to_train_csv_lst = []
 45 |   with open("./file.txt", mode='r') as f:
 46 |     lines = f.readlines()
 47 |     for line in lines:
 48 |       tmp_csv_path = os.path.join(realshow_prefix, line.strip()+'.feather')
 49 |       path_to_train_csv_lst.append(tmp_csv_path)
 50 |   
 51 |   num_of_train_csv = len(path_to_train_csv_lst)
 52 |   print("training files:")
 53 |   print(f"number of train_csv: {num_of_train_csv}")
 54 |   for idx, filepath in enumerate(path_to_train_csv_lst):
 55 |     print(f"{idx}: {filepath}")
 56 |   
 57 |   #prepare seq
 58 |   seq_prefix = os.path.join(prefix, "seq_effective_50_dict")
 59 |   path_to_train_seq_pkl_lst = []
 60 |   with open("./file.txt", mode='r') as f:
 61 |     lines = f.readlines()
 62 |     for line in lines:
 63 |       tmp_seq_pkl_path = os.path.join(seq_prefix, line.strip()+'.pkl')
 64 |       path_to_train_seq_pkl_lst.append(tmp_seq_pkl_path)
 65 |   
 66 |   print("training seq files:")
 67 |   for idx, filepath in enumerate(path_to_train_seq_pkl_lst):
 68 |     print(f"{idx}: {filepath}")
 69 |   
 70 |   #prepare id_cnt
 71 |   others_prefix = os.path.join(prefix, "others")
 72 |   path_to_id_cnt_pkl = os.path.join(others_prefix, "id_cnt.pkl")
 73 |   print(f"path_to_id_cnt_pkl: {path_to_id_cnt_pkl}")
 74 |   
 75 |   id_cnt_dict = load_pkl(path_to_id_cnt_pkl)
 76 |   for k,v in id_cnt_dict.items():
 77 |     print(f"{k}:{v}")
 78 |   
 79 |   #prepare negatives  
 80 |   video_prefix = os.path.join(others_prefix, "realshow_video_info_daily")
 81 |   path_to_video_info_feather_lst = []
 82 |   with open("./file.txt", mode='r') as f:
 83 |     lines = f.readlines()
 84 |     for line in lines:
 85 |       tmp_video_feather_path = os.path.join(video_prefix, line.strip()+'.feather')
 86 |       path_to_video_info_feather_lst.append(tmp_video_feather_path)
 87 |   
 88 |   print("realshow daily negative")
 89 |   for idx, filepath in enumerate(path_to_video_info_feather_lst):
 90 |     print(f"{idx}: {filepath}")
 91 | 
 92 |   #prepare model
 93 |   os.environ["CUDA_VISIBLE_DEVICES"] = str(args.cuda)
 94 |   device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
 95 |   print(f"device: {device}")
 96 |   
 97 |   model = SASRec(args.emb_dim, args.seq_len, args.neg_num, device, id_cnt_dict).to(device)
 98 |   
 99 |   loss_fn = nn.LogSigmoid().to(device)
100 |   
101 |   optimizer = torch.optim.Adam(model.parameters(), lr=args.lr)
102 |   
103 |   #training
104 |   for epoch in range(args.epochs):
105 |     for n_day in range(num_of_train_csv):
106 |       
107 |       train_dataset = Recall_Train_SASRec_Dataset(
108 |         path_to_train_csv_lst[n_day],
109 |         args.seq_len, args.neg_num,
110 |         path_to_train_seq_pkl_lst[n_day],
111 |         path_to_video_info_feather_lst[n_day]
112 |       )
113 |     
114 |       train_loader = DataLoader(
115 |         dataset=train_dataset, 
116 |         batch_size=args.batch_size, 
117 |         shuffle=True, 
118 |         num_workers=1, 
119 |         drop_last=False
120 |       )
121 | 
122 |       for iter_step, inputs in enumerate(train_loader):
123 | 
124 |         inputs_LongTensor = [torch.LongTensor(inp.numpy()).to(device) for inp in inputs]
125 | 
126 |         tgt_logits, neg_logits = model(inputs_LongTensor) #b
127 | 
128 |         pos_logits = tgt_logits.repeat_interleave(args.neg_num)
129 |         
130 |         loss = -loss_fn(pos_logits-neg_logits).mean()
131 | 
132 |         optimizer.zero_grad()
133 | 
134 |         loss.backward()
135 | 
136 |         optimizer.step()
137 | 
138 |         if iter_step % args.print_freq == 0:
139 |           print(f"Day:{n_day}\t[Epoch/iter]:{epoch:>3}/{iter_step:<4}\tloss:{loss.detach().cpu().item():.6f}")
140 | 
141 |   path_to_save_model=f"./checkpoints/bs-{args.batch_size}_lr-{args.lr}_{args.neg_num}_{args.tag}.pkl"
142 |   
143 |   torch.save(model.state_dict(), path_to_save_model)
144 |   
145 |   print(f"save model to {path_to_save_model} DONE.")


--------------------------------------------------------------------------------
/retrieval/run_sasrec.sh:
--------------------------------------------------------------------------------
 1 | set -x
 2 | set -e
 3 | set -o pipefail
 4 | 
 5 | tag="sasrec-1st"
 6 | 
 7 | bs=4096
 8 | lr=1e-1
 9 | neg_num=200
10 | 
11 | python -B -u run_sasrec.py \
12 | --epochs=1 \
13 | --batch_size=${bs} \
14 | --infer_batch_size=900 \
15 | --emb_dim=8 \
16 | --lr=${lr} \
17 | --seq_len=50 \
18 | --cuda="0" \
19 | --print_freq=100 \
20 | --neg_num=${neg_num} \
21 | --tag=${tag} > "./logs/bs-${bs}_lr-${lr}_${neg_num}_${tag}.log" 2>&1
22 | 
23 | python -B -u eval_sasrec.py \
24 | --epochs=1 \
25 | --batch_size=${bs} \
26 | --infer_recall_batch_size=900 \
27 | --emb_dim=8 \
28 | --lr=${lr} \
29 | --seq_len=50 \
30 | --cuda="0" \
31 | --print_freq=100 \
32 | --neg_num=${neg_num} \
33 | --tag=${tag} >> "./logs/bs-${bs}_lr-${lr}_${neg_num}_${tag}.log" 2>&1


--------------------------------------------------------------------------------
/retrieval/run_sasrec_fsltr.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import argparse
  3 | 
  4 | import torch
  5 | import torch.nn.functional as F
  6 | from torch.utils.data import DataLoader
  7 | 
  8 | from models import SASRec
  9 | from dataset import Recall_Train_SASRec_FSLTR_Dataset
 10 | 
 11 | from utils import load_pkl
 12 | 
 13 | def parse_args():
 14 |   parser = argparse.ArgumentParser()
 15 | 
 16 |   parser.add_argument('--epochs', type=int, default=1, help='epochs.')
 17 |   parser.add_argument('--batch_size', type=int, default=1024, help='train batch size.')
 18 |   parser.add_argument('--infer_batch_size', type=int, default=1024, help='inference batch size.')
 19 |   parser.add_argument('--emb_dim', type=int, default=8, help='embedding dimension.')
 20 |   parser.add_argument('--lr', type=float, default=1e-3, help='learning rate.')
 21 |   parser.add_argument('--seq_len', type=int, default=3, help='length of behaivor sequence')
 22 |   
 23 |   parser.add_argument('--cuda', type=int, default=0, help='cuda device.')
 24 | 
 25 |   parser.add_argument('--print_freq', type=int, default=200, help='frequency of print.')
 26 |   
 27 |   parser.add_argument('--tag', type=str, default="1st", help='exp tag.')
 28 |   
 29 |   parser.add_argument('--neg_num', type=int, default=3, help='number of negative samples')
 30 |   
 31 |   parser.add_argument('--flow_negs', type=str, default='mcd_prerank_neg', help='model name.')
 32 |   
 33 |   parser.add_argument('--flow_neg_nums', type=str, default=3, help='number of negative samples')
 34 |   
 35 |   return parser.parse_args()
 36 | 
 37 | 
 38 | if __name__ == '__main__':
 39 |   args = parse_args()
 40 | 
 41 |   for k,v in vars(args).items():
 42 |     print(f"{k}:{v}")
 43 |   
 44 |   #prepare data
 45 |   prefix = "../data"
 46 |   
 47 |   realshow_prefix = os.path.join(prefix, "realshow")
 48 |   path_to_train_csv_lst = []
 49 |   with open("./file.txt", mode='r') as f:
 50 |     lines = f.readlines()
 51 |     for line in lines:
 52 |       tmp_csv_path = os.path.join(realshow_prefix, line.strip()+'.feather')
 53 |       path_to_train_csv_lst.append(tmp_csv_path)
 54 |       
 55 |   num_of_train_csv = len(path_to_train_csv_lst)
 56 |   print("training files:")
 57 |   print(f"number of train_csv: {num_of_train_csv}")
 58 |   for idx, filepath in enumerate(path_to_train_csv_lst):
 59 |     print(f"{idx}: {filepath}")
 60 |     
 61 |   #prepare seq
 62 |   seq_prefix = os.path.join(prefix, "seq_effective_50_dict")
 63 |   path_to_train_seq_pkl_lst = []
 64 |   with open("./file.txt", mode='r') as f:
 65 |     lines = f.readlines()
 66 |     for line in lines:
 67 |       tmp_seq_pkl_path = os.path.join(seq_prefix, line.strip()+'.pkl')
 68 |       path_to_train_seq_pkl_lst.append(tmp_seq_pkl_path)
 69 |   
 70 |   print("training seq files:")
 71 |   for idx, filepath in enumerate(path_to_train_seq_pkl_lst):
 72 |     print(f"{idx}: {filepath}")
 73 |   
 74 |   #prepare request_id
 75 |   request_id_prefix = os.path.join(prefix, "request_id_dict")
 76 |   path_to_request_id_pkl_lst = []
 77 |   with open("./file.txt", mode='r') as f:
 78 |     lines = f.readlines()
 79 |     for line in lines:
 80 |       tmp_request_id_pkl_path = os.path.join(request_id_prefix, line.strip()+'.pkl')
 81 |       path_to_request_id_pkl_lst.append(tmp_request_id_pkl_path)
 82 |   
 83 |   print("training request_id files:")
 84 |   for idx, filepath in enumerate(path_to_request_id_pkl_lst):
 85 |     print(f"{idx}: {filepath}")
 86 |   
 87 |   #prepare id_cnt
 88 |   others_prefix = os.path.join(prefix, "others")
 89 |   path_to_id_cnt_pkl = os.path.join(others_prefix, "id_cnt.pkl")
 90 |   print(f"path_to_id_cnt_pkl: {path_to_id_cnt_pkl}")
 91 |   
 92 |   id_cnt_dict = load_pkl(path_to_id_cnt_pkl)
 93 |   for k,v in id_cnt_dict.items():
 94 |     print(f"{k}:{v}")
 95 |   
 96 |   #prepare negatives  
 97 |   video_prefix = os.path.join(others_prefix, "realshow_video_info_daily")
 98 |   path_to_video_info_feather_lst = []
 99 |   with open("./file.txt", mode='r') as f:
100 |     lines = f.readlines()
101 |     for line in lines:
102 |       tmp_video_feather_path = os.path.join(video_prefix, line.strip()+'.feather')
103 |       path_to_video_info_feather_lst.append(tmp_video_feather_path)
104 |   
105 |   print("realshow daily negative")
106 |   for idx, filepath in enumerate(path_to_video_info_feather_lst):
107 |     print(f"{idx}: {filepath}")
108 |   
109 |   #prepare model
110 |   os.environ["CUDA_VISIBLE_DEVICES"] = str(args.cuda)
111 |   device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
112 |   print(f"device: {device}")
113 |   
114 |   model = SASRec(args.emb_dim, args.seq_len, args.neg_num, device, id_cnt_dict).to(device)
115 |   
116 |   optimizer = torch.optim.Adam(model.parameters(), lr=args.lr)
117 |   
118 |   #training
119 |   for epoch in range(args.epochs):
120 |     for n_day in range(num_of_train_csv):
121 |       
122 |       train_dataset = Recall_Train_SASRec_FSLTR_Dataset(
123 |         path_to_train_csv_lst[n_day],
124 |         args.seq_len, args.neg_num,
125 |         path_to_train_seq_pkl_lst[n_day],
126 |         path_to_request_id_pkl_lst[n_day],
127 |         path_to_video_info_feather_lst[n_day],
128 |         args.flow_negs,
129 |         args.flow_neg_nums
130 |       )
131 |     
132 |       train_loader = DataLoader(
133 |         dataset=train_dataset, 
134 |         batch_size=args.batch_size, 
135 |         shuffle=True, 
136 |         num_workers=1, 
137 |         drop_last=False
138 |       )
139 | 
140 |       for iter_step, inputs in enumerate(train_loader):
141 |         
142 |         inputs_LongTensor = [torch.LongTensor(inp.numpy()).to(device) for inp in inputs[:-1]]
143 |         
144 |         logits = model.forward_fsltr(inputs_LongTensor) #b*k
145 |         
146 |         priority = torch.FloatTensor(inputs[-1].numpy()).to(device) #b*k
147 |         
148 |         weight = torch.gt(
149 |           priority.unsqueeze(-1), priority.unsqueeze(1)
150 |         )
151 |         
152 |         logits_diff = logits.unsqueeze(-1) - logits.unsqueeze(1)
153 |         
154 |         loss = F.binary_cross_entropy_with_logits(logits_diff, torch.ones_like(logits_diff), weight=weight, reduction='sum') / weight.sum()
155 |         
156 |         optimizer.zero_grad()
157 |         
158 |         loss.backward()
159 | 
160 |         optimizer.step()
161 | 
162 |         if iter_step % args.print_freq == 0:
163 |           print(f"Day:{n_day}\t[Epoch/iter]:{epoch:>3}/{iter_step:<4}\tloss:{loss.detach().cpu().item():.6f}")
164 |       
165 |   path_to_save_model=f"./checkpoints/bs-{args.batch_size}_lr-{args.lr}_{args.neg_num}_{args.flow_negs}_{args.flow_neg_nums}_{args.tag}.pkl"
166 |   
167 |   torch.save(model.state_dict(), path_to_save_model)
168 |   
169 |   print(f"save model to {path_to_save_model} DONE.")


--------------------------------------------------------------------------------
/retrieval/run_sasrec_fsltr.sh:
--------------------------------------------------------------------------------
 1 | set -x
 2 | set -e
 3 | set -o pipefail
 4 | 
 5 | tag="sasrec-fsltr-1st"
 6 | 
 7 | bs=4096
 8 | lr=1e-1
 9 | neg_num=200
10 | 
11 | flow_negs=realshow,coarse_neg,prerank_neg
12 | flow_neg_nums=1,1,1
13 | 
14 | python -B -u run_sasrec_fsltr.py \
15 | --epochs=1 \
16 | --batch_size=${bs} \
17 | --infer_batch_size=900 \
18 | --emb_dim=8 \
19 | --lr=${lr} \
20 | --seq_len=50 \
21 | --cuda="0" \
22 | --print_freq=100 \
23 | --neg_num=${neg_num} \
24 | --flow_negs=${flow_negs} \
25 | --flow_neg_nums=${flow_neg_nums} \
26 | --tag=${tag} > "./logs/bs-${bs}_lr-${lr}_${neg_num}_${flow_negs}_${flow_neg_nums}_${tag}.log" 2>&1
27 | 
28 | python -B -u eval_sasrec_fsltr.py \
29 | --epochs=1 \
30 | --batch_size=${bs} \
31 | --infer_recall_batch_size=900 \
32 | --emb_dim=8 \
33 | --lr=${lr} \
34 | --seq_len=50 \
35 | --cuda="0" \
36 | --print_freq=100 \
37 | --neg_num=${neg_num} \
38 | --flow_negs=${flow_negs} \
39 | --flow_neg_nums=${flow_neg_nums} \
40 | --tag=${tag} >> "./logs/bs-${bs}_lr-${lr}_${neg_num}_${flow_negs}_${flow_neg_nums}_${tag}.log" 2>&1


--------------------------------------------------------------------------------
/retrieval/run_sasrec_hardnegmining.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import argparse
  3 | 
  4 | import torch
  5 | import torch.nn as nn
  6 | from torch.utils.data import DataLoader
  7 | 
  8 | from models import SASRec
  9 | from dataset import Recall_Train_SASRec_HardNegMining_Dataset
 10 | 
 11 | from utils import load_pkl
 12 | 
 13 | def parse_args():
 14 |   parser = argparse.ArgumentParser()
 15 | 
 16 |   parser.add_argument('--epochs', type=int, default=1, help='epochs.')
 17 |   parser.add_argument('--batch_size', type=int, default=1024, help='train batch size.')
 18 |   parser.add_argument('--infer_batch_size', type=int, default=1024, help='inference batch size.')
 19 |   parser.add_argument('--emb_dim', type=int, default=8, help='embedding dimension.')
 20 |   parser.add_argument('--lr', type=float, default=1e-3, help='learning rate.')
 21 |   parser.add_argument('--seq_len', type=int, default=3, help='length of behaivor sequence')
 22 |   
 23 |   parser.add_argument('--cuda', type=int, default=0, help='cuda device.')
 24 | 
 25 |   parser.add_argument('--print_freq', type=int, default=200, help='frequency of print.')
 26 |   
 27 |   parser.add_argument('--tag', type=str, default="1st", help='exp tag.')
 28 |   
 29 |   parser.add_argument('--neg_num', type=int, default=3, help='number of negative samples')
 30 |   
 31 |   parser.add_argument('--flow_negs', type=str, default='mcd_prerank_neg', help='model name.')
 32 |   
 33 |   parser.add_argument('--flow_neg_nums', type=str, default=3, help='number of negative samples')
 34 |   
 35 |   return parser.parse_args()
 36 | 
 37 | 
 38 | if __name__ == '__main__':
 39 |   args = parse_args()
 40 | 
 41 |   for k,v in vars(args).items():
 42 |     print(f"{k}:{v}")
 43 |   
 44 |   #prepare data
 45 |   prefix = "../data"
 46 |   
 47 |   realshow_prefix = os.path.join(prefix, "realshow")
 48 |   path_to_train_csv_lst = []
 49 |   with open("./file.txt", mode='r') as f:
 50 |     lines = f.readlines()
 51 |     for line in lines:
 52 |       tmp_csv_path = os.path.join(realshow_prefix, line.strip()+'.feather')
 53 |       path_to_train_csv_lst.append(tmp_csv_path)
 54 |       
 55 |   num_of_train_csv = len(path_to_train_csv_lst)
 56 |   print("training files:")
 57 |   print(f"number of train_csv: {num_of_train_csv}")
 58 |   for idx, filepath in enumerate(path_to_train_csv_lst):
 59 |     print(f"{idx}: {filepath}")
 60 |     
 61 |   #prepare seq
 62 |   seq_prefix = os.path.join(prefix, "seq_effective_50_dict")
 63 |   path_to_train_seq_pkl_lst = []
 64 |   with open("./file.txt", mode='r') as f:
 65 |     lines = f.readlines()
 66 |     for line in lines:
 67 |       tmp_seq_pkl_path = os.path.join(seq_prefix, line.strip()+'.pkl')
 68 |       path_to_train_seq_pkl_lst.append(tmp_seq_pkl_path)
 69 |   
 70 |   print("training seq files:")
 71 |   for idx, filepath in enumerate(path_to_train_seq_pkl_lst):
 72 |     print(f"{idx}: {filepath}")
 73 |   
 74 |   #prepare request_id
 75 |   request_id_prefix = os.path.join(prefix, "request_id_dict")
 76 |   path_to_request_id_pkl_lst = []
 77 |   with open("./file.txt", mode='r') as f:
 78 |     lines = f.readlines()
 79 |     for line in lines:
 80 |       tmp_request_id_pkl_path = os.path.join(request_id_prefix, line.strip()+'.pkl')
 81 |       path_to_request_id_pkl_lst.append(tmp_request_id_pkl_path)
 82 |   
 83 |   print("training request_id files:")
 84 |   for idx, filepath in enumerate(path_to_request_id_pkl_lst):
 85 |     print(f"{idx}: {filepath}")
 86 |   
 87 |   #prepare id_cnt
 88 |   others_prefix = os.path.join(prefix, "others")
 89 |   path_to_id_cnt_pkl = os.path.join(others_prefix, "id_cnt.pkl")
 90 |   print(f"path_to_id_cnt_pkl: {path_to_id_cnt_pkl}")
 91 |   
 92 |   id_cnt_dict = load_pkl(path_to_id_cnt_pkl)
 93 |   for k,v in id_cnt_dict.items():
 94 |     print(f"{k}:{v}")
 95 |   
 96 |   #prepare negatives  
 97 |   video_prefix = os.path.join(others_prefix, "realshow_video_info_daily")
 98 |   path_to_video_info_feather_lst = []
 99 |   with open("./file.txt", mode='r') as f:
100 |     lines = f.readlines()
101 |     for line in lines:
102 |       tmp_video_feather_path = os.path.join(video_prefix, line.strip()+'.feather')
103 |       path_to_video_info_feather_lst.append(tmp_video_feather_path)
104 |   
105 |   print("realshow daily negative")
106 |   for idx, filepath in enumerate(path_to_video_info_feather_lst):
107 |     print(f"{idx}: {filepath}")
108 |   
109 |   #prepare model
110 |   os.environ["CUDA_VISIBLE_DEVICES"] = str(args.cuda)
111 |   device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
112 |   print(f"device: {device}")
113 |   
114 |   model = SASRec(args.emb_dim, args.seq_len, args.neg_num, device, id_cnt_dict).to(device)
115 |   
116 |   loss_fn = nn.LogSigmoid().to(device)
117 |   
118 |   optimizer = torch.optim.Adam(model.parameters(), lr=args.lr)
119 |   
120 |   #training
121 |   for epoch in range(args.epochs):
122 |     for n_day in range(num_of_train_csv):
123 |       
124 |       train_dataset = Recall_Train_SASRec_HardNegMining_Dataset(
125 |         path_to_train_csv_lst[n_day],
126 |         args.seq_len, args.neg_num,
127 |         path_to_train_seq_pkl_lst[n_day],
128 |         path_to_request_id_pkl_lst[n_day],
129 |         path_to_video_info_feather_lst[n_day],
130 |         args.flow_negs,
131 |         args.flow_neg_nums
132 |       )
133 |     
134 |       train_loader = DataLoader(
135 |         dataset=train_dataset, 
136 |         batch_size=args.batch_size, 
137 |         shuffle=True, 
138 |         num_workers=1, 
139 |         drop_last=False
140 |       )
141 |       
142 |       for iter_step, inputs in enumerate(train_loader):
143 |         
144 |         inputs_LongTensor = [torch.LongTensor(inp.numpy()).to(device) for inp in inputs]
145 |         
146 |         tgt_logits, neg_logits = model(inputs_LongTensor) #b
147 |       
148 |         pos_logits_expand = tgt_logits.repeat_interleave(args.neg_num)
149 |         
150 |         loss = -loss_fn(pos_logits_expand-neg_logits).mean()
151 | 
152 |         optimizer.zero_grad()
153 |         
154 |         loss.backward()
155 | 
156 |         optimizer.step()
157 | 
158 |         if iter_step % args.print_freq == 0:
159 |           print(f"Day:{n_day}\t[Epoch/iter]:{epoch:>3}/{iter_step:<4}\tloss:{loss.detach().cpu().item():.6f}")
160 |       
161 |   path_to_save_model=f"./checkpoints/bs-{args.batch_size}_lr-{args.lr}_neg_num-{args.neg_num}_flow_negs-{args.flow_negs}_flow_neg_nums-{args.flow_neg_nums}_{args.tag}.pkl"
162 |   
163 |   torch.save(model.state_dict(), path_to_save_model)
164 |   
165 |   print(f"save model to {path_to_save_model} DONE.")


--------------------------------------------------------------------------------
/retrieval/run_sasrec_hardnegmining.sh:
--------------------------------------------------------------------------------
 1 | set -x
 2 | set -e
 3 | set -o pipefail
 4 | 
 5 | tag="sasrec-hardnegmining-1st"
 6 | 
 7 | bs=4096
 8 | lr=1e-1
 9 | neg_num=200
10 | 
11 | flow_negs=prerank_neg
12 | flow_neg_nums=1
13 | 
14 | python -B -u run_sasrec_hardnegmining.py \
15 | --epochs=1 \
16 | --batch_size=${bs} \
17 | --infer_batch_size=900 \
18 | --emb_dim=8 \
19 | --lr=${lr} \
20 | --seq_len=50 \
21 | --cuda="0" \
22 | --print_freq=100 \
23 | --neg_num=${neg_num} \
24 | --flow_negs=${flow_negs} \
25 | --flow_neg_nums=${flow_neg_nums} \
26 | --tag=${tag} > "./logs/bs-${bs}_lr-${lr}_${neg_num}_${flow_negs}_${flow_neg_nums}_${tag}.log" 2>&1
27 | 
28 | python -B -u eval_sasrec_hardnegmining.py \
29 | --epochs=1 \
30 | --batch_size=${bs} \
31 | --infer_recall_batch_size=900 \
32 | --emb_dim=8 \
33 | --lr=${lr} \
34 | --seq_len=50 \
35 | --cuda="0" \
36 | --print_freq=100 \
37 | --neg_num=${neg_num} \
38 | --flow_negs=${flow_negs} \
39 | --flow_neg_nums=${flow_neg_nums} \
40 | --tag=${tag} >> "./logs/bs-${bs}_lr-${lr}_${neg_num}_${flow_negs}_${flow_neg_nums}_${tag}.log" 2>&1


--------------------------------------------------------------------------------
/retrieval/utils.py:
--------------------------------------------------------------------------------
 1 | import pickle as pkl
 2 | from collections import defaultdict
 3 | 
 4 | def defaultdict_tuple():
 5 |   return defaultdict(tuple)
 6 | 
 7 | def defaultdict_str():
 8 |   return defaultdict(str)
 9 | 
10 | def load_pkl(filename):
11 |   with open(filename, 'rb') as f:
12 |     return pkl.load(f)


--------------------------------------------------------------------------------