├── LICENSE ├── README.md ├── coarse ├── dataset.py ├── eval_dssm.py ├── eval_dssm_auxiliary_ranking.py ├── eval_dssm_data_dist_shift_all.py ├── eval_dssm_data_dist_shift_sampling.py ├── eval_dssm_fsltr.py ├── eval_dssm_ubm.py ├── file.txt ├── metrics.py ├── models.py ├── run_dssm.py ├── run_dssm.sh ├── run_dssm_auxiliary_ranking.py ├── run_dssm_auxiliary_ranking.sh ├── run_dssm_data_dist_shift_all.py ├── run_dssm_data_dist_shift_all.sh ├── run_dssm_data_dist_shift_sampling.py ├── run_dssm_data_dist_shift_sampling.sh ├── run_dssm_fsltr.py ├── run_dssm_fsltr.sh ├── run_dssm_ubm.py ├── run_dssm_ubm.sh └── utils.py ├── data ├── .DS_Store ├── all_stage │ └── example.feather ├── others │ ├── coarse_rank_test.feather │ ├── id_cnt.pkl │ ├── rank_test.feather │ ├── realshow_video_info.feather │ └── retrieval_test.feather ├── realshow │ └── example.feather ├── request_id_dict │ └── example.pkl └── seq_effective_50_dict │ └── example.pkl ├── rank ├── dataset.py ├── eval_din.py ├── eval_din_auxiliary_ranking.py ├── eval_din_data_dist_shift_all.py ├── eval_din_data_dist_shift_sampling.py ├── eval_din_fsltr.py ├── eval_din_ubm.py ├── file.txt ├── metrics.py ├── models.py ├── run_din.py ├── run_din.sh ├── run_din_auxiliary_ranking.py ├── run_din_auxiliary_ranking.sh ├── run_din_data_dist_shift_all.py ├── run_din_data_dist_shift_all.sh ├── run_din_data_dist_shift_sampling.py ├── run_din_data_dist_shift_sampling.sh ├── run_din_fsltr.py ├── run_din_fsltr.sh ├── run_din_ubm.py ├── run_din_ubm.sh └── utils.py ├── recflow.jpg └── retrieval ├── dataset.py ├── eval_sasrec.py ├── eval_sasrec_fsltr.py ├── eval_sasrec_hardnegmining.py ├── file.txt ├── metrics.py ├── models.py ├── modules.py ├── run_sasrec.py ├── run_sasrec.sh ├── run_sasrec_fsltr.py ├── run_sasrec_fsltr.sh ├── run_sasrec_hardnegmining.py ├── run_sasrec_hardnegmining.sh └── utils.py /README.md: -------------------------------------------------------------------------------- 1 | # RecFlow: An Industrial Full Flow Recommendation Dataset 2 | 3 | [![LICENSE](https://img.shields.io/badge/license-CC%20BY--SA%204.0-green)](https://github.com/RecFlow-nips24/RecFlow-nips24/blob/main/LICENSE) 4 | 5 | ### Download the data 6 | 7 | Download manually through the following links: 8 | 9 | - link: [Drive](https://rec.ustc.edu.cn/share/f8e5adc0-2e57-11ef-bea5-3b4cac9d110e) 10 | 11 | --- 12 | 13 | ### Motivation 14 | To provide the recommendation systems (RS) research community with an industrial full flow dataset, we propose RecFlow, which includes samples from the exposure space and unexposed items filtered at each stage of Kuaishou's multi-stage RS. Compared with all existing public RS datasets, RecFlow can be leveraged to not only optimize the conventional recommendation tasks but also study the challenges including the interplay of different stages, the data distribution shift, auxiliary ranking tasks, user behavior sequence modeling, etc. It is the first public RS dataset that allows researchers to study the real industrial multi-stage RS. 15 | 16 | The following figure illustrates the process of RecFlow's data collection . 17 | 18 | ![kuaidata](./recflow.jpg) 19 | 20 | ### Usage 21 | RecFlow can be applied to the following tasks. (1) By recording items from the serving space, RecFlow enables the study of how to alleviate the discrepancy between training and serving for specific stages during both the learning and evaluation processes. (2) RecFlow also records the stage information for different stage samples, facilitating research on joint modeling of multiple stages, such as stage consistency or optimal multi-stage RS. (3) The positive and negative samples from the exposure space are suitable for classical click-through rate prediction or sequential recommendation tasks. (4) RecFlow stores multiple types of positive feedback (e.g., effective view, long view, like, follow, share, comment), supporting research on multi-task recommendation. (5) Information about video duration and playing time for each exposure video allows the study of learning through implicit feedback, such as predicting playing time. (6) RecFlow includes a request identifier feature, which can contribute to studying the re-ranking problem. (7) Timestamps for each sample enable the aggregation of user feedback in chronological order, facilitating the study of user behavior sequence modeling algorithms. (8) RecFlow incorporates context, user, and video features beyond identity features (e.g., user ID and video ID), making it suitable for context-based recommendation. (9) The rich information recorded about RS and user feedback allows the construction of more accurate RS simulators or user models in feed scenarios. (10) Rich stage data may help estimate selection bias more accurately and design better debiasd algorithms. 22 | 23 | --- 24 | 25 | ### Dataset Organization 26 | 27 | *RecFlow* dataset has following folders. **all_stage** contains data from all stages. **realshow** contains data from the exposure space. **seq_effective_50_dict** contains the user's effective_view behavior sequence of length 50. **request_id_dict** stores the data from all stages in first_level_key-second_level_key-value structure. The first_level_key is the *request_id*, the second_levele_key is the stage label (i.e. *realshow,rerank_pos,rerank_neg,rank_pos,rank_neg,coarse_neg,prerank_neg*), the value is the corresponding videos of that stage. **ubm_seq_request_id_dict** is for the user behavior sequence modeling tasks and hold the same structure with **request_id_dict**. **id_cnt.pkl** records the unique ID number of each feature field. **retrieval_test.feather** is the testing dataset for retrieval experiments. **coarse_rank_test.feather** is the testing dataset for coarse ranking experiments. **rank_test.feather** is the testing dataset for ranking experiments. **realshow_video_info.feather** contains the video information from the exposure space. **realshow_video_info_daily** contains the accumulated video information from the exposure space. 28 | ``` 29 | RecFlow 30 |    ├── all_stage 31 | | ├──2024-01-13.feather 32 | | ├──2024-01-14.feather 33 | | ├──... 34 | | └──2024-02-18.feather 35 | | 36 |    ├── realshow 37 | | ├──2024-01-13.feather 38 | | ├──2024-01-14.feather 39 | | ├──... 40 | | └──2024-02-18.feather 41 | | 42 |    ├── seq_effective_50_dict 43 | | ├──2024-01-13.pkl 44 | | ├──2024-01-14.pkl 45 | | ├──... 46 | | └──2024-02-18.pkl 47 | | 48 |    ├── request_id_dict 49 | | ├──2024-01-13.pkl 50 | | ├──2024-01-14.pkl 51 | | ├──... 52 | | └──2024-02-18.pkl 53 | | 54 |    ├── ubm_seq_request_id_dict 55 | | ├──2024-01-13.pkl 56 | | ├──2024-01-14.pkl 57 | | ├──... 58 | | └──2024-02-18.pkl 59 | | 60 |    └── others 61 | ├──id_cnt.pkl 62 | ├──retrieval_test.feather 63 | ├──coarse_rank_test.feather 64 | ├──rank_test.feather 65 | ├──realshow_video_info.feather 66 | └──realshow_video_info_daily 67 | ├──2024-01-13.feather 68 | ├──2024-01-14.feather 69 | ├──... 70 | └──2024-02-18.feather 71 | ``` 72 | 73 | #### Descriptions of the feature fields in RecFlow. 74 | 75 | | Field Name: | Description | Type | 76 | | -------------- | -------------------------------------------------------- | ------- | 77 | | request_id | The unique ID of each recommendation request. | Integer | 78 | | request\_timestamp | The timestamp of each recommendation request. | Integer | 79 | | user\_id | The unique ID of each user. | Integer | 80 | | device\_id | The unique ID of each device. | Integer | 81 | | age | The user's age. | Integer | 82 | | gender | The user's gender. | Integer | 83 | | province | The user's province. | Integer | 84 | | video\_id | The unique ID of each video. | Integer | 85 | | author\_id | The unique ID of each author. | Integer | 86 | | category\_level\_one | The first level category ID of each video. | Integer | 87 | | category\_level\_two | The second level category ID of each video. | Integer | 88 | | upload\_type | The upload type ID of each video. | Integer | 89 | | upload\_timestamp |The upload timestamp of each video. | Integer | 90 | | duration | The time duration of each video in milliseconds. | Integer | 91 | | realshow | A binary feedback signal indicating the video is exposed to the user. | Integer | 92 | | rerank\_pos | A binary feedback signal indicating the video ranks top-10 in rerank stage. | Integer | 93 | | rerank\_neg | A binary feedback signal indicating the video ranks out of top-10 in rerank stage. | Integer | 94 | | rank\_pos | A binary feedback signal indicating the video ranks top-10 in rank stage. | Integer | 95 | | rank\_neg | A binary feedback signal indicating the video ranks out of top-10 in rank stage. | Integer | 96 | | coarse\_neg | A binary feedback signal indicating the video ranks out of top-500 in coarse rank stage. | Integer | 97 | | prerank\_neg | A binary feedback signal indicating the video ranks out of top-500 in pre-rank stage. | Integer | 98 | | rank\_index |The rank position of the video in the rank stage. | Integer | 99 | | rerank\_index | The rank position of the video in rerank stage. | Integer | 100 | | playing\_time | The time duration of the user watching the video. | Integer | 101 | | effective\_view | A binary feedback signal indicating the user watches at least 30\% of the video. | Integer | 102 | | long\_view | A binary feedback signal indicating the user watches at least 100\% of the video. | Integer | 103 | | like | A binary feedback signal indicating the user hit the like button. | Integer | 104 | | follow | A binary feedback signal indicating the user hit the follow the author button. | Integer | 105 | | forward | A binary feedback signal indicating the user forwards this video. | Integer | 106 | | comment | A binary feedback signal indicating the user writes a comment in the comments section of this video | Integer | 107 | 108 | --- 109 | 110 | ### Code 111 | If you want to run the code in the repository, you need to download the data from [Drive](https://rec.ustc.edu.cn/share/883adf20-7e44-11ef-90e2-9beaf2bdc778), and place them in the data folder as above data organization. 112 | 113 | #### Retrieval 114 | 115 | Baseline 116 | ``` 117 | bash ./retrieval/run_sasrec.sh 118 | ``` 119 | 120 | Hard Negative Mining 121 | ``` 122 | bash ./retrieval/run_sasrec_hardnegmining.sh 123 | ``` 124 | 125 | Interplay between Retrieval and Subsequent Stages 126 | ``` 127 | bash ./retrieval/run_sasrec_fsltr.sh 128 | ``` 129 | 130 | 131 | #### Coarse Ranking 132 | 133 | Baseline 134 | ``` 135 | bash ./coarse/run_dssm.sh 136 | ``` 137 | 138 | Data Distribution Shift 139 | ``` 140 | bash ./coarse/run_dssm_data_dist_shift_sampling.sh 141 | bash ./coarse/run_dssm_data_dist_shift_all.sh 142 | ``` 143 | 144 | Interplay between Retrieval and Subsequent Stages 145 | ``` 146 | bash ./coarse/run_dssm_fsltr.sh 147 | ``` 148 | 149 | Auxiliary Ranking 150 | ``` 151 | bash ./coarse/run_dssm_auxiliary_ranking.sh 152 | ``` 153 | 154 | User Behavior Sequence Modeling 155 | ``` 156 | bash ./coarse/run_dssm_ubm.sh 157 | ``` 158 | 159 | 160 | #### Ranking 161 | 162 | Baseline 163 | ``` 164 | bash ./rank/run_din.sh 165 | ``` 166 | 167 | Data Distribution Shift 168 | ``` 169 | bash ./rank/run_din_data_dist_shift_sampling.sh 170 | bash ./rank/run_din_data_dist_shift_all.sh 171 | ``` 172 | 173 | Interplay between Retrieval and Subsequent Stages 174 | ``` 175 | bash ./rank/run_din_fsltr.sh 176 | ``` 177 | 178 | Auxiliary Ranking 179 | ``` 180 | bash ./rank/run_din_auxiliary_ranking.sh 181 | ``` 182 | 183 | User Behavior Sequence Modeling 184 | ``` 185 | bash ./rank/run_din_ubm.sh 186 | ``` 187 | 188 | ### Requirements 189 | ``` 190 | python=3.7 191 | numpy=1.19.2 192 | pandas=1.3.5 193 | pyarrow=8.0.0 194 | scikit-learn=1.0.2 195 | pytorch=1.6 196 | faiss-gpu=1.7.1 197 | ``` -------------------------------------------------------------------------------- /coarse/eval_dssm.py: -------------------------------------------------------------------------------- 1 | import os 2 | import argparse 3 | 4 | import torch 5 | from torch.utils.data import DataLoader 6 | 7 | from models import DSSM 8 | from dataset import Prerank_Train_Dataset,Prerank_Test_Dataset 9 | 10 | from utils import load_pkl 11 | from metrics import evaluate,evaluate_recall 12 | 13 | def parse_args(): 14 | parser = argparse.ArgumentParser() 15 | 16 | parser.add_argument('--epochs', type=int, default=1, help='epochs.') 17 | parser.add_argument('--batch_size', type=int, default=1024, help='train batch size.') 18 | parser.add_argument('--infer_realshow_batch_size', type=int, default=1024, help='inference batch size.') 19 | parser.add_argument('--infer_recall_batch_size', type=int, default=1024, help='inference batch size.') 20 | parser.add_argument('--emb_dim', type=int, default=8, help='embedding dimension.') 21 | parser.add_argument('--lr', type=float, default=1e-3, help='learning rate.') 22 | parser.add_argument('--seq_len', type=int, default=3, help='length of behaivor sequence') 23 | 24 | parser.add_argument('--cuda', type=int, default=0, help='cuda device.') 25 | 26 | parser.add_argument('--print_freq', type=int, default=200, help='frequency of print.') 27 | 28 | parser.add_argument('--tag', type=str, default="1st", help='exp tag.') 29 | 30 | return parser.parse_args() 31 | 32 | 33 | if __name__ == '__main__': 34 | args = parse_args() 35 | 36 | for k,v in vars(args).items(): 37 | print(f"{k}:{v}") 38 | 39 | #prepare data 40 | prefix = "../data" 41 | 42 | realshow_prefix = os.path.join(prefix, "realshow") 43 | path_to_test_csv = os.path.join(realshow_prefix, "2024-02-18.feather") 44 | print("testing file:") 45 | print(path_to_test_csv) 46 | 47 | seq_prefix = os.path.join(prefix, "seq_effective_50_dict") 48 | path_to_test_seq_pkl = os.path.join(seq_prefix, "2024-02-18.pkl") 49 | print("testing seq file:") 50 | print(path_to_test_seq_pkl) 51 | 52 | others_prefix = os.path.join(prefix, "others") 53 | path_to_id_cnt_pkl = os.path.join(others_prefix, "id_cnt.pkl") 54 | print(f"path_to_id_cnt_pkl: {path_to_id_cnt_pkl}") 55 | id_cnt_dict = load_pkl(path_to_id_cnt_pkl) 56 | for k,v in id_cnt_dict.items(): 57 | print(f"{k}:{v}") 58 | 59 | path_to_test_pkl = os.path.join(others_prefix, "prerank_test.feather") 60 | print(f"path_to_test_pkl: {path_to_test_pkl}") 61 | 62 | #prepare model 63 | os.environ["CUDA_VISIBLE_DEVICES"] = str(args.cuda) 64 | device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") 65 | print(f"device: {device}") 66 | 67 | model = DSSM(args.emb_dim, args.seq_len, device, id_cnt_dict).to(device) 68 | 69 | path_to_save_model=f"./checkpoints/bs-{args.batch_size}_lr-{args.lr}_{args.tag}.pkl" 70 | 71 | state_dict = torch.load(path_to_save_model) 72 | model.load_state_dict(state_dict) 73 | 74 | print("testing: realshow") 75 | 76 | test_realshow_dataset = Prerank_Train_Dataset( 77 | path_to_test_csv, 78 | args.seq_len, 79 | path_to_test_seq_pkl, 80 | ) 81 | 82 | test_realshow_loader = DataLoader( 83 | dataset=test_realshow_dataset, 84 | batch_size=args.infer_realshow_batch_size, 85 | shuffle=False, 86 | num_workers=0, 87 | drop_last=True 88 | ) 89 | 90 | print_str = evaluate(model, test_realshow_loader, device) 91 | 92 | print("testing: recall") 93 | 94 | test_recall_dataset = Prerank_Test_Dataset( 95 | path_to_test_pkl, 96 | args.seq_len, 97 | path_to_test_seq_pkl, 98 | max_candidate_cnt=470 99 | ) 100 | 101 | test_recall_loader = DataLoader( 102 | dataset=test_recall_dataset, 103 | batch_size=args.infer_recall_batch_size, 104 | shuffle=False, 105 | num_workers=0, 106 | drop_last=True 107 | ) 108 | 109 | target_print = evaluate_recall(model, test_recall_loader, device) 110 | 111 | print("realshow") 112 | print(print_str) 113 | 114 | print("recall") 115 | print(target_print[0]) 116 | print(target_print[1]) -------------------------------------------------------------------------------- /coarse/eval_dssm_auxiliary_ranking.py: -------------------------------------------------------------------------------- 1 | import os 2 | import argparse 3 | 4 | import torch 5 | from torch.utils.data import DataLoader 6 | 7 | from models import DSSM_AuxRanking 8 | from dataset import Prerank_Train_Dataset,Prerank_Test_Dataset 9 | 10 | from utils import load_pkl 11 | from metrics import evaluate,evaluate_recall 12 | 13 | def parse_args(): 14 | parser = argparse.ArgumentParser() 15 | 16 | parser.add_argument('--epochs', type=int, default=1, help='epochs.') 17 | parser.add_argument('--batch_size', type=int, default=1024, help='train batch size.') 18 | parser.add_argument('--infer_realshow_batch_size', type=int, default=1024, help='inference batch size.') 19 | parser.add_argument('--infer_recall_batch_size', type=int, default=1024, help='inference batch size.') 20 | parser.add_argument('--emb_dim', type=int, default=8, help='embedding dimension.') 21 | parser.add_argument('--lr', type=float, default=1e-3, help='learning rate.') 22 | parser.add_argument('--seq_len', type=int, default=3, help='length of behaivor sequence') 23 | 24 | parser.add_argument('--cuda', type=int, default=0, help='cuda device.') 25 | 26 | parser.add_argument('--print_freq', type=int, default=200, help='frequency of print.') 27 | 28 | parser.add_argument('--tag', type=str, default="1st", help='exp tag.') 29 | 30 | # flow param 31 | parser.add_argument('--flows', type=str, default="", help='exp tag.') 32 | 33 | parser.add_argument('--rank_loss_weight', type=float, default=1e-2, help='learning rate.') 34 | 35 | return parser.parse_args() 36 | 37 | 38 | if __name__ == '__main__': 39 | args = parse_args() 40 | 41 | for k,v in vars(args).items(): 42 | print(f"{k}:{v}") 43 | 44 | #prepare data 45 | prefix = "../data" 46 | 47 | realshow_prefix = os.path.join(prefix, "realshow") 48 | path_to_test_csv = os.path.join(realshow_prefix, "2024-02-18.feather") 49 | print("testing file:") 50 | print(path_to_test_csv) 51 | 52 | seq_prefix = os.path.join(prefix, "seq_effective_50_dict") 53 | path_to_test_seq_pkl = os.path.join(seq_prefix, "2024-02-18.pkl") 54 | print("testing seq file:") 55 | print(path_to_test_seq_pkl) 56 | 57 | others_prefix = os.path.join(prefix, "others") 58 | path_to_id_cnt_pkl = os.path.join(others_prefix, "id_cnt.pkl") 59 | print(f"path_to_id_cnt_pkl: {path_to_id_cnt_pkl}") 60 | id_cnt_dict = load_pkl(path_to_id_cnt_pkl) 61 | for k,v in id_cnt_dict.items(): 62 | print(f"{k}:{v}") 63 | 64 | path_to_test_pkl = os.path.join(others_prefix, "prerank_test.feather") 65 | print(f"path_to_test_pkl: {path_to_test_pkl}") 66 | 67 | #prepare model 68 | os.environ["CUDA_VISIBLE_DEVICES"] = str(args.cuda) 69 | device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") 70 | print(f"device: {device}") 71 | 72 | model = DSSM_AuxRanking( 73 | args.emb_dim, args.seq_len, 74 | device, id_cnt_dict 75 | ).to(device) 76 | 77 | path_to_save_model=f"./checkpoints/{args.batch_size}_{args.lr}_{args.flows}_{args.rank_loss_weight}_{args.tag}.pkl" 78 | 79 | state_dict = torch.load(path_to_save_model) 80 | 81 | model.load_state_dict(state_dict) 82 | 83 | print("testing: realshow") 84 | 85 | test_realshow_dataset = Prerank_Train_Dataset( 86 | path_to_test_csv, 87 | args.seq_len, 88 | path_to_test_seq_pkl, 89 | ) 90 | 91 | test_realshow_loader = DataLoader( 92 | dataset=test_realshow_dataset, 93 | batch_size=args.infer_realshow_batch_size, 94 | shuffle=False, 95 | num_workers=0, 96 | drop_last=True 97 | ) 98 | print_str = evaluate(model, test_realshow_loader, device) 99 | 100 | print("testing: recall") 101 | 102 | test_recall_dataset = Prerank_Test_Dataset( 103 | path_to_test_pkl, 104 | args.seq_len, 105 | path_to_test_seq_pkl, 106 | max_candidate_cnt=470 107 | ) 108 | 109 | test_recall_loader = DataLoader( 110 | dataset=test_recall_dataset, 111 | batch_size=args.infer_recall_batch_size, 112 | shuffle=False, 113 | num_workers=0, 114 | drop_last=True 115 | ) 116 | target_print = evaluate_recall(model, test_recall_loader, device) 117 | 118 | print("realshow") 119 | print(print_str) 120 | 121 | print("recall") 122 | print(target_print[0]) 123 | print(target_print[1]) -------------------------------------------------------------------------------- /coarse/eval_dssm_data_dist_shift_all.py: -------------------------------------------------------------------------------- 1 | import os 2 | import argparse 3 | 4 | import torch 5 | from torch.utils.data import DataLoader 6 | 7 | from models import DSSM 8 | from dataset import Prerank_Train_Dataset,Prerank_Test_Dataset 9 | 10 | from utils import load_pkl 11 | from metrics import evaluate,evaluate_recall 12 | 13 | def parse_args(): 14 | parser = argparse.ArgumentParser() 15 | 16 | parser.add_argument('--epochs', type=int, default=1, help='epochs.') 17 | parser.add_argument('--batch_size', type=int, default=1024, help='train batch size.') 18 | parser.add_argument('--infer_realshow_batch_size', type=int, default=1024, help='inference batch size.') 19 | parser.add_argument('--infer_recall_batch_size', type=int, default=1024, help='inference batch size.') 20 | parser.add_argument('--emb_dim', type=int, default=8, help='embedding dimension.') 21 | parser.add_argument('--lr', type=float, default=1e-3, help='learning rate.') 22 | parser.add_argument('--seq_len', type=int, default=3, help='length of behaivor sequence') 23 | 24 | parser.add_argument('--cuda', type=int, default=0, help='cuda device.') 25 | 26 | parser.add_argument('--print_freq', type=int, default=200, help='frequency of print.') 27 | 28 | parser.add_argument('--tag', type=str, default="1st", help='exp tag.') 29 | 30 | parser.add_argument('--flows', type=str, default="", help='exp tag.') 31 | 32 | return parser.parse_args() 33 | 34 | 35 | if __name__ == '__main__': 36 | args = parse_args() 37 | 38 | for k,v in vars(args).items(): 39 | print(f"{k}:{v}") 40 | 41 | #prepare data 42 | prefix = "../data" 43 | 44 | realshow_prefix = os.path.join(prefix, "realshow") 45 | path_to_test_csv = os.path.join(realshow_prefix, "2024-02-18.feather") 46 | print("testing file:") 47 | print(path_to_test_csv) 48 | 49 | seq_prefix = os.path.join(prefix, "seq_effective_50_dict") 50 | path_to_test_seq_pkl = os.path.join(seq_prefix, "2024-02-18.pkl") 51 | print("testing seq file:") 52 | print(path_to_test_seq_pkl) 53 | 54 | others_prefix = os.path.join(prefix, "others") 55 | path_to_id_cnt_pkl = os.path.join(others_prefix, "id_cnt.pkl") 56 | print(f"path_to_id_cnt_pkl: {path_to_id_cnt_pkl}") 57 | id_cnt_dict = load_pkl(path_to_id_cnt_pkl) 58 | for k,v in id_cnt_dict.items(): 59 | print(f"{k}:{v}") 60 | 61 | path_to_test_pkl = os.path.join(others_prefix, "prerank_test.feather") 62 | print(f"path_to_test_pkl: {path_to_test_pkl}") 63 | 64 | #prepare model 65 | os.environ["CUDA_VISIBLE_DEVICES"] = str(args.cuda) 66 | device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") 67 | print(f"device: {device}") 68 | 69 | model = DSSM(args.emb_dim, args.seq_len, device, id_cnt_dict).to(device) 70 | 71 | path_to_save_model=f"./checkpoints/bs-{args.batch_size}_lr-{args.lr}_{args.flows}_{args.tag}.pkl" 72 | 73 | state_dict = torch.load(path_to_save_model) 74 | 75 | model.load_state_dict(state_dict) 76 | 77 | print("testing: realshow") 78 | 79 | test_realshow_dataset = Prerank_Train_Dataset( 80 | path_to_test_csv, 81 | args.seq_len, 82 | path_to_test_seq_pkl, 83 | ) 84 | 85 | test_realshow_loader = DataLoader( 86 | dataset=test_realshow_dataset, 87 | batch_size=args.infer_realshow_batch_size, 88 | shuffle=False, 89 | num_workers=0, 90 | drop_last=True 91 | ) 92 | print_str = evaluate(model, test_realshow_loader, device) 93 | 94 | print("testing: recall") 95 | 96 | test_recall_dataset = Prerank_Test_Dataset( 97 | path_to_test_pkl, 98 | args.seq_len, 99 | path_to_test_seq_pkl, 100 | max_candidate_cnt=470 101 | ) 102 | 103 | test_recall_loader = DataLoader( 104 | dataset=test_recall_dataset, 105 | batch_size=args.infer_recall_batch_size, 106 | shuffle=False, 107 | num_workers=0, 108 | drop_last=True 109 | ) 110 | target_print = evaluate_recall(model, test_recall_loader, device) 111 | 112 | print("realshow") 113 | print(print_str) 114 | 115 | print("recall") 116 | print(target_print[0]) 117 | print(target_print[1]) -------------------------------------------------------------------------------- /coarse/eval_dssm_data_dist_shift_sampling.py: -------------------------------------------------------------------------------- 1 | import os 2 | import argparse 3 | 4 | import torch 5 | from torch.utils.data import DataLoader 6 | 7 | from models import DSSM 8 | from dataset import Prerank_Train_Dataset,Prerank_Test_Dataset 9 | 10 | from utils import load_pkl 11 | from metrics import evaluate,evaluate_recall 12 | 13 | def parse_args(): 14 | parser = argparse.ArgumentParser() 15 | 16 | parser.add_argument('--epochs', type=int, default=1, help='epochs.') 17 | parser.add_argument('--batch_size', type=int, default=1024, help='train batch size.') 18 | parser.add_argument('--infer_realshow_batch_size', type=int, default=1024, help='inference batch size.') 19 | parser.add_argument('--infer_recall_batch_size', type=int, default=1024, help='inference batch size.') 20 | parser.add_argument('--emb_dim', type=int, default=8, help='embedding dimension.') 21 | parser.add_argument('--lr', type=float, default=1e-3, help='learning rate.') 22 | parser.add_argument('--seq_len', type=int, default=3, help='length of behaivor sequence') 23 | 24 | parser.add_argument('--cuda', type=int, default=0, help='cuda device.') 25 | 26 | parser.add_argument('--print_freq', type=int, default=200, help='frequency of print.') 27 | 28 | parser.add_argument('--tag', type=str, default="1st", help='exp tag.') 29 | 30 | parser.add_argument('--flows', type=str, default="", help='exp tag.') 31 | parser.add_argument('--k_flow_negs', type=str, default="", help='number of flow negative.') 32 | 33 | return parser.parse_args() 34 | 35 | 36 | if __name__ == '__main__': 37 | args = parse_args() 38 | 39 | for k,v in vars(args).items(): 40 | print(f"{k}:{v}") 41 | 42 | #prepare data 43 | prefix = "../data" 44 | 45 | realshow_prefix = os.path.join(prefix, "realshow") 46 | path_to_test_csv = os.path.join(realshow_prefix, "2024-02-18.feather") 47 | print("testing file:") 48 | print(path_to_test_csv) 49 | 50 | seq_prefix = os.path.join(prefix, "seq_effective_50_dict") 51 | path_to_test_seq_pkl = os.path.join(seq_prefix, "2024-02-18.pkl") 52 | print("testing seq file:") 53 | print(path_to_test_seq_pkl) 54 | 55 | others_prefix = os.path.join(prefix, "others") 56 | path_to_id_cnt_pkl = os.path.join(others_prefix, "id_cnt.pkl") 57 | print(f"path_to_id_cnt_pkl: {path_to_id_cnt_pkl}") 58 | id_cnt_dict = load_pkl(path_to_id_cnt_pkl) 59 | for k,v in id_cnt_dict.items(): 60 | print(f"{k}:{v}") 61 | 62 | path_to_test_pkl = os.path.join(others_prefix, "prerank_test.feather") 63 | print(f"path_to_test_pkl: {path_to_test_pkl}") 64 | 65 | #prepare model 66 | os.environ["CUDA_VISIBLE_DEVICES"] = str(args.cuda) 67 | device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") 68 | print(f"device: {device}") 69 | 70 | model = DSSM(args.emb_dim, args.seq_len, device, id_cnt_dict).to(device) 71 | 72 | path_to_save_model=f"./checkpoints/bs-{args.batch_size}_lr-{args.lr}_{args.flows}-{args.k_flow_negs}_{args.tag}.pkl" 73 | 74 | state_dict = torch.load(path_to_save_model) 75 | 76 | model.load_state_dict(state_dict) 77 | 78 | print("testing: realshow") 79 | 80 | test_realshow_dataset = Prerank_Train_Dataset( 81 | path_to_test_csv, 82 | args.seq_len, 83 | path_to_test_seq_pkl, 84 | ) 85 | 86 | test_realshow_loader = DataLoader( 87 | dataset=test_realshow_dataset, 88 | batch_size=args.infer_realshow_batch_size, 89 | shuffle=False, 90 | num_workers=0, 91 | drop_last=True 92 | ) 93 | print_str = evaluate(model, test_realshow_loader, device) 94 | 95 | print("testing: recall") 96 | 97 | test_recall_dataset = Prerank_Test_Dataset( 98 | path_to_test_pkl, 99 | args.seq_len, 100 | path_to_test_seq_pkl, 101 | max_candidate_cnt=470 102 | ) 103 | 104 | test_recall_loader = DataLoader( 105 | dataset=test_recall_dataset, 106 | batch_size=args.infer_recall_batch_size, 107 | shuffle=False, 108 | num_workers=0, 109 | drop_last=True 110 | ) 111 | target_print = evaluate_recall(model, test_recall_loader, device) 112 | 113 | print("realshow") 114 | print(print_str) 115 | 116 | print("recall") 117 | print(target_print[0]) 118 | print(target_print[1]) -------------------------------------------------------------------------------- /coarse/eval_dssm_fsltr.py: -------------------------------------------------------------------------------- 1 | import os 2 | import argparse 3 | 4 | import torch 5 | from torch.utils.data import DataLoader 6 | 7 | from models import DSSM 8 | from dataset import Prerank_Train_Dataset,Prerank_Test_Dataset 9 | 10 | from utils import load_pkl 11 | from metrics import evaluate,evaluate_recall 12 | 13 | def parse_args(): 14 | parser = argparse.ArgumentParser() 15 | 16 | parser.add_argument('--epochs', type=int, default=1, help='epochs.') 17 | parser.add_argument('--batch_size', type=int, default=1024, help='train batch size.') 18 | parser.add_argument('--infer_realshow_batch_size', type=int, default=1024, help='inference batch size.') 19 | parser.add_argument('--infer_recall_batch_size', type=int, default=1024, help='inference batch size.') 20 | parser.add_argument('--emb_dim', type=int, default=8, help='embedding dimension.') 21 | parser.add_argument('--lr', type=float, default=1e-3, help='learning rate.') 22 | parser.add_argument('--seq_len', type=int, default=3, help='length of behaivor sequence') 23 | 24 | parser.add_argument('--cuda', type=int, default=0, help='cuda device.') 25 | 26 | parser.add_argument('--print_freq', type=int, default=200, help='frequency of print.') 27 | 28 | parser.add_argument('--tag', type=str, default="1st", help='exp tag.') 29 | 30 | parser.add_argument('--flows', type=str, default='mcd_prerank_neg', help='model name.') 31 | 32 | parser.add_argument('--flow_nums', type=str, default="1", help='number of negative samples') 33 | 34 | parser.add_argument('--flow_weights', type=str, default="1.0", help='learning rate.') 35 | 36 | return parser.parse_args() 37 | 38 | 39 | if __name__ == '__main__': 40 | args = parse_args() 41 | 42 | for k,v in vars(args).items(): 43 | print(f"{k}:{v}") 44 | 45 | #prepare data 46 | prefix = "../data" 47 | 48 | realshow_prefix = os.path.join(prefix, "realshow") 49 | path_to_test_csv = os.path.join(realshow_prefix, "2024-02-18.feather") 50 | print("testing file:") 51 | print(path_to_test_csv) 52 | 53 | seq_prefix = os.path.join(prefix, "seq_effective_50_dict") 54 | path_to_test_seq_pkl = os.path.join(seq_prefix, "2024-02-18.pkl") 55 | print("testing seq file:") 56 | print(path_to_test_seq_pkl) 57 | 58 | others_prefix = os.path.join(prefix, "others") 59 | path_to_id_cnt_pkl = os.path.join(others_prefix, "id_cnt.pkl") 60 | print(f"path_to_id_cnt_pkl: {path_to_id_cnt_pkl}") 61 | id_cnt_dict = load_pkl(path_to_id_cnt_pkl) 62 | for k,v in id_cnt_dict.items(): 63 | print(f"{k}:{v}") 64 | 65 | path_to_test_pkl = os.path.join(others_prefix, "prerank_test.feather") 66 | print(f"path_to_test_pkl: {path_to_test_pkl}") 67 | 68 | #prepare model 69 | os.environ["CUDA_VISIBLE_DEVICES"] = str(args.cuda) 70 | device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") 71 | print(f"device: {device}") 72 | 73 | model = DSSM(args.emb_dim, args.seq_len, device, id_cnt_dict).to(device) 74 | 75 | path_to_save_model=f"./checkpoints/bs-{args.batch_size}_lr-{args.lr}_{args.loss}_{args.flows}_{args.flow_nums}_{args.flow_weights}_{args.tag}.pkl" 76 | 77 | state_dict = torch.load(path_to_save_model) 78 | 79 | model.load_state_dict(state_dict) 80 | 81 | print("testing: realshow") 82 | test_realshow_dataset = Prerank_Train_Dataset( 83 | path_to_test_csv, 84 | args.seq_len, 85 | path_to_test_seq_pkl, 86 | ) 87 | 88 | test_realshow_loader = DataLoader( 89 | dataset=test_realshow_dataset, 90 | batch_size=args.infer_realshow_batch_size, 91 | shuffle=False, 92 | num_workers=0, 93 | drop_last=True 94 | ) 95 | print_str = evaluate(model, test_realshow_loader, device) 96 | 97 | print("testing: recall") 98 | 99 | test_recall_dataset = Prerank_Test_Dataset( 100 | path_to_test_pkl, 101 | args.seq_len, 102 | path_to_test_seq_pkl, 103 | max_candidate_cnt=470 104 | ) 105 | 106 | test_recall_loader = DataLoader( 107 | dataset=test_recall_dataset, 108 | batch_size=args.infer_recall_batch_size, 109 | shuffle=False, 110 | num_workers=0, 111 | drop_last=True 112 | ) 113 | target_print = evaluate_recall(model, test_recall_loader, device) 114 | 115 | print("realshow") 116 | print(print_str) 117 | 118 | print("recall") 119 | print(target_print[0]) 120 | print(target_print[1]) -------------------------------------------------------------------------------- /coarse/eval_dssm_ubm.py: -------------------------------------------------------------------------------- 1 | import os 2 | import argparse 3 | 4 | import torch 5 | from torch.utils.data import DataLoader 6 | 7 | from models import DSSM_UBM 8 | from dataset import Prerank_Train_UBM_Dataset,Prerank_Test_UBM_Dataset 9 | 10 | from utils import load_pkl 11 | from metrics import evaluate,evaluate_recall 12 | 13 | def parse_args(): 14 | parser = argparse.ArgumentParser() 15 | 16 | parser.add_argument('--epochs', type=int, default=1, help='epochs.') 17 | parser.add_argument('--batch_size', type=int, default=1024, help='train batch size.') 18 | parser.add_argument('--infer_realshow_batch_size', type=int, default=1024, help='inference batch size.') 19 | parser.add_argument('--infer_recall_batch_size', type=int, default=1024, help='inference batch size.') 20 | parser.add_argument('--emb_dim', type=int, default=8, help='embedding dimension.') 21 | parser.add_argument('--lr', type=float, default=1e-3, help='learning rate.') 22 | parser.add_argument('--seq_len', type=int, default=3, help='length of behaivor sequence') 23 | 24 | parser.add_argument('--cuda', type=int, default=0, help='cuda device.') 25 | 26 | parser.add_argument('--print_freq', type=int, default=200, help='frequency of print.') 27 | 28 | parser.add_argument('--tag', type=str, default="1st", help='exp tag.') 29 | 30 | parser.add_argument('--flows', type=str, default="", help='exp tag.') 31 | 32 | return parser.parse_args() 33 | 34 | 35 | if __name__ == '__main__': 36 | args = parse_args() 37 | 38 | #prepare data 39 | prefix = "../data" 40 | 41 | realshow_prefix = os.path.join(prefix, "realshow") 42 | path_to_test_csv = os.path.join(realshow_prefix, "2024-02-18.feather") 43 | print(f"testing file: {path_to_test_csv}") 44 | 45 | seq_prefix = os.path.join(prefix, "seq_effective_50_dict") 46 | path_to_test_seq_pkl = os.path.join(seq_prefix, "2024-02-18.pkl") 47 | print(f"testing seq file: {path_to_test_seq_pkl}") 48 | 49 | request_id_prefix = os.path.join(prefix, "ubm_seq_request_id_dict") 50 | path_to_request_id_pkl = os.path.join(request_id_prefix, "2024-02-18.pkl") 51 | print(f"testing request_id file: {path_to_request_id_pkl}") 52 | 53 | others_prefix = os.path.join(prefix, "others") 54 | path_to_id_cnt_pkl = os.path.join(others_prefix, "id_cnt.pkl") 55 | print(f"path_to_id_cnt_pkl: {path_to_id_cnt_pkl}") 56 | 57 | id_cnt_dict = load_pkl(path_to_id_cnt_pkl) 58 | for k,v in id_cnt_dict.items(): 59 | print(f"{k}:{v}") 60 | 61 | path_to_test_pkl = os.path.join(others_prefix, "prerank_test.feather") 62 | print(f"path_to_test_pkl: {path_to_test_pkl}") 63 | 64 | #prepare model 65 | os.environ["CUDA_VISIBLE_DEVICES"] = str(args.cuda) 66 | device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") 67 | print(f"device: {device}") 68 | 69 | per_flow_seq_len = 10 70 | n_flows = len(args.flows.split(',')) 71 | flow_seq_len = per_flow_seq_len * n_flows 72 | 73 | max_candidate_cnt = 430 74 | 75 | model = DSSM_UBM( 76 | args.emb_dim, 77 | args.seq_len, 78 | device, 79 | per_flow_seq_len,flow_seq_len, 80 | id_cnt_dict 81 | ).to(device) 82 | 83 | path_to_save_model=f"./checkpoints/bs-{args.batch_size}_lr-{args.lr}_flows-{args.flows}_{args.tag}.pkl" 84 | 85 | state_dict = torch.load(path_to_save_model) 86 | 87 | model.load_state_dict(state_dict) 88 | 89 | print("testing: realshow") 90 | 91 | test_realshow_dataset = Prerank_Train_UBM_Dataset( 92 | path_to_test_csv, 93 | args.seq_len, 94 | path_to_test_seq_pkl, 95 | path_to_request_id_pkl, 96 | args.flows, 97 | per_flow_seq_len 98 | ) 99 | 100 | test_realshow_loader = DataLoader( 101 | dataset=test_realshow_dataset, 102 | batch_size=args.infer_realshow_batch_size, 103 | shuffle=False, 104 | num_workers=0, 105 | drop_last=True 106 | ) 107 | print_str = evaluate(model, test_realshow_loader, device) 108 | 109 | print("testing: recall") 110 | 111 | test_recall_dataset = Prerank_Test_UBM_Dataset( 112 | path_to_test_pkl, 113 | args.seq_len, 114 | path_to_test_seq_pkl, 115 | path_to_request_id_pkl, 116 | args.flows, per_flow_seq_len, 117 | max_candidate_cnt 118 | ) 119 | 120 | test_recall_loader = DataLoader( 121 | dataset=test_recall_dataset, 122 | batch_size=args.infer_recall_batch_size, 123 | shuffle=False, 124 | num_workers=0, 125 | drop_last=True 126 | ) 127 | target_print = evaluate_recall(model, test_recall_loader, device) 128 | 129 | print("realshow") 130 | print(print_str) 131 | 132 | print("recall") 133 | print(target_print[0]) 134 | print(target_print[1]) -------------------------------------------------------------------------------- /coarse/file.txt: -------------------------------------------------------------------------------- 1 | 2024-01-13 2 | 2024-01-14 3 | 2024-01-15 4 | 2024-01-16 5 | 2024-01-17 6 | 2024-01-18 7 | 2024-01-19 8 | 2024-01-20 9 | 2024-01-21 10 | 2024-01-22 11 | 2024-01-23 12 | 2024-01-24 13 | 2024-01-25 14 | 2024-01-26 15 | 2024-01-27 16 | 2024-01-28 17 | 2024-01-29 18 | 2024-01-30 19 | 2024-01-31 20 | 2024-02-01 21 | 2024-02-02 22 | 2024-02-03 23 | 2024-02-04 24 | 2024-02-05 25 | 2024-02-06 26 | 2024-02-07 27 | 2024-02-08 28 | 2024-02-09 29 | 2024-02-10 30 | 2024-02-11 31 | 2024-02-12 32 | 2024-02-13 33 | 2024-02-14 34 | 2024-02-15 35 | 2024-02-16 36 | 2024-02-17 -------------------------------------------------------------------------------- /coarse/metrics.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import numpy as np 3 | from sklearn.metrics import roc_auc_score, log_loss 4 | 5 | def evaluate(model, data_loader, device): 6 | model.eval() 7 | 8 | logits_lst = np.zeros(shape=(962560,), dtype=np.float32) 9 | label_lst = np.zeros(shape=(962560,), dtype=np.float32) 10 | 11 | with torch.no_grad(): 12 | 13 | start_index = 0 14 | end_index = 0 15 | 16 | for inputs in data_loader: 17 | 18 | inputs_LongTensor = [torch.LongTensor(inp.numpy()).to(device) for inp in inputs[:-1]] 19 | 20 | logits = model(inputs_LongTensor) #b 21 | 22 | logits = torch.sigmoid(logits) 23 | 24 | end_index += inputs[-1].size(0) 25 | 26 | label_lst[start_index:end_index] = inputs[-1].numpy().astype(np.float32) 27 | 28 | logits_lst[start_index:end_index] = logits.cpu().numpy().astype(np.float32) 29 | 30 | start_index = end_index 31 | 32 | test_auc = roc_auc_score(label_lst, logits_lst) 33 | test_logloss = log_loss(label_lst, logits_lst) 34 | 35 | print_str = f"auc\tlogloss: {test_auc:.6f}\t{test_logloss:.6f}" 36 | 37 | return print_str 38 | 39 | 40 | def evaluate_recall(model, data_loader, device): 41 | model.eval() 42 | 43 | target_top_k = [50,100,200] 44 | 45 | total_target_cnt = 0.0 46 | 47 | target_recall_lst = [0.0 for _ in range(len(target_top_k))] 48 | target_ndcg_lst = [0.0 for _ in range(len(target_top_k))] 49 | 50 | with torch.no_grad(): 51 | 52 | for inputs in data_loader: 53 | 54 | inputs_LongTensor = [torch.LongTensor(inp.numpy()).to(device) for inp in inputs[:-2]] 55 | 56 | logits = model.forward_recall(inputs_LongTensor) #b*500 57 | 58 | logits = logits.cpu().numpy() 59 | 60 | labels = inputs[-2].numpy().astype(np.float) #b*500 61 | 62 | n_photos = inputs[-1].numpy() #b 63 | 64 | for i in range(n_photos.shape[0]): 65 | 66 | n_photo = n_photos[i] 67 | 68 | logit = logits[i,:n_photo] 69 | label = labels[i,:n_photo] 70 | 71 | logit_descending_index = np.argsort(logit*-1.0) #descending order 72 | logit_descending_rank = np.argsort(logit_descending_index) #descending order 73 | 74 | #target metric 75 | if np.sum(label) > 0 and np.sum(label)!=n_photo: 76 | target_pos_index = np.nonzero(label)[0] 77 | target_pos_rank = logit_descending_rank[target_pos_index] 78 | for i in range(len(target_top_k)): 79 | target_recall_lst[i] += np.sum(target_pos_rank3}/{iter_step:<4}\tloss:{loss.detach().cpu().item():.6f}") 122 | 123 | path_to_save_model=f"./checkpoints/bs-{args.batch_size}_lr-{args.lr}_{args.tag}.pkl" 124 | 125 | torch.save(model.state_dict(), path_to_save_model) 126 | 127 | print(f"save model to {path_to_save_model} DONE.") -------------------------------------------------------------------------------- /coarse/run_dssm.sh: -------------------------------------------------------------------------------- 1 | set -x 2 | set -e 3 | set -o pipefail 4 | 5 | tag=dssm-1st 6 | 7 | python -B -u run_dssm.py \ 8 | --epochs=1 \ 9 | --batch_size=1024 \ 10 | --infer_realshow_batch_size=1024 \ 11 | --infer_recall_batch_size=900 \ 12 | --emb_dim=8 \ 13 | --lr=1e-2 \ 14 | --seq_len=50 \ 15 | --cuda='0' \ 16 | --print_freq=100 \ 17 | --tag=${tag} > "./logs/bs-1024_lr-1e-2_${tag}.log" 2>&1 18 | 19 | python -B -u eval_dssm.py \ 20 | --epochs=1 \ 21 | --batch_size=1024 \ 22 | --infer_realshow_batch_size=1024 \ 23 | --infer_recall_batch_size=900 \ 24 | --emb_dim=8 \ 25 | --lr=1e-2 \ 26 | --seq_len=50 \ 27 | --cuda='0' \ 28 | --print_freq=100 \ 29 | --tag=${tag} >> "./logs/bs-1024_lr-1e-2_${tag}.log" 2>&1 -------------------------------------------------------------------------------- /coarse/run_dssm_auxiliary_ranking.py: -------------------------------------------------------------------------------- 1 | import os 2 | import argparse 3 | 4 | import torch 5 | import torch.nn as nn 6 | import torch.nn.functional as F 7 | from torch.utils.data import DataLoader 8 | 9 | from models import DSSM_AuxRanking 10 | from dataset import Prerank_Train_Auxiliary_Ranking_Dataset 11 | 12 | from utils import load_pkl 13 | 14 | def parse_args(): 15 | parser = argparse.ArgumentParser() 16 | 17 | parser.add_argument('--epochs', type=int, default=1, help='epochs.') 18 | parser.add_argument('--batch_size', type=int, default=1024, help='train batch size.') 19 | parser.add_argument('--infer_realshow_batch_size', type=int, default=1024, help='inference batch size.') 20 | parser.add_argument('--infer_recall_batch_size', type=int, default=1024, help='inference batch size.') 21 | parser.add_argument('--emb_dim', type=int, default=8, help='embedding dimension.') 22 | parser.add_argument('--lr', type=float, default=1e-3, help='learning rate.') 23 | parser.add_argument('--seq_len', type=int, default=3, help='length of behaivor sequence') 24 | 25 | parser.add_argument('--cuda', type=int, default=0, help='cuda device.') 26 | 27 | parser.add_argument('--print_freq', type=int, default=200, help='frequency of print.') 28 | 29 | parser.add_argument('--tag', type=str, default="1st", help='exp tag.') 30 | 31 | # flow param 32 | parser.add_argument('--flows', type=str, default="", help='exp tag.') 33 | 34 | parser.add_argument('--rank_loss_weight', type=float, default=1e-2, help='learning rate.') 35 | 36 | return parser.parse_args() 37 | 38 | 39 | if __name__ == '__main__': 40 | args = parse_args() 41 | 42 | for k,v in vars(args).items(): 43 | print(f"{k}:{v}") 44 | 45 | #prepare data 46 | prefix = "../data" 47 | 48 | realshow_prefix = os.path.join(prefix, "realshow") 49 | path_to_train_csv_lst = [] 50 | with open("./file.txt", mode='r') as f: 51 | lines = f.readlines() 52 | for line in lines: 53 | tmp_csv_path = os.path.join(realshow_prefix, line.strip()+'.feather') 54 | path_to_train_csv_lst.append(tmp_csv_path) 55 | 56 | num_of_train_csv = len(path_to_train_csv_lst) 57 | print("training files:") 58 | print(f"number of train_csv: {num_of_train_csv}") 59 | for idx, filepath in enumerate(path_to_train_csv_lst): 60 | print(f"{idx}: {filepath}") 61 | 62 | seq_prefix = os.path.join(prefix, "seq_effective_50_dict") 63 | path_to_train_seq_pkl_lst = [] 64 | with open("./file.txt", mode='r') as f: 65 | lines = f.readlines() 66 | for line in lines: 67 | tmp_seq_pkl_path = os.path.join(seq_prefix, line.strip()+'.pkl') 68 | path_to_train_seq_pkl_lst.append(tmp_seq_pkl_path) 69 | 70 | print("training seq files:") 71 | for idx, filepath in enumerate(path_to_train_seq_pkl_lst): 72 | print(f"{idx}: {filepath}") 73 | 74 | request_id_prefix = os.path.join(prefix, "request_id_dict") 75 | path_to_train_request_pkl_lst = [] 76 | with open("./file.txt", mode='r') as f: 77 | lines = f.readlines() 78 | for line in lines: 79 | tmp_request_pkl_path = os.path.join(request_id_prefix, line.strip()+".pkl") 80 | path_to_train_request_pkl_lst.append(tmp_request_pkl_path) 81 | 82 | print("training request files") 83 | for idx, filepath in enumerate(path_to_train_request_pkl_lst): 84 | print(f"{idx}: {filepath}") 85 | 86 | others_prefix = os.path.join(prefix, "others") 87 | path_to_id_cnt_pkl = os.path.join(others_prefix, "id_cnt.pkl") 88 | print(f"path_to_id_cnt_pkl: {path_to_id_cnt_pkl}") 89 | 90 | id_cnt_dict = load_pkl(path_to_id_cnt_pkl) 91 | for k,v in id_cnt_dict.items(): 92 | print(f"{k}:{v}") 93 | 94 | #prepare model 95 | os.environ["CUDA_VISIBLE_DEVICES"] = str(args.cuda) 96 | device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") 97 | print(f"device: {device}") 98 | 99 | model = DSSM_AuxRanking(args.emb_dim, args.seq_len, device, id_cnt_dict).to(device) 100 | 101 | loss_fn = nn.BCEWithLogitsLoss().to(device) 102 | 103 | optimizer = torch.optim.Adam(model.parameters(), lr=args.lr) 104 | 105 | padding_num = -2**30 + 1 106 | 107 | n_flows = len(args.flows.split(',')) 108 | n_flow_photos = n_flows * 10 109 | 110 | #training 111 | for epoch in range(args.epochs): 112 | for n_day in range(num_of_train_csv): 113 | 114 | train_dataset = Prerank_Train_Auxiliary_Ranking_Dataset( 115 | path_to_train_csv_lst[n_day], 116 | args.seq_len, 117 | path_to_train_seq_pkl_lst[n_day], 118 | path_to_train_request_pkl_lst[n_day], 119 | args.flows 120 | ) 121 | 122 | train_loader = DataLoader( 123 | dataset=train_dataset, 124 | batch_size=args.batch_size, 125 | shuffle=True, 126 | num_workers=1, 127 | drop_last=True 128 | ) 129 | 130 | for iter_step, inputs in enumerate(train_loader): 131 | 132 | inputs_LongTensor = [torch.LongTensor(inp.numpy()).to(device) for inp in inputs[:-2]] 133 | 134 | label = torch.FloatTensor(inputs[-1].numpy()).to(device) #b 135 | 136 | logits, flow_logits = model.forward_train(inputs_LongTensor) #b*1,b*p 137 | 138 | loss = loss_fn(logits.squeeze(), label) 139 | 140 | flow_mask = torch.FloatTensor(inputs[-2].numpy()).to(device) # b*p 141 | 142 | flow_logits = torch.where( 143 | flow_mask > 0, 144 | flow_logits, 145 | torch.full_like(flow_logits, fill_value=padding_num) 146 | ) 147 | 148 | logits_repeat = logits.repeat([1,n_flow_photos]) #b*p 149 | 150 | bpr_logits = logits_repeat - flow_logits 151 | 152 | rank_loss = F.binary_cross_entropy_with_logits( 153 | bpr_logits, 154 | torch.ones_like(bpr_logits), 155 | weight=label.unsqueeze(1).repeat([1,n_flow_photos]), 156 | reduction='sum' 157 | ) / label.unsqueeze(1).repeat([1,n_flow_photos]).sum() 158 | 159 | all_loss = loss + args.rank_loss_weight * rank_loss 160 | 161 | optimizer.zero_grad() 162 | 163 | all_loss.backward() 164 | 165 | optimizer.step() 166 | 167 | if iter_step % args.print_freq == 0: 168 | print(f"Day:{n_day}\t[Epoch/iter]:{epoch:>3}/{iter_step:<4}\tall_loss:{all_loss.detach().cpu().item():.6f}\tloss:{loss.detach().cpu().item():.6f}\trank_loss:{rank_loss.detach().cpu().item():.6f}") 169 | 170 | path_to_save_model=f"./checkpoints/{args.batch_size}_{args.lr}_{args.flows}_{args.rank_loss_weight}_{args.tag}.pkl" 171 | 172 | torch.save(model.state_dict(), path_to_save_model) 173 | 174 | print(f"save model to {path_to_save_model} DONE.") -------------------------------------------------------------------------------- /coarse/run_dssm_auxiliary_ranking.sh: -------------------------------------------------------------------------------- 1 | set -x 2 | set -e 3 | set -o pipefail 4 | 5 | tag=dssm_auxiliary_ranking-1st 6 | 7 | flows=rerank_pos,rerank_neg,rank_pos,rank_neg 8 | 9 | rank_loss_weight=0.1 10 | 11 | python -B -u run_dssm_auxiliary_ranking.py \ 12 | --epochs=1 \ 13 | --batch_size=1024 \ 14 | --infer_realshow_batch_size=1024 \ 15 | --infer_recall_batch_size=900 \ 16 | --emb_dim=8 \ 17 | --lr=1e-2 \ 18 | --seq_len=50 \ 19 | --cuda='0' \ 20 | --print_freq=100 \ 21 | --flows=${flows} \ 22 | --rank_loss_weight=${rank_loss_weight} \ 23 | --tag=${tag} > "./logs/bs-1024_lr-1e-2_${flows}_${rank_loss_weight}_${tag}.log" 2>&1 24 | 25 | python -B -u eval_dssm_auxiliary_ranking.py \ 26 | --epochs=1 \ 27 | --batch_size=1024 \ 28 | --infer_realshow_batch_size=1024 \ 29 | --infer_recall_batch_size=900 \ 30 | --emb_dim=8 \ 31 | --lr=1e-2 \ 32 | --seq_len=50 \ 33 | --cuda='0' \ 34 | --print_freq=100 \ 35 | --flows=${flows} \ 36 | --rank_loss_weight=${rank_loss_weight} \ 37 | --tag=${tag} >> "./logs/bs-1024_lr-1e-2_${flows}_${rank_loss_weight}_${tag}.log" 2>&1 -------------------------------------------------------------------------------- /coarse/run_dssm_data_dist_shift_all.py: -------------------------------------------------------------------------------- 1 | import os 2 | import argparse 3 | 4 | import torch 5 | import torch.nn as nn 6 | from torch.utils.data import DataLoader 7 | 8 | from models import DSSM 9 | from dataset import Prerank_Train_Data_Dist_Shift_All_Dataset 10 | 11 | from utils import load_pkl 12 | 13 | def parse_args(): 14 | parser = argparse.ArgumentParser() 15 | 16 | parser.add_argument('--epochs', type=int, default=1, help='epochs.') 17 | parser.add_argument('--batch_size', type=int, default=1024, help='train batch size.') 18 | parser.add_argument('--infer_realshow_batch_size', type=int, default=1024, help='inference batch size.') 19 | parser.add_argument('--infer_recall_batch_size', type=int, default=1024, help='inference batch size.') 20 | parser.add_argument('--emb_dim', type=int, default=8, help='embedding dimension.') 21 | parser.add_argument('--lr', type=float, default=1e-3, help='learning rate.') 22 | parser.add_argument('--seq_len', type=int, default=3, help='length of behaivor sequence') 23 | 24 | parser.add_argument('--cuda', type=int, default=0, help='cuda device.') 25 | 26 | parser.add_argument('--print_freq', type=int, default=200, help='frequency of print.') 27 | 28 | parser.add_argument('--tag', type=str, default="1st", help='exp tag.') 29 | 30 | parser.add_argument('--flows', type=str, default="", help='exp tag.') 31 | 32 | return parser.parse_args() 33 | 34 | 35 | if __name__ == '__main__': 36 | args = parse_args() 37 | 38 | for k,v in vars(args).items(): 39 | print(f"{k}:{v}") 40 | 41 | #prepare data 42 | prefix = "../data" 43 | 44 | remap_daily_prefix = os.path.join(prefix, "remap_daily") 45 | path_to_train_csv_lst = [] 46 | with open("./file.txt", mode='r') as f: 47 | lines = f.readlines() 48 | for line in lines: 49 | tmp_csv_path = os.path.join(remap_daily_prefix, line.strip()+'.feather') 50 | path_to_train_csv_lst.append(tmp_csv_path) 51 | 52 | num_of_train_csv = len(path_to_train_csv_lst) 53 | print("training files:") 54 | print(f"number of train_csv: {num_of_train_csv}") 55 | for idx, filepath in enumerate(path_to_train_csv_lst): 56 | print(f"{idx}: {filepath}") 57 | 58 | seq_prefix = os.path.join(prefix, "seq_effective_50_dict") 59 | path_to_train_seq_pkl_lst = [] 60 | with open("./file.txt", mode='r') as f: 61 | lines = f.readlines() 62 | for line in lines: 63 | tmp_seq_pkl_path = os.path.join(seq_prefix, line.strip()+'.pkl') 64 | path_to_train_seq_pkl_lst.append(tmp_seq_pkl_path) 65 | 66 | print("training seq files:") 67 | for idx, filepath in enumerate(path_to_train_seq_pkl_lst): 68 | print(f"{idx}: {filepath}") 69 | 70 | others_prefix = os.path.join(prefix, "others") 71 | path_to_id_cnt_pkl = os.path.join(others_prefix, "id_cnt.pkl") 72 | print(f"path_to_id_cnt_pkl: {path_to_id_cnt_pkl}") 73 | 74 | id_cnt_dict = load_pkl(path_to_id_cnt_pkl) 75 | for k,v in id_cnt_dict.items(): 76 | print(f"{k}:{v}") 77 | 78 | #prepare model 79 | os.environ["CUDA_VISIBLE_DEVICES"] = str(args.cuda) 80 | device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") 81 | print(f"device: {device}") 82 | 83 | model = DSSM(args.emb_dim, args.seq_len, device, id_cnt_dict).to(device) 84 | 85 | loss_fn = nn.BCEWithLogitsLoss().to(device) 86 | 87 | optimizer = torch.optim.Adam(model.parameters(), lr=args.lr) 88 | 89 | #training 90 | for epoch in range(args.epochs): 91 | for n_day in range(num_of_train_csv): 92 | train_dataset = Prerank_Train_Data_Dist_Shift_All_Dataset( 93 | path_to_train_csv_lst[n_day], 94 | args.seq_len, 95 | path_to_train_seq_pkl_lst[n_day], 96 | args.flows 97 | ) 98 | 99 | train_loader = DataLoader( 100 | dataset=train_dataset, 101 | batch_size=args.batch_size, 102 | shuffle=True, 103 | num_workers=1, 104 | drop_last=True 105 | ) 106 | 107 | for iter_step, inputs in enumerate(train_loader): 108 | 109 | inputs_LongTensor = [torch.LongTensor(inp.numpy()).to(device) for inp in inputs[:-1]] 110 | 111 | label = torch.FloatTensor(inputs[-1].numpy()).to(device) #b*k 112 | 113 | logits = model(inputs_LongTensor) #b 114 | 115 | loss = loss_fn(logits, label) 116 | 117 | optimizer.zero_grad() 118 | 119 | loss.backward() 120 | 121 | optimizer.step() 122 | 123 | if iter_step % args.print_freq == 0: 124 | print(f"Day:{n_day}\t[Epoch/iter]:{epoch:>3}/{iter_step:<4}\tloss:{loss.detach().cpu().item():.6f}") 125 | 126 | path_to_save_model=f"./checkpoints/bs-{args.batch_size}_lr-{args.lr}_{args.flows}_{args.tag}.pkl" 127 | 128 | torch.save(model.state_dict(), path_to_save_model) 129 | 130 | print(f"save model to {path_to_save_model} DONE.") 131 | -------------------------------------------------------------------------------- /coarse/run_dssm_data_dist_shift_all.sh: -------------------------------------------------------------------------------- 1 | set -x 2 | set -e 3 | set -o pipefail 4 | 5 | tag=dssm_data_dist_shift_all-1st 6 | 7 | flows=rank_neg 8 | 9 | python -B -u run_dssm_data_dist_shift_all.py \ 10 | --epochs=1 \ 11 | --batch_size=1024 \ 12 | --infer_realshow_batch_size=1024 \ 13 | --infer_recall_batch_size=900 \ 14 | --emb_dim=8 \ 15 | --lr=1e-2 \ 16 | --seq_len=50 \ 17 | --cuda='0' \ 18 | --print_freq=100 \ 19 | --flows=${flows} \ 20 | --tag=${tag} > "./logs/bs-1024_lr-1e-2_${flows}_${tag}.log" 2>&1 21 | 22 | python -B -u eval_dssm_data_dist_shift_all.py \ 23 | --epochs=1 \ 24 | --batch_size=1024 \ 25 | --infer_realshow_batch_size=1024 \ 26 | --infer_recall_batch_size=900 \ 27 | --emb_dim=8 \ 28 | --lr=1e-2 \ 29 | --seq_len=50 \ 30 | --cuda='0' \ 31 | --print_freq=100 \ 32 | --flows=${flows} \ 33 | --tag=${tag} >> "./logs/bs-1024_lr-1e-2_${flows}_${tag}.log" 2>&1 -------------------------------------------------------------------------------- /coarse/run_dssm_data_dist_shift_sampling.py: -------------------------------------------------------------------------------- 1 | import os 2 | import argparse 3 | 4 | import torch 5 | import torch.nn as nn 6 | from torch.utils.data import DataLoader 7 | 8 | from models import DSSM 9 | from dataset import Prerank_Train_Data_Dist_Shift_Sampling_Dataset 10 | 11 | from utils import load_pkl 12 | 13 | def parse_args(): 14 | parser = argparse.ArgumentParser() 15 | 16 | parser.add_argument('--epochs', type=int, default=1, help='epochs.') 17 | parser.add_argument('--batch_size', type=int, default=1024, help='train batch size.') 18 | parser.add_argument('--infer_realshow_batch_size', type=int, default=1024, help='inference batch size.') 19 | parser.add_argument('--infer_recall_batch_size', type=int, default=1024, help='inference batch size.') 20 | parser.add_argument('--emb_dim', type=int, default=8, help='embedding dimension.') 21 | parser.add_argument('--lr', type=float, default=1e-3, help='learning rate.') 22 | parser.add_argument('--seq_len', type=int, default=3, help='length of behaivor sequence') 23 | 24 | parser.add_argument('--cuda', type=int, default=0, help='cuda device.') 25 | 26 | parser.add_argument('--print_freq', type=int, default=200, help='frequency of print.') 27 | 28 | parser.add_argument('--tag', type=str, default="1st", help='exp tag.') 29 | 30 | parser.add_argument('--flows', type=str, default="", help='exp tag.') 31 | parser.add_argument('--k_flow_negs', type=str, default="", help='number of flow negative.') 32 | 33 | return parser.parse_args() 34 | 35 | 36 | if __name__ == '__main__': 37 | args = parse_args() 38 | 39 | for k,v in vars(args).items(): 40 | print(f"{k}:{v}") 41 | 42 | #prepare data 43 | prefix = "../data" 44 | 45 | remap_daily__prefix = os.path.join(prefix, "all_stage") 46 | path_to_train_csv_lst = [] 47 | with open("./file.txt", mode='r') as f: 48 | lines = f.readlines() 49 | for line in lines: 50 | tmp_csv_path = os.path.join(remap_daily__prefix, line.strip()+'.feather') 51 | path_to_train_csv_lst.append(tmp_csv_path) 52 | 53 | num_of_train_csv = len(path_to_train_csv_lst) 54 | print("training files:") 55 | print(f"number of train_csv: {num_of_train_csv}") 56 | for idx, filepath in enumerate(path_to_train_csv_lst): 57 | print(f"{idx}: {filepath}") 58 | 59 | seq_prefix = os.path.join(prefix, "seq_effective_50_dict") 60 | path_to_train_seq_pkl_lst = [] 61 | with open("./file.txt", mode='r') as f: 62 | lines = f.readlines() 63 | for line in lines: 64 | tmp_seq_pkl_path = os.path.join(seq_prefix, line.strip()+'.pkl') 65 | path_to_train_seq_pkl_lst.append(tmp_seq_pkl_path) 66 | 67 | print("training seq files:") 68 | for idx, filepath in enumerate(path_to_train_seq_pkl_lst): 69 | print(f"{idx}: {filepath}") 70 | 71 | others_prefix = os.path.join(prefix, "others") 72 | path_to_id_cnt_pkl = os.path.join(others_prefix, "id_cnt.pkl") 73 | print(f"path_to_id_cnt_pkl: {path_to_id_cnt_pkl}") 74 | 75 | id_cnt_dict = load_pkl(path_to_id_cnt_pkl) 76 | for k,v in id_cnt_dict.items(): 77 | print(f"{k}:{v}") 78 | 79 | #prepare model 80 | os.environ["CUDA_VISIBLE_DEVICES"] = str(args.cuda) 81 | device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") 82 | print(f"device: {device}") 83 | 84 | model = DSSM(args.emb_dim, args.seq_len, device, id_cnt_dict).to(device) 85 | 86 | loss_fn = nn.BCEWithLogitsLoss().to(device) 87 | 88 | optimizer = torch.optim.Adam(model.parameters(), lr=args.lr) 89 | 90 | #training 91 | for epoch in range(args.epochs): 92 | for n_day in range(num_of_train_csv): 93 | train_dataset = Prerank_Train_Data_Dist_Shift_Sampling_Dataset( 94 | path_to_train_csv_lst[n_day], 95 | args.seq_len, 96 | path_to_train_seq_pkl_lst[n_day], 97 | args.flows, 98 | args.k_flow_negs 99 | ) 100 | 101 | train_loader = DataLoader( 102 | dataset=train_dataset, 103 | batch_size=args.batch_size, 104 | shuffle=True, 105 | num_workers=1, 106 | drop_last=True 107 | ) 108 | 109 | for iter_step, inputs in enumerate(train_loader): 110 | 111 | inputs_LongTensor = [torch.LongTensor(inp.numpy()).to(device) for inp in inputs[:-1]] 112 | 113 | label = torch.FloatTensor(inputs[-1].numpy()).to(device) #b 114 | 115 | logits = model(inputs_LongTensor) #b 116 | 117 | loss = loss_fn(logits, label) 118 | 119 | optimizer.zero_grad() 120 | 121 | loss.backward() 122 | 123 | optimizer.step() 124 | 125 | if iter_step % args.print_freq == 0: 126 | print(f"Day:{n_day}\t[Epoch/iter]:{epoch:>3}/{iter_step:<4}\tloss:{loss.detach().cpu().item():.6f}") 127 | 128 | path_to_save_model=f"./checkpoints/bs-{args.batch_size}_lr-{args.lr}_{args.flows}-{args.k_flow_negs}_{args.tag}.pkl" 129 | 130 | torch.save(model.state_dict(), path_to_save_model) 131 | 132 | print(f"save model to {path_to_save_model} DONE.") -------------------------------------------------------------------------------- /coarse/run_dssm_data_dist_shift_sampling.sh: -------------------------------------------------------------------------------- 1 | set -x 2 | set -e 3 | set -o pipefail 4 | 5 | tag=dssm_data_dist_shift_sampling-1st 6 | 7 | flows=rank_neg 8 | k_flow_negs=1 9 | 10 | python -B -u run_dssm_data_dist_shift_sampling.py \ 11 | --epochs=1 \ 12 | --batch_size=1024 \ 13 | --infer_realshow_batch_size=1024 \ 14 | --infer_recall_batch_size=900 \ 15 | --emb_dim=8 \ 16 | --lr=1e-2 \ 17 | --seq_len=50 \ 18 | --cuda='0' \ 19 | --print_freq=100 \ 20 | --flows=${flows} \ 21 | --k_flow_negs=${k_flow_negs} \ 22 | --tag=${tag} > "./logs/bs-1024_lr-1e-2_${flows}_${k_flow_negs}_${tag}.log" 2>&1 23 | 24 | python -B -u eval_dssm_data_dist_shift_sampling.py \ 25 | --epochs=1 \ 26 | --batch_size=1024 \ 27 | --infer_realshow_batch_size=1024 \ 28 | --infer_recall_batch_size=900 \ 29 | --emb_dim=8 \ 30 | --lr=1e-2 \ 31 | --seq_len=50 \ 32 | --cuda='0' \ 33 | --print_freq=100 \ 34 | --flows=${flows} \ 35 | --k_flow_negs=${k_flow_negs} \ 36 | --tag=${tag} >> "./logs/bs-1024_lr-1e-2_${flows}_${k_flow_negs}_${tag}.log" 2>&1 -------------------------------------------------------------------------------- /coarse/run_dssm_fsltr.py: -------------------------------------------------------------------------------- 1 | import os 2 | import argparse 3 | 4 | import torch 5 | import torch.nn.functional as F 6 | from torch.utils.data import DataLoader 7 | 8 | from models import DSSM 9 | from dataset import Prerank_Train_FSLTR_Dataset 10 | 11 | from utils import load_pkl 12 | 13 | def parse_args(): 14 | parser = argparse.ArgumentParser() 15 | 16 | parser.add_argument('--epochs', type=int, default=1, help='epochs.') 17 | parser.add_argument('--batch_size', type=int, default=1024, help='train batch size.') 18 | parser.add_argument('--infer_realshow_batch_size', type=int, default=1024, help='inference batch size.') 19 | parser.add_argument('--infer_recall_batch_size', type=int, default=1024, help='inference batch size.') 20 | parser.add_argument('--emb_dim', type=int, default=8, help='embedding dimension.') 21 | parser.add_argument('--lr', type=float, default=1e-3, help='learning rate.') 22 | parser.add_argument('--seq_len', type=int, default=3, help='length of behaivor sequence') 23 | 24 | parser.add_argument('--cuda', type=int, default=0, help='cuda device.') 25 | 26 | parser.add_argument('--print_freq', type=int, default=200, help='frequency of print.') 27 | 28 | parser.add_argument('--tag', type=str, default="1st", help='exp tag.') 29 | 30 | parser.add_argument('--flows', type=str, default='request_prerank_neg', help='model name.') 31 | 32 | parser.add_argument('--flow_nums', type=str, default="1", help='number of negative samples') 33 | 34 | parser.add_argument('--flow_weights', type=str, default="1.0", help='learning rate.') 35 | 36 | return parser.parse_args() 37 | 38 | 39 | if __name__ == '__main__': 40 | args = parse_args() 41 | 42 | for k,v in vars(args).items(): 43 | print(f"{k}:{v}") 44 | 45 | #prepare data 46 | prefix = "../data" 47 | 48 | realshow_prefix = os.path.join(prefix, "all_stage") 49 | path_to_train_csv_lst = [] 50 | with open("./file.txt", mode='r') as f: 51 | lines = f.readlines() 52 | for line in lines: 53 | tmp_csv_path = os.path.join(realshow_prefix, line.strip()+'.feather') 54 | path_to_train_csv_lst.append(tmp_csv_path) 55 | 56 | num_of_train_csv = len(path_to_train_csv_lst) 57 | print("training files:") 58 | print(f"number of train_csv: {num_of_train_csv}") 59 | for idx, filepath in enumerate(path_to_train_csv_lst): 60 | print(f"{idx}: {filepath}") 61 | 62 | seq_prefix = os.path.join(prefix, "seq_effective_50_dict") 63 | path_to_train_seq_pkl_lst = [] 64 | with open("./file.txt", mode='r') as f: 65 | lines = f.readlines() 66 | for line in lines: 67 | tmp_seq_pkl_path = os.path.join(seq_prefix, line.strip()+'.pkl') 68 | path_to_train_seq_pkl_lst.append(tmp_seq_pkl_path) 69 | 70 | print("training seq files:") 71 | for idx, filepath in enumerate(path_to_train_seq_pkl_lst): 72 | print(f"{idx}: {filepath}") 73 | 74 | request_id_prefix = os.path.join(prefix, "request_id_dict") 75 | path_to_train_request_pkl_lst = [] 76 | with open("./file.txt", mode='r') as f: 77 | lines = f.readlines() 78 | for line in lines: 79 | tmp_request_pkl_path = os.path.join(request_id_prefix, line.strip()+".pkl") 80 | path_to_train_request_pkl_lst.append(tmp_request_pkl_path) 81 | 82 | print("training request files") 83 | for idx, filepath in enumerate(path_to_train_request_pkl_lst): 84 | print(f"{idx}: {filepath}") 85 | 86 | others_prefix = os.path.join(prefix, "others") 87 | path_to_id_cnt_pkl = os.path.join(others_prefix, "id_cnt.pkl") 88 | print(f"path_to_id_cnt_pkl: {path_to_id_cnt_pkl}") 89 | 90 | id_cnt_dict = load_pkl(path_to_id_cnt_pkl) 91 | for k,v in id_cnt_dict.items(): 92 | print(f"{k}:{v}") 93 | 94 | #prepare model 95 | os.environ["CUDA_VISIBLE_DEVICES"] = str(args.cuda) 96 | device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") 97 | print(f"device: {device}") 98 | 99 | model = DSSM(args.emb_dim, args.seq_len, device, id_cnt_dict).to(device) 100 | 101 | optimizer = torch.optim.Adam(model.parameters(), lr=args.lr) 102 | 103 | sum_num = 0 104 | weight_lst = [] 105 | 106 | nums = args.flow_nums.split(',') 107 | weights = args.flow_weights.split(',') 108 | 109 | for idx,num in enumerate(nums): 110 | sum_num += int(num) 111 | weight_lst.extend([float(weights[idx])]*int(num)) 112 | 113 | loss_weight_gpu = torch.tensor(weight_lst, dtype=torch.float32, device=device).reshape([1,-1,1]) #1*p*1 114 | 115 | padding_num = -2**30 + 1 116 | 117 | #training 118 | for epoch in range(args.epochs): 119 | for n_day in range(num_of_train_csv): 120 | train_dataset = Prerank_Train_FSLTR_Dataset( 121 | path_to_train_csv_lst[n_day], 122 | args.seq_len, 123 | path_to_train_seq_pkl_lst[n_day], 124 | path_to_train_request_pkl_lst[n_day], 125 | args.flows, 126 | args.flow_nums 127 | ) 128 | 129 | train_loader = DataLoader( 130 | dataset=train_dataset, 131 | batch_size=args.batch_size, 132 | shuffle=True, 133 | num_workers=1, 134 | drop_last=True 135 | ) 136 | 137 | for iter_step, inputs in enumerate(train_loader): 138 | 139 | inputs_LongTensor = [torch.LongTensor(inp.numpy()).to(device) for inp in inputs[:-1]] 140 | 141 | logits = model.forward_fsltr(inputs_LongTensor) #b 142 | 143 | priority = torch.FloatTensor(inputs[-1].numpy()).to(device) #b*p 144 | 145 | weight = torch.gt( 146 | priority.unsqueeze(-1), priority.unsqueeze(1) 147 | ) #b*p*p 148 | 149 | logits_diff = logits.unsqueeze(-1) - logits.unsqueeze(1) 150 | 151 | loss = F.binary_cross_entropy_with_logits( 152 | logits_diff, 153 | torch.ones_like(logits_diff), 154 | weight=weight*loss_weight_gpu, 155 | reduction='sum') / weight.sum() 156 | 157 | optimizer.zero_grad() 158 | 159 | loss.backward() 160 | 161 | optimizer.step() 162 | 163 | if iter_step % args.print_freq == 0: 164 | print(f"Day:{n_day}\t[Epoch/iter]:{epoch:>3}/{iter_step:<4}\tloss:{loss.detach().cpu().item():.6f}") 165 | 166 | path_to_save_model=f"./checkpoints/bs-{args.batch_size}_lr-{args.lr}_{args.loss}_{args.flows}_{args.flow_nums}_{args.flow_weights}_{args.tag}.pkl" 167 | 168 | torch.save(model.state_dict(), path_to_save_model) 169 | 170 | print(f"save model to {path_to_save_model} DONE.") -------------------------------------------------------------------------------- /coarse/run_dssm_fsltr.sh: -------------------------------------------------------------------------------- 1 | set -x 2 | set -e 3 | set -o pipefail 4 | 5 | tag=dssm_fsltr-1st 6 | 7 | flows=click,realshow,rerank_pos,rerank_neg,rank_pos,rank_neg,coarse_neg 8 | flow_nums=6,6,10,10,10,10,10 9 | flow_weights=1.0,1.0,1.0,1.0,1.0,1.0,0.0 10 | 11 | python -B -u run_dssm_fsltr.py \ 12 | --epochs=1 \ 13 | --batch_size=1024 \ 14 | --infer_realshow_batch_size=1024 \ 15 | --infer_recall_batch_size=900 \ 16 | --emb_dim=8 \ 17 | --lr=1e-2 \ 18 | --seq_len=50 \ 19 | --cuda='0' \ 20 | --print_freq=100 \ 21 | --flows=${flows} \ 22 | --flow_nums=${flow_nums} \ 23 | --flow_weights=${flow_weights} \ 24 | --tag=${tag} > "./logs/bs-1024_lr-1e-2_${flows}_${flow_nums}_${flow_weights}_${tag}.log" 2>&1 25 | 26 | python -B -u eval_dssm_fsltr.py \ 27 | --epochs=1 \ 28 | --batch_size=1024 \ 29 | --infer_realshow_batch_size=1024 \ 30 | --infer_recall_batch_size=900 \ 31 | --emb_dim=8 \ 32 | --lr=1e-2 \ 33 | --seq_len=50 \ 34 | --cuda='0' \ 35 | --print_freq=100 \ 36 | --flows=${flows} \ 37 | --flow_nums=${flow_nums} \ 38 | --flow_weights=${flow_weights} \ 39 | --tag=${tag} >> "./logs/bs-1024_lr-1e-2_${flows}_${flow_nums}_${flow_weights}_${tag}.log" 2>&1 -------------------------------------------------------------------------------- /coarse/run_dssm_ubm.py: -------------------------------------------------------------------------------- 1 | import os 2 | import argparse 3 | 4 | import torch 5 | import torch.nn as nn 6 | from torch.utils.data import DataLoader 7 | 8 | from models import DSSM_UBM 9 | from dataset import Prerank_Train_UBM_Dataset 10 | 11 | from utils import load_pkl 12 | 13 | def parse_args(): 14 | parser = argparse.ArgumentParser() 15 | 16 | parser.add_argument('--epochs', type=int, default=1, help='epochs.') 17 | parser.add_argument('--batch_size', type=int, default=1024, help='train batch size.') 18 | parser.add_argument('--infer_realshow_batch_size', type=int, default=1024, help='inference batch size.') 19 | parser.add_argument('--infer_recall_batch_size', type=int, default=1024, help='inference batch size.') 20 | parser.add_argument('--emb_dim', type=int, default=8, help='embedding dimension.') 21 | parser.add_argument('--lr', type=float, default=1e-3, help='learning rate.') 22 | parser.add_argument('--seq_len', type=int, default=3, help='length of behaivor sequence') 23 | 24 | parser.add_argument('--cuda', type=int, default=0, help='cuda device.') 25 | 26 | parser.add_argument('--print_freq', type=int, default=200, help='frequency of print.') 27 | 28 | parser.add_argument('--tag', type=str, default="1st", help='exp tag.') 29 | 30 | parser.add_argument('--flows', type=str, default="", help='exp tag.') 31 | 32 | return parser.parse_args() 33 | 34 | 35 | if __name__ == '__main__': 36 | args = parse_args() 37 | 38 | for k,v in vars(args).items(): 39 | print(f"{k}:{v}") 40 | 41 | #prepare data 42 | prefix = "../data" 43 | 44 | realshow_prefix = os.path.join(prefix, "realshow") 45 | path_to_train_csv_lst = [] 46 | with open("./file.txt", mode='r') as f: 47 | lines = f.readlines() 48 | for line in lines: 49 | tmp_csv_path = os.path.join(realshow_prefix, line.strip()+'.feather') 50 | path_to_train_csv_lst.append(tmp_csv_path) 51 | 52 | num_of_train_csv = len(path_to_train_csv_lst) 53 | print("training files:") 54 | print(f"number of train_csv: {num_of_train_csv}") 55 | for idx, filepath in enumerate(path_to_train_csv_lst): 56 | print(f"{idx}: {filepath}") 57 | 58 | 59 | seq_prefix = os.path.join(prefix, "seq_effective_50_dict") 60 | path_to_train_seq_pkl_lst = [] 61 | with open("./file.txt", mode='r') as f: 62 | lines = f.readlines() 63 | for line in lines: 64 | tmp_seq_pkl_path = os.path.join(seq_prefix, line.strip()+'.pkl') 65 | path_to_train_seq_pkl_lst.append(tmp_seq_pkl_path) 66 | 67 | print("training seq files:") 68 | for idx, filepath in enumerate(path_to_train_seq_pkl_lst): 69 | print(f"{idx}: {filepath}") 70 | 71 | request_id_prefix = os.path.join(prefix, "ubm_seq_request_id_dict") 72 | path_to_train_request_pkl_lst = [] 73 | with open("./file.txt", mode='r') as f: 74 | lines = f.readlines() 75 | for line in lines: 76 | tmp_request_pkl_path = os.path.join(request_id_prefix, line.strip()+".pkl") 77 | path_to_train_request_pkl_lst.append(tmp_request_pkl_path) 78 | 79 | print("training request files") 80 | for idx, filepath in enumerate(path_to_train_request_pkl_lst): 81 | print(f"{idx}: {filepath}") 82 | 83 | others_prefix = os.path.join(prefix, "others") 84 | path_to_id_cnt_pkl = os.path.join(others_prefix, "id_cnt.pkl") 85 | print(f"path_to_id_cnt_pkl: {path_to_id_cnt_pkl}") 86 | 87 | id_cnt_dict = load_pkl(path_to_id_cnt_pkl) 88 | for k,v in id_cnt_dict.items(): 89 | print(f"{k}:{v}") 90 | 91 | #prepare model 92 | os.environ["CUDA_VISIBLE_DEVICES"] = str(args.cuda) 93 | device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") 94 | print(f"device: {device}") 95 | 96 | per_flow_seq_len = 10 97 | n_flows = len(args.flows.split(',')) 98 | flow_seq_len = per_flow_seq_len * n_flows 99 | 100 | max_candidate_cnt = 430 101 | 102 | model = DSSM_UBM( 103 | args.emb_dim, 104 | args.seq_len, 105 | device, 106 | per_flow_seq_len, flow_seq_len, 107 | id_cnt_dict 108 | ).to(device) 109 | 110 | loss_fn = nn.BCEWithLogitsLoss().to(device) 111 | 112 | optimizer = torch.optim.Adam(model.parameters(), lr=args.lr) 113 | 114 | #training 115 | for epoch in range(args.epochs): 116 | for n_day in range(num_of_train_csv): 117 | 118 | train_dataset = Prerank_Train_UBM_Dataset( 119 | path_to_train_csv_lst[n_day], 120 | args.seq_len, 121 | path_to_train_seq_pkl_lst[n_day], 122 | path_to_train_request_pkl_lst[n_day], 123 | args.flows, 124 | per_flow_seq_len 125 | ) 126 | 127 | train_loader = DataLoader( 128 | dataset=train_dataset, 129 | batch_size=args.batch_size, 130 | shuffle=True, 131 | num_workers=1, 132 | drop_last=True 133 | ) 134 | 135 | for iter_step, inputs in enumerate(train_loader): 136 | 137 | inputs_LongTensor = [torch.LongTensor(inp.numpy()).to(device) for inp in inputs[:-1]] 138 | 139 | label = torch.FloatTensor(inputs[-1].numpy()).to(device) #b 140 | 141 | logits = model(inputs_LongTensor) #b 142 | 143 | loss = loss_fn(logits, label) 144 | 145 | optimizer.zero_grad() 146 | 147 | loss.backward() 148 | 149 | optimizer.step() 150 | 151 | if iter_step % args.print_freq == 0: 152 | print(f"Day:{n_day}\t[Epoch/iter]:{epoch:>3}/{iter_step:<4}\tloss:{loss.detach().cpu().item():.6f}") 153 | 154 | path_to_save_model=f"./checkpoints/bs-{args.batch_size}_lr-{args.lr}_flows-{args.flows}_{args.tag}.pkl" 155 | 156 | torch.save(model.state_dict(), path_to_save_model) 157 | 158 | print(f"save model to {path_to_save_model} DONE.") -------------------------------------------------------------------------------- /coarse/run_dssm_ubm.sh: -------------------------------------------------------------------------------- 1 | set -x 2 | set -e 3 | set -o pipefail 4 | 5 | tag=dssm_ubm-1st 6 | 7 | flows=rank_pos 8 | 9 | python -B -u run_dssm_ubm.py \ 10 | --epochs=1 \ 11 | --batch_size=1024 \ 12 | --infer_realshow_batch_size=1024 \ 13 | --infer_recall_batch_size=900 \ 14 | --emb_dim=8 \ 15 | --lr=1e-2 \ 16 | --seq_len=50 \ 17 | --cuda='0' \ 18 | --print_freq=100 \ 19 | --flows=${flows} \ 20 | --tag=${tag} > "./logs/bs-1024_lr-1e-2_${flows}_${tag}.log" 2>&1 21 | 22 | python -B -u eval_dssm_ubm.py \ 23 | --epochs=1 \ 24 | --batch_size=1024 \ 25 | --infer_realshow_batch_size=1024 \ 26 | --infer_recall_batch_size=900 \ 27 | --emb_dim=8 \ 28 | --lr=1e-2 \ 29 | --seq_len=50 \ 30 | --cuda='0' \ 31 | --print_freq=100 \ 32 | --flows=${flows} \ 33 | --tag=${tag} >> "./logs/bs-1024_lr-1e-2_${flows}_${tag}.log" 2>&1 -------------------------------------------------------------------------------- /coarse/utils.py: -------------------------------------------------------------------------------- 1 | import pickle as pkl 2 | from collections import defaultdict 3 | 4 | def defaultdict_tuple(): 5 | return defaultdict(tuple) 6 | 7 | def defaultdict_str(): 8 | return defaultdict(str) 9 | 10 | def load_pkl(filename): 11 | with open(filename, 'rb') as f: 12 | return pkl.load(f) -------------------------------------------------------------------------------- /data/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/RecFlow-ICLR/RecFlow/6b9eb4f3dd1651788041704c388f9df1a39cf294/data/.DS_Store -------------------------------------------------------------------------------- /data/all_stage/example.feather: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/RecFlow-ICLR/RecFlow/6b9eb4f3dd1651788041704c388f9df1a39cf294/data/all_stage/example.feather -------------------------------------------------------------------------------- /data/others/coarse_rank_test.feather: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/RecFlow-ICLR/RecFlow/6b9eb4f3dd1651788041704c388f9df1a39cf294/data/others/coarse_rank_test.feather -------------------------------------------------------------------------------- /data/others/id_cnt.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/RecFlow-ICLR/RecFlow/6b9eb4f3dd1651788041704c388f9df1a39cf294/data/others/id_cnt.pkl -------------------------------------------------------------------------------- /data/others/rank_test.feather: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/RecFlow-ICLR/RecFlow/6b9eb4f3dd1651788041704c388f9df1a39cf294/data/others/rank_test.feather -------------------------------------------------------------------------------- /data/others/realshow_video_info.feather: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/RecFlow-ICLR/RecFlow/6b9eb4f3dd1651788041704c388f9df1a39cf294/data/others/realshow_video_info.feather -------------------------------------------------------------------------------- /data/others/retrieval_test.feather: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/RecFlow-ICLR/RecFlow/6b9eb4f3dd1651788041704c388f9df1a39cf294/data/others/retrieval_test.feather -------------------------------------------------------------------------------- /data/realshow/example.feather: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/RecFlow-ICLR/RecFlow/6b9eb4f3dd1651788041704c388f9df1a39cf294/data/realshow/example.feather -------------------------------------------------------------------------------- /data/request_id_dict/example.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/RecFlow-ICLR/RecFlow/6b9eb4f3dd1651788041704c388f9df1a39cf294/data/request_id_dict/example.pkl -------------------------------------------------------------------------------- /data/seq_effective_50_dict/example.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/RecFlow-ICLR/RecFlow/6b9eb4f3dd1651788041704c388f9df1a39cf294/data/seq_effective_50_dict/example.pkl -------------------------------------------------------------------------------- /rank/eval_din.py: -------------------------------------------------------------------------------- 1 | import os 2 | import argparse 3 | 4 | import torch 5 | from torch.utils.data import DataLoader 6 | 7 | from models import DIN 8 | from dataset import Rank_Train_Dataset,Rank_Test_Dataset 9 | 10 | from utils import load_pkl 11 | from metrics import evaluate,evaluate_recall 12 | 13 | def parse_args(): 14 | parser = argparse.ArgumentParser() 15 | 16 | parser.add_argument('--epochs', type=int, default=1, help='epochs.') 17 | parser.add_argument('--batch_size', type=int, default=1024, help='train batch size.') 18 | parser.add_argument('--infer_realshow_batch_size', type=int, default=1024, help='inference batch size.') 19 | parser.add_argument('--infer_recall_batch_size', type=int, default=1024, help='inference batch size.') 20 | parser.add_argument('--emb_dim', type=int, default=8, help='embedding dimension.') 21 | parser.add_argument('--lr', type=float, default=1e-3, help='learning rate.') 22 | parser.add_argument('--seq_len', type=int, default=3, help='length of behaivor sequence') 23 | 24 | parser.add_argument('--cuda', type=int, default=0, help='cuda device.') 25 | 26 | parser.add_argument('--print_freq', type=int, default=200, help='frequency of print.') 27 | 28 | parser.add_argument('--tag', type=str, default="1st", help='exp tag.') 29 | 30 | return parser.parse_args() 31 | 32 | 33 | if __name__ == '__main__': 34 | args = parse_args() 35 | 36 | for k,v in vars(args).items(): 37 | print(f"{k}:{v}") 38 | 39 | #prepare data 40 | prefix = "../data" 41 | 42 | realshow_prefix = os.path.join(prefix, "realshow") 43 | path_to_test_csv = os.path.join(realshow_prefix, "2024-02-18.feather") 44 | print(f"testing file: {path_to_test_csv}") 45 | 46 | seq_prefix = os.path.join(prefix, "seq_effective_50_dict") 47 | path_to_test_seq_pkl = os.path.join(seq_prefix, "2024-02-18.pkl") 48 | print(f"testing seq file: {path_to_test_seq_pkl}") 49 | 50 | others_prefix = os.path.join(prefix, "others") 51 | path_to_id_cnt_pkl = os.path.join(others_prefix, "id_cnt.pkl") 52 | print(f"path_to_id_cnt_pkl: {path_to_id_cnt_pkl}") 53 | id_cnt_dict = load_pkl(path_to_id_cnt_pkl) 54 | for k,v in id_cnt_dict.items(): 55 | print(f"{k}:{v}") 56 | 57 | path_to_test_pkl = os.path.join(others_prefix, "rank_test.feather") 58 | print(f"path_to_test_pkl: {path_to_test_pkl}") 59 | 60 | #prepare model 61 | os.environ["CUDA_VISIBLE_DEVICES"] = str(args.cuda) 62 | device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") 63 | print(f"device: {device}") 64 | 65 | max_candidate_cnt = 430 66 | 67 | model = DIN( 68 | args.emb_dim, 69 | args.seq_len, 70 | device, 71 | max_candidate_cnt, 72 | id_cnt_dict 73 | ).to(device) 74 | 75 | path_to_save_model=f"./checkpoints/bs-{args.batch_size}_lr-{args.lr}_{args.tag}.pkl" 76 | 77 | state_dict = torch.load(path_to_save_model) 78 | 79 | model.load_state_dict(state_dict) 80 | 81 | print("testing: realshow") 82 | 83 | test_realshow_dataset = Rank_Train_Dataset( 84 | path_to_test_csv, 85 | args.seq_len, 86 | path_to_test_seq_pkl, 87 | ) 88 | 89 | test_realshow_loader = DataLoader( 90 | dataset=test_realshow_dataset, 91 | batch_size=args.infer_realshow_batch_size, 92 | shuffle=False, 93 | num_workers=0, 94 | drop_last=True 95 | ) 96 | print_str = evaluate(model, test_realshow_loader, device) 97 | 98 | print("testing: recall") 99 | 100 | test_recall_dataset = Rank_Test_Dataset( 101 | path_to_test_pkl, 102 | args.seq_len, 103 | path_to_test_seq_pkl, 104 | max_candidate_cnt 105 | ) 106 | 107 | test_recall_loader = DataLoader( 108 | dataset=test_recall_dataset, 109 | batch_size=args.infer_recall_batch_size, 110 | shuffle=False, 111 | num_workers=0, 112 | drop_last=True 113 | ) 114 | 115 | target_print = evaluate_recall(model, test_recall_loader, device) 116 | 117 | print("realshow") 118 | print(print_str) 119 | 120 | print("recall") 121 | print(target_print[0]) 122 | print(target_print[1]) -------------------------------------------------------------------------------- /rank/eval_din_auxiliary_ranking.py: -------------------------------------------------------------------------------- 1 | import os 2 | import argparse 3 | 4 | import torch 5 | from torch.utils.data import DataLoader 6 | 7 | from models import DIN_AuxRanking 8 | from dataset import Rank_Train_Dataset,Rank_Test_Dataset 9 | 10 | from utils import load_pkl 11 | from metrics import evaluate,evaluate_recall 12 | 13 | def parse_args(): 14 | parser = argparse.ArgumentParser() 15 | 16 | parser.add_argument('--epochs', type=int, default=1, help='epochs.') 17 | parser.add_argument('--batch_size', type=int, default=1024, help='train batch size.') 18 | parser.add_argument('--infer_realshow_batch_size', type=int, default=1024, help='inference batch size.') 19 | parser.add_argument('--infer_recall_batch_size', type=int, default=1024, help='inference batch size.') 20 | parser.add_argument('--emb_dim', type=int, default=8, help='embedding dimension.') 21 | parser.add_argument('--lr', type=float, default=1e-3, help='learning rate.') 22 | parser.add_argument('--seq_len', type=int, default=3, help='length of behaivor sequence') 23 | 24 | parser.add_argument('--cuda', type=int, default=0, help='cuda device.') 25 | 26 | parser.add_argument('--print_freq', type=int, default=200, help='frequency of print.') 27 | 28 | parser.add_argument('--tag', type=str, default="1st", help='exp tag.') 29 | 30 | # flow param 31 | parser.add_argument('--flows', type=str, default="", help='exp tag.') 32 | 33 | parser.add_argument('--rank_loss_weight', type=float, default=1e-2, help='learning rate.') 34 | 35 | return parser.parse_args() 36 | 37 | 38 | if __name__ == '__main__': 39 | args = parse_args() 40 | 41 | for k,v in vars(args).items(): 42 | print(f"{k}:{v}") 43 | 44 | #prepare data 45 | prefix = "../data" 46 | 47 | realshow_prefix = os.path.join(prefix, "realshow") 48 | path_to_test_csv = os.path.join(realshow_prefix, "2024-02-18.feather") 49 | print(f"testing file: {path_to_test_csv}") 50 | 51 | seq_prefix = os.path.join(prefix, "seq_effective_50_dict") 52 | path_to_test_seq_pkl = os.path.join(seq_prefix, "2024-02-18.pkl") 53 | print(f"testing seq file: {path_to_test_seq_pkl}") 54 | 55 | others_prefix = os.path.join(prefix, "others") 56 | path_to_id_cnt_pkl = os.path.join(others_prefix, "id_cnt.pkl") 57 | print(f"path_to_id_cnt_pkl: {path_to_id_cnt_pkl}") 58 | id_cnt_dict = load_pkl(path_to_id_cnt_pkl) 59 | for k,v in id_cnt_dict.items(): 60 | print(f"{k}:{v}") 61 | 62 | path_to_test_pkl = os.path.join(others_prefix, "rank_test.feather") 63 | print(f"path_to_test_pkl: {path_to_test_pkl}") 64 | 65 | #prepare model 66 | os.environ["CUDA_VISIBLE_DEVICES"] = str(args.cuda) 67 | device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") 68 | print(f"device: {device}") 69 | 70 | max_candidate_cnt = 430 71 | 72 | model = DIN_AuxRanking( 73 | args.emb_dim, args.seq_len, 74 | device, max_candidate_cnt, id_cnt_dict 75 | ).to(device) 76 | 77 | path_to_save_model=f"./checkpoints/{args.batch_size}_{args.lr}_{args.flows}_{args.rank_loss_weight}_{args.tag}.pkl" 78 | 79 | state_dict = torch.load(path_to_save_model) 80 | 81 | model.load_state_dict(state_dict) 82 | 83 | print("testing: realshow") 84 | 85 | test_realshow_dataset = Rank_Train_Dataset( 86 | path_to_test_csv, 87 | args.seq_len, 88 | path_to_test_seq_pkl, 89 | ) 90 | 91 | test_realshow_loader = DataLoader( 92 | dataset=test_realshow_dataset, 93 | batch_size=args.infer_realshow_batch_size, 94 | shuffle=False, 95 | num_workers=0, 96 | drop_last=True 97 | ) 98 | print_str = evaluate(model, test_realshow_loader, device) 99 | 100 | print("testing: recall") 101 | 102 | test_recall_dataset = Rank_Test_Dataset( 103 | path_to_test_pkl, 104 | args.seq_len, 105 | path_to_test_seq_pkl, 106 | max_candidate_cnt 107 | ) 108 | 109 | test_recall_loader = DataLoader( 110 | dataset=test_recall_dataset, 111 | batch_size=args.infer_recall_batch_size, 112 | shuffle=False, 113 | num_workers=0, 114 | drop_last=True 115 | ) 116 | target_print = evaluate_recall(model, test_recall_loader, device) 117 | 118 | print("realshow") 119 | print(print_str) 120 | 121 | print("recall") 122 | print(target_print[0]) 123 | print(target_print[1]) -------------------------------------------------------------------------------- /rank/eval_din_data_dist_shift_all.py: -------------------------------------------------------------------------------- 1 | import os 2 | import argparse 3 | 4 | import torch 5 | from torch.utils.data import DataLoader 6 | 7 | from models import DIN 8 | from dataset import Rank_Train_Dataset,Rank_Test_Dataset 9 | 10 | from utils import load_pkl 11 | from metrics import evaluate,evaluate_recall 12 | 13 | def parse_args(): 14 | parser = argparse.ArgumentParser() 15 | 16 | parser.add_argument('--epochs', type=int, default=1, help='epochs.') 17 | parser.add_argument('--batch_size', type=int, default=1024, help='train batch size.') 18 | parser.add_argument('--infer_realshow_batch_size', type=int, default=1024, help='inference batch size.') 19 | parser.add_argument('--infer_recall_batch_size', type=int, default=1024, help='inference batch size.') 20 | parser.add_argument('--emb_dim', type=int, default=8, help='embedding dimension.') 21 | parser.add_argument('--lr', type=float, default=1e-3, help='learning rate.') 22 | parser.add_argument('--seq_len', type=int, default=3, help='length of behaivor sequence') 23 | 24 | parser.add_argument('--cuda', type=int, default=0, help='cuda device.') 25 | 26 | parser.add_argument('--print_freq', type=int, default=200, help='frequency of print.') 27 | 28 | parser.add_argument('--tag', type=str, default="1st", help='exp tag.') 29 | 30 | parser.add_argument('--flows', type=str, default="", help='exp tag.') 31 | 32 | return parser.parse_args() 33 | 34 | 35 | if __name__ == '__main__': 36 | args = parse_args() 37 | 38 | for k,v in vars(args).items(): 39 | print(f"{k}:{v}") 40 | 41 | #prepare data 42 | prefix = "../data" 43 | 44 | realshow_prefix = os.path.join(prefix, "realshow") 45 | path_to_test_csv = os.path.join(realshow_prefix, "2024-02-18.feather") 46 | print(f"testing file: {path_to_test_csv}") 47 | 48 | seq_prefix = os.path.join(prefix, "seq_effective_50_dict") 49 | path_to_test_seq_pkl = os.path.join(seq_prefix, "2024-02-18.pkl") 50 | print(f"testing seq file: {path_to_test_seq_pkl}") 51 | 52 | others_prefix = os.path.join(prefix, "others") 53 | path_to_id_cnt_pkl = os.path.join(others_prefix, "id_cnt.pkl") 54 | print(f"path_to_id_cnt_pkl: {path_to_id_cnt_pkl}") 55 | id_cnt_dict = load_pkl(path_to_id_cnt_pkl) 56 | for k,v in id_cnt_dict.items(): 57 | print(f"{k}:{v}") 58 | 59 | path_to_test_pkl = os.path.join(others_prefix, "rank_test.feather") 60 | print(f"path_to_test_pkl: {path_to_test_pkl}") 61 | 62 | #prepare model 63 | os.environ["CUDA_VISIBLE_DEVICES"] = str(args.cuda) 64 | device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") 65 | print(f"device: {device}") 66 | 67 | max_candidate_cnt = 430 68 | 69 | model = DIN( 70 | args.emb_dim, 71 | args.seq_len, 72 | device, 73 | max_candidate_cnt, id_cnt_dict 74 | ).to(device) 75 | 76 | path_to_save_model=f"./checkpoints/bs-{args.batch_size}_lr-{args.lr}_{args.flows}_{args.tag}.pkl" 77 | 78 | state_dict = torch.load(path_to_save_model) 79 | 80 | model.load_state_dict(state_dict) 81 | 82 | print("testing: realshow") 83 | 84 | test_realshow_dataset = Rank_Train_Dataset( 85 | path_to_test_csv, 86 | args.seq_len, 87 | path_to_test_seq_pkl, 88 | ) 89 | 90 | test_realshow_loader = DataLoader( 91 | dataset=test_realshow_dataset, 92 | batch_size=args.infer_realshow_batch_size, 93 | shuffle=False, 94 | num_workers=0, 95 | drop_last=True 96 | ) 97 | print_str = evaluate(model, test_realshow_loader, device) 98 | 99 | print("testing: recall") 100 | 101 | test_recall_dataset = Rank_Test_Dataset( 102 | path_to_test_pkl, 103 | args.seq_len, 104 | path_to_test_seq_pkl, 105 | max_candidate_cnt=max_candidate_cnt 106 | ) 107 | 108 | test_recall_loader = DataLoader( 109 | dataset=test_recall_dataset, 110 | batch_size=args.infer_recall_batch_size, 111 | shuffle=False, 112 | num_workers=0, 113 | drop_last=True 114 | ) 115 | target_print = evaluate_recall(model, test_recall_loader, device) 116 | 117 | print("realshow") 118 | print(print_str) 119 | 120 | print("recall") 121 | print(target_print[0]) 122 | print(target_print[1]) -------------------------------------------------------------------------------- /rank/eval_din_data_dist_shift_sampling.py: -------------------------------------------------------------------------------- 1 | import os 2 | import argparse 3 | 4 | import torch 5 | from torch.utils.data import DataLoader 6 | 7 | from models import DIN 8 | from dataset import Rank_Train_Dataset,Rank_Test_Dataset 9 | 10 | from utils import load_pkl 11 | from metrics import evaluate,evaluate_recall 12 | 13 | def parse_args(): 14 | parser = argparse.ArgumentParser() 15 | 16 | parser.add_argument('--epochs', type=int, default=1, help='epochs.') 17 | parser.add_argument('--batch_size', type=int, default=1024, help='train batch size.') 18 | parser.add_argument('--infer_realshow_batch_size', type=int, default=1024, help='inference batch size.') 19 | parser.add_argument('--infer_recall_batch_size', type=int, default=1024, help='inference batch size.') 20 | parser.add_argument('--emb_dim', type=int, default=8, help='embedding dimension.') 21 | parser.add_argument('--lr', type=float, default=1e-3, help='learning rate.') 22 | parser.add_argument('--seq_len', type=int, default=3, help='length of behaivor sequence') 23 | 24 | parser.add_argument('--cuda', type=int, default=0, help='cuda device.') 25 | 26 | parser.add_argument('--print_freq', type=int, default=200, help='frequency of print.') 27 | 28 | parser.add_argument('--tag', type=str, default="1st", help='exp tag.') 29 | 30 | parser.add_argument('--flows', type=str, default="", help='exp tag.') 31 | parser.add_argument('--k_flow_negs', type=str, default="", help='number of flow negative.') 32 | 33 | return parser.parse_args() 34 | 35 | 36 | if __name__ == '__main__': 37 | args = parse_args() 38 | 39 | for k,v in vars(args).items(): 40 | print(f"{k}:{v}") 41 | 42 | #prepare data 43 | prefix = "../data" 44 | 45 | realshow_prefix = os.path.join(prefix, "realshow") 46 | path_to_test_csv = os.path.join(realshow_prefix, "2024-02-18.feather") 47 | print(f"testing file: {path_to_test_csv}") 48 | 49 | seq_prefix = os.path.join(prefix, "seq_effective_50_dict") 50 | path_to_test_seq_pkl = os.path.join(seq_prefix, "2024-02-18.pkl") 51 | print(f"testing seq file: {path_to_test_seq_pkl}") 52 | 53 | others_prefix = os.path.join(prefix, "others") 54 | path_to_id_cnt_pkl = os.path.join(others_prefix, "id_cnt.pkl") 55 | print(f"path_to_id_cnt_pkl: {path_to_id_cnt_pkl}") 56 | id_cnt_dict = load_pkl(path_to_id_cnt_pkl) 57 | for k,v in id_cnt_dict.items(): 58 | print(f"{k}:{v}") 59 | 60 | path_to_test_pkl = os.path.join(others_prefix, "rank_test.feather") 61 | print(f"path_to_test_pkl: {path_to_test_pkl}") 62 | 63 | #prepare model 64 | os.environ["CUDA_VISIBLE_DEVICES"] = str(args.cuda) 65 | device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") 66 | print(f"device: {device}") 67 | 68 | max_candidate_cnt = 430 69 | 70 | model = DIN( 71 | args.emb_dim, 72 | args.seq_len, 73 | device, 74 | max_candidate_cnt, id_cnt_dict 75 | ).to(device) 76 | 77 | path_to_save_model=f"./checkpoints/bs-{args.batch_size}_lr-{args.lr}_{args.flows}-{args.k_flow_negs}_{args.tag}.pkl" 78 | 79 | state_dict = torch.load(path_to_save_model) 80 | 81 | model.load_state_dict(state_dict) 82 | 83 | print("testing: realshow") 84 | 85 | test_realshow_dataset = Rank_Train_Dataset( 86 | path_to_test_csv, 87 | args.seq_len, 88 | path_to_test_seq_pkl, 89 | ) 90 | 91 | test_realshow_loader = DataLoader( 92 | dataset=test_realshow_dataset, 93 | batch_size=args.infer_realshow_batch_size, 94 | shuffle=False, 95 | num_workers=0, 96 | drop_last=True 97 | ) 98 | print_str = evaluate(model, test_realshow_loader, device) 99 | 100 | print("testing: recall") 101 | 102 | test_recall_dataset = Rank_Test_Dataset( 103 | path_to_test_pkl, 104 | args.seq_len, 105 | path_to_test_seq_pkl, 106 | max_candidate_cnt=max_candidate_cnt 107 | ) 108 | 109 | test_recall_loader = DataLoader( 110 | dataset=test_recall_dataset, 111 | batch_size=args.infer_recall_batch_size, 112 | shuffle=False, 113 | num_workers=0, 114 | drop_last=True 115 | ) 116 | 117 | target_print = evaluate_recall(model, test_recall_loader, device) 118 | 119 | print("realshow") 120 | print(print_str) 121 | 122 | print("recall") 123 | print(target_print[0]) 124 | print(target_print[1]) -------------------------------------------------------------------------------- /rank/eval_din_fsltr.py: -------------------------------------------------------------------------------- 1 | import os 2 | import argparse 3 | 4 | import torch 5 | from torch.utils.data import DataLoader 6 | 7 | from models import DIN 8 | from dataset import Rank_Train_Dataset,Rank_Test_Dataset 9 | 10 | from utils import load_pkl 11 | from metrics import evaluate,evaluate_recall 12 | 13 | def parse_args(): 14 | parser = argparse.ArgumentParser() 15 | 16 | parser.add_argument('--epochs', type=int, default=1, help='epochs.') 17 | parser.add_argument('--batch_size', type=int, default=1024, help='train batch size.') 18 | parser.add_argument('--infer_realshow_batch_size', type=int, default=1024, help='inference batch size.') 19 | parser.add_argument('--infer_recall_batch_size', type=int, default=1024, help='inference batch size.') 20 | parser.add_argument('--emb_dim', type=int, default=8, help='embedding dimension.') 21 | parser.add_argument('--lr', type=float, default=1e-3, help='learning rate.') 22 | parser.add_argument('--seq_len', type=int, default=3, help='length of behaivor sequence') 23 | 24 | parser.add_argument('--cuda', type=int, default=0, help='cuda device.') 25 | 26 | parser.add_argument('--print_freq', type=int, default=200, help='frequency of print.') 27 | 28 | parser.add_argument('--tag', type=str, default="1st", help='exp tag.') 29 | 30 | parser.add_argument('--click_rank_loss_w', type=float, default=1e-2, help='learning rate.') 31 | parser.add_argument('--realshow_rank_loss_w', type=float, default=1e-2, help='learning rate.') 32 | parser.add_argument('--rerank_pos_rank_loss_w', type=float, default=1e-2, help='learning rate.') 33 | parser.add_argument('--rank_pos_rank_loss_w', type=float, default=1e-2, help='learning rate.') 34 | 35 | return parser.parse_args() 36 | 37 | 38 | if __name__ == '__main__': 39 | args = parse_args() 40 | 41 | for k,v in vars(args).items(): 42 | print(f"{k}:{v}") 43 | 44 | #prepare data 45 | prefix = "../data" 46 | 47 | realshow_prefix = os.path.join(prefix, "realshow") 48 | path_to_test_csv = os.path.join(realshow_prefix, "2024-02-18.feather") 49 | print(f"testing file: {path_to_test_csv}") 50 | 51 | seq_prefix = os.path.join(prefix, "seq_effective_50_dict") 52 | path_to_test_seq_pkl = os.path.join(seq_prefix, "2024-02-18.pkl") 53 | print(f"testing seq file: {path_to_test_seq_pkl}") 54 | 55 | others_prefix = os.path.join(prefix, "others") 56 | path_to_id_cnt_pkl = os.path.join(others_prefix, "id_cnt.pkl") 57 | print(f"path_to_id_cnt_pkl: {path_to_id_cnt_pkl}") 58 | id_cnt_dict = load_pkl(path_to_id_cnt_pkl) 59 | for k,v in id_cnt_dict.items(): 60 | print(f"{k}:{v}") 61 | 62 | path_to_test_pkl = os.path.join(others_prefix, "rank_test.feather") 63 | print(f"path_to_test_pkl: {path_to_test_pkl}") 64 | 65 | #prepare model 66 | os.environ["CUDA_VISIBLE_DEVICES"] = str(args.cuda) 67 | device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") 68 | print(f"device: {device}") 69 | 70 | max_candidate_cnt = 430 71 | 72 | model = DIN( 73 | args.emb_dim, 74 | args.seq_len, 75 | device, 76 | max_candidate_cnt, id_cnt_dict 77 | ).to(device) 78 | 79 | path_to_save_model=f"./checkpoints/bs-{args.batch_size}_lr-{args.lr}_{args.click_rank_loss_w}-{args.realshow_rank_loss_w}-{args.rerank_pos_rank_loss_w}-{args.rank_pos_rank_loss_w}_{args.tag}.pkl" 80 | state_dict = torch.load(path_to_save_model) 81 | model.load_state_dict(state_dict) 82 | 83 | print("testing: realshow") 84 | 85 | test_realshow_dataset = Rank_Train_Dataset( 86 | path_to_test_csv, 87 | args.seq_len, 88 | path_to_test_seq_pkl, 89 | ) 90 | 91 | test_realshow_loader = DataLoader( 92 | dataset=test_realshow_dataset, 93 | batch_size=args.infer_realshow_batch_size, 94 | shuffle=False, 95 | num_workers=0, 96 | drop_last=True 97 | ) 98 | print_str = evaluate(model, test_realshow_loader, device) 99 | 100 | print("testing: recall") 101 | 102 | test_recall_dataset = Rank_Test_Dataset( 103 | path_to_test_pkl, 104 | args.seq_len, 105 | path_to_test_seq_pkl, 106 | max_candidate_cnt=max_candidate_cnt 107 | ) 108 | 109 | test_recall_loader = DataLoader( 110 | dataset=test_recall_dataset, 111 | batch_size=args.infer_recall_batch_size, 112 | shuffle=False, 113 | num_workers=0, 114 | drop_last=True 115 | ) 116 | target_print = evaluate_recall(model, test_recall_loader, device) 117 | 118 | print("realshow") 119 | print(print_str) 120 | 121 | print("recall") 122 | print(target_print[0]) 123 | print(target_print[1]) -------------------------------------------------------------------------------- /rank/eval_din_ubm.py: -------------------------------------------------------------------------------- 1 | import os 2 | import argparse 3 | 4 | import torch 5 | from torch.utils.data import DataLoader 6 | 7 | from models import DIN_UBM 8 | from dataset import Rank_Train_UBM_Dataset,Rank_Test_UBM_Dataset 9 | 10 | from utils import load_pkl 11 | from metrics import evaluate,evaluate_recall 12 | 13 | def parse_args(): 14 | parser = argparse.ArgumentParser() 15 | 16 | parser.add_argument('--epochs', type=int, default=1, help='epochs.') 17 | parser.add_argument('--batch_size', type=int, default=1024, help='train batch size.') 18 | parser.add_argument('--infer_realshow_batch_size', type=int, default=1024, help='inference batch size.') 19 | parser.add_argument('--infer_recall_batch_size', type=int, default=1024, help='inference batch size.') 20 | parser.add_argument('--emb_dim', type=int, default=8, help='embedding dimension.') 21 | parser.add_argument('--lr', type=float, default=1e-3, help='learning rate.') 22 | parser.add_argument('--seq_len', type=int, default=3, help='length of behaivor sequence') 23 | 24 | parser.add_argument('--cuda', type=int, default=0, help='cuda device.') 25 | 26 | parser.add_argument('--print_freq', type=int, default=200, help='frequency of print.') 27 | 28 | parser.add_argument('--tag', type=str, default="1st", help='exp tag.') 29 | 30 | parser.add_argument('--flows', type=str, default="", help='exp tag.') 31 | 32 | return parser.parse_args() 33 | 34 | 35 | if __name__ == '__main__': 36 | args = parse_args() 37 | 38 | #prepare data 39 | prefix = "../data" 40 | 41 | realshow_prefix = os.path.join(prefix, "realshow") 42 | path_to_test_csv = os.path.join(realshow_prefix, "2024-02-18.feather") 43 | print(f"testing file: {path_to_test_csv}") 44 | 45 | seq_prefix = os.path.join(prefix, "seq_effective_50_dict") 46 | path_to_test_seq_pkl = os.path.join(seq_prefix, "2024-02-18.pkl") 47 | print(f"testing seq file: {path_to_test_seq_pkl}") 48 | 49 | request_id_prefix = os.path.join(prefix, "ubm_seq_request_id_dict") 50 | path_to_request_id_pkl = os.path.join(request_id_prefix, "2024-02-18.pkl") 51 | print(f"testing request_id file: {path_to_test_seq_pkl}") 52 | 53 | others_prefix = os.path.join(prefix, "others") 54 | path_to_id_cnt_pkl = os.path.join(others_prefix, "id_cnt.pkl") 55 | print(f"path_to_id_cnt_pkl: {path_to_id_cnt_pkl}") 56 | 57 | id_cnt_dict = load_pkl(path_to_id_cnt_pkl) 58 | for k,v in id_cnt_dict.items(): 59 | print(f"{k}:{v}") 60 | 61 | path_to_test_pkl = os.path.join(others_prefix, "rank_test.feather") 62 | print(f"path_to_test_pkl: {path_to_test_pkl}") 63 | 64 | #prepare model 65 | os.environ["CUDA_VISIBLE_DEVICES"] = str(args.cuda) 66 | device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") 67 | print(f"device: {device}") 68 | 69 | per_flow_seq_len = 10 70 | n_flows = len(args.flows.split(',')) 71 | flow_seq_len = per_flow_seq_len * n_flows 72 | 73 | max_candidate_cnt = 430 74 | 75 | model = DIN_UBM( 76 | args.emb_dim, 77 | args.seq_len, 78 | device, 79 | max_candidate_cnt, 80 | per_flow_seq_len,flow_seq_len, 81 | id_cnt_dict 82 | ).to(device) 83 | 84 | path_to_save_model=f"./checkpoints/bs-{args.batch_size}_lr-{args.lr}_{args.flows}_{args.tag}.pkl" 85 | 86 | state_dict = torch.load(path_to_save_model) 87 | 88 | model.load_state_dict(state_dict) 89 | 90 | print("testing: realshow") 91 | 92 | test_realshow_dataset = Rank_Train_UBM_Dataset( 93 | path_to_test_csv, 94 | args.seq_len, 95 | path_to_test_seq_pkl, 96 | path_to_request_id_pkl, 97 | args.flows, 98 | per_flow_seq_len 99 | ) 100 | 101 | test_realshow_loader = DataLoader( 102 | dataset=test_realshow_dataset, 103 | batch_size=args.infer_realshow_batch_size, 104 | shuffle=False, 105 | num_workers=0, 106 | drop_last=True 107 | ) 108 | 109 | print_str = evaluate(model, test_realshow_loader, device) 110 | 111 | print("testing: recall") 112 | 113 | test_recall_dataset = Rank_Test_UBM_Dataset( 114 | path_to_test_pkl, 115 | args.seq_len, 116 | path_to_test_seq_pkl, 117 | path_to_request_id_pkl, 118 | args.flows, per_flow_seq_len, 119 | max_candidate_cnt 120 | ) 121 | 122 | test_recall_loader = DataLoader( 123 | dataset=test_recall_dataset, 124 | batch_size=args.infer_recall_batch_size, 125 | shuffle=False, 126 | num_workers=0, 127 | drop_last=True 128 | ) 129 | 130 | target_print = evaluate_recall(model, test_recall_loader, device) 131 | 132 | print("realshow") 133 | print(print_str) 134 | 135 | print("recall") 136 | print(target_print[0]) 137 | print(target_print[1]) -------------------------------------------------------------------------------- /rank/file.txt: -------------------------------------------------------------------------------- 1 | 2024-01-13 2 | 2024-01-14 3 | 2024-01-15 4 | 2024-01-16 5 | 2024-01-17 6 | 2024-01-18 7 | 2024-01-19 8 | 2024-01-20 9 | 2024-01-21 10 | 2024-01-22 11 | 2024-01-23 12 | 2024-01-24 13 | 2024-01-25 14 | 2024-01-26 15 | 2024-01-27 16 | 2024-01-28 17 | 2024-01-29 18 | 2024-01-30 19 | 2024-01-31 20 | 2024-02-01 21 | 2024-02-02 22 | 2024-02-03 23 | 2024-02-04 24 | 2024-02-05 25 | 2024-02-06 26 | 2024-02-07 27 | 2024-02-08 28 | 2024-02-09 29 | 2024-02-10 30 | 2024-02-11 31 | 2024-02-12 32 | 2024-02-13 33 | 2024-02-14 34 | 2024-02-15 35 | 2024-02-16 36 | 2024-02-17 -------------------------------------------------------------------------------- /rank/metrics.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import numpy as np 3 | from sklearn.metrics import roc_auc_score, log_loss 4 | 5 | def evaluate(model, data_loader, device): 6 | model.eval() 7 | 8 | logits_lst = np.zeros(shape=(962560,), dtype=np.float32) 9 | label_lst = np.zeros(shape=(962560,), dtype=np.float32) 10 | 11 | with torch.no_grad(): 12 | start_index = 0 13 | end_index = 0 14 | 15 | for inputs in data_loader: 16 | 17 | inputs_LongTensor = [torch.LongTensor(inp.numpy()).to(device) for inp in inputs[:-1]] 18 | 19 | logits = model(inputs_LongTensor) #b 20 | 21 | logits = torch.sigmoid(logits) 22 | 23 | end_index += inputs[-1].size(0) 24 | 25 | label_lst[start_index:end_index] = inputs[-1].numpy().astype(np.float32) 26 | 27 | logits_lst[start_index:end_index] = logits.cpu().numpy().astype(np.float32) 28 | 29 | start_index = end_index 30 | 31 | test_auc = roc_auc_score(label_lst, logits_lst) 32 | test_logloss = log_loss(label_lst, logits_lst) 33 | 34 | print_str = f"Target: auc \t logloss: {test_auc:.6f} \t {test_logloss:.6f}" 35 | 36 | return print_str 37 | 38 | 39 | def evaluate_recall(model, data_loader, device): 40 | model.eval() 41 | 42 | target_top_k = [50,100,200] 43 | 44 | total_target_cnt = 0.0 45 | 46 | target_recall_lst = [0.0 for _ in range(len(target_top_k))] 47 | target_ndcg_lst = [0.0 for _ in range(len(target_top_k))] 48 | 49 | with torch.no_grad(): 50 | 51 | for inputs in data_loader: 52 | 53 | inputs_LongTensor = [torch.LongTensor(inp.numpy()).to(device) for inp in inputs[:-2]] 54 | 55 | logits = model.forward_recall(inputs_LongTensor) #b*430 56 | logits = logits.cpu().numpy() 57 | 58 | labels = inputs[-2].numpy().astype(np.float) #b*430 59 | n_photos = inputs[-1].numpy() #b 60 | 61 | for i in range(n_photos.shape[0]): 62 | 63 | n_photo = n_photos[i] 64 | 65 | logit = logits[i,:n_photo] 66 | label = labels[i,:n_photo] 67 | 68 | logit_descending_index = np.argsort(logit*-1.0) #descending order 69 | logit_descending_rank = np.argsort(logit_descending_index) #descending order 70 | 71 | #target metric 72 | if np.sum(label) > 0 and np.sum(label)!=n_photo: 73 | target_pos_index = np.nonzero(label)[0] 74 | target_pos_rank = logit_descending_rank[target_pos_index] 75 | 76 | for i in range(len(target_top_k)): 77 | target_recall_lst[i] += np.sum(target_pos_rank3}/{iter_step:<4}\tloss:{loss.detach().cpu().item():.6f}") 124 | 125 | path_to_save_model=f"./checkpoints/bs-{args.batch_size}_lr-{args.lr}_{args.tag}.pkl" 126 | 127 | torch.save(model.state_dict(), path_to_save_model) 128 | 129 | print(f"save model to {path_to_save_model} DONE.") -------------------------------------------------------------------------------- /rank/run_din.sh: -------------------------------------------------------------------------------- 1 | set -x 2 | set -e 3 | set -o pipefail 4 | 5 | tag=din-1st 6 | 7 | python -B -u run_din.py \ 8 | --epochs=1 \ 9 | --batch_size=1024 \ 10 | --infer_realshow_batch_size=1024 \ 11 | --infer_recall_batch_size=512 \ 12 | --emb_dim=8 \ 13 | --lr=1e-2 \ 14 | --seq_len=50 \ 15 | --cuda='0' \ 16 | --print_freq=100 \ 17 | --tag=${tag} > "./logs/bs-1024_lr-1e-2_${tag}.log" 2>&1 18 | 19 | python -B -u eval_dssm.py \ 20 | --epochs=1 \ 21 | --batch_size=1024 \ 22 | --infer_realshow_batch_size=1024 \ 23 | --infer_recall_batch_size=512 \ 24 | --emb_dim=8 \ 25 | --lr=1e-2 \ 26 | --seq_len=50 \ 27 | --cuda='0' \ 28 | --print_freq=100 \ 29 | --tag=${tag} >> "./logs/bs-1024_lr-1e-2_${tag}.log" 2>&1 -------------------------------------------------------------------------------- /rank/run_din_auxiliary_ranking.py: -------------------------------------------------------------------------------- 1 | import os 2 | import argparse 3 | 4 | import torch 5 | import torch.nn as nn 6 | import torch.nn.functional as F 7 | from torch.utils.data import DataLoader 8 | 9 | from models import DIN_AuxRanking 10 | from dataset import Rank_Train_Auxiliary_Ranking_Dataset 11 | 12 | from utils import load_pkl 13 | 14 | def parse_args(): 15 | parser = argparse.ArgumentParser() 16 | 17 | parser.add_argument('--epochs', type=int, default=1, help='epochs.') 18 | parser.add_argument('--batch_size', type=int, default=1024, help='train batch size.') 19 | parser.add_argument('--infer_realshow_batch_size', type=int, default=1024, help='inference batch size.') 20 | parser.add_argument('--infer_recall_batch_size', type=int, default=1024, help='inference batch size.') 21 | parser.add_argument('--emb_dim', type=int, default=8, help='embedding dimension.') 22 | parser.add_argument('--lr', type=float, default=1e-3, help='learning rate.') 23 | parser.add_argument('--seq_len', type=int, default=3, help='length of behaivor sequence') 24 | 25 | parser.add_argument('--cuda', type=int, default=0, help='cuda device.') 26 | 27 | parser.add_argument('--print_freq', type=int, default=200, help='frequency of print.') 28 | 29 | parser.add_argument('--tag', type=str, default="1st", help='exp tag.') 30 | 31 | # flow param 32 | parser.add_argument('--flows', type=str, default="", help='exp tag.') 33 | 34 | parser.add_argument('--rank_loss_weight', type=float, default=1e-2, help='learning rate.') 35 | 36 | return parser.parse_args() 37 | 38 | 39 | if __name__ == '__main__': 40 | args = parse_args() 41 | 42 | for k,v in vars(args).items(): 43 | print(f"{k}:{v}") 44 | 45 | #prepare data 46 | prefix = "../data" 47 | 48 | realshow_prefix = os.path.join(prefix, "realshow") 49 | path_to_train_csv_lst = [] 50 | with open("./file.txt", mode='r') as f: 51 | lines = f.readlines() 52 | for line in lines: 53 | tmp_csv_path = os.path.join(realshow_prefix, line.strip()+'.feather') 54 | path_to_train_csv_lst.append(tmp_csv_path) 55 | 56 | num_of_train_csv = len(path_to_train_csv_lst) 57 | print("training files:") 58 | print(f"number of train_csv: {num_of_train_csv}") 59 | for idx, filepath in enumerate(path_to_train_csv_lst): 60 | print(f"{idx}: {filepath}") 61 | 62 | seq_prefix = os.path.join(prefix, "seq_effective_50_dict") 63 | path_to_train_seq_pkl_lst = [] 64 | with open("./file.txt", mode='r') as f: 65 | lines = f.readlines() 66 | for line in lines: 67 | tmp_seq_pkl_path = os.path.join(seq_prefix, line.strip()+'.pkl') 68 | path_to_train_seq_pkl_lst.append(tmp_seq_pkl_path) 69 | 70 | print("training seq files:") 71 | for idx, filepath in enumerate(path_to_train_seq_pkl_lst): 72 | print(f"{idx}: {filepath}") 73 | 74 | request_id_prefix = os.path.join(prefix, "request_id_dict") 75 | path_to_train_request_pkl_lst = [] 76 | with open("./file.txt", mode='r') as f: 77 | lines = f.readlines() 78 | for line in lines: 79 | tmp_request_pkl_path = os.path.join(request_id_prefix, line.strip()+".pkl") 80 | path_to_train_request_pkl_lst.append(tmp_request_pkl_path) 81 | 82 | print("training request files") 83 | for idx, filepath in enumerate(path_to_train_request_pkl_lst): 84 | print(f"{idx}: {filepath}") 85 | 86 | others_prefix = os.path.join(prefix, "others") 87 | path_to_id_cnt_pkl = os.path.join(others_prefix, "id_cnt.pkl") 88 | print(f"path_to_id_cnt_pkl: {path_to_id_cnt_pkl}") 89 | 90 | id_cnt_dict = load_pkl(path_to_id_cnt_pkl) 91 | for k,v in id_cnt_dict.items(): 92 | print(f"{k}:{v}") 93 | 94 | #prepare model 95 | os.environ["CUDA_VISIBLE_DEVICES"] = str(args.cuda) 96 | device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") 97 | print(f"device: {device}") 98 | 99 | max_candidate_cnt = 430 100 | 101 | model = DIN_AuxRanking( 102 | args.emb_dim, args.seq_len, 103 | device, max_candidate_cnt, id_cnt_dict 104 | ).to(device) 105 | 106 | loss_fn = nn.BCEWithLogitsLoss().to(device) 107 | 108 | optimizer = torch.optim.Adam(model.parameters(), lr=args.lr) 109 | 110 | padding_num = -2**30 + 1 111 | 112 | k_per_flow = 10 113 | 114 | n_flows = len(args.flows.split(',')) 115 | 116 | n_flow_photos = n_flows * k_per_flow 117 | 118 | #training 119 | for epoch in range(args.epochs): 120 | for n_day in range(num_of_train_csv): 121 | 122 | train_dataset = Rank_Train_Auxiliary_Ranking_Dataset( 123 | path_to_train_csv_lst[n_day], 124 | args.seq_len, 125 | path_to_train_seq_pkl_lst[n_day], 126 | path_to_train_request_pkl_lst[n_day], 127 | args.flows 128 | ) 129 | 130 | train_loader = DataLoader( 131 | dataset=train_dataset, 132 | batch_size=args.batch_size, 133 | shuffle=True, 134 | num_workers=1, 135 | drop_last=True 136 | ) 137 | 138 | for iter_step, inputs in enumerate(train_loader): 139 | 140 | inputs_LongTensor = [torch.LongTensor(inp.numpy()).to(device) for inp in inputs[:-2]] 141 | 142 | label = torch.FloatTensor(inputs[-1].numpy()).to(device) #b 143 | 144 | logits, aux_logits, flow_logits = model.forward_train(inputs_LongTensor) #b*1,b*1,b*p 145 | 146 | loss = loss_fn(logits.squeeze(), label) 147 | 148 | flow_mask = torch.FloatTensor(inputs[-2].numpy()).to(device) # b*p 149 | 150 | flow_logits = torch.where( 151 | flow_mask > 0, 152 | flow_logits, 153 | torch.full_like(flow_logits, fill_value=padding_num) 154 | ) 155 | 156 | aux_logits_repeat = aux_logits.repeat([1,n_flow_photos]) #b*p 157 | 158 | bpr_logits = aux_logits_repeat - flow_logits 159 | 160 | rank_loss = F.binary_cross_entropy_with_logits( 161 | bpr_logits, 162 | torch.ones_like(bpr_logits), 163 | weight=label.unsqueeze(1).repeat([1,n_flow_photos]), 164 | reduction='sum' 165 | ) / label.unsqueeze(1).repeat([1,n_flow_photos]).sum() 166 | 167 | all_loss = loss + args.rank_loss_weight * rank_loss 168 | 169 | optimizer.zero_grad() 170 | 171 | all_loss.backward() 172 | 173 | optimizer.step() 174 | 175 | if iter_step % args.print_freq == 0: 176 | print(f"Day:{n_day}\t[Epoch/iter]:{epoch:>3}/{iter_step:<4}\tall_loss:{all_loss.detach().cpu().item():.6f}\tloss:{loss.detach().cpu().item():.6f}\trank_loss:{rank_loss.detach().cpu().item():.6f}") 177 | 178 | path_to_save_model=f"./checkpoints/{args.batch_size}_{args.lr}_{args.flows}_{args.rank_loss_weight}_{args.tag}.pkl" 179 | 180 | torch.save(model.state_dict(), path_to_save_model) 181 | 182 | print(f"save model to {path_to_save_model} DONE.") -------------------------------------------------------------------------------- /rank/run_din_auxiliary_ranking.sh: -------------------------------------------------------------------------------- 1 | set -x 2 | set -e 3 | set -o pipefail 4 | 5 | tag=din_auxiliary_ranking-1st 6 | 7 | flows=rerank_pos,rerank_neg,rank_pos,rank_neg 8 | 9 | rank_loss_weight=0.1 10 | 11 | python -B -u run_din_auxiliary_ranking.py \ 12 | --epochs=1 \ 13 | --batch_size=1024 \ 14 | --infer_realshow_batch_size=1024 \ 15 | --infer_recall_batch_size=512 \ 16 | --emb_dim=8 \ 17 | --lr=1e-2 \ 18 | --seq_len=50 \ 19 | --cuda='0' \ 20 | --print_freq=100 \ 21 | --flows=${flows} \ 22 | --rank_loss_weight=${rank_loss_weight} \ 23 | --tag=${tag} > "./logs/bs-1024_lr-1e-2_${flows}_${rank_loss_weight}_${tag}.log" 2>&1 24 | 25 | python -B -u eval_din_auxiliary_ranking.py \ 26 | --epochs=1 \ 27 | --batch_size=1024 \ 28 | --infer_realshow_batch_size=1024 \ 29 | --infer_recall_batch_size=512 \ 30 | --emb_dim=8 \ 31 | --lr=1e-2 \ 32 | --seq_len=50 \ 33 | --cuda='0' \ 34 | --print_freq=100 \ 35 | --flows=${flows} \ 36 | --rank_loss_weight=${rank_loss_weight} \ 37 | --tag=${tag} >> "./logs/bs-1024_lr-1e-2_${flows}_${rank_loss_weight}_${tag}.log" 2>&1 -------------------------------------------------------------------------------- /rank/run_din_data_dist_shift_all.py: -------------------------------------------------------------------------------- 1 | import os 2 | import argparse 3 | 4 | import torch 5 | import torch.nn as nn 6 | from torch.utils.data import DataLoader 7 | 8 | from models import DIN 9 | from dataset import Rank_Train_Data_Dist_Shift_All_Dataset 10 | 11 | from utils import load_pkl 12 | 13 | def parse_args(): 14 | parser = argparse.ArgumentParser() 15 | 16 | parser.add_argument('--epochs', type=int, default=1, help='epochs.') 17 | parser.add_argument('--batch_size', type=int, default=1024, help='train batch size.') 18 | parser.add_argument('--infer_realshow_batch_size', type=int, default=1024, help='inference batch size.') 19 | parser.add_argument('--infer_recall_batch_size', type=int, default=1024, help='inference batch size.') 20 | parser.add_argument('--emb_dim', type=int, default=8, help='embedding dimension.') 21 | parser.add_argument('--lr', type=float, default=1e-3, help='learning rate.') 22 | parser.add_argument('--seq_len', type=int, default=3, help='length of behaivor sequence') 23 | 24 | parser.add_argument('--cuda', type=int, default=0, help='cuda device.') 25 | 26 | parser.add_argument('--print_freq', type=int, default=200, help='frequency of print.') 27 | 28 | parser.add_argument('--tag', type=str, default="1st", help='exp tag.') 29 | 30 | parser.add_argument('--flows', type=str, default="", help='exp tag.') 31 | 32 | return parser.parse_args() 33 | 34 | 35 | if __name__ == '__main__': 36 | args = parse_args() 37 | 38 | for k,v in vars(args).items(): 39 | print(f"{k}:{v}") 40 | 41 | #prepare data 42 | prefix = "../data" 43 | 44 | realshow_prefix = os.path.join(prefix, "all_stage") 45 | path_to_train_csv_lst = [] 46 | with open("./file.txt", mode='r') as f: 47 | lines = f.readlines() 48 | for line in lines: 49 | tmp_csv_path = os.path.join(realshow_prefix, line.strip()+'.feather') 50 | path_to_train_csv_lst.append(tmp_csv_path) 51 | 52 | num_of_train_csv = len(path_to_train_csv_lst) 53 | print("training files:") 54 | print(f"number of train_csv: {num_of_train_csv}") 55 | for idx, filepath in enumerate(path_to_train_csv_lst): 56 | print(f"{idx}: {filepath}") 57 | 58 | seq_prefix = os.path.join(prefix, "seq_effective_50_dict") 59 | path_to_train_seq_pkl_lst = [] 60 | with open("./file.txt", mode='r') as f: 61 | lines = f.readlines() 62 | for line in lines: 63 | tmp_seq_pkl_path = os.path.join(seq_prefix, line.strip()+'.pkl') 64 | path_to_train_seq_pkl_lst.append(tmp_seq_pkl_path) 65 | 66 | print("training seq files:") 67 | for idx, filepath in enumerate(path_to_train_seq_pkl_lst): 68 | print(f"{idx}: {filepath}") 69 | 70 | others_prefix = os.path.join(prefix, "others") 71 | path_to_id_cnt_pkl = os.path.join(others_prefix, "id_cnt.pkl") 72 | print(f"path_to_id_cnt_pkl: {path_to_id_cnt_pkl}") 73 | 74 | id_cnt_dict = load_pkl(path_to_id_cnt_pkl) 75 | for k,v in id_cnt_dict.items(): 76 | print(f"{k}:{v}") 77 | 78 | #prepare model 79 | os.environ["CUDA_VISIBLE_DEVICES"] = str(args.cuda) 80 | device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") 81 | print(f"device: {device}") 82 | 83 | max_candidate_cnt = 430 84 | 85 | model = DIN( 86 | args.emb_dim, args.seq_len, 87 | device, max_candidate_cnt, id_cnt_dict 88 | ).to(device) 89 | 90 | loss_fn = nn.BCEWithLogitsLoss().to(device) 91 | 92 | optimizer = torch.optim.Adam(model.parameters(), lr=args.lr) 93 | 94 | #training 95 | for epoch in range(args.epochs): 96 | for n_day in range(num_of_train_csv): 97 | 98 | train_dataset = Rank_Train_Data_Dist_Shift_All_Dataset( 99 | path_to_train_csv_lst[n_day], 100 | args.seq_len, 101 | path_to_train_seq_pkl_lst[n_day], 102 | args.flows 103 | ) 104 | 105 | train_loader = DataLoader( 106 | dataset=train_dataset, 107 | batch_size=args.batch_size, 108 | shuffle=True, 109 | num_workers=1, 110 | drop_last=True 111 | ) 112 | 113 | for iter_step, inputs in enumerate(train_loader): 114 | 115 | inputs_LongTensor = [torch.LongTensor(inp.numpy()).to(device) for inp in inputs[:-1]] 116 | 117 | label = torch.FloatTensor(inputs[-1].numpy()).to(device) #b*k 118 | 119 | logits = model(inputs_LongTensor) #b 120 | 121 | loss = loss_fn(logits, label) 122 | 123 | optimizer.zero_grad() 124 | 125 | loss.backward() 126 | 127 | optimizer.step() 128 | 129 | if iter_step % args.print_freq == 0: 130 | print(f"Day:{n_day}\t[Epoch/iter]:{epoch:>3}/{iter_step:<4}\tloss:{loss.detach().cpu().item():.6f}") 131 | 132 | path_to_save_model=f"./checkpoints/bs-{args.batch_size}_lr-{args.lr}_{args.flows}_{args.tag}.pkl" 133 | 134 | torch.save(model.state_dict(), path_to_save_model) 135 | 136 | print(f"save model to {path_to_save_model} DONE.") -------------------------------------------------------------------------------- /rank/run_din_data_dist_shift_all.sh: -------------------------------------------------------------------------------- 1 | set -x 2 | set -e 3 | set -o pipefail 4 | 5 | tag=din_data_dist_shift_all-1st 6 | 7 | flows=rank_neg 8 | 9 | python -B -u run_din_data_dist_shift_all.py \ 10 | --epochs=1 \ 11 | --batch_size=1024 \ 12 | --infer_realshow_batch_size=1024 \ 13 | --infer_recall_batch_size=512 \ 14 | --emb_dim=8 \ 15 | --lr=1e-2 \ 16 | --seq_len=50 \ 17 | --cuda='0' \ 18 | --print_freq=100 \ 19 | --flows=${flows} \ 20 | --tag=${tag} > "./logs/bs-1024_lr-1e-2_${flows}_${tag}.log" 2>&1 21 | 22 | python -B -u eval_din_data_dist_shift_all.py \ 23 | --epochs=1 \ 24 | --batch_size=1024 \ 25 | --infer_realshow_batch_size=1024 \ 26 | --infer_recall_batch_size=512 \ 27 | --emb_dim=8 \ 28 | --lr=1e-2 \ 29 | --seq_len=50 \ 30 | --cuda='0' \ 31 | --print_freq=100 \ 32 | --flows=${flows} \ 33 | --tag=${tag} >> "./logs/bs-1024_lr-1e-2_${flows}_${tag}.log" 2>&1 -------------------------------------------------------------------------------- /rank/run_din_data_dist_shift_sampling.py: -------------------------------------------------------------------------------- 1 | import os 2 | import argparse 3 | 4 | import torch 5 | import torch.nn as nn 6 | from torch.utils.data import DataLoader 7 | 8 | from models import DIN 9 | from dataset import Rank_Train_Data_Dist_Shift_Sampling_Dataset 10 | 11 | from utils import load_pkl 12 | 13 | def parse_args(): 14 | parser = argparse.ArgumentParser() 15 | 16 | parser.add_argument('--epochs', type=int, default=1, help='epochs.') 17 | parser.add_argument('--batch_size', type=int, default=1024, help='train batch size.') 18 | parser.add_argument('--infer_realshow_batch_size', type=int, default=1024, help='inference batch size.') 19 | parser.add_argument('--infer_recall_batch_size', type=int, default=1024, help='inference batch size.') 20 | parser.add_argument('--emb_dim', type=int, default=8, help='embedding dimension.') 21 | parser.add_argument('--lr', type=float, default=1e-3, help='learning rate.') 22 | parser.add_argument('--seq_len', type=int, default=3, help='length of behaivor sequence') 23 | 24 | parser.add_argument('--cuda', type=int, default=0, help='cuda device.') 25 | 26 | parser.add_argument('--print_freq', type=int, default=200, help='frequency of print.') 27 | 28 | parser.add_argument('--tag', type=str, default="1st", help='exp tag.') 29 | 30 | parser.add_argument('--flows', type=str, default="", help='exp tag.') 31 | parser.add_argument('--k_flow_negs', type=str, default="", help='number of flow negative.') 32 | 33 | return parser.parse_args() 34 | 35 | 36 | if __name__ == '__main__': 37 | args = parse_args() 38 | 39 | for k,v in vars(args).items(): 40 | print(f"{k}:{v}") 41 | 42 | #prepare data 43 | prefix = "../data" 44 | 45 | realshow_prefix = os.path.join(prefix, "all_stage") 46 | path_to_train_csv_lst = [] 47 | with open("./file.txt", mode='r') as f: 48 | lines = f.readlines() 49 | for line in lines: 50 | tmp_csv_path = os.path.join(realshow_prefix, line.strip()+'.feather') 51 | path_to_train_csv_lst.append(tmp_csv_path) 52 | 53 | num_of_train_csv = len(path_to_train_csv_lst) 54 | print("training files:") 55 | print(f"number of train_csv: {num_of_train_csv}") 56 | for idx, filepath in enumerate(path_to_train_csv_lst): 57 | print(f"{idx}: {filepath}") 58 | 59 | seq_prefix = os.path.join(prefix, "seq_effective_50_dict") 60 | path_to_train_seq_pkl_lst = [] 61 | with open("./file.txt", mode='r') as f: 62 | lines = f.readlines() 63 | for line in lines: 64 | tmp_seq_pkl_path = os.path.join(seq_prefix, line.strip()+'.pkl') 65 | path_to_train_seq_pkl_lst.append(tmp_seq_pkl_path) 66 | 67 | print("training seq files:") 68 | for idx, filepath in enumerate(path_to_train_seq_pkl_lst): 69 | print(f"{idx}: {filepath}") 70 | 71 | others_prefix = os.path.join(prefix, "others") 72 | path_to_id_cnt_pkl = os.path.join(others_prefix, "id_cnt.pkl") 73 | print(f"path_to_id_cnt_pkl: {path_to_id_cnt_pkl}") 74 | 75 | id_cnt_dict = load_pkl(path_to_id_cnt_pkl) 76 | for k,v in id_cnt_dict.items(): 77 | print(f"{k}:{v}") 78 | 79 | #prepare model 80 | os.environ["CUDA_VISIBLE_DEVICES"] = str(args.cuda) 81 | device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") 82 | print(f"device: {device}") 83 | 84 | max_candidate_cnt = 430 85 | 86 | model = DIN( 87 | args.emb_dim, args.seq_len, 88 | device, max_candidate_cnt, id_cnt_dict 89 | ).to(device) 90 | 91 | loss_fn = nn.BCEWithLogitsLoss().to(device) 92 | 93 | optimizer = torch.optim.Adam(model.parameters(), lr=args.lr) 94 | 95 | #training 96 | for epoch in range(args.epochs): 97 | for n_day in range(num_of_train_csv): 98 | train_dataset = Rank_Train_Data_Dist_Shift_Sampling_Dataset( 99 | path_to_train_csv_lst[n_day], 100 | args.seq_len, 101 | path_to_train_seq_pkl_lst[n_day], 102 | args.flows, 103 | args.k_flow_negs 104 | ) 105 | 106 | train_loader = DataLoader( 107 | dataset=train_dataset, 108 | batch_size=args.batch_size, 109 | shuffle=True, 110 | num_workers=1, 111 | drop_last=True 112 | ) 113 | 114 | for iter_step, inputs in enumerate(train_loader): 115 | 116 | inputs_LongTensor = [torch.LongTensor(inp.numpy()).to(device) for inp in inputs[:-1]] 117 | 118 | label = torch.FloatTensor(inputs[-1].numpy()).to(device) #b*k 119 | 120 | logits = model(inputs_LongTensor) #b 121 | 122 | loss = loss_fn(logits, label) 123 | 124 | optimizer.zero_grad() 125 | 126 | loss.backward() 127 | 128 | optimizer.step() 129 | 130 | if iter_step % args.print_freq == 0: 131 | print(f"Day:{n_day}\t[Epoch/iter]:{epoch:>3}/{iter_step:<4}\tloss:{loss.detach().cpu().item():.6f}") 132 | 133 | path_to_save_model=f"./checkpoints/bs-{args.batch_size}_lr-{args.lr}_{args.flows}-{args.k_flow_negs}_{args.tag}.pkl" 134 | 135 | torch.save(model.state_dict(), path_to_save_model) 136 | 137 | print(f"save model to {path_to_save_model} DONE.") -------------------------------------------------------------------------------- /rank/run_din_data_dist_shift_sampling.sh: -------------------------------------------------------------------------------- 1 | set -x 2 | set -e 3 | set -o pipefail 4 | 5 | tag=din_data_dist_shift_sampling-1st 6 | 7 | flows=rank_neg 8 | k_flow_negs=1 9 | 10 | python -B -u run_din_data_dist_shift_sampling.py \ 11 | --epochs=1 \ 12 | --batch_size=1024 \ 13 | --infer_realshow_batch_size=1024 \ 14 | --infer_recall_batch_size=512 \ 15 | --emb_dim=8 \ 16 | --lr=1e-2 \ 17 | --seq_len=50 \ 18 | --cuda='0' \ 19 | --print_freq=100 \ 20 | --flows=${flows} \ 21 | --k_flow_negs=${k_flow_negs} \ 22 | --tag=${tag} > "./logs/bs-1024_lr-1e-2_${flows}_${k_flow_negs}_${tag}.log" 2>&1 23 | 24 | python -B -u eval_din_data_dist_shift_sampling.py \ 25 | --epochs=1 \ 26 | --batch_size=1024 \ 27 | --infer_realshow_batch_size=1024 \ 28 | --infer_recall_batch_size=512 \ 29 | --emb_dim=8 \ 30 | --lr=1e-2 \ 31 | --seq_len=50 \ 32 | --cuda='0' \ 33 | --print_freq=100 \ 34 | --flows=${flows} \ 35 | --k_flow_negs=${k_flow_negs} \ 36 | --tag=${tag} >> "./logs/bs-1024_lr-1e-2_${flows}_${k_flow_negs}_${tag}.log" 2>&1 -------------------------------------------------------------------------------- /rank/run_din_fsltr.py: -------------------------------------------------------------------------------- 1 | import os 2 | import argparse 3 | 4 | import torch 5 | import torch.nn as nn 6 | import torch.nn.functional as F 7 | from torch.utils.data import DataLoader 8 | 9 | from models import DIN 10 | from dataset import Rank_Train_FSLTR_Dataset 11 | 12 | from utils import load_pkl 13 | 14 | def parse_args(): 15 | parser = argparse.ArgumentParser() 16 | 17 | parser.add_argument('--epochs', type=int, default=1, help='epochs.') 18 | parser.add_argument('--batch_size', type=int, default=1024, help='train batch size.') 19 | parser.add_argument('--infer_realshow_batch_size', type=int, default=1024, help='inference batch size.') 20 | parser.add_argument('--infer_recall_batch_size', type=int, default=1024, help='inference batch size.') 21 | parser.add_argument('--emb_dim', type=int, default=8, help='embedding dimension.') 22 | parser.add_argument('--lr', type=float, default=1e-3, help='learning rate.') 23 | parser.add_argument('--seq_len', type=int, default=3, help='length of behaivor sequence') 24 | 25 | parser.add_argument('--cuda', type=int, default=0, help='cuda device.') 26 | 27 | parser.add_argument('--print_freq', type=int, default=200, help='frequency of print.') 28 | 29 | parser.add_argument('--tag', type=str, default="1st", help='exp tag.') 30 | 31 | parser.add_argument('--click_rank_loss_w', type=float, default=1e-2, help='learning rate.') 32 | parser.add_argument('--realshow_rank_loss_w', type=float, default=1e-2, help='learning rate.') 33 | parser.add_argument('--rerank_pos_rank_loss_w', type=float, default=1e-2, help='learning rate.') 34 | parser.add_argument('--rank_pos_rank_loss_w', type=float, default=1e-2, help='learning rate.') 35 | 36 | return parser.parse_args() 37 | 38 | 39 | if __name__ == '__main__': 40 | args = parse_args() 41 | 42 | for k,v in vars(args).items(): 43 | print(f"{k}:{v}") 44 | 45 | #prepare data 46 | prefix = "../data" 47 | 48 | realshow_prefix = os.path.join(prefix, "all_stage") 49 | path_to_train_csv_lst = [] 50 | with open("./file.txt", mode='r') as f: 51 | lines = f.readlines() 52 | for line in lines: 53 | tmp_csv_path = os.path.join(realshow_prefix, line.strip()+'.feather') 54 | path_to_train_csv_lst.append(tmp_csv_path) 55 | 56 | num_of_train_csv = len(path_to_train_csv_lst) 57 | print("training files:") 58 | print(f"number of train_csv: {num_of_train_csv}") 59 | for idx, filepath in enumerate(path_to_train_csv_lst): 60 | print(f"{idx}: {filepath}") 61 | 62 | seq_prefix = os.path.join(prefix, "seq_effective_50_dict") 63 | path_to_train_seq_pkl_lst = [] 64 | with open("./file.txt", mode='r') as f: 65 | lines = f.readlines() 66 | for line in lines: 67 | tmp_seq_pkl_path = os.path.join(seq_prefix, line.strip()+'.pkl') 68 | path_to_train_seq_pkl_lst.append(tmp_seq_pkl_path) 69 | 70 | print("training seq files:") 71 | for idx, filepath in enumerate(path_to_train_seq_pkl_lst): 72 | print(f"{idx}: {filepath}") 73 | 74 | request_id_prefix = os.path.join(prefix, "request_id_dict") 75 | path_to_train_request_pkl_lst = [] 76 | with open("./file.txt", mode='r') as f: 77 | lines = f.readlines() 78 | for line in lines: 79 | tmp_request_pkl_path = os.path.join(request_id_prefix, line.strip()+".pkl") 80 | path_to_train_request_pkl_lst.append(tmp_request_pkl_path) 81 | 82 | print("training request files") 83 | for idx, filepath in enumerate(path_to_train_request_pkl_lst): 84 | print(f"{idx}: {filepath}") 85 | 86 | others_prefix = os.path.join(prefix, "others") 87 | path_to_id_cnt_pkl = os.path.join(others_prefix, "id_cnt.pkl") 88 | print(f"path_to_id_cnt_pkl: {path_to_id_cnt_pkl}") 89 | 90 | id_cnt_dict = load_pkl(path_to_id_cnt_pkl) 91 | for k,v in id_cnt_dict.items(): 92 | print(f"{k}:{v}") 93 | 94 | #prepare model 95 | os.environ["CUDA_VISIBLE_DEVICES"] = str(args.cuda) 96 | device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") 97 | print(f"device: {device}") 98 | 99 | max_candidate_cnt = 430 100 | 101 | model = DIN( 102 | args.emb_dim, args.seq_len, 103 | device, max_candidate_cnt, id_cnt_dict 104 | ).to(device) 105 | 106 | loss_fn = nn.CrossEntropyLoss(ignore_index=1) 107 | 108 | optimizer = torch.optim.Adam(model.parameters(), lr=args.lr) 109 | 110 | #training 111 | for epoch in range(args.epochs): 112 | for n_day in range(num_of_train_csv): 113 | 114 | train_dataset = Rank_Train_FSLTR_Dataset( 115 | path_to_train_csv_lst[n_day], 116 | args.seq_len, 117 | path_to_train_seq_pkl_lst[n_day], 118 | path_to_train_request_pkl_lst[n_day] 119 | ) 120 | 121 | train_loader = DataLoader( 122 | dataset=train_dataset, 123 | batch_size=args.batch_size, 124 | shuffle=True, 125 | num_workers=1, 126 | drop_last=True 127 | ) 128 | 129 | for iter_step, inputs in enumerate(train_loader): 130 | 131 | inputs_LongTensor = [torch.LongTensor(inp.numpy()).to(device) for inp in inputs[:-6]] 132 | 133 | click_logits, realshow_logits, \ 134 | rerank_pos_logits, rerank_neg_logits, \ 135 | rank_pos_logits, rank_neg_logits = model.forward_fsltr(inputs_LongTensor) #b 136 | 137 | tmp_logits = torch.cat([realshow_logits,rerank_pos_logits,rerank_neg_logits, rank_pos_logits, rank_neg_logits], dim=1) 138 | click_bpr_logits = click_logits.unsqueeze(-1) - tmp_logits.unsqueeze(1) #b*6*46 139 | click_label = torch.FloatTensor(inputs[-6].numpy()).to(device).unsqueeze(-1) #b*6*1 140 | click_rank_loss = F.binary_cross_entropy_with_logits( 141 | click_bpr_logits, 142 | torch.ones(click_bpr_logits.size(), dtype=torch.float, device=device), 143 | weight=click_label, 144 | reduction='sum') / (46*torch.sum(click_label)) 145 | 146 | tmp_logits = torch.cat([rerank_pos_logits,rerank_neg_logits, rank_pos_logits, rank_neg_logits], dim=1) 147 | realshow_bpr_logits = realshow_logits.unsqueeze(-1) - tmp_logits.unsqueeze(1) #b*6*40 148 | realshow_label = torch.FloatTensor(inputs[-5].numpy()).to(device).unsqueeze(-1) #b*6*1 149 | realshow_rank_loss = F.binary_cross_entropy_with_logits( 150 | realshow_bpr_logits, 151 | torch.ones(realshow_bpr_logits.size(), dtype=torch.float, device=device), 152 | weight=realshow_label, 153 | reduction='sum') / (40*torch.sum(realshow_label)) 154 | 155 | tmp_logits = torch.cat([rerank_neg_logits, rank_neg_logits], dim=1) 156 | rerank_pos_bpr_logits = rerank_pos_logits.unsqueeze(-1) - tmp_logits.unsqueeze(1) #b*6*20 157 | rerank_pos_label = torch.FloatTensor(inputs[-4].numpy()).to(device).unsqueeze(-1) #b*10*1 158 | rerank_pos_rank_loss = F.binary_cross_entropy_with_logits( 159 | rerank_pos_bpr_logits, 160 | torch.ones(rerank_pos_bpr_logits.size(), dtype=torch.float, device=device), 161 | weight=rerank_pos_label, 162 | reduction='sum') / (20*torch.sum(rerank_pos_label)) 163 | 164 | tmp_logits = torch.cat([rerank_neg_logits, rank_neg_logits], dim=1) 165 | rank_pos_bpr_logits = rank_pos_logits.unsqueeze(-1) - tmp_logits.unsqueeze(1) #b*6*20 166 | rank_pos_label = torch.FloatTensor(inputs[-2].numpy()).to(device).unsqueeze(-1) #b*10*1 167 | rank_pos_rank_loss = F.binary_cross_entropy_with_logits( 168 | rank_pos_bpr_logits, 169 | torch.ones(rank_pos_bpr_logits.size(), dtype=torch.float, device=device), 170 | weight=rank_pos_label, 171 | reduction='sum') / (20*torch.sum(rank_pos_label)) 172 | 173 | loss = click_rank_loss * args.click_rank_loss_w + \ 174 | realshow_rank_loss * args.realshow_rank_loss_w + \ 175 | rerank_pos_rank_loss * args.rerank_pos_rank_loss_w + \ 176 | rank_pos_rank_loss * args.rank_pos_rank_loss_w 177 | 178 | optimizer.zero_grad() 179 | 180 | loss.backward() 181 | 182 | optimizer.step() 183 | 184 | if iter_step % args.print_freq == 0: 185 | print(f"Day:{n_day}\t[Epoch/iter]:{epoch:>3}/{iter_step:<4}\tloss:{loss.detach().cpu().item():.6f} \tclick_rank_loss:{click_rank_loss.detach().cpu().item():.6f} \trealshow_rank_loss:{realshow_rank_loss.detach().cpu().item():.6f}\trerank_pos_rank_loss:{rerank_pos_rank_loss.detach().cpu().item():.6f}\trank_pos_rank_loss:{rank_pos_rank_loss.detach().cpu().item():.6f}") 186 | 187 | path_to_save_model=f"./checkpoints/bs-{args.batch_size}_lr-{args.lr}_{args.click_rank_loss_w}-{args.realshow_rank_loss_w}-{args.rerank_pos_rank_loss_w}-{args.rank_pos_rank_loss_w}_{args.tag}.pkl" 188 | 189 | torch.save(model.state_dict(), path_to_save_model) 190 | 191 | print(f"save model to {path_to_save_model} DONE.") -------------------------------------------------------------------------------- /rank/run_din_fsltr.sh: -------------------------------------------------------------------------------- 1 | set -x 2 | set -e 3 | set -o pipefail 4 | 5 | tag=din_fsltr-1st 6 | 7 | click_rank_loss_w=1.0 8 | realshow_rank_loss_w=0.5 9 | rerank_pos_rank_loss_w=0.05 10 | rank_pos_rank_loss_w=0.05 11 | 12 | python -B -u run_din_fsltr.py \ 13 | --epochs=1 \ 14 | --batch_size=1024 \ 15 | --infer_realshow_batch_size=1024 \ 16 | --infer_recall_batch_size=512 \ 17 | --emb_dim=8 \ 18 | --lr=1e-2 \ 19 | --seq_len=50 \ 20 | --cuda='0' \ 21 | --print_freq=100 \ 22 | --click_rank_loss_w=${click_rank_loss_w} \ 23 | --realshow_rank_loss_w=${realshow_rank_loss_w} \ 24 | --rerank_pos_rank_loss_w=${rerank_pos_rank_loss_w} \ 25 | --rank_pos_rank_loss_w=${rank_pos_rank_loss_w} \ 26 | --tag=${tag} > "./logs/bs-1024_lr-1e-2_${click_rank_loss_w}_${realshow_rank_loss_w}_${rerank_pos_rank_loss_w}_${rank_pos_rank_loss_w}_${tag}.log" 2>&1 27 | 28 | python -B -u eval_din_fsltr.py \ 29 | --epochs=1 \ 30 | --batch_size=1024 \ 31 | --infer_realshow_batch_size=1024 \ 32 | --infer_recall_batch_size=512 \ 33 | --emb_dim=8 \ 34 | --lr=1e-2 \ 35 | --seq_len=50 \ 36 | --cuda='0' \ 37 | --print_freq=100 \ 38 | --click_rank_loss_w=${click_rank_loss_w} \ 39 | --realshow_rank_loss_w=${realshow_rank_loss_w} \ 40 | --rerank_pos_rank_loss_w=${rerank_pos_rank_loss_w} \ 41 | --rank_pos_rank_loss_w=${rank_pos_rank_loss_w} \ 42 | --tag=${tag} >> "./logs/bs-1024_lr-1e-2_${click_rank_loss_w}_${realshow_rank_loss_w}_${rerank_pos_rank_loss_w}_${rank_pos_rank_loss_w}_${tag}.log" 2>&1 -------------------------------------------------------------------------------- /rank/run_din_ubm.py: -------------------------------------------------------------------------------- 1 | import os 2 | import argparse 3 | 4 | import torch 5 | import torch.nn as nn 6 | from torch.utils.data import DataLoader 7 | 8 | from models import DIN_UBM 9 | from dataset import Rank_Train_UBM_Dataset 10 | 11 | from utils import load_pkl 12 | 13 | def parse_args(): 14 | parser = argparse.ArgumentParser() 15 | 16 | parser.add_argument('--epochs', type=int, default=1, help='epochs.') 17 | parser.add_argument('--batch_size', type=int, default=1024, help='train batch size.') 18 | parser.add_argument('--infer_realshow_batch_size', type=int, default=1024, help='inference batch size.') 19 | parser.add_argument('--infer_recall_batch_size', type=int, default=1024, help='inference batch size.') 20 | parser.add_argument('--emb_dim', type=int, default=8, help='embedding dimension.') 21 | parser.add_argument('--lr', type=float, default=1e-3, help='learning rate.') 22 | parser.add_argument('--seq_len', type=int, default=3, help='length of behaivor sequence') 23 | 24 | parser.add_argument('--cuda', type=int, default=0, help='cuda device.') 25 | 26 | parser.add_argument('--print_freq', type=int, default=200, help='frequency of print.') 27 | 28 | parser.add_argument('--tag', type=str, default="1st", help='exp tag.') 29 | 30 | parser.add_argument('--flows', type=str, default="", help='exp tag.') 31 | 32 | return parser.parse_args() 33 | 34 | 35 | if __name__ == '__main__': 36 | args = parse_args() 37 | 38 | for k,v in vars(args).items(): 39 | print(f"{k}:{v}") 40 | 41 | #prepare data 42 | prefix = "../data" 43 | 44 | realshow_prefix = os.path.join(prefix, "realshow") 45 | path_to_train_csv_lst = [] 46 | with open("./file.txt", mode='r') as f: 47 | lines = f.readlines() 48 | for line in lines: 49 | tmp_csv_path = os.path.join(realshow_prefix, line.strip()+'.feather') 50 | path_to_train_csv_lst.append(tmp_csv_path) 51 | 52 | num_of_train_csv = len(path_to_train_csv_lst) 53 | print("training files:") 54 | print(f"number of train_csv: {num_of_train_csv}") 55 | for idx, filepath in enumerate(path_to_train_csv_lst): 56 | print(f"{idx}: {filepath}") 57 | 58 | 59 | seq_prefix = os.path.join(prefix, "seq_effective_50_dict") 60 | path_to_train_seq_pkl_lst = [] 61 | with open("./file.txt", mode='r') as f: 62 | lines = f.readlines() 63 | for line in lines: 64 | tmp_seq_pkl_path = os.path.join(seq_prefix, line.strip()+'.pkl') 65 | path_to_train_seq_pkl_lst.append(tmp_seq_pkl_path) 66 | 67 | print("training seq files:") 68 | for idx, filepath in enumerate(path_to_train_seq_pkl_lst): 69 | print(f"{idx}: {filepath}") 70 | 71 | request_id_prefix = os.path.join(prefix, "ubm_seq_request_id_dict") 72 | path_to_train_request_pkl_lst = [] 73 | with open("./file.txt", mode='r') as f: 74 | lines = f.readlines() 75 | for line in lines: 76 | tmp_request_pkl_path = os.path.join(request_id_prefix, line.strip()+".pkl") 77 | path_to_train_request_pkl_lst.append(tmp_request_pkl_path) 78 | 79 | print("training request files") 80 | for idx, filepath in enumerate(path_to_train_request_pkl_lst): 81 | print(f"{idx}: {filepath}") 82 | 83 | others_prefix = os.path.join(prefix, "others") 84 | path_to_id_cnt_pkl = os.path.join(others_prefix, "id_cnt.pkl") 85 | print(f"path_to_id_cnt_pkl: {path_to_id_cnt_pkl}") 86 | 87 | id_cnt_dict = load_pkl(path_to_id_cnt_pkl) 88 | for k,v in id_cnt_dict.items(): 89 | print(f"{k}:{v}") 90 | 91 | #prepare model 92 | os.environ["CUDA_VISIBLE_DEVICES"] = str(args.cuda) 93 | device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") 94 | print(f"device: {device}") 95 | 96 | per_flow_seq_len = 10 97 | n_flows = len(args.flows.split(',')) 98 | flow_seq_len = per_flow_seq_len * n_flows 99 | 100 | max_candidate_cnt = 430 101 | 102 | model = DIN_UBM( 103 | args.emb_dim, 104 | args.seq_len, 105 | device, 106 | max_candidate_cnt, 107 | per_flow_seq_len, flow_seq_len, 108 | id_cnt_dict 109 | ).to(device) 110 | 111 | loss_fn = nn.BCEWithLogitsLoss().to(device) 112 | 113 | optimizer = torch.optim.Adam(model.parameters(), lr=args.lr) 114 | 115 | #training 116 | for epoch in range(args.epochs): 117 | for n_day in range(num_of_train_csv): 118 | 119 | train_dataset = Rank_Train_UBM_Dataset( 120 | path_to_train_csv_lst[n_day], 121 | args.seq_len, 122 | path_to_train_seq_pkl_lst[n_day], 123 | path_to_train_request_pkl_lst[n_day], 124 | args.flows, 125 | per_flow_seq_len 126 | ) 127 | 128 | train_loader = DataLoader( 129 | dataset=train_dataset, 130 | batch_size=args.batch_size, 131 | shuffle=True, 132 | num_workers=1, 133 | drop_last=True 134 | ) 135 | 136 | for iter_step, inputs in enumerate(train_loader): 137 | 138 | inputs_LongTensor = [torch.LongTensor(inp.numpy()).to(device) for inp in inputs[:-1]] 139 | 140 | label = torch.FloatTensor(inputs[-1].numpy()).to(device) #b 141 | 142 | logits = model(inputs_LongTensor) #b 143 | 144 | loss = loss_fn(logits, label) 145 | 146 | optimizer.zero_grad() 147 | 148 | loss.backward() 149 | 150 | optimizer.step() 151 | 152 | if iter_step % args.print_freq == 0: 153 | print(f"Day:{n_day}\t[Epoch/iter]:{epoch:>3}/{iter_step:<4}\tloss:{loss.detach().cpu().item():.6f}") 154 | 155 | path_to_save_model=f"./checkpoints/bs-{args.batch_size}_lr-{args.lr}_{args.flows}_{args.tag}.pkl" 156 | 157 | torch.save(model.state_dict(), path_to_save_model) 158 | 159 | print(f"save model to {path_to_save_model} DONE.") -------------------------------------------------------------------------------- /rank/run_din_ubm.sh: -------------------------------------------------------------------------------- 1 | set -x 2 | set -e 3 | set -o pipefail 4 | 5 | tag=din_ubm-1st 6 | 7 | flows=rank_pos 8 | 9 | python -B -u run_din_ubm.py \ 10 | --epochs=1 \ 11 | --batch_size=1024 \ 12 | --infer_realshow_batch_size=1024 \ 13 | --infer_recall_batch_size=512 \ 14 | --emb_dim=8 \ 15 | --lr=1e-2 \ 16 | --seq_len=50 \ 17 | --cuda='0' \ 18 | --print_freq=100 \ 19 | --flows=${flows} \ 20 | --tag=${tag} > "./logs/bs-1024_lr-1e-2_${flows}_${tag}.log" 2>&1 21 | 22 | python -B -u eval_din_ubm.py \ 23 | --epochs=1 \ 24 | --batch_size=1024 \ 25 | --infer_realshow_batch_size=1024 \ 26 | --infer_recall_batch_size=512 \ 27 | --emb_dim=8 \ 28 | --lr=1e-2 \ 29 | --seq_len=50 \ 30 | --cuda='0' \ 31 | --print_freq=100 \ 32 | --flows=${flows} \ 33 | --tag=${tag} >> "./logs/bs-1024_lr-1e-2_${flows}_${tag}.log" 2>&1 -------------------------------------------------------------------------------- /rank/utils.py: -------------------------------------------------------------------------------- 1 | import pickle as pkl 2 | from collections import defaultdict 3 | 4 | def defaultdict_tuple(): 5 | return defaultdict(tuple) 6 | 7 | def defaultdict_str(): 8 | return defaultdict(str) 9 | 10 | def load_pkl(filename): 11 | with open(filename, 'rb') as f: 12 | return pkl.load(f) -------------------------------------------------------------------------------- /recflow.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/RecFlow-ICLR/RecFlow/6b9eb4f3dd1651788041704c388f9df1a39cf294/recflow.jpg -------------------------------------------------------------------------------- /retrieval/dataset.py: -------------------------------------------------------------------------------- 1 | import os 2 | import gc 3 | import time 4 | 5 | import numpy as np 6 | import pandas as pd 7 | from torch.utils.data import Dataset 8 | 9 | from utils import load_pkl 10 | 11 | class Recall_Train_SASRec_Dataset(Dataset): 12 | def __init__( 13 | self, 14 | path_to_csv, 15 | seq_len, neg_num, 16 | path_to_seq, 17 | video_corpus 18 | ): 19 | t1 = time.time() 20 | 21 | raw_df = pd.read_feather(path_to_csv) 22 | 23 | df = raw_df[raw_df['effective_view']==1][["request_id", "video_id"]] 24 | 25 | self.data = df.to_numpy().copy() 26 | 27 | self.seq_len = seq_len 28 | 29 | self.neg_num = neg_num 30 | 31 | self.today_seq = load_pkl(path_to_seq) 32 | 33 | video_corpus_df = pd.read_feather(video_corpus) 34 | self.video_corpus = video_corpus_df['video_id'].unique().copy() + 1 35 | del video_corpus_df 36 | 37 | self.n_video_corpus = self.video_corpus.shape[0] 38 | 39 | del raw_df 40 | del df 41 | 42 | gc.collect() 43 | 44 | t2 = time.time() 45 | print(f'init data time: {t2-t1}') 46 | 47 | def __len__(self): 48 | return self.data.shape[0] 49 | 50 | def negative_sampling(self, tgt_video, neg_num): 51 | cnt = 0 52 | negs_index = np.random.randint(self.n_video_corpus, size=neg_num) 53 | while tgt_video in self.video_corpus[negs_index]: 54 | negs_index = np.random.randint(self.n_video_corpus, size=neg_num) 55 | cnt += 1 56 | if cnt >= 10: 57 | break 58 | return self.video_corpus[negs_index] 59 | 60 | def __getitem__(self, idx): 61 | request_id = self.data[idx][0] 62 | vid = self.data[idx][1] + 1 63 | 64 | seq_full = self.today_seq[request_id][:,[0,7]].copy() 65 | 66 | seq_mask = (seq_full[:,1] > 0).astype(np.int8) 67 | 68 | seq_len = np.sum(seq_mask) 69 | 70 | seq_arr = seq_full[:,0] 71 | 72 | if seq_len > 0: 73 | seq_arr[-seq_len:] += 1 74 | 75 | neg_vids = self.negative_sampling(vid, self.neg_num) 76 | 77 | return seq_arr, seq_mask, vid, neg_vids 78 | 79 | #public 80 | class Recall_Train_SASRec_HardNegMining_Dataset(Dataset): 81 | def __init__( 82 | self, 83 | path_to_csv, 84 | seq_len, neg_num, 85 | path_to_seq, 86 | path_to_request_id_pkl, 87 | video_corpus, 88 | flow_negs, 89 | flow_neg_nums 90 | ): 91 | t1 = time.time() 92 | 93 | self.flow_negs = flow_negs.split(',') 94 | 95 | self.flow_neg_nums = list(map(int, flow_neg_nums.split(","))) 96 | 97 | raw_df = pd.read_feather(path_to_csv) 98 | 99 | df = raw_df[raw_df['effective_view']==1][["request_id", "video_id"]] 100 | 101 | self.data = df.to_numpy().copy() 102 | 103 | self.seq_len = seq_len 104 | 105 | self.neg_num = neg_num 106 | 107 | self.random_neg_nums = self.neg_num - sum(self.flow_neg_nums) 108 | 109 | self.today_seq = load_pkl(path_to_seq) 110 | 111 | video_corpus_df = pd.read_feather(video_corpus) 112 | self.video_corpus = video_corpus_df['video_id'].unique().copy() + 1 113 | del video_corpus_df 114 | 115 | self.n_video_corpus = self.video_corpus.shape[0] 116 | 117 | self.request_dict = load_pkl(path_to_request_id_pkl) 118 | 119 | del df 120 | del raw_df 121 | 122 | gc.collect() 123 | 124 | t2 = time.time() 125 | print(f'init data time: {t2-t1}') 126 | 127 | def __len__(self): 128 | return self.data.shape[0] 129 | 130 | def random_negative_sampling(self, tgt_video, neg_num): 131 | cnt = 0 132 | negs_index = np.random.randint(self.n_video_corpus, size=neg_num) 133 | while tgt_video in self.video_corpus[negs_index]: 134 | negs_index = np.random.randint(self.n_video_corpus, size=neg_num) 135 | cnt += 1 136 | if cnt >= 10: 137 | break 138 | return self.video_corpus[negs_index] 139 | 140 | def flow_negative_sampling(self, tgt_video, request_id): 141 | 142 | flow_neg_lst = [] 143 | 144 | for idx, flow_neg in enumerate(self.flow_negs): 145 | if flow_neg in self.request_dict[request_id]: 146 | flow_arr = self.request_dict[request_id][flow_neg][:,0] + 1 147 | flow_arr_shape = flow_arr.shape[0] 148 | cnt = 0 149 | tmp_neg_arr = flow_arr[np.random.randint(flow_arr_shape,size=self.flow_neg_nums[idx])] 150 | while tgt_video in tmp_neg_arr: 151 | tmp_neg_arr = flow_arr[np.random.randint(flow_arr_shape,size=self.flow_neg_nums[idx])] 152 | cnt += 1 153 | if cnt >= 10: 154 | break 155 | else: 156 | tmp_neg_arr = np.zeros(self.flow_neg_nums[idx], dtype=np.int64) 157 | 158 | flow_neg_lst.extend(tmp_neg_arr) 159 | 160 | return np.reshape(np.concatenate([np.reshape(x,[-1,1]) for x in flow_neg_lst]), [-1]) 161 | 162 | def __getitem__(self, idx): 163 | request_id = self.data[idx][0] 164 | vid = self.data[idx][1] + 1 165 | 166 | # 0: padding, 1: behavior 167 | seq_full = self.today_seq[request_id][:,[0,7]].copy() 168 | 169 | seq_mask = (seq_full[:,1] > 0).astype(np.int8) #50 170 | 171 | seq_len = np.sum(seq_mask) 172 | 173 | seq_arr = seq_full[:,0] 174 | 175 | if seq_len > 0: 176 | seq_arr[-seq_len:] += 1 #50 177 | 178 | #negative sampling 179 | random_neg_vids = self.random_negative_sampling(vid, self.random_neg_nums) 180 | 181 | flow_neg_vids = self.flow_negative_sampling(vid, request_id) 182 | 183 | neg_vids = np.append(random_neg_vids, flow_neg_vids) 184 | 185 | return seq_arr, seq_mask, vid, neg_vids 186 | 187 | #public 188 | class Recall_Train_SASRec_FSLTR_Dataset(Dataset): 189 | def __init__( 190 | self, 191 | path_to_csv, 192 | seq_len, neg_num, 193 | path_to_seq, 194 | path_to_request_id_pkl, 195 | video_corpus, 196 | flow_negs, 197 | flow_neg_nums 198 | ): 199 | t1 = time.time() 200 | 201 | self.priority = { 202 | "click":6, 203 | "realshow":5, 204 | "rerank_pos":4, 205 | "rank_pos":4, 206 | "rerank_neg":3, 207 | "rank_neg":3, 208 | "coarse_neg":2, 209 | "prerank_neg":1 210 | } 211 | 212 | self.flow_negs = flow_negs.split(',') 213 | 214 | self.flow_neg_nums = list(map(int, flow_neg_nums.split(","))) 215 | 216 | raw_df = pd.read_feather(path_to_csv) 217 | 218 | df = raw_df[raw_df['effective_view']==1][["request_id", "video_id"]] 219 | 220 | self.data = df.to_numpy().copy() 221 | 222 | self.seq_len = seq_len 223 | 224 | self.neg_num = neg_num 225 | 226 | self.random_neg_nums = self.neg_num - sum(self.flow_neg_nums) 227 | 228 | self.today_seq = load_pkl(path_to_seq) 229 | 230 | video_corpus_df = pd.read_feather(video_corpus) 231 | self.video_corpus = video_corpus_df['video_id'].unique().copy() + 1 232 | del video_corpus_df 233 | 234 | self.n_video_corpus = self.video_corpus.shape[0] 235 | 236 | self.request_dict = load_pkl(path_to_request_id_pkl) 237 | 238 | del df 239 | del raw_df 240 | 241 | gc.collect() 242 | 243 | t2 = time.time() 244 | print(f'init data time: {t2-t1}') 245 | 246 | def __len__(self): 247 | return self.data.shape[0] 248 | 249 | def random_negative_sampling(self, tgt_video, neg_num): 250 | cnt = 0 251 | negs_index = np.random.randint(self.n_video_corpus, size=neg_num) 252 | while tgt_video in self.video_corpus[negs_index]: 253 | negs_index = np.random.randint(self.n_video_corpus, size=neg_num) 254 | cnt += 1 255 | if cnt >= 10: 256 | break 257 | return self.video_corpus[negs_index] 258 | 259 | def flow_negative_sampling(self, tgt_video, request_id): 260 | 261 | flow_neg_lst = [] 262 | flow_neg_priority_lst = [] 263 | 264 | flow_dict = self.request_dict[request_id] 265 | for idx, flow_neg in enumerate(self.flow_negs): 266 | if flow_neg in flow_dict: 267 | flow_arr = flow_dict[flow_neg][:,0] + 1 268 | flow_arr_shape = flow_arr.shape[0] 269 | cnt = 0 270 | tmp_neg_arr = flow_arr[np.random.randint(flow_arr_shape,size=self.flow_neg_nums[idx])] 271 | while tgt_video in tmp_neg_arr: 272 | tmp_neg_arr = flow_arr[np.random.randint(flow_arr_shape,size=self.flow_neg_nums[idx])] 273 | cnt += 1 274 | if cnt >= 10: 275 | break 276 | tmp_priority_arr = np.ones(self.flow_neg_nums[idx], dtype=np.float32)*self.priority[flow_neg] 277 | else: 278 | tmp_neg_arr = np.zeros(self.flow_neg_nums[idx], dtype=np.int64) 279 | tmp_priority_arr = np.zeros(self.flow_neg_nums[idx], dtype=np.float32) 280 | 281 | flow_neg_lst.append(tmp_neg_arr) 282 | flow_neg_priority_lst.append(tmp_priority_arr) 283 | 284 | return np.concatenate(flow_neg_lst), np.concatenate(flow_neg_priority_lst) 285 | 286 | def __getitem__(self, idx): 287 | request_id = self.data[idx][0] 288 | vid = self.data[idx][1] + 1 289 | pos_priority = np.ones(1, dtype=np.float32) * self.priority['click'] 290 | 291 | # 0: padding, 1: behavior 292 | seq_full = self.today_seq[request_id][:,[0,7]].copy() 293 | 294 | seq_mask = (seq_full[:,1] > 0).astype(np.int8) #50 295 | 296 | seq_len = np.sum(seq_mask) 297 | 298 | seq_arr = seq_full[:,0] 299 | 300 | if seq_len > 0: 301 | seq_arr[-seq_len:] += 1 #50 302 | 303 | #negative sampling 304 | random_neg_vids = self.random_negative_sampling(vid, self.random_neg_nums) 305 | random_neg_priority = np.zeros(self.random_neg_nums, dtype=np.float32) 306 | 307 | flow_neg_vids,flow_neg_priority = self.flow_negative_sampling(vid, request_id) 308 | 309 | vids = np.concatenate([np.atleast_1d(vid),random_neg_vids, flow_neg_vids]) 310 | prioritys = np.concatenate([pos_priority,random_neg_priority,flow_neg_priority]) 311 | 312 | return seq_arr, seq_mask, vids, prioritys 313 | 314 | #public 315 | class Recall_Test_SASRec_Recall_Dataset(Dataset): 316 | def __init__( 317 | self, 318 | path_to_test_feather, 319 | seq_len, 320 | path_to_seq, 321 | max_candidate_cnt=30 322 | ): 323 | t1 = time.time() 324 | 325 | raw_df = pd.read_feather(path_to_test_feather) 326 | 327 | data = raw_df[["request_id", "video_id", "effective_view"]] 328 | 329 | self.request_ids = data['request_id'].unique() 330 | 331 | self.seq_len = seq_len 332 | 333 | self.today_seq = load_pkl(path_to_seq) 334 | 335 | self.max_candidate_cnt = max_candidate_cnt 336 | 337 | self.data_group = data.copy().groupby('request_id') 338 | 339 | del data 340 | del raw_df 341 | 342 | gc.collect() 343 | 344 | t2 = time.time() 345 | print(f'init data time: {t2-t1}') 346 | 347 | def __len__(self): 348 | return len(self.request_ids) 349 | 350 | def __getitem__(self, idx): 351 | request_id = self.request_ids[idx] 352 | 353 | request_id_df = self.data_group.get_group(request_id)[["video_id","effective_view"]] 354 | 355 | request_id_arr = request_id_df.to_numpy().copy() 356 | 357 | n_video = request_id_arr.shape[0] 358 | 359 | n_complent = self.max_candidate_cnt - n_video 360 | 361 | complent_arr = np.zeros(shape=(n_complent,2), dtype=np.int64) 362 | 363 | request_id_arr = np.concatenate([request_id_arr, complent_arr], axis=0) 364 | 365 | vid = request_id_arr[:,0] + 1 366 | 367 | seq_full = self.today_seq[request_id][:,[0,7]].copy() 368 | 369 | seq_mask = (seq_full[:,1] > 0).astype(np.int8) 370 | 371 | seq_len = np.sum(seq_mask) 372 | 373 | seq_arr = seq_full[:,0] 374 | 375 | if seq_len > 0: 376 | seq_arr[-seq_len:] += 1 377 | 378 | effective = request_id_arr[:,1] 379 | 380 | return seq_arr, seq_mask, vid, effective, n_video -------------------------------------------------------------------------------- /retrieval/eval_sasrec.py: -------------------------------------------------------------------------------- 1 | import os 2 | import argparse 3 | 4 | import torch 5 | from torch.utils.data import DataLoader 6 | 7 | from models import SASRec 8 | from dataset import Recall_Test_SASRec_Recall_Dataset 9 | 10 | from utils import load_pkl 11 | from metrics import evaluate_recall 12 | 13 | def parse_args(): 14 | parser = argparse.ArgumentParser() 15 | 16 | parser.add_argument('--epochs', type=int, default=1, help='epochs.') 17 | parser.add_argument('--batch_size', type=int, default=1024, help='train batch size.') 18 | parser.add_argument('--infer_batch_size', type=int, default=1024, help='inference batch size.') 19 | parser.add_argument('--emb_dim', type=int, default=8, help='embedding dimension.') 20 | parser.add_argument('--lr', type=float, default=1e-3, help='learning rate.') 21 | parser.add_argument('--seq_len', type=int, default=3, help='length of behaivor sequence') 22 | 23 | parser.add_argument('--cuda', type=int, default=0, help='cuda device.') 24 | 25 | parser.add_argument('--print_freq', type=int, default=200, help='frequency of print.') 26 | 27 | parser.add_argument('--tag', type=str, default="1st", help='exp tag.') 28 | 29 | parser.add_argument('--neg_num', type=int, default=50, help='number of negative samples') 30 | 31 | return parser.parse_args() 32 | 33 | 34 | if __name__ == '__main__': 35 | args = parse_args() 36 | 37 | #prepare data 38 | prefix = "../data" 39 | 40 | seq_prefix = os.path.join(prefix, "seq_effective_50_dict") 41 | path_to_test_seq_pkl = os.path.join(seq_prefix, "2024-02-18.pkl") 42 | print(f"testing seq file: {path_to_test_seq_pkl}") 43 | 44 | others_prefix = os.path.join(prefix, "others") 45 | path_to_id_cnt_pkl = os.path.join(others_prefix, "id_cnt.pkl") 46 | print(f"path_to_id_cnt_pkl: {path_to_id_cnt_pkl}") 47 | 48 | id_cnt_dict = load_pkl(path_to_id_cnt_pkl) 49 | for k,v in id_cnt_dict.items(): 50 | print(f"{k}:{v}") 51 | 52 | #prepare negatives 53 | path_to_realshow_video_corpus_feather = os.path.join(others_prefix, "realshow_video_info.feather") 54 | print(f"path_to_video_corpus_pkl: {path_to_realshow_video_corpus_feather}") 55 | 56 | #prepare recal_test 57 | path_to_recall_test_feather = os.path.join(others_prefix, "recall_test.feather") 58 | print(f"path_to_recall_test_pkl: {path_to_recall_test_feather}") 59 | 60 | #prepare model 61 | os.environ["CUDA_VISIBLE_DEVICES"] = str(args.cuda) 62 | device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") 63 | print(f"device: {device}") 64 | 65 | model = SASRec(args.emb_dim, args.seq_len, args.neg_num, device, id_cnt_dict).to(device) 66 | 67 | path_to_save_model=f"./checkpoints/bs-{args.batch_size}_lr-{args.lr}_{args.neg_num}_{args.tag}.pkl" 68 | 69 | state_dict = torch.load(path_to_save_model) 70 | 71 | model.load_state_dict(state_dict) 72 | 73 | print("testing: recall") 74 | 75 | test_recall_dataset = Recall_Test_SASRec_Recall_Dataset( 76 | path_to_recall_test_feather, 77 | args.seq_len, 78 | path_to_test_seq_pkl, 79 | max_candidate_cnt=30 80 | ) 81 | 82 | test_recall_loader = DataLoader( 83 | dataset=test_recall_dataset, 84 | batch_size=args.infer_batch_size, 85 | shuffle=False, 86 | num_workers=0, 87 | drop_last=True 88 | ) 89 | 90 | target_print = evaluate_recall( 91 | model, 92 | test_recall_loader, 93 | device, 94 | path_to_realshow_video_corpus_feather 95 | ) 96 | 97 | print(target_print[0]) 98 | print(target_print[1]) -------------------------------------------------------------------------------- /retrieval/eval_sasrec_fsltr.py: -------------------------------------------------------------------------------- 1 | import os 2 | import argparse 3 | 4 | import torch 5 | from torch.utils.data import DataLoader 6 | 7 | from models import SASRec 8 | from dataset import Recall_Test_SASRec_Recall_Dataset 9 | 10 | from utils import load_pkl 11 | from metrics import evaluate_recall 12 | 13 | def parse_args(): 14 | parser = argparse.ArgumentParser() 15 | 16 | parser.add_argument('--epochs', type=int, default=1, help='epochs.') 17 | parser.add_argument('--batch_size', type=int, default=1024, help='train batch size.') 18 | parser.add_argument('--infer_batch_size', type=int, default=1024, help='inference batch size.') 19 | parser.add_argument('--emb_dim', type=int, default=8, help='embedding dimension.') 20 | parser.add_argument('--lr', type=float, default=1e-3, help='learning rate.') 21 | parser.add_argument('--seq_len', type=int, default=3, help='length of behaivor sequence') 22 | 23 | parser.add_argument('--cuda', type=int, default=0, help='cuda device.') 24 | 25 | parser.add_argument('--print_freq', type=int, default=200, help='frequency of print.') 26 | 27 | parser.add_argument('--tag', type=str, default="1st", help='exp tag.') 28 | 29 | parser.add_argument('--neg_num', type=int, default=3, help='number of negative samples') 30 | 31 | parser.add_argument('--flow_negs', type=str, default='mcd_prerank_neg', help='model name.') 32 | 33 | parser.add_argument('--flow_neg_nums', type=str, default=3, help='number of negative samples') 34 | 35 | return parser.parse_args() 36 | 37 | 38 | if __name__ == '__main__': 39 | args = parse_args() 40 | 41 | #prepare data 42 | prefix = "../data" 43 | 44 | seq_prefix = os.path.join(prefix, "seq_effective_50_dict") 45 | path_to_test_seq_pkl = os.path.join(seq_prefix, "2024-02-18.pkl") 46 | print(f"testing seq file: {path_to_test_seq_pkl}") 47 | 48 | others_prefix = os.path.join(prefix, "others") 49 | path_to_id_cnt_pkl = os.path.join(others_prefix, "id_cnt.pkl") 50 | print(f"path_to_id_cnt_pkl: {path_to_id_cnt_pkl}") 51 | 52 | id_cnt_dict = load_pkl(path_to_id_cnt_pkl) 53 | for k,v in id_cnt_dict.items(): 54 | print(f"{k}:{v}") 55 | 56 | #prepare negatives 57 | path_to_realshow_video_corpus_feather = os.path.join(others_prefix, "realshow_video_info.feather") 58 | print(f"path_to_video_corpus_pkl: {path_to_realshow_video_corpus_feather}") 59 | 60 | #prepare recal_test 61 | path_to_recall_test_feather = os.path.join(others_prefix, "recall_test.feather") 62 | print(f"path_to_recall_test_pkl: {path_to_recall_test_feather}") 63 | 64 | #prepare model 65 | os.environ["CUDA_VISIBLE_DEVICES"] = str(args.cuda) 66 | device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") 67 | print(f"device: {device}") 68 | 69 | model = SASRec( 70 | args.emb_dim, args.seq_len, 71 | args.neg_num, 72 | device, id_cnt_dict 73 | ).to(device) 74 | 75 | path_to_save_model=f"./checkpoints/bs-{args.batch_size}_lr-{args.lr}_{args.neg_num}_{args.flow_negs}_{args.flow_neg_nums}_{args.tag}.pkl" 76 | 77 | state_dict = torch.load(path_to_save_model) 78 | 79 | model.load_state_dict(state_dict) 80 | 81 | print("testing: recall") 82 | 83 | test_recall_dataset = Recall_Test_SASRec_Recall_Dataset( 84 | path_to_recall_test_feather, 85 | args.seq_len, 86 | path_to_test_seq_pkl, 87 | max_candidate_cnt=30 88 | ) 89 | 90 | test_recall_loader = DataLoader( 91 | dataset=test_recall_dataset, 92 | batch_size=args.infer_batch_size, 93 | shuffle=False, 94 | num_workers=0, 95 | drop_last=True 96 | ) 97 | 98 | target_print = evaluate_recall( 99 | model, 100 | test_recall_loader, 101 | device, 102 | path_to_realshow_video_corpus_feather 103 | ) 104 | 105 | print(target_print[0]) 106 | print(target_print[1]) -------------------------------------------------------------------------------- /retrieval/eval_sasrec_hardnegmining.py: -------------------------------------------------------------------------------- 1 | import os 2 | import argparse 3 | 4 | import torch 5 | from torch.utils.data import DataLoader 6 | 7 | from models import SASRec 8 | from dataset import Recall_Test_SASRec_Recall_Dataset 9 | 10 | from utils import load_pkl 11 | from metrics import evaluate_recall 12 | 13 | def parse_args(): 14 | parser = argparse.ArgumentParser() 15 | 16 | parser.add_argument('--epochs', type=int, default=1, help='epochs.') 17 | parser.add_argument('--batch_size', type=int, default=1024, help='train batch size.') 18 | parser.add_argument('--infer_batch_size', type=int, default=1024, help='inference batch size.') 19 | parser.add_argument('--emb_dim', type=int, default=8, help='embedding dimension.') 20 | parser.add_argument('--lr', type=float, default=1e-3, help='learning rate.') 21 | parser.add_argument('--seq_len', type=int, default=3, help='length of behaivor sequence') 22 | 23 | parser.add_argument('--cuda', type=int, default=0, help='cuda device.') 24 | 25 | parser.add_argument('--print_freq', type=int, default=200, help='frequency of print.') 26 | 27 | parser.add_argument('--tag', type=str, default="1st", help='exp tag.') 28 | 29 | parser.add_argument('--neg_num', type=int, default=3, help='number of negative samples') 30 | 31 | parser.add_argument('--flow_negs', type=str, default='mcd_prerank_neg', help='model name.') 32 | 33 | parser.add_argument('--flow_neg_nums', type=str, default=3, help='number of negative samples') 34 | 35 | return parser.parse_args() 36 | 37 | 38 | if __name__ == '__main__': 39 | args = parse_args() 40 | 41 | #prepare data 42 | prefix = "../data" 43 | 44 | seq_prefix = os.path.join(prefix, "seq_effective_50_dict") 45 | path_to_test_seq_pkl = os.path.join(seq_prefix, "2024-02-18.pkl") 46 | print(f"testing seq file: {path_to_test_seq_pkl}") 47 | 48 | others_prefix = os.path.join(prefix, "others") 49 | path_to_id_cnt_pkl = os.path.join(others_prefix, "id_cnt.pkl") 50 | print(f"path_to_id_cnt_pkl: {path_to_id_cnt_pkl}") 51 | 52 | id_cnt_dict = load_pkl(path_to_id_cnt_pkl) 53 | for k,v in id_cnt_dict.items(): 54 | print(f"{k}:{v}") 55 | 56 | #prepare negatives 57 | path_to_realshow_video_corpus_feather = os.path.join(others_prefix, "realshow_video_info.feather") 58 | print(f"path_to_video_corpus_pkl: {path_to_realshow_video_corpus_feather}") 59 | 60 | #prepare recal_test 61 | path_to_recall_test_feather = os.path.join(others_prefix, "recall_test.feather") 62 | print(f"path_to_recall_test_pkl: {path_to_recall_test_feather}") 63 | 64 | #prepare model 65 | os.environ["CUDA_VISIBLE_DEVICES"] = str(args.cuda) 66 | device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") 67 | print(f"device: {device}") 68 | 69 | model = SASRec( 70 | args.emb_dim, args.seq_len, 71 | args.neg_num, 72 | device, id_cnt_dict 73 | ).to(device) 74 | 75 | path_to_save_model=f"./checkpoints/bs-{args.batch_size}_lr-{args.lr}_neg_num-{args.neg_num}_flow_negs-{args.flow_negs}_flow_neg_nums-{args.flow_neg_nums}_{args.tag}.pkl" 76 | 77 | state_dict = torch.load(path_to_save_model) 78 | 79 | model.load_state_dict(state_dict) 80 | 81 | print("testing: recall") 82 | 83 | test_recall_dataset = Recall_Test_SASRec_Recall_Dataset( 84 | path_to_recall_test_feather, 85 | args.seq_len, 86 | path_to_test_seq_pkl, 87 | max_candidate_cnt=30 88 | ) 89 | 90 | test_recall_loader = DataLoader( 91 | dataset=test_recall_dataset, 92 | batch_size=args.infer_batch_size, 93 | shuffle=False, 94 | num_workers=0, 95 | drop_last=True 96 | ) 97 | 98 | target_print = evaluate_recall( 99 | model, 100 | test_recall_loader, 101 | device, 102 | path_to_realshow_video_corpus_feather 103 | ) 104 | 105 | print(target_print[0]) 106 | print(target_print[1]) -------------------------------------------------------------------------------- /retrieval/file.txt: -------------------------------------------------------------------------------- 1 | 2024-01-13 2 | 2024-01-14 3 | 2024-01-15 4 | 2024-01-16 5 | 2024-01-17 6 | 2024-01-18 7 | 2024-01-19 8 | 2024-01-20 9 | 2024-01-21 10 | 2024-01-22 11 | 2024-01-23 12 | 2024-01-24 13 | 2024-01-25 14 | 2024-01-26 15 | 2024-01-27 16 | 2024-01-28 17 | 2024-01-29 18 | 2024-01-30 19 | 2024-01-31 20 | 2024-02-01 21 | 2024-02-02 22 | 2024-02-03 23 | 2024-02-04 24 | 2024-02-05 25 | 2024-02-06 26 | 2024-02-07 27 | 2024-02-08 28 | 2024-02-09 29 | 2024-02-10 30 | 2024-02-11 31 | 2024-02-12 32 | 2024-02-13 33 | 2024-02-14 34 | 2024-02-15 35 | 2024-02-16 36 | 2024-02-17 -------------------------------------------------------------------------------- /retrieval/metrics.py: -------------------------------------------------------------------------------- 1 | import gc 2 | import faiss 3 | 4 | import torch 5 | import numpy as np 6 | import pandas as pd 7 | 8 | def evaluate_recall( 9 | model, data_loader, device, 10 | path_to_realshow_video_corpus_feather 11 | ): 12 | 13 | model.eval() 14 | 15 | realshow_video_corpus_df = pd.read_feather(path_to_realshow_video_corpus_feather) 16 | 17 | realshow_video_corpus = realshow_video_corpus_df['video_id'].unique().copy() + 1 18 | 19 | del realshow_video_corpus_df 20 | 21 | gc.collect() 22 | 23 | target_top_k = [50, 100, 500, 1000] 24 | 25 | total_target_cnt = 0.0 26 | 27 | target_realshow_recall_lst = [0.0 for _ in range(len(target_top_k))] 28 | target_realshow_ndcg_lst = [0.0 for _ in range(len(target_top_k))] 29 | 30 | with torch.no_grad(): 31 | 32 | #construct realshow embedding 33 | realshow_faiss_obj = faiss.StandardGpuResources() 34 | realshow_flat_config = faiss.GpuIndexFlatConfig() 35 | realshow_flat_config.device = 0 36 | realshow_index_flat = faiss.GpuIndexFlatIP(realshow_faiss_obj, 8, realshow_flat_config) 37 | realshow_index_flat.add(model.vid_emb.weight.cpu().numpy()[realshow_video_corpus]) 38 | 39 | for idx,inputs in enumerate(data_loader): 40 | 41 | inputs_LongTensor = [torch.LongTensor(inp.numpy()).to(device) for inp in inputs[:-3]] 42 | 43 | user_emb = model.forward_recall(inputs_LongTensor) #b*d 44 | 45 | _, topk_realshow_logits_index = realshow_index_flat.search(user_emb.cpu().numpy(), k=1000) 46 | 47 | topk_realshow_videos = realshow_video_corpus[topk_realshow_logits_index] #k*b 48 | 49 | vids = inputs[-3].numpy().astype(np.int64) #b*30 50 | labels = inputs[-2].numpy().astype(np.float) #b*30 51 | 52 | n_videos = inputs[-1].numpy() #b 53 | 54 | for i in range(n_videos.shape[0]): 55 | 56 | n_video = n_videos[i] 57 | 58 | topk_realshow_video = topk_realshow_videos[i] 59 | 60 | vid = vids[i,:n_video] 61 | label = labels[i,:n_video] 62 | 63 | #target metric 64 | if np.sum(label) > 0: 65 | target_pos_index = np.nonzero(label)[0] 66 | target_pos_vid = vid[target_pos_index] 67 | 68 | target_pos_realshow_rank = np.where(topk_realshow_video == target_pos_vid[:,None])[1] 69 | if target_pos_realshow_rank.shape[0] > 0: 70 | for i in range(len(target_top_k)): 71 | target_realshow_recall_lst[i] += np.sum(target_pos_realshow_rank b*1*n -> b*n 66 | 67 | neg_logits = neg_logits.view(-1) #bn 68 | 69 | return tgt_logits, neg_logits 70 | 71 | 72 | def forward_fsltr(self, inputs): 73 | seq, seq_mask, vids = inputs 74 | 75 | seq_emb = self.vid_emb(seq) #b*t*d 76 | vids_emb = self.vid_emb(vids) #b*d 77 | 78 | position_emb = self.position(torch.arange(self.seq_len, dtype=torch.int64, device=self.device)) #t*d 79 | 80 | seq_emb = seq_emb + position_emb #b*t*d 81 | 82 | mask = torch.ne(seq_mask, 0).float().unsqueeze(-1) #b*t*1 83 | 84 | seq_emb *= mask 85 | 86 | seq_emb_ln = self.ln_1(seq_emb) 87 | 88 | mh_attn_out = self.mh_attn(seq_emb_ln, seq_emb) 89 | 90 | ff_out = self.feed_forward(self.ln_2(mh_attn_out)) 91 | 92 | ff_out *= mask 93 | 94 | ff_out = self.ln_3(ff_out) #b*t*d 95 | 96 | final_state = ff_out[:,-1,:] #b*d 97 | 98 | logits = torch.bmm(vids_emb, final_state.unsqueeze(-1)).squeeze() 99 | 100 | return logits 101 | 102 | 103 | def forward_recall(self, inputs): 104 | 105 | seq, seq_mask = inputs 106 | 107 | seq_emb = self.vid_emb(seq) #b*t*d 108 | 109 | position_emb = self.position(torch.arange(self.seq_len, dtype=torch.int64, device=self.device)) #t*d 110 | 111 | seq_emb += position_emb 112 | 113 | mask = torch.ne(seq_mask, 0).float().unsqueeze(-1) #b*seq_len*1 114 | 115 | seq_emb *= mask 116 | 117 | seq_emb_ln = self.ln_1(seq_emb) 118 | 119 | mh_attn_out = self.mh_attn(seq_emb_ln, seq_emb) 120 | 121 | ff_out = self.feed_forward(self.ln_2(mh_attn_out)) 122 | 123 | ff_out *= mask 124 | 125 | ff_out = self.ln_3(ff_out) 126 | 127 | final_state = ff_out[:,-1,:] #b*d 128 | 129 | return final_state #b*d -------------------------------------------------------------------------------- /retrieval/modules.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn as nn 3 | import torch.nn.functional as F 4 | 5 | class PositionwiseFeedForward(nn.Module): 6 | def __init__(self, d_in, d_hid, dropout=0.1): 7 | super().__init__() 8 | self.w_1 = nn.Conv1d(d_in, d_hid, 1) 9 | self.w_2 = nn.Conv1d(d_hid, d_in, 1) 10 | # self.layer_norm = nn.LayerNorm(d_in) 11 | self.dropout = nn.Dropout(dropout) 12 | 13 | def forward(self, x): 14 | residual = x 15 | output = x.transpose(1, 2) 16 | output = self.w_2(F.relu(self.w_1(output))) 17 | output = output.transpose(1, 2) 18 | output = self.dropout(output) 19 | # output = self.layer_norm(output + residual) 20 | output =output + residual 21 | return output 22 | 23 | class MultiHeadAttention(nn.Module): 24 | def __init__(self, hidden_size, num_units, num_heads, dropout_rate): 25 | super().__init__() 26 | self.hidden_size = hidden_size 27 | self.num_heads = num_heads 28 | assert hidden_size % num_heads == 0 29 | 30 | self.linear_q = nn.Linear(hidden_size, num_units) 31 | self.linear_k = nn.Linear(hidden_size, num_units) 32 | self.linear_v = nn.Linear(hidden_size, num_units) 33 | self.dropout = nn.Dropout(dropout_rate) 34 | self.softmax = nn.Softmax(dim=-1) 35 | 36 | 37 | def forward(self, queries, keys): 38 | """ 39 | :param queries: A 3d tensor with shape of [N, T_q, C_q] 40 | :param keys: A 3d tensor with shape of [N, T_k, C_k] 41 | 42 | :return: A 3d tensor with shape of (N, T_q, C) 43 | 44 | """ 45 | Q = self.linear_q(queries) # (N, T_q, C) 46 | K = self.linear_k(keys) # (N, T_k, C) 47 | V = self.linear_v(keys) # (N, T_k, C) 48 | 49 | # Split and Concat 50 | split_size = self.hidden_size // self.num_heads 51 | Q_ = torch.cat(torch.split(Q, split_size, dim=2), dim=0) # (h*N, T_q, C/h) 52 | K_ = torch.cat(torch.split(K, split_size, dim=2), dim=0) # (h*N, T_k, C/h) 53 | V_ = torch.cat(torch.split(V, split_size, dim=2), dim=0) # (h*N, T_k, C/h) 54 | 55 | # Multiplication 56 | matmul_output = torch.bmm(Q_, K_.transpose(1, 2)) / self.hidden_size ** 0.5 # (h*N, T_q, T_k) 57 | 58 | # Key Masking 59 | key_mask = torch.sign(torch.abs(keys.sum(dim=-1))).repeat(self.num_heads, 1) # (h*N, T_k) 60 | key_mask_reshaped = key_mask.unsqueeze(1).repeat(1, queries.shape[1], 1) # (h*N, T_q, T_k) 61 | key_paddings = torch.ones_like(matmul_output) * (-2 ** 32 + 1) 62 | matmul_output_m1 = torch.where(torch.eq(key_mask_reshaped, 0), key_paddings, matmul_output) # (h*N, T_q, T_k) 63 | 64 | # Causality - Future Blinding 65 | diag_vals = torch.ones_like(matmul_output[0, :, :]) # (T_q, T_k) 66 | tril = torch.tril(diag_vals) # (T_q, T_k) 67 | causality_mask = tril.unsqueeze(0).repeat(matmul_output.shape[0], 1, 1) # (h*N, T_q, T_k) 68 | causality_paddings = torch.ones_like(causality_mask) * (-2 ** 32 + 1) 69 | matmul_output_m2 = torch.where(torch.eq(causality_mask, 0), causality_paddings, matmul_output_m1) # (h*N, T_q, T_k) 70 | 71 | # Activation 72 | matmul_output_sm = self.softmax(matmul_output_m2) # (h*N, T_q, T_k) 73 | 74 | # Query Masking 75 | query_mask = torch.sign(torch.abs(queries.sum(dim=-1))).repeat(self.num_heads, 1) # (h*N, T_q) 76 | query_mask = query_mask.unsqueeze(-1).repeat(1, 1, keys.shape[1]) # (h*N, T_q, T_k) 77 | matmul_output_qm = matmul_output_sm * query_mask 78 | 79 | # Dropout 80 | matmul_output_dropout = self.dropout(matmul_output_qm) 81 | 82 | # Weighted Sum 83 | output_ws = torch.bmm(matmul_output_dropout, V_) # ( h*N, T_q, C/h) 84 | 85 | # Restore Shape 86 | output = torch.cat(torch.split(output_ws, output_ws.shape[0] // self.num_heads, dim=0), dim=2) # (N, T_q, C) 87 | 88 | # Residual Connection 89 | output_res = output + queries 90 | 91 | return output_res -------------------------------------------------------------------------------- /retrieval/run_sasrec.py: -------------------------------------------------------------------------------- 1 | import os 2 | import argparse 3 | 4 | import torch 5 | import torch.nn as nn 6 | from torch.utils.data import DataLoader 7 | 8 | from models import SASRec 9 | from dataset import Recall_Train_SASRec_Dataset 10 | 11 | from utils import load_pkl 12 | 13 | def parse_args(): 14 | parser = argparse.ArgumentParser() 15 | 16 | parser.add_argument('--epochs', type=int, default=1, help='epochs.') 17 | parser.add_argument('--batch_size', type=int, default=1024, help='train batch size.') 18 | parser.add_argument('--infer_batch_size', type=int, default=1024, help='inference batch size.') 19 | parser.add_argument('--emb_dim', type=int, default=8, help='embedding dimension.') 20 | parser.add_argument('--lr', type=float, default=1e-3, help='learning rate.') 21 | parser.add_argument('--seq_len', type=int, default=3, help='length of behaivor sequence') 22 | 23 | parser.add_argument('--cuda', type=int, default=0, help='cuda device.') 24 | 25 | parser.add_argument('--print_freq', type=int, default=200, help='frequency of print.') 26 | 27 | parser.add_argument('--tag', type=str, default="1st", help='exp tag.') 28 | 29 | parser.add_argument('--neg_num', type=int, default=50, help='number of negative samples') 30 | 31 | return parser.parse_args() 32 | 33 | 34 | if __name__ == '__main__': 35 | args = parse_args() 36 | 37 | for k,v in vars(args).items(): 38 | print(f"{k}:{v}") 39 | 40 | #prepare data 41 | prefix = "../data" 42 | 43 | realshow_prefix = os.path.join(prefix, "realshow") 44 | path_to_train_csv_lst = [] 45 | with open("./file.txt", mode='r') as f: 46 | lines = f.readlines() 47 | for line in lines: 48 | tmp_csv_path = os.path.join(realshow_prefix, line.strip()+'.feather') 49 | path_to_train_csv_lst.append(tmp_csv_path) 50 | 51 | num_of_train_csv = len(path_to_train_csv_lst) 52 | print("training files:") 53 | print(f"number of train_csv: {num_of_train_csv}") 54 | for idx, filepath in enumerate(path_to_train_csv_lst): 55 | print(f"{idx}: {filepath}") 56 | 57 | #prepare seq 58 | seq_prefix = os.path.join(prefix, "seq_effective_50_dict") 59 | path_to_train_seq_pkl_lst = [] 60 | with open("./file.txt", mode='r') as f: 61 | lines = f.readlines() 62 | for line in lines: 63 | tmp_seq_pkl_path = os.path.join(seq_prefix, line.strip()+'.pkl') 64 | path_to_train_seq_pkl_lst.append(tmp_seq_pkl_path) 65 | 66 | print("training seq files:") 67 | for idx, filepath in enumerate(path_to_train_seq_pkl_lst): 68 | print(f"{idx}: {filepath}") 69 | 70 | #prepare id_cnt 71 | others_prefix = os.path.join(prefix, "others") 72 | path_to_id_cnt_pkl = os.path.join(others_prefix, "id_cnt.pkl") 73 | print(f"path_to_id_cnt_pkl: {path_to_id_cnt_pkl}") 74 | 75 | id_cnt_dict = load_pkl(path_to_id_cnt_pkl) 76 | for k,v in id_cnt_dict.items(): 77 | print(f"{k}:{v}") 78 | 79 | #prepare negatives 80 | video_prefix = os.path.join(others_prefix, "realshow_video_info_daily") 81 | path_to_video_info_feather_lst = [] 82 | with open("./file.txt", mode='r') as f: 83 | lines = f.readlines() 84 | for line in lines: 85 | tmp_video_feather_path = os.path.join(video_prefix, line.strip()+'.feather') 86 | path_to_video_info_feather_lst.append(tmp_video_feather_path) 87 | 88 | print("realshow daily negative") 89 | for idx, filepath in enumerate(path_to_video_info_feather_lst): 90 | print(f"{idx}: {filepath}") 91 | 92 | #prepare model 93 | os.environ["CUDA_VISIBLE_DEVICES"] = str(args.cuda) 94 | device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") 95 | print(f"device: {device}") 96 | 97 | model = SASRec(args.emb_dim, args.seq_len, args.neg_num, device, id_cnt_dict).to(device) 98 | 99 | loss_fn = nn.LogSigmoid().to(device) 100 | 101 | optimizer = torch.optim.Adam(model.parameters(), lr=args.lr) 102 | 103 | #training 104 | for epoch in range(args.epochs): 105 | for n_day in range(num_of_train_csv): 106 | 107 | train_dataset = Recall_Train_SASRec_Dataset( 108 | path_to_train_csv_lst[n_day], 109 | args.seq_len, args.neg_num, 110 | path_to_train_seq_pkl_lst[n_day], 111 | path_to_video_info_feather_lst[n_day] 112 | ) 113 | 114 | train_loader = DataLoader( 115 | dataset=train_dataset, 116 | batch_size=args.batch_size, 117 | shuffle=True, 118 | num_workers=1, 119 | drop_last=False 120 | ) 121 | 122 | for iter_step, inputs in enumerate(train_loader): 123 | 124 | inputs_LongTensor = [torch.LongTensor(inp.numpy()).to(device) for inp in inputs] 125 | 126 | tgt_logits, neg_logits = model(inputs_LongTensor) #b 127 | 128 | pos_logits = tgt_logits.repeat_interleave(args.neg_num) 129 | 130 | loss = -loss_fn(pos_logits-neg_logits).mean() 131 | 132 | optimizer.zero_grad() 133 | 134 | loss.backward() 135 | 136 | optimizer.step() 137 | 138 | if iter_step % args.print_freq == 0: 139 | print(f"Day:{n_day}\t[Epoch/iter]:{epoch:>3}/{iter_step:<4}\tloss:{loss.detach().cpu().item():.6f}") 140 | 141 | path_to_save_model=f"./checkpoints/bs-{args.batch_size}_lr-{args.lr}_{args.neg_num}_{args.tag}.pkl" 142 | 143 | torch.save(model.state_dict(), path_to_save_model) 144 | 145 | print(f"save model to {path_to_save_model} DONE.") -------------------------------------------------------------------------------- /retrieval/run_sasrec.sh: -------------------------------------------------------------------------------- 1 | set -x 2 | set -e 3 | set -o pipefail 4 | 5 | tag="sasrec-1st" 6 | 7 | bs=4096 8 | lr=1e-1 9 | neg_num=200 10 | 11 | python -B -u run_sasrec.py \ 12 | --epochs=1 \ 13 | --batch_size=${bs} \ 14 | --infer_batch_size=900 \ 15 | --emb_dim=8 \ 16 | --lr=${lr} \ 17 | --seq_len=50 \ 18 | --cuda="0" \ 19 | --print_freq=100 \ 20 | --neg_num=${neg_num} \ 21 | --tag=${tag} > "./logs/bs-${bs}_lr-${lr}_${neg_num}_${tag}.log" 2>&1 22 | 23 | python -B -u eval_sasrec.py \ 24 | --epochs=1 \ 25 | --batch_size=${bs} \ 26 | --infer_recall_batch_size=900 \ 27 | --emb_dim=8 \ 28 | --lr=${lr} \ 29 | --seq_len=50 \ 30 | --cuda="0" \ 31 | --print_freq=100 \ 32 | --neg_num=${neg_num} \ 33 | --tag=${tag} >> "./logs/bs-${bs}_lr-${lr}_${neg_num}_${tag}.log" 2>&1 -------------------------------------------------------------------------------- /retrieval/run_sasrec_fsltr.py: -------------------------------------------------------------------------------- 1 | import os 2 | import argparse 3 | 4 | import torch 5 | import torch.nn.functional as F 6 | from torch.utils.data import DataLoader 7 | 8 | from models import SASRec 9 | from dataset import Recall_Train_SASRec_FSLTR_Dataset 10 | 11 | from utils import load_pkl 12 | 13 | def parse_args(): 14 | parser = argparse.ArgumentParser() 15 | 16 | parser.add_argument('--epochs', type=int, default=1, help='epochs.') 17 | parser.add_argument('--batch_size', type=int, default=1024, help='train batch size.') 18 | parser.add_argument('--infer_batch_size', type=int, default=1024, help='inference batch size.') 19 | parser.add_argument('--emb_dim', type=int, default=8, help='embedding dimension.') 20 | parser.add_argument('--lr', type=float, default=1e-3, help='learning rate.') 21 | parser.add_argument('--seq_len', type=int, default=3, help='length of behaivor sequence') 22 | 23 | parser.add_argument('--cuda', type=int, default=0, help='cuda device.') 24 | 25 | parser.add_argument('--print_freq', type=int, default=200, help='frequency of print.') 26 | 27 | parser.add_argument('--tag', type=str, default="1st", help='exp tag.') 28 | 29 | parser.add_argument('--neg_num', type=int, default=3, help='number of negative samples') 30 | 31 | parser.add_argument('--flow_negs', type=str, default='mcd_prerank_neg', help='model name.') 32 | 33 | parser.add_argument('--flow_neg_nums', type=str, default=3, help='number of negative samples') 34 | 35 | return parser.parse_args() 36 | 37 | 38 | if __name__ == '__main__': 39 | args = parse_args() 40 | 41 | for k,v in vars(args).items(): 42 | print(f"{k}:{v}") 43 | 44 | #prepare data 45 | prefix = "../data" 46 | 47 | realshow_prefix = os.path.join(prefix, "realshow") 48 | path_to_train_csv_lst = [] 49 | with open("./file.txt", mode='r') as f: 50 | lines = f.readlines() 51 | for line in lines: 52 | tmp_csv_path = os.path.join(realshow_prefix, line.strip()+'.feather') 53 | path_to_train_csv_lst.append(tmp_csv_path) 54 | 55 | num_of_train_csv = len(path_to_train_csv_lst) 56 | print("training files:") 57 | print(f"number of train_csv: {num_of_train_csv}") 58 | for idx, filepath in enumerate(path_to_train_csv_lst): 59 | print(f"{idx}: {filepath}") 60 | 61 | #prepare seq 62 | seq_prefix = os.path.join(prefix, "seq_effective_50_dict") 63 | path_to_train_seq_pkl_lst = [] 64 | with open("./file.txt", mode='r') as f: 65 | lines = f.readlines() 66 | for line in lines: 67 | tmp_seq_pkl_path = os.path.join(seq_prefix, line.strip()+'.pkl') 68 | path_to_train_seq_pkl_lst.append(tmp_seq_pkl_path) 69 | 70 | print("training seq files:") 71 | for idx, filepath in enumerate(path_to_train_seq_pkl_lst): 72 | print(f"{idx}: {filepath}") 73 | 74 | #prepare request_id 75 | request_id_prefix = os.path.join(prefix, "request_id_dict") 76 | path_to_request_id_pkl_lst = [] 77 | with open("./file.txt", mode='r') as f: 78 | lines = f.readlines() 79 | for line in lines: 80 | tmp_request_id_pkl_path = os.path.join(request_id_prefix, line.strip()+'.pkl') 81 | path_to_request_id_pkl_lst.append(tmp_request_id_pkl_path) 82 | 83 | print("training request_id files:") 84 | for idx, filepath in enumerate(path_to_request_id_pkl_lst): 85 | print(f"{idx}: {filepath}") 86 | 87 | #prepare id_cnt 88 | others_prefix = os.path.join(prefix, "others") 89 | path_to_id_cnt_pkl = os.path.join(others_prefix, "id_cnt.pkl") 90 | print(f"path_to_id_cnt_pkl: {path_to_id_cnt_pkl}") 91 | 92 | id_cnt_dict = load_pkl(path_to_id_cnt_pkl) 93 | for k,v in id_cnt_dict.items(): 94 | print(f"{k}:{v}") 95 | 96 | #prepare negatives 97 | video_prefix = os.path.join(others_prefix, "realshow_video_info_daily") 98 | path_to_video_info_feather_lst = [] 99 | with open("./file.txt", mode='r') as f: 100 | lines = f.readlines() 101 | for line in lines: 102 | tmp_video_feather_path = os.path.join(video_prefix, line.strip()+'.feather') 103 | path_to_video_info_feather_lst.append(tmp_video_feather_path) 104 | 105 | print("realshow daily negative") 106 | for idx, filepath in enumerate(path_to_video_info_feather_lst): 107 | print(f"{idx}: {filepath}") 108 | 109 | #prepare model 110 | os.environ["CUDA_VISIBLE_DEVICES"] = str(args.cuda) 111 | device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") 112 | print(f"device: {device}") 113 | 114 | model = SASRec(args.emb_dim, args.seq_len, args.neg_num, device, id_cnt_dict).to(device) 115 | 116 | optimizer = torch.optim.Adam(model.parameters(), lr=args.lr) 117 | 118 | #training 119 | for epoch in range(args.epochs): 120 | for n_day in range(num_of_train_csv): 121 | 122 | train_dataset = Recall_Train_SASRec_FSLTR_Dataset( 123 | path_to_train_csv_lst[n_day], 124 | args.seq_len, args.neg_num, 125 | path_to_train_seq_pkl_lst[n_day], 126 | path_to_request_id_pkl_lst[n_day], 127 | path_to_video_info_feather_lst[n_day], 128 | args.flow_negs, 129 | args.flow_neg_nums 130 | ) 131 | 132 | train_loader = DataLoader( 133 | dataset=train_dataset, 134 | batch_size=args.batch_size, 135 | shuffle=True, 136 | num_workers=1, 137 | drop_last=False 138 | ) 139 | 140 | for iter_step, inputs in enumerate(train_loader): 141 | 142 | inputs_LongTensor = [torch.LongTensor(inp.numpy()).to(device) for inp in inputs[:-1]] 143 | 144 | logits = model.forward_fsltr(inputs_LongTensor) #b*k 145 | 146 | priority = torch.FloatTensor(inputs[-1].numpy()).to(device) #b*k 147 | 148 | weight = torch.gt( 149 | priority.unsqueeze(-1), priority.unsqueeze(1) 150 | ) 151 | 152 | logits_diff = logits.unsqueeze(-1) - logits.unsqueeze(1) 153 | 154 | loss = F.binary_cross_entropy_with_logits(logits_diff, torch.ones_like(logits_diff), weight=weight, reduction='sum') / weight.sum() 155 | 156 | optimizer.zero_grad() 157 | 158 | loss.backward() 159 | 160 | optimizer.step() 161 | 162 | if iter_step % args.print_freq == 0: 163 | print(f"Day:{n_day}\t[Epoch/iter]:{epoch:>3}/{iter_step:<4}\tloss:{loss.detach().cpu().item():.6f}") 164 | 165 | path_to_save_model=f"./checkpoints/bs-{args.batch_size}_lr-{args.lr}_{args.neg_num}_{args.flow_negs}_{args.flow_neg_nums}_{args.tag}.pkl" 166 | 167 | torch.save(model.state_dict(), path_to_save_model) 168 | 169 | print(f"save model to {path_to_save_model} DONE.") -------------------------------------------------------------------------------- /retrieval/run_sasrec_fsltr.sh: -------------------------------------------------------------------------------- 1 | set -x 2 | set -e 3 | set -o pipefail 4 | 5 | tag="sasrec-fsltr-1st" 6 | 7 | bs=4096 8 | lr=1e-1 9 | neg_num=200 10 | 11 | flow_negs=realshow,coarse_neg,prerank_neg 12 | flow_neg_nums=1,1,1 13 | 14 | python -B -u run_sasrec_fsltr.py \ 15 | --epochs=1 \ 16 | --batch_size=${bs} \ 17 | --infer_batch_size=900 \ 18 | --emb_dim=8 \ 19 | --lr=${lr} \ 20 | --seq_len=50 \ 21 | --cuda="0" \ 22 | --print_freq=100 \ 23 | --neg_num=${neg_num} \ 24 | --flow_negs=${flow_negs} \ 25 | --flow_neg_nums=${flow_neg_nums} \ 26 | --tag=${tag} > "./logs/bs-${bs}_lr-${lr}_${neg_num}_${flow_negs}_${flow_neg_nums}_${tag}.log" 2>&1 27 | 28 | python -B -u eval_sasrec_fsltr.py \ 29 | --epochs=1 \ 30 | --batch_size=${bs} \ 31 | --infer_recall_batch_size=900 \ 32 | --emb_dim=8 \ 33 | --lr=${lr} \ 34 | --seq_len=50 \ 35 | --cuda="0" \ 36 | --print_freq=100 \ 37 | --neg_num=${neg_num} \ 38 | --flow_negs=${flow_negs} \ 39 | --flow_neg_nums=${flow_neg_nums} \ 40 | --tag=${tag} >> "./logs/bs-${bs}_lr-${lr}_${neg_num}_${flow_negs}_${flow_neg_nums}_${tag}.log" 2>&1 -------------------------------------------------------------------------------- /retrieval/run_sasrec_hardnegmining.py: -------------------------------------------------------------------------------- 1 | import os 2 | import argparse 3 | 4 | import torch 5 | import torch.nn as nn 6 | from torch.utils.data import DataLoader 7 | 8 | from models import SASRec 9 | from dataset import Recall_Train_SASRec_HardNegMining_Dataset 10 | 11 | from utils import load_pkl 12 | 13 | def parse_args(): 14 | parser = argparse.ArgumentParser() 15 | 16 | parser.add_argument('--epochs', type=int, default=1, help='epochs.') 17 | parser.add_argument('--batch_size', type=int, default=1024, help='train batch size.') 18 | parser.add_argument('--infer_batch_size', type=int, default=1024, help='inference batch size.') 19 | parser.add_argument('--emb_dim', type=int, default=8, help='embedding dimension.') 20 | parser.add_argument('--lr', type=float, default=1e-3, help='learning rate.') 21 | parser.add_argument('--seq_len', type=int, default=3, help='length of behaivor sequence') 22 | 23 | parser.add_argument('--cuda', type=int, default=0, help='cuda device.') 24 | 25 | parser.add_argument('--print_freq', type=int, default=200, help='frequency of print.') 26 | 27 | parser.add_argument('--tag', type=str, default="1st", help='exp tag.') 28 | 29 | parser.add_argument('--neg_num', type=int, default=3, help='number of negative samples') 30 | 31 | parser.add_argument('--flow_negs', type=str, default='mcd_prerank_neg', help='model name.') 32 | 33 | parser.add_argument('--flow_neg_nums', type=str, default=3, help='number of negative samples') 34 | 35 | return parser.parse_args() 36 | 37 | 38 | if __name__ == '__main__': 39 | args = parse_args() 40 | 41 | for k,v in vars(args).items(): 42 | print(f"{k}:{v}") 43 | 44 | #prepare data 45 | prefix = "../data" 46 | 47 | realshow_prefix = os.path.join(prefix, "realshow") 48 | path_to_train_csv_lst = [] 49 | with open("./file.txt", mode='r') as f: 50 | lines = f.readlines() 51 | for line in lines: 52 | tmp_csv_path = os.path.join(realshow_prefix, line.strip()+'.feather') 53 | path_to_train_csv_lst.append(tmp_csv_path) 54 | 55 | num_of_train_csv = len(path_to_train_csv_lst) 56 | print("training files:") 57 | print(f"number of train_csv: {num_of_train_csv}") 58 | for idx, filepath in enumerate(path_to_train_csv_lst): 59 | print(f"{idx}: {filepath}") 60 | 61 | #prepare seq 62 | seq_prefix = os.path.join(prefix, "seq_effective_50_dict") 63 | path_to_train_seq_pkl_lst = [] 64 | with open("./file.txt", mode='r') as f: 65 | lines = f.readlines() 66 | for line in lines: 67 | tmp_seq_pkl_path = os.path.join(seq_prefix, line.strip()+'.pkl') 68 | path_to_train_seq_pkl_lst.append(tmp_seq_pkl_path) 69 | 70 | print("training seq files:") 71 | for idx, filepath in enumerate(path_to_train_seq_pkl_lst): 72 | print(f"{idx}: {filepath}") 73 | 74 | #prepare request_id 75 | request_id_prefix = os.path.join(prefix, "request_id_dict") 76 | path_to_request_id_pkl_lst = [] 77 | with open("./file.txt", mode='r') as f: 78 | lines = f.readlines() 79 | for line in lines: 80 | tmp_request_id_pkl_path = os.path.join(request_id_prefix, line.strip()+'.pkl') 81 | path_to_request_id_pkl_lst.append(tmp_request_id_pkl_path) 82 | 83 | print("training request_id files:") 84 | for idx, filepath in enumerate(path_to_request_id_pkl_lst): 85 | print(f"{idx}: {filepath}") 86 | 87 | #prepare id_cnt 88 | others_prefix = os.path.join(prefix, "others") 89 | path_to_id_cnt_pkl = os.path.join(others_prefix, "id_cnt.pkl") 90 | print(f"path_to_id_cnt_pkl: {path_to_id_cnt_pkl}") 91 | 92 | id_cnt_dict = load_pkl(path_to_id_cnt_pkl) 93 | for k,v in id_cnt_dict.items(): 94 | print(f"{k}:{v}") 95 | 96 | #prepare negatives 97 | video_prefix = os.path.join(others_prefix, "realshow_video_info_daily") 98 | path_to_video_info_feather_lst = [] 99 | with open("./file.txt", mode='r') as f: 100 | lines = f.readlines() 101 | for line in lines: 102 | tmp_video_feather_path = os.path.join(video_prefix, line.strip()+'.feather') 103 | path_to_video_info_feather_lst.append(tmp_video_feather_path) 104 | 105 | print("realshow daily negative") 106 | for idx, filepath in enumerate(path_to_video_info_feather_lst): 107 | print(f"{idx}: {filepath}") 108 | 109 | #prepare model 110 | os.environ["CUDA_VISIBLE_DEVICES"] = str(args.cuda) 111 | device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") 112 | print(f"device: {device}") 113 | 114 | model = SASRec(args.emb_dim, args.seq_len, args.neg_num, device, id_cnt_dict).to(device) 115 | 116 | loss_fn = nn.LogSigmoid().to(device) 117 | 118 | optimizer = torch.optim.Adam(model.parameters(), lr=args.lr) 119 | 120 | #training 121 | for epoch in range(args.epochs): 122 | for n_day in range(num_of_train_csv): 123 | 124 | train_dataset = Recall_Train_SASRec_HardNegMining_Dataset( 125 | path_to_train_csv_lst[n_day], 126 | args.seq_len, args.neg_num, 127 | path_to_train_seq_pkl_lst[n_day], 128 | path_to_request_id_pkl_lst[n_day], 129 | path_to_video_info_feather_lst[n_day], 130 | args.flow_negs, 131 | args.flow_neg_nums 132 | ) 133 | 134 | train_loader = DataLoader( 135 | dataset=train_dataset, 136 | batch_size=args.batch_size, 137 | shuffle=True, 138 | num_workers=1, 139 | drop_last=False 140 | ) 141 | 142 | for iter_step, inputs in enumerate(train_loader): 143 | 144 | inputs_LongTensor = [torch.LongTensor(inp.numpy()).to(device) for inp in inputs] 145 | 146 | tgt_logits, neg_logits = model(inputs_LongTensor) #b 147 | 148 | pos_logits_expand = tgt_logits.repeat_interleave(args.neg_num) 149 | 150 | loss = -loss_fn(pos_logits_expand-neg_logits).mean() 151 | 152 | optimizer.zero_grad() 153 | 154 | loss.backward() 155 | 156 | optimizer.step() 157 | 158 | if iter_step % args.print_freq == 0: 159 | print(f"Day:{n_day}\t[Epoch/iter]:{epoch:>3}/{iter_step:<4}\tloss:{loss.detach().cpu().item():.6f}") 160 | 161 | path_to_save_model=f"./checkpoints/bs-{args.batch_size}_lr-{args.lr}_neg_num-{args.neg_num}_flow_negs-{args.flow_negs}_flow_neg_nums-{args.flow_neg_nums}_{args.tag}.pkl" 162 | 163 | torch.save(model.state_dict(), path_to_save_model) 164 | 165 | print(f"save model to {path_to_save_model} DONE.") -------------------------------------------------------------------------------- /retrieval/run_sasrec_hardnegmining.sh: -------------------------------------------------------------------------------- 1 | set -x 2 | set -e 3 | set -o pipefail 4 | 5 | tag="sasrec-hardnegmining-1st" 6 | 7 | bs=4096 8 | lr=1e-1 9 | neg_num=200 10 | 11 | flow_negs=prerank_neg 12 | flow_neg_nums=1 13 | 14 | python -B -u run_sasrec_hardnegmining.py \ 15 | --epochs=1 \ 16 | --batch_size=${bs} \ 17 | --infer_batch_size=900 \ 18 | --emb_dim=8 \ 19 | --lr=${lr} \ 20 | --seq_len=50 \ 21 | --cuda="0" \ 22 | --print_freq=100 \ 23 | --neg_num=${neg_num} \ 24 | --flow_negs=${flow_negs} \ 25 | --flow_neg_nums=${flow_neg_nums} \ 26 | --tag=${tag} > "./logs/bs-${bs}_lr-${lr}_${neg_num}_${flow_negs}_${flow_neg_nums}_${tag}.log" 2>&1 27 | 28 | python -B -u eval_sasrec_hardnegmining.py \ 29 | --epochs=1 \ 30 | --batch_size=${bs} \ 31 | --infer_recall_batch_size=900 \ 32 | --emb_dim=8 \ 33 | --lr=${lr} \ 34 | --seq_len=50 \ 35 | --cuda="0" \ 36 | --print_freq=100 \ 37 | --neg_num=${neg_num} \ 38 | --flow_negs=${flow_negs} \ 39 | --flow_neg_nums=${flow_neg_nums} \ 40 | --tag=${tag} >> "./logs/bs-${bs}_lr-${lr}_${neg_num}_${flow_negs}_${flow_neg_nums}_${tag}.log" 2>&1 -------------------------------------------------------------------------------- /retrieval/utils.py: -------------------------------------------------------------------------------- 1 | import pickle as pkl 2 | from collections import defaultdict 3 | 4 | def defaultdict_tuple(): 5 | return defaultdict(tuple) 6 | 7 | def defaultdict_str(): 8 | return defaultdict(str) 9 | 10 | def load_pkl(filename): 11 | with open(filename, 'rb') as f: 12 | return pkl.load(f) --------------------------------------------------------------------------------