├── README.txt ├── README_CN.md ├── code ├── 1_DataPreprocessing │ ├── 01_Generate_Offline_Dataset_origin.py │ ├── 02_Generate_Model1_Dataset_origin.py │ ├── 03_Create_Model1_Answer.py │ ├── 03_Create_Offline_Answer.py │ ├── 04_TransformDateTime-Copy1.py │ ├── 05_Generate_img_txt_vec.py │ └── ipynb_file.zip ├── 2_Similarity │ ├── 01_itemCF_Mundane_model1.py │ ├── 01_itemCF_Mundane_offline.py │ ├── 01_itemCF_Mundane_online.py │ ├── RA_Wu_model1.py │ ├── RA_Wu_offline.py │ ├── RA_Wu_online.py │ ├── deep_node_model.py │ └── ipynb_file.zip ├── 3_NN │ ├── ItemFeat2.py │ ├── Readme │ ├── config.py │ ├── model2.py │ ├── modules.py │ ├── sampler2.py │ ├── sas_rec.py │ └── util.py ├── 3_Recall │ ├── 01_Recall-Wu-model1.py │ ├── 01_Recall-Wu-offline.py │ ├── 01_Recall-Wu-online.py │ └── ipynb_file.zip ├── 4_RankFeature │ ├── 01_sim_feature_model1.py │ ├── 01_sim_feature_model1_RA_AA.py │ ├── 01_sim_feature_offline.py │ ├── 01_sim_feature_offline_RA_AA.py │ ├── 01_sim_feature_online.py │ ├── 01_sim_feature_online_RA_AA.py │ ├── 02_itemtime_feature_model1.py │ ├── 02_itemtime_feature_offline.py │ ├── 02_itemtime_feature_online.py │ ├── 03_count_feature_model1.py │ ├── 03_count_feature_offline.py │ ├── 03_count_feature_online.py │ ├── 04_NN_feature_model1.py │ ├── 04_NN_feature_offline.py │ ├── 04_NN_feature_online.csv.py │ ├── 05_txt_feature_model1.py │ ├── 05_txt_feature_offline.py │ ├── 05_txt_feature_online.py │ ├── 06_interactive_model1.py │ ├── 06_interactive_offline.py │ ├── 06_interactive_online.py │ ├── 07_count_detail_model1.py │ ├── 07_count_detail_offline.py │ ├── 07_count_detail_online.py │ ├── 08_user_feature_model1.py │ ├── 08_user_feature_offline.py │ ├── 08_user_feature_online.py │ ├── 09_partial_sim_feature_model1.py │ ├── 09_partial_sim_feature_offline.py │ ├── 09_partial_sim_feature_online.py │ ├── 10_emergency_feature_model1.py │ ├── 10_emergency_feature_offline.py │ ├── 10_emergency_feature_online.py │ ├── 10_紧急feature_model1.py │ ├── 10_紧急feature_offline.py │ ├── 10_紧急feature_online.py │ └── 4_RankFeature.zip └── 5_Modeling │ ├── Model_Offline.py │ ├── Model_Online.py │ └── ipynb_file.zip ├── feature_list.csv ├── main.sh ├── project_structure.txt └── requirements.txt /README.txt: -------------------------------------------------------------------------------- 1 | This repository contains the 6th solution on KDD Cup 2020 Challenges 2 | for Modern E-Commerce Platform: Debiasing Challenge. 3 | 4 | skewcy@gmail.com 5 | -------------------------------------------------------------------------------- /README_CN.md: -------------------------------------------------------------------------------- 1 | # KDDCUP-2020 2 | 2020-KDDCUP,Debiasing赛道 第6名解决方案 3 | 4 | This repository contains the 6th solution on KDD Cup 2020 Challenges for Modern E-Commerce Platform: Debiasing Challenge. 5 | 6 | 赛题链接:https://tianchi.aliyun.com/competition/entrance/231785/introduction 7 | 8 | 解决方案blog: https://zhuanlan.zhihu.com/p/149424540 9 | 10 | 数据集下载链接: 11 | underexpose_train.zip 271.62MB http://tianchi-competition.oss-cn-hangzhou.aliyuncs.com/231785/underexpose_train.zip 12 | underexpose_test.zip 3.27MB http://tianchi-competition.oss-cn-hangzhou.aliyuncs.com/231785/underexpose_test.zip 13 | 14 | 数据集解压密码: 15 | 16 | 7c2d2b8a636cbd790ff12a007907b2ba underexpose_train_click-1 17 | ea0ec486b76ae41ed836a8059726aa85 underexpose_train_click-2 18 | 65255c3677a40bf4d341b0c739ad6dff underexpose_train_click-3 19 | c8376f1c4ed07b901f7fe5c60362ad7b underexpose_train_click-4 20 | 63b326dc07d39c9afc65ed81002ff2ab underexpose_train_click-5 21 | f611f3e477b458b718223248fd0d1b55 underexpose_train_click-6 22 | ec191ea68e0acc367da067133869dd60 underexpose_train_click-7 23 | 90129a980cb0a4ba3879fb9a4b177cd2 underexpose_train_click-8 24 | f4ff091ab62d849ba1e6ea6f7c4fb717 underexpose_train_click-9 25 | 26 | 96d071a532e801423be614e9e8414992 underexpose_test_click-1 27 | 503bf7a5882d3fac5ca9884d9010078c underexpose_test_click-2 28 | dd3de82d0b3a7fe9c55e0b260027f50f underexpose_test_click-3 29 | 04e966e4f6c7b48f1272a53d8f9ade5d underexpose_test_click-4 30 | 13a14563bf5528121b8aaccfa7a0dd73 underexpose_test_click-5 31 | dee22d5e4a7b1e3c409ea0719aa0a715 underexpose_test_click-6 32 | 69416eedf810b56f8a01439e2061e26d underexpose_test_click-7 33 | 55588c1cddab2fa5c63abe5c4bf020e5 underexpose_test_click-8 34 | caacb2c58d01757f018d6b9fee0c8095 underexpose_test_click-9 35 | 36 | 37 | 38 | ## 解决方案 39 | 1. 如下文件结构所示,我们先对数据做预处理“1_DataPreprocessing”,将倒数第二次点击当答案生成线下训练集(存于user_data/model_1),将倒数第一次 40 | 点击当答案生成线下验证集(存于user_data/offline),线上待预测数据存于user_data/dataset。我们依据点击数的周期变换,将time转换为了 41 | 日期(04_TransformDateTime-Copy1.py),还生成了文本相似性、图像相似性文件(05_Generate_img_txt_vec.py)。 42 | 43 | 2. 依次选用线下训练集、线下验证集和线上待预测数据中的点击日志训练deepwalk、node2vec模型(“deep_node_model.py”)。进而,融合文本相似性 44 | 、deepwalk、node2vec修改了ItemCF算法,计算并存储商品相似性(“01_itemCF_Mundane_model1.py”等)。此外,基于召回的商品相似性构建商品相似性网络, 45 | 计算并存储RA、AA、CN、HDI、HPI、LHN1等二阶相似性(“RA_Wu_model1.py”等)。 46 | 47 | 3. 实现Self-Attentive Sequnetial Model,预测召回的用户-商品对的发生点击的概率(“3_NN”)。 48 | 49 | 4. 基于存储的商品相似性为每个待预测用户召回1000候选商品(“3_Recall”)。 50 | 5. 为召回列表中的商品-用户对生成排序特征(“4_RankFeature”)。 51 | 52 | 6. 将召回列表中真正发生点击的用户-商品对视为正样,按1:5的正负比例从召回列表中随机选取负样,生成6个数据集。进而,采用catboost和lightgbm 53 | 建模,为点击量少的商品赋予更大的权重,采用算数平均值、几何平均值与调和平均值做模型融合,并依据商品点击量进行后处理(“5_Modeling”)。 54 | 55 | **最终我们的方案取得了Track-A 1th,Track-B 6th的成绩。** 56 | 57 | 58 | ## 文件结构 59 | 数据可以在比赛官方网站中下载,按照以下路径创建文件夹以及放置数据。 60 | 61 | │ feature_list.csv # List the features we used in ranking process 62 | │ main.sh # Run this script to start the whole process 63 | │ project_structure.txt # The tree structure of this project 64 | │ 65 | ├─code 66 | │ │ __init__.py 67 | │ │ 68 | │ ├─1_DataPreprocessing # Generate validation-set, create timestamp and generate item feature vectors 69 | │ │ 01_Generate_Offline_Dataset_origin.py 70 | │ │ 02_Generate_Model1_Dataset_origin.py 71 | │ │ 03_Create_Model1_Answer.py 72 | │ │ 03_Create_Offline_Answer.py 73 | │ │ 04_TransformDateTime-Copy1.py 74 | │ │ 05_Generate_img_txt_vec.py 75 | │ │ ipynb_file.zip 76 | │ │ 77 | │ ├─2_Similarity # Generate item-item similarity matrix 78 | │ │ 01_itemCF_Mundane_model1.py 79 | │ │ 01_itemCF_Mundane_offline.py 80 | │ │ 01_itemCF_Mundane_online.py 81 | │ │ deep_node_model.py 82 | │ │ ipynb_file.zip 83 | │ │ RA_Wu_model1.py 84 | │ │ RA_Wu_offline.py 85 | │ │ RA_Wu_online.py 86 | │ │ 87 | │ ├─3_NN # Generate deep-learning based result 88 | │ │ config.py 89 | │ │ ItemFeat2.py 90 | │ │ model2.py 91 | │ │ modules.py 92 | │ │ Readme 93 | │ │ sampler2.py 94 | │ │ sas_rec.py 95 | │ │ util.py 96 | │ │ 97 | │ ├─3_Recall # Recall candidates 98 | │ │ 01_Recall-Wu-model1.py 99 | │ │ 01_Recall-Wu-offline.py 100 | │ │ 01_Recall-Wu-online.py 101 | │ │ ipynb_file.zip 102 | │ │ 103 | │ ├─4_RankFeature # Generate feature for ranking 104 | │ │ 01_sim_feature_model1.py 105 | │ │ 01_sim_feature_model1_RA_AA.py 106 | │ │ 01_sim_feature_offline.py 107 | │ │ 01_sim_feature_offline_RA_AA.py 108 | | | …… 109 | │ │ 10_emergency_feature_offline.py 110 | │ │ 10_emergency_feature_online.py 111 | │ │ 4_RankFeature.zip 112 | │ │ 113 | │ └─5_Modeling # Build Catboost and LightGBM model 114 | │ ipynb_file.zip 115 | │ Model_Offline.py 116 | │ Model_Online.py 117 | │ 118 | ├─data # Origin dataset 119 | │ ├─underexpose_test 120 | │ └─underexpose_train 121 | ├─prediction_result 122 | └─user_data # Containing intermediate files 123 | ├─dataset 124 | │ ├─new_recall 125 | │ ├─new_similarity 126 | │ └─nn 127 | ├─model_1 128 | │ ├─new_recall 129 | │ ├─new_similarity 130 | │ └─nn 131 | └─offline 132 | ├─new_recall 133 | ├─new_similarity 134 | └─nn 135 | 136 | ## Python库环境依赖 137 | lightgbm==2.2.1 138 | tensorflow==1.13.1 139 | joblib==0.15.1 140 | gensim==3.4.0 141 | pandas==0.25.1 142 | numpy==1.16.3 143 | networkx==2.4 144 | tqdm==4.46.0 145 | 146 | ## 声明/ 147 | 本项目库专门存放KDD2020挑战赛的相关代码文件,所有代码仅供各位同学学习参考使用。如有任何对代码的问题请邮箱联系:cs_xcy@126.com 148 | 149 | If you have any issue please feel free to contact me at cs_xcy@126.com 150 | 151 | 天池ID:GrandRookie, 152 | BruceQD, 153 | 七里z, 154 | 青禹小生, 155 | 蓝绿黄红, 156 | LSH123, 157 | XMNG, 158 | wenwen_123, 159 | **小雨姑娘**, 160 | wbbhcb 161 | -------------------------------------------------------------------------------- /code/1_DataPreprocessing/01_Generate_Offline_Dataset_origin.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # coding: utf-8 3 | 4 | # In[29]: 5 | 6 | 7 | import pandas as pd 8 | import numpy as np 9 | from tqdm import tqdm 10 | import os 11 | import warnings 12 | warnings.filterwarnings("ignore") 13 | 14 | 15 | # In[30]: 16 | 17 | 18 | current_stage = 9 19 | path = './data/' 20 | output_path = './user_data/offline/' 21 | input_header = 'underexpose_' 22 | output_header = 'offline_' 23 | 24 | #path = 'offline/' 25 | #input_header = 'offline_' 26 | #output_header = 'model1/model_1_' 27 | 28 | 29 | # In[31]: 30 | 31 | 32 | df_train_list = [pd.read_csv(path+'underexpose_train/'+input_header+'train_click-%d.csv'%x, 33 | header=None, 34 | names=['user_id', 'item_id', 'time']) for x in range(current_stage + 1)] 35 | for x, df_train in enumerate(df_train_list): 36 | df_train.to_csv('./user_data/dataset/' + input_header + 'train_click-%d.csv'%x, index=False,header=None) 37 | 38 | df_train = pd.concat(df_train_list) 39 | df_train = df_train.drop_duplicates(subset=['user_id','item_id','time'],keep='last') 40 | df_train = df_train.reset_index(drop=True) 41 | 42 | 43 | # In[32]: 44 | 45 | 46 | df_test_list = [pd.read_csv(path+'underexpose_test/'+input_header+'test_click-%d.csv'%x, 47 | header=None, 48 | names=['user_id', 'item_id', 'time']) for x in range(current_stage + 1)] 49 | for x, df_test in enumerate(df_test_list): 50 | df_test.to_csv('./user_data/dataset/' + input_header + 'test_click-%d.csv'%x, index=False,header=None) 51 | df_test = pd.concat(df_test_list) 52 | df_test = df_test.drop_duplicates(subset=['user_id','item_id','time'],keep='last') 53 | df_test = df_test.reset_index(drop=True) 54 | 55 | 56 | # In[33]: 57 | 58 | 59 | df = pd.concat([df_train,df_test]) 60 | df = df.drop_duplicates(subset=['user_id','item_id','time'],keep='last') 61 | df = df.reset_index(drop=True) 62 | 63 | 64 | # In[34]: 65 | 66 | 67 | # if you are generating the offline dataset please use the comment sentense 68 | 69 | # df_pred_list = [pd.read_csv(path+input_header+'test_qtime-%d.csv'%x, 70 | # header=None, 71 | # names=['user_id','item_id','time']) for x in range(current_stage + 1)] 72 | 73 | #online 74 | df_pred_list = [pd.read_csv(path+'underexpose_test/'+input_header+'test_qtime-%d.csv'%x, 75 | header=None, 76 | names=['user_id','time']) for x in range(current_stage + 1)] 77 | 78 | for x, df_pred in enumerate(df_pred_list): 79 | df_pred.to_csv('./user_data/dataset/' + input_header + 'test_qtime-%d.csv'%x, index=False,header=None) 80 | 81 | 82 | # In[35]: 83 | 84 | 85 | for i in range(current_stage + 1): 86 | if 'item_id' in df_pred_list[i].columns: 87 | df_pred_list[i] = df_pred_list[i][['user_id','time']] 88 | 89 | 90 | # In[36]: 91 | 92 | 93 | df_list = [] 94 | 95 | for i in range(current_stage + 1): 96 | df_0 = pd.concat([df_train_list[i], df_test_list[i],df_pred_list[i]]) 97 | df_0 = df_0.sort_values(by=['time']) 98 | df_0 = df_0.reset_index(drop=True) 99 | df_list.append(df_0) 100 | 101 | 102 | # In[37]: 103 | 104 | 105 | for i in range(current_stage + 1): 106 | count_log = [] 107 | for index, row in df_pred_list[i].iterrows(): 108 | count_log.append(sum((df_list[i]['user_id']==row['user_id']) & (df_list[i]['time'] 1: 129 | row_tmp = df_list[each_stage_out].loc[df_tmp[ (df_tmp['time']==max(df_tmp['time']) ) & (~np.isnan(df_tmp['item_id'] )) ].index[0]] 130 | user_id_tmp = row_tmp['user_id'] 131 | item_id_tmp = row_tmp['item_id'] 132 | time_tmp = row_tmp['time'] 133 | fout.write(str(int(user_id_tmp)) + ',' + str(int(item_id_tmp)) + ',' + str(time_tmp) + '\n') 134 | else: 135 | row_tmp = df_list[each_stage_out].loc[df_tmp.index[-2]] 136 | user_id_tmp = row_tmp['user_id'] 137 | item_id_tmp = row_tmp['item_id'] 138 | time_tmp = row_tmp['time'] 139 | fout.write(str(int(user_id_tmp)) + ',' + str(int(item_id_tmp)) + ',' + str(time_tmp) + '\n') 140 | 141 | for each_stage_in in range(current_stage + 1): 142 | list_train_list[each_stage_in] += list(df_train_list[each_stage_in][(df_train_list[each_stage_in]['user_id']==row['user_id']) 143 | &(df_train_list[each_stage_in]['item_id']==item_id_tmp)].index) 144 | 145 | list_test_list[each_stage_in] += list(df_test_list[each_stage_in][(df_test_list[each_stage_in]['user_id']==row['user_id']) 146 | &(df_test_list[each_stage_in]['item_id']==item_id_tmp)].index) 147 | fout.close() 148 | 149 | 150 | # In[ ]: 151 | 152 | 153 | 154 | 155 | 156 | # In[39]: 157 | 158 | 159 | df_train_list = [x.drop(labels=list_train_list[i],axis=0) for i,x in enumerate(df_train_list)] 160 | 161 | 162 | # In[40]: 163 | 164 | 165 | df_test_list = [x.drop(labels=list_test_list[i],axis=0) for i,x in enumerate(df_test_list)] 166 | 167 | 168 | # In[41]: 169 | 170 | 171 | df_train_list = [x.reset_index(drop=True) for x in df_train_list] 172 | df_test_list = [x.reset_index(drop=True) for x in df_test_list] 173 | 174 | 175 | # In[42]: 176 | 177 | 178 | for i in range(current_stage + 1): 179 | df_train_list[i].to_csv(output_path + output_header+'train_click-%d.csv'%i,index=False,header=None) 180 | df_test_list[i].to_csv(output_path + output_header+'test_click-%d.csv'%i,index=False,header=None) 181 | 182 | # In[ ]: 183 | 184 | 185 | 186 | 187 | -------------------------------------------------------------------------------- /code/1_DataPreprocessing/02_Generate_Model1_Dataset_origin.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # coding: utf-8 3 | 4 | # In[10]: 5 | 6 | 7 | import pandas as pd 8 | import numpy as np 9 | from tqdm import tqdm 10 | import os 11 | import warnings 12 | warnings.filterwarnings("ignore") 13 | 14 | 15 | # In[23]: 16 | 17 | 18 | current_stage = 9 19 | #path = 'dataset/' 20 | #input_header = 'underexpose_' 21 | #output_header = 'offline/offline_' 22 | 23 | path = './user_data/offline/' 24 | output_path = './user_data/model_1/' 25 | input_header = 'offline_' 26 | output_header = 'model_1_' 27 | 28 | 29 | # In[12]: 30 | 31 | 32 | df_train_list = [pd.read_csv(path+input_header+'train_click-%d.csv'%x, 33 | header=None, 34 | names=['user_id', 'item_id', 'time']) for x in range(current_stage + 1)] 35 | df_train = pd.concat(df_train_list) 36 | df_train = df_train.drop_duplicates(subset=['user_id','item_id','time'],keep='last') 37 | df_train = df_train.reset_index(drop=True) 38 | 39 | 40 | # In[13]: 41 | 42 | 43 | df_test_list = [pd.read_csv(path+input_header+'test_click-%d.csv'%x, 44 | header=None, 45 | names=['user_id', 'item_id', 'time']) for x in range(current_stage + 1)] 46 | df_test = pd.concat(df_test_list) 47 | df_test = df_test.drop_duplicates(subset=['user_id','item_id','time'],keep='last') 48 | df_test = df_test.reset_index(drop=True) 49 | 50 | 51 | # In[14]: 52 | 53 | 54 | df = pd.concat([df_train,df_test]) 55 | df = df.drop_duplicates(subset=['user_id','item_id','time'],keep='last') 56 | df = df.reset_index(drop=True) 57 | 58 | 59 | # In[15]: 60 | 61 | 62 | # if you are generating the offline dataset please use the comment sentense 63 | 64 | df_pred_list = [pd.read_csv(path+input_header+'test_qtime-%d.csv'%x, 65 | header=None, 66 | names=['user_id','item_id','time']) for x in range(current_stage + 1)] 67 | 68 | #online 69 | #df_pred_list = [pd.read_csv(path+input_header+'test_qtime-%d.csv'%x, 70 | # header=None, 71 | # names=['user_id','time']) for x in range(current_stage + 1)] 72 | 73 | 74 | # In[16]: 75 | 76 | 77 | for i in range(current_stage + 1): 78 | if 'item_id' in df_pred_list[i].columns: 79 | df_pred_list[i] = df_pred_list[i][['user_id','time']] 80 | 81 | 82 | # In[17]: 83 | 84 | 85 | df_list = [] 86 | 87 | for i in range(current_stage + 1): 88 | df_0 = pd.concat([df_train_list[i], df_test_list[i],df_pred_list[i]]) 89 | df_0 = df_0.sort_values(by=['time']) 90 | df_0 = df_0.reset_index(drop=True) 91 | df_list.append(df_0) 92 | 93 | 94 | # In[18]: 95 | 96 | 97 | for i in range(current_stage + 1): 98 | count_log = [] 99 | for index, row in df_pred_list[i].iterrows(): 100 | count_log.append(sum((df_list[i]['user_id']==row['user_id']) & (df_list[i]['time'] 1: 121 | row_tmp = df_list[each_stage_out].loc[df_tmp[ (df_tmp['time']==max(df_tmp['time']) ) & (~np.isnan(df_tmp['item_id'] )) ].index[0]] 122 | user_id_tmp = row_tmp['user_id'] 123 | item_id_tmp = row_tmp['item_id'] 124 | time_tmp = row_tmp['time'] 125 | fout.write(str(int(user_id_tmp)) + ',' + str(int(item_id_tmp)) + ',' + str(time_tmp) + '\n') 126 | else: 127 | row_tmp = df_list[each_stage_out].loc[df_tmp.index[-2]] 128 | user_id_tmp = row_tmp['user_id'] 129 | item_id_tmp = row_tmp['item_id'] 130 | time_tmp = row_tmp['time'] 131 | fout.write(str(int(user_id_tmp)) + ',' + str(int(item_id_tmp)) + ',' + str(time_tmp) + '\n') 132 | 133 | for each_stage_in in range(current_stage + 1): 134 | list_train_list[each_stage_in] += list(df_train_list[each_stage_in][(df_train_list[each_stage_in]['user_id']==row['user_id']) 135 | &(df_train_list[each_stage_in]['item_id']==item_id_tmp)].index) 136 | 137 | list_test_list[each_stage_in] += list(df_test_list[each_stage_in][(df_test_list[each_stage_in]['user_id']==row['user_id']) 138 | &(df_test_list[each_stage_in]['item_id']==item_id_tmp)].index) 139 | fout.close() 140 | 141 | 142 | # In[ ]: 143 | 144 | 145 | 146 | 147 | 148 | # In[25]: 149 | 150 | 151 | df_train_list = [x.drop(labels=list_train_list[i],axis=0) for i,x in enumerate(df_train_list)] 152 | 153 | 154 | # In[26]: 155 | 156 | 157 | df_test_list = [x.drop(labels=list_test_list[i],axis=0) for i,x in enumerate(df_test_list)] 158 | 159 | 160 | # In[27]: 161 | 162 | 163 | df_train_list = [x.reset_index(drop=True) for x in df_train_list] 164 | df_test_list = [x.reset_index(drop=True) for x in df_test_list] 165 | 166 | 167 | # In[28]: 168 | 169 | 170 | for i in range(current_stage + 1): 171 | df_train_list[i].to_csv(output_path+output_header+'train_click-%d.csv'%i,index=False,header=None) 172 | df_test_list[i].to_csv(output_path+output_header+'test_click-%d.csv'%i,index=False,header=None) 173 | 174 | -------------------------------------------------------------------------------- /code/1_DataPreprocessing/03_Create_Model1_Answer.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # coding: utf-8 3 | 4 | # In[6]: 5 | 6 | 7 | from collections import defaultdict 8 | 9 | current_phases = 9 10 | number = 1 11 | 12 | def _create_answer_file_for_evaluation(answer_fname='debias_track_answer.csv'): 13 | 14 | 15 | train = './user_data/model_'+str(number)+'/model_'+str(number)+'_train_click-%d.csv' 16 | test = './user_data/model_'+str(number)+'/model_'+str(number)+'_test_click-%d.csv' 17 | 18 | 19 | answer = './user_data/model_'+str(number)+'/model_'+str(number)+'_test_qtime-%d.csv' 20 | 21 | item_deg = defaultdict(lambda: 0) 22 | with open(answer_fname, 'w') as fout: 23 | for phase_id in range(current_phases+1): 24 | with open(train % phase_id) as fin: 25 | for line in fin: 26 | user_id, item_id, timestamp = line.split(',') 27 | user_id, item_id, timestamp = ( 28 | int(user_id), int(item_id), float(timestamp)) 29 | item_deg[item_id] += 1 30 | with open(test % phase_id) as fin: 31 | for line in fin: 32 | user_id, item_id, timestamp = line.split(',') 33 | user_id, item_id, timestamp = ( 34 | int(user_id), int(item_id), float(timestamp)) 35 | item_deg[item_id] += 1 36 | with open(answer % phase_id) as fin: 37 | for line in fin: 38 | user_id, item_id, timestamp = line.split(',') 39 | user_id, item_id, timestamp = ( 40 | int(user_id), int(item_id), float(timestamp)) 41 | assert user_id % 11 == phase_id 42 | print(phase_id, user_id, item_id, item_deg[item_id], 43 | sep=',', file=fout) 44 | 45 | 46 | # In[7]: 47 | 48 | 49 | _create_answer_file_for_evaluation('./user_data/model_'+str(number)+'/model_'+str(number)+'_debias_track_answer.csv') 50 | 51 | 52 | # In[ ]: 53 | 54 | 55 | 56 | 57 | -------------------------------------------------------------------------------- /code/1_DataPreprocessing/03_Create_Offline_Answer.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # coding: utf-8 3 | 4 | # In[1]: 5 | 6 | 7 | from collections import defaultdict 8 | 9 | current_phases = 9 10 | 11 | def _create_answer_file_for_evaluation(answer_fname='debias_track_answer.csv'): 12 | train = './user_data/offline/offline_train_click-%d.csv' 13 | test = './user_data/offline/offline_test_click-%d.csv' 14 | 15 | 16 | # train = 'model'+str(number)+'/model_'+str(number)+'_train_click-%d.csv' 17 | # test = 'model'+str(number)+'/model_'+str(number)+'_test_click-%d.csv' 18 | 19 | 20 | # underexpose_test_qtime-T.csv contains only 21 | # underexpose_test_qtime_with_answer-T.csv contains 22 | #answer = 'model/model_test_qtime-%d.csv' # not released 23 | 24 | answer = './user_data/offline/offline_test_qtime-%d.csv' 25 | 26 | # answer = 'model'+str(number)+'/model_'+str(number)+'_test_qtime-%d.csv' 27 | 28 | item_deg = defaultdict(lambda: 0) 29 | with open(answer_fname, 'w') as fout: 30 | for phase_id in range(current_phases+1): 31 | with open(train % phase_id) as fin: 32 | for line in fin: 33 | user_id, item_id, timestamp = line.split(',') 34 | user_id, item_id, timestamp = ( 35 | int(user_id), int(item_id), float(timestamp)) 36 | item_deg[item_id] += 1 37 | with open(test % phase_id) as fin: 38 | for line in fin: 39 | user_id, item_id, timestamp = line.split(',') 40 | user_id, item_id, timestamp = ( 41 | int(user_id), int(item_id), float(timestamp)) 42 | item_deg[item_id] += 1 43 | with open(answer % phase_id) as fin: 44 | for line in fin: 45 | user_id, item_id, timestamp = line.split(',') 46 | user_id, item_id, timestamp = ( 47 | int(user_id), int(item_id), float(timestamp)) 48 | assert user_id % 11 == phase_id 49 | print(phase_id, user_id, item_id, item_deg[item_id], 50 | sep=',', file=fout) 51 | 52 | 53 | # In[2]: 54 | 55 | 56 | _create_answer_file_for_evaluation('./user_data/offline/offline_debias_track_answer.csv') 57 | 58 | 59 | # In[ ]: 60 | 61 | 62 | 63 | 64 | 65 | # In[3]: 66 | 67 | 68 | # _create_answer_file_for_evaluation('model'+str(number)+'/model_'+str(number)+'_debias_track_answer.csv') 69 | 70 | -------------------------------------------------------------------------------- /code/1_DataPreprocessing/04_TransformDateTime-Copy1.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # coding: utf-8 3 | 4 | # In[1]: 5 | 6 | 7 | import pandas as pd 8 | from tqdm import tqdm 9 | from collections import defaultdict 10 | import math 11 | import numpy as np 12 | import datetime 13 | 14 | 15 | # In[2]: 16 | 17 | 18 | random_number_1 = 41152582 19 | random_number_2 = 1570909091 20 | 21 | 22 | # In[3]: 23 | 24 | 25 | train_path = './user_data/offline/' 26 | test_path = './user_data/offline/' 27 | 28 | now_phase = 9 29 | for c in range(now_phase + 1): 30 | print('phase:', c) 31 | click_train = pd.read_csv(train_path + '/offline_train_click-{}.csv'.format(c), header=None, names=['user_id', 'item_id', 'time']) 32 | click_test = pd.read_csv(test_path + '/offline_test_click-{}.csv'.format(c), header=None, names=['user_id', 'item_id', 'time']) 33 | click_query = pd.read_csv(test_path + '/offline_test_qtime-{}.csv'.format(c), header=None, names=['user_id', 'item_id', 'time']) 34 | 35 | click_train['unix_time'] = click_train['time'].apply(lambda x: x * random_number_2 + random_number_1) 36 | click_train['datetime'] = click_train['unix_time'].apply(lambda x: datetime.datetime.fromtimestamp(x)) 37 | 38 | click_train.to_csv(train_path+'/offline_train_click_{}_time.csv'.format(c),index=False) 39 | 40 | click_test['unix_time'] = click_test['time'].apply(lambda x: x * random_number_2 + random_number_1) 41 | click_test['datetime'] = click_test['unix_time'].apply(lambda x: datetime.datetime.fromtimestamp(x)) 42 | 43 | click_test.to_csv(test_path+'/offline_test_click_{}_time.csv'.format(c),index=False) 44 | 45 | click_query['unix_time'] = click_query['time'].apply(lambda x: x * random_number_2 + random_number_1) 46 | click_query['datetime'] = click_query['unix_time'].apply(lambda x: datetime.datetime.fromtimestamp(x)) 47 | 48 | click_query.to_csv(test_path+'/offline_test_qtime_{}_time.csv'.format(c),index=False) 49 | 50 | 51 | 52 | # In[4]: 53 | 54 | 55 | num = 1 56 | train_path = './user_data/model_'+str(num) 57 | test_path = './user_data/model_'+str(num) 58 | 59 | now_phase = 9 60 | for c in range(now_phase + 1): 61 | print('phase:', c) 62 | click_train = pd.read_csv(train_path + '/model_'+str(num)+'_train_click-{}.csv'.format(c), header=None, names=['user_id', 'item_id', 'time']) 63 | click_test = pd.read_csv(test_path + '/model_'+str(num)+'_test_click-{}.csv'.format(c), header=None, names=['user_id', 'item_id', 'time']) 64 | click_query = pd.read_csv(test_path + '/model_'+str(num)+'_test_qtime-{}.csv'.format(c), header=None, names=['user_id', 'item_id', 'time']) 65 | 66 | click_train['unix_time'] = click_train['time'].apply(lambda x: x * random_number_2 + random_number_1) 67 | click_train['datetime'] = click_train['unix_time'].apply(lambda x: datetime.datetime.fromtimestamp(x)) 68 | 69 | click_train.to_csv(train_path+'/model_'+str(num)+'_train_click_{}_time.csv'.format(c),index=False) 70 | 71 | click_test['unix_time'] = click_test['time'].apply(lambda x: x * random_number_2 + random_number_1) 72 | click_test['datetime'] = click_test['unix_time'].apply(lambda x: datetime.datetime.fromtimestamp(x)) 73 | 74 | click_test.to_csv(test_path+'/model_'+str(num)+'_test_click_{}_time.csv'.format(c),index=False) 75 | 76 | click_query['unix_time'] = click_query['time'].apply(lambda x: x * random_number_2 + random_number_1) 77 | click_query['datetime'] = click_query['unix_time'].apply(lambda x: datetime.datetime.fromtimestamp(x)) 78 | 79 | click_query.to_csv(test_path+'/model_'+str(num)+'_test_qtime_{}_time.csv'.format(c),index=False) 80 | 81 | 82 | 83 | # In[5]: 84 | 85 | 86 | train_path = './user_data/dataset' 87 | test_path = './user_data/dataset' 88 | 89 | now_phase = 9 90 | for c in range(now_phase + 1): 91 | print('phase:', c) 92 | click_train = pd.read_csv(train_path + '/underexpose_train_click-{}.csv'.format(c), header=None, names=['user_id', 'item_id', 'time']) 93 | click_test = pd.read_csv(test_path + '/underexpose_test_click-{}.csv'.format(c), header=None, names=['user_id', 'item_id', 'time']) 94 | click_query = pd.read_csv(test_path + '/underexpose_test_qtime-{}.csv'.format(c), header=None, names=['user_id', 'time']) 95 | 96 | click_train['unix_time'] = click_train['time'].apply(lambda x: x * random_number_2 + random_number_1) 97 | click_train['datetime'] = click_train['unix_time'].apply(lambda x: datetime.datetime.fromtimestamp(x)) 98 | 99 | click_train.to_csv(train_path+'/underexpose_train_click_{}_time.csv'.format(c),index=False) 100 | 101 | click_test['unix_time'] = click_test['time'].apply(lambda x: x * random_number_2 + random_number_1) 102 | click_test['datetime'] = click_test['unix_time'].apply(lambda x: datetime.datetime.fromtimestamp(x)) 103 | 104 | click_test.to_csv(test_path+'/underexpose_test_click_{}_time.csv'.format(c),index=False) 105 | 106 | click_query['unix_time'] = click_query['time'].apply(lambda x: x * random_number_2 + random_number_1) 107 | click_query['datetime'] = click_query['unix_time'].apply(lambda x: datetime.datetime.fromtimestamp(x)) 108 | 109 | click_query.to_csv(test_path+'/underexpose_test_qtime_{}_time.csv'.format(c),index=False) 110 | 111 | 112 | -------------------------------------------------------------------------------- /code/1_DataPreprocessing/05_Generate_img_txt_vec.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | from gensim.models import KeyedVectors 3 | 4 | 5 | 6 | train_path = './data/underexpose_train/' 7 | item = pd.read_csv(train_path+'underexpose_item_feat.csv',header=None) 8 | 9 | item[1] = item[1].apply(lambda x: float(str(x).replace('[', ''))) 10 | item[256] = item[256].apply(lambda x: float(str(x).replace(']', ''))) 11 | item[128] = item[128].apply(lambda x: float(str(x).replace(']', ''))) 12 | item[129] = item[129].apply(lambda x: float(str(x).replace('[', ''))) 13 | item.columns = ['item_id'] + ['txt_vec_{}'.format(f) for f in range(0, 128)] + ['img_vec_{}'.format(f) for f in 14 | range(0, 128)] 15 | item_nun=item['item_id'].nunique() 16 | 17 | item[['item_id'] + ['img_vec_{}'.format(f) for f in range(0, 128)]].to_csv("user_data/w2v_img_vec.txt", sep=" ", 18 | header=[str(item_nun), '128'] + [""] * 127, 19 | index=False, 20 | encoding='UTF-8') 21 | 22 | item[['item_id'] + ['txt_vec_{}'.format(f) for f in range(0, 128)]].to_csv("user_data/w2v_txt_vec.txt", 23 | sep=" ", 24 | header=[str(item_nun), '128'] + [""] * 127, 25 | index=False, 26 | encoding='UTF-8') 27 | 28 | txt_vec_model = KeyedVectors.load_word2vec_format("./user_data/" + 'w2v_txt_vec.txt', binary=False) 29 | txt_vec_model = KeyedVectors.load_word2vec_format("./user_data/" + 'w2v_img_vec.txt', binary=False) -------------------------------------------------------------------------------- /code/1_DataPreprocessing/ipynb_file.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ChuanyuXue/KDDCUP-2020/b675a1b01ba430845f9e23a466bc82501f57abb2/code/1_DataPreprocessing/ipynb_file.zip -------------------------------------------------------------------------------- /code/2_Similarity/RA_Wu_model1.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # coding: utf-8 3 | 4 | # In[1]: 5 | 6 | 7 | import pandas as pd 8 | import numpy as np 9 | from tqdm import tqdm 10 | import os 11 | from collections import defaultdict 12 | import math 13 | import json 14 | from sys import stdout 15 | import pickle 16 | 17 | 18 | # In[ ]: 19 | 20 | 21 | 22 | 23 | 24 | # In[ ]: 25 | 26 | 27 | 28 | 29 | 30 | # # RA、AA一起运行的 31 | 32 | # In[ ]: 33 | 34 | 35 | 36 | 37 | 38 | # In[2]: 39 | 40 | 41 | now_phase = 9 42 | 43 | input_path = './user_data/model_1/new_similarity/' 44 | out_path = './user_data/model_1/new_similarity/' 45 | 46 | 47 | 48 | for num in range(now_phase+1): 49 | 50 | # 获取itemCF相似度 51 | with open(input_path+'itemCF_new'+str(num)+'.pkl','rb') as f: 52 | item_sim_list_tmp = pickle.load(f) 53 | 54 | item_sim = {} 55 | for item in item_sim_list_tmp: 56 | item_sim.setdefault(item, {}) 57 | for related_item in item_sim_list_tmp[item]: 58 | if item_sim_list_tmp[item][related_item] > 0.005: 59 | item_sim[item][related_item] = item_sim_list_tmp[item][related_item] 60 | 61 | item_sim_list_tmp = [] 62 | 63 | strengh_dict = dict() 64 | print('Counting degree') 65 | for item in tqdm(item_sim): 66 | strengh_dict[item] = sum(item_sim[item].values()) 67 | 68 | strengh_AA_dict = dict() 69 | print('Counting degree') 70 | for item in tqdm(item_sim): 71 | strengh_AA_dict[item] = math.log(1+sum(item_sim[item].values()) ) 72 | 73 | 74 | #RA 75 | RA_sim = dict() 76 | for item in tqdm(item_sim): 77 | neighbors = list(set(item_sim[item].keys())) 78 | for item1 in neighbors: 79 | if item in item_sim[item1]: 80 | RA_sim.setdefault(item1, {}) 81 | for item2 in neighbors: 82 | if item1 != item2: 83 | RA_sim[item1].setdefault(item2, 0) 84 | RA_sim[item1][item2] += item_sim[item1][item] * item_sim[item][item2]/strengh_dict[item] 85 | 86 | 87 | new_RA = dict() 88 | for item1 in tqdm(RA_sim): 89 | new_RA[item1] = {i: int(x * 1e3) / 1e3 for i, x in RA_sim[item1].items() if x > 1e-3} 90 | 91 | RA_sim = [] 92 | print('Saving') 93 | write_file = open(out_path+'RA_P'+str(num)+'_new.pkl', 'wb') 94 | pickle.dump(new_RA, write_file) 95 | write_file.close() 96 | 97 | 98 | new_RA = [] 99 | 100 | 101 | #RA 102 | AA_sim = dict() 103 | for item in tqdm(item_sim): 104 | neighbors = list(set(item_sim[item].keys())) 105 | for item1 in neighbors: 106 | if item in item_sim[item1]: 107 | AA_sim.setdefault(item1, {}) 108 | for item2 in neighbors: 109 | if item1 != item2: 110 | AA_sim[item1].setdefault(item2, 0) 111 | AA_sim[item1][item2] += item_sim[item1][item] * item_sim[item][item2]/strengh_AA_dict[item] 112 | 113 | 114 | new_AA = dict() 115 | for item1 in tqdm(AA_sim): 116 | new_AA[item1] = {i: int(x * 1e3) / 1e3 for i, x in AA_sim[item1].items() if x > 1e-3} 117 | 118 | AA_sim = [] 119 | print('Saving') 120 | write_file = open(out_path+'AA_P'+str(num)+'_new.pkl', 'wb') 121 | pickle.dump(new_AA, write_file) 122 | write_file.close() 123 | 124 | 125 | new_AA = [] 126 | 127 | 128 | 129 | 130 | 131 | 132 | # In[ ]: 133 | 134 | 135 | 136 | 137 | 138 | # In[ ]: 139 | 140 | 141 | 142 | 143 | 144 | # In[ ]: 145 | 146 | 147 | 148 | 149 | 150 | # In[ ]: 151 | 152 | 153 | 154 | 155 | 156 | # In[ ]: 157 | 158 | 159 | 160 | 161 | 162 | # In[ ]: 163 | 164 | 165 | 166 | 167 | 168 | # In[ ]: 169 | 170 | 171 | 172 | 173 | 174 | # In[ ]: 175 | 176 | 177 | 178 | 179 | 180 | # In[ ]: 181 | 182 | 183 | 184 | 185 | 186 | # In[ ]: 187 | 188 | 189 | 190 | 191 | 192 | # In[ ]: 193 | 194 | 195 | 196 | 197 | 198 | # In[ ]: 199 | 200 | 201 | 202 | 203 | 204 | # In[ ]: 205 | 206 | 207 | 208 | 209 | 210 | # In[ ]: 211 | 212 | 213 | 214 | 215 | 216 | # # CN、HPI、HDI、LHN1是一起运行的 217 | 218 | # In[3]: 219 | 220 | 221 | now_phase = 9 222 | 223 | input_path = './user_data/model_1/new_similarity/' 224 | out_path = './user_data/model_1/new_similarity/' 225 | 226 | 227 | 228 | for num in range(now_phase+1): 229 | 230 | # 获取itemCF相似度 231 | with open(input_path+'itemCF_new'+str(num)+'.pkl','rb') as f: 232 | item_sim_list_tmp = pickle.load(f) 233 | 234 | item_sim = {} 235 | for item in item_sim_list_tmp: 236 | item_sim.setdefault(item, {}) 237 | for related_item in item_sim_list_tmp[item]: 238 | if item_sim_list_tmp[item][related_item] > 0.005: 239 | item_sim[item][related_item] = item_sim_list_tmp[item][related_item] 240 | 241 | item_sim_list_tmp = [] 242 | 243 | #CN 244 | CN_sim = dict() 245 | for item in tqdm(item_sim): 246 | neighbors = list(set(item_sim[item].keys())) 247 | for item1 in neighbors: 248 | if item in item_sim[item1]: 249 | CN_sim.setdefault(item1, {}) 250 | for item2 in neighbors: 251 | if item1 != item2: 252 | CN_sim[item1].setdefault(item2, 0) 253 | CN_sim[item1][item2] += item_sim[item1][item] * item_sim[item][item2] 254 | 255 | 256 | new_CN = dict() 257 | for item1 in tqdm(CN_sim): 258 | new_CN[item1] = {i: int(x * 1e3) / 1e3 for i, x in CN_sim[item1].items() if x > 1e-3} 259 | 260 | CN_sim = [] 261 | print('Saving') 262 | write_file = open(out_path+'CN_P'+str(num)+'_new.pkl', 'wb') 263 | pickle.dump(new_CN, write_file) 264 | write_file.close() 265 | 266 | strengh_dict = dict() 267 | print('Counting degree') 268 | for item in tqdm(item_sim): 269 | strengh_dict[item] = sum(item_sim[item].values()) 270 | 271 | #HPI 272 | HPI_sim = dict() 273 | for item in tqdm(new_CN): 274 | HPI_sim.setdefault(item,{}) 275 | for related_item in new_CN[item]: 276 | HPI_sim[item][related_item] = new_CN[item][related_item]/max(0.005,min(strengh_dict[item],strengh_dict[related_item])) 277 | 278 | print('Saving') 279 | write_file = open(out_path+'HPI_P'+str(num)+'_new.pkl', 'wb') 280 | pickle.dump(HPI_sim, write_file) 281 | write_file.close() 282 | 283 | HPI_sim = [] 284 | 285 | 286 | #HDI 287 | HDI_sim = dict() 288 | for item in tqdm(new_CN): 289 | HDI_sim.setdefault(item,{}) 290 | for related_item in new_CN[item]: 291 | HDI_sim[item][related_item] = new_CN[item][related_item]/max(strengh_dict[item],strengh_dict[related_item]) 292 | 293 | print('Saving') 294 | write_file = open(out_path+'HDI_P'+str(num)+'_new.pkl', 'wb') 295 | pickle.dump(HDI_sim, write_file) 296 | write_file.close() 297 | HDI_sim = [] 298 | 299 | 300 | 301 | #LHN1 302 | LHN1_sim = dict() 303 | for item in tqdm(new_CN): 304 | LHN1_sim.setdefault(item,{}) 305 | for related_item in new_CN[item]: 306 | LHN1_sim[item][related_item] = new_CN[item][related_item]/( max(0.005,strengh_dict[item]) * max(0.005,strengh_dict[related_item])) 307 | 308 | print('Saving') 309 | write_file = open(out_path+'LHN1_P'+str(num)+'_new.pkl', 'wb') 310 | pickle.dump(LHN1_sim, write_file) 311 | write_file.close() 312 | LHN1_sim = [] 313 | 314 | 315 | new_CN = [] 316 | 317 | 318 | # In[ ]: 319 | 320 | 321 | 322 | 323 | 324 | # In[ ]: 325 | 326 | 327 | 328 | 329 | 330 | # In[ ]: 331 | 332 | 333 | 334 | 335 | -------------------------------------------------------------------------------- /code/2_Similarity/RA_Wu_offline.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # coding: utf-8 3 | 4 | # In[1]: 5 | 6 | 7 | import pandas as pd 8 | import numpy as np 9 | from tqdm import tqdm 10 | import os 11 | from collections import defaultdict 12 | import math 13 | import json 14 | from sys import stdout 15 | import pickle 16 | 17 | 18 | # In[ ]: 19 | 20 | 21 | 22 | 23 | 24 | # In[ ]: 25 | 26 | 27 | 28 | 29 | 30 | # In[ ]: 31 | 32 | 33 | 34 | 35 | 36 | # In[ ]: 37 | 38 | 39 | 40 | 41 | 42 | # In[ ]: 43 | 44 | 45 | 46 | 47 | 48 | # # RA、AA一起运行的 49 | 50 | # In[2]: 51 | 52 | 53 | now_phase = 9 54 | 55 | input_path = './user_data/offline/new_similarity/' 56 | out_path = './user_data/offline/new_similarity/' 57 | 58 | 59 | for num in range(now_phase+1): 60 | 61 | # 获取itemCF相似度 62 | with open(input_path+'itemCF_new'+str(num)+'.pkl','rb') as f: 63 | item_sim_list_tmp = pickle.load(f) 64 | 65 | item_sim = {} 66 | for item in item_sim_list_tmp: 67 | item_sim.setdefault(item, {}) 68 | for related_item in item_sim_list_tmp[item]: 69 | if item_sim_list_tmp[item][related_item] > 0.005: 70 | item_sim[item][related_item] = item_sim_list_tmp[item][related_item] 71 | 72 | item_sim_list_tmp = [] 73 | 74 | strengh_dict = dict() 75 | print('Counting degree') 76 | for item in tqdm(item_sim): 77 | strengh_dict[item] = sum(item_sim[item].values()) 78 | 79 | strengh_AA_dict = dict() 80 | print('Counting degree') 81 | for item in tqdm(item_sim): 82 | strengh_AA_dict[item] = math.log(1+sum(item_sim[item].values()) ) 83 | 84 | 85 | #RA 86 | RA_sim = dict() 87 | for item in tqdm(item_sim): 88 | neighbors = list(set(item_sim[item].keys())) 89 | for item1 in neighbors: 90 | if item in item_sim[item1]: 91 | RA_sim.setdefault(item1, {}) 92 | for item2 in neighbors: 93 | if item1 != item2: 94 | RA_sim[item1].setdefault(item2, 0) 95 | RA_sim[item1][item2] += item_sim[item1][item] * item_sim[item][item2]/strengh_dict[item] 96 | 97 | 98 | new_RA = dict() 99 | for item1 in tqdm(RA_sim): 100 | new_RA[item1] = {i: int(x * 1e3) / 1e3 for i, x in RA_sim[item1].items() if x > 1e-3} 101 | 102 | RA_sim = [] 103 | print('Saving') 104 | write_file = open(out_path+'RA_P'+str(num)+'_new.pkl', 'wb') 105 | pickle.dump(new_RA, write_file) 106 | write_file.close() 107 | 108 | 109 | new_RA = [] 110 | 111 | 112 | #RA 113 | AA_sim = dict() 114 | for item in tqdm(item_sim): 115 | neighbors = list(set(item_sim[item].keys())) 116 | for item1 in neighbors: 117 | if item in item_sim[item1]: 118 | AA_sim.setdefault(item1, {}) 119 | for item2 in neighbors: 120 | if item1 != item2: 121 | AA_sim[item1].setdefault(item2, 0) 122 | AA_sim[item1][item2] += item_sim[item1][item] * item_sim[item][item2]/strengh_AA_dict[item] 123 | 124 | 125 | new_AA = dict() 126 | for item1 in tqdm(AA_sim): 127 | new_AA[item1] = {i: int(x * 1e3) / 1e3 for i, x in AA_sim[item1].items() if x > 1e-3} 128 | 129 | AA_sim = [] 130 | print('Saving') 131 | write_file = open(out_path+'AA_P'+str(num)+'_new.pkl', 'wb') 132 | pickle.dump(new_AA, write_file) 133 | write_file.close() 134 | 135 | 136 | new_AA = [] 137 | 138 | 139 | 140 | 141 | # In[ ]: 142 | 143 | 144 | 145 | 146 | 147 | # In[ ]: 148 | 149 | 150 | 151 | 152 | 153 | # In[ ]: 154 | 155 | 156 | 157 | 158 | 159 | # In[ ]: 160 | 161 | 162 | 163 | 164 | 165 | # In[ ]: 166 | 167 | 168 | 169 | 170 | 171 | # In[ ]: 172 | 173 | 174 | 175 | 176 | 177 | # # CN、HPI、HDI、LHN1是一起运行的 178 | 179 | # In[3]: 180 | 181 | 182 | now_phase = 9 183 | 184 | input_path = './user_data/offline/new_similarity/' 185 | out_path = './user_data/offline/new_similarity/' 186 | 187 | 188 | 189 | for num in range(now_phase+1): 190 | 191 | # 获取itemCF相似度 192 | with open(input_path+'itemCF_new'+str(num)+'.pkl','rb') as f: 193 | item_sim_list_tmp = pickle.load(f) 194 | 195 | item_sim = {} 196 | for item in item_sim_list_tmp: 197 | item_sim.setdefault(item, {}) 198 | for related_item in item_sim_list_tmp[item]: 199 | if item_sim_list_tmp[item][related_item] > 0.005: 200 | item_sim[item][related_item] = item_sim_list_tmp[item][related_item] 201 | 202 | item_sim_list_tmp = [] 203 | 204 | #CN 205 | CN_sim = dict() 206 | for item in tqdm(item_sim): 207 | neighbors = list(set(item_sim[item].keys())) 208 | for item1 in neighbors: 209 | if item in item_sim[item1]: 210 | CN_sim.setdefault(item1, {}) 211 | for item2 in neighbors: 212 | if item1 != item2: 213 | CN_sim[item1].setdefault(item2, 0) 214 | CN_sim[item1][item2] += item_sim[item1][item] * item_sim[item][item2] 215 | 216 | 217 | new_CN = dict() 218 | for item1 in tqdm(CN_sim): 219 | new_CN[item1] = {i: int(x * 1e3) / 1e3 for i, x in CN_sim[item1].items() if x > 1e-3} 220 | 221 | CN_sim = [] 222 | print('Saving') 223 | write_file = open(out_path+'CN_P'+str(num)+'_new.pkl', 'wb') 224 | pickle.dump(new_CN, write_file) 225 | write_file.close() 226 | 227 | strengh_dict = dict() 228 | print('Counting degree') 229 | for item in tqdm(item_sim): 230 | strengh_dict[item] = sum(item_sim[item].values()) 231 | 232 | #HPI 233 | HPI_sim = dict() 234 | for item in tqdm(new_CN): 235 | HPI_sim.setdefault(item,{}) 236 | for related_item in new_CN[item]: 237 | HPI_sim[item][related_item] = new_CN[item][related_item]/min(strengh_dict[item],strengh_dict[related_item]) 238 | 239 | print('Saving') 240 | write_file = open(out_path+'HPI_P'+str(num)+'_new.pkl', 'wb') 241 | pickle.dump(HPI_sim, write_file) 242 | write_file.close() 243 | 244 | HPI_sim = [] 245 | 246 | 247 | #HDI 248 | HDI_sim = dict() 249 | for item in tqdm(new_CN): 250 | HDI_sim.setdefault(item,{}) 251 | for related_item in new_CN[item]: 252 | HDI_sim[item][related_item] = new_CN[item][related_item]/max(strengh_dict[item],strengh_dict[related_item]) 253 | 254 | print('Saving') 255 | write_file = open(out_path+'HDI_P'+str(num)+'_new.pkl', 'wb') 256 | pickle.dump(HDI_sim, write_file) 257 | write_file.close() 258 | HDI_sim = [] 259 | 260 | 261 | 262 | #LHN1 263 | LHN1_sim = dict() 264 | for item in tqdm(new_CN): 265 | LHN1_sim.setdefault(item,{}) 266 | for related_item in new_CN[item]: 267 | LHN1_sim[item][related_item] = new_CN[item][related_item]/(strengh_dict[item]*strengh_dict[related_item]) 268 | 269 | print('Saving') 270 | write_file = open(out_path+'LHN1_P'+str(num)+'_new.pkl', 'wb') 271 | pickle.dump(LHN1_sim, write_file) 272 | write_file.close() 273 | LHN1_sim = [] 274 | 275 | 276 | new_CN = [] 277 | 278 | 279 | # In[ ]: 280 | 281 | 282 | 283 | 284 | 285 | # In[ ]: 286 | 287 | 288 | 289 | 290 | 291 | # In[ ]: 292 | 293 | 294 | 295 | 296 | -------------------------------------------------------------------------------- /code/2_Similarity/RA_Wu_online.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # coding: utf-8 3 | 4 | # In[1]: 5 | 6 | 7 | import pandas as pd 8 | import numpy as np 9 | from tqdm import tqdm 10 | import os 11 | from collections import defaultdict 12 | import math 13 | import json 14 | from sys import stdout 15 | import pickle 16 | 17 | 18 | # In[ ]: 19 | 20 | 21 | 22 | 23 | 24 | # In[2]: 25 | 26 | 27 | now_phase = 9 28 | 29 | input_path = './user_data/dataset/new_similarity/' 30 | out_path = './user_data/dataset/new_similarity/' 31 | 32 | 33 | for num in range(now_phase+1): 34 | 35 | # 获取itemCF相似度 36 | with open(input_path+'itemCF_new'+str(num)+'.pkl','rb') as f: 37 | item_sim_list_tmp = pickle.load(f) 38 | 39 | item_sim = {} 40 | for item in item_sim_list_tmp: 41 | item_sim.setdefault(item, {}) 42 | for related_item in item_sim_list_tmp[item]: 43 | if item_sim_list_tmp[item][related_item] > 0.005: 44 | item_sim[item][related_item] = item_sim_list_tmp[item][related_item] 45 | 46 | item_sim_list_tmp = [] 47 | 48 | strengh_dict = dict() 49 | print('Counting degree') 50 | for item in tqdm(item_sim): 51 | strengh_dict[item] = sum(item_sim[item].values()) 52 | 53 | strengh_AA_dict = dict() 54 | print('Counting degree') 55 | for item in tqdm(item_sim): 56 | strengh_AA_dict[item] = math.log(1+sum(item_sim[item].values()) ) 57 | 58 | 59 | #RA 60 | RA_sim = dict() 61 | for item in tqdm(item_sim): 62 | neighbors = list(set(item_sim[item].keys())) 63 | for item1 in neighbors: 64 | if item in item_sim[item1]: 65 | RA_sim.setdefault(item1, {}) 66 | for item2 in neighbors: 67 | if item1 != item2: 68 | RA_sim[item1].setdefault(item2, 0) 69 | RA_sim[item1][item2] += item_sim[item1][item] * item_sim[item][item2]/strengh_dict[item] 70 | 71 | 72 | new_RA = dict() 73 | for item1 in tqdm(RA_sim): 74 | new_RA[item1] = {i: int(x * 1e3) / 1e3 for i, x in RA_sim[item1].items() if x > 1e-3} 75 | 76 | RA_sim = [] 77 | print('Saving') 78 | write_file = open(out_path+'RA_P'+str(num)+'_new.pkl', 'wb') 79 | pickle.dump(new_RA, write_file) 80 | write_file.close() 81 | 82 | 83 | new_RA = [] 84 | 85 | 86 | #RA 87 | AA_sim = dict() 88 | for item in tqdm(item_sim): 89 | neighbors = list(set(item_sim[item].keys())) 90 | for item1 in neighbors: 91 | if item in item_sim[item1]: 92 | AA_sim.setdefault(item1, {}) 93 | for item2 in neighbors: 94 | if item1 != item2: 95 | AA_sim[item1].setdefault(item2, 0) 96 | AA_sim[item1][item2] += item_sim[item1][item] * item_sim[item][item2]/strengh_AA_dict[item] 97 | 98 | 99 | new_AA = dict() 100 | for item1 in tqdm(AA_sim): 101 | new_AA[item1] = {i: int(x * 1e3) / 1e3 for i, x in AA_sim[item1].items() if x > 1e-3} 102 | 103 | AA_sim = [] 104 | print('Saving') 105 | write_file = open(out_path+'AA_P'+str(num)+'_new.pkl', 'wb') 106 | pickle.dump(new_AA, write_file) 107 | write_file.close() 108 | 109 | 110 | new_AA = [] 111 | 112 | 113 | 114 | 115 | # In[ ]: 116 | 117 | 118 | 119 | 120 | 121 | # In[ ]: 122 | 123 | 124 | 125 | 126 | 127 | # In[ ]: 128 | 129 | 130 | 131 | 132 | 133 | # In[ ]: 134 | 135 | 136 | 137 | 138 | 139 | # In[ ]: 140 | 141 | 142 | 143 | 144 | 145 | # In[ ]: 146 | 147 | 148 | 149 | 150 | 151 | # In[ ]: 152 | 153 | 154 | 155 | 156 | 157 | # In[3]: 158 | 159 | 160 | now_phase = 9 161 | 162 | input_path = './user_data/dataset/new_similarity/' 163 | out_path = './user_data/dataset/new_similarity/' 164 | 165 | 166 | 167 | for num in range(now_phase+1): 168 | 169 | # 获取itemCF相似度 170 | with open(input_path+'itemCF_new'+str(num)+'.pkl','rb') as f: 171 | item_sim_list_tmp = pickle.load(f) 172 | 173 | item_sim = {} 174 | for item in item_sim_list_tmp: 175 | item_sim.setdefault(item, {}) 176 | for related_item in item_sim_list_tmp[item]: 177 | if item_sim_list_tmp[item][related_item] > 0.005: 178 | item_sim[item][related_item] = item_sim_list_tmp[item][related_item] 179 | 180 | item_sim_list_tmp = [] 181 | 182 | #CN 183 | CN_sim = dict() 184 | for item in tqdm(item_sim): 185 | neighbors = list(set(item_sim[item].keys())) 186 | for item1 in neighbors: 187 | if item in item_sim[item1]: 188 | CN_sim.setdefault(item1, {}) 189 | for item2 in neighbors: 190 | if item1 != item2: 191 | CN_sim[item1].setdefault(item2, 0) 192 | CN_sim[item1][item2] += item_sim[item1][item] * item_sim[item][item2] 193 | 194 | 195 | new_CN = dict() 196 | for item1 in tqdm(CN_sim): 197 | new_CN[item1] = {i: int(x * 1e3) / 1e3 for i, x in CN_sim[item1].items() if x > 1e-3} 198 | 199 | CN_sim = [] 200 | print('Saving') 201 | write_file = open(out_path+'CN_P'+str(num)+'_new.pkl', 'wb') 202 | pickle.dump(new_CN, write_file) 203 | write_file.close() 204 | 205 | strengh_dict = dict() 206 | print('Counting degree') 207 | for item in tqdm(item_sim): 208 | strengh_dict[item] = sum(item_sim[item].values()) 209 | 210 | #HPI 211 | HPI_sim = dict() 212 | for item in tqdm(new_CN): 213 | HPI_sim.setdefault(item,{}) 214 | for related_item in new_CN[item]: 215 | HPI_sim[item][related_item] = new_CN[item][related_item]/min(strengh_dict[item],strengh_dict[related_item]) 216 | 217 | print('Saving') 218 | write_file = open(out_path+'HPI_P'+str(num)+'_new.pkl', 'wb') 219 | pickle.dump(HPI_sim, write_file) 220 | write_file.close() 221 | 222 | HPI_sim = [] 223 | 224 | 225 | #HDI 226 | HDI_sim = dict() 227 | for item in tqdm(new_CN): 228 | HDI_sim.setdefault(item,{}) 229 | for related_item in new_CN[item]: 230 | HDI_sim[item][related_item] = new_CN[item][related_item]/max(strengh_dict[item],strengh_dict[related_item]) 231 | 232 | print('Saving') 233 | write_file = open(out_path+'HDI_P'+str(num)+'_new.pkl', 'wb') 234 | pickle.dump(HDI_sim, write_file) 235 | write_file.close() 236 | HDI_sim = [] 237 | 238 | 239 | 240 | #LHN1 241 | LHN1_sim = dict() 242 | for item in tqdm(new_CN): 243 | LHN1_sim.setdefault(item,{}) 244 | for related_item in new_CN[item]: 245 | LHN1_sim[item][related_item] = new_CN[item][related_item]/(strengh_dict[item]*strengh_dict[related_item]) 246 | 247 | print('Saving') 248 | write_file = open(out_path+'LHN1_P'+str(num)+'_new.pkl', 'wb') 249 | pickle.dump(LHN1_sim, write_file) 250 | write_file.close() 251 | LHN1_sim = [] 252 | 253 | 254 | new_CN = [] 255 | 256 | 257 | # In[ ]: 258 | 259 | 260 | 261 | 262 | 263 | # In[ ]: 264 | 265 | 266 | 267 | 268 | 269 | # In[ ]: 270 | 271 | 272 | 273 | 274 | 275 | # In[ ]: 276 | 277 | 278 | 279 | 280 | 281 | # In[ ]: 282 | 283 | 284 | 285 | 286 | 287 | # In[ ]: 288 | 289 | 290 | 291 | 292 | 293 | # In[ ]: 294 | 295 | 296 | 297 | 298 | 299 | # In[ ]: 300 | 301 | 302 | 303 | 304 | 305 | # In[ ]: 306 | 307 | 308 | 309 | 310 | 311 | # In[ ]: 312 | 313 | 314 | 315 | 316 | 317 | # In[ ]: 318 | 319 | 320 | 321 | 322 | -------------------------------------------------------------------------------- /code/2_Similarity/ipynb_file.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ChuanyuXue/KDDCUP-2020/b675a1b01ba430845f9e23a466bc82501f57abb2/code/2_Similarity/ipynb_file.zip -------------------------------------------------------------------------------- /code/3_NN/ItemFeat2.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Wed Apr 29 10:01:01 2020 4 | @author: hcb 5 | """ 6 | 7 | import pandas as pd 8 | import os 9 | from config import config 10 | 11 | 12 | def get_feat(now_phase=3, base_path=None): 13 | 14 | # if base_path is None: 15 | # train_path = 'underexpose_train' 16 | # test_path = 'underexpose_test' 17 | # else: 18 | # train_path = os.path.join(base_path, 'underexpose_train') 19 | # test_path = os.path.join(base_path, 'underexpose_test') 20 | train_path = config.train_path 21 | test_path = config.test_path 22 | click_train = pd.DataFrame() 23 | click_test = pd.DataFrame() 24 | for c in range(now_phase + 1): 25 | click_tmp = pd.read_csv(train_path + f'/underexpose_train_click-{c}.csv', header=None, 26 | names=['user_id', 'item_id', 'time']) 27 | click_tmp['user_id'] = '1_{}_'.format(c) + click_tmp['user_id'].astype(str) 28 | click_test_tmp = pd.read_csv(test_path + f'/underexpose_test_click-{c}.csv', header=None, 29 | names=['user_id', 'item_id', 'time']) 30 | click_test_tmp['user_id'] = '0_{}_'.format(c) + click_test_tmp['user_id'].astype(str) 31 | click_train = click_train.append(click_tmp) 32 | click_test = click_test.append(click_test_tmp) 33 | all_click = click_train.append(click_test) 34 | print(all_click['item_id'].nunique()) 35 | item_df = all_click.groupby('item_id')['time'].count().reset_index() 36 | item_df.columns = ['item_id', 'degree'] 37 | 38 | feat = pd.read_csv('./data/underexpose_train/underexpose_item_feat.csv', header=None) 39 | feat[1] = feat[1].apply(lambda x:x[1:]).astype(float) 40 | feat[128] = feat[128].apply(lambda x:x[:-1]).astype(float) 41 | feat[129] = feat[129].apply(lambda x:x[1:]).astype(float) 42 | feat[256] = feat[256].apply(lambda x:x[:-1]).astype(float) 43 | feat.columns = ['item_id'] + ['feat'+str(i) for i in range(256)] 44 | 45 | item_df = item_df.merge(feat, on='item_id', how='left') 46 | print(item_df['item_id'].nunique()) 47 | def transform(x): 48 | if x > 150 and x <400: 49 | x = (x-150) // 25 * 25 +150 50 | elif x>=400: 51 | x = 400 52 | return x 53 | 54 | item_df['degree'] = item_df['degree'].apply(lambda x: transform(x)) 55 | degree_df = item_df.groupby('degree')[['feat'+str(i) for i in range(256)]].mean().reset_index() 56 | na_df = item_df[item_df['feat0'].isna()][['item_id', 'degree']].merge(degree_df, on='degree', how='left') 57 | item_df.dropna(inplace=True) 58 | item_df = pd.concat((item_df, na_df)) 59 | 60 | item_df.to_csv('item_feat.csv', index=None) 61 | 62 | if __name__ == '__main__': 63 | get_feat(now_phase=9) -------------------------------------------------------------------------------- /code/3_NN/Readme: -------------------------------------------------------------------------------- 1 | pandas==0.25.1 2 | numpy==1.17.2 3 | tensorflow-gpu==1.13.1 4 | tqdm 5 | argparse 6 | cudatoolkit==9.0 7 | cudnn==7.6.5 -------------------------------------------------------------------------------- /code/3_NN/config.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- coding: utf-8 -*- 3 | """ 4 | Created on Thu Jun 11 12:36:15 2020 5 | 6 | @author: hcb 7 | """ 8 | 9 | class config: 10 | train_path = './user_data/dataset' 11 | test_path = './user_data/dataset' 12 | offline_path = './user_data/offline' 13 | model1_path = './user_data/model_1' 14 | 15 | save_path_offline = './user_data/offline/nn/nn_offline.csv' 16 | save_path_online = './user_data/dataset/nn/nn_underexpose.csv' 17 | save_path_model1 = './user_data/model_1/nn/nn_model_1.csv' 18 | 19 | online_item_file = './user_data/dataset/new_recall/user_item_index.csv' 20 | offline_item_file = './user_data/offline/new_recall/user_item_index.csv' 21 | model1_item_file = './user_data/model_1/new_recall/user_item_index.csv' 22 | # online_path = '' -------------------------------------------------------------------------------- /code/3_NN/sampler2.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | from multiprocessing import Process, Queue 3 | import random 4 | 5 | def random_neq(l, r, s, num_neg): 6 | negs = [] 7 | for i in range(num_neg): 8 | t = np.random.randint(l, r) 9 | while t in s: 10 | t = np.random.randint(l, r) 11 | negs.append(t) 12 | 13 | return negs 14 | 15 | 16 | def sample_function(user_train, usernum, itemnum, batch_size, maxlen, num_neg, 17 | id2user, user2idmap2, result_queue, SEED): 18 | def sample(): 19 | 20 | # num_neg = 2 21 | user = np.random.randint(1, usernum + 1) 22 | while len(user_train[user]) <= 1: user = np.random.randint(1, usernum + 1) 23 | 24 | seq = np.zeros([maxlen], dtype=np.int32) 25 | pos = np.zeros([maxlen], dtype=np.int32) 26 | neg = np.zeros([maxlen, num_neg], dtype=np.int32) 27 | # nxt = user_train[user][-1] 28 | idx = maxlen - 1 29 | 30 | seq_ = user_train[user] 31 | st = 0 32 | if len(seq_) > (maxlen+1) : 33 | st = np.random.randint(0, len(seq_)-maxlen-1) 34 | seq_ = seq_[st:st+(maxlen+1)] 35 | nxt = seq_[-1] 36 | # nexts = [nxt] 37 | ts = set(seq_) 38 | 39 | for i in reversed(seq_[:-1]): 40 | seq[idx] = i 41 | pos[idx] = nxt 42 | if nxt != 0: neg[idx, :] = random_neq(1, itemnum + 1, ts, num_neg) 43 | nxt = i 44 | # nexts.append(i) 45 | # nxt = random.choice(nexts) 46 | 47 | idx -= 1 48 | if idx == -1: break 49 | 50 | user = id2user[user] 51 | # user = user2idmap2[int(user.split('_')[-1])] 52 | user = user2idmap2[user[2:]] 53 | return (user, seq, pos, neg) 54 | 55 | np.random.seed(SEED) 56 | while True: 57 | one_batch = [] 58 | for i in range(batch_size): 59 | one_batch.append(sample()) 60 | 61 | result_queue.put(zip(*one_batch)) 62 | 63 | 64 | class WarpSampler(object): 65 | def __init__(self, User, usernum, itemnum, id2user, user2idmap2, 66 | num_neg=20, batch_size=64, maxlen=10, n_workers=1): 67 | self.result_queue = Queue(maxsize=n_workers * 10) 68 | self.processors = [] 69 | for i in range(n_workers): 70 | self.processors.append( 71 | Process(target=sample_function, args=(User, 72 | usernum, 73 | itemnum, 74 | batch_size, 75 | maxlen, 76 | num_neg, 77 | id2user, user2idmap2, 78 | self.result_queue, 79 | np.random.randint(2e9) 80 | ))) 81 | self.processors[-1].daemon = True 82 | self.processors[-1].start() 83 | 84 | def next_batch(self): 85 | return self.result_queue.get() 86 | 87 | def close(self): 88 | for p in self.processors: 89 | p.terminate() 90 | p.join() 91 | -------------------------------------------------------------------------------- /code/3_Recall/ipynb_file.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ChuanyuXue/KDDCUP-2020/b675a1b01ba430845f9e23a466bc82501f57abb2/code/3_Recall/ipynb_file.zip -------------------------------------------------------------------------------- /code/4_RankFeature/01_sim_feature_model1_RA_AA.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # coding: utf-8 3 | 4 | # In[ ]: 5 | 6 | 7 | import pandas as pd 8 | import numpy as np 9 | from tqdm import tqdm 10 | import os 11 | from collections import defaultdict 12 | import math 13 | import json 14 | from sys import stdout 15 | import pickle 16 | 17 | 18 | # In[ ]: 19 | 20 | 21 | 22 | 23 | 24 | def ReComputeSim(sim_cor,candidate_item_list,interacted_items,item_weight_dict,flag=False): 25 | 26 | sim_list = [] 27 | for j in candidate_item_list: 28 | sim_tmp = 0 29 | for loc, i in enumerate(interacted_items): 30 | #Just for RA gernerated by offline 31 | if i not in sim_cor or j not in sim_cor[i]: 32 | continue 33 | if i in item_weight_dict: 34 | sim_tmp += sim_cor[i][j][0] * (0.7**loc) * item_weight_dict[i] if flag else sim_cor[i][j] * (0.7**loc) * item_weight_dict[i] 35 | else: 36 | sim_tmp += sim_cor[i][j][0] * (0.7**loc) * 0.5 if flag else sim_cor[i][j] * (0.7**loc) * 0.5 37 | 38 | sim_list.append(sim_tmp) 39 | 40 | return sim_list 41 | 42 | 43 | # In[ ]: 44 | 45 | 46 | file_name = 'recall_0531_addsim' 47 | 48 | offline = pd.read_csv('./user_data/model_1/new_recall/' + file_name + '.csv') 49 | 50 | now_phase = 9 51 | 52 | 53 | train_path = './user_data/model_1/' 54 | test_path = './user_data/model_1/' 55 | header = 'model_1' 56 | out_path = './user_data/model_1/new_similarity/' 57 | 58 | recom_item = [] 59 | 60 | whole_click = pd.DataFrame() 61 | 62 | 63 | user_id_list = [] 64 | item_id_list = [] 65 | 66 | 67 | ra_sim_list = [] 68 | aa_sim_list = [] 69 | 70 | 71 | 72 | for c in range(now_phase + 1): 73 | print('phase:', c) 74 | click_train = pd.read_csv(train_path + header + '_train_click_{}_time.csv'.format(c)) 75 | click_test = pd.read_csv(test_path + header + '_test_click_{}_time.csv'.format(c)) 76 | click_query = pd.read_csv(test_path + header + '_test_qtime_{}_time.csv'.format(c)) 77 | 78 | 79 | click_train['datetime'] = pd.to_datetime(click_train['datetime']) 80 | click_test['datetime'] = pd.to_datetime(click_test['datetime']) 81 | click_query['datetime'] = pd.to_datetime(click_query['datetime']) 82 | 83 | 84 | 85 | click_train['timestamp'] = click_train['datetime'].dt.day + ( click_train['datetime'].dt.hour + 86 | (click_train['datetime'].dt.minute + click_train['datetime'].dt.second/60)/float(60) )/float(24) 87 | 88 | click_test['timestamp'] = click_test['datetime'].dt.day + ( click_test['datetime'].dt.hour + 89 | (click_test['datetime'].dt.minute + click_test['datetime'].dt.second/60)/float(60) )/float(24) 90 | 91 | click_query['timestamp'] = click_query['datetime'].dt.day + ( click_query['datetime'].dt.hour + 92 | (click_query['datetime'].dt.minute + click_query['datetime'].dt.second/60)/float(60) )/float(24) 93 | 94 | 95 | all_click = click_train.append(click_test) 96 | 97 | 98 | with open(out_path+'user2item_new'+str(c)+'.pkl','rb') as f: 99 | user_item_tmp = pickle.load(f) 100 | 101 | with open(out_path+'RA_P'+str(c)+'_new.pkl','rb') as f: 102 | RA_sim_list_new = pickle.load(f) 103 | 104 | 105 | for i, row in click_query.iterrows(): 106 | offline_tmp = offline[offline['user_id']==row['user_id']] 107 | candidate_item_list = list(offline_tmp['item_id']) 108 | 109 | time_min = min(all_click['timestamp']) 110 | time_max = row['timestamp'] 111 | 112 | df_tmp = all_click[all_click['user_id']==row['user_id']] 113 | df_tmp = df_tmp.reset_index(drop=True) 114 | df_tmp['weight'] = 1 - (time_max-df_tmp['timestamp']+0.01) / (time_max-time_min+0.01) 115 | item_weight_dict = dict(zip(df_tmp['item_id'], df_tmp['weight'])) 116 | 117 | interacted_items = user_item_tmp[row['user_id']] 118 | interacted_items = interacted_items[::-1] 119 | 120 | sim_list_tmp = ReComputeSim(RA_sim_list_new,candidate_item_list,interacted_items,item_weight_dict) 121 | ra_sim_list += sim_list_tmp 122 | 123 | item_id_list += candidate_item_list 124 | user_id_list += [row['user_id'] for x in candidate_item_list] 125 | 126 | RA_sim_list_new = [] 127 | 128 | 129 | 130 | with open(out_path+'AA_P'+str(c)+'_new.pkl','rb') as f: 131 | AA_sim_list_new = pickle.load(f) 132 | 133 | 134 | for i, row in click_query.iterrows(): 135 | offline_tmp = offline[offline['user_id']==row['user_id']] 136 | candidate_item_list = list(offline_tmp['item_id']) 137 | 138 | time_min = min(all_click['timestamp']) 139 | time_max = row['timestamp'] 140 | 141 | df_tmp = all_click[all_click['user_id']==row['user_id']] 142 | df_tmp = df_tmp.reset_index(drop=True) 143 | df_tmp['weight'] = 1 - (time_max-df_tmp['timestamp']+0.01) / (time_max-time_min+0.01) 144 | item_weight_dict = dict(zip(df_tmp['item_id'], df_tmp['weight'])) 145 | 146 | interacted_items = user_item_tmp[row['user_id']] 147 | interacted_items = interacted_items[::-1] 148 | 149 | sim_list_tmp = ReComputeSim(AA_sim_list_new,candidate_item_list,interacted_items,item_weight_dict) 150 | aa_sim_list += sim_list_tmp 151 | 152 | 153 | AA_sim_list_new = [] 154 | 155 | 156 | 157 | 158 | 159 | # In[ ]: 160 | 161 | 162 | 163 | 164 | 165 | # In[ ]: 166 | 167 | 168 | offline.shape 169 | 170 | 171 | # In[ ]: 172 | 173 | 174 | 175 | 176 | 177 | # In[ ]: 178 | 179 | 180 | 181 | 182 | 183 | # In[ ]: 184 | 185 | 186 | sim_df = pd.DataFrame() 187 | sim_df['user_id'] = user_id_list 188 | sim_df['item_id'] = item_id_list 189 | sim_df['ra_sim'] = ra_sim_list 190 | sim_df['aa_sim'] = aa_sim_list 191 | 192 | 193 | # In[ ]: 194 | 195 | 196 | sim_df.shape 197 | 198 | 199 | # In[ ]: 200 | 201 | 202 | offline = offline.merge(sim_df,on=['user_id','item_id']) 203 | 204 | 205 | # In[ ]: 206 | 207 | 208 | 209 | 210 | 211 | # In[ ]: 212 | 213 | 214 | offline.to_csv('./user_data/model_1/new_recall/'+ file_name + '_addAA_RA.csv',index=False) 215 | 216 | 217 | # In[ ]: 218 | 219 | 220 | 221 | 222 | 223 | # In[ ]: 224 | 225 | 226 | 227 | 228 | 229 | # In[ ]: 230 | 231 | 232 | 233 | 234 | 235 | # In[ ]: 236 | 237 | 238 | 239 | 240 | 241 | # In[ ]: 242 | 243 | 244 | 245 | 246 | 247 | # In[ ]: 248 | 249 | 250 | 251 | 252 | 253 | # In[ ]: 254 | 255 | 256 | 257 | 258 | 259 | # In[ ]: 260 | 261 | 262 | 263 | 264 | 265 | # In[ ]: 266 | 267 | 268 | 269 | 270 | 271 | # In[ ]: 272 | 273 | 274 | 275 | 276 | 277 | # In[ ]: 278 | 279 | 280 | 281 | 282 | 283 | # In[ ]: 284 | 285 | 286 | 287 | 288 | 289 | # In[ ]: 290 | 291 | 292 | 293 | 294 | 295 | # In[ ]: 296 | 297 | 298 | 299 | 300 | 301 | # In[ ]: 302 | 303 | 304 | 305 | 306 | -------------------------------------------------------------------------------- /code/4_RankFeature/01_sim_feature_offline.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # coding: utf-8 3 | 4 | # In[1]: 5 | 6 | 7 | import pandas as pd 8 | import numpy as np 9 | from tqdm import tqdm 10 | import os 11 | from collections import defaultdict 12 | import math 13 | import json 14 | from sys import stdout 15 | import pickle 16 | 17 | 18 | # In[2]: 19 | 20 | 21 | 22 | 23 | 24 | def ReComputeSim(sim_cor,candidate_item_list,interacted_items,item_weight_dict,flag=False): 25 | 26 | sim_list = [] 27 | for j in candidate_item_list: 28 | sim_tmp = 0 29 | for loc, i in enumerate(interacted_items): 30 | #Just for RA gernerated by offline 31 | if i not in sim_cor or j not in sim_cor[i]: 32 | continue 33 | if i in item_weight_dict: 34 | sim_tmp += sim_cor[i][j][0] * (0.7**loc) * item_weight_dict[i] if flag else sim_cor[i][j] * (0.7**loc) * item_weight_dict[i] 35 | else: 36 | sim_tmp += sim_cor[i][j][0] * (0.7**loc) * 0.5 if flag else sim_cor[i][j] * (0.7**loc) * 0.5 37 | 38 | sim_list.append(sim_tmp) 39 | 40 | return sim_list 41 | 42 | 43 | # In[3]: 44 | 45 | 46 | file_name = 'recall_0531' 47 | 48 | offline = pd.read_csv('./user_data/offline/new_recall/' + file_name + '.csv') 49 | 50 | now_phase = 9 51 | 52 | 53 | train_path = './user_data/offline/' 54 | test_path = './user_data/offline/' 55 | header = 'offline' 56 | out_path = './user_data/offline/new_similarity/' 57 | 58 | recom_item = [] 59 | 60 | whole_click = pd.DataFrame() 61 | 62 | 63 | user_id_list = [] 64 | item_id_list = [] 65 | 66 | item_sim_list = [] 67 | ra_sim_list = [] 68 | aa_sim_list = [] 69 | cn_sim_list = [] 70 | txt_sim_list = [] 71 | 72 | hdi_sim_list = [] 73 | hpi_sim_list = [] 74 | lhn1_sim_list = [] 75 | 76 | 77 | for c in range(now_phase + 1): 78 | print('phase:', c) 79 | click_train = pd.read_csv(train_path + header + '_train_click_{}_time.csv'.format(c)) 80 | click_test = pd.read_csv(test_path + header + '_test_click_{}_time.csv'.format(c)) 81 | click_query = pd.read_csv(test_path + header + '_test_qtime_{}_time.csv'.format(c)) 82 | 83 | 84 | click_train['datetime'] = pd.to_datetime(click_train['datetime']) 85 | click_test['datetime'] = pd.to_datetime(click_test['datetime']) 86 | click_query['datetime'] = pd.to_datetime(click_query['datetime']) 87 | 88 | 89 | 90 | click_train['timestamp'] = click_train['datetime'].dt.day + ( click_train['datetime'].dt.hour + 91 | (click_train['datetime'].dt.minute + click_train['datetime'].dt.second/60)/float(60) )/float(24) 92 | 93 | click_test['timestamp'] = click_test['datetime'].dt.day + ( click_test['datetime'].dt.hour + 94 | (click_test['datetime'].dt.minute + click_test['datetime'].dt.second/60)/float(60) )/float(24) 95 | 96 | click_query['timestamp'] = click_query['datetime'].dt.day + ( click_query['datetime'].dt.hour + 97 | (click_query['datetime'].dt.minute + click_query['datetime'].dt.second/60)/float(60) )/float(24) 98 | 99 | 100 | all_click = click_train.append(click_test) 101 | 102 | 103 | with open(out_path+'user2item_new'+str(c)+'.pkl','rb') as f: 104 | user_item_tmp = pickle.load(f) 105 | 106 | with open(out_path+'CN_P'+str(c)+'_new.pkl','rb') as f: 107 | CN_sim_list_new = pickle.load(f) 108 | 109 | 110 | for i, row in click_query.iterrows(): 111 | offline_tmp = offline[offline['user_id']==row['user_id']] 112 | candidate_item_list = list(offline_tmp['item_id']) 113 | 114 | time_min = min(all_click['timestamp']) 115 | time_max = row['timestamp'] 116 | 117 | df_tmp = all_click[all_click['user_id']==row['user_id']] 118 | df_tmp = df_tmp.reset_index(drop=True) 119 | df_tmp['weight'] = 1 - (time_max-df_tmp['timestamp']+0.01) / (time_max-time_min+0.01) 120 | item_weight_dict = dict(zip(df_tmp['item_id'], df_tmp['weight'])) 121 | 122 | interacted_items = user_item_tmp[row['user_id']] 123 | interacted_items = interacted_items[::-1] 124 | 125 | sim_list_tmp = ReComputeSim(CN_sim_list_new,candidate_item_list,interacted_items,item_weight_dict) 126 | cn_sim_list += sim_list_tmp 127 | 128 | item_id_list += candidate_item_list 129 | user_id_list += [row['user_id'] for x in candidate_item_list] 130 | 131 | CN_sim_list_new = [] 132 | 133 | 134 | 135 | with open(out_path+'HDI_P'+str(c)+'_new.pkl','rb') as f: 136 | HDI_sim_list_new = pickle.load(f) 137 | 138 | 139 | for i, row in click_query.iterrows(): 140 | offline_tmp = offline[offline['user_id']==row['user_id']] 141 | candidate_item_list = list(offline_tmp['item_id']) 142 | 143 | time_min = min(all_click['timestamp']) 144 | time_max = row['timestamp'] 145 | 146 | df_tmp = all_click[all_click['user_id']==row['user_id']] 147 | df_tmp = df_tmp.reset_index(drop=True) 148 | df_tmp['weight'] = 1 - (time_max-df_tmp['timestamp']+0.01) / (time_max-time_min+0.01) 149 | item_weight_dict = dict(zip(df_tmp['item_id'], df_tmp['weight'])) 150 | 151 | interacted_items = user_item_tmp[row['user_id']] 152 | interacted_items = interacted_items[::-1] 153 | 154 | sim_list_tmp = ReComputeSim(HDI_sim_list_new,candidate_item_list,interacted_items,item_weight_dict) 155 | hdi_sim_list += sim_list_tmp 156 | 157 | 158 | HDI_sim_list_new = [] 159 | 160 | 161 | with open(out_path+'HPI_P'+str(c)+'_new.pkl','rb') as f: 162 | HPI_sim_list_new = pickle.load(f) 163 | 164 | 165 | for i, row in click_query.iterrows(): 166 | offline_tmp = offline[offline['user_id']==row['user_id']] 167 | candidate_item_list = list(offline_tmp['item_id']) 168 | 169 | time_min = min(all_click['timestamp']) 170 | time_max = row['timestamp'] 171 | 172 | df_tmp = all_click[all_click['user_id']==row['user_id']] 173 | df_tmp = df_tmp.reset_index(drop=True) 174 | df_tmp['weight'] = 1 - (time_max-df_tmp['timestamp']+0.01) / (time_max-time_min+0.01) 175 | item_weight_dict = dict(zip(df_tmp['item_id'], df_tmp['weight'])) 176 | 177 | interacted_items = user_item_tmp[row['user_id']] 178 | interacted_items = interacted_items[::-1] 179 | 180 | sim_list_tmp = ReComputeSim(HPI_sim_list_new,candidate_item_list,interacted_items,item_weight_dict) 181 | hpi_sim_list += sim_list_tmp 182 | 183 | 184 | HPI_sim_list_new = [] 185 | 186 | 187 | with open(out_path+'LHN1_P'+str(c)+'_new.pkl','rb') as f: 188 | LHN1_sim_list_new = pickle.load(f) 189 | 190 | 191 | for i, row in click_query.iterrows(): 192 | offline_tmp = offline[offline['user_id']==row['user_id']] 193 | candidate_item_list = list(offline_tmp['item_id']) 194 | 195 | time_min = min(all_click['timestamp']) 196 | time_max = row['timestamp'] 197 | 198 | df_tmp = all_click[all_click['user_id']==row['user_id']] 199 | df_tmp = df_tmp.reset_index(drop=True) 200 | df_tmp['weight'] = 1 - (time_max-df_tmp['timestamp']+0.01) / (time_max-time_min+0.01) 201 | item_weight_dict = dict(zip(df_tmp['item_id'], df_tmp['weight'])) 202 | 203 | interacted_items = user_item_tmp[row['user_id']] 204 | interacted_items = interacted_items[::-1] 205 | 206 | sim_list_tmp = ReComputeSim(LHN1_sim_list_new,candidate_item_list,interacted_items,item_weight_dict) 207 | lhn1_sim_list += sim_list_tmp 208 | 209 | 210 | LHN1_sim_list_new = [] 211 | 212 | 213 | 214 | # In[ ]: 215 | 216 | 217 | 218 | 219 | 220 | # In[4]: 221 | 222 | 223 | offline.shape 224 | 225 | 226 | # In[5]: 227 | 228 | 229 | len(lhn1_sim_list) 230 | 231 | 232 | # In[ ]: 233 | 234 | 235 | 236 | 237 | 238 | # In[6]: 239 | 240 | 241 | sim_df = pd.DataFrame() 242 | sim_df['user_id'] = user_id_list 243 | sim_df['item_id'] = item_id_list 244 | sim_df['cn_sim'] = cn_sim_list 245 | sim_df['hpi_sim'] = hpi_sim_list 246 | sim_df['hdi_sim'] = hdi_sim_list 247 | sim_df['lhn1_sim'] = lhn1_sim_list 248 | 249 | 250 | # In[7]: 251 | 252 | 253 | sim_df.shape 254 | 255 | 256 | # In[8]: 257 | 258 | 259 | offline = offline.merge(sim_df,on=['user_id','item_id']) 260 | 261 | 262 | # In[ ]: 263 | 264 | 265 | 266 | 267 | 268 | # In[9]: 269 | 270 | 271 | offline.to_csv('./user_data/offline/new_recall/'+ file_name + '_addsim.csv',index=False) 272 | 273 | 274 | # In[ ]: 275 | 276 | 277 | 278 | 279 | 280 | # In[ ]: 281 | 282 | 283 | 284 | 285 | 286 | # In[ ]: 287 | 288 | 289 | 290 | 291 | -------------------------------------------------------------------------------- /code/4_RankFeature/01_sim_feature_offline_RA_AA.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # coding: utf-8 3 | 4 | # In[1]: 5 | 6 | 7 | import pandas as pd 8 | import numpy as np 9 | from tqdm import tqdm 10 | import os 11 | from collections import defaultdict 12 | import math 13 | import json 14 | from sys import stdout 15 | import pickle 16 | 17 | 18 | # In[2]: 19 | 20 | 21 | 22 | 23 | 24 | def ReComputeSim(sim_cor,candidate_item_list,interacted_items,item_weight_dict,flag=False): 25 | 26 | sim_list = [] 27 | for j in candidate_item_list: 28 | sim_tmp = 0 29 | for loc, i in enumerate(interacted_items): 30 | #Just for RA gernerated by offline 31 | if i not in sim_cor or j not in sim_cor[i]: 32 | continue 33 | if i in item_weight_dict: 34 | sim_tmp += sim_cor[i][j][0] * (0.7**loc) * item_weight_dict[i] if flag else sim_cor[i][j] * (0.7**loc) * item_weight_dict[i] 35 | else: 36 | sim_tmp += sim_cor[i][j][0] * (0.7**loc) * 0.5 if flag else sim_cor[i][j] * (0.7**loc) * 0.5 37 | 38 | sim_list.append(sim_tmp) 39 | 40 | return sim_list 41 | 42 | 43 | # In[3]: 44 | 45 | 46 | file_name = 'recall_0531_addsim' 47 | 48 | offline = pd.read_csv('./user_data/offline/new_recall/' + file_name + '.csv') 49 | 50 | now_phase = 9 51 | 52 | 53 | train_path = './user_data/offline/' 54 | test_path = './user_data/offline/' 55 | header = 'offline' 56 | out_path = './user_data/offline/new_similarity/' 57 | 58 | recom_item = [] 59 | 60 | whole_click = pd.DataFrame() 61 | 62 | 63 | user_id_list = [] 64 | item_id_list = [] 65 | 66 | 67 | ra_sim_list = [] 68 | aa_sim_list = [] 69 | 70 | 71 | 72 | for c in range(now_phase + 1): 73 | print('phase:', c) 74 | click_train = pd.read_csv(train_path + header + '_train_click_{}_time.csv'.format(c)) 75 | click_test = pd.read_csv(test_path + header + '_test_click_{}_time.csv'.format(c)) 76 | click_query = pd.read_csv(test_path + header + '_test_qtime_{}_time.csv'.format(c)) 77 | 78 | 79 | click_train['datetime'] = pd.to_datetime(click_train['datetime']) 80 | click_test['datetime'] = pd.to_datetime(click_test['datetime']) 81 | click_query['datetime'] = pd.to_datetime(click_query['datetime']) 82 | 83 | 84 | 85 | click_train['timestamp'] = click_train['datetime'].dt.day + ( click_train['datetime'].dt.hour + 86 | (click_train['datetime'].dt.minute + click_train['datetime'].dt.second/60)/float(60) )/float(24) 87 | 88 | click_test['timestamp'] = click_test['datetime'].dt.day + ( click_test['datetime'].dt.hour + 89 | (click_test['datetime'].dt.minute + click_test['datetime'].dt.second/60)/float(60) )/float(24) 90 | 91 | click_query['timestamp'] = click_query['datetime'].dt.day + ( click_query['datetime'].dt.hour + 92 | (click_query['datetime'].dt.minute + click_query['datetime'].dt.second/60)/float(60) )/float(24) 93 | 94 | 95 | all_click = click_train.append(click_test) 96 | 97 | 98 | with open(out_path+'user2item_new'+str(c)+'.pkl','rb') as f: 99 | user_item_tmp = pickle.load(f) 100 | 101 | with open(out_path+'RA_P'+str(c)+'_new.pkl','rb') as f: 102 | RA_sim_list_new = pickle.load(f) 103 | 104 | 105 | for i, row in click_query.iterrows(): 106 | offline_tmp = offline[offline['user_id']==row['user_id']] 107 | candidate_item_list = list(offline_tmp['item_id']) 108 | 109 | time_min = min(all_click['timestamp']) 110 | time_max = row['timestamp'] 111 | 112 | df_tmp = all_click[all_click['user_id']==row['user_id']] 113 | df_tmp = df_tmp.reset_index(drop=True) 114 | df_tmp['weight'] = 1 - (time_max-df_tmp['timestamp']+0.01) / (time_max-time_min+0.01) 115 | item_weight_dict = dict(zip(df_tmp['item_id'], df_tmp['weight'])) 116 | 117 | interacted_items = user_item_tmp[row['user_id']] 118 | interacted_items = interacted_items[::-1] 119 | 120 | sim_list_tmp = ReComputeSim(RA_sim_list_new,candidate_item_list,interacted_items,item_weight_dict) 121 | ra_sim_list += sim_list_tmp 122 | 123 | item_id_list += candidate_item_list 124 | user_id_list += [row['user_id'] for x in candidate_item_list] 125 | 126 | RA_sim_list_new = [] 127 | 128 | 129 | 130 | with open(out_path+'AA_P'+str(c)+'_new.pkl','rb') as f: 131 | AA_sim_list_new = pickle.load(f) 132 | 133 | 134 | for i, row in click_query.iterrows(): 135 | offline_tmp = offline[offline['user_id']==row['user_id']] 136 | candidate_item_list = list(offline_tmp['item_id']) 137 | 138 | time_min = min(all_click['timestamp']) 139 | time_max = row['timestamp'] 140 | 141 | df_tmp = all_click[all_click['user_id']==row['user_id']] 142 | df_tmp = df_tmp.reset_index(drop=True) 143 | df_tmp['weight'] = 1 - (time_max-df_tmp['timestamp']+0.01) / (time_max-time_min+0.01) 144 | item_weight_dict = dict(zip(df_tmp['item_id'], df_tmp['weight'])) 145 | 146 | interacted_items = user_item_tmp[row['user_id']] 147 | interacted_items = interacted_items[::-1] 148 | 149 | sim_list_tmp = ReComputeSim(AA_sim_list_new,candidate_item_list,interacted_items,item_weight_dict) 150 | aa_sim_list += sim_list_tmp 151 | 152 | 153 | AA_sim_list_new = [] 154 | 155 | 156 | 157 | 158 | 159 | # In[ ]: 160 | 161 | 162 | 163 | 164 | 165 | # In[4]: 166 | 167 | 168 | offline.shape 169 | 170 | 171 | # In[ ]: 172 | 173 | 174 | 175 | 176 | 177 | # In[ ]: 178 | 179 | 180 | 181 | 182 | 183 | # In[5]: 184 | 185 | 186 | sim_df = pd.DataFrame() 187 | sim_df['user_id'] = user_id_list 188 | sim_df['item_id'] = item_id_list 189 | sim_df['ra_sim'] = ra_sim_list 190 | sim_df['aa_sim'] = aa_sim_list 191 | 192 | 193 | # In[6]: 194 | 195 | 196 | sim_df.shape 197 | 198 | 199 | # In[7]: 200 | 201 | 202 | offline = offline.merge(sim_df,on=['user_id','item_id']) 203 | 204 | 205 | # In[ ]: 206 | 207 | 208 | 209 | 210 | 211 | # In[8]: 212 | 213 | 214 | offline.to_csv('./user_data/offline/new_recall/'+ file_name + '_addAA_RA.csv',index=False) 215 | 216 | 217 | # In[ ]: 218 | 219 | 220 | 221 | 222 | 223 | # In[ ]: 224 | 225 | 226 | 227 | 228 | 229 | # In[ ]: 230 | 231 | 232 | 233 | 234 | 235 | # In[ ]: 236 | 237 | 238 | 239 | 240 | 241 | # In[ ]: 242 | 243 | 244 | 245 | 246 | 247 | # In[ ]: 248 | 249 | 250 | 251 | 252 | 253 | # In[ ]: 254 | 255 | 256 | 257 | 258 | 259 | # In[ ]: 260 | 261 | 262 | 263 | 264 | 265 | # In[ ]: 266 | 267 | 268 | 269 | 270 | 271 | # In[ ]: 272 | 273 | 274 | 275 | 276 | 277 | # In[ ]: 278 | 279 | 280 | 281 | 282 | 283 | # In[ ]: 284 | 285 | 286 | 287 | 288 | 289 | # In[ ]: 290 | 291 | 292 | 293 | 294 | 295 | # In[ ]: 296 | 297 | 298 | 299 | 300 | 301 | # In[ ]: 302 | 303 | 304 | 305 | 306 | -------------------------------------------------------------------------------- /code/4_RankFeature/01_sim_feature_online.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # coding: utf-8 3 | 4 | # In[1]: 5 | 6 | 7 | import pandas as pd 8 | import numpy as np 9 | from tqdm import tqdm 10 | import os 11 | from collections import defaultdict 12 | import math 13 | import json 14 | from sys import stdout 15 | import pickle 16 | 17 | 18 | # In[2]: 19 | 20 | 21 | 22 | 23 | 24 | def ReComputeSim(sim_cor,candidate_item_list,interacted_items,item_weight_dict,flag=False): 25 | 26 | sim_list = [] 27 | for j in candidate_item_list: 28 | sim_tmp = 0 29 | for loc, i in enumerate(interacted_items): 30 | #Just for RA gernerated by offline 31 | if i not in sim_cor or j not in sim_cor[i]: 32 | continue 33 | if i in item_weight_dict: 34 | sim_tmp += sim_cor[i][j][0] * (0.7**loc) * item_weight_dict[i] if flag else sim_cor[i][j] * (0.7**loc) * item_weight_dict[i] 35 | else: 36 | sim_tmp += sim_cor[i][j][0] * (0.7**loc) * 0.5 if flag else sim_cor[i][j] * (0.7**loc) * 0.5 37 | 38 | sim_list.append(sim_tmp) 39 | 40 | return sim_list 41 | 42 | 43 | # In[3]: 44 | 45 | 46 | file_name = 'recall_0531' 47 | 48 | offline = pd.read_csv('./user_data/dataset/new_recall/' + file_name + '.csv') 49 | 50 | now_phase = 9 51 | 52 | 53 | train_path = './user_data/dataset/' 54 | test_path = './user_data/dataset/' 55 | header = 'underexpose' 56 | out_path = './user_data/dataset/new_similarity/' 57 | 58 | recom_item = [] 59 | 60 | whole_click = pd.DataFrame() 61 | 62 | 63 | user_id_list = [] 64 | item_id_list = [] 65 | 66 | item_sim_list = [] 67 | ra_sim_list = [] 68 | aa_sim_list = [] 69 | cn_sim_list = [] 70 | txt_sim_list = [] 71 | 72 | hdi_sim_list = [] 73 | hpi_sim_list = [] 74 | lhn1_sim_list = [] 75 | 76 | 77 | for c in range(now_phase + 1): 78 | print('phase:', c) 79 | click_train = pd.read_csv(train_path + header + '_train_click_{}_time.csv'.format(c)) 80 | click_test = pd.read_csv(test_path + header + '_test_click_{}_time.csv'.format(c)) 81 | click_query = pd.read_csv(test_path + header + '_test_qtime_{}_time.csv'.format(c)) 82 | 83 | 84 | click_train['datetime'] = pd.to_datetime(click_train['datetime']) 85 | click_test['datetime'] = pd.to_datetime(click_test['datetime']) 86 | click_query['datetime'] = pd.to_datetime(click_query['datetime']) 87 | 88 | 89 | 90 | click_train['timestamp'] = click_train['datetime'].dt.day + ( click_train['datetime'].dt.hour + 91 | (click_train['datetime'].dt.minute + click_train['datetime'].dt.second/60)/float(60) )/float(24) 92 | 93 | click_test['timestamp'] = click_test['datetime'].dt.day + ( click_test['datetime'].dt.hour + 94 | (click_test['datetime'].dt.minute + click_test['datetime'].dt.second/60)/float(60) )/float(24) 95 | 96 | click_query['timestamp'] = click_query['datetime'].dt.day + ( click_query['datetime'].dt.hour + 97 | (click_query['datetime'].dt.minute + click_query['datetime'].dt.second/60)/float(60) )/float(24) 98 | 99 | 100 | all_click = click_train.append(click_test) 101 | 102 | 103 | with open(out_path+'user2item_new'+str(c)+'.pkl','rb') as f: 104 | user_item_tmp = pickle.load(f) 105 | 106 | with open(out_path+'CN_P'+str(c)+'_new.pkl','rb') as f: 107 | CN_sim_list_new = pickle.load(f) 108 | 109 | 110 | for i, row in click_query.iterrows(): 111 | offline_tmp = offline[offline['user_id']==row['user_id']] 112 | candidate_item_list = list(offline_tmp['item_id']) 113 | 114 | time_min = min(all_click['timestamp']) 115 | time_max = row['timestamp'] 116 | 117 | df_tmp = all_click[all_click['user_id']==row['user_id']] 118 | df_tmp = df_tmp.reset_index(drop=True) 119 | df_tmp['weight'] = 1 - (time_max-df_tmp['timestamp']+0.01) / (time_max-time_min+0.01) 120 | item_weight_dict = dict(zip(df_tmp['item_id'], df_tmp['weight'])) 121 | 122 | interacted_items = user_item_tmp[row['user_id']] 123 | interacted_items = interacted_items[::-1] 124 | 125 | sim_list_tmp = ReComputeSim(CN_sim_list_new,candidate_item_list,interacted_items,item_weight_dict) 126 | cn_sim_list += sim_list_tmp 127 | 128 | item_id_list += candidate_item_list 129 | user_id_list += [row['user_id'] for x in candidate_item_list] 130 | 131 | CN_sim_list_new = [] 132 | 133 | 134 | 135 | with open(out_path+'HDI_P'+str(c)+'_new.pkl','rb') as f: 136 | HDI_sim_list_new = pickle.load(f) 137 | 138 | 139 | for i, row in click_query.iterrows(): 140 | offline_tmp = offline[offline['user_id']==row['user_id']] 141 | candidate_item_list = list(offline_tmp['item_id']) 142 | 143 | time_min = min(all_click['timestamp']) 144 | time_max = row['timestamp'] 145 | 146 | df_tmp = all_click[all_click['user_id']==row['user_id']] 147 | df_tmp = df_tmp.reset_index(drop=True) 148 | df_tmp['weight'] = 1 - (time_max-df_tmp['timestamp']+0.01) / (time_max-time_min+0.01) 149 | item_weight_dict = dict(zip(df_tmp['item_id'], df_tmp['weight'])) 150 | 151 | interacted_items = user_item_tmp[row['user_id']] 152 | interacted_items = interacted_items[::-1] 153 | 154 | sim_list_tmp = ReComputeSim(HDI_sim_list_new,candidate_item_list,interacted_items,item_weight_dict) 155 | hdi_sim_list += sim_list_tmp 156 | 157 | 158 | HDI_sim_list_new = [] 159 | 160 | 161 | with open(out_path+'HPI_P'+str(c)+'_new.pkl','rb') as f: 162 | HPI_sim_list_new = pickle.load(f) 163 | 164 | 165 | for i, row in click_query.iterrows(): 166 | offline_tmp = offline[offline['user_id']==row['user_id']] 167 | candidate_item_list = list(offline_tmp['item_id']) 168 | 169 | time_min = min(all_click['timestamp']) 170 | time_max = row['timestamp'] 171 | 172 | df_tmp = all_click[all_click['user_id']==row['user_id']] 173 | df_tmp = df_tmp.reset_index(drop=True) 174 | df_tmp['weight'] = 1 - (time_max-df_tmp['timestamp']+0.01) / (time_max-time_min+0.01) 175 | item_weight_dict = dict(zip(df_tmp['item_id'], df_tmp['weight'])) 176 | 177 | interacted_items = user_item_tmp[row['user_id']] 178 | interacted_items = interacted_items[::-1] 179 | 180 | sim_list_tmp = ReComputeSim(HPI_sim_list_new,candidate_item_list,interacted_items,item_weight_dict) 181 | hpi_sim_list += sim_list_tmp 182 | 183 | 184 | HPI_sim_list_new = [] 185 | 186 | 187 | with open(out_path+'LHN1_P'+str(c)+'_new.pkl','rb') as f: 188 | LHN1_sim_list_new = pickle.load(f) 189 | 190 | 191 | for i, row in click_query.iterrows(): 192 | offline_tmp = offline[offline['user_id']==row['user_id']] 193 | candidate_item_list = list(offline_tmp['item_id']) 194 | 195 | time_min = min(all_click['timestamp']) 196 | time_max = row['timestamp'] 197 | 198 | df_tmp = all_click[all_click['user_id']==row['user_id']] 199 | df_tmp = df_tmp.reset_index(drop=True) 200 | df_tmp['weight'] = 1 - (time_max-df_tmp['timestamp']+0.01) / (time_max-time_min+0.01) 201 | item_weight_dict = dict(zip(df_tmp['item_id'], df_tmp['weight'])) 202 | 203 | interacted_items = user_item_tmp[row['user_id']] 204 | interacted_items = interacted_items[::-1] 205 | 206 | sim_list_tmp = ReComputeSim(LHN1_sim_list_new,candidate_item_list,interacted_items,item_weight_dict) 207 | lhn1_sim_list += sim_list_tmp 208 | 209 | 210 | LHN1_sim_list_new = [] 211 | 212 | 213 | 214 | # In[ ]: 215 | 216 | 217 | 218 | 219 | 220 | # In[4]: 221 | 222 | 223 | offline.shape 224 | 225 | 226 | # In[5]: 227 | 228 | 229 | len(lhn1_sim_list) 230 | 231 | 232 | # In[ ]: 233 | 234 | 235 | 236 | 237 | 238 | # In[6]: 239 | 240 | 241 | sim_df = pd.DataFrame() 242 | sim_df['user_id'] = user_id_list 243 | sim_df['item_id'] = item_id_list 244 | sim_df['cn_sim'] = cn_sim_list 245 | sim_df['hpi_sim'] = hpi_sim_list 246 | sim_df['hdi_sim'] = hdi_sim_list 247 | sim_df['lhn1_sim'] = lhn1_sim_list 248 | 249 | 250 | # In[7]: 251 | 252 | 253 | sim_df.shape 254 | 255 | 256 | # In[8]: 257 | 258 | 259 | offline = offline.merge(sim_df,on=['user_id','item_id']) 260 | 261 | 262 | # In[ ]: 263 | 264 | 265 | 266 | 267 | 268 | # In[9]: 269 | 270 | 271 | offline.to_csv('./user_data/dataset/new_recall/'+ file_name + '_addsim.csv',index=False) 272 | 273 | 274 | # In[10]: 275 | 276 | 277 | offline.shape 278 | 279 | 280 | # In[ ]: 281 | 282 | 283 | 284 | 285 | 286 | # In[ ]: 287 | 288 | 289 | 290 | 291 | -------------------------------------------------------------------------------- /code/4_RankFeature/01_sim_feature_online_RA_AA.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # coding: utf-8 3 | 4 | # In[1]: 5 | 6 | 7 | import pandas as pd 8 | import numpy as np 9 | from tqdm import tqdm 10 | import os 11 | from collections import defaultdict 12 | import math 13 | import json 14 | from sys import stdout 15 | import pickle 16 | 17 | 18 | # In[2]: 19 | 20 | 21 | 22 | 23 | 24 | def ReComputeSim(sim_cor,candidate_item_list,interacted_items,item_weight_dict,flag=False): 25 | 26 | sim_list = [] 27 | for j in candidate_item_list: 28 | sim_tmp = 0 29 | for loc, i in enumerate(interacted_items): 30 | #Just for RA gernerated by offline 31 | if i not in sim_cor or j not in sim_cor[i]: 32 | continue 33 | if i in item_weight_dict: 34 | sim_tmp += sim_cor[i][j][0] * (0.7**loc) * item_weight_dict[i] if flag else sim_cor[i][j] * (0.7**loc) * item_weight_dict[i] 35 | else: 36 | sim_tmp += sim_cor[i][j][0] * (0.7**loc) * 0.5 if flag else sim_cor[i][j] * (0.7**loc) * 0.5 37 | 38 | sim_list.append(sim_tmp) 39 | 40 | return sim_list 41 | 42 | 43 | # In[ ]: 44 | 45 | 46 | file_name = 'recall_0531_addsim' 47 | 48 | offline = pd.read_csv('./user_data/dataset/new_recall/' + file_name + '.csv') 49 | 50 | now_phase = 9 51 | 52 | 53 | train_path = './user_data/dataset/' 54 | test_path = './user_data/dataset/' 55 | header = 'underexpose' 56 | out_path = './user_data/dataset/new_similarity/' 57 | 58 | recom_item = [] 59 | 60 | whole_click = pd.DataFrame() 61 | 62 | 63 | user_id_list = [] 64 | item_id_list = [] 65 | 66 | 67 | ra_sim_list = [] 68 | aa_sim_list = [] 69 | 70 | 71 | 72 | for c in range(now_phase + 1): 73 | print('phase:', c) 74 | click_train = pd.read_csv(train_path + header + '_train_click_{}_time.csv'.format(c)) 75 | click_test = pd.read_csv(test_path + header + '_test_click_{}_time.csv'.format(c)) 76 | click_query = pd.read_csv(test_path + header + '_test_qtime_{}_time.csv'.format(c)) 77 | 78 | 79 | click_train['datetime'] = pd.to_datetime(click_train['datetime']) 80 | click_test['datetime'] = pd.to_datetime(click_test['datetime']) 81 | click_query['datetime'] = pd.to_datetime(click_query['datetime']) 82 | 83 | 84 | 85 | click_train['timestamp'] = click_train['datetime'].dt.day + ( click_train['datetime'].dt.hour + 86 | (click_train['datetime'].dt.minute + click_train['datetime'].dt.second/60)/float(60) )/float(24) 87 | 88 | click_test['timestamp'] = click_test['datetime'].dt.day + ( click_test['datetime'].dt.hour + 89 | (click_test['datetime'].dt.minute + click_test['datetime'].dt.second/60)/float(60) )/float(24) 90 | 91 | click_query['timestamp'] = click_query['datetime'].dt.day + ( click_query['datetime'].dt.hour + 92 | (click_query['datetime'].dt.minute + click_query['datetime'].dt.second/60)/float(60) )/float(24) 93 | 94 | 95 | all_click = click_train.append(click_test) 96 | 97 | 98 | with open(out_path+'user2item_new'+str(c)+'.pkl','rb') as f: 99 | user_item_tmp = pickle.load(f) 100 | 101 | with open(out_path+'RA_P'+str(c)+'_new.pkl','rb') as f: 102 | RA_sim_list_new = pickle.load(f) 103 | 104 | 105 | for i, row in click_query.iterrows(): 106 | offline_tmp = offline[offline['user_id']==row['user_id']] 107 | candidate_item_list = list(offline_tmp['item_id']) 108 | 109 | time_min = min(all_click['timestamp']) 110 | time_max = row['timestamp'] 111 | 112 | df_tmp = all_click[all_click['user_id']==row['user_id']] 113 | df_tmp = df_tmp.reset_index(drop=True) 114 | df_tmp['weight'] = 1 - (time_max-df_tmp['timestamp']+0.01) / (time_max-time_min+0.01) 115 | item_weight_dict = dict(zip(df_tmp['item_id'], df_tmp['weight'])) 116 | 117 | interacted_items = user_item_tmp[row['user_id']] 118 | interacted_items = interacted_items[::-1] 119 | 120 | sim_list_tmp = ReComputeSim(RA_sim_list_new,candidate_item_list,interacted_items,item_weight_dict) 121 | ra_sim_list += sim_list_tmp 122 | 123 | item_id_list += candidate_item_list 124 | user_id_list += [row['user_id'] for x in candidate_item_list] 125 | 126 | RA_sim_list_new = [] 127 | 128 | 129 | 130 | with open(out_path+'AA_P'+str(c)+'_new.pkl','rb') as f: 131 | AA_sim_list_new = pickle.load(f) 132 | 133 | 134 | for i, row in click_query.iterrows(): 135 | offline_tmp = offline[offline['user_id']==row['user_id']] 136 | candidate_item_list = list(offline_tmp['item_id']) 137 | 138 | time_min = min(all_click['timestamp']) 139 | time_max = row['timestamp'] 140 | 141 | df_tmp = all_click[all_click['user_id']==row['user_id']] 142 | df_tmp = df_tmp.reset_index(drop=True) 143 | df_tmp['weight'] = 1 - (time_max-df_tmp['timestamp']+0.01) / (time_max-time_min+0.01) 144 | item_weight_dict = dict(zip(df_tmp['item_id'], df_tmp['weight'])) 145 | 146 | interacted_items = user_item_tmp[row['user_id']] 147 | interacted_items = interacted_items[::-1] 148 | 149 | sim_list_tmp = ReComputeSim(AA_sim_list_new,candidate_item_list,interacted_items,item_weight_dict) 150 | aa_sim_list += sim_list_tmp 151 | 152 | 153 | AA_sim_list_new = [] 154 | 155 | 156 | 157 | 158 | 159 | # In[ ]: 160 | 161 | 162 | 163 | 164 | 165 | # In[ ]: 166 | 167 | 168 | offline.shape 169 | 170 | 171 | # In[ ]: 172 | 173 | 174 | 175 | 176 | 177 | # In[ ]: 178 | 179 | 180 | 181 | 182 | 183 | # In[ ]: 184 | 185 | 186 | sim_df = pd.DataFrame() 187 | sim_df['user_id'] = user_id_list 188 | sim_df['item_id'] = item_id_list 189 | sim_df['ra_sim'] = ra_sim_list 190 | sim_df['aa_sim'] = aa_sim_list 191 | 192 | 193 | # In[ ]: 194 | 195 | 196 | sim_df.shape 197 | 198 | 199 | # In[ ]: 200 | 201 | 202 | offline = offline.merge(sim_df,on=['user_id','item_id']) 203 | 204 | 205 | # In[ ]: 206 | 207 | 208 | 209 | 210 | 211 | # In[ ]: 212 | 213 | 214 | offline.to_csv('./user_data/dataset/new_recall/'+ file_name + '_addAA_RA.csv',index=False) 215 | 216 | 217 | # In[ ]: 218 | 219 | 220 | 221 | 222 | 223 | # In[ ]: 224 | 225 | 226 | 227 | 228 | 229 | # In[ ]: 230 | 231 | 232 | 233 | 234 | 235 | # In[ ]: 236 | 237 | 238 | 239 | 240 | 241 | # In[ ]: 242 | 243 | 244 | 245 | 246 | 247 | # In[ ]: 248 | 249 | 250 | 251 | 252 | 253 | # In[ ]: 254 | 255 | 256 | 257 | 258 | 259 | # In[ ]: 260 | 261 | 262 | 263 | 264 | 265 | # In[ ]: 266 | 267 | 268 | 269 | 270 | 271 | # In[ ]: 272 | 273 | 274 | 275 | 276 | 277 | # In[ ]: 278 | 279 | 280 | 281 | 282 | 283 | # In[ ]: 284 | 285 | 286 | 287 | 288 | 289 | # In[ ]: 290 | 291 | 292 | 293 | 294 | 295 | # In[ ]: 296 | 297 | 298 | 299 | 300 | 301 | # In[ ]: 302 | 303 | 304 | 305 | 306 | -------------------------------------------------------------------------------- /code/4_RankFeature/02_itemtime_feature_model1.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # coding: utf-8 3 | 4 | # In[1]: 5 | 6 | 7 | import pandas as pd 8 | import numpy as np 9 | 10 | 11 | # In[2]: 12 | 13 | 14 | def extractItemCount(df, df_qTime, df_click, intervals, col_name): 15 | 16 | 17 | 18 | df_click = getTimeInterval(df_click,intervals) 19 | 20 | if 'time_interval' not in df.columns: 21 | df_qTime = getTimeInterval(df_qTime,intervals) 22 | df = df.merge(df_qTime[['user_id','time_interval']]) 23 | 24 | df_click_sta = df_click[['user_id','item_id','time_interval']].groupby(by=['item_id','time_interval'],as_index=False).count() 25 | df_click_sta.columns = ['item_id','time_interval',col_name] 26 | 27 | df = df.merge(df_click_sta,on=['item_id','time_interval'],how='left') 28 | 29 | return df 30 | 31 | 32 | 33 | # In[3]: 34 | 35 | 36 | def getTimeInterval(df,intervals): 37 | df['hour_minute'] = (df['datetime'].dt.hour + df['datetime'].dt.minute/60)/24 38 | 39 | time_interval_list = np.linspace(0,1,intervals) 40 | 41 | df['time_interval'] = df['hour_minute'].apply(lambda x: np.where(x