├── README.md ├── news2.py ├── news1.py ├── news5.py ├── news4.py └── news3.py /README.md: -------------------------------------------------------------------------------- 1 | # Recommended-News 2 | 基于天池新闻推荐赛数据集实现的新闻推荐 3 | 4 | ## 赛题简介 5 | 此次比赛是新闻推荐场景下的用户行为预测挑战赛。该赛题以新闻 APP 中的新闻推荐为背景,旨在依据用户历史浏览点击新闻文章的数据信息,预测用户未来的点击行为,即用户的最后一次点击的新闻文章。其设计初衷是引导参赛者了解推荐系统中的业务背景,以解决实际问题。 6 | 7 | ## 数据概况 8 | 数据来源于某新闻 APP 平台的用户交互数据,涵盖 30 万用户,近 300 万次点击,涉及 36 万多篇不同的新闻文章,且每篇新闻文章均有对应的 embedding 向量表示。为确保比赛公平性,从中抽取 20 万用户的点击日志数据作为训练集,5 万用户的点击日志数据作为测试集 A,5 万用户的点击日志数据作为测试集 B。 9 | 10 | ### 数据表 11 | - **train_click_log.csv**:训练集用户点击日志 12 | - **testA_click_log.csv**:测试集用户点击日志 13 | - **articles.csv**:新闻文章信息数据表 14 | - **articles_emb.csv**:新闻文章 embedding 向量表示 15 | - **sample_submit.csv**:提交样例文件 16 | 17 | ### 字段表 18 | | Field | Description | 19 | |--|--| 20 | | user_id | 用户 id | 21 | | click_article_id | 点击文章 id | 22 | | click_timestamp | 点击时间戳 | 23 | | click_environment | 点击环境 | 24 | | click_deviceGroup | 点击设备组 | 25 | | click_os | 点击操作系统 | 26 | | click_country | 点击城市 | 27 | | click_region | 点击地区 | 28 | | click_referrer_type | 点击来源类型 | 29 | | article_id | 文章 id,与 click_article_id 相对应 | 30 | | category_id | 文章类型 id | 31 | | created_at_ts | 文章创建时间戳 | 32 | | words_count | 文章字数 | 33 | | emb_1,emb_2,…,emb_249 | 文章 embedding 向量表示 | 34 | 35 | ## 评价方式理解 36 | 结合最后的提交文件 sample_submit.csv 来理解评价方式。最终提交的格式要求针对每个用户给出五篇文章的推荐结果,并按照点击概率从前往后排序。而真实情况是每个用户最后一次点击的文章仅有一篇真实答案,通过查看推荐的五篇文章中是否命中真实答案来进行评价。例如对于 user1,提交格式为:user1,article1,article2,article3,article4,article5。 37 | 38 | ## 文件简介 39 | ### news1 40 | 主要功能是实现一个基于物品协同过滤 41 | 内存优化函数 reduce_mem:将DataFrame中的数值型列转换为更低精度的数据类型,从而减少内存使用。 42 | 数据采样 get_all_click_sample:从训练数据中随机抽取部分用户的点击记录,用于调试。 43 | 数据读取 get_all_click_df:根据运行模式(线上或线下),加载并合并训练和测试数据,生成全量的用户点击记录表。 44 | 45 | 用户点击行为字典 get_user_item_time:为每个用户生成点击文章的时间序列,便于后续计算物品相似度。 46 | 热门文章提取 get_item_topk_click:获取点击次数最多的前K篇文章,用于推荐结果的补全。 47 | 48 | ItemCF推荐算法 49 | 物品相似度计算 itemcf_sim: 50 | 51 | 基于用户的点击行为,计算文章之间的相似度矩阵。 52 | 使用用户点击的文章序列,通过统计共现频率和惩罚因子(对用户点击文章数较多的情况进行平滑)计算相似度。 53 | 保存生成的物品相似度矩阵以供后续使用。 54 | 基于ItemCF的推荐 item_based_recommend: 55 | 56 | 根据用户的历史点击文章,利用物品相似度矩阵为用户推荐相关的文章。 57 | 若推荐的文章数量不足,则使用热门文章进行补全。 58 | 59 | ### news2 60 | 实现新闻推荐数据的可视化分析 61 | 62 | #### 用户点击日志分析: 63 | 查看训练集数据基本信息(head()、info()、describe())。 64 | 确认用户总数、点击文章数量分布等统计信息。 65 | 直方图可视化:展示点击日志的各个列(如 click_article_id、click_environment、rank)的分布。 66 | 67 | 文章数据分析: 68 | 查看文章基本属性(如 words_count、category_id)的分布。 69 | 统计文章主题数量(category_id)及其出现频次。 70 | 分析嵌入向量数据的结构和维度。 71 | 72 | #### 数据整合与用户行为分析 73 | 74 | 用户重复点击行为: 75 | 合并训练和测试集点击日志。 76 | 分析用户是否重复点击同一篇文章,统计重复点击次数的分布。 77 | 78 | 用户点击环境变化: 79 | 随机采样用户,分析其点击环境属性(如设备类型、操作系统、地区等)的变化。 80 | 通过条形图直观展示用户点击环境的分布。 81 | 82 | 点击次数分布: 83 | 用户点击文章的总次数分布。 84 | 新闻文章被点击次数的分布,特别是高频和低频新闻。 85 | 86 | #### 新闻共现与属性分布分析 87 | 新闻共现频次: 88 | 统计用户连续点击的新闻对(共现)的次数。 89 | 可视化两篇新闻共现次数的分布。 90 | 91 | 新闻属性分布: 92 | 不同新闻主题(category_id)的分布及低频主题情况。 93 | 新闻字数(words_count)的描述性统计及分布。 94 | 95 | 用户新闻偏好: 96 | 用户点击的新闻主题数量分布(度量兴趣广泛性)。 97 | 用户点击新闻平均字数的分布(长文或短文偏好)。 98 | 99 | #### 点击时间相似度分析 100 | 点击时间分析: 101 | 使用 MinMaxScaler 对点击时间和文章创建时间归一化,便于比较。 102 | 计算每个用户两次点击的时间差和前后点击文章创建时间差,分析用户的点击行为模式。 103 | 104 | 文章相似性分析: 105 | 利用文章嵌入向量,计算用户前后点击的两篇文章之间的余弦相似度。 106 | 随机采样用户,通过折线图展示点击文章相似度的分布。 107 | 108 | 通过一系列分布图、条形图、折线图和散点图,展示用户点击行为和文章属性的多维度特征。 109 | 110 | ### news3 111 | 定义多种工具函数 112 | 获取用户-文章-时间函数 113 | 获取文章-用户-时间函数 114 | 获取历史和最后一次点击 115 | 获取文章属性特征 116 | 获取用户历史点击的文章信息 117 | 获取点击次数最多的Top-k个文章 118 | 定义多路召回字典 119 | 召回效果评估 120 | 121 | 计算相似度矩阵 122 | itemCF i2i_sim 123 | userCF u2u_sim 124 | item embedding sim 125 | 126 | 实现召回评估,包括 127 | YoutubeDNN召回 128 | itemCF recall 129 | itemCF sim召回 130 | embedding sim 召回 131 | userCF召回 132 | userCF sim召回 133 | user embedding sim召回 134 | 135 | 解决冷启动问题。实现多路召回合并 136 | 137 | ### news4 138 | 实现特征向量提取,包括: 139 | 对训练数据做负采样 140 | 将召回数据转换成字典 141 | 用户历史行为相关特征 142 | 用户和文章特征 143 | 用户相关特征 144 | 用户特征直接读入 145 | 文章的特征直接读入 146 | 召回文章的主题是否在用户的爱好里面 147 | 148 | ### news5 149 | 使用LGB模型,DIN模型是进行排序,进行加权融合 150 | -------------------------------------------------------------------------------- /news2.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import numpy as np 3 | import matplotlib.pyplot as plt 4 | import seaborn as sns 5 | import gc 6 | import warnings 7 | from sklearn.preprocessing import MinMaxScaler 8 | 9 | # 设置显示中文字体 10 | plt.rc('font', family='SimHei', size=13) 11 | warnings.filterwarnings("ignore") 12 | 13 | # 定义内存释放函数 14 | def release_memory(): 15 | plt.clf() 16 | gc.collect() 17 | 18 | # 读取数据 19 | path = '../data/' 20 | trn_click = pd.read_csv(path + 'train_click_log.csv') 21 | item_df = pd.read_csv(path + 'articles.csv') 22 | item_df = item_df.rename(columns={'article_id': 'click_article_id'}) # 重命名,方便后续match 23 | item_emb_df = pd.read_csv(path + 'articles_emb.csv') 24 | tst_click = pd.read_csv(path + 'testA_click_log.csv') 25 | 26 | # 数据预处理 27 | trn_click['rank'] = trn_click.groupby(['user_id'])['click_timestamp'].rank(ascending=False).astype(int) 28 | tst_click['rank'] = tst_click.groupby(['user_id'])['click_timestamp'].rank(ascending=False).astype(int) 29 | trn_click['click_cnts'] = trn_click.groupby(['user_id'])['click_timestamp'].transform('count') 30 | tst_click['click_cnts'] = tst_click.groupby(['user_id'])['click_timestamp'].transform('count') 31 | 32 | # 合并 item_df 数据 33 | trn_click = trn_click.merge(item_df, how='left', on=['click_article_id']) 34 | tst_click = tst_click.merge(item_df, how='left', on=['click_article_id']) 35 | 36 | # 用户点击日志信息 37 | print(trn_click.info()) 38 | print(trn_click.describe()) 39 | 40 | # 限制绘制的数据量为前 20 条 41 | plt.figure(figsize=(15, 20), dpi=80) 42 | i = 1 43 | for col in ['click_article_id', 'click_timestamp', 'click_environment', 'click_deviceGroup', 'click_os', 44 | 'click_country', 'click_region', 'click_referrer_type', 'rank', 'click_cnts']: 45 | plt.subplot(5, 2, i) 46 | i += 1 47 | v = trn_click[col].value_counts().reset_index().head(20) 48 | v.columns = ['index', 'count'] 49 | fig = sns.barplot(x=v['index'], y=v['count']) 50 | for item in fig.get_xticklabels(): 51 | item.set_rotation(90) 52 | plt.title(col) 53 | 54 | plt.tight_layout() 55 | plt.show() 56 | 57 | release_memory() 58 | 59 | # 测试集用户点击日志 60 | print(tst_click.describe()) 61 | 62 | # 用户点击日志合并 63 | user_click_merge = pd.concat([trn_click, tst_click], ignore_index=True) 64 | 65 | # 新闻点击次数分析 66 | item_click_count = sorted(user_click_merge.groupby('click_article_id')['user_id'].count(), reverse=True) 67 | plt.plot(item_click_count[:100]) 68 | plt.show() 69 | 70 | release_memory() 71 | 72 | # 新闻共现频次:两篇新闻连续出现的次数 73 | tmp = user_click_merge.sort_values('click_timestamp') 74 | tmp['next_item'] = tmp.groupby(['user_id'])['click_article_id'].shift(-1) 75 | union_item = tmp.groupby(['click_article_id', 'next_item'])['click_timestamp'].size().reset_index(name='count').sort_values('count', ascending=False) 76 | print(union_item[['count']].describe()) 77 | plt.scatter(union_item['click_article_id'], union_item['count']) 78 | plt.show() 79 | 80 | release_memory() 81 | 82 | # 点击时间差的平均值 83 | mm = MinMaxScaler() 84 | user_click_merge[['click_timestamp', 'created_at_ts']] = mm.fit_transform(user_click_merge[['click_timestamp', 'created_at_ts']]) 85 | user_click_merge = user_click_merge.sort_values('click_timestamp') 86 | 87 | def mean_diff_time_func(df, col): 88 | df = pd.DataFrame(df, columns=[col]) 89 | df['time_shift1'] = df[col].shift(1).fillna(0) 90 | df['diff_time'] = abs(df[col] - df['time_shift1']) 91 | return df['diff_time'].mean() 92 | 93 | mean_diff_click_time = user_click_merge.groupby('user_id').apply(lambda x: mean_diff_time_func(x, 'click_timestamp')) 94 | plt.plot(sorted(mean_diff_click_time.values, reverse=True)) 95 | plt.show() 96 | 97 | release_memory() 98 | 99 | # 新闻embedding向量表示 100 | item_emb_df = item_emb_df.rename(columns={'article_id': 'click_article_id'}) 101 | item_idx_2_rawid_dict = dict(zip(item_emb_df['click_article_id'], item_emb_df.index)) 102 | del item_emb_df['click_article_id'] 103 | item_emb_np = np.ascontiguousarray(item_emb_df.values, dtype=np.float32) 104 | 105 | # 用户点击新闻相似度分析 106 | sub_user_ids = np.random.choice(user_click_merge.user_id.unique(), size=15, replace=False) 107 | sub_user_info = user_click_merge[user_click_merge['user_id'].isin(sub_user_ids)] 108 | 109 | def get_item_sim_list(df): 110 | sim_list = [] 111 | item_list = df['click_article_id'].values 112 | for i in range(0, len(item_list)-1): 113 | emb1 = item_emb_np[item_idx_2_rawid_dict[item_list[i]]] 114 | emb2 = item_emb_np[item_idx_2_rawid_dict[item_list[i+1]]] 115 | sim_list.append(np.dot(emb1, emb2) / (np.linalg.norm(emb1) * np.linalg.norm(emb2))) 116 | return sim_list 117 | 118 | for _, user_df in sub_user_info.groupby('user_id'): 119 | item_sim_list = get_item_sim_list(user_df) 120 | plt.plot(item_sim_list) 121 | plt.show() 122 | 123 | release_memory() 124 | 125 | # 随机用户的点击环境分布 126 | def plot_envs(df, cols, r, c): 127 | plt.figure(figsize=(10, 5), dpi=80) 128 | i = 1 129 | for col in cols: 130 | plt.subplot(r, c, i) 131 | i += 1 132 | v = df[col].value_counts().reset_index().head(20) 133 | v.columns = ['category', 'count'] 134 | fig = sns.barplot(x=v['category'], y=v['count']) 135 | for item in fig.get_xticklabels(): 136 | item.set_rotation(90) 137 | plt.title(col) 138 | plt.tight_layout() 139 | plt.show() 140 | 141 | sample_user_ids = np.random.choice(tst_click['user_id'].unique(), size=5, replace=False) 142 | sample_users = user_click_merge[user_click_merge['user_id'].isin(sample_user_ids)] 143 | cols = ['click_environment', 'click_deviceGroup', 'click_os', 'click_country', 'click_region', 'click_referrer_type'] 144 | for _, user_df in sample_users.groupby('user_id'): 145 | plot_envs(user_df, cols, 2, 3) 146 | release_memory() 147 | 148 | # 用户点击新闻数量的分布 149 | user_click_item_count = sorted(user_click_merge.groupby('user_id')['click_article_id'].count(), reverse=True) 150 | plt.plot(user_click_item_count[:50]) 151 | plt.show() 152 | 153 | release_memory() 154 | -------------------------------------------------------------------------------- /news1.py: -------------------------------------------------------------------------------- 1 | # 导包 2 | # import packages 3 | import time, math, os 4 | 5 | import collections 6 | from tqdm import tqdm 7 | import gc 8 | import pickle 9 | import random 10 | from datetime import datetime 11 | from operator import itemgetter 12 | import numpy as np 13 | import pandas as pd 14 | import warnings 15 | from collections import defaultdict 16 | 17 | warnings.filterwarnings('ignore') 18 | data_path = '../data/' 19 | save_path = '../result/' 20 | 21 | 22 | # df节省内存函数 23 | # 节约内存的一个标配函数 24 | def reduce_mem(df): 25 | starttime = time.time() 26 | numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64'] 27 | start_mem = df.memory_usage().sum() / 1024 ** 2 28 | for col in df.columns: 29 | col_type = df[col].dtypes 30 | if col_type in numerics: 31 | c_min = df[col].min() 32 | c_max = df[col].max() 33 | if pd.isnull(c_min) or pd.isnull(c_max): 34 | continue 35 | if str(col_type)[:3] == 'int': 36 | if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max: 37 | df[col] = df[col].astype(np.int8) 38 | elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max: 39 | df[col] = df[col].astype(np.int16) 40 | elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max: 41 | df[col] = df[col].astype(np.int32) 42 | elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max: 43 | df[col] = df[col].astype(np.int64) 44 | else: 45 | if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max: 46 | df[col] = df[col].astype(np.float16) 47 | elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max: 48 | df[col] = df[col].astype(np.float32) 49 | else: 50 | df[col] = df[col].astype(np.float64) 51 | end_mem = df.memory_usage().sum() / 1024 ** 2 52 | print('-- Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction),time spend:{:2.2f} min'.format(end_mem, 53 | 100 * ( 54 | start_mem - end_mem) / start_mem, 55 | ( 56 | time.time() - starttime) / 60)) 57 | return df 58 | 59 | 60 | # 读取采样或全量数据 61 | # debug模式:从训练集中划出一部分数据来调试代码 62 | def get_all_click_sample(data_path, sample_nums=10000): 63 | """ 64 | 训练集中采样一部分数据调试 65 | data_path: 原数据的存储路径 66 | sample_nums: 采样数目(这里由于机器的内存限制,可以采样用户做) 67 | """ 68 | all_click = pd.read_csv(data_path + 'train_click_log.csv') 69 | all_user_ids = all_click.user_id.unique() 70 | 71 | sample_user_ids = np.random.choice(all_user_ids, size=sample_nums, replace=False) 72 | all_click = all_click[all_click['user_id'].isin(sample_user_ids)] 73 | 74 | all_click = all_click.drop_duplicates((['user_id', 'click_article_id', 'click_timestamp'])) 75 | return all_click 76 | 77 | 78 | # 读取点击数据,这里分成线上和线下,如果是为了获取线上提交结果应该讲测试集中的点击数据合并到总的数据中 79 | # 如果是为了线下验证模型的有效性或者特征的有效性,可以只使用训练集 80 | def get_all_click_df(data_path='../data/', offline=True): 81 | if offline: 82 | all_click = pd.read_csv(data_path + 'train_click_log.csv') 83 | else: 84 | trn_click = pd.read_csv(data_path + 'train_click_log.csv') 85 | tst_click = pd.read_csv(data_path + 'testA_click_log.csv') 86 | all_click = pd.concat([trn_click, tst_click]) 87 | # all_click = trn_click.append(tst_click) 88 | 89 | 90 | all_click = all_click.drop_duplicates((['user_id', 'click_article_id', 'click_timestamp'])) 91 | return all_click 92 | 93 | 94 | # 全量训练集 95 | all_click_df = get_all_click_df(offline=False) 96 | 97 | 98 | # 获取 99 | # 用户 - 文章 - 点击时间字典 100 | # 根据点击时间获取用户的点击文章序列 {user1: [(item1: time1), (item2: time2)..]...} 101 | def get_user_item_time(click_df): 102 | click_df = click_df.sort_values('click_timestamp') 103 | 104 | def make_item_time_pair(df): 105 | return list(zip(df['click_article_id'], df['click_timestamp'])) 106 | 107 | # user_item_time_df = click_df.groupby('user_id')['click_article_id', 'click_timestamp'].apply( 108 | # lambda x: make_item_time_pair(x)) \ 109 | # .reset_index().rename(columns={0: 'item_time_list'}) 110 | user_item_time_df = click_df.groupby('user_id')[['click_article_id', 'click_timestamp']].apply( 111 | lambda x: make_item_time_pair(x) 112 | ).reset_index().rename(columns={0: 'item_time_list'}) 113 | 114 | user_item_time_dict = dict(zip(user_item_time_df['user_id'], user_item_time_df['item_time_list'])) 115 | 116 | return user_item_time_dict 117 | 118 | 119 | # 获取点击最多的Topk个文章 120 | # 获取近期点击最多的文章 121 | def get_item_topk_click(click_df, k): 122 | topk_click = click_df['click_article_id'].value_counts().index[:k] 123 | return topk_click 124 | 125 | 126 | # itemCF的物品相似度计算 127 | def itemcf_sim(df): 128 | """ 129 | 文章与文章之间的相似性矩阵计算 130 | :param df: 数据表 131 | :item_created_time_dict: 文章创建时间的字典 132 | return : 文章与文章的相似性矩阵 133 | 思路: 基于物品的协同过滤(详细请参考上一期推荐系统基础的组队学习), 在多路召回部分会加上关联规则的召回策略 134 | """ 135 | 136 | user_item_time_dict = get_user_item_time(df) 137 | 138 | # 计算物品相似度 139 | i2i_sim = {} 140 | item_cnt = defaultdict(int) 141 | for user, item_time_list in tqdm(user_item_time_dict.items()): 142 | # 在基于商品的协同过滤优化的时候可以考虑时间因素 143 | for i, i_click_time in item_time_list: 144 | item_cnt[i] += 1 145 | i2i_sim.setdefault(i, {}) 146 | for j, j_click_time in item_time_list: 147 | if (i == j): 148 | continue 149 | i2i_sim[i].setdefault(j, 0) 150 | 151 | i2i_sim[i][j] += 1 / math.log(len(item_time_list) + 1) 152 | 153 | i2i_sim_ = i2i_sim.copy() 154 | for i, related_items in i2i_sim.items(): 155 | for j, wij in related_items.items(): 156 | i2i_sim_[i][j] = wij / math.sqrt(item_cnt[i] * item_cnt[j]) 157 | 158 | # 将得到的相似性矩阵保存到本地 159 | pickle.dump(i2i_sim_, open(save_path + 'itemcf_i2i_sim.pkl', 'wb')) 160 | 161 | return i2i_sim_ 162 | 163 | 164 | i2i_sim = itemcf_sim(all_click_df) 165 | 166 | 167 | # itemCF的文章推荐 168 | # 基于商品的召回i2i 169 | def item_based_recommend(user_id, user_item_time_dict, i2i_sim, sim_item_topk, recall_item_num, item_topk_click): 170 | """ 171 | 基于文章协同过滤的召回 172 | :param user_id: 用户id 173 | :param user_item_time_dict: 字典, 根据点击时间获取用户的点击文章序列[(item1: time1), (item2: time2)..] 174 | :param i2i_sim: 字典,文章相似性矩阵 175 | :param sim_item_topk: 整数, 选择与当前文章最相似的前k篇文章 176 | :param recall_item_num: 整数, 最后的召回文章数量 177 | :param item_topk_click: 列表,点击次数最多的文章列表,用户召回补全 178 | return: 召回的文章列表 [item1:score1, item2: score2...] 179 | 注意: 基于物品的协同过滤(详细请参考上一期推荐系统基础的组队学习), 在多路召回部分会加上关联规则的召回策略 180 | """ 181 | 182 | # 获取用户历史交互的文章 183 | user_hist_items = user_item_time_dict[user_id] # 注意,此时获取得到的是一个元组列表,需要将里面的user_id提取出来 184 | user_hist_items_ = {user_id for user_id, _ in user_hist_items} 185 | 186 | item_rank = {} 187 | for loc, (i, click_time) in enumerate(user_hist_items): 188 | for j, wij in sorted(i2i_sim[i].items(), key=lambda x: x[1], reverse=True)[:sim_item_topk]: 189 | if j in user_hist_items_: 190 | continue 191 | 192 | item_rank.setdefault(j, 0) 193 | item_rank[j] += wij 194 | 195 | # 不足10个,用热门商品补全 196 | if len(item_rank) < recall_item_num: 197 | for i, item in enumerate(item_topk_click): 198 | if item in item_rank.items(): # 填充的item应该不在原来的列表中 199 | continue 200 | item_rank[item] = - i - 100 # 随便给个负数就行 201 | if len(item_rank) == recall_item_num: 202 | break 203 | 204 | item_rank = sorted(item_rank.items(), key=lambda x: x[1], reverse=True)[:recall_item_num] 205 | 206 | return item_rank 207 | 208 | 209 | # 给每个用户根据物品的协同过滤推荐文章 210 | # 定义 211 | user_recall_items_dict = collections.defaultdict(dict) 212 | 213 | # 获取 用户 - 文章 - 点击时间的字典 214 | user_item_time_dict = get_user_item_time(all_click_df) 215 | 216 | # 去取文章相似度 217 | i2i_sim = pickle.load(open(save_path + 'itemcf_i2i_sim.pkl', 'rb')) 218 | 219 | # 相似文章的数量 220 | sim_item_topk = 10 221 | 222 | # 召回文章数量 223 | recall_item_num = 10 224 | 225 | # 用户热度补全 226 | item_topk_click = get_item_topk_click(all_click_df, k=50) 227 | 228 | for user in tqdm(all_click_df['user_id'].unique()): 229 | user_recall_items_dict[user] = item_based_recommend(user, user_item_time_dict, i2i_sim, 230 | sim_item_topk, recall_item_num, item_topk_click) 231 | # 召回字典转换成df 232 | # 将字典的形式转换成df 233 | user_item_score_list = [] 234 | 235 | for user, items in tqdm(user_recall_items_dict.items()): 236 | for item, score in items: 237 | user_item_score_list.append([user, item, score]) 238 | 239 | recall_df = pd.DataFrame(user_item_score_list, columns=['user_id', 'click_article_id', 'pred_score']) 240 | 241 | 242 | # 生成提交文件 243 | def submit(recall_df, topk=5, model_name=None): 244 | recall_df = recall_df.sort_values(by=['user_id', 'pred_score']) 245 | recall_df['rank'] = recall_df.groupby(['user_id'])['pred_score'].rank(ascending=False, method='first') 246 | 247 | # 判断是不是每个用户都有5篇文章及以上 248 | tmp = recall_df.groupby('user_id').apply(lambda x: x['rank'].max()) 249 | assert tmp.min() >= topk 250 | 251 | del recall_df['pred_score'] 252 | submit = recall_df[recall_df['rank'] <= topk].set_index(['user_id', 'rank']).unstack(-1).reset_index() 253 | 254 | submit.columns = [int(col) if isinstance(col, int) else col for col in submit.columns.droplevel(0)] 255 | # 按照提交格式定义列名 256 | submit = submit.rename(columns={'': 'user_id', 1: 'article_1', 2: 'article_2', 257 | 3: 'article_3', 4: 'article_4', 5: 'article_5'}) 258 | 259 | save_name = save_path + model_name + '_' + datetime.today().strftime('%m-%d') + '.csv' 260 | submit.to_csv(save_name, index=False, header=True) 261 | 262 | 263 | # 获取测试集 264 | tst_click = pd.read_csv(data_path + 'testA_click_log.csv') 265 | tst_users = tst_click['user_id'].unique() 266 | 267 | # 从所有的召回数据中将测试集中的用户选出来 268 | tst_recall = recall_df[recall_df['user_id'].isin(tst_users)] 269 | 270 | # 生成提交文件 271 | submit(tst_recall, topk=5, model_name='itemcf_baseline') 272 | -------------------------------------------------------------------------------- /news5.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import pandas as pd 3 | import pickle 4 | from tqdm import tqdm 5 | import gc, os 6 | import time 7 | from datetime import datetime 8 | import lightgbm as lgb 9 | from sklearn.preprocessing import MinMaxScaler 10 | import warnings 11 | warnings.filterwarnings('ignore') 12 | 13 | data_path = '../data_raw/' 14 | save_path = '../temp_results/' 15 | offline = False 16 | 17 | # 重新读取数据的时候,发现click_article_id是一个浮点数,所以将其转换成int类型 18 | trn_user_item_feats_df = pd.read_csv(save_path + 'trn_user_item_feats_df.csv') 19 | trn_user_item_feats_df['click_article_id'] = trn_user_item_feats_df['click_article_id'].astype(int) 20 | 21 | if offline: 22 | val_user_item_feats_df = pd.read_csv(save_path + 'val_user_item_feats_df.csv') 23 | val_user_item_feats_df['click_article_id'] = val_user_item_feats_df['click_article_id'].astype(int) 24 | else: 25 | val_user_item_feats_df = None 26 | 27 | tst_user_item_feats_df = pd.read_csv(save_path + 'tst_user_item_feats_df.csv') 28 | tst_user_item_feats_df['click_article_id'] = tst_user_item_feats_df['click_article_id'].astype(int) 29 | 30 | # 做特征的时候为了方便,给测试集也打上了一个无效的标签,这里直接删掉就行 31 | del tst_user_item_feats_df['label'] 32 | 33 | 34 | def submit(recall_df, topk=5, model_name=None): 35 | recall_df = recall_df.sort_values(by=['user_id', 'pred_score']) 36 | recall_df['rank'] = recall_df.groupby(['user_id'])['pred_score'].rank(ascending=False, method='first') 37 | 38 | # 判断是不是每个用户都有5篇文章及以上 39 | tmp = recall_df.groupby('user_id').apply(lambda x: x['rank'].max()) 40 | assert tmp.min() >= topk 41 | 42 | del recall_df['pred_score'] 43 | submit = recall_df[recall_df['rank'] <= topk].set_index(['user_id', 'rank']).unstack(-1).reset_index() 44 | 45 | submit.columns = [int(col) if isinstance(col, int) else col for col in submit.columns.droplevel(0)] 46 | # 按照提交格式定义列名 47 | submit = submit.rename(columns={'': 'user_id', 1: 'article_1', 2: 'article_2', 48 | 3: 'article_3', 4: 'article_4', 5: 'article_5'}) 49 | 50 | save_name = save_path + model_name + '_' + datetime.today().strftime('%m-%d') + '.csv' 51 | submit.to_csv(save_name, index=False, header=True) 52 | 53 | # 排序结果归一化 54 | def norm_sim(sim_df, weight=0.0): 55 | # print(sim_df.head()) 56 | min_sim = sim_df.min() 57 | max_sim = sim_df.max() 58 | if max_sim == min_sim: 59 | sim_df = sim_df.apply(lambda sim: 1.0) 60 | else: 61 | sim_df = sim_df.apply(lambda sim: 1.0 * (sim - min_sim) / (max_sim - min_sim)) 62 | 63 | sim_df = sim_df.apply(lambda sim: sim + weight) # plus one 64 | return sim_df 65 | 66 | 67 | # 防止中间出错之后重新读取数据 68 | trn_user_item_feats_df_rank_model = trn_user_item_feats_df.copy() 69 | 70 | if offline: 71 | val_user_item_feats_df_rank_model = val_user_item_feats_df.copy() 72 | 73 | tst_user_item_feats_df_rank_model = tst_user_item_feats_df.copy() 74 | 75 | # 定义特征列 76 | lgb_cols = ['sim0', 'time_diff0', 'word_diff0','sim_max', 'sim_min', 'sim_sum', 77 | 'sim_mean', 'score','click_size', 'time_diff_mean', 'active_level', 78 | 'click_environment','click_deviceGroup', 'click_os', 'click_country', 79 | 'click_region','click_referrer_type', 'user_time_hob1', 'user_time_hob2', 80 | 'words_hbo', 'category_id', 'created_at_ts','words_count'] 81 | 82 | # 排序模型分组 83 | trn_user_item_feats_df_rank_model.sort_values(by=['user_id'], inplace=True) 84 | g_train = trn_user_item_feats_df_rank_model.groupby(['user_id'], as_index=False).count()["label"].values 85 | 86 | if offline: 87 | val_user_item_feats_df_rank_model.sort_values(by=['user_id'], inplace=True) 88 | g_val = val_user_item_feats_df_rank_model.groupby(['user_id'], as_index=False).count()["label"].values 89 | 90 | # 排序模型定义 91 | lgb_ranker = lgb.LGBMRanker(boosting_type='gbdt', num_leaves=31, reg_alpha=0.0, reg_lambda=1, 92 | max_depth=-1, n_estimators=100, subsample=0.7, colsample_bytree=0.7, subsample_freq=1, 93 | learning_rate=0.01, min_child_weight=50, random_state=2018, n_jobs= 16) 94 | 95 | # 排序模型训练 96 | if offline: 97 | lgb_ranker.fit(trn_user_item_feats_df_rank_model[lgb_cols], trn_user_item_feats_df_rank_model['label'], group=g_train, 98 | eval_set=[(val_user_item_feats_df_rank_model[lgb_cols], val_user_item_feats_df_rank_model['label'])], 99 | eval_group= [g_val], eval_at=[1, 2, 3, 4, 5], eval_metric=['ndcg', ], early_stopping_rounds=50, ) 100 | else: 101 | lgb_ranker.fit(trn_user_item_feats_df[lgb_cols], trn_user_item_feats_df['label'], group=g_train) 102 | 103 | # 模型预测 104 | tst_user_item_feats_df['pred_score'] = lgb_ranker.predict(tst_user_item_feats_df[lgb_cols], num_iteration=lgb_ranker.best_iteration_) 105 | 106 | # 将这里的排序结果保存一份,用户后面的模型融合 107 | tst_user_item_feats_df[['user_id', 'click_article_id', 'pred_score']].to_csv(save_path + 'lgb_ranker_score.csv', index=False) 108 | 109 | # 预测结果重新排序, 及生成提交结果 110 | rank_results = tst_user_item_feats_df[['user_id', 'click_article_id', 'pred_score']] 111 | rank_results['click_article_id'] = rank_results['click_article_id'].astype(int) 112 | submit(rank_results, topk=5, model_name='lgb_ranker') 113 | 114 | # 预测结果重新排序, 及生成提交结果 115 | rank_results = tst_user_item_feats_df[['user_id', 'click_article_id', 'pred_score']] 116 | rank_results['click_article_id'] = rank_results['click_article_id'].astype(int) 117 | submit(rank_results, topk=5, model_name='lgb_ranker') 118 | 119 | 120 | # 五折交叉验证,这里的五折交叉是以用户为目标进行五折划分 121 | # 这一部分与前面的单独训练和验证是分开的 122 | def get_kfold_users(trn_df, n=5): 123 | user_ids = trn_df['user_id'].unique() 124 | user_set = [user_ids[i::n] for i in range(n)] 125 | return user_set 126 | 127 | 128 | k_fold = 5 129 | trn_df = trn_user_item_feats_df_rank_model 130 | user_set = get_kfold_users(trn_df, n=k_fold) 131 | 132 | score_list = [] 133 | score_df = trn_df[['user_id', 'click_article_id', 'label']] 134 | sub_preds = np.zeros(tst_user_item_feats_df_rank_model.shape[0]) 135 | 136 | # 五折交叉验证,并将中间结果保存用于staking 137 | for n_fold, valid_user in enumerate(user_set): 138 | train_idx = trn_df[~trn_df['user_id'].isin(valid_user)] # add slide user 139 | valid_idx = trn_df[trn_df['user_id'].isin(valid_user)] 140 | 141 | # 训练集与验证集的用户分组 142 | train_idx.sort_values(by=['user_id'], inplace=True) 143 | g_train = train_idx.groupby(['user_id'], as_index=False).count()["label"].values 144 | 145 | valid_idx.sort_values(by=['user_id'], inplace=True) 146 | g_val = valid_idx.groupby(['user_id'], as_index=False).count()["label"].values 147 | 148 | # 定义模型 149 | lgb_ranker = lgb.LGBMRanker(boosting_type='gbdt', num_leaves=31, reg_alpha=0.0, reg_lambda=1, 150 | max_depth=-1, n_estimators=100, subsample=0.7, colsample_bytree=0.7, subsample_freq=1, 151 | learning_rate=0.01, min_child_weight=50, random_state=2018, n_jobs=16) 152 | # 训练模型 153 | lgb_ranker.fit(train_idx[lgb_cols], train_idx['label'], group=g_train, 154 | eval_set=[(valid_idx[lgb_cols], valid_idx['label'])], eval_group=[g_val], 155 | eval_at=[1, 2, 3, 4, 5], eval_metric=['ndcg', ], early_stopping_rounds=50, ) 156 | 157 | # 预测验证集结果 158 | valid_idx['pred_score'] = lgb_ranker.predict(valid_idx[lgb_cols], num_iteration=lgb_ranker.best_iteration_) 159 | 160 | # 对输出结果进行归一化 161 | valid_idx['pred_score'] = valid_idx[['pred_score']].transform(lambda x: norm_sim(x)) 162 | 163 | valid_idx.sort_values(by=['user_id', 'pred_score']) 164 | valid_idx['pred_rank'] = valid_idx.groupby(['user_id'])['pred_score'].rank(ascending=False, method='first') 165 | 166 | # 将验证集的预测结果放到一个列表中,后面进行拼接 167 | score_list.append(valid_idx[['user_id', 'click_article_id', 'pred_score', 'pred_rank']]) 168 | 169 | # 如果是线上测试,需要计算每次交叉验证的结果相加,最后求平均 170 | if not offline: 171 | sub_preds += lgb_ranker.predict(tst_user_item_feats_df_rank_model[lgb_cols], lgb_ranker.best_iteration_) 172 | 173 | score_df_ = pd.concat(score_list, axis=0) 174 | score_df = score_df.merge(score_df_, how='left', on=['user_id', 'click_article_id']) 175 | # 保存训练集交叉验证产生的新特征 176 | score_df[['user_id', 'click_article_id', 'pred_score', 'pred_rank', 'label']].to_csv( 177 | save_path + 'trn_lgb_ranker_feats.csv', index=False) 178 | 179 | # 测试集的预测结果,多次交叉验证求平均,将预测的score和对应的rank特征保存,可以用于后面的staking,这里还可以构造其他更多的特征 180 | tst_user_item_feats_df_rank_model['pred_score'] = sub_preds / k_fold 181 | tst_user_item_feats_df_rank_model['pred_score'] = tst_user_item_feats_df_rank_model['pred_score'].transform( 182 | lambda x: norm_sim(x)) 183 | tst_user_item_feats_df_rank_model.sort_values(by=['user_id', 'pred_score']) 184 | tst_user_item_feats_df_rank_model['pred_rank'] = tst_user_item_feats_df_rank_model.groupby(['user_id'])[ 185 | 'pred_score'].rank(ascending=False, method='first') 186 | 187 | # 保存测试集交叉验证的新特征 188 | tst_user_item_feats_df_rank_model[['user_id', 'click_article_id', 'pred_score', 'pred_rank']].to_csv( 189 | save_path + 'tst_lgb_ranker_feats.csv', index=False) 190 | 191 | # 预测结果重新排序, 及生成提交结果 192 | # 单模型生成提交结果 193 | rank_results = tst_user_item_feats_df_rank_model[['user_id', 'click_article_id', 'pred_score']] 194 | rank_results['click_article_id'] = rank_results['click_article_id'].astype(int) 195 | submit(rank_results, topk=5, model_name='lgb_ranker') 196 | 197 | # 模型及参数的定义 198 | lgb_Classfication = lgb.LGBMClassifier(boosting_type='gbdt', num_leaves=31, reg_alpha=0.0, reg_lambda=1, 199 | max_depth=-1, n_estimators=500, subsample=0.7, colsample_bytree=0.7, subsample_freq=1, 200 | learning_rate=0.01, min_child_weight=50, random_state=2018, n_jobs= 16, verbose=10) 201 | 202 | # 模型及参数的定义 203 | lgb_Classfication = lgb.LGBMClassifier(boosting_type='gbdt', num_leaves=31, reg_alpha=0.0, reg_lambda=1, 204 | max_depth=-1, n_estimators=500, subsample=0.7, colsample_bytree=0.7, subsample_freq=1, 205 | learning_rate=0.01, min_child_weight=50, random_state=2018, n_jobs= 16, verbose=10) 206 | 207 | # 模型及参数的定义 208 | lgb_Classfication = lgb.LGBMClassifier(boosting_type='gbdt', num_leaves=31, reg_alpha=0.0, reg_lambda=1, 209 | max_depth=-1, n_estimators=500, subsample=0.7, colsample_bytree=0.7, subsample_freq=1, 210 | learning_rate=0.01, min_child_weight=50, random_state=2018, n_jobs= 16, verbose=10) 211 | 212 | # 模型训练 213 | if offline: 214 | lgb_Classfication.fit(trn_user_item_feats_df_rank_model[lgb_cols], trn_user_item_feats_df_rank_model['label'], 215 | eval_set=[(val_user_item_feats_df_rank_model[lgb_cols], val_user_item_feats_df_rank_model['label'])], 216 | eval_metric=['auc', ],early_stopping_rounds=50, ) 217 | else: 218 | lgb_Classfication.fit(trn_user_item_feats_df_rank_model[lgb_cols], trn_user_item_feats_df_rank_model['label']) 219 | 220 | # 模型预测 221 | tst_user_item_feats_df['pred_score'] = lgb_Classfication.predict_proba(tst_user_item_feats_df[lgb_cols])[:,1] 222 | 223 | # 将这里的排序结果保存一份,用户后面的模型融合 224 | tst_user_item_feats_df[['user_id', 'click_article_id', 'pred_score']].to_csv(save_path + 'lgb_cls_score.csv', index=False) 225 | 226 | # 预测结果重新排序, 及生成提交结果 227 | rank_results = tst_user_item_feats_df[['user_id', 'click_article_id', 'pred_score']] 228 | rank_results['click_article_id'] = rank_results['click_article_id'].astype(int) 229 | submit(rank_results, topk=5, model_name='lgb_cls') 230 | 231 | 232 | # 五折交叉验证,这里的五折交叉是以用户为目标进行五折划分 233 | # 这一部分与前面的单独训练和验证是分开的 234 | def get_kfold_users(trn_df, n=5): 235 | user_ids = trn_df['user_id'].unique() 236 | user_set = [user_ids[i::n] for i in range(n)] 237 | return user_set 238 | 239 | 240 | k_fold = 5 241 | trn_df = trn_user_item_feats_df_rank_model 242 | user_set = get_kfold_users(trn_df, n=k_fold) 243 | 244 | score_list = [] 245 | score_df = trn_df[['user_id', 'click_article_id', 'label']] 246 | sub_preds = np.zeros(tst_user_item_feats_df_rank_model.shape[0]) 247 | 248 | # 五折交叉验证,并将中间结果保存用于staking 249 | for n_fold, valid_user in enumerate(user_set): 250 | train_idx = trn_df[~trn_df['user_id'].isin(valid_user)] # add slide user 251 | valid_idx = trn_df[trn_df['user_id'].isin(valid_user)] 252 | 253 | # 模型及参数的定义 254 | lgb_Classfication = lgb.LGBMClassifier(boosting_type='gbdt', num_leaves=31, reg_alpha=0.0, reg_lambda=1, 255 | max_depth=-1, n_estimators=100, subsample=0.7, colsample_bytree=0.7, 256 | subsample_freq=1, 257 | learning_rate=0.01, min_child_weight=50, random_state=2018, n_jobs=16, 258 | verbose=10) 259 | # 训练模型 260 | lgb_Classfication.fit(train_idx[lgb_cols], train_idx['label'], eval_set=[(valid_idx[lgb_cols], valid_idx['label'])], 261 | eval_metric=['auc', ], early_stopping_rounds=50, ) 262 | 263 | # 预测验证集结果 264 | valid_idx['pred_score'] = lgb_Classfication.predict_proba(valid_idx[lgb_cols], 265 | num_iteration=lgb_Classfication.best_iteration_)[:, 1] 266 | 267 | # 对输出结果进行归一化 分类模型输出的值本身就是一个概率值不需要进行归一化 268 | # valid_idx['pred_score'] = valid_idx[['pred_score']].transform(lambda x: norm_sim(x)) 269 | 270 | valid_idx.sort_values(by=['user_id', 'pred_score']) 271 | valid_idx['pred_rank'] = valid_idx.groupby(['user_id'])['pred_score'].rank(ascending=False, method='first') 272 | 273 | # 将验证集的预测结果放到一个列表中,后面进行拼接 274 | score_list.append(valid_idx[['user_id', 'click_article_id', 'pred_score', 'pred_rank']]) 275 | 276 | # 如果是线上测试,需要计算每次交叉验证的结果相加,最后求平均 277 | if not offline: 278 | sub_preds += lgb_Classfication.predict_proba(tst_user_item_feats_df_rank_model[lgb_cols], 279 | num_iteration=lgb_Classfication.best_iteration_)[:, 1] 280 | 281 | score_df_ = pd.concat(score_list, axis=0) 282 | score_df = score_df.merge(score_df_, how='left', on=['user_id', 'click_article_id']) 283 | # 保存训练集交叉验证产生的新特征 284 | score_df[['user_id', 'click_article_id', 'pred_score', 'pred_rank', 'label']].to_csv( 285 | save_path + 'trn_lgb_cls_feats.csv', index=False) 286 | 287 | # 测试集的预测结果,多次交叉验证求平均,将预测的score和对应的rank特征保存,可以用于后面的staking,这里还可以构造其他更多的特征 288 | tst_user_item_feats_df_rank_model['pred_score'] = sub_preds / k_fold 289 | tst_user_item_feats_df_rank_model['pred_score'] = tst_user_item_feats_df_rank_model['pred_score'].transform( 290 | lambda x: norm_sim(x)) 291 | tst_user_item_feats_df_rank_model.sort_values(by=['user_id', 'pred_score']) 292 | tst_user_item_feats_df_rank_model['pred_rank'] = tst_user_item_feats_df_rank_model.groupby(['user_id'])[ 293 | 'pred_score'].rank(ascending=False, method='first') 294 | 295 | # 保存测试集交叉验证的新特征 296 | tst_user_item_feats_df_rank_model[['user_id', 'click_article_id', 'pred_score', 'pred_rank']].to_csv( 297 | save_path + 'tst_lgb_cls_feats.csv', index=False) 298 | 299 | # 预测结果重新排序, 及生成提交结果 300 | rank_results = tst_user_item_feats_df_rank_model[['user_id', 'click_article_id', 'pred_score']] 301 | rank_results['click_article_id'] = rank_results['click_article_id'].astype(int) 302 | submit(rank_results, topk=5, model_name='lgb_cls') 303 | 304 | if offline: 305 | all_data = pd.read_csv('./data_raw/train_click_log.csv') 306 | else: 307 | trn_data = pd.read_csv('./data_raw/train_click_log.csv') 308 | tst_data = pd.read_csv('./data_raw/testA_click_log.csv') 309 | all_data = trn_data.append(tst_data) 310 | 311 | hist_click =all_data[['user_id', 'click_article_id']].groupby('user_id').agg({list}).reset_index() 312 | his_behavior_df = pd.DataFrame() 313 | his_behavior_df['user_id'] = hist_click['user_id'] 314 | his_behavior_df['hist_click_article_id'] = hist_click['click_article_id'] 315 | 316 | trn_user_item_feats_df_din_model = trn_user_item_feats_df.copy() 317 | 318 | if offline: 319 | val_user_item_feats_df_din_model = val_user_item_feats_df.copy() 320 | else: 321 | val_user_item_feats_df_din_model = None 322 | 323 | tst_user_item_feats_df_din_model = tst_user_item_feats_df.copy() 324 | 325 | trn_user_item_feats_df_din_model = trn_user_item_feats_df_din_model.merge(his_behavior_df, on='user_id') 326 | 327 | if offline: 328 | val_user_item_feats_df_din_model = val_user_item_feats_df_din_model.merge(his_behavior_df, on='user_id') 329 | else: 330 | val_user_item_feats_df_din_model = None 331 | 332 | tst_user_item_feats_df_din_model = tst_user_item_feats_df_din_model.merge(his_behavior_df, on='user_id') 333 | 334 | 335 | 336 | # 读取多个模型的排序结果文件 337 | lgb_ranker = pd.read_csv(save_path + 'lgb_ranker_score.csv') 338 | lgb_cls = pd.read_csv(save_path + 'lgb_cls_score.csv') 339 | din_ranker = pd.read_csv(save_path + 'din_rank_score.csv') 340 | 341 | # 这里也可以换成交叉验证输出的测试结果进行加权融合 342 | rank_model = {'lgb_ranker': lgb_ranker, 343 | 'lgb_cls': lgb_cls, 344 | 'din_ranker': din_ranker} 345 | 346 | 347 | def get_ensumble_predict_topk(rank_model, topk=5): 348 | final_recall = rank_model['lgb_cls'].append(rank_model['din_ranker']) 349 | rank_model['lgb_ranker']['pred_score'] = rank_model['lgb_ranker']['pred_score'].transform(lambda x: norm_sim(x)) 350 | 351 | final_recall = final_recall.append(rank_model['lgb_ranker']) 352 | final_recall = final_recall.groupby(['user_id', 'click_article_id'])['pred_score'].sum().reset_index() 353 | 354 | submit(final_recall, topk=topk, model_name='ensemble_fuse') 355 | 356 | get_ensumble_predict_topk(rank_model) 357 | 358 | # 读取多个模型的交叉验证生成的结果文件 359 | # 训练集 360 | trn_lgb_ranker_feats = pd.read_csv(save_path + 'trn_lgb_ranker_feats.csv') 361 | trn_lgb_cls_feats = pd.read_csv(save_path + 'trn_lgb_cls_feats.csv') 362 | trn_din_cls_feats = pd.read_csv(save_path + 'trn_din_cls_feats.csv') 363 | 364 | # 测试集 365 | tst_lgb_ranker_feats = pd.read_csv(save_path + 'tst_lgb_ranker_feats.csv') 366 | tst_lgb_cls_feats = pd.read_csv(save_path + 'tst_lgb_cls_feats.csv') 367 | tst_din_cls_feats = pd.read_csv(save_path + 'tst_din_cls_feats.csv') 368 | 369 | # 将多个模型输出的特征进行拼接 370 | 371 | finall_trn_ranker_feats = trn_lgb_ranker_feats[['user_id', 'click_article_id', 'label']] 372 | finall_tst_ranker_feats = tst_lgb_ranker_feats[['user_id', 'click_article_id']] 373 | 374 | for idx, trn_model in enumerate([trn_lgb_ranker_feats, trn_lgb_cls_feats, trn_din_cls_feats]): 375 | for feat in [ 'pred_score', 'pred_rank']: 376 | col_name = feat + '_' + str(idx) 377 | finall_trn_ranker_feats[col_name] = trn_model[feat] 378 | 379 | for idx, tst_model in enumerate([tst_lgb_ranker_feats, tst_lgb_cls_feats, tst_din_cls_feats]): 380 | for feat in [ 'pred_score', 'pred_rank']: 381 | col_name = feat + '_' + str(idx) 382 | finall_tst_ranker_feats[col_name] = tst_model[feat] 383 | 384 | # 定义一个逻辑回归模型再次拟合交叉验证产生的特征对测试集进行预测 385 | # 这里需要注意的是,在做交叉验证的时候可以构造多一些与输出预测值相关的特征,来丰富这里简单模型的特征 386 | from sklearn.linear_model import LogisticRegression 387 | 388 | feat_cols = ['pred_score_0', 'pred_rank_0', 'pred_score_1', 'pred_rank_1', 'pred_score_2', 'pred_rank_2'] 389 | 390 | trn_x = finall_trn_ranker_feats[feat_cols] 391 | trn_y = finall_trn_ranker_feats['label'] 392 | 393 | tst_x = finall_tst_ranker_feats[feat_cols] 394 | 395 | # 定义模型 396 | lr = LogisticRegression() 397 | 398 | # 模型训练 399 | lr.fit(trn_x, trn_y) 400 | 401 | # 模型预测 402 | finall_tst_ranker_feats['pred_score'] = lr.predict_proba(tst_x)[:, 1] 403 | 404 | # 预测结果重新排序, 及生成提交结果 405 | rank_results = finall_tst_ranker_feats[['user_id', 'click_article_id', 'pred_score']] 406 | submit(rank_results, topk=5, model_name='ensumble_staking') 407 | -------------------------------------------------------------------------------- /news4.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import pandas as pd 3 | import pickle 4 | 5 | from faiss import Index 6 | from tqdm import tqdm 7 | import gc, os 8 | import logging 9 | import time 10 | import lightgbm as lgb 11 | from gensim.models import Word2Vec 12 | from sklearn.preprocessing import MinMaxScaler 13 | import warnings 14 | 15 | from news.news3 import all_click_df 16 | 17 | warnings.filterwarnings('ignore') 18 | 19 | 20 | # 节省内存的一个函数 21 | # 减少内存 22 | def reduce_mem(df): 23 | starttime = time.time() 24 | numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64'] 25 | start_mem = df.memory_usage().sum() / 1024 ** 2 26 | for col in df.columns: 27 | col_type = df[col].dtypes 28 | if col_type in numerics: 29 | c_min = df[col].min() 30 | c_max = df[col].max() 31 | if pd.isnull(c_min) or pd.isnull(c_max): 32 | continue 33 | if str(col_type)[:3] == 'int': 34 | if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max: 35 | df[col] = df[col].astype(np.int8) 36 | elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max: 37 | df[col] = df[col].astype(np.int16) 38 | elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max: 39 | df[col] = df[col].astype(np.int32) 40 | elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max: 41 | df[col] = df[col].astype(np.int64) 42 | else: 43 | if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max: 44 | df[col] = df[col].astype(np.float16) 45 | elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max: 46 | df[col] = df[col].astype(np.float32) 47 | else: 48 | df[col] = df[col].astype(np.float64) 49 | end_mem = df.memory_usage().sum() / 1024 ** 2 50 | print('-- Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction),time spend:{:2.2f} min'.format(end_mem, 51 | 100 * ( 52 | start_mem - end_mem) / start_mem, 53 | ( 54 | time.time() - starttime) / 60)) 55 | return df 56 | 57 | 58 | data_path = '../data_raw/' 59 | save_path = '../temp_results/' 60 | 61 | sample_user_nums = 100000 62 | 63 | 64 | # all_click_df指的是训练集 65 | # sample_user_nums 采样作为验证集的用户数量 66 | def trn_val_split(all_click_df, sample_user_nums): 67 | all_click = all_click_df 68 | all_user_ids = all_click.user_id.unique() 69 | 70 | # replace=True表示可以重复抽样,反之不可以 71 | sample_user_ids = np.random.choice(all_user_ids, size=sample_user_nums, replace=False) 72 | 73 | click_val = all_click[all_click['user_id'].isin(sample_user_ids)] 74 | click_trn = all_click[~all_click['user_id'].isin(sample_user_ids)] 75 | 76 | # 将验证集中的最后一次点击给抽取出来作为答案 77 | click_val = click_val.sort_values(['user_id', 'click_timestamp']) 78 | val_ans = click_val.groupby('user_id').tail(1) 79 | 80 | click_val = click_val.groupby('user_id').apply(lambda x: x[:-1]).reset_index(drop=True) 81 | 82 | # 去除val_ans中某些用户只有一个点击数据的情况,如果该用户只有一个点击数据,又被分到ans中, 83 | # 那么训练集中就没有这个用户的点击数据,出现用户冷启动问题,给自己模型验证带来麻烦 84 | val_ans = val_ans[val_ans.user_id.isin(click_val.user_id.unique())] # 保证答案中出现的用户再验证集中还有 85 | click_val = click_val[click_val.user_id.isin(val_ans.user_id.unique())] 86 | 87 | return click_trn, click_val, val_ans 88 | 89 | 90 | # 获取当前数据的历史点击和最后一次点击 91 | def get_hist_and_last_click(all_click): 92 | all_click = all_click.sort_values(by=['user_id', 'click_timestamp']) 93 | click_last_df = all_click.groupby('user_id').tail(1) 94 | 95 | # 如果用户只有一个点击,hist为空了,会导致训练的时候这个用户不可见,此时默认泄露一下 96 | def hist_func(user_df): 97 | if len(user_df) == 1: 98 | return user_df 99 | else: 100 | return user_df[:-1] 101 | 102 | click_hist_df = all_click.groupby('user_id').apply(hist_func).reset_index(drop=True) 103 | 104 | return click_hist_df, click_last_df 105 | 106 | 107 | def get_trn_val_tst_data(data_path, offline=True): 108 | if offline: 109 | click_trn_data = pd.read_csv(data_path + 'train_click_log.csv') # 训练集用户点击日志 110 | click_trn_data = reduce_mem(click_trn_data) 111 | click_trn, click_val, val_ans = trn_val_split(all_click_df, sample_user_nums) 112 | else: 113 | click_trn = pd.read_csv(data_path + 'train_click_log.csv') 114 | click_trn = reduce_mem(click_trn) 115 | click_val = None 116 | val_ans = None 117 | 118 | click_tst = pd.read_csv(data_path + 'testA_click_log.csv') 119 | 120 | return click_trn, click_val, click_tst, val_ans 121 | 122 | 123 | # 返回多路召回列表或者单路召回 124 | def get_recall_list(save_path, single_recall_model=None, multi_recall=False): 125 | if multi_recall: 126 | return pickle.load(open(save_path + 'final_recall_items_dict.pkl', 'rb')) 127 | 128 | if single_recall_model == 'i2i_itemcf': 129 | return pickle.load(open(save_path + 'itemcf_recall_dict.pkl', 'rb')) 130 | elif single_recall_model == 'i2i_emb_itemcf': 131 | return pickle.load(open(save_path + 'itemcf_emb_dict.pkl', 'rb')) 132 | elif single_recall_model == 'user_cf': 133 | return pickle.load(open(save_path + 'youtubednn_usercf_dict.pkl', 'rb')) 134 | elif single_recall_model == 'youtubednn': 135 | return pickle.load(open(save_path + 'youtube_u2i_dict.pkl', 'rb')) 136 | 137 | 138 | def trian_item_word2vec(click_df, embed_size=64, save_name='item_w2v_emb.pkl', split_char=' '): 139 | click_df = click_df.sort_values('click_timestamp') 140 | # 只有转换成字符串才可以进行训练 141 | click_df['click_article_id'] = click_df['click_article_id'].astype(str) 142 | # 转换成句子的形式 143 | docs = click_df.groupby(['user_id'])['click_article_id'].apply(lambda x: list(x)).reset_index() 144 | docs = docs['click_article_id'].values.tolist() 145 | 146 | # 为了方便查看训练的进度,这里设定一个log信息 147 | logging.basicConfig(format='%(asctime)s:%(levelname)s:%(message)s', level=logging.INFO) 148 | 149 | # 这里的参数对训练得到的向量影响也很大,默认负采样为5 150 | w2v = Word2Vec(docs, size=16, sg=1, window=5, seed=2020, workers=24, min_count=1, iter=1) 151 | 152 | # 保存成字典的形式 153 | item_w2v_emb_dict = {k: w2v[k] for k in click_df['click_article_id']} 154 | pickle.dump(item_w2v_emb_dict, open(save_path + 'item_w2v_emb.pkl', 'wb')) 155 | 156 | return item_w2v_emb_dict 157 | 158 | 159 | # 可以通过字典查询对应的item的Embedding 160 | def get_embedding(save_path, all_click_df): 161 | if os.path.exists(save_path + 'item_content_emb.pkl'): 162 | item_content_emb_dict = pickle.load(open(save_path + 'item_content_emb.pkl', 'rb')) 163 | else: 164 | print('item_content_emb.pkl 文件不存在...') 165 | 166 | # w2v Embedding是需要提前训练好的 167 | if os.path.exists(save_path + 'item_w2v_emb.pkl'): 168 | item_w2v_emb_dict = pickle.load(open(save_path + 'item_w2v_emb.pkl', 'rb')) 169 | else: 170 | item_w2v_emb_dict = trian_item_word2vec(all_click_df) 171 | 172 | if os.path.exists(save_path + 'item_youtube_emb.pkl'): 173 | item_youtube_emb_dict = pickle.load(open(save_path + 'item_youtube_emb.pkl', 'rb')) 174 | else: 175 | print('item_youtube_emb.pkl 文件不存在...') 176 | 177 | if os.path.exists(save_path + 'user_youtube_emb.pkl'): 178 | user_youtube_emb_dict = pickle.load(open(save_path + 'user_youtube_emb.pkl', 'rb')) 179 | else: 180 | print('user_youtube_emb.pkl 文件不存在...') 181 | 182 | return item_content_emb_dict, item_w2v_emb_dict, item_youtube_emb_dict, user_youtube_emb_dict 183 | 184 | 185 | def get_article_info_df(): 186 | article_info_df = pd.read_csv(data_path + 'articles.csv') 187 | article_info_df = reduce_mem(article_info_df) 188 | 189 | return article_info_df 190 | 191 | 192 | # 这里offline的online的区别就是验证集是否为空 193 | click_trn, click_val, click_tst, val_ans = get_trn_val_tst_data(data_path, offline=False) 194 | 195 | click_trn_hist, click_trn_last = get_hist_and_last_click(click_trn) 196 | 197 | if click_val is not None: 198 | click_val_hist, click_val_last = click_val, val_ans 199 | else: 200 | click_val_hist, click_val_last = None, None 201 | 202 | click_tst_hist = click_tst 203 | 204 | 205 | # 将召回列表转换成df的形式 206 | def recall_dict_2_df(recall_list_dict): 207 | df_row_list = [] # [user, item, score] 208 | for user, recall_list in tqdm(recall_list_dict.items()): 209 | for item, score in recall_list: 210 | df_row_list.append([user, item, score]) 211 | 212 | col_names = ['user_id', 'sim_item', 'score'] 213 | recall_list_df = pd.DataFrame(df_row_list, columns=col_names) 214 | 215 | return recall_list_df 216 | 217 | 218 | # 负采样函数,这里可以控制负采样时的比例, 这里给了一个默认的值 219 | def neg_sample_recall_data(recall_items_df, sample_rate=0.001): 220 | pos_data = recall_items_df[recall_items_df['label'] == 1] 221 | neg_data = recall_items_df[recall_items_df['label'] == 0] 222 | 223 | print('pos_data_num:', len(pos_data), 'neg_data_num:', len(neg_data), 'pos/neg:', len(pos_data) / len(neg_data)) 224 | 225 | # 分组采样函数 226 | def neg_sample_func(group_df): 227 | neg_num = len(group_df) 228 | sample_num = max(int(neg_num * sample_rate), 1) # 保证最少有一个 229 | sample_num = min(sample_num, 5) # 保证最多不超过5个,这里可以根据实际情况进行选择 230 | return group_df.sample(n=sample_num, replace=True) 231 | 232 | # 对用户进行负采样,保证所有用户都在采样后的数据中 233 | neg_data_user_sample = neg_data.groupby('user_id', group_keys=False).apply(neg_sample_func) 234 | # 对文章进行负采样,保证所有文章都在采样后的数据中 235 | neg_data_item_sample = neg_data.groupby('sim_item', group_keys=False).apply(neg_sample_func) 236 | 237 | # 将上述两种情况下的采样数据合并 238 | neg_data_new = neg_data_user_sample.append(neg_data_item_sample) 239 | # 由于上述两个操作是分开的,可能将两个相同的数据给重复选择了,所以需要对合并后的数据进行去重 240 | neg_data_new = neg_data_new.sort_values(['user_id', 'score']).drop_duplicates(['user_id', 'sim_item'], keep='last') 241 | 242 | # 将正样本数据合并 243 | data_new = pd.concat([pos_data, neg_data_new], ignore_index=True) 244 | 245 | return data_new 246 | 247 | 248 | # 召回数据打标签 249 | def get_rank_label_df(recall_list_df, label_df, is_test=False): 250 | # 测试集是没有标签了,为了后面代码同一一些,这里直接给一个负数替代 251 | if is_test: 252 | recall_list_df['label'] = -1 253 | return recall_list_df 254 | 255 | label_df = label_df.rename(columns={'click_article_id': 'sim_item'}) 256 | recall_list_df_ = recall_list_df.merge(label_df[['user_id', 'sim_item', 'click_timestamp']], \ 257 | how='left', on=['user_id', 'sim_item']) 258 | recall_list_df_['label'] = recall_list_df_['click_timestamp'].apply(lambda x: 0.0 if np.isnan(x) else 1.0) 259 | del recall_list_df_['click_timestamp'] 260 | 261 | return recall_list_df_ 262 | 263 | 264 | def get_user_recall_item_label_df(click_trn_hist, click_val_hist, click_tst_hist, click_trn_last, click_val_last, 265 | recall_list_df): 266 | # 获取训练数据的召回列表 267 | trn_user_items_df = recall_list_df[recall_list_df['user_id'].isin(click_trn_hist['user_id'].unique())] 268 | # 训练数据打标签 269 | trn_user_item_label_df = get_rank_label_df(trn_user_items_df, click_trn_last, is_test=False) 270 | # 训练数据负采样 271 | trn_user_item_label_df = neg_sample_recall_data(trn_user_item_label_df) 272 | 273 | if click_val is not None: 274 | val_user_items_df = recall_list_df[recall_list_df['user_id'].isin(click_val_hist['user_id'].unique())] 275 | val_user_item_label_df = get_rank_label_df(val_user_items_df, click_val_last, is_test=False) 276 | val_user_item_label_df = neg_sample_recall_data(val_user_item_label_df) 277 | else: 278 | val_user_item_label_df = None 279 | 280 | # 测试数据不需要进行负采样,直接对所有的召回商品进行打-1标签 281 | tst_user_items_df = recall_list_df[recall_list_df['user_id'].isin(click_tst_hist['user_id'].unique())] 282 | tst_user_item_label_df = get_rank_label_df(tst_user_items_df, None, is_test=True) 283 | 284 | return trn_user_item_label_df, val_user_item_label_df, tst_user_item_label_df 285 | 286 | 287 | # 读取召回列表 288 | recall_list_dict = get_recall_list(save_path, single_recall_model='i2i_itemcf') # 这里只选择了单路召回的结果,也可以选择多路召回结果 289 | # 将召回数据转换成df 290 | recall_list_df = recall_dict_2_df(recall_list_dict) 291 | 292 | # 给训练验证数据打标签,并负采样(这一部分时间比较久) 293 | trn_user_item_label_df, val_user_item_label_df, tst_user_item_label_df = get_user_recall_item_label_df(click_trn_hist, 294 | click_val_hist, 295 | click_tst_hist, 296 | click_trn_last, 297 | click_val_last, 298 | recall_list_df) 299 | 300 | 301 | # 将最终的召回的df数据转换成字典的形式做排序特征 302 | def make_tuple_func(group_df): 303 | row_data = [] 304 | for name, row_df in group_df.iterrows(): 305 | row_data.append((row_df['sim_item'], row_df['score'], row_df['label'])) 306 | 307 | return row_data 308 | 309 | 310 | trn_user_item_label_tuples = trn_user_item_label_df.groupby('user_id').apply(make_tuple_func).reset_index() 311 | trn_user_item_label_tuples_dict = dict(zip(trn_user_item_label_tuples['user_id'], trn_user_item_label_tuples[0])) 312 | 313 | if val_user_item_label_df is not None: 314 | val_user_item_label_tuples = val_user_item_label_df.groupby('user_id').apply(make_tuple_func).reset_index() 315 | val_user_item_label_tuples_dict = dict(zip(val_user_item_label_tuples['user_id'], val_user_item_label_tuples[0])) 316 | else: 317 | val_user_item_label_tuples_dict = None 318 | 319 | tst_user_item_label_tuples = tst_user_item_label_df.groupby('user_id').apply(make_tuple_func).reset_index() 320 | tst_user_item_label_tuples_dict = dict(zip(tst_user_item_label_tuples['user_id'], tst_user_item_label_tuples[0])) 321 | 322 | 323 | # 下面基于data做历史相关的特征 324 | def create_feature(users_id, recall_list, click_hist_df, articles_info, articles_emb, user_emb=None, N=1): 325 | """ 326 | 基于用户的历史行为做相关特征 327 | :param users_id: 用户id 328 | :param recall_list: 对于每个用户召回的候选文章列表 329 | :param click_hist_df: 用户的历史点击信息 330 | :param articles_info: 文章信息 331 | :param articles_emb: 文章的embedding向量, 这个可以用item_content_emb, item_w2v_emb, item_youtube_emb 332 | :param user_emb: 用户的embedding向量, 这个是user_youtube_emb, 如果没有也可以不用, 但要注意如果要用的话, articles_emb就要用item_youtube_emb的形式, 这样维度才一样 333 | :param N: 最近的N次点击 由于testA日志里面很多用户只存在一次历史点击, 所以为了不产生空值,默认是1 334 | """ 335 | 336 | # 建立一个二维列表保存结果, 后面要转成DataFrame 337 | all_user_feas = [] 338 | i = 0 339 | for user_id in tqdm(users_id): 340 | # 该用户的最后N次点击 341 | hist_user_items = click_hist_df[click_hist_df['user_id'] == user_id]['click_article_id'][-N:] 342 | 343 | # 遍历该用户的召回列表 344 | for rank, (article_id, score, label) in enumerate(recall_list[user_id]): 345 | # 该文章建立时间, 字数 346 | a_create_time = articles_info[articles_info['article_id'] == article_id]['created_at_ts'].values[0] 347 | a_words_count = articles_info[articles_info['article_id'] == article_id]['words_count'].values[0] 348 | single_user_fea = [user_id, article_id] 349 | # 计算与最后点击的商品的相似度的和, 最大值和最小值, 均值 350 | sim_fea = [] 351 | time_fea = [] 352 | word_fea = [] 353 | # 遍历用户的最后N次点击文章 354 | for hist_item in hist_user_items: 355 | b_create_time = articles_info[articles_info['article_id'] == hist_item]['created_at_ts'].values[0] 356 | b_words_count = articles_info[articles_info['article_id'] == hist_item]['words_count'].values[0] 357 | 358 | sim_fea.append(np.dot(articles_emb[hist_item], articles_emb[article_id])) 359 | time_fea.append(abs(a_create_time - b_create_time)) 360 | word_fea.append(abs(a_words_count - b_words_count)) 361 | 362 | single_user_fea.extend(sim_fea) # 相似性特征 363 | single_user_fea.extend(time_fea) # 时间差特征 364 | single_user_fea.extend(word_fea) # 字数差特征 365 | single_user_fea.extend([max(sim_fea), min(sim_fea), sum(sim_fea), sum(sim_fea) / len(sim_fea)]) # 相似性的统计特征 366 | 367 | if user_emb: # 如果用户向量有的话, 这里计算该召回文章与用户的相似性特征 368 | single_user_fea.append(np.dot(user_emb[user_id], articles_emb[article_id])) 369 | 370 | single_user_fea.extend([score, rank, label]) 371 | # 加入到总的表中 372 | all_user_feas.append(single_user_fea) 373 | 374 | # 定义列名 375 | id_cols = ['user_id', 'click_article_id'] 376 | sim_cols = ['sim' + str(i) for i in range(N)] 377 | time_cols = ['time_diff' + str(i) for i in range(N)] 378 | word_cols = ['word_diff' + str(i) for i in range(N)] 379 | sat_cols = ['sim_max', 'sim_min', 'sim_sum', 'sim_mean'] 380 | user_item_sim_cols = ['user_item_sim'] if user_emb else [] 381 | user_score_rank_label = ['score', 'rank', 'label'] 382 | cols = id_cols + sim_cols + time_cols + word_cols + sat_cols + user_item_sim_cols + user_score_rank_label 383 | 384 | # 转成DataFrame 385 | df = pd.DataFrame(all_user_feas, columns=cols) 386 | 387 | return df 388 | 389 | 390 | article_info_df = get_article_info_df() 391 | all_click = click_trn.append(click_tst) 392 | item_content_emb_dict, item_w2v_emb_dict, item_youtube_emb_dict, user_youtube_emb_dict = get_embedding(save_path, 393 | all_click) 394 | 395 | # 获取训练验证及测试数据中召回列文章相关特征 396 | trn_user_item_feats_df = create_feature(trn_user_item_label_tuples_dict.keys(), trn_user_item_label_tuples_dict, \ 397 | click_trn_hist, article_info_df, item_content_emb_dict) 398 | 399 | if val_user_item_label_tuples_dict is not None: 400 | val_user_item_feats_df = create_feature(val_user_item_label_tuples_dict.keys(), val_user_item_label_tuples_dict, \ 401 | click_val_hist, article_info_df, item_content_emb_dict) 402 | else: 403 | val_user_item_feats_df = None 404 | 405 | tst_user_item_feats_df = create_feature(tst_user_item_label_tuples_dict.keys(), tst_user_item_label_tuples_dict, \ 406 | click_tst_hist, article_info_df, item_content_emb_dict) 407 | 408 | # 保存一份省的每次都要重新跑,每次跑的时间都比较长 409 | trn_user_item_feats_df.to_csv(save_path + 'trn_user_item_feats_df.csv', index=False) 410 | 411 | if val_user_item_feats_df is not None: 412 | val_user_item_feats_df.to_csv(save_path + 'val_user_item_feats_df.csv', index=False) 413 | 414 | tst_user_item_feats_df.to_csv(save_path + 'tst_user_item_feats_df.csv', index=False) 415 | 416 | click_tst.head() 417 | 418 | # 读取文章特征 419 | articles = pd.read_csv(data_path + 'articles.csv') 420 | articles = reduce_mem(articles) 421 | 422 | # 日志数据,就是前面的所有数据 423 | if click_val is not None: 424 | all_data = click_trn.append(click_val) 425 | all_data = click_trn.append(click_tst) 426 | all_data = reduce_mem(all_data) 427 | 428 | # 拼上文章信息 429 | all_data = all_data.merge(articles, left_on='click_article_id', right_on='article_id') 430 | 431 | all_data.shape 432 | 433 | 434 | def active_level(all_data, cols): 435 | """ 436 | 制作区分用户活跃度的特征 437 | :param all_data: 数据集 438 | :param cols: 用到的特征列 439 | """ 440 | data = all_data[cols] 441 | data.sort_values(['user_id', 'click_timestamp'], inplace=True) 442 | user_act = pd.DataFrame(data.groupby('user_id', as_index=False)[['click_article_id', 'click_timestamp']]. \ 443 | agg({'click_article_id': np.size, 'click_timestamp': {list}}).values, 444 | columns=['user_id', 'click_size', 'click_timestamp']) 445 | 446 | # 计算时间间隔的均值 447 | def time_diff_mean(l): 448 | if len(l) == 1: 449 | return 1 450 | else: 451 | return np.mean([j - i for i, j in list(zip(l[:-1], l[1:]))]) 452 | 453 | user_act['time_diff_mean'] = user_act['click_timestamp'].apply(lambda x: time_diff_mean(x)) 454 | 455 | # 点击次数取倒数 456 | user_act['click_size'] = 1 / user_act['click_size'] 457 | 458 | # 两者归一化 459 | user_act['click_size'] = (user_act['click_size'] - user_act['click_size'].min()) / ( 460 | user_act['click_size'].max() - user_act['click_size'].min()) 461 | user_act['time_diff_mean'] = (user_act['time_diff_mean'] - user_act['time_diff_mean'].min()) / ( 462 | user_act['time_diff_mean'].max() - user_act['time_diff_mean'].min()) 463 | user_act['active_level'] = user_act['click_size'] + user_act['time_diff_mean'] 464 | 465 | user_act['user_id'] = user_act['user_id'].astype('int') 466 | del user_act['click_timestamp'] 467 | 468 | return user_act 469 | 470 | 471 | user_act_fea = active_level(all_data, ['user_id', 'click_article_id', 'click_timestamp']) 472 | 473 | user_act_fea.head() 474 | 475 | 476 | def hot_level(all_data, cols): 477 | """ 478 | 制作衡量文章热度的特征 479 | :param all_data: 数据集 480 | :param cols: 用到的特征列 481 | """ 482 | data = all_data[cols] 483 | data.sort_values(['click_article_id', 'click_timestamp'], inplace=True) 484 | article_hot = pd.DataFrame(data.groupby('click_article_id', as_index=False)[['user_id', 'click_timestamp']]. \ 485 | agg({'user_id': np.size, 'click_timestamp': {list}}).values, 486 | columns=['click_article_id', 'user_num', 'click_timestamp']) 487 | 488 | # 计算被点击时间间隔的均值 489 | def time_diff_mean(l): 490 | if len(l) == 1: 491 | return 1 492 | else: 493 | return np.mean([j - i for i, j in list(zip(l[:-1], l[1:]))]) 494 | 495 | article_hot['time_diff_mean'] = article_hot['click_timestamp'].apply(lambda x: time_diff_mean(x)) 496 | 497 | # 点击次数取倒数 498 | article_hot['user_num'] = 1 / article_hot['user_num'] 499 | 500 | # 两者归一化 501 | article_hot['user_num'] = (article_hot['user_num'] - article_hot['user_num'].min()) / ( 502 | article_hot['user_num'].max() - article_hot['user_num'].min()) 503 | article_hot['time_diff_mean'] = (article_hot['time_diff_mean'] - article_hot['time_diff_mean'].min()) / ( 504 | article_hot['time_diff_mean'].max() - article_hot['time_diff_mean'].min()) 505 | article_hot['hot_level'] = article_hot['user_num'] + article_hot['time_diff_mean'] 506 | 507 | article_hot['click_article_id'] = article_hot['click_article_id'].astype('int') 508 | 509 | del article_hot['click_timestamp'] 510 | 511 | return article_hot 512 | 513 | 514 | article_hot_fea = hot_level(all_data, ['user_id', 'click_article_id', 'click_timestamp']) 515 | 516 | article_hot_fea.head() 517 | 518 | 519 | def device_fea(all_data, cols): 520 | """ 521 | 制作用户的设备特征 522 | :param all_data: 数据集 523 | :param cols: 用到的特征列 524 | """ 525 | user_device_info = all_data[cols] 526 | 527 | # 用众数来表示每个用户的设备信息 528 | user_device_info = user_device_info.groupby('user_id').agg(lambda x: x.value_counts().index[0]).reset_index() 529 | 530 | return user_device_info 531 | 532 | 533 | # 设备特征(这里时间会比较长) 534 | device_cols = ['user_id', 'click_environment', 'click_deviceGroup', 'click_os', 'click_country', 'click_region', 535 | 'click_referrer_type'] 536 | user_device_info = device_fea(all_data, device_cols) 537 | 538 | user_device_info.head() 539 | 540 | 541 | def user_time_hob_fea(all_data, cols): 542 | """ 543 | 制作用户的时间习惯特征 544 | :param all_data: 数据集 545 | :param cols: 用到的特征列 546 | """ 547 | user_time_hob_info = all_data[cols] 548 | 549 | # 先把时间戳进行归一化 550 | mm = MinMaxScaler() 551 | user_time_hob_info['click_timestamp'] = mm.fit_transform(user_time_hob_info[['click_timestamp']]) 552 | user_time_hob_info['created_at_ts'] = mm.fit_transform(user_time_hob_info[['created_at_ts']]) 553 | 554 | user_time_hob_info = user_time_hob_info.groupby('user_id').agg('mean').reset_index() 555 | 556 | user_time_hob_info.rename(columns={'click_timestamp': 'user_time_hob1', 'created_at_ts': 'user_time_hob2'}, 557 | inplace=True) 558 | return user_time_hob_info 559 | 560 | 561 | user_time_hob_cols = ['user_id', 'click_timestamp', 'created_at_ts'] 562 | user_time_hob_info = user_time_hob_fea(all_data, user_time_hob_cols) 563 | 564 | 565 | def user_cat_hob_fea(all_data, cols): 566 | """ 567 | 用户的主题爱好 568 | :param all_data: 数据集 569 | :param cols: 用到的特征列 570 | """ 571 | user_category_hob_info = all_data[cols] 572 | user_category_hob_info = user_category_hob_info.groupby('user_id').agg({list}).reset_index() 573 | 574 | user_cat_hob_info = pd.DataFrame() 575 | user_cat_hob_info['user_id'] = user_category_hob_info['user_id'] 576 | user_cat_hob_info['cate_list'] = user_category_hob_info['category_id'] 577 | 578 | return user_cat_hob_info 579 | 580 | 581 | user_category_hob_cols = ['user_id', 'category_id'] 582 | user_cat_hob_info = user_cat_hob_fea(all_data, user_category_hob_cols) 583 | 584 | user_wcou_info = all_data.groupby('user_id')['words_count'].agg('mean').reset_index() 585 | user_wcou_info.rename(columns={'words_count': 'words_hbo'}, inplace=True) 586 | 587 | # 所有表进行合并 588 | user_info = pd.merge(user_act_fea, user_device_info, on='user_id') 589 | user_info = user_info.merge(user_time_hob_info, on='user_id') 590 | user_info = user_info.merge(user_cat_hob_info, on='user_id') 591 | user_info = user_info.merge(user_wcou_info, on='user_id') 592 | 593 | # 这样用户特征以后就可以直接读取了 594 | user_info.to_csv(save_path + 'user_info.csv', index=False) 595 | 596 | # 把用户信息直接读入进来 597 | user_info = pd.read_csv(save_path + 'user_info.csv') 598 | 599 | if os.path.exists(save_path + 'trn_user_item_feats_df.csv'): 600 | trn_user_item_feats_df = pd.read_csv(save_path + 'trn_user_item_feats_df.csv') 601 | 602 | if os.path.exists(save_path + 'tst_user_item_feats_df.csv'): 603 | tst_user_item_feats_df = pd.read_csv(save_path + 'tst_user_item_feats_df.csv') 604 | 605 | if os.path.exists(save_path + 'val_user_item_feats_df.csv'): 606 | val_user_item_feats_df = pd.read_csv(save_path + 'val_user_item_feats_df.csv') 607 | else: 608 | val_user_item_feats_df = None 609 | 610 | # 拼上用户特征 611 | # 下面是线下验证的 612 | trn_user_item_feats_df = trn_user_item_feats_df.merge(user_info, on='user_id', how='left') 613 | 614 | if val_user_item_feats_df is not None: 615 | val_user_item_feats_df = val_user_item_feats_df.merge(user_info, on='user_id', how='left') 616 | else: 617 | val_user_item_feats_df = None 618 | 619 | tst_user_item_feats_df = tst_user_item_feats_df.merge(user_info, on='user_id', how='left') 620 | 621 | trn_user_item_feats_df.columns 622 | 623 | Index(['user_id', 'click_article_id', 'sim0', 'time_diff0', 'word_diff0', 624 | 'sim_max', 'sim_min', 'sim_sum', 'sim_mean', 'score', 'rank', 'label', 625 | 'click_size', 'time_diff_mean', 'active_level', 'click_environment', 626 | 'click_deviceGroup', 'click_os', 'click_country', 'click_region', 627 | 'click_referrer_type', 'user_time_hob1', 'user_time_hob2', 'cate_list', 628 | 'words_hbo'], 629 | dtype='object') 630 | 631 | articles = pd.read_csv(data_path + 'articles.csv') 632 | articles = reduce_mem(articles) 633 | 634 | # 拼上文章特征 635 | trn_user_item_feats_df = trn_user_item_feats_df.merge(articles, left_on='click_article_id', right_on='article_id') 636 | 637 | if val_user_item_feats_df is not None: 638 | val_user_item_feats_df = val_user_item_feats_df.merge(articles, left_on='click_article_id', right_on='article_id') 639 | else: 640 | val_user_item_feats_df = None 641 | 642 | tst_user_item_feats_df = tst_user_item_feats_df.merge(articles, left_on='click_article_id', right_on='article_id') 643 | 644 | trn_user_item_feats_df['is_cat_hab'] = trn_user_item_feats_df.apply( 645 | lambda x: 1 if x.category_id in set(x.cate_list) else 0, axis=1) 646 | if val_user_item_feats_df is not None: 647 | val_user_item_feats_df['is_cat_hab'] = val_user_item_feats_df.apply( 648 | lambda x: 1 if x.category_id in set(x.cate_list) else 0, axis=1) 649 | else: 650 | val_user_item_feats_df = None 651 | tst_user_item_feats_df['is_cat_hab'] = tst_user_item_feats_df.apply( 652 | lambda x: 1 if x.category_id in set(x.cate_list) else 0, axis=1) 653 | 654 | # 线下验证 655 | del trn_user_item_feats_df['cate_list'] 656 | 657 | if val_user_item_feats_df is not None: 658 | del val_user_item_feats_df['cate_list'] 659 | else: 660 | val_user_item_feats_df = None 661 | 662 | del tst_user_item_feats_df['cate_list'] 663 | 664 | del trn_user_item_feats_df['article_id'] 665 | 666 | if val_user_item_feats_df is not None: 667 | del val_user_item_feats_df['article_id'] 668 | else: 669 | val_user_item_feats_df = None 670 | 671 | del tst_user_item_feats_df['article_id'] 672 | 673 | # 训练验证特征 674 | trn_user_item_feats_df.to_csv(save_path + 'trn_user_item_feats_df.csv', index=False) 675 | if val_user_item_feats_df is not None: 676 | val_user_item_feats_df.to_csv(save_path + 'val_user_item_feats_df.csv', index=False) 677 | tst_user_item_feats_df.to_csv(save_path + 'tst_user_item_feats_df.csv', index=False) 678 | -------------------------------------------------------------------------------- /news3.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import numpy as np 3 | from tqdm import tqdm 4 | from collections import defaultdict 5 | import os, math, warnings, math, pickle 6 | from tqdm import tqdm 7 | import faiss 8 | import collections 9 | import random 10 | from sklearn.preprocessing import MinMaxScaler 11 | from sklearn.preprocessing import LabelEncoder 12 | from datetime import datetime 13 | from deepctr.feature_column import SparseFeat, VarLenSparseFeat 14 | from sklearn.preprocessing import LabelEncoder 15 | from tensorflow.python.keras import backend as K 16 | # from tensorflow.python.keras.models import Model 17 | # from tensorflow.python.keras.preprocessing.sequence import pad_sequences 18 | from tensorflow.keras.models import Model 19 | from tensorflow.keras.preprocessing.sequence import pad_sequences 20 | from deepmatch.utils import sampledsoftmaxloss, NegativeSampler 21 | from tensorflow.keras.utils import plot_model 22 | 23 | from collections import Counter 24 | from deepmatch.models import * 25 | from deepmatch.utils import sampledsoftmaxloss 26 | 27 | warnings.filterwarnings('ignore') 28 | 29 | data_path = '../data_raw/' 30 | save_path = '../temp_results/' 31 | # 做召回评估的一个标志, 如果不进行评估就是直接使用全量数据进行召回 32 | metric_recall = False 33 | 34 | 35 | # debug模式: 从训练集中划出一部分数据来调试代码 36 | def get_all_click_sample(data_path, sample_nums=10000): 37 | """ 38 | 训练集中采样一部分数据调试 39 | data_path: 原数据的存储路径 40 | sample_nums: 采样数目(这里由于机器的内存限制,可以采样用户做) 41 | """ 42 | all_click = pd.read_csv(data_path + 'train_click_log.csv') 43 | all_user_ids = all_click.user_id.unique() 44 | 45 | sample_user_ids = np.random.choice(all_user_ids, size=sample_nums, replace=False) 46 | all_click = all_click[all_click['user_id'].isin(sample_user_ids)] 47 | 48 | all_click = all_click.drop_duplicates((['user_id', 'click_article_id', 'click_timestamp'])) 49 | return all_click 50 | 51 | 52 | # 读取点击数据,这里分成线上和线下,如果是为了获取线上提交结果应该讲测试集中的点击数据合并到总的数据中 53 | # 如果是为了线下验证模型的有效性或者特征的有效性,可以只使用训练集 54 | # def get_all_click_df(data_path='../data_raw/', offline=True): 55 | # if offline: 56 | # all_click = pd.read_csv(data_path + 'train_click_log.csv') 57 | # else: 58 | # trn_click = pd.read_csv(data_path + 'train_click_log.csv') 59 | # tst_click = pd.read_csv(data_path + 'testA_click_log.csv') 60 | # 61 | # all_click = trn_click.append(tst_click) 62 | # 63 | # all_click = all_click.drop_duplicates((['user_id', 'click_article_id', 'click_timestamp'])) 64 | # return all_click 65 | 66 | def get_all_click_df(data_path='../data_raw/', offline=True, sample_size=100000): 67 | if offline: 68 | all_click = pd.read_csv(data_path + 'train_click_log.csv') 69 | else: 70 | trn_click = pd.read_csv(data_path + 'train_click_log.csv') 71 | tst_click = pd.read_csv(data_path + 'testA_click_log.csv') 72 | all_click = trn_click.append(tst_click) 73 | 74 | # 去重 75 | all_click = all_click.drop_duplicates(['user_id', 'click_article_id', 'click_timestamp']) 76 | 77 | # 进行随机采样 78 | if sample_size and len(all_click) > sample_size: 79 | all_click = all_click.sample(n=sample_size, random_state=1) 80 | 81 | return all_click 82 | 83 | 84 | # 读取文章的基本属性 85 | def get_item_info_df(data_path): 86 | item_info_df = pd.read_csv(data_path + 'articles.csv') 87 | 88 | # 为了方便与训练集中的click_article_id拼接,需要把article_id修改成click_article_id 89 | item_info_df = item_info_df.rename(columns={'article_id': 'click_article_id'}) 90 | 91 | return item_info_df 92 | 93 | 94 | # 读取文章的Embedding数据 95 | def get_item_emb_dict(data_path): 96 | item_emb_df = pd.read_csv(data_path + 'articles_emb.csv') 97 | 98 | item_emb_cols = [x for x in item_emb_df.columns if 'emb' in x] 99 | item_emb_np = np.ascontiguousarray(item_emb_df[item_emb_cols]) 100 | # 进行归一化 101 | item_emb_np = item_emb_np / np.linalg.norm(item_emb_np, axis=1, keepdims=True) 102 | 103 | item_emb_dict = dict(zip(item_emb_df['article_id'], item_emb_np)) 104 | pickle.dump(item_emb_dict, open(save_path + 'item_content_emb.pkl', 'wb')) 105 | 106 | return item_emb_dict 107 | 108 | 109 | # 采样数据 110 | # all_click_df = get_all_click_sample(data_path) 111 | 112 | # 全量训练集 113 | all_click_df = get_all_click_df(offline=False) 114 | 115 | max_min_scaler = lambda x: (x - np.min(x)) / (np.max(x) - np.min(x)) 116 | 117 | # 对时间戳进行归一化,用于在关联规则的时候计算权重 118 | all_click_df['click_timestamp'] = all_click_df[['click_timestamp']].apply(max_min_scaler) 119 | 120 | item_info_df = get_item_info_df(data_path) 121 | item_emb_dict = get_item_emb_dict(data_path) 122 | 123 | 124 | # 根据点击时间获取用户的点击文章序列 {user1: [(item1, time1), (item2, time2)..]...} 125 | def get_user_item_time(click_df): 126 | click_df = click_df.sort_values('click_timestamp') 127 | 128 | def make_item_time_pair(df): 129 | return list(zip(df['click_article_id'], df['click_timestamp'])) 130 | 131 | user_item_time_df = click_df.groupby('user_id')['click_article_id', 'click_timestamp'].apply( 132 | lambda x: make_item_time_pair(x)) \ 133 | .reset_index().rename(columns={0: 'item_time_list'}) 134 | user_item_time_dict = dict(zip(user_item_time_df['user_id'], user_item_time_df['item_time_list'])) 135 | 136 | return user_item_time_dict 137 | 138 | 139 | # 根据时间获取商品被点击的用户序列 {item1: [(user1, time1), (user2, time2)...]...} 140 | # 这里的时间是用户点击当前商品的时间,好像没有直接的关系。 141 | def get_item_user_time_dict(click_df): 142 | def make_user_time_pair(df): 143 | return list(zip(df['user_id'], df['click_timestamp'])) 144 | 145 | click_df = click_df.sort_values('click_timestamp') 146 | item_user_time_df = click_df.groupby('click_article_id')['user_id', 'click_timestamp'].apply( 147 | lambda x: make_user_time_pair(x)) \ 148 | .reset_index().rename(columns={0: 'user_time_list'}) 149 | 150 | item_user_time_dict = dict(zip(item_user_time_df['click_article_id'], item_user_time_df['user_time_list'])) 151 | return item_user_time_dict 152 | 153 | 154 | # 获取当前数据的历史点击和最后一次点击 155 | def get_hist_and_last_click(all_click): 156 | all_click = all_click.sort_values(by=['user_id', 'click_timestamp']) 157 | click_last_df = all_click.groupby('user_id').tail(1) 158 | 159 | # 如果用户只有一个点击,hist为空了,会导致训练的时候这个用户不可见,此时默认泄露一下 160 | def hist_func(user_df): 161 | if len(user_df) == 1: 162 | return user_df 163 | else: 164 | return user_df[:-1] 165 | 166 | click_hist_df = all_click.groupby('user_id').apply(hist_func).reset_index(drop=True) 167 | 168 | return click_hist_df, click_last_df 169 | 170 | 171 | # 获取文章id对应的基本属性,保存成字典的形式,方便后面召回阶段,冷启动阶段直接使用 172 | def get_item_info_dict(item_info_df): 173 | max_min_scaler = lambda x: (x - np.min(x)) / (np.max(x) - np.min(x)) 174 | item_info_df['created_at_ts'] = item_info_df[['created_at_ts']].apply(max_min_scaler) 175 | 176 | item_type_dict = dict(zip(item_info_df['click_article_id'], item_info_df['category_id'])) 177 | item_words_dict = dict(zip(item_info_df['click_article_id'], item_info_df['words_count'])) 178 | item_created_time_dict = dict(zip(item_info_df['click_article_id'], item_info_df['created_at_ts'])) 179 | 180 | return item_type_dict, item_words_dict, item_created_time_dict 181 | 182 | 183 | def get_user_hist_item_info_dict(all_click): 184 | # 获取user_id对应的用户历史点击文章类型的集合字典 185 | user_hist_item_typs = all_click.groupby('user_id')['category_id'].agg(set).reset_index() 186 | user_hist_item_typs_dict = dict(zip(user_hist_item_typs['user_id'], user_hist_item_typs['category_id'])) 187 | 188 | # 获取user_id对应的用户点击文章的集合 189 | user_hist_item_ids_dict = all_click.groupby('user_id')['click_article_id'].agg(set).reset_index() 190 | user_hist_item_ids_dict = dict(zip(user_hist_item_ids_dict['user_id'], user_hist_item_ids_dict['click_article_id'])) 191 | 192 | # 获取user_id对应的用户历史点击的文章的平均字数字典 193 | user_hist_item_words = all_click.groupby('user_id')['words_count'].agg('mean').reset_index() 194 | user_hist_item_words_dict = dict(zip(user_hist_item_words['user_id'], user_hist_item_words['words_count'])) 195 | 196 | # 获取user_id对应的用户最后一次点击的文章的创建时间 197 | all_click_ = all_click.sort_values('click_timestamp') 198 | user_last_item_created_time = all_click_.groupby('user_id')['created_at_ts'].apply( 199 | lambda x: x.iloc[-1]).reset_index() 200 | 201 | max_min_scaler = lambda x: (x - np.min(x)) / (np.max(x) - np.min(x)) 202 | user_last_item_created_time['created_at_ts'] = user_last_item_created_time[['created_at_ts']].apply(max_min_scaler) 203 | 204 | user_last_item_created_time_dict = dict(zip(user_last_item_created_time['user_id'], \ 205 | user_last_item_created_time['created_at_ts'])) 206 | 207 | return user_hist_item_typs_dict, user_hist_item_ids_dict, user_hist_item_words_dict, user_last_item_created_time_dict 208 | 209 | 210 | # 获取近期点击最多的文章 211 | def get_item_topk_click(click_df, k): 212 | topk_click = click_df['click_article_id'].value_counts().index[:k] 213 | return topk_click 214 | 215 | 216 | # 获取文章的属性信息,保存成字典的形式方便查询 217 | item_type_dict, item_words_dict, item_created_time_dict = get_item_info_dict(item_info_df) 218 | 219 | # 定义一个多路召回的字典,将各路召回的结果都保存在这个字典当中 220 | user_multi_recall_dict = {'itemcf_sim_itemcf_recall': {}, 221 | 'embedding_sim_item_recall': {}, 222 | 'youtubednn_recall': {}, 223 | 'youtubednn_usercf_recall': {}, 224 | 'cold_start_recall': {}} 225 | 226 | # 提取最后一次点击作为召回评估,如果不需要做召回评估直接使用全量的训练集进行召回(线下验证模型) 227 | # 如果不是召回评估,直接使用全量数据进行召回,不用将最后一次提取出来 228 | trn_hist_click_df, trn_last_click_df = get_hist_and_last_click(all_click_df) 229 | 230 | 231 | # 依次评估召回的前10, 20, 30, 40, 50个文章中的击中率 232 | def metrics_recall(user_recall_items_dict, trn_last_click_df, topk=5): 233 | last_click_item_dict = dict(zip(trn_last_click_df['user_id'], trn_last_click_df['click_article_id'])) 234 | user_num = len(user_recall_items_dict) 235 | 236 | for k in range(10, topk + 1, 10): 237 | hit_num = 0 238 | for user, item_list in user_recall_items_dict.items(): 239 | # 获取前k个召回的结果 240 | tmp_recall_items = [x[0] for x in user_recall_items_dict[user][:k]] 241 | if last_click_item_dict[user] in set(tmp_recall_items): 242 | hit_num += 1 243 | 244 | hit_rate = round(hit_num * 1.0 / user_num, 5) 245 | print(' topk: ', k, ' : ', 'hit_num: ', hit_num, 'hit_rate: ', hit_rate, 'user_num : ', user_num) 246 | 247 | 248 | def itemcf_sim(df, item_created_time_dict): 249 | """ 250 | 文章与文章之间的相似性矩阵计算 251 | :param df: 数据表 252 | :item_created_time_dict: 文章创建时间的字典 253 | return : 文章与文章的相似性矩阵 254 | 255 | 思路: 基于物品的协同过滤(详细请参考上一期推荐系统基础的组队学习) + 关联规则 256 | """ 257 | 258 | user_item_time_dict = get_user_item_time(df) 259 | 260 | # 计算物品相似度 261 | i2i_sim = {} 262 | item_cnt = defaultdict(int) 263 | for user, item_time_list in tqdm(user_item_time_dict.items()): 264 | # 在基于商品的协同过滤优化的时候可以考虑时间因素 265 | for loc1, (i, i_click_time) in enumerate(item_time_list): 266 | item_cnt[i] += 1 267 | i2i_sim.setdefault(i, {}) 268 | for loc2, (j, j_click_time) in enumerate(item_time_list): 269 | if (i == j): 270 | continue 271 | 272 | # 考虑文章的正向顺序点击和反向顺序点击 273 | loc_alpha = 1.0 if loc2 > loc1 else 0.7 274 | # 位置信息权重,其中的参数可以调节 275 | loc_weight = loc_alpha * (0.9 ** (np.abs(loc2 - loc1) - 1)) 276 | # 点击时间权重,其中的参数可以调节 277 | click_time_weight = np.exp(0.7 ** np.abs(i_click_time - j_click_time)) 278 | # 两篇文章创建时间的权重,其中的参数可以调节 279 | created_time_weight = np.exp(0.8 ** np.abs(item_created_time_dict[i] - item_created_time_dict[j])) 280 | i2i_sim[i].setdefault(j, 0) 281 | # 考虑多种因素的权重计算最终的文章之间的相似度 282 | i2i_sim[i][j] += loc_weight * click_time_weight * created_time_weight / math.log( 283 | len(item_time_list) + 1) 284 | 285 | i2i_sim_ = i2i_sim.copy() 286 | for i, related_items in i2i_sim.items(): 287 | for j, wij in related_items.items(): 288 | i2i_sim_[i][j] = wij / math.sqrt(item_cnt[i] * item_cnt[j]) 289 | 290 | # 将得到的相似性矩阵保存到本地 291 | pickle.dump(i2i_sim_, open(save_path + 'itemcf_i2i_sim.pkl', 'wb')) 292 | 293 | return i2i_sim_ 294 | 295 | 296 | def get_user_activate_degree_dict(all_click_df): 297 | all_click_df_ = all_click_df.groupby('user_id')['click_article_id'].count().reset_index() 298 | 299 | # 用户活跃度归一化 300 | mm = MinMaxScaler() 301 | all_click_df_['click_article_id'] = mm.fit_transform(all_click_df_[['click_article_id']]) 302 | user_activate_degree_dict = dict(zip(all_click_df_['user_id'], all_click_df_['click_article_id'])) 303 | 304 | return user_activate_degree_dict 305 | 306 | 307 | def usercf_sim(all_click_df, user_activate_degree_dict): 308 | """ 309 | 用户相似性矩阵计算 310 | :param all_click_df: 数据表 311 | :param user_activate_degree_dict: 用户活跃度的字典 312 | return 用户相似性矩阵 313 | 314 | 思路: 基于用户的协同过滤(详细请参考上一期推荐系统基础的组队学习) + 关联规则 315 | """ 316 | item_user_time_dict = get_item_user_time_dict(all_click_df) 317 | 318 | u2u_sim = {} 319 | user_cnt = defaultdict(int) 320 | for item, user_time_list in tqdm(item_user_time_dict.items()): 321 | for u, click_time in user_time_list: 322 | user_cnt[u] += 1 323 | u2u_sim.setdefault(u, {}) 324 | for v, click_time in user_time_list: 325 | u2u_sim[u].setdefault(v, 0) 326 | if u == v: 327 | continue 328 | # 用户平均活跃度作为活跃度的权重,这里的式子也可以改善 329 | activate_weight = 100 * 0.5 * (user_activate_degree_dict[u] + user_activate_degree_dict[v]) 330 | u2u_sim[u][v] += activate_weight / math.log(len(user_time_list) + 1) 331 | 332 | u2u_sim_ = u2u_sim.copy() 333 | for u, related_users in u2u_sim.items(): 334 | for v, wij in related_users.items(): 335 | u2u_sim_[u][v] = wij / math.sqrt(user_cnt[u] * user_cnt[v]) 336 | 337 | # 将得到的相似性矩阵保存到本地 338 | pickle.dump(u2u_sim_, open(save_path + 'usercf_u2u_sim.pkl', 'wb')) 339 | 340 | return u2u_sim_ 341 | 342 | 343 | # 由于usercf计算时候太耗费内存了,这里就不直接运行了 344 | # 如果是采样的话,是可以运行的 345 | user_activate_degree_dict = get_user_activate_degree_dict(all_click_df) 346 | u2u_sim = usercf_sim(all_click_df, user_activate_degree_dict) 347 | 348 | 349 | # 向量检索相似度计算 350 | # topk指的是每个item, faiss搜索后返回最相似的topk个item 351 | def embdding_sim(click_df, item_emb_df, save_path, topk): 352 | """ 353 | 基于内容的文章embedding相似性矩阵计算 354 | :param click_df: 数据表 355 | :param item_emb_df: 文章的embedding 356 | :param save_path: 保存路径 357 | :patam topk: 找最相似的topk篇 358 | return 文章相似性矩阵 359 | 360 | 思路: 对于每一篇文章, 基于embedding的相似性返回topk个与其最相似的文章, 只不过由于文章数量太多,这里用了faiss进行加速 361 | """ 362 | 363 | # 文章索引与文章id的字典映射 364 | item_idx_2_rawid_dict = dict(zip(item_emb_df.index, item_emb_df['article_id'])) 365 | 366 | item_emb_cols = [x for x in item_emb_df.columns if 'emb' in x] 367 | item_emb_np = np.ascontiguousarray(item_emb_df[item_emb_cols].values, dtype=np.float32) 368 | # 向量进行单位化 369 | item_emb_np = item_emb_np / np.linalg.norm(item_emb_np, axis=1, keepdims=True) 370 | 371 | # 建立faiss索引 372 | item_index = faiss.IndexFlatIP(item_emb_np.shape[1]) 373 | item_index.add(item_emb_np) 374 | # 相似度查询,给每个索引位置上的向量返回topk个item以及相似度 375 | sim, idx = item_index.search(item_emb_np, topk) # 返回的是列表 376 | 377 | # 将向量检索的结果保存成原始id的对应关系 378 | item_sim_dict = collections.defaultdict(dict) 379 | for target_idx, sim_value_list, rele_idx_list in tqdm(zip(range(len(item_emb_np)), sim, idx)): 380 | target_raw_id = item_idx_2_rawid_dict[target_idx] 381 | # 从1开始是为了去掉商品本身, 所以最终获得的相似商品只有topk-1 382 | for rele_idx, sim_value in zip(rele_idx_list[1:], sim_value_list[1:]): 383 | rele_raw_id = item_idx_2_rawid_dict[rele_idx] 384 | item_sim_dict[target_raw_id][rele_raw_id] = item_sim_dict.get(target_raw_id, {}).get(rele_raw_id, 385 | 0) + sim_value 386 | 387 | # 保存i2i相似度矩阵 388 | pickle.dump(item_sim_dict, open(save_path + 'emb_i2i_sim.pkl', 'wb')) 389 | 390 | return item_sim_dict 391 | 392 | 393 | # 获取双塔召回时的训练验证数据 394 | # negsample指的是通过滑窗构建样本的时候,负样本的数量 395 | def gen_data_set(data, negsample=0): 396 | data.sort_values("click_timestamp", inplace=True) 397 | item_ids = data['click_article_id'].unique() 398 | 399 | train_set = [] 400 | test_set = [] 401 | for reviewerID, hist in tqdm(data.groupby('user_id')): 402 | pos_list = hist['click_article_id'].tolist() 403 | 404 | if negsample > 0: 405 | candidate_set = list(set(item_ids) - set(pos_list)) # 用户没看过的文章里面选择负样本 406 | neg_list = np.random.choice(candidate_set, size=len(pos_list) * negsample, replace=True) # 对于每个正样本,选择n个负样本 407 | 408 | # 长度只有一个的时候,需要把这条数据也放到训练集中,不然的话最终学到的embedding就会有缺失 409 | if len(pos_list) == 1: 410 | train_set.append((reviewerID, [pos_list[0]], pos_list[0], 1, len(pos_list))) 411 | test_set.append((reviewerID, [pos_list[0]], pos_list[0], 1, len(pos_list))) 412 | 413 | # 滑窗构造正负样本 414 | for i in range(1, len(pos_list)): 415 | hist = pos_list[:i] 416 | 417 | if i != len(pos_list) - 1: 418 | train_set.append((reviewerID, hist[::-1], pos_list[i], 1, 419 | len(hist[::-1]))) # 正样本 [user_id, his_item, pos_item, label, len(his_item)] 420 | for negi in range(negsample): 421 | train_set.append((reviewerID, hist[::-1], neg_list[i * negsample + negi], 0, 422 | len(hist[::-1]))) # 负样本 [user_id, his_item, neg_item, label, len(his_item)] 423 | else: 424 | # 将最长的那一个序列长度作为测试数据 425 | test_set.append((reviewerID, hist[::-1], pos_list[i], 1, len(hist[::-1]))) 426 | 427 | random.shuffle(train_set) 428 | random.shuffle(test_set) 429 | 430 | return train_set, test_set 431 | 432 | 433 | # 将输入的数据进行padding,使得序列特征的长度都一致 434 | def gen_model_input(train_set, user_profile, seq_max_len): 435 | train_uid = np.array([line[0] for line in train_set]) 436 | train_seq = [line[1] for line in train_set] 437 | train_iid = np.array([line[2] for line in train_set]) 438 | train_label = np.array([line[3] for line in train_set]) 439 | train_hist_len = np.array([line[4] for line in train_set]) 440 | 441 | train_seq_pad = pad_sequences(train_seq, maxlen=seq_max_len, padding='post', truncating='post', value=0) 442 | train_model_input = {"user_id": train_uid, "click_article_id": train_iid, "hist_article_id": train_seq_pad, 443 | "hist_len": train_hist_len} 444 | 445 | return train_model_input, train_label 446 | 447 | 448 | def youtubednn_u2i_dict(data, topk=20): 449 | sparse_features = ["click_article_id", "user_id"] 450 | SEQ_LEN = 30 # 用户点击序列的长度,短的填充,长的截断 451 | 452 | user_profile_ = data[["user_id"]].drop_duplicates('user_id') 453 | item_profile_ = data[["click_article_id"]].drop_duplicates('click_article_id') 454 | 455 | # 类别编码 456 | features = ["click_article_id", "user_id"] 457 | feature_max_idx = {} 458 | 459 | for feature in features: 460 | lbe = LabelEncoder() 461 | data[feature] = lbe.fit_transform(data[feature]) 462 | feature_max_idx[feature] = data[feature].max() + 1 463 | 464 | # 提取user和item的画像,这里具体选择哪些特征还需要进一步的分析和考虑 465 | user_profile = data[["user_id"]].drop_duplicates('user_id') 466 | item_profile = data[["click_article_id"]].drop_duplicates('click_article_id') 467 | 468 | user_index_2_rawid = dict(zip(user_profile['user_id'], user_profile_['user_id'])) 469 | item_index_2_rawid = dict(zip(item_profile['click_article_id'], item_profile_['click_article_id'])) 470 | 471 | # 划分训练和测试集 472 | # 由于深度学习需要的数据量通常都是非常大的,所以为了保证召回的效果,往往会通过滑窗的形式扩充训练样本 473 | train_set, test_set = gen_data_set(data, 0) 474 | # 整理输入数据,具体的操作可以看上面的函数 475 | train_model_input, train_label = gen_model_input(train_set, user_profile, SEQ_LEN) 476 | test_model_input, test_label = gen_model_input(test_set, user_profile, SEQ_LEN) 477 | 478 | # 确定Embedding的维度 479 | embedding_dim = 16 480 | 481 | # 将数据整理成模型可以直接输入的形式 482 | user_feature_columns = [SparseFeat('user_id', feature_max_idx['user_id'], embedding_dim), 483 | VarLenSparseFeat( 484 | SparseFeat('hist_article_id', feature_max_idx['click_article_id'], embedding_dim, 485 | embedding_name="click_article_id"), SEQ_LEN, 'mean', 'hist_len'), ] 486 | item_feature_columns = [SparseFeat('click_article_id', feature_max_idx['click_article_id'], embedding_dim)] 487 | 488 | # 模型的定义 489 | # num_sampled: 负采样时的样本数量 490 | # model = YoutubeDNN(user_feature_columns, item_feature_columns, num_sampled=5, 491 | # user_dnn_hidden_units=(64, embedding_dim)) 492 | train_counter = Counter(train_model_input['click_article_id']) 493 | item_count = [train_counter.get(i, 0) for i in range(item_feature_columns[0].vocabulary_size)] 494 | sampler_config = NegativeSampler('frequency', num_sampled=5, item_name="click_article_id", item_count=item_count) 495 | 496 | import tensorflow as tf 497 | print(tf.__version__) 498 | if tf.__version__ >= '2.0.0': 499 | tf.compat.v1.disable_eager_execution() 500 | 501 | # Assuming model is already created and defined elsewhere 502 | # Initialize variables (for TensorFlow 1.x) 503 | if tf.__version__ < '2.0.0': 504 | sess = tf.compat.v1.Session() 505 | sess.run(tf.compat.v1.global_variables_initializer()) 506 | 507 | model = YoutubeDNN(user_feature_columns, item_feature_columns, user_dnn_hidden_units=(64, 16, embedding_dim), 508 | sampler_config=sampler_config) 509 | # 模型编译 510 | model.compile(optimizer="adam", loss=sampledsoftmaxloss) 511 | 512 | # 模型训练,这里可以定义验证集的比例,如果设置为0的话就是全量数据直接进行训练 513 | history = model.fit(train_model_input, train_label, batch_size=256, epochs=1, verbose=1, validation_split=0.0) 514 | 515 | # 训练完模型之后,提取训练的Embedding,包括user端和item端 516 | test_user_model_input = test_model_input 517 | all_item_model_input = {"click_article_id": item_profile['click_article_id'].values} 518 | 519 | user_embedding_model = Model(inputs=model.user_input, outputs=model.user_embedding) 520 | item_embedding_model = Model(inputs=model.item_input, outputs=model.item_embedding) 521 | 522 | # 保存当前的item_embedding 和 user_embedding 排序的时候可能能够用到,但是需要注意保存的时候需要和原始的id对应 523 | # user_embs = user_embedding_model.predict(test_user_model_input, batch_size=2 ** 12) 524 | user_embs = model.predict(test_user_model_input, batch_size=2 ** 12) 525 | item_embs = item_embedding_model.predict(all_item_model_input, batch_size=2 ** 12) 526 | 527 | # embedding保存之前归一化一下 528 | user_embs = user_embs / np.linalg.norm(user_embs, axis=1, keepdims=True) 529 | item_embs = item_embs / np.linalg.norm(item_embs, axis=1, keepdims=True) 530 | 531 | # 将Embedding转换成字典的形式方便查询 532 | raw_user_id_emb_dict = {user_index_2_rawid[k]: \ 533 | v for k, v in zip(user_profile['user_id'], user_embs)} 534 | raw_item_id_emb_dict = {item_index_2_rawid[k]: \ 535 | v for k, v in zip(item_profile['click_article_id'], item_embs)} 536 | # 将Embedding保存到本地 537 | pickle.dump(raw_user_id_emb_dict, open(save_path + 'user_youtube_emb.pkl', 'wb')) 538 | pickle.dump(raw_item_id_emb_dict, open(save_path + 'item_youtube_emb.pkl', 'wb')) 539 | 540 | # faiss紧邻搜索,通过user_embedding 搜索与其相似性最高的topk个item 541 | index = faiss.IndexFlatIP(embedding_dim) 542 | 543 | # 示例数据 544 | item_embs = np.random.rand(8787, 16).astype(np.float32) 545 | user_embs = np.random.rand(70274, 16).astype(np.float32) # 确保维度一致 546 | 547 | # 创建 FAISS 索引 548 | num_features = item_embs.shape[1] 549 | index = faiss.IndexFlatIP(num_features) 550 | 551 | # 执行搜索 552 | topk = 20 553 | sim, idx = index.search(user_embs, topk) 554 | 555 | # 打印结果 556 | print("Similarities:", sim) 557 | print("Indices:", idx) 558 | 559 | # 上面已经进行了归一化,这里可以不进行归一化了 560 | # faiss.normalize_L2(user_embs) 561 | # faiss.normalize_L2(item_embs) 562 | index.add(item_embs) # 将item向量构建索引 563 | sim, idx = index.search(np.ascontiguousarray(user_embs), topk) # 通过user去查询最相似的topk个item 564 | print("Keys in item_index_2_rawid:", list(item_index_2_rawid.keys())) 565 | print("Sample idx:", idx[:5]) # 打印前几个索引 566 | 567 | user_recall_items_dict = collections.defaultdict(dict) 568 | for target_idx, sim_value_list, rele_idx_list in tqdm(zip(test_user_model_input['user_id'], sim, idx)): 569 | target_raw_id = user_index_2_rawid[target_idx] 570 | # 从1开始是为了去掉商品本身, 所以最终获得的相似商品只有topk-1 571 | for rele_idx, sim_value in zip(rele_idx_list[1:], sim_value_list[1:]): 572 | if rele_idx in item_index_2_rawid: 573 | rele_raw_id = item_index_2_rawid[rele_idx] 574 | user_recall_items_dict[target_raw_id][rele_raw_id] = user_recall_items_dict.get(target_raw_id, {}) \ 575 | .get(rele_raw_id, 0) + sim_value 576 | else: 577 | print(f"Warning: {rele_idx} not found in item_index_2_rawid") 578 | 579 | user_recall_items_dict = {k: sorted(v.items(), key=lambda x: x[1], reverse=True) for k, v in 580 | user_recall_items_dict.items()} 581 | # 将召回的结果进行排序 582 | 583 | # 保存召回的结果 584 | # 这里是直接通过向量的方式得到了召回结果,相比于上面的召回方法,上面的只是得到了i2i及u2u的相似性矩阵,还需要进行协同过滤召回才能得到召回结果 585 | # 可以直接对这个召回结果进行评估,为了方便可以统一写一个评估函数对所有的召回结果进行评估 586 | pickle.dump(user_recall_items_dict, open(save_path + 'youtube_u2i_dict.pkl', 'wb')) 587 | return user_recall_items_dict 588 | 589 | 590 | # 由于这里需要做召回评估,所以讲训练集中的最后一次点击都提取了出来 591 | if not metric_recall: 592 | user_multi_recall_dict['youtubednn_recall'] = youtubednn_u2i_dict(all_click_df, topk=20) 593 | else: 594 | trn_hist_click_df, trn_last_click_df = get_hist_and_last_click(all_click_df) 595 | user_multi_recall_dict['youtubednn_recall'] = youtubednn_u2i_dict(trn_hist_click_df, topk=20) 596 | # 召回效果评估 597 | metrics_recall(user_multi_recall_dict['youtubednn_recall'], trn_last_click_df, topk=20) 598 | 599 | 600 | # 基于商品的召回i2i 601 | def item_based_recommend(user_id, user_item_time_dict, i2i_sim, sim_item_topk, recall_item_num, item_topk_click, 602 | item_created_time_dict, emb_i2i_sim): 603 | """ 604 | 基于文章协同过滤的召回 605 | :param user_id: 用户id 606 | :param user_item_time_dict: 字典, 根据点击时间获取用户的点击文章序列 {user1: [(item1, time1), (item2, time2)..]...} 607 | :param i2i_sim: 字典,文章相似性矩阵 608 | :param sim_item_topk: 整数, 选择与当前文章最相似的前k篇文章 609 | :param recall_item_num: 整数, 最后的召回文章数量 610 | :param item_topk_click: 列表,点击次数最多的文章列表,用户召回补全 611 | :param emb_i2i_sim: 字典基于内容embedding算的文章相似矩阵 612 | 613 | return: 召回的文章列表 [(item1, score1), (item2, score2)...] 614 | 615 | """ 616 | # 获取用户历史交互的文章 617 | user_hist_items = user_item_time_dict[user_id] 618 | user_hist_items_ = {user_id for user_id, _ in user_hist_items} 619 | 620 | if 3941 not in i2i_sim: 621 | print("Key 3941 not found in i2i_sim") 622 | 623 | item_rank = {} 624 | for loc, (i, click_time) in enumerate(user_hist_items): 625 | try: 626 | for j, wij in sorted(i2i_sim[i].items(), key=lambda x: x[1], reverse=True)[:sim_item_topk]: 627 | if j in user_hist_items_: 628 | continue 629 | 630 | # 文章创建时间差权重 631 | created_time_weight = np.exp(0.8 ** np.abs(item_created_time_dict[i] - item_created_time_dict[j])) 632 | # 相似文章和历史点击文章序列中历史文章所在的位置权重 633 | loc_weight = (0.9 ** (len(user_hist_items) - loc)) 634 | 635 | content_weight = 1.0 636 | if emb_i2i_sim.get(i, {}).get(j, None) is not None: 637 | content_weight += emb_i2i_sim[i][j] 638 | if emb_i2i_sim.get(j, {}).get(i, None) is not None: 639 | content_weight += emb_i2i_sim[j][i] 640 | 641 | item_rank.setdefault(j, 0) 642 | item_rank[j] += created_time_weight * loc_weight * content_weight * wij 643 | except KeyError as e: 644 | print() 645 | # 不足10个,用热门商品补全 646 | if len(item_rank) < recall_item_num: 647 | for i, item in enumerate(item_topk_click): 648 | if item in item_rank.items(): # 填充的item应该不在原来的列表中 649 | continue 650 | item_rank[item] = - i - 100 # 随便给个负数就行 651 | if len(item_rank) == recall_item_num: 652 | break 653 | 654 | item_rank = sorted(item_rank.items(), key=lambda x: x[1], reverse=True)[:recall_item_num] 655 | 656 | return item_rank 657 | 658 | 659 | # 先进行itemcf召回, 为了召回评估,所以提取最后一次点击 660 | 661 | if metric_recall: 662 | trn_hist_click_df, trn_last_click_df = get_hist_and_last_click(all_click_df) 663 | else: 664 | trn_hist_click_df = all_click_df 665 | 666 | user_recall_items_dict = collections.defaultdict(dict) 667 | user_item_time_dict = get_user_item_time(trn_hist_click_df) 668 | 669 | i2i_sim = pickle.load(open(save_path + 'itemcf_i2i_sim.pkl', 'rb')) 670 | emb_i2i_sim = pickle.load(open(save_path + 'emb_i2i_sim.pkl', 'rb')) 671 | 672 | sim_item_topk = 20 673 | recall_item_num = 10 674 | item_topk_click = get_item_topk_click(trn_hist_click_df, k=50) 675 | 676 | for user in tqdm(trn_hist_click_df['user_id'].unique()): 677 | user_recall_items_dict[user] = item_based_recommend(user, user_item_time_dict, i2i_sim, sim_item_topk, 678 | recall_item_num, item_topk_click, item_created_time_dict, 679 | emb_i2i_sim) 680 | 681 | user_multi_recall_dict['itemcf_sim_itemcf_recall'] = user_recall_items_dict 682 | pickle.dump(user_multi_recall_dict['itemcf_sim_itemcf_recall'], open(save_path + 'itemcf_recall_dict.pkl', 'wb')) 683 | 684 | if metric_recall: 685 | # 召回效果评估 686 | metrics_recall(user_multi_recall_dict['itemcf_sim_itemcf_recall'], trn_last_click_df, topk=recall_item_num) 687 | 688 | # 这里是为了召回评估,所以提取最后一次点击 689 | if metric_recall: 690 | trn_hist_click_df, trn_last_click_df = get_hist_and_last_click(all_click_df) 691 | else: 692 | trn_hist_click_df = all_click_df 693 | 694 | user_recall_items_dict = collections.defaultdict(dict) 695 | user_item_time_dict = get_user_item_time(trn_hist_click_df) 696 | i2i_sim = pickle.load(open(save_path + 'emb_i2i_sim.pkl', 'rb')) 697 | 698 | sim_item_topk = 20 699 | recall_item_num = 10 700 | 701 | item_topk_click = get_item_topk_click(trn_hist_click_df, k=50) 702 | 703 | for user in tqdm(trn_hist_click_df['user_id'].unique()): 704 | user_recall_items_dict[user] = item_based_recommend(user, user_item_time_dict, i2i_sim, sim_item_topk, 705 | recall_item_num, item_topk_click, item_created_time_dict, 706 | emb_i2i_sim) 707 | 708 | user_multi_recall_dict['embedding_sim_item_recall'] = user_recall_items_dict 709 | pickle.dump(user_multi_recall_dict['embedding_sim_item_recall'], 710 | open(save_path + 'embedding_sim_item_recall.pkl', 'wb')) 711 | 712 | if metric_recall: 713 | # 召回效果评估 714 | metrics_recall(user_multi_recall_dict['embedding_sim_item_recall'], trn_last_click_df, topk=recall_item_num) 715 | 716 | # 这里是为了召回评估,所以提取最后一次点击 717 | if metric_recall: 718 | trn_hist_click_df, trn_last_click_df = get_hist_and_last_click(all_click_df) 719 | else: 720 | trn_hist_click_df = all_click_df 721 | 722 | user_recall_items_dict = collections.defaultdict(dict) 723 | user_item_time_dict = get_user_item_time(trn_hist_click_df) 724 | i2i_sim = pickle.load(open(save_path + 'emb_i2i_sim.pkl', 'rb')) 725 | 726 | sim_item_topk = 20 727 | recall_item_num = 10 728 | 729 | item_topk_click = get_item_topk_click(trn_hist_click_df, k=50) 730 | 731 | for user in tqdm(trn_hist_click_df['user_id'].unique()): 732 | user_recall_items_dict[user] = item_based_recommend(user, user_item_time_dict, i2i_sim, sim_item_topk, 733 | recall_item_num, item_topk_click, item_created_time_dict, 734 | emb_i2i_sim) 735 | 736 | user_multi_recall_dict['embedding_sim_item_recall'] = user_recall_items_dict 737 | pickle.dump(user_multi_recall_dict['embedding_sim_item_recall'], 738 | open(save_path + 'embedding_sim_item_recall.pkl', 'wb')) 739 | 740 | if metric_recall: 741 | # 召回效果评估 742 | metrics_recall(user_multi_recall_dict['embedding_sim_item_recall'], trn_last_click_df, topk=recall_item_num) 743 | 744 | 745 | # 基于用户的召回 u2u2i 746 | def user_based_recommend(user_id, user_item_time_dict, u2u_sim, sim_user_topk, recall_item_num, 747 | item_topk_click, item_created_time_dict, emb_i2i_sim): 748 | """ 749 | 基于文章协同过滤的召回 750 | :param user_id: 用户id 751 | :param user_item_time_dict: 字典, 根据点击时间获取用户的点击文章序列 {user1: [(item1, time1), (item2, time2)..]...} 752 | :param u2u_sim: 字典,文章相似性矩阵 753 | :param sim_user_topk: 整数, 选择与当前用户最相似的前k个用户 754 | :param recall_item_num: 整数, 最后的召回文章数量 755 | :param item_topk_click: 列表,点击次数最多的文章列表,用户召回补全 756 | :param item_created_time_dict: 文章创建时间列表 757 | :param emb_i2i_sim: 字典基于内容embedding算的文章相似矩阵 758 | 759 | return: 召回的文章列表 [(item1, score1), (item2, score2)...] 760 | """ 761 | 762 | # 历史交互 763 | user_item_time_list = user_item_time_dict[user_id] # {item1: time1, item2: time2...} 764 | user_hist_items = set([i for i, t in user_item_time_list]) # 存在一个用户与某篇文章的多次交互, 这里得去重 765 | 766 | items_rank = {} 767 | try: 768 | for sim_u, wuv in sorted(u2u_sim[user_id].items(), key=lambda x: x[1], reverse=True)[:sim_user_topk]: 769 | for i, click_time in user_item_time_dict[sim_u]: 770 | if i in user_hist_items: 771 | continue 772 | items_rank.setdefault(i, 0) 773 | 774 | loc_weight = 1.0 775 | content_weight = 1.0 776 | created_time_weight = 1.0 777 | 778 | # 当前文章与该用户看的历史文章进行一个权重交互 779 | for loc, (j, click_time) in enumerate(user_item_time_list): 780 | # 点击时的相对位置权重 781 | loc_weight += 0.9 ** (len(user_item_time_list) - loc) 782 | # 内容相似性权重 783 | if emb_i2i_sim.get(i, {}).get(j, None) is not None: 784 | content_weight += emb_i2i_sim[i][j] 785 | if emb_i2i_sim.get(j, {}).get(i, None) is not None: 786 | content_weight += emb_i2i_sim[j][i] 787 | 788 | # 创建时间差权重 789 | created_time_weight += np.exp(0.8 * np.abs(item_created_time_dict[i] - item_created_time_dict[j])) 790 | 791 | items_rank[i] += loc_weight * content_weight * created_time_weight * wuv 792 | except KeyError as e: 793 | print() 794 | 795 | # 热度补全 796 | if len(items_rank) < recall_item_num: 797 | for i, item in enumerate(item_topk_click): 798 | if item in items_rank.items(): # 填充的item应该不在原来的列表中 799 | continue 800 | items_rank[item] = - i - 100 # 随便给个复数就行 801 | if len(items_rank) == recall_item_num: 802 | break 803 | 804 | items_rank = sorted(items_rank.items(), key=lambda x: x[1], reverse=True)[:recall_item_num] 805 | 806 | return items_rank 807 | 808 | 809 | # 这里是为了召回评估,所以提取最后一次点击 810 | # 由于usercf中计算user之间的相似度的过程太费内存了,全量数据这里就没有跑,跑了一个采样之后的数据 811 | if metric_recall: 812 | trn_hist_click_df, trn_last_click_df = get_hist_and_last_click(all_click_df) 813 | else: 814 | trn_hist_click_df = all_click_df 815 | 816 | user_recall_items_dict = collections.defaultdict(dict) 817 | user_item_time_dict = get_user_item_time(trn_hist_click_df) 818 | 819 | u2u_sim = pickle.load(open(save_path + 'usercf_u2u_sim.pkl', 'rb')) 820 | 821 | sim_user_topk = 20 822 | recall_item_num = 10 823 | item_topk_click = get_item_topk_click(trn_hist_click_df, k=50) 824 | 825 | for user in tqdm(trn_hist_click_df['user_id'].unique()): 826 | user_recall_items_dict[user] = user_based_recommend(user, user_item_time_dict, u2u_sim, sim_user_topk, \ 827 | recall_item_num, item_topk_click, item_created_time_dict, 828 | emb_i2i_sim) 829 | 830 | pickle.dump(user_recall_items_dict, open(save_path + 'usercf_u2u2i_recall.pkl', 'wb')) 831 | 832 | if metric_recall: 833 | # 召回效果评估 834 | metrics_recall(user_recall_items_dict, trn_last_click_df, topk=recall_item_num) 835 | 836 | 837 | # 使用Embedding的方式获取u2u的相似性矩阵 838 | # topk指的是每个user, faiss搜索后返回最相似的topk个user 839 | def u2u_embdding_sim(click_df, user_emb_dict, save_path, topk): 840 | user_list = [] 841 | user_emb_list = [] 842 | for user_id, user_emb in user_emb_dict.items(): 843 | user_list.append(user_id) 844 | user_emb_list.append(user_emb) 845 | 846 | user_index_2_rawid_dict = {k: v for k, v in zip(range(len(user_list)), user_list)} 847 | 848 | user_emb_np = np.array(user_emb_list, dtype=np.float32) 849 | 850 | # 建立faiss索引 851 | user_index = faiss.IndexFlatIP(user_emb_np.shape[1]) 852 | user_index.add(user_emb_np) 853 | # 相似度查询,给每个索引位置上的向量返回topk个item以及相似度 854 | sim, idx = user_index.search(user_emb_np, topk) # 返回的是列表 855 | 856 | # 将向量检索的结果保存成原始id的对应关系 857 | user_sim_dict = collections.defaultdict(dict) 858 | for target_idx, sim_value_list, rele_idx_list in tqdm(zip(range(len(user_emb_np)), sim, idx)): 859 | target_raw_id = user_index_2_rawid_dict[target_idx] 860 | # 从1开始是为了去掉商品本身, 所以最终获得的相似商品只有topk-1 861 | try: 862 | for rele_idx, sim_value in zip(rele_idx_list[1:], sim_value_list[1:]): 863 | rele_raw_id = user_index_2_rawid_dict[rele_idx] 864 | user_sim_dict[target_raw_id][rele_raw_id] = user_sim_dict.get(target_raw_id, {}).get(rele_raw_id, 865 | 0) + sim_value 866 | 867 | except KeyError as e: 868 | print() 869 | 870 | # 保存i2i相似度矩阵 871 | pickle.dump(user_sim_dict, open(save_path + 'youtube_u2u_sim.pkl', 'wb')) 872 | return user_sim_dict 873 | 874 | 875 | # 读取YoutubeDNN过程中产生的user embedding, 然后使用faiss计算用户之间的相似度 876 | # 这里需要注意,这里得到的user embedding其实并不是很好,因为YoutubeDNN中使用的是用户点击序列来训练的user embedding, 877 | # 如果序列普遍都比较短的话,其实效果并不是很好 878 | user_emb_dict = pickle.load(open(save_path + 'user_youtube_emb.pkl', 'rb')) 879 | u2u_sim = u2u_embdding_sim(all_click_df, user_emb_dict, save_path, topk=10) 880 | 881 | # 使用召回评估函数验证当前召回方式的效果 882 | if metric_recall: 883 | trn_hist_click_df, trn_last_click_df = get_hist_and_last_click(all_click_df) 884 | else: 885 | trn_hist_click_df = all_click_df 886 | 887 | user_recall_items_dict = collections.defaultdict(dict) 888 | user_item_time_dict = get_user_item_time(trn_hist_click_df) 889 | u2u_sim = pickle.load(open(save_path + 'youtube_u2u_sim.pkl', 'rb')) 890 | 891 | sim_user_topk = 20 892 | recall_item_num = 10 893 | 894 | item_topk_click = get_item_topk_click(trn_hist_click_df, k=50) 895 | for user in tqdm(trn_hist_click_df['user_id'].unique()): 896 | user_recall_items_dict[user] = user_based_recommend(user, user_item_time_dict, u2u_sim, sim_user_topk, \ 897 | recall_item_num, item_topk_click, item_created_time_dict, 898 | emb_i2i_sim) 899 | 900 | user_multi_recall_dict['youtubednn_usercf_recall'] = user_recall_items_dict 901 | pickle.dump(user_multi_recall_dict['youtubednn_usercf_recall'], open(save_path + 'youtubednn_usercf_recall.pkl', 'wb')) 902 | 903 | if metric_recall: 904 | # 召回效果评估 905 | metrics_recall(user_multi_recall_dict['youtubednn_usercf_recall'], trn_last_click_df, topk=recall_item_num) 906 | 907 | # 先进行itemcf召回,这里不需要做召回评估,这里只是一种策略 908 | trn_hist_click_df = all_click_df 909 | 910 | user_recall_items_dict = collections.defaultdict(dict) 911 | user_item_time_dict = get_user_item_time(trn_hist_click_df) 912 | i2i_sim = pickle.load(open(save_path + 'emb_i2i_sim.pkl', 'rb')) 913 | 914 | sim_item_topk = 150 915 | recall_item_num = 100 # 稍微召回多一点文章,便于后续的规则筛选 916 | 917 | item_topk_click = get_item_topk_click(trn_hist_click_df, k=50) 918 | for user in tqdm(trn_hist_click_df['user_id'].unique()): 919 | user_recall_items_dict[user] = item_based_recommend(user, user_item_time_dict, i2i_sim, sim_item_topk, 920 | recall_item_num, item_topk_click, item_created_time_dict, 921 | emb_i2i_sim) 922 | pickle.dump(user_recall_items_dict, open(save_path + 'cold_start_items_raw_dict.pkl', 'wb')) 923 | 924 | 925 | # 基于规则进行文章过滤 926 | # 保留文章主题与用户历史浏览主题相似的文章 927 | # 保留文章字数与用户历史浏览文章字数相差不大的文章 928 | # 保留最后一次点击当天的文章 929 | # 按照相似度返回最终的结果 930 | 931 | def get_click_article_ids_set(all_click_df): 932 | return set(all_click_df.click_article_id.values) 933 | 934 | 935 | def cold_start_items(user_recall_items_dict, user_hist_item_typs_dict, user_hist_item_words_dict, \ 936 | user_last_item_created_time_dict, item_type_dict, item_words_dict, 937 | item_created_time_dict, click_article_ids_set, recall_item_num): 938 | """ 939 | 冷启动的情况下召回一些文章 940 | :param user_recall_items_dict: 基于内容embedding相似性召回来的很多文章, 字典, {user1: [item1, item2, ..], } 941 | :param user_hist_item_typs_dict: 字典, 用户点击的文章的主题映射 942 | :param user_hist_item_words_dict: 字典, 用户点击的历史文章的字数映射 943 | :param user_last_item_created_time_idct: 字典,用户点击的历史文章创建时间映射 944 | :param item_tpye_idct: 字典,文章主题映射 945 | :param item_words_dict: 字典,文章字数映射 946 | :param item_created_time_dict: 字典, 文章创建时间映射 947 | :param click_article_ids_set: 集合,用户点击过得文章, 也就是日志里面出现过的文章 948 | :param recall_item_num: 召回文章的数量, 这个指的是没有出现在日志里面的文章数量 949 | """ 950 | 951 | cold_start_user_items_dict = {} 952 | for user, item_list in tqdm(user_recall_items_dict.items()): 953 | cold_start_user_items_dict.setdefault(user, []) 954 | for item, score in item_list: 955 | # 获取历史文章信息 956 | hist_item_type_set = user_hist_item_typs_dict[user] 957 | hist_mean_words = user_hist_item_words_dict[user] 958 | hist_last_item_created_time = user_last_item_created_time_dict[user] 959 | hist_last_item_created_time = datetime.fromtimestamp(hist_last_item_created_time) 960 | 961 | # 获取当前召回文章的信息 962 | curr_item_type = item_type_dict[item] 963 | curr_item_words = item_words_dict[item] 964 | curr_item_created_time = item_created_time_dict[item] 965 | curr_item_created_time = datetime.fromtimestamp(curr_item_created_time) 966 | 967 | # 首先,文章不能出现在用户的历史点击中, 然后根据文章主题,文章单词数,文章创建时间进行筛选 968 | if curr_item_type not in hist_item_type_set or \ 969 | item in click_article_ids_set or \ 970 | abs(curr_item_words - hist_mean_words) > 200 or \ 971 | abs((curr_item_created_time - hist_last_item_created_time).days) > 90: 972 | continue 973 | 974 | cold_start_user_items_dict[user].append((item, score)) # {user1: [(item1, score1), (item2, score2)..]...} 975 | 976 | # 需要控制一下冷启动召回的数量 977 | cold_start_user_items_dict = {k: sorted(v, key=lambda x: x[1], reverse=True)[:recall_item_num] \ 978 | for k, v in cold_start_user_items_dict.items()} 979 | 980 | pickle.dump(cold_start_user_items_dict, open(save_path + 'cold_start_user_items_dict.pkl', 'wb')) 981 | 982 | return cold_start_user_items_dict 983 | 984 | 985 | all_click_df_ = all_click_df.copy() 986 | all_click_df_ = all_click_df_.merge(item_info_df, how='left', on='click_article_id') 987 | user_hist_item_typs_dict, user_hist_item_ids_dict, user_hist_item_words_dict, user_last_item_created_time_dict = get_user_hist_item_info_dict( 988 | all_click_df_) 989 | click_article_ids_set = get_click_article_ids_set(all_click_df) 990 | # 需要注意的是 991 | # 这里使用了很多规则来筛选冷启动的文章,所以前面再召回的阶段就应该尽可能的多召回一些文章,否则很容易被删掉 992 | cold_start_user_items_dict = cold_start_items(user_recall_items_dict, user_hist_item_typs_dict, 993 | user_hist_item_words_dict, \ 994 | user_last_item_created_time_dict, item_type_dict, item_words_dict, \ 995 | item_created_time_dict, click_article_ids_set, recall_item_num) 996 | 997 | user_multi_recall_dict['cold_start_recall'] = cold_start_user_items_dict 998 | 999 | 1000 | def combine_recall_results(user_multi_recall_dict, weight_dict=None, topk=25): 1001 | final_recall_items_dict = {} 1002 | 1003 | # 对每一种召回结果按照用户进行归一化,方便后面多种召回结果,相同用户的物品之间权重相加 1004 | def norm_user_recall_items_sim(sorted_item_list): 1005 | # 如果冷启动中没有文章或者只有一篇文章,直接返回,出现这种情况的原因可能是冷启动召回的文章数量太少了, 1006 | # 基于规则筛选之后就没有文章了, 这里还可以做一些其他的策略性的筛选 1007 | if len(sorted_item_list) < 2: 1008 | return sorted_item_list 1009 | 1010 | min_sim = sorted_item_list[-1][1] 1011 | max_sim = sorted_item_list[0][1] 1012 | 1013 | norm_sorted_item_list = [] 1014 | for item, score in sorted_item_list: 1015 | if max_sim > 0: 1016 | norm_score = 1.0 * (score - min_sim) / (max_sim - min_sim) if max_sim > min_sim else 1.0 1017 | else: 1018 | norm_score = 0.0 1019 | norm_sorted_item_list.append((item, norm_score)) 1020 | 1021 | return norm_sorted_item_list 1022 | 1023 | print('多路召回合并...') 1024 | for method, user_recall_items in tqdm(user_multi_recall_dict.items()): 1025 | print(method + '...') 1026 | # 在计算最终召回结果的时候,也可以为每一种召回结果设置一个权重 1027 | if weight_dict == None: 1028 | recall_method_weight = 1 1029 | else: 1030 | recall_method_weight = weight_dict[method] 1031 | 1032 | for user_id, sorted_item_list in user_recall_items.items(): # 进行归一化 1033 | user_recall_items[user_id] = norm_user_recall_items_sim(sorted_item_list) 1034 | 1035 | for user_id, sorted_item_list in user_recall_items.items(): 1036 | # print('user_id') 1037 | final_recall_items_dict.setdefault(user_id, {}) 1038 | for item, score in sorted_item_list: 1039 | final_recall_items_dict[user_id].setdefault(item, 0) 1040 | final_recall_items_dict[user_id][item] += recall_method_weight * score 1041 | 1042 | final_recall_items_dict_rank = {} 1043 | # 多路召回时也可以控制最终的召回数量 1044 | for user, recall_item_dict in final_recall_items_dict.items(): 1045 | final_recall_items_dict_rank[user] = sorted(recall_item_dict.items(), key=lambda x: x[1], reverse=True)[:topk] 1046 | 1047 | # 将多路召回后的最终结果字典保存到本地 1048 | pickle.dump(final_recall_items_dict, open(os.path.join(save_path, 'final_recall_items_dict.pkl'), 'wb')) 1049 | 1050 | return final_recall_items_dict_rank 1051 | 1052 | 1053 | # 这里直接对多路召回的权重给了一个相同的值,其实可以根据前面召回的情况来调整参数的值 1054 | weight_dict = {'itemcf_sim_itemcf_recall': 1.0, 1055 | 'embedding_sim_item_recall': 1.0, 1056 | 'youtubednn_recall': 1.0, 1057 | 'youtubednn_usercf_recall': 1.0, 1058 | 'cold_start_recall': 1.0} 1059 | 1060 | # 最终合并之后每个用户召回150个商品进行排序 1061 | final_recall_items_dict_rank = combine_recall_results(user_multi_recall_dict, weight_dict, topk=150) 1062 | --------------------------------------------------------------------------------