├── .gitignore ├── LICENSE ├── README.md ├── example.py ├── setup.py ├── source_data ├── newsid_content.csv └── userid_newsid.csv └── ucas_dm ├── __init__.py ├── prediction_algorithms ├── __init__.py ├── base_algo.py ├── baseline.py ├── collaborate_based_algo.py ├── content_based_algo.py ├── nmf.py ├── surprise_base_algo.py └── topic_based_algo.py ├── preprocess ├── __init__.py ├── preprocess.py └── stop_words │ └── stop.txt └── utils.py /.gitignore: -------------------------------------------------------------------------------- 1 | auto_encoder.py 2 | backup.py 3 | knn.py 4 | .* 5 | *.log 6 | !.gitignore 7 | **/__pycache__/** 8 | data.txt 9 | .idea 10 | models/ -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2018 Unknow 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # UCAS-DM 2 | ![](https://img.shields.io/badge/version-1.0.0-green.svg) 3 | ![](https://img.shields.io/badge/docs-passing-green.svg) 4 | ![](https://img.shields.io/badge/python-3.x-blue.svg) 5 | ![](https://img.shields.io/badge/License-MIT-blue.svg) 6 | 7 | 介绍 8 | -------- 9 | UCAS-DM(UCAS-DataMining)是一个较为简易的推荐算法库,专门为国科大-网络数据挖掘的**新闻推荐**课程大作业所设计,其主要包含了数据预处理,推荐算法以及算法性能评测三个部分,该算法库旨在为使用者提供更加方便的数据处理与分析的接口,让使用者能够将精力更加专注于数据的分析以及算法的调整上。 10 | 11 | 完整的API文档请点这里[Docs](http://YLonely.github.io/web-data-mining) 12 | 13 | 环境依赖 14 | -------- 15 | ### 系统环境 16 | 由于本算法库使用到了[Faiss](https://github.com/facebookresearch/faiss)库用于推荐算法的加速,而Faiss目前仅支持Linux与OSX,因此请在*nix的环境下使用本算法库。 17 | ### Python版本 18 | 本Python包在Python3.6.5的环境下运行通过,因此不支持Python2.x的Python版本,推荐在Python3.5~3.6的环境下使用本算法库。 19 | 20 | ### 推荐的环境搭建方式:white_check_mark: 21 | 1. 安装相应版本的Anaconda以获得Python环境以及所需的包(例如numpy,pandas等)。点击[这里](https://repo.continuum.io/archive/)获得历史版本的安装包 22 | 2. 根据[指导](https://github.com/facebookresearch/faiss/blob/master/INSTALL.md)通过conda安装Faiss包(CPU版本即可)。 23 | 24 | 安装 25 | ------ 26 | 直接通过pip进行安装即可 27 | ``` 28 | pip install ucas-dm 29 | ``` 30 | 安装本库时会自动检查并安装(若缺少)`numpy`,`pandas`,`gensim`,`jieba`,`scikit-surprise`等库,但并不会自动检查安装`Faiss`。 31 | 32 | 简单教程 33 | ---- 34 | ### 数据预处理 35 | ```python 36 | import pandas as pd 37 | from ucas_dm.preprocess import PreProcessor 38 | 39 | path_to_source_data = ".." 40 | pp = PreProcessor(path_to_source_data) 41 | news_id_and_its_content = pp.extract_news() 42 | news_id_and_its_content.to_csv(path_or_buf = "./news_id_and_its_content.csv", index = False) 43 | 44 | user_logs = pp.extract_logs() 45 | user_logs.to_csv(path_or_buf = "./user_logs.csv", index = False) 46 | ``` 47 | `Preprocessor`类对外提供了`extract_news`,`generate_tokens`,`build_tf_idf`,以及`extract_logs`方法,并不推荐直接使用`generate_tokens`以及`build_tf_idf`方法,这两个方法会被推荐算法库中的算法调用。`extract_news`以及`extract_logs`方法分别从原始数据中抽取出新闻内容以及用户的浏览历史,返回的均是`pandas.DataFrame`类型的数据,以上的两个方法均默认原始数据具有以下的形式: 48 | 49 | | 用户id | 新闻id | 浏览时间 | 标题 | 新闻内容 | 发布时间 | 50 | | :----: | :----: | :------: | :---: | :------: | :------: | 51 | | …… | …… | …… | …… | …… | …… | 52 | 53 | 数据预处理的部分并不具有很好的可重用性,因此如果原始数据有更改,则需要修改数据或者预处理的代码。本项目的*source_data*文件夹中已包含了从原始数据中抽取出的新闻内容与用户浏览记录,可以直接使用。 54 | 55 | ### 使用推荐算法 56 | ```python 57 | import pandas as pd 58 | from ucas_dm.prediction_algorithms import CollaborateBasedAlgo, TopicBasedAlgo 59 | 60 | k_items = 35 61 | user_logs = pd.read_csv("./user_logs.csv") 62 | k_neighbors = 10 63 | cb = CollaborateBasedAlgo(user_based = True, k = k_neighbors) 64 | cb.train(user_logs) 65 | recommend1 = cb.top_k_recommend(u_id = 12345, k = k_items)[0] 66 | cb.save("./cb.model") 67 | 68 | news_id_and_its_content = pd.read_csv("./news_id_and_its_content.csv") 69 | initial_params = TopicBasedAlgo.preprocess(news_id_and_its_content) 70 | lsi = TopicBasedAlgo(initial_params = initial_params, topic_n = 100, topic_type = 'lsi', chunksize = 100) 71 | lsi.train(user_logs) 72 | recommend2 = lsi.top_k_recommend(u_id = 12345, k = k_items)[0] 73 | ``` 74 | 所有的推荐算法都被集中到了`prediction_algorithms`包中,目前可以直接使用的推荐算法有BaseLineAlgo(随机推荐算法,可以作为基准),CollaborateBasedAlgo(协同过滤推荐算法),NMF(基于非负矩阵分解的协同过滤),TopicBasedAlgo(使用了话题模型的基于内容推荐算法),这些算法均直接或间接地实现了BaseAlgo接口,这些算法在初始化之后,使用之前均需要调用`train`方法并传入用户浏览历史进行训练,之后调用`top_k_recommend`方法获得针对某用户的前`k`个推荐的物品(`top_k_recommend`返回的是推荐的前`k`个物品的id列表以及这`k`个物品推荐度的列表,推荐度的解释根据不同的推荐算法而不同,例如在协同过滤推荐算法中,推荐度是相似度与相似用户或物品评分的积,在基于内容推荐算法中,推荐度是被推荐物品与用户兴趣模型的相似度),在使用上TopicBasedAlgo稍有不同,它需要调用`preprocess`方法对数据进行进一步处理,得到初始化时所必须的参数`initial_params`,该参数可以保存,也可以从文件中读取,实际类型来源于`prediction_algorithms.InitialParams`。所有的算法模型都可以使用`save`和`load`进行存取。 75 | 76 | ### 算法评价 77 | ```python 78 | from ucas_dm.utils import Evaluator 79 | from ucas_dm.prediction_algorithms import CollaborateBasedAlgo 80 | import pandas as pd 81 | 82 | k_list = [5, 10, 15] 83 | k_neighbors = 30 84 | user_logs = pd.read_csv("./user_logs.csv") 85 | eva = Evaluator(user_logs) 86 | cb = CollaborateBasedAlgo(user_based = True, k = k_neighbors) 87 | eva.evaluate(algo = cb, k = k_list, n_jobs = 2, split_date = '2014-3-21', auto_log = True) 88 | ``` 89 | 使用`utils`包中的`Evaluator`可以方便地对不同的推荐模型在不同的数据集上的推荐性能进行评测,`Evaluator`使用`evaluate`方法进行评测,该方法的`k`参数表示在推荐时为每位用户推荐多少物品,该参数允许传入一个列表,这样可以方便地对多个参数进行评测。目前`Evaluator`仅允许按照浏览时间对数据集进行划分,即划分为训练集与测试集。若自动记录`auto_log`打开,`evaluate`方法会自动记录推荐算法的性能参数,以json的格式存储到`./performance.log`文件中。 90 | 91 | 反馈 92 | ----- 93 | 欢迎反馈,如果有bug什么的我尽量修改:blush: -------------------------------------------------------------------------------- /example.py: -------------------------------------------------------------------------------- 1 | from ucas_dm.prediction_algorithms import BaseLineAlgo 2 | from ucas_dm.prediction_algorithms import CollaborateBasedAlgo 3 | from ucas_dm.prediction_algorithms import TopicBasedAlgo 4 | from ucas_dm.prediction_algorithms import NMF 5 | from ucas_dm.prediction_algorithms import InitialParams 6 | from ucas_dm.utils import Evaluator 7 | import pandas as pd 8 | 9 | data_path = "./source_data/" 10 | id_content = pd.read_csv(data_path + "newsid_content.csv") 11 | user_log = pd.read_csv(data_path + 'userid_newsid.csv') 12 | eva = Evaluator(user_log) 13 | initial_params = TopicBasedAlgo.preprocess(id_content) 14 | initial_params.save('./tmp/initial_p') 15 | # initial_params = InitialParams.load('/tmp/initial_p') 16 | lsi = TopicBasedAlgo(initial_params=initial_params, topic_n=100, topic_type='lsi', chunksize=1000) 17 | lsi.save('./tmp/lsi_model') 18 | # lsi = TopicBasedAlgo.load('./tmp/lsi_model') 19 | k_list = [5, 10, 15] 20 | n_jobs = 6 21 | eva.evaluate(algo=lsi, k=k_list, n_jobs=n_jobs, split_date='2014-3-21', auto_log=True, debug=True) 22 | 23 | for factor in [10, 15, 20]: 24 | nmf = NMF(n_factors=factor, random_state=112) 25 | eva.evaluate(algo=nmf, k=k_list, n_jobs=n_jobs, split_date='2014-3-21', auto_log=True) 26 | 27 | for k_neighbor in [5, 10, 15, 20]: 28 | cb = CollaborateBasedAlgo(user_based=True, k=k_neighbor) 29 | eva.evaluate(algo=cb, k=k_list, n_jobs=n_jobs, split_date='2014-3-21', auto_log=True) 30 | 31 | for k_neighbor in [5, 10, 15, 20]: 32 | cb = CollaborateBasedAlgo(user_based=False, k=k_neighbor, sim_func='pearson') 33 | eva.evaluate(algo=cb, k=k_list, n_jobs=6, split_date='2014-3-21', auto_log=True, debug=False) 34 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | from setuptools import setup, find_packages 2 | 3 | setup( 4 | name='ucas_dm', 5 | version='1.0.0', 6 | description="ucas_dm is a simple library provides preprocess and some recommend algorithms for UCAS's web data " 7 | "mining homework project", 8 | author='YLonely', 9 | license='MIT', 10 | author_email='loneybw@gmail.com', 11 | url='https://github.com/YLonely/web-data-mining', 12 | packages=find_packages(), 13 | data_files=[('ucas_dm/preprocess/stop_words', 14 | ['ucas_dm/preprocess/stop_words/stop.txt'])], 15 | platforms=['MacOS', 'Linux'], 16 | classifiers=[ 17 | 'Development Status :: 4 - Beta', 18 | 'Intended Audience :: Developers', 19 | 'License :: OSI Approved :: MIT License', 20 | 'Programming Language :: Python :: 3 :: Only', 21 | 'Topic :: Scientific/Engineering :: Information Analysis' 22 | ], 23 | install_requires=['pandas', 24 | 'numpy', 25 | 'gensim', 26 | 'jieba', 27 | 'scikit-surprise'] 28 | ) 29 | -------------------------------------------------------------------------------- /ucas_dm/__init__.py: -------------------------------------------------------------------------------- 1 | from .prediction_algorithms import BaseAlgo 2 | from .prediction_algorithms import BaseLineAlgo 3 | from .prediction_algorithms import CollaborateBasedAlgo 4 | from .prediction_algorithms import ContentBasedAlgo 5 | from .prediction_algorithms import NMF 6 | from .prediction_algorithms import SurpriseBaseAlgo 7 | from .prediction_algorithms import TopicBasedAlgo 8 | from .prediction_algorithms import InitialParams 9 | 10 | from .preprocess import PreProcessor 11 | from .utils import Evaluator 12 | 13 | __all__ = ['BaseAlgo', 'BaseLineAlgo', 'CollaborateBasedAlgo', 'ContentBasedAlgo', 'NMF', 'SurpriseBaseAlgo', 14 | 'TopicBasedAlgo', 'PreProcessor', 'Evaluator', 'InitialParams'] 15 | __version__ = '1.0.0' 16 | -------------------------------------------------------------------------------- /ucas_dm/prediction_algorithms/__init__.py: -------------------------------------------------------------------------------- 1 | from .base_algo import BaseAlgo 2 | from .baseline import BaseLineAlgo 3 | from .collaborate_based_algo import CollaborateBasedAlgo 4 | from .content_based_algo import ContentBasedAlgo 5 | from .nmf import NMF 6 | from .surprise_base_algo import SurpriseBaseAlgo 7 | from .topic_based_algo import TopicBasedAlgo 8 | from .topic_based_algo import InitialParams 9 | 10 | __all__ = ['BaseLineAlgo', 'BaseAlgo', 'CollaborateBasedAlgo', 'ContentBasedAlgo', 'NMF', 'SurpriseBaseAlgo', 11 | 'TopicBasedAlgo', 'InitialParams'] 12 | -------------------------------------------------------------------------------- /ucas_dm/prediction_algorithms/base_algo.py: -------------------------------------------------------------------------------- 1 | import pickle as pic 2 | 3 | 4 | class BaseAlgo: 5 | """ 6 | Do not use this class directly. 7 | The interface of all recommend algorithms. 8 | """ 9 | 10 | def __init__(self): 11 | pass 12 | 13 | def train(self, train_set): 14 | """ 15 | Do some train-set-dependent work here: for example calculate sims between users or items 16 | 17 | :param train_set: A pandas.DataFrame contains two attributes: user_id and item_id,which \ 18 | represents the user view record during a period of time. 19 | :return: return a model that is ready to give recommend 20 | """ 21 | raise NotImplementedError() 22 | 23 | def top_k_recommend(self, u_id, k): 24 | """ 25 | Calculate the top-K recommend items 26 | 27 | :param u_id: users' identity (user's id) 28 | :param k: the number of the items that the recommender should return 29 | :return: (v,id) v is a list contains predict rate or distance, id is a list contains top-k highest rated or \ 30 | nearest items 31 | """ 32 | raise NotImplementedError() 33 | 34 | def to_dict(self): 35 | """ 36 | Convert algorithm model to a dict which contains algorithm's type and it's main hyper-parameters. 37 | 38 | :return: A dict contains type and hyper-parameters. 39 | """ 40 | raise NotImplementedError() 41 | 42 | @classmethod 43 | def load(cls, fname): 44 | """ 45 | Load an object previously saved from a file 46 | 47 | :param fname: file path 48 | :return: object loaded from file 49 | """ 50 | with open(fname, 'rb') as f: 51 | obj = pic.load(f) 52 | return obj 53 | 54 | def save(self, fname, ignore=None): 55 | """ 56 | Save an object to a file. 57 | 58 | :param fname: file path 59 | :param ignore: a set of attributes that should't be saved by super class, but subclass may have to handle \ 60 | these special attributes. 61 | """ 62 | if ignore is not None: 63 | for attr in ignore: 64 | if hasattr(self, attr): 65 | setattr(self, attr, None) 66 | with open(fname, 'wb') as f: 67 | pic.dump(self, f) 68 | -------------------------------------------------------------------------------- /ucas_dm/prediction_algorithms/baseline.py: -------------------------------------------------------------------------------- 1 | from .base_algo import BaseAlgo 2 | import numpy as np 3 | import pandas as pd 4 | 5 | 6 | class BaseLineAlgo(BaseAlgo): 7 | """ 8 | A simple recommend algorithm that recommend items in random. 9 | Use it as a base-line. 10 | """ 11 | 12 | def __init__(self): 13 | super().__init__() 14 | self._user_log = None 15 | 16 | def train(self, train_set): 17 | self._user_log = pd.DataFrame(train_set) 18 | self._user_log.columns = ['user_id', 'item_id'] 19 | 20 | def top_k_recommend(self, u_id, k): 21 | specific_user_log = self._user_log[self._user_log['user_id'] == u_id] 22 | viewed_num = specific_user_log.shape[0] 23 | assert (viewed_num != 0), "User id doesn't exist" 24 | predict_rate_log = self._user_log.copy() 25 | predict_rate_log = predict_rate_log[['item_id']].drop_duplicates() 26 | predict_rate_log = predict_rate_log[~predict_rate_log['item_id'].isin(specific_user_log['item_id'])] 27 | predict_rate_log['prate'] = np.random.rand(predict_rate_log.shape[0]) 28 | predict_rate_log = predict_rate_log.sort_values(by=['prate'], ascending=False) 29 | predict_rate_log = predict_rate_log[:k] 30 | top_k_rate = predict_rate_log['prate'].values.tolist() 31 | top_k_item = predict_rate_log['item_id'].values.tolist() 32 | return top_k_rate, top_k_item 33 | 34 | def to_dict(self): 35 | """ 36 | See :meth:`BaseAlgo.to_dict ` for more details. 37 | """ 38 | return {'type': 'BaseLineAlgo'} 39 | -------------------------------------------------------------------------------- /ucas_dm/prediction_algorithms/collaborate_based_algo.py: -------------------------------------------------------------------------------- 1 | from .surprise_base_algo import SurpriseBaseAlgo 2 | from surprise.prediction_algorithms import KNNBasic 3 | import pandas as pd 4 | import math 5 | from surprise import Dataset, Reader 6 | 7 | 8 | class CollaborateBasedAlgo(SurpriseBaseAlgo): 9 | """ 10 | Collaborative filtering algorithm. 11 | """ 12 | 13 | def __init__(self, sim_func='cosine', user_based=True, k=1): 14 | """ 15 | :param sim_func: similarity function: 'cosine','msd','pearson','pearson_baseline' 16 | :param user_based: True--> user-user filtering strategy;False--> item-item filtering strategy 17 | :param k: The (max) number of neighbors to take into account for aggregation 18 | """ 19 | super().__init__() 20 | self._user_based = user_based 21 | self._sim_func = sim_func 22 | self._k = k 23 | 24 | def train(self, train_set): 25 | # News recommendation is a typical case that use users' implicit feedback to give recommendations, train set 26 | # only contains binary or unary data (1 for seen, 0 for unseen). According to some papers, normalizing user 27 | # vectors to unit vectors will increase the accuracy of recommending with binary data. 28 | if self._surprise_model is None: 29 | self._surprise_model = self._init_surprise_model() 30 | train_set = pd.DataFrame(train_set) 31 | train_set.columns = ['user_id', 'item_id'] 32 | self._user_log = train_set.copy() 33 | train_set = train_set.drop_duplicates() 34 | groups = train_set.groupby(['user_id']) 35 | id_to_group_size = {} 36 | for user_id, group in groups: 37 | id_to_group_size[user_id] = group.shape[0] 38 | train_set['rate'] = 1 39 | train_set['rate'] = train_set.apply(lambda row: 1 / math.sqrt(id_to_group_size[row['user_id']]), axis=1) 40 | reader = Reader(rating_scale=(0, 1)) 41 | train_s = Dataset.load_from_df(train_set, reader) 42 | ''' train surprise-framework based model ''' 43 | self._surprise_model.fit(train_s.build_full_trainset()) 44 | return self 45 | 46 | def _init_surprise_model(self): 47 | sim_options = {'name': self._sim_func, 'user_based': self._user_based} 48 | return KNNBasic(k=self._k, sim_options=sim_options) 49 | 50 | def to_dict(self): 51 | """ 52 | See :meth:`BaseAlgo.to_dict ` for more details. 53 | """ 54 | return {'type': 'Collaborative filtering', 'user_based': self._user_based, 'sim_fun': self._sim_func, 55 | 'k': self._k} 56 | -------------------------------------------------------------------------------- /ucas_dm/prediction_algorithms/content_based_algo.py: -------------------------------------------------------------------------------- 1 | from .base_algo import BaseAlgo 2 | import numpy as np 3 | import pandas as pd 4 | import faiss 5 | import multiprocessing as mp 6 | from multiprocessing import cpu_count 7 | 8 | 9 | class ContentBasedAlgo(BaseAlgo): 10 | """ 11 | Content-based prediction algorithm 12 | """ 13 | 14 | def __init__(self, item_vector, dimension): 15 | """ 16 | :param item_vector: Should be a pd.DataFrame contains item_id(integer) and its vector([float]) | id | vector | 17 | :param dimension: Vector's dimensions. 18 | """ 19 | super().__init__() 20 | self._dimension = dimension 21 | self._item_vector = pd.DataFrame(item_vector) 22 | self._item_vector.columns = ['item_id', 'vec'] 23 | self._retrieval_model = self._generate_retrieval_model() 24 | self._user_vector = {} 25 | self._user_log = None 26 | 27 | def train(self, train_set): 28 | """ 29 | Main job is calculating user model for every user. Use multi-process to speed up the training. 30 | 31 | See :meth:`BaseAlog.train ` for more details. 32 | """ 33 | 34 | class TrainJob(mp.Process): 35 | def __init__(self, func, result_list, *args): 36 | super().__init__() 37 | self.func = func 38 | self.args = args 39 | self.res = result_list 40 | 41 | def run(self): 42 | self.res.append(self.func(*self.args)) 43 | 44 | self._user_log = pd.DataFrame(train_set) 45 | self._user_log.columns = ['user_id', 'item_id'] 46 | self._user_log.drop_duplicates(inplace=True) 47 | '''Calculate user model''' 48 | manager = mp.Manager() 49 | res_list = manager.list() 50 | user_ids = self._user_log['user_id'].drop_duplicates().values.tolist() 51 | part = 2 52 | cpus = cpu_count() 53 | job_list = [] 54 | jobs = int(cpus / part) # Use 1/2 of the cpus 55 | if jobs <= 0: 56 | jobs = 1 57 | part_ids_num = int((len(user_ids) + jobs - 1) / jobs) 58 | for i in range(jobs): 59 | part_ids = user_ids[i * part_ids_num:i * part_ids_num + part_ids_num] 60 | j = TrainJob(self._build_user_model, res_list, part_ids) 61 | job_list.append(j) 62 | j.start() 63 | for job in job_list: 64 | job.join() 65 | for ids_dict in res_list: 66 | for key in ids_dict.keys(): 67 | self._user_vector[key] = ids_dict[key] 68 | return self 69 | 70 | def top_k_recommend(self, u_id, k): 71 | """ 72 | See :meth:`BaseAlog.top_k_recommend ` for more details. 73 | """ 74 | if self._retrieval_model is None: 75 | raise RuntimeError('Run method train() first.') 76 | specific_user_log = self._user_log[self._user_log['user_id'] == u_id] 77 | viewed_num = specific_user_log.shape[0] 78 | assert (viewed_num != 0), "User id doesn't exist." 79 | specific_user_vec = self._user_vector[u_id] 80 | normal_specific_user_vec = ContentBasedAlgo._vector_normalize(np.array([specific_user_vec]).astype('float32')) 81 | ''' k+viewed_num make sure that we have at least k unseen items ''' 82 | distance, index = self._retrieval_model.search(normal_specific_user_vec, k + viewed_num) 83 | item_res = self._item_vector.loc[index[0]] 84 | res = pd.DataFrame({'dis': distance[0], 'item_id': item_res['item_id']}) 85 | res = res[~res['item_id'].isin(specific_user_log['item_id'])] 86 | res = res[:k] 87 | ''' return top-k smallest cosine distance and the ids of the items which hold that distance. ''' 88 | return res['dis'].values.tolist(), res['item_id'].values.tolist() 89 | 90 | @classmethod 91 | def load(cls, fname): 92 | """ 93 | See :meth:`BaseAlog.load ` for more details. 94 | """ 95 | res = super(ContentBasedAlgo, cls).load(fname) 96 | assert (hasattr(res, '_retrieval_model')), 'Not a standard ContentBasedAlgo class.' 97 | setattr(res, '_retrieval_model', faiss.read_index(fname + ".retrieval")) 98 | return res 99 | 100 | def save(self, fname, ignore=None): 101 | """ 102 | See :meth:`BaseAlog.save ` for more details. 103 | """ 104 | if ignore is None: 105 | ignore = [] 106 | ignore.append('_retrieval_model') 107 | faiss.write_index(self._retrieval_model, fname + ".retrieval") 108 | super().save(fname, ignore) 109 | 110 | def _generate_retrieval_model(self): 111 | """ 112 | Use the retrieval model(faiss) to speed up the vector indexing 113 | 114 | :return: Ready-to-work retrieval model from faiss 115 | """ 116 | real_vecs = self._item_vector['vec'].values.tolist() 117 | item_vector_array = np.array(real_vecs) 118 | item_vector_array = ContentBasedAlgo._vector_normalize(item_vector_array.astype('float32')) 119 | retrieval_model = faiss.IndexFlatIP(self._dimension) 120 | retrieval_model.add(item_vector_array) 121 | return retrieval_model 122 | 123 | def _build_user_model(self, user_ids): 124 | """ 125 | This method will calculate user model for all users in user_ids. 126 | 127 | :param user_ids: users' id list 128 | :return: A dict contains user's id and vector. 129 | """ 130 | res_dict = {} 131 | for user_id in user_ids: 132 | specific_user_log = self._user_log[self._user_log['user_id'] == user_id] 133 | log_vecs = pd.merge(specific_user_log, self._item_vector, how='left', on=['item_id']) 134 | assert (sum(log_vecs['vec'].notnull()) == log_vecs.shape[0]), 'Item vector sheet has null values' 135 | res_dict[user_id] = ContentBasedAlgo._calc_dim_average(np.array(log_vecs['vec'].values.tolist())) 136 | return res_dict 137 | 138 | def to_dict(self): 139 | pass 140 | 141 | @staticmethod 142 | def _calc_dim_average(vectors_array): 143 | """ 144 | This func calculate the average value on every dimension of vectors_array, but it only count none-zero values. 145 | 146 | :param vectors_array: np.array contains a list of vectors. 147 | :return: A vector has the average value in every dimension. 148 | """ 149 | array = np.array(vectors_array) 150 | threshold = 0.001 151 | res = array.sum(axis=0, dtype='float32') 152 | valid_count = (array > threshold).sum(axis=0) 153 | valid_count[valid_count == 0] = 1 154 | res /= valid_count 155 | return res 156 | 157 | @staticmethod 158 | def _vector_normalize(vectors_array): 159 | vector_len_list = np.sqrt((vectors_array ** 2).sum(axis=1, keepdims=True)) 160 | # handle all-zero vectors 161 | vector_len_list[vector_len_list == 0] = 1 162 | res = vectors_array / vector_len_list 163 | return res 164 | -------------------------------------------------------------------------------- /ucas_dm/prediction_algorithms/nmf.py: -------------------------------------------------------------------------------- 1 | from .surprise_base_algo import SurpriseBaseAlgo 2 | from surprise.prediction_algorithms import matrix_factorization 3 | 4 | 5 | class NMF(SurpriseBaseAlgo): 6 | """ 7 | A collaborative filtering algorithm based on Non-negative Matrix Factorization. 8 | """ 9 | 10 | def __init__(self, n_factors=15, n_epochs=80, random_state=0, reg_pu=0.05, reg_qi=0.05): 11 | """ 12 | :param n_factors: The number of factors. Default is 20. 13 | :param n_epochs: The number of iteration of the SGD procedure. Default is 20. 14 | :param random_state: random_state (int, RandomState instance from numpy, or None) – Determines the RNG that \ 15 | will be used for initialization. If int, random_state will be used as a seed for a new RNG. This is useful to \ 16 | get the same initialization over multiple calls to fit(). If RandomState instance, this same instance is used \ 17 | as RNG. If None, the current RNG from numpy is used. Default is 0. 18 | :param reg_pu: The regularization term for users λu. Default is 0.05. 19 | :param reg_qi: The regularization term for items λi. Default is 0.05. 20 | """ 21 | super().__init__() 22 | self._n_factors = n_factors 23 | self._n_epochs = n_epochs 24 | self._random_state = random_state 25 | self._reg_pu = reg_pu 26 | self._reg_qi = reg_qi 27 | self._surprise_model = self._init_surprise_model() 28 | 29 | def _init_surprise_model(self): 30 | return matrix_factorization.NMF(n_factors=self._n_factors, random_state=self._random_state, 31 | n_epochs=self._n_epochs, reg_pu=self._reg_pu, reg_qi=self._reg_qi) 32 | 33 | def to_dict(self): 34 | """ 35 | See :meth:`BaseAlgo.to_dict ` for more details. 36 | """ 37 | return {'type': 'NMF', 'factors': self._n_factors, 'epochs': self._n_epochs, 38 | 'random_state': self._random_state, 'reg_pu': self._reg_pu, 'reg_qi': self._reg_qi} -------------------------------------------------------------------------------- /ucas_dm/prediction_algorithms/surprise_base_algo.py: -------------------------------------------------------------------------------- 1 | from surprise import Dataset, Reader, dump 2 | from .base_algo import BaseAlgo 3 | import pandas as pd 4 | 5 | 6 | class SurpriseBaseAlgo(BaseAlgo): 7 | """ 8 | Do not use this class directly. 9 | This is the base class for all other sub-class which use the algorithms from 10 | Python recommend package--'Surprise'. 11 | Inherit from this base class will obtain some basic features. 12 | """ 13 | 14 | def __init__(self): 15 | super().__init__() 16 | self._user_log = None 17 | self._surprise_model = None 18 | 19 | def train(self, train_set): 20 | if self._surprise_model is None: 21 | self._surprise_model = self._init_surprise_model() # Initialize prediction model 22 | self._user_log = pd.DataFrame(train_set) 23 | self._user_log.columns = ['user_id', 'item_id'] 24 | ''' Cause there is no rate in this situation, so just simply set rate to 1''' 25 | rate_log = self._user_log.copy() 26 | rate_log = rate_log.drop_duplicates() 27 | rate_log['rate'] = 1 28 | reader = Reader(rating_scale=(0, 1)) 29 | train_s = Dataset.load_from_df(rate_log, reader) 30 | ''' train surprise-framework based model ''' 31 | self._surprise_model.fit(train_s.build_full_trainset()) 32 | return self 33 | 34 | def _init_surprise_model(self): 35 | """ 36 | Sub-class should implement this method which return a prediction algorithm from package 'Surprise'. 37 | 38 | :return: A surprise-based recommend model 39 | """ 40 | raise NotImplementedError() 41 | 42 | def top_k_recommend(self, u_id, k): 43 | specific_user_log = self._user_log[self._user_log['user_id'] == u_id] 44 | viewed_num = specific_user_log.shape[0] 45 | assert (viewed_num != 0), "User id doesn't exist" 46 | predict_rate_log = self._user_log.copy() 47 | predict_rate_log = predict_rate_log[['item_id']].drop_duplicates() 48 | predict_rate_log = predict_rate_log[~predict_rate_log['item_id'].isin(specific_user_log['item_id'])] 49 | predict_rate_log['prate'] = predict_rate_log.apply(lambda row: self.predict(u_id, row['item_id']), axis=1) 50 | predict_rate_log = predict_rate_log.sort_values(by=['prate'], ascending=False) 51 | predict_rate_log = predict_rate_log[:k] 52 | top_k_rate = predict_rate_log['prate'].values.tolist() 53 | top_k_item = predict_rate_log['item_id'].values.tolist() 54 | return top_k_rate, top_k_item 55 | 56 | def predict(self, u_id, i_id): 57 | """ 58 | Predict the rate of user 'u_id' give to the item 'i_id' 59 | 60 | :param u_id: user id 61 | :param i_id: item id 62 | :return: rate value 63 | """ 64 | _, _, _, est, _ = self._surprise_model.predict(u_id, i_id) 65 | return est 66 | 67 | def to_dict(self): 68 | raise NotImplementedError() 69 | 70 | @classmethod 71 | def load(cls, fname): 72 | res = super(SurpriseBaseAlgo, cls).load(fname) 73 | assert (hasattr(res, '_surprise_model')), 'Not a standard SurpriseBaseAlgo class.' 74 | setattr(res, '_surprise_model', dump.load(fname + '.surprise')) 75 | return res 76 | 77 | def save(self, fname, *args): 78 | if len(args) == 0: 79 | ignore = ['_surprise_model'] 80 | else: 81 | ignore = args[0].append('_surprise_model') 82 | dump.dump(fname + '.surprise', algo=self._surprise_model) 83 | super().save(fname, ignore) 84 | -------------------------------------------------------------------------------- /ucas_dm/prediction_algorithms/topic_based_algo.py: -------------------------------------------------------------------------------- 1 | from .content_based_algo import ContentBasedAlgo 2 | from .base_algo import BaseAlgo 3 | from ..preprocess import PreProcessor 4 | from gensim import models 5 | import pandas as pd 6 | import numpy as np 7 | import pickle as pic 8 | 9 | 10 | class InitialParams: 11 | """ 12 | This class contains some necessary data for the initialization of class TopicBasedAlgo. 13 | """ 14 | 15 | def __init__(self, **kwargs): 16 | self.ids = kwargs['ids'] 17 | self.id2word = kwargs['id2word'] 18 | self.corpus = kwargs['corpus'] 19 | 20 | def save(self, fname): 21 | """ 22 | This method save initial params to a file 23 | 24 | :param fname: file path 25 | """ 26 | with open(fname, 'wb') as f: 27 | pic.dump({"ids": self.ids, "id2word": self.id2word, "corpus": self.corpus}, f) 28 | 29 | @classmethod 30 | def load(cls, fname): 31 | """ 32 | Load an object previously saved from a file 33 | 34 | :param fname: file path 35 | 36 | :return: object loaded from file 37 | """ 38 | with open(fname, 'rb') as f: 39 | obj = pic.load(f) 40 | return InitialParams(ids=obj['ids'], id2word=obj['id2word'], corpus=obj['corpus']) 41 | 42 | 43 | class TopicBasedAlgo(BaseAlgo): 44 | """ 45 | Content-based algorithm which use "Topic model" algorithms (LSI or LDA). 46 | Use delegation strategy 47 | """ 48 | 49 | def __init__(self, initial_params, topic_n=100, chunksize=100, topic_type='lda', power_iters=2, 50 | extra_samples=100, passes=1): 51 | """ 52 | :param initial_params: An instance of InitialParams generated by :meth:`preprocess \ 53 | ` 54 | :param topic_n: The number of requested latent topics to be extracted from the training corpus. 55 | :param chunksize: Number of documents to be used in each training chunk. 56 | :param topic_type: 'lsi' or 'lda' 57 | :param power_iters: (**LSI parameter**)Number of power iteration steps to be used. Increasing the number \ 58 | of power iterations improves accuracy, but lowers performance. 59 | :param extra_samples: (**LSI parameter**)Extra samples to be used besides the rank k. Can improve accuracy. 60 | :param passes: (**LDA parameter**)Number of passes through the corpus during training. 61 | """ 62 | super().__init__() 63 | self._item_ids = initial_params.ids 64 | self._id2word = initial_params.id2word 65 | self._corpus = initial_params.corpus 66 | self._topic_n = topic_n 67 | self._topic = topic_type 68 | self._topic_model = None 69 | self._chunksize = chunksize 70 | self._power_iters = power_iters 71 | self._extra_samples = extra_samples 72 | self._passes = passes 73 | self._content_algo = ContentBasedAlgo(self._generate_item_vector(), self._topic_n) 74 | 75 | @classmethod 76 | def preprocess(cls, raw_data): 77 | """ 78 | Call this method to process raw data which contain item id and its content before initializing TopicBasedAlgo \ 79 | instance. 80 | 81 | :param raw_data: A pandas.DataFrame contains item id and content \| id \| content \| 82 | 83 | :return: A :meth:`InitialParams ` instance, a necessary parameter in the \ 84 | initialization of TopicBasedAlgo. 85 | """ 86 | id_tokens = PreProcessor.generate_tokens(raw_data) 87 | tf_res = PreProcessor.build_tf_idf(id_tokens) 88 | raw_data.columns = ['id', 'content'] 89 | return InitialParams(ids=raw_data['id'].values.tolist(), id2word=tf_res['gensim_pack']['id2word'], 90 | corpus=tf_res['gensim_pack']['corpus']) 91 | 92 | def train(self, train_set): 93 | self._content_algo.train(train_set) 94 | return self 95 | 96 | def top_k_recommend(self, u_id, k): 97 | return self._content_algo.top_k_recommend(u_id, k) 98 | 99 | @classmethod 100 | def load(cls, fname): 101 | res = super(TopicBasedAlgo, cls).load(fname) 102 | assert (hasattr(res, '_topic')), 'Not a standard TopicBasedAlgo class.' 103 | topic = getattr(res, '_topic') 104 | if topic == 'lsi': 105 | model_fname = '.'.join([fname, 'lsi']) 106 | setattr(res, '_topic_model', models.LsiModel.load(model_fname)) 107 | elif topic == 'lda': 108 | model_fname = '.'.join([fname, 'lda']) 109 | setattr(res, '_topic_model', models.LdaModel.load(model_fname)) 110 | setattr(res, '_content_algo', ContentBasedAlgo.load('.'.join([fname, 'content_base']))) 111 | return res 112 | 113 | def save(self, fname, *args): 114 | ignore = ['_topic_model', '_content_algo'] 115 | if self._topic_model is not None: 116 | self._topic_model.save('.'.join([fname, self._topic])) 117 | self._content_algo.save('.'.join([fname, 'content_base'])) 118 | super().save(fname, ignore) 119 | 120 | def _generate_item_vector(self): 121 | """ 122 | Use LDA or LSI algorithm to process TF-IDF vector and generate new item vectors. 123 | 124 | :return: DataFrame contains item id and it's new vector 125 | """ 126 | if self._topic == 'lsi': 127 | self._topic_model = models.LsiModel(corpus=self._corpus, num_topics=self._topic_n, 128 | id2word=self._id2word, chunksize=self._chunksize, 129 | power_iters=self._power_iters, extra_samples=self._extra_samples) 130 | elif self._topic == 'lda': 131 | self._topic_model = models.LdaModel(corpus=self._corpus, num_topics=self._topic_n, 132 | id2word=self._id2word, chunksize=self._chunksize, 133 | update_every=1, passes=self._passes, dtype=np.float64) 134 | else: 135 | raise ValueError(self._topic) 136 | vecs = self._topic_model[self._corpus] 137 | pure_vecs = [] 138 | for vec in vecs: 139 | if len(vec) != self._topic_n: 140 | pure_vecs.append(TopicBasedAlgo._rebuild_vector(vec, self._topic_n)) 141 | else: 142 | pure_vecs.append([v for (index, v) in vec]) 143 | return pd.DataFrame({'item_id': self._item_ids, 'vec': pure_vecs}) 144 | 145 | def to_dict(self): 146 | """ 147 | See :meth:`BaseAlgo.to_dict ` for more details. 148 | """ 149 | res = {'type': self._topic, 'topic_num': self._topic_n, 'chunksize': self._chunksize} 150 | if self._topic == 'lsi': 151 | res['power_iters'] = self._power_iters 152 | res['extra_samples'] = self._extra_samples 153 | else: 154 | res['passes'] = self._passes 155 | return res 156 | 157 | @staticmethod 158 | def _rebuild_vector(partial_vector, dim): 159 | res = [0] * dim 160 | for (index, value) in partial_vector: 161 | res[index] = value 162 | return res 163 | -------------------------------------------------------------------------------- /ucas_dm/preprocess/__init__.py: -------------------------------------------------------------------------------- 1 | from .preprocess import PreProcessor 2 | 3 | __all__ = ['PreProcessor'] 4 | -------------------------------------------------------------------------------- /ucas_dm/preprocess/preprocess.py: -------------------------------------------------------------------------------- 1 | # coding=utf-8 2 | import pandas as pd 3 | import jieba.posseg as pseg 4 | import codecs 5 | import os 6 | from gensim import corpora, models 7 | from ast import literal_eval 8 | 9 | 10 | class PreProcessor: 11 | 12 | def __init__(self, source_data_path): 13 | self.__source_data_path = source_data_path 14 | 15 | def extract_news(self): 16 | """ 17 | This method extract news from data and save them to a csv file. 18 | 19 | :return: A pandas.DataFrame with two attributes: news_id and content 20 | """ 21 | data = pd.read_csv(filepath_or_buffer=self.__source_data_path, sep="\\t", names=[ 22 | 'user_id', 'news_id', 'view_time', 'title', 'content', 'publish_time'], encoding="utf-8") 23 | data = data[['news_id', 'title', 'content']] 24 | data = data.fillna('') 25 | data = data.drop_duplicates().reset_index(drop=True) 26 | data['content'] = data['title'] + data['content'] 27 | id_content = data[['news_id', 'content']] 28 | return id_content 29 | 30 | @classmethod 31 | def generate_tokens(cls, id_content): 32 | """ 33 | This method generate tokens for news. 34 | 35 | :param id_content: A pandas.DataFrame of news id(integer) and its content(string) \ 36 | \|column1: news_id\|column2: content\| 37 | 38 | :return: A pd.DataFrame of news id and its tokens 39 | """ 40 | dir_path = os.path.split(__file__)[0] 41 | stop_words_path = dir_path + "/stop_words/stop.txt" 42 | id_content.columns = ['news_id', 'content'] 43 | stop_words = codecs.open(stop_words_path, encoding='utf8').readlines() 44 | stop_words = [w.strip() for w in stop_words] 45 | stop_flags = ['x', 'c', 'u', 'd', 'p', 't', 'uj', 'm', 'f', 'r'] 46 | 47 | def _tokenization(text): 48 | result = [] 49 | words = pseg.cut(text) 50 | for word, flag in words: 51 | if flag not in stop_flags and word not in stop_words: 52 | result.append(word) 53 | return result 54 | 55 | res = [] 56 | for i in range(id_content.shape[0]): 57 | content = id_content.loc[i, 'content'] 58 | result = _tokenization(content) 59 | res.append(result) 60 | 61 | assert (id_content.shape[0] == len( 62 | res)), "The number of id_content's rows doesn't match the length of tokenization result." 63 | id_tokens = pd.DataFrame({'news_id': id_content['news_id'], 'tokens': res}) 64 | return id_tokens 65 | 66 | @classmethod 67 | def build_tf_idf(cls, id_tokens): 68 | """ 69 | This method builds TF-IDF vectors for news. 70 | 71 | :param id_tokens: A pandas.DataFrame contains news id and its tokens. \|column1: news_id\|column2: tokens\| 72 | :return: A dict - {"id_tfvec": A pandas.DataFrame contains news id and its tf-idf vector \ 73 | \|column1: news_id\|column2: tf_vec\| ,"gensim_pack":{"word2dict": important parameter if package "gensim" is \ 74 | used for further process, "corpus": important parameter if package "gensim" is used for further process}} 75 | """ 76 | id_tokens.columns = ['news_id', 'tokens'] 77 | pure_tokens = id_tokens['tokens'].values.tolist() 78 | if isinstance(pure_tokens[0], str): 79 | pure_tokens = [literal_eval(t) for t in pure_tokens] # transform list-like strings to list 80 | word_dict = corpora.Dictionary(pure_tokens) # Used in LSA or LDA algorithm 81 | news_bow = [word_dict.doc2bow(t) for t in pure_tokens] 82 | algo = models.TfidfModel(news_bow) 83 | corpus_tfidf = algo[news_bow] 84 | news_vec = [] 85 | for t in corpus_tfidf: 86 | news_vec.append([v for (_, v) in t]) 87 | id_tfvec = pd.DataFrame({'news_id': id_tokens['news_id'], 'tf_vec': news_vec}) 88 | return {"id_tfvec": id_tfvec, "gensim_pack": {"id2word": word_dict, "corpus": corpus_tfidf}} 89 | 90 | def extract_logs(self): 91 | """ 92 | This method extract user's browsing history from source data and save it to a csv file. 93 | 94 | :return: A pandas.DataFrame with 3 attributes: user_id, news_id, view_time 95 | """ 96 | data = pd.read_csv(filepath_or_buffer=self.__source_data_path, sep="\\t", names=[ 97 | 'user_id', 'news_id', 'view_time', 'title', 'content', 'publish_time'], encoding="utf-8") 98 | user_log = data[['user_id', 'news_id', 'view_time']] 99 | user_log['view_time'] = pd.to_datetime(user_log['view_time'], unit='s') 100 | user_log = user_log.drop_duplicates().reset_index(drop=True) 101 | return user_log 102 | -------------------------------------------------------------------------------- /ucas_dm/preprocess/stop_words/stop.txt: -------------------------------------------------------------------------------- 1 | 打开天窗说亮话 2 | 到目前为止 3 | 赶早不赶晚 4 | 常言说得好 5 | 何乐而不为 6 | 毫无保留地 7 | 由此可见 8 | 这就是说 9 | 这么点儿 10 | 综上所述 11 | 总的来看 12 | 总的来说 13 | 总的说来 14 | 总而言之 15 | 相对而言 16 | 除此之外 17 | 反过来说 18 | 恰恰相反 19 | 如上所述 20 | 换句话说 21 | 具体地说 22 | 具体说来 23 | 另一方面 24 | 与此同时 25 | 一则通过 26 | 毫无例外 27 | 不然的话 28 | 从此以后 29 | 从古到今 30 | 从古至今 31 | 从今以后 32 | 大张旗鼓 33 | 从无到有 34 | 从早到晚 35 | 弹指之间 36 | 不亦乐乎 37 | 不知不觉 38 | 不止一次 39 | 不择手段 40 | 不可开交 41 | 不可抗拒 42 | 不仅仅是 43 | 不管怎样 44 | 挨家挨户 45 | 长此下去 46 | 长话短说 47 | 除此而外 48 | 除此以外 49 | 除此之外 50 | 得天独厚 51 | 川流不息 52 | 长期以来 53 | 挨门挨户 54 | 挨门逐户 55 | 多多少少 56 | 多多益善 57 | 二话不说 58 | 更进一步 59 | 二话没说 60 | 分期分批 61 | 风雨无阻 62 | 归根到底 63 | 归根结底 64 | 反之亦然 65 | 大面儿上 66 | 倒不如说 67 | 成年累月 68 | 换句话说 69 | 或多或少 70 | 简而言之 71 | 接连不断 72 | 尽如人意 73 | 尽心竭力 74 | 尽心尽力 75 | 尽管如此 76 | 据我所知 77 | 具体地说 78 | 具体来说 79 | 具体说来 80 | 近几年来 81 | 每时每刻 82 | 屡次三番 83 | 三番两次 84 | 三番五次 85 | 三天两头 86 | 另一方面 87 | 老老实实 88 | 年复一年 89 | 恰恰相反 90 | 顷刻之间 91 | 穷年累月 92 | 千万千万 93 | 日复一日 94 | 如此等等 95 | 如前所述 96 | 如上所述 97 | 一方面 98 | 切不可 99 | 顷刻间 100 | 全身心 101 | 另方面 102 | 另一个 103 | 猛然间 104 | 默默地 105 | 就是说 106 | 近年来 107 | 尽可能 108 | 接下来 109 | 简言之 110 | 急匆匆 111 | 即是说 112 | 基本上 113 | 换言之 114 | 充其极 115 | 充其量 116 | 暗地里 117 | 反之则 118 | 比如说 119 | 背地里 120 | 背靠背 121 | 并没有 122 | 不得不 123 | 不得了 124 | 不得已 125 | 不仅仅 126 | 不经意 127 | 不能不 128 | 不外乎 129 | 不由得 130 | 不怎么 131 | 不至于 132 | 策略地 133 | 差不多 134 | 常言道 135 | 常言说 136 | 多年来 137 | 多年前 138 | 差一点 139 | 敞开儿 140 | 抽冷子 141 | 大不了 142 | 反倒是 143 | 反过来 144 | 大体上 145 | 当口儿 146 | 倒不如 147 | 怪不得 148 | 动不动 149 | 看起来 150 | 看上去 151 | 看样子 152 | 够瞧的 153 | 到了儿 154 | 呆呆地 155 | 来不及 156 | 来得及 157 | 到头来 158 | 连日来 159 | 于是乎 160 | 为什么 161 | 这会儿 162 | 换言之 163 | 那会儿 164 | 那么些 165 | 那么样 166 | 什么样 167 | 反过来 168 | 紧接着 169 | 就是说 170 | 要不然 171 | 要不是 172 | 一方面 173 | 以至于 174 | 自个儿 175 | 自各儿 176 | 之所以 177 | 这么些 178 | 这么样 179 | 怎么办 180 | 怎么样 181 | 谁知 182 | 顺着 183 | 似的 184 | 虽然 185 | 虽说 186 | 虽则 187 | 随着 188 | 所以 189 | 他们 190 | 他人 191 | 它们 192 | 她们 193 | 倘或 194 | 倘然 195 | 倘若 196 | 倘使 197 | 要么 198 | 要是 199 | 也罢 200 | 也好 201 | 以便 202 | 依照 203 | 以及 204 | 以免 205 | 以至 206 | 以致 207 | 抑或 208 | 因此 209 | 因而 210 | 因为 211 | 由于 212 | 有的 213 | 有关 214 | 有些 215 | 于是 216 | 与否 217 | 与其 218 | 越是 219 | 云云 220 | 一般 221 | 一旦 222 | 一来 223 | 一切 224 | 一样 225 | 同时 226 | 万一 227 | 为何 228 | 为了 229 | 为着 230 | 嗡嗡 231 | 我们 232 | 呜呼 233 | 乌乎 234 | 无论 235 | 无宁 236 | 沿着 237 | 毋宁 238 | 向着 239 | 照着 240 | 怎么 241 | 咱们 242 | 在下 243 | 再说 244 | 再者 245 | 怎样 246 | 这边 247 | 这儿 248 | 这个 249 | 这里 250 | 这么 251 | 这时 252 | 这些 253 | 这样 254 | 正如 255 | 之类 256 | 之一 257 | 只是 258 | 只限 259 | 只要 260 | 只有 261 | 至于 262 | 诸位 263 | 着呢 264 | 纵令 265 | 纵然 266 | 纵使 267 | 遵照 268 | 作为 269 | 喔唷 270 | 自从 271 | 自己 272 | 自家 273 | 自身 274 | 总之 275 | 要不 276 | 哎呀 277 | 哎哟 278 | 俺们 279 | 按照 280 | 吧哒 281 | 罢了 282 | 本着 283 | 比方 284 | 比如 285 | 鄙人 286 | 彼此 287 | 别的 288 | 别说 289 | 并且 290 | 不比 291 | 不成 292 | 不单 293 | 不但 294 | 不独 295 | 不管 296 | 不光 297 | 不过 298 | 不仅 299 | 不拘 300 | 不论 301 | 不怕 302 | 不然 303 | 不如 304 | 不特 305 | 不惟 306 | 不问 307 | 不只 308 | 朝着 309 | 趁着 310 | 除非 311 | 除了 312 | 此间 313 | 此外 314 | 从而 315 | 但是 316 | 当着 317 | 的话 318 | 等等 319 | 叮咚 320 | 对于 321 | 多少 322 | 而况 323 | 而且 324 | 而是 325 | 而外 326 | 而言 327 | 而已 328 | 尔后 329 | 反之 330 | 非但 331 | 非徒 332 | 否则 333 | 嘎登 334 | 各个 335 | 各位 336 | 各种 337 | 各自 338 | 根据 339 | 故此 340 | 固然 341 | 关于 342 | 果然 343 | 果真 344 | 哈哈 345 | 何处 346 | 何况 347 | 何时 348 | 哼唷 349 | 呼哧 350 | 还是 351 | 还有 352 | 或是 353 | 或者 354 | 极了 355 | 及其 356 | 及至 357 | 即便 358 | 即或 359 | 即令 360 | 即若 361 | 即使 362 | 既然 363 | 既是 364 | 继而 365 | 加之 366 | 假如 367 | 假若 368 | 假使 369 | 鉴于 370 | 几时 371 | 较之 372 | 接着 373 | 结果 374 | 进而 375 | 尽管 376 | 经过 377 | 就是 378 | 可见 379 | 可是 380 | 可以 381 | 况且 382 | 开始 383 | 开外 384 | 来着 385 | 例如 386 | 连同 387 | 两者 388 | 另外 389 | 慢说 390 | 漫说 391 | 每当 392 | 莫若 393 | 某个 394 | 某些 395 | 哪边 396 | 哪儿 397 | 哪个 398 | 哪里 399 | 哪年 400 | 哪怕 401 | 哪天 402 | 哪些 403 | 哪样 404 | 那边 405 | 那儿 406 | 那个 407 | 那里 408 | 那么 409 | 那时 410 | 那些 411 | 那样 412 | 乃至 413 | 宁可 414 | 宁肯 415 | 宁愿 416 | 你们 417 | 啪达 418 | 旁人 419 | 凭借 420 | 其次 421 | 其二 422 | 其他 423 | 其它 424 | 其一 425 | 其余 426 | 其中 427 | 起见 428 | 起见 429 | 岂但 430 | 前后 431 | 前者 432 | 然而 433 | 然后 434 | 然则 435 | 人家 436 | 任何 437 | 任凭 438 | 如此 439 | 如果 440 | 如何 441 | 如其 442 | 如若 443 | 若非 444 | 若是 445 | 上下 446 | 尚且 447 | 设若 448 | 设使 449 | 甚而 450 | 甚么 451 | 甚至 452 | 省得 453 | 时候 454 | 什么 455 | 使得 456 | 是的 457 | 首先 458 | 首先 459 | 其次 460 | 再次 461 | 最后 462 | 您们 463 | 它们 464 | 她们 465 | 他们 466 | 我们 467 | 你是 468 | 您是 469 | 我是 470 | 他是 471 | 她是 472 | 它是 473 | 不是 474 | 你们 475 | 啊哈 476 | 啊呀 477 | 啊哟 478 | 挨次 479 | 挨个 480 | 挨着 481 | 哎呀 482 | 哎哟 483 | 俺们 484 | 按理 485 | 按期 486 | 默然 487 | 按时 488 | 按说 489 | 按照 490 | 暗中 491 | 暗自 492 | 昂然 493 | 八成 494 | 倍感 495 | 倍加 496 | 本人 497 | 本身 498 | 本着 499 | 并非 500 | 别人 501 | 必定 502 | 比起 503 | 比如 504 | 比照 505 | 鄙人 506 | 毕竟 507 | 必将 508 | 必须 509 | 并肩 510 | 并没 511 | 并排 512 | 并且 513 | 并无 514 | 勃然 515 | 不必 516 | 不常 517 | 不大 518 | 不单 519 | 不但 520 | 而且 521 | 不得 522 | 不迭 523 | 不定 524 | 不独 525 | 不对 526 | 不妨 527 | 不管 528 | 不光 529 | 不过 530 | 不会 531 | 不仅 532 | 不拘 533 | 不力 534 | 不了 535 | 不料 536 | 不论 537 | 不满 538 | 不免 539 | 不起 540 | 不巧 541 | 不然 542 | 不日 543 | 不少 544 | 不胜 545 | 不时 546 | 不是 547 | 不同 548 | 不能 549 | 不要 550 | 不外 551 | 不下 552 | 不限 553 | 不消 554 | 不已 555 | 不再 556 | 不曾 557 | 不止 558 | 不只 559 | 才能 560 | 彻夜 561 | 趁便 562 | 趁机 563 | 趁热 564 | 趁势 565 | 趁早 566 | 趁着 567 | 成心 568 | 乘机 569 | 乘势 570 | 乘隙 571 | 乘虚 572 | 诚然 573 | 迟早 574 | 充分 575 | 出来 576 | 出去 577 | 除此 578 | 除非 579 | 除开 580 | 除了 581 | 除去 582 | 除却 583 | 除外 584 | 处处 585 | 传说 586 | 传闻 587 | 纯粹 588 | 此后 589 | 此间 590 | 此外 591 | 此中 592 | 次第 593 | 匆匆 594 | 从不 595 | 从此 596 | 从而 597 | 从宽 598 | 从来 599 | 从轻 600 | 从速 601 | 从头 602 | 从未 603 | 从小 604 | 从新 605 | 从严 606 | 从优 607 | 从中 608 | 从重 609 | 凑巧 610 | 存心 611 | 达旦 612 | 打从 613 | 大大 614 | 大抵 615 | 大都 616 | 大多 617 | 大凡 618 | 大概 619 | 大家 620 | 大举 621 | 大略 622 | 大约 623 | 大致 624 | 待到 625 | 单纯 626 | 单单 627 | 但是 628 | 但愿 629 | 当场 630 | 当儿 631 | 当即 632 | 当然 633 | 当庭 634 | 当头 635 | 当下 636 | 当真 637 | 当中 638 | 当着 639 | 倒是 640 | 到处 641 | 到底 642 | 到头 643 | 得起 644 | 的话 645 | 的确 646 | 等到 647 | 等等 648 | 顶多 649 | 动辄 650 | 陡然 651 | 独自 652 | 断然 653 | 对于 654 | 顿时 655 | 多次 656 | 多多 657 | 多亏 658 | 而后 659 | 而论 660 | 而且 661 | 而是 662 | 而外 663 | 而言 664 | 而已 665 | 而又 666 | 尔等 667 | 反倒 668 | 反而 669 | 反手 670 | 反之 671 | 方才 672 | 方能 673 | 非常 674 | 非但 675 | 非得 676 | 分头 677 | 奋勇 678 | 愤然 679 | 更为 680 | 更加 681 | 根据 682 | 个人 683 | 各式 684 | 刚才 685 | 敢情 686 | 该当 687 | 嘎嘎 688 | 否则 689 | 赶快 690 | 敢于 691 | 刚好 692 | 刚巧 693 | 高低 694 | 格外 695 | 隔日 696 | 隔夜 697 | 公然 698 | 过于 699 | 果然 700 | 果真 701 | 光是 702 | 关于 703 | 共总 704 | 姑且 705 | 故此 706 | 故而 707 | 故意 708 | 固然 709 | 惯常 710 | 毫不 711 | 毫无 712 | 很多 713 | 何须 714 | 好在 715 | 何必 716 | 何尝 717 | 何妨 718 | 何苦 719 | 何况 720 | 何止 721 | 很少 722 | 轰然 723 | 后来 724 | 呼啦 725 | 哗啦 726 | 互相 727 | 忽地 728 | 忽然 729 | 话说 730 | 或是 731 | 伙同 732 | 豁然 733 | 恍然 734 | 还是 735 | 或许 736 | 或者 737 | 基本 738 | 基于 739 | 极大 740 | 极度 741 | 极端 742 | 极力 743 | 极其 744 | 极为 745 | 即便 746 | 即将 747 | 及其 748 | 及至 749 | 即刻 750 | 即令 751 | 即使 752 | 几度 753 | 几番 754 | 几乎 755 | 几经 756 | 既然 757 | 继而 758 | 继之 759 | 加上 760 | 加以 761 | 加之 762 | 假如 763 | 假若 764 | 假使 765 | 间或 766 | 将才 767 | 简直 768 | 鉴于 769 | 将近 770 | 将要 771 | 交口 772 | 较比 773 | 较为 774 | 较之 775 | 皆可 776 | 截然 777 | 截至 778 | 藉以 779 | 借此 780 | 借以 781 | 届时 782 | 尽快 783 | 近来 784 | 进而 785 | 进来 786 | 进去 787 | 尽管 788 | 尽量 789 | 尽然 790 | 就算 791 | 居然 792 | 就此 793 | 就地 794 | 竟然 795 | 究竟 796 | 经常 797 | 尽早 798 | 精光 799 | 经过 800 | 就是 801 | 局外 802 | 举凡 803 | 据称 804 | 据此 805 | 据实 806 | 据说 807 | 可好 808 | 看来 809 | 开外 810 | 绝不 811 | 决不 812 | 据悉 813 | 决非 814 | 绝顶 815 | 绝对 816 | 绝非 817 | 可见 818 | 可能 819 | 可是 820 | 可以 821 | 恐怕 822 | 来讲 823 | 来看 824 | 快要 825 | 况且 826 | 拦腰 827 | 牢牢 828 | 老是 829 | 累次 830 | 累年 831 | 理当 832 | 理该 833 | 理应 834 | 例如 835 | 立地 836 | 立刻 837 | 立马 838 | 立时 839 | 联袂 840 | 连连 841 | 连日 842 | 路经 843 | 临到 844 | 连声 845 | 连同 846 | 连袂 847 | 另外 848 | 另行 849 | 屡次 850 | 屡屡 851 | 缕缕 852 | 率尔 853 | 率然 854 | 略加 855 | 略微 856 | 略为 857 | 论说 858 | 马上 859 | 猛然 860 | 没有 861 | 每当 862 | 每逢 863 | 每每 864 | 莫不 865 | 莫非 866 | 莫如 867 | 莫若 868 | 哪怕 869 | 那么 870 | 那末 871 | 那些 872 | 乃至 873 | 难道 874 | 难得 875 | 难怪 876 | 难说 877 | 你们 878 | 凝神 879 | 宁可 880 | 宁肯 881 | 宁愿 882 | 偶而 883 | 偶尔 884 | 碰巧 885 | 譬如 886 | 偏偏 887 | 平素 888 | 迫于 889 | 扑通 890 | 其次 891 | 其后 892 | 其实 893 | 其它 894 | 起初 895 | 起来 896 | 起首 897 | 起头 898 | 起先 899 | 岂但 900 | 岂非 901 | 岂止 902 | 恰逢 903 | 恰好 904 | 恰恰 905 | 恰巧 906 | 恰如 907 | 恰似 908 | 前后 909 | 前者 910 | 切莫 911 | 切切 912 | 切勿 913 | 亲口 914 | 亲身 915 | 亲手 916 | 亲眼 917 | 亲自 918 | 顷刻 919 | 请勿 920 | 取道 921 | 权时 922 | 全都 923 | 全力 924 | 全年 925 | 全然 926 | 然而 927 | 然后 928 | 人家 929 | 人人 930 | 仍旧 931 | 仍然 932 | 日见 933 | 日渐 934 | 日益 935 | 日臻 936 | 如常 937 | 如次 938 | 如果 939 | 如今 940 | 如期 941 | 如若 942 | 如上 943 | 如下 944 | 上来 945 | 上去 946 | 瑟瑟 947 | 沙沙 948 | 啊 949 | 哎 950 | 唉 951 | 俺 952 | 按 953 | 吧 954 | 把 955 | 甭 956 | 别 957 | 嘿 958 | 很 959 | 乎 960 | 会 961 | 或 962 | 既 963 | 及 964 | 啦 965 | 了 966 | 们 967 | 你 968 | 您 969 | 哦 970 | 砰 971 | 啊 972 | 你 973 | 我 974 | 他 975 | 她 976 | 它 977 | -------------------------------------------------------------------------------- /ucas_dm/utils.py: -------------------------------------------------------------------------------- 1 | from .prediction_algorithms.base_algo import BaseAlgo 2 | import pandas as pd 3 | import json 4 | import multiprocessing as mp 5 | import time 6 | import os 7 | 8 | 9 | class Evaluator: 10 | """ 11 | This class provide some methods to evaluate the performance of a recommend algorithm. 12 | Two measures are supported for now: Recall and Precision 13 | """ 14 | 15 | def __init__(self, data_set): 16 | """ 17 | :param data_set: User view log. 18 | """ 19 | self.__data_set = pd.DataFrame(data_set) 20 | self.__data_set.columns = ['user_id', 'item_id', 'view_time'] 21 | self.__data_set['view_time'] = pd.to_datetime(self.__data_set['view_time']) 22 | self.__data_set.drop_duplicates(inplace=True) 23 | 24 | def evaluate(self, algo=BaseAlgo(), k=[], n_jobs=1, split_date='2000-1-1', debug=False, verbose=False, 25 | auto_log=False): 26 | """ 27 | :param algo: recommend algorithm 28 | :param k: list of integers represent the number of recommended items. 29 | :param n_jobs: The maximum number of evaluating in parallel. Use multi-thread to speed up the evaluating. 30 | :param split_date: on which date we split the log data into train and test. 31 | :param debug: if true, the evaluator will use 5000 instances in data set to run the test. 32 | :param verbose: whether to print the total time that evaluation cost. 33 | :param auto_log: if true, Evaluator will automatically save performance data to './performance.log' 34 | :return: average recall and precision 35 | """ 36 | assert (n_jobs > 0), 'n_jobs must be greater than 0.' 37 | if debug: 38 | data_set = self.__data_set[:5000] 39 | else: 40 | data_set = self.__data_set 41 | train_set = data_set[data_set['view_time'] < split_date][['user_id', 'item_id']] 42 | test_set = data_set[data_set['view_time'] >= split_date][['user_id', 'item_id']] 43 | res = [] 44 | start_time = time.time() 45 | algo.train(train_set) 46 | end_time1 = time.time() 47 | for _k in k: 48 | s_time = time.time() 49 | recall, precision = Evaluator._job_dispatch(algo, _k, train_set, test_set, n_jobs) 50 | e_time = time.time() 51 | res.append((recall, precision)) 52 | if verbose: 53 | print("Totally cost: %.1f(s)" % (e_time - s_time + end_time1 - start_time)) 54 | if auto_log: 55 | Evaluator._log_to_file(algo.to_dict(), recall=recall, precision=precision, k_recommend=_k, 56 | train_time=end_time1 - start_time, recommend_time=e_time - s_time, n_jobs=n_jobs, 57 | debug=debug) 58 | return res 59 | 60 | @staticmethod 61 | def _log_to_file(algo_dict, **kwargs): 62 | """ 63 | This func will save algorithm's dict data and it's performance data to ./performance.log in json format. 64 | 65 | :param algo_dict: algorithm's dict format. 66 | :param kwargs: Performance data of the algorithm. 67 | """ 68 | file_path = "./performance.log" 69 | print("Saving to " + file_path) 70 | for k in kwargs.keys(): 71 | algo_dict[k] = kwargs[k] 72 | if os.path.exists(file_path): 73 | with open(file_path, 'r+', encoding='utf-8') as f: 74 | logs = json.load(f) 75 | f.seek(0) 76 | keys = logs.keys() 77 | last_index = int(list(keys)[-1].replace("record", '')) 78 | new_key = "record" + str(last_index + 1) 79 | logs[new_key] = algo_dict 80 | json.dump(logs, f, ensure_ascii=False, indent=4) 81 | else: 82 | with open(file_path, 'w', encoding='utf-8') as f: 83 | logs = {"record1": algo_dict} 84 | json.dump(logs, f, ensure_ascii=False, indent=4) 85 | 86 | @staticmethod 87 | def _job_dispatch(algo, k, train_set, test_set, n_jobs): 88 | class TestJob(mp.Process): 89 | def __init__(self, func, result_list, *args): 90 | super().__init__() 91 | self.func = func 92 | self.args = args 93 | self.res = result_list 94 | 95 | def run(self): 96 | self.res.append(self.func(*self.args)) 97 | 98 | manager = mp.Manager() 99 | res_list = manager.list() 100 | user_ids = train_set[['user_id']].drop_duplicates() 101 | user_num = user_ids.shape[0] 102 | part_num = int((user_num + n_jobs - 1) / n_jobs) 103 | job_list = [] 104 | recall = [] 105 | precision = [] 106 | for i in range(n_jobs): 107 | part_users = user_ids[i * part_num:i * part_num + part_num] 108 | part_train_set = train_set[train_set['user_id'].isin(part_users['user_id'])] 109 | part_test_set = test_set[test_set['user_id'].isin(part_users['user_id'])] 110 | j = TestJob(Evaluator._one_round_test, res_list, algo, k, part_train_set, part_test_set) 111 | job_list.append(j) 112 | j.start() 113 | for job in job_list: 114 | job.join() 115 | # All processes finished here. 116 | for i in range(len(res_list)): 117 | recall.append(res_list[i][0]) 118 | precision.append(res_list[i][1]) 119 | return sum(recall) / len(recall), sum(precision) / len(precision) 120 | 121 | @staticmethod 122 | def _one_round_test(algo, k, train_set, test_set): 123 | user_id_series = train_set['user_id'].drop_duplicates() 124 | recall_log = [] 125 | precision_log = [] 126 | for user_id in user_id_series: 127 | user_viewed_in_test = test_set[test_set['user_id'] == user_id].drop_duplicates() 128 | if user_viewed_in_test.shape[0] == 0: 129 | continue 130 | _, recommend = algo.top_k_recommend(user_id, k) 131 | recommend = pd.DataFrame({'item_id': recommend}) 132 | assert (recommend.shape[0] > 0), 'Recommend error' 133 | cover_number = (recommend[recommend['item_id'].isin(user_viewed_in_test['item_id'])]).shape[0] 134 | recall_log.append(cover_number / user_viewed_in_test.shape[0]) 135 | precision_log.append(cover_number / recommend.shape[0]) 136 | recall = sum(recall_log) / len(recall_log) 137 | precision = sum(precision_log) / len(precision_log) 138 | return recall, precision 139 | --------------------------------------------------------------------------------