├── .gitignore ├── README.md ├── UseCases.ipynb ├── data └── .keepdir ├── inference_hp.py ├── logging_config.py ├── main.py ├── main_inference.py ├── main_train.py ├── models └── .keepdir ├── outputs └── .keepdir ├── plots └── .keepdir ├── presplit.py ├── requirements.txt └── src ├── builder.py ├── evaluation.py ├── metrics.py ├── model.py ├── sampling.py ├── train └── run.py ├── utils.py ├── utils_data.py ├── utils_inference.py └── utils_vizualization.py /.gitignore: -------------------------------------------------------------------------------- 1 | #Added by user 2 | .idea/ 3 | .DS_Store 4 | 5 | # Byte-compiled / optimized / DLL files 6 | __pycache__/ 7 | *.py[cod] 8 | *$py.class 9 | 10 | # C extensions 11 | *.so 12 | 13 | # Distribution / packaging 14 | .Python 15 | build/ 16 | develop-eggs/ 17 | dist/ 18 | downloads/ 19 | eggs/ 20 | .eggs/ 21 | lib/ 22 | lib64/ 23 | parts/ 24 | sdist/ 25 | var/ 26 | wheels/ 27 | pip-wheel-metadata/ 28 | share/python-wheels/ 29 | *.egg-info/ 30 | .installed.cfg 31 | *.egg 32 | MANIFEST 33 | 34 | # PyInstaller 35 | # Usually these files are written by a python script from a template 36 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 37 | *.manifest 38 | *.spec 39 | 40 | # Installer logs 41 | pip-log.txt 42 | pip-delete-this-directory.txt 43 | 44 | # Unit test / coverage reports 45 | htmlcov/ 46 | .tox/ 47 | .nox/ 48 | .coverage 49 | .coverage.* 50 | .cache 51 | nosetests.xml 52 | coverage.xml 53 | *.cover 54 | *.py,cover 55 | .hypothesis/ 56 | .pytest_cache/ 57 | 58 | # Translations 59 | *.mo 60 | *.pot 61 | 62 | # Django stuff: 63 | *.log 64 | local_settings.py 65 | db.sqlite3 66 | db.sqlite3-journal 67 | 68 | # Flask stuff: 69 | instance/ 70 | .webassets-cache 71 | 72 | # Scrapy stuff: 73 | .scrapy 74 | 75 | # Sphinx documentation 76 | docs/_build/ 77 | 78 | # PyBuilder 79 | target/ 80 | 81 | # Jupyter Notebook 82 | .ipynb_checkpoints 83 | 84 | # IPython 85 | profile_default/ 86 | ipython_config.py 87 | 88 | # pyenv 89 | .python-version 90 | 91 | # pipenv 92 | # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. 93 | # However, in case of collaboration, if having platform-specific dependencies or dependencies 94 | # having no cross-platform support, pipenv may install dependencies that don't work, or not 95 | # install all needed dependencies. 96 | #Pipfile.lock 97 | 98 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow 99 | __pypackages__/ 100 | 101 | # Celery stuff 102 | celerybeat-schedule 103 | celerybeat.pid 104 | 105 | # SageMath parsed files 106 | *.sage.py 107 | 108 | # Environments 109 | .env 110 | .venv 111 | env/ 112 | venv/ 113 | ENV/ 114 | env.bak/ 115 | venv.bak/ 116 | 117 | # Spyder project settings 118 | .spyderproject 119 | .spyproject 120 | 121 | # Rope project settings 122 | .ropeproject 123 | 124 | # mkdocs documentation 125 | /site 126 | 127 | # mypy 128 | .mypy_cache/ 129 | .dmypy.json 130 | dmypy.json 131 | 132 | # Pyre type checker 133 | .pyre/ 134 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # GNN-RecSys 2 | *This project was presented in a [40min talk + Q&A available on Youtube](https://www.youtube.com/watch?v=hvTawbQnK_w) and in a [Medium blog post](https://medium.com/decathlondevelopers/building-a-recommender-system-using-graph-neural-networks-2ee5fc4e706d)* 3 | 4 | **Graph Neural Networks for Recommender Systems**\ 5 | This repository contains code to train and test GNN models for recommendation, mainly using the Deep Graph Library 6 | ([DGL](https://docs.dgl.ai/)). 7 | 8 | 9 | **What kind of recommendation?**\ 10 | For example, an organisation might want to recommend items of interest to all users of its ecommerce platforms. 11 | 12 | **How can this repository can be used?**\ 13 | This repository is aimed at helping users that wish to experiment with GNNs for recommendation, by giving a real example of code 14 | to build a GNN model, train it and serve recommendations. 15 | 16 | No training data, experiments logs, or trained model are available in this repository. 17 | 18 | **What should the data look like?**\ 19 | To run the code, users need multiple data sources, notably interaction data between user and items and features of users and items. 20 | 21 | The interaction data sources should be adjacency lists. Here is an example: 22 | 23 | | customer_id | item_id | timestamp | click | purchase | 24 | |-------------------|------------------|------------|-------|----------| 25 | | imbvblxwvtiywunh | 3384934262863770 | 2018-01-01 | 0 | 1 | 26 | | nzhrkquelkgflone | 8321263216904593 | 2018-01-01 | 1 | 0 | 27 | | ... | ... | ... | ... | ... | 28 | | cgatomzvjiizvctb | 2756920171861146 | 2019-12-31 | 1 | 0 | 29 | | cnspkotxubxnxtzk | 5150255386059428 | 2019-12-31 | 0 | 1 | 30 | 31 | The feature data should have node identifier and node features: 32 | | customer_id | is_male | is_female | 33 | |-------------------|---------|-----------| 34 | | imbvblxwvtiywunh | 0 | 1 | 35 | | nzhrkquelkgflone | 1 | 0 | 36 | | ... | ... | ... | 37 | | cgatomzvjiizvctb | 0 | 1 | 38 | | cnspkotxubxnxtzk | 0 | 1 | 39 | 40 | ## Run the code 41 | There are 3 different usages of the code: hyperparametrization, training and inference. 42 | Examples of how to run the code are presented in UseCases.ipynb. 43 | 44 | All 3 usages require specific files to be available. Please refer to the docstring to 45 | see which files are required. 46 | 47 | ### Hyperparametrization 48 | 49 | Hyperparametrization is done using the main.py file. 50 | Going through the space of hyperparameters, the loop builds a GNN model, trains it on a sample of training data, and computes its performance metrics. 51 | The metrics are reported in a result txt file, and the best model's parameters are saved in the models directory. 52 | Plots of the training experiments are saved in the plots directory. 53 | Examples of recommendations are saved in the outputs directory. 54 | ```bash 55 | python main.py --from_beginning -v --visualization --check_embedding --remove 0.85 --num_epochs 100 --patience 5 --edge_batch_size 1024 --item_id_type 'ITEM IDENTIFIER' --duplicates 'keep_all' 56 | ``` 57 | Refer to docstrings of main.py for details on parameters. 58 | 59 | ### Training 60 | 61 | When the hyperparameters are selected, it is possible to train the chosen GNN model on the available data. 62 | This process saves the trained model in the models directory. Plots, training logs, and examples of recommendations are saved. 63 | ```bash 64 | python main_train.py --fixed_params_path test/fixed_params_example.pkl --params_path test/params_example.pkl --visualization --check_embedding --remove .85 --edge_batch_size 512 65 | ``` 66 | Refer to docstrings of main_train.py for details on parameters. 67 | 68 | ### Inference 69 | With a trained model, it is possible to generate recommendations for all users or specific users. 70 | Examples of recommendations are printed. 71 | ```bash 72 | python main_inference.py --params_path test/final_params_example.pkl --user_ids 123456 \ 73 | --user_ids 654321 --user_ids 999 \ 74 | --trained_model_path test/final_model_trained_example.pth --k 10 --remove .99 75 | ``` 76 | Refer to docstrings of main_inference.py for details on parameters. 77 | 78 | -------------------------------------------------------------------------------- /UseCases.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "id": "oFIozaUExVEL" 7 | }, 8 | "source": [ 9 | "# Mount the drive & download required package\n", 10 | "This notebook was made for Colab usage. If running on local, the next cell can be ommitted." 11 | ] 12 | }, 13 | { 14 | "cell_type": "code", 15 | "execution_count": null, 16 | "metadata": {}, 17 | "outputs": [], 18 | "source": [ 19 | "!pip install dgl-cu101\n", 20 | "!pip install scikit-optimize\n", 21 | "!pip install boto3\n", 22 | "from google.colab import drive\n", 23 | "drive.mount('/content/drive')\n", 24 | "import sys\n", 25 | "sys.path.append('/content/drive/My Drive/Code/')\n", 26 | "%cd /content/drive/My\\ Drive/Code/\n", 27 | "\n", 28 | "from torch.multiprocessing import Pool, Process, set_start_method\n", 29 | "try:\n", 30 | " set_start_method('spawn')\n", 31 | "except RuntimeError:\n", 32 | " pass" 33 | ] 34 | }, 35 | { 36 | "cell_type": "markdown", 37 | "metadata": { 38 | "id": "0vYgFk5KxHBJ" 39 | }, 40 | "source": [ 41 | "# Use case 1 : Hyperparametrization" 42 | ] 43 | }, 44 | { 45 | "cell_type": "code", 46 | "execution_count": null, 47 | "metadata": {}, 48 | "outputs": [], 49 | "source": [ 50 | "!python main.py --from_beginning -v --visualization --check_embedding --remove 0.85 --num_epochs 100 --patience 5 --edge_batch_size 1024 --item_id_type 'ITEM IDENTIFIER' --duplicates 'keep_all'" 51 | ] 52 | }, 53 | { 54 | "cell_type": "markdown", 55 | "metadata": { 56 | "id": "BHekS5cQxjGZ" 57 | }, 58 | "source": [ 59 | "# Use case 2 : Full training" 60 | ] 61 | }, 62 | { 63 | "cell_type": "code", 64 | "execution_count": null, 65 | "metadata": {}, 66 | "outputs": [], 67 | "source": [ 68 | "!python main_train.py --fixed_params_path test/fixed_params_example.pkl --params_path test/params_example.pkl --visualization --check_embedding --remove .85 --edge_batch_size 512" 69 | ] 70 | }, 71 | { 72 | "cell_type": "markdown", 73 | "metadata": { 74 | "id": "vZeLHdtTxjfT" 75 | }, 76 | "source": [ 77 | "# Use case 3 : Inference" 78 | ] 79 | }, 80 | { 81 | "cell_type": "markdown", 82 | "metadata": { 83 | "id": "eGkv8ffZ4y26" 84 | }, 85 | "source": [ 86 | "## 3.1 : Specific users, creating the graph" 87 | ] 88 | }, 89 | { 90 | "cell_type": "code", 91 | "execution_count": null, 92 | "metadata": {}, 93 | "outputs": [], 94 | "source": [ 95 | "!python main_inference.py --params_path test/final_params_example.pkl --user_ids 123456 \\\n", 96 | "--user_ids 654321 --user_ids 999 \\\n", 97 | "--trained_model_path test/final_model_trained_example.pth --k 10 --remove .99" 98 | ] 99 | }, 100 | { 101 | "cell_type": "markdown", 102 | "metadata": { 103 | "id": "qlZ-rbWW46Ue" 104 | }, 105 | "source": [ 106 | "## 3.1 : All users, importing the graph" 107 | ] 108 | }, 109 | { 110 | "cell_type": "code", 111 | "execution_count": null, 112 | "metadata": { 113 | "pycharm": { 114 | "name": "#%%\n" 115 | } 116 | }, 117 | "outputs": [], 118 | "source": [ 119 | "!python main_inference.py --params_path test/final_params_example.pkl \\\n", 120 | "--user_ids all --use_saved_graph --graph_path test/final_graph_example.bin --ctm_id_path test/final_ctm_id_example.pkl \\\n", 121 | "--pdt_id_path test/final_pdt_id_example.pkl --trained_model_path test/final_model_trained_example.pth \\\n", 122 | "--k 10 --remove 0" 123 | ] 124 | } 125 | ], 126 | "metadata": { 127 | "accelerator": "GPU", 128 | "colab": { 129 | "collapsed_sections": [], 130 | "machine_shape": "hm", 131 | "name": "UseCases.ipynb", 132 | "provenance": [] 133 | }, 134 | "kernelspec": { 135 | "display_name": "Python 3", 136 | "language": "python", 137 | "name": "python3" 138 | }, 139 | "language_info": { 140 | "codemirror_mode": { 141 | "name": "ipython", 142 | "version": 3 143 | }, 144 | "file_extension": ".py", 145 | "mimetype": "text/x-python", 146 | "name": "python", 147 | "nbconvert_exporter": "python", 148 | "pygments_lexer": "ipython3", 149 | "version": "3.8.3" 150 | } 151 | }, 152 | "nbformat": 4, 153 | "nbformat_minor": 4 154 | } -------------------------------------------------------------------------------- /data/.keepdir: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /inference_hp.py: -------------------------------------------------------------------------------- 1 | import math 2 | 3 | import numpy as np 4 | import torch 5 | 6 | from src.utils_data import DataLoader, DataPaths, assign_graph_features 7 | 8 | from src.builder import (create_graph) 9 | from src.model import ConvModel 10 | from src.sampling import train_valid_split, generate_dataloaders 11 | from src.metrics import get_metrics_at_k 12 | from src.train.run import get_embeddings 13 | from src.utils import save_txt, read_data 14 | 15 | cuda = torch.cuda.is_available() 16 | device = torch.device('cuda') if cuda else torch.device('cpu') 17 | num_workers = 4 if cuda else 0 18 | 19 | def inference_fn(trained_model, 20 | remove, 21 | fixed_params, 22 | overwrite_fixed_params=False, 23 | days_of_purchases=710, 24 | days_of_clicks=710, 25 | lifespan_of_items=710, 26 | **params): 27 | """ 28 | Function to run inference inside the hyperparameter loop and calculate metrics. 29 | 30 | Parameters 31 | ---------- 32 | trained_model: 33 | Model trained during training of hyperparameter loop. 34 | remove: 35 | Percentage of data removed. See src.utils_data for more details. 36 | fixed_params: 37 | All parameters used during training of hyperparameter loop. See src.utils_data for more details. 38 | overwrite_fixed_params: 39 | If true, training parameters will overwritten by the parameters below. Can be useful if need to test the model 40 | on different parameters, e.g. that includes older clicks or purchases. 41 | days_of_purchases, days_of_clicks, lifespan_of_items: 42 | All parameters that can overwrite the training parameters. Only useful if overwrite_fixed_params is True. 43 | params: 44 | All other parameters used during training. 45 | 46 | Returns 47 | ------- 48 | recall: 49 | Recall on the test set. Relevant to compare with recall computed on hyperparametrization test set (since 50 | parameters like 'remove' and all overwritable parameters are different) 51 | 52 | Saves to file 53 | ------------- 54 | Metrics computed on the test set. 55 | """ 56 | # Import parameters 57 | if isinstance(fixed_params, str): 58 | path = fixed_params 59 | fixed_params = read_data(path) 60 | class objectview(object): 61 | def __init__(self, d): 62 | self.__dict__ = d 63 | fixed_params = objectview(fixed_params) 64 | 65 | if 'params' in params.keys(): 66 | # if isinstance(params['params'], str): 67 | path = params['params'] 68 | params = read_data(path) 69 | 70 | # Initialize data 71 | data_paths = DataPaths() 72 | fixed_params.remove = remove 73 | if overwrite_fixed_params: 74 | fixed_params.days_of_purchases = days_of_purchases 75 | fixed_params.days_of_clicks = days_of_clicks 76 | fixed_params.lifespan_of_items = lifespan_of_items 77 | data = DataLoader(data_paths, fixed_params) 78 | 79 | # Get graph 80 | valid_graph = create_graph( 81 | data.graph_schema, 82 | ) 83 | valid_graph = assign_graph_features(valid_graph, 84 | fixed_params, 85 | data, 86 | **params, 87 | ) 88 | 89 | dim_dict = {'user': valid_graph.nodes['user'].data['features'].shape[1], 90 | 'item': valid_graph.nodes['item'].data['features'].shape[1], 91 | 'out': params['out_dim'], 92 | 'hidden': params['hidden_dim']} 93 | 94 | all_sids = None 95 | if 'sport' in valid_graph.ntypes: 96 | dim_dict['sport'] = valid_graph.nodes['sport'].data['features'].shape[1] 97 | all_sids = np.arange(valid_graph.num_nodes('sport')) 98 | 99 | # get training and test ids 100 | ( 101 | train_graph, 102 | train_eids_dict, 103 | valid_eids_dict, 104 | subtrain_uids, 105 | valid_uids, 106 | test_uids, 107 | all_iids, 108 | ground_truth_subtrain, 109 | ground_truth_valid, 110 | all_eids_dict 111 | ) = train_valid_split( 112 | valid_graph, 113 | data.ground_truth_test, 114 | fixed_params.etype, 115 | fixed_params.subtrain_size, 116 | fixed_params.valid_size, 117 | fixed_params.reverse_etype, 118 | fixed_params.train_on_clicks, 119 | fixed_params.remove_train_eids, 120 | params['clicks_sample'], 121 | params['purchases_sample'], 122 | ) 123 | ( 124 | edgeloader_train, 125 | edgeloader_valid, 126 | nodeloader_subtrain, 127 | nodeloader_valid, 128 | nodeloader_test 129 | ) = generate_dataloaders(valid_graph, 130 | train_graph, 131 | train_eids_dict, 132 | valid_eids_dict, 133 | subtrain_uids, 134 | valid_uids, 135 | test_uids, 136 | all_iids, 137 | fixed_params, 138 | num_workers, 139 | all_sids, 140 | embedding_layer=params['embedding_layer'], 141 | n_layers=params['n_layers'], 142 | neg_sample_size=params['neg_sample_size'], 143 | ) 144 | 145 | num_batches_test = math.ceil((len(test_uids) + len(all_iids)) / fixed_params.node_batch_size) 146 | 147 | # Import model 148 | if isinstance(trained_model, str): 149 | path = trained_model 150 | trained_model = ConvModel(valid_graph, 151 | params['n_layers'], 152 | dim_dict, 153 | params['norm'], 154 | params['dropout'], 155 | params['aggregator_type'], 156 | fixed_params.pred, 157 | params['aggregator_hetero'], 158 | params['embedding_layer'], 159 | ) 160 | trained_model.load_state_dict(torch.load(path, map_location=device)) 161 | if cuda: 162 | trained_model = trained_model.to(device) 163 | 164 | trained_model.eval() 165 | with torch.no_grad(): 166 | embeddings = get_embeddings(valid_graph, 167 | params['out_dim'], 168 | trained_model, 169 | nodeloader_test, 170 | num_batches_test, 171 | cuda, 172 | device, 173 | params['embedding_layer'], 174 | ) 175 | 176 | for ground_truth in [data.ground_truth_purchase_test, data.ground_truth_test]: 177 | precision, recall, coverage = get_metrics_at_k( 178 | embeddings, 179 | valid_graph, 180 | trained_model, 181 | params['out_dim'], 182 | ground_truth, 183 | all_eids_dict[('user', 'buys', 'item')], 184 | fixed_params.k, 185 | True, # Remove already bought 186 | cuda, 187 | device, 188 | fixed_params.pred, 189 | params['use_popularity'], 190 | params['weight_popularity'], 191 | ) 192 | 193 | sentence = ("TEST Precision " 194 | "{:.3f}% | Recall {:.3f}% | Coverage {:.2f}%" 195 | .format(precision * 100, 196 | recall * 100, 197 | coverage * 100)) 198 | 199 | print(sentence) 200 | save_txt(sentence, data_paths.result_filepath, mode='a') 201 | 202 | return recall 203 | -------------------------------------------------------------------------------- /logging_config.py: -------------------------------------------------------------------------------- 1 | """ 2 | config_logging.py 3 | 4 | This module aims to define the logger object. 5 | """ 6 | import logging 7 | 8 | 9 | def get_logger(name): 10 | """ 11 | This function aims to define the logger object from a (file) name. 12 | 13 | input: 14 | :name: str, file name 15 | output: 16 | :logger: logger object to do logging 17 | """ 18 | logger = logging.getLogger(name) 19 | 20 | if not logger.handlers: 21 | logger.propagate = False 22 | logger.setLevel(logging.DEBUG) 23 | # stream handler 24 | ch = logging.StreamHandler() 25 | ch.setLevel(logging.INFO) 26 | formatter = logging.Formatter('%(asctime)s-%(name)s-%(levelname)s: %(message)s') 27 | ch.setFormatter(formatter) 28 | logger.addHandler(ch) 29 | return logger -------------------------------------------------------------------------------- /main.py: -------------------------------------------------------------------------------- 1 | import datetime 2 | from datetime import timedelta 3 | import logging 4 | import math 5 | import time 6 | 7 | import click 8 | import numpy as np 9 | from skopt import gp_minimize 10 | from skopt.space import Real, Integer, Categorical 11 | from skopt.utils import use_named_args 12 | from skopt.callbacks import CheckpointSaver 13 | from skopt import load 14 | import torch 15 | 16 | from src.builder import create_graph, import_features 17 | from src.model import ConvModel, max_margin_loss 18 | from src.sampling import train_valid_split, generate_dataloaders 19 | from src.metrics import (create_already_bought, create_ground_truth, 20 | get_metrics_at_k, get_recs) 21 | from src.train.run import train_model, get_embeddings 22 | from src.evaluation import explore_recs, explore_sports, check_coverage 23 | from src.utils import save_txt, save_outputs, get_last_checkpoint 24 | from src.utils_data import DataLoader, FixedParameters, DataPaths, assign_graph_features 25 | from src.utils_vizualization import plot_train_loss 26 | import inference_hp 27 | 28 | from logging_config import get_logger 29 | 30 | log = get_logger(__name__) 31 | 32 | global cuda 33 | 34 | cuda = torch.cuda.is_available() 35 | device = torch.device('cuda') 36 | if not cuda: 37 | num_workers = 0 38 | else: 39 | num_workers = 4 40 | 41 | 42 | def train(data, fixed_params, data_paths, 43 | visualization, check_embedding, **params): 44 | """ 45 | Function to find the best hyperparameter combination. 46 | 47 | Files needed to run 48 | ------------------- 49 | All the files in the src.utils_data.DataPaths: 50 | It includes all the interactions between user, sport and items, as well as features for user, sport and items. 51 | If starting hyperparametrization from a checkpoint: 52 | The checkpoint file, generated by skopt during a previous hyperparametrization. The most recent file of 53 | the root folder will be fetched. 54 | 55 | Parameters 56 | ---------- 57 | data : 58 | Object of class DataLoader, containing multiple arguments such as user_item_train dataframe, graph schema, etc. 59 | fixed_params : 60 | All parameters that are fixed, i.e. not part of the hyperparametrization. 61 | data_paths : 62 | All data paths (mainly csv). # Note: currently, only paths.result_filepath is used here. 63 | visualization : 64 | Visualize results or not. # Note: currently not used, visualization is always on or controlled by fixed_params. 65 | check_embedding : 66 | Visualize recommendations or not. # Note: currently not used, controlled by fixed_params. 67 | **params : 68 | Mainly params that come from the hyperparametrization loop, controlled by skopt. 69 | 70 | Returns 71 | ------- 72 | recall : 73 | Recall on the test set for the current combination of hyperparameters. 74 | 75 | Saves to files 76 | -------------- 77 | logging of all experiments: 78 | All training logs are saved to result_filepath, including losses, metrics and examples of recommendations 79 | Plots of the evolution of losses and metrics are saved to the folder 'plots' 80 | best models: 81 | All models, fixed_params and params that yielded recall higher than 8% on specific item identifier or 20% on 82 | generic item identifier are saved to the folder 'models' 83 | """ 84 | # Establish hyperparameters 85 | # Dimensions 86 | out_dim = {'Very Small': 32, 'Small': 96, 'Medium': 128, 'Large': 192, 'Very Large': 256} 87 | hidden_dim = {'Very Small': 64, 'Small': 192, 'Medium': 256, 'Large': 384, 'Very Large': 512} 88 | params['out_dim'] = out_dim[params['embed_dim']] 89 | params['hidden_dim'] = hidden_dim[params['embed_dim']] 90 | 91 | # Popularity 92 | use_popularity = {'No': False, 'Small': True, 'Medium': True, 'Large': True} 93 | weight_popularity = {'No': 0, 'Small': .01, 'Medium': .05, 'Large': .1} 94 | days_popularity = {'No': 0, 'Small': 7, 'Medium': 7, 'Large': 7} 95 | params['use_popularity'] = use_popularity[params['popularity_importance']] 96 | params['weight_popularity'] = weight_popularity[params['popularity_importance']] 97 | params['days_popularity'] = days_popularity[params['popularity_importance']] 98 | 99 | if fixed_params.duplicates == 'count_occurrence': 100 | params['aggregator_type'] += '_edge' 101 | 102 | # Make sure graph data is consistent with message passing parameters 103 | if fixed_params.duplicates == 'count_occurrence': 104 | assert params['aggregator_type'].endswith('edge') 105 | else: 106 | assert not params['aggregator_type'].endswith('edge') 107 | 108 | valid_graph = create_graph( 109 | data.graph_schema, 110 | ) 111 | valid_graph = assign_graph_features(valid_graph, 112 | fixed_params, 113 | data, 114 | **params, 115 | ) 116 | 117 | dim_dict = {'user': valid_graph.nodes['user'].data['features'].shape[1], 118 | 'item': valid_graph.nodes['item'].data['features'].shape[1], 119 | 'out': params['out_dim'], 120 | 'hidden': params['hidden_dim']} 121 | 122 | all_sids = None 123 | if 'sport' in valid_graph.ntypes: 124 | dim_dict['sport'] = valid_graph.nodes['sport'].data['features'].shape[1] 125 | all_sids = np.arange(valid_graph.num_nodes('sport')) 126 | 127 | # get training and test ids 128 | ( 129 | train_graph, 130 | train_eids_dict, 131 | valid_eids_dict, 132 | subtrain_uids, 133 | valid_uids, 134 | test_uids, 135 | all_iids, 136 | ground_truth_subtrain, 137 | ground_truth_valid, 138 | all_eids_dict 139 | ) = train_valid_split( 140 | valid_graph, 141 | data.ground_truth_test, 142 | fixed_params.etype, 143 | fixed_params.subtrain_size, 144 | fixed_params.valid_size, 145 | fixed_params.reverse_etype, 146 | fixed_params.train_on_clicks, 147 | fixed_params.remove_train_eids, 148 | params['clicks_sample'], 149 | params['purchases_sample'], 150 | ) 151 | 152 | ( 153 | edgeloader_train, 154 | edgeloader_valid, 155 | nodeloader_subtrain, 156 | nodeloader_valid, 157 | nodeloader_test 158 | ) = generate_dataloaders(valid_graph, 159 | train_graph, 160 | train_eids_dict, 161 | valid_eids_dict, 162 | subtrain_uids, 163 | valid_uids, 164 | test_uids, 165 | all_iids, 166 | fixed_params, 167 | num_workers, 168 | all_sids, 169 | embedding_layer=params['embedding_layer'], 170 | n_layers=params['n_layers'], 171 | neg_sample_size=params['neg_sample_size'], 172 | ) 173 | 174 | train_eids_len = 0 175 | valid_eids_len = 0 176 | for etype in train_eids_dict.keys(): 177 | train_eids_len += len(train_eids_dict[etype]) 178 | valid_eids_len += len(valid_eids_dict[etype]) 179 | num_batches_train = math.ceil(train_eids_len / fixed_params.edge_batch_size) 180 | num_batches_subtrain = math.ceil( 181 | (len(subtrain_uids) + len(all_iids)) / fixed_params.node_batch_size 182 | ) 183 | num_batches_val_loss = math.ceil(valid_eids_len / fixed_params.edge_batch_size) 184 | num_batches_val_metrics = math.ceil( 185 | (len(valid_uids) + len(all_iids)) / fixed_params.node_batch_size 186 | ) 187 | num_batches_test = math.ceil( 188 | (len(test_uids) + len(all_iids)) / fixed_params.node_batch_size 189 | ) 190 | 191 | if fixed_params.neighbor_sampler == 'partial': 192 | params['n_layers'] = 3 193 | 194 | model = ConvModel(valid_graph, 195 | params['n_layers'], 196 | dim_dict, 197 | params['norm'], 198 | params['dropout'], 199 | params['aggregator_type'], 200 | fixed_params.pred, 201 | params['aggregator_hetero'], 202 | params['embedding_layer'], 203 | ) 204 | if cuda: 205 | model = model.to(device) 206 | 207 | hp_sentence = params 208 | hp_sentence.update(vars(fixed_params)) 209 | hp_sentence.update( 210 | { 211 | 'cuda': cuda, 212 | } 213 | ) 214 | hp_sentence = f'{str(hp_sentence)[1: -1]} \n' 215 | 216 | save_txt(f'\n \n START - Hyperparameters \n{hp_sentence}', data_paths.result_filepath, "a") 217 | 218 | start_time = time.time() 219 | 220 | # Train model 221 | trained_model, viz, best_metrics = train_model( 222 | model, 223 | fixed_params.num_epochs, 224 | num_batches_train, 225 | num_batches_val_loss, 226 | edgeloader_train, 227 | edgeloader_valid, 228 | max_margin_loss, 229 | params['delta'], 230 | params['neg_sample_size'], 231 | params['use_recency'], 232 | cuda, 233 | device, 234 | fixed_params.optimizer, 235 | params['lr'], 236 | get_metrics=True, 237 | train_graph=train_graph, 238 | valid_graph=valid_graph, 239 | nodeloader_valid=nodeloader_valid, 240 | nodeloader_subtrain=nodeloader_subtrain, 241 | k=fixed_params.k, 242 | out_dim=params['out_dim'], 243 | num_batches_val_metrics=num_batches_val_metrics, 244 | num_batches_subtrain=num_batches_subtrain, 245 | bought_eids=train_eids_dict[('user', 'buys', 'item')], 246 | ground_truth_subtrain=ground_truth_subtrain, 247 | ground_truth_valid=ground_truth_valid, 248 | remove_already_bought=True, 249 | result_filepath=data_paths.result_filepath, 250 | start_epoch=fixed_params.start_epoch, 251 | patience=fixed_params.patience, 252 | pred=params['pred'], 253 | use_popularity=params['use_popularity'], 254 | weight_popularity=params['weight_popularity'], 255 | remove_false_negative=fixed_params.remove_false_negative, 256 | embedding_layer=params['embedding_layer'], 257 | ) 258 | elapsed = time.time() - start_time 259 | result_to_save = f'\n {timedelta(seconds=elapsed)} \n END' 260 | save_txt(result_to_save, data_paths.result_filepath, mode='a') 261 | 262 | if visualization: 263 | plot_train_loss(hp_sentence, viz) 264 | 265 | # Report performance on validation set 266 | sentence = ("BEST VALIDATION Precision " 267 | "{:.3f}% | Recall {:.3f}% | Coverage {:.2f}%" 268 | .format(best_metrics['precision'] * 100, 269 | best_metrics['recall'] * 100, 270 | best_metrics['coverage'] * 100)) 271 | 272 | log.info(sentence) 273 | save_txt(sentence, data_paths.result_filepath, mode='a') 274 | 275 | # Report performance on test set 276 | log.debug('Test metrics start ...') 277 | trained_model.eval() 278 | with torch.no_grad(): 279 | embeddings = get_embeddings(valid_graph, 280 | params['out_dim'], 281 | trained_model, 282 | nodeloader_test, 283 | num_batches_test, 284 | cuda, 285 | device, 286 | params['embedding_layer'], 287 | ) 288 | 289 | for ground_truth in [data.ground_truth_purchase_test, data.ground_truth_test]: 290 | precision, recall, coverage = get_metrics_at_k( 291 | embeddings, 292 | valid_graph, 293 | trained_model, 294 | params['out_dim'], 295 | ground_truth, 296 | all_eids_dict[('user', 'buys', 'item')], 297 | fixed_params.k, 298 | True, # Remove already bought 299 | cuda, 300 | device, 301 | fixed_params.pred, 302 | params['use_popularity'], 303 | params['weight_popularity'], 304 | ) 305 | 306 | sentence = ("TEST Precision " 307 | "{:.3f}% | Recall {:.3f}% | Coverage {:.2f}%" 308 | .format(precision * 100, 309 | recall * 100, 310 | coverage * 100)) 311 | log.info(sentence) 312 | save_txt(sentence, data_paths.result_filepath, mode='a') 313 | 314 | if check_embedding: 315 | trained_model.eval() 316 | with torch.no_grad(): 317 | log.debug('ANALYSIS OF RECOMMENDATIONS') 318 | if 'sport' in train_graph.ntypes: 319 | result_sport = explore_sports(embeddings, 320 | data.sport_feat_df, 321 | data.spt_id, 322 | fixed_params.num_choices) 323 | 324 | save_txt(result_sport, data_paths.result_filepath, mode='a') 325 | 326 | already_bought_dict = create_already_bought(valid_graph, 327 | all_eids_dict[('user', 'buys', 'item')], 328 | ) 329 | already_clicked_dict = None 330 | if fixed_params.discern_clicks: 331 | already_clicked_dict = create_already_bought(valid_graph, 332 | all_eids_dict[('user', 'clicks', 'item')], 333 | etype='clicks', 334 | ) 335 | 336 | users, items = data.ground_truth_test 337 | ground_truth_dict = create_ground_truth(users, items) 338 | user_ids = np.unique(users).tolist() 339 | recs = get_recs(valid_graph, 340 | embeddings, 341 | trained_model, 342 | params['out_dim'], 343 | fixed_params.k, 344 | user_ids, 345 | already_bought_dict, 346 | remove_already_bought=True, 347 | pred=fixed_params.pred, 348 | use_popularity=params['use_popularity'], 349 | weight_popularity=params['weight_popularity']) 350 | 351 | users, items = data.ground_truth_purchase_test 352 | ground_truth_purchase_dict = create_ground_truth(users, items) 353 | explore_recs(recs, 354 | already_bought_dict, 355 | already_clicked_dict, 356 | ground_truth_dict, 357 | ground_truth_purchase_dict, 358 | data.item_feat_df, 359 | fixed_params.num_choices, 360 | data.pdt_id, 361 | fixed_params.item_id_type, 362 | data_paths.result_filepath) 363 | 364 | if fixed_params.item_id_type == 'SPECIFIC ITEM_IDENTIFIER': 365 | coverage_metrics = check_coverage(data.user_item_train, 366 | data.item_feat_df, 367 | data.pdt_id, 368 | recs) 369 | 370 | sentence = ( 371 | "COVERAGE \n|| All transactions : " 372 | "Generic {:.1f}% | Junior {:.1f}% | Male {:.1f}% | Female {:.1f}% | Eco {:.1f}% " 373 | "\n|| Recommendations : " 374 | "Generic {:.1f}% | Junior {:.1f}% | Male {:.1f}% | Female {:.1f} | Eco {:.1f}%%" 375 | .format( 376 | coverage_metrics['generic_mean_whole'] * 100, 377 | coverage_metrics['junior_mean_whole'] * 100, 378 | coverage_metrics['male_mean_whole'] * 100, 379 | coverage_metrics['female_mean_whole'] * 100, 380 | coverage_metrics['eco_mean_whole'] * 100, 381 | coverage_metrics['generic_mean_recs'] * 100, 382 | coverage_metrics['junior_mean_recs'] * 100, 383 | coverage_metrics['male_mean_recs'] * 100, 384 | coverage_metrics['female_mean_recs'] * 100, 385 | coverage_metrics['eco_mean_recs'] * 100, 386 | ) 387 | ) 388 | log.info(sentence) 389 | save_txt(sentence, data_paths.result_filepath, mode='a') 390 | 391 | save_outputs( 392 | { 393 | 'embeddings': embeddings, 394 | 'already_bought': already_bought_dict, 395 | 'already_clicked': already_bought_dict, 396 | 'ground_truth': ground_truth_dict, 397 | 'recs': recs, 398 | }, 399 | 'outputs/' 400 | ) 401 | 402 | del params['remove'] 403 | # Save model if the recall is greater than 8% 404 | if (recall > 0.08) & (fixed_params.item_id_type == 'SPECIFIC ITEM_IDENTIFIER') \ 405 | or (recall > 0.2) & (fixed_params.item_id_type == 'GENERAL ITEM_IDENTIFIER'): 406 | date = str(datetime.datetime.now())[:-10].replace(' ', '') 407 | torch.save(trained_model.state_dict(), f'models/HP_Recall_{recall * 100:.2f}_{date}.pth') 408 | # Save all necessary params 409 | save_outputs( 410 | { 411 | f'{date}_params': params, 412 | f'{date}_fixed_params': vars(fixed_params), 413 | }, 414 | 'models/' 415 | ) 416 | 417 | # Inference on different users 418 | if fixed_params.run_inference > 0: 419 | with torch.no_grad(): 420 | print('On normal params') 421 | inference_recall = inference_hp.inference_fn(trained_model, 422 | remove=fixed_params.remove_on_inference, 423 | fixed_params=fixed_params, 424 | overwrite_fixed_params=False, 425 | **params) 426 | if fixed_params.run_inference > 1: 427 | print('For all users') 428 | del params['days_of_purchases'], params['days_of_clicks'], params['lifespan_of_items'] 429 | all_users_inference_recall = inference_hp.inference_fn(trained_model, 430 | remove=fixed_params.remove_on_inference, 431 | fixed_params=fixed_params, 432 | overwrite_fixed_params=True, 433 | days_of_purchases=710, 434 | days_of_clicks=710, 435 | lifespan_of_items=710, 436 | **params) 437 | 438 | recap = f"BEST RECALL on 1) Validation set : {best_metrics['recall'] * 100:.2f}%" \ 439 | f'\n2) Test set : {recall * 100:.2f}%' 440 | if fixed_params.run_inference == 1: 441 | recap += f'\n3) On random users of {fixed_params.remove_on_inference} removed : {inference_recall * 100:.2f}' 442 | recap += f"\nLoop took {timedelta(seconds=elapsed)} for {len(viz['train_loss_list'])} epochs, an average of " \ 443 | f"{timedelta(seconds=elapsed / len(viz['train_loss_list']))} per epoch" 444 | print(recap) 445 | save_txt(recap, data_paths.result_filepath, mode='a') 446 | 447 | return recall # This is the 'test set' recall, on both purchases & clicks 448 | 449 | 450 | class SearchableHyperparameters: 451 | """ 452 | All hyperparameters to optimize. 453 | 454 | Attributes 455 | ---------- 456 | Aggregator_hetero : 457 | How to aggregate messages from different types of edge relations. Choices : 'sum', 'max', 458 | 'min', 'mean', 'stack'. More info here 459 | https://docs.dgl.ai/_modules/dgl/nn/pytorch/hetero.html 460 | Aggregator_type : 461 | How to aggregate neighborhood messages. Choices : 'mean', 'pool' for max pooling or 'lstm' 462 | Clicks_sample : 463 | Proportion of all clicks edges that should be used for training. Only relevant if 464 | fixed_params.train_on_clicks == True 465 | Days_popularity : 466 | Number of days considered in Use_popularity 467 | Dropout : 468 | Dropout used on nodes features (at all layers of the GNN) 469 | Embedding_layer : 470 | Create an explicit embedding layer that projects user & item features into and embedding 471 | of hidden_size dimension. If false, the embedding is done in the first layer of the GNN 472 | model. 473 | Purchases_sample : 474 | Proportion of all purchase (i.e. 'buys') edges that should be used for training. If 475 | fixed_params.discern_clicks == False, then 'clicks' edges are considered as 'purchases' 476 | Norm : 477 | Perform normalization after message aggregation 478 | Use_popularity : 479 | When computing ratings, add a score for items that were recent in the last X days 480 | Use_recency : 481 | When computing the loss, give more weights to more recent transactions 482 | Weight_popularity : 483 | Weight of the popularity score 484 | """ 485 | def __init__(self): 486 | self.aggregator_hetero = Categorical(categories=['mean', 'sum', 'max'], name='aggregator_hetero') 487 | self.aggregator_type = Categorical(categories=['mean', 'mean_nn', 'pool_nn'], name='aggregator_type') # LSTM? 488 | self.clicks_sample = Categorical(categories=[.2, .3, .4], name='clicks_sample') 489 | self.delta = Real(low=0.15, high=0.35, prior='log-uniform', 490 | name='delta') 491 | self.dropout = Real(low=0., high=0.8, prior='uniform', 492 | name='dropout') 493 | self.embed_dim = Categorical(categories=['Very Small', 'Small', 'Medium', 'Large', 'Very Large'], 494 | name='embed_dim') 495 | self.embedding_layer = Categorical(categories=[True, False], name='embedding_layer') 496 | self.lr = Real(low=1e-4, high=1e-2, prior='log-uniform', name='lr') 497 | self.n_layers = Integer(low=3, high=5, name='n_layers') 498 | self.neg_sample_size = Integer(low=700, high=3000, 499 | name='neg_sample_size') 500 | self.norm = Categorical(categories=[True, False], name='norm') 501 | self.popularity_importance = Categorical(categories=['No', 'Small', 'Medium', 'Large'], 502 | name='popularity_importance') 503 | self.purchases_sample = Categorical(categories=[.4, .5, .6], name='purchases_sample') 504 | self.use_recency = Categorical(categories=[True, False], name='use_recency') 505 | 506 | # List all the attributes in a list. 507 | # This is equivalent to [self.hidden_dim_HP, self.out_dim_HP ...] 508 | self.dimensions = [self.__getattribute__(attr) 509 | for attr in dir(self) if '__' not in attr] 510 | self.default_parameters = ['sum', 'mean_nn', .3, 0.266, .5, 'Medium', False, 511 | 0.00565, 3, 2500, True, 'No', .5, True] 512 | 513 | 514 | searchable_params = SearchableHyperparameters() 515 | fitness_params = None 516 | 517 | @use_named_args(dimensions=searchable_params.dimensions) 518 | def fitness(**params): 519 | """ 520 | Function used by skopt to find the best hyperparameter combination. 521 | 522 | The function calls the train function defined earlier, with all needed parameters. The recall that is returned 523 | is then multiplied by -1, since skopt is minimizing metrics. 524 | """ 525 | recall = train(**{**fitness_params, **params}) 526 | return -recall 527 | 528 | 529 | @click.command() 530 | @click.option('--from_beginning', count=True, 531 | help='Continue with last trained model or not') 532 | @click.option('-v', '--verbose', count=True, help='Verbosity') 533 | @click.option('-viz', '--visualization', count=True, help='Visualize result') 534 | @click.option('--check_embedding', count=True, help='Explore embedding result') 535 | @click.option('--remove', default=.95, help='Data remove percentage') 536 | @click.option('--num_epochs', default=10, help='Number of epochs') 537 | @click.option('--start_epoch', default=0, help='Start epoch') 538 | @click.option('--patience', default=3, help='Patience for early stopping') 539 | @click.option('--edge_batch_size', default=2048, help='Number of edges in a train / validation batch') 540 | @click.option('--item_id_type', default='SPECIFIC ITEM IDENTIFIER', 541 | help='Identifier for the item. This code allows 2 types: SPECIFIC (e.g. item SKU' 542 | 'or GENERAL (e.g. item family)') 543 | @click.option('--duplicates', default='keep_all', 544 | help='How to handle duplicates. Choices: keep_all, keep_last, count_occurrence') 545 | def main(from_beginning, verbose, visualization, check_embedding, 546 | remove, num_epochs, start_epoch, patience, edge_batch_size, 547 | item_id_type, duplicates): 548 | """ 549 | Main function that loads data and parameters, then runs hyperparameter loop with the fitness function. 550 | 551 | """ 552 | if verbose: 553 | log.setLevel(logging.DEBUG) 554 | else: 555 | log.setLevel(logging.INFO) 556 | 557 | data_paths = DataPaths() 558 | fixed_params = FixedParameters(num_epochs, start_epoch, patience, edge_batch_size, 559 | remove, item_id_type, duplicates) 560 | 561 | checkpoint_saver = CheckpointSaver( 562 | f'checkpoint{str(datetime.datetime.now())[:-10]}.pkl', 563 | compress=9 564 | ) 565 | 566 | data = DataLoader(data_paths, fixed_params) 567 | 568 | global fitness_params 569 | fitness_params = { 570 | 'data': data, 571 | 'fixed_params': fixed_params, 572 | 'data_paths': data_paths, 573 | 'visualization': visualization, 574 | 'check_embedding': check_embedding, 575 | } 576 | if from_beginning: 577 | search_result = gp_minimize( 578 | func=fitness, 579 | dimensions=searchable_params.dimensions, 580 | n_calls=200, 581 | acq_func='EI', 582 | x0=searchable_params.default_parameters, 583 | callback=[checkpoint_saver], 584 | random_state=46, 585 | ) 586 | 587 | if not from_beginning: 588 | checkpoint_path = None 589 | if checkpoint_path is None: 590 | checkpoint_path = get_last_checkpoint() 591 | res = load(checkpoint_path) 592 | 593 | x0 = res.x_iters 594 | y0 = res.func_vals 595 | 596 | search_result = gp_minimize( 597 | func=fitness, 598 | dimensions=searchable_params.dimensions, 599 | n_calls=200, 600 | n_initial_points=-len(x0), # Workaround suggested to correct the error when resuming training 601 | acq_func='EI', 602 | x0=x0, 603 | y0=y0, 604 | callback=[checkpoint_saver], 605 | random_state=46 606 | ) 607 | log.info(search_result) 608 | 609 | 610 | if __name__ == '__main__': 611 | main() 612 | -------------------------------------------------------------------------------- /main_inference.py: -------------------------------------------------------------------------------- 1 | import math 2 | 3 | import click 4 | import dgl 5 | import numpy as np 6 | import torch 7 | 8 | from src.builder import create_graph 9 | from src.model import ConvModel 10 | from src.utils_data import DataPaths, DataLoader, FixedParameters, assign_graph_features 11 | from src.utils_inference import read_graph, fetch_uids, postprocess_recs 12 | from src.train.run import get_embeddings 13 | from src.metrics import get_recs, create_already_bought 14 | from src.utils import read_data 15 | 16 | cuda = torch.cuda.is_available() 17 | device = torch.device('cuda') if cuda else torch.device('cpu') 18 | num_workers = 4 if cuda else 0 19 | 20 | def inference_ondemand(user_ids, # List or 'all' 21 | use_saved_graph: bool, 22 | trained_model_path: str, 23 | use_saved_already_bought: bool, 24 | graph_path=None, 25 | ctm_id_path=None, 26 | pdt_id_path=None, 27 | already_bought_path=None, 28 | k=10, 29 | remove=.99, 30 | **params, 31 | ): 32 | """ 33 | Given a fully trained model, return recommendations specific to each user. 34 | 35 | Files needed to run 36 | ------------------- 37 | Params used when training the model: 38 | Those params will indicate how to run inference on the model. Usually, they are outputted during training 39 | (and hyperparametrization). 40 | If using a saved already bought dict: 41 | The already bought dict: the dict includes all previous purchases of all user ids for which recommendations 42 | were requested. If not using a saved dict, it will be created using the graph. 43 | Using a saved already bought dict is not necessary, but might make the inference 44 | process faster. 45 | A) If using a saved graph: 46 | The saved graph: the graph that must include all user ids for which recommendations were requested. Usually, 47 | it is outputted during training. It could also be created by another independent function. 48 | ID mapping: ctm_id and pdt_id mapping that allows to associate real-world information, e.g. item and customer 49 | identifier, to actual nodes in the graph. They are usually saved when generating a graph. 50 | B) If not using a saved graph: 51 | The graph will be generated on demand, using all the files in DataPaths of src.utils_data. All those files will 52 | be needed. 53 | 54 | Parameters 55 | ---------- 56 | See click options below for details. 57 | 58 | Returns 59 | ------- 60 | Recommendations for all user ids. 61 | 62 | """ 63 | # Load & preprocess data 64 | ## Graph 65 | if use_saved_graph: 66 | graph = read_graph(graph_path) 67 | ctm_id_df = read_data(ctm_id_path) 68 | pdt_id_df = read_data(pdt_id_path) 69 | else: 70 | # Create graph 71 | data_paths = DataPaths() 72 | fixed_params = FixedParameters(num_epochs=0, start_epoch=0, # Not used (only used in training) 73 | patience=0, edge_batch_size=0, # Not used (only used in training) 74 | remove=remove, item_id_type=params['item_id_type'], 75 | duplicates=params['duplicates']) 76 | data = DataLoader(data_paths, fixed_params) 77 | ctm_id_df = data.ctm_id 78 | pdt_id_df = data.pdt_id 79 | 80 | graph = create_graph( 81 | data.graph_schema, 82 | ) 83 | graph = assign_graph_features(graph, 84 | fixed_params, 85 | data, 86 | **params, 87 | ) 88 | ## Preprocess: fetch right user ids 89 | if user_ids[0] == 'all': 90 | test_uids = np.arange(graph.num_nodes('user')) 91 | else: 92 | test_uids = fetch_uids(user_ids, 93 | ctm_id_df) 94 | ## Remove already bought 95 | if use_saved_already_bought: 96 | already_bought_dict = read_data(already_bought_path) 97 | else: 98 | bought_eids = graph.out_edges(u=test_uids, form='eid', etype='buys') 99 | already_bought_dict = create_already_bought(graph, bought_eids) 100 | 101 | # Load model 102 | dim_dict = {'user': graph.nodes['user'].data['features'].shape[1], 103 | 'item': graph.nodes['item'].data['features'].shape[1], 104 | 'out': params['out_dim'], 105 | 'hidden': params['hidden_dim']} 106 | if 'sport' in graph.ntypes: 107 | dim_dict['sport'] = graph.nodes['sport'].data['features'].shape[1] 108 | trained_model = ConvModel( 109 | graph, 110 | params['n_layers'], 111 | dim_dict, 112 | params['norm'], 113 | params['dropout'], 114 | params['aggregator_type'], 115 | params['pred'], 116 | params['aggregator_hetero'], 117 | params['embedding_layer'], 118 | ) 119 | trained_model.load_state_dict(torch.load(trained_model_path, map_location=device)) 120 | if cuda: 121 | trained_model = trained_model.to(device) 122 | 123 | # Create dataloader 124 | all_iids = np.arange(graph.num_nodes('item')) 125 | test_node_ids = {'user': test_uids, 'item': all_iids} 126 | n_layers = params['n_layers'] 127 | if params['embedding_layer']: 128 | n_layers = n_layers - 1 129 | sampler = dgl.dataloading.MultiLayerFullNeighborSampler(n_layers) 130 | nodeloader_test = dgl.dataloading.NodeDataLoader( 131 | graph, 132 | test_node_ids, 133 | sampler, 134 | batch_size=128, 135 | shuffle=True, 136 | drop_last=False, 137 | num_workers=num_workers 138 | ) 139 | num_batches_test = math.ceil((len(test_uids) + len(all_iids)) / 128) 140 | 141 | # Fetch recs 142 | trained_model.eval() 143 | with torch.no_grad(): 144 | embeddings = get_embeddings(graph, 145 | params['out_dim'], 146 | trained_model, 147 | nodeloader_test, 148 | num_batches_test, 149 | cuda, 150 | device, 151 | params['embedding_layer'], 152 | ) 153 | recs = get_recs(graph, 154 | embeddings, 155 | trained_model, 156 | params['out_dim'], 157 | k, 158 | test_uids, 159 | already_bought_dict, 160 | remove_already_bought=True, 161 | cuda=cuda, 162 | device=device, 163 | pred=params['pred'], 164 | use_popularity=params['use_popularity'], 165 | weight_popularity=params['weight_popularity'] 166 | ) 167 | 168 | # Postprocess: user & item ids 169 | processed_recs = postprocess_recs(recs, 170 | pdt_id_df, 171 | ctm_id_df, 172 | params['item_id_type'], 173 | params['ctm_id_type']) 174 | print(processed_recs) 175 | return processed_recs 176 | 177 | 178 | 179 | @click.command() 180 | @click.option('--params_path', default='params.pkl', 181 | help='Path where the optimal hyperparameters found in the hyperparametrization were saved.') 182 | @click.option('--user_ids', multiple=True, default=['all'], 183 | help="IDs of users for which to generate recommendations. Either list of user ids, or 'all'.") 184 | @click.option('--use_saved_graph', count=True, 185 | help='If true, will use graph that was saved on disk. Need to import ID mapping for users & items.') 186 | @click.option('--trained_model_path', default='model.pth', 187 | help='Path where fully trained model is saved.') 188 | @click.option('--use_saved_already_bought', count=True, 189 | help='If true, will use already bought dict that was saved on disk.') 190 | @click.option('--graph_path', default='graph.bin', 191 | help='Path where the graph was saved. Mandatory if use_saved_graph is True.') 192 | @click.option('--ctm_id_path', default='ctm_id.pkl', 193 | help='Path where the mapping for customer was save. Mandatory if use_saved_graph is True.') 194 | @click.option('--pdt_id_path', default='pdt_id.pkl', 195 | help='Path where the mapping for items was save. Mandatory if use_saved_graph is True.') 196 | @click.option('--already_bought_path', default='already_bought.pkl', 197 | help='Path where the already bought dict was saved. Mandatory if use_saved_already_bought is True.') 198 | @click.option('--k', default=10, 199 | help='Number of recs to generate for each user.') 200 | @click.option('--remove', default=.99, 201 | help='Percentage of users to remove from graph if used_saved_graph = True. If more than 0, user_ids might' 202 | ' not be in the graph. However, higher "remove" allows for faster inference.') 203 | def main(params_path, user_ids, use_saved_graph, trained_model_path, 204 | use_saved_already_bought, graph_path, ctm_id_path, pdt_id_path, 205 | already_bought_path, k, remove): 206 | params = read_data(params_path) 207 | params.pop('k', None) 208 | params.pop('remove', None) 209 | 210 | 211 | inference_ondemand(user_ids=user_ids, # List or 'all' 212 | use_saved_graph=use_saved_graph, 213 | trained_model_path=trained_model_path, 214 | use_saved_already_bought=use_saved_already_bought, 215 | graph_path=graph_path, 216 | ctm_id_path=ctm_id_path, 217 | pdt_id_path=pdt_id_path, 218 | already_bought_path=already_bought_path, 219 | k=k, 220 | remove=remove, 221 | **params, 222 | ) 223 | 224 | 225 | if __name__ == '__main__': 226 | main() 227 | 228 | 229 | -------------------------------------------------------------------------------- /main_train.py: -------------------------------------------------------------------------------- 1 | import math 2 | import datetime 3 | 4 | import click 5 | import numpy as np 6 | import torch 7 | from dgl.data.utils import save_graphs 8 | 9 | from src.builder import create_graph 10 | from src.utils_data import DataLoader, assign_graph_features 11 | from src.utils import read_data, save_txt, save_outputs 12 | from src.model import ConvModel, max_margin_loss 13 | from src.sampling import train_valid_split, generate_dataloaders 14 | from src.train.run import train_model, get_embeddings 15 | from src.utils_vizualization import plot_train_loss 16 | from src.metrics import (create_already_bought, create_ground_truth, 17 | get_metrics_at_k, get_recs) 18 | from src.evaluation import explore_recs, explore_sports, check_coverage 19 | from presplit import presplit_data 20 | 21 | from logging_config import get_logger 22 | 23 | log = get_logger(__name__) 24 | 25 | cuda = torch.cuda.is_available() 26 | device = torch.device('cuda') if cuda else torch.device('cpu') 27 | num_workers = 4 if cuda else 0 28 | 29 | 30 | class TrainDataPaths: 31 | def __init__(self): 32 | self.result_filepath = 'TXT FILE WHERE TO LOG THE RESULTS .txt' 33 | self.sport_feat_path = 'FEATURE DATASET, SPORTS (sport names) .csv' 34 | self.full_interaction_path = 'INTERACTION LIST, USER-ITEM (Full dataset, not splitted between train & test).csv' 35 | self.item_sport_path = 'INTERACTION LIST, ITEM-SPORT .csv' 36 | self.user_sport_path = 'INTERACTION LIST, USER-SPORT .csv' 37 | self.sport_sportg_path = 'INTERACTION LIST, SPORT-SPORT .csv' 38 | self.item_feat_path = 'FEATURE DATASET, ITEMS .csv' 39 | self.user_feat_path = 'FEATURE DATASET, USERS.csv' 40 | self.sport_onehot_path = 'FEATURE DATASET, SPORTS (one-hot vectors) .csv' 41 | 42 | 43 | def train_full_model(fixed_params_path, 44 | visualization, 45 | check_embedding, 46 | remove, 47 | edge_batch_size, 48 | **params,): 49 | """ 50 | Given the best hyperparameter combination, function to train the model on all available data. 51 | 52 | Files needed to run 53 | ------------------- 54 | All the files in the TrainDataPaths: 55 | It includes all the interactions between user, sport and items, as well as features for user, sport and items. 56 | Fixed_params and params found in hyperparametrization: 57 | Those params will indicate how to train the model. Usually, they are found when running the hyperparametrization 58 | loop. 59 | 60 | Parameters 61 | ---------- 62 | See click options below for details. 63 | 64 | 65 | Saves to files 66 | -------------- 67 | trained_model with its fixed parameters and hyperparameters: 68 | The trained model with all parameters are saved to the folder 'models'. 69 | graph and ID mapping: 70 | When doing inference, it might be useful to import an already built graph (and the mapping that allows to 71 | associate node ID with personal information such as CUSTOMER IDENTIFIER or ITEM IDENTIFIER). Thus, the graph and ID mapping are saved to 72 | folder 'models'. 73 | """ 74 | # Load parameters 75 | fixed_params = read_data(fixed_params_path) 76 | class objectview(object): 77 | def __init__(self, d): 78 | self.__dict__ = d 79 | fixed_params = objectview(fixed_params) 80 | fixed_params.remove = remove 81 | fixed_params.subtrain_size = 0.01 82 | fixed_params.valid_size = 0.01 83 | fixed_params.edge_batch_size = edge_batch_size 84 | 85 | # Create full train set 86 | train_data_paths = TrainDataPaths() 87 | presplit_item_feat = read_data(train_data_paths.item_feat_path) 88 | full_interaction_data = read_data(train_data_paths.full_interaction_path) 89 | train_df, test_df = presplit_data(presplit_item_feat, 90 | full_interaction_data, 91 | num_min=3, 92 | remove_unk=True, 93 | sort=True, 94 | test_size_days=1, 95 | item_id_type='ITEM IDENTIFIER', 96 | ctm_id_type='CUSTOMER IDENTIFIER', ) 97 | train_data_paths.train_path = train_df 98 | train_data_paths.test_path = test_df 99 | data = DataLoader(train_data_paths, fixed_params) 100 | 101 | # Initialize graph & features 102 | valid_graph = create_graph( 103 | data.graph_schema, 104 | ) 105 | valid_graph = assign_graph_features(valid_graph, 106 | fixed_params, 107 | data, 108 | **params, 109 | ) 110 | 111 | dim_dict = {'user': valid_graph.nodes['user'].data['features'].shape[1], 112 | 'item': valid_graph.nodes['item'].data['features'].shape[1], 113 | 'out': params['out_dim'], 114 | 'hidden': params['hidden_dim']} 115 | 116 | all_sids = None 117 | if 'sport' in valid_graph.ntypes: 118 | dim_dict['sport'] = valid_graph.nodes['sport'].data['features'].shape[1] 119 | all_sids = np.arange(valid_graph.num_nodes('sport')) 120 | 121 | # Initialize model 122 | model = ConvModel(valid_graph, 123 | params['n_layers'], 124 | dim_dict, 125 | params['norm'], 126 | params['dropout'], 127 | params['aggregator_type'], 128 | params['pred'], 129 | params['aggregator_hetero'], 130 | params['embedding_layer'], 131 | ) 132 | if cuda: 133 | model = model.to(device) 134 | 135 | # Initialize dataloaders 136 | # get training and test ids 137 | ( 138 | train_graph, 139 | train_eids_dict, 140 | valid_eids_dict, 141 | subtrain_uids, 142 | valid_uids, 143 | test_uids, 144 | all_iids, 145 | ground_truth_subtrain, 146 | ground_truth_valid, 147 | all_eids_dict 148 | ) = train_valid_split( 149 | valid_graph, 150 | data.ground_truth_test, 151 | fixed_params.etype, 152 | fixed_params.subtrain_size, 153 | fixed_params.valid_size, 154 | fixed_params.reverse_etype, 155 | fixed_params.train_on_clicks, 156 | fixed_params.remove_train_eids, 157 | params['clicks_sample'], 158 | params['purchases_sample'], 159 | ) 160 | 161 | ( 162 | edgeloader_train, 163 | edgeloader_valid, 164 | nodeloader_subtrain, 165 | nodeloader_valid, 166 | nodeloader_test 167 | ) = generate_dataloaders(valid_graph, 168 | train_graph, 169 | train_eids_dict, 170 | valid_eids_dict, 171 | subtrain_uids, 172 | valid_uids, 173 | test_uids, 174 | all_iids, 175 | fixed_params, 176 | num_workers, 177 | all_sids, 178 | embedding_layer=params['embedding_layer'], 179 | n_layers=params['n_layers'], 180 | neg_sample_size=params['neg_sample_size'], 181 | ) 182 | 183 | train_eids_len = 0 184 | valid_eids_len = 0 185 | for etype in train_eids_dict.keys(): 186 | train_eids_len += len(train_eids_dict[etype]) 187 | valid_eids_len += len(valid_eids_dict[etype]) 188 | num_batches_train = math.ceil(train_eids_len / fixed_params.edge_batch_size) 189 | num_batches_subtrain = math.ceil( 190 | (len(subtrain_uids) + len(all_iids)) / fixed_params.node_batch_size 191 | ) 192 | num_batches_val_loss = math.ceil(valid_eids_len / fixed_params.edge_batch_size) 193 | num_batches_val_metrics = math.ceil( 194 | (len(valid_uids) + len(all_iids)) / fixed_params.node_batch_size 195 | ) 196 | num_batches_test = math.ceil( 197 | (len(test_uids) + len(all_iids)) / fixed_params.node_batch_size 198 | ) 199 | 200 | # Run model 201 | hp_sentence = params 202 | hp_sentence.update(vars(fixed_params)) 203 | hp_sentence = f'{str(hp_sentence)[1: -1]} \n' 204 | save_txt(f'\n \n START - Hyperparameters \n{hp_sentence}', train_data_paths.result_filepath, "a") 205 | trained_model, viz, best_metrics = train_model( 206 | model, 207 | fixed_params.num_epochs, 208 | num_batches_train, 209 | num_batches_val_loss, 210 | edgeloader_train, 211 | edgeloader_valid, 212 | max_margin_loss, 213 | params['delta'], 214 | params['neg_sample_size'], 215 | params['use_recency'], 216 | cuda, 217 | device, 218 | fixed_params.optimizer, 219 | params['lr'], 220 | get_metrics=True, 221 | train_graph=train_graph, 222 | valid_graph=valid_graph, 223 | nodeloader_valid=nodeloader_valid, 224 | nodeloader_subtrain=nodeloader_subtrain, 225 | k=fixed_params.k, 226 | out_dim=params['out_dim'], 227 | num_batches_val_metrics=num_batches_val_metrics, 228 | num_batches_subtrain=num_batches_subtrain, 229 | bought_eids=train_eids_dict[('user', 'buys', 'item')], 230 | ground_truth_subtrain=ground_truth_subtrain, 231 | ground_truth_valid=ground_truth_valid, 232 | remove_already_bought=True, 233 | result_filepath=train_data_paths.result_filepath, 234 | start_epoch=fixed_params.start_epoch, 235 | patience=fixed_params.patience, 236 | pred=params['pred'], 237 | use_popularity=params['use_popularity'], 238 | weight_popularity=params['weight_popularity'], 239 | remove_false_negative=fixed_params.remove_false_negative, 240 | embedding_layer=params['embedding_layer'], 241 | ) 242 | 243 | # Get viz & metrics 244 | if visualization: 245 | plot_train_loss(hp_sentence, viz) 246 | 247 | # Report performance on validation set 248 | sentence = ("BEST VALIDATION Precision " 249 | "{:.3f}% | Recall {:.3f}% | Coverage {:.2f}%" 250 | .format(best_metrics['precision'] * 100, 251 | best_metrics['recall'] * 100, 252 | best_metrics['coverage'] * 100)) 253 | 254 | log.info(sentence) 255 | save_txt(sentence, train_data_paths.result_filepath, mode='a') 256 | 257 | # Report performance on test set 258 | log.debug('Test metrics start ...') 259 | trained_model.eval() 260 | with torch.no_grad(): 261 | embeddings = get_embeddings(valid_graph, 262 | params['out_dim'], 263 | trained_model, 264 | nodeloader_test, 265 | num_batches_test, 266 | cuda, 267 | device, 268 | params['embedding_layer'], 269 | ) 270 | 271 | for ground_truth in [data.ground_truth_purchase_test, data.ground_truth_test]: 272 | precision, recall, coverage = get_metrics_at_k( 273 | embeddings, 274 | valid_graph, 275 | trained_model, 276 | params['out_dim'], 277 | ground_truth, 278 | all_eids_dict[('user', 'buys', 'item')], 279 | fixed_params.k, 280 | True, # Remove already bought 281 | cuda, 282 | device, 283 | params['pred'], 284 | params['use_popularity'], 285 | params['weight_popularity'], 286 | ) 287 | 288 | sentence = ("TEST Precision " 289 | "{:.3f}% | Recall {:.3f}% | Coverage {:.2f}%" 290 | .format(precision * 100, 291 | recall * 100, 292 | coverage * 100)) 293 | log.info(sentence) 294 | save_txt(sentence, train_data_paths.result_filepath, mode='a') 295 | 296 | if check_embedding: 297 | trained_model.eval() 298 | with torch.no_grad(): 299 | log.debug('ANALYSIS OF RECOMMENDATIONS') 300 | if 'sport' in train_graph.ntypes: 301 | result_sport = explore_sports(embeddings, 302 | data.sport_feat_df, 303 | data.spt_id, 304 | fixed_params.num_choices) 305 | 306 | save_txt(result_sport, train_data_paths.result_filepath, mode='a') 307 | 308 | already_bought_dict = create_already_bought(valid_graph, 309 | all_eids_dict[('user', 'buys', 'item')], 310 | ) 311 | already_clicked_dict = None 312 | if fixed_params.discern_clicks: 313 | already_clicked_dict = create_already_bought(valid_graph, 314 | all_eids_dict[('user', 'clicks', 'item')], 315 | etype='clicks', 316 | ) 317 | 318 | users, items = data.ground_truth_test 319 | ground_truth_dict = create_ground_truth(users, items) 320 | user_ids = np.unique(users).tolist() 321 | recs = get_recs(valid_graph, 322 | embeddings, 323 | trained_model, 324 | params['out_dim'], 325 | fixed_params.k, 326 | user_ids, 327 | already_bought_dict, 328 | remove_already_bought=True, 329 | pred=params['pred'], 330 | use_popularity=params['use_popularity'], 331 | weight_popularity=params['weight_popularity']) 332 | 333 | users, items = data.ground_truth_purchase_test 334 | ground_truth_purchase_dict = create_ground_truth(users, items) 335 | explore_recs(recs, 336 | already_bought_dict, 337 | already_clicked_dict, 338 | ground_truth_dict, 339 | ground_truth_purchase_dict, 340 | data.item_feat_df, 341 | fixed_params.num_choices, 342 | data.pdt_id, 343 | fixed_params.item_id_type, 344 | train_data_paths.result_filepath) 345 | 346 | if fixed_params.item_id_type == 'SPECIFIC ITEM IDENTIFIER': 347 | coverage_metrics = check_coverage(data.user_item_train, 348 | data.item_feat_df, 349 | data.pdt_id, 350 | recs) 351 | 352 | sentence = ( 353 | "COVERAGE \n|| All transactions : " 354 | "Generic {:.1f}% | Junior {:.1f}% | Male {:.1f}% | Female {:.1f}% | Eco {:.1f}% " 355 | "\n|| Recommendations : " 356 | "Generic {:.1f}% | Junior {:.1f}% | Male {:.1f}% | Female {:.1f} | Eco {:.1f}%%" 357 | .format( 358 | coverage_metrics['generic_mean_whole'] * 100, 359 | coverage_metrics['junior_mean_whole'] * 100, 360 | coverage_metrics['male_mean_whole'] * 100, 361 | coverage_metrics['female_mean_whole'] * 100, 362 | coverage_metrics['eco_mean_whole'] * 100, 363 | coverage_metrics['generic_mean_recs'] * 100, 364 | coverage_metrics['junior_mean_recs'] * 100, 365 | coverage_metrics['male_mean_recs'] * 100, 366 | coverage_metrics['female_mean_recs'] * 100, 367 | coverage_metrics['eco_mean_recs'] * 100, 368 | ) 369 | ) 370 | log.info(sentence) 371 | save_txt(sentence, train_data_paths.result_filepath, mode='a') 372 | 373 | save_outputs( 374 | { 375 | 'embeddings': embeddings, 376 | 'already_bought': already_bought_dict, 377 | 'already_clicked': already_bought_dict, 378 | 'ground_truth': ground_truth_dict, 379 | 'recs': recs, 380 | }, 381 | 'outputs/' 382 | ) 383 | 384 | # Save model 385 | date = str(datetime.datetime.now())[:-10].replace(' ', '') 386 | torch.save(trained_model.state_dict(), f'models/FULL_Recall_{recall * 100:.2f}_{date}.pth') 387 | # Save all necessary params 388 | save_outputs( 389 | { 390 | f'{date}_params': params, 391 | f'{date}_fixed_params': vars(fixed_params), 392 | }, 393 | 'models/' 394 | ) 395 | print("Saved model & parameters to disk.") 396 | 397 | # Save graph & ID mapping 398 | save_graphs(f'models/{date}_graph.bin', [valid_graph]) 399 | save_outputs( 400 | { 401 | f'{date}_ctm_id': data.ctm_id, 402 | f'{date}_pdt_id': data.pdt_id, 403 | }, 404 | 'models/' 405 | ) 406 | print("Saved graph & ID mapping to disk.") 407 | 408 | 409 | @click.command() 410 | @click.option('--fixed_params_path', default='fixed_params.pkl', 411 | help='Path where the fixed parameters used in the hyperparametrization were saved.') 412 | @click.option('--params_path', default='params.pkl', 413 | help='Path where the optimal hyperparameters found in the hyperparametrization were saved.') 414 | @click.option('-viz', '--visualization', count=True, help='Visualize result') 415 | @click.option('--check_embedding', count=True, help='Explore embedding result') 416 | @click.option('--remove', default=.99, help='Percentage of users to remove from train set. Ideally,' 417 | ' remove would be 0. However, higher "remove" accelerates training.') 418 | @click.option('--edge_batch_size', default=2048, help='Number of edges in a train / validation batch') 419 | def main(fixed_params_path, params_path, visualization, check_embedding, remove, edge_batch_size): 420 | params = read_data(params_path) 421 | params.pop('remove', None) 422 | params.pop('edge_batch_size', None) 423 | train_full_model(fixed_params_path=fixed_params_path, 424 | visualization=visualization, 425 | check_embedding=check_embedding, 426 | remove=remove, 427 | edge_batch_size=edge_batch_size, 428 | **params) 429 | 430 | if __name__ == '__main__': 431 | main() 432 | -------------------------------------------------------------------------------- /models/.keepdir: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /outputs/.keepdir: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /plots/.keepdir: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /presplit.py: -------------------------------------------------------------------------------- 1 | from datetime import datetime, timedelta 2 | 3 | import numpy as np 4 | 5 | from logging_config import get_logger 6 | 7 | logger = get_logger(__file__) 8 | 9 | 10 | def presplit_data(item_feature_data, 11 | user_item_interaction_data, 12 | num_min=3, 13 | remove_unk=True, 14 | sort=True, 15 | test_size_days=14, 16 | item_id_type='ITEM IDENTIFIER', 17 | ctm_id_type='CUSTOMER IDENTIFIER'): 18 | """ 19 | Split data into train and test set. 20 | 21 | Parameters 22 | ---------- 23 | num_min: 24 | Minimal number of interactions (transactions or clicks) for a customer to be included in the dataset 25 | (interactions can be both in train and test sets) 26 | remove_unk: 27 | Remove items in the interaction set that are not in the item features set, e.g. "items" that are services 28 | like skate sharpening 29 | sort: 30 | Sort the dataset by date before splitting in train/test set, thus having a test set that is succeeding 31 | the train set 32 | test_size_days: 33 | Number of days that should be in the test set. The rest will be in the training set. 34 | ctm_id_type: 35 | Unique identifier for the customers. 36 | item_id_type: 37 | Unique identifier for the items. 38 | 39 | Returns 40 | ------- 41 | train_set: 42 | Pandas dataframe of all training interactions. 43 | test_set: 44 | Pandas dataframe of all testing interactions. 45 | """ 46 | 47 | np.random.seed(11) 48 | 49 | if num_min > 0: 50 | user_item_interaction_data = user_item_interaction_data[ 51 | user_item_interaction_data[ctm_id_type].map( 52 | user_item_interaction_data[ctm_id_type].value_counts() 53 | ) >= num_min 54 | ] 55 | 56 | if remove_unk: 57 | known_items = item_feature_data[item_id_type].unique().tolist() 58 | user_item_interaction_data = user_item_interaction_data[user_item_interaction_data[item_id_type].isin(known_items)] 59 | 60 | if sort: 61 | user_item_interaction_data.sort_values(by=['hit_timestamp'], 62 | axis=0, 63 | inplace=True) 64 | # Split into train & test sets 65 | most_recent_date = datetime.strptime(max(user_item_interaction_data.hit_date), '%Y-%m-%d') 66 | limit_date = datetime.strftime( 67 | (most_recent_date - timedelta(days=int(test_size_days))), 68 | format='%Y-%m-%d' 69 | ) 70 | train_set = user_item_interaction_data[user_item_interaction_data['hit_date'] <= limit_date] 71 | test_set = user_item_interaction_data[user_item_interaction_data['hit_date'] > limit_date] 72 | 73 | else: 74 | most_recent_date = datetime.strptime(max(user_item_interaction_data.hit_date), '%Y-%m-%d') 75 | oldest_date = datetime.strptime(min(user_item_interaction_data.hit_date), '%Y-%m-%d') 76 | total_days = timedelta(days=(most_recent_date - oldest_date)) # To be tested 77 | test_size = test_size_days / total_days 78 | test_set = user_item_interaction_data.sample(frac=test_size, random_state=200) 79 | train_set = user_item_interaction_data.drop(test_set.index) 80 | 81 | # Keep only users in train set 82 | ctm_list = train_set[ctm_id_type].unique() 83 | test_set = test_set[test_set[ctm_id_type].isin(ctm_list)] 84 | return train_set, test_set 85 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | click~=7.1.2 2 | dgl==0.5.2 3 | matplotlib~=3.3.2 4 | numpy==1.19.2 5 | pandas==1.1.2 6 | scikit-learn==0.23.2 7 | torch==1.6.0 8 | scikit-optimize==0.8.1 -------------------------------------------------------------------------------- /src/builder.py: -------------------------------------------------------------------------------- 1 | from datetime import datetime, timedelta 2 | from typing import Tuple 3 | 4 | import dgl 5 | import numpy as np 6 | import pandas as pd 7 | import torch 8 | 9 | from src.utils import read_data 10 | 11 | 12 | def format_dfs( 13 | train_path, # str (path) or pd.Dataframe directly (df) 14 | test_path, # str (path) or pd.Dataframe directly (df) 15 | item_sport_path: str, 16 | user_sport_path: str, 17 | sport_sportg_path: str, 18 | item_feat_path: str, 19 | user_feat_path: str, 20 | sport_feat_path: str, 21 | sport_onehot_path: str, 22 | remove: float = 0., 23 | ctm_id_type: str = 'CUSTOMER IDENTIFIER', 24 | item_id_type: str = 'SPECIFIC ITEM IDENTIFIER', 25 | days_of_purchases: int = 710, 26 | days_of_clicks: int = 710, 27 | lifespan_of_items: int = 710, 28 | report_model_coverage: bool = False, 29 | ): 30 | """ 31 | Import all dfs from csv paths and preprocess interactions to sample interactions and remove old users and items. 32 | 33 | Parameters 34 | ---------- 35 | train_path, test_path: 36 | Paths of interaction files, between user and items (in the train set and the test set). To accommodate a wider 37 | range of utilisation, train_path and test_path can be directly dataframes instead of strings. All files with 38 | user and items must include a column named with the specified ctm_id_type or item_id_type. 39 | item_sport_path, user_sport_path, sport_sportg_path: 40 | Paths of interaction files, between item and sport, user and sport, sport and sport group. All files with user 41 | and items must include a column named with the specified ctm_id_type or item_id_type. 42 | item_feat_path, user_feat_path, sport_feat_path: 43 | Paths of feature files, for item, user and sports. Item features include textual descriptions and junior, male, 44 | female and eco indicators. User features include male and female indicator. Sport features include only name of 45 | sport. All files with user and items must include a column named with the specified ctm_id_type or item_id_type. 46 | sport_onehot_path: 47 | Path for a csv matrix containing the sport_id and a one-hot vector, unique per sport. 48 | remove: 49 | Removes a proportion of users from the dataset randomly. 50 | ctm_id_type : 51 | Identifier for the customers. 52 | item_id_type : 53 | Identifier for the items. Can be SPECIFIC ITEM IDENTIFIER (e.g. item SKU) 54 | or GENERAL ITEM IDENTIFIER (e.g. item family identifier) 55 | days_of_purchases (Days_of_clicks) : 56 | Number of days of purchases (clicks) that should be kept in the dataset. 57 | Intuition is that interactions of 12+ months ago might not be relevant. Max is 710 days 58 | Those that do not have any remaining interactions will be fed recommendations from another 59 | model. 60 | lifespan_of_items : 61 | Number of days since most recent transactions for an item to be considered by the 62 | model. Max is 710 days. Won't make a difference is it is > Days_of_interaction. 63 | report_model_coverage : bool 64 | Computes how many users are included by these parameters (and would thus receive a recommendation by this GNN 65 | model). 66 | 67 | Returns 68 | ------- 69 | user_item_train, user_item_test, user_sport_interaction, item_sport_interaction, sport_sportg_interaction: 70 | Dataframes of interactions. 71 | item_feat_df, user_feat_df, sport_feat_df, sport_onehot_df: 72 | Dataframes of features. 73 | """ 74 | np.random.seed(11) 75 | 76 | # User, item and sport features 77 | item_feat_df = read_data(item_feat_path) 78 | user_feat_df = read_data(user_feat_path) 79 | sport_feat_df = read_data(sport_feat_path) 80 | sport_onehot_df = read_data(sport_onehot_path) 81 | 82 | # User-item interaction. We allow direct df instead of path: check which was passed. 83 | if isinstance(train_path, str): 84 | user_item_train = read_data(train_path) 85 | elif isinstance(train_path, pd.DataFrame): 86 | user_item_train = train_path 87 | else: 88 | raise TypeError(f'Type of {train_path} not recognized. Should be str or pd.DataFrame') 89 | if isinstance(test_path, str): 90 | user_item_test = read_data(test_path) 91 | elif isinstance(test_path, pd.DataFrame): 92 | user_item_test = test_path 93 | else: 94 | raise TypeError(f'Type of {test_path} not recognized. Should be str or pd.DataFrame') 95 | 96 | if days_of_purchases < 710: 97 | most_recent_date = datetime.strptime(max(user_item_train.hit_date), '%Y-%m-%d') 98 | limit_date = datetime.strftime( 99 | (most_recent_date - timedelta(days=int(days_of_purchases))), 100 | format='%Y-%m-%d' 101 | ) 102 | user_item_train = user_item_train[(user_item_train.hit_date >= limit_date) | (user_item_train.buy == 0)] 103 | 104 | if days_of_clicks < 710: 105 | most_recent_date = datetime.strptime(max(user_item_train.hit_date), '%Y-%m-%d') 106 | limit_date = datetime.strftime( 107 | (most_recent_date - timedelta(days=int(days_of_clicks))), 108 | format='%Y-%m-%d' 109 | ) 110 | user_item_train = user_item_train[(user_item_train.hit_date >= limit_date) | (user_item_train.buy == 1)] 111 | 112 | if lifespan_of_items < days_of_purchases: 113 | most_recent_date = datetime.strptime(max(user_item_train.hit_date), '%Y-%m-%d') 114 | limit_date = datetime.strftime( 115 | (most_recent_date - timedelta(days=int(lifespan_of_items))), 116 | format='%Y-%m-%d' 117 | ) 118 | item_list = user_item_train[user_item_train.hit_date >= limit_date]['SPECIFIC ITEM IDENTIFIER'].unique() 119 | user_item_train = user_item_train[user_item_train['SPECIFIC ITEM IDENTIFIER'].isin(item_list)] 120 | 121 | if remove > 0: 122 | ctm_list = user_item_train[ctm_id_type].unique() 123 | np.random.shuffle(ctm_list) 124 | ctm_list = ctm_list[:int(len(ctm_list) * (1 - remove))] 125 | user_item_train = user_item_train[user_item_train[ctm_id_type].isin(ctm_list)] 126 | user_item_test = user_item_test[user_item_test[ctm_id_type].isin(ctm_list)] 127 | 128 | if remove == 0: 129 | # Make sure that if no observations were removed by days of clicks / purchases, no user is only in test set 130 | user_item_test = user_item_test[user_item_test[ctm_id_type].isin(user_item_train[ctm_id_type].unique())] 131 | 132 | if item_id_type == 'GENERAL ITEM IDENTIFIER': 133 | user_item_train = user_item_train.merge( 134 | item_feat_df[['SPECIFIC ITEM IDENTIFIER', 'GENERAL ITEM IDENTIFIER']].drop_duplicates(), 135 | how='left', 136 | on='SPECIFIC ITEM IDENTIFIER') 137 | user_item_test = user_item_test.merge( 138 | item_feat_df[['SPECIFIC ITEM IDENTIFIER', 'GENERAL ITEM IDENTIFIER']].drop_duplicates(), 139 | how='left', 140 | on='SPECIFIC ITEM IDENTIFIER') 141 | assert user_item_train.general_item_identifier.isna().sum() == 0 142 | assert user_item_test.general_item_identifier.isna().sum() == 0 143 | 144 | 145 | # Item-sport interaction 146 | item_sport_interaction = read_data(item_sport_path) 147 | if lifespan_of_items < days_of_purchases: 148 | item_sport_interaction = item_sport_interaction[item_sport_interaction['SPECIFIC ITEM IDENTIFIER'].isin( 149 | item_list)] 150 | if item_id_type == 'GENERAL ITEM IDENTIFIER': 151 | item_sport_interaction = item_sport_interaction.merge( 152 | item_feat_df[['SPECIFIC ITEM IDENTIFIER', 'GENERAL ITEM IDENTIFIER']], 153 | how='left', 154 | on='SPECIFIC ITEM IDENTIFIER') 155 | # Drop duplicates if not item_id_type not model number 156 | item_sport_interaction.drop_duplicates(inplace=True) 157 | 158 | 159 | # User-sport interaction 160 | user_sport_interaction = read_data(user_sport_path) 161 | if remove > 0: 162 | user_sport_interaction = user_sport_interaction[user_sport_interaction[ctm_id_type].isin(ctm_list)] 163 | 164 | # Sport-sportgroups interaction 165 | sport_sportg_interaction = read_data(sport_sportg_path) 166 | 167 | if report_model_coverage: 168 | train_users = user_item_train[ctm_id_type].unique().tolist() 169 | test_users = user_item_test[ctm_id_type].unique().tolist() 170 | sport_users = user_sport_interaction[ctm_id_type].unique().tolist() 171 | unseen_users = [uid for uid in test_users if uid not in train_users] 172 | print(f'There are {len(unseen_users)} users with no interactions') 173 | train_users.extend(sport_users) 174 | unseen_users = [uid for uid in test_users if uid not in train_users] 175 | print(f'and {len(unseen_users)} with also no sports associated') 176 | print(f'out of {len(test_users)}') 177 | 178 | return user_item_train, user_item_test, item_sport_interaction, user_sport_interaction, \ 179 | sport_sportg_interaction, item_feat_df, user_feat_df, sport_feat_df, sport_onehot_df 180 | 181 | 182 | def create_ids(user_item_train: pd.DataFrame, 183 | user_sport_interaction: pd.DataFrame, 184 | sport_sportg_interaction: pd.DataFrame, 185 | item_feat_df, 186 | item_id_type: str = 'SPECIFIC ITEM IDENTIFIER', 187 | ctm_id_type: str = 'CUSTOMER IDENTIFIER', 188 | spt_id_type: str = 'sport_id', 189 | ) -> Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]: 190 | """ 191 | Create ids needed for creating the graph (nodes cannot have arbitrary ids, i.e. it couldn't be directly 192 | the item identifier). 193 | 194 | Parameters 195 | ---------- 196 | See parameters and outputs of format_dfs for details. 197 | 198 | Returns 199 | ------- 200 | ctm_id, pdt_id, spt_id: 201 | Mapping between Organisation info (e.g. customer, item and sport ID) and new node ID. 202 | 203 | """ 204 | 205 | # Create user ids 206 | ctm_id = pd.DataFrame(user_item_train[ctm_id_type].unique(), 207 | columns=[ctm_id_type]) 208 | ctm_id['ctm_new_id'] = ctm_id.index 209 | 210 | # Create item ids 211 | train_pdt = user_item_train[item_id_type].unique().tolist() 212 | all_pdt = item_feat_df[item_id_type].unique().tolist() 213 | unseen_pdt = [pdt for pdt in all_pdt if pdt not in train_pdt] 214 | train_pdt.extend(unseen_pdt) # DGL requires that node IDs are continuous; unseen are at the end 215 | pdt_id = pd.DataFrame(train_pdt, 216 | columns=[item_id_type]) 217 | pdt_id['pdt_new_id'] = pdt_id.index 218 | 219 | # Create sport ids 220 | unique_sports = np.append(sport_sportg_interaction.sports_id.unique(), 221 | sport_sportg_interaction.sportsgroup_id.unique()) 222 | unique_sports = np.unique(np.append(unique_sports, 223 | user_sport_interaction[spt_id_type].unique())) 224 | spt_id = pd.DataFrame(unique_sports, columns=[spt_id_type]) 225 | spt_id['spt_new_id'] = spt_id.index 226 | 227 | return ctm_id, pdt_id, spt_id 228 | 229 | 230 | def df_to_adjacency_list(user_item_train: pd.DataFrame, 231 | user_item_test: pd.DataFrame, 232 | item_sport_interaction: pd.DataFrame, 233 | user_sport_interaction: pd.DataFrame, 234 | sport_sportg_interaction: pd.DataFrame, 235 | ctm_id: pd.DataFrame, 236 | pdt_id: pd.DataFrame, 237 | spt_id: pd.DataFrame, 238 | item_id_type: str, 239 | ctm_id_type: str, 240 | spt_id_type: str, 241 | discern_clicks: bool = False, 242 | duplicates: str = 'keep_all' 243 | ): 244 | """ 245 | Takes dataframes & ids for the nodes, and return adjacency lists (in the form of src nodes and dst nodes.) 246 | 247 | Parameters 248 | ---------- 249 | discern_clicks, duplicates: 250 | See utils_data for details. 251 | all other parameters: 252 | See parameters & outputs of other functions in this file for details. 253 | 254 | Returns 255 | ------- 256 | adjacency_dict: 257 | This will be used to build the graph. It contains id of source and destination nodes for all edge types. 258 | ground_truth_test, ground_truth_purchase_test: 259 | This will be used to compute metrics (i.e. check if recommended items can be found in the ground_truth). It 260 | contains user and item ids for all interactions in the test set. 261 | user_item_train: 262 | In this function, if duplicates == 'count_occurrence' or 'keep_last', some grouping manipulations are done on 263 | the user_item_train dataframe. Returning it will allow to attribute features to "grouped" edges. 264 | 265 | """ 266 | adjacency_dict = {} 267 | # User item : join new ids with old ids 268 | user_item_train = user_item_train.merge(ctm_id, 269 | how='left', 270 | on=ctm_id_type) 271 | user_item_train = user_item_train.merge(pdt_id, 272 | how='left', 273 | on=item_id_type) 274 | 275 | if duplicates in ['keep_last', 'count_occurrence']: 276 | grouped_df = user_item_train.groupby(['buy', 'ctm_new_id', 'pdt_new_id']).specific_item_identifier.count() 277 | grouped_df = pd.DataFrame(grouped_df).reset_index() 278 | grouped_df.columns = ['buy', 'ctm_new_id', 'pdt_new_id', 'num_interaction'] 279 | 280 | user_item_train.drop_duplicates(subset=['buy', 'ctm_new_id', 'pdt_new_id'], 281 | keep='last', 282 | inplace=True) # Keep last interaction 283 | user_item_train.sort_values(by=['buy', 'ctm_new_id', 'pdt_new_id'], 284 | ignore_index=True, 285 | inplace=True) # Have same order as grouped_df 286 | assert len(user_item_train) == len(grouped_df) 287 | user_item_train['num_interaction'] = grouped_df.num_interaction.values 288 | user_item_train.sort_values(by='hit_timestamp', 289 | ignore_index=True, 290 | inplace=True) # Reorder by date to keep sequential order 291 | if discern_clicks: 292 | adjacency_dict.update( 293 | { 294 | 'clicks_num': user_item_train[user_item_train.buy == 0].num_interaction.values, 295 | 'purchases_num': user_item_train[user_item_train.buy == 1].num_interaction.values 296 | } 297 | ) 298 | else: 299 | adjacency_dict.update( 300 | { 301 | 'user_item_num': user_item_train.num_interaction.values 302 | } 303 | ) 304 | 305 | if discern_clicks: 306 | adjacency_dict.update( 307 | { 308 | 'clicks_src': user_item_train[user_item_train.buy == 0].ctm_new_id.values, 309 | 'clicks_dst': user_item_train[user_item_train.buy == 0].pdt_new_id.values, 310 | 'purchases_src': user_item_train[user_item_train.buy == 1].ctm_new_id.values, 311 | 'purchases_dst': user_item_train[user_item_train.buy == 1].pdt_new_id.values, 312 | } 313 | ) 314 | 315 | else: 316 | adjacency_dict.update( 317 | { 318 | 'user_item_src': user_item_train.ctm_new_id.values, 319 | 'user_item_dst': user_item_train.pdt_new_id.values, 320 | } 321 | ) 322 | 323 | user_item_test = user_item_test.merge(ctm_id, 324 | how='left', 325 | on=ctm_id_type) 326 | user_item_test = user_item_test.merge(pdt_id, 327 | how='left', 328 | on=item_id_type) 329 | test_purchase_src = user_item_test[user_item_test.buy == 1].ctm_new_id.values 330 | test_purchase_dst = user_item_test[user_item_test.buy == 1].pdt_new_id.values 331 | ground_truth_purchase_test = (test_purchase_src, test_purchase_dst) 332 | 333 | test_src = user_item_test.ctm_new_id.values 334 | test_dst = user_item_test.pdt_new_id.values 335 | ground_truth_test = (test_src, test_dst) 336 | 337 | # Item sport : merge new ids with old ids 338 | item_sport_interaction = item_sport_interaction.merge(spt_id, 339 | how='left', 340 | on=spt_id_type) 341 | item_sport_interaction = item_sport_interaction.merge(pdt_id, 342 | how='left', 343 | on=item_id_type) 344 | item_sport_interaction.dropna(inplace=True) # drop items with no sports associated 345 | 346 | adjacency_dict['item_sport_src'] = item_sport_interaction.pdt_new_id.values 347 | adjacency_dict['item_sport_dst'] = item_sport_interaction.spt_new_id.values 348 | 349 | # User sport : merge new ids with old ids 350 | user_sport_interaction = user_sport_interaction.merge(spt_id, 351 | how='left', 352 | on=spt_id_type) 353 | user_sport_interaction = user_sport_interaction.merge(ctm_id, 354 | how='left', 355 | on=ctm_id_type) 356 | user_sport_interaction.dropna(inplace=True) 357 | 358 | adjacency_dict['user_sport_src'] = user_sport_interaction.ctm_new_id.values 359 | adjacency_dict['user_sport_dst'] = user_sport_interaction.spt_new_id.values 360 | 361 | # Sport sportgroups 362 | sport_sportg_interaction = sport_sportg_interaction.merge(spt_id, 363 | how='left', 364 | left_on='sports_id', 365 | right_on=spt_id_type) 366 | sport_sportg_interaction = sport_sportg_interaction.merge(spt_id, 367 | how='left', 368 | left_on='sportsgroup_id', 369 | right_on=spt_id_type) 370 | 371 | adjacency_dict['sport_sportg_src'] = sport_sportg_interaction.spt_new_id_x.values 372 | adjacency_dict['sport_sportg_dst'] = sport_sportg_interaction.spt_new_id_y.values 373 | 374 | return adjacency_dict, ground_truth_test, ground_truth_purchase_test, user_item_train 375 | 376 | 377 | def create_graph(graph_schema, 378 | ) -> dgl.DGLHeteroGraph: 379 | """ 380 | Create graph based on adjacency list. 381 | """ 382 | g = dgl.heterograph(graph_schema) 383 | return g 384 | 385 | 386 | def import_features(g: dgl.DGLHeteroGraph, 387 | user_feat_df, 388 | item_feat_df, 389 | sport_onehot_df, 390 | ctm_id: pd.DataFrame, 391 | pdt_id: pd.DataFrame, 392 | spt_id: pd.DataFrame, 393 | user_item_train, 394 | get_popularity: bool, 395 | num_days_pop: int, 396 | item_id_type: str, 397 | ctm_id_type: str, 398 | spt_id_type: str, 399 | ): 400 | """ 401 | Import features to a dict for all node types. 402 | 403 | For user and item, initializes feature arrays with only 0, then fills the values if they are available. 404 | 405 | Parameters 406 | ---------- 407 | get_popularity, num_days_pop: 408 | The recommender system can be enhanced by giving score boost for items that were popular. If get_popularity, 409 | popularity of the items will be computed. Num_days_pop defines the number of days to include in the 410 | computation. 411 | item_id_type, ctm_id_type, spt_id_type: 412 | See utils_data for details. 413 | all other parameters: 414 | See other functions in this file for details. 415 | 416 | Returns 417 | ------- 418 | features_dict: 419 | Dictionary with all the features imported here. 420 | """ 421 | features_dict = {} 422 | # User 423 | user_feat_df = user_feat_df.merge(ctm_id, how='inner', on=ctm_id_type) 424 | 425 | ids = user_feat_df.ctm_new_id.values.astype(int) 426 | feats = np.stack((user_feat_df.is_male.values, 427 | user_feat_df.is_female.values), 428 | axis=1) 429 | 430 | user_feat = np.zeros((g.number_of_nodes('user'), 2)) 431 | user_feat[ids] = feats 432 | 433 | user_feat = torch.tensor(user_feat).float() 434 | features_dict['user_feat'] = user_feat 435 | 436 | # Item 437 | if item_id_type in ['SPECIFIC ITEM IDENTIFIER']: 438 | item_feat_df = item_feat_df.merge(pdt_id, 439 | how='left', 440 | on=item_id_type) 441 | item_feat_df = item_feat_df[item_feat_df.pdt_new_id < g.number_of_nodes('item')] # Only IDs that are in graph 442 | 443 | ids = item_feat_df.pdt_new_id.values.astype(int) 444 | feats = np.stack((item_feat_df.is_junior.values, 445 | item_feat_df.is_male.values, 446 | item_feat_df.is_female.values, 447 | item_feat_df.eco_design.values, 448 | ), 449 | axis=1) 450 | 451 | item_feat = np.zeros((g.number_of_nodes('item'), feats.shape[1])) 452 | item_feat[ids] = feats 453 | item_feat = torch.tensor(item_feat).float() 454 | elif item_id_type in ['GENERAL ITEM IDENTIFIER']: 455 | item_feat = torch.zeros((g.number_of_nodes('item'), 4)) 456 | else: 457 | raise KeyError(f'Item ID {item_id_type} not recognized.') 458 | 459 | features_dict['item_feat'] = item_feat 460 | 461 | # Sport one-hot 462 | if 'sport' in g.ntypes: 463 | sport_onehot_df = sport_onehot_df.merge(spt_id, how='inner', on=spt_id_type) 464 | sport_onehot_df.sort_values(by='spt_new_id', 465 | inplace=True) # Values need to be sorted by node id to align with g.nodes['sport'] 466 | feats = sport_onehot_df.drop(labels=[spt_id_type, 'spt_new_id'], axis=1).values 467 | assert feats.shape[0] == g.num_nodes('sport') 468 | sport_feat = torch.tensor(feats).float() 469 | features_dict['sport_feat'] = sport_feat 470 | 471 | # Popularity 472 | if get_popularity: 473 | item_popularity = np.zeros((g.number_of_nodes('item'), 1)) 474 | pop_df = user_item_train.merge(pdt_id, 475 | how='left', 476 | on=item_id_type) 477 | most_recent_date = datetime.strptime(max(pop_df.hit_date), '%Y-%m-%d') 478 | limit_date = datetime.strftime( 479 | (most_recent_date - timedelta(days=num_days_pop)), 480 | format='%Y-%m-%d' 481 | ) 482 | pop_df = pop_df[pop_df.hit_date >= limit_date] 483 | pop_df = pd.DataFrame(pop_df.pdt_new_id.value_counts()) 484 | pop_df.columns = ['purchases'] 485 | pop_df['score'] = pop_df.purchases / pop_df.purchases.sum() 486 | pop_df.sort_index(inplace=True) 487 | ids = pop_df.index.values.astype(int) 488 | scores = pop_df.score.values 489 | item_popularity[ids] = np.expand_dims(scores, axis=1) 490 | item_popularity = torch.tensor(item_popularity).float() 491 | features_dict['item_pop'] = item_popularity 492 | 493 | return features_dict 494 | -------------------------------------------------------------------------------- /src/evaluation.py: -------------------------------------------------------------------------------- 1 | import random 2 | 3 | import numpy as np 4 | import pandas as pd 5 | from sklearn.metrics.pairwise import cosine_similarity 6 | 7 | from src.utils import save_txt 8 | 9 | 10 | def get_item_by_id(iid: int, 11 | pdt_id: pd.DataFrame, 12 | item_feat: pd.DataFrame, 13 | item_id_type: str): 14 | """ 15 | Fetch information about the item, given its node_id. 16 | 17 | The info need to be available in the item features dataset. 18 | """ 19 | # fetch old iid 20 | old_iid = pdt_id[item_id_type][pdt_id.pdt_new_id == iid].item() 21 | # fetch info 22 | info1 = item_feat.info1[item_feat[item_id_type] == old_iid].tolist()[0] 23 | info2 = item_feat.info2[item_feat[item_id_type] == old_iid].tolist()[0] 24 | info3 = item_feat.info3[item_feat[item_id_type] == old_iid].tolist()[0] 25 | return info1, info2, info3 26 | 27 | 28 | def fetch_recs_for_users(user, 29 | user_dict, 30 | pdt_id, 31 | item_feat_df, 32 | item_id_type, 33 | result_filepath, 34 | ground_truth_purchase_dict=None): 35 | """ 36 | For all items in a dict (of recs, or already_bought, or ground_truth), fetch information. 37 | 38 | """ 39 | for iid in user_dict[user]: 40 | try: 41 | info1, info2, info3 = get_item_by_id(iid, pdt_id, item_feat_df, item_id_type) 42 | sentence = info1 + ', ' + info2 + info3 43 | if ground_truth_purchase_dict is not None: 44 | if iid in ground_truth_purchase_dict[user]: 45 | count_purchases = len([item for item in ground_truth_purchase_dict[user] if item == iid]) 46 | sentence += f' ----- BOUGHT {count_purchases} TIME(S)' 47 | except: 48 | sentence = 'No name' 49 | save_txt(sentence, result_filepath, mode='a') 50 | 51 | 52 | def explore_recs(recs: dict, 53 | already_bought_dict: dict, 54 | already_clicked_dict, 55 | ground_truth_dict: dict, 56 | ground_truth_purchase_dict: dict, 57 | item_feat_df: pd.DataFrame, 58 | num_choices: int, 59 | pdt_id: pd.DataFrame, 60 | item_id_type: str, 61 | result_filepath: str): 62 | """ 63 | For a random sample of users, fetch information about what items were clicked/bought, recommended and ground truth. 64 | 65 | Users with only 1 previous click or purchase are explored at the end. 66 | """ 67 | choices = random.sample(recs.keys(), num_choices) 68 | 69 | for user in choices: 70 | save_txt('\nCustomer bought', result_filepath, mode='a') 71 | try: 72 | fetch_recs_for_users(user, 73 | already_bought_dict, 74 | pdt_id, 75 | item_feat_df, 76 | item_id_type, 77 | result_filepath) 78 | except: 79 | save_txt('Nothing', result_filepath, mode='a') 80 | 81 | save_txt('\nCustomer clicked on', result_filepath, mode='a') 82 | try: 83 | fetch_recs_for_users(user, 84 | already_clicked_dict, 85 | pdt_id, 86 | item_feat_df, 87 | item_id_type, 88 | result_filepath) 89 | except: 90 | save_txt('No click data', result_filepath, mode='a') 91 | 92 | save_txt('\nGot recommended', result_filepath, mode='a') 93 | fetch_recs_for_users(user, 94 | recs, 95 | pdt_id, 96 | item_feat_df, 97 | item_id_type, 98 | result_filepath) 99 | 100 | save_txt('\nGround truth', result_filepath, mode='a') 101 | fetch_recs_for_users(user, 102 | ground_truth_dict, 103 | pdt_id, 104 | item_feat_df, 105 | item_id_type, 106 | result_filepath, 107 | ground_truth_purchase_dict) 108 | 109 | # user with 1 item 110 | choices = random.sample([uid for uid, v in already_bought_dict.items() if len(v) == 1 and uid in recs.keys()], 2) 111 | for user in choices: 112 | save_txt('\nCustomer bought', result_filepath, mode='a') 113 | try: 114 | fetch_recs_for_users(user, 115 | already_bought_dict, 116 | pdt_id, 117 | item_feat_df, 118 | item_id_type, 119 | result_filepath) 120 | except: 121 | save_txt('Nothing', result_filepath, mode='a') 122 | 123 | save_txt('\nCustomer clicked on', result_filepath, mode='a') 124 | try: 125 | fetch_recs_for_users(user, 126 | already_clicked_dict, 127 | pdt_id, 128 | item_feat_df, 129 | item_id_type, 130 | result_filepath) 131 | except: 132 | save_txt('No click data', result_filepath, mode='a') 133 | 134 | save_txt('\nGot recommended', result_filepath, mode='a') 135 | fetch_recs_for_users(user, 136 | recs, 137 | pdt_id, 138 | item_feat_df, 139 | item_id_type, 140 | result_filepath) 141 | 142 | save_txt('\nGround truth', result_filepath, mode='a') 143 | fetch_recs_for_users(user, 144 | ground_truth_dict, 145 | pdt_id, 146 | item_feat_df, 147 | item_id_type, 148 | result_filepath, 149 | ground_truth_purchase_dict) 150 | 151 | 152 | def explore_sports(h, 153 | sport_feat_df: pd.DataFrame, 154 | spt_id: pd.DataFrame, 155 | num_choices: int, 156 | ): 157 | """ 158 | For a random sample of sport, fetch name of 5 most similar sports. 159 | """ 160 | sport_h = h['sport'] 161 | sim_matrix = cosine_similarity(sport_h.detach().cpu()) 162 | choices = random.sample(range(sport_h.shape[0]), num_choices) 163 | sentence = '' 164 | for sid in choices: 165 | # fetch name of sport id 166 | try: 167 | old_sid = spt_id.sport_id[spt_id.spt_new_id == sid].item() 168 | chosen_name = sport_feat_df.sport_label[sport_feat_df.sport_id == old_sid].item() 169 | except: 170 | chosen_name = 'N/A' 171 | # fetch most similar sports 172 | top = np.argpartition(sim_matrix[sid], -5)[-5:] 173 | top_list = spt_id.sport_id[spt_id.spt_new_id.isin(top.tolist())].tolist() 174 | top_names = sport_feat_df.sport_label[sport_feat_df.sport_id.isin(top_list)].unique() 175 | sentence += 'For sport {}, top similar sports are {} \n'.format(chosen_name, top_names) 176 | return sentence 177 | 178 | 179 | def check_coverage(user_item_interaction, 180 | item_feat_df, 181 | pdt_id, 182 | recs): 183 | """ 184 | Check the repartition of types of items in the purchases vs recommendations (generic vs female vs male vs junior). 185 | 186 | Also checks repartition of eco-design products in purchases vs recommendations. 187 | """ 188 | coverage_metrics = {} 189 | 190 | # remove all 'unknown' items 191 | known_items = item_feat_df.item_identifier.unique().tolist() 192 | user_item_interaction = user_item_interaction[user_item_interaction.item_identifier.isin(known_items)] 193 | 194 | # count number of types in original dataset 195 | df = user_item_interaction.merge(item_feat_df, 196 | how='left', 197 | on='ITEM IDENTIFIER') 198 | df['is_generic'] = (df.is_junior + df.is_male + df.is_female).astype(bool) * -1 + 1 199 | 200 | coverage_metrics['generic_mean_whole'] = df.is_generic.mean() 201 | coverage_metrics['junior_mean_whole'] = df.is_junior.mean() 202 | coverage_metrics['male_mean_whole'] = df.is_male.mean() 203 | coverage_metrics['female_mean_whole'] = df.is_female.mean() 204 | coverage_metrics['eco_mean_whole'] = df.eco_design.mean() 205 | 206 | # count in 'recs' 207 | recs_df = pd.DataFrame(recs.items()) 208 | recs_df.columns = ['uid', 'iid'] 209 | recs_df = recs_df.explode('iid') 210 | recs_df = recs_df.merge(pdt_id, 211 | how='left', 212 | left_on='iid', 213 | right_on='pdt_new_id') 214 | recs_df = recs_df.merge(item_feat_df, 215 | how='left', 216 | on='ITEM IDENTIFIER') 217 | 218 | recs_df['is_generic'] = (recs_df.is_junior + recs_df.is_male + recs_df.is_female).astype(bool) * -1 + 1 219 | 220 | coverage_metrics['generic_mean_recs'] = recs_df.is_generic.mean() 221 | coverage_metrics['junior_mean_recs'] = recs_df.is_junior.mean() 222 | coverage_metrics['male_mean_recs'] = recs_df.is_male.mean() 223 | coverage_metrics['female_mean_recs'] = recs_df.is_female.mean() 224 | coverage_metrics['eco_mean_recs'] = recs_df.eco_design.mean() 225 | 226 | return coverage_metrics 227 | -------------------------------------------------------------------------------- /src/metrics.py: -------------------------------------------------------------------------------- 1 | from src.utils import softmax 2 | from collections import defaultdict 3 | 4 | import numpy as np 5 | import torch 6 | import torch.nn as nn 7 | 8 | def create_ground_truth(users, items): 9 | """ 10 | Creates a dictionary, where the keys are user ids and the values are item ids that the user actually bought. 11 | """ 12 | ground_truth_arr = np.stack((np.asarray(users), np.asarray(items)), axis=1) 13 | ground_truth_dict = defaultdict(list) 14 | for key, val in ground_truth_arr: 15 | ground_truth_dict[key].append(val) 16 | return ground_truth_dict 17 | 18 | 19 | def create_already_bought(g, bought_eids, etype='buys'): 20 | """ 21 | Creates a dictionary, where the keys are user ids and the values are item ids that the user already bought. 22 | """ 23 | users_train, items_train = g.find_edges(bought_eids, etype=etype) 24 | already_bought_arr = np.stack((np.asarray(users_train), np.asarray(items_train)), axis=1) 25 | already_bought_dict = defaultdict(list) 26 | for key, val in already_bought_arr: 27 | already_bought_dict[key].append(val) 28 | return already_bought_dict 29 | 30 | 31 | def get_recs(g, 32 | h, 33 | model, 34 | embed_dim, 35 | k, 36 | user_ids, 37 | already_bought_dict, 38 | remove_already_bought=True, 39 | cuda=False, 40 | device=None, 41 | pred: str = 'cos', 42 | use_popularity: bool = False, 43 | weight_popularity=1 44 | ): 45 | """ 46 | Computes K recommendation for all users, given hidden states, the model and what they already bought. 47 | """ 48 | if cuda: # model is already in cuda? 49 | model = model.to(device) 50 | print('Computing recommendations on {} users, for {} items'.format(len(user_ids), g.num_nodes('item'))) 51 | recs = {} 52 | for user in user_ids: 53 | user_emb = h['user'][user] 54 | already_bought = already_bought_dict[user] 55 | user_emb_rpt = torch.cat(g.num_nodes('item')*[user_emb]).reshape(-1, embed_dim) 56 | 57 | if pred == 'cos': 58 | cos = nn.CosineSimilarity(dim=1, eps=1e-6) 59 | ratings = cos(user_emb_rpt, h['item']) 60 | 61 | elif pred == 'nn': 62 | cat_embed = torch.cat((user_emb_rpt, h['item']), 1) 63 | ratings = model.pred_fn.layer_nn(cat_embed) 64 | 65 | else: 66 | raise KeyError(f'Prediction function {pred} not recognized.') 67 | 68 | ratings_formatted = ratings.cpu().detach().numpy().reshape(g.num_nodes('item'),) 69 | if use_popularity: 70 | softmax_ratings = softmax(ratings_formatted) 71 | popularity_scores = g.ndata['popularity']['item'].numpy().reshape(g.num_nodes('item'),) 72 | ratings_formatted = np.add(softmax_ratings, popularity_scores * weight_popularity) 73 | order = np.argsort(-ratings_formatted) 74 | if remove_already_bought: 75 | order = [item for item in order if item not in already_bought] 76 | rec = order[:k] 77 | recs[user] = rec 78 | return recs 79 | 80 | 81 | def recs_to_metrics(recs, ground_truth_dict, g): 82 | """ 83 | Given the recommendations and the ground_truth, computes precision, recall & coverage. 84 | """ 85 | # precision 86 | k_relevant = 0 87 | k_total = 0 88 | for uid, iids in recs.items(): 89 | k_total += len(iids) 90 | k_relevant += len([id_ for id_ in iids if id_ in ground_truth_dict[uid]]) 91 | precision = k_relevant/k_total 92 | 93 | # recall 94 | k_relevant = 0 95 | k_total = 0 96 | for uid, iids in recs.items(): 97 | k_total += len(ground_truth_dict[uid]) 98 | k_relevant += len([id_ for id_ in ground_truth_dict[uid] if id_ in iids]) 99 | recall = k_relevant/k_total 100 | 101 | # coverage 102 | nb_total = g.num_nodes('item') 103 | recs_flatten = [item for sublist in list(recs.values()) for item in sublist] 104 | nb_recommended = len(set(recs_flatten)) 105 | coverage = nb_recommended / nb_total 106 | 107 | return precision, recall, coverage 108 | 109 | 110 | def get_metrics_at_k(h, 111 | g, 112 | model, 113 | embed_dim, 114 | ground_truth, 115 | bought_eids, 116 | k, 117 | remove_already_bought=True, 118 | cuda=False, 119 | device=None, 120 | pred='cos', 121 | use_popularity=False, 122 | weight_popularity=1): 123 | """ 124 | Function combining all previous functions: create already_bought & ground_truth dict, get recs and compute metrics. 125 | """ 126 | already_bought_dict = create_already_bought(g, bought_eids) 127 | users, items = ground_truth 128 | user_ids = np.unique(users).tolist() 129 | ground_truth_dict = create_ground_truth(users, items) 130 | recs = get_recs(g, h, model, embed_dim, k, user_ids, already_bought_dict, 131 | remove_already_bought, cuda, device, pred, use_popularity, weight_popularity) 132 | precision, recall, coverage = recs_to_metrics(recs, ground_truth_dict, g) 133 | 134 | return precision, recall, coverage 135 | 136 | 137 | def MRR_neg_edges(model, 138 | blocks, 139 | pos_g, 140 | neg_g, 141 | etype, 142 | neg_sample_size): 143 | """ 144 | (Currently not used.) Computes Mean Reciprocal Rank for the positive edge, out of all negative edges considered. 145 | 146 | Since it computes only on negative edges considered, it is an heuristic of the real MRR. 147 | However, if you put neg_sample_size as the total number of items, can be considered as MRR 148 | (because it will create as many edges as there are items). 149 | """ 150 | input_features = blocks[0].srcdata['features'] 151 | _, pos_score, neg_score = model(blocks, 152 | input_features, 153 | pos_g, neg_g, 154 | etype) 155 | neg_score = neg_score.reshape(-1, neg_sample_size) 156 | rankings = torch.sum(neg_score >= pos_score, dim=1) + 1 157 | return np.mean(1.0 / rankings.cpu().numpy()) 158 | -------------------------------------------------------------------------------- /src/model.py: -------------------------------------------------------------------------------- 1 | from typing import Tuple 2 | 3 | import torch 4 | import torch.nn as nn 5 | import torch.nn.functional as F 6 | import dgl.nn.pytorch as dglnn 7 | import dgl.function as fn 8 | 9 | 10 | class NodeEmbedding(nn.Module): 11 | """ 12 | Projects the node features into embedding space. 13 | """ 14 | def __init__(self, 15 | in_feats, 16 | out_feats, 17 | ): 18 | super().__init__() 19 | self.proj_feats = nn.Linear(in_feats, out_feats) 20 | 21 | def forward(self, 22 | node_feats): 23 | x = self.proj_feats(node_feats) 24 | return x 25 | 26 | 27 | class ConvLayer(nn.Module): 28 | """ 29 | 1 layer of message passing & aggregation, specific to an edge type. 30 | 31 | Similar to SAGEConv layer in DGL library. 32 | 33 | Methods 34 | ------- 35 | reset_parameters: 36 | Intialize parameters for all neural networks in the layer. 37 | _lstm_reducer: 38 | Aggregate messages of neighborhood nodes using LSTM technique. (All other message aggregation are builtin 39 | functions of DGL). 40 | forward: 41 | Actual message passing & aggregation, & update of nodes messages. 42 | 43 | """ 44 | 45 | def reset_parameters(self): 46 | gain = nn.init.calculate_gain('relu') 47 | nn.init.xavier_uniform_(self.fc_self.weight, gain=gain) 48 | nn.init.xavier_uniform_(self.fc_neigh.weight, gain=gain) 49 | # nn.init.xavier_uniform_(self.fc_edge.weight, gain=gain) 50 | if self._aggre_type in ['pool_nn', 'pool_nn_edge', 'mean_nn', 'mean_nn_edge']: 51 | nn.init.xavier_uniform_(self.fc_preagg.weight, gain=gain) 52 | if self._aggre_type == 'lstm': 53 | self.lstm.reset_parameters() 54 | 55 | def __init__(self, 56 | in_feats: Tuple[int, int], 57 | out_feats: int, 58 | dropout: float, 59 | aggregator_type: str, 60 | norm, 61 | ): 62 | """ 63 | Initialize the layer with parameters. 64 | 65 | Parameters 66 | ---------- 67 | in_feats: 68 | Dimension of the message (aka features) of the node type neighbor and of the node type. E.g. if the 69 | ConvLayer is initialized for the edge type (user, clicks, item), in_feats should be 70 | (dimension_of_item_features, dimension_of_user_features). Note that usually features will first go 71 | through embedding layer, thus dimension might be equal to the embedding dimension. 72 | out_feats: 73 | Dimension that the output (aka the updated node message) should take. E.g. if the layer is a hidden layer, 74 | out_feats should be hidden_dimension, whereas if the layer is the output layer, out_feats should be 75 | output_dimension. 76 | dropout: 77 | Traditional dropout applied to input features. 78 | aggregator_type: 79 | This is the main parameter of ConvLayer; it defines how messages are passed and aggregated. Multiple 80 | choices: 81 | 'mean' : messages are passed normally, and aggregated by doing the mean of all neighbor messages. 82 | 'mean_nn' : messages are passed through a neural network before being passed to neighbors, and 83 | aggregated by doing the mean of all neighbor messages. 84 | 'pool_nn' : messages are passed through a neural network before being passed to neighbors, and 85 | aggregated by doing the max of all neighbor messages. 86 | 'lstm' : messages are passed normally, and aggregared using _lstm_reducer. 87 | All choices have also their equivalent that ends with _edge (e.g. mean_nn_edge). Those version include 88 | the edge in the message passing, i.e. the message will be multiplied by the value of the edge. 89 | norm: 90 | Apply normalization 91 | """ 92 | super().__init__() 93 | self._in_neigh_feats, self._in_self_feats = in_feats 94 | self._out_feats = out_feats 95 | self._aggre_type = aggregator_type 96 | self.dropout_fn = nn.Dropout(dropout) 97 | self.norm = norm 98 | self.fc_self = nn.Linear(self._in_self_feats, out_feats, bias=False) 99 | self.fc_neigh = nn.Linear(self._in_neigh_feats, out_feats, bias=False) 100 | # self.fc_edge = nn.Linear(1, 1, bias=True) # Projecting recency days 101 | if aggregator_type in ['pool_nn', 'pool_nn_edge', 'mean_nn', 'mean_nn_edge']: 102 | self.fc_preagg = nn.Linear(self._in_neigh_feats, self._in_neigh_feats, bias=False) 103 | if aggregator_type == 'lstm': 104 | self.lstm = nn.LSTM(self._in_neigh_feats, self._in_neigh_feats, batch_first=True) 105 | self.reset_parameters() 106 | 107 | def _lstm_reducer(self, nodes): 108 | """ 109 | Aggregates the neighborhood messages using LSTM technique. 110 | 111 | This was taken from DGL docs. For computation optimization, when 'batching' nodes, all nodes 112 | with the same degrees are batched together, i.e. at first all nodes with 1 in-neighbor are computed 113 | (thus m.shape = [number of nodes in the batchs, 1, dimensionality of h]), then all nodes with 2 in- 114 | neighbors, etc. 115 | """ 116 | m = nodes.mailbox['m'] 117 | batch_size = m.shape[0] 118 | h = (m.new_zeros((1, batch_size, self._in_neigh_feats)), 119 | m.new_zeros((1, batch_size, self._in_neigh_feats))) 120 | _, (rst, _) = self.lstm(m, h) 121 | return {'neigh': rst.squeeze(0)} 122 | 123 | def forward(self, 124 | graph, 125 | x): 126 | """ 127 | Message passing & aggregation, & update of node messages. 128 | 129 | Process is the following: 130 | - Messages (h_neigh and h_self) are extracted from x 131 | - Dropout is applied 132 | - Messages are passed and aggregated using the _aggre_type specified (see __init__ for details), which 133 | return updated h_neigh 134 | - h_self passes through neural network & updated h_neigh passes through neural network, and are then summed 135 | up 136 | - The sum (z) passes through Relu 137 | - Normalization is applied 138 | """ 139 | h_neigh, h_self = x 140 | h_neigh = self.dropout_fn(h_neigh) 141 | h_self = self.dropout_fn(h_self) 142 | 143 | if self._aggre_type == 'mean': 144 | graph.srcdata['h'] = h_neigh 145 | graph.update_all( 146 | fn.copy_src('h', 'm'), 147 | fn.mean('m', 'neigh')) 148 | h_neigh = graph.dstdata['neigh'] 149 | 150 | elif self._aggre_type == 'mean_nn': 151 | graph.srcdata['h'] = F.relu(self.fc_preagg(h_neigh)) 152 | graph.update_all( 153 | fn.copy_src('h', 'm'), 154 | fn.mean('m', 'neigh')) 155 | h_neigh = graph.dstdata['neigh'] 156 | 157 | elif self._aggre_type == 'pool_nn': 158 | graph.srcdata['h'] = F.relu(self.fc_preagg(h_neigh)) 159 | graph.update_all( 160 | fn.copy_src('h', 'm'), 161 | fn.max('m', 'neigh')) 162 | h_neigh = graph.dstdata['neigh'] 163 | 164 | elif self._aggre_type == 'lstm': 165 | graph.srcdata['h'] = h_neigh 166 | graph.update_all( 167 | fn.copy_src('h', 'm'), 168 | self._lstm_reducer) 169 | h_neigh = graph.dstdata['neigh'] 170 | 171 | elif self._aggre_type == 'mean_edge': 172 | graph.srcdata['h'] = h_neigh 173 | if graph.canonical_etypes[0][0] in ['user', 'item'] and graph.canonical_etypes[0][2] in ['user', 'item']: 174 | graph.edata['h'] = graph.edata['occurrence'].float().unsqueeze(1) 175 | graph.update_all( 176 | fn.u_mul_e('h', 'h', 'm'), 177 | fn.mean('m', 'neigh')) 178 | else: 179 | graph.update_all( 180 | fn.copy_src('h', 'm'), 181 | fn.mean('m', 'neigh')) 182 | h_neigh = graph.dstdata['neigh'] 183 | 184 | elif self._aggre_type == 'mean_nn_edge': 185 | graph.srcdata['h'] = F.relu(self.fc_preagg(h_neigh)) 186 | if graph.canonical_etypes[0][0] in ['user', 'item'] and graph.canonical_etypes[0][2] in ['user', 'item']: 187 | graph.edata['h'] = graph.edata['occurrence'].float().unsqueeze(1) 188 | graph.update_all( 189 | fn.u_mul_e('h', 'h', 'm'), 190 | fn.mean('m', 'neigh')) 191 | else: 192 | graph.update_all( 193 | fn.copy_src('h', 'm'), 194 | fn.mean('m', 'neigh')) 195 | h_neigh = graph.dstdata['neigh'] 196 | 197 | elif self._aggre_type == 'pool_nn_edge': 198 | graph.srcdata['h'] = F.relu(self.fc_preagg(h_neigh)) 199 | if graph.canonical_etypes[0][0] in ['user', 'item'] and graph.canonical_etypes[0][2] in ['user', 'item']: 200 | graph.edata['h'] = graph.edata['occurrence'].float().unsqueeze(1) 201 | graph.update_all( 202 | fn.u_mul_e('h', 'h', 'm'), 203 | fn.max('m', 'neigh')) 204 | else: 205 | graph.update_all( 206 | fn.copy_src('h', 'm'), 207 | fn.max('m', 'neigh')) 208 | h_neigh = graph.dstdata['neigh'] 209 | 210 | elif self._aggre_type == 'lstm_edge': 211 | graph.srcdata['h'] = h_neigh 212 | if graph.canonical_etypes[0][0] in ['user', 'item'] and graph.canonical_etypes[0][2] in ['user', 'item']: 213 | graph.edata['h'] = graph.edata['occurrence'].float().unsqueeze(1) 214 | graph.update_all( 215 | fn.u_mul_e('h', 'h', 'm'), 216 | self._lstm_reducer) 217 | else: 218 | graph.update_all( 219 | fn.copy_src('h', 'm'), 220 | self._lstm_reducer) 221 | h_neigh = graph.dstdata['neigh'] 222 | 223 | else: 224 | raise KeyError('Aggregator type {} not recognized.'.format(self._aggre_type)) 225 | 226 | z = self.fc_self(h_self) + self.fc_neigh(h_neigh) 227 | z = F.relu(z) 228 | 229 | # normalization 230 | if self.norm: 231 | z_norm = z.norm(2, 1, keepdim=True) 232 | z_norm = torch.where(z_norm == 0, 233 | torch.tensor(1.).to(z_norm), 234 | z_norm) 235 | z = z / z_norm 236 | 237 | return z 238 | 239 | 240 | class PredictingLayer(nn.Module): 241 | """ 242 | Scoring function that uses a neural network to compute similarity between user and item. 243 | 244 | Only used if fixed_params.pred == 'nn'. 245 | Given the concatenated hidden states of heads and tails vectors, passes them through neural network and 246 | returns sigmoid ratings. 247 | """ 248 | 249 | def reset_parameters(self): 250 | gain_relu = nn.init.calculate_gain('relu') 251 | gain_sigmoid = nn.init.calculate_gain('sigmoid') 252 | nn.init.xavier_uniform_(self.hidden_1.weight, gain=gain_relu) 253 | nn.init.xavier_uniform_(self.hidden_2.weight, gain=gain_relu) 254 | nn.init.xavier_uniform_(self.output.weight, gain=gain_sigmoid) 255 | 256 | def __init__(self, embed_dim: int): 257 | super(PredictingLayer, self).__init__() 258 | self.hidden_1 = nn.Linear(embed_dim * 2, 128) 259 | self.hidden_2 = nn.Linear(128, 32) 260 | self.output = nn.Linear(32, 1) 261 | self.relu = nn.ReLU() 262 | self.sigmoid = nn.Sigmoid() 263 | self.reset_parameters() 264 | 265 | def forward(self, x): 266 | x = self.hidden_1(x) 267 | x = self.relu(x) 268 | x = self.hidden_2(x) 269 | x = self.relu(x) 270 | x = self.output(x) 271 | x = self.sigmoid(x) 272 | return x 273 | 274 | 275 | class PredictingModule(nn.Module): 276 | """ 277 | Predicting module that incorporate the predicting layer defined earlier. 278 | 279 | Only used if fixed_params.pred == 'nn'. 280 | Process: 281 | - Fetches hidden states of 'heads' and 'tails' of the edges. 282 | - Concatenates them, then passes them through the predicting layer. 283 | - Returns ratings (sigmoid function). 284 | """ 285 | 286 | def __init__(self, predicting_layer, embed_dim: int): 287 | super(PredictingModule, self).__init__() 288 | self.layer_nn = predicting_layer(embed_dim) 289 | 290 | def forward(self, 291 | graph, 292 | h 293 | ): 294 | ratings_dict = {} 295 | for etype in graph.canonical_etypes: 296 | if etype[0] in ['user', 'item'] and etype[2] in ['user', 'item']: 297 | utype, _, vtype = etype 298 | src_nid, dst_nid = graph.all_edges(etype=etype) 299 | emb_heads = h[utype][src_nid] 300 | emb_tails = h[vtype][dst_nid] 301 | cat_embed = torch.cat((emb_heads, emb_tails), 1) 302 | ratings = self.layer_nn(cat_embed) 303 | ratings_dict[etype] = torch.flatten(ratings) 304 | ratings_dict = {key: torch.unsqueeze(ratings_dict[key], 1) for key in ratings_dict.keys()} 305 | return ratings_dict 306 | 307 | 308 | class CosinePrediction(nn.Module): 309 | """ 310 | Scoring function that uses cosine similarity to compute similarity between user and item. 311 | 312 | Only used if fixed_params.pred == 'cos'. 313 | """ 314 | def __init__(self): 315 | super().__init__() 316 | 317 | def forward(self, graph, h): 318 | with graph.local_scope(): 319 | for etype in graph.canonical_etypes: 320 | try: 321 | graph.nodes[etype[0]].data['norm_h'] = F.normalize(h[etype[0]], p=2, dim=-1) 322 | graph.nodes[etype[2]].data['norm_h'] = F.normalize(h[etype[2]], p=2, dim=-1) 323 | graph.apply_edges(fn.u_dot_v('norm_h', 'norm_h', 'cos'), etype=etype) 324 | except KeyError: 325 | pass # For etypes that are not in training eids, thus have no 'h' 326 | ratings = graph.edata['cos'] 327 | return ratings 328 | 329 | 330 | class ConvModel(nn.Module): 331 | """ 332 | Assembles embedding layers, multiple ConvLayers and chosen predicting function into a full model. 333 | 334 | """ 335 | 336 | def __init__(self, 337 | g, 338 | n_layers: int, 339 | dim_dict, 340 | norm: bool = True, 341 | dropout: float = 0.0, 342 | aggregator_type: str = 'mean', 343 | pred: str = 'cos', 344 | aggregator_hetero: str = 'sum', 345 | embedding_layer: bool = True, 346 | ): 347 | """ 348 | Initialize the ConvModel. 349 | 350 | Parameters 351 | ---------- 352 | g: 353 | Graph, only used to query graph metastructure (fetch node types and edge types). 354 | n_layers: 355 | Number of ConvLayer. 356 | dim_dict: 357 | Dictionary with dimension for all input nodes, hidden dimension (aka embedding dimension), output dimension. 358 | norm, dropout, aggregator_type: 359 | See ConvLayer for details. 360 | aggregator_hetero: 361 | Since we are working with heterogeneous graph, all nodes will have messages coming from different types of 362 | nodes. However, the neighborhood messages are specific to a node type. Thus, we have to aggregate 363 | neighborhood messages from different edge types. 364 | Choices are 'mean', 'sum', 'max'. 365 | embedding_layer: 366 | Some GNN papers explicitly define an embedding layer, whereas other papers consider the first ConvLayer 367 | as the "embedding" layer. If true, an explicit embedding layer will be defined (using NodeEmbedding). If 368 | false, the first ConvLayer will have input dimensions equal to node features. 369 | 370 | """ 371 | super().__init__() 372 | self.embedding_layer = embedding_layer 373 | if embedding_layer: 374 | self.user_embed = NodeEmbedding(dim_dict['user'], dim_dict['hidden']) 375 | self.item_embed = NodeEmbedding(dim_dict['item'], dim_dict['hidden']) 376 | if 'sport' in g.ntypes: 377 | self.sport_embed = NodeEmbedding(dim_dict['sport'], dim_dict['hidden']) 378 | 379 | self.layers = nn.ModuleList() 380 | 381 | # input layer 382 | if not embedding_layer: 383 | self.layers.append( 384 | dglnn.HeteroGraphConv( 385 | {etype[1]: ConvLayer((dim_dict[etype[0]], dim_dict[etype[2]]), dim_dict['hidden'], dropout, 386 | aggregator_type, norm) 387 | for etype in g.canonical_etypes}, 388 | aggregate=aggregator_hetero) 389 | ) 390 | 391 | # hidden layers 392 | for i in range(n_layers - 2): 393 | self.layers.append( 394 | dglnn.HeteroGraphConv( 395 | {etype[1]: ConvLayer((dim_dict['hidden'], dim_dict['hidden']), dim_dict['hidden'], dropout, 396 | aggregator_type, norm) 397 | for etype in g.canonical_etypes}, 398 | aggregate=aggregator_hetero)) 399 | 400 | # output layer 401 | self.layers.append( 402 | dglnn.HeteroGraphConv( 403 | {etype[1]: ConvLayer((dim_dict['hidden'], dim_dict['hidden']), dim_dict['out'], dropout, 404 | aggregator_type, norm) 405 | for etype in g.canonical_etypes}, 406 | aggregate=aggregator_hetero)) 407 | 408 | if pred == 'cos': 409 | self.pred_fn = CosinePrediction() 410 | elif pred == 'nn': 411 | self.pred_fn = PredictingModule(PredictingLayer, dim_dict['out']) 412 | else: 413 | raise KeyError('Prediction function {} not recognized.'.format(pred)) 414 | 415 | def get_repr(self, 416 | blocks, 417 | h): 418 | for i in range(len(blocks)): 419 | layer = self.layers[i] 420 | h = layer(blocks[i], h) 421 | return h 422 | 423 | def forward(self, 424 | blocks, 425 | h, 426 | pos_g, 427 | neg_g, 428 | embedding_layer: bool = True, 429 | ): 430 | """ 431 | Full pass through the ConvModel. 432 | 433 | Process: 434 | - Embedding layer 435 | - get_repr: As many ConvLayer as wished. All "Layers" are composed of as many ConvLayer as there 436 | are edge types. 437 | - Predicting layer predicts score for all positive examples and all negative examples. 438 | 439 | Parameters 440 | ---------- 441 | blocks: 442 | Computational blocks. Can be thought of as subgraphs. There are as many blocks as there are layers. 443 | h: 444 | Initial state of all nodes. 445 | pos_g: 446 | Positive graph, generated by the EdgeDataLoader. Contains all positive examples of the batch that need to 447 | be scored. 448 | neg_g: 449 | Negative graph, generated by the EdgeDataLoader. For all positive pairs in the pos_g, multiple negative 450 | pairs were generated. They are all scored. 451 | 452 | Returns 453 | ------- 454 | h: 455 | Updated state of all nodes 456 | pos_score: 457 | All scores between positive examples (aka positive pairs). 458 | neg_score: 459 | All score between negative examples. 460 | 461 | """ 462 | if embedding_layer: 463 | h['user'] = self.user_embed(h['user']) 464 | h['item'] = self.item_embed(h['item']) 465 | if 'sport' in h.keys(): 466 | h['sport'] = self.sport_embed(h['sport']) 467 | h = self.get_repr(blocks, h) 468 | pos_score = self.pred_fn(pos_g, h) 469 | neg_score = self.pred_fn(neg_g, h) 470 | return h, pos_score, neg_score 471 | 472 | 473 | def max_margin_loss(pos_score, 474 | neg_score, 475 | delta: float, 476 | neg_sample_size: int, 477 | use_recency: bool = False, 478 | recency_scores=None, 479 | remove_false_negative: bool = False, 480 | negative_mask=None, 481 | cuda=False, 482 | device=None 483 | ): 484 | """ 485 | Simple max margin loss. 486 | 487 | Parameters 488 | ---------- 489 | pos_score: 490 | All similarity scores for positive examples. 491 | neg_score: 492 | All similarity scores for negative examples. 493 | delta: 494 | Delta from which the pos_score should be higher than all its corresponding neg_score. 495 | neg_sample_size: 496 | Number of negative examples to generate for each positive example. 497 | See main.SearchableHyperparameters for more details. 498 | use_recency: 499 | If true, loss will be divided by the recency, i.e. more recent positive examples will be given a 500 | greater weight in the total loss. 501 | See main.SearchableHyperparameters for more details. 502 | recency_score: 503 | Loss will be divided by the recency if use_recency == True. Those are the recency, for all training edges. 504 | remove_false_negative: 505 | When generating negative examples, it is possible that a random negative example is actually in the graph, 506 | i.e. it should not be a negative example. If true, those will be removed. 507 | negative_mask: 508 | For each negative example, indicator if it is a false negative or not. 509 | """ 510 | all_scores = torch.empty(0) 511 | if cuda: 512 | all_scores = all_scores.to(device) 513 | for etype in pos_score.keys(): 514 | neg_score_tensor = neg_score[etype] 515 | pos_score_tensor = pos_score[etype] 516 | neg_score_tensor = neg_score_tensor.reshape(-1, neg_sample_size) 517 | if remove_false_negative: 518 | negative_mask_tensor = negative_mask[etype].reshape(-1, neg_sample_size) 519 | else: 520 | negative_mask_tensor = torch.zeros(size=neg_score_tensor.shape) 521 | if cuda: 522 | negative_mask_tensor = negative_mask_tensor.to(device) 523 | scores = neg_score_tensor + delta - pos_score_tensor - negative_mask_tensor 524 | relu = nn.ReLU() 525 | scores = relu(scores) 526 | if use_recency: 527 | try: 528 | recency_scores_tensor = recency_scores[etype] 529 | scores = scores / torch.unsqueeze(recency_scores_tensor, 1) 530 | except KeyError: # Not all edge types have recency. Only training edges have recency (i.e. clicks & buys) 531 | pass 532 | all_scores = torch.cat((all_scores, scores), 0) 533 | return torch.mean(all_scores) 534 | -------------------------------------------------------------------------------- /src/sampling.py: -------------------------------------------------------------------------------- 1 | import dgl 2 | import numpy as np 3 | 4 | 5 | def train_valid_split(valid_graph: dgl.DGLHeteroGraph, 6 | ground_truth_test, 7 | etypes, 8 | subtrain_size, 9 | valid_size, 10 | reverse_etype, 11 | train_on_clicks, 12 | remove_train_eids, 13 | clicks_sample=1, 14 | purchases_sample=1, 15 | ): 16 | """ 17 | Using the full graph, sample train_graph and eids of edges for train & validation, as well as nids for test. 18 | 19 | Process: 20 | - Validation 21 | - valid_eids are the most recent X edges of all eids of the graph (based on valid_size) 22 | - valid_uids and iid are the user ids and item ids associated with those edges (and together they form the 23 | ground_truth) 24 | - Training graph & eids 25 | - All edges and reverse edges of valid_eids are removed from the full graph. 26 | - train_eids are all remaining edges. 27 | - Sampling of training eids 28 | - It might be relevant to have numerous edges in the training graph to do message passing, but to 29 | optimize the model to give great scores only to recent interaction (to help with seasonality) 30 | - Thus, if purchases_sample or clicks_sample are < 1, only the most recent X edges are kept in the 31 | train_eids dict 32 | - An extra option is available to insure that no information leakage appear: remove_train_eids. If true, 33 | all eids in train_eids dict will be removed from the graph. (Otherwise, information leakage is still 34 | taken care of during EdgeDataLoader: sampled edges are removed from the computation blocks). Based on 35 | experience, it is best to set remove_train_eids as False. 36 | - Computing metrics on training set: subtrain nids 37 | - To compute metrics on the training set, we sample a "subtrain set". We need the ground_truth for 38 | the subtrain, as well as node ids for all user and items in the subtrain set. 39 | - Computing metrics on test set 40 | - We need node ids for all user and items in the test set (so we can fetch their embeddings during 41 | recommendations) 42 | 43 | """ 44 | np.random.seed(11) 45 | 46 | all_eids_dict = {} 47 | valid_eids_dict = {} 48 | train_eids_dict = {} 49 | valid_uids_all = [] 50 | valid_iids_all = [] 51 | for etype in etypes: 52 | all_eids = np.arange(valid_graph.number_of_edges(etype)) 53 | valid_eids = all_eids[int(len(all_eids) * (1 - valid_size)):] 54 | valid_uids, valid_iids = valid_graph.find_edges(valid_eids, etype=etype) 55 | valid_uids_all.extend(valid_uids.tolist()) 56 | valid_iids_all.extend(valid_iids.tolist()) 57 | all_eids_dict[etype] = all_eids 58 | if (etype == ('user', 'buys', 'item')) or (etype == ('user', 'clicks', 'item') and train_on_clicks): 59 | valid_eids_dict[etype] = valid_eids 60 | ground_truth_valid = (np.array(valid_uids_all), np.array(valid_iids_all)) 61 | valid_uids = np.array(np.unique(valid_uids_all)) 62 | 63 | # Create partial graph 64 | train_graph = valid_graph.clone() 65 | for etype in etypes: 66 | if (etype == ('user', 'buys', 'item')) or (etype == ('user', 'clicks', 'item') and train_on_clicks): 67 | train_graph.remove_edges(valid_eids_dict[etype], etype=etype) 68 | train_graph.remove_edges(valid_eids_dict[etype], etype=reverse_etype[etype]) 69 | train_eids = np.arange(train_graph.number_of_edges(etype)) 70 | train_eids_dict[etype] = train_eids 71 | 72 | if purchases_sample != 1: 73 | eids = train_eids_dict[('user', 'buys', 'item')] 74 | train_eids_dict[('user', 'buys', 'item')] = eids[int(len(eids) * (1 - purchases_sample)):] 75 | eids = valid_eids_dict[('user', 'buys', 'item')] 76 | valid_eids_dict[('user', 'buys', 'item')] = eids[int(len(eids) * (1 - purchases_sample)):] 77 | 78 | if clicks_sample != 1 and ('user', 'clicks', 'item') in train_eids_dict.keys(): 79 | eids = train_eids_dict[('user', 'clicks', 'item')] 80 | train_eids_dict[('user', 'clicks', 'item')] = eids[int(len(eids) * (1 - clicks_sample)):] 81 | eids = valid_eids_dict[('user', 'clicks', 'item')] 82 | valid_eids_dict[('user', 'clicks', 'item')] = eids[int(len(eids) * (1 - clicks_sample)):] 83 | 84 | if remove_train_eids: 85 | train_graph.remove_edges(train_eids_dict[etype], etype=etype) 86 | train_graph.remove_edges(train_eids_dict[etype], etype=reverse_etype[etype]) 87 | 88 | # Generate inference nodes for subtrain & ground truth for subtrain 89 | ## Choose the subsample of training set. For now, only users with purchases are included. 90 | train_uids, train_iids = valid_graph.find_edges(train_eids_dict[etypes[0]], etype=etypes[0]) 91 | unique_train_uids = np.unique(train_uids) 92 | subtrain_uids = np.random.choice(unique_train_uids, int(len(unique_train_uids) * subtrain_size), replace=False) 93 | ## Fetch uids and iids of subtrain sample for all etypes 94 | subtrain_uids_all = [] 95 | subtrain_iids_all = [] 96 | for etype in train_eids_dict.keys(): 97 | train_uids, train_iids = valid_graph.find_edges(train_eids_dict[etype], etype=etype) 98 | subtrain_eids = [] 99 | for i in range(len(train_eids_dict[etype])): 100 | if train_uids[i].item() in subtrain_uids: 101 | subtrain_eids.append(train_eids_dict[etype][i].item()) 102 | subtrain_uids, subtrain_iids = valid_graph.find_edges(subtrain_eids, etype=etype) 103 | subtrain_uids_all.extend(subtrain_uids.tolist()) 104 | subtrain_iids_all.extend(subtrain_iids.tolist()) 105 | ground_truth_subtrain = (np.array(subtrain_uids_all), np.array(subtrain_iids_all)) 106 | subtrain_uids = np.array(np.unique(subtrain_uids_all)) 107 | 108 | # Generate inference nodes for test 109 | test_uids, _ = ground_truth_test 110 | test_uids = np.unique(test_uids) 111 | all_iids = np.arange(valid_graph.num_nodes('item')) 112 | 113 | return train_graph, train_eids_dict, valid_eids_dict, subtrain_uids, valid_uids, test_uids, \ 114 | all_iids, ground_truth_subtrain, ground_truth_valid, all_eids_dict 115 | 116 | 117 | def generate_dataloaders(valid_graph, 118 | train_graph, 119 | train_eids_dict, 120 | valid_eids_dict, 121 | subtrain_uids, 122 | valid_uids, 123 | test_uids, 124 | all_iids, 125 | fixed_params, 126 | num_workers, 127 | all_sids=None, 128 | embedding_layer: bool = True, 129 | **params, 130 | ): 131 | """ 132 | Since data is large, it is fed to the model in batches. This creates batches for train, valid & test. 133 | 134 | Process: 135 | - Set up 136 | - Fix the number of layers. If there is an explicit embedding layer, we need 1 less layer in the blocks. 137 | - The sampler will generate computation blocks. Currently, only 'full' sampler is used, meaning that all 138 | nodes have all their neighbors, but one could specify 'partial' neighborhood to have only message passing 139 | with a limited number of neighbors. 140 | - The negative sampler generates K negative samples for all positive examples in the batch. 141 | - Edgeloader_train 142 | - All train_eids will be batched, using the training graph. Sampled edge and their reverse etype will be 143 | removed from computation blocks. 144 | - If remove_train_eids, the graph used for sampling will not have the train_eids as edges. (Thus, a 145 | different graph as g_sampling) 146 | - Edgeloader_valid 147 | - All valid_eids will be batched. 148 | - Nodeloaders 149 | - When computing metrics, we want to compute embeddings for all nodes of interest. Thus, we use 150 | a NodeDataLoader instead of an EdgeDataLoader. 151 | - We have a nodeloader for subtrain, validation and test. 152 | """ 153 | n_layers = params['n_layers'] 154 | if embedding_layer: 155 | n_layers = n_layers - 1 156 | if fixed_params.neighbor_sampler == 'full': 157 | sampler = dgl.dataloading.MultiLayerFullNeighborSampler(n_layers) 158 | elif fixed_params.neighbor_sampler == 'partial': 159 | sampler = dgl.dataloading.MultiLayerNeighborSampler([1, 1, 1], replace=False) 160 | else: 161 | raise KeyError('Neighbor sampler {} not recognized.'.format(fixed_params.neighbor_sampler)) 162 | 163 | sampler_n = dgl.dataloading.negative_sampler.Uniform( 164 | params['neg_sample_size'] 165 | ) 166 | 167 | if fixed_params.remove_train_eids: 168 | edgeloader_train = dgl.dataloading.EdgeDataLoader( 169 | valid_graph, 170 | train_eids_dict, 171 | sampler, 172 | g_sampling=train_graph, 173 | negative_sampler=sampler_n, 174 | batch_size=fixed_params.edge_batch_size, 175 | shuffle=True, 176 | drop_last=False, # Drop last batch if non-full 177 | pin_memory=True, # Helps the transfer to GPU 178 | num_workers=num_workers, 179 | ) 180 | else: 181 | edgeloader_train = dgl.dataloading.EdgeDataLoader( 182 | train_graph, 183 | train_eids_dict, 184 | sampler, 185 | exclude='reverse_types', 186 | reverse_etypes={'buys': 'bought-by', 'bought-by': 'buys', 187 | 'clicks': 'clicked-by', 'clicked-by': 'clicks'}, 188 | negative_sampler=sampler_n, 189 | batch_size=fixed_params.edge_batch_size, 190 | shuffle=True, 191 | drop_last=False, 192 | pin_memory=True, 193 | num_workers=num_workers, 194 | ) 195 | 196 | edgeloader_valid = dgl.dataloading.EdgeDataLoader( 197 | valid_graph, 198 | valid_eids_dict, 199 | sampler, 200 | g_sampling=train_graph, 201 | negative_sampler=sampler_n, 202 | batch_size=fixed_params.edge_batch_size, 203 | shuffle=True, 204 | drop_last=False, 205 | pin_memory=True, 206 | num_workers=num_workers, 207 | ) 208 | 209 | nodeloader_subtrain = dgl.dataloading.NodeDataLoader( 210 | train_graph, 211 | {'user': subtrain_uids, 'item': all_iids}, 212 | sampler, 213 | batch_size=fixed_params.node_batch_size, 214 | shuffle=True, 215 | drop_last=False, 216 | num_workers=num_workers, 217 | ) 218 | 219 | nodeloader_valid = dgl.dataloading.NodeDataLoader( 220 | train_graph, 221 | {'user': valid_uids, 'item': all_iids}, 222 | sampler, 223 | batch_size=fixed_params.node_batch_size, 224 | shuffle=True, 225 | drop_last=False, 226 | num_workers=num_workers, 227 | ) 228 | 229 | test_node_ids = {'user': test_uids, 'item': all_iids} 230 | if 'sport' in valid_graph.ntypes: 231 | test_node_ids['sport'] = all_sids 232 | 233 | nodeloader_test = dgl.dataloading.NodeDataLoader( 234 | valid_graph, 235 | test_node_ids, 236 | sampler, 237 | batch_size=fixed_params.node_batch_size, 238 | shuffle=True, 239 | drop_last=False, 240 | num_workers=num_workers 241 | ) 242 | 243 | return edgeloader_train, edgeloader_valid, nodeloader_subtrain, nodeloader_valid, nodeloader_test 244 | -------------------------------------------------------------------------------- /src/train/run.py: -------------------------------------------------------------------------------- 1 | from datetime import timedelta 2 | import time 3 | 4 | import dgl 5 | import torch 6 | 7 | from src.metrics import get_metrics_at_k 8 | from src.utils import save_txt 9 | 10 | 11 | def train_model(model, 12 | num_epochs, 13 | num_batches_train, 14 | num_batches_val_loss, 15 | edgeloader_train, 16 | edgeloader_valid, 17 | loss_fn, 18 | delta, 19 | neg_sample_size, 20 | use_recency=False, 21 | cuda=False, 22 | device=None, 23 | optimizer=torch.optim.Adam, 24 | lr=0.001, 25 | get_metrics=False, 26 | train_graph=None, 27 | valid_graph=None, 28 | nodeloader_valid=None, 29 | nodeloader_subtrain=None, 30 | k=None, 31 | out_dim=None, 32 | num_batches_val_metrics=None, 33 | num_batches_subtrain=None, 34 | bought_eids=None, 35 | ground_truth_subtrain=None, 36 | ground_truth_valid=None, 37 | remove_already_bought=True, 38 | result_filepath=None, 39 | start_epoch=0, 40 | patience=5, 41 | pred=None, 42 | use_popularity=False, 43 | weight_popularity=1, 44 | remove_false_negative=False, 45 | embedding_layer=True, 46 | ): 47 | """ 48 | Main function to train a GNN, using max margin loss on positive and negative examples. 49 | 50 | Process: 51 | - A full training epoch 52 | - Batch by batch. 1 batch is composed of multiple computational blocks, required to compute embeddings 53 | for all the nodes related to the edges in the batch. 54 | - Input the initial features. Compute the embeddings & the positive and negative scores 55 | - Also compute other considerations for the loss function: negative mask, recency scores 56 | - Loss is returned, then backward, then step. 57 | - Metrics are computed on the subtraining set (using nodeloader) 58 | - Validation set 59 | - Loss is computed (in model.eval() mode) for validation edge for early stopping purposes 60 | - Also, metrics are computed on the validation set (using nodeloader) 61 | - Logging & early stopping 62 | - Everything is logged, best metrics are saved. 63 | - Using the patience parameter, early stopping is applied when val_loss stops going down. 64 | """ 65 | model.train_loss_list = [] 66 | model.train_precision_list = [] 67 | model.train_recall_list = [] 68 | model.train_coverage_list = [] 69 | model.val_loss_list = [] 70 | model.val_precision_list = [] 71 | model.val_recall_list = [] 72 | model.val_coverage_list = [] 73 | best_metrics = {} # For visualization 74 | max_metric = -0.1 75 | patience_counter = 0 # For early stopping 76 | min_loss = 1.1 77 | 78 | opt = optimizer(model.parameters(), 79 | lr=lr) 80 | 81 | # TRAINING 82 | print('Starting training.') 83 | for epoch in range(start_epoch, num_epochs): 84 | start_time = time.time() 85 | print('TRAINING LOSS') 86 | model.train() # Because if not, after eval, dropout would be still be inactive 87 | i = 0 88 | total_loss = 0 89 | for _, pos_g, neg_g, blocks in edgeloader_train: 90 | opt.zero_grad() 91 | 92 | # Negative mask 93 | negative_mask = {} 94 | if remove_false_negative: 95 | nids = neg_g.ndata[dgl.NID] 96 | for etype in pos_g.canonical_etypes: 97 | neg_src, neg_dst = neg_g.edges(etype=etype) 98 | neg_src = nids[etype[0]][neg_src] 99 | neg_dst = nids[etype[2]][neg_dst] 100 | negative_mask_tensor = valid_graph.has_edges_between(neg_src, neg_dst, etype=etype) 101 | negative_mask[etype] = negative_mask_tensor.type(torch.float) 102 | if cuda: 103 | negative_mask[etype] = negative_mask[etype].to(device) 104 | if cuda: 105 | blocks = [b.to(device) for b in blocks] 106 | pos_g = pos_g.to(device) 107 | neg_g = neg_g.to(device) 108 | 109 | i += 1 110 | if i % 10 == 0: 111 | print("Edge batch {} out of {}".format(i, num_batches_train)) 112 | input_features = blocks[0].srcdata['features'] 113 | # recency (TO BE CLEANED) 114 | recency_scores = None 115 | if use_recency: 116 | recency_scores = pos_g.edata['recency'] 117 | 118 | _, pos_score, neg_score = model(blocks, 119 | input_features, 120 | pos_g, 121 | neg_g, 122 | embedding_layer, 123 | ) 124 | loss = loss_fn(pos_score, 125 | neg_score, 126 | delta, 127 | neg_sample_size, 128 | use_recency=use_recency, 129 | recency_scores=recency_scores, 130 | remove_false_negative=remove_false_negative, 131 | negative_mask=negative_mask, 132 | cuda=cuda, 133 | device=device, 134 | ) 135 | 136 | if epoch > 0: # For the epoch 0, no training (just report loss) 137 | loss.backward() 138 | opt.step() 139 | total_loss += loss.item() 140 | 141 | if epoch == 0 and i > 10: 142 | break # For the epoch 0, report loss on only subset 143 | 144 | train_avg_loss = total_loss / i 145 | model.train_loss_list.append(train_avg_loss) 146 | 147 | print('VALIDATION LOSS') 148 | model.eval() 149 | with torch.no_grad(): 150 | total_loss = 0 151 | i = 0 152 | for _, pos_g, neg_g, blocks in edgeloader_valid: 153 | i += 1 154 | if i % 10 == 0: 155 | print("Edge batch {} out of {}".format(i, num_batches_val_loss)) 156 | 157 | # Negative mask 158 | negative_mask = {} 159 | if remove_false_negative: 160 | nids = neg_g.ndata[dgl.NID] 161 | for etype in pos_g.canonical_etypes: 162 | neg_src, neg_dst = neg_g.edges(etype=etype) 163 | neg_src = nids[etype[0]][neg_src] 164 | neg_dst = nids[etype[2]][neg_dst] 165 | negative_mask_tensor = valid_graph.has_edges_between(neg_src, neg_dst, etype=etype) 166 | negative_mask[etype] = negative_mask_tensor.type(torch.float) 167 | if cuda: 168 | negative_mask[etype] = negative_mask[etype].to(device) 169 | 170 | if cuda: 171 | blocks = [b.to(device) for b in blocks] 172 | pos_g = pos_g.to(device) 173 | neg_g = neg_g.to(device) 174 | 175 | input_features = blocks[0].srcdata['features'] 176 | _, pos_score, neg_score = model(blocks, 177 | input_features, 178 | pos_g, 179 | neg_g, 180 | embedding_layer, 181 | ) 182 | # recency (TO BE CLEANED) 183 | recency_scores = None 184 | if use_recency: 185 | recency_scores = pos_g.edata['recency'] 186 | 187 | val_loss = loss_fn(pos_score, 188 | neg_score, 189 | delta, 190 | neg_sample_size, 191 | use_recency=use_recency, 192 | recency_scores=recency_scores, 193 | remove_false_negative=remove_false_negative, 194 | negative_mask=negative_mask, 195 | cuda=cuda, 196 | device=device, 197 | ) 198 | total_loss += val_loss.item() 199 | # print(val_loss.item()) 200 | val_avg_loss = total_loss / i 201 | model.val_loss_list.append(val_avg_loss) 202 | 203 | ############ 204 | # METRICS PER EPOCH 205 | if get_metrics and epoch % 10 == 1: 206 | model.eval() 207 | with torch.no_grad(): 208 | # training metrics 209 | print('TRAINING METRICS') 210 | y = get_embeddings(train_graph, 211 | out_dim, 212 | model, 213 | nodeloader_subtrain, 214 | num_batches_subtrain, 215 | cuda, 216 | device, 217 | embedding_layer, 218 | ) 219 | 220 | train_precision, train_recall, train_coverage = get_metrics_at_k(y, 221 | train_graph, 222 | model, 223 | out_dim, 224 | ground_truth_subtrain, 225 | bought_eids, 226 | k, 227 | False, # Remove already bought 228 | cuda, 229 | device, 230 | pred, 231 | use_popularity, 232 | weight_popularity) 233 | 234 | # validation metrics 235 | print('VALIDATION METRICS') 236 | y = get_embeddings(valid_graph, 237 | out_dim, 238 | model, 239 | nodeloader_valid, 240 | num_batches_val_metrics, 241 | cuda, 242 | device, 243 | embedding_layer, 244 | ) 245 | 246 | val_precision, val_recall, val_coverage = get_metrics_at_k(y, 247 | valid_graph, 248 | model, 249 | out_dim, 250 | ground_truth_valid, 251 | bought_eids, 252 | k, 253 | remove_already_bought, 254 | cuda, 255 | device, 256 | pred, 257 | use_popularity, 258 | weight_popularity 259 | ) 260 | sentence = '''Epoch {:05d} || TRAINING Loss {:.5f} | Precision {:.3f}% | Recall {:.3f}% | Coverage {:.2f}% 261 | || VALIDATION Loss {:.5f} | Precision {:.3f}% | Recall {:.3f}% | Coverage {:.2f}% '''.format( 262 | epoch, train_avg_loss, train_precision * 100, train_recall * 100, train_coverage * 100, 263 | val_avg_loss, val_precision * 100, val_recall * 100, val_coverage * 100) 264 | print(sentence) 265 | save_txt(sentence, result_filepath, mode='a') 266 | 267 | model.train_precision_list.append(train_precision * 100) 268 | model.train_recall_list.append(train_recall * 100) 269 | model.train_coverage_list.append(train_coverage * 10) 270 | model.val_precision_list.append(val_precision * 100) 271 | model.val_recall_list.append(val_recall * 100) 272 | model.val_coverage_list.append(val_coverage * 10) # just *10 for viz purposes 273 | 274 | # Visualization of best metric 275 | if val_recall > max_metric: 276 | max_metric = val_recall 277 | best_metrics = {'recall': val_recall, 'precision': val_precision, 'coverage': val_coverage} 278 | 279 | else: 280 | sentence = "Epoch {:05d} | Training Loss {:.5f} | Validation Loss {:.5f} | ".format( 281 | epoch, train_avg_loss, val_avg_loss) 282 | print(sentence) 283 | save_txt(sentence, result_filepath, mode='a') 284 | 285 | if val_avg_loss < min_loss: 286 | min_loss = val_avg_loss 287 | patience_counter = 0 288 | else: 289 | patience_counter += 1 290 | if patience_counter == patience: 291 | break 292 | 293 | elapsed = time.time() - start_time 294 | result_to_save = f'Epoch took {timedelta(seconds=elapsed)} \n' 295 | print(result_to_save) 296 | save_txt(result_to_save, result_filepath, mode='a') 297 | 298 | viz = {'train_loss_list': model.train_loss_list, 299 | 'train_precision_list': model.train_precision_list, 300 | 'train_recall_list': model.train_recall_list, 301 | 'train_coverage_list': model.train_coverage_list, 302 | 'val_loss_list': model.val_loss_list, 303 | 'val_precision_list': model.val_precision_list, 304 | 'val_recall_list': model.val_recall_list, 305 | 'val_coverage_list': model.val_coverage_list} 306 | 307 | print('Training completed.') 308 | return model, viz, best_metrics # model will already be to 'cuda' device? 309 | 310 | 311 | def get_embeddings(g, 312 | out_dim: int, 313 | trained_model, 314 | nodeloader_test, 315 | num_batches_valid: int, 316 | cuda: bool = False, 317 | device=None, 318 | embedding_layer: bool = True): 319 | """ 320 | Fetch the embeddings for all the nodes in the nodeloader. 321 | 322 | Nodeloader is preferable when computing embeddings because we can specify which nodes to compute the embedding for, 323 | and only have relevant nodes in the computational blocks. Whereas Edgeloader is preferable for training, because 324 | we generate negative edges also. 325 | """ 326 | if cuda: # model is already on device? 327 | trained_model = trained_model.to(device) 328 | i2 = 0 329 | y = {ntype: torch.zeros(g.num_nodes(ntype), out_dim) 330 | for ntype in g.ntypes} 331 | if cuda: # not sure if I need to put the 'result' tensor to device 332 | y = {ntype: torch.zeros(g.num_nodes(ntype), out_dim).to(device) 333 | for ntype in g.ntypes} 334 | for input_nodes, output_nodes, blocks in nodeloader_test: 335 | i2 += 1 336 | if i2 % 10 == 0: 337 | print("Computing embeddings: Batch {} out of {}".format(i2, num_batches_valid)) 338 | if cuda: 339 | blocks = [b.to(device) for b in blocks] 340 | input_features = blocks[0].srcdata['features'] 341 | if embedding_layer: 342 | input_features['user'] = trained_model.user_embed(input_features['user']) 343 | input_features['item'] = trained_model.item_embed(input_features['item']) 344 | if 'sport' in input_features.keys(): 345 | input_features['sport'] = trained_model.sport_embed(input_features['sport']) 346 | h = trained_model.get_repr(blocks, input_features) 347 | for ntype in h.keys(): 348 | y[ntype][output_nodes[ntype]] = h[ntype] 349 | return y 350 | -------------------------------------------------------------------------------- /src/utils.py: -------------------------------------------------------------------------------- 1 | import os 2 | import numpy as np 3 | import pandas as pd 4 | import pickle 5 | 6 | 7 | def save_txt(data_to_save, filepath, mode='a'): 8 | """ 9 | Save text to a file. 10 | """ 11 | with open(filepath, mode) as text_file: 12 | text_file.write(data_to_save + '\n') 13 | 14 | 15 | def save_outputs(files_to_save: dict, 16 | folder_path): 17 | """ 18 | Save objects as pickle files, in a given folder. 19 | """ 20 | for name, file in files_to_save.items(): 21 | with open(folder_path + name + '.pkl', 'wb') as f: 22 | pickle.dump(file, f) 23 | 24 | 25 | def get_last_checkpoint(): 26 | """ 27 | Fetch path of last checkpoint available in the root folder, based on the date in the filename. 28 | """ 29 | logdir = '.' 30 | logfiles = sorted([f for f in os.listdir(logdir) if f.startswith('checkpoint')]) 31 | checkpoint_path = logfiles[-1] 32 | return checkpoint_path 33 | 34 | 35 | def read_data(file_path): 36 | """ 37 | Generic function to read any kind of data. Extensions supported: '.gz', '.csv', '.pkl' 38 | """ 39 | if file_path.endswith('.gz'): 40 | obj = pd.read_csv(file_path, compression='gzip', 41 | header=0, sep=';', quotechar='"', 42 | error_bad_lines=False) 43 | elif file_path.endswith('.csv'): 44 | obj = pd.read_csv(file_path) 45 | elif file_path.endswith('.pkl'): 46 | with open(file_path, 'rb') as handle: 47 | obj = pickle.load(handle) 48 | else: 49 | raise KeyError('File extension of {} not recognized.'.format(file_path)) 50 | return obj 51 | 52 | 53 | def softmax(x): 54 | """ 55 | (Currently not used.) Compute softmax values for each sets of scores in x. 56 | """ 57 | e_x = np.exp(x - np.max(x)) 58 | return e_x / e_x.sum() 59 | -------------------------------------------------------------------------------- /src/utils_data.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import pandas as pd 3 | import torch 4 | 5 | from src.builder import (create_ids, df_to_adjacency_list, 6 | format_dfs, import_features) 7 | 8 | 9 | 10 | class DataPaths: 11 | def __init__(self): 12 | self.result_filepath = 'TXT FILE WHERE TO LOG THE RESULTS .txt' 13 | self.sport_feat_path = 'FEATURE DATASET, SPORTS (sport names) .csv' 14 | self.train_path = 'INTERACTION LIST, USER-ITEM (Train dataset).csv' 15 | self.test_path = 'INTERACTION LIST, USER-ITEM (Train dataset).csv' 16 | self.item_sport_path = 'INTERACTION LIST, ITEM-SPORT .csv' 17 | self.user_sport_path = 'INTERACTION LIST, USER-SPORT .csv' 18 | self.sport_sportg_path = 'INTERACTION LIST, SPORT-SPORT .csv' 19 | self.item_feat_path = 'FEATURE DATASET, ITEMS .csv' 20 | self.user_feat_path = 'FEATURE DATASET, USERS.csv' 21 | self.sport_onehot_path = 'FEATURE DATASET, SPORTS (one-hot vectors) .csv' 22 | 23 | class FixedParameters: 24 | def __init__(self, num_epochs, start_epoch, patience, edge_batch_size, 25 | remove, item_id_type, duplicates): 26 | """ 27 | All parameters that are fixed, i.e. not part of the hyperparametrization. 28 | 29 | Attributes 30 | ---------- 31 | ctm_id_type : 32 | Identifier for the customers. 33 | Days_of_purchases (Days_of_clicks) : 34 | Number of days of purchases (clicks) that should be kept in the dataset. 35 | Intuition is that interactions of 12+ months ago might not be relevant. Max is 710 days 36 | Those that do not have any remaining interactions will be fed recommendations from another 37 | model. 38 | Discern_clicks : 39 | Clicks and purchases will be considered as 2 different edge types 40 | Duplicates : 41 | Determines how to handle duplicates in the training set. 'count_occurrence' will drop all 42 | duplicates except last, and the number of interactions will be stored in the edge feature. 43 | If duplicates == 'count_occurrence', aggregator_type needs to handle edge feature. 'keep_last' 44 | will drop all duplicates except last. 'keep_all' will conserve all duplicates. 45 | Explore : 46 | Print examples of recommendations and of similar sports 47 | Include_sport : 48 | Sports will be included in the graph, with 6 more relation types. User-practices-sport, 49 | item-utilizedby-sport, sport-belongsto-sport (and all their reverse relation type) 50 | item_id_type : 51 | Identifier for the items. Can be SPECIFIC ITEM IDENTIFIER (e.g. item SKU) or GENERIC ITEM IDENTIFIER 52 | (e.g. item family ID) 53 | Lifespan_of_items : 54 | Number of days since most recent transactions for an item to be considered by the 55 | model. Max is 710 days. Won't make a difference is it is > Days_of_interaction. 56 | Num_choices : 57 | Number of examples of recommendations and similar sports to print 58 | Patience : 59 | Number of epochs to wait for Early stopping 60 | Pred : 61 | Function that takes as input embedding of user and item, and outputs ratings. Choices : 'cos' for cosine 62 | similarity, 'nn' for multilayer perceptron with sigmoid function at the end 63 | Start_epoch : 64 | Load model from a previous epoch 65 | Train_on_clicks : 66 | When parametrizing the GNN, edges of purchases are always included. If true, clicks will also 67 | be included 68 | """ 69 | self.ctm_id_type = 'CUSTOMER IDENTIFIER' 70 | self.days_of_purchases = 365 # Max is 710 71 | self.days_of_clicks = 30 # Max is 710 72 | self.discern_clicks = True 73 | self.duplicates = duplicates # 'keep_last', 'keep_all', 'count_occurrence' 74 | self.edge_batch_size = edge_batch_size 75 | self.etype = [('user', 'buys', 'item')] 76 | if self.discern_clicks: 77 | self.etype.append(('user', 'clicks', 'item')) 78 | self.explore = True 79 | self.include_sport = True 80 | self.item_id_type = item_id_type 81 | self.k = 10 82 | self.lifespan_of_items = 180 83 | self.neighbor_sampler = 'full' 84 | self.node_batch_size = 128 85 | self.num_choices = 10 86 | self.num_epochs = num_epochs 87 | self.optimizer = torch.optim.Adam 88 | self.patience = patience 89 | self.pred = 'cos' 90 | self.remove = remove 91 | self.remove_false_negative = True 92 | self.remove_on_inference = .7 93 | self.remove_train_eids = False 94 | self.report_model_coverage = False 95 | self.reverse_etype = {('user', 'buys', 'item'): ('item', 'bought-by', 'user')} 96 | if self.discern_clicks: 97 | self.reverse_etype[('user', 'clicks', 'item')] = ('item', 'clicked-by', 'user') 98 | self.run_inference = 1 99 | self.spt_id_type = 'sport_id' 100 | self.start_epoch = start_epoch 101 | self.subtrain_size = 0.05 102 | self.train_on_clicks = True 103 | self.valid_size = 0.05 104 | # self.dropout = .5 # HP 105 | # self.norm = False # HP 106 | # self.use_popularity = False # HP 107 | # self.days_popularity = 0 # HP 108 | # self.weight_popularity = 0. # HP 109 | # self.use_recency = False # HP 110 | # self.aggregator_type = 'mean_nn_edge' # HP 111 | # self.aggregator_hetero = 'sum' # HP 112 | # self.purchases_sample = .5 # HP 113 | # self.clicks_sample = .4 # HP 114 | # self.embedding_layer = False # HP 115 | # self.edge_update = True # Removed implementation; not useful 116 | # self.automatic_precision = False # Removed implementation; not useful 117 | 118 | 119 | class DataLoader: 120 | """Data loading, cleaning and pre-processing.""" 121 | 122 | def __init__(self, data_paths, fixed_params): 123 | self.data_paths = data_paths 124 | ( 125 | self.user_item_train, 126 | self.user_item_test, 127 | self.item_sport_interaction, 128 | self.user_sport_interaction, 129 | self.sport_sportg_interaction, 130 | self.item_feat_df, 131 | self.user_feat_df, 132 | self.sport_feat_df, 133 | self.sport_onehot_df, 134 | ) = format_dfs( 135 | self.data_paths.train_path, 136 | self.data_paths.test_path, 137 | self.data_paths.item_sport_path, 138 | self.data_paths.user_sport_path, 139 | self.data_paths.sport_sportg_path, 140 | self.data_paths.item_feat_path, 141 | self.data_paths.user_feat_path, 142 | self.data_paths.sport_feat_path, 143 | self.data_paths.sport_onehot_path, 144 | fixed_params.remove, 145 | fixed_params.ctm_id_type, 146 | fixed_params.item_id_type, 147 | fixed_params.days_of_purchases, 148 | fixed_params.days_of_clicks, 149 | fixed_params.lifespan_of_items, 150 | fixed_params.report_model_coverage, 151 | ) 152 | if fixed_params.report_model_coverage: 153 | print('Reporting model coverage') 154 | (_, _, _, _, _, _, _, _ 155 | ) = format_dfs( 156 | self.data_paths.train_path, 157 | self.data_paths.test_path, 158 | self.data_paths.item_sport_path, 159 | self.data_paths.user_sport_path, 160 | self.data_paths.sport_sportg_path, 161 | self.data_paths.item_feat_path, 162 | self.data_paths.user_feat_path, 163 | self.data_paths.sport_feat_path, 164 | 0, # remove 0 165 | fixed_params.ctm_id_type, 166 | fixed_params.item_id_type, 167 | fixed_params.days_of_purchases, 168 | fixed_params.days_of_clicks, 169 | fixed_params.lifespan_of_items, 170 | fixed_params.report_model_coverage, 171 | ) 172 | 173 | self.ctm_id, self.pdt_id, self.spt_id = create_ids( 174 | self.user_item_train, 175 | self.user_sport_interaction, 176 | self.sport_sportg_interaction, 177 | self.item_feat_df, 178 | item_id_type=fixed_params.item_id_type, 179 | ctm_id_type=fixed_params.ctm_id_type, 180 | spt_id_type=fixed_params.spt_id_type, 181 | ) 182 | 183 | ( 184 | self.adjacency_dict, 185 | self.ground_truth_test, 186 | self.ground_truth_purchase_test, 187 | self.user_item_train_grouped, # Will be grouped if duplicates != 'keep_all'. Used for recency edge feature 188 | ) = df_to_adjacency_list( 189 | self.user_item_train, 190 | self.user_item_test, 191 | self.item_sport_interaction, 192 | self.user_sport_interaction, 193 | self.sport_sportg_interaction, 194 | self.ctm_id, 195 | self.pdt_id, 196 | self.spt_id, 197 | item_id_type=fixed_params.item_id_type, 198 | ctm_id_type=fixed_params.ctm_id_type, 199 | spt_id_type=fixed_params.spt_id_type, 200 | discern_clicks=fixed_params.discern_clicks, 201 | duplicates=fixed_params.duplicates, 202 | ) 203 | 204 | if fixed_params.discern_clicks: 205 | self.graph_schema = { 206 | ('user', 'buys', 'item'): 207 | list(zip(self.adjacency_dict['purchases_src'], self.adjacency_dict['purchases_dst'])), 208 | ('item', 'bought-by', 'user'): 209 | list(zip(self.adjacency_dict['purchases_dst'], self.adjacency_dict['purchases_src'])), 210 | ('user', 'clicks', 'item'): 211 | list(zip(self.adjacency_dict['clicks_src'], self.adjacency_dict['clicks_dst'])), 212 | ('item', 'clicked-by', 'user'): 213 | list(zip(self.adjacency_dict['clicks_dst'], self.adjacency_dict['clicks_src'])), 214 | } 215 | else: 216 | self.graph_schema = { 217 | ('user', 'buys', 'item'): 218 | list(zip(self.adjacency_dict['user_item_src'], self.adjacency_dict['user_item_dst'])), 219 | ('item', 'bought-by', 'user'): 220 | list(zip(self.adjacency_dict['user_item_dst'], self.adjacency_dict['user_item_src'])), 221 | } 222 | if fixed_params.include_sport: 223 | self.graph_schema.update( 224 | { 225 | ('item', 'utilized-for', 'sport'): 226 | list(zip(self.adjacency_dict['item_sport_src'], self.adjacency_dict['item_sport_dst'])), 227 | ('sport', 'utilizes', 'item'): 228 | list(zip(self.adjacency_dict['item_sport_dst'], self.adjacency_dict['item_sport_src'])), 229 | ('user', 'practices', 'sport'): 230 | list(zip(self.adjacency_dict['user_sport_src'], self.adjacency_dict['user_sport_dst'])), 231 | ('sport', 'practiced-by', 'user'): 232 | list(zip(self.adjacency_dict['user_sport_dst'], self.adjacency_dict['user_sport_src'])), 233 | ('sport', 'belongs-to', 'sport'): 234 | list(zip(self.adjacency_dict['sport_sportg_src'], self.adjacency_dict['sport_sportg_dst'])), 235 | ('sport', 'includes', 'sport'): 236 | list(zip(self.adjacency_dict['sport_sportg_dst'], self.adjacency_dict['sport_sportg_src'])), 237 | } 238 | ) 239 | 240 | 241 | def assign_graph_features(graph, 242 | fixed_params, 243 | data, 244 | **params, 245 | ): 246 | """ 247 | Assigns features to graph nodes and edges, based on data previously provided in the dataloader. 248 | 249 | Parameters 250 | ---------- 251 | graph: 252 | Graph of type dgl.DGLGraph, with all the nodes & edges. 253 | fixed_params: 254 | All fixed parameters. The only fixed params used are related to id types and occurrences. 255 | data: 256 | Object that contains node feature dataframes, ID mapping dataframes and user item interactions. 257 | params: 258 | Parameters used in this function include popularity & recency hyperparameters. 259 | 260 | Returns 261 | ------- 262 | graph: 263 | The input graph but with features assigned to its nodes and edges. 264 | """ 265 | # Assign features 266 | features_dict = import_features( 267 | graph, 268 | data.user_feat_df, 269 | data.item_feat_df, 270 | data.sport_onehot_df, 271 | data.ctm_id, 272 | data.pdt_id, 273 | data.spt_id, 274 | data.user_item_train, 275 | params['use_popularity'], 276 | params['days_popularity'], 277 | fixed_params.item_id_type, 278 | fixed_params.ctm_id_type, 279 | fixed_params.spt_id_type, 280 | ) 281 | 282 | graph.nodes['user'].data['features'] = features_dict['user_feat'] 283 | graph.nodes['item'].data['features'] = features_dict['item_feat'] 284 | if 'sport' in graph.ntypes: 285 | graph.nodes['sport'].data['features'] = features_dict['sport_feat'] 286 | 287 | # add date as edge feature 288 | if params['use_recency']: 289 | df = data.user_item_train_grouped 290 | df['max_date'] = max(df.hit_date) 291 | df['days_recency'] = (pd.to_datetime(df.max_date) - pd.to_datetime(df.hit_date)).dt.days + 1 292 | if fixed_params.discern_clicks: 293 | recency_tensor_buys = torch.tensor(df[df.buy == 1].days_recency.values) 294 | recency_tensor_clicks = torch.tensor(df[df.buy == 0].days_recency.values) 295 | graph.edges['buys'].data['recency'] = recency_tensor_buys 296 | graph.edges['bought-by'].data['recency'] = recency_tensor_buys 297 | graph.edges['clicks'].data['recency'] = recency_tensor_clicks 298 | graph.edges['clicked-by'].data['recency'] = recency_tensor_clicks 299 | else: 300 | recency_tensor = torch.tensor(df.days_recency.values) 301 | graph.edges['buys'].data['recency'] = recency_tensor 302 | graph.edges['bought-by'].data['recency'] = recency_tensor 303 | 304 | if params['use_popularity']: 305 | graph.nodes['item'].data['popularity'] = features_dict['item_pop'] 306 | 307 | if fixed_params.duplicates == 'count_occurrence': 308 | if fixed_params.discern_clicks: 309 | graph.edges['clicks'].data['occurrence'] = torch.tensor(data.adjacency_dict['clicks_num']) 310 | graph.edges['clicked-by'].data['occurrence'] = torch.tensor(data.adjacency_dict['clicks_num']) 311 | graph.edges['buys'].data['occurrence'] = torch.tensor(data.adjacency_dict['purchases_num']) 312 | graph.edges['bought-by'].data['occurrence'] = torch.tensor(data.adjacency_dict['purchases_num']) 313 | else: 314 | graph.edges['buys'].data['occurrence'] = torch.tensor(data.adjacency_dict['user_item_num']) 315 | graph.edges['bought-by'].data['occurrence'] = torch.tensor(data.adjacency_dict['user_item_num']) 316 | 317 | return graph 318 | 319 | 320 | 321 | 322 | 323 | -------------------------------------------------------------------------------- /src/utils_inference.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | 3 | from dgl.data.utils import load_graphs 4 | 5 | 6 | def read_graph(graph_path): 7 | """ 8 | Read graph data from path. 9 | """ 10 | graph_list, _ = load_graphs(graph_path) 11 | graph = graph_list[0] 12 | return graph 13 | 14 | 15 | def fetch_uids(user_ids, 16 | ctm_id_df): 17 | """ 18 | Maps the Organisation user_ids into node_ids that are used in the graph. 19 | """ 20 | user_df = pd.DataFrame(user_ids, columns=['old_id']) 21 | user_df = user_df.merge(ctm_id_df, how='inner', left_on='old_id', right_on='CUSTOMER IDENTIFIER') 22 | new_uids_list = user_df.ctm_new_id.values 23 | if len(user_ids) != len(new_uids_list): 24 | print(f'{len(user_ids)-len(new_uids_list)} user ids provided had no node ids in the graph.') 25 | return new_uids_list 26 | 27 | 28 | def postprocess_recs(recs, 29 | pdt_id_df, 30 | ctm_id_df, 31 | pdt_id_type, 32 | ctm_id_type, ): 33 | """ 34 | Transforms node_ids for user and item into Organisation user_ids and item_ids 35 | (e.g.CUSTOMER IDENTIFIER and ITEM IDENTIFIER) 36 | """ 37 | processed_recs = {ctm_id_df[ctm_id_df.ctm_new_id == key][ctm_id_type].item(): 38 | [pdt_id_df[pdt_id_df.pdt_new_id == iid][pdt_id_type].item() for iid in value_list] 39 | for key, value_list in recs.items()} 40 | return processed_recs 41 | -------------------------------------------------------------------------------- /src/utils_vizualization.py: -------------------------------------------------------------------------------- 1 | import matplotlib.pyplot as plt 2 | from datetime import datetime 3 | import textwrap 4 | 5 | import numpy as np 6 | 7 | 8 | def plot_train_loss(hp_sentence, viz): 9 | """ 10 | Visualize train & validation loss & metrics. hp_sentence is used as the title of the plot. 11 | 12 | Saves plots in the plots folder. 13 | """ 14 | if 'val_loss_list' in viz.keys(): 15 | fig = plt.figure() 16 | x = np.arange(len(viz['train_loss_list'])) 17 | plt.title('\n'.join(textwrap.wrap(hp_sentence, 60))) 18 | fig.tight_layout() 19 | plt.rcParams["axes.titlesize"] = 6 20 | plt.plot(x, viz['train_loss_list']) 21 | plt.plot(x, viz['val_loss_list']) 22 | plt.legend(['training loss', 'valid loss'], loc='upper left') 23 | plt.savefig('plots/' + str(datetime.now())[:-10] + 'loss.png') 24 | plt.close(fig) 25 | 26 | if 'val_recall_list' in viz.keys(): 27 | fig = plt.figure() 28 | x = np.arange(len(viz['train_precision_list'])) 29 | plt.title('\n'.join(textwrap.wrap(hp_sentence, 60))) 30 | fig.tight_layout() 31 | plt.rcParams["axes.titlesize"] = 6 32 | plt.plot(x, viz['train_precision_list']) 33 | plt.plot(x, viz['train_recall_list']) 34 | plt.plot(x, viz['train_coverage_list']) 35 | plt.plot(x, viz['val_precision_list']) 36 | plt.plot(x, viz['val_recall_list']) 37 | plt.plot(x, viz['val_coverage_list']) 38 | plt.legend(['training precision', 'training recall', 'training coverage/10', 39 | 'valid precision', 'valid recall', 'valid coverage/10'], loc='upper left') 40 | plt.savefig('plots/' + str(datetime.now())[:-10] + 'metrics.png') 41 | plt.close(fig) 42 | --------------------------------------------------------------------------------