├── .gitignore
├── README.md
├── UseCases.ipynb
├── data
    └── .keepdir
├── inference_hp.py
├── logging_config.py
├── main.py
├── main_inference.py
├── main_train.py
├── models
    └── .keepdir
├── outputs
    └── .keepdir
├── plots
    └── .keepdir
├── presplit.py
├── requirements.txt
└── src
    ├── builder.py
    ├── evaluation.py
    ├── metrics.py
    ├── model.py
    ├── sampling.py
    ├── train
        └── run.py
    ├── utils.py
    ├── utils_data.py
    ├── utils_inference.py
    └── utils_vizualization.py


/.gitignore:
--------------------------------------------------------------------------------
  1 | #Added by user
  2 | .idea/
  3 | .DS_Store
  4 | 
  5 | # Byte-compiled / optimized / DLL files
  6 | __pycache__/
  7 | *.py[cod]
  8 | *$py.class
  9 | 
 10 | # C extensions
 11 | *.so
 12 | 
 13 | # Distribution / packaging
 14 | .Python
 15 | build/
 16 | develop-eggs/
 17 | dist/
 18 | downloads/
 19 | eggs/
 20 | .eggs/
 21 | lib/
 22 | lib64/
 23 | parts/
 24 | sdist/
 25 | var/
 26 | wheels/
 27 | pip-wheel-metadata/
 28 | share/python-wheels/
 29 | *.egg-info/
 30 | .installed.cfg
 31 | *.egg
 32 | MANIFEST
 33 | 
 34 | # PyInstaller
 35 | #  Usually these files are written by a python script from a template
 36 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
 37 | *.manifest
 38 | *.spec
 39 | 
 40 | # Installer logs
 41 | pip-log.txt
 42 | pip-delete-this-directory.txt
 43 | 
 44 | # Unit test / coverage reports
 45 | htmlcov/
 46 | .tox/
 47 | .nox/
 48 | .coverage
 49 | .coverage.*
 50 | .cache
 51 | nosetests.xml
 52 | coverage.xml
 53 | *.cover
 54 | *.py,cover
 55 | .hypothesis/
 56 | .pytest_cache/
 57 | 
 58 | # Translations
 59 | *.mo
 60 | *.pot
 61 | 
 62 | # Django stuff:
 63 | *.log
 64 | local_settings.py
 65 | db.sqlite3
 66 | db.sqlite3-journal
 67 | 
 68 | # Flask stuff:
 69 | instance/
 70 | .webassets-cache
 71 | 
 72 | # Scrapy stuff:
 73 | .scrapy
 74 | 
 75 | # Sphinx documentation
 76 | docs/_build/
 77 | 
 78 | # PyBuilder
 79 | target/
 80 | 
 81 | # Jupyter Notebook
 82 | .ipynb_checkpoints
 83 | 
 84 | # IPython
 85 | profile_default/
 86 | ipython_config.py
 87 | 
 88 | # pyenv
 89 | .python-version
 90 | 
 91 | # pipenv
 92 | #   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
 93 | #   However, in case of collaboration, if having platform-specific dependencies or dependencies
 94 | #   having no cross-platform support, pipenv may install dependencies that don't work, or not
 95 | #   install all needed dependencies.
 96 | #Pipfile.lock
 97 | 
 98 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow
 99 | __pypackages__/
100 | 
101 | # Celery stuff
102 | celerybeat-schedule
103 | celerybeat.pid
104 | 
105 | # SageMath parsed files
106 | *.sage.py
107 | 
108 | # Environments
109 | .env
110 | .venv
111 | env/
112 | venv/
113 | ENV/
114 | env.bak/
115 | venv.bak/
116 | 
117 | # Spyder project settings
118 | .spyderproject
119 | .spyproject
120 | 
121 | # Rope project settings
122 | .ropeproject
123 | 
124 | # mkdocs documentation
125 | /site
126 | 
127 | # mypy
128 | .mypy_cache/
129 | .dmypy.json
130 | dmypy.json
131 | 
132 | # Pyre type checker
133 | .pyre/
134 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # GNN-RecSys
 2 | *This project was presented in a [40min talk + Q&A available on Youtube](https://www.youtube.com/watch?v=hvTawbQnK_w) and in a [Medium blog post](https://medium.com/decathlondevelopers/building-a-recommender-system-using-graph-neural-networks-2ee5fc4e706d)*
 3 | 
 4 | **Graph Neural Networks for Recommender Systems**\
 5 | This repository contains code to train and test GNN models for recommendation, mainly using the Deep Graph Library
 6 | ([DGL](https://docs.dgl.ai/)). 
 7 | 
 8 | 
 9 | **What kind of recommendation?**\
10 | For example, an organisation might want to recommend items of interest to all users of its ecommerce platforms.
11 | 
12 | **How can this repository can be used?**\
13 | This repository is aimed at helping users that wish to experiment with GNNs for recommendation, by giving a real example of code
14 | to build a GNN model, train it and serve recommendations.
15 | 
16 | No training data, experiments logs, or trained model are available in this repository.
17 | 
18 | **What should the data look like?**\
19 | To run the code, users need multiple data sources, notably interaction data between user and items and features of users and items.
20 | 
21 | The interaction data sources should be adjacency lists. Here is an example:
22 | 
23 | | customer_id       | item_id          | timestamp  | click | purchase |
24 | |-------------------|------------------|------------|-------|----------|
25 | | imbvblxwvtiywunh  | 3384934262863770 | 2018-01-01 |     0 |        1 |
26 | | nzhrkquelkgflone  | 8321263216904593 | 2018-01-01 |     1 |        0 |
27 | | ...               | ...              | ...        | ...   | ...      |
28 | | cgatomzvjiizvctb  | 2756920171861146 | 2019-12-31 |     1 |        0 |
29 | | cnspkotxubxnxtzk  | 5150255386059428 | 2019-12-31 |     0 |        1 |
30 | 
31 | The feature data should have node identifier and node features:
32 | | customer_id       | is_male | is_female |
33 | |-------------------|---------|-----------|
34 | | imbvblxwvtiywunh  |       0 |         1 |
35 | | nzhrkquelkgflone  |       1 |         0 |
36 | | ...               | ...     | ...       |
37 | | cgatomzvjiizvctb  |       0 |         1 |
38 | | cnspkotxubxnxtzk  |       0 |         1 |
39 | 
40 | ## Run the code
41 | There are 3 different usages of the code: hyperparametrization, training and inference.
42 | Examples of how to run the code are presented in UseCases.ipynb.
43 | 
44 | All 3 usages require specific files to be available. Please refer to the docstring to
45 | see which files are required.
46 | 
47 | ### Hyperparametrization
48 | 
49 | Hyperparametrization is done using the main.py file. 
50 | Going through the space of hyperparameters, the loop builds a GNN model, trains it on a sample of training data, and computes its performance metrics.
51 | The metrics are reported in a result txt file, and the best model's parameters are saved in the models directory.
52 | Plots of the training experiments are saved in the plots directory.
53 | Examples of recommendations are saved in the outputs directory.
54 | ```bash
55 | python main.py --from_beginning -v --visualization --check_embedding --remove 0.85 --num_epochs 100 --patience 5 --edge_batch_size 1024 --item_id_type 'ITEM IDENTIFIER' --duplicates 'keep_all'
56 | ```
57 | Refer to docstrings of main.py for details on parameters.
58 | 
59 | ### Training
60 | 
61 | When the hyperparameters are selected, it is possible to train the chosen GNN model on the available data.
62 | This process saves the trained model in the models directory. Plots, training logs, and examples of recommendations are saved.
63 | ```bash
64 | python main_train.py --fixed_params_path test/fixed_params_example.pkl --params_path test/params_example.pkl --visualization --check_embedding --remove .85 --edge_batch_size 512
65 | ```
66 | Refer to docstrings of main_train.py for details on parameters.
67 | 
68 | ### Inference
69 | With a trained model, it is possible to generate recommendations for all users or specific users.
70 | Examples of recommendations are printed.
71 | ```bash
72 | python main_inference.py --params_path test/final_params_example.pkl --user_ids 123456 \
73 | --user_ids 654321 --user_ids 999 \
74 | --trained_model_path test/final_model_trained_example.pth --k 10 --remove .99
75 | ```
76 | Refer to docstrings of main_inference.py for details on parameters.
77 | 
78 | 


--------------------------------------------------------------------------------
/UseCases.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {
  6 |     "id": "oFIozaUExVEL"
  7 |    },
  8 |    "source": [
  9 |     "# Mount the drive & download required package\n",
 10 |     "This notebook was made for Colab usage. If running on local, the next cell can be ommitted."
 11 |    ]
 12 |   },
 13 |   {
 14 |    "cell_type": "code",
 15 |    "execution_count": null,
 16 |    "metadata": {},
 17 |    "outputs": [],
 18 |    "source": [
 19 |     "!pip install dgl-cu101\n",
 20 |     "!pip install scikit-optimize\n",
 21 |     "!pip install boto3\n",
 22 |     "from google.colab import drive\n",
 23 |     "drive.mount('/content/drive')\n",
 24 |     "import sys\n",
 25 |     "sys.path.append('/content/drive/My Drive/Code/')\n",
 26 |     "%cd /content/drive/My\\ Drive/Code/\n",
 27 |     "\n",
 28 |     "from torch.multiprocessing import Pool, Process, set_start_method\n",
 29 |     "try:\n",
 30 |     "    set_start_method('spawn')\n",
 31 |     "except RuntimeError:\n",
 32 |     "    pass"
 33 |    ]
 34 |   },
 35 |   {
 36 |    "cell_type": "markdown",
 37 |    "metadata": {
 38 |     "id": "0vYgFk5KxHBJ"
 39 |    },
 40 |    "source": [
 41 |     "# Use case 1 : Hyperparametrization"
 42 |    ]
 43 |   },
 44 |   {
 45 |    "cell_type": "code",
 46 |    "execution_count": null,
 47 |    "metadata": {},
 48 |    "outputs": [],
 49 |    "source": [
 50 |     "!python main.py --from_beginning -v --visualization --check_embedding --remove 0.85 --num_epochs 100 --patience 5 --edge_batch_size 1024 --item_id_type 'ITEM IDENTIFIER' --duplicates 'keep_all'"
 51 |    ]
 52 |   },
 53 |   {
 54 |    "cell_type": "markdown",
 55 |    "metadata": {
 56 |     "id": "BHekS5cQxjGZ"
 57 |    },
 58 |    "source": [
 59 |     "# Use case 2 : Full training"
 60 |    ]
 61 |   },
 62 |   {
 63 |    "cell_type": "code",
 64 |    "execution_count": null,
 65 |    "metadata": {},
 66 |    "outputs": [],
 67 |    "source": [
 68 |     "!python main_train.py --fixed_params_path test/fixed_params_example.pkl --params_path test/params_example.pkl --visualization --check_embedding --remove .85 --edge_batch_size 512"
 69 |    ]
 70 |   },
 71 |   {
 72 |    "cell_type": "markdown",
 73 |    "metadata": {
 74 |     "id": "vZeLHdtTxjfT"
 75 |    },
 76 |    "source": [
 77 |     "# Use case 3 : Inference"
 78 |    ]
 79 |   },
 80 |   {
 81 |    "cell_type": "markdown",
 82 |    "metadata": {
 83 |     "id": "eGkv8ffZ4y26"
 84 |    },
 85 |    "source": [
 86 |     "## 3.1 : Specific users, creating the graph"
 87 |    ]
 88 |   },
 89 |   {
 90 |    "cell_type": "code",
 91 |    "execution_count": null,
 92 |    "metadata": {},
 93 |    "outputs": [],
 94 |    "source": [
 95 |     "!python main_inference.py --params_path test/final_params_example.pkl --user_ids 123456 \\\n",
 96 |     "--user_ids 654321 --user_ids 999 \\\n",
 97 |     "--trained_model_path test/final_model_trained_example.pth --k 10 --remove .99"
 98 |    ]
 99 |   },
100 |   {
101 |    "cell_type": "markdown",
102 |    "metadata": {
103 |     "id": "qlZ-rbWW46Ue"
104 |    },
105 |    "source": [
106 |     "## 3.1 : All users, importing the graph"
107 |    ]
108 |   },
109 |   {
110 |    "cell_type": "code",
111 |    "execution_count": null,
112 |    "metadata": {
113 |     "pycharm": {
114 |      "name": "#%%\n"
115 |     }
116 |    },
117 |    "outputs": [],
118 |    "source": [
119 |     "!python main_inference.py --params_path test/final_params_example.pkl \\\n",
120 |     "--user_ids all --use_saved_graph --graph_path test/final_graph_example.bin --ctm_id_path test/final_ctm_id_example.pkl \\\n",
121 |     "--pdt_id_path test/final_pdt_id_example.pkl --trained_model_path test/final_model_trained_example.pth \\\n",
122 |     "--k 10 --remove 0"
123 |    ]
124 |   }
125 |  ],
126 |  "metadata": {
127 |   "accelerator": "GPU",
128 |   "colab": {
129 |    "collapsed_sections": [],
130 |    "machine_shape": "hm",
131 |    "name": "UseCases.ipynb",
132 |    "provenance": []
133 |   },
134 |   "kernelspec": {
135 |    "display_name": "Python 3",
136 |    "language": "python",
137 |    "name": "python3"
138 |   },
139 |   "language_info": {
140 |    "codemirror_mode": {
141 |     "name": "ipython",
142 |     "version": 3
143 |    },
144 |    "file_extension": ".py",
145 |    "mimetype": "text/x-python",
146 |    "name": "python",
147 |    "nbconvert_exporter": "python",
148 |    "pygments_lexer": "ipython3",
149 |    "version": "3.8.3"
150 |   }
151 |  },
152 |  "nbformat": 4,
153 |  "nbformat_minor": 4
154 | }


--------------------------------------------------------------------------------
/data/.keepdir:
--------------------------------------------------------------------------------
1 | 
2 | 


--------------------------------------------------------------------------------
/inference_hp.py:
--------------------------------------------------------------------------------
  1 | import math
  2 | 
  3 | import numpy as np
  4 | import torch
  5 | 
  6 | from src.utils_data import DataLoader, DataPaths, assign_graph_features
  7 | 
  8 | from src.builder import (create_graph)
  9 | from src.model import ConvModel
 10 | from src.sampling import train_valid_split, generate_dataloaders
 11 | from src.metrics import get_metrics_at_k
 12 | from src.train.run import get_embeddings
 13 | from src.utils import save_txt, read_data
 14 | 
 15 | cuda = torch.cuda.is_available()
 16 | device = torch.device('cuda') if cuda else torch.device('cpu')
 17 | num_workers = 4 if cuda else 0
 18 | 
 19 | def inference_fn(trained_model,
 20 |                  remove,
 21 |                  fixed_params,
 22 |                  overwrite_fixed_params=False,
 23 |                  days_of_purchases=710,
 24 |                  days_of_clicks=710,
 25 |                  lifespan_of_items=710,
 26 |                  **params):
 27 |     """
 28 |     Function to run inference inside the hyperparameter loop and calculate metrics.
 29 | 
 30 |     Parameters
 31 |     ----------
 32 |     trained_model:
 33 |         Model trained during training of hyperparameter loop.
 34 |     remove:
 35 |         Percentage of data removed. See src.utils_data for more details.
 36 |     fixed_params:
 37 |         All parameters used during training of hyperparameter loop. See src.utils_data for more details.
 38 |     overwrite_fixed_params:
 39 |         If true, training parameters will overwritten by the parameters below. Can be useful if need to test the model
 40 |         on different parameters, e.g. that includes older clicks or purchases.
 41 |     days_of_purchases, days_of_clicks, lifespan_of_items:
 42 |         All parameters that can overwrite the training parameters. Only useful if overwrite_fixed_params is True.
 43 |     params:
 44 |         All other parameters used during training.
 45 | 
 46 |     Returns
 47 |     -------
 48 |     recall:
 49 |         Recall on the test set. Relevant to compare with recall computed on hyperparametrization test set (since
 50 |         parameters like 'remove' and all overwritable parameters are different)
 51 | 
 52 |     Saves to file
 53 |     -------------
 54 |     Metrics computed on the test set.
 55 |     """
 56 |     # Import parameters
 57 |     if isinstance(fixed_params, str):
 58 |         path = fixed_params
 59 |         fixed_params = read_data(path)
 60 |         class objectview(object):
 61 |             def __init__(self, d):
 62 |                 self.__dict__ = d
 63 |         fixed_params = objectview(fixed_params)
 64 | 
 65 |     if 'params' in params.keys():
 66 |         # if isinstance(params['params'], str):
 67 |         path = params['params']
 68 |         params = read_data(path)
 69 | 
 70 |     # Initialize data
 71 |     data_paths = DataPaths()
 72 |     fixed_params.remove = remove
 73 |     if overwrite_fixed_params:
 74 |         fixed_params.days_of_purchases = days_of_purchases
 75 |         fixed_params.days_of_clicks = days_of_clicks
 76 |         fixed_params.lifespan_of_items = lifespan_of_items
 77 |     data = DataLoader(data_paths, fixed_params)
 78 | 
 79 |     # Get graph
 80 |     valid_graph = create_graph(
 81 |         data.graph_schema,
 82 |     )
 83 |     valid_graph = assign_graph_features(valid_graph,
 84 |                                         fixed_params,
 85 |                                         data,
 86 |                                         **params,
 87 |                                         )
 88 | 
 89 |     dim_dict = {'user': valid_graph.nodes['user'].data['features'].shape[1],
 90 |                 'item': valid_graph.nodes['item'].data['features'].shape[1],
 91 |                 'out': params['out_dim'],
 92 |                 'hidden': params['hidden_dim']}
 93 | 
 94 |     all_sids = None
 95 |     if 'sport' in valid_graph.ntypes:
 96 |         dim_dict['sport'] = valid_graph.nodes['sport'].data['features'].shape[1]
 97 |         all_sids = np.arange(valid_graph.num_nodes('sport'))
 98 | 
 99 |     # get training and test ids
100 |     (
101 |         train_graph,
102 |         train_eids_dict,
103 |         valid_eids_dict,
104 |         subtrain_uids,
105 |         valid_uids,
106 |         test_uids,
107 |         all_iids,
108 |         ground_truth_subtrain,
109 |         ground_truth_valid,
110 |         all_eids_dict
111 |     ) = train_valid_split(
112 |         valid_graph,
113 |         data.ground_truth_test,
114 |         fixed_params.etype,
115 |         fixed_params.subtrain_size,
116 |         fixed_params.valid_size,
117 |         fixed_params.reverse_etype,
118 |         fixed_params.train_on_clicks,
119 |         fixed_params.remove_train_eids,
120 |         params['clicks_sample'],
121 |         params['purchases_sample'],
122 |     )
123 |     (
124 |         edgeloader_train,
125 |         edgeloader_valid,
126 |         nodeloader_subtrain,
127 |         nodeloader_valid,
128 |         nodeloader_test
129 |     ) = generate_dataloaders(valid_graph,
130 |                              train_graph,
131 |                              train_eids_dict,
132 |                              valid_eids_dict,
133 |                              subtrain_uids,
134 |                              valid_uids,
135 |                              test_uids,
136 |                              all_iids,
137 |                              fixed_params,
138 |                              num_workers,
139 |                              all_sids,
140 |                              embedding_layer=params['embedding_layer'],
141 |                              n_layers=params['n_layers'],
142 |                              neg_sample_size=params['neg_sample_size'],
143 |                              )
144 | 
145 |     num_batches_test = math.ceil((len(test_uids) + len(all_iids)) / fixed_params.node_batch_size)
146 | 
147 |     # Import model
148 |     if isinstance(trained_model, str):
149 |         path = trained_model
150 |         trained_model = ConvModel(valid_graph,
151 |                                   params['n_layers'],
152 |                                   dim_dict,
153 |                                   params['norm'],
154 |                                   params['dropout'],
155 |                                   params['aggregator_type'],
156 |                                   fixed_params.pred,
157 |                                   params['aggregator_hetero'],
158 |                                   params['embedding_layer'],
159 |                                   )
160 |         trained_model.load_state_dict(torch.load(path, map_location=device))
161 |     if cuda:
162 |         trained_model = trained_model.to(device)
163 | 
164 |     trained_model.eval()
165 |     with torch.no_grad():
166 |         embeddings = get_embeddings(valid_graph,
167 |                                     params['out_dim'],
168 |                                     trained_model,
169 |                                     nodeloader_test,
170 |                                     num_batches_test,
171 |                                     cuda,
172 |                                     device,
173 |                                     params['embedding_layer'],
174 |                                     )
175 | 
176 |         for ground_truth in [data.ground_truth_purchase_test, data.ground_truth_test]:
177 |             precision, recall, coverage = get_metrics_at_k(
178 |                 embeddings,
179 |                 valid_graph,
180 |                 trained_model,
181 |                 params['out_dim'],
182 |                 ground_truth,
183 |                 all_eids_dict[('user', 'buys', 'item')],
184 |                 fixed_params.k,
185 |                 True,  # Remove already bought
186 |                 cuda,
187 |                 device,
188 |                 fixed_params.pred,
189 |                 params['use_popularity'],
190 |                 params['weight_popularity'],
191 |             )
192 | 
193 |             sentence = ("TEST Precision "
194 |                         "{:.3f}% | Recall {:.3f}% | Coverage {:.2f}%"
195 |                         .format(precision * 100,
196 |                                 recall * 100,
197 |                                 coverage * 100))
198 | 
199 |             print(sentence)
200 |             save_txt(sentence, data_paths.result_filepath, mode='a')
201 | 
202 |     return recall
203 | 


--------------------------------------------------------------------------------
/logging_config.py:
--------------------------------------------------------------------------------
 1 | """
 2 | config_logging.py
 3 | 
 4 | This module aims to define the logger object.
 5 | """
 6 | import logging
 7 | 
 8 | 
 9 | def get_logger(name):
10 |     """
11 |     This function aims to define the logger object from a (file) name.
12 | 
13 |     input:
14 |             :name: str, file name
15 |     output:
16 |             :logger: logger object to do logging
17 |     """
18 |     logger = logging.getLogger(name)
19 | 
20 |     if not logger.handlers:
21 |         logger.propagate = False
22 |         logger.setLevel(logging.DEBUG)
23 |         # stream handler
24 |         ch = logging.StreamHandler()
25 |         ch.setLevel(logging.INFO)
26 |         formatter = logging.Formatter('%(asctime)s-%(name)s-%(levelname)s: %(message)s')
27 |         ch.setFormatter(formatter)
28 |         logger.addHandler(ch)
29 |     return logger


--------------------------------------------------------------------------------
/main.py:
--------------------------------------------------------------------------------
  1 | import datetime
  2 | from datetime import timedelta
  3 | import logging
  4 | import math
  5 | import time
  6 | 
  7 | import click
  8 | import numpy as np
  9 | from skopt import gp_minimize
 10 | from skopt.space import Real, Integer, Categorical
 11 | from skopt.utils import use_named_args
 12 | from skopt.callbacks import CheckpointSaver
 13 | from skopt import load
 14 | import torch
 15 | 
 16 | from src.builder import create_graph, import_features
 17 | from src.model import ConvModel, max_margin_loss
 18 | from src.sampling import train_valid_split, generate_dataloaders
 19 | from src.metrics import (create_already_bought, create_ground_truth,
 20 |                          get_metrics_at_k, get_recs)
 21 | from src.train.run import train_model, get_embeddings
 22 | from src.evaluation import explore_recs, explore_sports, check_coverage
 23 | from src.utils import save_txt, save_outputs, get_last_checkpoint
 24 | from src.utils_data import DataLoader, FixedParameters, DataPaths, assign_graph_features
 25 | from src.utils_vizualization import plot_train_loss
 26 | import inference_hp
 27 | 
 28 | from logging_config import get_logger
 29 | 
 30 | log = get_logger(__name__)
 31 | 
 32 | global cuda
 33 | 
 34 | cuda = torch.cuda.is_available()
 35 | device = torch.device('cuda')
 36 | if not cuda:
 37 |     num_workers = 0
 38 | else:
 39 |     num_workers = 4
 40 | 
 41 | 
 42 | def train(data, fixed_params, data_paths,
 43 |           visualization, check_embedding, **params):
 44 |     """
 45 |     Function to find the best hyperparameter combination.
 46 | 
 47 |     Files needed to run
 48 |     -------------------
 49 |     All the files in the src.utils_data.DataPaths:
 50 |         It includes all the interactions between user, sport and items, as well as features for user, sport and items.
 51 |     If starting hyperparametrization from a checkpoint:
 52 |         The checkpoint file, generated by skopt during a previous hyperparametrization. The most recent file of
 53 |         the root folder will be fetched.
 54 | 
 55 |     Parameters
 56 |     ----------
 57 |     data :
 58 |         Object of class DataLoader, containing multiple arguments such as user_item_train dataframe, graph schema, etc.
 59 |     fixed_params :
 60 |         All parameters that are fixed, i.e. not part of the hyperparametrization.
 61 |     data_paths :
 62 |         All data paths (mainly csv).  # Note: currently, only paths.result_filepath is used here.
 63 |     visualization :
 64 |         Visualize results or not.  # Note: currently not used, visualization is always on or controlled by fixed_params.
 65 |     check_embedding :
 66 |         Visualize recommendations or not.  # Note: currently not used, controlled by fixed_params.
 67 |     **params :
 68 |         Mainly params that come from the hyperparametrization loop, controlled by skopt.
 69 | 
 70 |     Returns
 71 |     -------
 72 |     recall :
 73 |         Recall on the test set for the current combination of hyperparameters.
 74 | 
 75 |     Saves to files
 76 |     --------------
 77 |     logging of all experiments:
 78 |         All training logs are saved to result_filepath, including losses, metrics and examples of recommendations
 79 |         Plots of the evolution of losses and metrics are saved to the folder 'plots'
 80 |     best models:
 81 |         All models, fixed_params and params that yielded recall higher than 8% on specific item identifier or 20% on
 82 |         generic item identifier are saved to the folder 'models'
 83 |     """
 84 |     # Establish hyperparameters
 85 |     # Dimensions
 86 |     out_dim = {'Very Small': 32, 'Small': 96, 'Medium': 128, 'Large': 192, 'Very Large': 256}
 87 |     hidden_dim = {'Very Small': 64, 'Small': 192, 'Medium': 256, 'Large': 384, 'Very Large': 512}
 88 |     params['out_dim'] = out_dim[params['embed_dim']]
 89 |     params['hidden_dim'] = hidden_dim[params['embed_dim']]
 90 | 
 91 |     # Popularity
 92 |     use_popularity = {'No': False, 'Small': True, 'Medium': True, 'Large': True}
 93 |     weight_popularity = {'No': 0, 'Small': .01, 'Medium': .05, 'Large': .1}
 94 |     days_popularity = {'No': 0, 'Small': 7, 'Medium': 7, 'Large': 7}
 95 |     params['use_popularity'] = use_popularity[params['popularity_importance']]
 96 |     params['weight_popularity'] = weight_popularity[params['popularity_importance']]
 97 |     params['days_popularity'] = days_popularity[params['popularity_importance']]
 98 | 
 99 |     if fixed_params.duplicates == 'count_occurrence':
100 |         params['aggregator_type'] += '_edge'
101 | 
102 |     # Make sure graph data is consistent with message passing parameters
103 |     if fixed_params.duplicates == 'count_occurrence':
104 |         assert params['aggregator_type'].endswith('edge')
105 |     else:
106 |         assert not params['aggregator_type'].endswith('edge')
107 | 
108 |     valid_graph = create_graph(
109 |         data.graph_schema,
110 |     )
111 |     valid_graph = assign_graph_features(valid_graph,
112 |                                         fixed_params,
113 |                                         data,
114 |                                         **params,
115 |                                         )
116 | 
117 |     dim_dict = {'user': valid_graph.nodes['user'].data['features'].shape[1],
118 |                 'item': valid_graph.nodes['item'].data['features'].shape[1],
119 |                 'out': params['out_dim'],
120 |                 'hidden': params['hidden_dim']}
121 | 
122 |     all_sids = None
123 |     if 'sport' in valid_graph.ntypes:
124 |         dim_dict['sport'] = valid_graph.nodes['sport'].data['features'].shape[1]
125 |         all_sids = np.arange(valid_graph.num_nodes('sport'))
126 | 
127 |     # get training and test ids
128 |     (
129 |         train_graph,
130 |         train_eids_dict,
131 |         valid_eids_dict,
132 |         subtrain_uids,
133 |         valid_uids,
134 |         test_uids,
135 |         all_iids,
136 |         ground_truth_subtrain,
137 |         ground_truth_valid,
138 |         all_eids_dict
139 |     ) = train_valid_split(
140 |         valid_graph,
141 |         data.ground_truth_test,
142 |         fixed_params.etype,
143 |         fixed_params.subtrain_size,
144 |         fixed_params.valid_size,
145 |         fixed_params.reverse_etype,
146 |         fixed_params.train_on_clicks,
147 |         fixed_params.remove_train_eids,
148 |         params['clicks_sample'],
149 |         params['purchases_sample'],
150 |     )
151 | 
152 |     (
153 |         edgeloader_train,
154 |         edgeloader_valid,
155 |         nodeloader_subtrain,
156 |         nodeloader_valid,
157 |         nodeloader_test
158 |     ) = generate_dataloaders(valid_graph,
159 |                              train_graph,
160 |                              train_eids_dict,
161 |                              valid_eids_dict,
162 |                              subtrain_uids,
163 |                              valid_uids,
164 |                              test_uids,
165 |                              all_iids,
166 |                              fixed_params,
167 |                              num_workers,
168 |                              all_sids,
169 |                              embedding_layer=params['embedding_layer'],
170 |                              n_layers=params['n_layers'],
171 |                              neg_sample_size=params['neg_sample_size'],
172 |                              )
173 | 
174 |     train_eids_len = 0
175 |     valid_eids_len = 0
176 |     for etype in train_eids_dict.keys():
177 |         train_eids_len += len(train_eids_dict[etype])
178 |         valid_eids_len += len(valid_eids_dict[etype])
179 |     num_batches_train = math.ceil(train_eids_len / fixed_params.edge_batch_size)
180 |     num_batches_subtrain = math.ceil(
181 |         (len(subtrain_uids) + len(all_iids)) / fixed_params.node_batch_size
182 |     )
183 |     num_batches_val_loss = math.ceil(valid_eids_len / fixed_params.edge_batch_size)
184 |     num_batches_val_metrics = math.ceil(
185 |         (len(valid_uids) + len(all_iids)) / fixed_params.node_batch_size
186 |     )
187 |     num_batches_test = math.ceil(
188 |         (len(test_uids) + len(all_iids)) / fixed_params.node_batch_size
189 |     )
190 | 
191 |     if fixed_params.neighbor_sampler == 'partial':
192 |         params['n_layers'] = 3
193 | 
194 |     model = ConvModel(valid_graph,
195 |                       params['n_layers'],
196 |                       dim_dict,
197 |                       params['norm'],
198 |                       params['dropout'],
199 |                       params['aggregator_type'],
200 |                       fixed_params.pred,
201 |                       params['aggregator_hetero'],
202 |                       params['embedding_layer'],
203 |                       )
204 |     if cuda:
205 |         model = model.to(device)
206 | 
207 |     hp_sentence = params
208 |     hp_sentence.update(vars(fixed_params))
209 |     hp_sentence.update(
210 |         {
211 |             'cuda': cuda,
212 |         }
213 |     )
214 |     hp_sentence = f'{str(hp_sentence)[1: -1]} \n'
215 | 
216 |     save_txt(f'\n \n START - Hyperparameters \n{hp_sentence}', data_paths.result_filepath, "a")
217 | 
218 |     start_time = time.time()
219 | 
220 |     # Train model
221 |     trained_model, viz, best_metrics = train_model(
222 |         model,
223 |         fixed_params.num_epochs,
224 |         num_batches_train,
225 |         num_batches_val_loss,
226 |         edgeloader_train,
227 |         edgeloader_valid,
228 |         max_margin_loss,
229 |         params['delta'],
230 |         params['neg_sample_size'],
231 |         params['use_recency'],
232 |         cuda,
233 |         device,
234 |         fixed_params.optimizer,
235 |         params['lr'],
236 |         get_metrics=True,
237 |         train_graph=train_graph,
238 |         valid_graph=valid_graph,
239 |         nodeloader_valid=nodeloader_valid,
240 |         nodeloader_subtrain=nodeloader_subtrain,
241 |         k=fixed_params.k,
242 |         out_dim=params['out_dim'],
243 |         num_batches_val_metrics=num_batches_val_metrics,
244 |         num_batches_subtrain=num_batches_subtrain,
245 |         bought_eids=train_eids_dict[('user', 'buys', 'item')],
246 |         ground_truth_subtrain=ground_truth_subtrain,
247 |         ground_truth_valid=ground_truth_valid,
248 |         remove_already_bought=True,
249 |         result_filepath=data_paths.result_filepath,
250 |         start_epoch=fixed_params.start_epoch,
251 |         patience=fixed_params.patience,
252 |         pred=params['pred'],
253 |         use_popularity=params['use_popularity'],
254 |         weight_popularity=params['weight_popularity'],
255 |         remove_false_negative=fixed_params.remove_false_negative,
256 |         embedding_layer=params['embedding_layer'],
257 |     )
258 |     elapsed = time.time() - start_time
259 |     result_to_save = f'\n {timedelta(seconds=elapsed)} \n END'
260 |     save_txt(result_to_save, data_paths.result_filepath, mode='a')
261 | 
262 |     if visualization:
263 |         plot_train_loss(hp_sentence, viz)
264 | 
265 |     # Report performance on validation set
266 |     sentence = ("BEST VALIDATION Precision "
267 |                 "{:.3f}% | Recall {:.3f}% | Coverage {:.2f}%"
268 |                 .format(best_metrics['precision'] * 100,
269 |                         best_metrics['recall'] * 100,
270 |                         best_metrics['coverage'] * 100))
271 | 
272 |     log.info(sentence)
273 |     save_txt(sentence, data_paths.result_filepath, mode='a')
274 | 
275 |     # Report performance on test set
276 |     log.debug('Test metrics start ...')
277 |     trained_model.eval()
278 |     with torch.no_grad():
279 |         embeddings = get_embeddings(valid_graph,
280 |                                     params['out_dim'],
281 |                                     trained_model,
282 |                                     nodeloader_test,
283 |                                     num_batches_test,
284 |                                     cuda,
285 |                                     device,
286 |                                     params['embedding_layer'],
287 |                                     )
288 | 
289 |         for ground_truth in [data.ground_truth_purchase_test, data.ground_truth_test]:
290 |             precision, recall, coverage = get_metrics_at_k(
291 |                 embeddings,
292 |                 valid_graph,
293 |                 trained_model,
294 |                 params['out_dim'],
295 |                 ground_truth,
296 |                 all_eids_dict[('user', 'buys', 'item')],
297 |                 fixed_params.k,
298 |                 True,  # Remove already bought
299 |                 cuda,
300 |                 device,
301 |                 fixed_params.pred,
302 |                 params['use_popularity'],
303 |                 params['weight_popularity'],
304 |             )
305 | 
306 |             sentence = ("TEST Precision "
307 |                         "{:.3f}% | Recall {:.3f}% | Coverage {:.2f}%"
308 |                         .format(precision * 100,
309 |                                 recall * 100,
310 |                                 coverage * 100))
311 |             log.info(sentence)
312 |             save_txt(sentence, data_paths.result_filepath, mode='a')
313 | 
314 |     if check_embedding:
315 |         trained_model.eval()
316 |         with torch.no_grad():
317 |             log.debug('ANALYSIS OF RECOMMENDATIONS')
318 |             if 'sport' in train_graph.ntypes:
319 |                 result_sport = explore_sports(embeddings,
320 |                                               data.sport_feat_df,
321 |                                               data.spt_id,
322 |                                               fixed_params.num_choices)
323 | 
324 |                 save_txt(result_sport, data_paths.result_filepath, mode='a')
325 | 
326 |             already_bought_dict = create_already_bought(valid_graph,
327 |                                                         all_eids_dict[('user', 'buys', 'item')],
328 |                                                         )
329 |             already_clicked_dict = None
330 |             if fixed_params.discern_clicks:
331 |                 already_clicked_dict = create_already_bought(valid_graph,
332 |                                                              all_eids_dict[('user', 'clicks', 'item')],
333 |                                                              etype='clicks',
334 |                                                              )
335 | 
336 |             users, items = data.ground_truth_test
337 |             ground_truth_dict = create_ground_truth(users, items)
338 |             user_ids = np.unique(users).tolist()
339 |             recs = get_recs(valid_graph,
340 |                             embeddings,
341 |                             trained_model,
342 |                             params['out_dim'],
343 |                             fixed_params.k,
344 |                             user_ids,
345 |                             already_bought_dict,
346 |                             remove_already_bought=True,
347 |                             pred=fixed_params.pred,
348 |                             use_popularity=params['use_popularity'],
349 |                             weight_popularity=params['weight_popularity'])
350 | 
351 |             users, items = data.ground_truth_purchase_test
352 |             ground_truth_purchase_dict = create_ground_truth(users, items)
353 |             explore_recs(recs,
354 |                          already_bought_dict,
355 |                          already_clicked_dict,
356 |                          ground_truth_dict,
357 |                          ground_truth_purchase_dict,
358 |                          data.item_feat_df,
359 |                          fixed_params.num_choices,
360 |                          data.pdt_id,
361 |                          fixed_params.item_id_type,
362 |                          data_paths.result_filepath)
363 | 
364 |             if fixed_params.item_id_type == 'SPECIFIC ITEM_IDENTIFIER':
365 |                 coverage_metrics = check_coverage(data.user_item_train,
366 |                                                   data.item_feat_df,
367 |                                                   data.pdt_id,
368 |                                                   recs)
369 | 
370 |                 sentence = (
371 |                     "COVERAGE \n|| All transactions : "
372 |                     "Generic {:.1f}% | Junior {:.1f}% | Male {:.1f}% | Female {:.1f}% | Eco {:.1f}% "
373 |                     "\n|| Recommendations : "
374 |                     "Generic {:.1f}% | Junior {:.1f}% | Male {:.1f}% | Female {:.1f} | Eco {:.1f}%%"
375 |                         .format(
376 |                         coverage_metrics['generic_mean_whole'] * 100,
377 |                         coverage_metrics['junior_mean_whole'] * 100,
378 |                         coverage_metrics['male_mean_whole'] * 100,
379 |                         coverage_metrics['female_mean_whole'] * 100,
380 |                         coverage_metrics['eco_mean_whole'] * 100,
381 |                         coverage_metrics['generic_mean_recs'] * 100,
382 |                         coverage_metrics['junior_mean_recs'] * 100,
383 |                         coverage_metrics['male_mean_recs'] * 100,
384 |                         coverage_metrics['female_mean_recs'] * 100,
385 |                         coverage_metrics['eco_mean_recs'] * 100,
386 |                     )
387 |                 )
388 |                 log.info(sentence)
389 |                 save_txt(sentence, data_paths.result_filepath, mode='a')
390 | 
391 |         save_outputs(
392 |             {
393 |                 'embeddings': embeddings,
394 |                 'already_bought': already_bought_dict,
395 |                 'already_clicked': already_bought_dict,
396 |                 'ground_truth': ground_truth_dict,
397 |                 'recs': recs,
398 |             },
399 |             'outputs/'
400 |         )
401 | 
402 |         del params['remove']
403 |         # Save model if the recall is greater than 8%
404 |         if (recall > 0.08) & (fixed_params.item_id_type == 'SPECIFIC ITEM_IDENTIFIER') \
405 |                 or (recall > 0.2) & (fixed_params.item_id_type == 'GENERAL ITEM_IDENTIFIER'):
406 |             date = str(datetime.datetime.now())[:-10].replace(' ', '')
407 |             torch.save(trained_model.state_dict(), f'models/HP_Recall_{recall * 100:.2f}_{date}.pth')
408 |             # Save all necessary params
409 |             save_outputs(
410 |                 {
411 |                     f'{date}_params': params,
412 |                     f'{date}_fixed_params': vars(fixed_params),
413 |                 },
414 |                 'models/'
415 |             )
416 | 
417 |         # Inference on different users
418 |         if fixed_params.run_inference > 0:
419 |             with torch.no_grad():
420 |                 print('On normal params')
421 |                 inference_recall = inference_hp.inference_fn(trained_model,
422 |                                                              remove=fixed_params.remove_on_inference,
423 |                                                              fixed_params=fixed_params,
424 |                                                              overwrite_fixed_params=False,
425 |                                                              **params)
426 |                 if fixed_params.run_inference > 1:
427 |                     print('For all users')
428 |                     del params['days_of_purchases'], params['days_of_clicks'], params['lifespan_of_items']
429 |                     all_users_inference_recall = inference_hp.inference_fn(trained_model,
430 |                                                                            remove=fixed_params.remove_on_inference,
431 |                                                                            fixed_params=fixed_params,
432 |                                                                            overwrite_fixed_params=True,
433 |                                                                            days_of_purchases=710,
434 |                                                                            days_of_clicks=710,
435 |                                                                            lifespan_of_items=710,
436 |                                                                            **params)
437 | 
438 |     recap = f"BEST RECALL on 1) Validation set : {best_metrics['recall'] * 100:.2f}%" \
439 |             f'\n2) Test set : {recall * 100:.2f}%'
440 |     if fixed_params.run_inference == 1:
441 |         recap += f'\n3) On random users of {fixed_params.remove_on_inference} removed : {inference_recall * 100:.2f}'
442 |     recap += f"\nLoop took {timedelta(seconds=elapsed)} for {len(viz['train_loss_list'])} epochs, an average of " \
443 |              f"{timedelta(seconds=elapsed / len(viz['train_loss_list']))} per epoch"
444 |     print(recap)
445 |     save_txt(recap, data_paths.result_filepath, mode='a')
446 | 
447 |     return recall  # This is the 'test set' recall, on both purchases & clicks
448 | 
449 | 
450 | class SearchableHyperparameters:
451 |     """
452 |     All hyperparameters to optimize.
453 | 
454 |     Attributes
455 |     ----------
456 |     Aggregator_hetero :
457 |         How to aggregate messages from different types of edge relations. Choices : 'sum', 'max',
458 |         'min', 'mean', 'stack'. More info here
459 |         https://docs.dgl.ai/_modules/dgl/nn/pytorch/hetero.html
460 |     Aggregator_type :
461 |         How to aggregate neighborhood messages. Choices : 'mean', 'pool' for max pooling or 'lstm'
462 |     Clicks_sample :
463 |         Proportion of all clicks edges that should be used for training. Only relevant if
464 |         fixed_params.train_on_clicks == True
465 |     Days_popularity :
466 |         Number of days considered in Use_popularity
467 |     Dropout :
468 |             Dropout used on nodes features (at all layers of the GNN)
469 |     Embedding_layer :
470 |         Create an explicit embedding layer that projects user & item features into and embedding
471 |         of hidden_size dimension. If false, the embedding is done in the first layer of the GNN
472 |         model.
473 |     Purchases_sample :
474 |         Proportion of all purchase (i.e. 'buys') edges that should be used for training. If
475 |         fixed_params.discern_clicks == False, then 'clicks' edges are considered as 'purchases'
476 |     Norm :
477 |         Perform normalization after message aggregation
478 |     Use_popularity :
479 |         When computing ratings, add a score for items that were recent in the last X days
480 |     Use_recency :
481 |         When computing the loss, give more weights to more recent transactions
482 |     Weight_popularity :
483 |         Weight of the popularity score
484 |     """
485 |     def __init__(self):
486 |         self.aggregator_hetero = Categorical(categories=['mean', 'sum', 'max'], name='aggregator_hetero')
487 |         self.aggregator_type = Categorical(categories=['mean', 'mean_nn', 'pool_nn'], name='aggregator_type')  # LSTM?
488 |         self.clicks_sample = Categorical(categories=[.2, .3, .4], name='clicks_sample')
489 |         self.delta = Real(low=0.15, high=0.35, prior='log-uniform',
490 |                           name='delta')
491 |         self.dropout = Real(low=0., high=0.8, prior='uniform',
492 |                             name='dropout')
493 |         self.embed_dim = Categorical(categories=['Very Small', 'Small', 'Medium', 'Large', 'Very Large'],
494 |                                      name='embed_dim')
495 |         self.embedding_layer = Categorical(categories=[True, False], name='embedding_layer')
496 |         self.lr = Real(low=1e-4, high=1e-2, prior='log-uniform', name='lr')
497 |         self.n_layers = Integer(low=3, high=5, name='n_layers')
498 |         self.neg_sample_size = Integer(low=700, high=3000,
499 |                                        name='neg_sample_size')
500 |         self.norm = Categorical(categories=[True, False], name='norm')
501 |         self.popularity_importance = Categorical(categories=['No', 'Small', 'Medium', 'Large'],
502 |                                                  name='popularity_importance')
503 |         self.purchases_sample = Categorical(categories=[.4, .5, .6], name='purchases_sample')
504 |         self.use_recency = Categorical(categories=[True, False], name='use_recency')
505 | 
506 |         # List all the attributes in a list.
507 |         # This is equivalent to [self.hidden_dim_HP, self.out_dim_HP ...]
508 |         self.dimensions = [self.__getattribute__(attr)
509 |                            for attr in dir(self) if '__' not in attr]
510 |         self.default_parameters = ['sum', 'mean_nn', .3, 0.266, .5, 'Medium', False,
511 |                                    0.00565, 3, 2500, True, 'No', .5, True]
512 | 
513 | 
514 | searchable_params = SearchableHyperparameters()
515 | fitness_params = None
516 | 
517 | @use_named_args(dimensions=searchable_params.dimensions)
518 | def fitness(**params):
519 |     """
520 |     Function used by skopt to find the best hyperparameter combination.
521 | 
522 |     The function calls the train function defined earlier, with all needed parameters. The recall that is returned
523 |     is then multiplied by -1, since skopt is minimizing metrics.
524 |     """
525 |     recall = train(**{**fitness_params, **params})
526 |     return -recall
527 | 
528 | 
529 | @click.command()
530 | @click.option('--from_beginning', count=True,
531 |               help='Continue with last trained model or not')
532 | @click.option('-v', '--verbose', count=True, help='Verbosity')
533 | @click.option('-viz', '--visualization', count=True, help='Visualize result')
534 | @click.option('--check_embedding', count=True, help='Explore embedding result')
535 | @click.option('--remove', default=.95, help='Data remove percentage')
536 | @click.option('--num_epochs', default=10, help='Number of epochs')
537 | @click.option('--start_epoch', default=0, help='Start epoch')
538 | @click.option('--patience', default=3, help='Patience for early stopping')
539 | @click.option('--edge_batch_size', default=2048, help='Number of edges in a train / validation batch')
540 | @click.option('--item_id_type', default='SPECIFIC ITEM IDENTIFIER',
541 |               help='Identifier for the item. This code allows 2 types: SPECIFIC (e.g. item SKU'
542 |                    'or GENERAL (e.g. item family)')
543 | @click.option('--duplicates', default='keep_all',
544 |               help='How to handle duplicates. Choices: keep_all, keep_last, count_occurrence')
545 | def main(from_beginning, verbose, visualization, check_embedding,
546 |          remove, num_epochs, start_epoch, patience, edge_batch_size,
547 |          item_id_type, duplicates):
548 |     """
549 |     Main function that loads data and parameters, then runs hyperparameter loop with the fitness function.
550 | 
551 |     """
552 |     if verbose:
553 |         log.setLevel(logging.DEBUG)
554 |     else:
555 |         log.setLevel(logging.INFO)
556 | 
557 |     data_paths = DataPaths()
558 |     fixed_params = FixedParameters(num_epochs, start_epoch, patience, edge_batch_size,
559 |                                    remove, item_id_type, duplicates)
560 | 
561 |     checkpoint_saver = CheckpointSaver(
562 |         f'checkpoint{str(datetime.datetime.now())[:-10]}.pkl',
563 |         compress=9
564 |     )
565 | 
566 |     data = DataLoader(data_paths, fixed_params)
567 | 
568 |     global fitness_params
569 |     fitness_params = {
570 |         'data': data,
571 |         'fixed_params': fixed_params,
572 |         'data_paths': data_paths,
573 |         'visualization': visualization,
574 |         'check_embedding': check_embedding,
575 |     }
576 |     if from_beginning:
577 |         search_result = gp_minimize(
578 |             func=fitness,
579 |             dimensions=searchable_params.dimensions,
580 |             n_calls=200,
581 |             acq_func='EI',
582 |             x0=searchable_params.default_parameters,
583 |             callback=[checkpoint_saver],
584 |             random_state=46,
585 |         )
586 | 
587 |     if not from_beginning:
588 |         checkpoint_path = None
589 |         if checkpoint_path is None:
590 |             checkpoint_path = get_last_checkpoint()
591 |         res = load(checkpoint_path)
592 | 
593 |         x0 = res.x_iters
594 |         y0 = res.func_vals
595 | 
596 |         search_result = gp_minimize(
597 |             func=fitness,
598 |             dimensions=searchable_params.dimensions,
599 |             n_calls=200,
600 |             n_initial_points=-len(x0),  # Workaround suggested to correct the error when resuming training
601 |             acq_func='EI',
602 |             x0=x0,
603 |             y0=y0,
604 |             callback=[checkpoint_saver],
605 |             random_state=46
606 |         )
607 |     log.info(search_result)
608 | 
609 | 
610 | if __name__ == '__main__':
611 |     main()
612 | 


--------------------------------------------------------------------------------
/main_inference.py:
--------------------------------------------------------------------------------
  1 | import math
  2 | 
  3 | import click
  4 | import dgl
  5 | import numpy as np
  6 | import torch
  7 | 
  8 | from src.builder import create_graph
  9 | from src.model import ConvModel
 10 | from src.utils_data import DataPaths, DataLoader, FixedParameters, assign_graph_features
 11 | from src.utils_inference import read_graph, fetch_uids, postprocess_recs
 12 | from src.train.run import get_embeddings
 13 | from src.metrics import get_recs, create_already_bought
 14 | from src.utils import read_data
 15 | 
 16 | cuda = torch.cuda.is_available()
 17 | device = torch.device('cuda') if cuda else torch.device('cpu')
 18 | num_workers = 4 if cuda else 0
 19 | 
 20 | def inference_ondemand(user_ids,  # List or 'all'
 21 |                        use_saved_graph: bool,
 22 |                        trained_model_path: str,
 23 |                        use_saved_already_bought: bool,
 24 |                        graph_path=None,
 25 |                        ctm_id_path=None,
 26 |                        pdt_id_path=None,
 27 |                        already_bought_path=None,
 28 |                        k=10,
 29 |                        remove=.99,
 30 |                        **params,
 31 |                        ):
 32 |     """
 33 |     Given a fully trained model, return recommendations specific to each user.
 34 | 
 35 |     Files needed to run
 36 |     -------------------
 37 |     Params used when training the model:
 38 |         Those params will indicate how to run inference on the model. Usually, they are outputted during training
 39 |         (and hyperparametrization).
 40 |     If using a saved already bought dict:
 41 |         The already bought dict: the dict includes all previous purchases of all user ids for which recommendations
 42 |                                  were requested. If not using a saved dict, it will be created using the graph.
 43 |                                  Using a saved already bought dict is not necessary, but might make the inference
 44 |                                  process faster.
 45 |     A) If using a saved graph:
 46 |         The saved graph: the graph that must include all user ids for which recommendations were requested. Usually,
 47 |                          it is outputted during training. It could also be created by another independent function.
 48 |         ID mapping: ctm_id and pdt_id mapping that allows to associate real-world information, e.g. item and customer
 49 |         identifier, to actual nodes in the graph. They are usually saved when generating a graph.
 50 |     B) If not using a saved graph:
 51 |         The graph will be generated on demand, using all the files in DataPaths of src.utils_data. All those files will
 52 |         be needed.
 53 | 
 54 |     Parameters
 55 |     ----------
 56 |     See click options below for details.
 57 | 
 58 |     Returns
 59 |     -------
 60 |     Recommendations for all user ids.
 61 | 
 62 |     """
 63 |     # Load & preprocess data
 64 |     ## Graph
 65 |     if use_saved_graph:
 66 |         graph = read_graph(graph_path)
 67 |         ctm_id_df = read_data(ctm_id_path)
 68 |         pdt_id_df = read_data(pdt_id_path)
 69 |     else:
 70 |         # Create graph
 71 |         data_paths = DataPaths()
 72 |         fixed_params = FixedParameters(num_epochs=0, start_epoch=0,  # Not used (only used in training)
 73 |                                        patience=0, edge_batch_size=0,  # Not used (only used in training)
 74 |                                        remove=remove, item_id_type=params['item_id_type'],
 75 |                                        duplicates=params['duplicates'])
 76 |         data = DataLoader(data_paths, fixed_params)
 77 |         ctm_id_df = data.ctm_id
 78 |         pdt_id_df = data.pdt_id
 79 | 
 80 |         graph = create_graph(
 81 |             data.graph_schema,
 82 |         )
 83 |         graph = assign_graph_features(graph,
 84 |                                       fixed_params,
 85 |                                       data,
 86 |                                       **params,
 87 |                                       )
 88 |     ## Preprocess: fetch right user ids
 89 |     if user_ids[0] == 'all':
 90 |         test_uids = np.arange(graph.num_nodes('user'))
 91 |     else:
 92 |         test_uids = fetch_uids(user_ids,
 93 |                                ctm_id_df)
 94 |     ## Remove already bought
 95 |     if use_saved_already_bought:
 96 |         already_bought_dict = read_data(already_bought_path)
 97 |     else:
 98 |         bought_eids = graph.out_edges(u=test_uids, form='eid', etype='buys')
 99 |         already_bought_dict = create_already_bought(graph, bought_eids)
100 | 
101 |     # Load model
102 |     dim_dict = {'user': graph.nodes['user'].data['features'].shape[1],
103 |                 'item': graph.nodes['item'].data['features'].shape[1],
104 |                 'out': params['out_dim'],
105 |                 'hidden': params['hidden_dim']}
106 |     if 'sport' in graph.ntypes:
107 |         dim_dict['sport'] = graph.nodes['sport'].data['features'].shape[1]
108 |     trained_model = ConvModel(
109 |         graph,
110 |         params['n_layers'],
111 |         dim_dict,
112 |         params['norm'],
113 |         params['dropout'],
114 |         params['aggregator_type'],
115 |         params['pred'],
116 |         params['aggregator_hetero'],
117 |         params['embedding_layer'],
118 |     )
119 |     trained_model.load_state_dict(torch.load(trained_model_path, map_location=device))
120 |     if cuda:
121 |         trained_model = trained_model.to(device)
122 | 
123 |     # Create dataloader
124 |     all_iids = np.arange(graph.num_nodes('item'))
125 |     test_node_ids = {'user': test_uids, 'item': all_iids}
126 |     n_layers = params['n_layers']
127 |     if params['embedding_layer']:
128 |         n_layers = n_layers - 1
129 |     sampler = dgl.dataloading.MultiLayerFullNeighborSampler(n_layers)
130 |     nodeloader_test = dgl.dataloading.NodeDataLoader(
131 |         graph,
132 |         test_node_ids,
133 |         sampler,
134 |         batch_size=128,
135 |         shuffle=True,
136 |         drop_last=False,
137 |         num_workers=num_workers
138 |     )
139 |     num_batches_test = math.ceil((len(test_uids) + len(all_iids)) / 128)
140 | 
141 |     # Fetch recs
142 |     trained_model.eval()
143 |     with torch.no_grad():
144 |         embeddings = get_embeddings(graph,
145 |                                     params['out_dim'],
146 |                                     trained_model,
147 |                                     nodeloader_test,
148 |                                     num_batches_test,
149 |                                     cuda,
150 |                                     device,
151 |                                     params['embedding_layer'],
152 |                                     )
153 |         recs = get_recs(graph,
154 |                         embeddings,
155 |                         trained_model,
156 |                         params['out_dim'],
157 |                         k,
158 |                         test_uids,
159 |                         already_bought_dict,
160 |                         remove_already_bought=True,
161 |                         cuda=cuda,
162 |                         device=device,
163 |                         pred=params['pred'],
164 |                         use_popularity=params['use_popularity'],
165 |                         weight_popularity=params['weight_popularity']
166 |                         )
167 | 
168 |         # Postprocess: user & item ids
169 |         processed_recs = postprocess_recs(recs,
170 |                                           pdt_id_df,
171 |                                           ctm_id_df,
172 |                                           params['item_id_type'],
173 |                                           params['ctm_id_type'])
174 |         print(processed_recs)
175 |         return processed_recs
176 | 
177 | 
178 | 
179 | @click.command()
180 | @click.option('--params_path', default='params.pkl',
181 |               help='Path where the optimal hyperparameters found in the hyperparametrization were saved.')
182 | @click.option('--user_ids', multiple=True, default=['all'],
183 |               help="IDs of users for which to generate recommendations. Either list of user ids, or 'all'.")
184 | @click.option('--use_saved_graph', count=True,
185 |               help='If true, will use graph that was saved on disk. Need to import ID mapping for users & items.')
186 | @click.option('--trained_model_path', default='model.pth',
187 |               help='Path where fully trained model is saved.')
188 | @click.option('--use_saved_already_bought', count=True,
189 |               help='If true, will use already bought dict that was saved on disk.')
190 | @click.option('--graph_path', default='graph.bin',
191 |               help='Path where the graph was saved. Mandatory if use_saved_graph is True.')
192 | @click.option('--ctm_id_path', default='ctm_id.pkl',
193 |               help='Path where the mapping for customer was save. Mandatory if use_saved_graph is True.')
194 | @click.option('--pdt_id_path', default='pdt_id.pkl',
195 |               help='Path where the mapping for items was save. Mandatory if use_saved_graph is True.')
196 | @click.option('--already_bought_path', default='already_bought.pkl',
197 |               help='Path where the already bought dict was saved. Mandatory if use_saved_already_bought is True.')
198 | @click.option('--k', default=10,
199 |               help='Number of recs to generate for each user.')
200 | @click.option('--remove', default=.99,
201 |               help='Percentage of users to remove from graph if used_saved_graph = True. If more than 0, user_ids might'
202 |                    ' not be in the graph. However, higher "remove" allows for faster inference.')
203 | def main(params_path, user_ids, use_saved_graph, trained_model_path,
204 |          use_saved_already_bought, graph_path, ctm_id_path, pdt_id_path,
205 |          already_bought_path, k, remove):
206 |     params = read_data(params_path)
207 |     params.pop('k', None)
208 |     params.pop('remove', None)
209 | 
210 | 
211 |     inference_ondemand(user_ids=user_ids,  # List or 'all'
212 |                        use_saved_graph=use_saved_graph,
213 |                        trained_model_path=trained_model_path,
214 |                        use_saved_already_bought=use_saved_already_bought,
215 |                        graph_path=graph_path,
216 |                        ctm_id_path=ctm_id_path,
217 |                        pdt_id_path=pdt_id_path,
218 |                        already_bought_path=already_bought_path,
219 |                        k=k,
220 |                        remove=remove,
221 |                        **params,
222 |                        )
223 | 
224 | 
225 | if __name__ == '__main__':
226 |     main()
227 | 
228 | 
229 | 


--------------------------------------------------------------------------------
/main_train.py:
--------------------------------------------------------------------------------
  1 | import math
  2 | import datetime
  3 | 
  4 | import click
  5 | import numpy as np
  6 | import torch
  7 | from dgl.data.utils import save_graphs
  8 | 
  9 | from src.builder import create_graph
 10 | from src.utils_data import DataLoader, assign_graph_features
 11 | from src.utils import read_data, save_txt, save_outputs
 12 | from src.model import ConvModel, max_margin_loss
 13 | from src.sampling import train_valid_split, generate_dataloaders
 14 | from src.train.run import train_model, get_embeddings
 15 | from src.utils_vizualization import plot_train_loss
 16 | from src.metrics import (create_already_bought, create_ground_truth,
 17 |                          get_metrics_at_k, get_recs)
 18 | from src.evaluation import explore_recs, explore_sports, check_coverage
 19 | from presplit import presplit_data
 20 | 
 21 | from logging_config import get_logger
 22 | 
 23 | log = get_logger(__name__)
 24 | 
 25 | cuda = torch.cuda.is_available()
 26 | device = torch.device('cuda') if cuda else torch.device('cpu')
 27 | num_workers = 4 if cuda else 0
 28 | 
 29 | 
 30 | class TrainDataPaths:
 31 |     def __init__(self):
 32 |         self.result_filepath = 'TXT FILE WHERE TO LOG THE RESULTS .txt'
 33 |         self.sport_feat_path = 'FEATURE DATASET, SPORTS (sport names) .csv'
 34 |         self.full_interaction_path = 'INTERACTION LIST, USER-ITEM (Full dataset, not splitted between train & test).csv'
 35 |         self.item_sport_path = 'INTERACTION LIST, ITEM-SPORT .csv'
 36 |         self.user_sport_path = 'INTERACTION LIST, USER-SPORT .csv'
 37 |         self.sport_sportg_path = 'INTERACTION LIST, SPORT-SPORT .csv'
 38 |         self.item_feat_path = 'FEATURE DATASET, ITEMS .csv'
 39 |         self.user_feat_path = 'FEATURE DATASET, USERS.csv'
 40 |         self.sport_onehot_path = 'FEATURE DATASET, SPORTS (one-hot vectors) .csv'
 41 | 
 42 | 
 43 | def train_full_model(fixed_params_path,
 44 |                      visualization,
 45 |                      check_embedding,
 46 |                      remove,
 47 |                      edge_batch_size,
 48 |                      **params,):
 49 |     """
 50 |     Given the best hyperparameter combination, function to train the model on all available data.
 51 | 
 52 |     Files needed to run
 53 |     -------------------
 54 |     All the files in the TrainDataPaths:
 55 |         It includes all the interactions between user, sport and items, as well as features for user, sport and items.
 56 |     Fixed_params and params found in hyperparametrization:
 57 |         Those params will indicate how to train the model. Usually, they are found when running the hyperparametrization
 58 |         loop.
 59 | 
 60 |     Parameters
 61 |     ----------
 62 |     See click options below for details.
 63 | 
 64 | 
 65 |     Saves to files
 66 |     --------------
 67 |     trained_model with its fixed parameters and hyperparameters:
 68 |         The trained model with all parameters are saved to the folder 'models'.
 69 |     graph and ID mapping:
 70 |         When doing inference, it might be useful to import an already built graph (and the mapping that allows to
 71 |         associate node ID with personal information such as CUSTOMER IDENTIFIER or ITEM IDENTIFIER). Thus, the graph and ID mapping are saved to
 72 |         folder 'models'.
 73 |     """
 74 |     # Load parameters
 75 |     fixed_params = read_data(fixed_params_path)
 76 |     class objectview(object):
 77 |         def __init__(self, d):
 78 |             self.__dict__ = d
 79 |     fixed_params = objectview(fixed_params)
 80 |     fixed_params.remove = remove
 81 |     fixed_params.subtrain_size = 0.01
 82 |     fixed_params.valid_size = 0.01
 83 |     fixed_params.edge_batch_size = edge_batch_size
 84 | 
 85 |     # Create full train set
 86 |     train_data_paths = TrainDataPaths()
 87 |     presplit_item_feat = read_data(train_data_paths.item_feat_path)
 88 |     full_interaction_data = read_data(train_data_paths.full_interaction_path)
 89 |     train_df, test_df = presplit_data(presplit_item_feat,
 90 |                                       full_interaction_data,
 91 |                                       num_min=3,
 92 |                                       remove_unk=True,
 93 |                                       sort=True,
 94 |                                       test_size_days=1,
 95 |                                       item_id_type='ITEM IDENTIFIER',
 96 |                                       ctm_id_type='CUSTOMER IDENTIFIER', )
 97 |     train_data_paths.train_path = train_df
 98 |     train_data_paths.test_path = test_df
 99 |     data = DataLoader(train_data_paths, fixed_params)
100 | 
101 |     # Initialize graph & features
102 |     valid_graph = create_graph(
103 |         data.graph_schema,
104 |     )
105 |     valid_graph = assign_graph_features(valid_graph,
106 |                                         fixed_params,
107 |                                         data,
108 |                                         **params,
109 |                                         )
110 | 
111 |     dim_dict = {'user': valid_graph.nodes['user'].data['features'].shape[1],
112 |                 'item': valid_graph.nodes['item'].data['features'].shape[1],
113 |                 'out': params['out_dim'],
114 |                 'hidden': params['hidden_dim']}
115 | 
116 |     all_sids = None
117 |     if 'sport' in valid_graph.ntypes:
118 |         dim_dict['sport'] = valid_graph.nodes['sport'].data['features'].shape[1]
119 |         all_sids = np.arange(valid_graph.num_nodes('sport'))
120 | 
121 |     # Initialize model
122 |     model = ConvModel(valid_graph,
123 |                       params['n_layers'],
124 |                       dim_dict,
125 |                       params['norm'],
126 |                       params['dropout'],
127 |                       params['aggregator_type'],
128 |                       params['pred'],
129 |                       params['aggregator_hetero'],
130 |                       params['embedding_layer'],
131 |                       )
132 |     if cuda:
133 |         model = model.to(device)
134 | 
135 |     # Initialize dataloaders
136 |     # get training and test ids
137 |     (
138 |         train_graph,
139 |         train_eids_dict,
140 |         valid_eids_dict,
141 |         subtrain_uids,
142 |         valid_uids,
143 |         test_uids,
144 |         all_iids,
145 |         ground_truth_subtrain,
146 |         ground_truth_valid,
147 |         all_eids_dict
148 |     ) = train_valid_split(
149 |         valid_graph,
150 |         data.ground_truth_test,
151 |         fixed_params.etype,
152 |         fixed_params.subtrain_size,
153 |         fixed_params.valid_size,
154 |         fixed_params.reverse_etype,
155 |         fixed_params.train_on_clicks,
156 |         fixed_params.remove_train_eids,
157 |         params['clicks_sample'],
158 |         params['purchases_sample'],
159 |     )
160 | 
161 |     (
162 |         edgeloader_train,
163 |         edgeloader_valid,
164 |         nodeloader_subtrain,
165 |         nodeloader_valid,
166 |         nodeloader_test
167 |     ) = generate_dataloaders(valid_graph,
168 |                              train_graph,
169 |                              train_eids_dict,
170 |                              valid_eids_dict,
171 |                              subtrain_uids,
172 |                              valid_uids,
173 |                              test_uids,
174 |                              all_iids,
175 |                              fixed_params,
176 |                              num_workers,
177 |                              all_sids,
178 |                              embedding_layer=params['embedding_layer'],
179 |                              n_layers=params['n_layers'],
180 |                              neg_sample_size=params['neg_sample_size'],
181 |                              )
182 | 
183 |     train_eids_len = 0
184 |     valid_eids_len = 0
185 |     for etype in train_eids_dict.keys():
186 |         train_eids_len += len(train_eids_dict[etype])
187 |         valid_eids_len += len(valid_eids_dict[etype])
188 |     num_batches_train = math.ceil(train_eids_len / fixed_params.edge_batch_size)
189 |     num_batches_subtrain = math.ceil(
190 |         (len(subtrain_uids) + len(all_iids)) / fixed_params.node_batch_size
191 |     )
192 |     num_batches_val_loss = math.ceil(valid_eids_len / fixed_params.edge_batch_size)
193 |     num_batches_val_metrics = math.ceil(
194 |         (len(valid_uids) + len(all_iids)) / fixed_params.node_batch_size
195 |     )
196 |     num_batches_test = math.ceil(
197 |         (len(test_uids) + len(all_iids)) / fixed_params.node_batch_size
198 |     )
199 | 
200 |     # Run model
201 |     hp_sentence = params
202 |     hp_sentence.update(vars(fixed_params))
203 |     hp_sentence = f'{str(hp_sentence)[1: -1]} \n'
204 |     save_txt(f'\n \n START - Hyperparameters \n{hp_sentence}', train_data_paths.result_filepath, "a")
205 |     trained_model, viz, best_metrics = train_model(
206 |         model,
207 |         fixed_params.num_epochs,
208 |         num_batches_train,
209 |         num_batches_val_loss,
210 |         edgeloader_train,
211 |         edgeloader_valid,
212 |         max_margin_loss,
213 |         params['delta'],
214 |         params['neg_sample_size'],
215 |         params['use_recency'],
216 |         cuda,
217 |         device,
218 |         fixed_params.optimizer,
219 |         params['lr'],
220 |         get_metrics=True,
221 |         train_graph=train_graph,
222 |         valid_graph=valid_graph,
223 |         nodeloader_valid=nodeloader_valid,
224 |         nodeloader_subtrain=nodeloader_subtrain,
225 |         k=fixed_params.k,
226 |         out_dim=params['out_dim'],
227 |         num_batches_val_metrics=num_batches_val_metrics,
228 |         num_batches_subtrain=num_batches_subtrain,
229 |         bought_eids=train_eids_dict[('user', 'buys', 'item')],
230 |         ground_truth_subtrain=ground_truth_subtrain,
231 |         ground_truth_valid=ground_truth_valid,
232 |         remove_already_bought=True,
233 |         result_filepath=train_data_paths.result_filepath,
234 |         start_epoch=fixed_params.start_epoch,
235 |         patience=fixed_params.patience,
236 |         pred=params['pred'],
237 |         use_popularity=params['use_popularity'],
238 |         weight_popularity=params['weight_popularity'],
239 |         remove_false_negative=fixed_params.remove_false_negative,
240 |         embedding_layer=params['embedding_layer'],
241 |     )
242 | 
243 |     # Get viz & metrics
244 |     if visualization:
245 |         plot_train_loss(hp_sentence, viz)
246 | 
247 |     # Report performance on validation set
248 |     sentence = ("BEST VALIDATION Precision "
249 |                 "{:.3f}% | Recall {:.3f}% | Coverage {:.2f}%"
250 |                 .format(best_metrics['precision'] * 100,
251 |                         best_metrics['recall'] * 100,
252 |                         best_metrics['coverage'] * 100))
253 | 
254 |     log.info(sentence)
255 |     save_txt(sentence, train_data_paths.result_filepath, mode='a')
256 | 
257 |     # Report performance on test set
258 |     log.debug('Test metrics start ...')
259 |     trained_model.eval()
260 |     with torch.no_grad():
261 |         embeddings = get_embeddings(valid_graph,
262 |                                     params['out_dim'],
263 |                                     trained_model,
264 |                                     nodeloader_test,
265 |                                     num_batches_test,
266 |                                     cuda,
267 |                                     device,
268 |                                     params['embedding_layer'],
269 |                                     )
270 | 
271 |         for ground_truth in [data.ground_truth_purchase_test, data.ground_truth_test]:
272 |             precision, recall, coverage = get_metrics_at_k(
273 |                 embeddings,
274 |                 valid_graph,
275 |                 trained_model,
276 |                 params['out_dim'],
277 |                 ground_truth,
278 |                 all_eids_dict[('user', 'buys', 'item')],
279 |                 fixed_params.k,
280 |                 True,  # Remove already bought
281 |                 cuda,
282 |                 device,
283 |                 params['pred'],
284 |                 params['use_popularity'],
285 |                 params['weight_popularity'],
286 |             )
287 | 
288 |             sentence = ("TEST Precision "
289 |                         "{:.3f}% | Recall {:.3f}% | Coverage {:.2f}%"
290 |                         .format(precision * 100,
291 |                                 recall * 100,
292 |                                 coverage * 100))
293 |             log.info(sentence)
294 |             save_txt(sentence, train_data_paths.result_filepath, mode='a')
295 | 
296 |     if check_embedding:
297 |         trained_model.eval()
298 |         with torch.no_grad():
299 |             log.debug('ANALYSIS OF RECOMMENDATIONS')
300 |             if 'sport' in train_graph.ntypes:
301 |                 result_sport = explore_sports(embeddings,
302 |                                               data.sport_feat_df,
303 |                                               data.spt_id,
304 |                                               fixed_params.num_choices)
305 | 
306 |                 save_txt(result_sport, train_data_paths.result_filepath, mode='a')
307 | 
308 |             already_bought_dict = create_already_bought(valid_graph,
309 |                                                         all_eids_dict[('user', 'buys', 'item')],
310 |                                                         )
311 |             already_clicked_dict = None
312 |             if fixed_params.discern_clicks:
313 |                 already_clicked_dict = create_already_bought(valid_graph,
314 |                                                              all_eids_dict[('user', 'clicks', 'item')],
315 |                                                              etype='clicks',
316 |                                                              )
317 | 
318 |             users, items = data.ground_truth_test
319 |             ground_truth_dict = create_ground_truth(users, items)
320 |             user_ids = np.unique(users).tolist()
321 |             recs = get_recs(valid_graph,
322 |                             embeddings,
323 |                             trained_model,
324 |                             params['out_dim'],
325 |                             fixed_params.k,
326 |                             user_ids,
327 |                             already_bought_dict,
328 |                             remove_already_bought=True,
329 |                             pred=params['pred'],
330 |                             use_popularity=params['use_popularity'],
331 |                             weight_popularity=params['weight_popularity'])
332 | 
333 |             users, items = data.ground_truth_purchase_test
334 |             ground_truth_purchase_dict = create_ground_truth(users, items)
335 |             explore_recs(recs,
336 |                          already_bought_dict,
337 |                          already_clicked_dict,
338 |                          ground_truth_dict,
339 |                          ground_truth_purchase_dict,
340 |                          data.item_feat_df,
341 |                          fixed_params.num_choices,
342 |                          data.pdt_id,
343 |                          fixed_params.item_id_type,
344 |                          train_data_paths.result_filepath)
345 | 
346 |             if fixed_params.item_id_type == 'SPECIFIC ITEM IDENTIFIER':
347 |                 coverage_metrics = check_coverage(data.user_item_train,
348 |                                                   data.item_feat_df,
349 |                                                   data.pdt_id,
350 |                                                   recs)
351 | 
352 |                 sentence = (
353 |                     "COVERAGE \n|| All transactions : "
354 |                     "Generic {:.1f}% | Junior {:.1f}% | Male {:.1f}% | Female {:.1f}% | Eco {:.1f}% "
355 |                     "\n|| Recommendations : "
356 |                     "Generic {:.1f}% | Junior {:.1f}% | Male {:.1f}% | Female {:.1f} | Eco {:.1f}%%"
357 |                         .format(
358 |                         coverage_metrics['generic_mean_whole'] * 100,
359 |                         coverage_metrics['junior_mean_whole'] * 100,
360 |                         coverage_metrics['male_mean_whole'] * 100,
361 |                         coverage_metrics['female_mean_whole'] * 100,
362 |                         coverage_metrics['eco_mean_whole'] * 100,
363 |                         coverage_metrics['generic_mean_recs'] * 100,
364 |                         coverage_metrics['junior_mean_recs'] * 100,
365 |                         coverage_metrics['male_mean_recs'] * 100,
366 |                         coverage_metrics['female_mean_recs'] * 100,
367 |                         coverage_metrics['eco_mean_recs'] * 100,
368 |                     )
369 |                 )
370 |                 log.info(sentence)
371 |                 save_txt(sentence, train_data_paths.result_filepath, mode='a')
372 | 
373 |         save_outputs(
374 |             {
375 |                 'embeddings': embeddings,
376 |                 'already_bought': already_bought_dict,
377 |                 'already_clicked': already_bought_dict,
378 |                 'ground_truth': ground_truth_dict,
379 |                 'recs': recs,
380 |             },
381 |             'outputs/'
382 |         )
383 | 
384 |     # Save model
385 |     date = str(datetime.datetime.now())[:-10].replace(' ', '')
386 |     torch.save(trained_model.state_dict(), f'models/FULL_Recall_{recall * 100:.2f}_{date}.pth')
387 |     # Save all necessary params
388 |     save_outputs(
389 |         {
390 |             f'{date}_params': params,
391 |             f'{date}_fixed_params': vars(fixed_params),
392 |         },
393 |         'models/'
394 |     )
395 |     print("Saved model & parameters to disk.")
396 | 
397 |     # Save graph & ID mapping
398 |     save_graphs(f'models/{date}_graph.bin', [valid_graph])
399 |     save_outputs(
400 |         {
401 |             f'{date}_ctm_id': data.ctm_id,
402 |             f'{date}_pdt_id': data.pdt_id,
403 |         },
404 |         'models/'
405 |     )
406 |     print("Saved graph & ID mapping to disk.")
407 | 
408 | 
409 | @click.command()
410 | @click.option('--fixed_params_path', default='fixed_params.pkl',
411 |               help='Path where the fixed parameters used in the hyperparametrization were saved.')
412 | @click.option('--params_path', default='params.pkl',
413 |               help='Path where the optimal hyperparameters found in the hyperparametrization were saved.')
414 | @click.option('-viz', '--visualization', count=True, help='Visualize result')
415 | @click.option('--check_embedding', count=True, help='Explore embedding result')
416 | @click.option('--remove', default=.99, help='Percentage of users to remove from train set. Ideally,'
417 |                                             ' remove would be 0. However, higher "remove" accelerates training.')
418 | @click.option('--edge_batch_size', default=2048, help='Number of edges in a train / validation batch')
419 | def main(fixed_params_path, params_path, visualization, check_embedding, remove, edge_batch_size):
420 |     params = read_data(params_path)
421 |     params.pop('remove', None)
422 |     params.pop('edge_batch_size', None)
423 |     train_full_model(fixed_params_path=fixed_params_path,
424 |                      visualization=visualization,
425 |                      check_embedding=check_embedding,
426 |                      remove=remove,
427 |                      edge_batch_size=edge_batch_size,
428 |                      **params)
429 | 
430 | if __name__ == '__main__':
431 |     main()
432 | 


--------------------------------------------------------------------------------
/models/.keepdir:
--------------------------------------------------------------------------------
1 | 
2 | 


--------------------------------------------------------------------------------
/outputs/.keepdir:
--------------------------------------------------------------------------------
1 | 
2 | 


--------------------------------------------------------------------------------
/plots/.keepdir:
--------------------------------------------------------------------------------
1 | 
2 | 


--------------------------------------------------------------------------------
/presplit.py:
--------------------------------------------------------------------------------
 1 | from datetime import datetime, timedelta
 2 | 
 3 | import numpy as np
 4 | 
 5 | from logging_config import get_logger
 6 | 
 7 | logger = get_logger(__file__)
 8 | 
 9 | 
10 | def presplit_data(item_feature_data,
11 |                   user_item_interaction_data,
12 |                   num_min=3,
13 |                   remove_unk=True,
14 |                   sort=True,
15 |                   test_size_days=14,
16 |                   item_id_type='ITEM IDENTIFIER',
17 |                   ctm_id_type='CUSTOMER IDENTIFIER'):
18 |     """
19 |     Split data into train and test set.
20 | 
21 |     Parameters
22 |     ----------
23 |     num_min:
24 |         Minimal number of interactions (transactions or clicks) for a customer to be included in the dataset
25 |         (interactions can be both in train and test sets)
26 |     remove_unk:
27 |         Remove items in the interaction set that are not in the item features set, e.g. "items" that are services
28 |         like skate sharpening
29 |     sort:
30 |         Sort the dataset by date before splitting in train/test set,  thus having a test set that is succeeding
31 |         the train set
32 |     test_size_days:
33 |         Number of days that should be in the test set. The rest will be in the training set.
34 |     ctm_id_type:
35 |         Unique identifier for the customers.
36 |     item_id_type:
37 |         Unique identifier for the items.
38 | 
39 |     Returns
40 |     -------
41 |     train_set:
42 |         Pandas dataframe of all training interactions.
43 |     test_set:
44 |         Pandas dataframe of all testing interactions.
45 |     """
46 | 
47 |     np.random.seed(11)
48 | 
49 |     if num_min > 0:
50 |         user_item_interaction_data = user_item_interaction_data[
51 |             user_item_interaction_data[ctm_id_type].map(
52 |                 user_item_interaction_data[ctm_id_type].value_counts()
53 |             ) >= num_min
54 |         ]
55 | 
56 |     if remove_unk:
57 |         known_items = item_feature_data[item_id_type].unique().tolist()
58 |         user_item_interaction_data = user_item_interaction_data[user_item_interaction_data[item_id_type].isin(known_items)]
59 | 
60 |     if sort:
61 |         user_item_interaction_data.sort_values(by=['hit_timestamp'],
62 |                                                axis=0,
63 |                                                inplace=True)
64 |         # Split into train & test sets
65 |         most_recent_date = datetime.strptime(max(user_item_interaction_data.hit_date), '%Y-%m-%d')
66 |         limit_date = datetime.strftime(
67 |             (most_recent_date - timedelta(days=int(test_size_days))),
68 |             format='%Y-%m-%d'
69 |         )
70 |         train_set = user_item_interaction_data[user_item_interaction_data['hit_date'] <= limit_date]
71 |         test_set = user_item_interaction_data[user_item_interaction_data['hit_date'] > limit_date]
72 | 
73 |     else:
74 |         most_recent_date = datetime.strptime(max(user_item_interaction_data.hit_date), '%Y-%m-%d')
75 |         oldest_date = datetime.strptime(min(user_item_interaction_data.hit_date), '%Y-%m-%d')
76 |         total_days = timedelta(days=(most_recent_date - oldest_date))  # To be tested
77 |         test_size = test_size_days / total_days
78 |         test_set = user_item_interaction_data.sample(frac=test_size, random_state=200)
79 |         train_set = user_item_interaction_data.drop(test_set.index)
80 | 
81 |     # Keep only users in train set
82 |     ctm_list = train_set[ctm_id_type].unique()
83 |     test_set = test_set[test_set[ctm_id_type].isin(ctm_list)]
84 |     return train_set, test_set
85 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | click~=7.1.2
2 | dgl==0.5.2
3 | matplotlib~=3.3.2
4 | numpy==1.19.2
5 | pandas==1.1.2
6 | scikit-learn==0.23.2
7 | torch==1.6.0
8 | scikit-optimize==0.8.1


--------------------------------------------------------------------------------
/src/builder.py:
--------------------------------------------------------------------------------
  1 | from datetime import datetime, timedelta
  2 | from typing import Tuple
  3 | 
  4 | import dgl
  5 | import numpy as np
  6 | import pandas as pd
  7 | import torch
  8 | 
  9 | from src.utils import read_data
 10 | 
 11 | 
 12 | def format_dfs(
 13 |         train_path,  # str (path) or pd.Dataframe directly (df)
 14 |         test_path,  # str (path) or pd.Dataframe directly (df)
 15 |         item_sport_path: str,
 16 |         user_sport_path: str,
 17 |         sport_sportg_path: str,
 18 |         item_feat_path: str,
 19 |         user_feat_path: str,
 20 |         sport_feat_path: str,
 21 |         sport_onehot_path: str,
 22 |         remove: float = 0.,
 23 |         ctm_id_type: str = 'CUSTOMER IDENTIFIER',
 24 |         item_id_type: str = 'SPECIFIC ITEM IDENTIFIER',
 25 |         days_of_purchases: int = 710,
 26 |         days_of_clicks: int = 710,
 27 |         lifespan_of_items: int = 710,
 28 |         report_model_coverage: bool = False,
 29 | ):
 30 |     """
 31 |     Import all dfs from csv paths and preprocess interactions to sample interactions and remove old users and items.
 32 | 
 33 |     Parameters
 34 |     ----------
 35 |     train_path, test_path:
 36 |         Paths of interaction files, between user and items (in the train set and the test set). To accommodate a wider
 37 |         range of utilisation, train_path and test_path can be directly dataframes instead of strings. All files with
 38 |         user and items must include a column named with the specified ctm_id_type or item_id_type.
 39 |     item_sport_path, user_sport_path, sport_sportg_path:
 40 |         Paths of interaction files, between item and sport, user and sport, sport and sport group. All files with user
 41 |         and items must include a column named with the specified ctm_id_type or item_id_type.
 42 |     item_feat_path, user_feat_path, sport_feat_path:
 43 |         Paths of feature files, for item, user and sports. Item features include textual descriptions and junior, male,
 44 |         female and eco indicators. User features include male and female indicator. Sport features include only name of
 45 |         sport. All files with user and items must include a column named with the specified ctm_id_type or item_id_type.
 46 |     sport_onehot_path:
 47 |         Path for a csv matrix containing the sport_id and a one-hot vector, unique per sport.
 48 |     remove:
 49 |         Removes a proportion of users from the dataset randomly.
 50 |     ctm_id_type :
 51 |         Identifier for the customers.
 52 |     item_id_type :
 53 |         Identifier for the items. Can be SPECIFIC ITEM IDENTIFIER (e.g. item SKU)
 54 |         or GENERAL ITEM IDENTIFIER (e.g. item family identifier)
 55 |     days_of_purchases (Days_of_clicks) :
 56 |             Number of days of purchases (clicks) that should be kept in the dataset.
 57 |             Intuition is that interactions of 12+ months ago might not be relevant. Max is 710 days
 58 |             Those that do not have any remaining interactions will be fed recommendations from another
 59 |             model.
 60 |     lifespan_of_items :
 61 |         Number of days since most recent transactions for an item to be considered by the
 62 |         model. Max is 710 days. Won't make a difference is it is > Days_of_interaction.
 63 |     report_model_coverage : bool
 64 |         Computes how many users are included by these parameters (and would thus receive a recommendation by this GNN
 65 |         model).
 66 | 
 67 |     Returns
 68 |     -------
 69 |     user_item_train, user_item_test, user_sport_interaction, item_sport_interaction, sport_sportg_interaction:
 70 |         Dataframes of interactions.
 71 |     item_feat_df, user_feat_df, sport_feat_df, sport_onehot_df:
 72 |         Dataframes of features.
 73 |     """
 74 |     np.random.seed(11)
 75 | 
 76 |     # User, item and sport features
 77 |     item_feat_df = read_data(item_feat_path)
 78 |     user_feat_df = read_data(user_feat_path)
 79 |     sport_feat_df = read_data(sport_feat_path)
 80 |     sport_onehot_df = read_data(sport_onehot_path)
 81 | 
 82 |     # User-item interaction. We allow direct df instead of path: check which was passed.
 83 |     if isinstance(train_path, str):
 84 |         user_item_train = read_data(train_path)
 85 |     elif isinstance(train_path, pd.DataFrame):
 86 |         user_item_train = train_path
 87 |     else:
 88 |         raise TypeError(f'Type of {train_path} not recognized. Should be str or pd.DataFrame')
 89 |     if isinstance(test_path, str):
 90 |         user_item_test = read_data(test_path)
 91 |     elif isinstance(test_path, pd.DataFrame):
 92 |         user_item_test = test_path
 93 |     else:
 94 |         raise TypeError(f'Type of {test_path} not recognized. Should be str or pd.DataFrame')
 95 | 
 96 |     if days_of_purchases < 710:
 97 |         most_recent_date = datetime.strptime(max(user_item_train.hit_date), '%Y-%m-%d')
 98 |         limit_date = datetime.strftime(
 99 |             (most_recent_date - timedelta(days=int(days_of_purchases))),
100 |             format='%Y-%m-%d'
101 |         )
102 |         user_item_train = user_item_train[(user_item_train.hit_date >= limit_date) | (user_item_train.buy == 0)]
103 | 
104 |     if days_of_clicks < 710:
105 |         most_recent_date = datetime.strptime(max(user_item_train.hit_date), '%Y-%m-%d')
106 |         limit_date = datetime.strftime(
107 |             (most_recent_date - timedelta(days=int(days_of_clicks))),
108 |             format='%Y-%m-%d'
109 |         )
110 |         user_item_train = user_item_train[(user_item_train.hit_date >= limit_date) | (user_item_train.buy == 1)]
111 | 
112 |     if lifespan_of_items < days_of_purchases:
113 |         most_recent_date = datetime.strptime(max(user_item_train.hit_date), '%Y-%m-%d')
114 |         limit_date = datetime.strftime(
115 |             (most_recent_date - timedelta(days=int(lifespan_of_items))),
116 |             format='%Y-%m-%d'
117 |         )
118 |         item_list = user_item_train[user_item_train.hit_date >= limit_date]['SPECIFIC ITEM IDENTIFIER'].unique()
119 |         user_item_train = user_item_train[user_item_train['SPECIFIC ITEM IDENTIFIER'].isin(item_list)]
120 | 
121 |     if remove > 0:
122 |         ctm_list = user_item_train[ctm_id_type].unique()
123 |         np.random.shuffle(ctm_list)
124 |         ctm_list = ctm_list[:int(len(ctm_list) * (1 - remove))]
125 |         user_item_train = user_item_train[user_item_train[ctm_id_type].isin(ctm_list)]
126 |         user_item_test = user_item_test[user_item_test[ctm_id_type].isin(ctm_list)]
127 | 
128 |     if remove == 0:
129 |         # Make sure that if no observations were removed by days of clicks / purchases, no user is only in test set
130 |         user_item_test = user_item_test[user_item_test[ctm_id_type].isin(user_item_train[ctm_id_type].unique())]
131 | 
132 |     if item_id_type == 'GENERAL ITEM IDENTIFIER':
133 |         user_item_train = user_item_train.merge(
134 |             item_feat_df[['SPECIFIC ITEM IDENTIFIER', 'GENERAL ITEM IDENTIFIER']].drop_duplicates(),
135 |             how='left',
136 |             on='SPECIFIC ITEM IDENTIFIER')
137 |         user_item_test = user_item_test.merge(
138 |             item_feat_df[['SPECIFIC ITEM IDENTIFIER', 'GENERAL ITEM IDENTIFIER']].drop_duplicates(),
139 |             how='left',
140 |             on='SPECIFIC ITEM IDENTIFIER')
141 |         assert user_item_train.general_item_identifier.isna().sum() == 0
142 |         assert user_item_test.general_item_identifier.isna().sum() == 0
143 | 
144 | 
145 |     # Item-sport interaction
146 |     item_sport_interaction = read_data(item_sport_path)
147 |     if lifespan_of_items < days_of_purchases:
148 |         item_sport_interaction = item_sport_interaction[item_sport_interaction['SPECIFIC ITEM IDENTIFIER'].isin(
149 |             item_list)]
150 |     if item_id_type == 'GENERAL ITEM IDENTIFIER':
151 |         item_sport_interaction = item_sport_interaction.merge(
152 |             item_feat_df[['SPECIFIC ITEM IDENTIFIER', 'GENERAL ITEM IDENTIFIER']],
153 |                                                               how='left',
154 |                                                               on='SPECIFIC ITEM IDENTIFIER')
155 |     # Drop duplicates if not item_id_type not model number
156 |     item_sport_interaction.drop_duplicates(inplace=True)
157 | 
158 | 
159 |     # User-sport interaction
160 |     user_sport_interaction = read_data(user_sport_path)
161 |     if remove > 0:
162 |         user_sport_interaction = user_sport_interaction[user_sport_interaction[ctm_id_type].isin(ctm_list)]
163 | 
164 |     # Sport-sportgroups interaction
165 |     sport_sportg_interaction = read_data(sport_sportg_path)
166 | 
167 |     if report_model_coverage:
168 |         train_users = user_item_train[ctm_id_type].unique().tolist()
169 |         test_users = user_item_test[ctm_id_type].unique().tolist()
170 |         sport_users = user_sport_interaction[ctm_id_type].unique().tolist()
171 |         unseen_users = [uid for uid in test_users if uid not in train_users]
172 |         print(f'There are {len(unseen_users)} users with no interactions')
173 |         train_users.extend(sport_users)
174 |         unseen_users = [uid for uid in test_users if uid not in train_users]
175 |         print(f'and {len(unseen_users)} with also no sports associated')
176 |         print(f'out of {len(test_users)}')
177 | 
178 |     return user_item_train, user_item_test, item_sport_interaction, user_sport_interaction, \
179 |            sport_sportg_interaction, item_feat_df, user_feat_df, sport_feat_df, sport_onehot_df
180 | 
181 | 
182 | def create_ids(user_item_train: pd.DataFrame,
183 |                user_sport_interaction: pd.DataFrame,
184 |                sport_sportg_interaction: pd.DataFrame,
185 |                item_feat_df,
186 |                item_id_type: str = 'SPECIFIC ITEM IDENTIFIER',
187 |                ctm_id_type: str = 'CUSTOMER IDENTIFIER',
188 |                spt_id_type: str = 'sport_id',
189 |                ) -> Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
190 |     """
191 |     Create ids needed for creating the graph (nodes cannot have arbitrary ids, i.e. it couldn't be directly
192 |     the item identifier).
193 | 
194 |     Parameters
195 |     ----------
196 |     See parameters and outputs of format_dfs for details.
197 | 
198 |     Returns
199 |     -------
200 |     ctm_id, pdt_id, spt_id:
201 |         Mapping between Organisation info (e.g. customer, item and sport ID) and new node ID.
202 | 
203 |     """
204 | 
205 |     # Create user ids
206 |     ctm_id = pd.DataFrame(user_item_train[ctm_id_type].unique(),
207 |                           columns=[ctm_id_type])
208 |     ctm_id['ctm_new_id'] = ctm_id.index
209 | 
210 |     # Create item ids
211 |     train_pdt = user_item_train[item_id_type].unique().tolist()
212 |     all_pdt = item_feat_df[item_id_type].unique().tolist()
213 |     unseen_pdt = [pdt for pdt in all_pdt if pdt not in train_pdt]
214 |     train_pdt.extend(unseen_pdt)  # DGL requires that node IDs are continuous; unseen are at the end
215 |     pdt_id = pd.DataFrame(train_pdt,
216 |                           columns=[item_id_type])
217 |     pdt_id['pdt_new_id'] = pdt_id.index
218 | 
219 |     # Create sport ids
220 |     unique_sports = np.append(sport_sportg_interaction.sports_id.unique(),
221 |                               sport_sportg_interaction.sportsgroup_id.unique())
222 |     unique_sports = np.unique(np.append(unique_sports,
223 |                                         user_sport_interaction[spt_id_type].unique()))
224 |     spt_id = pd.DataFrame(unique_sports, columns=[spt_id_type])
225 |     spt_id['spt_new_id'] = spt_id.index
226 | 
227 |     return ctm_id, pdt_id, spt_id
228 | 
229 | 
230 | def df_to_adjacency_list(user_item_train: pd.DataFrame,
231 |                          user_item_test: pd.DataFrame,
232 |                          item_sport_interaction: pd.DataFrame,
233 |                          user_sport_interaction: pd.DataFrame,
234 |                          sport_sportg_interaction: pd.DataFrame,
235 |                          ctm_id: pd.DataFrame,
236 |                          pdt_id: pd.DataFrame,
237 |                          spt_id: pd.DataFrame,
238 |                          item_id_type: str,
239 |                          ctm_id_type: str,
240 |                          spt_id_type: str,
241 |                          discern_clicks: bool = False,
242 |                          duplicates: str = 'keep_all'
243 |                          ):
244 |     """
245 |     Takes dataframes & ids for the nodes, and return adjacency lists (in the form of src nodes and dst nodes.)
246 | 
247 |     Parameters
248 |     ----------
249 |     discern_clicks, duplicates:
250 |         See utils_data for details.
251 |     all other parameters:
252 |         See parameters & outputs of other functions in this file for details.
253 | 
254 |     Returns
255 |     -------
256 |     adjacency_dict:
257 |         This will be used to build the graph. It contains id of source and destination nodes for all edge types.
258 |     ground_truth_test, ground_truth_purchase_test:
259 |         This will be used to compute metrics (i.e. check if recommended items can be found in the ground_truth). It
260 |         contains user and item ids for all interactions in the test set.
261 |     user_item_train:
262 |         In this function, if duplicates == 'count_occurrence' or 'keep_last', some grouping manipulations are done on
263 |         the user_item_train dataframe. Returning it will allow to attribute features to "grouped" edges.
264 | 
265 |     """
266 |     adjacency_dict = {}
267 |     # User item : join new ids with old ids
268 |     user_item_train = user_item_train.merge(ctm_id,
269 |                                             how='left',
270 |                                             on=ctm_id_type)
271 |     user_item_train = user_item_train.merge(pdt_id,
272 |                                             how='left',
273 |                                             on=item_id_type)
274 | 
275 |     if duplicates in ['keep_last', 'count_occurrence']:
276 |         grouped_df = user_item_train.groupby(['buy', 'ctm_new_id', 'pdt_new_id']).specific_item_identifier.count()
277 |         grouped_df = pd.DataFrame(grouped_df).reset_index()
278 |         grouped_df.columns = ['buy', 'ctm_new_id', 'pdt_new_id', 'num_interaction']
279 | 
280 |         user_item_train.drop_duplicates(subset=['buy', 'ctm_new_id', 'pdt_new_id'],
281 |                                         keep='last',
282 |                                         inplace=True)  # Keep last interaction
283 |         user_item_train.sort_values(by=['buy', 'ctm_new_id', 'pdt_new_id'],
284 |                                     ignore_index=True,
285 |                                     inplace=True)  # Have same order as grouped_df
286 |         assert len(user_item_train) == len(grouped_df)
287 |         user_item_train['num_interaction'] = grouped_df.num_interaction.values
288 |         user_item_train.sort_values(by='hit_timestamp',
289 |                                     ignore_index=True,
290 |                                     inplace=True)  # Reorder by date to keep sequential order
291 |         if discern_clicks:
292 |             adjacency_dict.update(
293 |                 {
294 |                     'clicks_num': user_item_train[user_item_train.buy == 0].num_interaction.values,
295 |                     'purchases_num': user_item_train[user_item_train.buy == 1].num_interaction.values
296 |                 }
297 |             )
298 |         else:
299 |             adjacency_dict.update(
300 |                 {
301 |                     'user_item_num': user_item_train.num_interaction.values
302 |                 }
303 |             )
304 | 
305 |     if discern_clicks:
306 |         adjacency_dict.update(
307 |             {
308 |                 'clicks_src': user_item_train[user_item_train.buy == 0].ctm_new_id.values,
309 |                 'clicks_dst': user_item_train[user_item_train.buy == 0].pdt_new_id.values,
310 |                 'purchases_src': user_item_train[user_item_train.buy == 1].ctm_new_id.values,
311 |                 'purchases_dst': user_item_train[user_item_train.buy == 1].pdt_new_id.values,
312 |             }
313 |         )
314 | 
315 |     else:
316 |         adjacency_dict.update(
317 |             {
318 |                 'user_item_src': user_item_train.ctm_new_id.values,
319 |                 'user_item_dst': user_item_train.pdt_new_id.values,
320 |             }
321 |         )
322 | 
323 |     user_item_test = user_item_test.merge(ctm_id,
324 |                                           how='left',
325 |                                           on=ctm_id_type)
326 |     user_item_test = user_item_test.merge(pdt_id,
327 |                                           how='left',
328 |                                           on=item_id_type)
329 |     test_purchase_src = user_item_test[user_item_test.buy == 1].ctm_new_id.values
330 |     test_purchase_dst = user_item_test[user_item_test.buy == 1].pdt_new_id.values
331 |     ground_truth_purchase_test = (test_purchase_src, test_purchase_dst)
332 | 
333 |     test_src = user_item_test.ctm_new_id.values
334 |     test_dst = user_item_test.pdt_new_id.values
335 |     ground_truth_test = (test_src, test_dst)
336 | 
337 |     # Item sport : merge new ids with old ids
338 |     item_sport_interaction = item_sport_interaction.merge(spt_id,
339 |                                                           how='left',
340 |                                                           on=spt_id_type)
341 |     item_sport_interaction = item_sport_interaction.merge(pdt_id,
342 |                                                           how='left',
343 |                                                           on=item_id_type)
344 |     item_sport_interaction.dropna(inplace=True)  # drop items with no sports associated
345 | 
346 |     adjacency_dict['item_sport_src'] = item_sport_interaction.pdt_new_id.values
347 |     adjacency_dict['item_sport_dst'] = item_sport_interaction.spt_new_id.values
348 | 
349 |     # User sport : merge new ids with old ids
350 |     user_sport_interaction = user_sport_interaction.merge(spt_id,
351 |                                                           how='left',
352 |                                                           on=spt_id_type)
353 |     user_sport_interaction = user_sport_interaction.merge(ctm_id,
354 |                                                           how='left',
355 |                                                           on=ctm_id_type)
356 |     user_sport_interaction.dropna(inplace=True)
357 | 
358 |     adjacency_dict['user_sport_src'] = user_sport_interaction.ctm_new_id.values
359 |     adjacency_dict['user_sport_dst'] = user_sport_interaction.spt_new_id.values
360 | 
361 |     # Sport sportgroups
362 |     sport_sportg_interaction = sport_sportg_interaction.merge(spt_id,
363 |                                                               how='left',
364 |                                                               left_on='sports_id',
365 |                                                               right_on=spt_id_type)
366 |     sport_sportg_interaction = sport_sportg_interaction.merge(spt_id,
367 |                                                               how='left',
368 |                                                               left_on='sportsgroup_id',
369 |                                                               right_on=spt_id_type)
370 | 
371 |     adjacency_dict['sport_sportg_src'] = sport_sportg_interaction.spt_new_id_x.values
372 |     adjacency_dict['sport_sportg_dst'] = sport_sportg_interaction.spt_new_id_y.values
373 | 
374 |     return adjacency_dict, ground_truth_test, ground_truth_purchase_test, user_item_train
375 | 
376 | 
377 | def create_graph(graph_schema,
378 |                  ) -> dgl.DGLHeteroGraph:
379 |     """
380 |     Create graph based on adjacency list.
381 |     """
382 |     g = dgl.heterograph(graph_schema)
383 |     return g
384 | 
385 | 
386 | def import_features(g: dgl.DGLHeteroGraph,
387 |                     user_feat_df,
388 |                     item_feat_df,
389 |                     sport_onehot_df,
390 |                     ctm_id: pd.DataFrame,
391 |                     pdt_id: pd.DataFrame,
392 |                     spt_id: pd.DataFrame,
393 |                     user_item_train,
394 |                     get_popularity: bool,
395 |                     num_days_pop: int,
396 |                     item_id_type: str,
397 |                     ctm_id_type: str,
398 |                     spt_id_type: str,
399 |                     ):
400 |     """
401 |     Import features to a dict for all node types.
402 | 
403 |     For user and item, initializes feature arrays with only 0, then fills the values if they are available.
404 | 
405 |     Parameters
406 |     ----------
407 |     get_popularity, num_days_pop:
408 |         The recommender system can be enhanced by giving score boost for items that were popular. If get_popularity,
409 |         popularity of the items will be computed. Num_days_pop defines the number of days to include in the
410 |         computation.
411 |     item_id_type, ctm_id_type, spt_id_type:
412 |         See utils_data for details.
413 |     all other parameters:
414 |         See other functions in this file for details.
415 | 
416 |     Returns
417 |     -------
418 |     features_dict:
419 |         Dictionary with all the features imported here.
420 |     """
421 |     features_dict = {}
422 |     # User
423 |     user_feat_df = user_feat_df.merge(ctm_id, how='inner', on=ctm_id_type)
424 | 
425 |     ids = user_feat_df.ctm_new_id.values.astype(int)
426 |     feats = np.stack((user_feat_df.is_male.values,
427 |                       user_feat_df.is_female.values),
428 |                      axis=1)
429 | 
430 |     user_feat = np.zeros((g.number_of_nodes('user'), 2))
431 |     user_feat[ids] = feats
432 | 
433 |     user_feat = torch.tensor(user_feat).float()
434 |     features_dict['user_feat'] = user_feat
435 | 
436 |     # Item
437 |     if item_id_type in ['SPECIFIC ITEM IDENTIFIER']:
438 |         item_feat_df = item_feat_df.merge(pdt_id,
439 |                                           how='left',
440 |                                           on=item_id_type)
441 |         item_feat_df = item_feat_df[item_feat_df.pdt_new_id < g.number_of_nodes('item')]  # Only IDs that are in graph
442 | 
443 |         ids = item_feat_df.pdt_new_id.values.astype(int)
444 |         feats = np.stack((item_feat_df.is_junior.values,
445 |                           item_feat_df.is_male.values,
446 |                           item_feat_df.is_female.values,
447 |                           item_feat_df.eco_design.values,
448 |                           ),
449 |                          axis=1)
450 | 
451 |         item_feat = np.zeros((g.number_of_nodes('item'), feats.shape[1]))
452 |         item_feat[ids] = feats
453 |         item_feat = torch.tensor(item_feat).float()
454 |     elif item_id_type in ['GENERAL ITEM IDENTIFIER']:
455 |         item_feat = torch.zeros((g.number_of_nodes('item'), 4))
456 |     else:
457 |         raise KeyError(f'Item ID {item_id_type} not recognized.')
458 | 
459 |     features_dict['item_feat'] = item_feat
460 | 
461 |     # Sport one-hot
462 |     if 'sport' in g.ntypes:
463 |         sport_onehot_df = sport_onehot_df.merge(spt_id, how='inner', on=spt_id_type)
464 |         sport_onehot_df.sort_values(by='spt_new_id',
465 |                                     inplace=True)  # Values need to be sorted by node id to align with g.nodes['sport']
466 |         feats = sport_onehot_df.drop(labels=[spt_id_type, 'spt_new_id'], axis=1).values
467 |         assert feats.shape[0] == g.num_nodes('sport')
468 |         sport_feat = torch.tensor(feats).float()
469 |         features_dict['sport_feat'] = sport_feat
470 | 
471 |     # Popularity
472 |     if get_popularity:
473 |         item_popularity = np.zeros((g.number_of_nodes('item'), 1))
474 |         pop_df = user_item_train.merge(pdt_id,
475 |                                        how='left',
476 |                                        on=item_id_type)
477 |         most_recent_date = datetime.strptime(max(pop_df.hit_date), '%Y-%m-%d')
478 |         limit_date = datetime.strftime(
479 |             (most_recent_date - timedelta(days=num_days_pop)),
480 |             format='%Y-%m-%d'
481 |         )
482 |         pop_df = pop_df[pop_df.hit_date >= limit_date]
483 |         pop_df = pd.DataFrame(pop_df.pdt_new_id.value_counts())
484 |         pop_df.columns = ['purchases']
485 |         pop_df['score'] = pop_df.purchases / pop_df.purchases.sum()
486 |         pop_df.sort_index(inplace=True)
487 |         ids = pop_df.index.values.astype(int)
488 |         scores = pop_df.score.values
489 |         item_popularity[ids] = np.expand_dims(scores, axis=1)
490 |         item_popularity = torch.tensor(item_popularity).float()
491 |         features_dict['item_pop'] = item_popularity
492 | 
493 |     return features_dict
494 | 


--------------------------------------------------------------------------------
/src/evaluation.py:
--------------------------------------------------------------------------------
  1 | import random
  2 | 
  3 | import numpy as np
  4 | import pandas as pd
  5 | from sklearn.metrics.pairwise import cosine_similarity
  6 | 
  7 | from src.utils import save_txt
  8 | 
  9 | 
 10 | def get_item_by_id(iid: int,
 11 |                    pdt_id: pd.DataFrame,
 12 |                    item_feat: pd.DataFrame,
 13 |                    item_id_type: str):
 14 |     """
 15 |     Fetch information about the item, given its node_id.
 16 | 
 17 |     The info need to be available in the item features dataset.
 18 |     """
 19 |     # fetch old iid
 20 |     old_iid = pdt_id[item_id_type][pdt_id.pdt_new_id == iid].item()
 21 |     # fetch info
 22 |     info1 = item_feat.info1[item_feat[item_id_type] == old_iid].tolist()[0]
 23 |     info2 = item_feat.info2[item_feat[item_id_type] == old_iid].tolist()[0]
 24 |     info3 = item_feat.info3[item_feat[item_id_type] == old_iid].tolist()[0]
 25 |     return info1, info2, info3
 26 | 
 27 | 
 28 | def fetch_recs_for_users(user,
 29 |                          user_dict,
 30 |                          pdt_id,
 31 |                          item_feat_df,
 32 |                          item_id_type,
 33 |                          result_filepath,
 34 |                          ground_truth_purchase_dict=None):
 35 |     """
 36 |     For all items in a dict (of recs, or already_bought, or ground_truth), fetch information.
 37 | 
 38 |     """
 39 |     for iid in user_dict[user]:
 40 |         try:
 41 |             info1, info2, info3 = get_item_by_id(iid, pdt_id, item_feat_df, item_id_type)
 42 |             sentence = info1 + ', ' + info2 + info3
 43 |             if ground_truth_purchase_dict is not None:
 44 |                 if iid in ground_truth_purchase_dict[user]:
 45 |                     count_purchases = len([item for item in ground_truth_purchase_dict[user] if item == iid])
 46 |                     sentence += f' ----- BOUGHT {count_purchases} TIME(S)'
 47 |         except:
 48 |             sentence = 'No name'
 49 |         save_txt(sentence, result_filepath, mode='a')
 50 | 
 51 | 
 52 | def explore_recs(recs: dict,
 53 |                  already_bought_dict: dict,
 54 |                  already_clicked_dict,
 55 |                  ground_truth_dict: dict,
 56 |                  ground_truth_purchase_dict: dict,
 57 |                  item_feat_df: pd.DataFrame,
 58 |                  num_choices: int,
 59 |                  pdt_id: pd.DataFrame,
 60 |                  item_id_type: str,
 61 |                  result_filepath: str):
 62 |     """
 63 |     For a random sample of users, fetch information about what items were clicked/bought, recommended and ground truth.
 64 | 
 65 |     Users with only 1 previous click or purchase are explored at the end.
 66 |     """
 67 |     choices = random.sample(recs.keys(), num_choices)
 68 | 
 69 |     for user in choices:
 70 |         save_txt('\nCustomer bought', result_filepath, mode='a')
 71 |         try:
 72 |             fetch_recs_for_users(user,
 73 |                                  already_bought_dict,
 74 |                                  pdt_id,
 75 |                                  item_feat_df,
 76 |                                  item_id_type,
 77 |                                  result_filepath)
 78 |         except:
 79 |             save_txt('Nothing', result_filepath, mode='a')
 80 | 
 81 |         save_txt('\nCustomer clicked on', result_filepath, mode='a')
 82 |         try:
 83 |             fetch_recs_for_users(user,
 84 |                                  already_clicked_dict,
 85 |                                  pdt_id,
 86 |                                  item_feat_df,
 87 |                                  item_id_type,
 88 |                                  result_filepath)
 89 |         except:
 90 |             save_txt('No click data', result_filepath, mode='a')
 91 | 
 92 |         save_txt('\nGot recommended', result_filepath, mode='a')
 93 |         fetch_recs_for_users(user,
 94 |                              recs,
 95 |                              pdt_id,
 96 |                              item_feat_df,
 97 |                              item_id_type,
 98 |                              result_filepath)
 99 | 
100 |         save_txt('\nGround truth', result_filepath, mode='a')
101 |         fetch_recs_for_users(user,
102 |                              ground_truth_dict,
103 |                              pdt_id,
104 |                              item_feat_df,
105 |                              item_id_type,
106 |                              result_filepath,
107 |                              ground_truth_purchase_dict)
108 | 
109 |     # user with 1 item
110 |     choices = random.sample([uid for uid, v in already_bought_dict.items() if len(v) == 1 and uid in recs.keys()], 2)
111 |     for user in choices:
112 |         save_txt('\nCustomer bought', result_filepath, mode='a')
113 |         try:
114 |             fetch_recs_for_users(user,
115 |                                  already_bought_dict,
116 |                                  pdt_id,
117 |                                  item_feat_df,
118 |                                  item_id_type,
119 |                                  result_filepath)
120 |         except:
121 |             save_txt('Nothing', result_filepath, mode='a')
122 | 
123 |         save_txt('\nCustomer clicked on', result_filepath, mode='a')
124 |         try:
125 |             fetch_recs_for_users(user,
126 |                                  already_clicked_dict,
127 |                                  pdt_id,
128 |                                  item_feat_df,
129 |                                  item_id_type,
130 |                                  result_filepath)
131 |         except:
132 |             save_txt('No click data', result_filepath, mode='a')
133 | 
134 |         save_txt('\nGot recommended', result_filepath, mode='a')
135 |         fetch_recs_for_users(user,
136 |                              recs,
137 |                              pdt_id,
138 |                              item_feat_df,
139 |                              item_id_type,
140 |                              result_filepath)
141 | 
142 |         save_txt('\nGround truth', result_filepath, mode='a')
143 |         fetch_recs_for_users(user,
144 |                              ground_truth_dict,
145 |                              pdt_id,
146 |                              item_feat_df,
147 |                              item_id_type,
148 |                              result_filepath,
149 |                              ground_truth_purchase_dict)
150 | 
151 | 
152 | def explore_sports(h,
153 |                    sport_feat_df: pd.DataFrame,
154 |                    spt_id: pd.DataFrame,
155 |                    num_choices: int,
156 |                    ):
157 |     """
158 |     For a random sample of sport, fetch name of 5 most similar sports.
159 |     """
160 |     sport_h = h['sport']
161 |     sim_matrix = cosine_similarity(sport_h.detach().cpu())
162 |     choices = random.sample(range(sport_h.shape[0]), num_choices)
163 |     sentence = ''
164 |     for sid in choices:
165 |         # fetch name of sport id
166 |         try:
167 |             old_sid = spt_id.sport_id[spt_id.spt_new_id == sid].item()
168 |             chosen_name = sport_feat_df.sport_label[sport_feat_df.sport_id == old_sid].item()
169 |         except:
170 |             chosen_name = 'N/A'
171 |         # fetch most similar sports
172 |         top = np.argpartition(sim_matrix[sid], -5)[-5:]
173 |         top_list = spt_id.sport_id[spt_id.spt_new_id.isin(top.tolist())].tolist()
174 |         top_names = sport_feat_df.sport_label[sport_feat_df.sport_id.isin(top_list)].unique()
175 |         sentence += 'For sport {}, top similar sports are {} \n'.format(chosen_name, top_names)
176 |     return sentence
177 | 
178 | 
179 | def check_coverage(user_item_interaction,
180 |                    item_feat_df,
181 |                    pdt_id,
182 |                    recs):
183 |     """
184 |     Check the repartition of types of items in the purchases vs recommendations (generic vs female vs male vs junior).
185 | 
186 |     Also checks repartition of eco-design products in purchases vs recommendations.
187 |     """
188 |     coverage_metrics = {}
189 | 
190 |     # remove all 'unknown' items
191 |     known_items = item_feat_df.item_identifier.unique().tolist()
192 |     user_item_interaction = user_item_interaction[user_item_interaction.item_identifier.isin(known_items)]
193 | 
194 |     # count number of types in original dataset
195 |     df = user_item_interaction.merge(item_feat_df,
196 |                                      how='left',
197 |                                      on='ITEM IDENTIFIER')
198 |     df['is_generic'] = (df.is_junior + df.is_male + df.is_female).astype(bool) * -1 + 1
199 | 
200 |     coverage_metrics['generic_mean_whole'] = df.is_generic.mean()
201 |     coverage_metrics['junior_mean_whole'] = df.is_junior.mean()
202 |     coverage_metrics['male_mean_whole'] = df.is_male.mean()
203 |     coverage_metrics['female_mean_whole'] = df.is_female.mean()
204 |     coverage_metrics['eco_mean_whole'] = df.eco_design.mean()
205 | 
206 |     # count in 'recs'
207 |     recs_df = pd.DataFrame(recs.items())
208 |     recs_df.columns = ['uid', 'iid']
209 |     recs_df = recs_df.explode('iid')
210 |     recs_df = recs_df.merge(pdt_id,
211 |                             how='left',
212 |                             left_on='iid',
213 |                             right_on='pdt_new_id')
214 |     recs_df = recs_df.merge(item_feat_df,
215 |                             how='left',
216 |                             on='ITEM IDENTIFIER')
217 | 
218 |     recs_df['is_generic'] = (recs_df.is_junior + recs_df.is_male + recs_df.is_female).astype(bool) * -1 + 1
219 | 
220 |     coverage_metrics['generic_mean_recs'] = recs_df.is_generic.mean()
221 |     coverage_metrics['junior_mean_recs'] = recs_df.is_junior.mean()
222 |     coverage_metrics['male_mean_recs'] = recs_df.is_male.mean()
223 |     coverage_metrics['female_mean_recs'] = recs_df.is_female.mean()
224 |     coverage_metrics['eco_mean_recs'] = recs_df.eco_design.mean()
225 | 
226 |     return coverage_metrics
227 | 


--------------------------------------------------------------------------------
/src/metrics.py:
--------------------------------------------------------------------------------
  1 | from src.utils import softmax
  2 | from collections import defaultdict
  3 | 
  4 | import numpy as np
  5 | import torch
  6 | import torch.nn as nn
  7 | 
  8 | def create_ground_truth(users, items):
  9 |     """
 10 |     Creates a dictionary, where the keys are user ids and the values are item ids that the user actually bought.
 11 |     """
 12 |     ground_truth_arr = np.stack((np.asarray(users), np.asarray(items)), axis=1)
 13 |     ground_truth_dict = defaultdict(list)
 14 |     for key, val in ground_truth_arr:
 15 |         ground_truth_dict[key].append(val)
 16 |     return ground_truth_dict
 17 | 
 18 | 
 19 | def create_already_bought(g, bought_eids, etype='buys'):
 20 |     """
 21 |     Creates a dictionary, where the keys are user ids and the values are item ids that the user already bought.
 22 |     """
 23 |     users_train, items_train = g.find_edges(bought_eids, etype=etype)
 24 |     already_bought_arr = np.stack((np.asarray(users_train), np.asarray(items_train)), axis=1)
 25 |     already_bought_dict = defaultdict(list)
 26 |     for key, val in already_bought_arr:
 27 |         already_bought_dict[key].append(val)
 28 |     return already_bought_dict
 29 | 
 30 | 
 31 | def get_recs(g, 
 32 |              h, 
 33 |              model,
 34 |              embed_dim,
 35 |              k,
 36 |              user_ids,
 37 |              already_bought_dict,
 38 |              remove_already_bought=True,
 39 |              cuda=False,
 40 |              device=None,
 41 |              pred: str = 'cos',
 42 |              use_popularity: bool = False,
 43 |              weight_popularity=1
 44 |              ):
 45 |     """
 46 |     Computes K recommendation for all users, given hidden states, the model and what they already bought.
 47 |     """
 48 |     if cuda:  # model is already in cuda?
 49 |         model = model.to(device)
 50 |     print('Computing recommendations on {} users, for {} items'.format(len(user_ids), g.num_nodes('item')))
 51 |     recs = {}
 52 |     for user in user_ids:
 53 |         user_emb = h['user'][user]
 54 |         already_bought = already_bought_dict[user]
 55 |         user_emb_rpt = torch.cat(g.num_nodes('item')*[user_emb]).reshape(-1, embed_dim)
 56 |         
 57 |         if pred == 'cos':
 58 |             cos = nn.CosineSimilarity(dim=1, eps=1e-6)
 59 |             ratings = cos(user_emb_rpt, h['item'])
 60 | 
 61 |         elif pred == 'nn':
 62 |             cat_embed = torch.cat((user_emb_rpt, h['item']), 1)
 63 |             ratings = model.pred_fn.layer_nn(cat_embed)
 64 | 
 65 |         else:
 66 |             raise KeyError(f'Prediction function {pred} not recognized.')
 67 |             
 68 |         ratings_formatted = ratings.cpu().detach().numpy().reshape(g.num_nodes('item'),)
 69 |         if use_popularity:
 70 |             softmax_ratings = softmax(ratings_formatted)
 71 |             popularity_scores = g.ndata['popularity']['item'].numpy().reshape(g.num_nodes('item'),)
 72 |             ratings_formatted = np.add(softmax_ratings, popularity_scores * weight_popularity)
 73 |         order = np.argsort(-ratings_formatted)
 74 |         if remove_already_bought:
 75 |             order = [item for item in order if item not in already_bought]
 76 |         rec = order[:k]
 77 |         recs[user] = rec
 78 |     return recs
 79 | 
 80 | 
 81 | def recs_to_metrics(recs, ground_truth_dict, g):
 82 |     """
 83 |     Given the recommendations and the ground_truth, computes precision, recall & coverage.
 84 |     """
 85 |     # precision
 86 |     k_relevant = 0
 87 |     k_total = 0
 88 |     for uid, iids in recs.items():
 89 |         k_total += len(iids)
 90 |         k_relevant += len([id_ for id_ in iids if id_ in ground_truth_dict[uid]])
 91 |     precision = k_relevant/k_total
 92 | 
 93 |     # recall
 94 |     k_relevant = 0
 95 |     k_total = 0
 96 |     for uid, iids in recs.items():
 97 |         k_total += len(ground_truth_dict[uid])
 98 |         k_relevant += len([id_ for id_ in ground_truth_dict[uid] if id_ in iids])
 99 |     recall = k_relevant/k_total
100 |     
101 |     # coverage
102 |     nb_total = g.num_nodes('item')
103 |     recs_flatten = [item for sublist in list(recs.values()) for item in sublist]
104 |     nb_recommended = len(set(recs_flatten))
105 |     coverage = nb_recommended / nb_total
106 |     
107 |     return precision, recall, coverage
108 | 
109 | 
110 | def get_metrics_at_k(h, 
111 |                      g,
112 |                      model,
113 |                      embed_dim,
114 |                      ground_truth,
115 |                      bought_eids,
116 |                      k,
117 |                      remove_already_bought=True,
118 |                      cuda=False,
119 |                      device=None,
120 |                      pred='cos',
121 |                      use_popularity=False,
122 |                      weight_popularity=1):
123 |     """
124 |     Function combining all previous functions: create already_bought & ground_truth dict, get recs and compute metrics.
125 |     """
126 |     already_bought_dict = create_already_bought(g, bought_eids)
127 |     users, items = ground_truth
128 |     user_ids = np.unique(users).tolist()
129 |     ground_truth_dict = create_ground_truth(users, items)
130 |     recs = get_recs(g, h, model, embed_dim, k, user_ids, already_bought_dict,
131 |                     remove_already_bought, cuda, device, pred, use_popularity, weight_popularity)
132 |     precision, recall, coverage = recs_to_metrics(recs, ground_truth_dict, g)
133 |     
134 |     return precision, recall, coverage
135 | 
136 | 
137 | def MRR_neg_edges(model,
138 |                   blocks,
139 |                   pos_g,
140 |                   neg_g,
141 |                   etype,
142 |                   neg_sample_size):
143 |     """
144 |     (Currently not used.) Computes Mean Reciprocal Rank for the positive edge, out of all negative edges considered.
145 | 
146 |     Since it computes only on negative edges considered, it is an heuristic of the real MRR.
147 |     However, if you put neg_sample_size as the total number of items, can be considered as MRR
148 |     (because it will create as many edges as there are items).
149 |     """
150 |     input_features = blocks[0].srcdata['features']
151 |     _, pos_score, neg_score = model(blocks,
152 |                                     input_features,
153 |                                     pos_g, neg_g,
154 |                                     etype)
155 |     neg_score = neg_score.reshape(-1, neg_sample_size)
156 |     rankings = torch.sum(neg_score >= pos_score, dim=1) + 1
157 |     return np.mean(1.0 / rankings.cpu().numpy())
158 | 


--------------------------------------------------------------------------------
/src/model.py:
--------------------------------------------------------------------------------
  1 | from typing import Tuple
  2 | 
  3 | import torch
  4 | import torch.nn as nn
  5 | import torch.nn.functional as F
  6 | import dgl.nn.pytorch as dglnn
  7 | import dgl.function as fn
  8 | 
  9 | 
 10 | class NodeEmbedding(nn.Module):
 11 |     """
 12 |     Projects the node features into embedding space.
 13 |     """
 14 |     def __init__(self,
 15 |                  in_feats,
 16 |                  out_feats,
 17 |                  ):
 18 |         super().__init__()
 19 |         self.proj_feats = nn.Linear(in_feats, out_feats)
 20 | 
 21 |     def forward(self,
 22 |                 node_feats):
 23 |         x = self.proj_feats(node_feats)
 24 |         return x
 25 | 
 26 | 
 27 | class ConvLayer(nn.Module):
 28 |     """
 29 |     1 layer of message passing & aggregation, specific to an edge type.
 30 | 
 31 |     Similar to SAGEConv layer in DGL library.
 32 | 
 33 |     Methods
 34 |     -------
 35 |     reset_parameters:
 36 |         Intialize parameters for all neural networks in the layer.
 37 |     _lstm_reducer:
 38 |         Aggregate messages of neighborhood nodes using LSTM technique. (All other message aggregation are builtin
 39 |         functions of DGL).
 40 |     forward:
 41 |         Actual message passing & aggregation, & update of nodes messages.
 42 | 
 43 |     """
 44 | 
 45 |     def reset_parameters(self):
 46 |         gain = nn.init.calculate_gain('relu')
 47 |         nn.init.xavier_uniform_(self.fc_self.weight, gain=gain)
 48 |         nn.init.xavier_uniform_(self.fc_neigh.weight, gain=gain)
 49 |         # nn.init.xavier_uniform_(self.fc_edge.weight, gain=gain)
 50 |         if self._aggre_type in ['pool_nn', 'pool_nn_edge', 'mean_nn', 'mean_nn_edge']:
 51 |             nn.init.xavier_uniform_(self.fc_preagg.weight, gain=gain)
 52 |         if self._aggre_type == 'lstm':
 53 |             self.lstm.reset_parameters()
 54 | 
 55 |     def __init__(self,
 56 |                  in_feats: Tuple[int, int],
 57 |                  out_feats: int,
 58 |                  dropout: float,
 59 |                  aggregator_type: str,
 60 |                  norm,
 61 |                  ):
 62 |         """
 63 |         Initialize the layer with parameters.
 64 | 
 65 |         Parameters
 66 |         ----------
 67 |         in_feats:
 68 |             Dimension of the message (aka features) of the node type neighbor and of the node type. E.g. if the
 69 |             ConvLayer is initialized for the edge type (user, clicks, item), in_feats should be
 70 |             (dimension_of_item_features, dimension_of_user_features). Note that usually features will first go
 71 |             through embedding layer, thus dimension might be equal to the embedding dimension.
 72 |         out_feats:
 73 |             Dimension that the output (aka the updated node message) should take. E.g. if the layer is a hidden layer,
 74 |             out_feats should be hidden_dimension, whereas if the layer is the output layer, out_feats should be
 75 |             output_dimension.
 76 |         dropout:
 77 |             Traditional dropout applied to input features.
 78 |         aggregator_type:
 79 |             This is the main parameter of ConvLayer; it defines how messages are passed and aggregated. Multiple
 80 |             choices:
 81 |                 'mean' : messages are passed normally, and aggregated by doing the mean of all neighbor messages.
 82 |                 'mean_nn' : messages are passed through a neural network before being passed to neighbors, and
 83 |                             aggregated by doing the mean of all neighbor messages.
 84 |                 'pool_nn' : messages are passed through a neural network before being passed to neighbors, and
 85 |                             aggregated by doing the max of all neighbor messages.
 86 |                 'lstm' : messages are passed normally, and aggregared using _lstm_reducer.
 87 |             All choices have also their equivalent that ends with _edge (e.g. mean_nn_edge). Those version include
 88 |             the edge in the message passing, i.e. the message will be multiplied by the value of the edge.
 89 |         norm:
 90 |             Apply normalization
 91 |         """
 92 |         super().__init__()
 93 |         self._in_neigh_feats, self._in_self_feats = in_feats
 94 |         self._out_feats = out_feats
 95 |         self._aggre_type = aggregator_type
 96 |         self.dropout_fn = nn.Dropout(dropout)
 97 |         self.norm = norm
 98 |         self.fc_self = nn.Linear(self._in_self_feats, out_feats, bias=False)
 99 |         self.fc_neigh = nn.Linear(self._in_neigh_feats, out_feats, bias=False)
100 |         # self.fc_edge = nn.Linear(1, 1, bias=True)  # Projecting recency days
101 |         if aggregator_type in ['pool_nn', 'pool_nn_edge', 'mean_nn', 'mean_nn_edge']:
102 |             self.fc_preagg = nn.Linear(self._in_neigh_feats, self._in_neigh_feats, bias=False)
103 |         if aggregator_type == 'lstm':
104 |             self.lstm = nn.LSTM(self._in_neigh_feats, self._in_neigh_feats, batch_first=True)
105 |         self.reset_parameters()
106 | 
107 |     def _lstm_reducer(self, nodes):
108 |         """
109 |         Aggregates the neighborhood messages using LSTM technique.
110 | 
111 |         This was taken from DGL docs. For computation optimization, when 'batching' nodes, all nodes
112 |         with the same degrees are batched together, i.e. at first all nodes with 1 in-neighbor are computed
113 |         (thus m.shape = [number of nodes in the batchs, 1, dimensionality of h]), then all nodes with 2 in-
114 |         neighbors, etc.
115 |         """
116 |         m = nodes.mailbox['m']
117 |         batch_size = m.shape[0]
118 |         h = (m.new_zeros((1, batch_size, self._in_neigh_feats)),
119 |              m.new_zeros((1, batch_size, self._in_neigh_feats)))
120 |         _, (rst, _) = self.lstm(m, h)
121 |         return {'neigh': rst.squeeze(0)}
122 | 
123 |     def forward(self,
124 |                 graph,
125 |                 x):
126 |         """
127 |         Message passing & aggregation, & update of node messages.
128 | 
129 |         Process is the following:
130 |             - Messages (h_neigh and h_self) are extracted from x
131 |             - Dropout is applied
132 |             - Messages are passed and aggregated using the _aggre_type specified (see __init__ for details), which
133 |               return updated h_neigh
134 |             - h_self passes through neural network & updated h_neigh passes through neural network, and are then summed
135 |               up
136 |             - The sum (z) passes through Relu
137 |             - Normalization is applied
138 |         """
139 |         h_neigh, h_self = x
140 |         h_neigh = self.dropout_fn(h_neigh)
141 |         h_self = self.dropout_fn(h_self)
142 | 
143 |         if self._aggre_type == 'mean':
144 |             graph.srcdata['h'] = h_neigh
145 |             graph.update_all(
146 |                 fn.copy_src('h', 'm'),
147 |                 fn.mean('m', 'neigh'))
148 |             h_neigh = graph.dstdata['neigh']
149 | 
150 |         elif self._aggre_type == 'mean_nn':
151 |             graph.srcdata['h'] = F.relu(self.fc_preagg(h_neigh))
152 |             graph.update_all(
153 |                 fn.copy_src('h', 'm'),
154 |                 fn.mean('m', 'neigh'))
155 |             h_neigh = graph.dstdata['neigh']
156 | 
157 |         elif self._aggre_type == 'pool_nn':
158 |             graph.srcdata['h'] = F.relu(self.fc_preagg(h_neigh))
159 |             graph.update_all(
160 |                 fn.copy_src('h', 'm'),
161 |                 fn.max('m', 'neigh'))
162 |             h_neigh = graph.dstdata['neigh']
163 | 
164 |         elif self._aggre_type == 'lstm':
165 |             graph.srcdata['h'] = h_neigh
166 |             graph.update_all(
167 |                 fn.copy_src('h', 'm'),
168 |                 self._lstm_reducer)
169 |             h_neigh = graph.dstdata['neigh']
170 | 
171 |         elif self._aggre_type == 'mean_edge':
172 |             graph.srcdata['h'] = h_neigh
173 |             if graph.canonical_etypes[0][0] in ['user', 'item'] and graph.canonical_etypes[0][2] in ['user', 'item']:
174 |                 graph.edata['h'] = graph.edata['occurrence'].float().unsqueeze(1)
175 |                 graph.update_all(
176 |                     fn.u_mul_e('h', 'h', 'm'),
177 |                     fn.mean('m', 'neigh'))
178 |             else:
179 |                 graph.update_all(
180 |                     fn.copy_src('h', 'm'),
181 |                     fn.mean('m', 'neigh'))
182 |             h_neigh = graph.dstdata['neigh']
183 | 
184 |         elif self._aggre_type == 'mean_nn_edge':
185 |             graph.srcdata['h'] = F.relu(self.fc_preagg(h_neigh))
186 |             if graph.canonical_etypes[0][0] in ['user', 'item'] and graph.canonical_etypes[0][2] in ['user', 'item']:
187 |                 graph.edata['h'] = graph.edata['occurrence'].float().unsqueeze(1)
188 |                 graph.update_all(
189 |                     fn.u_mul_e('h', 'h', 'm'),
190 |                     fn.mean('m', 'neigh'))
191 |             else:
192 |                 graph.update_all(
193 |                     fn.copy_src('h', 'm'),
194 |                     fn.mean('m', 'neigh'))
195 |             h_neigh = graph.dstdata['neigh']
196 | 
197 |         elif self._aggre_type == 'pool_nn_edge':
198 |             graph.srcdata['h'] = F.relu(self.fc_preagg(h_neigh))
199 |             if graph.canonical_etypes[0][0] in ['user', 'item'] and graph.canonical_etypes[0][2] in ['user', 'item']:
200 |                 graph.edata['h'] = graph.edata['occurrence'].float().unsqueeze(1)
201 |                 graph.update_all(
202 |                     fn.u_mul_e('h', 'h', 'm'),
203 |                     fn.max('m', 'neigh'))
204 |             else:
205 |                 graph.update_all(
206 |                     fn.copy_src('h', 'm'),
207 |                     fn.max('m', 'neigh'))
208 |             h_neigh = graph.dstdata['neigh']
209 | 
210 |         elif self._aggre_type == 'lstm_edge':
211 |             graph.srcdata['h'] = h_neigh
212 |             if graph.canonical_etypes[0][0] in ['user', 'item'] and graph.canonical_etypes[0][2] in ['user', 'item']:
213 |                 graph.edata['h'] = graph.edata['occurrence'].float().unsqueeze(1)
214 |                 graph.update_all(
215 |                     fn.u_mul_e('h', 'h', 'm'),
216 |                     self._lstm_reducer)
217 |             else:
218 |                 graph.update_all(
219 |                     fn.copy_src('h', 'm'),
220 |                     self._lstm_reducer)
221 |             h_neigh = graph.dstdata['neigh']
222 | 
223 |         else:
224 |             raise KeyError('Aggregator type {} not recognized.'.format(self._aggre_type))
225 | 
226 |         z = self.fc_self(h_self) + self.fc_neigh(h_neigh)
227 |         z = F.relu(z)
228 | 
229 |         # normalization
230 |         if self.norm:
231 |             z_norm = z.norm(2, 1, keepdim=True)
232 |             z_norm = torch.where(z_norm == 0,
233 |                                  torch.tensor(1.).to(z_norm),
234 |                                  z_norm)
235 |             z = z / z_norm
236 | 
237 |         return z
238 | 
239 | 
240 | class PredictingLayer(nn.Module):
241 |     """
242 |     Scoring function that uses a neural network to compute similarity between user and item.
243 | 
244 |     Only used if fixed_params.pred == 'nn'.
245 |     Given the concatenated hidden states of heads and tails vectors, passes them through neural network and
246 |     returns sigmoid ratings.
247 |     """
248 | 
249 |     def reset_parameters(self):
250 |         gain_relu = nn.init.calculate_gain('relu')
251 |         gain_sigmoid = nn.init.calculate_gain('sigmoid')
252 |         nn.init.xavier_uniform_(self.hidden_1.weight, gain=gain_relu)
253 |         nn.init.xavier_uniform_(self.hidden_2.weight, gain=gain_relu)
254 |         nn.init.xavier_uniform_(self.output.weight, gain=gain_sigmoid)
255 | 
256 |     def __init__(self, embed_dim: int):
257 |         super(PredictingLayer, self).__init__()
258 |         self.hidden_1 = nn.Linear(embed_dim * 2, 128)
259 |         self.hidden_2 = nn.Linear(128, 32)
260 |         self.output = nn.Linear(32, 1)
261 |         self.relu = nn.ReLU()
262 |         self.sigmoid = nn.Sigmoid()
263 |         self.reset_parameters()
264 | 
265 |     def forward(self, x):
266 |         x = self.hidden_1(x)
267 |         x = self.relu(x)
268 |         x = self.hidden_2(x)
269 |         x = self.relu(x)
270 |         x = self.output(x)
271 |         x = self.sigmoid(x)
272 |         return x
273 | 
274 | 
275 | class PredictingModule(nn.Module):
276 |     """
277 |     Predicting module that incorporate the predicting layer defined earlier.
278 | 
279 |     Only used if fixed_params.pred == 'nn'.
280 |     Process:
281 |         - Fetches hidden states of 'heads' and 'tails' of the edges.
282 |         - Concatenates them, then passes them through the predicting layer.
283 |         - Returns ratings (sigmoid function).
284 |     """
285 | 
286 |     def __init__(self, predicting_layer, embed_dim: int):
287 |         super(PredictingModule, self).__init__()
288 |         self.layer_nn = predicting_layer(embed_dim)
289 | 
290 |     def forward(self,
291 |                 graph,
292 |                 h
293 |                 ):
294 |         ratings_dict = {}
295 |         for etype in graph.canonical_etypes:
296 |             if etype[0] in ['user', 'item'] and etype[2] in ['user', 'item']:
297 |                 utype, _, vtype = etype
298 |                 src_nid, dst_nid = graph.all_edges(etype=etype)
299 |                 emb_heads = h[utype][src_nid]
300 |                 emb_tails = h[vtype][dst_nid]
301 |                 cat_embed = torch.cat((emb_heads, emb_tails), 1)
302 |                 ratings = self.layer_nn(cat_embed)
303 |                 ratings_dict[etype] = torch.flatten(ratings)
304 |         ratings_dict = {key: torch.unsqueeze(ratings_dict[key], 1) for key in ratings_dict.keys()}
305 |         return ratings_dict
306 | 
307 | 
308 | class CosinePrediction(nn.Module):
309 |     """
310 |     Scoring function that uses cosine similarity to compute similarity between user and item.
311 | 
312 |     Only used if fixed_params.pred == 'cos'.
313 |     """
314 |     def __init__(self):
315 |         super().__init__()
316 | 
317 |     def forward(self, graph, h):
318 |         with graph.local_scope():
319 |             for etype in graph.canonical_etypes:
320 |                 try:
321 |                     graph.nodes[etype[0]].data['norm_h'] = F.normalize(h[etype[0]], p=2, dim=-1)
322 |                     graph.nodes[etype[2]].data['norm_h'] = F.normalize(h[etype[2]], p=2, dim=-1)
323 |                     graph.apply_edges(fn.u_dot_v('norm_h', 'norm_h', 'cos'), etype=etype)
324 |                 except KeyError:
325 |                     pass  # For etypes that are not in training eids, thus have no 'h'
326 |             ratings = graph.edata['cos']
327 |         return ratings
328 | 
329 | 
330 | class ConvModel(nn.Module):
331 |     """
332 |     Assembles embedding layers, multiple ConvLayers and chosen predicting function into a full model.
333 | 
334 |     """
335 | 
336 |     def __init__(self,
337 |                  g,
338 |                  n_layers: int,
339 |                  dim_dict,
340 |                  norm: bool = True,
341 |                  dropout: float = 0.0,
342 |                  aggregator_type: str = 'mean',
343 |                  pred: str = 'cos',
344 |                  aggregator_hetero: str = 'sum',
345 |                  embedding_layer: bool = True,
346 |                  ):
347 |         """
348 |         Initialize the ConvModel.
349 | 
350 |         Parameters
351 |         ----------
352 |         g:
353 |             Graph, only used to query graph metastructure (fetch node types and edge types).
354 |         n_layers:
355 |             Number of ConvLayer.
356 |         dim_dict:
357 |             Dictionary with dimension for all input nodes, hidden dimension (aka embedding dimension), output dimension.
358 |         norm, dropout, aggregator_type:
359 |             See ConvLayer for details.
360 |         aggregator_hetero:
361 |             Since we are working with heterogeneous graph, all nodes will have messages coming from different types of
362 |             nodes. However, the neighborhood messages are specific to a node type. Thus, we have to aggregate
363 |             neighborhood messages from different edge types.
364 |             Choices are 'mean', 'sum', 'max'.
365 |         embedding_layer:
366 |             Some GNN papers explicitly define an embedding layer, whereas other papers consider the first ConvLayer
367 |             as the "embedding" layer. If true, an explicit embedding layer will be defined (using NodeEmbedding). If
368 |             false, the first ConvLayer will have input dimensions equal to node features.
369 | 
370 |         """
371 |         super().__init__()
372 |         self.embedding_layer = embedding_layer
373 |         if embedding_layer:
374 |             self.user_embed = NodeEmbedding(dim_dict['user'], dim_dict['hidden'])
375 |             self.item_embed = NodeEmbedding(dim_dict['item'], dim_dict['hidden'])
376 |             if 'sport' in g.ntypes:
377 |                 self.sport_embed = NodeEmbedding(dim_dict['sport'], dim_dict['hidden'])
378 | 
379 |         self.layers = nn.ModuleList()
380 | 
381 |         # input layer
382 |         if not embedding_layer:
383 |             self.layers.append(
384 |                 dglnn.HeteroGraphConv(
385 |                     {etype[1]: ConvLayer((dim_dict[etype[0]], dim_dict[etype[2]]), dim_dict['hidden'], dropout,
386 |                                          aggregator_type, norm)
387 |                      for etype in g.canonical_etypes},
388 |                     aggregate=aggregator_hetero)
389 |             )
390 | 
391 |         # hidden layers
392 |         for i in range(n_layers - 2):
393 |             self.layers.append(
394 |                 dglnn.HeteroGraphConv(
395 |                     {etype[1]: ConvLayer((dim_dict['hidden'], dim_dict['hidden']), dim_dict['hidden'], dropout,
396 |                                          aggregator_type, norm)
397 |                      for etype in g.canonical_etypes},
398 |                     aggregate=aggregator_hetero))
399 | 
400 |         # output layer
401 |         self.layers.append(
402 |             dglnn.HeteroGraphConv(
403 |                 {etype[1]: ConvLayer((dim_dict['hidden'], dim_dict['hidden']), dim_dict['out'], dropout,
404 |                                      aggregator_type, norm)
405 |                  for etype in g.canonical_etypes},
406 |                 aggregate=aggregator_hetero))
407 | 
408 |         if pred == 'cos':
409 |             self.pred_fn = CosinePrediction()
410 |         elif pred == 'nn':
411 |             self.pred_fn = PredictingModule(PredictingLayer, dim_dict['out'])
412 |         else:
413 |             raise KeyError('Prediction function {} not recognized.'.format(pred))
414 | 
415 |     def get_repr(self,
416 |                  blocks,
417 |                  h):
418 |         for i in range(len(blocks)):
419 |             layer = self.layers[i]
420 |             h = layer(blocks[i], h)
421 |         return h
422 | 
423 |     def forward(self,
424 |                 blocks,
425 |                 h,
426 |                 pos_g,
427 |                 neg_g,
428 |                 embedding_layer: bool = True,
429 |                 ):
430 |         """
431 |         Full pass through the ConvModel.
432 | 
433 |         Process:
434 |             - Embedding layer
435 |             - get_repr: As many ConvLayer as wished. All "Layers" are composed of as many ConvLayer as there
436 |                         are edge types.
437 |             - Predicting layer predicts score for all positive examples and all negative examples.
438 | 
439 |         Parameters
440 |         ----------
441 |         blocks:
442 |             Computational blocks. Can be thought of as subgraphs. There are as many blocks as there are layers.
443 |         h:
444 |             Initial state of all nodes.
445 |         pos_g:
446 |             Positive graph, generated by the EdgeDataLoader. Contains all positive examples of the batch that need to
447 |             be scored.
448 |         neg_g:
449 |             Negative graph, generated by the EdgeDataLoader. For all positive pairs in the pos_g, multiple negative
450 |             pairs were generated. They are all scored.
451 | 
452 |         Returns
453 |         -------
454 |         h:
455 |             Updated state of all nodes
456 |         pos_score:
457 |             All scores between positive examples (aka positive pairs).
458 |         neg_score:
459 |             All score between negative examples.
460 | 
461 |         """
462 |         if embedding_layer:
463 |             h['user'] = self.user_embed(h['user'])
464 |             h['item'] = self.item_embed(h['item'])
465 |             if 'sport' in h.keys():
466 |                 h['sport'] = self.sport_embed(h['sport'])
467 |         h = self.get_repr(blocks, h)
468 |         pos_score = self.pred_fn(pos_g, h)
469 |         neg_score = self.pred_fn(neg_g, h)
470 |         return h, pos_score, neg_score
471 | 
472 | 
473 | def max_margin_loss(pos_score,
474 |                     neg_score,
475 |                     delta: float,
476 |                     neg_sample_size: int,
477 |                     use_recency: bool = False,
478 |                     recency_scores=None,
479 |                     remove_false_negative: bool = False,
480 |                     negative_mask=None,
481 |                     cuda=False,
482 |                     device=None
483 |                     ):
484 |     """
485 |     Simple max margin loss.
486 | 
487 |     Parameters
488 |     ----------
489 |     pos_score:
490 |         All similarity scores for positive examples.
491 |     neg_score:
492 |         All similarity scores for negative examples.
493 |     delta:
494 |         Delta from which the pos_score should be higher than all its corresponding neg_score.
495 |     neg_sample_size:
496 |         Number of negative examples to generate for each positive example.
497 |         See main.SearchableHyperparameters for more details.
498 |     use_recency:
499 |         If true, loss will be divided by the recency, i.e. more recent positive examples will be given a
500 |         greater weight in the total loss.
501 |         See main.SearchableHyperparameters for more details.
502 |     recency_score:
503 |         Loss will be divided by the recency if use_recency == True. Those are the recency, for all training edges.
504 |     remove_false_negative:
505 |         When generating negative examples, it is possible that a random negative example is actually in the graph,
506 |         i.e. it should not be a negative example. If true, those will be removed.
507 |     negative_mask:
508 |         For each negative example, indicator if it is a false negative or not.
509 |     """
510 |     all_scores = torch.empty(0)
511 |     if cuda:
512 |         all_scores = all_scores.to(device)
513 |     for etype in pos_score.keys():
514 |         neg_score_tensor = neg_score[etype]
515 |         pos_score_tensor = pos_score[etype]
516 |         neg_score_tensor = neg_score_tensor.reshape(-1, neg_sample_size)
517 |         if remove_false_negative:
518 |             negative_mask_tensor = negative_mask[etype].reshape(-1, neg_sample_size)
519 |         else:
520 |             negative_mask_tensor = torch.zeros(size=neg_score_tensor.shape)
521 |         if cuda:
522 |             negative_mask_tensor = negative_mask_tensor.to(device)
523 |         scores = neg_score_tensor + delta - pos_score_tensor - negative_mask_tensor
524 |         relu = nn.ReLU()
525 |         scores = relu(scores)
526 |         if use_recency:
527 |             try:
528 |                 recency_scores_tensor = recency_scores[etype]
529 |                 scores = scores / torch.unsqueeze(recency_scores_tensor, 1)
530 |             except KeyError:  # Not all edge types have recency. Only training edges have recency (i.e. clicks & buys)
531 |                 pass
532 |         all_scores = torch.cat((all_scores, scores), 0)
533 |     return torch.mean(all_scores)
534 | 


--------------------------------------------------------------------------------
/src/sampling.py:
--------------------------------------------------------------------------------
  1 | import dgl
  2 | import numpy as np
  3 | 
  4 | 
  5 | def train_valid_split(valid_graph: dgl.DGLHeteroGraph,
  6 |                       ground_truth_test,
  7 |                       etypes,
  8 |                       subtrain_size,
  9 |                       valid_size,
 10 |                       reverse_etype,
 11 |                       train_on_clicks,
 12 |                       remove_train_eids,
 13 |                       clicks_sample=1,
 14 |                       purchases_sample=1,
 15 |                       ):
 16 |     """
 17 |     Using the full graph, sample train_graph and eids of edges for train & validation, as well as nids for test.
 18 | 
 19 |     Process:
 20 |         - Validation
 21 |             - valid_eids are the most recent X edges of all eids of the graph (based on valid_size)
 22 |             - valid_uids and iid are the user ids and item ids associated with those edges (and together they form the
 23 |               ground_truth)
 24 |         - Training graph & eids
 25 |             - All edges and reverse edges of valid_eids are removed from the full graph.
 26 |             - train_eids are all remaining edges.
 27 |         - Sampling of training eids
 28 |             - It might be relevant to have numerous edges in the training graph to do message passing, but to
 29 |               optimize the model to give great scores only to recent interaction (to help with seasonality)
 30 |             - Thus, if purchases_sample or clicks_sample are < 1, only the most recent X edges are kept in the
 31 |               train_eids dict
 32 |             - An extra option is available to insure that no information leakage appear: remove_train_eids. If true,
 33 |               all eids in train_eids dict will be removed from the graph. (Otherwise, information leakage is still
 34 |               taken care of during EdgeDataLoader: sampled edges are removed from the computation blocks). Based on
 35 |               experience, it is best to set remove_train_eids as False.
 36 |         - Computing metrics on training set: subtrain nids
 37 |             - To compute metrics on the training set, we sample a "subtrain set". We need the ground_truth for
 38 |               the subtrain, as well as node ids for all user and items in the subtrain set.
 39 |         - Computing metrics on test set
 40 |             - We need node ids for all user and items in the test set (so we can fetch their embeddings during
 41 |               recommendations)
 42 | 
 43 |     """
 44 |     np.random.seed(11)
 45 | 
 46 |     all_eids_dict = {}
 47 |     valid_eids_dict = {}
 48 |     train_eids_dict = {}
 49 |     valid_uids_all = []
 50 |     valid_iids_all = []
 51 |     for etype in etypes:
 52 |         all_eids = np.arange(valid_graph.number_of_edges(etype))
 53 |         valid_eids = all_eids[int(len(all_eids) * (1 - valid_size)):]
 54 |         valid_uids, valid_iids = valid_graph.find_edges(valid_eids, etype=etype)
 55 |         valid_uids_all.extend(valid_uids.tolist())
 56 |         valid_iids_all.extend(valid_iids.tolist())
 57 |         all_eids_dict[etype] = all_eids
 58 |         if (etype == ('user', 'buys', 'item')) or (etype == ('user', 'clicks', 'item') and train_on_clicks):
 59 |             valid_eids_dict[etype] = valid_eids
 60 |     ground_truth_valid = (np.array(valid_uids_all), np.array(valid_iids_all))
 61 |     valid_uids = np.array(np.unique(valid_uids_all))
 62 | 
 63 |     # Create partial graph
 64 |     train_graph = valid_graph.clone()
 65 |     for etype in etypes:
 66 |         if (etype == ('user', 'buys', 'item')) or (etype == ('user', 'clicks', 'item') and train_on_clicks):
 67 |             train_graph.remove_edges(valid_eids_dict[etype], etype=etype)
 68 |             train_graph.remove_edges(valid_eids_dict[etype], etype=reverse_etype[etype])
 69 |             train_eids = np.arange(train_graph.number_of_edges(etype))
 70 |             train_eids_dict[etype] = train_eids
 71 | 
 72 |     if purchases_sample != 1:
 73 |         eids = train_eids_dict[('user', 'buys', 'item')]
 74 |         train_eids_dict[('user', 'buys', 'item')] = eids[int(len(eids) * (1 - purchases_sample)):]
 75 |         eids = valid_eids_dict[('user', 'buys', 'item')]
 76 |         valid_eids_dict[('user', 'buys', 'item')] = eids[int(len(eids) * (1 - purchases_sample)):]
 77 | 
 78 |     if clicks_sample != 1 and ('user', 'clicks', 'item') in train_eids_dict.keys():
 79 |         eids = train_eids_dict[('user', 'clicks', 'item')]
 80 |         train_eids_dict[('user', 'clicks', 'item')] = eids[int(len(eids) * (1 - clicks_sample)):]
 81 |         eids = valid_eids_dict[('user', 'clicks', 'item')]
 82 |         valid_eids_dict[('user', 'clicks', 'item')] = eids[int(len(eids) * (1 - clicks_sample)):]
 83 | 
 84 |     if remove_train_eids:
 85 |         train_graph.remove_edges(train_eids_dict[etype], etype=etype)
 86 |         train_graph.remove_edges(train_eids_dict[etype], etype=reverse_etype[etype])
 87 | 
 88 |     # Generate inference nodes for subtrain & ground truth for subtrain
 89 |     ## Choose the subsample of training set. For now, only users with purchases are included.
 90 |     train_uids, train_iids = valid_graph.find_edges(train_eids_dict[etypes[0]], etype=etypes[0])
 91 |     unique_train_uids = np.unique(train_uids)
 92 |     subtrain_uids = np.random.choice(unique_train_uids, int(len(unique_train_uids) * subtrain_size), replace=False)
 93 |     ## Fetch uids and iids of subtrain sample for all etypes
 94 |     subtrain_uids_all = []
 95 |     subtrain_iids_all = []
 96 |     for etype in train_eids_dict.keys():
 97 |         train_uids, train_iids = valid_graph.find_edges(train_eids_dict[etype], etype=etype)
 98 |         subtrain_eids = []
 99 |         for i in range(len(train_eids_dict[etype])):
100 |             if train_uids[i].item() in subtrain_uids:
101 |                 subtrain_eids.append(train_eids_dict[etype][i].item())
102 |         subtrain_uids, subtrain_iids = valid_graph.find_edges(subtrain_eids, etype=etype)
103 |         subtrain_uids_all.extend(subtrain_uids.tolist())
104 |         subtrain_iids_all.extend(subtrain_iids.tolist())
105 |     ground_truth_subtrain = (np.array(subtrain_uids_all), np.array(subtrain_iids_all))
106 |     subtrain_uids = np.array(np.unique(subtrain_uids_all))
107 | 
108 |     # Generate inference nodes for test
109 |     test_uids, _ = ground_truth_test
110 |     test_uids = np.unique(test_uids)
111 |     all_iids = np.arange(valid_graph.num_nodes('item'))
112 | 
113 |     return train_graph, train_eids_dict, valid_eids_dict, subtrain_uids, valid_uids, test_uids, \
114 |            all_iids, ground_truth_subtrain, ground_truth_valid, all_eids_dict
115 | 
116 | 
117 | def generate_dataloaders(valid_graph,
118 |                          train_graph,
119 |                          train_eids_dict,
120 |                          valid_eids_dict,
121 |                          subtrain_uids,
122 |                          valid_uids,
123 |                          test_uids,
124 |                          all_iids,
125 |                          fixed_params,
126 |                          num_workers,
127 |                          all_sids=None,
128 |                          embedding_layer: bool = True,
129 |                          **params,
130 |                          ):
131 |     """
132 |     Since data is large, it is fed to the model in batches. This creates batches for train, valid & test.
133 | 
134 |     Process:
135 |         - Set up
136 |             - Fix the number of layers. If there is an explicit embedding layer, we need 1 less layer in the blocks.
137 |             - The sampler will generate computation blocks. Currently, only 'full' sampler is used, meaning that all
138 |               nodes have all their neighbors, but one could specify 'partial' neighborhood to have only message passing
139 |               with a limited number of neighbors.
140 |             - The negative sampler generates K negative samples for all positive examples in the batch.
141 |         - Edgeloader_train
142 |             - All train_eids will be batched, using the training graph. Sampled edge and their reverse etype will be
143 |               removed from computation blocks.
144 |             - If remove_train_eids, the graph used for sampling will not have the train_eids as edges. (Thus, a
145 |               different graph as g_sampling)
146 |         - Edgeloader_valid
147 |             - All valid_eids will be batched.
148 |         - Nodeloaders
149 |             - When computing metrics, we want to compute embeddings for all nodes of interest. Thus, we use
150 |               a NodeDataLoader instead of an EdgeDataLoader.
151 |             - We have a nodeloader for subtrain, validation and test.
152 |     """
153 |     n_layers = params['n_layers']
154 |     if embedding_layer:
155 |         n_layers = n_layers - 1
156 |     if fixed_params.neighbor_sampler == 'full':
157 |         sampler = dgl.dataloading.MultiLayerFullNeighborSampler(n_layers)
158 |     elif fixed_params.neighbor_sampler == 'partial':
159 |         sampler = dgl.dataloading.MultiLayerNeighborSampler([1, 1, 1], replace=False)
160 |     else:
161 |         raise KeyError('Neighbor sampler {} not recognized.'.format(fixed_params.neighbor_sampler))
162 | 
163 |     sampler_n = dgl.dataloading.negative_sampler.Uniform(
164 |         params['neg_sample_size']
165 |     )
166 | 
167 |     if fixed_params.remove_train_eids:
168 |         edgeloader_train = dgl.dataloading.EdgeDataLoader(
169 |             valid_graph,
170 |             train_eids_dict,
171 |             sampler,
172 |             g_sampling=train_graph,
173 |             negative_sampler=sampler_n,
174 |             batch_size=fixed_params.edge_batch_size,
175 |             shuffle=True,
176 |             drop_last=False,  # Drop last batch if non-full
177 |             pin_memory=True,  # Helps the transfer to GPU
178 |             num_workers=num_workers,
179 |         )
180 |     else:
181 |         edgeloader_train = dgl.dataloading.EdgeDataLoader(
182 |             train_graph,
183 |             train_eids_dict,
184 |             sampler,
185 |             exclude='reverse_types',
186 |             reverse_etypes={'buys': 'bought-by', 'bought-by': 'buys',
187 |                             'clicks': 'clicked-by', 'clicked-by': 'clicks'},
188 |             negative_sampler=sampler_n,
189 |             batch_size=fixed_params.edge_batch_size,
190 |             shuffle=True,
191 |             drop_last=False,
192 |             pin_memory=True,
193 |             num_workers=num_workers,
194 |         )
195 | 
196 |     edgeloader_valid = dgl.dataloading.EdgeDataLoader(
197 |         valid_graph,
198 |         valid_eids_dict,
199 |         sampler,
200 |         g_sampling=train_graph,
201 |         negative_sampler=sampler_n,
202 |         batch_size=fixed_params.edge_batch_size,
203 |         shuffle=True,
204 |         drop_last=False,
205 |         pin_memory=True,
206 |         num_workers=num_workers,
207 |     )
208 | 
209 |     nodeloader_subtrain = dgl.dataloading.NodeDataLoader(
210 |         train_graph,
211 |         {'user': subtrain_uids, 'item': all_iids},
212 |         sampler,
213 |         batch_size=fixed_params.node_batch_size,
214 |         shuffle=True,
215 |         drop_last=False,
216 |         num_workers=num_workers,
217 |     )
218 | 
219 |     nodeloader_valid = dgl.dataloading.NodeDataLoader(
220 |         train_graph,
221 |         {'user': valid_uids, 'item': all_iids},
222 |         sampler,
223 |         batch_size=fixed_params.node_batch_size,
224 |         shuffle=True,
225 |         drop_last=False,
226 |         num_workers=num_workers,
227 |     )
228 | 
229 |     test_node_ids = {'user': test_uids, 'item': all_iids}
230 |     if 'sport' in valid_graph.ntypes:
231 |         test_node_ids['sport'] = all_sids
232 | 
233 |     nodeloader_test = dgl.dataloading.NodeDataLoader(
234 |         valid_graph,
235 |         test_node_ids,
236 |         sampler,
237 |         batch_size=fixed_params.node_batch_size,
238 |         shuffle=True,
239 |         drop_last=False,
240 |         num_workers=num_workers
241 |     )
242 | 
243 |     return edgeloader_train, edgeloader_valid, nodeloader_subtrain, nodeloader_valid, nodeloader_test
244 | 


--------------------------------------------------------------------------------
/src/train/run.py:
--------------------------------------------------------------------------------
  1 | from datetime import timedelta
  2 | import time
  3 | 
  4 | import dgl
  5 | import torch
  6 | 
  7 | from src.metrics import get_metrics_at_k
  8 | from src.utils import save_txt
  9 | 
 10 | 
 11 | def train_model(model,
 12 |                 num_epochs,
 13 |                 num_batches_train,
 14 |                 num_batches_val_loss,
 15 |                 edgeloader_train,
 16 |                 edgeloader_valid,
 17 |                 loss_fn,
 18 |                 delta,
 19 |                 neg_sample_size,
 20 |                 use_recency=False,
 21 |                 cuda=False,
 22 |                 device=None,
 23 |                 optimizer=torch.optim.Adam,
 24 |                 lr=0.001,
 25 |                 get_metrics=False,
 26 |                 train_graph=None,
 27 |                 valid_graph=None,
 28 |                 nodeloader_valid=None,
 29 |                 nodeloader_subtrain=None,
 30 |                 k=None,
 31 |                 out_dim=None,
 32 |                 num_batches_val_metrics=None,
 33 |                 num_batches_subtrain=None,
 34 |                 bought_eids=None,
 35 |                 ground_truth_subtrain=None,
 36 |                 ground_truth_valid=None,
 37 |                 remove_already_bought=True,
 38 |                 result_filepath=None,
 39 |                 start_epoch=0,
 40 |                 patience=5,
 41 |                 pred=None,
 42 |                 use_popularity=False,
 43 |                 weight_popularity=1,
 44 |                 remove_false_negative=False,
 45 |                 embedding_layer=True,
 46 |                 ):
 47 |     """
 48 |     Main function to train a GNN, using max margin loss on positive and negative examples.
 49 | 
 50 |     Process:
 51 |         - A full training epoch
 52 |             - Batch by batch. 1 batch is composed of multiple computational blocks, required to compute embeddings
 53 |               for all the nodes related to the edges in the batch.
 54 |             - Input the initial features. Compute the embeddings & the positive and negative scores
 55 |             - Also compute other considerations for the loss function: negative mask, recency scores
 56 |             - Loss is returned, then backward, then step.
 57 |             - Metrics are computed on the subtraining set (using nodeloader)
 58 |         - Validation set
 59 |             - Loss is computed (in model.eval() mode) for validation edge for early stopping purposes
 60 |             - Also, metrics are computed on the validation set (using nodeloader)
 61 |         - Logging & early stopping
 62 |             - Everything is logged, best metrics are saved.
 63 |             - Using the patience parameter, early stopping is applied when val_loss stops going down.
 64 |     """
 65 |     model.train_loss_list = []
 66 |     model.train_precision_list = []
 67 |     model.train_recall_list = []
 68 |     model.train_coverage_list = []
 69 |     model.val_loss_list = []
 70 |     model.val_precision_list = []
 71 |     model.val_recall_list = []
 72 |     model.val_coverage_list = []
 73 |     best_metrics = {}  # For visualization
 74 |     max_metric = -0.1
 75 |     patience_counter = 0  # For early stopping
 76 |     min_loss = 1.1
 77 | 
 78 |     opt = optimizer(model.parameters(),
 79 |                     lr=lr)
 80 | 
 81 |     # TRAINING
 82 |     print('Starting training.')
 83 |     for epoch in range(start_epoch, num_epochs):
 84 |         start_time = time.time()
 85 |         print('TRAINING LOSS')
 86 |         model.train()  # Because if not, after eval, dropout would be still be inactive
 87 |         i = 0
 88 |         total_loss = 0
 89 |         for _, pos_g, neg_g, blocks in edgeloader_train:
 90 |             opt.zero_grad()
 91 | 
 92 |             # Negative mask
 93 |             negative_mask = {}
 94 |             if remove_false_negative:
 95 |                 nids = neg_g.ndata[dgl.NID]
 96 |                 for etype in pos_g.canonical_etypes:
 97 |                     neg_src, neg_dst = neg_g.edges(etype=etype)
 98 |                     neg_src = nids[etype[0]][neg_src]
 99 |                     neg_dst = nids[etype[2]][neg_dst]
100 |                     negative_mask_tensor = valid_graph.has_edges_between(neg_src, neg_dst, etype=etype)
101 |                     negative_mask[etype] = negative_mask_tensor.type(torch.float)
102 |                     if cuda:
103 |                         negative_mask[etype] = negative_mask[etype].to(device)
104 |             if cuda:
105 |                 blocks = [b.to(device) for b in blocks]
106 |                 pos_g = pos_g.to(device)
107 |                 neg_g = neg_g.to(device)
108 | 
109 |             i += 1
110 |             if i % 10 == 0:
111 |                 print("Edge batch {} out of {}".format(i, num_batches_train))
112 |             input_features = blocks[0].srcdata['features']
113 |             # recency (TO BE CLEANED)
114 |             recency_scores = None
115 |             if use_recency:
116 |                 recency_scores = pos_g.edata['recency']
117 | 
118 |             _, pos_score, neg_score = model(blocks,
119 |                                             input_features,
120 |                                             pos_g,
121 |                                             neg_g,
122 |                                             embedding_layer,
123 |                                             )
124 |             loss = loss_fn(pos_score,
125 |                            neg_score,
126 |                            delta,
127 |                            neg_sample_size,
128 |                            use_recency=use_recency,
129 |                            recency_scores=recency_scores,
130 |                            remove_false_negative=remove_false_negative,
131 |                            negative_mask=negative_mask,
132 |                            cuda=cuda,
133 |                            device=device,
134 |                            )
135 | 
136 |             if epoch > 0:  # For the epoch 0, no training (just report loss)
137 |                 loss.backward()
138 |                 opt.step()
139 |             total_loss += loss.item()
140 | 
141 |             if epoch == 0 and i > 10:
142 |                 break  # For the epoch 0, report loss on only subset
143 | 
144 |         train_avg_loss = total_loss / i
145 |         model.train_loss_list.append(train_avg_loss)
146 | 
147 |         print('VALIDATION LOSS')
148 |         model.eval()
149 |         with torch.no_grad():
150 |             total_loss = 0
151 |             i = 0
152 |             for _, pos_g, neg_g, blocks in edgeloader_valid:
153 |                 i += 1
154 |                 if i % 10 == 0:
155 |                     print("Edge batch {} out of {}".format(i, num_batches_val_loss))
156 | 
157 |                 # Negative mask
158 |                 negative_mask = {}
159 |                 if remove_false_negative:
160 |                     nids = neg_g.ndata[dgl.NID]
161 |                     for etype in pos_g.canonical_etypes:
162 |                         neg_src, neg_dst = neg_g.edges(etype=etype)
163 |                         neg_src = nids[etype[0]][neg_src]
164 |                         neg_dst = nids[etype[2]][neg_dst]
165 |                         negative_mask_tensor = valid_graph.has_edges_between(neg_src, neg_dst, etype=etype)
166 |                         negative_mask[etype] = negative_mask_tensor.type(torch.float)
167 |                         if cuda:
168 |                             negative_mask[etype] = negative_mask[etype].to(device)
169 | 
170 |                 if cuda:
171 |                     blocks = [b.to(device) for b in blocks]
172 |                     pos_g = pos_g.to(device)
173 |                     neg_g = neg_g.to(device)
174 | 
175 |                 input_features = blocks[0].srcdata['features']
176 |                 _, pos_score, neg_score = model(blocks,
177 |                                                 input_features,
178 |                                                 pos_g,
179 |                                                 neg_g,
180 |                                                 embedding_layer,
181 |                                                 )
182 |                 # recency (TO BE CLEANED)
183 |                 recency_scores = None
184 |                 if use_recency:
185 |                     recency_scores = pos_g.edata['recency']
186 | 
187 |                 val_loss = loss_fn(pos_score,
188 |                                    neg_score,
189 |                                    delta,
190 |                                    neg_sample_size,
191 |                                    use_recency=use_recency,
192 |                                    recency_scores=recency_scores,
193 |                                    remove_false_negative=remove_false_negative,
194 |                                    negative_mask=negative_mask,
195 |                                    cuda=cuda,
196 |                                    device=device,
197 |                                    )
198 |                 total_loss += val_loss.item()
199 |                 # print(val_loss.item())
200 |             val_avg_loss = total_loss / i
201 |             model.val_loss_list.append(val_avg_loss)
202 | 
203 |         ############
204 |         # METRICS PER EPOCH 
205 |         if get_metrics and epoch % 10 == 1:
206 |             model.eval()
207 |             with torch.no_grad():
208 |                 # training metrics
209 |                 print('TRAINING METRICS')
210 |                 y = get_embeddings(train_graph,
211 |                                    out_dim,
212 |                                    model,
213 |                                    nodeloader_subtrain,
214 |                                    num_batches_subtrain,
215 |                                    cuda,
216 |                                    device,
217 |                                    embedding_layer,
218 |                                    )
219 | 
220 |                 train_precision, train_recall, train_coverage = get_metrics_at_k(y,
221 |                                                                                  train_graph,
222 |                                                                                  model,
223 |                                                                                  out_dim,
224 |                                                                                  ground_truth_subtrain,
225 |                                                                                  bought_eids,
226 |                                                                                  k,
227 |                                                                                  False,  # Remove already bought
228 |                                                                                  cuda,
229 |                                                                                  device,
230 |                                                                                  pred,
231 |                                                                                  use_popularity,
232 |                                                                                  weight_popularity)
233 | 
234 |                 # validation metrics
235 |                 print('VALIDATION METRICS')
236 |                 y = get_embeddings(valid_graph,
237 |                                    out_dim,
238 |                                    model,
239 |                                    nodeloader_valid,
240 |                                    num_batches_val_metrics,
241 |                                    cuda,
242 |                                    device,
243 |                                    embedding_layer,
244 |                                    )
245 | 
246 |                 val_precision, val_recall, val_coverage = get_metrics_at_k(y,
247 |                                                                            valid_graph,
248 |                                                                            model,
249 |                                                                            out_dim,
250 |                                                                            ground_truth_valid,
251 |                                                                            bought_eids,
252 |                                                                            k,
253 |                                                                            remove_already_bought,
254 |                                                                            cuda,
255 |                                                                            device,
256 |                                                                            pred,
257 |                                                                            use_popularity,
258 |                                                                            weight_popularity
259 |                                                                            )
260 |                 sentence = '''Epoch {:05d} || TRAINING Loss {:.5f} | Precision {:.3f}% | Recall {:.3f}% | Coverage {:.2f}% 
261 |                 || VALIDATION Loss {:.5f} | Precision {:.3f}% | Recall {:.3f}% | Coverage {:.2f}% '''.format(
262 |                     epoch, train_avg_loss, train_precision * 100, train_recall * 100, train_coverage * 100,
263 |                     val_avg_loss, val_precision * 100, val_recall * 100, val_coverage * 100)
264 |                 print(sentence)
265 |                 save_txt(sentence, result_filepath, mode='a')
266 | 
267 |                 model.train_precision_list.append(train_precision * 100)
268 |                 model.train_recall_list.append(train_recall * 100)
269 |                 model.train_coverage_list.append(train_coverage * 10)
270 |                 model.val_precision_list.append(val_precision * 100)
271 |                 model.val_recall_list.append(val_recall * 100)
272 |                 model.val_coverage_list.append(val_coverage * 10)  # just *10 for viz purposes
273 | 
274 |                 # Visualization of best metric
275 |                 if val_recall > max_metric:
276 |                     max_metric = val_recall
277 |                     best_metrics = {'recall': val_recall, 'precision': val_precision, 'coverage': val_coverage}
278 | 
279 |         else:
280 |             sentence = "Epoch {:05d} | Training Loss {:.5f} | Validation Loss {:.5f} | ".format(
281 |                 epoch, train_avg_loss, val_avg_loss)
282 |             print(sentence)
283 |             save_txt(sentence, result_filepath, mode='a')
284 | 
285 |         if val_avg_loss < min_loss:
286 |             min_loss = val_avg_loss
287 |             patience_counter = 0
288 |         else:
289 |             patience_counter += 1
290 |         if patience_counter == patience:
291 |             break
292 | 
293 |         elapsed = time.time() - start_time
294 |         result_to_save = f'Epoch took {timedelta(seconds=elapsed)} \n'
295 |         print(result_to_save)
296 |         save_txt(result_to_save, result_filepath, mode='a')
297 | 
298 |     viz = {'train_loss_list': model.train_loss_list,
299 |            'train_precision_list': model.train_precision_list,
300 |            'train_recall_list': model.train_recall_list,
301 |            'train_coverage_list': model.train_coverage_list,
302 |            'val_loss_list': model.val_loss_list,
303 |            'val_precision_list': model.val_precision_list,
304 |            'val_recall_list': model.val_recall_list,
305 |            'val_coverage_list': model.val_coverage_list}
306 | 
307 |     print('Training completed.')
308 |     return model, viz, best_metrics  # model will already be to 'cuda' device?
309 | 
310 | 
311 | def get_embeddings(g,
312 |                    out_dim: int,
313 |                    trained_model,
314 |                    nodeloader_test,
315 |                    num_batches_valid: int,
316 |                    cuda: bool = False,
317 |                    device=None,
318 |                    embedding_layer: bool = True):
319 |     """
320 |     Fetch the embeddings for all the nodes in the nodeloader.
321 | 
322 |     Nodeloader is preferable when computing embeddings because we can specify which nodes to compute the embedding for,
323 |     and only have relevant nodes in the computational blocks. Whereas Edgeloader is preferable for training, because
324 |     we generate negative edges also.
325 |     """
326 |     if cuda:  # model is already on device?
327 |         trained_model = trained_model.to(device)
328 |     i2 = 0
329 |     y = {ntype: torch.zeros(g.num_nodes(ntype), out_dim)
330 |          for ntype in g.ntypes}
331 |     if cuda:  # not sure if I need to put the 'result' tensor to device
332 |         y = {ntype: torch.zeros(g.num_nodes(ntype), out_dim).to(device)
333 |              for ntype in g.ntypes}
334 |     for input_nodes, output_nodes, blocks in nodeloader_test:
335 |         i2 += 1
336 |         if i2 % 10 == 0:
337 |             print("Computing embeddings: Batch {} out of {}".format(i2, num_batches_valid))
338 |         if cuda:
339 |             blocks = [b.to(device) for b in blocks]
340 |         input_features = blocks[0].srcdata['features']
341 |         if embedding_layer:
342 |             input_features['user'] = trained_model.user_embed(input_features['user'])
343 |             input_features['item'] = trained_model.item_embed(input_features['item'])
344 |             if 'sport' in input_features.keys():
345 |                 input_features['sport'] = trained_model.sport_embed(input_features['sport'])
346 |         h = trained_model.get_repr(blocks, input_features)
347 |         for ntype in h.keys():
348 |             y[ntype][output_nodes[ntype]] = h[ntype]
349 |     return y
350 | 


--------------------------------------------------------------------------------
/src/utils.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import numpy as np
 3 | import pandas as pd
 4 | import pickle
 5 | 
 6 | 
 7 | def save_txt(data_to_save, filepath, mode='a'):
 8 |     """
 9 |     Save text to a file.
10 |     """
11 |     with open(filepath, mode) as text_file:
12 |         text_file.write(data_to_save + '\n')
13 | 
14 | 
15 | def save_outputs(files_to_save: dict,
16 |                  folder_path):
17 |     """
18 |     Save objects as pickle files, in a given folder.
19 |     """
20 |     for name, file in files_to_save.items():
21 |         with open(folder_path + name + '.pkl', 'wb') as f:
22 |             pickle.dump(file, f)
23 | 
24 | 
25 | def get_last_checkpoint():
26 |     """
27 |     Fetch path of last checkpoint available in the root folder, based on the date in the filename.
28 |     """
29 |     logdir = '.'
30 |     logfiles = sorted([f for f in os.listdir(logdir) if f.startswith('checkpoint')])
31 |     checkpoint_path = logfiles[-1]
32 |     return checkpoint_path
33 | 
34 | 
35 | def read_data(file_path):
36 |     """
37 |     Generic function to read any kind of data. Extensions supported: '.gz', '.csv', '.pkl'
38 |     """
39 |     if file_path.endswith('.gz'):
40 |         obj = pd.read_csv(file_path, compression='gzip',
41 |                      header=0, sep=';', quotechar='"',
42 |                      error_bad_lines=False)
43 |     elif file_path.endswith('.csv'):
44 |         obj = pd.read_csv(file_path)
45 |     elif file_path.endswith('.pkl'):
46 |         with open(file_path, 'rb') as handle:
47 |             obj = pickle.load(handle)
48 |     else:
49 |         raise KeyError('File extension of {} not recognized.'.format(file_path))
50 |     return obj
51 | 
52 | 
53 | def softmax(x):
54 |     """
55 |     (Currently not used.) Compute softmax values for each sets of scores in x.
56 |     """
57 |     e_x = np.exp(x - np.max(x))
58 |     return e_x / e_x.sum()
59 | 


--------------------------------------------------------------------------------
/src/utils_data.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | import pandas as pd
  3 | import torch
  4 | 
  5 | from src.builder import (create_ids, df_to_adjacency_list,
  6 |                          format_dfs, import_features)
  7 | 
  8 | 
  9 | 
 10 | class DataPaths:
 11 |     def __init__(self):
 12 |         self.result_filepath = 'TXT FILE WHERE TO LOG THE RESULTS .txt'
 13 |         self.sport_feat_path = 'FEATURE DATASET, SPORTS (sport names) .csv'
 14 |         self.train_path = 'INTERACTION LIST, USER-ITEM (Train dataset).csv'
 15 |         self.test_path = 'INTERACTION LIST, USER-ITEM (Train dataset).csv'
 16 |         self.item_sport_path = 'INTERACTION LIST, ITEM-SPORT .csv'
 17 |         self.user_sport_path = 'INTERACTION LIST, USER-SPORT .csv'
 18 |         self.sport_sportg_path = 'INTERACTION LIST, SPORT-SPORT .csv'
 19 |         self.item_feat_path = 'FEATURE DATASET, ITEMS .csv'
 20 |         self.user_feat_path = 'FEATURE DATASET, USERS.csv'
 21 |         self.sport_onehot_path = 'FEATURE DATASET, SPORTS (one-hot vectors) .csv'
 22 | 
 23 | class FixedParameters:
 24 |     def __init__(self, num_epochs, start_epoch, patience, edge_batch_size,
 25 |                  remove, item_id_type, duplicates):
 26 |         """
 27 |         All parameters that are fixed, i.e. not part of the hyperparametrization.
 28 | 
 29 |         Attributes
 30 |         ----------
 31 |         ctm_id_type :
 32 |             Identifier for the customers.
 33 |         Days_of_purchases (Days_of_clicks) :
 34 |             Number of days of purchases (clicks) that should be kept in the dataset.
 35 |             Intuition is that interactions of 12+ months ago might not be relevant. Max is 710 days
 36 |             Those that do not have any remaining interactions will be fed recommendations from another
 37 |             model.
 38 |         Discern_clicks :
 39 |             Clicks and purchases will be considered as 2 different edge types
 40 |         Duplicates :
 41 |             Determines how to handle duplicates in the training set. 'count_occurrence' will drop all
 42 |             duplicates except last, and the number of interactions will be stored in the edge feature.
 43 |             If duplicates == 'count_occurrence', aggregator_type needs to handle edge feature. 'keep_last'
 44 |             will drop all duplicates except last. 'keep_all' will conserve all duplicates.
 45 |         Explore :
 46 |             Print examples of recommendations and of similar sports
 47 |         Include_sport :
 48 |             Sports will be included in the graph, with 6 more relation types. User-practices-sport,
 49 |             item-utilizedby-sport, sport-belongsto-sport (and all their reverse relation type)
 50 |         item_id_type :
 51 |             Identifier for the items. Can be SPECIFIC ITEM IDENTIFIER (e.g. item SKU) or GENERIC ITEM IDENTIFIER
 52 |             (e.g. item family ID)
 53 |         Lifespan_of_items :
 54 |             Number of days since most recent transactions for an item to be considered by the
 55 |             model. Max is 710 days. Won't make a difference is it is > Days_of_interaction.
 56 |         Num_choices :
 57 |             Number of examples of recommendations and similar sports to print
 58 |         Patience :
 59 |             Number of epochs to wait for Early stopping
 60 |         Pred :
 61 |             Function that takes as input embedding of user and item, and outputs ratings. Choices : 'cos' for cosine
 62 |             similarity, 'nn' for multilayer perceptron with sigmoid function at the end
 63 |         Start_epoch :
 64 |             Load model from a previous epoch
 65 |         Train_on_clicks :
 66 |             When parametrizing the GNN, edges of purchases are always included. If true, clicks will also
 67 |             be included
 68 |         """
 69 |         self.ctm_id_type = 'CUSTOMER IDENTIFIER'
 70 |         self.days_of_purchases = 365  # Max is 710
 71 |         self.days_of_clicks = 30  # Max is 710
 72 |         self.discern_clicks = True
 73 |         self.duplicates = duplicates  # 'keep_last', 'keep_all', 'count_occurrence'
 74 |         self.edge_batch_size = edge_batch_size
 75 |         self.etype = [('user', 'buys', 'item')]
 76 |         if self.discern_clicks:
 77 |             self.etype.append(('user', 'clicks', 'item'))
 78 |         self.explore = True
 79 |         self.include_sport = True
 80 |         self.item_id_type = item_id_type
 81 |         self.k = 10
 82 |         self.lifespan_of_items = 180
 83 |         self.neighbor_sampler = 'full'
 84 |         self.node_batch_size = 128
 85 |         self.num_choices = 10
 86 |         self.num_epochs = num_epochs
 87 |         self.optimizer = torch.optim.Adam
 88 |         self.patience = patience
 89 |         self.pred = 'cos'
 90 |         self.remove = remove
 91 |         self.remove_false_negative = True
 92 |         self.remove_on_inference = .7
 93 |         self.remove_train_eids = False
 94 |         self.report_model_coverage = False
 95 |         self.reverse_etype = {('user', 'buys', 'item'): ('item', 'bought-by', 'user')}
 96 |         if self.discern_clicks:
 97 |             self.reverse_etype[('user', 'clicks', 'item')] = ('item', 'clicked-by', 'user')
 98 |         self.run_inference = 1
 99 |         self.spt_id_type = 'sport_id'
100 |         self.start_epoch = start_epoch
101 |         self.subtrain_size = 0.05
102 |         self.train_on_clicks = True
103 |         self.valid_size = 0.05
104 |         # self.dropout = .5  # HP
105 |         # self.norm = False  # HP
106 |         # self.use_popularity = False  # HP
107 |         # self.days_popularity = 0  # HP
108 |         # self.weight_popularity = 0.  # HP
109 |         # self.use_recency = False  # HP
110 |         # self.aggregator_type = 'mean_nn_edge'  # HP
111 |         # self.aggregator_hetero = 'sum'  # HP
112 |         # self.purchases_sample = .5  # HP
113 |         # self.clicks_sample = .4  # HP
114 |         # self.embedding_layer = False  # HP
115 |         # self.edge_update = True  # Removed implementation; not useful
116 |         # self.automatic_precision = False  # Removed implementation; not useful
117 | 
118 | 
119 | class DataLoader:
120 |     """Data loading, cleaning and pre-processing."""
121 | 
122 |     def __init__(self, data_paths, fixed_params):
123 |         self.data_paths = data_paths
124 |         (
125 |             self.user_item_train,
126 |             self.user_item_test,
127 |             self.item_sport_interaction,
128 |             self.user_sport_interaction,
129 |             self.sport_sportg_interaction,
130 |             self.item_feat_df,
131 |             self.user_feat_df,
132 |             self.sport_feat_df,
133 |             self.sport_onehot_df,
134 |         ) = format_dfs(
135 |             self.data_paths.train_path,
136 |             self.data_paths.test_path,
137 |             self.data_paths.item_sport_path,
138 |             self.data_paths.user_sport_path,
139 |             self.data_paths.sport_sportg_path,
140 |             self.data_paths.item_feat_path,
141 |             self.data_paths.user_feat_path,
142 |             self.data_paths.sport_feat_path,
143 |             self.data_paths.sport_onehot_path,
144 |             fixed_params.remove,
145 |             fixed_params.ctm_id_type,
146 |             fixed_params.item_id_type,
147 |             fixed_params.days_of_purchases,
148 |             fixed_params.days_of_clicks,
149 |             fixed_params.lifespan_of_items,
150 |             fixed_params.report_model_coverage,
151 |         )
152 |         if fixed_params.report_model_coverage:
153 |             print('Reporting model coverage')
154 |             (_, _, _, _, _, _, _, _
155 |              ) = format_dfs(
156 |                 self.data_paths.train_path,
157 |                 self.data_paths.test_path,
158 |                 self.data_paths.item_sport_path,
159 |                 self.data_paths.user_sport_path,
160 |                 self.data_paths.sport_sportg_path,
161 |                 self.data_paths.item_feat_path,
162 |                 self.data_paths.user_feat_path,
163 |                 self.data_paths.sport_feat_path,
164 |                 0,  # remove 0
165 |                 fixed_params.ctm_id_type,
166 |                 fixed_params.item_id_type,
167 |                 fixed_params.days_of_purchases,
168 |                 fixed_params.days_of_clicks,
169 |                 fixed_params.lifespan_of_items,
170 |                 fixed_params.report_model_coverage,
171 |             )
172 | 
173 |         self.ctm_id, self.pdt_id, self.spt_id = create_ids(
174 |             self.user_item_train,
175 |             self.user_sport_interaction,
176 |             self.sport_sportg_interaction,
177 |             self.item_feat_df,
178 |             item_id_type=fixed_params.item_id_type,
179 |             ctm_id_type=fixed_params.ctm_id_type,
180 |             spt_id_type=fixed_params.spt_id_type,
181 |         )
182 | 
183 |         (
184 |             self.adjacency_dict,
185 |             self.ground_truth_test,
186 |             self.ground_truth_purchase_test,
187 |             self.user_item_train_grouped,  # Will be grouped if duplicates != 'keep_all'. Used for recency edge feature
188 |         ) = df_to_adjacency_list(
189 |             self.user_item_train,
190 |             self.user_item_test,
191 |             self.item_sport_interaction,
192 |             self.user_sport_interaction,
193 |             self.sport_sportg_interaction,
194 |             self.ctm_id,
195 |             self.pdt_id,
196 |             self.spt_id,
197 |             item_id_type=fixed_params.item_id_type,
198 |             ctm_id_type=fixed_params.ctm_id_type,
199 |             spt_id_type=fixed_params.spt_id_type,
200 |             discern_clicks=fixed_params.discern_clicks,
201 |             duplicates=fixed_params.duplicates,
202 |         )
203 | 
204 |         if fixed_params.discern_clicks:
205 |             self.graph_schema = {
206 |                 ('user', 'buys', 'item'):
207 |                     list(zip(self.adjacency_dict['purchases_src'], self.adjacency_dict['purchases_dst'])),
208 |                 ('item', 'bought-by', 'user'):
209 |                     list(zip(self.adjacency_dict['purchases_dst'], self.adjacency_dict['purchases_src'])),
210 |                 ('user', 'clicks', 'item'):
211 |                     list(zip(self.adjacency_dict['clicks_src'], self.adjacency_dict['clicks_dst'])),
212 |                 ('item', 'clicked-by', 'user'):
213 |                     list(zip(self.adjacency_dict['clicks_dst'], self.adjacency_dict['clicks_src'])),
214 |             }
215 |         else:
216 |             self.graph_schema = {
217 |                 ('user', 'buys', 'item'):
218 |                     list(zip(self.adjacency_dict['user_item_src'], self.adjacency_dict['user_item_dst'])),
219 |                 ('item', 'bought-by', 'user'):
220 |                     list(zip(self.adjacency_dict['user_item_dst'], self.adjacency_dict['user_item_src'])),
221 |             }
222 |         if fixed_params.include_sport:
223 |             self.graph_schema.update(
224 |                 {
225 |                     ('item', 'utilized-for', 'sport'):
226 |                         list(zip(self.adjacency_dict['item_sport_src'], self.adjacency_dict['item_sport_dst'])),
227 |                     ('sport', 'utilizes', 'item'):
228 |                         list(zip(self.adjacency_dict['item_sport_dst'], self.adjacency_dict['item_sport_src'])),
229 |                     ('user', 'practices', 'sport'):
230 |                         list(zip(self.adjacency_dict['user_sport_src'], self.adjacency_dict['user_sport_dst'])),
231 |                     ('sport', 'practiced-by', 'user'):
232 |                         list(zip(self.adjacency_dict['user_sport_dst'], self.adjacency_dict['user_sport_src'])),
233 |                     ('sport', 'belongs-to', 'sport'):
234 |                         list(zip(self.adjacency_dict['sport_sportg_src'], self.adjacency_dict['sport_sportg_dst'])),
235 |                     ('sport', 'includes', 'sport'):
236 |                         list(zip(self.adjacency_dict['sport_sportg_dst'], self.adjacency_dict['sport_sportg_src'])),
237 |                 }
238 |             )
239 | 
240 | 
241 | def assign_graph_features(graph,
242 |                           fixed_params,
243 |                           data,
244 |                           **params,
245 |                           ):
246 |     """
247 |     Assigns features to graph nodes and edges, based on data previously provided in the dataloader.
248 | 
249 |     Parameters
250 |     ----------
251 |     graph:
252 |         Graph of type dgl.DGLGraph, with all the nodes & edges.
253 |     fixed_params:
254 |         All fixed parameters. The only fixed params used are related to id types and occurrences.
255 |     data:
256 |         Object that contains node feature dataframes, ID mapping dataframes and user item interactions.
257 |     params:
258 |         Parameters used in this function include popularity & recency hyperparameters.
259 | 
260 |     Returns
261 |     -------
262 |     graph:
263 |         The input graph but with features assigned to its nodes and edges.
264 |     """
265 |     # Assign features
266 |     features_dict = import_features(
267 |         graph,
268 |         data.user_feat_df,
269 |         data.item_feat_df,
270 |         data.sport_onehot_df,
271 |         data.ctm_id,
272 |         data.pdt_id,
273 |         data.spt_id,
274 |         data.user_item_train,
275 |         params['use_popularity'],
276 |         params['days_popularity'],
277 |         fixed_params.item_id_type,
278 |         fixed_params.ctm_id_type,
279 |         fixed_params.spt_id_type,
280 |     )
281 | 
282 |     graph.nodes['user'].data['features'] = features_dict['user_feat']
283 |     graph.nodes['item'].data['features'] = features_dict['item_feat']
284 |     if 'sport' in graph.ntypes:
285 |         graph.nodes['sport'].data['features'] = features_dict['sport_feat']
286 | 
287 |     # add date as edge feature
288 |     if params['use_recency']:
289 |         df = data.user_item_train_grouped
290 |         df['max_date'] = max(df.hit_date)
291 |         df['days_recency'] = (pd.to_datetime(df.max_date) - pd.to_datetime(df.hit_date)).dt.days + 1
292 |         if fixed_params.discern_clicks:
293 |             recency_tensor_buys = torch.tensor(df[df.buy == 1].days_recency.values)
294 |             recency_tensor_clicks = torch.tensor(df[df.buy == 0].days_recency.values)
295 |             graph.edges['buys'].data['recency'] = recency_tensor_buys
296 |             graph.edges['bought-by'].data['recency'] = recency_tensor_buys
297 |             graph.edges['clicks'].data['recency'] = recency_tensor_clicks
298 |             graph.edges['clicked-by'].data['recency'] = recency_tensor_clicks
299 |         else:
300 |             recency_tensor = torch.tensor(df.days_recency.values)
301 |             graph.edges['buys'].data['recency'] = recency_tensor
302 |             graph.edges['bought-by'].data['recency'] = recency_tensor
303 | 
304 |     if params['use_popularity']:
305 |         graph.nodes['item'].data['popularity'] = features_dict['item_pop']
306 | 
307 |     if fixed_params.duplicates == 'count_occurrence':
308 |         if fixed_params.discern_clicks:
309 |             graph.edges['clicks'].data['occurrence'] = torch.tensor(data.adjacency_dict['clicks_num'])
310 |             graph.edges['clicked-by'].data['occurrence'] = torch.tensor(data.adjacency_dict['clicks_num'])
311 |             graph.edges['buys'].data['occurrence'] = torch.tensor(data.adjacency_dict['purchases_num'])
312 |             graph.edges['bought-by'].data['occurrence'] = torch.tensor(data.adjacency_dict['purchases_num'])
313 |         else:
314 |             graph.edges['buys'].data['occurrence'] = torch.tensor(data.adjacency_dict['user_item_num'])
315 |             graph.edges['bought-by'].data['occurrence'] = torch.tensor(data.adjacency_dict['user_item_num'])
316 | 
317 |     return graph
318 | 
319 | 
320 | 
321 | 
322 | 
323 | 


--------------------------------------------------------------------------------
/src/utils_inference.py:
--------------------------------------------------------------------------------
 1 | import pandas as pd
 2 | 
 3 | from dgl.data.utils import load_graphs
 4 | 
 5 | 
 6 | def read_graph(graph_path):
 7 |     """
 8 |     Read graph data from path.
 9 |     """
10 |     graph_list, _ = load_graphs(graph_path)
11 |     graph = graph_list[0]
12 |     return graph
13 | 
14 | 
15 | def fetch_uids(user_ids,
16 |                ctm_id_df):
17 |     """
18 |     Maps the Organisation user_ids into node_ids that are used in the graph.
19 |     """
20 |     user_df = pd.DataFrame(user_ids, columns=['old_id'])
21 |     user_df = user_df.merge(ctm_id_df, how='inner', left_on='old_id', right_on='CUSTOMER IDENTIFIER')
22 |     new_uids_list = user_df.ctm_new_id.values
23 |     if len(user_ids) != len(new_uids_list):
24 |         print(f'{len(user_ids)-len(new_uids_list)} user ids provided had no node ids in the graph.')
25 |     return new_uids_list
26 | 
27 | 
28 | def postprocess_recs(recs,
29 |                      pdt_id_df,
30 |                      ctm_id_df,
31 |                      pdt_id_type,
32 |                      ctm_id_type, ):
33 |     """
34 |     Transforms node_ids for user and item into Organisation user_ids and item_ids
35 |     (e.g.CUSTOMER IDENTIFIER and ITEM IDENTIFIER)
36 |     """
37 |     processed_recs = {ctm_id_df[ctm_id_df.ctm_new_id == key][ctm_id_type].item():
38 |                           [pdt_id_df[pdt_id_df.pdt_new_id == iid][pdt_id_type].item() for iid in value_list]
39 |                       for key, value_list in recs.items()}
40 |     return processed_recs
41 | 


--------------------------------------------------------------------------------
/src/utils_vizualization.py:
--------------------------------------------------------------------------------
 1 | import matplotlib.pyplot as plt
 2 | from datetime import datetime
 3 | import textwrap
 4 | 
 5 | import numpy as np
 6 | 
 7 | 
 8 | def plot_train_loss(hp_sentence, viz):
 9 |     """
10 |     Visualize train & validation loss & metrics. hp_sentence is used as the title of the plot.
11 | 
12 |     Saves plots in the plots folder.
13 |     """
14 |     if 'val_loss_list' in viz.keys():
15 |         fig = plt.figure()
16 |         x = np.arange(len(viz['train_loss_list']))
17 |         plt.title('\n'.join(textwrap.wrap(hp_sentence, 60)))
18 |         fig.tight_layout()
19 |         plt.rcParams["axes.titlesize"] = 6
20 |         plt.plot(x, viz['train_loss_list'])
21 |         plt.plot(x, viz['val_loss_list'])
22 |         plt.legend(['training loss', 'valid loss'], loc='upper left')
23 |         plt.savefig('plots/' + str(datetime.now())[:-10] + 'loss.png')
24 |         plt.close(fig)
25 | 
26 |     if 'val_recall_list' in viz.keys():
27 |         fig = plt.figure()
28 |         x = np.arange(len(viz['train_precision_list']))
29 |         plt.title('\n'.join(textwrap.wrap(hp_sentence, 60)))
30 |         fig.tight_layout()
31 |         plt.rcParams["axes.titlesize"] = 6
32 |         plt.plot(x, viz['train_precision_list'])
33 |         plt.plot(x, viz['train_recall_list'])
34 |         plt.plot(x, viz['train_coverage_list'])
35 |         plt.plot(x, viz['val_precision_list'])
36 |         plt.plot(x, viz['val_recall_list'])
37 |         plt.plot(x, viz['val_coverage_list'])
38 |         plt.legend(['training precision', 'training recall', 'training coverage/10',
39 |                     'valid precision', 'valid recall', 'valid coverage/10'], loc='upper left')
40 |         plt.savefig('plots/' + str(datetime.now())[:-10] + 'metrics.png')
41 |         plt.close(fig)
42 | 


--------------------------------------------------------------------------------