├── LICENSE ├── README.md ├── assets ├── cbf_cf.png ├── deepmind_forgoogle_recsys.png ├── mf.png ├── netflix_prize.jpeg ├── recsys_io.png ├── retrieval_ranking.png ├── tensorflow_two_tower.gif └── youtube_retrieval.png └── recommender_system_tutorial.ipynb /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2020 Hamidreza Hosseinkhani 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Build a recommendation system with TensorFlow and Keras 2 | It is a step-by-step tutorial on developing a practical recommendation system (`retrieval` and `ranking` tasks) using [TensorFlow Recommenders](https://www.tensorflow.org/recommenders) and [Keras](https://keras.io/) and deploy it using [TensorFlow Serving](https://www.tensorflow.org/tfx/guide/serving). 3 | 4 | Here, you can find an introduction to the information retrieval and the recommendation systems, then you can explore [the Jupyter notebook](https://github.com/xei/recommender_system_tutorial/blob/main/recommender_system_tutorial.ipynb) and run it in [Google Colab](https://colab.research.google.com/github/xei/recommender_system_tutorial/blob/main/recommender_system_tutorial.ipynb) in order to study the code. 5 | 6 | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/xei/recommender_system_tutorial/blob/main/recommender_system_tutorial.ipynb) 7 | 8 | Open in Studio 9 | 10 | 11 | In the notebook, we load [MovieLens dataset](https://grouplens.org/datasets/movielens/) using [TesorFlow Datasets](https://www.tensorflow.org/datasets), preprocess its features using [Keras preprocessing layers](https://keras.io/guides/preprocessing_layers/), build the `retrieval` and `ranking` tasks using [TensorFlow Recommenders](https://www.tensorflow.org/recommenders) and index and search for similar items using [Spotify Annoy](https://github.com/spotify/annoy). 12 | 13 | This tutorial is recommended to both academic and industry enthusiasts. 14 | 15 | ## Introduction 16 | 17 | ### What is a recommendation system?! 18 | Online services usually provides thousands, millions or even billions of items like products, video clips, movies, musics, news, articles, blog posts, advertisements, etc. For example, the [Google Play Store](https://play.google.com/store) provides millions of apps and [YouTube](https://www.youtube.com/) provides billions of videos. [[1]](https://developers.google.com/machine-learning/recommendation/overview) 19 | 20 | However, users prefer to see a handful shortlist of likely items instead of struggling with the full corpora. They usually can search or filter the list to find the best handful items, but sometimes they even don't know what they really want (e.g. a birthday gift). In a physical store an expert seller would help in this case by useful recommendations. So, why not in an online store?! 21 | 22 | A recommedation system can retrieve, filter and recommend best personalized results for the user - results which the user is likely to buy. So it is one of the major requirements of modern businesses in order to increase their `conversion rate`. On September 21, 2009, Netflix gave a grand prize of $1,000,000 to a team which bested Netflix's own algorithm for predicting ratings by 10.06%. [[2]](https://web.archive.org/web/20090924184639/http://www.netflixprize.com/community/viewtopic.php?id=1537) 23 | 24 |

25 | 26 |

27 | 28 | A recommendation system ia a system that gives a `query (context)` which is what we know about the liking list, and filter the corpus (full catalog of items) to a shortlist of `candidates` (items, documents). A query (context) can be a ***user id***, ***user's geographical location*** or ***user's history of previous purchases*** and the resulting candidates can be some new items that we guess are interesting for the user. 29 | 30 | The query can also be an ***item id***, ***its image*** or ***its textual description*** and the candidates can be some similar or related items from the corpus. 31 |

 

32 |

33 | 34 |

35 |

 

36 | 37 | ### Recommendation stages (tasks) 38 | In practice, dealing with a large corpus and filter it to a shortlist is an intractable and inefficient task. So practical recommender systems has two (or three) filterng phases: 39 | 1. Retrieval (Candidate Generation) 40 | 2. Ranking (Scoring) 41 | 3. Re-ranking or optimazation or ... 42 | 43 |

44 | 45 |

46 | 47 |

48 | 49 |

50 | 51 | ### Content-based Filtering vs Collaborative Filtering 52 | Filtering items is based on similarities. we can filter the list based on similar candidates (`content-based filtering`) or based on the similarity between queries and candidates (`collaborative filtering`). Collaborative filtering algorithms usually perform better than content-based methods. 53 | 54 |

55 | 56 |

57 | 58 | ### Representation of a query or a candidate 59 | A query or a candidate has lots of different features. For example a query can be constructed by these features: 60 | - user_id 61 | - user_previous_history 62 | - user_job 63 | - etc. 64 | 65 | And a candidate can have features like: 66 | - item_description 67 | - item_image 68 | - item_price 69 | - posted_time 70 | - etc. 71 | 72 | These obviouse features can be `numerical variables`, `categorical variables`, `bitmaps` or `raw texts`. However, these low-level features are not enough and we should extract some more abstract `latent features` from these obvious features to represent the query or the candidate as a numerical high-dimensional vector - known as `Embedding Vector`. 73 | 74 | `Matrix Factorization` (MF) is a classic collaborative filtering method to learn some `latent factors` (latent features) from `user_id`, `item_id` and `rating` features and represent **users** and **items** by latent (embedding) vectors. 75 | 76 |

77 | 78 |

79 | 80 | Matrix Factorization method only uses `user_id` and `candidate_id` features collaboratively to learn the `latent features`. In fact it doesn't care about other side-features like `candidate_description`, `price`, `user_comment`, etc. 81 | 82 | To involve side-features as well as ids while learning latent features (embeddings), we can use deep neural network (DNN) architectures like `softmax` or `two-tower` neural models. 83 | 84 |

85 | 86 |

87 | 88 | YouTube two-tower neural model uses side-features to represent queries and candidates in an abstract high-dimentional embedding vector. 89 | 90 |

91 | 92 |

93 | 94 | ### Movielens dataset 95 | 96 | The `Movielens` dataset is a benchmark dataset in the field of recommender system research containing a set of *ratings* given to *movies* by a set of *users*, collected from the [MovieLens website](http://movielens.org/) - a movie recommendation service. 97 | 98 | There are 5 different versions of Movielens available for different purposes: "25m", "latest-small", "100k", "1m" and "20m". In this tutorial we are going to work with "100k" version. For more information about different versions visit the [official website](https://grouplens.org/datasets/movielens/). 99 | 100 | **movielens/100k-ratings** 101 | 102 | The oldest version of the MovieLens dataset containing 100,000 ratings from 943 users on 1,682 movies. Each user has rated at least 20 movies. Ratings are in whole-star increments. This dataset contains demographic data of users in addition to data on movies and ratings. 103 | 104 | **movielens/100k-movies** 105 | 106 | This dataset contains data of 1,682 movies rated in the ***movielens/100k-ratings*** dataset. 107 | 108 |

 

109 | 110 | ## Explore the Jupyter notebook 111 | 112 | 113 | 116 | 117 | 120 |
114 | View the code on GitHub 115 | 118 | Run the code in Google Colab 119 |
121 | 122 |

 

123 |

 

124 | 125 |

 

126 |

 

127 | 128 | ## Donation 129 | Give a ⭐ if this project helped you! 130 | -------------------------------------------------------------------------------- /assets/cbf_cf.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/xei/recommender-system-tutorial/ec36e5a3e4066f2b1b297b3495d2ce86290d77dd/assets/cbf_cf.png -------------------------------------------------------------------------------- /assets/deepmind_forgoogle_recsys.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/xei/recommender-system-tutorial/ec36e5a3e4066f2b1b297b3495d2ce86290d77dd/assets/deepmind_forgoogle_recsys.png -------------------------------------------------------------------------------- /assets/mf.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/xei/recommender-system-tutorial/ec36e5a3e4066f2b1b297b3495d2ce86290d77dd/assets/mf.png -------------------------------------------------------------------------------- /assets/netflix_prize.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/xei/recommender-system-tutorial/ec36e5a3e4066f2b1b297b3495d2ce86290d77dd/assets/netflix_prize.jpeg -------------------------------------------------------------------------------- /assets/recsys_io.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/xei/recommender-system-tutorial/ec36e5a3e4066f2b1b297b3495d2ce86290d77dd/assets/recsys_io.png -------------------------------------------------------------------------------- /assets/retrieval_ranking.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/xei/recommender-system-tutorial/ec36e5a3e4066f2b1b297b3495d2ce86290d77dd/assets/retrieval_ranking.png -------------------------------------------------------------------------------- /assets/tensorflow_two_tower.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/xei/recommender-system-tutorial/ec36e5a3e4066f2b1b297b3495d2ce86290d77dd/assets/tensorflow_two_tower.gif -------------------------------------------------------------------------------- /assets/youtube_retrieval.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/xei/recommender-system-tutorial/ec36e5a3e4066f2b1b297b3495d2ce86290d77dd/assets/youtube_retrieval.png -------------------------------------------------------------------------------- /recommender_system_tutorial.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "provenance": [] 7 | }, 8 | "kernelspec": { 9 | "name": "python3", 10 | "display_name": "Python 3" 11 | } 12 | }, 13 | "cells": [ 14 | { 15 | "cell_type": "markdown", 16 | "metadata": { 17 | "id": "l-8TmjXRTezi" 18 | }, 19 | "source": [ 20 | "# Build a recommendation system with TensorFlow and Keras\n", 21 | "\n", 22 | "\n", 23 | "\n", 24 | "\n", 25 | " \n", 28 | " \n", 31 | "
\n", 26 | " Run in Google Colab\n", 27 | " \n", 29 | " View source on GitHub\n", 30 | "
\n", 32 | "\n", 33 | "

 

\n", 34 | "

 

\n", 35 | "\n", 36 | "Here, we are going to learn the fundamentals of information retrieval and recommendation systems and build a practical movie recommender service using [TensorFlow Recommenders](https://www.tensorflow.org/recommenders) and [Keras](https://keras.io/) and deploy it using [TensorFlow Serving](https://www.tensorflow.org/tfx/guide/serving).\n", 37 | "\n", 38 | "This step-by-step tutorial is recommended to both academia and industry enthusiasts." 39 | ] 40 | }, 41 | { 42 | "cell_type": "markdown", 43 | "metadata": { 44 | "id": "JR7yjVX9UUxs" 45 | }, 46 | "source": [ 47 | "## Introduction\n", 48 | "\n", 49 | "### What is a recommendation system?!\n", 50 | "Online services usually provide thousands, millions or even billions of items like products, video clips, movies, musics, news, articles, blog posts, advertisements, etc. For example, the [Google Play Store](https://play.google.com/store) provides millions of apps and [YouTube](https://www.youtube.com/) provides billions of videos. [[1]](https://developers.google.com/machine-learning/recommendation/overview)\n", 51 | "\n", 52 | "However, users prefer to see a handful shortlist of likely items instead of struggling with the full corpora. They usually can search or filter the list to find the best handful items, but sometimes they even don't know what they really want (e.g. a birthday gift). In a physical store an expert seller would help in this case by useful recommendations. So, why not in an online store?!\n", 53 | "\n", 54 | "A recommedation system can retrieve, filter and recommend best personalized results for the user - results which the user is likely to buy. So it is one of the major requirements of modern businesses in order to increase their `conversion rate`. On September 21, 2009, Netflix gave a grand prize of $1,000,000 to a team which bested Netflix's own algorithm for predicting ratings by 10.06%. [[2]](https://web.archive.org/web/20090924184639/http://www.netflixprize.com/community/viewtopic.php?id=1537)\n", 55 | "\n", 56 | "

\n", 57 | " \n", 58 | "

\n", 59 | "\n", 60 | "A recommendation system ia a system that gives a `query (context)` which is what we know about the liking list, and filter the corpus (full catalog of items) to a shortlist of `candidates` (items, documents). A query (context) can be a combination of ***user id***, ***user's geographical location***, ***user's history of previous purchases***, etc and the resulting candidates can be some new items that we guess are interesting for the user.\n", 61 | "\n", 62 | "The query can also be a combination of ***item id***, ***its image***, ***its textual description***, etc and the candidates can be some similar or related items from the corpus.\n", 63 | "

 

\n", 64 | "

\n", 65 | " \n", 66 | "

\n", 67 | "

 

\n", 68 | "\n", 69 | "### Recommendation stages (tasks)\n", 70 | "In practice, dealing with a large corpus and filter it to a shortlist is an intractable and inefficient task. So practical recommender systems has two (or three) filterng phases:\n", 71 | "1. Retrieval (Candidate Generation)\n", 72 | "2. Ranking (Scoring)\n", 73 | "3. Re-ranking or optimazation or ...\n", 74 | "\n", 75 | "

\n", 76 | " \n", 77 | "

\n", 78 | "\n", 79 | "

\n", 80 | " \n", 81 | "

\n", 82 | "\n", 83 | "### Content-based Filtering vs Collaborative Filtering\n", 84 | "Filtering items is based on similarities. we can filter the list based on similar candidates (`content-based filtering`) or based on the similarity between queries and candidates (`collaborative filtering`). Collaborative filtering algorithms usually perform better than content-based methods.\n", 85 | "\n", 86 | "

\n", 87 | " \n", 88 | "

\n", 89 | "\n", 90 | "### Representation of a query or a candidate\n", 91 | "A query or a candidate has lots of different features. For example a query can be constructed by these features:\n", 92 | "- user_id\n", 93 | "- user_previous_history\n", 94 | "- user_job\n", 95 | "- etc.\n", 96 | "\n", 97 | "And a candidate can have features like:\n", 98 | "- item_description\n", 99 | "- item_image\n", 100 | "- item_price\n", 101 | "- posted_time\n", 102 | "- etc.\n", 103 | "\n", 104 | "These obviouse features can be `numerical variables`, `categorical variables`, `bitmaps` or `raw texts`. However, these low-level features are not enough for any prediction and we should extract some more abstract `latent features` from these obvious features to represent the query or the candidate as a numerical high-dimensional vector - known as `Embedding Vector`.\n", 105 | "\n", 106 | "`Matrix Factorization` (MF) is a classic collaborative filtering method to learn some `latent factors` (latent features) from `user_id`, `item_id` and `rating` features and represent **users** and **items** by latent (embedding) vectors.\n", 107 | "\n", 108 | "

\n", 109 | " \n", 110 | "

\n", 111 | "\n", 112 | "Matrix Factorization method only uses `user_id` and `candidate_id` features collaboratively to learn the `latent features`. In fact it doesn't care about other side-features like `candidate_description`, `price`, `user_comment`, etc.\n", 113 | "\n", 114 | "To involve side-features as well as ids while learning latent features (embeddings), we can use deep neural network (DNN) architectures like `softmax` or `two-tower` neural models.\n", 115 | "\n", 116 | "

\n", 117 | " \n", 118 | "

\n", 119 | "\n", 120 | "YouTube two-tower neural model uses side-features to represent queries and candidates in an abstract high-dimentional embedding vector.\n", 121 | "\n", 122 | "

\n", 123 | " \n", 124 | "

\n", 125 | "\n", 126 | "### Movielens dataset\n", 127 | "\n", 128 | "The `Movielens` dataset is a benchmark dataset in the field of recommender system research containing a set of *ratings* given to *movies* by a set of *users*, collected from the [MovieLens website](http://movielens.org/) - a movie recommendation service.\n", 129 | "\n", 130 | "There are 5 different versions of Movielens available for different purposes: \"25m\", \"latest-small\", \"100k\", \"1m\" and \"20m\". In this tutorial we are going to work with \"100k\" version. For more information about different versions visit the [official website](https://grouplens.org/datasets/movielens/).\n", 131 | "\n", 132 | "**movielens/100k-ratings**\n", 133 | "\n", 134 | "The oldest version of the MovieLens dataset containing 100,000 ratings from 943 users on 1,682 movies. Each user has rated at least 20 movies. Ratings are in whole-star increments. This dataset contains demographic data of users in addition to data on movies and ratings.\n", 135 | "\n", 136 | "**movielens/100k-movies**\n", 137 | "\n", 138 | "This dataset contains data of 1,682 movies rated in the ***movielens/100k-ratings*** dataset." 139 | ] 140 | }, 141 | { 142 | "cell_type": "markdown", 143 | "metadata": { 144 | "id": "NDrq_pdbB6Pd" 145 | }, 146 | "source": [ 147 | "## Make an input pipeline with TensorFlow Datasets\n", 148 | "TensorFlow Datasets ([TFDS library](https://www.tensorflow.org/datasets/overview)) provides a collection of ready-to-use datasets for use with TensorFlow, Jax, and other Machine Learning frameworks.\n", 149 | "\n", 150 | "It handles downloading and preparing the data deterministically and constructing a `tf.data.Dataset` (or `np.array`) in order to enabling easy-to-use and high-performance input pipelines." 151 | ] 152 | }, 153 | { 154 | "cell_type": "code", 155 | "source": [ 156 | "import os\n", 157 | "os.environ['TF_USE_LEGACY_KERAS'] = '1'" 158 | ], 159 | "metadata": { 160 | "id": "NLu_As-ji2o0" 161 | }, 162 | "execution_count": 1, 163 | "outputs": [] 164 | }, 165 | { 166 | "cell_type": "code", 167 | "metadata": { 168 | "id": "Sciomsbda9NT" 169 | }, 170 | "source": [ 171 | "# Make sure the latest version of TFDS library is installed.\n", 172 | "# Older versions are not supported here.\n", 173 | "!pip install -q --upgrade tensorflow-datasets\n", 174 | "\n", 175 | "import tensorflow_datasets as tfds" 176 | ], 177 | "execution_count": 2, 178 | "outputs": [] 179 | }, 180 | { 181 | "cell_type": "code", 182 | "metadata": { 183 | "id": "IWY6OFnhLQGx", 184 | "colab": { 185 | "base_uri": "https://localhost:8080/" 186 | }, 187 | "outputId": "faeb623c-3ae3-41fd-cd61-5328c457a8f5" 188 | }, 189 | "source": [ 190 | "# Download the data, save them as `tfrecord` files, load the `tfrecord` files\n", 191 | "# and create the `tf.data.Dataset` object containing the dataset.\n", 192 | "ratings_dataset, ratings_dataset_info = tfds.load(\n", 193 | " name='movielens/latest-small-ratings',\n", 194 | " # MovieLens dataset is not splitted into `train` and `test` sets by default.\n", 195 | " # So TFDS has put it all into `train` split. We load it completely and split\n", 196 | " # it manually.\n", 197 | " split='train',\n", 198 | " # `with_info=True` makes the `load` function return a `tfds.core.DatasetInfo`\n", 199 | " # object containing dataset metadata like version, description, homepage,\n", 200 | " # citation, etc.\n", 201 | " with_info=True\n", 202 | ")\n", 203 | "\n", 204 | "# Calling the `tfds.load()` function in old versions of TFDS won't return an\n", 205 | "# instance of `tf.data.Dataset` type. So we can make sure about it.\n", 206 | "import tensorflow as tf\n", 207 | "assert isinstance(ratings_dataset, tf.data.Dataset)\n", 208 | "\n", 209 | "print(\n", 210 | " \"ratings_dataset size: %d\" % ratings_dataset.__len__()\n", 211 | ")\n", 212 | "\n", 213 | "# Use `tfds.as_dataframe()` to convert `tf.data.Dataset` to `pandas.DataFrame`.\n", 214 | "# Add the `tfds.core.DatasetInfo` as second argument of `tfds.as_dataframe` to\n", 215 | "# visualize images, audio, texts, videos, etc. `pandas.DataFrame` will load the\n", 216 | "# full dataset in-memory, and can be very expensive to display. So use it only\n", 217 | "# with take() function.\n", 218 | "print(\n", 219 | " tfds.as_dataframe(ratings_dataset.take(5), ratings_dataset_info)\n", 220 | ")\n", 221 | "\n", 222 | "## Feature selection\n", 223 | "ratings_dataset = ratings_dataset.map(\n", 224 | " lambda rating: {\n", 225 | " # `user_id` is useful as a user identifier.\n", 226 | " 'user_id': rating['user_id'],\n", 227 | " # `movie_id` is useful as a movie identifier.\n", 228 | " 'movie_id': rating['movie_id'],\n", 229 | " # `movie_title` is useful as a textual information about the movie.\n", 230 | " 'movie_title': rating['movie_title'],\n", 231 | " # `user_rating` shows the user's level of interest to a movie.\n", 232 | " 'user_rating': rating['user_rating'],\n", 233 | " # `timestamp` will allow us to model the effect of time.\n", 234 | " 'timestamp': rating['timestamp']\n", 235 | " }\n", 236 | ")\n", 237 | "\n", 238 | "print(\n", 239 | " tfds.as_dataframe(ratings_dataset.take(5), ratings_dataset_info)\n", 240 | ")\n", 241 | "\n", 242 | "## Split dataset randomly (80% for train and 20% for test)\n", 243 | "trainset_size = 0.8 * ratings_dataset.__len__().numpy()\n", 244 | "# In an industrial recommender system, this would most likely be done by time:\n", 245 | "# The data up to time T would be used to predict interactions after T.\n", 246 | "\n", 247 | "# set the global seed:\n", 248 | "tf.random.set_seed(42)\n", 249 | "# More info: https://www.tensorflow.org/api_docs/python/tf/random/set_seed\n", 250 | "\n", 251 | "# Shuffle the elements of the dataset randomly.\n", 252 | "ratings_dataset_shuffled = ratings_dataset.shuffle(\n", 253 | " # the new dataset will be sampled from a buffer window of first `buffer_size`\n", 254 | " # elements of the dataset\n", 255 | " buffer_size=100_000,\n", 256 | " # set the random seed that will be used to create the distribution.\n", 257 | " seed=42,\n", 258 | " # `list(dataset.as_numpy_iterator()` yields different result for each call\n", 259 | " # Because reshuffle_each_iteration defaults to True.\n", 260 | " reshuffle_each_iteration=False\n", 261 | ")\n", 262 | "# More info: https://www.tensorflow.org/api_docs/python/tf/data/Dataset#shuffle\n", 263 | "\n", 264 | "ratings_trainset = ratings_dataset_shuffled.take(trainset_size)\n", 265 | "ratings_testset = ratings_dataset_shuffled.skip(trainset_size)\n", 266 | "\n", 267 | "print(\n", 268 | " \"ratings_trainset size: %d\" % ratings_trainset.__len__()\n", 269 | ")\n", 270 | "print(\n", 271 | " \"ratings_testset size: %d\" % ratings_testset.__len__()\n", 272 | ")" 273 | ], 274 | "execution_count": 3, 275 | "outputs": [ 276 | { 277 | "output_type": "stream", 278 | "name": "stdout", 279 | "text": [ 280 | "ratings_dataset size: 100836\n", 281 | " movie_genres movie_id movie_title \\\n", 282 | "0 [7, 8, 13, 15] b'4874' b'K-PAX (2001)' \n", 283 | "1 [7, 18] b'527' b\"Schindler's List (1993)\" \n", 284 | "2 [5, 9] b'7943' b'Killers, The (1946)' \n", 285 | "3 [10, 13, 16] b'1644' b'I Know What You Did Last Summer (1997)' \n", 286 | "4 [1, 2, 3, 4, 12, 14] b'8360' b'Shrek 2 (2004)' \n", 287 | "\n", 288 | " timestamp user_id user_rating \n", 289 | "0 1446749868 b'105' 5.0 \n", 290 | "1 1305696664 b'17' 4.5 \n", 291 | "2 1166068511 b'309' 4.0 \n", 292 | "3 1518640852 b'111' 0.5 \n", 293 | "4 1127221149 b'182' 3.0 \n", 294 | " movie_id movie_title timestamp user_id \\\n", 295 | "0 b'4874' b'K-PAX (2001)' 1446749868 b'105' \n", 296 | "1 b'527' b\"Schindler's List (1993)\" 1305696664 b'17' \n", 297 | "2 b'7943' b'Killers, The (1946)' 1166068511 b'309' \n", 298 | "3 b'1644' b'I Know What You Did Last Summer (1997)' 1518640852 b'111' \n", 299 | "4 b'8360' b'Shrek 2 (2004)' 1127221149 b'182' \n", 300 | "\n", 301 | " user_rating \n", 302 | "0 5.0 \n", 303 | "1 4.5 \n", 304 | "2 4.0 \n", 305 | "3 0.5 \n", 306 | "4 3.0 \n", 307 | "ratings_trainset size: 80668\n", 308 | "ratings_testset size: 20168\n" 309 | ] 310 | } 311 | ] 312 | }, 313 | { 314 | "cell_type": "markdown", 315 | "metadata": { 316 | "id": "2JOGVWBEWiZZ" 317 | }, 318 | "source": [ 319 | "To make a custom `tf.data.Dataset` object including your own dataset visit [this link](https://www.tensorflow.org/datasets/add_dataset).\n", 320 | "\n", 321 | "For more information about `tf.data.Dataset` visit [this link](https://www.tensorflow.org/api_docs/python/tf/data/Dataset)." 322 | ] 323 | }, 324 | { 325 | "cell_type": "markdown", 326 | "metadata": { 327 | "id": "GnlXCP02UV8K" 328 | }, 329 | "source": [ 330 | "## Preprocess raw features and make embeddings with Keras preprocessing layers\n", 331 | "\n", 332 | "Raw features are usually not be immediately usable in a machine learning model and should be preprocessed in the first place.\n", 333 | "- **Numerical features** (ratings, prices, timestamps, etc) can be far away in terms of scale and need to be `normalized` so that their values lie in a small interval around 0.\n", 334 | "- **Categorical features** (ids, usernames/emails, titles, etc) are usually string features and have to be translated into `embedding vectors` (numerical feature representations) that are adjusted during training the model.\n", 335 | "- **Text features** (descriptions, comments, etc) need to be at first, `tokenized` (split into smaller parts such as individual words known as word pieces) and then translated into embeddings.\n", 336 | "\n", 337 | "[Keras preprocessing layers](https://keras.io/guides/preprocessing_layers/) let us build `end-to-end` portable models that accept raw features (raw images or raw structured data) as input; models that handle feature normalization or feature value indexing on their own." 338 | ] 339 | }, 340 | { 341 | "cell_type": "markdown", 342 | "metadata": { 343 | "id": "EHvSx5Vq1dFO" 344 | }, 345 | "source": [ 346 | "Let's first have a look at what features we can use from the MovieLens dataset:\n" 347 | ] 348 | }, 349 | { 350 | "cell_type": "code", 351 | "metadata": { 352 | "id": "iIiKVYmASodk", 353 | "colab": { 354 | "base_uri": "https://localhost:8080/" 355 | }, 356 | "outputId": "d5b6473a-b387-4f91-e649-b139f41a62f4" 357 | }, 358 | "source": [ 359 | "from pprint import pprint\n", 360 | "\n", 361 | "for rating in ratings_trainset.take(1).as_numpy_iterator():\n", 362 | " pprint(rating)" 363 | ], 364 | "execution_count": 4, 365 | "outputs": [ 366 | { 367 | "output_type": "stream", 368 | "name": "stdout", 369 | "text": [ 370 | "{'movie_id': b'3421',\n", 371 | " 'movie_title': b'Animal House (1978)',\n", 372 | " 'timestamp': 975212199,\n", 373 | " 'user_id': b'216',\n", 374 | " 'user_rating': 5.0}\n" 375 | ] 376 | } 377 | ] 378 | }, 379 | { 380 | "cell_type": "markdown", 381 | "metadata": { 382 | "id": "vAIqWlEjsvVO" 383 | }, 384 | "source": [ 385 | "For more information about each field visit [this link](https://www.tensorflow.org/datasets/catalog/movielens)." 386 | ] 387 | }, 388 | { 389 | "cell_type": "markdown", 390 | "metadata": { 391 | "id": "82VLAC1qJHMD" 392 | }, 393 | "source": [ 394 | "### Normalize numerical features\n", 395 | "`timestamp` values are far too large to be used directly in a machine learning model. However, it can be normalized in a small interval around 0. Standardization (`Z-score Normalization`) is a common preprocessing transformation that rescales features to normalize their range by subtracting the feature's `mean` and dividing by its `standard deviation`.\n", 396 | "\n" 397 | ] 398 | }, 399 | { 400 | "cell_type": "code", 401 | "metadata": { 402 | "id": "Ga9_p0_1GpUC", 403 | "colab": { 404 | "base_uri": "https://localhost:8080/" 405 | }, 406 | "outputId": "fdea8df1-9f6a-47c7-a1ae-ed70342b6201" 407 | }, 408 | "source": [ 409 | "# Make a Keras Normalization layer to standardize a numerical feature.\n", 410 | "timestamp_normalization_layer = \\\n", 411 | " tf.keras.layers.Normalization(axis=None)\n", 412 | "\n", 413 | "# Normalization layer is a non-trainable layer and its state (mean and std of\n", 414 | "# feature set) must be set before training in a step called \"adaptation\".\n", 415 | "timestamp_normalization_layer.adapt(\n", 416 | " ratings_trainset.map(\n", 417 | " lambda x: x['timestamp']\n", 418 | " )\n", 419 | ")\n", 420 | "\n", 421 | "for rating in ratings_trainset.take(3).as_numpy_iterator():\n", 422 | " print(\n", 423 | " f\"Raw timestamp: {rating['timestamp']} ->\",\n", 424 | " f\"Normalized timestamp: {timestamp_normalization_layer(rating['timestamp'])}\"\n", 425 | " )" 426 | ], 427 | "execution_count": 5, 428 | "outputs": [ 429 | { 430 | "output_type": "stream", 431 | "name": "stdout", 432 | "text": [ 433 | "Raw timestamp: 975212199 -> Normalized timestamp: -1.0651628971099854\n", 434 | "Raw timestamp: 1167543227 -> Normalized timestamp: -0.1764659434556961\n", 435 | "Raw timestamp: 958881789 -> Normalized timestamp: -1.1406203508377075\n" 436 | ] 437 | } 438 | ] 439 | }, 440 | { 441 | "cell_type": "markdown", 442 | "metadata": { 443 | "id": "Q_2VmQ4KpOpo" 444 | }, 445 | "source": [ 446 | "### Turning categorical features into embeddings\n", 447 | "\n", 448 | "A categorical feature is a feature that does not express a continuous quantity, but rather takes on one of a set of fixed values. Most deep learning models express these feature by turning them into high-dimensional embedding vectors which will be adjusted during model training.\n", 449 | "\n", 450 | "Here we represent each user and each movie by an embedding vector. Initially, these embeddings will take on random values, but during training, we will adjust them so that embeddings of users and the movies they watch end up closer together.\n", 451 | "\n", 452 | "Taking raw categorical features and turning them into embeddings is normally a two-step process:\n", 453 | "\n", 454 | "\n", 455 | "1. Build a mapping (called a `\"vocabulary\"`) that maps each raw values for example \"Postman, The (1997)\" to unique integers (say, 15).\n", 456 | "2. Turn these integers into embedding vectors.\n", 457 | "\n" 458 | ] 459 | }, 460 | { 461 | "cell_type": "code", 462 | "metadata": { 463 | "id": "2yfWHKFfpPSx", 464 | "colab": { 465 | "base_uri": "https://localhost:8080/" 466 | }, 467 | "outputId": "abb36d70-c592-48bf-a40a-0bcef167b7c3" 468 | }, 469 | "source": [ 470 | "# Make a Keras StringLookup layer as the mapping (lookup)\n", 471 | "user_id_lookup_layer = \\\n", 472 | " tf.keras.layers.StringLookup(mask_token=None)\n", 473 | "\n", 474 | "# StringLookup layer is a non-trainable layer and its state (the vocabulary)\n", 475 | "# must be constructed and set before training in a step called \"adaptation\".\n", 476 | "user_id_lookup_layer.adapt(\n", 477 | " ratings_trainset.map(\n", 478 | " lambda x: x['user_id']\n", 479 | " )\n", 480 | ")\n", 481 | "\n", 482 | "print(\n", 483 | " f\"Vocabulary[:10] -> {user_id_lookup_layer.get_vocabulary()[:10]}\"\n", 484 | " # Vocabulary: ['[UNK]', '405', '655', '13', ...]\n", 485 | " # The vocabulary includes one (or more!) unknown (or \"out of vocabulary\", OOV)\n", 486 | " # tokens. So the layer can handle categorical values that are not in the\n", 487 | " # vocabulary and the model can continue to learn about and make\n", 488 | " # recommendations even using features that have not been seen during\n", 489 | " # vocabulary construction.\n", 490 | ")\n", 491 | "\n", 492 | "print(\n", 493 | " \"Mapped integer for user ids: ['-2', '13', '655', 'xxx']\\n\",\n", 494 | " user_id_lookup_layer(\n", 495 | " ['-2', '13', '655', 'xxx']\n", 496 | " )\n", 497 | ")\n", 498 | "\n", 499 | "user_id_embedding_dim = 32\n", 500 | "# The larger it is, the higher the capacity of the model, but the slower it is\n", 501 | "# to fit and serve and more prone to overfitting.\n", 502 | "\n", 503 | "user_id_embedding_layer = tf.keras.layers.Embedding(\n", 504 | " # Size of the vocabulary\n", 505 | " input_dim=user_id_lookup_layer.vocabulary_size(),\n", 506 | " # Dimension of the dense embedding\n", 507 | " output_dim=user_id_embedding_dim\n", 508 | ")\n", 509 | "\n", 510 | "# A model that takes raw string feature values (user_id) in and yields embeddings\n", 511 | "user_id_model = tf.keras.Sequential(\n", 512 | " [\n", 513 | " user_id_lookup_layer,\n", 514 | " user_id_embedding_layer\n", 515 | " ]\n", 516 | ")\n", 517 | "\n", 518 | "print(\n", 519 | " \"Embeddings for user ids: ['-2', '13', '655', 'xxx']\\n\",\n", 520 | " user_id_model(\n", 521 | " tf.convert_to_tensor(['-2', '13', '655', 'xxx'])\n", 522 | " )\n", 523 | ")" 524 | ], 525 | "execution_count": 6, 526 | "outputs": [ 527 | { 528 | "output_type": "stream", 529 | "name": "stdout", 530 | "text": [ 531 | "Vocabulary[:10] -> ['[UNK]', '414', '599', '474', '448', '274', '610', '68', '380', '606']\n", 532 | "Mapped integer for user ids: ['-2', '13', '655', 'xxx']\n", 533 | " tf.Tensor([ 0 493 0 0], shape=(4,), dtype=int64)\n", 534 | "Embeddings for user ids: ['-2', '13', '655', 'xxx']\n", 535 | " tf.Tensor(\n", 536 | "[[ 0.01478275 -0.03399806 -0.04737892 -0.01455851 0.02260116 0.03772945\n", 537 | " -0.04947149 -0.03576241 0.01529941 -0.01436429 0.03431726 0.00177632\n", 538 | " 0.00778866 0.00945215 -0.02211686 0.04326234 0.00865857 0.03611586\n", 539 | " -0.00819879 -0.01737251 -0.03739591 0.04867731 -0.00277033 0.02783228\n", 540 | " -0.03756921 0.04588716 -0.00360299 -0.00270822 -0.01879736 0.00096357\n", 541 | " 0.04194726 -0.03343551]\n", 542 | " [-0.00042721 0.01578123 0.03790379 0.03472162 -0.04650879 0.02481607\n", 543 | " -0.0494769 0.01245934 0.04805774 0.0247086 0.03865222 0.00027167\n", 544 | " 0.00802004 0.00562286 0.02058078 -0.03171451 0.0408157 0.00758809\n", 545 | " 0.04206653 0.04217685 0.01028284 0.02889507 0.03656546 0.00334147\n", 546 | " -0.00780983 -0.02825767 0.00996434 0.02749718 -0.03737084 -0.0091005\n", 547 | " 0.03157964 0.01717177]\n", 548 | " [ 0.01478275 -0.03399806 -0.04737892 -0.01455851 0.02260116 0.03772945\n", 549 | " -0.04947149 -0.03576241 0.01529941 -0.01436429 0.03431726 0.00177632\n", 550 | " 0.00778866 0.00945215 -0.02211686 0.04326234 0.00865857 0.03611586\n", 551 | " -0.00819879 -0.01737251 -0.03739591 0.04867731 -0.00277033 0.02783228\n", 552 | " -0.03756921 0.04588716 -0.00360299 -0.00270822 -0.01879736 0.00096357\n", 553 | " 0.04194726 -0.03343551]\n", 554 | " [ 0.01478275 -0.03399806 -0.04737892 -0.01455851 0.02260116 0.03772945\n", 555 | " -0.04947149 -0.03576241 0.01529941 -0.01436429 0.03431726 0.00177632\n", 556 | " 0.00778866 0.00945215 -0.02211686 0.04326234 0.00865857 0.03611586\n", 557 | " -0.00819879 -0.01737251 -0.03739591 0.04867731 -0.00277033 0.02783228\n", 558 | " -0.03756921 0.04588716 -0.00360299 -0.00270822 -0.01879736 0.00096357\n", 559 | " 0.04194726 -0.03343551]], shape=(4, 32), dtype=float32)\n" 560 | ] 561 | } 562 | ] 563 | }, 564 | { 565 | "cell_type": "code", 566 | "source": [ 567 | "movie_id_lookup_layer = \\\n", 568 | " tf.keras.layers.StringLookup(mask_token=None)\n", 569 | "\n", 570 | "movie_id_lookup_layer.adapt(\n", 571 | " ratings_trainset.map(\n", 572 | " lambda x: x['movie_id']\n", 573 | " )\n", 574 | ")\n", 575 | "\n", 576 | "# Same as user_id_embedding_dim to be able to measure the similarity\n", 577 | "movie_id_embedding_dim = 32\n", 578 | "\n", 579 | "movie_id_embedding_layer = tf.keras.layers.Embedding(\n", 580 | " input_dim=movie_id_lookup_layer.vocabulary_size(),\n", 581 | " output_dim=movie_id_embedding_dim\n", 582 | ")\n", 583 | "\n", 584 | "movie_id_model = tf.keras.Sequential(\n", 585 | " [\n", 586 | " movie_id_lookup_layer,\n", 587 | " movie_id_embedding_layer\n", 588 | " ]\n", 589 | ")\n", 590 | "\n", 591 | "print(\n", 592 | " f\"Embedding for the movie 898:\\n {movie_id_model(tf.convert_to_tensor(['898']))}\"\n", 593 | ")" 594 | ], 595 | "metadata": { 596 | "id": "O-Fw4HAuO1kh", 597 | "outputId": "be83a6c3-31f7-4cd7-fc1e-83f94074d213", 598 | "colab": { 599 | "base_uri": "https://localhost:8080/" 600 | } 601 | }, 602 | "execution_count": 7, 603 | "outputs": [ 604 | { 605 | "output_type": "stream", 606 | "name": "stdout", 607 | "text": [ 608 | "Embedding for the movie 898:\n", 609 | " [[ 0.02420554 -0.02993221 -0.03578611 -0.01940017 0.02106949 -0.01936241\n", 610 | " -0.00158745 0.00174867 0.02626561 -0.0172717 0.02571883 -0.03316146\n", 611 | " 0.03160522 0.04094924 -0.01908119 0.01257378 -0.00210525 -0.013066\n", 612 | " -0.03793862 0.01088411 0.02089325 -0.04121248 0.03619212 0.01857083\n", 613 | " 0.02743815 0.03726472 -0.03322572 0.03820366 0.01855929 0.03005471\n", 614 | " -0.01181834 -0.00866457]]\n" 615 | ] 616 | } 617 | ] 618 | }, 619 | { 620 | "cell_type": "markdown", 621 | "metadata": { 622 | "id": "U3hdi7pZS2oC" 623 | }, 624 | "source": [ 625 | "### Tokenize textual features and translate them into embeddings\n", 626 | "Candidates textual description and users' reviews can be useful especially in a `cold-start` or `long-tail` scenario.\n", 627 | "\n", 628 | "While the MovieLens dataset does not give us rich textual features, we can still use movie titles. This may help us capture the fact that movies with very similar titles are likely to belong to the same series (for example \"Harry Potter and the Philosopher's Stone\" and \"Harry Potter and the Chamber of Secrets\")." 629 | ] 630 | }, 631 | { 632 | "cell_type": "code", 633 | "metadata": { 634 | "id": "mGF8c2ZKOtuG", 635 | "colab": { 636 | "base_uri": "https://localhost:8080/" 637 | }, 638 | "outputId": "a4d65258-0200-4986-8653-fdf07ba98eb7" 639 | }, 640 | "source": [ 641 | "# Keras TextVectorization layer transforms the raw texts into `word pieces` and\n", 642 | "# map these pieces into tokens.\n", 643 | "movie_title_vectorization_layer = \\\n", 644 | " tf.keras.layers.TextVectorization()\n", 645 | "movie_title_vectorization_layer.adapt(\n", 646 | " ratings_trainset.map(\n", 647 | " lambda rating: rating['movie_title']\n", 648 | " )\n", 649 | ")\n", 650 | "\n", 651 | "# Verify that the tokenization is done correctly\n", 652 | "print(\n", 653 | " \"Vocabulary[40:50] -> \",\n", 654 | " movie_title_vectorization_layer.get_vocabulary()[40:50]\n", 655 | ")\n", 656 | "\n", 657 | "print(\n", 658 | " \"Vectorized title for 'Postman, The (1997)'\\n\",\n", 659 | " movie_title_vectorization_layer('Postman, The (1997)')\n", 660 | ")\n", 661 | "\n", 662 | "movie_title_embedding_dim = 32\n", 663 | "movie_title_embedding_layer = tf.keras.layers.Embedding(\n", 664 | " input_dim=len(movie_title_vectorization_layer.get_vocabulary()),\n", 665 | " output_dim=movie_title_embedding_dim,\n", 666 | " # Whether or not the input value 0 is a MASK token.\n", 667 | " # Keras TextVectorization layer builds the vocabulary with MASK token.\n", 668 | " mask_zero=True\n", 669 | ")\n", 670 | "\n", 671 | "movie_title_model = tf.keras.Sequential(\n", 672 | " [\n", 673 | " movie_title_vectorization_layer,\n", 674 | " movie_title_embedding_layer,\n", 675 | " # each title contains multiple words, so we will get multiple embeddings\n", 676 | " # for each title that should be compressed into a single embedding for\n", 677 | " # the text. Models like RNNs, Transformers or Attentions are useful here.\n", 678 | " # However, averaging all the words' embeddings together is also a good\n", 679 | " # starting point.\n", 680 | " tf.keras.layers.GlobalAveragePooling1D()\n", 681 | " ]\n", 682 | ")" 683 | ], 684 | "execution_count": 8, 685 | "outputs": [ 686 | { 687 | "output_type": "stream", 688 | "name": "stdout", 689 | "text": [ 690 | "Vocabulary[40:50] -> ['2014', 'ii', '1985', '2013', 'day', 'wars', '2015', '1982', 'for', 'episode']\n", 691 | "Vectorized title for 'Postman, The (1997)'\n", 692 | " tf.Tensor([1119 2 12], shape=(3,), dtype=int64)\n" 693 | ] 694 | } 695 | ] 696 | }, 697 | { 698 | "cell_type": "markdown", 699 | "metadata": { 700 | "id": "PEmOyPxuOcHx" 701 | }, 702 | "source": [ 703 | "For more information about feature preprocessing visit [this link](https://www.tensorflow.org/recommenders/examples/featurization?hl=lt)." 704 | ] 705 | }, 706 | { 707 | "cell_type": "markdown", 708 | "metadata": { 709 | "id": "UChreol5-Aj2" 710 | }, 711 | "source": [ 712 | "## Query and Candidate representation\n", 713 | "We are building a [two-tower retrieval model](https://research.google/pubs/pub48840/), a model including two seperate models (towers) one for transforming query raw features to query representation (query tower) and one another for transforming candidate raw features to the same dimensionality candidate representation.\n", 714 | "\n", 715 | "The output tensors of the two models will multiply together (inner product) to give a query-candidate `affinity score` (similarity measure). Higher scores express a better match between the candidate and the query." 716 | ] 717 | }, 718 | { 719 | "cell_type": "code", 720 | "metadata": { 721 | "id": "H1jx9RDfoJIq" 722 | }, 723 | "source": [ 724 | "# Query tower\n", 725 | "query_model = user_id_model\n", 726 | "\n", 727 | "# Candidate tower\n", 728 | "candidate_model = movie_id_model\n", 729 | "\n", 730 | "\n", 731 | "# Here we only used query and candidate identifiers to buid the towers. This\n", 732 | "# corresponds exactly to a classic matrix factorization approach.\n", 733 | "# https://ieeexplore.ieee.org/abstract/document/4781121\n", 734 | "# However, we can extend `tf.keras.Model` class to an arbitrarily complex model\n", 735 | "# including other features and return the final embedding vector at the end.\n", 736 | "# For example, by using movie metadata in the candidate tower, we can alleviate\n", 737 | "# cold-start problem.\n", 738 | "# return tf.concat([\n", 739 | "# self.user_embedding(inputs[\"user_id\"]),\n", 740 | "# self.timestamp_embedding(inputs[\"timestamp\"]),\n", 741 | "# self.normalized_timestamp(inputs[\"timestamp\"])\n", 742 | "# ], axis=1)" 743 | ], 744 | "execution_count": 9, 745 | "outputs": [] 746 | }, 747 | { 748 | "cell_type": "markdown", 749 | "metadata": { 750 | "id": "-X0nS9g18q4v" 751 | }, 752 | "source": [ 753 | "## Build the Retrieval (Candidate Generation) task\n", 754 | "\n", 755 | "It is about selecting an initial set of hundreds of candidates from all possible candidates. The main objective of this model is to efficiently weed out all candidates that the user is not interested in. Because the retrieval model may be dealing with millions of candidates, **it has to be computationally efficient**.\n", 756 | "\n", 757 | "A retrieval system is a model that predicts a set of movies from the catalogue that the user is likely to watch. So the train set should be expressesing which movies the users watched, and which they did not. for example:\n", 758 | "```\n", 759 | "[\n", 760 | " (('user1', 'Star Wars'), POSITIVE),\n", 761 | " (('user1', 'Harry Potter'), NEGATIVE),\n", 762 | " ...\n", 763 | "]\n", 764 | "```\n", 765 | "So we treat Movielens as an `implicit feedback dataset`, where users' watches tell us which things they prefer to see and which they'd rather not see. This means that every movie a user rated (so watched!), no matter the given rating, is an **implicit positive example**, and every movie they have not rated (not seen!) is an **implicit negative example**." 766 | ] 767 | }, 768 | { 769 | "cell_type": "code", 770 | "metadata": { 771 | "id": "QP5VaGW69c2B" 772 | }, 773 | "source": [ 774 | "# We don't need rating field for the retrieval task\n", 775 | "retrieval_ratings_trainset = ratings_trainset.map(\n", 776 | " lambda rating: {\n", 777 | " 'user_id': rating['user_id'],\n", 778 | " 'movie_id': rating['movie_id'],\n", 779 | " }\n", 780 | ")\n", 781 | "\n", 782 | "retrieval_ratings_testset = ratings_testset.map(\n", 783 | " lambda rating: {\n", 784 | " 'user_id': rating['user_id'],\n", 785 | " 'movie_id': rating['movie_id'],\n", 786 | " }\n", 787 | ")" 788 | ], 789 | "execution_count": 10, 790 | "outputs": [] 791 | }, 792 | { 793 | "cell_type": "markdown", 794 | "metadata": { 795 | "id": "SqeFh9s99bLw" 796 | }, 797 | "source": [ 798 | "The similarity between the query representation (query embedding vector) and the candiate representation (candidate embedding vector) a.k.a. `affinity score` can be calcualted by dot-product (or other similarity measures). The K-nearest candidates (candidates with higher affinity scores) will be chosen for the final list.\n", 799 | "\n", 800 | "In our training data we only have positive (user, movie) pairs. To figure out how good our model is, we need to compare the affinity score that the model calculates for this positive pair to the scores of all the other possible candidates: if the score for the positive pair is higher than for all other possible candidates, our model is highly accurate.\n", 801 | "\n", 802 | "To measure the performance of a retrieval task, `factorized top-K categorical accuracy` metrics over a corpus of candidates can be used. These metrics measure how good the model is at picking the true candidate out of **all possible candidates** in the system.\n", 803 | "\n", 804 | "For example, a `top-5 categorical accuracy` metric of `0.2` would tell us that, on average, the true positive is in the top 5 retrieved items 20% of the time." 805 | ] 806 | }, 807 | { 808 | "cell_type": "code", 809 | "metadata": { 810 | "id": "DgwkIUn71_0y", 811 | "colab": { 812 | "base_uri": "https://localhost:8080/" 813 | }, 814 | "outputId": "f868ba6f-c3ae-4536-a977-1557ef45f115" 815 | }, 816 | "source": [ 817 | "# To calculate the factorized top-k categorical accuracy we need the dataset of\n", 818 | "# all possible candidates that are used as implicit negatives for evaluation.\n", 819 | "movies_dataset, movies_dataset_info = tfds.load(\n", 820 | " name='movielens/latest-small-movies',\n", 821 | " split='train',\n", 822 | " with_info=True\n", 823 | ")\n", 824 | "\n", 825 | "print(\n", 826 | " tfds.as_dataframe(movies_dataset.take(5), movies_dataset_info)\n", 827 | ")\n", 828 | "\n", 829 | "# We are using just `movie_id` feature for making the candidates representation\n", 830 | "candidates_corpus_dataset = movies_dataset.map(\n", 831 | " lambda movie: movie['movie_id']\n", 832 | ")" 833 | ], 834 | "execution_count": 11, 835 | "outputs": [ 836 | { 837 | "output_type": "stream", 838 | "name": "stdout", 839 | "text": [ 840 | " movie_genres movie_id movie_title\n", 841 | "0 [4] b'2261' b'One Crazy Summer (1986)'\n", 842 | "1 [10] b'1979' b'Friday the 13th Part VI: Jason Lives (1986)'\n", 843 | "2 [4, 5] b'6143' b'Trail of the Pink Panther (1982)'\n", 844 | "3 [4, 7, 14] b'5856' b'Do You Remember Dolly Bell? (Sjecas li se, D...\n", 845 | "4 [0, 4, 7, 16] b'70728' b'Bronson (2009)'\n" 846 | ] 847 | } 848 | ] 849 | }, 850 | { 851 | "cell_type": "markdown", 852 | "metadata": { 853 | "id": "P0aPHewDuQZf" 854 | }, 855 | "source": [ 856 | "[TensorFlow Recommenders](https://www.tensorflow.org/recommenders) (TFRS) is a library to facilitate building and evaluating flexible recommendation models.\n", 857 | "\n", 858 | "It can calculate the `factorized top-k categorical accuracy` through `FactorizedTopK` metrics using dataset of all possible candidate embeddings." 859 | ] 860 | }, 861 | { 862 | "cell_type": "code", 863 | "metadata": { 864 | "id": "YPO73Ihx_xQS" 865 | }, 866 | "source": [ 867 | "!pip install -q scann tensorflow-recommenders\n", 868 | "# We also installed `ScaNN` package as a dependency for TFRS library\n", 869 | "# We will describe ScaNN in future but it has to be installed before importing\n", 870 | "# the TFRS\n", 871 | "import tensorflow_recommenders as tfrs\n", 872 | "\n", 873 | "factorized_top_k_metrics = tfrs.metrics.FactorizedTopK(\n", 874 | " # dataset of candidate embeddings from which candidates should be retrieved\n", 875 | " candidates=candidates_corpus_dataset.batch(128).map(\n", 876 | " candidate_model\n", 877 | " )\n", 878 | ")" 879 | ], 880 | "execution_count": 12, 881 | "outputs": [] 882 | }, 883 | { 884 | "cell_type": "markdown", 885 | "metadata": { 886 | "id": "Ie_v22Od-zY8" 887 | }, 888 | "source": [ 889 | "`in-batch softmax loss` can be used as a `loss function` in order to train the system.\n", 890 | "\n", 891 | "TFRS proposes a Keras layer named `tfrs.tasks.Retrieval` that takes the query and candidate embeddings as arguments, and returns the computed loss." 892 | ] 893 | }, 894 | { 895 | "cell_type": "code", 896 | "metadata": { 897 | "id": "4E8eOyErNqHY" 898 | }, 899 | "source": [ 900 | "retrieval_task_layer = tfrs.tasks.Retrieval(\n", 901 | " metrics=factorized_top_k_metrics\n", 902 | ")\n", 903 | "\n", 904 | "# The task computes the metrics and return the in-batch softmax loss.\n", 905 | "# Because the metrics range over the entire candidate set, they are usually much\n", 906 | "# slower to compute. Consider setting `compute_metrics=False` in Retrieval\n", 907 | "# costructor during training to save the time in computing the metrics." 908 | ], 909 | "execution_count": 13, 910 | "outputs": [] 911 | }, 912 | { 913 | "cell_type": "markdown", 914 | "metadata": { 915 | "id": "lVFXh69bPYL9" 916 | }, 917 | "source": [ 918 | "### Create the training loop\n", 919 | "To create an appropriate training loop and train the models we can extend the class `tf.keras.Model` and override the `train_step` and `test_step` functions. [See how](https://keras.io/guides/customizing_what_happens_in_fit/).\n", 920 | "\n", 921 | "However, to keep the focus on modelling and abstract away some of the boilerplate, TFRS exposes `tfrs.models.Model` base class which allows us to compute both training and test losses using the same method. All we need to do is to set up the components in the `__init__` method, and implement the `compute_loss` method, taking in the raw features and returning a loss value. The base model will then take care of creating the appropriate training loop to fit the model." 922 | ] 923 | }, 924 | { 925 | "cell_type": "code", 926 | "metadata": { 927 | "id": "EGN_rnLNN07m" 928 | }, 929 | "source": [ 930 | "class RetrievalModel(tfrs.models.Model):\n", 931 | " \"\"\"MovieLens candidate generation model\"\"\"\n", 932 | "\n", 933 | " def __init__(self, query_model, candidate_model, retrieval_task_layer):\n", 934 | " super().__init__()\n", 935 | " self.query_model: tf.keras.Model = query_model\n", 936 | " self.candidate_model: tf.keras.Model = candidate_model\n", 937 | " self.retrieval_task_layer: tf.keras.layers.Layer = retrieval_task_layer\n", 938 | "\n", 939 | " #def compute_loss(self, features: Dict[Text, tf.Tensor], training=False):\n", 940 | " def compute_loss(self, features, training=False) -> tf.Tensor:\n", 941 | " query_embeddings = self.query_model(features['user_id'])\n", 942 | " positive_candidate_embeddings = self.candidate_model(features[\"movie_id\"])\n", 943 | "\n", 944 | " loss = self.retrieval_task_layer(\n", 945 | " query_embeddings,\n", 946 | " positive_candidate_embeddings\n", 947 | " # ,compute_metrics=not training # To speed up training\n", 948 | " )\n", 949 | " return loss" 950 | ], 951 | "execution_count": 14, 952 | "outputs": [] 953 | }, 954 | { 955 | "cell_type": "markdown", 956 | "metadata": { 957 | "id": "ylJqYSN8GH68" 958 | }, 959 | "source": [ 960 | "### Fit the model using standard Keras routine" 961 | ] 962 | }, 963 | { 964 | "cell_type": "code", 965 | "metadata": { 966 | "id": "HoyYTDhB9CAP" 967 | }, 968 | "source": [ 969 | "movielens_retrieval_model = RetrievalModel(\n", 970 | " query_model,\n", 971 | " candidate_model,\n", 972 | " retrieval_task_layer\n", 973 | ")\n", 974 | "\n", 975 | "optimizer_step_size = 0.1\n", 976 | "movielens_retrieval_model.compile(\n", 977 | " optimizer=tf.keras.optimizers.Adagrad(\n", 978 | " learning_rate=optimizer_step_size\n", 979 | " )\n", 980 | ")" 981 | ], 982 | "execution_count": 15, 983 | "outputs": [] 984 | }, 985 | { 986 | "cell_type": "code", 987 | "metadata": { 988 | "id": "ZNw02KWxE3bI", 989 | "colab": { 990 | "base_uri": "https://localhost:8080/" 991 | }, 992 | "outputId": "f17923c4-38d2-4cda-dff0-1242bfff9eae" 993 | }, 994 | "source": [ 995 | "# Shuffle the training data for each epoch.\n", 996 | "# Batch and cache both the training and evaluation data.\n", 997 | "# `cache()` method caches the elements in the dataset in memory. To caches data\n", 998 | "# in a file pass the `filename` argument to the method: cache(filename='')\n", 999 | "# The first time the dataset is iterated over, its elements will be cached\n", 1000 | "# either in the specified file or in memory. Subsequent iterations will use the\n", 1001 | "# cached data.\n", 1002 | "retrieval_cached_ratings_trainset = \\\n", 1003 | " retrieval_ratings_trainset.shuffle(100_000).batch(8192).cache()\n", 1004 | "retrieval_cached_ratings_testset = \\\n", 1005 | " retrieval_ratings_testset.batch(4096).cache()\n", 1006 | "\n", 1007 | "num_epochs = 5\n", 1008 | "history = movielens_retrieval_model.fit(\n", 1009 | " retrieval_cached_ratings_trainset,\n", 1010 | " validation_data=retrieval_cached_ratings_testset,\n", 1011 | " validation_freq=1,\n", 1012 | " epochs=num_epochs\n", 1013 | ")" 1014 | ], 1015 | "execution_count": 16, 1016 | "outputs": [ 1017 | { 1018 | "output_type": "stream", 1019 | "name": "stdout", 1020 | "text": [ 1021 | "Epoch 1/5\n", 1022 | "10/10 [==============================] - 169s 15s/step - factorized_top_k/top_1_categorical_accuracy: 6.1982e-05 - factorized_top_k/top_5_categorical_accuracy: 0.0013 - factorized_top_k/top_10_categorical_accuracy: 0.0031 - factorized_top_k/top_50_categorical_accuracy: 0.0262 - factorized_top_k/top_100_categorical_accuracy: 0.0574 - loss: 71061.1591 - regularization_loss: 0.0000e+00 - total_loss: 71061.1591 - val_factorized_top_k/top_1_categorical_accuracy: 0.0017 - val_factorized_top_k/top_5_categorical_accuracy: 0.0090 - val_factorized_top_k/top_10_categorical_accuracy: 0.0175 - val_factorized_top_k/top_50_categorical_accuracy: 0.0799 - val_factorized_top_k/top_100_categorical_accuracy: 0.1389 - val_loss: 30403.5078 - val_regularization_loss: 0.0000e+00 - val_total_loss: 30403.5078\n", 1023 | "Epoch 2/5\n", 1024 | "10/10 [==============================] - 128s 13s/step - factorized_top_k/top_1_categorical_accuracy: 8.4296e-04 - factorized_top_k/top_5_categorical_accuracy: 0.0079 - factorized_top_k/top_10_categorical_accuracy: 0.0174 - factorized_top_k/top_50_categorical_accuracy: 0.0816 - factorized_top_k/top_100_categorical_accuracy: 0.1432 - loss: 68374.8374 - regularization_loss: 0.0000e+00 - total_loss: 68374.8374 - val_factorized_top_k/top_1_categorical_accuracy: 7.9334e-04 - val_factorized_top_k/top_5_categorical_accuracy: 0.0065 - val_factorized_top_k/top_10_categorical_accuracy: 0.0140 - val_factorized_top_k/top_50_categorical_accuracy: 0.0683 - val_factorized_top_k/top_100_categorical_accuracy: 0.1245 - val_loss: 29863.0508 - val_regularization_loss: 0.0000e+00 - val_total_loss: 29863.0508\n", 1025 | "Epoch 3/5\n", 1026 | "10/10 [==============================] - 128s 13s/step - factorized_top_k/top_1_categorical_accuracy: 0.0012 - factorized_top_k/top_5_categorical_accuracy: 0.0118 - factorized_top_k/top_10_categorical_accuracy: 0.0232 - factorized_top_k/top_50_categorical_accuracy: 0.1039 - factorized_top_k/top_100_categorical_accuracy: 0.1754 - loss: 66258.5078 - regularization_loss: 0.0000e+00 - total_loss: 66258.5078 - val_factorized_top_k/top_1_categorical_accuracy: 4.9583e-04 - val_factorized_top_k/top_5_categorical_accuracy: 0.0050 - val_factorized_top_k/top_10_categorical_accuracy: 0.0111 - val_factorized_top_k/top_50_categorical_accuracy: 0.0597 - val_factorized_top_k/top_100_categorical_accuracy: 0.1075 - val_loss: 29832.7773 - val_regularization_loss: 0.0000e+00 - val_total_loss: 29832.7773\n", 1027 | "Epoch 4/5\n", 1028 | "10/10 [==============================] - 128s 13s/step - factorized_top_k/top_1_categorical_accuracy: 0.0012 - factorized_top_k/top_5_categorical_accuracy: 0.0129 - factorized_top_k/top_10_categorical_accuracy: 0.0270 - factorized_top_k/top_50_categorical_accuracy: 0.1152 - factorized_top_k/top_100_categorical_accuracy: 0.1959 - loss: 64616.1264 - regularization_loss: 0.0000e+00 - total_loss: 64616.1264 - val_factorized_top_k/top_1_categorical_accuracy: 4.4625e-04 - val_factorized_top_k/top_5_categorical_accuracy: 0.0035 - val_factorized_top_k/top_10_categorical_accuracy: 0.0085 - val_factorized_top_k/top_50_categorical_accuracy: 0.0501 - val_factorized_top_k/top_100_categorical_accuracy: 0.0933 - val_loss: 30017.0508 - val_regularization_loss: 0.0000e+00 - val_total_loss: 30017.0508\n", 1029 | "Epoch 5/5\n", 1030 | "10/10 [==============================] - 128s 13s/step - factorized_top_k/top_1_categorical_accuracy: 0.0010 - factorized_top_k/top_5_categorical_accuracy: 0.0135 - factorized_top_k/top_10_categorical_accuracy: 0.0288 - factorized_top_k/top_50_categorical_accuracy: 0.1269 - factorized_top_k/top_100_categorical_accuracy: 0.2194 - loss: 63291.1729 - regularization_loss: 0.0000e+00 - total_loss: 63291.1729 - val_factorized_top_k/top_1_categorical_accuracy: 1.4875e-04 - val_factorized_top_k/top_5_categorical_accuracy: 0.0030 - val_factorized_top_k/top_10_categorical_accuracy: 0.0073 - val_factorized_top_k/top_50_categorical_accuracy: 0.0445 - val_factorized_top_k/top_100_categorical_accuracy: 0.0849 - val_loss: 30291.8203 - val_regularization_loss: 0.0000e+00 - val_total_loss: 30291.8203\n" 1031 | ] 1032 | } 1033 | ] 1034 | }, 1035 | { 1036 | "cell_type": "markdown", 1037 | "metadata": { 1038 | "id": "lvDnN_hQJ5fi" 1039 | }, 1040 | "source": [ 1041 | "`factorized_top_k/top_10_categorical_accuracy: 0.0538` would tell us that, on average, the true positive is in the top 10 retrieved items from the entire candidate set 5% of the time.\n", 1042 | "\n", 1043 | "If the candidate set is a large set, turn metric calculation off in training, and only run it in evaluation. Because this can be quite slow!" 1044 | ] 1045 | }, 1046 | { 1047 | "cell_type": "code", 1048 | "metadata": { 1049 | "id": "3djrQCxhBZCV", 1050 | "colab": { 1051 | "base_uri": "https://localhost:8080/", 1052 | "height": 472 1053 | }, 1054 | "outputId": "fb0af53f-81de-4fc6-fa68-e8a01649b8b9" 1055 | }, 1056 | "source": [ 1057 | "# Plot changes in model loss during training\n", 1058 | "import matplotlib.pyplot as plt\n", 1059 | "\n", 1060 | "plt.plot(history.history[\"loss\"])\n", 1061 | "plt.plot(history.history[\"val_loss\"])\n", 1062 | "plt.title(\"Model losses during training\")\n", 1063 | "plt.xlabel(\"epoch\")\n", 1064 | "plt.ylabel(\"loss\")\n", 1065 | "plt.legend([\"train\", \"test\"], loc=\"upper right\")\n", 1066 | "plt.show()" 1067 | ], 1068 | "execution_count": 17, 1069 | "outputs": [ 1070 | { 1071 | "output_type": "display_data", 1072 | "data": { 1073 | "text/plain": [ 1074 | "
" 1075 | ], 1076 | "image/png": "\n" 1077 | }, 1078 | "metadata": {} 1079 | } 1080 | ] 1081 | }, 1082 | { 1083 | "cell_type": "code", 1084 | "metadata": { 1085 | "colab": { 1086 | "base_uri": "https://localhost:8080/", 1087 | "height": 472 1088 | }, 1089 | "id": "Fs0nwFpAL7YI", 1090 | "outputId": "c8b3165d-f6cd-4e87-d19d-e89b30a62445" 1091 | }, 1092 | "source": [ 1093 | "# Plot changes in model accuracy during training\n", 1094 | "plt.plot(history.history[\"factorized_top_k/top_100_categorical_accuracy\"])\n", 1095 | "plt.plot(history.history[\"val_factorized_top_k/top_100_categorical_accuracy\"])\n", 1096 | "plt.title(\"Model accuracies during training\")\n", 1097 | "plt.xlabel(\"epoch\")\n", 1098 | "plt.ylabel(\"accuracy\")\n", 1099 | "plt.legend([\"train\", \"test\"], loc=\"upper right\")\n", 1100 | "plt.show()" 1101 | ], 1102 | "execution_count": 18, 1103 | "outputs": [ 1104 | { 1105 | "output_type": "display_data", 1106 | "data": { 1107 | "text/plain": [ 1108 | "
" 1109 | ], 1110 | "image/png": "\n" 1111 | }, 1112 | "metadata": {} 1113 | } 1114 | ] 1115 | }, 1116 | { 1117 | "cell_type": "markdown", 1118 | "metadata": { 1119 | "id": "X25xwNezOn3v" 1120 | }, 1121 | "source": [ 1122 | "As we can see, the model is `overfitted` on the training dataset. It works well for training data, simply because it memorized them. However, the performance on training data is not important. The model should be able to `generalize` to new unseen data.\n", 1123 | "\n", 1124 | "The overfitting phenomenon is especially strong when models have many parameters. It can be mediated by `model regularization` and use of user and movie features that help the model generalize better to unseen data.\n", 1125 | "\n", 1126 | "Moreover, The model is re-recommending some of users' already watched movies. These known-positive watches can crowd out test movies out of top K recommendations and reduce the model accuracy for test data. This can be tackled by excluding previously seen movies from test recommendations in a third stage. This approach is relatively common in the recommender systems literature, but we don't follow it in these tutorials. If not recommending past watches is important, we should expect appropriately specified models to learn this behaviour automatically from past user history and contextual information. Additionally, it is often appropriate to recommend the same item multiple times (say, an evergreen TV series or a regularly purchased item)." 1127 | ] 1128 | }, 1129 | { 1130 | "cell_type": "markdown", 1131 | "metadata": { 1132 | "id": "vu1I9SGxXIGy" 1133 | }, 1134 | "source": [ 1135 | "### Making predictions\n", 1136 | "To make recommendations, at first, we have to find the representation embedding vectors for all candidates in the corpus and `index` them all for future retrieval.\n", 1137 | "\n", 1138 | "Then, we can get a `query`, pass it through the query tower and find its representation embedding vector.\n", 1139 | "\n", 1140 | "Finally, we can calculate the affinity score between the query and all the candidates, sort them and find `k-nearest` candidates to the query. It is called a `brute-force` search and TFRS exposes a layer named `tfrs.layers.factorized_top_k.BruteForce` to do this." 1141 | ] 1142 | }, 1143 | { 1144 | "cell_type": "code", 1145 | "metadata": { 1146 | "id": "_OqB4kwwS1mR", 1147 | "colab": { 1148 | "base_uri": "https://localhost:8080/" 1149 | }, 1150 | "outputId": "38e0ed4a-d4db-4c52-c5f4-4a1f7e6ea979" 1151 | }, 1152 | "source": [ 1153 | "brute_force_layer = tfrs.layers.factorized_top_k.BruteForce(\n", 1154 | " movielens_retrieval_model.query_model\n", 1155 | ")\n", 1156 | "\n", 1157 | "brute_force_layer.index_from_dataset(\n", 1158 | " tf.data.Dataset.zip(\n", 1159 | " (\n", 1160 | " candidates_corpus_dataset.batch(100),\n", 1161 | " candidates_corpus_dataset.batch(100).map(\n", 1162 | " movielens_retrieval_model.candidate_model\n", 1163 | " )\n", 1164 | " )\n", 1165 | " )\n", 1166 | ")" 1167 | ], 1168 | "execution_count": 19, 1169 | "outputs": [ 1170 | { 1171 | "output_type": "execute_result", 1172 | "data": { 1173 | "text/plain": [ 1174 | "" 1175 | ] 1176 | }, 1177 | "metadata": {}, 1178 | "execution_count": 19 1179 | } 1180 | ] 1181 | }, 1182 | { 1183 | "cell_type": "code", 1184 | "metadata": { 1185 | "id": "ccdAoIjAS0bC", 1186 | "colab": { 1187 | "base_uri": "https://localhost:8080/" 1188 | }, 1189 | "outputId": "f6637372-b3c6-41a2-ecf5-c95a08aa6b1a" 1190 | }, 1191 | "source": [ 1192 | "user_id = '42'\n", 1193 | "afinity_scores, movie_ids = brute_force_layer(\n", 1194 | " tf.constant([user_id])\n", 1195 | ")\n", 1196 | "\n", 1197 | "print(f\"Recommendations for user {user_id} using BruteForce: {movie_ids[0, :5]}\")" 1198 | ], 1199 | "execution_count": 20, 1200 | "outputs": [ 1201 | { 1202 | "output_type": "stream", 1203 | "name": "stdout", 1204 | "text": [ 1205 | "Recommendations for user 42 using BruteForce: [b'2771' b'3142' b'3695' b'3120' b'4207']\n" 1206 | ] 1207 | } 1208 | ] 1209 | }, 1210 | { 1211 | "cell_type": "markdown", 1212 | "metadata": { 1213 | "id": "OVlilfUEYPMa" 1214 | }, 1215 | "source": [ 1216 | "Performing a brute-force search in a large corpus can be too slow and impractical in a production environment. In practice we can speed this up by using an `Approximate Nearest Neighbor` search instead of the brute-force search. Approximate Nearest Neighbor (`ANN`) search can greatly outperform brute-force search speed while sacrificing little in terms of accuracy.\n", 1217 | "\n", 1218 | "We can use Google [ScaNN](https://ai.googleblog.com/2020/07/announcing-scann-efficient-vector.html), Facebook [Faiss](https://engineering.fb.com/2017/03/29/data-infrastructure/faiss-a-library-for-efficient-similarity-search/), Spotify [Annoy](https://github.com/spotify/annoy), or other approximate vector similarity search libraries to build an `approximate retrieval index`. However, TFRS uses `ScaNN` library and exposes the layer `tfrs.layers.factorized_top_k.ScaNN` letting us make the model really end-to-end.\n", 1219 | "\n", 1220 | "Note: to use `tfrs.layers.factorized_top_k.ScaNN` layer, `ScaNN` library should be installed using `pip install scann` before importing the `TFRS` library.\n" 1221 | ] 1222 | }, 1223 | { 1224 | "cell_type": "code", 1225 | "metadata": { 1226 | "colab": { 1227 | "base_uri": "https://localhost:8080/" 1228 | }, 1229 | "id": "2_c77jsdiPp9", 1230 | "outputId": "4b775ddb-a1ba-47b0-d396-571a82896834" 1231 | }, 1232 | "source": [ 1233 | "scann_layer = tfrs.layers.factorized_top_k.ScaNN(\n", 1234 | " movielens_retrieval_model.query_model\n", 1235 | ")\n", 1236 | "\n", 1237 | "scann_layer.index_from_dataset(\n", 1238 | " tf.data.Dataset.zip(\n", 1239 | " (\n", 1240 | " candidates_corpus_dataset.batch(100),\n", 1241 | " candidates_corpus_dataset.batch(100).map(\n", 1242 | " movielens_retrieval_model.candidate_model\n", 1243 | " )\n", 1244 | " )\n", 1245 | " )\n", 1246 | ")\n", 1247 | "\n", 1248 | "user_id = '42'\n", 1249 | "afinity_scores, movie_ids = scann_layer(\n", 1250 | " tf.constant([user_id])\n", 1251 | ")\n", 1252 | "\n", 1253 | "print(f\"Recommendations for user {user_id} using ScaNN: {movie_ids[0, :5]}\")" 1254 | ], 1255 | "execution_count": 21, 1256 | "outputs": [ 1257 | { 1258 | "output_type": "stream", 1259 | "name": "stdout", 1260 | "text": [ 1261 | "Recommendations for user 42 using ScaNN: [b'3142' b'2574' b'3695' b'4565' b'4442']\n" 1262 | ] 1263 | } 1264 | ] 1265 | }, 1266 | { 1267 | "cell_type": "markdown", 1268 | "metadata": { 1269 | "id": "iumhoS_lucaV" 1270 | }, 1271 | "source": [ 1272 | "### Save the model to use in future\n", 1273 | "To make it possible to serve the Keras model using [TensorFlow Serving](https://www.tensorflow.org/tfx/guide/serving), export it to a `SavedModel` format." 1274 | ] 1275 | }, 1276 | { 1277 | "cell_type": "code", 1278 | "metadata": { 1279 | "colab": { 1280 | "base_uri": "https://localhost:8080/" 1281 | }, 1282 | "id": "Fowdeb85j6OS", 1283 | "outputId": "e9426126-6287-4bba-f8b6-e07ef5d53c16" 1284 | }, 1285 | "source": [ 1286 | "import os\n", 1287 | "import tempfile\n", 1288 | "\n", 1289 | "with tempfile.TemporaryDirectory() as tmp_dir:\n", 1290 | " retrieval_model_path = os.path.join(tmp_dir, \"retrieval_model\")\n", 1291 | "\n", 1292 | "\n", 1293 | "scann_layer.save(\n", 1294 | " retrieval_model_path,\n", 1295 | " options=tf.saved_model.SaveOptions(namespace_whitelist=[\"Scann\"])\n", 1296 | ")" 1297 | ], 1298 | "execution_count": 22, 1299 | "outputs": [ 1300 | { 1301 | "output_type": "stream", 1302 | "name": "stderr", 1303 | "text": [ 1304 | "WARNING:tensorflow:Model's `__init__()` arguments contain non-serializable objects. Please implement a `get_config()` method in the subclassed Model for proper saving and loading. Defaulting to empty config.\n", 1305 | "WARNING:tensorflow:Model's `__init__()` arguments contain non-serializable objects. Please implement a `get_config()` method in the subclassed Model for proper saving and loading. Defaulting to empty config.\n", 1306 | "WARNING:tensorflow:Model's `__init__()` arguments contain non-serializable objects. Please implement a `get_config()` method in the subclassed Model for proper saving and loading. Defaulting to empty config.\n", 1307 | "WARNING:tensorflow:Model's `__init__()` arguments contain non-serializable objects. Please implement a `get_config()` method in the subclassed Model for proper saving and loading. Defaulting to empty config.\n" 1308 | ] 1309 | } 1310 | ] 1311 | }, 1312 | { 1313 | "cell_type": "code", 1314 | "metadata": { 1315 | "id": "UNET0awMnJ-X", 1316 | "outputId": "ce2eaf05-8e10-44ed-c913-11d1c6e4dc1d", 1317 | "colab": { 1318 | "base_uri": "https://localhost:8080/" 1319 | } 1320 | }, 1321 | "source": [ 1322 | "# Reload the saved model to confirm that it works correctly\n", 1323 | "reloaded_model = tf.keras.models.load_model(retrieval_model_path)\n", 1324 | "afinity_scores, movie_ids = reloaded_model(\n", 1325 | " tf.constant([user_id])\n", 1326 | ")\n", 1327 | "\n", 1328 | "print(f\"Recommendations for user {user_id} using reloaded saved model: {movie_ids[0, :5]}\")" 1329 | ], 1330 | "execution_count": 23, 1331 | "outputs": [ 1332 | { 1333 | "output_type": "stream", 1334 | "name": "stderr", 1335 | "text": [ 1336 | "WARNING:tensorflow:No training configuration found in save file, so the model was *not* compiled. Compile it manually.\n" 1337 | ] 1338 | }, 1339 | { 1340 | "output_type": "stream", 1341 | "name": "stdout", 1342 | "text": [ 1343 | "Recommendations for user 42 using reloaded saved model: [b'3142' b'2574' b'3695' b'4565' b'4442']\n" 1344 | ] 1345 | } 1346 | ] 1347 | }, 1348 | { 1349 | "cell_type": "markdown", 1350 | "metadata": { 1351 | "id": "tnRXsM-3xsGU" 1352 | }, 1353 | "source": [ 1354 | "If the saved model works correctly, back it up from the temporary directory to your Google Drive, download it from there, copy to your server (`retrieval_model/1/saved_model.pb`) and serve it." 1355 | ] 1356 | }, 1357 | { 1358 | "cell_type": "markdown", 1359 | "metadata": { 1360 | "id": "FZznewEYzYh1" 1361 | }, 1362 | "source": [ 1363 | "### Serve the saved model using [TensorFlow Serving](https://www.tensorflow.org/tfx/guide/serving)\n", 1364 | "\n", 1365 | "We can use TensorFlow Serving Docker image to run a Docker container serving the saved model. Note that `tensorflow/serving` image doesn't support `ScaNN` layer, so it's important to use `google/tf-serving-scann` custom image instead.\n", 1366 | "\n", 1367 | "TF Serving uses the port 8500 for gRPC and the port 8501 for REST APIs.\n", 1368 | "\n", 1369 | "```\n", 1370 | "docker run -p 8501:8501 \\\n", 1371 | "--mount type=bind,source=PATH_TO_RETRIEVAL_DIR/retrieval,target=/models/retrieval \\\n", 1372 | "-e MODEL_NAME=retrieval -t google/tf-serving-scann &\n", 1373 | "```\n", 1374 | "\n", 1375 | "Now we can send the `query` as an HTTP request to the server and get the recommended candidates.\n", 1376 | "```\n", 1377 | "curl --location --request POST 'http://SERVER_IP_ADDRESS:8501/v1/models/retrieval:predict' \\\n", 1378 | "--header 'Content-Type: application/json' \\\n", 1379 | "--data-raw '{\n", 1380 | " \"instances\":\n", 1381 | " [\n", 1382 | " \"42\"\n", 1383 | " ]\n", 1384 | "}'\n", 1385 | "```" 1386 | ] 1387 | }, 1388 | { 1389 | "cell_type": "markdown", 1390 | "metadata": { 1391 | "id": "3Ez1GmXvB8L-" 1392 | }, 1393 | "source": [ 1394 | "## Build the Ranking (Scoring) task\n", 1395 | "\n", 1396 | "The `ranking model` takes the outputs of the retrieval model and fine-tunes them to select the best possible handful of recommendations.\n", 1397 | "\n", 1398 | " Its task is to narrow down the set of items the user may be interested in to a shortlist of likely candidates. So the traing set should be expressesing ***how much the users liked the movies they did watch***. This is a form of `explicit feedback` - given that a user watched a movie, we can tell roughly how much they liked by looking at the rating they have given.\n", 1399 | "```\n", 1400 | "[\n", 1401 | " (('user1', 'Star Wars'), 4.0),\n", 1402 | " (('user1', 'Harry Potter'), 5.0),\n", 1403 | " ...\n", 1404 | "]\n", 1405 | "```\n", 1406 | " This time we are going to predict `user_rating` value as the objective. So like the other `regression` problems we can use `MSE (Mean Squared Error)` as loss function and `RMSE (Root Mean Squared Error)` as an accuracy metric. The state of the art (`SOTA`) RMSE value for `MovieLens/100k` is equal to `0.909`.\n", 1407 | "\n", 1408 | "`tfrs.tasks.Ranking` layer gets the predicted ratings and the `ground truth` as input, calculates the metrics and returns the loss value.\n", 1409 | "\n", 1410 | "Ranking models do not face the same efficiency constraints as retrieval models do, and so we have a little bit more freedom in our choice of architectures. A model composed of multiple stacked `Dense` layers is a relatively common architecture for ranking tasks.\n", 1411 | "\n", 1412 | "In most cases, a ranking model can be substantially improved by using more features rather than just user and candidate identifiers." 1413 | ] 1414 | }, 1415 | { 1416 | "cell_type": "code", 1417 | "metadata": { 1418 | "id": "TLwaxI00k1xX" 1419 | }, 1420 | "source": [ 1421 | "class RankingModel(tfrs.models.Model):\n", 1422 | " \"\"\"MovieLens ranking model\"\"\"\n", 1423 | "\n", 1424 | " def __init__(self, query_model, candidate_model):\n", 1425 | " super().__init__()\n", 1426 | "\n", 1427 | " self.query_model: tf.keras.Model = query_model\n", 1428 | " self.candidate_model: tf.keras.Model = candidate_model\n", 1429 | " self.rating_model = tf.keras.Sequential(\n", 1430 | " [\n", 1431 | " tf.keras.layers.Dense(256, activation='relu'),\n", 1432 | " tf.keras.layers.Dense(64, activation='relu'),\n", 1433 | " tf.keras.layers.Dense(1)\n", 1434 | " ]\n", 1435 | " )\n", 1436 | " self.ranking_task_layer: tf.keras.layers.Layer = tfrs.tasks.Ranking(\n", 1437 | " loss=tf.keras.losses.MeanSquaredError(),\n", 1438 | " metrics=[\n", 1439 | " tf.keras.metrics.RootMeanSquaredError()\n", 1440 | " ]\n", 1441 | " )\n", 1442 | "\n", 1443 | "\n", 1444 | " def compute_loss(self, features, training=False) -> tf.Tensor:\n", 1445 | " query_embeddings = self.query_model(features['user_id'])\n", 1446 | " candidate_embeddings = self.candidate_model(features[\"movie_id\"])\n", 1447 | " rating_predictions = self.rating_model(\n", 1448 | " tf.concat(\n", 1449 | " [query_embeddings, candidate_embeddings],\n", 1450 | " axis=1\n", 1451 | " )\n", 1452 | " # We could use `tf.keras.layers.Concatenate(axis=1)([x, y])`\n", 1453 | " )\n", 1454 | "\n", 1455 | " loss = self.ranking_task_layer(\n", 1456 | " predictions=rating_predictions,\n", 1457 | " labels=features[\"user_rating\"]\n", 1458 | " )\n", 1459 | " return loss" 1460 | ], 1461 | "execution_count": 24, 1462 | "outputs": [] 1463 | }, 1464 | { 1465 | "cell_type": "markdown", 1466 | "metadata": { 1467 | "id": "it4JTHYF1eh-" 1468 | }, 1469 | "source": [ 1470 | "### Fit the ranking model" 1471 | ] 1472 | }, 1473 | { 1474 | "cell_type": "code", 1475 | "metadata": { 1476 | "id": "BijEncJG1iaW" 1477 | }, 1478 | "source": [ 1479 | "movielens_ranking_model = RankingModel(query_model, candidate_model)\n", 1480 | "\n", 1481 | "optimizer_step_size = 0.1\n", 1482 | "movielens_ranking_model.compile(\n", 1483 | " optimizer=tf.keras.optimizers.Adagrad(\n", 1484 | " learning_rate=optimizer_step_size\n", 1485 | " )\n", 1486 | ")" 1487 | ], 1488 | "execution_count": 25, 1489 | "outputs": [] 1490 | }, 1491 | { 1492 | "cell_type": "code", 1493 | "metadata": { 1494 | "colab": { 1495 | "base_uri": "https://localhost:8080/" 1496 | }, 1497 | "id": "JXZykvH6-DvF", 1498 | "outputId": "cf398902-4e5e-4038-d2a3-2d5d3b2370f3" 1499 | }, 1500 | "source": [ 1501 | "ranking_ratings_trainset = ratings_trainset.shuffle(100_000).batch(8192).cache()\n", 1502 | "ranking_ratings_testset = ratings_testset.batch(4096).cache()\n", 1503 | "\n", 1504 | "history = movielens_ranking_model.fit(\n", 1505 | " ranking_ratings_trainset,\n", 1506 | " validation_data=ranking_ratings_testset,\n", 1507 | " validation_freq=1,\n", 1508 | " epochs=5\n", 1509 | ")" 1510 | ], 1511 | "execution_count": 26, 1512 | "outputs": [ 1513 | { 1514 | "output_type": "stream", 1515 | "name": "stdout", 1516 | "text": [ 1517 | "Epoch 1/5\n", 1518 | "10/10 [==============================] - 19s 1s/step - root_mean_squared_error: 2.4954 - loss: 5.6800 - regularization_loss: 0.0000e+00 - total_loss: 5.6800 - val_root_mean_squared_error: 0.9894 - val_loss: 0.9845 - val_regularization_loss: 0.0000e+00 - val_total_loss: 0.9845\n", 1519 | "Epoch 2/5\n", 1520 | "10/10 [==============================] - 1s 91ms/step - root_mean_squared_error: 0.9699 - loss: 0.9353 - regularization_loss: 0.0000e+00 - total_loss: 0.9353 - val_root_mean_squared_error: 0.9536 - val_loss: 0.9175 - val_regularization_loss: 0.0000e+00 - val_total_loss: 0.9175\n", 1521 | "Epoch 3/5\n", 1522 | "10/10 [==============================] - 1s 88ms/step - root_mean_squared_error: 0.9457 - loss: 0.8907 - regularization_loss: 0.0000e+00 - total_loss: 0.8907 - val_root_mean_squared_error: 0.9412 - val_loss: 0.8968 - val_regularization_loss: 0.0000e+00 - val_total_loss: 0.8968\n", 1523 | "Epoch 4/5\n", 1524 | "10/10 [==============================] - 1s 81ms/step - root_mean_squared_error: 0.9343 - loss: 0.8695 - regularization_loss: 0.0000e+00 - total_loss: 0.8695 - val_root_mean_squared_error: 0.9334 - val_loss: 0.8839 - val_regularization_loss: 0.0000e+00 - val_total_loss: 0.8839\n", 1525 | "Epoch 5/5\n", 1526 | "10/10 [==============================] - 1s 76ms/step - root_mean_squared_error: 0.9266 - loss: 0.8554 - regularization_loss: 0.0000e+00 - total_loss: 0.8554 - val_root_mean_squared_error: 0.9277 - val_loss: 0.8744 - val_regularization_loss: 0.0000e+00 - val_total_loss: 0.8744\n" 1527 | ] 1528 | } 1529 | ] 1530 | }, 1531 | { 1532 | "cell_type": "code", 1533 | "metadata": { 1534 | "colab": { 1535 | "base_uri": "https://localhost:8080/", 1536 | "height": 472 1537 | }, 1538 | "id": "1-uNjmoY-gsA", 1539 | "outputId": "8fbd9476-ab44-4a1d-c1ac-b4ed06252b08" 1540 | }, 1541 | "source": [ 1542 | "# Plot changes in model loss during training\n", 1543 | "plt.plot(history.history[\"loss\"])\n", 1544 | "plt.plot(history.history[\"val_loss\"])\n", 1545 | "plt.title(\"Model losses during training\")\n", 1546 | "plt.xlabel(\"epoch\")\n", 1547 | "plt.ylabel(\"loss\")\n", 1548 | "plt.legend([\"train\", \"test\"], loc=\"upper right\")\n", 1549 | "plt.show()" 1550 | ], 1551 | "execution_count": 27, 1552 | "outputs": [ 1553 | { 1554 | "output_type": "display_data", 1555 | "data": { 1556 | "text/plain": [ 1557 | "
" 1558 | ], 1559 | "image/png": "\n" 1560 | }, 1561 | "metadata": {} 1562 | } 1563 | ] 1564 | }, 1565 | { 1566 | "cell_type": "markdown", 1567 | "metadata": { 1568 | "id": "Haid763z4PL2" 1569 | }, 1570 | "source": [ 1571 | "## Next Step\n", 1572 | "Of course, making a powerful recommender system requires much more effort. You can read these materials to dive in deeper.\n", 1573 | "\n", 1574 | "

 

\n", 1575 | "\n", 1576 | "[Side Features] https://www.tensorflow.org/recommenders/examples/featurization\n", 1577 | "\n", 1578 | "[Context Features] https://www.tensorflow.org/recommenders/examples/context_features\n", 1579 | "\n", 1580 | "[Deep Recommenders] https://www.tensorflow.org/recommenders/examples/deep_recommenders\n", 1581 | "\n", 1582 | "[Multi-task Recommenders] https://www.tensorflow.org/recommenders/examples/multitask\n", 1583 | "\n", 1584 | "[DCN] https://www.tensorflow.org/recommenders/examples/dcn\n", 1585 | "\n", 1586 | "

 

\n", 1587 | "\n", 1588 | "Papers:\n", 1589 | "\n", 1590 | "[Two-tower neural network] https://storage.googleapis.com/pub-tools-public-publication-data/pdf/6c8a86c981a62b0126a11896b7f6ae0dae4c3566.pdf\n", 1591 | "\n", 1592 | "[SOTA] https://arxiv.org/pdf/1905.01395v1.pdf\n", 1593 | "\n", 1594 | "[SSL] https://arxiv.org/abs/2007.12865\n", 1595 | "\n", 1596 | "[Multi-task] http://www.jiaqima.com/papers/SNR.pdf\n", 1597 | "\n", 1598 | "

 

\n", 1599 | "\n", 1600 | "References:\n", 1601 | "\n", 1602 | "[1] https://developers.google.com/machine-learning/recommendation/overview\n", 1603 | "\n", 1604 | "[2] https://www.tensorflow.org/recommenders" 1605 | ] 1606 | }, 1607 | { 1608 | "cell_type": "markdown", 1609 | "metadata": { 1610 | "id": "7TqxYMXOGcrC" 1611 | }, 1612 | "source": [ 1613 | "## Donation\n", 1614 | "Give a ⭐ if this tutorial helped you!\n", 1615 | "\n", 1616 | "https://github.com/xei/recommender-system-tutorial" 1617 | ] 1618 | } 1619 | ] 1620 | } --------------------------------------------------------------------------------