├── .gitignore ├── 01_Data_Preperation_0.ipynb ├── 01_Data_Preperation_Rendered.ipynb ├── 02_Short_Intro_to_Embeddings_with_Tensorflow_0.ipynb ├── 02_Short_Intro_to_Embeddings_with_Tensorflow_Rendered.ipynb ├── 03_Image_Search_Using_AlexNet_0.ipynb ├── 03_Image_Search_Using_AlexNet_Rendered.ipynb ├── 04_Explicit_Feedback_deep_learning_with_Tensorflow_0.ipynb ├── 04_Explicit_Feedback_deep_learning_with_Tensorflow_Rendered.ipynb ├── 05-Implicit_Feedback_with_the_triplet_loss_with_Tensorflow_0.ipynb ├── 05-Implicit_Feedback_with_the_triplet_loss_with_Tensorflow_Rendered.ipynb ├── README.md ├── deep_learning_based_search_and_recommender_system.pdf └── images ├── alexnet_architecture.png ├── rec_archi_1.svg ├── rec_archi_2.svg ├── rec_archi_3.svg ├── rec_archi_implicit_1.svg └── rec_archi_implicit_2.svg /.gitignore: -------------------------------------------------------------------------------- 1 | .DS_Store 2 | data/ 3 | processed/ 4 | saved_models/ 5 | config.yaml 6 | tensorboard/ 7 | *.sh 8 | TODO.txt 9 | run 10 | README_Kubernetes.md 11 | .ipynb_checkpoints/ 12 | Dockerfile.prod 13 | .dockerignore 14 | 15 | 16 | -------------------------------------------------------------------------------- /01_Data_Preperation_0.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Prepare Data" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "If you are using the hosted jupyter environment, You will already have all raw data in place. Run all the cells to extract all zipped files." 15 | ] 16 | }, 17 | { 18 | "cell_type": "code", 19 | "execution_count": null, 20 | "metadata": {}, 21 | "outputs": [], 22 | "source": [ 23 | "# import libraries\n", 24 | "import matplotlib.pyplot as plt\n", 25 | "import numpy as np\n", 26 | "import pandas as pd\n", 27 | "import os\n", 28 | "import os.path as op\n", 29 | "from zipfile import ZipFile\n", 30 | "from six.moves import urllib\n", 31 | "import sys\n", 32 | "import shutil\n", 33 | "\n", 34 | "\n", 35 | "\n", 36 | "try:\n", 37 | " from urllib.request import urlretrieve\n", 38 | "except ImportError: # Python 2 compat\n", 39 | " from urllib import urlretrieve\n" 40 | ] 41 | }, 42 | { 43 | "cell_type": "markdown", 44 | "metadata": {}, 45 | "source": [ 46 | "## Script for Downloading Data " 47 | ] 48 | }, 49 | { 50 | "cell_type": "code", 51 | "execution_count": null, 52 | "metadata": {}, 53 | "outputs": [], 54 | "source": [ 55 | "#Modified function from here\n", 56 | "# https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/image_retraining/retrain.py\n", 57 | "\n", 58 | "def maybe_download_and_extract(data_url, dest_directory, extract_directory, extract=True, copy_only=False):\n", 59 | " \"\"\"Download and extract model tar file.\n", 60 | " If the pretrained model we're using doesn't already exist, this function\n", 61 | " downloads it from the TensorFlow.org website and unpacks it into a directory.\n", 62 | " \"\"\"\n", 63 | " if not os.path.exists(dest_directory):\n", 64 | " os.makedirs(dest_directory)\n", 65 | " filename = data_url.split('/')[-1]\n", 66 | " filepath = os.path.join(dest_directory, filename)\n", 67 | " if not os.path.exists(filepath):\n", 68 | "\n", 69 | " def _progress(count, block_size, total_size):\n", 70 | " sys.stdout.write('\\r>> Downloading %s %.1f%%' %\n", 71 | " (filename,\n", 72 | " float(count * block_size) / float(total_size) * 100.0))\n", 73 | " sys.stdout.flush()\n", 74 | "\n", 75 | " filepath, _ = urllib.request.urlretrieve(data_url,\n", 76 | " filepath,\n", 77 | " _progress)\n", 78 | " print()\n", 79 | " statinfo = os.stat(filepath)\n", 80 | " print('Successfully downloaded', filename, statinfo.st_size, 'bytes.')\n", 81 | " else:\n", 82 | " print('File Already Exists.')\n", 83 | " \n", 84 | " if extract:\n", 85 | " if not op.exists(extract_directory): \n", 86 | " with ZipFile(filepath, 'r') as z:\n", 87 | " print('Extracting content from {0}'.format(filepath))\n", 88 | " z.extractall(path=extract_directory)\n", 89 | " else:\n", 90 | " print('Folder Already Exists.')\n", 91 | " \n", 92 | " if copy_only:\n", 93 | " if not op.exists(extract_directory):\n", 94 | " shutil.copyfile(filepath, extract_directory)\n", 95 | " print('File Copied.')\n", 96 | " else:\n", 97 | " print('File Already Exists.')\n", 98 | " " 99 | ] 100 | }, 101 | { 102 | "cell_type": "code", 103 | "execution_count": null, 104 | "metadata": {}, 105 | "outputs": [], 106 | "source": [ 107 | "# if raw and processed directories are not created, create these directories \n", 108 | "raw_data_directory = op.join(op.curdir, 'data')\n", 109 | "if not os.path.exists(raw_data_directory):\n", 110 | " os.makedirs(raw_data_directory)\n", 111 | " print('data folder created. You are now ready to download raw files.')\n", 112 | "else:\n", 113 | " print('data folder already exists. You are ready to download raw files.')\n", 114 | " \n", 115 | "\n", 116 | "processed_directory = op.join(op.curdir, 'processed')\n", 117 | "if not os.path.exists(processed_directory):\n", 118 | " os.makedirs(processed_directory)\n", 119 | " print('processed folder created. You are now ready to extract raw files.')\n", 120 | "else:\n", 121 | " print('processed folder already exists. You are ready to extract raw files.')" 122 | ] 123 | }, 124 | { 125 | "cell_type": "markdown", 126 | "metadata": {}, 127 | "source": [ 128 | "## Download Movielens-100K dataset" 129 | ] 130 | }, 131 | { 132 | "cell_type": "code", 133 | "execution_count": null, 134 | "metadata": {}, 135 | "outputs": [], 136 | "source": [ 137 | "# Specify where to download from\n", 138 | "ML_100K_URL = 'http://files.grouplens.org/datasets/movielens/ml-100k.zip'\n", 139 | "# Specify where to save \n", 140 | "dest_directory = op.join(op.curdir, 'data')\n", 141 | "# extract \n", 142 | "extract_directory = op.join(op.curdir, 'processed', 'ml-100k')" 143 | ] 144 | }, 145 | { 146 | "cell_type": "code", 147 | "execution_count": null, 148 | "metadata": {}, 149 | "outputs": [], 150 | "source": [ 151 | "# Actually download/extract the file!\n", 152 | "maybe_download_and_extract(ML_100K_URL, dest_directory, extract_directory, extract=True)" 153 | ] 154 | }, 155 | { 156 | "cell_type": "markdown", 157 | "metadata": {}, 158 | "source": [ 159 | "## Download Glove Pre-Trained Embeddings" 160 | ] 161 | }, 162 | { 163 | "cell_type": "code", 164 | "execution_count": null, 165 | "metadata": {}, 166 | "outputs": [], 167 | "source": [ 168 | "# Specify where to download from\n", 169 | "glove_model_url = 'http://nlp.stanford.edu/data/glove.6B.zip'\n", 170 | "# Specify where to save \n", 171 | "dest_directory = op.join(op.curdir, 'data')\n", 172 | "# extract \n", 173 | "extract_directory = op.join(op.curdir, 'processed', 'glove')" 174 | ] 175 | }, 176 | { 177 | "cell_type": "code", 178 | "execution_count": null, 179 | "metadata": {}, 180 | "outputs": [], 181 | "source": [ 182 | "# Actually download/extract the file!\n", 183 | "maybe_download_and_extract(glove_model_url, dest_directory, extract_directory, extract=True)" 184 | ] 185 | }, 186 | { 187 | "cell_type": "markdown", 188 | "metadata": {}, 189 | "source": [ 190 | "## Download AlexNet Pre-Trained Model Weights" 191 | ] 192 | }, 193 | { 194 | "cell_type": "code", 195 | "execution_count": null, 196 | "metadata": {}, 197 | "outputs": [], 198 | "source": [ 199 | "# Specify where to download from\n", 200 | "alexnet_weights_URL = 'https://www.cs.toronto.edu/~guerzhoy/tf_alexnet/bvlc_alexnet.npy'\n", 201 | "# Specify where to save \n", 202 | "dest_directory = op.join(op.curdir, 'data')\n", 203 | "# destination file name in processed folder\n", 204 | "extract_directory = op.join(op.curdir, 'processed', 'bvlc_alexnet.npy')" 205 | ] 206 | }, 207 | { 208 | "cell_type": "code", 209 | "execution_count": null, 210 | "metadata": {}, 211 | "outputs": [], 212 | "source": [ 213 | "# Actually download/extract the file!\n", 214 | "maybe_download_and_extract(alexnet_weights_URL, dest_directory, extract_directory, extract=False, copy_only=True)" 215 | ] 216 | }, 217 | { 218 | "cell_type": "markdown", 219 | "metadata": {}, 220 | "source": [ 221 | "### Download UT Zappos50K dataset" 222 | ] 223 | }, 224 | { 225 | "cell_type": "code", 226 | "execution_count": null, 227 | "metadata": {}, 228 | "outputs": [], 229 | "source": [ 230 | "#Specify where to download from\n", 231 | "utzap50k_URL = 'http://vision.cs.utexas.edu/projects/finegrained/utzap50k/ut-zap50k-images-square.zip'\n", 232 | "# Specify where to save \n", 233 | "dest_directory = op.join(op.curdir, 'data')\n", 234 | "# extract \n", 235 | "extract_directory = op.join(op.curdir, 'processed', 'utzap50k')" 236 | ] 237 | }, 238 | { 239 | "cell_type": "code", 240 | "execution_count": null, 241 | "metadata": {}, 242 | "outputs": [], 243 | "source": [ 244 | "# Actually download/extract the file!\n", 245 | "maybe_download_and_extract(utzap50k_URL, dest_directory, extract_directory, extract=True)" 246 | ] 247 | } 248 | ], 249 | "metadata": { 250 | "kernelspec": { 251 | "display_name": "Python 3", 252 | "language": "python", 253 | "name": "python3" 254 | }, 255 | "language_info": { 256 | "codemirror_mode": { 257 | "name": "ipython", 258 | "version": 3 259 | }, 260 | "file_extension": ".py", 261 | "mimetype": "text/x-python", 262 | "name": "python", 263 | "nbconvert_exporter": "python", 264 | "pygments_lexer": "ipython3", 265 | "version": "3.6.3" 266 | } 267 | }, 268 | "nbformat": 4, 269 | "nbformat_minor": 2 270 | } 271 | -------------------------------------------------------------------------------- /01_Data_Preperation_Rendered.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Prepare Data" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "If you are using the hosted jupyter environment, You will already have all raw data in place. Run all the cells to extract all zipped files." 15 | ] 16 | }, 17 | { 18 | "cell_type": "code", 19 | "execution_count": 1, 20 | "metadata": {}, 21 | "outputs": [], 22 | "source": [ 23 | "# import libraries\n", 24 | "import matplotlib.pyplot as plt\n", 25 | "import numpy as np\n", 26 | "import pandas as pd\n", 27 | "import os\n", 28 | "import os.path as op\n", 29 | "from zipfile import ZipFile\n", 30 | "from six.moves import urllib\n", 31 | "import sys\n", 32 | "import shutil\n", 33 | "\n", 34 | "\n", 35 | "\n", 36 | "try:\n", 37 | " from urllib.request import urlretrieve\n", 38 | "except ImportError: # Python 2 compat\n", 39 | " from urllib import urlretrieve\n" 40 | ] 41 | }, 42 | { 43 | "cell_type": "markdown", 44 | "metadata": {}, 45 | "source": [ 46 | "## Script for Downloading Data " 47 | ] 48 | }, 49 | { 50 | "cell_type": "code", 51 | "execution_count": 2, 52 | "metadata": {}, 53 | "outputs": [], 54 | "source": [ 55 | "#Modified function from here\n", 56 | "# https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/image_retraining/retrain.py\n", 57 | "\n", 58 | "def maybe_download_and_extract(data_url, dest_directory, extract_directory, extract=True, copy_only=False):\n", 59 | " \"\"\"Download and extract model tar file.\n", 60 | " If the pretrained model we're using doesn't already exist, this function\n", 61 | " downloads it from the TensorFlow.org website and unpacks it into a directory.\n", 62 | " \"\"\"\n", 63 | " if not os.path.exists(dest_directory):\n", 64 | " os.makedirs(dest_directory)\n", 65 | " filename = data_url.split('/')[-1]\n", 66 | " filepath = os.path.join(dest_directory, filename)\n", 67 | " if not os.path.exists(filepath):\n", 68 | "\n", 69 | " def _progress(count, block_size, total_size):\n", 70 | " sys.stdout.write('\\r>> Downloading %s %.1f%%' %\n", 71 | " (filename,\n", 72 | " float(count * block_size) / float(total_size) * 100.0))\n", 73 | " sys.stdout.flush()\n", 74 | "\n", 75 | " filepath, _ = urllib.request.urlretrieve(data_url,\n", 76 | " filepath,\n", 77 | " _progress)\n", 78 | " print()\n", 79 | " statinfo = os.stat(filepath)\n", 80 | " print('Successfully downloaded', filename, statinfo.st_size, 'bytes.')\n", 81 | " else:\n", 82 | " print('File Already Exists.')\n", 83 | " \n", 84 | " if extract:\n", 85 | " if not op.exists(extract_directory): \n", 86 | " with ZipFile(filepath, 'r') as z:\n", 87 | " print('Extracting content from {0}'.format(filepath))\n", 88 | " z.extractall(path=extract_directory)\n", 89 | " else:\n", 90 | " print('Folder Already Exists.')\n", 91 | " \n", 92 | " if copy_only:\n", 93 | " if not op.exists(extract_directory):\n", 94 | " shutil.copyfile(filepath, extract_directory)\n", 95 | " print('File Copied.')\n", 96 | " else:\n", 97 | " print('File Already Exists.')\n", 98 | " " 99 | ] 100 | }, 101 | { 102 | "cell_type": "code", 103 | "execution_count": 3, 104 | "metadata": {}, 105 | "outputs": [ 106 | { 107 | "name": "stdout", 108 | "output_type": "stream", 109 | "text": [ 110 | "data folder already exists. You are ready to download raw files.\n", 111 | "processed folder already exists. You are ready to extract raw files.\n" 112 | ] 113 | } 114 | ], 115 | "source": [ 116 | "# if raw and processed directories are not created, create these directories \n", 117 | "raw_data_directory = op.join(op.curdir, 'data')\n", 118 | "if not os.path.exists(raw_data_directory):\n", 119 | " os.makedirs(raw_data_directory)\n", 120 | " print('data folder created. You are now ready to download raw files.')\n", 121 | "else:\n", 122 | " print('data folder already exists. You are ready to download raw files.')\n", 123 | " \n", 124 | "\n", 125 | "processed_directory = op.join(op.curdir, 'processed')\n", 126 | "if not os.path.exists(processed_directory):\n", 127 | " os.makedirs(processed_directory)\n", 128 | " print('processed folder created. You are now ready to extract raw files.')\n", 129 | "else:\n", 130 | " print('processed folder already exists. You are ready to extract raw files.')" 131 | ] 132 | }, 133 | { 134 | "cell_type": "markdown", 135 | "metadata": {}, 136 | "source": [ 137 | "## Download Movielens-100K dataset" 138 | ] 139 | }, 140 | { 141 | "cell_type": "code", 142 | "execution_count": 4, 143 | "metadata": {}, 144 | "outputs": [], 145 | "source": [ 146 | "# Specify where to download from\n", 147 | "ML_100K_URL = 'http://files.grouplens.org/datasets/movielens/ml-100k.zip'\n", 148 | "# Specify where to save \n", 149 | "dest_directory = op.join(op.curdir, 'data')\n", 150 | "# extract \n", 151 | "extract_directory = op.join(op.curdir, 'processed', 'ml-100k')" 152 | ] 153 | }, 154 | { 155 | "cell_type": "code", 156 | "execution_count": 5, 157 | "metadata": {}, 158 | "outputs": [ 159 | { 160 | "name": "stdout", 161 | "output_type": "stream", 162 | "text": [ 163 | "File Already Exists.\n", 164 | "Folder Already Exists.\n" 165 | ] 166 | } 167 | ], 168 | "source": [ 169 | "# Actually download/extract the file!\n", 170 | "maybe_download_and_extract(ML_100K_URL, dest_directory, extract_directory, extract=True)" 171 | ] 172 | }, 173 | { 174 | "cell_type": "markdown", 175 | "metadata": {}, 176 | "source": [ 177 | "## Download Glove Pre-Trained Embeddings" 178 | ] 179 | }, 180 | { 181 | "cell_type": "code", 182 | "execution_count": 6, 183 | "metadata": {}, 184 | "outputs": [], 185 | "source": [ 186 | "# Specify where to download from\n", 187 | "glove_model_url = 'http://nlp.stanford.edu/data/glove.6B.zip'\n", 188 | "# Specify where to save \n", 189 | "dest_directory = op.join(op.curdir, 'data')\n", 190 | "# extract \n", 191 | "extract_directory = op.join(op.curdir, 'processed', 'glove')" 192 | ] 193 | }, 194 | { 195 | "cell_type": "code", 196 | "execution_count": 7, 197 | "metadata": {}, 198 | "outputs": [ 199 | { 200 | "name": "stdout", 201 | "output_type": "stream", 202 | "text": [ 203 | "File Already Exists.\n", 204 | "Folder Already Exists.\n" 205 | ] 206 | } 207 | ], 208 | "source": [ 209 | "# Actually download/extract the file!\n", 210 | "maybe_download_and_extract(glove_model_url, dest_directory, extract_directory, extract=True)" 211 | ] 212 | }, 213 | { 214 | "cell_type": "markdown", 215 | "metadata": {}, 216 | "source": [ 217 | "## Download AlexNet Pre-Trained Model Weights" 218 | ] 219 | }, 220 | { 221 | "cell_type": "code", 222 | "execution_count": 8, 223 | "metadata": {}, 224 | "outputs": [], 225 | "source": [ 226 | "# Specify where to download from\n", 227 | "alexnet_weights_URL = 'https://www.cs.toronto.edu/~guerzhoy/tf_alexnet/bvlc_alexnet.npy'\n", 228 | "# Specify where to save \n", 229 | "dest_directory = op.join(op.curdir, 'data')\n", 230 | "# destination file name in processed folder\n", 231 | "extract_directory = op.join(op.curdir, 'processed', 'bvlc_alexnet.npy')" 232 | ] 233 | }, 234 | { 235 | "cell_type": "code", 236 | "execution_count": 9, 237 | "metadata": {}, 238 | "outputs": [ 239 | { 240 | "name": "stdout", 241 | "output_type": "stream", 242 | "text": [ 243 | "File Already Exists.\n", 244 | "File Already Exists.\n" 245 | ] 246 | } 247 | ], 248 | "source": [ 249 | "# Actually download/extract the file!\n", 250 | "maybe_download_and_extract(alexnet_weights_URL, dest_directory, extract_directory, extract=False, copy_only=True)" 251 | ] 252 | }, 253 | { 254 | "cell_type": "markdown", 255 | "metadata": {}, 256 | "source": [ 257 | "### Download UT Zappos50K dataset" 258 | ] 259 | }, 260 | { 261 | "cell_type": "code", 262 | "execution_count": 10, 263 | "metadata": {}, 264 | "outputs": [], 265 | "source": [ 266 | "#Specify where to download from\n", 267 | "utzap50k_URL = 'http://vision.cs.utexas.edu/projects/finegrained/utzap50k/ut-zap50k-images-square.zip'\n", 268 | "# Specify where to save \n", 269 | "dest_directory = op.join(op.curdir, 'data')\n", 270 | "# extract \n", 271 | "extract_directory = op.join(op.curdir, 'processed', 'utzap50k')" 272 | ] 273 | }, 274 | { 275 | "cell_type": "code", 276 | "execution_count": 11, 277 | "metadata": {}, 278 | "outputs": [ 279 | { 280 | "name": "stdout", 281 | "output_type": "stream", 282 | "text": [ 283 | "File Already Exists.\n", 284 | "Folder Already Exists.\n" 285 | ] 286 | } 287 | ], 288 | "source": [ 289 | "# Actually download/extract the file!\n", 290 | "maybe_download_and_extract(utzap50k_URL, dest_directory, extract_directory, extract=True)" 291 | ] 292 | } 293 | ], 294 | "metadata": { 295 | "kernelspec": { 296 | "display_name": "Python 3", 297 | "language": "python", 298 | "name": "python3" 299 | }, 300 | "language_info": { 301 | "codemirror_mode": { 302 | "name": "ipython", 303 | "version": 3 304 | }, 305 | "file_extension": ".py", 306 | "mimetype": "text/x-python", 307 | "name": "python", 308 | "nbconvert_exporter": "python", 309 | "pygments_lexer": "ipython3", 310 | "version": "3.6.3" 311 | } 312 | }, 313 | "nbformat": 4, 314 | "nbformat_minor": 2 315 | } 316 | -------------------------------------------------------------------------------- /02_Short_Intro_to_Embeddings_with_Tensorflow_0.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Short Intro To Embeddings with Tensorflow" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "Goals\n", 15 | "- Understand Embedding \n", 16 | "- Perform Embedding Lookup using Tensorflow\n", 17 | "- Use Pre-Trained Embedding " 18 | ] 19 | }, 20 | { 21 | "cell_type": "code", 22 | "execution_count": null, 23 | "metadata": {}, 24 | "outputs": [], 25 | "source": [ 26 | "import tensorflow as tf\n", 27 | "import numpy as np\n", 28 | "import os\n", 29 | "print('Tensorflow version : {0}'.format(tf.__version__))" 30 | ] 31 | }, 32 | { 33 | "cell_type": "markdown", 34 | "metadata": {}, 35 | "source": [ 36 | "### Sample Data" 37 | ] 38 | }, 39 | { 40 | "cell_type": "code", 41 | "execution_count": null, 42 | "metadata": {}, 43 | "outputs": [], 44 | "source": [ 45 | "embedding_size = 5\n", 46 | "vocabulary_size = 10\n", 47 | "\n", 48 | "# create a sample embedding matrix of size 5 for vocab of size 10\n", 49 | "embedding = np.random.rand(vocabulary_size, embedding_size)\n", 50 | "print(embedding)" 51 | ] 52 | }, 53 | { 54 | "cell_type": "code", 55 | "execution_count": null, 56 | "metadata": {}, 57 | "outputs": [], 58 | "source": [ 59 | "# create one-hot encoding for one of element in vocabulary\n", 60 | "i = 4\n", 61 | "one_hot = np.zeros(10)\n", 62 | "one_hot[i] = 1.0\n", 63 | "print(one_hot)" 64 | ] 65 | }, 66 | { 67 | "cell_type": "code", 68 | "execution_count": null, 69 | "metadata": {}, 70 | "outputs": [], 71 | "source": [ 72 | "# embedding vector can be extracted by taking a dot product between the one_hot vector and embedding matrix\n", 73 | "embedding_vector = np.dot(one_hot, embedding)\n", 74 | "print(embedding_vector)" 75 | ] 76 | }, 77 | { 78 | "cell_type": "code", 79 | "execution_count": null, 80 | "metadata": {}, 81 | "outputs": [], 82 | "source": [ 83 | "# cross validate from the embedding matrix\n", 84 | "print(embedding[i])" 85 | ] 86 | }, 87 | { 88 | "cell_type": "markdown", 89 | "metadata": {}, 90 | "source": [ 91 | "## Tensorflow Embedding Lookup" 92 | ] 93 | }, 94 | { 95 | "cell_type": "code", 96 | "execution_count": null, 97 | "metadata": {}, 98 | "outputs": [], 99 | "source": [ 100 | "g = tf.Graph()\n", 101 | "with g.as_default():\n", 102 | " # provide input indices \n", 103 | " x = tf.placeholder(shape=[None], dtype=tf.int32, name='x')\n", 104 | " \n", 105 | " # create a constant initializer\n", 106 | " weights_initializer = tf.constant_initializer(embedding)\n", 107 | " embedding_weights = tf.get_variable(\n", 108 | " name='embedding_weights', \n", 109 | " shape=(vocabulary_size, embedding_size), \n", 110 | " initializer=weights_initializer,\n", 111 | " trainable=False)\n", 112 | " # emebedding Lookup \n", 113 | " embedding_lookup = tf.nn.embedding_lookup(embedding_weights, x)" 114 | ] 115 | }, 116 | { 117 | "cell_type": "code", 118 | "execution_count": null, 119 | "metadata": {}, 120 | "outputs": [], 121 | "source": [ 122 | "# Getting Single Row\n", 123 | "with tf.Session(graph=g) as sess:\n", 124 | " sess.run(tf.global_variables_initializer())\n", 125 | " print(sess.run(embedding_lookup, feed_dict={x : [4]}))\n" 126 | ] 127 | }, 128 | { 129 | "cell_type": "code", 130 | "execution_count": null, 131 | "metadata": {}, 132 | "outputs": [], 133 | "source": [ 134 | "# Getting Multiple Rows\n", 135 | "with tf.Session(graph=g) as sess:\n", 136 | " sess.run(tf.global_variables_initializer())\n", 137 | " print(sess.run(embedding_lookup, feed_dict={x : [2,4,6]}))\n", 138 | "\n" 139 | ] 140 | }, 141 | { 142 | "cell_type": "markdown", 143 | "metadata": {}, 144 | "source": [ 145 | "### Using GloVe Pre-Trained Model " 146 | ] 147 | }, 148 | { 149 | "cell_type": "code", 150 | "execution_count": null, 151 | "metadata": {}, 152 | "outputs": [], 153 | "source": [ 154 | "EMBEDDING_DIMENSION=100 # Available dimensions for 6B data is 50, 100, 200, 300\n", 155 | "glove_weights_file_path = os.path.join('processed','glove', 'glove.6B.{0}d.txt'.format(EMBEDDING_DIMENSION))\n", 156 | "print('Using the following glove weight file : {0}'.format(glove_weights_file_path))" 157 | ] 158 | }, 159 | { 160 | "cell_type": "code", 161 | "execution_count": null, 162 | "metadata": {}, 163 | "outputs": [], 164 | "source": [ 165 | "# look at some sample rows\n", 166 | "!head -3 processed/glove/glove.6B.100d.txt" 167 | ] 168 | }, 169 | { 170 | "cell_type": "code", 171 | "execution_count": null, 172 | "metadata": {}, 173 | "outputs": [], 174 | "source": [ 175 | "glove_weights = []\n", 176 | "word2idx = {}\n", 177 | "vocabulary_size = 40000 # limit vocab to top 40K terms\n", 178 | "vocabulary = []\n", 179 | "\n", 180 | "\n", 181 | "with open(glove_weights_file_path,'r') as file:\n", 182 | " for index, line in enumerate(file):\n", 183 | " values = line.split() # Word and weights separated by space\n", 184 | " word = values[0] # Word is first symbol on each line\n", 185 | " vocabulary.append(word)\n", 186 | " word_weights = np.asarray(values[1:], dtype=np.float32) # Remainder of line is weights for word\n", 187 | " word2idx[word] = index \n", 188 | " glove_weights.append(word_weights)\n", 189 | " \n", 190 | " if index + 1 == vocabulary_size:\n", 191 | " break\n", 192 | "glove_weights = np.asarray(glove_weights, dtype=np.float32)" 193 | ] 194 | }, 195 | { 196 | "cell_type": "code", 197 | "execution_count": null, 198 | "metadata": {}, 199 | "outputs": [], 200 | "source": [ 201 | "glove_weights.shape" 202 | ] 203 | }, 204 | { 205 | "cell_type": "code", 206 | "execution_count": null, 207 | "metadata": {}, 208 | "outputs": [], 209 | "source": [ 210 | "words = [\"man\", \"woman\"]\n", 211 | "#words = [\"paris\", \"london\",\"rome\",\"berlin\"]\n", 212 | "words_indices = [word2idx[word] for word in words]\n", 213 | "words_indices" 214 | ] 215 | }, 216 | { 217 | "cell_type": "code", 218 | "execution_count": null, 219 | "metadata": {}, 220 | "outputs": [], 221 | "source": [ 222 | "g = tf.Graph()\n", 223 | "\n", 224 | "with g.as_default():\n", 225 | " # provide input indices \n", 226 | " x = tf.placeholder(shape=[None], dtype=tf.int32, name='x')\n", 227 | " \n", 228 | " # create a constant initializer\n", 229 | " weights_initializer = tf.constant_initializer(glove_weights)\n", 230 | " embedding_weights = tf.get_variable(\n", 231 | " name='embedding_weights', \n", 232 | " shape=(vocabulary_size, EMBEDDING_DIMENSION), \n", 233 | " initializer=weights_initializer,\n", 234 | " trainable=False)\n", 235 | " # emebedding Lookup \n", 236 | " embedding_lookup = tf.nn.embedding_lookup(embedding_weights, x)\n", 237 | " \n", 238 | " # We use the cosine distance:\n", 239 | " norm = tf.sqrt(tf.reduce_sum(tf.square(embedding_weights), 1, keepdims=True))\n", 240 | " normalized_embeddings = embedding_weights / norm\n", 241 | " valid_embeddings = tf.nn.embedding_lookup(normalized_embeddings, x)\n", 242 | " similarity = tf.matmul(valid_embeddings, tf.transpose(normalized_embeddings))\n", 243 | " \n" 244 | ] 245 | }, 246 | { 247 | "cell_type": "code", 248 | "execution_count": null, 249 | "metadata": {}, 250 | "outputs": [], 251 | "source": [ 252 | "with tf.Session(graph=g) as sess:\n", 253 | " sess.run(tf.global_variables_initializer())\n", 254 | " result = sess.run(embedding_lookup, feed_dict={x : words_indices})\n", 255 | " sim = sess.run(similarity, feed_dict={x : words_indices})\n", 256 | " print('Shape of Similarity Matrix: {0}'.format(sim.shape))\n", 257 | " for i,word_index in enumerate(words_indices):\n", 258 | " \n", 259 | " top_k = 10 # number of nearest neighbors\n", 260 | " nearest = (-sim[i, :]).argsort()[1:top_k+1]\n", 261 | " log = 'Nearest to {0} :'.format(vocabulary[word_index])\n", 262 | " \n", 263 | " for k in range(top_k):\n", 264 | " \n", 265 | " close_word = vocabulary[nearest[k]]\n", 266 | " log = '{0} {1},'.format(log, close_word)\n", 267 | " print(log)\n" 268 | ] 269 | } 270 | ], 271 | "metadata": { 272 | "kernelspec": { 273 | "display_name": "Python 3", 274 | "language": "python", 275 | "name": "python3" 276 | }, 277 | "language_info": { 278 | "codemirror_mode": { 279 | "name": "ipython", 280 | "version": 3 281 | }, 282 | "file_extension": ".py", 283 | "mimetype": "text/x-python", 284 | "name": "python", 285 | "nbconvert_exporter": "python", 286 | "pygments_lexer": "ipython3", 287 | "version": "3.6.3" 288 | } 289 | }, 290 | "nbformat": 4, 291 | "nbformat_minor": 2 292 | } 293 | -------------------------------------------------------------------------------- /02_Short_Intro_to_Embeddings_with_Tensorflow_Rendered.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Short Intro To Embeddings with Tensorflow" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "Goals\n", 15 | "- Understand Embedding \n", 16 | "- Perform Embedding Lookup using Tensorflow\n", 17 | "- Use Pre-Trained Embedding " 18 | ] 19 | }, 20 | { 21 | "cell_type": "code", 22 | "execution_count": 2, 23 | "metadata": {}, 24 | "outputs": [ 25 | { 26 | "name": "stdout", 27 | "output_type": "stream", 28 | "text": [ 29 | "Tensorflow version : 1.5.0\n" 30 | ] 31 | } 32 | ], 33 | "source": [ 34 | "import tensorflow as tf\n", 35 | "import numpy as np\n", 36 | "import os\n", 37 | "print('Tensorflow version : {0}'.format(tf.__version__))" 38 | ] 39 | }, 40 | { 41 | "cell_type": "markdown", 42 | "metadata": {}, 43 | "source": [ 44 | "### Sample Data" 45 | ] 46 | }, 47 | { 48 | "cell_type": "code", 49 | "execution_count": 3, 50 | "metadata": {}, 51 | "outputs": [ 52 | { 53 | "name": "stdout", 54 | "output_type": "stream", 55 | "text": [ 56 | "[[0.8474672 0.63517459 0.20856734 0.62158914 0.1706803 ]\n", 57 | " [0.01427033 0.29586182 0.15395211 0.8114454 0.92883602]\n", 58 | " [0.07006209 0.93486385 0.87660798 0.93495023 0.93246372]\n", 59 | " [0.75529191 0.45670366 0.83832113 0.96170286 0.20207429]\n", 60 | " [0.60661999 0.79176031 0.84172283 0.90355146 0.82368189]\n", 61 | " [0.9636877 0.76228184 0.55074808 0.30381757 0.82599705]\n", 62 | " [0.34921163 0.17196946 0.79534164 0.39571298 0.43468079]\n", 63 | " [0.8603585 0.87752959 0.3065835 0.02131077 0.3528051 ]\n", 64 | " [0.81300056 0.17322652 0.32041377 0.74049448 0.97602482]\n", 65 | " [0.36918957 0.52890363 0.56712384 0.8195898 0.97569215]]\n" 66 | ] 67 | } 68 | ], 69 | "source": [ 70 | "embedding_size = 5\n", 71 | "vocabulary_size = 10\n", 72 | "\n", 73 | "# create a sample embedding matrix of size 5 for vocab of size 10\n", 74 | "embedding = np.random.rand(vocabulary_size, embedding_size)\n", 75 | "print(embedding)" 76 | ] 77 | }, 78 | { 79 | "cell_type": "code", 80 | "execution_count": 4, 81 | "metadata": {}, 82 | "outputs": [ 83 | { 84 | "name": "stdout", 85 | "output_type": "stream", 86 | "text": [ 87 | "[0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]\n" 88 | ] 89 | } 90 | ], 91 | "source": [ 92 | "# create one-hot encoding for one of element in vocabulary\n", 93 | "i = 4\n", 94 | "one_hot = np.zeros(10)\n", 95 | "one_hot[i] = 1.0\n", 96 | "print(one_hot)" 97 | ] 98 | }, 99 | { 100 | "cell_type": "code", 101 | "execution_count": 5, 102 | "metadata": {}, 103 | "outputs": [ 104 | { 105 | "name": "stdout", 106 | "output_type": "stream", 107 | "text": [ 108 | "[0.60661999 0.79176031 0.84172283 0.90355146 0.82368189]\n" 109 | ] 110 | } 111 | ], 112 | "source": [ 113 | "# embedding vector can be extracted by taking a dot product between the one_hot vector and embedding matrix\n", 114 | "embedding_vector = np.dot(one_hot, embedding)\n", 115 | "print(embedding_vector)" 116 | ] 117 | }, 118 | { 119 | "cell_type": "code", 120 | "execution_count": 6, 121 | "metadata": {}, 122 | "outputs": [ 123 | { 124 | "name": "stdout", 125 | "output_type": "stream", 126 | "text": [ 127 | "[0.60661999 0.79176031 0.84172283 0.90355146 0.82368189]\n" 128 | ] 129 | } 130 | ], 131 | "source": [ 132 | "# cross validate from the embedding matrix\n", 133 | "print(embedding[i])" 134 | ] 135 | }, 136 | { 137 | "cell_type": "markdown", 138 | "metadata": {}, 139 | "source": [ 140 | "## Tensorflow Embedding Lookup" 141 | ] 142 | }, 143 | { 144 | "cell_type": "code", 145 | "execution_count": 7, 146 | "metadata": {}, 147 | "outputs": [], 148 | "source": [ 149 | "g = tf.Graph()\n", 150 | "with g.as_default():\n", 151 | " # provide input indices \n", 152 | " x = tf.placeholder(shape=[None], dtype=tf.int32, name='x')\n", 153 | " \n", 154 | " # create a constant initializer\n", 155 | " weights_initializer = tf.constant_initializer(embedding)\n", 156 | " embedding_weights = tf.get_variable(\n", 157 | " name='embedding_weights', \n", 158 | " shape=(vocabulary_size, embedding_size), \n", 159 | " initializer=weights_initializer,\n", 160 | " trainable=False)\n", 161 | " # emebedding Lookup \n", 162 | " embedding_lookup = tf.nn.embedding_lookup(embedding_weights, x)" 163 | ] 164 | }, 165 | { 166 | "cell_type": "code", 167 | "execution_count": 8, 168 | "metadata": {}, 169 | "outputs": [ 170 | { 171 | "name": "stdout", 172 | "output_type": "stream", 173 | "text": [ 174 | "[[0.60662 0.7917603 0.84172285 0.90355146 0.8236819 ]]\n" 175 | ] 176 | } 177 | ], 178 | "source": [ 179 | "# Getting Single Row\n", 180 | "with tf.Session(graph=g) as sess:\n", 181 | " sess.run(tf.global_variables_initializer())\n", 182 | " print(sess.run(embedding_lookup, feed_dict={x : [4]}))\n" 183 | ] 184 | }, 185 | { 186 | "cell_type": "code", 187 | "execution_count": 9, 188 | "metadata": {}, 189 | "outputs": [ 190 | { 191 | "name": "stdout", 192 | "output_type": "stream", 193 | "text": [ 194 | "[[0.07006209 0.93486387 0.87660795 0.93495023 0.9324637 ]\n", 195 | " [0.60662 0.7917603 0.84172285 0.90355146 0.8236819 ]\n", 196 | " [0.34921163 0.17196946 0.7953417 0.39571297 0.4346808 ]]\n" 197 | ] 198 | } 199 | ], 200 | "source": [ 201 | "# Getting Multiple Rows\n", 202 | "with tf.Session(graph=g) as sess:\n", 203 | " sess.run(tf.global_variables_initializer())\n", 204 | " print(sess.run(embedding_lookup, feed_dict={x : [2,4,6]}))\n", 205 | "\n" 206 | ] 207 | }, 208 | { 209 | "cell_type": "markdown", 210 | "metadata": {}, 211 | "source": [ 212 | "### Using GloVe Pre-Trained Model " 213 | ] 214 | }, 215 | { 216 | "cell_type": "code", 217 | "execution_count": 10, 218 | "metadata": {}, 219 | "outputs": [ 220 | { 221 | "name": "stdout", 222 | "output_type": "stream", 223 | "text": [ 224 | "Using the following glove weight file : processed/glove/glove.6B.100d.txt\n" 225 | ] 226 | } 227 | ], 228 | "source": [ 229 | "EMBEDDING_DIMENSION=100 # Available dimensions for 6B data is 50, 100, 200, 300\n", 230 | "glove_weights_file_path = os.path.join('processed','glove', 'glove.6B.{0}d.txt'.format(EMBEDDING_DIMENSION))\n", 231 | "print('Using the following glove weight file : {0}'.format(glove_weights_file_path))" 232 | ] 233 | }, 234 | { 235 | "cell_type": "code", 236 | "execution_count": 11, 237 | "metadata": {}, 238 | "outputs": [ 239 | { 240 | "name": "stdout", 241 | "output_type": "stream", 242 | "text": [ 243 | "the -0.038194 -0.24487 0.72812 -0.39961 0.083172 0.043953 -0.39141 0.3344 -0.57545 0.087459 0.28787 -0.06731 0.30906 -0.26384 -0.13231 -0.20757 0.33395 -0.33848 -0.31743 -0.48336 0.1464 -0.37304 0.34577 0.052041 0.44946 -0.46971 0.02628 -0.54155 -0.15518 -0.14107 -0.039722 0.28277 0.14393 0.23464 -0.31021 0.086173 0.20397 0.52624 0.17164 -0.082378 -0.71787 -0.41531 0.20335 -0.12763 0.41367 0.55187 0.57908 -0.33477 -0.36559 -0.54857 -0.062892 0.26584 0.30205 0.99775 -0.80481 -3.0243 0.01254 -0.36942 2.2167 0.72201 -0.24978 0.92136 0.034514 0.46745 1.1079 -0.19358 -0.074575 0.23353 -0.052062 -0.22044 0.057162 -0.15806 -0.30798 -0.41625 0.37972 0.15006 -0.53212 -0.2055 -1.2526 0.071624 0.70565 0.49744 -0.42063 0.26148 -1.538 -0.30223 -0.073438 -0.28312 0.37104 -0.25217 0.016215 -0.017099 -0.38984 0.87424 -0.72569 -0.51058 -0.52028 -0.1459 0.8278 0.27062\n", 244 | ", -0.10767 0.11053 0.59812 -0.54361 0.67396 0.10663 0.038867 0.35481 0.06351 -0.094189 0.15786 -0.81665 0.14172 0.21939 0.58505 -0.52158 0.22783 -0.16642 -0.68228 0.3587 0.42568 0.19021 0.91963 0.57555 0.46185 0.42363 -0.095399 -0.42749 -0.16567 -0.056842 -0.29595 0.26037 -0.26606 -0.070404 -0.27662 0.15821 0.69825 0.43081 0.27952 -0.45437 -0.33801 -0.58184 0.22364 -0.5778 -0.26862 -0.20425 0.56394 -0.58524 -0.14365 -0.64218 0.0054697 -0.35248 0.16162 1.1796 -0.47674 -2.7553 -0.1321 -0.047729 1.0655 1.1034 -0.2208 0.18669 0.13177 0.15117 0.7131 -0.35215 0.91348 0.61783 0.70992 0.23955 -0.14571 -0.37859 -0.045959 -0.47368 0.2385 0.20536 -0.18996 0.32507 -1.1112 -0.36341 0.98679 -0.084776 -0.54008 0.11726 -1.0194 -0.24424 0.12771 0.013884 0.080374 -0.35414 0.34951 -0.7226 0.37549 0.4441 -0.99059 0.61214 -0.35111 -0.83155 0.45293 0.082577\n", 245 | ". -0.33979 0.20941 0.46348 -0.64792 -0.38377 0.038034 0.17127 0.15978 0.46619 -0.019169 0.41479 -0.34349 0.26872 0.04464 0.42131 -0.41032 0.15459 0.022239 -0.64653 0.25256 0.043136 -0.19445 0.46516 0.45651 0.68588 0.091295 0.21875 -0.70351 0.16785 -0.35079 -0.12634 0.66384 -0.2582 0.036542 -0.13605 0.40253 0.14289 0.38132 -0.12283 -0.45886 -0.25282 -0.30432 -0.11215 -0.26182 -0.22482 -0.44554 0.2991 -0.85612 -0.14503 -0.49086 0.0082973 -0.17491 0.27524 1.4401 -0.21239 -2.8435 -0.27958 -0.45722 1.6386 0.78808 -0.55262 0.65 0.086426 0.39012 1.0632 -0.35379 0.48328 0.346 0.84174 0.098707 -0.24213 -0.27053 0.045287 -0.40147 0.11395 0.0062226 0.036673 0.018518 -1.0213 -0.20806 0.64072 -0.068763 -0.58635 0.33476 -1.1432 -0.1148 -0.25091 -0.45907 -0.096819 -0.17946 -0.063351 -0.67412 -0.068895 0.53604 -0.87773 0.31802 -0.39242 -0.23394 0.47298 -0.028803\n" 246 | ] 247 | } 248 | ], 249 | "source": [ 250 | "# look at some sample rows\n", 251 | "!head -3 processed/glove/glove.6B.100d.txt" 252 | ] 253 | }, 254 | { 255 | "cell_type": "code", 256 | "execution_count": 12, 257 | "metadata": {}, 258 | "outputs": [], 259 | "source": [ 260 | "glove_weights = []\n", 261 | "word2idx = {}\n", 262 | "vocabulary_size = 40000 # limit vocab to top 40K terms\n", 263 | "vocabulary = []\n", 264 | "\n", 265 | "\n", 266 | "with open(glove_weights_file_path,'r') as file:\n", 267 | " for index, line in enumerate(file):\n", 268 | " values = line.split() # Word and weights separated by space\n", 269 | " word = values[0] # Word is first symbol on each line\n", 270 | " vocabulary.append(word)\n", 271 | " word_weights = np.asarray(values[1:], dtype=np.float32) # Remainder of line is weights for word\n", 272 | " word2idx[word] = index \n", 273 | " glove_weights.append(word_weights)\n", 274 | " \n", 275 | " if index + 1 == vocabulary_size:\n", 276 | " break\n", 277 | "glove_weights = np.asarray(glove_weights, dtype=np.float32)" 278 | ] 279 | }, 280 | { 281 | "cell_type": "code", 282 | "execution_count": 13, 283 | "metadata": {}, 284 | "outputs": [ 285 | { 286 | "data": { 287 | "text/plain": [ 288 | "(40000, 100)" 289 | ] 290 | }, 291 | "execution_count": 13, 292 | "metadata": {}, 293 | "output_type": "execute_result" 294 | } 295 | ], 296 | "source": [ 297 | "glove_weights.shape" 298 | ] 299 | }, 300 | { 301 | "cell_type": "code", 302 | "execution_count": 14, 303 | "metadata": {}, 304 | "outputs": [ 305 | { 306 | "data": { 307 | "text/plain": [ 308 | "[300, 787]" 309 | ] 310 | }, 311 | "execution_count": 14, 312 | "metadata": {}, 313 | "output_type": "execute_result" 314 | } 315 | ], 316 | "source": [ 317 | "words = [\"man\", \"woman\"]\n", 318 | "#words = [\"paris\", \"london\",\"rome\",\"berlin\"]\n", 319 | "words_indices = [word2idx[word] for word in words]\n", 320 | "words_indices" 321 | ] 322 | }, 323 | { 324 | "cell_type": "code", 325 | "execution_count": 15, 326 | "metadata": {}, 327 | "outputs": [], 328 | "source": [ 329 | "g = tf.Graph()\n", 330 | "\n", 331 | "with g.as_default():\n", 332 | " # provide input indices \n", 333 | " x = tf.placeholder(shape=[None], dtype=tf.int32, name='x')\n", 334 | " \n", 335 | " # create a constant initializer\n", 336 | " weights_initializer = tf.constant_initializer(glove_weights)\n", 337 | " embedding_weights = tf.get_variable(\n", 338 | " name='embedding_weights', \n", 339 | " shape=(vocabulary_size, EMBEDDING_DIMENSION), \n", 340 | " initializer=weights_initializer,\n", 341 | " trainable=False)\n", 342 | " # emebedding Lookup \n", 343 | " embedding_lookup = tf.nn.embedding_lookup(embedding_weights, x)\n", 344 | " \n", 345 | " # We use the cosine distance:\n", 346 | " norm = tf.sqrt(tf.reduce_sum(tf.square(embedding_weights), 1, keepdims=True))\n", 347 | " normalized_embeddings = embedding_weights / norm\n", 348 | " valid_embeddings = tf.nn.embedding_lookup(normalized_embeddings, x)\n", 349 | " similarity = tf.matmul(valid_embeddings, tf.transpose(normalized_embeddings))\n", 350 | " \n" 351 | ] 352 | }, 353 | { 354 | "cell_type": "code", 355 | "execution_count": 16, 356 | "metadata": {}, 357 | "outputs": [ 358 | { 359 | "name": "stdout", 360 | "output_type": "stream", 361 | "text": [ 362 | "Shape of Similarity Matrix: (2, 40000)\n", 363 | "Nearest to man : woman, boy, one, person, another, old, life, father, turned, who,\n", 364 | "Nearest to woman : girl, man, mother, boy, she, child, wife, her, herself, daughter,\n" 365 | ] 366 | } 367 | ], 368 | "source": [ 369 | "with tf.Session(graph=g) as sess:\n", 370 | " sess.run(tf.global_variables_initializer())\n", 371 | " result = sess.run(embedding_lookup, feed_dict={x : words_indices})\n", 372 | " sim = sess.run(similarity, feed_dict={x : words_indices})\n", 373 | " print('Shape of Similarity Matrix: {0}'.format(sim.shape))\n", 374 | " for i,word_index in enumerate(words_indices):\n", 375 | " \n", 376 | " top_k = 10 # number of nearest neighbors\n", 377 | " nearest = (-sim[i, :]).argsort()[1:top_k+1]\n", 378 | " log = 'Nearest to {0} :'.format(vocabulary[word_index])\n", 379 | " \n", 380 | " for k in range(top_k):\n", 381 | " \n", 382 | " close_word = vocabulary[nearest[k]]\n", 383 | " log = '{0} {1},'.format(log, close_word)\n", 384 | " print(log)\n" 385 | ] 386 | }, 387 | { 388 | "cell_type": "code", 389 | "execution_count": null, 390 | "metadata": {}, 391 | "outputs": [], 392 | "source": [] 393 | } 394 | ], 395 | "metadata": { 396 | "kernelspec": { 397 | "display_name": "Python 3", 398 | "language": "python", 399 | "name": "python3" 400 | }, 401 | "language_info": { 402 | "codemirror_mode": { 403 | "name": "ipython", 404 | "version": 3 405 | }, 406 | "file_extension": ".py", 407 | "mimetype": "text/x-python", 408 | "name": "python", 409 | "nbconvert_exporter": "python", 410 | "pygments_lexer": "ipython3", 411 | "version": "3.6.3" 412 | } 413 | }, 414 | "nbformat": 4, 415 | "nbformat_minor": 2 416 | } 417 | -------------------------------------------------------------------------------- /03_Image_Search_Using_AlexNet_0.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Image Retrieval using AlexNet " 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "Goals :\n", 15 | "- Create AlexNet architecture using Pre-Trained Weights\n", 16 | "- Create Image Embeddings using Transfer Learning \n", 17 | "- Create Nearest Neighbour Algorithm \n", 18 | "- Give a query iamges, retrieve similar images " 19 | ] 20 | }, 21 | { 22 | "cell_type": "code", 23 | "execution_count": null, 24 | "metadata": { 25 | "collapsed": true 26 | }, 27 | "outputs": [], 28 | "source": [ 29 | "import tensorflow as tf\n", 30 | "import numpy as np\n", 31 | "import os\n", 32 | "import os.path as op\n", 33 | "import random\n", 34 | "from scipy import ndimage\n", 35 | "from glob import glob\n", 36 | "import matplotlib.pyplot as plt\n", 37 | "from sklearn.neighbors import NearestNeighbors\n", 38 | "\n", 39 | "%matplotlib inline\n", 40 | "print('Tensorflow version : {0}'.format(tf.__version__))" 41 | ] 42 | }, 43 | { 44 | "cell_type": "markdown", 45 | "metadata": {}, 46 | "source": [ 47 | "#### AlexNet Architecture" 48 | ] 49 | }, 50 | { 51 | "cell_type": "markdown", 52 | "metadata": {}, 53 | "source": [ 54 | "![AlexNet](./images/alexnet_architecture.png)" 55 | ] 56 | }, 57 | { 58 | "cell_type": "markdown", 59 | "metadata": {}, 60 | "source": [ 61 | " Weights from Michael Guerzhoy and Davi Frossard\n", 62 | " [http://www.cs.toronto.edu/~guerzhoy/tf_alexnet/]" 63 | ] 64 | }, 65 | { 66 | "cell_type": "markdown", 67 | "metadata": {}, 68 | "source": [ 69 | "### Import Pre-Trained Variables" 70 | ] 71 | }, 72 | { 73 | "cell_type": "code", 74 | "execution_count": null, 75 | "metadata": {}, 76 | "outputs": [], 77 | "source": [ 78 | "# Path to the numpy weights\n", 79 | "alexnet_path = op.join(op.curdir, 'processed','bvlc_alexnet.npy')" 80 | ] 81 | }, 82 | { 83 | "cell_type": "code", 84 | "execution_count": null, 85 | "metadata": { 86 | "collapsed": false 87 | }, 88 | "outputs": [], 89 | "source": [ 90 | "# read weights in numpy\n", 91 | "variable_data = np.load(alexnet_path, encoding='bytes').item()" 92 | ] 93 | }, 94 | { 95 | "cell_type": "code", 96 | "execution_count": null, 97 | "metadata": {}, 98 | "outputs": [], 99 | "source": [ 100 | "# variable_data is a dictionary\n", 101 | "type(variable_data)" 102 | ] 103 | }, 104 | { 105 | "cell_type": "code", 106 | "execution_count": null, 107 | "metadata": {}, 108 | "outputs": [], 109 | "source": [ 110 | "# keys\n", 111 | "variable_data.keys()" 112 | ] 113 | }, 114 | { 115 | "cell_type": "code", 116 | "execution_count": null, 117 | "metadata": {}, 118 | "outputs": [], 119 | "source": [ 120 | "# Convolution layer - 1 weights\n", 121 | "conv1_preW = variable_data[\"conv1\"][0]\n", 122 | "conv1_preb = variable_data[\"conv1\"][1]\n", 123 | "print(conv1_preW.shape)\n", 124 | "print(conv1_preb.shape)" 125 | ] 126 | }, 127 | { 128 | "cell_type": "code", 129 | "execution_count": null, 130 | "metadata": {}, 131 | "outputs": [], 132 | "source": [ 133 | "# Convolution layer - 2 weights\n", 134 | "conv2_preW = variable_data[\"conv2\"][0]\n", 135 | "conv2_preb = variable_data[\"conv2\"][1]\n", 136 | "print(conv2_preW.shape)\n", 137 | "print(conv2_preb.shape)" 138 | ] 139 | }, 140 | { 141 | "cell_type": "code", 142 | "execution_count": null, 143 | "metadata": {}, 144 | "outputs": [], 145 | "source": [ 146 | "# Convolution layer-3 weights\n", 147 | "conv3_preW = variable_data[\"conv3\"][0]\n", 148 | "conv3_preb = variable_data[\"conv3\"][1]\n", 149 | "print(conv3_preW.shape)\n", 150 | "print(conv3_preb.shape)" 151 | ] 152 | }, 153 | { 154 | "cell_type": "code", 155 | "execution_count": null, 156 | "metadata": {}, 157 | "outputs": [], 158 | "source": [ 159 | "# Convolution layer-4 weights\n", 160 | "conv4_preW = variable_data[\"conv4\"][0]\n", 161 | "conv4_preb = variable_data[\"conv4\"][1]\n", 162 | "print(conv4_preW.shape)\n", 163 | "print(conv4_preb.shape)" 164 | ] 165 | }, 166 | { 167 | "cell_type": "code", 168 | "execution_count": null, 169 | "metadata": {}, 170 | "outputs": [], 171 | "source": [ 172 | "# Convolution layer-5 weights\n", 173 | "conv5_preW = variable_data[\"conv5\"][0]\n", 174 | "conv5_preb = variable_data[\"conv5\"][1]\n", 175 | "print(conv5_preW.shape)\n", 176 | "print(conv5_preb.shape)" 177 | ] 178 | }, 179 | { 180 | "cell_type": "code", 181 | "execution_count": null, 182 | "metadata": {}, 183 | "outputs": [], 184 | "source": [ 185 | "# Fully contected layer - 1\n", 186 | "fc6_preW = variable_data[\"fc6\"][0]\n", 187 | "fc6_preb = variable_data[\"fc6\"][1]\n", 188 | "print(fc6_preW.shape)\n", 189 | "print(fc6_preb.shape)" 190 | ] 191 | }, 192 | { 193 | "cell_type": "code", 194 | "execution_count": null, 195 | "metadata": {}, 196 | "outputs": [], 197 | "source": [ 198 | "# Fully connected layer -2 \n", 199 | "fc7_preW = variable_data[\"fc7\"][0]\n", 200 | "fc7_preb = variable_data[\"fc7\"][1]\n", 201 | "print(fc7_preW.shape)\n", 202 | "print(fc7_preb.shape)" 203 | ] 204 | }, 205 | { 206 | "cell_type": "code", 207 | "execution_count": null, 208 | "metadata": {}, 209 | "outputs": [], 210 | "source": [ 211 | "# Fully connected layer - 3\n", 212 | "fc8_preW = variable_data[\"fc8\"][0]\n", 213 | "fc8_preb = variable_data[\"fc8\"][1]\n", 214 | "print(fc8_preW.shape)\n", 215 | "print(fc8_preb.shape)" 216 | ] 217 | }, 218 | { 219 | "cell_type": "markdown", 220 | "metadata": {}, 221 | "source": [ 222 | "### Create the AlexNet Network" 223 | ] 224 | }, 225 | { 226 | "cell_type": "code", 227 | "execution_count": null, 228 | "metadata": { 229 | "collapsed": false 230 | }, 231 | "outputs": [], 232 | "source": [ 233 | "pixel_depth = 255.0\n", 234 | "resized_height = 227\n", 235 | "resized_width = 227\n", 236 | "num_channels = 3\n", 237 | "\n", 238 | "graph = tf.Graph()\n", 239 | "\n", 240 | "with graph.as_default():\n", 241 | " x = tf.placeholder(tf.uint8, [None, None, None, num_channels],\n", 242 | " name='input')\n", 243 | " \n", 244 | " to_float = tf.cast(x, tf.float32)\n", 245 | " resized = tf.image.resize_images(to_float, (resized_height, resized_width))\n", 246 | " \n", 247 | " # Convolution 1\n", 248 | " with tf.name_scope('conv1') as scope:\n", 249 | " kernel = tf.Variable(conv1_preW, name='weights')\n", 250 | " biases = tf.Variable(conv1_preb, name='biases')\n", 251 | " conv = tf.nn.conv2d(resized, kernel, [1, 4, 4, 1], padding=\"SAME\")\n", 252 | " bias = tf.nn.bias_add(conv, biases)\n", 253 | " conv1 = tf.nn.relu(bias, name=scope)\n", 254 | "\n", 255 | " # Local response normalization 2\n", 256 | " radius = 2\n", 257 | " alpha = 2e-05\n", 258 | " beta = 0.75\n", 259 | " bias = 1.0\n", 260 | " lrn1 = tf.nn.local_response_normalization(conv1,\n", 261 | " depth_radius=radius,\n", 262 | " alpha=alpha,\n", 263 | " beta=beta,\n", 264 | " bias=bias)\n", 265 | "\n", 266 | " # Maxpool 1\n", 267 | " pool1 = tf.nn.max_pool(lrn1,\n", 268 | " ksize=[1, 3, 3, 1],\n", 269 | " strides=[1, 2, 2, 1],\n", 270 | " padding='VALID',\n", 271 | " name='pool1')\n", 272 | "\n", 273 | " # Convolution 2\n", 274 | " with tf.name_scope('conv2') as scope:\n", 275 | "\n", 276 | " kernel = tf.Variable(conv2_preW, name='weights')\n", 277 | " biases = tf.Variable(conv2_preb, name='biases')\n", 278 | "\n", 279 | " input_a, input_b = tf.split(axis=3,\n", 280 | " num_or_size_splits=2,\n", 281 | " value=pool1)\n", 282 | " kernel_a, kernel_b = tf.split(axis=3,\n", 283 | " num_or_size_splits=2,\n", 284 | " value=kernel)\n", 285 | "\n", 286 | " with tf.name_scope('A'):\n", 287 | " conv_a = tf.nn.conv2d(input_a, kernel_a, [1, 1, 1, 1], padding=\"SAME\") \n", 288 | "\n", 289 | " with tf.name_scope('B'):\n", 290 | " conv_b = tf.nn.conv2d(input_b, kernel_b, [1, 1, 1, 1], padding=\"SAME\")\n", 291 | "\n", 292 | " conv = tf.concat(values=[conv_a, conv_b], axis=3)\n", 293 | " bias = tf.nn.bias_add(conv, biases)\n", 294 | " conv2 = tf.nn.relu(bias, name=scope)\n", 295 | "\n", 296 | " # Local response normalization 2\n", 297 | " radius = 2\n", 298 | " alpha = 2e-05\n", 299 | " beta = 0.75\n", 300 | " bias = 1.0\n", 301 | " lrn2 = tf.nn.local_response_normalization(conv2,\n", 302 | " depth_radius=radius,\n", 303 | " alpha=alpha,\n", 304 | " beta=beta,\n", 305 | " bias=bias)\n", 306 | "\n", 307 | " # Maxpool 2\n", 308 | " pool2 = tf.nn.max_pool(lrn2,\n", 309 | " ksize=[1, 3, 3, 1],\n", 310 | " strides=[1, 2, 2, 1],\n", 311 | " padding='VALID',\n", 312 | " name='pool2')\n", 313 | "\n", 314 | " with tf.name_scope('conv3') as scope:\n", 315 | " kernel = tf.Variable(conv3_preW, name='weights')\n", 316 | " biases = tf.Variable(conv3_preb, name='biases')\n", 317 | " conv = tf.nn.conv2d(pool2, kernel, [1, 1, 1, 1], padding=\"SAME\")\n", 318 | " bias = tf.nn.bias_add(conv, biases)\n", 319 | " conv3 = tf.nn.relu(bias, name=scope)\n", 320 | "\n", 321 | "\n", 322 | " with tf.name_scope('conv4') as scope:\n", 323 | "\n", 324 | " kernel = tf.Variable(conv4_preW, name='weights')\n", 325 | " biases = tf.Variable(conv4_preb, name='biases')\n", 326 | "\n", 327 | " input_a, input_b = tf.split(axis=3,\n", 328 | " num_or_size_splits=2,\n", 329 | " value=conv3)\n", 330 | " kernel_a, kernel_b = tf.split(axis=3,\n", 331 | " num_or_size_splits=2,\n", 332 | " value=kernel)\n", 333 | "\n", 334 | " with tf.name_scope('A'):\n", 335 | " conv_a = tf.nn.conv2d(input_a, kernel_a, [1, 1, 1, 1], padding=\"SAME\") \n", 336 | "\n", 337 | " with tf.name_scope('B'):\n", 338 | " conv_b = tf.nn.conv2d(input_b, kernel_b, [1, 1, 1, 1], padding=\"SAME\")\n", 339 | "\n", 340 | " conv = tf.concat(values=[conv_a, conv_b], axis=3)\n", 341 | " bias = tf.nn.bias_add(conv, biases)\n", 342 | " conv4 = tf.nn.relu(bias, name=scope)\n", 343 | "\n", 344 | "\n", 345 | " with tf.name_scope('conv5') as scope:\n", 346 | "\n", 347 | " kernel = tf.Variable(conv5_preW, name='weights')\n", 348 | " biases = tf.Variable(conv5_preb, name='biases')\n", 349 | "\n", 350 | " input_a, input_b = tf.split(axis=3,\n", 351 | " num_or_size_splits=2,\n", 352 | " value=conv4)\n", 353 | " kernel_a, kernel_b = tf.split(axis=3,\n", 354 | " num_or_size_splits=2,\n", 355 | " value=kernel)\n", 356 | "\n", 357 | " with tf.name_scope('A'):\n", 358 | " conv_a = tf.nn.conv2d(input_a, kernel_a, [1, 1, 1, 1], padding=\"SAME\") \n", 359 | "\n", 360 | " with tf.name_scope('B'):\n", 361 | " conv_b = tf.nn.conv2d(input_b, kernel_b, [1, 1, 1, 1], padding=\"SAME\")\n", 362 | "\n", 363 | " conv = tf.concat(values=[conv_a, conv_b],axis=3)\n", 364 | " bias = tf.nn.bias_add(conv, biases)\n", 365 | " conv5 = tf.nn.relu(bias, name=scope)\n", 366 | "\n", 367 | "\n", 368 | " # Maxpool 2\n", 369 | " pool5 = tf.nn.max_pool(conv5,\n", 370 | " ksize=[1, 3, 3, 1],\n", 371 | " strides=[1, 2, 2, 1],\n", 372 | " padding='VALID',\n", 373 | " name='pool5')\n", 374 | "\n", 375 | " # Fully connected 6\n", 376 | " with tf.name_scope('fc6'):\n", 377 | " weights = tf.Variable(fc6_preW, name='fc6_weights')\n", 378 | " bias = tf.Variable(fc6_preb, name='fc6_bias')\n", 379 | " shape = tf.shape(pool5)\n", 380 | " size = shape[1] * shape[2] * shape[3]\n", 381 | " fc6 = tf.nn.relu_layer(tf.reshape(pool5, [-1, size]),\n", 382 | " weights, bias, name='relu')\n", 383 | "\n", 384 | " # Fully connected 7\n", 385 | " with tf.name_scope('fc7'):\n", 386 | " weights = tf.Variable(fc7_preW, name='weights')\n", 387 | " bias = tf.Variable(fc7_preb, name='bias')\n", 388 | " fc7 = tf.nn.relu_layer(fc6, weights, bias, name='relu')\n", 389 | "\n", 390 | " # Fully connected 8\n", 391 | " with tf.name_scope('fc8'):\n", 392 | " weights = tf.Variable(fc8_preW, name='weights')\n", 393 | " bias = tf.Variable(fc8_preb, name='bias')\n", 394 | " # fc8 = tf.matmul(fc7, weights) + bias\n", 395 | " fc8 = tf.nn.xw_plus_b(fc7, weights, bias)\n", 396 | "\n", 397 | " softmax = tf.nn.softmax(fc8)\n", 398 | "\n", 399 | " init = tf.global_variables_initializer()" 400 | ] 401 | }, 402 | { 403 | "cell_type": "code", 404 | "execution_count": null, 405 | "metadata": { 406 | "collapsed": false 407 | }, 408 | "outputs": [], 409 | "source": [ 410 | "# visualize alexnet graph\n", 411 | "sess = tf.Session(graph=graph)\n", 412 | "sess.run(init)\n", 413 | "\n", 414 | "writer = tf.summary.FileWriter('tensorboard/alexnet', graph=graph)\n", 415 | "writer.close()" 416 | ] 417 | }, 418 | { 419 | "cell_type": "code", 420 | "execution_count": null, 421 | "metadata": {}, 422 | "outputs": [], 423 | "source": [ 424 | "#!tensorboard --logdir='tensorboard/alexnet'" 425 | ] 426 | }, 427 | { 428 | "cell_type": "markdown", 429 | "metadata": {}, 430 | "source": [ 431 | "### Export AlexNet Model ( Graph + Weights )" 432 | ] 433 | }, 434 | { 435 | "cell_type": "code", 436 | "execution_count": null, 437 | "metadata": { 438 | "collapsed": false 439 | }, 440 | "outputs": [], 441 | "source": [ 442 | "with graph.as_default():\n", 443 | " saver = tf.train.Saver()\n", 444 | " alex_vars_path = op.join(op.curdir, 'saved_models','alex_vars')\n", 445 | " save_path = saver.save(sess, alex_vars_path)" 446 | ] 447 | }, 448 | { 449 | "cell_type": "markdown", 450 | "metadata": {}, 451 | "source": [ 452 | "## Import Graph and Weights" 453 | ] 454 | }, 455 | { 456 | "cell_type": "code", 457 | "execution_count": null, 458 | "metadata": { 459 | "collapsed": false 460 | }, 461 | "outputs": [], 462 | "source": [ 463 | "alex_vars_path = op.join(op.curdir, 'saved_models','alex_vars')\n", 464 | "alex_meta_file_path = op.join(op.curdir, 'saved_models','alex_vars.meta')\n", 465 | "graph = tf.Graph()\n", 466 | "with graph.as_default(): \n", 467 | " importer = tf.train.import_meta_graph(alex_meta_file_path)\n", 468 | "\n", 469 | "sess = tf.Session(graph=graph)\n", 470 | "importer.restore(sess, alex_vars_path)" 471 | ] 472 | }, 473 | { 474 | "cell_type": "code", 475 | "execution_count": null, 476 | "metadata": {}, 477 | "outputs": [], 478 | "source": [ 479 | "graph.get_operations()" 480 | ] 481 | }, 482 | { 483 | "cell_type": "code", 484 | "execution_count": null, 485 | "metadata": {}, 486 | "outputs": [], 487 | "source": [ 488 | "# extract fc7 output\n", 489 | "fc7_op = graph.get_operation_by_name('fc7/relu')\n", 490 | "fc7 = fc7_op.outputs[0]\n", 491 | "x = graph.get_operation_by_name('input').outputs[0]\n", 492 | "init = graph.get_operation_by_name('init')\n", 493 | "\n", 494 | "sess = tf.Session(graph=graph)\n", 495 | "sess.run(init)" 496 | ] 497 | }, 498 | { 499 | "cell_type": "code", 500 | "execution_count": null, 501 | "metadata": {}, 502 | "outputs": [], 503 | "source": [ 504 | "image_embedding_size = fc7.get_shape()[1]\n", 505 | "print('Image embedding size : {0}'.format(image_embedding_size))\n" 506 | ] 507 | }, 508 | { 509 | "cell_type": "markdown", 510 | "metadata": {}, 511 | "source": [ 512 | "### Perform Image Embedding" 513 | ] 514 | }, 515 | { 516 | "cell_type": "code", 517 | "execution_count": null, 518 | "metadata": {}, 519 | "outputs": [], 520 | "source": [ 521 | "# UTZap50K\n", 522 | "UTZap50K_directory = op.join(op.curdir, 'processed','utzap50k')\n", 523 | "\n", 524 | "all_files = [y for x in os.walk(UTZap50K_directory) for y in glob(os.path.join(x[0], '*.jpg'))]\n", 525 | "len(all_files)" 526 | ] 527 | }, 528 | { 529 | "cell_type": "code", 530 | "execution_count": null, 531 | "metadata": {}, 532 | "outputs": [], 533 | "source": [ 534 | "# sample paths\n", 535 | "all_files[:5]" 536 | ] 537 | }, 538 | { 539 | "cell_type": "code", 540 | "execution_count": null, 541 | "metadata": {}, 542 | "outputs": [], 543 | "source": [ 544 | "# show a sample image\n", 545 | "image = ndimage.imread(all_files[0])\n", 546 | "print('image shape : {0}'.format(image.shape))\n", 547 | "plt.imshow(image)\n", 548 | "plt.show()" 549 | ] 550 | }, 551 | { 552 | "cell_type": "code", 553 | "execution_count": null, 554 | "metadata": {}, 555 | "outputs": [], 556 | "source": [ 557 | "# Number of images to build Nearest Neighbors Model\n", 558 | "random.shuffle(all_files)\n", 559 | "num_images = 2000 # increase the number of images to increase the image database\n", 560 | "neighbor_list = all_files[:num_images]" 561 | ] 562 | }, 563 | { 564 | "cell_type": "code", 565 | "execution_count": null, 566 | "metadata": {}, 567 | "outputs": [], 568 | "source": [ 569 | "# create empty array with shape : num_images * image_embedding_size\n", 570 | "image_embeddings = np.ndarray((num_images, image_embedding_size))" 571 | ] 572 | }, 573 | { 574 | "cell_type": "code", 575 | "execution_count": null, 576 | "metadata": {}, 577 | "outputs": [], 578 | "source": [ 579 | "# create image embedding for each image \n", 580 | "for i, filename in enumerate(neighbor_list):\n", 581 | " image = ndimage.imread(filename)\n", 582 | " features = sess.run(fc7, feed_dict={x: [image]})\n", 583 | " image_embeddings[i:i+1] = features\n", 584 | " if i % 250 == 0:\n", 585 | " print(i)" 586 | ] 587 | }, 588 | { 589 | "cell_type": "code", 590 | "execution_count": null, 591 | "metadata": {}, 592 | "outputs": [], 593 | "source": [ 594 | "# shape of image embeddings\n", 595 | "image_embeddings.shape" 596 | ] 597 | }, 598 | { 599 | "cell_type": "markdown", 600 | "metadata": {}, 601 | "source": [ 602 | "### Find Similar Images " 603 | ] 604 | }, 605 | { 606 | "cell_type": "code", 607 | "execution_count": null, 608 | "metadata": {}, 609 | "outputs": [], 610 | "source": [ 611 | "g = tf.Graph()\n", 612 | "NUM_IMAGES = len(neighbor_list)\n", 613 | "EMBEDDING_DIMENSION = 4096\n", 614 | "\n", 615 | "with g.as_default():\n", 616 | " # provide input indices \n", 617 | " x = tf.placeholder(shape=[None], dtype=tf.int32, name='x')\n", 618 | " \n", 619 | " # create a constant initializer\n", 620 | " weights_initializer = tf.constant_initializer(image_embeddings)\n", 621 | " embedding_weights = tf.get_variable(\n", 622 | " name='embedding_weights', \n", 623 | " shape=(NUM_IMAGES, EMBEDDING_DIMENSION), \n", 624 | " initializer=weights_initializer,\n", 625 | " trainable=False)\n", 626 | " # emebedding Lookup \n", 627 | " embedding_lookup = tf.nn.embedding_lookup(embedding_weights, x)\n", 628 | " \n", 629 | " # We use the cosine distance:\n", 630 | " norm = tf.sqrt(tf.reduce_sum(tf.square(embedding_weights), 1, keepdims=True))\n", 631 | " normalized_embeddings = embedding_weights / norm\n", 632 | " valid_embeddings = tf.nn.embedding_lookup(normalized_embeddings, x)\n", 633 | " similarity = tf.matmul(valid_embeddings, tf.transpose(normalized_embeddings))\n", 634 | " " 635 | ] 636 | }, 637 | { 638 | "cell_type": "code", 639 | "execution_count": null, 640 | "metadata": {}, 641 | "outputs": [], 642 | "source": [ 643 | "NUM_QUERY_IMAGES = 10\n", 644 | "NUM_NEIGHBORS = 5\n", 645 | "query_indices = np.random.choice(range(len(neighbor_list)), NUM_QUERY_IMAGES, replace=False)\n", 646 | "\n", 647 | "with tf.Session(graph=g) as sess:\n", 648 | " sess.run(tf.global_variables_initializer())\n", 649 | " sim = sess.run(similarity, feed_dict={x : query_indices})\n", 650 | " \n", 651 | " f,ax = plt.subplots(NUM_QUERY_IMAGES,NUM_NEIGHBORS, sharex=True, sharey=True, figsize=(14,20))\n", 652 | " \n", 653 | " print('Shape of Similarity Matrix: {0}'.format(sim.shape))\n", 654 | " for i,image_index in enumerate(query_indices):\n", 655 | " \n", 656 | " top_k = NUM_NEIGHBORS # number of nearest neighbors\n", 657 | " nearest = (-sim[i, :]).argsort()[1:top_k+1]\n", 658 | " log = 'Nearest to {0} :'.format(neighbor_list[image_index])\n", 659 | " \n", 660 | " for k in range(top_k):\n", 661 | " close_image = ndimage.imread(neighbor_list[nearest[k]])\n", 662 | " ax[i,k].imshow(close_image)\n", 663 | " plt.show()\n", 664 | "\n" 665 | ] 666 | }, 667 | { 668 | "cell_type": "markdown", 669 | "metadata": {}, 670 | "source": [ 671 | "## Further Extension\n", 672 | "\n", 673 | "**Approximate Nearest Neighbor Search** \n", 674 | "\n", 675 | "Currently we are comparing each image with every other image that is not recommended for production scenarios. Use techniques such as Approximate Nearest Neighbor to accelerate the process.\n", 676 | "\n", 677 | "- You can explore ANNOY package by Spotify [ https://github.com/spotify/annoy ]\n", 678 | "- You can also explore LSH ( Locality Senstivity Hashing ) [ https://graphics.stanford.edu/courses/cs468-06-fall/Slides/aneesh-michael.pdf ]" 679 | ] 680 | } 681 | ], 682 | "metadata": { 683 | "kernelspec": { 684 | "display_name": "Python 3", 685 | "language": "python", 686 | "name": "python3" 687 | }, 688 | "language_info": { 689 | "codemirror_mode": { 690 | "name": "ipython", 691 | "version": 3 692 | }, 693 | "file_extension": ".py", 694 | "mimetype": "text/x-python", 695 | "name": "python", 696 | "nbconvert_exporter": "python", 697 | "pygments_lexer": "ipython3", 698 | "version": "3.6.3" 699 | } 700 | }, 701 | "nbformat": 4, 702 | "nbformat_minor": 2 703 | } 704 | -------------------------------------------------------------------------------- /04_Explicit_Feedback_deep_learning_with_Tensorflow_0.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Explicit Feedback Neural Recommender Systems\n", 8 | "\n", 9 | "Goals:\n", 10 | "- Understand recommendation system \n", 11 | "- Build different models architectures using Tensorflow\n", 12 | "- Retrieve Embeddings and visualize them\n", 13 | "- Add metadata information as input to the model\n", 14 | "\n", 15 | "\n", 16 | "This notebook is inspired by Oliver Grisel Notebook who used Keras\n", 17 | "https://github.com/ogrisel for building the moels. We will be using Basic Tensorflow APIs instead. " 18 | ] 19 | }, 20 | { 21 | "cell_type": "code", 22 | "execution_count": null, 23 | "metadata": {}, 24 | "outputs": [], 25 | "source": [ 26 | "import pandas as pd\n", 27 | "import numpy as np\n", 28 | "import os\n", 29 | "import matplotlib.pyplot as plt\n", 30 | "from sklearn.model_selection import train_test_split\n", 31 | "import tensorflow as tf\n", 32 | "from tensorflow.contrib import layers\n", 33 | "from tensorflow.python.estimator.inputs import numpy_io\n", 34 | "from tensorflow.contrib.learn import *\n", 35 | "%matplotlib inline " 36 | ] 37 | }, 38 | { 39 | "cell_type": "code", 40 | "execution_count": null, 41 | "metadata": {}, 42 | "outputs": [], 43 | "source": [ 44 | "print('Tensorflow Version : {0}'.format(tf.__version__))" 45 | ] 46 | }, 47 | { 48 | "cell_type": "markdown", 49 | "metadata": {}, 50 | "source": [ 51 | "### Ratings file\n", 52 | "\n", 53 | "Each line contains a rated movie: \n", 54 | "- a user\n", 55 | "- an item\n", 56 | "- a rating from 1 to 5 stars" 57 | ] 58 | }, 59 | { 60 | "cell_type": "code", 61 | "execution_count": null, 62 | "metadata": {}, 63 | "outputs": [], 64 | "source": [ 65 | "# Base Path for MovieLens dataset\n", 66 | "ML_100K_PATH = os.path.join('processed','ml-100k','ml-100k')" 67 | ] 68 | }, 69 | { 70 | "cell_type": "code", 71 | "execution_count": null, 72 | "metadata": {}, 73 | "outputs": [], 74 | "source": [ 75 | "df_raw_ratings = pd.read_csv(os.path.join(ML_100K_PATH, 'u.data'), sep='\\t',\n", 76 | " names=[\"user_id\", \"item_id\", \"rating\", \"timestamp\"])\n", 77 | "df_raw_ratings.head()" 78 | ] 79 | }, 80 | { 81 | "cell_type": "markdown", 82 | "metadata": {}, 83 | "source": [ 84 | "### Item metadata file\n" 85 | ] 86 | }, 87 | { 88 | "cell_type": "code", 89 | "execution_count": null, 90 | "metadata": {}, 91 | "outputs": [], 92 | "source": [ 93 | "m_cols = ['item_id', 'title', 'release_date', 'video_release_date', 'imdb_url']\n", 94 | "# Loading only 5 columns\n", 95 | "df_items = pd.read_csv(os.path.join(ML_100K_PATH, 'u.item'), sep='|',\n", 96 | " names=m_cols, usecols=range(5), encoding='latin-1')\n", 97 | "df_items.head()" 98 | ] 99 | }, 100 | { 101 | "cell_type": "code", 102 | "execution_count": null, 103 | "metadata": {}, 104 | "outputs": [], 105 | "source": [ 106 | "def get_release_year(x):\n", 107 | " splits = str(x).split('-')\n", 108 | " if(len(splits) == 3):\n", 109 | " return int(splits[2])\n", 110 | " else:\n", 111 | " return 1920\n", 112 | " \n", 113 | "df_items['release_year'] = df_items['release_date'].map(lambda x : get_release_year(x))" 114 | ] 115 | }, 116 | { 117 | "cell_type": "code", 118 | "execution_count": null, 119 | "metadata": {}, 120 | "outputs": [], 121 | "source": [ 122 | "df_items.head()" 123 | ] 124 | }, 125 | { 126 | "cell_type": "markdown", 127 | "metadata": {}, 128 | "source": [ 129 | "## Merge Rating with Item Metadata" 130 | ] 131 | }, 132 | { 133 | "cell_type": "code", 134 | "execution_count": null, 135 | "metadata": {}, 136 | "outputs": [], 137 | "source": [ 138 | "df_all_ratings = pd.merge(df_items, df_raw_ratings)" 139 | ] 140 | }, 141 | { 142 | "cell_type": "code", 143 | "execution_count": null, 144 | "metadata": {}, 145 | "outputs": [], 146 | "source": [ 147 | "df_all_ratings.head()" 148 | ] 149 | }, 150 | { 151 | "cell_type": "markdown", 152 | "metadata": {}, 153 | "source": [ 154 | "## Data Preprocessing\n", 155 | "\n", 156 | "To understand well the distribution of the data, the following statistics are computed:\n", 157 | "- the number of users\n", 158 | "- the number of items\n", 159 | "- the rating distribution" 160 | ] 161 | }, 162 | { 163 | "cell_type": "code", 164 | "execution_count": null, 165 | "metadata": {}, 166 | "outputs": [], 167 | "source": [ 168 | "# Number of users\n", 169 | "max_user_id = df_all_ratings['user_id'].max()\n", 170 | "max_user_id" 171 | ] 172 | }, 173 | { 174 | "cell_type": "code", 175 | "execution_count": null, 176 | "metadata": {}, 177 | "outputs": [], 178 | "source": [ 179 | "# Number of items\n", 180 | "max_item_id = df_all_ratings['item_id'].max()\n", 181 | "max_item_id" 182 | ] 183 | }, 184 | { 185 | "cell_type": "code", 186 | "execution_count": null, 187 | "metadata": {}, 188 | "outputs": [], 189 | "source": [ 190 | "df_all_ratings.groupby('rating')['rating'].count().plot(kind='bar', rot=0);" 191 | ] 192 | }, 193 | { 194 | "cell_type": "code", 195 | "execution_count": null, 196 | "metadata": {}, 197 | "outputs": [], 198 | "source": [ 199 | "# ratings\n", 200 | "df_all_ratings['rating'].describe()" 201 | ] 202 | }, 203 | { 204 | "cell_type": "markdown", 205 | "metadata": {}, 206 | "source": [ 207 | "### Add Popularity" 208 | ] 209 | }, 210 | { 211 | "cell_type": "code", 212 | "execution_count": null, 213 | "metadata": {}, 214 | "outputs": [], 215 | "source": [ 216 | "popularity = df_all_ratings.groupby('item_id').size().reset_index(name='popularity')" 217 | ] 218 | }, 219 | { 220 | "cell_type": "markdown", 221 | "metadata": {}, 222 | "source": [ 223 | "Enrich the ratings data with the popularity as an additional metadata." 224 | ] 225 | }, 226 | { 227 | "cell_type": "code", 228 | "execution_count": null, 229 | "metadata": {}, 230 | "outputs": [], 231 | "source": [ 232 | "df_all_ratings = pd.merge(df_all_ratings, popularity)\n", 233 | "df_all_ratings.head()" 234 | ] 235 | }, 236 | { 237 | "cell_type": "code", 238 | "execution_count": null, 239 | "metadata": {}, 240 | "outputs": [], 241 | "source": [ 242 | "df_all_ratings.nlargest(10, 'popularity')" 243 | ] 244 | }, 245 | { 246 | "cell_type": "markdown", 247 | "metadata": {}, 248 | "source": [ 249 | "Later in the analysis we will assume that this popularity does not come from the ratings themselves but from an external metadata, e.g. box office numbers in the month after the release in movie theaters." 250 | ] 251 | }, 252 | { 253 | "cell_type": "markdown", 254 | "metadata": {}, 255 | "source": [ 256 | "### Train Test Validation Split" 257 | ] 258 | }, 259 | { 260 | "cell_type": "code", 261 | "execution_count": null, 262 | "metadata": {}, 263 | "outputs": [], 264 | "source": [ 265 | "# Split All ratings into train_val and test\n", 266 | "ratings_train_val, ratings_test = train_test_split(df_all_ratings, test_size=0.2, random_state=0)\n", 267 | "# Split train_val into training and validation set\n", 268 | "ratings_train, ratings_val = train_test_split(ratings_train_val, test_size=0.2, random_state=0)\n", 269 | "\n", 270 | "print('Total rating rows count: {0} '.format(len(df_all_ratings)))\n", 271 | "print('Total training rows count: {0} '.format(len(ratings_train_val)))\n", 272 | "print('Total validation rows count: {0} '.format(len(ratings_val)))\n", 273 | "print('Total test rows count: {0} '.format(len(ratings_test)))\n" 274 | ] 275 | }, 276 | { 277 | "cell_type": "code", 278 | "execution_count": null, 279 | "metadata": {}, 280 | "outputs": [], 281 | "source": [ 282 | "ratings_train.info()" 283 | ] 284 | }, 285 | { 286 | "cell_type": "markdown", 287 | "metadata": {}, 288 | "source": [ 289 | "# Explicit feedback: supervised ratings prediction\n", 290 | "\n", 291 | "For each pair of (user, item) try to predict the rating the user would give to the item.\n", 292 | "\n", 293 | "This is the classical setup for building recommender systems from offline data with explicit supervision signal. " 294 | ] 295 | }, 296 | { 297 | "cell_type": "markdown", 298 | "metadata": {}, 299 | "source": [ 300 | "## Predictive ratings as a regression problem\n", 301 | "\n", 302 | "The following code implements the following architecture:\n", 303 | "\n", 304 | "" 305 | ] 306 | }, 307 | { 308 | "cell_type": "markdown", 309 | "metadata": {}, 310 | "source": [ 311 | "### Matrix Factorization" 312 | ] 313 | }, 314 | { 315 | "cell_type": "code", 316 | "execution_count": null, 317 | "metadata": {}, 318 | "outputs": [], 319 | "source": [ 320 | "embedding_size = 30 # embedding size\n", 321 | "reg_param = 0.01 # regularization parameter lambda\n", 322 | "learning_rate = 0.01 # learning rate \n", 323 | "\n", 324 | "\n", 325 | "# create tensorflow graph\n", 326 | "g = tf.Graph()\n", 327 | "with g.as_default():\n", 328 | " # setting up random seed\n", 329 | " tf.set_random_seed(1234)\n", 330 | " \n", 331 | " # placeholders\n", 332 | " users = tf.placeholder(shape=[None], dtype=tf.int64)\n", 333 | " items = tf.placeholder(shape=[None], dtype=tf.int64)\n", 334 | " ratings = tf.placeholder(shape=[None], dtype=tf.float32)\n", 335 | " \n", 336 | " # variables\n", 337 | " with tf.variable_scope(\"embedding\"):\n", 338 | " user_weight = tf.get_variable(\"user_w\"\n", 339 | " , shape=[max_user_id + 1, embedding_size]\n", 340 | " , dtype=tf.float32\n", 341 | " , initializer=layers.xavier_initializer())\n", 342 | "\n", 343 | " item_weight = tf.get_variable(\"item_w\"\n", 344 | " , shape=[max_item_id + 1, embedding_size]\n", 345 | " , dtype=tf.float32\n", 346 | " , initializer=layers.xavier_initializer())\n", 347 | " # prediction\n", 348 | " with tf.name_scope(\"inference\"):\n", 349 | " user_embedding = tf.nn.embedding_lookup(user_weight, users)\n", 350 | " item_embedding = tf.nn.embedding_lookup(item_weight, items)\n", 351 | " pred = tf.reduce_sum(tf.multiply(user_embedding, item_embedding), 1) \n", 352 | " \n", 353 | " # loss \n", 354 | " with tf.name_scope(\"loss\"):\n", 355 | " reg_loss = tf.contrib.layers.apply_regularization(layers.l2_regularizer(scale=reg_param),\n", 356 | " weights_list=[user_weight, item_weight])\n", 357 | " loss = tf.nn.l2_loss(pred - ratings) + reg_loss\n", 358 | " train_ops = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(loss)\n", 359 | " rmse = tf.sqrt(tf.reduce_mean(tf.pow(pred - ratings, 2)))\n", 360 | "\n", 361 | " " 362 | ] 363 | }, 364 | { 365 | "cell_type": "code", 366 | "execution_count": null, 367 | "metadata": {}, 368 | "outputs": [], 369 | "source": [ 370 | "\n", 371 | "def train_model():\n", 372 | " # Training \n", 373 | " epochs = 1000 # number of iterations \n", 374 | " losses_train = []\n", 375 | " losses_val = []\n", 376 | "\n", 377 | "\n", 378 | "\n", 379 | " with tf.Session(graph=g) as sess:\n", 380 | " # initializer\n", 381 | " sess.run(tf.global_variables_initializer())\n", 382 | "\n", 383 | "\n", 384 | " train_input_dict = { users: ratings_train['user_id']\n", 385 | " , items: ratings_train['item_id']\n", 386 | " , ratings: ratings_train['rating']}\n", 387 | " val_input_dict = { users: ratings_val['user_id']\n", 388 | " , items: ratings_val['item_id']\n", 389 | " , ratings: ratings_val['rating']}\n", 390 | "\n", 391 | " test_input_dict = { users: ratings_test['user_id']\n", 392 | " , items: ratings_test['item_id']\n", 393 | " , ratings: ratings_test['rating']}\n", 394 | "\n", 395 | " def check_overfit(validation_loss):\n", 396 | " n = len(validation_loss)\n", 397 | " if n < 5:\n", 398 | " return False\n", 399 | " count = 0 \n", 400 | " for i in range(n-4, n):\n", 401 | " if validation_loss[i] < validation_loss[i-1]:\n", 402 | " count += 1\n", 403 | " if count >=2:\n", 404 | " return False\n", 405 | " return True\n", 406 | "\n", 407 | " for i in range(epochs):\n", 408 | " # run the training operation\n", 409 | " sess.run([train_ops], feed_dict=train_input_dict)\n", 410 | "\n", 411 | " # show intermediate results \n", 412 | " if i % 5 == 0:\n", 413 | " loss_train = sess.run(loss, feed_dict=train_input_dict)\n", 414 | " loss_val = sess.run(loss, feed_dict=val_input_dict)\n", 415 | " losses_train.append(loss_train)\n", 416 | " losses_val.append(loss_val)\n", 417 | "\n", 418 | "\n", 419 | " # check early stopping \n", 420 | " if(check_overfit(losses_val)):\n", 421 | " print('overfit !')\n", 422 | " break\n", 423 | "\n", 424 | " print(\"iteration : {0} train loss: {1:.3f} , valid loss {2:.3f}\".format(i,loss_train, loss_val))\n", 425 | "\n", 426 | " # calculate RMSE on the test dataset\n", 427 | " print('RMSE on test dataset : {0:.4f}'.format(sess.run(rmse, feed_dict=test_input_dict)))\n", 428 | "\n", 429 | " plt.plot(losses_train, label='train')\n", 430 | " plt.plot(losses_val, label='validation')\n", 431 | " #plt.ylim(0, 50000)\n", 432 | " plt.legend(loc='best')\n", 433 | " plt.title('Loss');" 434 | ] 435 | }, 436 | { 437 | "cell_type": "code", 438 | "execution_count": null, 439 | "metadata": {}, 440 | "outputs": [], 441 | "source": [ 442 | "train_model()" 443 | ] 444 | }, 445 | { 446 | "cell_type": "markdown", 447 | "metadata": {}, 448 | "source": [ 449 | "### Matrix Factorization with Biases" 450 | ] 451 | }, 452 | { 453 | "cell_type": "code", 454 | "execution_count": null, 455 | "metadata": {}, 456 | "outputs": [], 457 | "source": [ 458 | "embedding_size = 30 # embedding size\n", 459 | "reg_param = 0.01 # regularization parameter lambda\n", 460 | "learning_rate = 0.01 # learning rate \n", 461 | "\n", 462 | "\n", 463 | "# create tensorflow graph\n", 464 | "g = tf.Graph()\n", 465 | "with g.as_default():\n", 466 | " \n", 467 | " tf.set_random_seed(1234)\n", 468 | " \n", 469 | " # placeholders\n", 470 | " users = tf.placeholder(shape=[None], dtype=tf.int64)\n", 471 | " items = tf.placeholder(shape=[None], dtype=tf.int64)\n", 472 | " ratings = tf.placeholder(shape=[None], dtype=tf.float32)\n", 473 | " \n", 474 | " # variables\n", 475 | " with tf.variable_scope(\"embedding\"):\n", 476 | " user_weight = tf.get_variable(\"user_w\"\n", 477 | " , shape=[max_user_id + 1, embedding_size]\n", 478 | " , dtype=tf.float32\n", 479 | " , initializer=layers.xavier_initializer())\n", 480 | "\n", 481 | " item_weight = tf.get_variable(\"item_w\"\n", 482 | " , shape=[max_item_id + 1, embedding_size]\n", 483 | " , dtype=tf.float32\n", 484 | " , initializer=layers.xavier_initializer())\n", 485 | " \n", 486 | " user_bias = tf.get_variable(\"user_b\"\n", 487 | " , shape=[max_user_id + 1]\n", 488 | " , dtype=tf.float32\n", 489 | " , initializer=tf.zeros_initializer)\n", 490 | " \n", 491 | " item_bias = tf.get_variable(\"item_b\"\n", 492 | " , shape=[max_item_id + 1]\n", 493 | " , dtype=tf.float32\n", 494 | " , initializer=tf.zeros_initializer)\n", 495 | " \n", 496 | " # prediction\n", 497 | " with tf.name_scope(\"inference\"):\n", 498 | " user_embedding = tf.nn.embedding_lookup(user_weight, users)\n", 499 | " item_embedding = tf.nn.embedding_lookup(item_weight, items)\n", 500 | " user_b = tf.nn.embedding_lookup(user_bias, users)\n", 501 | " item_b = tf.nn.embedding_lookup(item_bias, items)\n", 502 | " pred = tf.reduce_sum(tf.multiply(user_embedding, item_embedding), 1) + user_b + item_b\n", 503 | " \n", 504 | " # loss \n", 505 | " with tf.name_scope(\"loss\"):\n", 506 | " reg_loss = tf.contrib.layers.apply_regularization(layers.l2_regularizer(scale=reg_param),\n", 507 | " weights_list=[user_weight, item_weight])\n", 508 | " loss = tf.nn.l2_loss(pred - ratings) + reg_loss\n", 509 | " train_ops = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(loss)\n", 510 | " rmse = tf.sqrt(tf.reduce_mean(tf.pow(pred - ratings, 2)))\n", 511 | "\n", 512 | " " 513 | ] 514 | }, 515 | { 516 | "cell_type": "code", 517 | "execution_count": null, 518 | "metadata": {}, 519 | "outputs": [], 520 | "source": [ 521 | "train_model()" 522 | ] 523 | }, 524 | { 525 | "cell_type": "markdown", 526 | "metadata": {}, 527 | "source": [ 528 | "## A Deep recommender model\n", 529 | "\n", 530 | "We can use deep learning models with multiple layers ( fully connected and dropout ) for the recommendation system.\n", 531 | "\n", 532 | "\n", 533 | "\n", 534 | "To build this model we will need a new kind of layer:" 535 | ] 536 | }, 537 | { 538 | "cell_type": "code", 539 | "execution_count": null, 540 | "metadata": {}, 541 | "outputs": [], 542 | "source": [ 543 | "embedding_size = 50\n", 544 | "reg_param = 0.01\n", 545 | "learning_rate = 0.01\n", 546 | "n_users = max_user_id + 1\n", 547 | "n_items = max_item_id + 1\n", 548 | "\n", 549 | "g = tf.Graph()\n", 550 | "with g.as_default():\n", 551 | " \n", 552 | " tf.set_random_seed(1234)\n", 553 | "\n", 554 | " users = tf.placeholder(shape=[None,1], dtype=tf.int64, name='input_users')\n", 555 | " items = tf.placeholder(shape=[None,1], dtype=tf.int64, name='input_items')\n", 556 | " ratings = tf.placeholder(shape=[None,1], dtype=tf.float32, name='input_ratings')\n", 557 | " \n", 558 | " l2_loss = tf.constant(0.0)\n", 559 | " \n", 560 | " # embeddding layer\n", 561 | " with tf.variable_scope(\"embedding\"):\n", 562 | " user_weights = tf.get_variable(\"user_w\"\n", 563 | " , shape=[n_users, embedding_size]\n", 564 | " , dtype=tf.float32\n", 565 | " , initializer=layers.xavier_initializer())\n", 566 | " \n", 567 | " item_weights = tf.get_variable(\"item_w\"\n", 568 | " , shape=[n_items, embedding_size]\n", 569 | " , dtype=tf.float32\n", 570 | " , initializer=layers.xavier_initializer())\n", 571 | " \n", 572 | " user_embedding = tf.squeeze(tf.nn.embedding_lookup(user_weights, users),axis=1, name='user_embedding')\n", 573 | " item_embedding = tf.squeeze(tf.nn.embedding_lookup(item_weights, items),axis=1, name='item_embedding')\n", 574 | " \n", 575 | " l2_loss += tf.nn.l2_loss(user_weights)\n", 576 | " l2_loss += tf.nn.l2_loss(item_weights)\n", 577 | " \n", 578 | " \n", 579 | " print(user_embedding)\n", 580 | " print(item_embedding)\n", 581 | " \n", 582 | " \n", 583 | " # combine inputs\n", 584 | " with tf.name_scope('concatenation'):\n", 585 | " input_vecs = tf.concat([user_embedding, item_embedding], axis=1)\n", 586 | " print(input_vecs)\n", 587 | " \n", 588 | " # fc-1\n", 589 | " num_hidden = 64\n", 590 | " with tf.name_scope(\"fc_1\"):\n", 591 | " W_fc_1 = tf.get_variable(\n", 592 | " \"W_hidden\",\n", 593 | " shape=[2*embedding_size, num_hidden],\n", 594 | " initializer=tf.contrib.layers.xavier_initializer())\n", 595 | " b_fc_1 = tf.Variable(tf.constant(0.1, shape=[num_hidden]), name=\"b\")\n", 596 | " hidden_output = tf.nn.relu(tf.nn.xw_plus_b(input_vecs, W_fc_1, b_fc_1), name='hidden_output')\n", 597 | " l2_loss += tf.nn.l2_loss(W_fc_1)\n", 598 | " print(hidden_output)\n", 599 | " \n", 600 | " # dropout\n", 601 | " with tf.name_scope(\"dropout\"):\n", 602 | " h_drop = tf.nn.dropout(hidden_output, 0.99, name=\"hidden_output_drop\")\n", 603 | " print(h_drop)\n", 604 | " \n", 605 | " # fc-2\n", 606 | " with tf.name_scope(\"fc_2\"):\n", 607 | " W_fc_2 = tf.get_variable(\n", 608 | " \"W_output\",\n", 609 | " shape=[num_hidden,1],\n", 610 | " initializer=tf.contrib.layers.xavier_initializer())\n", 611 | " b_fc_2 = tf.Variable(tf.constant(0.1, shape=[1]), name=\"b\")\n", 612 | " pred = tf.nn.xw_plus_b(h_drop, W_fc_2, b_fc_2, name='pred')\n", 613 | " l2_loss += tf.nn.l2_loss(W_fc_2)\n", 614 | " print(pred)\n", 615 | "\n", 616 | " # loss\n", 617 | " with tf.name_scope(\"loss\"):\n", 618 | " loss = tf.nn.l2_loss(pred - ratings) + reg_param * l2_loss\n", 619 | " train_ops = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(loss)\n", 620 | " rmse = tf.sqrt(tf.reduce_mean(tf.pow(pred - ratings, 2)))\n", 621 | "\n", 622 | " " 623 | ] 624 | }, 625 | { 626 | "cell_type": "code", 627 | "execution_count": null, 628 | "metadata": {}, 629 | "outputs": [], 630 | "source": [ 631 | "def train_model_deep():\n", 632 | " losses_train = []\n", 633 | " losses_val = []\n", 634 | " epochs = 1000\n", 635 | "\n", 636 | " with tf.Session(graph=g) as sess:\n", 637 | " sess.run(tf.global_variables_initializer())\n", 638 | " train_input_dict = {users: ratings_train['user_id'].values.reshape([-1,1])\n", 639 | " , items: ratings_train['item_id'].values.reshape([-1,1])\n", 640 | " , ratings: ratings_train['rating'].values.reshape([-1,1])}\n", 641 | "\n", 642 | " val_input_dict = {users: ratings_val['user_id'].values.reshape([-1,1])\n", 643 | " , items: ratings_val['item_id'].values.reshape([-1,1])\n", 644 | " , ratings: ratings_val['rating'].values.reshape([-1,1])}\n", 645 | "\n", 646 | " test_input_dict = {users: ratings_test['user_id'].values.reshape([-1,1])\n", 647 | " , items: ratings_test['item_id'].values.reshape([-1,1])\n", 648 | " , ratings: ratings_test['rating'].values.reshape([-1,1])}\n", 649 | "\n", 650 | " def check_overfit(validation_loss):\n", 651 | " n = len(validation_loss)\n", 652 | " if n < 5:\n", 653 | " return False\n", 654 | " count = 0 \n", 655 | " for i in range(n-4, n):\n", 656 | " if validation_loss[i] < validation_loss[i-1]:\n", 657 | " count += 1\n", 658 | " if count >=3:\n", 659 | " return False\n", 660 | " return True\n", 661 | "\n", 662 | "\n", 663 | "\n", 664 | " for i in range(epochs):\n", 665 | " sess.run([train_ops], feed_dict=train_input_dict)\n", 666 | " if i % 10 == 0:\n", 667 | " loss_train = sess.run(loss, feed_dict=train_input_dict)\n", 668 | " loss_val = sess.run(loss, feed_dict=val_input_dict)\n", 669 | " losses_train.append(loss_train)\n", 670 | " losses_val.append(loss_val)\n", 671 | "\n", 672 | " # check early stopping \n", 673 | " if(check_overfit(losses_val)):\n", 674 | " print('overfit !')\n", 675 | " break\n", 676 | "\n", 677 | " print(\"iteration : %d train loss: %.3f , valid loss %.3f\" % (i,loss_train, loss_val))\n", 678 | "\n", 679 | " # calculate RMSE on the test dataset\n", 680 | " print('RMSE on test dataset : {0:.4f}'.format(sess.run(rmse, feed_dict=test_input_dict)))\n", 681 | "\n", 682 | " # user and item embedding\n", 683 | " user_embedding_variable = [v for v in tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES) if v.name.endswith('embedding/user_w:0')][0]\n", 684 | " item_embedding_variable = [v for v in tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES) if v.name.endswith('embedding/item_w:0')][0]\n", 685 | " user_embedding_weights, item_embedding_weights = sess.run([user_embedding_variable,item_embedding_variable])\n", 686 | " \n", 687 | " \n", 688 | " # plot train and validation loss\n", 689 | " plt.plot(losses_train, label='train')\n", 690 | " plt.plot(losses_val, label='validation')\n", 691 | " plt.legend(loc='best')\n", 692 | " plt.title('Loss');\n", 693 | " \n", 694 | " return user_embedding_weights, item_embedding_weights " 695 | ] 696 | }, 697 | { 698 | "cell_type": "code", 699 | "execution_count": null, 700 | "metadata": {}, 701 | "outputs": [], 702 | "source": [ 703 | "user_embedding_weights, item_embedding_weights = train_model_deep()" 704 | ] 705 | }, 706 | { 707 | "cell_type": "markdown", 708 | "metadata": {}, 709 | "source": [ 710 | "### Model Embeddings" 711 | ] 712 | }, 713 | { 714 | "cell_type": "code", 715 | "execution_count": null, 716 | "metadata": {}, 717 | "outputs": [], 718 | "source": [ 719 | "print(\"First item name from metadata:\", df_items[\"title\"][1])\n", 720 | "print(\"Embedding vector for the first item:\")\n", 721 | "print(item_embedding_weights[1])\n", 722 | "print(\"shape:\", item_embedding_weights[1].shape)" 723 | ] 724 | }, 725 | { 726 | "cell_type": "markdown", 727 | "metadata": {}, 728 | "source": [ 729 | "### Visualizing embeddings using TSNE\n", 730 | "\n", 731 | "- we use scikit learn to visualize items embeddings\n", 732 | "- Try different perplexities, and visualize user embeddings as well\n", 733 | "- check what is the impact of different perplexity value. Here is a very nice tutorial if you want to know in detail (https://distill.pub/2016/misread-tsne/ )" 734 | ] 735 | }, 736 | { 737 | "cell_type": "code", 738 | "execution_count": null, 739 | "metadata": {}, 740 | "outputs": [], 741 | "source": [ 742 | "from sklearn.manifold import TSNE\n", 743 | "\n", 744 | "item_tsne = TSNE(perplexity=50).fit_transform(item_embedding_weights)" 745 | ] 746 | }, 747 | { 748 | "cell_type": "code", 749 | "execution_count": null, 750 | "metadata": {}, 751 | "outputs": [], 752 | "source": [ 753 | "import matplotlib.pyplot as plt\n", 754 | "\n", 755 | "plt.figure(figsize=(10, 10))\n", 756 | "plt.scatter(item_tsne[:, 0], item_tsne[:, 1]);\n", 757 | "plt.xticks(()); plt.yticks(());\n", 758 | "plt.show()" 759 | ] 760 | }, 761 | { 762 | "cell_type": "markdown", 763 | "metadata": {}, 764 | "source": [ 765 | "## Using item metadata in the model\n", 766 | "\n", 767 | "Using a similar framework as previously, we will build another deep model that can also leverage additional metadata. The resulting system is therefore an **Hybrid Recommender System** that does both **Collaborative Filtering** and **Content-based recommendations**.\n", 768 | "\n", 769 | "\n" 770 | ] 771 | }, 772 | { 773 | "cell_type": "code", 774 | "execution_count": null, 775 | "metadata": {}, 776 | "outputs": [], 777 | "source": [ 778 | "embedding_size = 50\n", 779 | "reg_param = 0.01\n", 780 | "learning_rate = 0.01\n", 781 | "n_users = max_user_id + 1\n", 782 | "n_items = max_item_id + 1\n", 783 | "meta_size = 2\n", 784 | "\n", 785 | "g = tf.Graph()\n", 786 | "with g.as_default():\n", 787 | "\n", 788 | " tf.set_random_seed(1234)\n", 789 | " \n", 790 | " users = tf.placeholder(shape=[None,1], dtype=tf.int64, name='input_users')\n", 791 | " items = tf.placeholder(shape=[None,1], dtype=tf.int64, name='input_items')\n", 792 | " meta = tf.placeholder(shape=[None,2], dtype=tf.float32, name='input_metadata')\n", 793 | " ratings = tf.placeholder(shape=[None,1], dtype=tf.float32, name='input_ratings')\n", 794 | " \n", 795 | " l2_loss = tf.constant(0.0)\n", 796 | " \n", 797 | " # embeddding layer\n", 798 | " with tf.variable_scope(\"embedding\"):\n", 799 | " user_weights = tf.get_variable(\"user_w\"\n", 800 | " , shape=[n_users, embedding_size]\n", 801 | " , dtype=tf.float32\n", 802 | " , initializer=layers.xavier_initializer())\n", 803 | " \n", 804 | " item_weights = tf.get_variable(\"item_w\"\n", 805 | " , shape=[n_items, embedding_size]\n", 806 | " , dtype=tf.float32\n", 807 | " , initializer=layers.xavier_initializer())\n", 808 | " \n", 809 | " \n", 810 | " \n", 811 | " user_embedding = tf.squeeze(tf.nn.embedding_lookup(user_weights, users),axis=1, name='user_embedding')\n", 812 | " item_embedding = tf.squeeze(tf.nn.embedding_lookup(item_weights, items),axis=1, name='item_embedding')\n", 813 | " \n", 814 | " l2_loss += tf.nn.l2_loss(user_weights)\n", 815 | " l2_loss += tf.nn.l2_loss(item_weights)\n", 816 | " \n", 817 | " \n", 818 | " print(user_embedding)\n", 819 | " print(item_embedding)\n", 820 | " \n", 821 | " \n", 822 | " # combine inputs\n", 823 | " with tf.name_scope('concatenation'):\n", 824 | " input_vecs = tf.concat([user_embedding, item_embedding, meta], axis=1)\n", 825 | " print(input_vecs)\n", 826 | " \n", 827 | " # fc-1\n", 828 | " num_hidden = 64\n", 829 | " with tf.name_scope(\"fc_1\"):\n", 830 | " W_fc_1 = tf.get_variable(\n", 831 | " \"W_hidden\",\n", 832 | " shape=[2*embedding_size + meta_size, num_hidden],\n", 833 | " initializer=tf.contrib.layers.xavier_initializer())\n", 834 | " b_fc_1 = tf.Variable(tf.constant(0.1, shape=[num_hidden]), name=\"b\")\n", 835 | " hidden_output = tf.nn.relu(tf.nn.xw_plus_b(input_vecs, W_fc_1, b_fc_1), name='hidden_output')\n", 836 | " l2_loss += tf.nn.l2_loss(W_fc_1)\n", 837 | " print(hidden_output)\n", 838 | " \n", 839 | " # dropout\n", 840 | " with tf.name_scope(\"dropout\"):\n", 841 | " h_drop = tf.nn.dropout(hidden_output, 0.99, name=\"hidden_output_drop\")\n", 842 | " print(h_drop)\n", 843 | " \n", 844 | " # fc-2\n", 845 | " with tf.name_scope(\"fc_2\"):\n", 846 | " W_fc_2 = tf.get_variable(\n", 847 | " \"W_output\",\n", 848 | " shape=[num_hidden,1],\n", 849 | " initializer=tf.contrib.layers.xavier_initializer())\n", 850 | " b_fc_2 = tf.Variable(tf.constant(0.1, shape=[1]), name=\"b\")\n", 851 | " pred = tf.nn.xw_plus_b(h_drop, W_fc_2, b_fc_2, name='pred')\n", 852 | " l2_loss += tf.nn.l2_loss(W_fc_2)\n", 853 | " print(pred)\n", 854 | "\n", 855 | " # loss\n", 856 | " with tf.name_scope(\"loss\"):\n", 857 | " loss = tf.nn.l2_loss(pred - ratings) + reg_param * l2_loss\n", 858 | " train_ops = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(loss)\n", 859 | " rmse = tf.sqrt(tf.reduce_mean(tf.pow(pred - ratings, 2)))" 860 | ] 861 | }, 862 | { 863 | "cell_type": "code", 864 | "execution_count": null, 865 | "metadata": {}, 866 | "outputs": [], 867 | "source": [ 868 | "from sklearn.preprocessing import QuantileTransformer\n", 869 | "\n", 870 | "meta_columns = ['popularity', 'release_year']\n", 871 | "\n", 872 | "scaler = QuantileTransformer()\n", 873 | "item_meta_train = scaler.fit_transform(ratings_train[meta_columns])\n", 874 | "item_meta_val = scaler.transform(ratings_val[meta_columns])\n", 875 | "item_meta_test = scaler.transform(ratings_test[meta_columns])" 876 | ] 877 | }, 878 | { 879 | "cell_type": "code", 880 | "execution_count": null, 881 | "metadata": {}, 882 | "outputs": [], 883 | "source": [ 884 | "def train_model_deep_meta():\n", 885 | "\n", 886 | " losses_train = []\n", 887 | " losses_val = []\n", 888 | " epochs = 1000\n", 889 | "\n", 890 | " with tf.Session(graph=g) as sess:\n", 891 | " sess.run(tf.global_variables_initializer())\n", 892 | " train_input_dict = {users: ratings_train['user_id'].values.reshape([-1,1])\n", 893 | " , items: ratings_train['item_id'].values.reshape([-1,1])\n", 894 | " , ratings: ratings_train['rating'].values.reshape([-1,1])\n", 895 | " ,meta: item_meta_train}\n", 896 | "\n", 897 | " val_input_dict = {users: ratings_val['user_id'].values.reshape([-1,1])\n", 898 | " , items: ratings_val['item_id'].values.reshape([-1,1])\n", 899 | " , ratings: ratings_val['rating'].values.reshape([-1,1])\n", 900 | " ,meta : item_meta_val}\n", 901 | "\n", 902 | " test_input_dict = {users: ratings_test['user_id'].values.reshape([-1,1])\n", 903 | " , items: ratings_test['item_id'].values.reshape([-1,1])\n", 904 | " , ratings: ratings_test['rating'].values.reshape([-1,1])\n", 905 | " ,meta : item_meta_test}\n", 906 | " def check_overfit(validation_loss):\n", 907 | " n = len(validation_loss)\n", 908 | " if n < 5:\n", 909 | " return False\n", 910 | " count = 0 \n", 911 | " for i in range(n-4, n):\n", 912 | " if validation_loss[i] < validation_loss[i-1]:\n", 913 | " count += 1\n", 914 | " if count >=3:\n", 915 | " return False\n", 916 | " return True\n", 917 | "\n", 918 | "\n", 919 | " for i in range(epochs):\n", 920 | " sess.run([train_ops], feed_dict=train_input_dict)\n", 921 | " if i % 10 == 0:\n", 922 | " loss_train = sess.run(loss, feed_dict=train_input_dict)\n", 923 | " loss_val = sess.run(loss, feed_dict=val_input_dict)\n", 924 | " losses_train.append(loss_train)\n", 925 | " losses_val.append(loss_val)\n", 926 | "\n", 927 | " # check early stopping \n", 928 | " if(check_overfit(losses_val)):\n", 929 | " print('overfit !')\n", 930 | " break\n", 931 | " print(\"iteration : %d train loss: %.3f , valid loss %.3f\" % (i,loss_train, loss_val))\n", 932 | " \n", 933 | " # plot train and validation loss\n", 934 | " plt.plot(losses_train, label='train')\n", 935 | " plt.plot(losses_val, label='validation')\n", 936 | " plt.legend(loc='best')\n", 937 | " plt.title('Loss');\n", 938 | " \n", 939 | " # calculate RMSE on the test dataset\n", 940 | " print('RMSE on test dataset : {0:.4f}'.format(sess.run(rmse, feed_dict=test_input_dict)))\n", 941 | " " 942 | ] 943 | }, 944 | { 945 | "cell_type": "code", 946 | "execution_count": null, 947 | "metadata": {}, 948 | "outputs": [], 949 | "source": [ 950 | "train_model_deep_meta()" 951 | ] 952 | } 953 | ], 954 | "metadata": { 955 | "kernelspec": { 956 | "display_name": "Python 3", 957 | "language": "python", 958 | "name": "python3" 959 | }, 960 | "language_info": { 961 | "codemirror_mode": { 962 | "name": "ipython", 963 | "version": 3 964 | }, 965 | "file_extension": ".py", 966 | "mimetype": "text/x-python", 967 | "name": "python", 968 | "nbconvert_exporter": "python", 969 | "pygments_lexer": "ipython3", 970 | "version": "3.6.3" 971 | } 972 | }, 973 | "nbformat": 4, 974 | "nbformat_minor": 2 975 | } 976 | -------------------------------------------------------------------------------- /05-Implicit_Feedback_with_the_triplet_loss_with_Tensorflow_0.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Triplet Loss for Implicit Feedback Neural Recommender Systems\n", 8 | "\n", 9 | "Goals:\n", 10 | "- Understand bi-linear recommendation system only using positive feedback data\n", 11 | "- Use Margin Based Comparator / Triplet Loss\n", 12 | "- Build deep learning architecture using similar deign principle\n", 13 | "\n", 14 | "This notebook is inspired by Oliver Grisel Notebook who used Keras\n", 15 | "https://github.com/ogrisel for building the moels. We will be using Basic Tensorflow APIs instead. You can also look into Maciej Kula's work [Recommendations in Keras using triplet loss](\n", 16 | "https://github.com/maciejkula/triplet_recommendations_keras) that uses BPR ( Bayesian Personalized Ranking ). \n" 17 | ] 18 | }, 19 | { 20 | "cell_type": "code", 21 | "execution_count": null, 22 | "metadata": {}, 23 | "outputs": [], 24 | "source": [ 25 | "%matplotlib inline\n", 26 | "import matplotlib.pyplot as plt\n", 27 | "import numpy as np\n", 28 | "import pandas as pd\n", 29 | "import os\n", 30 | "import os.path as op\n", 31 | "from sklearn.metrics import roc_auc_score\n", 32 | "from tensorflow.contrib import layers\n", 33 | "import tensorflow as tf\n", 34 | "print(tf.__version__)" 35 | ] 36 | }, 37 | { 38 | "cell_type": "code", 39 | "execution_count": null, 40 | "metadata": {}, 41 | "outputs": [], 42 | "source": [ 43 | "# Base Path for MovieLens dataset\n", 44 | "ML_100K_PATH = os.path.join('processed','ml-100k','ml-100k')\n", 45 | "\n", 46 | "data_train = pd.read_csv(op.join(ML_100K_PATH, 'ua.base'), sep='\\t',\n", 47 | " names=[\"user_id\", \"item_id\", \"rating\", \"timestamp\"])\n", 48 | "data_test = pd.read_csv(op.join(ML_100K_PATH, 'ua.test'), sep='\\t',\n", 49 | " names=[\"user_id\", \"item_id\", \"rating\", \"timestamp\"])\n", 50 | "\n", 51 | "data_train.describe()" 52 | ] 53 | }, 54 | { 55 | "cell_type": "code", 56 | "execution_count": null, 57 | "metadata": {}, 58 | "outputs": [], 59 | "source": [ 60 | "def get_release_year(x):\n", 61 | " splits = str(x).split('-')\n", 62 | " if(len(splits) == 3):\n", 63 | " return int(splits[2])\n", 64 | " else:\n", 65 | " return 1920\n", 66 | " \n", 67 | "\n", 68 | "m_cols = ['item_id', 'title', 'release_date', 'video_release_date', 'imdb_url']\n", 69 | "items = pd.read_csv(op.join(ML_100K_PATH, 'u.item'), sep='|',\n", 70 | " names=m_cols, usecols=range(5), encoding='latin-1')\n", 71 | "items['release_year'] = items['release_date'].map(get_release_year)\n", 72 | "\n", 73 | "data_train = pd.merge(data_train, items)\n", 74 | "data_test = pd.merge(data_test, items)" 75 | ] 76 | }, 77 | { 78 | "cell_type": "code", 79 | "execution_count": null, 80 | "metadata": {}, 81 | "outputs": [], 82 | "source": [ 83 | "data_train.head()" 84 | ] 85 | }, 86 | { 87 | "cell_type": "code", 88 | "execution_count": null, 89 | "metadata": {}, 90 | "outputs": [], 91 | "source": [ 92 | "max_user_id = max(data_train['user_id'].max(), data_test['user_id'].max())\n", 93 | "max_item_id = max(data_train['item_id'].max(), data_test['item_id'].max())\n", 94 | "\n", 95 | "n_users = max_user_id + 1\n", 96 | "n_items = max_item_id + 1\n", 97 | "\n", 98 | "print('n_users=%d, n_items=%d' % (n_users, n_items))" 99 | ] 100 | }, 101 | { 102 | "cell_type": "markdown", 103 | "metadata": {}, 104 | "source": [ 105 | "## Implicit feedback data\n", 106 | "\n", 107 | "Consider ratings >= 4 as positive feed back and ignore the rest:" 108 | ] 109 | }, 110 | { 111 | "cell_type": "code", 112 | "execution_count": null, 113 | "metadata": {}, 114 | "outputs": [], 115 | "source": [ 116 | "data_train['rating'].plot(kind='hist');\n", 117 | "print(data_train['rating'].mean())" 118 | ] 119 | }, 120 | { 121 | "cell_type": "code", 122 | "execution_count": null, 123 | "metadata": {}, 124 | "outputs": [], 125 | "source": [ 126 | "pos_data_train = data_train.query(\"rating >= 4\")\n", 127 | "pos_data_test = data_test.query(\"rating >= 4\")" 128 | ] 129 | }, 130 | { 131 | "cell_type": "markdown", 132 | "metadata": {}, 133 | "source": [ 134 | "Because the mean rating is around 3.5, this cut will remove approximately half of the ratings from the datasets:" 135 | ] 136 | }, 137 | { 138 | "cell_type": "code", 139 | "execution_count": null, 140 | "metadata": {}, 141 | "outputs": [], 142 | "source": [ 143 | "pos_data_train['rating'].count()" 144 | ] 145 | }, 146 | { 147 | "cell_type": "code", 148 | "execution_count": null, 149 | "metadata": {}, 150 | "outputs": [], 151 | "source": [ 152 | "pos_data_test['rating'].count()" 153 | ] 154 | }, 155 | { 156 | "cell_type": "markdown", 157 | "metadata": {}, 158 | "source": [ 159 | "## The Triplet Loss\n", 160 | "\n", 161 | "The following section demonstrates how to build a low-rank quadratic interaction model between users and items. The similarity score between a user and an item is defined by the unormalized dot products of their respective embeddings.\n", 162 | "\n", 163 | "The matching scores can be use to rank items to recommend to a specific user.\n", 164 | "\n", 165 | "Training of the model parameters is achieved by randomly sampling negative items not seen by a pre-selected anchor user. We want the model embedding matrices to be such that the similarity between the user vector and the negative vector is smaller than the similarity between the user vector and the positive item vector. Furthermore we use a margin to further move appart the negative from the anchor user.\n", 166 | "\n", 167 | "Here is the architecture of such a triplet architecture. The triplet name comes from the fact that the loss to optimize is defined for triple `(anchor_user, positive_item, negative_item)`:\n", 168 | "\n", 169 | "\n", 170 | "\n", 171 | "We call this model a triplet model with bi-linear interactions because the similarity between a user and an item is captured by a dot product of the first level embedding vectors. This is therefore not a deep architecture." 172 | ] 173 | }, 174 | { 175 | "cell_type": "markdown", 176 | "metadata": {}, 177 | "source": [ 178 | "Here is the actual code that builds the model(s) with shared weights. Note that here we use the cosine similarity instead of unormalized dot products (both seems to yield comparable results)." 179 | ] 180 | }, 181 | { 182 | "cell_type": "markdown", 183 | "metadata": {}, 184 | "source": [ 185 | "## Quality of Ranked Recommendations\n", 186 | "\n", 187 | "Now that we have a randomly initialized model we can start computing random recommendations. To assess their quality we do the following for each user:\n", 188 | "\n", 189 | "- compute matching scores for items (except the movies that the user has already seen in the training set),\n", 190 | "- compare to the positive feedback actually collected on the test set using the ROC AUC ranking metric,\n", 191 | "- average ROC AUC scores across users to get the average performance of the recommender model on the test set." 192 | ] 193 | }, 194 | { 195 | "cell_type": "markdown", 196 | "metadata": {}, 197 | "source": [ 198 | "By default the model should make predictions that rank the items in random order. The **ROC AUC score** is a ranking score that represents the **expected value of correctly ordering uniformly sampled pairs of recommendations**.\n", 199 | "\n", 200 | "A random (untrained) model should yield 0.50 ROC AUC on average. " 201 | ] 202 | }, 203 | { 204 | "cell_type": "code", 205 | "execution_count": null, 206 | "metadata": {}, 207 | "outputs": [], 208 | "source": [ 209 | "embedding_size = 64 # embedding size\n", 210 | "reg_param = 0.01 # regularization parameter lambda\n", 211 | "learning_rate = 0.01 # learning rate \n", 212 | "margin = 1.0 # margin \n", 213 | "\n", 214 | "# create tensorflow graph\n", 215 | "g = tf.Graph()\n", 216 | "with g.as_default():\n", 217 | " \n", 218 | " # setting up random seed\n", 219 | " tf.set_random_seed(1234)\n", 220 | " \n", 221 | " # placeholders\n", 222 | " user_input = tf.placeholder(shape=[None], dtype=tf.int64)\n", 223 | " positive_item_input = tf.placeholder(shape=[None], dtype=tf.int64)\n", 224 | " negative_item_input = tf.placeholder(shape=[None], dtype=tf.int64)\n", 225 | " \n", 226 | " # variables\n", 227 | " with tf.variable_scope(\"embedding\"):\n", 228 | " user_weight = tf.get_variable(\"user_w\"\n", 229 | " , shape=[max_user_id + 1, embedding_size]\n", 230 | " , dtype=tf.float32\n", 231 | " , initializer=layers.xavier_initializer())\n", 232 | "\n", 233 | " item_weight = tf.get_variable(\"item_w\"\n", 234 | " , shape=[max_item_id + 1, embedding_size]\n", 235 | " , dtype=tf.float32\n", 236 | " , initializer=layers.xavier_initializer())\n", 237 | " # embedding\n", 238 | " with tf.name_scope(\"embedding\"):\n", 239 | " user_embedding = tf.nn.embedding_lookup(user_weight, user_input)\n", 240 | " positive_item_embedding = tf.nn.embedding_lookup(item_weight, positive_item_input)\n", 241 | " negative_item_embedding = tf.nn.embedding_lookup(item_weight, negative_item_input)\n", 242 | " \n", 243 | " # similarity\n", 244 | " with tf.name_scope(\"similarity\"):\n", 245 | " positive_similarity = tf.reduce_sum(tf.multiply(user_embedding, positive_item_embedding), 1) \n", 246 | " negative_similarity = tf.reduce_sum(tf.multiply(user_embedding, negative_item_embedding), 1) \n", 247 | " \n", 248 | " # loss \n", 249 | " with tf.name_scope(\"loss\"):\n", 250 | " triplet_loss = tf.maximum(negative_similarity - positive_similarity + margin, 0)\n", 251 | " loss = tf.reduce_mean(triplet_loss)\n", 252 | " train_ops = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(loss)\n", 253 | " \n", 254 | " \n", 255 | " " 256 | ] 257 | }, 258 | { 259 | "cell_type": "code", 260 | "execution_count": null, 261 | "metadata": {}, 262 | "outputs": [], 263 | "source": [ 264 | "def sample_triplets(pos_data, max_item_id, random_seed=0):\n", 265 | " \"\"\"Sample negatives at random\"\"\"\n", 266 | " rng = np.random.RandomState(random_seed)\n", 267 | " user_ids = pos_data['user_id'].values\n", 268 | " pos_item_ids = pos_data['item_id'].values\n", 269 | "\n", 270 | " neg_item_ids = rng.randint(low=1, high=max_item_id + 1,\n", 271 | " size=len(user_ids))\n", 272 | " return [ user_ids, pos_item_ids,neg_item_ids]" 273 | ] 274 | }, 275 | { 276 | "cell_type": "markdown", 277 | "metadata": {}, 278 | "source": [ 279 | "## Training the Triplet Model\n", 280 | "\n", 281 | "Let's now fit the parameters of the model by sampling triplets: for each user, select a movie in the positive feedback set of that user and randomly sample another movie to serve as negative item.\n", 282 | "\n", 283 | "Note that this sampling scheme could be improved by removing items that are marked as positive in the data to remove some label noise. In practice this does not seem to be a problem though." 284 | ] 285 | }, 286 | { 287 | "cell_type": "markdown", 288 | "metadata": {}, 289 | "source": [ 290 | "Let's train the triplet model:" 291 | ] 292 | }, 293 | { 294 | "cell_type": "code", 295 | "execution_count": null, 296 | "metadata": {}, 297 | "outputs": [], 298 | "source": [ 299 | "n_epochs = 1000\n", 300 | "losses_train = []\n", 301 | "losses_val = []\n", 302 | " \n", 303 | "with tf.Session(graph=g) as sess:\n", 304 | " # initializer\n", 305 | " sess.run(tf.global_variables_initializer())\n", 306 | " \n", 307 | " def check_overfit(validation_loss):\n", 308 | " n = len(validation_loss)\n", 309 | " if n < 5:\n", 310 | " return False\n", 311 | " count = 0 \n", 312 | " for i in range(n-4, n):\n", 313 | " if validation_loss[i] < validation_loss[i-1]:\n", 314 | " count += 1\n", 315 | " if count >=2:\n", 316 | " return False\n", 317 | " return True\n", 318 | " \n", 319 | " for i in range(n_epochs):\n", 320 | " triplet_inputs_train = sample_triplets(pos_data_train, max_item_id,random_seed=i)\n", 321 | " triplet_inputs_val = sample_triplets(pos_data_test, max_item_id,random_seed=i+1)\n", 322 | " \n", 323 | " train_input_dict = {user_input: triplet_inputs_train[0]\n", 324 | " , positive_item_input: triplet_inputs_train[1]\n", 325 | " , negative_item_input: triplet_inputs_train[2]}\n", 326 | " \n", 327 | " val_input_dict = {user_input: triplet_inputs_val[0]\n", 328 | " , positive_item_input: triplet_inputs_val[1]\n", 329 | " , negative_item_input: triplet_inputs_val[2]}\n", 330 | " sess.run([train_ops], feed_dict=train_input_dict)\n", 331 | " \n", 332 | " if i % 10 == 0:\n", 333 | " loss_train = sess.run(loss, feed_dict=train_input_dict)\n", 334 | " loss_val = sess.run(loss, feed_dict=val_input_dict)\n", 335 | "\n", 336 | " losses_train.append(loss_train)\n", 337 | " losses_val.append(loss_val)\n", 338 | "\n", 339 | " # check early stopping \n", 340 | " if(check_overfit(losses_val)):\n", 341 | " print('overfit !')\n", 342 | " break\n", 343 | "\n", 344 | " \n", 345 | " # calculate AUC Score \n", 346 | " \"\"\"Compute the ROC AUC for each user and average over users\"\"\"\n", 347 | " max_user_id = max(pos_data_train['user_id'].max(), pos_data_test['user_id'].max())\n", 348 | " max_item_id = max(pos_data_train['item_id'].max(), pos_data_test['item_id'].max())\n", 349 | " user_auc_scores = []\n", 350 | " for user_id in range(1, max_user_id + 1):\n", 351 | " pos_item_train = pos_data_train[pos_data_train['user_id'] == user_id]\n", 352 | " pos_item_test = pos_data_test[pos_data_test['user_id'] == user_id]\n", 353 | "\n", 354 | " # Consider all the items already seen in the training set\n", 355 | " all_item_ids = np.arange(1, max_item_id + 1)\n", 356 | " items_to_rank = np.setdiff1d(all_item_ids, pos_item_train['item_id'].values)\n", 357 | "\n", 358 | " # Ground truth: return 1 for each item positively present in the test set\n", 359 | " # and 0 otherwise.\n", 360 | " expected = np.in1d(items_to_rank, pos_item_test['item_id'].values)\n", 361 | "\n", 362 | " if np.sum(expected) >= 1:\n", 363 | " # At least one positive test value to rank\n", 364 | " repeated_user_id = np.empty_like(items_to_rank)\n", 365 | " repeated_user_id.fill(user_id)\n", 366 | " predicted = sess.run(positive_similarity, feed_dict={user_input : repeated_user_id, \n", 367 | " positive_item_input : items_to_rank})\n", 368 | " user_auc_scores.append(roc_auc_score(expected, predicted))\n", 369 | "\n", 370 | " print(\"iteration : %d train loss: %.3f , valid loss %.3f , ROC auc %.4f\" % (i,loss_train, loss_val,np.mean(user_auc_scores)))\n" 371 | ] 372 | }, 373 | { 374 | "cell_type": "markdown", 375 | "metadata": {}, 376 | "source": [ 377 | "## Training a Deep Matching Model on Implicit Feedback\n", 378 | "\n", 379 | "\n", 380 | "Instead of using hard-coded cosine similarities to predict the match of a `(user_id, item_id)` pair, we can instead specify a deep neural network based parametrisation of the similarity. The parameters of that matching model are also trained with the margin comparator loss:\n", 381 | "\n", 382 | "\n", 383 | "\n", 384 | "\n" 385 | ] 386 | }, 387 | { 388 | "cell_type": "code", 389 | "execution_count": null, 390 | "metadata": {}, 391 | "outputs": [], 392 | "source": [ 393 | "user_embedding_size = 32\n", 394 | "item_embedding_size = 64\n", 395 | "num_hidden = 64\n", 396 | "\n", 397 | "reg_param = 0.01\n", 398 | "learning_rate = 0.01\n", 399 | "n_users = max_user_id + 1\n", 400 | "n_items = max_item_id + 1\n", 401 | "\n", 402 | "g = tf.Graph()\n", 403 | "with g.as_default():\n", 404 | " \n", 405 | " # setting up random seed\n", 406 | " tf.set_random_seed(1234)\n", 407 | "\n", 408 | " user_input = tf.placeholder(shape=[None,1], dtype=tf.int64, name='user_input')\n", 409 | " positive_item_input = tf.placeholder(shape=[None,1], dtype=tf.int64, name='positive_item_input')\n", 410 | " negative_item_input = tf.placeholder(shape=[None,1], dtype=tf.int64, name='negative_item_input')\n", 411 | " \n", 412 | " l2_loss = tf.constant(0.0)\n", 413 | " \n", 414 | " # embeddding layer\n", 415 | " with tf.variable_scope(\"embedding\"):\n", 416 | " user_weights = tf.get_variable(\"user_w\"\n", 417 | " , shape=[n_users, user_embedding_size]\n", 418 | " , dtype=tf.float32\n", 419 | " , initializer=layers.xavier_initializer())\n", 420 | " \n", 421 | " item_weights = tf.get_variable(\"item_w\"\n", 422 | " , shape=[n_items, item_embedding_size]\n", 423 | " , dtype=tf.float32\n", 424 | " , initializer=layers.xavier_initializer())\n", 425 | " \n", 426 | " user_embedding = tf.squeeze(tf.nn.embedding_lookup(user_weights, user_input),axis=1, name='user_embedding')\n", 427 | " positive_item_embedding = tf.squeeze(tf.nn.embedding_lookup(item_weights, positive_item_input),axis=1, name='positive_item_embedding')\n", 428 | " negative_item_embedding = tf.squeeze(tf.nn.embedding_lookup(item_weights, negative_item_input),axis=1, name='negative_item_embedding')\n", 429 | " \n", 430 | " l2_loss += tf.nn.l2_loss(user_weights)\n", 431 | " l2_loss += tf.nn.l2_loss(item_weights)\n", 432 | " \n", 433 | " \n", 434 | " print(user_embedding)\n", 435 | " print(positive_item_embedding)\n", 436 | " print(negative_item_embedding)\n", 437 | " \n", 438 | " \n", 439 | " # combine inputs\n", 440 | " with tf.name_scope('concatenation'):\n", 441 | " positive_embeddings_pair = tf.concat([user_embedding, positive_item_embedding], axis=1)\n", 442 | " negative_embeddings_pair = tf.concat([user_embedding, negative_item_embedding], axis=1)\n", 443 | " print(positive_embeddings_pair)\n", 444 | " print(negative_embeddings_pair)\n", 445 | " \n", 446 | " # fc-1\n", 447 | " \n", 448 | " with tf.name_scope(\"fc_1\"):\n", 449 | " W_fc_1 = tf.get_variable(\n", 450 | " \"W_hidden\",\n", 451 | " shape=[user_embedding_size + item_embedding_size, num_hidden],\n", 452 | " initializer=tf.contrib.layers.xavier_initializer())\n", 453 | " b_fc_1 = tf.Variable(tf.constant(0.1, shape=[num_hidden]), name=\"b\")\n", 454 | " hidden_output_positive = tf.nn.relu(tf.nn.xw_plus_b(positive_embeddings_pair, W_fc_1, b_fc_1), name='hidden_output_positive')\n", 455 | " hidden_output_negative = tf.nn.relu(tf.nn.xw_plus_b(negative_embeddings_pair, W_fc_1, b_fc_1), name='hidden_output_negative')\n", 456 | " \n", 457 | " l2_loss += tf.nn.l2_loss(W_fc_1)\n", 458 | " print(hidden_output_positive)\n", 459 | " print(hidden_output_negative)\n", 460 | " \n", 461 | " # dropout\n", 462 | " with tf.name_scope(\"dropout\"):\n", 463 | " h_drop_positive = tf.nn.dropout(hidden_output_positive, 0.8, name=\"hidden_output_drop_positive\")\n", 464 | " h_drop_negative = tf.nn.dropout(hidden_output_negative, 0.8, name=\"hidden_output_drop_negative\")\n", 465 | " print(h_drop_positive)\n", 466 | " print(h_drop_negative)\n", 467 | " \n", 468 | " # fc-2\n", 469 | " with tf.name_scope(\"fc_2\"):\n", 470 | " W_fc_2 = tf.get_variable(\n", 471 | " \"W_output\",\n", 472 | " shape=[num_hidden,1],\n", 473 | " initializer=tf.contrib.layers.xavier_initializer())\n", 474 | " b_fc_2 = tf.Variable(tf.constant(0.1, shape=[1]), name=\"b\")\n", 475 | " positive_prediction = tf.nn.xw_plus_b(h_drop_positive, W_fc_2, b_fc_2, name='positive_prediction')\n", 476 | " negative_prediction = tf.nn.xw_plus_b(h_drop_negative, W_fc_2, b_fc_2, name='negative_prediction')\n", 477 | " \n", 478 | " l2_loss += tf.nn.l2_loss(W_fc_2)\n", 479 | " print(positive_prediction)\n", 480 | " print(negative_prediction)\n", 481 | "\n", 482 | " # loss\n", 483 | " with tf.name_scope(\"loss\"):\n", 484 | " triplet_loss = tf.maximum(negative_prediction - positive_prediction + margin, 0)\n", 485 | " loss = tf.reduce_mean(triplet_loss) + reg_param * l2_loss\n", 486 | " train_ops = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(loss)\n", 487 | " " 488 | ] 489 | }, 490 | { 491 | "cell_type": "code", 492 | "execution_count": null, 493 | "metadata": {}, 494 | "outputs": [], 495 | "source": [ 496 | "n_epochs = 1000\n", 497 | "losses_train = []\n", 498 | "losses_val = []\n", 499 | "\n", 500 | "with tf.Session(graph=g) as sess:\n", 501 | " # initializer\n", 502 | " sess.run(tf.global_variables_initializer())\n", 503 | " \n", 504 | " def check_overfit(validation_loss):\n", 505 | " n = len(validation_loss)\n", 506 | " if n < 5:\n", 507 | " return False\n", 508 | " count = 0 \n", 509 | " for i in range(n-4, n):\n", 510 | " if validation_loss[i] < validation_loss[i-1]:\n", 511 | " count += 1\n", 512 | " if count >=3:\n", 513 | " return False\n", 514 | " return True\n", 515 | " \n", 516 | " for i in range(n_epochs):\n", 517 | " triplet_inputs_train = sample_triplets(pos_data_train, max_item_id,random_seed=i)\n", 518 | " triplet_inputs_val = sample_triplets(pos_data_test, max_item_id,random_seed=i+1)\n", 519 | " \n", 520 | " train_input_dict = {user_input: triplet_inputs_train[0].reshape([-1,1])\n", 521 | " , positive_item_input: triplet_inputs_train[1].reshape([-1,1])\n", 522 | " , negative_item_input: triplet_inputs_train[2].reshape([-1,1])}\n", 523 | " \n", 524 | " val_input_dict = {user_input: triplet_inputs_val[0].reshape([-1,1])\n", 525 | " , positive_item_input: triplet_inputs_val[1].reshape([-1,1])\n", 526 | " , negative_item_input: triplet_inputs_val[2].reshape([-1,1])}\n", 527 | " sess.run([train_ops], feed_dict=train_input_dict)\n", 528 | " \n", 529 | " if i % 10 == 0:\n", 530 | " loss_train = sess.run(loss, feed_dict=train_input_dict)\n", 531 | " loss_val = sess.run(loss, feed_dict=val_input_dict)\n", 532 | "\n", 533 | " losses_train.append(loss_train)\n", 534 | " losses_val.append(loss_val)\n", 535 | "\n", 536 | " # check early stopping \n", 537 | " if(check_overfit(losses_val)):\n", 538 | " print('overfit !')\n", 539 | " break\n", 540 | "\n", 541 | " \n", 542 | " # calculate AUC Score \n", 543 | " \"\"\"Compute the ROC AUC for each user and average over users\"\"\"\n", 544 | " max_user_id = max(pos_data_train['user_id'].max(), pos_data_test['user_id'].max())\n", 545 | " max_item_id = max(pos_data_train['item_id'].max(), pos_data_test['item_id'].max())\n", 546 | " user_auc_scores = []\n", 547 | " for user_id in range(1, max_user_id + 1):\n", 548 | " pos_item_train = pos_data_train[pos_data_train['user_id'] == user_id]\n", 549 | " pos_item_test = pos_data_test[pos_data_test['user_id'] == user_id]\n", 550 | "\n", 551 | " # Consider all the items already seen in the training set\n", 552 | " all_item_ids = np.arange(1, max_item_id + 1)\n", 553 | " items_to_rank = np.setdiff1d(all_item_ids, pos_item_train['item_id'].values)\n", 554 | "\n", 555 | " # Ground truth: return 1 for each item positively present in the test set\n", 556 | " # and 0 otherwise.\n", 557 | " expected = np.in1d(items_to_rank, pos_item_test['item_id'].values)\n", 558 | "\n", 559 | " if np.sum(expected) >= 1:\n", 560 | " # At least one positive test value to rank\n", 561 | " repeated_user_id = np.empty_like(items_to_rank)\n", 562 | " repeated_user_id.fill(user_id)\n", 563 | " predicted = sess.run(positive_prediction, feed_dict={user_input : repeated_user_id.reshape([-1,1]), \n", 564 | " positive_item_input : items_to_rank.reshape([-1,1])})\n", 565 | " user_auc_scores.append(roc_auc_score(expected, predicted))\n", 566 | "\n", 567 | " print(\"iteration : %d train loss: %.3f , valid loss %.3f , ROC auc %.4f\" % (i,loss_train, loss_val,np.mean(user_auc_scores)))\n", 568 | "\n", 569 | "\n", 570 | " \n", 571 | " \n", 572 | "\n", 573 | "\n" 574 | ] 575 | }, 576 | { 577 | "cell_type": "markdown", 578 | "metadata": {}, 579 | "source": [ 580 | "## Possible Extensions\n", 581 | "\n", 582 | "You can implement any of the following ideas if you want to get a deeper understanding of recommender systems.\n", 583 | "\n", 584 | "\n", 585 | "### Leverage User and Item metadata\n", 586 | "\n", 587 | "As we did for the Explicit Feedback model, it's also possible to extend our models to take additional user and item metadata as side information when computing the match score.\n", 588 | "\n", 589 | "\n", 590 | "### Better Ranking Metrics\n", 591 | "\n", 592 | "In this notebook we evaluated the quality of the ranked recommendations using the ROC AUC metric. This score reflect the ability of the model to correctly rank any pair of items (sampled uniformly at random among all possible items).\n", 593 | "\n", 594 | "In practice recommender systems will only display a few recommendations to the user (typically 1 to 10). It is typically more informative to use an evaluatio metric that characterize the quality of the top ranked items and attribute less or no importance to items that are not good recommendations for a specific users. Popular ranking metrics therefore include the **Precision at k** and the **Mean Average Precision**.\n", 595 | "\n", 596 | "\n", 597 | "\n", 598 | "### Hard Negatives Sampling\n", 599 | "\n", 600 | "In this experiment we sampled negative items uniformly at random. However, after training the model for a while, it is possible that the vast majority of sampled negatives have a similarity already much lower than the positive pair and that the margin comparator loss sets the majority of the gradients to zero effectively wasting a lot of computation.\n", 601 | "\n", 602 | "Given the current state of the recsys model we could sample harder negatives with a larger likelihood to train the model better closer to its decision boundary. This strategy is implemented in the WARP loss [1].\n", 603 | "\n", 604 | "The main drawback of hard negative sampling is increasing the risk of sever overfitting if a significant fraction of the labels are noisy.\n", 605 | "\n" 606 | ] 607 | }, 608 | { 609 | "cell_type": "code", 610 | "execution_count": null, 611 | "metadata": {}, 612 | "outputs": [], 613 | "source": [] 614 | } 615 | ], 616 | "metadata": { 617 | "kernelspec": { 618 | "display_name": "Python 3", 619 | "language": "python", 620 | "name": "python3" 621 | }, 622 | "language_info": { 623 | "codemirror_mode": { 624 | "name": "ipython", 625 | "version": 3 626 | }, 627 | "file_extension": ".py", 628 | "mimetype": "text/x-python", 629 | "name": "python", 630 | "nbconvert_exporter": "python", 631 | "pygments_lexer": "ipython3", 632 | "version": "3.6.3" 633 | } 634 | }, 635 | "nbformat": 4, 636 | "nbformat_minor": 2 637 | } 638 | -------------------------------------------------------------------------------- /05-Implicit_Feedback_with_the_triplet_loss_with_Tensorflow_Rendered.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Triplet Loss for Implicit Feedback Neural Recommender Systems\n", 8 | "\n", 9 | "Goals:\n", 10 | "- Understand bi-linear recommendation system only using positive feedback data\n", 11 | "- Use Margin Based Comparator / Triplet Loss\n", 12 | "- Build deep learning architecture using similar deign principle\n", 13 | "\n", 14 | "This notebook is inspired by Oliver Grisel Notebook who used Keras\n", 15 | "https://github.com/ogrisel for building the moels. We will be using Basic Tensorflow APIs instead. You can also look into Maciej Kula's work [Recommendations in Keras using triplet loss](\n", 16 | "https://github.com/maciejkula/triplet_recommendations_keras) that uses BPR ( Bayesian Personalized Ranking ). \n" 17 | ] 18 | }, 19 | { 20 | "cell_type": "code", 21 | "execution_count": 2, 22 | "metadata": {}, 23 | "outputs": [ 24 | { 25 | "name": "stdout", 26 | "output_type": "stream", 27 | "text": [ 28 | "1.5.0\n" 29 | ] 30 | } 31 | ], 32 | "source": [ 33 | "%matplotlib inline\n", 34 | "import matplotlib.pyplot as plt\n", 35 | "import numpy as np\n", 36 | "import pandas as pd\n", 37 | "import os\n", 38 | "import os.path as op\n", 39 | "from sklearn.metrics import roc_auc_score\n", 40 | "from tensorflow.contrib import layers\n", 41 | "import tensorflow as tf\n", 42 | "print(tf.__version__)" 43 | ] 44 | }, 45 | { 46 | "cell_type": "code", 47 | "execution_count": 3, 48 | "metadata": {}, 49 | "outputs": [ 50 | { 51 | "data": { 52 | "text/html": [ 53 | "
\n", 54 | "\n", 67 | "\n", 68 | " \n", 69 | " \n", 70 | " \n", 71 | " \n", 72 | " \n", 73 | " \n", 74 | " \n", 75 | " \n", 76 | " \n", 77 | " \n", 78 | " \n", 79 | " \n", 80 | " \n", 81 | " \n", 82 | " \n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | "
user_iditem_idratingtimestamp
count90570.00000090570.00000090570.0000009.057000e+04
mean461.494038428.1048913.5238278.835073e+08
std266.004364333.0880291.1260735.341684e+06
min1.0000001.0000001.0000008.747247e+08
25%256.000000174.0000003.0000008.794484e+08
50%442.000000324.0000004.0000008.828143e+08
75%682.000000636.0000004.0000008.882049e+08
max943.0000001682.0000005.0000008.932866e+08
\n", 136 | "
" 137 | ], 138 | "text/plain": [ 139 | " user_id item_id rating timestamp\n", 140 | "count 90570.000000 90570.000000 90570.000000 9.057000e+04\n", 141 | "mean 461.494038 428.104891 3.523827 8.835073e+08\n", 142 | "std 266.004364 333.088029 1.126073 5.341684e+06\n", 143 | "min 1.000000 1.000000 1.000000 8.747247e+08\n", 144 | "25% 256.000000 174.000000 3.000000 8.794484e+08\n", 145 | "50% 442.000000 324.000000 4.000000 8.828143e+08\n", 146 | "75% 682.000000 636.000000 4.000000 8.882049e+08\n", 147 | "max 943.000000 1682.000000 5.000000 8.932866e+08" 148 | ] 149 | }, 150 | "execution_count": 3, 151 | "metadata": {}, 152 | "output_type": "execute_result" 153 | } 154 | ], 155 | "source": [ 156 | "# Base Path for MovieLens dataset\n", 157 | "ML_100K_PATH = os.path.join('processed','ml-100k','ml-100k')\n", 158 | "\n", 159 | "data_train = pd.read_csv(op.join(ML_100K_PATH, 'ua.base'), sep='\\t',\n", 160 | " names=[\"user_id\", \"item_id\", \"rating\", \"timestamp\"])\n", 161 | "data_test = pd.read_csv(op.join(ML_100K_PATH, 'ua.test'), sep='\\t',\n", 162 | " names=[\"user_id\", \"item_id\", \"rating\", \"timestamp\"])\n", 163 | "\n", 164 | "data_train.describe()" 165 | ] 166 | }, 167 | { 168 | "cell_type": "code", 169 | "execution_count": 4, 170 | "metadata": {}, 171 | "outputs": [], 172 | "source": [ 173 | "def get_release_year(x):\n", 174 | " splits = str(x).split('-')\n", 175 | " if(len(splits) == 3):\n", 176 | " return int(splits[2])\n", 177 | " else:\n", 178 | " return 1920\n", 179 | " \n", 180 | "\n", 181 | "m_cols = ['item_id', 'title', 'release_date', 'video_release_date', 'imdb_url']\n", 182 | "items = pd.read_csv(op.join(ML_100K_PATH, 'u.item'), sep='|',\n", 183 | " names=m_cols, usecols=range(5), encoding='latin-1')\n", 184 | "items['release_year'] = items['release_date'].map(get_release_year)\n", 185 | "\n", 186 | "data_train = pd.merge(data_train, items)\n", 187 | "data_test = pd.merge(data_test, items)" 188 | ] 189 | }, 190 | { 191 | "cell_type": "code", 192 | "execution_count": 5, 193 | "metadata": {}, 194 | "outputs": [ 195 | { 196 | "data": { 197 | "text/html": [ 198 | "
\n", 199 | "\n", 212 | "\n", 213 | " \n", 214 | " \n", 215 | " \n", 216 | " \n", 217 | " \n", 218 | " \n", 219 | " \n", 220 | " \n", 221 | " \n", 222 | " \n", 223 | " \n", 224 | " \n", 225 | " \n", 226 | " \n", 227 | " \n", 228 | " \n", 229 | " \n", 230 | " \n", 231 | " \n", 232 | " \n", 233 | " \n", 234 | " \n", 235 | " \n", 236 | " \n", 237 | " \n", 238 | " \n", 239 | " \n", 240 | " \n", 241 | " \n", 242 | " \n", 243 | " \n", 244 | " \n", 245 | " \n", 246 | " \n", 247 | " \n", 248 | " \n", 249 | " \n", 250 | " \n", 251 | " \n", 252 | " \n", 253 | " \n", 254 | " \n", 255 | " \n", 256 | " \n", 257 | " \n", 258 | " \n", 259 | " \n", 260 | " \n", 261 | " \n", 262 | " \n", 263 | " \n", 264 | " \n", 265 | " \n", 266 | " \n", 267 | " \n", 268 | " \n", 269 | " \n", 270 | " \n", 271 | " \n", 272 | " \n", 273 | " \n", 274 | " \n", 275 | " \n", 276 | " \n", 277 | " \n", 278 | " \n", 279 | " \n", 280 | " \n", 281 | " \n", 282 | " \n", 283 | " \n", 284 | " \n", 285 | " \n", 286 | " \n", 287 | " \n", 288 | " \n", 289 | "
user_iditem_idratingtimestamptitlerelease_datevideo_release_dateimdb_urlrelease_year
0115874965758Toy Story (1995)01-Jan-1995NaNhttp://us.imdb.com/M/title-exact?Toy%20Story%2...1995
1214888550871Toy Story (1995)01-Jan-1995NaNhttp://us.imdb.com/M/title-exact?Toy%20Story%2...1995
2614883599478Toy Story (1995)01-Jan-1995NaNhttp://us.imdb.com/M/title-exact?Toy%20Story%2...1995
31014877888877Toy Story (1995)01-Jan-1995NaNhttp://us.imdb.com/M/title-exact?Toy%20Story%2...1995
41313882140487Toy Story (1995)01-Jan-1995NaNhttp://us.imdb.com/M/title-exact?Toy%20Story%2...1995
\n", 290 | "
" 291 | ], 292 | "text/plain": [ 293 | " user_id item_id rating timestamp title release_date \\\n", 294 | "0 1 1 5 874965758 Toy Story (1995) 01-Jan-1995 \n", 295 | "1 2 1 4 888550871 Toy Story (1995) 01-Jan-1995 \n", 296 | "2 6 1 4 883599478 Toy Story (1995) 01-Jan-1995 \n", 297 | "3 10 1 4 877888877 Toy Story (1995) 01-Jan-1995 \n", 298 | "4 13 1 3 882140487 Toy Story (1995) 01-Jan-1995 \n", 299 | "\n", 300 | " video_release_date imdb_url \\\n", 301 | "0 NaN http://us.imdb.com/M/title-exact?Toy%20Story%2... \n", 302 | "1 NaN http://us.imdb.com/M/title-exact?Toy%20Story%2... \n", 303 | "2 NaN http://us.imdb.com/M/title-exact?Toy%20Story%2... \n", 304 | "3 NaN http://us.imdb.com/M/title-exact?Toy%20Story%2... \n", 305 | "4 NaN http://us.imdb.com/M/title-exact?Toy%20Story%2... \n", 306 | "\n", 307 | " release_year \n", 308 | "0 1995 \n", 309 | "1 1995 \n", 310 | "2 1995 \n", 311 | "3 1995 \n", 312 | "4 1995 " 313 | ] 314 | }, 315 | "execution_count": 5, 316 | "metadata": {}, 317 | "output_type": "execute_result" 318 | } 319 | ], 320 | "source": [ 321 | "data_train.head()" 322 | ] 323 | }, 324 | { 325 | "cell_type": "code", 326 | "execution_count": 6, 327 | "metadata": {}, 328 | "outputs": [ 329 | { 330 | "name": "stdout", 331 | "output_type": "stream", 332 | "text": [ 333 | "n_users=944, n_items=1683\n" 334 | ] 335 | } 336 | ], 337 | "source": [ 338 | "max_user_id = max(data_train['user_id'].max(), data_test['user_id'].max())\n", 339 | "max_item_id = max(data_train['item_id'].max(), data_test['item_id'].max())\n", 340 | "\n", 341 | "n_users = max_user_id + 1\n", 342 | "n_items = max_item_id + 1\n", 343 | "\n", 344 | "print('n_users=%d, n_items=%d' % (n_users, n_items))" 345 | ] 346 | }, 347 | { 348 | "cell_type": "markdown", 349 | "metadata": {}, 350 | "source": [ 351 | "## Implicit feedback data\n", 352 | "\n", 353 | "Consider ratings >= 4 as positive feed back and ignore the rest:" 354 | ] 355 | }, 356 | { 357 | "cell_type": "code", 358 | "execution_count": 7, 359 | "metadata": {}, 360 | "outputs": [ 361 | { 362 | "name": "stdout", 363 | "output_type": "stream", 364 | "text": [ 365 | "3.5238268742409184\n" 366 | ] 367 | }, 368 | { 369 | "data": { 370 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAZUAAAD8CAYAAAC/1zkdAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4wLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvpW3flQAAFu9JREFUeJzt3X+w3XV95/HnywCKP0GJliaxod2sa3QqYBazw27HooUArcFZ2cXOSnSxqRa2OnVmjc5OsSozONNCl9VicckarBrxJ6nG0qi0TmfKj4tSIESXu5iVGMZEg4DFwgbf+8f5XDm9nOSeG77nnnvN8zFz5n7P+/v5nu/7fOHmdb8/zvekqpAkqQtPGXcDkqSfH4aKJKkzhookqTOGiiSpM4aKJKkzhookqTOGiiSpM4aKJKkzhookqTNHjLuBuXbcccfV8uXLx92GJC0ot9566w+qavFM4w67UFm+fDkTExPjbkOSFpQk/3eYcR7+kiR1xlCRJHXGUJEkdcZQkSR1xlCRJHXGUJEkdcZQkSR1xlCRJHXGUJEkdeaw+0S9pPlj+YYvjWW9Oy89eyzrPRy4pyJJ6szIQiXJ05LcnOQfkmxP8ketfkKSm5LcneRTSY5q9ae255Nt/vK+13pXq387yRl99TWtNplkw6jeiyRpOKPcU3kEOK2qXgacCKxJshr4AHB5Va0A7gcuaOMvAO6vqn8BXN7GkWQlcB7wEmAN8GdJFiVZBHwIOBNYCby+jZUkjcnIQqV6ftyeHtkeBZwGfKbVNwHntOm17Tlt/quSpNU3V9UjVfUdYBI4pT0mq+qeqnoU2NzGSpLGZKTnVNoexW3AHmAb8H+AH1XV/jZkF7CkTS8B7gVo8x8Antdfn7bMgeqD+lifZCLJxN69e7t4a5KkAUYaKlX1WFWdCCylt2fx4kHD2s8cYN5s64P6uKqqVlXVqsWLZ/yOGUnSIZqTq7+q6kfA3wCrgWOSTF3KvBTY3aZ3AcsA2vznAPv669OWOVBdkjQmo7z6a3GSY9r00cCrgR3ADcDr2rB1wHVtekt7Tpv/taqqVj+vXR12ArACuBm4BVjRriY7it7J/C2jej+SpJmN8sOPxwOb2lVaTwGuraovJrkL2Jzk/cA3gavb+KuBjyWZpLeHch5AVW1Pci1wF7AfuLCqHgNIchFwPbAI2FhV20f4fiRJMxhZqFTV7cBJA+r30Du/Mr3+T8C5B3itS4BLBtS3AlufdLOSpE74iXpJUmcMFUlSZwwVSVJnDBVJUmcMFUlSZwwVSVJnDBVJUmcMFUlSZwwVSVJnDBVJUmcMFUlSZwwVSVJnDBVJUmcMFUlSZwwVSVJnDBVJUmcMFUlSZwwVSVJnDBVJUmcMFUlSZwwVSVJnDBVJUmcMFUlSZwwVSVJnRhYqSZYluSHJjiTbk7yt1d+T5HtJbmuPs/qWeVeSySTfTnJGX31Nq00m2dBXPyHJTUnuTvKpJEeN6v1IkmY2yj2V/cA7qurFwGrgwiQr27zLq+rE9tgK0OadB7wEWAP8WZJFSRYBHwLOBFYCr+97nQ+011oB3A9cMML3I0mawchCparuq6pvtOmHgB3AkoMsshbYXFWPVNV3gEnglPaYrKp7qupRYDOwNkmA04DPtOU3AeeM5t1IkoYxJ+dUkiwHTgJuaqWLktyeZGOSY1ttCXBv32K7Wu1A9ecBP6qq/dPqkqQxGXmoJHkm8Fng7VX1IHAl8CvAicB9wJ9MDR2weB1CfVAP65NMJJnYu3fvLN+BJGlYIw2VJEfSC5SPV9XnAKrq+1X1WFX9FPgIvcNb0NvTWNa3+FJg90HqPwCOSXLEtPoTVNVVVbWqqlYtXry4mzcnSXqCI2YecmjaOY+rgR1VdVlf/fiquq89fS1wZ5veAnwiyWXALwIrgJvp7ZGsSHIC8D16J/N/u6oqyQ3A6+idZ1kHXDeq9yON2vINXxrbundeevbY1q2fLyMLFeBU4A3AHUlua7V307t660R6h6p2Ar8LUFXbk1wL3EXvyrELq+oxgCQXAdcDi4CNVbW9vd47gc1J3g98k16ISZLGZGShUlV/x+DzHlsPsswlwCUD6lsHLVdV9/D44TNJ0pj5iXpJUmcMFUlSZwwVSVJnDBVJUmcMFUlSZwwVSVJnDBVJUmcMFUlSZwwVSVJnDBVJUmcMFUlSZwwVSVJnDBVJUmcMFUlSZwwVSVJnDBVJUmcMFUlSZwwVSVJnDBVJUmcMFUlSZwwVSVJnDBVJUmcMFUlSZwwVSVJnDBVJUmdGFipJliW5IcmOJNuTvK3Vn5tkW5K7289jWz1JrkgymeT2JCf3vda6Nv7uJOv66i9Pckdb5ookGdX7kSTNbJR7KvuBd1TVi4HVwIVJVgIbgK9W1Qrgq+05wJnAivZYD1wJvRACLgZeAZwCXDwVRG3M+r7l1ozw/UiSZjCyUKmq+6rqG236IWAHsARYC2xqwzYB57TptcA11XMjcEyS44EzgG1Vta+q7ge2AWvavGdX1d9XVQHX9L2WJGkM5uScSpLlwEnATcALquo+6AUP8Pw2bAlwb99iu1rtYPVdA+qSpDEZeagkeSbwWeDtVfXgwYYOqNUh1Af1sD7JRJKJvXv3ztSyJOkQDRUqSV56KC+e5Eh6gfLxqvpcK3+/Hbqi/dzT6ruAZX2LLwV2z1BfOqD+BFV1VVWtqqpVixcvPpS3IkkawrB7Kh9OcnOS30tyzDALtCuxrgZ2VNVlfbO2AFNXcK0Druurn9+uAlsNPNAOj10PnJ7k2HaC/nTg+jbvoSSr27rO73stSdIYHDHMoKr6t0lWAP8ZmEhyM/C/qmrbQRY7FXgDcEeS21rt3cClwLVJLgC+C5zb5m0FzgImgYeBN7V170vyPuCWNu69VbWvTb8V+ChwNPDl9pAkjclQoQJQVXcn+W/ABHAFcFLbQ3h336Gt/vF/x+DzHgCvGjC+gAsPsO6NwMYB9QngkA7NSZK6N+w5lV9Ncjm9y4JPA36rff7kNODyEfYnSVpAht1T+SDwEXp7JT+ZKlbV7rb3IkkawvINXxrLendeevacrGfYUDkL+ElVPQaQ5CnA06rq4ar62Mi6kyQtKMNe/fUVeifDpzy91SRJ+plhQ+VpVfXjqSdt+umjaUmStFANGyr/OO2uwS8HfnKQ8ZKkw9Cw51TeDnw6ydQn1o8H/uNoWpIkLVTDfvjxliT/CngRvc+efKuq/t9IO5MkLThDf/gR+NfA8rbMSUmoqmtG0pUkaUEaKlSSfAz4FeA24LFWnvoOE0mSgOH3VFYBK9utVCRJGmjYq7/uBH5hlI1Ikha+YfdUjgPuancnfmSqWFWvGUlXkqQFadhQec8om5Ak/XwY9pLiv03yS8CKqvpKkqcDi0bbmiRpoRn21ve/A3wG+PNWWgJ8YVRNSZIWpmFP1F9I75scH4TeF3YBzx9VU5KkhWnYUHmkqh6depLkCHqfU5Ek6WeGDZW/TfJu4OgkvwF8GvjL0bUlSVqIhg2VDcBe4A7gd4GtgN/4KEn6Z4a9+uun9L5O+COjbUeStJANe++v7zDgHEpV/XLnHUmSFqzZ3PtrytOAc4Hndt+OJGkhG+qcSlX9sO/xvar6U+C0EfcmSVpghv3w48l9j1VJ3gI8a4ZlNibZk+TOvtp7knwvyW3tcVbfvHclmUzy7SRn9NXXtNpkkg199ROS3JTk7iSfSnLUrN65JKlzwx7++pO+6f3ATuA/zLDMR4EP8sTvXLm8qv64v5BkJXAe8BLgF4GvJPmXbfaHgN8AdgG3JNlSVXcBH2ivtTnJh4ELgCuHfD+SpBEY9uqvX5/tC1fV15MsH3L4WmBzVT0CfCfJJHBKmzdZVfcAJNkMrE2yg97ht99uYzbRu+mloSJJYzTs1V9/cLD5VXXZLNZ5UZLzgQngHVV1P717id3YN2ZXqwHcO63+CuB5wI+qav+A8ZKkMRn2w4+rgLfS+4d7CfAWYCW98yoHPbcyzZX0vpb4ROA+Hj+slgFj6xDqAyVZn2QiycTevXtn0a4kaTZm8yVdJ1fVQ9A74Q58uqrePJuVVdX3p6aTfAT4Ynu6C1jWN3QpsLtND6r/ADgmyRFtb6V//KD1XgVcBbBq1SrvWSZJIzLsnsoLgUf7nj8KLJ/typIc3/f0tfS+phhgC3BekqcmOQFYAdwM3AKsaFd6HUXvZP6WqirgBuB1bfl1wHWz7UeS1K1h91Q+Btyc5PP0DjO9lide1fXPJPkk8ErguCS7gIuBVyY5sb3GTnr3EaOqtie5FriL3tVlF1bVY+11LgKup/elYBurantbxTuBzUneD3wTuHrI9yJJGpFhr/66JMmXgX/XSm+qqm/OsMzrB5QP+A9/VV0CXDKgvpXeDSyn1+/h8SvEJEnzwLCHvwCeDjxYVf8d2NUOU0mS9DPDfqL+YnqHm97VSkcCfzGqpiRJC9OweyqvBV4D/CNAVe1mdpcSS5IOA8OGyqPtiqsCSPKM0bUkSVqohg2Va5P8Ob3PhvwO8BX8wi5J0jTDXv31x+276R8EXgT8YVVtG2lnkqQFZ8ZQSbIIuL6qXg0YJJKkA5rx8Ff7EOLDSZ4zB/1IkhawYT9R/0/AHUm20a4AA6iq3x9JV5KkBWnYUPlSe0iSdEAHDZUkL6yq71bVprlqSJK0cM10TuULUxNJPjviXiRJC9xModL/ZVi/PMpGJEkL30yhUgeYliTpCWY6Uf+yJA/S22M5uk3TnldVPXuk3UmSFpSDhkpVLZqrRiRJC99svk9FkqSDMlQkSZ0xVCRJnTFUJEmdMVQkSZ0xVCRJnTFUJEmdMVQkSZ0xVCRJnRlZqCTZmGRPkjv7as9Nsi3J3e3nsa2eJFckmUxye5KT+5ZZ18bfnWRdX/3lSe5oy1yRJEiSxmrYL+k6FB8FPghc01fbAHy1qi5NsqE9fydwJrCiPV4BXAm8IslzgYuBVfRuaHlrki1VdX8bsx64EdgKrAG+PML3c1havmE8382289Kzx7JeSU/OyPZUqurrwL5p5bXA1Bd+bQLO6atfUz03AsckOR44A9hWVftakGwD1rR5z66qv6+qohdc5yBJGqu5Pqfygqq6D6D9fH6rLwHu7Ru3q9UOVt81oC5JGqP5cqJ+0PmQOoT64BdP1ieZSDKxd+/eQ2xRkjSTuQ6V77dDV7Sfe1p9F7Csb9xSYPcM9aUD6gNV1VVVtaqqVi1evPhJvwlJ0mBzHSpbgKkruNYB1/XVz29Xga0GHmiHx64HTk9ybLtS7HTg+jbvoSSr21Vf5/e9liRpTEZ29VeSTwKvBI5LsoveVVyXAtcmuQD4LnBuG74VOAuYBB4G3gRQVfuSvA+4pY17b1VNnfx/K70rzI6md9WXV35J0piNLFSq6vUHmPWqAWMLuPAAr7MR2DigPgG89Mn0KEnq1nw5US9J+jlgqEiSOmOoSJI6Y6hIkjpjqEiSOmOoSJI6Y6hIkjpjqEiSOmOoSJI6Y6hIkjpjqEiSOmOoSJI6Y6hIkjpjqEiSOmOoSJI6Y6hIkjpjqEiSOmOoSJI6Y6hIkjpjqEiSOmOoSJI6Y6hIkjpjqEiSOmOoSJI6c8S4G1hIlm/40ljWu/PSs8eyXkmarbHsqSTZmeSOJLclmWi15ybZluTu9vPYVk+SK5JMJrk9ycl9r7Oujb87ybpxvBdJ0uPGefjr16vqxKpa1Z5vAL5aVSuAr7bnAGcCK9pjPXAl9EIIuBh4BXAKcPFUEEmSxmM+nVNZC2xq05uAc/rq11TPjcAxSY4HzgC2VdW+qrof2AasmeumJUmPG1eoFPDXSW5Nsr7VXlBV9wG0n89v9SXAvX3L7mq1A9WfIMn6JBNJJvbu3dvh25Ak9RvXifpTq2p3kucD25J86yBjM6BWB6k/sVh1FXAVwKpVqwaOkSQ9eWPZU6mq3e3nHuDz9M6JfL8d1qL93NOG7wKW9S2+FNh9kLokaUzmPFSSPCPJs6amgdOBO4EtwNQVXOuA69r0FuD8dhXYauCBdnjseuD0JMe2E/Snt5okaUzGcfjrBcDnk0yt/xNV9VdJbgGuTXIB8F3g3DZ+K3AWMAk8DLwJoKr2JXkfcEsb996q2jd3b0OSNN2ch0pV3QO8bED9h8CrBtQLuPAAr7UR2Nh1j5KkQzOfLimWJC1whookqTOGiiSpM4aKJKkzhookqTOGiiSpM4aKJKkzhookqTOGiiSpM4aKJKkzhookqTOGiiSpM4aKJKkzhookqTOGiiSpM4aKJKkzhookqTOGiiSpM4aKJKkzhookqTOGiiSpM4aKJKkzhookqTOGiiSpMws+VJKsSfLtJJNJNoy7H0k6nC3oUEmyCPgQcCawEnh9kpXj7UqSDl8LOlSAU4DJqrqnqh4FNgNrx9yTJB22FnqoLAHu7Xu+q9UkSWOQqhp3D4csybnAGVX15vb8DcApVfVfpo1bD6xvT18EfPsQV3kc8INDXHaU7Gt27Gt27Gt2fl77+qWqWjzToCOexArmg13Asr7nS4Hd0wdV1VXAVU92ZUkmqmrVk32drtnX7NjX7NjX7BzufS30w1+3ACuSnJDkKOA8YMuYe5Kkw9aC3lOpqv1JLgKuBxYBG6tq+5jbkqTD1oIOFYCq2gpsnaPVPelDaCNiX7NjX7NjX7NzWPe1oE/US5Lml4V+TkWSNI8YKtMk2ZhkT5I7DzA/Sa5ot4W5PcnJ86SvVyZ5IMlt7fGHc9TXsiQ3JNmRZHuStw0YM+fbbMi+5nybJXlakpuT/EPr648GjHlqkk+17XVTkuXzpK83Jtnbt73ePOq++ta9KMk3k3xxwLw5315D9jWW7ZVkZ5I72jonBswf7e9jVfnoewC/BpwM3HmA+WcBXwYCrAZumid9vRL44hi21/HAyW36WcD/BlaOe5sN2decb7O2DZ7Zpo8EbgJWTxvze8CH2/R5wKfmSV9vBD441/+PtXX/AfCJQf+9xrG9huxrLNsL2Akcd5D5I/19dE9lmqr6OrDvIEPWAtdUz43AMUmOnwd9jUVV3VdV32jTDwE7eOJdDeZ8mw3Z15xr2+DH7emR7TH9xOZaYFOb/gzwqiSZB32NRZKlwNnA/zzAkDnfXkP2NV+N9PfRUJm9+XxrmH/TDl98OclL5nrl7bDDSfT+yu031m12kL5gDNusHTK5DdgDbKuqA26vqtoPPAA8bx70BfDv2yGTzyRZNmD+KPwp8F+Bnx5g/li21xB9wXi2VwF/neTW9O4mMt1Ifx8Nldkb9BfQfPiL7hv0bqPwMuB/AF+Yy5UneSbwWeDtVfXg9NkDFpmTbTZDX2PZZlX1WFWdSO8OEKckeem0IWPZXkP09ZfA8qr6VeArPL53MDJJfhPYU1W3HmzYgNpIt9eQfc359mpOraqT6d29/cIkvzZt/ki3l6Eye0PdGmauVdWDU4cvqvfZnSOTHDcX605yJL1/uD9eVZ8bMGQs22ymvsa5zdo6fwT8DbBm2qyfba8kRwDPYQ4PfR6or6r6YVU90p5+BHj5HLRzKvCaJDvp3YX8tCR/MW3MOLbXjH2NaXtRVbvbzz3A5+ndzb3fSH8fDZXZ2wKc366gWA08UFX3jbupJL8wdRw5ySn0/tv+cA7WG+BqYEdVXXaAYXO+zYbpaxzbLMniJMe06aOBVwPfmjZsC7CuTb8O+Fq1M6zj7GvacffX0DtPNVJV9a6qWlpVy+mdhP9aVf2nacPmfHsN09c4tleSZyR51tQ0cDow/YrRkf4+LvhP1HctySfpXRV0XJJdwMX0TlpSVR+m9+n9s4BJ4GHgTfOkr9cBb02yH/gJcN6of7GaU4E3AHe04/EA7wZe2NfbOLbZMH2NY5sdD2xK7wvmngJcW1VfTPJeYKKqttALw48lmaT3F/d5I+5p2L5+P8lrgP2trzfOQV8DzYPtNUxf49heLwA+3/5WOgL4RFX9VZK3wNz8PvqJeklSZzz8JUnqjKEiSeqMoSJJ6oyhIknqjKEiSeqMoSJJ6oyhIknqjKEiSerM/we1LKLwMk9VDwAAAABJRU5ErkJggg==\n", 371 | "text/plain": [ 372 | "" 373 | ] 374 | }, 375 | "metadata": {}, 376 | "output_type": "display_data" 377 | } 378 | ], 379 | "source": [ 380 | "data_train['rating'].plot(kind='hist');\n", 381 | "print(data_train['rating'].mean())" 382 | ] 383 | }, 384 | { 385 | "cell_type": "code", 386 | "execution_count": 8, 387 | "metadata": {}, 388 | "outputs": [], 389 | "source": [ 390 | "pos_data_train = data_train.query(\"rating >= 4\")\n", 391 | "pos_data_test = data_test.query(\"rating >= 4\")" 392 | ] 393 | }, 394 | { 395 | "cell_type": "markdown", 396 | "metadata": {}, 397 | "source": [ 398 | "Because the mean rating is around 3.5, this cut will remove approximately half of the ratings from the datasets:" 399 | ] 400 | }, 401 | { 402 | "cell_type": "code", 403 | "execution_count": 9, 404 | "metadata": {}, 405 | "outputs": [ 406 | { 407 | "data": { 408 | "text/plain": [ 409 | "49906" 410 | ] 411 | }, 412 | "execution_count": 9, 413 | "metadata": {}, 414 | "output_type": "execute_result" 415 | } 416 | ], 417 | "source": [ 418 | "pos_data_train['rating'].count()" 419 | ] 420 | }, 421 | { 422 | "cell_type": "code", 423 | "execution_count": 10, 424 | "metadata": {}, 425 | "outputs": [ 426 | { 427 | "data": { 428 | "text/plain": [ 429 | "5469" 430 | ] 431 | }, 432 | "execution_count": 10, 433 | "metadata": {}, 434 | "output_type": "execute_result" 435 | } 436 | ], 437 | "source": [ 438 | "pos_data_test['rating'].count()" 439 | ] 440 | }, 441 | { 442 | "cell_type": "markdown", 443 | "metadata": {}, 444 | "source": [ 445 | "## The Triplet Loss\n", 446 | "\n", 447 | "The following section demonstrates how to build a low-rank quadratic interaction model between users and items. The similarity score between a user and an item is defined by the unormalized dot products of their respective embeddings.\n", 448 | "\n", 449 | "The matching scores can be use to rank items to recommend to a specific user.\n", 450 | "\n", 451 | "Training of the model parameters is achieved by randomly sampling negative items not seen by a pre-selected anchor user. We want the model embedding matrices to be such that the similarity between the user vector and the negative vector is smaller than the similarity between the user vector and the positive item vector. Furthermore we use a margin to further move appart the negative from the anchor user.\n", 452 | "\n", 453 | "Here is the architecture of such a triplet architecture. The triplet name comes from the fact that the loss to optimize is defined for triple `(anchor_user, positive_item, negative_item)`:\n", 454 | "\n", 455 | "\n", 456 | "\n", 457 | "We call this model a triplet model with bi-linear interactions because the similarity between a user and an item is captured by a dot product of the first level embedding vectors. This is therefore not a deep architecture." 458 | ] 459 | }, 460 | { 461 | "cell_type": "markdown", 462 | "metadata": {}, 463 | "source": [ 464 | "Here is the actual code that builds the model(s) with shared weights. Note that here we use the cosine similarity instead of unormalized dot products (both seems to yield comparable results)." 465 | ] 466 | }, 467 | { 468 | "cell_type": "markdown", 469 | "metadata": {}, 470 | "source": [ 471 | "## Quality of Ranked Recommendations\n", 472 | "\n", 473 | "Now that we have a randomly initialized model we can start computing random recommendations. To assess their quality we do the following for each user:\n", 474 | "\n", 475 | "- compute matching scores for items (except the movies that the user has already seen in the training set),\n", 476 | "- compare to the positive feedback actually collected on the test set using the ROC AUC ranking metric,\n", 477 | "- average ROC AUC scores across users to get the average performance of the recommender model on the test set." 478 | ] 479 | }, 480 | { 481 | "cell_type": "markdown", 482 | "metadata": {}, 483 | "source": [ 484 | "By default the model should make predictions that rank the items in random order. The **ROC AUC score** is a ranking score that represents the **expected value of correctly ordering uniformly sampled pairs of recommendations**.\n", 485 | "\n", 486 | "A random (untrained) model should yield 0.50 ROC AUC on average. " 487 | ] 488 | }, 489 | { 490 | "cell_type": "code", 491 | "execution_count": 11, 492 | "metadata": {}, 493 | "outputs": [], 494 | "source": [ 495 | "embedding_size = 64 # embedding size\n", 496 | "reg_param = 0.01 # regularization parameter lambda\n", 497 | "learning_rate = 0.01 # learning rate \n", 498 | "margin = 1.0 # margin \n", 499 | "\n", 500 | "# create tensorflow graph\n", 501 | "g = tf.Graph()\n", 502 | "with g.as_default():\n", 503 | " \n", 504 | " # setting up random seed\n", 505 | " tf.set_random_seed(1234)\n", 506 | " \n", 507 | " # placeholders\n", 508 | " user_input = tf.placeholder(shape=[None], dtype=tf.int64)\n", 509 | " positive_item_input = tf.placeholder(shape=[None], dtype=tf.int64)\n", 510 | " negative_item_input = tf.placeholder(shape=[None], dtype=tf.int64)\n", 511 | " \n", 512 | " # variables\n", 513 | " with tf.variable_scope(\"embedding\"):\n", 514 | " user_weight = tf.get_variable(\"user_w\"\n", 515 | " , shape=[max_user_id + 1, embedding_size]\n", 516 | " , dtype=tf.float32\n", 517 | " , initializer=layers.xavier_initializer())\n", 518 | "\n", 519 | " item_weight = tf.get_variable(\"item_w\"\n", 520 | " , shape=[max_item_id + 1, embedding_size]\n", 521 | " , dtype=tf.float32\n", 522 | " , initializer=layers.xavier_initializer())\n", 523 | " # embedding\n", 524 | " with tf.name_scope(\"embedding\"):\n", 525 | " user_embedding = tf.nn.embedding_lookup(user_weight, user_input)\n", 526 | " positive_item_embedding = tf.nn.embedding_lookup(item_weight, positive_item_input)\n", 527 | " negative_item_embedding = tf.nn.embedding_lookup(item_weight, negative_item_input)\n", 528 | " \n", 529 | " # similarity\n", 530 | " with tf.name_scope(\"similarity\"):\n", 531 | " positive_similarity = tf.reduce_sum(tf.multiply(user_embedding, positive_item_embedding), 1) \n", 532 | " negative_similarity = tf.reduce_sum(tf.multiply(user_embedding, negative_item_embedding), 1) \n", 533 | " \n", 534 | " # loss \n", 535 | " with tf.name_scope(\"loss\"):\n", 536 | " triplet_loss = tf.maximum(negative_similarity - positive_similarity + margin, 0)\n", 537 | " loss = tf.reduce_mean(triplet_loss)\n", 538 | " train_ops = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(loss)\n", 539 | " \n", 540 | " \n", 541 | " " 542 | ] 543 | }, 544 | { 545 | "cell_type": "code", 546 | "execution_count": 12, 547 | "metadata": {}, 548 | "outputs": [], 549 | "source": [ 550 | "def sample_triplets(pos_data, max_item_id, random_seed=0):\n", 551 | " \"\"\"Sample negatives at random\"\"\"\n", 552 | " rng = np.random.RandomState(random_seed)\n", 553 | " user_ids = pos_data['user_id'].values\n", 554 | " pos_item_ids = pos_data['item_id'].values\n", 555 | "\n", 556 | " neg_item_ids = rng.randint(low=1, high=max_item_id + 1,\n", 557 | " size=len(user_ids))\n", 558 | " return [ user_ids, pos_item_ids,neg_item_ids]" 559 | ] 560 | }, 561 | { 562 | "cell_type": "markdown", 563 | "metadata": {}, 564 | "source": [ 565 | "## Training the Triplet Model\n", 566 | "\n", 567 | "Let's now fit the parameters of the model by sampling triplets: for each user, select a movie in the positive feedback set of that user and randomly sample another movie to serve as negative item.\n", 568 | "\n", 569 | "Note that this sampling scheme could be improved by removing items that are marked as positive in the data to remove some label noise. In practice this does not seem to be a problem though." 570 | ] 571 | }, 572 | { 573 | "cell_type": "markdown", 574 | "metadata": {}, 575 | "source": [ 576 | "Let's train the triplet model:" 577 | ] 578 | }, 579 | { 580 | "cell_type": "code", 581 | "execution_count": 13, 582 | "metadata": {}, 583 | "outputs": [ 584 | { 585 | "name": "stdout", 586 | "output_type": "stream", 587 | "text": [ 588 | "iteration : 0 train loss: 0.992 , valid loss 1.000 , ROC auc 0.5014\n", 589 | "iteration : 10 train loss: 0.670 , valid loss 0.751 , ROC auc 0.8609\n", 590 | "iteration : 20 train loss: 0.325 , valid loss 0.361 , ROC auc 0.8819\n", 591 | "iteration : 30 train loss: 0.273 , valid loss 0.283 , ROC auc 0.9002\n", 592 | "iteration : 40 train loss: 0.223 , valid loss 0.237 , ROC auc 0.9179\n", 593 | "iteration : 50 train loss: 0.194 , valid loss 0.226 , ROC auc 0.9238\n", 594 | "iteration : 60 train loss: 0.172 , valid loss 0.233 , ROC auc 0.9257\n", 595 | "iteration : 70 train loss: 0.159 , valid loss 0.219 , ROC auc 0.9264\n", 596 | "iteration : 80 train loss: 0.146 , valid loss 0.236 , ROC auc 0.9257\n", 597 | "iteration : 90 train loss: 0.135 , valid loss 0.236 , ROC auc 0.9246\n", 598 | "iteration : 100 train loss: 0.126 , valid loss 0.248 , ROC auc 0.9236\n", 599 | "iteration : 110 train loss: 0.120 , valid loss 0.242 , ROC auc 0.9224\n", 600 | "iteration : 120 train loss: 0.116 , valid loss 0.268 , ROC auc 0.9211\n", 601 | "iteration : 130 train loss: 0.110 , valid loss 0.242 , ROC auc 0.9197\n", 602 | "iteration : 140 train loss: 0.108 , valid loss 0.253 , ROC auc 0.9184\n", 603 | "overfit !\n" 604 | ] 605 | } 606 | ], 607 | "source": [ 608 | "n_epochs = 1000\n", 609 | "losses_train = []\n", 610 | "losses_val = []\n", 611 | " \n", 612 | "with tf.Session(graph=g) as sess:\n", 613 | " # initializer\n", 614 | " sess.run(tf.global_variables_initializer())\n", 615 | " \n", 616 | " def check_overfit(validation_loss):\n", 617 | " n = len(validation_loss)\n", 618 | " if n < 5:\n", 619 | " return False\n", 620 | " count = 0 \n", 621 | " for i in range(n-4, n):\n", 622 | " if validation_loss[i] < validation_loss[i-1]:\n", 623 | " count += 1\n", 624 | " if count >=2:\n", 625 | " return False\n", 626 | " return True\n", 627 | " \n", 628 | " for i in range(n_epochs):\n", 629 | " triplet_inputs_train = sample_triplets(pos_data_train, max_item_id,random_seed=i)\n", 630 | " triplet_inputs_val = sample_triplets(pos_data_test, max_item_id,random_seed=i+1)\n", 631 | " \n", 632 | " train_input_dict = {user_input: triplet_inputs_train[0]\n", 633 | " , positive_item_input: triplet_inputs_train[1]\n", 634 | " , negative_item_input: triplet_inputs_train[2]}\n", 635 | " \n", 636 | " val_input_dict = {user_input: triplet_inputs_val[0]\n", 637 | " , positive_item_input: triplet_inputs_val[1]\n", 638 | " , negative_item_input: triplet_inputs_val[2]}\n", 639 | " sess.run([train_ops], feed_dict=train_input_dict)\n", 640 | " \n", 641 | " if i % 10 == 0:\n", 642 | " loss_train = sess.run(loss, feed_dict=train_input_dict)\n", 643 | " loss_val = sess.run(loss, feed_dict=val_input_dict)\n", 644 | "\n", 645 | " losses_train.append(loss_train)\n", 646 | " losses_val.append(loss_val)\n", 647 | "\n", 648 | " # check early stopping \n", 649 | " if(check_overfit(losses_val)):\n", 650 | " print('overfit !')\n", 651 | " break\n", 652 | "\n", 653 | " \n", 654 | " # calculate AUC Score \n", 655 | " \"\"\"Compute the ROC AUC for each user and average over users\"\"\"\n", 656 | " max_user_id = max(pos_data_train['user_id'].max(), pos_data_test['user_id'].max())\n", 657 | " max_item_id = max(pos_data_train['item_id'].max(), pos_data_test['item_id'].max())\n", 658 | " user_auc_scores = []\n", 659 | " for user_id in range(1, max_user_id + 1):\n", 660 | " pos_item_train = pos_data_train[pos_data_train['user_id'] == user_id]\n", 661 | " pos_item_test = pos_data_test[pos_data_test['user_id'] == user_id]\n", 662 | "\n", 663 | " # Consider all the items already seen in the training set\n", 664 | " all_item_ids = np.arange(1, max_item_id + 1)\n", 665 | " items_to_rank = np.setdiff1d(all_item_ids, pos_item_train['item_id'].values)\n", 666 | "\n", 667 | " # Ground truth: return 1 for each item positively present in the test set\n", 668 | " # and 0 otherwise.\n", 669 | " expected = np.in1d(items_to_rank, pos_item_test['item_id'].values)\n", 670 | "\n", 671 | " if np.sum(expected) >= 1:\n", 672 | " # At least one positive test value to rank\n", 673 | " repeated_user_id = np.empty_like(items_to_rank)\n", 674 | " repeated_user_id.fill(user_id)\n", 675 | " predicted = sess.run(positive_similarity, feed_dict={user_input : repeated_user_id, \n", 676 | " positive_item_input : items_to_rank})\n", 677 | " user_auc_scores.append(roc_auc_score(expected, predicted))\n", 678 | "\n", 679 | " print(\"iteration : %d train loss: %.3f , valid loss %.3f , ROC auc %.4f\" % (i,loss_train, loss_val,np.mean(user_auc_scores)))\n" 680 | ] 681 | }, 682 | { 683 | "cell_type": "markdown", 684 | "metadata": {}, 685 | "source": [ 686 | "## Training a Deep Matching Model on Implicit Feedback\n", 687 | "\n", 688 | "\n", 689 | "Instead of using hard-coded cosine similarities to predict the match of a `(user_id, item_id)` pair, we can instead specify a deep neural network based parametrisation of the similarity. The parameters of that matching model are also trained with the margin comparator loss:\n", 690 | "\n", 691 | "\n", 692 | "\n", 693 | "\n" 694 | ] 695 | }, 696 | { 697 | "cell_type": "code", 698 | "execution_count": 14, 699 | "metadata": {}, 700 | "outputs": [ 701 | { 702 | "name": "stdout", 703 | "output_type": "stream", 704 | "text": [ 705 | "Tensor(\"embedding/user_embedding:0\", shape=(?, 32), dtype=float32)\n", 706 | "Tensor(\"embedding/positive_item_embedding:0\", shape=(?, 64), dtype=float32)\n", 707 | "Tensor(\"embedding/negative_item_embedding:0\", shape=(?, 64), dtype=float32)\n", 708 | "Tensor(\"concatenation/concat:0\", shape=(?, 96), dtype=float32)\n", 709 | "Tensor(\"concatenation/concat_1:0\", shape=(?, 96), dtype=float32)\n", 710 | "Tensor(\"fc_1/hidden_output_positive:0\", shape=(?, 64), dtype=float32)\n", 711 | "Tensor(\"fc_1/hidden_output_negative:0\", shape=(?, 64), dtype=float32)\n", 712 | "Tensor(\"dropout/hidden_output_drop_positive/mul:0\", shape=(?, 64), dtype=float32)\n", 713 | "Tensor(\"dropout/hidden_output_drop_negative/mul:0\", shape=(?, 64), dtype=float32)\n", 714 | "Tensor(\"fc_2/positive_prediction:0\", shape=(?, 1), dtype=float32)\n", 715 | "Tensor(\"fc_2/negative_prediction:0\", shape=(?, 1), dtype=float32)\n" 716 | ] 717 | } 718 | ], 719 | "source": [ 720 | "user_embedding_size = 32\n", 721 | "item_embedding_size = 64\n", 722 | "num_hidden = 64\n", 723 | "\n", 724 | "reg_param = 0.01\n", 725 | "learning_rate = 0.01\n", 726 | "n_users = max_user_id + 1\n", 727 | "n_items = max_item_id + 1\n", 728 | "\n", 729 | "g = tf.Graph()\n", 730 | "with g.as_default():\n", 731 | " \n", 732 | " # setting up random seed\n", 733 | " tf.set_random_seed(1234)\n", 734 | "\n", 735 | " user_input = tf.placeholder(shape=[None,1], dtype=tf.int64, name='user_input')\n", 736 | " positive_item_input = tf.placeholder(shape=[None,1], dtype=tf.int64, name='positive_item_input')\n", 737 | " negative_item_input = tf.placeholder(shape=[None,1], dtype=tf.int64, name='negative_item_input')\n", 738 | " \n", 739 | " l2_loss = tf.constant(0.0)\n", 740 | " \n", 741 | " # embeddding layer\n", 742 | " with tf.variable_scope(\"embedding\"):\n", 743 | " user_weights = tf.get_variable(\"user_w\"\n", 744 | " , shape=[n_users, user_embedding_size]\n", 745 | " , dtype=tf.float32\n", 746 | " , initializer=layers.xavier_initializer())\n", 747 | " \n", 748 | " item_weights = tf.get_variable(\"item_w\"\n", 749 | " , shape=[n_items, item_embedding_size]\n", 750 | " , dtype=tf.float32\n", 751 | " , initializer=layers.xavier_initializer())\n", 752 | " \n", 753 | " user_embedding = tf.squeeze(tf.nn.embedding_lookup(user_weights, user_input),axis=1, name='user_embedding')\n", 754 | " positive_item_embedding = tf.squeeze(tf.nn.embedding_lookup(item_weights, positive_item_input),axis=1, name='positive_item_embedding')\n", 755 | " negative_item_embedding = tf.squeeze(tf.nn.embedding_lookup(item_weights, negative_item_input),axis=1, name='negative_item_embedding')\n", 756 | " \n", 757 | " l2_loss += tf.nn.l2_loss(user_weights)\n", 758 | " l2_loss += tf.nn.l2_loss(item_weights)\n", 759 | " \n", 760 | " \n", 761 | " print(user_embedding)\n", 762 | " print(positive_item_embedding)\n", 763 | " print(negative_item_embedding)\n", 764 | " \n", 765 | " \n", 766 | " # combine inputs\n", 767 | " with tf.name_scope('concatenation'):\n", 768 | " positive_embeddings_pair = tf.concat([user_embedding, positive_item_embedding], axis=1)\n", 769 | " negative_embeddings_pair = tf.concat([user_embedding, negative_item_embedding], axis=1)\n", 770 | " print(positive_embeddings_pair)\n", 771 | " print(negative_embeddings_pair)\n", 772 | " \n", 773 | " # fc-1\n", 774 | " \n", 775 | " with tf.name_scope(\"fc_1\"):\n", 776 | " W_fc_1 = tf.get_variable(\n", 777 | " \"W_hidden\",\n", 778 | " shape=[user_embedding_size + item_embedding_size, num_hidden],\n", 779 | " initializer=tf.contrib.layers.xavier_initializer())\n", 780 | " b_fc_1 = tf.Variable(tf.constant(0.1, shape=[num_hidden]), name=\"b\")\n", 781 | " hidden_output_positive = tf.nn.relu(tf.nn.xw_plus_b(positive_embeddings_pair, W_fc_1, b_fc_1), name='hidden_output_positive')\n", 782 | " hidden_output_negative = tf.nn.relu(tf.nn.xw_plus_b(negative_embeddings_pair, W_fc_1, b_fc_1), name='hidden_output_negative')\n", 783 | " \n", 784 | " l2_loss += tf.nn.l2_loss(W_fc_1)\n", 785 | " print(hidden_output_positive)\n", 786 | " print(hidden_output_negative)\n", 787 | " \n", 788 | " # dropout\n", 789 | " with tf.name_scope(\"dropout\"):\n", 790 | " h_drop_positive = tf.nn.dropout(hidden_output_positive, 0.8, name=\"hidden_output_drop_positive\")\n", 791 | " h_drop_negative = tf.nn.dropout(hidden_output_negative, 0.8, name=\"hidden_output_drop_negative\")\n", 792 | " print(h_drop_positive)\n", 793 | " print(h_drop_negative)\n", 794 | " \n", 795 | " # fc-2\n", 796 | " with tf.name_scope(\"fc_2\"):\n", 797 | " W_fc_2 = tf.get_variable(\n", 798 | " \"W_output\",\n", 799 | " shape=[num_hidden,1],\n", 800 | " initializer=tf.contrib.layers.xavier_initializer())\n", 801 | " b_fc_2 = tf.Variable(tf.constant(0.1, shape=[1]), name=\"b\")\n", 802 | " positive_prediction = tf.nn.xw_plus_b(h_drop_positive, W_fc_2, b_fc_2, name='positive_prediction')\n", 803 | " negative_prediction = tf.nn.xw_plus_b(h_drop_negative, W_fc_2, b_fc_2, name='negative_prediction')\n", 804 | " \n", 805 | " l2_loss += tf.nn.l2_loss(W_fc_2)\n", 806 | " print(positive_prediction)\n", 807 | " print(negative_prediction)\n", 808 | "\n", 809 | " # loss\n", 810 | " with tf.name_scope(\"loss\"):\n", 811 | " triplet_loss = tf.maximum(negative_prediction - positive_prediction + margin, 0)\n", 812 | " loss = tf.reduce_mean(triplet_loss) + reg_param * l2_loss\n", 813 | " train_ops = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(loss)\n", 814 | " " 815 | ] 816 | }, 817 | { 818 | "cell_type": "code", 819 | "execution_count": 15, 820 | "metadata": {}, 821 | "outputs": [ 822 | { 823 | "name": "stdout", 824 | "output_type": "stream", 825 | "text": [ 826 | "iteration : 0 train loss: 1.882 , valid loss 1.881 , ROC auc 0.6317\n", 827 | "iteration : 10 train loss: 0.787 , valid loss 0.773 , ROC auc 0.8665\n", 828 | "iteration : 20 train loss: 0.591 , valid loss 0.593 , ROC auc 0.8591\n", 829 | "iteration : 30 train loss: 0.548 , valid loss 0.531 , ROC auc 0.8638\n", 830 | "iteration : 40 train loss: 0.528 , valid loss 0.511 , ROC auc 0.8635\n", 831 | "iteration : 50 train loss: 0.521 , valid loss 0.512 , ROC auc 0.8635\n", 832 | "overfit !\n" 833 | ] 834 | } 835 | ], 836 | "source": [ 837 | "n_epochs = 1000\n", 838 | "losses_train = []\n", 839 | "losses_val = []\n", 840 | "\n", 841 | "with tf.Session(graph=g) as sess:\n", 842 | " # initializer\n", 843 | " sess.run(tf.global_variables_initializer())\n", 844 | " \n", 845 | " def check_overfit(validation_loss):\n", 846 | " n = len(validation_loss)\n", 847 | " if n < 5:\n", 848 | " return False\n", 849 | " count = 0 \n", 850 | " for i in range(n-4, n):\n", 851 | " if validation_loss[i] < validation_loss[i-1]:\n", 852 | " count += 1\n", 853 | " if count >=3:\n", 854 | " return False\n", 855 | " return True\n", 856 | " \n", 857 | " for i in range(n_epochs):\n", 858 | " triplet_inputs_train = sample_triplets(pos_data_train, max_item_id,random_seed=i)\n", 859 | " triplet_inputs_val = sample_triplets(pos_data_test, max_item_id,random_seed=i+1)\n", 860 | " \n", 861 | " train_input_dict = {user_input: triplet_inputs_train[0].reshape([-1,1])\n", 862 | " , positive_item_input: triplet_inputs_train[1].reshape([-1,1])\n", 863 | " , negative_item_input: triplet_inputs_train[2].reshape([-1,1])}\n", 864 | " \n", 865 | " val_input_dict = {user_input: triplet_inputs_val[0].reshape([-1,1])\n", 866 | " , positive_item_input: triplet_inputs_val[1].reshape([-1,1])\n", 867 | " , negative_item_input: triplet_inputs_val[2].reshape([-1,1])}\n", 868 | " sess.run([train_ops], feed_dict=train_input_dict)\n", 869 | " \n", 870 | " if i % 10 == 0:\n", 871 | " loss_train = sess.run(loss, feed_dict=train_input_dict)\n", 872 | " loss_val = sess.run(loss, feed_dict=val_input_dict)\n", 873 | "\n", 874 | " losses_train.append(loss_train)\n", 875 | " losses_val.append(loss_val)\n", 876 | "\n", 877 | " # check early stopping \n", 878 | " if(check_overfit(losses_val)):\n", 879 | " print('overfit !')\n", 880 | " break\n", 881 | "\n", 882 | " \n", 883 | " # calculate AUC Score \n", 884 | " \"\"\"Compute the ROC AUC for each user and average over users\"\"\"\n", 885 | " max_user_id = max(pos_data_train['user_id'].max(), pos_data_test['user_id'].max())\n", 886 | " max_item_id = max(pos_data_train['item_id'].max(), pos_data_test['item_id'].max())\n", 887 | " user_auc_scores = []\n", 888 | " for user_id in range(1, max_user_id + 1):\n", 889 | " pos_item_train = pos_data_train[pos_data_train['user_id'] == user_id]\n", 890 | " pos_item_test = pos_data_test[pos_data_test['user_id'] == user_id]\n", 891 | "\n", 892 | " # Consider all the items already seen in the training set\n", 893 | " all_item_ids = np.arange(1, max_item_id + 1)\n", 894 | " items_to_rank = np.setdiff1d(all_item_ids, pos_item_train['item_id'].values)\n", 895 | "\n", 896 | " # Ground truth: return 1 for each item positively present in the test set\n", 897 | " # and 0 otherwise.\n", 898 | " expected = np.in1d(items_to_rank, pos_item_test['item_id'].values)\n", 899 | "\n", 900 | " if np.sum(expected) >= 1:\n", 901 | " # At least one positive test value to rank\n", 902 | " repeated_user_id = np.empty_like(items_to_rank)\n", 903 | " repeated_user_id.fill(user_id)\n", 904 | " predicted = sess.run(positive_prediction, feed_dict={user_input : repeated_user_id.reshape([-1,1]), \n", 905 | " positive_item_input : items_to_rank.reshape([-1,1])})\n", 906 | " user_auc_scores.append(roc_auc_score(expected, predicted))\n", 907 | "\n", 908 | " print(\"iteration : %d train loss: %.3f , valid loss %.3f , ROC auc %.4f\" % (i,loss_train, loss_val,np.mean(user_auc_scores)))\n", 909 | "\n", 910 | "\n", 911 | " \n", 912 | " \n", 913 | "\n", 914 | "\n" 915 | ] 916 | }, 917 | { 918 | "cell_type": "markdown", 919 | "metadata": {}, 920 | "source": [ 921 | "## Possible Extensions\n", 922 | "\n", 923 | "You can implement any of the following ideas if you want to get a deeper understanding of recommender systems.\n", 924 | "\n", 925 | "\n", 926 | "### Leverage User and Item metadata\n", 927 | "\n", 928 | "As we did for the Explicit Feedback model, it's also possible to extend our models to take additional user and item metadata as side information when computing the match score.\n", 929 | "\n", 930 | "\n", 931 | "### Better Ranking Metrics\n", 932 | "\n", 933 | "In this notebook we evaluated the quality of the ranked recommendations using the ROC AUC metric. This score reflect the ability of the model to correctly rank any pair of items (sampled uniformly at random among all possible items).\n", 934 | "\n", 935 | "In practice recommender systems will only display a few recommendations to the user (typically 1 to 10). It is typically more informative to use an evaluatio metric that characterize the quality of the top ranked items and attribute less or no importance to items that are not good recommendations for a specific users. Popular ranking metrics therefore include the **Precision at k** and the **Mean Average Precision**.\n", 936 | "\n", 937 | "\n", 938 | "\n", 939 | "### Hard Negatives Sampling\n", 940 | "\n", 941 | "In this experiment we sampled negative items uniformly at random. However, after training the model for a while, it is possible that the vast majority of sampled negatives have a similarity already much lower than the positive pair and that the margin comparator loss sets the majority of the gradients to zero effectively wasting a lot of computation.\n", 942 | "\n", 943 | "Given the current state of the recsys model we could sample harder negatives with a larger likelihood to train the model better closer to its decision boundary. This strategy is implemented in the WARP loss [1].\n", 944 | "\n", 945 | "The main drawback of hard negative sampling is increasing the risk of sever overfitting if a significant fraction of the labels are noisy.\n", 946 | "\n" 947 | ] 948 | }, 949 | { 950 | "cell_type": "code", 951 | "execution_count": null, 952 | "metadata": {}, 953 | "outputs": [], 954 | "source": [] 955 | } 956 | ], 957 | "metadata": { 958 | "kernelspec": { 959 | "display_name": "Python 3", 960 | "language": "python", 961 | "name": "python3" 962 | }, 963 | "language_info": { 964 | "codemirror_mode": { 965 | "name": "ipython", 966 | "version": 3 967 | }, 968 | "file_extension": ".py", 969 | "mimetype": "text/x-python", 970 | "name": "python", 971 | "nbconvert_exporter": "python", 972 | "pygments_lexer": "ipython3", 973 | "version": "3.6.3" 974 | } 975 | }, 976 | "nbformat": 4, 977 | "nbformat_minor": 2 978 | } 979 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ## Deep Learning Based Search and Recommendation System 2 | 3 | ### Strata Conference , March - 2018, San Jose 4 | 5 | 6 | **Presenters** 7 | - Dr. Vijay Agneeswaran [ LinkedIn : http://bit.ly/vijaysa Twitter : @a_vijaysrinivas ] 8 | - Abhishek Kumar [ LinkedIn : http://bit.ly/kumarabhishek Twitter : @meabhishekkumar ] 9 | 10 | --- 11 | 12 | ### Session Content 13 | 14 | 1. Slides [ PDF : https://github.com/meabhishekkumar/strata-conference-ca-2018/blob/master/deep_learning_based_search_and_recommender_system.pdf ] 15 | 2. Notebooks 16 | - Data Preparation : Download required data to your local machine [https://github.com/meabhishekkumar/strata-conference-ca-2018/blob/master/01_data_preperation.ipynb ] 17 | - Short Introduction to Embeddings in Tensorflow 18 | - Image Search using Tensorflow 19 | - Explicit Feedback Based Recommendation System using Tesnorflow 20 | - Implicit Feedback Based Recommendation System 21 | 22 | ### Setting up the Enviornment 23 | 24 | You can easily setup the enviornment with all required components ( data and notebooks ) with the help of Docker. 25 | 26 | Here are the steps. 27 | 28 | 1. Install Docker on your local machine. You will required documentation on Docker website [ https://docs.docker.com/install/ ] 29 | 30 | 2. Make sure Docker is working fine. If you are not getting any error and able to see the docker 31 | 32 | ```sh 33 | $ docker --version 34 | ``` 35 | 36 | 3. Download the docker image and create container for the tutorial 37 | 38 | ```sh 39 | $ docker run -it --rm -p 8888:8888 -p 0.0.0.0:6006:6006 meabhishekkumar/strata-ca-2018 40 | ``` 41 | 42 | ### Reference Papers 43 | 44 | 1. Restricted Boltzmann Machines for Collaborative Filtering by Ruslan Salakhutdinov. 45 | Source: http://www.machinelearning.org/proceedings/icml2007/papers/407.pdf 46 | 2. Wide & Deep Learning for Recommender Systems by Heng-Tze Cheng. 47 | Source: https://arxiv.org/abs/1606.07792 48 | 3. A Survey and Critique of Deep Learning on Recommender Systems by Lei Zheng. 49 | Source: http://bdsc.lab.uic.edu/docs/survey-critique-deep.pdf 50 | 4. DeepFM: A Factorization-Machine based Neural Network for CTR Prediction. IJCAI2017 51 | Source: https://arxiv.org/abs/1703.04247 52 | 5. Deep Neural Networks for YouTube Recommendations by Paul Covington. 53 | Source: https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/45530.pdf 54 | 55 | 56 | --- 57 | Credits : 58 | 59 | - Recommendation system notebooks are inspired by [ Olivier Grisel ]( https://github.com/ogrisel) work using Keras 60 | 61 | 62 | 63 | 64 | -------------------------------------------------------------------------------- /deep_learning_based_search_and_recommender_system.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/meabhishekkumar/strata-conference-ca-2018/3a5b4ccf4f0803222f7f184b04e56ffd3928e73e/deep_learning_based_search_and_recommender_system.pdf -------------------------------------------------------------------------------- /images/alexnet_architecture.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/meabhishekkumar/strata-conference-ca-2018/3a5b4ccf4f0803222f7f184b04e56ffd3928e73e/images/alexnet_architecture.png --------------------------------------------------------------------------------