├── .gitignore
├── .travis.yml
├── LICENSE
├── README.md
├── data
├── movies.csv
├── ratings.csv
└── user_item_matrix.npz
├── docs
├── Makefile
├── conf.py
├── index.rst
└── intro.rst
├── images
├── cb-filtering-1.png
├── cb-filtering.png
├── cf-filtering.png
├── knn.png
├── matrix-factorization.png
├── movie-features-matrix.png
├── user-movie-matrix.png
└── utility-matrix.png
├── part-1-item-item-recommender.ipynb
├── part-2-cold-start-problem.ipynb
├── part-3-implicit-feedback-recommender.ipynb
├── presentation
├── images
│ ├── amazon-ecommerce.png
│ ├── amazon-example.png
│ ├── bookstore.png
│ ├── collaborative-filtering.png
│ ├── content-based-filtering.png
│ ├── cosine-sim.png
│ ├── gypsy-musical.png
│ ├── knn.png
│ ├── lamerica.png
│ ├── long-tail-book.png
│ ├── medium-example.png
│ ├── netflix-example.png
│ ├── recommender-examples.png
│ ├── recommender-ml-1.png
│ ├── recommender-ml-2.png
│ ├── recommender-ml-3.png
│ ├── recommender-ml-4.png
│ ├── spotify-example.png
│ ├── tasting-booth.png
│ └── utility-matrix.png
├── intro_to_recommenders_slides.ipynb
└── slides.html
├── recommender-basics.md
└── utils.py
/.gitignore:
--------------------------------------------------------------------------------
1 | .ipynb_checkpoints
2 | *.DS_Store
3 | *.npz
--------------------------------------------------------------------------------
/.travis.yml:
--------------------------------------------------------------------------------
1 | install:
2 | - pip install jupyter
3 | - cd presentation
4 | - wget https://github.com/hakimel/reveal.js/archive/master.zip
5 | - unzip master.zip
6 | - mv reveal.js-master reveal.js
7 |
8 | script:
9 | - jupyter nbconvert intro_to_recommenders_slides.ipynb --to slides
10 |
11 | after_success: |
12 | if [ -n "$GITHUB_API_KEY" ]; then
13 | git checkout --orphan gh-pages
14 | git rm -rf --cached .
15 | mv intro_to_recommenders_slides.slides.html index.html
16 | git add -f --ignore-errors index.html images reveal.js
17 | git -c user.name='travis' -c user.email='travis' commit -m init
18 | git push -f -q https://$GITHUB_USER:$GITHUB_API_KEY@github.com/$TRAVIS_REPO_SLUG gh-pages
19 | fi
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | BSD 3-Clause License
2 |
3 | Copyright (c) 2018, Jill Cates
4 | All rights reserved.
5 |
6 | Redistribution and use in source and binary forms, with or without
7 | modification, are permitted provided that the following conditions are met:
8 |
9 | * Redistributions of source code must retain the above copyright notice, this
10 | list of conditions and the following disclaimer.
11 |
12 | * Redistributions in binary form must reproduce the above copyright notice,
13 | this list of conditions and the following disclaimer in the documentation
14 | and/or other materials provided with the distribution.
15 |
16 | * Neither the name of the copyright holder nor the names of its
17 | contributors may be used to endorse or promote products derived from
18 | this software without specific prior written permission.
19 |
20 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
21 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
22 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
23 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
24 | FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
25 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
26 | SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
27 | CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
28 | OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
29 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
30 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Recommendation Systems 101
2 |
3 | This series of tutorials explores different types of recommendation systems and their implementations. Topics include:
4 |
5 | - collaborative vs. content-based filtering
6 | - implicit vs. explicit feedback
7 | - handling the cold start problem
8 | - recommendation model evaluation
9 |
10 | We will build various recommendation systems using data from the [MovieLens](https://movielens.org/) database. You will need Jupyter Lab to run the notebooks for each part of this series. Alternatively, you can also use Google’s new [colab platform](https://colab.research.google.com) which allows you to run a Jupyter notebook environment in the cloud. You won't need to install any local dependencies; however, you will need a gmail account.
11 |
12 | The series is divided into 3 parts:
13 |
14 | 1. [Building an Item-Item Recommender with Collaborative Filtering](#part-1-building-an-item-item-recommender-with-collaborative-filtering)
15 | 2. [Handling the Cold Start Problem with Content-based Filtering](#part-2-handling-the-cold-start-problem-with-content-based-filtering)
16 | 3. [Building an Implicit Feedback Recommender System](#part-3-building-an-implicit-feedback-recommender-system)
17 |
18 |
19 | More information on each part can be found in the descriptions below.
20 |
21 | ### Part 1: Building an Item-Item Recommender with Collaborative Filtering
22 |
23 | | |Description |
24 | |:-----------|:----------|
25 | |Objective|Want to know how Spotify, Amazon, and Netflix generate "similar item" recommendations for users? In this tutorial, we will build an item-item recommendation system by computing similarity using nearest neighbor techniques.|
26 | |Key concepts|collaborative filtering, content-based filtering, k-Nearest neighbors, cosine similarity|
27 | |Requirements|Python 3.6+, Jupyter Lab, numpy, pandas, matplotlib, seaborn, scikit-learn|
28 | |Tutorial link|[Jupyter Notebook](part-1-item-item-recommender.ipynb)|
29 | |Resources|[Item-item collaborative filtering](https://www.wikiwand.com/en/Item-item_collaborative_filtering), [Amazon.com Recommendations](https://www.cs.umd.edu/~samir/498/Amazon-Recommendations.pdf), [Various Implementations of Collaborative Filtering](https://towardsdatascience.com/various-implementations-of-collaborative-filtering-100385c6dfe0) |
30 |
31 |
32 | ### Part 2: Handling the Cold Start Problem with Content-based Filtering
33 |
34 | | |Description |
35 | |:-----------|:----------|
36 | |Objective|Collaborative filtering fails to incorporate new users who haven't rated yet and new items that don't have any ratings or reviews. This is called the cold start problem. In this tutorial, we will learn about clustering techniques that are used to tackle the cold start problem of collaborative filtering.|
37 | |Requirements|Python 3.6+, Jupyter Lab, numpy, pandas, matplotlib, seaborn, scikit-learn|
38 | |Tutorial link|[Jupyter Notebook](part-2-cold-start-problem.ipynb)|
39 |
40 |
41 | ### Part 3: Building an Implicit Feedback Recommender System
42 |
43 | | |Description |
44 | |:-----------|:----------|
45 | |Objective|Unlike explicit feedback (e.g., user ratings), implicit feedback infers a user's degree of preference toward an item by looking at their indirect interactions with that item. In this tutorial, we will investigate a recommender model that specifically handles implicit feedback datasets.|
46 | |Requirements|Python 3.6+, Jupyter Lab, numpy, pandas, implicit|
47 | |Tutorial link|[Jupyter Notebook](part-3-implicit-feedback-recommender.ipynb)|
--------------------------------------------------------------------------------
/data/user_item_matrix.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/topspinj/recommender-tutorial/d643f8b131fae6e2aaf91e5d389c358a1e823426/data/user_item_matrix.npz
--------------------------------------------------------------------------------
/docs/Makefile:
--------------------------------------------------------------------------------
1 | # Minimal makefile for Sphinx documentation
2 | #
3 |
4 | # You can set these variables from the command line.
5 | SPHINXOPTS =
6 | SPHINXBUILD = sphinx-build
7 | SPHINXPROJ = recommender-tutorial
8 | SOURCEDIR = .
9 | BUILDDIR = _build
10 |
11 | # Put it first so that "make" without argument is like "make help".
12 | help:
13 | @$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
14 |
15 | .PHONY: help Makefile
16 |
17 | # Catch-all target: route all unknown targets to Sphinx using the new
18 | # "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
19 | %: Makefile
20 | @$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
--------------------------------------------------------------------------------
/docs/conf.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | #
3 | # Configuration file for the Sphinx documentation builder.
4 | #
5 | # This file does only contain a selection of the most common options. For a
6 | # full list see the documentation:
7 | # http://www.sphinx-doc.org/en/master/config
8 |
9 | # -- Path setup --------------------------------------------------------------
10 |
11 | # If extensions (or modules to document with autodoc) are in another directory,
12 | # add these directories to sys.path here. If the directory is relative to the
13 | # documentation root, use os.path.abspath to make it absolute, like shown here.
14 | #
15 | # import os
16 | # import sys
17 | # sys.path.insert(0, os.path.abspath('.'))
18 | import nbsphinx
19 |
20 | # -- Project information -----------------------------------------------------
21 |
22 | project = 'recommender-tutorial'
23 | copyright = '2018, Jill Cates'
24 | author = 'Jill Cates'
25 |
26 | # The short X.Y version
27 | version = ''
28 | # The full version, including alpha/beta/rc tags
29 | release = '0.1'
30 |
31 |
32 | # -- General configuration ---------------------------------------------------
33 |
34 | # If your documentation needs a minimal Sphinx version, state it here.
35 | #
36 | # needs_sphinx = '1.0'
37 |
38 | # Add any Sphinx extension module names here, as strings. They can be
39 | # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
40 | # ones.
41 | extensions = [
42 | 'sphinx.ext.napoleon',
43 | 'sphinx.ext.mathjax',
44 | 'sphinx.ext.githubpages',
45 | 'nbsphinx',
46 | 'sphinx.ext.viewcode'
47 | ]
48 |
49 | # Add any paths that contain templates here, relative to this directory.
50 | templates_path = ['_templates']
51 |
52 | # The suffix(es) of source filenames.
53 | # You can specify multiple suffix as a list of string:
54 | #
55 | # source_suffix = ['.rst', '.md']
56 | source_suffix = '.rst'
57 |
58 | # The master toctree document.
59 | master_doc = 'index'
60 |
61 | # The language for content autogenerated by Sphinx. Refer to documentation
62 | # for a list of supported languages.
63 | #
64 | # This is also used if you do content translation via gettext catalogs.
65 | # Usually you set "language" from the command line for these cases.
66 | language = None
67 |
68 | # List of patterns, relative to source directory, that match files and
69 | # directories to ignore when looking for source files.
70 | # This pattern also affects html_static_path and html_extra_path .
71 | exclude_patterns = ['_build', 'Thumbs.db', '.DS_Store']
72 |
73 | # The name of the Pygments (syntax highlighting) style to use.
74 | pygments_style = 'sphinx'
75 |
76 |
77 | # -- Options for HTML output -------------------------------------------------
78 |
79 | # The theme to use for HTML and HTML Help pages. See the documentation for
80 | # a list of builtin themes.
81 | #
82 | html_theme = 'sphinx_rtd_theme'
83 |
84 | # Theme options are theme-specific and customize the look and feel of a theme
85 | # further. For a list of options available for each theme, see the
86 | # documentation.
87 | #
88 | # html_theme_options = {}
89 |
90 | # Add any paths that contain custom static files (such as style sheets) here,
91 | # relative to this directory. They are copied after the builtin static files,
92 | # so a file named "default.css" will overwrite the builtin "default.css".
93 | html_static_path = ['_static']
94 |
95 | # Custom sidebar templates, must be a dictionary that maps document names
96 | # to template names.
97 | #
98 | # The default sidebars (for documents that don't match any pattern) are
99 | # defined by theme itself. Builtin themes are using these templates by
100 | # default: ``['localtoc.html', 'relations.html', 'sourcelink.html',
101 | # 'searchbox.html']``.
102 | #
103 | # html_sidebars = {}
104 |
105 |
106 | # -- Options for HTMLHelp output ---------------------------------------------
107 |
108 | # Output file base name for HTML help builder.
109 | htmlhelp_basename = 'recommender-tutorialdoc'
110 |
111 |
112 | # -- Options for LaTeX output ------------------------------------------------
113 |
114 | latex_elements = {
115 | # The paper size ('letterpaper' or 'a4paper').
116 | #
117 | # 'papersize': 'letterpaper',
118 |
119 | # The font size ('10pt', '11pt' or '12pt').
120 | #
121 | # 'pointsize': '10pt',
122 |
123 | # Additional stuff for the LaTeX preamble.
124 | #
125 | # 'preamble': '',
126 |
127 | # Latex figure (float) alignment
128 | #
129 | # 'figure_align': 'htbp',
130 | }
131 |
132 | # Grouping the document tree into LaTeX files. List of tuples
133 | # (source start file, target name, title,
134 | # author, documentclass [howto, manual, or own class]).
135 | latex_documents = [
136 | (master_doc, 'recommender-tutorial.tex', 'recommender-tutorial Documentation',
137 | 'Jill Cates', 'manual'),
138 | ]
139 |
140 |
141 | # -- Options for manual page output ------------------------------------------
142 |
143 | # One entry per manual page. List of tuples
144 | # (source start file, name, description, authors, manual section).
145 | man_pages = [
146 | (master_doc, 'recommender-tutorial', 'recommender-tutorial Documentation',
147 | [author], 1)
148 | ]
149 |
150 |
151 | # -- Options for Texinfo output ----------------------------------------------
152 |
153 | # Grouping the document tree into Texinfo files. List of tuples
154 | # (source start file, target name, title, author,
155 | # dir menu entry, description, category)
156 | texinfo_documents = [
157 | (master_doc, 'recommender-tutorial', 'recommender-tutorial Documentation',
158 | author, 'recommender-tutorial', 'One line description of project.',
159 | 'Miscellaneous'),
160 | ]
161 |
162 |
163 | # -- Extension configuration -------------------------------------------------
--------------------------------------------------------------------------------
/docs/index.rst:
--------------------------------------------------------------------------------
1 | .. recommender-tutorial documentation master file, created by
2 | sphinx-quickstart on Fri Aug 10 16:53:05 2018.
3 | You can adapt this file completely to your liking, but it should at least
4 | contain the root `toctree` directive.
5 |
6 | An Introduction to Recommendation Systems in Python
7 | ===================================================
8 |
9 | Skill level: Intermediate
10 |
11 | Tutorial requirements
12 | ---------------------
13 | - Python 3.6+
14 | - Jupyter Lab
15 | - numpy
16 | - scikit-learn
17 |
18 | Alternatively, you can also use Google’s new `colab platform `_ which allows you to run a Jupyter notebook environment in the cloud. You won't need to locally install any of the above; however, you will need a gmail account.
19 |
20 |
21 | Description
22 | -----------
23 | In this tutorial, we will explore the different types of recommendation systems and their implementations. We will also build our own recommendation system using data from the `MovieLens `_ database.
24 |
25 | .. toctree::
26 | :maxdepth: 1
27 | :caption: Tutorials:
28 |
29 | part-1-building-from-scratch.ipynb
30 |
31 | .. toctree::
32 | :maxdepth: 2
33 | :caption: Contents:
34 |
35 | intro.rst
36 |
--------------------------------------------------------------------------------
/docs/intro.rst:
--------------------------------------------------------------------------------
1 | What is a recommendation system?
2 | ================================
3 |
4 | A recommendation system is an algorithm that matches items to users. Its goal is to predict a user's preference toward an item.
5 |
6 | Examples
7 | +++++++++
8 | - recommending products based on past purchases or product searches (Amazon)
9 | - suggesting TV shows or movies based on prediction of a user's interests (Netflix)
10 | - creating well-curated playlists based on song history (Spotify)
11 | - personalized ads based on "liked" posts or previous websites visited (Facebook)
12 |
13 | The two most commonly used methods for recommendation systems are **collaborative filtering** and **content-based filtering**.
14 |
15 | Collaborative Filtering
16 | ++++++++++++++++++++++++
17 |
18 | Collaborative filering (CF) is based on the concept of "homophily" - similar users like similar things. It uses item preferences from other users to predict which item a particular user will like best. Collaborative filtering uses a user-item matrix to generate recommendations. This matrix is populated with values that indicate a given user's preference towards a given item. It's very unlikely that a user will have interacted with every item, so in most real-life cases, the user-item matrix is very sparse.
19 |
20 |
21 | .. image:: images/utility-matrix.png
22 | :width: 280px
23 |
24 | Collaborative filtering can be further divided into two categories: memory-based and model-based.
25 |
26 | - **Memory-based** algorithms look at item-item, user-item, or user-user similarity using different similarity metrics such as Pearson correlation coefficient, cosine similarity, etc. This approach is easy to apply to your user-item matrix and very interpretable. However, its performance decreases as the dataset becomes more sparse.
27 | - **Model-based** algorithms use matrix factorization techniques such as Single Vector Decomposition (SVD_) and Non-negative Matrix Factorization (NMF_) to extract latent/hidden, meaningful factors from the data.
28 |
29 | A major disadvantage of collaborative filtering is the **cold start problem**. You can only get recommendations for users and items that already have "interactions" in the user-item matrix. Collaborative filtering fails to provide personalized recommendations for brand new users or newly released items.
30 |
31 | .. _NMF: https://www.wikiwand.com/en/Non-negative_matrix_factorization
32 | .. _SVD: https://www.wikiwand.com/en/Singular-value_decomposition
33 |
34 | Content-based Filtering
35 | ++++++++++++++++++++++++
36 |
37 | Content-based filtering is a type of supervised learning that generates recommendations based on user and item features. Given a set of item features (movie genre, release date, country, language, etc.), it predicts how a user will rate an item based on their ratings of previous movies.
38 |
39 | Content-based filtering handles the "cold start" problem because it is able to provide personalized recommendations for brand new users and features.
40 |
41 |
42 | .. image:: images/cb-filtering.png
43 | :width: 550px
44 |
45 |
46 | How do we define a user's "preference" towards an item?
47 | +++++++++++++++++++++++++++++++++++++++++++++++++++++++
48 |
49 | There are two types of feedback data:
50 |
51 | 1. Explicit feedback, which considers a user's direct response to an item (e.g., rating, like/dislike)
52 |
53 | 2. Implicit feedback, which looks at a user's indirect behaviour towards an item (e.g., number of times a user has watched a movie)
54 |
55 | Before using this data in your recommendation system, it is important to perform some data pre-processing. For example, you should normalize ratings of different users to the same scale. More information on how to normalize data in recommendation systems is described `here `_.
--------------------------------------------------------------------------------
/images/cb-filtering-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/topspinj/recommender-tutorial/d643f8b131fae6e2aaf91e5d389c358a1e823426/images/cb-filtering-1.png
--------------------------------------------------------------------------------
/images/cb-filtering.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/topspinj/recommender-tutorial/d643f8b131fae6e2aaf91e5d389c358a1e823426/images/cb-filtering.png
--------------------------------------------------------------------------------
/images/cf-filtering.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/topspinj/recommender-tutorial/d643f8b131fae6e2aaf91e5d389c358a1e823426/images/cf-filtering.png
--------------------------------------------------------------------------------
/images/knn.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/topspinj/recommender-tutorial/d643f8b131fae6e2aaf91e5d389c358a1e823426/images/knn.png
--------------------------------------------------------------------------------
/images/matrix-factorization.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/topspinj/recommender-tutorial/d643f8b131fae6e2aaf91e5d389c358a1e823426/images/matrix-factorization.png
--------------------------------------------------------------------------------
/images/movie-features-matrix.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/topspinj/recommender-tutorial/d643f8b131fae6e2aaf91e5d389c358a1e823426/images/movie-features-matrix.png
--------------------------------------------------------------------------------
/images/user-movie-matrix.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/topspinj/recommender-tutorial/d643f8b131fae6e2aaf91e5d389c358a1e823426/images/user-movie-matrix.png
--------------------------------------------------------------------------------
/images/utility-matrix.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/topspinj/recommender-tutorial/d643f8b131fae6e2aaf91e5d389c358a1e823426/images/utility-matrix.png
--------------------------------------------------------------------------------
/part-1-item-item-recommender.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Part 1: Building an Item-Item Recommender\n",
8 | "\n",
9 | "If you use Netflix, you will notice that there is a section titled \"Because you watched Movie X\", which provides recommendations for movies based on a recent movie that you've watched. This is a classic example of an item-item recommendation. \n",
10 | "\n",
11 | "In this tutorial, we will generate item-item recommendations using a technique called [collaborative filtering](https://en.wikipedia.org/wiki/Collaborative_filtering). Let's get started! "
12 | ]
13 | },
14 | {
15 | "cell_type": "markdown",
16 | "metadata": {},
17 | "source": [
18 | "## Step 1: Import the Dependencies\n",
19 | "\n",
20 | "We will be representing our data as a pandas [DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html). \n",
21 | "\n",
22 | "**What is a DataFrame?**\n",
23 | "\n",
24 | "- a two-dimensional Pandas data structure\n",
25 | "- columns represent features, rows represent items\n",
26 | "- analogous to an Excel spreadsheet or SQL table\n",
27 | "- documentation can be found here\n",
28 | "\n",
29 | "We will also be using two plotting packages: [matplotlib](https://matplotlib.org/) and [seaborn](https://seaborn.pydata.org/) (which is a wrapper of matplotlib) to visualize our data."
30 | ]
31 | },
32 | {
33 | "cell_type": "code",
34 | "execution_count": 2,
35 | "metadata": {},
36 | "outputs": [],
37 | "source": [
38 | "import numpy as np\n",
39 | "import pandas as pd\n",
40 | "import sklearn\n",
41 | "import matplotlib.pyplot as plt\n",
42 | "import seaborn as sns\n",
43 | "\n",
44 | "import warnings\n",
45 | "warnings.simplefilter(action='ignore', category=FutureWarning)"
46 | ]
47 | },
48 | {
49 | "cell_type": "markdown",
50 | "metadata": {},
51 | "source": [
52 | "## Step 2: Load the Data\n",
53 | "\n",
54 | "Let's download a small version of the [MovieLens](https://www.wikiwand.com/en/MovieLens) dataset. You can access it via the zip file url [here](https://grouplens.org/datasets/movielens/), or directly download [here](http://files.grouplens.org/datasets/movielens/ml-latest-small.zip). We're working with data in `ml-latest-small.zip` and will need to add the following files to our local directory: \n",
55 | "- ratings.csv\n",
56 | "- movies.csv\n",
57 | "\n",
58 | "These are also located in the data folder inside this GitHub repository. \n",
59 | "\n",
60 | "Alternatively, you can access the data here: \n",
61 | "- https://s3-us-west-2.amazonaws.com/recommender-tutorial/movies.csv\n",
62 | "- https://s3-us-west-2.amazonaws.com/recommender-tutorial/ratings.csv\n",
63 | "\n",
64 | "Let's load in our data and take a peek at the structure."
65 | ]
66 | },
67 | {
68 | "cell_type": "code",
69 | "execution_count": 2,
70 | "metadata": {},
71 | "outputs": [
72 | {
73 | "data": {
74 | "text/html": [
75 | "
"
744 | ],
745 | "text/plain": [
746 | " userId movieId rating timestamp title\n",
747 | "93 95 3690 2.0 1043339908 Porky's Revenge (1985)\n",
748 | "122 95 5283 2.0 1043339957 National Lampoon's Van Wilder (2002)\n",
749 | "100 95 4015 2.0 1043339957 Dude, Where's My Car? (2000)\n",
750 | "164 95 7373 1.0 1105401093 Hellboy (2004)\n",
751 | "109 95 4732 1.0 1043339283 Bubble Boy (2001)"
752 | ]
753 | },
754 | "execution_count": 16,
755 | "metadata": {},
756 | "output_type": "execute_result"
757 | }
758 | ],
759 | "source": [
760 | "bottom_5 = user_ratings[user_ratings['rating']<3].tail()\n",
761 | "bottom_5"
762 | ]
763 | },
764 | {
765 | "cell_type": "markdown",
766 | "metadata": {},
767 | "source": [
768 | "Based on their preferences above, we can get a sense that user 95 likes action and crime movies from the early 1990's over light-hearted American comedies from the early 2000's. Let's see what recommendations our model will generate for user 95.\n",
769 | "\n",
770 | "We'll use the `recommend()` method, which takes in the user index of interest and transposed user-item matrix. "
771 | ]
772 | },
773 | {
774 | "cell_type": "code",
775 | "execution_count": 17,
776 | "metadata": {},
777 | "outputs": [
778 | {
779 | "data": {
780 | "text/plain": [
781 | "[(855, 1.127779),\n",
782 | " (1043, 0.98673713),\n",
783 | " (1210, 0.9256185),\n",
784 | " (3633, 0.90900886),\n",
785 | " (1978, 0.8929481),\n",
786 | " (4155, 0.84075284),\n",
787 | " (2979, 0.82858247),\n",
788 | " (3609, 0.78015),\n",
789 | " (4791, 0.7672245),\n",
790 | " (4010, 0.7530525)]"
791 | ]
792 | },
793 | "execution_count": 17,
794 | "metadata": {},
795 | "output_type": "execute_result"
796 | }
797 | ],
798 | "source": [
799 | "X_t = X.T.tocsr()\n",
800 | "\n",
801 | "user_idx = user_mapper[user_id]\n",
802 | "recommendations = model.recommend(user_idx, X_t)\n",
803 | "recommendations"
804 | ]
805 | },
806 | {
807 | "cell_type": "markdown",
808 | "metadata": {},
809 | "source": [
810 | "We can't interpret the results as is since movies are represented by their index. We'll have to loop over the list of recommendations and get the movie title for each movie index. "
811 | ]
812 | },
813 | {
814 | "cell_type": "code",
815 | "execution_count": 18,
816 | "metadata": {},
817 | "outputs": [
818 | {
819 | "name": "stdout",
820 | "output_type": "stream",
821 | "text": [
822 | "Abyss, The (1989)\n",
823 | "Star Trek: First Contact (1996)\n",
824 | "Hunt for Red October, The (1990)\n",
825 | "Lord of the Rings: The Fellowship of the Ring, The (2001)\n",
826 | "Star Wars: Episode I - The Phantom Menace (1999)\n",
827 | "Chicago (2002)\n",
828 | "Crouching Tiger, Hidden Dragon (Wo hu cang long) (2000)\n",
829 | "Ocean's Eleven (2001)\n",
830 | "Lord of the Rings: The Return of the King, The (2003)\n",
831 | "Punch-Drunk Love (2002)\n"
832 | ]
833 | }
834 | ],
835 | "source": [
836 | "for r in recommendations:\n",
837 | " recommended_title = get_movie_title(r[0])\n",
838 | " print(recommended_title)"
839 | ]
840 | },
841 | {
842 | "cell_type": "markdown",
843 | "metadata": {},
844 | "source": [
845 | "User 95's recommendations consist of action, crime, and thrillers. None of their recommendations are comedies. "
846 | ]
847 | }
848 | ],
849 | "metadata": {
850 | "kernelspec": {
851 | "display_name": "Python 3",
852 | "language": "python",
853 | "name": "python3"
854 | },
855 | "language_info": {
856 | "codemirror_mode": {
857 | "name": "ipython",
858 | "version": 3
859 | },
860 | "file_extension": ".py",
861 | "mimetype": "text/x-python",
862 | "name": "python",
863 | "nbconvert_exporter": "python",
864 | "pygments_lexer": "ipython3",
865 | "version": "3.7.6"
866 | }
867 | },
868 | "nbformat": 4,
869 | "nbformat_minor": 4
870 | }
871 |
--------------------------------------------------------------------------------
/presentation/images/amazon-ecommerce.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/topspinj/recommender-tutorial/d643f8b131fae6e2aaf91e5d389c358a1e823426/presentation/images/amazon-ecommerce.png
--------------------------------------------------------------------------------
/presentation/images/amazon-example.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/topspinj/recommender-tutorial/d643f8b131fae6e2aaf91e5d389c358a1e823426/presentation/images/amazon-example.png
--------------------------------------------------------------------------------
/presentation/images/bookstore.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/topspinj/recommender-tutorial/d643f8b131fae6e2aaf91e5d389c358a1e823426/presentation/images/bookstore.png
--------------------------------------------------------------------------------
/presentation/images/collaborative-filtering.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/topspinj/recommender-tutorial/d643f8b131fae6e2aaf91e5d389c358a1e823426/presentation/images/collaborative-filtering.png
--------------------------------------------------------------------------------
/presentation/images/content-based-filtering.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/topspinj/recommender-tutorial/d643f8b131fae6e2aaf91e5d389c358a1e823426/presentation/images/content-based-filtering.png
--------------------------------------------------------------------------------
/presentation/images/cosine-sim.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/topspinj/recommender-tutorial/d643f8b131fae6e2aaf91e5d389c358a1e823426/presentation/images/cosine-sim.png
--------------------------------------------------------------------------------
/presentation/images/gypsy-musical.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/topspinj/recommender-tutorial/d643f8b131fae6e2aaf91e5d389c358a1e823426/presentation/images/gypsy-musical.png
--------------------------------------------------------------------------------
/presentation/images/knn.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/topspinj/recommender-tutorial/d643f8b131fae6e2aaf91e5d389c358a1e823426/presentation/images/knn.png
--------------------------------------------------------------------------------
/presentation/images/lamerica.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/topspinj/recommender-tutorial/d643f8b131fae6e2aaf91e5d389c358a1e823426/presentation/images/lamerica.png
--------------------------------------------------------------------------------
/presentation/images/long-tail-book.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/topspinj/recommender-tutorial/d643f8b131fae6e2aaf91e5d389c358a1e823426/presentation/images/long-tail-book.png
--------------------------------------------------------------------------------
/presentation/images/medium-example.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/topspinj/recommender-tutorial/d643f8b131fae6e2aaf91e5d389c358a1e823426/presentation/images/medium-example.png
--------------------------------------------------------------------------------
/presentation/images/netflix-example.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/topspinj/recommender-tutorial/d643f8b131fae6e2aaf91e5d389c358a1e823426/presentation/images/netflix-example.png
--------------------------------------------------------------------------------
/presentation/images/recommender-examples.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/topspinj/recommender-tutorial/d643f8b131fae6e2aaf91e5d389c358a1e823426/presentation/images/recommender-examples.png
--------------------------------------------------------------------------------
/presentation/images/recommender-ml-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/topspinj/recommender-tutorial/d643f8b131fae6e2aaf91e5d389c358a1e823426/presentation/images/recommender-ml-1.png
--------------------------------------------------------------------------------
/presentation/images/recommender-ml-2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/topspinj/recommender-tutorial/d643f8b131fae6e2aaf91e5d389c358a1e823426/presentation/images/recommender-ml-2.png
--------------------------------------------------------------------------------
/presentation/images/recommender-ml-3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/topspinj/recommender-tutorial/d643f8b131fae6e2aaf91e5d389c358a1e823426/presentation/images/recommender-ml-3.png
--------------------------------------------------------------------------------
/presentation/images/recommender-ml-4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/topspinj/recommender-tutorial/d643f8b131fae6e2aaf91e5d389c358a1e823426/presentation/images/recommender-ml-4.png
--------------------------------------------------------------------------------
/presentation/images/spotify-example.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/topspinj/recommender-tutorial/d643f8b131fae6e2aaf91e5d389c358a1e823426/presentation/images/spotify-example.png
--------------------------------------------------------------------------------
/presentation/images/tasting-booth.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/topspinj/recommender-tutorial/d643f8b131fae6e2aaf91e5d389c358a1e823426/presentation/images/tasting-booth.png
--------------------------------------------------------------------------------
/presentation/images/utility-matrix.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/topspinj/recommender-tutorial/d643f8b131fae6e2aaf91e5d389c358a1e823426/presentation/images/utility-matrix.png
--------------------------------------------------------------------------------
/recommender-basics.md:
--------------------------------------------------------------------------------
1 | ### What is a recommendation system?
2 |
3 | A recommendation system is an algorithm that predicts a user's preference toward an item. In most cases, its goal is to **drive user engagement**.
4 |
5 | **Examples:**
6 |
7 | - recommending products based on past purchases or product searches (Amazon)
8 | - suggesting TV shows or movies based on prediction of a user's interests (Netflix)
9 | - creating well-curated playlists based on song history (Spotify)
10 | - personalized ads based on "liked" posts or previous websites visited (Facebook)
11 |
12 | The two most common recommendation system techniques are: 1) collaborative filtering, and 2) content-based filtering.
13 |
14 | ### Collaborative Filtering
15 |
16 | Collaborative filering (CF) is based on the concept of "homophily" - similar users like similar things. It uses item preferences from other users to predict which item a particular user will like best. Collaborative filtering uses a user-item matrix to generate recommendations. This matrix is populated with values that indicate a given user's preference towards a given item. It's very unlikely that a user will have interacted with every item, so in most real-life cases, the user-item matrix is very sparse.
17 |
18 |
19 |
20 |
21 |
22 | Collaborative filtering can be further divided into two categories: memory-based and model-based.
23 |
24 | - **Memory-based** algorithms look at item-item, user-item, or user-user similarity using different similarity metrics such as Pearson correlation coefficient, cosine similarity, etc. This approach is easy to apply to your user-item matrix and very interpretable. However, its performance decreases as the dataset becomes more sparse.
25 | - **Model-based** algorithms use matrix factorization techniques such as Single Vector Decomposition ([SVD](https://www.wikiwand.com/en/Singular-value_decomposition)) and Non-negative Matrix Factorization ([NMF](https://www.wikiwand.com/en/Non-negative_matrix_factorization)) to extract latent/hidden, meaningful factors from the data.
26 |
27 | A major disadvantage of collaborative filtering is the **cold start problem**. You can only get recommendations for users and items that already have "interactions" in the user-item matrix. Collaborative filtering fails to provide personalized recommendations for brand new users or newly released items.
28 |
29 |
30 | ### Content-based Filtering
31 |
32 | Content-based filtering generates recommendations based on user and item features. Given a set of item features (movie genre, release date, country, language, etc.), it predicts how a user will rate an item based on their ratings of previous movies.
33 |
34 | Content-based filtering handles the "cold start" problem because it is able to provide personalized recommendations for brand new users and features.
35 |
36 |
37 |
38 |
39 | ### How do we define a user's "preference" towards an item?
40 |
41 | There are two types of feedback data:
42 |
43 | 1. Explicit feedback, which considers a user's direct response to an item (e.g., rating, like/dislike)
44 |
45 | 2. Implicit feedback, which looks at a user's indirect behaviour towards an item (e.g., number of times a user has watched a movie)
46 |
47 | Before using this data in your recommendation system, it is important to perform some data pre-processing. For example, you should normalize ratings of different users to the same scale. More information on how to normalize data in recommendation systems is described [here](https://www.cs.purdue.edu/homes/lsi/sigir04-cf-norm.pdf).
--------------------------------------------------------------------------------
/utils.py:
--------------------------------------------------------------------------------
1 | """
2 | Utils
3 | =====
4 | """
5 |
6 | import numpy as np
7 | import pandas as pd
8 | from scipy.sparse import csr_matrix
9 | from sklearn.neighbors import NearestNeighbors
10 |
11 | def create_X(df):
12 | """
13 | Generates a sparse matrix from ratings dataframe.
14 |
15 | Args:
16 | df: pandas dataframe
17 |
18 | Returns:
19 | X: sparse matrix
20 | user_mapper: dict that maps user id's to user indices
21 | user_inv_mapper: dict that maps user indices to user id's
22 | movie_mapper: dict that maps movie id's to movie indices
23 | movie_inv_mapper: dict that maps movie indices to movie id's
24 | """
25 | N = df['userId'].nunique()
26 | M = df['movieId'].nunique()
27 |
28 | user_mapper = dict(zip(np.unique(df["userId"]), list(range(N))))
29 | movie_mapper = dict(zip(np.unique(df["movieId"]), list(range(M))))
30 |
31 | user_inv_mapper = dict(zip(list(range(N)), np.unique(df["userId"])))
32 | movie_inv_mapper = dict(zip(list(range(M)), np.unique(df["movieId"])))
33 |
34 | user_index = [user_mapper[i] for i in df['userId']]
35 | item_index = [movie_mapper[i] for i in df['movieId']]
36 |
37 | X = csr_matrix((df["rating"], (item_index, user_index)), shape=(M, N))
38 |
39 | return X, user_mapper, movie_mapper, user_inv_mapper, movie_inv_mapper
40 |
41 | def find_similar_movies(movie_id, X, k, movie_mapper, movie_inv_mapper, metric='cosine', show_distance=False):
42 | """
43 | Finds k-nearest neighbours for a given movie id.
44 |
45 | Args:
46 | movie_id: id of the movie of interest
47 | X: user-item utility matrix
48 | k: number of similar movies to retrieve
49 | metric: distance metric for kNN calculations
50 |
51 | Returns:
52 | list of k similar movie ID's
53 | """
54 | neighbour_ids = []
55 |
56 | movie_ind = movie_mapper[movie_id]
57 | movie_vec = X[movie_ind]
58 | k+=1
59 | kNN = NearestNeighbors(n_neighbors=k, algorithm="brute", metric=metric)
60 | kNN.fit(X)
61 | if isinstance(movie_vec, (np.ndarray)):
62 | movie_vec = movie_vec.reshape(1,-1)
63 | neighbour = kNN.kneighbors(movie_vec, return_distance=show_distance)
64 | for i in range(0,k):
65 | n = neighbour.item(i)
66 | neighbour_ids.append(movie_inv_mapper[n])
67 | neighbour_ids.pop(0)
68 | return neighbour_ids
--------------------------------------------------------------------------------