├── .gitignore ├── Code ├── Collab filtering using PySpark (user-based recommendations).ipynb ├── Collab filtering using Surprise (algo list, cv, plot, user-based recommendations).ipynb ├── Collab filtering using pearsonR (item-based recommendations).ipynb ├── Content filtering by vectorizing on full text (tfidf and count) with word cloud.ipynb ├── Data prep - full Goodreads loading files, statistics, distributions.ipynb ├── EDA - full Goodreads authors, works, series, genres, interactions.ipynb └── Text analysis - build, clean, prep review text.ipynb ├── Final Presentation.key ├── Final Report.ipynb ├── Images ├── ALS model with rmse.png ├── Algo results.png ├── Author plots.png ├── Cross validation plot.png ├── Full Goodreads counts.png ├── Goodreads interactions counts.png ├── Hist and scatterplot.png ├── Log-log plots of interactions.png ├── Optimized algo params.png ├── PearsonR code.png ├── Pie chart of genres.png ├── PySpark sparcity.png ├── Tuned ALS model with best rmse.png ├── Word cloud of review text.png ├── count recs.png └── tfidf recs.png ├── README.md ├── books10k.csv └── ratings10k.csv /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | build/ 12 | develop-eggs/ 13 | dist/ 14 | downloads/ 15 | eggs/ 16 | .eggs/ 17 | lib/ 18 | lib64/ 19 | parts/ 20 | sdist/ 21 | var/ 22 | wheels/ 23 | pip-wheel-metadata/ 24 | share/python-wheels/ 25 | *.egg-info/ 26 | .installed.cfg 27 | *.egg 28 | MANIFEST 29 | 30 | # PyInstaller 31 | # Usually these files are written by a python script from a template 32 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 33 | *.manifest 34 | *.spec 35 | 36 | # Installer logs 37 | pip-log.txt 38 | pip-delete-this-directory.txt 39 | 40 | # Unit test / coverage reports 41 | htmlcov/ 42 | .tox/ 43 | .nox/ 44 | .coverage 45 | .coverage.* 46 | .cache 47 | nosetests.xml 48 | coverage.xml 49 | *.cover 50 | *.py,cover 51 | .hypothesis/ 52 | .pytest_cache/ 53 | 54 | # Translations 55 | *.mo 56 | *.pot 57 | 58 | # Django stuff: 59 | *.log 60 | local_settings.py 61 | db.sqlite3 62 | db.sqlite3-journal 63 | 64 | # Flask stuff: 65 | instance/ 66 | .webassets-cache 67 | 68 | # Scrapy stuff: 69 | .scrapy 70 | 71 | # Sphinx documentation 72 | docs/_build/ 73 | 74 | # PyBuilder 75 | target/ 76 | 77 | # Jupyter Notebook 78 | .ipynb_checkpoints 79 | 80 | # IPython 81 | profile_default/ 82 | ipython_config.py 83 | 84 | # pyenv 85 | .python-version 86 | 87 | # pipenv 88 | # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. 89 | # However, in case of collaboration, if having platform-specific dependencies or dependencies 90 | # having no cross-platform support, pipenv may install dependencies that don't work, or not 91 | # install all needed dependencies. 92 | #Pipfile.lock 93 | 94 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow 95 | __pypackages__/ 96 | 97 | # Celery stuff 98 | celerybeat-schedule 99 | celerybeat.pid 100 | 101 | # SageMath parsed files 102 | *.sage.py 103 | 104 | # Environments 105 | .env 106 | .venv 107 | env/ 108 | venv/ 109 | ENV/ 110 | env.bak/ 111 | venv.bak/ 112 | 113 | # Spyder project settings 114 | .spyderproject 115 | .spyproject 116 | 117 | # Rope project settings 118 | .ropeproject 119 | 120 | # mkdocs documentation 121 | /site 122 | 123 | # mypy 124 | .mypy_cache/ 125 | .dmypy.json 126 | dmypy.json 127 | 128 | # Pyre type checker 129 | .pyre/ 130 | -------------------------------------------------------------------------------- /Code/Collab filtering using PySpark (user-based recommendations).ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "### starter code found at https://www.kaggle.com/vchulski/tutorial-collaborative-filtering-with-pyspark" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 1, 13 | "metadata": {}, 14 | "outputs": [ 15 | { 16 | "name": "stdout", 17 | "output_type": "stream", 18 | "text": [ 19 | "env: JOBLIB_TEMP_FOLDER=/tmp\n" 20 | ] 21 | } 22 | ], 23 | "source": [ 24 | "# start Jupyter Notebook with this command - jupyter notebook --NotebookApp.iopub_data_rate_limit=100000000\n", 25 | "import numpy as np\n", 26 | "import pandas as pd\n", 27 | "import os\n", 28 | "import gc #??? what's this for?\n", 29 | "\n", 30 | "%env JOBLIB_TEMP_FOLDER = /tmp" 31 | ] 32 | }, 33 | { 34 | "cell_type": "code", 35 | "execution_count": 2, 36 | "metadata": {}, 37 | "outputs": [], 38 | "source": [ 39 | "from pyspark.sql.functions import *\n", 40 | "from pyspark.sql.types import *\n", 41 | "from pyspark.ml.recommendation import ALS, ALSModel\n", 42 | "from pyspark.context import SparkContext\n", 43 | "from pyspark.sql.session import SparkSession\n", 44 | "from pyspark.mllib.evaluation import RegressionMetrics, RankingMetrics\n", 45 | "from pyspark.ml.evaluation import RegressionEvaluator\n", 46 | "from pyspark.ml.tuning import ParamGridBuilder, CrossValidator\n", 47 | "from pyspark import SparkFiles\n", 48 | "\n", 49 | "sc = SparkContext('local')\n", 50 | "spark = SparkSession(sc)" 51 | ] 52 | }, 53 | { 54 | "cell_type": "code", 55 | "execution_count": 3, 56 | "metadata": {}, 57 | "outputs": [], 58 | "source": [ 59 | "# load in data - interactions for collaborative filtering, books for content filtering (too big?)" 60 | ] 61 | }, 62 | { 63 | "cell_type": "raw", 64 | "metadata": {}, 65 | "source": [ 66 | "sp_interactions = spark.read.csv('goodreads_interactions.csv', header = True)\n", 67 | "sp_interactions.show()" 68 | ] 69 | }, 70 | { 71 | "cell_type": "raw", 72 | "metadata": {}, 73 | "source": [ 74 | "# calculate sparsity\n", 75 | "numerator = sp_interactions.select(\"rating\").count()\n", 76 | "num_users = sp_interactions.select(\"user_id\").distinct().count()\n", 77 | "num_books = sp_interactions.select(\"book_id\").distinct().count()\n", 78 | "denominator = num_users * num_books\n", 79 | "sparsity = (1.0 - (numerator * 1.0)/denominator) * 100\n", 80 | "print(\"The sp_interactions dataframe is \", \"%.2f\" % sparsity + \"% empty.\")" 81 | ] 82 | }, 83 | { 84 | "cell_type": "raw", 85 | "metadata": { 86 | "scrolled": true 87 | }, 88 | "source": [ 89 | "# Avg num ratings per book\n", 90 | "print(\"Avg num ratings per book: \")\n", 91 | "sp_interactions.groupBy(\"book_id\").count().select(avg(\"count\")).show()\n", 92 | "\n", 93 | "# Avg num ratings per users\n", 94 | "print(\"Avg num ratings per user: \")\n", 95 | "sp_interactions.groupBy(\"user_id\").count().select(avg(\"count\")).show()" 96 | ] 97 | }, 98 | { 99 | "cell_type": "raw", 100 | "metadata": {}, 101 | "source": [ 102 | "sp_interactions.printSchema()" 103 | ] 104 | }, 105 | { 106 | "cell_type": "raw", 107 | "metadata": {}, 108 | "source": [ 109 | "sp_interactions = sp_interactions.select(sp_interactions.user_id.cast(\"integer\"),\n", 110 | " sp_interactions.book_id.cast(\"integer\"),\n", 111 | " sp_interactions.is_read.cast(\"integer\"),\n", 112 | " sp_interactions.rating.cast(\"double\"),\n", 113 | " sp_interactions.is_reviewed.cast(\"integer\"))\n", 114 | "sp_interactions.printSchema()" 115 | ] 116 | }, 117 | { 118 | "cell_type": "raw", 119 | "metadata": {}, 120 | "source": [ 121 | "(training_data, test_data) = sp_interactions.randomSplit([0.80, 0.20], seed=307)" 122 | ] 123 | }, 124 | { 125 | "cell_type": "raw", 126 | "metadata": {}, 127 | "source": [ 128 | "# continually failed\n", 129 | "model = cv.fit(training_data)\n", 130 | "best_model = model.bestModel\n", 131 | "predictions = best_model.transform(test_data)\n", 132 | "rmse = evaluator.evaluate(predictions)\n", 133 | "\n", 134 | "print(\"**Best Model**\")\n", 135 | "print(\"RMSE = \"), rmse\n", 136 | "print(\" Rank: \"), best_model.rank\n", 137 | "print(\" MaxIter: \"), best_model._java_obj.parent().getMaxIter()\n", 138 | "print(\" RegParam: \"), best_model._java_obj.parent().getRegParam()" 139 | ] 140 | }, 141 | { 142 | "cell_type": "code", 143 | "execution_count": 4, 144 | "metadata": {}, 145 | "outputs": [ 146 | { 147 | "name": "stdout", 148 | "output_type": "stream", 149 | "text": [ 150 | "+-------+-----------------+------------+--------+-----------+----------+-----------------+--------------------+-------------------------+--------------------+--------------------+-------------+--------------+-------------+------------------+-----------------------+---------+---------+---------+---------+---------+--------------------+--------------------+\n", 151 | "|book_id|goodreads_book_id|best_book_id| work_id|books_count| isbn| isbn13| authors|original_publication_year| original_title| title|language_code|average_rating|ratings_count|work_ratings_count|work_text_reviews_count|ratings_1|ratings_2|ratings_3|ratings_4|ratings_5| image_url| small_image_url|\n", 152 | "+-------+-----------------+------------+--------+-----------+----------+-----------------+--------------------+-------------------------+--------------------+--------------------+-------------+--------------+-------------+------------------+-----------------------+---------+---------+---------+---------+---------+--------------------+--------------------+\n", 153 | "| 1| 2767052| 2767052| 2792775| 272| 439023483|9.78043902348e+12| Suzanne Collins| 2008.0| The Hunger Games|The Hunger Games ...| eng| 4.34| 4780653| 4942365| 155254| 66715| 127936| 560092| 1481305| 2706317|https://images.gr...|https://images.gr...|\n", 154 | "| 2| 3| 3| 4640799| 491| 439554934|9.78043955493e+12|J.K. Rowling, Mar...| 1997.0|Harry Potter and ...|Harry Potter and ...| eng| 4.44| 4602479| 4800065| 75867| 75504| 101676| 455024| 1156318| 3011543|https://images.gr...|https://images.gr...|\n", 155 | "| 3| 41865| 41865| 3212258| 226| 316015849|9.78031601584e+12| Stephenie Meyer| 2005.0| Twilight|Twilight (Twiligh...| en-US| 3.57| 3866839| 3916824| 95009| 456191| 436802| 793319| 875073| 1355439|https://images.gr...|https://images.gr...|\n", 156 | "| 4| 2657| 2657| 3275794| 487| 61120081|9.78006112008e+12| Harper Lee| 1960.0|To Kill a Mocking...|To Kill a Mocking...| eng| 4.25| 3198671| 3340896| 72586| 60427| 117415| 446835| 1001952| 1714267|https://images.gr...|https://images.gr...|\n", 157 | "| 5| 4671| 4671| 245494| 1356| 743273567|9.78074327356e+12| F. Scott Fitzgerald| 1925.0| The Great Gatsby| The Great Gatsby| eng| 3.89| 2683664| 2773745| 51992| 86236| 197621| 606158| 936012| 947718|https://images.gr...|https://images.gr...|\n", 158 | "| 6| 11870085| 11870085|16827462| 226| 525478817|9.78052547881e+12| John Green| 2012.0|The Fault in Our ...|The Fault in Our ...| eng| 4.26| 2346404| 2478609| 140739| 47994| 92723| 327550| 698471| 1311871|https://images.gr...|https://images.gr...|\n", 159 | "| 7| 5907| 5907| 1540236| 969| 618260307| 9.7806182603e+12| J.R.R. Tolkien| 1937.0|The Hobbit or The...| The Hobbit| en-US| 4.25| 2071616| 2196809| 37653| 46023| 76784| 288649| 665635| 1119718|https://images.gr...|https://images.gr...|\n", 160 | "| 8| 5107| 5107| 3036731| 360| 316769177|9.78031676917e+12| J.D. Salinger| 1951.0|The Catcher in th...|The Catcher in th...| eng| 3.79| 2044241| 2120637| 44920| 109383| 185520| 455042| 661516| 709176|https://images.gr...|https://images.gr...|\n", 161 | "| 9| 960| 960| 3338963| 311|1416524797|9.78141652479e+12| Dan Brown| 2000.0| Angels & Demons |Angels & Demons ...| en-CA| 3.85| 2001311| 2078754| 25112| 77841| 145740| 458429| 716569| 680175|https://images.gr...|https://images.gr...|\n", 162 | "| 10| 1885| 1885| 3060926| 3455| 679783261|9.78067978327e+12| Jane Austen| 1813.0| Pride and Prejudice| Pride and Prejudice| eng| 4.24| 2035490| 2191465| 49152| 54700| 86485| 284852| 609755| 1155673|https://images.gr...|https://images.gr...|\n", 163 | "| 11| 77203| 77203| 3295919| 283|1594480001| 9.78159448e+12| Khaled Hosseini| 2003.0| The Kite Runner | The Kite Runner| eng| 4.26| 1813044| 1878095| 59730| 34288| 59980| 226062| 628174| 929591|https://images.gr...|https://images.gr...|\n", 164 | "| 12| 13335037| 13335037|13155899| 210| 62024035|9.78006202404e+12| Veronica Roth| 2011.0| Divergent|Divergent (Diverg...| eng| 4.24| 1903563| 2216814| 101023| 36315| 82870| 310297| 673028| 1114304|https://images.gr...|https://images.gr...|\n", 165 | "| 13| 5470| 5470| 153313| 995| 451524934|9.78045152494e+12|George Orwell, Er...| 1949.0|Nineteen Eighty-Four| 1984| eng| 4.14| 1956832| 2053394| 45518| 41845| 86425| 324874| 692021| 908229|https://images.gr...|https://images.gr...|\n", 166 | "| 14| 7613| 7613| 2207778| 896| 452284244|9.78045228424e+12| George Orwell| 1945.0|Animal Farm: A Fa...| Animal Farm| eng| 3.87| 1881700| 1982987| 35472| 66854| 135147| 433432| 698642| 648912|https://images.gr...|https://images.gr...|\n", 167 | "| 15| 48855| 48855| 3532896| 710| 553296981|9.78055329698e+12|Anne Frank, Elean...| 1947.0|Het Achterhuis: D...|The Diary of a Yo...| eng| 4.1| 1972666| 2024493| 20825| 45225| 91270| 355756| 656870| 875372|https://images.gr...|https://images.gr...|\n", 168 | "| 16| 2429135| 2429135| 1708725| 274| 307269752|9.78030726975e+12|Stieg Larsson, Re...| 2005.0|Män som hatar kvi...|The Girl with the...| eng| 4.11| 1808403| 1929834| 62543| 54835| 86051| 285413| 667485| 836050|https://images.gr...|https://images.gr...|\n", 169 | "| 17| 6148028| 6148028| 6171458| 201| 439023491| 9.7804390235e+12| Suzanne Collins| 2009.0| Catching Fire|Catching Fire (Th...| eng| 4.3| 1831039| 1988079| 88538| 10492| 48030| 262010| 687238| 980309|https://images.gr...|https://images.gr...|\n", 170 | "| 18| 5| 5| 2402163| 376|043965548X|9.78043965548e+12|J.K. Rowling, Mar...| 1999.0|Harry Potter and ...|Harry Potter and ...| eng| 4.53| 1832823| 1969375| 36099| 6716| 20413| 166129| 509447| 1266670|https://images.gr...|https://images.gr...|\n", 171 | "| 19| 34| 34| 3204327| 566| 618346252|9.78061834626e+12| J.R.R. Tolkien| 1954.0| The Fellowship o...|The Fellowship of...| eng| 4.34| 1766803| 1832541| 15333| 38031| 55862| 202332| 493922| 1042394|https://images.gr...|https://images.gr...|\n", 172 | "| 20| 7260188| 7260188| 8812783| 239| 439023513|9.78043902351e+12| Suzanne Collins| 2010.0| Mockingjay|Mockingjay (The H...| eng| 4.03| 1719760| 1870748| 96274| 30144| 110498| 373060| 618271| 738775|https://images.gr...|https://images.gr...|\n", 173 | "+-------+-----------------+------------+--------+-----------+----------+-----------------+--------------------+-------------------------+--------------------+--------------------+-------------+--------------+-------------+------------------+-----------------------+---------+---------+---------+---------+---------+--------------------+--------------------+\n", 174 | "only showing top 20 rows\n", 175 | "\n" 176 | ] 177 | } 178 | ], 179 | "source": [ 180 | "# continued failures led me to trim back on size of dataset - choose 10k\n", 181 | "books10k = spark.read.csv('books10k.csv', header = True)\n", 182 | "books10k.show()" 183 | ] 184 | }, 185 | { 186 | "cell_type": "code", 187 | "execution_count": 5, 188 | "metadata": { 189 | "scrolled": true 190 | }, 191 | "outputs": [ 192 | { 193 | "name": "stdout", 194 | "output_type": "stream", 195 | "text": [ 196 | "+-------+-------+------+\n", 197 | "|user_id|book_id|rating|\n", 198 | "+-------+-------+------+\n", 199 | "| 1| 258| 5|\n", 200 | "| 2| 4081| 4|\n", 201 | "| 2| 260| 5|\n", 202 | "| 2| 9296| 5|\n", 203 | "| 2| 2318| 3|\n", 204 | "| 2| 26| 4|\n", 205 | "| 2| 315| 3|\n", 206 | "| 2| 33| 4|\n", 207 | "| 2| 301| 5|\n", 208 | "| 2| 2686| 5|\n", 209 | "| 2| 3753| 5|\n", 210 | "| 2| 8519| 5|\n", 211 | "| 4| 70| 4|\n", 212 | "| 4| 264| 3|\n", 213 | "| 4| 388| 4|\n", 214 | "| 4| 18| 5|\n", 215 | "| 4| 27| 5|\n", 216 | "| 4| 21| 5|\n", 217 | "| 4| 2| 5|\n", 218 | "| 4| 23| 5|\n", 219 | "+-------+-------+------+\n", 220 | "only showing top 20 rows\n", 221 | "\n" 222 | ] 223 | } 224 | ], 225 | "source": [ 226 | "# continued failures led me to trim back on size of dataset - choose 10k\n", 227 | "ratings10k = spark.read.csv('ratings10k.csv', header = True)\n", 228 | "ratings10k.show()" 229 | ] 230 | }, 231 | { 232 | "cell_type": "code", 233 | "execution_count": 6, 234 | "metadata": {}, 235 | "outputs": [ 236 | { 237 | "name": "stdout", 238 | "output_type": "stream", 239 | "text": [ 240 | "The ratings10k dataframe is 98.94% empty.\n" 241 | ] 242 | } 243 | ], 244 | "source": [ 245 | "# calculate sparsity\n", 246 | "numerator = ratings10k.select(\"rating\").count()\n", 247 | "num_users = ratings10k.select(\"user_id\").distinct().count()\n", 248 | "num_books = ratings10k.select(\"book_id\").distinct().count()\n", 249 | "denominator = num_users * num_books\n", 250 | "sparsity = (1.0 - (numerator * 1.0)/denominator) * 100\n", 251 | "print(\"The ratings10k dataframe is \", \"%.2f\" % sparsity + \"% empty.\")" 252 | ] 253 | }, 254 | { 255 | "cell_type": "code", 256 | "execution_count": 7, 257 | "metadata": {}, 258 | "outputs": [ 259 | { 260 | "name": "stdout", 261 | "output_type": "stream", 262 | "text": [ 263 | "Avg num ratings per book: \n", 264 | "+-----------------+\n", 265 | "| avg(count)|\n", 266 | "+-----------------+\n", 267 | "|40.15235939404492|\n", 268 | "+-----------------+\n", 269 | "\n", 270 | "Avg num ratings per user: \n", 271 | "+-----------------+\n", 272 | "| avg(count)|\n", 273 | "+-----------------+\n", 274 | "|60.87513199577614|\n", 275 | "+-----------------+\n", 276 | "\n" 277 | ] 278 | } 279 | ], 280 | "source": [ 281 | "# Avg num ratings per book\n", 282 | "print(\"Avg num ratings per book: \")\n", 283 | "ratings10k.groupBy(\"book_id\").count().select(avg(\"count\")).show()\n", 284 | "\n", 285 | "# Avg num ratings per users\n", 286 | "print(\"Avg num ratings per user: \")\n", 287 | "ratings10k.groupBy(\"user_id\").count().select(avg(\"count\")).show()" 288 | ] 289 | }, 290 | { 291 | "cell_type": "code", 292 | "execution_count": 8, 293 | "metadata": { 294 | "scrolled": true 295 | }, 296 | "outputs": [ 297 | { 298 | "name": "stdout", 299 | "output_type": "stream", 300 | "text": [ 301 | "root\n", 302 | " |-- user_id: integer (nullable = true)\n", 303 | " |-- book_id: integer (nullable = true)\n", 304 | " |-- rating: float (nullable = true)\n", 305 | "\n" 306 | ] 307 | } 308 | ], 309 | "source": [ 310 | "ratings10k = ratings10k.select(ratings10k.user_id.cast(\"integer\"),\n", 311 | " ratings10k.book_id.cast(\"integer\"),\n", 312 | " ratings10k.rating.cast(\"float\"))\n", 313 | "ratings10k.printSchema()" 314 | ] 315 | }, 316 | { 317 | "cell_type": "code", 318 | "execution_count": 9, 319 | "metadata": {}, 320 | "outputs": [ 321 | { 322 | "name": "stdout", 323 | "output_type": "stream", 324 | "text": [ 325 | "+-------+-------+------+\n", 326 | "|user_id|book_id|rating|\n", 327 | "+-------+-------+------+\n", 328 | "| 463| 471| 4.0|\n", 329 | "| 463| 148| 0.0|\n", 330 | "| 463| 2142| 0.0|\n", 331 | "| 463| 3997| 0.0|\n", 332 | "| 463| 496| 0.0|\n", 333 | "| 463| 1580| 0.0|\n", 334 | "| 463| 2366| 0.0|\n", 335 | "| 463| 463| 0.0|\n", 336 | "| 463| 1238| 0.0|\n", 337 | "| 463| 833| 0.0|\n", 338 | "| 463| 1088| 0.0|\n", 339 | "| 463| 6620| 0.0|\n", 340 | "| 463| 1591| 0.0|\n", 341 | "| 463| 9852| 0.0|\n", 342 | "| 463| 4101| 0.0|\n", 343 | "| 463| 3918| 0.0|\n", 344 | "| 463| 6397| 0.0|\n", 345 | "| 463| 1342| 0.0|\n", 346 | "| 463| 7253| 0.0|\n", 347 | "| 463| 3794| 0.0|\n", 348 | "+-------+-------+------+\n", 349 | "only showing top 20 rows\n", 350 | "\n" 351 | ] 352 | } 353 | ], 354 | "source": [ 355 | "# correct the format to include zeros\n", 356 | "\n", 357 | "users = ratings10k.select(\"user_id\").distinct()\n", 358 | "books = ratings10k.select(\"book_id\").distinct()\n", 359 | "\n", 360 | "# Cross join users and products\n", 361 | "cj = users.crossJoin(books)\n", 362 | "ratings = cj.join(ratings10k, [\"user_id\", \"book_id\"], \"left\").fillna(0)\n", 363 | "ratings.show()" 364 | ] 365 | }, 366 | { 367 | "cell_type": "code", 368 | "execution_count": 11, 369 | "metadata": {}, 370 | "outputs": [], 371 | "source": [ 372 | "(train, test) = ratings.randomSplit([0.80, 0.20], seed=731)" 373 | ] 374 | }, 375 | { 376 | "cell_type": "code", 377 | "execution_count": 12, 378 | "metadata": {}, 379 | "outputs": [ 380 | { 381 | "name": "stdout", 382 | "output_type": "stream", 383 | "text": [ 384 | "RMSE: \n" 385 | ] 386 | }, 387 | { 388 | "data": { 389 | "text/plain": [ 390 | "(None, 0.39640662340632277)" 391 | ] 392 | }, 393 | "execution_count": 12, 394 | "metadata": {}, 395 | "output_type": "execute_result" 396 | } 397 | ], 398 | "source": [ 399 | "als_model = ALS(userCol = \"user_id\", itemCol = \"book_id\", ratingCol = \"rating\",\n", 400 | " nonnegative = True,\n", 401 | " coldStartStrategy = \"drop\",\n", 402 | " implicitPrefs = False)\n", 403 | "model = als_model.fit(train)\n", 404 | "predictions = model.transform(test)\n", 405 | "evaluator = RegressionEvaluator(metricName = 'rmse', labelCol = 'rating',\n", 406 | " predictionCol = 'prediction')\n", 407 | "rmse = evaluator.evaluate(predictions)\n", 408 | "print(\"RMSE: \"), rmse" 409 | ] 410 | }, 411 | { 412 | "cell_type": "code", 413 | "execution_count": 13, 414 | "metadata": { 415 | "scrolled": false 416 | }, 417 | "outputs": [ 418 | { 419 | "name": "stdout", 420 | "output_type": "stream", 421 | "text": [ 422 | "+-------+-------+------+------------+\n", 423 | "|user_id|book_id|rating| prediction|\n", 424 | "+-------+-------+------+------------+\n", 425 | "| 1645| 148| 4.0| 0.049995024|\n", 426 | "| 3175| 148| 0.0| 0.05377537|\n", 427 | "| 3918| 148| 0.0| 0.1438313|\n", 428 | "| 5300| 148| 0.0| 0.065395646|\n", 429 | "| 1025| 148| 0.0| 0.051750187|\n", 430 | "| 1127| 148| 0.0| 0.07879922|\n", 431 | "| 1507| 148| 0.0| 0.16217102|\n", 432 | "| 2387| 148| 0.0| 0.017485976|\n", 433 | "| 2563| 148| 0.0| 0.01884679|\n", 434 | "| 3475| 148| 0.0| 0.03814276|\n", 435 | "| 4190| 148| 0.0| 0.064677484|\n", 436 | "| 4929| 148| 0.0| 0.03870188|\n", 437 | "| 1143| 148| 0.0| 0.1420587|\n", 438 | "| 3000| 148| 0.0|8.5675006E-4|\n", 439 | "| 808| 148| 0.0| 0.09464113|\n", 440 | "| 1265| 148| 0.0| 0.05013421|\n", 441 | "| 3098| 148| 0.0| 0.03183744|\n", 442 | "| 4078| 148| 0.0| 0.07300787|\n", 443 | "| 4684| 148| 0.0| 0.09861391|\n", 444 | "| 5223| 148| 0.0| 0.15989798|\n", 445 | "+-------+-------+------+------------+\n", 446 | "only showing top 20 rows\n", 447 | "\n" 448 | ] 449 | } 450 | ], 451 | "source": [ 452 | "predictions.show()" 453 | ] 454 | }, 455 | { 456 | "cell_type": "code", 457 | "execution_count": 13, 458 | "metadata": {}, 459 | "outputs": [], 460 | "source": [ 461 | "# tweak model by playing with rank, MaxIter, RegParam, goal = lowest RMSE" 462 | ] 463 | }, 464 | { 465 | "cell_type": "code", 466 | "execution_count": 14, 467 | "metadata": {}, 468 | "outputs": [ 469 | { 470 | "name": "stdout", 471 | "output_type": "stream", 472 | "text": [ 473 | "RMSE: \n" 474 | ] 475 | }, 476 | { 477 | "data": { 478 | "text/plain": [ 479 | "(None, 0.40559786366728084)" 480 | ] 481 | }, 482 | "execution_count": 14, 483 | "metadata": {}, 484 | "output_type": "execute_result" 485 | } 486 | ], 487 | "source": [ 488 | "# change rank only (chose 16 b/c it was recommended by Goodreads paper)\n", 489 | "als_model2 = ALS(userCol = \"user_id\", itemCol = \"book_id\", ratingCol = \"rating\",\n", 490 | " rank = 16, maxIter = 10, regParam = 1,\n", 491 | " nonnegative = True,\n", 492 | " coldStartStrategy = \"drop\",\n", 493 | " implicitPrefs = False)\n", 494 | "model2 = als_model2.fit(train)\n", 495 | "predictions2 = model2.transform(test)\n", 496 | "evaluator = RegressionEvaluator(metricName = 'rmse', labelCol = 'rating',\n", 497 | " predictionCol = 'prediction')\n", 498 | "rmse2 = evaluator.evaluate(predictions2)\n", 499 | "print(\"RMSE: \"), rmse2" 500 | ] 501 | }, 502 | { 503 | "cell_type": "code", 504 | "execution_count": 15, 505 | "metadata": {}, 506 | "outputs": [ 507 | { 508 | "name": "stdout", 509 | "output_type": "stream", 510 | "text": [ 511 | "+-------+-------+------+----------+\n", 512 | "|user_id|book_id|rating|prediction|\n", 513 | "+-------+-------+------+----------+\n", 514 | "| 1645| 148| 4.0| 0.0|\n", 515 | "| 3175| 148| 0.0| 0.0|\n", 516 | "| 3918| 148| 0.0| 0.0|\n", 517 | "| 5300| 148| 0.0| 0.0|\n", 518 | "| 1025| 148| 0.0| 0.0|\n", 519 | "| 1127| 148| 0.0| 0.0|\n", 520 | "| 1507| 148| 0.0| 0.0|\n", 521 | "| 2387| 148| 0.0| 0.0|\n", 522 | "| 2563| 148| 0.0| 0.0|\n", 523 | "| 3475| 148| 0.0| 0.0|\n", 524 | "| 4190| 148| 0.0| 0.0|\n", 525 | "| 4929| 148| 0.0| 0.0|\n", 526 | "| 1143| 148| 0.0| 0.0|\n", 527 | "| 3000| 148| 0.0| 0.0|\n", 528 | "| 808| 148| 0.0| 0.0|\n", 529 | "| 1265| 148| 0.0| 0.0|\n", 530 | "| 3098| 148| 0.0| 0.0|\n", 531 | "| 4078| 148| 0.0| 0.0|\n", 532 | "| 4684| 148| 0.0| 0.0|\n", 533 | "| 5223| 148| 0.0| 0.0|\n", 534 | "+-------+-------+------+----------+\n", 535 | "only showing top 20 rows\n", 536 | "\n" 537 | ] 538 | } 539 | ], 540 | "source": [ 541 | "predictions2.show()" 542 | ] 543 | }, 544 | { 545 | "cell_type": "code", 546 | "execution_count": 16, 547 | "metadata": {}, 548 | "outputs": [ 549 | { 550 | "name": "stdout", 551 | "output_type": "stream", 552 | "text": [ 553 | "Num models to be tested: 32\n" 554 | ] 555 | } 556 | ], 557 | "source": [ 558 | "param_grid = ParamGridBuilder().addGrid(als_model.rank, [5, 10, 15, 20]).addGrid(\n", 559 | " als_model.maxIter, [5, 10]).addGrid(als_model.regParam, [0.01, 0.05, 0.1, 0.15]).build()\n", 560 | "evaluator = RegressionEvaluator(metricName = \"rmse\", labelCol = \"rating\",\n", 561 | " predictionCol = \"prediction\")\n", 562 | "cv = CrossValidator(estimator = als_model,\n", 563 | " estimatorParamMaps = param_grid,\n", 564 | " evaluator = evaluator,\n", 565 | " numFolds = 5)\n", 566 | "print (\"Num models to be tested: \", len(param_grid))" 567 | ] 568 | }, 569 | { 570 | "cell_type": "code", 571 | "execution_count": 17, 572 | "metadata": {}, 573 | "outputs": [], 574 | "source": [ 575 | "modelcv = cv.fit(train)" 576 | ] 577 | }, 578 | { 579 | "cell_type": "code", 580 | "execution_count": 18, 581 | "metadata": {}, 582 | "outputs": [ 583 | { 584 | "name": "stdout", 585 | "output_type": "stream", 586 | "text": [ 587 | "\n" 588 | ] 589 | } 590 | ], 591 | "source": [ 592 | "best_model = modelcv.bestModel\n", 593 | "print(type(best_model))" 594 | ] 595 | }, 596 | { 597 | "cell_type": "code", 598 | "execution_count": 20, 599 | "metadata": {}, 600 | "outputs": [ 601 | { 602 | "name": "stdout", 603 | "output_type": "stream", 604 | "text": [ 605 | "0.3622062135603919\n" 606 | ] 607 | } 608 | ], 609 | "source": [ 610 | "test_predictions = best_model.transform(test)\n", 611 | "rmse = evaluator.evaluate(test_predictions)\n", 612 | "print(rmse)" 613 | ] 614 | }, 615 | { 616 | "cell_type": "code", 617 | "execution_count": 21, 618 | "metadata": { 619 | "scrolled": true 620 | }, 621 | "outputs": [ 622 | { 623 | "name": "stdout", 624 | "output_type": "stream", 625 | "text": [ 626 | "25\n" 627 | ] 628 | } 629 | ], 630 | "source": [ 631 | "print(best_model.rank) # k value (# of latent features)" 632 | ] 633 | }, 634 | { 635 | "cell_type": "code", 636 | "execution_count": 22, 637 | "metadata": {}, 638 | "outputs": [ 639 | { 640 | "name": "stdout", 641 | "output_type": "stream", 642 | "text": [ 643 | "10\n", 644 | "0.01\n" 645 | ] 646 | } 647 | ], 648 | "source": [ 649 | "print(best_model._java_obj.parent().getMaxIter())\n", 650 | "print(best_model._java_obj.parent().getRegParam())" 651 | ] 652 | }, 653 | { 654 | "cell_type": "code", 655 | "execution_count": null, 656 | "metadata": {}, 657 | "outputs": [], 658 | "source": [ 659 | "# best model: k = 25, maxIter = 10, regParam = 0.01, RMSE = 0.3622" 660 | ] 661 | }, 662 | { 663 | "cell_type": "code", 664 | "execution_count": 23, 665 | "metadata": { 666 | "scrolled": true 667 | }, 668 | "outputs": [ 669 | { 670 | "name": "stdout", 671 | "output_type": "stream", 672 | "text": [ 673 | "+-------+-------+------+-----------+\n", 674 | "|user_id|book_id|rating| prediction|\n", 675 | "+-------+-------+------+-----------+\n", 676 | "| 1645| 148| 4.0| 0.70989245|\n", 677 | "| 3175| 148| 0.0| 0.15258428|\n", 678 | "| 3918| 148| 0.0| 0.71268547|\n", 679 | "| 5300| 148| 0.0| 0.302438|\n", 680 | "| 1025| 148| 0.0| 0.245731|\n", 681 | "| 1127| 148| 0.0| 0.13814119|\n", 682 | "| 1507| 148| 0.0| 0.27556774|\n", 683 | "| 2387| 148| 0.0| 0.13265418|\n", 684 | "| 2563| 148| 0.0| 0.218646|\n", 685 | "| 3475| 148| 0.0| 0.22585842|\n", 686 | "| 4190| 148| 0.0| 0.69588304|\n", 687 | "| 4929| 148| 0.0| 0.15377276|\n", 688 | "| 1143| 148| 0.0| 0.58321583|\n", 689 | "| 3000| 148| 0.0|0.012677923|\n", 690 | "| 808| 148| 0.0| 0.49552304|\n", 691 | "| 1265| 148| 0.0| 0.0973807|\n", 692 | "| 3098| 148| 0.0| 0.16174576|\n", 693 | "| 4078| 148| 0.0| 0.02927421|\n", 694 | "| 4684| 148| 0.0|0.107607946|\n", 695 | "| 5223| 148| 0.0| 0.9644963|\n", 696 | "+-------+-------+------+-----------+\n", 697 | "only showing top 20 rows\n", 698 | "\n" 699 | ] 700 | } 701 | ], 702 | "source": [ 703 | "test_predictions.show()" 704 | ] 705 | }, 706 | { 707 | "cell_type": "code", 708 | "execution_count": 26, 709 | "metadata": {}, 710 | "outputs": [ 711 | { 712 | "name": "stdout", 713 | "output_type": "stream", 714 | "text": [ 715 | "+-------+--------------------+\n", 716 | "|user_id| recommendations|\n", 717 | "+-------+--------------------+\n", 718 | "| 1580|[[11, 1.0400147],...|\n", 719 | "| 5300|[[37, 2.0233123],...|\n", 720 | "| 1591|[[167, 0.19397956...|\n", 721 | "| 4101|[[11, 1.3917072],...|\n", 722 | "| 1342|[[476, 0.63664454...|\n", 723 | "| 2122|[[4, 1.1893904], ...|\n", 724 | "| 463|[[7, 4.087946], [...|\n", 725 | "| 833|[[94, 1.4590017],...|\n", 726 | "| 3794|[[168, 1.8770207]...|\n", 727 | "| 1645|[[11, 2.7405286],...|\n", 728 | "| 3175|[[19, 4.241493], ...|\n", 729 | "| 2366|[[205, 4.1261687]...|\n", 730 | "| 5156|[[65, 3.4123454],...|\n", 731 | "| 3997|[[10, 0.30373782]...|\n", 732 | "| 1238|[[11, 3.7416406],...|\n", 733 | "| 3918|[[50, 4.152341], ...|\n", 734 | "| 4818|[[26, 1.1737348],...|\n", 735 | "| 5518|[[125, 1.6194955]...|\n", 736 | "| 1829|[[50, 1.9392428],...|\n", 737 | "| 3749|[[168, 0.92164904...|\n", 738 | "+-------+--------------------+\n", 739 | "only showing top 20 rows\n", 740 | "\n" 741 | ] 742 | } 743 | ], 744 | "source": [ 745 | "# view recommendations\n", 746 | "userRecs = best_model.recommendForAllUsers(10)\n", 747 | "userRecs.show()" 748 | ] 749 | }, 750 | { 751 | "cell_type": "code", 752 | "execution_count": 36, 753 | "metadata": {}, 754 | "outputs": [ 755 | { 756 | "name": "stdout", 757 | "output_type": "stream", 758 | "text": [ 759 | "User 60's Ratings:\n", 760 | "+-------+-------+------+\n", 761 | "|user_id|book_id|rating|\n", 762 | "+-------+-------+------+\n", 763 | "| 2| 260| 5.0|\n", 764 | "| 2| 3753| 5.0|\n", 765 | "| 2| 9296| 5.0|\n", 766 | "| 2| 8519| 5.0|\n", 767 | "| 2| 2686| 5.0|\n", 768 | "| 2| 301| 5.0|\n", 769 | "| 2| 4081| 4.0|\n", 770 | "| 2| 33| 4.0|\n", 771 | "| 2| 26| 4.0|\n", 772 | "| 2| 2318| 3.0|\n", 773 | "| 2| 315| 3.0|\n", 774 | "| 2| 471| 0.0|\n", 775 | "| 2| 496| 0.0|\n", 776 | "| 2| 148| 0.0|\n", 777 | "| 2| 1580| 0.0|\n", 778 | "| 2| 1238| 0.0|\n", 779 | "| 2| 2142| 0.0|\n", 780 | "| 2| 2366| 0.0|\n", 781 | "| 2| 833| 0.0|\n", 782 | "| 2| 3997| 0.0|\n", 783 | "+-------+-------+------+\n", 784 | "only showing top 20 rows\n", 785 | "\n", 786 | "User 60s Recommendations:\n", 787 | "+-------+--------------------+\n", 788 | "|user_id| recommendations|\n", 789 | "+-------+--------------------+\n", 790 | "| 2|[[11, 0.77537215]...|\n", 791 | "+-------+--------------------+\n", 792 | "\n", 793 | "User 63's Ratings:\n", 794 | "+-------+-------+------+\n", 795 | "|user_id|book_id|rating|\n", 796 | "+-------+-------+------+\n", 797 | "| 63| 323| 5.0|\n", 798 | "| 63| 6772| 5.0|\n", 799 | "| 63| 592| 5.0|\n", 800 | "| 63| 7151| 5.0|\n", 801 | "| 63| 4475| 5.0|\n", 802 | "| 63| 8455| 5.0|\n", 803 | "| 63| 80| 5.0|\n", 804 | "| 63| 3913| 4.0|\n", 805 | "| 63| 85| 4.0|\n", 806 | "| 63| 1113| 4.0|\n", 807 | "| 63| 498| 4.0|\n", 808 | "| 63| 4531| 4.0|\n", 809 | "| 63| 6160| 4.0|\n", 810 | "| 63| 709| 4.0|\n", 811 | "| 63| 614| 4.0|\n", 812 | "| 63| 485| 4.0|\n", 813 | "| 63| 162| 4.0|\n", 814 | "| 63| 5374| 4.0|\n", 815 | "| 63| 9858| 4.0|\n", 816 | "| 63| 669| 4.0|\n", 817 | "+-------+-------+------+\n", 818 | "only showing top 20 rows\n", 819 | "\n", 820 | "User 63's Recommendations:\n", 821 | "+-------+--------------------+\n", 822 | "|user_id| recommendations|\n", 823 | "+-------+--------------------+\n", 824 | "| 63|[[58, 2.5218172],...|\n", 825 | "+-------+--------------------+\n", 826 | "\n" 827 | ] 828 | } 829 | ], 830 | "source": [ 831 | "# Look at user 60's ratings\n", 832 | "print(\"User 2's Ratings:\")\n", 833 | "ratings.filter(col(\"user_id\") == 2).sort(\"rating\", ascending = False).show()\n", 834 | "\n", 835 | "# Look at the movies recommended to user 60\n", 836 | "print(\"User 2's Recommendations:\")\n", 837 | "userRecs.filter(col(\"user_id\") == 2).show()\n", 838 | "\n", 839 | "# Look at user 63's ratings\n", 840 | "print(\"User 63's Ratings:\")\n", 841 | "ratings.filter(col(\"user_id\") == 63).sort(\"rating\", ascending = False).show()\n", 842 | "\n", 843 | "# Look at the movies recommended to user 63\n", 844 | "print(\"User 63's Recommendations:\")\n", 845 | "userRecs.filter(col(\"user_id\") == 63).show()" 846 | ] 847 | }, 848 | { 849 | "cell_type": "code", 850 | "execution_count": 50, 851 | "metadata": {}, 852 | "outputs": [ 853 | { 854 | "name": "stdout", 855 | "output_type": "stream", 856 | "text": [ 857 | "+-------+----------------+\n", 858 | "|user_id| BookRec|\n", 859 | "+-------+----------------+\n", 860 | "| 1580| [11, 1.0400147]|\n", 861 | "| 1580|[33, 0.65387654]|\n", 862 | "| 1580|[100, 0.5300547]|\n", 863 | "| 1580|[38, 0.52326894]|\n", 864 | "| 1580| [67, 0.4987593]|\n", 865 | "| 1580|[57, 0.47979054]|\n", 866 | "| 1580|[45, 0.47413552]|\n", 867 | "| 1580| [4, 0.46988013]|\n", 868 | "| 1580| [22, 0.4662242]|\n", 869 | "| 1580|[26, 0.45919847]|\n", 870 | "| 5300| [37, 2.0233123]|\n", 871 | "| 5300| [58, 1.9057353]|\n", 872 | "| 5300| [59, 1.7356001]|\n", 873 | "| 5300| [29, 1.725448]|\n", 874 | "| 5300|[138, 1.6259873]|\n", 875 | "| 5300| [50, 1.6185813]|\n", 876 | "| 5300|[102, 1.6020172]|\n", 877 | "| 5300| [15, 1.5831457]|\n", 878 | "| 5300|[117, 1.5628898]|\n", 879 | "| 5300| [85, 1.5290784]|\n", 880 | "+-------+----------------+\n", 881 | "only showing top 20 rows\n", 882 | "\n" 883 | ] 884 | } 885 | ], 886 | "source": [ 887 | "exploded_recs = spark.sql(\"SELECT user_id, explode(recommendations) AS BookRec FROM ALS_recs_temp\")\n", 888 | "exploded_recs.show()" 889 | ] 890 | } 891 | ], 892 | "metadata": { 893 | "kernelspec": { 894 | "display_name": "Python 3", 895 | "language": "python", 896 | "name": "python3" 897 | }, 898 | "language_info": { 899 | "codemirror_mode": { 900 | "name": "ipython", 901 | "version": 3 902 | }, 903 | "file_extension": ".py", 904 | "mimetype": "text/x-python", 905 | "name": "python", 906 | "nbconvert_exporter": "python", 907 | "pygments_lexer": "ipython3", 908 | "version": "3.8.3" 909 | } 910 | }, 911 | "nbformat": 4, 912 | "nbformat_minor": 4 913 | } 914 | -------------------------------------------------------------------------------- /Code/Data prep - full Goodreads loading files, statistics, distributions.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "### Code from https://github.com/MengtingWan/goodreads/blob/master/samples.ipynb" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 1, 13 | "metadata": {}, 14 | "outputs": [], 15 | "source": [ 16 | "import gzip\n", 17 | "import json\n", 18 | "import re\n", 19 | "import os\n", 20 | "import sys\n", 21 | "import numpy as np\n", 22 | "import pandas as pd\n", 23 | "import matplotlib.pyplot as plt\n", 24 | "\n", 25 | "pd.options.display.float_format = '{:,}'.format\n", 26 | "%matplotlib inline" 27 | ] 28 | }, 29 | { 30 | "cell_type": "code", 31 | "execution_count": 2, 32 | "metadata": {}, 33 | "outputs": [], 34 | "source": [ 35 | "def load_data(file_name, head = 500):\n", 36 | " count = 0\n", 37 | " data = []\n", 38 | " with gzip.open(file_name) as fin:\n", 39 | " for l in fin:\n", 40 | " d = json.loads(l)\n", 41 | " count += 1\n", 42 | " data.append(d)\n", 43 | " \n", 44 | " # break if reaches the 100th line\n", 45 | " if (head is not None) and (count > head):\n", 46 | " break\n", 47 | " return data" 48 | ] 49 | }, 50 | { 51 | "cell_type": "code", 52 | "execution_count": 3, 53 | "metadata": {}, 54 | "outputs": [], 55 | "source": [ 56 | "books = load_data('goodreads_books.json.gz')\n", 57 | "authors = load_data('goodreads_book_authors.json.gz')\n", 58 | "works = load_data('goodreads_book_works.json.gz')\n", 59 | "series = load_data('goodreads_book_series.json.gz')\n", 60 | "genres = load_data('goodreads_book_genres_initial.json.gz')" 61 | ] 62 | }, 63 | { 64 | "cell_type": "code", 65 | "execution_count": 4, 66 | "metadata": {}, 67 | "outputs": [ 68 | { 69 | "name": "stdout", 70 | "output_type": "stream", 71 | "text": [ 72 | " == sample record (books) ==\n" 73 | ] 74 | }, 75 | { 76 | "data": { 77 | "text/plain": [ 78 | "{'isbn': '0671015311',\n", 79 | " 'text_reviews_count': '126',\n", 80 | " 'series': ['161229'],\n", 81 | " 'country_code': 'US',\n", 82 | " 'language_code': 'eng',\n", 83 | " 'popular_shelves': [{'count': '1607', 'name': 'to-read'},\n", 84 | " {'count': '164', 'name': 'mystery'},\n", 85 | " {'count': '136', 'name': 'historical-fiction'},\n", 86 | " {'count': '41', 'name': 'historical'},\n", 87 | " {'count': '40', 'name': 'fiction'},\n", 88 | " {'count': '33', 'name': 'historical-mystery'},\n", 89 | " {'count': '25', 'name': 'series'},\n", 90 | " {'count': '24', 'name': 'mysteries'},\n", 91 | " {'count': '16', 'name': 'currently-reading'},\n", 92 | " {'count': '15', 'name': 'tudor'},\n", 93 | " {'count': '14', 'name': 'audiobook'},\n", 94 | " {'count': '14', 'name': 'england'},\n", 95 | " {'count': '13', 'name': 'historical-mysteries'},\n", 96 | " {'count': '10', 'name': 'owned'},\n", 97 | " {'count': '9', 'name': 'crime'},\n", 98 | " {'count': '9', 'name': '16th-century'},\n", 99 | " {'count': '8', 'name': 'first-in-series'},\n", 100 | " {'count': '8', 'name': 'ursula-blanchard'},\n", 101 | " {'count': '7', 'name': 'mystery-thriller'},\n", 102 | " {'count': '7', 'name': 'audio'},\n", 103 | " {'count': '7', 'name': 'mystery-historical'},\n", 104 | " {'count': '7', 'name': 'library'},\n", 105 | " {'count': '7', 'name': 'audiobooks'},\n", 106 | " {'count': '6', 'name': 'kindle'},\n", 107 | " {'count': '6', 'name': 'books-i-own'},\n", 108 | " {'count': '5', 'name': 'favorites'},\n", 109 | " {'count': '5', 'name': 'cozy-mystery'},\n", 110 | " {'count': '5', 'name': 'tudor-fiction'},\n", 111 | " {'count': '5', 'name': 'medieval'},\n", 112 | " {'count': '4', 'name': 'british'},\n", 113 | " {'count': '4', 'name': 'elizabeth'},\n", 114 | " {'count': '4', 'name': 'history'},\n", 115 | " {'count': '4', 'name': 'elizabethan'},\n", 116 | " {'count': '4', 'name': 'fiction-historical'},\n", 117 | " {'count': '4', 'name': 'fiona-buckley'},\n", 118 | " {'count': '4', 'name': '1500s'},\n", 119 | " {'count': '3', 'name': 'ebooks'},\n", 120 | " {'count': '3', 'name': 'britain'},\n", 121 | " {'count': '3', 'name': 'maybe'},\n", 122 | " {'count': '3', 'name': 'adult-fiction'},\n", 123 | " {'count': '3', 'name': 'mystery-crime'},\n", 124 | " {'count': '3', 'name': 'mystery-suspense'},\n", 125 | " {'count': '3', 'name': 'tudor-england'},\n", 126 | " {'count': '3', 'name': 'detective'},\n", 127 | " {'count': '3', 'name': 'adult'},\n", 128 | " {'count': '3', 'name': 'novels'},\n", 129 | " {'count': '3', 'name': 'tudors'},\n", 130 | " {'count': '3', 'name': 'to-buy'},\n", 131 | " {'count': '3', 'name': 'mystery-series'},\n", 132 | " {'count': '2', 'name': 'hist-myst'},\n", 133 | " {'count': '2', 'name': 'read-in-2017'},\n", 134 | " {'count': '2', 'name': 'policier-historique'},\n", 135 | " {'count': '2', 'name': 'who-done-it'},\n", 136 | " {'count': '2', 'name': 'part-of-a-series'},\n", 137 | " {'count': '2', 'name': 'ebook'},\n", 138 | " {'count': '2', 'name': 'nook-books'},\n", 139 | " {'count': '2', 'name': 'tudor-era'},\n", 140 | " {'count': '2', 'name': 'women'},\n", 141 | " {'count': '2', 'name': 'detectives'},\n", 142 | " {'count': '2', 'name': 'cozy-mysteries'},\n", 143 | " {'count': '2', 'name': 'novel'},\n", 144 | " {'count': '2', 'name': 'review'},\n", 145 | " {'count': '2', 'name': 'my-books'},\n", 146 | " {'count': '2', 'name': 'read-in-2013'},\n", 147 | " {'count': '2', 'name': 'female-protagonist'},\n", 148 | " {'count': '2', 'name': 'british-mystery'},\n", 149 | " {'count': '2', 'name': 'historical-fic'},\n", 150 | " {'count': '2', 'name': 'women-sleuths'},\n", 151 | " {'count': '2', 'name': 'wish-list'},\n", 152 | " {'count': '2', 'name': 'audio-books'},\n", 153 | " {'count': '2', 'name': 'read-in-2011'},\n", 154 | " {'count': '2', 'name': 'audible'},\n", 155 | " {'count': '2', 'name': 'british-historical-fiction'},\n", 156 | " {'count': '2', 'name': 'borrowed-from-library'},\n", 157 | " {'count': '2', 'name': 'books-in-a-series'},\n", 158 | " {'count': '2', 'name': 'to-read-historical-fiction'},\n", 159 | " {'count': '2', 'name': 'library-books'},\n", 160 | " {'count': '2', 'name': 'renaissance'},\n", 161 | " {'count': '2', 'name': '2005'},\n", 162 | " {'count': '2', 'name': 'audio-book'},\n", 163 | " {'count': '2', 'name': 'read-2009'},\n", 164 | " {'count': '2', 'name': 'elizabeth-i'},\n", 165 | " {'count': '1', 'name': 'openlibrary-org'},\n", 166 | " {'count': '1', 'name': 'series-paused'},\n", 167 | " {'count': '1', 'name': 'abandoned'},\n", 168 | " {'count': '1', 'name': 'audio-version-read'},\n", 169 | " {'count': '1', 'name': 't-title'},\n", 170 | " {'count': '1', 'name': '000001-series-i-m-reading'},\n", 171 | " {'count': '1', 'name': '0000001-first-one-series'},\n", 172 | " {'count': '1', 'name': 'series-i-m-following'},\n", 173 | " {'count': '1', 'name': 'historical-mystery-monopoly-2017'},\n", 174 | " {'count': '1', 'name': '2017-11'},\n", 175 | " {'count': '1', 'name': 'rensing-center'},\n", 176 | " {'count': '1', 'name': 'cozi-s'},\n", 177 | " {'count': '1', 'name': '2017newauthor'},\n", 178 | " {'count': '1', 'name': '2017library'},\n", 179 | " {'count': '1', 'name': '2017abc'},\n", 180 | " {'count': '1', 'name': 'own-but-need-to-be-read'},\n", 181 | " {'count': '1', 'name': 'not-available'},\n", 182 | " {'count': '1', 'name': 'man-ref'}],\n", 183 | " 'asin': '',\n", 184 | " 'is_ebook': 'false',\n", 185 | " 'average_rating': '3.72',\n", 186 | " 'kindle_asin': 'B00BOTY2JQ',\n", 187 | " 'similar_books': ['612301',\n", 188 | " '713500',\n", 189 | " '166679',\n", 190 | " '27932',\n", 191 | " '4320305',\n", 192 | " '11285472',\n", 193 | " '1625740',\n", 194 | " '88918',\n", 195 | " '630303',\n", 196 | " '140554',\n", 197 | " '838133',\n", 198 | " '1142347'],\n", 199 | " 'description': 'In this compelling debut of her historical mystery series, Fiona Buckley introduces Ursula Blanchard, a widowed young mother who has become lady-in-waiting to Queen Elizabeth I. Armed with a sharp eye, dangerous curiosity, and uncanny intelligence, Ursula pledges...\\nTo Shield the Queen\\nRumor has linked Queen Elizabeth I to her master of horse, Robin Dudley. As gossip would have it, only his ailing wife, Amy, prevents marriage between Dudley and the Queen. To quell the idle tongues at court, the Queen dispatches Ursula Blanchard to tend to the sick woman\\'s needs. But not even Ursula can prevent the \"accident\" that takes Amy\\'s life. Did she fall or was she pushed? Was Ursula a pawn of Dudley and the Queen?\\nSuddenly Ursula finds herself at the center of the scandal, trying to protect Elizabeth as she loses her heart to a Frenchman who may be flirting with sedition against her Queen. She can trust no one, neither her lover nor her monarch, as she sets out to find the truth in a glittering court that conceals a wellspring of blood and lies.',\n", 200 | " 'format': 'Paperback',\n", 201 | " 'link': 'https://www.goodreads.com/book/show/388674.To_Shield_the_Queen',\n", 202 | " 'authors': [{'author_id': '33981', 'role': ''}],\n", 203 | " 'publisher': 'Pocket Books',\n", 204 | " 'num_pages': '336',\n", 205 | " 'publication_day': '1',\n", 206 | " 'isbn13': '9780671015312',\n", 207 | " 'publication_month': '10',\n", 208 | " 'edition_information': '',\n", 209 | " 'publication_year': '1998',\n", 210 | " 'url': 'https://www.goodreads.com/book/show/388674.To_Shield_the_Queen',\n", 211 | " 'image_url': 'https://images.gr-assets.com/books/1325787198m/388674.jpg',\n", 212 | " 'book_id': '388674',\n", 213 | " 'ratings_count': '1371',\n", 214 | " 'work_id': '2208431',\n", 215 | " 'title': 'To Shield the Queen (Ursula Blanchard, #1)',\n", 216 | " 'title_without_series': 'To Shield the Queen (Ursula Blanchard, #1)'}" 217 | ] 218 | }, 219 | "metadata": {}, 220 | "output_type": "display_data" 221 | }, 222 | { 223 | "name": "stdout", 224 | "output_type": "stream", 225 | "text": [ 226 | " == sample record (authors) ==\n" 227 | ] 228 | }, 229 | { 230 | "data": { 231 | "text/plain": [ 232 | "{'average_rating': '3.82',\n", 233 | " 'author_id': '2905297',\n", 234 | " 'text_reviews_count': '43579',\n", 235 | " 'name': 'Lauren Kate',\n", 236 | " 'ratings_count': '907978'}" 237 | ] 238 | }, 239 | "metadata": {}, 240 | "output_type": "display_data" 241 | }, 242 | { 243 | "name": "stdout", 244 | "output_type": "stream", 245 | "text": [ 246 | " == sample record (works) ==\n" 247 | ] 248 | }, 249 | { 250 | "data": { 251 | "text/plain": [ 252 | "{'books_count': '144',\n", 253 | " 'reviews_count': '10794',\n", 254 | " 'original_publication_month': '',\n", 255 | " 'default_description_language_code': '',\n", 256 | " 'text_reviews_count': '238',\n", 257 | " 'best_book_id': '19336',\n", 258 | " 'original_publication_year': '1908',\n", 259 | " 'original_title': 'The Tale of Jemima Puddle-Duck',\n", 260 | " 'rating_dist': '5:3275|4:2562|3:1910|2:366|1:102|total:8215',\n", 261 | " 'default_chaptering_book_id': '',\n", 262 | " 'original_publication_day': '',\n", 263 | " 'original_language_id': '',\n", 264 | " 'ratings_count': '8215',\n", 265 | " 'media_type': 'book',\n", 266 | " 'ratings_sum': '33187',\n", 267 | " 'work_id': '2690022'}" 268 | ] 269 | }, 270 | "metadata": {}, 271 | "output_type": "display_data" 272 | }, 273 | { 274 | "name": "stdout", 275 | "output_type": "stream", 276 | "text": [ 277 | " == sample record (series) ==\n" 278 | ] 279 | }, 280 | { 281 | "data": { 282 | "text/plain": [ 283 | "{'numbered': 'true',\n", 284 | " 'note': '',\n", 285 | " 'description': '',\n", 286 | " 'title': 'Night Runner',\n", 287 | " 'series_works_count': '3',\n", 288 | " 'series_id': '433405',\n", 289 | " 'primary_work_count': '2'}" 290 | ] 291 | }, 292 | "metadata": {}, 293 | "output_type": "display_data" 294 | }, 295 | { 296 | "name": "stdout", 297 | "output_type": "stream", 298 | "text": [ 299 | " == sample record (series) ==\n" 300 | ] 301 | }, 302 | { 303 | "data": { 304 | "text/plain": [ 305 | "{'book_id': '2741850',\n", 306 | " 'genres': {'fantasy, paranormal': 1758,\n", 307 | " 'romance': 317,\n", 308 | " 'fiction': 153,\n", 309 | " 'mystery, thriller, crime': 38}}" 310 | ] 311 | }, 312 | "metadata": {}, 313 | "output_type": "display_data" 314 | } 315 | ], 316 | "source": [ 317 | "print(' == sample record (books) ==')\n", 318 | "display(np.random.choice(books))\n", 319 | "print(' == sample record (authors) ==')\n", 320 | "display(np.random.choice(authors))\n", 321 | "print(' == sample record (works) ==')\n", 322 | "display(np.random.choice(works))\n", 323 | "print(' == sample record (series) ==')\n", 324 | "display(np.random.choice(series))\n", 325 | "print(' == sample record (series) ==')\n", 326 | "display(np.random.choice(genres))" 327 | ] 328 | }, 329 | { 330 | "cell_type": "code", 331 | "execution_count": 5, 332 | "metadata": {}, 333 | "outputs": [], 334 | "source": [ 335 | "interactions = pd.read_csv('goodreads_interactions.csv')" 336 | ] 337 | }, 338 | { 339 | "cell_type": "code", 340 | "execution_count": 6, 341 | "metadata": {}, 342 | "outputs": [ 343 | { 344 | "data": { 345 | "text/html": [ 346 | "
\n", 347 | "\n", 360 | "\n", 361 | " \n", 362 | " \n", 363 | " \n", 364 | " \n", 365 | " \n", 366 | " \n", 367 | " \n", 368 | " \n", 369 | " \n", 370 | " \n", 371 | " \n", 372 | " \n", 373 | " \n", 374 | " \n", 375 | " \n", 376 | " \n", 377 | " \n", 378 | " \n", 379 | " \n", 380 | " \n", 381 | " \n", 382 | " \n", 383 | " \n", 384 | " \n", 385 | " \n", 386 | " \n", 387 | " \n", 388 | " \n", 389 | " \n", 390 | " \n", 391 | " \n", 392 | " \n", 393 | " \n", 394 | " \n", 395 | " \n", 396 | " \n", 397 | " \n", 398 | " \n", 399 | " \n", 400 | " \n", 401 | " \n", 402 | " \n", 403 | " \n", 404 | " \n", 405 | " \n", 406 | " \n", 407 | " \n", 408 | " \n", 409 | " \n", 410 | " \n", 411 | " \n", 412 | " \n", 413 | " \n", 414 | " \n", 415 | " \n", 416 | " \n", 417 | " \n", 418 | " \n", 419 | " \n", 420 | " \n", 421 | " \n", 422 | " \n", 423 | " \n", 424 | " \n", 425 | " \n", 426 | " \n", 427 | " \n", 428 | " \n", 429 | " \n", 430 | " \n", 431 | " \n", 432 | " \n", 433 | " \n", 434 | " \n", 435 | " \n", 436 | " \n", 437 | " \n", 438 | " \n", 439 | " \n", 440 | " \n", 441 | " \n", 442 | " \n", 443 | " \n", 444 | " \n", 445 | " \n", 446 | " \n", 447 | " \n", 448 | " \n", 449 | " \n", 450 | " \n", 451 | " \n", 452 | " \n", 453 | "
user_idbook_idis_readratingis_reviewed
00948150
10947151
20946150
30945150
40944150
50943150
60942150
70941150
80940150
90939151
\n", 454 | "
" 455 | ], 456 | "text/plain": [ 457 | " user_id book_id is_read rating is_reviewed\n", 458 | "0 0 948 1 5 0\n", 459 | "1 0 947 1 5 1\n", 460 | "2 0 946 1 5 0\n", 461 | "3 0 945 1 5 0\n", 462 | "4 0 944 1 5 0\n", 463 | "5 0 943 1 5 0\n", 464 | "6 0 942 1 5 0\n", 465 | "7 0 941 1 5 0\n", 466 | "8 0 940 1 5 0\n", 467 | "9 0 939 1 5 1" 468 | ] 469 | }, 470 | "execution_count": 6, 471 | "metadata": {}, 472 | "output_type": "execute_result" 473 | } 474 | ], 475 | "source": [ 476 | "interactions.head(10)" 477 | ] 478 | }, 479 | { 480 | "cell_type": "code", 481 | "execution_count": 7, 482 | "metadata": {}, 483 | "outputs": [ 484 | { 485 | "data": { 486 | "text/plain": [ 487 | "{'user_id': '8842281e1d1347389f2ab93d60773d4d',\n", 488 | " 'book_id': '66406',\n", 489 | " 'review_id': '80a09fb75c693a67b392a484eac59d51',\n", 490 | " 'rating': 4,\n", 491 | " 'review_text': 'A very interesting WW2 book that chronicles how a gang of street orphans fought the nazis. I love to hear different perspectives, so this was very enjoyable.',\n", 492 | " 'date_added': 'Fri Feb 09 18:27:09 -0800 2007',\n", 493 | " 'date_updated': 'Wed Mar 22 11:45:16 -0700 2017',\n", 494 | " 'read_at': 'Thu Jul 01 00:00:00 -0700 2004',\n", 495 | " 'started_at': '',\n", 496 | " 'n_votes': 0,\n", 497 | " 'n_comments': 0}" 498 | ] 499 | }, 500 | "execution_count": 7, 501 | "metadata": {}, 502 | "output_type": "execute_result" 503 | } 504 | ], 505 | "source": [ 506 | "reviews = load_data('goodreads_reviews_dedup.json.gz')\n", 507 | "np.random.choice(reviews)" 508 | ] 509 | }, 510 | { 511 | "cell_type": "code", 512 | "execution_count": 8, 513 | "metadata": {}, 514 | "outputs": [ 515 | { 516 | "data": { 517 | "text/plain": [ 518 | "{'user_id': '01ec1a320ffded6b2dd47833f2c8e4fb',\n", 519 | " 'timestamp': '2014-05-04',\n", 520 | " 'review_sentences': [[0, '4.5 stars!'],\n", 521 | " [0, 'Maverick.'],\n", 522 | " [0, 'Yes, please!'],\n", 523 | " [0, 'This novel starts off with a bang and hooked me from the first page.'],\n", 524 | " [0,\n", 525 | " 'Maverick is a HOT AS HELL Navy SEAL, a womanizer but determined to change for Windsor.'],\n", 526 | " [0, 'Their relationship is romantic, sweet, scorching hot and challenging.'],\n", 527 | " [0,\n", 528 | " 'There is a great cast of supporting characters and the story is very engaging.'],\n", 529 | " [0, 'There is also just the right amount of angst.'],\n", 530 | " [0,\n", 531 | " 'Crazy Good is the perfect book when you want a steamy read with some sweet romance and emotion thrown in.'],\n", 532 | " [0, 'I LOVED Mav!'],\n", 533 | " [0,\n", 534 | " 'Rachel Robinson is an author to watch and I look forward to reading more from her!']],\n", 535 | " 'rating': 4,\n", 536 | " 'has_spoiler': False,\n", 537 | " 'book_id': '20576134',\n", 538 | " 'review_id': '5d6496c3313da68f28227cfa842a5b1e'}" 539 | ] 540 | }, 541 | "execution_count": 8, 542 | "metadata": {}, 543 | "output_type": "execute_result" 544 | } 545 | ], 546 | "source": [ 547 | "spoilers = load_data('goodreads_reviews_spoiler.json.gz')\n", 548 | "np.random.choice(spoilers)" 549 | ] 550 | }, 551 | { 552 | "cell_type": "code", 553 | "execution_count": 9, 554 | "metadata": {}, 555 | "outputs": [], 556 | "source": [ 557 | "np_reviews = np.array(reviews)\n", 558 | "np_spoilers = np.array(spoilers)" 559 | ] 560 | }, 561 | { 562 | "cell_type": "code", 563 | "execution_count": 10, 564 | "metadata": {}, 565 | "outputs": [], 566 | "source": [ 567 | "np_books = np.array(books)\n", 568 | "np_authors = np.array(authors)\n", 569 | "np_works = np.array(works)\n", 570 | "np_series = np.array(series)\n", 571 | "np_genres = np.array(genres)" 572 | ] 573 | }, 574 | { 575 | "cell_type": "markdown", 576 | "metadata": {}, 577 | "source": [ 578 | "### Code from https://github.com/MengtingWan/goodreads/blob/master/statistics.ipynb" 579 | ] 580 | }, 581 | { 582 | "cell_type": "code", 583 | "execution_count": 11, 584 | "metadata": {}, 585 | "outputs": [], 586 | "source": [ 587 | "def count_lines(file_name):\n", 588 | " print('counting file:', file_name)\n", 589 | " count = 0\n", 590 | " with gzip.open(file_name) as fin:\n", 591 | " for l in fin:\n", 592 | " count += 1\n", 593 | " print('done!')\n", 594 | " return count" 595 | ] 596 | }, 597 | { 598 | "cell_type": "code", 599 | "execution_count": 12, 600 | "metadata": {}, 601 | "outputs": [ 602 | { 603 | "name": "stdout", 604 | "output_type": "stream", 605 | "text": [ 606 | "counting file: goodreads_books.json.gz\n", 607 | "done!\n", 608 | "counting file: goodreads_book_works.json.gz\n", 609 | "done!\n", 610 | "counting file: goodreads_book_authors.json.gz\n", 611 | "done!\n", 612 | "counting file: goodreads_book_series.json.gz\n", 613 | "done!\n", 614 | "counting file: goodreads_book_genres_initial.json.gz\n", 615 | "done!\n", 616 | "counting file: goodreads_reviews_dedup.json.gz\n", 617 | "done!\n", 618 | "counting file: goodreads_reviews_spoiler.json.gz\n", 619 | "done!\n" 620 | ] 621 | } 622 | ], 623 | "source": [ 624 | "n_books = count_lines('goodreads_books.json.gz')\n", 625 | "n_works = count_lines('goodreads_book_works.json.gz')\n", 626 | "n_authors = count_lines('goodreads_book_authors.json.gz')\n", 627 | "n_series = count_lines('goodreads_book_series.json.gz')\n", 628 | "n_genres = count_lines('goodreads_book_genres_initial.json.gz')\n", 629 | "n_reviews = count_lines('goodreads_reviews_dedup.json.gz')\n", 630 | "n_spoilers = count_lines('goodreads_reviews_spoiler.json.gz')" 631 | ] 632 | }, 633 | { 634 | "cell_type": "code", 635 | "execution_count": 13, 636 | "metadata": {}, 637 | "outputs": [ 638 | { 639 | "data": { 640 | "text/plain": [ 641 | "user_id 228648342\n", 642 | "book_id 228648342\n", 643 | "is_read 228648342\n", 644 | "rating 228648342\n", 645 | "is_reviewed 228648342\n", 646 | "dtype: int64" 647 | ] 648 | }, 649 | "execution_count": 13, 650 | "metadata": {}, 651 | "output_type": "execute_result" 652 | } 653 | ], 654 | "source": [ 655 | "n_interactions = interactions.count()\n", 656 | "n_interactions" 657 | ] 658 | }, 659 | { 660 | "cell_type": "code", 661 | "execution_count": 14, 662 | "metadata": {}, 663 | "outputs": [ 664 | { 665 | "data": { 666 | "text/html": [ 667 | "
\n", 668 | "\n", 681 | "\n", 682 | " \n", 683 | " \n", 684 | " \n", 685 | " \n", 686 | " \n", 687 | " \n", 688 | " \n", 689 | " \n", 690 | " \n", 691 | " \n", 692 | " \n", 693 | " \n", 694 | " \n", 695 | " \n", 696 | " \n", 697 | " \n", 698 | " \n", 699 | " \n", 700 | " \n", 701 | " \n", 702 | " \n", 703 | " \n", 704 | " \n", 705 | " \n", 706 | " \n", 707 | " \n", 708 | " \n", 709 | " \n", 710 | " \n", 711 | " \n", 712 | " \n", 713 | " \n", 714 | " \n", 715 | " \n", 716 | " \n", 717 | " \n", 718 | "
count
# books2,360,655.0
# works1,521,962.0
# authors829,529.0
# series400,390.0
# genres2,360,655.0
# reviews15,739,967.0
# spoilers1,378,033.0
\n", 719 | "
" 720 | ], 721 | "text/plain": [ 722 | " count\n", 723 | "# books 2,360,655.0\n", 724 | "# works 1,521,962.0\n", 725 | "# authors 829,529.0\n", 726 | "# series 400,390.0\n", 727 | "# genres 2,360,655.0\n", 728 | "# reviews 15,739,967.0\n", 729 | "# spoilers 1,378,033.0" 730 | ] 731 | }, 732 | "metadata": {}, 733 | "output_type": "display_data" 734 | } 735 | ], 736 | "source": [ 737 | "df_book_stats = pd.DataFrame([n_books, n_works, n_authors, n_series, n_genres, n_reviews, n_spoilers],\n", 738 | " dtype = float, columns = ['count'],\n", 739 | " index = ['# books', '# works', '# authors', '# series', '# genres', '# reviews', '# spoilers'])\n", 740 | "display(df_book_stats)" 741 | ] 742 | }, 743 | { 744 | "cell_type": "markdown", 745 | "metadata": {}, 746 | "source": [ 747 | "# tweak this to work with interactions.csv? or email for details json?\n", 748 | "genre_list = ['children', 'comics_graphic', 'fantasy_paranormal', 'history_biography',\n", 749 | " 'mystery_thriller_crime', 'poetry', 'romance', 'young_adult']\n", 750 | "\n", 751 | "def count_all_genres(genre_list):\n", 752 | " res = []\n", 753 | " for g in genre_list:\n", 754 | " n_book = count_lines(os.path.join('goodreads_books_'+g+'.json.gz'))\n", 755 | " n_shelve, n_read, n_rate, n_review, n_user = count_interactions(\n", 756 | " os.path.join(DIR_GENRE, 'goodreads_interactions_'+g+'.json.gz'))\n", 757 | " res.append([n_book, n_user, n_shelve, n_read, n_rate, n_review])\n", 758 | " df_stats_by_genre = pd.DataFrame(res, dtype = float, \n", 759 | " columns = ['# book', '# user', '# shelve', '# read', '# rate', '# review'],\n", 760 | " index = genre_list)\n", 761 | " return df_stats_by_genre" 762 | ] 763 | }, 764 | { 765 | "cell_type": "markdown", 766 | "metadata": {}, 767 | "source": [ 768 | "for _t in ['# shelve', '# read', '# rate', '# review']:\n", 769 | " df_stats_by_genre[_t+'/'+'book'] = df_stats_by_genre[_t]/df_stats_by_genre['# book']\n", 770 | " df_stats_by_genre[_t+'/'+'user'] = df_stats_by_genre[_t]/df_stats_by_genre['# user']\n", 771 | "display(df_stats_by_genre.round(2).transpose())" 772 | ] 773 | }, 774 | { 775 | "cell_type": "markdown", 776 | "metadata": {}, 777 | "source": [ 778 | "### Code from https://github.com/MengtingWan/goodreads/blob/master/distributions.ipynb" 779 | ] 780 | }, 781 | { 782 | "cell_type": "code", 783 | "execution_count": null, 784 | "metadata": {}, 785 | "outputs": [], 786 | "source": [ 787 | "print('=== first 5 records ===')\n", 788 | "display(interactions.head())\n", 789 | "print('=== duplicated records ===')\n", 790 | "display(interactions[interactions.duplicated(['user_id', 'book_id'], keep=False)])\n", 791 | "print('ideally you will not see any rows displayed above, then we are good now, no duplicates.')" 792 | ] 793 | }, 794 | { 795 | "cell_type": "markdown", 796 | "metadata": {}, 797 | "source": [ 798 | "#### # shelved = total number of records in file\n", 799 | "#### # read = number of records where users read the book\n", 800 | "#### # rated = number of records where users provided rating scores for the book\n", 801 | "#### # reviewed = number of recirds where the book recview texts are not empty" 802 | ] 803 | }, 804 | { 805 | "cell_type": "code", 806 | "execution_count": 15, 807 | "metadata": {}, 808 | "outputs": [ 809 | { 810 | "data": { 811 | "text/html": [ 812 | "
\n", 813 | "\n", 826 | "\n", 827 | " \n", 828 | " \n", 829 | " \n", 830 | " \n", 831 | " \n", 832 | " \n", 833 | " \n", 834 | " \n", 835 | " \n", 836 | " \n", 837 | " \n", 838 | " \n", 839 | " \n", 840 | " \n", 841 | " \n", 842 | " \n", 843 | " \n", 844 | " \n", 845 | " \n", 846 | " \n", 847 | " \n", 848 | " \n", 849 | " \n", 850 | " \n", 851 | "
count
# shelved228,648,342.0
# read112,131,203.0
# rated104,551,549.0
# reviewed16,219,149.0
\n", 852 | "
" 853 | ], 854 | "text/plain": [ 855 | " count\n", 856 | "# shelved 228,648,342.0\n", 857 | "# read 112,131,203.0\n", 858 | "# rated 104,551,549.0\n", 859 | "# reviewed 16,219,149.0" 860 | ] 861 | }, 862 | "metadata": {}, 863 | "output_type": "display_data" 864 | } 865 | ], 866 | "source": [ 867 | "df_stats = pd.DataFrame([interactions.shape[0],\n", 868 | " interactions['is_read'].sum(),\n", 869 | " (interactions['rating']>0).sum(),\n", 870 | " interactions['is_reviewed'].sum()], dtype = float, \n", 871 | " columns = ['count'],\n", 872 | " index = ['# shelved', '# read', '# rated', '# reviewed'])\n", 873 | "display(df_stats)" 874 | ] 875 | }, 876 | { 877 | "cell_type": "code", 878 | "execution_count": 16, 879 | "metadata": {}, 880 | "outputs": [ 881 | { 882 | "data": { 883 | "text/plain": [ 884 | "0 124,096,793.0\n", 885 | "4 37,497,451.0\n", 886 | "5 35,506,166.0\n", 887 | "3 23,307,457.0\n", 888 | "2 6,189,946.0\n", 889 | "1 2,050,529.0\n", 890 | "Name: rating, dtype: float64" 891 | ] 892 | }, 893 | "metadata": {}, 894 | "output_type": "display_data" 895 | } 896 | ], 897 | "source": [ 898 | "# rating of 0 means no rating provided\n", 899 | "df_rating_count = interactions['rating'].value_counts().astype(float)\n", 900 | "display(df_rating_count)" 901 | ] 902 | }, 903 | { 904 | "cell_type": "code", 905 | "execution_count": 17, 906 | "metadata": {}, 907 | "outputs": [], 908 | "source": [ 909 | "# visualize user/item distributions (zipf's law)\n", 910 | "# count the number of interactions for each user/item\n", 911 | "# count the freq of these numbers (ranks)\n", 912 | "# plot each type of interaction\n", 913 | "shelve_user = interactions['user_id'].value_counts().value_counts().reset_index().sort_values('index').values\n", 914 | "read_user = interactions['user_id'].loc[interactions['is_read']>0].value_counts().value_counts().reset_index().sort_values('index').values\n", 915 | "rate_user = interactions['user_id'].loc[interactions['rating']>0].value_counts().value_counts().reset_index().sort_values('index').values\n", 916 | "review_user = interactions['user_id'].loc[interactions['is_reviewed']>0].value_counts().value_counts().reset_index().sort_values('index').values\n", 917 | "\n", 918 | "shelve_book = interactions['book_id'].value_counts().value_counts().reset_index().sort_values('index').values\n", 919 | "read_book = interactions['book_id'].loc[interactions['is_read']>0].value_counts().value_counts().reset_index().sort_values('index').values\n", 920 | "rate_book = interactions['book_id'].loc[interactions['rating']>0].value_counts().value_counts().reset_index().sort_values('index').values\n", 921 | "review_book = interactions['book_id'].loc[interactions['is_reviewed']>0].value_counts().value_counts().reset_index().sort_values('index').values" 922 | ] 923 | }, 924 | { 925 | "cell_type": "code", 926 | "execution_count": 18, 927 | "metadata": {}, 928 | "outputs": [ 929 | { 930 | "data": { 931 | "text/plain": [ 932 | "" 933 | ] 934 | }, 935 | "execution_count": 18, 936 | "metadata": {}, 937 | "output_type": "execute_result" 938 | }, 939 | { 940 | "data": { 941 | "image/png": "\n", 942 | "text/plain": [ 943 | "
" 944 | ] 945 | }, 946 | "metadata": { 947 | "needs_background": "light" 948 | }, 949 | "output_type": "display_data" 950 | } 951 | ], 952 | "source": [ 953 | "plt.figure(figsize=(5,5))\n", 954 | "plt.loglog(shelve_user[:,0], shelve_user[:,1], label='shelve')\n", 955 | "plt.loglog(read_user[:,0], read_user[:,1], label='read')\n", 956 | "plt.loglog(rate_user[:,0], rate_user[:,1], label='rate')\n", 957 | "plt.loglog(review_user[:,0], review_user[:,1], label='review')\n", 958 | "plt.xlabel('rank')\n", 959 | "plt.ylabel('frequency')\n", 960 | "plt.title('Log-Log Plot of the Distribution of Users')\n", 961 | "plt.legend(loc='upper right')" 962 | ] 963 | }, 964 | { 965 | "cell_type": "code", 966 | "execution_count": 19, 967 | "metadata": { 968 | "scrolled": true 969 | }, 970 | "outputs": [ 971 | { 972 | "data": { 973 | "text/plain": [ 974 | "" 975 | ] 976 | }, 977 | "execution_count": 19, 978 | "metadata": {}, 979 | "output_type": "execute_result" 980 | }, 981 | { 982 | "data": { 983 | "image/png": "\n", 984 | "text/plain": [ 985 | "
" 986 | ] 987 | }, 988 | "metadata": { 989 | "needs_background": "light" 990 | }, 991 | "output_type": "display_data" 992 | } 993 | ], 994 | "source": [ 995 | "plt.figure(figsize=(5,5))\n", 996 | "plt.loglog(shelve_book[:,0], shelve_book[:,1], label='shelve')\n", 997 | "plt.loglog(read_book[:,0], read_book[:,1], label='read')\n", 998 | "plt.loglog(rate_book[:,0], rate_book[:,1], label='rate')\n", 999 | "plt.loglog(review_book[:,0], review_book[:,1], label='review')\n", 1000 | "plt.xlabel('rank')\n", 1001 | "plt.ylabel('frequency')\n", 1002 | "plt.title('Log-Log Plot of the Distribution of Books')\n", 1003 | "plt.legend(loc='upper right')" 1004 | ] 1005 | }, 1006 | { 1007 | "cell_type": "markdown", 1008 | "metadata": {}, 1009 | "source": [ 1010 | "### Code from https://github.com/MengtingWan/goodreads/blob/master/reviews.ipynb" 1011 | ] 1012 | }, 1013 | { 1014 | "cell_type": "code", 1015 | "execution_count": 20, 1016 | "metadata": {}, 1017 | "outputs": [], 1018 | "source": [ 1019 | "def count_reviews(file_name):\n", 1020 | " print('counting file:', file_name)\n", 1021 | " n_review = 0\n", 1022 | " book_set, user_set = set(), set()\n", 1023 | " print('current line: ', end='')\n", 1024 | " with gzip.open(file_name) as fin:\n", 1025 | " for l in fin:\n", 1026 | " d = json.loads(l)\n", 1027 | " if n_review % 1000000 == 0:\n", 1028 | " print(n_review, end=',')\n", 1029 | " n_review += 1\n", 1030 | " book_set.add(d['book_id'])\n", 1031 | " user_set.add(d['user_id'])\n", 1032 | " print('complete')\n", 1033 | " print('done!')\n", 1034 | " return n_review, len(book_set), len(user_set)" 1035 | ] 1036 | }, 1037 | { 1038 | "cell_type": "code", 1039 | "execution_count": 21, 1040 | "metadata": {}, 1041 | "outputs": [ 1042 | { 1043 | "name": "stdout", 1044 | "output_type": "stream", 1045 | "text": [ 1046 | "counting file: goodreads_reviews_dedup.json.gz\n", 1047 | "current line: 0,1000000,2000000,3000000,4000000,5000000,6000000,7000000,8000000,9000000,10000000,11000000,12000000,13000000,14000000,15000000,complete\n", 1048 | "done!\n" 1049 | ] 1050 | }, 1051 | { 1052 | "data": { 1053 | "text/html": [ 1054 | "
\n", 1055 | "\n", 1068 | "\n", 1069 | " \n", 1070 | " \n", 1071 | " \n", 1072 | " \n", 1073 | " \n", 1074 | " \n", 1075 | " \n", 1076 | " \n", 1077 | " \n", 1078 | " \n", 1079 | " \n", 1080 | " \n", 1081 | " \n", 1082 | " \n", 1083 | " \n", 1084 | " \n", 1085 | " \n", 1086 | " \n", 1087 | " \n", 1088 | " \n", 1089 | "
count
# review15,739,967.0
# book2,080,190.0
# user465,323.0
\n", 1090 | "
" 1091 | ], 1092 | "text/plain": [ 1093 | " count\n", 1094 | "# review 15,739,967.0\n", 1095 | "# book 2,080,190.0\n", 1096 | "# user 465,323.0" 1097 | ] 1098 | }, 1099 | "metadata": {}, 1100 | "output_type": "display_data" 1101 | } 1102 | ], 1103 | "source": [ 1104 | "n_review, n_book, n_user = count_reviews(os.path.join('goodreads_reviews_dedup.json.gz'))\n", 1105 | "df_stats_review = pd.DataFrame([n_review, n_book, n_user], dtype=float,\n", 1106 | " columns=['count'], index=['# review', '# book', '# user'])\n", 1107 | "display(df_stats_review)" 1108 | ] 1109 | }, 1110 | { 1111 | "cell_type": "code", 1112 | "execution_count": 22, 1113 | "metadata": {}, 1114 | "outputs": [], 1115 | "source": [ 1116 | "def count_spoilers(file_name):\n", 1117 | " print('counting file:', file_name)\n", 1118 | " n_review, n_sentence, n_spoiler_review, n_spoiler_sentence = 0, 0, 0, 0\n", 1119 | " book_set, user_set = set(), set()\n", 1120 | " print('current line: ', end='')\n", 1121 | " with gzip.open(file_name) as fin:\n", 1122 | " for l in fin:\n", 1123 | " d = json.loads(l)\n", 1124 | " if n_review % 1000000 == 0:\n", 1125 | " print(n_review, end=',')\n", 1126 | " n_review += 1\n", 1127 | " for _t, _ in d['review_sentences']:\n", 1128 | " n_sentence += 1\n", 1129 | " n_spoiler_sentence += _t\n", 1130 | " n_spoiler_review += int(d['has_spoiler'])\n", 1131 | " book_set.add(d['book_id'])\n", 1132 | " user_set.add(d['user_id'])\n", 1133 | " print('complete')\n", 1134 | " print('done!')\n", 1135 | " return n_review, n_sentence, n_spoiler_review, n_spoiler_sentence, len(book_set), len(user_set)" 1136 | ] 1137 | }, 1138 | { 1139 | "cell_type": "code", 1140 | "execution_count": 23, 1141 | "metadata": {}, 1142 | "outputs": [ 1143 | { 1144 | "name": "stdout", 1145 | "output_type": "stream", 1146 | "text": [ 1147 | "counting file: goodreads_reviews_spoiler.json.gz\n", 1148 | "current line: 0,1000000,complete\n", 1149 | "done!\n" 1150 | ] 1151 | }, 1152 | { 1153 | "data": { 1154 | "text/html": [ 1155 | "
\n", 1156 | "\n", 1169 | "\n", 1170 | " \n", 1171 | " \n", 1172 | " \n", 1173 | " \n", 1174 | " \n", 1175 | " \n", 1176 | " \n", 1177 | " \n", 1178 | " \n", 1179 | " \n", 1180 | " \n", 1181 | " \n", 1182 | " \n", 1183 | " \n", 1184 | " \n", 1185 | " \n", 1186 | " \n", 1187 | " \n", 1188 | " \n", 1189 | " \n", 1190 | " \n", 1191 | " \n", 1192 | " \n", 1193 | " \n", 1194 | " \n", 1195 | " \n", 1196 | " \n", 1197 | " \n", 1198 | " \n", 1199 | " \n", 1200 | " \n", 1201 | " \n", 1202 | "
count
# review1,378,033.0
# sentence17,672,655.0
# spoiler review89,627.0
# spoiler sentence569,724.0
# book25,475.0
# user18,892.0
\n", 1203 | "
" 1204 | ], 1205 | "text/plain": [ 1206 | " count\n", 1207 | "# review 1,378,033.0\n", 1208 | "# sentence 17,672,655.0\n", 1209 | "# spoiler review 89,627.0\n", 1210 | "# spoiler sentence 569,724.0\n", 1211 | "# book 25,475.0\n", 1212 | "# user 18,892.0" 1213 | ] 1214 | }, 1215 | "metadata": {}, 1216 | "output_type": "display_data" 1217 | } 1218 | ], 1219 | "source": [ 1220 | "res = count_spoilers(os.path.join('goodreads_reviews_spoiler.json.gz'))\n", 1221 | "df_stats_spoiler = pd.DataFrame(res, columns=['count'], dtype=float, \n", 1222 | " index=['# review', '# sentence', '# spoiler review', '# spoiler sentence',\n", 1223 | " '# book', '# user'])\n", 1224 | "display(df_stats_spoiler)" 1225 | ] 1226 | }, 1227 | { 1228 | "cell_type": "markdown", 1229 | "metadata": {}, 1230 | "source": [ 1231 | "### Recommendation is to start with genre specific files \n", 1232 | "### https://sites.google.com/eng.ucsd.edu/ucsdbookgraph/home" 1233 | ] 1234 | } 1235 | ], 1236 | "metadata": { 1237 | "kernelspec": { 1238 | "display_name": "Python 3", 1239 | "language": "python", 1240 | "name": "python3" 1241 | }, 1242 | "language_info": { 1243 | "codemirror_mode": { 1244 | "name": "ipython", 1245 | "version": 3 1246 | }, 1247 | "file_extension": ".py", 1248 | "mimetype": "text/x-python", 1249 | "name": "python", 1250 | "nbconvert_exporter": "python", 1251 | "pygments_lexer": "ipython3", 1252 | "version": "3.8.3" 1253 | } 1254 | }, 1255 | "nbformat": 4, 1256 | "nbformat_minor": 4 1257 | } 1258 | -------------------------------------------------------------------------------- /Final Presentation.key: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Reinalynn/Building-a-Book-Recommendation-System-using-Python/e8af41082a5fbda73197d4655adf3d3fa4716a1e/Final Presentation.key -------------------------------------------------------------------------------- /Images/ALS model with rmse.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Reinalynn/Building-a-Book-Recommendation-System-using-Python/e8af41082a5fbda73197d4655adf3d3fa4716a1e/Images/ALS model with rmse.png -------------------------------------------------------------------------------- /Images/Algo results.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Reinalynn/Building-a-Book-Recommendation-System-using-Python/e8af41082a5fbda73197d4655adf3d3fa4716a1e/Images/Algo results.png -------------------------------------------------------------------------------- /Images/Author plots.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Reinalynn/Building-a-Book-Recommendation-System-using-Python/e8af41082a5fbda73197d4655adf3d3fa4716a1e/Images/Author plots.png -------------------------------------------------------------------------------- /Images/Cross validation plot.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Reinalynn/Building-a-Book-Recommendation-System-using-Python/e8af41082a5fbda73197d4655adf3d3fa4716a1e/Images/Cross validation plot.png -------------------------------------------------------------------------------- /Images/Full Goodreads counts.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Reinalynn/Building-a-Book-Recommendation-System-using-Python/e8af41082a5fbda73197d4655adf3d3fa4716a1e/Images/Full Goodreads counts.png -------------------------------------------------------------------------------- /Images/Goodreads interactions counts.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Reinalynn/Building-a-Book-Recommendation-System-using-Python/e8af41082a5fbda73197d4655adf3d3fa4716a1e/Images/Goodreads interactions counts.png -------------------------------------------------------------------------------- /Images/Hist and scatterplot.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Reinalynn/Building-a-Book-Recommendation-System-using-Python/e8af41082a5fbda73197d4655adf3d3fa4716a1e/Images/Hist and scatterplot.png -------------------------------------------------------------------------------- /Images/Log-log plots of interactions.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Reinalynn/Building-a-Book-Recommendation-System-using-Python/e8af41082a5fbda73197d4655adf3d3fa4716a1e/Images/Log-log plots of interactions.png -------------------------------------------------------------------------------- /Images/Optimized algo params.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Reinalynn/Building-a-Book-Recommendation-System-using-Python/e8af41082a5fbda73197d4655adf3d3fa4716a1e/Images/Optimized algo params.png -------------------------------------------------------------------------------- /Images/PearsonR code.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Reinalynn/Building-a-Book-Recommendation-System-using-Python/e8af41082a5fbda73197d4655adf3d3fa4716a1e/Images/PearsonR code.png -------------------------------------------------------------------------------- /Images/Pie chart of genres.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Reinalynn/Building-a-Book-Recommendation-System-using-Python/e8af41082a5fbda73197d4655adf3d3fa4716a1e/Images/Pie chart of genres.png -------------------------------------------------------------------------------- /Images/PySpark sparcity.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Reinalynn/Building-a-Book-Recommendation-System-using-Python/e8af41082a5fbda73197d4655adf3d3fa4716a1e/Images/PySpark sparcity.png -------------------------------------------------------------------------------- /Images/Tuned ALS model with best rmse.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Reinalynn/Building-a-Book-Recommendation-System-using-Python/e8af41082a5fbda73197d4655adf3d3fa4716a1e/Images/Tuned ALS model with best rmse.png -------------------------------------------------------------------------------- /Images/Word cloud of review text.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Reinalynn/Building-a-Book-Recommendation-System-using-Python/e8af41082a5fbda73197d4655adf3d3fa4716a1e/Images/Word cloud of review text.png -------------------------------------------------------------------------------- /Images/count recs.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Reinalynn/Building-a-Book-Recommendation-System-using-Python/e8af41082a5fbda73197d4655adf3d3fa4716a1e/Images/count recs.png -------------------------------------------------------------------------------- /Images/tfidf recs.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Reinalynn/Building-a-Book-Recommendation-System-using-Python/e8af41082a5fbda73197d4655adf3d3fa4716a1e/Images/tfidf recs.png -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Building a Book Recommender System using Python 2 | 3 | Recorded presentation: http://youtu.be/rVDA0WA8tcQ?hd=1 4 | 5 | ## Objective 6 | Recommender systems have become a part of daily life for users of Amazon and Netflix and even social media. While some sites might use these systems to improve the customer experience (if you liked movie A, you might like movie B) or increase sales (customers who bought product C also bought product D), others are focused on customized advertising and suggestive marketing. As a book lover and former book store manager, I have always wondered where I can find good book recommendations that are both personalized to my interests and also capable of introducing me to new authors and genres. The purpose of this project is to create just such a recommender system (RS). 7 | 8 | ### Collaborative Filtering vs. Content Filtering 9 | If an RS suggests items to a user based on past interactions between users and items, that system is known as a Collaborative Filtering system. In these recommendation engines, a user-item interactions matrix is created such that every user and item pair has a space in the matrix. That space is either filled with the user's rating of that item or it is left blank. This can be used for matrix factorization or nearest neighbor classification, both of which will be addressed when we develop our models. The important thing to remember with collaborative filtering is that user id, item id, and rating are the only fields required. Collaborative models can be user-based or item-based. 10 | 11 | Content filtering, on the other hand, focuses exclusively on either the item or the user and does not need any information about interactions between the two. Instead, content filtering calculates the similarity between items or users using attributes of the items or users themselves. For my book data, I will use book reviews and text analysis to determine which books are most similar to books that I like and thus which books should be recommended (item based). 12 | 13 | ## Data 14 | While there are many book datasets available to use, I decided to work with Goodreads Book data. There are several full Goodreads data sets available at the [UCSD Book Graph site](https://sites.google.com/eng.ucsd.edu/ucsdbookgraph/home) and I initially worked with this data to analyze metadata for books, authors, series, genres, reviews, and the interactions between users and items. Once I began building the models, I quickly realized that my dataset was too large. Rather than limit myself to just one genre, I chose to use the [Goodreads 10k data set](https://www.kaggle.com/zygmunt/goodbooks-10k/version/4), a subset of the full Goodreads data sets. This data set contains book metdata, ratings, book tags, and book shelves. 15 | 16 | For full code, view the following files in this github: 17 | 18 | [EDA - full Goodreads authors, works, series, genres, interactions.ipynb](https://github.com/Reinalynn/Building-a-Book-Recommendation-System-using-Python/blob/master/Code/EDA%20-%20full%20Goodreads%20authors%2C%20works%2C%20series%2C%20genres%2C%20interactions.ipynb) 19 | [Data prep - full Goodreads loading files, statistics, distributions.ipynb](https://github.com/Reinalynn/Building-a-Book-Recommendation-System-using-Python/blob/master/Code/Data%20prep%20-%20full%20Goodreads%20loading%20files%2C%20statistics%2C%20distributions.ipynb) 20 | 21 | ### Collection, Cleaning and Analysis 22 | 23 | ##### Full Goodreads Dataset 24 | ```python 25 | import gzip 26 | def parse(path): 27 | g = gzip.open(path, 'r') 28 | for l in g: 29 | yield eval(l) 30 | books = parse('Unused data/goodreads_books.json.gz') 31 | next(books) 32 | ``` 33 | This Python generator allowed me to view a full book record in order to understand which fields are represented: 34 | >{'isbn': '0312853122', 35 | >'text_reviews_count': '1', 36 | >'series': [], 37 | >'country_code': 'US', 38 | >'language_code': '', 39 | >'popular_shelves': [{'count': '3', 'name': 'to-read'}, 40 | > {'count': '1', 'name': 'p'}, 41 | > {'count': '1', 'name': 'collection'}, 42 | > {'count': '1', 'name': 'w-c-fields'}, 43 | > {'count': '1', 'name': 'biography'}], 44 | >'asin': '', 45 | >'is_ebook': 'false', 46 | >'average_rating': '4.00', 47 | >'kindle_asin': '', 48 | >'similar_books': [], 49 | >'description': '', 50 | >'format': 'Paperback', 51 | >'link': 'https://www.goodreads.com/book/show/5333265-w-c-fields', 52 | >'authors': [{'author_id': '604031', 'role': ''}], 53 | >'publisher': "St. Martin's Press", 54 | >'num_pages': '256', 55 | >'publication_day': '1', 56 | >'isbn13': '9780312853129', 57 | >'publication_month': '9', 58 | >'edition_information': '', 59 | >'publication_year': '1984', 60 | >'url': 'https://www.goodreads.com/book/show/5333265-w-c-fields', 61 | >'image_url': 'https://images.gr-assets.com/books/1310220028m/5333265.jpg', 62 | >'book_id': '5333265', 63 | >'ratings_count': '3', 64 | >'work_id': '5400751', 65 | >'title': 'W.C. Fields: A Life on Film', 66 | >'title_without_series': 'W.C. Fields: A Life on Film'} 67 | 68 | The same can be done for any of the large json files available at the [UCSD Book Graph](https://sites.google.com/eng.ucsd.edu/ucsdbookgraph/home) site. 69 | 70 | I conducted basic EDA on the full Goodreads data set by first looking at the size of each file. As is clear from these counts, the data sets are very, very large. 71 | 72 | ![Full Goodreads counts.png](https://github.com/Reinalynn/Building-a-Book-Recommendation-System-using-Python/blob/master/Images/Full%20Goodreads%20counts.png) 73 | 74 | The interactions file is also quite large and contains entries for shelved books (a Goodreads user can classify a book by adding to a shelf that they create, such as a favorites list or a "to be read" list), read books, rated books, and reviewed books. 75 | 76 | ![Goodreads interactions counts.png](https://github.com/Reinalynn/Building-a-Book-Recommendation-System-using-Python/blob/master/Images/Goodreads%20interactions%20counts.png) 77 | 78 | When visualizating the log-log plot of user/item distributions, both plots appear to follow Zipf's law. Zipf's law is typically used in text analysis and states that the frequency of any word is inversely proportional to its rank in the frequency table. In the case of the Goodreads data, it simply means that many of the book entries are for the same small number of books and from the same small number of users. More information on Zipf's Law can be found [here](https://en.wikipedia.org/wiki/Zipf%27s_law). 79 | 80 | ![Log-log plots of interactions](https://github.com/Reinalynn/Building-a-Book-Recommendation-System-using-Python/blob/master/Images/Log-log%20plots%20of%20interactions.png) 81 | 82 | The histogram below shows the distribution of the ratings in the interactions file. The scatterplot also indicates a clear relationship between the number of books read by a user and the number of books reviewed by the same user. 83 | 84 | ![Hist and scatterplot](https://github.com/Reinalynn/Building-a-Book-Recommendation-System-using-Python/blob/master/Images/Hist%20and%20scatterplot.png) 85 | 86 | I conducted similar analysis of the author file, recognizing that there is quite a bit of overlap between authors who receive high ratings on average and authors that have a large number of text reviews. 87 | 88 | ![Author plots](https://github.com/Reinalynn/Building-a-Book-Recommendation-System-using-Python/blob/master/Images/Author%20plots.png) 89 | 90 | The genres can be plotted in a pie chart where it becomes clear that fiction is the most prevelant genre. One thing to note is that books can be tagged with multiple genres. 91 | 92 | ![Pie chart of genres](https://github.com/Reinalynn/Building-a-Book-Recommendation-System-using-Python/blob/master/Images/Pie%20chart%20of%20genres.png) 93 | 94 | ##### Goodreads 10k dataset 95 | When I switched to the Goodreads 10k dataset for my model building, I conducted EDA using the pandas_profiling functions but the smaller dataset appeared to be representative of the full data. 96 | 97 | ```python 98 | import surprise 99 | import numpy as np 100 | import pandas as pd 101 | import pandas_profiling 102 | from sklearn.model_selection import train_test_split 103 | import matplotlib.pyplot as plt 104 | 105 | books10k = pd.read_csv('Data/books10k.csv') 106 | ratings10k = pd.read_csv('Data/ratings10k.csv') 107 | 108 | pandas_profiling.ProfileReport(books10k) 109 | pandas_profiling.ProfileReport(ratings10k) 110 | ``` 111 | ### Feature Selection and Engineering 112 | For collaborative filtering, the primary features necessary are user_id, book_id, and ratings. These were already present in the Goodreads 10k dataset and could be found in the ratings10k file. 113 | 114 | For content filtering, it was important to include all of the variables that might be used to determine which items are similar to one another. To prepare for this, I had to create a new feature that contained relevant text for each book and then conduct text analysis on that feature. I was prepared to use the review text from each book for this, but I did try out a few different text features. From simplest to most complex, I used book tags only, keywords only, review text only, cleaned review text, and then a full text field that contained review text + title + authors + publication date. As expected, the best results were found with the full text field. 115 | 116 | #### Text Analysis 117 | In order to prep my text data for the content based RS, I followed these steps: 118 | 1. use generator to list reviews 119 | 2. merge reviews with books 120 | 3. books have multiple reviews - concat all review_text by book title and group together 121 | 4. clean text (this step is optional and I determined it was best to skip) 122 | 5. add back in book metadata because I had mistakenly dropped too many columns in step 2 (because of large data file) 123 | 124 | For full code, view the following file in this github: 125 | 126 | [Text analysis - build, clean, and prep review text.ipynb](https://github.com/Reinalynn/Building-a-Book-Recommendation-System-using-Python/blob/master/Code/Text%20analysis%20-%20build%2C%20clean%2C%20prep%20review%20text.ipynb) 127 | 128 | I learned an important lesson when I cleaned and lemmatized the review text. Because many of the reviews contained proper names for book characters or book series, cleaning the text actually led to reduced performance and increased confusion. As a result, I chose not to clean the full text field so that my model could identify these important words and recognize that books with the same proper names are probably similar. 129 | 130 | As an analysis of the full text field, I created the following word cloud: 131 | 132 | ![Word cloud of full text](https://github.com/Reinalynn/Building-a-Book-Recommendation-System-using-Python/blob/master/Images/Word%20cloud%20of%20review%20text.png) 133 | 134 | ## Building and Tuning the Models 135 | 136 | ### Collaborative Filtering 137 | In Collaborative Filtering, the model is often predicting the user's rating for a given book. Because of this, test and train sets can be created and root mean square error (RMSE) can be used to calculate the error rate of the model (difference between actual rating and predicted rating). The lower the RMSE, the lower the error and the more accurate the model. 138 | 139 | #### PySpark 140 | The PySpark package in Python uses the Alternating Least Squares (ALS) method to build recommendation engines. ALS is a matrix factorization running in a parallel fashion and is built for larger scale problems. PySpark was created to support the collaboration of Apache Spark and Python. Because ALS uses matrix factorization, it is comparable to the SVD and SVD++ algorithms in the Surprise package. 141 | 142 | I was able to build a Collaborative Filtering RS using PySpark that performed very well according to the RMSE, but it was quite slow. The original model had a RMSE of 0.396: 143 | 144 | ![ALS model with rmse.png](https://github.com/Reinalynn/Building-a-Book-Recommendation-System-using-Python/blob/master/Images/ALS%20model%20with%20rmse.png) 145 | 146 | After tuning the model using a gridsearch (which took a very long time), I was able to drop the RMSE to 0.362: 147 | 148 | ![Tuned ALS model with best rmse.png](https://github.com/Reinalynn/Building-a-Book-Recommendation-System-using-Python/blob/master/Images/Tuned%20ALS%20model%20with%20best%20rmse.png) 149 | 150 | For full code, view the following file in this github: 151 | 152 | [Collab filtering using PySpark (user-based recommendations).ipynb](https://github.com/Reinalynn/Building-a-Book-Recommendation-System-using-Python/blob/master/Code/Collab%20filtering%20using%20PySpark%20(user-based%20recommendations).ipynb) 153 | 154 | #### Pandas corrwith (pearsonR correlation) 155 | Some Collaborative Filtering RS are built using a memory based method such as correlation. These models are very easy to build and interpret but the accuracy cannot be measured because the model is simply grouping like items together. The corrwith function in Pandas uses PearsonR's correlation method to output a nice list of recommendations when a book is input: 156 | 157 | ![PearsonR code](https://github.com/Reinalynn/Building-a-Book-Recommendation-System-using-Python/blob/master/Images/PearsonR%20code.png) 158 | 159 | For full code, view the following file in this github: 160 | 161 | [Collab filtering using pearsonR (item-based recommendations).ipynb](https://github.com/Reinalynn/Building-a-Book-Recommendation-System-using-Python/blob/master/Code/Collab%20filtering%20using%20pearsonR%20(item-based%20recommendations).ipynb) 162 | 163 | #### Surprise 164 | The Surprise package in Python is newer but provided all the tools I needed to test out multiple algorithms for Collaborative Filtering and then guided me through tuning the parameters and cross validating to determine the optimal model. In order to this, I chose three versions of the data to analyze. First, I looked at the most popular books only, filtering down to those with at least 20 book ratings and at least 50 user ratings. I then created a list of midlist books by filtering down to 2 book rating and 20 user ratings. Finally, I used the full list to include books that have as little as 1 book rating and 1 user rating. 165 | 166 | ![Algo results](https://github.com/Reinalynn/Building-a-Book-Recommendation-System-using-Python/blob/master/Images/Algo%20results.png) 167 | 168 | Using the top rated algorithms above, I chose to run a GridSearchCV on SVDpp, KNNBaseline, BaselineOnly, and KNNWithMeans. After completing the gridsearches, I ran 10-fold cross validation on each of the tuned models and plotted the results. I was surprised to find that the KNNBaseline ultimely performed best, especially considering that the SVDpp algorithm had been a front runner initially. The SVD algorithm created by Simon Funk is used in the Netflix RS, so I had expected this to be the most accurate. 169 | 170 | ![Optimized algo params.png](https://github.com/Reinalynn/Building-a-Book-Recommendation-System-using-Python/blob/master/Images/Optimized%20algo%20params.png) 171 | ![Cross validation plot.png](https://github.com/Reinalynn/Building-a-Book-Recommendation-System-using-Python/blob/master/Images/Cross%20validation%20plot.png) 172 | 173 | The KNNBaseline algorithm's lowest RMSE was about 0.85, much higher than the PySpark RMSE of 0.36, but the model was faster and easy to use. One downfall, though, is that the model predicts books by user so it can only be used with current readers. This is known as the 'cold start problem' because the algorithm will not provide any output until a user has built up a profile. 174 | 175 | For full code, view the following file in this github: 176 | 177 | [Collab filtering using Surprise (algo list, cv, plot, user-based recommendations).ipynb](https://github.com/Reinalynn/Building-a-Book-Recommendation-System-using-Python/blob/master/Code/Collab%20filtering%20using%20Surprise%20(algo%20list%2C%20cv%2C%20plot%2C%20user-based%20recommendations).ipynb) 178 | 179 | ### Content Filtering 180 | Content filtering uses cosine similarity to map attributes (in my case, text) in order to determine which items are most similar to one another. Because of this, there is no way to measure the accuracy of the models and the results are more subjective. However, I do have strong domain knowledge in this area because I used to manage a bookstore and am still fairly well read, so I was able to identify a model that I thought was better than the others. Of course, this is personal preference. 181 | 182 | #### Tfidf and Count Vectorization 183 | Here, features are extracted using term frequency-inverse document frequency (tfidf) or count vectorization. Using cosine similarity, both count and tfidf seem viable but tfidf might be more accurate since it is better at recommending the same authors and series. 184 | 185 | The code used to extract features from text: 186 | ```python 187 | pd.set_option('display.max_columns', 100) 188 | ds = pd.read_csv('Data/reviews10k_grouped_full.csv') 189 | 190 | # for tf-idf 191 | from sklearn.feature_extraction.text import TfidfVectorizer 192 | tfidf_vectorizer = TfidfVectorizer() 193 | tfidf_rev = tfidf_vectorizer.fit_transform((ds['full_text'])) 194 | 195 | # for count 196 | from sklearn.feature_extraction.text import CountVectorizer 197 | count_vectorizer = CountVectorizer() 198 | count_rev = count_vectorizer.fit_transform((ds['full_text'])) 199 | ``` 200 | Next, I chose a book ('The Name of the Wind' by Patrick Rothfuss, the same title chosen for the PearsonR Collaborative model) and then calculated the similarity between my chosen book and the rest of the books. From there the RS produced output for both tfidf and count vectorization models. 201 | 202 | ```python 203 | #Choose book by inserting goodreads_book_id 204 | g = 186074 205 | index = np.where(ds['goodreads_book_id'] == g)[0][0] 206 | read_book = ds.iloc[[index]] 207 | 208 | from sklearn.metrics.pairwise import cosine_similarity 209 | book_tfidf = tfidf_vectorizer.transform(read_book['full_text']) 210 | cos_similarity_tfidf = map(lambda x: cosine_similarity(book_tfidf, x), tfidf_rev) 211 | output = list(cos_similarity_tfidf) 212 | 213 | from sklearn.metrics.pairwise import cosine_similarity 214 | book_count = count_vectorizer.transform(read_book['full_text']) 215 | cos_similarity_countv = map(lambda x: cosine_similarity(book_count, x), count_rev) 216 | output2 = list(cos_similarity_countv) 217 | 218 | def get_recommendation(top, ds, scores): 219 | recommendation = pd.DataFrame(columns = ['goodreads_book_id', 'authors', 'title', 'score']) 220 | count = 0 221 | for i in top: 222 | recommendation.at[count, 'goodreads_book_id'] = ds.iloc[i, 2] 223 | recommendation.at[count, 'authors'] = ds.iloc[i, 19] 224 | recommendation.at[count, 'title'] = ds.iloc[i, 8] 225 | recommendation.at[count, 'score'] = scores[count] 226 | count += 1 227 | return recommendation 228 | ``` 229 | For the tfidf model: 230 | ```python 231 | # for tfidf 232 | top = sorted(range(len(output)), key=lambda i: output[i], reverse=True)[:10] 233 | list_scores = [output[i][0][0] for i in top] 234 | get_recommendation(top, ds, list_scores) 235 | ``` 236 | ![tfidf recs](https://github.com/Reinalynn/Building-a-Book-Recommendation-System-using-Python/blob/master/Images/tfidf%20recs.png) 237 | 238 | For the count model: 239 | ```python 240 | # for count 241 | top = sorted(range(len(output2)), key=lambda i: output2[i], reverse=True)[:10] 242 | list_scores = [output2[i][0][0] for i in top] 243 | get_recommendation(top, ds, list_scores) 244 | ``` 245 | ![count recs](https://github.com/Reinalynn/Building-a-Book-Recommendation-System-using-Python/blob/master/Images/count%20recs.png) 246 | 247 | For full code, view the following file in this github: 248 | 249 | [Content filtering by vectorizing on full text (tfidf and count) with word cloud.ipynb](https://github.com/Reinalynn/Building-a-Book-Recommendation-System-using-Python/blob/master/Code/Content%20filtering%20by%20vectorizing%20on%20full%20text%20(tfidf%20and%20count)%20with%20word%20cloud.ipynb) 250 | 251 | 252 | ## Conclusions 253 | ### Collaborative Filtering 254 | I created 2 user-based Collaborative Filtering RS via PySpark and Surprise. PySpark, while slower, had a much better RMSE rate and would thus be my preferred model if I wanted to recommend books by user. PearsonR was the only item-based Collaborative Filtering RS that I built but I was satisfied with the results. If I were to continue with this project, I would further investigate item-based models and look for an alternative method of evaluation. 255 | 256 | ### Content Filtering 257 | This portion of the project was most interesting to me because it allowed me to conduct text analysis. Looking at the results of my models, I believe that the tfidf model is most relevant but, as I indicated above, that is a matter of personal preference. It might be interesting to do additional research on the review text for each book, even conducting sentiment analysis to determine what percentage of reviews are positive vs. negative. My interaction with the data led me to believe that most people write positive reviews, but it could be helpful to identify the negative reviews and use those to negatively weight the books for recommendations. I could also conduct a live survey for users to try out the content filtering models and rate them, allowing me to better measure their success. One final thought for future investigation involves image classification. Goodreads provides a url to show the cover of most of its books, so it could be interesting to see if a model could be trained on the images to predict genre or even recommend titles. 258 | 259 | In all, I felt like this was a good choice for a practicum project. The problem was interesting to me and also advanced enough to challenge me to learn more about text analysis and machine learning algorithms used in recommendation engines. The data did involve some cleaning and prep, especially since I had to change datasets in Week 3, but it was not so time consuming that I did not get to spend adequate time on building and tuning the models. I was also able to improve my Python skills and learned how to use several packages that were new to me (Surprise, pandas_profiling, PySpark - I had limited experience). 260 | 261 | ## References: 262 | * https://heartbeat.fritz.ai/recommender-systems-with-python-part-i-content-based-filtering-5df4940bd831 263 | * https://github.com/ArmandDS/jobs_recommendations/blob/master/job_analysis_content_recommendation.ipynb 264 | * https://github.com/MengtingWan/goodreads 265 | * https://github.com/NicolasHug/Surprise/blob/master/examples/top_n_recommendations.py 266 | * https://github.com/nikitaa30/Content-based-Recommender-System/blob/master/recommender_system.py 267 | * https://github.com/susanli2016/Machine-Learning-with-Python/blob/master/Recommender%20Systems%20-%20The%20Fundamentals.ipynb 268 | * https://medium.com/@armandj.olivares/building-nlp-content-based-recommender-systems-b104a709c042 269 | * https://medium.com/@chhavi.saluja1401/recommendation-systems-made-simple-b5a79cac8862 270 | * https://stackabuse.com/creating-a-simple-recommender-system-in-python-using-pandas/ 271 | * https://stackoverflow.com/questions/39303912/tfidfvectorizer-in-scikit-learn-valueerror-np-nan-is-an-invalid-document 272 | * https://towardsdatascience.com/collaborative-filtering-based-recommendation-systems-exemplified-ecbffe1c20b1 273 | * https://towardsdatascience.com/how-did-we-build-book-recommender-systems-in-an-hour-the-fundamentals-dfee054f978e 274 | * https://towardsdatascience.com/introduction-to-recommender-systems-6c66cf15ada 275 | * https://towardsdatascience.com/my-journey-to-building-book-recommendation-system-5ec959c41847 276 | * https://towardsdatascience.com/recommendation-systems-models-and-evaluation-84944a84fb8e 277 | * https://towardsdatascience.com/various-implementations-of-collaborative-filtering-100385c6dfe0 278 | * https://www.kaggle.com/robottums/hybrid-recommender-systems-with-surprise 279 | * https://www.kaggle.com/vchulski/tutorial-collaborative-filtering-with-pyspark 280 | * https://www.tutorialspoint.com/change-data-type-for-one-or-more-columns-in-pandas-dataframe-1 281 | * Mengting Wan, Julian McAuley, "Item Recommendation on Monotonic Behavior Chains", in RecSys'18. 282 | * Mengting Wan, Rishabh Misra, Ndapa Nakashole, Julian McAuley, "Fine-Grained Spoiler Detection from Large-Scale Review Corpora", in ACL'19. 283 | --------------------------------------------------------------------------------