├── .gitignore ├── CHANGELOG.md ├── LICENSE ├── README.md ├── assets └── demo.gif ├── goodreads_to_sqlite ├── __init__.py ├── cli.py └── utils.py └── setup.py /.gitignore: -------------------------------------------------------------------------------- 1 | .venv 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | venv 6 | .eggs 7 | .pytest_cache 8 | *.egg-info 9 | *.db 10 | auth.json 11 | build/ 12 | dist/ 13 | *.bak 14 | -------------------------------------------------------------------------------- /CHANGELOG.md: -------------------------------------------------------------------------------- 1 | ## v0.5 2 | 3 | - Upgrade dependencies. (9bc9f17) 4 | 5 | ## v0.4 6 | 7 | - Add this changelog. (d66b61f) 8 | - Add initial progress message to avoid impression of idleness. (cd75582) 9 | 10 | ## v0.3 11 | 12 | - Always save user IDs in shelves table. (b5fbbbe) 13 | - Accept user ID in URL format. (08aac71) 14 | - Allow to import user profiles from author URLs. (69dd5d8) 15 | 16 | ## v0.2 17 | 18 | - Adds web-scraping to avoid missing "read_at" metadata. (282218a) 19 | - Fixes an import bug. (7cd40af) 20 | - Ignore database column ordering. (54cd85b) 21 | 22 | ## v0.1 23 | 24 | Initial release. 25 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Copyright 2019 Tobias Kunze 2 | 3 | Licensed under the Apache License, Version 2.0 (the "License"); 4 | you may not use this file except in compliance with the License. 5 | You may obtain a copy of the License at 6 | 7 | http://www.apache.org/licenses/LICENSE-2.0 8 | 9 | Unless required by applicable law or agreed to in writing, software 10 | distributed under the License is distributed on an "AS IS" BASIS, 11 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | See the License for the specific language governing permissions and 13 | limitations under the License. 14 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # goodreads-to-sqlite 2 | 3 | [![PyPI](https://img.shields.io/pypi/v/goodreads-to-sqlite.svg)](https://pypi.org/project/goodreads-to-sqlite/) 4 | [![License](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](https://github.com/rixx/goodreads-to-sqlite/blob/master/LICENSE) 5 | 6 | Save data from Goodreads to a SQLite database. Can save all your public shelves and reviews, and also the public reviews 7 | and shelves of other people. 8 | 9 | ![Demo](./assets/demo.gif) 10 | 11 | ## How to install 12 | 13 | $ pip install goodreads-to-sqlite 14 | 15 | Add the `-U` flag to update. Change notes can be found in the ``CHANGELOG`` file, next to this README. 16 | 17 | ## Authentication 18 | 19 | Create a Goodreads developer token: https://www.goodreads.com/api/keys 20 | 21 | Run this command and paste in your token and your profile URL: 22 | 23 | $ goodreads-to-sqlite auth 24 | 25 | This will create a file called `auth.json` in your current directory containing the required value. To save the file at 26 | a different path or filename, use the `--auth=myauth.json` option. 27 | 28 | ## Retrieving books 29 | 30 | The `books` command retrieves all of the books and reviews/ratings belonging to you: 31 | 32 | $ goodreads-to-sqlite books goodreads.db 33 | 34 | Note that your Goodreads profile must be public in order for this to work - if 35 | it is not already, you can enable this by visiting 36 | https://www.goodreads.com/user/edit?ref=nav_profile_settings and selecting 37 | "anyone (including search engines)" within the "Settings" tab. 38 | 39 | You can also specify the user to target, to fetch books on public shelves of other users. Please provide either the user ID 40 | (the numerical part of a user's profile URL), or the name of their vanity URL. 41 | 42 | $ goodreads-to-sqlite books goodreads.db rixx 43 | 44 | Sometime in 2018 or 2017, Goodreads started leaving out some "read_at" timestamps in their API. If you want to include 45 | these datapoints regardless, you can add the `--scrape` parameter, and the dates will be scraped from the website. 46 | This will take a bit longer, by maybe a minute depending on the size of your library. 47 | 48 | $ goodreads-to-sqlite books goodreads.db --scrape 49 | 50 | The `auth.json` file is used by default for authentication. You can point to a different location of `auth.json` using 51 | `-a`: 52 | 53 | $ goodreads-to-sqlite books goodreads.db rixx -a /path/to/auth.json 54 | 55 | ## Limitations 56 | 57 | - The order of books in shelves is not exposed in the API, so we cannot determine the order of the to-read list. 58 | - Goodreads also offers a CSV export, which is currently not supported as an input format. 59 | - Since the Goodreads API is a bit slow, and we are restricted to one request per second, for larger libraries the 60 | import can take a couple of minutes. 61 | - The script currently re-syncs the entire library instead of just looking at newly changed data, to make sure we don't 62 | lose information after aborted syncs. 63 | 64 | ## Thanks 65 | 66 | This package is heavily inspired by [github-to-sqlite](https://github.com/dogsheep/github-to-sqlite/) by [Simon 67 | Willison](https://simonwillison.net/2019/Oct/7/dogsheep/). 68 | 69 | The terminal recording above was made with [ASCIInema](https://asciinema.org/a/WT6bfxoFP3IlgeX8PO6FHDdDx). 70 | -------------------------------------------------------------------------------- /assets/demo.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rixx/goodreads-to-sqlite/1e438091e1ff2a1eeda48620bf6481719a9208c4/assets/demo.gif -------------------------------------------------------------------------------- /goodreads_to_sqlite/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rixx/goodreads-to-sqlite/1e438091e1ff2a1eeda48620bf6481719a9208c4/goodreads_to_sqlite/__init__.py -------------------------------------------------------------------------------- /goodreads_to_sqlite/cli.py: -------------------------------------------------------------------------------- 1 | import csv 2 | import json 3 | import pathlib 4 | import sys 5 | 6 | import click 7 | import sqlite_utils 8 | 9 | from goodreads_to_sqlite import utils 10 | 11 | 12 | @click.group() 13 | @click.version_option() 14 | def cli(): 15 | "Save data from Goodreads to a SQLite database" 16 | 17 | 18 | @cli.command() 19 | @click.option( 20 | "-a", 21 | "--auth", 22 | type=click.Path(file_okay=True, dir_okay=False, allow_dash=False), 23 | default="auth.json", 24 | help="Path to save tokens to, defaults to ./auth.json.", 25 | ) 26 | def auth(auth): 27 | "Save authentication credentials to a JSON file" 28 | auth_data = {} 29 | if pathlib.Path(auth).exists(): 30 | auth_data = json.load(open(auth)) 31 | saved_user_id = auth_data.get("goodreads_user_id") 32 | click.echo( 33 | "Create a Goodreads developer key at https://www.goodreads.com/api/keys and paste it here:" 34 | ) 35 | personal_token = click.prompt("Developer key") 36 | click.echo() 37 | click.echo( 38 | "Please enter your Goodreads user ID (numeric) or just paste your Goodreads profile URL." 39 | ) 40 | user_id = click.prompt("User-ID or URL", default=saved_user_id) 41 | user_id = user_id.strip("/").split("/")[-1].split("-")[0] 42 | if not user_id.isdigit(): 43 | raise Exception( 44 | "Your user ID has to be a number! {} does not look right".format(user_id) 45 | ) 46 | auth_data["goodreads_personal_token"] = personal_token 47 | auth_data["goodreads_user_id"] = user_id 48 | open(auth, "w").write(json.dumps(auth_data, indent=4) + "\n") 49 | auth_suffix = (" -a " + auth) if auth != "auth.json" else "" 50 | click.echo() 51 | click.echo( 52 | "Your authentication credentials have been saved to {}. You can now import books by running".format( 53 | auth 54 | ) 55 | ) 56 | click.echo() 57 | click.echo(" goodreads-to-sqlite books books.db" + auth_suffix + " [username]") 58 | click.echo() 59 | 60 | 61 | @cli.command() 62 | @click.argument( 63 | "db_path", 64 | type=click.Path(file_okay=True, dir_okay=False, allow_dash=False), 65 | required=True, 66 | ) 67 | @click.option( 68 | "-a", 69 | "--auth", 70 | type=click.Path(file_okay=True, dir_okay=False, allow_dash=False), 71 | default="auth.json", 72 | help="Path to save tokens to, defaults to auth.json", 73 | ) 74 | @click.argument("username", required=False) 75 | @click.option( 76 | "-s", 77 | "--scrape", 78 | is_flag=True, 79 | help="Scrape missing data (like date_read) from the web interface. Slow.", 80 | ) 81 | def books(db_path, auth, username, scrape): 82 | """Save books for a specified user, e.g. rixx""" 83 | db = sqlite_utils.Database(db_path) 84 | try: 85 | data = json.load(open(auth)) 86 | token = data["goodreads_personal_token"] 87 | user_id = data["goodreads_user_id"] 88 | except (KeyError, FileNotFoundError): 89 | utils.error( 90 | "Cannot find authentication data, please run goodreads_to_sqlite auth!" 91 | ) 92 | 93 | click.secho(f"Read credentials for user ID {user_id}.", fg="green") 94 | if username: 95 | user_id = username if username.isdigit() else utils.fetch_user_id(username) 96 | 97 | utils.fetch_user_and_shelves(user_id, token, db=db) 98 | utils.fetch_books(db, user_id, token, scrape=scrape) 99 | -------------------------------------------------------------------------------- /goodreads_to_sqlite/utils.py: -------------------------------------------------------------------------------- 1 | import datetime as dt 2 | import sys 3 | import xml.etree.ElementTree as ET 4 | from contextlib import suppress 5 | 6 | import bs4 7 | import click 8 | import dateutil.parser 9 | import requests 10 | from tqdm import tqdm 11 | 12 | BASE_URL = "https://www.goodreads.com/" 13 | 14 | 15 | def error(message): 16 | click.secho(message, bold=True, fg="red") 17 | sys.exit(-1) 18 | 19 | 20 | def fetch_books(db, user_id, token, scrape=False): 21 | """Fetches a user's books and reviews from the public Goodreads API. 22 | 23 | Technically we are rate-limited to one request per second, but since we are not 24 | running in parallel, and the Goodreads API responds way slower than that, we are 25 | reliably in the clear.""" 26 | url = BASE_URL + "review/list/{}.xml".format(user_id) 27 | params = { 28 | "key": token, 29 | "v": "2", 30 | "per_page": "200", 31 | "sort": "date_updated", 32 | "page": 0, 33 | } 34 | end = -1 35 | total = 0 36 | books = dict() 37 | authors = dict() 38 | reviews = dict() 39 | progress_bar = None 40 | 41 | while end < total: 42 | params["page"] += 1 43 | response = requests.get(url, data=params) 44 | response.raise_for_status() 45 | root = ET.fromstring(response.content.decode()) 46 | review_data = root.find("reviews") 47 | end = int(review_data.attrib["end"]) 48 | total = int(review_data.attrib["total"]) 49 | 50 | if progress_bar is None: 51 | progress_bar = tqdm( 52 | desc="Fetching books", total=int(review_data.attrib.get("total")) 53 | ) 54 | for review in review_data: 55 | book_data = review.find("book") 56 | book_authors = [] 57 | 58 | for author in book_data.find("authors"): 59 | author_id = author.find("id").text 60 | author = _get_author_from_data(author) 61 | authors[author_id] = author 62 | book_authors.append(author) 63 | 64 | book_id = book_data.find("id").text 65 | books[book_id] = _get_book_from_data(book_data, book_authors) 66 | 67 | review_id = review.find("id").text 68 | reviews[review_id] = _get_review_from_data(review, user_id) 69 | progress_bar.update(1) 70 | progress_bar.close() 71 | 72 | if scrape is True: 73 | scrape_data(user_id, reviews) 74 | 75 | save_authors(db, list(authors.values())) 76 | save_books(db, list(books.values())) 77 | save_reviews(db, list(reviews.values())) 78 | 79 | 80 | def scrape_data(user_id, reviews): 81 | relevant_ids = { 82 | review_id 83 | for review_id, review in reviews.items() 84 | if "read_at" not in review 85 | and any(shelf["name"] == "read" for shelf in review["shelves"]) 86 | } 87 | url = BASE_URL + "review/list/{}".format(user_id) 88 | params = { 89 | "utf8": "✓", 90 | "shelf": "read", 91 | "per_page": "100", # Maximum allowed page size 92 | "sort": "date_updated", 93 | "page": 0, 94 | } 95 | date_counter = 0 96 | progress_bar = None 97 | while True: 98 | params["page"] += 1 99 | response = requests.get(url, data=params) 100 | response.raise_for_status() 101 | soup = bs4.BeautifulSoup(response.content.decode(), "html.parser") 102 | if progress_bar is None: 103 | read_shelf = soup.select("a.selectedShelf")[0].text 104 | total = int(read_shelf[read_shelf.find("(") :].strip("()")) 105 | progress_bar = tqdm(desc="Scraping books", total=total) 106 | rows = soup.select("table#books tbody tr") 107 | for row in rows: 108 | review_id = row.attrs["id"][len("review_") :] 109 | if review_id in relevant_ids: 110 | date = row.select(".date_read_value") 111 | if date: 112 | reviews[review_id]["read_at"] = dateutil.parser.parse( 113 | date[0].text, default=dt.date(2019, 1, 1) 114 | ) 115 | date_counter += 1 116 | progress_bar.update(1) 117 | if not soup.select("a[rel=next]") or progress_bar.n >= progress_bar.total: 118 | break 119 | progress_bar.close() 120 | click.echo("Found {} previously missing read dates.".format(date_counter)) 121 | 122 | 123 | def save_authors(db, authors): 124 | total = len(authors) 125 | progress_bar = tqdm(total=total, desc="Saving authors") 126 | db["authors"].insert_all(authors, pk="id", replace=True) 127 | progress_bar.update(total) 128 | progress_bar.close() 129 | 130 | 131 | def save_books(db, books): 132 | authors_table = db.table("authors", pk="id") 133 | for book in tqdm(books, desc="Saving books "): 134 | authors = book.pop("authors", []) 135 | db["books"].insert(book, pk="id", replace=True).m2m(authors_table, authors) 136 | 137 | 138 | def save_reviews(db, reviews): 139 | shelves_table = db.table("shelves", pk="id") 140 | for review in tqdm(reviews, desc="Saving reviews"): 141 | shelves = review.pop("shelves", []) 142 | db["reviews"].insert( 143 | review, 144 | pk="id", 145 | foreign_keys=(("book_id", "books", "id"), ("user_id", "users", "id")), 146 | alter=True, 147 | replace=True, 148 | ).m2m(shelves_table, shelves) 149 | 150 | 151 | def _get_author_from_data(author): 152 | return {"id": author.find("id").text, "name": author.find("name").text} 153 | 154 | 155 | def _get_book_from_data(book, authors): 156 | series = None 157 | series_position = None 158 | title = book.find("title").text 159 | title_series = book.find("title_without_series").text 160 | if title != title_series: 161 | series_with_position = title[len(title_series) :].strip(" ()") 162 | if "#" in series_with_position: 163 | series, series_position = series_with_position.split("#", maxsplit=1) 164 | elif "Book" in series_with_position: 165 | series, series_position = series_with_position.split("Book", maxsplit=1) 166 | else: 167 | series = series_with_position 168 | series_position = "" 169 | series = series.strip(", ") 170 | series_position = series_position.strip(", #") 171 | title = title_series 172 | publication_year = book.find("publication_year").text 173 | publication_date = None 174 | if publication_year: 175 | publication_date = dt.date( 176 | int(book.find("publication_year").text), 177 | int(book.find("publication_month").text or 1), 178 | int(book.find("publication_day").text or 1), 179 | ) 180 | return { 181 | "id": book.find("id").text, 182 | "isbn": book.find("isbn").text, 183 | "isbn13": book.find("isbn13").text, 184 | "title": title, 185 | "series": series, 186 | "series_position": series_position, 187 | "pages": book.find("num_pages").text, 188 | "publisher": book.find("publisher").text, 189 | "publication_date": publication_date, 190 | "description": book.find("description").text, 191 | "image_url": book.find("image_url").text, 192 | "authors": authors, 193 | } 194 | 195 | 196 | def _get_review_from_data(review, user_id): 197 | rating = review.find("rating").text 198 | rating = int(rating) or None if rating else None 199 | result = { 200 | "id": review.find("id").text, 201 | "book_id": review.find("book").find("id").text, 202 | "user_id": user_id, 203 | "rating": rating, 204 | "text": (review.find("body").text or "").strip(), 205 | "shelves": [ 206 | { 207 | "name": shelf.attrib.get("name"), 208 | "id": shelf.attrib.get("id"), 209 | "user_id": user_id, 210 | } 211 | for shelf in (review.find("shelves") or []) 212 | ], 213 | } 214 | for key in ("started_at", "read_at", "date_added", "date_updated"): 215 | date = maybe_date(review.find(key).text) 216 | if date: 217 | result[key] = date 218 | return result 219 | 220 | 221 | def fetch_user_id(username, force_online=False, db=None) -> str: 222 | """We can look up a user ID given a (public vanity) username. 223 | 224 | We go to that profile page, and observe the redirect target. If we have the 225 | user in question in our database, we just return the known value, since 226 | user IDs are assumed to be stable. The vanity URL redirects to a URL ending 227 | in -.""" 228 | if not force_online and db: 229 | user = db["users"].get(username=username) 230 | if user: 231 | return user.id 232 | click.echo("Fetching user details.") 233 | url = username if username.startswith("http") else BASE_URL + username 234 | response = requests.get(url) 235 | response.raise_for_status() 236 | if "/author/" in username: 237 | soup = bs4.BeautifulSoup(response.content.decode(), "html.parser") 238 | url = soup.select("link[rel=alternate][title=Bookshelves]")[0].attrs["href"] 239 | else: 240 | url = response.request.url 241 | result = url.strip("/").split("/")[-1].split("-")[0] 242 | if not result.isdigit(): 243 | error("Cannot find user ID for {}".format(response.request.url)) 244 | return result 245 | 246 | 247 | def fetch_user_and_shelves(user_id, token, db) -> dict: 248 | with suppress(TypeError): 249 | user = db["users"].get(id=user_id) 250 | shelves = db["shelves"].rows_where("user_id = ?", [user_id]) 251 | if user and all(user.values()) and shelves: 252 | user["shelves"] = shelves 253 | return user 254 | click.secho("Fetching shelves.") 255 | response = requests.get( 256 | BASE_URL + "user/show/{}.xml".format(user_id), {"key": token} 257 | ) 258 | response.raise_for_status() 259 | to_root = ET.fromstring(response.content.decode()) 260 | user = to_root.find("user") 261 | shelves = user.find("user_shelves") 262 | if not shelves: 263 | error("This user's shelves and reviews are private, and cannot be fetched.") 264 | user = { 265 | "id": user.find("id").text, 266 | "name": user.find("name").text, 267 | "username": user.find("user_name").text, 268 | "shelves": [ 269 | {"id": shelf.find("id").text, "name": shelf.find("name").text} 270 | for shelf in shelves 271 | ], 272 | } 273 | save_user(db, user) 274 | 275 | 276 | def save_user(db, user): 277 | save_data = {key: user.get(key) for key in ["id", "name", "username"]} 278 | pk = db["users"].insert(save_data, pk="id", alter=True, replace=True).last_pk 279 | for shelf in user.get("shelves", []): 280 | save_shelf(db, shelf, user["id"]) 281 | return pk 282 | 283 | 284 | def save_shelf(db, shelf, user_id): 285 | save_data = {key: shelf.get(key) for key in ["id", "name"]} 286 | save_data["user_id"] = user_id 287 | return ( 288 | db["shelves"] 289 | .insert( 290 | save_data, foreign_keys=(("user_id", "users", "id"),), pk="id", alter=True, replace=True 291 | ) 292 | .last_pk 293 | ) 294 | 295 | 296 | def maybe_date(value): 297 | if value: 298 | return dateutil.parser.parse(value) 299 | return None 300 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | import os 2 | 3 | from setuptools import setup 4 | 5 | VERSION = "0.5" 6 | 7 | 8 | def get_long_description(): 9 | with open( 10 | os.path.join(os.path.dirname(os.path.abspath(__file__)), "README.md"), 11 | encoding="utf8", 12 | ) as fp: 13 | return fp.read() 14 | 15 | 16 | setup( 17 | name="goodreads-to-sqlite", 18 | description="Save data from Goodreads to a SQLite database", 19 | long_description=get_long_description(), 20 | long_description_content_type="text/markdown", 21 | author="Tobias Kunze", 22 | author_email="r@rixx.de", 23 | url="https://github.com/rixx/goodreads-to-sqlite", 24 | project_urls={ 25 | "Source": "https://github.com/rixx/goodreads-to-sqlite", 26 | "Issues": "https://github.com/rixx/goodreads-to-sqlite/issues", 27 | }, 28 | classifiers=[ 29 | "Development Status :: 5 - Production/Stable", 30 | "Environment :: Console", 31 | "License :: OSI Approved", 32 | "License :: OSI Approved :: Apache Software License", 33 | "Programming Language :: Python :: 3", 34 | "Programming Language :: Python :: 3.6", 35 | "Programming Language :: Python :: 3.7", 36 | "Topic :: Database", 37 | ], 38 | keywords="goodreads books sqlite export dogsheep", 39 | license="Apache License, Version 2.0", 40 | version=VERSION, 41 | packages=["goodreads_to_sqlite"], 42 | entry_points=""" 43 | [console_scripts] 44 | goodreads-to-sqlite=goodreads_to_sqlite.cli:cli 45 | """, 46 | install_requires=[ 47 | "beautifulsoup4~=4.8", 48 | "click", 49 | "python-dateutil", 50 | "requests", 51 | "sqlite-utils~=2.4.4", 52 | "tqdm~=4.36", 53 | ], 54 | extras_require={"test": ["pytest"]}, 55 | tests_require=["goodreads-to-sqlite[test]"], 56 | ) 57 | --------------------------------------------------------------------------------