├── .gitignore
├── CHANGELOG.md
├── LICENSE
├── README.md
├── assets
    └── demo.gif
├── goodreads_to_sqlite
    ├── __init__.py
    ├── cli.py
    └── utils.py
└── setup.py


/.gitignore:
--------------------------------------------------------------------------------
 1 | .venv
 2 | __pycache__/
 3 | *.py[cod]
 4 | *$py.class
 5 | venv
 6 | .eggs
 7 | .pytest_cache
 8 | *.egg-info
 9 | *.db
10 | auth.json
11 | build/
12 | dist/
13 | *.bak
14 | 


--------------------------------------------------------------------------------
/CHANGELOG.md:
--------------------------------------------------------------------------------
 1 | ## v0.5
 2 | 
 3 | - Upgrade dependencies. (9bc9f17)
 4 | 
 5 | ## v0.4
 6 | 
 7 | - Add this changelog. (d66b61f)
 8 | - Add initial progress message to avoid impression of idleness. (cd75582)
 9 | 
10 | ## v0.3
11 | 
12 | - Always save user IDs in shelves table. (b5fbbbe)
13 | - Accept user ID in URL format. (08aac71)
14 | - Allow to import user profiles from author URLs. (69dd5d8)
15 | 
16 | ## v0.2
17 | 
18 | - Adds web-scraping to avoid missing "read_at" metadata. (282218a)
19 | - Fixes an import bug. (7cd40af)
20 | - Ignore database column ordering. (54cd85b)
21 | 
22 | ## v0.1
23 | 
24 | Initial release.
25 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | Copyright 2019 Tobias Kunze <r@rixx.de>
 2 | 
 3 | Licensed under the Apache License, Version 2.0 (the "License");
 4 | you may not use this file except in compliance with the License.
 5 | You may obtain a copy of the License at
 6 | 
 7 |    http://www.apache.org/licenses/LICENSE-2.0
 8 | 
 9 | Unless required by applicable law or agreed to in writing, software
10 | distributed under the License is distributed on an "AS IS" BASIS,
11 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12 | See the License for the specific language governing permissions and
13 | limitations under the License.
14 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # goodreads-to-sqlite
 2 | 
 3 | [![PyPI](https://img.shields.io/pypi/v/goodreads-to-sqlite.svg)](https://pypi.org/project/goodreads-to-sqlite/)
 4 | [![License](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](https://github.com/rixx/goodreads-to-sqlite/blob/master/LICENSE)
 5 | 
 6 | Save data from Goodreads to a SQLite database. Can save all your public shelves and reviews, and also the public reviews
 7 | and shelves of other people.
 8 | 
 9 | ![Demo](./assets/demo.gif)
10 | 
11 | ## How to install
12 | 
13 |     $ pip install goodreads-to-sqlite
14 | 
15 | Add the `-U` flag to update. Change notes can be found in the ``CHANGELOG`` file, next to this README.
16 | 
17 | ## Authentication
18 | 
19 | Create a Goodreads developer token: https://www.goodreads.com/api/keys
20 | 
21 | Run this command and paste in your token and your profile URL:
22 | 
23 |     $ goodreads-to-sqlite auth
24 | 
25 | This will create a file called `auth.json` in your current directory containing the required value. To save the file at
26 | a different path or filename, use the `--auth=myauth.json` option.
27 | 
28 | ## Retrieving books
29 | 
30 | The `books` command retrieves all of the books and reviews/ratings belonging to you:
31 | 
32 |     $ goodreads-to-sqlite books goodreads.db
33 |     
34 | Note that your Goodreads profile must be public in order for this to work - if
35 | it is not already, you can enable this by visiting
36 | https://www.goodreads.com/user/edit?ref=nav_profile_settings and selecting
37 | "anyone (including search engines)" within the "Settings" tab.
38 | 
39 | You can also specify the user to target, to fetch books on public shelves of other users. Please provide either the user ID
40 | (the numerical part of a user's profile URL), or the name of their vanity URL.
41 | 
42 |     $ goodreads-to-sqlite books goodreads.db rixx
43 | 
44 | Sometime in 2018 or 2017, Goodreads started leaving out some "read_at" timestamps in their API. If you want to include
45 | these datapoints regardless, you can add the `--scrape` parameter, and the dates will be scraped from the website.
46 | This will take a bit longer, by maybe a minute depending on the size of your library.
47 | 
48 |     $ goodreads-to-sqlite books goodreads.db --scrape
49 | 
50 | The `auth.json` file is used by default for authentication. You can point to a different location of `auth.json` using
51 | `-a`:
52 | 
53 |     $ goodreads-to-sqlite books goodreads.db rixx -a /path/to/auth.json
54 | 
55 | ## Limitations
56 | 
57 | - The order of books in shelves is not exposed in the API, so we cannot determine the order of the to-read list.
58 | - Goodreads also offers a CSV export, which is currently not supported as an input format.
59 | - Since the Goodreads API is a bit slow, and we are restricted to one request per second, for larger libraries the
60 |   import can take a couple of minutes.
61 | - The script currently re-syncs the entire library instead of just looking at newly changed data, to make sure we don't
62 |   lose information after aborted syncs.
63 | 
64 | ## Thanks
65 | 
66 | This package is heavily inspired by [github-to-sqlite](https://github.com/dogsheep/github-to-sqlite/) by [Simon
67 | Willison](https://simonwillison.net/2019/Oct/7/dogsheep/).
68 | 
69 | The terminal recording above was made with [ASCIInema](https://asciinema.org/a/WT6bfxoFP3IlgeX8PO6FHDdDx).
70 | 


--------------------------------------------------------------------------------
/assets/demo.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rixx/goodreads-to-sqlite/1e438091e1ff2a1eeda48620bf6481719a9208c4/assets/demo.gif


--------------------------------------------------------------------------------
/goodreads_to_sqlite/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rixx/goodreads-to-sqlite/1e438091e1ff2a1eeda48620bf6481719a9208c4/goodreads_to_sqlite/__init__.py


--------------------------------------------------------------------------------
/goodreads_to_sqlite/cli.py:
--------------------------------------------------------------------------------
 1 | import csv
 2 | import json
 3 | import pathlib
 4 | import sys
 5 | 
 6 | import click
 7 | import sqlite_utils
 8 | 
 9 | from goodreads_to_sqlite import utils
10 | 
11 | 
12 | @click.group()
13 | @click.version_option()
14 | def cli():
15 |     "Save data from Goodreads to a SQLite database"
16 | 
17 | 
18 | @cli.command()
19 | @click.option(
20 |     "-a",
21 |     "--auth",
22 |     type=click.Path(file_okay=True, dir_okay=False, allow_dash=False),
23 |     default="auth.json",
24 |     help="Path to save tokens to, defaults to ./auth.json.",
25 | )
26 | def auth(auth):
27 |     "Save authentication credentials to a JSON file"
28 |     auth_data = {}
29 |     if pathlib.Path(auth).exists():
30 |         auth_data = json.load(open(auth))
31 |     saved_user_id = auth_data.get("goodreads_user_id")
32 |     click.echo(
33 |         "Create a Goodreads developer key at https://www.goodreads.com/api/keys and paste it here:"
34 |     )
35 |     personal_token = click.prompt("Developer key")
36 |     click.echo()
37 |     click.echo(
38 |         "Please enter your Goodreads user ID (numeric) or just paste your Goodreads profile URL."
39 |     )
40 |     user_id = click.prompt("User-ID or URL", default=saved_user_id)
41 |     user_id = user_id.strip("/").split("/")[-1].split("-")[0]
42 |     if not user_id.isdigit():
43 |         raise Exception(
44 |             "Your user ID has to be a number! {} does not look right".format(user_id)
45 |         )
46 |     auth_data["goodreads_personal_token"] = personal_token
47 |     auth_data["goodreads_user_id"] = user_id
48 |     open(auth, "w").write(json.dumps(auth_data, indent=4) + "\n")
49 |     auth_suffix = (" -a " + auth) if auth != "auth.json" else ""
50 |     click.echo()
51 |     click.echo(
52 |         "Your authentication credentials have been saved to {}. You can now import books by running".format(
53 |             auth
54 |         )
55 |     )
56 |     click.echo()
57 |     click.echo("    goodreads-to-sqlite books books.db" + auth_suffix + " [username]")
58 |     click.echo()
59 | 
60 | 
61 | @cli.command()
62 | @click.argument(
63 |     "db_path",
64 |     type=click.Path(file_okay=True, dir_okay=False, allow_dash=False),
65 |     required=True,
66 | )
67 | @click.option(
68 |     "-a",
69 |     "--auth",
70 |     type=click.Path(file_okay=True, dir_okay=False, allow_dash=False),
71 |     default="auth.json",
72 |     help="Path to save tokens to, defaults to auth.json",
73 | )
74 | @click.argument("username", required=False)
75 | @click.option(
76 |     "-s",
77 |     "--scrape",
78 |     is_flag=True,
79 |     help="Scrape missing data (like date_read) from the web interface. Slow.",
80 | )
81 | def books(db_path, auth, username, scrape):
82 |     """Save books for a specified user, e.g. rixx"""
83 |     db = sqlite_utils.Database(db_path)
84 |     try:
85 |         data = json.load(open(auth))
86 |         token = data["goodreads_personal_token"]
87 |         user_id = data["goodreads_user_id"]
88 |     except (KeyError, FileNotFoundError):
89 |         utils.error(
90 |             "Cannot find authentication data, please run goodreads_to_sqlite auth!"
91 |         )
92 | 
93 |     click.secho(f"Read credentials for user ID {user_id}.", fg="green")
94 |     if username:
95 |         user_id = username if username.isdigit() else utils.fetch_user_id(username)
96 | 
97 |     utils.fetch_user_and_shelves(user_id, token, db=db)
98 |     utils.fetch_books(db, user_id, token, scrape=scrape)
99 | 


--------------------------------------------------------------------------------
/goodreads_to_sqlite/utils.py:
--------------------------------------------------------------------------------
  1 | import datetime as dt
  2 | import sys
  3 | import xml.etree.ElementTree as ET
  4 | from contextlib import suppress
  5 | 
  6 | import bs4
  7 | import click
  8 | import dateutil.parser
  9 | import requests
 10 | from tqdm import tqdm
 11 | 
 12 | BASE_URL = "https://www.goodreads.com/"
 13 | 
 14 | 
 15 | def error(message):
 16 |     click.secho(message, bold=True, fg="red")
 17 |     sys.exit(-1)
 18 | 
 19 | 
 20 | def fetch_books(db, user_id, token, scrape=False):
 21 |     """Fetches a user's books and reviews from the public Goodreads API.
 22 | 
 23 |     Technically we are rate-limited to one request per second, but since we are not
 24 |     running in parallel, and the Goodreads API responds way slower than that, we are
 25 |     reliably in the clear."""
 26 |     url = BASE_URL + "review/list/{}.xml".format(user_id)
 27 |     params = {
 28 |         "key": token,
 29 |         "v": "2",
 30 |         "per_page": "200",
 31 |         "sort": "date_updated",
 32 |         "page": 0,
 33 |     }
 34 |     end = -1
 35 |     total = 0
 36 |     books = dict()
 37 |     authors = dict()
 38 |     reviews = dict()
 39 |     progress_bar = None
 40 | 
 41 |     while end < total:
 42 |         params["page"] += 1
 43 |         response = requests.get(url, data=params)
 44 |         response.raise_for_status()
 45 |         root = ET.fromstring(response.content.decode())
 46 |         review_data = root.find("reviews")
 47 |         end = int(review_data.attrib["end"])
 48 |         total = int(review_data.attrib["total"])
 49 | 
 50 |         if progress_bar is None:
 51 |             progress_bar = tqdm(
 52 |                 desc="Fetching books", total=int(review_data.attrib.get("total"))
 53 |             )
 54 |         for review in review_data:
 55 |             book_data = review.find("book")
 56 |             book_authors = []
 57 | 
 58 |             for author in book_data.find("authors"):
 59 |                 author_id = author.find("id").text
 60 |                 author = _get_author_from_data(author)
 61 |                 authors[author_id] = author
 62 |                 book_authors.append(author)
 63 | 
 64 |             book_id = book_data.find("id").text
 65 |             books[book_id] = _get_book_from_data(book_data, book_authors)
 66 | 
 67 |             review_id = review.find("id").text
 68 |             reviews[review_id] = _get_review_from_data(review, user_id)
 69 |             progress_bar.update(1)
 70 |     progress_bar.close()
 71 | 
 72 |     if scrape is True:
 73 |         scrape_data(user_id, reviews)
 74 | 
 75 |     save_authors(db, list(authors.values()))
 76 |     save_books(db, list(books.values()))
 77 |     save_reviews(db, list(reviews.values()))
 78 | 
 79 | 
 80 | def scrape_data(user_id, reviews):
 81 |     relevant_ids = {
 82 |         review_id
 83 |         for review_id, review in reviews.items()
 84 |         if "read_at" not in review
 85 |         and any(shelf["name"] == "read" for shelf in review["shelves"])
 86 |     }
 87 |     url = BASE_URL + "review/list/{}".format(user_id)
 88 |     params = {
 89 |         "utf8": "✓",
 90 |         "shelf": "read",
 91 |         "per_page": "100",  # Maximum allowed page size
 92 |         "sort": "date_updated",
 93 |         "page": 0,
 94 |     }
 95 |     date_counter = 0
 96 |     progress_bar = None
 97 |     while True:
 98 |         params["page"] += 1
 99 |         response = requests.get(url, data=params)
100 |         response.raise_for_status()
101 |         soup = bs4.BeautifulSoup(response.content.decode(), "html.parser")
102 |         if progress_bar is None:
103 |             read_shelf = soup.select("a.selectedShelf")[0].text
104 |             total = int(read_shelf[read_shelf.find("(") :].strip("()"))
105 |             progress_bar = tqdm(desc="Scraping books", total=total)
106 |         rows = soup.select("table#books tbody tr")
107 |         for row in rows:
108 |             review_id = row.attrs["id"][len("review_") :]
109 |             if review_id in relevant_ids:
110 |                 date = row.select(".date_read_value")
111 |                 if date:
112 |                     reviews[review_id]["read_at"] = dateutil.parser.parse(
113 |                         date[0].text, default=dt.date(2019, 1, 1)
114 |                     )
115 |                     date_counter += 1
116 |             progress_bar.update(1)
117 |         if not soup.select("a[rel=next]") or progress_bar.n >= progress_bar.total:
118 |             break
119 |     progress_bar.close()
120 |     click.echo("Found {} previously missing read dates.".format(date_counter))
121 | 
122 | 
123 | def save_authors(db, authors):
124 |     total = len(authors)
125 |     progress_bar = tqdm(total=total, desc="Saving authors")
126 |     db["authors"].insert_all(authors, pk="id", replace=True)
127 |     progress_bar.update(total)
128 |     progress_bar.close()
129 | 
130 | 
131 | def save_books(db, books):
132 |     authors_table = db.table("authors", pk="id")
133 |     for book in tqdm(books, desc="Saving books  "):
134 |         authors = book.pop("authors", [])
135 |         db["books"].insert(book, pk="id", replace=True).m2m(authors_table, authors)
136 | 
137 | 
138 | def save_reviews(db, reviews):
139 |     shelves_table = db.table("shelves", pk="id")
140 |     for review in tqdm(reviews, desc="Saving reviews"):
141 |         shelves = review.pop("shelves", [])
142 |         db["reviews"].insert(
143 |             review,
144 |             pk="id",
145 |             foreign_keys=(("book_id", "books", "id"), ("user_id", "users", "id")),
146 |             alter=True,
147 |             replace=True,
148 |         ).m2m(shelves_table, shelves)
149 | 
150 | 
151 | def _get_author_from_data(author):
152 |     return {"id": author.find("id").text, "name": author.find("name").text}
153 | 
154 | 
155 | def _get_book_from_data(book, authors):
156 |     series = None
157 |     series_position = None
158 |     title = book.find("title").text
159 |     title_series = book.find("title_without_series").text
160 |     if title != title_series:
161 |         series_with_position = title[len(title_series) :].strip(" ()")
162 |         if "#" in series_with_position:
163 |             series, series_position = series_with_position.split("#", maxsplit=1)
164 |         elif "Book" in series_with_position:
165 |             series, series_position = series_with_position.split("Book", maxsplit=1)
166 |         else:
167 |             series = series_with_position
168 |             series_position = ""
169 |         series = series.strip(", ")
170 |         series_position = series_position.strip(", #")
171 |         title = title_series
172 |     publication_year = book.find("publication_year").text
173 |     publication_date = None
174 |     if publication_year:
175 |         publication_date = dt.date(
176 |             int(book.find("publication_year").text),
177 |             int(book.find("publication_month").text or 1),
178 |             int(book.find("publication_day").text or 1),
179 |         )
180 |     return {
181 |         "id": book.find("id").text,
182 |         "isbn": book.find("isbn").text,
183 |         "isbn13": book.find("isbn13").text,
184 |         "title": title,
185 |         "series": series,
186 |         "series_position": series_position,
187 |         "pages": book.find("num_pages").text,
188 |         "publisher": book.find("publisher").text,
189 |         "publication_date": publication_date,
190 |         "description": book.find("description").text,
191 |         "image_url": book.find("image_url").text,
192 |         "authors": authors,
193 |     }
194 | 
195 | 
196 | def _get_review_from_data(review, user_id):
197 |     rating = review.find("rating").text
198 |     rating = int(rating) or None if rating else None
199 |     result = {
200 |         "id": review.find("id").text,
201 |         "book_id": review.find("book").find("id").text,
202 |         "user_id": user_id,
203 |         "rating": rating,
204 |         "text": (review.find("body").text or "").strip(),
205 |         "shelves": [
206 |             {
207 |                 "name": shelf.attrib.get("name"),
208 |                 "id": shelf.attrib.get("id"),
209 |                 "user_id": user_id,
210 |             }
211 |             for shelf in (review.find("shelves") or [])
212 |         ],
213 |     }
214 |     for key in ("started_at", "read_at", "date_added", "date_updated"):
215 |         date = maybe_date(review.find(key).text)
216 |         if date:
217 |             result[key] = date
218 |     return result
219 | 
220 | 
221 | def fetch_user_id(username, force_online=False, db=None) -> str:
222 |     """We can look up a user ID given a (public vanity) username.
223 | 
224 |     We go to that profile page, and observe the redirect target. If we have the
225 |     user in question in our database, we just return the known value, since
226 |     user IDs are assumed to be stable. The vanity URL redirects to a URL ending
227 |     in <user_id>-<username>."""
228 |     if not force_online and db:
229 |         user = db["users"].get(username=username)
230 |         if user:
231 |             return user.id
232 |     click.echo("Fetching user details.")
233 |     url = username if username.startswith("http") else BASE_URL + username
234 |     response = requests.get(url)
235 |     response.raise_for_status()
236 |     if "/author/" in username:
237 |         soup = bs4.BeautifulSoup(response.content.decode(), "html.parser")
238 |         url = soup.select("link[rel=alternate][title=Bookshelves]")[0].attrs["href"]
239 |     else:
240 |         url = response.request.url
241 |     result = url.strip("/").split("/")[-1].split("-")[0]
242 |     if not result.isdigit():
243 |         error("Cannot find user ID for {}".format(response.request.url))
244 |     return result
245 | 
246 | 
247 | def fetch_user_and_shelves(user_id, token, db) -> dict:
248 |     with suppress(TypeError):
249 |         user = db["users"].get(id=user_id)
250 |         shelves = db["shelves"].rows_where("user_id = ?", [user_id])
251 |         if user and all(user.values()) and shelves:
252 |             user["shelves"] = shelves
253 |             return user
254 |     click.secho("Fetching shelves.")
255 |     response = requests.get(
256 |         BASE_URL + "user/show/{}.xml".format(user_id), {"key": token}
257 |     )
258 |     response.raise_for_status()
259 |     to_root = ET.fromstring(response.content.decode())
260 |     user = to_root.find("user")
261 |     shelves = user.find("user_shelves")
262 |     if not shelves:
263 |         error("This user's shelves and reviews are private, and cannot be fetched.")
264 |     user = {
265 |         "id": user.find("id").text,
266 |         "name": user.find("name").text,
267 |         "username": user.find("user_name").text,
268 |         "shelves": [
269 |             {"id": shelf.find("id").text, "name": shelf.find("name").text}
270 |             for shelf in shelves
271 |         ],
272 |     }
273 |     save_user(db, user)
274 | 
275 | 
276 | def save_user(db, user):
277 |     save_data = {key: user.get(key) for key in ["id", "name", "username"]}
278 |     pk = db["users"].insert(save_data, pk="id", alter=True, replace=True).last_pk
279 |     for shelf in user.get("shelves", []):
280 |         save_shelf(db, shelf, user["id"])
281 |     return pk
282 | 
283 | 
284 | def save_shelf(db, shelf, user_id):
285 |     save_data = {key: shelf.get(key) for key in ["id", "name"]}
286 |     save_data["user_id"] = user_id
287 |     return (
288 |         db["shelves"]
289 |         .insert(
290 |             save_data, foreign_keys=(("user_id", "users", "id"),), pk="id", alter=True, replace=True
291 |         )
292 |         .last_pk
293 |     )
294 | 
295 | 
296 | def maybe_date(value):
297 |     if value:
298 |         return dateutil.parser.parse(value)
299 |     return None
300 | 


--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | 
 3 | from setuptools import setup
 4 | 
 5 | VERSION = "0.5"
 6 | 
 7 | 
 8 | def get_long_description():
 9 |     with open(
10 |         os.path.join(os.path.dirname(os.path.abspath(__file__)), "README.md"),
11 |         encoding="utf8",
12 |     ) as fp:
13 |         return fp.read()
14 | 
15 | 
16 | setup(
17 |     name="goodreads-to-sqlite",
18 |     description="Save data from Goodreads to a SQLite database",
19 |     long_description=get_long_description(),
20 |     long_description_content_type="text/markdown",
21 |     author="Tobias Kunze",
22 |     author_email="r@rixx.de",
23 |     url="https://github.com/rixx/goodreads-to-sqlite",
24 |     project_urls={
25 |         "Source": "https://github.com/rixx/goodreads-to-sqlite",
26 |         "Issues": "https://github.com/rixx/goodreads-to-sqlite/issues",
27 |     },
28 |     classifiers=[
29 |         "Development Status :: 5 - Production/Stable",
30 |         "Environment :: Console",
31 |         "License :: OSI Approved",
32 |         "License :: OSI Approved :: Apache Software License",
33 |         "Programming Language :: Python :: 3",
34 |         "Programming Language :: Python :: 3.6",
35 |         "Programming Language :: Python :: 3.7",
36 |         "Topic :: Database",
37 |     ],
38 |     keywords="goodreads books sqlite export dogsheep",
39 |     license="Apache License, Version 2.0",
40 |     version=VERSION,
41 |     packages=["goodreads_to_sqlite"],
42 |     entry_points="""
43 |         [console_scripts]
44 |         goodreads-to-sqlite=goodreads_to_sqlite.cli:cli
45 |     """,
46 |     install_requires=[
47 |         "beautifulsoup4~=4.8",
48 |         "click",
49 |         "python-dateutil",
50 |         "requests",
51 |         "sqlite-utils~=2.4.4",
52 |         "tqdm~=4.36",
53 |     ],
54 |     extras_require={"test": ["pytest"]},
55 |     tests_require=["goodreads-to-sqlite[test]"],
56 | )
57 | 


--------------------------------------------------------------------------------